arXiv 论文速递

Snapshot: 20260322_0340

Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

Authors: Shang-Jui Ray Kuo, Paola Cascante-Bonilla

First: 2026-03-19T17:56:32+00:00 · Latest: 2026-03-19T17:56:32+00:00

Comments: Project page: https://lab-spell.github.io/vlm-ssm-vision-encoders/ ; Code: https://github.com/raykuo18/vlm-ssm-vision-encoders

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Large vision--language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet-1K initialization, the SSM backbone achieves the strongest overall performance across both VQA and grounding/localization. We further adapt both SSM and ViT-family backbones with detection or segmentation training and find that dense-task tuning generally improves performance across families; after this adaptation, the SSM backbone remains competitive while operating at a substantially smaller model scale. We further observe that (i) higher ImageNet accuracy or larger backbones do not reliably translate into better VLM performance, and (ii) some visual backbones are unstable in localization. Based on these findings, we propose stabilization strategies that improve robustness for both backbone families and highlight SSM backbones as a strong alternative to transformer-based vision encoders in VLMs.

中文标题/摘要

标题：VLMs是否需要视觉变换器？评估状态空间模型作为视觉编码器

大型视觉-语言模型（VLMs）通常使用冻结的视觉主干，其图像特征通过轻量级连接器映射到大型语言模型中。虽然基于变换器的编码器是标准的视觉主干，但我们询问状态空间模型（SSM）视觉主干是否可以成为强有力的替代品。我们在受控环境中系统地评估了SSM视觉主干在VLMs中的表现。在匹配的ImageNet-1K初始化下，SSM主干在VQA和定位/标注方面实现了最强的整体性能。我们进一步适应了SSM和ViT家族的主干，并进行了检测或分割训练，发现密集任务调整通常在家族中提高了性能；在这一适应后，SSM主干保持竞争力，同时以较小的模型规模运行。我们还观察到，(i) 更高的ImageNet准确度或更大的主干并不一定能可靠地转化为更好的VLM性能，(ii) 一些视觉主干在定位方面不稳定。基于这些发现，我们提出了稳定策略，以提高两个主干家族的鲁棒性，并强调SSM主干作为VLMs中基于变换器视觉编码器的强有力替代品。

Summary / 总结

This study evaluates state space model (SSM) vision backbones in large vision-language models (VLMs), finding that SSMs outperform transformer-based encoders in VQA and grounding tasks under matched ImageNet initialization. After task-specific training, SSMs remain competitive while being smaller in scale. The research also highlights that higher ImageNet accuracy or larger backbones do not always translate to better VLM performance, and some visual backbones are unstable in localization tasks. The study proposes stabilization strategies to improve robustness for both backbone families and suggests SSMs as a strong alternative to transformer-based vision encoders in VLMs.

研究评估了状态空间模型（SSM）在大型视觉语言模型（VLM）中的表现，发现SSM在VQA和定位/检测任务中均优于基于变换器的编码器，尤其是在匹配的ImageNet-1K初始化条件下。密集任务调优可提升各家族的表现，而SSM仍能在较小的模型规模下保持竞争力。研究还指出，更高的ImageNet准确度或更大的模型并不一定意味着更好的VLM性能，且某些视觉编码器在定位任务中不稳定。这些发现表明，SSM是VLM中变换器基视觉编码器的强大替代方案。

Tinted Frames: Question Framing Blinds Vision-Language Models

Authors: Wan-Cyuan Fan, Jiayun Luo, Declan Kutscher, Leonid Sigal, Ritwik Gupta

First: 2026-03-19T17:53:09+00:00 · Latest: 2026-03-19T17:53:09+00:00

Comments: Preprint. Project page: https://davidhalladay.github.io/tinted_frames_demo/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-Language Models (VLMs) have been shown to be blind, often underutilizing their visual inputs even on tasks that require visual reasoning. In this work, we demonstrate that VLMs are selectively blind. They modulate the amount of attention applied to visual inputs based on linguistic framing even when alternative framings demand identical visual reasoning. Using visual attention as a probe, we quantify how framing alters both the amount and distribution of attention over the image. Constrained framings, such as multiple choice and yes/no, induce substantially lower attention to image context compared to open-ended, reduce focus on task-relevant regions, and shift attention towards uninformative tokens. We further demonstrate that this attention misallocation is the principal cause of degraded accuracy and cross-framing inconsistency. Building on this mechanistic insight, we introduce a lightweight prompt-tuning method using learnable tokens that encourages the robust, visually grounded attention patterns observed in open-ended settings, improving visual grounding and improving performance across framings.

中文标题/摘要

标题：着色框：问题框架使视觉-语言模型失明

视觉-语言模型（VLMs）已被证明是失明的，即使在需要视觉推理的任务中，它们也经常未能充分利用视觉输入。在本研究中，我们展示了VLMs是选择性失明的。它们根据语言框架调整对视觉输入的注意力程度，即使存在其他框架要求相同的视觉推理。通过使用视觉注意力作为探针，我们量化了框架如何改变对图像的关注量及其分布。受限的框架，如多项选择和是/否，相比开放式框架，显著降低了对图像上下文的关注，减少了对任务相关区域的关注，并将注意力转移到无信息的标记上。我们进一步证明，这种注意力分配不当是导致准确度下降和跨框架不一致的主要原因。基于这一机制洞察，我们引入了一种轻量级的提示调优方法，使用可学习标记来鼓励在开放式设置中观察到的稳健、视觉接地的注意力模式，从而提高视觉接地并改善跨框架性能。

Summary / 总结

This study investigates why Vision-Language Models (VLMs) are selectively blind to visual inputs, depending on how questions are framed. By using visual attention as a probe, the researchers found that constrained framings like multiple choice or yes/no lead to less attention on the image context and more on uninformative tokens, compared to open-ended questions. This misallocation of attention is the main cause of reduced accuracy and inconsistency across different framings. The study introduces a prompt-tuning method that encourages robust, visually grounded attention patterns, improving performance across various framings.

该研究探讨了为什么视觉语言模型（VLMs）在需要视觉推理的任务中未能充分利用视觉输入。通过分析视觉注意力模式，研究者发现，VLMs会根据问题的语义框架调整其注意力，即使不同的框架要求相同的视觉推理。研究显示，受限框架，如多项选择或是非题，会导致对图像上下文的关注减少，更多地关注无信息性标记，这降低了准确性和不同框架之间的一致性。研究者提出了一种使用可学习标记的提示调优方法，以促进在开放性设置中观察到的稳健的视觉定位，从而提高性能并跨不同框架保持一致性。

Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation

Authors: Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen, Nakul Gopalan

First: 2026-03-19T17:20:56+00:00 · Latest: 2026-03-19T17:20:56+00:00

Comments: Equal contribution: Swagat Padhan and Lakshya Jain, 9 pages, 6 figures, paper website: https://lakshya-asu.github.io/Meanings-Measurements-Multi-Agent-Probabilistic-Grounding/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions. For example, executing a command such as "go two meters to the right of the fridge" requires grounding semantic references, spatial relations, and metric constraints within a 3D scene. While recent vision language models (VLMs) demonstrate strong semantic grounding capabilities, they are not explicitly designed to reason about metric constraints in physically defined spaces. In this work, we empirically demonstrate that state-of-the-art VLM-based grounding approaches struggle with complex metric-semantic language queries. To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component. MAPG then probabilistically composes these grounded outputs to produce metrically consistent, actionable decisions in 3D space. We evaluate MAPG on the HM-EQA benchmark and show consistent performance improvements over strong baselines. Furthermore, we introduce a new benchmark, MAPG-Bench, specifically designed to evaluate metric-semantic goal grounding, addressing a gap in existing language grounding evaluations. We also present a real-world robot demonstration showing that MAPG transfers beyond simulation when a structured scene representation is available.

中文标题/摘要

标题：意义与测量：多智能体概率对接在视觉语言导航中的应用

与人类合作的机器人必须将自然语言目标转化为可执行的、物理上可对接的决策。例如，执行“向冰箱右边两米处走”的命令需要在三维场景中对接语义参考、空间关系和度量约束。虽然最近的视觉语言模型（VLMs）展示了强大的语义对接能力，但它们并未明确设计用于在物理定义的空间中推理度量约束。在本文中，我们实验证明了最先进的基于VLM的对接方法在处理复杂的度量语义语言查询时存在困难。为解决这一局限，我们提出了MAPG（多智能体概率对接）框架，该框架将语言查询分解为结构化的子组件，并查询VLM对接每个组件。然后，MAPG通过概率性地组合这些对接输出，生成在三维空间中度量一致的可执行决策。我们使用HM-EQA基准对MAPG进行了评估，并展示了相对于强大基线的一致性能改进。此外，我们引入了一个新的基准MAPG-Bench，专门用于评估度量语义目标对接，填补了现有语言对接评估中的空白。我们还展示了在可用结构化场景表示时，MAPG在真实世界机器人演示中的应用。

Summary / 总结

This paper addresses the challenge of converting natural language goals into actionable decisions for robots, particularly focusing on metric-semantic language queries. The authors propose MAPG (Multi-Agent Probabilistic Grounding), which decomposes language queries into subcomponents and uses a VLM to ground each part, then probabilistically composes these to produce metrically consistent actions. Experiments on HM-EQA show MAPG outperforms strong baselines, and the authors introduce MAPG-Bench to evaluate metric-semantic goal grounding, demonstrating MAPG's real-world applicability in a structured scene representation.

该研究旨在将复杂的度量语义语言查询转化为机器人的可执行决策。提出了一种名为MAPG（多智能体概率定位）的方法，该方法将语言查询分解为子组件，并使用VLM对每个部分进行定位，然后通过概率组合确保度量一致性。HM-EQA基准上的实验表明，MAPG优于强基线，并引入了新的MAPG-Bench基准来评估度量语义目标定位。此外，一个实际的机器人演示验证了MAPG在模拟之外的有效性。

Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation

Authors: Kwanyoung Lee, SeungJu Cha, Yebin Ahn, Hyunwoo Oh, Sungho Koh, Dong-Jin Kim

Venue: CVPR 2026

First: 2026-03-19T17:12:03+00:00 · Latest: 2026-03-19T17:12:03+00:00

Comments: Accepted in CVPR 2026 (main track). 10 pages, 6 figures; supplementary material included (14 pages, 11 figures)

Abs · PDF · Code1 · Code2

Abstract

Diffusion-based text-to-image (T2I) models have made remarkable progress in generating photorealistic and semantically rich images. However, when the target concepts lie in low-density regions of the training distribution, these models often produce semantically misaligned or structurally inconsistent results. This limitation arises from the long-tailed nature of text-image datasets, where rare concepts or editing instructions are underrepresented. To address this, we introduce Adaptive Auxiliary Prompt Blending (AAPB) - a unified framework that stabilizes the diffusion process in low-density regions. AAPB leverages auxiliary anchor prompts to provide semantic support in rare concept generation and structural support in image editing, ensuring faithful guidance toward the target prompt. Unlike prior heuristic prompt alternation methods, AAPB derives a closed-form adaptive coefficient that optimally balances the influence between the auxiliary anchor and the target prompt at each diffusion step. Grounded in Tweedie's identity, our formulation provides a principled and training-free framework for adaptive prompt blending, ensuring stable and target-faithful generation. We demonstrate the effectiveness of adaptive interpolation over fixed interpolation through controlled experiments and empirically show consistent improvements on the RareBench and FlowEdit datasets, achieving superior semantic accuracy and structural fidelity compared to prior training-free baselines.

中文标题/摘要

标题：适应性辅助提示融合以实现目标忠实的扩散生成

基于扩散的文本到图像（T2I）模型在生成逼真且语义丰富的图像方面取得了显著进展。然而，当目标概念位于训练分布的低密度区域时，这些模型往往会生成语义不匹配或结构不一致的结果。这一限制源于文本-图像数据集的长尾性质，其中稀有概念或编辑指令的代表性不足。为了解决这一问题，我们引入了适应性辅助提示融合（AAPB）——一种统一框架，用于在低密度区域稳定扩散过程。AAPB 利用辅助锚提示提供稀有概念生成的语义支持和图像编辑的结构支持，确保目标提示的忠实指导。与先前的启发式提示交替方法不同，AAPB 在每个扩散步骤中推导出一个闭式自适应系数，以最优地平衡辅助锚提示和目标提示之间的影响力。基于 Tweedie 的恒等式，我们的公式提供了一种原理上和无需训练的自适应提示融合框架，确保稳定和目标忠实的生成。通过受控实验，我们展示了自适应插值优于固定插值的有效性，并在 RareBench 和 FlowEdit 数据集上实验证明了一致的改进，实现了与先前无需训练基线相比的更优语义准确性和结构保真度。

Summary / 总结

The paper introduces Adaptive Auxiliary Prompt Blending (AAPB), a method to stabilize diffusion models in generating images for rare concepts. AAPB uses auxiliary anchor prompts to provide semantic and structural support, ensuring target-faithful generation. It derives an adaptive coefficient for each diffusion step, balancing the influence between the auxiliary and target prompts. Experiments on RareBench and FlowEdit show AAPB outperforms previous methods in semantic accuracy and structural fidelity.

论文针对扩散模型在处理稀有概念时生成的图像出现语义不匹配或结构不一致的问题，引入了自适应辅助提示融合（AAPB）方法，通过辅助锚提示提供语义和结构支持，确保生成目标一致。AAPB在每个扩散步骤中计算一个自适应系数，平衡辅助锚提示和目标提示的影响。实验结果表明，AAPB在RareBench和FlowEdit数据集上的一致改进，优于之前的无训练基线，在语义准确性和结构保真度方面表现更优。

ADAPT: Attention Driven Adaptive Prompt Scheduling and InTerpolating Orthogonal Complements for Rare Concepts Generation

Authors: Kwanyoung Lee, Hyunwoo Oh, SeungJu Cha, Sungho Koh, Dong-Jin Kim

Venue: CVPR 2026

First: 2026-03-19T17:11:49+00:00 · Latest: 2026-03-19T17:11:49+00:00

Comments: Accepted in CVPR 2026 (findings). 10 pages, 4 figures; supplementary material included (8 pages, 10 figures)

Abs · PDF · Code1 · Code2

Abstract

Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training data. While recent approaches, such as R2F, address this challenge by utilizing LLM for prompt scheduling, they suffer from inherent variance due to the randomness of language models and suboptimal guidance from iterative text embedding switching. To address these problems, we propose the ADAPT framework, a training-free framework that deterministically plans and semantically aligns prompt schedules, providing consistent guidance to enhance the composition of rare concepts. By leveraging attention scores and orthogonal components, ADAPT significantly enhances compositional generation of rare concepts in the RareBench benchmark without additional training or fine-tuning. Through comprehensive experiments, we demonstrate that ADAPT achieves superior performance in RareBench and accurately reflects the semantic information of rare attributes, providing deterministic and precise control over the generation of rare compositions without compromising visual integrity.

中文标题/摘要

标题：ADAPT：注意力驱动的自适应提示调度和插值正交补对于稀有概念生成

对于文本到图像合成而言，在生成稀有组合概念方面，扩散模型仍然面临挑战，尤其是对于训练数据中不常见的属性。虽然最近的方法，如R2F，通过利用LLM进行提示调度来解决这一挑战，但由于语言模型的随机性和迭代文本嵌入切换的次优指导，它们仍然存在固有的方差问题。为了解决这些问题，我们提出了ADAPT框架，这是一种无需训练的框架，可以确定性地规划和语义对齐提示调度，提供一致的指导以增强稀有概念的组合。通过利用注意力分数和正交组件，ADAPT在RareBench基准上显著增强了稀有概念的组合生成，无需额外的训练或微调。通过全面的实验，我们证明ADAPT在RareBench上实现了优越的性能，并准确反映了稀有属性的语义信息，提供了对稀有组合生成的确定性和精确控制，而不损害视觉完整性。

Summary / 总结

The research addresses the challenge of generating rare compositional concepts in text-to-image synthesis using diffusion models. It proposes the ADAPT framework, which uses attention scores and orthogonal components to deterministically plan and align prompt schedules, providing consistent guidance for the generation of rare concepts. Experiments show that ADAPT outperforms existing methods on the RareBench benchmark, enhancing the compositional generation of rare concepts without additional training or fine-tuning.

研究旨在使用扩散模型提高文本到图像合成中稀有组合概念的生成。ADAPT，一个无需训练的框架，通过确定性地规划提示调度并进行语义对齐，提供一致的指导，以生成稀有概念。实验表明，ADAPT 在 RareBench 基准上优于现有方法如 R2F，增强了稀有概念的组合生成，无需额外的训练或微调，同时保持视觉完整性。

GSMem: 3D Gaussian Splatting as Persistent Spatial Memory for Zero-Shot Embodied Exploration and Reasoning

Authors: Yiren Lu, Yi Du, Disheng Liu, Yunlai Zhou, Chen Wang, Yu Yin

First: 2026-03-19T16:55:54+00:00 · Latest: 2026-03-19T16:55:54+00:00

Comments: Project page at https://vulab-ai.github.io/GSMem/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Effective embodied exploration requires agents to accumulate and retain spatial knowledge over time. However, existing scene representations, such as discrete scene graphs or static view-based snapshots, lack \textit{post-hoc re-observability}. If an initial observation misses a target, the resulting memory omission is often irrecoverable. To bridge this gap, we propose \textbf{GSMem}, a zero-shot embodied exploration and reasoning framework built upon 3D Gaussian Splatting (3DGS). By explicitly parameterizing continuous geometry and dense appearance, 3DGS serves as a persistent spatial memory that endows the agent with \textit{Spatial Recollection}: the ability to render photorealistic novel views from optimal, previously unoccupied viewpoints. To operationalize this, GSMem employs a retrieval mechanism that simultaneously leverages parallel object-level scene graphs and semantic-level language fields. This complementary design robustly localizes target regions, enabling the agent to ``hallucinate'' optimal views for high-fidelity Vision-Language Model (VLM) reasoning. Furthermore, we introduce a hybrid exploration strategy that combines VLM-driven semantic scoring with a 3DGS-based coverage objective, balancing task-aware exploration with geometric coverage. Extensive experiments on embodied question answering and lifelong navigation demonstrate the robustness and effectiveness of our framework

中文标题/摘要

标题：GSMem: 3D高斯斑点化作为持久空间记忆的零样本体态探索与推理框架

有效的体态探索需要智能体在时间上积累和保留空间知识。然而，现有的场景表示，如离散场景图或静态视角快照，缺乏“事后重新观察”的能力。如果初始观察错过了目标，由此产生的记忆遗漏往往是不可恢复的。为了解决这一问题，我们提出了**GSMem**，一种基于3D高斯斑点化（3DGS）的零样本体态探索与推理框架。通过显式参数化连续几何和密集外观，3DGS充当持久空间记忆，赋予智能体“空间回忆”的能力：从先前未占用的最佳视角生成逼真的新视角。为了实现这一点，GSMem采用了一种检索机制，同时利用并行的对象级场景图和语义级语言字段。这种互补设计能够稳健地定位目标区域，使智能体能够“想象”出高保真视觉-语言模型（VLM）推理的最佳视角。此外，我们引入了一种结合VLM驱动的语义评分与基于3DGS的覆盖目标的探索策略，平衡任务感知探索与几何覆盖。在体态问答和终身导航的广泛实验中，我们的框架显示出其稳健性和有效性。

Summary / 总结

GSMem is a zero-shot embodied exploration and reasoning framework that uses 3D Gaussian Splatting (3DGS) to create a persistent spatial memory. This allows the agent to render photorealistic novel views from previously unoccupied viewpoints, enabling spatial recollection. GSMem combines object-level scene graphs and semantic-level language fields for robust target localization and uses a hybrid exploration strategy that balances task-aware exploration with geometric coverage. Experiments show that GSMem is robust and effective for embodied question answering and lifelong navigation.

GSMem 是一个基于 3D 贝塞尔散点图 (3DGS) 的零样本体态探索和推理框架，创建持久的空间记忆，使代理能够从先前未占用的视角生成逼真的新视图，实现空间回忆。该框架结合了对象级场景图和语义级语言字段进行稳健的目标定位，并采用结合语义评分和几何覆盖的混合探索策略。实验表明，GSMem 在体态问答和终身导航中表现出色且稳健。

Efficient Reasoning with Balanced Thinking

Authors: Yulin Li, Tengyao Tu, Li Ding, Junjie Wang, Huiling Zhen, Yixin Chen, Yong Li, Zhuotao Tian

Venue: ICLR 2026

First: 2026-03-12T18:48:07+00:00 · Latest: 2026-03-19T16:54:22+00:00

Comments: Accepted by ICLR 2026

Abs · PDF · Code1 · Code2 · Project1

Abstract

Large Reasoning Models (LRMs) have shown remarkable reasoning capabilities, yet they often suffer from overthinking, expending redundant computational steps on simple problems, or underthinking, failing to explore sufficient reasoning paths despite inherent capabilities. These issues lead to inefficiencies and potential inaccuracies, limiting practical deployment in resource-constrained settings. Existing methods to mitigate overthinking, such as suppressing reflective keywords or adjusting reasoning length, may inadvertently induce underthinking, compromising accuracy. Therefore, we propose ReBalance, a training-free framework that achieves efficient reasoning with balanced thinking. ReBalance leverages confidence as a continuous indicator of reasoning dynamics, identifying overthinking through high confidence variance and underthinking via consistent overconfidence. By aggregating hidden states from a small-scale dataset into reasoning mode prototypes, we compute a steering vector to guide LRMs' reasoning trajectories. A dynamic control function modulates this vector's strength and direction based on real-time confidence, pruning redundancy during overthinking, and promoting exploration during underthinking. Extensive experiments conducted on four models ranging from 0.5B to 32B, and across nine benchmarks in math reasoning, general question answering, and coding tasks demonstrate that ReBalance effectively reduces output redundancy while improving accuracy, offering a general, training-free, and plug-and-play strategy for efficient and robust LRM deployment. Project page and code are available at https://rebalance-ai.github.io .

中文标题/摘要

标题：平衡思考实现高效推理

大型推理模型（LRMs）展示了出色的推理能力，但往往存在过度推理的问题，即在简单问题上浪费冗余计算步骤，或者存在欠推理的问题，即在具备推理能力的情况下未能充分探索推理路径。这些问题导致了效率低下和潜在的不准确性，限制了其在资源受限环境中的实际部署。现有减少过度推理的方法，如抑制反思关键词或调整推理长度，可能会无意中导致欠推理，从而损害准确性。因此，我们提出了ReBalance，这是一种无需训练的框架，实现了平衡思考下的高效推理。ReBalance 利用置信度作为推理动态的连续指标，通过高置信度波动识别过度推理，通过一致的过度自信识别欠推理。通过将小型数据集中的隐藏状态聚合为推理模式原型，我们计算出一个引导向量来引导LRMs的推理轨迹。动态控制函数根据实时置信度调整该向量的强度和方向，在过度推理时修剪冗余，在欠推理时促进探索。在四个从0.5B到32B的模型以及九个涵盖数学推理、通用问答和编程任务的基准测试中进行的广泛实验表明，ReBalance 有效减少了输出冗余并提高了准确性，提供了一种通用、无需训练且即插即用的策略，用于高效和稳健的LRM部署。项目页面和代码可在https://rebalance-ai.github.io 获取。

Summary / 总结

The paper addresses the inefficiencies of Large Reasoning Models (LRMs) by proposing ReBalance, a training-free framework that aims to achieve efficient reasoning with balanced thinking. ReBalance uses confidence as a continuous indicator to detect overthinking and underthinking, and it guides LRMs' reasoning trajectories by computing a steering vector that is dynamically controlled based on real-time confidence. Experiments on four models across nine benchmarks show that ReBalance reduces output redundancy while improving accuracy, offering a general and plug-and-play strategy for efficient and robust LRM deployment.

论文针对大型推理模型（LRMs）由于过度推理或不足推理导致的效率问题，提出了ReBalance框架，该框架利用信心来平衡推理动态。ReBalance通过高信心变异识别过度推理，通过一致的过度自信识别不足推理，引导LRMs减少冗余并促进探索。实验表明，ReBalance在各种模型和基准测试中提高了准确性和减少了输出冗余，提供了一种通用且即插即用的解决方案，用于高效和稳健的LRM部署。

MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding

Authors: Fan Yang, Xingping Dong, Xin Yu, Wenhan Luo, Wei Liu, Kaihao Zhang

Venue: CVPR 2026

First: 2025-12-02T16:22:01+00:00 · Latest: 2026-03-19T16:35:02+00:00

Comments: Accepted to CVPR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Understanding high-resolution (HR) images remains a critical challenge for multimodal large language models (MLLMs). Recent approaches leverage vision-based retrieval-augmented generation (RAG) to retrieve query-relevant crops from HR images, improving understanding capacity of MLLMs. However, this paradigm often leads to object fragmentation, resulting in semantic bias and incomplete retrieval, while also introducing false positives from irrelevant background patches. To address these issues, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework that enhances HR image understanding from both local and global perspectives. Locally, MRD enforces cross-scale semantic consistency via multi-resolution semantic fusion to mitigate single-resolution bias and alleviate object fragmentation. Globally, it integrates open-vocabulary object detection (OVD) as localization priors within a unified framework. Extensive experiments across multiple MLLMs on HR image benchmarks demonstrate that MRD achieves state-of-the-art (SOTA) performance on both single-object and multi-object understanding tasks. Code will be available at: https://github.com/yf0412/MRD.

中文标题/摘要

标题：MRD：多分辨率检索-检测融合用于高分辨率图像理解

理解高分辨率（HR）图像仍然是多模态大型语言模型（MLLM）的关键挑战。近期的方法利用基于视觉的检索增强生成（RAG）从HR图像中检索查询相关的片段，从而提高MLLM的理解能力。然而，这种范式往往导致对象碎片化，产生语义偏差和不完整的检索，同时还会引入来自无关背景片段的假阳性。为了解决这些问题，我们提出了多分辨率检索-检测（MRD），这是一种无需训练的框架，可以从局部和全局两个方面增强HR图像的理解。局部上，MRD通过多分辨率语义融合来缓解单一分辨率偏差并减轻对象碎片化。全局上，它将开放词汇对象检测（OVD）作为定位先验整合到统一框架中。在多个MLLM上的HR图像基准测试中，广泛的实验表明，MRD在单对象和多对象理解任务上均实现了最先进的（SOTA）性能。代码将在：https://github.com/yf0412/MRD/ 可用。

Summary / 总结

The paper proposes MRD, a training-free framework for enhancing high-resolution image understanding by addressing object fragmentation and semantic bias. It achieves this through multi-resolution semantic fusion for local consistency and integrates open-vocabulary object detection as localization priors for global understanding. Experiments show MRD outperforms existing methods on both single-object and multi-object understanding tasks across multiple multimodal large language models.

论文提出了一种名为MRD的无训练框架，通过融合多分辨率语义信息并结合开放词汇量物体检测来提升高分辨率图像理解。实验表明，MRD在多个大规模多模态语言模型上，无论是单物体还是多物体理解任务，都优于现有方法。全局上，MRD使用物体检测作为定位先验，局部上则通过跨尺度语义一致性来减少物体碎片化和语义偏差。

TAU-R1: Visual Language Model for Traffic Anomaly Understanding

Authors: Yuqiang Lin, Kehua Chen, Sam Lockyer, Arjun Yadav, Mingxuan Sui, Shucheng Zhang, Yan Shi, Bingzhang Wang, Yuang Zhang, Markus Zarbock, Florain Stanek, Adrian Evans, Wenbin Li, Yinhai Wang, Nic Zhang

First: 2026-03-19T16:23:21+00:00 · Latest: 2026-03-19T16:23:21+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Traffic Anomaly Understanding (TAU) is important for traffic safety in Intelligent Transportation Systems. Recent vision-language models (VLMs) have shown strong capabilities in video understanding. However, progress on TAU remains limited due to the lack of benchmarks and task-specific methodologies. To address this limitation, we introduce Roundabout-TAU, a dataset constructed from real-world roundabout videos collected in collaboration with the City of Carmel, Indiana. The dataset contains 342 clips and is annotated with more than 2,000 question-answer pairs covering multiple aspects of traffic anomaly understanding. Building on this benchmark, we propose TAU-R1, a two-layer vision-language framework for TAU. The first layer is a lightweight anomaly classifier that performs coarse anomaly categorisation, while the second layer is a larger anomaly reasoner that generates detailed event summaries. To improve task-specific reasoning, we introduce a two-stage training strategy consisting of decomposed-QA-enhanced supervised fine-tuning followed by TAU-GRPO, a GRPO-based post-training method with TAU-specific reward functions. Experimental results show that TAU-R1 achieves strong performance on both anomaly classification and reasoning tasks while maintaining deployment efficiency. The dataset and code are available at: https://github.com/siri-rouser/TAU-R1

中文标题/摘要

标题：TAU-R1：交通异常理解的视觉语言模型

交通异常理解（TAU）对于智能交通系统中的交通安全至关重要。最近的视觉-语言模型（VLMs）在视频理解方面表现出强大的能力。然而，由于缺乏基准和特定任务的方法，TAU 的进展仍然有限。为了解决这一限制，我们引入了Roundabout-TAU数据集，该数据集由与印第安纳州卡梅尔市合作收集的真实环形交叉口视频构建而成。该数据集包含342个片段，并且带有超过2000个问题-答案对，涵盖了交通异常理解的多个方面。基于此基准，我们提出了TAU-R1，一种两层视觉-语言框架用于TAU。第一层是一个轻量级的异常分类器，执行粗略的异常分类，而第二层是一个较大的异常推理器，生成详细的事件总结。为了提高特定任务的推理，我们引入了一种两阶段训练策略，包括分解-问答增强监督微调，随后是基于GRPO的TAU-GRPO后训练方法，带有TAU特定的奖励函数。实验结果表明，TAU-R1在异常分类和推理任务上均表现出色，同时保持了部署效率。数据集和代码可在：https://github.com/siri-rouser/TAU-R1 获取

Summary / 总结

The research aims to enhance traffic safety in Intelligent Transportation Systems by developing a visual language model for traffic anomaly understanding (TAU). The study introduces Roundabout-TAU, a dataset of 342 clips from real-world roundabouts, and proposes TAU-R1, a two-layer vision-language framework. TAU-R1 includes a lightweight anomaly classifier and a larger anomaly reasoner, and employs a two-stage training strategy to improve task-specific reasoning. The model demonstrates strong performance in both anomaly classification and reasoning tasks while maintaining efficiency for deployment.

研究旨在通过开发交通异常理解（TAU）的视觉语言模型来提高智能交通系统的交通安全。作者引入了Roundabout-TAU数据集，包含342个来自真实环形交叉口的视频片段，并提出了TAU-R1双层视觉语言框架。TAU-R1包括一个轻量级的异常分类器和一个较大的异常推理器，并采用两阶段训练策略以提高任务特定的推理能力。该模型在异常分类和推理任务上表现出色，同时保持了部署效率。

SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues

Authors: Carlos Hinojosa, Clemens Grange, Bernard Ghanem

First: 2026-03-19T16:18:00+00:00 · Latest: 2026-03-19T16:18:00+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) are increasingly deployed in real-world and embodied settings where safety decisions depend on visual context. However, it remains unclear which visual evidence drives these judgments. We study whether multimodal safety behavior in VLMs can be steered by simple semantic cues. We introduce a semantic steering framework that applies controlled textual, visual, and cognitive interventions without changing the underlying scene content. To evaluate these effects, we propose SAVeS, a benchmark for situational safety under semantic cues, together with an evaluation protocol that separates behavioral refusal, grounded safety reasoning, and false refusals. Experiments across multiple VLMs and an additional state-of-the-art benchmark show that safety decisions are highly sensitive to semantic cues, indicating reliance on learned visual-linguistic associations rather than grounded visual understanding. We further demonstrate that automated steering pipelines can exploit these mechanisms, highlighting a potential vulnerability in multimodal safety systems.

中文标题/摘要

标题：SAVeS: 通过语义线索引导视觉-语言模型的安全判断

视觉-语言模型（VLMs）在现实世界和具身环境中越来越被部署，其中安全决策依赖于视觉上下文。然而，尚不清楚哪些视觉证据驱动了这些判断。我们研究了是否可以通过简单的语义线索来引导VLMs中的多模态安全行为。我们引入了一种语义引导框架，该框架通过控制文本、视觉和认知干预来引导行为，而不改变底层场景内容。为了评估这些效果，我们提出了SAVeS基准，用于在语义线索下的情境安全性评估，以及一个将行为拒绝、基于视觉的语言推理和虚假拒绝分开的评估协议。在多个VLMs和一个额外的最先进的基准上的实验表明，安全决策对语义线索非常敏感，表明依赖于学习到的视觉-语言关联而非基于视觉的理解。我们进一步证明了自动化引导管道可以利用这些机制，突显了多模态安全系统中潜在的脆弱性。

Summary / 总结

The study aims to understand how visual evidence influences safety judgments in vision-language models (VLMs) and whether these judgments can be steered using simple semantic cues. A semantic steering framework was developed to introduce controlled textual, visual, and cognitive interventions without altering the scene content. The SAVeS benchmark was introduced to evaluate the effects of these interventions on safety behavior, distinguishing between behavioral refusal, grounded safety reasoning, and false refusals. Experiments across multiple VLMs and an additional benchmark revealed that safety decisions are highly sensitive to semantic cues, suggesting reliance on learned visual-linguistic associations rather than grounded visual understanding. Automated steering pipelines can exploit these mechanisms, indicating a potential vulnerability in multimodal safety systems.

研究旨在理解视觉证据如何影响视觉语言模型（VLMs）中的安全判断，以及是否可以通过简单的语义提示来引导这些判断。开发了一种语义引导框架，引入了控制文本、视觉和认知干预，而不改变场景内容。SAVeS基准用于评估这些干预措施对安全行为的影响，区分行为拒绝、基于视觉的语言推理和虚假拒绝。实验表明，各种VLMs和额外基准中的安全决策高度依赖于语义提示，表明模型依赖于学习到的视觉-语言关联而非基于视觉的理解。这表明，通过自动化引导管道可以利用这些机制，揭示了多模态安全系统中的潜在漏洞。

SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation

Authors: Phuc Pham, Uy Dieu Tran, Binh-Son Hua, Phong Nguyen

Venue: CVPR 2026

First: 2026-03-19T15:47:43+00:00 · Latest: 2026-03-19T15:47:43+00:00

Comments: CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Realistic and efficient 3D garment generation remains a longstanding challenge in computer vision and digital fashion. Existing methods typically rely on large vision- language models to produce serialized representations of 2D sewing patterns, which are then transformed into simulation-ready 3D meshes using garment modeling framework such as GarmentCode. Although these approaches yield high-quality results, they often suffer from slow inference times, ranging from 30 seconds to a minute. In this work, we introduce SwiftTailor, a novel two-stage framework that unifies sewing-pattern reasoning and geometry-based mesh synthesis through a compact geometry image representation. SwiftTailor comprises two lightweight modules: PatternMaker, an efficient vision-language model that predicts sewing patterns from diverse input modalities, and GarmentSewer, an efficient dense prediction transformer that converts these patterns into a novel Garment Geometry Image, encoding the 3D surface of all garment panels in a unified UV space. The final 3D mesh is reconstructed through an efficient inverse mapping process that incorporates remeshing and dynamic stitching algorithms to directly assemble the garment, thereby amortizing the cost of physical simulation. Extensive experiments on the Multimodal GarmentCodeData demonstrate that SwiftTailor achieves state-of-the-art accuracy and visual fidelity while significantly reducing inference time. This work offers a scalable, interpretable, and high-performance solution for next-generation 3D garment generation.

中文标题/摘要

标题：SwiftTailor：基于几何图像表示的高效3D服装生成

在计算机视觉和数字时尚领域，真实且高效的3D服装生成仍然是一个长期存在的挑战。现有方法通常依赖于大型视觉语言模型来生成2D缝纫图案的序列化表示，然后使用如GarmentCode等服装建模框架将其转换为可用于模拟的3D网格。尽管这些方法可以产生高质量的结果，但它们通常会遭受较慢的推理时间，从30秒到一分钟不等。在本工作中，我们引入了SwiftTailor，这是一种新颖的两阶段框架，通过紧凑的几何图像表示统一了缝纫图案推理和基于几何的网格合成。SwiftTailor 包含两个轻量级模块：PatternMaker，一种高效的视觉语言模型，可以从多种输入模态中预测缝纫图案；以及GarmentSewer，一种高效的密集预测变换器，将这些图案转换为一种新颖的服装几何图像，统一编码所有服装面板的3D表面在UV空间中。最终的3D网格通过一个高效的逆映射过程重建，该过程结合了重新网格化和动态缝合算法，直接组装服装，从而摊销物理模拟的成本。在Multimodal GarmentCodeData上的大量实验表明，SwiftTailor 在准确性和视觉保真度方面达到了最先进的水平，同时显著减少了推理时间。这项工作提供了一种可扩展、可解释且高性能的下一代3D服装生成解决方案。

Summary / 总结

SwiftTailor is a two-stage framework that efficiently generates 3D garments by predicting sewing patterns and converting them into a geometry image representation. This method reduces inference time while maintaining high accuracy and visual fidelity. Experiments show that SwiftTailor outperforms existing approaches in terms of both speed and quality.

SwiftTailor 是一个两阶段框架，结合了缝制模式推理和基于几何的网格合成，使用紧凑的几何图像表示。它包括 PatternMaker 预测缝制模式和 GarmentSewer 将这些模式转换为服装几何图像。该方法减少了推理时间，同时保持了高精度和视觉保真度，优于现有方法。

TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

Authors: Yan Shu, Bin Ren, Zhitong Xiong, Xiao Xiang Zhu, Begüm Demir, Nicu Sebe, Paolo Rota

First: 2026-03-19T15:38:02+00:00 · Latest: 2026-03-19T15:38:02+00:00

Comments: Accepted by CVPR20206 (Main Track)

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.

中文标题/摘要

标题：TerraScope：基于像素的地球观测视觉推理

视觉语言模型（VLMs）在地球观测（EO）中显示出潜力，但它们在需要在精确的像素级视觉表示中进行复杂的空间推理的任务上存在困难。为了解决这个问题，我们引入了TerraScope，这是一种统一的VLM，能够提供基于像素的地理空间推理，具有两个关键能力：（1）模态灵活推理：它处理单一模态输入（光学或SAR），并在两种模态都可用时适应性地将不同模态融合到推理过程中；（2）多时相推理：它整合时间序列以在多个时间点进行变化分析。此外，我们构建了Terra-CoT数据集，包含100万个样本，其中包含嵌入在多源推理链中的像素级掩码。我们还提出了TerraScope-Bench，这是第一个用于基于像素的地理空间推理的基准，包含六个子任务，评估答案准确性和掩码质量，以确保真实的基于像素的推理。实验表明，TerraScope在基于像素的地理空间推理方面显著优于现有VLMs，同时提供可解释的视觉证据。

Summary / 总结

TerraScope is a unified vision-language model designed for earth observation tasks that require precise pixel-level reasoning. It supports both single-modality inputs and adaptive fusion of optical and SAR data, and can handle multi-temporal sequences for change analysis. The model is evaluated on a new dataset, Terra-CoT, and a benchmark, TerraScope-Bench, which includes six sub-tasks assessing both answer accuracy and mask quality. TerraScope demonstrates superior performance compared to existing vision-language models in pixel-grounded geospatial reasoning, providing interpretable visual evidence.

TerraScope 是一种统一的视觉语言模型，用于需要精确像素级推理的地球观测任务。它支持单模态输入，并且可以在两种数据都可用时进行模态融合，还可以处理多时序序列进行变化分析。该模型在 Terra-CoT 数据集和 TerraScope-Bench 基准上进行评估，后者包含六个子任务，评估答案准确性和掩码质量。实验表明，TerraScope 在像素级地理空间推理方面优于现有视觉语言模型，并提供了可解释的视觉证据。

AgroCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture

Authors: Yibin Wen, Qingmei Li, Zi Ye, Jiarui Zhang, Zurong Mai, Jing Wu, Shuohong Lou, Yuhang Chen, Henglian Huang, Xiaoya Fan, Yang Zhang, Defeng Gu, Lingyuan Zhao, Yutong Lu, Haohuan Fu, Jianxi Huang, Juepeng Zheng

First: 2025-11-28T15:02:19+00:00 · Latest: 2026-03-19T15:28:24+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advancements in Vision-Language Models (VLMs) have significantly impacted various industries. In agriculture, these multimodal capabilities hold great promise for applications such as precision farming, crop monitoring, pest detection, and environmental sustainability. However, while several Visual Question Answering (VQA) datasets and benchmarks have been developed to assess VLM performance, they often fail to effectively evaluate the critical reasoning and problem-solving skills needed in complex agricultural contexts. To address this gap, we introduce AgroCoT, a VQA dataset that integrates Chain-of-Thought (CoT) reasoning, specifically designed to evaluate the reasoning capabilities of VLMs. With 4,759 carefully curated samples, AgroCoT provides a comprehensive and robust evaluation of reasoning abilities, particularly in zero-shot scenarios, focusing on the models' ability to engage in logical reasoning and effective problem-solving. Our evaluation of 30 representative VLMs, including both proprietary and open-source models, reveals a gap in their reasoning capabilities, which underscores the importance of incorporating CoT for assessments. Our dataset is available at https://huggingface.co/datasets/wenyb/AgriCoT.

中文标题/摘要

标题：AgroCoT：农业领域视觉语言模型推理能力评估的链式思考基准

近年来，视觉语言模型（VLMs）在各个行业中的进步产生了重大影响。在农业领域，这些多模态能力在精准农业、作物监测、病虫害检测和环境可持续性等方面具有巨大的应用潜力。然而，尽管已经开发了多个视觉问答（VQA）数据集和基准来评估VLM性能，它们往往未能有效评估在复杂农业情境中所需的关键推理和问题解决能力。为解决这一问题，我们引入了AgroCoT，这是一个结合了链式思考（CoT）推理的VQA数据集，专门用于评估VLM的推理能力。AgroCoT包含4,759个精心策划的样本，提供了全面且稳健的推理能力评估，特别是在零样本场景中，重点关注模型进行逻辑推理和有效问题解决的能力。我们对30个代表性VLM的评估，包括专有和开源模型，揭示了它们在推理能力上的差距，突显了在评估中纳入CoT的重要性。我们的数据集可在https://huggingface.co/datasets/wenyb/AgriCoT获取。

Summary / 总结

AgroCoT is a VQA dataset designed to evaluate the reasoning capabilities of Vision-Language Models (VLMs) in agricultural contexts. It includes 4,759 samples that require Chain-of-Thought reasoning, addressing the limitations of existing benchmarks. The evaluation of 30 VLMs shows a significant gap in their reasoning abilities, highlighting the necessity of incorporating CoT for comprehensive assessments.

AgroCoT 是一个 VQA 数据集，旨在评估 Vision-Language 模型 (VLMs) 在农业场景中的推理能力。它包含 4,759 个样本，需要进行链式思考 (CoT) 推理，特别关注零样本场景。对 30 个 VLMs 的评估显示它们在推理能力方面存在显著差距，强调了在农业评估中采用 CoT 的必要性。

SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models

Authors: Quentin Guimard, Federico Bartsch, Simone Caldarella, Rahaf Aljundi, Elisa Ricci, Massimiliano Mancini

Venue: CVPR

First: 2026-03-19T15:28:08+00:00 · Latest: 2026-03-19T15:28:08+00:00

Comments: CVPR Findings 2026. Project website: https://sparse-embedding-modulation.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Models that bridge vision and language, such as CLIP, are key components of multimodal AI, yet their large-scale, uncurated training data introduce severe social and spurious biases. Existing post-hoc debiasing methods often operate directly in the dense CLIP embedding space, where bias and task-relevant information are highly entangled. This entanglement limits their ability to remove bias without degrading semantic fidelity. In this work, we propose Sparse Embedding Modulation (SEM), a post-hoc, zero-shot debiasing framework that operates in a Sparse Autoencoder (SAE) latent space. By decomposing CLIP text embeddings into disentangled features, SEM identifies and modulates bias-relevant neurons while preserving query-relevant ones. This enables more precise, non-linear interventions. Across four benchmark datasets and two CLIP backbones, SEM achieves substantial fairness gains in retrieval and zero-shot classification. Our results demonstrate that sparse latent representations provide an effective foundation for post-hoc debiasing of vision-language models.

中文标题/摘要

标题：SEM：稀疏嵌入调制在后验消除视觉-语言模型中的社会偏见

连接视觉和语言的模型，如CLIP，是多模态AI的关键组成部分，但其大规模、未经筛选的训练数据引入了严重的社会和统计偏见。现有的后验消除偏见方法通常直接在密集的CLIP嵌入空间中操作，其中偏见和任务相关信息高度纠缠。这种纠缠限制了它们在不损害语义保真度的情况下去除偏见的能力。在本文中，我们提出了一种后验、零样本的稀疏嵌入调制（SEM）框架，该框架在稀疏自编码器（SAE）的潜在空间中操作。通过将CLIP文本嵌入分解为分离的特征，SEM识别并调节与偏见相关的神经元，同时保留与查询相关的神经元。这使得更精确、非线性的干预成为可能。在四个基准数据集和两个CLIP骨干网络上，SEM在检索和零样本分类中实现了显著的公平性提升。我们的结果表明，稀疏潜在表示为视觉-语言模型的后验消除偏见提供了有效的基础。

Summary / 总结

The paper proposes Sparse Embedding Modulation (SEM), a post-hoc debiasing method for vision-language models like CLIP. SEM operates in a Sparse Autoencoder latent space to disentangle bias and task-relevant information, allowing for more precise and non-linear interventions. Experiments show that SEM achieves significant improvements in fairness for retrieval and zero-shot classification tasks across multiple datasets and model architectures.

该研究提出了一种后处理去偏方法Sparse Embedding Modulation (SEM)，用于视觉-语言模型如CLIP。SEM在稀疏自编码器的潜在空间中操作，以分离偏见和任务相关的信息，从而实现更精确和非线性的干预。实验结果显示，SEM在多个数据集和模型架构上显著提高了检索和零样本分类任务的公平性。

Hypothesis-Conditioned Query Rewriting for Decision-Useful Retrieval

Authors: Hangeol Chang, Changsun Lee, Seungjoon Rho, Junho Yeo, Jong Chul Ye

First: 2026-03-19T15:15:58+00:00 · Latest: 2026-03-19T15:15:58+00:00

Abs · PDF · Code1 · Code2

Abstract

Retrieval-Augmented Generation (RAG) improves Large Language Models (LLMs) by grounding generation in external, non-parametric knowledge. However, when a task requires choosing among competing options, simply grounding generation in broadly relevant context is often insufficient to drive the final decision. Existing RAG methods typically rely on a single initial query, which often favors topical relevance over decision-relevant evidence, and therefore retrieves background information that can fail to discriminate among answer options. To address this issue, here we propose Hypothesis-Conditioned Query Rewriting (HCQR), a training-free pre-retrieval framework that reorients RAG from topic-oriented retrieval to evidence-oriented retrieval. HCQR first derives a lightweight working hypothesis from the input question and candidate options, and then rewrites retrieval into three targeted queries that seek evidence to: (1) support the hypothesis, (2) distinguish it from competing alternatives, and (3) verify salient clues in the question. This approach enables context retrieval that is more directly aligned with answer selection, allowing the generator to confirm or overturn the initial hypothesis based on the retrieved evidence. Experiments on MedQA and MMLU-Med show that HCQR consistently outperforms single-query RAG and re-rank/filter baselines, improving average accuracy over Simple RAG by 5.9 and 3.6 points, respectively. Code is available at https://anonymous.4open.science/r/HCQR-1C2E.

中文标题/摘要

标题：基于假设条件的查询重写以实现决策有用检索

检索增强生成（RAG）通过将生成与外部非参数知识相结合来提高大型语言模型（LLMs）。然而，当任务需要在竞争选项中进行选择时，仅将生成与广泛相关的内容相结合往往不足以驱动最终决策。现有的RAG方法通常依赖于单个初始查询，这往往倾向于主题相关性而非决策相关证据，因此检索到的信息可能无法区分答案选项。为了解决这一问题，我们提出了一种无需训练的预检索框架——基于假设条件的查询重写（HCQR），该框架将RAG从主题导向的检索转向证据导向的检索。HCQR首先从输入问题和候选选项中推导出一个轻量级的工作假设，然后将检索重写为三个针对特定证据的查询，以：（1）支持假设，（2）区分其与竞争替代方案，（3）验证问题中的关键线索。这种方法使上下文检索更直接地与答案选择对齐，使生成器能够根据检索到的证据确认或推翻初始假设。在MedQA和MMLU-Med上的实验表明，HCQR在平均准确性上始终优于单查询RAG和重新排名/过滤基线，分别提高了5.9和3.6个百分点。代码可在https://anonymous.4open.science/r/HCQR-1C2E获取。

Summary / 总结

The paper addresses the limitation of existing Retrieval-Augmented Generation (RAG) methods in decision-making tasks by proposing Hypothesis-Conditioned Query Rewriting (HCQR). HCQR rewrites the initial query into three targeted queries to support, distinguish, and verify evidence relevant to the input question and candidate options. Experiments on MedQA and MMLU-Med show that HCQR outperforms single-query RAG and re-rank/filter baselines, improving accuracy by 5.9 and 3.6 points respectively.

本文提出了一种假设条件下的查询重写方法（HCQR），以解决现有检索增强生成（RAG）方法在处理多个选项之间的决策时的局限性。HCQR将初始查询重写为三个有针对性的查询，以支持假设、区分其与替代方案以及验证问题中的线索，从而提高证据导向的检索效果。实验结果表明，HCQR在MedQA和MMLU-Med上的平均准确率分别比单查询RAG和重排/过滤基线提高了5.9和3.6个百分点。

CrossHOI-Bench: A Unified Benchmark for HOI Evaluation across Vision-Language Models and HOI-Specific Methods

Authors: Qinqian Lei, Bo Wang, Robby T. Tan

Venue: CVPR 2026

First: 2025-08-26T07:30:53+00:00 · Latest: 2026-03-19T15:13:14+00:00

Comments: Accepted by CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

HOI detection has long been dominated by task-specific models, sometimes with early vision-language backbones such as CLIP. With the rise of large generative VLMs, a key question is whether standalone VLMs can perform HOI detection competitively against specialized HOI methods. Existing benchmarks such as HICO-DET require exact label matching under incomplete annotations, so any unmatched prediction is marked wrong. This unfairly penalizes valid outputs, especially from less constrained VLMs, and makes cross-paradigm comparison unreliable. To address this limitation, we introduce CrossHOI-Bench, a multiple-choice HOI benchmark with explicit positives and curated negatives, enabling unified and reliable evaluation of both VLMs and HOI-specific models. We further focus on challenging scenarios, such as multi-person scenes and fine-grained interaction distinctions, which are crucial for revealing real differences between the two paradigms. Experiments show that large VLMs achieve competitive, sometimes superior, zero-shot performance, yet they struggle with multiple concurrent actions and with correctly assigning interactions to the target person. Conversely, HOI-specific methods remain weaker in general HOI reasoning but demonstrate stronger multi-action recognition and more reliable identification of which person performs which action. These findings expose complementary strengths and weaknesses of VLMs and HOI-specific methods, which existing benchmarks fail to reveal due to incorrect penalization.

中文标题/摘要

标题：CrossHOI-Bench：跨范式的HOI评估统一基准

长期以来，HOI检测主要由专门的任务模型主导，有时甚至使用早期的跨模态视觉语言模型，如CLIP。随着大型生成型VLMs的兴起，一个关键问题是，独立的VLMs是否能够与专门的HOI方法竞争进行HOI检测。现有的基准，如HICO-DET，要求精确标签匹配，在不完整注释的情况下，任何未匹配的预测都会被标记为错误。这不公平地惩罚了有效的输出，尤其是来自较不约束的VLMs的输出，使得跨范式比较不可靠。为了解决这一局限性，我们引入了CrossHOI-Bench，这是一个具有明确正例和精心挑选负例的多项选择HOI基准，使VLMs和HOI特定模型的统一和可靠评估成为可能。我们进一步关注具有挑战性的场景，如多人场景和精细的交互区分，这对于揭示两种范式之间的真正差异至关重要。实验表明，大型VLMs在零样本情况下表现出竞争力，甚至有时更优，但它们在处理多个并发动作和正确分配交互给目标人物方面存在困难。相反，HOI特定方法在一般HOI推理方面仍然较弱，但在多动作识别和准确识别哪个执行者执行哪个动作方面表现出更强的能力。这些发现揭示了VLMs和HOI特定方法的互补优势和劣势，而现有的基准由于错误的惩罚未能揭示这些差异。

Summary / 总结

The research aims to evaluate the performance of vision-language models and HOI-specific methods in HOI detection by introducing CrossHOI-Bench, a new benchmark that uses multiple-choice questions with explicit positives and curated negatives. The experiments show that large VLMs can achieve competitive zero-shot performance but struggle with multiple concurrent actions and correctly assigning interactions to the target person, while HOI-specific methods are better at recognizing multiple actions and identifying the performer of each action. This benchmark reveals the strengths and weaknesses of both paradigms, which existing benchmarks fail to capture due to incorrect penalization of unmatched predictions.

研究旨在评估视觉语言模型（VLMs）和HOI特定模型在人类物体交互（HOI）检测中的性能。引入了CrossHOI-Bench作为具有明确正样本和精心筛选负样本的多项选择基准，解决了现有基准的局限性。实验表明，大型VLMs在性能上具有竞争力，但在处理多个并发动作和将交互分配给正确的人时存在困难，而HOI特定方法在多动作识别和确定每个动作的执行者方面表现出色，突显了两种方法的优势和劣势。

How to Take a Memorable Picture? Empowering Users with Actionable Feedback

Authors: Francesco Laiti, Davide Talon, Jacopo Staiano, Elisa Ricci

Venue: CVPR 2026

First: 2026-02-25T13:02:35+00:00 · Latest: 2026-03-19T15:10:18+00:00

Comments: Accepted @ CVPR 2026. Project page: https://laitifranz.github.io/MemCoach/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Image memorability, i.e., how likely an image is to be remembered, has traditionally been studied in computer vision either as a passive prediction task, with models regressing a scalar score, or with generative methods altering the visual input to boost the image likelihood of being remembered. Yet, none of these paradigms supports users at capture time, when the crucial question is how to improve a photo memorability. We introduce the task of Memorability Feedback (MemFeed), where an automated model should provide actionable, human-interpretable guidance to users with the goal to enhance an image future recall. We also present MemCoach, the first approach designed to provide concrete suggestions in natural language for memorability improvement (e.g., "emphasize facial expression," "bring the subject forward"). Our method, based on Multimodal Large Language Models (MLLMs), is training-free and employs a teacher-student steering strategy, aligning the model internal activations toward more memorable patterns learned from a teacher model progressing along least-to-most memorable samples. To enable systematic evaluation on this novel task, we further introduce MemBench, a new benchmark featuring sequence-aligned photoshoots with annotated memorability scores. Our experiments, considering multiple MLLMs, demonstrate the effectiveness of MemCoach, showing consistently improved performance over several zero-shot models. The results indicate that memorability can not only be predicted but also taught and instructed, shifting the focus from mere prediction to actionable feedback for human creators.

中文标题/摘要

标题：如何拍摄令人难忘的照片？通过可操作反馈赋能用户

图像的难忘性，即图像被记住的可能性，传统上在计算机视觉中要么作为被动预测任务进行研究，模型回归一个标量分数，要么使用生成方法改变视觉输入以提高图像被记住的可能性。然而，这些范式在拍摄时并不支持用户，关键问题是如何提高照片的难忘性。我们引入了难忘性反馈（MemFeed）任务，其中自动化模型应提供可操作的、人类可理解的指导，以提高图像未来回忆的可能性。我们还提出了MemCoach，这是第一个提供具体自然语言建议以提高难忘性的方法（例如，“强调面部表情”，“将主题置于前景”）。我们的方法基于多模态大型语言模型（MLLMs），无需训练，并采用教师-学生引导策略，使模型内部激活与从教师模型中学习到的从最难忘到最难忘的样本对齐。为了在这一新任务上进行系统评估，我们进一步引入了MemBench，这是一个新的基准，包含序列对齐的照片拍摄，并附有标注的难忘性分数。我们的实验，考虑了多个MLLMs，证明了MemCoach的有效性，显示出在多个零样本模型上的一致改进。结果表明，难忘性不仅可以被预测，还可以被教授和指导，将重点从单纯的预测转移到对人类创作者的可操作反馈。

Summary / 总结

The paper introduces the task of Memorability Feedback (MemFeed), where an automated model provides actionable guidance to users to enhance the memorability of their photos. The method, MemCoach, uses Multimodal Large Language Models to offer concrete suggestions in natural language, such as 'emphasize facial expression.' Experiments show that MemCoach outperforms several zero-shot models, demonstrating that memorability can be both predicted and instructed, shifting the focus from mere prediction to actionable feedback for human creators.

论文提出了记忆反馈（MemFeed）任务，自动模型为用户提供建议以提升照片的记忆力，如建议‘强调面部表情’。实验表明，MemCoach 方法优于多个零样本模型，展示了不仅可以预测记忆力，还可以进行教学和指导，从单纯的预测转向为创作者提供可操作的反馈。

VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation

Authors: Jiayi Yuan, Haobo Jiang, De Wen Soh, Na Zhao

First: 2026-03-19T14:18:17+00:00 · Latest: 2026-03-19T14:18:17+00:00

Abs · PDF · Code1 · Code2

Abstract

This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT's perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (ii) Structure-saliency enhanced attention strengthens VGGT's robustness during 3D reconstruction by injecting structure-aware confidence into its attention layers, guiding focus toward geometrically reliable regions and enhancing cross-view coherence. (iii) Correlation-weighted 3D model correction refines the reconstructed 3D model by reweighting overlapping points using attention-inferred correlation scores, providing a consistent geometric basis for accurate panoramic reprojection. Extensive experiments show that VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor and outdoor datasets.

中文标题/摘要

标题：VGGT-360：几何一致的零样本全景深度估计

本文提出了VGGT-360，这是一种无需训练的新型框架，用于实现零样本、几何一致的全景深度估计。与之前的视图无关的无需训练方法不同，VGGT-360将任务重新表述为利用VGGT类基础模型的内在三维一致性，通过多视图重建的全景重投影来统一视图分割的推理，从而形成一个连贯的全景理解。为了实现稳健且准确的估计，VGGT-360整合了三个即插即用模块，形成了一个统一的全景到三维再到深度框架：(i) 不确定性引导的自适应投影将全景图切分为透视视图，以弥合全景输入与VGGT的透视先验之间的领域差距。它估计基于梯度的不确定性，将更密集的视图分配给几何贫瘠区域，为VGGT提供几何信息丰富的输入。(ii) 结构显著性增强的注意力在三维重建过程中增强VGGT的鲁棒性，通过将结构感知的置信度注入其注意力层，引导关注几何可靠区域，增强跨视图的一致性。(iii) 相关加权三维模型校正通过使用注意力推断的相关分数重新加权重叠点，以提供准确的全景重投影的一致几何基础。广泛的实验表明，VGGT-360在多个分辨率和多种室内外数据集上均优于已训练和无需训练的最新方法。

Summary / 总结

VGGT-360 is a novel training-free framework for zero-shot panoramic depth estimation that leverages 3D consistency of VGGT-like models to unify per-view reasoning into a coherent panoramic understanding. It integrates three modules: uncertainty-guided adaptive projection, structure-saliency enhanced attention, and correlation-weighted 3D model correction, which collectively enhance robustness and accuracy. Experiments demonstrate that VGGT-360 outperforms both trained and training-free state-of-the-art methods across various datasets and resolutions.

VGGT-360 是一个无需训练的框架，用于全景深度估计，通过利用 3D 一致性将单视图推理统一到一个连贯的全景理解中。它整合了三个模块：不确定性引导自适应投影、结构显著性增强注意力和相关加权 3D 模型校正。实验表明，VGGT-360 在各种分辨率和不同室内外数据集上均优于训练有素和无需训练的最新方法。

MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model

Authors: Youngwan Lee, Soojin Jang, Yoorhim Cho, Seunghwan Lee, Yong-Ju Lee, Sung Ju Hwang

First: 2026-03-19T13:33:26+00:00 · Latest: 2026-03-19T13:33:26+00:00

Comments: Project page: https://youngwanlee.github.io/multihopspatial

Abs · PDF · Code1 · Code2 · Project1

Abstract

Spatial reasoning is foundational for Vision-Language Models (VLMs), particularly when deployed as Vision-Language-Action (VLA) agents in physical environments. However, existing benchmarks predominantly focus on elementary, single-hop relations, neglecting the multi-hop compositional reasoning and precise visual grounding essential for real-world scenarios. To address this, we introduce MultihopSpatial, offering three key contributions: (1) A comprehensive benchmark designed for multi-hop and compositional spatial reasoning, featuring 1- to 3-hop complex queries across diverse spatial perspectives. (2) Acc@50IoU, a complementary metric that simultaneously evaluates reasoning and visual grounding by requiring both answer selection and precise bounding box prediction - capabilities vital for robust VLA deployment. (3) MultihopSpatial-Train, a dedicated large-scale training corpus to foster spatial intelligence. Extensive evaluation of 37 state-of-the-art VLMs yields eight key insights, revealing that compositional spatial reasoning remains a formidable challenge. Finally, we demonstrate that reinforcement learning post-training on our corpus enhances both intrinsic VLM spatial reasoning and downstream embodied manipulation performance.

中文标题/摘要

标题：MultihopSpatial：多跳组合空间推理基准模型用于视觉语言模型

空间推理是视觉语言模型（VLMs）的基础，尤其是在作为视觉语言行动（VLA）代理在物理环境中部署时。然而，现有的基准测试主要集中在基础的单跳关系上，忽视了多跳组合推理和精确的视觉定位，这对于现实世界场景至关重要。为了解决这一问题，我们引入了MultihopSpatial，提供了三个关键贡献：（1）一个旨在进行多跳和组合空间推理的全面基准测试，涵盖从1到3跳的复杂查询，跨越多种空间视角。（2）Acc@50IoU，一个补充性度量标准，同时评估推理和视觉定位，要求进行答案选择和精确边界框预测——这些能力对于稳健的VLA部署至关重要。（3）MultihopSpatial-Train，一个专门的大规模训练语料库，以促进空间智能的发展。对37个最先进的VLMs的广泛评估揭示了八个关键见解，表明组合空间推理仍然是一个严峻的挑战。最后，我们证明了在我们的语料库上进行强化学习后训练可以提高VLM的内在空间推理能力和下游实体操作性能。

Summary / 总结

The research introduces MultihopSpatial, a benchmark for multi-hop and compositional spatial reasoning in Vision-Language Models (VLMs), addressing the limitations of existing benchmarks. It includes 1- to 3-hop complex queries and a metric Acc@50IoU that evaluates both reasoning and visual grounding. The evaluation of 37 state-of-the-art VLMs reveals that compositional spatial reasoning is challenging, and post-training reinforcement learning improves both spatial reasoning and embodied manipulation performance.

研究旨在提高视觉-语言模型（VLMs）在多跳和组合空间推理方面的性能，这对于实际应用至关重要。研究引入了MultihopSpatial基准，包含1-到3跳的复杂查询，并提出了一种新的评估指标Acc@50IoU，同时评估推理和视觉定位能力。对37个最先进的VLMs的评估显示，组合空间推理具有挑战性，而通过MultihopSpatial数据集进行强化学习后训练可以提高空间推理能力和下游的实体操作性能。

LucidFlux: Caption-Free Photo-Realistic Image Restoration via a Large-Scale Diffusion Transformer

Authors: Song Fei, Tian Ye, Lujia Wang, Lei Zhu

First: 2025-09-26T14:39:08+00:00 · Latest: 2026-03-19T12:57:49+00:00

Comments: Project Page: https://w2genai-lab.github.io/LucidFlux

Abs · PDF · Code1 · Code2 · Project1

Abstract

Image restoration (IR) aims to recover images degraded by unknown mixtures while preserving semanticsconditions under which discriminative restorers and UNet-based diffusion priors often oversmooth, hallucinate, or drift. We present LucidFlux, a caption-free IR framework that adapts a large diffusion transformer (Flux.1) without image captions. Our LucidFlux introduces a lightweight dual-branch conditioner that injects signals from the degraded input and a lightly restored proxy to respectively anchor geometry and suppress artifacts. Then, a timestep- and layer-adaptive modulation schedule is designed to route these cues across the backbones hierarchy, in order to yield coarse-to-fine and context-aware updates that protect the global structure while recovering texture. After that, to avoid the latency and instability of text prompts or Vision-Language Model (VLM) captions, we enforce caption-free semantic alignment via SigLIP features extracted from the proxy. A scalable curation pipeline further filters large-scale data for structure-rich supervision. Across synthetic and in-the-wild benchmarks, our LucidFlux consistently outperforms strong open-source and commercial baselines, and ablation studies verify the necessity of each component. LucidFlux shows that, for large DiTs, when, where, and what to condition onrather than adding parameters or relying on text promptsis the governing lever for robust and caption-free image restoration in the wild.

Summary / 总结

LucidFlux is a caption-free image restoration framework that uses a large diffusion transformer to recover images degraded by unknown factors while preserving semantics. It introduces a lightweight dual-branch conditioner and a timestep- and layer-adaptive modulation schedule to protect global structure and recover texture. The method avoids the use of text prompts or VLM captions, instead using SigLIP features for semantic alignment. Experiments show that LucidFlux outperforms strong open-source and commercial baselines across synthetic and real-world benchmarks, and ablation studies confirm the importance of each component.

LucidFlux 是一个无字幕的图像恢复框架，使用大型扩散变换器来恢复未知因素导致退化的图像，同时保留语义。它引入了一个轻量级的双分支条件器和时间步长和层自适应调制计划，以保护全局结构并恢复纹理。该方法避免使用文本提示或视觉语言模型字幕，而是使用代理的 SigLIP 特征进行语义对齐。实验表明，LucidFlux 在合成和真实世界基准测试中均优于强大的开源和商用基线，并且消融研究证实了每个组件的重要性。

HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models

Authors: Xiangyu Bai, Bishoy Galoaa, Sarah Ostadabbas

First: 2026-03-19T12:53:32+00:00 · Latest: 2026-03-19T12:53:32+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Video question answering (VQA) with vision-language models (VLMs) depends critically on which frames are selected from the input video, yet most systems rely on uniform or heuristic sampling that cannot be optimized for downstream answering quality. We introduce \textbf{HORNet}, a lightweight frame selection policy trained with Group Relative Policy Optimization (GRPO) to learn which frames a frozen VLM needs to answer questions correctly. With fewer than 1M trainable parameters, HORNet reduces input frames by up to 99\% and VLM processing time by up to 93\%, while improving answer quality on short-form benchmarks (+1.7\% F1 on MSVD-QA) and achieving strong performance on temporal reasoning tasks (+7.3 points over uniform sampling on NExT-QA). We formalize this as Select Any Frames (SAF), a task that decouples visual input curation from VLM reasoning, and show that GRPO-trained selection generalizes better out-of-distribution than supervised and PPO alternatives. HORNet's policy further transfers across VLM answerers without retraining, yielding an additional 8.5\% relative gain when paired with a stronger model. Evaluated across six benchmarks spanning 341,877 QA pairs and 114.2 hours of video, our results demonstrate that optimizing \emph{what} a VLM sees is a practical and complementary alternative to optimizing what it generates while improving efficiency. Code is available at https://github.com/ostadabbas/HORNet.

中文标题/摘要

标题：HORNet：基于任务指导的视频问答中帧选择策略

视频问答（VQA）使用视觉-语言模型（VLMs）依赖于从输入视频中选择哪些帧，但大多数系统依赖于均匀或启发式的采样，这些采样无法优化下游问答质量。我们引入了**HORNet**，这是一种通过组相对策略优化（GRPO）训练的轻量级帧选择策略，以学习冻结的VLM需要查看哪些帧才能正确回答问题。HORNet通过减少输入帧多达99%和VLM处理时间多达93%，同时在短格式基准上提高答案质量（MSVD-QA上的F1分数提高1.7%），并在时间推理任务上表现出色（NExT-QA上比均匀采样高7.3分）。我们将此任务形式化为“选择任意帧”（SAF），该任务将视觉输入的策划与VLM推理解耦，并表明GRPO训练的选择在分布外泛化能力优于监督学习和PPO替代方案。HORNet的策略在与更强的模型配对时无需重新训练，可额外获得8.5%的相对增益。在六个基准测试中评估了341,877个问答对和114.2小时的视频，我们的结果表明，优化VLM所见的内容是一种实用且互补的替代方案，同时提高了效率。代码可在https://github.com/ostadabbas/HORNet/获取。

Summary / 总结

HORNet is a lightweight frame selection policy trained with Group Relative Policy Optimization (GRPO) to optimize frame selection for video question answering with vision-language models (VLMs). It reduces input frames by up to 99% and VLM processing time by up to 93%, while improving answer quality on short-form benchmarks and achieving strong performance on temporal reasoning tasks. HORNet's policy transfers across different VLMs, yielding additional gains. Evaluated across six benchmarks, the results show that optimizing what a VLM sees is practical and complementary to optimizing what it generates, improving efficiency.

HORNet 是一种通过组相对策略优化（GRPO）训练的轻量级帧选择策略，用于优化视频问答（VQA）中视觉语言模型（VLMs）的帧选择。它将输入帧减少高达99%，并将VLM处理时间减少高达93%，同时在短格式基准上提高答案质量，并在时间推理任务上表现出色。HORNet 的策略在不同 VLM 上无需重新训练即可转移，从而获得额外收益。在六个基准测试中评估了超过341,877个问答对和114.2小时的视频，结果表明优化 VLM 所见的内容是提高效率和优化生成内容的实用且互补的方法。

Activation Quantization of Vision Encoders Needs Prefixing Registers

Authors: Seunghyeon Kim, Taesun Yeom, Jinho Kim, Wonpyo Park, Kyuyeun Kim, Jaeho Lee

First: 2025-10-06T07:27:46+00:00 · Latest: 2026-03-19T12:18:57+00:00

Comments: under review; 28 pages, 9 figures

Abs · PDF · Code1 · Code2

Abstract

Large pretrained vision encoders are central to multimodal intelligence, powering applications from on-device vision processing to vision-language models. Since these applications often demand real-time processing of massive visual data, reducing the inference cost of vision encoders is critical. Quantization offers a practical path, but it remains challenging even at 8-bit precision due to so-called outliers. In this work, we propose $\textit{RegCache}$, a training-free algorithm that mitigates outliers in large-scale pretrained vision encoders and serves as a plug-in module that can be applied on top of other quantization methods. RegCache introduces outlier-prone yet semantically meaningless prefix tokens to the vision encoder, which prevent other tokens from having outliers. Notably, we observe that outliers in vision encoders behave differently from those in language models, motivating two technical innovations: middle-layer prefixing and token deletion. Experimental results show that our method consistently improves quantized model performance across various vision encoders, particularly in extremely low-bit regimes (e.g., 4-bit).

中文标题/摘要

标题：视觉编码器的激活量化需要前缀寄存器

大型预训练视觉编码器是多模态智能的核心，驱动着从设备端视觉处理到视觉语言模型的各种应用。由于这些应用通常需要实时处理大量视觉数据，因此降低视觉编码器的推理成本至关重要。量化提供了一条可行的路径，但在8位精度下仍面临挑战，主要是所谓的异常值问题。在本工作中，我们提出了一种名为$\textit{RegCache}$的无训练算法，该算法可以缓解大型预训练视觉编码器中的异常值问题，并作为可插拔模块应用于其他量化方法之上。RegCache通过引入可能产生异常值但具有语义意义的前缀标记，防止其他标记产生异常值。值得注意的是，我们观察到视觉编码器中的异常值与语言模型中的异常值行为不同，这促使我们提出了两种技术创新：中间层前缀和标记删除。实验结果表明，我们的方法在各种视觉编码器中都能提高量化模型的性能，特别是在极低位宽（例如4位）的情况下。

Summary / 总结

This work addresses the challenge of reducing the inference cost of large pretrained vision encoders by proposing RegCache, a training-free algorithm that mitigates outliers through the introduction of prefix tokens. The method is designed as a plug-in module that can be applied on top of other quantization techniques. Experimental results demonstrate that RegCache improves quantized model performance, especially in low-bit regimes like 4-bit, by preventing outliers in vision encoders, which behave differently from those in language models.

本文旨在通过激活量化减少大型预训练视觉编码器的推理成本。提出了一种名为RegCache的无训练方法，通过在视觉编码器中添加前缀标记来缓解异常值问题。该方法在低位宽区间（如4位）表现出色，并且在各种视觉编码器中表现出一致的性能提升。

Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation

Authors: Yuchen Li, Amanmeet Garg, Shalini Chaudhuri, Rui Zhao, Garin Kessler

First: 2026-03-19T11:46:01+00:00 · Latest: 2026-03-19T11:46:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we (i) distill a VQVAE depth codebook from a strong monocular teacher to tokenize dense depth into compact sequences, and (ii) integrate SAM2 based semantic segmentation tokens and VQ-VAE depth tokens inside the LLM so the model first emits spatial tokens and then answers. To stabilize depth token generation, we introduce novel composite depth-token objectives (marker, token, and count losses) and a soft-merging technique for differentiable reconstruction. We adopt a multi-task co-training strategy across diverse datasets, letting the model learn perception tokens to tackle multiple downstream tasks. Building on InternVL, Perceptio achieves state-of-the-art performance across benchmarks: improving referring expression segmentation by +0.8/+1.4/+1.1 cIoU on RefCOCO/+/g HardBLINK spatial understanding accuracy by 10.3%, and MMBench accuracy by 1.0%, demonstrating that explicit spatial chain-of-thought materially strengthens spatial grounding in LVLMs.

中文标题/摘要

标题：Perceptio：通过空间标记生成增强感知的视觉语言模型

大型视觉语言模型（LVLMs）在语义理解方面表现出色，但在精细的空间定位方面存在困难，因为模型必须隐式推断复杂的几何结构，而从未产生过空间解释。我们提出了Perceptio，这是一种具有2D和3D空间推理能力的感知增强LVLM，通过在自回归序列中直接生成语义分割标记和深度标记来实现。具体来说，我们(i)从强大的单目教师中提取VQVAE深度码本，将密集深度量化为紧凑序列，(ii)在LLM中集成基于SAM2的语义分割标记和VQ-VAE深度标记，使模型首先发出空间标记，然后进行回答。为了稳定深度标记生成，我们引入了新颖的复合深度标记目标（标记、标记和计数损失）和可微重构的软合并技术。我们采用跨多种数据集的多任务协同训练策略，让模型学习感知标记以应对多种下游任务。基于InternVL，Perceptio在基准测试中实现了最先进的性能：在RefCOCO/+/g HardBLINK中分别提高引用表达分割的cIoU值0.8/1.4/1.1，在空间理解准确性上提高10.3%，在MMBench中提高1.0%，证明了显式空间推理链对LVLM中空间定位的实质性增强。

Summary / 总结

Perceptio is a perception-enhanced Vision Language Model that integrates 2D and 3D spatial reasoning capabilities by generating semantic segmentation and depth tokens within the autoregressive sequence. It uses a VQ-VAE depth codebook and SAM2-based semantic segmentation tokens to improve spatial grounding. Perceptio achieves state-of-the-art performance on benchmarks, including a 10.3% improvement in spatial understanding accuracy on HardBLINK and a 1.0% increase in MMBench accuracy.

Perceptio 是一种通过在自回归序列中生成显式的语义分割和深度令牌来增强 2D 和 3D 空间推理的视觉语言模型。该模型使用 VQ-VAE 深度码本和基于 SAM2 的语义分割令牌来改善空间定位。Perceptio 在基准测试中实现了最先进的性能，增强了 10.3% 的引用表达分割和空间理解准确性以及 1.0% 的 MMBench 准确性。

Balanced Thinking: Improving Chain of Thought Training in Vision Language Models

Authors: Shaked Perek, Ben Wiesel, Avihu Dekel, Nimrod Shabtay, Eli Schwartz

First: 2026-03-19T09:21:49+00:00 · Latest: 2026-03-19T09:21:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal reasoning in vision-language models (VLMs) typically relies on a two-stage process: supervised fine-tuning (SFT) and reinforcement learning (RL). In standard SFT, all tokens contribute equally to the loss, even though reasoning data are inherently token-imbalanced. Long <think> traces overshadow short but task-critical <answer> segments, leading to verbose reasoning and inaccurate answers. We propose SCALe (Scheduled Curriculum Adaptive Loss), which explicitly separates supervision over reasoning and answer segments using dynamic, length-independent weighting. Unlike vanilla SFT, which overweights the <think> segment, SCALe-SFT gradually shifts the focus from <think> to <answer> throughout training via a cosine scheduling policy, encouraging concise and well-grounded reasoning. We evaluate SCALe across diverse benchmarks and architectures. Results show that SCALe consistently improves accuracy over vanilla SFT and matches the performance of the full two-phase SFT + GRPO pipeline while requiring only about one-seventh of the training time, making it a lightweight yet effective alternative. When combined with GRPO, SCALe achieves the best overall performance, highlighting its value both as a standalone method and as a strong foundation for reinforcement refinement.

中文标题/摘要

标题：平衡思维：提升视觉语言模型的推理训练

视觉语言模型（VLMs）中的多模态推理通常依赖于两阶段过程：监督微调（SFT）和强化学习（RL）。在标准的SFT中，所有标记对损失的贡献是平等的，即使推理数据本质上是标记不平衡的。长的<思考>痕迹遮盖了短但任务关键的<答案>片段，导致冗长的推理和不准确的答案。我们提出了SCALe（逐步课程自适应损失），它明确地通过动态、与长度无关的加权来分离对推理和答案片段的监督。与传统的SFT不同，SCALe-SFT通过余弦调度策略在整个训练过程中逐渐将重点从<思考>转移到<答案>，鼓励简洁且有根据的推理。我们在多种基准和架构上评估了SCALe。结果显示，SCALe在准确性上始终优于传统的SFT，并且在训练时间仅为完整两阶段SFT + GRPO流水线的约七分之一的情况下达到了相当的性能，使其成为一种轻量级但有效的替代方案。当与GRPO结合使用时，SCALe实现了最佳的整体性能，突显了其作为独立方法和强化细化强大基础的价值。

Summary / 总结

The paper addresses the issue of token imbalance in supervised fine-tuning of vision-language models, where reasoning data are inherently token-imbalanced, leading to verbose reasoning and inaccurate answers. It introduces SCALe, a method that separates supervision over reasoning and answer segments using dynamic, length-independent weighting. Experiments show that SCALe improves accuracy over vanilla supervised fine-tuning and matches the performance of the full two-phase pipeline while requiring less training time, making it a lightweight yet effective alternative.

论文针对视觉-语言模型中监督微调过程中推理数据往往被长推理片段主导而忽视关键答案片段的问题，提出了一种名为SCALe的方法，该方法在训练过程中动态调整推理和答案片段之间的损失权重。实验结果显示，SCALe在各种基准测试中提高了准确性，并且与完整的两阶段训练管道相比，所需训练时间仅为七分之一，使其成为一个轻量级且有效的解决方案。

MeInTime: Bridging Age Gap in Identity-Preserving Face Restoration

Authors: Teer Song, Yue Zhang, Yu Tian, Ziyang Wang, Xianlin Zhang, Guixuan Zhang, Xuan Liu, Xueming Li, Yasen Zhang

First: 2026-03-19T09:11:07+00:00 · Latest: 2026-03-19T09:11:07+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

To better preserve an individual's identity, face restoration has evolved from reference-free to reference-based approaches, which leverage high-quality reference images of the same identity to enhance identity fidelity in the restored outputs. However, most existing methods implicitly assume that the reference and degraded input are age-aligned, limiting their effectiveness in real-world scenarios where only cross-age references are available, such as historical photo restoration. This paper proposes MeInTime, a diffusion-based face restoration method that extends reference-based restoration from same-age to cross-age settings. Given one or few reference images along with an age prompt corresponding to the degraded input, MeInTime achieves faithful restoration with both identity fidelity and age consistency. Specifically, we decouple the modeling of identity and age conditions. During training, we focus solely on effectively injecting identity features through a newly introduced attention mechanism and introduce Gated Residual Fusion modules to facilitate the integration between degraded features and identity representations. At inference, we propose Age-Aware Gradient Guidance, a training-free sampling strategy, using an age-driven direction to iteratively nudge the identity-aware denoising latent toward the desired age semantic manifold. Extensive experiments demonstrate that MeInTime outperforms existing face restoration methods in both identity preservation and age consistency. Our code is available at: https://github.com/teer4/MeInTime

中文标题/摘要

标题：MeInTime: 跨年龄身份保留面部恢复方法

为了更好地保留个体的身份，面部恢复从无参考方法发展到了有参考方法，利用同一身份的高质量参考图像来增强恢复输出中的身份保真度。然而，现有大多数方法隐含地假设参考图像和退化输入在年龄上是匹配的，这限制了它们在只有跨年龄参考图像可用的真实场景中的效果，例如历史照片恢复。本文提出了一种基于扩散的面部恢复方法MeInTime，该方法将参考方法从同龄扩展到跨龄设置。给定一个或几个参考图像以及与退化输入对应的年龄提示，MeInTime 能够实现身份保真度和年龄一致性兼具的恢复。具体而言，我们解耦身份和年龄条件的建模。在训练过程中，我们专注于通过新引入的注意力机制有效注入身份特征，并引入门控残差融合模块以促进退化特征与身份表示的整合。在推理过程中，我们提出了一种无需训练的年龄感知梯度引导策略，使用年龄驱动的方向逐步引导身份感知去噪潜在变量向所需的年龄语义流形。大量实验表明，MeInTime 在身份保留和年龄一致性方面均优于现有面部恢复方法。我们的代码可在 https://github.com/teer4/MeInTime 获取

Summary / 总结

MeInTime is a diffusion-based face restoration method designed to handle cross-age references, enhancing identity fidelity and age consistency in restored outputs. By decoupling identity and age conditions, MeInTime uses a newly introduced attention mechanism and Gated Residual Fusion modules during training, and an Age-Aware Gradient Guidance strategy at inference. Experiments show that MeInTime outperforms existing methods in both identity preservation and age consistency.

MeInTime 是一种基于扩散的面部恢复方法，旨在在只有跨年龄参考的情况下，实现面部恢复的同时保持身份一致性和年龄一致性。该方法将身份和年龄条件分离，通过引入注意力机制和门控残差融合模块进行训练，并在推理时使用年龄导向的方向进行迭代调整。实验表明，MeInTime 在保持身份和年龄一致性方面优于现有方法。

Training-Free Sparse Attention for Fast Video Generation via Offline Layer-Wise Sparsity Profiling and Online Bidirectional Co-Clustering

Authors: Jiayi Luo, Jiayu Chen, Jiankun Wang, Cong Wang, Hanxin Zhu, Qingyun Sun, Chen Gao, Zhibo Chen, Jianxin Li

First: 2026-03-19T09:00:08+00:00 · Latest: 2026-03-19T09:00:08+00:00

Abs · PDF · Code1 · Code2

Abstract

Diffusion Transformers (DiTs) achieve strong video generation quality but suffer from high inference cost due to dense 3D attention, leading to the development of sparse attention technologies to improve efficiency. However, existing training-free sparse attention methods in video generation still face two unresolved limitations: ignoring layer heterogeneity in attention pruning and ignoring query-key coupling in block partitioning, which hinder a better quality-speedup trade-off. In this work, we uncover a critical insight that the attention sparsity of each layer is its intrinsic property, with minor effects across different inputs. Motivated by this, we propose SVOO, a training-free Sparse attention framework for fast Video generation via Offline layer-wise sparsity profiling and Online bidirectional co-clustering. Specifically, SVOO adopts a two-stage paradigm: (i) offline layer-wise sensitivity profiling to derive intrinsic per-layer pruning levels, and (ii) online block-wise sparse attention via a novel bidirectional co-clustering algorithm. Extensive experiments on seven widely used video generation models demonstrate that SVOO achieves a superior quality-speedup trade-off over state-of-the-art methods, delivering up to $1.93\times$ speedup while maintaining a PSNR of up to 29 dB on Wan2.1.

中文标题/摘要

标题：基于离线层间稀疏性表征和在线双向共聚类的无需训练的稀疏注意力以实现快速视频生成

扩散变换器（DiTs）在视频生成质量上表现出色，但由于密集的3D注意力导致推理成本高，因此开发了稀疏注意力技术以提高效率。然而，现有的无需训练的视频生成稀疏注意力方法仍然面临两个未解决的局限性：忽略注意力剪枝中的层间异质性以及忽略查询-键耦合在块分割中的作用，这阻碍了更好的质量-加速权衡。在本文中，我们揭示了一个关键见解，即每个层的注意力稀疏性是其固有的属性，不同输入之间的影响较小。受此启发，我们提出了SVOO，一种基于离线层间稀疏性表征和在线双向共聚类的无需训练的稀疏注意力框架。具体而言，SVOO采用两阶段范式：（i）离线层间灵敏度表征以推导固有的每层剪枝水平，（ii）通过一种新颖的双向共聚类算法实现块级稀疏注意力。在七个广泛使用的视频生成模型上的大量实验表明，SVOO在质量-加速权衡上优于最先进的方法，同时在Wan2.1上保持高达29 dB的PSNR，实现高达1.93倍的加速。

Summary / 总结

This work addresses the high inference cost of diffusion transformers (DiTs) in video generation by proposing SVOO, a training-free sparse attention framework. SVOO uses offline layer-wise sensitivity profiling to determine intrinsic pruning levels and an online bidirectional co-clustering algorithm for block-wise sparse attention. Experiments show that SVOO outperforms existing methods, achieving up to 1.93 times speedup with PSNR up to 29 dB on Wan2.1.

本文通过提出SVOO，一种无需训练的稀疏注意力框架，解决了扩散变换器(DiT)在视频生成中的高推理成本问题。SVOO包括离线层内灵敏度分析以确定固有的剪枝水平，以及在线双向聚类算法进行块级稀疏注意力。实验表明，SVOO优于现有方法，实现最高1.93倍的加速，同时在Wan2.1上保持PSNR达到29 dB。

Agentic Flow Steering and Parallel Rollout Search for Spatially Grounded Text-to-Image Generation

Authors: Ping Chen, Daoxuan Zhang, Xiangming Wang, Yungeng Liu, Haijin Zeng, Yongyong Chen

First: 2026-03-19T08:50:49+00:00 · Latest: 2026-03-19T08:50:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Precise Text-to-Image (T2I) generation has achieved great success but is hindered by the limited relational reasoning of static text encoders and the error accumulation in open-loop sampling. Without real-time feedback, initial semantic ambiguities during the Ordinary Differential Equation trajectory inevitably escalate into stochastic deviations from spatial constraints. To bridge this gap, we introduce AFS-Search (Agentic Flow Steering and Parallel Rollout Search), a training-free closed-loop framework built upon FLUX.1-dev. AFS-Search incorporates a training-free closed-loop parallel rollout search and flow steering mechanism, which leverages a Vision-Language Model (VLM) as a semantic critic to diagnose intermediate latents and dynamically steer the velocity field via precise spatial grounding. Complementarily, we formulate T2I generation as a sequential decision-making process, exploring multiple trajectories through lookahead simulations and selecting the optimal path based on VLM-guided rewards. Further, we provide AFS-Search-Pro for higher performance and AFS-Search-Fast for quicker generation. Experimental results show that our AFS-Search-Pro greatly boosts the performance of the original FLUX.1-dev, achieving state-of-the-art results across three different benchmarks. Meanwhile, AFS-Search-Fast also significantly enhances performance while maintaining fast generation speed.

中文标题/摘要

标题：代理流动引导和并行展开搜索在空间定位的文本到图像生成

精确的文本到图像(T2I)生成已经取得了巨大成功，但受限于静态文本编码器的有限关系推理能力和开环采样中的误差累积。缺乏实时反馈，初始语义模糊性在常微分方程轨迹中不可避免地演变成对空间约束的随机偏差。为弥合这一差距，我们引入了AFS-Search（代理流动引导和并行展开搜索），这是一种基于FLUX.1-dev的无需训练的闭环框架。AFS-Search 结合了无需训练的闭环并行展开搜索机制和流动引导机制，利用视觉语言模型（VLM）作为语义批评家来诊断中间潜变量并动态引导速度场，通过精确的空间定位。此外，我们将T2I生成视为一个顺序决策过程，通过前瞻模拟探索多个轨迹，并基于VLM引导的奖励选择最优路径。进一步，我们提供了AFS-Search-Pro以获得更高性能，并提供了AFS-Search-Fast以实现更快的生成速度。实验结果表明，我们的AFS-Search-Pro极大地提升了原始FLUX.1-dev的性能，在三个不同的基准测试中达到了最先进的结果。同时，AFS-Search-Fast也显著提高了性能，同时保持了快速生成速度。

Summary / 总结

The research aims to improve the precision of Text-to-Image (T2I) generation by addressing the limitations of static text encoders and open-loop sampling. The method introduces AFS-Search, a training-free closed-loop framework that uses a Vision-Language Model (VLM) to diagnose and steer the generation process, ensuring spatial constraints are met. Experimental results demonstrate that AFS-Search-Pro significantly enhances the performance of FLUX.1-dev, achieving state-of-the-art results across three benchmarks, while AFS-Search-Fast maintains fast generation speed with improved performance.

研究旨在通过解决静态文本编码器和开环采样的局限性，提高文本到图像（T2I）生成的精确性和可靠性。方法AFS-Search引入了一个无需训练的闭环框架，使用视觉语言模型（VLM）诊断和引导生成过程，确保满足空间约束。实验结果表明，AFS-Search-Pro显著提升了FLUX.1-dev的性能，在三个不同基准上达到了最先进的结果，而AFS-Search-Fast则保持了快速生成速度的同时提高了性能。

GenVideoLens: Where LVLMs Fall Short in AI-Generated Video Detection?

Authors: Yueying Zou, Pei Pei Li, Zekun Li, Xinyu Guo, Xing Cui, Huaibo Huang, Ran He

Venue: ECCV 2026

First: 2026-03-19T08:44:08+00:00 · Latest: 2026-03-19T08:44:08+00:00

Comments: ECCV 2026 submission. 14 pages, 6 figures, 4 tables. Supplementary material included

Abs · PDF · Code1 · Code2

Abstract

In recent years, AI-generated videos have become increasingly realistic and sophisticated. Meanwhile, Large Vision-Language Models (LVLMs) have shown strong potential for detecting such content. However, existing evaluation protocols largely treat the task as a binary classification problem and rely on coarse-grained metrics such as overall accuracy, providing limited insight into where LVLMs succeed or fail. To address this limitation, we introduce GenVideoLens, a fine-grained benchmark that enables dimension-wise evaluation of LVLM capabilities in AI-generated video detection. The benchmark contains 400 highly deceptive AI-generated videos and 100 real videos, annotated by experts across 15 authenticity dimensions covering perceptual, optical, physical, and temporal cues. We evaluate eleven representative LVLMs on this benchmark. Our analysis reveals a pronounced dimensional imbalance. While LVLMs perform relatively well on perceptual cues, they struggle with optical consistency, physical interactions, and temporal-causal reasoning. Model performance also varies substantially across dimensions, with smaller open-source models sometimes outperforming stronger proprietary models on specific authenticity cues. Temporal perturbation experiments further show that current LVLMs make limited use of temporal information. Overall, GenVideoLens provides diagnostic insights into LVLM behavior, revealing key capability gaps and offering guidance for improving future AI-generated video detection systems.

中文标题/摘要

标题：GenVideoLens：LVLM在AI生成视频检测中的不足之处？

近年来，AI生成的视频越来越逼真和复杂。与此同时，大型视觉-语言模型（LVLM）在检测此类内容方面显示出强大的潜力。然而，现有的评估协议主要将任务视为二元分类问题，并依赖于粗粒度的指标，如总体准确率，这为了解LVLM的成功或失败提供了有限的见解。为了解决这一局限性，我们引入了GenVideoLens，这是一个细粒度基准，使我们能够从维度上评估LVLM在AI生成视频检测中的能力。基准数据集包含400个高度欺骗性的AI生成视频和100个真实视频，并由专家在15个涵盖感知、光学、物理和时间线索的真伪维度上进行标注。我们在这项基准上评估了11个代表性LVLM。我们的分析揭示了明显的维度不平衡。虽然LVLM在感知线索方面表现相对较好，但在光学一致性、物理交互和时间因果推理方面却表现不佳。模型在不同维度上的表现也存在显著差异，较小的开源模型有时在特定真伪线索上会优于更强的专有模型。时间扰动实验进一步表明，当前的LVLM对时间信息的利用有限。总体而言，GenVideoLens为LVLM的行为提供了诊断性见解，揭示了关键的能力差距，并为改进未来的AI生成视频检测系统提供了指导。

Summary / 总结

GenVideoLens is introduced to evaluate the performance of Large Vision-Language Models (LVLMs) in detecting AI-generated videos. Unlike existing binary classification approaches, GenVideoLens offers a fine-grained benchmark with 400 deceptive AI-generated videos and 100 real videos, annotated across 15 dimensions. The evaluation reveals that LVLMs excel in perceptual cues but struggle with optical consistency, physical interactions, and temporal reasoning. Additionally, smaller open-source models sometimes outperform larger proprietary models in specific authenticity cues, and current LVLMs underutilize temporal information.

GenVideoLens 是一个基准，用于评估大型视觉-语言模型（LVLMs）在检测 AI 生成视频方面的性能。基准包括 400 条具有欺骗性的 AI 生成视频和 100 条真实视频，并在 15 个维度上进行了标注。评估结果显示，LVLMs 在感知线索方面表现良好，但在光学一致性、物理交互和时间因果推理方面存在困难。不同维度上的性能也存在显著差异，有时较小的开源模型在特定真实度线索上会优于较大的专有模型。时间扰动实验表明，当前的 LVLMs 并未有效利用时间信息。这项工作提供了关于 LVLMs 限制的诊断性见解，并指出了改进未来 AI 生成视频检测系统的方向。

REST: Receding Horizon Explorative Steiner Tree for Zero-Shot Object-Goal Navigation

Authors: Shuqi Xiao, Maani Ghaffari, Chengzhong Xu, Hui Kong

First: 2026-03-19T08:43:40+00:00 · Latest: 2026-03-19T08:43:40+00:00

Abs · PDF · Code1 · Code2

Abstract

Zero-shot object-goal navigation (ZSON) requires navigating unknown environments to find a target object without task-specific training. Prior hierarchical training-free solutions invest in scene understanding (\textit{belief}) and high-level decision-making (\textit{policy}), yet overlook the design of \textit{option}, i.e., a subgoal candidate proposed from evolving belief and presented to policy for selection. In practice, options are reduced to isolated waypoints scored independently: single destinations hide the value gathered along the journey; an unstructured collection obscures the relationships among candidates. Our insight is that the option space should be a \textit{tree of paths}. Full paths expose en-route information gain that destination-only scoring systematically neglects; a tree of shared segments enables coarse-to-fine LLM reasoning that dismisses or pursues entire branches before examining individual leaves, compressing the combinatorial path space into an efficient hierarchy. We instantiate this insight in \textbf{REST} (Receding Horizon Explorative Steiner Tree), a training-free framework that (1) builds an explicit open-vocabulary 3D map from online RGB-D streams; (2) grows an agent-centric tree of safe and informative paths as the option space via sampling-based planning; and (3) textualizes each branch into a spatial narrative and selects the next-best path through chain-of-thought LLM reasoning. Across the Gibson, HM3D, and HSSD benchmarks, REST consistently ranks among the top methods in success rate while achieving the best or second-best path efficiency, demonstrating a favorable efficiency-success balance.

中文标题/摘要

标题：REST：退缩视野探索性Steiner树在零样本物体目标导航中的应用

零样本物体目标导航（ZSON）要求在未知环境中导航以找到目标物体，无需特定任务的训练。先前的无监督层次训练解决方案在场景理解（信念）和高层决策（策略）上投入了大量资源，但忽视了选项（即从不断演变的信念中提出的子目标候选，并呈递给策略进行选择）的设计。实际上，选项被简化为孤立的航点，独立评分：单一目的地忽略了旅程中积累的价值；无序的集合掩盖了候选者之间的关系。我们的见解是，选项空间应该是一个路径树。完整路径揭示了目的地评分系统系统性忽视的沿途信息增益；共享路径段的树结构使LLM能够进行从粗到细的推理，先粗略地排除或追求整个分支，再检查个别分支，从而将组合路径空间压缩成一个高效的层次结构。我们通过REST（退缩视野探索性Steiner树）这一无监督框架实现了这一见解，该框架（1）从在线RGB-D流中构建显式的开放词汇3D地图；（2）通过基于采样的规划生成以代理为中心的安全且信息丰富的路径树作为选项空间；（3）将每个分支文本化为一个空间叙述，并通过链式思维LLM推理选择下一个最佳路径。在Gibson、HM3D和HSSD基准测试中，REST在成功率方面始终名列前茅，同时在路径效率方面达到最佳或第二佳，展示了有利的效率-成功率平衡。

Summary / 总结

REST is a training-free framework for zero-shot object-goal navigation that addresses the limitations of prior approaches by focusing on the design of options as a tree of paths. It builds an explicit 3D map from RGB-D streams, grows an agent-centric tree of safe and informative paths, and selects the next-best path using LLM reasoning. REST consistently ranks among the top methods in success rate while achieving the best or second-best path efficiency across various benchmarks, showcasing a favorable efficiency-success balance.

REST 是一个无需训练的零样本物体目标导航框架，通过将选项设计为路径树来解决先前方法的局限性。它从 RGB-D 流中构建显式的 3D 地图，生成一个以代理为中心的安全且信息丰富的路径树，并使用 LLM 推理选择下一个最佳路径。REST 在成功率方面始终名列前茅，同时在路径效率方面达到最佳或第二佳，展示了效率与成功率的良好平衡。

Deep Expert Injection for Anchoring Retinal VLMs with Domain-Specific Knowledge

Authors: Shuai Lu, Meng Wang, Jia Guo, Jiawei Du, Bo Liu, Shengzhu Yang, Weihang Zhang, Huazhu Fu, Huiqi Li

First: 2026-03-07T09:43:49+00:00 · Latest: 2026-03-19T08:24:18+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Vision Language Models (LVLMs) show immense potential for automated ophthalmic diagnosis. However, their clinical deployment is severely hindered by lacking domain-specific knowledge. In this work, we identify two structural deficiencies hindering reliable medical reasoning: 1) the Perception Gap, where general-purpose visual encoders fail to resolve fine-grained pathological cues (e.g., microaneurysms); and 2) the Reasoning Gap, where sparse visual evidence is progressively overridden by massive language priors in deeper transformer layers, leading to ungrounded hallucinations. To bridge these gaps, we propose EyExIn, a data-efficient framework designed to anchor retinal VLMs with expert knowledge via a Deep Expert Injection mechanism. Our architecture employs an Expert-Aware Dual-Stream encoding strategy that decouples visual representation into a general stream for anatomical context and a specialized expert stream for pathological semantics. To ensure high-fidelity integration, we design a Semantic-Adaptive Gated Fusion module, which dynamically amplifies subtle lesion signals while filtering irrelevant background noise. Furthermore, we introduce Adaptive Deep Expert Injection to embed persistent "Vision Anchors" by integrating fused visual features as residual biases directly into intermediate LLM layers. This mechanism creates a visual shortcut that forces the reasoning stack to remain strictly grounded in visual evidence. Extensive experiments across four benchmarks demonstrate that our model consistently outperforms massive proprietary systems. EyExIn significantly enhances domain-specific knowledge embedding and achieves state-of-the-art precision in ophthalmic visual question answering, advancing the development of trustworthy ophthalmic AI.

中文标题/摘要

标题：深度专家注入以领域特定知识锚定视网膜VLM

大型视觉语言模型（LVLMs）在眼科自动化诊断方面展现出巨大的潜力。然而，它们的临床部署受到缺乏领域特定知识的严重阻碍。在本工作中，我们识别出两个阻碍可靠医学推理的结构性缺陷：1）感知差距，通用视觉编码器无法解决细微病理线索（如微动脉瘤）；2）推理差距，在深层变压器层中稀疏视觉证据逐渐被大量语言先验所取代，导致无根据的幻觉。为弥合这些差距，我们提出了一种EyExIn框架，通过深度专家注入机制以数据高效的方式将专家知识锚定到视网膜VLMs。我们的架构采用了一种专家感知的双流编码策略，将视觉表示分解为一个用于解剖学上下文的一般流和一个用于病理学语义的专门专家流。为了确保高保真集成，我们设计了一种语义自适应门控融合模块，该模块动态放大细微病灶信号并过滤无关背景噪声。此外，我们引入了自适应深度专家注入，通过将融合后的视觉特征直接作为残差偏差嵌入到中间LLM层，以嵌入持久的“视觉锚点”。该机制创建了一个视觉捷径，迫使推理堆栈始终保持严格地基于视觉证据。在四个基准上的广泛实验表明，我们的模型在眼科视觉问答方面始终优于大规模专有系统。EyExIn显著增强了领域特定知识的嵌入，并实现了最先进的精度，推动了可信赖眼科AI的发展。

Summary / 总结

This work addresses the limitations of large vision language models (LVLMs) in ophthalmic diagnosis by proposing EyExIn, a framework that injects domain-specific knowledge to bridge the Perception and Reasoning Gaps. EyExIn uses an Expert-Aware Dual-Stream encoding and a Semantic-Adaptive Gated Fusion module to ensure high-fidelity integration of visual and expert knowledge, and introduces Adaptive Deep Expert Injection to embed persistent 'Vision Anchors' into intermediate layers. Experiments show that EyExIn outperforms proprietary systems and achieves state-of-the-art precision in ophthalmic visual question answering.

该研究提出了EyExIn框架，通过注入领域特定知识来解决大型视觉语言模型（LVLM）在眼科诊断中的局限性，以弥合感知和推理缺口。EyExIn采用专家感知双流编码和语义自适应门控融合模块，确保视觉和专家知识的高保真融合，并引入适应性深度专家注入机制，将持久的‘视觉锚点’直接嵌入中间层。实验表明，EyExIn在眼科视觉问答任务中超越了专有系统，并达到了最先进的精度。

History

20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553