Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders
Authors: Shang-Jui Ray Kuo, Paola Cascante-Bonilla
First: 2026-03-19T17:56:32+00:00 · Latest: 2026-03-19T17:56:32+00:00
Comments: Project page: https://lab-spell.github.io/vlm-ssm-vision-encoders/ ; Code: https://github.com/raykuo18/vlm-ssm-vision-encoders
Abstract
Large vision--language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet-1K initialization, the SSM backbone achieves the strongest overall performance across both VQA and grounding/localization. We further adapt both SSM and ViT-family backbones with detection or segmentation training and find that dense-task tuning generally improves performance across families; after this adaptation, the SSM backbone remains competitive while operating at a substantially smaller model scale. We further observe that (i) higher ImageNet accuracy or larger backbones do not reliably translate into better VLM performance, and (ii) some visual backbones are unstable in localization. Based on these findings, we propose stabilization strategies that improve robustness for both backbone families and highlight SSM backbones as a strong alternative to transformer-based vision encoders in VLMs.
中文标题/摘要
标题:VLMs是否需要视觉变换器?评估状态空间模型作为视觉编码器的效果
大型视觉-语言模型(VLMs)通常使用冻结的视觉骨干,其图像特征通过轻量级连接器映射到大型语言模型中。虽然基于变换器的编码器是标准的视觉骨干,但我们询问状态空间模型(SSM)视觉骨干是否可以成为强有力的替代品。我们在受控环境中系统地评估了SSM视觉骨干在VLMs中的表现。在匹配的ImageNet-1K初始化下,SSM骨干在VQA和定位/标注方面实现了最强的整体性能。我们进一步适应了SSM和ViT家族的骨干,并进行了检测或分割训练,发现密集任务调整通常在家族中提高了性能;在这一适应后,SSM骨干保持竞争力,同时运行于显著更小的模型规模。我们还观察到,(i) 更高的ImageNet准确度或更大的骨干并不一定能可靠地转化为更好的VLM性能,(ii) 一些视觉骨干在定位方面不稳定。基于这些发现,我们提出了稳定策略,以提高两个骨干家族的鲁棒性,并强调SSM骨干作为VLMs中基于变换器视觉编码器的强有力替代品。
Summary / 总结
This study evaluates state space model (SSM) vision backbones in large vision-language models (VLMs), finding that SSMs outperform transformer-based encoders in VQA and grounding/localization tasks under matched ImageNet-1K initialization. After dense-task adaptation, SSMs maintain competitive performance while being smaller in scale. The research also highlights that higher ImageNet accuracy or larger backbones do not always translate to better VLM performance, and some visual backbones are unstable in localization tasks, suggesting SSMs as a strong alternative to transformers.
研究探讨了状态空间模型(SSM)视觉骨干是否可以作为大型视觉-语言模型(VLM)中基于变换器编码器的替代方案。研究在匹配的ImageNet-1K初始化条件下评估了SSM骨干,并发现它们在VQA和定位/检测任务中表现出最强的整体性能。在对SSM和ViT家族骨干进行检测或分割训练后,SSM骨干在较小的模型规模下仍保持竞争力。研究还指出,更高的ImageNet准确度或更大的骨干并不一定意味着更好的VLM性能,并提出了稳定策略以提高两种骨干家族的鲁棒性。
Tinted Frames: Question Framing Blinds Vision-Language Models
Authors: Wan-Cyuan Fan, Jiayun Luo, Declan Kutscher, Leonid Sigal, Ritwik Gupta
First: 2026-03-19T17:53:09+00:00 · Latest: 2026-03-19T17:53:09+00:00
Comments: Preprint. Project page: https://davidhalladay.github.io/tinted_frames_demo/
Abstract
Vision-Language Models (VLMs) have been shown to be blind, often underutilizing their visual inputs even on tasks that require visual reasoning. In this work, we demonstrate that VLMs are selectively blind. They modulate the amount of attention applied to visual inputs based on linguistic framing even when alternative framings demand identical visual reasoning. Using visual attention as a probe, we quantify how framing alters both the amount and distribution of attention over the image. Constrained framings, such as multiple choice and yes/no, induce substantially lower attention to image context compared to open-ended, reduce focus on task-relevant regions, and shift attention towards uninformative tokens. We further demonstrate that this attention misallocation is the principal cause of degraded accuracy and cross-framing inconsistency. Building on this mechanistic insight, we introduce a lightweight prompt-tuning method using learnable tokens that encourages the robust, visually grounded attention patterns observed in open-ended settings, improving visual grounding and improving performance across framings.
中文标题/摘要
标题:着色框:问题框架限制了视觉语言模型的视野
视觉语言模型(VLMs)已被证明是盲目的,即使在需要视觉推理的任务中,它们也经常未能充分利用视觉输入。在本研究中,我们展示了VLMs是选择性地盲目的。它们根据语言框架调整对视觉输入的注意力程度,即使存在替代框架要求相同的视觉推理。通过使用视觉注意力作为探针,我们量化了框架如何改变对图像的关注量及其分布。受限的框架,如多项选择和是/否,会显著降低对图像上下文的关注度,减少对任务相关区域的关注,并将注意力转移到无信息性标记上。我们进一步证明,这种注意力分配不当是导致准确度下降和跨框架不一致的主要原因。基于这一机制洞察,我们引入了一种轻量级的提示调优方法,使用可学习标记来鼓励在开放性设置中观察到的稳健、视觉基础的注意力模式,从而提高视觉定位并改善不同框架下的性能。
Summary / 总结
Vision-Language Models (VLMs) have been shown to be blind, often underutilizing their visual inputs even on tasks that require visual reasoning.
研究探讨了为什么视觉语言模型(VLMs)在不同语言框架下对视觉输入的选择性忽视。通过使用视觉注意力作为探针,研究发现,如多项选择和是非题等受限框架会导致对图像上下文的关注度降低,并且注意力偏向于无关信息。这种注意力分配的错误导致了较低的准确性和不同框架之间的不一致性。研究提出了一种使用可学习标记的提示调优方法,以促进在开放性设置中观察到的稳健且视觉基础的注意力模式,从而提高视觉定位和跨框架的性能。
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
Authors: Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen, Nakul Gopalan
First: 2026-03-19T17:20:56+00:00 · Latest: 2026-03-19T17:20:56+00:00
Comments: Equal contribution: Swagat Padhan and Lakshya Jain, 9 pages, 6 figures, paper website: https://lakshya-asu.github.io/Meanings-Measurements-Multi-Agent-Probabilistic-Grounding/
Abstract
Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions. For example, executing a command such as "go two meters to the right of the fridge" requires grounding semantic references, spatial relations, and metric constraints within a 3D scene. While recent vision language models (VLMs) demonstrate strong semantic grounding capabilities, they are not explicitly designed to reason about metric constraints in physically defined spaces. In this work, we empirically demonstrate that state-of-the-art VLM-based grounding approaches struggle with complex metric-semantic language queries. To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component. MAPG then probabilistically composes these grounded outputs to produce metrically consistent, actionable decisions in 3D space. We evaluate MAPG on the HM-EQA benchmark and show consistent performance improvements over strong baselines. Furthermore, we introduce a new benchmark, MAPG-Bench, specifically designed to evaluate metric-semantic goal grounding, addressing a gap in existing language grounding evaluations. We also present a real-world robot demonstration showing that MAPG transfers beyond simulation when a structured scene representation is available.
中文标题/摘要
标题:意义与测量:多智能体概率对接在视觉语言导航中的应用
与人类合作的机器人必须将自然语言目标转化为可执行的、物理上可对接的决策。例如,执行“向冰箱右边两米处走”的命令需要在三维场景中对接语义参考、空间关系和度量约束。虽然最近的视觉语言模型(VLMs)展示了强大的语义对接能力,但它们并未明确设计用于在物理定义的空间中推理度量约束。在本研究中,我们实证展示了最先进的基于VLM的对接方法在处理复杂的度量语义语言查询时存在困难。为解决这一局限,我们提出了MAPG(多智能体概率对接)框架,将语言查询分解为结构化的子组件,并查询VLM对接每个组件。然后,MAPG通过概率组合这些对接输出,生成在三维空间中度量一致的可执行决策。我们使用HM-EQA基准评估MAPG,并展示了相对于强大基线的一致性能改进。此外,我们引入了一个新的基准MAPG-Bench,专门用于评估度量语义目标对接,填补了现有语言对接评估中的空白。我们还展示了在可用结构化场景表示的现实世界机器人演示,表明MAPG可以超越仿真。
Summary / 总结
Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions.
该研究旨在将复杂的度量语义语言查询转化为机器人的可执行决策。提出了一种名为MAPG(多代理概率定位)的方法,将语言查询分解为子组件,并使用VLM进行定位,然后通过概率组合生成度量上一致的行动。实验在HM-EQA和新推出的MAPG-Bench基准上表明,MAPG优于强基线。此外,一个实际的机器人演示验证了MAPG在提供结构化场景表示时能够超越仿真环境的有效性。
Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation
Authors: Kwanyoung Lee, SeungJu Cha, Yebin Ahn, Hyunwoo Oh, Sungho Koh, Dong-Jin Kim
Venue: CVPR 2026
First: 2026-03-19T17:12:03+00:00 · Latest: 2026-03-19T17:12:03+00:00
Comments: Accepted in CVPR 2026 (main track). 10 pages, 6 figures; supplementary material included (14 pages, 11 figures)
Abstract
Diffusion-based text-to-image (T2I) models have made remarkable progress in generating photorealistic and semantically rich images. However, when the target concepts lie in low-density regions of the training distribution, these models often produce semantically misaligned or structurally inconsistent results. This limitation arises from the long-tailed nature of text-image datasets, where rare concepts or editing instructions are underrepresented. To address this, we introduce Adaptive Auxiliary Prompt Blending (AAPB) - a unified framework that stabilizes the diffusion process in low-density regions. AAPB leverages auxiliary anchor prompts to provide semantic support in rare concept generation and structural support in image editing, ensuring faithful guidance toward the target prompt. Unlike prior heuristic prompt alternation methods, AAPB derives a closed-form adaptive coefficient that optimally balances the influence between the auxiliary anchor and the target prompt at each diffusion step. Grounded in Tweedie's identity, our formulation provides a principled and training-free framework for adaptive prompt blending, ensuring stable and target-faithful generation. We demonstrate the effectiveness of adaptive interpolation over fixed interpolation through controlled experiments and empirically show consistent improvements on the RareBench and FlowEdit datasets, achieving superior semantic accuracy and structural fidelity compared to prior training-free baselines.
中文标题/摘要
标题:适应性辅助提示融合以实现目标忠实的扩散生成
基于扩散的文本到图像(T2I)模型在生成逼真且语义丰富的图像方面取得了显著进展。然而,当目标概念位于训练分布的低密度区域时,这些模型往往会生成语义不匹配或结构不一致的结果。这一局限性源于文本-图像数据集的长尾性质,其中稀有概念或编辑指令的代表性不足。为了解决这一问题,我们引入了适应性辅助提示融合(AAPB)——一种统一框架,用于在低密度区域稳定扩散过程。AAPB利用辅助锚提示提供稀有概念生成的语义支持和图像编辑的结构支持,确保目标提示的忠实指导。与先前的启发式提示交替方法不同,AAPB在每个扩散步骤中推导出一个闭式自适应系数,以最优地平衡辅助锚提示和目标提示之间的影响力。基于Tweedie恒等式,我们的公式提供了一种原理上和无需训练的自适应提示融合框架,确保稳定和目标忠实的生成。通过受控实验,我们展示了自适应插值优于固定插值的有效性,并在RareBench和FlowEdit数据集上实验证明了一致的改进,实现了比先前的无需训练基线更高的语义准确性和结构保真度。
Summary / 总结
The paper addresses the issue of semantically misaligned or structurally inconsistent image generation by diffusion models when dealing with rare concepts. It introduces Adaptive Auxiliary Prompt Blending (AAPB), which uses auxiliary anchor prompts to provide semantic and structural support, ensuring target-faithful generation. AAPB derives an adaptive coefficient for each diffusion step, leading to stable and accurate image generation. Experiments on RareBench and FlowEdit datasets show consistent improvements in semantic accuracy and structural fidelity compared to previous methods.
论文提出了自适应辅助提示融合(AAPB)方法,以稳定生成罕见或欠代表概念的图像过程。AAPB 使用辅助锚提示提供语义和结构支持,确保生成的图像忠实于目标提示。实验结果表明,AAPB 在语义准确性和结构保真度方面优于之前的无训练基线方法。
ADAPT: Attention Driven Adaptive Prompt Scheduling and InTerpolating Orthogonal Complements for Rare Concepts Generation
Authors: Kwanyoung Lee, Hyunwoo Oh, SeungJu Cha, Sungho Koh, Dong-Jin Kim
Venue: CVPR 2026
First: 2026-03-19T17:11:49+00:00 · Latest: 2026-03-19T17:11:49+00:00
Comments: Accepted in CVPR 2026 (findings). 10 pages, 4 figures; supplementary material included (8 pages, 10 figures)
Abstract
Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training data. While recent approaches, such as R2F, address this challenge by utilizing LLM for prompt scheduling, they suffer from inherent variance due to the randomness of language models and suboptimal guidance from iterative text embedding switching. To address these problems, we propose the ADAPT framework, a training-free framework that deterministically plans and semantically aligns prompt schedules, providing consistent guidance to enhance the composition of rare concepts. By leveraging attention scores and orthogonal components, ADAPT significantly enhances compositional generation of rare concepts in the RareBench benchmark without additional training or fine-tuning. Through comprehensive experiments, we demonstrate that ADAPT achieves superior performance in RareBench and accurately reflects the semantic information of rare attributes, providing deterministic and precise control over the generation of rare compositions without compromising visual integrity.
中文标题/摘要
标题:ADAPT:注意力驱动的自适应提示调度和插值正交补对于稀有概念生成
对于文本到图像合成而言,生成稀有组合概念仍然是扩散模型面临的挑战,尤其是对于训练数据中不常见的属性。虽然最近的方法,如R2F,通过利用LLM进行提示调度来解决这一挑战,但由于语言模型的随机性和迭代文本嵌入切换的次优指导,它们仍然存在固有的方差问题。为了解决这些问题,我们提出了ADAPT框架,这是一个无需训练的框架,可以确定性地规划和语义对齐提示调度,提供一致的指导以增强稀有概念的组合。通过利用注意力分数和正交组件,ADAPT在无需额外训练或微调的情况下,显著增强了RareBench基准上稀有概念的组合生成。通过全面的实验,我们证明ADAPT在RareBench上实现了优越的性能,并准确反映了稀有属性的语义信息,提供了对稀有组合生成的确定性和精确控制,而不损害视觉完整性。
Summary / 总结
The research addresses the challenge of generating rare compositional concepts in text-to-image synthesis using diffusion models. ADAPT, an attention-driven framework, deterministically plans prompt schedules and semantically aligns them to provide consistent guidance, enhancing the generation of rare concepts. Experiments show that ADAPT outperforms existing methods like R2F in the RareBench benchmark, maintaining visual integrity while accurately reflecting the semantic information of rare attributes.
该论文旨在解决使用扩散模型在文本到图像合成中生成稀有组合概念的挑战。它提出了ADAPT框架,利用注意力分数和正交组件来确定性地规划提示调度,提供一致的指导以生成稀有概念。实验表明,ADAPT在RareBench基准上优于现有方法,实现了更好的稀有概念组合生成,且无需额外训练或微调。
GSMem: 3D Gaussian Splatting as Persistent Spatial Memory for Zero-Shot Embodied Exploration and Reasoning
Authors: Yiren Lu, Yi Du, Disheng Liu, Yunlai Zhou, Chen Wang, Yu Yin
First: 2026-03-19T16:55:54+00:00 · Latest: 2026-03-19T16:55:54+00:00
Comments: Project page at https://vulab-ai.github.io/GSMem/
Abstract
Effective embodied exploration requires agents to accumulate and retain spatial knowledge over time. However, existing scene representations, such as discrete scene graphs or static view-based snapshots, lack \textit{post-hoc re-observability}. If an initial observation misses a target, the resulting memory omission is often irrecoverable. To bridge this gap, we propose \textbf{GSMem}, a zero-shot embodied exploration and reasoning framework built upon 3D Gaussian Splatting (3DGS). By explicitly parameterizing continuous geometry and dense appearance, 3DGS serves as a persistent spatial memory that endows the agent with \textit{Spatial Recollection}: the ability to render photorealistic novel views from optimal, previously unoccupied viewpoints. To operationalize this, GSMem employs a retrieval mechanism that simultaneously leverages parallel object-level scene graphs and semantic-level language fields. This complementary design robustly localizes target regions, enabling the agent to ``hallucinate'' optimal views for high-fidelity Vision-Language Model (VLM) reasoning. Furthermore, we introduce a hybrid exploration strategy that combines VLM-driven semantic scoring with a 3DGS-based coverage objective, balancing task-aware exploration with geometric coverage. Extensive experiments on embodied question answering and lifelong navigation demonstrate the robustness and effectiveness of our framework
中文标题/摘要
标题:GSMem:基于3D高斯斑点的持久空间记忆用于零样本体态探索与推理
有效的体态探索需要代理人在时间上积累和保留空间知识。然而,现有的场景表示,如离散场景图或静态视角快照,缺乏“事后重新观察”的能力。如果初始观察错过了目标,生成的记忆遗漏往往是不可恢复的。为了解决这一问题,我们提出了**GSMem**,一种基于3D高斯斑点(3DGS)的零样本体态探索与推理框架。通过显式参数化连续几何和密集外观,3DGS充当持久空间记忆,赋予代理“空间回忆”的能力:从之前未占用的最佳视角生成逼真的新视角。为了实现这一点,GSMem采用了一种检索机制,同时利用并行的对象级场景图和语义级语言字段。这种互补设计能够稳健地定位目标区域,使代理能够“想象”出高质量的视图,用于高保真视觉-语言模型(VLM)推理。此外,我们引入了一种结合VLM驱动的语义评分与基于3DGS的覆盖目标的混合探索策略,平衡任务感知探索与几何覆盖。在体态问答和终身导航的广泛实验中,证明了我们框架的稳健性和有效性
Summary / 总结
GSMem is a zero-shot embodied exploration and reasoning framework that uses 3D Gaussian Splatting (3DGS) to create a persistent spatial memory. This allows the agent to render photorealistic novel views from previously unoccupied viewpoints, enabling spatial recollection. GSMem combines object-level scene graphs and semantic-level language fields for robust target localization and uses a hybrid exploration strategy that balances semantic scoring with geometric coverage. Experiments show its effectiveness in embodied question answering and lifelong navigation.
GSMem 是一个基于 3D 贝塞尔插值的零样本体态探索和推理框架,通过创建持久的空间记忆来生成从未占用视角的逼真新视图,实现空间回忆。该框架结合了对象级场景图和语义级语言字段进行鲁棒定位,并采用混合探索策略平衡任务感知探索与几何覆盖。实验表明其在体态问答和终身导航中的有效性。
Efficient Reasoning with Balanced Thinking
Authors: Yulin Li, Tengyao Tu, Li Ding, Junjie Wang, Huiling Zhen, Yixin Chen, Yong Li, Zhuotao Tian
Venue: ICLR 2026
First: 2026-03-12T18:48:07+00:00 · Latest: 2026-03-19T16:54:22+00:00
Comments: Accepted by ICLR 2026
Abstract
Large Reasoning Models (LRMs) have shown remarkable reasoning capabilities, yet they often suffer from overthinking, expending redundant computational steps on simple problems, or underthinking, failing to explore sufficient reasoning paths despite inherent capabilities. These issues lead to inefficiencies and potential inaccuracies, limiting practical deployment in resource-constrained settings. Existing methods to mitigate overthinking, such as suppressing reflective keywords or adjusting reasoning length, may inadvertently induce underthinking, compromising accuracy. Therefore, we propose ReBalance, a training-free framework that achieves efficient reasoning with balanced thinking. ReBalance leverages confidence as a continuous indicator of reasoning dynamics, identifying overthinking through high confidence variance and underthinking via consistent overconfidence. By aggregating hidden states from a small-scale dataset into reasoning mode prototypes, we compute a steering vector to guide LRMs' reasoning trajectories. A dynamic control function modulates this vector's strength and direction based on real-time confidence, pruning redundancy during overthinking, and promoting exploration during underthinking. Extensive experiments conducted on four models ranging from 0.5B to 32B, and across nine benchmarks in math reasoning, general question answering, and coding tasks demonstrate that ReBalance effectively reduces output redundancy while improving accuracy, offering a general, training-free, and plug-and-play strategy for efficient and robust LRM deployment. Project page and code are available at https://rebalance-ai.github.io .
中文标题/摘要
标题:平衡思考实现高效推理
大型推理模型(LRMs)展示了出色的推理能力,但往往存在过度推理的问题,即在简单问题上浪费冗余计算步骤,或者存在欠推理的问题,即在具备推理能力的情况下未能充分探索推理路径。这些问题导致了效率低下和潜在的不准确性,限制了其在资源受限环境中的实际部署。现有减少过度推理的方法,如抑制反思关键词或调整推理长度,可能会无意中导致欠推理,从而影响准确性。因此,我们提出了ReBalance,这是一种无需训练的框架,实现了平衡思考下的高效推理。ReBalance 利用置信度作为推理动态的连续指标,通过高置信度波动识别过度推理,通过一致的高置信度识别欠推理。通过将小型数据集中的隐藏状态聚合为推理模式原型,我们计算出一个引导向量来引导LRMs的推理轨迹。动态控制函数根据实时置信度调整该向量的强度和方向,在过度推理时修剪冗余,在欠推理时促进探索。在四个从0.5B到32B的模型以及九个涉及数学推理、通用问答和编程任务的基准测试中进行的广泛实验表明,ReBalance 有效减少了输出冗余并提高了准确性,提供了一种通用、无需训练且即插即用的策略,用于高效和稳健的LRM部署。项目页面和代码可在https://rebalance-ai.github.io 获取。
Summary / 总结
The paper addresses the inefficiencies of Large Reasoning Models (LRMs) by proposing ReBalance, a training-free framework that balances overthinking and underthinking. ReBalance uses confidence to identify and mitigate these issues, guiding LRMs to efficient reasoning. Experiments show that ReBalance reduces output redundancy and improves accuracy across various models and tasks, offering a general and plug-and-play solution for efficient LRM deployment.
论文提出了一种名为ReBalance的无训练框架,旨在平衡LRM的过度思考和不足思考,提高其效率。ReBalance通过信心指标来识别和纠正这些问题,引导LRM进行更有效的推理。实验结果显示,ReBalance减少了输出冗余并提高了准确性,使其在各种模型和任务中更加实用,适用于资源受限的环境。
MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding
Authors: Fan Yang, Xingping Dong, Xin Yu, Wenhan Luo, Wei Liu, Kaihao Zhang
Venue: CVPR 2026
First: 2025-12-02T16:22:01+00:00 · Latest: 2026-03-19T16:35:02+00:00
Comments: Accepted to CVPR 2026
Abstract
Understanding high-resolution (HR) images remains a critical challenge for multimodal large language models (MLLMs). Recent approaches leverage vision-based retrieval-augmented generation (RAG) to retrieve query-relevant crops from HR images, improving understanding capacity of MLLMs. However, this paradigm often leads to object fragmentation, resulting in semantic bias and incomplete retrieval, while also introducing false positives from irrelevant background patches. To address these issues, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework that enhances HR image understanding from both local and global perspectives. Locally, MRD enforces cross-scale semantic consistency via multi-resolution semantic fusion to mitigate single-resolution bias and alleviate object fragmentation. Globally, it integrates open-vocabulary object detection (OVD) as localization priors within a unified framework. Extensive experiments across multiple MLLMs on HR image benchmarks demonstrate that MRD achieves state-of-the-art (SOTA) performance on both single-object and multi-object understanding tasks. Code will be available at: https://github.com/yf0412/MRD.
中文标题/摘要
标题:MRD:多分辨率检索-检测融合用于高分辨率图像理解
理解高分辨率(HR)图像仍然是多模态大型语言模型(MLLM)的关键挑战。近期的方法利用基于视觉的检索增强生成(RAG)从HR图像中检索查询相关的片段,从而提高MLLM的理解能力。然而,这种范式往往导致对象碎片化,产生语义偏差和不完整的检索,同时还会引入来自无关背景片段的假阳性。为了解决这些问题,我们提出了一种无需训练的多分辨率检索-检测(MRD)框架,从局部和全局两个方面增强HR图像理解。局部上,MRD通过多分辨率语义融合来缓解单一分辨率偏差并减轻对象碎片化。全局上,它将开放词汇对象检测(OVD)作为定位先验整合到统一框架中。在多个MLLM上的HR图像基准测试中,广泛的实验表明,MRD在单对象和多对象理解任务上均实现了最先进的(SOTA)性能。代码将在:https://github.com/yf0412/MRD.上提供。
Summary / 总结
The paper addresses the challenge of understanding high-resolution images for multimodal large language models by proposing MRD, a training-free framework that enhances both local and global perspectives. MRD uses multi-resolution semantic fusion to reduce single-resolution bias and object fragmentation, and integrates open-vocabulary object detection as localization priors. Experiments show that MRD outperforms existing methods on single-object and multi-object understanding tasks across multiple MLLMs.
论文提出了一种名为MRD的训练-free框架,通过增强局部和全局视角来解决高分辨率图像理解的挑战。MRD利用多分辨率语义融合减少单分辨率偏差和物体碎片化问题,并将开放词汇量物体检测作为定位先验。实验表明,MRD在多个MLLM上的单物体和多物体理解任务中均优于现有方法。
TAU-R1: Visual Language Model for Traffic Anomaly Understanding
Authors: Yuqiang Lin, Kehua Chen, Sam Lockyer, Arjun Yadav, Mingxuan Sui, Shucheng Zhang, Yan Shi, Bingzhang Wang, Yuang Zhang, Markus Zarbock, Florain Stanek, Adrian Evans, Wenbin Li, Yinhai Wang, Nic Zhang
First: 2026-03-19T16:23:21+00:00 · Latest: 2026-03-19T16:23:21+00:00
Abstract
Traffic Anomaly Understanding (TAU) is important for traffic safety in Intelligent Transportation Systems. Recent vision-language models (VLMs) have shown strong capabilities in video understanding. However, progress on TAU remains limited due to the lack of benchmarks and task-specific methodologies. To address this limitation, we introduce Roundabout-TAU, a dataset constructed from real-world roundabout videos collected in collaboration with the City of Carmel, Indiana. The dataset contains 342 clips and is annotated with more than 2,000 question-answer pairs covering multiple aspects of traffic anomaly understanding. Building on this benchmark, we propose TAU-R1, a two-layer vision-language framework for TAU. The first layer is a lightweight anomaly classifier that performs coarse anomaly categorisation, while the second layer is a larger anomaly reasoner that generates detailed event summaries. To improve task-specific reasoning, we introduce a two-stage training strategy consisting of decomposed-QA-enhanced supervised fine-tuning followed by TAU-GRPO, a GRPO-based post-training method with TAU-specific reward functions. Experimental results show that TAU-R1 achieves strong performance on both anomaly classification and reasoning tasks while maintaining deployment efficiency. The dataset and code are available at: https://github.com/siri-rouser/TAU-R1
中文标题/摘要
标题:TAU-R1:交通异常理解的视觉语言模型
交通异常理解(TAU)对于智能交通系统中的交通安全至关重要。最近的视觉-语言模型(VLMs)在视频理解方面表现出强大的能力。然而,由于缺乏基准和特定任务的方法,TAU 的进展仍然有限。为了解决这一限制,我们引入了Roundabout-TAU数据集,该数据集由与印第安纳州卡梅尔市合作收集的真实环形交叉口视频构建而成。该数据集包含342个片段,并且带有超过2000个问题-答案对,涵盖了交通异常理解的多个方面。基于此基准,我们提出了TAU-R1,一种两层视觉-语言框架用于TAU。第一层是一个轻量级的异常分类器,执行粗略的异常分类,而第二层是一个较大的异常推理器,生成详细的事件总结。为了提高特定任务的推理,我们引入了一种两阶段训练策略,包括分解-问答增强监督微调,随后是基于GRPO的TAU-GRPO后训练方法,具有TAU特定的奖励函数。实验结果表明,TAU-R1在异常分类和推理任务上均表现出色,同时保持了部署效率。数据集和代码可在:https://github.com/siri-rouser/TAU-R1 获取
Summary / 总结
The research aims to enhance traffic safety in Intelligent Transportation Systems by developing a visual language model for Traffic Anomaly Understanding (TAU). To address the lack of benchmarks and task-specific methodologies, the authors introduce Roundabout-TAU, a dataset with 342 clips and 2,000 annotated question-answer pairs. They propose TAU-R1, a two-layer vision-language framework, where the first layer classifies anomalies and the second layer generates detailed event summaries. The model uses a two-stage training strategy, including decomposed-QA-enhanced supervised fine-tuning and TAU-GRPO, which improves task-specific reasoning. Experimental results demonstrate strong performance in both anomaly classification and reasoning tasks while maintaining efficiency for deployment.
研究旨在通过开发交通异常理解(TAU)的视觉语言模型来提高智能交通系统的交通安全。研究引入了Roundabout-TAU数据集,包含342个真实交叉口视频片段和超过2,000个标注的问答对。在此基础上,提出了TAU-R1双层视觉语言框架,包括一个轻量级的异常分类器和一个更大的异常推理器。该模型采用两阶段训练策略以提高任务特定的推理能力,并在异常分类和推理任务上均表现出色,同时保持了高效性。数据集和代码已公开可用。
SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues
Authors: Carlos Hinojosa, Clemens Grange, Bernard Ghanem
First: 2026-03-19T16:18:00+00:00 · Latest: 2026-03-19T16:18:00+00:00
Abstract
Vision-language models (VLMs) are increasingly deployed in real-world and embodied settings where safety decisions depend on visual context. However, it remains unclear which visual evidence drives these judgments. We study whether multimodal safety behavior in VLMs can be steered by simple semantic cues. We introduce a semantic steering framework that applies controlled textual, visual, and cognitive interventions without changing the underlying scene content. To evaluate these effects, we propose SAVeS, a benchmark for situational safety under semantic cues, together with an evaluation protocol that separates behavioral refusal, grounded safety reasoning, and false refusals. Experiments across multiple VLMs and an additional state-of-the-art benchmark show that safety decisions are highly sensitive to semantic cues, indicating reliance on learned visual-linguistic associations rather than grounded visual understanding. We further demonstrate that automated steering pipelines can exploit these mechanisms, highlighting a potential vulnerability in multimodal safety systems.
中文标题/摘要
标题:SAVeS: 通过语义线索引导视觉-语言模型的安全判断
视觉-语言模型(VLMs)在现实世界和具身环境中越来越被部署,其中安全决策依赖于视觉上下文。然而,尚不清楚哪些视觉证据驱动这些判断。我们研究了是否可以通过简单的语义线索来引导VLMs中的多模态安全行为。我们引入了一种语义引导框架,该框架通过控制文本、视觉和认知干预而不改变底层场景内容来实现这一目标。为了评估这些效果,我们提出了SAVeS基准,用于在语义线索下的情境安全性评估,并提出了一种评估协议,将行为拒绝、基于事实的安全推理和虚假拒绝区分开来。跨多个VLMs的实验以及一个额外的最新基准表明,安全决策对语义线索非常敏感,表明依赖于学习到的视觉-语言关联而非基于事实的视觉理解。我们进一步证明,自动化引导管道可以利用这些机制,突显了多模态安全系统中的潜在漏洞。
Summary / 总结
The research aims to understand how visual evidence influences safety judgments in vision-language models (VLMs) and whether these judgments can be steered using simple semantic cues. A semantic steering framework was developed to apply textual, visual, and cognitive interventions without altering the scene content. Experiments across various VLMs and an additional benchmark showed that safety decisions are highly sensitive to semantic cues, suggesting reliance on learned visual-linguistic associations rather than grounded visual understanding. This indicates a potential vulnerability in multimodal safety systems.
研究旨在理解视觉证据如何影响视觉语言模型(VLM)中的安全判断,以及是否可以通过简单的语义提示来引导这些判断。开发了一种语义引导框架,引入了控制的文本、视觉和认知干预,而不改变场景内容。研究引入了SAVeS基准,用于评估在语义提示下的情境安全性,并展示了VLM中的安全决策高度依赖于这些提示,表明其依赖于学习到的视觉-语言关联而非基于视觉的理解。自动引导管道可以利用这些机制,表明多模态安全系统中存在潜在的安全性漏洞。
SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation
Authors: Phuc Pham, Uy Dieu Tran, Binh-Son Hua, Phong Nguyen
Venue: CVPR 2026
First: 2026-03-19T15:47:43+00:00 · Latest: 2026-03-19T15:47:43+00:00
Comments: CVPR 2026
Abstract
Realistic and efficient 3D garment generation remains a longstanding challenge in computer vision and digital fashion. Existing methods typically rely on large vision- language models to produce serialized representations of 2D sewing patterns, which are then transformed into simulation-ready 3D meshes using garment modeling framework such as GarmentCode. Although these approaches yield high-quality results, they often suffer from slow inference times, ranging from 30 seconds to a minute. In this work, we introduce SwiftTailor, a novel two-stage framework that unifies sewing-pattern reasoning and geometry-based mesh synthesis through a compact geometry image representation. SwiftTailor comprises two lightweight modules: PatternMaker, an efficient vision-language model that predicts sewing patterns from diverse input modalities, and GarmentSewer, an efficient dense prediction transformer that converts these patterns into a novel Garment Geometry Image, encoding the 3D surface of all garment panels in a unified UV space. The final 3D mesh is reconstructed through an efficient inverse mapping process that incorporates remeshing and dynamic stitching algorithms to directly assemble the garment, thereby amortizing the cost of physical simulation. Extensive experiments on the Multimodal GarmentCodeData demonstrate that SwiftTailor achieves state-of-the-art accuracy and visual fidelity while significantly reducing inference time. This work offers a scalable, interpretable, and high-performance solution for next-generation 3D garment generation.
中文标题/摘要
标题:SwiftTailor:基于几何图像表示的高效3D服装生成
在计算机视觉和数字时尚领域,真实且高效的3D服装生成仍然是一个长期存在的挑战。现有方法通常依赖于大型视觉语言模型来生成2D缝纫图案的序列化表示,然后使用如GarmentCode等服装建模框架将其转换为可用于模拟的3D网格。尽管这些方法可以产生高质量的结果,但它们通常会遭受较慢的推理时间,从30秒到一分钟不等。在本工作中,我们引入了SwiftTailor,这是一种新颖的两阶段框架,通过紧凑的几何图像表示统一了缝纫图案推理和基于几何的网格合成。SwiftTailor 包含两个轻量级模块:PatternMaker,一种高效的视觉语言模型,可以从多种输入模态中预测缝纫图案;以及GarmentSewer,一种高效的密集预测变换器,将这些图案转换为一种新颖的服装几何图像,编码所有服装面板的3D表面在统一的UV空间中。最终的3D网格通过一个高效的逆映射过程重建,该过程结合了重新网格化和动态缝合算法,直接组装服装,从而摊销物理模拟的成本。在Multimodal GarmentCodeData上的大量实验表明,SwiftTailor 在准确性和视觉保真度方面达到了最先进的水平,同时显著减少了推理时间。这项工作提供了一种可扩展、可解释且高性能的下一代3D服装生成解决方案。
Summary / 总结
SwiftTailor is a two-stage framework that combines sewing-pattern reasoning and geometry-based mesh synthesis using a compact geometry image representation. It includes PatternMaker, which predicts sewing patterns from various inputs, and GarmentSewer, which converts these patterns into a Garment Geometry Image. The final 3D mesh is reconstructed through an efficient inverse mapping process. SwiftTailor achieves high accuracy and visual fidelity while reducing inference time significantly compared to existing methods.
SwiftTailor 是一个两阶段框架,结合了缝制模式推理和基于几何的网格合成,使用紧凑的几何图像表示。它包括 PatternMaker,可以从多种输入中预测缝制模式,以及 GarmentSewer,将这些模式转换为 Garment 几何图像。最终的 3D 网格通过高效的逆映射过程重建。SwiftTailor 达到了高准确性和视觉保真度,同时显著减少了推理时间。
TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation
Authors: Yan Shu, Bin Ren, Zhitong Xiong, Xiao Xiang Zhu, Begüm Demir, Nicu Sebe, Paolo Rota
First: 2026-03-19T15:38:02+00:00 · Latest: 2026-03-19T15:38:02+00:00
Comments: Accepted by CVPR20206 (Main Track)
Abstract
Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.
中文标题/摘要
标题:TerraScope:基于像素的地球观测视觉推理
视觉语言模型(VLMs)在地球观测(EO)中显示出潜力,但它们在需要将复杂的空间推理精确地与像素级视觉表示相结合的任务中表现不佳。为了解决这个问题,我们引入了TerraScope,这是一种统一的VLM,能够提供基于像素的地理空间推理,具有两个关键能力:(1)模态灵活的推理:它可以处理单一模态输入(光学或SAR),并在两种模态都可用时适应性地将不同模态融合到推理过程中;(2)多时相推理:它整合了时间序列以在多个时间点进行变化分析。此外,我们还整理了Terra-CoT数据集,包含100万样本,其中包含嵌入在多个来源推理链中的像素级掩码。我们还提出了TerraScope-Bench,这是第一个用于基于像素的地理空间推理的基准,包含六个子任务,评估答案准确性和掩码质量,以确保真实的基于像素的推理。实验表明,TerraScope在基于像素的地理空间推理方面显著优于现有VLMs,同时提供了可解释的视觉证据。
Summary / 总结
TerraScope is a unified vision-language model designed for earth observation, addressing the challenge of complex spatial reasoning by incorporating pixel-grounded geospatial capabilities. It supports both single and multi-modal inputs and integrates temporal sequences for change analysis. The model is evaluated on a new dataset, Terra-CoT, and a benchmark, TerraScope-Bench, demonstrating superior performance in pixel-grounded geospatial reasoning compared to existing models, with interpretable visual evidence provided.
TerraScope 是一种统一的视觉-语言模型,用于需要精确像素级推理的地球观测任务。它支持单模态输入,并且可以在两种数据都可用时进行模态融合,还可以处理多时序序列进行变化分析。该模型在新数据集 Terra-CoT 上进行了评估,并在像素级地理空间推理方面优于现有视觉-语言模型,提供了可解释的视觉证据。
AgroCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture
Authors: Yibin Wen, Qingmei Li, Zi Ye, Jiarui Zhang, Zurong Mai, Jing Wu, Shuohong Lou, Yuhang Chen, Henglian Huang, Xiaoya Fan, Yang Zhang, Defeng Gu, Lingyuan Zhao, Yutong Lu, Haohuan Fu, Jianxi Huang, Juepeng Zheng
First: 2025-11-28T15:02:19+00:00 · Latest: 2026-03-19T15:28:24+00:00
Abstract
Recent advancements in Vision-Language Models (VLMs) have significantly impacted various industries. In agriculture, these multimodal capabilities hold great promise for applications such as precision farming, crop monitoring, pest detection, and environmental sustainability. However, while several Visual Question Answering (VQA) datasets and benchmarks have been developed to assess VLM performance, they often fail to effectively evaluate the critical reasoning and problem-solving skills needed in complex agricultural contexts. To address this gap, we introduce AgroCoT, a VQA dataset that integrates Chain-of-Thought (CoT) reasoning, specifically designed to evaluate the reasoning capabilities of VLMs. With 4,759 carefully curated samples, AgroCoT provides a comprehensive and robust evaluation of reasoning abilities, particularly in zero-shot scenarios, focusing on the models' ability to engage in logical reasoning and effective problem-solving. Our evaluation of 30 representative VLMs, including both proprietary and open-source models, reveals a gap in their reasoning capabilities, which underscores the importance of incorporating CoT for assessments. Our dataset is available at https://huggingface.co/datasets/wenyb/AgriCoT.
中文标题/摘要
标题:AgroCoT:农业领域视觉语言模型推理能力评估的链式思考基准
近年来,视觉语言模型(VLMs)在各个行业中的进步产生了重大影响。在农业领域,这些跨模态能力为精准农业、作物监测、病虫害检测和环境可持续性等应用带来了巨大潜力。然而,尽管已经开发了多个视觉问答(VQA)数据集和基准来评估VLM性能,它们往往未能有效评估在复杂农业背景下所需的关键推理和问题解决能力。为解决这一缺口,我们引入了AgroCoT,这是一个结合了链式思考(CoT)推理的VQA数据集,专门用于评估VLM的推理能力。AgroCoT包含4,759个精心策划的样本,提供了全面且稳健的推理能力评估,特别是在零样本场景中,重点关注模型进行逻辑推理和有效问题解决的能力。我们对30个代表性VLM的评估,包括专有和开源模型,揭示了它们在推理能力上的差距,突显了在评估中纳入CoT的重要性。我们的数据集可在https://huggingface.co/datasets/wenyb/AgriCoT获取。
Summary / 总结
AgroCoT is a VQA dataset designed to evaluate the reasoning capabilities of Vision-Language Models (VLMs) in agricultural contexts. It includes 4,759 samples that require Chain-of-Thought (CoT) reasoning, focusing on zero-shot scenarios. The evaluation of 30 VLMs shows a significant gap in their reasoning abilities, highlighting the need for CoT in assessing VLMs for agriculture.
AgroCoT 是一个 VQA 数据集,旨在评估 Vision-Language 模型 (VLM) 在农业环境中的推理能力。它包含 4,759 个样本,需要进行链式思考 (CoT) 推理,特别是在零样本场景中。对 30 个 VLM 的评估显示它们在推理能力方面存在显著差距,强调了在农业评估中引入 CoT 的必要性。数据集可在 https://huggingface.co/datasets/wenyb/AgriCoT 获取。
SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models
Authors: Quentin Guimard, Federico Bartsch, Simone Caldarella, Rahaf Aljundi, Elisa Ricci, Massimiliano Mancini
Venue: CVPR
First: 2026-03-19T15:28:08+00:00 · Latest: 2026-03-19T15:28:08+00:00
Comments: CVPR Findings 2026. Project website: https://sparse-embedding-modulation.github.io/
Abstract
Models that bridge vision and language, such as CLIP, are key components of multimodal AI, yet their large-scale, uncurated training data introduce severe social and spurious biases. Existing post-hoc debiasing methods often operate directly in the dense CLIP embedding space, where bias and task-relevant information are highly entangled. This entanglement limits their ability to remove bias without degrading semantic fidelity. In this work, we propose Sparse Embedding Modulation (SEM), a post-hoc, zero-shot debiasing framework that operates in a Sparse Autoencoder (SAE) latent space. By decomposing CLIP text embeddings into disentangled features, SEM identifies and modulates bias-relevant neurons while preserving query-relevant ones. This enables more precise, non-linear interventions. Across four benchmark datasets and two CLIP backbones, SEM achieves substantial fairness gains in retrieval and zero-shot classification. Our results demonstrate that sparse latent representations provide an effective foundation for post-hoc debiasing of vision-language models.
中文标题/摘要
标题:SEM:稀疏嵌入调制用于视觉-语言模型的后验去偏
连接视觉和语言的模型,如CLIP,是多模态AI的关键组成部分,但其大规模、未经筛选的训练数据引入了严重的社会和统计偏差。现有的后验去偏方法通常直接在密集的CLIP嵌入空间中操作,其中偏差和任务相关信息高度纠缠。这种纠缠限制了它们在不损害语义保真度的情况下去除偏差的能力。在本工作中,我们提出了稀疏嵌入调制(SEM),这是一种后验、零样本的去偏框架,它在稀疏自编码器(SAE)的潜在空间中操作。通过将CLIP文本嵌入分解为解纠缠的特征,SEM识别并调制与偏差相关的神经元,同时保留与查询相关的神经元。这使得可以进行更精确、非线性的干预。在四个基准数据集和两个CLIP骨干网络上,SEM在检索和零样本分类中实现了显著的公平性提升。我们的结果表明,稀疏潜在表示为视觉-语言模型的后验去偏提供了有效的基础。
Summary / 总结
The research aims to address the social and spurious biases in vision-language models like CLIP by proposing Sparse Embedding Modulation (SEM), a post-hoc debiasing framework. SEM operates in a Sparse Autoencoder latent space to decompose and disentangle CLIP text embeddings, allowing for precise modulation of bias-relevant neurons while preserving task-relevant features. The method achieves significant improvements in fairness across four benchmark datasets and two CLIP backbones in retrieval and zero-shot classification tasks.
研究旨在通过提出Sparse Embedding Modulation (SEM)框架解决CLIP等视觉-语言模型中的社会性和统计性偏见问题。SEM在Sparse Autoencoder的潜在空间中操作,分解和分离CLIP文本嵌入,从而精确调节与偏见相关的神经元,同时保留与任务相关的特征。该方法在四个基准数据集和两个CLIP模型上实现了检索和零样本分类任务中的显著公平性改进。
Hypothesis-Conditioned Query Rewriting for Decision-Useful Retrieval
Authors: Hangeol Chang, Changsun Lee, Seungjoon Rho, Junho Yeo, Jong Chul Ye
First: 2026-03-19T15:15:58+00:00 · Latest: 2026-03-19T15:15:58+00:00
Abstract
Retrieval-Augmented Generation (RAG) improves Large Language Models (LLMs) by grounding generation in external, non-parametric knowledge. However, when a task requires choosing among competing options, simply grounding generation in broadly relevant context is often insufficient to drive the final decision. Existing RAG methods typically rely on a single initial query, which often favors topical relevance over decision-relevant evidence, and therefore retrieves background information that can fail to discriminate among answer options. To address this issue, here we propose Hypothesis-Conditioned Query Rewriting (HCQR), a training-free pre-retrieval framework that reorients RAG from topic-oriented retrieval to evidence-oriented retrieval. HCQR first derives a lightweight working hypothesis from the input question and candidate options, and then rewrites retrieval into three targeted queries that seek evidence to: (1) support the hypothesis, (2) distinguish it from competing alternatives, and (3) verify salient clues in the question. This approach enables context retrieval that is more directly aligned with answer selection, allowing the generator to confirm or overturn the initial hypothesis based on the retrieved evidence. Experiments on MedQA and MMLU-Med show that HCQR consistently outperforms single-query RAG and re-rank/filter baselines, improving average accuracy over Simple RAG by 5.9 and 3.6 points, respectively. Code is available at https://anonymous.4open.science/r/HCQR-1C2E.
中文标题/摘要
标题:基于假设条件的查询重写以实现决策有用检索
检索增强生成(RAG)通过将生成与外部非参数知识相结合来提高大型语言模型(LLMs)。然而,当任务要求在竞争选项中进行选择时,仅将生成与广泛相关背景相结合往往不足以驱动最终决策。现有RAG方法通常依赖于单个初始查询,这往往倾向于主题相关性而非决策相关证据,因此检索到的信息可能无法区分答案选项。为了解决这一问题,我们提出了一种无需训练的预检索框架——基于假设条件的查询重写(HCQR),该框架将RAG从主题导向检索转向证据导向检索。HCQR首先从输入问题和候选选项中推导出一个轻量级的工作假设,然后将检索重写为三个针对特定证据的查询,以:(1)支持假设,(2)区分其与竞争替代方案,(3)验证问题中的关键线索。这种方法使上下文检索更直接地与答案选择对齐,使生成器能够根据检索到的证据确认或推翻初始假设。在MedQA和MMLU-Med上的实验表明,HCQR在平均准确性上始终优于单查询RAG和重排/过滤基线,分别提高了5.9和3.6个百分点。代码可在https://anonymous.4open.science/r/HCQR-1C2E获取。
Summary / 总结
The paper addresses the limitation of Retrieval-Augmented Generation (RAG) in decision-making tasks by proposing Hypothesis-Conditioned Query Rewriting (HCQR). HCQR rewrites the initial query into three targeted queries to support, distinguish, and verify evidence related to the input question and candidate options. Experiments on MedQA and MMLU-Med show that HCQR improves accuracy by 5.9 and 3.6 points over Simple RAG and re-rank/filter baselines, respectively.
论文针对现有检索增强生成(RAG)方法在处理竞争选项决策时的不足,提出了假设条件下的查询重写(HCQR)方法。HCQR将初始查询重写为三个有针对性的查询,以支持、区分和验证与输入问题和候选选项相关的证据。实验结果表明,HCQR在MedQA和MMLU-Med上的准确率分别比简单RAG和重排/过滤基线提高了5.9和3.6个百分点。
CrossHOI-Bench: A Unified Benchmark for HOI Evaluation across Vision-Language Models and HOI-Specific Methods
Authors: Qinqian Lei, Bo Wang, Robby T. Tan
Venue: CVPR 2026
First: 2025-08-26T07:30:53+00:00 · Latest: 2026-03-19T15:13:14+00:00
Comments: Accepted by CVPR 2026
Abstract
HOI detection has long been dominated by task-specific models, sometimes with early vision-language backbones such as CLIP. With the rise of large generative VLMs, a key question is whether standalone VLMs can perform HOI detection competitively against specialized HOI methods. Existing benchmarks such as HICO-DET require exact label matching under incomplete annotations, so any unmatched prediction is marked wrong. This unfairly penalizes valid outputs, especially from less constrained VLMs, and makes cross-paradigm comparison unreliable. To address this limitation, we introduce CrossHOI-Bench, a multiple-choice HOI benchmark with explicit positives and curated negatives, enabling unified and reliable evaluation of both VLMs and HOI-specific models. We further focus on challenging scenarios, such as multi-person scenes and fine-grained interaction distinctions, which are crucial for revealing real differences between the two paradigms. Experiments show that large VLMs achieve competitive, sometimes superior, zero-shot performance, yet they struggle with multiple concurrent actions and with correctly assigning interactions to the target person. Conversely, HOI-specific methods remain weaker in general HOI reasoning but demonstrate stronger multi-action recognition and more reliable identification of which person performs which action. These findings expose complementary strengths and weaknesses of VLMs and HOI-specific methods, which existing benchmarks fail to reveal due to incorrect penalization.
中文标题/摘要
标题:CrossHOI-Bench:跨范式的HOI评估统一基准
长期以来,HOI检测主要由任务特定模型主导,有时使用早期的视觉-语言模型,如CLIP。随着大型生成VLM的兴起,一个关键问题是独立的VLM是否能在HOI检测上与专门的HOI方法竞争。现有基准如HICO-DET要求精确标签匹配,在不完整注释下,任何未匹配的预测都会被标记为错误。这不公平地惩罚了有效的输出,尤其是对于较不约束的VLM,使得跨范式比较不可靠。为了解决这一局限性,我们引入了CrossHOI-Bench,这是一个具有明确正例和精心挑选负例的多项选择HOI基准,使VLM和HOI特定模型的统一和可靠评估成为可能。我们进一步关注具有挑战性的场景,如多人场景和精细的交互区分,这对于揭示两种范式之间的真正差异至关重要。实验表明,大型VLM在零样本情况下表现出竞争力,甚至有时更优,但它们在处理多个并发动作和正确分配交互给目标人物方面存在困难。相反,HOI特定方法在一般HOI推理方面较弱,但在多动作识别和更可靠地识别哪个人执行哪种动作方面表现出更强的能力。这些发现揭示了VLM和HOI特定方法的互补优势和劣势,而现有基准由于错误的惩罚未能揭示这些差异。
Summary / 总结
The research aims to evaluate the performance of vision-language models and HOI-specific methods in HOI detection by introducing CrossHOI-Bench, a new benchmark that uses multiple-choice questions with explicit positives and curated negatives. The study finds that large vision-language models can achieve competitive zero-shot performance but struggle with multiple concurrent actions and correctly assigning interactions to the target person. In contrast, HOI-specific methods are better at recognizing multiple actions and identifying which person performs which action, though they are weaker in general HOI reasoning. This benchmark addresses the limitations of existing benchmarks by providing a fairer evaluation framework for both paradigms.
CrossHOI-Bench 旨在评估视觉语言模型和HOI特定方法在HOI检测中的表现。它采用多项选择格式,包含明确的正例和精心挑选的负例,解决了现有基准的局限性。实验表明,大型VLM可以在性能上与HOI特定方法竞争,但难以处理多个并发动作和目标人物的识别,而HOI特定方法在多动作识别方面表现出色,但在一般HOI推理方面较弱。
How to Take a Memorable Picture? Empowering Users with Actionable Feedback
Authors: Francesco Laiti, Davide Talon, Jacopo Staiano, Elisa Ricci
Venue: CVPR 2026
First: 2026-02-25T13:02:35+00:00 · Latest: 2026-03-19T15:10:18+00:00
Comments: Accepted @ CVPR 2026. Project page: https://laitifranz.github.io/MemCoach/
Abstract
Image memorability, i.e., how likely an image is to be remembered, has traditionally been studied in computer vision either as a passive prediction task, with models regressing a scalar score, or with generative methods altering the visual input to boost the image likelihood of being remembered. Yet, none of these paradigms supports users at capture time, when the crucial question is how to improve a photo memorability. We introduce the task of Memorability Feedback (MemFeed), where an automated model should provide actionable, human-interpretable guidance to users with the goal to enhance an image future recall. We also present MemCoach, the first approach designed to provide concrete suggestions in natural language for memorability improvement (e.g., "emphasize facial expression," "bring the subject forward"). Our method, based on Multimodal Large Language Models (MLLMs), is training-free and employs a teacher-student steering strategy, aligning the model internal activations toward more memorable patterns learned from a teacher model progressing along least-to-most memorable samples. To enable systematic evaluation on this novel task, we further introduce MemBench, a new benchmark featuring sequence-aligned photoshoots with annotated memorability scores. Our experiments, considering multiple MLLMs, demonstrate the effectiveness of MemCoach, showing consistently improved performance over several zero-shot models. The results indicate that memorability can not only be predicted but also taught and instructed, shifting the focus from mere prediction to actionable feedback for human creators.
中文标题/摘要
标题:如何拍摄令人难忘的照片?赋予用户可操作反馈
图像的难忘性,即图像被记住的可能性,传统上在计算机视觉中要么作为被动预测任务进行研究,模型回归一个标量分数,要么使用生成方法改变视觉输入以提高图像被记住的可能性。然而,这些范式在拍摄时并不支持用户,关键问题是如何提高照片的难忘性。我们引入了难忘性反馈(MemFeed)任务,其中自动化模型应提供可操作的、人类可理解的指导,以提高图像未来回忆的可能性。我们还提出了MemCoach,这是第一个提供具体自然语言建议以提高难忘性的方法(例如,“强调面部表情”,“将主题置于前景”)。我们的方法基于多模态大型语言模型(MLLMs),无需训练,并采用教师-学生引导策略,使模型内部激活与从教师模型中学习到的从最难忘到最难忘的样本对齐。为了在这一新任务上进行系统评估,我们进一步引入了MemBench,这是一个新的基准,包含序列对齐的照片拍摄,并附有标注的难忘性分数。我们的实验,考虑了多个MLLMs,证明了MemCoach的有效性,显示其在多个零样本模型上的一致性改进。结果表明,难忘性不仅可以被预测,也可以被教授和指导,从单纯的预测转向对人类创作者的可操作反馈。
Summary / 总结
The paper introduces the task of Memorability Feedback (MemFeed), where an automated model provides actionable guidance to users to enhance the memorability of their photos. The method, MemCoach, uses Multimodal Large Language Models to offer concrete suggestions in natural language, such as 'emphasize facial expression.' Experiments show that MemCoach outperforms several zero-shot models, indicating that memorability can be both predicted and instructed, shifting the focus from mere prediction to actionable feedback for human creators.
论文提出了记忆反馈(MemFeed)任务,通过自动化模型为用户提供增强照片记忆性的具体建议。方法MemCoach使用多模态大型语言模型(MLLMs)和教师-学生引导策略生成自然语言建议,如“强调面部表情”。实验表明,MemCoach在多个零样本模型中表现出更优的效果,表明记忆性不仅可以预测,还可以被指导和教授,从而将重点从单纯的预测转向创作者的实际反馈。
VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation
Authors: Jiayi Yuan, Haobo Jiang, De Wen Soh, Na Zhao
First: 2026-03-19T14:18:17+00:00 · Latest: 2026-03-19T14:18:17+00:00
Abstract
This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT's perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (ii) Structure-saliency enhanced attention strengthens VGGT's robustness during 3D reconstruction by injecting structure-aware confidence into its attention layers, guiding focus toward geometrically reliable regions and enhancing cross-view coherence. (iii) Correlation-weighted 3D model correction refines the reconstructed 3D model by reweighting overlapping points using attention-inferred correlation scores, providing a consistent geometric basis for accurate panoramic reprojection. Extensive experiments show that VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor and outdoor datasets.
中文标题/摘要
标题:VGGT-360:几何一致的零样本全景深度估计
本文提出了VGGT-360,这是一种无需训练的新型框架,用于实现零样本、几何一致的全景深度估计。与之前的视图无关的无需训练方法不同,VGGT-360将任务重新表述为利用VGGT类基础模型的内在三维一致性,通过多视图重建的全景重投影来统一视图分割的推理,从而形成一个连贯的全景理解。为了实现稳健且准确的估计,VGGT-360整合了三个即插即用模块,形成了一个统一的全景到三维再到深度框架:(i) 不确定性引导的自适应投影将全景图切分为透视视图,以弥合全景输入与VGGT的透视先验之间的领域差距。它估计基于梯度的不确定性,将更密集的视图分配给几何贫瘠区域,为VGGT提供几何信息丰富的输入。(ii) 结构显著性增强的注意力在三维重建过程中增强VGGT的鲁棒性,通过将结构感知的置信度注入其注意力层,引导关注几何可靠区域,增强跨视图的一致性。(iii) 相关加权三维模型校正通过使用注意力推断的相关分数重新加权重叠点,细化重建的三维模型,为准确的全景重投影提供一致的几何基础。广泛的实验表明,VGGT-360在多个分辨率和多种室内外数据集上均优于训练有素和无需训练的最新方法。
Summary / 总结
VGGT-360 is a novel training-free framework for zero-shot panoramic depth estimation that leverages 3D consistency of VGGT-like models to unify per-view reasoning into a coherent panoramic understanding. It integrates three modules: uncertainty-guided adaptive projection, structure-saliency enhanced attention, and correlation-weighted 3D model correction, which enhance robustness and accuracy. Experiments demonstrate that VGGT-360 outperforms both trained and training-free state-of-the-art methods across various resolutions and datasets.
VGGT-360 是一个无需训练的框架,用于零样本全景深度估计,通过将任务重新表述为在多视图 3D 模型上的全景重投影,利用 VGGT 类基础模型的 3D 一致性。它整合了三个模块:不确定性引导自适应投影、结构显著性增强注意力和相关加权 3D 模型修正,这些模块共同提高了鲁棒性和准确性。实验表明,VGGT-360 在各种分辨率和不同室内外数据集上均优于已训练和未训练的最新方法。
MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model
Authors: Youngwan Lee, Soojin Jang, Yoorhim Cho, Seunghwan Lee, Yong-Ju Lee, Sung Ju Hwang
First: 2026-03-19T13:33:26+00:00 · Latest: 2026-03-19T13:33:26+00:00
Comments: Project page: https://youngwanlee.github.io/multihopspatial
Abstract
Spatial reasoning is foundational for Vision-Language Models (VLMs), particularly when deployed as Vision-Language-Action (VLA) agents in physical environments. However, existing benchmarks predominantly focus on elementary, single-hop relations, neglecting the multi-hop compositional reasoning and precise visual grounding essential for real-world scenarios. To address this, we introduce MultihopSpatial, offering three key contributions: (1) A comprehensive benchmark designed for multi-hop and compositional spatial reasoning, featuring 1- to 3-hop complex queries across diverse spatial perspectives. (2) Acc@50IoU, a complementary metric that simultaneously evaluates reasoning and visual grounding by requiring both answer selection and precise bounding box prediction - capabilities vital for robust VLA deployment. (3) MultihopSpatial-Train, a dedicated large-scale training corpus to foster spatial intelligence. Extensive evaluation of 37 state-of-the-art VLMs yields eight key insights, revealing that compositional spatial reasoning remains a formidable challenge. Finally, we demonstrate that reinforcement learning post-training on our corpus enhances both intrinsic VLM spatial reasoning and downstream embodied manipulation performance.
中文标题/摘要
标题:MultihopSpatial:多跳组合空间推理基准模型用于视觉-语言模型
空间推理是视觉-语言模型(VLMs)的基础,特别是在作为视觉-语言-行动(VLA)代理在物理环境中部署时。然而,现有的基准测试主要集中在基础的单跳关系上,忽视了多跳组合推理和精确的视觉定位,这对于现实世界场景至关重要。为了解决这个问题,我们引入了MultihopSpatial,提供了三个关键贡献:(1)一个旨在进行多跳和组合空间推理的全面基准测试,涵盖从1到3跳的复杂查询,跨越多种空间视角。(2)Acc@50IoU,一个补充性度量标准,同时评估推理和视觉定位,要求进行答案选择和精确边界框预测——这些能力对于稳健的VLA部署至关重要。(3)MultihopSpatial-Train,一个专门的大规模训练语料库,以促进空间智能的发展。对37个最先进的VLMs的广泛评估揭示了八个关键见解,表明组合空间推理仍然是一个严峻的挑战。最后,我们证明了在我们的语料库上进行强化学习后训练可以提高VLM的内在空间推理能力和下游实体操作性能。
Summary / 总结
The research aims to improve Vision-Language Models (VLMs) for multi-hop and compositional spatial reasoning, which is crucial for real-world applications. The study introduces MultihopSpatial, a benchmark that includes complex 1- to 3-hop spatial queries and a new metric Acc@50IoU that evaluates both reasoning and visual grounding. The evaluation of 37 state-of-the-art VLMs reveals significant challenges in compositional spatial reasoning, and post-training reinforcement learning on the MultihopSpatial-Train corpus improves both spatial reasoning and embodied manipulation performance.
研究旨在提升Vision-Language Models (VLMs)在真实场景中的多跳和组合空间推理能力。研究引入了MultihopSpatial基准,包含1-到3跳的复杂查询,并提出了一种新的评估指标Acc@50IoU,同时评估推理和视觉定位。主要发现表明,现有VLMs在组合空间推理方面存在挑战,而通过MultihopSpatial数据集进行强化学习后训练,可以提升空间推理能力和下游的实体操作性能。
LucidFlux: Caption-Free Photo-Realistic Image Restoration via a Large-Scale Diffusion Transformer
Authors: Song Fei, Tian Ye, Lujia Wang, Lei Zhu
First: 2025-09-26T14:39:08+00:00 · Latest: 2026-03-19T12:57:49+00:00
Comments: Project Page: https://w2genai-lab.github.io/LucidFlux
Abstract
Image restoration (IR) aims to recover images degraded by unknown mixtures while preserving semanticsconditions under which discriminative restorers and UNet-based diffusion priors often oversmooth, hallucinate, or drift. We present LucidFlux, a caption-free IR framework that adapts a large diffusion transformer (Flux.1) without image captions. Our LucidFlux introduces a lightweight dual-branch conditioner that injects signals from the degraded input and a lightly restored proxy to respectively anchor geometry and suppress artifacts. Then, a timestep- and layer-adaptive modulation schedule is designed to route these cues across the backbones hierarchy, in order to yield coarse-to-fine and context-aware updates that protect the global structure while recovering texture. After that, to avoid the latency and instability of text prompts or Vision-Language Model (VLM) captions, we enforce caption-free semantic alignment via SigLIP features extracted from the proxy. A scalable curation pipeline further filters large-scale data for structure-rich supervision. Across synthetic and in-the-wild benchmarks, our LucidFlux consistently outperforms strong open-source and commercial baselines, and ablation studies verify the necessity of each component. LucidFlux shows that, for large DiTs, when, where, and what to condition onrather than adding parameters or relying on text promptsis the governing lever for robust and caption-free image restoration in the wild.
中文标题/摘要
标题:LucidFlux:无需描述的高保真图像恢复大型扩散变换器
图像恢复(IR)旨在恢复被未知混合物降级的图像,同时保留语义。在语义条件下,判别恢复器和基于UNet的扩散先验往往过度平滑、幻觉或漂移。我们提出了LucidFlux,这是一种无需描述的IR框架,它适应了一个大型扩散变换器(Flux.1),而无需图像描述。我们的LucidFlux引入了一个轻量级的双分支条件器,分别从降级输入和轻度恢复的代理中注入信号,以分别锚定几何结构和抑制伪影。然后,设计了一种时间步长和层自适应调制调度,以在骨干网络层次结构中路由这些线索,从而实现从粗到细和上下文感知的更新,以保护全局结构并恢复纹理。之后,为了避免文本提示或视觉语言模型(VLM)描述的延迟和不稳定,我们通过从代理中提取的SigLIP特征强制执行无描述的语义对齐。一个可扩展的策展管道进一步筛选大规模数据以提供结构丰富的监督。在合成和野外基准测试中,我们的LucidFlux始终优于强大的开源和商用基线,消融研究验证了每个组件的必要性。LucidFlux表明,对于大型DiTs,何时、何地以及如何进行条件控制,而不是增加参数或依赖于文本提示,是野外稳健和无描述图像恢复的关键杠杆。
Summary / 总结
LucidFlux is a caption-free image restoration framework that uses a large diffusion transformer to recover images degraded by unknown mixtures while preserving semantics. It introduces a lightweight dual-branch conditioner and a timestep- and layer-adaptive modulation schedule to protect global structure and recover texture. LucidFlux consistently outperforms strong open-source and commercial baselines across synthetic and in-the-wild benchmarks, and ablation studies confirm the necessity of each component.
LucidFlux 是一个无图描述的图像恢复框架,使用大型扩散变换器来恢复被未知混合物降级的图像,同时保持语义。它引入了一个轻量级的双分支条件器和一个时间步和层自适应调制调度,以保护全局结构并恢复纹理。LucidFlux 在合成和现实世界基准测试中均优于强大的开源和商用基线,并且消融研究证实了每个组件的必要性。
HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models
Authors: Xiangyu Bai, Bishoy Galoaa, Sarah Ostadabbas
First: 2026-03-19T12:53:32+00:00 · Latest: 2026-03-19T12:53:32+00:00
Abstract
Video question answering (VQA) with vision-language models (VLMs) depends critically on which frames are selected from the input video, yet most systems rely on uniform or heuristic sampling that cannot be optimized for downstream answering quality. We introduce \textbf{HORNet}, a lightweight frame selection policy trained with Group Relative Policy Optimization (GRPO) to learn which frames a frozen VLM needs to answer questions correctly. With fewer than 1M trainable parameters, HORNet reduces input frames by up to 99\% and VLM processing time by up to 93\%, while improving answer quality on short-form benchmarks (+1.7\% F1 on MSVD-QA) and achieving strong performance on temporal reasoning tasks (+7.3 points over uniform sampling on NExT-QA). We formalize this as Select Any Frames (SAF), a task that decouples visual input curation from VLM reasoning, and show that GRPO-trained selection generalizes better out-of-distribution than supervised and PPO alternatives. HORNet's policy further transfers across VLM answerers without retraining, yielding an additional 8.5\% relative gain when paired with a stronger model. Evaluated across six benchmarks spanning 341,877 QA pairs and 114.2 hours of video, our results demonstrate that optimizing \emph{what} a VLM sees is a practical and complementary alternative to optimizing what it generates while improving efficiency. Code is available at https://github.com/ostadabbas/HORNet.
中文标题/摘要
标题:HORNet:基于任务指导的视频问答中帧选择策略
视频问答(VQA)使用视觉-语言模型(VLMs)依赖于从输入视频中选择哪些帧,但大多数系统依赖于均匀或启发式的采样,这些采样无法优化下游问答质量。我们引入了**HORNet**,这是一种通过组相对策略优化(GRPO)训练的轻量级帧选择策略,以学习冻结的VLM需要查看哪些帧才能正确回答问题。HORNet通过减少输入帧多达99%和VLM处理时间多达93%,同时在短格式基准上提高答案质量(MSVD-QA的F1分数提高1.7%),并在时间推理任务上表现出色(NExT-QA上比均匀采样高7.3分)。我们将此任务形式化为“选择任意帧”(SAF),该任务将视觉输入的策划与VLM推理解耦,并证明GRPO训练的选择在分布外泛化能力优于监督学习和PPO替代方案。HORNet的策略在与不同VLM回答器结合使用时无需重新训练,与更强的模型配对时可获得额外8.5%的相对增益。在六个基准测试中评估了341,877个问答对和114.2小时的视频,我们的结果表明,优化VLM所见内容是一种实用且互补的替代方案,同时提高了效率。代码可在https://github.com/ostadabbas/HORNet/获取。
Summary / 总结
HORNet is a lightweight frame selection policy trained with Group Relative Policy Optimization (GRPO) to optimize which frames a frozen vision-language model needs to answer video questions correctly. It reduces input frames by up to 99% and VLM processing time by up to 93%, while improving answer quality on short-form benchmarks and achieving strong performance on temporal reasoning tasks. HORNet's policy transfers across different vision-language models, yielding additional gains. Evaluated across six benchmarks, the results show that optimizing what a VLM sees is practical and complementary to optimizing what it generates, improving efficiency.
HORNet 是一种通过组相对策略优化(GRPO)训练的轻量级帧选择策略,旨在优化一个冻结的视觉语言模型需要查看的视频帧,以正确回答问题。它将输入帧减少到最多 99%,并将视觉语言模型的处理时间减少到最多 93%,同时在短格式基准上提高答案质量,并在时间推理任务上表现出强劲性能。HORNet 的策略在不同视觉语言模型之间具有转移性,可以带来额外的收益。在六个基准测试中评估了这些结果,显示优化视觉语言模型看到的内容是实用且与优化生成内容互补的,同时提高了效率。
Activation Quantization of Vision Encoders Needs Prefixing Registers
Authors: Seunghyeon Kim, Taesun Yeom, Jinho Kim, Wonpyo Park, Kyuyeun Kim, Jaeho Lee
First: 2025-10-06T07:27:46+00:00 · Latest: 2026-03-19T12:18:57+00:00
Comments: under review; 28 pages, 9 figures
Abstract
Large pretrained vision encoders are central to multimodal intelligence, powering applications from on-device vision processing to vision-language models. Since these applications often demand real-time processing of massive visual data, reducing the inference cost of vision encoders is critical. Quantization offers a practical path, but it remains challenging even at 8-bit precision due to so-called outliers. In this work, we propose $\textit{RegCache}$, a training-free algorithm that mitigates outliers in large-scale pretrained vision encoders and serves as a plug-in module that can be applied on top of other quantization methods. RegCache introduces outlier-prone yet semantically meaningless prefix tokens to the vision encoder, which prevent other tokens from having outliers. Notably, we observe that outliers in vision encoders behave differently from those in language models, motivating two technical innovations: middle-layer prefixing and token deletion. Experimental results show that our method consistently improves quantized model performance across various vision encoders, particularly in extremely low-bit regimes (e.g., 4-bit).
中文标题/摘要
标题:视觉编码器的激活量化需要前缀寄存器
大型预训练视觉编码器是多模态智能的核心,驱动着从设备端视觉处理到视觉语言模型的各种应用。由于这些应用通常需要实时处理大量视觉数据,因此降低视觉编码器的推理成本至关重要。量化提供了一条可行的路径,但在8位精度下仍面临挑战,主要是所谓的异常值问题。在本工作中,我们提出了一种无需训练的算法$\textit{RegCache}$,该算法可以缓解大型预训练视觉编码器中的异常值问题,并作为可插拔模块应用于其他量化方法之上。RegCache 引入了前缀的、但具有语义意义的前缀标记到视觉编码器中,防止其他标记出现异常值。值得注意的是,我们观察到视觉编码器中的异常值与语言模型中的异常值行为不同,这促使了两项技术创新:中间层前缀和标记删除。实验结果表明,我们的方法在各种视觉编码器中一致地提高了量化模型的性能,特别是在极低位数(例如4位)的情况下。
Summary / 总结
This work addresses the challenge of quantizing large pretrained vision encoders to reduce inference costs while maintaining performance. The proposed $\textit{RegCache}$ method introduces prefix tokens to mitigate outliers, which are problematic at low-bit precision. Experimental results demonstrate consistent improvements in quantized model performance, especially in 4-bit regimes, through innovations like middle-layer prefixing and token deletion.
该研究解决了大规模预训练视觉编码器量化的问题,这些编码器对于实时视觉处理至关重要。提出的$\textit{RegCache}$方法通过引入前缀标记来缓解异常值问题,并作为其他量化技术的插件模块。实验结果表明,$\textit{RegCache}$在低位宽情况下(如4位)显著提升了量化模型的性能。
Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation
Authors: Yuchen Li, Amanmeet Garg, Shalini Chaudhuri, Rui Zhao, Garin Kessler
First: 2026-03-19T11:46:01+00:00 · Latest: 2026-03-19T11:46:01+00:00
Abstract
Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we (i) distill a VQVAE depth codebook from a strong monocular teacher to tokenize dense depth into compact sequences, and (ii) integrate SAM2 based semantic segmentation tokens and VQ-VAE depth tokens inside the LLM so the model first emits spatial tokens and then answers. To stabilize depth token generation, we introduce novel composite depth-token objectives (marker, token, and count losses) and a soft-merging technique for differentiable reconstruction. We adopt a multi-task co-training strategy across diverse datasets, letting the model learn perception tokens to tackle multiple downstream tasks. Building on InternVL, Perceptio achieves state-of-the-art performance across benchmarks: improving referring expression segmentation by +0.8/+1.4/+1.1 cIoU on RefCOCO/+/g HardBLINK spatial understanding accuracy by 10.3%, and MMBench accuracy by 1.0%, demonstrating that explicit spatial chain-of-thought materially strengthens spatial grounding in LVLMs.
中文标题/摘要
标题:Perceptio:通过空间标记生成增强感知的视觉语言模型
大型视觉语言模型(LVLMs)在语义理解方面表现出色,但在精细的空间定位方面存在困难,因为模型必须通过隐式推断复杂的几何结构,而从未产生过空间解释。我们提出了Perceptio,这是一种增强感知的LVLM,具有二维和三维空间推理能力,通过在自回归序列中直接生成语义分割标记和深度标记来实现。具体来说,我们(i) 从强大的单目教师中提取VQVAE深度码本,将密集深度标记化为紧凑序列,(ii) 在LLM中集成基于SAM2的语义分割标记和VQ-VAE深度标记,使模型首先发出空间标记,然后回答。为了稳定深度标记生成,我们引入了新颖的复合深度标记目标(标记、标记计数和标记损失)以及一种可微重构的软合并技术。我们采用跨多种数据集的多任务协同训练策略,让模型学习感知标记以应对多种下游任务。基于InternVL,Perceptio在基准测试中取得了最先进的性能:在RefCOCO/+/g HardBLINK中空间理解准确性分别提高10.3%,在MMBench中准确性提高1.0%,证明了显式空间推理链对LVLM中空间定位的实质性增强。
Summary / 总结
Perceptio is a perception-enhanced Vision Language Model that integrates 2D and 3D spatial reasoning through explicit semantic segmentation and depth tokens generated within the autoregressive sequence. It uses a VQ-VAE depth codebook and SAM2-based semantic segmentation tokens to improve spatial grounding. Perceptio achieves state-of-the-art performance, enhancing referring expression segmentation and spatial understanding accuracy by 10.3% and MMBench accuracy by 1.0%.
Perceptio 是一种通过在自回归序列中生成显式的语义分割和深度令牌来集成 2D 和 3D 空间推理的视觉语言模型。它使用 VQ-VAE 深度码本和基于 SAM2 的语义分割令牌,使模型首先生成空间令牌再进行回答。Perceptio 在多个基准测试中提高了性能,包括在 HardBLINK 上的空间理解准确率提高了 10.3%,以及在 MMBench 上的准确率提高了 1.0%,表明显式的空间推理增强了视觉语言模型中的空间定位。
Balanced Thinking: Improving Chain of Thought Training in Vision Language Models
Authors: Shaked Perek, Ben Wiesel, Avihu Dekel, Nimrod Shabtay, Eli Schwartz
First: 2026-03-19T09:21:49+00:00 · Latest: 2026-03-19T09:21:49+00:00
Abstract
Multimodal reasoning in vision-language models (VLMs) typically relies on a two-stage process: supervised fine-tuning (SFT) and reinforcement learning (RL). In standard SFT, all tokens contribute equally to the loss, even though reasoning data are inherently token-imbalanced. Long <think> traces overshadow short but task-critical <answer> segments, leading to verbose reasoning and inaccurate answers. We propose SCALe (Scheduled Curriculum Adaptive Loss), which explicitly separates supervision over reasoning and answer segments using dynamic, length-independent weighting. Unlike vanilla SFT, which overweights the <think> segment, SCALe-SFT gradually shifts the focus from <think> to <answer> throughout training via a cosine scheduling policy, encouraging concise and well-grounded reasoning. We evaluate SCALe across diverse benchmarks and architectures. Results show that SCALe consistently improves accuracy over vanilla SFT and matches the performance of the full two-phase SFT + GRPO pipeline while requiring only about one-seventh of the training time, making it a lightweight yet effective alternative. When combined with GRPO, SCALe achieves the best overall performance, highlighting its value both as a standalone method and as a strong foundation for reinforcement refinement.
中文标题/摘要
标题:平衡思维:提升视觉语言模型的链式思维训练
视觉语言模型(VLMs)中的多模态推理通常依赖于两阶段过程:监督微调(SFT)和强化学习(RL)。在标准的SFT中,所有标记对损失的贡献是平等的,即使推理数据本质上是标记不平衡的。长的<思考>痕迹遮盖了短但任务关键的<答案>片段,导致冗长的推理和不准确的答案。我们提出了SCALe(逐步课程自适应损失),它明确地通过动态、与长度无关的加权来分离对推理和答案片段的监督。与传统的SFT不同,SCALe-SFT通过余弦调度策略在整个训练过程中逐渐将重点从<思考>转移到<答案>,鼓励简洁且有根据的推理。我们在多种基准和架构上评估了SCALe。结果显示,SCALe在准确性上始终优于传统的SFT,并且在训练时间仅为完整两阶段SFT + GRPO流水线的约七分之一的情况下达到了相当的性能,使其成为一种轻量级但有效的替代方案。当与GRPO结合使用时,SCALe实现了最佳的整体性能,突显了其作为独立方法和强化细化强大基础的价值。
Summary / 总结
The paper addresses the issue of token imbalance in supervised fine-tuning of vision-language models, where reasoning data are often dominated by long but less task-critical reasoning segments. It introduces SCALe, a method that dynamically weights reasoning and answer segments to encourage concise and accurate reasoning. Experiments across various benchmarks show that SCALe improves accuracy over standard supervised fine-tuning and achieves comparable performance to a full two-phase training pipeline but with significantly less training time, making it a lightweight and effective alternative.
论文针对监督微调中视觉语言模型数据不平衡的问题,其中推理数据往往被冗长的<think>痕迹主导,而忽视了关键的<answer>片段。文中提出了一种名为SCALe的方法,该方法动态调整推理和答案之间的损失权重,促进简洁且准确的推理。实验结果表明,SCALe在各种基准测试中提高了准确性,与两阶段的SFT + GRPO管道相比,其性能相当,但训练时间减少了约六分之一,使其成为一种轻量级且有效的替代方案。
MeInTime: Bridging Age Gap in Identity-Preserving Face Restoration
Authors: Teer Song, Yue Zhang, Yu Tian, Ziyang Wang, Xianlin Zhang, Guixuan Zhang, Xuan Liu, Xueming Li, Yasen Zhang
First: 2026-03-19T09:11:07+00:00 · Latest: 2026-03-19T09:11:07+00:00
Abstract
To better preserve an individual's identity, face restoration has evolved from reference-free to reference-based approaches, which leverage high-quality reference images of the same identity to enhance identity fidelity in the restored outputs. However, most existing methods implicitly assume that the reference and degraded input are age-aligned, limiting their effectiveness in real-world scenarios where only cross-age references are available, such as historical photo restoration. This paper proposes MeInTime, a diffusion-based face restoration method that extends reference-based restoration from same-age to cross-age settings. Given one or few reference images along with an age prompt corresponding to the degraded input, MeInTime achieves faithful restoration with both identity fidelity and age consistency. Specifically, we decouple the modeling of identity and age conditions. During training, we focus solely on effectively injecting identity features through a newly introduced attention mechanism and introduce Gated Residual Fusion modules to facilitate the integration between degraded features and identity representations. At inference, we propose Age-Aware Gradient Guidance, a training-free sampling strategy, using an age-driven direction to iteratively nudge the identity-aware denoising latent toward the desired age semantic manifold. Extensive experiments demonstrate that MeInTime outperforms existing face restoration methods in both identity preservation and age consistency. Our code is available at: https://github.com/teer4/MeInTime
中文标题/摘要
标题:MeInTime: 跨年龄身份保留面部恢复方法
为了更好地保留个体的身份,面部恢复从无参考方法发展到了有参考方法,后者利用同一身份的高质量参考图像来增强恢复输出中的身份保真度。然而,大多数现有方法隐含地假设参考图像和退化输入在年龄上是匹配的,这限制了它们在只有跨年龄参考图像可用的真实场景中的效果,例如历史照片恢复。本文提出了一种基于扩散的面部恢复方法MeInTime,该方法将参考方法从同龄扩展到跨龄设置。给定一个或几个参考图像以及与退化输入对应的年龄提示,MeInTime 能够实现具有身份保真度和年龄一致性的忠实恢复。具体而言,我们解耦身份和年龄条件的建模。在训练过程中,我们专注于通过新引入的注意力机制有效注入身份特征,并引入门控残差融合模块以促进退化特征与身份表示之间的整合。在推理过程中,我们提出了一种无需训练的年龄感知梯度引导策略,使用年龄驱动的方向逐步引导身份感知去噪潜在变量向所需的年龄语义流形。大量实验表明,MeInTime 在身份保留和年龄一致性方面均优于现有面部恢复方法。我们的代码可在以下链接获取:https://github.com/teer4/MeInTime
Summary / 总结
MeInTime is a diffusion-based face restoration method that addresses the challenge of restoring faces from historical photos where the reference and input images are of different ages. By decoupling identity and age modeling, MeInTime uses a newly introduced attention mechanism and Gated Residual Fusion modules during training to enhance identity fidelity. At inference, it employs Age-Aware Gradient Guidance to ensure age consistency. Experiments show that MeInTime outperforms existing methods in preserving identity and maintaining age consistency.
MeInTime 是一种基于扩散的面部恢复方法,旨在在只有跨年龄参考的情况下,实现面部恢复的同时保持身份一致性和年龄一致性。该方法在训练中通过新引入的注意力机制和 Gated Residual Fusion 模块解耦身份和年龄建模。在推理阶段,使用 Age-Aware Gradient Guidance 迭代引导恢复结果向目标年龄语义流形靠拢。实验表明,MeInTime 在保持身份和年龄一致性方面优于现有方法。
Training-Free Sparse Attention for Fast Video Generation via Offline Layer-Wise Sparsity Profiling and Online Bidirectional Co-Clustering
Authors: Jiayi Luo, Jiayu Chen, Jiankun Wang, Cong Wang, Hanxin Zhu, Qingyun Sun, Chen Gao, Zhibo Chen, Jianxin Li
First: 2026-03-19T09:00:08+00:00 · Latest: 2026-03-19T09:00:08+00:00
Abstract
Diffusion Transformers (DiTs) achieve strong video generation quality but suffer from high inference cost due to dense 3D attention, leading to the development of sparse attention technologies to improve efficiency. However, existing training-free sparse attention methods in video generation still face two unresolved limitations: ignoring layer heterogeneity in attention pruning and ignoring query-key coupling in block partitioning, which hinder a better quality-speedup trade-off. In this work, we uncover a critical insight that the attention sparsity of each layer is its intrinsic property, with minor effects across different inputs. Motivated by this, we propose SVOO, a training-free Sparse attention framework for fast Video generation via Offline layer-wise sparsity profiling and Online bidirectional co-clustering. Specifically, SVOO adopts a two-stage paradigm: (i) offline layer-wise sensitivity profiling to derive intrinsic per-layer pruning levels, and (ii) online block-wise sparse attention via a novel bidirectional co-clustering algorithm. Extensive experiments on seven widely used video generation models demonstrate that SVOO achieves a superior quality-speedup trade-off over state-of-the-art methods, delivering up to $1.93\times$ speedup while maintaining a PSNR of up to 29 dB on Wan2.1.
中文标题/摘要
标题:基于离线层间稀疏性表征和在线双向共聚类的无需训练的稀疏注意力以实现快速视频生成
扩散变换器(DiTs)在视频生成质量上表现出色,但由于密集的3D注意力导致推理成本高,因此开发了稀疏注意力技术以提高效率。然而,现有的无需训练的视频生成稀疏注意力方法仍然面临两个未解决的局限性:忽略注意力剪枝中的层间异质性以及忽略查询-键耦合在块分割中的作用,这阻碍了更好的质量-加速权衡。在本文中,我们揭示了一个关键见解,即每个层的注意力稀疏性是其固有的属性,不同输入之间的影响较小。受此启发,我们提出了SVOO,一种基于离线层间稀疏性表征和在线双向共聚类的无需训练的稀疏注意力框架。具体而言,SVOO采用两阶段范式:(i)离线层间敏感性表征以推导固有的每层剪枝水平,(ii)通过一种新颖的双向共聚类算法实现块级稀疏注意力。在七个广泛使用的视频生成模型上的大量实验表明,SVOO在质量-加速权衡上优于最先进的方法,同时在Wan2.1上保持高达29 dB的PSNR,实现高达1.93倍的加速。
Summary / 总结
The work addresses the high inference cost of dense 3D attention in Diffusion Transformers (DiTs) for video generation by proposing SVOO, a training-free sparse attention framework. SVOO involves offline layer-wise sparsity profiling to determine intrinsic pruning levels and online bidirectional co-clustering for block-wise sparse attention. Experiments show that SVOO outperforms existing methods, achieving up to 1.93 times speedup with PSNR up to 29 dB on Wan2.1.
该研究针对扩散变换器(DiTs)的高推理成本,提出了SVOO,一种无需训练的稀疏注意力框架。SVOO通过离线层内稀疏性分析确定固有的剪枝水平,并使用双向聚类算法进行块级稀疏注意力。实验表明,SVOO优于现有方法,最高可实现1.93倍的加速,同时在Wan2.1上保持PSNR高达29 dB。
Agentic Flow Steering and Parallel Rollout Search for Spatially Grounded Text-to-Image Generation
Authors: Ping Chen, Daoxuan Zhang, Xiangming Wang, Yungeng Liu, Haijin Zeng, Yongyong Chen
First: 2026-03-19T08:50:49+00:00 · Latest: 2026-03-19T08:50:49+00:00
Abstract
Precise Text-to-Image (T2I) generation has achieved great success but is hindered by the limited relational reasoning of static text encoders and the error accumulation in open-loop sampling. Without real-time feedback, initial semantic ambiguities during the Ordinary Differential Equation trajectory inevitably escalate into stochastic deviations from spatial constraints. To bridge this gap, we introduce AFS-Search (Agentic Flow Steering and Parallel Rollout Search), a training-free closed-loop framework built upon FLUX.1-dev. AFS-Search incorporates a training-free closed-loop parallel rollout search and flow steering mechanism, which leverages a Vision-Language Model (VLM) as a semantic critic to diagnose intermediate latents and dynamically steer the velocity field via precise spatial grounding. Complementarily, we formulate T2I generation as a sequential decision-making process, exploring multiple trajectories through lookahead simulations and selecting the optimal path based on VLM-guided rewards. Further, we provide AFS-Search-Pro for higher performance and AFS-Search-Fast for quicker generation. Experimental results show that our AFS-Search-Pro greatly boosts the performance of the original FLUX.1-dev, achieving state-of-the-art results across three different benchmarks. Meanwhile, AFS-Search-Fast also significantly enhances performance while maintaining fast generation speed.
中文标题/摘要
标题:代理流动引导和并行展开搜索在空间定位文本到图像生成中的应用
精确的文本到图像(T2I)生成已经取得了巨大成功,但受限于静态文本编码器的有限关系推理能力和开环采样中的误差累积。缺乏实时反馈,初始语义模糊性在常微分方程轨迹中不可避免地演变成空间约束的随机偏差。为解决这一问题,我们引入了AFS-Search(代理流动引导和并行展开搜索),这是一种基于FLUX.1-dev的无需训练的闭环框架。AFS-Search 结合了无需训练的闭环并行展开搜索和流动引导机制,利用视觉-语言模型(VLM)作为语义批评家来诊断中间潜变量,并通过精确的空间定位动态引导速度场。此外,我们将T2I生成视为一个顺序决策过程,通过前瞻模拟探索多个轨迹,并基于VLM引导的奖励选择最优路径。进一步,我们提供了AFS-Search-Pro以获得更高性能,并提供了AFS-Search-Fast以实现更快的生成速度。实验结果表明,我们的AFS-Search-Pro极大地提升了原始FLUX.1-dev的性能,在三个不同的基准测试中取得了最先进的结果。同时,AFS-Search-Fast也显著提高了性能,同时保持了快速生成速度。
Summary / 总结
The research aims to improve the precision and reliability of Text-to-Image (T2I) generation by addressing the limitations of static text encoders and open-loop sampling. The method introduces AFS-Search, a training-free closed-loop framework that uses a Vision-Language Model (VLM) to diagnose and steer the generation process, ensuring spatial constraints are met. Experimental results demonstrate that AFS-Search-Pro significantly improves the performance of FLUX.1-dev, achieving state-of-the-art results across three benchmarks, while AFS-Search-Fast maintains high performance with faster generation speed.
研究旨在通过解决静态文本编码器和开环采样的局限性,提高文本到图像生成的精确性和空间一致性。方法AFS-Search引入了一个无需训练的闭环框架,使用视觉语言模型(VLM)诊断和引导生成过程,确保满足空间约束。实验结果表明,AFS-Search-Pro显著提升了FLUX.1-dev的性能,在三个基准测试中达到了最先进的效果,而AFS-Search-Fast则保持了快速生成速度的同时提高了性能。
GenVideoLens: Where LVLMs Fall Short in AI-Generated Video Detection?
Authors: Yueying Zou, Pei Pei Li, Zekun Li, Xinyu Guo, Xing Cui, Huaibo Huang, Ran He
Venue: ECCV 2026
First: 2026-03-19T08:44:08+00:00 · Latest: 2026-03-19T08:44:08+00:00
Comments: ECCV 2026 submission. 14 pages, 6 figures, 4 tables. Supplementary material included
Abstract
In recent years, AI-generated videos have become increasingly realistic and sophisticated. Meanwhile, Large Vision-Language Models (LVLMs) have shown strong potential for detecting such content. However, existing evaluation protocols largely treat the task as a binary classification problem and rely on coarse-grained metrics such as overall accuracy, providing limited insight into where LVLMs succeed or fail. To address this limitation, we introduce GenVideoLens, a fine-grained benchmark that enables dimension-wise evaluation of LVLM capabilities in AI-generated video detection. The benchmark contains 400 highly deceptive AI-generated videos and 100 real videos, annotated by experts across 15 authenticity dimensions covering perceptual, optical, physical, and temporal cues. We evaluate eleven representative LVLMs on this benchmark. Our analysis reveals a pronounced dimensional imbalance. While LVLMs perform relatively well on perceptual cues, they struggle with optical consistency, physical interactions, and temporal-causal reasoning. Model performance also varies substantially across dimensions, with smaller open-source models sometimes outperforming stronger proprietary models on specific authenticity cues. Temporal perturbation experiments further show that current LVLMs make limited use of temporal information. Overall, GenVideoLens provides diagnostic insights into LVLM behavior, revealing key capability gaps and offering guidance for improving future AI-generated video detection systems.
中文标题/摘要
标题:GenVideoLens:LVLM在AI生成视频检测中存在哪些不足?
近年来,AI生成的视频越来越逼真和复杂。与此同时,大型视觉-语言模型(LVLM)在检测此类内容方面显示出强大的潜力。然而,现有的评估协议主要将任务视为二元分类问题,并依赖于粗粒度的指标,如总体准确率,这为了解LVLM的成功或失败提供了有限的见解。为了解决这一局限性,我们引入了GenVideoLens,这是一个细粒度基准,使我们能够从维度上评估LVLM在AI生成视频检测中的能力。基准数据集包含400个高度欺骗性的AI生成视频和100个真实视频,由专家在15个涵盖感知、光学、物理和时间线索的真伪维度上进行标注。我们在这项基准上评估了11个代表性LVLM。我们的分析揭示了明显的维度不平衡。虽然LVLM在感知线索方面表现相对较好,但在光学一致性、物理交互和时间因果推理方面却表现不佳。模型在不同维度上的表现也存在显著差异,较小的开源模型有时在特定真伪线索上会优于更强的专有模型。时间扰动实验进一步表明,当前的LVLM对时间信息的利用有限。总体而言,GenVideoLens为诊断LVLM的行为提供了诊断性见解,揭示了关键的能力差距,并为改进未来的AI生成视频检测系统提供了指导。
Summary / 总结
GenVideoLens is introduced to evaluate the performance of Large Vision-Language Models (LVLMs) in detecting AI-generated videos. The benchmark includes 400 highly deceptive AI-generated videos and 100 real videos, annotated across 15 dimensions. The study reveals that LVLMs excel in perceptual cues but struggle with optical consistency, physical interactions, and temporal-causal reasoning. Model performance varies across dimensions, with smaller open-source models sometimes outperforming larger proprietary models. Temporal perturbation experiments indicate limited use of temporal information by current LVLMs. This work provides insights into LVLM behavior and highlights key capability gaps in AI-generated video detection systems.
研究旨在评估大型视觉语言模型(LVLMs)在检测AI生成视频方面的性能,解决现有二元分类指标的局限性。引入了GenVideoLens基准,评估LVLMs在15个真实度维度上的表现,包括感知、光学、物理和时间线索。研究发现,LVLMs在感知线索方面表现出色,但在光学一致性、物理交互和时间因果推理方面存在困难。此外,小型开源模型有时在特定真实度线索上优于大型专有模型,而时间扰动实验表明,当前的LVLMs对时间信息的利用有限。
REST: Receding Horizon Explorative Steiner Tree for Zero-Shot Object-Goal Navigation
Authors: Shuqi Xiao, Maani Ghaffari, Chengzhong Xu, Hui Kong
First: 2026-03-19T08:43:40+00:00 · Latest: 2026-03-19T08:43:40+00:00
Abstract
Zero-shot object-goal navigation (ZSON) requires navigating unknown environments to find a target object without task-specific training. Prior hierarchical training-free solutions invest in scene understanding (\textit{belief}) and high-level decision-making (\textit{policy}), yet overlook the design of \textit{option}, i.e., a subgoal candidate proposed from evolving belief and presented to policy for selection. In practice, options are reduced to isolated waypoints scored independently: single destinations hide the value gathered along the journey; an unstructured collection obscures the relationships among candidates. Our insight is that the option space should be a \textit{tree of paths}. Full paths expose en-route information gain that destination-only scoring systematically neglects; a tree of shared segments enables coarse-to-fine LLM reasoning that dismisses or pursues entire branches before examining individual leaves, compressing the combinatorial path space into an efficient hierarchy. We instantiate this insight in \textbf{REST} (Receding Horizon Explorative Steiner Tree), a training-free framework that (1) builds an explicit open-vocabulary 3D map from online RGB-D streams; (2) grows an agent-centric tree of safe and informative paths as the option space via sampling-based planning; and (3) textualizes each branch into a spatial narrative and selects the next-best path through chain-of-thought LLM reasoning. Across the Gibson, HM3D, and HSSD benchmarks, REST consistently ranks among the top methods in success rate while achieving the best or second-best path efficiency, demonstrating a favorable efficiency-success balance.
中文标题/摘要
标题:REST:退缩视野探索性Steiner树在零样本物体-目标导航中的应用
零样本物体-目标导航(ZSON)要求在未知环境中导航以找到目标物体,无需针对特定任务进行训练。先前的无监督层次训练解决方案在场景理解(信念)和高层次决策(策略)上投入了大量资源,但忽视了选项(即从不断演变的信念中提出的子目标候选,并呈交给策略进行选择)的设计。实际上,选项被简化为孤立的航点,独立评分:单一目的地忽略了旅程中积累的价值;无序的集合掩盖了候选者之间的关系。我们的见解是,选项空间应该是一个路径树。完整路径揭示了目的地评分系统系统性忽视的沿途信息增益;路径树中的共享段落使粗到细的LLM推理成为可能,在检查个别叶子之前,可以粗略地忽略或追求整个分支,从而将组合路径空间压缩成一个高效的层次结构。我们通过REST(退缩视野探索性Steiner树)这一无监督框架将这一见解付诸实践,该框架(1)从在线RGB-D流中构建显式的开放词汇3D地图;(2)通过基于采样的规划生成以代理为中心的安全且信息丰富的路径树作为选项空间;(3)将每一分支文本化为一个空间叙述,并通过链式思维LLM推理选择下一个最佳路径。在Gibson、HM3D和HSSD基准测试中,REST在成功率方面始终名列前茅,同时在路径效率方面达到最佳或第二佳,展示了有利的效率-成功率平衡。
Summary / 总结
REST is a training-free framework for zero-shot object-goal navigation that addresses the limitations of prior approaches by focusing on the design of options as a tree of paths. It constructs an explicit 3D map from RGB-D streams, grows an agent-centric tree of safe and informative paths, and uses LLM reasoning to select the next-best path. REST consistently ranks among the top methods in success rate while achieving the best or second-best path efficiency across various benchmarks, showing a favorable efficiency-success balance.
论文解决了在未知环境中进行零样本物体目标导航的问题,无需特定任务的训练。它提出了REST(Receding Horizon Explorative Steiner Tree)框架,该框架从RGB-D流中构建显式的3D地图,生成一个以代理为中心的安全且信息丰富的路径树,并使用LLM推理选择最佳路径。REST在成功率方面始终名列前茅,并且在路径效率方面达到最佳或第二佳,展示了效率与成功率的良好平衡。
Deep Expert Injection for Anchoring Retinal VLMs with Domain-Specific Knowledge
Authors: Shuai Lu, Meng Wang, Jia Guo, Jiawei Du, Bo Liu, Shengzhu Yang, Weihang Zhang, Huazhu Fu, Huiqi Li
First: 2026-03-07T09:43:49+00:00 · Latest: 2026-03-19T08:24:18+00:00
Abstract
Large Vision Language Models (LVLMs) show immense potential for automated ophthalmic diagnosis. However, their clinical deployment is severely hindered by lacking domain-specific knowledge. In this work, we identify two structural deficiencies hindering reliable medical reasoning: 1) the Perception Gap, where general-purpose visual encoders fail to resolve fine-grained pathological cues (e.g., microaneurysms); and 2) the Reasoning Gap, where sparse visual evidence is progressively overridden by massive language priors in deeper transformer layers, leading to ungrounded hallucinations. To bridge these gaps, we propose EyExIn, a data-efficient framework designed to anchor retinal VLMs with expert knowledge via a Deep Expert Injection mechanism. Our architecture employs an Expert-Aware Dual-Stream encoding strategy that decouples visual representation into a general stream for anatomical context and a specialized expert stream for pathological semantics. To ensure high-fidelity integration, we design a Semantic-Adaptive Gated Fusion module, which dynamically amplifies subtle lesion signals while filtering irrelevant background noise. Furthermore, we introduce Adaptive Deep Expert Injection to embed persistent "Vision Anchors" by integrating fused visual features as residual biases directly into intermediate LLM layers. This mechanism creates a visual shortcut that forces the reasoning stack to remain strictly grounded in visual evidence. Extensive experiments across four benchmarks demonstrate that our model consistently outperforms massive proprietary systems. EyExIn significantly enhances domain-specific knowledge embedding and achieves state-of-the-art precision in ophthalmic visual question answering, advancing the development of trustworthy ophthalmic AI.
中文标题/摘要
标题:深度专家注入以领域特定知识锚定视网膜VLM
大型视觉语言模型(LVLMs)在眼科自动化诊断方面展现出巨大的潜力。然而,它们的临床部署受到缺乏领域特定知识的严重阻碍。在本工作中,我们识别出两个阻碍可靠医学推理的结构性缺陷:1)感知差距,其中通用视觉编码器无法解决细微的病理线索(例如微动脉瘤);2)推理差距,其中稀疏的视觉证据在更深的变压器层中逐渐被大量的语言先验所取代,导致无根据的幻觉。为了弥合这些差距,我们提出了一种EyExIn框架,该框架通过深度专家注入机制设计来利用专家知识锚定视网膜VLM。我们的架构采用了一种专家感知的双流编码策略,将视觉表示分解为一个用于解剖学上下文的一般流和一个用于病理学语义的专门专家流。为了确保高保真集成,我们设计了一种语义自适应门控融合模块,该模块动态放大细微的病灶信号并过滤掉无关的背景噪声。此外,我们引入了自适应深度专家注入,通过将融合的视觉特征直接作为残差偏差集成到中间的LLM层中,嵌入持久的“视觉锚点”。该机制创建了一个视觉捷径,迫使推理堆栈始终保持严格地基于视觉证据。在四个基准上的广泛实验表明,我们的模型在眼科视觉问答方面始终优于大规模的专有系统。EyExIn显著增强了领域特定知识的嵌入,并在眼科视觉问答方面达到了最先进的精度,推动了可信赖的眼科AI的发展。
Summary / 总结
This work addresses the limitations of large vision language models (LVLMs) in ophthalmic diagnosis by proposing EyExIn, a framework that injects expert knowledge to bridge the perception and reasoning gaps. EyExIn uses an Expert-Aware Dual-Stream encoding and a Semantic-Adaptive Gated Fusion module to integrate visual and pathological information effectively. The model consistently outperforms large proprietary systems and achieves state-of-the-art precision in ophthalmic visual question answering, enhancing the trustworthiness of ophthalmic AI systems.
该研究通过提出EyExIn框架,将专家知识注入大型视觉语言模型(LVLM),以解决眼科诊断中的感知和推理缺口问题。EyExIn采用专家感知双流编码和语义自适应门控融合模块,有效整合视觉和病理信息。该模型在眼科视觉问答任务中始终优于大型专有系统,并达到最先进的精度,提升了眼科AI系统的可信度。