arXiv 论文速递

2026-03-21 03:47
Snapshot: 20260321_0347
Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders
Authors: Shang-Jui Ray Kuo, Paola Cascante-Bonilla
First: 2026-03-19T17:56:32+00:00 · Latest: 2026-03-19T17:56:32+00:00
Comments: Project page: https://lab-spell.github.io/vlm-ssm-vision-encoders/ ; Code: https://github.com/raykuo18/vlm-ssm-vision-encoders
Abstract
Large vision--language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet-1K initialization, the SSM backbone achieves the strongest overall performance across both VQA and grounding/localization. We further adapt both SSM and ViT-family backbones with detection or segmentation training and find that dense-task tuning generally improves performance across families; after this adaptation, the SSM backbone remains competitive while operating at a substantially smaller model scale. We further observe that (i) higher ImageNet accuracy or larger backbones do not reliably translate into better VLM performance, and (ii) some visual backbones are unstable in localization. Based on these findings, we propose stabilization strategies that improve robustness for both backbone families and highlight SSM backbones as a strong alternative to transformer-based vision encoders in VLMs.
中文标题/摘要
标题:VLMs是否需要视觉变换器?评估状态空间模型作为视觉编码器
大型视觉-语言模型(VLMs)通常使用冻结的视觉主干,其图像特征通过轻量级连接器映射到大型语言模型中。虽然基于变换器的编码器是标准的视觉主干,但我们询问状态空间模型(SSM)视觉主干是否可以成为强有力的替代品。我们在受控环境中系统地评估了SSM视觉主干在VLMs中的表现。在匹配的ImageNet-1K初始化下,SSM主干在VQA和定位/标注任务中均表现出最强的整体性能。我们进一步适应了SSM和ViT家族的主干,并进行了检测或分割训练,发现密集任务调整通常在家族中提高了性能;在这一适应后,SSM主干仍具有竞争力,但模型规模要小得多。我们还观察到,(i) 更高的ImageNet准确度或更大的主干并不一定能可靠地转化为更好的VLM性能,(ii) 一些视觉主干在定位任务中不稳定。基于这些发现,我们提出了稳定策略,以提高两种主干家族的鲁棒性,并强调SSM主干作为VLMs中基于变换器视觉编码器的强有力替代品。
Summary / 总结
This study evaluates state space model (SSM) vision backbones in large vision-language models (VLMs), finding that SSMs outperform transformer-based encoders in VQA and grounding/localization tasks under matched ImageNet-1K initialization. Dense-task tuning improves performance for both SSM and ViT-family backbones, but SSMs maintain competitiveness at a smaller model scale. The research also highlights that higher ImageNet accuracy or larger backbones do not reliably enhance VLM performance, and some visual backbones are unstable in localization tasks, suggesting SSMs as a strong alternative to transformer-based vision encoders.
研究探讨了在大型视觉-语言模型(VLMs)中使用状态空间模型(SSM)视觉骨干作为替代的变压器编码器。研究在受控条件下评估了SSM骨干,并发现它们在VQA和定位/检测任务中的整体性能最强。在使用检测或分割训练后,SSM和ViT家族的骨干仍然具有竞争力,但规模更小。研究还指出,更高的ImageNet准确度或更大的骨干并不一定意味着更好的VLM性能,而且某些视觉骨干在定位任务中不稳定。研究结果表明,SSM骨干可以作为VLM中变压器基视觉编码器的有力替代方案。
Tinted Frames: Question Framing Blinds Vision-Language Models
Authors: Wan-Cyuan Fan, Jiayun Luo, Declan Kutscher, Leonid Sigal, Ritwik Gupta
First: 2026-03-19T17:53:09+00:00 · Latest: 2026-03-19T17:53:09+00:00
Comments: Preprint. Project page: https://davidhalladay.github.io/tinted_frames_demo/
Abstract
Vision-Language Models (VLMs) have been shown to be blind, often underutilizing their visual inputs even on tasks that require visual reasoning. In this work, we demonstrate that VLMs are selectively blind. They modulate the amount of attention applied to visual inputs based on linguistic framing even when alternative framings demand identical visual reasoning. Using visual attention as a probe, we quantify how framing alters both the amount and distribution of attention over the image. Constrained framings, such as multiple choice and yes/no, induce substantially lower attention to image context compared to open-ended, reduce focus on task-relevant regions, and shift attention towards uninformative tokens. We further demonstrate that this attention misallocation is the principal cause of degraded accuracy and cross-framing inconsistency. Building on this mechanistic insight, we introduce a lightweight prompt-tuning method using learnable tokens that encourages the robust, visually grounded attention patterns observed in open-ended settings, improving visual grounding and improving performance across framings.
中文标题/摘要
标题:着色框:问题框架使视觉-语言模型失明
视觉-语言模型(VLMs)已被证明是失明的,即使在需要视觉推理的任务中,它们也经常未能充分利用视觉输入。在本研究中,我们展示了VLMs是选择性失明的。它们根据语言框架调整对视觉输入的注意力程度,即使存在其他框架要求相同的视觉推理。通过使用视觉注意力作为探针,我们量化了框架如何改变对图像的关注量及其分布。受限的框架,如多项选择和是/否,相比开放式框架,显著降低了对图像上下文的关注,减少了对任务相关区域的关注,并将注意力转移到无信息性标记上。我们进一步证明,这种注意力分配不当是导致准确度下降和跨框架不一致的主要原因。基于这一机制洞察,我们引入了一种轻量级的提示调优方法,使用可学习标记来鼓励在开放式设置中观察到的稳健、视觉接地的注意力模式,从而提高视觉接地并改善不同框架下的性能。
Summary / 总结
This study investigates why Vision-Language Models (VLMs) are selectively blind to visual inputs, especially in tasks requiring visual reasoning. By using visual attention as a probe, the researchers found that VLMs apply less attention to images when given constrained framings like multiple choice or yes/no, compared to open-ended questions. This misallocation of attention leads to lower accuracy and inconsistency across different framings. The study introduces a prompt-tuning method using learnable tokens to encourage robust, visually grounded attention patterns, improving both visual grounding and performance across different framings.
研究探讨了视觉-语言模型(VLMs)对视觉输入的有选择性忽视,这种忽视受问题框架的影响。通过使用视觉注意力作为探针,研究发现,如多项选择和是/否这类受限的框架会导致对图像上下文的关注减少,并将注意力转向不相关信息,从而导致性能下降。研究还提出了一种使用可学习标记的轻量级提示调优方法,以促进在开放性设置中观察到的稳健且视觉导向的注意力模式,从而改善视觉定位并提高不同框架下的性能。
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
Authors: Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen, Nakul Gopalan
First: 2026-03-19T17:20:56+00:00 · Latest: 2026-03-19T17:20:56+00:00
Comments: Equal contribution: Swagat Padhan and Lakshya Jain, 9 pages, 6 figures, paper website: https://lakshya-asu.github.io/Meanings-Measurements-Multi-Agent-Probabilistic-Grounding/
Abstract
Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions. For example, executing a command such as "go two meters to the right of the fridge" requires grounding semantic references, spatial relations, and metric constraints within a 3D scene. While recent vision language models (VLMs) demonstrate strong semantic grounding capabilities, they are not explicitly designed to reason about metric constraints in physically defined spaces. In this work, we empirically demonstrate that state-of-the-art VLM-based grounding approaches struggle with complex metric-semantic language queries. To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component. MAPG then probabilistically composes these grounded outputs to produce metrically consistent, actionable decisions in 3D space. We evaluate MAPG on the HM-EQA benchmark and show consistent performance improvements over strong baselines. Furthermore, we introduce a new benchmark, MAPG-Bench, specifically designed to evaluate metric-semantic goal grounding, addressing a gap in existing language grounding evaluations. We also present a real-world robot demonstration showing that MAPG transfers beyond simulation when a structured scene representation is available.
中文标题/摘要
标题:意义与测量:多智能体概率对接在视觉语言导航中的应用
与人类合作的机器人必须将自然语言目标转化为可执行的、物理上可对接的决策。例如,执行“向冰箱右边两米处走”的命令需要在三维场景中对接语义参考、空间关系和度量约束。虽然最近的视觉语言模型(VLMs)展示了强大的语义对接能力,但它们并未明确设计用于在物理定义的空间中推理度量约束。在本研究中,我们实证展示了最先进的基于VLM的对接方法在处理复杂的度量语义语言查询时存在困难。为解决这一局限,我们提出了MAPG(多智能体概率对接)框架,将语言查询分解为结构化的子组件,并查询VLM对接每个组件。然后,MAPG通过概率组合这些对接输出,生成在三维空间中度量一致的可执行决策。我们使用HM-EQA基准评估MAPG,并展示了相对于强大基线的一致性能改进。此外,我们引入了一个新的基准MAPG-Bench,专门用于评估度量语义目标对接,填补了现有语言对接评估中的空白。我们还展示了在可用结构化场景表示的现实世界机器人演示,表明MAPG可以超越仿真。
Summary / 总结
This work addresses the challenge of converting complex metric-semantic language queries into actionable decisions for robots. It introduces MAPG (Multi-Agent Probabilistic Grounding), which decomposes language queries into structured components and uses a VLM to ground each part, then probabilistically composes the results to produce metrically consistent actions. Experiments on HM-EQA show MAPG outperforms strong baselines, and a new benchmark, MAPG-Bench, is introduced to evaluate metric-semantic goal grounding. Additionally, a real-world robot demonstration demonstrates MAPG's effectiveness beyond simulation.
该研究旨在将复杂的度量语义语言查询转化为机器人的可执行决策。提出了MAPG(多智能体概率定位)框架,将语言查询分解为结构化的子组件,并使用VLM进行每个组件的定位,然后概率性地组合这些定位结果以产生度量一致的行动。HM-EQA基准上的实验表明MAPG优于强基线,还引入了MAPG-Bench新基准以评估度量语义目标定位。此外,现实世界的机器人演示验证了MAPG在模拟之外的有效性。
Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation
Authors: Kwanyoung Lee, SeungJu Cha, Yebin Ahn, Hyunwoo Oh, Sungho Koh, Dong-Jin Kim
Venue: CVPR 2026
First: 2026-03-19T17:12:03+00:00 · Latest: 2026-03-19T17:12:03+00:00
Comments: Accepted in CVPR 2026 (main track). 10 pages, 6 figures; supplementary material included (14 pages, 11 figures)
Abstract
Diffusion-based text-to-image (T2I) models have made remarkable progress in generating photorealistic and semantically rich images. However, when the target concepts lie in low-density regions of the training distribution, these models often produce semantically misaligned or structurally inconsistent results. This limitation arises from the long-tailed nature of text-image datasets, where rare concepts or editing instructions are underrepresented. To address this, we introduce Adaptive Auxiliary Prompt Blending (AAPB) - a unified framework that stabilizes the diffusion process in low-density regions. AAPB leverages auxiliary anchor prompts to provide semantic support in rare concept generation and structural support in image editing, ensuring faithful guidance toward the target prompt. Unlike prior heuristic prompt alternation methods, AAPB derives a closed-form adaptive coefficient that optimally balances the influence between the auxiliary anchor and the target prompt at each diffusion step. Grounded in Tweedie's identity, our formulation provides a principled and training-free framework for adaptive prompt blending, ensuring stable and target-faithful generation. We demonstrate the effectiveness of adaptive interpolation over fixed interpolation through controlled experiments and empirically show consistent improvements on the RareBench and FlowEdit datasets, achieving superior semantic accuracy and structural fidelity compared to prior training-free baselines.
中文标题/摘要
标题:适应性辅助提示融合以实现目标忠实的扩散生成
基于扩散的文本到图像(T2I)模型在生成逼真且语义丰富的图像方面取得了显著进展。然而,当目标概念位于训练分布的低密度区域时,这些模型往往会产生语义不匹配或结构不一致的结果。这一局限性源于文本-图像数据集的长尾性质,其中稀有概念或编辑指令的代表性不足。为了解决这一问题,我们引入了适应性辅助提示融合(AAPB)——一种统一框架,用于在低密度区域稳定扩散过程。AAPB 利用辅助锚提示提供稀有概念生成的语义支持和图像编辑的结构支持,确保目标提示的忠实指导。与先前的启发式提示交替方法不同,AAPB 在每个扩散步骤中推导出一个闭式自适应系数,以最优地平衡辅助锚提示和目标提示之间的影响力。基于 Tweedie 的恒等式,我们的公式提供了一种原理上和无需训练的自适应提示融合框架,确保稳定和目标忠实的生成。通过受控实验,我们展示了自适应插值优于固定插值的有效性,并在 RareBench 和 FlowEdit 数据集上实验证明了一致的改进,实现了与先前无需训练基线相比更优的语义准确性和结构保真度。
Summary / 总结
The paper addresses the issue of semantically misaligned or structurally inconsistent image generation by diffusion-based models when dealing with rare concepts. It introduces Adaptive Auxiliary Prompt Blending (AAPB), a method that uses auxiliary anchor prompts to provide semantic and structural support, ensuring target-faithful generation. AAPB derives an adaptive coefficient for each diffusion step, balancing the influence between the auxiliary anchor and the target prompt. Experiments on RareBench and FlowEdit datasets show consistent improvements in semantic accuracy and structural fidelity compared to previous methods.
论文针对扩散模型在处理稀有概念时生成的图像出现语义不匹配或结构不一致的问题,引入了自适应辅助提示混合(AAPB)框架,通过辅助锚提示提供语义和结构支持,确保生成目标一致。AAPB在每个扩散步骤中动态计算一个自适应系数,以最优方式平衡辅助锚和目标提示的影响,从而在RareBench和FlowEdit数据集上实现了更高的语义准确性和结构保真度。
ADAPT: Attention Driven Adaptive Prompt Scheduling and InTerpolating Orthogonal Complements for Rare Concepts Generation
Authors: Kwanyoung Lee, Hyunwoo Oh, SeungJu Cha, Sungho Koh, Dong-Jin Kim
Venue: CVPR 2026
First: 2026-03-19T17:11:49+00:00 · Latest: 2026-03-19T17:11:49+00:00
Comments: Accepted in CVPR 2026 (findings). 10 pages, 4 figures; supplementary material included (8 pages, 10 figures)
Abstract
Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training data. While recent approaches, such as R2F, address this challenge by utilizing LLM for prompt scheduling, they suffer from inherent variance due to the randomness of language models and suboptimal guidance from iterative text embedding switching. To address these problems, we propose the ADAPT framework, a training-free framework that deterministically plans and semantically aligns prompt schedules, providing consistent guidance to enhance the composition of rare concepts. By leveraging attention scores and orthogonal components, ADAPT significantly enhances compositional generation of rare concepts in the RareBench benchmark without additional training or fine-tuning. Through comprehensive experiments, we demonstrate that ADAPT achieves superior performance in RareBench and accurately reflects the semantic information of rare attributes, providing deterministic and precise control over the generation of rare compositions without compromising visual integrity.
中文标题/摘要
标题:ADAPT:注意力驱动的自适应提示调度和插值正交补对于稀有概念生成
对于文本到图像合成而言,在生成稀有组合概念方面,扩散模型仍然面临挑战,尤其是对于训练数据中不常见的属性。虽然最近的方法,如R2F,通过利用LLM进行提示调度来解决这一挑战,但由于语言模型的随机性和迭代文本嵌入切换的次优指导,它们仍然存在固有的方差问题。为了解决这些问题,我们提出了ADAPT框架,这是一种无需训练的框架,可以确定性地规划和语义对齐提示调度,提供一致的指导以增强稀有概念的组合。通过利用注意力分数和正交组件,ADAPT在RareBench基准上显著增强了稀有概念的组合生成,无需额外的训练或微调。通过全面的实验,我们证明ADAPT在RareBench上实现了优越的性能,并准确反映了稀有属性的语义信息,提供了对稀有组合生成的确定性和精确控制,而不牺牲视觉完整性。
Summary / 总结
The motivation for this work is to improve the generation of rare compositional concepts in text-to-image synthesis using diffusion models. The ADAPT framework is proposed to address the challenges of randomness in language models and suboptimal guidance from iterative text embedding switching. By deterministically planning and semantically aligning prompt schedules, ADAPT enhances the composition of rare concepts without additional training. Experiments on the RareBench benchmark show that ADAPT outperforms existing methods and accurately reflects the semantic information of rare attributes, providing precise control over the generation of rare compositions.
这项工作的动机是使用扩散模型提高文本到图像合成中罕见组合概念的生成。ADAPT框架被提出以解决语言模型的随机性和迭代文本嵌入切换的次优指导带来的挑战。通过确定性地规划和语义对齐提示调度,ADAPT增强了罕见概念的组合生成,无需额外训练。实验表明,ADAPT在RareBench基准上优于现有方法,并准确反映了罕见属性的语义信息,提供了对罕见组合生成的精确控制,同时保持了视觉完整性。
GSMem: 3D Gaussian Splatting as Persistent Spatial Memory for Zero-Shot Embodied Exploration and Reasoning
Authors: Yiren Lu, Yi Du, Disheng Liu, Yunlai Zhou, Chen Wang, Yu Yin
First: 2026-03-19T16:55:54+00:00 · Latest: 2026-03-19T16:55:54+00:00
Comments: Project page at https://vulab-ai.github.io/GSMem/
Abstract
Effective embodied exploration requires agents to accumulate and retain spatial knowledge over time. However, existing scene representations, such as discrete scene graphs or static view-based snapshots, lack \textit{post-hoc re-observability}. If an initial observation misses a target, the resulting memory omission is often irrecoverable. To bridge this gap, we propose \textbf{GSMem}, a zero-shot embodied exploration and reasoning framework built upon 3D Gaussian Splatting (3DGS). By explicitly parameterizing continuous geometry and dense appearance, 3DGS serves as a persistent spatial memory that endows the agent with \textit{Spatial Recollection}: the ability to render photorealistic novel views from optimal, previously unoccupied viewpoints. To operationalize this, GSMem employs a retrieval mechanism that simultaneously leverages parallel object-level scene graphs and semantic-level language fields. This complementary design robustly localizes target regions, enabling the agent to ``hallucinate'' optimal views for high-fidelity Vision-Language Model (VLM) reasoning. Furthermore, we introduce a hybrid exploration strategy that combines VLM-driven semantic scoring with a 3DGS-based coverage objective, balancing task-aware exploration with geometric coverage. Extensive experiments on embodied question answering and lifelong navigation demonstrate the robustness and effectiveness of our framework
中文标题/摘要
标题:GSMem: 3D高斯点积作为持久空间记忆的零样本体态探索与推理框架
有效的体态探索需要代理人在时间上积累和保留空间知识。然而,现有的场景表示,如离散场景图或静态视角快照,缺乏“事后重新观察”的能力。如果初始观察错过了目标,生成的记忆遗漏往往是不可恢复的。为了解决这一问题,我们提出了**GSMem**,一种基于3D高斯点积(3DGS)的零样本体态探索与推理框架。通过显式参数化连续几何和密集外观,3DGS充当持久空间记忆,赋予代理“空间回忆”的能力:从先前未占用的最佳视角生成逼真的新视角。为了实现这一点,GSMem采用了一种检索机制,同时利用并行的对象级场景图和语义级语言字段。这种互补设计能够稳健地定位目标区域,使代理能够“想象”出高保真视觉-语言模型(VLM)推理的最佳视角。此外,我们引入了一种结合VLM驱动的语义评分与基于3DGS的覆盖目标的探索策略,平衡任务感知的探索与几何覆盖。在体态问答和终身导航的广泛实验中,我们的框架显示出其稳健性和有效性
Summary / 总结
GSMem is a zero-shot embodied exploration and reasoning framework that uses 3D Gaussian Splatting to create a persistent spatial memory. This allows the agent to render photorealistic novel views from previously unoccupied viewpoints, enabling spatial recollection. GSMem combines object-level scene graphs and semantic-level language fields for robust target localization and uses a hybrid exploration strategy that balances semantic scoring with geometric coverage. Experiments show that GSMem is robust and effective for embodied question answering and lifelong navigation.
GSMem 是一个基于 3D 贝塞尔插值的零样本体态探索和推理框架,能够创建持久的空间记忆,使代理能够从之前未占用的视角生成逼真的新视图。该框架结合了对象级别的场景图和语义级别的语言字段,以实现稳健的目标定位,并采用结合语义评分和 3D 贝塞尔插值覆盖目标的混合探索策略,平衡任务感知的探索与几何覆盖。实验表明,GSMem 有效地支持了体态问答和终身导航。
Efficient Reasoning with Balanced Thinking
Authors: Yulin Li, Tengyao Tu, Li Ding, Junjie Wang, Huiling Zhen, Yixin Chen, Yong Li, Zhuotao Tian
Venue: ICLR 2026
First: 2026-03-12T18:48:07+00:00 · Latest: 2026-03-19T16:54:22+00:00
Comments: Accepted by ICLR 2026
Abstract
Large Reasoning Models (LRMs) have shown remarkable reasoning capabilities, yet they often suffer from overthinking, expending redundant computational steps on simple problems, or underthinking, failing to explore sufficient reasoning paths despite inherent capabilities. These issues lead to inefficiencies and potential inaccuracies, limiting practical deployment in resource-constrained settings. Existing methods to mitigate overthinking, such as suppressing reflective keywords or adjusting reasoning length, may inadvertently induce underthinking, compromising accuracy. Therefore, we propose ReBalance, a training-free framework that achieves efficient reasoning with balanced thinking. ReBalance leverages confidence as a continuous indicator of reasoning dynamics, identifying overthinking through high confidence variance and underthinking via consistent overconfidence. By aggregating hidden states from a small-scale dataset into reasoning mode prototypes, we compute a steering vector to guide LRMs' reasoning trajectories. A dynamic control function modulates this vector's strength and direction based on real-time confidence, pruning redundancy during overthinking, and promoting exploration during underthinking. Extensive experiments conducted on four models ranging from 0.5B to 32B, and across nine benchmarks in math reasoning, general question answering, and coding tasks demonstrate that ReBalance effectively reduces output redundancy while improving accuracy, offering a general, training-free, and plug-and-play strategy for efficient and robust LRM deployment. Project page and code are available at https://rebalance-ai.github.io .
中文标题/摘要
标题:平衡思考实现高效推理
大型推理模型(LRMs)展示了出色的推理能力,但往往存在过度推理的问题,即在简单问题上浪费冗余计算步骤,或者存在欠推理的问题,即在具备推理能力的情况下未能充分探索推理路径。这些问题导致了效率低下和潜在的不准确性,限制了其在资源受限环境中的实际部署。现有减少过度推理的方法,如抑制反思关键词或调整推理长度,可能会无意中导致欠推理,从而影响准确性。因此,我们提出了ReBalance,这是一种无需训练的框架,实现了平衡思考下的高效推理。ReBalance 利用置信度作为推理动态的连续指标,通过高置信度波动识别过度推理,通过一致的高置信度识别欠推理。通过将小型数据集中的隐藏状态聚合为推理模式原型,我们计算出一个引导向量来引导LRMs的推理轨迹。动态控制函数根据实时置信度调整该向量的强度和方向,在过度推理时修剪冗余,在欠推理时促进探索。在四个从0.5B到32B的模型以及九个涉及数学推理、通用问答和编程任务的基准测试中进行的广泛实验表明,ReBalance 有效减少了输出冗余并提高了准确性,提供了一种通用、无需训练且即插即用的策略,用于高效和稳健的LRM部署。项目页面和代码可在https://rebalance-ai.github.io 获取。
Summary / 总结
The paper addresses the inefficiencies of Large Reasoning Models (LRMs) due to overthinking or underthinking, proposing ReBalance as a training-free framework to achieve balanced reasoning. ReBalance uses confidence as a dynamic indicator to steer LRMs, reducing redundancy and improving accuracy. Experiments across various models and benchmarks show that ReBalance effectively enhances the efficiency and robustness of LRMs.
论文提出了一个无需训练的框架ReBalance,以平衡LRMs中的过度思考和不足思考。通过使用信心作为指示器来引导LRMs,ReBalance减少了冗余并提高了准确性。在多种模型和基准测试上的实验表明,ReBalance有效提升了效率和鲁棒性,无需额外训练。
MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding
Authors: Fan Yang, Xingping Dong, Xin Yu, Wenhan Luo, Wei Liu, Kaihao Zhang
Venue: CVPR 2026
First: 2025-12-02T16:22:01+00:00 · Latest: 2026-03-19T16:35:02+00:00
Comments: Accepted to CVPR 2026
Abstract
Understanding high-resolution (HR) images remains a critical challenge for multimodal large language models (MLLMs). Recent approaches leverage vision-based retrieval-augmented generation (RAG) to retrieve query-relevant crops from HR images, improving understanding capacity of MLLMs. However, this paradigm often leads to object fragmentation, resulting in semantic bias and incomplete retrieval, while also introducing false positives from irrelevant background patches. To address these issues, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework that enhances HR image understanding from both local and global perspectives. Locally, MRD enforces cross-scale semantic consistency via multi-resolution semantic fusion to mitigate single-resolution bias and alleviate object fragmentation. Globally, it integrates open-vocabulary object detection (OVD) as localization priors within a unified framework. Extensive experiments across multiple MLLMs on HR image benchmarks demonstrate that MRD achieves state-of-the-art (SOTA) performance on both single-object and multi-object understanding tasks. Code will be available at: https://github.com/yf0412/MRD.
中文标题/摘要
标题:MRD:多分辨率检索-检测融合用于高分辨率图像理解
理解高分辨率(HR)图像仍然是多模态大型语言模型(MLLM)的关键挑战。近期的方法利用基于视觉的检索增强生成(RAG)从HR图像中检索查询相关的片段,从而提高MLLM的理解能力。然而,这种范式往往导致对象碎片化,产生语义偏差和不完整的检索,同时还会引入来自无关背景片段的假阳性。为了解决这些问题,我们提出了一种无需训练的多分辨率检索-检测(MRD)框架,从局部和全局两个方面增强HR图像理解。局部上,MRD通过多分辨率语义融合来缓解单一分辨率偏差并减轻对象碎片化。全局上,它将开放词汇对象检测(OVD)作为定位先验整合到统一框架中。在多个MLLM上的HR图像基准测试中,广泛的实验表明,MRD在单对象和多对象理解任务上均实现了最先进的(SOTA)性能。代码将在:https://github.com/yf0412/MRD.上提供。
Summary / 总结
The paper addresses the challenge of understanding high-resolution images for multimodal large language models by proposing MRD, a training-free framework that enhances both local and global perspectives. Locally, MRD uses multi-resolution semantic fusion to reduce single-resolution bias and object fragmentation. Globally, it integrates open-vocabulary object detection as localization priors. Experiments show that MRD outperforms existing methods on single-object and multi-object understanding tasks across multiple MLLMs on HR image benchmarks.
论文提出了一种训练-free 的 MRD 框架,通过增强局部和全局视角来解决高分辨率图像理解的挑战。局部上,MRD 使用多分辨率语义融合来减少单分辨率偏差和对象碎片化。全局上,它将开放词汇量对象检测作为定位先验。实验表明,MRD 在多个 MLLMs 的 HR 图像基准测试上,在单对象和多对象理解任务中均优于现有方法。
TAU-R1: Visual Language Model for Traffic Anomaly Understanding
Authors: Yuqiang Lin, Kehua Chen, Sam Lockyer, Arjun Yadav, Mingxuan Sui, Shucheng Zhang, Yan Shi, Bingzhang Wang, Yuang Zhang, Markus Zarbock, Florain Stanek, Adrian Evans, Wenbin Li, Yinhai Wang, Nic Zhang
First: 2026-03-19T16:23:21+00:00 · Latest: 2026-03-19T16:23:21+00:00
Abstract
Traffic Anomaly Understanding (TAU) is important for traffic safety in Intelligent Transportation Systems. Recent vision-language models (VLMs) have shown strong capabilities in video understanding. However, progress on TAU remains limited due to the lack of benchmarks and task-specific methodologies. To address this limitation, we introduce Roundabout-TAU, a dataset constructed from real-world roundabout videos collected in collaboration with the City of Carmel, Indiana. The dataset contains 342 clips and is annotated with more than 2,000 question-answer pairs covering multiple aspects of traffic anomaly understanding. Building on this benchmark, we propose TAU-R1, a two-layer vision-language framework for TAU. The first layer is a lightweight anomaly classifier that performs coarse anomaly categorisation, while the second layer is a larger anomaly reasoner that generates detailed event summaries. To improve task-specific reasoning, we introduce a two-stage training strategy consisting of decomposed-QA-enhanced supervised fine-tuning followed by TAU-GRPO, a GRPO-based post-training method with TAU-specific reward functions. Experimental results show that TAU-R1 achieves strong performance on both anomaly classification and reasoning tasks while maintaining deployment efficiency. The dataset and code are available at: https://github.com/siri-rouser/TAU-R1
中文标题/摘要
标题:TAU-R1:交通异常理解的视觉语言模型
交通异常理解(TAU)对于智能交通系统中的交通安全至关重要。最近的视觉-语言模型(VLMs)在视频理解方面表现出强大的能力。然而,由于缺乏基准和特定任务的方法,TAU 的进展仍然有限。为了解决这一限制,我们引入了Roundabout-TAU数据集,该数据集由与印第安纳州卡梅尔市合作收集的真实环形交叉口视频构建而成。该数据集包含342个片段,并且带有超过2000个问题-答案对,涵盖了交通异常理解的多个方面。基于此基准,我们提出了TAU-R1,一种两层视觉-语言框架用于TAU。第一层是一个轻量级的异常分类器,执行粗略的异常分类,而第二层是一个较大的异常解释器,生成详细的事件总结。为了提高特定任务的推理,我们引入了一种两阶段训练策略,包括分解-QA增强的监督微调,随后是基于GRPO的TAU-GRPO后训练方法,带有TAU特定的奖励函数。实验结果表明,TAU-R1在异常分类和推理任务上均表现出色,同时保持了部署效率。数据集和代码可在:https://github.com/siri-rouser/TAU-R1 获取
Summary / 总结
The research addresses the need for better traffic anomaly understanding in Intelligent Transportation Systems by introducing Roundabout-TAU, a dataset of real-world roundabout videos with over 2,000 annotated question-answer pairs. Building on this dataset, TAU-R1, a two-layer vision-language framework, is proposed, which includes a lightweight anomaly classifier and a larger anomaly reasoner. The framework uses a two-stage training strategy to enhance task-specific reasoning. Experimental results demonstrate that TAU-R1 performs well in both anomaly classification and reasoning tasks while maintaining efficiency for deployment.
研究旨在通过开发视觉语言模型来提高智能交通系统中的交通异常理解(TAU),以增强交通安全。研究引入了Roundabout-TAU数据集,包含342个片段和超过2,000个标注的问题-答案对,并提出了TAU-R1两层视觉语言框架。TAU-R1包括一个轻量级的异常分类器和一个更大的异常推理器,并采用两阶段训练策略来提高任务特定的推理能力。该模型在异常分类和推理任务上表现出色,同时保持了部署效率。
SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues
Authors: Carlos Hinojosa, Clemens Grange, Bernard Ghanem
First: 2026-03-19T16:18:00+00:00 · Latest: 2026-03-19T16:18:00+00:00
Abstract
Vision-language models (VLMs) are increasingly deployed in real-world and embodied settings where safety decisions depend on visual context. However, it remains unclear which visual evidence drives these judgments. We study whether multimodal safety behavior in VLMs can be steered by simple semantic cues. We introduce a semantic steering framework that applies controlled textual, visual, and cognitive interventions without changing the underlying scene content. To evaluate these effects, we propose SAVeS, a benchmark for situational safety under semantic cues, together with an evaluation protocol that separates behavioral refusal, grounded safety reasoning, and false refusals. Experiments across multiple VLMs and an additional state-of-the-art benchmark show that safety decisions are highly sensitive to semantic cues, indicating reliance on learned visual-linguistic associations rather than grounded visual understanding. We further demonstrate that automated steering pipelines can exploit these mechanisms, highlighting a potential vulnerability in multimodal safety systems.
中文标题/摘要
标题:SAVeS: 通过语义线索引导视觉-语言模型的安全判断
视觉-语言模型(VLMs)在现实世界和具身环境中越来越被部署,其中安全决策依赖于视觉上下文。然而,尚不清楚哪些视觉证据驱动了这些判断。我们研究了是否可以通过简单的语义线索来引导VLMs中的多模态安全行为。我们引入了一种语义引导框架,该框架通过控制文本、视觉和认知干预而不改变底层场景内容来实现这一目标。为了评估这些效果,我们提出了SAVeS基准,这是一个在语义线索下的情境安全性基准,以及一个将行为拒绝、基于视觉的语言推理和虚假拒绝分开的评估协议。在多个VLMs和一个额外的最先进的基准上的实验表明,安全决策对语义线索非常敏感,表明依赖于学习到的视觉-语言关联而非基于视觉的理解。我们进一步证明了自动化引导管道可以利用这些机制,突显了多模态安全系统中潜在的脆弱性。
Summary / 总结
The study aims to understand how visual evidence influences safety judgments in vision-language models (VLMs) and whether these judgments can be steered using simple semantic cues. A semantic steering framework was developed to apply controlled interventions without altering the scene content. The SAVeS benchmark was introduced to evaluate the effects of semantic cues on safety behaviors, distinguishing between behavioral refusal, grounded safety reasoning, and false refusals. Experiments across various VLMs and an additional benchmark showed that safety decisions are highly sensitive to semantic cues, suggesting reliance on learned visual-linguistic associations rather than grounded visual understanding. This indicates a potential vulnerability in multimodal safety systems that can be exploited through automated steering pipelines.
研究旨在理解视觉证据如何影响在现实世界环境中使用的视觉语言模型(VLMs)中的安全判断。开发了一种语义引导框架,使用简单的文本和视觉提示来引导这些判断,而不改变场景内容。研究引入了SAVeS基准,用于评估在语义提示下的情境安全性,并发现VLMs对这些提示非常敏感,表明它们依赖于学习到的视觉-语言关联而非基于视觉的理解。这表明,通过自动引导管道,存在在多模态安全系统中利用这些机制的潜在漏洞。
SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation
Authors: Phuc Pham, Uy Dieu Tran, Binh-Son Hua, Phong Nguyen
Venue: CVPR 2026
First: 2026-03-19T15:47:43+00:00 · Latest: 2026-03-19T15:47:43+00:00
Comments: CVPR 2026
Abstract
Realistic and efficient 3D garment generation remains a longstanding challenge in computer vision and digital fashion. Existing methods typically rely on large vision- language models to produce serialized representations of 2D sewing patterns, which are then transformed into simulation-ready 3D meshes using garment modeling framework such as GarmentCode. Although these approaches yield high-quality results, they often suffer from slow inference times, ranging from 30 seconds to a minute. In this work, we introduce SwiftTailor, a novel two-stage framework that unifies sewing-pattern reasoning and geometry-based mesh synthesis through a compact geometry image representation. SwiftTailor comprises two lightweight modules: PatternMaker, an efficient vision-language model that predicts sewing patterns from diverse input modalities, and GarmentSewer, an efficient dense prediction transformer that converts these patterns into a novel Garment Geometry Image, encoding the 3D surface of all garment panels in a unified UV space. The final 3D mesh is reconstructed through an efficient inverse mapping process that incorporates remeshing and dynamic stitching algorithms to directly assemble the garment, thereby amortizing the cost of physical simulation. Extensive experiments on the Multimodal GarmentCodeData demonstrate that SwiftTailor achieves state-of-the-art accuracy and visual fidelity while significantly reducing inference time. This work offers a scalable, interpretable, and high-performance solution for next-generation 3D garment generation.
中文标题/摘要
标题:SwiftTailor:基于几何图像表示的高效3D服装生成
在计算机视觉和数字时尚领域,真实且高效的3D服装生成仍然是一个长期存在的挑战。现有方法通常依赖于大型视觉语言模型来生成2D缝纫图案的序列化表示,然后使用如GarmentCode等服装建模框架将其转换为可用于模拟的3D网格。尽管这些方法可以产生高质量的结果,但它们通常会遭受较慢的推理时间,从30秒到一分钟不等。在本工作中,我们引入了SwiftTailor,这是一种新颖的两阶段框架,通过紧凑的几何图像表示统一了缝纫图案推理和基于几何的网格合成。SwiftTailor 包含两个轻量级模块:PatternMaker,一种高效的视觉语言模型,可以从多种输入模态中预测缝纫图案;以及GarmentSewer,一种高效的密集预测变换器,将这些图案转换为一种新颖的服装几何图像,编码所有服装面板的3D表面在统一的UV空间中。最终的3D网格通过一个高效的逆映射过程重建,该过程结合了重新网格化和动态缝合算法,直接组装服装,从而摊销物理模拟的成本。在Multimodal GarmentCodeData上的大量实验表明,SwiftTailor 在准确性和视觉保真度方面达到了最先进的水平,同时显著减少了推理时间。这项工作提供了一种可扩展、可解释且高性能的下一代3D服装生成解决方案。
Summary / 总结
SwiftTailor is a two-stage framework for efficient 3D garment generation that combines sewing-pattern reasoning and geometry-based mesh synthesis using a compact geometry image representation. It includes PatternMaker for predicting sewing patterns and GarmentSewer for converting these patterns into a Garment Geometry Image, which encodes the 3D surface of all garment panels. The final 3D mesh is reconstructed through an efficient inverse mapping process. Experiments show that SwiftTailor achieves high accuracy and visual fidelity while significantly reducing inference time compared to existing methods.
SwiftTailor 是一个两阶段框架,结合了缝制模式推理和基于几何的网格合成,使用紧凑的几何图像表示。它包括 PatternMaker 预测缝制模式和 GarmentSewer 将这些模式转换为服装几何图像。该方法减少了推理时间,同时保持了高准确性和视觉保真度,优于现有方法。
TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation
Authors: Yan Shu, Bin Ren, Zhitong Xiong, Xiao Xiang Zhu, Begüm Demir, Nicu Sebe, Paolo Rota
First: 2026-03-19T15:38:02+00:00 · Latest: 2026-03-19T15:38:02+00:00
Comments: Accepted by CVPR20206 (Main Track)
Abstract
Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.
中文标题/摘要
标题:TerraScope:基于像素的地球观测视觉推理
视觉语言模型(VLMs)在地球观测(EO)中显示出潜力,但它们在需要将复杂的空间推理精确地与像素级视觉表示联系起来的任务中表现不佳。为了解决这个问题,我们引入了TerraScope,这是一种统一的VLM,能够提供基于像素的地理空间推理,具有两个关键能力:(1)模态灵活的推理:它可以处理单一模态输入(光学或SAR),并在两种模态都可用时,适应性地将不同模态融合到推理过程中;(2)多时相推理:它整合了时间序列,以跨多个时间点进行变化分析。此外,我们还整理了Terra-CoT,这是一个包含一百万个样本的大规模数据集,这些样本在多个来源的推理链中嵌入了像素级掩码。我们还提出了TerraScope-Bench,这是第一个用于基于像素的地理空间推理的基准,包含六个子任务,评估答案准确性和掩码质量,以确保真实的基于像素的推理。实验表明,TerraScope 在基于像素的地理空间推理方面显著优于现有VLMs,同时提供了可解释的视觉证据。
Summary / 总结
TerraScope is a unified vision-language model designed for pixel-grounded geospatial reasoning in earth observation. It supports both single-modality and multi-modality inputs and integrates temporal sequences for change analysis. The model is evaluated on Terra-CoT, a large-scale dataset with 1 million samples, and outperforms existing vision-language models in pixel-grounded geospatial reasoning tasks, providing interpretable visual evidence.
TerraScope 是一种统一的视觉语言模型,用于地球观测中的像素级地理空间推理。它支持单模态和多模态输入,并结合时间序列进行变化分析。该模型在包含100万样本的Terra-CoT大数据集上进行评估,并在像素级地理空间推理任务中优于现有视觉语言模型,提供了可解释的视觉证据。
AgroCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture
Authors: Yibin Wen, Qingmei Li, Zi Ye, Jiarui Zhang, Zurong Mai, Jing Wu, Shuohong Lou, Yuhang Chen, Henglian Huang, Xiaoya Fan, Yang Zhang, Defeng Gu, Lingyuan Zhao, Yutong Lu, Haohuan Fu, Jianxi Huang, Juepeng Zheng
First: 2025-11-28T15:02:19+00:00 · Latest: 2026-03-19T15:28:24+00:00
Abstract
Recent advancements in Vision-Language Models (VLMs) have significantly impacted various industries. In agriculture, these multimodal capabilities hold great promise for applications such as precision farming, crop monitoring, pest detection, and environmental sustainability. However, while several Visual Question Answering (VQA) datasets and benchmarks have been developed to assess VLM performance, they often fail to effectively evaluate the critical reasoning and problem-solving skills needed in complex agricultural contexts. To address this gap, we introduce AgroCoT, a VQA dataset that integrates Chain-of-Thought (CoT) reasoning, specifically designed to evaluate the reasoning capabilities of VLMs. With 4,759 carefully curated samples, AgroCoT provides a comprehensive and robust evaluation of reasoning abilities, particularly in zero-shot scenarios, focusing on the models' ability to engage in logical reasoning and effective problem-solving. Our evaluation of 30 representative VLMs, including both proprietary and open-source models, reveals a gap in their reasoning capabilities, which underscores the importance of incorporating CoT for assessments. Our dataset is available at https://huggingface.co/datasets/wenyb/AgriCoT.
Summary / 总结
AgroCoT is a VQA dataset designed to evaluate the reasoning capabilities of Vision-Language Models (VLMs) in agricultural contexts. It includes 4,759 samples that require Chain-of-Thought (CoT) reasoning, focusing on zero-shot scenarios. The evaluation of 30 VLMs shows a significant gap in their reasoning abilities, highlighting the necessity of CoT for effective assessments in agriculture.
AgroCoT 是一个 VQA 数据集,旨在评估 Vision-Language 模型 (VLM) 在农业环境中的推理能力。它包含 4,759 个样本,需要进行链式思考 (CoT) 推理,特别关注零样本场景。对 30 个 VLM 的评估显示它们在推理能力方面存在显著差距,强调了在农业评估中整合 CoT 的必要性。数据集可在 https://huggingface.co/datasets/wenyb/AgriCoT 获取。
SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models
Authors: Quentin Guimard, Federico Bartsch, Simone Caldarella, Rahaf Aljundi, Elisa Ricci, Massimiliano Mancini
Venue: CVPR
First: 2026-03-19T15:28:08+00:00 · Latest: 2026-03-19T15:28:08+00:00
Comments: CVPR Findings 2026. Project website: https://sparse-embedding-modulation.github.io/
Abstract
Models that bridge vision and language, such as CLIP, are key components of multimodal AI, yet their large-scale, uncurated training data introduce severe social and spurious biases. Existing post-hoc debiasing methods often operate directly in the dense CLIP embedding space, where bias and task-relevant information are highly entangled. This entanglement limits their ability to remove bias without degrading semantic fidelity. In this work, we propose Sparse Embedding Modulation (SEM), a post-hoc, zero-shot debiasing framework that operates in a Sparse Autoencoder (SAE) latent space. By decomposing CLIP text embeddings into disentangled features, SEM identifies and modulates bias-relevant neurons while preserving query-relevant ones. This enables more precise, non-linear interventions. Across four benchmark datasets and two CLIP backbones, SEM achieves substantial fairness gains in retrieval and zero-shot classification. Our results demonstrate that sparse latent representations provide an effective foundation for post-hoc debiasing of vision-language models.
中文标题/摘要
标题:SEM:稀疏嵌入调制在后验消除视觉-语言模型中的社会偏见
连接视觉和语言的模型,如CLIP,是多模态AI的关键组成部分,但其大规模、未经筛选的训练数据引入了严重的社会和统计偏见。现有的后验消除偏见方法通常直接在密集的CLIP嵌入空间中操作,其中偏见和任务相关信息高度纠缠。这种纠缠限制了它们在不损害语义保真度的情况下去除偏见的能力。在本研究中,我们提出了一种后验、零样本的稀疏嵌入调制(SEM)框架,该框架在稀疏自编码器(SAE)的潜在空间中操作。通过将CLIP文本嵌入分解为分离的特征,SEM识别并调节与偏见相关的神经元,同时保留与查询相关的神经元。这使得可以进行更精确、非线性的干预。在四个基准数据集和两个CLIP骨干网络上,SEM在检索和零样本分类中实现了显著的公平性提升。我们的结果表明,稀疏潜在表示为视觉-语言模型的后验消除偏见提供了有效的基础。
Summary / 总结
The research aims to address the social and spurious biases in vision-language models like CLIP by proposing Sparse Embedding Modulation (SEM), a zero-shot debiasing framework. SEM operates in a Sparse Autoencoder latent space to disentangle bias-relevant and query-relevant features, allowing for precise and non-linear bias modulation. The method achieves significant fairness improvements in retrieval and zero-shot classification across multiple datasets and model backbones.
研究旨在通过提出Sparse Embedding Modulation (SEM) 后处理去偏方法来解决像CLIP这样的视觉-语言模型中的社会性和统计性偏差问题。SEM 在稀疏自编码器的潜在空间中操作,分解并分离文本嵌入,选择性地调节与偏差相关的神经元,同时保留与任务相关的神经元。该方法在四个基准数据集和两个CLIP骨干网络上实现了检索和零样本分类任务中的显著公平性改进。
Hypothesis-Conditioned Query Rewriting for Decision-Useful Retrieval
Authors: Hangeol Chang, Changsun Lee, Seungjoon Rho, Junho Yeo, Jong Chul Ye
First: 2026-03-19T15:15:58+00:00 · Latest: 2026-03-19T15:15:58+00:00
Abstract
Retrieval-Augmented Generation (RAG) improves Large Language Models (LLMs) by grounding generation in external, non-parametric knowledge. However, when a task requires choosing among competing options, simply grounding generation in broadly relevant context is often insufficient to drive the final decision. Existing RAG methods typically rely on a single initial query, which often favors topical relevance over decision-relevant evidence, and therefore retrieves background information that can fail to discriminate among answer options. To address this issue, here we propose Hypothesis-Conditioned Query Rewriting (HCQR), a training-free pre-retrieval framework that reorients RAG from topic-oriented retrieval to evidence-oriented retrieval. HCQR first derives a lightweight working hypothesis from the input question and candidate options, and then rewrites retrieval into three targeted queries that seek evidence to: (1) support the hypothesis, (2) distinguish it from competing alternatives, and (3) verify salient clues in the question. This approach enables context retrieval that is more directly aligned with answer selection, allowing the generator to confirm or overturn the initial hypothesis based on the retrieved evidence. Experiments on MedQA and MMLU-Med show that HCQR consistently outperforms single-query RAG and re-rank/filter baselines, improving average accuracy over Simple RAG by 5.9 and 3.6 points, respectively. Code is available at https://anonymous.4open.science/r/HCQR-1C2E.
中文标题/摘要
标题:基于假设条件的查询重写以实现决策有用检索
检索增强生成(RAG)通过将生成与外部非参数知识相结合来提高大型语言模型(LLMs)。然而,当任务要求在竞争选项中进行选择时,仅将生成与广泛相关背景相结合往往不足以驱动最终决策。现有RAG方法通常依赖于单个初始查询,这往往倾向于主题相关性而非决策相关证据,因此检索到的背景信息可能无法区分答案选项。为了解决这一问题,我们提出了一种无需训练的预检索框架——基于假设条件的查询重写(HCQR),该框架将RAG从主题导向检索转向证据导向检索。HCQR首先从输入问题和候选选项中推导出一个轻量级的工作假设,然后将检索重写为三个针对特定证据的查询,以:(1)支持假设,(2)区分其与竞争替代方案,(3)验证问题中的关键线索。这种方法使上下文检索更直接地与答案选择对齐,使生成器能够根据检索到的证据确认或推翻初始假设。在MedQA和MMLU-Med上的实验表明,HCQR在平均准确性上始终优于单查询RAG和重排/过滤基线,分别提高了5.9和3.6个百分点。代码可在https://anonymous.4open.science/r/HCQR-1C2E获取。
Summary / 总结
This paper addresses the limitation of existing Retrieval-Augmented Generation (RAG) methods in making decisions among competing options by proposing Hypothesis-Conditioned Query Rewriting (HCQR). HCQR rewrites the retrieval process into three targeted queries to support, distinguish, and verify evidence related to the input question and candidate options, aligning the context retrieval more closely with answer selection. Experiments on MedQA and MMLU-Med show that HCQR outperforms single-query RAG and re-rank/filter baselines, improving accuracy by 5.9 and 3.6 points, respectively.
论文提出了一种名为Hypothesis-Conditioned Query Rewriting (HCQR)的方法,通过将Retrieval-Augmented Generation (RAG)从主题导向的检索转向证据导向的检索来提高大型语言模型(LLMs)的决策有用检索效果。HCQR从输入问题和候选选项中推导出一个轻量级的工作假设,并将其重写为三个有针对性的查询,以支持假设、区分其与替代方案以及验证关键线索。实验结果显示,HCQR在MedQA和MMLU-Med上的平均准确率分别提高了5.9和3.6个百分点,优于单查询RAG和重排/过滤基线。
CrossHOI-Bench: A Unified Benchmark for HOI Evaluation across Vision-Language Models and HOI-Specific Methods
Authors: Qinqian Lei, Bo Wang, Robby T. Tan
Venue: CVPR 2026
First: 2025-08-26T07:30:53+00:00 · Latest: 2026-03-19T15:13:14+00:00
Comments: Accepted by CVPR 2026
Abstract
HOI detection has long been dominated by task-specific models, sometimes with early vision-language backbones such as CLIP. With the rise of large generative VLMs, a key question is whether standalone VLMs can perform HOI detection competitively against specialized HOI methods. Existing benchmarks such as HICO-DET require exact label matching under incomplete annotations, so any unmatched prediction is marked wrong. This unfairly penalizes valid outputs, especially from less constrained VLMs, and makes cross-paradigm comparison unreliable. To address this limitation, we introduce CrossHOI-Bench, a multiple-choice HOI benchmark with explicit positives and curated negatives, enabling unified and reliable evaluation of both VLMs and HOI-specific models. We further focus on challenging scenarios, such as multi-person scenes and fine-grained interaction distinctions, which are crucial for revealing real differences between the two paradigms. Experiments show that large VLMs achieve competitive, sometimes superior, zero-shot performance, yet they struggle with multiple concurrent actions and with correctly assigning interactions to the target person. Conversely, HOI-specific methods remain weaker in general HOI reasoning but demonstrate stronger multi-action recognition and more reliable identification of which person performs which action. These findings expose complementary strengths and weaknesses of VLMs and HOI-specific methods, which existing benchmarks fail to reveal due to incorrect penalization.
中文标题/摘要
标题:CrossHOI-Bench:跨范式的HOI评估统一基准
长期以来,HOI检测主要由专门的任务模型主导,有时甚至使用早期的视觉-语言模型如CLIP作为基础。随着大型生成VLM的兴起,一个关键问题是独立的VLM是否能够与专门的HOI方法竞争进行HOI检测。现有的基准如HICO-DET要求精确的标签匹配,在不完整注释的情况下,任何未匹配的预测都会被标记为错误。这不公平地惩罚了有效的输出,尤其是来自较不约束的VLMs,使得跨范式比较不可靠。为了解决这一局限性,我们引入了CrossHOI-Bench,这是一个具有明确正例和精心挑选负例的多项选择HOI基准,使VLMs和HOI特定模型的统一和可靠评估成为可能。我们进一步关注具有挑战性的场景,如多人场景和精细的交互区分,这对于揭示两种范式之间的真正差异至关重要。实验表明,大型VLMs在零样本情况下表现出竞争力,甚至有时更优,但它们在处理多个并发动作和正确分配交互给目标人物方面存在困难。相反,HOI特定方法在一般HOI推理方面仍然较弱,但在多动作识别和更可靠地识别哪个人执行哪种动作方面表现出更强的能力。这些发现揭示了VLMs和HOI特定方法的互补优势和劣势,而现有的基准由于错误的惩罚未能揭示这些差异。
Summary / 总结
The research aims to evaluate the performance of vision-language models (VLMs) and HOI-specific models in HOI detection by introducing CrossHOI-Bench, a new benchmark that uses multiple-choice questions with explicit positives and curated negatives. The study finds that large VLMs can achieve competitive zero-shot performance but struggle with multiple concurrent actions and correctly assigning interactions to the target person. HOI-specific methods, while weaker in general HOI reasoning, show stronger multi-action recognition and more reliable identification of which person performs which action. This benchmark addresses the limitations of previous benchmarks by providing a fairer evaluation framework for both paradigms.
研究旨在通过引入CrossHOI-Bench这一新的基准,使用多选题格式并包含明确的正确选项和精心筛选的错误选项,来评估视觉语言模型和HOI特定方法在HOI检测中的表现。研究发现,大型VLM在性能上具有竞争力,但在处理多个并发动作和正确识别目标人物方面存在困难,而HOI特定方法在多动作识别方面表现出色,但在一般HOI推理方面较弱。这一基准解决了现有基准的局限性,提供了跨范式的公平评估。
How to Take a Memorable Picture? Empowering Users with Actionable Feedback
Authors: Francesco Laiti, Davide Talon, Jacopo Staiano, Elisa Ricci
Venue: CVPR 2026
First: 2026-02-25T13:02:35+00:00 · Latest: 2026-03-19T15:10:18+00:00
Comments: Accepted @ CVPR 2026. Project page: https://laitifranz.github.io/MemCoach/
Abstract
Image memorability, i.e., how likely an image is to be remembered, has traditionally been studied in computer vision either as a passive prediction task, with models regressing a scalar score, or with generative methods altering the visual input to boost the image likelihood of being remembered. Yet, none of these paradigms supports users at capture time, when the crucial question is how to improve a photo memorability. We introduce the task of Memorability Feedback (MemFeed), where an automated model should provide actionable, human-interpretable guidance to users with the goal to enhance an image future recall. We also present MemCoach, the first approach designed to provide concrete suggestions in natural language for memorability improvement (e.g., "emphasize facial expression," "bring the subject forward"). Our method, based on Multimodal Large Language Models (MLLMs), is training-free and employs a teacher-student steering strategy, aligning the model internal activations toward more memorable patterns learned from a teacher model progressing along least-to-most memorable samples. To enable systematic evaluation on this novel task, we further introduce MemBench, a new benchmark featuring sequence-aligned photoshoots with annotated memorability scores. Our experiments, considering multiple MLLMs, demonstrate the effectiveness of MemCoach, showing consistently improved performance over several zero-shot models. The results indicate that memorability can not only be predicted but also taught and instructed, shifting the focus from mere prediction to actionable feedback for human creators.
中文标题/摘要
标题:如何拍摄令人难忘的照片?赋予用户可操作反馈
图像的难忘性,即图像被记住的可能性,传统上在计算机视觉中要么作为被动预测任务进行研究,模型回归一个标量分数,要么使用生成方法改变视觉输入以提高图像被记住的可能性。然而,这些范式在拍摄时并不支持用户,关键问题是如何提高照片的难忘性。我们引入了难忘性反馈(MemFeed)任务,其中自动化模型应提供可操作的、人类可理解的指导,以提高图像未来回忆的可能性。我们还介绍了MemCoach,这是第一个提供自然语言具体建议以提高难忘性的方法(例如,“强调面部表情”,“将主题置于前景”)。我们的方法基于多模态大型语言模型(MLLMs),无需训练,并采用教师-学生引导策略,使模型内部激活与更难忘的样本中学习到的模式对齐。为了在这一新任务上进行系统评估,我们进一步引入了MemBench,这是一个新的基准,包含序列对齐的照片拍摄,并附有标注的难忘性评分。我们的实验,考虑了多个MLLMs,证明了MemCoach的有效性,显示出在多个零样本模型上的一致改进。结果表明,难忘性不仅可以被预测,也可以被教授和指导,从单纯的预测转向对人类创作者的可操作反馈。
Summary / 总结
The paper introduces the task of Memorability Feedback (MemFeed), where an automated model provides actionable guidance to users to enhance the memorability of their photos. The method, MemCoach, uses Multimodal Large Language Models to offer concrete suggestions in natural language, such as 'emphasize facial expression.' Experiments show that MemCoach outperforms several zero-shot models, demonstrating the potential to teach and instruct memorability rather than just predicting it.
论文提出了记忆反馈(MemFeed)任务,自动模型为用户提供建议以增强照片的记忆力。方法MemCoach使用多模态大型语言模型提供具体的自然语言建议,如“强调面部表情”。实验表明,MemCoach在多个零样本模型中表现出色,展示了不仅可以预测记忆力,还可以教和指导记忆力的潜力。
VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation
Authors: Jiayi Yuan, Haobo Jiang, De Wen Soh, Na Zhao
First: 2026-03-19T14:18:17+00:00 · Latest: 2026-03-19T14:18:17+00:00
Abstract
This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT's perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (ii) Structure-saliency enhanced attention strengthens VGGT's robustness during 3D reconstruction by injecting structure-aware confidence into its attention layers, guiding focus toward geometrically reliable regions and enhancing cross-view coherence. (iii) Correlation-weighted 3D model correction refines the reconstructed 3D model by reweighting overlapping points using attention-inferred correlation scores, providing a consistent geometric basis for accurate panoramic reprojection. Extensive experiments show that VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor and outdoor datasets.
中文标题/摘要
标题:VGGT-360:几何一致的零样本全景深度估计
本文提出了VGGT-360,这是一种无需训练的新型框架,用于实现零样本、几何一致的全景深度估计。与之前的视图无关的无需训练方法不同,VGGT-360将任务重新表述为利用VGGT类基础模型的内在三维一致性,通过多视图重建的全景重投影来统一视图分割的推理,从而形成一个连贯的全景理解。为了实现稳健且准确的估计,VGGT-360整合了三个即插即用模块,形成了一个统一的全景到三维再到深度框架:(i) 不确定性引导的自适应投影将全景图切分为透视视图,以弥合全景输入与VGGT的透视先验之间的领域差距。它估计梯度不确定性,将更密集的视图分配给几何贫瘠区域,为VGGT提供几何信息丰富的输入。(ii) 结构显著性增强的注意力在三维重建过程中增强VGGT的鲁棒性,通过将结构感知的置信度注入其注意力层,引导关注几何可靠区域,增强跨视图的一致性。(iii) 相关加权三维模型校正通过使用注意力推断的相关分数重新加权重叠点,细化重建的三维模型,为准确的全景重投影提供一致的几何基础。广泛的实验表明,VGGT-360在多个分辨率和多种室内外数据集上均优于训练有素和无需训练的最新方法。
Summary / 总结
VGGT-360 is a training-free framework for zero-shot panoramic depth estimation that leverages 3D consistency to unify per-view reasoning into a coherent panoramic understanding. It integrates three modules: uncertainty-guided adaptive projection, structure-saliency enhanced attention, and correlation-weighted 3D model correction. Experiments demonstrate that VGGT-360 outperforms both trained and training-free state-of-the-art methods across various datasets and resolutions.
VGGT-360 是一种新颖的无训练框架,用于零样本全景深度估计,通过利用 VGGT 类似模型的内在 3D 一致性,将单视图推理统一为一个连贯的全景理解。它整合了三个模块:不确定性引导自适应投影、结构显著性增强注意力和相关加权 3D 模型校正,这些模块共同增强了鲁棒性和准确性。实验表明,VGGT-360 在各种分辨率和不同室内外数据集上均优于训练有素和无训练的最新方法。
MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model
Authors: Youngwan Lee, Soojin Jang, Yoorhim Cho, Seunghwan Lee, Yong-Ju Lee, Sung Ju Hwang
First: 2026-03-19T13:33:26+00:00 · Latest: 2026-03-19T13:33:26+00:00
Comments: Project page: https://youngwanlee.github.io/multihopspatial
Abstract
Spatial reasoning is foundational for Vision-Language Models (VLMs), particularly when deployed as Vision-Language-Action (VLA) agents in physical environments. However, existing benchmarks predominantly focus on elementary, single-hop relations, neglecting the multi-hop compositional reasoning and precise visual grounding essential for real-world scenarios. To address this, we introduce MultihopSpatial, offering three key contributions: (1) A comprehensive benchmark designed for multi-hop and compositional spatial reasoning, featuring 1- to 3-hop complex queries across diverse spatial perspectives. (2) Acc@50IoU, a complementary metric that simultaneously evaluates reasoning and visual grounding by requiring both answer selection and precise bounding box prediction - capabilities vital for robust VLA deployment. (3) MultihopSpatial-Train, a dedicated large-scale training corpus to foster spatial intelligence. Extensive evaluation of 37 state-of-the-art VLMs yields eight key insights, revealing that compositional spatial reasoning remains a formidable challenge. Finally, we demonstrate that reinforcement learning post-training on our corpus enhances both intrinsic VLM spatial reasoning and downstream embodied manipulation performance.
中文标题/摘要
标题:MultihopSpatial:多跳组合空间推理基准测试,用于视觉-语言模型
空间推理是视觉-语言模型(VLMs)的基础,特别是在作为视觉-语言-行动(VLA)代理在物理环境中部署时。然而,现有的基准测试主要集中在基础的单跳关系上,忽视了多跳组合推理和精确的视觉定位,这对于现实世界场景至关重要。为了解决这个问题,我们引入了MultihopSpatial,提供了三个关键贡献:(1) 一个旨在进行多跳和组合空间推理的全面基准测试,涵盖从1到3跳的复杂查询,跨越多种空间视角。(2) Acc@50IoU,一个补充性指标,同时评估推理和视觉定位,要求进行答案选择和精确的边界框预测——这些能力对于稳健的VLA部署至关重要。(3) MultihopSpatial-Train,一个专门的大规模训练语料库,以促进空间智能。对37个最先进的VLMs的广泛评估揭示了八个关键见解,表明组合空间推理仍然是一个严峻的挑战。最后,我们证明了在我们的语料库上进行强化学习后训练可以提高VLM的内在空间推理能力和下游的实体操作性能。
Summary / 总结
The research aims to improve Vision-Language Models (VLMs) for multi-hop and compositional spatial reasoning in real-world scenarios. The study introduces MultihopSpatial, a benchmark that includes complex 1- to 3-hop spatial queries and a new metric Acc@50IoU to evaluate both reasoning and visual grounding. Key findings show that current VLMs struggle with compositional spatial reasoning, and post-training reinforcement learning on the MultihopSpatial corpus improves both spatial reasoning and embodied manipulation performance.
研究旨在提升Vision-Language模型(VLMs)在多跳和组合空间推理方面的表现,这对于实际应用至关重要。研究引入了MultihopSpatial基准,包含1-到3跳的复杂查询,并提出了一种新的Acc@50IoU评估指标,同时评估推理和视觉定位能力。对37个最先进的VLMs的广泛评估显示,组合空间推理仍然是一个挑战,而通过MultihopSpatial数据集进行强化学习后训练可以提升空间推理能力和下游的实体操作性能。
LucidFlux: Caption-Free Photo-Realistic Image Restoration via a Large-Scale Diffusion Transformer
Authors: Song Fei, Tian Ye, Lujia Wang, Lei Zhu
First: 2025-09-26T14:39:08+00:00 · Latest: 2026-03-19T12:57:49+00:00
Comments: Project Page: https://w2genai-lab.github.io/LucidFlux
Abstract
Image restoration (IR) aims to recover images degraded by unknown mixtures while preserving semanticsconditions under which discriminative restorers and UNet-based diffusion priors often oversmooth, hallucinate, or drift. We present LucidFlux, a caption-free IR framework that adapts a large diffusion transformer (Flux.1) without image captions. Our LucidFlux introduces a lightweight dual-branch conditioner that injects signals from the degraded input and a lightly restored proxy to respectively anchor geometry and suppress artifacts. Then, a timestep- and layer-adaptive modulation schedule is designed to route these cues across the backbones hierarchy, in order to yield coarse-to-fine and context-aware updates that protect the global structure while recovering texture. After that, to avoid the latency and instability of text prompts or Vision-Language Model (VLM) captions, we enforce caption-free semantic alignment via SigLIP features extracted from the proxy. A scalable curation pipeline further filters large-scale data for structure-rich supervision. Across synthetic and in-the-wild benchmarks, our LucidFlux consistently outperforms strong open-source and commercial baselines, and ablation studies verify the necessity of each component. LucidFlux shows that, for large DiTs, when, where, and what to condition onrather than adding parameters or relying on text promptsis the governing lever for robust and caption-free image restoration in the wild.
中文标题/摘要
标题:LucidFlux:无需描述的高保真图像恢复大型扩散变换器
图像恢复(IR)旨在恢复被未知混合物破坏的图像,同时保留语义。在语义条件下,判别恢复器和基于UNet的扩散先验往往会过度平滑、虚构或漂移。我们提出了LucidFlux,这是一种无需描述的IR框架,它适应了一个大型扩散变换器(Flux.1),而无需使用图像描述。我们的LucidFlux引入了一个轻量级的双分支条件器,分别从退化输入和轻度恢复的代理中注入信号,以分别锚定几何结构和抑制伪影。然后,设计了一种时间步长和层自适应调制调度,以在骨干网络层次结构中路由这些线索,从而实现从粗到细和上下文感知的更新,以保护全局结构并恢复纹理。之后,为了避免文本提示或视觉语言模型(VLM)描述的延迟和不稳定,我们通过从代理中提取的SigLIP特征强制执行无描述的语义对齐。一个可扩展的策展管道进一步筛选大规模数据以提供结构丰富的监督。在合成和野外基准测试中,我们的LucidFlux始终优于强大的开源和商用基线,消融研究验证了每个组件的必要性。LucidFlux表明,对于大型DiTs,何时、何地以及如何进行条件控制,而不是增加参数或依赖于文本提示,是野外稳健且无需描述的图像恢复的关键杠杆。
Summary / 总结
LucidFlux is a caption-free image restoration framework that uses a large-scale diffusion transformer to recover degraded images while preserving semantics. It introduces a lightweight dual-branch conditioner and a timestep- and layer-adaptive modulation schedule to protect global structure and recover texture. LucidFlux outperforms strong open-source and commercial baselines on both synthetic and in-the-wild benchmarks, and ablation studies confirm the necessity of each component for its success.
LucidFlux 是一个无图描述的图像恢复框架,利用大规模扩散变换器来恢复退化图像并保留语义。它引入了轻量级的双分支条件器和时间步长和层自适应调制调度,以保护全局结构并恢复纹理。LucidFlux 在合成和现实世界基准测试中均优于强大的开源和商业基线,并且消融研究证实了每个组件对于其成功的重要性。
HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models
Authors: Xiangyu Bai, Bishoy Galoaa, Sarah Ostadabbas
First: 2026-03-19T12:53:32+00:00 · Latest: 2026-03-19T12:53:32+00:00
Abstract
Video question answering (VQA) with vision-language models (VLMs) depends critically on which frames are selected from the input video, yet most systems rely on uniform or heuristic sampling that cannot be optimized for downstream answering quality. We introduce \textbf{HORNet}, a lightweight frame selection policy trained with Group Relative Policy Optimization (GRPO) to learn which frames a frozen VLM needs to answer questions correctly. With fewer than 1M trainable parameters, HORNet reduces input frames by up to 99\% and VLM processing time by up to 93\%, while improving answer quality on short-form benchmarks (+1.7\% F1 on MSVD-QA) and achieving strong performance on temporal reasoning tasks (+7.3 points over uniform sampling on NExT-QA). We formalize this as Select Any Frames (SAF), a task that decouples visual input curation from VLM reasoning, and show that GRPO-trained selection generalizes better out-of-distribution than supervised and PPO alternatives. HORNet's policy further transfers across VLM answerers without retraining, yielding an additional 8.5\% relative gain when paired with a stronger model. Evaluated across six benchmarks spanning 341,877 QA pairs and 114.2 hours of video, our results demonstrate that optimizing \emph{what} a VLM sees is a practical and complementary alternative to optimizing what it generates while improving efficiency. Code is available at https://github.com/ostadabbas/HORNet.
中文标题/摘要
标题:HORNet:基于任务指导的视频帧选择以利用视觉-语言模型进行视频问答
视觉-语言模型(VLMs)驱动的视频问答(VQA)高度依赖于从输入视频中选择哪些帧,但大多数系统依赖于均匀或启发式的采样,这些采样无法优化下游问答质量。我们引入了**HORNet**,这是一种通过组相对策略优化(GRPO)训练的轻量级帧选择策略,以学习一个冻结的VLM需要查看哪些帧才能正确回答问题。HORNet通过减少输入帧数高达99%和减少VLM处理时间高达93%,同时在短格式基准上提高了答案质量(MSVD-QA上的F1分数提高1.7%),并在时间推理任务上取得了出色的表现(NExT-QA上比均匀采样高出7.3分)。我们将此任务形式化为“选择任意帧”(SAF),该任务将视觉输入的编排与VLM推理解耦,并展示了GRPO训练的选择在分布外表现更好,优于监督学习和PPO替代方案。HORNet的策略进一步在不同VLM回答器之间进行迁移,无需重新训练,与更强的模型配对时可获得额外8.5%的相对增益。在六个基准测试中评估了超过341,877个问答对和114.2小时的视频,我们的结果表明,优化VLM所见的内容是一种实用且互补的替代方案,同时提高了效率。代码可在https://github.com/ostadabbas/HORNet/获取。
Summary / 总结
HORNet is a lightweight frame selection policy trained with Group Relative Policy Optimization (GRPO) to optimize frame selection for video question answering with vision-language models (VLMs). It reduces input frames by up to 99% and VLM processing time by up to 93%, while improving answer quality on short-form benchmarks and achieving strong performance on temporal reasoning tasks. HORNet's policy transfers across different VLMs, yielding additional gains. Evaluated across six benchmarks, optimizing what a VLM sees is shown to be a practical and complementary approach to improving efficiency and answer quality.
HORNet 是一种通过组相对策略优化(GRPO)训练的轻量级帧选择策略,用于优化视频问答中的视觉语言模型(VLMs)的帧选择。它将输入帧减少99%,并将 VLM 处理时间减少93%,同时在短格式基准上提高答案质量,并在时间推理任务上表现出色。HORNet 的策略在不同 VLM 之间具有良好的转移性,可以带来额外的收益。在六个基准测试中评估了该方法,结果表明优化 VLM 所见的内容是一种实用且互补的方法,可以提高效率和答案质量。
Activation Quantization of Vision Encoders Needs Prefixing Registers
Authors: Seunghyeon Kim, Taesun Yeom, Jinho Kim, Wonpyo Park, Kyuyeun Kim, Jaeho Lee
First: 2025-10-06T07:27:46+00:00 · Latest: 2026-03-19T12:18:57+00:00
Comments: under review; 28 pages, 9 figures
Abstract
Large pretrained vision encoders are central to multimodal intelligence, powering applications from on-device vision processing to vision-language models. Since these applications often demand real-time processing of massive visual data, reducing the inference cost of vision encoders is critical. Quantization offers a practical path, but it remains challenging even at 8-bit precision due to so-called outliers. In this work, we propose $\textit{RegCache}$, a training-free algorithm that mitigates outliers in large-scale pretrained vision encoders and serves as a plug-in module that can be applied on top of other quantization methods. RegCache introduces outlier-prone yet semantically meaningless prefix tokens to the vision encoder, which prevent other tokens from having outliers. Notably, we observe that outliers in vision encoders behave differently from those in language models, motivating two technical innovations: middle-layer prefixing and token deletion. Experimental results show that our method consistently improves quantized model performance across various vision encoders, particularly in extremely low-bit regimes (e.g., 4-bit).
中文标题/摘要
标题:视觉编码器的激活量化需要前缀寄存器
大型预训练视觉编码器是多模态智能的核心,驱动着从设备端视觉处理到视觉语言模型的各种应用。由于这些应用通常需要实时处理大量视觉数据,因此降低视觉编码器的推理成本至关重要。量化提供了一条可行的路径,但在8位精度下仍面临挑战,主要是所谓的异常值问题。在本工作中,我们提出了一种名为$\textit{RegCache}$的无训练算法,该算法可以缓解大规模预训练视觉编码器中的异常值问题,并作为可插拔模块应用于其他量化方法之上。$\textit{RegCache}$通过引入前缀的、但具有语义意义的前缀标记到视觉编码器中,防止其他标记出现异常值。值得注意的是,我们观察到视觉编码器中的异常值与语言模型中的异常值行为不同,这促使我们提出了两种技术创新:中间层前缀和标记删除。实验结果表明,我们的方法在各种视觉编码器中都能一致地提高量化模型的性能,特别是在极低位宽(例如4位)的情况下。
Summary / 总结
This work addresses the challenge of quantizing large pretrained vision encoders to reduce inference costs while maintaining performance. The proposed $\textit{RegCache}$ method introduces prefix tokens to mitigate outliers, which are problematic at low-bit precision. Experimental results demonstrate consistent improvements in quantized model performance, especially in 4-bit regimes.
该研究旨在通过引入前缀标记来量化大型预训练视觉编码器,以降低推理成本并保持性能。$\textit{RegCache}$方法通过减轻异常值的影响来实现这一目标,特别是在4比特精度下。实验结果表明,$\textit{RegCache}$能够提升量化模型的性能。
Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation
Authors: Yuchen Li, Amanmeet Garg, Shalini Chaudhuri, Rui Zhao, Garin Kessler
First: 2026-03-19T11:46:01+00:00 · Latest: 2026-03-19T11:46:01+00:00
Abstract
Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we (i) distill a VQVAE depth codebook from a strong monocular teacher to tokenize dense depth into compact sequences, and (ii) integrate SAM2 based semantic segmentation tokens and VQ-VAE depth tokens inside the LLM so the model first emits spatial tokens and then answers. To stabilize depth token generation, we introduce novel composite depth-token objectives (marker, token, and count losses) and a soft-merging technique for differentiable reconstruction. We adopt a multi-task co-training strategy across diverse datasets, letting the model learn perception tokens to tackle multiple downstream tasks. Building on InternVL, Perceptio achieves state-of-the-art performance across benchmarks: improving referring expression segmentation by +0.8/+1.4/+1.1 cIoU on RefCOCO/+/g HardBLINK spatial understanding accuracy by 10.3%, and MMBench accuracy by 1.0%, demonstrating that explicit spatial chain-of-thought materially strengthens spatial grounding in LVLMs.
中文标题/摘要
标题:Perceptio:通过空间标记生成增强视觉语言模型的感知能力
大型视觉语言模型(LVLMs)在语义理解方面表现出色,但在精细的空间定位方面存在困难,因为模型必须隐式推断复杂的几何结构,而从未产生过空间解释。我们提出了Perceptio,这是一种具有2D和3D空间推理能力的感知增强LVLM,通过在自回归序列中直接生成语义分割标记和深度标记来实现。具体来说,我们(i)从强大的单目教师中提取VQVAE深度码本,将密集深度量化为紧凑序列,(ii)在LLM中集成基于SAM2的语义分割标记和VQ-VAE深度标记,使模型首先发出空间标记,然后回答。为了稳定深度标记生成,我们引入了新颖的复合深度标记目标(标记、标记和计数损失)和一种可微重构的软合并技术。我们采用跨多种数据集的多任务协同训练策略,让模型学习感知标记以应对多个下游任务。基于InternVL,Perceptio在基准测试中取得了最先进的性能:在RefCOCO/+/g HardBLINK中分别提高参考表达分割的cIoU值0.8/1.4/1.1,在MMBench中提高准确率1.0%,证明了显式空间推理链对LVLM中空间定位的实质性增强。
Summary / 总结
Perceptio is a perception-enhanced Vision Language Model that integrates 2D and 3D spatial reasoning through explicit semantic segmentation tokens and depth tokens generated within the autoregressive sequence. It uses a VQVAE depth codebook and SAM2-based semantic segmentation tokens to improve spatial grounding. Perceptio achieves state-of-the-art performance, enhancing referring expression segmentation and spatial understanding accuracy on various benchmarks, and improving MMBench accuracy by 1.0%.
Perceptio 是一种通过在自回归序列中生成显式的语义分割和深度令牌来增强 2D 和 3D 空间推理的视觉语言模型。它使用 VQ-VAE 深度码本和 SAM2 基础的语义分割令牌来改善空间定位。Perceptio 达到了最先进的性能,分别在引用表达分割和空间理解准确性上提高了 10.3% 和 1.0%,以及 MMBench 准确性。
Balanced Thinking: Improving Chain of Thought Training in Vision Language Models
Authors: Shaked Perek, Ben Wiesel, Avihu Dekel, Nimrod Shabtay, Eli Schwartz
First: 2026-03-19T09:21:49+00:00 · Latest: 2026-03-19T09:21:49+00:00
Abstract
Multimodal reasoning in vision-language models (VLMs) typically relies on a two-stage process: supervised fine-tuning (SFT) and reinforcement learning (RL). In standard SFT, all tokens contribute equally to the loss, even though reasoning data are inherently token-imbalanced. Long <think> traces overshadow short but task-critical <answer> segments, leading to verbose reasoning and inaccurate answers. We propose SCALe (Scheduled Curriculum Adaptive Loss), which explicitly separates supervision over reasoning and answer segments using dynamic, length-independent weighting. Unlike vanilla SFT, which overweights the <think> segment, SCALe-SFT gradually shifts the focus from <think> to <answer> throughout training via a cosine scheduling policy, encouraging concise and well-grounded reasoning. We evaluate SCALe across diverse benchmarks and architectures. Results show that SCALe consistently improves accuracy over vanilla SFT and matches the performance of the full two-phase SFT + GRPO pipeline while requiring only about one-seventh of the training time, making it a lightweight yet effective alternative. When combined with GRPO, SCALe achieves the best overall performance, highlighting its value both as a standalone method and as a strong foundation for reinforcement refinement.
中文标题/摘要
标题:平衡思维:提升视觉语言模型的链式思维训练
视觉语言模型(VLMs)中的多模态推理通常依赖于两阶段过程:监督微调(SFT)和强化学习(RL)。在标准的SFT中,所有标记对损失的贡献是平等的,即使推理数据本质上是标记不平衡的。长的<think>痕迹遮盖了短但任务关键的<answer>片段,导致冗长的推理和不准确的答案。我们提出了SCALe(逐步课程自适应损失),它明确地通过动态、与长度无关的加权来分离对推理和答案片段的监督。与传统的SFT不同,SCALe-SFT通过余弦调度策略在整个训练过程中逐渐将重点从<think>转移到<answer>,鼓励简洁且有根据的推理。我们在多种基准和架构上评估了SCALe。结果显示,SCALe在准确性上始终优于传统的SFT,并且在训练时间仅为完整两阶段SFT + GRPO流水线的约七分之一的情况下达到了相当的性能,使其成为一种轻量级但有效的替代方案。当与GRPO结合使用时,SCALe实现了最佳的整体性能,突显了其作为独立方法和强化细化强大基础的价值。
Summary / 总结
The paper addresses the issue of token imbalance in supervised fine-tuning of vision-language models, where reasoning data often have long <think> traces overshadowing short but critical <answer> segments. It introduces SCALe, a method that dynamically weights reasoning and answer segments to encourage concise and accurate reasoning. Experiments across various benchmarks show that SCALe improves accuracy over standard supervised fine-tuning and achieves comparable performance to a full two-phase pipeline with significantly less training time, making it a lightweight yet effective alternative.
论文针对视觉-语言模型中推理数据的标记不平衡问题,导致冗长的推理和不准确的答案。提出了SCALe方法,通过动态、长度无关的加权来分别监督推理和答案段。实验结果显示,SCALe在准确性上优于标准的监督微调,并且与完整的两阶段管道性能相当,但所需训练时间仅为七分之一,使其成为一个轻量级但有效的替代方案。
MeInTime: Bridging Age Gap in Identity-Preserving Face Restoration
Authors: Teer Song, Yue Zhang, Yu Tian, Ziyang Wang, Xianlin Zhang, Guixuan Zhang, Xuan Liu, Xueming Li, Yasen Zhang
First: 2026-03-19T09:11:07+00:00 · Latest: 2026-03-19T09:11:07+00:00
Abstract
To better preserve an individual's identity, face restoration has evolved from reference-free to reference-based approaches, which leverage high-quality reference images of the same identity to enhance identity fidelity in the restored outputs. However, most existing methods implicitly assume that the reference and degraded input are age-aligned, limiting their effectiveness in real-world scenarios where only cross-age references are available, such as historical photo restoration. This paper proposes MeInTime, a diffusion-based face restoration method that extends reference-based restoration from same-age to cross-age settings. Given one or few reference images along with an age prompt corresponding to the degraded input, MeInTime achieves faithful restoration with both identity fidelity and age consistency. Specifically, we decouple the modeling of identity and age conditions. During training, we focus solely on effectively injecting identity features through a newly introduced attention mechanism and introduce Gated Residual Fusion modules to facilitate the integration between degraded features and identity representations. At inference, we propose Age-Aware Gradient Guidance, a training-free sampling strategy, using an age-driven direction to iteratively nudge the identity-aware denoising latent toward the desired age semantic manifold. Extensive experiments demonstrate that MeInTime outperforms existing face restoration methods in both identity preservation and age consistency. Our code is available at: https://github.com/teer4/MeInTime
中文标题/摘要
标题:MeInTime: 跨年龄身份保留面部恢复方法
为了更好地保留个体的身份,面部恢复从无参考方法发展到了基于高质量参考图像的有参考方法,这些方法利用相同身份的高质量参考图像来增强恢复输出中的身份保真度。然而,大多数现有方法隐含地假设参考图像和退化输入在年龄上是齐平的,这限制了它们在只有跨年龄参考图像可用的真实场景中的有效性,例如历史照片恢复。本文提出了一种基于扩散的面部恢复方法MeInTime,该方法将基于参考的恢复从同龄扩展到跨龄设置。给定一个或几个参考图像以及与退化输入对应的年龄提示,MeInTime 能够实现具有身份保真度和年龄一致性的忠实恢复。具体来说,我们解耦身份和年龄条件的建模。在训练过程中,我们专注于通过新引入的注意力机制有效地注入身份特征,并引入门控残差融合模块以促进退化特征与身份表示之间的集成。在推理过程中,我们提出了一种无需训练的年龄感知梯度引导策略,使用年龄驱动的方向迭代地引导身份感知去噪潜在变量向所需的年龄语义流形。大量实验表明,MeInTime 在身份保留和年龄一致性方面均优于现有面部恢复方法。我们的代码可在以下链接获取:https://github.com/teer4/MeInTime
Summary / 总结
MeInTime is a diffusion-based face restoration method designed to bridge the age gap in identity-preserving face restoration. It uses one or a few reference images along with an age prompt to achieve faithful restoration with both identity fidelity and age consistency. MeInTime decouples identity and age conditions during training and employs an age-aware gradient guidance strategy at inference to iteratively adjust the denoising latent towards the desired age semantic manifold. Experimental results show that MeInTime outperforms existing methods in both identity preservation and age consistency.
MeInTime 是一种基于扩散的面部恢复方法,旨在弥合身份保留面部恢复中的年龄差距。它使用一个或几个参考图像以及年龄提示来实现同时保持身份一致性和年龄一致性的忠实恢复。MeInTime 在训练中解耦身份和年龄条件,并使用 Gated Residual Fusion 模块将退化特征与身份表示集成。在推理时,使用 Age-Aware Gradient Guidance 逐步引导身份感知去噪潜变量向所需的年龄语义流形靠拢。实验表明,MeInTime 在身份保持和年龄一致性方面优于现有方法。
Training-Free Sparse Attention for Fast Video Generation via Offline Layer-Wise Sparsity Profiling and Online Bidirectional Co-Clustering
Authors: Jiayi Luo, Jiayu Chen, Jiankun Wang, Cong Wang, Hanxin Zhu, Qingyun Sun, Chen Gao, Zhibo Chen, Jianxin Li
First: 2026-03-19T09:00:08+00:00 · Latest: 2026-03-19T09:00:08+00:00
Abstract
Diffusion Transformers (DiTs) achieve strong video generation quality but suffer from high inference cost due to dense 3D attention, leading to the development of sparse attention technologies to improve efficiency. However, existing training-free sparse attention methods in video generation still face two unresolved limitations: ignoring layer heterogeneity in attention pruning and ignoring query-key coupling in block partitioning, which hinder a better quality-speedup trade-off. In this work, we uncover a critical insight that the attention sparsity of each layer is its intrinsic property, with minor effects across different inputs. Motivated by this, we propose SVOO, a training-free Sparse attention framework for fast Video generation via Offline layer-wise sparsity profiling and Online bidirectional co-clustering. Specifically, SVOO adopts a two-stage paradigm: (i) offline layer-wise sensitivity profiling to derive intrinsic per-layer pruning levels, and (ii) online block-wise sparse attention via a novel bidirectional co-clustering algorithm. Extensive experiments on seven widely used video generation models demonstrate that SVOO achieves a superior quality-speedup trade-off over state-of-the-art methods, delivering up to $1.93\times$ speedup while maintaining a PSNR of up to 29 dB on Wan2.1.
中文标题/摘要
标题:基于离线层间稀疏性表征和在线双向共聚类的无需训练快速视频生成稀疏注意力
扩散变换器(DiTs)在视频生成质量上表现出色,但由于密集的3D注意力导致推理成本高,因此开发了稀疏注意力技术以提高效率。然而,现有的无需训练的视频生成稀疏注意力方法仍然面临两个未解决的局限性:忽略注意力剪枝中的层异质性以及忽略查询-键耦合在块分割中的作用,这阻碍了更好的质量-加速权衡。在本文中,我们揭示了一个关键见解,即每个层的注意力稀疏性是其固有的属性,不同输入之间的影响较小。受此启发,我们提出了SVOO,一种基于离线层间稀疏性表征和在线双向共聚类的无需训练快速视频生成稀疏注意力框架。具体而言,SVOO采用两阶段范式:(i)离线层间敏感性表征以推导每层固有的剪枝水平,(ii)通过一种新颖的双向共聚类算法实现块级稀疏注意力。在七个广泛使用的视频生成模型上的大量实验表明,SVOO在质量-加速权衡上优于最先进的方法,同时在Wan2.1上保持高达29 dB的PSNR,加速高达1.93倍。
Summary / 总结
The work addresses the high inference cost of diffusion transformers (DiTs) in video generation by proposing SVOO, a training-free sparse attention framework. SVOO involves offline layer-wise sensitivity profiling to determine intrinsic pruning levels and online bidirectional co-clustering for block-wise sparse attention. Experiments show SVOO outperforms existing methods, achieving up to 1.93 times speedup with PSNR up to 29 dB on Wan2.1.
该研究提出了一种名为SVOO的无训练稀疏注意力框架,以解决扩散变换器在视频生成中的高推理成本问题。SVOO通过离线层间稀疏性分析和在线双向聚类实现更好的质量和加速权衡。实验表明,SVOO在Wan2.1上实现了最高29 dB的PSNR,同时提供高达1.93倍的加速。
Agentic Flow Steering and Parallel Rollout Search for Spatially Grounded Text-to-Image Generation
Authors: Ping Chen, Daoxuan Zhang, Xiangming Wang, Yungeng Liu, Haijin Zeng, Yongyong Chen
First: 2026-03-19T08:50:49+00:00 · Latest: 2026-03-19T08:50:49+00:00
Abstract
Precise Text-to-Image (T2I) generation has achieved great success but is hindered by the limited relational reasoning of static text encoders and the error accumulation in open-loop sampling. Without real-time feedback, initial semantic ambiguities during the Ordinary Differential Equation trajectory inevitably escalate into stochastic deviations from spatial constraints. To bridge this gap, we introduce AFS-Search (Agentic Flow Steering and Parallel Rollout Search), a training-free closed-loop framework built upon FLUX.1-dev. AFS-Search incorporates a training-free closed-loop parallel rollout search and flow steering mechanism, which leverages a Vision-Language Model (VLM) as a semantic critic to diagnose intermediate latents and dynamically steer the velocity field via precise spatial grounding. Complementarily, we formulate T2I generation as a sequential decision-making process, exploring multiple trajectories through lookahead simulations and selecting the optimal path based on VLM-guided rewards. Further, we provide AFS-Search-Pro for higher performance and AFS-Search-Fast for quicker generation. Experimental results show that our AFS-Search-Pro greatly boosts the performance of the original FLUX.1-dev, achieving state-of-the-art results across three different benchmarks. Meanwhile, AFS-Search-Fast also significantly enhances performance while maintaining fast generation speed.
中文标题/摘要
标题:代理流动引导和并行展开搜索在空间定位文本到图像生成中的应用
精确的文本到图像(T2I)生成已经取得了巨大成功,但受限于静态文本编码器的有限关系推理能力和开环采样中的误差累积。缺乏实时反馈,初始语义模糊性在常微分方程轨迹中不可避免地演变成对空间约束的随机偏差。为解决这一问题,我们引入了AFS-Search(代理流动引导和并行展开搜索),这是一种基于FLUX.1-dev的无需训练的闭环框架。AFS-Search 结合了无需训练的闭环并行展开搜索和流动引导机制,利用视觉语言模型(VLM)作为语义批评家来诊断中间潜变量,并通过精确的空间定位动态引导速度场。此外,我们将T2I生成视为一个顺序决策过程,通过前瞻模拟探索多个轨迹,并基于VLM引导的奖励选择最优路径。进一步地,我们提供了AFS-Search-Pro以获得更高性能,并提供了AFS-Search-Fast以实现更快的生成速度。实验结果表明,我们的AFS-Search-Pro极大地提升了原始FLUX.1-dev的性能,在三个不同的基准测试中达到了最先进的结果。同时,AFS-Search-Fast也显著提高了性能,同时保持了快速生成速度。
Summary / 总结
The research aims to improve the precision and reliability of Text-to-Image (T2I) generation by addressing the limitations of static text encoders and open-loop sampling. The method, AFS-Search, introduces a training-free closed-loop framework that uses a Vision-Language Model (VLM) to diagnose and steer the generation process, ensuring spatial constraints are met. Experimental results demonstrate that AFS-Search-Pro significantly improves the performance of FLUX.1-dev, achieving state-of-the-art results across three benchmarks, while AFS-Search-Fast enhances performance without compromising speed.
研究旨在通过解决静态文本编码器和开环采样限制,提高文本到图像生成的精度和可靠性。方法AFS-Search引入了一个无需训练的闭环框架,使用视觉语言模型(VLM)诊断和引导生成过程,确保满足空间约束。实验结果显示,AFS-Search-Pro显著提升了FLUX.1-dev的性能,在三个基准测试中达到最先进的结果,而AFS-Search-Fast则保持了快速生成速度的同时提高了性能。
GenVideoLens: Where LVLMs Fall Short in AI-Generated Video Detection?
Authors: Yueying Zou, Pei Pei Li, Zekun Li, Xinyu Guo, Xing Cui, Huaibo Huang, Ran He
Venue: ECCV 2026
First: 2026-03-19T08:44:08+00:00 · Latest: 2026-03-19T08:44:08+00:00
Comments: ECCV 2026 submission. 14 pages, 6 figures, 4 tables. Supplementary material included
Abstract
In recent years, AI-generated videos have become increasingly realistic and sophisticated. Meanwhile, Large Vision-Language Models (LVLMs) have shown strong potential for detecting such content. However, existing evaluation protocols largely treat the task as a binary classification problem and rely on coarse-grained metrics such as overall accuracy, providing limited insight into where LVLMs succeed or fail. To address this limitation, we introduce GenVideoLens, a fine-grained benchmark that enables dimension-wise evaluation of LVLM capabilities in AI-generated video detection. The benchmark contains 400 highly deceptive AI-generated videos and 100 real videos, annotated by experts across 15 authenticity dimensions covering perceptual, optical, physical, and temporal cues. We evaluate eleven representative LVLMs on this benchmark. Our analysis reveals a pronounced dimensional imbalance. While LVLMs perform relatively well on perceptual cues, they struggle with optical consistency, physical interactions, and temporal-causal reasoning. Model performance also varies substantially across dimensions, with smaller open-source models sometimes outperforming stronger proprietary models on specific authenticity cues. Temporal perturbation experiments further show that current LVLMs make limited use of temporal information. Overall, GenVideoLens provides diagnostic insights into LVLM behavior, revealing key capability gaps and offering guidance for improving future AI-generated video detection systems.
中文标题/摘要
标题:GenVideoLens:LVLM在AI生成视频检测中存在哪些不足?
近年来,AI生成的视频越来越逼真和复杂。与此同时,大型视觉-语言模型(LVLMs)在检测此类内容方面显示出强大的潜力。然而,现有的评估协议主要将任务视为二元分类问题,并依赖于粗粒度的指标,如总体准确率,这为了解LVLMs的成功或失败提供了有限的洞察。为了解决这一局限性,我们引入了GenVideoLens,这是一个细粒度基准,使我们能够从维度上评估LVLM在AI生成视频检测中的能力。基准数据集包含400个高度欺骗性的AI生成视频和100个真实视频,由专家在15个涵盖感知、光学、物理和时间线索的真伪维度上进行标注。我们在这项基准上评估了11个代表性LVLM。我们的分析揭示了明显的维度不平衡。虽然LVLMs在感知线索方面表现相对较好,但在光学一致性、物理交互和时间因果推理方面却表现不佳。模型在不同维度上的表现也存在显著差异,较小的开源模型有时在特定真伪线索上会优于更强的专有模型。时间扰动实验进一步表明,当前的LVLMs对时间信息的利用有限。总体而言,GenVideoLens为LVLM的行为提供了诊断性见解,揭示了关键的能力差距,并为改进未来的AI生成视频检测系统提供了指导。
Summary / 总结
GenVideoLens introduces a fine-grained benchmark for evaluating Large Vision-Language Models (LVLMs) in detecting AI-generated videos, containing 400 highly deceptive videos and 100 real videos, annotated across 15 dimensions. The study reveals that LVLMs perform well on perceptual cues but struggle with optical consistency, physical interactions, and temporal-causal reasoning. Performance varies across dimensions, with smaller models sometimes outperforming larger ones on specific cues. Temporal perturbation experiments show limited use of temporal information by current LVLMs. This benchmark provides diagnostic insights into LVLM behavior and highlights key capability gaps.
GenVideoLens 是一个基准,用于评估大型视觉-语言模型(LVLMs)在检测 AI 生成视频方面的性能,解决了现有二元分类指标的局限性。基准包括 400 个高度欺骗性的 AI 生成视频和 100 个真实视频,并在 15 个维度上进行了标注。研究发现,LVLMs 在感知线索方面表现良好,但在光学一致性、物理交互和时间因果推理方面存在困难。此外,较小的开源模型有时在特定的真实性线索上优于较大的专有模型,而当前的 LVLMs 对时间信息的利用有限。
REST: Receding Horizon Explorative Steiner Tree for Zero-Shot Object-Goal Navigation
Authors: Shuqi Xiao, Maani Ghaffari, Chengzhong Xu, Hui Kong
First: 2026-03-19T08:43:40+00:00 · Latest: 2026-03-19T08:43:40+00:00
Abstract
Zero-shot object-goal navigation (ZSON) requires navigating unknown environments to find a target object without task-specific training. Prior hierarchical training-free solutions invest in scene understanding (\textit{belief}) and high-level decision-making (\textit{policy}), yet overlook the design of \textit{option}, i.e., a subgoal candidate proposed from evolving belief and presented to policy for selection. In practice, options are reduced to isolated waypoints scored independently: single destinations hide the value gathered along the journey; an unstructured collection obscures the relationships among candidates. Our insight is that the option space should be a \textit{tree of paths}. Full paths expose en-route information gain that destination-only scoring systematically neglects; a tree of shared segments enables coarse-to-fine LLM reasoning that dismisses or pursues entire branches before examining individual leaves, compressing the combinatorial path space into an efficient hierarchy. We instantiate this insight in \textbf{REST} (Receding Horizon Explorative Steiner Tree), a training-free framework that (1) builds an explicit open-vocabulary 3D map from online RGB-D streams; (2) grows an agent-centric tree of safe and informative paths as the option space via sampling-based planning; and (3) textualizes each branch into a spatial narrative and selects the next-best path through chain-of-thought LLM reasoning. Across the Gibson, HM3D, and HSSD benchmarks, REST consistently ranks among the top methods in success rate while achieving the best or second-best path efficiency, demonstrating a favorable efficiency-success balance.
中文标题/摘要
标题:REST:退行展望探索性Steiner树用于零样本物体-目标导航
零样本物体-目标导航(ZSON)要求在未知环境中导航以找到目标物体,无需特定任务的训练。先前的无监督层次训练解决方案投资于场景理解(信念)和高层次决策(策略),但忽视了选项的设计,即从不断演变的信念中提出的子目标候选,并将其呈现给策略进行选择。实践中,选项被简化为孤立的航点,独立评分:单一目的地隐藏了旅途中的价值;无序的集合掩盖了候选者之间的关系。我们的见解是,选项空间应该是一个路径树。完整路径揭示了目的地评分系统系统性忽视的沿途信息增益;共享路径段的树结构使LLM能够进行粗细粒度的推理,先粗略地排除或追求整个分支,再检查个别分支,从而将组合路径空间压缩为高效的层次结构。我们通过REST(退行展望探索性Steiner树)这一无监督框架将这一见解具体化,该框架(1)从在线RGB-D流中构建显式的开放词汇3D地图;(2)通过基于采样的规划生成以代理为中心的安全且信息丰富的路径树作为选项空间;(3)将每个分支文本化为空间叙事,并通过链式思考LLM推理选择下一个最佳路径。在Gibson、HM3D和HSSD基准测试中,REST在成功率方面始终名列前茅,同时在路径效率方面达到最佳或第二佳,展示了有利的效率-成功率平衡。
Summary / 总结
REST is a training-free framework for zero-shot object-goal navigation that addresses the limitations of prior methods by focusing on the design of options as a tree of paths. It builds an explicit 3D map from RGB-D streams, grows an agent-centric tree of safe and informative paths, and uses LLM reasoning to select the next-best path. REST consistently ranks among the top methods in success rate while achieving the best or second-best path efficiency, showing a favorable efficiency-success balance.
REST 是一种无需训练的零样本物体目标导航框架,通过将选项设计为路径树来解决先前方法的局限性。它从 RGB-D 流中构建显式的 3D 地图,生成一个安全且信息丰富的路径树,并使用 LLM 推理选择最佳路径。REST 在各种基准测试中始终在成功率方面名列前茅,同时实现最佳或第二最佳路径效率,显示出高效的成功率平衡。
Deep Expert Injection for Anchoring Retinal VLMs with Domain-Specific Knowledge
Authors: Shuai Lu, Meng Wang, Jia Guo, Jiawei Du, Bo Liu, Shengzhu Yang, Weihang Zhang, Huazhu Fu, Huiqi Li
First: 2026-03-07T09:43:49+00:00 · Latest: 2026-03-19T08:24:18+00:00
Abstract
Large Vision Language Models (LVLMs) show immense potential for automated ophthalmic diagnosis. However, their clinical deployment is severely hindered by lacking domain-specific knowledge. In this work, we identify two structural deficiencies hindering reliable medical reasoning: 1) the Perception Gap, where general-purpose visual encoders fail to resolve fine-grained pathological cues (e.g., microaneurysms); and 2) the Reasoning Gap, where sparse visual evidence is progressively overridden by massive language priors in deeper transformer layers, leading to ungrounded hallucinations. To bridge these gaps, we propose EyExIn, a data-efficient framework designed to anchor retinal VLMs with expert knowledge via a Deep Expert Injection mechanism. Our architecture employs an Expert-Aware Dual-Stream encoding strategy that decouples visual representation into a general stream for anatomical context and a specialized expert stream for pathological semantics. To ensure high-fidelity integration, we design a Semantic-Adaptive Gated Fusion module, which dynamically amplifies subtle lesion signals while filtering irrelevant background noise. Furthermore, we introduce Adaptive Deep Expert Injection to embed persistent "Vision Anchors" by integrating fused visual features as residual biases directly into intermediate LLM layers. This mechanism creates a visual shortcut that forces the reasoning stack to remain strictly grounded in visual evidence. Extensive experiments across four benchmarks demonstrate that our model consistently outperforms massive proprietary systems. EyExIn significantly enhances domain-specific knowledge embedding and achieves state-of-the-art precision in ophthalmic visual question answering, advancing the development of trustworthy ophthalmic AI.
中文标题/摘要
标题:深度专家注入以领域特定知识锚定视网膜VLM
大型视觉语言模型(LVLMs)在自动眼科诊断方面显示出巨大的潜力。然而,它们的临床部署受到缺乏领域特定知识的严重阻碍。在这项工作中,我们识别出两个阻碍可靠医学推理的结构性缺陷:1)感知差距,其中通用视觉编码器无法解决细微的病理线索(例如,微动脉瘤);2)推理差距,其中稀疏的视觉证据在更深的变压器层中被大量的语言先验逐步取代,导致无根据的幻觉。为了弥合这些差距,我们提出了一种EyExIn框架,该框架通过深度专家注入机制设计来利用专家知识锚定视网膜VLM。我们的架构采用了一种专家感知的双流编码策略,将视觉表示分解为一个用于解剖学上下文的一般流和一个用于病理学语义的专门专家流。为了确保高保真集成,我们设计了一种语义自适应门控融合模块,该模块动态放大细微的病灶信号并过滤掉无关的背景噪声。此外,我们引入了自适应深度专家注入,通过将融合的视觉特征直接作为残差偏差集成到中间的LLM层中,嵌入持久的“视觉锚点”。该机制创建了一个视觉捷径,迫使推理堆栈严格地保持在视觉证据的基础上。在四个基准上的广泛实验表明,我们的模型在眼科视觉问答方面始终优于大规模的专有系统。EyExIn显著增强了领域特定知识的嵌入,并实现了最先进的精度,推动了可信赖的眼科AI的发展。
History
20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553