arXiv 论文速递

Snapshot: 20260319_0358

Demystifing Video Reasoning

Authors: Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang

Venue: www

First: 2026-03-17T17:59:55+00:00 · Latest: 2026-03-17T17:59:55+00:00

Comments: Homepage: https://www.wruisi.com/demystifying_video_reasoning

Abs · PDF · Code1 · Code2

Abstract

Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.

中文标题/摘要

标题：揭秘视频推理

近期在视频生成方面的进展揭示了一个意想不到的现象：基于扩散的视频模型表现出非平凡的推理能力。先前的工作将这一现象归因于帧链（CoF）机制，推理被认为在视频帧之间顺序展开。在本项工作中，我们挑战了这一假设并揭示了一种根本不同的机制。我们表明，视频模型中的推理主要在去噪步骤中出现。通过定性分析和有针对性的探针实验，我们发现模型在早期去噪步骤中探索多个候选解决方案，并逐步收敛到最终答案，我们将其称为步骤链（CoS）。除了这一核心机制，我们还识别出几种对模型性能至关重要的新兴推理行为：（1）工作记忆，使参考持久化；（2）自我纠正和增强，允许从错误的中间解决方案中恢复；（3）先感知后行动，早期步骤建立语义基础，后期步骤执行结构化操作。在去噪步骤中，我们进一步发现扩散变换器内部自我进化出的功能专业化，早期层编码密集的感知结构，中间层执行推理，后期层整合潜在表示。受这些见解的启发，我们提出了一种简单的无训练策略作为概念验证，展示了如何通过从具有不同随机种子的相同模型中组合潜在轨迹来提高推理能力。总体而言，我们的工作为理解视频生成模型中推理的出现提供了系统性的理解，为未来更好地利用视频模型固有的推理动态作为智能新基质的研究奠定了基础。

Summary / 总结

This work challenges the assumption that reasoning in video models occurs sequentially across frames and instead identifies a Chain-of-Steps (CoS) mechanism where reasoning emerges primarily during diffusion denoising steps. The study reveals that models explore multiple solutions early on and converge to a final answer. Key emergent behaviors include working memory, self-correction, and perception before action. The research also uncovers functional specialization within diffusion transformers, with early layers encoding perceptual structure, middle layers executing reasoning, and later layers consolidating representations. A training-free strategy is proposed to improve reasoning by ensembling latent trajectories from identical models with different random seeds.

这项工作挑战了推理在视频生成模型中沿帧顺序进行的传统假设，而是识别出一种称为Chain-of-Steps (CoS)的机制，其中推理主要在去噪步骤中出现。关键发现包括工作记忆、自我纠正和感知先于行动等新兴行为的识别。研究还揭示了在去噪步骤中扩散变换器的功能专业化。提出了一种无需训练的策略，通过从具有不同随机种子的相同模型中组合潜在轨迹来提高推理能力，从而为更好地利用视频生成模型中的内在推理动态提供了基础。

OrigamiBench: An Interactive Environment to Synthesize Flat-Foldable Origamis

Authors: Naaisha Agarwal, Yihan Wu, Yichang Jian, Yikuan Hu, Nishad Mansoor, Mohan Li, Yifei Peng, Wang-Zhou Dai, Yao-Xiang Ding, Emanuele Sansone

First: 2026-03-14T09:33:29+00:00 · Latest: 2026-03-17T17:36:55+00:00

Abs · PDF · Code1 · Code2

Abstract

Building AI systems that can plan, act, and create in the physical world requires more than pattern recognition. Such systems must understand the causal mechanisms and constraints governing physical processes in order to guide sequential decisions. This capability relies on internal representations, analogous to an internal language model, that relate observations, actions, and resulting environmental changes. However, many existing benchmarks treat visual perception and programmatic reasoning as separate problems, focusing either on visual recognition or on symbolic tasks. The domain of origami provides a natural testbed that integrates these modalities. Constructing shapes through folding operations requires visual perception, reasoning about geometric and physical constraints, and sequential planning, while remaining sufficiently structured for systematic evaluation. We introduce OrigamiBench, an interactive benchmark in which models iteratively propose folds and receive feedback on physical validity and similarity to a target configuration. Experiments with modern vision-language models show that scaling model size alone does not reliably produce causal reasoning about physical transformations. Models fail to generate coherent multi-step folding strategies, suggesting that visual and language representations remain weakly integrated.

中文标题/摘要

标题：OrigamiBench：一种交互式环境以合成可平面折叠的纸艺作品

构建能够在物理世界中规划、行动和创造的AI系统需要的不仅仅是模式识别。这样的系统必须理解物理过程中的因果机制和约束，以便指导顺序决策。这种能力依赖于类似于内部语言模型的内部表示，将观察、行动和环境变化的结果联系起来。然而，许多现有的基准将视觉感知和程序化推理视为两个独立的问题，要么专注于视觉识别，要么专注于符号任务。折纸领域提供了一个自然的测试平台，可以整合这些模态。通过折叠操作构建形状需要视觉感知、几何和物理约束的推理以及顺序规划，同时保持足够的结构化以便系统性评估。我们介绍了OrigamiBench，这是一个交互式基准，在其中模型迭代地提出折叠并接收关于物理有效性和与目标配置相似性的反馈。现代视觉-语言模型的实验表明，仅扩大模型规模并不能可靠地产生关于物理变换的因果推理。模型无法生成连贯的多步折叠策略，这表明视觉和语言表示仍然结合得不够紧密。

Diffusion-DRF: Free, Rich, and Differentiable Reward for Video Diffusion Fine-Tuning

Authors: Yifan Wang, Yanyu Li, Gordon Guocheng Qian, Sergey Tulyakov, Yun Fu, Anil Kag

First: 2026-01-07T18:05:08+00:00 · Latest: 2026-03-17T17:21:55+00:00

Comments: Webpage: https://snap-research.github.io/diffusion-drf/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Video diffusion alignment has been heavily relied on scalar rewards. These rewards are typically derived from learned reward models in human preference datasets, requiring additional training and extensive collection. Moreover, scalar rewards provide coarse, global supervision, offering limited prompt-generation mismatch credit assignment and making models prone to reward exploitation and unstable optimization. We propose Diffusion-DRF, a free, rich, and differentiable reward framework for video diffusion fine-tuning. Diffusion-DRF employs a frozen, off-the-shelf Vision-Language Model (VLM) as the critic, eliminating the need for reward model training. Instead of relying on a single scalar reward, it decomposes each user prompt into multi-dimensional questions with freeform dense VQA explanation queries, yielding information-rich feedback. By direct differentiable optimization over this rich feedback, Diffusion-DRF achieves stable reward-based tuning without preference datasets collection. Diffusion-DRF achieves significant gains both quantitatively and qualitatively, outperforming state-of-the-art Flow-GRPO by 4.74% in overall performance on unseen VBench-2.0.

中文标题/摘要

标题：Diffusion-DRF：免费、丰富且可微的奖励框架用于视频扩散微调

视频扩散对齐主要依赖于标量奖励。这些奖励通常来自人类偏好数据集中的学习奖励模型，需要额外的训练和广泛的收集。此外，标量奖励提供粗略的全局监督，对提示生成不匹配的信用分配有限，使模型容易受到奖励利用和不稳定优化的影响。我们提出了Diffusion-DRF，一种用于视频扩散微调的免费、丰富且可微的奖励框架。Diffusion-DRF 使用一个冻结的现成视觉-语言模型（VLM）作为批评者，消除了奖励模型训练的需要。它不依赖于单一的标量奖励，而是将每个用户提示分解为多维问题，并使用自由形式的密集VQA解释查询，提供丰富的反馈信息。通过直接对这种丰富反馈的可微优化，Diffusion-DRF 实现了稳定的基于奖励的微调，无需收集偏好数据集。Diffusion-DRF 在定量和定性方面均取得了显著的改进，在未见过的VBench-2.0 上的整体性能上优于最先进的Flow-GRPO 4.74%。

Summary / 总结

The paper proposes Diffusion-DRF, a reward framework for video diffusion fine-tuning that uses a frozen Vision-Language Model as a critic, avoiding the need for additional training. It decomposes user prompts into multi-dimensional questions and provides rich, dense feedback, enabling stable optimization without preference datasets. Diffusion-DRF outperforms Flow-GRPO by 4.74% on unseen VBench-2.0.

Diffusion-DRF 是一种用于视频扩散微调的奖励框架，使用冻结的视觉-语言模型作为批评者，避免了额外训练的需要。它将用户提示分解为多维问题，并提供丰富的密集反馈，从而实现稳定的基于奖励的微调。Diffusion-DRF 在未见过的 VBench-2.0 上的性能比 Flow-GRPO 高出 4.74%，在定量和定性方面都表现出色。

IOSVLM: A 3D Vision-Language Model for Unified Dental Diagnosis from Intraoral Scans

Authors: Huimin Xiong, Zijie Meng, Tianxiang Hu, Chenyi Zhou, Yang Feng, Zuozhu Liu

First: 2026-03-17T16:57:02+00:00 · Latest: 2026-03-17T16:57:02+00:00

Abs · PDF · Code1 · Code2

Abstract

3D intraoral scans (IOS) are increasingly adopted in routine dentistry due to abundant geometric evidence, and unified multi-disease diagnosis is desirable for clinical documentation and communication. While recent works introduce dental vision-language models (VLMs) to enable unified diagnosis and report generation on 2D images or multi-view images rendered from IOS, they do not fully leverage native 3D geometry. Such work is necessary and also challenging, due to: (i) heterogeneous scan forms and the complex IOS topology, (ii) multi-disease co-occurrence with class imbalance and fine-grained morphological ambiguity, (iii) limited paired 3D IOS-text data. Thus, we present IOSVLM, an end-to-end 3D VLM that represents scans as point clouds and follows a 3D encoder-projector-LLM design for unified diagnosis and generative visual question-answering (VQA), together with IOSVQA, a large-scale multi-source IOS diagnosis VQA dataset comprising 19,002 cases and 249,055 VQA pairs over 23 oral diseases and heterogeneous scan types. To address the distribution gap between color-free IOS data and color-dependent 3D pre-training, we propose a geometry-to-chromatic proxy that stabilizes fine-grained geometric perception and cross-modal alignment. A two-stage curriculum training strategy further enhances robustness. IOSVLM consistently outperforms strong baselines, achieving gains of at least +9.58% macro accuracy and +1.46% macro F1, indicating the effectiveness of direct 3D geometry modeling for IOS-based diagnosis.

中文标题/摘要

标题：IOSVLM：一种基于内窥三维扫描的统一牙科诊断的3D视觉语言模型

三维内窥扫描（IOS）由于丰富的几何证据在常规牙科中越来越受欢迎，统一的多病种诊断对于临床记录和沟通是必要的。虽然最近的工作引入了牙科视觉语言模型（VLMs）以在2D图像或从IOS生成的多视角图像上实现统一的诊断和报告生成，但它们并未充分利用原生的3D几何结构。由于：（i）扫描形式的异质性和复杂的IOS拓扑结构，（ii）多种疾病共存导致类别不平衡和细微形态的模糊性，（iii）3D IOS文本配对数据有限，因此有必要且具有挑战性。我们提出了IOSVLM，这是一种端到端的3D VLM，将扫描表示为点云，并采用3D编码器-投影器-LLM设计，用于统一诊断和生成视觉问答（VQA）。同时，我们还构建了IOSVQA，这是一个包含19,002个病例和249,055个VQA配对的大规模多源IOS诊断VQA数据集，覆盖23种口腔疾病和异质扫描类型。为解决无色IOS数据与依赖颜色的3D预训练之间的分布差距，我们提出了一种几何到色彩的代理，以稳定细微的几何感知和跨模态对齐。两阶段的课程训练策略进一步增强了鲁棒性。IOSVLM在所有基线模型上表现一致优越，宏观准确率提高了至少9.58%，宏观F1提高了1.46%，表明直接3D几何建模对基于IOS的诊断的有效性。

Summary / 总结

The paper introduces IOSVLM, an end-to-end 3D vision-language model that leverages 3D intraoral scans for unified dental diagnosis and generative visual question-answering. It addresses challenges such as heterogeneous scan forms, multi-disease co-occurrence, and limited paired data by representing scans as point clouds and using a 3D encoder-projector-LLM design. The model outperforms strong baselines with gains of at least +9.58% macro accuracy and +1.46% macro F1, demonstrating the effectiveness of direct 3D geometry modeling for dental diagnosis from 3D scans.

研究旨在利用3D口腔扫描进行统一的牙科诊断，解决异构扫描形式、多病种共存和数据有限等挑战。IOSVLM 是一个端到端的3D视觉语言模型，将扫描表示为点云，并采用3D编码器-投影器-大型语言模型（LLM）设计进行诊断和生成视觉问答。该模型在基线模型上表现出色，实现了至少 +9.58% 的宏准确率和 +1.46% 的宏F1的提升，证明了直接3D几何建模在基于扫描的诊断中的有效性。

Retrieving Counterfactuals Improves Visual In-Context Learning

Authors: Guangzhi Xiong, Sanchit Sinha, Zhenghao He, Aidong Zhang

Venue: CVPR 2026

First: 2026-03-17T16:18:09+00:00 · Latest: 2026-03-17T16:18:09+00:00

Comments: CVPR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-language models (VLMs) have achieved impressive performance across a wide range of multimodal reasoning tasks, but they often struggle to disentangle fine-grained visual attributes and reason about underlying causal relationships. In-context learning (ICL) offers a promising avenue for VLMs to adapt to new tasks, but its effectiveness critically depends on the selection of demonstration examples. Existing retrieval-augmented approaches typically rely on passive similarity-based retrieval, which tends to select correlated but non-causal examples, amplifying spurious associations and limiting model robustness. We introduce CIRCLES (Composed Image Retrieval for Causal Learning Example Selection), a novel framework that actively constructs demonstration sets by retrieving counterfactual-style examples through targeted, attribute-guided composed image retrieval. By incorporating counterfactual-style examples, CIRCLES enables VLMs to implicitly reason about the causal relations between attributes and outcomes, moving beyond superficial correlations and fostering more robust and grounded reasoning. Comprehensive experiments on four diverse datasets demonstrate that CIRCLES consistently outperforms existing methods across multiple architectures, especially on small-scale models, with pronounced gains under information scarcity. Furthermore, CIRCLES retrieves more diverse and causally informative examples, providing qualitative insights into how models leverage in-context demonstrations for improved reasoning. Our code is available at https://github.com/gzxiong/CIRCLES.

中文标题/摘要

标题：检索反事实提高视觉上下文学习

视觉-语言模型（VLMs）在多种跨模态推理任务中取得了令人印象深刻的性能，但它们往往难以区分细微的视觉特征并推理其潜在的因果关系。上下文学习（ICL）为VLMs提供了适应新任务的有希望的途径，但其有效性高度依赖于示范示例的选择。现有的检索增强方法通常依赖于被动的基于相似性的检索，这往往会选择相关但非因果的示例，放大虚假关联并限制模型的稳健性。我们提出了CIRCLES（因果学习示例选择的组合图像检索），这是一种新颖的框架，通过目标导向的、属性引导的组合图像检索主动构建示范集，检索反事实风格的示例。通过引入反事实风格的示例，CIRCLES使VLMs能够隐式地推理属性与结果之间的因果关系，超越表面的相关性，促进更稳健和基于事实的推理。在四个不同数据集上的全面实验表明，CIRCLES在多个架构中始终优于现有方法，特别是在小型模型中表现出显著的改进。此外，CIRCLES检索到更多样化且具有因果信息的示例，提供了模型如何利用上下文示范以提高推理能力的定性见解。我们的代码可在https://github.com/gzxiong/CIRCLES/获取。

Summary / 总结

The research aims to enhance the causal reasoning ability of vision-language models (VLMs) by addressing their struggle with fine-grained visual attribute disentanglement and causal relationship reasoning. The method, CIRCLES, actively constructs demonstration sets using counterfactual-style examples through attribute-guided composed image retrieval, improving model robustness and grounding. Experiments show that CIRCLES outperforms existing methods across various architectures, particularly on small-scale models, and provides more diverse and causally informative examples under information scarcity.

研究旨在通过解决视觉语言模型（VLMs）在区分细粒度视觉属性和因果推理方面的局限性，提高其因果推理能力。CIRCLES是一种新型框架，通过属性导向的组合图像检索主动构建示范集，以使VLMs能够推理因果关系。在四个数据集上的实验表明，CIRCLES在小规模模型中表现更优，且检索到更多多样性和因果信息丰富的示例，从而增强模型的稳健性和基于事实的推理能力。

Synergizing Understanding and Generation with Interleaved Analyzing-Drafting Thinking

Authors: Shengqiong Wu, Bobo Li, Xinkai Wang, Xiangtai Li, Lei Cui, Furu Wei, Shuicheng Yan, Hao Fei, Tat-seng Chua

Venue: ICLR

First: 2026-02-24T23:26:09+00:00 · Latest: 2026-03-17T16:12:38+00:00

Comments: 28 pages, 17 figures, 6 tables, ICLR conference

Abs · PDF · Code1 · Code2

Abstract

Unified Vision-Language Models (UVLMs) aim to advance multimodal learning by supporting both understanding and generation within a single framework. However, existing approaches largely focus on architectural unification while overlooking the need for explicit interaction between the two capabilities during task solving. As a result, current models treat understanding and generation as parallel skills rather than synergistic processes. To achieve real synergy, we introduce the interleaved Analyzing-Drafting problem-solving loop (AD-Loop), a new think paradigm that dynamically alternates between analytic and drafting operations. By interleaving textual thoughts with visual thoughts, AD-Loop enables models to iteratively refine both comprehension and outputs, fostering genuine synergy. To train this mechanism, we design a two-stage strategy: supervised learning on interleaved thought data to initialize alternation, followed by reinforcement learning to promote adaptive and autonomous control. Extensive experiments demonstrate that AD-Loop consistently improves performance across standard benchmarks for both understanding and generation, with strong transferability to various UVLMs architectures. Visual analyses further validate the effectiveness of implicit visual thoughts. These results highlight AD-Loop as a principled and broadly applicable strategy for synergizing comprehension and creation. The project page is at https://sqwu.top/AD-Loop.

中文标题/摘要

标题：交错分析-草稿思考循环促进理解和生成的协同

统一的视觉-语言模型（UVLMs）旨在通过支持单一框架内的理解和生成来促进多模态学习。然而，现有方法主要集中在架构统一上，而忽视了在任务解决过程中两个能力之间需要明确的交互。因此，当前模型将理解和生成视为并行技能，而不是协同过程。为了实现真正的协同，我们引入了交错分析-草稿问题解决循环（AD-Loop），这是一种新的思考范式，动态交替进行分析和草稿操作。通过交错文本思考与视觉思考，AD-Loop使模型能够迭代地细化理解和输出，促进真正的协同。为了训练这种机制，我们设计了两阶段策略：在交错思考数据上进行监督学习以初始化交替，然后通过强化学习促进自适应和自主控制。广泛的实验表明，AD-Loop在标准基准测试中的一致性改进了理解和生成的性能，并且具有很强的迁移性，适用于各种UVLMs架构。视觉分析进一步验证了隐含视觉思考的有效性。这些结果突显了AD-Loop作为促进理解和创造协同的原理性且广泛适用策略的重要性。项目页面位于https://sqwu.top/AD-Loop。

Summary / 总结

The research aims to enhance the synergy between understanding and generation in Unified Vision-Language Models (UVLMs) by introducing an interleaved Analyzing-Drafting problem-solving loop (AD-Loop). This loop alternates between analytic and drafting operations, enabling models to iteratively refine both comprehension and outputs. The method involves a two-stage training strategy: supervised learning on interleaved thought data to initialize alternation, followed by reinforcement learning to promote adaptive control. Experiments show that AD-Loop improves performance across standard benchmarks for both understanding and generation, with strong transferability to various UVLM architectures.

研究旨在通过引入交替分析-草稿问题解决循环（AD-Loop）来增强统一视觉-语言模型（UVLM）中理解和生成之间的协同作用。该方法交替进行分析和草稿操作，以迭代地细化理解和输出。实验表明，AD-Loop在标准基准测试中提高了理解和生成的性能，并且在各种UVLM架构中具有很强的可移植性。视觉分析进一步验证了隐含视觉思想的有效性。

The Cost of Reasoning: Chain-of-Thought Induces Overconfidence in Vision-Language Models

Authors: Robert Welch, Emir Konuk, Kevin Smith

First: 2026-03-17T16:12:06+00:00 · Latest: 2026-03-17T16:12:06+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) are increasingly deployed in high-stakes settings where reliable uncertainty quantification (UQ) is as important as predictive accuracy. Extended reasoning via chain-of-thought (CoT) prompting or reasoning-trained models has become ubiquitous in modern VLM pipelines, yet its effect on UQ reliability remains poorly understood. We show that reasoning consistently degrades the quality of most uncertainty estimates, even when it improves task accuracy. We identify implicit answer conditioning as the primary mechanism: as reasoning traces converge on a conclusion before the final answer is generated, token probabilities increasingly reflect consistency with the model's own reasoning trace rather than uncertainty about correctness. In effect, the model becomes overconfident in its answer. In contrast, agreement-based consistency remains robust and often improves under reasoning, making it a practical choice for uncertainty estimation in reasoning-enabled VLMs.

中文标题/摘要

标题：推理的成本：链式思考在视觉语言模型中引发过度自信

视觉语言模型（VLMs）在高风险应用场景中越来越普遍，可靠性的不确定性量化（UQ）与预测准确性一样重要。通过链式思考（CoT）提示或推理训练模型进行扩展推理已成为现代VLM管道中的普遍做法，但其对UQ可靠性的影响仍知之甚少。我们表明，即使推理提高了任务准确性，推理也一致地降低了大多数不确定性估计的质量。我们确定隐含的答案条件是主要原因：随着推理路径在最终答案生成前趋于一致，令牌概率越来越多地反映与模型自身推理路径的一致性，而不是正确性的不确定性。实际上，模型对其答案变得过于自信。相比之下，在推理下，基于一致性的协议保持稳健，通常会有所改善，使其成为在推理增强的VLM中进行不确定性估计的实用选择。

Summary / 总结

The research investigates how chain-of-thought (CoT) reasoning affects the reliability of uncertainty quantification in vision-language models (VLMs). It finds that reasoning, despite improving task accuracy, consistently degrades the quality of uncertainty estimates. The primary reason is implicit answer conditioning, where token probabilities reflect the model's reasoning trace rather than the true uncertainty. Agreement-based consistency, however, remains robust and can even improve under reasoning, making it a better choice for uncertainty estimation in reasoning-enabled VLMs.

研究探讨了链式思考推理如何影响视觉语言模型中不确定性量化（UQ）的可靠性。研究发现，尽管推理提高了任务准确性，但它会因隐式的答案条件而导致不确定性估计质量下降，使模型对其预测过于自信。相比之下，基于一致性的同意方法在推理下仍然稳健，甚至可能改善，因此建议在推理增强的模型中用于不确定性估计。

Search2Motion: Training-Free Object-Level Motion Control via Attention-Consensus Search

Authors: Sainan Liu, Tz-Ying Wu, Hector A Valdez, Subarna Tripathi

First: 2026-03-17T16:02:38+00:00 · Latest: 2026-03-17T16:02:38+00:00

Comments: 14 pages, 9 figures

Abs · PDF · Code1 · Code2

Abstract

We present Search2Motion, a training-free framework for object-level motion editing in image-to-video generation. Unlike prior methods requiring trajectories, bounding boxes, masks, or motion fields, Search2Motion adopts target-frame-based control, leveraging first-last-frame motion priors to realize object relocation while preserving scene stability without fine-tuning. Reliable target-frame construction is achieved through semantic-guided object insertion and robust background inpainting. We further show that early-step self-attention maps predict object and camera dynamics, offering interpretable user feedback and motivating ACE-Seed (Attention Consensus for Early-step Seed selection), a lightweight search strategy that improves motion fidelity without look-ahead sampling or external evaluators. Noting that existing benchmarks conflate object and camera motion, we introduce S2M-DAVIS and S2M-OMB for stable-camera, object-only evaluation, alongside FLF2V-obj metrics that isolate object artifacts without requiring ground-truth trajectories. Search2Motion consistently outperforms baselines on FLF2V-obj and VBench.

中文标题/摘要

标题：Search2Motion：无需训练的对象级运动控制

我们提出了Search2Motion，一种无需训练的框架，用于图像到视频生成中的对象级运动编辑。与需要轨迹、边界框、掩码或运动场的先前方法不同，Search2Motion 采用目标帧基于的控制，利用首尾帧运动先验来实现对象重定位，同时保持场景稳定性，无需微调。通过语义引导的对象插入和鲁棒的背景修复，实现了可靠的目标帧构建。我们进一步展示了早期步骤的自我注意力图预测对象和相机动力学，提供可解释的用户反馈，并激发了ACE-Seed（注意力共识早期步骤种子选择）这一轻量级搜索策略，该策略在无需前瞻采样或外部评估者的情况下提高了运动保真度。鉴于现有基准混淆了对象和相机运动，我们引入了S2M-DAVIS和S2M-OMB进行稳定相机、仅对象评估，以及FLF2V-obj指标，该指标隔离了对象伪影，无需真实轨迹。Search2Motion 在 FLF2V-obj 和 VBench 上均优于基线。

Summary / 总结

Search2Motion is a training-free framework for object-level motion editing in image-to-video generation. It uses target-frame-based control and first-last-frame motion priors to relocate objects while maintaining scene stability. The framework constructs reliable target-frames through semantic-guided object insertion and robust background inpainting. It also introduces ACE-Seed, a lightweight search strategy that enhances motion fidelity without using look-ahead sampling or external evaluators. Search2Motion outperforms baselines on FLF2V-obj and VBench metrics.

Search2Motion 是一个无需训练的框架，用于图像到视频生成中的对象级运动编辑。它采用目标帧控制和首尾帧运动先验来重新定位对象，同时保持场景稳定性。该框架依赖于语义引导的对象插入和鲁棒的背景修复来实现可靠的目标帧构建。早期步长的自注意力图用于预测对象和相机动力学，从而通过轻量级的 ACE-Seed 搜索策略提高运动保真度。Search2Motion 在 FLF2V-obj 和 VBench 指标上优于基线。为了解决现有基准中对象和相机运动混杂的问题，作者引入了 S2M-DAVIS 和 S2M-OMB 用于稳定相机、仅对象的评估，以及 FLF2V-obj 指标来隔离对象伪影。

VideoVerse: Does Your T2V Generator Have World Model Capability to Synthesize Videos?

Authors: Zeqing Wang, Xinyu Wei, Bairui Li, Zhen Guo, Jinrui Zhang, Hongyang Wei, Keze Wang, Lei Zhang

First: 2025-10-09T16:18:20+00:00 · Latest: 2026-03-17T16:00:23+00:00

Comments: 26 Pages, 10 Figures, 13 Tables

Abs · PDF · Code1 · Code2

Abstract

The recent rapid advancement of Text-to-Video (T2V) generation technologies are engaging the trained models with more world model ability, making the existing benchmarks increasingly insufficient to evaluate state-of-the-art T2V models. First, current evaluation dimensions, such as per-frame aesthetic quality and temporal consistency, are no longer able to differentiate state-of-the-art T2V models. Second, event-level temporal causality-an essential property that differentiates videos from other modalities-remains largely unexplored. Third, existing benchmarks lack a systematic assessment of world knowledge, which are essential capabilities for building world models. To address these issues, we introduce VideoVerse, a comprehensive benchmark focusing on evaluating whether the current T2V model could understand complex temporal causality and world knowledge to synthesize videos. We collect representative videos across diverse domains and extract their event-level descriptions with inherent temporal causality, which are then rewritten into text-to-video prompts by independent annotators. For each prompt, we design ten evaluation dimensions covering dynamic and static properties, resulting in 300 prompts, 815 events, and 793 evaluation questions. Consequently, a human preference-aligned QA-based evaluation pipeline is developed by using modern vision-language models to systematically benchmark leading open- and closed-source T2V systems, revealing the current gap between T2V models and desired world modeling abilities.

中文标题/摘要

标题：VideoVerse：您的T2V生成器是否具备世界模型能力以合成视频？

近期文本到视频(T2V)生成技术的迅速发展使训练模型具备了更多的世界模型能力，现有的基准越来越不足以评估最先进的T2V模型。首先，当前的评估维度，如每帧的美学质量和时间一致性，已无法区分最先进的T2V模型。其次，事件级的时间因果性——这是区分视频与其他模态的重要属性——仍然未被充分探索。第三，现有的基准缺乏对世界知识的系统评估，这是构建世界模型所需的重要能力。为解决这些问题，我们引入了VideoVerse，这是一个全面的基准，旨在评估当前T2V模型是否能够理解复杂的时序因果性和世界知识以合成视频。我们收集了跨多个领域的代表性视频，并提取了其具有内在时序因果性的事件级描述，然后由独立的注释者将其重写为文本到视频提示。对于每个提示，我们设计了十个评估维度，涵盖动态和静态属性，共产生了300个提示、815个事件和793个评估问题。因此，我们通过使用现代的视觉语言模型开发了一种与人类偏好对齐的问答式评估管道，系统地基准测试了领先的开源和闭源T2V系统，揭示了T2V模型与期望的世界建模能力之间的差距。

Summary / 总结

The paper introduces VideoVerse, a new benchmark to evaluate the world model capabilities of Text-to-Video (T2V) generators. It addresses the limitations of existing benchmarks by focusing on complex temporal causality and world knowledge. The benchmark includes 300 prompts, 815 events, and 793 evaluation questions, designed to test dynamic and static properties. An evaluation pipeline using vision-language models was developed to systematically assess leading T2V systems, highlighting the current gap in world modeling abilities.

该论文提出了VideoVerse，一个新的基准，用于评估Text-to-Video (T2V) 模型理解复杂的时间因果性和世界知识的能力。该基准解决了现有基准的局限性，重点关注事件级的时间因果性和世界知识。基准包括300个提示、815个事件和793个评估问题，并开发了一个基于人类偏好的QA评估管道，使用现代视觉-语言模型系统性地评估领先的T2V系统，揭示了T2V模型与期望的世界建模能力之间的差距。

AutothinkRAG: Complexity-Aware Control of Retrieval-Augmented Reasoning for Image-Text Interaction

Authors: Jiashu Yang, Chi Zhang, Abudukelimu Wuerkaixi, Xuxin Cheng, Cao Liu, Ke Zeng, Xu Jia, Xunliang Cai

First: 2026-03-05T02:29:33+00:00 · Latest: 2026-03-17T15:59:36+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal document question answering requires retrieving dispersed evidence from visually rich long documents and performing reliable reasoning over heterogeneous information. Existing multimodal RAG systems remain limited by two bottlenecks: static retrieval that ignores query complexity, and end-to-end Vision-Language Models (VLMs) that couple visual perception with logical reasoning, leading to inefficient computation and unstable answer generation. We propose AutoThinkRAG, a complexity-aware inference architecture for multimodal document QA. It has two components: (1) a Query Complexity Router that analyzes query difficulty and structure to adaptively select retrieval and reasoning paths; and (2) a Perception--Reasoning Decoupling architecture that uses a lightweight VLM as a high-fidelity visual interpreter to convert query-relevant visual cues into textual representations, which are then passed to an LLM for logical reasoning and answer synthesis. This design improves both efficiency and robustness, especially on long-document and unanswerable queries. Experiments on DocBench and MMLongBench show that AutoThinkRAG achieves 82.13\% and 51.29\% overall accuracy, respectively, while reducing per-query token consumption by 18.9\% and monetary cost by 18.2\%. Further analyses show that the gains are most pronounced on complex queries requiring adaptive retrieval and multi-step reasoning.

中文标题/摘要

标题：AutothinkRAG：面向检索增强推理的复杂性感知控制

多模态文档问答需要从视觉丰富的长文档中检索分散的证据，并对异构信息进行可靠的推理。现有的多模态RAG系统受到两个瓶颈的限制：静态检索忽略了查询复杂性，以及端到端的视觉语言模型（VLM），将视觉感知与逻辑推理耦合在一起，导致计算效率低下和答案生成不稳定。我们提出了AutoThinkRAG，这是一种面向多模态文档问答的复杂性感知推理架构。它有两个组件：（1）查询复杂性路由器，分析查询难度和结构，以适应性地选择检索和推理路径；（2）感知-推理解耦架构，使用轻量级VLM作为高保真视觉解释器，将与查询相关的视觉线索转换为文本表示，然后传递给LLM进行逻辑推理和答案合成。这种设计提高了效率和鲁棒性，特别是在长文档和无法回答的查询方面。在DocBench和MMLongBench上的实验表明，AutoThinkRAG分别实现了82.13%和51.29%的整体准确率，同时每查询的令牌消耗减少了18.9%，货币成本减少了18.2%。进一步的分析表明，收益在需要适应性检索和多步推理的复杂查询中最为显著。

Summary / 总结

AutoThinkRAG addresses the limitations of existing multimodal RAG systems by introducing a complexity-aware architecture. It includes a Query Complexity Router for adaptive retrieval and reasoning paths, and a Perception--Reasoning Decoupling architecture that uses a lightweight VLM to convert visual cues into textual representations for logical reasoning. The system demonstrates improved efficiency and robustness, achieving 82.13% and 51.29% accuracy on DocBench and MMLongBench, respectively, with a 18.9% reduction in token consumption and 18.2% decrease in monetary cost, especially on complex queries requiring adaptive retrieval and multi-step reasoning.

AutoThinkRAG通过引入一种复杂性感知的架构，包括查询复杂性路由器和感知-推理解耦模块来解决现有多模态RAG系统的局限性。该系统分析查询复杂性以适配性选择检索和推理路径，并使用轻量级VLM将视觉线索转换为文本表示，供逻辑推理和答案合成使用。实验结果表明，AutoThinkRAG在DocBench和MMLongBench上的总体准确率分别为82.13%和51.29%，同时每查询的令牌消耗减少了18.9%，成本降低了18.2%，特别是在需要适应性检索和多步推理的复杂查询上表现出显著优势。

MASS: MoErging through Adaptive Subspace Selection

Authors: Donato Crisostomi, Alessandro Zirilli, Antonio Andrea Gargiulo, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, Iacopo Masi, Emanuele Rodolà

First: 2025-04-06T08:49:52+00:00 · Latest: 2026-03-17T15:42:44+00:00

Abs · PDF · Code1 · Code2

Abstract

Model merging has recently emerged as a lightweight alternative to ensembling, combining multiple fine-tuned models into a single set of parameters with no additional training overhead. Yet, existing merging methods fall short of matching the full accuracy of separately fine-tuned endpoints. We present MASS (MoErging through Adaptive Subspace Selection), a new approach that closes this gap by unifying multiple fine-tuned models while retaining near state-of-the-art performance across tasks. Building on the low-rank decomposition of per-task updates, MASS stores only the most salient singular components for each task and merges them into a shared model. At inference time, a non-parametric, data-free router identifies which subspace (or combination thereof) best explains an input's intermediate features and activates the corresponding task-specific block. This procedure is fully training-free and introduces only a two-pass inference overhead plus a ~2 storage factor compared to a single pretrained model, irrespective of the number of tasks. We evaluate MASS on CLIP-based image classification using ViT-B-16, ViT-B-32 and ViT-L-14 for benchmarks of 8, 14 and 20 tasks respectively, establishing a new state-of-the-art. Most notably, MASS recovers up to ~98% of the average accuracy of individual fine-tuned models, making it a practical alternative to ensembling at a fraction of the storage cost.

中文标题/摘要

标题：MASS：通过自适应子空间选择进行模型合并

模型合并最近作为一种轻量级替代方案出现，将多个微调模型合并为一组参数，无需额外的训练开销。然而，现有的合并方法在准确度上仍无法与单独微调的端点相匹配。我们提出了MASS（通过自适应子空间选择进行模型合并），这是一种新的方法，通过统一多个微调模型同时保留接近最先进的性能来弥补这一差距。基于每任务更新的低秩分解，MASS 只存储每个任务最显著的奇异成分，并将它们合并到一个共享模型中。在推理时，一个非参数化的、无需数据的路由器识别哪个子空间（或它们的组合）最好地解释输入的中间特征，并激活相应的任务特定块。该过程完全无需训练，并且仅引入两遍推理开销以及与单个预训练模型相比约2倍的存储因子，无论任务数量多少。我们在使用ViT-B-16、ViT-B-32和ViT-L-14的CLIP基图像分类基准上评估了MASS，分别包含8、14和20个任务，建立了新的最先进的水平。最值得注意的是，MASS 恢复了单个微调模型平均准确度的约98%，使其成为在存储成本仅为几分之一的情况下的一种实用的替代方案。

Summary / 总结

MASS is a model merging approach that combines multiple fine-tuned models into a single set while retaining near state-of-the-art performance. It uses low-rank decomposition to store only the most important singular components for each task and merges them into a shared model. At inference time, a non-parametric router selects the best subspace to explain the input's features, introducing minimal overhead. On CLIP-based image classification benchmarks, MASS recovers up to 98% of the accuracy of individual fine-tuned models, making it a practical alternative to ensembling with lower storage costs.

MASS 是一种模型合并技术，将多个细调模型合并为一个集合，无需额外的训练开销。通过选择每个任务中最显著的奇异组件并进行合并，MASS 保持了接近最先进的性能。在推理时，一个非参数路由器确定输入中间特征的最佳子空间，引入了最小的开销。在基于 CLIP 的图像分类基准测试中，MASS 达到了单个细调模型平均准确率的 98%，使其成为具有更低存储成本的分层合并的实用替代方案。

Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation

Authors: Jiawei Mao, Hardy Chen, Haoqin Tu, Yuhan Wang, Letian Zhang, Zeyu Zheng, Huaxiu Yao, Zirui Wang, Cihang Xie, Yuyin Zhou

First: 2026-03-17T15:30:47+00:00 · Latest: 2026-03-17T15:30:47+00:00

Comments: 16 pages, 11 figures, 5 tables

Abs · PDF · Code1 · Code2

Abstract

Large vision-language models (LVLMs) have become increasingly strong but remain prone to hallucinations in multimodal tasks, which significantly narrows their deployment. As training these LVLMs to avoid hallucinations becomes prohibitively expensive for larger models, training-free methods offer a cheap and flexible solution to this problem, yet existing approaches based on decoding or tool use often bring limited gains and/or weak interpretability. We propose Kestrel, a training-free framework for LVLM hallucination mitigation that combines an explicit visual-grounding agent with evidence-verified self-refinement mechanism. In detail, Kestrel first collects explicit visual evidence and converts tool outputs into reusable and structured textual evidence. Second, to take full advantage of these evidence, Kestrel verifies them via an LVLM judge for evidence checking, then iteratively self-refine answers based on verified evidence to reduce the risk of over-correction. Extensive experiments show that Kestrel improves performance over strong baselines across hallucination benchmarks (e.g., average +3.31% on POPE and +28.34 on MME-Hallucination with Qwen3-VL), while providing transparent verification traces for hallucination diagnosis and analysis -- e.g., both the integrated self-refinement module and grounding agent contributing an average +2.0% gain on POPE.

中文标题/摘要

标题：kestrel：为LVLM幻觉缓解奠定自我改进基础

大型视觉-语言模型（LVLMs）在多模态任务中变得越来越强大，但仍容易出现幻觉，这极大地限制了它们的应用。随着训练这些LVLMs避免幻觉的成本变得难以承受，无需训练的方法为解决这一问题提供了廉价且灵活的解决方案，但现有的基于解码或工具使用的方法往往带来有限的收益和/或较弱的可解释性。我们提出Kestrel，这是一种无需训练的框架，用于减轻LVLM幻觉，结合了显式的视觉接地代理和证据验证的自我改进机制。具体而言，Kestrel 首先收集显式的视觉证据，并将工具输出转换为可重用且结构化的文本证据。其次，为了充分利用这些证据，Kestrel 通过LVLM裁判进行证据验证，然后基于验证的证据迭代自我改进答案，以降低过度纠正的风险。广泛的实验表明，Kestrel 在幻觉基准测试中（例如，POPE平均提高3.31%，MME-Hallucination与Qwen3-VL相比提高28.34%）优于强大的基线，同时提供了透明的验证轨迹，用于幻觉诊断和分析——例如，集成的自我改进模块和接地代理分别平均为POPE带来2.0%的收益。

Summary / 总结

Kestrel is a training-free framework designed to mitigate hallucinations in large vision-language models (LVLMs) by integrating an explicit visual-grounding agent and an evidence-verified self-refinement mechanism. It collects visual evidence, converts tool outputs into structured textual evidence, and iteratively refines answers based on verified evidence. Experiments demonstrate that Kestrel outperforms strong baselines on hallucination benchmarks, with an average improvement of +3.31% on POPE and +28.34% on MME-Hallucination, and provides transparent verification traces for diagnosis and analysis.

Kestrel 是一个无需训练的框架，旨在通过结合显式的视觉接地代理和基于证据的自我修正机制来减轻大型视觉语言模型（LVLM）中的幻觉问题。它收集视觉证据并将工具输出转换为结构化的文本证据，然后通过LVLM裁判进行验证。基于验证的证据进行迭代自我修正，可以降低过度修正的风险。实验表明，Kestrel 在幻觉基准测试中优于强基线，平均改进了 POPE 上的 +3.31% 和 MME-幻觉上的 +28.34%，并且提供了透明的验证轨迹以供分析。

HeBA: Heterogeneous Bottleneck Adapters for Robust Vision-Language Models

Authors: Md Jahidul Islam

First: 2026-03-17T15:23:04+00:00 · Latest: 2026-03-17T15:23:04+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Adapting large-scale Vision-Language Models (VLMs) like CLIP to downstream tasks often suffers from a "one-size-fits-all" architectural approach, where visual and textual tokens are processed uniformly by wide, generic adapters. We argue that this homogeneity ignores the distinct structural nature of the modalities -- spatial locality in images versus semantic density in text. To address this, we propose HeBA (Heterogeneous Bottleneck Adapter), a unified architectural framework that introduces modality-specific structural inductive biases. HeBA departs from conventional designs through three key architectural innovations: (1) Heterogeneity: It processes visual tokens via 2D depthwise-separable convolutions to preserve spatial correlations, while distinctively processing text tokens via dense linear projections to capture semantic relationships; (2) Bottleneck Regularization: Unlike standard expanding adapters, HeBA employs a compression bottleneck (D -> D/4) that explicitly forces the model to learn compact, robust features and acts as a structural regularizer; and (3) Active Gradient Initialization: We challenge the restrictive zero-initialization paradigm, utilizing a Kaiming initialization strategy that ensures sufficient initial gradient flow to accelerate convergence without compromising the frozen backbone's pre-trained knowledge. Extensive experiments demonstrate that HeBA's architecturally specialized design achieves superior stability and accuracy, establishing a new state-of-the-art on 11 few-shot benchmarks. Code is available at https://github.com/Jahid12012021/VLM-HeBA.

中文标题/摘要

标题：HeBA：异构瓶颈适配器用于稳健的视觉-语言模型

将大型视觉-语言模型（VLMs）如CLIP适应下游任务时，通常采用“一刀切”的架构方法，其中视觉和文本标记由宽泛的通用适配器均匀处理。我们认为这种方法忽略了模态的差异性结构——图像中的空间局部性与文本中的语义密度。为解决这一问题，我们提出了HeBA（异构瓶颈适配器），这是一种统一的架构框架，引入了模态特定的结构归纳偏差。HeBA通过三个关键的架构创新从传统设计中脱颖而出：（1）异构性：它通过2D深度可分离卷积处理视觉标记以保留空间相关性，而通过密集线性投影处理文本标记以捕捉语义关系；（2）瓶颈正则化：与标准扩展适配器不同，HeBA采用压缩瓶颈（D -> D/4），明确地迫使模型学习紧凑且稳健的特征，并作为结构正则化器；（3）激活梯度初始化：我们挑战了限制性的零初始化范式，采用Kaiming初始化策略，确保初始梯度流充分以加速收敛，同时不牺牲冻结主干的预训练知识。广泛的实验表明，HeBA的架构专业化设计实现了更高的稳定性和准确性，并在11个少样本基准测试中建立了新的最先进水平。代码可在https://github.com/Jahid12012021/VLM-HeBA/ 获取。

Summary / 总结

The paper proposes HeBA (Heterogeneous Bottleneck Adapter) to improve the robustness of Vision-Language Models (VLMs) like CLIP by addressing the uniform processing of visual and textual tokens. HeBA introduces three key innovations: heterogeneity in processing visual and textual tokens differently, bottleneck regularization to force learning compact features, and active gradient initialization to enhance convergence. Experiments show that HeBA outperforms existing methods on 11 few-shot benchmarks, setting a new state-of-the-art.

论文提出了HeBA（异质瓶颈适配器），通过解决视觉和文本令牌的统一处理问题来提高Vision-Language模型（VLMs）如CLIP的鲁棒性。HeBA引入了三个关键创新：视觉和文本令牌的异质处理、瓶颈正则化以迫使学习紧凑特征，以及激活梯度初始化以增强收敛性。实验表明，HeBA在11个少样本基准上优于现有方法，达到了新的最佳水平。

Closed-Loop Action Chunks with Dynamic Corrections for Training-Free Diffusion Policy

Authors: Pengyuan Wu, Pingrui Zhang, Zhigang Wang, Dong Wang, Bin Zhao, Xuelong Li

First: 2026-03-02T15:04:18+00:00 · Latest: 2026-03-17T15:17:12+00:00

Comments: Accepted by ICRA2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Diffusion-based policies have achieved remarkable results in robotic manipulation but often struggle to adapt rapidly in dynamic scenarios, leading to delayed responses or task failures. We present DCDP, a Dynamic Closed-Loop Diffusion Policy framework that integrates chunk-based action generation with real-time correction. DCDP integrates a self-supervised dynamic feature encoder, cross-attention fusion, and an asymmetric action encoder-decoder to inject environmental dynamics before action execution, achieving real-time closed-loop action correction and enhancing the system's adaptability in dynamic scenarios. In dynamic PushT simulations, DCDP improves adaptability by 19\% without retraining while requiring only 5\% additional computation. Its modular design enables plug-and-play integration, achieving both temporal coherence and real-time responsiveness in dynamic robotic scenarios, including real-world manipulation tasks. The project page is at: https://github.com/wupengyuan/dcdp

中文标题/摘要

标题：闭环动作片段结合动态校正的无训练扩散策略

基于扩散的策略在机器人操作中取得了显著成果，但往往难以在动态场景中快速适应，导致延迟响应或任务失败。我们提出了DCDP（动态闭环扩散策略）框架，该框架结合了基于片段的动作生成与实时校正。DCDP集成了自监督动态特征编码器、交叉注意力融合以及非对称动作编码器-解码器，在动作执行前注入环境动态，实现实时闭环动作校正，增强系统在动态场景中的适应性。在动态PushT模拟中，DCDP在无需重新训练的情况下提高了19%的适应性，仅需5%的额外计算。其模块化设计使其能够即插即用，实现动态机器人场景中的时间连贯性和实时响应性，包括实际操作任务。项目页面位于：https://github.com/wupengyuan/dcdp

Summary / 总结

The research addresses the challenge of rapid adaptation in dynamic scenarios for diffusion-based robotic manipulation policies. DCDP, a Dynamic Closed-Loop Diffusion Policy framework, integrates chunk-based action generation with real-time correction using a self-supervised dynamic feature encoder and an asymmetric action encoder-decoder. This approach enhances adaptability by 19% in dynamic PushT simulations with only 5% additional computation. The modular design allows for plug-and-play integration, improving both temporal coherence and real-time responsiveness in dynamic robotic tasks.

研究针对扩散基机器人操作策略在动态场景中的快速适应性问题。DCDP动态闭环扩散策略框架结合了基于片段的动作生成和实时纠正，使用自我监督的动态特征编码器和交叉注意力融合。在动态PushT模拟中，DCDP在无需重新训练的情况下提高了19%的适应性，并且只需要额外5%的计算量。模块化设计使得其能够在动态机器人场景中实现实时响应和时间连贯性。

FlowComposer: Composable Flows for Compositional Zero-Shot Learning

Authors: Zhenqi He, Lin Li, Long Chen

First: 2026-03-17T15:12:39+00:00 · Latest: 2026-03-17T15:12:39+00:00

Comments: Accepted to CVPR2026

Abs · PDF · Code1 · Code2

Abstract

Compositional zero-shot learning (CZSL) aims to recognize unseen attribute-object compositions by recombining primitives learned from seen pairs. Recent CZSL methods built on vision-language models (VLMs) typically adopt parameter-efficient fine-tuning (PEFT). They apply visual disentanglers for decomposition and manipulate token-level prompts or prefixes to encode compositions. However, such PEFT-based designs suffer from two fundamental limitations: (1) Implicit Composition Construction, where composition is realized only via token concatenation or branch-wise prompt tuning rather than an explicit operation in the embedding space; (2) Remained Feature Entanglement, where imperfect disentanglement leaves attribute, object, and composition features mutually contaminated. Together, these issues limit the generalization ability of current CZSL models. In this paper, we are the first to systematically study flow matching for CZSL and introduce FlowComposer, a model-agnostic framework that learns two primitive flows to transport visual features toward attribute and object text embeddings, and a learnable Composer that explicitly fuses their velocity fields into a composition flow. To exploit the inevitable residual entanglement, we further devise a leakage-guided augmentation scheme that reuses leaked features as auxiliary signals. We thoroughly evaluate FlowComposer on three public CZSL benchmarks by integrating it as a plug-and-play component into various baselines, consistently achieving significant improvements.

中文标题/摘要

标题：FlowComposer：可组合流用于组合零样本学习

组合零样本学习（CZSL）旨在通过重新组合从已见配对中学习到的基本要素来识别未见的属性-对象组合。基于视觉-语言模型（VLM）的最近CZSL方法通常采用参数高效微调（PEFT）。它们使用视觉解纠缠器进行分解，并通过操作标记级提示或前缀来编码组合。然而，这些基于PEFT的设计存在两个根本局限：（1）隐式组合构建，其中组合仅通过标记连接或分支级提示调优实现，而不是在嵌入空间中的显式操作；（2）残留特征纠缠，其中不完美的解纠缠导致属性、对象和组合特征相互污染。这些问题共同限制了当前CZSL模型的泛化能力。在本文中，我们首次系统研究了CZSL中的流匹配，并引入了FlowComposer，这是一种模型无关的框架，它学习两个基本流将视觉特征输运到属性和对象文本嵌入，并通过可学习的组合器显式地将它们的速度场融合成一个组合流。为了利用不可避免的残留纠缠，我们进一步设计了一种泄漏导向的增强方案，利用泄漏特征作为辅助信号。我们通过将FlowComposer整合为各种基线组件的一部分，在三个公开的CZSL基准上进行了彻底的评估，始终实现了显著的改进。

Summary / 总结

FlowComposer addresses the limitations of parameter-efficient fine-tuning in compositional zero-shot learning by introducing a model-agnostic framework that learns two primitive flows to transport visual features towards attribute and object text embeddings, and a learnable Composer that explicitly fuses these flows. This approach, combined with a leakage-guided augmentation scheme, improves generalization and achieves significant performance gains on three public CZSL benchmarks.

FlowComposer通过引入一种模型无关的框架，学习两个原始流将视觉特征输送到属性和对象文本嵌入中，并使用可学习的Composer显式地将这些流融合，解决了当前组成零样本学习方法中存在的隐式组成构建和特征纠缠问题，从而提高了泛化能力。在三个公开的CZSL基准上的实验表明，当将FlowComposer集成到各种基线中时，可以实现一致的改进。

Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLMReward Models

Authors: Weijie Qiu, Dai Guan, Junxin Wang, Zhihang Li, Yongbo Gai, Mengyu Zhou, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang

First: 2026-03-17T14:45:49+00:00 · Latest: 2026-03-17T14:45:49+00:00

Comments: 25 pages, 10 figures,

Abs · PDF · Code1 · Code2 · Code3

Abstract

Generative reward models (GRMs) for vision-language models (VLMs) often evaluate outputs via a three-stage pipeline: rubric generation, criterion-based scoring, and a final verdict. However, the intermediate rubric is rarely optimized directly. Prior work typically either treats rubrics as incidental or relies on expensive LLM-as-judge checks that provide no differentiable signal and limited training-time guidance. We propose Proxy-GRM, which introduces proxy-guided rubric verification into Reinforcement Learning (RL) to explicitly enhance rubric quality. Concretely, we train lightweight proxy agents (Proxy-SFT and Proxy-RL) that take a candidate rubric together with the original query and preference pair, and then predict the preference ordering using only the rubric as evidence. The proxy's prediction accuracy serves as a rubric-quality reward, incentivizing the model to produce rubrics that are internally consistent and transferable. With ~50k data samples, Proxy-GRM reaches state-of-the-art results on the VL-Reward Bench, Multimodal Reward Bench, and MM-RLHF-Reward Bench, outperforming the methods trained on four times the data. Ablations show Proxy-SFT is a stronger verifier than Proxy-RL, and implicit reward aggregation performs best. Crucially, the learned rubrics transfer to unseen evaluators, improving reward accuracy at test time without additional training. Our code is available at https://github.com/Qwen-Applications/Proxy-GRM.

中文标题/摘要

标题：理性决定一切：通过代理引导批判学习可转移的评判标准以优化视觉语言模型奖励模型

生成奖励模型（GRMs）用于视觉语言模型（VLMs）通常通过三阶段管道进行评估：评判标准生成、基于标准的评分以及最终裁决。然而，中间的评判标准很少被直接优化。先前的工作通常将评判标准视为附带内容，或者依赖昂贵的LLM作为裁判的检查，这些检查不提供可微信号且在训练时间提供的指导有限。我们提出了Proxy-GRM，它将代理引导的评判标准验证引入强化学习（RL）中，以明确提升评判标准的质量。具体来说，我们训练了轻量级的代理代理（Proxy-SFT和Proxy-RL），它们接受候选评判标准以及原始查询和偏好对，然后仅使用评判标准来预测偏好顺序。代理的预测准确性作为评判标准质量的奖励，激励模型生成内部一致且可转移的评判标准。使用约5万个数据样本，Proxy-GRM在VL-Reward Bench、多模态奖励基准和MM-RLHF-Reward Bench上达到了最先进的结果，优于在四倍数据上训练的方法。消融实验表明，Proxy-SFT比Proxy-RL是更强的验证器，隐式奖励聚合表现最佳。至关重要的是，学习到的评判标准可以转移到未见过的评估者，提高测试时的奖励准确性而无需额外训练。我们的代码可在https://github.com/Qwen-Applications/Proxy-GRM获取。

Summary / 总结

The paper addresses the issue of optimizing intermediate rubrics in generative reward models (GRMs) for vision-language models (VLMs) by proposing Proxy-GRM, which uses proxy-guided rubric verification in RL to enhance rubric quality. The method trains lightweight proxy agents to predict preference orderings based on rubrics, using their prediction accuracy as a reward signal. Experiments show that Proxy-GRM achieves state-of-the-art results on various benchmarks with fewer data samples compared to other methods, and the learned rubrics transfer well to unseen evaluators, improving reward accuracy without additional training.

论文提出Proxy-GRM方法，通过在RL中引入代理引导的评分表验证来优化视觉-语言模型（VLMs）生成奖励模型（GRMs）中的中间评分表。该方法训练轻量级代理代理（Proxy-SFT和Proxy-RL），基于评分表预测偏好排序，代理的准确性作为奖励信号。Proxy-GRM在多个基准测试中取得了最先进的结果，使用的数据样本比现有方法少，并且学习到的评分表可以很好地转移到未见过的评估者上，无需额外训练。

V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge in Vision Language Models

Authors: Seyed Mahed Mousavi, Christian Moiola, Massimo Rizzoli, Simone Alghisi, Giuseppe Riccardi

First: 2026-03-17T14:33:08+00:00 · Latest: 2026-03-17T14:33:08+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) are trained on data snapshots of documents, including images and texts. Their training data and evaluation benchmarks are typically static, implicitly treating factual knowledge as time-invariant. However, real-world facts are intrinsically time-sensitive and subject to erratic and periodic changes, causing model predictions to become outdated. We present V-DyKnow, a Visual Dynamic Knowledge benchmark for evaluating time-sensitive factual knowledge in VLMs. Using V-DyKnow, we benchmark closed- and open-source VLMs and analyze a) the reliability (correctness and consistency) of model responses across modalities and input perturbations; b) the efficacy of knowledge editing and multi-modal RAG methods for knowledge updates across modalities; and c) the sources of outdated predictions, through data and mechanistic analysis. Our results show that VLMs frequently output outdated facts, reflecting outdated snapshots used in the (pre-)training phase. Factual reliability degrades from textual to visual stimuli, even when entities are correctly recognized. Besides, existing alignment approaches fail to consistently update the models' knowledge across modalities. Together, these findings highlight fundamental limitations in how current VLMs acquire and update time-sensitive knowledge across modalities. We release the benchmark, code, and evaluation data.

中文标题/摘要

标题：V-DyKnow：视觉语言模型中时间敏感知识的动态基准

视觉-语言模型（VLMs）在包含图像和文本的数据快照上进行训练。它们的训练数据和评估基准通常是静态的，隐含地将事实知识视为时间不变的。然而，现实中的事实本质上是时间敏感的，并且会受到随机和周期性变化的影响，导致模型预测变得过时。我们提出了V-DyKnow，一个视觉动态知识基准，用于评估VLMs中的时间敏感事实知识。使用V-DyKnow，我们对闭源和开源VLMs进行了基准测试，并分析了a) 模型响应在不同模态和输入扰动下的可靠性和一致性；b) 知识编辑和多模态RAG方法在不同模态下的知识更新效果；c) 过时预测的来源，通过数据和机制分析。我们的结果显示，VLMs经常输出过时的事实，反映了在（预）训练阶段使用的过时数据快照。事实可靠性从文本刺激到视觉刺激下降，即使实体被正确识别也是如此。此外，现有的对齐方法无法在不同模态中一致地更新模型的知识。这些发现共同揭示了当前VLMs在跨模态获取和更新时间敏感知识方面的根本局限性。我们发布了基准、代码和评估数据。

Summary / 总结

The paper introduces V-DyKnow, a benchmark for evaluating time-sensitive factual knowledge in Vision-Language Models (VLMs). It assesses the reliability and consistency of model responses, the effectiveness of knowledge editing and multi-modal RAG methods, and the sources of outdated predictions. The study reveals that VLMs often provide outdated information due to static training data, with textual stimuli being more reliable than visual ones. Existing alignment approaches also struggle to update knowledge consistently across modalities.

论文提出了V-DyKnow基准，用于评估视觉语言模型（VLM）的时间敏感事实知识。它评估了模型响应的可靠性和一致性、知识编辑和多模态RAG方法的有效性，以及过时预测的来源。研究发现，由于静态训练数据，VLMs经常提供过时的信息，文本刺激比视觉刺激更可靠。现有的对齐方法也无法在不同模态之间一致地更新模型的知识，这突显了在不同模态中处理时间敏感知识所需的新机制的重要性。

VisTIRA: Closing the Image-Text Modality Gap in Visual Math Reasoning via Structured Tool Integration

Authors: Saeed Khaki, Ashudeep Singh, Nima Safaei, Kamal Ginotra

First: 2026-01-20T19:54:49+00:00 · Latest: 2026-03-17T14:23:34+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) lag behind text-only language models on mathematical reasoning when the same problems are presented as images rather than text. We empirically characterize this as a modality gap: the same question in text form yields markedly higher accuracy than its visually typeset counterpart, due to compounded failures in reading dense formulas, layout, and mixed symbolic-diagrammatic context. First, we introduce VisTIRA (Vision and Tool-Integrated Reasoning Agent), a tool-integrated reasoning framework that enables structured problem solving by iteratively decomposing a given math problem (as an image) into natural language rationales and executable Python steps to determine the final answer. Second, we build a framework to measure and improve visual math reasoning: a LaTeX-based pipeline that converts chain-of-thought math corpora (e.g., NuminaMath) into challenging image counterparts, and a large set of synthetic tool-use trajectories derived from a real-world, homework-style image dataset (called SnapAsk) for fine-tuning VLMs. Our experiments show that tool-integrated supervision improves image-based reasoning, and OCR grounding can further narrow the gap for smaller models, although its benefit diminishes at scale. These findings highlight that modality gap severity inversely correlates with model size, and that structured reasoning and OCR-based grounding are complementary strategies for advancing visual mathematical reasoning.

中文标题/摘要

标题：VisTIRA：通过结构化工具集成缩小视觉数学推理中的图像-文本模态差距

视觉语言模型（VLMs）在以图像而非文本形式呈现相同问题时，在数学推理方面落后于仅文本的语言模型。我们实证地将这种差距归因于模态差距：文本形式的问题比其视觉排版的对应物具有明显更高的准确性，原因在于阅读密集公式、布局以及混合符号-图表上下文方面的复合失败。首先，我们引入了VisTIRA（视觉和工具集成推理代理），这是一种工具集成的推理框架，通过迭代地将给定的数学问题（作为图像）分解为自然语言推理和可执行的Python步骤，以确定最终答案。其次，我们构建了一个框架来衡量和提高视觉数学推理能力：一个基于LaTeX的流水线，将链式思维数学语料库（例如，NuminaMath）转换为具有挑战性的图像对应物，并从一个真实世界的、类似于家庭作业的图像数据集（称为SnapAsk）中生成大量合成的工具使用轨迹，用于微调VLMs。我们的实验表明，工具集成的监督可以提高基于图像的推理能力，而OCR定位可以进一步缩小较小模型的差距，尽管其益处随着规模的扩大而减弱。这些发现表明，模态差距的严重程度与模型大小呈反比，而结构化推理和基于OCR的定位是推进视觉数学推理的互补策略。

Summary / 总结

The research aims to address the modality gap in visual math reasoning where vision-language models perform poorly compared to text-only models when presented with images. The study introduces VisTIRA, a tool-integrated reasoning framework that decomposes math problems into natural language rationales and executable steps. Experiments show that tool-integrated supervision and OCR grounding improve image-based reasoning, with smaller models benefiting more from these techniques, though their effectiveness diminishes with larger models.

研究旨在解决视觉数学推理中的模态差距，即视觉语言模型在面对图像时的表现远不如处理文本时。研究引入了VisTIRA，一种工具集成推理框架，将数学问题分解为自然语言推理和可执行步骤。实验表明，工具集成监督和OCR定位可以提高基于图像的推理能力，尤其是对于较小的模型，但随着模型规模的增大，这些技术的效果会减弱。

Segmentation-Based Attention Entropy: Detecting and Mitigating Object Hallucinations in Large Vision-Language Models

Authors: Jiale Song, Jiaxin Luo, Xue-song Tang, Kuangrong Hao, Mingbo Zhao

First: 2026-03-17T14:19:22+00:00 · Latest: 2026-03-17T14:19:22+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Vision-Language Models (LVLMs) achieve strong performance on many multimodal tasks, but object hallucinations severely undermine their reliability. Most existing studies focus on the text modality, attributing hallucinations to overly strong language priors and insufficient visual grounding. In contrast, we observe that abnormal attention patterns within the visual modality can also give rise to hallucinated objects. Building on this observation, we propose Segmentation-based Attention Entropy (SAE), which leverages semantic segmentation to quantify visual attention uncertainty in an object-level semantic space. Based on SAE, we further design a reliability score for hallucination detection and an SAE-guided attention adjustment method that modifies visual attention at inference time to mitigate hallucinations. We evaluate our approach on public benchmarks and in real embodied multimodal scenarios with quadruped robots. Experimental results show that SAE substantially reduces object hallucinations without any additional training cost, thereby enabling more trustworthy LVLM-driven perception and decision-making.

中文标题/摘要

标题：基于分割的注意力熵：检测和缓解大型视觉-语言模型中的对象幻觉

大型视觉-语言模型（LVLMs）在许多多模态任务中表现出色，但对象幻觉严重削弱了它们的可靠性。现有大多数研究集中在文本模态，将幻觉归因于语言先验过于强烈和视觉定位不足。相比之下，我们观察到视觉模态内的异常注意力模式也可能导致幻觉对象的出现。基于这一观察，我们提出了基于分割的注意力熵（SAE），它利用语义分割来量化对象级语义空间中的视觉注意力不确定性。基于SAE，我们进一步设计了一种幻觉检测的可靠性评分，并提出了一种SAE引导的注意力调整方法，在推理时修改视觉注意力以缓解幻觉。我们在公开基准和使用四足机器人的实际多模态场景中评估了我们的方法。实验结果表明，SAE在不增加额外训练成本的情况下显著减少了对象幻觉，从而使得基于LVLM的感知和决策更加可靠。

Summary / 总结

The research aims to address the issue of object hallucinations in large vision-language models (LVLMs) by focusing on the visual modality. The authors propose Segmentation-based Attention Entropy (SAE) to quantify visual attention uncertainty at the object level and develop a reliability score for hallucination detection. Additionally, they introduce an SAE-guided attention adjustment method to mitigate hallucinations during inference. Experiments on public benchmarks and real-world scenarios with quadruped robots demonstrate that SAE effectively reduces hallucinations without additional training costs, enhancing the reliability of LVLM-driven perception and decision-making.

研究旨在通过关注视觉模态的注意力模式来解决大型视觉语言模型（LVLM）中的对象幻觉问题。提出了基于分割的注意力熵（SAE）来量化视觉注意力的不确定性，并提出了一种可靠性评分用于幻觉检测。此外，还设计了一种基于SAE的注意力调整方法，在推理时修改视觉注意力以减轻幻觉。实验结果表明，SAE在公共基准和使用四足机器人的实际场景中有效减少了幻觉，且无需额外的训练成本，从而提高了LVLM驱动的感知和决策的可靠性。

An approximate graph elicits detonation lattice

Authors: Vansh Sharma, Venkat Raman

First: 2026-03-17T13:46:29+00:00 · Latest: 2026-03-17T13:46:29+00:00

Abs · PDF · Code1 · Code2

Abstract

This study presents a novel algorithm based on graph theory for the precise segmentation and measurement of detonation cells from 3D pressure traces, termed detonation lattices, addressing the limitations of manual and primitive 2D edge detection methods prevalent in the field. Using a segmentation model, the proposed training-free algorithm is designed to accurately extract cellular patterns, a longstanding challenge in detonations research. First, the efficacy of segmentation on generated data is shown with a prediction error 2%. Next, 3D simulation data is used to establish performance of the graph-based workflow. The results of statistics and joint probability densities show oblong cells aligned with the wave propagation axis with 17% deviation, whereas larger dispersion in volume reflects cubic amplification of linear variability. Although the framework is robust, it remains challenging to reliably segment and quantify highly complex cellular patterns. However, the graph-based formulation generalizes across diverse cellular geometries, positioning it as a practical tool for detonation analysis and a strong foundation for future extensions in triple-point collision studies.

中文标题/摘要

标题：一种近似图提取爆炸晶格

本研究提出了一种基于图论的新型算法，用于从3D压力轨迹中精确分割和测量爆炸细胞，称为爆炸晶格，解决了手动和原始2D边缘检测方法在该领域中的局限性。利用分割模型，提出的无训练算法旨在准确提取细胞模式，这是爆炸研究中的长期挑战。首先，通过预测误差2%的数据展示了分割的有效性。然后，使用3D模拟数据来建立基于图的工作流程的性能。统计和联合概率密度的结果表明，细胞呈椭圆形，与波传播轴对齐，偏差为17%，而体积的更大分散反映了线性变异的立方放大。尽管该框架具有鲁棒性，但可靠地分割和量化高度复杂的细胞模式仍然具有挑战性。然而，基于图的表述在不同细胞几何形状之间具有泛化能力，将其定位为爆炸分析的实用工具，并为未来在三相点碰撞研究中的扩展奠定了坚实的基础。

Summary / 总结

This study introduces a graph theory-based algorithm for precise segmentation and measurement of detonation cells from 3D pressure traces, termed detonation lattices. The algorithm, which does not require training, shows a 2% prediction error on generated data and aligns oblong cells with the wave propagation axis with 17% deviation in 3D simulation data. The method generalizes across various cellular geometries, providing a practical tool for detonation analysis and future research.

该研究提出了一种基于图理论的算法，用于从3D压力波形中精确分割和测量爆炸细胞，称为爆炸晶格。该算法无需训练，生成数据的预测误差为2%，并与3D模拟数据吻合良好，细胞沿波传播轴呈椭圆形，偏差为17%。该方法适用于不同细胞几何形状，使其成为爆炸分析的实用工具，并为未来的研究奠定了基础。

TempCore: Are Video QA Benchmarks Temporally Grounded? A Frame Selection Sensitivity Analysis and Benchmark

Authors: Hyunjong Ok, Jaeho Lee

First: 2025-09-01T06:39:08+00:00 · Latest: 2026-03-17T13:01:54+00:00

Comments: preprint

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) can ingest only a limited number of video frames, making frame selection a practical necessity. But do current Video QA benchmarks genuinely require temporal frame selection, or can most questions be answered regardless of which frames are shown? We introduce Frame Selection Sensitivity (FSS), a per-sample diagnostic that measures how much VLM accuracy changes when the most relevant frames are replaced with the least relevant ones. Across six benchmarks and eight VLMs, we find that a large majority of samples are frame-agnostic: only a minority are genuinely sensitive to frame choice. Combining FSS with a Language Independence Score (LIS) reveals that merely 8--33% of samples are Temporally Sensitive. We construct TempCore, compact evaluation subsets that isolate these temporal samples from existing benchmarks, and will release code and per-sample annotations upon publication.

中文标题/摘要

标题：TempCore：视频QA基准是否具有时间关联性？一种帧选择敏感性分析和基准测试

视觉-语言模型（VLMs）只能处理有限数量的视频帧，因此帧选择是实际需求。但当前的视频QA基准是否真正需要时间上的帧选择，还是大多数问题都可以在不考虑显示哪些帧的情况下回答？我们引入了帧选择敏感性（FSS），这是一种样本诊断，衡量用最不相关的帧替换最相关帧时VLM准确度的变化量。在六个基准和八个VLM中，我们发现大多数样本是帧无关的：只有少数样本真正对帧选择敏感。结合FSS和语言独立性得分（LIS），我们发现只有8-33%的样本是时间敏感的。我们构建了TempCore，这是一种紧凑的评估子集，将这些时间相关的样本从现有基准中分离出来，并将在发表时发布代码和样本注释。

Summary / 总结

The research aims to evaluate the necessity of temporal frame selection in current Video QA benchmarks. The study introduces Frame Selection Sensitivity (FSS) to measure how VLM accuracy changes when relevant frames are replaced with irrelevant ones. Across six benchmarks and eight VLMs, the findings indicate that most samples are frame-agnostic, with only 8-33% being truly sensitive to frame choice. The study constructs TempCore, a compact evaluation subset, to isolate these temporal samples and will release code and per-sample annotations upon publication.

研究旨在评估视频QA基准中时间帧选择的必要性，以评估视觉语言模型的表现。研究引入了帧选择敏感性（FSS）来衡量当将相关帧替换为无关帧时VLM准确度的变化。在六个基准和八个VLM中，研究发现大多数样本是帧无关的，只有少数样本真正对帧选择敏感。研究构建了TempCore，一个紧凑的评估子集，用于隔离这些时间敏感样本，并将在发表时发布代码和样本注释。

Follow the Clues, Frame the Truth: Hybrid-evidential Deductive Reasoning in Open-Vocabulary Multimodal Emotion Recognition

Authors: Yu Liu, Lei Zhang, Haoxun Li, Hanlei Shi, Yuxuan Ding, Leyuan Qu, Taihao Li

First: 2026-03-17T12:49:51+00:00 · Latest: 2026-03-17T12:49:51+00:00

Abs · PDF · Code1 · Code2

Abstract

Open-Vocabulary Multimodal Emotion Recognition (OV-MER) is inherently challenging due to the ambiguity of equivocal multimodal cues, which often stem from distinct unobserved situational dynamics. While Multimodal Large Language Models (MLLMs) offer extensive semantic coverage, their performance is often bottlenecked by premature commitment to dominant data priors, resulting in suboptimal heuristics that overlook crucial, complementary affective cues across modalities. We argue that effective affective reasoning requires more than surface-level association; it necessitates reconstructing nuanced emotional states by synthesizing multiple evidence-grounded rationales that reconcile these observations from diverse latent perspectives. We introduce HyDRA, a Hybrid-evidential Deductive Reasoning Architecture that formalizes inference as a Propose-Verify-Decide protocol. To internalize this abductive process, we employ reinforcement learning with hierarchical reward shaping, aligning the reasoning trajectories with final task performance to ensure they best reconcile the observed multimodal cues. Systematic evaluations validate our design choices, with HyDRA consistently outperforming strong baselines--especially in ambiguous or conflicting scenarios--while providing interpretable, diagnostic evidence traces.

中文标题/摘要

标题：跟随线索，构架真相：混合证据演绎推理在开放词汇多模态情绪识别中的应用

开放词汇多模态情绪识别（OV-MER）由于模态混杂线索的模糊性而固有地具有挑战性，这些线索通常源自不同的未观察到的情境动态。尽管多模态大型语言模型（MLLMs）提供了广泛的语义覆盖，但它们的表现往往受限于过早地承诺主导的数据先验，导致次优的启发式方法，忽略了跨模态的重要互补情感线索。我们认为，有效的推理不仅需要表面关联，还需要通过综合多种证据支持的理由来重建复杂的心理状态，以调和来自不同潜在视角的观察。我们提出了HyDRA，一种混合证据演绎推理架构，将推理形式化为提出-验证-决定协议。为了内化这一演绎过程，我们使用层次奖励塑形的强化学习，使推理轨迹与最终任务性能对齐，以确保它们最好地调和观察到的多模态线索。系统评估验证了我们的设计选择，HyDRA在模糊或冲突的情景中始终优于强大的基线模型，同时提供可解释的诊断证据轨迹。

Visual Distraction Undermines Moral Reasoning in Vision-Language Models

Authors: Xinyi Yang, Chenheng Xu, Weijun Hong, Ce Mo, Qian Wang, Fang Fang, Yixin Zhu

First: 2026-03-17T12:29:09+00:00 · Latest: 2026-03-17T12:29:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Moral reasoning is fundamental to safe Artificial Intelligence (AI), yet ensuring its consistency across modalities becomes critical as AI systems evolve from text-based assistants to embodied agents. Current safety techniques demonstrate success in textual contexts, but concerns remain about generalization to visual inputs. Existing moral evaluation benchmarks rely on textonly formats and lack systematic control over variables that influence moral decision-making. Here we show that visual inputs fundamentally alter moral decision-making in state-of-the-art (SOTA) Vision-Language Models (VLMs), bypassing text-based safety mechanisms. We introduce Moral Dilemma Simulation (MDS), a multimodal benchmark grounded in Moral Foundation Theory (MFT) that enables mechanistic analysis through orthogonal manipulation of visual and contextual variables. The evaluation reveals that the vision modality activates intuition-like pathways that override the more deliberate and safer reasoning patterns observed in text-only contexts. These findings expose critical fragilities where language-tuned safety filters fail to constrain visual processing, demonstrating the urgent need for multimodal safety alignment.

中文标题/摘要

标题：视觉干扰削弱了视觉语言模型的道德推理

道德推理是安全的人工智能（AI）的基础，随着AI系统从基于文本的助手演变为具身代理，确保其在不同模态中的一致性变得至关重要。当前的安全技术在文本环境中显示出成功，但对视觉输入的泛化仍存在担忧。现有的道德评估基准依赖于纯文本格式，缺乏对影响道德决策的变量的系统控制。我们展示了视觉输入从根本上改变了最先进的（SOTA）视觉语言模型（VLMs）中的道德决策，绕过了基于文本的安全机制。我们引入了道德困境模拟（MDS），这是一种基于道德基础理论（MFT）的多模态基准，通过视觉和上下文变量的正交操纵，使其能够进行机制分析。评估揭示了语言调优的安全过滤器在视觉处理方面无法约束的关键脆弱性，表明了迫切需要实现多模态安全对齐。

Summary / 总结

The research aims to address the challenge of ensuring moral reasoning consistency across modalities in AI, particularly in Vision-Language Models (VLMs). The study introduces Moral Dilemma Simulation (MDS), a multimodal benchmark based on Moral Foundation Theory, to evaluate how visual inputs affect moral decision-making. Key findings show that visual inputs bypass text-based safety mechanisms, activating intuition-like pathways that override safer reasoning patterns, highlighting the need for multimodal safety alignment in VLMs.

研究旨在解决在AI中确保道德推理在不同模态之间的一致性问题，特别是在视觉语言模型（VLMs）中的问题。研究引入了基于道德基础理论的道德困境模拟（MDS）基准，以评估视觉输入如何影响道德决策。关键发现表明，视觉输入绕过了基于文本的安全机制，激活了类似于直觉的路径，从而 overriding 更安全的推理模式，这突显了在VLMs中进行多模态安全对齐的迫切需求。

A Unified Benchmark for HOI Evaluation across Vision-Language Models and HOI-Specific Methods

Authors: Qinqian Lei, Bo Wang, Robby T. Tan

Venue: CVPR 2026

First: 2025-08-26T07:30:53+00:00 · Latest: 2026-03-17T12:07:24+00:00

Comments: Accepted by CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

HOI detection has long been dominated by task-specific models, sometimes with early vision-language backbones such as CLIP. With the rise of large generative VLMs, a key question is whether standalone VLMs can perform HOI detection competitively against specialized HOI methods. Existing benchmarks such as HICO-DET require exact label matching under incomplete annotations, so any unmatched prediction is marked wrong. This unfairly penalizes valid outputs, especially from less constrained VLMs, and makes cross-paradigm comparison unreliable. To address this limitation, we introduce CrossHOI-Bench, a multiple-choice HOI benchmark with explicit positives and curated negatives, enabling unified and reliable evaluation of both VLMs and HOI-specific models. We further focus on challenging scenarios, such as multi-person scenes and fine-grained interaction distinctions, which are crucial for revealing real differences between the two paradigms. Experiments show that large VLMs achieve competitive, sometimes superior, zero-shot performance, yet they struggle with multiple concurrent actions and with correctly assigning interactions to the target person. Conversely, HOI-specific methods remain weaker in general HOI reasoning but demonstrate stronger multi-action recognition and more reliable identification of which person performs which action. These findings expose complementary strengths and weaknesses of VLMs and HOI-specific methods, which existing benchmarks fail to reveal due to incorrect penalization.

中文标题/摘要

标题：跨视觉语言模型和特定于HOI方法的统一HOI评估基准

长期以来，HOI检测一直由任务特定模型主导，有时使用早期的视觉语言模型，如CLIP。随着大型生成VLM的兴起，一个关键问题是独立的VLM是否能在HOI检测中与专门的HOI方法竞争。现有的基准，如HICO-DET，要求精确的标签匹配，在不完整注释下，任何未匹配的预测都会被标记为错误。这不公平地惩罚了有效的输出，尤其是来自较少约束的VLM，使得跨范式比较不可靠。为了解决这一局限性，我们引入了CrossHOI-Bench，这是一个具有明确正例和精心挑选负例的多项选择HOI基准，使VLM和HOI特定模型的统一和可靠评估成为可能。我们进一步关注具有挑战性的场景，如多人场景和精细的交互区分，这对于揭示两种范式之间的真正差异至关重要。实验表明，大型VLM在零样本性能上具有竞争力，甚至有时更优，但它们在处理多个并发动作和正确分配交互给目标人物方面存在困难。相反，HOI特定方法在一般HOI推理方面仍然较弱，但在多动作识别和更可靠地识别哪个执行者执行哪个动作方面表现出更强的能力。这些发现揭示了VLM和HOI特定方法的互补优势和劣势，而现有的基准由于错误的惩罚未能揭示这些差异。

Summary / 总结

The research aims to evaluate the performance of vision-language models and HOI-specific methods in HOI detection by introducing CrossHOI-Bench, a multiple-choice benchmark with explicit positives and curated negatives. The study finds that large VLMs achieve competitive zero-shot performance but struggle with multiple concurrent actions and correctly assigning interactions to the target person, while HOI-specific methods excel in multi-action recognition and identifying the performer of each action, revealing complementary strengths and weaknesses of the two approaches.

研究旨在通过引入CrossHOI-Bench，一个包含明确正例和负例的多项选择基准，来评估视觉语言模型和HOI特定方法在HOI检测中的表现。实验表明，大型VLM在性能上具有竞争力，但在处理多个并发动作和交互分配方面存在困难，而HOI特定方法在多动作识别和准确的人-动作识别方面表现出色。这些发现揭示了两种方法的互补优势和劣势，而现有的基准由于严格的标签匹配标准未能捕捉到这些差异。

ANTS: Adaptive Negative Textual Space Shaping for OOD Detection via Test-Time MLLM Understanding and Reasoning

Authors: Wenjie Zhu, Yabin Zhang, Xin Jin, Wenjun Zeng, Lei Zhang

First: 2025-09-04T07:26:20+00:00 · Latest: 2026-03-17T12:05:24+00:00

Comments: Accepted by CVPR2026, Project Page: https://zhuwenjie98.github.io/ANTS-project-page/

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

The introduction of negative labels (NLs) has proven effective in enhancing Out-of-Distribution (OOD) detection. However, existing methods often lack an understanding of OOD images, making it difficult to construct an accurate negative space. Furthermore, the absence of negative labels semantically similar to ID labels constrains their capability in near-OOD detection. To address these issues, we propose shaping an Adaptive Negative Textual Space (ANTS) by leveraging the understanding and reasoning capabilities of multimodal large language models (MLLMs). Specifically, we cache images likely to be OOD samples from the historical test images and prompt the MLLM to describe these images, generating expressive negative sentences that precisely characterize the OOD distribution and enhance far-OOD detection. For the near-OOD setting, where OOD samples resemble the in-distribution (ID) subset, we cache the subset of ID classes that are visually similar to historical test images and then leverage MLLM reasoning to generate visually similar negative labels tailored to this subset, effectively reducing false negatives and improving near-OOD detection. To balance these two types of negative textual spaces, we design an adaptive weighted score that enables the method to handle different OOD task settings (near-OOD and far-OOD), making it highly adaptable in open environments. On the ImageNet benchmark, our ANTS significantly reduces the FPR95 by 3.1\%, establishing a new state-of-the-art. Furthermore, our method is training-free and zero-shot, enabling high scalability. Codes are available at https://github.com/ZhuWenjie98/ANTS.

中文标题/摘要

标题：ANTS：通过测试时MLLM理解和推理构建自适应负文本空间进行OOD检测

引入负标签（NLs）已被证明能有效提升Out-of-Distribution（OOD）检测。然而，现有方法往往缺乏对OOD图像的理解，难以构建准确的负空间。此外，缺乏与ID标签语义相似的负标签限制了其在近OOD检测中的能力。为解决这些问题，我们提出通过利用多模态大型语言模型（MLLM）的理解和推理能力来塑造自适应负文本空间（ANTS）。具体而言，我们从历史测试图像中缓存可能为OOD样本的图像，并提示MLLM描述这些图像，生成能够精确刻画OOD分布的表达性负句子，从而增强远OOD检测。对于近OOD设置，其中OOD样本类似于分布内（ID）子集，我们缓存与历史测试图像视觉相似的ID类子集，并利用MLLM推理生成针对该子集的视觉相似负标签，有效减少假阴性并提高近OOD检测。为了平衡这两种类型的负文本空间，我们设计了一个自适应加权分数，使方法能够处理不同的OOD任务设置（近OOD和远OOD），使其在开放环境中具有高度适应性。在ImageNet基准测试中，我们的ANTS显著降低了FPR95 3.1%，建立了新的最佳水平。此外，我们的方法是无训练和零样本的，具有高可扩展性。代码可在https://github.com/ZhuWenjie98/ANTS获取。

Summary / 总结

This paper introduces ANTS, which addresses the limitations of existing OOD detection methods by leveraging MLLMs to generate adaptive negative textual spaces. ANTS shapes a far-OOD negative space by prompting MLLMs to describe OOD images and a near-OOD negative space by generating visually similar negative labels. The method uses an adaptive weighted score to balance these spaces, improving both far-OOD and near-OOD detection. On ImageNet, ANTS reduces FPR95 by 3.1%, setting a new state-of-the-art, and is training-free and zero-shot, enhancing scalability.

该论文提出了一种名为ANTs的方法，通过利用多模态大语言模型（MLLMs）生成自适应的负文本空间来增强Out-of-Distribution (OOD)检测。ANTs通过使用MLLMs描述OOD图像并生成精确的负句子来解决现有方法的局限性，从而提高远OOD检测效果。对于近OOD检测，ANTs生成视觉上相似的负标签以减少误检。该方法通过自适应加权评分平衡这两种类型的负文本空间，使其在不同OOD设置下具有高度适应性。在ImageNet基准测试中，ANTs显著降低了FPR95达3.1%，并建立了新的最佳水平。

Think3D: Thinking with Space for Spatial Reasoning

Authors: Zaibin Zhang, Yuhan Wu, Lianjie Jia, Yifan Wang, Zhongbo Zhang, Yijiang Li, Binghao Ran, Fuxi Zhang, Zhuohan Sun, Zhenfei Yin, Lijun Wang, Huchuan Lu

First: 2026-01-19T13:13:54+00:00 · Latest: 2026-03-17T11:48:50+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

While contemporary Vision-Language Models (VLMs) excel at 2D visual understanding, they remain constrained by a passive, 2D-centric paradigm that severely limits genuine 3D spatial reasoning. To bridge this gap, we introduce Think3D, a novel framework that equips VLM agents with interactive, 3D chain-of-thought reasoning capabilities. By integrating a suite of 3D manipulation tools, Think3D transforms passive perception into active spatial exploration, closely mirroring human geometric reasoning. We demonstrate that Think3D acts as a highly effective zero-shot plug-in for state-of-the-art closed-source models (e.g., GPT-4.1, Gemini 2.5 Pro), yielding absolute performance gains of +7.8% on BLINK Multi-view and MindCube, and +4.7% on VSI-Bench. Furthermore, to optimize tool-use in smaller open-weight models, we propose Think3D-RL, a reinforcement learning paradigm designed to autonomously learn spatial exploration strategies. When applied to Qwen3-VL-4B, Think3D-RL amplifies the performance gain from a marginal +0.7% to a substantial +10.7%. Notably, this RL formulation induces an exploration policy that qualitatively aligns with the sophisticated behavior of much larger models, entirely circumventing the need for costly operation-trajectory annotations. Ultimately, Think3D establishes tool-augmented active exploration as an effective paradigm for unlocking human-like 3D reasoning in multimodal agents. Code, models, and data are available at https://github.com/zhangzaibin/spagent.

中文标题/摘要

标题：Think3D：以空间思维进行空间推理

尽管当前的视觉-语言模型（VLMs）在二维视觉理解方面表现出色，但它们仍然受限于一种被动的、二维中心的范式，严重限制了真正的三维空间推理能力。为了解决这一问题，我们提出了Think3D，这是一种新型框架，能够为VLM代理提供互动的、三维的链式思考推理能力。通过整合一系列三维操作工具，Think3D将被动感知转变为积极的空间探索，紧密模拟人类几何推理。我们证明，Think3D作为最先进的闭源模型（如GPT-4.1、Gemini 2.5 Pro）的高效零样本插件，能够在BLINK多视图和MindCube上实现绝对性能提升+7.8%，在VSI-Bench上提升+4.7%。此外，为了优化较小的开放权重模型中的工具使用，我们提出了Think3D-RL，这是一种强化学习范式，旨在自主学习空间探索策略。当应用于Qwen3-VL-4B时，Think3D-RL将性能提升从微不足道的+0.7%显著提高到+10.7%。值得注意的是，这种RL形式诱导出的探索策略在定性上与更大模型的复杂行为一致，完全避免了昂贵的操作轨迹注解的需要。最终，Think3D确立了工具增强的主动探索作为一种有效范式，以解锁多模态代理中的类人三维推理。代码、模型和数据可在https://github.com/zhangzaibin/spagent/获取。

Summary / 总结

Think3D is a novel framework that enhances Vision-Language Models (VLMs) with 3D spatial reasoning capabilities, enabling active exploration and manipulation in 3D space. It integrates 3D tools to transform passive perception into active spatial reasoning, improving performance on benchmarks like BLINK Multi-view and MindCube by +7.8% and VSI-Bench by +4.7%. Think3D-RL, a reinforcement learning approach, further enhances smaller models, achieving a significant +10.7% improvement in Qwen3-VL-4B. This work demonstrates the effectiveness of tool-augmented active exploration for 3D reasoning in multimodal agents.

Think3D 是一种新型框架，通过集成 3D 操作工具增强视觉-语言模型（VLMs）的 3D 空间推理能力，使其能够进行主动探索和操作。它将被动感知转化为主动的空间探索，提高了 BLINK 多视图和 MindCube 等基准测试的性能，分别提升了 7.8% 和 4.7%。Think3D-RL 是一种强化学习方法，进一步增强了小型模型的性能，在 Qwen3-VL-4B 中实现了 10.7% 的提升。这项工作展示了工具增强的主动探索在多模态代理中实现类人 3D 推理的潜力。

VALD: Multi-Stage Vision Attack Detection for Efficient LVLM Defense

Authors: Nadav Kadvil, Malak Fares, Ayellet Tal

First: 2026-02-23T07:39:43+00:00 · Latest: 2026-03-17T11:20:48+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Vision-Language Models (LVLMs) can be vulnerable to adversarial images that subtly bias their outputs toward plausible yet incorrect responses. We introduce a general, efficient, and training-free defense that combines image transformations with agentic data consolidation to recover correct model behavior. A key component of our approach is a two-stage detection mechanism that quickly filters out the majority of clean inputs. We first assess image consistency under content-preserving transformations at negligible computational cost. For more challenging cases, we examine discrepancies in a text-embedding space. Only when necessary do we invoke a powerful LLM to resolve attack-induced divergences. A key idea is to consolidate multiple responses, leveraging both their similarities and their differences. We show that our method achieves state-of-the-art accuracy while maintaining notable efficiency: most clean images skip costly processing, and even in the presence of numerous adversarial examples, the overhead remains minimal.

中文标题/摘要

标题：VALD：多阶段视觉攻击检测以实现高效的LVLM防御

大型视觉-语言模型（LVLMs）可能会受到微妙偏倚其输出的对抗图像的攻击，使其倾向于合理但错误的响应。我们提出了一种通用、高效且无需训练的防御方法，该方法结合了图像变换与代理数据整合，以恢复正确的模型行为。我们方法的关键组成部分是一种两阶段检测机制，可以快速过滤掉大部分干净输入。我们首先在几乎不增加计算成本的情况下，评估内容保持变换下的图像一致性。对于更具挑战性的案例，我们检查文本嵌入空间中的差异。只有在必要时，我们才调用强大的LLM来解决攻击引起的偏差。一个关键思想是整合多个响应，利用它们的相似性和差异性。我们展示了我们的方法在保持显著效率的同时达到了最先进的准确性：大多数干净图像可以跳过昂贵的处理，即使存在大量对抗样本，开销也保持在最小。

Summary / 总结

The research aims to develop an efficient defense mechanism against adversarial images that can bias the outputs of large vision-language models. The method involves a two-stage detection process that uses content-preserving image transformations and text-embedding space analysis to filter out clean inputs quickly. Only when necessary, a powerful LLM is used to resolve any attack-induced divergences. The key finding is that the proposed approach achieves state-of-the-art accuracy with minimal overhead, as most clean images avoid costly processing and the system remains efficient even with many adversarial examples.

研究旨在开发一种高效的防御机制，以应对可能使大型视觉语言模型输出偏移的对抗性图像。方法包括两阶段检测过程，使用内容保持的图像变换和文本嵌入空间分析来快速过滤干净输入。只有在必要时，才会使用强大的LLM来解决由攻击引起的偏差。关键发现是，所提出的方法在保持高效的同时达到了最先进的准确性，大多数干净的图像避免了昂贵的处理，即使存在大量对抗性示例，系统也保持高效。

AgriPath: A Systematic Exploration of Architectural Trade-offs for Crop Disease Classification

Authors: Hamza Mooraj, George Pantazopoulos, Alessandro Suglia

First: 2026-03-08T17:28:01+00:00 · Latest: 2026-03-17T10:54:10+00:00

Comments: 11 pages main text, 22 pages total including references and appendix. 6 figures, 10 tables. Code and dataset will be released upon publication

Abs · PDF · Code1 · Code2

Abstract

Reliable crop disease detection requires models that perform consistently across diverse acquisition conditions, yet existing evaluations often focus on single architectural families or lab-generated datasets. This work presents a systematic empirical comparison of three model paradigms for fine-grained crop disease classification: Convolutional Neural Networks (CNNs), contrastive Vision-Language Models (VLMs), and generative VLMs. To enable controlled analysis of domain effects, we introduce AgriPath-LF16, a benchmark containing 111k images spanning 16 crops and 41 diseases with explicit separation between laboratory and field imagery, alongside a balanced 30k subset for standardized training and evaluation. All models are trained and evaluated under unified protocols across full, lab-only, and field-only training regimes using macro-F1 and Parse Success Rate (PSR) to account for generative reliability. The results reveal distinct performance profiles. CNNs achieve the highest accuracy on lab imagery but degrade under domain shift. Contrastive VLMs provide a robust and parameter-efficient alternative with competitive cross-domain performance. Generative VLMs demonstrate the strongest resilience to distributional variation, albeit with additional failure modes stemming from free-text generation. These findings highlight that architectural choice should be guided by deployment context rather than aggregate accuracy alone.

中文标题/摘要

标题：AgriPath：作物病害分类架构权衡的系统性探索

可靠的作物病害检测需要在多种获取条件下表现一致的模型，但现有评估往往集中在单一架构家族或实验室生成的数据集上。本研究系统性地比较了三种模型范式在细粒度作物病害分类中的表现：卷积神经网络（CNNs）、对比视觉语言模型（VLMs）和生成性VLMs。为了进行可控的领域效应分析，我们引入了AgriPath-LF16基准，包含111k张图像，覆盖16种作物和41种病害，其中实验室和田间图像明确分离，同时提供一个平衡的30k子集用于标准化训练和评估。所有模型均在统一协议下进行训练和评估，涵盖全领域、仅实验室和仅田间训练制度，使用宏F1和解析成功率（PSR）来考虑生成可靠性。结果表明，不同模型具有不同的性能特征。CNNs在实验室图像上表现最佳，但在领域转移时表现下降。对比VLMs提供了一种稳健且参数高效的替代方案，具有竞争力的跨领域性能。生成性VLMs在分布变化中表现出最强的鲁棒性，但存在额外的失败模式，源于自由文本生成。这些发现表明，架构选择应根据部署环境而非单一准确率来指导。

Summary / 总结

This work aims to evaluate the performance of different model paradigms for crop disease classification under varying acquisition conditions. It introduces AgriPath-LF16, a benchmark dataset with 111k images from 16 crops and 41 diseases, separated into laboratory and field imagery. The study compares Convolutional Neural Networks (CNNs), contrastive Vision-Language Models (VLMs), and generative VLMs. Results show that CNNs excel in lab conditions but struggle with domain shift, contrastive VLMs offer robust performance across domains with fewer parameters, and generative VLMs are highly resilient to distributional shifts but have additional failure modes. The findings suggest that architectural choice should be tailored to the specific deployment context.

该研究旨在评估不同模型架构在多样采集条件下对作物病害分类的表现。它比较了卷积神经网络（CNNs）、对比视觉语言模型（VLMs）和生成型VLMs在包含111k张来自16种作物和41种病害的图像的新基准AgriPath-LF16上的性能，这些图像被分为实验室和田间图像。结果显示，CNNs在实验室条件下表现出色，但在领域转移时表现不佳，而对比VLMs则提供了鲁棒且参数效率高的跨域性能。生成型VLMs对分布变化具有最强的抵抗力，但因文本生成而存在额外的失败模式。这些发现表明，架构选择应考虑部署环境，而不仅仅是整体准确性。

Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR

Authors: Yulong Zhang, Tianyi Liang, Xinyue Huang, Erfei Cui, Guoqing Wang, Xu Guo, Chenhui Li, Gongshen Liu

First: 2025-04-15T11:51:18+00:00 · Latest: 2026-03-17T10:40:23+00:00

Abs · PDF · Code1 · Code2

Abstract

Optical Character Recognition (OCR) is fundamental to Vision-Language Models (VLMs) and high-quality data generation for LLM training. Yet, despite progress in average OCR accuracy, state-of-the-art VLMs still struggle with detecting sample-level errors and lack effective unsupervised quality control. We introduce Consensus Entropy (CE), a training-free, model-agnostic metric that estimates output reliability by measuring inter-model agreement entropy. The core insight is that correct predictions converge in output space, while errors diverge. Based on CE, we develop CE-OCR, a lightweight multi-model framework that verifies outputs by ensemble agreement, selects the best outputs, and further improves efficiency through adaptive routing. Experiments demonstrate that CE is robust for quality verification, improving F1 scores by 42.1\% over VLM-as-Judge. CE-OCR achieves consistent OCR gains, outperforming self-consistency and single-model baselines at the same cost. Notably, CE requires no training or supervision, enabling plug-and-play integration.

中文标题/摘要

标题：共识熵：利用多VLM一致性的自我验证与自我改进OCR

光学字符识别（OCR）是视觉语言模型（VLMs）和大规模语言模型（LLMs）训练高质量数据的基础。尽管平均OCR准确率有所提高，但最先进的VLMs仍然难以检测样本级别的错误，缺乏有效的无监督质量控制。我们引入了共识熵（CE），这是一种无需训练、模型无关的度量方法，通过测量模型间一致性的熵来估计输出的可靠性。核心洞察是正确预测在输出空间中收敛，而错误则发散。基于CE，我们开发了CE-OCR，这是一种轻量级的多模型框架，通过集成一致性的验证输出，选择最佳输出，并通过自适应路由进一步提高效率。实验表明，CE在质量验证方面表现出色，相比VLM-as-Judge提高了42.1%的F1分数。CE-OCR在相同成本下实现了稳定的OCR改进，优于自我一致性与单模型基线。值得注意的是，CE无需训练或监督，可实现即插即用集成。

Summary / 总结

The paper introduces Consensus Entropy (CE), a model-agnostic metric for estimating OCR output reliability by measuring inter-model agreement entropy. CE-OCR, a lightweight multi-model framework, uses CE to verify outputs, select the best outputs, and improve efficiency through adaptive routing. Experiments show that CE enhances F1 scores by 42.1% over VLM-as-Judge and outperforms self-consistency and single-model baselines without requiring training or supervision.

研究旨在通过利用多VLM的一致性来提高OCR的准确性和质量控制。方法引入了共识熵（CE），这是一种模型无关的度量，通过测量模型间的一致性熵来估计输出的可靠性。实验表明，CE相比VLM-as-Judge提高了42.1%的F1分数，并且在无需训练或监督的情况下优于自我一致性及单模型基线。CE-OCR是一个轻量级的多模型框架，用于验证输出、选择最佳输出并通过对路由的自适应优化来提高效率，实现了稳定的OCR改进。

Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment

Authors: Youjia Zhang, Youngeun Kim, Young-Geun Choi, Hongyeob Kim, Huiling Liu, Sungeun Hong

First: 2025-08-21T13:42:49+00:00 · Latest: 2026-03-17T10:32:25+00:00

Abs · PDF · Code1 · Code2

Abstract

Test-time adaptation (TTA) enhances the zero-shot robustness under distribution shifts by leveraging unlabeled test data during inference. Despite notable advances, several challenges still limit its broader applicability. First, most methods rely on backpropagation or iterative optimization, which limits scalability and hinders real-time deployment. Second, they lack explicit modeling of class-conditional feature distributions. This modeling is crucial for producing reliable decision boundaries and calibrated predictions, but it remains underexplored due to the lack of both source data and supervision at test time. In this paper, we propose ADAPT, an Advanced Distribution-Aware and backPropagation-free Test-time adaptation method. We reframe TTA as a Gaussian probabilistic inference task by modeling class-conditional likelihoods using gradually updated class means and a shared covariance matrix. This enables closed-form, training-free inference. To correct potential likelihood bias, we introduce lightweight regularization guided by CLIP priors and a historical knowledge bank. ADAPT requires no source data, no gradient updates, and no full access to target data, supporting both online and transductive settings. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts with superior scalability and robustness.

中文标题/摘要

标题：无需反向传播的测试时自适应通过概率高斯对齐

测试时自适应（TTA）通过在推理过程中利用未标记的测试数据来增强零样本鲁棒性，从而在分布偏移下提高鲁棒性。尽管取得了显著进展，但仍存在几个挑战限制了其更广泛的适用性。首先，大多数方法依赖于反向传播或迭代优化，这限制了其可扩展性并阻碍了实时部署。其次，它们缺乏对类条件特征分布的显式建模。这种建模对于生成可靠的决策边界和校准预测至关重要，但由于缺乏测试时的源数据和监督，这一领域仍处于探索阶段。在本文中，我们提出了一种名为ADAPT的高级分布感知且无需反向传播的测试时自适应方法。我们将TTA重新定义为一个高斯概率推理任务，通过使用逐渐更新的类均值和共享协方差矩阵来建模类条件似然性。这使得可以进行无需训练的闭式形式推理。为了纠正潜在的似然偏差，我们引入了由CLIP先验和历史知识库引导的轻量级正则化。ADAPT不需要源数据、不需要梯度更新，并且不需要完全访问目标数据，支持在线和归纳设置。在多种基准上的广泛实验表明，我们的方法在各种分布偏移下实现了最先进的性能，具有更好的可扩展性和鲁棒性。

Summary / 总结

The paper addresses the challenges of test-time adaptation (TTA) by proposing ADAPT, which reframes TTA as a Gaussian probabilistic inference task. ADAPT avoids backpropagation and iterative optimization, enabling scalable and real-time deployment. It models class-conditional likelihoods using updated class means and a shared covariance matrix, allowing for closed-form inference. ADAPT also introduces regularization to correct likelihood bias, using CLIP priors and a historical knowledge bank. Experiments show that ADAPT outperforms existing methods across various benchmarks with better scalability and robustness under distribution shifts.

论文提出ADAPT方法，将TTA重新定义为高斯概率推断任务，通过逐步更新类均值和共享协方差矩阵进行类条件似然建模，实现无需反向传播的闭式推断。引入轻量级正则化来纠正似然偏差，并支持在线和归纳设置。实验表明，ADAPT在各种分布偏移下实现了最先进的性能，具有更好的可扩展性和鲁棒性。

History

20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553