arXiv 论文速递

CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images

Authors: Chengqi Duan, Kaiyue Sun, Rongyao Fang, Manyuan Zhang, Yan Feng, Ying Luo, Yufang Liu, Ke Wang, Peng Pei, Xunliang Cai, Hongsheng Li, Yi Ma, Xihui Liu

First: 2025-10-13T17:59:55+00:00 · Latest: 2025-10-13T17:59:55+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances in Large Language Models (LLMs) and Vision Language Models (VLMs) have shown significant progress in mathematical reasoning, yet they still face a critical bottleneck with problems requiring visual assistance, such as drawing auxiliary lines or plotting functions to solve the problems. Most LLMs and VLMs are constrained to text-only reasoning chains, while multimodal unified models that can generate interleaved text and images lack the necessary precision and controllability for such tasks. To address this, we propose CodePlot-CoT, a code-driven Chain-of-Thought paradigm for "thinking with images" in mathematics. Our approach leverages the VLM to generate text reasoning as well as executable plotting code, which is then rendered into images as "visual thought", to solve mathematical problems. To achieve this, we first construct Math-VR, the first large-scale, bilingual dataset and benchmark for Mathematics problems with Visual Reasoning, comprising 178K samples. Second, to create high-quality training data, we develop a state-of-the-art image-to-code converter specialized for parsing complex mathematical figures into codes. Finally, using these training data, we train the CodePlot-CoT model for solving mathematical problems. Experimental results show that our model achieves up to 21% increase over base model on our new benchmark, fully validating the efficacy of our proposed code-driven reasoning paradigm. Our work opens a new direction for multimodal mathematical reasoning and provides the community with the first large-scale dataset, comprehensive benchmark, and strong approach for such problems. To facilitate future research, we make our datasets, code, and pretrained models publicly available at https://github.com/HKU-MMLab/Math-VR-CodePlot-CoT.

中文标题/摘要

标题：CodePlot-CoT：通过代码驱动图像进行数学视觉推理

大型语言模型（LLMs）和视觉语言模型（VLMs）的最新进展在数学推理方面取得了显著进展，但它们仍然面临一个关键瓶颈，即需要视觉辅助的问题，例如绘制辅助线或绘制函数以解决问题。大多数LLMs和VLMs仅限于文本推理链，而能够生成交错文本和图像的多模态统一模型缺乏此类任务所需的精确性和可控性。为了解决这个问题，我们提出了CodePlot-CoT，这是一种代码驱动的“思考图像”链式推理范式。我们的方法利用VLM生成文本推理以及可执行的绘图代码，然后将其渲染为“视觉思考”，以解决数学问题。为此，我们首先构建了Math-VR，这是第一个大规模、双语的数学问题视觉推理数据集和基准，包含178,000个样本。其次，为了创建高质量的训练数据，我们开发了一种专门用于解析复杂数学图形的最先进的图像到代码转换器。最后，使用这些训练数据，我们训练了CodePlot-CoT模型以解决数学问题。实验结果表明，我们的模型在我们的新基准上比基模型提高了21%的性能，完全验证了我们提出的代码驱动推理范式的有效性。我们的工作为多模态数学推理开辟了一个新方向，并为社区提供了第一个大规模数据集、全面基准和此类问题的强方法。为了促进未来的研究，我们在https://github.com/HKU-MMLab/Math-VR-CodePlot-CoT/上公开了我们的数据集、代码和预训练模型。

Phys2Real: Fusing VLM Priors with Interactive Online Adaptation for Uncertainty-Aware Sim-to-Real Manipulation

Authors: Maggie Wang, Stephen Tian, Aiden Swann, Ola Shorinwa, Jiajun Wu, Mac Schwager

First: 2025-10-13T17:51:23+00:00 · Latest: 2025-10-13T17:51:23+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Learning robotic manipulation policies directly in the real world can be expensive and time-consuming. While reinforcement learning (RL) policies trained in simulation present a scalable alternative, effective sim-to-real transfer remains challenging, particularly for tasks that require precise dynamics. To address this, we propose Phys2Real, a real-to-sim-to-real RL pipeline that combines vision-language model (VLM)-inferred physical parameter estimates with interactive adaptation through uncertainty-aware fusion. Our approach consists of three core components: (1) high-fidelity geometric reconstruction with 3D Gaussian splatting, (2) VLM-inferred prior distributions over physical parameters, and (3) online physical parameter estimation from interaction data. Phys2Real conditions policies on interpretable physical parameters, refining VLM predictions with online estimates via ensemble-based uncertainty quantification. On planar pushing tasks of a T-block with varying center of mass (CoM) and a hammer with an off-center mass distribution, Phys2Real achieves substantial improvements over a domain randomization baseline: 100% vs 79% success rate for the bottom-weighted T-block, 57% vs 23% in the challenging top-weighted T-block, and 15% faster average task completion for hammer pushing. Ablation studies indicate that the combination of VLM and interaction information is essential for success. Project website: https://phys2real.github.io/ .

中文标题/摘要

标题：Phys2Real：将VLM先验与交互式在线适应融合以实现不确定性感知的仿真实际操作

直接在现实世界中学习机器人操作策略可能既昂贵又耗时。虽然在仿真中训练的强化学习（RL）策略提供了一种可扩展的替代方案，但有效的仿真实际操作转移仍然具有挑战性，尤其是在需要精确动力学的任务中。为了解决这一问题，我们提出了Phys2Real，这是一种从现实到仿真再到现实的RL管道，结合了基于视觉语言模型（VLM）推断的物理参数估计与基于不确定性的交互式适应。我们的方法包括三个核心组件：（1）高保真几何重建，使用3D高斯散点图，（2）基于VLM推断的物理参数先验分布，以及（3）从交互数据中在线估计物理参数。Phys2Real根据可解释的物理参数调整策略，通过基于集合的不确定性量化，用在线估计值改进VLM预测。在具有不同质心（CoM）的平面推T块任务和具有偏心质量分布的锤子推任务中，Phys2Real在底重T块上的成功率达到了100%，而基线为79%；在具有挑战性的顶重T块上，成功率达到了57%，而基线为23%；锤子推任务的平均任务完成时间快了15%。消融研究表明，VLM和交互信息的结合对于成功至关重要。项目网站：https://phys2real.github.io/。

Summary / 总结

The research aims to improve sim-to-real transfer in robotic manipulation by addressing the challenges of precise dynamics. Phys2Real uses a pipeline that combines VLM-inferred physical parameters with interactive online adaptation. It achieves higher success rates and faster task completion on planar pushing tasks compared to a domain randomization baseline. Ablation studies show the importance of combining VLM and interaction information for success.

研究旨在通过融合视觉语言模型（VLM）先验与交互式在线适应来提高机器人操作任务的仿真实现。方法包括高保真几何重建、VLM推断的先验分布和在线物理参数估计。关键发现表明，Phys2Real 显著优于域随机化，对于底部重 T 块的成功率为 100%，对于顶部重 T 块为 57%，锤子推举任务的平均完成时间快 15%。消融研究证实了结合 VLM 和交互信息的重要性。

EvoCAD: Evolutionary CAD Code Generation with Vision Language Models

Authors: Tobias Preintner, Weixuan Yuan, Adrian König, Thomas Bäck, Elena Raponi, Niki van Stein

First: 2025-10-13T17:12:02+00:00 · Latest: 2025-10-13T17:12:02+00:00

Comments: Accepted to IEEE ICTAI 2025

Abs · PDF · Code1 · Code2

Abstract

Combining large language models with evolutionary computation algorithms represents a promising research direction leveraging the remarkable generative and in-context learning capabilities of LLMs with the strengths of evolutionary algorithms. In this work, we present EvoCAD, a method for generating computer-aided design (CAD) objects through their symbolic representations using vision language models and evolutionary optimization. Our method samples multiple CAD objects, which are then optimized using an evolutionary approach with vision language and reasoning language models. We assess our method using GPT-4V and GPT-4o, evaluating it on the CADPrompt benchmark dataset and comparing it to prior methods. Additionally, we introduce two new metrics based on topological properties defined by the Euler characteristic, which capture a form of semantic similarity between 3D objects. Our results demonstrate that EvoCAD outperforms previous approaches on multiple metrics, particularly in generating topologically correct objects, which can be efficiently evaluated using our two novel metrics that complement existing spatial metrics.

中文标题/摘要

标题：EvoCAD：基于视觉语言模型的进化CAD代码生成

将大型语言模型与进化计算算法相结合，利用LLM的生成能力和上下文学习能力以及进化算法的优势，代表了一个有前景的研究方向。在本文中，我们提出了EvoCAD方法，通过视觉语言模型和进化优化生成计算机辅助设计（CAD）对象的符号表示。我们的方法采样多个CAD对象，然后使用视觉语言和推理语言模型通过进化方法进行优化。我们使用GPT-4V和GPT-4o评估了该方法，并在CADPrompt基准数据集上与先前的方法进行了比较。此外，我们引入了基于欧拉特征定义的拓扑性质的两个新度量，这些度量捕捉了3D对象之间的某种语义相似性。我们的结果表明，EvoCAD在多个度量标准上优于先前的方法，特别是在生成拓扑正确对象方面，可以使用我们提出的两个新颖度量有效地进行评估，这些度量补充了现有的空间度量。

Summary / 总结

EvoCAD uses vision language models and evolutionary optimization to generate CAD objects from their symbolic representations. It samples multiple CAD objects and optimizes them using evolutionary algorithms with the help of vision and reasoning language models. The method is evaluated on the CADPrompt benchmark dataset and outperforms previous approaches, especially in generating topologically correct objects using two new metrics based on Euler characteristic.

EvoCAD 通过视觉语言模型和进化优化生成CAD对象的符号表示。它从多个CAD对象中采样并使用进化算法进行优化，借助视觉和推理语言模型。该方法在CADPrompt基准数据集上进行评估，并在生成基于欧拉特征定义的拓扑正确对象方面优于先前的方法，使用了两个新的度量标准。

InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition

Authors: Yijie Zheng, Weijie Wu, Qingyun Li, Xuehui Wang, Xu Zhou, Aiai Ren, Jun Shen, Long Zhao, Guoqing Li, Xue Yang

Venue: NeurIPS 2025

First: 2025-05-21T17:59:56+00:00 · Latest: 2025-10-13T16:36:15+00:00

Comments: Accepted to NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Language-Guided object recognition in remote sensing imagery is crucial for large-scale mapping and automated data annotation. However, existing open-vocabulary and visual grounding methods rely on explicit category cues, limiting their ability to handle complex or implicit queries that require advanced reasoning. To address this issue, we introduce a new suite of tasks, including Instruction-Oriented Object Counting, Detection, and Segmentation (InstructCDS), covering open-vocabulary, open-ended, and open-subclass scenarios. We further present EarthInstruct, the first InstructCDS benchmark for earth observation. It is constructed from two diverse remote sensing datasets with varying spatial resolutions and annotation rules across 20 categories, necessitating models to interpret dataset-specific instructions. Given the scarcity of semantically rich labeled data in remote sensing, we propose InstructSAM, a training-free framework for instruction-driven object recognition. InstructSAM leverages large vision-language models to interpret user instructions and estimate object counts, employs SAM2 for mask proposal, and formulates mask-label assignment as a binary integer programming problem. By integrating semantic similarity with counting constraints, InstructSAM efficiently assigns categories to predicted masks without relying on confidence thresholds. Experiments demonstrate that InstructSAM matches or surpasses specialized baselines across multiple tasks while maintaining near-constant inference time regardless of object count, reducing output tokens by 89% and overall runtime by over 32% compared to direct generation approaches. We believe the contributions of the proposed tasks, benchmark, and effective approach will advance future research in developing versatile object recognition systems.

中文标题/摘要

标题：InstructSAM：一种无需训练的面向指令的遥感物体识别框架

语言引导的遥感图像物体识别对于大规模制图和自动化数据标注至关重要。然而，现有的开放词汇和视觉定位方法依赖于明确的类别提示，限制了它们处理需要高级推理的复杂或隐含查询的能力。为了解决这一问题，我们引入了一套新的任务，包括面向指令的物体计数、检测和分割（InstructCDS），涵盖了开放词汇、开放问题和开放子类场景。我们进一步介绍了地球指令（EarthInstruct），这是首个面向地球观测的InstructCDS基准。它基于两个具有不同空间分辨率和注释规则的遥感数据集构建，涉及20个类别，要求模型能够解释数据集特定的指令。鉴于遥感中语义丰富的标注数据稀缺，我们提出了InstructSAM，一种无需训练的面向指令的物体识别框架。InstructSAM 利用大型视觉-语言模型解释用户指令并估计物体计数，采用SAM2进行掩码提案，并将掩码标签分配形式化为二元整数规划问题。通过结合语义相似性和计数约束，InstructSAM 有效地为预测的掩码分配类别，而不依赖于置信度阈值。实验表明，InstructSAM 在多个任务中与专门的基线相当或超越，同时保持近似恒定的推理时间，无论物体数量如何，与直接生成方法相比，输出令牌减少89%，整体运行时间减少超过32%。我们认为，提出的任务、基准和有效方法的贡献将推动未来研究，开发多功能物体识别系统。

Summary / 总结

The research aims to improve language-guided object recognition in remote sensing imagery, addressing the limitations of existing methods in handling complex or implicit queries. The study introduces InstructSAM, a training-free framework that uses large vision-language models to interpret user instructions and assign categories to predicted masks. Experiments show that InstructSAM outperforms specialized baselines and reduces inference time and output tokens significantly compared to direct generation approaches.

该论文提出了InstructSAM，一种无需训练的面向指令的遥感物体识别框架。通过引入新的任务集EarthInstruct和基准，它解决了现有方法在处理复杂或隐含查询时的局限性。InstructSAM 使用大型视觉-语言模型来解释指令并为预测的掩码分配类别，实现了在多个任务上的竞争力表现，并且相比直接生成方法具有更短的推理时间和更少的输出令牌。

Q-Router: Agentic Video Quality Assessment with Expert Model Routing and Artifact Localization

Authors: Shuo Xing, Soumik Dey, Mingyang Wu, Ashirbad Mishra, Naveen Ravipati, Binbin Li, Hansi Wu, Zhengzhong Tu

First: 2025-10-09T20:11:38+00:00 · Latest: 2025-10-13T16:16:11+00:00

Abs · PDF · Code1 · Code2

Abstract

Video quality assessment (VQA) is a fundamental computer vision task that aims to predict the perceptual quality of a given video in alignment with human judgments. Existing performant VQA models trained with direct score supervision suffer from (1) poor generalization across diverse content and tasks, ranging from user-generated content (UGC), short-form videos, to AI-generated content (AIGC), (2) limited interpretability, and (3) lack of extensibility to novel use cases or content types. We propose Q-Router, an agentic framework for universal VQA with a multi-tier model routing system. Q-Router integrates a diverse set of expert models and employs vision--language models (VLMs) as real-time routers that dynamically reason and then ensemble the most appropriate experts conditioned on the input video semantics. We build a multi-tiered routing system based on the computing budget, with the heaviest tier involving a specific spatiotemporal artifacts localization for interpretability. This agentic design enables Q-Router to combine the complementary strengths of specialized experts, achieving both flexibility and robustness in delivering consistent performance across heterogeneous video sources and tasks. Extensive experiments demonstrate that Q-Router matches or surpasses state-of-the-art VQA models on a variety of benchmarks, while substantially improving generalization and interpretability. Moreover, Q-Router excels on the quality-based question answering benchmark, Q-Bench-Video, highlighting its promise as a foundation for next-generation VQA systems. Finally, we show that Q-Router capably localizes spatiotemporal artifacts, showing potential as a reward function for post-training video generation models.

中文标题/摘要

标题：Q-Router：具有专家模型路由和瑕疵定位的自主视频质量评估

视频质量评估（VQA）是计算机视觉中的一个基本任务，旨在预测给定视频的感知质量，与人类判断一致。现有的高性能VQA模型在直接评分监督下训练，存在（1）在多样内容和任务上的泛化能力差，从用户生成内容（UGC）、短格式视频到人工智能生成内容（AIGC），（2）解释性有限，以及（3）对新型应用场景或内容类型的扩展性不足。我们提出Q-Router，一种具有多级模型路由系统的自主通用VQA框架。Q-Router整合了一组多样化的专家模型，并使用视觉-语言模型（VLMs）作为实时路由器，动态推理并根据输入视频语义组合最合适的专家。我们基于计算预算构建了多级路由系统，最重的一级涉及特定的空间-时间瑕疵定位以提高解释性。这种自主设计使Q-Router能够结合专门专家的互补优势，在异构视频源和任务中实现灵活性和鲁棒性，保持一致性能。大量实验表明，Q-Router在多种基准上与最先进的VQA模型相当或超越，同时显著提高泛化能力和解释性。此外，Q-Router在基于质量的问题回答基准Q-Bench-Video上表现出色，突显了其作为下一代VQA系统基础的潜力。最后，我们展示了Q-Router能够定位空间-时间瑕疵，显示出作为后训练视频生成模型奖励函数的潜力。

SNAP: Towards Segmenting Anything in Any Point Cloud

Authors: Aniket Gupta, Hanhui Wang, Charles Saunders, Aruni RoyChowdhury, Hanumant Singh, Huaizu Jiang

First: 2025-10-13T16:07:00+00:00 · Latest: 2025-10-13T16:07:00+00:00

Comments: Project Page, https://neu-vi.github.io/SNAP/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Interactive 3D point cloud segmentation enables efficient annotation of complex 3D scenes through user-guided prompts. However, current approaches are typically restricted in scope to a single domain (indoor or outdoor), and to a single form of user interaction (either spatial clicks or textual prompts). Moreover, training on multiple datasets often leads to negative transfer, resulting in domain-specific tools that lack generalizability. To address these limitations, we present \textbf{SNAP} (\textbf{S}egment a\textbf{N}ything in \textbf{A}ny \textbf{P}oint cloud), a unified model for interactive 3D segmentation that supports both point-based and text-based prompts across diverse domains. Our approach achieves cross-domain generalizability by training on 7 datasets spanning indoor, outdoor, and aerial environments, while employing domain-adaptive normalization to prevent negative transfer. For text-prompted segmentation, we automatically generate mask proposals without human intervention and match them against CLIP embeddings of textual queries, enabling both panoptic and open-vocabulary segmentation. Extensive experiments demonstrate that SNAP consistently delivers high-quality segmentation results. We achieve state-of-the-art performance on 8 out of 9 zero-shot benchmarks for spatial-prompted segmentation and demonstrate competitive results on all 5 text-prompted benchmarks. These results show that a unified model can match or exceed specialized domain-specific approaches, providing a practical tool for scalable 3D annotation. Project page is at, https://neu-vi.github.io/SNAP/

中文标题/摘要

标题：SNAP：在任意点云中分割万物

交互式3D点云分割可以通过用户引导的提示高效标注复杂的3D场景。然而，当前的方法通常局限于单一领域（室内或室外），并且仅支持单一形式的用户交互（空间点击或文本提示）。此外，多数据集训练往往导致负迁移，产生缺乏普适性的领域特定工具。为解决这些限制，我们提出了**SNAP**（**S**egment **A**nything in **A**ny **P**oint cloud），一种统一的交互式3D分割模型，支持跨领域点基和文本基提示。通过在涵盖室内、室外和航空环境的7个数据集上进行训练，并采用领域自适应归一化防止负迁移，我们的方法实现了跨域泛化。对于文本提示分割，我们自动生成掩码建议，无需人工干预，并将其与CLIP文本查询的嵌入匹配，实现全景和开放词汇分割。大量实验表明，SNAP始终提供高质量的分割结果。我们在8个零样本基准中的9个基准中达到最先进的性能，并在所有5个文本提示基准中展示了竞争力的结果。这些结果表明，统一模型可以匹配或超越专门领域的特定方法，提供一种实用的工具用于大规模3D标注。项目页面为：https://neu-vi.github.io/SNAP/

Summary / 总结

SNAP is a unified model for interactive 3D point cloud segmentation that supports both point-based and text-based prompts across various domains. By training on seven diverse datasets and using domain-adaptive normalization, SNAP achieves cross-domain generalizability. Experimental results show that SNAP outperforms or matches specialized domain-specific approaches, delivering high-quality segmentation results on multiple benchmarks and demonstrating its practical utility for scalable 3D annotation.

SNAP 是一个统一的模型，支持在多种领域中使用点基和文本基提示进行交互式 3D 点云分割。通过在七个不同的数据集上进行训练并使用领域自适应归一化，SNAP 实现了跨领域的泛化能力并避免了负迁移。SNAP 在八个零样本基准中的九个零样本基准中的八个和所有五个文本提示基准中都表现出最先进的性能，展示了其作为可扩展 3D 注释工具的潜力。

ODI-Bench: Can MLLMs Understand Immersive Omnidirectional Environments?

Authors: Liu Yang, Huiyu Duan, Ran Tao, Juntao Cheng, Sijing Wu, Yunhao Li, Jing Liu, Xiongkuo Min, Guangtao Zhai

First: 2025-10-13T15:51:47+00:00 · Latest: 2025-10-13T15:51:47+00:00

Abs · PDF · Code1 · Code2

Abstract

Omnidirectional images (ODIs) provide full 360x180 view which are widely adopted in VR, AR and embodied intelligence applications. While multi-modal large language models (MLLMs) have demonstrated remarkable performance on conventional 2D image and video understanding benchmarks, their ability to comprehend the immersive environments captured by ODIs remains largely unexplored. To address this gap, we first present ODI-Bench, a novel comprehensive benchmark specifically designed for omnidirectional image understanding. ODI-Bench contains 2,000 high-quality omnidirectional images and over 4,000 manually annotated question-answering (QA) pairs across 10 fine-grained tasks, covering both general-level and spatial-level ODI understanding. Extensive experiments are conducted to benchmark 20 representative MLLMs, including proprietary and open-source models, under both close-ended and open-ended settings. Experimental results reveal that current MLLMs still struggle to capture the immersive context provided by ODIs. To this end, we further introduce Omni-CoT, a training-free method which significantly enhances MLLMs' comprehension ability in the omnidirectional environment through chain-of-thought reasoning across both textual information and visual cues. Both the benchmark and the code will be released upon the publication.

中文标题/摘要

标题：ODI-Bench：MLLMs能否理解沉浸式全景环境？

全景图像（ODIs）提供了360x180度的全方位视角，广泛应用于VR、AR和具身智能应用中。虽然多模态大型语言模型（MLLMs）在传统的2D图像和视频理解基准测试中表现出色，但它们理解由ODIs捕捉的沉浸式环境的能力仍鲜有探索。为填补这一空白，我们首先提出了ODI-Bench，这是一个专门设计用于全景图像理解的新颖综合基准。ODI-Bench 包含2000张高质量的全景图像和超过4000个手动标注的问题-答案（QA）对，涵盖了10个细粒度任务，包括一般层面和空间层面的ODI理解。进行了广泛的实验，以在封闭式和开放式设置下对20个代表性MLLMs进行基准测试，包括专有和开源模型。实验结果表明，当前的MLLMs仍然难以捕捉ODIs提供的沉浸式环境。为此，我们进一步引入了Omni-CoT，这是一种无需训练的方法，通过跨文本信息和视觉线索的链式推理显著增强了MLLMs在全景环境中的理解能力。基准测试和代码将在发表时发布。

Summary / 总结

The research aims to evaluate the ability of multi-modal large language models (MLLMs) to understand immersive omnidirectional environments (ODIs) which are crucial for VR, AR, and embodied intelligence applications. ODI-Bench, a novel benchmark, was developed containing 2,000 high-quality ODIs and 4,000 QA pairs across 10 tasks. Experiments on 20 representative MLLMs showed that current models struggle with ODI understanding. The study further introduces Omni-CoT, a training-free method that enhances MLLMs' comprehension through chain-of-thought reasoning across textual and visual information, improving their performance in ODI environments.

该研究介绍了ODI-Bench，这是一个用于理解全景图像（ODIs）的新基准，ODIs广泛应用于VR、AR和嵌入式智能领域。研究评估了20种MLLMs在涉及全景图像一般和空间理解的10个任务上的表现，结果显示当前模型在处理沉浸式环境方面仍存在困难。研究还提出了一种无需额外训练的Omni-CoT方法，通过结合文本信息和视觉线索的链式推理来提升MLLMs对全景环境的理解能力。

Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers

Authors: Chaofan Gan, Zicheng Zhao, Yuanpeng Tu, Xi Chen, Ziran Qin, Tieyuan Chen, Mehrtash Harandi, Weiyao Lin

First: 2025-10-13T15:39:13+00:00 · Latest: 2025-10-13T15:39:13+00:00

Abs · PDF · Code1 · Code2

Abstract

Diffusion Transformers (DiTs) have recently emerged as a powerful backbone for visual generation. Recent observations reveal \emph{Massive Activations} (MAs) in their internal feature maps, yet their function remains poorly understood. In this work, we systematically investigate these activations to elucidate their role in visual generation. We found that these massive activations occur across all spatial tokens, and their distribution is modulated by the input timestep embeddings. Importantly, our investigations further demonstrate that these massive activations play a key role in local detail synthesis, while having minimal impact on the overall semantic content of output. Building on these insights, we propose \textbf{D}etail \textbf{G}uidance (\textbf{DG}), a MAs-driven, training-free self-guidance strategy to explicitly enhance local detail fidelity for DiTs. Specifically, DG constructs a degraded ``detail-deficient'' model by disrupting MAs and leverages it to guide the original network toward higher-quality detail synthesis. Our DG can seamlessly integrate with Classifier-Free Guidance (CFG), enabling further refinements of fine-grained details. Extensive experiments demonstrate that our DG consistently improves fine-grained detail quality across various pre-trained DiTs (\eg, SD3, SD3.5, and Flux).

中文标题/摘要

标题：大规模激活是扩散变换器在视觉生成中局部细节合成的关键

扩散变换器(DiTs)最近已成为视觉生成的强大骨干。最近的观察发现其内部特征图中存在大量的激活(MAs)，但其功能尚未得到充分理解。在本工作中，我们系统地研究了这些激活，以阐明其在视觉生成中的作用。我们发现这些大规模激活出现在所有空间标记中，其分布受输入时间步嵌入的调节。重要的是，我们的研究进一步表明，这些大规模激活在局部细节合成中起着关键作用，而对输出的整体语义内容影响甚微。基于这些见解，我们提出了**D**etail **G**uidance (DG)，一种基于MAs的、无需训练的自我指导策略，以明确增强DiTs的局部细节保真度。具体而言，DG通过破坏MAs构建一个退化的“细节不足”模型，并利用它来引导原始网络向更高质量的细节合成发展。我们的DG可以无缝地与无分类器引导(CFG)集成，进一步细化微细结构细节。广泛的实验表明，我们的DG在各种预训练的DiTs（例如，SD3、SD3.5和Flux）中一致地提高了微细结构细节的质量。

Summary / 总结

This work investigates the role of massive activations (MAs) in Diffusion Transformers (DiTs) for visual generation. It finds that MAs are crucial for local detail synthesis without affecting the overall semantic content. Based on this, the authors propose Detail Guidance (DG), a training-free method that enhances local detail fidelity by disrupting MAs and guiding the network towards better detail synthesis. Experiments show that DG improves fine-grained detail quality across different DiTs models such as SD3, SD3.5, and Flux.

研究探讨了大规模激活（MAs）在扩散变换器（DiTs）中对视觉生成的作用。研究发现，MAs 对局部细节合成至关重要，但不会显著影响整体语义内容。基于这些发现，作者提出了一种名为 Detail Guidance (DG) 的训练-free 方法，通过利用 MAs 来增强局部细节保真度。实验表明，DG 能够在不同 DiTs 模型（如 SD3、SD3.5 和 Flux）中提高细粒度细节质量。

mmWalk: Towards Multi-modal Multi-view Walking Assistance

Authors: Kedi Ying, Ruiping Liu, Chongyan Chen, Mingzhe Tao, Hao Shi, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen

Venue: NeurIPS 2025

First: 2025-10-13T15:25:52+00:00 · Latest: 2025-10-13T15:25:52+00:00

Comments: Accepted by NeurIPS 2025 Datasets and Benchmarks Track. Data and Code: https://github.com/KediYing/mmWalk

Abs · PDF · Code1 · Code2 · Code3

Abstract

Walking assistance in extreme or complex environments remains a significant challenge for people with blindness or low vision (BLV), largely due to the lack of a holistic scene understanding. Motivated by the real-world needs of the BLV community, we build mmWalk, a simulated multi-modal dataset that integrates multi-view sensor and accessibility-oriented features for outdoor safe navigation. Our dataset comprises 120 manually controlled, scenario-categorized walking trajectories with 62k synchronized frames. It contains over 559k panoramic images across RGB, depth, and semantic modalities. Furthermore, to emphasize real-world relevance, each trajectory involves outdoor corner cases and accessibility-specific landmarks for BLV users. Additionally, we generate mmWalkVQA, a VQA benchmark with over 69k visual question-answer triplets across 9 categories tailored for safe and informed walking assistance. We evaluate state-of-the-art Vision-Language Models (VLMs) using zero- and few-shot settings and found they struggle with our risk assessment and navigational tasks. We validate our mmWalk-finetuned model on real-world datasets and show the effectiveness of our dataset for advancing multi-modal walking assistance.

中文标题/摘要

标题：mmWalk：迈向多模态多视角行走辅助

在极端或复杂环境中提供行走辅助仍然是盲人或低视力（BLV）人群的一大挑战，主要原因是缺乏对整体场景的理解。受BLV社区实际需求的启发，我们构建了mmWalk，这是一个模拟的多模态数据集，集成了多视角传感器和无障碍导向特征，用于户外安全导航。该数据集包含120条手动控制、场景分类的行走轨迹，共有62000帧同步图像。它包含了超过559000张全景图像，涵盖RGB、深度和语义模态。此外，为了强调现实相关性，每条轨迹都涉及户外的特殊情况和专为BLV用户设计的无障碍地标。此外，我们还生成了mmWalkVQA，这是一个包含超过69000个视觉问题-答案三元组的VQA基准，分为9个类别，旨在提供安全和知情的行走辅助。我们使用零样本和少样本设置评估了最先进的视觉-语言模型（VLMs），发现它们在我们的风险评估和导航任务中表现不佳。我们还在真实世界数据集上验证了mmWalk微调模型，并展示了该数据集在推进多模态行走辅助方面的有效性。

Summary / 总结

The research aims to address the challenges of walking assistance in extreme environments for people with blindness or low vision by developing a comprehensive multi-modal dataset called mmWalk. This dataset includes 120 walking trajectories with 62k synchronized frames and over 559k panoramic images across RGB, depth, and semantic modalities. The study evaluates state-of-the-art Vision-Language Models and finds that they struggle with the risk assessment and navigational tasks, highlighting the need for further development. The mmWalk-finetuned model is validated on real-world datasets, demonstrating its effectiveness for advancing multi-modal walking assistance.

研究通过开发mmWalk多模态数据集来解决视觉障碍人士在复杂环境中的行走辅助问题，该数据集整合了多视图传感器数据和无障碍功能。数据集包含120个行走轨迹，共有62k同步帧和超过559k的全景图像，涵盖RGB、深度和语义模态。关键发现表明，最先进的视觉-语言模型在风险评估和导航任务上表现不佳，而mmWalk微调模型在真实世界数据集上的验证显示了其在多模态行走辅助方面的有效性。

LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference

Authors: Jianhao Yuan, Fabio Pizzati, Francesco Pinto, Lars Kunze, Ivan Laptev, Paul Newman, Philip Torr, Daniele De Martini

First: 2025-10-13T15:19:07+00:00 · Latest: 2025-10-13T15:19:07+00:00

Comments: 22 pages, 9 figures

Abs · PDF · Code1 · Code2

Abstract

Intuitive physics understanding in video diffusion models plays an essential role in building general-purpose physically plausible world simulators, yet accurately evaluating such capacity remains a challenging task due to the difficulty in disentangling physics correctness from visual appearance in generation. To the end, we introduce LikePhys, a training-free method that evaluates intuitive physics in video diffusion models by distinguishing physically valid and impossible videos using the denoising objective as an ELBO-based likelihood surrogate on a curated dataset of valid-invalid pairs. By testing on our constructed benchmark of twelve scenarios spanning over four physics domains, we show that our evaluation metric, Plausibility Preference Error (PPE), demonstrates strong alignment with human preference, outperforming state-of-the-art evaluator baselines. We then systematically benchmark intuitive physics understanding in current video diffusion models. Our study further analyses how model design and inference settings affect intuitive physics understanding and highlights domain-specific capacity variations across physical laws. Empirical results show that, despite current models struggling with complex and chaotic dynamics, there is a clear trend of improvement in physics understanding as model capacity and inference settings scale.

中文标题/摘要

标题：LikePhys：通过似然偏好评估视频扩散模型的直观物理理解

视频扩散模型中的直观物理理解在构建通用的物理合理世界模拟器中起着重要作用，但由于难以区分生成中的物理正确性和视觉外观，准确评估这种能力仍然是一个具有挑战性的任务。为此，我们引入了LikePhys，一种无需训练的方法，通过使用去噪目标作为基于ELBO的似然替代物，在精心策划的正确-错误配对数据集上区分物理上有效的和不可能的视频，来评估视频扩散模型中的直观物理理解。通过在我们构建的涵盖四个物理领域的十二种场景基准测试上进行测试，我们展示了我们的评估指标，可信赖性偏好误差（PPE），与人类偏好有很强的对齐，并优于最先进的评估基准。我们随后系统地评估了当前视频扩散模型的直观物理理解。我们的研究进一步分析了模型设计和推理设置如何影响直观物理理解，并突显了不同物理定律下的领域特定能力差异。实证结果表明，尽管当前模型在复杂和混沌动力学方面存在困难，但随着模型容量和推理设置的增加，物理理解有明显的改进趋势。

Summary / 总结

LikePhys evaluates the intuitive physics understanding in video diffusion models by distinguishing physically valid and impossible videos using a denoising objective as an ELBO-based likelihood surrogate. The Plausibility Preference Error (PPE) metric shows strong alignment with human preference and outperforms state-of-the-art evaluators. The study benchmarks current video diffusion models and finds a trend of improvement in physics understanding with increasing model capacity and inference settings, while highlighting domain-specific capacity variations across physical laws.

LikePhys 通过使用去噪目标作为基于 ELBO 的似然替代方法来区分物理上有效和不可能的视频，评估视频扩散模型的直观物理理解。它引入了可信赖度偏好误差（PPE）作为与人类偏好高度一致的度量标准，优于现有评估器。研究还对当前视频扩散模型进行了基准测试，并发现随着模型容量和推理设置的增加，物理理解有所改善，尽管复杂动力学仍然具有挑战性。

Coupled Degradation Modeling and Fusion: A VLM-Guided Degradation-Coupled Network for Degradation-Aware Infrared and Visible Image Fusion

Authors: Tianpei Zhang, Jufeng Zhao, Yiming Zhu, Guangmang Cui

First: 2025-10-13T14:26:33+00:00 · Latest: 2025-10-13T14:26:33+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Existing Infrared and Visible Image Fusion (IVIF) methods typically assume high-quality inputs. However, when handing degraded images, these methods heavily rely on manually switching between different pre-processing techniques. This decoupling of degradation handling and image fusion leads to significant performance degradation. In this paper, we propose a novel VLM-Guided Degradation-Coupled Fusion network (VGDCFusion), which tightly couples degradation modeling with the fusion process and leverages vision-language models (VLMs) for degradation-aware perception and guided suppression. Specifically, the proposed Specific-Prompt Degradation-Coupled Extractor (SPDCE) enables modality-specific degradation awareness and establishes a joint modeling of degradation suppression and intra-modal feature extraction. In parallel, the Joint-Prompt Degradation-Coupled Fusion (JPDCF) facilitates cross-modal degradation perception and couples residual degradation filtering with complementary cross-modal feature fusion. Extensive experiments demonstrate that our VGDCFusion significantly outperforms existing state-of-the-art fusion approaches under various degraded image scenarios. Our code is available at https://github.com/Lmmh058/VGDCFusion.

中文标题/摘要

标题：耦合退化建模与融合：基于VLM的退化耦合网络实现退化感知红外和可见光图像融合

现有的红外和可见光图像融合（IVIF）方法通常假设高质量的输入。然而，在处理退化图像时，这些方法严重依赖于手动切换不同的预处理技术。这种退化处理与图像融合的脱耦导致了显著的性能下降。在本文中，我们提出了一种新颖的基于VLM的退化耦合融合网络（VGDCFusion），该网络紧密耦合了退化建模与融合过程，并利用视觉语言模型（VLMs）进行退化感知感知和引导抑制。具体而言，提出的特定提示退化耦合提取器（SPDCE）实现了模态特定的退化意识，并建立了退化抑制与跨模态特征提取的联合建模。同时，联合提示退化耦合融合（JPDCF）促进了跨模态退化感知，并将残余退化过滤与互补的跨模态特征融合耦合在一起。广泛的实验表明，在各种退化图像场景下，我们的VGDCFusion显著优于现有的最先进的融合方法。我们的代码可在https://github.com/Lmmh058/VGDCFusion获取。

Summary / 总结

The paper addresses the issue of performance degradation in Infrared and Visible Image Fusion (IVIF) methods when dealing with degraded images. It proposes a VLM-Guided Degradation-Coupled Fusion network (VGDCFusion) that integrates degradation modeling with the fusion process. The network uses vision-language models for degradation-aware perception and guided suppression, and it includes a Specific-Prompt Degradation-Coupled Extractor (SPDCE) and a Joint-Prompt Degradation-Coupled Fusion (JPDCF) module. Experiments show that VGDCFusion outperforms existing methods in various degraded image scenarios.

本文解决了处理退化输入时红外和可见光图像融合性能下降的问题。提出了一种VGDCFusion方法，将退化建模与融合过程紧密结合，并使用视觉语言模型进行退化感知。该方法包括一种特定提示退化耦合提取器，用于模态特定的退化意识，以及一种联合提示退化耦合融合，用于跨模态退化感知。实验表明，VGDCFusion在各种退化场景中优于现有方法。

The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models

Authors: Lijun Sheng, Jian Liang, Ran He, Zilei Wang, Tieniu Tan

Venue: NeurIPS 2025

First: 2025-06-30T16:05:55+00:00 · Latest: 2025-10-13T13:09:11+00:00

Comments: NeurIPS 2025 Datasets and Benchmarks Track. Github link: https://github.com/TomSheng21/tta-vlm

Abs · PDF · Code1 · Code2 · Code3

Abstract

Test-time adaptation (TTA) methods have gained significant attention for enhancing the performance of vision-language models (VLMs) such as CLIP during inference, without requiring additional labeled data. However, current TTA researches generally suffer from major limitations such as duplication of baseline results, limited evaluation metrics, inconsistent experimental settings, and insufficient analysis. These problems hinder fair comparisons between TTA methods and make it difficult to assess their practical strengths and weaknesses. To address these challenges, we introduce TTA-VLM, a comprehensive benchmark for evaluating TTA methods on VLMs. Our benchmark implements 8 episodic TTA and 7 online TTA methods within a unified and reproducible framework, and evaluates them across 15 widely used datasets. Unlike prior studies focused solely on CLIP, we extend the evaluation to SigLIP--a model trained with a Sigmoid loss--and include training-time tuning methods such as CoOp, MaPLe, and TeCoA to assess generality. Beyond classification accuracy, TTA-VLM incorporates various evaluation metrics, including robustness, calibration, out-of-distribution detection, and stability, enabling a more holistic assessment of TTA methods. Through extensive experiments, we find that 1) existing TTA methods produce limited gains compared to the previous pioneering work; 2) current TTA methods exhibit poor collaboration with training-time fine-tuning methods; 3) accuracy gains frequently come at the cost of reduced model trustworthiness. We release TTA-VLM to provide fair comparison and comprehensive evaluation of TTA methods for VLMs, and we hope it encourages the community to develop more reliable and generalizable TTA strategies.

中文标题/摘要

标题：进步的错觉？视觉-语言模型测试时适应的批判性审视

测试时适应（TTA）方法因能在不需额外标注数据的情况下提升视觉-语言模型（VLMs）如CLIP的推理性能而受到广泛关注。然而，当前的TTA研究普遍面临诸如基准结果重复、评价指标有限、实验设置不一致和分析不足等重大局限。这些问题阻碍了TTA方法之间的公平比较，使得评估其实际优劣变得困难。为应对这些挑战，我们引入了TTA-VLM，这是一个全面的基准，用于评估VLM上的TTA方法。我们的基准在统一且可复现的框架内实现了8种 episodic TTA 和7种 online TTA 方法，并在15个广泛使用的数据集上进行评估。不同于仅专注于CLIP的先前研究，我们还将评估扩展到使用Sigmoid损失训练的SigLIP，并包括CoOp、MaPLe和TeCoA等训练时调优方法以评估其普适性。除了分类准确性，TTA-VLM还采用了多种评价指标，包括鲁棒性、校准、离群检测和稳定性，从而实现对TTA方法的更全面评估。通过大量实验，我们发现1）现有TTA方法与之前的开创性工作相比，增益有限；2）当前的TTA方法与训练时调优方法的协作性较差；3）准确性提升往往以降低模型可信度为代价。我们发布了TTA-VLM，以提供公平比较和全面评估VLM上的TTA方法，我们希望这能鼓励社区开发更可靠和普适的TTA策略。

Summary / 总结

The study addresses the limitations in current test-time adaptation (TTA) methods for vision-language models (VLMs) by introducing TTA-VLM, a comprehensive benchmark. It evaluates 8 episodic and 7 online TTA methods across 15 datasets, covering various metrics like robustness and calibration. The research finds that existing TTA methods offer limited improvements, struggle to collaborate with training-time fine-tuning methods, and often reduce model trustworthiness despite gains in accuracy. The benchmark aims to promote fair comparisons and the development of more reliable TTA strategies for VLMs.

研究通过引入TTA-VLM基准，旨在解决当前针对视觉-语言模型（VLMs）的测试时适应（TTA）方法的局限性。该基准评估了15个数据集上的8种 episodic TTA 和7种 online TTA 方法，涵盖了诸如鲁棒性、校准等多方面指标。研究发现，现有的 TTA 方法提供的改进有限，难以与训练时微调方法协作，并且在提高准确率的同时往往降低了模型的可信度。该基准旨在促进公平比较和更可靠、更通用的 TTA 策略的发展。

OVS Meets Continual Learning: Towards Sustainable Open-Vocabulary Segmentation

Authors: Dongjun Hwang, Yejin Kim, Minyoung Lee, Seong Joon Oh, Junsuk Choe

First: 2024-10-15T12:11:41+00:00 · Latest: 2025-10-13T11:59:16+00:00

Abs · PDF · Code1 · Code2

Abstract

Open-Vocabulary Segmentation (OVS) aims to segment classes that are not present in the training dataset. However, most existing studies assume that the training data is fixed in advance, overlooking more practical scenarios where new datasets are continuously collected over time. To address this, we first analyze how existing OVS models perform under such conditions. In this context, we explore several approaches such as retraining, fine-tuning, and continual learning but find that each of them has clear limitations. To address these issues, we propose ConOVS, a novel continual learning method based on a Mixture-of-Experts framework. ConOVS dynamically combines expert decoders based on the probability that an input sample belongs to the distribution of each incremental dataset. Through extensive experiments, we show that ConOVS consistently outperforms existing methods across pre-training, incremental, and zero-shot test datasets, effectively expanding the recognition capabilities of OVS models when data is collected sequentially.

中文标题/摘要

标题：OVS与持续学习的交汇：迈向可持续的开放词汇分割

开放词汇分割（OVS）旨在分割训练数据集中不存在的类别。然而，大多数现有研究假设训练数据在事先固定，忽略了新数据集随着时间不断收集的更实际场景。为解决这一问题，我们首先分析现有OVS模型在这种条件下的表现。在此背景下，我们探索了重新训练、微调和持续学习等多种方法，但发现每种方法都有明显的局限性。为解决这些问题，我们提出了一种基于专家混合框架的新型持续学习方法ConOVS。ConOVS根据输入样本属于每个增量数据集分布的概率动态组合专家解码器。通过广泛的实验，我们展示了ConOVS在预训练、增量和零样本测试数据集上始终优于现有方法，有效扩展了OVS模型在数据按顺序收集时的识别能力。

Summary / 总结

The paper addresses the challenge of Open-Vocabulary Segmentation (OVS) in dynamic scenarios where new datasets are continuously collected. It finds that existing OVS models struggle with retraining and fine-tuning. To overcome these limitations, the authors propose ConOVS, a continual learning method using a Mixture-of-Experts framework. ConOVS dynamically selects expert decoders based on the input sample's distribution, showing superior performance across various test datasets and enhancing OVS models' recognition capabilities over time.

论文针对新数据不断收集的动态场景中的开放词汇分割（OVS）挑战。研究发现，现有OVS模型在重新训练和微调方面存在困难。为解决这些问题，作者提出了一种基于专家混合框架的持续学习方法ConOVS。ConOVS根据输入样本的分布动态选择专家解码器，实验结果显示其在各种测试数据集上表现更优，有效提升了OVS模型的识别能力随时间的扩展。

When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models

Authors: Samer Al-Hamadani

First: 2025-10-13T11:48:48+00:00 · Latest: 2025-10-13T11:48:48+00:00

Comments: 23 pages, 4 figures, 4 tables

Abs · PDF · Code1 · Code2

Abstract

Object detection systems have traditionally relied on supervised learning with manually annotated bounding boxes, achieving high accuracy at the cost of substantial annotation investment. The emergence of Vision-Language Models (VLMs) offers an alternative paradigm enabling zero-shot detection through natural language queries, eliminating annotation requirements but operating with reduced accuracy. This paper presents the first comprehensive cost-effectiveness analysis comparing supervised detection (YOLO) with zero-shot VLM inference (Gemini Flash 2.5). Through systematic evaluation on 1,000 stratified COCO images and 200 diverse product images spanning consumer electronics and rare categories, combined with detailed Total Cost of Ownership modeling, we establish quantitative break-even thresholds governing architecture selection. Our findings reveal that supervised YOLO achieves 91.2% accuracy versus 68.5% for zero-shot Gemini on standard categories, representing a 22.7 percentage point advantage that costs $10,800 in annotation for 100-category systems. However, this advantage justifies investment only beyond 55 million inferences, equivalent to 151,000 images daily for one year. Zero-shot Gemini demonstrates 52.3% accuracy on diverse product categories (ranging from highly web-prevalent consumer electronics at 75-85% to rare specialized equipment at 25-40%) where supervised YOLO achieves 0% due to architectural constraints preventing detection of untrained classes. Cost per Correct Detection analysis reveals substantially lower per-detection costs for Gemini ($0.00050 vs $0.143) at 100,000 inferences despite accuracy deficits. We develop decision frameworks demonstrating that optimal architecture selection depends critically on deployment volume, category stability, budget constraints, and accuracy requirements rather than purely technical performance metrics.

中文标题/摘要

标题：监督训练何时见效？视觉语言模型时代目标检测的隐含经济学

传统的目标检测系统依赖于带有手动标注边框的监督学习，虽然准确率高，但需要大量的标注投资。视觉语言模型（VLMs）的出现提供了一种替代范式，通过自然语言查询实现零样本检测，消除了标注需求，但准确性较低。本文首次进行了全面的成本效益分析，比较了监督检测（YOLO）与零样本VLM推理（Gemini Flash 2.5）。通过在1000张分层COCO图像和200张涵盖消费电子和稀有类别的多样化产品图像上的系统评估，结合详细的总拥有成本建模，我们建立了架构选择的定量临界值。研究发现，监督YOLO在标准类别中的准确率为91.2%，而零样本Gemini为68.5%，差距为22.7个百分点，这需要100类别系统10800美元的标注成本。然而，这种优势仅在超过5.5亿次推理时才值得投资，相当于每天处理151,000张图像一年。零样本Gemini在多样化产品类别中（从高度网络普及的消费电子设备的75-85%到稀有专业设备的25-40%）的准确率为52.3%，而监督YOLO在这些类别中由于架构限制无法检测未训练的类别，准确率为0%。每正确检测成本分析显示，尽管准确性较低，Gemini的每检测成本（0.00050美元 vs 0.143美元）在10万次推理时仍显著较低。我们开发的决策框架表明，最优架构选择取决于部署量、类别稳定性、预算限制和准确性要求，而不仅仅是纯粹的技术性能指标。

Summary / 总结

This paper evaluates the cost-effectiveness of supervised object detection using YOLO and zero-shot detection with VLMs like Gemini Flash 2.5. Through a comprehensive analysis on 1,000 COCO images and 200 product images, it finds that YOLO outperforms Gemini with 91.2% accuracy versus 68.5%, but this advantage justifies the additional annotation cost only beyond 55 million inferences. Gemini shows 52.3% accuracy on diverse product categories, compared to 0% for YOLO, making it more cost-effective for high-volume deployments.

该研究评估了监督对象检测（YOLO）与零样本视觉语言模型（Gemini Flash 2.5）在对象检测中的成本效益。通过对1,200张图像的全面分析和详细的总拥有成本建模，研究发现，YOLO在标准类别上的准确率（91.2%）高于零样本Gemini（68.5%），但准确率差距仅在超过5500万次推理时才值得注释成本。Gemini在多种产品类别上表现出52.3%的准确率，而YOLO在未训练的类别上无法检测，Gemini的每检测成本更低。

$Δ\mathrm{Energy}$: Optimizing Energy Change During Vision-Language Alignment Improves both OOD Detection and OOD Generalization

Authors: Lin Zhu, Yifeng Yang, Xinbing Wang, Qinying Gu, Nanyang Ye

First: 2025-10-13T11:36:58+00:00 · Latest: 2025-10-13T11:36:58+00:00

Comments: Accepted by NeruIPS2025

Abs · PDF · Code1 · Code2

Abstract

Recent approaches for vision-language models (VLMs) have shown remarkable success in achieving fast downstream adaptation. When applied to real-world downstream tasks, VLMs inevitably encounter both the in-distribution (ID) data and out-of-distribution (OOD) data. The OOD datasets often include both covariate shifts (e.g., known classes with changes in image styles) and semantic shifts (e.g., test-time unseen classes). This highlights the importance of improving VLMs' generalization ability to covariate-shifted OOD data, while effectively detecting open-set semantic-shifted OOD classes. In this paper, inspired by the substantial energy change observed in closed-set data when re-aligning vision-language modalities (specifically by directly reducing the maximum cosine similarity to a low value), we introduce a novel OOD score, named {\Delta}Energy. {\Delta}Energy significantly outperforms the vanilla energy-based OOD score and provides a more reliable approach for OOD detection. Furthermore, {\Delta}Energy can simultaneously improve OOD generalization under covariate shifts, which is achieved by lower-bound maximization for {\Delta}Energy (termed EBM). EBM is theoretically proven to not only enhance OOD detection but also yields a domain-consistent Hessian, which serves as a strong indicator for OOD generalization. Based on this finding, we developed a unified fine-tuning framework that allows for improving VLMs' robustness in both OOD generalization and OOD detection. Extensive experiments on challenging OOD detection and generalization benchmarks demonstrate the superiority of our method, outperforming recent approaches by 10% to 25% in AUROC.

中文标题/摘要

标题：$Δ\mathrm{能量}$：在视觉-语言对齐过程中优化能量变化可同时提高OOD检测和OOD泛化

近期针对视觉-语言模型（VLMs）的方法在实现快速下游适应方面取得了显著成功。当应用于实际的下游任务时，VLMs不可避免地会遇到同分布（ID）数据和异分布（OOD）数据。OOD数据集通常包括协变量偏移（例如，已知类别的图像风格变化）和语义偏移（例如，测试时未见过的类别）。这突显了提高VLMs对协变量偏移的OOD数据的泛化能力的重要性，同时有效检测开放集的语义偏移OOD类别。在本文中，我们受到在重新对齐视觉-语言模态（特别是直接将最大余弦相似度降低到低值）时观察到的显著能量变化的启发，引入了一种新的OOD得分，称为$Δ\mathrm{Energy}$。$Δ\mathrm{Energy}$显著优于传统的基于能量的OOD得分，并提供了一种更可靠的OOD检测方法。此外，$Δ\mathrm{Energy}$可以通过$Δ\mathrm{Energy}$的下界最大化（称为EBM）同时提高协变量偏移下的OOD泛化。EBM不仅理论上证明了可以增强OOD检测，还产生了一个领域一致的海森矩阵，这为OOD泛化提供了一个强有力的指标。基于这一发现，我们开发了一个统一的微调框架，可以同时提高VLMs在OOD泛化和OOD检测方面的鲁棒性。在具有挑战性的OOD检测和泛化基准上的广泛实验表明，我们的方法优于最近的方法，AUROC提高了10%到25%。

Summary / 总结

This paper addresses the challenges of out-of-distribution (OOD) detection and generalization in vision-language models (VLMs) by introducing a novel OOD score, ΔEnergy. Inspired by the energy change observed during vision-language alignment, ΔEnergy is designed to effectively detect OOD data and improve OOD generalization under covariate shifts. The method, named EBM, maximizes the lower bound of ΔEnergy, which not only enhances OOD detection but also provides a domain-consistent Hessian for better generalization. Experiments on OOD benchmarks show that this approach outperforms recent methods by 10% to 25% in AUROC.

本文旨在通过优化视觉-语言模型(VLM)在对齐过程中的能量变化来提高其对异常分布(OOD)数据的检测和泛化能力。作者提出了一种新的OOD得分ΔEnergy，显著优于传统方法。ΔEnergy通过将最大余弦相似度直接降低到低值来实现，并在EBM（能量基于模型）框架中使用，以增强OOD检测并提供一致的Hessian，从而更好地泛化到OOD数据。实验结果显示，该方法在挑战性的OOD检测和泛化基准上优于最近的方法，AUROC性能提高了10%到25%。

Human Uncertainty-Aware Data Selection and Automatic Labeling in Visual Question Answering

Authors: Jian Lan, Zhicheng Liu, Udo Schlegel, Raoyuan Zhao, Yihong Liu, Hinrich Schütze, Michael A. Hedderich, Thomas Seidl

First: 2025-10-13T11:35:30+00:00 · Latest: 2025-10-13T11:35:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Large vision-language models (VLMs) achieve strong performance in Visual Question Answering but still rely heavily on supervised fine-tuning (SFT) with massive labeled datasets, which is costly due to human annotations. Crucially, real-world datasets often exhibit human uncertainty (HU) -- variation in human confidence across annotations -- but standard SFT simply optimizes toward the most frequent label, disregarding HU distributions. This leaves two open questions: How does HU affect SFT, and how can HU be effectively leveraged in training? In this work, we first conduct a systematic evaluation of VLMs across varying HU levels. We have two key findings: (i) surprisingly, high-HU samples contribute little or even degrade model performance, and (ii) naively training on the full dataset yields under-calibrated models that fail to capture HU distributions. Motivated by these findings, we introduce HaDola, a human uncertainty-aware data selection and automatic labeling framework. HaDola operates in four stages -- discriminate, self-annotate, error trigger, and training -- to iteratively identify harmful samples, prioritize informative ones, and bootstrap from a small seed set (5\% of data). Our approach substantially reduces reliance on costly HU annotations and makes VLMs more accurate and better calibrated. Extensive experiments on VQAv2 and VizWiz datasets demonstrate that HaDola consistently matches or outperforms state-of-the-art baselines with less training data. Our work highlights the importance of explicitly modeling HU in SFT, suggesting that better utilization of HU is more effective than merely scaling up dataset size.

Summary / 总结

This work addresses the issue of human uncertainty (HU) in supervised fine-tuning (SFT) for Visual Question Answering (VQA) models. It evaluates the impact of HU on model performance and finds that high-HU samples can degrade model accuracy. The authors propose HaDola, a framework that selects data based on HU and automatically labels it, improving model calibration and accuracy. Experiments show that HaDola outperforms existing methods with less training data.

该研究探讨了人类不确定性（HU）对视觉问答（VQA）模型监督微调（SFT）的影响。研究发现，高HU样本会降低模型准确性。作者提出了一种名为HaDola的框架，该框架基于HU选择数据并自动标注，从而提高模型的校准度和准确性。实验表明，与现有方法相比，HaDola使用更少的训练数据仍能取得更好的效果。

Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation

Authors: Liyang Chen, Hongkai Chen, Yujun Cai, Sifan Li, Qingwen Ye, Yiwei Wang

First: 2025-10-09T11:08:07+00:00 · Latest: 2025-10-13T11:22:24+00:00

Abs · PDF · Code1 · Code2

Abstract

Video-to-Audio generation has made remarkable strides in automatically synthesizing sound for video. However, existing evaluation metrics, which focus on semantic and temporal alignment, overlook a critical failure mode: models often generate acoustic events, particularly speech and music, that have no corresponding visual source. We term this phenomenon Insertion Hallucination and identify it as a systemic risk driven by dataset biases, such as the prevalence of off-screen sounds, that remains completely undetected by current metrics. To address this challenge, we first develop a systematic evaluation framework that employs a majority-voting ensemble of multiple audio event detectors. We also introduce two novel metrics to quantify the prevalence and severity of this issue: IH@vid (the fraction of videos with hallucinations) and IH@dur (the fraction of hallucinated duration). Building on this, we propose Posterior Feature Correction, a novel training-free inference-time method that mitigates IH. PFC operates in a two-pass process: it first generates an initial audio output to detect hallucinated segments, and then regenerates the audio after masking the corresponding video features at those timestamps. Experiments on several mainstream V2A benchmarks first reveal that state-of-the-art models suffer from severe IH. In contrast, our PFC method reduces both the prevalence and duration of hallucinations by over 50\% on average, without degrading, and in some cases even improving, conventional metrics for audio quality and temporal synchronization. Our work is the first to formally define, systematically measure, and effectively mitigate Insertion Hallucination, paving the way for more reliable and faithful V2A models.

中文标题/摘要

标题：视频到音频生成中的插入幻觉检测与缓解

视频到音频生成在自动合成视频声音方面取得了显著进展。然而，现有的评估指标侧重于语义和时间对齐，忽视了一个关键的失败模式：模型经常生成声学事件，特别是语音和音乐，这些事件在视频中没有对应的视觉来源。我们称这种现象为插入幻觉，并将其识别为由数据集偏差驱动的系统性风险，这种风险目前完全未被现有指标检测到。为应对这一挑战，我们首先开发了一种系统性的评估框架，该框架采用多个声学事件检测器的多数投票集成。我们还引入了两个新的度量标准来量化这一问题的普遍性和严重性：IH@vid（带有幻觉的视频比例）和IH@dur（幻觉持续时间的比例）。在此基础上，我们提出了后验特征校正（PFC），这是一种无需训练的推理时方法，可以缓解插入幻觉。PFC采用两步过程：首先生成初始音频输出以检测幻觉段落，然后在这些时间戳处遮蔽相应的视频特征后再生音频。在几个主流的V2A基准上的实验首次揭示，最先进的模型遭受严重的插入幻觉。相比之下，我们的PFC方法平均将幻觉的普遍性和持续时间降低了超过50%，且不降低，甚至在某些情况下还改善了传统的音频质量和时间同步度指标。我们的工作首次正式定义、系统性测量并有效缓解了插入幻觉，为更可靠和忠实的V2A模型铺平了道路。

Summary / 总结

The research addresses the issue of Insertion Hallucination in video-to-audio generation, where models generate sounds that do not correspond to any visual source. To tackle this, the authors developed a systematic evaluation framework using multiple audio event detectors and introduced two metrics, IH@vid and IH@dur, to quantify the problem. They also proposed Posterior Feature Correction (PFC), a training-free method that reduces hallucinations by over 50% on average without degrading conventional audio quality metrics. This work is the first to formally define, measure, and mitigate Insertion Hallucination in V2A models, enhancing their reliability and faithfulness.

研究针对视频到音频生成中的插入幻觉问题，即模型生成与视觉元素不对应的音频。研究引入了一种系统评估框架，使用多个音频事件检测器的多数投票集合，并提出了两个新的度量标准（IH@vid 和 IH@dur）来量化这一问题。研究提出了一种名为后验特征校正（PFC）的训练免费方法，该方法通过生成初始音频、检测幻觉段落，然后在这些时间戳处屏蔽相应的视频特征来重新生成音频，从而减轻幻觉。实验表明，PFC方法平均将幻觉的频率和持续时间减少了50%以上，同时没有降低传统的音频质量指标。

TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models

Authors: Chenghao Liu, Jiachen Zhang, Chengxuan Li, Zhimu Zhou, Shixin Wu, Songfang Huang, Huiling Duan

Venue: AAAI 2026

First: 2025-08-15T12:03:34+00:00 · Latest: 2025-10-13T10:18:34+00:00

Comments: Manuscript submitted to AAAI 2026, currently under review

Abs · PDF · Code1 · Code2

Abstract

Vision-Language-Action (VLA) models process visual inputs independently at each timestep, discarding valuable temporal information inherent in robotic manipulation tasks. This frame-by-frame processing makes models vulnerable to visual noise while ignoring the substantial coherence between consecutive frames in manipulation sequences. We propose Temporal Token Fusion (TTF), a training-free approach that intelligently integrates historical and current visual representations to enhance VLA inference quality. Our method employs dual-dimension detection combining efficient grayscale pixel difference analysis with attention-based semantic relevance assessment, enabling selective temporal token fusion through hard fusion strategies and keyframe anchoring to prevent error accumulation. Comprehensive experiments across LIBERO, SimplerEnv, and real robot tasks demonstrate consistent improvements: 4.0 percentage points average on LIBERO (72.4\% vs 68.4\% baseline), cross-environment validation on SimplerEnv (4.8\% relative improvement), and 8.7\% relative improvement on real robot tasks. Our approach proves model-agnostic, working across OpenVLA and VLA-Cache architectures. Notably, TTF reveals that selective Query matrix reuse in attention mechanisms enhances rather than compromises performance, suggesting promising directions for direct KQV matrix reuse strategies that achieve computational acceleration while improving task success rates.

中文标题/摘要

标题：TTF-VLA：基于像素注意集成的时间令牌融合用于视觉-语言-动作模型

视觉-语言-动作（VLA）模型在每个时间步独立处理视觉输入，忽略了机器人操作任务中固有的宝贵时间信息。这种帧帧处理使模型容易受到视觉噪声的影响，同时忽略了连续帧之间的重要连贯性。我们提出了时间令牌融合（TTF），这是一种无需训练的方法，通过智能地整合历史和当前的视觉表示来增强VLA推理质量。我们的方法结合了高效的灰度像素差异分析和基于注意力的语义相关性评估，通过硬融合策略和关键帧锚定来实现选择性的时间令牌融合，防止错误累积。在LIBERO、SimplerEnv和真实机器人任务中的全面实验表明，一致性改进：在LIBERO上平均提高了4.0个百分点（72.4% vs 68.4%基线），在SimplerEnv上的跨环境验证（相对改进4.8%），以及在真实机器人任务上的相对改进8.7%。我们的方法具有模型无关性，适用于OpenVLA和VLA-Cache架构。值得注意的是，TTF表明在注意力机制中选择性地重用查询矩阵实际上可以提高性能，而不是削弱性能，这表明直接的KQV矩阵重用策略具有前景，可以在实现计算加速的同时提高任务成功率。

Summary / 总结

The research addresses the issue of temporal information loss in Vision-Language-Action models, which process visual inputs frame-by-frame, leading to vulnerability to visual noise and ignoring temporal coherence. It introduces Temporal Token Fusion (TTF), a training-free method that integrates historical and current visual representations through dual-dimension detection and hard fusion strategies. Experiments across LIBERO, SimplerEnv, and real robot tasks show consistent improvements, with 4.0 percentage points on LIBERO, 4.8% relative improvement on SimplerEnv, and 8.7% relative improvement on real robot tasks. TTF is model-agnostic and works across different architectures, suggesting potential for computational acceleration and improved task success rates.

研究旨在通过整合历史和当前的视觉表示来提高Vision-Language-Action (VLA)模型的性能，增强时间连贯性。方法是Temporal Token Fusion (TTF)，结合像素差异分析和注意力基的语义相关性进行双维度检测，以选择性地融合时间令牌。实验在LIBERO、SimplerEnv和真实机器人任务上显示了一致的改进，LIBERO上平均提高了4.0个百分点，SimplerEnv上相对提高了4.8%，真实机器人任务上相对提高了8.7%。TTF在不同的VLA架构中具有通用性，证明了其灵活性和有效性。

MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs

Authors: Haonan Ge, Yiwei Wang, Ming-Hsuan Yang, Yujun Cai

Venue: EMNLP 2025

First: 2025-08-14T01:17:39+00:00 · Latest: 2025-10-13T09:52:34+00:00

Comments: EMNLP 2025

Abs · PDF · Code1 · Code2

Abstract

Large Vision-Language Models (LVLMs) have shown strong performance across multimodal tasks. However, they often produce hallucinations -- text that is inconsistent with visual input, due to the limited ability to verify information in different regions of the image. To address this, we propose Multi-Region Fusion Decoding (MRFD), a training-free decoding method that improves factual grounding by modeling inter-region consistency. MRFD identifies salient regions using cross-attention, generates initial responses for each, and computes reliability weights based on Jensen-Shannon Divergence (JSD) among the responses. These weights guide a consistency-aware fusion of per-region predictions, using region-aware prompts inspired by Chain-of-Thought reasoning. Experiments across multiple LVLMs and benchmarks show that MRFD significantly reduces hallucinations and improves response factuality without requiring model updates.

中文标题/摘要

标题：MRFD：多区域融合解码及其自一致性方法在减轻LVLMs幻觉中的应用

大型视觉-语言模型（LVLMs）在多模态任务中表现出强大的性能。然而，它们经常产生幻觉——与视觉输入不一致的文本，这是由于验证图像不同区域信息的能力有限。为了解决这个问题，我们提出了多区域融合解码（MRFD），这是一种无需训练的解码方法，通过建模区域间的一致性来提高事实性基础。MRFD 使用交叉注意力识别显著区域，为每个区域生成初始响应，并基于响应之间的杰伦-香农散度（JSD）计算可靠性权重。这些权重指导一种基于区域感知提示的一致性融合，这些提示受到链式推理启发。在多个LVLMs和基准测试中的实验表明，MRFD 显著减少了幻觉并提高了响应的事实性，而无需对模型进行更新。

Summary / 总结

The research aims to address the issue of hallucinations in Large Vision-Language Models (LVLMs), where text generated is inconsistent with the visual input. The proposed Multi-Region Fusion Decoding (MRFD) method enhances factual grounding by modeling inter-region consistency. It identifies salient regions using cross-attention, generates initial responses for each region, and computes reliability weights based on Jensen-Shannon Divergence (JSD). These weights guide a consistency-aware fusion of per-region predictions, using region-aware prompts. Experiments across various LVLMs and benchmarks demonstrate that MRFD significantly reduces hallucinations and improves response factuality without requiring model updates.

研究旨在通过提出多区域融合解码（MRFD）方法来解决大型视觉语言模型（LVLM）中的幻觉问题，该方法通过跨区域一致性建模增强事实性。MRFD 使用交叉注意力识别图像中的显著区域，生成初始响应，并基于 Jensen-Shannon 散度计算可靠性权重。这些权重指导各区域预测的一致性融合，提高响应的事实性。实验表明，MRFD 在多个 LVLM 和基准测试中显著减少了幻觉，且无需对模型进行更新。

LiTransProQA: an LLM-based Literary Translation evaluation metric with Professional Question Answering

Authors: Ran Zhang, Wei Zhao, Lieve Macken, Steffen Eger

Venue: EMNLP 2025

First: 2025-05-08T17:12:56+00:00 · Latest: 2025-10-13T09:31:45+00:00

Comments: Accepted as a main paper at EMNLP 2025. CR version

Abs · PDF · Code1 · Code2

Abstract

The impact of Large Language Models (LLMs) has extended into literary domains. However, existing evaluation metrics for literature prioritize mechanical accuracy over artistic expression and tend to overrate machine translation as being superior to human translation from experienced professionals. In the long run, this bias could result in an irreversible decline in translation quality and cultural authenticity. In response to the urgent need for a specialized literary evaluation metric, we introduce LITRANSPROQA, a novel, reference-free, LLM-based question-answering framework designed for literary translation evaluation. LITRANSPROQA integrates humans in the loop to incorporate insights from professional literary translators and researchers, focusing on critical elements in literary quality assessment such as literary devices, cultural understanding, and authorial voice. Our extensive evaluation shows that while literary-finetuned XCOMET-XL yields marginal gains, LITRANSPROQA substantially outperforms current metrics, achieving up to 0.07 gain in correlation and surpassing the best state-of-the-art metrics by over 15 points in adequacy assessments. Incorporating professional translator insights as weights further improves performance, highlighting the value of translator inputs. Notably, LITRANSPROQA reaches an adequacy performance comparable to trained linguistic student evaluators, though it still falls behind experienced professional translators. LITRANSPROQA shows broad applicability to open-source models like LLaMA3.3-70b and Qwen2.5-32b, indicating its potential as an accessible and training-free tool for evaluating literary translations that require local processing due to copyright or ethical considerations.

中文标题/摘要

标题：LiTransProQA：基于大语言模型的专业文学翻译评估指标

大型语言模型（LLMs）的影响已扩展到文学领域。然而，现有的文学评估指标更注重机械准确性而忽视了艺术表达，倾向于高估机器翻译优于有经验的专业人士的人工翻译。长期来看，这种偏见可能导致翻译质量和文化真实性的不可逆转下降。为应对对专门文学评估指标的迫切需求，我们引入了LITRANSPROQA，这是一种新颖的、无需参考的、基于大语言模型的专业问答框架，用于文学翻译评估。LITRANSPROQA将人类纳入评估过程，结合专业文学翻译者和研究人员的见解，重点关注文学质量评估中的关键要素，如文学手法、文化理解和作者声音。我们的广泛评估显示，虽然经过文学微调的XCOMET-XL略有改进，但LITRANSPROQA在相关性方面显著优于现有指标，达到0.07的提升，在充分性评估中超过最佳最先进的指标15分以上。将专业翻译者的见解作为权重进一步提高了性能，突显了翻译者输入的价值。值得注意的是，LITRANSPROQA在充分性方面的表现与训练有素的语言学学生评估者相当，但仍落后于有经验的专业翻译者。LITRANSPROQA适用于开源模型如LLaMA3.3-70b和Qwen2.5-32b，表明其作为评估受版权或伦理考虑限制的文学翻译的可访问且无需训练工具的潜力。

Summary / 总结

LiTransProQA is an LLM-based evaluation metric for literary translation that incorporates professional insights to assess literary devices, cultural understanding, and authorial voice. It outperforms existing metrics by up to 0.07 in correlation and by over 15 points in adequacy assessments, achieving performance comparable to trained linguistic student evaluators but falling short of experienced professional translators. It is applicable to various open-source models, making it a valuable tool for evaluating literary translations with local processing needs due to copyright or ethical considerations.

LiTransProQA 是一个基于大语言模型的文学翻译评估指标，融合了专业译者和研究者的见解，评估文学手法、文化理解和作者声音。它在相关性上比现有指标高出至多0.07，在适当性评估中超过15分，达到了受过训练的语言学学生评估者的水平，但仍未达到经验丰富的专业译者的水平。它适用于各种开源模型，是一个在因版权或伦理原因需要本地处理的情况下评估文学翻译的有价值工具。

Evaluating Reasoning Faithfulness in Medical Vision-Language Models using Multimodal Perturbations

Authors: Johannes Moll, Markus Graf, Tristan Lemke, Nicolas Lenhart, Daniel Truhn, Jean-Benoit Delbrouck, Jiazhen Pan, Daniel Rueckert, Lisa C. Adams, Keno K. Bressem

First: 2025-10-13T09:28:22+00:00 · Latest: 2025-10-13T09:28:22+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) often produce chain-of-thought (CoT) explanations that sound plausible yet fail to reflect the underlying decision process, undermining trust in high-stakes clinical use. Existing evaluations rarely catch this misalignment, prioritizing answer accuracy or adherence to formats. We present a clinically grounded framework for chest X-ray visual question answering (VQA) that probes CoT faithfulness via controlled text and image modifications across three axes: clinical fidelity, causal attribution, and confidence calibration. In a reader study (n=4), evaluator-radiologist correlations fall within the observed inter-radiologist range for all axes, with strong alignment for attribution (Kendall's $\tau_b=0.670$), moderate alignment for fidelity ($\tau_b=0.387$), and weak alignment for confidence tone ($\tau_b=0.091$), which we report with caution. Benchmarking six VLMs shows that answer accuracy and explanation quality are decoupled, acknowledging injected cues does not ensure grounding, and text cues shift explanations more than visual cues. While some open-source models match final answer accuracy, proprietary models score higher on attribution (25.0% vs. 1.4%) and often on fidelity (36.1% vs. 31.7%), highlighting deployment risks and the need to evaluate beyond final answer accuracy.

中文标题/摘要

标题：使用多模态扰动评估医学视觉语言模型的推理忠实性

视觉语言模型（VLMs）通常生成听起来合理但实际上未能反映决策过程的推理链（CoT）解释，这在高风险临床应用中削弱了信任。现有评估很少捕捉到这种不一致，而是优先考虑答案的准确性或格式的遵守。我们提出了一种基于临床的框架，用于胸部X光视觉问答（VQA），通过控制文本和图像修改在三个轴上探究推理链的忠实性：临床忠实性、因果归因和置信度校准。在一项读者研究（n=4）中，评估者-放射科医生的相关性在所有轴上均落在观察到的放射科医生间范围内，归因有很强的对齐（Kendall's $\tau_b=0.670$），忠实性有中等的对齐（$\tau_b=0.387$），而置信度语气有弱的对齐（$\tau_b=0.091$），我们对此持谨慎态度。基准测试六种VLMs显示，答案的准确性与解释质量是脱钩的，承认注入的提示并不保证扎根，文本提示比视觉提示更改变解释。虽然一些开源模型在最终答案准确性上与之匹配，但专有模型在归因上得分更高（25.0% vs. 1.4%），并且经常在忠实性上得分更高（36.1% vs. 31.7%），这突显了部署风险并强调了评估应超越最终答案准确性的重要性。

When Language Model Guides Vision: Grounding DINO for Cattle Muzzle Detection

Authors: Rabin Dulal, Lihong Zheng, Muhammad Ashad Kabir

Venue: Australasian Joint Conference on Artificial Intelligence 2025

First: 2025-09-08T08:21:34+00:00 · Latest: 2025-10-13T09:28:02+00:00

Abs · PDF · Code1 · Code2

Abstract

Muzzle patterns are among the most effective biometric traits for cattle identification. Fast and accurate detection of the muzzle region as the region of interest is critical to automatic visual cattle identification.. Earlier approaches relied on manual detection, which is labor-intensive and inconsistent. Recently, automated methods using supervised models like YOLO have become popular for muzzle detection. Although effective, these methods require extensive annotated datasets and tend to be trained data-dependent, limiting their performance on new or unseen cattle. To address these limitations, this study proposes a zero-shot muzzle detection framework based on Grounding DINO, a vision-language model capable of detecting muzzles without any task-specific training or annotated data. This approach leverages natural language prompts to guide detection, enabling scalable and flexible muzzle localization across diverse breeds and environments. Our model achieves a mean Average Precision (mAP)@0.5 of 76.8\%, demonstrating promising performance without requiring annotated data. To our knowledge, this is the first research to provide a real-world, industry-oriented, and annotation-free solution for cattle muzzle detection. The framework offers a practical alternative to supervised methods, promising improved adaptability and ease of deployment in livestock monitoring applications.

中文标题/摘要

标题：当语言模型引导视觉：基于Grounding DINO的牛鼻孔检测

鼻孔模式是牛身份识别中最有效的生物特征之一。快速准确地检测鼻孔区域作为感兴趣区域是自动视觉牛身份识别的关键。早期的方法依赖于手动检测，这既费时又不一致。最近，使用监督模型如YOLO的自动化方法在鼻孔检测中变得流行。尽管有效，但这些方法需要大量的标注数据集，并且往往依赖于特定的数据集进行训练，限制了它们在新或未见过的牛上的性能。为了解决这些限制，本研究提出了一种基于Grounding DINO的零样本鼻孔检测框架，Grounding DINO是一种能够无需任何特定任务训练或标注数据即可检测鼻孔的视觉语言模型。该方法利用自然语言提示来引导检测，使鼻孔定位在不同品种和环境中具有可扩展性和灵活性。我们的模型在mAP@0.5上达到了76.8%，展示了无需标注数据的有前途的性能。据我们所知，这是首次为牛鼻孔检测提供一种实际可行、面向行业且无需标注的解决方案。该框架为牲畜监测应用提供了监督方法的实用替代方案，有望提高适应性和部署的简便性。

Summary / 总结

This study aims to improve the accuracy and efficiency of cattle muzzle detection by proposing a zero-shot framework using Grounding DINO, a vision-language model. The method leverages natural language prompts to guide detection, avoiding the need for annotated data. The model achieves a mean Average Precision (mAP)@0.5 of 76.8%, demonstrating promising performance in cattle muzzle detection without requiring annotated data. This is the first research to provide an annotation-free solution for cattle muzzle detection in real-world applications, offering a practical alternative to supervised methods.

该研究旨在通过提出基于Grounding DINO的零样本框架来提高牛鼻孔区域检测的准确性和效率。该方法利用自然语言提示来引导检测，无需标注数据。模型在mAP@0.5上的平均精度达到76.8%，展示了在牛鼻孔检测中无需标注数据的出色性能。这是首次提供一种适用于实际应用的无标注解决方案，为牛鼻孔检测提供了监督方法之外的实用替代方案。

FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models

Authors: Shengming Yuan, Xinyu Lyu, Shuailong Wang, Beitao Chen, Jingkuan Song, Lianli Gao

Venue: NeurIPS 2025

First: 2025-10-13T09:22:12+00:00 · Latest: 2025-10-13T09:22:12+00:00

Comments: 19 pages, 11 figures. Accepted by the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

Abs · PDF · Code1 · Code2 · Code3

Abstract

Multimodal large language models (MLLMs) face an inherent trade-off between faithfulness and creativity, as different tasks require varying degrees of associative reasoning. However, existing methods lack the flexibility to modulate this reasoning strength, limiting MLLMs' adaptability across factual and creative scenarios. To bridge this gap, we propose equipping MLLMs with mechanisms that enable flexible control over associative reasoning. We begin by investigating the internal mechanisms underlying associative behavior in MLLMs and find that: (1) middle layers play a pivotal role in shaping model's associative tendencies, (2) modifying representations in these layers effectively regulates associative reasoning strength, and (3) hallucinations can be exploited to derive steering vectors that guide this modulation. Building on these findings, we introduce Flexible Association Control (FlexAC), a lightweight and training-free framework for modulating associative behavior in MLLMs. FlexAC first induces hallucination-guided intermediate representations to encode associative directions. Then, it selects high-association instances to construct effective associative steering vectors, whose strengths are adaptively calibrated to balance creative guidance with output stability. Finally, recognizing the multi-dimensional nature of associative reasoning, FlexAC incorporates task-specific associative vectors derived from a forward pass on a few target-domain samples, enabling models to follow diverse associative directions and better adapt to creative tasks. Notably, our method achieves up to a 5.8x improvement in creativity on Creation-MMBench and a 29% reduction in hallucination rate on CHAIR, surpassing existing baselines and demonstrating its effectiveness in enabling flexible control over associative reasoning in MLLMs. Our code is available at https://github.com/ylhz/FlexAC.

中文标题/摘要

标题：FlexAC：向多模态大型语言模型灵活控制关联推理的方向

多模态大型语言模型（MLLMs）在忠实性和创造性之间存在固有的权衡，因为不同的任务需要不同程度的关联推理。然而，现有方法缺乏调节这种推理强度的灵活性，限制了MLLMs在事实性和创造性场景中的适应性。为了解决这一问题，我们提出为MLLMs配备机制，使其能够灵活控制关联推理。我们首先研究了MLLMs内部驱动关联行为的机制，并发现：(1) 中间层在塑造模型的关联倾向中起着关键作用，(2) 修改这些层中的表示可以有效地调节关联推理强度，(3) 可以利用幻觉来推导出引导这种调节的引导向量。基于这些发现，我们引入了灵活关联控制（FlexAC），这是一种轻量级且无需训练的框架，用于调节MLLMs的关联行为。FlexAC 首先通过幻觉引导的中间表示来编码关联方向。然后，它选择高关联实例来构建有效的关联引导向量，其强度会根据创造性指导与输出稳定性之间的平衡进行自适应校准。最后，考虑到关联推理的多维性质，FlexAC 结合了从少量目标领域样本前向传递中提取的任务特定关联向量，使模型能够遵循多种关联方向，更好地适应创造性任务。值得注意的是，我们的方法在Creation-MMBench上的创造性提高了5.8倍，在CHAIR上的幻觉率降低了29%，超过了现有基线，证明了其在MLLMs中实现灵活控制关联推理的有效性。我们的代码可在https://github.com/ylhz/FlexAC/获取。

Summary / 总结

The research aims to enhance the adaptability of multimodal large language models (MLLMs) by addressing the trade-off between faithfulness and creativity. The study introduces Flexible Association Control (FlexAC), a lightweight framework that modulates associative reasoning in MLLMs. FlexAC uses hallucination-guided intermediate representations and task-specific associative vectors to balance creativity and output stability, achieving up to a 5.8x improvement in creativity and a 29% reduction in hallucination rate compared to existing methods. The method effectively enables flexible control over associative reasoning in MLLMs.

研究旨在通过解决忠实性和创造力之间的权衡问题，增强多模态大型语言模型（MLLMs）的适应性。研究引入了Flexible Association Control (FlexAC) 框架，该框架能够调节MLLMs中的联想推理。FlexAC 使用幻觉引导的中间表示和任务特定的联想向量来平衡创造力和输出稳定性，相比现有方法，其在创造力上可提高5.8倍，并将幻觉率降低29%。该方法有效地实现了对联想推理的灵活控制。

BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models

Authors: Bryan Chen Zhengyu Tan, Zheng Weihua, Zhengyuan Liu, Nancy F. Chen, Hwaran Lee, Kenny Tsu Wei Choo, Roy Ka-Wei Lee

First: 2025-10-13T09:10:05+00:00 · Latest: 2025-10-13T09:10:05+00:00

Comments: Code and Dataset to be released

Abs · PDF · Code1 · Code2

Abstract

As vision-language models (VLMs) are deployed globally, their ability to understand culturally situated knowledge becomes essential. Yet, existing evaluations largely assess static recall or isolated visual grounding, leaving unanswered whether VLMs possess robust and transferable cultural understanding. We introduce BLEnD-Vis, a multimodal, multicultural benchmark designed to evaluate the robustness of everyday cultural knowledge in VLMs across linguistic rephrasings and visual modalities. Building on the BLEnD dataset, BLEnD-Vis constructs 313 culturally grounded question templates spanning 16 regions and generates three aligned multiple-choice formats: (i) a text-only baseline querying from Region $\to$ Entity, (ii) an inverted text-only variant (Entity $\to$ Region), and (iii) a VQA-style version of (ii) with generated images. The resulting benchmark comprises 4,916 images and over 21,000 multiple-choice question (MCQ) instances, validated through human annotation. BLEnD-Vis reveals significant fragility in current VLM cultural knowledge; models exhibit performance drops under linguistic rephrasing and, whilst visual cues often aid performance, low cross-modal consistency highlights challenges in robustly integrating textual and visual understanding, particularly for lower-resource regions. BLEnD-Vis thus provides a crucial testbed for systematically analysing cultural robustness and multimodal grounding, exposing limitations and guiding the development of more culturally competent VLMs.

中文标题/摘要

标题：BLEnD-Vis：视觉语言模型多模态文化理解基准测试

随着视觉语言模型（VLMs）在全球范围内的部署，它们理解文化背景知识的能力变得至关重要。然而，现有的评估主要集中在静态回忆或孤立的视觉定位上，未能回答VLMs是否具备稳健且可迁移的文化理解能力。我们引入了BLEnD-Vis，这是一个多模态、多文化基准，旨在评估VLMs在不同语言重述和视觉模态下的日常文化知识的稳健性。基于BLEnD数据集，BLEnD-Vis构建了涵盖16个地区的313个文化背景问题模板，并生成了三种对齐的多项选择格式：(i) 仅文本基线查询从地区到实体，(ii) 逆序文本基线（实体到地区），(iii) (ii) 的VQA风格版本，带有生成的图像。该基准包括4,916张图像和超过21,000个多项选择题（MCQ）实例，通过人工注释验证。BLEnD-Vis揭示了当前VLM文化知识的显著脆弱性；模型在语言重述下表现出性能下降，尽管视觉线索往往有助于性能提升，但跨模态一致性低凸显了在稳健整合文本和视觉理解方面面临的挑战，尤其是在低资源地区。因此，BLEnD-Vis为系统分析文化稳健性和多模态定位提供了一个关键的测试平台，揭示了局限性并指导了更具文化适应性的VLMs的发展。

Summary / 总结

The research aims to evaluate the cultural understanding capabilities of vision-language models (VLMs) by introducing BLEnD-Vis, a multimodal benchmark. The method involves creating 313 culturally grounded question templates across 16 regions and generating three aligned multiple-choice formats. The main experimental findings show that current VLMs exhibit significant fragility in cultural knowledge, with performance drops under linguistic rephrasing and low cross-modal consistency, especially for lower-resource regions. This benchmark is crucial for analyzing cultural robustness and guiding the development of more culturally competent VLMs.

研究旨在通过引入BLEnD-Vis多模态基准来评估视觉语言模型（VLMs）的文化理解能力。方法包括创建16个地区跨越313个文化背景问题模板，并生成三种对齐的多项选择格式。主要实验发现表明，当前的VLMs在文化知识方面表现出显著的脆弱性，在语言重述下性能下降，并且跨模态一致性较低，特别是在低资源地区。该基准对于分析文化鲁棒性和多模态定位具有重要意义，并指导更具有文化适应性的VLMs的发展。

EAGER: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling

Authors: Daniel Scalena, Leonidas Zotos, Elisabetta Fersini, Malvina Nissim, Ahmet Üstün

First: 2025-10-13T09:04:28+00:00 · Latest: 2025-10-13T09:04:28+00:00

Abs · PDF · Code1 · Code2

Abstract

With the rise of reasoning language models and test-time scaling methods as a paradigm for improving model performance, substantial computation is often required to generate multiple candidate sequences from the same prompt. This enables exploration of different reasoning paths toward the correct solution, however, allocates the same compute budget for each prompt. Grounded on the assumption that different prompts carry different degrees of complexity, and thus different computation needs, we propose EAGer, a training-free generation method that leverages model uncertainty through token-wise entropy distribution to reduce redundant computation and concurrently improve overall performance. EAGer allows branching to multiple reasoning paths only in the presence of high-entropy tokens, and then reallocates the saved compute budget to the instances where exploration of alternative paths is most needed. We find that across multiple open-source models on complex reasoning benchmarks such as AIME 2025, EAGer can reallocate the budget without accessing target labels, achieving the best efficiency-performance trade-off in terms of reasoning length and Pass@k. When target labels are accessible, EAGer generates up to 65% fewer tokens (hence saving compute) and achieves up to 37% improvement in Pass@k compared to the Full Parallel Sampling.

中文标题/摘要

标题：EAGER: 适应性推理时缩放的熵感知生成

随着推理语言模型和测试时缩放方法作为提高模型性能范式的兴起，通常需要大量计算从相同的提示生成多个候选序列。这使不同推理路径的探索成为可能，但每个提示分配相同的计算预算。基于不同提示具有不同复杂度和计算需求的假设，我们提出了一种无需训练的生成方法EAGer，该方法通过按词元分布熵来利用模型不确定性，减少冗余计算并同时提高整体性能。EAGer仅在存在高熵词元时才进行分支到多个推理路径，并将节省的计算预算重新分配到最需要探索替代路径的实例中。我们发现，在AIME 2025等复杂推理基准上，EAGer在无需访问目标标签的情况下可以重新分配预算，实现推理长度和Pass@k的最佳效率-性能权衡。当可以访问目标标签时，与全并行采样相比，EAGer可以节省高达65%的计算量（即生成更少的词元），并且在Pass@k上提高高达37%。

Summary / 总结

EAGer is a training-free generation method that uses token-wise entropy to adaptively allocate computation during inference. By branching into multiple reasoning paths only when high-entropy tokens are present, EAGer reallocates saved compute budget to instances where alternative paths are needed. On complex reasoning benchmarks, EAGer achieves the best efficiency-performance trade-off, saving up to 65% of tokens and improving Pass@k by up to 37% compared to Full Parallel Sampling when target labels are available.

EAGer 是一种无需训练的生成方法，通过使用 token 的熵来适应性地在推理时分配计算资源。仅在遇到高熵 token 时才分支进入多个推理路径，并将节省下来的计算预算重新分配给需要探索替代路径的实例。在复杂的推理基准测试中，EAGer 实现了最佳的效率-性能权衡，当目标标签可用时，相比全并行采样，可以节省高达 65% 的 token，并提高 Pass@k 37%。