arXiv 论文速递

Mechanisms of Prompt-Induced Hallucination in Vision-Language Models

Authors: William Rudman, Michal Golovanevsky, Dana Arad, Yonatan Belinkov, Ritambhara Singh, Carsten Eickhoff, Kyle Mahowald

First: 2026-01-08T18:23:03+00:00 · Latest: 2026-01-08T18:23:03+00:00

Abs · PDF · Code1 · Code2

Abstract

Large vision-language models (VLMs) are highly capable, yet often hallucinate by favoring textual prompts over visual evidence. We study this failure mode in a controlled object-counting setting, where the prompt overstates the number of objects in the image (e.g., asking a model to describe four waterlilies when only three are present). At low object counts, models often correct the overestimation, but as the number of objects increases, they increasingly conform to the prompt regardless of the discrepancy. Through mechanistic analysis of three VLMs, we identify a small set of attention heads whose ablation substantially reduces prompt-induced hallucinations (PIH) by at least 40% without additional training. Across models, PIH-heads mediate prompt copying in model-specific ways. We characterize these differences and show that PIH ablation increases correction toward visual evidence. Our findings offer insights into the internal mechanisms driving prompt-induced hallucinations, revealing model-specific differences in how these behaviors are implemented.

中文标题/摘要

标题：视觉语言模型中提示诱发幻觉的机制

大型视觉语言模型（VLMs）功能强大，但往往倾向于根据文本提示而非视觉证据进行幻觉。我们在一个受控的物体计数设置中研究了这种失败模式，其中提示夸大了图像中的物体数量（例如，要求模型描述四朵水仙花，而实际上只有三朵）。在低物体数量时，模型通常会纠正这种夸大，但随着物体数量的增加，它们越来越倾向于遵循提示，无视差异。通过对三种VLMs的机制分析，我们确定了一组小的注意力头，其消除可以减少至少40%的提示诱发幻觉（PIH）而无需额外训练。在不同模型中，PIH头以特定方式介导提示复制。我们描述了这些差异，并表明PIH消除增加了对视觉证据的纠正。我们的研究提供了关于提示诱发幻觉内部机制的见解，揭示了这些行为在不同模型中的特定差异实现方式。

Summary / 总结

The study investigates how large vision-language models (VLMs) hallucinate based on textual prompts rather than visual evidence. By manipulating object counts in images, the researchers found that at low counts, models tend to correct the overestimation, but as the number of objects increases, they increasingly conform to the prompt. Through analysis of three VLMs, the team identified specific attention heads that, when removed, significantly reduce prompt-induced hallucinations by at least 40% without further training. These findings highlight model-specific mechanisms that drive prompt-induced hallucinations and suggest ways to mitigate them.

研究通过观察视觉语言模型（VLMs）在物体计数任务中的表现，探讨了提示诱导幻觉的机制。研究发现，随着图像中物体数量的增加，VLMs越来越倾向于遵循提示的过度陈述，即使这与视觉证据相矛盾。通过对三种VLMs的分析，研究发现特定的注意力头在移除后，可以显著减少提示诱导幻觉至少40%，无需额外训练。研究结果表明，这些头在复制提示方面起着关键作用，移除它们会增强模型依赖视觉证据进行修正的能力。

CoV: Chain-of-View Prompting for Spatial Reasoning

Authors: Haoyu Zhao, Akide Liu, Zeyu Zhang, Weijie Wang, Feng Chen, Ruihan Zhu, Gholamreza Haffari, Bohan Zhuang

First: 2026-01-08T17:59:42+00:00 · Latest: 2026-01-08T17:59:42+00:00

Abs · PDF · Code1 · Code2

Abstract

Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision--language models (VLMs) are constrained to a fixed and finite set of input views, which limits their ability to acquire question-relevant context at inference time and hinders complex spatial reasoning. We propose Chain-of-View (CoV) prompting, a training-free, test-time reasoning framework that transforms a VLM into an active viewpoint reasoner through a coarse-to-fine exploration process. CoV first employs a View Selection agent to filter redundant frames and identify question-aligned anchor views. It then performs fine-grained view adjustment by interleaving iterative reasoning with discrete camera actions, obtaining new observations from the underlying 3D scene representation until sufficient context is gathered or a step budget is reached. We evaluate CoV on OpenEQA across four mainstream VLMs and obtain an average +11.56\% improvement in LLM-Match, with a maximum gain of +13.62\% on Qwen3-VL-Flash. CoV further exhibits test-time scaling: increasing the minimum action budget yields an additional +2.51\% average improvement, peaking at +3.73\% on Gemini-2.5-Flash. On ScanQA and SQA3D, CoV delivers strong performance (e.g., 116 CIDEr / 31.9 EM@1 on ScanQA and 51.1 EM@1 on SQA3D). Overall, these results suggest that question-aligned view selection coupled with open-view search is an effective, model-agnostic strategy for improving spatial reasoning in 3D EQA without additional training.

中文标题/摘要

标题：CoV：空间推理的链式视角提示

在3D环境中的嵌入式问题回答（EQA）通常需要收集分布在多个视角且部分被遮挡的上下文。然而，大多数最新的视觉-语言模型（VLMs）仅限于固定且有限的输入视角集，这限制了它们在推理时获取与问题相关上下文的能力，并阻碍了复杂的空间推理。我们提出了一种名为Chain-of-View（CoV）的提示方法，这是一种无需训练、在测试时进行推理的框架，通过从粗到细的探索过程将VLM转变为积极的视角推理者。CoV首先使用视图选择代理筛选冗余帧并识别与问题对齐的锚视图，然后通过交替进行迭代推理和离散相机动作进行细粒度视图调整，从底层3D场景表示中获取新观察，直到收集到足够上下文或达到步骤预算。我们在OpenEQA上对CoV进行了评估，跨四个主流VLMs获得了平均+11.56%的LLM-Match改进，最大增益为Qwen3-VL-Flash上的+13.62%。CoV还表现出测试时的扩展性：增加最小动作预算可额外获得平均+2.51%的改进，峰值为Gemini-2.5-Flash上的+3.73%。在ScanQA和SQA3D上，CoV表现出强大的性能（例如，ScanQA上的116 CIDEr / 31.9 EM@1和SQA3D上的51.1 EM@1）。总体而言，这些结果表明，与问题对齐的视图选择结合开放视图搜索是提高3D EQA中空间推理能力的有效、模型无关的策略，无需额外训练。

Summary / 总结

The research aims to enhance embodied question answering in 3D environments by addressing the limitations of existing vision-language models (VLMs) that are constrained to a fixed set of input views. The proposed Chain-of-View (CoV) prompting method enables VLMs to actively explore and gather context from multiple viewpoints through a coarse-to-fine process. Evaluation on OpenEQA across four VLMs shows an average improvement of +11.56% in LLM-Match, with the best gain of +13.62% on Qwen3-VL-Flash. CoV also demonstrates test-time scalability, with additional improvements observed as the minimum action budget increases, peaking at +3.73% on Gemini-2.5-Flash. The method performs well on ScanQA and SQA3D, indicating its effectiveness in improving spatial reasoning without additional training.

研究旨在通过解决现有视觉-语言模型（VLMs）仅限于固定视角输入的限制，提升3D环境中的体感问答能力。提出的Chain-of-View（CoV）提示方法通过粗到细的过程使VLMs能够主动探索并从多个视角收集上下文信息。在OpenEQA上对四种VLMs的评估显示，平均提高了11.56%的LLM-Match，其中Qwen3-VL-Flash的最佳增益为13.62%。CoV还展示了测试时的可扩展性，随着最小动作预算的增加，额外的改进逐渐显现，最高达到3.73%的提升，出现在Gemini-2.5-Flash上。该方法在ScanQA和SQA3D上表现出色，表明其在不进行额外训练的情况下有效提升了空间推理能力。

MVT: Mask-Grounded Vision-Language Models for Taxonomy-Aligned Land-Cover Tagging

Authors: Siyi Chen, Kai Wang, Weicong Pang, Ruiming Yang, Ziru Chen, Renjun Gao, Alexis Kai Hon Lau, Dasa Gu, Chenchen Zhang, Cheng Li

First: 2025-09-23T06:23:56+00:00 · Latest: 2026-01-08T17:56:05+00:00

Comments: The project is available at https://charlescsyyy.github.io/MVT

Abs · PDF · Code1 · Code2 · Project1

Abstract

Land-cover understanding in remote sensing increasingly demands class-agnostic systems that generalize across datasets while remaining spatially precise and interpretable. We study a geometry-first discovery-and-interpretation setting under domain shift, where candidate regions are delineated class-agnostically and supervision avoids lexical class names via anonymized identifiers. Complementary to open-set recognition and open-world learning, we focus on coupling class-agnostic mask evidence with taxonomy-grounded scene interpretation, rather than unknown rejection or continual class expansion. We propose MVT, a three-stage framework that (i) extracts boundary-faithful region masks using SAM2 with domain adaptation, (ii) performs mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluates outputs with LLM-as-judge scoring calibrated by stratified expert ratings. On cross-dataset segmentation transfer (train on OpenEarthMap, evaluate on LoveDA), domain-adapted SAM2 improves mask quality; meanwhile, dual-step MLLM fine-tuning yields more accurate taxonomy-aligned tags and more informative mask-grounded scene descriptions.

中文标题/摘要

标题：MVT：基于掩码的视觉-语言模型在分类学对齐的土地覆盖标签化中的应用

遥感中的土地覆盖理解越来越多地需要跨数据集泛化但保持空间精确性和可解释性的类无差别系统。我们研究了在领域转移下的几何优先发现与解释设置，其中候选区域以类无差别方式划定，监督避免使用类名的明码标识。除了开放集识别和开放世界学习，我们专注于将类无差别掩码证据与分类学导向的场景解释相结合，而不是未知拒绝或持续类扩展。我们提出了MVT，一个三阶段框架，(i) 使用SAM2进行领域适应以提取边界忠实的区域掩码，(ii) 通过双步骤LoRA微调多模态LLM进行掩码导向的语义标签和场景描述生成，(iii) 使用LLM作为裁判评分进行输出评估，评分通过分层专家评分校准。在跨数据集分割迁移（在OpenEarthMap上训练，在LoveDA上评估）中，领域适应的SAM2提高了掩码质量；同时，双步骤多模态LLM微调产生了更准确的分类学对齐标签和更具有信息性的掩码导向场景描述。

Summary / 总结

The research aims to develop class-agnostic systems for land-cover understanding in remote sensing that generalize across datasets while maintaining spatial precision and interpretability. The method involves a three-stage framework: (i) extracting boundary-faithful region masks using SAM2 with domain adaptation, (ii) performing mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluating outputs with LLM-as-judge scoring calibrated by stratified expert ratings. The study shows that domain-adapted SAM2 improves mask quality, and dual-step MLLM fine-tuning leads to more accurate taxonomy-aligned tags and more informative scene descriptions on cross-dataset segmentation transfer.

研究旨在开发在遥感中进行土地覆盖理解的类无偏系统，使其能够在不同数据集之间泛化，同时保持空间精度和可解释性。方法包括三个阶段：(i) 使用域适应的SAM2提取边界忠实的区域掩码，(ii) 通过双步骤LoRA微调多模态LLM进行掩码导向的语义标签和场景描述生成，(iii) 使用LLM作为裁判评分评估输出，并通过分层专家评分进行校准。研究显示，域适应的SAM2提高了掩码质量，而双步骤MLLM微调则产生了更准确的分类学对齐标签和更具信息量的掩码导向场景描述，在跨数据集分割转移中得到了验证。

Vision-Language Introspection: Mitigating Overconfident Hallucinations in MLLMs via Interpretable Bi-Causal Steering

Authors: Shuliang Liu, Songbo Yang, Dong Fang, Sihang Jia, Yuqi Tang, Lingfeng Su, Ruoshui Peng, Yibo Yan, Xin Zou, Xuming Hu

First: 2026-01-08T17:49:13+00:00 · Latest: 2026-01-08T17:49:13+00:00

Abs · PDF · Code1 · Code2

Abstract

Object hallucination critically undermines the reliability of Multimodal Large Language Models, often stemming from a fundamental failure in cognitive introspection, where models blindly trust linguistic priors over specific visual evidence. Existing mitigations remain limited: contrastive decoding approaches operate superficially without rectifying internal semantic misalignments, while current latent steering methods rely on static vectors that lack instance-specific precision. We introduce Vision-Language Introspection (VLI), a training-free inference framework that simulates a metacognitive self-correction process. VLI first performs Attributive Introspection to diagnose hallucination risks via probabilistic conflict detection and localize the causal visual anchors. It then employs Interpretable Bi-Causal Steering to actively modulate the inference process, dynamically isolating visual evidence from background noise while neutralizing blind confidence through adaptive calibration. VLI achieves state-of-the-art performance on advanced models, reducing object hallucination rates by 12.67% on MMHal-Bench and improving accuracy by 5.8% on POPE.

中文标题/摘要

标题：视觉-语言内省：通过可解释的双因归向引导减轻MLLM中的过度自信幻觉

物体幻觉严重削弱了多模态大型语言模型的可靠性，通常源于认知内省的根本失败，模型盲目信任语言先验而非具体的视觉证据。现有缓解措施仍有限：对比解码方法仅表面操作而不纠正内部语义错位，而当前的潜在引导方法依赖于静态向量，缺乏实例特定的精确性。我们引入了视觉-语言内省（VLI），这是一种无需训练的推理框架，模拟了元认知的自我纠正过程。VLI 首先进行属性内省，通过概率冲突检测诊断幻觉风险并定位因果视觉锚点。然后使用可解释的双因归向引导主动调节推理过程，动态隔离视觉证据与背景噪声，通过适应性校准消除盲目的自信。VLI 在先进模型上实现了最先进的性能，在MMHal-Bench 上将物体幻觉率降低了12.67%，在POPE 上提高了5.8% 的准确性。

Summary / 总结

The research aims to address the issue of object hallucination in Multimodal Large Language Models (MLLMs) by enhancing their cognitive introspection. The method introduced is Vision-Language Introspection (VLI), which simulates a self-correction process through Attributive Introspection and Interpretable Bi-Causal Steering. VLI detects and localizes hallucination risks and dynamically modulates the inference process to isolate relevant visual evidence, reducing object hallucination rates by 12.67% on MMHal-Bench and improving accuracy by 5.8% on POPE.

研究旨在通过增强认知反省来解决多模态大型语言模型（MLLMs）中的物体幻觉问题。方法是引入视觉-语言反省（VLI），模拟自我纠正过程，通过属性反省检测和定位幻觉风险，并动态调节推理过程以隔离相关视觉证据，使MMHal-Bench上的物体幻觉率降低12.67%，POPE上的准确率提高5.8%。

FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs

Authors: Carlos Plou, Cesar Borja, Ruben Martinez-Cantin, Ana C. Murillo

First: 2025-03-25T17:17:19+00:00 · Latest: 2026-01-08T17:17:54+00:00

Abs · PDF · Code1 · Code2

Abstract

Finding information in hour-long videos is a challenging task even for top-performing Vision Language Models (VLMs), as encoding visual content quickly exceeds available context windows. To tackle this challenge, we present FALCONEye, a novel video agent based on a training-free, model-agnostic meta-architecture composed of a VLM and a Large Language Model (LLM). FALCONEye answers open-ended questions using an exploration-based search algorithm guided by calibrated confidence from the VLM's answers. We also introduce the FALCON-Bench benchmark, extending Question Answering problem to Video Answer Search-requiring models to return both the answer and its supporting temporal window for open-ended questions in hour-long videos. With just a 7B VLM and a lightweight LLM, FALCONEye outscores all open-source 7B VLMs and comparable agents in FALCON-Bench. It further demonstrates its generalization capability in MLVU benchmark with shorter videos and different tasks, surpassing GPT-4o on single-detail tasks while slashing inference cost by roughly an order of magnitude.

中文标题/摘要

标题：FALCONEye：利用多模态大语言模型在一小时视频中查找答案并定位内容

在小时长的视频中查找信息对顶级视觉语言模型（VLMs）来说也是一个具有挑战性的任务，因为编码视觉内容会迅速超出可用的上下文窗口。为了解决这一挑战，我们提出了FALCONEye，这是一种基于训练无损、模型无关的元架构的新型视频代理，该架构由VLM和大语言模型（LLM）组成。FALCONEye使用由VLM答案校准置信度引导的基于探索的搜索算法来回答开放式问题。我们还引入了FALCON-Bench基准测试，将问答问题扩展到视频答案搜索，要求模型返回开放式问题在一小时长视频中的答案及其支持的时间窗口。仅使用一个7B VLM和一个轻量级LLM，FALCONEye在FALCON-Bench中得分超过了所有开源7B VLMs和可比代理。此外，FALCONEye还在MLVU基准测试中展示了其泛化能力，处理较短视频和不同任务时，超越了GPT-4o，在单一细节任务上的推理成本降低了大约一个数量级。

Summary / 总结

FALCONEye is a novel video agent that uses a VLM and an LLM to answer open-ended questions in one-hour-long videos. It employs an exploration-based search algorithm guided by the VLM's calibrated confidence. FALCONEye outperforms all open-source 7B VLMs and comparable agents in the FALCON-Bench and shows strong generalization in the MLVU benchmark, reducing inference cost significantly compared to GPT-4o on single-detail tasks.

FALCONEye 是一个使用 VLM 和 LLM 来回答一小时长视频中的开放性问题的新视频代理。它采用了一种基于探索的搜索算法，并由 VLM 的校准置信度引导。FALCONEye 在 FALCON-Bench 上超越了所有开源的 7B VLM 及其同类代理，并在 MLVU 基准测试中展示了强大的泛化能力，超越了 GPT-4o 在单一细节任务上的表现，同时大幅降低了推理成本。

VERSE: Visual Embedding Reduction and Space Exploration. Clustering-Guided Insights for Training Data Enhancement in Visually-Rich Document Understanding

Authors: Ignacio de Rodrigo, Alvaro J. Lopez-Lopez, Jaime Boal

First: 2026-01-08T17:15:15+00:00 · Latest: 2026-01-08T17:15:15+00:00

Abs · PDF · Code1 · Code2

Abstract

This work introduces VERSE, a methodology for analyzing and improving Vision-Language Models applied to Visually-rich Document Understanding by exploring their visual embedding space. VERSE enables the visualization of latent representations, supporting the assessment of model feasibility. It also facilitates the identification of problematic regions and guides the generation of synthetic data to enhance performance in those clusters. We validate the methodology by training on the synthetic MERIT Dataset and evaluating on its real-world counterpart, MERIT Secret. Results show that VERSE helps uncover the visual features associated with error-prone clusters, and that retraining with samples containing these features substantially boosts F1 performance without degrading generalization. Furthermore, we demonstrate that on-premise models such as Donut and Idefics2, when optimized with VERSE, match or even surpass the performance of SaaS solutions like GPT-4 and Pixtral.

中文标题/摘要

标题：VERSE：视觉嵌入空间探索与缩减. 基于聚类指导的训练数据增强在视觉丰富文档理解中的见解

本文介绍了VERSE，一种用于分析和改进应用于视觉丰富文档理解的视觉语言模型的方法，通过探索其视觉嵌入空间。VERSE使潜在表示的可视化成为可能，支持模型可行性的评估。它还促进了问题区域的识别，并指导生成合成数据以在这些聚类中增强性能。我们通过在合成MERIT数据集上进行训练并在其现实世界对应物MERIT Secret上进行评估来验证该方法。结果表明，VERSE有助于揭示与错误倾向聚类相关的视觉特征，并且使用包含这些特征的样本重新训练显著提高了F1性能，而不会损害泛化能力。此外，我们证明了使用VERSE优化的本地模型（如Donut和Idefics2）在性能上可以与GPT-4和Pixtral等SaaS解决方案相匹敌，甚至超越它们。

Summary / 总结

VERSE is a methodology for enhancing Vision-Language Models in Visually-rich Document Understanding by visualizing and exploring their visual embedding space. It helps identify problematic regions and guides the generation of synthetic data to improve model performance. Experiments show that VERSE can uncover visual features associated with error-prone clusters and retraining with these features significantly improves F1 performance without degrading generalization. VERSE also enables on-premise models to match or surpass the performance of SaaS solutions like GPT-4 and Pixtral.

VERSE 是一种方法，通过探索视觉嵌入空间来提高视觉丰富文档理解中的视觉-语言模型。它可视化潜在表示以识别问题区域，并生成合成数据以增强模型性能。实验表明，VERSE 帮助发现与错误密集区域相关的视觉特征，并通过这些特征重新训练显著提高了 F1 性能，而不会损害泛化能力。此外，VERSE 优化的本地模型可以匹配甚至超越 GPT-4 和 Pixtral 等 SaaS 解决方案的性能。

$π_0$: A Vision-Language-Action Flow Model for General Robot Control

Authors: Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, Ury Zhilinsky

Venue: RSS 2025

First: 2024-10-31T17:22:30+00:00 · Latest: 2026-01-08T17:01:05+00:00

Comments: See project website for videos: https://physicalintelligence.company/blog/pi0 Published in RSS 2025

Abs · PDF · Code1 · Code2

Abstract

Robot learning holds tremendous promise to unlock the full potential of flexible, general, and dexterous robot systems, as well as to address some of the deepest questions in artificial intelligence. However, bringing robot learning to the level of generality required for effective real-world systems faces major obstacles in terms of data, generalization, and robustness. In this paper, we discuss how generalist robot policies (i.e., robot foundation models) can address these challenges, and how we can design effective generalist robot policies for complex and highly dexterous tasks. We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge. We then discuss how this model can be trained on a large and diverse dataset from multiple dexterous robot platforms, including single-arm robots, dual-arm robots, and mobile manipulators. We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people and from a high-level VLM policy, and its ability to acquire new skills via fine-tuning. Our results cover a wide variety of tasks, such as laundry folding, table cleaning, and assembling boxes.

中文标题/摘要

标题：$π_0$: 一种视觉-语言-行动流模型用于通用机器人控制

机器人学习在解锁灵活、通用和灵巧的机器人系统潜力以及解决人工智能中最深层次的问题方面具有巨大的前景。然而，将机器人学习提升到有效现实系统所需的通用性水平面临着数据、泛化和鲁棒性方面的重大障碍。在本文中，我们讨论了通用机器人策略（即机器人基础模型）如何应对这些挑战，以及如何设计有效的通用机器人策略以应对复杂和高度灵巧的任务。我们提出了一种基于预训练视觉-语言模型（VLM）的新颖流匹配架构，以继承互联网规模的语义知识。然后，我们讨论了如何使用多种灵巧机器人平台的大规模和多样化数据集对该模型进行训练，包括单臂机器人、双臂机器人和移动操作器。我们从预训练后执行任务的能力、遵循人类和高级VLM策略的语言指令以及通过微调获取新技能等方面评估了该模型。我们的结果涵盖了各种任务，如衣物折叠、桌面清洁和组装盒子。

Summary / 总结

This paper addresses the challenges of general robot learning by proposing a vision-language-action flow model, leveraging a pre-trained vision-language model to inherit semantic knowledge from the Internet. The model is trained on diverse datasets from various robotic platforms and evaluated for its ability to perform tasks in zero-shot settings, follow human language instructions, and acquire new skills through fine-tuning. Key tasks include laundry folding, table cleaning, and assembling boxes.

本文提出了一种视觉-语言-动作流模型，利用预训练的视觉-语言模型继承互联网上的语义知识，以应对通用机器人学习的挑战。该模型在多种机器人平台的数据集上进行训练，并评估其在零样本设置下执行任务、遵循人类语言指令以及通过微调获取新技能的能力。关键任务包括折叠衣物、清理桌子和组装盒子。

POLYCHARTQA: Benchmarking Large Vision-Language Models with Multilingual Chart Question Answering

Authors: Yichen Xu, Liangyu Chen, Liang Zhang, Jianzhe Ma, Wenxuan Wang, Qin Jin

First: 2025-07-16T06:09:02+00:00 · Latest: 2026-01-08T17:00:25+00:00

Comments: Work in Progress

Abs · PDF · Code1 · Code2

Abstract

Charts are a universally adopted medium for data communication, yet existing chart understanding benchmarks are overwhelmingly English-centric, limiting their accessibility and relevance to global audiences. To address this limitation, we introduce PolyChartQA, the first large-scale multilingual benchmark for chart question answering, comprising 22,606 charts and 26,151 QA pairs across 10 diverse languages. PolyChartQA is constructed through a scalable pipeline that enables efficient multilingual chart generation via data translation and code reuse, supported by LLM-based translation and rigorous quality control. We systematically evaluate multilingual chart understanding with PolyChartQA on state-of-the-art LVLMs and reveal a significant performance gap between English and other languages, particularly low-resource ones. Additionally, we introduce a companion multilingual chart question answering training set, PolyChartQA-Train, on which fine-tuning LVLMs yields substantial gains in multilingual chart understanding across diverse model sizes and architectures. Together, our benchmark provides a foundation for developing globally inclusive vision-language models capable of understanding charts across diverse linguistic contexts.

中文标题/摘要

标题：POLYCHARTQA：使用多语言图表问答基准评估大型视觉语言模型

图表是数据交流的普遍采用媒介，但现有的图表理解基准主要以英语为中心，限制了其对全球受众的可访问性和相关性。为解决这一限制，我们引入了PolyChartQA，这是首个大规模多语言图表问答基准，包含22,606张图表和26,151个问答对，覆盖10种不同的语言。PolyChartQA通过可扩展的管道构建，通过数据翻译和代码重用实现高效的多语言图表生成，支持基于LLM的翻译和严格的质量控制。我们系统地使用PolyChartQA对最先进的LVLM进行多语言图表理解评估，并揭示了英语与其他语言之间，尤其是低资源语言之间存在显著的性能差距。此外，我们还引入了PolyChartQA-Train多语言图表问答训练集，在此集上微调LVLM可显著提高多语言图表理解能力，适用于各种模型大小和架构。我们的基准为开发能够跨多种语言环境理解图表的全球包容性视觉语言模型提供了基础。

Summary / 总结

The research aims to address the limitation of existing chart understanding benchmarks being predominantly English-centric. PolyChartQA, a new multilingual benchmark, is introduced, containing 22,606 charts and 26,151 QA pairs in 10 languages. The evaluation shows a significant performance gap between English and other languages, especially low-resource ones. Fine-tuning large vision-language models on PolyChartQA-Train improves multilingual chart understanding across different model sizes and architectures, highlighting the need for globally inclusive models.

研究旨在解决现有图表理解基准主要以英语为中心的问题。PolyChartQA 是一个新的多语言基准，包含 22,606 张图表和 26,151 个问答对，覆盖 10 种语言。评估结果显示，英语和其他语言之间的性能差距很大，尤其是低资源语言。通过 PolyChartQA-Train 对大型视觉语言模型进行微调，可以提高不同模型大小和架构下的多语言图表理解能力，突显了开发全球包容性模型的需求。

GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts

Authors: Wenhao Zeng, Xuteng Zhang, Yuling Shi, Chao Hu, Yuting Chen, Beijun Shen, Xiaodong Gu

First: 2026-01-08T16:58:07+00:00 · Latest: 2026-01-08T16:58:07+00:00

Comments: Code available at https://github.com/Zengwh02/GlimpRouter

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large Reasoning Models (LRMs) achieve remarkable performance by explicitly generating multi-step chains of thought, but this capability incurs substantial inference latency and computational cost. Collaborative inference offers a promising solution by selectively allocating work between lightweight and large models, yet a fundamental challenge remains: determining when a reasoning step requires the capacity of a large model or the efficiency of a small model. Existing routing strategies either rely on local token probabilities or post-hoc verification, introducing significant inference overhead. In this work, we propose a novel perspective on step-wise collaboration: the difficulty of a reasoning step can be inferred from its very first token. Inspired by the "Aha Moment" phenomenon in LRMs, we show that the entropy of the initial token serves as a strong predictor of step difficulty. Building on this insight, we introduce GlimpRouter, a training-free step-wise collaboration framework. GlimpRouter employs a lightweight model to generate only the first token of each reasoning step and routes the step to a larger model only when the initial token entropy exceeds a threshold. Experiments on multiple benchmarks demonstrate that our approach significantly reduces inference latency while preserving accuracy. For instance, GlimpRouter attains a substantial 10.7% improvement in accuracy while reducing inference latency by 25.9% compared to a standalone large model on AIME25. These results suggest a simple yet effective mechanism for reasoning: allocating computation based on a glimpse of thought rather than full-step evaluation.

中文标题/摘要

标题：GlimpRouter：通过窥视一个思维令牌实现高效的协作推理

大型推理模型（LRMs）通过显式生成多步推理链路实现了显著的性能，但这种能力会带来显著的推理延迟和计算成本。协作推理通过在轻量级和大型模型之间选择性地分配工作提供了有希望的解决方案，但仍然存在一个基本挑战：确定推理步骤何时需要大型模型的容量或小型模型的效率。现有的路由策略要么依赖于局部令牌概率，要么进行事后验证，引入了显著的推理开销。在本文中，我们提出了一种新的步骤协作视角：推理步骤的难度可以从其第一个令牌中推断出来。受LRMs中的“顿悟”现象启发，我们表明初始令牌的熵是步骤难度的强预测器。基于这一洞察，我们引入了GlimpRouter，这是一种无需训练的步骤协作框架。GlimpRouter使用一个轻量级模型仅生成每个推理步骤的第一个令牌，并仅当初始令牌的熵超过阈值时才将步骤路由到一个更大的模型。在多个基准上的实验表明，我们的方法在显著减少推理延迟的同时保持了准确性。例如，在AIME25上，GlimpRouter在保持91.3%准确性的基础上，将推理延迟降低了25.9%。这些结果表明，一种简单而有效的推理机制是：根据思维的一瞥来分配计算，而不是对整个步骤进行评估。

Summary / 总结

GlimpRouter proposes a novel approach to collaborative inference by using the entropy of the first token generated in each reasoning step to predict its difficulty. This method reduces inference latency and computational cost without requiring additional training. Experiments show that GlimpRouter improves accuracy by 10.7% while decreasing inference latency by 25.9% compared to a standalone large model on AIME25.

GlimpRouter通过使用初始令牌的熵来预测推理步骤的难度，提出了一种新的步骤协作方法。这种方法将推理延迟降低了25.9%，同时在AIME25上保持了10.7%更高的准确率。轻量级模型仅生成每个推理步骤的第一个令牌，并在初始令牌的熵超过阈值时才将步骤路由到一个更大的模型，从而避免不必要的计算开销。

Instruction Tuning with and without Context: Behavioral Shifts and Downstream Impact

Authors: Hyunji Lee, Seunghyun Yoon, Yunjae Won, Hanseok Oh, Geewook Kim, Trung Bui, Franck Dernoncourt, Elias Stengel-Eskin, Mohit Bansal, Minjoon Seo

First: 2025-06-18T14:13:56+00:00 · Latest: 2026-01-08T16:32:25+00:00

Abs · PDF · Code1 · Code2

Abstract

Instruction tuning is a widely used approach to improve the instruction-following ability of large language models (LLMs). Instruction-tuning datasets typically include a mixture of context-augmented and context-free examples, yet prior work has largely combined these data types without examining their distinct effects. In this paper, we investigate how training LLMs with or without context affects model behavior and downstream performance. First, in the text domain, we show that LLMs trained with context attend more strongly to the provided knowledge, achieving better grounding. We also observe that context-augmented training shifts how LLMs use knowledge: models store and leverage less on parametric knowledge and instead depend more on the provided context. Second, we observe that using LLM trained with context-augmented data as the backbone for vision-language models reduces hallucination and improves grounding in the visual domain. Finally, we explore practical strategies for real-world deployments where context availability varies. We show that maintaining separate context-augmented and context-free models and routing inputs between them yields more robust overall performance than training a single mixed model, as it better preserves their complementary strengths.

中文标题/摘要

标题：基于上下文和不基于上下文的指令调优：行为变化及下游影响

指令调优是广泛用于提高大型语言模型（LLM）遵循指令能力的一种方法。指令调优数据集通常包含上下文增强和无上下文示例的混合，但先前的工作大多将这些数据类型结合起来而没有考察它们的不同影响。在本文中，我们研究了在有无上下文的情况下训练LLM如何影响模型行为和下游性能。首先，在文本领域，我们表明，使用上下文训练的LLM更强烈地关注提供的知识，从而实现更好的定位。我们还观察到，上下文增强的训练改变了LLM使用知识的方式：模型存储和利用的参数化知识较少，而是更多地依赖提供的上下文。其次，我们观察到，使用基于上下文增强数据训练的LLM作为视觉语言模型的骨干，可以减少幻觉并改善视觉领域的定位。最后，我们探讨了在上下文可用性变化的现实世界部署中实用的策略。我们表明，维护独立的上下文增强和无上下文模型，并在它们之间路由输入，比训练单一混合模型能获得更稳健的整体性能，因为它更好地保留了它们互补的优势。

Summary / 总结

This paper investigates the effects of training large language models (LLMs) with or without context on their instruction-following ability and downstream performance. It finds that context-augmented training improves grounding and shifts how models use knowledge, reducing parametric knowledge reliance and increasing context dependence. The study also shows that using context-augmented LLMs as backbones in vision-language models reduces hallucination and improves visual grounding. Additionally, the research suggests maintaining separate context-augmented and context-free models for robust performance in varying context availability scenarios.

研究探讨了训练大型语言模型（LLMs）时使用或不使用上下文对其指令遵循能力和下游性能的影响。研究发现，带有上下文的训练能够提高模型的接地能力，并使模型更多依赖提供的上下文而非参数化知识，从而减少视觉语言模型中的幻觉现象。研究还建议，在不同上下文可用性场景中维护单独的上下文增强和无上下文模型，以获得更稳健的整体性能。

From Understanding to Engagement: Personalized pharmacy Video Clips via Vision Language Models (VLMs)

Authors: Suyash Mishra, Qiang Li, Srikanth Patil, Anubhav Girdhar

First: 2026-01-08T16:02:56+00:00 · Latest: 2026-01-08T16:02:56+00:00

Comments: Contributed original research to top tier conference in VLM; currently undergoing peer review

Abs · PDF · Code1 · Code2

Abstract

Vision Language Models (VLMs) are poised to revolutionize the digital transformation of pharmacyceutical industry by enabling intelligent, scalable, and automated multi-modality content processing. Traditional manual annotation of heterogeneous data modalities (text, images, video, audio, and web links), is prone to inconsistencies, quality degradation, and inefficiencies in content utilization. The sheer volume of long video and audio data further exacerbates these challenges, (e.g. long clinical trial interviews and educational seminars). Here, we introduce a domain adapted Video to Video Clip Generation framework that integrates Audio Language Models (ALMs) and Vision Language Models (VLMs) to produce highlight clips. Our contributions are threefold: (i) a reproducible Cut & Merge algorithm with fade in/out and timestamp normalization, ensuring smooth transitions and audio/visual alignment; (ii) a personalization mechanism based on role definition and prompt injection for tailored outputs (marketing, training, regulatory); (iii) a cost efficient e2e pipeline strategy balancing ALM/VLM enhanced processing. Evaluations on Video MME benchmark (900) and our proprietary dataset of 16,159 pharmacy videos across 14 disease areas demonstrate 3 to 4 times speedup, 4 times cost reduction, and competitive clip quality. Beyond efficiency gains, we also report our methods improved clip coherence scores (0.348) and informativeness scores (0.721) over state of the art VLM baselines (e.g., Gemini 2.5 Pro), highlighting the potential of transparent, custom extractive, and compliance supporting video summarization for life sciences.

中文标题/摘要

标题：从理解到参与：基于视觉语言模型的个性化药房视频片段生成

视觉语言模型（VLMs）有望通过实现智能、可扩展和自动化的多模态内容处理来革新制药行业的数字化转型。传统的异构数据模态（文本、图像、视频、音频和网页链接）的手动标注容易导致不一致、内容质量下降和内容利用效率低下。大量的长视频和音频数据进一步加剧了这些挑战（例如，长期的临床试验访谈和教育研讨会）。本文介绍了一种针对药房领域的视频到视频片段生成框架，该框架结合了音频语言模型（ALMs）和视觉语言模型（VLMs）以生成高光片段。我们的贡献包括三个方面：（i）一种可复现的剪辑与合并算法，带有淡入淡出和时间戳规范化，确保平滑过渡和音视频对齐；（ii）基于角色定义和提示注入的个性化机制，以生成定制输出（营销、培训、监管）；（iii）一种成本效益高的端到端管道策略，平衡了ALM/VLM增强处理。在Video MME基准（900）和我们自有的包含16,159个药房视频的14个疾病领域的数据集上进行的评估表明，该方法实现了3到4倍的速度提升，4倍的成本降低，并且片段质量与最先进的VLM基线（如Gemini 2.5 Pro）相当。除了效率提升，我们还报告了我们的方法提高了片段连贯性评分（0.348）和信息量评分（0.721），突显了透明、自提取和合规支持的视频摘要在生命科学领域的潜力。

Summary / 总结

This study aims to enhance the digital transformation of the pharmaceutical industry by leveraging Vision Language Models (VLMs) and Audio Language Models (ALMs) for automated video content processing. The research introduces a framework that includes a reproducible Cut & Merge algorithm and a personalization mechanism based on role definitions. Evaluations show a 3 to 4 times speedup, 4 times cost reduction, and improved clip coherence and informativeness scores compared to state-of-the-art VLMs.

该研究引入了一种基于音频和视觉语言模型的领域适应视频到视频剪辑生成框架，用于生成制药行业的个性化高光剪辑。该框架包括一个可重复的剪切与合并算法和个人化机制，基于角色定义。评估表明，与最先进的VLM相比，该方法在基准和自有数据集上的速度提高了3到4倍，成本降低了4倍，并且剪辑的连贯性和信息性得分更高。

Scaling Vision Language Models for Pharmaceutical Long Form Video Reasoning on Industrial GenAI Platform

Authors: Suyash Mishra, Qiang Li, Srikanth Patil, Satyanarayan Pati, Baddu Narendra

First: 2026-01-08T12:42:17+00:00 · Latest: 2026-01-08T12:42:17+00:00

Comments: Submitted to the Industry Track of Top Tier Conference; currently under peer review

Abs · PDF · Code1 · Code2

Abstract

Vision Language Models (VLMs) have shown strong performance on multimodal reasoning tasks, yet most evaluations focus on short videos and assume unconstrained computational resources. In industrial settings such as pharmaceutical content understanding, practitioners must process long-form videos under strict GPU, latency, and cost constraints, where many existing approaches fail to scale. In this work, we present an industrial GenAI framework that processes over 200,000 PDFs, 25,326 videos across eight formats (e.g., MP4, M4V, etc.), and 888 multilingual audio files in more than 20 languages. Our study makes three contributions: (i) an industrial large-scale architecture for multimodal reasoning in pharmaceutical domains; (ii) empirical analysis of over 40 VLMs on two leading benchmarks (Video-MME and MMBench) and proprietary dataset of 25,326 videos across 14 disease areas; and (iii) four findings relevant to long-form video reasoning: the role of multimodality, attention mechanism trade-offs, temporal reasoning limits, and challenges of video splitting under GPU constraints. Results show 3-8 times efficiency gains with SDPA attention on commodity GPUs, multimodality improving up to 8/12 task domains (especially length-dependent tasks), and clear bottlenecks in temporal alignment and keyframe detection across open- and closed-source VLMs. Rather than proposing a new "A+B" model, this paper characterizes practical limits, trade-offs, and failure patterns of current VLMs under realistic deployment constraints, and provide actionable guidance for both researchers and practitioners designing scalable multimodal systems for long-form video understanding in industrial domains.

中文标题/摘要

标题：在工业级GenAI平台上扩展视觉语言模型以处理制药长格式视频推理

视觉语言模型（VLMs）在多模态推理任务中表现出强大的性能，但大多数评估集中在短视频上，并假设不受限制的计算资源。在制药内容理解等工业环境中，从业人员必须在严格的GPU、延迟和成本约束下处理长格式视频，而许多现有方法无法扩展。在本研究中，我们提出了一种工业级GenAI框架，处理了超过200,000个PDF、25,326个八种格式（例如MP4、M4V等）的视频以及888个多语言音频文件，涉及20多种语言。我们的研究做出了三项贡献：（i）制药领域的大规模多模态推理工业架构；（ii）在两个领先基准（Video-MME和MMBench）和包含25,326个视频的14个疾病领域的专有数据集上对超过40个VLMs的实证分析；（iii）关于长格式视频推理的四项发现：多模态的作用、注意力机制权衡、时间推理限制以及在GPU约束下的视频分割挑战。结果表明，与普通GPU相比，SDPA注意力机制可提高3-8倍的效率，多模态在8/12任务领域（尤其是长度依赖任务）上可提高性能，开放源代码和闭源VLMs在时间对齐和关键帧检测方面存在明显瓶颈。本文并未提出新的“A+B”模型，而是对在现实部署约束下当前VLMs的实用极限、权衡和失败模式进行了描述，并为研究人员和从业者设计可扩展的多模态系统提供了实用指导，以用于工业领域的长格式视频理解。

Summary / 总结

This work addresses the scalability of Vision Language Models (VLMs) for processing long-form videos in pharmaceutical content understanding, where strict computational constraints are common. The study evaluates over 40 VLMs on industrial-scale datasets and benchmarks, highlighting the importance of multimodality, attention mechanisms, and temporal reasoning. Key findings include efficiency gains with SDPA attention, improved performance in length-dependent tasks through multimodality, and identified bottlenecks in temporal alignment and keyframe detection. The research provides practical insights and actionable guidance for deploying VLMs in industrial settings.

该研究针对工业环境中，特别是制药内容理解领域，Vision Language Models (VLMs) 处理长视频的可扩展性问题。研究提出了一种工业级的GenAI框架，处理了超过20万份PDF、25,326个视频和888个多语言音频文件。关键发现包括SDPA注意力机制的效率提升、多模态性带来的性能改进以及在时间对齐和关键帧检测方面的局限性。研究提供了在现实部署条件下设计可扩展多模态系统的实用见解和行动指南。

SOVABench: A Vehicle Surveillance Action Retrieval Benchmark for Multimodal Large Language Models

Authors: Oriol Rabasseda, Zenjie Li, Kamal Nasrollahi, Sergio Escalera

Venue: WACV

First: 2026-01-08T10:58:59+00:00 · Latest: 2026-01-08T10:58:59+00:00

Comments: This work has been accepted at Real World Surveillance: Applications and Challenges, 6th (in WACV Workshops)

Abs · PDF · Code1 · Code2

Abstract

Automatic identification of events and recurrent behavior analysis are critical for video surveillance. However, most existing content-based video retrieval benchmarks focus on scene-level similarity and do not evaluate the action discrimination required in surveillance. To address this gap, we introduce SOVABench (Surveillance Opposite Vehicle Actions Benchmark), a real-world retrieval benchmark built from surveillance footage and centered on vehicle-related actions. SOVABench defines two evaluation protocols (inter-pair and intra-pair) to assess cross-action discrimination and temporal direction understanding. Although action distinctions are generally intuitive for human observers, our experiments show that they remain challenging for state-of-the-art vision and multimodal models. Leveraging the visual reasoning and instruction-following capabilities of Multimodal Large Language Models (MLLMs), we present a training-free framework for producing interpretable embeddings from MLLM-generated descriptions for both images and videos. The framework achieves strong performance on SOVABench as well as on several spatial and counting benchmarks where contrastive Vision-Language Models often fail. The code, annotations, and instructions to construct the benchmark are publicly available.

中文标题/摘要

标题：SOVABench：一种针对多模态大语言模型的车辆监视动作检索基准

自动识别事件和重复行为分析是视频监视的关键。然而，大多数现有的基于内容的视频检索基准主要关注场景相似性，而不评估监视所需的动作区分。为了解决这一差距，我们引入了SOVABench（监视相反车辆动作基准），这是一个基于监视视频构建的真实世界检索基准，专注于车辆相关动作。SOVABench 定义了两种评估协议（跨对和同对），以评估跨动作区分和时间方向理解。尽管动作区分对人类观察者来说通常很直观，但我们的实验表明，它们仍然对最先进的视觉和多模态模型构成挑战。利用多模态大语言模型（MLLM）的视觉推理和指令跟随能力，我们提出了一种无需训练的框架，用于从MLLM生成的描述中生成可解释的嵌入，适用于图像和视频。该框架在SOVABench 以及几个对比视觉-语言模型经常失败的空间和计数基准上都取得了良好的性能。基准的代码、注释和构建说明已公开。

Summary / 总结

SOVABench is a new benchmark for vehicle surveillance action retrieval, addressing the gap in existing benchmarks that focus on scene-level similarity rather than action discrimination. It evaluates models through two protocols: inter-pair and intra-pair, assessing cross-action discrimination and temporal direction understanding. Despite being intuitive for humans, these tasks remain challenging for state-of-the-art models. A training-free framework using Multimodal Large Language Models (MLLMs) generates interpretable embeddings from MLLM descriptions, achieving strong performance on SOVABench and other benchmarks where contrastive Vision-Language Models often fail.

SOVABench 是一个新的车辆 surveillance 行动检索基准，填补了现有基准主要关注场景相似性而非动作区分的空白。它通过两个协议进行评估：跨动作对和同动作对，分别评估动作区分能力和时间方向理解能力。尽管这些任务对人类来说很直观，但对最先进的模型仍然具有挑战性。一个无需训练的框架利用多模态大型语言模型（MLLMs）生成从 MLLM 描述中提取的可解释嵌入，这些嵌入在 SOVABench 和其他基准测试中表现出色，而这些基准测试往往是对比视觉语言模型的弱项。

Agentic Retoucher for Text-To-Image Generation

Authors: Shaocheng Shen, Jianfeng Liang, Chunlei Cai, Cong Geng, Huiyu Duan, Xiaoyun Zhang, Qiang Hu, Guangtao Zhai

First: 2026-01-05T12:06:43+00:00 · Latest: 2026-01-08T10:57:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Text-to-image (T2I) diffusion models such as SDXL and FLUX have achieved impressive photorealism, yet small-scale distortions remain pervasive in limbs, face, text and so on. Existing refinement approaches either perform costly iterative re-generation or rely on vision-language models (VLMs) with weak spatial grounding, leading to semantic drift and unreliable local edits. To close this gap, we propose Agentic Retoucher, a hierarchical decision-driven framework that reformulates post-generation correction as a human-like perception-reasoning-action loop. Specifically, we design (1) a perception agent that learns contextual saliency for fine-grained distortion localization under text-image consistency cues, (2) a reasoning agent that performs human-aligned inferential diagnosis via progressive preference alignment, and (3) an action agent that adaptively plans localized inpainting guided by user preference. This design integrates perceptual evidence, linguistic reasoning, and controllable correction into a unified, self-corrective decision process. To enable fine-grained supervision and quantitative evaluation, we further construct GenBlemish-27K, a dataset of 6K T2I images with 27K annotated artifact regions across 12 categories. Extensive experiments demonstrate that Agentic Retoucher consistently outperforms state-of-the-art methods in perceptual quality, distortion localization and human preference alignment, establishing a new paradigm for self-corrective and perceptually reliable T2I generation.

中文标题/摘要

标题：代理修图师：用于文本到图像生成

文本到图像（T2I）扩散模型如SDXL和FLUX已经实现了令人印象深刻的写实效果，但在肢体、面部、文本等方面仍然普遍存在小规模失真。现有的精修方法要么进行昂贵的迭代重新生成，要么依赖于弱空间定位的视觉语言模型（VLMs），导致语义漂移和不可靠的局部编辑。为了解决这一问题，我们提出了一种名为代理修图师的分层决策驱动框架，将后生成修正重新定义为类似人类感知-推理-行动的循环。具体来说，我们设计了（1）一个感知代理，学习在文本-图像一致性线索下的细粒度失真定位的上下文显著性，（2）一个推理代理，通过逐步偏好对齐进行符合人类的推断诊断，以及（3）一个行动代理，根据用户偏好自适应地计划局部修复。该设计将感知证据、语言推理和可控修正整合到一个统一的、自我修正的决策过程中。为了实现细粒度的监督和定量评估，我们进一步构建了包含6000张T2I图像和27000个注释缺陷区域的12个类别的GenBlemish-27K数据集。广泛的实验表明，代理修图师在感知质量、失真定位和人类偏好对齐方面始终优于最先进的方法，建立了自修正和感知可靠的T2I生成的新范式。

Summary / 总结

Agentic Retoucher is a hierarchical framework designed to correct small-scale distortions in text-to-image generation, addressing limitations of existing methods. It includes a perception agent for fine-grained distortion localization, a reasoning agent for human-aligned inferential diagnosis, and an action agent for localized inpainting. The framework integrates perceptual evidence, linguistic reasoning, and controllable correction into a unified decision process. Experiments show that Agentic Retoucher outperforms state-of-the-art methods in perceptual quality, distortion localization, and human preference alignment, setting a new standard for self-corrective and perceptually reliable T2I generation.

研究旨在解决SDXL和FLUX等文本到图像生成模型中的小规模失真问题。提出了一个层次框架Agentic Retoucher，将其后生成修正过程重新构想为感知-推理-行动循环。该框架包括用于精细失真定位的感知代理、用于人类对齐的推理诊断代理以及根据用户偏好进行自适应局部修复的行动代理。该框架将感知证据、语言推理和可控修正统一到一个自纠正决策过程中。实验表明，Agentic Retoucher在感知质量、失真定位和人类偏好对齐方面优于现有方法，为自纠正和感知可靠的文本到图像生成设定了新标准。

AECV-Bench: Benchmarking Multimodal Models on Architectural and Engineering Drawings Understanding

Authors: Aleksei Kondratenko, Mussie Birhane, Houssame E. Hsain, Guido Maciocci

First: 2026-01-08T10:54:32+00:00 · Latest: 2026-01-08T10:54:32+00:00

Abs · PDF · Code1 · Code2

Abstract

AEC drawings encode geometry and semantics through symbols, layout conventions, and dense annotation, yet it remains unclear whether modern multimodal and vision-language models can reliably interpret this graphical language. We present AECV-Bench, a benchmark for evaluating multimodal and vision-language models on realistic AEC artefacts via two complementary use cases: (i) object counting on 120 high-quality floor plans (doors, windows, bedrooms, toilets), and (ii) drawing-grounded document QA spanning 192 question-answer pairs that test text extraction (OCR), instance counting, spatial reasoning, and comparative reasoning over common drawing regions. Object-counting performance is reported using per-field exact-match accuracy and MAPE results, while document-QA performance is reported using overall accuracy and per-category breakdowns with an LLM-as-a-judge scoring pipeline and targeted human adjudication for edge cases. Evaluating a broad set of state-of-the-art models under a unified protocol, we observe a stable capability gradient; OCR and text-centric document QA are strongest (up to 0.95 accuracy), spatial reasoning is moderate, and symbol-centric drawing understanding - especially reliable counting of doors and windows - remains unsolved (often 0.40-0.55 accuracy) with substantial proportional errors. These results suggest that current systems function well as document assistants but lack robust drawing literacy, motivating domain-specific representations and tool-augmented, human-in-the-loop workflows for an efficient AEC automation.

中文标题/摘要

标题：AECV-Bench：在建筑和工程图纸理解上的多模态模型基准测试

建筑和工程(AEC)图纸通过符号、布局规范和密集注释来编码几何和语义，但尚不清楚现代多模态和视觉-语言模型是否能可靠地解释这种图形语言。我们提出了AECV-Bench，这是一个基准测试，通过两个互补的应用场景来评估多模态和视觉-语言模型在现实AEC制品上的表现：(i) 在120份高质量的楼层平面图上进行物体计数（门、窗、卧室、卫生间），(ii) 包含192个问题-答案对的图纸指导文档问答，测试文本提取（OCR）、实例计数、空间推理和对常见图纸区域的比较推理。物体计数性能使用每个字段的精确匹配准确率和MAPE结果报告，而文档问答性能使用总体准确率和按类别细分的评分管道报告，并通过LLM作为法官的评分流程和针对边缘情况的人工复核。在统一协议下评估一系列最先进的模型，我们观察到一个稳定的性能梯度；文本提取和文本中心的文档问答表现最强（高达0.95的准确率），空间推理表现适中，而以符号为中心的图纸理解——尤其是可靠的门和窗计数——仍然未解决（通常0.40-0.55的准确率），存在大量比例错误。这些结果表明，当前系统在文档助手方面表现良好，但在绘制图的阅读能力方面缺乏稳健性，这激励了针对特定领域的表示和工具增强的人在环工作流程，以实现高效的AEC自动化。

Summary / 总结

AECV-Bench evaluates multimodal and vision-language models on architectural and engineering drawings through object counting and drawing-grounded document QA. The benchmark uses 120 floor plans for object counting and 192 question-answer pairs for document QA. Results show strong performance in OCR and text extraction, moderate spatial reasoning, and poor accuracy in symbol-centric drawing understanding, particularly for counting doors and windows. This highlights the need for domain-specific representations and human-in-the-loop workflows for AEC automation.

AECV-Bench 通过对象计数和图纸相关的文档问答评估多模态和视觉-语言模型在建筑和工程图纸上的能力。基准使用120个楼层平面图进行对象计数，192个问题-答案对进行文档问答。结果显示，在OCR和文本提取方面表现出色，在空间推理方面表现一般，在符号相关的图纸理解方面，尤其是门窗计数方面表现较差。这表明当前系统在文档助手方面表现良好，但在图纸阅读能力方面仍需改进，需要特定领域的表示和人工在环的工作流程以实现高效的AEC自动化。

Crafting Adversarial Inputs for Large Vision-Language Models Using Black-Box Optimization

Authors: Jiwei Guan, Haibo Jin, Haohan Wang

First: 2026-01-05T02:49:33+00:00 · Latest: 2026-01-08T10:46:04+00:00

Comments: EACL

Abs · PDF · Code1 · Code2

Abstract

Recent advancements in Large Vision-Language Models (LVLMs) have shown groundbreaking capabilities across diverse multimodal tasks. However, these models remain vulnerable to adversarial jailbreak attacks, where adversaries craft subtle perturbations to bypass safety mechanisms and trigger harmful outputs. Existing white-box attacks methods require full model accessibility, suffer from computing costs and exhibit insufficient adversarial transferability, making them impractical for real-world, black-box settings. To address these limitations, we propose a black-box jailbreak attack on LVLMs via Zeroth-Order optimization using Simultaneous Perturbation Stochastic Approximation (ZO-SPSA). ZO-SPSA provides three key advantages: (i) gradient-free approximation by input-output interactions without requiring model knowledge, (ii) model-agnostic optimization without the surrogate model and (iii) lower resource requirements with reduced GPU memory consumption. We evaluate ZO-SPSA on three LVLMs, including InstructBLIP, LLaVA and MiniGPT-4, achieving the highest jailbreak success rate of 83.0% on InstructBLIP, while maintaining imperceptible perturbations comparable to white-box methods. Moreover, adversarial examples generated from MiniGPT-4 exhibit strong transferability to other LVLMs, with ASR reaching 64.18%. These findings underscore the real-world feasibility of black-box jailbreaks and expose critical weaknesses in the safety mechanisms of current LVLMs

中文标题/摘要

标题：使用黑盒优化构建针对大型视觉-语言模型的对抗输入

大型视觉-语言模型（LVLMs）在多种跨模态任务中展现了突破性的能力。然而，这些模型仍然容易受到对抗性脱狱攻击的影响，攻击者通过施加微妙的扰动来绕过安全机制并触发有害输出。现有的白盒攻击方法需要完全访问模型，计算成本高且对抗性转移性不足，使其在实际的黑盒环境中不切实际。为了解决这些限制，我们提出了一种通过零阶优化（ZO-SPSA）使用同时扰动随机近似（Simultaneous Perturbation Stochastic Approximation）对LVLMs进行黑盒脱狱攻击的方法。ZO-SPSA提供了三个关键优势：(i) 无需模型知识的输入输出交互的无梯度近似，(ii) 不依赖于替代模型的模型无关优化，(iii) 降低资源需求，减少GPU内存消耗。我们在包括InstructBLIP、LLaVA和MiniGPT-4在内的三个LVLMs上评估了ZO-SPSA，实现了InstructBLIP上最高的脱狱成功率83.0%，同时保持与白盒方法相当的不可感知扰动。此外，从MiniGPT-4生成的对抗性示例在其他LVLMs上表现出强大的转移性，ASR达到64.18%。这些发现强调了黑盒脱狱在实际环境中的可行性，并揭示了当前LVLMs安全机制中的关键弱点

Summary / 总结

This study addresses the vulnerability of Large Vision-Language Models (LVLMs) to adversarial attacks by proposing a black-box jailbreak attack using Zeroth-Order optimization with Simultaneous Perturbation Stochastic Approximation (ZO-SPSA). This method does not require model knowledge, is model-agnostic, and has lower resource requirements. Experiments on InstructBLIP, LLaVA, and MiniGPT-4 showed a high jailbreak success rate of 83.0% on InstructBLIP and strong transferability of adversarial examples, with an ASR of 64.18% on MiniGPT-4. These results highlight the real-world feasibility of black-box attacks and the need for improved safety mechanisms in LVLMs.

该研究通过提出基于零阶优化的Simultaneous Perturbation Stochastic Approximation (ZO-SPSA) 方法，解决大型视觉-语言模型（LVLMs）对黑盒攻击的脆弱性问题。该方法无需模型知识，具有模型无关性，并减少资源消耗。实验表明，在InstructBLIP、LLaVA和MiniGPT-4上的破解成功率高达83.0%，且生成的对抗样本在其他模型上具有较强的迁移性，突显了LVLMs安全机制的改进需求。

CHIMERA: Adaptive Cache Injection and Semantic Anchor Prompting for Zero-shot Image Morphing with Morphing-oriented Metrics

Authors: Dahyeon Kye, Jeahun Sung, Minkyu Jeon, Jihyong Oh

First: 2025-12-08T04:39:12+00:00 · Latest: 2026-01-08T10:29:58+00:00

Comments: Please visit our project page at https://cmlab-korea.github.io/CHIMERA/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Diffusion models exhibit remarkable generative ability, yet achieving smooth and semantically consistent image morphing remains a challenge. Existing approaches often yield abrupt transitions or over-saturated appearances due to the lack of adaptive structural and semantic alignments. We propose CHIMERA, a zero-shot diffusion-based framework that formulates morphing as a cached inversion-guided denoising process. To handle large semantic and appearance disparities, we propose Adaptive Cache Injection and Semantic Anchor Prompting. Adaptive Cache Injection (ACI) caches down, mid, and up blocks features from both inputs during DDIM inversion and re-injects them adaptively during denoising, enabling spatial and semantic alignment in depth- and time-adaptive manners and enabling natural feature fusion and smooth transitions. Semantic Anchor Prompting (SAP) leverages a vision-language model to generate a shared anchor prompt that serves as a semantic anchor, bridging dissimilar inputs and guiding the denoising process toward coherent results. Finally, we introduce the Global-Local Consistency Score (GLCS), a morphing-oriented metric that simultaneously evaluates the global harmonization of the two inputs and the smoothness of the local morphing transition. Extensive experiments and user studies show that CHIMERA achieves smoother and more semantically aligned transitions than existing methods, establishing a new state of the art in image morphing. The code and project page will be publicly released.

中文标题/摘要

标题：CHIMERA：自适应缓存注入与语义锚点提示在基于扩散模型的零样本图像变形中的应用及其评价指标

扩散模型在生成能力方面表现出色，但在实现平滑且语义一致的图像变形方面仍面临挑战。现有方法往往由于缺乏自适应结构和语义对齐而产生突兀的过渡或过度饱和的外观。我们提出CHIMERA，一种基于扩散模型的零样本框架，将变形视为缓存反演引导的去噪过程。为处理大规模的语义和外观差异，我们提出了自适应缓存注入和语义锚点提示。自适应缓存注入（ACI）在DDIM反演过程中缓存输入的低、中、高层特征，并在去噪过程中适当地重新注入，从而在深度和时间自适应的方式下实现空间和语义对齐，并实现自然特征融合和平滑过渡。语义锚点提示（SAP）利用视觉-语言模型生成共享的锚点提示，作为语义锚点，连接不相似的输入，并引导去噪过程向一致的结果发展。最后，我们引入全局-局部一致性评分（GLCS），这是一种面向变形的评价指标，同时评估两个输入的全局协调性和局部变形的平滑度。广泛的实验和用户研究显示，CHIMERA实现了比现有方法更平滑且更语义一致的过渡，建立了图像变形的新基准。代码和项目页面将公开发布。

Summary / 总结

CHIMERA is a zero-shot diffusion-based framework that addresses the challenge of achieving smooth and semantically consistent image morphing. It introduces Adaptive Cache Injection and Semantic Anchor Prompting to handle large semantic and appearance disparities. Experiments demonstrate that CHIMERA outperforms existing methods in producing smoother and more semantically aligned transitions, setting a new state of the art in image morphing. The framework evaluates morphing quality using the Global-Local Consistency Score (GLCS).

CHIMERA 是一个零样本扩散基础框架，旨在解决实现平滑且语义一致的图像变形的挑战。它引入了自适应缓存注入和语义锚点提示来处理大规模的语义和外观差异。实验表明，CHIMERA 在生成更平滑且语义对齐的过渡方面优于现有方法，建立了图像变形的新基准。该框架使用全局-局部一致性评分（GLCS）来评估变形质量。

ProFuse: Efficient Cross-View Context Fusion for Open-Vocabulary 3D Gaussian Splatting

Authors: Yen-Jen Chiou, Wei-Tse Cheng, Yuan-Fu Yang

First: 2026-01-08T09:20:46+00:00 · Latest: 2026-01-08T09:20:46+00:00

Comments: 10 pages, 5 figures

Abs · PDF · Code1 · Code2

Abstract

We present ProFuse, an efficient context-aware framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). The pipeline enhances cross-view consistency and intra-mask cohesion within a direct registration setup, adding minimal overhead and requiring no render-supervised fine-tuning. Instead of relying on a pretrained 3DGS scene, we introduce a dense correspondence-guided pre-registration phase that initializes Gaussians with accurate geometry while jointly constructing 3D Context Proposals via cross-view clustering. Each proposal carries a global feature obtained through weighted aggregation of member embeddings, and this feature is fused onto Gaussians during direct registration to maintain per-primitive language coherence across views. With associations established in advance, semantic fusion requires no additional optimization beyond standard reconstruction, and the model retains geometric refinement without densification. ProFuse achieves strong open-vocabulary 3DGS understanding while completing semantic attachment in about five minutes per scene, which is two times faster than SOTA.

中文标题/摘要

标题：ProFuse：开放词汇3D高斯点云融合的高效跨视图上下文融合

我们提出了ProFuse，一种基于3D高斯点云（3DGS）的开放词汇3D场景理解的高效上下文感知框架。该流水线在直接配准设置中增强跨视图一致性及掩膜内的内聚性，增加的开销极小，无需渲染监督微调。我们引入了一种密集对应关系引导的预配准阶段，该阶段以准确的几何形状初始化高斯点，同时通过跨视图聚类联合构建3D上下文提案。每个提案携带一个通过加权聚合成员嵌入获得的全局特征，并在直接配准过程中将该特征融合到高斯点上，以保持视图间的语言一致性。通过预先建立的关联，语义融合无需额外优化，且模型在无需密集化的情况下保留几何细化。ProFuse在每场景约五分钟内实现强大的开放词汇3DGS理解，比当前最佳方案快两倍。

Summary / 总结

ProFuse is an efficient framework for open-vocabulary 3D scene understanding using 3D Gaussian Splatting. It enhances cross-view consistency and intra-mask cohesion through a dense correspondence-guided pre-registration phase and cross-view clustering, without requiring render-supervised fine-tuning. The model achieves strong open-vocabulary 3DGS understanding and completes semantic attachment in about five minutes per scene, which is two times faster than the state-of-the-art.

ProFuse 是一种使用 3D 高斯点积进行开放词汇 3D 场景理解的高效框架。通过密集对应关系引导的预注册阶段和跨视图聚类，该方法增强了跨视图一致性和内部掩码的连贯性。此方法以准确的几何形状初始化高斯点，并在直接注册过程中融合全局特征，保持视图间的语义连贯性。ProFuse 每个场景的语义连接只需约五分钟，比当前最先进的方法快两倍。

Skeletonization-Based Adversarial Perturbations on Large Vision Language Model's Mathematical Text Recognition

Authors: Masatomo Yoshida, Haruto Namura, Nicola Adami, Masahiro Okuda

Venue: Proc. ITC-CSCC 2025

First: 2026-01-08T09:15:27+00:00 · Latest: 2026-01-08T09:15:27+00:00

Comments: accepted to ITC-CSCC 2025

Abs · PDF · Code1 · Code2

Abstract

This work explores the visual capabilities and limitations of foundation models by introducing a novel adversarial attack method utilizing skeletonization to reduce the search space effectively. Our approach specifically targets images containing text, particularly mathematical formula images, which are more challenging due to their LaTeX conversion and intricate structure. We conduct a detailed evaluation of both character and semantic changes between original and adversarially perturbed outputs to provide insights into the models' visual interpretation and reasoning abilities. The effectiveness of our method is further demonstrated through its application to ChatGPT, which shows its practical implications in real-world scenarios.

中文标题/摘要

标题：基于骨架化的大规模视觉语言模型数学文本识别的对抗性扰动

本研究通过引入一种利用骨架化减少搜索空间的新颖对抗攻击方法，探索基础模型的视觉能力和局限性。我们的方法特别针对包含文本的图像，尤其是由于其LaTeX转换和复杂的结构，数学公式图像更具挑战性。我们详细评估了原始输出和对抗性扰动输出之间的字符和语义变化，以提供模型视觉解释和推理能力的见解。通过将其应用于ChatGPT，进一步证明了该方法的有效性及其在实际场景中的实际意义。

Summary / 总结

This study investigates the visual recognition capabilities of large vision-language models by employing a skeletonization-based adversarial attack method. The method targets mathematical formula images, reducing the search space and evaluating character and semantic changes. The findings highlight the models' limitations in visual interpretation and reasoning, with practical implications shown through application to ChatGPT.

该研究通过采用基于骨架化的对抗攻击方法，考察大型视觉-语言模型的视觉能力。该方法针对数学公式图像，减少搜索空间，并评估原始图像与对抗扰动图像之间的字符和语义变化。研究结果揭示了模型在视觉解释和推理方面的局限性，并通过应用到ChatGPT展示了其实用意义。

Agri-R1: Empowering Generalizable Agricultural Reasoning in Vision-Language Models with Reinforcement Learning

Authors: Wentao Zhang, Lifei Wang, Lina Lu, MingKun Xu, Shangyang Li, Yanchao Yang, Tao Fang

Venue: ACL 2026 long

First: 2026-01-08T07:34:37+00:00 · Latest: 2026-01-08T07:34:37+00:00

Comments: This paper is submitted for review to ACL 2026. It is 17 pages long and includes 5 figures. The corresponding authors are Tao Fang and Lina Lu

Abs · PDF · Code1 · Code2

Abstract

Agricultural disease diagnosis challenges VLMs, as conventional fine-tuning requires extensive labels, lacks interpretability, and generalizes poorly. While reasoning improves model robustness, existing methods rely on costly expert annotations and rarely address the open-ended, diverse nature of agricultural queries. To address these limitations, we propose \textbf{Agri-R1}, a reasoning-enhanced large model for agriculture. Our framework automates high-quality reasoning data generation via vision-language synthesis and LLM-based filtering, using only 19\% of available samples. Training employs Group Relative Policy Optimization (GRPO) with a novel proposed reward function that integrates domain-specific lexicons and fuzzy matching to assess both correctness and linguistic flexibility in open-ended responses. Evaluated on CDDMBench, our resulting 3B-parameter model achieves performance competitive with 7B- to 13B-parameter baselines, showing a +23.2\% relative gain in disease recognition accuracy, +33.3\% in agricultural knowledge QA, and a +26.10-point improvement in cross-domain generalization over standard fine-tuning. Ablation studies confirm that the synergy between structured reasoning data and GRPO-driven exploration underpins these gains, with benefits scaling as question complexity increases.

中文标题/摘要

标题：Agri-R1：通过强化学习增强农业视觉语言模型的一般化农业推理能力

农业疾病诊断对VLM构成挑战，因为传统的微调需要大量标签，缺乏可解释性且泛化能力差。尽管推理可以提高模型的鲁棒性，但现有方法依赖昂贵的专家注释，很少解决农业查询的开放性和多样性。为解决这些局限性，我们提出了**Agri-R1**，一种增强的农业大型模型。我们的框架通过视觉语言合成和基于LLM的过滤自动生成高质量的推理数据，仅使用可用样本的19%。训练使用组相对策略优化（GRPO）和一个新颖的奖励函数，该函数结合领域特定词汇和模糊匹配来评估开放性回答的正确性和语言灵活性。在CDDMBench上评估，我们的3B参数模型在疾病识别准确性上比7B到13B参数的基线模型高出23.2%，在农业知识问答上高出33.3%，在跨域泛化上比标准微调高出26.10分。消融研究证实，结构化推理数据与GRPO驱动的探索之间的协同作用是这些改进的基础，随着问题复杂性的增加，这种优势会进一步增强。

Summary / 总结

Agri-R1 addresses the limitations of conventional fine-tuning for agricultural disease diagnosis by proposing a reasoning-enhanced framework. It generates high-quality reasoning data through vision-language synthesis and LLM-based filtering, requiring only 19% of available samples. Training uses Group Relative Policy Optimization with a reward function that integrates domain-specific lexicons and fuzzy matching. The resulting 3B-parameter model outperforms baselines, achieving a +23.2% relative gain in disease recognition accuracy, +33.3% in agricultural knowledge QA, and a +26.10-point improvement in cross-domain generalization over standard fine-tuning.

Agri-R1 通过提出一种增强推理的框架来解决传统微调在农业疾病诊断中的局限性。该框架使用视觉语言合成和基于LLM的数据过滤生成高质量的推理数据，仅需使用19%的可用样本。模型使用组相对策略优化，并结合领域特定词汇和模糊匹配的奖励函数，实现了与更大基线模型竞争的性能，并在疾病识别和农业知识问答方面取得了显著改进。

GeoReason: Aligning Thinking And Answering In Remote Sensing Vision-Language Models Via Logical Consistency Reinforcement Learning

Authors: Wenshuai Li, Xiantai Xiang, Zixiao Wen, Guangyao Zhou, Ben Niu, Feng Wang, Lijia Huang, Qiantong Wang, Yuxin Hu

First: 2026-01-07T17:26:41+00:00 · Latest: 2026-01-08T06:19:12+00:00

Abs · PDF · Code1 · Code2

Abstract

The evolution of Remote Sensing Vision-Language Models(RS-VLMs) emphasizes the importance of transitioning from perception-centric recognition toward high-level deductive reasoning to enhance cognitive reliability in complex spatial tasks. However, current models often suffer from logical hallucinations, where correct answers are derived from flawed reasoning chains or rely on positional shortcuts rather than spatial logic. This decoupling undermines reliability in strategic spatial decision-making. To address this, we present GeoReason, a framework designed to synchronize internal thinking with final decisions. We first construct GeoReason-Bench, a logic-driven dataset containing 4,000 reasoning trajectories synthesized from geometric primitives and expert knowledge. We then formulate a two-stage training strategy: (1) Supervised Knowledge Initialization to equip the model with reasoning syntax and domain expertise, and (2) Consistency-Aware Reinforcement Learning to refine deductive reliability. This second stage integrates a novel Logical Consistency Reward, which penalizes logical drift via an option permutation strategy to anchor decisions in verifiable reasoning traces. Experimental results demonstrate that our framework significantly enhances the cognitive reliability and interpretability of RS-VLMs, achieving state-of-the-art performance compared to other advanced methods.

中文标题/摘要

标题：GeoReason: 通过逻辑一致性强化学习使遥感视觉语言模型的思考与回答保持一致

遥感视觉语言模型(RS-VLMs)的发展强调了从感知中心的识别向高级演绎推理过渡的重要性，以增强复杂空间任务中的认知可靠性。然而，当前的模型往往会出现逻辑幻觉，即正确的答案是基于有缺陷的推理链或依赖于位置捷径而非空间逻辑得出的。这种脱节削弱了在战略空间决策中的可靠性。为了解决这一问题，我们提出了GeoReason框架，旨在使内部思考与最终决策同步。我们首先构建了GeoReason-Bench，这是一个逻辑驱动的数据集，包含4,000条从几何原语和专家知识中合成的推理轨迹。然后，我们制定了两阶段训练策略：(1) 监督知识初始化，以使模型具备推理语法和领域专业知识；(2) 一致性感知强化学习，以提高演绎可靠性。这一阶段整合了一种新颖的逻辑一致性奖励，通过选项排列策略惩罚逻辑漂移，以确保决策基于可验证的推理轨迹。实验结果表明，我们的框架显著提高了RS-VLMs的认知可靠性和可解释性，达到了与其他先进方法相比的最先进性能。

Summary / 总结

GeoReason is a framework designed to improve the cognitive reliability of Remote Sensing Vision-Language Models (RS-VLMs) by aligning internal reasoning with final decisions. It introduces a two-stage training strategy: supervised knowledge initialization to equip the model with reasoning syntax and domain expertise, followed by consistency-aware reinforcement learning that penalizes logical drift through an option permutation strategy. This approach enhances the interpretability and reliability of RS-VLMs, achieving state-of-the-art performance in experiments.

GeoReason 是一个框架，旨在通过使内部推理与最终决策保持一致来提高遥感视觉语言模型（RS-VLMs）的认知可靠性。它采用两阶段训练策略：监督知识初始化，使模型具备推理语法和领域专业知识，随后是通过逻辑一致性奖励来惩罚逻辑漂移的一致性感知强化学习。这种方法提高了 RS-VLMs 的可解释性和可靠性，并在实验中达到了最先进的性能。

MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models

Authors: Dianyi Wang, Siyuan Wang, Zejun Li, Yikun Wang, Yitong Li, Duyu Tang, Xiaoyu Shen, Xuanjing Huang, Zhongyu Wei

First: 2025-08-13T13:00:05+00:00 · Latest: 2026-01-08T05:44:40+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across multi-modal tasks by scaling model size and training data. However, these dense LVLMs incur significant computational costs and motivate the exploration of sparse Mixture of Experts (MoE) architectures. While MoE improve parameter efficiency, effectively applying MoE to simultaneously model modality-specific features and cross-modal associations in LVLMs remains challenging. In this work, we propose to incorporate Mixture of Intra- and Inter-Modality Experts (MoIIE) to LVLMs. For each token, expert routing is guided by its modality, directing tokens to their respective intra-modality experts as well as a shared pool of inter-modality experts, enabling the model to jointly learn rich intra-modal features and cross-modal interactions. We further introduce an effective and straightforward two-stage training strategy, which facilitates the direct activation of both MoE and multi-modal capabilities. Extensive experiments across different data scales and LLM backbone demonstrate the effectiveness, efficiency and generality of our approach. Notably, our MoIIE models with 5.5B and 11.3B activated parameters match or even surpass the performance of existing advanced open-source MoE-LLMs based multi-modal models that involve more activated parameters. The code is available at https://github.com/AlenjandroWang/MoIIE.

中文标题/摘要

标题：MoIIE：混合内模间模专家体系结构用于大型视觉语言模型

大型多模态语言模型（LVLMs）通过扩大模型规模和训练数据，在多模态任务中表现出色。然而，这些密集的LVLMs带来了显著的计算成本，促使人们探索稀疏专家混合（MoE）架构。虽然MoE提高了参数效率，但同时建模模态特定特征和跨模态关联在LVLMs中仍然具有挑战性。在本文中，我们提出将混合内模间模专家（MoIIE）引入LVLMs。对于每个标记，专家路由由其模态指导，将标记导向其各自的内模专家以及共享的跨模专家池，使模型能够同时学习丰富的内模特征和跨模交互。我们还引入了一种有效且简单的两阶段训练策略，这有助于直接激活MoE和多模态能力。在不同数据规模和LLM主干网络的广泛实验中，证明了我们方法的有效性、效率和通用性。值得注意的是，我们的MoIIE模型在激活参数为55亿和113亿的情况下，与现有基于MoE-LLMs的多模态模型相比，性能相当甚至更优。代码可在https://github.com/AlenjandroWang/MoIIE/ 获取。

Summary / 总结

This paper addresses the challenge of efficiently modeling both modality-specific features and cross-modal interactions in large vision-language models (LVLMs) by proposing MoIIE, a Mixture of Intra- and Inter-Modality Experts. The model routes tokens to their respective intra-modality experts and a shared pool of inter-modality experts, enabling the joint learning of rich intra-modal features and cross-modal interactions. The authors introduce a two-stage training strategy to facilitate the direct activation of both MoE and multi-modal capabilities. Experimental results show that MoIIE models with fewer activated parameters match or outperform existing advanced open-source MoE-LLMs on multi-modal tasks across different data scales and LLM backbones.

本文提出了MoIIE，一种用于LVLM的混合内模-跨模专家模型，以解决密集模型的计算挑战。通过将令牌路由到特定模态和共享跨模态专家，该模型能够有效学习内模特征和跨模态交互。两阶段训练策略确保了模型同时激活MoE和多模态能力。实验结果表明，MoIIE模型在更少激活参数的情况下，能够匹配甚至超越更大、更复杂的模型，展示了其在各种数据规模和LLM骨干网络上的有效性和效率。

Minimal Clips, Maximum Salience: Long Video Summarization via Key Moment Extraction

Authors: Galann Pennec, Zhengyuan Liu, Nicholas Asher, Philippe Muller, Nancy F. Chen

First: 2025-12-12T09:19:45+00:00 · Latest: 2026-01-08T05:10:53+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) are able to process increasingly longer videos. Yet, important visual information is easily lost throughout the entire context and missed by VLMs. Also, it is important to design tools that enable cost-effective analysis of lengthy video content. In this paper, we propose a clip selection method that targets key video moments to be included in a multimodal summary. We divide the video into short clips and generate compact visual descriptions of each using a lightweight video captioning model. These are then passed to a large language model (LLM), which selects the K clips containing the most relevant visual information for a multimodal summary. We evaluate our approach on reference clips for the task, automatically derived from full human-annotated screenplays and summaries in the MovieSum dataset. We further show that these reference clips (less than 6% of the movie) are sufficient to build a complete multimodal summary of the movies in MovieSum. Using our clip selection method, we achieve a summarization performance close to that of these reference clips while capturing substantially more relevant video information than random clip selection. Importantly, we maintain low computational cost by relying on a lightweight captioning model.

中文标题/摘要

标题：最少片段，最大显著性：通过关键时刻提取进行长视频摘要

视觉-语言模型（VLMs）能够处理越来越长的视频。然而，重要视觉信息很容易在整个上下文中丢失并被VLMs忽略。此外，设计能够经济有效地分析长视频内容的工具也很重要。在本文中，我们提出了一种片段选择方法，旨在选择应包含在多模态摘要中的关键视频时刻。我们将视频划分为短片段，并使用轻量级视频描述模型生成每个片段的紧凑视觉描述。然后将这些描述传递给大型语言模型（LLM），该模型选择包含最多相关视觉信息的K个片段以构建多模态摘要。我们在MovieSum数据集中的人类标注屏幕剧和摘要的参考片段上评估了我们的方法。我们进一步表明，这些参考片段（不到电影的6%）足以构建MovieSum中电影的完整多模态摘要。使用我们的片段选择方法，我们实现了与这些参考片段相当的摘要性能，同时捕获了比随机片段选择更多的相关视频信息。重要的是，我们通过依赖轻量级描述模型保持了较低的计算成本。

Summary / 总结

This paper addresses the challenge of summarizing long videos by focusing on key moments. It proposes a method that divides videos into short clips and uses a lightweight video captioning model to generate visual descriptions. These descriptions are then analyzed by a large language model to select the most relevant clips for a multimodal summary. The approach is evaluated on reference clips from the MovieSum dataset, showing that these clips (less than 6% of the movie) are sufficient to create a complete summary. The method achieves summarization performance close to the reference clips while capturing more relevant video information than random selection, maintaining low computational cost.

本文提出了一种针对长视频进行关键时刻总结的方法。该方法将视频划分为短片段，并使用轻量级视频描述模型生成视觉描述。这些描述随后由大型语言模型分析，以选择最相关的片段用于多模态总结。该方法在MovieSum数据集的参考片段上进行了评估，显示这些片段（不到电影的6%）足以创建完整的总结。该方法在捕获更多相关视频信息的同时，其总结性能接近参考片段，且保持了较低的计算成本。

BanglaLorica: Design and Evaluation of a Robust Watermarking Algorithm for Large Language Models in Bangla Text Generation

Authors: Amit Bin Tariqul, A N M Zahid Hossain Milkan, Sahab-Al-Chowdhury, Syed Rifat Raiyan, Hasan Mahmud, Md Kamrul Hasan

First: 2026-01-08T03:01:59+00:00 · Latest: 2026-01-08T03:01:59+00:00

Comments: Under review, 12 pages, 7 figures, 5 tables

Abs · PDF · Code1 · Code2

Abstract

As large language models (LLMs) are increasingly deployed for text generation, watermarking has become essential for authorship attribution, intellectual property protection, and misuse detection. While existing watermarking methods perform well in high-resource languages, their robustness in low-resource languages remains underexplored. This work presents the first systematic evaluation of state-of-the-art text watermarking methods: KGW, Exponential Sampling (EXP), and Waterfall, for Bangla LLM text generation under cross-lingual round-trip translation (RTT) attacks. Under benign conditions, KGW and EXP achieve high detection accuracy (>88%) with negligible perplexity and ROUGE degradation. However, RTT causes detection accuracy to collapse below RTT causes detection accuracy to collapse to 9-13%, indicating a fundamental failure of token-level watermarking. To address this, we propose a layered watermarking strategy that combines embedding-time and post-generation watermarks. Experimental results show that layered watermarking improves post-RTT detection accuracy by 25-35%, achieving 40-50% accuracy, representing a 3$\times$ to 4$\times$ relative improvement over single-layer methods, at the cost of controlled semantic degradation. Our findings quantify the robustness-quality trade-off in multilingual watermarking and establish layered watermarking as a practical, training-free solution for low-resource languages such as Bangla. Our code and data will be made public.

中文标题/摘要

标题：BanglaLorica：面向孟加拉语大型语言模型文本生成的鲁棒水印算法设计与评估

随着大型语言模型（LLMs）在文本生成中的广泛应用，水印技术对于作者归属、知识产权保护和滥用检测变得至关重要。尽管现有的水印方法在高资源语言中表现良好，但在低资源语言中的鲁棒性仍被忽视。本研究首次系统评估了针对孟加拉语LLM文本生成的先进文本水印方法：KGW、指数采样（EXP）和Waterfall，在跨语言往返翻译（RTT）攻击下的表现。在正常条件下，KGW和EXP的检测准确率超过88%，且几乎无困惑度和ROUGE降解。然而，RTT导致检测准确率降至9-13%，表明基于标记级别的水印方法存在根本性失败。为解决这一问题，我们提出了一种分层水印策略，结合了嵌入时和生成后水印。实验结果表明，分层水印策略在RTT后的检测准确率提高了25-35%，达到40-50%，相比单层方法有3到4倍的相对改进，但代价是可控的语义降解。我们的研究量化了多语言水印的鲁棒性-质量权衡，并将分层水印确立为低资源语言如孟加拉语的一种实用、无需训练的解决方案。我们的代码和数据将公开。

Summary / 总结

This paper evaluates the robustness of state-of-the-art watermarking methods (KGW, EXP, and Waterfall) for Bangla large language models (LLMs) under cross-lingual round-trip translation attacks. While these methods achieve high detection accuracy under benign conditions, they fail significantly under RTT attacks. To address this, the authors propose a layered watermarking strategy that combines embedding-time and post-generation watermarks, improving post-RTT detection accuracy by 25-35% with controlled semantic degradation. This work highlights the robustness-quality trade-off in multilingual watermarking and provides a practical solution for low-resource languages like Bangla.

这项工作评估了KGW、EXP和Waterfall等最先进的水印方法在跨语言往返翻译攻击下对孟加拉语LLM文本生成的鲁棒性。虽然这些方法在良性条件下可以实现高检测准确率，但在往返翻译攻击下表现不佳。为此，提出了一种结合嵌入时间和生成后水印的分层水印策略，该策略在可控语义降级的情况下将往返翻译后的检测准确率提高了25-35%。本研究突出了多语言水印的鲁棒性-质量权衡，并将分层水印确立为低资源语言如孟加拉语的实际解决方案。

Current Agents Fail to Leverage World Model as Tool for Foresight

Authors: Cheng Qian, Emre Can Acikgoz, Bingxuan Li, Xiusi Chen, Yuji Zhang, Bingxiang He, Qinyu Luo, Dilek Hakkani-Tür, Gokhan Tur, Yunzhu Li, Heng Ji

First: 2026-01-07T13:15:23+00:00 · Latest: 2026-01-08T02:36:21+00:00

Comments: 36 Pages, 13 Figures, 17 Tables (Meta data updated)

Abs · PDF · Code1 · Code2

Abstract

Agents built on vision-language models increasingly face tasks that demand anticipating future states rather than relying on short-horizon reasoning. Generative world models offer a promising remedy: agents could use them as external simulators to foresee outcomes before acting. This paper empirically examines whether current agents can leverage such world models as tools to enhance their cognition. Across diverse agentic and visual question answering tasks, we observe that some agents rarely invoke simulation (fewer than 1%), frequently misuse predicted rollouts (approximately 15%), and often exhibit inconsistent or even degraded performance (up to 5%) when simulation is available or enforced. Attribution analysis further indicates that the primary bottleneck lies in the agents' capacity to decide when to simulate, how to interpret predicted outcomes, and how to integrate foresight into downstream reasoning. These findings underscore the need for mechanisms that foster calibrated, strategic interaction with world models, paving the way toward more reliable anticipatory cognition in future agent systems.

中文标题/摘要

标题：当前代理无法利用世界模型作为前瞻工具

基于视觉-语言模型构建的代理越来越多地面临需要预测未来状态的任务，而不是依赖于短期推理。生成的世界模型提供了一种有希望的解决方案：代理可以使用它们作为外部模拟器，在行动前预见结果。本文实证研究了当前代理是否能够利用此类世界模型作为工具来增强其认知能力。在各种各样的代理和视觉问答任务中，我们观察到一些代理很少使用模拟（不到1%），频繁误用预测滚动（约15%），并且在模拟可用或强制时，经常表现出不一致甚至退化的性能（最高达5%）。进一步的归因分析表明，主要瓶颈在于代理决定何时模拟、如何解释预测结果以及如何将前瞻性纳入下游推理的能力。这些发现强调了需要机制来促进与世界模型的校准、战略性互动，为未来代理系统更可靠的前瞻性认知铺平道路。

Summary / 总结

This paper investigates whether current agents can effectively use generative world models to enhance their foresight in tasks requiring long-term reasoning. Across various agentic and visual question answering tasks, the study finds that some agents rarely use simulations, many misuse predicted outcomes, and some even perform worse when simulations are available. The main challenge is the agents' inability to decide when to simulate, interpret predictions, and integrate foresight into their reasoning. These findings highlight the necessity for better mechanisms to enable strategic and reliable interaction with world models in future agent systems.

该研究探讨了当前代理是否能够有效利用生成的世界模型来预测未来状态，从而增强其认知能力。研究发现，一些代理很少使用模拟，许多代理错误地使用预测结果，而有些代理在模拟可用时甚至表现更差。主要问题在于代理无法决定何时进行模拟、如何解释预测结果以及如何将前瞻性思维整合到后续推理中。这些发现强调了在未来代理系统中需要更好的机制来促进与世界模型的战略互动。

Vision-Language Agents for Interactive Forest Change Analysis

Authors: James Brock, Ce Zhang, Nantheera Anantrasirichai

First: 2026-01-08T02:02:36+00:00 · Latest: 2026-01-08T02:02:36+00:00

Comments: 5 pages, 4 figures, Submitted to IGARSS 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Modern forest monitoring workflows increasingly benefit from the growing availability of high-resolution satellite imagery and advances in deep learning. Two persistent challenges in this context are accurate pixel-level change detection and meaningful semantic change captioning for complex forest dynamics. While large language models (LLMs) are being adapted for interactive data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored. To address this gap, we introduce an LLM-driven agent for integrated forest change analysis that supports natural language querying across multiple RSICI tasks. The proposed system builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration. To facilitate adaptation and evaluation in forest environments, we further introduce the Forest-Change dataset, which comprises bi-temporal satellite imagery, pixel-level change masks, and multi-granularity semantic change captions generated using a combination of human annotation and rule-based methods. Experimental results show that the proposed system achieves mIoU and BLEU-4 scores of 67.10% and 40.17% on the Forest-Change dataset, and 88.13% and 34.41% on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI benchmark for joint change detection and captioning. These results highlight the potential of interactive, LLM-driven RSICI systems to improve accessibility, interpretability, and efficiency of forest change analysis. All data and code are publicly available at https://github.com/JamesBrockUoB/ForestChat.

中文标题/摘要

标题：视觉-语言代理在交互式森林变化分析中的应用

现代森林监测工作流程越来越多地受益于高分辨率卫星图像的日益可用和深度学习的进步。在此背景下，准确的像素级变化检测和复杂森林动态的有意义语义变化描述是两个持续存在的挑战。虽然大型语言模型（LLMs）正在被适应用于交互式数据探索，但它们与视觉-语言模型（VLMs）在遥感图像变化解释（RSICI）中的集成仍然未被充分探索。为了解决这一差距，我们引入了一个由LLM驱动的集成森林变化分析代理，该代理支持跨多个RSICI任务的自然语言查询。所提出的系统基于多级变化解释（MCI）视觉-语言骨干，并通过LLM进行编排。为了在森林环境中促进适应和评估，我们进一步引入了森林变化数据集，该数据集包含双时相卫星图像、像素级变化掩码和使用结合人工注释和基于规则的方法生成的多粒度语义变化描述。实验结果表明，所提出的系统在森林变化数据集上的mIoU和BLEU-4得分为67.10%和40.17%，在LEVIR-MCI-Trees上的得分为88.13%和34.41%，LEVIR-MCI基准数据集的一个专注于树木的子集。这些结果突显了交互式、LLM驱动的RSICI系统在提高森林变化分析的可访问性、可解释性和效率方面的潜力。所有数据和代码均可在https://github.com/JamesBrockUoB/ForestChat/上公开获取。

Summary / 总结

The research aims to address the challenges of accurate pixel-level change detection and semantic change captioning in forest monitoring using deep learning and large language models. The method involves an LLM-driven agent that integrates a multi-level change interpretation vision-language backbone for RSICI tasks. The system was evaluated on the Forest-Change dataset and achieved mIoU and BLEU-4 scores of 67.10% and 40.17%, respectively, demonstrating improved accessibility and interpretability in forest change analysis. All data and code are publicly available.

本文旨在利用深度学习和大型语言模型解决森林监测中的像素级变化检测和语义变化描述的挑战。作者提出了一种基于LLM的代理，结合了多级变化解释视觉语言骨干以执行RSICI任务。他们引入了Forest-Change数据集用于评估，其中包括双时相卫星图像、像素级变化掩码和语义变化描述。该系统在Forest-Change数据集上实现了67.10%的mIoU和40.17%的BLEU-4得分，在LEVIR-MCI-Trees上实现了88.13%的mIoU和34.41%的BLEU-4得分，展示了在森林变化分析中提高的可访问性、可解释性和效率。

From Dataset to Real-world: General 3D Object Detection via Generalized Cross-domain Few-shot Learning

Authors: Shuangzhi Li, Junlong Shen, Lei Ma, Xingyu Li

First: 2025-03-08T17:05:21+00:00 · Latest: 2026-01-08T01:19:36+00:00

Comments: The latest version refines the few-shot setting on common classes, enforcing a stricter object-level definition

Abs · PDF · Code1 · Code2

Abstract

LiDAR-based 3D object detection models often struggle to generalize to real-world environments due to limited object diversity in existing datasets. To tackle it, we introduce the first generalized cross-domain few-shot (GCFS) task in 3D object detection, aiming to adapt a source-pretrained model to both common and novel classes in a new domain with only few-shot annotations. We propose a unified framework that learns stable target semantics under limited supervision by bridging 2D open-set semantics with 3D spatial reasoning. Specifically, an image-guided multi-modal fusion injects transferable 2D semantic cues into the 3D pipeline via vision-language models, while a physically-aware box search enhances 2D-to-3D alignment via LiDAR priors. To capture class-specific semantics from sparse data, we further introduce contrastive-enhanced prototype learning, which encodes few-shot instances into discriminative semantic anchors and stabilizes representation learning. Extensive experiments on GCFS benchmarks demonstrate the effectiveness and generality of our approach in realistic deployment settings.

中文标题/摘要

标题：从数据集到现实世界：通过通用跨域少样本学习进行三维物体检测

基于LiDAR的三维物体检测模型往往难以在现实环境中泛化，因为现有数据集中的物体多样性有限。为了解决这个问题，我们引入了三维物体检测中的第一个通用跨域少样本（GCFS）任务，旨在仅通过少量标注将源预训练模型适应到新域中的常见和新型类。我们提出了一种统一框架，通过将开放集2D语义与3D空间推理相结合，在有限监督下学习稳定的靶标语义。具体而言，图像引导的多模态融合通过视觉语言模型将可转移的2D语义线索注入3D管道，而物理感知的框搜索通过LiDAR先验增强2D到3D对齐。为了从稀疏数据中捕获类特定的语义，我们进一步引入了对比增强的原型学习，将少样本实例编码为判别性语义锚点，并稳定表示学习。在GCFS基准上的大量实验表明，我们的方法在现实部署环境中具有有效性和普适性。

Summary / 总结

The paper addresses the challenge of LiDAR-based 3D object detection models generalizing to real-world environments by introducing a generalized cross-domain few-shot (GCFS) task. It proposes a unified framework that combines 2D open-set semantics with 3D spatial reasoning to adapt a source-pretrained model to both common and novel classes with limited annotations. Key findings include the use of image-guided multi-modal fusion and physically-aware box search to enhance 2D-to-3D alignment, and contrastive-enhanced prototype learning to stabilize representation learning from sparse data. Experiments show the approach's effectiveness in realistic settings.

研究针对LiDAR基于的3D物体检测模型在现实环境中难以泛化的挑战，由于现有数据集中的物体多样性有限。引入了通用跨域少量样本学习（GCFS）任务，并提出了一种结合2D开放集语义和3D空间推理的统一框架。该方法利用图像引导的多模态融合和物理感知的框搜索来增强迁移学习和2D到3D对齐。引入对比增强原型学习以从稀疏数据中稳定表示学习。实验表明该方法在现实场景中的有效性和普适性。

UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving

Authors: Zhexiao Xiong, Xin Ye, Burhan Yaman, Sheng Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, Liu Ren

First: 2026-01-07T23:49:52+00:00 · Latest: 2026-01-07T23:49:52+00:00

Comments: Project Page: https://unidrive-wm.github.io/UniDrive-WM

Abs · PDF · Code1 · Code2 · Project1

Abstract

World models have become central to autonomous driving, where accurate scene understanding and future prediction are crucial for safe control. Recent work has explored using vision-language models (VLMs) for planning, yet existing approaches typically treat perception, prediction, and planning as separate modules. We propose UniDrive-WM, a unified VLM-based world model that jointly performs driving-scene understanding, trajectory planning, and trajectory-conditioned future image generation within a single architecture. UniDrive-WM's trajectory planner predicts a future trajectory, which conditions a VLM-based image generator to produce plausible future frames. These predictions provide additional supervisory signals that enhance scene understanding and iteratively refine trajectory generation. We further compare discrete and continuous output representations for future image prediction, analyzing their influence on downstream driving performance. Experiments on the challenging Bench2Drive benchmark show that UniDrive-WM produces high-fidelity future images and improves planning performance by 5.9% in L2 trajectory error and 9.2% in collision rate over the previous best method. These results demonstrate the advantages of tightly integrating VLM-driven reasoning, planning, and generative world modeling for autonomous driving. The project page is available at https://unidrive-wm.github.io/UniDrive-WM .

中文标题/摘要

标题：UniDrive-WM：统一理解、规划和生成世界模型的自主驾驶

世界模型已成为自主驾驶的核心，准确的场景理解和未来预测对于安全控制至关重要。近期研究探索了使用视觉-语言模型（VLMs）进行规划，但现有方法通常将感知、预测和规划视为独立模块。我们提出了UniDrive-WM，这是一种基于VLM的统一世界模型，能够在单一架构中联合执行驾驶场景理解、轨迹规划和基于轨迹的未来图像生成。UniDrive-WM的轨迹规划器预测未来轨迹，条件化VLM图像生成器以生成合理的未来帧。这些预测提供了额外的监督信号，增强场景理解并迭代细化轨迹生成。我们进一步比较了离散和连续输出表示对未来图像预测的影响，分析其对下游驾驶性能的影响。在具有挑战性的Bench2Drive基准测试中，UniDrive-WM生成了高保真度的未来图像，并在L2轨迹误差和碰撞率方面分别提高了5.9%和9.2%，超过了之前的最佳方法。这些结果表明，将VLM驱动的推理、规划和生成世界建模紧密集成对于自主驾驶的优势。项目页面可在https://unidrive-wm.github.io/UniDrive-WM 查看。

Summary / 总结

UniDrive-WM is a unified VLM-based world model that integrates driving-scene understanding, trajectory planning, and future image generation. It uses a trajectory planner to predict future trajectories, which conditions a VLM to generate plausible future frames. Experiments show that UniDrive-WM improves planning performance by 5.9% in L2 trajectory error and 9.2% in collision rate compared to the previous best method on the Bench2Drive benchmark, highlighting the benefits of tightly integrating VLM-driven reasoning and generative world modeling for autonomous driving.

UniDrive-WM 是一个统一的 VLM 基础世界模型，将场景理解、轨迹规划和未来图像生成集成在一个架构中。它使用轨迹规划器预测未来路径，以条件化 VLM 生成合理的未来帧，增强场景理解和轨迹生成。实验结果显示，UniDrive-WM 在 Bench2Drive 基准上的 L2 轨迹误差降低了 5.9%，碰撞率降低了 9.2%，优于之前的方法，突显了将 VLM 驱动的推理和生成性建模紧密集成的优势。

Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization

Authors: Xingjian Diao, Zheyuan Liu, Chunhui Zhang, Weiyi Wu, Keyi Kong, Lin Shi, Kaize Ding, Soroush Vosoughi, Jiang Gui

First: 2026-01-07T23:05:17+00:00 · Latest: 2026-01-07T23:05:17+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Vision-Language Models (LVLMs) have exhibited strong reasoning capabilities through chain-of-thought mechanisms that generate step-by-step rationales. However, such slow-thinking approaches often lead to overthinking, where models produce excessively verbose responses even for simple queries, resulting in test-time inefficiency and even degraded accuracy. Prior work has attempted to mitigate this issue via adaptive reasoning strategies, but these methods largely overlook a fundamental bottleneck: visual perception failures. We argue that stable reasoning critically depends on low-level visual grounding, and that reasoning errors often originate from imperfect perception rather than insufficient deliberation. To address this limitation, we propose Gated Perception-Reasoning Optimization (GPRO), a meta-reasoning controller that dynamically routes computation among three decision paths at each generation step: a lightweight fast path, a slow perception path for re-examining visual inputs, and a slow reasoning path for internal self-reflection. To learn this distinction, we derive large-scale failure attribution supervision from approximately 790k samples, using teacher models to distinguish perceptual hallucinations from reasoning errors. We then train the controller with multi-objective reinforcement learning to optimize the trade-off between task accuracy and computational cost under uncertainty. Experiments on five benchmarks demonstrate that GPRO substantially improves both accuracy and efficiency, outperforming recent slow-thinking methods while generating significantly shorter responses.

中文标题/摘要

标题：通过门控感知推理优化解决大型视觉-语言模型中的过度思考问题

大型视觉-语言模型（LVLMs）通过链式思考机制展示了强大的推理能力，生成逐步的推理过程。然而，这种缓慢思考的方法往往会导致过度思考，模型会对简单查询产生冗长的响应，导致测试时的低效率甚至降低了准确性。先前的工作试图通过自适应推理策略来缓解这一问题，但这些方法大多忽视了一个根本瓶颈：视觉感知失败。我们认为，稳定的推理依赖于低级视觉定位，推理错误通常源自不完美的感知而非不足的思考。为了解决这一限制，我们提出了门控感知推理优化（GPRO），这是一种元推理控制器，在每一步生成中动态地在三条决策路径之间分配计算：一条轻量级的快速路径，一条缓慢的感知路径用于重新审视视觉输入，以及一条缓慢的推理路径用于内部自我反思。为了学习这种区分，我们从大约79万样本中推导出大规模的失败归因监督，使用教师模型区分感知幻觉和推理错误。然后，我们使用多目标强化学习训练控制器，在不确定性下优化任务准确性和计算成本之间的权衡。在五个基准上的实验表明，GPRO在准确性和效率上都有显著改进，优于最近的缓慢思考方法，同时生成的响应显著更短。

Summary / 总结

The paper addresses the issue of overthinking in large vision-language models (LVLMs) by proposing Gated Perception-Reasoning Optimization (GPRO). GPRO introduces a meta-reasoning controller that dynamically routes computation among three paths: a fast path, a slow perception path, and a slow reasoning path. The method uses failure attribution supervision from teacher models to distinguish perceptual hallucinations from reasoning errors and trains the controller with multi-objective reinforcement learning. Experiments show that GPRO improves both accuracy and efficiency, outperforming recent slow-thinking methods while generating shorter responses.

论文通过提出Gated Perception-Reasoning Optimization (GPRO)来解决大型视觉-语言模型（LVLM）中的过度思考问题，该方法在快速路径、感知路径和推理路径之间动态分配计算。该方法使用教师模型的失败归因监督来区分感知错误和推理错误，并在不确定性下优化准确性和计算成本之间的权衡。实验表明，GPRO在提高准确性和效率方面优于最近的慢思考方法，并生成了更短的回答。

3D-Agent:Tri-Modal Multi-Agent Collaboration for Scalable 3D Object Annotation

Authors: Jusheng Zhang, Yijia Fan, Zimo Wen, Jian Wang, Keze Wang

Venue: NeurIPS 2025

First: 2026-01-07T21:23:05+00:00 · Latest: 2026-01-07T21:23:05+00:00

Comments: Accepted at NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Driven by applications in autonomous driving robotics and augmented reality 3D object annotation presents challenges beyond 2D annotation including spatial complexity occlusion and viewpoint inconsistency Existing approaches based on single models often struggle to address these issues effectively We propose Tri MARF a novel framework that integrates tri modal inputs including 2D multi view images textual descriptions and 3D point clouds within a multi agent collaborative architecture to enhance large scale 3D annotation Tri MARF consists of three specialized agents a vision language model agent for generating multi view descriptions an information aggregation agent for selecting optimal descriptions and a gating agent that aligns textual semantics with 3D geometry for refined captioning Extensive experiments on Objaverse LVIS Objaverse XL and ABO demonstrate that Tri MARF substantially outperforms existing methods achieving a CLIPScore of 88 point 7 compared to prior state of the art methods retrieval accuracy of 45 point 2 and 43 point 8 on ViLT R at 5 and a throughput of up to 12000 objects per hour on a single NVIDIA A100 GPU

中文标题/摘要

标题：3D-Agent：三模态多智能体协作的可扩展3D物体标注

受自主驾驶机器人和增强现实应用的驱动，3D物体标注面临的挑战远超2D标注，包括空间复杂性、遮挡和视角不一致。现有基于单一模型的方法往往难以有效解决这些问题。我们提出了一种名为Tri MARF的新框架，该框架将2D多视角图像、文本描述和3D点云的三模态输入整合到多智能体协作架构中，以增强大规模3D标注。Tri MARF包括三个专门的智能体：一个视觉语言模型智能体用于生成多视角描述，一个信息聚合智能体用于选择最优描述，以及一个门控智能体，用于将文本语义与3D几何对齐以进行精细的标注。在Objaverse LVIS、Objaverse XL和ABO上的广泛实验表明，Tri MARF在CLIPScore、检索准确率和吞吐量方面显著优于现有方法，CLIPScore达到88.7，检索准确率分别为ViLT R@5的45.2和43.8，单块NVIDIA A100 GPU上的吞吐量可达每小时12000个物体。

Summary / 总结

The research aims to address the challenges of 3D object annotation in autonomous driving and augmented reality by proposing Tri MARF, a framework that integrates 2D multi-view images, textual descriptions, and 3D point clouds. This framework uses three specialized agents for generating multi-view descriptions, aggregating information, and aligning textual semantics with 3D geometry. Experiments show that Tri MARF outperforms existing methods with a CLIPScore of 88.7, retrieval accuracy of 45.2% and 43.8% on ViLT R at 5, and a throughput of up to 12,000 objects per hour on a single NVIDIA A100 GPU.

研究旨在通过开发三模态多代理系统解决自动驾驶和增强现实中的3D对象标注挑战。该系统整合了2D多视角图像、文本描述和3D点云。提出的Tri MARF框架包括三个代理：视觉语言模型代理、信息聚合代理和门控代理。实验表明，Tri MARF显著优于现有方法，CLIPScore达到88.7，ViLT R在5的检索准确率分别为45.2%和43.8%，并且在单个NVIDIA A100 GPU上的吞吐量可达每小时12,000个对象。