arXiv 论文速递

2026-01-12 03:31
Snapshot: 20260112_0331
Mechanisms of Prompt-Induced Hallucination in Vision-Language Models
Authors: William Rudman, Michal Golovanevsky, Dana Arad, Yonatan Belinkov, Ritambhara Singh, Carsten Eickhoff, Kyle Mahowald
First: 2026-01-08T18:23:03+00:00 · Latest: 2026-01-08T18:23:03+00:00
Abstract
Large vision-language models (VLMs) are highly capable, yet often hallucinate by favoring textual prompts over visual evidence. We study this failure mode in a controlled object-counting setting, where the prompt overstates the number of objects in the image (e.g., asking a model to describe four waterlilies when only three are present). At low object counts, models often correct the overestimation, but as the number of objects increases, they increasingly conform to the prompt regardless of the discrepancy. Through mechanistic analysis of three VLMs, we identify a small set of attention heads whose ablation substantially reduces prompt-induced hallucinations (PIH) by at least 40% without additional training. Across models, PIH-heads mediate prompt copying in model-specific ways. We characterize these differences and show that PIH ablation increases correction toward visual evidence. Our findings offer insights into the internal mechanisms driving prompt-induced hallucinations, revealing model-specific differences in how these behaviors are implemented.
中文标题/摘要
标题:视觉语言模型中提示诱发幻觉的机制
大型视觉语言模型(VLMs)功能强大,但经常倾向于文本提示而非视觉证据,从而产生幻觉。我们在一个受控的对象计数设置中研究了这种失败模式,其中提示夸大了图像中的对象数量(例如,要求模型描述四朵水仙花,而实际上只有三朵)。在对象数量较低时,模型通常会纠正这种夸大,但随着对象数量的增加,它们越来越倾向于遵循提示,无视差异。通过对三种VLMs的机制分析,我们发现一小组注意力头的消除可以显著减少提示诱发幻觉(PIH),至少降低40%且无需额外训练。在不同模型中,PIH头以特定方式介导提示复制。我们描述了这些差异,并表明PIH消除增加了对视觉证据的纠正。我们的研究结果揭示了驱动提示诱发幻觉的内部机制,揭示了这些行为在不同模型中的特定差异。
Summary / 总结
The study investigates how large vision-language models (VLMs) produce hallucinations when prompted to describe images with more objects than are present. By analyzing three VLMs, the researchers found that specific attention heads are responsible for prompt-induced hallucinations (PIH). Ablating these heads reduces PIH by at least 40% without additional training. The findings suggest that PIH is model-specific and that reducing these behaviors increases alignment with visual evidence.
研究探讨了视觉语言模型(VLMs)如何基于文本提示而非视觉证据产生幻觉。通过操控图像中的物体数量,研究人员发现,随着物体数量的增加,模型越来越倾向于遵循提示。移除特定的注意力头可以减少至少40%的提示诱导幻觉。研究结果表明,这些注意力头对于提示复制行为至关重要,并且移除这些头可以提高模型与视觉证据的一致性。
CoV: Chain-of-View Prompting for Spatial Reasoning
Authors: Haoyu Zhao, Akide Liu, Zeyu Zhang, Weijie Wang, Feng Chen, Ruihan Zhu, Gholamreza Haffari, Bohan Zhuang
First: 2026-01-08T17:59:42+00:00 · Latest: 2026-01-08T17:59:42+00:00
Abstract
Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision--language models (VLMs) are constrained to a fixed and finite set of input views, which limits their ability to acquire question-relevant context at inference time and hinders complex spatial reasoning. We propose Chain-of-View (CoV) prompting, a training-free, test-time reasoning framework that transforms a VLM into an active viewpoint reasoner through a coarse-to-fine exploration process. CoV first employs a View Selection agent to filter redundant frames and identify question-aligned anchor views. It then performs fine-grained view adjustment by interleaving iterative reasoning with discrete camera actions, obtaining new observations from the underlying 3D scene representation until sufficient context is gathered or a step budget is reached. We evaluate CoV on OpenEQA across four mainstream VLMs and obtain an average +11.56\% improvement in LLM-Match, with a maximum gain of +13.62\% on Qwen3-VL-Flash. CoV further exhibits test-time scaling: increasing the minimum action budget yields an additional +2.51\% average improvement, peaking at +3.73\% on Gemini-2.5-Flash. On ScanQA and SQA3D, CoV delivers strong performance (e.g., 116 CIDEr / 31.9 EM@1 on ScanQA and 51.1 EM@1 on SQA3D). Overall, these results suggest that question-aligned view selection coupled with open-view search is an effective, model-agnostic strategy for improving spatial reasoning in 3D EQA without additional training.
中文标题/摘要
标题:CoV:空间推理的链式视角提示
在3D环境中的嵌入式问题回答(EQA)通常需要收集分布在多个视角且部分被遮挡的上下文。然而,大多数最近的视觉-语言模型(VLMs)受限于固定且有限的输入视角集,这限制了它们在推理时获取与问题相关上下文的能力,并阻碍了复杂的空间推理。我们提出了链式视角(CoV)提示,这是一种无需训练、在测试时进行推理的框架,通过粗到细的探索过程将VLM转变为积极的视角推理者。CoV首先使用视角选择代理筛选冗余帧并识别与问题对齐的锚定视角,然后通过交替进行迭代推理和离散相机动作进行细粒度视角调整,从底层3D场景表示中获取新观察,直到收集到足够上下文或达到步骤预算。 我们在OpenEQA上对CoV进行了评估,跨四个主流VLMs获得了平均+11.56%的LLM-Match改进,最大增益为Qwen3-VL-Flash上的+13.62%。CoV还表现出测试时的扩展性:增加最小动作预算可额外获得平均+2.51%的改进,峰值为Gemini-2.5-Flash上的+3.73%。在ScanQA和SQA3D上,CoV表现出强大的性能(例如,ScanQA上的116 CIDEr / 31.9 EM@1和SQA3D上的51.1 EM@1)。总体而言,这些结果表明,与问题对齐的视角选择结合开放视角搜索是提高3D EQA中空间推理能力的有效、模型无关的策略,无需额外训练。
Summary / 总结
The research aims to enhance embodied question answering (EQA) in 3D environments by addressing the limitations of vision-language models (VLMs) in collecting relevant context across multiple viewpoints. The proposed Chain-of-View (CoV) prompting method involves a coarse-to-fine exploration process, including view selection and fine-grained view adjustment, to improve spatial reasoning. CoV achieves an average improvement of +11.56% in LLM-Match across four VLMs, with the best gain of +13.62% on Qwen3-VL-Flash. It also shows test-time scalability, with additional improvements up to +3.73% on Gemini-2.5-Flash.
研究旨在通过提出链式视角(CoV)提示来解决视觉语言模型在3D环境中的体感问答(EQA)中的局限性,增强模型在多个视角下收集相关上下文并进行复杂空间推理的能力。方法包括粗到细的探索过程,包括视角选择和精细视角调整。在OpenEQA上的实验显示,平均提高了11.56%的LLM-Match,最大增益为Qwen3-VL-Flash上的13.62%。CoV还展示了测试时的扩展性,当增加最小动作预算时,平均额外提高了2.51%,峰值为Gemini-2.5-Flash上的3.73%。在ScanQA和SQA3D上,CoV取得了强劲的表现,例如ScanQA上的116 CIDEr和31.9 EM@1,以及SQA3D上的51.1 EM@1。
MVT: Mask-Grounded Vision-Language Models for Taxonomy-Aligned Land-Cover Tagging
Authors: Siyi Chen, Kai Wang, Weicong Pang, Ruiming Yang, Ziru Chen, Renjun Gao, Alexis Kai Hon Lau, Dasa Gu, Chenchen Zhang, Cheng Li
First: 2025-09-23T06:23:56+00:00 · Latest: 2026-01-08T17:56:05+00:00
Comments: The project is available at https://charlescsyyy.github.io/MVT
Abstract
Land-cover understanding in remote sensing increasingly demands class-agnostic systems that generalize across datasets while remaining spatially precise and interpretable. We study a geometry-first discovery-and-interpretation setting under domain shift, where candidate regions are delineated class-agnostically and supervision avoids lexical class names via anonymized identifiers. Complementary to open-set recognition and open-world learning, we focus on coupling class-agnostic mask evidence with taxonomy-grounded scene interpretation, rather than unknown rejection or continual class expansion. We propose MVT, a three-stage framework that (i) extracts boundary-faithful region masks using SAM2 with domain adaptation, (ii) performs mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluates outputs with LLM-as-judge scoring calibrated by stratified expert ratings. On cross-dataset segmentation transfer (train on OpenEarthMap, evaluate on LoveDA), domain-adapted SAM2 improves mask quality; meanwhile, dual-step MLLM fine-tuning yields more accurate taxonomy-aligned tags and more informative mask-grounded scene descriptions.
中文标题/摘要
标题:MVT:基于掩码的视觉-语言模型在分类学对齐的土地覆盖标签化中的应用
遥感中的土地覆盖理解越来越需要跨数据集泛化但同时保持空间精确性和可解释性的类无差别系统。我们研究了在领域转移下的几何优先发现与解释设置,其中候选区域以类无差别方式划定,监督避免使用类名的明文标识符。除了开放集识别和开放世界学习,我们专注于将类无差别掩码证据与分类学导向的场景解释相结合,而不是未知拒绝或持续类扩展。我们提出了MVT,一个三阶段框架,(i) 使用SAM2进行领域适应以提取边界忠实的区域掩码,(ii) 通过双步骤LoRA微调多模态LLM进行掩码导向的语义标签和场景描述生成,(iii) 使用LLM作为裁判评分进行输出评估,评分通过分层专家评分校准。在跨数据集分割迁移(在OpenEarthMap上训练,在LoveDA上评估)中,领域适应的SAM2提高了掩码质量;同时,双步骤多模态LLM微调产生了更准确的分类学对齐标签和更具有信息量的掩码导向场景描述。
Summary / 总结
The research aims to develop class-agnostic systems for land-cover understanding in remote sensing that can generalize across datasets while maintaining spatial precision and interpretability. The proposed MVT framework consists of three stages: (i) extracting boundary-faithful region masks using SAM2 with domain adaptation, (ii) performing mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluating outputs with LLM-as-judge scoring calibrated by stratified expert ratings. The study shows that domain-adapted SAM2 improves mask quality, and dual-step MLLM fine-tuning yields more accurate taxonomy-aligned tags and more informative mask-grounded scene descriptions on cross-dataset segmentation transfer tasks.
研究旨在开发用于遥感的土地覆盖理解系统,注重空间精度和可解释性。方法包括三个阶段:(i) 使用SAM2进行领域适应以提取边界忠实的区域掩码,(ii) 通过多模态LLM的双重步骤LoRA微调进行掩码导向的语义标签和场景描述生成,(iii) 使用LLM作为评判者进行输出评估,并通过分层专家评分进行校准。关键发现包括领域适应的SAM2提高了掩码质量,以及通过双重步骤的MLLM微调获得了更准确的分类对齐标签和更具信息量的掩码导向场景描述。
Vision-Language Introspection: Mitigating Overconfident Hallucinations in MLLMs via Interpretable Bi-Causal Steering
Authors: Shuliang Liu, Songbo Yang, Dong Fang, Sihang Jia, Yuqi Tang, Lingfeng Su, Ruoshui Peng, Yibo Yan, Xin Zou, Xuming Hu
First: 2026-01-08T17:49:13+00:00 · Latest: 2026-01-08T17:49:13+00:00
Abstract
Object hallucination critically undermines the reliability of Multimodal Large Language Models, often stemming from a fundamental failure in cognitive introspection, where models blindly trust linguistic priors over specific visual evidence. Existing mitigations remain limited: contrastive decoding approaches operate superficially without rectifying internal semantic misalignments, while current latent steering methods rely on static vectors that lack instance-specific precision. We introduce Vision-Language Introspection (VLI), a training-free inference framework that simulates a metacognitive self-correction process. VLI first performs Attributive Introspection to diagnose hallucination risks via probabilistic conflict detection and localize the causal visual anchors. It then employs Interpretable Bi-Causal Steering to actively modulate the inference process, dynamically isolating visual evidence from background noise while neutralizing blind confidence through adaptive calibration. VLI achieves state-of-the-art performance on advanced models, reducing object hallucination rates by 12.67% on MMHal-Bench and improving accuracy by 5.8% on POPE.
中文标题/摘要
标题:视觉-语言内省:通过可解释的双因归导引减轻MLLM中的过度自信幻觉
物体幻觉严重削弱了多模态大型语言模型的可靠性,通常源于认知内省的基本失败,模型盲目信任语言先验而非具体的视觉证据。现有缓解措施仍有限:对比解码方法仅表面操作而不纠正内部语义错位,而当前的潜在引导方法依赖于静态向量,缺乏实例特定的精确性。我们引入了视觉-语言内省(VLI),这是一种无需训练的推理框架,模拟了元认知的自我纠正过程。VLI 首先进行属性内省,通过概率冲突检测诊断幻觉风险并定位因果视觉锚点。然后使用可解释的双因归引导引来主动调节推理过程,动态隔离视觉证据与背景噪声,通过自适应校准消除盲目的自信。VLI 在先进模型上实现了最先进的性能,在MMHal-Bench 上将物体幻觉率降低了12.67%,在POPE 上提高了5.8%的准确性。
Summary / 总结
The research aims to address the issue of object hallucination in Multimodal Large Language Models (MLLMs) by enhancing their cognitive introspection. The method introduced is Vision-Language Introspection (VLI), which uses Attributive Introspection to detect and localize hallucination risks and Interpretable Bi-Causal Steering to dynamically adjust the inference process, reducing hallucinations and improving accuracy. Key findings show that VLI reduces object hallucination rates by 12.67% on MMHal-Bench and increases accuracy by 5.8% on POPE.
研究旨在通过增强认知自我反省来解决多模态大型语言模型(MLLMs)中的物体幻觉问题。方法是引入视觉-语言反省(VLI),它通过属性反省检测和定位幻觉风险,并通过可解释的双向因果引导动态调整推理过程,从而减少幻觉并提高准确性。关键发现表明,VLI在MMHal-Bench上将物体幻觉率降低了12.67%,在POPE上提高了5.8%的准确性。
FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs
Authors: Carlos Plou, Cesar Borja, Ruben Martinez-Cantin, Ana C. Murillo
First: 2025-03-25T17:17:19+00:00 · Latest: 2026-01-08T17:17:54+00:00
Abstract
Finding information in hour-long videos is a challenging task even for top-performing Vision Language Models (VLMs), as encoding visual content quickly exceeds available context windows. To tackle this challenge, we present FALCONEye, a novel video agent based on a training-free, model-agnostic meta-architecture composed of a VLM and a Large Language Model (LLM). FALCONEye answers open-ended questions using an exploration-based search algorithm guided by calibrated confidence from the VLM's answers. We also introduce the FALCON-Bench benchmark, extending Question Answering problem to Video Answer Search-requiring models to return both the answer and its supporting temporal window for open-ended questions in hour-long videos. With just a 7B VLM and a lightweight LLM, FALCONEye outscores all open-source 7B VLMs and comparable agents in FALCON-Bench. It further demonstrates its generalization capability in MLVU benchmark with shorter videos and different tasks, surpassing GPT-4o on single-detail tasks while slashing inference cost by roughly an order of magnitude.
中文标题/摘要
标题:FALCONEye:使用多模态大语言模型在一小时长视频中查找答案并定位内容
在小时长视频中查找信息对顶级视觉语言模型(VLMs)来说也是一个具有挑战性的任务,因为编码视觉内容会迅速超出可用的上下文窗口。为了解决这一挑战,我们提出了FALCONEye,这是一种基于训练无损、模型无关元架构的新型视频代理,该架构由VLM和大语言模型(LLM)组成。FALCONEye 使用由VLM答案校准置信度引导的探索式搜索算法来回答开放式问题。我们还引入了FALCON-Bench基准测试,将问答问题扩展到视频答案搜索,要求模型返回一小时长视频中开放式问题的答案及其支持的时间窗口。仅使用7B VLM和轻量级LLM,FALCONEye 在FALCON-Bench中得分超过所有开源7B VLMs和可比代理。此外,FALCONEye 在MLVU基准测试中展示了其泛化能力,处理较短视频和不同任务时,超越了GPT-4o,在单一细节任务上的推理成本降低了约一个数量级。
Summary / 总结
FALCONEye is a novel video agent that uses a VLM and an LLM to answer open-ended questions in one-hour-long videos. It employs an exploration-based search algorithm guided by the VLM's calibrated confidence. FALCONEye outperforms all open-source 7B VLMs and comparable agents in the FALCON-Bench benchmark and shows strong generalization in the MLVU benchmark, surpassing GPT-4o on single-detail tasks while reducing inference cost significantly.
FALCONEye 是一个利用 VLM 和 LLM 来回答一小时长视频中的开放性问题的新视频代理。它采用了一种基于探索的搜索算法,并由 VLM 的校准置信度引导。FALCONEye 在 FALCON-Bench 基准测试中超越了所有开源的 7B VLM 及其可比代理,并在 MLVU 基准测试中展示了强大的泛化能力,超越了 GPT-4o 在单一细节任务上的表现,同时大幅降低了推理成本。
VERSE: Visual Embedding Reduction and Space Exploration. Clustering-Guided Insights for Training Data Enhancement in Visually-Rich Document Understanding
Authors: Ignacio de Rodrigo, Alvaro J. Lopez-Lopez, Jaime Boal
First: 2026-01-08T17:15:15+00:00 · Latest: 2026-01-08T17:15:15+00:00
Abstract
This work introduces VERSE, a methodology for analyzing and improving Vision-Language Models applied to Visually-rich Document Understanding by exploring their visual embedding space. VERSE enables the visualization of latent representations, supporting the assessment of model feasibility. It also facilitates the identification of problematic regions and guides the generation of synthetic data to enhance performance in those clusters. We validate the methodology by training on the synthetic MERIT Dataset and evaluating on its real-world counterpart, MERIT Secret. Results show that VERSE helps uncover the visual features associated with error-prone clusters, and that retraining with samples containing these features substantially boosts F1 performance without degrading generalization. Furthermore, we demonstrate that on-premise models such as Donut and Idefics2, when optimized with VERSE, match or even surpass the performance of SaaS solutions like GPT-4 and Pixtral.
中文标题/摘要
标题:VERSE:视觉嵌入空间探索与缩减。基于聚类指导的见解以增强视觉丰富文档理解的训练数据
本文介绍了VERSE,一种用于分析和改进应用于视觉丰富文档理解的视觉语言模型的方法,通过探索其视觉嵌入空间。VERSE使潜在表示的可视化成为可能,支持模型可行性的评估。它还使识别问题区域并指导生成合成数据以在这些聚类中增强性能成为可能。我们通过在合成MERIT数据集上进行训练并在其现实世界对应物MERIT Secret上进行评估来验证该方法。结果表明,VERSE有助于揭示与错误频发聚类相关的视觉特征,并且使用包含这些特征的样本重新训练显著提高了F1性能,而不会损害泛化能力。此外,我们证明了使用VERSE优化的本地模型(如Donut和Idefics2)在性能上可以与GPT-4和Pixtral等SaaS解决方案相匹敌甚至超越它们。
Summary / 总结
VERSE is a methodology that explores the visual embedding space of Vision-Language Models to enhance visually-rich document understanding. It visualizes latent representations to identify problematic regions and generate synthetic data to improve model performance. Experiments show that VERSE helps uncover visual features associated with error-prone clusters, and retraining with these features significantly boosts F1 performance without degrading generalization. VERSE also enables on-premise models like Donut and Idefics2 to match or surpass the performance of SaaS solutions like GPT-4 and Pixtral.
VERSE 是一种方法,通过探索 Vision-Language 模型的视觉嵌入空间来提升视觉丰富的文档理解。它可视化潜在表示以识别问题区域,并生成合成数据以提高模型性能。实验表明,VERSE 帮助发现了与错误多发簇相关的视觉特征,并通过这些特征的重新训练显著提升了 F1 表现,而不会损害泛化能力。此外,通过 VERSE 优化的本地模型如 Donut 和 Idefics2 可以匹配甚至超越 GPT-4 和 Pixtral 等 SaaS 解决方案的表现。
$π_0$: A Vision-Language-Action Flow Model for General Robot Control
Authors: Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, Ury Zhilinsky
Venue: RSS 2025
First: 2024-10-31T17:22:30+00:00 · Latest: 2026-01-08T17:01:05+00:00
Comments: See project website for videos: https://physicalintelligence.company/blog/pi0 Published in RSS 2025
Abstract
Robot learning holds tremendous promise to unlock the full potential of flexible, general, and dexterous robot systems, as well as to address some of the deepest questions in artificial intelligence. However, bringing robot learning to the level of generality required for effective real-world systems faces major obstacles in terms of data, generalization, and robustness. In this paper, we discuss how generalist robot policies (i.e., robot foundation models) can address these challenges, and how we can design effective generalist robot policies for complex and highly dexterous tasks. We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge. We then discuss how this model can be trained on a large and diverse dataset from multiple dexterous robot platforms, including single-arm robots, dual-arm robots, and mobile manipulators. We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people and from a high-level VLM policy, and its ability to acquire new skills via fine-tuning. Our results cover a wide variety of tasks, such as laundry folding, table cleaning, and assembling boxes.
中文标题/摘要
标题:$π_0$: 一种视觉-语言-行动流模型用于通用机器人控制
机器人学习具有巨大的潜力,可以解锁灵活、通用和灵巧的机器人系统的全部潜力,并解决人工智能领域的一些最深层次问题。然而,将机器人学习提升到有效现实系统所需的通用性水平面临数据、泛化和鲁棒性方面的重大障碍。在本文中,我们讨论了通用机器人策略(即机器人基础模型)如何应对这些挑战,以及如何设计有效的通用机器人策略以应对复杂和高度灵巧的任务。我们提出了一种基于预训练视觉-语言模型(VLM)的新颖流匹配架构,以继承互联网规模的语义知识。然后,我们讨论了如何在多种灵巧机器人平台上大规模多样化的数据集上训练该模型,包括单臂机器人、双臂机器人和移动操作器。我们从预训练后执行任务的能力、遵循人类和高级VLM策略的语言指令以及通过微调获取新技能等方面评估了该模型。我们的结果涵盖了各种任务,如衣物折叠、桌面清洁和组装盒子。
Summary / 总结
This paper addresses the challenges of general robot learning by proposing a vision-language-action flow model, leveraging a pre-trained vision-language model to inherit semantic knowledge from the Internet. The model is trained on diverse datasets from various robotic platforms, enabling it to perform tasks like laundry folding, table cleaning, and assembling boxes without prior training. It also demonstrates the ability to follow human language instructions and learn new skills through fine-tuning.
本文提出了一种视觉-语言-动作流模型,利用预训练的视觉-语言模型从互联网中继承语义知识,解决通用机器人学习的挑战。该模型在多种机器人平台的数据集上进行训练,并评估其在无先验训练的情况下执行任务、遵循人类语言指令以及通过微调获取新技能的能力。主要发现包括成功执行洗衣折叠、桌面清洁和组装盒子等任务。
POLYCHARTQA: Benchmarking Large Vision-Language Models with Multilingual Chart Question Answering
Authors: Yichen Xu, Liangyu Chen, Liang Zhang, Jianzhe Ma, Wenxuan Wang, Qin Jin
First: 2025-07-16T06:09:02+00:00 · Latest: 2026-01-08T17:00:25+00:00
Comments: Work in Progress
Abstract
Charts are a universally adopted medium for data communication, yet existing chart understanding benchmarks are overwhelmingly English-centric, limiting their accessibility and relevance to global audiences. To address this limitation, we introduce PolyChartQA, the first large-scale multilingual benchmark for chart question answering, comprising 22,606 charts and 26,151 QA pairs across 10 diverse languages. PolyChartQA is constructed through a scalable pipeline that enables efficient multilingual chart generation via data translation and code reuse, supported by LLM-based translation and rigorous quality control. We systematically evaluate multilingual chart understanding with PolyChartQA on state-of-the-art LVLMs and reveal a significant performance gap between English and other languages, particularly low-resource ones. Additionally, we introduce a companion multilingual chart question answering training set, PolyChartQA-Train, on which fine-tuning LVLMs yields substantial gains in multilingual chart understanding across diverse model sizes and architectures. Together, our benchmark provides a foundation for developing globally inclusive vision-language models capable of understanding charts across diverse linguistic contexts.
中文标题/摘要
标题:POLYCHARTQA:使用多语言图表问答基准评估大型视觉-语言模型
图表是数据交流的普遍采用媒介,但现有的图表理解基准主要以英语为中心,限制了其对全球受众的可访问性和相关性。为解决这一限制,我们引入了PolyChartQA,这是首个大规模多语言图表问答基准,包含22,606张图表和26,151个问答对,覆盖10种不同的语言。PolyChartQA通过可扩展的管道构建,通过数据翻译和代码重用实现高效的多语言图表生成,支持基于LLM的翻译和严格的质量控制。我们系统地使用PolyChartQA对最先进的LVLM进行多语言图表理解评估,并揭示了英语与其他语言之间,尤其是低资源语言之间存在显著的性能差距。此外,我们还引入了PolyChartQA-Train,这是一个多语言图表问答训练集,通过微调LVLM可以在不同模型大小和架构下显著提高多语言图表理解能力。我们的基准为开发能够跨多种语言环境理解图表的全球包容性视觉-语言模型提供了基础。
Summary / 总结
PolyChartQA is introduced as the first large-scale multilingual benchmark for chart question answering, containing 22,606 charts and 26,151 QA pairs in 10 languages. It addresses the limitation of existing English-centric benchmarks by using a scalable pipeline for multilingual chart generation and evaluation. The benchmark reveals a significant performance gap between English and other languages, especially low-resource ones, and fine-tuning LVLMs on PolyChartQA-Train improves multilingual chart understanding across different model sizes and architectures. This work provides a foundation for developing globally inclusive vision-language models.
PolyChartQA 是一个新的多语言图表问答基准,旨在解决现有基准对全球受众的不包容性问题。它包含 22,606 张图表和 26,151 对 QA 对,在 10 种语言中评估了最先进的大型视觉语言模型,结果显示英语与其他语言之间的性能差距显著。在 PolyChartQA-Train 上进行微调可以提高不同模型大小和架构的多语言图表理解能力。
GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts
Authors: Wenhao Zeng, Xuteng Zhang, Yuling Shi, Chao Hu, Yuting Chen, Beijun Shen, Xiaodong Gu
First: 2026-01-08T16:58:07+00:00 · Latest: 2026-01-08T16:58:07+00:00
Comments: Code available at https://github.com/Zengwh02/GlimpRouter
Abstract
Large Reasoning Models (LRMs) achieve remarkable performance by explicitly generating multi-step chains of thought, but this capability incurs substantial inference latency and computational cost. Collaborative inference offers a promising solution by selectively allocating work between lightweight and large models, yet a fundamental challenge remains: determining when a reasoning step requires the capacity of a large model or the efficiency of a small model. Existing routing strategies either rely on local token probabilities or post-hoc verification, introducing significant inference overhead. In this work, we propose a novel perspective on step-wise collaboration: the difficulty of a reasoning step can be inferred from its very first token. Inspired by the "Aha Moment" phenomenon in LRMs, we show that the entropy of the initial token serves as a strong predictor of step difficulty. Building on this insight, we introduce GlimpRouter, a training-free step-wise collaboration framework. GlimpRouter employs a lightweight model to generate only the first token of each reasoning step and routes the step to a larger model only when the initial token entropy exceeds a threshold. Experiments on multiple benchmarks demonstrate that our approach significantly reduces inference latency while preserving accuracy. For instance, GlimpRouter attains a substantial 10.7% improvement in accuracy while reducing inference latency by 25.9% compared to a standalone large model on AIME25. These results suggest a simple yet effective mechanism for reasoning: allocating computation based on a glimpse of thought rather than full-step evaluation.
中文标题/摘要
标题:GlimpRouter:通过窥视一个思维令牌实现高效的协作推理
大型推理模型(LRMs)通过显式生成多步思维链实现显著性能,但这种能力导致了推理延迟和计算成本的大幅增加。协作推理通过在轻量级和大型模型之间选择性分配工作提供了有希望的解决方案,但一个基本挑战仍然存在:确定推理步骤何时需要大型模型的容量或小型模型的效率。现有的路由策略要么依赖于局部令牌概率,要么进行事后验证,引入了显著的推理开销。在本文中,我们提出了一种新的步骤协作视角:推理步骤的难度可以从其第一个令牌中推断出来。受LRMs中“恍然大悟”现象的启发,我们展示了初始令牌的熵是步骤难度的强预测器。基于这一洞察,我们引入了GlimpRouter,这是一种无需训练的步骤协作框架。GlimpRouter使用一个轻量级模型仅生成每个推理步骤的第一个令牌,并仅当初始令牌的熵超过阈值时才将步骤路由到一个更大的模型。在多个基准上的实验表明,我们的方法在显著减少推理延迟的同时保持了准确性。例如,与单独使用大型模型相比,GlimpRouter在AIME25上的准确率提高了10.7%,推理延迟减少了25.9%。这些结果表明,一种简单而有效的推理机制是:根据思维的一瞥来分配计算,而不是对整个步骤进行评估。
Summary / 总结
GlimpRouter proposes a novel approach to collaborative inference by using the entropy of the first token generated in each reasoning step to predict the difficulty of the step. This method reduces inference latency by 25.9% while maintaining 10.7% higher accuracy compared to a standalone large model on AIME25. The lightweight model only routes steps to a larger model when the initial token entropy exceeds a threshold, avoiding unnecessary computations and improving efficiency.
GlimpRouter通过推理步骤的第一个令牌的熵来推断推理步骤的难度,实现了一种步进协作的方法,减少了推理延迟和计算成本,同时保持了准确性。实验结果显示,在AIME25上相比单一的大模型,准确率提高了10.7%,推理延迟减少了25.9%。
Instruction Tuning with and without Context: Behavioral Shifts and Downstream Impact
Authors: Hyunji Lee, Seunghyun Yoon, Yunjae Won, Hanseok Oh, Geewook Kim, Trung Bui, Franck Dernoncourt, Elias Stengel-Eskin, Mohit Bansal, Minjoon Seo
First: 2025-06-18T14:13:56+00:00 · Latest: 2026-01-08T16:32:25+00:00
Abstract
Instruction tuning is a widely used approach to improve the instruction-following ability of large language models (LLMs). Instruction-tuning datasets typically include a mixture of context-augmented and context-free examples, yet prior work has largely combined these data types without examining their distinct effects. In this paper, we investigate how training LLMs with or without context affects model behavior and downstream performance. First, in the text domain, we show that LLMs trained with context attend more strongly to the provided knowledge, achieving better grounding. We also observe that context-augmented training shifts how LLMs use knowledge: models store and leverage less on parametric knowledge and instead depend more on the provided context. Second, we observe that using LLM trained with context-augmented data as the backbone for vision-language models reduces hallucination and improves grounding in the visual domain. Finally, we explore practical strategies for real-world deployments where context availability varies. We show that maintaining separate context-augmented and context-free models and routing inputs between them yields more robust overall performance than training a single mixed model, as it better preserves their complementary strengths.
中文标题/摘要
标题:带有和不带有上下文的指令调优:行为变化及下游影响
指令调优是广泛用于提高大型语言模型(LLM)遵循指令能力的一种方法。指令调优数据集通常包含上下文增强和无上下文示例的混合,但先前的工作大多将这些数据类型结合起来而没有考察它们各自的影响。在本文中,我们研究了在有无上下文的情况下训练LLM如何影响模型行为和下游性能。首先,在文本领域,我们展示了使用上下文训练的LLM更强烈地关注提供的知识,从而实现更好的定位。我们还观察到,上下文增强的训练改变了LLM使用知识的方式:模型存储和利用较少的参数化知识,而是更多地依赖提供的上下文。其次,我们观察到,使用使用上下文增强数据训练的LLM作为视觉-语言模型的骨干可以减少幻觉并改善视觉领域的定位。最后,我们探讨了在上下文可用性变化的现实世界部署中的实用策略。我们展示了保持分离的上下文增强和无上下文模型,并在它们之间路由输入,比训练单一混合模型能获得更稳健的整体性能,因为它更好地保留了它们的互补优势。
Summary / 总结
This paper investigates the effects of training large language models (LLMs) with or without context on their instruction-following ability and downstream performance. It finds that context-augmented training improves grounding and shifts model behavior, making them more dependent on provided context rather than parametric knowledge. This approach also reduces hallucination and improves grounding in vision-language models. The study suggests maintaining separate models for context-augmented and context-free data can yield better overall performance by leveraging the complementary strengths of both models.
本文研究了在有或无上下文的情况下训练大型语言模型(LLMs)对其指令遵循能力和下游性能的影响。研究发现,带有上下文的训练可以提高模型的定位能力,并改变模型的行为,使其更多依赖于提供的上下文而非参数知识。这种方法还能减少幻觉现象并在视觉语言模型中提高定位能力。研究建议,维护独立的上下文增强和无上下文模型,并在两者之间路由输入,可以更好地利用两者的优势,从而获得更好的整体性能。
From Understanding to Engagement: Personalized pharmacy Video Clips via Vision Language Models (VLMs)
Authors: Suyash Mishra, Qiang Li, Srikanth Patil, Anubhav Girdhar
First: 2026-01-08T16:02:56+00:00 · Latest: 2026-01-08T16:02:56+00:00
Comments: Contributed original research to top tier conference in VLM; currently undergoing peer review
Abstract
Vision Language Models (VLMs) are poised to revolutionize the digital transformation of pharmacyceutical industry by enabling intelligent, scalable, and automated multi-modality content processing. Traditional manual annotation of heterogeneous data modalities (text, images, video, audio, and web links), is prone to inconsistencies, quality degradation, and inefficiencies in content utilization. The sheer volume of long video and audio data further exacerbates these challenges, (e.g. long clinical trial interviews and educational seminars). Here, we introduce a domain adapted Video to Video Clip Generation framework that integrates Audio Language Models (ALMs) and Vision Language Models (VLMs) to produce highlight clips. Our contributions are threefold: (i) a reproducible Cut & Merge algorithm with fade in/out and timestamp normalization, ensuring smooth transitions and audio/visual alignment; (ii) a personalization mechanism based on role definition and prompt injection for tailored outputs (marketing, training, regulatory); (iii) a cost efficient e2e pipeline strategy balancing ALM/VLM enhanced processing. Evaluations on Video MME benchmark (900) and our proprietary dataset of 16,159 pharmacy videos across 14 disease areas demonstrate 3 to 4 times speedup, 4 times cost reduction, and competitive clip quality. Beyond efficiency gains, we also report our methods improved clip coherence scores (0.348) and informativeness scores (0.721) over state of the art VLM baselines (e.g., Gemini 2.5 Pro), highlighting the potential of transparent, custom extractive, and compliance supporting video summarization for life sciences.
中文标题/摘要
标题:从理解到参与:通过视觉语言模型(VLMs)个性化制药视频片段
视觉语言模型(VLMs)有望通过实现智能、可扩展和自动化的多模态内容处理来革新制药行业的数字化转型。传统的异构数据模态(文本、图像、视频、音频和网页链接)的手动注释容易导致不一致、内容质量下降和内容利用效率低下。大量的长视频和音频数据进一步加剧了这些挑战(例如,长期的临床试验访谈和教育研讨会)。 在这里,我们介绍了一种针对制药领域的视频到视频片段生成框架,该框架结合了音频语言模型(ALMs)和视觉语言模型(VLMs)以生成高光片段。我们的贡献包括三个方面:(i)一种可重复的剪切与合并算法,带有淡入淡出和时间戳规范化,确保平滑过渡和音视频对齐;(ii)基于角色定义和提示注入的个性化机制,以实现定制输出(营销、培训、监管);(iii)一种成本效益高的端到端管道策略,平衡了ALM/VLM增强处理。在Video MME基准(900)和我们14个疾病领域16,159个制药视频的专有数据集上进行的评估显示,速度提高了3到4倍,成本降低了4倍,片段质量具有竞争力。除了效率提升,我们还报告了我们的方法提高了片段连贯性评分(0.348)和信息量评分(0.721),超过了最先进的VLM基线(例如,Gemini 2.5 Pro),突显了透明、定制提取和符合法规要求的视频摘要在生命科学领域的潜力。
Summary / 总结
The research aims to leverage Vision Language Models (VLMs) and Audio Language Models (ALMs) to automate the generation of personalized highlight clips from long pharmacy video data. The method includes a reproducible Cut & Merge algorithm and a personalization mechanism based on role definition and prompt injection. The study demonstrates a 3 to 4 times speedup, 4 times cost reduction, and improved clip coherence and informativeness scores compared to state-of-the-art VLMs, highlighting the potential for efficient and compliant video summarization in the life sciences.
该研究提出了一种基于音频语言模型(ALMs)和视觉语言模型(VLMs)的领域适配视频到视频剪辑生成框架,以自动化生成长药学视频的高光剪辑。该框架包括可重复的剪辑与合并算法和个人化机制以生成定制输出。在基准数据集和自有数据集上的评估显示,该方法比最先进的VLMs提高了3到4倍的速度,降低了4倍的成本,并且提高了剪辑的连贯性和信息性评分。
Scaling Vision Language Models for Pharmaceutical Long Form Video Reasoning on Industrial GenAI Platform
Authors: Suyash Mishra, Qiang Li, Srikanth Patil, Satyanarayan Pati, Baddu Narendra
First: 2026-01-08T12:42:17+00:00 · Latest: 2026-01-08T12:42:17+00:00
Comments: Submitted to the Industry Track of Top Tier Conference; currently under peer review
Abstract
Vision Language Models (VLMs) have shown strong performance on multimodal reasoning tasks, yet most evaluations focus on short videos and assume unconstrained computational resources. In industrial settings such as pharmaceutical content understanding, practitioners must process long-form videos under strict GPU, latency, and cost constraints, where many existing approaches fail to scale. In this work, we present an industrial GenAI framework that processes over 200,000 PDFs, 25,326 videos across eight formats (e.g., MP4, M4V, etc.), and 888 multilingual audio files in more than 20 languages. Our study makes three contributions: (i) an industrial large-scale architecture for multimodal reasoning in pharmaceutical domains; (ii) empirical analysis of over 40 VLMs on two leading benchmarks (Video-MME and MMBench) and proprietary dataset of 25,326 videos across 14 disease areas; and (iii) four findings relevant to long-form video reasoning: the role of multimodality, attention mechanism trade-offs, temporal reasoning limits, and challenges of video splitting under GPU constraints. Results show 3-8 times efficiency gains with SDPA attention on commodity GPUs, multimodality improving up to 8/12 task domains (especially length-dependent tasks), and clear bottlenecks in temporal alignment and keyframe detection across open- and closed-source VLMs. Rather than proposing a new "A+B" model, this paper characterizes practical limits, trade-offs, and failure patterns of current VLMs under realistic deployment constraints, and provide actionable guidance for both researchers and practitioners designing scalable multimodal systems for long-form video understanding in industrial domains.
中文标题/摘要
标题:在工业级GenAI平台上扩展视觉语言模型以处理制药长格式视频推理
视觉语言模型(VLMs)在多模态推理任务中表现出强大的性能,但大多数评估集中在短视频上,并假设不受限制的计算资源。在制药内容理解等工业环境中,从业人员必须在严格的GPU、延迟和成本约束下处理长格式视频,而许多现有方法无法扩展。在本研究中,我们提出了一种工业级GenAI框架,处理了超过200,000个PDF、25,326个八种格式(例如MP4、M4V等)的视频以及888个多语言音频文件,涉及20多种语言的14个疾病领域。我们的研究做出了三项贡献:(i)制药领域的大规模多模态推理工业架构;(ii)在两个领先基准(Video-MME和MMBench)和包含25,326个视频的自有数据集上对超过40个VLMs的实证分析;(iii)关于长格式视频推理的四项发现:多模态的作用、注意力机制权衡、时间推理限制以及在GPU约束下视频分割的挑战。结果表明,与普通GPU相比,SDPA注意力机制可提高3-8倍的效率,多模态在8/12个任务领域(尤其是长度依赖性任务)中可提高性能,开放源和闭源VLMs在时间对齐和关键帧检测方面存在明显瓶颈。本文并未提出新的“A+B”模型,而是对在现实部署约束下当前VLMs的实用极限、权衡和失败模式进行了描述,并为研究人员和从业者设计可扩展的多模态系统提供了实用指导,以用于工业领域的长格式视频理解。
Summary / 总结
This work addresses the scalability of Vision Language Models (VLMs) for processing long-form pharmaceutical videos under industrial constraints. The study evaluates over 40 VLMs on various benchmarks and a proprietary dataset, highlighting the importance of multimodality and attention mechanisms. Key findings include efficiency gains with SDPA attention, improved performance in length-dependent tasks, and challenges in temporal alignment and keyframe detection. The research provides practical insights and actionable guidance for designing scalable multimodal systems in industrial settings.
该研究针对工业环境中Vision Language Models (VLMs)在处理制药领域长视频推理时的可扩展性挑战。研究展示了一个工业级的GenAI框架,处理了25,326个视频和888个音频文件,覆盖14个疾病领域。关键发现包括使用SDPA注意力机制提高效率,通过多模态提高长度依赖任务的表现,以及在时间对齐和关键帧检测方面存在的瓶颈。研究提供了在实际部署条件下设计可扩展多模态系统的实用见解和行动指南。
SOVABench: A Vehicle Surveillance Action Retrieval Benchmark for Multimodal Large Language Models
Authors: Oriol Rabasseda, Zenjie Li, Kamal Nasrollahi, Sergio Escalera
Venue: WACV
First: 2026-01-08T10:58:59+00:00 · Latest: 2026-01-08T10:58:59+00:00
Comments: This work has been accepted at Real World Surveillance: Applications and Challenges, 6th (in WACV Workshops)
Abstract
Automatic identification of events and recurrent behavior analysis are critical for video surveillance. However, most existing content-based video retrieval benchmarks focus on scene-level similarity and do not evaluate the action discrimination required in surveillance. To address this gap, we introduce SOVABench (Surveillance Opposite Vehicle Actions Benchmark), a real-world retrieval benchmark built from surveillance footage and centered on vehicle-related actions. SOVABench defines two evaluation protocols (inter-pair and intra-pair) to assess cross-action discrimination and temporal direction understanding. Although action distinctions are generally intuitive for human observers, our experiments show that they remain challenging for state-of-the-art vision and multimodal models. Leveraging the visual reasoning and instruction-following capabilities of Multimodal Large Language Models (MLLMs), we present a training-free framework for producing interpretable embeddings from MLLM-generated descriptions for both images and videos. The framework achieves strong performance on SOVABench as well as on several spatial and counting benchmarks where contrastive Vision-Language Models often fail. The code, annotations, and instructions to construct the benchmark are publicly available.
中文标题/摘要
标题:SOVABench:一种针对多模态大语言模型的车辆监控动作检索基准
自动识别事件和重复行为分析是视频监控的关键。然而,大多数现有的基于内容的视频检索基准主要关注场景相似性,而不评估监控所需的动作区分。为了解决这一差距,我们引入了SOVABench(Surveillance Opposite Vehicle Actions Benchmark),这是一个基于监控视频构建的现实世界检索基准,专注于车辆相关动作。SOVABench 定义了两种评估协议(跨对和内对),以评估跨动作区分和时间方向理解。尽管动作区分对人类观察者来说通常很直观,但我们的实验表明,它们仍然对最先进的视觉和多模态模型构成挑战。 利用多模态大语言模型(MLLMs)的视觉推理和指令跟随能力,我们提出了一种无需训练的框架,用于从MLLM生成的描述中生成可解释的嵌入,适用于图像和视频。该框架在SOVABench以及几个对比视觉-语言模型常常失败的空间和计数基准上都取得了良好的性能。基准的代码、注释和构建说明已公开。
Summary / 总结
SOVABench is introduced to evaluate the action discrimination capability in video surveillance, addressing the gap in existing benchmarks. It uses surveillance footage to assess cross-action discrimination and temporal direction understanding through two protocols. The framework using MLLMs generates interpretable embeddings, showing strong performance on SOVABench and other benchmarks despite challenges for state-of-the-art models. The code and annotations are publicly available.
SOVABench 是一个新的车辆监视动作检索基准,填补了现有基准侧重于场景相似性而非动作区分的空白。它评估模型在跨动作区分和时间方向理解方面的表现。该基准使用实际的监视录像,并引入了两个评估协议。尽管对人类来说这些任务是直观的,但最先进的视觉和多模态模型在这些任务上仍然面临挑战。一个无需训练的框架利用多模态大型语言模型(MLLMs)生成可解释的嵌入,实现了在 SOVABench 和其他基准上的强大性能,而这些基准往往是对比视觉语言模型的弱项。
Agentic Retoucher for Text-To-Image Generation
Authors: Shaocheng Shen, Jianfeng Liang, Chunlei Cai, Cong Geng, Huiyu Duan, Xiaoyun Zhang, Qiang Hu, Guangtao Zhai
First: 2026-01-05T12:06:43+00:00 · Latest: 2026-01-08T10:57:37+00:00
Abstract
Text-to-image (T2I) diffusion models such as SDXL and FLUX have achieved impressive photorealism, yet small-scale distortions remain pervasive in limbs, face, text and so on. Existing refinement approaches either perform costly iterative re-generation or rely on vision-language models (VLMs) with weak spatial grounding, leading to semantic drift and unreliable local edits. To close this gap, we propose Agentic Retoucher, a hierarchical decision-driven framework that reformulates post-generation correction as a human-like perception-reasoning-action loop. Specifically, we design (1) a perception agent that learns contextual saliency for fine-grained distortion localization under text-image consistency cues, (2) a reasoning agent that performs human-aligned inferential diagnosis via progressive preference alignment, and (3) an action agent that adaptively plans localized inpainting guided by user preference. This design integrates perceptual evidence, linguistic reasoning, and controllable correction into a unified, self-corrective decision process. To enable fine-grained supervision and quantitative evaluation, we further construct GenBlemish-27K, a dataset of 6K T2I images with 27K annotated artifact regions across 12 categories. Extensive experiments demonstrate that Agentic Retoucher consistently outperforms state-of-the-art methods in perceptual quality, distortion localization and human preference alignment, establishing a new paradigm for self-corrective and perceptually reliable T2I generation.
中文标题/摘要
标题:代理修图师:用于文本到图像生成
文本到图像(T2I)扩散模型如SDXL和FLUX已经实现了令人印象深刻的写实效果,但在肢体、面部、文本等方面仍然普遍存在小规模失真。现有的精修方法要么进行昂贵的迭代重新生成,要么依赖于缺乏空间定位能力的视觉语言模型(VLMs),导致语义漂移和不可靠的局部编辑。为了解决这一问题,我们提出了一种名为代理修图师的分层决策驱动框架,将后生成修正重新构想为类似人类感知-推理-行动的循环。具体来说,我们设计了(1)一个感知代理,学习在文本-图像一致性线索下的细粒度失真定位的上下文显著性,(2)一个推理代理,通过逐步偏好对齐进行与人类对齐的推断诊断,以及(3)一个行动代理,根据用户偏好自适应地计划局部修复。该设计将感知证据、语言推理和可控修正整合到一个统一的、自我修正的决策过程中。为了实现精细监督和定量评估,我们进一步构建了包含6000张T2I图像和27000个注释缺陷区域的GenBlemish-27K数据集,覆盖12个类别。大量实验表明,代理修图师在感知质量、失真定位和人类偏好对齐方面始终优于最先进的方法,建立了自修正和感知可靠的T2I生成的新范式。
Summary / 总结
Agentic Retoucher is a hierarchical framework that addresses small-scale distortions in text-to-image generation by integrating perceptual evidence, linguistic reasoning, and controllable correction. It consists of a perception agent for fine-grained distortion localization, a reasoning agent for human-aligned inferential diagnosis, and an action agent for adaptive localized inpainting. The framework demonstrates superior performance in perceptual quality, distortion localization, and human preference alignment compared to existing methods, setting a new standard for self-corrective and perceptually reliable T2I generation.
Agentic Retoucher 是一个层次框架,通过整合感知证据、语言推理和可控修正来解决文本到图像生成中的小尺度失真问题。它包括一个用于精细失真定位的感知代理、一个进行人类对齐的推理诊断代理以及一个根据用户偏好进行自适应局部修复的行动代理。该框架在感知质量、失真定位和人类偏好对齐方面优于现有方法,确立了自纠正和感知可靠的 T2I 生成的新标准。
AECV-Bench: Benchmarking Multimodal Models on Architectural and Engineering Drawings Understanding
Authors: Aleksei Kondratenko, Mussie Birhane, Houssame E. Hsain, Guido Maciocci
First: 2026-01-08T10:54:32+00:00 · Latest: 2026-01-08T10:54:32+00:00
Abstract
AEC drawings encode geometry and semantics through symbols, layout conventions, and dense annotation, yet it remains unclear whether modern multimodal and vision-language models can reliably interpret this graphical language. We present AECV-Bench, a benchmark for evaluating multimodal and vision-language models on realistic AEC artefacts via two complementary use cases: (i) object counting on 120 high-quality floor plans (doors, windows, bedrooms, toilets), and (ii) drawing-grounded document QA spanning 192 question-answer pairs that test text extraction (OCR), instance counting, spatial reasoning, and comparative reasoning over common drawing regions. Object-counting performance is reported using per-field exact-match accuracy and MAPE results, while document-QA performance is reported using overall accuracy and per-category breakdowns with an LLM-as-a-judge scoring pipeline and targeted human adjudication for edge cases. Evaluating a broad set of state-of-the-art models under a unified protocol, we observe a stable capability gradient; OCR and text-centric document QA are strongest (up to 0.95 accuracy), spatial reasoning is moderate, and symbol-centric drawing understanding - especially reliable counting of doors and windows - remains unsolved (often 0.40-0.55 accuracy) with substantial proportional errors. These results suggest that current systems function well as document assistants but lack robust drawing literacy, motivating domain-specific representations and tool-augmented, human-in-the-loop workflows for an efficient AEC automation.
中文标题/摘要
标题:AECV-Bench:建筑和工程图纸理解的多模态模型基准测试
建筑和工程(AEC)图纸通过符号、布局规范和密集注释来编码几何和语义,但尚不清楚现代多模态和视觉-语言模型是否能可靠地解释这种图形语言。我们提出了AECV-Bench,这是一个基准测试,通过两个互补的应用场景来评估多模态和视觉-语言模型在现实AEC制品上的表现:(i) 在120份高质量的楼层平面图上进行物体计数(门、窗、卧室、厕所),(ii) 包含192个问题-答案对的图纸指导文档问答,测试文本提取(OCR)、实例计数、空间推理和常见绘图区域上的比较推理。物体计数性能使用每个字段的精确匹配准确率和MAPE结果报告,而文档问答性能使用总体准确率和按类别细分的评分管道报告,并通过LLM作为法官的评分流程和针对边缘情况的人工复核。在统一协议下评估一系列最先进的模型,我们观察到一种稳定的性能梯度;文本提取和文本为中心的文档问答表现最强(高达0.95的准确率),空间推理表现中等,而以符号为中心的绘图理解——尤其是门和窗的可靠计数——仍然无法解决(通常准确率在0.40-0.55之间),存在大量比例错误。这些结果表明,当前系统在文档助手方面表现良好,但在绘图素养方面缺乏稳健性,这促使了针对特定领域的表示和工具增强、人类在环的工作流程,以实现高效的AEC自动化。
Summary / 总结
The research aims to evaluate the capability of modern multimodal and vision-language models in understanding architectural and engineering drawings. AECV-Bench, a new benchmark, assesses models through object counting and drawing-grounded document QA. Results show that models perform well in OCR and text extraction but struggle with spatial reasoning and reliable counting of specific symbols like doors and windows, indicating a need for domain-specific representations and human-in-the-loop workflows for AEC automation.
AECV-Bench 通过对象计数和基于图纸的文档问答评估多模态和视觉-语言模型在建筑和工程图纸上的表现。基准数据集包括120个楼层平面图用于对象计数,以及192个问题-答案对用于文档问答。结果显示,在OCR和文本提取方面表现强劲,在空间推理方面表现一般,在符号导向的图纸理解方面表现较差,尤其是门窗的计数。这表明当前系统在文档助手方面表现良好,但在图纸阅读方面缺乏稳健性,需要领域特定的表示和人工在环的工作流程以实现高效的AEC自动化。
Crafting Adversarial Inputs for Large Vision-Language Models Using Black-Box Optimization
Authors: Jiwei Guan, Haibo Jin, Haohan Wang
First: 2026-01-05T02:49:33+00:00 · Latest: 2026-01-08T10:46:04+00:00
Comments: EACL
Abstract
Recent advancements in Large Vision-Language Models (LVLMs) have shown groundbreaking capabilities across diverse multimodal tasks. However, these models remain vulnerable to adversarial jailbreak attacks, where adversaries craft subtle perturbations to bypass safety mechanisms and trigger harmful outputs. Existing white-box attacks methods require full model accessibility, suffer from computing costs and exhibit insufficient adversarial transferability, making them impractical for real-world, black-box settings. To address these limitations, we propose a black-box jailbreak attack on LVLMs via Zeroth-Order optimization using Simultaneous Perturbation Stochastic Approximation (ZO-SPSA). ZO-SPSA provides three key advantages: (i) gradient-free approximation by input-output interactions without requiring model knowledge, (ii) model-agnostic optimization without the surrogate model and (iii) lower resource requirements with reduced GPU memory consumption. We evaluate ZO-SPSA on three LVLMs, including InstructBLIP, LLaVA and MiniGPT-4, achieving the highest jailbreak success rate of 83.0% on InstructBLIP, while maintaining imperceptible perturbations comparable to white-box methods. Moreover, adversarial examples generated from MiniGPT-4 exhibit strong transferability to other LVLMs, with ASR reaching 64.18%. These findings underscore the real-world feasibility of black-box jailbreaks and expose critical weaknesses in the safety mechanisms of current LVLMs
中文标题/摘要
标题:使用黑盒优化构建针对大型视觉-语言模型的对抗输入
大型视觉-语言模型(LVLMs)在多种跨模态任务中展现了突破性的能力。然而,这些模型仍然容易受到对抗性脱管攻击的影响,攻击者通过施加微妙的扰动来绕过安全机制并触发有害输出。现有的白盒攻击方法需要完全访问模型,计算成本高且对抗性转移性不足,使其在实际的黑盒环境中不切实际。为了解决这些限制,我们提出了一种通过零阶优化和同时扰动随机近似(ZO-SPSA)对LVLMs进行黑盒脱管攻击的方法。ZO-SPSA提供了三个关键优势:(i) 无需模型知识的输入-输出交互的无梯度近似,(ii) 不依赖于替代模型的模型无关优化,(iii) 降低资源需求,减少GPU内存消耗。我们在三个LVLMs上评估了ZO-SPSA,包括InstructBLIP、LLaVA和MiniGPT-4,在InstructBLIP上实现了最高的脱管攻击成功率83.0%,同时保持与白盒方法相当的不可感知扰动。此外,从MiniGPT-4生成的对抗性示例在其他LVLMs上表现出强大的转移性,ASR达到64.18%。这些发现强调了黑盒脱管攻击在实际环境中的可行性,并揭示了当前LVLMs安全机制中的关键弱点
Summary / 总结
This study addresses the vulnerability of Large Vision-Language Models (LVLMs) to adversarial attacks by proposing a black-box jailbreak attack using Zeroth-Order optimization with Simultaneous Perturbation Stochastic Approximation (ZO-SPSA). ZO-SPSA allows for gradient-free optimization without model knowledge, reducing computational costs and resource requirements. The method achieves high jailbreak success rates, with 83.0% on InstructBLIP and strong transferability to other models, highlighting the need for improved safety mechanisms in LVLMs.
该研究通过提出使用零阶优化与同时扰动随机近似(ZO-SPSA)的黑盒攻击方法,解决了大型视觉-语言模型(LVLMs)对对抗攻击的脆弱性问题。该方法无需模型知识,具有模型无关性,并且资源需求较低。实验结果显示,在InstructBLIP、LLaVA和MiniGPT-4上的攻击成功率高达83.0%,并且生成的对抗样本在MiniGPT-4上具有较强的迁移性,表明黑盒攻击在实际世界中的可行性,并揭示了当前LVLMs中安全机制的关键弱点。
CHIMERA: Adaptive Cache Injection and Semantic Anchor Prompting for Zero-shot Image Morphing with Morphing-oriented Metrics
Authors: Dahyeon Kye, Jeahun Sung, Minkyu Jeon, Jihyong Oh
First: 2025-12-08T04:39:12+00:00 · Latest: 2026-01-08T10:29:58+00:00
Comments: Please visit our project page at https://cmlab-korea.github.io/CHIMERA/
Abstract
Diffusion models exhibit remarkable generative ability, yet achieving smooth and semantically consistent image morphing remains a challenge. Existing approaches often yield abrupt transitions or over-saturated appearances due to the lack of adaptive structural and semantic alignments. We propose CHIMERA, a zero-shot diffusion-based framework that formulates morphing as a cached inversion-guided denoising process. To handle large semantic and appearance disparities, we propose Adaptive Cache Injection and Semantic Anchor Prompting. Adaptive Cache Injection (ACI) caches down, mid, and up blocks features from both inputs during DDIM inversion and re-injects them adaptively during denoising, enabling spatial and semantic alignment in depth- and time-adaptive manners and enabling natural feature fusion and smooth transitions. Semantic Anchor Prompting (SAP) leverages a vision-language model to generate a shared anchor prompt that serves as a semantic anchor, bridging dissimilar inputs and guiding the denoising process toward coherent results. Finally, we introduce the Global-Local Consistency Score (GLCS), a morphing-oriented metric that simultaneously evaluates the global harmonization of the two inputs and the smoothness of the local morphing transition. Extensive experiments and user studies show that CHIMERA achieves smoother and more semantically aligned transitions than existing methods, establishing a new state of the art in image morphing. The code and project page will be publicly released.
中文标题/摘要
标题:CHIMERA:自适应缓存注入与语义锚点提示的零样本图像形态变换框架
扩散模型展示了惊人的生成能力,但在实现平滑且语义一致的图像形态变换方面仍面临挑战。现有方法往往由于缺乏自适应结构和语义对齐而产生突兀的过渡或过度饱和的外观。我们提出CHIMERA,一种基于扩散的零样本框架,将形态变换形式化为缓存反演引导的去噪过程。为处理大规模语义和外观差异,我们提出了自适应缓存注入和语义锚点提示。自适应缓存注入(ACI)在DDIM反演过程中缓存来自两个输入的下、中、上层特征,并在去噪过程中适当地重新注入,从而以深度和时间自适应的方式实现空间和语义对齐,并实现自然特征融合和平滑过渡。语义锚点提示(SAP)利用视觉-语言模型生成共享的锚点提示,作为语义锚点,连接不相似的输入,并引导去噪过程向一致的结果发展。最后,我们引入全局-局部一致性评分(GLCS),这是一种形态变换导向的度量标准,同时评估两个输入的全局协调性和局部形态变换的平滑度。广泛的实验和用户研究显示,CHIMERA实现了比现有方法更平滑且更语义对齐的过渡,建立了图像形态变换的新基准。代码和项目页面将公开发布。
Summary / 总结
CHIMERA is a zero-shot diffusion-based framework for image morphing that addresses the challenges of abrupt transitions and over-saturation by introducing Adaptive Cache Injection and Semantic Anchor Prompting. ACI caches and re-injects features from both inputs during the denoising process, enabling spatial and semantic alignment. SAP uses a vision-language model to generate a shared anchor prompt, guiding the denoising process towards coherent results. GLCS, a morphing-oriented metric, evaluates global harmonization and local smoothness. Experiments show CHIMERA outperforms existing methods in achieving smoother and more semantically aligned transitions, setting a new state of the art in image morphing.
CHIMERA 是一种零样本扩散基础框架,用于解决图像变形中实现平滑且语义一致过渡的挑战。它使用自适应缓存注入和语义锚点提示来处理大语义和外观差异。实验结果表明,CHIMERA 在生成更平滑且语义对齐的过渡方面优于现有方法,建立了图像变形的新基准。该框架引入了全局-局部一致性评分来评估全局和谐性和局部平滑性。
ProFuse: Efficient Cross-View Context Fusion for Open-Vocabulary 3D Gaussian Splatting
Authors: Yen-Jen Chiou, Wei-Tse Cheng, Yuan-Fu Yang
First: 2026-01-08T09:20:46+00:00 · Latest: 2026-01-08T09:20:46+00:00
Comments: 10 pages, 5 figures
Abstract
We present ProFuse, an efficient context-aware framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). The pipeline enhances cross-view consistency and intra-mask cohesion within a direct registration setup, adding minimal overhead and requiring no render-supervised fine-tuning. Instead of relying on a pretrained 3DGS scene, we introduce a dense correspondence-guided pre-registration phase that initializes Gaussians with accurate geometry while jointly constructing 3D Context Proposals via cross-view clustering. Each proposal carries a global feature obtained through weighted aggregation of member embeddings, and this feature is fused onto Gaussians during direct registration to maintain per-primitive language coherence across views. With associations established in advance, semantic fusion requires no additional optimization beyond standard reconstruction, and the model retains geometric refinement without densification. ProFuse achieves strong open-vocabulary 3DGS understanding while completing semantic attachment in about five minutes per scene, which is two times faster than SOTA.
中文标题/摘要
标题:ProFuse:开放词汇3D高斯点云融合的高效跨视图上下文融合
我们提出了ProFuse,一种基于3D高斯点云(3DGS)的开放词汇3D场景理解的高效上下文感知框架。该流水线在直接配准设置中增强跨视图一致性及掩膜内的内聚性,增加的开销极小,无需渲染监督微调。我们引入了一种基于密集对应关系的预配准阶段,该阶段使用准确的几何信息初始化高斯点,同时通过跨视图聚类联合构建3D上下文提案。每个提案携带一个通过加权聚合成员嵌入获得的全局特征,并在直接配准过程中将该特征融合到高斯点上,以保持视图间每个基本语言的一致性。通过预先建立的关联,语义融合无需额外优化,且模型在无需密集化的情况下保留几何细化。ProFuse在每场景约五分钟内实现强大的开放词汇3DGS理解,比当前最佳方案快两倍。
Summary / 总结
ProFuse is an efficient framework for open-vocabulary 3D scene understanding using 3D Gaussian Splatting. It enhances cross-view consistency and intra-mask cohesion through a dense correspondence-guided pre-registration phase and cross-view clustering. This method initializes Gaussians with accurate geometry and fuses global features during direct registration, maintaining semantic coherence across views. ProFuse completes semantic attachment in about five minutes per scene, which is twice as fast as the state-of-the-art methods.
ProFuse 是一种高效的 3D 场景理解框架,使用 3D 高斯点积。它通过密集对应关系引导的预注册阶段和跨视图聚类增强跨视图一致性与内部掩模的一致性。该方法为每个 3D 上下文提案引入全局特征,在直接注册过程中将其融合以保持语义一致性。ProFuse 每个场景完成语义连接大约需要五分钟,比最先进的方法快两倍。
Skeletonization-Based Adversarial Perturbations on Large Vision Language Model's Mathematical Text Recognition
Authors: Masatomo Yoshida, Haruto Namura, Nicola Adami, Masahiro Okuda
Venue: Proc. ITC-CSCC 2025
First: 2026-01-08T09:15:27+00:00 · Latest: 2026-01-08T09:15:27+00:00
Comments: accepted to ITC-CSCC 2025
Abstract
This work explores the visual capabilities and limitations of foundation models by introducing a novel adversarial attack method utilizing skeletonization to reduce the search space effectively. Our approach specifically targets images containing text, particularly mathematical formula images, which are more challenging due to their LaTeX conversion and intricate structure. We conduct a detailed evaluation of both character and semantic changes between original and adversarially perturbed outputs to provide insights into the models' visual interpretation and reasoning abilities. The effectiveness of our method is further demonstrated through its application to ChatGPT, which shows its practical implications in real-world scenarios.
中文标题/摘要
标题:基于骨架化的大规模视觉语言模型数学文本识别的对抗性扰动
本研究通过引入一种利用骨架化减少搜索空间的新颖对抗攻击方法,探索基础模型的视觉能力和局限性。我们的方法特别针对包含文本的图像,尤其是由于其LaTeX转换和复杂的结构,数学公式图像更具挑战性。我们详细评估了原始输出和对抗性扰动输出之间的字符和语义变化,以提供模型视觉解释和推理能力的见解。通过将其应用于ChatGPT,进一步证明了该方法的有效性及其在实际场景中的实际意义。
Summary / 总结
This work investigates the visual recognition capabilities of large vision language models by introducing a skeletonization-based adversarial attack method. The method targets mathematical formula images, reducing the search space and evaluating character and semantic changes. The effectiveness is demonstrated through its application to ChatGPT, highlighting the models' limitations in visual interpretation and reasoning.
该研究通过使用骨架化基于的对抗攻击方法,探索大型视觉语言模型的视觉识别能力。该方法针对数学公式图像,减少搜索空间,并评估原始图像和对抗扰动输出之间的字符和语义变化。研究结果揭示了模型在视觉解释和推理方面的局限性,并通过ChatGPT展示了其实用意义。
Agri-R1: Empowering Generalizable Agricultural Reasoning in Vision-Language Models with Reinforcement Learning
Authors: Wentao Zhang, Lifei Wang, Lina Lu, MingKun Xu, Shangyang Li, Yanchao Yang, Tao Fang
Venue: ACL 2026 long
First: 2026-01-08T07:34:37+00:00 · Latest: 2026-01-08T07:34:37+00:00
Comments: This paper is submitted for review to ACL 2026. It is 17 pages long and includes 5 figures. The corresponding authors are Tao Fang and Lina Lu
Abstract
Agricultural disease diagnosis challenges VLMs, as conventional fine-tuning requires extensive labels, lacks interpretability, and generalizes poorly. While reasoning improves model robustness, existing methods rely on costly expert annotations and rarely address the open-ended, diverse nature of agricultural queries. To address these limitations, we propose \textbf{Agri-R1}, a reasoning-enhanced large model for agriculture. Our framework automates high-quality reasoning data generation via vision-language synthesis and LLM-based filtering, using only 19\% of available samples. Training employs Group Relative Policy Optimization (GRPO) with a novel proposed reward function that integrates domain-specific lexicons and fuzzy matching to assess both correctness and linguistic flexibility in open-ended responses. Evaluated on CDDMBench, our resulting 3B-parameter model achieves performance competitive with 7B- to 13B-parameter baselines, showing a +23.2\% relative gain in disease recognition accuracy, +33.3\% in agricultural knowledge QA, and a +26.10-point improvement in cross-domain generalization over standard fine-tuning. Ablation studies confirm that the synergy between structured reasoning data and GRPO-driven exploration underpins these gains, with benefits scaling as question complexity increases.
中文标题/摘要
标题:Agri-R1:通过强化学习增强通用农业推理能力的视觉语言模型
农业疾病诊断挑战了VLMs,因为传统的微调需要大量的标签,缺乏可解释性且泛化能力差。尽管推理可以提高模型的鲁棒性,但现有方法依赖于昂贵的专家注释,很少解决农业查询的开放性和多样性。为了解决这些限制,我们提出了**Agri-R1**,一种增强的农业推理大型模型。我们的框架通过视觉语言合成和基于LLM的过滤自动生成高质量的推理数据,仅使用可用样本的19%。训练使用组相对策略优化(GRPO)和一个新颖的奖励函数,该函数结合领域特定词汇和模糊匹配来评估开放性回答的正确性和语言灵活性。在CDDMBench上评估,我们的3B参数模型在疾病识别准确性上比7B到13B参数的基线模型高出23.2%,在农业知识问答上高出33.3%,在跨域泛化上比标准微调高出26.10分。消融研究证实,结构化推理数据与GRPO驱动的探索之间的协同作用是这些收益的基础,随着问题复杂性的增加,这种好处会增加。
Summary / 总结
The paper addresses the challenge of agricultural disease diagnosis for vision-language models by proposing Agri-R1, which uses reinforcement learning to generate high-quality reasoning data and train a 3B-parameter model. The model outperforms larger baselines in disease recognition and agricultural knowledge QA, with a significant improvement in cross-domain generalization. The method involves automating data generation and using a novel reward function that integrates domain-specific lexicons and fuzzy matching to enhance model robustness and interpretability.
Agri-R1通过提出一种增强推理的模型来解决传统微调在农业疾病诊断中的局限性。该模型通过视觉-语言合成和基于LLM的数据过滤自动生成高质量的推理数据,仅需19%的可用样本。训练使用Group Relative Policy Optimization,并结合领域特定词汇和模糊匹配的奖励函数。最终生成的3B参数模型在疾病识别准确性上提高了23.2%,在农业知识问答上提高了33.3%,并且在跨域泛化上比标准微调提高了26.10分。
GeoReason: Aligning Thinking And Answering In Remote Sensing Vision-Language Models Via Logical Consistency Reinforcement Learning
Authors: Wenshuai Li, Xiantai Xiang, Zixiao Wen, Guangyao Zhou, Ben Niu, Feng Wang, Lijia Huang, Qiantong Wang, Yuxin Hu
First: 2026-01-07T17:26:41+00:00 · Latest: 2026-01-08T06:19:12+00:00
Abstract
The evolution of Remote Sensing Vision-Language Models(RS-VLMs) emphasizes the importance of transitioning from perception-centric recognition toward high-level deductive reasoning to enhance cognitive reliability in complex spatial tasks. However, current models often suffer from logical hallucinations, where correct answers are derived from flawed reasoning chains or rely on positional shortcuts rather than spatial logic. This decoupling undermines reliability in strategic spatial decision-making. To address this, we present GeoReason, a framework designed to synchronize internal thinking with final decisions. We first construct GeoReason-Bench, a logic-driven dataset containing 4,000 reasoning trajectories synthesized from geometric primitives and expert knowledge. We then formulate a two-stage training strategy: (1) Supervised Knowledge Initialization to equip the model with reasoning syntax and domain expertise, and (2) Consistency-Aware Reinforcement Learning to refine deductive reliability. This second stage integrates a novel Logical Consistency Reward, which penalizes logical drift via an option permutation strategy to anchor decisions in verifiable reasoning traces. Experimental results demonstrate that our framework significantly enhances the cognitive reliability and interpretability of RS-VLMs, achieving state-of-the-art performance compared to other advanced methods.
中文标题/摘要
标题:GeoReason: 通过逻辑一致性强化学习使遥感视觉语言模型的思考与回答保持一致
遥感视觉语言模型(RS-VLMs)的发展强调了从感知中心的识别向高级演绎推理过渡的重要性,以增强复杂空间任务中的认知可靠性。然而,当前的模型往往遭受逻辑幻觉的困扰,即正确的答案是基于有缺陷的推理链或依赖于位置捷径而非空间逻辑。这种脱节削弱了在战略空间决策中的可靠性。为了解决这一问题,我们提出了GeoReason框架,旨在同步内部思考与最终决策。我们首先构建了GeoReason-Bench,这是一个逻辑驱动的数据集,包含4,000条从几何原语和专家知识中合成的推理轨迹。然后我们制定了两阶段训练策略:(1) 监督知识初始化,以装备模型推理语法和领域专业知识,(2) 一致性感知强化学习,以提高演绎可靠性。这一阶段整合了一种新颖的逻辑一致性奖励,通过选项排列策略惩罚逻辑漂移,以确保决策基于可验证的推理轨迹。实验结果表明,我们的框架显著提高了RS-VLMs的认知可靠性和可解释性,达到了与其他先进方法相比的最优性能。
Summary / 总结
GeoReason is designed to improve the cognitive reliability of Remote Sensing Vision-Language Models (RS-VLMs) by addressing logical hallucinations. It introduces a two-stage training strategy: supervised knowledge initialization and consistency-aware reinforcement learning. The framework uses a novel Logical Consistency Reward to penalize logical drift and anchor decisions in verifiable reasoning traces. Experiments show that GeoReason enhances the interpretability and reliability of RS-VLMs, achieving state-of-the-art performance.
GeoReason 是一个框架,旨在通过使内部思考与最终决策同步来提高遥感视觉语言模型(RS-VLMs)的认知可靠性。它采用两阶段训练策略:监督知识初始化,使模型具备推理语法和领域专业知识,随后是通过引入新型逻辑一致性奖励来惩罚逻辑漂移的一致性感知强化学习。这种方法提高了 RS-VLMs 的可解释性和可靠性,并达到了最先进的性能。
MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models
Authors: Dianyi Wang, Siyuan Wang, Zejun Li, Yikun Wang, Yitong Li, Duyu Tang, Xiaoyu Shen, Xuanjing Huang, Zhongyu Wei
First: 2025-08-13T13:00:05+00:00 · Latest: 2026-01-08T05:44:40+00:00
Abstract
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across multi-modal tasks by scaling model size and training data. However, these dense LVLMs incur significant computational costs and motivate the exploration of sparse Mixture of Experts (MoE) architectures. While MoE improve parameter efficiency, effectively applying MoE to simultaneously model modality-specific features and cross-modal associations in LVLMs remains challenging. In this work, we propose to incorporate Mixture of Intra- and Inter-Modality Experts (MoIIE) to LVLMs. For each token, expert routing is guided by its modality, directing tokens to their respective intra-modality experts as well as a shared pool of inter-modality experts, enabling the model to jointly learn rich intra-modal features and cross-modal interactions. We further introduce an effective and straightforward two-stage training strategy, which facilitates the direct activation of both MoE and multi-modal capabilities. Extensive experiments across different data scales and LLM backbone demonstrate the effectiveness, efficiency and generality of our approach. Notably, our MoIIE models with 5.5B and 11.3B activated parameters match or even surpass the performance of existing advanced open-source MoE-LLMs based multi-modal models that involve more activated parameters. The code is available at https://github.com/AlenjandroWang/MoIIE.
中文标题/摘要
标题:MoIIE:混合内模间模专家体系结构用于大型视觉语言模型
大型多模态视觉-语言模型(LVLMs)通过扩大模型规模和训练数据,在多模态任务中表现出显著的性能。然而,这些密集的LVLMs带来了显著的计算成本,并促使人们探索稀疏专家混合(MoE)架构。虽然MoE提高了参数效率,但同时建模LVLMs中的模态特定特征和跨模态关联仍然具有挑战性。在本文中,我们提出将混合内模间模专家(MoIIE)引入LVLMs。对于每个标记,专家路由由其模态引导,将标记导向其各自的内模专家以及共享的跨模专家池,使模型能够同时学习丰富的内模特征和跨模交互。我们还引入了一种有效且简单的两阶段训练策略,这有助于直接激活MoE和多模态能力。在不同数据规模和LLM主干网络的广泛实验中,证明了我们方法的有效性、效率和通用性。值得注意的是,我们的MoIIE模型在激活参数为55亿和113亿的情况下,与现有基于MoE-LLMs的多模态模型相比,性能相当甚至更优。代码可在https://github.com/AlenjandroWang/MoIIE/ 获取。
Summary / 总结
This work addresses the challenge of applying Mixture of Experts (MoE) to large Vision-Language Models (LVLMs) to improve parameter efficiency while maintaining performance. The proposed MoIIE framework routes tokens to both modality-specific and shared cross-modal experts, enabling the model to learn rich intra-modal features and cross-modal interactions. Experiments show that MoIIE models with fewer activated parameters match or outperform existing MoE-LLMs on multi-modal tasks, demonstrating effectiveness and efficiency. The code is available at https://github.com/AlenjandroWang/MoIIE.
本文旨在通过应用Mixture of Experts (MoE)来提高Vision-Language Models (LVLMs)的参数效率,同时保持性能。它提出了一种新的MoIIE架构,结合了内模和跨模专家,使令牌根据其模态被导向相应的专家。模型使用两阶段训练策略来增强MoE和多模态能力的激活。实验结果表明,MoIIE模型在不同数据规模和骨干模型上的多模态任务中,即使参数较少,也能匹配甚至超越现有先进的MoE-LLMs。
Minimal Clips, Maximum Salience: Long Video Summarization via Key Moment Extraction
Authors: Galann Pennec, Zhengyuan Liu, Nicholas Asher, Philippe Muller, Nancy F. Chen
First: 2025-12-12T09:19:45+00:00 · Latest: 2026-01-08T05:10:53+00:00
Abstract
Vision-Language Models (VLMs) are able to process increasingly longer videos. Yet, important visual information is easily lost throughout the entire context and missed by VLMs. Also, it is important to design tools that enable cost-effective analysis of lengthy video content. In this paper, we propose a clip selection method that targets key video moments to be included in a multimodal summary. We divide the video into short clips and generate compact visual descriptions of each using a lightweight video captioning model. These are then passed to a large language model (LLM), which selects the K clips containing the most relevant visual information for a multimodal summary. We evaluate our approach on reference clips for the task, automatically derived from full human-annotated screenplays and summaries in the MovieSum dataset. We further show that these reference clips (less than 6% of the movie) are sufficient to build a complete multimodal summary of the movies in MovieSum. Using our clip selection method, we achieve a summarization performance close to that of these reference clips while capturing substantially more relevant video information than random clip selection. Importantly, we maintain low computational cost by relying on a lightweight captioning model.
中文标题/摘要
标题:最少片段,最大显著性:通过关键时刻提取进行长视频摘要
视觉-语言模型(VLMs)能够处理越来越长的视频。然而,重要视觉信息很容易在整个上下文中丢失并被VLMs忽略。此外,设计能够经济有效地分析长视频内容的工具也很重要。在本文中,我们提出了一种片段选择方法,旨在选择应包含在多模态摘要中的关键视频时刻。我们将视频划分为短片段,并使用轻量级视频描述模型生成每个片段的紧凑视觉描述。然后将这些描述传递给大型语言模型(LLM),该模型选择包含最多相关视觉信息的K个片段以构建多模态摘要。我们在MovieSum数据集中的人类标注屏幕剧和摘要的参考片段上评估了我们的方法。我们进一步表明,这些参考片段(不到电影的6%)足以构建MovieSum中电影的完整多模态摘要。使用我们的片段选择方法,我们实现了与这些参考片段相当的摘要性能,同时捕获了比随机片段选择多得多的相关视频信息。重要的是,我们通过依赖轻量级描述模型维持了较低的计算成本。
BanglaLorica: Design and Evaluation of a Robust Watermarking Algorithm for Large Language Models in Bangla Text Generation
Authors: Amit Bin Tariqul, A N M Zahid Hossain Milkan, Sahab-Al-Chowdhury, Syed Rifat Raiyan, Hasan Mahmud, Md Kamrul Hasan
First: 2026-01-08T03:01:59+00:00 · Latest: 2026-01-08T03:01:59+00:00
Comments: Under review, 12 pages, 7 figures, 5 tables
Abstract
As large language models (LLMs) are increasingly deployed for text generation, watermarking has become essential for authorship attribution, intellectual property protection, and misuse detection. While existing watermarking methods perform well in high-resource languages, their robustness in low-resource languages remains underexplored. This work presents the first systematic evaluation of state-of-the-art text watermarking methods: KGW, Exponential Sampling (EXP), and Waterfall, for Bangla LLM text generation under cross-lingual round-trip translation (RTT) attacks. Under benign conditions, KGW and EXP achieve high detection accuracy (>88%) with negligible perplexity and ROUGE degradation. However, RTT causes detection accuracy to collapse below RTT causes detection accuracy to collapse to 9-13%, indicating a fundamental failure of token-level watermarking. To address this, we propose a layered watermarking strategy that combines embedding-time and post-generation watermarks. Experimental results show that layered watermarking improves post-RTT detection accuracy by 25-35%, achieving 40-50% accuracy, representing a 3$\times$ to 4$\times$ relative improvement over single-layer methods, at the cost of controlled semantic degradation. Our findings quantify the robustness-quality trade-off in multilingual watermarking and establish layered watermarking as a practical, training-free solution for low-resource languages such as Bangla. Our code and data will be made public.
中文标题/摘要
标题:BanglaLorica:面向孟加拉语大型语言模型文本生成的鲁棒水印算法设计与评估
随着大型语言模型(LLMs)在文本生成中的广泛应用,水印技术对于作者归属、知识产权保护和滥用检测变得至关重要。尽管现有的水印方法在高资源语言中表现良好,但在低资源语言中的鲁棒性仍被忽视。本研究首次系统评估了最先进的文本水印方法:KGW、指数采样(EXP)和Waterfall,针对孟加拉语LLM文本生成在跨语言往返翻译(RTT)攻击下的表现。在正常条件下,KGW和EXP的检测准确率超过88%,且几乎无困惑度和ROUGE降解。然而,RTT导致检测准确率降至9-13%,表明基于标记级别的水印方法存在根本性失败。为解决这一问题,我们提出了一种分层水印策略,结合了嵌入时和生成后水印。实验结果表明,分层水印策略在RTT后的检测准确率提高了25-35%,达到40-50%,相比单层方法有3-4倍的相对改进,但代价是可控的语义降解。我们的研究量化了多语言水印的鲁棒性-质量权衡,并确立了分层水印策略作为一种适用于低资源语言如孟加拉语的实用、无需训练的解决方案。我们的代码和数据将公开。
Summary / 总结
This paper evaluates the robustness of state-of-the-art watermarking methods (KGW, EXP, and Waterfall) for Bangla large language models (LLMs) under cross-lingual round-trip translation attacks. While these methods achieve high detection accuracy under benign conditions, they fail significantly under RTT attacks. To address this, the authors propose a layered watermarking strategy that combines embedding-time and post-generation watermarks, improving post-RTT detection accuracy by 25-35% and achieving 40-50% accuracy, a 3-4 times relative improvement over single-layer methods, at the cost of controlled semantic degradation. This work highlights the robustness-quality trade-off in multilingual watermarking and establishes layered watermarking as a practical solution for low-resource languages like Bangla.
这项工作评估了KGW、EXP和Waterfall等最先进的水印方法在孟加拉语大型语言模型(LLM)下跨语言往返翻译(RTT)攻击中的鲁棒性。虽然在良性条件下,KGW和EXP显示出高检测准确性,但RTT显著降低了它们的性能。作者提出了一种结合嵌入时间和生成后水印的分层水印策略,该策略在RTT后的检测准确性提高了25-35%,达到了40-50%的准确性,相对于单层方法提高了3-4倍,但代价是可控的语义降级。这项研究量化了多语言水印的鲁棒性-质量权衡,并将分层水印确立为低资源语言如孟加拉语的实用解决方案。
Current Agents Fail to Leverage World Model as Tool for Foresight
Authors: Cheng Qian, Emre Can Acikgoz, Bingxuan Li, Xiusi Chen, Yuji Zhang, Bingxiang He, Qinyu Luo, Dilek Hakkani-Tür, Gokhan Tur, Yunzhu Li, Heng Ji
First: 2026-01-07T13:15:23+00:00 · Latest: 2026-01-08T02:36:21+00:00
Comments: 36 Pages, 13 Figures, 17 Tables (Meta data updated)
Abstract
Agents built on vision-language models increasingly face tasks that demand anticipating future states rather than relying on short-horizon reasoning. Generative world models offer a promising remedy: agents could use them as external simulators to foresee outcomes before acting. This paper empirically examines whether current agents can leverage such world models as tools to enhance their cognition. Across diverse agentic and visual question answering tasks, we observe that some agents rarely invoke simulation (fewer than 1%), frequently misuse predicted rollouts (approximately 15%), and often exhibit inconsistent or even degraded performance (up to 5%) when simulation is available or enforced. Attribution analysis further indicates that the primary bottleneck lies in the agents' capacity to decide when to simulate, how to interpret predicted outcomes, and how to integrate foresight into downstream reasoning. These findings underscore the need for mechanisms that foster calibrated, strategic interaction with world models, paving the way toward more reliable anticipatory cognition in future agent systems.
中文标题/摘要
标题:当前代理无法利用世界模型作为前瞻工具
基于视觉语言模型构建的代理越来越多地面临需要预测未来状态的任务,而不仅仅是依赖短期推理。生成的世界模型提供了一种有希望的解决方案:代理可以使用它们作为外部模拟器,在行动前预见结果。本文实证研究了当前代理是否能够利用此类世界模型作为工具来增强其认知能力。在各种各样的代理和视觉问答任务中,我们观察到一些代理很少使用模拟(不到1%),频繁误用预测滚动(约15%),并且在模拟可用或强制时,经常表现出不一致甚至退化的性能(最高5%)。进一步的归因分析表明,主要瓶颈在于代理决定何时模拟、如何解释预测结果以及如何将前瞻性纳入下游推理的能力。这些发现强调了需要机制来促进与世界模型的校准、战略性互动,为未来代理系统更可靠的前瞻性认知铺平道路。
Summary / 总结
This paper investigates whether current agents can effectively use generative world models to anticipate future states, which is crucial for tasks requiring long-term reasoning. Despite the potential of world models, the study finds that most agents rarely or inconsistently use simulations, sometimes even performing worse. The main issue is the agents' inability to decide when and how to simulate and interpret the outcomes. This highlights the need for better mechanisms to integrate world models into agent cognition for more reliable foresight.
本文研究了当前代理是否能够有效利用生成的世界模型来增强其在需要长期推理的任务中的预见能力。尽管世界模型具有潜力,但研究发现许多代理很少使用模拟,经常错误使用它们,有时甚至在模拟可用时表现更差。主要问题在于代理无法决定何时进行模拟、如何解释预测以及如何将预见性整合到后续推理中。这些结果强调了在未来代理系统中更好地促进与世界模型的战略互动的必要性。
Vision-Language Agents for Interactive Forest Change Analysis
Authors: James Brock, Ce Zhang, Nantheera Anantrasirichai
First: 2026-01-08T02:02:36+00:00 · Latest: 2026-01-08T02:02:36+00:00
Comments: 5 pages, 4 figures, Submitted to IGARSS 2026
Abstract
Modern forest monitoring workflows increasingly benefit from the growing availability of high-resolution satellite imagery and advances in deep learning. Two persistent challenges in this context are accurate pixel-level change detection and meaningful semantic change captioning for complex forest dynamics. While large language models (LLMs) are being adapted for interactive data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored. To address this gap, we introduce an LLM-driven agent for integrated forest change analysis that supports natural language querying across multiple RSICI tasks. The proposed system builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration. To facilitate adaptation and evaluation in forest environments, we further introduce the Forest-Change dataset, which comprises bi-temporal satellite imagery, pixel-level change masks, and multi-granularity semantic change captions generated using a combination of human annotation and rule-based methods. Experimental results show that the proposed system achieves mIoU and BLEU-4 scores of 67.10% and 40.17% on the Forest-Change dataset, and 88.13% and 34.41% on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI benchmark for joint change detection and captioning. These results highlight the potential of interactive, LLM-driven RSICI systems to improve accessibility, interpretability, and efficiency of forest change analysis. All data and code are publicly available at https://github.com/JamesBrockUoB/ForestChat.
中文标题/摘要
标题:视觉-语言代理在交互式森林变化分析中的应用
现代森林监测工作流程越来越多地得益于高分辨率卫星图像的日益可用和深度学习的进步。在此背景下,准确的像素级变化检测和复杂森林动态的有意义语义变化描述是两个持续存在的挑战。虽然大型语言模型(LLMs)正在被适应用于交互式数据探索,但它们与视觉-语言模型(VLMs)在遥感图像变化解释(RSICI)中的集成仍然未被充分探索。为了解决这一差距,我们引入了一个由LLM驱动的集成森林变化分析代理,支持跨多个RSICI任务的自然语言查询。所提出的系统基于多级变化解释(MCI)视觉-语言骨干,并通过LLM进行编排。为了在森林环境中促进适应和评估,我们进一步引入了森林变化数据集,该数据集包含双时相卫星图像、像素级变化掩码和使用人类注释和基于规则的方法生成的多粒度语义变化描述。实验结果表明,所提出的系统在森林变化数据集上的mIoU和BLEU-4得分为67.10%和40.17%,在LEVIR-MCI-Trees上的得分为88.13%和34.41%,LEVIR-MCI基准数据集的一个以树木为重点的子集,用于联合变化检测和描述。这些结果突显了交互式、LLM驱动的RSICI系统在提高森林变化分析的可访问性、可解释性和效率方面的潜力。所有数据和代码均可在https://github.com/JamesBrockUoB/ForestChat/上公开获取。
Summary / 总结
The paper addresses the challenges of accurate pixel-level change detection and semantic captioning in forest monitoring using deep learning and large language models. It introduces a vision-language agent that leverages a multi-level change interpretation backbone and LLM-based orchestration for interactive forest change analysis. The system is evaluated on the Forest-Change dataset and achieves mIoU and BLEU-4 scores of 67.10% and 40.17%, respectively, demonstrating improved accessibility and interpretability in forest change analysis.
本文旨在通过深度学习和大型语言模型(LLMs)解决森林监测中的像素级变化检测和语义变化描述的挑战。它提出了一种将LLMs与多级变化解释(MCI)骨干网集成的视觉-语言代理,用于遥感图像变化解释(RSICI)。该系统支持自然语言查询,并在Forest-Change数据集上进行了评估,分别取得了67.10%的mIoU和40.17%的BLEU-4分数。结果表明,交互式、LLM驱动的RSICI系统能够提升森林变化分析的可访问性、可解释性和效率。所有数据和代码可在https://github.com/JamesBrockUoB/ForestChat获取。
From Dataset to Real-world: General 3D Object Detection via Generalized Cross-domain Few-shot Learning
Authors: Shuangzhi Li, Junlong Shen, Lei Ma, Xingyu Li
First: 2025-03-08T17:05:21+00:00 · Latest: 2026-01-08T01:19:36+00:00
Comments: The latest version refines the few-shot setting on common classes, enforcing a stricter object-level definition
Abstract
LiDAR-based 3D object detection models often struggle to generalize to real-world environments due to limited object diversity in existing datasets. To tackle it, we introduce the first generalized cross-domain few-shot (GCFS) task in 3D object detection, aiming to adapt a source-pretrained model to both common and novel classes in a new domain with only few-shot annotations. We propose a unified framework that learns stable target semantics under limited supervision by bridging 2D open-set semantics with 3D spatial reasoning. Specifically, an image-guided multi-modal fusion injects transferable 2D semantic cues into the 3D pipeline via vision-language models, while a physically-aware box search enhances 2D-to-3D alignment via LiDAR priors. To capture class-specific semantics from sparse data, we further introduce contrastive-enhanced prototype learning, which encodes few-shot instances into discriminative semantic anchors and stabilizes representation learning. Extensive experiments on GCFS benchmarks demonstrate the effectiveness and generality of our approach in realistic deployment settings.
中文标题/摘要
标题:从数据集到现实世界:通过通用跨域少样本学习进行三维物体检测
基于LiDAR的三维物体检测模型往往难以在现实环境中泛化,因为现有数据集中的物体多样性有限。为解决这一问题,我们引入了三维物体检测中的首个通用跨域少样本(GCFS)任务,旨在仅通过少量标注将源预训练模型适应到新域中的常见和新型类。我们提出了一种统一框架,通过将开放集2D语义与三维空间推理相结合,在有限监督下学习稳定的靶标语义。具体而言,图像引导的多模态融合通过视觉语言模型将可转移的2D语义线索注入三维管道,而物理感知的框搜索通过LiDAR先验增强2D到3D对齐。为了从稀疏数据中捕获类特定语义,我们进一步引入对比增强原型学习,将少样本实例编码为判别性语义锚点,并稳定表示学习。在GCFS基准上的大量实验表明,我们的方法在现实部署场景中具有有效性和普适性。
Summary / 总结
The research addresses the challenge of LiDAR-based 3D object detection models failing to generalize to real-world environments due to limited dataset diversity. It introduces a generalized cross-domain few-shot learning framework to adapt a source-pretrained model to both common and novel classes with minimal annotations. The approach uses image-guided multi-modal fusion and physically-aware box search to enhance 2D-to-3D alignment and introduces contrastive-enhanced prototype learning to stabilize representation learning. Experiments show the effectiveness and generality of the proposed method in realistic settings.
论文针对3D物体检测模型在现实环境中难以泛化的挑战,由于现有数据集的物体多样性有限。它引入了一种通用的跨域少量样本学习框架,以少量标注适应源预训练模型到常见和新型类。该方法利用图像引导的多模态融合和物理感知的框搜索增强2D到3D对齐,并引入对比增强的原型学习来从稀疏数据中稳定表示学习。实验表明,该方法在现实场景中具有有效性和普适性。
UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving
Authors: Zhexiao Xiong, Xin Ye, Burhan Yaman, Sheng Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, Liu Ren
First: 2026-01-07T23:49:52+00:00 · Latest: 2026-01-07T23:49:52+00:00
Comments: Project Page: https://unidrive-wm.github.io/UniDrive-WM
Abstract
World models have become central to autonomous driving, where accurate scene understanding and future prediction are crucial for safe control. Recent work has explored using vision-language models (VLMs) for planning, yet existing approaches typically treat perception, prediction, and planning as separate modules. We propose UniDrive-WM, a unified VLM-based world model that jointly performs driving-scene understanding, trajectory planning, and trajectory-conditioned future image generation within a single architecture. UniDrive-WM's trajectory planner predicts a future trajectory, which conditions a VLM-based image generator to produce plausible future frames. These predictions provide additional supervisory signals that enhance scene understanding and iteratively refine trajectory generation. We further compare discrete and continuous output representations for future image prediction, analyzing their influence on downstream driving performance. Experiments on the challenging Bench2Drive benchmark show that UniDrive-WM produces high-fidelity future images and improves planning performance by 5.9% in L2 trajectory error and 9.2% in collision rate over the previous best method. These results demonstrate the advantages of tightly integrating VLM-driven reasoning, planning, and generative world modeling for autonomous driving. The project page is available at https://unidrive-wm.github.io/UniDrive-WM .
中文标题/摘要
标题:UniDrive-WM:统一的理解、规划和生成世界模型用于自动驾驶
世界模型已成为自动驾驶的核心,准确的场景理解和未来预测对于安全控制至关重要。近期研究探索了使用视觉-语言模型(VLM)进行规划,但现有方法通常将感知、预测和规划视为独立模块。我们提出了UniDrive-WM,这是一种基于VLM的统一世界模型,能够在单一架构中联合执行驾驶场景理解、轨迹规划和基于轨迹的未来图像生成。UniDrive-WM的轨迹规划器预测未来轨迹,条件化VLM图像生成器以生成合理的未来帧。这些预测提供了额外的监督信号,增强场景理解并逐步细化轨迹生成。我们进一步比较了离散和连续输出表示对未来图像预测的影响,分析其对下游驾驶性能的影响。在具有挑战性的Bench2Drive基准测试中,UniDrive-WM生成了高保真度的未来图像,并在L2轨迹误差和碰撞率方面分别提高了5.9%和9.2%,超过了之前的最佳方法。这些结果表明,将VLM驱动的推理、规划和生成世界建模紧密集成对于自动驾驶的优势。项目页面可在https://unidrive-wm.github.io/UniDrive-WM 查看。
Summary / 总结
UniDrive-WM is a unified VLM-based world model that integrates driving-scene understanding, trajectory planning, and future image generation. It uses a trajectory planner to predict future trajectories, which conditions a VLM to generate plausible future frames. Experiments show that UniDrive-WM produces high-fidelity future images and improves planning performance by 5.9% in L2 trajectory error and 9.2% in collision rate compared to the previous best method.
UniDrive-WM 是一个统一的基于 VLM 的世界模型,将驾驶场景理解、轨迹规划和未来图像生成集成在一个架构中。它使用轨迹规划器预测未来轨迹,以条件化 VLM 生成可能的未来帧,增强场景理解和轨迹生成。在 Bench2Drive 基准上的实验表明,UniDrive-WM 生成了高保真度的未来图像,并将 L2 轨迹误差和碰撞率的规划性能分别提高了 5.9% 和 9.2%,优于之前的最佳方法。
Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization
Authors: Xingjian Diao, Zheyuan Liu, Chunhui Zhang, Weiyi Wu, Keyi Kong, Lin Shi, Kaize Ding, Soroush Vosoughi, Jiang Gui
First: 2026-01-07T23:05:17+00:00 · Latest: 2026-01-07T23:05:17+00:00
Abstract
Large Vision-Language Models (LVLMs) have exhibited strong reasoning capabilities through chain-of-thought mechanisms that generate step-by-step rationales. However, such slow-thinking approaches often lead to overthinking, where models produce excessively verbose responses even for simple queries, resulting in test-time inefficiency and even degraded accuracy. Prior work has attempted to mitigate this issue via adaptive reasoning strategies, but these methods largely overlook a fundamental bottleneck: visual perception failures. We argue that stable reasoning critically depends on low-level visual grounding, and that reasoning errors often originate from imperfect perception rather than insufficient deliberation. To address this limitation, we propose Gated Perception-Reasoning Optimization (GPRO), a meta-reasoning controller that dynamically routes computation among three decision paths at each generation step: a lightweight fast path, a slow perception path for re-examining visual inputs, and a slow reasoning path for internal self-reflection. To learn this distinction, we derive large-scale failure attribution supervision from approximately 790k samples, using teacher models to distinguish perceptual hallucinations from reasoning errors. We then train the controller with multi-objective reinforcement learning to optimize the trade-off between task accuracy and computational cost under uncertainty. Experiments on five benchmarks demonstrate that GPRO substantially improves both accuracy and efficiency, outperforming recent slow-thinking methods while generating significantly shorter responses.
中文标题/摘要
标题:通过门控感知推理优化解决大型视觉-语言模型中的过度思考问题
大型视觉-语言模型(LVLMs)通过生成逐步推理机制展示了强大的推理能力。然而,这种缓慢思考的方法往往会导致过度思考,模型会对简单查询产生冗长的响应,导致测试时效率低下,甚至降低准确性。先前的工作试图通过自适应推理策略来缓解这一问题,但这些方法大多忽视了一个基本瓶颈:视觉感知失败。我们认为,稳定的推理依赖于低级视觉定位,推理错误通常源自不完美的感知而非不足的思考。为了解决这一限制,我们提出了门控感知推理优化(GPRO),这是一种元推理控制器,在每次生成步骤中动态地在三条决策路径之间分配计算:一条轻量级的快速路径,一条缓慢的感知路径用于重新审视视觉输入,以及一条缓慢的推理路径用于内部自我反思。为了学习这种区分,我们从大约79万样本中推导出大规模的失败归因监督,使用教师模型区分感知幻觉和推理错误。然后,我们使用多目标强化学习训练控制器,在不确定性下优化任务准确性和计算成本之间的权衡。在五个基准上的实验表明,GPRO在准确性和效率上都有显著提升,优于最近的缓慢思考方法,同时生成的响应也显著更短。
Summary / 总结
The paper addresses the issue of overthinking in large vision-language models (LVLMs) by proposing Gated Perception-Reasoning Optimization (GPRO), which dynamically routes computation among a fast path, a slow perception path, and a slow reasoning path. The method uses large-scale failure attribution supervision to train a meta-reasoning controller and optimizes the trade-off between accuracy and computational cost. Experiments show that GPRO improves both accuracy and efficiency, outperforming recent slow-thinking methods and generating shorter responses.
论文提出了一种称为Gated Perception-Reasoning Optimization (GPRO)的方法,动态地在快速、感知和推理路径之间分配计算。该方法使用多目标强化学习来优化准确性和计算成本之间的权衡。实验表明,GPRO在提高准确性和效率方面优于最近的慢思考方法,同时生成的响应更短。
3D-Agent:Tri-Modal Multi-Agent Collaboration for Scalable 3D Object Annotation
Authors: Jusheng Zhang, Yijia Fan, Zimo Wen, Jian Wang, Keze Wang
Venue: NeurIPS 2025
First: 2026-01-07T21:23:05+00:00 · Latest: 2026-01-07T21:23:05+00:00
Comments: Accepted at NeurIPS 2025
Abstract
Driven by applications in autonomous driving robotics and augmented reality 3D object annotation presents challenges beyond 2D annotation including spatial complexity occlusion and viewpoint inconsistency Existing approaches based on single models often struggle to address these issues effectively We propose Tri MARF a novel framework that integrates tri modal inputs including 2D multi view images textual descriptions and 3D point clouds within a multi agent collaborative architecture to enhance large scale 3D annotation Tri MARF consists of three specialized agents a vision language model agent for generating multi view descriptions an information aggregation agent for selecting optimal descriptions and a gating agent that aligns textual semantics with 3D geometry for refined captioning Extensive experiments on Objaverse LVIS Objaverse XL and ABO demonstrate that Tri MARF substantially outperforms existing methods achieving a CLIPScore of 88 point 7 compared to prior state of the art methods retrieval accuracy of 45 point 2 and 43 point 8 on ViLT R at 5 and a throughput of up to 12000 objects per hour on a single NVIDIA A100 GPU
中文标题/摘要
标题:3D-Agent:三模态多智能体协作的可扩展3D物体标注
受自主驾驶机器人和增强现实应用的驱动,3D物体标注面临着超越2D标注的挑战,包括空间复杂性、遮挡和视角不一致等问题。现有基于单一模型的方法往往难以有效解决这些问题。我们提出了一种名为Tri MARF的新框架,该框架将2D多视角图像、文本描述和3D点云的三模态输入整合到多智能体协作架构中,以增强大规模3D标注。Tri MARF包括三个专门的智能体:一个视觉语言模型智能体用于生成多视角描述,一个信息聚合智能体用于选择最优描述,以及一个门控智能体,用于将文本语义与3D几何对齐以进行精细的标注。在Objaverse LVIS、Objaverse XL和ABO上的广泛实验表明,Tri MARF在CLIPScore、检索准确率和吞吐量方面显著优于现有方法,CLIPScore达到88.7,检索准确率分别为ViLT R@5的45.2和43.8,单块NVIDIA A100 GPU上的吞吐量可达每小时12000个物体。
Summary / 总结
The research aims to address the challenges of 3D object annotation in autonomous driving and augmented reality by proposing Tri MARF, a framework that integrates 2D multi-view images, textual descriptions, and 3D point clouds. Tri MARF uses three specialized agents to generate multi-view descriptions, select optimal descriptions, and align textual semantics with 3D geometry. Experiments show that Tri MARF outperforms existing methods with a CLIPScore of 88.7, retrieval accuracy of 45.2% and 43.8% on ViLT R at 5, and a throughput of up to 12,000 objects per hour on a single NVIDIA A100 GPU.
论文提出了一种名为Tri MARF的新框架,通过多智能体协作系统整合2D多视角图像、文本描述和3D点云,以解决自动驾驶和增强现实中的3D对象标注挑战。该框架包括视觉语言模型智能体、信息聚合智能体和对齐智能体,用于精细标注。实验结果显示,Tri MARF在CLIPScore上达到88.7,在ViLT R 5的检索准确率为45.2%和43.8%,并在单块NVIDIA A100 GPU上每小时处理多达12000个对象。
History
20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553