arXiv 论文速递

2026-01-13 03:34
Snapshot: 20260113_0334
Co-Training Vision Language Models for Remote Sensing Multi-task Learning
Authors: Qingyun Li, Shuran Ma, Junwei Luo, Yi Yu, Yue Zhou, Fengxiang Wang, Xudong Lu, Xiaoxing Wang, Xin He, Yushi Chen, Xue Yang
First: 2025-11-26T10:55:07+00:00 · Latest: 2026-01-09T18:43:00+00:00
Comments: 14 pages, 6 figures
Abstract
With Transformers achieving outstanding performance on individual remote sensing (RS) tasks, we are now approaching the realization of a unified model that excels across multiple tasks through multi-task learning (MTL). Compared to single-task approaches, MTL methods offer improved generalization, enhanced scalability, and greater practical applicability. Recently, vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning, respectively. Moreover, the unified text-based interface demonstrates significant potential for MTL. Hence, in this work, we present RSCoVLM, a simple yet flexible VLM baseline for RS MTL. Firstly, we create the data curation engine, including data acquisition, offline processing and integrating, as well as online loading and weighting. This data engine effectively addresses complex RS data enviroment and generates flexible vision-language conversations. Furthermore, we propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery. For UHR images, we introduce the Zoom-in Chain mechanism together with its corresponding dataset, LRS-VQA-Zoom. The strategies are flexible and effectively mitigate the computational burdens. Additionally, we significantly enhance the model's object detection capability and propose a novel evaluation protocol that ensures fair comparison between VLMs and conventional detection models. Extensive experiments demonstrate that RSCoVLM achieves state-of-the-art performance across diverse tasks, outperforming existing RS VLMs and even rivaling specialized expert models. All the training and evaluating tools, model weights, and datasets have been fully open-sourced to support reproducibility. We expect that this baseline will promote further progress toward general-purpose RS models.
中文标题/摘要
标题:遥感多任务学习中联合训练视觉语言模型
随着Transformer在遥感(RS)单一任务中取得卓越表现,我们正接近通过多任务学习(MTL)实现跨多个任务的统一模型。与单一任务方法相比,MTL方法提供了更好的泛化能力、更强的可扩展性和更高的实际应用价值。最近,视觉语言模型(VLMs)在RS图像理解、语义分割和超高清(UHR)图像推理方面取得了令人鼓舞的结果。此外,统一的文本界面展示了MTL的巨大潜力。因此,在这项工作中,我们提出了RSCoVLM,这是一种简单而灵活的VLM基线模型,用于RS MTL。首先,我们创建了数据编排引擎,包括数据获取、离线处理和集成,以及在线加载和加权。该数据引擎有效地解决了复杂RS数据环境问题,并生成了灵活的视觉-语言对话。此外,我们提出了一种统一的动态分辨率策略,以应对RS图像中固有的不同图像尺度。对于UHR图像,我们引入了缩放链机制及其相应的数据集LRS-VQA-Zoom。这些策略灵活且有效地减轻了计算负担。此外,我们显著增强了模型的物体检测能力,并提出了一种新的评估协议,以确保VLMs和传统检测模型之间的公平比较。广泛的实验表明,RSCoVLM在多种任务中均取得了最先进的性能,超越了现有的RS VLMs,甚至与专门的专家模型相媲美。所有训练和评估工具、模型权重和数据集均已完全开源,以支持可再现性。我们期望这一基线将促进通用RS模型的进一步发展。
Open-Vocabulary 3D Instruction Ambiguity Detection
Authors: Jiayu Ding, Haoran Tang, Ge Li
First: 2026-01-09T18:17:11+00:00 · Latest: 2026-01-09T18:17:11+00:00
Abstract
In safety-critical domains, linguistic ambiguity can have severe consequences; a vague command like "Pass me the vial" in a surgical setting could lead to catastrophic errors. Yet, most embodied AI research overlooks this, assuming instructions are clear and focusing on execution rather than confirmation. To address this critical safety gap, we are the first to define Open-Vocabulary 3D Instruction Ambiguity Detection, a fundamental new task where a model must determine if a command has a single, unambiguous meaning within a given 3D scene. To support this research, we build Ambi3D, the large-scale benchmark for this task, featuring over 700 diverse 3D scenes and around 22k instructions. Our analysis reveals a surprising limitation: state-of-the-art 3D Large Language Models (LLMs) struggle to reliably determine if an instruction is ambiguous. To address this challenge, we propose AmbiVer, a two-stage framework that collects explicit visual evidence from multiple views and uses it to guide an vision-language model (VLM) in judging instruction ambiguity. Extensive experiments demonstrate the challenge of our task and the effectiveness of AmbiVer, paving the way for safer and more trustworthy embodied AI. Code and dataset available at https://jiayuding031020.github.io/ambi3d/.
中文标题/摘要
标题:开放词汇3D指令歧义检测
在安全关键领域,语言歧义可能导致严重后果;手术环境中一个模糊的命令“递给我那个药瓶”可能会导致灾难性错误。然而,大多数具身AI研究忽略了这一点,假设指令是清晰的,而专注于执行而不是确认。为解决这一关键安全缺口,我们首次定义了开放词汇3D指令歧义检测这一基本新任务,其中模型必须确定命令在给定的3D场景中是否有单一且明确的意义。为了支持这一研究,我们构建了Ambi3D,这是该任务的大规模基准,包含超过700个多样化的3D场景和约22000条指令。我们的分析揭示了一个令人惊讶的局限性:最先进的3D大型语言模型(LLMs)难以可靠地判断指令是否具有歧义性。为应对这一挑战,我们提出了AmbiVer,这是一种两阶段框架,通过从多个视角收集明确的视觉证据,并利用这些证据指导视觉-语言模型(VLM)判断指令的歧义性。广泛的实验表明了我们任务的挑战性以及AmbiVer的有效性,为更安全和更可信赖的具身AI铺平了道路。代码和数据集可在https://jiayuding031020.github.io/ambi3d/获取。
Summary / 总结
The research addresses the critical issue of linguistic ambiguity in safety-critical domains, defining a new task called Open-Vocabulary 3D Instruction Ambiguity Detection. To support this, the authors created Ambi3D, a benchmark with over 700 3D scenes and 22,000 instructions. They found that state-of-the-art 3D LLMs often fail to accurately detect ambiguity. To improve this, they proposed AmbiVer, a two-stage framework that uses visual evidence to guide a vision-language model in determining instruction ambiguity, showing promising results in experiments.
研究旨在通过定义新的任务——开放词汇3D指令歧义检测,来解决手术等关键领域中语言歧义带来的安全风险。为此,作者构建了Ambi3D基准,包含700多个3D场景和22,000个指令。他们发现当前的3D大型语言模型难以检测歧义。为解决这一问题,他们提出了AmbiVer框架,该框架利用视觉证据帮助视觉-语言模型判断指令的歧义性,并在实验中展示了其有效性。
Context-Aware Decoding for Faithful Vision-Language Generation
Authors: Mehrdad Fazli, Bowen Wei, Ziwei Zhu
First: 2026-01-09T16:50:57+00:00 · Latest: 2026-01-09T16:50:57+00:00
Abstract
Hallucinations, generating responses inconsistent with the visual input, remain a critical limitation of large vision-language models (LVLMs), especially in open-ended tasks such as image captioning and visual reasoning. In this work, we probe the layer-wise generation dynamics that drive hallucinations and propose a training-free mitigation strategy. Employing the Logit Lens, we examine how LVLMs construct next-token distributions across decoder layers, uncovering a pronounced commitment-depth gap: truthful tokens accumulate probability mass on their final candidates earlier than hallucinatory ones. Drawing on this discovery, we introduce Context Embedding Injection (CEI), a lightweight method that harnesses the hidden state of the last input token-the context embedding-as a grounding signal to maintain visual fidelity throughout decoding and curb hallucinations. Evaluated on the CHAIR, AMBER, and MMHal-Bench benchmarks (with a maximum token length of 512), CEI outperforms state-of-the-art baselines across three LVLMs, with its dynamic variant yielding the lowest overall hallucination rates. By integrating novel mechanistic insights with a scalable intervention, this work advances the mitigation of hallucinations in LVLMs.
中文标题/摘要
标题:基于上下文的解码以实现忠实的跨模态语言生成
幻觉,即生成与视觉输入不一致的响应,仍然是大型跨模态语言模型(LVLM)的一个关键限制,尤其是在开放任务如图像字幕和视觉推理中。在本文中,我们探究了导致幻觉的逐层生成动态,并提出了一种无需训练的缓解策略。利用Logit Lens,我们检查了LVLMs在解码器各层如何构建下一个词的概率分布,发现了一种显著的可信度深度差距:真实的词更早地将概率质量集中在它们的最终候选词上,而幻觉词则不然。基于这一发现,我们引入了上下文嵌入注入(CEI),这是一种轻量级的方法,利用最后一个输入词的隐藏状态——上下文嵌入——作为接地信号,以在整个解码过程中保持视觉保真度并抑制幻觉。在CHAIR、AMBER和MMHal-Bench基准测试(最大词长512)上评估,CEI在三种LVLM中均优于最先进的基线,其动态变体的幻觉率最低。通过将新颖的机制见解与可扩展的干预措施相结合,本文推进了LVLM中幻觉的缓解。
Summary / 总结
This work addresses the issue of hallucinations in large vision-language models (LVLMs) by analyzing the layer-wise generation dynamics and proposing a training-free mitigation strategy called Context Embedding Injection (CEI). CEI uses the hidden state of the last input token as a grounding signal to maintain visual fidelity during decoding. The method outperforms state-of-the-art baselines on CHAIR, AMBER, and MMHal-Bench benchmarks, with the dynamic variant achieving the lowest hallucination rates across three LVLMs.
该研究通过分析导致不一致响应的层间生成动态来解决大型视觉语言模型(LVLMs)中的幻觉问题。它提出了一种名为Context Embedding Injection (CEI)的方法,该方法利用最后一个输入令牌的隐藏状态作为接地信号,以保持视觉保真度并减少幻觉。CEI在CHAIR、AMBER和MMHal-Bench基准上优于最先进的基线方法,其动态变体实现了最低的幻觉率。
Router-Suggest: Dynamic Routing for Multimodal Auto-Completion in Visually-Grounded Dialogs
Authors: Sandeep Mishra, Devichand Budagam, Anubhab Mandal, Bishal Santra, Pawan Goyal, Manish Gupta
First: 2026-01-09T15:29:50+00:00 · Latest: 2026-01-09T15:29:50+00:00
Comments: Accepted to EACL 2026 Industry Track, 12 pages, 6 figures
Abstract
Real-time multimodal auto-completion is essential for digital assistants, chatbots, design tools, and healthcare consultations, where user inputs rely on shared visual context. We introduce Multimodal Auto-Completion (MAC), a task that predicts upcoming characters in live chats using partially typed text and visual cues. Unlike traditional text-only auto-completion (TAC), MAC grounds predictions in multimodal context to better capture user intent. To enable this task, we adapt MMDialog and ImageChat to create benchmark datasets. We evaluate leading vision-language models (VLMs) against strong textual baselines, highlighting trade-offs in accuracy and efficiency. We present Router-Suggest, a router framework that dynamically selects between textual models and VLMs based on dialog context, along with a lightweight variant for resource-constrained environments. Router-Suggest achieves a 2.3x to 10x speedup over the best-performing VLM. A user study shows that VLMs significantly excel over textual models on user satisfaction, notably saving user typing effort and improving the quality of completions in multi-turn conversations. These findings underscore the need for multimodal context in auto-completions, leading to smarter, user-aware assistants.
中文标题/摘要
标题:Router-Suggest:基于视觉上下文的多模态自动补全动态路由
实时多模态自动补全对于数字助手、聊天机器人、设计工具和医疗咨询至关重要,其中用户输入依赖于共享的视觉上下文。我们引入了多模态自动补全(MAC)任务,该任务使用部分输入文本和视觉提示来预测实时聊天中的下一个字符。与传统的纯文本自动补全(TAC)不同,MAC将预测基于多模态上下文,以更好地捕捉用户意图。为了实现这一任务,我们调整了MMDialog和ImageChat以创建基准数据集。我们评估了领先的语言-视觉模型(VLMs)与强大的文本基线模型,突显了准确性和效率之间的权衡。我们提出了Router-Suggest,这是一种路由框架,根据对话上下文动态选择文本模型和VLMs,还提供了一种轻量级变体以适应资源受限的环境。Router-Suggest 在最佳性能的VLM上实现了2.3到10倍的速度提升。用户研究显示,VLMs在用户满意度方面显著优于文本模型,特别是在多轮对话中节省了用户的输入努力并提高了补全质量。这些发现强调了在自动补全中使用多模态上下文的必要性,从而实现更智能、更用户友好的助手。
Summary / 总结
The paper introduces Multimodal Auto-Completion (MAC), which predicts upcoming characters in live chats using both partially typed text and visual cues. It evaluates vision-language models (VLMs) against textual baselines and proposes Router-Suggest, a dynamic router framework that selects between textual models and VLMs based on dialog context, achieving significant speedups. User studies show that VLMs outperform textual models in terms of user satisfaction and completion quality in multi-turn conversations, emphasizing the importance of multimodal context in auto-completions.
论文提出了多模态自动完成(MAC),该方法利用部分输入文本和视觉线索来预测聊天内容,以更好地理解用户意图。它评估了VLMs和文本基线,并提出了Router-Suggest动态路由框架,根据对话上下文选择文本模型或VLMs,实现了显著的加速并提高了多轮对话中的用户满意度。
From Off-Policy to On-Policy: Enhancing GUI Agents via Bi-level Expert-to-Policy Assimilation
Authors: Zezhou Wang, Ziyun Zhang, Xiaoyi Zhang, Zhuzhong Qian, Yan Lu
First: 2026-01-09T13:26:38+00:00 · Latest: 2026-01-09T13:26:38+00:00
Comments: Work In Progress
Abstract
Vision-language models are increasingly deployed as computer-use agents (CUAs) that operate desktops and browsers. Top-performing CUAs are framework-based systems that decompose planning and execution, while end-to-end screenshot-to-action policies are easier to deploy but lag behind on benchmarks such as OSWorld-Verified. GUI datasets like OSWorld pose two bottlenecks: they expose only a few hundred interactive, verifiable tasks and environments, and expert trajectories must be gathered by interacting with these environments, making such data hard to scale. We therefore ask how reinforcement learning from verifiable rewards (RLVR) can best exploit a small pool of exist expert trajectories to train end-to-end policies. Naively mixing these off-policy traces into on-policy RLVR is brittle: even after format conversion, expert trajectories exhibit structural mismatch and distribution shift from the learner. We propose BEPA (Bi-Level Expert-to-Policy Assimilation), which turns static expert traces into policy-aligned guidance via self-rolled reachable trajectories under the base policy (LEVEL-1) and a per-task, dynamically updated cache used in RLVR (LEVEL-2). On OSWorld-Verified, BEPA improves UITARS1.5-7B success from 22.87% to 32.13% and raises a held-out split from 5.74% to 10.30%, with consistent gains on MMBench-GUI and Online-Mind2Web. Our code and data are available at: https://github.com/LEON-gittech/Verl_GUI.git
中文标题/摘要
标题:从离策学到近策学:通过双层专家到策略同化提升GUI代理
视觉语言模型越来越多地被部署为计算机使用代理(CUAs),用于操作桌面和浏览器。表现最佳的CUAs是基于框架的系统,将规划和执行分解开来,而端到端的截图到动作策略更容易部署,但在OSWorld-Verified等基准测试中却落后。GUI数据集如OSWorld存在两个瓶颈:它们仅暴露了几百个可交互的、可验证的任务和环境,并且专家轨迹必须通过与这些环境交互来收集,使得此类数据难以扩展。因此,我们探讨了如何利用可验证奖励的强化学习(RLVR)最好地利用少量现有的专家轨迹来训练端到端策略。简单地将这些离策轨迹混合到RLVR中是脆弱的:即使在格式转换后,专家轨迹也表现出结构不匹配和从学习者分布的偏移。我们提出了BEPA(双层专家到策略同化),通过在基础策略(LEVEL-1)下生成自定义可达轨迹和RLVR中使用的每任务动态更新缓存(LEVEL-2)将静态专家轨迹转化为与策略对齐的指导。在OSWorld-Verified上,BEPA将UITARS1.5-7B的成功率从22.87%提高到32.13%,并将保留分割从5.74%提高到10.30%,在MMBench-GUI和Online-Mind2Web上也取得了持续的改进。我们的代码和数据可在:https://github.com/LEON-gittech/Verl_GUI.git
Summary / 总结
The paper aims to enhance computer-use agents (CUAs) for desktop and browser operations by addressing the limitations of off-policy and end-to-end screenshot-to-action policies. It introduces BEPA (Bi-Level Expert-to-Policy Assimilation), a method that converts static expert trajectories into policy-aligned guidance through two levels: LEVEL-1 generates self-rolled reachable trajectories under the base policy, and LEVEL-2 uses a per-task, dynamically updated cache in RLVR. On OSWorld-Verified, BEPA significantly improves the success rate of UITARS1.5-7B from 22.87% to 32.13% and raises the held-out split from 5.74% to 10.30%, with consistent gains on MMBench-GUI and Online-Mind2Web.
论文针对在GUI数据集如OSWorld中使用可验证奖励训练端到端策略的挑战,这些数据集规模有限且需要专家轨迹。提出了一种名为BEPA(Bi-Level Expert-to-Policy Assimilation)的方法,通过两个层次将静态专家轨迹转化为策略对齐的指导:LEVEL-1生成基于基策略的自定义可达轨迹,LEVEL-2使用每任务动态更新的缓存进行RLVR。该方法在OSWorld-Verified上将UITARS1.5-7B的成功率从22.87%提高到32.13%,并将保留分割从5.74%提高到10.30%,在MMBench-GUI和Online-Mind2Web上也取得了持续的改进。
Multimodal Interpretation of Remote Sensing Images: Dynamic Resolution Input Strategy and Multi-scale Vision-Language Alignment Mechanism
Authors: Siyu Zhang, Lianlei Shan, Runhe Qiu
First: 2025-12-29T06:51:20+00:00 · Latest: 2026-01-09T12:31:40+00:00
Abstract
Multimodal fusion of remote sensing images serves as a core technology for overcoming the limitations of single-source data and improving the accuracy of surface information extraction, which exhibits significant application value in fields such as environmental monitoring and urban planning. To address the deficiencies of existing methods, including the failure of fixed resolutions to balance efficiency and detail, as well as the lack of semantic hierarchy in single-scale alignment, this study proposes a Vision-language Model (VLM) framework integrated with two key innovations: the Dynamic Resolution Input Strategy (DRIS) and the Multi-scale Vision-language Alignment Mechanism (MS-VLAM).Specifically, the DRIS adopts a coarse-to-fine approach to adaptively allocate computational resources according to the complexity of image content, thereby preserving key fine-grained features while reducing redundant computational overhead. The MS-VLAM constructs a three-tier alignment mechanism covering object, local-region and global levels, which systematically captures cross-modal semantic consistency and alleviates issues of semantic misalignment and granularity imbalance.Experimental results on the RS-GPT4V dataset demonstrate that the proposed framework significantly improves the accuracy of semantic understanding and computational efficiency in tasks including image captioning and cross-modal retrieval. Compared with conventional methods, it achieves superior performance in evaluation metrics such as BLEU-4 and CIDEr for image captioning, as well as R@10 for cross-modal retrieval. This technical framework provides a novel approach for constructing efficient and robust multimodal remote sensing systems, laying a theoretical foundation and offering technical guidance for the engineering application of intelligent remote sensing interpretation.
中文标题/摘要
标题:遥感图像多模态解释:动态分辨率输入策略和多尺度视觉-语言对齐机制
遥感图像的多模态融合是一种克服单一数据源限制、提高地表信息提取准确性的核心技术,在环境监测和城市规划等领域具有显著的应用价值。为解决现有方法存在的固定分辨率难以平衡效率与细节、单尺度对齐缺乏语义层次等问题,本研究提出了一种结合两种关键创新的视觉-语言模型(VLM)框架:动态分辨率输入策略(DRIS)和多尺度视觉-语言对齐机制(MS-VLAM)。具体而言,DRIS采用由粗到细的方法,根据图像内容的复杂性动态分配计算资源,从而保留关键的细粒度特征并减少冗余计算开销。MS-VLAM构建了涵盖对象、局部区域和全局三个层级的对齐机制,系统地捕捉跨模态语义一致性,缓解语义错位和粒度失衡问题。在RS-GPT4V数据集上的实验结果表明,所提出的框架在图像描述和跨模态检索等任务中显著提高了语义理解和计算效率。与传统方法相比,它在图像描述任务中的BLEU-4和CIDEr等评估指标以及跨模态检索任务中的R@10上均表现出更优的性能。该技术框架为构建高效稳健的多模态遥感系统提供了新的方法,为智能遥感解释的工程应用奠定了理论基础并提供了技术指导。
Summary / 总结
This study addresses the limitations of fixed resolutions in remote sensing image processing and the lack of semantic hierarchy in single-scale alignment by proposing a Vision-language Model (VLM) framework with Dynamic Resolution Input Strategy (DRIS) and Multi-scale Vision-language Alignment Mechanism (MS-VLAM). The DRIS adapts computational resources based on image complexity, preserving fine-grained features while reducing overhead. The MS-VLAM constructs a three-tier alignment mechanism to capture semantic consistency across object, local-region, and global levels. Experimental results show that the proposed framework improves semantic understanding and computational efficiency, outperforming conventional methods in tasks like image captioning and cross-modal retrieval.
本研究通过提出结合动态分辨率输入策略(DRIS)和多尺度视觉语言对齐机制(MS-VLAM)的视觉语言模型(VLM)框架,解决了固定分辨率在遥感图像处理中的局限性和单尺度对齐中缺乏语义层次的问题。DRIS根据图像复杂性动态分配计算资源,保留细粒度特征并减少冗余。MS-VLAM构建了涵盖对象、局部区域和全局三个层级的对齐机制,以捕捉跨模态语义一致性。实验结果表明,该框架在图像描述和跨模态检索等任务中提高了语义理解和计算效率,并在BLEU-4、CIDEr等指标上优于传统方法。
ViTNT-FIQA: Training-Free Face Image Quality Assessment with Vision Transformers
Authors: Guray Ozgur, Eduarda Caldeira, Tahar Chettaoui, Jan Niklas Kolf, Marco Huber, Naser Damer, Fadi Boutros
Venue: WACV
First: 2026-01-09T11:46:25+00:00 · Latest: 2026-01-09T11:46:25+00:00
Comments: Accepted at WACV Workshops
Abstract
Face Image Quality Assessment (FIQA) is essential for reliable face recognition systems. Current approaches primarily exploit only final-layer representations, while training-free methods require multiple forward passes or backpropagation. We propose ViTNT-FIQA, a training-free approach that measures the stability of patch embedding evolution across intermediate Vision Transformer (ViT) blocks. We demonstrate that high-quality face images exhibit stable feature refinement trajectories across blocks, while degraded images show erratic transformations. Our method computes Euclidean distances between L2-normalized patch embeddings from consecutive transformer blocks and aggregates them into image-level quality scores. We empirically validate this correlation on a quality-labeled synthetic dataset with controlled degradation levels. Unlike existing training-free approaches, ViTNT-FIQA requires only a single forward pass without backpropagation or architectural modifications. Through extensive evaluation on eight benchmarks (LFW, AgeDB-30, CFP-FP, CALFW, Adience, CPLFW, XQLFW, IJB-C), we show that ViTNT-FIQA achieves competitive performance with state-of-the-art methods while maintaining computational efficiency and immediate applicability to any pre-trained ViT-based face recognition model.
中文标题/摘要
标题:ViTNT-FIQA:无需训练的面部图像质量评估
面部图像质量评估(FIQA)对于可靠的面部识别系统至关重要。当前的方法主要利用最终层的表示,而无需训练的方法需要多次前向传递或反向传播。我们提出了ViTNT-FIQA,这是一种无需训练的方法,它通过测量中间视觉变换器(ViT)块中补丁嵌入演变的稳定性来进行评估。我们证明高质量的面部图像在各个块中表现出稳定的特征细化轨迹,而退化的图像则表现出不规则的变换。我们的方法计算连续变换器块中L2归一化补丁嵌入之间的欧几里得距离,并将它们聚合为图像级别的质量评分。我们通过在具有受控退化级别的质量标注合成数据集上进行实证验证了这种相关性。与现有的无需训练的方法不同,ViTNT-FIQA 只需要一次前向传递,无需反向传播或架构修改。通过在八个基准数据集(LFW,AgeDB-30,CFP-FP,CALFW,Adience,CPLFW,XQLFW,IJB-C)上进行广泛的评估,我们展示了ViTNT-FIQA 达到了与最先进的方法相当的性能,同时保持了计算效率和立即应用于任何预训练的ViT基面部识别模型的适用性。
Summary / 总结
The research aims to develop a training-free method for Face Image Quality Assessment (FIQA) to enhance the reliability of face recognition systems. ViTNT-FIQA measures the stability of patch embedding evolution across intermediate Vision Transformer blocks, computing Euclidean distances between L2-normalized patch embeddings. The method shows that high-quality images have stable feature refinement trajectories, while degraded images exhibit erratic transformations. ViTNT-FIQA requires only a single forward pass and demonstrates competitive performance on eight benchmarks while maintaining computational efficiency and immediate applicability to any pre-trained ViT-based face recognition model.
ViTNT-FIQA 是一种无需训练的方法,用于评估人脸图像质量(FIQA),它通过评估中间 Vision Transformer (ViT) 块中 patch 嵌入演化的稳定性来工作。该方法计算连续 transformer 块之间 L2 归一化 patch 嵌入的欧几里得距离,并将它们聚合为图像级别的质量评分。该方法在八个基准上展示了与最先进的方法相当的性能,同时只需要一次前向传播,保持了计算效率和对任何预训练的 ViT 基本人脸识别模型的即时适用性。
PII-VisBench: Evaluating Personally Identifiable Information Safety in Vision Language Models Along a Continuum of Visibility
Authors: G M Shahariar, Zabir Al Nazi, Md Olid Hasan Bhuiyan, Zhouxing Shi
First: 2026-01-09T11:40:56+00:00 · Latest: 2026-01-09T11:40:56+00:00
Abstract
Vision Language Models (VLMs) are increasingly integrated into privacy-critical domains, yet existing evaluations of personally identifiable information (PII) leakage largely treat privacy as a static extraction task and ignore how a subject's online presence--the volume of their data available online--influences privacy alignment. We introduce PII-VisBench, a novel benchmark containing 4000 unique probes designed to evaluate VLM safety through the continuum of online presence. The benchmark stratifies 200 subjects into four visibility categories: high, medium, low, and zero--based on the extent and nature of their information available online. We evaluate 18 open-source VLMs (0.3B-32B) based on two key metrics: percentage of PII probing queries refused (Refusal Rate) and the fraction of non-refusal responses flagged for containing PII (Conditional PII Disclosure Rate). Across models, we observe a consistent pattern: refusals increase and PII disclosures decrease (9.10% high to 5.34% low) as subject visibility drops. We identify that models are more likely to disclose PII for high-visibility subjects, alongside substantial model-family heterogeneity and PII-type disparities. Finally, paraphrasing and jailbreak-style prompts expose attack and model-dependent failures, motivating visibility-aware safety evaluation and training interventions.
中文标题/摘要
标题:PII-VisBench:评估视觉语言模型中个人可识别信息安全性沿可见度连续谱的变化
视觉语言模型(VLMs)越来越多地被集成到隐私关键领域,但现有的个人可识别信息(PII)泄露评估大多将隐私视为静态提取任务,并忽略了一个人的在线存在——其在线数据量——如何影响隐私对齐。我们引入了PII-VisBench,这是一个包含4000个独特探针的新基准,旨在通过在线存在连续谱评估VLM的安全性。基准将200个受试者按其在线信息的范围和性质分为四个可见度类别:高、中、低和零。我们基于两个关键指标评估了18个开源VLM(0.3B-32B):拒绝PII探针查询的比例(拒绝率)和非拒绝响应中包含PII的比例(条件PII披露率)。在所有模型中,我们观察到一个一致的趋势:随着受试者可见度的降低,拒绝率增加,PII披露率降低(从高可见度的9.10%降至低可见度的5.34%)。我们发现,模型更可能披露高可见度受试者的PII,同时存在显著的模型家族异质性和PII类型差异。最后,改写和jailbreak风格的提示揭示了攻击和模型依赖的失败,促使了可见度意识的安全评估和训练干预。
Summary / 总结
PII-VisBench evaluates the safety of personally identifiable information (PII) in Vision Language Models (VLMs) across different levels of online visibility. The benchmark includes 4000 unique probes and stratifies 200 subjects into four visibility categories. The study finds that refusals of PII probing queries increase and PII disclosures decrease as subject visibility drops, with models more likely to disclose PII for high-visibility subjects. There is also significant heterogeneity among model families and disparities in PII types disclosed. The research highlights the need for visibility-aware safety evaluations and training interventions.
PII-VisBench 评估 Vision Language Models (VLMs) 在不同在线可见度水平下的个人可识别信息 (PII) 安全性。基准包括 4000 个探针并将 200 个主体分为四个可见度类别。研究发现,随着可见度的降低,PII 探针查询的拒绝率增加,有条件地披露 PII 的比例降低。模型倾向于为高可见度的主体披露更多 PII,并且在模型家族和 PII 类型之间存在显著差异。研究强调需要进行可见度意识的安全评估和训练干预。
SOVABench: A Vehicle Surveillance Action Retrieval Benchmark for Multimodal Large Language Models
Authors: Oriol Rabasseda, Zenjie Li, Kamal Nasrollahi, Sergio Escalera
Venue: WACV
First: 2026-01-08T10:58:59+00:00 · Latest: 2026-01-09T10:27:37+00:00
Comments: This work has been accepted at Real World Surveillance: Applications and Challenges, 6th (in WACV Workshops)
Abstract
Automatic identification of events and recurrent behavior analysis are critical for video surveillance. However, most existing content-based video retrieval benchmarks focus on scene-level similarity and do not evaluate the action discrimination required in surveillance. To address this gap, we introduce SOVABench (Surveillance Opposite Vehicle Actions Benchmark), a real-world retrieval benchmark built from surveillance footage and centered on vehicle-related actions. SOVABench defines two evaluation protocols (inter-pair and intra-pair) to assess cross-action discrimination and temporal direction understanding. Although action distinctions are generally intuitive for human observers, our experiments show that they remain challenging for state-of-the-art vision and multimodal models. Leveraging the visual reasoning and instruction-following capabilities of Multimodal Large Language Models (MLLMs), we present a training-free framework for producing interpretable embeddings from MLLM-generated descriptions for both images and videos. The framework achieves strong performance on SOVABench as well as on several spatial and counting benchmarks where contrastive Vision-Language Models often fail. The code, annotations, and instructions to construct the benchmark are publicly available.
中文标题/摘要
标题:SOVABench:面向多模态大语言模型的车辆监控动作检索基准
自动识别事件和重复行为分析是视频监控的关键。然而,大多数现有的基于内容的视频检索基准主要关注场景相似性,而不评估监控所需的动作区分。为了解决这一差距,我们引入了SOVABench(Surveillance Opposite Vehicle Actions Benchmark),这是一个基于监控视频构建的现实世界检索基准,专注于车辆相关动作。SOVABench 定义了两种评估协议(跨对和内对),以评估跨动作区分和时间方向理解。尽管动作区分对人类观察者来说通常很直观,但我们的实验表明,即使是最先进的视觉和多模态模型也难以完成这些任务。 利用多模态大语言模型(MLLMs)的视觉推理和指令跟随能力,我们提出了一种无需训练的框架,用于从MLLM生成的图像和视频描述中生成可解释的嵌入。该框架在SOVABench以及几个对比视觉-语言模型经常失败的空间和计数基准上都取得了良好的性能。基准的代码、注释和构建说明已公开。
Summary / 总结
SOVABench is a new benchmark for vehicle surveillance action retrieval, addressing the lack of action discrimination in existing video retrieval benchmarks. It evaluates models on cross-action discrimination and temporal direction understanding. The benchmark uses surveillance footage and defines two protocols. Despite intuitive for humans, state-of-the-art models struggle with these tasks. A training-free framework using MLLMs generates interpretable embeddings, achieving strong performance on SOVABench and other benchmarks where contrastive VLMs often fail.
SOVABench 是一个新的车辆 surveillance 行动检索基准,解决了现有视频检索基准中缺乏动作区分的问题。它评估模型在跨动作区分和时间方向理解上的表现。基准使用监控录像并定义了两个协议。尽管对人类来说这些任务是直观的,但最先进的模型仍然难以完成这些任务。一个无需训练的框架使用 MLLM 生成可解释的嵌入,这些嵌入在 SOVABench 和其他对比 VLM 常常失败的基准上表现出色。
SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving
Authors: Jingyu Li, Junjie Wu, Dongnan Hu, Xiangkai Huang, Bin Sun, Zhihui Hao, Xianpeng Lang, Xiatian Zhu, Li Zhang
First: 2026-01-09T08:55:42+00:00 · Latest: 2026-01-09T08:55:42+00:00
Abstract
Recent end-to-end autonomous driving approaches have leveraged Vision-Language Models (VLMs) to enhance planning capabilities in complex driving scenarios. However, VLMs are inherently trained as generalist models, lacking specialized understanding of driving-specific reasoning in 3D space and time. When applied to autonomous driving, these models struggle to establish structured spatial-temporal representations that capture geometric relationships, scene context, and motion patterns critical for safe trajectory planning. To address these limitations, we propose SGDrive, a novel framework that explicitly structures the VLM's representation learning around driving-specific knowledge hierarchies. Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition: drivers first perceive the overall environment (scene context), then attend to safety-critical agents and their behaviors, and finally formulate short-term goals before executing actions. This hierarchical decomposition provides the structured spatial-temporal representation that generalist VLMs lack, integrating multi-level information into a compact yet comprehensive format for trajectory planning. Extensive experiments on the NAVSIM benchmark demonstrate that SGDrive achieves state-of-the-art performance among camera-only methods on both PDMS and EPDMS, validating the effectiveness of hierarchical knowledge structuring for adapting generalist VLMs to autonomous driving.
中文标题/摘要
标题:SGDrive:场景到目标的层次世界认知在自动驾驶中的应用
近期的端到端自动驾驶方法利用视觉-语言模型(VLMs)增强了在复杂驾驶场景中的规划能力。然而,VLMs本质上是作为通用模型训练的,缺乏对驾驶特定推理在三维空间和时间中的专门理解。当应用于自动驾驶时,这些模型难以建立能够捕捉几何关系、场景上下文和对安全轨迹规划至关重要的运动模式的结构化时空表示。为了解决这些限制,我们提出了SGDrive,这是一种新颖的框架,明确地将VLM的表示学习结构化为驾驶特定知识的层次结构。基于预训练的VLM主干,SGDrive将驾驶理解分解为场景-代理-目标层次结构,这与人类的驾驶认知相呼应:驾驶员首先感知整体环境(场景上下文),然后关注安全关键的代理及其行为,最后制定短期目标再执行动作。这种层次分解提供了通用VLM缺乏的结构化时空表示,将多级信息整合为一种紧凑而全面的格式,用于轨迹规划。在NAVSIM基准上的广泛实验表明,SGDrive在PDMS和EPDMS上的表现优于仅使用摄像头的方法,验证了层次知识结构化对适应通用VLMs到自动驾驶的有效性。
Summary / 总结
The paper addresses the limitations of Vision-Language Models (VLMs) in autonomous driving by proposing SGDrive, a framework that structures VLMs around driving-specific knowledge hierarchies. SGDrive decomposes driving understanding into a scene-agent-goal hierarchy, enhancing the model's ability to capture geometric relationships, scene context, and motion patterns. Experiments on the NAVSIM benchmark show that SGDrive outperforms existing camera-only methods, validating the effectiveness of hierarchical knowledge structuring for VLMs in autonomous driving.
SGDrive 是一个框架,通过将 Vision-Language 模型 (VLM) 的表示学习结构化为驾驶特定的知识层次结构来提升自动驾驶能力。它将驾驶理解分解为场景-代理-目标层次结构,使模型能够捕捉几何关系、场景上下文和运动模式。在 NAVSIM 基准上的实验结果表明,SGDrive 在 PDMS 和 EPDMS 上均优于现有基于摄像头的方法,验证了层次知识结构化对适应通用 VLM 的自动驾驶的有效性。
Video Generation Models Are Good Latent Reward Models
Authors: Xiaoyue Mi, Wenqing Yu, Jiesong Lian, Shibo Jie, Ruizhe Zhong, Zijun Liu, Guozhen Zhang, Zixiang Zhou, Zhiyong Xu, Yuan Zhou, Qinglin Lu, Fan Tang
First: 2025-11-26T16:14:18+00:00 · Latest: 2026-01-09T08:31:36+00:00
Abstract
Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces significant challenges. Existing video reward models rely on vision-language models designed for pixel-space inputs, confining ReFL optimization to near-complete denoising steps after computationally expensive VAE decoding. This pixel-space approach incurs substantial memory overhead and increased training time, and its late-stage optimization lacks early-stage supervision, refining only visual quality rather than fundamental motion dynamics and structural coherence. In this work, we show that pre-trained video generation models are naturally suited for reward modeling in the noisy latent space, as they are explicitly designed to process noisy latent representations at arbitrary timesteps and inherently preserve temporal information through their sequential modeling capabilities. Accordingly, we propose Process Reward Feedback Learning~(PRFL), a framework that conducts preference optimization entirely in latent space, enabling efficient gradient backpropagation throughout the full denoising chain without VAE decoding. Extensive experiments demonstrate that PRFL significantly improves alignment with human preferences, while achieving substantial reductions in memory consumption and training time compared to RGB ReFL.
中文标题/摘要
标题:视频生成模型是良好的潜在空间奖励模型
奖励反馈学习(ReFL)已被证明对于使图像生成与人类偏好对齐非常有效。然而,将其扩展到视频生成面临着重大挑战。现有的视频奖励模型依赖于为像素空间输入设计的视觉语言模型,这将ReFL优化限制在昂贵的VAE解码之后的近完全去噪步骤中。这种像素空间的方法会产生大量的内存开销并增加训练时间,而且其后期优化缺乏早期监督,只能优化视觉质量而不是基本的运动动态和结构一致性。在本文中,我们展示了预训练的视频生成模型天然适合在噪声潜在空间中进行奖励建模,因为它们明确设计为可以处理任意时间步的噪声潜在表示,并通过其序列建模能力内在地保留时间信息。因此,我们提出了过程奖励反馈学习(PRFL)框架,该框架在潜在空间中完全进行偏好优化,从而在整个去噪链中实现高效的梯度反向传播,而无需VAE解码。广泛的实验表明,PRFL在与人类偏好对齐方面显著提高,同时与RGB ReFL相比在内存消耗和训练时间上实现了显著减少。
Summary / 总结
This work addresses the challenge of applying reward feedback learning (ReFL) to video generation by proposing Process Reward Feedback Learning (PRFL). PRFL leverages pre-trained video generation models to optimize preferences entirely in the noisy latent space, avoiding the need for computationally expensive VAE decoding. This approach reduces memory consumption and training time while improving alignment with human preferences compared to traditional RGB ReFL methods.
本文提出了Process Reward Feedback Learning (PRFL) 方法来解决将奖励反馈学习 (ReFL) 应用于视频生成的问题。PRFL 利用预训练的视频生成模型在噪声的潜在空间中优化偏好,避免了昂贵的 VAE 解码步骤。这种方法在与人类偏好对齐方面表现更好,并且相比于传统的像素空间 ReFL 方法,减少了内存消耗和训练时间。
See or Say Graphs: Agent-Driven Scalable Graph Structure Understanding with Vision-Language Models
Authors: Shuo Han, Yukun Cao, Zezhong Ding, Zengyi Gao, S Kevin Zhou, Xike Xie
First: 2025-10-19T09:20:44+00:00 · Latest: 2026-01-09T08:10:28+00:00
Abstract
Vision-language models (VLMs) have shown promise in graph structure understanding, but remain limited by input-token constraints, facing scalability bottlenecks and lacking effective mechanisms to coordinate textual and visual modalities. To address these challenges, we propose GraphVista, a unified framework that enhances both scalability and modality coordination in graph structure understanding. For scalability, GraphVista organizes graph information hierarchically into a lightweight GraphRAG base, which retrieves only task-relevant textual descriptions and high-resolution visual subgraphs, compressing redundant context while preserving key reasoning elements. For modality coordination, GraphVista introduces a planning agent that decomposes and routes tasks to the most suitable modality-using the text modality for direct access to explicit graph properties and the visual modality for local graph structure reasoning grounded in explicit topology. Extensive experiments demonstrate that GraphVista scales to large graphs, up to 200$\times$ larger than those used in existing benchmarks, and consistently outperforms existing textual, visual, and fusion-based methods, achieving up to 4.4$\times$ quality improvement over the state-of-the-art baselines by fully exploiting the complementary strengths of both modalities.
中文标题/摘要
标题:见或说图:基于代理驱动的可扩展图结构理解
视觉语言模型(VLMs)在图结构理解方面显示出潜力,但仍然受到输入标记限制的限制,面临可扩展性瓶颈,并缺乏有效机制来协调文本和视觉模态。为了解决这些挑战,我们提出了GraphVista,这是一种统一框架,增强了图结构理解中的可扩展性和模态协调。为了解扩展性,GraphVista 将图信息分层组织为一个轻量级的GraphRAG基底,仅检索与任务相关的文本描述和高分辨率的视觉子图,压缩冗余上下文同时保留关键推理元素。为了解决模态协调问题,GraphVista 引入了一个规划代理,将任务分解并路由到最合适的模态——使用文本模态直接访问显式的图属性,使用视觉模态进行基于显式拓扑的局部图结构推理。广泛的实验表明,GraphVista 可以扩展到大型图,比现有基准中的图大200倍,且始终优于现有的文本、视觉和融合方法,通过充分利用两种模态的互补优势,相对于最先进的基线方法,质量提高了4.4倍。
Summary / 总结
The paper addresses the scalability and modality coordination challenges in graph structure understanding using vision-language models. It introduces GraphVista, a unified framework that organizes graph information hierarchically and uses a planning agent to coordinate text and visual modalities. Experiments show that GraphVista can handle much larger graphs and outperforms existing methods by up to 4.4 times in quality improvement.
论文旨在解决使用视觉-语言模型进行图结构理解时的可扩展性和模态协调问题。提出了一种名为GraphVista的统一框架,该框架通过分层组织图信息,并使用规划代理协调文本和视觉模态。实验表明,GraphVista可以处理远大于现有基准的大型图,并且在质量改进方面比现有方法高出4.4倍。
LatentVLA: Efficient Vision-Language Models for Autonomous Driving via Latent Action Prediction
Authors: Chengen Xie, Bin Sun, Tianyu Li, Junjie Wu, Zhihui Hao, XianPeng Lang, Hongyang Li
First: 2026-01-09T08:06:44+00:00 · Latest: 2026-01-09T08:06:44+00:00
Abstract
End-to-end autonomous driving models trained on largescale datasets perform well in common scenarios but struggle with rare, long-tail situations due to limited scenario diversity. Recent Vision-Language-Action (VLA) models leverage broad knowledge from pre-trained visionlanguage models to address this limitation, yet face critical challenges: (1) numerical imprecision in trajectory prediction due to discrete tokenization, (2) heavy reliance on language annotations that introduce linguistic bias and annotation burden, and (3) computational inefficiency from multi-step chain-of-thought reasoning hinders real-time deployment. We propose LatentVLA, a novel framework that employs self-supervised latent action prediction to train VLA models without language annotations, eliminating linguistic bias while learning rich driving representations from unlabeled trajectory data. Through knowledge distillation, LatentVLA transfers the generalization capabilities of VLA models to efficient vision-based networks, achieving both robust performance and real-time efficiency. LatentVLA establishes a new state-of-the-art on the NAVSIM benchmark with a PDMS score of 92.4 and demonstrates strong zeroshot generalization on the nuScenes benchmark.
中文标题/摘要
标题:LatentVLA:通过潜在动作预测实现自主驾驶的高效视觉-语言模型
端到端的自主驾驶模型在大规模数据集上训练,在常见场景中表现良好,但在罕见的长尾情况下表现不佳,原因在于场景多样性有限。最近的视觉-语言-动作(VLA)模型利用预训练的视觉-语言模型的广泛知识来解决这一限制,但面临关键挑战:(1)由于离散标记化导致轨迹预测中的数值不精确,(2)对语言注解的重度依赖引入了语言偏见和注解负担,(3)多步推理计算效率低下阻碍了实时部署。我们提出了一种名为LatentVLA的新框架,该框架采用自监督的潜在动作预测来训练VLA模型,无需语言注解,从而消除语言偏见并从未标记的轨迹数据中学习丰富的驾驶表示。通过知识蒸馏,LatentVLA将VLA模型的泛化能力转移到高效的基于视觉的网络中,实现了稳健的性能和实时效率。LatentVLA在NAVSIM基准测试中建立了新的最佳水平,得分为92.4,并在nuScenes基准测试中展示了强大的零样本泛化能力。
Summary / 总结
LatentVLA is a novel framework that uses self-supervised latent action prediction to train Vision-Language-Action models without language annotations, addressing issues of linguistic bias and computational inefficiency. It achieves robust performance and real-time efficiency, setting a new state-of-the-art on the NAVSIM benchmark with a PDMS score of 92.4 and showing strong zeroshot generalization on the nuScenes benchmark.
LatentVLA 是一种使用自监督的潜在动作预测来训练无需语言注释的视觉-语言-动作模型的新框架,解决了数值精度和计算效率的问题。它实现了稳健的性能和实时效率,并在 NAVSIM 基准上达到了 92.4 的 PDMS 分数,同时在 nuScenes 基准上展示了强大的零样本泛化能力。
CoV: Chain-of-View Prompting for Spatial Reasoning
Authors: Haoyu Zhao, Akide Liu, Zeyu Zhang, Weijie Wang, Feng Chen, Ruihan Zhu, Gholamreza Haffari, Bohan Zhuang
First: 2026-01-08T17:59:42+00:00 · Latest: 2026-01-09T07:20:05+00:00
Comments: Code link https://github.com/ziplab/CoV
Abstract
Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision--language models (VLMs) are constrained to a fixed and finite set of input views, which limits their ability to acquire question-relevant context at inference time and hinders complex spatial reasoning. We propose Chain-of-View (CoV) prompting, a training-free, test-time reasoning framework that transforms a VLM into an active viewpoint reasoner through a coarse-to-fine exploration process. CoV first employs a View Selection agent to filter redundant frames and identify question-aligned anchor views. It then performs fine-grained view adjustment by interleaving iterative reasoning with discrete camera actions, obtaining new observations from the underlying 3D scene representation until sufficient context is gathered or a step budget is reached. We evaluate CoV on OpenEQA across four mainstream VLMs and obtain an average +11.56% improvement in LLM-Match, with a maximum gain of +13.62% on Qwen3-VL-Flash. CoV further exhibits test-time scaling: increasing the minimum action budget yields an additional +2.51% average improvement, peaking at +3.73% on Gemini-2.5-Flash. On ScanQA and SQA3D, CoV delivers strong performance (e.g., 116 CIDEr / 31.9 EM@1 on ScanQA and 51.1 EM@1 on SQA3D). Overall, these results suggest that question-aligned view selection coupled with open-view search is an effective, model-agnostic strategy for improving spatial reasoning in 3D EQA without additional training. Code is available on https://github.com/ziplab/CoV .
中文标题/摘要
标题:CoV:空间推理的链式视角提示
在3D环境中的嵌入式问题回答(EQA)通常需要收集分布在多个视角且部分被遮挡的上下文。然而,大多数最近的视觉-语言模型(VLMs)仅限于固定且有限的输入视图集,这限制了它们在推理时获取与问题相关上下文的能力,并阻碍了复杂的空间推理。我们提出了链式视角(CoV)提示,这是一种无需训练、在测试时进行推理的框架,通过从粗到细的探索过程将VLM转换为积极的视角推理者。CoV首先使用视图选择代理筛选冗余帧并识别与问题对齐的锚视图,然后通过交替进行迭代推理和离散相机动作进行细粒度视图调整,从底层3D场景表示中获取新观察,直到收集到足够上下文或达到步骤预算。 我们在OpenEQA上对CoV进行了评估,跨四个主流VLMs获得了平均+11.56%的LLM-Match改进,最大增益为Qwen3-VL-Flash上的+13.62%。CoV还表现出测试时的扩展性:增加最小动作预算可额外获得平均+2.51%的改进,峰值为Gemini-2.5-Flash上的+3.73%。在ScanQA和SQA3D上,CoV表现出强大的性能(例如,ScanQA上的116 CIDEr / 31.9 EM@1和SQA3D上的51.1 EM@1)。总体而言,这些结果表明,与问题对齐的视图选择结合开放视图搜索是提高3D EQA中空间推理的有效、模型无关的策略,无需额外训练。代码可在https://github.com/ziplab/CoV 获取。
Summary / 总结
The research aims to enhance embodied question answering (EQA) in 3D environments by addressing the limitations of existing vision-language models (VLMs) that are constrained to a fixed set of input views. The proposed Chain-of-View (CoV) prompting method involves a coarse-to-fine exploration process, using a View Selection agent to filter redundant frames and identify relevant anchor views, followed by fine-grained view adjustment through iterative reasoning and discrete camera actions. CoV shows significant improvements across various VLMs, with an average increase of 11.56% in LLM-Match, and further scaling benefits with increased action budgets. CoV performs well on ScanQA and SQA3D, demonstrating its effectiveness in improving spatial reasoning in 3D EQA without additional training.
研究旨在通过解决固定输入视图限制,提升3D环境中的体感问答能力。提出的Chain-of-View (CoV) 提示方法包括粗到细的探索过程,包括视图选择和精细视图调整。在OpenEQA上的实验显示,平均改进了+11.56%的LLM-Match,最高增益为+13.62%在Qwen3-VL-Flash上。CoV还展示了测试时的可扩展性,最高增益达到+3.73%在Gemini-2.5-Flash上。该方法在ScanQA和SQA3D上表现出色,表明其在不额外训练的情况下有效提升空间推理能力。
ReasonAny: Incorporating Reasoning Capability to Any Model via Simple and Effective Model Merging
Authors: Junyao Yang, Chen Qian, Dongrui Liu, Wen Shen, Yong Liu, Jing Shao
First: 2026-01-09T06:19:00+00:00 · Latest: 2026-01-09T06:19:00+00:00
Comments: 22 pages, 6 figures, 14 tables
Abstract
Large Reasoning Models (LRMs) with long chain-of-thought reasoning have recently achieved remarkable success. Yet, equipping domain-specialized models with such reasoning capabilities, referred to as "Reasoning + X", remains a significant challenge. While model merging offers a promising training-free solution, existing methods often suffer from a destructive performance collapse: existing methods tend to both weaken reasoning depth and compromise domain-specific utility. Interestingly, we identify a counter-intuitive phenomenon underlying this failure: reasoning ability predominantly resides in parameter regions with low gradient sensitivity, contrary to the common assumption that domain capabilities correspond to high-magnitude parameters. Motivated by this insight, we propose ReasonAny, a novel merging framework that resolves the reasoning-domain performance collapse through Contrastive Gradient Identification. Experiments across safety, biomedicine, and finance domains show that ReasonAny effectively synthesizes "Reasoning + X" capabilities, significantly outperforming state-of-the-art baselines while retaining robust reasoning performance.
中文标题/摘要
标题:ReasonAny:通过简单的有效模型合并将推理能力赋予任何模型
大型推理模型(LRMs)具有长链推理能力,最近取得了显著的成功。然而,为领域特定模型配备此类推理能力,即所谓的“推理+X”,仍然是一个重大挑战。虽然模型合并提供了一种有前景的无需训练的解决方案,但现有方法往往遭受性能崩溃的破坏:这些方法往往会削弱推理深度并损害领域特定的实用性。有趣的是,我们发现这一失败背后存在一个反直觉的现象:推理能力主要存在于梯度敏感度低的参数区域,这与领域能力对应高幅度参数的常见假设相反。受此见解的启发,我们提出了ReasonAny,这是一种新颖的合并框架,通过对比梯度识别解决推理-领域性能崩溃问题。在安全、生物医学和金融领域进行的实验表明,ReasonAny能够有效合成“推理+X”能力,显著优于最先进的基线方法,同时保持稳健的推理性能。
Summary / 总结
The paper addresses the challenge of integrating reasoning capabilities into domain-specialized models, known as 'Reasoning + X'. It introduces ReasonAny, a novel merging framework that uses Contrastive Gradient Identification to avoid the performance collapse often seen in existing methods. Experiments across safety, biomedicine, and finance domains demonstrate that ReasonAny effectively combines reasoning and domain-specific utility, outperforming existing methods while maintaining robust reasoning performance.
研究旨在通过增强推理能力而不损害性能来提升领域专业化模型。提出的ReasonAny框架利用对比梯度识别将推理模型与其它专业化模型合并,有效合成‘推理+X’的能力。实验结果显示,ReasonAny在保持稳健推理性能的同时,优于现有方法的整体模型能力。
VIB-Probe: Detecting and Mitigating Hallucinations in Vision-Language Models via Variational Information Bottleneck
Authors: Feiran Zhang, Yixin Wu, Zhenghua Wang, Xiaohua Wang, Changze Lv, Xuanjing Huang, Xiaoqing Zheng
First: 2026-01-09T05:58:22+00:00 · Latest: 2026-01-09T05:58:22+00:00
Abstract
Vision-Language Models (VLMs) have demonstrated remarkable progress in multimodal tasks, but remain susceptible to hallucinations, where generated text deviates from the underlying visual content. Existing hallucination detection methods primarily rely on output logits or external verification tools, often overlooking their internal mechanisms. In this work, we investigate the outputs of internal attention heads, postulating that specific heads carry the primary signals for truthful generation.However, directly probing these high-dimensional states is challenging due to the entanglement of visual-linguistic syntax and noise. To address this, we propose VIB-Probe, a novel hallucination detection and mitigation framework leveraging the Variational Information Bottleneck (VIB) theory. Our method extracts discriminative patterns across layers and heads while filtering out semantic nuisances through the information bottleneck principle. Furthermore, by leveraging the gradients of our VIB probe, we identify attention heads with strong causal influence on hallucinations and introduce an inference-time intervention strategy for hallucination mitigation. Extensive experiments across diverse benchmarks demonstrate that VIB-Probe significantly outperforms existing baselines in both settings. Our code will be made publicly available.
中文标题/摘要
标题:VIB-Probe:通过变分信息瓶颈检测和缓解视觉-语言模型中的幻觉
视觉-语言模型(VLMs)在多模态任务中取得了显著进展,但仍然容易出现幻觉,即生成的文本与底层视觉内容不符。现有的幻觉检测方法主要依赖于输出logits或外部验证工具,往往忽视了其内部机制。在本文中,我们研究了内部注意力头的输出,假设特定的头携带着真实生成的主要信号。然而,直接探测这些高维状态由于视觉-语言语法和噪声的纠缠而具有挑战性。为了解决这个问题,我们提出了VIB-Probe,这是一种利用变分信息瓶颈(VIB)理论的新颖幻觉检测和缓解框架。我们的方法通过信息瓶颈原则提取各层和各头的判别模式,同时过滤掉语义噪声。此外,通过利用我们的VIB探针的梯度,我们识别出对幻觉有强烈因果影响的注意力头,并引入了一种推理时的干预策略以缓解幻觉。广泛的实验表明,VIB-Probe在两种设置中均显著优于现有基线。我们的代码将公开发布。
Summary / 总结
The research aims to address the issue of hallucinations in Vision-Language Models (VLMs), where generated text does not align with visual content. The authors propose VIB-Probe, which uses the Variational Information Bottleneck (VIB) theory to detect and mitigate hallucinations by analyzing internal attention heads. The method identifies discriminative patterns and filters out semantic noise, and it also introduces an inference-time intervention strategy based on gradient analysis. Experiments show that VIB-Probe outperforms existing methods in both detection and mitigation of hallucinations across various benchmarks.
研究旨在解决视觉-语言模型(VLMs)中的幻觉问题,即生成的文本与视觉内容不符。方法VIB-Probe利用变分信息瓶颈(VIB)理论,通过分析内部注意力头并过滤掉语义噪声来检测和缓解幻觉。实验表明,VIB-Probe在各种基准测试中比现有方法在幻觉检测和缓解方面表现更优。
Safety Not Found (404): Hidden Risks of LLM-Based Robotics Decision Making
Authors: Jua Han, Jaeyoon Seo, Jungbin Min, Jean Oh, Jihie Kim
First: 2026-01-09T05:04:15+00:00 · Latest: 2026-01-09T05:04:15+00:00
Abstract
One mistake by an AI system in a safety-critical setting can cost lives. As Large Language Models (LLMs) become integral to robotics decision-making, the physical dimension of risk grows; a single wrong instruction can directly endanger human safety. This paper addresses the urgent need to systematically evaluate LLM performance in scenarios where even minor errors are catastrophic. Through a qualitative evaluation of a fire evacuation scenario, we identified critical failure cases in LLM-based decision-making. Based on these, we designed seven tasks for quantitative assessment, categorized into: Complete Information, Incomplete Information, and Safety-Oriented Spatial Reasoning (SOSR). Complete information tasks utilize ASCII maps to minimize interpretation ambiguity and isolate spatial reasoning from visual processing. Incomplete information tasks require models to infer missing context, testing for spatial continuity versus hallucinations. SOSR tasks use natural language to evaluate safe decision-making in life-threatening contexts. We benchmark various LLMs and Vision-Language Models (VLMs) across these tasks. Beyond aggregate performance, we analyze the implications of a 1% failure rate, highlighting how "rare" errors escalate into catastrophic outcomes. Results reveal serious vulnerabilities: several models achieved a 0% success rate in ASCII navigation, while in a simulated fire drill, models instructed robots to move toward hazardous areas instead of emergency exits. Our findings lead to a sobering conclusion: current LLMs are not ready for direct deployment in safety-critical systems. A 99% accuracy rate is dangerously misleading in robotics, as it implies one out of every hundred executions could result in catastrophic harm. We demonstrate that even state-of-the-art models cannot guarantee safety, and absolute reliance on them creates unacceptable risks.
中文标题/摘要
标题:安全未找到(404):基于LLM的机器人决策中的隐藏风险
AI系统在关键安全环境中的一个错误可能会导致生命损失。随着大型语言模型(LLMs)在机器人决策中的作用日益重要,物理风险的范围也在扩大;一个错误指令可以直接危及人类安全。本文针对即使是最小的错误都可能导致灾难性后果的场景,系统地评估LLM的表现提出了迫切需求。通过定性评估火灾疏散场景,我们识别出基于LLM的决策中的关键失败案例。基于这些案例,我们设计了七个任务进行定量评估,分为:完全信息、不完全信息和安全导向的空间推理(SOSR)。完全信息任务利用ASCII地图来减少解释歧义,将空间推理与视觉处理隔离。不完全信息任务要求模型推断缺失的上下文,测试空间连续性与幻觉。SOSR任务使用自然语言评估在生命威胁情境下的安全决策。我们跨这些任务基准测试了各种LLM和视觉-语言模型(VLM)。除了整体性能外,我们还分析了1%失败率的影响,强调“罕见”的错误如何升级为灾难性后果。结果揭示了严重的漏洞:几种模型在ASCII导航中实现了0%的成功率,在模拟的消防演习中,模型指示机器人向危险区域移动而不是紧急出口。我们的发现得出一个令人警醒的结论:当前的LLM尚不适合直接部署在关键安全系统中。99%的准确率在机器人领域是危险的误导,因为它意味着每一百次执行中就可能有一次会导致灾难性伤害。我们证明即使是最先进的模型也无法保证安全,完全依赖它们会带来不可接受的风险。
Summary / 总结
This paper addresses the safety risks associated with Large Language Models (LLMs) in robotics decision-making by evaluating their performance in critical scenarios. Through a qualitative analysis of a fire evacuation scenario and a quantitative assessment using seven tasks, the study identifies serious vulnerabilities in LLMs. The tasks include scenarios with complete and incomplete information, as well as safety-oriented spatial reasoning. Key findings show that several models failed completely in ASCII navigation tasks and incorrectly directed robots towards hazardous areas during a simulated fire drill, indicating that current LLMs are not suitable for safety-critical systems despite high accuracy rates.
本文探讨了大型语言模型(LLMs)在机器人决策中的安全风险,特别是在即使出现轻微错误也可能导致灾难性后果的场景中。研究通过一系列任务评估了LLMs和视觉语言模型(VLMs),包括完整信息、不完整信息和安全导向的空间推理任务。研究结果揭示了严重的漏洞,多个模型在ASCII导航任务中完全失败,并在模拟火灾演习中指示机器人向危险区域移动。研究结果表明,当前的LLMs尚不适合部署在安全关键系统中,99%的准确率仍然可能导致灾难性后果。
Branch, or Layer? Zeroth-Order Optimization for Continual Learning of Vision-Language Models
Authors: Ziwei Liu, Borui Kang, Wei Li, Hangjie Yuan, Yanbing Yang, Wenbin Li, Yifan Zhu, Tao Feng, Jun Luo
Venue: AAAI
First: 2025-06-14T08:59:19+00:00 · Latest: 2026-01-09T04:06:17+00:00
Comments: Published in the Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2026)
Abstract
Vision-Language Continual Learning (VLCL) has attracted significant research attention for its robust capabilities, and the adoption of Parameter-Efficient Fine-Tuning (PEFT) strategies is enabling these models to achieve competitive performance with substantially reduced resource consumption. However, dominated First-Order (FO) optimization is prone to trap models in suboptimal local minima, especially in limited exploration subspace within PEFT. To overcome this challenge, this paper pioneers a systematic exploration of adopting Zeroth-Order (ZO) optimization for PEFT-based VLCL. We first identify the incompatibility of naive full-ZO adoption in VLCL due to optimization process instability. We then investigate the application of ZO optimization from a modality branch-wise to a fine-grained layer-wise across various training units to identify an optimal strategy. Besides, a key theoretical insight reveals that vision modality exhibit higher variance than language counterparts in VLCL during the ZO optimization process, and we propose a modality-aware ZO strategy, which adopts gradient sign normalization in ZO and constrains vision modality perturbation to further improve performance. Benefiting from the adoption of ZO optimization, PEFT-based VLCL fulfills better ability to escape local minima during the optimization process, extensive experiments on four benchmarks demonstrate that our method achieves state-of-the-art results.
中文标题/摘要
标题:分支,还是层?视觉-语言模型连续学习的零阶优化
视觉-语言连续学习(VLCL)因其强大的能力吸引了大量研究关注,参数高效微调(PEFT)策略的应用使这些模型能够在大幅减少资源消耗的情况下实现竞争力的性能。然而,主导的首阶(FO)优化容易使模型陷入次优的局部极小值,特别是在PEFT的有限探索子空间内。为克服这一挑战,本文首次系统地探索了采用零阶(ZO)优化进行基于PEFT的VLCL。我们首先识别了在VLCL中全ZO采用的不兼容性,由于优化过程不稳定。然后,我们从模态分支到细粒度层的训练单元,研究ZO优化的应用,以确定最优策略。此外,一个关键的理论洞察表明,在ZO优化过程中,视觉模态的方差高于语言模态,我们提出了一种模态感知的ZO策略,在ZO中采用梯度符号归一化,并约束视觉模态扰动,以进一步提高性能。得益于ZO优化的采用,基于PEFT的VLCL在优化过程中更好地具备了逃逸局部极小值的能力,对四个基准的广泛实验表明,我们的方法达到了最先进的结果。
Summary / 总结
This paper addresses the challenge of suboptimal local minima in First-Order optimization for Parameter-Efficient Fine-Tuning in Vision-Language Continual Learning. It explores the use of Zeroth-Order (ZO) optimization, identifying that full ZO adoption is unstable in VLCL. Instead, the authors propose a modality-aware ZO strategy that normalizes gradient signs and constrains vision modality perturbation, leading to better performance and improved ability to escape local minima. Experiments on four benchmarks show state-of-the-art results.
该研究针对参数高效微调(PEFT)在视觉-语言持续学习(VLCL)中使用一阶(FO)优化时出现的次优局部极小值问题,探索了零阶(ZO)优化的应用以提升性能。研究发现,直接采用全ZO优化不稳定,并提出了一种模态感知的ZO策略,该策略通过归一化梯度符号并约束视觉模态的扰动来进一步提高性能。实验表明,这种方法增强了模型在优化过程中逃离局部极小值的能力,并在四个基准测试上达到了最先进的结果。
e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings
Authors: Haonan Chen, Sicheng Gao, Radu Timofte, Tetsuya Sakai, Zhicheng Dou
First: 2026-01-07T07:39:40+00:00 · Latest: 2026-01-09T02:24:32+00:00
Comments: https://huggingface.co/Haon-Chen/e5-omni-7B
Abstract
Modern information systems often involve different types of items, e.g., a text query, an image, a video clip, or an audio segment. This motivates omni-modal embedding models that map heterogeneous modalities into a shared space for direct comparison. However, most recent omni-modal embeddings still rely heavily on implicit alignment inherited from pretrained vision-language model (VLM) backbones. In practice, this causes three common issues: (i) similarity logits have modality-dependent sharpness, so scores are not on a consistent scale; (ii) in-batch negatives become less effective over time because mixed-modality batches create an imbalanced hardness distribution; as a result, many negatives quickly become trivial and contribute little gradient; and (iii) embeddings across modalities show mismatched first- and second-order statistics, which makes rankings less stable. To tackle these problems, we propose e5-omni, a lightweight explicit alignment recipe that adapts off-the-shelf VLMs into robust omni-modal embedding models. e5-omni combines three simple components: (1) modality-aware temperature calibration to align similarity scales, (2) a controllable negative curriculum with debiasing to focus on confusing negatives while reducing the impact of false negatives, and (3) batch whitening with covariance regularization to better match cross-modal geometry in the shared embedding space. Experiments on MMEB-V2 and AudioCaps show consistent gains over strong bi-modal and omni-modal baselines, and the same recipe also transfers well to other VLM backbones. We release our model checkpoint at https://huggingface.co/Haon-Chen/e5-omni-7B.
中文标题/摘要
标题:e5-omni:显式跨模态对齐的全模态嵌入
现代信息系统通常涉及不同类型的商品,例如文本查询、图像、视频片段或音频片段。这促使了全模态嵌入模型的发展,将异构模态映射到共享空间以直接进行比较。然而,大多数最新的全模态嵌入仍然严重依赖于从预训练的视觉-语言模型(VLM)骨干继承的隐式对齐。实践中,这导致了三个常见问题:(i)相似度标量具有模态依赖的锐度,因此分数不在一致的尺度上;(ii)批次内负样本随时间变得不那么有效,因为混合模态批次创建了一个不平衡的难度分布;结果,许多负样本很快变得简单且贡献很少梯度;(iii)不同模态的嵌入显示出不匹配的一阶和二阶统计,这使得排名不够稳定。为了解决这些问题,我们提出了e5-omni,这是一种轻量级的显式对齐方法,将现成的VLM调整为稳健的全模态嵌入模型。e5-omni结合了三个简单的组件:(1)模态感知的温度校准以对齐相似度尺度,(2)可控的负样本课程学习与去偏置,以专注于混淆的负样本并减少假负样本的影响,(3)批次白化与协方差正则化以更好地匹配共享嵌入空间中的跨模态几何。在MMEB-V2和AudioCaps上的实验显示,e5-omni在强大的双模态和全模态基线之上具有一致的改进,并且相同的配方也很好地转移到了其他VLM骨干上。我们将在https://huggingface.co/Haon-Chen/e5-omni-7B发布我们的模型检查点。
Summary / 总结
The paper addresses the challenges of using implicit alignment in omni-modal embedding models, which can lead to inconsistent similarity scales, imbalanced negative hardness, and mismatched statistics across modalities. To solve these issues, the authors propose e5-omni, which includes modality-aware temperature calibration, a controllable negative curriculum, and batch whitening with covariance regularization. Experiments on MMEB-V2 and AudioCaps demonstrate consistent improvements over existing bi-modal and omni-modal baselines, and the method is adaptable to other VLM backbones.
论文针对当前模型中存在的模态依赖性锐度、硬度分布不平衡以及跨模态统计不匹配等问题,提出了一种轻量级的方法e5-omni,该方法包括模态感知温度校准、可控负样本课程以及批量白化和协方差正则化。实验结果表明,该方法在MMEB-V2和AudioCaps上相对于现有的双模态和跨模态基线模型具有一致的改进效果,并且该方法适用于其他VLM基座。
Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization
Authors: Yuxiang Ji, Yong Wang, Ziyu Ma, Yiming Hu, Hailang Huang, Xuecai Hu, Guanhua Chen, Liaoni Wu, Xiangxiang Chu
First: 2026-01-08T23:47:30+00:00 · Latest: 2026-01-08T23:47:30+00:00
Abstract
The image geolocalization task aims to predict the location where an image was taken anywhere on Earth using visual clues. Existing large vision-language model (LVLM) approaches leverage world knowledge, chain-of-thought reasoning, and agentic capabilities, but overlook a common strategy used by humans -- using maps. In this work, we first equip the model \textit{Thinking with Map} ability and formulate it as an agent-in-the-map loop. We develop a two-stage optimization scheme for it, including agentic reinforcement learning (RL) followed by parallel test-time scaling (TTS). The RL strengthens the agentic capability of model to improve sampling efficiency, and the parallel TTS enables the model to explore multiple candidate paths before making the final prediction, which is crucial for geolocalization. To evaluate our method on up-to-date and in-the-wild images, we further present MAPBench, a comprehensive geolocalization training and evaluation benchmark composed entirely of real-world images. Experimental results show that our method outperforms existing open- and closed-source models on most metrics, specifically improving Acc@500m from 8.0\% to 22.1\% compared to \textit{Gemini-3-Pro} with Google Search/Map grounded mode.
中文标题/摘要
标题:地图思维:强化并行地图增强代理的地理定位
图像地理定位任务旨在利用视觉线索预测地球上任何地方拍摄图像的位置。现有的大型视觉-语言模型(LVLM)方法利用世界知识、链式推理和代理能力,但忽略了人类常用的一种策略——使用地图。在本文中,我们首先赋予模型“地图思维”能力,并将其形式化为地图中的代理循环。我们为此开发了一种两阶段优化方案,包括代理强化学习(RL)后跟并行测试时缩放(TTS)。强化学习增强了模型的代理能力,以提高采样效率,而并行TTS使模型能够在做出最终预测之前探索多个候选路径,这对于地理定位至关重要。为了在最新和真实世界的图像上评估我们的方法,我们进一步提出了MAPBench,这是一个由完全真实世界图像组成的全面地理定位训练和评估基准。实验结果表明,我们的方法在大多数指标上优于现有开源和闭源模型,特别是在与“Gemini-3-Pro”(带有Google搜索/地图支持模式)相比时,500米准确率从8.0%提高到22.1%。
Summary / 总结
The research aims to enhance image geolocalization by incorporating map-based reasoning into large vision-language models. The method, named Thinking with Map, uses an agent-in-the-map loop with two stages: reinforcement learning to improve sampling efficiency and parallel test-time scaling to explore multiple paths. This approach significantly improves performance, particularly in accuracy at 500 meters, outperforming existing models by 14.1 percentage points.
研究旨在通过将基于地图的推理融入大型视觉-语言模型来提升图像地理定位能力。方法包括地图中的智能体循环,分为两个阶段:强化学习以提高采样效率和并行测试时扩展以探索多条路径。该方法显著提高了准确性,特别是在500米内的准确率从8.0%提升到22.1%,超过了具有Google地图集成模式的Gemini-3-Pro。
Coding the Visual World: From Image to Simulation Using Vision Language Models
Authors: Sagi Eppel
First: 2026-01-08T19:49:05+00:00 · Latest: 2026-01-08T19:49:05+00:00
Abstract
The ability to construct mental models of the world is a central aspect of understanding. Similarly, visual understanding can be viewed as the ability to construct a representative model of the system depicted in an image. This work explores the capacity of Vision Language Models (VLMs) to recognize and simulate the systems and mechanisms depicted in images using the Im2Sim methodology. The VLM is given a natural image of a real-world system (e.g., cities, clouds, vegetation) and is tasked with describing the system and writing code that simulates and generates it. This generative code is then executed to produce a synthetic image, which is compared against the original. This approach is tested on various complex emergent systems, ranging from physical systems (waves, lights, clouds) to vegetation, cities, materials, and geological formations. Through analysis of the models and images generated by the VLMs, we examine their understanding of the systems in images. The results show that leading VLMs (GPT, Gemini) demonstrate the capacity to understand and model complex, multi-component systems across multiple layers of abstraction and a wide range of domains. At the same time, the VLMs exhibit limited ability to replicate fine details and low-level arrangements of patterns in the image. These findings reveal an interesting asymmetry: VLMs combine high-level, deep visual understanding of images with limited perception of fine details.
中文标题/摘要
标题:编码可视世界:使用视觉语言模型从图像到模拟
构建世界的心理模型是理解的核心方面。同样,视觉理解可以被视为构建图像中所描绘系统代表模型的能力。这项工作探讨了视觉语言模型(VLMs)使用Im2Sim方法识别和模拟图像中所示系统和机制的能力。VLM被给予一个真实世界的系统的自然图像(例如,城市、云、植被),并被要求描述该系统并编写模拟和生成它的代码。然后执行生成的代码以产生合成图像,并将其与原始图像进行比较。这种方法在各种复杂的涌现系统上进行了测试,从物理系统(波、光、云)到植被、城市、材料和地质构造。通过分析VLM生成的模型和图像,我们研究了它们对图像中系统的理解。结果表明,领先的VLM(GPT、Gemini)能够理解并建模跨多个抽象层次和多个领域的复杂、多组件系统。同时,VLM在复制图像中的细部和低级模式排列方面表现出有限的能力。这些发现揭示了一个有趣的不对称性:VLM结合了对图像的高层次、深入的视觉理解,但对细部感知有限。
Summary / 总结
This study investigates the capability of Vision Language Models (VLMs) to recognize and simulate complex systems depicted in images using the Im2Sim methodology. The models are provided with natural images and asked to describe the systems and write code to simulate them, which are then compared to the original images. The results indicate that leading VLMs like GPT and Gemini can understand and model complex, multi-component systems across various domains but struggle with replicating fine details and low-level patterns. This reveals an interesting asymmetry in VLMs' visual understanding capabilities.
这项研究探讨了视觉语言模型(VLMs)使用Im2Sim方法学模拟图像中所示的现实世界系统的能力。VLMs被给予图像并被要求描述和编写代码,该代码被执行以生成合成图像进行比较。结果表明,领先的VLMs如GPT和Gemini能够跨多个领域理解和模拟复杂的多组件系统,但在复制图像中的细节点和低级模式方面存在局限性。这揭示了一个有趣的平衡:高级视觉理解与有限的细节感知之间的平衡。
Mechanisms of Prompt-Induced Hallucination in Vision-Language Models
Authors: William Rudman, Michal Golovanevsky, Dana Arad, Yonatan Belinkov, Ritambhara Singh, Carsten Eickhoff, Kyle Mahowald
First: 2026-01-08T18:23:03+00:00 · Latest: 2026-01-08T18:23:03+00:00
Abstract
Large vision-language models (VLMs) are highly capable, yet often hallucinate by favoring textual prompts over visual evidence. We study this failure mode in a controlled object-counting setting, where the prompt overstates the number of objects in the image (e.g., asking a model to describe four waterlilies when only three are present). At low object counts, models often correct the overestimation, but as the number of objects increases, they increasingly conform to the prompt regardless of the discrepancy. Through mechanistic analysis of three VLMs, we identify a small set of attention heads whose ablation substantially reduces prompt-induced hallucinations (PIH) by at least 40% without additional training. Across models, PIH-heads mediate prompt copying in model-specific ways. We characterize these differences and show that PIH ablation increases correction toward visual evidence. Our findings offer insights into the internal mechanisms driving prompt-induced hallucinations, revealing model-specific differences in how these behaviors are implemented.
中文标题/摘要
标题:视觉语言模型中提示诱发幻觉的机制
大型视觉语言模型(VLMs)功能强大,但经常倾向于文本提示而非视觉证据,从而产生幻觉。我们在一个受控的对象计数设置中研究了这种失败模式,其中提示夸大了图像中的对象数量(例如,要求模型描述四朵水仙花,而实际上只有三朵)。在对象数量较低时,模型通常会纠正这种夸大,但随着对象数量的增加,它们越来越倾向于遵循提示,无视差异。通过对三种VLMs的机制分析,我们确定了一组小的注意力头,其消除可以将提示诱发幻觉(PIH)减少至少40%而无需额外训练。在不同模型中,PIH头以特定方式介导提示复制。我们描述了这些差异,并表明PIH消除增加了对视觉证据的纠正。我们的研究结果揭示了驱动提示诱发幻觉的内部机制,并揭示了这些行为在不同模型中的特定差异。
Summary / 总结
This study investigates the mechanism of prompt-induced hallucination in vision-language models (VLMs) by over-stating the number of objects in images. At low object counts, models tend to correct the overestimation, but as the number increases, they conform more to the prompt. By analyzing three VLMs, the researchers identified specific attention heads that, when ablated, significantly reduce prompt-induced hallucinations by at least 40% without additional training. These PIH-heads mediate prompt copying in model-specific ways, and ablation increases correction toward visual evidence, providing insights into the internal mechanisms driving these behaviors.
研究探讨了大型视觉-语言模型(VLMs)如何优先考虑文本提示而非视觉证据,在物体计数任务中产生幻觉。通过对三种VLMs的分析,研究人员发现,移除特定的注意力头可以显著减少提示诱导的幻觉(PIH)至少40%,无需额外训练。研究结果揭示了这些行为在不同模型中的具体实现差异,并表明针对这些特定的头可以纠正模型对提示的依赖。
MVT: Mask-Grounded Vision-Language Models for Taxonomy-Aligned Land-Cover Tagging
Authors: Siyi Chen, Kai Wang, Weicong Pang, Ruiming Yang, Ziru Chen, Renjun Gao, Alexis Kai Hon Lau, Dasa Gu, Chenchen Zhang, Cheng Li
First: 2025-09-23T06:23:56+00:00 · Latest: 2026-01-08T17:56:05+00:00
Comments: The project is available at https://charlescsyyy.github.io/MVT
Abstract
Land-cover understanding in remote sensing increasingly demands class-agnostic systems that generalize across datasets while remaining spatially precise and interpretable. We study a geometry-first discovery-and-interpretation setting under domain shift, where candidate regions are delineated class-agnostically and supervision avoids lexical class names via anonymized identifiers. Complementary to open-set recognition and open-world learning, we focus on coupling class-agnostic mask evidence with taxonomy-grounded scene interpretation, rather than unknown rejection or continual class expansion. We propose MVT, a three-stage framework that (i) extracts boundary-faithful region masks using SAM2 with domain adaptation, (ii) performs mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluates outputs with LLM-as-judge scoring calibrated by stratified expert ratings. On cross-dataset segmentation transfer (train on OpenEarthMap, evaluate on LoveDA), domain-adapted SAM2 improves mask quality; meanwhile, dual-step MLLM fine-tuning yields more accurate taxonomy-aligned tags and more informative mask-grounded scene descriptions.
中文标题/摘要
标题:MVT:基于掩码的视觉-语言模型在分类学对齐的土地覆盖标签化中的应用
遥感中的土地覆盖理解越来越需要跨数据集泛化但同时保持空间精确性和可解释性的类无差别系统。我们研究了在领域转移下的几何优先发现与解释设置,其中候选区域以类无差别方式划定,监督避免使用类名的明文标识符。除了开放集识别和开放世界学习,我们专注于将类无差别掩码证据与分类学导向的场景解释相结合,而不是未知拒绝或持续类扩展。我们提出了MVT,一个三阶段框架,(i) 使用SAM2进行领域适应以提取边界忠实的区域掩码,(ii) 通过双步骤LoRA微调多模态LLM进行掩码导向的语义标签和场景描述生成,(iii) 使用LLM作为裁判评分进行输出评估,评分通过分层专家评分校准。在跨数据集分割迁移(在OpenEarthMap上训练,在LoveDA上评估)中,领域适应的SAM2提高了掩码质量;同时,双步骤多模态LLM微调产生了更准确的分类学对齐标签和更具有信息量的掩码导向场景描述。
Summary / 总结
The research aims to develop class-agnostic systems for land-cover understanding in remote sensing that can generalize across datasets while maintaining spatial precision and interpretability. The method involves a three-stage framework: (i) extracting boundary-faithful region masks using SAM2 with domain adaptation, (ii) performing mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluating outputs with LLM-as-judge scoring calibrated by stratified expert ratings. Key findings include improved mask quality through domain-adapted SAM2 and more accurate taxonomy-aligned tags and informative scene descriptions from dual-step MLLM fine-tuning.
研究旨在开发在遥感中用于土地覆盖理解的类无感知系统,重点在于空间精度和可解释性。方法包括三个阶段:(i) 使用SAM2进行领域适应以提取边界忠实的区域掩码,(ii) 通过双步骤LoRA微调多模态LLM进行掩码导向的语义标签和场景描述生成,(iii) 使用LLM作为裁判评分并根据分层专家评级进行校准。关键发现表明,领域适应的SAM2提高了掩码质量,而双步骤MLLM微调产生了更准确的分类学对齐标签和更具信息量的掩码导向场景描述,在跨数据集分割传输中得到了验证。
Vision-Language Introspection: Mitigating Overconfident Hallucinations in MLLMs via Interpretable Bi-Causal Steering
Authors: Shuliang Liu, Songbo Yang, Dong Fang, Sihang Jia, Yuqi Tang, Lingfeng Su, Ruoshui Peng, Yibo Yan, Xin Zou, Xuming Hu
First: 2026-01-08T17:49:13+00:00 · Latest: 2026-01-08T17:49:13+00:00
Abstract
Object hallucination critically undermines the reliability of Multimodal Large Language Models, often stemming from a fundamental failure in cognitive introspection, where models blindly trust linguistic priors over specific visual evidence. Existing mitigations remain limited: contrastive decoding approaches operate superficially without rectifying internal semantic misalignments, while current latent steering methods rely on static vectors that lack instance-specific precision. We introduce Vision-Language Introspection (VLI), a training-free inference framework that simulates a metacognitive self-correction process. VLI first performs Attributive Introspection to diagnose hallucination risks via probabilistic conflict detection and localize the causal visual anchors. It then employs Interpretable Bi-Causal Steering to actively modulate the inference process, dynamically isolating visual evidence from background noise while neutralizing blind confidence through adaptive calibration. VLI achieves state-of-the-art performance on advanced models, reducing object hallucination rates by 12.67% on MMHal-Bench and improving accuracy by 5.8% on POPE.
中文标题/摘要
标题:视觉-语言内省:通过可解释的双因归因引导减轻MLLM中的过度自信幻觉
物体幻觉严重削弱了多模态大型语言模型的可靠性,通常源于认知内省的基本失败,模型盲目信任语言先验而非特定的视觉证据。现有缓解措施仍有限:对比解码方法仅表面操作而不纠正内部语义错位,而当前的潜在引导方法依赖于静态向量,缺乏实例特定的精确性。我们引入了视觉-语言内省(VLI),这是一种无需训练的推理框架,模拟了元认知的自我纠正过程。VLI 首先进行属性内省,通过概率冲突检测诊断幻觉风险并定位因果视觉锚点。然后使用可解释的双因归因引导主动调节推理过程,动态隔离视觉证据与背景噪声,并通过自适应校准消除盲目的自信。VLI 在先进模型上实现了最先进的性能,在MMHal-Bench 上将物体幻觉率降低了12.67%,在POPE 上提高了5.8%的准确性。
Summary / 总结
The research aims to address the issue of object hallucination in Multimodal Large Language Models (MLLMs) by enhancing their cognitive introspection. The proposed method, Vision-Language Introspection (VLI), introduces a training-free inference framework that includes Attributive Introspection for diagnosing hallucination risks and Interpretable Bi-Causal Steering for dynamically modulating the inference process. Key findings show that VLI significantly reduces object hallucination rates by 12.67% on MMHal-Bench and improves accuracy by 5.8% on POPE.
研究旨在通过增强认知反省来解决多模态大型语言模型(MLLMs)中的物体幻觉问题。方法是采用一个无需训练的推理框架——视觉-语言反省(VLI),首先通过概率冲突检测诊断幻觉风险并定位因果视觉锚点。然后使用可解释的双向因果引导来动态隔离视觉证据与背景噪声,并通过自适应校准调整信心水平。关键发现表明,VLI在MMHal-Bench上将物体幻觉率降低了12.67%,在POPE上提高了5.8%的准确性。
FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs
Authors: Carlos Plou, Cesar Borja, Ruben Martinez-Cantin, Ana C. Murillo
First: 2025-03-25T17:17:19+00:00 · Latest: 2026-01-08T17:17:54+00:00
Abstract
Finding information in hour-long videos is a challenging task even for top-performing Vision Language Models (VLMs), as encoding visual content quickly exceeds available context windows. To tackle this challenge, we present FALCONEye, a novel video agent based on a training-free, model-agnostic meta-architecture composed of a VLM and a Large Language Model (LLM). FALCONEye answers open-ended questions using an exploration-based search algorithm guided by calibrated confidence from the VLM's answers. We also introduce the FALCON-Bench benchmark, extending Question Answering problem to Video Answer Search-requiring models to return both the answer and its supporting temporal window for open-ended questions in hour-long videos. With just a 7B VLM and a lightweight LLM, FALCONEye outscores all open-source 7B VLMs and comparable agents in FALCON-Bench. It further demonstrates its generalization capability in MLVU benchmark with shorter videos and different tasks, surpassing GPT-4o on single-detail tasks while slashing inference cost by roughly an order of magnitude.
中文标题/摘要
标题:FALCONEye:在一小时长视频中使用多模态LLM查找答案并定位内容
即使对于表现最佳的视觉语言模型(VLMs),在一小时长的视频中查找信息也是一个具有挑战性的任务,因为编码视觉内容会迅速超出可用的上下文窗口。为了解决这一挑战,我们提出了FALCONEye,这是一种基于训练无损、模型无关的元架构的新型视频代理,该架构由VLM和大型语言模型(LLM)组成。FALCONEye 使用由VLM答案校准置信度引导的基于探索的搜索算法来回答开放式问题。我们还引入了FALCON-Bench基准测试,将问答问题扩展到视频答案搜索,要求模型返回一小时长视频中开放式问题的答案及其支持的时间窗口。仅使用一个7B VLM和一个轻量级LLM,FALCONEye 在FALCON-Bench中得分超过了所有开源的7B VLM和可比代理。此外,FALCONEye 还在MLVU基准测试中展示了其泛化能力,处理较短的视频和不同的任务,同时在单一细节任务上超越了GPT-4o,而推理成本降低了大约一个数量级。
Summary / 总结
FALCONEye is a novel video agent that uses a VLM and an LLM to answer open-ended questions in hour-long videos. It employs an exploration-based search algorithm guided by the VLM's calibrated confidence. FALCONEye outperforms all open-source 7B VLMs and comparable agents in the FALCON-Bench and shows strong generalization in the MLVU benchmark, outperforming GPT-4o on single-detail tasks while reducing inference cost significantly.
FALCONEye 是一种新颖的视频代理,利用 VLM 和 LLM 来回答一小时长视频中的开放性问题。它采用了一种由 VLM 的校准置信度引导的探索式搜索算法。FALCONEye 在 FALCON-Bench 基准测试中超越了所有开源的 7B VLM 和同类代理,并在 MLVU 基准测试中展示了强大的泛化能力,超越了 GPT-4o 在单一细节任务上的表现,同时大幅降低了推理成本。
VERSE: Visual Embedding Reduction and Space Exploration. Clustering-Guided Insights for Training Data Enhancement in Visually-Rich Document Understanding
Authors: Ignacio de Rodrigo, Alvaro J. Lopez-Lopez, Jaime Boal
First: 2026-01-08T17:15:15+00:00 · Latest: 2026-01-08T17:15:15+00:00
Abstract
This work introduces VERSE, a methodology for analyzing and improving Vision-Language Models applied to Visually-rich Document Understanding by exploring their visual embedding space. VERSE enables the visualization of latent representations, supporting the assessment of model feasibility. It also facilitates the identification of problematic regions and guides the generation of synthetic data to enhance performance in those clusters. We validate the methodology by training on the synthetic MERIT Dataset and evaluating on its real-world counterpart, MERIT Secret. Results show that VERSE helps uncover the visual features associated with error-prone clusters, and that retraining with samples containing these features substantially boosts F1 performance without degrading generalization. Furthermore, we demonstrate that on-premise models such as Donut and Idefics2, when optimized with VERSE, match or even surpass the performance of SaaS solutions like GPT-4 and Pixtral.
中文标题/摘要
标题:VERSE:视觉嵌入空间探索与降维. 基于聚类指导的训练数据增强方法在视觉丰富文档理解中的应用
本文介绍了VERSE,一种用于分析和改进应用于视觉丰富文档理解的视觉语言模型的方法,通过探索其视觉嵌入空间。VERSE使潜在表示的可视化成为可能,支持模型可行性的评估。它还便于识别问题区域,并指导生成合成数据以增强这些聚类中的性能。我们通过在合成MERIT数据集上进行训练并在其现实世界对应数据集MERIT Secret上进行评估来验证该方法。结果表明,VERSE有助于揭示与错误频发聚类相关的视觉特征,并且使用包含这些特征的样本重新训练显著提高了F1性能,而不会损害泛化能力。此外,我们证明了使用VERSE优化的本地模型(如Donut和Idefics2)在性能上可以与GPT-4和Pixtral等SaaS解决方案相匹敌,甚至超越它们。
Summary / 总结
VERSE is a methodology for enhancing Vision-Language Models in Visually-rich Document Understanding by exploring their visual embedding space. It visualizes latent representations to identify problematic regions and generate synthetic data to improve model performance. Experiments show that VERSE helps uncover visual features associated with error-prone clusters, and retraining with these features significantly boosts F1 performance without degrading generalization. VERSE also enables on-premise models to match or surpass the performance of SaaS solutions like GPT-4 and Pixtral.
VERSE 是一种方法,通过探索 Vision-Language 模型的视觉嵌入空间来提高其在视觉丰富文档理解中的性能。它可视化潜在表示以识别问题区域,并指导生成合成数据以增强模型性能。实验表明,VERSE 帮助发现与错误多发簇相关的视觉特征,并通过这些特征重新训练显著提高了 F1 性能,而不会损害泛化能力。此外,VERSE 还帮助本地模型匹配甚至超越 GPT-4 和 Pixtral 等 SaaS 解决方案的性能。
$π_0$: A Vision-Language-Action Flow Model for General Robot Control
Authors: Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, Ury Zhilinsky
Venue: RSS 2025
First: 2024-10-31T17:22:30+00:00 · Latest: 2026-01-08T17:01:05+00:00
Comments: See project website for videos: https://physicalintelligence.company/blog/pi0 Published in RSS 2025
Abstract
Robot learning holds tremendous promise to unlock the full potential of flexible, general, and dexterous robot systems, as well as to address some of the deepest questions in artificial intelligence. However, bringing robot learning to the level of generality required for effective real-world systems faces major obstacles in terms of data, generalization, and robustness. In this paper, we discuss how generalist robot policies (i.e., robot foundation models) can address these challenges, and how we can design effective generalist robot policies for complex and highly dexterous tasks. We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge. We then discuss how this model can be trained on a large and diverse dataset from multiple dexterous robot platforms, including single-arm robots, dual-arm robots, and mobile manipulators. We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people and from a high-level VLM policy, and its ability to acquire new skills via fine-tuning. Our results cover a wide variety of tasks, such as laundry folding, table cleaning, and assembling boxes.
中文标题/摘要
标题:$π_0$: 一种视觉-语言-行动流程模型用于通用机器人控制
机器人学习在解锁灵活、通用和灵巧的机器人系统潜力以及解决人工智能领域的一些最深层次问题方面具有巨大的前景。然而,将机器人学习提升到有效现实系统所需的通用性水平面临数据、泛化和鲁棒性方面的重大障碍。在本文中,我们讨论了通用机器人策略(即机器人基础模型)如何应对这些挑战,以及如何设计有效的通用机器人策略以应对复杂和高度灵巧的任务。我们提出了一种基于预训练视觉-语言模型(VLM)的新颖流程匹配架构,以继承互联网规模的语义知识。然后,我们讨论了如何使用多种灵巧机器人平台的大规模和多样化数据集对该模型进行训练,包括单臂机器人、双臂机器人和移动操作器。我们从预训练后执行任务的能力、遵循人类和高级VLM策略的语言指令以及通过微调获取新技能等方面评估了该模型。我们的结果涵盖了各种任务,如衣物折叠、桌面清洁和组装盒子。
Summary / 总结
This paper aims to develop general robot policies that can address the challenges of data, generalization, and robustness in robot learning. The authors propose a vision-language-action flow model based on a pre-trained vision-language model to inherit semantic knowledge from the Internet. The model is trained on a diverse dataset from various robot platforms and evaluated for zero-shot task performance, following language instructions, and acquiring new skills through fine-tuning. Key findings include the model's capability to perform tasks like laundry folding, table cleaning, and assembling boxes after pre-training.
本文提出了一种基于预训练视觉语言模型的视-语-动流模型$π_0$,以解决机器人学习在实际应用中的挑战。该模型从互联网中继承语义知识,并通过多样化的机器人平台数据集进行训练,评估其在零样本设置下执行任务、遵循人类语言指令以及通过微调学习新技能的能力。主要发现包括该模型在折叠衣物、清理桌子和组装盒子等任务中的有效性。
POLYCHARTQA: Benchmarking Large Vision-Language Models with Multilingual Chart Question Answering
Authors: Yichen Xu, Liangyu Chen, Liang Zhang, Jianzhe Ma, Wenxuan Wang, Qin Jin
First: 2025-07-16T06:09:02+00:00 · Latest: 2026-01-08T17:00:25+00:00
Comments: Work in Progress
Abstract
Charts are a universally adopted medium for data communication, yet existing chart understanding benchmarks are overwhelmingly English-centric, limiting their accessibility and relevance to global audiences. To address this limitation, we introduce PolyChartQA, the first large-scale multilingual benchmark for chart question answering, comprising 22,606 charts and 26,151 QA pairs across 10 diverse languages. PolyChartQA is constructed through a scalable pipeline that enables efficient multilingual chart generation via data translation and code reuse, supported by LLM-based translation and rigorous quality control. We systematically evaluate multilingual chart understanding with PolyChartQA on state-of-the-art LVLMs and reveal a significant performance gap between English and other languages, particularly low-resource ones. Additionally, we introduce a companion multilingual chart question answering training set, PolyChartQA-Train, on which fine-tuning LVLMs yields substantial gains in multilingual chart understanding across diverse model sizes and architectures. Together, our benchmark provides a foundation for developing globally inclusive vision-language models capable of understanding charts across diverse linguistic contexts.
中文标题/摘要
标题:POLYCHARTQA:使用多语言图表问答基准评估大型视觉语言模型
图表是数据交流的普遍采用媒介,但现有的图表理解基准主要以英语为中心,限制了其对全球受众的适用性和相关性。为解决这一限制,我们引入了PolyChartQA,这是首个大规模多语言图表问答基准,包含22,606张图表和26,151个问答对,覆盖10种不同的语言。PolyChartQA通过可扩展的管道构建,通过数据翻译和代码重用实现高效的多语言图表生成,支持基于LLM的翻译和严格的质量控制。我们系统地使用PolyChartQA对最先进的LVLM进行多语言图表理解评估,并揭示了英语与其他语言之间,尤其是低资源语言之间存在显著的性能差距。此外,我们还引入了PolyChartQA-Train多语言图表问答训练集,通过微调LVLM可以在不同模型大小和架构下显著提高多语言图表理解能力。我们的基准为开发能够跨多种语言环境理解图表的全球包容性视觉语言模型提供了基础。
Summary / 总结
PolyChartQA is a new multilingual benchmark for chart question answering, comprising 22,606 charts and 26,151 QA pairs in 10 languages. It addresses the limitation of existing English-centric benchmarks by providing a scalable pipeline for efficient multilingual chart generation. Evaluations show a significant performance gap between English and other languages, especially low-resource ones. Fine-tuning large vision-language models on PolyChartQA-Train improves multilingual chart understanding across different model sizes and architectures, highlighting the need for globally inclusive models.
PolyChartQA 是一个包含 22,606 个图表和 26,151 个问答对的多语言图表问答基准,覆盖 10 种语言。它通过一个可扩展的管道高效生成多语言图表,解决了现有以英语为中心的基准的局限性。评估结果显示,英语和其他语言之间的性能差距很大,尤其是低资源语言。通过 PolyChartQA-Train 对大型视觉语言模型进行微调可以显著提高多语言图表理解能力,适用于不同模型大小和架构,强调了开发全球包容性模型的重要性。
GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts
Authors: Wenhao Zeng, Xuteng Zhang, Yuling Shi, Chao Hu, Yuting Chen, Beijun Shen, Xiaodong Gu
First: 2026-01-08T16:58:07+00:00 · Latest: 2026-01-08T16:58:07+00:00
Comments: Code available at https://github.com/Zengwh02/GlimpRouter
Abstract
Large Reasoning Models (LRMs) achieve remarkable performance by explicitly generating multi-step chains of thought, but this capability incurs substantial inference latency and computational cost. Collaborative inference offers a promising solution by selectively allocating work between lightweight and large models, yet a fundamental challenge remains: determining when a reasoning step requires the capacity of a large model or the efficiency of a small model. Existing routing strategies either rely on local token probabilities or post-hoc verification, introducing significant inference overhead. In this work, we propose a novel perspective on step-wise collaboration: the difficulty of a reasoning step can be inferred from its very first token. Inspired by the "Aha Moment" phenomenon in LRMs, we show that the entropy of the initial token serves as a strong predictor of step difficulty. Building on this insight, we introduce GlimpRouter, a training-free step-wise collaboration framework. GlimpRouter employs a lightweight model to generate only the first token of each reasoning step and routes the step to a larger model only when the initial token entropy exceeds a threshold. Experiments on multiple benchmarks demonstrate that our approach significantly reduces inference latency while preserving accuracy. For instance, GlimpRouter attains a substantial 10.7% improvement in accuracy while reducing inference latency by 25.9% compared to a standalone large model on AIME25. These results suggest a simple yet effective mechanism for reasoning: allocating computation based on a glimpse of thought rather than full-step evaluation.
中文标题/摘要
标题:GlimpRouter:通过窥视一个思维令牌实现高效的协作推理
大型推理模型(LRMs)通过显式生成多步思维链实现显著性能,但这种能力会带来严重的推理延迟和计算成本。协作推理通过在轻量级和大型模型之间选择性分配工作提供了有希望的解决方案,但一个基本挑战仍然存在:确定推理步骤需要大型模型的容量还是小型模型的效率。现有的路由策略要么依赖于局部令牌概率,要么进行事后验证,引入了显著的推理开销。在本文中,我们提出了一种新的步骤协作视角:推理步骤的难度可以从其第一个令牌中推断出来。受LRMs中的“顿悟时刻”现象的启发,我们表明初始令牌的熵是步骤难度的强预测器。基于这一洞察,我们引入了GlimpRouter,这是一种无需训练的步骤协作框架。GlimpRouter使用一个轻量级模型仅生成每个推理步骤的第一个令牌,并仅当初始令牌的熵超过阈值时才将步骤路由到一个更大的模型。在多个基准上的实验表明,我们的方法在显著减少推理延迟的同时保持了准确性。例如,与AIME25中的独立大型模型相比,GlimpRouter在准确率上提高了10.7%,推理延迟减少了25.9%。这些结果表明,一种简单而有效的推理机制是:根据思维的一瞥而不是完整的步骤评估来分配计算。
Summary / 总结
GlimpRouter proposes a novel approach to collaborative inference by focusing on the entropy of the first token generated during reasoning steps. This method reduces inference latency and computational cost without significant loss of accuracy. Experiments show that GlimpRouter improves accuracy by 10.7% and reduces inference latency by 25.9% compared to a standalone large model on AIME25.
GlimpRouter 通过使用初始 token 的熵来预测推理步骤的难度,提出了一种新的步骤级协作方法。这种方法在 AIME25 上将推理延迟降低了 25.9%,同时保持了 10.7% 的更高准确率。轻量级模型仅生成每个推理步骤的第一个 token,并在初始 token 的熵超过阈值时才将步骤路由到更大的模型,从而避免不必要的计算和开销。
Instruction Tuning with and without Context: Behavioral Shifts and Downstream Impact
Authors: Hyunji Lee, Seunghyun Yoon, Yunjae Won, Hanseok Oh, Geewook Kim, Trung Bui, Franck Dernoncourt, Elias Stengel-Eskin, Mohit Bansal, Minjoon Seo
First: 2025-06-18T14:13:56+00:00 · Latest: 2026-01-08T16:32:25+00:00
Abstract
Instruction tuning is a widely used approach to improve the instruction-following ability of large language models (LLMs). Instruction-tuning datasets typically include a mixture of context-augmented and context-free examples, yet prior work has largely combined these data types without examining their distinct effects. In this paper, we investigate how training LLMs with or without context affects model behavior and downstream performance. First, in the text domain, we show that LLMs trained with context attend more strongly to the provided knowledge, achieving better grounding. We also observe that context-augmented training shifts how LLMs use knowledge: models store and leverage less on parametric knowledge and instead depend more on the provided context. Second, we observe that using LLM trained with context-augmented data as the backbone for vision-language models reduces hallucination and improves grounding in the visual domain. Finally, we explore practical strategies for real-world deployments where context availability varies. We show that maintaining separate context-augmented and context-free models and routing inputs between them yields more robust overall performance than training a single mixed model, as it better preserves their complementary strengths.
中文标题/摘要
标题:基于上下文和不基于上下文的指令调优:行为变化及下游影响
指令调优是广泛用于提高大型语言模型(LLM)遵循指令能力的一种方法。指令调优数据集通常包含上下文增强和无上下文示例的混合,但先前的工作大多将这些数据类型结合起来而没有考察它们的各自影响。在本文中,我们研究了在有无上下文的情况下训练LLM如何影响模型行为和下游性能。首先,在文本领域,我们展示了使用上下文训练的LLM更强烈地关注提供的知识,从而实现更好的定位。我们还观察到,上下文增强的训练改变了LLM使用知识的方式:模型存储和利用的参数化知识较少,而是更多地依赖提供的上下文。其次,我们观察到,使用基于上下文增强数据训练的LLM作为视觉-语言模型的骨干可以减少幻觉并改善视觉领域的定位。最后,我们探讨了在上下文可用性变化的现实世界部署中的实用策略。我们展示了保持上下文增强和无上下文模型的分离,并在它们之间路由输入,比训练单一混合模型能获得更稳健的整体性能,因为它更好地保留了它们的互补优势。
Summary / 总结
This paper investigates the impact of training large language models (LLMs) with or without context on their instruction-following ability and downstream performance. The study finds that context-augmented training improves grounding and shifts model behavior to rely more on provided context rather than parametric knowledge. It also shows that using context-augmented LLMs as the backbone for vision-language models reduces hallucination and improves visual grounding. The research suggests maintaining separate context-augmented and context-free models for robust performance in varying context availability scenarios.
本文研究了在有或无上下文的情况下训练大型语言模型(LLMs)对其指令遵循能力和下游性能的影响。研究发现,带有上下文的训练可以提高模型的定位能力,并改变模型使用知识的方式,减少对参数知识的依赖,增加对提供上下文的依赖。此外,研究还表明,使用带有上下文的LLM作为视觉-语言模型的骨干可以减少幻觉并提高定位能力。最后,文章建议在不同上下文可用性场景中分别维护带有上下文和不带上下文的模型以获得更稳健的整体性能。
History
20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553