More Images, More Problems? A Controlled Analysis of VLM Failure Modes
Authors: Anurag Das, Adrian Bulat, Alberto Baldrati, Ioannis Maniadis Metaxas, Bernt Schiele, Georgios Tzimiropoulos, Brais Martinez
First: 2026-01-12T18:45:13+00:00 · Latest: 2026-01-12T18:45:13+00:00
Comments: 19 pages, 16 figures
Abstract
Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities, yet their proficiency in understanding and reasoning over multiple images remains largely unexplored. While existing benchmarks have initiated the evaluation of multi-image models, a comprehensive analysis of their core weaknesses and their causes is still lacking. In this work, we introduce MIMIC (Multi-Image Model Insights and Challenges), a new benchmark designed to rigorously evaluate the multi-image capabilities of LVLMs. Using MIMIC, we conduct a series of diagnostic experiments that reveal pervasive issues: LVLMs often fail to aggregate information across images and struggle to track or attend to multiple concepts simultaneously. To address these failures, we propose two novel complementary remedies. On the data side, we present a procedural data-generation strategy that composes single-image annotations into rich, targeted multi-image training examples. On the optimization side, we analyze layer-wise attention patterns and derive an attention-masking scheme tailored for multi-image inputs. Experiments substantially improved cross-image aggregation, while also enhancing performance on existing multi-image benchmarks, outperforming prior state of the art across tasks. Data and code will be made available at https://github.com/anurag-198/MIMIC.
中文标题/摘要
标题:更多图像,更多问题?对VLM失败模式的受控分析
大型视觉语言模型(LVLMs)展示了显著的能力,但它们在理解和推理多个图像方面的熟练程度仍鲜有探索。尽管现有的基准测试已经启动了对多图像模型的评估,但对其核心弱点及其原因的全面分析仍然缺乏。在本文中,我们引入了MIMIC(多图像模型见解与挑战),这是一个新的基准,旨在严格评估LVLMs的多图像能力。使用MIMIC,我们进行了一系列诊断实验,揭示了普遍存在的问题:LVLMs经常无法在图像间汇总信息,并且难以同时跟踪或关注多个概念。为了解决这些失败,我们提出了两种新的互补补救措施。在数据方面,我们提出了一种过程化的数据生成策略,将单图像注释组合成丰富的、有针对性的多图像训练示例。在优化方面,我们分析了逐层的注意力模式,并推导出一种针对多图像输入的注意力掩蔽方案。实验显著提高了跨图像的聚合能力,同时也在现有的多图像基准测试中提高了性能,超越了先前的最先进水平。数据和代码将在https://github.com/anurag-198/MIMIC上提供。
Summary / 总结
This study addresses the limitations of Large Vision Language Models (LVLMs) in handling multiple images by introducing MIMIC, a new benchmark. Through diagnostic experiments, it reveals that LVLMs struggle to aggregate information across images and track multiple concepts simultaneously. The research proposes two solutions: a procedural data-generation strategy and an attention-masking scheme. These improvements significantly enhance cross-image aggregation and outperform previous state-of-the-art methods on multi-image benchmarks.
本文通过引入MIMIC新基准,探讨了大型视觉语言模型(LVLM)在处理多张图像时的局限性。通过诊断实验,研究发现LVLM难以在图像间聚合信息和同时跟踪多个概念。为解决这些问题,作者提出了两种方法:一种程序化数据生成策略和一种针对多图像输入的注意力掩码方案。这些方法显著改善了图像间的信息聚合,并在多图像基准测试中超越了先前的最先进模型。
StarFlow: Generating Structured Workflow Outputs From Sketch Images
Authors: Patrice Bechard, Chao Wang, Amirhossein Abaskohi, Juan Rodriguez, Christopher Pal, David Vazquez, Spandana Gella, Sai Rajeswar, Perouz Taslakian
First: 2025-03-27T18:04:05+00:00 · Latest: 2026-01-12T18:27:42+00:00
Comments: To be presented at EACL2026
Abstract
Workflows are a fundamental component of automation in enterprise platforms, enabling the orchestration of tasks, data processing, and system integrations. Despite being widely used, building workflows can be complex, often requiring manual configuration through low-code platforms or visual programming tools. To simplify this process, we explore the use of generative foundation models, particularly vision-language models (VLMs), to automatically generate structured workflows from visual inputs. Translating hand-drawn sketches or computer-generated diagrams into executable workflows is challenging due to the ambiguity of free-form drawings, variations in diagram styles, and the difficulty of inferring execution logic from visual elements. To address this, we introduce StarFlow, a framework for generating structured workflow outputs from sketches using vision-language models. We curate a diverse dataset of workflow diagrams -- including synthetic, manually annotated, and real-world samples -- to enable robust training and evaluation. We finetune and benchmark multiple vision-language models, conducting a series of ablation studies to analyze the strengths and limitations of our approach. Our results show that finetuning significantly enhances structured workflow generation, outperforming large vision-language models on this task.
中文标题/摘要
标题:StarFlow:从草图图像生成结构化工作流输出
工作流是企业平台自动化中的基本组成部分,能够实现任务编排、数据处理和系统集成。尽管被广泛使用,但构建工作流往往复杂,通常需要通过低代码平台或可视化编程工具进行手动配置。为了简化这一过程,我们探索了使用生成基础模型,特别是视觉语言模型(VLMs),从视觉输入自动生成结构化工作流的方法。将手绘草图或计算机生成的图表转换为可执行的工作流具有挑战性,因为自由形式的绘制具有歧义性,图表风格存在差异,从视觉元素中推断执行逻辑也具有难度。为了解决这一问题,我们引入了StarFlow框架,用于使用视觉语言模型从草图生成结构化工作流输出。我们收集了多样化的流程图数据集,包括合成、手动标注和实际样本,以实现稳健的训练和评估。我们对多个视觉语言模型进行了微调和基准测试,并进行了一系列消融研究,以分析我们方法的优势和局限性。我们的结果显示,微调显著提高了结构化工作流生成的效果,在此任务上优于大型视觉语言模型。
Summary / 总结
The paper introduces StarFlow, a framework that uses vision-language models to generate structured workflow outputs from sketch images. The motivation is to simplify the process of creating workflows by automatically translating hand-drawn or computer-generated diagrams into executable workflows. Key experimental findings show that fine-tuning vision-language models significantly improves structured workflow generation, outperforming large models on this task.
研究旨在通过使用视觉语言模型从草图图像自动生成结构化的工作流来简化工作流的创建过程。引入了StarFlow框架来解决将模糊的视觉输入转换为可执行工作流的挑战。研究显示,通过大量基准测试和消融研究,微调视觉语言模型在该任务上优于大型模型,显著提高了结构化工作流的生成效果。
Vision-Language Model for Accurate Crater Detection
Authors: Patrick Bauer, Marius Schwinning, Florian Renk, Andreas Weinmann, Hichem Snoussi
First: 2026-01-12T18:08:17+00:00 · Latest: 2026-01-12T18:08:17+00:00
Abstract
The European Space Agency (ESA), driven by its ambitions on planned lunar missions with the Argonaut lander, has a profound interest in reliable crater detection, since craters pose a risk to safe lunar landings. This task is usually addressed with automated crater detection algorithms (CDA) based on deep learning techniques. It is non-trivial due to the vast amount of craters of various sizes and shapes, as well as challenging conditions such as varying illumination and rugged terrain. Therefore, we propose a deep-learning CDA based on the OWLv2 model, which is built on a Vision Transformer, that has proven highly effective in various computer vision tasks. For fine-tuning, we utilize a manually labeled dataset fom the IMPACT project, that provides crater annotations on high-resolution Lunar Reconnaissance Orbiter Camera Calibrated Data Record images. We insert trainable parameters using a parameter-efficient fine-tuning strategy with Low-Rank Adaptation, and optimize a combined loss function consisting of Complete Intersection over Union (CIoU) for localization and a contrastive loss for classification. We achieve satisfactory visual results, along with a maximum recall of 94.0% and a maximum precision of 73.1% on a test dataset from IMPACT. Our method achieves reliable crater detection across challenging lunar imaging conditions, paving the way for robust crater analysis in future lunar exploration.
中文标题/摘要
标题:用于精确撞击坑检测的视觉-语言模型
欧洲航天局(ESA),因其计划中的阿戈纳特着陆器月球任务而雄心勃勃,对可靠的撞击坑检测有着深刻的兴趣,因为撞击坑对安全的月球着陆构成风险。通常使用基于深度学习技术的自动撞击坑检测算法(CDA)来解决这一任务。由于存在各种大小和形状的撞击坑,以及光照条件和崎岖地形等挑战性条件,这是一项非平凡的任务。因此,我们提出了一种基于OWLv2模型的深度学习CDA,该模型基于视觉变换器,在各种计算机视觉任务中已被证明非常有效。为了微调,我们使用IMPACT项目提供的手动标注数据集,该数据集提供了高分辨率月球轨道器摄像机校准数据记录图像上的撞击坑注释。我们使用参数高效微调策略Low-Rank Adaptation插入可训练参数,并优化了一个由完整交并比(CIoU)用于定位和对比损失用于分类组成的联合损失函数。我们在IMPACT提供的测试数据集上实现了令人满意的视觉效果,最大召回率为94.0%,最大精度为73.1%。我们的方法在具有挑战性的月球成像条件下实现了可靠的撞击坑检测,为未来月球探索中的稳健撞击坑分析铺平了道路。
Summary / 总结
The research aims to develop an accurate crater detection algorithm for lunar missions, addressing the challenges of varying crater sizes, shapes, and imaging conditions. The method employs a fine-tuned OWLv2 model, a Vision Transformer, using a parameter-efficient Low-Rank Adaptation strategy and a combined loss function. The model achieves a maximum recall of 94.0% and a maximum precision of 73.1%, demonstrating reliable crater detection under challenging lunar imaging conditions.
研究旨在开发一种准确的撞击坑检测算法,以应对撞击坑对月球着陆安全的挑战。方法采用经过微调的OWLv2模型,这是一种视觉变换器,使用参数高效的低秩适应策略和结合损失函数。该模型在挑战性的月球成像条件下实现了最高召回率94.0%和最高精确率73.1%,展示了可靠的撞击坑检测能力。
OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent
Authors: Bowen Yang, Kaiming Jin, Zhenyu Wu, Zhaoyang Liu, Qiushi Sun, Zehao Li, JingJing Xie, Zhoumianze Liu, Fangzhi Xu, Kanzhi Cheng, Qingyun Li, Yian Wang, Yu Qiao, Zun Wang, Zichen Ding
First: 2026-01-12T17:55:51+00:00 · Latest: 2026-01-12T17:55:51+00:00
Comments: 31 pages, 11 figures, 12 tables
Abstract
While Vision-Language Models (VLMs) have significantly advanced Computer-Using Agents (CUAs), current frameworks struggle with robustness in long-horizon workflows and generalization in novel domains. These limitations stem from a lack of granular control over historical visual context curation and the absence of visual-aware tutorial retrieval. To bridge these gaps, we introduce OS-Symphony, a holistic framework that comprises an Orchestrator coordinating two key innovations for robust automation: (1) a Reflection-Memory Agent that utilizes milestone-driven long-term memory to enable trajectory-level self-correction, effectively mitigating visual context loss in long-horizon tasks; (2) Versatile Tool Agents featuring a Multimodal Searcher that adopts a SeeAct paradigm to navigate a browser-based sandbox to synthesize live, visually aligned tutorials, thereby resolving fidelity issues in unseen scenarios. Experimental results demonstrate that OS-Symphony delivers substantial performance gains across varying model scales, establishing new state-of-the-art results on three online benchmarks, notably achieving 65.84% on OSWorld.
中文标题/摘要
标题:OS-Symphony:一种全面的鲁棒且通用的计算机使用代理框架
尽管视觉语言模型(VLMs)显著推进了计算机使用代理(CUAs),但当前框架在长时序工作流程中的鲁棒性和新领域中的泛化能力存在局限。这些局限源于对历史视觉上下文编纂缺乏细粒度控制以及缺乏视觉感知的教程检索。为解决这些问题,我们提出了OS-Symphony,一种全面框架,包含一个协调两个关键创新以实现鲁棒自动化的核心协调器:(1)一个反思记忆代理,利用里程碑驱动的长期记忆实现轨迹级自我纠正,有效缓解长时序任务中的视觉上下文损失;(2)多功能工具代理,配备多模态搜索器,采用“看做-行动”(SeeAct)范式在基于浏览器的沙箱中导航以合成实时、视觉对齐的教程,从而解决未见过场景中的保真度问题。实验结果表明,OS-Symphony在不同模型规模下实现了显著的性能提升,建立了三个在线基准的新最先进结果,特别在OSWorld上达到65.84%。
Summary / 总结
The research addresses the limitations of current Vision-Language Models (VLMs) in Computer-Using Agents (CUAs) by introducing OS-Symphony, a holistic framework. It includes an Orchestrator that manages a Reflection-Memory Agent and Versatile Tool Agents. The Reflection-Memory Agent uses milestone-driven long-term memory for self-correction in long-horizon tasks, while Versatile Tool Agents create live, visually aligned tutorials using a SeeAct paradigm. Experiments show OS-Symphony outperforms existing models, achieving 65.84% on the OSWorld benchmark.
研究针对当前视觉语言模型在计算机使用代理中的局限性,引入了OS-Symphony这一整体框架。该框架包括一个协调反射记忆代理和多功能工具代理的协调器,后者采用SeeAct范式在浏览器沙盒中生成实时、视觉对齐的教程。实验结果显示,OS-Symphony在OSWorld基准测试中取得了65.84%的优异成绩,超越了先前的模型。
LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation
Authors: Yang Miao, Jan-Nico Zaech, Xi Wang, Fabien Despinoy, Danda Pani Paudel, Luc Van Gool
Venue: Neurips 2025
First: 2025-10-29T08:21:59+00:00 · Latest: 2026-01-12T17:46:52+00:00
Comments: 10 pages, 5 figures, 14 tables, Neurips 2025
Abstract
We propose LangHOPS, the first Multimodal Large Language Model (MLLM) based framework for open-vocabulary object-part instance segmentation. Given an image, LangHOPS can jointly detect and segment hierarchical object and part instances from open-vocabulary candidate categories. Unlike prior approaches that rely on heuristic or learnable visual grouping, our approach grounds object-part hierarchies in language space. It integrates the MLLM into the object-part parsing pipeline to leverage its rich knowledge and reasoning capabilities, and link multi-granularity concepts within the hierarchies. We evaluate LangHOPS across multiple challenging scenarios, including in-domain and cross-dataset object-part instance segmentation, and zero-shot semantic segmentation. LangHOPS achieves state-of-the-art results, surpassing previous methods by 5.5% Average Precision (AP) (in-domain) and 4.8% (cross-dataset) on the PartImageNet dataset and by 2.5% mIOU on unseen object parts in ADE20K (zero-shot). Ablation studies further validate the effectiveness of the language-grounded hierarchy and MLLM driven part query refinement strategy. The code will be released here.
中文标题/摘要
标题:LangHOPS:基于语言的层次开放词汇部件分割
我们提出了LangHOPS,这是第一个基于多模态大型语言模型(MLLM)的开放词汇对象部件实例分割框架。给定一张图像,LangHOPS 可以从开放词汇候选类别中联合检测和分割层次化对象和部件实例。与依赖启发式或可学习视觉分组的先前方法不同,我们的方法将对象部件层次结构扎根于语言空间。它将 MLLM 集成到对象部件解析管道中,利用其丰富的知识和推理能力,并在层次结构内链接多粒度概念。我们在多个具有挑战性的场景中评估了 LangHOPS,包括同域和跨数据集对象部件实例分割以及零样本语义分割。LangHOPS 达到了最先进的结果,在 PartImageNet 数据集上超越了先前方法 5.5% 的平均精度(AP)(同域)和 4.8%(跨数据集),以及在 ADE20K 中未见过的对象部件上达到了 2.5% 的 mIOU(零样本)。消融研究进一步验证了语言扎根层次结构和 MLLM 驱动部件查询精炼策略的有效性。代码将在此发布。
Summary / 总结
LangHOPS is a framework that uses a Multimodal Large Language Model to perform open-vocabulary object-part instance segmentation. It can detect and segment hierarchical object and part instances from various categories in an image. Unlike previous methods, LangHOPS grounds object-part hierarchies in language space and integrates the MLLM into the parsing pipeline to leverage its knowledge and reasoning capabilities. LangHOPS outperforms previous methods by 5.5% AP in-domain and 4.8% AP cross-dataset on PartImageNet, and by 2.5% mIOU on unseen object parts in ADE20K for zero-shot segmentation. Ablation studies confirm the effectiveness of the language-grounded hierarchy and the MLLM-driven part query refinement strategy.
LangHOPS 是一种使用多模态大型语言模型进行开放词汇对象部件实例分割的框架。它可以检测和分割图像中不同类别中的层次化对象和部件实例。与以往方法不同,LangHOPS 将对象部件层次结构置于语言空间,并将 MLLM 集成到解析管道中,利用其知识和推理能力。LangHOPS 在 PartImageNet 数据集上的室内和跨数据集 AP 分别优于先前方法 5.5% 和 4.8%,在 ADE20K 上对未见过的对象部件进行零样本分割时的 mIOU 优于 2.5%。消融研究进一步验证了语言导向的层次结构和部件查询精炼策略的有效性。
Video Evidence to Reasoning Efficient Video Understanding via Explicit Evidence Grounding
Authors: Yanxiang Huang, Guohua Gao, Zhaoyang Wei, Jianyuan Ni
Venue: ICME 2026
First: 2026-01-12T17:46:10+00:00 · Latest: 2026-01-12T17:46:10+00:00
Comments: 6 pages
Abstract
Large Vision-Language Models (LVLMs) face a fundamental dilemma in video reasoning: they are caught between the prohibitive computational costs of verbose reasoning and the hallucination risks of efficient, ungrounded approaches. To resolve this, we introduce the Chain of Evidence (CoE), a novel framework that architecturally decouples and co-optimizes perceptual grounding and reasoning efficiency. CoE incorporates two core innovations: (1) A lightweight Evidence Grounding Module (EGM) that acts as a query-guided filter, dynamically identifying and extracting a compact set of high-fidelity visual evidence; and (2) An Evidence-Anchoring Protocol optimized via Reinforcement Learning. Crucially, we design a composite reward mechanism that enforces process alignment, compelling the model to strictly reference identified temporal anchors during deduction, thereby mitigating hallucinations. To enable this, we construct CoE-Instruct, a large-scale dataset (164k samples) featuring a novel dual-annotation schema for separate perception and reasoning supervision. Extensive experiments on five benchmarks, including Video-MME, MVBench, and VSI-Bench, demonstrate that CoE-enhanced models establish a new state-of-the-art. They significantly outperform existing methods in accuracy, proving CoE to be a powerful and practical paradigm for reliable video understanding.
中文标题/摘要
标题:视频证据到推理:通过明确的证据关联实现高效视频理解
大型视觉-语言模型(LVLMs)在视频推理中面临一个根本性的困境:它们在冗长推理的高昂计算成本和高效但未关联方法的幻觉风险之间徘徊。为了解决这一问题,我们引入了证据链(CoE),这是一种新颖的框架,通过架构解耦和联合优化感知关联和推理效率来解决这一问题。CoE 包含两个核心创新:(1)一种轻量级的证据关联模块(EGM),作为查询引导的过滤器,动态地识别并提取一组高保真视觉证据;(2)一种通过强化学习优化的证据锚定协议。关键的是,我们设计了一种复合奖励机制,以确保过程对齐,迫使模型在推理过程中严格参考已识别的时间锚点,从而减轻幻觉。为了实现这一点,我们构建了CoE-指令,这是一个大规模数据集(164,000 个样本),包含一种新的双注释方案,用于分别监督感知和推理。在五个基准测试上的广泛实验,包括Video-MME、MVBench 和 VSI-Bench,表明CoE增强的模型达到了新的最佳水平。它们在准确性上显著优于现有方法,证明CoE是一种强大且实用的框架,用于可靠的视频理解。
Summary / 总结
The paper addresses the challenge of efficient video understanding by introducing the Chain of Evidence (CoE) framework, which decouples perceptual grounding and reasoning efficiency. CoE includes a lightweight Evidence Grounding Module (EGM) that dynamically selects relevant visual evidence and an Evidence-Anchoring Protocol optimized via Reinforcement Learning to prevent hallucinations. Experiments on five benchmarks show that CoE-enhanced models outperform existing methods in accuracy, establishing a new state-of-the-art for video understanding.
本文通过引入Chain of Evidence (CoE)框架解决了高效视频理解的挑战,该框架将感知接地和推理效率分离。CoE 包含一个轻量级的Evidence Grounding Module (EGM),用于过滤视觉证据,以及通过强化学习优化的Evidence-Anchoring Protocol。复合奖励机制确保模型在推理时引用时间锚点,减少幻觉。在五个基准测试上的实验表明,CoE增强的模型在准确性上超越了现有方法,证明了其在可靠视频理解中的有效性和实用性。
Smooth Operator: Smooth Verifiable Reward Activates Spatial Reasoning Ability of Vision-Language Model
Authors: Siwen Jiao, Tianxiong Lv, Kangan Qian, Chenxu Zhao, Xiuyuan Zhu, Tianlun Li, Xiaolong Cheng, Jinyu Li, Zhihao Liao, Yang Cai
First: 2026-01-12T16:26:42+00:00 · Latest: 2026-01-12T16:26:42+00:00
Abstract
Vision-Language Models (VLMs) face a critical bottleneck in achieving precise numerical prediction for 3D scene understanding. Traditional reinforcement learning (RL) approaches, primarily based on relative ranking, often suffer from severe reward sparsity and gradient instability, failing to effectively exploit the verifiable signals provided by 3D physical constraints. Notably, in standard GRPO frameworks, relative normalization causes "near-miss" samples (characterized by small but non-zero errors) to suffer from advantage collapse. This leads to a severe data utilization bottleneck where valuable boundary samples are discarded during optimization. To address this, we introduce the Smooth Numerical Reward Activation (SNRA) operator and the Absolute-Preserving GRPO (AP-GRPO) framework. SNRA employs a dynamically parameterized Sigmoid function to transform raw feedback into a dense, continuous reward continuum. Concurrently, AP-GRPO integrates absolute scalar gradients to mitigate the numerical information loss inherent in conventional relative-ranking mechanisms. By leveraging this approach, we constructed Numerical3D-50k, a dataset comprising 50,000 verifiable 3D subtasks. Empirical results indicate that AP-GRPO achieves performance parity with large-scale supervised methods while maintaining higher data efficiency, effectively activating latent 3D reasoning in VLMs without requiring architectural modifications.
中文标题/摘要
标题:平滑操作员:平滑可验证奖励激活视觉语言模型的空间推理能力
视觉语言模型(VLMs)在实现精确的数值预测以理解3D场景方面面临关键瓶颈。传统的强化学习(RL)方法主要基于相对排名,通常会遭受严重的奖励稀疏性和梯度不稳定性,无法有效利用由3D物理约束提供的可验证信号。值得注意的是,在标准的GRPO框架中,相对归一化导致“接近但未命中”的样本(特征为小但非零的误差)遭受优势坍塌。这导致在优化过程中有价值边界样本被丢弃的数据利用瓶颈。为解决这一问题,我们引入了平滑数值奖励激活(SNRA)操作和绝对保留GRPO(AP-GRPO)框架。SNRA采用动态参数化的Sigmoid函数将原始反馈转换为密集的连续奖励连续体。同时,AP-GRPO整合绝对标量梯度以减轻传统相对排名机制固有的数值信息损失。通过这种方法,我们构建了包含50,000个可验证3D子任务的数据集Numerical3D-50k。实验证明,AP-GRPO在性能上与大规模监督方法相当,同时保持更高的数据效率,有效激活了VLMs中的潜在3D推理能力,无需进行架构修改。
Summary / 总结
The research aims to enhance the precision of numerical predictions in 3D scene understanding for Vision-Language Models (VLMs) by addressing the issues of reward sparsity and gradient instability in traditional reinforcement learning. It introduces the Smooth Numerical Reward Activation (SNRA) operator and the Absolute-Preserving GRPO (AP-GRPO) framework. SNRA transforms raw feedback into a dense reward continuum, while AP-GRPO integrates absolute gradients to preserve numerical information. These methods enable the VLMs to effectively utilize verifiable 3D subtasks, achieving performance comparable to large-scale supervised methods with higher data efficiency.
研究旨在通过解决传统强化学习中的奖励稀疏性和梯度不稳定性问题,提高视觉-语言模型(VLMs)在3D场景理解中的精确数值预测能力。研究引入了Smooth Numerical Reward Activation (SNRA) 操作和Absolute-Preserving GRPO (AP-GRPO) 框架。SNRA将原始反馈转换为密集的连续奖励,而AP-GRPO整合绝对梯度以保留数值信息。这些方法使得能够构建包含50,000个可验证3D子任务的Numerical3D-50k数据集,并展示了AP-GRPO在数据效率更高的情况下,能够与大规模监督方法达到同等性能,有效激活VLMs中的3D推理能力。
Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference
Authors: Rei Taniguchi, Yuyang Dong, Makoto Onizuka, Chuan Xiao
First: 2026-01-12T15:47:35+00:00 · Latest: 2026-01-12T15:47:35+00:00
Comments: Source code is available at https://github.com/TANIGUCHIREI/ASL
Abstract
Due to the prevalence of large language models (LLMs), key-value (KV) cache reduction for LLM inference has received remarkable attention. Among numerous works that have been proposed in recent years, layer-wise token pruning approaches, which select a subset of tokens at particular layers to retain in KV cache and prune others, are one of the most popular schemes. They primarily adopt a set of pre-defined layers, at which tokens are selected. Such design is inflexible in the sense that the accuracy significantly varies across tasks and deteriorates in harder tasks such as KV retrieval. In this paper, we propose ASL, a training-free method that adaptively chooses the selection layer for KV cache reduction, exploiting the variance of token ranks ordered by attention score. The proposed method balances the performance across different tasks while meeting the user-specified KV budget requirement. ASL operates during the prefilling stage and can be jointly used with existing KV cache reduction methods such as SnapKV to optimize the decoding stage. By evaluations on the InfiniteBench, RULER, and NIAH benchmarks, we show that equipped with one-shot token selection, where tokens are selected at a layer and propagated to deeper layers, ASL outperforms state-of-the-art layer-wise token selection methods in accuracy while maintaining decoding speed and KV cache reduction.
中文标题/摘要
标题:适应性层选择在LLM推理中按层剪枝token
由于大型语言模型(LLMs)的普及,LLM推理中的键值(KV)缓存减少受到了显著关注。近年来,提出的各种方法中,按层剪枝方法是最受欢迎的方案之一,这些方法在特定层选择保留token,并剪枝其他token。这种设计在灵活性方面存在不足,因为其准确率在不同任务中差异显著,在如键值检索等较难任务中会下降。本文提出了一种无需训练的方法ASL,该方法利用注意力分数排序的token秩的方差,自适应地选择KV缓存减少的层。该方法在满足用户指定的KV预算要求的同时,平衡了不同任务的性能。ASL在预填充阶段运行,并可以与现有的KV缓存减少方法(如SnapKV)联合使用,以优化解码阶段。通过在InfiniteBench、RULER和NIAH基准上的评估,我们展示了ASL在准确率上优于最先进的按层剪枝方法,同时保持解码速度和KV缓存减少。
Summary / 总结
This paper addresses the issue of key-value (KV) cache reduction in large language model (LLM) inference by proposing ASL, an adaptive layer selection method. Unlike existing layer-wise token pruning approaches that use pre-defined layers, ASL dynamically selects the layer for token selection based on the variance of token ranks ordered by attention score. This method improves performance across various tasks while adhering to user-specified KV budget requirements. Experimental results on InfiniteBench, RULER, and NIAH benchmarks demonstrate that ASL outperforms state-of-the-art methods in accuracy while maintaining decoding speed and KV cache reduction.
本文提出了一种名为ASL的自适应层选择方法,以解决大型语言模型(LLM)推理中的关键值(KV)缓存减少问题。不同于现有基于预定义层的层内令牌剪枝方法,ASL 根据注意力分数排序下的令牌排名方差动态选择剪枝层。该方法在不同任务上提高了性能,同时遵守用户指定的KV预算。实验结果表明,ASL 在 InfiniteBench、RULER 和 NIAH 基准上的准确率优于最先进的方法,同时保持了解码速度和KV缓存减少。
SuperGSeg: Open-Vocabulary 3D Segmentation with Structured Super-Gaussians
Authors: Siyun Liang, Sen Wang, Kunyi Li, Michael Niemeyer, Stefano Gasperini, Nassir Navab, Federico Tombari
First: 2024-12-13T16:01:19+00:00 · Latest: 2026-01-12T14:51:22+00:00
Comments: 13 pages, 8 figures
Abstract
3D Gaussian Splatting has recently gained traction for its efficient training and real-time rendering. While the vanilla Gaussian Splatting representation is mainly designed for view synthesis, more recent works investigated how to extend it with scene understanding and language features. However, existing methods lack a detailed comprehension of scenes, limiting their ability to segment and interpret complex structures. To this end, We introduce SuperGSeg, a novel approach that fosters cohesive, context-aware scene representation by disentangling segmentation and language field distillation. SuperGSeg first employs neural Gaussians to learn instance and hierarchical segmentation features from multi-view images with the aid of off-the-shelf 2D masks. These features are then leveraged to create a sparse set of what we call Super-Gaussians. Super-Gaussians facilitate the distillation of 2D language features into 3D space. Through Super-Gaussians, our method enables high-dimensional language feature rendering without extreme increases in GPU memory. Extensive experiments demonstrate that SuperGSeg outperforms prior works on both open-vocabulary object localization and semantic segmentation tasks.
中文标题/摘要
标题:SuperGSeg:开放式词汇3D分割的结构化超高斯方法
3D 高斯点积近年来因其高效的训练和实时渲染而受到关注。虽然传统的高斯点积表示主要设计用于视图合成,但最近的研究探讨了如何通过场景理解和语言特征来扩展它。然而,现有方法对场景的理解不够详细,限制了它们对复杂结构进行分割和解释的能力。为此,我们提出了SuperGSeg,这是一种新颖的方法,通过分离分割和语言场提炼来促进连贯的、上下文感知的场景表示。SuperGSeg 首先利用神经高斯函数从多视图图像中学习实例和层次分割特征,并借助现成的2D掩码。这些特征随后被用来创建一组我们称之为超高斯的稀疏集。超高斯使2D语言特征向3D空间的提炼成为可能。通过超高斯,我们的方法能够在不极端增加GPU内存的情况下实现高维语言特征渲染。广泛的实验表明,SuperGSeg 在开放词汇对象定位和语义分割任务上均优于先前的工作。
Summary / 总结
SuperGSeg is designed to enhance 3D segmentation by integrating structured Super-Gaussians with language features. It uses neural Gaussians to learn segmentation and hierarchical features from multi-view images and then distills 2D language features into 3D space through Super-Gaussians. Experiments show that SuperGSeg outperforms previous methods in open-vocabulary object localization and semantic segmentation tasks.
SuperGSeg通过分离分割和语言场提炼,使用神经高斯和超级高斯来引入一种新的3D分割方法。它利用多视图图像和2D掩码来学习实例和层次分割特征,然后使用这些特征创建超级高斯,以高效地在3D空间中渲染高维语言特征。实验表明,SuperGSeg在开放词汇对象定位和语义分割任务中优于先前的方法。
LLMs Enable Bag-of-Texts Representations for Short-Text Clustering
Authors: I-Fan Lin, Faegheh Hasibi, Suzan Verberne
First: 2025-10-08T08:05:39+00:00 · Latest: 2026-01-12T14:39:21+00:00
Abstract
In this paper, we propose a training-free method for unsupervised short text clustering that relies less on careful selection of embedders than other methods. In customer-facing chatbots, companies are dealing with large amounts of user utterances that need to be clustered according to their intent. In these settings, no labeled data is typically available, and the number of clusters is not known. Recent approaches to short-text clustering in label-free settings incorporate LLM output to refine existing embeddings. While LLMs can identify similar texts effectively, the resulting similarities may not be directly represented by distances in the dense vector space, as they depend on the original embedding. We therefore propose a method for transforming LLM judgments directly into a bag-of-texts representation in which texts are initialized to be equidistant, without assuming any prior distance relationships. Our method achieves comparable or superior results to state-of-the-art methods, but without embeddings optimization or assuming prior knowledge of clusters or labels. Experiments on diverse datasets and smaller LLMs show that our method is model agnostic and can be applied to any embedder, with relatively small LLMs, and different clustering methods. We also show how our method scales to large datasets, reducing the computational cost of the LLM use. The flexibility and scalability of our method make it more aligned with real-world training-free scenarios than existing clustering methods.
中文标题/摘要
标题:LLMs 使短文本聚类具备文本包表示
在本文中,我们提出了一种无需训练的无监督短文本聚类方法,与其他方法相比,这种方法对嵌入器的选择要求较低。在面向客户的聊天机器人中,公司需要处理大量用户陈述,这些陈述需要根据其意图进行聚类。在这种情况下,通常没有标注数据,聚类的数量也不确定。在无标签设置下的短文本聚类中,最近的方法将LLM输出纳入现有嵌入的细化。虽然LLM可以有效识别相似文本,但由此产生的相似性可能不会直接由密集向量空间中的距离表示,因为它们依赖于原始嵌入。因此,我们提出了一种直接将LLM判断转换为文本包表示的方法,在这种表示中,文本初始化为等距,无需假设任何先前的距离关系。我们的方法在与最先进的方法相当或更优的同时,无需嵌入优化或假设任何先验的聚类或标签知识。在多样化的数据集和较小的LLM上的实验表明,我们的方法是模型无关的,可以应用于任何嵌入器,使用相对较小的LLM和不同的聚类方法。我们还展示了我们的方法如何扩展到大数据集,从而降低LLM使用带来的计算成本。我们的方法的灵活性和可扩展性使其更符合现实世界的无监督训练场景,优于现有的聚类方法。
Summary / 总结
This paper introduces a training-free method for unsupervised short-text clustering that leverages LLM judgments to create a bag-of-texts representation. The method initializes texts to be equidistant without assuming any prior distance relationships, avoiding the need for embedding optimization or labeled data. Experiments demonstrate that this approach achieves comparable or superior results to state-of-the-art methods, while being model agnostic and scalable to large datasets.
本文提出了一种无需训练的方法,利用LLM判断来创建文本包表示,避免了精心选择嵌入和标注数据的需要。该方法将文本初始化为等距,并将LLM判断转换为无需假设先前距离关系的表示。实验表明,该方法在无需优化嵌入或假设聚类或标签先验知识的情况下,能达到与最先进的方法相当或更优的结果,并且能够很好地应用于大型数据集,使用较小的LLM和不同的聚类方法。
Beyond Entangled Planning: Task-Decoupled Planning for Long-Horizon Agents
Authors: Yunfan Li, Bingbing Xu, Xueyun Tian, Xiucheng Xu, Huawei Shen
First: 2026-01-12T14:30:10+00:00 · Latest: 2026-01-12T14:30:10+00:00
Abstract
Recent advances in large language models (LLMs) have enabled agents to autonomously execute complex, long-horizon tasks, yet planning remains a primary bottleneck for reliable task execution. Existing methods typically fall into two paradigms: step-wise planning, which is reactive but often short-sighted; and one-shot planning, which generates a complete plan upfront yet is brittle to execution errors. Crucially, both paradigms suffer from entangled contexts, where the agent must reason over a monolithic history spanning multiple sub-tasks. This entanglement increases cognitive load and lets local errors propagate across otherwise independent decisions, making recovery computationally expensive. To address this, we propose Task-Decoupled Planning (TDP), a training-free framework that replaces entangled reasoning with task decoupling. TDP decomposes tasks into a directed acyclic graph (DAG) of sub-goals via a Supervisor. Using a Planner and Executor with scoped contexts, TDP confines reasoning and replanning to the active sub-task. This isolation prevents error propagation and corrects deviations locally without disrupting the workflow. Results on TravelPlanner, ScienceWorld, and HotpotQA show that TDP outperforms strong baselines while reducing token consumption by up to 82%, demonstrating that sub-task decoupling improves both robustness and efficiency for long-horizon agents.
中文标题/摘要
标题:超越纠缠规划:长时程代理的任务解耦规划
大型语言模型(LLMs)的最新进展使代理能够自主执行复杂的长时程任务,但规划仍然是可靠任务执行的主要瓶颈。现有方法通常分为两种范式:逐步规划,这是一种反应性但往往目光短浅的方法;一次性规划,这种方法会一次性生成完整的计划,但对执行错误的鲁棒性较差。关键的是,这两种方法都存在纠缠的上下文问题,即代理必须对跨越多个子任务的单一历史进行推理。这种纠缠增加了认知负担,并使局部错误在原本独立的决策之间传播,从而使得恢复计算成本高昂。为了解决这个问题,我们提出了任务解耦规划(TDP),这是一种无需训练的框架,它用任务解耦替代了纠缠的推理。TDP 通过一个监督者将任务分解为子目标的有向无环图(DAG)。使用具有限定上下文的规划器和执行器,TDP 将推理和重新规划限制在当前活动的子任务中。这种隔离防止了错误传播,并且可以在不中断工作流程的情况下局部纠正偏差。在 TravelPlanner、ScienceWorld 和 HotpotQA 上的结果表明,TDP 在性能上优于强大的基线模型,同时将令牌消耗减少了高达 82%,这表明子任务解耦可以提高长时程代理的鲁棒性和效率。
Summary / 总结
The paper addresses the challenge of planning for long-horizon tasks by proposing Task-Decoupled Planning (TDP), which replaces entangled reasoning with task decoupling. TDP decomposes tasks into a directed acyclic graph of sub-goals and confines reasoning and replanning to the active sub-task, thereby preventing error propagation. Experimental results on TravelPlanner, ScienceWorld, and HotpotQA show that TDP outperforms strong baselines and reduces token consumption by up to 82%.
论文提出了一种任务解耦规划(TDP)方法,通过将任务分解为有向无环图中的子目标来解决长期任务规划的挑战。TDP 使用一个监督者来分解任务,并使用具有限定上下文的规划器和执行器来独立处理子任务,从而减少认知负担和错误传播。实验结果表明,TDP 在 TravelPlanner、ScienceWorld 和 HotpotQA 上的表现优于现有方法,并且显著减少了令牌消耗。
VirtualEnv: A Platform for Embodied AI Research
Authors: Kabir Swain, Sijie Han, Ayush Raina, Jin Zhang, Shuang Li, Michael Stopa, Antonio Torralba
First: 2026-01-12T14:04:38+00:00 · Latest: 2026-01-12T14:04:38+00:00
Abstract
As large language models (LLMs) continue to improve in reasoning and decision-making, there is a growing need for realistic and interactive environments where their abilities can be rigorously evaluated. We present VirtualEnv, a next-generation simulation platform built on Unreal Engine 5 that enables fine-grained benchmarking of LLMs in embodied and interactive scenarios. VirtualEnv supports rich agent-environment interactions, including object manipulation, navigation, and adaptive multi-agent collaboration, as well as game-inspired mechanics like escape rooms and procedurally generated environments. We provide a user-friendly API built on top of Unreal Engine, allowing researchers to deploy and control LLM-driven agents using natural language instructions. We integrate large-scale LLMs and vision-language models (VLMs), such as GPT-based models, to generate novel environments and structured tasks from multimodal inputs. Our experiments benchmark the performance of several popular LLMs across tasks of increasing complexity, analyzing differences in adaptability, planning, and multi-agent coordination. We also describe our methodology for procedural task generation, task validation, and real-time environment control. VirtualEnv is released as an open-source platform, we aim to advance research at the intersection of AI and gaming, enable standardized evaluation of LLMs in embodied AI settings, and pave the way for future developments in immersive simulations and interactive entertainment.
中文标题/摘要
标题:VirtualEnv:一种具身AI研究平台
随着大型语言模型(LLMs)在推理和决策方面不断改进,对能够真实且互动地评估其能力的环境的需求也在增长。我们介绍了VirtualEnv,这是一个基于Unreal Engine 5的下一代模拟平台,它能够精细地在具身和互动场景中对LLMs进行基准测试。VirtualEnv 支持丰富的代理-环境交互,包括物体操作、导航和自适应多代理协作,以及像逃脱房间和程序生成环境等游戏启发的机制。我们提供了一个基于Unreal Engine的用户友好型API,允许研究人员使用自然语言指令部署和控制LLM驱动的代理。我们整合了大规模LLMs和视觉-语言模型(VLMs),如基于GPT的模型,以从多模态输入中生成新颖的环境和结构化任务。我们的实验在任务复杂度递增的情况下对几种流行的LLMs进行了基准测试,分析了它们在适应性、规划和多代理协调方面的差异。我们还描述了我们的程序化任务生成、任务验证和实时环境控制方法。VirtualEnv 作为开源平台发布,我们旨在推动AI与游戏交叉领域的研究,使LLMs在具身AI设置中的标准化评估成为可能,并为未来沉浸式模拟和互动娱乐的发展铺平道路。
Summary / 总结
VirtualEnv is a simulation platform built on Unreal Engine 5 designed to evaluate the reasoning and decision-making abilities of large language models (LLMs) in embodied and interactive scenarios. It supports rich interactions such as object manipulation, navigation, and multi-agent collaboration, and integrates LLMs and vision-language models to generate novel environments and tasks. Experiments show that LLMs vary in adaptability, planning, and multi-agent coordination across different tasks. The platform is open-source and aims to standardize the evaluation of LLMs in embodied AI settings and advance research in immersive simulations and interactive entertainment.
VirtualEnv 是一个基于 Unreal Engine 5 的模拟平台,旨在严格评估大型语言模型(LLMs)在具身和互动场景中的推理和决策能力。它支持丰富的交互,如物体操作、导航和多智能体协作,并结合 LLM 和视觉语言模型生成新的环境和任务。该平台对几种流行的 LLM 在复杂度递增的任务中的表现进行了基准测试,突显了适应性、规划和多智能体协调方面的差异。VirtualEnv 作为开源工具发布,旨在推动 AI 和游戏领域的研究,使 LLM 在具身 AI 设置中的标准化评估成为可能。
Controlling Multimodal Conversational Agents with Coverage-Enhanced Latent Actions
Authors: Yongqi Li, Hao Lang, Tieyun Qian, Yongbin Li
First: 2026-01-12T13:13:24+00:00 · Latest: 2026-01-12T13:13:24+00:00
Abstract
Vision-language models are increasingly employed as multimodal conversational agents (MCAs) for diverse conversational tasks. Recently, reinforcement learning (RL) has been widely explored for adapting MCAs to various human-AI interaction scenarios. Despite showing great enhancement in generalization performance, fine-tuning MCAs via RL still faces challenges in handling the extremely large text token space. To address this, we learn a compact latent action space for RL fine-tuning instead. Specifically, we adopt the learning from observation mechanism to construct the codebook for the latent action space, where future observations are leveraged to estimate current latent actions that could further be used to reconstruct future observations. However, the scarcity of paired image-text data hinders learning a codebook with sufficient coverage. Thus, we leverage both paired image-text data and text-only data to construct the latent action space, using a cross-modal projector for transforming text embeddings into image-text embeddings. We initialize the cross-modal projector on paired image-text data, and further train it on massive text-only data with a novel cycle consistency loss to enhance its robustness. We show that our latent action based method outperforms competitive baselines on two conversation tasks across various RL algorithms.
中文标题/摘要
标题:使用覆盖增强潜在动作控制多模态对话代理
视觉-语言模型越来越多地被用作多模态对话代理(MCAs)以执行各种对话任务。最近,强化学习(RL)被广泛探索以使MCAs适应各种人机交互场景。尽管在泛化性能上表现出色,但通过RL微调MCAs仍然面临处理极其庞大的文本标记空间的挑战。为了解决这个问题,我们学习了一个紧凑的潜在动作空间来进行RL微调。具体来说,我们采用从观察中学习的机制来构建潜在动作空间的码本,其中未来的观察用于估计当前潜在动作,这些潜在动作可以进一步用于重建未来的观察。然而,成对的图像-文本数据的稀缺性阻碍了学习具有足够覆盖范围的码本。因此,我们利用成对的图像-文本数据和仅文本数据来构建潜在动作空间,使用跨模态投影器将文本嵌入转换为图像-文本嵌入。我们使用成对的图像-文本数据初始化跨模态投影器,并使用一种新颖的循环一致性损失在大量仅文本数据上进一步训练它,以增强其鲁棒性。我们展示了我们的基于潜在动作的方法在各种RL算法的两个对话任务上优于竞争基线。
Summary / 总结
The paper addresses the challenge of fine-tuning multimodal conversational agents (MCAs) using reinforcement learning (RL) by learning a compact latent action space. This is achieved by leveraging both paired image-text data and text-only data through a cross-modal projector, which is trained with a cycle consistency loss to enhance robustness. The method outperforms competitive baselines on two conversation tasks across various RL algorithms.
论文通过学习一个紧凑的潜在动作空间来解决使用强化学习(RL)微调多模态对话代理(MCAs)的挑战。这通过利用配对的图像-文本数据和文本-only数据中的交叉模态投影器实现,并通过循环一致性损失进一步训练以增强其鲁棒性。该方法在各种RL算法下,在两个对话任务上优于竞争基线。
CaTS-Bench: Can Language Models Describe Time Series?
Authors: Luca Zhou, Pratham Yashwante, Marshall Fisher, Alessio Sampieri, Zihao Zhou, Fabio Galasso, Rose Yu
First: 2025-09-25T07:10:03+00:00 · Latest: 2026-01-12T13:01:45+00:00
Comments: 8 pages, 6 figures, 3 tables in the main paper. Many more in the appendix
Abstract
Time series captioning, the task of describing time series in natural language, requires numeric and temporal reasoning, trend interpretation, and contextual understanding. Existing benchmarks, however, often rely on fully synthetic or generic captions, and typically neglect metadata and visual representations. We introduce CaTS-Bench, a comprehensive benchmark for Context-aware Time Series reasoning across $11$ diverse domains, centered on a gold-standard evaluation set of $1746$ human-rewritten captions that measure how effectively models translate numeric trends into immediately interpretable narratives. To address the scarcity of human-annotated data, we also propose a scalable pipeline for generating high-fidelity synthetic captions, the quality of which we validate. We evaluate leading Vision-Language Models on our benchmark, revealing that even proprietary models struggle to capture numeric nuances in temporal descriptions, while finetuning open-source models on synthetic data yields substantial performance gains. Finally, we release a diagnostic suite of $910$ multiple-choice questions and tailored numeric metrics to gauge time-series-specific reasoning capabilities, establishing CaTS-Bench as a reliable foundation for grounded, multimodal language generation in numeric domains.
中文标题/摘要
标题:CaTS-Bench:语言模型能否描述时间序列?
时间序列描述是将时间序列用自然语言描述的任务,需要数值和时间推理、趋势解释和上下文理解。然而,现有的基准测试往往依赖于完全合成或通用的描述,通常忽略了元数据和视觉表示。我们引入了CaTS-Bench,这是一个涵盖11个不同领域的全面基准测试,围绕一个包含1746个人类重写描述的标准评估集,该集衡量模型如何将数值趋势转化为易于理解的叙述。为了解决人类标注数据稀缺的问题,我们还提出了一种可扩展的生成高保真合成描述的管道,并验证了其质量。我们在基准测试上评估了领先的空间语言模型,发现即使是专有模型也难以捕捉时间描述中的数值细微差别,而使用合成数据微调开源模型则能显著提高性能。最后,我们发布了一个包含910个选择题和定制化数值指标的诊断套件,以评估时间序列特定的推理能力,使CaTS-Bench成为可靠的基础,用于数值领域中的接地多模态语言生成。
Safe Vision-Language Models via Unsafe Weights Manipulation
Authors: Moreno D'Incà, Elia Peruzzo, Xingqian Xu, Humphrey Shi, Nicu Sebe, Massimiliano Mancini
Venue: WACV 2026
First: 2025-03-14T17:00:22+00:00 · Latest: 2026-01-12T11:44:30+00:00
Comments: WACV 2026
Abstract
Vision-language models (VLMs) often inherit the biases and unsafe associations present within their large-scale training dataset. While recent approaches mitigate unsafe behaviors, their evaluation focuses on how safe the model is on unsafe inputs, ignoring potential shortcomings on safe ones. In this paper, we first revise safety evaluation by introducing SafeGround, a new set of metrics that evaluate safety at different levels of granularity. With this metric, we uncover a surprising issue of training-based methods: they make the model less safe on safe inputs. From this finding, we take a different direction and explore whether it is possible to make a model safer without training, introducing Unsafe Weights Manipulation (UWM). UWM uses a calibration set of safe and unsafe instances to compare activations between safe and unsafe content, identifying the most important parameters for processing the latter. Their values are then manipulated via negation. Experiments show that UWM achieves the best tradeoff between safety and knowledge preservation, consistently improving VLMs on unsafe queries while outperforming even training-based state-of-the-art methods on safe ones.
中文标题/摘要
标题:通过不安全权重操纵实现安全的视觉-语言模型
视觉-语言模型(VLMs)通常继承了其大规模训练数据集中存在的偏见和不安全关联。虽然最近的方法减轻了不安全行为,但它们的评估主要关注模型在不安全输入上的安全性,而忽略了在安全输入上的潜在不足。在本文中,我们首先通过引入SafeGround,一种新的评估指标来修订安全性评估,该指标在不同粒度级别上评估安全性。借助此指标,我们揭示了一个令人惊讶的问题:基于训练的方法会使模型在安全输入上变得更不安全。从这一发现出发,我们采取了不同的方向,探索是否可以在不训练的情况下使模型更安全,引入了不安全权重操纵(UWM)。UWM 使用一组安全和不安全的校准实例,比较安全和不安全内容之间的激活,识别处理后者最重要的参数。然后通过否定操作调整这些参数的值。实验表明,UWM 在不安全查询上实现了安全性和知识保留的最佳权衡,同时在安全输入上甚至超过了基于训练的最先进的方法。
Summary / 总结
The paper addresses the issue of safety in vision-language models (VLMs) by revising safety evaluation metrics and introducing a new method called Unsafe Weights Manipulation (UWM). The authors find that training-based methods make models less safe on safe inputs and propose UWM, which uses a calibration set to identify and manipulate parameters for safer processing of unsafe content. Experiments show that UWM improves safety on unsafe queries and outperforms training-based methods on safe inputs, achieving the best tradeoff between safety and knowledge preservation.
本文通过引入SafeGround,一种新的安全评估指标,以不同粒度评估安全性,解决了视觉-语言模型(VLMs)的安全性问题。作者发现基于训练的方法会使模型在处理安全输入时变得更不安全。他们提出了Unsafe Weights Manipulation (UWM),通过使用校准集来识别和调整处理不安全内容的关键参数,从而提高安全性而不牺牲知识保留。UWM 在安全和不安全查询上都优于基于训练的方法,实现了安全性与知识保留的最佳平衡。
Studying Illustrations in Manuscripts: An Efficient Deep-Learning Approach
Authors: Yoav Evron, Michal Bar-Asher Siegal, Michael Fire
First: 2025-11-15T18:30:42+00:00 · Latest: 2026-01-12T11:37:13+00:00
Comments: 17 pages, 5 figures
Abstract
The recent Artificial Intelligence (AI) revolution has opened transformative possibilities for the humanities, particularly in unlocking the visual-artistic content embedded in historical illuminated manuscripts. While digital archives now offer unprecedented access to these materials, the ability to systematically locate, extract, and analyze illustrations at scale remains a major challenge. We present a general and scalable AI-based pipeline for large-scale visual analysis of illuminated manuscripts. The framework integrates modern deep-learning models for page-level illustration detection, illustration extraction, and multimodal description, enabling scholars to search, cluster, and study visual materials and artistic trends across entire corpora. We demonstrate the applicability of this approach on large heterogeneous collections, including the Vatican Library and richly illuminated manuscripts such as the Bible of Borso d'Este. The system reveals meaningful visual patterns and cross-manuscript relationships by embedding illustrations into a shared representation space and analyzing their similarity structure (see figure 4). By harnessing recent advances in computer vision and vision-language models, our framework enables new forms of large-scale visual scholarship in historical studies, art history, and cultural heritage making it possible to explore iconography, stylistic trends, and cultural connections in ways that were previously impractical.
中文标题/摘要
标题:手稿插图研究:一种高效的深度学习方法
近期的人工智能(AI)革命为人文科学带来了变革性的可能性,特别是在解锁历史手稿中嵌入的视觉艺术内容方面。虽然数字档案现在提供了前所未有的访问这些材料的机会,但系统地定位、提取和分析大规模插图仍然是一项重大挑战。我们提出了一种通用且可扩展的基于AI的流水线,用于大型手稿视觉分析。该框架结合了现代深度学习模型进行页面级插图检测、插图提取和多模态描述,使学者能够搜索、聚类和研究整个手稿集合中的视觉材料和艺术趋势。我们通过在梵蒂冈图书馆和富丽堂皇的手稿(如波尔索·德·埃斯特的圣经)等大型异构集合上展示该方法的应用性。该系统通过将插图嵌入共享表示空间并分析其相似性结构揭示了有意义的视觉模式和跨手稿关系(参见图4)。通过利用计算机视觉和视觉语言模型的最新进展,我们的框架使历史研究、艺术史和文化遗产中的大规模视觉研究成为可能,使其能够以前所未有的方式探索图像学、风格趋势和文化联系。
Summary / 总结
The paper presents an AI-based pipeline for analyzing illustrations in historical illuminated manuscripts, addressing the challenge of systematic analysis at scale. It uses deep-learning models for illustration detection, extraction, and multimodal description, allowing scholars to search, cluster, and study visual materials. The system reveals visual patterns and cross-manuscript relationships by embedding illustrations into a shared representation space, demonstrating its effectiveness on large collections such as the Vatican Library and the Bible of Borso d'Este.
本文介绍了一种基于AI的管道,用于分析历史手稿中的视觉内容。该方法结合了用于图像检测、提取和多模态描述的深度学习模型,以促进大规模视觉分析。主要发现包括能够跨整个收藏搜索、聚类和研究视觉材料和艺术趋势,通过共享表示空间和相似性分析揭示有意义的视觉模式和跨手稿关系。
LOST-3DSG: Lightweight Open-Vocabulary 3D Scene Graphs with Semantic Tracking in Dynamic Environments
Authors: Sara Micol Ferraina, Michele Brienza, Francesco Argenziano, Emanuele Musumeci, Vincenzo Suriani, Domenico D. Bloisi, Daniele Nardi
First: 2026-01-06T10:44:19+00:00 · Latest: 2026-01-12T09:08:59+00:00
Abstract
Tracking objects that move within dynamic environments is a core challenge in robotics. Recent research has advanced this topic significantly; however, many existing approaches remain inefficient due to their reliance on heavy foundation models. To address this limitation, we propose LOST-3DSG, a lightweight open-vocabulary 3D scene graph designed to track dynamic objects in real-world environments. Our method adopts a semantic approach to entity tracking based on word2vec and sentence embeddings, enabling an open-vocabulary representation while avoiding the necessity of storing dense CLIP visual features. As a result, LOST-3DSG achieves superior performance compared to approaches that rely on high-dimensional visual embeddings. We evaluate our method through qualitative and quantitative experiments conducted in a real 3D environment using a TIAGo robot. The results demonstrate the effectiveness and efficiency of LOST-3DSG in dynamic object tracking. Code and supplementary material are publicly available on the project website at https://lab-rococo-sapienza.github.io/lost-3dsg/.
中文标题/摘要
标题:LOST-3DSG:轻量级开放词汇3D场景图及其在动态环境中的语义跟踪
在动态环境中跟踪移动对象是机器人技术中的核心挑战。近期研究在这一领域取得了显著进展,但许多现有方法仍因依赖重模型而效率低下。为解决这一限制,我们提出LOST-3DSG,一种轻量级开放词汇3D场景图,旨在实现实时动态物体跟踪。我们的方法基于word2vec和句子嵌入采用语义实体跟踪,实现开放词汇表示,避免存储密集的CLIP视觉特征的必要性。因此,LOST-3DSG在性能上优于依赖高维视觉嵌入的方法。我们通过在真实3D环境中使用TIAGo机器人进行定性和定量实验来评估该方法。结果表明,LOST-3DSG在动态物体跟踪方面具有有效性和效率。代码和补充材料可在项目网站https://lab-rococo-sapienza.github.io/lost-3dsg/上公开获取。
OSCAR: Open-Set CAD Retrieval from a Language Prompt and a Single Image
Authors: Tessa Pulli, Jean-Baptiste Weibel, Peter Hönig, Matthias Hirschmanner, Markus Vincze, Andreas Holzinger
First: 2026-01-12T08:59:22+00:00 · Latest: 2026-01-12T08:59:22+00:00
Abstract
6D object pose estimation plays a crucial role in scene understanding for applications such as robotics and augmented reality. To support the needs of ever-changing object sets in such context, modern zero-shot object pose estimators were developed to not require object-specific training but only rely on CAD models. Such models are hard to obtain once deployed, and a continuously changing and growing set of objects makes it harder to reliably identify the instance model of interest. To address this challenge, we introduce an Open-Set CAD Retrieval from a Language Prompt and a Single Image (OSCAR), a novel training-free method that retrieves a matching object model from an unlabeled 3D object database. During onboarding, OSCAR generates multi-view renderings of database models and annotates them with descriptive captions using an image captioning model. At inference, GroundedSAM detects the queried object in the input image, and multi-modal embeddings are computed for both the Region-of-Interest and the database captions. OSCAR employs a two-stage retrieval: text-based filtering using CLIP identifies candidate models, followed by image-based refinement using DINOv2 to select the most visually similar object. In our experiments we demonstrate that OSCAR outperforms all state-of-the-art methods on the cross-domain 3D model retrieval benchmark MI3DOR. Furthermore, we demonstrate OSCAR's direct applicability in automating object model sourcing for 6D object pose estimation. We propose using the most similar object model for pose estimation if the exact instance is not available and show that OSCAR achieves an average precision of 90.48\% during object retrieval on the YCB-V object dataset. Moreover, we demonstrate that the most similar object model can be utilized for pose estimation using Megapose achieving better results than a reconstruction-based approach.
中文标题/摘要
标题:OSCAR:基于语言提示和单张图像的开放集CAD检索
6D物体姿态估计在机器人技术和增强现实等应用的场景理解中起着关键作用。为了支持此类情境下不断变化的物体集合的需求,现代零样本物体姿态估计器被开发出来,不需要特定物体的训练,仅依赖CAD模型。一旦部署,获取此类模型变得困难,且不断变化和增长的物体集合使得准确识别所需实例模型变得更加困难。为解决这一挑战,我们提出了OSCAR(基于语言提示和单张图像的开放集CAD检索),这是一种无需训练的新颖方法,可以从未标记的3D物体数据库中检索匹配的物体模型。在上线过程中,OSCAR生成数据库模型的多视角渲染,并使用图像描述模型对其进行描述性注释。在推理过程中,GroundedSAM在输入图像中检测查询物体,计算区域兴趣和数据库描述的多模态嵌入。OSCAR采用两阶段检索:使用CLIP进行基于文本的过滤以识别候选模型,然后使用DINOv2进行基于图像的细化以选择最相似的物体。在我们的实验中,我们证明OSCAR在跨域3D模型检索基准MI3DOR上优于所有最先进的方法。此外,我们展示了OSCAR直接应用于6D物体姿态估计中物体模型的自动化获取。我们建议如果无法获取精确实例,则使用最相似的物体模型进行姿态估计,并展示了OSCAR在YCB-V物体数据集上进行物体检索时的平均精度为90.48%。此外,我们展示了最相似的物体模型可以用于姿态估计,使用Megapose方法获得比基于重建的方法更好的结果。
Summary / 总结
OSCAR is a training-free method for open-set CAD retrieval that uses a language prompt and a single image to find a matching 3D object model. It generates multi-view renderings of database models and annotates them with captions, then at inference, it uses CLIP for text-based filtering and DINOv2 for image-based refinement to select the most visually similar object. OSCAR outperforms state-of-the-art methods on the MI3DOR benchmark and achieves an average precision of 90.48% in object retrieval on the YCB-V dataset, and can be used for 6D object pose estimation using the most similar model if the exact instance is not available.
OSCAR 是一种无需训练的方法,用于从未标注的 3D 对象数据库中根据语言提示和单张图像检索匹配的对象模型。它生成多视角渲染并用描述性标题进行标注,然后使用 CLIP 进行文本过滤并使用 DINOv2 进行图像细化,以选择最相似的对象。实验表明,OSCAR 在 MI3DOR 基准测试中优于现有方法,并在 YCB-V 对象数据集上实现 90.48% 的对象检索平均精度,还可以在无法获取精确实例时用于 6D 对象姿态估计。
VLM-CAD: VLM-Optimized Collaborative Agent Design Workflow for Analog Circuit Sizing
Authors: Guanyuan Pan, Yugui Lin, Tiansheng Zhou, Pietro Liò, Shuai Wang, Yaqi Wang
First: 2026-01-12T08:37:32+00:00 · Latest: 2026-01-12T08:37:32+00:00
Comments: 8 pages, 5 figures
Abstract
Analog mixed-signal circuit sizing involves complex trade-offs within high-dimensional design spaces. Existing automatic analog circuit sizing approaches often underutilize circuit schematics and lack the explainability required for industry adoption. To tackle these challenges, we propose a Vision Language Model-optimized collaborative agent design workflow (VLM-CAD), which analyzes circuits, optimizes DC operating points, performs inference-based sizing and executes external sizing optimization. We integrate Image2Net to annotate circuit schematics and generate a structured JSON description for precise interpretation by Vision Language Models. Furthermore, we propose an Explainable Trust Region Bayesian Optimization method (ExTuRBO) that employs collaborative warm-starting from agent-generated seeds and offers dual-granularity sensitivity analysis for external sizing optimization, supporting a comprehensive final design report. Experiment results on amplifier sizing tasks using 180nm, 90nm, and 45nm Predictive Technology Models demonstrate that VLM-CAD effectively balances power and performance, achieving a 100% success rate in optimizing an amplifier with a complementary input and a class-AB output stage, while maintaining total runtime under 43 minutes across all experiments.
中文标题/摘要
标题:VLM-CAD:优化视觉语言模型协作代理设计工作流以进行模拟电路尺寸优化
模拟混合信号电路尺寸优化涉及高维设计空间中的复杂权衡。现有的自动模拟电路尺寸优化方法往往未能充分利用电路图并缺乏行业采用所需的可解释性。为应对这些挑战,我们提出了一种视觉语言模型优化的协作代理设计工作流(VLM-CAD),该工作流分析电路、优化直流工作点、进行基于推理的尺寸优化并执行外部尺寸优化。我们整合了Image2Net来标注电路图并生成结构化的JSON描述,以便视觉语言模型精确解释。此外,我们提出了一种可解释的信任区域贝叶斯优化方法(ExTuRBO),该方法采用代理生成的种子进行协作预热启动,并提供外部尺寸优化的双重粒度敏感性分析,支持全面的最终设计报告。使用180nm、90nm和45nm预测技术模型进行的放大器尺寸优化任务实验结果表明,VLM-CAD有效地平衡了功率和性能,在优化具有互补输入和类AB输出阶段的放大器时实现了100%的成功率,同时在所有实验中保持总运行时间低于43分钟。
Summary / 总结
VLM-CAD is designed to address the complex trade-offs in analog mixed-signal circuit sizing by integrating a Vision Language Model-optimized collaborative agent design workflow. This method uses Image2Net for circuit schematic annotation and ExTuRBO for optimization, providing detailed sensitivity analysis. Experiments on amplifier sizing tasks with different technology nodes show that VLM-CAD successfully balances power and performance, achieving a 100% success rate and maintaining runtime under 43 minutes.
VLM-CAD 通过结合 Vision Language Model-优化的协作代理设计工作流来解决模拟混合信号电路尺寸设计中的复杂权衡问题。该方法使用 Image2Net 对电路图进行注释,并使用 ExTuRBO 进行可解释优化,提供详细的灵敏度分析。实验结果表明,VLM-CAD 在不同技术节点的放大器尺寸任务中成功平衡了功率和性能,实现了 100% 的优化成功率,并且整个实验的运行时间保持在 43 分钟以内。
ARM: Role-Conditioned Neuron Transplantation for Training-Free Generalist LLM Agent Merging
Authors: Zhuoka Feng, Kang Chen, Sihan Zhao, Kai Xiong, Yaoning Wang, Minshen Yu, Junjie Nian, Changyi Xiao, Yixin Cao, Yugang Jiang
First: 2026-01-12T08:31:53+00:00 · Latest: 2026-01-12T08:31:53+00:00
Comments: 17 pages, 12 figures. Project page: https://arkazhuo.github.io/ARM-homepage/
Abstract
Interactive large language model agents have advanced rapidly, but most remain specialized to a single environment and fail to adapt robustly to other environments. Model merging offers a training-free alternative by integrating multiple experts into a single model. In this paper, we propose Agent-Role Merging (ARM), an activation-guided, role-conditioned neuron transplantation method for model merging in LLM agents. ARM improves existing merging methods from static natural language tasks to multi-turn agent scenarios, and over the generalization ability across various interactive environments. This is achieved with a well designed 3-step framework: 1) constructing merged backbones, 2) selection based on its role-conditioned activation analysis, and 3) neuron transplantation for fine-grained refinements. Without gradient-based optimization, ARM improves cross-benchmark generalization while enjoying efficiency. Across diverse domains, the model obtained via ARM merging outperforms prior model merging methods and domain-specific expert models, while demonstrating strong out-of-domain generalization.
中文标题/摘要
标题:ARM:基于角色条件的神经元移植以实现无需训练的一般性LLM代理融合
交互式大型语言模型代理已经取得了快速进展,但大多数仍局限于单一环境,并且难以在其他环境中稳健适应。模型融合提供了一种无需训练的替代方案,通过将多个专家整合到一个模型中。在本文中,我们提出了代理角色融合(ARM),这是一种基于激活引导、角色条件的神经元移植方法,用于LLM代理的模型融合。ARM将现有的融合方法从静态自然语言任务扩展到多轮代理场景,并提高了在各种交互环境中的泛化能力。这通过一个精心设计的3步框架实现:1)构建融合的主干,2)基于其角色条件激活分析进行选择,3)进行神经元移植以实现细粒度的改进。无需基于梯度的优化,ARM在保持效率的同时提高了跨基准的泛化能力。在不同领域中,通过ARM融合得到的模型优于先前的模型融合方法和领域特定的专家模型,并且表现出强大的跨域泛化能力。
Summary / 总结
The research aims to address the limitation of interactive large language model agents being specialized and failing to adapt to new environments. ARM, a role-conditioned neuron transplantation method, is proposed to merge multiple experts into a single model without training. The method uses a 3-step framework to construct merged backbones, select based on role-conditioned activation analysis, and refine through neuron transplantation. ARM improves cross-benchmark generalization and outperforms previous model merging methods and domain-specific models in various domains, showing strong out-of-domain generalization capabilities.
ARM 是一种通过角色条件下的神经元移植实现多个大型语言模型代理合并为通用代理的无训练方法。它通过适应多轮对话场景和增强跨不同环境的一般化能力来改进现有合并方法。ARM 使用三步框架:构建合并的骨干、基于角色条件下的激活分析进行选择以及进行神经元移植以进行精细调整。合并后的模型在不同领域中优于先前的方法和领域特定的专家模型,展示了强大的跨域一般化能力。
Toward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion
Authors: Yi Zhou, Xuechao Zou, Shun Zhang, Kai Li, Shiying Wang, Jingming Chen, Congyan Lang, Tengfei Cao, Pin Tao, Yuanchun Shi
First: 2025-12-28T18:24:19+00:00 · Latest: 2026-01-12T08:31:50+00:00
Comments: 12 pages, 5 figures, 9 tables
Abstract
Semi-supervised remote sensing (RS) image semantic segmentation offers a promising solution to alleviate the burden of exhaustive annotation, yet it fundamentally struggles with pseudo-label drift, a phenomenon where confirmation bias leads to the accumulation of errors during training. In this work, we propose Co2S, a stable semi-supervised RS segmentation framework that synergistically fuses priors from vision-language models and self-supervised models. Specifically, we construct a heterogeneous dual-student architecture comprising two distinct ViT-based vision foundation models initialized with pretrained CLIP and DINOv3 to mitigate error accumulation and pseudo-label drift. To effectively incorporate these distinct priors, an explicit-implicit semantic co-guidance mechanism is introduced that utilizes text embeddings and learnable queries to provide explicit and implicit class-level guidance, respectively, thereby jointly enhancing semantic consistency. Furthermore, a global-local feature collaborative fusion strategy is developed to effectively fuse the global contextual information captured by CLIP with the local details produced by DINOv3, enabling the model to generate highly precise segmentation results. Extensive experiments on six popular datasets demonstrate the superiority of the proposed method, which consistently achieves leading performance across various partition protocols and diverse scenarios. Project page is available at https://xavierjiezou.github.io/Co2S/.
中文标题/摘要
标题:通过共引导和共融合实现稳定的半监督遥感分割
半监督遥感(RS)图像语义分割提供了一种缓解全面标注负担的有希望的解决方案,但根本上它面临着伪标签漂移的问题,这是一种在训练过程中由于确认偏差导致错误累积的现象。在本文中,我们提出了一种名为Co2S的稳定半监督RS分割框架,该框架能够协同融合来自视觉语言模型和自监督模型的先验知识。具体而言,我们构建了一个异构双学生架构,其中包括两个使用预训练CLIP和DINOv3初始化的不同ViT视觉基础模型,以减轻错误累积和伪标签漂移。为了有效结合这些不同的先验知识,我们引入了一种显式-隐式语义共引导机制,该机制利用文本嵌入和可学习查询分别提供显式和隐式的类别级引导,从而共同增强语义一致性。此外,我们还开发了一种全局-局部特征协作融合策略,以有效地融合CLIP捕获的全局上下文信息和DINOv3生成的局部细节,使模型能够生成高度精确的分割结果。在六个流行数据集上的广泛实验表明,所提出的方法在各种分割协议和不同场景中始终表现出优越性。项目页面可在https://xavierjiezou.github.io/Co2S/访问。
Summary / 总结
This paper addresses the issue of pseudo-label drift in semi-supervised remote sensing image segmentation by proposing Co2S, a framework that integrates priors from vision-language and self-supervised models. It uses a dual-student architecture with CLIP and DINOv3 to reduce error accumulation and introduces a co-guidance mechanism and feature fusion strategy to enhance semantic consistency and precision. Experiments on six datasets show that Co2S outperforms existing methods across various scenarios.
该论文通过提出Co2S框架,将来自视觉语言模型和自监督模型的先验知识结合起来,解决半监督遥感图像分割中的伪标签漂移问题。该框架采用CLIP和DINOv3初始化的双学生架构,减少错误累积。引入了显式-隐式语义协同引导机制和全局-局部特征融合策略,以增强语义一致性并生成精确的分割结果。在六个数据集上的实验表明,Co2S在各种场景下优于现有方法。
VMMU: A Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark
Authors: Vy Tuong Dang, An Vo, Emilio Villa-Cueva, Quang Tau, Duc Dm, Thamar Solorio, Daeyoung Kim
First: 2025-08-19T09:31:18+00:00 · Latest: 2026-01-12T08:08:32+00:00
Abstract
We introduce VMMU, a Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark designed to evaluate how vision-language models (VLMs) interpret and reason over visual and textual information beyond English. VMMU consists of 2.5k multimodal questions across 7 tasks, covering a diverse range of problem contexts, including STEM problem solving, data interpretation, rule-governed visual reasoning, and abstract visual reasoning. All questions require genuine multimodal integration, rather than reliance on text-only cues or OCR-based shortcuts. We evaluate a diverse set of state-of-the-art proprietary and open-source VLMs on VMMU. Despite strong Vietnamese OCR performance, proprietary models achieve only 66% mean accuracy. Further analysis shows that the primary source of failure is not OCR, but instead multimodal grounding and reasoning over text and visual evidence. Code and data are available at https://vmmu.github.io.
中文标题/摘要
标题:VMMU:越南多任务多模态理解与推理基准
我们介绍了VMMU,一个越南多任务多模态理解与推理基准,旨在评估视觉语言模型(VLMs)如何超越英语对视觉和文本信息进行解释和推理。VMMU 包含7个任务中的2500个多模态问题,涵盖了从STEM问题解决到数据解释、规则指导的视觉推理和抽象视觉推理等多种问题情境。所有问题都需要真正的多模态整合,而不是依赖于仅基于文本的线索或OCR捷径。我们对VMMU上的一系列最先进的专有和开源VLMs进行了评估。尽管越南OCR表现出色,但专有模型的平均准确率仅为66%。进一步的分析表明,失败的主要原因是多模态定位和文本与视觉证据的推理,而不是OCR。代码和数据可在https://vmmu.github.io/获取。
Summary / 总结
VMMU is a Vietnamese benchmark for evaluating vision-language models in interpreting and reasoning over multimodal information, extending beyond English. It includes 2,500 questions across 7 tasks, focusing on STEM, data interpretation, and abstract reasoning. Despite strong OCR performance, proprietary models achieve only 66% mean accuracy, indicating challenges in multimodal grounding and reasoning. The benchmark aims to push the boundaries of VLMs in handling diverse and complex tasks in Vietnamese.
VMMU 是一个越南语多任务多模态理解与推理基准,旨在评估视觉语言模型在处理非英语任务时的能力。它包含 2,500 个多模态问题,覆盖七个不同的任务,需要真正的多模态整合。尽管 OCR 性能很强,但专有模型的平均准确率仅为 66%,表明在多模态定位和推理方面存在挑战。该基准旨在推动视觉语言模型在处理越南语和多模态数据方面的进步。
A Visual Semantic Adaptive Watermark grounded by Prefix-Tuning for Large Vision-Language Model
Authors: Qi Zheng, Shuliang Liu, Yu Huang, Sihang Jia, Jungang Li, Lyuhao Chen, Junhao Chen, Hanqian Li, Aiwei Liu, Yibo Yan, Xuming Hu
First: 2026-01-12T07:55:13+00:00 · Latest: 2026-01-12T07:55:13+00:00
Abstract
Watermarking has emerged as a pivotal solution for content traceability and intellectual property protection in Large Vision-Language Models (LVLMs). However, vision-agnostic watermarks introduce visually irrelevant tokens and disrupt visual grounding by enforcing indiscriminate pseudo-random biases, while some semantic-aware methods incur prohibitive inference latency due to rejection sampling. In this paper, we propose the VIsual Semantic Adaptive Watermark (VISA-Mark), a novel framework that embeds detectable signals while strictly preserving visual fidelity. Our approach employs a lightweight, efficiently trained prefix-tuner to extract dynamic Visual-Evidence Weights, which quantify the evidentiary support for candidate tokens based on the visual input. These weights guide an adaptive vocabulary partitioning and logits perturbation mechanism, concentrating watermark strength specifically on visually-supported tokens. By actively aligning the watermark with visual evidence, VISA-Mark effectively maintains visual fidelity. Empirical results confirm that VISA-Mark outperforms conventional methods with a 7.8% improvement in visual consistency (Chair-I) and superior semantic fidelity. The framework maintains highly competitive detection accuracy (96.88% AUC) and robust attack resilience (99.3%) without sacrificing inference efficiency, effectively establishing a new standard for reliability-preserving multimodal watermarking.
中文标题/摘要
标题:一种基于前缀调优的地视觉语义自适应水印
水印技术已成为大型视觉语言模型(LVLMs)中内容追溯和知识产权保护的关键解决方案。然而,视觉无关的水印引入了视觉无关的标记,并通过施加不分青红皂白的伪随机偏见破坏了视觉定位,而一些语义感知的方法则因拒绝采样而产生高昂的推理延迟。在本文中,我们提出了视觉语义自适应水印(VISA-Mark)这一新颖框架,该框架在严格保持视觉保真度的同时嵌入可检测的信号。我们的方法采用一种轻量级、高效训练的前缀调优器来提取动态视觉证据权重,这些权重根据视觉输入量化候选标记的证据支持度。这些权重指导自适应词汇分区和logits扰动机制,将水印强度集中在视觉支持的标记上。通过积极地使水印与视觉证据对齐,VISA-Mark 有效地保持了视觉保真度。实验证明,VISA-Mark 在视觉一致性(Chair-I)方面比传统方法提高了 7.8%,并且在语义保真度方面表现更优。该框架保持了高度竞争力的检测准确性(96.88% AUC)和鲁棒的攻击抵抗力(99.3%),而不牺牲推理效率,从而有效地确立了可靠保真的多模态水印的新标准。
Summary / 总结
The research addresses the challenge of embedding watermarks in Large Vision-Language Models (LVLMs) without compromising visual fidelity or introducing irrelevant tokens. It introduces VISA-Mark, which uses a lightweight prefix-tuner to dynamically adjust the strength of the watermark based on visual evidence, ensuring that the watermark is only applied to visually supported tokens. Experiments show that VISA-Mark improves visual consistency by 7.8% and maintains high detection accuracy and robustness against attacks while preserving inference efficiency.
该论文提出了一种名为VISA-Mark的新方法,通过使用轻量级前缀调谐器动态调整基于视觉输入的令牌权重来增强视觉保真度。该方法通过提高7.8%的视觉一致性,并保持高检测准确性和对攻击的鲁棒性,同时保留推理效率来提升视觉语言模型的水印可靠性。
VENUS: Visual Editing with Noise Inversion Using Scene Graphs
Authors: Thanh-Nhan Vo, Trong-Thuan Nguyen, Tam V. Nguyen, Minh-Triet Tran
First: 2026-01-12T05:24:58+00:00 · Latest: 2026-01-12T05:24:58+00:00
Abstract
State-of-the-art text-based image editing models often struggle to balance background preservation with semantic consistency, frequently resulting either in the synthesis of entirely new images or in outputs that fail to realize the intended edits. In contrast, scene graph-based image editing addresses this limitation by providing a structured representation of semantic entities and their relations, thereby offering improved controllability. However, existing scene graph editing methods typically depend on model fine-tuning, which incurs high computational cost and limits scalability. To this end, we introduce VENUS (Visual Editing with Noise inversion Using Scene graphs), a training-free framework for scene graph-guided image editing. Specifically, VENUS employs a split prompt conditioning strategy that disentangles the target object of the edit from its background context, while simultaneously leveraging noise inversion to preserve fidelity in unedited regions. Moreover, our proposed approach integrates scene graphs extracted from multimodal large language models with diffusion backbones, without requiring any additional training. Empirically, VENUS substantially improves both background preservation and semantic alignment on PIE-Bench, increasing PSNR from 22.45 to 24.80, SSIM from 0.79 to 0.84, and reducing LPIPS from 0.100 to 0.070 relative to the state-of-the-art scene graph editing model (SGEdit). In addition, VENUS enhances semantic consistency as measured by CLIP similarity (24.97 vs. 24.19). On EditVal, VENUS achieves the highest fidelity with a 0.87 DINO score and, crucially, reduces per-image runtime from 6-10 minutes to only 20-30 seconds. Beyond scene graph-based editing, VENUS also surpasses strong text-based editing baselines such as LEDIT++ and P2P+DirInv, thereby demonstrating consistent improvements across both paradigms.
中文标题/摘要
标题:VENUS:使用场景图的噪声反转视觉编辑
最先进的基于文本的图像编辑模型往往难以在背景保留与语义一致性之间取得平衡,经常导致合成全新的图像或输出未能实现预期的编辑。相比之下,基于场景图的图像编辑通过提供语义实体及其关系的结构化表示,解决了这一限制,从而提高了可控性。然而,现有的场景图编辑方法通常依赖于模型微调,这会带来高昂的计算成本并限制可扩展性。为此,我们提出了VENUS(使用场景图的噪声反转视觉编辑),这是一种无需训练的场景图引导图像编辑框架。具体而言,VENUS采用分割提示条件策略,将编辑目标对象与其背景上下文分离,同时利用噪声反转来保留未编辑区域的保真度。此外,我们提出的方法将从多模态大型语言模型中提取的场景图与扩散模型相结合,无需任何额外训练。实验表明,VENUS在PIE-Bench上显著提高了背景保留和语义对齐,PSNR从22.45提高到24.80,SSIM从0.79提高到0.84,LPIPS从0.100降低到0.070,相对于最先进的场景图编辑模型(SGEdit)有所提升。此外,VENUS通过CLIP相似度(24.97 vs. 24.19)提高了语义一致性。在EditVal上,VENUS实现了最高的保真度,DINO得分为0.87,并且关键地将每张图像的运行时间从6-10分钟缩短到仅20-30秒。除了基于场景图的编辑,VENUS还超越了强大的基于文本的编辑基线,如LEdit++和P2P+DirInv,从而在两种范式中都表现出一致的改进。
Summary / 总结
VENUS is a training-free framework for scene graph-guided image editing that improves background preservation and semantic alignment. It uses a split prompt conditioning strategy to disentangle the target object from its background and noise inversion to preserve fidelity. VENUS outperforms existing methods on PIE-Bench, increasing PSNR, SSIM, and reducing LPIPS, and achieves the highest fidelity on EditVal with a 0.87 DINO score, significantly reducing runtime to 20-30 seconds per image. It also surpasses text-based editing methods like LEDIT++ and P2P+DirInv across both paradigms.
VENUS 是一个无需训练的框架,用于基于场景图的图像编辑,能够提高背景保留和语义一致性。它使用分离提示条件策略来分离目标对象及其背景,并利用噪声反转来保持未编辑区域的保真度。VENUS 将来自多模态大型语言模型的场景图与扩散模型相结合,与最先进的场景图编辑模型(SGEdit)相比,在 PSNR、SSIM 和 LPIPS 方面取得了显著改进。在 EditVal 上,VENUS 也展示了最高的保真度,DINO 得分为 0.87,并将每张图像的运行时间缩短至 20-30 秒。它还超越了如 LEDIT++ 和 P2P+DirInv 等强大的基于文本的编辑模型,涵盖了两种范式都取得了持续改进。
From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D
Authors: Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, Li Zhang
First: 2025-03-29T04:51:50+00:00 · Latest: 2026-01-12T04:43:21+00:00
Comments: Project page: https://logosroboticsgroup.github.io/SPAR/
Abstract
Recent advances in LVLMs have improved vision-language understanding, but they still struggle with spatial perception, limiting their ability to reason about complex 3D scenes. Unlike previous approaches that incorporate 3D representations into models to improve spatial understanding, we aim to unlock the potential of VLMs by leveraging spatially relevant image data. To this end, we introduce a novel 2D spatial data generation and annotation pipeline built upon scene data with 3D ground-truth. This pipeline enables the creation of a diverse set of spatial tasks, ranging from basic perception tasks to more complex reasoning tasks. Leveraging this pipeline, we construct SPAR-7M, a large-scale dataset generated from thousands of scenes across multiple public datasets. In addition, we introduce SPAR-Bench, a benchmark designed to offer a more comprehensive evaluation of spatial capabilities compared to existing spatial benchmarks, supporting both single-view and multi-view inputs. Training on both SPAR-7M and large-scale 2D datasets enables our models to achieve state-of-the-art performance on 2D spatial benchmarks. Further fine-tuning on 3D task-specific datasets yields competitive results, underscoring the effectiveness of our dataset in enhancing spatial reasoning.
中文标题/摘要
标题:从平面向空间:教学视-语言模型在三维中感知与推理
最近在LVLMs方面的进展提高了视-语言理解能力,但它们仍然在空间感知方面存在困难,限制了它们对复杂三维场景进行推理的能力。与之前将三维表示整合到模型中以提高空间理解能力的方法不同,我们旨在通过利用与空间相关的图像数据来解锁VLMs的潜力。为此,我们引入了一种基于具有三维真实值的场景数据的新颖二维空间数据生成和注释管道。该管道使我们能够创建从基本感知任务到更复杂推理任务的多样化空间任务集。利用此管道,我们构建了SPAR-7M,这是一个从多个公共数据集中数千个场景生成的大规模数据集。此外,我们引入了SPAR-Bench,这是一个旨在提供比现有空间基准更全面评估空间能力的基准,支持单视图和多视图输入。在SPAR-7M和大规模二维数据集上进行训练,使我们的模型在二维空间基准上达到最先进的性能。进一步针对三维任务特定数据集进行微调,取得了竞争力的结果,突显了我们数据集在增强空间推理方面的有效性。
Summary / 总结
This research addresses the limitation of vision-language models in spatial perception and reasoning in 3D scenes. It introduces a novel 2D spatial data generation and annotation pipeline to create SPAR-7M, a large-scale dataset for training models. The study also develops SPAR-Bench, a benchmark for evaluating spatial capabilities. Models trained on SPAR-7M and fine-tuned on 3D datasets show state-of-the-art performance on 2D spatial benchmarks and competitive results on 3D tasks.
研究旨在提升视觉语言模型的空间感知能力,以便更好地理解和推理3D场景。引入了一种新的2D空间数据生成管道,并构建了SPAR-7M大规模数据集用于模型训练。SPAR-Bench是一个新的基准,能够更全面地评估空间能力。在SPAR-7M上训练并在3D特定任务数据集上微调的模型在2D空间基准测试中表现出最先进的性能,并在3D任务上取得竞争力的结果。
SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving
Authors: Jingyu Li, Junjie Wu, Dongnan Hu, Xiangkai Huang, Bin Sun, Zhihui Hao, Xianpeng Lang, Xiatian Zhu, Li Zhang
First: 2026-01-09T08:55:42+00:00 · Latest: 2026-01-12T03:29:14+00:00
Abstract
Recent end-to-end autonomous driving approaches have leveraged Vision-Language Models (VLMs) to enhance planning capabilities in complex driving scenarios. However, VLMs are inherently trained as generalist models, lacking specialized understanding of driving-specific reasoning in 3D space and time. When applied to autonomous driving, these models struggle to establish structured spatial-temporal representations that capture geometric relationships, scene context, and motion patterns critical for safe trajectory planning. To address these limitations, we propose SGDrive, a novel framework that explicitly structures the VLM's representation learning around driving-specific knowledge hierarchies. Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition: drivers first perceive the overall environment (scene context), then attend to safety-critical agents and their behaviors, and finally formulate short-term goals before executing actions. This hierarchical decomposition provides the structured spatial-temporal representation that generalist VLMs lack, integrating multi-level information into a compact yet comprehensive format for trajectory planning. Extensive experiments on the NAVSIM benchmark demonstrate that SGDrive achieves state-of-the-art performance among camera-only methods on both PDMS and EPDMS, validating the effectiveness of hierarchical knowledge structuring for adapting generalist VLMs to autonomous driving.
中文标题/摘要
标题:SGDrive:场景到目标的层次世界认知在自动驾驶中的应用
近期的端到端自动驾驶方法利用视觉-语言模型(VLMs)增强了在复杂驾驶场景中的规划能力。然而,VLMs本质上是作为通用模型训练的,缺乏对驾驶特定空间和时间推理的专门理解。在应用于自动驾驶时,这些模型难以建立能够捕捉几何关系、场景上下文和对安全轨迹规划至关重要的运动模式的结构化时空表示。为了解决这些限制,我们提出了SGDrive,这是一种新颖的框架,明确地将VLM的表示学习结构化为驾驶特定知识的层次结构。基于预训练的VLM主干,SGDrive将驾驶理解分解为场景-代理-目标层次结构,这与人类驾驶认知相呼应:驾驶员首先感知整体环境(场景上下文),然后关注安全关键的代理及其行为,最后制定短期目标并执行动作。这种层次分解提供了通用VLM所缺乏的结构化时空表示,将多级信息整合为一种紧凑而全面的格式,用于轨迹规划。在NAVSIM基准上的广泛实验表明,SGDrive在PDMS和EPDMS上的表现优于仅使用摄像头的方法,验证了层次知识结构化对于将通用VLM适应自动驾驶的有效性。
Summary / 总结
SGDrive is a novel framework that enhances autonomous driving by leveraging a pre-trained Vision-Language Model (VLM) to address the limitations of generalist models in understanding driving-specific reasoning. It decomposes driving understanding into a scene-agent-goal hierarchy, providing structured spatial-temporal representations that capture geometric relationships and motion patterns. Experiments on the NAVSIM benchmark show that SGDrive outperforms existing camera-only methods in both PDMS and EPDMS, validating the effectiveness of hierarchical knowledge structuring for autonomous driving.
SGDrive 是一个框架,通过利用预训练的 Vision-Language 模型(VLM)并围绕驾驶特定的层次结构结构化其表示学习,来增强自动驾驶能力。它将驾驶理解分解为场景-代理-目标层次结构,有助于捕捉几何关系、场景上下文和运动模式。在 NAVSIM 基准上的实验结果表明,SGDrive 在 PDMS 和 EPDMS 中均优于其他基于摄像头的方法,验证了层次化知识结构对于将通用 VLM 适应自动驾驶的有效性。
GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation
Authors: Tianchen Deng, Xuefeng Chen, Yi Chen, Qu Chen, Yuyao Xu, Lijin Yang, Le Xu, Yu Zhang, Bo Zhang, Wuxiong Huang, Hesheng Wang
First: 2025-12-29T03:40:05+00:00 · Latest: 2026-01-12T02:02:10+00:00
Abstract
Driving World Models (DWMs) have been developing rapidly with the advances of generative models. However, existing DWMs lack 3D scene understanding capabilities and can only generate content conditioned on input data, without the ability to interpret or reason about the driving environment. Moreover, current approaches represent 3D spatial information with point cloud or BEV features do not accurately align textual information with the underlying 3D scene. To address these limitations, we propose a novel unified DWM framework based on 3D Gaussian scene representation, which enables both 3D scene understanding and multi-modal scene generation, while also enabling contextual enrichment for understanding and generation tasks. Our approach directly aligns textual information with the 3D scene by embedding rich linguistic features into each Gaussian primitive, thereby achieving early modality alignment. In addition, we design a novel task-aware language-guided sampling strategy that removes redundant 3D Gaussians and injects accurate and compact 3D tokens into LLM. Furthermore, we design a dual-condition multi-modal generation model, where the information captured by our vision-language model is leveraged as a high-level language condition in combination with a low-level image condition, jointly guiding the multi-modal generation process. We conduct comprehensive studies on the nuScenes, and NuInteract datasets to validate the effectiveness of our framework. Our method achieves state-of-the-art performance. We will release the code publicly on GitHub https://github.com/dtc111111/GaussianDWM.
中文标题/摘要
标题:GaussianDWM:基于3D高斯场景表示的统一场景理解和多模态生成世界模型
生成模型的发展推动了驾驶世界模型(DWMs)的快速发展。然而,现有的DWMs缺乏3D场景理解能力,只能根据输入数据生成内容,而无法解释或推理驾驶环境。此外,当前方法使用点云或BEV特征表示3D空间信息,无法准确将文本信息与底层3D场景对齐。为解决这些限制,我们提出了一种基于3D高斯场景表示的新型统一DWM框架,该框架能够同时实现3D场景理解和多模态场景生成,并且能够为理解和生成任务提供上下文增强。我们的方法通过将丰富的语言特征嵌入到每个高斯原语中,直接将文本信息与3D场景对齐,从而实现早期模态对齐。此外,我们设计了一种新的任务感知语言引导采样策略,该策略移除了冗余的3D高斯,并将准确且紧凑的3D标记注入到LLM中。此外,我们设计了一种双条件多模态生成模型,其中我们的视觉-语言模型捕获的信息作为高级语言条件与低级图像条件相结合,共同指导多模态生成过程。我们在nuScenes和NuInteract数据集上进行了全面研究,以验证我们框架的有效性。我们的方法达到了最先进的性能。我们将在GitHub上公开发布代码:https://github.com/dtc111111/GaussianDWM。
Summary / 总结
The paper introduces GaussianDWM, a novel 3D Gaussian Driving World Model that integrates 3D scene understanding and multi-modal generation. It uses 3D Gaussian primitives to align textual information with the scene, and designs a task-aware language-guided sampling strategy to enhance the model's accuracy. The model outperforms existing methods on nuScenes and NuInteract datasets, demonstrating superior performance in both understanding and generation tasks. The framework is publicly available on GitHub.
论文提出了一种基于3D高斯分布的Driving World Model (GaussianDWM),该模型结合了3D场景理解和多模态生成能力。通过使用高斯原语嵌入丰富的语言特征,实现了早期模态对齐和文本信息与3D场景的准确对齐。模型还包括任务感知的语言引导采样策略和双条件多模态生成模型,实现了在nuScenes和NuInteract数据集上的最新性能。框架通过利用视觉-语言模型的信息,为高级语言条件和低级图像条件提供支持,以增强理解和生成任务。代码已公开发布在GitHub上。
MEDVISTAGYM: A Scalable Training Environment for Thinking with Medical Images via Tool-Integrated Reinforcement Learning
Authors: Meng Lu, Yuxing Lu, Yuchen Zhuang, Megan Mullins, Yang Xie, Guanghua Xiao, Charles Fleming, Wenqi Shi, Xuan Wang
First: 2026-01-12T00:11:10+00:00 · Latest: 2026-01-12T00:11:10+00:00
Abstract
Vision language models (VLMs) achieve strong performance on general image understanding but struggle to think with medical images, especially when performing multi-step reasoning through iterative visual interaction. Medical VLMs often rely on static visual embeddings and single-pass inference, preventing models from re-examining, verifying, or refining visual evidence during reasoning. While tool-integrated reasoning offers a promising path forward, open-source VLMs lack the training infrastructure to learn effective tool selection, invocation, and coordination in multi-modal medical reasoning. We introduce MedVistaGym, a scalable and interactive training environment that incentivizes tool-integrated visual reasoning for medical image analysis. MedVistaGym equips VLMs to determine when and which tools to invoke, localize task-relevant image regions, and integrate single or multiple sub-image evidence into interleaved multimodal reasoning within a unified, executable interface for agentic training. Using MedVistaGym, we train MedVistaGym-R1 to interleave tool use with agentic reasoning through trajectory sampling and end-to-end reinforcement learning. Across six medical VQA benchmarks, MedVistaGym-R1-8B exceeds comparably sized tool-augmented baselines by 19.10% to 24.21%, demonstrating that structured agentic training--not tool access alone--unlocks effective tool-integrated reasoning for medical image analysis.
中文标题/摘要
标题:MEDVISTAGYM:通过工具集成强化学习进行医学图像思考的可扩展训练环境
视觉语言模型(VLMs)在通用图像理解方面表现出色,但在处理医学图像时,尤其是在通过迭代视觉交互进行多步推理时,它们难以进行思考。医学VLMs通常依赖静态视觉嵌入和单次推理,这阻止了模型在推理过程中重新检查、验证或细化视觉证据。虽然工具集成推理提供了前进的道路,但开源VLMs缺乏学习有效工具选择、调用和协调的训练基础设施,特别是在多模态医学推理中。我们介绍了MedVistaGym,这是一种可扩展且互动的训练环境,旨在激励工具集成视觉推理以进行医学图像分析。MedVistaGym使VLMs能够确定何时以及调用哪些工具,定位与任务相关图像区域,并在统一的可执行界面中进行单个或多个子图像证据的集成多模态推理,实现自主训练。使用MedVistaGym,我们通过轨迹采样和端到端强化学习训练了MedVistaGym-R1,使其能够将工具使用与自主推理交织在一起。在六个医学VQA基准测试中,MedVistaGym-R1-8B在大小相当的工具增强基线之上分别超过了19.10%到24.21%,这表明结构化的自主训练——而不仅仅是工具访问——解锁了有效的工具集成推理,用于医学图像分析。
Summary / 总结
The research aims to enhance the ability of vision language models (VLMs) to reason with medical images through tool-integrated reinforcement learning. MedVistaGym is introduced as a scalable training environment that encourages VLMs to select, invoke, and coordinate tools for iterative visual interaction. MedVistaGym-R1, trained using this environment, outperforms comparable tool-augmented baselines by 19.10% to 24.21% across six medical VQA benchmarks, highlighting the importance of structured agentic training over mere tool access for effective reasoning with medical images.
研究旨在提高视觉语言模型在处理医学图像时的多步推理能力,特别是对于多步推理任务。MedVistaGym 作为一种可扩展的训练环境,鼓励工具集成的视觉推理。通过轨迹采样和端到端强化学习训练的 MedVistaGym-R1 在六个医学 VQA 基准测试中,相对于可比的工具增强基线,性能提高了 19.10% 至 24.21%,表明结构化的主动训练对于医学图像分析中的有效工具集成推理至关重要。
Efficient Visual Question Answering Pipeline for Autonomous Driving via Scene Region Compression
Authors: Yuliang Cai, Dongqiangzi Ye, Zitian Chen, Chongruo Wu
First: 2026-01-11T23:25:49+00:00 · Latest: 2026-01-11T23:25:49+00:00
Comments: 7 pages
Abstract
Autonomous driving increasingly relies on Visual Question Answering (VQA) to enable vehicles to understand complex surroundings by analyzing visual inputs and textual queries. Currently, a paramount concern for VQA in this domain is the stringent requirement for fast latency and real-time processing, as delays directly impact real-world safety in this safety-critical application. However, current state-of-the-art VQA models, particularly large vision-language models (VLMs), often prioritize performance over computational efficiency. These models typically process dense patch tokens for every frame, leading to prohibitive computational costs (FLOPs) and significant inference latency, especially with long video sequences. This focus limits their practical deployment in real-time autonomous driving scenarios. To tackle this issue, we propose an efficient VLM framework for autonomous driving VQA tasks, SRC-Pipeline. It learns to compress early frame tokens into a small number of high-level tokens while retaining full patch tokens for recent frames. Experiments on autonomous driving video question answering tasks show that our approach achieves 66% FLOPs reduction while maintaining comparable performance, enabling VLMs to operate more effectively in real-time, safety-critical autonomous driving settings.
中文标题/摘要
标题:自主驾驶场景区域压缩高效视觉问答流水线
自主驾驶越来越多地依赖视觉问答(VQA)来通过分析视觉输入和文本查询使车辆理解复杂的周围环境。目前,该领域中VQA的一个主要关注点是严格的快速延迟和实时处理要求,因为延迟直接影响这种安全关键应用中的现实世界安全性。然而,当前最先进的VQA模型,尤其是大型视觉-语言模型(VLMs),通常更注重性能而非计算效率。这些模型通常为每一帧处理密集的补丁标记,导致高昂的计算成本(FLOPs)和显著的推理延迟,尤其是在长视频序列中。这种关注限制了它们在实时自主驾驶场景中的实际部署。为了解决这一问题,我们提出了一种高效的VLM框架,用于自主驾驶VQA任务,即SRC-流水线。它学习将早期帧标记压缩成少量高级标记,同时为最近的帧保留完整的补丁标记。在自主驾驶视频问答任务上的实验表明,我们的方法在保持相当性能的同时实现了66%的FLOPs减少,使VLMs能够在实时、安全关键的自主驾驶环境中更有效地运行。
Summary / 总结
The research aims to address the computational efficiency challenge in Visual Question Answering (VQA) for autonomous driving, where fast latency is crucial for safety. The proposed SRC-Pipeline framework compresses early frame tokens into fewer high-level tokens while retaining full patch tokens for recent frames, achieving a 66% reduction in FLOPs without compromising performance. This enables VQA models to operate more effectively in real-time autonomous driving scenarios.
论文针对自主驾驶中高效视觉问答(VQA)的需求,旨在减少计算成本同时保持性能。提出了SRC-Pipeline框架,该框架将早期帧的tokens压缩成更少的高层次tokens,并保留最近帧的完整patch tokens。该方法实现了66%的FLOPs减少,同时保持了相当的性能,使VQA模型更适合实时自主驾驶应用。
Mid-Think: Training-Free Intermediate-Budget Reasoning via Token-Level Triggers
Authors: Wang Yang, Debargha Ganguly, Xinpeng Li, Chaoda Song, Shouren Wang, Vikash Singh, Vipin Chaudhary, Xiaotian Han
First: 2026-01-11T19:19:39+00:00 · Latest: 2026-01-11T19:19:39+00:00
Abstract
Hybrid reasoning language models are commonly controlled through high-level Think/No-think instructions to regulate reasoning behavior, yet we found that such mode switching is largely driven by a small set of trigger tokens rather than the instructions themselves. Through attention analysis and controlled prompting experiments, we show that a leading ``Okay'' token induces reasoning behavior, while the newline pattern following ``</think>'' suppresses it. Based on this observation, we propose Mid-Think, a simple training-free prompting format that combines these triggers to achieve intermediate-budget reasoning, consistently outperforming fixed-token and prompt-based baselines in terms of the accuracy-length trade-off. Furthermore, applying Mid-Think to RL training after SFT reduces training time by approximately 15% while improving final performance of Qwen3-8B on AIME from 69.8% to 72.4% and on GPQA from 58.5% to 61.1%, demonstrating its effectiveness for both inference-time control and RL-based reasoning training.
中文标题/摘要
标题:Mid-思考:通过标记级触发器实现无需训练的中级预算推理
混合推理语言模型通常通过高级的‘思考/不思考’指令来控制推理行为,但我们发现这种模式切换主要由一小组触发标记驱动,而不是指令本身。通过注意力分析和受控提示实验,我们展示了‘好的’标记会引发推理行为,而紧跟在‘</think>’之后的换行符则会抑制这种行为。基于这一观察,我们提出了Mid-思考,这是一种简单的无需训练的提示格式,结合了这些触发器以实现中级预算推理,并在准确性和长度的权衡上始终优于固定标记和基于提示的基线。此外,将Mid-思考应用于SFT后的RL训练,可将训练时间减少约15%,同时提高Qwen3-8B在AIME上的最终性能从69.8%到72.4%,在GPQA上的最终性能从58.5%到61.1%,证明了其在推理时间和基于RL的推理训练中的有效性。
Summary / 总结
The study aims to explore how hybrid reasoning language models can be controlled through specific token-level triggers rather than high-level instructions. By analyzing attention patterns and conducting controlled experiments, the researchers identified that the 'Okay' token promotes reasoning, while the newline pattern after '</think>' suppresses it. They then developed Mid-Think, a training-free prompting format that leverages these triggers to achieve intermediate-budget reasoning, which outperforms fixed-token and prompt-based baselines. The application of Mid-Think in reinforcement learning training after supervised fine-tuning reduces training time by about 15% and improves performance on AIME and GPQA tasks.
研究旨在探索如何通过特定的标记级触发器而非高级指令来控制混合推理语言模型。通过分析注意力模式并进行受控实验,研究人员发现'Okay'标记会促进推理,而'</think>'后的换行符则会抑制推理。他们随后开发了Mid-Think,这是一种无需训练的提示格式,利用这些触发器实现中间预算的推理,其性能优于固定标记和基于提示的基线。Mid-Think在强化学习训练中的应用,在监督微调后减少了约15%的训练时间,并在AIME和GPQA任务上的性能分别提高了至72.4%和61.1%。