MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning
Authors: Tajamul Ashraf, Umair Nawaz, Abdelrahman M. Shaker, Rao Anwer, Philip Torr, Fahad Shahbaz Khan, Salman Khan
First: 2025-10-09T17:59:54+00:00 · Latest: 2025-10-09T17:59:54+00:00
Abstract
Vision language models (VLMs) are increasingly deployed as controllers with
access to external tools for complex reasoning and decision-making, yet their
effectiveness remains limited by the scarcity of high-quality multimodal
trajectories and the cost of manual annotation. We address this challenge with
a vision-centric agent tuning framework that automatically synthesizes
multimodal trajectories, generates step-wise preference pairs, and trains a VLM
controller for robust tool-use reasoning. Our pipeline first constructs
M-TRACE, a large-scale dataset of 28.5K multimodal tasks with 177K verified
trajectories, enabling imitation-based trajectory tuning. Building on this, we
develop MATRIX Agent, a controller finetuned on M-TRACE for step-wise tool
reasoning. To achieve finer alignment, we further introduce Pref-X, a set of
11K automatically generated preference pairs, and optimize MATRIX on it via
step-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA,
MATRIX consistently surpasses both open- and closed-source VLMs, demonstrating
scalable and effective multimodal tool use. Our data and code is avaliable at
https://github.com/mbzuai-oryx/MATRIX.
中文标题/摘要
标题:MATRIX:多模态智能体调优以实现稳健的工具使用推理
视觉语言模型(VLMs)越来越多地被用作控制器,具有访问外部工具的能力,用于复杂的推理和决策,但其有效性受限于高质量多模态轨迹的稀缺性和手动注释的成本。我们通过一种以视觉为中心的智能体调优框架来应对这一挑战,该框架自动合成多模态轨迹、生成逐步偏好对,并训练一个VLM控制器以实现稳健的工具使用推理。我们的流水线首先构建了M-TRACE,这是一个包含28500个多模态任务和177000个验证轨迹的大规模数据集,使基于模仿的轨迹调优成为可能。在此基础上,我们开发了MATRIX智能体,该智能体是基于M-TRACE进行逐步工具推理的微调控制器。为了实现更精细的对齐,我们进一步引入了Pref-X,这是一个包含11000个自动生成的偏好对的集合,并通过逐步偏好学习对其进行优化。在三个基准测试Agent-X、GTA和GAIA上,MATRIX始终超越了开源和闭源的VLMs,展示了可扩展且有效的多模态工具使用能力。我们的数据和代码可在https://github.com/mbzuai-oryx/MATRIX/获得。
Summary / 总结
The research aims to enhance the effectiveness of vision language models (VLMs) as controllers for complex reasoning tasks involving external tools. The method involves creating a vision-centric agent tuning framework that automatically generates multimodal trajectories and preference pairs, and trains a VLM controller. Key findings show that the developed MATRIX Agent outperforms both open- and closed-source VLMs across three benchmarks, proving scalable and effective multimodal tool use.
研究旨在提高视觉语言模型(VLMs)在涉及工具的复杂推理任务中的有效性。该研究引入了一个以视觉为中心的框架MATRIX,该框架自动生成多模态轨迹和偏好对来训练VLM控制器。该框架构建了包含28.5K多模态任务的大规模数据集M-TRACE,并对VLM控制器MATRIX Agent进行了微调。结果显示,MATRIX在三个基准测试中均优于开源和闭源的VLMs,展示了多模态工具使用的可扩展性和有效性。
SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models
Authors: Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang
First: 2025-10-09T17:50:54+00:00 · Latest: 2025-10-09T17:50:54+00:00
Comments: Project Page: https://zju-real.github.io/SpatialLadder/ Code:
https://github.com/ZJU-REAL/SpatialLadder
Abstract
Spatial reasoning remains a fundamental challenge for Vision-Language Models
(VLMs), with current approaches struggling to achieve robust performance
despite recent advances. We identify that this limitation stems from a critical
gap: existing methods attempt to learn spatial reasoning directly without
establishing the hierarchical foundations of perception and understanding. To
address this challenge, we present a comprehensive methodology for building
spatial intelligence progressively. We introduce SpatialLadder-26k, a
multimodal dataset containing 26,610 samples spanning object localization,
single image, multi-view, and video spatial reasoning tasks, constructed
through a standardized pipeline that ensures systematic coverage across
modalities. Building on this dataset, we design a three-stage progressive
training framework that (1) establishes spatial perception through object
localization, (2) develops spatial understanding through multi-dimensional
spatial tasks, and (3) strengthens complex reasoning via reinforcement learning
with verifiable rewards. This approach yields SpatialLadder, a 3B-parameter
model that achieves state-of-the-art performance on spatial reasoning
benchmarks, with 23.4% average improvement over the base model, surpassing
GPT-4o by 20.8% and Gemini-2.0-Flash by 10.1%. Notably, SpatialLadder maintains
strong generalization with 7.2% improvement on out-of-domain benchmarks,
demonstrating that progressive training from perception to reasoning is
essential for robust spatial intelligence.
中文标题/摘要
标题:SpatialLadder:视觉语言模型中空间推理的渐进训练方法
空间推理仍然是视觉语言模型(VLMs)的基本挑战,尽管最近取得了进展,但当前方法在实现稳健性能方面仍存在困难。我们发现这一限制源于一个关键缺口:现有方法试图直接学习空间推理,而没有建立感知和理解的层次基础。为了解决这一挑战,我们提出了一种全面的方法来逐步构建空间智能。我们引入了包含26,610个样本的SpatialLadder-26k多模态数据集,这些样本覆盖了对象定位、单图像、多视图和视频空间推理任务,通过标准化流程确保了跨模态的系统覆盖。基于此数据集,我们设计了一个三阶段的渐进训练框架:(1)通过对象定位建立空间感知,(2)通过多维度空间任务发展空间理解,(3)通过强化学习和可验证奖励强化复杂推理。这种方法产生了SpatialLadder,一个3亿参数的模型,在空间推理基准测试中达到了最先进的性能,平均改进了23.4%,分别超过了GPT-4o的20.8%和Gemini-2.0-Flash的10.1%。值得注意的是,SpatialLadder在域外基准测试中保持了较强的泛化能力,改进了7.2%,表明从感知到推理的渐进训练对于构建稳健的空间智能至关重要。
Summary / 总结
The paper addresses the challenge of spatial reasoning in Vision-Language Models (VLMs) by introducing SpatialLadder, a progressive training framework. It uses a large multimodal dataset, SpatialLadder-26k, to train VLMs in three stages: object localization, multi-dimensional spatial tasks, and complex reasoning with reinforcement learning. The resulting model, SpatialLadder, shows significant improvements in spatial reasoning benchmarks, with 23.4% average improvement over the base model and strong generalization capabilities.
论文通过引入SpatialLadder,提出了一种渐进式训练框架来解决视觉-语言模型(VLMs)中的空间推理问题。该框架利用SpatialLadder-26k这一大规模多模态数据集,在三个阶段进行训练:物体定位、多维度空间任务和通过强化学习进行复杂推理。最终模型SpatialLadder在空间推理基准测试中表现出显著提升,平均改进幅度为23.4%,并且具有良好的泛化能力。
To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models
Authors: Jiayun Luo, Wan-Cyuan Fan, Lyuyang Wang, Xiangteng He, Tanzila Rahman, Purang Abolmaesumi, Leonid Sigal
First: 2025-10-09T17:44:42+00:00 · Latest: 2025-10-09T17:44:42+00:00
Comments: Preprint. Project page: https://davidhalladay.github.io/diysink_demo
Abstract
Large Vision Language Models (LVLMs) have recently emerged as powerful
architectures capable of understanding and reasoning over both visual and
textual information. These models typically rely on two key components: a
Vision Transformer (ViT) and a Large Language Model (LLM). ViT encodes visual
content into a sequence of image tokens and serves as the perceptual front-end
-- the eyes of the model. In contrast, the LLM interprets these tokens to
perform high-level reasoning, generates responses, and functions as the
cognitive core -- the brain of the model. However, it remains unclear which
visual tokens contribute most significantly to understanding and reasoning, and
how effectively these signals are propagated from ViT to the LLM. While most
existing works have focused on identifying attention sinks, low-semantic tokens
receiving disproportionately high attention, within the LLM, we shift the focus
to the vision encoder by identifying a class of high-norm visual tokens from
ViT, referred to as ViT attention sinks -- a problem that has been rarely
studied but is indeed very important for LVLMs. Our findings show that these
ViT sinks encapsulate high-level semantic concepts from images, allowing the
LLM to perform more effective understanding and reasoning. Despite their
importance, these sink tokens are often overlooked in existing LVLM
architectures. To explore their contribution, we present both qualitative and
quantitative analyses of the information embedded in these sink tokens. We also
propose both training-free and training-based approaches to better leverage how
this information is interpreted by the LLM, and to what extent. By explicitly
utilizing these tokens, we demonstrate substantial improvements across a range
of LVLMs and visual reasoning tasks, highlighting the untapped potential of ViT
attention sinks in enhancing visual reasoning.
中文标题/摘要
标题:沉还是不沉:大型视觉语言模型中的视觉信息路径
大型视觉语言模型(LVLMs)最近已成为能够理解和推理视觉和文本信息的强大架构。这些模型通常依赖于两个关键组件:视觉变换器(ViT)和大型语言模型(LLM)。ViT 将视觉内容编码为图像标记序列,并作为感知前端——模型的“眼睛”。相比之下,LLM 解释这些标记以进行高层次推理、生成响应,并作为认知核心——模型的“大脑”。然而,尚不清楚哪些视觉标记对理解和推理贡献最大,以及这些信号如何有效地从 ViT 传播到 LLM。虽然大多数现有工作都集中在识别 LLM 中的注意力“陷阱”(低语义标记,接受不相称的高关注),但在 LLM 中,我们将重点转向视觉编码器,通过从 ViT 中识别一类高范数视觉标记,称为 ViT 注意“陷阱”——这个问题虽然很少被研究,但对 LVLMs 来说确实非常重要。我们的研究发现,这些 ViT 陷阱包含了图像中的高层次语义概念,使 LLM 能够更有效地理解和推理。尽管这些陷阱标记在现有 LVLM 架构中经常被忽视,为了探索它们的贡献,我们对这些陷阱标记中嵌入的信息进行了定性和定量分析。我们还提出了无需训练和基于训练的方法,以更好地利用 LLM 对这些信息的解释及其程度。通过明确利用这些标记,我们展示了在一系列 LVLM 和视觉推理任务中取得了显著改进,突显了 ViT 注意“陷阱”在增强视觉推理方面的未开发潜力。
Summary / 总结
This study investigates the role of visual tokens in large vision-language models (LVLMs) by identifying a class of high-norm visual tokens from the Vision Transformer (ViT), referred to as ViT attention sinks. The research shows that these tokens encapsulate high-level semantic concepts, enabling more effective reasoning by the language model. The study provides both qualitative and quantitative analyses of these sink tokens and proposes methods to better leverage their information, leading to improvements in various LVLMs and visual reasoning tasks.
研究通过识别来自视觉变换器(ViT)的一类高范数视觉令牌——ViT 注意力陷阱,探讨了视觉令牌在大型视觉语言模型(LVLM)中的作用。研究发现,这些令牌包含了高级语义概念,使语言模型能够更有效地进行推理。研究提供了这些陷阱令牌的定性和定量分析,并提出了更好地利用这些信息的方法,从而在各种LVLM和视觉推理任务中取得了显著改进。
MoA-VR: A Mixture-of-Agents System Towards All-in-One Video Restoration
Authors: Lu Liu, Chunlei Cai, Shaocheng Shen, Jianfeng Liang, Weimin Ouyang, Tianxiao Ye, Jian Mao, Huiyu Duan, Jiangchao Yao, Xiaoyun Zhang, Qiang Hu, Guangtao Zhai
First: 2025-10-09T17:42:51+00:00 · Latest: 2025-10-09T17:42:51+00:00
Abstract
Real-world videos often suffer from complex degradations, such as noise,
compression artifacts, and low-light distortions, due to diverse acquisition
and transmission conditions. Existing restoration methods typically require
professional manual selection of specialized models or rely on monolithic
architectures that fail to generalize across varying degradations. Inspired by
expert experience, we propose MoA-VR, the first
\underline{M}ixture-\underline{o}f-\underline{A}gents \underline{V}ideo
\underline{R}estoration system that mimics the reasoning and processing
procedures of human professionals through three coordinated agents: Degradation
Identification, Routing and Restoration, and Restoration Quality Assessment.
Specifically, we construct a large-scale and high-resolution video degradation
recognition benchmark and build a vision-language model (VLM) driven
degradation identifier. We further introduce a self-adaptive router powered by
large language models (LLMs), which autonomously learns effective restoration
strategies by observing tool usage patterns. To assess intermediate and final
processed video quality, we construct the \underline{Res}tored
\underline{V}ideo \underline{Q}uality (Res-VQ) dataset and design a dedicated
VLM-based video quality assessment (VQA) model tailored for restoration tasks.
Extensive experiments demonstrate that MoA-VR effectively handles diverse and
compound degradations, consistently outperforming existing baselines in terms
of both objective metrics and perceptual quality. These results highlight the
potential of integrating multimodal intelligence and modular reasoning in
general-purpose video restoration systems.
中文标题/摘要
标题:MoA-VR:一种面向全方位视频修复的混合代理系统
现实世界的视频往往由于多样化的采集和传输条件而遭受复杂的退化,如噪声、压缩伪影和低光照失真。现有的修复方法通常需要专业的手动选择专门的模型,或者依赖于无法在不同退化类型之间泛化的单一架构。受专家经验的启发,我们提出了MoA-VR,这是一种首先通过三个协调的代理:退化识别、路由和修复、以及修复质量评估,来模仿人类专业人士的推理和处理过程的视频修复系统。具体来说,我们构建了一个大规模和高分辨率的视频退化识别基准,并构建了一个由视觉语言模型(VLM)驱动的退化识别器。我们进一步引入了一个由大型语言模型(LLMs)驱动的自适应路由器,该路由器通过观察工具使用模式自主学习有效的修复策略。为了评估中间和最终处理视频的质量,我们构建了Res-VQ数据集,并设计了一个专门针对修复任务的VLM为基础的视频质量评估(VQA)模型。广泛的实验表明,MoA-VR能够有效处理各种复杂的退化,其在客观指标和感知质量方面均优于现有基线。这些结果突显了在通用视频修复系统中集成多模态智能和模块化推理的潜力。
Summary / 总结
MoA-VR is a novel video restoration system that addresses complex degradations in real-world videos by using a mixture-of-agents approach. It consists of three agents: Degradation Identification, Routing and Restoration, and Restoration Quality Assessment. MoA-VR constructs a large-scale video degradation benchmark and uses a vision-language model for degradation identification, a self-adaptive router for restoration strategy learning, and a VQA model for quality assessment. Experiments show that MoA-VR outperforms existing methods in handling diverse degradations and improving both objective and perceptual quality.
MoA-VR 是一种新型视频修复系统,通过三个协调工作的代理:退化识别、路由和修复、以及修复质量评估,来应对真实世界视频中的复杂退化。该系统利用大规模视频退化识别基准和视觉语言模型进行退化识别,由大型语言模型驱动的自适应路由器,以及专为修复任务设计的视觉语言模型视频质量评估模型。大量实验表明,MoA-VR 在处理多样和复合退化方面优于现有方法,提高了客观指标和感知质量。
The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping
Authors: Onur Keleş, Aslı Özyürek, Gerardo Ortega, Kadir Gökgö, Esam Ghaleb
First: 2025-10-09T17:21:59+00:00 · Latest: 2025-10-09T17:21:59+00:00
Abstract
Iconicity, the resemblance between linguistic form and meaning, is pervasive
in signed languages, offering a natural testbed for visual grounding. For
vision-language models (VLMs), the challenge is to recover such essential
mappings from dynamic human motion rather than static context. We introduce the
\textit{Visual Iconicity Challenge}, a novel video-based benchmark that adapts
psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological
sign-form prediction (e.g., handshape, location), (ii) transparency (inferring
meaning from visual form), and (iii) graded iconicity ratings. We assess $13$
state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the
Netherlands and compare them to human baselines. On \textit{phonological form
prediction}, VLMs recover some handshape and location detail but remain below
human performance; on \textit{transparency}, they are far from human baselines;
and only top models correlate moderately with human \textit{iconicity ratings}.
Interestingly, \textit{models with stronger phonological form prediction
correlate better with human iconicity judgment}, indicating shared sensitivity
to visually grounded structure. Our findings validate these diagnostic tasks
and motivate human-centric signals and embodied learning methods for modelling
iconicity and improving visual grounding in multimodal models.
中文标题/摘要
标题:视觉图示性挑战:评估视觉-语言模型在手语形式-意义映射上的表现
图示性,即语言形式与意义之间的相似性,在手语中普遍存在,为视觉定位提供了自然的测试平台。对于视觉-语言模型(VLMs),挑战在于从动态的人体动作中恢复这些基本的映射,而不是从静态的上下文中。我们引入了“视觉图示性挑战”,这是一个新颖的基于视频的基准测试,将心理语言学指标适应于评估VLMs在三项任务上的表现:(i)音位手语形式预测(如手势、位置),(ii)透明度(从视觉形式推断意义),(iii)图示性等级评分。我们评估了13个最先进的VLMs在零样本和少量样本设置下对荷兰手语的表现,并将其与人类基线进行比较。在音位形式预测方面,VLMs恢复了一些手势和位置细节,但低于人类表现;在透明度方面,它们远低于人类基线;只有顶级模型与人类的图示性评分有中等程度的相关性。有趣的是,具有更强音位形式预测能力的模型与人类图示性判断的相关性更好,表明它们对视觉定位结构具有共同的敏感性。我们的研究结果验证了这些诊断任务的有效性,并促进了以人类为中心的信号和具身学习方法来建模图示性并提高多模态模型的视觉定位能力。
Summary / 总结
The Visual Iconicity Challenge evaluates vision-language models on sign language form-meaning mapping by adapting psycholinguistic measures into three tasks: phonological sign-form prediction, transparency, and graded iconicity ratings. The study assesses 13 state-of-the-art VLMs on Sign Language of the Netherlands and finds that while models recover some handshape and location details in phonological form prediction, they perform poorly on transparency and only moderately correlate with human iconicity ratings. Models with better phonological form prediction correlate better with human iconicity judgments, suggesting shared sensitivity to visually grounded structure.
研究针对视觉语言模型(VLMs)在手语形式-意义映射任务中的挑战,特别是图标性。研究引入了视觉图标性挑战,该基准评估VLMs在三个任务上的表现:语音学的手语形式预测、透明度和分级图标性评分。研究发现,虽然VLMs可以在手型和位置细节上取得一些进展,但在透明度任务上表现不佳,仅在人类图标性评分上有中等程度的相关性,表明需要能够更好地捕捉视觉基础结构的模型。这项工作验证了这些诊断任务的有效性,并强调了人类为中心的信号和具身学习方法对于改善VLMs在多模态环境中的图标性和视觉定位的重要性。
Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools
Authors: Zhenlong Yuan, Xiangyan Qu, Chengxuan Qian, Rui Chen, Jing Tang, Lei Sun, Xiangxiang Chu, Dapeng Zhang, Yiwei Wang, Yujun Cai, Shuo Li
First: 2025-10-09T17:20:44+00:00 · Latest: 2025-10-09T17:20:44+00:00
Abstract
Multimodal large language models (MLLMs) have demonstrated remarkable
potential in bridging visual and textual reasoning, yet their reliance on
text-centric priors often limits their ability to disentangle semantically
similar actions in open-vocabulary scenarios. To address this, we propose
Video-STAR, a framework that harmonizes contextual sub-motion decomposition
with tool-augmented reinforcement learning for open-vocabulary action
recognition (OVAR). Unlike prior methods that treat actions as monolithic
entities, our approach innovatively decomposes actions into discriminative
sub-motions for fine-grained matching while dynamically invoking
domain-specific tools for cross-modal interleaving, thereby enabling
category-specific reasoning capacity and reducing cross-modal hallucination.
Moreover, by designing a hierarchical reward that balances tool-usage
efficiency, sub-motion relevance, and structural coherence in reasoning, our
method autonomously leverages external tools to prioritize sub-motion patterns
without explicit supervision, transmitting from text-centric reasoning to
visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2,
Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art
performance, outperforming existing methods in distinguishing fine-grained
actions and handling cross-modal hallucination, validating our excellent
robustness and generalization.
中文标题/摘要
标题:Video-STAR:通过工具强化开放词汇动作识别
多模态大型语言模型(MLLMs)在视觉和文本推理方面展现了显著的潜力,但它们对文本中心先验的依赖往往限制了其在开放词汇场景中区分语义相似动作的能力。为了解决这一问题,我们提出了Video-STAR框架,该框架将上下文子动作分解与工具增强的强化学习相结合,用于开放词汇动作识别(OVAR)。与以往方法将动作视为单一实体不同,我们的方法创新地将动作分解为具有区分性的子动作进行精细匹配,同时动态调用领域特定工具进行跨模态交织,从而实现类别特定的推理能力和减少跨模态幻觉。此外,通过设计一个分层奖励,平衡工具使用效率、子动作相关性和推理结构连贯性,我们的方法自主利用外部工具优先考虑子动作模式,从文本中心推理过渡到视觉接地推理。在HMDB-51、UCF-101、SSv2、Kinetics-400和Kinetics-600数据集上的广泛评估表明,我们的方法在区分精细动作和处理跨模态幻觉方面表现出最先进的性能,验证了我们卓越的鲁棒性和泛化能力。
Summary / 总结
Video-STAR is a framework that decomposes actions into discriminative sub-motions and uses tool-augmented reinforcement learning for open-vocabulary action recognition. It outperforms existing methods by reducing cross-modal hallucination and enhancing category-specific reasoning. Extensive evaluations on multiple datasets show superior performance in distinguishing fine-grained actions.
Video-STAR 是一种结合了上下文子运动分解和工具增强强化学习的框架,以提升开放词汇动作识别。不同于以往方法,它将动作分解为可区分的子运动进行精细匹配,并使用领域特定工具进行跨模态交织。该方法自主利用外部工具优先处理子运动模式,减少跨模态幻觉。在 HMDB-51、UCF-101、SSv2、Kinetics-400 和 Kinetics-600 数据集上的实验表明,Video-STAR 在区分精细动作和处理跨模态幻觉方面优于现有方法,展示了其鲁棒性和泛化能力。
Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling
Authors: Bianca-Mihaela Ganescu, Suchir Salhan, Andrew Caines, Paula Buttery
Venue: EMNLP 2025
First: 2025-10-09T17:10:36+00:00 · Latest: 2025-10-09T17:10:36+00:00
Comments: Accepted to the EMNLP 2025 BabyLM Workshop
Abstract
Training vision-language models on cognitively-plausible amounts of data
requires rethinking how models integrate multimodal information. Within the
constraints of the Vision track for the BabyLM Challenge 2025, we propose a
lightweight decoder-based architecture with (1) token-wise dynamic gating for
adaptive fusion of linguistic and visual cues, (2) feature modulation and
channel attention to maximise the utility of limited visual information and (3)
auxiliary contrastive objectives for visual grounding. Evaluation on five
benchmarks (BLiMP, BLiMP Supplement, EWoK, Winoground and VQA) shows
competitive or superior performance to multimodal baselines. More notably, our
dynamic gate discovers interpretable patterns without explicit supervision,
favouring visual cues for content words and linguistic cues for function words.
While we identify limitations in the Challenge constraints, such as the
information bottleneck created by global image embeddings and training
instability from the dataset split, our findings establish dynamic gating as a
powerful tool for efficient multimodal learning, offering both interpretability
and performance even under severe constraints.
中文标题/摘要
标题:欲学习:低资源视觉-语言建模的令牌级动态门控
在认知上合理的数据量上训练视觉-语言模型需要重新思考模型如何整合多模态信息。在BabyLM挑战赛2025视觉赛道的约束下,我们提出了一种轻量级解码器为基础的架构,包括(1)令牌级动态门控以适应性融合语言和视觉线索,(2)特征调制和通道注意力以最大化有限视觉信息的效用,以及(3)辅助对比目标以实现视觉定位。在五个基准(BLiMP、BLiMP补充、EWoK、Winoground和VQA)上的评估显示,我们的模型在多模态基线模型上具有竞争力或更优性能。更值得注意的是,我们的动态门控在没有显式监督的情况下发现了可解释的模式,倾向于使用视觉线索来处理内容词,使用语言线索来处理功能词。尽管我们在挑战约束中发现了局限性,如由全局图像嵌入创建的信息瓶颈以及从数据集划分引起的训练不稳定,但我们的发现确立了动态门控作为高效多模态学习的强大工具,即使在严重约束下也能提供可解释性和性能。
Summary / 总结
This paper aims to improve low-resource vision-language models by proposing a lightweight decoder-based architecture with token-wise dynamic gating for adaptive fusion of linguistic and visual cues, feature modulation, and channel attention to maximize the utility of limited visual information, and auxiliary contrastive objectives for visual grounding. The model shows competitive or superior performance on five benchmarks compared to multimodal baselines. Notably, the dynamic gate discovers interpretable patterns without explicit supervision, favoring visual cues for content words and linguistic cues for function words.
本文针对有限数据训练视觉语言模型的挑战,提出了一种轻量级解码器架构,包含基于token的动态门控以适应性融合语言和视觉线索、特征调制和通道注意力以增强有限视觉信息的利用,以及辅助对比目标以实现视觉定位。该模型在五个基准测试(BLiMP、BLiMP补充、EWoK、Winoground和VQA)上表现出与多模态基线相当或更优的性能,动态门控在无显式监督的情况下发现了可解释的模式,倾向于为内容词使用视觉线索,为功能词使用语言线索。
VideoVerse: How Far is Your T2V Generator from a World Model?
Authors: Zeqing Wang, Xinyu Wei, Bairui Li, Zhen Guo, Jinrui Zhang, Hongyang Wei, Keze Wang, Lei Zhang
First: 2025-10-09T16:18:20+00:00 · Latest: 2025-10-09T16:18:20+00:00
Comments: 24 Pages, 8 Figures, 11 Tables
Abstract
The recent rapid advancement of Text-to-Video (T2V) generation technologies,
which are critical to build ``world models'', makes the existing benchmarks
increasingly insufficient to evaluate state-of-the-art T2V models. First,
current evaluation dimensions, such as per-frame aesthetic quality and temporal
consistency, are no longer able to differentiate state-of-the-art T2V models.
Second, event-level temporal causality, which not only distinguishes video from
other modalities but also constitutes a crucial component of world models, is
severely underexplored in existing benchmarks. Third, existing benchmarks lack
a systematic assessment of world knowledge, which are essential capabilities
for building world models. To address these issues, we introduce VideoVerse, a
comprehensive benchmark that focuses on evaluating whether a T2V model could
understand complex temporal causality and world knowledge in the real world. We
collect representative videos across diverse domains (e.g., natural landscapes,
sports, indoor scenes, science fiction, chemical and physical experiments) and
extract their event-level descriptions with inherent temporal causality, which
are then rewritten into text-to-video prompts by independent annotators. For
each prompt, we design a suite of binary evaluation questions from the
perspective of dynamic and static properties, with a total of ten carefully
defined evaluation dimensions. In total, our VideoVerse comprises 300 carefully
curated prompts, involving 815 events and 793 binary evaluation questions.
Consequently, a human preference aligned QA-based evaluation pipeline is
developed by using modern vision-language models. Finally, we perform a
systematic evaluation of state-of-the-art open-source and closed-source T2V
models on VideoVerse, providing in-depth analysis on how far the current T2V
generators are from world models.
中文标题/摘要
标题:VideoVerse: 你的T2V生成器距离世界模型还有多远?
近期文本到视频(T2V)生成技术的迅速发展,这些技术对于构建“世界模型”至关重要,使得现有的基准越来越不足以评估最先进的T2V模型。首先,当前的评估维度,如每帧的美学质量和时间一致性,已不再能够区分最先进的T2V模型。其次,事件级的时间因果关系,不仅能够区分视频与其他模态,也是世界模型的关键组成部分,但在现有基准中严重缺乏探索。第三,现有的基准缺乏对世界知识的系统评估,这是构建世界模型所需的重要能力。为了解决这些问题,我们引入了VideoVerse,这是一个全面的基准,旨在评估T2V模型是否能够理解现实世界中的复杂时间因果关系和世界知识。我们收集了跨多个领域(如自然景观、体育、室内场景、科幻、化学和物理实验)的代表性视频,并提取了具有内在时间因果关系的事件级描述,这些描述随后由独立的注释者重写为文本到视频提示。对于每个提示,我们从动态和静态属性的角度设计了一系列二元评估问题,总共定义了十个精心设计的评估维度。总共,我们的VideoVerse包含300个精心策划的提示,涉及815个事件和793个二元评估问题。因此,我们通过使用现代视觉语言模型开发了一个与人类偏好对齐的问答式评估流水线。最后,我们在VideoVerse上系统地评估了最先进的开源和闭源T2V模型,深入分析了当前T2V生成器与世界模型之间的差距。
Evaluating Small Vision-Language Models on Distance-Dependent Traffic Perception
Authors: Nikos Theodoridis, Tim Brophy, Reenu Mohandas, Ganesh Sistu, Fiachra Collins, Anthony Scanlan, Ciaran Eising
First: 2025-10-09T15:38:41+00:00 · Latest: 2025-10-09T15:38:41+00:00
Abstract
Vision-Language Models (VLMs) are becoming increasingly powerful,
demonstrating strong performance on a variety of tasks that require both visual
and textual understanding. Their strong generalisation abilities make them a
promising component for automated driving systems, which must handle unexpected
corner cases. However, to be trusted in such safety-critical applications, a
model must first possess a reliable perception system. Moreover, since critical
objects and agents in traffic scenes are often at a distance, we require
systems that are not "shortsighted", i.e., systems with strong perception
capabilities at both close (up to 20 meters) and long (30+ meters) range. With
this in mind, we introduce Distance-Annotated Traffic Perception Question
Answering (DTPQA), the first Visual Question Answering (VQA) benchmark focused
solely on perception-based questions in traffic scenes, enriched with distance
annotations. By excluding questions that require reasoning, we ensure that
model performance reflects perception capabilities alone. Since automated
driving hardware has limited processing power and cannot support large VLMs,
our study centers on smaller VLMs. More specifically, we evaluate several
state-of-the-art (SOTA) small VLMs on DTPQA and show that, despite the
simplicity of the questions, these models significantly underperform compared
to humans (~60% average accuracy for the best-performing small VLM versus ~85%
human performance). However, it is important to note that the human sample size
was relatively small, which imposes statistical limitations. We also identify
specific perception tasks, such as distinguishing left from right, that remain
particularly challenging for these models.
中文标题/摘要
标题:评估小型视觉-语言模型在距离依赖交通感知上的表现
视觉-语言模型(VLMs)变得越来越强大,展示了在需要视觉和文本理解的各种任务中表现出色的能力。它们强大的泛化能力使它们成为自动驾驶系统的一个有前途的组成部分,这些系统必须处理意外的边缘情况。然而,要在这种安全关键的应用中获得信任,一个模型首先必须具备可靠的感知系统。此外,由于交通场景中的关键物体和代理通常处于远处,我们要求系统不是“短视”的,即在近距离(20米以内)和远距离(30米以上)范围内都具有强大的感知能力。基于此,我们引入了距离标注交通感知问答(DTPQA),这是第一个专注于交通场景中基于感知的问题的视觉问答(VQA)基准,其中包含距离标注。通过排除需要推理的问题,我们确保模型性能反映的是感知能力。由于自动驾驶硬件的处理能力有限,无法支持大型VLMs,我们的研究集中在较小的VLMs上。具体来说,我们在DTPQA上评估了几种最先进的(SOTA)小型VLMs,结果显示,尽管问题很简单,但这些模型的表现显著低于人类(最佳小型VLM的平均准确率为约60%,而人类的准确率为约85%)。然而,需要注意的是,人类样本量相对较小,这带来了统计上的限制。我们还确定了一些特定的感知任务,例如区分左和右,这些任务对这些模型来说仍然特别具有挑战性。
Summary / 总结
This study evaluates small Vision-Language Models (VLMs) on their ability to perceive traffic scenes at various distances, introducing DTPQA, a new benchmark for this purpose. The research finds that small VLMs significantly underperform compared to humans in tasks requiring perception, achieving only about 60% accuracy on average, while humans score around 85%. The study highlights the need for models to improve their perception capabilities, especially for distant objects.
该研究评估了小型视觉-语言模型在不同距离感知交通场景的能力,引入了DTPQA作为新的基准。研究发现,小型VLM在需要感知的任务上显著低于人类的表现,平均准确率仅为约60%,而人类的得分为约85%。研究强调了模型需要提高其感知能力,特别是对远处物体的感知。
A Multimodal Depth-Aware Method For Embodied Reference Understanding
Authors: Fevziye Irem Eyiokur, Dogucan Yaman, Hazım Kemal Ekenel, Alexander Waibel
First: 2025-10-09T14:32:21+00:00 · Latest: 2025-10-09T14:32:21+00:00
Abstract
Embodied Reference Understanding requires identifying a target object in a
visual scene based on both language instructions and pointing cues. While prior
works have shown progress in open-vocabulary object detection, they often fail
in ambiguous scenarios where multiple candidate objects exist in the scene. To
address these challenges, we propose a novel ERU framework that jointly
leverages LLM-based data augmentation, depth-map modality, and a depth-aware
decision module. This design enables robust integration of linguistic and
embodied cues, improving disambiguation in complex or cluttered environments.
Experimental results on two datasets demonstrate that our approach
significantly outperforms existing baselines, achieving more accurate and
reliable referent detection.
中文标题/摘要
标题:一种多模态深度感知方法用于体态参考理解
体态参考理解需要根据语言指令和指示手势在视觉场景中识别目标物体。尽管先前的工作在开放词汇对象检测方面取得了进展,但在存在多个候选物体的模糊场景中往往失败。为了解决这些挑战,我们提出了一种新颖的ERU框架,该框架联合利用基于LLM的数据增强、深度图模态和深度感知决策模块。这种设计能够稳健地整合语言和体态线索,提高在复杂或杂乱环境中消歧的效果。在两个数据集上的实验结果表明,我们的方法显著优于现有基线,实现了更准确和可靠的指代检测。
Summary / 总结
The research aims to improve embodied reference understanding by addressing ambiguities in visual scenes. The proposed method combines language instructions with depth-map information and a depth-aware decision module, enhancing the disambiguation of target objects. Experiments on two datasets show that this approach outperforms existing methods, leading to more accurate and reliable object detection in complex environments.
研究旨在通过解决视觉场景中的歧义问题来提升体态参考理解。提出的方法结合语言指令和指向线索,采用一种新的ERU框架,包括基于LLM的数据增强、深度图模态和深度感知决策模块。实验结果表明,该方法在两个数据集上优于现有方法,能够更准确和可靠地检测目标对象。
Chain-of-Trigger: An Agentic Backdoor that Paradoxically Enhances Agentic Robustness
Authors: Jiyang Qiu, Xinbei Ma, Yunqing Xu, Zhuosheng Zhang, Hai Zhao
First: 2025-10-09T14:01:43+00:00 · Latest: 2025-10-09T14:01:43+00:00
Abstract
The rapid deployment of large language model (LLM)-based agents in real-world
applications has raised serious concerns about their trustworthiness. In this
work, we reveal the security and robustness vulnerabilities of these agents
through backdoor attacks. Distinct from traditional backdoors limited to
single-step control, we propose the Chain-of-Trigger Backdoor (CoTri), a
multi-step backdoor attack designed for long-horizon agentic control. CoTri
relies on an ordered sequence. It starts with an initial trigger, and
subsequent ones are drawn from the environment, allowing multi-step
manipulation that diverts the agent from its intended task. Experimental
results show that CoTri achieves a near-perfect attack success rate (ASR) while
maintaining a near-zero false trigger rate (FTR). Due to training data modeling
the stochastic nature of the environment, the implantation of CoTri
paradoxically enhances the agent's performance on benign tasks and even
improves its robustness against environmental distractions. We further validate
CoTri on vision-language models (VLMs), confirming its scalability to
multimodal agents. Our work highlights that CoTri achieves stable, multi-step
control within agents, improving their inherent robustness and task
capabilities, which ultimately makes the attack more stealthy and raises
potential safty risks.
中文标题/摘要
标题:触发链:一种悖论性地增强自主鲁棒性的自主后门
基于大型语言模型(LLM)的代理在实际应用中的快速部署引发了对其可信度的重大担忧。在本工作中,我们通过后门攻击揭示了这些代理的安全性和鲁棒性漏洞。不同于传统的仅限单步控制的后门,我们提出了触发链后门(CoTri),这是一种为长期自主控制设计的多步后门攻击。CoTri 依赖于有序序列。它始于初始触发器,随后的触发器来自环境,允许多步操纵,使代理偏离其预定任务。实验结果表明,CoTri 实现了近乎完美的攻击成功率(ASR)同时保持了近乎零的误触发率(FTR)。由于训练数据模拟了环境的随机性,CoTri 的植入反而增强了代理在良性任务上的性能,甚至提高了其对环境干扰的鲁棒性。我们进一步在视觉语言模型(VLMs)上验证了 CoTri,证实了其对多模态代理的可扩展性。我们的工作突显了 CoTri 在代理中实现稳定多步控制,提高其固有鲁棒性和任务能力,最终使攻击更加隐蔽并引发了潜在的安全风险。
Summary / 总结
This paper addresses the security vulnerabilities of large language model-based agents through a novel multi-step backdoor attack called Chain-of-Trigger (CoTri). Unlike traditional single-step backdoors, CoTri enables long-term manipulation by using an ordered sequence of triggers, allowing the agent to deviate from its intended task. The experimental results demonstrate that CoTri achieves high attack success rates without false triggers, and interestingly, its implantation paradoxically enhances the agent's performance on benign tasks and improves its robustness against environmental distractions. This work highlights the potential for CoTri to make attacks more stealthy and raises safety concerns.
本文通过一种名为Chain-of-Trigger (CoTri) 的新型多步后门攻击,揭示了基于大型语言模型的代理的安全和鲁棒性漏洞。与传统的单步后门不同,CoTri 依赖于一个有序的触发序列,实现长期操控。实验结果表明,CoTri 能够实现高攻击成功率且无误触发,有趣的是,其植入反而提升了代理在良性任务上的性能,并增强了其对环境干扰的鲁棒性。这项工作突显了CoTri 可以通过提高代理的内在鲁棒性和任务能力,使其攻击更加隐蔽,从而引发潜在的安全风险。
TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics
Authors: Yi Han, Cheng Chi, Enshen Zhou, Shanyu Rong, Jingkun An, Pengwei Wang, Zhongyuan Wang, Lu Sheng, Shanghang Zhang
First: 2025-10-08T16:20:23+00:00 · Latest: 2025-10-09T13:56:25+00:00
Comments: 9 pages, 6 figures
Abstract
Vision-Language Models (VLMs) have shown remarkable capabilities in spatial
reasoning, yet they remain fundamentally limited to qualitative precision and
lack the computational precision required for real-world robotics. Current
approaches fail to leverage metric cues from depth sensors and camera
calibration, instead reducing geometric problems to pattern recognition tasks
that cannot deliver the centimeter-level accuracy essential for robotic
manipulation. We present TIGeR (Tool-Integrated Geometric Reasoning), a novel
framework that transforms VLMs from perceptual estimators to geometric
computers by enabling them to generate and execute precise geometric
computations through external tools. Rather than attempting to internalize
complex geometric operations within neural networks, TIGeR empowers models to
recognize geometric reasoning requirements, synthesize appropriate
computational code, and invoke specialized libraries for exact calculations. To
support this paradigm, we introduce TIGeR-300K, a comprehensive
tool-invocation-oriented dataset covering point transformations, pose
estimation, and spatial compatibility verification, complete with tool
invocation sequences and intermediate computations. Through a two-stage
training pipeline combining supervised fine-tuning (SFT) and reinforcement
fine-tuning (RFT) with our proposed hierarchical reward design, TIGeR achieves
SOTA performance on geometric reasoning benchmarks while demonstrating
centimeter-level precision in real-world robotic manipulation tasks.
中文标题/摘要
标题:TIGeR: 工具集成几何推理在视觉-语言模型中的应用以实现机器人技术
视觉-语言模型(VLMs)在空间推理方面表现出色,但它们本质上仍局限于定性的精确度,并缺乏实现现实世界机器人技术所需的计算精确度。当前的方法未能利用深度传感器和相机校准的度量线索,而是将几何问题简化为模式识别任务,这些任务无法提供机器人操作所需的厘米级精度。我们提出了TIGeR(Tool-Integrated Geometric Reasoning),这是一种新颖的框架,通过使VLMs能够生成和执行精确的几何计算,从而将它们从感知估计器转变为几何计算机。TIGeR 不试图在神经网络中内化复杂的几何操作,而是赋予模型识别几何推理需求、合成适当的计算代码并调用专门的库进行精确计算的能力。为了支持这一范式,我们引入了TIGeR-300K,这是一个全面的工具调用导向数据集,涵盖了点变换、姿态估计和空间兼容性验证,包括工具调用序列和中间计算。通过结合监督微调(SFT)和强化微调(RFT)以及我们提出的分层奖励设计的两阶段训练管道,TIGeR 在几何推理基准测试中达到了最佳性能,同时在实际机器人操作任务中展示了厘米级的精度。
Summary / 总结
TIGeR is a novel framework that enhances Vision-Language Models (VLMs) for geometric reasoning in robotics by integrating external tools. This approach enables VLMs to generate and execute precise geometric computations, moving beyond qualitative reasoning to achieve centimeter-level accuracy. TIGeR uses a two-stage training pipeline combining supervised and reinforcement fine-tuning to achieve state-of-the-art performance on geometric reasoning benchmarks and demonstrates practical precision in real-world robotic tasks.
TIGeR 是一种新型框架,通过集成外部工具来增强视觉-语言模型(VLMs)在机器人中的几何推理能力。这种方法使 VLMs 能够生成并执行精确的几何计算,超越了定性推理,实现了厘米级的精度。TIGeR 使用结合监督微调和强化微调的两阶段训练管道,实现了几何推理基准的最先进性能,并在实际机器人任务中展示了实际的精度。
Approximate Domain Unlearning for Vision-Language Models
Authors: Kodai Kawamura, Yuta Goto, Rintaro Yanagi, Hirokatsu Kataoka, Go Irie
Venue: NeurIPS 2025 Spotlight
First: 2025-10-09T12:17:59+00:00 · Latest: 2025-10-09T12:17:59+00:00
Comments: NeurIPS 2025 (Spotlight)
Abstract
Pre-trained Vision-Language Models (VLMs) exhibit strong generalization
capabilities, enabling them to recognize a wide range of objects across diverse
domains without additional training. However, they often retain irrelevant
information beyond the requirements of specific downstream tasks, raising
concerns about computational efficiency and potential information leakage. This
has motivated growing interest in approximate unlearning, which aims to
selectively remove unnecessary knowledge while preserving overall model
performance. Existing approaches to approximate unlearning have primarily
focused on class unlearning, where a VLM is retrained to fail to recognize
specified object classes while maintaining accuracy for others. However, merely
forgetting object classes is often insufficient in practical applications. For
instance, an autonomous driving system should accurately recognize real cars
while avoiding misrecognition of illustrated cars depicted in roadside
advertisements as real cars, which could be hazardous. In this paper, we
introduce Approximate Domain Unlearning (ADU), a novel problem setting that
requires reducing recognition accuracy for images from specified domains (e.g.,
illustration) while preserving accuracy for other domains (e.g., real). ADU
presents new technical challenges: due to the strong domain generalization
capability of pre-trained VLMs, domain distributions are highly entangled in
the feature space, making naive approaches based on penalizing target domains
ineffective. To tackle this limitation, we propose a novel approach that
explicitly disentangles domain distributions and adaptively captures
instance-specific domain information. Extensive experiments show that our
approach outperforms baselines built upon VLM tuning techniques, paving the way
for practical and fine-grained unlearning in VLMs. Code:
https://kodaikawamura.github.io/Domain_Unlearning/.
中文标题/摘要
标题:视觉语言模型的近似域遗忘
预训练的视觉语言模型(VLMs)具有强大的泛化能力,能够在无需额外训练的情况下识别各种对象,跨越不同的领域。然而,它们往往会保留超出特定下游任务需求的相关信息,这引发了关于计算效率和潜在信息泄露的担忧。这激发了对近似遗忘的兴趣,其目标是在保留整体模型性能的同时,选择性地移除不必要的知识。现有的近似遗忘方法主要集中在类别遗忘,即重新训练VLM使其无法识别指定的对象类别,同时保持对其他类别的准确性。然而,在实际应用中,仅仅忘记对象类别往往是不够的。例如,自动驾驶系统应该准确识别真实的汽车,同时避免将路旁广告中描绘的汽车误认为真实的汽车,这可能会造成危险。在本文中,我们提出了近似域遗忘(ADU),这是一种新的问题设置,要求减少来自指定领域(例如,插图)的图像识别准确性,同时保持对其他领域(例如,真实)的准确性。ADU提出了新的技术挑战:由于预训练VLMs具有强大的域泛化能力,域分布高度纠缠在特征空间中,基于惩罚目标域的简单方法无效。为了解决这一局限性,我们提出了一种新的方法,明确地解纠缠域分布,并自适应地捕捉实例特定的域信息。广泛的实验表明,我们的方法优于基于VLM调优技术的基线方法,为视觉语言模型中的实用和精细遗忘铺平了道路。代码:https://kodaikawamura.github.io/Domain_Unlearning/
Summary / 总结
This paper addresses the issue of approximate unlearning in Vision-Language Models (VLMs) by introducing Approximate Domain Unlearning (ADU), which aims to reduce recognition accuracy for specific domains while preserving accuracy for others. The method explicitly disentangles domain distributions and captures instance-specific domain information, outperforming existing VLM tuning techniques in experiments. This approach enhances computational efficiency and reduces information leakage, making VLMs more suitable for practical applications like autonomous driving systems.
论文针对Vision-Language模型(VLM)的近似域卸载问题,旨在减少特定域的识别准确性同时保持对其他域的性能。作者引入了一个新的问题设置,称为近似域卸载(ADU),并提出了一种方法,该方法将域分布分离,并适应性地捕捉实例特定的域信息。实验表明,他们的方法优于现有的VLM调优技术,为VLM中的实际卸载提供了有前景的解决方案。
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
Authors: Wei Wu, Zhuoshi Pan, Chao Wang, Liyi Chen, Yunchu Bai, Tianfu Wang, Kun Fu, Zheng Wang, Hui Xiong
First: 2024-11-05T07:56:24+00:00 · Latest: 2025-10-09T12:05:04+00:00
Comments: Accepted by EMNLP2025
Abstract
Rapid advances in Large Language Models (LLMs) have spurred demand for
processing extended context sequences in contemporary applications. However,
this progress faces two challenges: performance degradation due to sequence
lengths out-of-distribution, and excessively long inference times caused by the
quadratic computational complexity of attention. These issues limit LLMs in
long-context scenarios. In this paper, we propose Dynamic Token-Level KV Cache
Selection (TokenSelect), a training-free method for efficient and accurate
long-context inference. TokenSelect builds upon the observation of
non-contiguous attention sparsity, using QK dot products to measure per-head KV
Cache criticality at token-level. By per-head soft voting mechanism,
TokenSelect selectively involves a few critical KV cache tokens in attention
calculation without sacrificing accuracy. To further accelerate TokenSelect, we
design the Selection Cache based on observations of consecutive Query
similarity and implemented the efficient Paged Dot Product Kernel,
significantly reducing the selection overhead. A comprehensive evaluation of
TokenSelect demonstrates up to $23.84\times$ speedup in attention computation
and up to $2.28\times$ acceleration in end-to-end latency, while providing
superior performance compared to state-of-the-art long-context inference
methods.
中文标题/摘要
标题:TokenSelect:通过动态选择令牌级KV缓存实现高效长上下文推理和长度外推
大型语言模型(LLMs)的迅速发展推动了现代应用中处理扩展上下文序列的需求。然而,这一进展面临两个挑战:由于序列长度超出分布范围导致的性能下降,以及由于注意力机制的二次计算复杂性引起的推理时间过长。这些问题限制了LLMs在长上下文场景中的应用。本文提出了一种无需训练的方法——动态令牌级KV缓存选择(TokenSelect),以实现高效准确的长上下文推理。TokenSelect 基于非连续注意力稀疏性的观察,使用QK点积来衡量每个头在令牌级的KV缓存关键性。通过每个头的软投票机制,TokenSelect 选择性地参与少量关键KV缓存令牌的注意力计算,而不牺牲准确性。为了进一步加速TokenSelect,我们基于连续查询相似性的观察设计了选择缓存,并实现了高效的分页点积内核,显著减少了选择开销。TokenSelect 的全面评估显示,在注意力计算中可实现高达23.84倍的加速,在端到端延迟中可实现高达2.28倍的加速,同时在长上下文推理方法中提供更优的性能。
Summary / 总结
TokenSelect is a training-free method for efficient long-context inference in LLMs, addressing performance degradation and long inference times. It uses QK dot products to measure the criticality of KV cache tokens and a per-head soft voting mechanism to selectively involve only a few critical tokens in attention calculation. TokenSelect also includes a Selection Cache and an efficient Paged Dot Product Kernel to further accelerate the process. Experimental results show up to 23.84 times speedup in attention computation and 2.28 times acceleration in end-to-end latency, outperforming existing methods.
TokenSelect 是一种无需训练的方法,用于在大语言模型中高效进行长上下文推理,解决性能下降和长时间推理的问题。它通过基于 QK 点积测量每个头的缓存关键性,并使用每头软投票机制仅选择关键的 KV 缓存令牌参与注意力计算。该方法在注意力计算中实现了高达 23.84 倍的加速,并在端到端延迟上实现了 2.28 倍的加速,优于现有长上下文推理方法。
Language learning shapes visual category-selectivity in deep neural networks
Authors: Zitong Lu, Yuxin Wang
First: 2025-02-23T06:15:51+00:00 · Latest: 2025-10-09T11:58:58+00:00
Abstract
Category-selective regions in the human brain-such as the fusiform face area
(FFA), extrastriate body area (EBA), parahippocampal place area (PPA), and
visual word form area (VWFA)-support high-level visual recognition. Here, we
investigate whether artificial neural networks (ANNs) exhibit analogous
category-selective neurons and how these representations are shaped by language
experience. Using an fMRI-inspired functional localizer approach, we identified
face-, body-, place-, and word-selective neurons in deep networks presented
with category images and scrambled controls. Both the purely visual ResNet and
a linguistically supervised Lang-Learned ResNet contained category-selective
neurons that increased in proportion across layers. However, compared to the
vision-only model, the Lang-Learned ResNet showed a greater number but lower
specificity of category-selective neurons, along with reduced spatial
localization and attenuated activation strength-indicating a shift toward more
distributed, semantically aligned coding. These effects were replicated in the
large-scale vision-language model CLIP. Together, our findings reveal that
language experience systematically reorganizes visual category representations
in ANNs, providing a computational parallel to how linguistic context may shape
categorical organization in the human brain.
中文标题/摘要
标题:语言学习塑造深度神经网络中的视觉类别选择性
人类大脑中的类别选择性区域,如梭形面孔区(FFA)、外侧视皮层体区(EBA)、海马旁回地点区(PPA)和视觉单词形式区(VWFA),支持高级视觉识别。在这里,我们研究人工神经网络(ANNs)是否表现出类似的类别选择性神经元,以及这些表示如何受到语言经验的影响。使用一种基于fMRI的功能局部化方法,我们在向深层网络展示类别图像和杂乱控制时,识别出了面孔选择性、身体选择性、地点选择性和单词选择性神经元。无论是纯粹视觉的ResNet还是语言监督的Lang-Learned ResNet,都包含随着层次增加比例增加的类别选择性神经元。然而,与仅视觉模型相比,Lang-Learned ResNet显示出更多的但更不具体的类别选择性神经元,空间定位降低,激活强度减弱,表明向更分布式、语义对齐编码的转变。这些效应在大规模的视觉-语言模型CLIP中也得到了复制。总之,我们的研究发现语言经验系统地重新组织了ANN中的视觉类别表示,为语言上下文如何在人类大脑中塑造类别组织提供了计算上的类比。
Summary / 总结
This study investigates whether artificial neural networks exhibit category-selective neurons similar to those in the human brain and how language experience shapes these representations. Using an fMRI-inspired approach, the study identified face-, body-, place-, and word-selective neurons in both a purely visual ResNet and a linguistically supervised Lang-Learned ResNet. The Lang-Learned ResNet showed a greater number but lower specificity of category-selective neurons, indicating a shift towards more distributed, semantically aligned coding compared to the vision-only model.
研究探讨了人工神经网络(ANNs)是否会在类似人类大脑的情况下发展出类别选择性神经元,以及语言经验如何影响这些表示。通过一种类似fMRI的功能局部化方法,研究在纯视觉ResNet和语言监督下的Lang-Learned ResNet中识别出了面部、身体、地点和单词选择性神经元。与纯视觉模型相比,语言监督下的ResNet显示出更多的但更不具体的类别选择性神经元,具有较低的空间定位和激活强度,表明其编码方式更倾向于分布式的、语义对齐的编码。
Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation
Authors: Yachun Mi, Yu Li, Yanting Li, Chen Hui, Tong Zhang, Zhixuan Li, Chenyue Song, Wei Yang Bryan Lim, Shaohui Liu
First: 2025-08-08T07:36:01+00:00 · Latest: 2025-10-09T11:58:11+00:00
Abstract
Accurate and efficient Video Quality Assessment (VQA) has long been a key
research challenge. Current mainstream VQA methods typically improve
performance by pretraining on large-scale classification datasets (e.g.,
ImageNet, Kinetics-400), followed by fine-tuning on VQA datasets. However, this
strategy presents two significant challenges: (1) merely transferring semantic
knowledge learned from pretraining is insufficient for VQA, as video quality
depends on multiple factors (e.g., semantics, distortion, motion, aesthetics);
(2) pretraining on large-scale datasets demands enormous computational
resources, often dozens or even hundreds of times greater than training
directly on VQA datasets. Recently, Vision-Language Models (VLMs) have shown
remarkable generalization capabilities across a wide range of visual tasks, and
have begun to demonstrate promising potential in quality assessment. In this
work, we propose Q-CLIP, the first fully VLMs-based framework for VQA. Q-CLIP
enhances both visual and textual representations through a Shared Cross-Modal
Adapter (SCMA), which contains only a minimal number of trainable parameters
and is the only component that requires training. This design significantly
reduces computational cost. In addition, we introduce a set of five learnable
quality-level prompts to guide the VLMs in perceiving subtle quality
variations, thereby further enhancing the model's sensitivity to video quality.
Furthermore, we investigate the impact of different frame sampling strategies
on VQA performance, and find that frame-difference-based sampling leads to
better generalization performance across datasets. Extensive experiments
demonstrate that Q-CLIP exhibits excellent performance on several VQA datasets.
中文标题/摘要
标题:Q-CLIP:通过统一跨模态适应释放视觉语言模型在视频质量评估中的潜力
准确高效的视频质量评估(VQA)一直是关键的研究挑战。当前主流的VQA方法通常通过在大规模分类数据集(如ImageNet、Kinetics-400)上预训练,然后在VQA数据集上微调来提高性能。然而,这种方法存在两个重大挑战:(1)仅从预训练中转移语义知识不足以进行VQA,因为视频质量取决于多个因素(如语义、失真、运动、美学);(2)在大规模数据集上预训练需要巨大的计算资源,通常比直接在VQA数据集上训练大几十甚至几百倍。最近,视觉语言模型(VLMs)在多种视觉任务上展示了出色的泛化能力,并开始在质量评估方面显示出有前景的潜力。在这项工作中,我们提出了Q-CLIP,这是第一个基于VLMs的VQA框架。Q-CLIP通过共享跨模态适配器(SCMA)增强视觉和文本表示,该适配器仅包含少量可训练参数,并且是唯一需要训练的组件。这种设计显著降低了计算成本。此外,我们引入了一组五个可学习的质量级别提示,以指导VLMs感知细微的质量变化,从而进一步增强了模型对视频质量的敏感性。此外,我们研究了不同的帧采样策略对VQA性能的影响,并发现基于帧差的采样策略在不同数据集上具有更好的泛化性能。广泛的实验表明,Q-CLIP在多个VQA数据集上表现出色。
Summary / 总结
Q-CLIP is a novel framework for Video Quality Assessment (VQA) that leverages Vision-Language Models (VLMs) to improve performance. It uses a Shared Cross-Modal Adapter (SCMA) to enhance visual and textual representations with minimal training parameters, reducing computational cost. Q-CLIP also introduces learnable quality-level prompts to better perceive quality variations and investigates frame sampling strategies, finding that frame-difference-based sampling improves generalization. Experiments show Q-CLIP outperforms existing methods on multiple VQA datasets.
Q-CLIP 是一种利用视觉语言模型(VLMs)进行视频质量评估(VQA)的新框架。它通过共享跨模态适配器(SCMA)增强视觉和文本表示,使用少量可训练参数,从而降低计算成本。Q-CLIP 还引入了可学习的质量级别提示,以更好地感知质量变化,并研究了不同的帧采样策略,发现基于帧差异的采样策略能提高泛化性能。实验表明,Q-CLIP 在多个 VQA 数据集上表现出色。
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
Authors: Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, Huajie Tan, Chunyuan Li, Jing Yang, Jie Yu, Xiyao Wang, Bin Qin, Yumeng Wang, Zizhen Yan, Ziyong Feng, Ziwei Liu, Bo Li, Jiankang Deng
First: 2025-09-28T05:52:55+00:00 · Latest: 2025-10-09T11:54:14+00:00
Comments: LLaVA-OneVision-1.5 Technical Report
Abstract
We present LLaVA-OneVision-1.5, a novel family of Large Multimodal Models
(LMMs) that achieve state-of-the-art performance with significantly reduced
computational and financial costs. Different from the existing works,
LLaVA-OneVision-1.5 provides an open, efficient, and reproducible framework for
building high-quality vision-language models entirely from scratch. The
LLaVA-OneVision-1.5 release comprises three primary components: (1) Large-Scale
Curated Datasets: We construct an 85M concept-balanced pretraining dataset
LLaVA-OneVision-1.5-Mid-Traning and a meticulously curated 22M instruction
dataset LLaVA-OneVision-1.5-Instruct. (2) Efficient Training Framework: We
develop a complete end-to-end efficient training framework leveraging an
offline parallel data packing strategy to facilitate the training of
LLaVA-OneVision-1.5 within a $16,000 budget. (3) State-of-the-art Performance:
Experimental results demonstrate that LLaVA-OneVision-1.5 yields exceptionally
competitive performance across a broad range of downstream tasks. Specifically,
LLaVA-OneVision-1.5-8B outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks, and
LLaVA-OneVision-1.5-4B surpasses Qwen2.5-VL-3B on all 27 benchmarks. We
anticipate releasing LLaVA-OneVision-1.5-RL shortly and encourage the community
to await further updates.
中文标题/摘要
标题:LLaVA-OneVision-1.5:民主化多模态训练的完全开源框架
我们介绍了LLaVA-OneVision-1.5,这是一种新型的大型多模态模型(LMMs),在显著降低计算和财务成本的同时达到最先进的性能。与现有工作不同,LLaVA-OneVision-1.5 提供了一个完全从零开始构建高质量视觉语言模型的开放、高效和可重复的框架。LLaVA-OneVision-1.5 发布包括三个主要组件:(1)大规模平衡概念预训练数据集:我们构建了包含 8500 万概念平衡预训练数据集 LLaVA-OneVision-1.5-Mid-Traning 和精心整理的 2200 万指令数据集 LLaVA-OneVision-1.5-Instruct。 (2)高效训练框架:我们开发了一个完整的端到端高效训练框架,利用离线并行数据打包策略,使 LLaVA-OneVision-1.5 在 16000 美元的预算内进行训练成为可能。 (3)最先进的性能:实验结果表明,LLaVA-OneVision-1.5 在一系列下游任务中表现出色。具体而言,LLaVA-OneVision-1.5-8B 在 27 个基准测试中有 18 个优于 Qwen2.5-VL-7B,而 LLaVA-OneVision-1.5-4B 在所有 27 个基准测试中均优于 Qwen2.5-VL-3B。我们预计很快将发布 LLaVA-OneVision-1.5-RL,并鼓励社区关注后续更新。
Summary / 总结
LLaVA-OneVision-1.5 is a novel framework for building high-quality vision-language models with reduced computational and financial costs. It includes a large-scale curated dataset, an efficient training framework, and state-of-the-art performance across various downstream tasks. Specifically, LLaVA-OneVision-1.5-8B outperforms Qwen2.5-VL-7B on 18 out of 27 benchmarks, and LLaVA-OneVision-1.5-4B surpasses Qwen2.5-VL-3B on all 27 benchmarks, achieved within a $16,000 budget.
LLaVA-OneVision-1.5 是一种新型框架,用于以较低的计算和财务成本构建高质量的视觉-语言模型。它包括大规模的精制数据集、高效的训练框架,并在多种下游任务中表现出最先进的性能。具体来说,LLaVA-OneVision-1.5-8B 在 27 个基准中的 18 个上优于 Qwen2.5-VL-7B,而 LLaVA-OneVision-1.5-4B 在所有 27 个基准上都优于 Qwen2.5-VL-3B,且在 $16,000 的预算内实现。
EFSA: Episodic Few-Shot Adaptation for Text-to-Image Retrieval
Authors: Muhammad Huzaifa, Yova Kementchedjhieva
First: 2024-11-28T17:09:20+00:00 · Latest: 2025-10-09T11:20:45+00:00
Abstract
Text-to-image retrieval is a critical task for managing diverse visual
content, but common benchmarks for the task rely on small, single-domain
datasets that fail to capture real-world complexity. Pre-trained
vision-language models tend to perform well with easy negatives but struggle
with hard negatives--visually similar yet incorrect images--especially in
open-domain scenarios. To address this, we introduce Episodic Few-Shot
Adaptation (EFSA), a novel test-time framework that adapts pre-trained models
dynamically to a query's domain by fine-tuning on top-k retrieved candidates
and synthetic captions generated for them. EFSA improves performance across
diverse domains while preserving generalization, as shown in evaluations on
queries from eight highly distinct visual domains and an open-domain retrieval
pool of over one million images. Our work highlights the potential of episodic
few-shot adaptation to enhance robustness in the critical and understudied task
of open-domain text-to-image retrieval.
中文标题/摘要
标题:EFSA: episodic few-shot 调适在文本到图像检索中的应用
文本到图像检索是管理多样视觉内容的关键任务,但该任务的常见基准依赖于小规模、单一领域的数据集,无法捕捉现实世界的复杂性。预训练的跨模态模型在容易的负样本上表现良好,但在处理视觉上相似但不正确的硬负样本时遇到困难,尤其是在开放领域场景中。为了解决这个问题,我们引入了Episodic Few-Shot 调适(EFSA),这是一种新颖的测试时框架,通过在检索出的顶级候选图像及其生成的合成描述上进行微调,动态地将预训练模型适配到查询的领域。EFSA 在多个领域中提高了性能,同时保持了泛化能力,如在八个高度不同的视觉领域查询和一个包含超过一百万张图像的开放领域检索池上的评估所示。我们的工作突显了 episodic few-shot 调适在开放领域文本到图像检索这一关键且未充分研究的任务中增强鲁棒性的潜力。
Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation
Authors: Liyang Chen, Hongkai Chen, Yujun Cai, Sifan Li, Qingwen Ye, Yiwei Wang
First: 2025-10-09T11:08:07+00:00 · Latest: 2025-10-09T11:08:07+00:00
Abstract
Video-to-Audio generation has made remarkable strides in automatically
synthesizing sound for video. However, existing evaluation metrics, which focus
on semantic and temporal alignment, overlook a critical failure mode: models
often generate acoustic events, particularly speech and music, that have no
corresponding visual source. We term this phenomenon Insertion Hallucination
and identify it as a systemic risk driven by dataset biases, such as the
prevalence of off-screen sounds, that remains completely undetected by current
metrics. To address this challenge, we first develop a systematic evaluation
framework that employs a majority-voting ensemble of multiple audio event
detectors. We also introduce two novel metrics to quantify the prevalence and
severity of this issue: IH@vid (the fraction of videos with hallucinations) and
IH@dur (the fraction of hallucinated duration). Building on this, we propose
Posterior Feature Correction, a novel training-free inference-time method that
mitigates IH. PFC operates in a two-pass process: it first generates an initial
audio output to detect hallucinated segments, and then regenerates the audio
after masking the corresponding video features at those timestamps. Experiments
on several mainstream V2A benchmarks first reveal that state-of-the-art models
suffer from severe IH. In contrast, our PFC method reduces both the prevalence
and duration of hallucinations by over 50\% on average, without degrading, and
in some cases even improving, conventional metrics for audio quality and
temporal synchronization. Our work is the first to formally define,
systematically measure, and effectively mitigate Insertion Hallucination,
paving the way for more reliable and faithful V2A models.
中文标题/摘要
标题:视频到音频生成中的插入幻觉检测与缓解
视频到音频生成在自动合成视频声音方面取得了显著进展。然而,现有的评估指标侧重于语义和时间对齐,忽视了一个关键的失败模式:模型经常生成声学事件,特别是语音和音乐,这些事件在视频中没有相应的视觉来源。我们称这种现象为插入幻觉,并将其识别为由数据集偏差驱动的系统性风险,这种风险目前完全未被现有指标检测到。为应对这一挑战,我们首先开发了一种系统性的评估框架,该框架采用多个声学事件检测器的多数投票集成。我们还引入了两个新的度量标准来量化这一问题的普遍性和严重性:IH@vid(带有幻觉的视频比例)和IH@dur(幻觉持续时间的比例)。在此基础上,我们提出了后验特征校正(PFC),这是一种无需训练的推理时方法,可以缓解插入幻觉。PFC采用两步过程:首先生成初始音频输出以检测幻觉段落,然后在这些时间戳处遮蔽相应的视频特征后重新生成音频。在几个主流的V2A基准上的实验首次揭示,最先进的模型遭受严重的插入幻觉。相比之下,我们的PFC方法平均将幻觉的普遍性和持续时间降低了超过50%,且不降低,甚至在某些情况下还改善了传统的音频质量和时间同步度指标。我们的工作首次正式定义、系统性测量并有效缓解了插入幻觉,为更可靠和忠实的V2A模型铺平了道路。
Summary / 总结
This paper addresses the issue of Insertion Hallucination in video-to-audio generation, where models generate sounds that do not have corresponding visual sources. It introduces a systematic evaluation framework using a majority-voting ensemble of audio event detectors and proposes a novel inference-time method called Posterior Feature Correction (PFC) to mitigate this issue. Experiments show that PFC reduces both the prevalence and duration of hallucinations by over 50% on average, without degrading conventional audio quality and temporal synchronization metrics.
研究针对视频到音频生成中的插入幻觉问题,即模型生成没有对应视觉来源的声音。为此,作者开发了一个使用多个音频事件检测器的多数投票集成评估框架,并引入了两个指标IH@vid和IH@dur来量化该问题。他们还提出了后验特征校正(PFC)方法,这是一种无需训练的推理时方法,通过首先生成初始音频以检测幻觉段落,然后在这些时间戳处遮蔽相应的视频特征后再生成音频来减轻幻觉。实验表明,PFC平均减少了超过50%的幻觉,且在不损害传统音频质量指标的情况下,有时甚至还能提升。
RetouchLLM: Training-free White-box Image Retouching
Authors: Moon Ye-Bin, Roy Miles, Tae-Hyun Oh, Ismail Elezi, Jiankang Deng
First: 2025-10-09T10:40:49+00:00 · Latest: 2025-10-09T10:40:49+00:00
Abstract
Image retouching not only enhances visual quality but also serves as a means
of expressing personal preferences and emotions. However, existing
learning-based approaches require large-scale paired data and operate as black
boxes, making the retouching process opaque and limiting their adaptability to
handle diverse, user- or image-specific adjustments. In this work, we propose
RetouchLLM, a training-free white-box image retouching system, which requires
no training data and performs interpretable, code-based retouching directly on
high-resolution images. Our framework progressively enhances the image in a
manner similar to how humans perform multi-step retouching, allowing
exploration of diverse adjustment paths. It comprises of two main modules: a
visual critic that identifies differences between the input and reference
images, and a code generator that produces executable codes. Experiments
demonstrate that our approach generalizes well across diverse retouching
styles, while natural language-based user interaction enables interpretable and
controllable adjustments tailored to user intent.
中文标题/摘要
标题:RetouchLLM:无需训练的白盒图像润饰
图像润饰不仅提升了视觉质量,还是一种表达个人偏好和情感的方式。然而,现有的基于学习的方法需要大量配对数据,并且作为黑盒运行,使得润饰过程不透明,限制了其适应处理多样、用户或图像特定调整的能力。在本文中,我们提出了一种无需训练的白盒图像润饰系统RetouchLLM,该系统不需要训练数据,并可以直接在高分辨率图像上进行可解释的、基于代码的润饰。我们的框架以类似于人类多步润饰的方式逐步提升图像,允许探索多种调整路径。该框架包括两个主要模块:一个视觉批评家,用于识别输入图像和参考图像之间的差异,以及一个代码生成器,用于生成可执行代码。实验表明,我们的方法在多种润饰风格下具有良好的泛化能力,而基于自然语言的用户交互则使调整具有可解释性和可控性,以满足用户意图。
CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning
Authors: Weihuang Lin, Yiwei Ma, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji
First: 2025-10-09T09:41:45+00:00 · Latest: 2025-10-09T09:41:45+00:00
Abstract
Composed Image Retrieval (CIR), which aims to find a target image from a
reference image and a modification text, presents the core challenge of
performing unified reasoning across visual and semantic modalities. While
current approaches based on Vision-Language Models (VLMs, e.g., CLIP) and more
recent Multimodal Large Language Models (MLLMs, e.g., Qwen-VL) have shown
progress, they predominantly function as ``black boxes." This inherent opacity
not only prevents users from understanding the retrieval rationale but also
restricts the models' ability to follow complex, fine-grained instructions. To
overcome these limitations, we introduce CIR-CoT, the first end-to-end
retrieval-oriented MLLM designed to integrate explicit Chain-of-Thought (CoT)
reasoning. By compelling the model to first generate an interpretable reasoning
chain, CIR-CoT enhances its ability to capture crucial cross-modal
interactions, leading to more accurate retrieval while making its decision
process transparent. Since existing datasets like FashionIQ and CIRR lack the
necessary reasoning data, a key contribution of our work is the creation of
structured CoT annotations using a three-stage process involving a caption,
reasoning, and conclusion. Our model is then fine-tuned to produce this
structured output before encoding its final retrieval intent into a dedicated
embedding. Comprehensive experiments show that CIR-CoT achieves highly
competitive performance on in-domain datasets (FashionIQ, CIRR) and
demonstrates remarkable generalization on the out-of-domain CIRCO dataset,
establishing a new path toward more effective and trustworthy retrieval
systems.
中文标题/摘要
标题:CIR-CoT:通过端到端链式推理实现可解释的组合图像检索
组合图像检索(CIR)旨在从参考图像和修改文本中找到目标图像,其核心挑战在于在视觉和语义模态之间进行统一推理。尽管基于视觉语言模型(VLM,例如CLIP)和更近期的多模态大型语言模型(MLLM,例如Qwen-VL)的方法已经取得进展,但它们大多作为“黑箱”运行。这种固有的不透明性不仅阻止用户理解检索逻辑,还限制了模型遵循复杂、精细指令的能力。为克服这些限制,我们提出了CIR-CoT,这是第一个面向检索的端到端多模态大型语言模型,旨在整合显式的链式推理(CoT)。通过迫使模型首先生成可解释的推理链,CIR-CoT增强了其捕捉关键跨模态交互的能力,从而提高了检索准确性,同时使其决策过程透明化。由于现有数据集如FashionIQ和CIRR缺乏必要的推理数据,我们工作的关键贡献是使用三阶段过程(包括描述、推理和结论)创建结构化的CoT注释。然后,我们的模型经过微调以生成这种结构化输出,并将其最终检索意图编码到专用嵌入中。全面的实验表明,CIR-CoT在领域内数据集(FashionIQ、CIRR)上取得了高度竞争力的表现,并在领域外数据集CIRCO上展示了显著的泛化能力,为更有效的和值得信赖的检索系统开辟了一条新路径。
Summary / 总结
CIR-CoT is designed to address the challenge of Composed Image Retrieval by integrating explicit Chain-of-Thought (CoT) reasoning into an end-to-end retrieval-oriented Multimodal Large Language Model (MLLM). It generates an interpretable reasoning chain to enhance cross-modal interactions, leading to more accurate retrieval and transparency in decision-making. The model was fine-tuned with structured CoT annotations and achieved competitive performance on in-domain datasets and remarkable generalization on the out-of-domain CIRCO dataset.
CIR-CoT 通过将显式的链式思考推理集成到端到端的多模态大型语言模型中,旨在解决组合图像检索的挑战。它生成可解释的推理链,以增强跨模态交互,提高检索准确性,同时使决策过程透明。该模型在领域内数据集上表现出色,并在领域外数据集上展示了出色的泛化能力,为更有效和可信的检索系统开辟了新路径。
Executable Analytic Concepts as the Missing Link Between VLM Insight and Precise Manipulation
Authors: Mingyang Sun, Jiude Wei, Qichen He, Donglin Wang, Cewu Lu, Jianhua Sun
First: 2025-10-09T09:08:33+00:00 · Latest: 2025-10-09T09:08:33+00:00
Abstract
Enabling robots to perform precise and generalized manipulation in
unstructured environments remains a fundamental challenge in embodied AI. While
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in
semantic reasoning and task planning, a significant gap persists between their
high-level understanding and the precise physical execution required for
real-world manipulation. To bridge this "semantic-to-physical" gap, we
introduce GRACE, a novel framework that grounds VLM-based reasoning through
executable analytic concepts (EAC)-mathematically defined blueprints that
encode object affordances, geometric constraints, and semantics of
manipulation. Our approach integrates a structured policy scaffolding pipeline
that turn natural language instructions and visual information into an
instantiated EAC, from which we derive grasp poses, force directions and plan
physically feasible motion trajectory for robot execution. GRACE thus provides
a unified and interpretable interface between high-level instruction
understanding and low-level robot control, effectively enabling precise and
generalizable manipulation through semantic-physical grounding. Extensive
experiments demonstrate that GRACE achieves strong zero-shot generalization
across a variety of articulated objects in both simulated and real-world
environments, without requiring task-specific training.
中文标题/摘要
标题:可执行分析概念作为VLM洞察与精确操作之间缺失的联系
使机器人在非结构化环境中执行精确且通用的操作仍然是具身AI中的基本挑战。尽管视觉语言模型(VLMs)在语义推理和任务规划方面展现了卓越的能力,但它们的高层次理解与现实世界操作所需的精确物理执行之间仍存在显著差距。为了弥合“语义到物理”的差距,我们提出了GRACE,这是一种新颖的框架,通过可执行分析概念(EAC)——数学定义的蓝图,这些蓝图编码了物体的功能、几何约束和操作的语义。我们的方法整合了一个结构化策略支撑管道,将自然语言指令和视觉信息转化为实例化的EAC,从中我们推导出抓取姿态、力的方向,并规划出机器人执行的物理可行运动轨迹。GRACE因此提供了一个统一且可解释的接口,连接高层次指令理解和低层次机器人控制,有效通过语义-物理对接实现精确且通用的操作。广泛的实验表明,GRACE在模拟和真实世界环境中对各种关节物体实现了强大的零样本泛化,无需特定任务的训练。
Summary / 总结
The research aims to bridge the gap between high-level semantic understanding and precise physical execution in robot manipulation. GRACE, a novel framework, uses executable analytic concepts (EAC) to ground VLM-based reasoning, converting natural language instructions and visual information into grasp poses, force directions, and motion trajectories. Experiments show that GRACE achieves strong zero-shot generalization across various articulated objects in both simulated and real-world environments without task-specific training.
论文通过引入使用可执行分析概念(EAC)的GRACE框架,解决了机器人在非结构化环境中进行精确操作的挑战,以弥合高层次语义理解与精确物理执行之间的差距。GRACE将自然语言指令和视觉信息转换为EAC,进而推导出抓取姿态、力的方向和机器人执行的运动轨迹。实验表明,GRACE可以在模拟和真实世界环境中,无需特定任务训练,就能实现各种关节物体的强零样本泛化。
TTOM: Test-Time Optimization and Memorization for Compositional Video Generation
Authors: Leigang Qu, Ziyang Wang, Na Zheng, Wenjie Wang, Liqiang Nie, Tat-Seng Chua
First: 2025-10-09T08:37:00+00:00 · Latest: 2025-10-09T08:37:00+00:00
Comments: Project page: https://ttom-t2v.github.io/
Abstract
Video Foundation Models (VFMs) exhibit remarkable visual generation
performance, but struggle in compositional scenarios (e.g., motion, numeracy,
and spatial relation). In this work, we introduce Test-Time Optimization and
Memorization (TTOM), a training-free framework that aligns VFM outputs with
spatiotemporal layouts during inference for better text-image alignment. Rather
than direct intervention to latents or attention per-sample in existing work,
we integrate and optimize new parameters guided by a general layout-attention
objective. Furthermore, we formulate video generation within a streaming
setting, and maintain historical optimization contexts with a parametric memory
mechanism that supports flexible operations, such as insert, read, update, and
delete. Notably, we found that TTOM disentangles compositional world knowledge,
showing powerful transferability and generalization. Experimental results on
the T2V-CompBench and Vbench benchmarks establish TTOM as an effective,
practical, scalable, and efficient framework to achieve cross-modal alignment
for compositional video generation on the fly.
中文标题/摘要
标题:TTOM:测试时优化与记忆以实现组合视频生成
视频基础模型(VFMs)在视觉生成方面表现出色,但在组合场景(如运动、数量关系和空间关系)中遇到困难。在本工作中,我们引入了测试时优化与记忆(TTOM),这是一种无需训练的框架,在推理过程中将VFMs的输出与时空布局对齐,以提高文本-图像对齐效果。与现有工作中直接干预潜在变量或注意力机制不同,我们通过一个通用布局-注意力目标整合并优化新的参数。此外,我们将视频生成置于流式处理环境中,并通过参数化记忆机制维护历史优化上下文,支持插入、读取、更新和删除等灵活操作。值得注意的是,我们发现TTOM能够分离组合世界知识,显示出强大的可转移性和泛化能力。在T2V-CompBench和Vbench基准测试上的实验结果表明,TTOM是一种有效、实用、可扩展和高效的框架,能够实现组合视频生成的跨模态对齐。
Summary / 总结
TTOM is a training-free framework that improves the compositional generation of videos by aligning Video Foundation Models (VFMs) with spatiotemporal layouts during inference. It introduces new parameters optimized by a layout-attention objective and uses a parametric memory mechanism to maintain historical contexts. Experiments on T2V-CompBench and Vbench show that TTOM effectively enhances text-image alignment and demonstrates strong transferability and generalization capabilities.
TTOM 是一个无需训练的框架,通过在推理过程中将视频基础模型(VFM)的输出与时空布局对齐来增强其生成能力。它通过一个通用的布局-注意力目标引入新的参数进行优化,并使用参数化记忆机制来维护历史上下文。实验结果表明,TTOM 能够有效提高跨模态对齐,并在组成性场景中表现出良好的泛化能力。
MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding
Authors: Peiran Wu, Zhuorui Yu, Yunze Liu, Chi-Hao Wu, Enmin Zhou, Junxiao Shen
First: 2025-10-09T08:07:19+00:00 · Latest: 2025-10-09T08:07:19+00:00
Abstract
The rapid progress of large language models (LLMs) has laid the foundation
for multimodal models. However, visual language models (VLMs) still face heavy
computational costs when extended from images to videos due to high frame rates
and long durations. Token compression is a promising solution, yet most
existing training-free methods cause information loss and performance
degradation. To overcome this, we propose \textbf{Memory-Augmented
Reinforcement Learning-based Token Compression (MARC)}, which integrates
structured retrieval and RL-based distillation. MARC adopts a
\textit{retrieve-then-compress} strategy using a \textbf{Visual Memory
Retriever (VMR)} to select key clips and a \textbf{Compression Group Relative
Policy Optimization (C-GRPO)} framework to distil reasoning ability from a
teacher to a student model. Experiments on six video benchmarks show that MARC
achieves near-baseline accuracy using only one frame's tokens -- reducing
visual tokens by \textbf{95\%}, GPU memory by \textbf{72\%}, and latency by
\textbf{23.9\%}. This demonstrates its potential for efficient, real-time video
understanding in resource-constrained settings such as video QA, surveillance,
and autonomous driving.
中文标题/摘要
标题:MARC:基于记忆增强的RL标记压缩以实现高效的视频理解
大型语言模型(LLMs)的快速发展为多模态模型奠定了基础。然而,视觉语言模型(VLMs)在从图像扩展到视频时仍面临巨大的计算成本,因为视频具有高帧率和长持续时间。标记压缩是一种有前途的解决方案,但大多数现有的无训练方法会导致信息丢失和性能下降。为了解决这一问题,我们提出了**基于记忆增强的强化学习标记压缩(MARC)**,该方法结合了结构化检索和基于RL的蒸馏。MARC采用**检索-压缩**策略,使用**视觉记忆检索器(VMR)**选择关键片段,并使用**压缩组相对策略优化(C-GRPO)**框架从教师模型向学生模型传递推理能力。在六个视频基准上的实验表明,MARC仅使用一帧的标记即可达到接近基线的准确性——视觉标记减少**95%**,GPU内存减少**72%**,延迟减少**23.9%**。这表明其在资源受限的环境中(如视频问答、监控和自动驾驶)实现高效、实时视频理解的潜力。
Summary / 总结
MARC is a method that integrates structured retrieval and RL-based distillation to compress visual tokens in video understanding models. It uses a Visual Memory Retriever to select key clips and a Compression Group Relative Policy Optimization framework to transfer reasoning ability from a teacher to a student model. Experiments show MARC can achieve near-baseline accuracy while reducing visual tokens by 95%, GPU memory by 72%, and latency by 23.9%. This makes it suitable for resource-constrained applications like video QA, surveillance, and autonomous driving.
MARC 是一种通过结合结构化检索和基于强化学习的蒸馏来解决将视觉语言模型扩展到视频理解中的计算挑战的方法。它使用视觉记忆检索器选择关键片段,并使用压缩组相对策略优化框架从教师模型向学生模型蒸馏推理能力。实验结果显示,MARC 可以在视觉令牌减少 95%、GPU 内存减少 72% 和延迟减少 23.9% 的情况下达到接近基线的准确性,使其适用于资源受限的应用场景,如视频问答、监控和自动驾驶等。
CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D
Authors: Mohamad Amin Mirzaei, Pantea Amoie, Ali Ekhterachian, Matin Mirzababaei, Babak Khalaj
Venue: ICLR 2026
First: 2025-09-29T09:43:00+00:00 · Latest: 2025-10-09T07:45:58+00:00
Comments: 9 pages, 4 figures, submitted for ICLR 2026 conference
Abstract
3D scene understanding is fundamental for embodied AI and robotics,
supporting reliable perception for interaction and navigation. Recent
approaches achieve zero-shot, open-vocabulary 3D semantic mapping by assigning
embedding vectors to 2D class-agnostic masks generated via vision-language
models (VLMs) and projecting these into 3D. However, these methods often
produce fragmented masks and inaccurate semantic assignments due to the direct
use of raw masks, limiting their effectiveness in complex environments. To
address this, we leverage SemanticSAM with progressive granularity refinement
to generate more accurate and numerous object-level masks, mitigating the
over-segmentation commonly observed in mask generation models such as vanilla
SAM, and improving downstream 3D semantic segmentation. To further enhance
semantic context, we employ a context-aware CLIP encoding strategy that
integrates multiple contextual views of each mask using empirically determined
weighting, providing much richer visual context. We evaluate our approach on
multiple 3D scene understanding tasks, including 3D semantic segmentation and
object retrieval from language queries, across several benchmark datasets.
Experimental results demonstrate significant improvements over existing
methods, highlighting the effectiveness of our approach.
中文标题/摘要
标题:CORE-3D:基于嵌入的3D上下文感知开放词汇检索
3D场景理解是体态人工智能和机器人技术的基础,支持可靠的感知以进行交互和导航。最近的方法通过将嵌入向量分配给由视觉-语言模型(VLMs)生成的2D类无感知掩码,并将这些掩码投影到3D中,实现了零样本、开放词汇的3D语义映射。然而,这些方法由于直接使用原始掩码,经常产生碎片化的掩码和不准确的语义分配,限制了它们在复杂环境中的有效性。为了解决这个问题,我们利用具有逐步粒度细化的SemanticSAM生成更准确和更多的对象级掩码,减轻了像vanilla SAM这样的掩码生成模型中常见的过度分割问题,并提高了下游3D语义分割的效果。为了进一步增强语义上下文,我们采用了一种上下文感知的CLIP编码策略,通过经验确定的权重整合每个掩码的多种上下文视图,提供了更丰富的视觉上下文。我们在多个3D场景理解任务上评估了我们的方法,包括3D语义分割和从语言查询中检索对象,跨越了几个基准数据集。实验结果表明,我们的方法显著优于现有方法,突显了我们方法的有效性。
Summary / 总结
The research aims to improve 3D scene understanding for embodied AI and robotics by addressing the limitations of existing zero-shot, open-vocabulary 3D semantic mapping methods. The method uses SemanticSAM with progressive granularity refinement to generate more accurate object-level masks and integrates context-aware CLIP encoding to enhance semantic context. The approach shows significant improvements in 3D semantic segmentation and object retrieval from language queries across various benchmark datasets.
研究旨在通过解决现有零样本、开放词汇3D语义映射方法的局限性,提高嵌入式AI和机器人技术中的3D场景理解。方法使用具有逐级粒度细化的SemanticSAM生成更准确的对象级掩码,并采用上下文感知的CLIP编码策略整合每个掩码的多个上下文视图,增强语义上下文。该方法在多个基准数据集上的3D语义分割和基于语言查询的对象检索任务中显示出显著的改进。