arXiv 论文速递

MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning

Authors: Tajamul Ashraf, Umair Nawaz, Abdelrahman M. Shaker, Rao Anwer, Philip Torr, Fahad Shahbaz Khan, Salman Khan

First: 2025-10-09T17:59:54+00:00 · Latest: 2025-10-09T17:59:54+00:00

Abstract

Vision language models (VLMs) are increasingly deployed as controllers with access to external tools for complex reasoning and decision-making, yet their effectiveness remains limited by the scarcity of high-quality multimodal trajectories and the cost of manual annotation. We address this challenge with a vision-centric agent tuning framework that automatically synthesizes multimodal trajectories, generates step-wise preference pairs, and trains a VLM controller for robust tool-use reasoning. Our pipeline first constructs M-TRACE, a large-scale dataset of 28.5K multimodal tasks with 177K verified trajectories, enabling imitation-based trajectory tuning. Building on this, we develop MATRIX Agent, a controller finetuned on M-TRACE for step-wise tool reasoning. To achieve finer alignment, we further introduce Pref-X, a set of 11K automatically generated preference pairs, and optimize MATRIX on it via step-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA, MATRIX consistently surpasses both open- and closed-source VLMs, demonstrating scalable and effective multimodal tool use. Our data and code is avaliable at https://github.com/mbzuai-oryx/MATRIX.

中文标题/摘要

标题：MATRIX：多模态智能体调优以实现稳健的工具使用推理

视觉语言模型（VLMs）越来越多地被用作控制器，具有访问外部工具的能力，用于复杂的推理和决策，但其有效性受限于高质量多模态轨迹的稀缺性和手动注释的成本。我们通过一种以视觉为中心的智能体调优框架来应对这一挑战，该框架自动合成多模态轨迹、生成逐步偏好对，并训练一个VLM控制器以实现稳健的工具使用推理。我们的流水线首先构建了M-TRACE，这是一个包含28500个多模态任务和177000个验证轨迹的大规模数据集，使基于模仿的轨迹调优成为可能。在此基础上，我们开发了MATRIX智能体，该智能体是基于M-TRACE进行逐步工具推理的微调控制器。为了实现更精细的对齐，我们进一步引入了Pref-X，这是一个包含11000个自动生成的偏好对的集合，并通过逐步偏好学习对其进行优化。在三个基准测试Agent-X、GTA和GAIA上，MATRIX始终超越了开源和闭源的VLMs，展示了可扩展且有效的多模态工具使用能力。我们的数据和代码可在https://github.com/mbzuai-oryx/MATRIX/获得。

Summary / 总结

The research aims to enhance the effectiveness of vision language models (VLMs) as controllers for complex reasoning tasks involving external tools. The method involves creating a vision-centric agent tuning framework that automatically generates multimodal trajectories and preference pairs, and trains a VLM controller. Key findings show that the developed MATRIX Agent outperforms both open- and closed-source VLMs across three benchmarks, proving scalable and effective multimodal tool use.

研究旨在提高视觉语言模型（VLMs）在涉及工具的复杂推理任务中的有效性。该研究引入了一个以视觉为中心的框架MATRIX，该框架自动生成多模态轨迹和偏好对来训练VLM控制器。该框架构建了包含28.5K多模态任务的大规模数据集M-TRACE，并对VLM控制器MATRIX Agent进行了微调。结果显示，MATRIX在三个基准测试中均优于开源和闭源的VLMs，展示了多模态工具使用的可扩展性和有效性。

SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models

Authors: Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang

First: 2025-10-09T17:50:54+00:00 · Latest: 2025-10-09T17:50:54+00:00

Comments: Project Page: https://zju-real.github.io/SpatialLadder/ Code: https://github.com/ZJU-REAL/SpatialLadder

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Spatial reasoning remains a fundamental challenge for Vision-Language Models (VLMs), with current approaches struggling to achieve robust performance despite recent advances. We identify that this limitation stems from a critical gap: existing methods attempt to learn spatial reasoning directly without establishing the hierarchical foundations of perception and understanding. To address this challenge, we present a comprehensive methodology for building spatial intelligence progressively. We introduce SpatialLadder-26k, a multimodal dataset containing 26,610 samples spanning object localization, single image, multi-view, and video spatial reasoning tasks, constructed through a standardized pipeline that ensures systematic coverage across modalities. Building on this dataset, we design a three-stage progressive training framework that (1) establishes spatial perception through object localization, (2) develops spatial understanding through multi-dimensional spatial tasks, and (3) strengthens complex reasoning via reinforcement learning with verifiable rewards. This approach yields SpatialLadder, a 3B-parameter model that achieves state-of-the-art performance on spatial reasoning benchmarks, with 23.4% average improvement over the base model, surpassing GPT-4o by 20.8% and Gemini-2.0-Flash by 10.1%. Notably, SpatialLadder maintains strong generalization with 7.2% improvement on out-of-domain benchmarks, demonstrating that progressive training from perception to reasoning is essential for robust spatial intelligence.

中文标题/摘要

标题：SpatialLadder：视觉语言模型中空间推理的渐进训练方法

空间推理仍然是视觉语言模型（VLMs）的基本挑战，尽管最近取得了进展，但当前方法在实现稳健性能方面仍存在困难。我们发现这一限制源于一个关键缺口：现有方法试图直接学习空间推理，而没有建立感知和理解的层次基础。为了解决这一挑战，我们提出了一种全面的方法来逐步构建空间智能。我们引入了包含26,610个样本的SpatialLadder-26k多模态数据集，这些样本覆盖了对象定位、单图像、多视图和视频空间推理任务，通过标准化流程确保了跨模态的系统覆盖。基于此数据集，我们设计了一个三阶段的渐进训练框架：（1）通过对象定位建立空间感知，（2）通过多维度空间任务发展空间理解，（3）通过强化学习和可验证奖励强化复杂推理。这种方法产生了SpatialLadder，一个3亿参数的模型，在空间推理基准测试中达到了最先进的性能，平均改进了23.4%，分别超过了GPT-4o的20.8%和Gemini-2.0-Flash的10.1%。值得注意的是，SpatialLadder在域外基准测试中保持了较强的泛化能力，改进了7.2%，表明从感知到推理的渐进训练对于构建稳健的空间智能至关重要。

Summary / 总结

The paper addresses the challenge of spatial reasoning in Vision-Language Models (VLMs) by introducing SpatialLadder, a progressive training framework. It uses a large multimodal dataset, SpatialLadder-26k, to train VLMs in three stages: object localization, multi-dimensional spatial tasks, and complex reasoning with reinforcement learning. The resulting model, SpatialLadder, shows significant improvements in spatial reasoning benchmarks, with 23.4% average improvement over the base model and strong generalization capabilities.

论文通过引入SpatialLadder，提出了一种渐进式训练框架来解决视觉-语言模型（VLMs）中的空间推理问题。该框架利用SpatialLadder-26k这一大规模多模态数据集，在三个阶段进行训练：物体定位、多维度空间任务和通过强化学习进行复杂推理。最终模型SpatialLadder在空间推理基准测试中表现出显著提升，平均改进幅度为23.4%，并且具有良好的泛化能力。

To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models

Authors: Jiayun Luo, Wan-Cyuan Fan, Lyuyang Wang, Xiangteng He, Tanzila Rahman, Purang Abolmaesumi, Leonid Sigal

First: 2025-10-09T17:44:42+00:00 · Latest: 2025-10-09T17:44:42+00:00

Comments: Preprint. Project page: https://davidhalladay.github.io/diysink_demo

Abs · PDF · Code1 · Code2 · Project1

Abstract

Large Vision Language Models (LVLMs) have recently emerged as powerful architectures capable of understanding and reasoning over both visual and textual information. These models typically rely on two key components: a Vision Transformer (ViT) and a Large Language Model (LLM). ViT encodes visual content into a sequence of image tokens and serves as the perceptual front-end -- the eyes of the model. In contrast, the LLM interprets these tokens to perform high-level reasoning, generates responses, and functions as the cognitive core -- the brain of the model. However, it remains unclear which visual tokens contribute most significantly to understanding and reasoning, and how effectively these signals are propagated from ViT to the LLM. While most existing works have focused on identifying attention sinks, low-semantic tokens receiving disproportionately high attention, within the LLM, we shift the focus to the vision encoder by identifying a class of high-norm visual tokens from ViT, referred to as ViT attention sinks -- a problem that has been rarely studied but is indeed very important for LVLMs. Our findings show that these ViT sinks encapsulate high-level semantic concepts from images, allowing the LLM to perform more effective understanding and reasoning. Despite their importance, these sink tokens are often overlooked in existing LVLM architectures. To explore their contribution, we present both qualitative and quantitative analyses of the information embedded in these sink tokens. We also propose both training-free and training-based approaches to better leverage how this information is interpreted by the LLM, and to what extent. By explicitly utilizing these tokens, we demonstrate substantial improvements across a range of LVLMs and visual reasoning tasks, highlighting the untapped potential of ViT attention sinks in enhancing visual reasoning.

中文标题/摘要

标题：沉还是不沉：大型视觉语言模型中的视觉信息路径

大型视觉语言模型（LVLMs）最近已成为能够理解和推理视觉和文本信息的强大架构。这些模型通常依赖于两个关键组件：视觉变换器（ViT）和大型语言模型（LLM）。ViT 将视觉内容编码为图像标记序列，并作为感知前端——模型的“眼睛”。相比之下，LLM 解释这些标记以进行高层次推理、生成响应，并作为认知核心——模型的“大脑”。然而，尚不清楚哪些视觉标记对理解和推理贡献最大，以及这些信号如何有效地从 ViT 传播到 LLM。虽然大多数现有工作都集中在识别 LLM 中的注意力“陷阱”（低语义标记，接受不相称的高关注），但在 LLM 中，我们将重点转向视觉编码器，通过从 ViT 中识别一类高范数视觉标记，称为 ViT 注意“陷阱”——这个问题虽然很少被研究，但对 LVLMs 来说确实非常重要。我们的研究发现，这些 ViT 陷阱包含了图像中的高层次语义概念，使 LLM 能够更有效地理解和推理。尽管这些陷阱标记在现有 LVLM 架构中经常被忽视，为了探索它们的贡献，我们对这些陷阱标记中嵌入的信息进行了定性和定量分析。我们还提出了无需训练和基于训练的方法，以更好地利用 LLM 对这些信息的解释及其程度。通过明确利用这些标记，我们展示了在一系列 LVLM 和视觉推理任务中取得了显著改进，突显了 ViT 注意“陷阱”在增强视觉推理方面的未开发潜力。

Summary / 总结

This study investigates the role of visual tokens in large vision-language models (LVLMs) by identifying a class of high-norm visual tokens from the Vision Transformer (ViT), referred to as ViT attention sinks. The research shows that these tokens encapsulate high-level semantic concepts, enabling more effective reasoning by the language model. The study provides both qualitative and quantitative analyses of these sink tokens and proposes methods to better leverage their information, leading to improvements in various LVLMs and visual reasoning tasks.

研究通过识别来自视觉变换器（ViT）的一类高范数视觉令牌——ViT 注意力陷阱，探讨了视觉令牌在大型视觉语言模型（LVLM）中的作用。研究发现，这些令牌包含了高级语义概念，使语言模型能够更有效地进行推理。研究提供了这些陷阱令牌的定性和定量分析，并提出了更好地利用这些信息的方法，从而在各种LVLM和视觉推理任务中取得了显著改进。

MoA-VR: A Mixture-of-Agents System Towards All-in-One Video Restoration

Authors: Lu Liu, Chunlei Cai, Shaocheng Shen, Jianfeng Liang, Weimin Ouyang, Tianxiao Ye, Jian Mao, Huiyu Duan, Jiangchao Yao, Xiaoyun Zhang, Qiang Hu, Guangtao Zhai

First: 2025-10-09T17:42:51+00:00 · Latest: 2025-10-09T17:42:51+00:00