FVG-PT: Adaptive Foreground View-Guided Prompt Tuning for Vision-Language Models
Authors: Haoyang Li, Liang Wang, Siyu Zhou, Jiacheng Sun, Jing Jiang, Chao Wang, Guodong Long, Yan Peng
First: 2026-03-09T17:59:18+00:00 · Latest: 2026-03-09T17:59:18+00:00
Comments: 27 Pages, 9 Figures, 15 Tables
Abstract
CLIP-based prompt tuning enables pretrained Vision-Language Models (VLMs) to efficiently adapt to downstream tasks. Although existing studies have made significant progress, they pay limited attention to changes in the internal attention representations of VLMs during the tuning process. In this paper, we attribute the failure modes of prompt tuning predictions to shifts in foreground attention of the visual encoder, and propose Foreground View-Guided Prompt Tuning (FVG-PT), an adaptive plug-and-play foreground attention guidance module, to alleviate the shifts. Concretely, FVG-PT introduces a learnable Foreground Reliability Gate to automatically enhance the foreground view quality, applies a Foreground Distillation Compensation module to guide visual attention toward the foreground, and further introduces a Prior Calibration module to mitigate generalization degradation caused by excessive focus on the foreground. Experiments on multiple backbone models and datasets show the effectiveness and compatibility of FVG-PT. Codes are available at: https://github.com/JREion/FVG-PT
中文标题/摘要
标题:FVG-PT:自适应前景视图引导提示调优方法
基于CLIP的提示调优使预训练的视觉-语言模型(VLMs)能够高效地适应下游任务。尽管现有研究取得了显著进展,但它们在调优过程中对VLMs内部注意力表示的变化关注有限。本文将提示调优预测的失败模式归因于视觉编码器前景注意力的变化,并提出了一种自适应插件前景注意力引导模块——前景视图引导提示调优(FVG-PT),以缓解这种变化。具体而言,FVG-PT引入了一个可学习的前景可靠性门控,以自动增强前景视图质量,应用前景蒸馏补偿模块以引导视觉注意力朝向前景,并进一步引入先验校准模块以减轻过度关注前景导致的一般化退化。在多个骨干模型和数据集上的实验显示了FVG-PT的有效性和兼容性。代码可在:https://github.com/JREion/FVG-PT
Summary / 总结
This paper addresses the limitations of existing prompt tuning methods for Vision-Language Models (VLMs) by focusing on changes in the internal attention representations during the tuning process. It proposes FVG-PT, which includes a Foreground Reliability Gate, a Foreground Distillation Compensation module, and a Prior Calibration module to guide and enhance foreground attention. Experiments demonstrate the effectiveness and compatibility of FVG-PT across multiple backbone models and datasets.
本文针对现有视觉-语言模型的提示调优方法在调优过程中前景注意力转移的问题,提出了FVG-PT模块,包括前景可靠性门控、前景蒸馏补偿模块和先验校准模块,以提高前景视图质量并引导视觉注意力。实验结果表明,FVG-PT在不同模型和数据集上具有有效性和兼容性。
From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models
Authors: Ashay Athalye, Nishanth Kumar, Tom Silver, Yichao Liang, Jiuguang Wang, Tomás Lozano-Pérez, Leslie Pack Kaelbling
First: 2024-12-31T06:14:16+00:00 · Latest: 2026-03-09T17:35:57+00:00
Comments: A version of this paper appears in the official proceedings of RA-L, Volume 11, Issue 4
Abstract
Our aim is to learn to solve long-horizon decision-making problems in complex robotics domains given low-level skills and a handful of short-horizon demonstrations containing sequences of images. To this end, we focus on learning abstract symbolic world models that facilitate zero-shot generalization to novel goals via planning. A critical component of such models is the set of symbolic predicates that define properties of and relationships between objects. In this work, we leverage pretrained vision-language models (VLMs) to propose a large set of visual predicates potentially relevant for decision-making, and to evaluate those predicates directly from camera images. At training time, we pass the proposed predicates and demonstrations into an optimization-based model-learning algorithm to obtain an abstract symbolic world model that is defined in terms of a compact subset of the proposed predicates. At test time, given a novel goal in a novel setting, we use the VLM to construct a symbolic description of the current world state, and then use a search-based planning algorithm to find a sequence of low-level skills that achieves the goal. We demonstrate empirically across experiments in both simulation and the real world that our method can generalize aggressively, applying its learned world model to solve problems with a wide variety of object types, arrangements, numbers of objects, and visual backgrounds, as well as novel goals and much longer horizons than those seen at training time.
中文标题/摘要
标题:从像素到谓词:通过预训练的视觉-语言模型学习符号世界模型
我们的目标是在给定低级技能和少量短时间 horizon 的演示(包含一系列图像序列)的情况下,学习解决复杂机器人领域的长期决策问题。为此,我们专注于学习抽象的符号世界模型,这些模型能够通过规划实现零样本泛化。此类模型的关键组成部分是定义对象属性及其之间关系的符号谓词集。在本工作中,我们利用预训练的视觉-语言模型(VLMs)提出一组可能适用于决策的视觉谓词,并直接从相机图像中评估这些谓词。在训练时,我们将提出的谓词和演示传递给基于优化的模型学习算法,以获得一个用提出的谓词子集定义的抽象符号世界模型。在测试时,给定一个新的目标和新的环境设置,我们使用VLM构建当前世界状态的符号描述,然后使用基于搜索的规划算法找到实现目标的一系列低级技能。我们通过在仿真和真实世界中的实验,实证地证明了我们的方法可以积极泛化,将其学习到的世界模型应用于解决各种对象类型、排列、对象数量和视觉背景广泛变化的问题,以及新的目标和远超训练时的更长时间 horizon 的问题。
Summary / 总结
The research aims to develop a method for learning symbolic world models from low-level skills and short demonstrations to solve long-horizon decision-making problems in robotics. The approach uses pretrained vision-language models to propose and evaluate visual predicates, which are then optimized to form an abstract symbolic model. At test time, the model is used to plan sequences of low-level skills to achieve novel goals. Experiments show that the method can generalize effectively across various object types and scenarios, solving problems with longer horizons than seen during training.
研究旨在利用低级技能和少量演示来解决复杂机器人环境中的长期决策问题。它利用预训练的视觉-语言模型提出一组视觉谓词,并直接从图像中评估这些谓词。在训练过程中,使用基于优化的算法选择这些谓词中的一个紧凑子集,以形成抽象的符号世界模型。在测试时,使用该模型规划一系列低级技能来实现新的目标。实验表明,该方法可以有效地泛化到各种物体类型、环境和比训练时更长的时间跨度。
X-SYS: A Reference Architecture for Interactive Explanation Systems
Authors: Tobias Labarta, Nhi Hoang, Maximilian Dreyer, Jim Berend, Oleg Hein, Jackie Ma, Wojciech Samek, Sebastian Lapuschkin
First: 2026-02-13T09:24:03+00:00 · Latest: 2026-03-09T17:21:38+00:00
Comments: 18 pages, 8 figures
Abstract
The explainable AI (XAI) research community has proposed numerous technical methods, yet deploying explainability as systems remains challenging: Interactive explanation systems require both suitable algorithms and system capabilities that maintain explanation usability across repeated queries, evolving models and data, and governance constraints. We argue that operationalizing XAI requires treating explainability as an information systems problem where user interaction demands induce specific system requirements. We introduce X-SYS, a reference architecture for interactive explanation systems, that guides (X)AI researchers, developers and practitioners in connecting interactive explanation user interfaces (XUI) with system capabilities. X-SYS organizes around four quality attributes named STAR (scalability, traceability, responsiveness, and adaptability), and specifies a five-component decomposition (XUI Services, Explanation Services, Model Services, Data Services, Orchestration and Governance). It maps interaction patterns to system capabilities to decouple user interface evolution from backend computation. We implement X-SYS through SemanticLens, a system for semantic search and activation steering in vision-language models. SemanticLens demonstrates how contract-based service boundaries enable independent evolution, offline/online separation ensures responsiveness, and persistent state management supports traceability. Together, this work provides a reusable blueprint and concrete instantiation for interactive explanation systems supporting end-to-end design under operational constraints.
中文标题/摘要
标题:X-SYS:交互解释系统参考架构
可解释人工智能(XAI)研究社区提出了众多技术方法,但将可解释性部署为系统仍然具有挑战性:交互解释系统需要合适的算法和系统能力,以保持解释在重复查询、模型和数据演变以及治理约束下的可用性。我们认为,实现XAI需要将可解释性视为信息系统问题,其中用户交互需求引发特定系统需求。我们介绍了X-SYS,一种交互解释系统的参考架构,指导(X)AI研究人员、开发人员和从业者将交互解释用户界面(XUI)与系统能力连接起来。X-SYS围绕四个质量属性(可扩展性、可追溯性、响应性和适应性)组织,并规定了五个组件分解(XUI服务、解释服务、模型服务、数据服务、编排和治理)。它将交互模式映射到系统能力,以解耦用户界面的演变与后端计算。我们通过SemanticLens系统实现了X-SYS,SemanticLens是一个用于视觉语言模型的语义搜索和激活引导系统。SemanticLens展示了基于合同的服务边界如何实现独立演变,离线/在线分离如何确保响应性,持久状态管理如何支持可追溯性。这项工作一起提供了一个可重用的蓝图和具体的实例,支持在运营约束下端到端设计交互解释系统。
Summary / 总结
The research aims to address the challenge of deploying explainable AI (XAI) systems by treating explainability as an information systems problem. X-SYS, a reference architecture, is introduced to guide the development of interactive explanation systems. It focuses on four quality attributes (scalability, traceability, responsiveness, and adaptability) and decomposes the system into five components. X-SYS maps interaction patterns to system capabilities, decoupling user interface evolution from backend computation. The implementation through SemanticLens demonstrates the effectiveness of this approach in enabling independent evolution, ensuring responsiveness, and supporting traceability.
研究旨在通过提出X-SYS交互解释系统参考架构来解决可解释AI (XAI) 系统的部署挑战。X-SYS关注四个质量属性(可扩展性、可追溯性、响应性和适应性),并将系统分解为五个组件。关键发现包括能够将用户界面的演变与后端计算分离,通过基于合同的服务边界确保独立演变,并通过持久状态管理确保响应性和可追溯性。
Are vision-language models ready to zero-shot replace supervised classification models in agriculture?
Authors: Earl Ranario, Mason J. Earles
First: 2025-12-17T21:22:44+00:00 · Latest: 2026-03-09T16:58:19+00:00
Abstract
Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural image classification datasets from the AgML collection (https://github.com/Project-AgML), spanning 162 classes and 248,000 images across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%. Applying LLM-based semantic judging increases open-ended accuracy (e.g., from ~21% to ~30% for top models) and alters model rankings, demonstrating that evaluation methodology meaningfully affects reported conclusions. Among open-source models, Qwen-VL-72B performs best, approaching closed-source performance under constrained prompting but still trailing top proprietary systems. Task-level analysis shows that plant and weed species classification is consistently easier than pest and damage identification, which remains the most challenging category across models. Overall, these results indicate that current off-the-shelf VLMs are not yet suitable as standalone agricultural diagnostic systems, but can function as assistive components when paired with constrained interfaces, explicit label ontologies, and domain-aware evaluation strategies.
中文标题/摘要
标题:视觉语言模型在农业中是否准备好零样本替代监督分类模型?
视觉语言模型(VLMs)越来越多地被提议作为视觉识别任务的一般解决方案,但它们在农业决策支持中的可靠性仍不清楚。我们对来自AgML集合(https://github.com/Project-AgML)的27个农业图像分类数据集进行了基准测试,这些数据集涵盖了162个类别和248,000张图片,包括植物病害、害虫和损伤以及植物和杂草种类识别。在所有任务中,零样本VLMs的表现显著低于监督任务特定基线(YOLO11),后者始终比任何基础模型获得更高的准确率。在多项选择提示下,表现最佳的VLM(Gemini-3 Pro)的平均准确率为约62%,而开放式提示则导致性能大幅下降,原始准确率通常低于25%。基于LLM的语义评估提高了开放式提示的准确率(例如,顶级模型从约21%提高到约30%),并改变了模型排名,表明评估方法对报告结论有实质性影响。在开源模型中,Qwen-VL-72B表现最佳,在受限提示下接近闭源模型的性能,但仍落后于顶级专有系统。任务级分析表明,植物和杂草种类分类始终比害虫和损伤识别更容易,后者仍然是最具挑战性的类别。总体而言,这些结果表明,当前的即用型VLMs尚不适合作为独立的农业诊断系统,但在与受限界面、明确标签本体和领域意识评估策略配对时,可以作为辅助组件发挥作用。
Summary / 总结
The study benchmarks vision-language models (VLMs) on 27 agricultural image classification datasets, finding that zero-shot VLMs underperform a supervised task-specific baseline. Under multiple-choice prompting, Gemini-3 Pro achieves around 62% accuracy, while open-ended prompting yields lower performance. Applying LLM-based semantic judging improves open-ended accuracy and alters model rankings. Among open-source models, Qwen-VL-72B performs best but still lags behind top proprietary systems. The results suggest that current VLMs are not yet suitable as standalone diagnostic tools but can assist with constrained interfaces and domain-aware evaluation strategies.
研究评估了视觉语言模型(VLMs)在农业图像分类任务中的表现,将其与特定监督任务基线进行比较。在27个数据集中,VLMs表现显著低于基线,Gemini-3 Pro在多项选择提示下的准确率为约62%,而开放提示下的表现较低。应用基于LLM的语义评估提高了开放提示的准确性并改变了模型排名。在开源模型中,Qwen-VL-72B表现最佳,但仍落后于顶级专有系统。研究结果表明,当前的VLMs尚不适合作为独立的诊断工具,但在特定约束和评估方法下可以作为辅助组件使用。
MetaWorld-X: Hierarchical World Modeling via VLM-Orchestrated Experts for Humanoid Loco-Manipulation
Authors: Yutong Shen, Hangxu Liu, Penghui Liu, Jiashuo Luo, Yongkang Zhang, Rex Morvley, Chen Jiang, Jianwei Zhang, Lei Zhang
First: 2026-03-09T16:28:26+00:00 · Latest: 2026-03-09T16:28:26+00:00
Comments: 8 figures, https://syt2004.github.io/metaworldX/
Abstract
Learning natural, stable, and compositionally generalizable whole-body control policies for humanoid robots performing simultaneous locomotion and manipulation (loco-manipulation) remains a fundamental challenge in robotics. Existing reinforcement learning approaches typically rely on a single monolithic policy to acquire multiple skills, which often leads to cross-skill gradient interference and motion pattern conflicts in high-degree-of-freedom systems. As a result, generated behaviors frequently exhibit unnatural movements, limited stability, and poor generalization to complex task compositions. To address these limitations, we propose MetaWorld-X, a hierarchical world model framework for humanoid control. Guided by a divide-and-conquer principle, our method decomposes complex control problems into a set of specialized expert policies (Specialized Expert Policies, SEP). Each expert is trained under human motion priors through imitation-constrained reinforcement learning, introducing biomechanically consistent inductive biases that ensure natural and physically plausible motion generation. Building upon this foundation, we further develop an Intelligent Routing Mechanism (IRM) supervised by a Vision-Language Model (VLM), enabling semantic-driven expert composition. The VLM-guided router dynamically integrates expert policies according to high-level task semantics, facilitating compositional generalization and adaptive execution in multi-stage loco-manipulation tasks.
中文标题/摘要
标题:MetaWorld-X:通过VLM协调专家进行类人行走操作建模
学习类人机器人在同时进行行走和操作(行走操作)时具有自然、稳定和组成上通用的整体控制策略仍然是机器人学中的一个基本挑战。现有的强化学习方法通常依赖单一的庞大策略来获取多种技能,这往往会导致跨技能梯度干扰和高自由度系统中的运动模式冲突。因此,生成的行为经常表现出不自然的运动、有限的稳定性和对复杂任务组成的不良泛化。为了解决这些限制,我们提出了MetaWorld-X,一种类人控制的分层世界模型框架。我们的方法遵循分而治之的原则,将复杂的控制问题分解为一组专门的专家策略(专门专家策略,SEP)。每个专家在人类运动先验下通过模仿约束强化学习进行训练,引入生物力学一致的归纳偏置,确保自然和物理上合理的运动生成。在此基础上,我们进一步开发了一种由视觉语言模型(VLM)监督的智能路由机制(IRM),实现语义驱动的专家组合。VLM引导的路由器根据高层次任务语义动态整合专家策略,促进多阶段行走操作任务中的组合泛化和自适应执行。
Summary / 总结
MetaWorld-X is a hierarchical world model framework designed to address the challenges of learning natural, stable, and generalizable whole-body control policies for humanoid robots performing simultaneous locomotion and manipulation. It decomposes complex tasks into specialized expert policies trained with human motion priors and an Intelligent Routing Mechanism guided by a Vision-Language Model, which dynamically integrates these experts based on task semantics. This approach enhances compositional generalization and adaptive execution in multi-stage loco-manipulation tasks.
MetaWorld-X 提出了一种分层世界模型框架,用于使类人机器人执行同时移动和操作的任务。该方法将问题分解为基于人类运动先验训练的专业专家策略,并通过视觉语言模型引导的智能路由机制根据任务语义动态整合这些专家,解决了现有方法中不自然的运动和差的一般化问题。
The Neural Compass: Probabilistic Relative Feature Fields for Robotic Search
Authors: Gabriele Somaschini, Adrian Röfer, Abhinav Valada
Venue: IROS 2026
First: 2026-03-09T16:11:05+00:00 · Latest: 2026-03-09T16:11:05+00:00
Comments: 9 pages, 7 figures, 2 tables, submitted to IROS 2026
Abstract
Object co-occurrences provide a key cue for finding objects successfully and efficiently in unfamiliar environments. Typically, one looks for cups in kitchens and views fridges as evidence of being in a kitchen. Such priors have also been exploited in artificial agents, but they are typically learned from explicitly labeled data or queried from language models. It is still unclear whether these relations can be learned implicitly from unlabeled observations alone. In this work, we address this problem and propose ProReFF, a feature field model trained to predict relative distributions of features obtained from pre-trained vision language models. In addition, we introduce a learning-based strategy that enables training from unlabeled and potentially contradictory data by aligning inconsistent observations into a coherent relative distribution. For the downstream object search task, we propose an agent that leverages predicted feature distributions as a semantic prior to guide exploration toward regions with a high likelihood of containing the object. We present extensive evaluations demonstrating that ProReFF captures meaningful relative feature distributions in natural scenes and provides insight into the impact of our proposed alignment step. We further evaluate the performance of our search agent in 100 challenges in the Matterport3D simulator, comparing with feature-based baselines and human participants. The proposed agent is 20% more efficient than the strongest baseline and achieves up to 80% of human performance.
中文标题/摘要
标题:神经罗盘:基于相对特征场的概率特征场模型用于机器人搜索
物体共现为在陌生环境中成功且高效地找到物体提供了关键线索。通常,人们会在厨房寻找杯子,并将冰箱视为厨房的证据。这些先验知识也已被人工代理利用,但它们通常是从明确标记的数据中学习或从语言模型中查询得到的。尚不清楚这些关系是否可以从未标记的观察中隐式学习。在本文中,我们解决了这一问题并提出了ProReFF,这是一种训练以预测由预训练视觉语言模型获得的特征分布的特征场模型。此外,我们引入了一种基于学习的策略,通过将不一致的观察调整为一致的相对分布,从而能够从未标记且可能矛盾的数据中进行训练。对于下游的物体搜索任务,我们提出了一种代理,利用预测的特征分布作为语义先验,引导探索具有高物体存在可能性的区域。我们进行了广泛的评估,证明ProReFF能够捕捉自然场景中的有意义的相对特征分布,并提供了我们提出的对齐步骤的影响见解。我们还在Matterport3D模拟器中对我们的搜索代理进行了评估,与基于特征的基线和人类参与者进行了比较。所提出的代理比最强基线效率高20%,并达到了人类性能的80%。
Summary / 总结
This paper addresses the challenge of learning object co-occurrences implicitly from unlabeled data to improve robotic search efficiency. It introduces ProReFF, a feature field model trained to predict relative distributions of features using pre-trained vision-language models. The model aligns inconsistent observations to form a coherent relative distribution. The proposed agent uses these predictions as a semantic prior to guide exploration. Experiments show that ProReFF captures meaningful relative feature distributions and the agent outperforms baselines, being 20% more efficient and achieving up to 80% of human performance in 100 challenges in the Matterport3D simulator.
研究旨在利用物体共现关系提高机器人在陌生环境中的搜索效率。方法是训练ProReFF特征场模型,预测预训练视觉-语言模型的特征相对分布。该模型将不一致的观察结果对齐以形成一致的相对分布。提出的搜索代理使用这些预测作为语义先验来引导探索。实验表明,ProReFF能够捕捉有意义的相对特征分布,搜索代理比基线更高效,最高可达人类性能的80%。
SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution
Authors: Chao Wang, Zijin Yang, Yaofei Wang, Yuang Qi, Weiming Zhang, Nenghai Yu, Kejiang Chen
First: 2026-03-09T16:06:26+00:00 · Latest: 2026-03-09T16:06:26+00:00
Abstract
Recent advancements in video generation technologies have been significant, resulting in their widespread application across multiple domains. However, concerns have been mounting over the potential misuse of generated content. Tracing the origin of generated videos has become crucial to mitigate potential misuse and identify responsible parties. Existing video attribution methods require additional operations or the training of source attribution models, which may degrade video quality or necessitate large amounts of training samples. To address these challenges, we define for the first time the "few-shot training-free generated video attribution" task and propose SWIFT, which is tightly integrated with the temporal characteristics of the video. By leveraging the "Pixel Frames(many) to Latent Frame(one)" temporal mapping within each video chunk, SWIFT applies a fixed-length sliding window to perform two distinct reconstructions: normal and corrupted. The variation in the losses between two reconstructions is then used as an attribution signal. We conducted an extensive evaluation of five state-of-the-art (SOTA) video generation models. Experimental results show that SWIFT achieves over 90% average attribution accuracy with merely 20 video samples across all models and even enables zero-shot attribution for HunyuanVideo, EasyAnimate, and Wan2.2. Our source code is available at https://github.com/wangchao0708/SWIFT.
中文标题/摘要
标题:SWIFT:滑动窗口重建用于少量样本无训练生成视频归属
视频生成技术的最新进展显著,使其在多个领域得到广泛应用。然而,人们对生成内容可能被滥用的担忧日益增加。追踪生成视频的来源变得至关重要,以减轻潜在滥用并识别责任方。现有的视频归属方法需要额外的操作或训练源归属模型,这可能会降低视频质量或需要大量的训练样本。为了解决这些挑战,我们首次定义了“少量样本无训练生成视频归属”任务,并提出了SWIFT,该方法紧密结合了视频的时间特性。通过利用每个视频片段内的“像素帧(多个)到潜在帧(一个)”的时间映射,SWIFT 应用固定长度的滑动窗口进行两种不同的重建:正常和损坏。两种重建之间的损失变化被用作归属信号。我们对五种最先进的(SOTA)视频生成模型进行了广泛的评估。实验结果表明,SWIFT 在所有模型上仅使用 20 个视频样本即可实现超过 90% 的平均归属准确率,并且甚至可以实现零样本归属,适用于 HunyuanVideo、EasyAnimate 和 Wan2.2。我们的源代码可在 https://github.com/wangchao0708/SWIFT 获取。
Summary / 总结
The research aims to address the challenge of tracing the origin of generated videos in the era of advanced video generation technologies. SWIFT, a novel method, is proposed to perform few-shot training-free generated video attribution by leveraging the temporal characteristics of videos. It uses a fixed-length sliding window to reconstruct normal and corrupted pixel frames, and the variation in losses between these reconstructions serves as an attribution signal. Experiments on five state-of-the-art video generation models demonstrate that SWIFT achieves over 90% average attribution accuracy with just 20 video samples, even enabling zero-shot attribution for some models.
论文解决了生成视频溯源的问题,这对于防止滥用至关重要。提出了一种名为SWIFT的方法,利用视频片段中的时间映射关系,在像素帧和潜在帧之间进行滑动窗口下的两种重建。实验结果显示,SWIFT使用20个视频样本即可在五种最先进的视频生成模型上达到超过90%的平均溯源准确率,甚至对某些模型支持零样本溯源。
Reading $\neq$ Seeing: Diagnosing and Closing the Typography Gap in Vision-Language Models
Authors: Heng Zhou, Ao Yu, Li Kang, Yuchen Fan, Yutao Fan, Xiufeng Song, Hejia Geng, Yiran Qin
First: 2026-03-09T15:31:47+00:00 · Latest: 2026-03-09T15:31:47+00:00
Abstract
Vision-Language Models achieve near-perfect accuracy at reading text in images, yet prove largely typography-blind: capable of recognizing what text says, but not how it looks. We systematically investigate this gap by evaluating font family, size, style, and color recognition across 26 fonts, four scripts, and three difficulty levels. Our evaluation of 15 state-of-the-art VLMs reveals a striking perception hierarchy: color recognition is near-perfect, yet font style detection remains universally poor. We further find that model scale fails to predict performance and that accuracy is uniform across difficulty levels, together pointing to a training-data omission rather than a capacity ceiling. LoRA fine-tuning on a small set of synthetic samples substantially improves an open-source model, narrowing the gap to the best closed-source system and surpassing it on font size recognition. Font style alone remains resistant to fine-tuning, suggesting that relational visual reasoning may require architectural innovation beyond current patch-based encoders. We release our evaluation framework, data, and fine-tuning recipe to support progress in closing the typographic gap in vision-language understanding.
中文标题/摘要
标题:阅读$\neq$观看:诊断并缩小视觉语言模型中的排版差距
视觉语言模型在识别图像中的文本方面几乎达到完美准确度,但对排版却表现出极大的盲视:能够识别文本说什么,但无法识别文本如何呈现。我们通过评估26种字体、四种书写系统和三种难度级别下的字体家族、大小、风格和颜色识别,系统地研究了这一差距。对15个最先进的视觉语言模型的评估揭示了一个显著的认知层次:颜色识别几乎完美,但字体风格检测普遍较差。我们还发现,模型规模无法预测性能,且准确性在不同难度级别上保持一致,这表明训练数据缺失而非容量限制。对少量合成样本进行LoRA微调显著提高了开源模型的性能,缩小了与最佳闭源系统的差距,并在字体大小识别上超过了它。仅字体风格对微调的抵抗力表明,关系视觉推理可能需要超越当前基于块的编码器的架构创新。我们发布了评估框架、数据和微调食谱,以支持缩小视觉语言理解中排版差距的研究进展。
Global Cross-Modal Geo-Localization: A Million-Scale Dataset and a Physical Consistency Learning Framework
Authors: Yutong Hu, Jinhui Chen, Chaoqiang Xu, Yuan Kou, Sili Zhou, Shaocheng Yan, Pengcheng Shi, Qingwu Hu, Jiayuan Li
First: 2026-03-09T15:27:19+00:00 · Latest: 2026-03-09T15:27:19+00:00
Abstract
Cross-modal Geo-localization (CMGL) matches ground-level text descriptions with geo-tagged aerial imagery, which is crucial for pedestrian navigation and emergency response. However, existing researches are constrained by narrow geographic coverage and simplistic scene diversity, failing to reflect the immense spatial heterogeneity of global architectural styles and topographic features. To bridge this gap and facilitate universal positioning, we introduce CORE, the first million-scale dataset dedicated to global CMGL. CORE comprises 1,034,786 cross-view images sampled from 225 distinct geographic regions across all continents, offering an unprecedented variety of perspectives in varying environmental conditions and urban layouts. We leverage the zero-shot reasoning of Large Vision-Language Models (LVLMs) to synthesize high-quality scene descriptions rich in discriminative cues. Furthermore, we propose a physical-law-aware network (PLANET) for cross-modal geo-localization. PLANET introduces a novel contrastive learning paradigm to guide textual representations in capturing the intrinsic physical signatures of satellite imagery. Extensive experiments across varied geographic regions demonstrate that PLANet significantly outperforms state-of-the-art methods, establishing a new benchmark for robust, global-scale geo-localization. The dataset and source code will be released at https://github.com/YtH0823/CORE.
中文标题/摘要
标题:全球跨模态地理定位:百万规模数据集及物理一致性学习框架
跨模态地理定位(CMGL)将地面文本描述与地理标记的航空影像匹配,对于行人导航和应急响应至关重要。然而,现有研究受限于狭窄的地理覆盖范围和简单的场景多样性,未能反映全球建筑风格和地形特征的巨大空间异质性。为弥合这一差距并促进全球定位,我们引入了CORE,这是首个专门用于全球CMGL的百万规模数据集。CORE包含来自全球225个不同地理区域的1,034,786张跨视角图像,提供了在各种环境条件和城市布局下的前所未有的视角多样性。我们利用大型视觉-语言模型(LVLM)的零样本推理能力生成富含区分性线索的高质量场景描述。此外,我们提出了一种物理定律感知网络(PLANET)用于跨模态地理定位。PLANET引入了一种新颖的对比学习范式,以引导文本表示捕捉卫星影像的内在物理特征。在不同地理区域的广泛实验中,PLANET显著优于现有最先进的方法,建立了新的全球规模地理定位基准。数据集和源代码将在https://github.com/YtH0823/CORE发布。
Summary / 总结
The research aims to improve cross-modal geo-localization by addressing the limitations of existing datasets in terms of geographic coverage and scene diversity. To this end, the authors introduce CORE, a million-scale dataset covering 225 geographic regions worldwide. They also propose PLANET, a physical-law-aware network that uses contrastive learning to enhance the textual representation of satellite imagery. Experimental results show that PLANET outperforms existing methods across various geographic regions, setting a new benchmark for global-scale geo-localization.
研究旨在通过解决现有数据集地理覆盖狭窄的问题,改进跨模态地理定位。为此,作者引入了CORE,这是一个覆盖225个全球地理区域的百万级数据集。他们使用大型视觉-语言模型生成高质量的场景描述,并提出了PLANET,这是一种物理定律感知的网络,通过新颖的对比学习方法增强跨模态地理定位。实验结果显示,PLANET在各种地理区域的表现优于现有方法,为全球规模的地理定位设定了新基准。
ODI-Bench: Can MLLMs Understand Immersive Omnidirectional Environments?
Authors: Liu Yang, Huiyu Duan, Ran Tao, Juntao Cheng, Sijing Wu, Yunhao Li, Jing Liu, Xiongkuo Min, Guangtao Zhai
First: 2025-10-13T15:51:47+00:00 · Latest: 2026-03-09T15:21:50+00:00
Abstract
Omnidirectional images (ODIs) provide full 360x180 view which are widely adopted in VR, AR and embodied intelligence applications. While multi-modal large language models (MLLMs) have demonstrated remarkable performance on conventional 2D image and video understanding benchmarks, their ability to comprehend the immersive environments captured by ODIs remains largely unexplored. To address this gap, we first present ODI-Bench, a novel comprehensive benchmark specifically designed for omnidirectional image understanding. ODI-Bench contains 2,000 high-quality omnidirectional images and over 4,000 manually annotated question-answering (QA) pairs across 10 fine-grained tasks, covering both general-level and spatial-level ODI understanding. Extensive experiments are conducted to benchmark 20 representative MLLMs, including proprietary and open-source models, under both close-ended and open-ended settings. Experimental results reveal that current MLLMs still struggle to capture the immersive context provided by ODIs. To this end, we further introduce Omni-CoT, a training-free method which significantly enhances MLLMs' comprehension ability in the omnidirectional environment through chain-of-thought reasoning across both textual information and visual cues. Both the benchmark and the code will be released at https://github.com/ylylyl-sjtu/ODI-Bench.
中文标题/摘要
标题:ODI-Bench:MLLMs能否理解沉浸式全景环境?
全景图像(ODIs)提供了360x180度的全视角,广泛应用于VR、AR和具身智能应用中。尽管多模态大型语言模型(MLLMs)在传统的2D图像和视频理解基准测试中表现出色,但它们理解由ODIs捕获的沉浸式环境的能力仍鲜有探索。为解决这一问题,我们首先提出了ODI-Bench,这是一个专门用于全景图像理解的新颖综合基准。ODI-Bench 包含2000张高质量的全景图像和超过4000个手动标注的问题-答案(QA)对,涵盖了10个细粒度任务,包括一般层面和空间层面的ODI理解。进行了广泛的实验,以在封闭式和开放式设置下对20个代表性MLLMs进行基准测试,包括专有和开源模型。实验结果表明,当前的MLLMs仍然难以捕捉ODIs提供的沉浸式环境。为此,我们进一步引入了Omni-CoT,这是一种无需训练的方法,通过跨文本信息和视觉线索的链式推理显著增强了MLLMs在全景环境中的理解能力。基准测试和代码将在https://github.com/ylylyl-sjtu/ODI-Bench上发布。
Summary / 总结
The research aims to evaluate the ability of multi-modal large language models (MLLMs) to understand immersive environments captured by omnidirectional images (ODIs). ODI-Bench, a novel benchmark, was created to assess 20 representative MLLMs on 10 tasks involving both general and spatial understanding of ODIs. The experiments show that current MLLMs have difficulty comprehending the immersive context of ODIs. To improve this, the study introduces Omni-CoT, a training-free method that enhances MLLMs' comprehension through chain-of-thought reasoning across textual and visual information, significantly improving their performance on ODI understanding tasks. https://github.com/ylylyl-sjtu/ODI-Bench
研究旨在评估多模态大型语言模型(MLLMs)理解全景图像(ODIs)所捕捉的沉浸式环境的能力。研究创建了ODI-Bench这一新型基准,评估了20种代表性MLLMs在涉及全景图像通用和空间理解的10项任务上的表现。实验结果显示,当前的MLLMs难以理解ODIs提供的沉浸式背景。为此,研究引入了Omni-CoT,这是一种无需训练的方法,通过在文本信息和视觉线索之间的链式推理来增强MLLMs在全景环境中的理解能力,显著提高了其在ODI理解任务上的表现。https://github.com/ylylyl-sjtu/ODI-Bench
Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images
Authors: Qishun Yang, Shu Yang, Lijie Hu, Di Wang
First: 2026-03-09T15:20:53+00:00 · Latest: 2026-03-09T15:20:53+00:00
Abstract
Multimodal large language models (MLLMs) face safety misalignment, where visual inputs enable harmful outputs. To address this, existing methods require explicit safety labels or contrastive data; yet, threat-related concepts are concrete and visually depictable, while safety concepts, like helpfulness, are abstract and lack visual referents. Inspired by the Self-Fulfilling mechanism underlying emergent misalignment, we propose Visual Self-Fulfilling Alignment (VSFA). VSFA fine-tunes vision-language models (VLMs) on neutral VQA tasks constructed around threat-related images, without any safety labels. Through repeated exposure to threat-related visual content, models internalize the implicit semantics of vigilance and caution, shaping safety-oriented personas. Experiments across multiple VLMs and safety benchmarks demonstrate that VSFA reduces the attack success rate, improves response quality, and mitigates over-refusal while preserving general capabilities. Our work extends the self-fulfilling mechanism from text to visual modalities, offering a label-free approach to VLMs alignment.
中文标题/摘要
标题:视觉自我实现对齐:通过威胁相关图像塑造安全导向的人格
多模态大型语言模型(MLLMs)面临安全不对齐的问题,其中视觉输入可能导致有害输出。为了解决这一问题,现有方法需要显式安全标签或对比数据;然而,威胁相关概念具体且可视觉呈现,而诸如乐于助人等安全概念则抽象且缺乏视觉参照。受潜在的自我实现机制启发,我们提出视觉自我实现对齐(VSFA)。VSFA 在围绕威胁相关图像构建的中立视觉问答任务上微调视觉语言模型(VLMs),无需任何安全标签。通过反复接触威胁相关视觉内容,模型内化了警惕和谨慎的隐含语义,塑造了安全导向的人格。在多个 VLMs 和安全基准上的实验表明,VSFA 降低了攻击成功率,提高了响应质量,减轻了过度拒绝现象,同时保留了通用能力。我们的工作将自我实现机制从文本扩展到视觉模态,提供了一种无标签的 VLMs 对齐方法。
Summary / 总结
The research aims to address safety misalignment in multimodal large language models (MLLMs) by leveraging threat-related images to shape safety-oriented personas. VSFA fine-tunes vision-language models (VLMs) on neutral VQA tasks involving threat-related images without explicit safety labels. The experiments show that VSFA reduces attack success rates, improves response quality, and mitigates over-refusal while maintaining general capabilities.
研究旨在通过利用威胁相关图像来塑造安全导向的人格,解决多模态大型语言模型(MLLMs)的安全对齐问题。VSFA在涉及威胁相关图像的中性VQA任务上微调视觉语言模型(VLMs),无需明确的安全标签。实验表明,VSFA降低了攻击成功率,提高了响应质量,并减轻了过度拒绝现象,同时保持了通用能力。
R2F: Repurposing Ray Frontiers for LLM-free Object Navigation
Authors: Francesco Argenziano, John Mark Alexis Marcelo, Michele Brienza, Abdel Hakim Drid, Emanuele Musumeci, Daniele Nardi, Domenico D. Bloisi, Vincenzo Suriani
First: 2026-03-09T15:10:10+00:00 · Latest: 2026-03-09T15:10:10+00:00
Abstract
Zero-shot open-vocabulary object navigation has progressed rapidly with the emergence of large Vision-Language Models (VLMs) and Large Language Models (LLMs), now widely used as high-level decision-makers instead of end-to-end policies. Although effective, such systems often rely on iterative large-model queries at inference time, introducing latency and computational overhead that limit real-time deployment. To address this problem, we repurpose ray frontiers (R2F), a recently proposed frontier-based exploration paradigm, to develop an LLM-free framework for indoor open-vocabulary object navigation. While ray frontiers were originally used to bias exploration using semantic cues carried along rays, we reinterpret frontier regions as explicit, direction-conditioned semantic hypotheses that serve as navigation goals. Language-aligned features accumulated along out-of-range rays are stored sparsely at frontiers, where each region maintains multiple directional embeddings encoding plausible unseen content. In this way, navigation then reduces to embedding-based frontier scoring and goal tracking within a classical mapping and planning pipeline, eliminating iterative large-model reasoning. We further introduce R2F-VLN, a lightweight extension for free-form language instructions using syntactic parsing and relational verification without additional VLM or LLM components. Experiments in Habitat-sim and on a real robotic platform demonstrate competitive state-of-the-art zero-shot performance with real-time execution, achieving up to 6 times faster runtime than VLM-based alternatives.
中文标题/摘要
标题:R2F: 重新利用Ray前沿进行LLM无对象导航
零样本开放词汇对象导航随着大型视觉-语言模型(VLM)和大型语言模型(LLM)的出现而迅速发展,现在广泛用作高级决策者,而不是端到端策略。尽管有效,但此类系统通常依赖于推理时的迭代大型模型查询,引入了延迟和计算开销,限制了实时部署。为了解决这个问题,我们重新利用了前沿(R2F)这一最近提出的基于前沿的探索范式,开发了一个LLM无框架,用于室内开放词汇对象导航。虽然前沿最初用于使用沿光线携带的语义线索来偏置探索,但我们重新解释前沿区域作为显式的、方向条件下的语义假设,作为导航目标。沿范围外光线累积的语言对齐特征稀疏地存储在前沿中,每个区域维护多个方向嵌入,编码可能的未见内容。这样,导航就简化为基于嵌入的前沿评分和在经典建图和规划管道内的目标跟踪,消除了迭代大型模型推理。我们进一步引入了R2F-VLN,这是一种轻量级扩展,用于自由形式的语言指令,使用句法解析和关系验证,而无需额外的VLM或LLM组件。在Habitat-sim中的实验和真实机器人平台上证明了与基于VLM的方法相比,具有实时执行的竞争力,运行时间快6倍。
Summary / 总结
The paper addresses the latency and computational overhead issues of using large models for zero-shot open-vocabulary object navigation. It proposes R2F, an LLM-free framework that repurposes ray frontiers to navigate indoor environments. By storing language-aligned features at frontiers and using embedding-based scoring, R2F reduces the need for iterative large-model queries. Experiments show that R2F achieves competitive zero-shot performance with real-time execution, up to 6 times faster than VLM-based methods.
研究旨在解决依赖大型模型的零样本开放词汇物体导航系统中存在的延迟和计算开销问题。提出了一个LLM-free框架R2F,通过重新利用射线前沿来导航室内环境。通过在前沿存储语言对齐特征并使用嵌入式评分,该系统减少了迭代大型模型查询的需求,实现了实时执行,并达到了与VLM基线相当的零样本性能,速度快6倍。
Can Vision-Language Models Solve the Shell Game?
Authors: Tiedong Liu, Wee Sun Lee
First: 2026-03-09T14:33:25+00:00 · Latest: 2026-03-09T14:33:25+00:00
Abstract
Visual entity tracking is an innate cognitive ability in humans, yet it remains a critical bottleneck for Vision-Language Models (VLMs). This deficit is often obscured in existing video benchmarks by visual shortcuts. We introduce VET-Bench, a synthetic diagnostic testbed featuring visually identical objects that necessitate tracking exclusively through spatiotemporal continuity. Our experiments reveal that current state-of-the-art VLMs perform at or near chance level on VET-Bench, exposing a fundamental limitation: an over-reliance on static frame-level features and a failure to maintain entity representations over time. We provide a theoretical analysis drawing connections to the state-tracking problem, proving that fixed-depth transformer-based VLMs are fundamentally limited in tracking indistinguishable objects without intermediate supervision due to expressivity constraints. To address this, we propose Spatiotemporal Grounded Chain-of-Thought (SGCoT): generating object trajectories as explicit intermediate states. Leveraging Molmo2's object tracking ability, we elicit SGCoT reasoning by fine-tuning on synthesized text-only data for alignment. Our method achieves state-of-the-art accuracy exceeding 90% on VET-Bench, demonstrating that VLMs can reliably solve the video shell-game task end-to-end without external tools. Our code and data are available at https://vetbench.github.io .
中文标题/摘要
标题:视觉-语言模型能否解决壳游戏问题?
视觉实体跟踪是人类的一种天生认知能力,但仍然是视觉-语言模型(VLMs)的关键瓶颈。这一缺陷在现有的视频基准测试中往往被视觉捷径所掩盖。我们引入了VET-Bench,这是一个合成诊断测试平台,其中包含视觉上相同的对象,需要通过时空连续性进行跟踪。我们的实验表明,当前最先进的VLMs在VET-Bench上的表现接近随机水平,揭示了一个根本性的局限:过度依赖静态帧级特征,无法在时间上保持实体表示。我们进行了理论分析,将问题与状态跟踪问题联系起来,证明固定深度的基于变压器的VLMs在没有中间监督的情况下跟踪不可区分的对象时存在表达能力限制。为了解决这个问题,我们提出了时空定位链式思考(SGCoT):生成对象轨迹作为显式的中间状态。利用Molmo2的物体跟踪能力,通过在合成文本数据上进行微调来实现SGCoT推理。我们的方法在VET-Bench上的准确率超过90%,证明了VLMs可以在不使用外部工具的情况下端到端地解决视频壳游戏任务。我们的代码和数据可在https://vetbench.github.io 获取。
Summary / 总结
The research aims to evaluate the ability of Vision-Language Models (VLMs) to track visual entities over time, which is crucial for their performance in real-world scenarios. The study introduces VET-Bench, a synthetic test designed to challenge VLMs by requiring them to track identical objects based solely on spatiotemporal continuity. Experiments show that current state-of-the-art VLMs perform poorly on VET-Bench, indicating their reliance on static frame-level features and inability to maintain entity representations over time. The proposed Spatiotemporal Grounded Chain-of-Thought (SGCoT) method improves tracking accuracy to over 90% on VET-Bench by generating object trajectories as intermediate states, demonstrating the potential for VLMs to solve the video shell-game task end-to-end without external tools.
研究旨在评估视觉语言模型(VLMs)在时间上跟踪视觉上相同物体的能力,这是人类的一项关键认知技能。研究引入了VET-Bench,这是一个合成测试平台,要求仅通过时空连续性进行跟踪。实验显示,当前最先进的VLMs在VET-Bench上的表现不佳,表明它们依赖于静态帧级特征,并且无法在时间上保持实体表示。提出的时空定位链式思考(SGCoT)方法通过生成物体轨迹作为中间状态,显著提高了性能,在VET-Bench上达到了最先进的准确率。
SYNAPSE: Framework for Neuron Analysis and Perturbation in Sequence Encoding
Authors: Jesús Sánchez Ochoa, Enrique Tomás Martínez Beltrán, Alberto Huertas Celdrán
First: 2026-03-09T14:18:47+00:00 · Latest: 2026-03-09T14:18:47+00:00
Abstract
In recent years, Artificial Intelligence has become a powerful partner for complex tasks such as data analysis, prediction, and problem-solving, yet its lack of transparency raises concerns about its reliability. In sensitive domains such as healthcare or cybersecurity, ensuring transparency, trustworthiness, and robustness is essential, since the consequences of wrong decisions or successful attacks can be severe. Prior neuron-level interpretability approaches are primarily descriptive, task-dependent, or require retraining, which limits their use as systematic, reusable tools for evaluating internal robustness across architectures and domains. To overcome these limitations, this work proposes SYNAPSE, a systematic, training-free framework for understanding and stress-testing the internal behavior of Transformer models across domains. It extracts per-layer [CLS] representations, trains a lightweight linear probe to obtain global and per-class neuron rankings, and applies forward-hook interventions during inference. This design enables controlled experiments on internal representations without altering the original model, thereby allowing weaknesses, stability patterns, and label-specific sensitivities to be measured and compared directly across tasks and architectures. Across all experiments, SYNAPSE reveals a consistent, domain-independent organization of internal representations, in which task-relevant information is encoded in broad, overlapping neuron subsets. This redundancy provides a strong degree of functional stability, while class-wise asymmetries expose heterogeneous specialization patterns and enable label-aware analysis. In contrast, small structured manipulations in weight or logit space are sufficient to redirect predictions, highlighting complementary vulnerability profiles and illustrating how SYNAPSE can guide the development of more robust Transformer models.
中文标题/摘要
标题:SYNAPSE:序列编码中神经元分析和扰动的框架
近年来,人工智能已成为数据分析、预测和问题解决等复杂任务的强大合作伙伴,但其缺乏透明性引发了对其可靠性的担忧。在医疗保健或网络安全等敏感领域,确保透明性、可信性和鲁棒性至关重要,因为错误决策或成功攻击的后果可能非常严重。先前的神经元级可解释性方法主要具有描述性、任务依赖性或需要重新训练,这限制了它们作为系统性、可重用工具来评估跨架构和领域的内部鲁棒性。为克服这些限制,本研究提出了SYNAPSE,这是一种无需训练的系统性框架,用于跨领域理解并测试变压器模型的内部行为。该框架提取每层[CLS]表示,训练一个轻量级线性探针以获得全局和每类神经元排名,并在推理期间应用前向钩子干预。这种设计允许在不改变原始模型的情况下进行受控的内部表示实验,从而可以直接测量和比较任务和架构之间的弱点、稳定性模式和标签特定敏感性。在所有实验中,SYNAPSE揭示了一种一致的、跨领域的内部表示组织,在其中任务相关信息编码在广泛的、重叠的神经元子集中。这种冗余性提供了强大的功能稳定性,而类别间的不对称性揭示了异质的专业化模式,并允许进行标签感知分析。相比之下,在权重或对数概率空间中的小结构化操纵足以重定向预测,突显了互补的脆弱性特征,并说明了SYNAPSE如何指导更鲁棒的变压器模型的开发。
Summary / 总结
SYNAPSE is a framework designed to analyze and perturb neuron behavior in Transformer models, addressing the lack of transparency in AI systems. It extracts per-layer [CLS] representations, trains a lightweight linear probe for neuron ranking, and applies forward-hook interventions during inference. Key findings include a consistent, domain-independent organization of internal representations with redundant task-relevant information and class-wise asymmetries indicating heterogeneous specialization patterns. Small manipulations in weight or logit space can redirect predictions, highlighting vulnerabilities and guiding the development of more robust models.
SYNAPSE 是一个框架,旨在分析和扰动 Transformer 模型中的神经元行为,以解决 AI 系统缺乏透明度的问题。它提取每层的 [CLS] 表示,训练一个轻量级的线性探针来对神经元进行排名,并在推理过程中应用前向钩子干预而不改变原始模型。关键发现包括内部表示的一致且跨领域不变的组织,其中包含冗余的任务相关信息,以及类别间的不对称性表明了异质的专业化模式。小的权重或 logits 空间结构化操作足以改变预测,突显了 Transformer 模型潜在的脆弱性特征。
ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers
Authors: Fotios Lygerakis, Ozan Özdenizci, Elmar Rückert
First: 2025-05-26T14:19:29+00:00 · Latest: 2026-03-09T14:16:16+00:00
Abstract
Tactile sensing provides local essential information that is complementary to visual perception, such as texture, compliance, and force. Despite recent advances in visuotactile representation learning, challenges remain in fusing these modalities and generalizing across tasks and environments without heavy reliance on pre-trained vision-language models. Moreover, existing methods do not study positional encodings, thereby overlooking the multi-stage spatial reasoning needed to capture fine-grained visuotactile correlations. We introduce ViTaPEs, a transformer-based architecture for learning task-agnostic visuotactile representations from paired vision and tactile inputs. Our key idea is a two-stage positional injection: local (modality-specific) positional encodings are added within each stream, and a global positional encoding is added on the joint token sequence immediately before attention, providing a shared positional vocabulary at the stage where cross-modal interaction occurs. We make the positional injection points explicit and conduct controlled ablations that isolate their effect before a token-wise nonlinearity versus immediately before self-attention. Experiments on multiple large-scale real-world datasets show that ViTaPEs not only surpasses state-of-the-art baselines across various recognition tasks but also demonstrates zero-shot generalization to unseen, out-of-domain scenarios. We further demonstrate the transfer-learning strength of ViTaPEs in a robotic grasping task, where it outperforms state-of-the-art baselines in predicting grasp success. Project page: https://sites.google.com/view/vitapes
中文标题/摘要
标题:ViTaPEs:跨模态对齐的视触觉位置编码在多模态变换器中的应用
触觉感知提供了与视觉感知互补的局部关键信息,如纹理、顺应性和力。尽管在视触觉表示学习方面取得了进展,但在不严重依赖预训练的视觉-语言模型的情况下,将这些模态融合并在不同任务和环境中泛化仍然存在挑战。此外,现有方法没有研究位置编码,从而忽略了捕捉细粒度视触觉相关性的多阶段空间推理所需。我们提出了ViTaPEs,这是一种基于变换器的架构,用于从配对的视觉和触觉输入中学习任务无关的视触觉表示。我们的核心思想是两阶段的位置注入:在每个流中添加局部(模态特定)位置编码,并在注意力之前立即在联合标记序列上添加全局位置编码,从而在跨模态交互发生时提供共享的位置词汇表。我们在标记非线性之前和自注意力之前显式地使位置注入点变得明确,并进行受控消融实验以隔离其效果。在多个大规模真实世界数据集上的实验表明,ViTaPEs 不仅在各种识别任务中超越了最先进的基线,还展示了对未见过的、域外场景的零样本泛化能力。我们进一步展示了ViTaPEs在机器人抓取任务中的迁移学习能力,它在预测抓取成功率方面优于最先进的基线。项目页面:https://sites.google.com/view/vitapes
Summary / 总结
The research aims to improve the fusion of visual and tactile data for better multimodal understanding, addressing the limitations of existing methods in positional encoding and cross-modal alignment. The proposed ViTaPEs model uses a two-stage positional injection to enhance task-agnostic visuotactile representations, leading to superior performance across various recognition tasks and demonstrating zero-shot generalization to unseen scenarios. Additionally, ViTaPEs outperforms state-of-the-art methods in predicting grasp success in robotic grasping tasks.
研究旨在通过融合视觉和触觉数据来提升多模态理解,解决现有方法在位置编码和跨模态对齐方面的局限性。提出的ViTaPEs模型采用两阶段的位置注入来增强任务无关的视觉触觉表示,不仅在各种识别任务中表现优于现有基线,还展示了对未见过的场景的零样本泛化能力。此外,ViTaPEs在机器人抓取任务中预测抓取成功率方面也优于现有基线方法。
RoboLayout: Differentiable 3D Scene Generation for Embodied Agents
Authors: Ali Shamsaddinlou
First: 2026-02-18T08:17:05+00:00 · Latest: 2026-03-09T14:05:21+00:00
Abstract
Recent advances in vision language models (VLMs) have shown strong potential for spatial reasoning and 3D scene layout generation from open-ended language instructions. However, generating layouts that are not only semantically coherent but also feasible for interaction by embodied agents remains challenging, particularly in physically constrained indoor environments. In this paper, RoboLayout is introduced as an extension of LayoutVLM that augments the original framework with agent-aware reasoning and improved optimization stability. RoboLayout integrates explicit reachability constraints into a differentiable layout optimization process, enabling the generation of layouts that are navigable and actionable by embodied agents. Importantly, the agent abstraction is not limited to a specific robot platform and can represent diverse entities with distinct physical capabilities, such as service robots, warehouse robots, humans of different age groups, or animals, allowing environment design to be tailored to the intended agent. In addition, a local refinement stage is proposed that selectively reoptimizes problematic object placements while keeping the remainder of the scene fixed, improving convergence efficiency without increasing global optimization iterations. Overall, RoboLayout preserves the strong semantic alignment and physical plausibility of LayoutVLM while enhancing applicability to agent-centric indoor scene generation, as demonstrated by experimental results across diverse scene configurations.
中文标题/摘要
标题:RoboLayout:面向体态智能体的可微分3D场景生成
近期视觉语言模型(VLMs)的发展显示了其在基于开放语言指令进行空间推理和3D场景布局生成方面的强大潜力。然而,生成既具有语义一致性又适合体态智能体交互的布局,在物理受限的室内环境中仍然是一个挑战。本文介绍了RoboLayout,它是LayoutVLM的扩展,增加了体态智能体意识的推理和优化稳定性改进。RoboLayout将显式的可达性约束整合到可微分布局优化过程中,使得生成的布局能够被体态智能体导航和操作。重要的是,体态智能体的抽象不限于特定的机器人平台,可以代表具有不同物理能力的多种实体,如服务机器人、仓库机器人、不同年龄段的人类或动物,从而允许环境设计针对预期的智能体进行定制。此外,提出了一种局部细化阶段,该阶段选择性地重新优化有问题的对象放置,同时保持场景的其余部分固定,从而提高收敛效率,而不增加全局优化迭代次数。总体而言,RoboLayout保持了LayoutVLM强大的语义对齐和物理合理性,同时增强了其在面向体态智能体的室内场景生成中的应用性,实验结果表明了这一点,涵盖了多种场景配置。
Summary / 总结
RoboLayout extends LayoutVLM by incorporating agent-aware reasoning and reachability constraints into a differentiable optimization process, enabling the generation of navigable and actionable 3D layouts for embodied agents. The system can represent various entities with different physical capabilities, improving the applicability to diverse indoor environments. A local refinement stage further enhances convergence efficiency. Experiments show that RoboLayout maintains semantic alignment and physical plausibility while generating layouts suitable for different types of embodied agents.
RoboLayout 在 LayoutVLM 的基础上引入了基于代理的推理和可达性约束,通过可微优化过程生成可导航和可操作的 3D 布局,适用于实体代理。方法中包含一个局部细化阶段以提高收敛效率。实验结果表明,RoboLayout 生成的布局具有较强的语义对齐和物理合理性,适用于多种实体代理在室内环境中的场景生成。
Local-Global Prompt Learning via Sparse Optimal Transport
Authors: Deniz Kizaroğlu, Ülku Tuncer Küçüktas, Emre Çakmakyurdu, Alptekin Temizel
First: 2026-03-09T13:09:55+00:00 · Latest: 2026-03-09T13:09:55+00:00
Comments: 9 pages, 3 figures, 4 tables. Code available at GitHub
Abstract
Few-shot adaptation of vision-language models (VLMs) like CLIP typically relies on learning textual prompts matched to global image embeddings. Recent works extend this paradigm by incorporating local image-text alignment to capture fine-grained visual cues, yet these approaches often select local regions independently for each prompt, leading to redundant local feature usage and prompt overlap. We propose SOT-GLP, which introduces a shared sparse patch support and balanced optimal transport allocation to explicitly partition salient visual regions among class-specific local prompts while preserving global alignment. Our method learns shared global prompts and class-specific local prompts. The global branch maintains standard image-text matching for robust category-level alignment. The local branch constructs a class-conditioned sparse patch set using V-V attention and aligns it to multiple class-specific prompts via balanced entropic optimal transport, yielding a soft partition of patches that prevents prompt overlap and collapse. We evaluate our method on two complementary objectives: (i) few-shot classification accuracy on 11 standard benchmarks and (ii) out-of-distribution (OOD) detection. On the standard 11-dataset benchmark with 16-shot ViT-B/16, SOT-GLP achieves 85.1% average accuracy, outperforming prior prompt-learning methods. We identify a distinct accuracy-robustness trade-off in prompt learning: while learnable projections optimize in-distribution fit, they alter the foundational feature space. We demonstrate that a projection-free local alignment preserves the native geometry of the CLIP manifold, yielding state-of-the-art OOD detection performance (94.2% AUC) that surpasses fully adapted models. Implementation available at: https://github.com/Deniz2304988/SOT-GLP
中文标题/摘要
标题:基于稀疏最优传输的局部-全局提示学习
视觉-语言模型(VLMs)如CLIP的少样本适应通常依赖于学习与全局图像嵌入匹配的文本提示。近期工作通过引入局部图像-文本对齐来扩展这一范式,以捕捉细粒度的视觉线索,但这些方法通常独立地为每个提示选择局部区域,导致局部特征的冗余使用和提示重叠。我们提出了SOT-GLP,该方法引入了共享的稀疏补丁支持和平衡的最优传输分配,以明确地将显著的视觉区域在类别特定的局部提示之间进行分区,同时保持全局对齐。我们的方法学习共享的全局提示和类别特定的局部提示。全局分支保持标准的图像-文本匹配,以实现稳健的类别级对齐。局部分支使用V-V注意力构建类别条件下的稀疏补丁集,并通过平衡的熵最优传输将其对齐到多个类别特定的提示,从而产生防止提示重叠和崩溃的软分区。我们在两个互补目标上评估了我们的方法:(i)在11个标准基准上的少样本分类准确率和(ii)异常分布检测(OOD)。在11个数据集基准上使用16-shot ViT-B/16,SOT-GLP实现了85.1%的平均准确率,优于先前的提示学习方法。我们发现提示学习中存在准确率-鲁棒性权衡:可学习的投影优化了分布内拟合,但改变了基础特征空间。我们证明了无投影的局部对齐保留了CLIP流形的原生几何结构,实现了最先进的异常分布检测性能(94.2% AUC),超过了完全适应的模型。代码可在GitHub获取:https://github.com/Deniz2304988/SOT-GLP
Summary / 总结
The research aims to improve few-shot adaptation of vision-language models by addressing the redundancy and overlap in local image-text alignment. SOT-GLP introduces a shared sparse patch support and balanced optimal transport to partition salient visual regions among class-specific local prompts while maintaining global alignment. On 11 standard benchmarks with 16-shot ViT-B/16, SOT-GLP achieves 85.1% accuracy, outperforming previous methods. Additionally, SOT-GLP shows superior out-of-distribution detection performance with 94.2% AUC, surpassing fully adapted models.
研究旨在通过解决局部特征冗余和提示重叠的问题,改进视觉-语言模型的少量样本适应。方法SOT-GLP引入了共享稀疏补丁支持和平衡最优传输分配,以在保持全局对齐的同时将显著的视觉区域分配给类特定的局部提示。在11个标准基准测试中,使用16-shot ViT-B/16,SOT-GLP实现了85.1%的平均准确率,超过了之前的提示学习方法。此外,SOT-GLP在异常分布检测中的表现也更为出色,AUC达到94.2%,超过了完全适应的模型。
ExGS: Extreme 3D Gaussian Compression with Diffusion Priors
Authors: Jiaqi Chen, Xinhao Ji, Yuanyuan Gao, Hao Li, Yuning Gong, Yifei Liu, Dan Xu, Zhihang Zhong, Dingwen Zhang, Xiao Sun
First: 2025-09-29T13:23:06+00:00 · Latest: 2026-03-09T12:52:20+00:00
Abstract
Neural scene representations, such as 3D Gaussian Splatting (3DGS), have enabled high-quality neural rendering; however, their large storage and transmission costs hinder deployment in resource-constrained environments. Existing compression methods either rely on costly optimization, which is slow and scene-specific, or adopt training-free pruning and quantization, which degrade rendering quality under high compression ratios. In contrast, recent data-driven approaches provide a promising direction to overcome this trade-off, enabling efficient compression while preserving high rendering quality. We introduce ExGS, a novel feed-forward framework that unifies Universal Gaussian Compression (UGC) with GaussPainter for Extreme 3DGS compression. UGC performs re-optimization-free pruning to aggressively reduce Gaussian primitives while retaining only essential information, whereas GaussPainter leverages powerful diffusion priors with mask-guided refinement to restore high-quality renderings from heavily pruned Gaussian scenes. Unlike conventional inpainting, GaussPainter not only fills in missing regions but also enhances visible pixels, yielding substantial improvements in degraded renderings. To ensure practicality, it adopts a lightweight VAE and a one-step diffusion design, enabling real-time restoration. Our framework can even achieve over 100X compression (reducing a typical 354.77 MB model to about 3.31 MB) while preserving fidelity and significantly improving image quality under challenging conditions. These results highlight the central role of diffusion priors in bridging the gap between extreme compression and high-quality neural rendering. Our code repository will be released at: https://github.com/chenttt2001/ExGS
中文标题/摘要
标题:ExGS:极端3D高斯压缩与扩散先验
神经场景表示,如3D高斯斑点化(3DGS),已实现高质量的神经渲染;然而,其庞大的存储和传输成本阻碍了在资源受限环境中的部署。现有压缩方法要么依赖昂贵的优化,这既慢又针对特定场景,要么采用无训练剪枝和量化,这在高压缩比下会降低渲染质量。相比之下,最近的数据驱动方法为克服这一权衡提供了有希望的方向,实现了高效压缩同时保持高质量渲染。我们提出了ExGS,这是一种新颖的前馈框架,将通用高斯压缩(UGC)与GaussPainter统一起来,用于极端3DGS压缩。UGC通过不重新优化的剪枝激进地减少高斯原语,同时保留关键信息,而GaussPainter利用强大的扩散先验和掩码引导细化,从高度剪枝的高斯场景中恢复高质量渲染。与传统的修补不同,GaussPainter不仅填补缺失区域,还增强可见像素,显著改善了降级渲染。为了确保实用性,它采用轻量级的VAE和一步扩散设计,实现实时恢复。我们的框架甚至可以在保持保真度和显著提高图像质量的同时,实现超过100倍的压缩(将典型的354.77 MB模型减少到约3.31 MB)。这些结果突显了扩散先验在极端压缩与高质量神经渲染之间的桥梁作用。我们的代码库将在以下地址发布:https://github.com/chenttt2001/ExGS
Summary / 总结
ExGS is a novel feed-forward framework that combines Universal Gaussian Compression (UGC) and GaussPainter for efficient 3D Gaussian Splatting (3DGS) compression. UGC prunes Gaussian primitives without re-optimization to retain essential information, while GaussPainter uses diffusion priors and mask-guided refinement to restore high-quality renderings. ExGS achieves over 100X compression while maintaining fidelity and improving image quality, making it suitable for resource-constrained environments. Diffusion priors play a crucial role in bridging the gap between extreme compression and high-quality rendering.
ExGS 是一种结合了通用高斯压缩 (UGC) 和 GaussPainter 的新型前馈框架,用于高效压缩 3D 高斯散点图 (3DGS)。UGC 在不重新优化的情况下减少高斯基元,而 GaussPainter 利用扩散先验增强和恢复高质量渲染。ExGS 在保持保真度的同时实现超过 100 倍的压缩,并在具有挑战性的条件下显著提高图像质量,突显了扩散先验在极端压缩中的重要作用。
Multimodal Large Language Models as Image Classifiers
Authors: Nikita Kisel, Illia Volkov, Klara Janouskova, Jiri Matas
First: 2026-03-06T18:59:58+00:00 · Latest: 2026-03-09T12:45:56+00:00
Abstract
Multimodal Large Language Models (MLLM) classification performance depends critically on evaluation protocol and ground truth quality. Studies comparing MLLMs with supervised and vision-language models report conflicting conclusions, and we show these conflicts stem from protocols that either inflate or underestimate performance. Across the most common evaluation protocols, we identify and fix key issues: model outputs that fall outside the provided class list and are discarded, inflated results from weak multiple-choice distractors, and an open-world setting that underperforms only due to poor output mapping. We additionally quantify the impact of commonly overlooked design choices - batch size, image ordering, and text encoder selection - showing they substantially affect accuracy. Evaluating on ReGT, our multilabel reannotation of 625 ImageNet-1k classes, reveals that MLLMs benefit most from corrected labels (up to +10.8%), substantially narrowing the perceived gap with supervised models. Much of the reported MLLMs underperformance on classification is thus an artifact of noisy ground truth and flawed evaluation protocol rather than genuine model deficiency. Models less reliant on supervised training signals prove most sensitive to annotation quality. Finally, we show that MLLMs can assist human annotators: in a controlled case study, annotators confirmed or integrated MLLMs predictions in approximately 50% of difficult cases, demonstrating their potential for large-scale dataset curation. This work is part of the Aiming for Perfect ImageNet-1k project, see https://klarajanouskova.github.io/ImageNet/.
中文标题/摘要
标题:多模态大型语言模型作为图像分类器
多模态大型语言模型(MLLM)的分类性能在很大程度上取决于评估协议和 ground truth 的质量。比较 MLLM、监督模型和视觉语言模型的研究报告结论不一,我们表明这些分歧源于要么夸大要么低估性能的评估协议。在最常见的评估协议中,我们识别并解决了关键问题:模型输出超出提供的类别列表并被丢弃、由于弱的选择题干扰项导致的夸大结果以及在开放世界设置中由于输出映射不佳而导致的低性能。此外,我们量化了通常被忽视的设计选择——批量大小、图像排序和文本编码器选择的影响,表明它们显著影响准确性。在 ReGT 上进行评估,这是对 ImageNet-1k 的 625 个多标签重新注释,表明 MLLM 最受益于修正标签(最多 +10.8%),显著缩小了与监督模型之间的感知差距。因此,报告的 MLLM 在分类中的表现不佳主要是由于噪声 ground truth 和有缺陷的评估协议,而不是真正的模型缺陷。对监督训练信号依赖较少的模型对注释质量更为敏感。最后,我们展示了 MLLM 可以帮助人类注释员:在受控案例研究中,注释员在大约 50% 的困难案例中确认或整合了 MLLM 的预测,证明了它们在大规模数据集整理中的潜力。这项工作是 Perfect ImageNet-1k 项目的一部分,详见 https://klarajanouskova.github.io/ImageNet/。
Summary / 总结
This study investigates the performance of Multimodal Large Language Models (MLLM) as image classifiers, highlighting that previous conflicting conclusions were due to evaluation protocol issues. The research identifies and addresses key problems such as model outputs outside the class list, weak distractors, and open-world settings. It also quantifies the impact of design choices like batch size and text encoder selection. Evaluating on ReGT, a reannotation of ImageNet-1k, the study shows that MLLMs benefit significantly from corrected labels, narrowing the gap with supervised models. The work also demonstrates MLLMs' potential in assisting human annotators, integrating their predictions in about 50% of difficult cases.
研究探讨了多模态大型语言模型(MLLM)作为图像分类器的性能,解决了评估协议和地面真实数据质量的问题。通过修正这些问题,研究显示MLLMs的表现比之前报告的更好,使用修正后的标签时,性能最多可提高10.8%。研究还强调了批量大小和文本编码器选择等设计选择对准确率的影响。此外,MLLMs被发现能够帮助人类标注员进行数据集的整理,在困难案例中,标注员确认或整合MLLM预测的比例约为50%。
SlowBA: An efficiency backdoor attack towards VLM-based GUI agents
Authors: Junxian Li, Tu Lan, Haozhen Tan, Yan Meng, Haojin Zhu
First: 2026-03-09T12:38:28+00:00 · Latest: 2026-03-09T12:38:28+00:00
Comments: 25 pages
Abstract
Modern vision-language-model (VLM) based graphical user interface (GUI) agents are expected not only to execute actions accurately but also to respond to user instructions with low latency. While existing research on GUI-agent security mainly focuses on manipulating action correctness, the security risks related to response efficiency remain largely unexplored. In this paper, we introduce SlowBA, a novel backdoor attack that targets the responsiveness of VLM-based GUI agents. The key idea is to manipulate response latency by inducing excessively long reasoning chains under specific trigger patterns. To achieve this, we propose a two-stage reward-level backdoor injection (RBI) strategy that first aligns the long-response format and then learns trigger-aware activation through reinforcement learning. In addition, we design realistic pop-up windows as triggers that naturally appear in GUI environments, improving the stealthiness of the attack. Extensive experiments across multiple datasets and baselines demonstrate that SlowBA can significantly increase response length and latency while largely preserving task accuracy. The attack remains effective even with a small poisoning ratio and under several defense settings. These findings reveal a previously overlooked security vulnerability in GUI agents and highlight the need for defenses that consider both action correctness and response efficiency. Code can be found in https://github.com/tu-tuing/SlowBA.
中文标题/摘要
标题:SlowBA:一种针对基于VLM的GUI代理的效率后门攻击
现代基于视觉语言模型(VLM)的图形用户界面(GUI)代理不仅期望能够准确执行操作,还期望能够以低延迟响应用户指令。虽然现有关于GUI代理安全性的研究主要集中在操控操作的正确性上,但与响应效率相关的安全风险却很少被探索。在本文中,我们介绍了SlowBA,这是一种针对基于VLM的GUI代理响应性的新型后门攻击。关键思想是通过在特定触发模式下诱导过长的推理链来操纵响应延迟。为此,我们提出了一种两阶段奖励级后门注入(RBI)策略,首先对齐长响应格式,然后通过强化学习学习触发感知激活。此外,我们设计了现实的弹出窗口作为触发器,这些触发器自然出现在GUI环境中,提高了攻击的隐蔽性。在多个数据集和基线上的广泛实验表明,SlowBA可以显著增加响应时间和延迟,同时在很大程度上保持任务准确性。即使在小污染比例和多种防御设置下,攻击仍然有效。这些发现揭示了GUI代理中一个之前未被注意到的安全漏洞,并强调了需要同时考虑操作正确性和响应效率的防御措施。代码可以在https://github.com/tu-tuing/SlowBA获取。
Summary / 总结
The research introduces SlowBA, a backdoor attack targeting the responsiveness of VLM-based GUI agents by manipulating response latency through specific trigger patterns. The method uses a two-stage reward-level backdoor injection strategy and reinforcement learning to induce long reasoning chains. Experiments show that SlowBA can significantly increase response length and latency while maintaining task accuracy, even with a small poisoning ratio and under various defense settings. This highlights the need for security measures that consider both action correctness and response efficiency.
研究引入了SlowBA,这是一种针对VLM基GUI代理响应性的后门攻击,通过特定触发模式操纵响应延迟。它采用了一种两阶段奖励级后门注入策略,并使用现实中的弹出窗口作为触发器。实验表明,SlowBA可以显著增加响应时间和延迟,同时保持任务准确性,即使在小污染比例和多种防御设置下仍然有效。
OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer
Authors: Si-Yu Lu, Po-Ting Chen, Hui-Che Hsu, Sin-Ye Jhong, Wen-Huang Cheng, Yung-Yao Chen
First: 2026-03-06T06:44:17+00:00 · Latest: 2026-03-09T12:16:48+00:00
Comments: Project page: https://vaisr.github.io/OVGGT/ Code: https://github.com/VAISR/OVGGT
Abstract
Reconstructing 3D geometry from streaming video requires continuous inference under bounded resources. Recent geometric foundation models achieve impressive reconstruction quality through all-to-all attention, yet their quadratic cost confines them to short, offline sequences. Causal-attention variants such as StreamVGGT enable single-pass streaming but accumulate an ever-growing KV cache, exhausting GPU memory within hundreds of frames and precluding the long-horizon deployment that motivates streaming inference in the first place. We present OVGGT, a training-free framework that bounds both memory and compute to a fixed budget regardless of sequence length. Our approach combines Self-Selective Caching, which leverages FFN residual magnitudes to compress the KV cache while remaining fully compatible with FlashAttention, with Dynamic Anchor Protection, which shields coordinate-critical tokens from eviction to suppress geometric drift over extended trajectories. Extensive experiments on indoor, outdoor, and ultra-long sequence benchmarks demonstrate that OVGGT processes arbitrarily long videos within a constant VRAM envelope while achieving state-of-the-art 3D geometric accuracy.
中文标题/摘要
标题:OVGGT: O(1)常成本流式视觉几何变换器
从流式视频中重建3D几何形状需要在有限资源下持续推理。最近的几何基础模型通过全对全注意力机制实现了令人印象深刻的重建质量,但其二次成本限制了它们只能处理短的离线序列。因果注意力变体如StreamVGGT能够实现单通道流式处理,但会不断累积一个不断增长的KV缓存,几百度帧后耗尽GPU内存,从而阻止了流式推理所期望的长期部署。我们提出了OVGGT,这是一种无需训练的框架,无论序列长度如何,都能将内存和计算成本限制在一个固定预算内。我们的方法结合了自选择缓存,利用FFN残差幅度压缩KV缓存,同时完全兼容FlashAttention,以及动态锚点保护,以屏蔽关键坐标令牌,防止在长时间轨迹中几何漂移。在室内、室外和超长序列基准上的广泛实验表明,OVGGT能够在恒定的显存环境中处理任意长度的视频,同时实现最先进的3D几何精度。
Summary / 总结
OVGGT addresses the challenge of reconstructing 3D geometry from streaming video by bounding both memory and compute costs to a fixed budget. It combines Self-Selective Caching to compress the KV cache and Dynamic Anchor Protection to shield critical tokens, ensuring geometric accuracy over long sequences. Experiments show that OVGGT processes arbitrarily long videos within a constant VRAM envelope while achieving state-of-the-art 3D geometric accuracy.
OVGGT 解决了从流式视频中以有限资源重建 3D 几何结构的问题。它结合了自选择缓存和动态锚点保护,以保持固定的内存和计算预算,从而支持长时间部署。实验表明,OVGGT 能够处理任意长度的视频,并在常量 VRAM 环境下实现最先进的 3D 几何精度。
Novel Semantic Prompting for Zero-Shot Action Recognition
Authors: Salman Iqbal, Waheed Rehman
First: 2026-03-09T12:07:55+00:00 · Latest: 2026-03-09T12:07:55+00:00
Abstract
Zero-shot action recognition relies on transferring knowledge from vision-language models to unseen actions using semantic descriptions. While recent methods focus on temporal modeling or architectural adaptations to handle video data, we argue that semantic prompting alone provides a strong and underexplored signal for zero-shot action understanding. We introduce SP-CLIP, a lightweight framework that augments frozen vision-language models with structured semantic prompts describing actions at multiple levels of abstraction, such as intent, motion, and object interaction. Without modifying the visual encoder or learning additional parameters, SP-CLIP aligns video representations with enriched textual semantics through prompt aggregation and consistency scoring. Experiments across standard benchmarks show that semantic prompting substantially improves zero-shot action recognition, particularly for fine-grained and compositional actions, while preserving the efficiency and generalization of pretrained models.
中文标题/摘要
标题:新颖的语义提示在零样本动作识别中的应用
零样本动作识别依赖于通过语义描述将视觉语言模型的知识转移到未见过的动作上。尽管最近的方法集中在时间建模或对视频数据进行架构调整,但我们认为仅通过语义提示可以提供一种强大且未被充分探索的信号,以实现零样本动作理解。我们引入了SP-CLIP,这是一种轻量级框架,通过在冻结的视觉语言模型中添加结构化的语义提示,描述动作在多个抽象层次上的意图、运动和物体交互。无需修改视觉编码器或学习额外参数,SP-CLIP通过提示聚合和一致性评分将视频表示与丰富的文本语义对齐。在标准基准上的实验表明,语义提示显著提高了零样本动作识别的性能,特别是在细粒度和组合动作方面,同时保持了预训练模型的效率和泛化能力。
Summary / 总结
The paper addresses the challenge of zero-shot action recognition by leveraging semantic descriptions to enhance vision-language models. It introduces SP-CLIP, a lightweight framework that uses structured semantic prompts to align video representations with enriched textual semantics without modifying the visual encoder or learning additional parameters. The experiments demonstrate that semantic prompting significantly improves zero-shot action recognition, especially for fine-grained and compositional actions, while maintaining the efficiency and generalization of pretrained models.
研究旨在通过利用语义描述来增强零样本动作识别,将视觉语言模型的知识转移到未见过的动作上。方法引入了SP-CLIP,这是一种轻量级框架,使用结构化的语义提示来对齐视频表示与丰富的文本语义,而不修改视觉编码器或学习额外参数。关键实验结果表明,语义提示显著提高了零样本动作识别,特别是对于细粒度和组合动作,同时保持了预训练模型的效率和泛化能力。
SAIL: Test-Time Scaling for In-Context Imitation Learning with VLM
Authors: Makoto Sato, Yusuke Iwasawa, Yujin Tang, So Kuroki
First: 2026-03-09T11:39:40+00:00 · Latest: 2026-03-09T11:39:40+00:00
Comments: 8 pages, 3 figures
Abstract
In-context imitation learning allows robots to acquire skills from demonstrations, yet one-shot trajectory generation remains fragile under environmental variation. We propose SAIL, a framework that reframes robot imitation as an iterative refinement problem capable of scaling with test-time compute. SAIL utilizes Monte Carlo Tree Search, where each node is a complete trajectory and edges correspond to trajectory refinements. The process is guided by three core components: an automated archive of successful trajectories for contextually relevant retrieval, a vision language model-based scoring mechanism for trajectory evaluation, and a step-level feedback that provides trajectory-aligned scores for iterative refinement. Experiments across six diverse manipulation tasks in simulation and real-world validation clearly demonstrate that increasing test-time compute consistently improves success rates, achieving up to 95% on complex tasks. Our results suggest that trajectory-level test-time scaling is a robust path toward more generalizable robotic agents.
中文标题/摘要
标题:SAIL:测试时缩放以适应上下文模仿学习与VLM
上下文模仿学习允许机器人从演示中获取技能,但一次性的轨迹生成在环境变化下仍然很脆弱。我们提出了SAIL,一种框架,将机器人模仿重新定义为一个迭代细化问题,能够根据测试时的计算能力进行扩展。SAIL 利用蒙特卡洛树搜索,其中每个节点是一个完整的轨迹,边对应于轨迹的细化。过程由三个核心组件引导:一个自动化的成功轨迹档案,用于上下文相关检索,基于视觉语言模型的评分机制,用于轨迹评估,以及步骤级反馈,提供与轨迹对齐的评分以进行迭代细化。在六个不同操作任务的模拟和现实世界验证中,实验结果清楚地表明,增加测试时的计算能力可以一致地提高成功率,复杂任务上最高可达95%。我们的结果表明,轨迹级测试时缩放是实现更通用机器人代理的稳健途径。
Summary / 总结
The research aims to improve the robustness of one-shot trajectory generation in in-context imitation learning for robots under environmental variation. SAIL, a framework using Monte Carlo Tree Search, is proposed to iteratively refine trajectories based on test-time compute. Key findings show that increasing test-time compute enhances success rates, achieving up to 95% on complex tasks across six diverse manipulation tasks.
研究旨在通过环境变化下的单次轨迹生成增强类条件模仿学习中机器人的鲁棒性。SAIL框架将模仿学习重新定义为迭代优化过程,使用蒙特卡洛树搜索和视觉语言模型进行轨迹评估,并通过步骤级反馈进行优化。实验表明,增加测试时的计算能力可以提高成功率,最高可达95%的复杂任务。
ALOOD: Exploiting Language Representations for LiDAR-based Out-of-Distribution Object Detection
Authors: Michael Kösel, Marcel Schreiber, Michael Ulrich, Claudius Gläser, Klaus Dietmayer
First: 2026-03-09T10:02:45+00:00 · Latest: 2026-03-09T10:02:45+00:00
Comments: Accepted for publication at the 2025 IEEE Intelligent Transportation Systems Conference (ITSC)
Abstract
LiDAR-based 3D object detection plays a critical role for reliable and safe autonomous driving systems. However, existing detectors often produce overly confident predictions for objects not belonging to known categories, posing significant safety risks. This is caused by so-called out-of-distribution (OOD) objects, which were not part of the training data, resulting in incorrect predictions. To address this challenge, we propose ALOOD (Aligned LiDAR representations for Out-Of-Distribution Detection), a novel approach that incorporates language representations from a vision-language model (VLM). By aligning the object features from the object detector to the feature space of the VLM, we can treat the detection of OOD objects as a zero-shot classification task. We demonstrate competitive performance on the nuScenes OOD benchmark, establishing a novel approach to OOD object detection in LiDAR using language representations. The source code is available at https://github.com/uulm-mrm/mmood3d.
中文标题/摘要
标题:ALOOD:利用语言表示进行基于LiDAR的离分布对象检测
基于LiDAR的3D对象检测对于可靠和安全的自动驾驶系统至关重要。然而,现有的检测器经常对不属于已知类别的对象产生过于自信的预测,这会带来重大的安全风险。这是由于所谓的离分布(OOD)对象,这些对象未包含在训练数据中,导致错误的预测。为了解决这一挑战,我们提出了ALOOD(对齐LiDAR表示的离分布检测),这是一种新颖的方法,它结合了来自视觉-语言模型(VLM)的语言表示。通过将对象检测器的对象特征对齐到VLM的特征空间,我们可以将离分布对象的检测视为零样本分类任务。我们在nuScenes OOD基准上展示了竞争力的表现,建立了使用语言表示进行LiDAR中的离分布对象检测的新方法。源代码可在https://github.com/uulm-mrm/mmood3d获取。
Summary / 总结
The paper introduces ALOOD, a method that uses language representations from a vision-language model to improve LiDAR-based out-of-distribution object detection. By aligning object features from the detector with the VLM feature space, ALOOD treats OOD detection as a zero-shot classification task. The approach demonstrates competitive performance on the nuScenes OOD benchmark, offering a novel solution for safe autonomous driving systems.
ALOOD 通过使用视觉语言模型的语言表示来对齐 LiDAR 基础目标检测器的对象特征,将未知对象的检测视为零样本分类任务。它在 nuScenes OOD 基准测试中取得了竞争力的表现,解决了自动驾驶系统中对未知对象过于自信预测的问题。
C$^2$FG: Control Classifier-Free Guidance via Score Discrepancy Analysis
Authors: Jiayang Gao, Tianyi Zheng, Jiayang Zou, Fengxiang Yang, Shice Liu, Luyao Fan, Zheyu Zhang, Hao Zhang, Jinwei Chen, Peng-Tao Jiang, Bo Li, Jia Wang
First: 2026-03-09T09:37:17+00:00 · Latest: 2026-03-09T09:37:17+00:00
Abstract
Classifier-Free Guidance (CFG) is a cornerstone of modern conditional diffusion models, yet its reliance on the fixed or heuristic dynamic guidance weight is predominantly empirical and overlooks the inherent dynamics of the diffusion process. In this paper, we provide a rigorous theoretical analysis of the Classifier-Free Guidance. Specifically, we establish strict upper bounds on the score discrepancy between conditional and unconditional distributions at different timesteps based on the diffusion process. This finding explains the limitations of fixed-weight strategies and establishes a principled foundation for time-dependent guidance. Motivated by this insight, we introduce \textbf{Control Classifier-Free Guidance (C$^2$FG)}, a novel, training-free, and plug-in method that aligns the guidance strength with the diffusion dynamics via an exponential decay control function. Extensive experiments demonstrate that C$^2$FG is effective and broadly applicable across diverse generative tasks, while also exhibiting orthogonality to existing strategies.
中文标题/摘要
标题:C$^2$FG: 通过评分差异分析控制分类器无关指导
分类器无关指导(CFG)是现代条件扩散模型的核心,但其依赖于固定或启发式的动态指导权重,主要是经验性的,并且忽略了扩散过程的内在动态。在本文中,我们对分类器无关指导进行了严格的理论分析。具体来说,我们基于扩散过程建立了条件分布和无条件分布之间评分差异的严格上界。这一发现解释了固定权重策略的局限性,并为时间依赖性指导奠定了原则性的基础。受此见解的启发,我们引入了**控制分类器无关指导(C$^2$FG)**,这是一种无需训练且可插拔的方法,通过指数衰减控制函数使指导强度与扩散动态保持一致。广泛的实验表明,C$^2$FG 在各种生成任务中有效且具有广泛适用性,同时与现有策略正交。
Summary / 总结
The paper addresses the limitations of fixed or heuristic guidance weights in Classifier-Free Guidance (CFG) by providing a theoretical analysis of score discrepancies between conditional and unconditional distributions. Motivated by this, the authors propose Control Classifier-Free Guidance (C$^2$FG), a training-free method that adjusts guidance strength based on the diffusion process dynamics. Experiments show that C$^2$FG improves performance across various generative tasks and is orthogonal to existing strategies.
研究旨在通过解决固定或启发式指导权重的局限性,改进条件扩散模型中的Classifier-Free Guidance (CFG)。作者通过对条件和无条件分布之间得分差异的理论分析,提出了Control Classifier-Free Guidance (C$^2$FG)。C$^2$FG 使用指数衰减控制函数使指导强度与扩散动力学保持一致,证明了其在各种生成任务中的有效性。
FLARE: Learning Future-Aware Latent Representations from Vision-Language Models for Autonomous Driving
Authors: Chengen Xie, Chonghao Sima, Tianyu Li, Bin Sun, Junjie Wu, Zhihui Hao, Hongyang Li
First: 2026-01-09T08:06:44+00:00 · Latest: 2026-03-09T09:35:48+00:00
Abstract
While Vision-Language Models (VLMs) offer rich world knowledge for end-to-end autonomous driving, current approaches heavily rely on labor-intensive language annotations (e.g., VQA) to bridge perception and control. This paradigm suffers from a fundamental mismatch between discrete linguistic tokens and continuous driving trajectories, often leading to suboptimal control policies and inefficient utilization of pre-trained knowledge. To address these challenges, we propose FLARE (Future-aware LAtent REpresentation), a novel framework that activates the visual-semantic capabilities of pre-trained VLMs without requiring language supervision. Instead of aligning with text, we introduce a self-supervised future feature prediction objective. This mechanism compels the model to anticipate scene dynamics and ego-motion directly in the latent space, enabling the learning of robust driving representations from large-scale unlabeled trajectory data. Furthermore, we integrate Group Relative Policy Optimization (GRPO) into the planning process to refine decision-making quality. Extensive experiments on the NAVSIM benchmark demonstrate that FLARE achieves state-of-the-art performance, validating the effectiveness of leveraging VLM knowledge via predictive self-supervision rather than explicit language generation.
中文标题/摘要
标题:FLARE:从视觉语言模型中学习面向未来的潜在表示以实现自主驾驶
尽管视觉语言模型(VLMs)为端到端的自主驾驶提供了丰富的世界知识,但当前的方法严重依赖于劳动密集型的语言注释(例如,VQA)来连接感知和控制。这种范式在离散的语言标记和连续的驾驶轨迹之间存在根本性的不匹配,经常导致次优的控制策略和预训练知识的低效利用。为了解决这些挑战,我们提出了FLARE(面向未来的潜在表示),这是一种新颖的框架,可以在不需要语言监督的情况下激活预训练VLM的视觉语义能力。我们引入了一种自我监督的未来特征预测目标,而不是与文本对齐。这种机制促使模型直接在潜在空间中预测场景动态和自我运动,从而能够从大规模的未标记轨迹数据中学习稳健的驾驶表示。此外,我们还将组相对策略优化(GRPO)集成到规划过程中以提高决策质量。在NAVSIM基准上的广泛实验表明,FLARE实现了最先进的性能,验证了通过预测自我监督利用VLM知识的有效性,而不是显式的语言生成。
Summary / 总结
FLARE is a novel framework that enhances autonomous driving by leveraging the visual-semantic capabilities of pre-trained Vision-Language Models (VLMs) without requiring language supervision. It introduces a self-supervised future feature prediction objective to enable the model to anticipate scene dynamics and ego-motion directly in the latent space. Experimental results on the NAVSIM benchmark show that FLARE outperforms existing methods, demonstrating the effectiveness of using predictive self-supervision to learn robust driving representations from large-scale unlabeled trajectory data.
FLARE 是一种新颖的框架,通过利用预训练的 Vision-Language 模型(VLMs)的视觉语义能力来增强自动驾驶,无需语言监督。它引入了一种自我监督的未来特征预测目标,使模型能够在潜空间直接预测场景动态和自我运动。在 NAVSIM 基准上的实验结果表明,FLARE 的性能优于现有方法,证明了使用预测性自我监督从大规模未标记的轨迹数据中学习稳健的驾驶表示的有效性。
iGVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding
Authors: Hanpeng Liu, Yaqian Li, Zidan Wang, Shuoxi Zhang, Zihao Bo, Rinyoichi Takezoe, Kaiwen Long, Kun He
First: 2026-03-03T08:49:41+00:00 · Latest: 2026-03-09T09:29:46+00:00
Abstract
Despite the success of Large Vision--Language Models (LVLMs), most existing architectures suffer from a representation bottleneck: they rely on static, instruction-agnostic vision encoders whose visual representations are utilized in an invariant manner across different textual tasks. This rigidity hinders fine-grained reasoning where task-specific visual cues are critical. To address this issue, we propose iGVLM, a general framework for instruction-guided visual modulation. iGVLM introduces a decoupled dual-branch architecture: a frozen representation branch that preserves task-agnostic visual representations learned during pre-training, and a dynamic conditioning branch that performs affine feature modulation via Adaptive Layer Normalization (AdaLN). This design enables a smooth transition from general-purpose perception to instruction-aware reasoning while maintaining the structural integrity and stability of pre-trained visual priors. Beyond standard benchmarks, we introduce MM4, a controlled diagnostic probe for quantifying logical consistency under multi-query, multi-instruction settings. Extensive results show that iGVLM consistently enhances instruction sensitivity across diverse language backbones, offering a plug-and-play paradigm for bridging passive perception and active reasoning.
中文标题/摘要
标题:iGVLM:动态指令引导视觉编码以实现问题感知的多模态理解
尽管大型视觉-语言模型(LVLMs)取得了成功,但大多数现有架构仍存在表示瓶颈:它们依赖于静态、无指令的视觉编码器,这些编码器在不同文本任务中的视觉表示是不变的。这种刚性阻碍了细粒度推理,特别是在任务特定的视觉线索至关重要时。为了解决这一问题,我们提出了iGVLM,这是一种指令引导视觉调制的一般框架。iGVLM引入了一个解耦的双分支架构:一个冻结表示分支,保留预训练期间学习到的任务无关视觉表示,以及一个动态条件分支,通过自适应层归一化(AdaLN)执行仿射特征调制。这种设计使从通用感知到指令感知推理的过渡变得平滑,同时保持预训练视觉先验的结构完整性和稳定性。除了标准基准之外,我们还引入了MM4,这是一种受控诊断探针,用于在多查询、多指令设置下量化逻辑一致性。广泛的结果表明,iGVLM在各种语言后端中一致地增强了指令敏感性,提供了一种即插即用的范式,用于连接被动感知和主动推理。
UniGround: Universal 3D Visual Grounding via Training-Free Scene Parsing
Authors: Jiaxi Zhang, Yunheng Wang, Wei Lu, Taowen Wang, Weisheng Xu, Shuning Zhang, Yixiao Feng, Yuetong Fang, Renjing Xu
First: 2026-03-09T09:10:01+00:00 · Latest: 2026-03-09T09:10:01+00:00
Comments: 14 pages,6 figures,3 tables
Abstract
Understanding and localizing objects in complex 3D environments from natural language descriptions, known as 3D Visual Grounding (3DVG), is a foundational challenge in embodied AI, with broad implications for robotics, augmented reality, and human-machine interaction. Large-scale pre-trained foundation models have driven significant progress on this front, enabling open-vocabulary 3DVG that allows systems to locate arbitrary objects in a given scene. However, their reliance on pre-trained models constrains 3D perception and reasoning within the inherited knowledge boundaries, resulting in limited generalization to unseen spatial relationships and poor robustness to out-of-distribution scenes. In this paper, we replace this constrained perception with training-free visual and geometric reasoning, thereby unlocking open-world 3DVG that enables the localization of any object in any scene beyond the training data. Specifically, the proposed UniGround operates in two stages: a Global Candidate Filtering stage that constructs scene candidates through training-free 3D topology and multi-view semantic encoding, and a Local Precision Grounding stage that leverages multi-scale visual prompting and structured reasoning to precisely identify the target object. Experiments on ScanRefer and EmbodiedScan show that UniGround achieves 46.1\%/34.1\% Acc@0.25/0.5 on ScanRefer and 28.7\% Acc@0.25 on EmbodiedScan, establishing a new state-of-the-art among zero-shot methods on EmbodiedScan without any 3D supervision. We further evaluate UniGround in real-world environments under uncontrolled reconstruction conditions and substantial domain shift, showing training-free reasoning generalizes robustly beyond curated benchmarks.
中文标题/摘要
标题:UniGround:通过无训练场景解析实现通用3D视觉定位
从自然语言描述中理解并定位复杂3D环境中的物体,即3D视觉定位(3DVG),是嵌入式人工智能中的基础挑战,对机器人技术、增强现实和人机交互具有广泛影响。大规模预训练基础模型在这一领域取得了显著进展,使系统能够定位给定场景中的任意物体。然而,它们依赖于预训练模型,限制了3D感知和推理在继承知识边界内的能力,导致对未见过的空间关系的泛化能力有限,并且对分布外场景的鲁棒性较差。在本文中,我们用无训练的视觉和几何推理替代这种受限的感知,从而解锁开放世界3DVG,使系统能够定位训练数据之外的任何场景中的任何物体。具体而言,提出的UniGround分为两个阶段:全局候选过滤阶段,通过无训练的3D拓扑和多视图语义编码构建场景候选;局部精确定位阶段,利用多尺度视觉提示和结构化推理精确识别目标物体。在ScanRefer和EmbodiedScan上的实验表明,UniGround在ScanRefer上实现了46.1%/34.1%的Acc@0.25/0.5,在EmbodiedScan上实现了28.7%的Acc@0.25,成为EmbodiedScan上无监督方法的新最佳,无需任何3D监督。我们进一步在不受控的重建条件下和大量领域转移的现实环境中评估了UniGround,显示无训练推理在精心策划的基准之外具有稳健的泛化能力。
Summary / 总结
The research addresses the challenge of 3D Visual Grounding (3DVG) by proposing UniGround, which uses training-free visual and geometric reasoning to achieve open-world 3DVG. UniGround operates in two stages: Global Candidate Filtering and Local Precision Grounding. Experiments on ScanRefer and EmbodiedScan demonstrate that UniGround outperforms existing zero-shot methods, achieving 46.1%/34.1% Acc@0.25/0.5 on ScanRefer and 28.7% Acc@0.25 on EmbodiedScan, and shows robust generalization in real-world environments.
研究通过提出UniGround,利用无监督的视觉和几何推理来解决3D视觉定位(3DVG)问题。UniGround分为全局候选过滤和局部精确定位两个阶段。实验结果显示,UniGround在ScanRefer和EmbodiedScan上的表现优于现有零样本方法,分别达到46.1%/34.1%的Acc@0.25/0.5和28.7%的Acc@0.25,并在真实环境中的非受控重建条件下展示了良好的泛化能力。
iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models
Authors: Lianyu Hu, Liqing Gao, Fanhua Shang, Liang Wan, Wei Feng
First: 2024-12-09T07:22:19+00:00 · Latest: 2026-03-09T08:28:06+00:00
Comments: Accepted by ICLR2026,code is released at https://github.com/hulianyuyy/iLLaVA
Abstract
Recent methods have made notable progress in accelerating Large Vision-Language Models (LVLMs) by exploiting the inherent redundancy in visual inputs. Most existing approaches, however, focus narrowly on reducing image tokens before or within the Large Language Model (LLM) stage to lower computational cost. This overlooks other major bottlenecks, particularly the image encoder, which itself requires substantial computation. As a result, these methods fall short of achieving true end-to-end acceleration. Importantly, the image encoder is the primary contributor of input tokens to the LLM. Thus, reducing visual redundancy at the encoder stage not only speeds up the encoder itself but also significantly lightens the workload for the subsequent LLM. Motivated by this, we investigate how to jointly optimize the image encoder and the LLM along with other LVLM components for comprehensive acceleration. To mitigate the risk of performance degradation from token reduction, we propose a novel token merging strategy that recycles useful information from otherwise discarded tokens. Our approach, iLLaVA, delivers consistent improvements across both image and video understanding tasks, achieving up to a 2 times throughput boost and a 4 times reduction in prefilling time. Notably, iLLaVA enables a larger model (e.g., InternVL-2.5 26B) to surpass a smaller counterpart (e.g., InternVL-2.5 8B) in both accuracy and efficiency. Extensive comparisons with state-of-the-art token pruning and merging techniques demonstrate the clear superiority of our method. Finally, we provide detailed visualizations for the merging steps of iLLaVA , offering deeper insights into how different LVLM components contribute to efficient computation.
中文标题/摘要
标题:iLLaVA:图像的价值少于输入标记的1/3在大型多模态模型中
最近的方法通过利用视觉输入中的固有冗余性,在加速大型视觉-语言模型(LVLMs)方面取得了显著进展。然而,大多数现有方法仅集中在减少输入到大型语言模型(LLM)阶段之前的图像标记数量,以降低计算成本。这忽视了其他主要瓶颈,特别是图像编码器本身需要大量的计算。因此,这些方法未能实现真正的端到端加速。重要的是,图像编码器是输入标记的主要贡献者。因此,在编码器阶段减少视觉冗余不仅加速了编码器本身,还显著减轻了后续LLM的工作量。受此启发,我们研究如何联合优化图像编码器、LLM以及其他LVLM组件以实现全面加速。为减轻标记减少带来的性能下降风险,我们提出了一种新的标记合并策略,以回收被丢弃标记中的有用信息。我们的方法iLLaVA在图像和视频理解任务中均实现了持续改进,吞吐量提升高达2倍,预填充时间减少高达4倍。值得注意的是,iLLaVA使一个较大的模型(例如,InternVL-2.5 26B)在准确性和效率上超越了一个较小的对应模型(例如,InternVL-2.5 8B)。与最先进的标记剪枝和合并技术的广泛比较表明,我们的方法具有明显的优势。最后,我们提供了iLLaVA合并步骤的详细可视化,以更深入地了解不同LVLM组件如何贡献于高效计算。
FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
Authors: Guangda Liu, Chengwei Li, Zhenyu Ning, Jing Lin, Yiwu Yao, Danning Ke, Minyi Guo, Jieru Zhao
First: 2025-05-19T13:36:45+00:00 · Latest: 2026-03-09T08:08:41+00:00
Abstract
Large language models (LLMs) are widely deployed with rapidly expanding context windows to support increasingly demanding applications. However, long contexts pose significant deployment challenges, primarily due to the KV cache whose size grows proportionally with context length. While KV cache compression methods have been proposed to address this issue, KV dropping methods incur considerable accuracy loss, and KV retrieval methods suffer from significant efficiency bottlenecks. We propose FreeKV, a training-free algorithm-system co-optimization framework to enhance KV retrieval efficiency while preserving accuracy. On the algorithm side, FreeKV introduces speculative retrieval to shift the KV selection and recall processes out of the critical path, combined with fine-grained correction to ensure accuracy. On the system side, FreeKV employs hybrid KV layouts across CPU and GPU memory to eliminate fragmented data transfers, and leverages double-buffered streamed recall to further improve efficiency, enabling effective overlap with computation, full latency hiding, and practical speedups from speculative recall. Experiments demonstrate that FreeKV achieves near-lossless accuracy across various scenarios and models, delivering up to a 13$\times$ speedup compared to SOTA KV retrieval methods. Code is available at https://github.com/sjtu-zhao-lab/FreeKV.
中文标题/摘要
标题:FreeKV:提升键值缓存检索以实现高效的LLM推理
大型语言模型(LLMs)广泛部署并迅速扩展上下文窗口,以支持日益复杂的应用程序。然而,长上下文带来了重大的部署挑战,主要由于键值(KV)缓存的大小与上下文长度成正比增长。虽然已经提出了KV缓存压缩方法来解决这一问题,但KV丢弃方法会导致显著的准确率损失,而KV检索方法则面临显著的效率瓶颈。我们提出FreeKV,这是一种无需训练的算法-系统协同优化框架,旨在提高KV检索效率同时保持准确率。在算法方面,FreeKV引入了推测性检索,将KV选择和召回过程移出关键路径,并结合细粒度校正以确保准确率。在系统方面,FreeKV在CPU和GPU内存中采用混合KV布局以消除碎片化数据传输,并利用双缓冲流式召回进一步提高效率,从而实现有效的计算重叠、完整的延迟隐藏以及从推测性召回中获得的实际加速。实验表明,FreeKV在各种场景和模型中实现了接近无损的准确率,并与当前最佳KV检索方法相比,提供了高达13倍的加速。代码可在https://github.com/sjtu-zhao-lab/FreeKV获取。
Summary / 总结
FreeKV is a training-free framework that optimizes KV cache retrieval for efficient LLM inference. It introduces speculative retrieval to shift KV selection and recall processes out of the critical path and uses fine-grained correction to maintain accuracy. On the system side, FreeKV employs hybrid KV layouts and double-buffered streamed recall to eliminate data fragmentation and improve efficiency. Experiments show that FreeKV achieves near-lossless accuracy and up to a 13 times speedup compared to state-of-the-art KV retrieval methods.
FreeKV 是一个无需训练的框架,旨在优化 KV 缓存检索以提高 LLM 推理效率。它在算法侧引入了推测性检索和细粒度校正,在系统侧采用了跨 CPU 和 GPU 内存的混合 KV 布局以及双缓冲流式检索。实验表明,FreeKV 保持了近乎无损的准确性和高达 13 倍的速度提升,相比最先进的 KV 检索方法。