arXiv 论文速递

2025-11-18 03:28
Snapshot: 20251118_0328
DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding
Authors: Dawei Zhu, Rui Meng, Jiefeng Chen, Sujian Li, Tomas Pfister, Jinsung Yoon
First: 2025-11-14T18:42:18+00:00 · Latest: 2025-11-14T18:42:18+00:00
Abstract
Comprehending long visual documents, where information is distributed across extensive pages of text and visual elements, is a critical but challenging task for modern Vision-Language Models (VLMs). Existing approaches falter on a fundamental challenge: evidence localization. They struggle to retrieve relevant pages and overlook fine-grained details within visual elements, leading to limited performance and model hallucination. To address this, we propose DocLens, a tool-augmented multi-agent framework that effectively ``zooms in'' on evidence like a lens. It first navigates from the full document to specific visual elements on relevant pages, then employs a sampling-adjudication mechanism to generate a single, reliable answer. Paired with Gemini-2.5-Pro, DocLens achieves state-of-the-art performance on MMLongBench-Doc and FinRAGBench-V, surpassing even human experts. The framework's superiority is particularly evident on vision-centric and unanswerable queries, demonstrating the power of its enhanced localization capabilities.
中文标题/摘要
标题:DocLens : 一种工具增强的多智能体框架,用于长视觉文档理解
理解长视觉文档,其中信息分布在大量的文本和视觉元素页面上,是现代视觉-语言模型(VLMs)面临的一个关键但具有挑战性的任务。现有方法在根本挑战上失败:证据定位。它们难以检索相关页面并忽略视觉元素中的细粒度细节,导致性能有限和模型幻觉。为了解决这个问题,我们提出了DocLens,一种工具增强的多智能体框架,能够有效地“聚焦”在证据上,就像一个镜头。它首先从整个文档导航到相关页面上的特定视觉元素,然后采用采样-裁定机制生成一个可靠的答案。与Gemini-2.5-Pro结合使用时,DocLens在MMLongBench-Doc和FinRAGBench-V上达到了最先进的性能,甚至超过了人类专家。该框架在视觉中心和无法回答的查询方面表现出色,展示了其增强定位能力的强大之处。
Summary / 总结
DocLens is a tool-augmented multi-agent framework designed to improve the understanding of long visual documents by addressing the challenge of evidence localization. It navigates from the full document to specific visual elements on relevant pages and uses a sampling-adjudication mechanism to generate reliable answers. DocLens, paired with Gemini-2.5-Pro, outperforms existing models and even human experts on MMLongBench-Doc and FinRAGBench-V, especially on vision-centric and unanswerable queries.
DocLens 是一种工具增强的多智能体框架,旨在通过解决证据定位问题来提高对长视觉文档的理解能力。它从整个文档导航到相关页面的具体视觉元素,并使用抽样-裁定机制生成可靠的答案。DocLens 与 Gemini-2.5-Pro 结合使用,在 MMLongBench-Doc 和 FinRAGBench-V 上超越了现有模型和人类专家,特别是在视觉中心和无法回答的问题上表现出色。
Bridging Hidden States in Vision-Language Models
Authors: Benjamin Fein-Ashley, Jacob Fein-Ashley
First: 2025-11-14T17:55:25+00:00 · Latest: 2025-11-14T17:55:25+00:00
Abstract
Vision-Language Models (VLMs) are a new family of models that align image content with natural language. Existing approaches typically fuse either (a) early: by mixing tokens/features inside the encoders, or (b) late: by comparing pooled embeddings. Many methods also tie fusion to an autoregressive decoder. However, the hidden states of both modalities already carry rich, modality-specific structure (spatial layout in vision; syntax and semantics in text), so directly aligning these states is a natural way to match what the two modalities "think". We propose a lightweight fusion module: a few cross-only, bidirectional attention layers placed near the top of both encoders. Each layer projects the vision and text encoder hidden-state sequences into a shared space, attends across modalities, and sends gated residual updates back, with simple stabilizers to improve alignment. The encoders remain non-causal and strong for understanding, while generation stays cleanly decoupled via an optional decoder. Across standard retrieval, VQA, and visual reasoning benchmarks, BRIDGE outperforms comparable VLMs while preserving the bi-encoder efficiency of contrastive models. We make our code publicly available at https://github.com/jfeinashley/BRIDGE.
中文标题/摘要
标题:视觉-语言模型中隐藏状态的连接
视觉-语言模型(VLMs)是一类新的模型,能够将图像内容与自然语言对齐。现有方法通常在编码器内部通过混合标记/特征(早期融合)或通过比较聚合表示(晚期融合)来进行融合。许多方法还将融合与自回归解码器联系起来。然而,两种模态的隐藏状态本身已经携带了丰富的、模态特定的结构(视觉中的空间布局;文本中的句法和语义),因此直接对齐这些状态是匹配这两种模态“思考”的自然方式。我们提出了一种轻量级的融合模块:在两个编码器的顶部附近放置几层仅跨模态的双向注意力层。每一层将视觉和文本编码器的隐藏状态序列投影到共享空间,跨模态进行注意,并通过简单的稳定器发送门控残差更新,从而改善对齐。编码器保持非因果性,强于理解,而生成则通过可选的解码器保持清晰地分离。在标准检索、VQA和视觉推理基准测试中,BRIDGE在保持对比模型的双编码器效率的同时,优于可比的VLMs。我们将在https://github.com/jfeinashley/BRIDGE公开我们的代码。
Summary / 总结
The research aims to improve the alignment of visual and textual information in Vision-Language Models (VLMs) by directly aligning the hidden states of both modalities. The proposed method, BRIDGE, introduces a few cross-modal, bidirectional attention layers near the top of both encoders, which project and align the hidden states from vision and text encoders into a shared space. This approach outperforms existing methods on standard benchmarks while maintaining the efficiency of contrastive models and keeping the encoders non-causal for strong understanding. The generation process remains cleanly decoupled via an optional decoder. The code is publicly available.
研究旨在通过直接对齐图像和文本的隐藏状态来改进视觉语言模型(VLMs)中的图像和文本表示。提出的BRIDGE方法在两个编码器的顶部引入了几层跨模态双向注意力层,将视觉和文本的隐藏状态投影到共享空间并进行对齐。这种方法在标准基准测试中优于现有方法,同时保持对比模型的高效性,并保留编码器和解码器的解耦。
Collaborative Representation Learning for Alignment of Tactile, Language, and Vision Modalities
Authors: Yiyun Zhou, Mingjing Xu, Jingwei Shi, Quanjiang Li, Jingyuan Chen
First: 2025-11-14T17:34:20+00:00 · Latest: 2025-11-14T17:34:20+00:00
Abstract
Tactile sensing offers rich and complementary information to vision and language, enabling robots to perceive fine-grained object properties. However, existing tactile sensors lack standardization, leading to redundant features that hinder cross-sensor generalization. Moreover, existing methods fail to fully integrate the intermediate communication among tactile, language, and vision modalities. To address this, we propose TLV-CoRe, a CLIP-based Tactile-Language-Vision Collaborative Representation learning method. TLV-CoRe introduces a Sensor-Aware Modulator to unify tactile features across different sensors and employs tactile-irrelevant decoupled learning to disentangle irrelevant tactile features. Additionally, a Unified Bridging Adapter is introduced to enhance tri-modal interaction within the shared representation space. To fairly evaluate the effectiveness of tactile models, we further propose the RSS evaluation framework, focusing on Robustness, Synergy, and Stability across different methods. Experimental results demonstrate that TLV-CoRe significantly improves sensor-agnostic representation learning and cross-modal alignment, offering a new direction for multimodal tactile representation.
中文标题/摘要
标题:协作学习表示法在触觉、语言和视觉模态对齐中的应用
触觉传感提供了丰富且互补的信息,能够使机器人感知物体的细微属性。然而,现有的触觉传感器缺乏标准化,导致冗余特征,阻碍了跨传感器的一般化。此外,现有的方法未能充分整合触觉、语言和视觉模态之间的中间通信。为了解决这个问题,我们提出了基于CLIP的触觉-语言-视觉协作表示学习方法TLV-CoRe。TLV-CoRe引入了传感器感知调制器以统一不同传感器的触觉特征,并采用触觉无关解耦学习以分离无关的触觉特征。此外,还引入了统一桥梁适配器以增强三模态在共享表示空间内的交互。为了公平评估触觉模型的效果,我们进一步提出了RSS评估框架,重点关注不同方法下的鲁棒性、协同性和稳定性。实验结果表明,TLV-CoRe显著提高了传感器无关的表示学习和跨模态对齐,为多模态触觉表示提供了新的方向。
PAS : Prelim Attention Score for Detecting Object Hallucinations in Large Vision--Language Models
Authors: Nhat Hoang-Xuan, Minh Vu, My T. Thai, Manish Bhattarai
First: 2025-11-14T17:23:55+00:00 · Latest: 2025-11-14T17:23:55+00:00
Abstract
Large vision-language models (LVLMs) are powerful, yet they remain unreliable due to object hallucinations. In this work, we show that in many hallucinatory predictions the LVLM effectively ignores the image and instead relies on previously generated output (prelim) tokens to infer new objects. We quantify this behavior via the mutual information between the image and the predicted object conditioned on the prelim, demonstrating that weak image dependence strongly correlates with hallucination. Building on this finding, we introduce the Prelim Attention Score (PAS), a lightweight, training-free signal computed from attention weights over prelim tokens. PAS requires no additional forward passes and can be computed on the fly during inference. Exploiting this previously overlooked signal, PAS achieves state-of-the-art object-hallucination detection across multiple models and datasets, enabling real-time filtering and intervention.
中文标题/摘要
标题:PAS:用于检测大型视觉-语言模型中对象幻觉的初步注意分数
大型视觉-语言模型(LVLMs)非常强大,但仍然不可靠,因为存在对象幻觉。在这项工作中,我们表明,在许多幻觉预测中,LVLM实际上忽略了图像,而是依赖于先前生成的输出(初步)令牌来推断新对象。我们通过在初步条件下条件化图像和预测对象之间的互信息来量化这种行为,证明了弱图像依赖性与幻觉强烈相关。基于这一发现,我们引入了初步注意分数(PAS),这是一种轻量级、无需训练的信号,由初步令牌上的注意权重计算得出。PAS 不需要额外的前向传递,并且可以在推理过程中实时计算。利用这一之前被忽视的信号,PAS 在多个模型和数据集上实现了最先进的对象幻觉检测,从而实现实时过滤和干预。
Summary / 总结
This research addresses the issue of object hallucinations in large vision-language models (LVLMs) by identifying a pattern where the models rely on previously generated tokens rather than the input image to predict new objects. The study introduces the Prelim Attention Score (PAS), a lightweight method that quantifies this behavior without additional training or inference steps. PAS effectively detects hallucinations and achieves state-of-the-art results across various models and datasets, facilitating real-time filtering and intervention.
该研究针对大型视觉-语言模型(LVLM)中的物体幻觉问题,发现模型往往依赖于之前生成的令牌而非输入图像。作者提出了一种轻量级方法——初步注意得分(PAS),该方法无需额外训练即可在推理过程中计算。PAS 能够有效检测幻觉并在多个模型和数据集上达到最先进的性能,从而实现实时过滤和干预。
ImAgent: A Unified Multimodal Agent Framework for Test-Time Scalable Image Generation
Authors: Kaishen Wang, Ruibo Chen, Tong Zheng, Heng Huang
First: 2025-11-14T17:00:29+00:00 · Latest: 2025-11-14T17:00:29+00:00
Comments: 12 pages, 5 tables, 6 figures
Abstract
Recent text-to-image (T2I) models have made remarkable progress in generating visually realistic and semantically coherent images. However, they still suffer from randomness and inconsistency with the given prompts, particularly when textual descriptions are vague or underspecified. Existing approaches, such as prompt rewriting, best-of-N sampling, and self-refinement, can mitigate these issues but usually require additional modules and operate independently, hindering test-time scaling efficiency and increasing computational overhead. In this paper, we introduce ImAgent, a training-free unified multimodal agent that integrates reasoning, generation, and self-evaluation within a single framework for efficient test-time scaling. Guided by a policy controller, multiple generation actions dynamically interact and self-organize to enhance image fidelity and semantic alignment without relying on external models. Extensive experiments on image generation and editing tasks demonstrate that ImAgent consistently improves over the backbone and even surpasses other strong baselines where the backbone model fails, highlighting the potential of unified multimodal agents for adaptive and efficient image generation under test-time scaling.
中文标题/摘要
标题:ImAgent:一种用于测试时可扩展图像生成的统一多模态代理框架
近期的文本到图像(T2I)模型在生成视觉真实且语义一致的图像方面取得了显著进展。然而,它们仍然存在随机性和与给定提示不一致的问题,特别是在文本描述模糊或不明确时更为明显。现有的方法,如提示重写、最佳N采样和自我完善,可以缓解这些问题,但通常需要额外的模块并独立运行,阻碍了测试时可扩展性的效率并增加了计算开销。在本文中,我们引入了ImAgent,这是一种无需训练的统一多模态代理,将推理、生成和自我评估整合在一个框架中,以实现高效的测试时可扩展性。在策略控制器的引导下,多个生成动作动态交互和自我组织,以提高图像保真度和语义对齐,而不依赖于外部模型。在图像生成和编辑任务上的大量实验表明,ImAgent在基线模型上始终表现出改进,并且在基线模型失败的情况下甚至超越了其他强基线,突显了统一多模态代理在测试时可扩展性下的自适应和高效图像生成的潜力。
Summary / 总结
The research motivation is to address the randomness and inconsistency issues in text-to-image generation models, especially when textual descriptions are vague. The main method involves introducing ImAgent, a unified multimodal agent framework that integrates reasoning, generation, and self-evaluation within a single training-free framework. Key experimental findings show that ImAgent consistently improves image fidelity and semantic alignment, outperforming the backbone model and other strong baselines in various image generation and editing tasks, particularly under test-time scaling conditions.
研究动机是解决文本到图像生成模型中存在的随机性和不一致性问题,尤其是在文本描述模糊时更为明显。主要方法是引入ImAgent,这是一种统一的多模态代理框架,将推理、生成和自我评估整合在一个无训练框架中。关键实验结果表明,ImAgent在图像生成和编辑任务中始终提高了图像保真度和语义对齐,超越了基础模型和其他强大基线,特别是在测试时缩放条件下。
Rethinking Progression of Memory State in Robotic Manipulation: An Object-Centric Perspective
Authors: Nhat Chung, Taisei Hanyu, Toan Nguyen, Huy Le, Frederick Bumgarner, Duy Minh Ho Nguyen, Khoa Vo, Kashu Yamazaki, Chase Rainwater, Tung Kieu, Anh Nguyen, Ngan Le
Venue: AAAI 2026
First: 2025-11-14T16:56:01+00:00 · Latest: 2025-11-14T16:56:01+00:00
Comments: Accepted at AAAI 2026
Abstract
As embodied agents operate in increasingly complex environments, the ability to perceive, track, and reason about individual object instances over time becomes essential, especially in tasks requiring sequenced interactions with visually similar objects. In these non-Markovian settings, key decision cues are often hidden in object-specific histories rather than the current scene. Without persistent memory of prior interactions (what has been interacted with, where it has been, or how it has changed) visuomotor policies may fail, repeat past actions, or overlook completed ones. To surface this challenge, we introduce LIBERO-Mem, a non-Markovian task suite for stress-testing robotic manipulation under object-level partial observability. It combines short- and long-horizon object tracking with temporally sequenced subgoals, requiring reasoning beyond the current frame. However, vision-language-action (VLA) models often struggle in such settings, with token scaling quickly becoming intractable even for tasks spanning just a few hundred frames. We propose Embodied-SlotSSM, a slot-centric VLA framework built for temporal scalability. It maintains spatio-temporally consistent slot identities and leverages them through two mechanisms: (1) slot-state-space modeling for reconstructing short-term history, and (2) a relational encoder to align the input tokens with action decoding. Together, these components enable temporally grounded, context-aware action prediction. Experiments show Embodied-SlotSSM's baseline performance on LIBERO-Mem and general tasks, offering a scalable solution for non-Markovian reasoning in object-centric robotic policies.
中文标题/摘要
标题:从物体中心视角重新思考机器人操作中记忆状态的演变
随着嵌入式代理在日益复杂的环境中操作,感知、跟踪和随时间推移对个体物体实例进行推理的能力变得至关重要,尤其是在需要与视觉上相似的物体进行顺序交互的任务中。在这些非马尔可夫环境中,关键决策线索往往隐藏在物体特定的历史记录中,而不是当前场景中。如果没有持续的记忆(之前交互过什么,它在哪里,或者它如何变化),视知觉运动策略可能会失败,重复过去的动作,或者忽略已完成的动作。为了揭示这一挑战,我们引入了LIBERO-Mem,这是一种非马尔可夫任务套件,用于在物体级别部分可观测性下对机器人操作进行压力测试。它结合了短期和长期的物体跟踪以及时间序列子目标,要求超越当前帧进行推理。然而,视觉-语言-动作(VLA)模型在这些环境中往往难以应对,即使对于仅跨越几百帧的任务,标记缩放也很快变得不可行。我们提出了一种基于槽的VLA框架Embodied-SlotSSM,该框架旨在实现时间上的可扩展性。它保持时空一致的槽身份,并通过两种机制利用它们:(1)槽状态空间建模以重构短期历史,(2)关系编码器将输入标记与动作解码对齐。这些组件共同使基于时间的、上下文相关的动作预测成为可能。实验表明,Embodied-SlotSSM在LIBERO-Mem和通用任务上的基线性能,提供了一种在物体中心的机器人策略中进行非马尔可夫推理的可扩展解决方案。
Summary / 总结
This paper addresses the challenge of robotic manipulation in non-Markovian environments where object-specific histories are crucial for decision-making. It introduces LIBERO-Mem, a task suite that tests robotic manipulation under partial observability of objects. To tackle the scalability issue, the authors propose Embodied-SlotSSM, a slot-centric vision-language-action framework that maintains consistent slot identities and uses slot-state-space modeling and a relational encoder for temporally grounded action prediction. Experiments demonstrate that Embodied-SlotSSM outperforms existing models on both LIBERO-Mem and general tasks, providing a scalable solution for non-Markovian reasoning in robotic manipulation policies.
论文针对复杂非马尔可夫环境中物体交互记忆的挑战,引入了LIBERO-Mem任务套件来测试部分可观测条件下的机器人操作,并提出了Embodied-SlotSSM,这是一种基于槽的视觉-语言-动作框架,保持时空一致的槽身份,并使用槽状态空间建模和关系编码器实现时间可扩展的动作预测。实验表明,Embodied-SlotSSM在非马尔可夫推理任务中表现出色,为物体中心的机器人策略提供了一个可扩展的解决方案。
VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation
Authors: Maximilian Rokuss, Moritz Langenberg, Yannick Kirchhoff, Fabian Isensee, Benjamin Hamm, Constantin Ulrich, Sebastian Regnery, Lukas Bauer, Efthimios Katsigiannopulos, Tobias Norajitra, Klaus Maier-Hein
First: 2025-11-14T16:20:07+00:00 · Latest: 2025-11-14T16:20:07+00:00
Abstract
We introduce VoxTell, a vision-language model for text-prompted volumetric medical image segmentation. It maps free-form descriptions, from single words to full clinical sentences, to 3D masks. Trained on 62K+ CT, MRI, and PET volumes spanning over 1K anatomical and pathological classes, VoxTell uses multi-stage vision-language fusion across decoder layers to align textual and visual features at multiple scales. It achieves state-of-the-art zero-shot performance across modalities on unseen datasets, excelling on familiar concepts while generalizing to related unseen classes. Extensive experiments further demonstrate strong cross-modality transfer, robustness to linguistic variations and clinical language, as well as accurate instance-specific segmentation from real-world text. Code is available at: https://www.github.com/MIC-DKFZ/VoxTell
中文标题/摘要
标题:VoxTell:可文本提示的通用3D医学图像分割
我们介绍了VoxTell,一种用于文本提示的体积医学图像分割的视觉语言模型。它将从单个单词到完整的临床句子的自由形式描述映射到3D掩码。VoxTell基于超过62,000个CT、MRI和PET体积,涵盖1,000多个解剖和病理类,通过解码器层的多阶段视觉语言融合,在多个尺度上对齐文本和视觉特征。它在未见过的数据集上实现了跨模态的零样本最佳性能,对熟悉的概念表现出色,同时能够泛化到相关的未见过的类别。大量实验进一步证明了其跨模态的强迁移能力、对语言变化和临床语言的鲁棒性,以及对真实世界文本的准确实例特定分割。代码可在:https://www.github.com/MIC-DKFZ/VoxTell 获取
Summary / 总结
VoxTell is a vision-language model designed for text-prompted volumetric medical image segmentation, trained on over 62,000 CT, MRI, and PET volumes. It uses multi-stage vision-language fusion to align textual and visual features at multiple scales, achieving state-of-the-art zero-shot performance across modalities on unseen datasets. The model excels on familiar concepts and generalizes well to related unseen classes, demonstrating strong cross-modality transfer and robustness to linguistic variations and clinical language. Accurate instance-specific segmentation from real-world text is also demonstrated in experiments.
VoxTell 是一种用于文本提示的体积医学图像分割的视觉语言模型,训练数据包括 CT、MRI 和 PET 图像。该模型通过多阶段的视觉语言融合来对齐文本和视觉特征,展示了跨模态的先进零样本性能,并且在语言变化和临床语言方面表现出较强的鲁棒性,同时能够进行准确的实例特定分割。
From Synthetic Scenes to Real Performance: Enhancing Spatial Reasoning in VLMs
Authors: Massimo Rizzoli, Simone Alghisi, Seyed Mahed Mousavi, Giuseppe Riccardi
First: 2025-11-14T16:07:18+00:00 · Latest: 2025-11-14T16:07:18+00:00
Abstract
Fine-tuning Vision-Language Models (VLMs) is a common strategy to improve performance following an ad-hoc data collection and annotation of real-world scenes. However, this process is often prone to biases, errors, and distribution imbalance, resulting in overfitting and imbalanced performance. Although a few studies have tried to address this problem by generating synthetic data, they lacked control over distribution bias and annotation quality. To address these challenges, we redesign the fine-tuning process in two ways. First, we control the generation of data and its annotations, ensuring it is free from bias, distribution imbalance, and annotation errors. We automatically construct the dataset by comprehensively sampling objects' attributes, including color, shape, size, and position within the scene. Secondly, using this annotated dataset, we fine-tune state-of-the-art VLMs and assess performance transferability to real-world data on the absolute position task. We conduct exhaustive evaluations on both synthetic and real-world benchmarks. Our experiments reveal two key findings: 1) fine-tuning on balanced synthetic data yields uniform performance across the visual scene and mitigates common biases; and 2) fine-tuning on synthetic stimuli significantly improves performance on real-world data (COCO), outperforming models fine-tuned in the matched setting.
中文标题/摘要
标题:从合成场景到真实表现:增强VLM的空间推理能力
对视觉-语言模型(VLMs)进行微调是一种常见的策略,以提高性能,通常是在收集和标注真实场景数据后进行。然而,这一过程往往容易出现偏差、错误和分布不平衡,导致过拟合和性能不平衡。尽管有一些研究尝试通过生成合成数据来解决这个问题,但它们缺乏对分布偏差和标注质量的控制。为了解决这些挑战,我们以两种方式重新设计了微调过程。首先,我们控制数据及其标注的生成,确保其无偏差、无分布不平衡和无标注错误。我们通过全面采样场景中对象的属性(包括颜色、形状、大小和位置)自动构建数据集。其次,使用这个标注数据集,我们微调最先进的VLMs,并在绝对位置任务上评估其性能转移性。我们在合成和真实世界基准上进行了详尽的评估。我们的实验揭示了两个关键发现:1)在平衡的合成数据上进行微调可以在视觉场景中获得一致的性能并减轻常见偏差;2)在合成刺激上进行微调显著提高了在真实世界数据(COCO)上的性能,超过了在匹配设置中进行微调的模型。
Summary / 总结
The research aims to enhance the spatial reasoning capabilities of Vision-Language Models (VLMs) by fine-tuning them on synthetic data to avoid biases and distribution imbalances. The method involves automatically constructing a balanced dataset by sampling object attributes and ensuring high-quality annotations. Key experimental findings show that fine-tuning on balanced synthetic data improves uniform performance across the visual scene and enhances real-world performance on the COCO dataset compared to models fine-tuned on real-world data.
该研究旨在通过解决细调过程中的偏差和分布不平衡问题,增强视觉-语言模型(VLMs)的空间推理能力。作者重新设计了细调过程,生成了具有控制注释的平衡合成数据,确保场景中的对象属性(如颜色、形状、大小和位置)准确表示。实验表明,使用这种合成数据进行细调可以提高在真实世界任务中的表现,特别是在COCO数据集上,并且能够缓解真实世界数据中存在的常见偏差。
Retrofit: Continual Learning with Bounded Forgetting for Security Applications
Authors: Yiling He, Junchi Lei, Hongyu She, Shuo Shao, Xinran Zheng, Yiping Liu, Zhan Qin, Lorenzo Cavallaro
First: 2025-11-14T16:07:03+00:00 · Latest: 2025-11-14T16:07:03+00:00
Abstract
Modern security analytics are increasingly powered by deep learning models, but their performance often degrades as threat landscapes evolve and data representations shift. While continual learning (CL) offers a promising paradigm to maintain model effectiveness, many approaches rely on full retraining or data replay, which are infeasible in data-sensitive environments. Moreover, existing methods remain inadequate for security-critical scenarios, facing two coupled challenges in knowledge transfer: preserving prior knowledge without old data and integrating new knowledge with minimal interference. We propose RETROFIT, a data retrospective-free continual learning method that achieves bounded forgetting for effective knowledge transfer. Our key idea is to consolidate previously trained and newly fine-tuned models, serving as teachers of old and new knowledge, through parameter-level merging that eliminates the need for historical data. To mitigate interference, we apply low-rank and sparse updates that confine parameter changes to independent subspaces, while a knowledge arbitration dynamically balances the teacher contributions guided by model confidence. Our evaluation on two representative applications demonstrates that RETROFIT consistently mitigates forgetting while maintaining adaptability. In malware detection under temporal drift, it substantially improves the retention score, from 20.2% to 38.6% over CL baselines, and exceeds the oracle upper bound on new data. In binary summarization across decompilation levels, where analyzing stripped binaries is especially challenging, RETROFIT achieves around twice the BLEU score of transfer learning used in prior work and surpasses all baselines in cross-representation generalization.
中文标题/摘要
标题:Retrofit:在安全应用中具有有限遗忘的持续学习方法
现代安全分析越来越多地依赖深度学习模型,但随着威胁环境的变化和数据表示的转变,其性能往往会下降。虽然持续学习(CL)提供了一种有希望的范式来保持模型的有效性,但许多方法依赖于完全重新训练或数据回放,这在敏感数据环境中是不可行的。此外,现有方法在安全关键场景中仍然不足,面临着知识转移的两个耦合挑战:在没有旧数据的情况下保留先验知识和在最小干扰下整合新知识。 我们提出了一种名为Retrofit的数据回顾自由持续学习方法,以实现有效的知识转移并具有有限遗忘。我们的核心思想是通过参数级合并将先前训练和新微调的模型结合起来,作为旧知识和新知识的教师,从而消除对历史数据的需求。为了减轻干扰,我们应用了低秩和稀疏更新,将参数变化限制在独立子空间中,同时知识仲裁根据模型置信度动态平衡教师贡献。我们在两个代表性应用上的评估表明,Retrofit在减轻遗忘的同时保持了适应性。在时间漂移下的恶意软件检测中,它在持续学习基线上的保留分数从20.2%提高到38.6%,并超过了新数据上的先验上限。在跨分解级别进行二元总结化时,特别是在分析剥离二进制文件特别具有挑战性的场景中,Retrofit的BLEU分数大约是先前工作中使用的迁移学习的两倍,并且在跨表示泛化方面超过了所有基线。
Summary / 总结
The paper addresses the challenge of maintaining deep learning models' performance in dynamic security environments where data distributions shift. It introduces RETROFIT, a continual learning method that consolidates old and new models through parameter-level merging without requiring historical data. This approach mitigates forgetting while integrating new knowledge, as demonstrated by improved retention scores in malware detection and higher BLEU scores in binary summarization tasks compared to existing methods.
论文针对数据表示变化和威胁演化的动态安全环境中保持深度学习模型性能的挑战,提出了一种名为RETROFIT的持续学习方法,该方法通过合并先前训练和新微调模型的参数来避免使用历史数据,从而减少旧知识和新知识之间的干扰。实验结果显示,RETROFIT有效缓解了遗忘问题,在恶意软件检测中提高了保留分数,超过了基线方法和Oracle上限。在二进制摘要化任务中,它实现了比先前工作中的迁移学习更好的BLEU分数,并在跨表示泛化方面超越所有基线方法。
VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models
Authors: Mingjie Xu, Jinpeng Chen, Yuzhi Zhao, Jason Chun Lok Li, Yue Qiu, Zekang Du, Mengyang Wu, Pingping Zhang, Kun Li, Hongzheng Yang, Wenao Ma, Jiaheng Wei, Qinbin Li, Kangcheng Liu, Wenqiang Lei
Venue: AAAI 2026
First: 2025-11-14T16:06:25+00:00 · Latest: 2025-11-14T16:06:25+00:00
Comments: This is the extended version of the paper accepted at AAAI 2026, which includes all technical appendices and additional experimental details
Abstract
Multimodal large language models (MLLMs) have enabled a wide range of advanced vision-language applications, including fine-grained object recognition and contextual understanding. When querying specific regions or objects in an image, human users naturally use "visual prompts" (VPs), such as bounding boxes, to provide reference. However, no existing benchmark systematically evaluates the ability of MLLMs to interpret such VPs. This gap leaves it unclear whether current MLLMs can effectively recognize VPs, an intuitive prompting method for humans, and use them to solve problems. To address this limitation, we introduce VP-Bench, a benchmark for assessing MLLMs' capability in VP perception and utilization. VP-Bench employs a two-stage evaluation framework: Stage 1 examines models' ability to perceive VPs in natural scenes, using 30k visualized prompts spanning eight shapes and 355 attribute combinations. Stage 2 investigates the impact of VPs on downstream tasks, measuring their effectiveness in real-world problem-solving scenarios. Using VP-Bench, we evaluate 28 MLLMs, including proprietary systems (e.g., GPT-4o) and open-source models (e.g., InternVL3 and Qwen2.5-VL), and provide a comprehensive analysis of factors that affect VP understanding, such as variations in VP attributes, question arrangement, and model scale. VP-Bench establishes a new reference framework for studying how MLLMs comprehend and resolve grounded referring questions.
中文标题/摘要
标题:VP-Bench:多模态大型语言模型视觉提示综合基准
多模态大型语言模型(MLLMs)已使一系列高级视觉-语言应用成为可能,包括细粒度的目标识别和上下文理解。当查询图像中的特定区域或对象时,人类用户自然会使用“视觉提示”(VPs),如边界框,来提供参考。然而,目前没有基准能够系统地评估MLLMs理解VPs的能力。这一空白使得不清楚当前的MLLMs是否能够有效识别VPs,这是一种直观的人类提示方法,并利用它们解决问题。为解决这一局限,我们引入了VP-Bench,一个评估MLLMs在VP感知和利用方面能力的基准。VP-Bench采用两阶段评估框架:第一阶段检查模型在自然场景中感知VPs的能力,使用30,000个可视化提示,涵盖八种形状和355种属性组合。第二阶段研究VPs对下游任务的影响,测量其在现实世界问题解决场景中的有效性。使用VP-Bench,我们评估了28个MLLMs,包括专有系统(如GPT-4o)和开源模型(如InternVL3和Qwen2.5-VL),并提供了影响VP理解的因素的全面分析,如VP属性的变化、问题排列和模型规模。VP-Bench为研究MLLMs如何理解和解决基于参照的问题建立了新的参考框架。
Summary / 总结
VP-Bench is a benchmark designed to evaluate the ability of multimodal large language models (MLLMs) to interpret and utilize visual prompts (VPs). It consists of two stages: the first evaluates models' VP perception with 30k prompts, and the second assesses their effectiveness in real-world tasks. The study finds that model performance varies based on VP attributes, question arrangement, and model scale, highlighting the need for better VP understanding in MLLMs.
VP-Bench 是一个用于评估多模态大语言模型 (MLLM) 解释和利用视觉提示 (VP) 能力的基准。它分为两个阶段:第一阶段通过 30k 个提示评估模型的 VP 感知能力,第二阶段评估 VP 对下游任务的影响。研究测试了 28 个 MLLM,揭示了影响 VP 理解的因素,如 VP 属性和模型规模。VP-Bench 补充了 MLLM VP 解释能力评估的空白,提供了一个新的研究参考框架。
BOFA: Bridge-Layer Orthogonal Low-Rank Fusion for CLIP-Based Class-Incremental Learning
Authors: Lan Li, Tao Hu, Da-Wei Zhou, Han-Jia Ye, De-Chuan Zhan
Venue: AAAI 2026
First: 2025-11-14T15:51:40+00:00 · Latest: 2025-11-14T15:51:40+00:00
Comments: Accepted by AAAI 2026
Abstract
Class-Incremental Learning (CIL) aims to continually learn new categories without forgetting previously acquired knowledge. Vision-language models such as CLIP offer strong transferable representations via multi-modal supervision, making them promising for CIL. However, applying CLIP to CIL poses two major challenges: (1) adapting to downstream tasks often requires additional learnable modules, increasing model complexity and susceptibility to forgetting; and (2) while multi-modal representations offer complementary strengths, existing methods have yet to fully realize their potential in effectively integrating visual and textual modalities. To address these issues, we propose BOFA (Bridge-layer Orthogonal Fusion for Adaptation), a novel framework for CIL. BOFA confines all model adaptation exclusively to CLIP's existing cross-modal bridge-layer, thereby adding no extra parameters or inference cost. To prevent forgetting within this layer, it leverages Orthogonal Low-Rank Fusion, a mechanism that constrains parameter updates to a low-rank ``safe subspace" mathematically constructed to be orthogonal to past task features. This ensures stable knowledge accumulation without data replay. Furthermore, BOFA employs a cross-modal hybrid prototype that synergizes stable textual prototypes with visual counterparts derived from our stably adapted bridge-layer, enhancing classification performance. Extensive experiments on standard benchmarks show that BOFA achieves superior accuracy and efficiency compared to existing methods.
中文标题/摘要
标题:BOFA:桥梁层正交低秩融合在CLIP基类增量学习中的应用
类增量学习(CIL)旨在不断学习新类别而不忘记之前获得的知识。视觉-语言模型如CLIP通过多模态监督提供强大的可迁移表示,使其在CIL中具有潜力。然而,将CLIP应用于CIL面临两大挑战:(1)适应下游任务通常需要额外的可学习模块,增加模型复杂性和遗忘风险;(2)尽管多模态表示提供了互补的优势,但现有方法尚未充分利用其在有效整合视觉和文本模态方面的潜力。为解决这些问题,我们提出了BOFA(桥梁层正交融合),一种用于CIL的新框架。BOFA将所有模型适应完全限制在CLIP现有的跨模态桥梁层,从而不增加额外参数或推理成本。为防止在该层内遗忘,它利用正交低秩融合机制,将参数更新约束在一个数学上与过去任务特征正交的低秩“安全子空间”。这确保了在不重放数据的情况下稳定的知识积累。此外,BOFA采用跨模态混合原型,结合稳定文本原型与我们稳定适应的桥梁层衍生的视觉对应物,增强分类性能。在标准基准上的广泛实验表明,BOFA在准确性和效率方面优于现有方法。
Summary / 总结
BOFA is a novel framework for Class-Incremental Learning (CIL) that addresses the challenges of adapting CLIP models without increasing complexity or forgetting. It confines all model adaptation to CLIP's existing cross-modal bridge-layer using Orthogonal Low-Rank Fusion to prevent forgetting and enhance stability. BOFA also uses a cross-modal hybrid prototype to improve classification performance. Experiments show BOFA outperforms existing methods in both accuracy and efficiency on standard benchmarks.
BOFA 是一种针对类增量学习(CIL)的新框架,旨在解决在不增加复杂度或遗忘的情况下适应 CLIP 模型的问题。它将所有适应限定在 CLIP 的现有跨模态桥接层,并使用正交低秩融合来约束参数更新以防止遗忘。BOFA 还通过跨模态混合原型增强分类性能。实验表明,BOFA 在标准基准上的准确性和效率都优于现有方法。
Low-Bit, High-Fidelity: Optimal Transport Quantization for Flow Matching
Authors: Dara Varam, Diaa A. Abuhani, Imran Zualkernan, Raghad AlDamani, Lujain Khalil
First: 2025-11-14T15:49:36+00:00 · Latest: 2025-11-14T15:49:36+00:00
Comments: 12 pages, 8 figures
Abstract
Flow Matching (FM) generative models offer efficient simulation-free training and deterministic sampling, but their practical deployment is challenged by high-precision parameter requirements. We adapt optimal transport (OT)-based post-training quantization to FM models, minimizing the 2-Wasserstein distance between quantized and original weights, and systematically compare its effectiveness against uniform, piecewise, and logarithmic quantization schemes. Our theoretical analysis provides upper bounds on generative degradation under quantization, and empirical results across five benchmark datasets of varying complexity show that OT-based quantization preserves both visual generation quality and latent space stability down to 2-3 bits per parameter, where alternative methods fail. This establishes OT-based quantization as a principled, effective approach to compress FM generative models for edge and embedded AI applications.
中文标题/摘要
标题:低比特,高保真:基于最优传输的流匹配量化
流匹配(FM)生成模型提供高效的无模拟训练和确定性采样,但其实际部署受到高精度参数要求的挑战。我们采用基于最优传输(OT)的后训练量化方法,最小化量化和原始权重之间的2- Wasserstein距离,并系统地将其有效性与均匀、分段和对数量化方案进行比较。我们的理论分析提供了量化下生成降级的上界,而跨五个不同复杂度的基准数据集的实证结果表明,基于OT的量化方法在每参数2-3比特时仍能保持视觉生成质量和潜在空间稳定性,而其他方法在此处失效。这确立了基于OT的量化方法作为压缩FM生成模型以适应边缘和嵌入式AI应用的原理性、有效方法。
Summary / 总结
The research aims to address the high-precision parameter requirements of Flow Matching generative models for practical deployment. The study employs optimal transport (OT)-based post-training quantization to minimize the 2-Wasserstein distance between quantized and original weights. Experiments across five benchmark datasets show that OT-based quantization preserves both visual generation quality and latent space stability down to 2-3 bits per parameter, outperforming uniform, piecewise, and logarithmic quantization schemes.
研究旨在解决Flow Matching生成模型因高精度参数要求而在实际部署中遇到的问题。作者采用最优传输(OT)后训练量化方法,以最小化量化后和原始权重之间的2- Wasserstein距离。实验结果表明,OT基量化方法在5个不同复杂度的基准数据集上,能够保持视觉生成质量和潜在空间稳定性,直至每参数2-3位,优于均匀、分段和对数量化方案。
Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models
Authors: Jiaxi Huang, Dongxu Wu, Hanwei Zhu, Lingyu Zhu, Jun Xing, Xu Wang, Baoliang Chen
First: 2025-11-14T15:41:17+00:00 · Latest: 2025-11-14T15:41:17+00:00
Abstract
The rapid advancement of Multi-modal Large Language Models (MLLMs) has expanded their capabilities beyond high-level vision tasks. Nevertheless, their potential for Document Image Quality Assessment (DIQA) remains underexplored. To bridge this gap, we propose Q-Doc, a three-tiered evaluation framework for systematically probing DIQA capabilities of MLLMs at coarse, middle, and fine granularity levels. a) At the coarse level, we instruct MLLMs to assign quality scores to document images and analyze their correlation with Quality Annotations. b) At the middle level, we design distortion-type identification tasks, including single-choice and multi-choice tests for multi-distortion scenarios. c) At the fine level, we introduce distortion-severity assessment where MLLMs classify distortion intensity against human-annotated references. Our evaluation demonstrates that while MLLMs possess nascent DIQA abilities, they exhibit critical limitations: inconsistent scoring, distortion misidentification, and severity misjudgment. Significantly, we show that Chain-of-Thought (CoT) prompting substantially enhances performance across all levels. Our work provides a benchmark for DIQA capabilities in MLLMs, revealing pronounced deficiencies in their quality perception and promising pathways for enhancement. The benchmark and code are publicly available at: https://github.com/cydxf/Q-Doc.
中文标题/摘要
标题:Q-Doc:多模态大型语言模型文档图像质量评估能力基准测试
多模态大型语言模型(MLLMs)的快速发展使其能力超越了高级视觉任务,但其在文档图像质量评估(DIQA)方面的潜力尚未得到充分探索。为填补这一空白,我们提出了Q-Doc,这是一种三级评估框架,用于系统地探究MLLMs在粗粒度、中粒度和细粒度水平上的DIQA能力。a) 在粗粒度级别,我们指导MLLMs对文档图像进行质量评分,并分析其与质量注释的相关性。b) 在中粒度级别,我们设计了失真类型识别任务,包括单选和多选测试,适用于多种失真场景。c) 在细粒度级别,我们引入了失真严重性评估,MLLMs将根据人类注释的参考对失真强度进行分类。我们的评估表明,尽管MLLMs具备初步的DIQA能力,但它们存在关键限制:评分不一致、失真误识别和严重性误判。重要的是,我们展示了链式思考(CoT)提示在所有级别上显著提升了性能。我们的工作为MLLMs的DIQA能力提供了一个基准,揭示了其质量感知的显著缺陷,并指出了改进的潜在途径。基准和代码可在以下网址公开获取: https://github.com/cydxf/Q-Doc.
Summary / 总结
Q-Doc is a three-tiered evaluation framework designed to assess the Document Image Quality Assessment (DIQA) capabilities of Multi-modal Large Language Models (MLLMs). It evaluates MLLMs at three levels: coarse (assigning quality scores), middle (identifying distortions), and fine (assessing distortion severity). The evaluation shows that MLLMs have nascent DIQA abilities but face challenges such as inconsistent scoring and misidentification of distortions and their severity. Notably, Chain-of-Thought (CoT) prompting improves performance across all levels. This work provides a benchmark for DIQA capabilities in MLLMs, highlighting their deficiencies and suggesting potential improvement paths.
论文提出了Q-Doc,一个三层评估框架,用于评估多模态大型语言模型(MLLMs)的文档图像质量评估(DIQA)能力。该评估框架在粗粒度、中粒度和细粒度层次上评估MLLMs,包括质量评分、失真识别和严重性评估。评估结果显示,尽管MLLMs具有初步的DIQA能力,但它们在评分一致性、失真识别和严重性判断方面存在挑战。研究表明,链式思考(CoT)提示可以提高所有层次的性能。这项工作为MLLMs的DIQA能力提供了一个基准,揭示了它们在质量感知方面的不足,并指出了改进的潜在途径。
Leveraging NTPs for Efficient Hallucination Detection in VLMs
Authors: Ofir Azachi, Kfir Eliyahu, Eyal El Ani, Rom Himelstein, Roi Reichart, Yuval Pinter, Nitay Calderon
First: 2025-09-20T14:36:22+00:00 · Latest: 2025-11-14T15:38:48+00:00
Comments: Accepted to The First Workshop on Confabulation, Hallucinations, & Overgeneration in Multilingual & Precision-critical Setting - AACL-IJCNLP2025
Abstract
Hallucinations of vision-language models (VLMs), which are misalignments between visual content and generated text, undermine the reliability of VLMs. One common approach for detecting them employs the same VLM, or a different one, to assess generated outputs. This process is computationally intensive and increases model latency. In this paper, we explore an efficient on-the-fly method for hallucination detection by training traditional ML models over signals based on the VLM's next-token probabilities (NTPs). NTPs provide a direct quantification of model uncertainty. We hypothesize that high uncertainty (i.e., a low NTP value) is strongly associated with hallucinations. To test this, we introduce a dataset of 1,400 human-annotated statements derived from VLM-generated content, each labeled as hallucinated or not, and use it to test our NTP-based lightweight method. Our results demonstrate that NTP-based features are valuable predictors of hallucinations, enabling fast and simple ML models to achieve performance comparable to that of strong VLMs. Furthermore, augmenting these NTPs with linguistic NTPs, computed by feeding only the generated text back into the VLM, enhances hallucination detection performance. Finally, integrating hallucination prediction scores from VLMs into the NTP-based models led to better performance than using either VLMs or NTPs alone. We hope this study paves the way for simple, lightweight solutions that enhance the reliability of VLMs.
中文标题/摘要
标题:利用NTPs提高VLMs幻觉检测效率
视觉语言模型(VLMs)中的幻觉,即视觉内容与生成文本之间的不一致,削弱了VLMs的可靠性。一种常见的检测方法是使用相同的VLM或不同的VLM来评估生成的输出。这一过程计算密集且增加模型延迟。本文探讨了一种高效的实时幻觉检测方法,通过训练传统机器学习模型来评估VLM的下一个标记概率(NTPs)信号。NTPs直接量化了模型的不确定性。我们假设高不确定性(即低NTP值)与幻觉密切相关。为此,我们引入了一个由1,400个人标注的陈述数据集,这些陈述来自VLM生成的内容,并且每个陈述都标记为幻觉或非幻觉,用于测试我们的基于NTP的轻量级方法。结果显示,基于NTP的特征是幻觉的有价值预测器,使得快速简单的机器学习模型能够达到与强大VLM相当的性能。此外,将仅通过生成文本反馈给VLM计算的语言NTPs与NTPs结合使用,可以提高幻觉检测性能。最后,将VLM的幻觉预测分数整合到基于NTP的模型中,其性能优于单独使用VLMs或NTPs。我们希望这项研究为提高VLMs可靠性的简单轻量级解决方案铺平道路。
Summary / 总结
This paper addresses the issue of hallucinations in vision-language models (VLMs) by proposing an efficient method using next-token probabilities (NTPs) to detect hallucinations. The authors train lightweight ML models on NTP signals to identify high uncertainty, which is indicative of hallucinations. Using a dataset of 1,400 human-annotated statements, they show that NTP-based features can predict hallucinations with performance comparable to strong VLMs. Adding linguistic NTPs further improves detection, and integrating VLM hallucination scores with NTPs results in better overall performance. This method offers a fast and simple alternative to traditional VLM-based detection, enhancing the reliability of VLMs.
本文提出了一种利用下一个标记概率(NTP)进行视觉语言模型(VLM)幻觉检测的高效方法。作者通过训练传统机器学习模型来预测幻觉,表明这些特征是有效的预测器。该方法计算效率高,性能可与强大的VLM媲美。通过结合语言NTP和将VLM的幻觉预测分数集成到NTP模型中,性能得到了进一步提升。研究建议了一种简单轻量级的解决方案,以提高VLM的可靠性。
MicroVQA++: High-Quality Microscopy Reasoning Dataset with Weakly Supervised Graphs for Multimodal Large Language Model
Authors: Manyu Li, Ruian He, Chenxi Ma, Weimin Tan, Bo Yan
First: 2025-11-14T15:35:43+00:00 · Latest: 2025-11-14T15:35:43+00:00
Comments: 11 pages, 4 figures
Abstract
Multimodal Large Language Models are increasingly applied to biomedical imaging, yet scientific reasoning for microscopy remains limited by the scarcity of large-scale, high-quality training data. We introduce MicroVQA++, a three-stage, large-scale and high-quality microscopy VQA corpus derived from the BIOMEDICA archive. Stage one bootstraps supervision from expert-validated figure-caption pairs sourced from peer-reviewed articles. Stage two applies HiCQA-Graph, a novel heterogeneous graph over images, captions, and QAs that fuses NLI-based textual entailment, CLIP-based vision-language alignment, and agent signals to identify and filter inconsistent samples. Stage three uses a MultiModal Large Language Model (MLLM) agent to generate multiple-choice questions (MCQ) followed by human screening. The resulting release comprises a large training split and a human-checked test split whose Bloom's level hard-sample distribution exceeds the MicroVQA benchmark. Our work delivers (i) a quality-controlled dataset that couples expert literature with graph-based filtering and human refinement; (ii) HiCQA-Graph, the first graph that jointly models (image, caption, QA) for cross-modal consistency filtering; (iii) evidence that careful data construction enables 4B-scale MLLMs to reach competitive microscopy reasoning performance (e.g., GPT-5) and achieve state-of-the-art performance among open-source MLLMs. Code and dataset will be released after the review process concludes.
中文标题/摘要
标题:MicroVQA++:高质微观成像推理数据集及弱监督图模型用于多模态大型语言模型
多模态大型语言模型在生物医学成像中应用日益广泛,但微观成像的科学推理受限于高质量训练数据的稀缺性。我们介绍了MicroVQA++,这是一个从BIOMEDICA档案中衍生出的三阶段、大规模、高质量的微观成像问答(VQA)语料库。第一阶段从同行评审文章中专家验证的图-标题对中获取监督信息。第二阶段应用HiCQA-Graph,这是一种新颖的异构图,结合了基于NLI的文本蕴含、基于CLIP的视觉-语言对齐以及代理信号,以识别和过滤不一致样本。第三阶段使用多模态大型语言模型(MLLM)代理生成多项选择题(MCQ),随后由人类筛选。最终发布的数据集包括一个大规模训练集和一个经过人工检查的测试集,其布卢姆水平的难样本分布超过了MicroVQA基准。我们的工作提供了(i)一个质量控制的数据集,结合了专家文献与基于图的过滤和人工精炼;(ii)HiCQA-Graph,这是第一个联合建模(图像、标题、问答)以实现跨模态一致性过滤的图;(iii)证据表明,精心构建的数据使4B规模的MLLM能够达到与GPT-5相当的微观成像推理性能,并在开源MLLM中达到最佳性能。代码和数据集将在审稿过程结束后发布。
Summary / 总结
MicroVQA++ is a high-quality microscopy VQA dataset created through a three-stage process that starts with expert-validated figure-caption pairs, uses a novel HiCQA-Graph for filtering inconsistent samples, and ends with human screening of multiple-choice questions generated by a multimodal large language model. The dataset outperforms existing benchmarks and demonstrates that careful data construction can enable 4B-scale MLLMs to achieve competitive performance in microscopy reasoning tasks.
MicroVQA++ 是一个大规模、高质量的显微镜 VQA 数据集,旨在解决生物医学成像训练数据稀缺的问题。它包括三个阶段:从专家验证的图例-标题对中提取监督信息,使用 HiCQA-Graph 进行图模型过滤,以及通过人工筛选生成多项选择题。该数据集表明,4B 级 MLLM 可以在显微镜推理任务中达到竞争力的表现,并且在开源 MLLM 中达到最先进的性能。
Adaptive Cache Enhancement for Test-Time Adaptation of Vision-Language Models
Authors: Khanh-Binh Nguyen, Phuoc-Nguyen Bui, Hyunseung Choo, Duc Thanh Nguyen
First: 2025-08-11T03:03:34+00:00 · Latest: 2025-11-14T15:34:24+00:00
Comments: 12 pages, Under review
Abstract
Vision-language models (VLMs) exhibit remarkable zero-shot generalization but suffer performance degradation under distribution shifts in downstream tasks, particularly in the absence of labeled data. Test-Time Adaptation (TTA) addresses this challenge by enabling online optimization of VLMs during inference, eliminating the need for annotated data. Cache-based TTA methods exploit historical knowledge by maintaining a dynamic memory cache of low-entropy or high-confidence samples, promoting efficient adaptation to out-of-distribution data. Nevertheless, these methods face two critical challenges: (1) unreliable confidence metrics under significant distribution shifts, resulting in error accumulation within the cache and degraded adaptation performance; and (2) rigid decision boundaries that fail to accommodate substantial distributional variations, leading to suboptimal predictions. To overcome these limitations, we introduce the Adaptive Cache Enhancement (ACE) framework, which constructs a robust cache by selectively storing high-confidence or low-entropy image embeddings per class, guided by dynamic, class-specific thresholds initialized from zero-shot statistics and iteratively refined using an exponential moving average and exploration-augmented updates. This approach enables adaptive, class-wise decision boundaries, ensuring robust and accurate predictions across diverse visual distributions. Extensive experiments on 15 diverse benchmark datasets demonstrate that ACE achieves state-of-the-art performance, delivering superior robustness and generalization compared to existing TTA methods in challenging out-of-distribution scenarios.
中文标题/摘要
标题:视觉语言模型测试时自适应缓存增强
视觉语言模型(VLMs)在零样本泛化方面表现出色,但在下游任务中面对分布变化时性能会下降,尤其是在缺乏标注数据的情况下。测试时自适应(TTA)通过在推理过程中在线优化VLMs,消除了对标注数据的需求。基于缓存的TTA方法通过维护动态记忆缓存中的低熵或高置信度样本,促进对分布外数据的高效适应。然而,这些方法面临两个关键挑战:(1)在显著分布变化下不可靠的置信度度量,导致缓存中的错误累积和适应性能下降;(2)僵硬的决策边界无法适应显著的分布变化,导致次优预测。为克服这些限制,我们提出了自适应缓存增强(ACE)框架,该框架通过动态、类特定的阈值初始化和迭代优化,选择性地存储每个类的高置信度或低熵图像嵌入,这些阈值由零样本统计信息引导并使用指数移动平均和探索增强更新进行调整。这种方法允许类别的自适应决策边界,确保在各种视觉分布中实现稳健和准确的预测。在15个不同的基准数据集上的广泛实验表明,ACE在具有挑战性的分布外场景中实现了最先进的性能,相比现有TTA方法具有更好的稳健性和泛化能力。
Summary / 总结
The research aims to improve the performance of vision-language models under distribution shifts by enhancing test-time adaptation methods. The Adaptive Cache Enhancement (ACE) framework is introduced, which dynamically stores high-confidence or low-entropy image embeddings per class, guided by class-specific thresholds. Experiments show that ACE outperforms existing methods in handling out-of-distribution data, achieving superior robustness and generalization.
论文提出了一种自适应缓存增强(ACE)框架,以应对视觉-语言模型在分布变化下的性能下降问题。ACE动态地在类特定缓存中存储高置信度或低熵的图像嵌入,提高鲁棒性和泛化能力。实验表明,ACE在处理出分布数据时优于现有的测试时自适应方法。
Unifying Segment Anything in Microscopy with Vision-Language Knowledge
Authors: Manyu Li, Ruian He, Zixian Zhang, Chenxi Ma, Weimin Tan, Bo Yan
First: 2025-05-16T00:55:56+00:00 · Latest: 2025-11-14T15:03:51+00:00
Comments: 15 pages, 5 figures
Abstract
Accurate segmentation of regions of interest in biomedical images holds substantial value in image analysis. Although several foundation models for biomedical segmentation have currently achieved excellent performance on certain datasets, they typically demonstrate sub-optimal performance on unseen domain data. We owe the deficiency to lack of vision-language knowledge before segmentation. Multimodal Large Language Models (MLLMs) bring outstanding understanding and reasoning capabilities to multimodal tasks, which inspires us to leverage MLLMs to inject Vision-Language Knowledge (VLK), thereby enabling vision models to demonstrate superior generalization capabilities on cross-domain datasets. In this paper, we propose a novel framework that seamlessly uses MLLMs to guide SAM in learning microscopy cross-domain data, unifying Segment Anything in Microscopy, named uLLSAM. Specifically, we propose the Vision-Language Semantic Alignment (VLSA) module, which injects VLK into Segment Anything Model (SAM). We find that after SAM receives global VLK prompts, its performance improves significantly, but there are deficiencies in boundary contour perception. Therefore, we further propose Semantic Boundary Regularization (SBR) to regularize SAM. Our method achieves performance improvements of 11.8% in SA across 9 in-domain microscopy datasets, achieving state-of-the-art performance. Our method also demonstrates improvements of 9.2% in SA across 10 out-of-domain datasets, exhibiting strong generalization capabilities. Code is available at https://github.com/ieellee/uLLSAM.
中文标题/摘要
标题:在显微镜下统一分割一切与视觉语言知识
在生物医学图像中准确分割感兴趣区域在图像分析中具有重大价值。尽管目前已有多种基础模型在某些数据集上实现了卓越的性能,但在未见过的数据集上通常表现不佳。我们归因于分割前缺乏视觉语言知识。多模态大型语言模型(MLLMs)为多模态任务带来了出色的理解和推理能力,这启发我们利用MLLMs注入视觉语言知识(VLK),从而使视觉模型在跨域数据集上表现出更强的泛化能力。在本文中,我们提出了一种新颖的框架,利用MLLMs引导SAM学习显微镜跨域数据,统一分割显微镜中的Segment Anything,命名为uLLSAM。具体而言,我们提出了视觉语言语义对齐(VLSA)模块,将VLK注入分割一切模型(SAM)。我们发现,在SAM接收全局VLK提示后,其性能显著提高,但在边界轮廓感知方面存在不足。因此,我们进一步提出了语义边界正则化(SBR)来正则化SAM。我们的方法在9个显微镜领域数据集上实现了11.8%的SA性能提升,达到最先进的性能。我们的方法还在10个未见过的领域数据集上实现了9.2%的SA性能提升,展示了强大的泛化能力。代码可在https://github.com/ieellee/uLLSAM获取。
Summary / 总结
This paper addresses the challenge of accurate segmentation in biomedical images by proposing a novel framework, uLLSAM, which integrates Vision-Language Knowledge (VLK) into Segment Anything Model (SAM) using a Vision-Language Semantic Alignment (VLSA) module and Semantic Boundary Regularization (SBR). The method significantly improves segmentation performance by 11.8% across 9 in-domain microscopy datasets and 9.2% across 10 out-of-domain datasets, demonstrating strong generalization capabilities.
本文提出了一种名为uLLSAM的新框架,通过Vision-Language Semantic Alignment (VLSA)模块和Semantic Boundary Regularization (SBR)将Vision-Language Knowledge (VLK)整合到Segment Anything Model (SAM)中,以解决生物医学图像准确分割的挑战。该方法在9个领域内显微镜数据集上显著提高了11.8%的分割性能,在10个跨域数据集上提高了9.2%,展示了强大的泛化能力。
Free3D: 3D Human Motion Emerges from Single-View 2D Supervision
Authors: Sheng Liu, Yuanzhi Liang, Sidan Du
First: 2025-11-14T14:49:19+00:00 · Latest: 2025-11-14T14:49:19+00:00
Abstract
Recent 3D human motion generation models demonstrate remarkable reconstruction accuracy yet struggle to generalize beyond training distributions. This limitation arises partly from the use of precise 3D supervision, which encourages models to fit fixed coordinate patterns instead of learning the essential 3D structure and motion semantic cues required for robust generalization.To overcome this limitation, we propose Free3D, a framework that synthesizes realistic 3D motions without any 3D motion annotations. Free3D introduces a Motion-Lifting Residual Quantized VAE (ML-RQ) that maps 2D motion sequences into 3D-consistent latent spaces, and a suite of 3D-free regularization objectives enforcing view consistency, orientation coherence, and physical plausibility. Trained entirely on 2D motion data, Free3D generates diverse, temporally coherent, and semantically aligned 3D motions, achieving performance comparable to or even surpassing fully 3D-supervised counterparts. These results suggest that relaxing explicit 3D supervision encourages stronger structural reasoning and generalization, offering a scalable and data-efficient paradigm for 3D motion generation.
中文标题/摘要
标题:Free3D:单视角2D监督下的3D人体动作生成
近期的3D人体动作生成模型在重建精度方面表现出色,但在泛化能力上却难以超越训练分布。这一局限部分源于使用精确的3D监督,这促使模型拟合固定的坐标模式,而不是学习对于稳健泛化至关重要的3D结构和动作语义线索。为克服这一局限,我们提出了Free3D框架,该框架无需任何3D动作标注即可合成逼真的3D动作。Free3D引入了Motion-Lifting Residual Quantized VAE (ML-RQ),将2D动作序列映射到3D一致的潜在空间,并通过一系列无需3D的正则化目标强制视图一致性、方向连贯性和物理合理性。完全基于2D动作数据训练,Free3D生成多样、时间连贯且语义对齐的3D动作,其性能与完全3D监督的模型相当甚至更优。这些结果表明,放松显式的3D监督可以促进更强的结构推理和泛化,为3D动作生成提供了一种可扩展且数据高效的范式。
Summary / 总结
The research aims to improve the generalization ability of 3D human motion generation models by reducing reliance on precise 3D supervision. Free3D uses a Motion-Lifting Residual Quantized VAE to map 2D motion sequences into 3D-consistent latent spaces and includes regularization objectives to enforce view consistency, orientation coherence, and physical plausibility. The model, trained solely on 2D data, generates diverse, temporally coherent, and semantically aligned 3D motions, matching or surpassing the performance of fully 3D-supervised models.
研究旨在通过减少对精确3D监督的依赖来提高3D人体动作生成模型的泛化能力。Free3D框架利用Motion-Lifting Residual Quantized VAE将单视角2D动作序列映射到3D一致的潜在空间,并包含正则化目标以确保视角一致性、方向连贯性和物理合理性。该模型仅使用2D数据训练,生成多样、时间连贯且语义对齐的3D动作,其性能与完全3D监督的模型相当或更优。
Adaptive Pareto-Optimal Token Merging for Edge Transformer Models in Semantic Communication
Authors: Omar Erak, Omar Alhussein, Hatem Abou-Zeid, Mehdi Bennis
First: 2025-09-11T06:05:35+00:00 · Latest: 2025-11-14T14:05:04+00:00
Comments: Accepted for presentation in IEEE Globecom 2025
Abstract
Large-scale transformer models have emerged as a powerful tool for semantic communication systems, enabling edge devices to extract rich representations for robust inference across noisy wireless channels. However, their substantial computational demands remain a major barrier to practical deployment in resource-constrained 6G networks. In this paper, we present a training-free framework for adaptive token merging in pretrained vision transformers to jointly reduce inference time and transmission resource usage. We formulate the selection of per-layer merging proportions as a multi-objective optimization problem to balance accuracy and computational cost. We employ Gaussian process-based Bayesian optimization to construct a Pareto frontier of optimal configurations, enabling flexible runtime adaptation to dynamic application requirements and channel conditions. Extensive experiments demonstrate that our method consistently outperforms other baselines and achieves significant reductions in floating-point operations while maintaining competitive accuracy across a wide range of signal-to-noise ratio (SNR) conditions. Additional results highlight the effectiveness of adaptive policies that adjust merging aggressiveness in response to channel quality, providing a practical mechanism to trade off latency and semantic fidelity on demand. These findings establish a scalable and efficient approach for deploying transformer-based semantic communication in future edge intelligence systems.
中文标题/摘要
标题:边缘变换器模型在语义通信中的自适应帕累托最优标记合并
大规模变换器模型已成为语义通信系统中的强大工具,使边缘设备能够在嘈杂的无线信道中提取丰富的表示以进行稳健的推理。然而,它们巨大的计算需求仍然是在资源受限的6G网络中实际部署的主要障碍。本文提出了一种无需训练的框架,用于在预训练的视觉变换器中自适应地合并标记,以同时减少推理时间和传输资源使用。我们将每层合并比例的选择形式化为一个多目标优化问题,以平衡准确性和计算成本。我们采用基于高斯过程的贝叶斯优化来构建最优配置的帕累托前沿,从而灵活地适应动态应用程序需求和信道条件。广泛实验表明,我们的方法在各种信噪比(SNR)条件下始终优于其他基线,并在保持竞争力的同时显著减少了浮点运算。附加结果强调了自适应策略的有效性,这些策略根据信道质量调整合并的激进程度,提供了一种按需权衡延迟和语义保真度的实用机制。这些发现为在未来的边缘智能系统中部署基于变换器的语义通信奠定了可扩展和高效的方法。
Summary / 总结
This paper addresses the computational challenges of deploying large-scale transformer models in edge devices for semantic communication. It introduces a training-free adaptive token merging framework for vision transformers, optimizing both inference time and resource usage. By formulating the merging proportions as a multi-objective optimization problem and using Gaussian process-based Bayesian optimization, the method constructs a Pareto frontier of optimal configurations. Experiments show that the proposed method significantly reduces floating-point operations while maintaining competitive accuracy across various SNR conditions, and adaptive policies further enhance performance based on channel quality. This work provides a practical solution for efficient transformer-based semantic communication in 6G networks.
本文提出了一种无训练的自适应token合并框架,用于减少边缘transformer模型的推理时间和传输资源使用。它将合并比例形式化为一个多目标优化问题,并使用高斯过程基于的贝叶斯优化来找到最优配置的帕累托前沿。实验表明,该方法在各种信噪比条件下显著减少了浮点运算次数,同时保持了竞争力的准确性,并且基于信道质量调整合并激进性的自适应策略在权衡延迟和语义保真度方面是有效的。
DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding
Authors: Tanveer Hannan, Dimitrios Mallios, Parth Pathak, Faegheh Sardari, Thomas Seidl, Gedas Bertasius, Mohsen Fayyaz, Sunando Sengupta
First: 2025-11-14T13:56:39+00:00 · Latest: 2025-11-14T13:56:39+00:00
Abstract
Large Vision-Language Models (LVLMs) have demonstrated strong multimodal reasoning capabilities on long and complex documents. However, their high memory footprint makes them impractical for deployment on resource-constrained edge devices. We present DocSLM, an efficient Small Vision-Language Model designed for long-document understanding under constrained memory resources. DocSLM incorporates a Hierarchical Multimodal Compressor that jointly encodes visual, textual, and layout information from each page into a fixed-length sequence, greatly reducing memory consumption while preserving both local and global semantics. To enable scalable processing over arbitrarily long inputs, we introduce a Streaming Abstention mechanism that operates on document segments sequentially and filters low-confidence responses using an entropy-based uncertainty calibrator. Across multiple long multimodal document benchmarks, DocSLM matches or surpasses state-of-the-art methods while using 82\% fewer visual tokens, 75\% fewer parameters, and 71\% lower latency, delivering reliable multimodal document understanding on lightweight edge devices. Code is available in the supplementary material.
中文标题/摘要
标题:DocSLM:一种用于长多模态文档理解的小型视觉-语言模型
大型视觉-语言模型(LVLMs)在长且复杂的文档上展示了强大的多模态推理能力。然而,它们较高的内存占用使其在资源受限的边缘设备上部署不切实际。我们提出了DocSLM,这是一种为受限内存资源设计的高效小型视觉-语言模型,用于长文档理解。DocSLM 包含一个分层多模态压缩器,能够将每页的视觉、文本和布局信息联合编码为固定长度的序列,大幅减少内存消耗同时保留局部和全局语义。为了实现对任意长输入的可扩展处理,我们引入了一种流式弃权机制,该机制按文档段顺序操作,并使用基于熵的不确定性校准器过滤低置信度响应。在多个长多模态文档基准测试中,DocSLM 在使用 82% 更少的视觉标记、75% 更少的参数和 71% 更低的延迟的同时,达到了或超过了最先进的方法,实现了轻量级边缘设备上的可靠多模态文档理解。代码可在附录中获取。
Summary / 总结
DocSLM is a small vision-language model designed for efficient long-document understanding on resource-constrained devices. It uses a Hierarchical Multimodal Compressor to encode visual, textual, and layout information into a fixed-length sequence, reducing memory usage. Additionally, DocSLM employs a Streaming Abstention mechanism to handle long inputs by processing them in segments and filtering low-confidence responses. The model outperforms or matches state-of-the-art methods with significantly fewer visual tokens, parameters, and latency, making it suitable for edge devices.
DocSLM 是一种小型视觉语言模型,旨在资源受限的设备上高效理解长文档。它使用层次多模态压缩器将视觉、文本和布局信息编码为固定长度的序列,减少内存使用。DocSLM 还采用流式弃权机制按段处理长输入,并使用基于熵的不确定性校准器过滤低置信度响应。该模型在视觉令牌、参数和延迟方面显著减少,同时在多个长多模态文档基准测试中匹配或超越了最先进的方法,展示了在边缘设备上的可靠性能。
NOCTIS: Novel Object Cyclic Threshold based Instance Segmentation
Authors: Max Gandyra, Alessandro Santonicola, Michael Beetz
Venue: CVPR 2026
First: 2025-07-02T08:23:14+00:00 · Latest: 2025-11-14T13:41:06+00:00
Comments: 9 pages, 3 figures, 5 tables, CVPR 2026 preprint
Abstract
Instance segmentation of novel objects instances in RGB images, given some example images for each object, is a well known problem in computer vision. Designing a model general enough to be employed for all kinds of novel objects without (re-) training has proven to be a difficult task. To handle this, we present a new training-free framework, called: Novel Object Cyclic Threshold based Instance Segmentation (NOCTIS). NOCTIS integrates two pre-trained models: Grounded-SAM 2 for object proposals with precise bounding boxes and corresponding segmentation masks; and DINOv2 for robust class and patch embeddings, due to its zero-shot capabilities. Internally, the proposal-object matching is realized by determining an object matching score based on the similarity of the class embeddings and the average maximum similarity of the patch embeddings with a new cyclic thresholding (CT) mechanism that mitigates unstable matches caused by repetitive textures or visually similar patterns. Beyond CT, NOCTIS introduces: (i) an appearance score that is unaffected by object selection bias; (ii) the usage of the average confidence of the proposals' bounding box and mask as a scoring component; and (iii) an RGB-only pipeline that performs even better than RGB-D ones. We empirically show that NOCTIS, without further training/fine tuning, outperforms the best RGB and RGB-D methods regarding the mean AP score on the seven core datasets of the BOP 2023 challenge for the "Model-based 2D segmentation of unseen objects" task.
中文标题/摘要
标题:NOCTIS:新颖对象循环阈值基于实例分割
给定每种对象的一些示例图像,在RGB图像中进行新颖对象实例分割是一个在计算机视觉中广为人知的问题。设计一个适用于所有类型新颖对象的通用模型而不需重新训练,证明是一个困难的任务。为此,我们提出了一种新的无需训练框架,称为:新颖对象循环阈值基于实例分割(NOCTIS)。NOCTIS 结合了两个预训练模型:Grounded-SAM 2 用于生成具有精确边界框和相应分割掩码的对象提案;以及 DINOv2 用于稳健的类别和补丁嵌入,由于其零样本能力。内部,提案-对象匹配通过基于类别嵌入的相似性和补丁嵌入的平均最大相似性来确定对象匹配得分,采用新的循环阈值(CT)机制来缓解由重复纹理或视觉相似模式引起的不稳定匹配。除了CT,NOCTIS 引入了:(i)不受对象选择偏差影响的外观得分;(ii)使用提案边界框和掩码的平均置信度作为评分组件;(iii)仅使用RGB的管道,其性能甚至优于RGB-D管道。我们实验证明,NOCTIS 在BOP 2023挑战赛七个核心数据集的“基于模型的未见对象2D分割”任务中,无需进一步训练/微调,其平均AP得分优于最佳RGB和RGB-D方法。
Summary / 总结
NOCTIS is a training-free framework for instance segmentation of novel objects in RGB images. It leverages Grounded-SAM for precise object proposals and DINOv2 for robust class and patch embeddings. NOCTIS introduces a cyclic thresholding mechanism to mitigate unstable matches and includes an appearance score and the use of average confidence scores. Experiments show that NOCTIS outperforms existing RGB and RGB-D methods on the BOP 2023 challenge for unseen object segmentation.
NOCTIS 是一个无需训练的框架,用于 RGB 图像中新型物体的实例分割,结合了 Grounded-SAM 进行对象提案和 DINOv2 进行稳健嵌入。它使用循环阈值机制来匹配提案与对象,并引入了外观得分和置信度得分以提高匹配稳定性。NOCTIS 在 BOP 2023 挑战赛中未进一步训练的情况下,优于其他 RGB 和 RGB-D 方法,特别是在未见物体的 2D 分割任务上表现出色。
EcoAlign: An Economically Rational Framework for Efficient LVLM Alignment
Authors: Ruoxi Cheng, Haoxuan Ma, Teng Ma, Hongyi Zhang
First: 2025-11-14T13:38:13+00:00 · Latest: 2025-11-14T13:38:13+00:00
Abstract
Large Vision-Language Models (LVLMs) exhibit powerful reasoning capabilities but suffer sophisticated jailbreak vulnerabilities. Fundamentally, aligning LVLMs is not just a safety challenge but a problem of economic efficiency. Current alignment methods struggle with the trade-off between safety, utility, and operational costs. Critically, a focus solely on final outputs (process-blindness) wastes significant computational budget on unsafe deliberation. This flaw allows harmful reasoning to be disguised with benign justifications, thereby circumventing simple additive safety scores. To address this, we propose EcoAlign, an inference-time framework that reframes alignment as an economically rational search by treating the LVLM as a boundedly rational agent. EcoAlign incrementally expands a thought graph and scores actions using a forward-looking function (analogous to net present value) that dynamically weighs expected safety, utility, and cost against the remaining budget. To prevent deception, path safety is enforced via the weakest-link principle. Extensive experiments across 3 closed-source and 2 open-source models on 6 datasets show that EcoAlign matches or surpasses state-of-the-art safety and utility at a lower computational cost, thereby offering a principled, economical pathway to robust LVLM alignment.
中文标题/摘要
标题:EcoAlign:一种经济理性的低级视觉-语言模型对齐框架
大型视觉-语言模型(LVLMs)表现出强大的推理能力,但遭受复杂的脱缰漏洞。从根本上说,对LVLMs进行对齐不仅是安全挑战,还是经济效率的问题。当前的对齐方法在安全、效用和运营成本之间难以权衡。关键的是,仅关注最终输出(过程盲视)会浪费大量的计算预算在不安全的推理上。这一缺陷允许有害推理被善意的解释所掩盖,从而绕过简单的加性安全评分。为了解决这一问题,我们提出了EcoAlign,一种推理时框架,将对齐重新定义为一种经济理性的搜索,将LVLM视为一个有界理性代理。EcoAlign逐步扩展思维图,并使用前瞻函数(类似于净现值)来动态衡量预期的安全性、效用和成本与剩余预算之间的权衡。为了防止欺骗,路径安全性通过最弱环节原则来强制执行。在3个闭源和2个开源模型上的6个数据集上的广泛实验表明,EcoAlign在较低的计算成本下达到了或超过了最先进的安全性和效用,从而提供了一种原则性的、经济的路径来实现稳健的LVLM对齐。
Summary / 总结
EcoAlign is an inference-time framework that addresses the economic efficiency challenge in aligning LVLMs by treating them as boundedly rational agents. It incrementally expands a thought graph and scores actions using a forward-looking function that dynamically weighs expected safety, utility, and cost against the remaining budget. Experiments show that EcoAlign matches or surpasses state-of-the-art safety and utility while reducing computational costs.
EcoAlign 是一种在推理时处理 LVLM 对齐经济效率挑战的框架,将其视为有边界理性的代理。该框架逐步扩展思维图,并基于一个前瞻性的函数对行动进行评分,该函数平衡了预期的安全性、效用和成本。实验表明,EcoAlign 在较低的计算成本下实现了或超过了现有方法的安全性和效用。
Experiences from Benchmarking Vision-Language-Action Models for Robotic Manipulation
Authors: Yihao Zhang, Yuankai Qi, Xi Zheng
First: 2025-11-14T13:35:30+00:00 · Latest: 2025-11-14T13:35:30+00:00
Abstract
Foundation models applied in robotics, particularly \textbf{Vision--Language--Action (VLA)} models, hold great promise for achieving general-purpose manipulation. Yet, systematic real-world evaluations and cross-model comparisons remain scarce. This paper reports our \textbf{empirical experiences} from benchmarking four representative VLAs -- \textbf{ACT}, \textbf{OpenVLA--OFT}, \textbf{RDT-1B}, and \boldmath{$π_0$} -- across four manipulation tasks conducted in both simulation and on the \textbf{ALOHA Mobile} platform. We establish a \textbf{standardized evaluation framework} that measures performance along three key dimensions: (1) \textit{accuracy and efficiency} (success rate and time-to-success), (2) \textit{adaptability} across in-distribution, spatial out-of-distribution, and instance-plus-spatial out-of-distribution settings, and (3) \textit{language instruction-following accuracy}. Through this process, we observe that \boldmath{$π_0$} demonstrates superior adaptability in out-of-distribution scenarios, while \textbf{ACT} provides the highest stability in-distribution. Further analysis highlights differences in computational demands, data-scaling behavior, and recurring failure modes such as near-miss grasps, premature releases, and long-horizon state drift. These findings reveal practical trade-offs among VLA model architectures in balancing precision, generalization, and deployment cost, offering actionable insights for selecting and deploying VLAs in real-world robotic manipulation tasks.
Summary / 总结
This paper reports empirical experiences from benchmarking four Vision-Language-Action (VLA) models -- ACT, OpenVLA-OFT, RDT-1B, and π_0 -- across four manipulation tasks in simulation and on the ALOHA Mobile platform. The study establishes a standardized evaluation framework focusing on accuracy and efficiency, adaptability, and language instruction-following accuracy. Key findings include π_0's superior adaptability in out-of-distribution scenarios and ACT's highest in-distribution stability, with insights into computational demands and recurring failure modes.
本文报告了对四个Vision-Language-Action (VLA) 模型——ACT、OpenVLA-OFT、RDT-1B 和 π_0——在仿真和ALOHA Mobile平台上的四种操作任务中的基准测试经验。评估框架衡量了准确性与效率、不同分布设置下的适应性以及语言指令跟随准确性。关键发现包括π_0在离分布场景中的优越适应性,以及ACT在分布场景中的最高稳定性,同时还揭示了计算需求、数据扩展行为以及常见的失败模式如接近抓取、过早释放和长时间状态漂移等。
SimuFreeMark: A Noise-Simulation-Free Robust Watermarking Against Image Editing
Authors: Yichao Tang, Mingyang Li, Di Miao, Sheng Li, Zhenxing Qian, Xinpeng Zhang
First: 2025-11-14T13:30:43+00:00 · Latest: 2025-11-14T13:30:43+00:00
Abstract
The advancement of artificial intelligence generated content (AIGC) has created a pressing need for robust image watermarking that can withstand both conventional signal processing and novel semantic editing attacks. Current deep learning-based methods rely on training with hand-crafted noise simulation layers, which inherently limit their generalization to unforeseen distortions. In this work, we propose $\textbf{SimuFreeMark}$, a noise-$\underline{\text{simu}}$lation-$\underline{\text{free}}$ water$\underline{\text{mark}}$ing framework that circumvents this limitation by exploiting the inherent stability of image low-frequency components. We first systematically establish that low-frequency components exhibit significant robustness against a wide range of attacks. Building on this foundation, SimuFreeMark embeds watermarks directly into the deep feature space of the low-frequency components, leveraging a pre-trained variational autoencoder (VAE) to bind the watermark with structurally stable image representations. This design completely eliminates the need for noise simulation during training. Extensive experiments demonstrate that SimuFreeMark outperforms state-of-the-art methods across a wide range of conventional and semantic attacks, while maintaining superior visual quality.
中文标题/摘要
标题:SimuFreeMark:一种无需噪声模拟的鲁棒图像水印技术对抗图像编辑
人工智能生成内容(AIGC)的发展迫切需要一种能够抵御传统信号处理和新型语义编辑攻击的鲁棒图像水印技术。当前基于深度学习的方法依赖于手工构建的噪声模拟层进行训练,这固有限制了它们对未预见失真的泛化能力。在本文中,我们提出了一种名为$\textbf{SimuFreeMark}$的无需噪声模拟的水$\underline{\text{印}}$标记框架,通过利用图像低频分量的固有稳定性来克服这一限制。我们首先系统地证明了低频分量在广泛攻击下的显著鲁棒性。在此基础上,SimuFreeMark 直接将水印嵌入低频分量的深度特征空间中,利用预训练的变分自编码器(VAE)将水印与结构稳定的图像表示绑定在一起。这种设计完全消除了训练过程中需要噪声模拟的需求。广泛的实验表明,SimuFreeMark 在多种传统和语义攻击下均优于现有最佳方法,同时保持了更高的视觉质量。
Summary / 总结
SimuFreeMark is a noise-simulation-free watermarking framework designed to protect images against both conventional and novel semantic editing attacks. It leverages the inherent stability of image low-frequency components to embed watermarks directly into the deep feature space, using a pre-trained VAE to ensure structural stability. Experimental results show that SimuFreeMark outperforms existing methods in various attacks while maintaining high visual quality.
论文针对传统和新型语义编辑攻击下的图像水印鲁棒性问题,提出了一种无需噪声模拟的SimuFreeMark水印框架,利用图像低频分量的固有稳定性。通过使用预训练的VAE将水印直接嵌入低频分量的深层特征空间,SimuFreeMark在多种攻击下表现出色,同时保持了良好的视觉质量。
GraphPilot: Grounded Scene Graph Conditioning for Language-Based Autonomous Driving
Authors: Fabian Schmidt, Markus Enzweiler, Abhinav Valada
First: 2025-11-14T12:57:39+00:00 · Latest: 2025-11-14T12:57:39+00:00
Abstract
Vision-language models have recently emerged as promising planners for autonomous driving, where success hinges on topology-aware reasoning over spatial structure and dynamic interactions from multimodal input. However, existing models are typically trained without supervision that explicitly encodes these relational dependencies, limiting their ability to infer how agents and other traffic entities influence one another from raw sensor data. In this work, we bridge this gap with a novel model-agnostic method that conditions language-based driving models on structured relational context in the form of traffic scene graphs. We serialize scene graphs at various abstraction levels and formats, and incorporate them into the models via structured prompt templates, enabling a systematic analysis of when and how relational supervision is most beneficial. Extensive evaluations on the public LangAuto benchmark show that scene graph conditioning of state-of-the-art approaches yields large and persistent improvement in driving performance. Notably, we observe up to a 15.6\% increase in driving score for LMDrive and 17.5\% for BEVDriver, indicating that models can better internalize and ground relational priors through scene graph-conditioned training, even without requiring scene graph input at test-time. Code, fine-tuned models, and our scene graph dataset are publicly available at https://github.com/iis-esslingen/GraphPilot.
中文标题/摘要
标题:GraphPilot:基于场景图的语义约束在基于语言的自动驾驶中的应用
视觉-语言模型最近在自主驾驶的规划中显示出潜力,其成功依赖于对多模态输入中的空间结构和动态交互的拓扑感知推理。然而,现有模型通常在没有明确编码这些关系依赖性的监督下进行训练,限制了它们从原始传感器数据中推断出代理和其他交通实体如何相互影响的能力。在本文中,我们通过一种新颖的模型无关方法弥合了这一差距,该方法将基于语言的驾驶模型条件化在交通场景图的形式下的结构化关系上下文上。我们以不同抽象级别和格式序列化场景图,并通过结构化提示模板将它们整合到模型中,从而系统地分析关系监督在何时和如何最有益。在公共发布的LangAuto基准上的广泛评估表明,场景图条件化可以显著且持续地提高驾驶性能。值得注意的是,我们观察到LMDrive的驾驶分数提高了15.6%,BEVDriver提高了17.5%,表明即使在测试时不使用场景图输入,模型也可以通过场景图条件化训练更好地内化和接地关系先验。代码、微调模型和我们的场景图数据集可在https://github.com/iis-esslingen/GraphPilot上公开获取。
Summary / 总结
The research aims to enhance language-based autonomous driving models by incorporating relational context from traffic scene graphs. The method involves serializing scene graphs at different levels and integrating them into existing models using structured prompt templates. Key findings show significant improvements in driving performance, with up to 15.6% and 17.5% increases in scores for LMDrive and BEVDriver, respectively, indicating better internalization of relational priors through scene graph-conditioned training.
研究旨在通过引入交通场景图中的关系上下文来提升基于语言的自动驾驶模型。方法包括在不同层次上序列化场景图,并通过结构化提示模板将其整合到现有模型中。在公开的LangAuto基准测试上的实验结果表明,这带来了显著的改进,LMDrive和BEVDriver的驾驶得分分别提高了15.6%和17.5%,表明通过场景图条件训练,模型能够更好地内化和接地关系先验,即使在测试时不使用场景图输入。
Discovering Meaningful Units with Visually Grounded Semantics from Image Captions
Authors: Melika Behjati, James Henderson
First: 2025-11-14T12:56:18+00:00 · Latest: 2025-11-14T12:56:18+00:00
Abstract
Fine-grained knowledge is crucial for vision-language models to obtain a better understanding of the real world. While there has been work trying to acquire this kind of knowledge in the space of vision and language, it has mostly focused on aligning the image patches with the tokens on the language side. However, image patches do not have any meaning to the human eye, and individual tokens do not necessarily carry groundable information in the image. It is groups of tokens which describe different aspects of the scene. In this work, we propose a model which groups the caption tokens as part of its architecture in order to capture a fine-grained representation of the language. We expect our representations to be at the level of objects present in the image, and therefore align our representations with the output of an image encoder trained to discover objects. We show that by learning to group the tokens, the vision-language model has a better fine-grained understanding of vision and language. In addition, the token groups that our model discovers are highly similar to groundable phrases in text, both qualitatively and quantitatively.
Summary / 总结
The research aims to enhance the fine-grained understanding of vision-language models by focusing on the meaningful units in image captions. The method involves grouping caption tokens to capture detailed language representations aligned with object-level image features. The key findings show that this approach improves the model's fine-grained understanding of both vision and language, with the discovered token groups closely resembling groundable phrases in text.
本文旨在通过关注图像字幕中的有意义的词组群,而不是单独的图像片段或词,来提高视觉语言模型的细粒度理解。该模型将字幕中的词分组,以捕获与图像编码器发现的对象级别的信息对齐的详细语言表示。结果显示,该模型实现了更好的细粒度理解,且模型发现的词组群与文本中的可地化短语在质和量上都非常相似。
CountSteer: Steering Attention for Object Counting in Diffusion Models
Authors: Hyemin Boo, Hyoryung Kim, Myungjin Lee, Seunghyeon Lee, Jiyoung Lee, Jang-Hwan Choi, Hyunsoo Cho
Venue: AAAI 2026
First: 2025-11-14T12:52:11+00:00 · Latest: 2025-11-14T12:52:11+00:00
Comments: Accepted to AAAI 2026 Workshop on Shaping Responsible Synthetic Data in the Era of Foundation Models (RSD)
Abstract
Text-to-image diffusion models generate realistic and coherent images but often fail to follow numerical instructions in text, revealing a gap between language and visual representation. Interestingly, we found that these models are not entirely blind to numbers-they are implicitly aware of their own counting accuracy, as their internal signals shift in consistent ways depending on whether the output meets the specified count. This observation suggests that the model already encodes a latent notion of numerical correctness, which can be harnessed to guide generation more precisely. Building on this intuition, we introduce CountSteer, a training-free method that improves generation of specified object counts by steering the model's cross-attention hidden states during inference. In our experiments, CountSteer improved object-count accuracy by about 4% without compromising visual quality, demonstrating a simple yet effective step toward more controllable and semantically reliable text-to-image generation.
中文标题/摘要
标题:CountSteer:在扩散模型中引导对象计数的注意力方法
文本到图像的扩散模型能够生成逼真且连贯的图像,但往往无法遵循文本中的数字指令,揭示了语言与视觉表示之间的差距。有趣的是,我们发现这些模型并非完全无视数字——它们在输出是否符合指定计数的情况下,内部信号会以一致的方式发生变化,表明模型已经编码了一种潜在的数值正确性概念,可以利用这一概念来更精确地引导生成。基于这一直觉,我们提出了一种无需训练的方法CountSteer,在推理过程中引导模型的交叉注意力隐藏状态,以提高指定对象计数的生成效果。在我们的实验中,CountSteer 在不牺牲视觉质量的情况下,将对象计数的准确性提高了约 4%,展示了更可控且语义可靠的文本到图像生成的一个简单而有效的步骤。
Summary / 总结
CountSteer is a training-free method that enhances the accuracy of object counts in text-to-image generation by steering the model's cross-attention hidden states during inference. This approach leverages the model's implicit awareness of its own counting accuracy to improve precision without sacrificing visual quality. Experiments show that CountSteer increases object-count accuracy by about 4%.
CountSteer 是一种无需训练的方法,通过在推理过程中引导模型的交叉注意力隐藏状态来提高文本到图像生成中指定对象数量的准确性。该方法利用模型对其自身计数准确性的隐含意识来更精确地引导生成过程。实验表明,CountSteer 可以将对象计数的准确性提高约 4%,同时不损害图像质量,表明这是一种简单而有效的提高文本到图像生成可控性和语义可靠性的方法。
Arcee: Differentiable Recurrent State Chain for Generative Vision Modeling with Mamba SSMs
Authors: Jitesh Chavan, Rohit Lal, Anand Kamat, Mengjia Xu
First: 2025-11-14T12:44:02+00:00 · Latest: 2025-11-14T12:44:02+00:00
Abstract
State-space models (SSMs), Mamba in particular, are increasingly adopted for long-context sequence modeling, providing linear-time aggregation via an input-dependent, causal selective-scan operation. Along this line, recent "Mamba-for-vision" variants largely explore multiple scan orders to relax strict causality for non-sequential signals (e.g., images). Rather than preserving cross-block memory, the conventional formulation of the selective-scan operation in Mamba reinitializes each block's state-space dynamics from zero, discarding the terminal state-space representation (SSR) from the previous block. Arcee, a cross-block recurrent state chain, reuses each block's terminal state-space representation as the initial condition for the next block. Handoff across blocks is constructed as a differentiable boundary map whose Jacobian enables end-to-end gradient flow across terminal boundaries. Key to practicality, Arcee is compatible with all prior "vision-mamba" variants, parameter-free, and incurs constant, negligible cost. As a modeling perspective, we view terminal SSR as a mild directional prior induced by a causal pass over the input, rather than an estimator of the non-sequential signal itself. To quantify the impact, for unconditional generation on CelebA-HQ (256$\times$256) with Flow Matching, Arcee reduces FID$\downarrow$ from $82.81$ to $15.33$ ($5.4\times$ lower) on a single scan-order Zigzag Mamba baseline. Efficient CUDA kernels and training code will be released to support rigorous and reproducible research.
中文标题/摘要
标题:Arcee:用于生成视觉建模的可微循环状态链
状态空间模型(SSMs),特别是Mamba,越来越多地被用于长上下文序列建模,通过输入依赖的、因果的选择性扫描操作提供线性时间聚合。沿着这一思路,最近的“Mamba-for-vision”变体主要探索多种扫描顺序以放松严格的因果性要求(例如,对于图像等非序列信号)。与保留跨块记忆不同,Mamba中选择性扫描操作的常规形式从零重新初始化每个块的状态空间动力学,丢弃前一个块的终端状态空间表示(SSR)。Arcee,一种跨块循环状态链,将每个块的终端状态空间表示作为下一个块的初始条件。跨块的传递构建为一个可微边界映射,其雅可比矩阵使端边界上的端到端梯度流动成为可能。为了实用性,Arcee与所有先前的“vision-mamba”变体兼容,无参数,并且引入的计算成本可以忽略不计。作为一种建模视角,我们视终端SSR为由因果扫描输入引起的轻微方向先验,而不是非序列信号本身的估计器。为了量化影响,在CelebA-HQ(256×256)的无条件生成中,使用Flow Matching,Arcee将单扫描顺序Zigzag Mamba基线的FID从82.81降低到15.33(降低5.4倍)。高效的CUDA内核和训练代码将被发布以支持严格的和可重复的研究。
Summary / 总结
Arcee is a novel approach that extends Mamba state-space models for generative vision tasks by reusing the terminal state-space representation (SSR) from one block as the initial condition for the next block, enabling a cross-block recurrent state chain. This method facilitates end-to-end gradient flow through a differentiable boundary map. On CelebA-HQ unconditional generation, Arcee significantly reduces the FID score from 82.81 to 15.33, demonstrating its effectiveness in improving model performance with minimal computational overhead.
Arcee 是一种跨块递归状态链,它将一个块的终端状态空间表示作为下一个块的初始条件,通过一个可微边界映射实现端到端的梯度流动。这种方法在 CelebA-HQ 上的无条件生成中显著提高了性能,将 FID 从 82.81 降低到 15.33,相比 Zigzag Mamba 基线降低了 5.4 倍。Arcee 与现有 Mamba 变体兼容,无参数,并且具有可忽略的计算开销。
Beyond Flatlands: Unlocking Spatial Intelligence by Decoupling 3D Reasoning from Numerical Regression
Authors: Zhongbin Guo, Jiahe Liu, Yushan Li, Wenyu Gao, Zhen Yang, Chenzhi Li, Xinyue Zhang, Ping Jian
First: 2025-11-14T12:42:07+00:00 · Latest: 2025-11-14T12:42:07+00:00
Abstract
Existing Vision Language Models (VLMs) architecturally rooted in "flatland" perception, fundamentally struggle to comprehend real-world 3D spatial intelligence. This failure stems from a dual-bottleneck: input-stage conflict between computationally exorbitant geometric-aware encoders and superficial 2D-only features, and output-stage misalignment where discrete tokenizers are structurally incapable of producing precise, continuous numerical values. To break this impasse, we introduce GEODE (Geometric-Output and Decoupled-Input Engine), a novel architecture that resolves this dual-bottleneck by decoupling 3D reasoning from numerical generation. GEODE augments main VLM with two specialized, plug-and-play modules: Decoupled Rationale Module (DRM) that acts as spatial co-processor, aligning explicit 3D data with 2D visual features via cross-attention and distilling spatial Chain-of-Thought (CoT) logic into injectable Rationale Tokens; and Direct Regression Head (DRH), an "Embedding-as-Value" paradigm which routes specialized control tokens to a lightweight MLP for precise, continuous regression of scalars and 3D bounding boxes. The synergy of these modules allows our 1.5B parameter model to function as a high-level semantic dispatcher, achieving state-of-the-art spatial reasoning performance that rivals 7B+ models.
中文标题/摘要
标题:超越平地:通过解耦三维推理与数值回归解锁空间智能
现有的视觉语言模型(VLMs)在架构上根植于“平地”感知,根本无法理解现实世界的三维空间智能。这种失败源于双重瓶颈:输入阶段计算成本高昂的几何感知编码器与浅层的二维特征之间的冲突,以及输出阶段对离散分词器无法生成精确连续数值的结构性不匹配。为打破这一僵局,我们引入了GEODE(几何输出和解耦输入引擎),这是一种新型架构,通过解耦三维推理与数值生成来解决这一双重瓶颈。GEODE通过两个专门的即插即用模块增强主要的VLM:空间协处理器模块(DRM),通过交叉注意力将显式的三维数据与二维视觉特征对齐,并将空间链式思维(CoT)逻辑提炼成可注入的推理令牌;以及直接回归头(DRH),这是一种“嵌入即值”范式,将专门的控制令牌路由到轻量级MLP中,以实现对标量和三维边界框的精确连续回归。这些模块的协同作用使我们的1.5亿参数模型能够作为高级语义调度器运行,实现与7亿+参数模型相媲美的空间推理性能。
Summary / 总结
The paper addresses the limitation of existing Vision Language Models (VLMs) in handling 3D spatial intelligence due to a dual-bottleneck in input and output stages. It proposes GEODE, a novel architecture that decouples 3D reasoning from numerical generation by introducing a Decoupled Rationale Module and a Direct Regression Head. The model achieves state-of-the-art spatial reasoning performance, comparable to larger models with fewer parameters.
论文针对现有视觉语言模型在处理3D空间智能方面存在的局限性,由于输入和输出阶段的双重瓶颈。提出了GEODE,一种通过引入解耦推理模块和直接回归头来解耦3D推理和数值生成的新架构。该模型实现了最先进的空间推理性能,与更大规模的模型相比参数更少。
TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models
Authors: Chenghao Liu, Jiachen Zhang, Chengxuan Li, Zhimu Zhou, Shixin Wu, Songfang Huang, Huiling Duan
Venue: AAAI 2026
First: 2025-08-15T12:03:34+00:00 · Latest: 2025-11-14T12:35:36+00:00
Comments: Accepted to AAAI 2026. Camera-ready version
Abstract
Vision-Language-Action (VLA) models process visual inputs independently at each timestep, discarding valuable temporal information inherent in robotic manipulation tasks. This frame-by-frame processing makes models vulnerable to visual noise while ignoring the substantial coherence between consecutive frames in manipulation sequences. We propose Temporal Token Fusion (TTF), a training-free approach that intelligently integrates historical and current visual representations to enhance VLA inference quality. Our method employs dual-dimension detection combining efficient grayscale pixel difference analysis with attention-based semantic relevance assessment, enabling selective temporal token fusion through hard fusion strategies and keyframe anchoring to prevent error accumulation. Comprehensive experiments across LIBERO, SimplerEnv, and real robot tasks demonstrate consistent improvements: 4.0 percentage points average on LIBERO (72.4\% vs 68.4\% baseline), cross-environment validation on SimplerEnv (4.8\% relative improvement), and 8.7\% relative improvement on real robot tasks. Our approach proves model-agnostic, working across OpenVLA and VLA-Cache architectures. Notably, TTF reveals that selective Query matrix reuse in attention mechanisms enhances rather than compromises performance, suggesting promising directions for direct KQV matrix reuse strategies that achieve computational acceleration while improving task success rates.
中文标题/摘要
标题:TTF-VLA:基于像素注意集成的时间令牌融合在视觉-语言-行动模型中的应用
视觉-语言-行动(VLA)模型在每个时间步独立处理视觉输入,忽略了机器人操作任务中固有的宝贵时间信息。这种帧帧处理方式使模型容易受到视觉噪声的影响,同时忽略了连续帧之间的重要一致性。我们提出了时间令牌融合(TTF),这是一种无需训练的方法,通过智能地整合历史和当前的视觉表示来增强VLA推理质量。我们的方法结合了高效的灰度像素差异分析和基于注意力的语义相关性评估,通过硬融合策略和关键帧锚定来实现选择性的时间令牌融合,以防止错误累积。在LIBERO、SimplerEnv和真实机器人任务中的全面实验表明,我们的方法在LIBERO上平均提高了4.0个百分点(72.4% vs 68.4%基线),在SimplerEnv上的跨环境验证中相对提高了4.8%,在真实机器人任务中相对提高了8.7%。我们的方法具有模型通用性,适用于OpenVLA和VLA-Cache架构。值得注意的是,TTF表明在注意力机制中选择性地重用查询矩阵实际上可以提高性能,而不是削弱性能,这表明直接的KQV矩阵重用策略可能在实现计算加速的同时提高任务成功率。
Summary / 总结
The research aims to improve Vision-Language-Action (VLA) models by integrating temporal information, which is usually discarded in frame-by-frame processing. The proposed Temporal Token Fusion (TTF) method uses dual-dimension detection to combine pixel difference analysis and semantic relevance assessment, enabling selective fusion of historical and current visual representations. Experiments across different datasets show consistent improvements, with an average increase of 4.0 percentage points on LIBERO, a 4.8% relative improvement on SimplerEnv, and an 8.7% relative improvement on real robot tasks. The approach is model-agnostic and works across various VLA architectures, suggesting potential for computational acceleration and enhanced task success rates.
研究针对视觉-语言-动作模型中时间信息丢失的问题,这些模型逐帧处理视觉输入,导致对视觉噪声的脆弱性和对连续帧间一致性忽视。提出了一种无需训练的方法——时间令牌融合(TTF),将历史和当前的视觉表示进行整合以提高推理质量。实验结果显示,在LIBERO、SimplerEnv和真实机器人任务中的一致改进,包括LIBERO平均提高4.0个百分点,SimplerEnv相对提高4.8%,真实机器人任务相对提高8.7%。该方法适用于不同的VLA架构,并展示了选择性查询矩阵重用在注意力机制中的潜力,这不仅能实现计算加速,还能提高任务成功率。
History
20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553