arXiv 论文速递

Snapshot: 20260307_0347

Accelerating Text-to-Video Generation with Calibrated Sparse Attention

Authors: Shai Yehezkel, Shahar Yadin, Noam Elata, Yaron Ostrovsky-Berman, Bahjat Kawar

First: 2026-03-05T18:59:32+00:00 · Latest: 2026-03-05T18:59:32+00:00

Abstract

Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the attention computation in these cases can be skipped with little to no effect on the result. This observation continues to hold for connections among local token blocks. Motivated by this, we introduce CalibAtt, a training-free method that accelerates video generation via calibrated sparse attention. CalibAtt performs an offline calibration pass that identifies block-level sparsity and repetition patterns that are stable across inputs, and compiles these patterns into optimized attention operations for each layer, head, and diffusion timestep. At inference time, we compute the selected input-dependent connections densely, and skip the unselected ones in a hardware-efficient manner. Extensive experiments on Wan 2.1 14B, Mochi 1, and few-step distilled models at various resolutions show that CalibAtt achieves up to 1.58x end-to-end speedup, outperforming existing training-free methods while maintaining video generation quality and text-video alignment.

中文标题/摘要

标题：校准稀疏注意加速文本到视频生成

最近的扩散模型能够生成高质量的视频，但运行速度较慢。这些模型中使用的大型基于变压器的骨干网络由于时空注意机制而成为瓶颈。在本文中，我们发现许多词元到词元的连接在各种输入中始终产生微不足道的分数，并且它们的模式在查询之间经常重复。因此，在这些情况下可以跳过注意计算，对结果影响甚微。这一观察结果同样适用于局部词元块之间的连接。受此启发，我们引入了CalibAtt，这是一种无需训练的方法，通过校准稀疏注意来加速视频生成。CalibAtt 进行了一次离线校准过程，以识别在输入之间稳定的块级稀疏性和重复模式，并将这些模式编译为每层、每个头和每个扩散时间步的优化注意操作。在推理时，我们密集地计算选定的输入相关连接，并以硬件高效的方式跳过未选择的连接。在Wan 2.1 14B、Mochi 1和不同分辨率下的少量步骤蒸馏模型上进行的广泛实验表明，CalibAtt 可以实现高达1.58倍的端到端加速，同时优于现有无需训练的方法，保持视频生成质量和文本-视频对齐。

Summary / 总结

This paper addresses the slow runtime of diffusion models for text-to-video generation by proposing CalibAtt, a training-free method that accelerates video generation through calibrated sparse attention. By identifying and skipping negligible token-to-token connections, CalibAtt achieves up to 1.58x end-to-end speedup without compromising video quality or text-video alignment. Extensive experiments on various models and resolutions validate the effectiveness of this approach.

本文针对用于高质量文本到视频生成的扩散模型运行缓慢的问题，提出了一种名为CalibAtt的无训练加速方法，该方法通过基于识别出的稀疏性和重复模式来跳过不必要的注意力计算。实验表明，CalibAtt 可以实现最高1.58倍的加速，同时保持视频质量和文本到视频的对齐。

HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token

Authors: Sai Akhil Kogilathota, Sripadha Vallabha E G, Luzhe Sun, Jiawei Zhou

Venue: The 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026)

First: 2026-03-05T18:36:31+00:00 · Latest: 2026-03-05T18:36:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Hallucinations remain a persistent challenge for vision-language models (VLMs), which often describe nonexistent objects or fabricate facts. Existing detection methods typically operate after text generation, making intervention both costly and untimely. We investigate whether hallucination risk can instead be predicted before any token is generated by probing a model's internal representations in a single forward pass. Across a diverse set of vision-language tasks and eight modern VLMs, including Llama-3.2-Vision, Gemma-3, Phi-4-VL, and Qwen2.5-VL, we examine three families of internal representations: (i) visual-only features without multimodal fusion, (ii) vision-token representations within the text decoder, and (iii) query-token representations that integrate visual and textual information before generation. Probes trained on these representations achieve strong hallucination-detection performance without decoding, reaching up to 0.93 AUROC on Gemma-3-12B, Phi-4-VL 5.6B, and Molmo 7B. Late query-token states are the most predictive for most models, while visual or mid-layer features dominate in a few architectures (e.g., ~0.79 AUROC for Qwen2.5-VL-7B using visual-only features). These results demonstrate that (1) hallucination risk is detectable pre-generation, (2) the most informative layer and modality vary across architectures, and (3) lightweight probes have the potential to enable early abstention, selective routing, and adaptive decoding to improve both safety and efficiency.

中文标题/摘要

标题：HALP：无需生成单个词元即可检测视觉语言模型中的幻觉

幻觉仍然是视觉语言模型（VLMs）的一个持续性挑战，它们经常描述不存在的对象或编造事实。现有的检测方法通常在文本生成之后进行操作，这使得干预既昂贵又不及时。我们研究了是否可以在生成任何词元之前通过探测模型的内部表示来预测幻觉风险，而只需一次前向传递。在一系列视觉语言任务和八个现代VLMs（包括Llama-3.2-Vision、Gemma-3、Phi-4-VL和Qwen2.5-VL）中，我们检查了三种内部表示家族：（i）仅视觉特征而不进行多模态融合，（ii）文本解码器中的视觉词元表示，以及（iii）在生成之前整合视觉和文本信息的查询词元表示。基于这些表示训练的探测器在无需解码的情况下实现了强大的幻觉检测性能，达到Gemma-3-12B、Phi-4-VL 5.6B和Molmo 7B上的0.93 AUROC。大多数模型中，后期查询词元状态最具预测性，而在少数架构中（例如，使用仅视觉特征的Qwen2.5-VL-7B达到约0.79 AUROC），视觉或中间层特征占主导地位。这些结果表明：（1）幻觉风险可以在生成之前检测到，（2）最具信息量的层和模态在不同架构中有所不同，（3）轻量级探测器有可能实现早期回避、选择性路由和自适应解码，以提高安全性和效率。

Summary / 总结

The research aims to address the challenge of hallucinations in vision-language models by predicting hallucination risk before any token is generated. The method involves probing internal representations of the models during a single forward pass to detect hallucination risk. Across various vision-language tasks and eight modern VLMs, the probes trained on different types of internal representations achieve strong hallucination-detection performance, with late query-token states being the most predictive for most models. The study shows that hallucination risk is detectable pre-generation, the most informative layer and modality vary across architectures, and lightweight probes can enable early intervention to improve safety and efficiency.

研究通过在生成任何标记之前预测幻觉风险的方法，解决了视觉语言模型中的幻觉问题。使用内部表示的探针实现了强大的幻觉检测性能，其中大多数模型的晚期查询标记状态是最具预测性的。研究涵盖了多种视觉语言任务和八个现代模型，显示不同架构具有不同的最具信息性的层和模态用于幻觉检测。

Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes

Authors: Pengxiang Li, Joey Tsai, Hongwei Xue, Kunyu Shi, Shilin Yan

Venue: ICLR 2026

First: 2026-03-05T18:25:26+00:00 · Latest: 2026-03-05T18:25:26+00:00

Comments: Accepted at ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Diffusion Language Models (DLMs) promise highly parallel text generation, yet their practical inference speed is often bottlenecked by suboptimal decoding schedulers. Standard approaches rely on 'scattered acceptance'-committing high confidence tokens at disjoint positions throughout the sequence. This approach inadvertently fractures the Key-Value (KV) cache, destroys memory locality, and forces the model into costly, repeated repairs across unstable token boundaries. To resolve this, we present the Longest Stable Prefix (LSP) scheduler, a training-free and model-agnostic inference paradigm based on monolithic prefix absorption. In each denoising step, LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions, and snaps its boundary to natural linguistic or structural delimiters before an atomic commitment. This prefix-first topology yields dual benefits: systemically, it converts fragmented KV cache updates into efficient, contiguous appends; algorithmically, it preserves bidirectional lookahead over a geometrically shrinking active suffix, drastically reducing token flip rates and denoiser calls. Extensive evaluations on LLaDA-8B and Dream-7B demonstrate that LSP accelerates inference by up to 3.4x across rigorous benchmarks including mathematical reasoning, code generation, multilingual (CJK) tasks, and creative writing while matching or slightly improving output quality. By fundamentally restructuring the commitment topology, LSP bridges the gap between the theoretical parallelism of DLMs and practical hardware efficiency.

中文标题/摘要

标题：超越零散接受：通过最长稳定前缀实现DLMs的快速和连贯推理

扩散语言模型（DLMs）承诺实现高度并行的文本生成，但其实用推理速度往往受限于次优解码调度器。标准方法依赖于“零散接受”——在序列中不连续位置上提交高置信度的标记。这种方法无意中破坏了键值（KV）缓存，破坏了内存局部性，并迫使模型在不稳定的标记边界上进行昂贵的重复修复。为了解决这个问题，我们提出了最长稳定前缀（LSP）调度器，这是一种基于单一前缀吸收的无训练和模型无关的推理范式。在每次去噪步骤中，LSP 通过单向传递评估标记的稳定性，动态识别一个连续的左对齐的稳定预测块，并在自然语言或结构分隔符之前将其边界锁定，然后进行原子提交。这种前缀优先的拓扑结构带来了双重好处：系统上，它将碎片化的KV缓存更新转换为高效的连续追加；算法上，它保留了对几何缩小的活动后缀的双向前瞻，大幅减少了标记翻转率和去噪器调用次数。在LLaDA-8B和Dream-7B上的广泛评估表明，LSP 在包括数学推理、代码生成、多语言（CJK）任务和创意写作在内的严格基准测试中将推理加速了高达3.4倍，同时保持或略微提高了输出质量。通过从根本上重新结构化提交拓扑，LSP 桥接了DLMs的理论并行性和实际硬件效率之间的差距。

Summary / 总结

The paper addresses the slow inference speed of Diffusion Language Models (DLMs) due to suboptimal decoding schedulers that commit high confidence tokens at disjoint positions, leading to fragmented Key-Value (KV) cache updates and high token flip rates. It introduces the Longest Stable Prefix (LSP) scheduler, which evaluates token stability via a single forward pass, identifies a contiguous block of stable predictions, and commits them atomically. This approach accelerates inference by up to 3.4x on LLaDA-8B and Dream-7B models across various tasks while maintaining or slightly improving output quality.

论文针对扩散语言模型(DLMs)因解码调度器效率低下导致的推理速度慢问题，这些调度器在序列中不连续位置承诺高置信度的标记，这会破坏键值(KV)缓存并导致昂贵的修复。它提出了最长稳定前缀(LSP)调度器，该调度器通过一次前向传播评估标记稳定性，并在原子更新前提交一个连续的稳定预测块。在LLaDA-8B和Dream-7B上的实验表明，LSP可以将推理加速多达3.4倍，同时保持或略微提高输出质量。

RelaxFlow: Text-Driven Amodal 3D Generation

Authors: Jiayin Zhu, Guoji Fu, Xiaolu Liu, Qiyuan He, Yicong Li, Angela Yao

First: 2026-03-05T17:45:47+00:00 · Latest: 2026-03-05T17:45:47+00:00

Comments: Code: https://github.com/viridityzhu/RelaxFlow

Abs · PDF · Code1 · Code2 · Code3

Abstract

Image-to-3D generation faces inherent semantic ambiguity under occlusion, where partial observation alone is often insufficient to determine object category. In this work, we formalize text-driven amodal 3D generation, where text prompts steer the completion of unseen regions while strictly preserving input observation. Crucially, we identify that these objectives demand distinct control granularities: rigid control for the observation versus relaxed structural control for the prompt. To this end, we propose RelaxFlow, a training-free dual-branch framework that decouples control granularity via a Multi-Prior Consensus Module and a Relaxation Mechanism. Theoretically, we prove that our relaxation is equivalent to applying a low-pass filter on the generative vector field, which suppresses high-frequency instance details to isolate geometric structure that accommodates the observation. To facilitate evaluation, we introduce two diagnostic benchmarks, ExtremeOcc-3D and AmbiSem-3D. Extensive experiments demonstrate that RelaxFlow successfully steers the generation of unseen regions to match the prompt intent without compromising visual fidelity.

中文标题/摘要

标题：RelaxFlow：文本驱动的无遮挡3D生成

从图像到3D的生成面临着在遮挡下固有的语义模糊性，仅凭部分观察往往不足以确定物体类别。在本文中，我们形式化了文本驱动的无遮挡3D生成，其中文本提示引导未见区域的完成，同时严格保留输入观察。关键的是，我们发现这些目标需要不同的控制粒度：对观察进行刚性控制，而对提示进行放松的结构控制。为此，我们提出了RelaxFlow，这是一种无需训练的双分支框架，通过多先验一致性模块和放松机制解耦控制粒度。理论上，我们证明了我们的放松等同于在生成向量场中应用低通滤波器，这抑制了高频实例细节以隔离几何结构，以适应观察。为了便于评估，我们引入了两个诊断基准，ExtremeOcc-3D和AmbiSem-3D。广泛的实验表明，RelaxFlow成功地引导了未见区域的生成以匹配提示意图，而不牺牲视觉保真度。

Summary / 总结

The paper addresses the challenge of generating 3D models from partial observations, where text prompts guide the completion of unseen regions while preserving the observed parts. It introduces RelaxFlow, a dual-branch framework that uses a Multi-Prior Consensus Module and a Relaxation Mechanism to achieve this. Theoretical analysis shows that relaxation suppresses high-frequency details, focusing on geometric structure. Experiments show that RelaxFlow successfully aligns unseen regions with prompt intent without losing visual fidelity.

该研究解决了从部分观察生成3D模型的挑战，通过文本提示引导未观察区域的完成，同时保留已观察的部分。提出了一种无需训练的框架RelaxFlow，使用多先验一致性模块和松弛机制来解耦控制粒度。理论分析表明，松弛机制抑制了高频细节，专注于几何结构。实验表明，RelaxFlow能够在未观察区域成功匹配提示意图，同时不牺牲视觉保真度。

ORMOT: A Dataset and Framework for Omnidirectional Referring Multi-Object Tracking

Authors: Sijia Chen, Zihan Zhou, Yanqiu Yu, En Yu, Wenbing Tao

First: 2026-03-05T17:15:01+00:00 · Latest: 2026-03-05T17:15:01+00:00

Comments: https://github.com/chen-si-jia/ORMOT

Abs · PDF · Code1 · Code2 · Code3

Abstract

Multi-Object Tracking (MOT) is a fundamental task in computer vision, aiming to track targets across video frames. Existing MOT methods perform well in general visual scenes, but face significant challenges and limitations when extended to visual-language settings. To bridge this gap, the task of Referring Multi-Object Tracking (RMOT) has recently been proposed, which aims to track objects that correspond to language descriptions. However, current RMOT methods are primarily developed on datasets captured by conventional cameras, which suffer from limited field of view. This constraint often causes targets to move out of the frame, leading to fragmented tracking and loss of contextual information. In this work, we propose a novel task, called Omnidirectional Referring Multi-Object Tracking (ORMOT), which extends RMOT to omnidirectional imagery, aiming to overcome the field-of-view (FoV) limitation of conventional datasets and improve the model's ability to understand long-horizon language descriptions. To advance the ORMOT task, we construct ORSet, an Omnidirectional Referring Multi-Object Tracking dataset, which contains 27 diverse omnidirectional scenes, 848 language descriptions, and 3,401 annotated objects, providing rich visual, temporal, and language information. Furthermore, we propose ORTrack, a Large Vision-Language Model (LVLM)-driven framework tailored for Omnidirectional Referring Multi-Object Tracking. Extensive experiments on the ORSet dataset demonstrate the effectiveness of our ORTrack framework. The dataset and code will be open-sourced at https://github.com/chen-si-jia/ORMOT.

中文标题/摘要

标题：ORMOT：全景描述多目标跟踪的数据集和框架

多目标跟踪（MOT）是计算机视觉中的一个基本任务，旨在跨视频帧跟踪目标。现有的MOT方法在一般视觉场景中表现良好，但在扩展到视觉语言设置时面临重大挑战和限制。为了解决这一差距，最近提出了描述多目标跟踪（RMOT）任务，旨在跟踪与语言描述对应的物体。然而，当前的RMOT方法主要是在由传统相机拍摄的数据集上开发的，这些数据集存在视野有限的限制。这种限制往往导致目标移出画面，从而导致跟踪片段化并丢失上下文信息。在本文中，我们提出了一项新的任务，称为全景描述多目标跟踪（ORMOT），该任务将RMOT扩展到全景图像，旨在克服传统数据集的视野限制，并提高模型理解长时语言描述的能力。为了推进ORMOT任务，我们构建了ORSet，一个全景描述多目标跟踪数据集，包含27个多样化的全景场景、848个语言描述和3,401个标注物体，提供了丰富的视觉、时间和语言信息。此外，我们提出了ORTrack，一种针对全景描述多目标跟踪的大型视觉-语言模型（LVLM）驱动框架。在ORSet数据集上的广泛实验表明，我们的ORTrack框架是有效的。数据集和代码将在https://github.com/chen-si-jia/ORMOT开源。

Summary / 总结

The research aims to address the limitations of existing Multi-Object Tracking (MOT) methods in visual-language settings by proposing Omnidirectional Referring Multi-Object Tracking (ORMOT). The authors introduce ORSet, a dataset for ORMOT containing diverse omnidirectional scenes and language descriptions, and develop ORTrack, a framework using a Large Vision-Language Model. Experiments show the effectiveness of ORTrack in handling long-horizon language descriptions and overcoming field-of-view limitations.

论文提出了ORMOT，一种新的全景多目标跟踪任务，旨在解决传统摄像机数据集视野有限的问题。作者提出了ORSet数据集，包含27个全景场景、848个语言描述和3,401个目标标注，并提出了ORTrack框架，该框架使用大型视觉-语言模型。实验表明，ORTrack在处理长时语言描述和克服视野限制方面具有有效性。

OpenFrontier: General Navigation with Visual-Language Grounded Frontiers

Authors: Esteban Padilla, Boyang Sun, Marc Pollefeys, Hermann Blum

First: 2026-03-05T17:02:22+00:00 · Latest: 2026-03-05T17:02:22+00:00

Abs · PDF · Code1 · Code2

Abstract

Open-world navigation requires robots to make decisions in complex everyday environments while adapting to flexible task requirements. Conventional navigation approaches often rely on dense 3D reconstruction and hand-crafted goal metrics, which limits their generalization across tasks and environments. Recent advances in vision--language navigation (VLN) and vision--language--action (VLA) models enable end-to-end policies conditioned on natural language, but typically require interactive training, large-scale data collection, or task-specific fine-tuning with a mobile agent. We formulate navigation as a sparse subgoal identification and reaching problem and observe that providing visual anchoring targets for high-level semantic priors enables highly efficient goal-conditioned navigation. Based on this insight, we select navigation frontiers as semantic anchors and propose OpenFrontier, a training-free navigation framework that seamlessly integrates diverse vision--language prior models. OpenFrontier enables efficient navigation with a lightweight system design, without dense 3D mapping, policy training, or model fine-tuning. We evaluate OpenFrontier across multiple navigation benchmarks and demonstrate strong zero-shot performance, as well as effective real-world deployment on a mobile robot.

Summary / 总结

The research aims to enable robots to navigate in complex environments with flexible task requirements. The method involves formulating navigation as a sparse subgoal identification problem and using visual frontiers as semantic anchors. Key findings show that OpenFrontier, a training-free framework, achieves efficient navigation without dense 3D mapping or policy training, and demonstrates strong zero-shot performance across multiple benchmarks and real-world deployment on a mobile robot.

研究旨在让机器人能够在复杂环境中灵活地进行导航。方法是将导航问题表述为稀疏子目标识别问题，并使用视觉前沿作为语义锚点。关键发现表明，OpenFrontier 这个无需训练的框架能够实现高效的导航，无需密集的3D映射或策略训练，并在多个基准测试和移动机器人的真实世界部署中展示了强大的零样本性能。

Revolutionizing Mixed Precision Quantization: Towards Training-free Automatic Proxy Discovery via Large Language Models

Authors: Haidong Kang, Jun Du, Lihong Lin

First: 2025-12-08T10:52:55+00:00 · Latest: 2026-03-05T16:57:43+00:00

Abs · PDF · Code1 · Code2

Abstract

Mixed-Precision Quantization (MPQ) liberates Deep Neural Networks (DNNs) from the Out-Of-Memory (OOM) bottleneck and has garnered increasing research attention. However, conventional methods either rely on costly differentiable optimization search, which is neither efficient nor flexible, or learn a quantized DNN from a proxy (e.g., HAWQ) manually designed by human experts, which is labor-intensive and requires extensive expert knowledge. Can we design a proxy without involving any human experts or training? In this paper, we provide an affirmative answer by proposing a novel Large Language Model (LLM)-driven Training-free Automatic Proxy (dubbed TAP) discovery framework. It reforms the design paradigm of MPQ by utilizing LLMs and evolutionary search strategies to automatically find superior TAP tailored for MPQ. In addition, to bridge the gap between black-box LLMs and the challenging MPQ task, we introduce a lightweight Direct Preference Optimization (DPO)-based strategy controller that dynamically reweights the selection probabilities of the three prompt templates for evolutionary search strategies according to fitness signals, without fine-tuning the LLM. This forms a task-aware feedback loop that improves proxy generation across evolutions. Extensive experiments on mainstream benchmarks demonstrate that TAP achieves state-of-the-art performance. Finally, we believe that our TAP will significantly contribute to the MPQ community by providing a new perspective on LLM-driven design algorithms.

中文标题/摘要

标题：革新混合精度量化：通过大型语言模型实现无训练自动代理发现

混合精度量化（MPQ）使深度神经网络（DNNs）摆脱了内存不足（OOM）的瓶颈，并引起了越来越多的研究关注。然而，传统方法要么依赖于昂贵的可微优化搜索，这既不高效也不灵活，要么从人类专家人工设计的代理（例如HAWQ）中学习量化DNN，这既耗时又需要大量专家知识。我们能否设计一个无需任何人类专家或训练的代理？在本文中，我们通过提出一种新颖的大型语言模型（LLM）驱动的无训练自动代理（简称TAP）发现框架，给出了肯定的答案。该框架通过利用LLM和进化搜索策略，自动发现适用于MPQ的优质TAP，改革了MPQ的设计范式。此外，为了弥合黑盒LLM与挑战性的MPQ任务之间的差距，我们引入了一种轻量级的直接偏好优化（DPO）为基础的策略控制器，根据适应度信号动态调整进化搜索策略中三种提示模板的选择概率，无需微调LLM。这形成了一种任务感知的反馈循环，提高了代理生成的性能。在主流基准上的广泛实验表明，TAP达到了最先进的性能。最后，我们认为，我们的TAP将通过提供一种LLM驱动设计算法的新视角，对MPQ社区产生重大贡献。

Summary / 总结

This paper addresses the challenge of designing a proxy for Mixed-Precision Quantization (MPQ) without human intervention or training. It introduces a novel framework called TAP, which uses a Large Language Model (LLM) and evolutionary search strategies to automatically discover a superior proxy. The TAP framework includes a lightweight Direct Preference Optimization (DPO)-based strategy controller that dynamically adjusts the selection probabilities of prompt templates based on fitness signals, forming a task-aware feedback loop. Experiments show that TAP outperforms existing methods on mainstream benchmarks.

研究旨在无需人工干预或训练的情况下自动发现混合精度量化（MPQ）的代理。它提出了一种使用大型语言模型（LLMs）和进化搜索策略的TAP框架来找到最优代理。实验表明，TAP在主流基准上优于现有方法，为MPQ设计提供了新的方法论。

Frequency-Aware Error-Bounded Caching for Accelerating Diffusion Transformers

Authors: Guandong Li

First: 2026-03-05T15:58:06+00:00 · Latest: 2026-03-05T15:58:06+00:00

Abs · PDF · Code1 · Code2

Abstract

Diffusion Transformers (DiTs) have emerged as the dominant architecture for high-quality image and video generation, yet their iterative denoising process incurs substantial computational cost during inference. Existing caching methods accelerate DiTs by reusing intermediate computations across timesteps, but they share a common limitation: treating the denoising process as uniform across time,depth, and feature dimensions. In this work, we identify three orthogonal axes of non-uniformity in DiT denoising: (1) temporal -- sensitivity to caching errors varies dramatically across the denoising trajectory; (2) depth -- consecutive caching decisions lead to cascading approximation errors; and (3) feature -- different components of the hidden state exhibit heterogeneous temporal dynamics. Based on these observations, we propose SpectralCache, a unified caching framework comprising Timestep-Aware Dynamic Scheduling (TADS), Cumulative Error Budgets (CEB), and Frequency-Decomposed Caching (FDC). On FLUX.1-schnell at 512x512 resolution, SpectralCache achieves 2.46x speedup with LPIPS 0.217 and SSIM 0.727, outperforming TeaCache (2.12x, LPIPS 0.215, SSIM 0.734) by 16% in speed while maintaining comparable quality (LPIPS difference < 1%). Our approach is training-free, plug-and-play, and compatible with existing DiT architectures.

中文标题/摘要

标题：基于频率感知的误差受限缓存加速扩散变换器

扩散变换器（DiTs）已成为高质量图像和视频生成的主要架构，但其迭代去噪过程在推理时会带来巨大的计算成本。现有的缓存方法通过在时间步之间重用中间计算来加速DiTs，但它们都存在一个共同的局限性：将去噪过程视为在时间、深度和特征维度上均匀的。在这项工作中，我们识别了DiT去噪中的三个正交非均匀轴：（1）时间轴——缓存误差对去噪轨迹的敏感性差异巨大；（2）深度轴——连续的缓存决策会导致级联的近似误差；（3）特征轴——隐藏状态的不同组成部分表现出异质的时间动态。基于这些观察，我们提出了SpectralCache，这是一种统一的缓存框架，包括时间步感知动态调度（TADS）、累积误差预算（CEB）和频率分解缓存（FDC）。在FLUX.1-schnell，512x512分辨率下，SpectralCache实现了2.46倍的加速，LPIPS为0.217，SSIM为0.727，与TeaCache（2.12倍，LPIPS为0.215，SSIM为0.734）相比，在速度上提高了16%，同时保持了相当的质量（LPIPS差异<1%）。我们的方法是无需训练的、即插即用的，并且与现有的DiT架构兼容。

Summary / 总结

This work addresses the high computational cost of Diffusion Transformers (DiTs) during inference by proposing SpectralCache, a frequency-aware caching method that accounts for non-uniformity in the denoising process across time, depth, and feature dimensions. SpectralCache includes Timestep-Aware Dynamic Scheduling, Cumulative Error Budgets, and Frequency-Decomposed Caching. On FLUX.1-schnell at 512x512 resolution, SpectralCache achieves a 2.46x speedup with LPIPS 0.217 and SSIM 0.727, outperforming TeaCache by 16% in speed while maintaining similar quality metrics.

该研究针对扩散变换器（DiTs）在推理过程中高昂的计算成本，提出了一种频率感知的缓存框架SpectralCache，该框架包括时间步感知动态调度、累积误差预算和频率分解缓存，以处理非均匀去噪过程。在FLUX.1-schnell 512x512分辨率下，SpectralCache实现了2.46倍的加速，LPIPS为0.217，SSIM为0.727，比TeaCache在速度上提高了16%，同时保持了相近的质量。

FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding

Authors: Janghoon Cho, Jungsoo Lee, Munawar Hayat, Kyuwoong Hwang, Fatih Porikli, Sungha Choi

Venue: ICLR 2026

First: 2025-10-31T17:29:39+00:00 · Latest: 2026-03-05T15:43:07+00:00

Comments: Accepted to ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Recent studies in long video understanding have harnessed the advanced visual-language reasoning capabilities of Large Multimodal Models (LMMs), driving the evolution of video-LMMs specialized for processing extended video sequences. However, the scalability of these models is severely limited by the overwhelming volume of visual tokens generated from extended video sequences. To address this challenge, we propose FLoC, an efficient visual token compression framework based on the facility location function, a principled approach that swiftly selects a compact yet highly representative and diverse subset of visual tokens within a predefined budget on the number of visual tokens. By integrating the lazy greedy algorithm, our method achieves remarkable efficiency gains by swiftly selecting a compact subset of tokens, drastically reducing the number of visual tokens while guaranteeing near-optimal performance. Notably, our approach is training-free, model-agnostic, and query-agnostic, providing a versatile solution that seamlessly integrates with diverse video-LLMs and existing workflows. Extensive evaluations on large-scale benchmarks, such as Video-MME, MLVU, LongVideoBench, and EgoSchema, show that our framework consistently surpasses recent compression techniques, highlighting its effectiveness and robustness in addressing the challenges of long video understanding as well as its processing efficiency.

中文标题/摘要

标题：FLoC：基于设施位置的高效视觉标记压缩框架以实现长视频理解

近期关于长视频理解的研究利用了大型多模态模型（LMMs）的先进视觉-语言推理能力，推动了专门用于处理扩展视频序列的视频-LMMs的发展。然而，这些模型的可扩展性受到从扩展视频序列生成的大量视觉标记的限制。为了解决这一挑战，我们提出了FLoC，一种基于设施位置函数的高效视觉标记压缩框架，这是一种原理性的方法，能够迅速选择在预定义的视觉标记数量预算内具有高度代表性且多样化的紧凑子集。通过集成懒惰贪婪算法，我们的方法通过迅速选择紧凑的标记子集实现了显著的效率提升，大幅减少了视觉标记的数量，同时保证了接近最优的性能。值得注意的是，我们的方法是无需训练的、模型无关的、查询无关的，提供了一种灵活的解决方案，能够无缝集成到各种视频-LLMs和现有工作流程中。在Video-MME、MLVU、LongVideoBench和EgoSchema等大规模基准上的广泛评估表明，我们的框架在压缩技术方面始终优于近期的技术，突显了其在解决长视频理解挑战方面的有效性、鲁棒性以及处理效率。

Summary / 总结

The paper addresses the scalability issue of Large Multimodal Models (LMMs) in long video understanding by proposing FLoC, a facility location-based visual token compression framework. FLoC uses the facility location function and a lazy greedy algorithm to efficiently select a compact subset of visual tokens, reducing the number of tokens while maintaining near-optimal performance. The method is training-free, model-agnostic, and query-agnostic, making it versatile and easy to integrate into various video-LLMs. Experimental results on large-scale benchmarks demonstrate that FLoC outperforms recent compression techniques in terms of both effectiveness and processing efficiency.

论文提出了一种基于设施位置函数的视觉令牌压缩框架FLoC，以解决大型多模态模型（LMMs）在长视频理解中的可扩展性问题。FLoC 使用懒惰贪婪算法高效地选择一小部分视觉令牌，减少令牌数量同时保持接近最优的性能。该方法无需训练、模型无关且查询无关，使其具有高度的灵活性和易于与各种视频-LLMs 集成。在大规模基准测试上的实验结果表明，FLoC 在有效性和处理效率方面均优于最近的压缩技术。

Pursuing Minimal Sufficiency in Spatial Reasoning

Authors: Yejie Guo, Yunzhong Hou, Wufei Ma, Meng Tang, Ming-Hsuan Yang

First: 2025-10-19T02:29:09+00:00 · Latest: 2026-03-05T14:41:14+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Spatial reasoning, the ability to ground language in 3D understanding, remains a persistent challenge for Vision-Language Models (VLMs). We identify two fundamental bottlenecks: inadequate 3D understanding capabilities stemming from 2D-centric pre-training, and reasoning failures induced by redundant 3D information. To address these, we first construct a Minimal Sufficient Set (MSS) of information before answering a given question: a compact selection of 3D perception results from \textit{expert models}. We introduce MSSR (Minimal Sufficient Spatial Reasoner), a dual-agent framework that implements this principle. A Perception Agent programmatically queries 3D scenes using a versatile perception toolbox to extract sufficient information, including a novel SOG (Situated Orientation Grounding) module that robustly extracts language-grounded directions. A Reasoning Agent then iteratively refines this information to pursue minimality, pruning redundant details and requesting missing ones in a closed loop until the MSS is curated. Extensive experiments demonstrate that our method, by explicitly pursuing both sufficiency and minimality, significantly improves accuracy and achieves state-of-the-art performance across two challenging benchmarks. Furthermore, our framework produces interpretable reasoning paths, offering a promising source of high-quality training data for future models. Source code is available at https://github.com/gyj155/mssr.

中文标题/摘要

标题：在空间推理中追求最小充分性

空间推理，即在三维理解基础上将语言接地的能力，仍然是视觉-语言模型（VLMs）的一个持续性挑战。我们识别出两个根本瓶颈：源于二维中心预训练的不充分三维理解能力，以及由冗余三维信息引发的推理失败。为解决这些问题，我们首先在回答给定问题之前构建一个最小充分集（MSS）：从专家模型中提取的紧凑三维感知结果的选择。我们引入了MSSR（最小充分空间推理器），这是一种双智能体框架，实现了这一原则。感知智能体使用多功能感知工具箱程序化地查询三维场景，提取足够的信息，包括一种新颖的SOG（情境定向定位）模块，该模块能够稳健地提取语言导向的方向。推理智能体随后迭代地精炼这些信息，追求最小性，通过闭环不断修剪冗余细节并请求缺失信息，直到MSS被精心整理。大量实验表明，通过明确追求充分性和最小性，我们的方法显著提高了准确性，并在两个具有挑战性的基准测试中达到了最先进的性能。此外，我们的框架生成可解释的推理路径，为未来模型提供了一个高质量的训练数据来源。源代码可在https://github.com/gyj155/mssr/获取。

Summary / 总结

This paper addresses the challenge of spatial reasoning for Vision-Language Models by identifying two key issues: inadequate 3D understanding and reasoning failures due to redundant information. To tackle these, the authors propose MSSR (Minimal Sufficient Spatial Reasoner), a dual-agent framework that constructs a Minimal Sufficient Set (MSS) of 3D information. The Perception Agent queries 3D scenes using a versatile perception toolbox, while the Reasoning Agent iteratively refines this information to ensure both sufficiency and minimality. Experiments show that this method improves accuracy and achieves state-of-the-art performance on two benchmarks, while also producing interpretable reasoning paths.

该论文通过识别两个关键问题——3D理解不足和冗余信息导致的推理失败，来解决视觉-语言模型的空间推理挑战。为此，作者提出了MSSR（Minimal Sufficient Spatial Reasoner）双代理框架，构建了一个3D信息的Minimal Sufficient Set（MSS）。感知代理使用多功能感知工具箱查询3D场景，而推理代理则迭代地精炼这些信息以确保充分性和最小性。实验表明，该方法显著提高了准确率，并在两个基准测试中达到了最先进的性能，同时生成了可解释的推理路径。

SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus

Authors: Ming Zhao, Wenhui Dong, Yang Zhang, Xiang Zheng, Zhonghao Zhang, Zian Zhou, Yunzhi Guan, Liukun Xu, Wei Peng, Zhaoyang Gong, Zhicheng Zhang, Dachuan Li, Xiaosheng Ma, Yuli Ma, Jianing Ni, Changjiang Jiang, Lixia Tian, Qixin Chen, Kaishun Xia, Pingping Liu, Tongshun Zhang, Zhiqiang Liu, Zhongyan Bi, Chenyang Si, Tiansheng Sun, Caifeng Shan

First: 2025-10-03T16:32:02+00:00 · Latest: 2026-03-05T14:25:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Spine disorders affect 619 million people globally and are a leading cause of disability, yet AI-assisted diagnosis remains limited by the lack of level-aware, multimodal datasets. Clinical decision-making for spine disorders requires sophisticated reasoning across X-ray, CT, and MRI at specific vertebral levels. However, progress has been constrained by the absence of traceable, clinically-grounded instruction data and standardized, spine-specific benchmarks. To address this, we introduce SpineMed, an ecosystem co-designed with practicing spine surgeons. It features SpineMed-450k, the first large-scale dataset explicitly designed for vertebral-level reasoning across imaging modalities with over 450,000 instruction instances, and SpineBench, a clinically-grounded evaluation framework. SpineMed-450k is curated from diverse sources, including textbooks, guidelines, open datasets, and ~1,000 de-identified hospital cases, using a clinician-in-the-loop pipeline with a two-stage LLM generation method (draft and revision) to ensure high-quality, traceable data for question-answering, multi-turn consultations, and report generation. SpineBench evaluates models on clinically salient axes, including level identification, pathology assessment, and surgical planning. Our comprehensive evaluation of several recently advanced large vision-language models (LVLMs) on SpineBench reveals systematic weaknesses in fine-grained, level-specific reasoning. In contrast, our model fine-tuned on SpineMed-450k demonstrates consistent and significant improvements across all tasks. Clinician assessments confirm the diagnostic clarity and practical utility of our model's outputs.

中文标题/摘要

标题：SpineBench：由SpineMed-450k语料库驱动的临床相关、分级感知基准

脊椎疾病影响全球6.19亿人，是导致残疾的主要原因之一，但AI辅助诊断仍受限于缺乏分级感知的多模态数据集。脊椎疾病的临床决策需要在特定椎体水平上对X光、CT和MRI进行复杂的推理。然而，由于缺乏可追溯的、临床依据的数据和标准化的脊椎特定基准，进展受到限制。为了解决这一问题，我们引入了SpineMed，一个与执业脊椎外科医生共同设计的生态系统。它包括SpineMed-450k，这是第一个专门为跨成像模态的椎体级推理设计的大规模数据集，包含超过45万个指令实例，以及SpineBench，一个临床依据的评估框架。SpineMed-450k从教科书、指南、开放数据集和约1000个匿名医院病例中收集，使用临床医生在环的管道和两阶段LLM生成方法（草案和修订）来确保高质量、可追溯的数据，用于问题回答、多轮咨询和报告生成。SpineBench在临床相关轴上评估模型，包括椎体识别、病理评估和手术规划。我们对SpineBench上几种最近先进的大型视觉-语言模型的全面评估揭示了其在细粒度、椎体特定推理方面的系统性弱点。相比之下，我们基于SpineMed-450k微调的模型在所有任务上都表现出一致且显著的改进。临床医生评估证实了我们模型输出的诊断清晰度和实用价值。

Summary / 总结

The research addresses the lack of level-aware, multimodal datasets for AI-assisted diagnosis of spine disorders, which affect 619 million people globally. It introduces SpineMed-450k, a large-scale dataset for vertebral-level reasoning across imaging modalities, and SpineBench, an evaluation framework. The study evaluates several advanced large vision-language models on SpineBench and finds that models fine-tuned on SpineMed-450k show consistent improvements in level-specific reasoning tasks such as level identification, pathology assessment, and surgical planning.

该论文介绍了SpineMed生态系统，包括SpineMed-450k，这是一个用于脊椎水平跨影像模态推理的大规模数据集，以及SpineBench，一个临床相关的评估框架。该数据集从多种来源中提炼，确保高质量和可追溯的数据。SpineBench在临床相关任务上评估模型，评估结果显示，基于SpineMed-450k微调的模型在层次特定推理方面优于最近的大型视觉-语言模型，在所有任务上表现出一致的改进。

RadarVLM: A Vision-Language Model Approach for Radar Scene Understanding

Authors: Pushkal Mishra, Kshitiz Bansal, Dinesh Bharadia

First: 2025-11-26T06:41:00+00:00 · Latest: 2026-03-05T14:00:17+00:00

Abs · PDF · Code1 · Code2

Abstract

Radar sensors provide reliable perception across adverse weather, lighting, and long-range conditions, yet existing machine learning approaches remain fragmented and task-specific, with each downstream task employing distinct architectures and training objectives. We present RadarVLM, a vision-language framework that learns unified scene-level representations through structured spatial language supervision. Leveraging the CARLA simulator with a realistic radar model, we collect over 800k radar-caption pairs across 110+ hours of simulated driving in diverse scenarios. We make two key contributions: (1) a structured caption framework encoding vehicle distributions in the radar's native coordinate system, and (2) Spatially-Grounded CLIP (SG-CLIP) objective that replaces binary matching with continuous scene similarity, enabling fine-grained spatial reasoning. We further propose localization-aware evaluation metrics that directly assess spatial accuracy beyond traditional linguistic similarity measures. Validated on generative captioning and vehicle segmentation, SG-CLIP achieves up to 50\% relative F1-score improvement over vanilla CLIP and a 21\% AP gain on segmentation, demonstrating that language grounding produces spatially structured representations.

中文标题/摘要

标题：RadarVLM：雷达场景理解的视觉语言模型方法

雷达传感器在恶劣天气、光照和远距离条件下提供可靠的感知，但现有的机器学习方法仍然支离破碎且任务特定，每个下游任务都采用不同的架构和训练目标。我们提出了RadarVLM，这是一种视觉语言框架，通过结构化的空间语言监督学习统一的场景级表示。利用CARLA模拟器和现实的雷达模型，我们收集了超过80万对雷达-描述符，覆盖了110多个小时的模拟驾驶，涉及多种场景。我们做出了两项关键贡献：(1) 结构化的描述符框架，编码车辆在雷达原坐标系中的分布，以及(2) 空间接地CLIP (SG-CLIP) 目标，用连续的场景相似度替代二元匹配，使细粒度的空间推理成为可能。我们还提出了定位感知的评估指标，直接评估空间准确性，超越传统的语言相似度度量。在生成描述符和车辆分割上，SG-CLIP 达到了比原始CLIP高达50%的相对F1分数提升，分割的AP值提高了21%，证明了语言接地产生了空间结构化的表示。

Summary / 总结

RadarVLM is a vision-language model that learns unified scene-level representations using structured spatial language supervision. It leverages the CARLA simulator to collect 800k radar-caption pairs and introduces a structured caption framework and the Spatially-Grounded CLIP (SG-CLIP) objective, which improves generative captioning and vehicle segmentation tasks. The SG-CLIP objective achieves up to 50% relative F1-score improvement over vanilla CLIP and a 21% AP gain on segmentation, highlighting the effectiveness of language grounding for spatial reasoning.

RadarVLM 是一种利用结构化空间语言监督学习统一场景级表示的视觉-语言模型。它利用 CARLA 模拟器收集了 800k 雷达-描述对，并引入了结构化描述框架和空间定位 CLIP (SG-CLIP) 目标，该目标在生成描述和车辆分割任务中取得了显著改进。SG-CLIP 目标在生成描述中实现了高达 50% 的相对 F1 分数提升，并在分割中获得了 21% 的 AP 增益，表明语言定位能够产生空间结构化的表示。

Logi-PAR: Logic-Infused Patient Activity Recognition via Differentiable Rule

Authors: Muhammad Zarar, MingZheng Zhang, Xiaowang Zhang, Zhiyong Feng, Sofonias Yitagesu, Kawsar Farooq

First: 2026-03-05T13:52:50+00:00 · Latest: 2026-03-05T13:52:50+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Patient Activity Recognition (PAR) in clinical settings uses activity data to improve safety and quality of care. Although significant progress has been made, current models mainly identify which activity is occurring. They often spatially compose sub-sparse visual cues using global and local attention mechanisms, yet only learn logically implicit patterns due to their neural-pipeline. Advancing clinical safety requires methods that can infer why a set of visual cues implies a risk, and how these can be compositionally reasoned through explicit logic beyond mere classification. To address this, we proposed Logi-PAR, the first Logic-Infused Patient Activity Recognition Framework that integrates contextual fact fusion as a multi-view primitive extractor and injects neural-guided differentiable rules. Our method automatically learns rules from visual cues, optimizing them end-to-end while enabling the implicit emergence patterns to be explicitly labelled during training. To the best of our knowledge, Logi-PAR is the first framework to recognize patient activity by applying learnable logic rules to symbolic mappings. It produces auditable why explanations as rule traces and supports counterfactual interventions (e.g., risk would decrease by 65% if assistance were present). Extensive evaluation on clinical benchmarks (VAST and OmniFall) demonstrates state-of-the-art performance, significantly outperforming Vision-Language Models and transformer baselines. The code is available via: https://github.com/zararkhan985/Logi-PAR.git}

中文标题/摘要

标题：Logi-PAR：通过可微规则融合上下文事实的患者活动识别

在临床环境中，患者活动识别（PAR）利用活动数据以提高安全性和护理质量。尽管取得了显著进展，当前模型主要识别正在进行的活动。它们通常使用全局和局部注意力机制组合稀疏的视觉线索，但由于其神经管道，只能学习逻辑隐含模式。为了提高临床安全性，需要能够推断出一组视觉线索为何表示风险的方法，并通过明确的逻辑进行组合推理，而不仅仅是分类。为此，我们提出了Logi-PAR，这是第一个融合上下文事实的逻辑注入患者活动识别框架，将其作为多视图原始提取器，并注入神经引导的可微规则。我们的方法自动从视觉线索中学习规则，在端到端优化的同时，使隐含模式在训练期间明确地被标记。据我们所知，Logi-PAR 是第一个通过应用可学习逻辑规则到符号映射来识别患者活动的框架。它产生可审计的“为什么”解释作为规则跟踪，并支持反事实干预（例如，如果提供帮助，风险将降低65%）。在临床基准测试（VAST和OmniFall）上的广泛评估表明，其性能达到最先进的水平，显著优于视觉-语言模型和变压器基线。代码可通过：https://github.com/zararkhan985/Logi-PAR.git 获取

Summary / 总结

Logi-PAR is a novel framework for Patient Activity Recognition (PAR) that integrates logic into the recognition process. It uses differentiable rules to infer why certain visual cues imply a risk and enables explicit logical reasoning beyond mere classification. Logi-PAR outperforms existing models on clinical benchmarks and provides auditable explanations and counterfactual interventions. It automatically learns rules from visual cues and optimizes them end-to-end, demonstrating state-of-the-art performance.

Logi-PAR 是一种新颖的患者活动识别（PAR）框架，将逻辑推理集成到模型中以提供风险评估的明确解释。它使用多视图原始提取器和可微规则来自动学习逻辑规则，并端到端地优化它们。Logi-PAR 在临床基准测试中表现出色，提供了可审计的解释和反事实干预措施，显示出在性能和可解释性方面的显著改进。

Mario: Multimodal Graph Reasoning with Large Language Models

Authors: Yuanfu Sun, Kang Li, Pengkang Guo, Jiajin Liu, Qiaoyu Tan

Venue: CVPR 2026

First: 2026-03-05T13:49:41+00:00 · Latest: 2026-03-05T13:49:41+00:00

Comments: CVPR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances in large language models (LLMs) have opened new avenues for multimodal reasoning. Yet, most existing methods still rely on pretrained vision-language models (VLMs) to encode image-text pairs in isolation, ignoring the relational structure that real-world multimodal data naturally form. This motivates reasoning on multimodal graphs (MMGs), where each node has textual and visual attributes and edges provide structural cues. Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-modal consistency and handling heterogeneous modality preference. To address this, we propose Mario, a unified framework that simultaneously resolves the two above challenges and enables effective LLM-based reasoning over MMGs. Mario consists of two innovative stages. Firstly, a graph-conditioned VLM design that jointly refines textual and visual features through fine-grained cross-modal contrastive learning guided by graph topology. Secondly, a modality-adaptive graph instruction tuning mechanism that organizes aligned multimodal features into graph-aware instruction views and employs a learnable router to surface, for each node and its neighborhood, the most informative modality configuration to the LLM. Extensive experiments across diverse MMG benchmarks demonstrate that Mario consistently outperforms state-of-the-art graph models in both supervised and zero-shot scenarios for node classification and link prediction. The code will be made available at https://github.com/sunyuanfu/Mario.

中文标题/摘要

标题：马里奥：大规模语言模型的多模态图推理

大规模语言模型（LLMs）的最新进展为多模态推理开辟了新的途径。然而，大多数现有方法仍然依赖预训练的视觉-语言模型（VLMs）来孤立地编码图像-文本对，忽略了真实世界多模态数据自然形成的关联结构。这促使我们在多模态图（MMGs）上进行推理，其中每个节点具有文本和视觉属性，边提供结构线索。在保持图拓扑的同时，利用LLM进行这样的异构多模态信号推理引入了两个关键挑战：解决弱跨模态一致性并处理异构模态偏好。为了解决这个问题，我们提出了马里奥，这是一种统一框架，同时解决了上述两个挑战，并使LLM能够在MMGs上进行有效的推理。马里奥由两个创新阶段组成。首先，一种基于图条件的VLM设计，通过由图拓扑引导的细粒度跨模态对比学习联合精炼文本和视觉特征。其次，一种模态自适应图指令调优机制，将对齐的多模态特征组织成图感知指令视图，并使用可学习的路由器为每个节点及其邻域呈现最相关信息模态配置给LLM。在各种多模态图基准上的广泛实验表明，马里奥在节点分类和链接预测的监督和零样本场景中均优于最先进的图模型。代码将在https://github.com/sunyuanfu/Mario上公开。

Summary / 总结

Mario is a unified framework that addresses the challenges of multimodal graph reasoning using large language models. It resolves weak cross-modal consistency and handles heterogeneous modality preferences by designing a graph-conditioned vision-language model and a modality-adaptive graph instruction tuning mechanism. Mario outperforms state-of-the-art graph models in node classification and link prediction across various multimodal graph benchmarks in both supervised and zero-shot settings.

研究动机是利用大型语言模型（LLMs）进行多模态推理，解决现有方法依赖预训练的视觉语言模型（VLMs）孤立编码图像-文本对的局限性。Mario是一个统一框架，通过两个阶段解决弱跨模态一致性问题和处理异构模态偏好：图条件下的VLM设计进行细粒度的跨模态对比学习，以及模态自适应图指令调优机制，将对齐的多模态特征组织成图感知指令视图，并使用可学习的路由器为每个节点及其邻域呈现最相关信息模态配置。实验表明，Mario在节点分类和链接预测的各种基准测试中，无论是监督学习还是零样本场景，都优于最先进的图模型。

Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models

Authors: Shule Lu, Yujing Wang, Hainan Zhang, Xiaoshan Yang, Hongwei Zheng, Yongxin Tong, Changsheng Xu, Zhiming Zheng

First: 2026-01-31T03:11:51+00:00 · Latest: 2026-03-05T12:23:38+00:00

Comments: Due to the need for substantial revisions, the authors believe that the paper should be retracted first.A revised version may be resubmitted

Abs · PDF · Code1 · Code2

Abstract

VLMs have broad potential in privacy-sensitive domains such as healthcare and finance, yet strict data-sharing constraints render centralized training infeasible. FL mitigates this issue by enabling decentralized training, but practical deployments face challenges due to client heterogeneity in computational resources, application requirements, and model architectures. We argue that while replacing data with model parameters characterizes the present of FL, replacing parameters with preferences represents a more scalable and privacy-preserving future. Motivated by this perspective, we propose MoR, a federated alignment framework based on GRPO with Mixture-of-Rewards for heterogeneous VLMs. MoR initializes a visual foundation model as a KL-regularized reference, while each client locally trains a reward model from local preference annotations, capturing specific evaluation signals without exposing raw data. To reconcile heterogeneous rewards, we introduce a routing-based fusion mechanism that adaptively aggregates client reward signals. Finally, the server performs GRPO with this mixed reward to optimize the base VLM. Experiments on three public VQA benchmarks demonstrate that MoR consistently outperforms federated alignment baselines in generalization, robustness, and cross-client adaptability. Our approach provides a scalable solution for privacy-preserving alignment of heterogeneous VLMs under federated settings.

中文标题/摘要

标题：用偏好替换参数：异构视觉-语言模型的联邦对齐

视觉语言模型（VLMs）在医疗保健和金融等隐私敏感领域具有广泛的应用潜力，但由于严格的资源共享限制，集中式训练变得不可行。联邦学习（FL）通过使训练去中心化来缓解这一问题，但实际部署面临挑战，因为客户端在计算资源、应用需求和模型架构方面存在异质性。我们认为，虽然用模型参数替换数据是当前FL的特点，但用偏好替换参数代表了更具有扩展性和隐私保护的未来。基于这一视角，我们提出了MoR，一种基于GRPO的混合奖励的异构VLM联邦对齐框架。MoR以KL正则化的视觉基础模型作为参考，每个客户端从本地偏好注释中局部训练奖励模型，捕捉特定的评估信号而不暴露原始数据。为了协调异质奖励，我们引入了一种基于路由的融合机制，以自适应地聚合客户端的奖励信号。最后，服务器使用这种混合奖励进行GRPO优化基础VLM。在三个公开的VQA基准测试上进行的实验表明，MoR在泛化能力、鲁棒性和跨客户端适应性方面始终优于联邦对齐基线。我们的方法为在联邦设置下提供了一种异构VLM的可扩展的隐私保护对齐解决方案。

Summary / 总结

The paper addresses the challenge of training vision-language models (VLMs) in privacy-sensitive domains where centralized training is infeasible due to data-sharing constraints. It proposes MoR, a federated learning framework that replaces model parameters with preferences to enhance scalability and privacy. MoR initializes a reference model and allows clients to locally train reward models based on preference annotations, which are then fused to optimize the base model. Experiments show that MoR outperforms existing federated alignment methods in terms of generalization, robustness, and cross-client adaptability.

论文针对在隐私敏感领域中无法进行集中训练的问题，提出了一种名为MoR的联邦学习框架，该框架通过将模型参数替换为偏好来增强可扩展性和隐私性。MoR初始化一个参考模型，并允许客户端从偏好注释中训练本地奖励模型，然后将这些奖励信号融合以优化基础模型。实验结果显示，MoR在泛化能力、鲁棒性和跨客户端适应性方面优于现有联邦对齐方法。

GEM-TFL: Bridging Weak and Full Supervision for Forgery Localization through EM-Guided Decomposition and Temporal Refinement

Authors: Xiaodong Zhu, Yuanming Zheng, Suting Wang, Junqi Yang, Yuhong Yang, Weiping Tu, Zhongyuan Wang

Venue: CVPR 2026

First: 2026-03-05T12:07:26+00:00 · Latest: 2026-03-05T12:07:26+00:00

Comments: 10 pages, 4 figures, accepted by CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Temporal Forgery Localization (TFL) aims to precisely identify manipulated segments within videos or audio streams, providing interpretable evidence for multimedia forensics and security. While most existing TFL methods rely on dense frame-level labels in a fully supervised manner, Weakly Supervised TFL (WS-TFL) reduces labeling cost by learning only from binary video-level labels. However, current WS-TFL approaches suffer from mismatched training and inference objectives, limited supervision from binary labels, gradient blockage caused by non-differentiable top-k aggregation, and the absence of explicit modeling of inter-proposal relationships. To address these issues, we propose GEM-TFL (Graph-based EM-powered Temporal Forgery Localization), a two-phase classification-regression framework that effectively bridges the supervision gap between training and inference. Built upon this foundation, (1) we enhance weak supervision by reformulating binary labels into multi-dimensional latent attributes through an EM-based optimization process; (2) we introduce a training-free temporal consistency refinement that realigns frame-level predictions for smoother temporal dynamics; and (3) we design a graph-based proposal refinement module that models temporal-semantic relationships among proposals for globally consistent confidence estimation. Extensive experiments on benchmark datasets demonstrate that GEM-TFL achieves more accurate and robust temporal forgery localization, substantially narrowing the gap with fully supervised methods.

中文标题/摘要

标题：GEM-TFL：通过EM引导分解和时间精炼，实现弱监督与全监督之间的伪造定位桥梁

时间伪造定位（TFL）旨在精确识别视频或音频流中的篡改段落，为多媒体取证和安全提供可解释的证据。虽然大多数现有的TFL方法依赖于密集的帧级标签进行全监督学习，但弱监督TFL（WS-TFL）通过仅从二进制视频级标签中学习来降低标注成本。然而，当前的WS-TFL方法存在训练和推理目标不匹配、二进制标签监督有限、由于非可微的top-k聚合导致梯度阻塞以及缺乏对提案间关系的显式建模等问题。为了解决这些问题，我们提出了GEM-TFL（基于图的EM增强时间伪造定位），这是一种两阶段分类-回归框架，有效地弥合了训练和推理之间的监督差距。在此基础上，（1）我们通过基于EM的优化过程将二进制标签重新表述为多维潜在属性，增强弱监督；（2）我们引入了一种无需训练的时间一致性精炼方法，重新对齐帧级预测以实现更平滑的时间动态；（3）我们设计了一种基于图的提案精炼模块，建模提案之间的时空语义关系，以实现全局一致的置信度估计。在基准数据集上的广泛实验表明，GEM-TFL实现了更准确和稳健的时间伪造定位，显著缩小了与全监督方法的差距。

Summary / 总结

GEM-TFL is designed to improve weakly supervised temporal forgery localization by bridging the gap between training and inference. It reformulates binary video-level labels into multi-dimensional latent attributes using an EM-based optimization process, introduces a temporal consistency refinement for smoother predictions, and designs a graph-based proposal refinement module to model temporal-semantic relationships. Experiments show that GEM-TFL outperforms fully supervised methods in terms of accuracy and robustness.

GEM-TFL 提出了一种两阶段框架来增强弱监督并引入时间一致性精炼，通过 EM 基础优化将二元标签转换为多维潜在属性，并建模提案之间的时序语义关系。实验表明，GEM-TFL 在伪造定位的准确性和鲁棒性上优于全监督方法。

CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection

Authors: Zhaonian Kuang, Rui Ding, Haotian Wang, Xinhu Zheng, Meng Yang, Gang Hua

Venue: CVPR 2026

First: 2026-03-05T10:49:46+00:00 · Latest: 2026-03-05T10:49:46+00:00

Comments: Accepted to CVPR 2026 main track

Abs · PDF · Code1 · Code2

Abstract

Multi-camera 3D object detection (MC3D) has attracted increasing attention with the growing deployment of multi-sensor physical agents, such as robots and autonomous vehicles. However, MC3D models still struggle to generalize to unseen platforms with new multi-camera configurations. Current solutions simply employ a meta-camera for unified representation but lack comprehensive consideration. In this paper, we revisit this issue and identify that the devil lies in spatial prior discrepancies across source and target configurations, including different intrinsics, extrinsics, and array layouts. To address this, we propose CoIn3D, a generalizable MC3D framework that enables strong transferability from source configurations to unseen target ones. CoIn3D explicitly incorporates all identified spatial priors into both feature embedding and image observation through spatial-aware feature modulation (SFM) and camera-aware data augmentation (CDA), respectively. SFM enriches feature space by integrating four spatial representations, such as focal length, ground depth, ground gradient, and Plücker coordinate. CDA improves observation diversity under various configurations via a training-free dynamic novel-view image synthesis scheme. Extensive experiments demonstrate that CoIn3D achieves strong cross-configuration performance on landmark datasets such as NuScenes, Waymo, and Lyft, under three dominant MC3D paradigms represented by BEVDepth, BEVFormer, and PETR.

中文标题/摘要

标题：CoIn3D：重新审视配置不变的多相机3D物体检测

多相机3D物体检测（MC3D）随着多传感器物理代理（如机器人和自动驾驶车辆）的部署越来越多而受到越来越多的关注。然而，MC3D模型仍然难以在具有新多相机配置的未见过的平台上泛化。当前的解决方案只是使用一个元相机进行统一表示，但缺乏全面的考虑。在本文中，我们重新审视了这一问题，并确定了源配置和目标配置之间的空间先验差异是问题的关键，包括不同的内参、外参和阵列布局。为了解决这一问题，我们提出了CoIn3D，这是一种通用的MC3D框架，能够从源配置高效地转移到未见过的目标配置。CoIn3D通过空间感知特征调制（SFM）和相机感知数据增强（CDA）将所有识别的空间先验显式地整合到特征嵌入和图像观察中。SFM通过整合焦距、地面深度、地面梯度和Plücker坐标等四种空间表示来丰富特征空间。CDA通过一种无需训练的动态新颖视角图像合成方案来在各种配置下提高观察多样性。广泛的实验表明，CoIn3D在NuScenes、Waymo和Lyft等地标数据集上，在BEVDepth、BEVFormer和PETR等三种主导的MC3D范式下，实现了强大的跨配置性能。

Summary / 总结

The paper addresses the challenge of multi-camera 3D object detection (MC3D) models generalizing to new configurations. It proposes CoIn3D, a framework that explicitly considers spatial priors like intrinsics, extrinsics, and array layouts. CoIn3D uses spatial-aware feature modulation and camera-aware data augmentation to enhance feature embedding and observation diversity, respectively. Experiments show CoIn3D performs well across different datasets and MC3D paradigms.

CoIn3D重新审视了多相机3D目标检测(MC3D)的问题，并提出了一种通用框架来解决不同相机配置间的可迁移性问题。它通过空间感知特征调制和相机感知数据增强，将空间先验融入特征嵌入和图像观察中。实验表明，CoIn3D在NuScenes、Waymo和Lyft等地标数据集的各种MC3D范式下优于现有方法。

Flatness Guided Test-Time Adaptation for Vision-Language Models

Authors: Aodi Li, Liansheng Zhuang, Xiao Long, Houqiang Li, Shafei Wang

First: 2025-01-31T03:10:48+00:00 · Latest: 2026-03-05T10:05:46+00:00

Abs · PDF · Code1 · Code2

Abstract

Test-time adaptation (TTA) of Vision-Language Models (VLMs) has emerged as a technique for tackling distribution shifts during the test time. Recent research indicates that the test-time adaptation is intrinsically linked to the model's training history. However, existing TTA methods, such as Test-time Prompt Tuning, often design adaptation strategies in isolation from the models' training characteristics, which degrade their performance. This paper argues that the flatness acquired via sharpness-aware training is an efficient clue for the test-time adaptation of VLMs. Built on this insight, this paper proposes a novel Flatness-Guided Adaptation framework (FGA) for VLMs to cohesively unify training and test-time procedures. Its core idea is to leverage the alignment between the training minimum and test loss flat regions to guide the adaptation process. Specifically, our FGA consists of a prompt-tuning stage and a test-time adaptation stage. In the tuning stage, a Sharpness-Aware Prompt Tuning method is utilized to identify the training flat minimum, offering a geometric clue of flatness for subsequent adaptation. In the test stage, a Sharpness-based Test Sample Selection approach is proposed to ensure the alignment of flat minima between the training and each augmented test sample's loss landscape. In comparison to existing TTA methods, our FGA avoids the expensive prompt parameter updates during test time, and substantially reduces the computation overhead. Extensive experiments on both domain generalization and cross-dataset benchmarks demonstrate that our FGA achieves superior performance over prevalent TTA methods. Notably, when employing a ViT-B/16 image encoder, FGA even outperforms TPT+CoOp by an average of 4.88% across all four ImageNet out-of-domain variants.

中文标题/摘要

标题：基于平坦度引导的视觉-语言模型测试时适应

视觉-语言模型（VLMs）的测试时适应（TTA）已成为解决测试时分布偏移的技术。现有研究表明，测试时适应与模型的训练历史密切相关。然而，现有的TTA方法，如测试时提示调优，往往孤立地设计适应策略，这会降低其性能。本文认为，通过尖锐感知训练获得的平坦度是视觉-语言模型测试时适应的有效线索。基于这一见解，本文提出了一种新颖的基于平坦度引导的适应框架（FGA），以统一训练和测试过程。其核心思想是利用训练最小值和平坦损失区域之间的对齐来引导适应过程。具体而言，我们的FGA包括一个提示调优阶段和一个测试时适应阶段。在调优阶段，使用尖锐感知提示调优方法来识别训练平坦最小值，为后续适应提供平坦度的几何线索。在测试阶段，提出了一种基于尖锐性的测试样本选择方法，以确保训练最小值和平滑每个增强测试样本损失景观之间的对齐。与现有TTA方法相比，我们的FGA避免了测试时昂贵的提示参数更新，并显著减少了计算开销。在领域泛化和跨数据集基准测试上的广泛实验表明，我们的FGA在主流TTA方法中表现出更优的性能。值得注意的是，当使用ViT-B/16图像编码器时，FGA在所有四个ImageNet离域变体上平均优于TPT+CoOp 4.88%。

Summary / 总结

This paper addresses the challenge of test-time adaptation (TTA) for Vision-Language Models (VLMs) by proposing a Flatness-Guided Adaptation (FGA) framework. The motivation is to improve TTA performance by leveraging the model's training history. The FGA framework consists of a prompt-tuning stage and a test-time adaptation stage, where the former identifies the training flat minimum using Sharpness-Aware Prompt Tuning, and the latter ensures alignment of flat minima through a Sharpness-based Test Sample Selection approach. Experiments show that FGA outperforms existing TTA methods, achieving superior performance on domain generalization and cross-dataset benchmarks, with an average improvement of 4.88% over TPT+CoOp on ImageNet out-of-domain variants.

本文提出了一种名为Flatness-Guided Adaptation (FGA)的框架，以解决视觉-语言模型（VLMs）在测试时的适应性问题。受模型训练历史与测试时适应性之间关系的启发，作者认为训练最小值的平坦性可以指导适应过程。FGA框架包括使用Sharpness-Aware Prompt Tuning在调优阶段识别训练平坦最小值，并在测试阶段确保训练和每个增强测试样本损失景观之间的平坦最小值对齐。实验表明，FGA在性能上优于现有方法，并且减少了计算开销。

Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning

Authors: Yiwei Chen, Yuguang Yao, Yihua Zhang, Bingquan Shen, Gaowen Liu, Sijia Liu

First: 2025-03-14T19:52:08+00:00 · Latest: 2026-03-05T09:05:50+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent vision language models (VLMs) have made remarkable strides in generative modeling with multimodal inputs, particularly text and images. However, their susceptibility to generating harmful content when exposed to unsafe queries raises critical safety concerns. While current alignment strategies primarily rely on supervised safety fine-tuning with curated datasets, we identify a fundamental limitation we call the ''safety mirage'', where supervised fine-tuning inadvertently reinforces spurious correlations between superficial textual patterns and safety responses, rather than fostering deep, intrinsic mitigation of harm. We show that these spurious correlations leave fine-tuned VLMs vulnerable even to a simple one-word modification-based attack, where substituting a single word in text queries with a spurious correlation-inducing alternative can effectively bypass safeguards. Additionally, these correlations contribute to the over-prudence, causing fine-tuned VLMs to refuse benign queries unnecessarily. To address these issues, we show machine unlearning (MU) as a powerful alternative to supervised safety fine-tuning, as it avoids biased feature-label mappings and directly removes harmful knowledge from VLMs while preserving their general capabilities. Extensive evaluations across safety benchmarks show that under MU-based alignment reduces the attack success rate by up to 60.27% and cuts unnecessary rejections by over 84.20%. WARNING: There exist AI generations that may be offensive in nature.

中文标题/摘要

标题：安全幻象：虚假相关性如何削弱VLM安全微调并可通过机器遗忘加以缓解

近期的视觉语言模型（VLMs）在使用多模态输入（特别是文本和图像）的生成建模方面取得了显著进展。然而，当暴露于不安全查询时，它们生成有害内容的脆弱性引发了重要的安全问题。尽管当前的对齐策略主要依赖于监督安全微调和精心策划的数据集，但我们发现了一个根本性的局限性，我们称之为“安全幻象”，即监督微调无意中强化了表面文本模式与安全响应之间的虚假相关性，而不是培养深层次、内在的有害行为缓解。我们展示了这些虚假相关性使微调后的VLMs即使在简单的基于单词替换的攻击中也容易受到攻击，其中用一个诱导虚假相关性的替代词替换文本查询中的单个词可以有效绕过防护措施。此外，这些相关性导致过度谨慎，使微调后的VLMs无谓地拒绝良性查询。为了解决这些问题，我们展示了机器遗忘（MU）作为监督安全微调的强大替代方案，因为它避免了有偏的特征-标签映射，并直接从VLMs中移除有害知识，同时保留其一般能力。广泛的评估表明，在基于MU的对齐下，攻击成功率降低了高达60.27%，不必要的拒绝率减少了超过84.20%。注意：存在可能具有冒犯性的AI生成内容。

Summary / 总结

This paper addresses the issue of spurious correlations in vision language models (VLMs) that can undermine their safety. It identifies a 'safety mirage' where supervised fine-tuning can inadvertently reinforce harmful patterns, making VLMs vulnerable to simple attacks and overly cautious. The study demonstrates that machine unlearning (MU) can mitigate these issues by directly removing harmful knowledge without bias, leading to a significant reduction in attack success rates and unnecessary rejections. Extensive evaluations show MU-based alignment improves safety performance substantially.

本文通过识别'安全幻象'问题，指出监督微调可能会无意中强化虚假关联，使视觉语言模型（VLM）对简单的攻击变得脆弱且过于谨慎。作者提出机器遗忘（MU）作为替代方法来缓解这些问题，实验表明MU可以将攻击成功率降低高达60.27%，并将不必要的拒绝率降低超过84.20%。

Retrieval-Augmented Generation with Covariate Time Series

Authors: Kenny Ye Liang, Zhongyi Pei, Huan Zhang, Yuhui Liu, Shaoxu Song, Jianmin Wang

First: 2026-03-05T08:45:24+00:00 · Latest: 2026-03-05T08:45:24+00:00

Comments: 12 pages. Preprint

Abs · PDF · Code1 · Code2

Abstract

While RAG has greatly enhanced LLMs, extending this paradigm to Time-Series Foundation Models (TSFMs) remains a challenge. This is exemplified in the Predictive Maintenance of the Pressure Regulating and Shut-Off Valve (PRSOV), a high-stakes industrial scenario characterized by (1) data scarcity, (2) short transient sequences, and (3) covariate coupled dynamics. Unfortunately, existing time-series RAG approaches predominantly rely on generated static vector embeddings and learnable context augmenters, which may fail to distinguish similar regimes in such scarce, transient, and covariate coupled scenarios. To address these limitations, we propose RAG4CTS, a regime-aware, training-free RAG framework for Covariate Time-Series. Specifically, we construct a hierarchal time-series native knowledge base to enable lossless storage and physics-informed retrieval of raw historical regimes. We design a two-stage bi-weighted retrieval mechanism that aligns historical trends through point-wise and multivariate similarities. For context augmentation, we introduce an agent-driven strategy to dynamically optimize context in a self-supervised manner. Extensive experiments on PRSOV demonstrate that our framework significantly outperforms state-of-the-art baselines in prediction accuracy. The proposed system is deployed in Apache IoTDB within China Southern Airlines. Since deployment, our method has successfully identified one PRSOV fault in two months with zero false alarm.

中文标题/摘要

标题：基于协变量时间序列的检索增强生成

尽管检索增强生成（RAG）极大地提升了语言模型（LLMs），将其扩展到时间序列基础模型（TSFMs）仍面临挑战。这在压力调节和关断阀（PRSOV）的预测维护中尤为明显，这是一个高风险的工业场景，具有（1）数据稀缺性，（2）短暂序列，以及（3）协变量耦合动力学的特点。不幸的是，现有的时间序列RAG方法主要依赖生成的静态向量嵌入和可学习的上下文增强器，这在数据稀缺、短暂且协变量耦合的场景中可能无法区分相似的运行状态。为解决这些局限性，我们提出了RAG4CTS，这是一种针对协变量时间序列的无训练检索增强生成框架。具体而言，我们构建了一个层次化的时间序列本体知识库，以实现无损存储和基于物理的检索历史运行状态。我们设计了一种两阶段的双加权检索机制，通过点对点和多变量相似性对历史趋势进行对齐。对于上下文增强，我们引入了一种基于代理的策略，以自监督方式动态优化上下文。在PRSOV上的广泛实验表明，我们的框架在预测准确性上显著优于最先进的基线。所提出系统已部署在中国南方航空公司的Apache IoTDB中。自部署以来，我们的方法在两个月内成功检测到一个PRSOV故障，且无误报。

Summary / 总结

This paper addresses the challenge of applying Retrieval-Augmented Generation (RAG) to Time-Series Foundation Models (TSFMs) in high-stakes industrial scenarios like Predictive Maintenance of the PRSOV valve, where data is scarce, sequences are short, and dynamics are covariate coupled. The authors propose RAG4CTS, a regime-aware RAG framework that constructs a hierarchical time-series knowledge base for lossless storage and physics-informed retrieval of historical regimes, and introduces a two-stage bi-weighted retrieval mechanism. Experiments show that RAG4CTS significantly improves prediction accuracy compared to existing methods, and it has been deployed in China Southern Airlines, successfully identifying a fault with no false alarms.

研究针对现有RAG方法在数据稀缺和协变量耦合环境如PRSOV预测维护中的局限性，提出了一种基于层次时间序列知识库的RAG4CTS框架，实现历史制度的无损存储和检索，并采用两阶段加权检索机制进行上下文增强。实验表明，RAG4CTS在预测准确性上显著优于现有方法。

Collaborative Learning of Local 3D Occupancy Prediction and Versatile Global Occupancy Mapping

Authors: Shanshuai Yuan, Julong Wei, Muer Tie, Xiangyun Ren, Zhongxue Gan, Wenchao Ding

Venue: ICRA 2026

First: 2025-04-18T09:58:48+00:00 · Latest: 2026-03-05T07:52:27+00:00

Comments: Accepted by ICRA 2026

Abs · PDF · Code1 · Code2

Abstract

Vision-based 3D semantic occupancy prediction is vital for autonomous driving, enabling unified modeling of static infrastructure and dynamic agents. Global occupancy maps serve as long-term memory priors, providing valuable historical context that enhances local perception. This is particularly important in challenging scenarios such as occlusion or poor illumination, where current and nearby observations may be unreliable or incomplete. Priors aggregated from previous traversals under better conditions help fill gaps and enhance the robustness of local 3D occupancy prediction. In this paper, we propose Long-term Memory Prior Occupancy (LMPOcc), a plug-and-play framework that incorporates global occupancy priors to boost local prediction and simultaneously updates global maps with new observations. To realize the information gain from global priors, we design an efficient and lightweight Current-Prior Fusion module that adaptively integrates prior and current features. Meanwhile, we introduce a model-agnostic prior format to enable continual updating of global occupancy and ensure compatibility across diverse prediction baselines. LMPOcc achieves state-of-the-art local occupancy prediction performance validated on the Occ3D-nuScenes benchmark, especially on static semantic categories. Furthermore, we verify LMPOcc's capability to build large-scale global occupancy maps through multi-vehicle crowdsourcing, and utilize occupancy-derived dense depth to support the construction of 3D open-vocabulary maps. Our method opens up a new paradigm for continuous global information updating and storage, paving the way towards more comprehensive and scalable scene understanding in large outdoor environments.

中文标题/摘要

标题：协作学习局部3D占用预测和多功能全局占用映射

基于视觉的3D语义占用预测对于自动驾驶至关重要，能够统一建模静态基础设施和动态代理。全局占用地图作为长期记忆先验，提供有价值的历史上下文，增强局部感知。特别是在遮挡或光照不良等具有挑战性的场景中，当前和附近的观测可能不可靠或不完整。来自更好条件下的先前遍历的先验有助于填补空白并增强局部3D占用预测的鲁棒性。在本文中，我们提出了一种名为长时记忆先验占用（LMPOcc）的即插即用框架，该框架结合全局占用先验以增强局部预测，并同时使用新观测更新全局地图。为了实现全局先验的信息增益，我们设计了一种高效且轻量级的当前-先验融合模块，以自适应地整合先验和当前特征。同时，我们引入了一种模型无关的先验格式，以实现全局占用的持续更新并确保与各种预测基线的兼容性。LMPOcc在Occ3D-nuScenes基准上实现了最先进的局部占用预测性能，特别是在静态语义类别方面。此外，我们通过多车辆众包验证了LMPOcc构建大规模全局占用地图的能力，并利用占用衍生的密集深度支持3D开放词汇地图的构建。我们的方法为持续的全局信息更新和存储开辟了一个新的范式，为大型户外环境中的更全面和可扩展的场景理解铺平了道路。

Summary / 总结

This paper addresses the challenge of 3D semantic occupancy prediction in autonomous driving by proposing LMPOcc, a framework that integrates global occupancy priors to enhance local prediction and simultaneously updates global maps. The method includes an efficient Current-Prior Fusion module and a model-agnostic prior format, achieving state-of-the-art performance on the Occ3D-nuScenes benchmark and demonstrating the capability to build large-scale global occupancy maps through multi-vehicle crowdsourcing.

论文针对自主驾驶中3D语义占用预测的挑战，特别是在遮挡或光照不良场景下的问题。提出了一种名为LMPOcc的框架，该框架整合全局占用先验以增强局部预测，并同时更新全局地图。Current-Prior Fusion模块高效地结合了先验和当前特征，而一种模型无关的先验格式确保了不同预测模型之间的兼容性。LMPOcc在局部占用预测性能上表现出色，并通过多车辆众包展示了构建大规模全局占用地图的能力，支持3D开放词汇地图的构建。

Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

Authors: Haodong Li, Shaoteng Liu, Zhe Lin, Manmohan Chandraker

First: 2026-02-08T02:16:02+00:00 · Latest: 2026-03-05T07:52:20+00:00

Comments: Figure PDFs were compressed to 150 dpi to comply with arXiv's submission size limit. Project page: https://rolling-sink.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recently, autoregressive (AR) video diffusion models has achieved remarkable performance. However, due to their limited training durations, a train-test gap emerges when testing at longer horizons, leading to rapid visual degradations. Following Self Forcing, which studies the train-test gap within the training duration, this work studies the train-test gap beyond the training duration, i.e., the gap between the limited horizons during training and open-ended horizons during testing. Since open-ended testing can extend beyond any finite training window, and long-video training is computationally expensive, we pursue a training-free solution to bridge this gap. To explore a training-free solution, we conduct a systematic analysis of AR cache maintenance. These insights lead to Rolling Sink. Built on Self Forcing (trained on only 5s clips), Rolling Sink effectively scales the AR video synthesis to ultra-long durations (e.g., 5-30 minutes at 16 FPS) at test time, with consistent subjects, stable colors, coherent structures, and smooth motions. As demonstrated by extensive experiments, Rolling Sink achieves superior long-horizon visual fidelity and temporal consistency compared to SOTA baselines. Project page: https://rolling-sink.github.io/

中文标题/摘要

标题：滚动水槽：在自回归视频扩散模型中弥合有限训练期与开放测试期之间的差距

最近，自回归（AR）视频扩散模型取得了显著的性能。然而，由于其有限的训练时长，当在更长的时间范围内进行测试时，会出现训练-测试差距，导致视觉质量迅速退化。在研究了训练时长内的训练-测试差距之后，这项工作研究了训练时长之外的训练-测试差距，即训练时有限时间范围与测试时开放时间范围之间的差距。由于开放测试可以超出任何有限的训练窗口，且长视频训练计算成本高昂，我们寻求一种无需训练的解决方案来弥合这一差距。为了探索无需训练的解决方案，我们系统地分析了AR缓存维护。这些见解导致了滚动水槽（Rolling Sink）的提出。基于仅使用5秒片段训练的Self Forcing，滚动水槽在测试时能够将AR视频合成扩展到超长时长（例如，16 FPS下5-30分钟），且保持一致的主题、稳定的颜色、连贯的结构和流畅的运动。通过广泛的实验表明，滚动水槽在长时范围内的视觉保真度和时间一致性方面优于当前最佳基线。项目页面：https://rolling-sink.github.io/

Summary / 总结

This work addresses the train-test gap in autoregressive video diffusion models by focusing on the gap between limited training horizons and open-ended testing horizons. It introduces Rolling Sink, a training-free solution based on insights from AR cache maintenance, which effectively extends the video synthesis to ultra-long durations (5-30 minutes at 16 FPS) with consistent subjects, stable colors, coherent structures, and smooth motions. Experiments show that Rolling Sink outperforms state-of-the-art baselines in terms of long-horizon visual fidelity and temporal consistency.

该研究通过提出Rolling Sink来解决自回归视频扩散模型中的训练-测试差距问题，该方法将有限的训练时间段与开放的测试时间段之间的差距进行了弥合。基于Self Forcing，Rolling Sink将视频合成扩展到超长持续时间（5-30分钟），保持一致的主题、稳定的颜色、连贯的结构和流畅的运动。实验表明，Rolling Sink在长时间段的视觉保真度和时间一致性方面优于最先进的基线方法。

AdaIAT: Adaptively Increasing Attention to Generated Text to Alleviate Hallucinations in LVLM

Authors: Li'an Zhong, Ziqiang He, Jibin Zheng, Jin Li, Z. Jane Wang, Xiangui Kang

First: 2026-03-05T07:52:11+00:00 · Latest: 2026-03-05T07:52:11+00:00

Abs · PDF · Code1 · Code2

Abstract

Hallucination has been a significant impediment to the development and application of current Large Vision-Language Models (LVLMs). To mitigate hallucinations, one intuitive and effective way is to directly increase attention weights to image tokens during inference. Although this effectively reduces the hallucination rate, it often induces repetitive descriptions. To address this, we first conduct an analysis of attention patterns and reveal that real object tokens tend to assign higher attention to the generated text than hallucinated ones. This inspires us to leverage the generated text, which contains instruction-related visual information and contextual knowledge, to alleviate hallucinations while maintaining linguistic coherence. We therefore propose Attention to Generated Text (IAT) and demonstrate that it significantly reduces the hallucination rate while avoiding repetitive descriptions. To prevent naive amplification from impairing the inherent prediction capabilities of LVLMs, we further explore Adaptive IAT (AdaIAT) that employs a layer-wise threshold to control intervention time and fine-grained amplification magnitude tailored to the characteristics of each attention head. Both analysis and experiments demonstrate the effectiveness of AdaIAT. Results of several LVLMs show that AdaIAT effectively alleviates hallucination (reducing hallucination rates $C_S$ and $C_I$ on LLaVA-1.5 by 35.8% and 37.1%, respectively) while preserving linguistic performance and prediction capability, achieving an attractive trade-off.

中文标题/摘要

标题：AdaIAT：通过增加生成文本的关注度来缓解LVLM中的幻觉

幻觉已成为当前大型视觉-语言模型（LVLM）发展和应用中的重大障碍。为了减轻幻觉，一种直观且有效的方法是在推理过程中直接增加对图像标记的关注权重。尽管这种方法有效地降低了幻觉率，但往往会引起重复描述。为了解决这一问题，我们首先分析了注意力模式，并发现真实对象标记倾向于比幻觉标记更关注生成的文本。这启发我们利用包含指令相关视觉信息和上下文知识的生成文本来缓解幻觉，同时保持语言连贯性。因此，我们提出了生成文本注意力（IAT），并证明它显著降低了幻觉率，同时避免了重复描述。为了防止简单的放大损害LVLM的固有预测能力，我们进一步探索了分层阈值控制干预时间和细粒度放大幅度的自适应IAT（AdaIAT）。分析和实验都证明了AdaIAT的有效性。多个LVLM的结果表明，AdaIAT有效地缓解了幻觉（分别在LLaVA-1.5上减少了幻觉率$C_S$和$C_I$的35.8%和37.1%），同时保持了语言性能和预测能力，实现了令人满意的权衡。

Summary / 总结

The paper addresses the issue of hallucination in Large Vision-Language Models (LVLMs) by proposing AdaIAT, which adaptively increases attention to generated text to reduce hallucinations without causing repetitive descriptions. Experiments show that AdaIAT significantly reduces hallucination rates on LLaVA-1.5 by 35.8% and 37.1% for $C_S$ and $C_I$, respectively, while maintaining linguistic coherence and prediction capability.

该论文通过提出AdaIAT方法，针对大型视觉-语言模型（LVLM）中的幻觉问题，增加对生成文本的关注以减少幻觉现象，同时避免重复描述。该方法基于观察到真实对象令牌对生成文本给予更高关注的事实进行改进，进一步引入了层间阈值来控制干预时间和精细调节放大程度。实验结果显示，AdaIAT在LLaVA-1.5上将幻觉率分别降低了35.8%和37.1%，同时保持了语言性能和预测能力，实现了良好的权衡。

Authorize-on-Demand: Dynamic Authorization with Legality-Aware Intellectual Property Protection for VLMs

Authors: Lianyu Wang, Meng Wang, Huazhu Fu, Daoqiang Zhang

First: 2026-03-05T07:36:07+00:00 · Latest: 2026-03-05T07:36:07+00:00

Abs · PDF · Code1 · Code2

Abstract

The rapid adoption of vision-language models (VLMs) has heightened the demand for robust intellectual property (IP) protection of these high-value pretrained models. Effective IP protection should proactively confine model deployment within authorized domains and prevent unauthorized transfers. However, existing methods rely on static training-time definitions, limiting flexibility in dynamic environments and often producing opaque responses to unauthorized inputs. To address these limitations, we propose a novel dynamic authorization with legality-aware intellectual property protection (AoD-IP) for VLMs, a framework that supports authorize-on-demand and legality-aware assessment. AoD-IP introduces a lightweight dynamic authorization module that enables flexible, user-controlled authorization, allowing users to actively specify or switch authorized domains on demand at deployment time. This enables the model to adapt seamlessly as application scenarios evolve and provides substantially greater extensibility than existing static-domain approaches. In addition, AoD-IP incorporates a dual-path inference mechanism that jointly predicts input legality-aware and task-specific outputs. Comprehensive experimental results on multiple cross-domain benchmarks demonstrate that AoD-IP maintains strong authorized-domain performance and reliable unauthorized detection, while supporting user-controlled authorization for adaptive deployment in dynamic environments.

中文标题/摘要

标题：按需授权：具有法律意识的知识产权保护以实现VLM的动态授权

视觉-语言模型（VLMs）的快速采用加剧了对这些高价值预训练模型的知识产权（IP）保护需求。有效的IP保护应主动限制模型部署在授权领域内，并防止未经授权的转移。然而，现有方法依赖于静态训练时定义，限制了在动态环境中的灵活性，并经常对未经授权的输入产生不透明的响应。为解决这些限制，我们提出了一种新颖的具有法律意识的知识产权保护（AoD-IP）框架，用于VLMs，该框架支持按需授权和法律意识评估。AoD-IP引入了一个轻量级的动态授权模块，使授权更加灵活和用户可控，允许用户在部署时主动指定或切换授权领域。这使模型能够随着应用场景的变化无缝适应，并提供了比现有静态领域方法更大的可扩展性。此外，AoD-IP结合了双路径推理机制，同时预测输入的法律意识和任务特定输出。在多个跨域基准上的全面实验结果表明，AoD-IP在授权领域内保持了强大的性能，并可靠地检测未经授权的输入，同时支持用户控制的授权以适应动态环境的部署。

Summary / 总结

The paper proposes AoD-IP, a dynamic authorization framework for VLMs that supports on-demand authorization and legality-aware assessment, addressing the limitations of static training-time definitions. Experimental results show that AoD-IP maintains strong performance in authorized domains and reliable unauthorized detection, while enabling flexible and user-controlled authorization for adaptive deployment in dynamic environments.

论文提出了AoD-IP，这是一种针对视觉语言模型（VLMs）的动态授权框架，支持按需授权和合法性评估。它引入了一个轻量级的动态授权模块，允许用户在部署时指定授权领域，增强了灵活性和可扩展性。实验结果表明，AoD-IP在授权领域保持了强大的性能，并且对未授权检测可靠，支持在动态环境中进行自适应部署。

Differentially Private Multimodal In-Context Learning

Authors: Ivoline C. Ngong, Zarreen Reza, Joseph P. Near

First: 2026-03-05T07:36:02+00:00 · Latest: 2026-03-05T07:36:02+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-language models are increasingly applied to sensitive domains such as medical imaging and personal photographs, yet existing differentially private methods for in-context learning are limited to few-shot, text-only settings because privacy cost scales with the number of tokens processed. We present Differentially Private Multimodal Task Vectors (DP-MTV), the first framework enabling many-shot multimodal in-context learning with formal $(\varepsilon, δ)$-differential privacy by aggregating hundreds of demonstrations into compact task vectors in activation space. DP-MTV partitions private data into disjoint chunks, applies per-layer clipping to bound sensitivity, and adds calibrated noise to the aggregate, requiring only a single noise addition that enables unlimited inference queries. We evaluate on eight benchmarks across three VLM architectures, supporting deployment with or without auxiliary data. At $\varepsilon=1.0$, DP-MTV achieves 50% on VizWiz compared to 55% non-private and 35% zero-shot, preserving most of the gain from in-context learning under meaningful privacy constraints.

Summary / 总结

The research aims to enable privacy-preserving multimodal in-context learning for sensitive applications like medical imaging. The method, DP-MTV, partitions private data, clips per-layer activations, and adds calibrated noise to support many-shot learning with formal differential privacy. At ε=1.0, DP-MTV achieves 50% accuracy on VizWiz, comparable to 55% non-private and 35% zero-shot models, demonstrating effective privacy while maintaining most of the in-context learning benefits.

研究旨在为敏感应用如医疗成像提供隐私保护的多模态上下文学习。方法DP-MTV将私有数据分区，逐层剪裁激活，并添加校准噪声以支持具有形式差分隐私的多示例学习。在ε=1.0时，DP-MTV在VizWiz上的准确率为50%，与非隐私的55%和零样本的35%相比，保持了大部分上下文学习的好处。

Free Lunch for Pass@$k$? Low Cost Diverse Sampling for Diffusion Language Models

Authors: Sean Lamont, Christian Walder, Paul Montague, Amir Dezfouli, Michael Norrish

First: 2026-03-05T07:35:07+00:00 · Latest: 2026-03-05T07:35:07+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Diverse outputs in text generation are necessary for effective exploration in complex reasoning tasks, such as code generation and mathematical problem solving. Such Pass@$k$ problems benefit from distinct candidates covering the solution space. However, traditional sampling approaches often waste computational resources on repetitive failure modes. While Diffusion Language Models have emerged as a competitive alternative to the prevailing Autoregressive paradigm, they remain susceptible to this redundancy, with independent samples frequently collapsing into similar modes. To address this, we propose a training free, low cost intervention to enhance generative diversity in Diffusion Language Models. Our approach modifies intermediate samples in a batch sequentially, where each sample is repelled from the feature space of previous samples, actively penalising redundancy. Unlike prior methods that require retraining or beam search, our strategy incurs negligible computational overhead, while ensuring that each sample contributes a unique perspective to the batch. We evaluate our method on the HumanEval and GSM8K benchmarks using the LLaDA-8B-Instruct model. Our results demonstrate significantly improved diversity and Pass@$k$ performance across various temperature settings. As a simple modification to the sampling process, our method offers an immediate, low-cost improvement for current and future Diffusion Language Models in tasks that benefit from diverse solution search. We make our code available at https://github.com/sean-lamont/odd.

中文标题/摘要

标题：免费午餐？低成本多样化采样以提高扩散语言模型性能

文本生成中的多样化输出对于复杂推理任务（如代码生成和数学问题解决）的有效探索是必要的。这样的Pass@$k$问题可以从不同的候选方案中受益，这些方案覆盖了解空间。然而，传统的采样方法往往在重复的失败模式上浪费计算资源。虽然扩散语言模型已经成为了与自回归范式竞争的替代方案，但它们仍然容易受到这种冗余的影响，独立的样本经常陷入相似的模式。为了解决这个问题，我们提出了一种无需训练、低成本的干预措施，以增强扩散语言模型的生成多样性。我们的方法在批次中的中间样本之间顺序修改，每个样本都远离先前样本的特征空间，积极惩罚冗余。与需要重新训练或使用束搜索的先前方法不同，我们的策略几乎不增加计算开销，同时确保每个样本都为批次贡献了独特的视角。我们使用LLaDA-8B-Instruct模型在HumanEval和GSM8K基准上评估了我们的方法。我们的结果表明，在各种温度设置下，多样性显著提高，Pass@$k$性能也得到了改善。作为一种简单的采样过程修改，我们的方法为当前和未来的扩散语言模型在需要多样化解决方案搜索的任务中提供了即时、低成本的改进。我们将在https://github.com/sean-lamont/odd/提供我们的代码。

Summary / 总结

This paper addresses the need for diverse outputs in text generation tasks, such as code generation and mathematical problem solving, by proposing a low-cost intervention to enhance generative diversity in Diffusion Language Models. The method modifies intermediate samples in a batch by repelling each sample from the feature space of previous samples, ensuring unique contributions. Evaluations on HumanEval and GSM8K benchmarks with the LLaDA-8B-Instruct model show significantly improved diversity and Pass@$k$ performance across different temperature settings, offering a simple and cost-effective solution for current and future Diffusion Language Models.

本文提出了一种低成本干预方法，以增强扩散语言模型的生成多样性，解决代码生成和数学问题求解等文本生成任务中多样性的需求。该方法通过在批次中依次修改中间样本，使每个样本远离前一个样本的特征空间，确保每个样本的独特贡献。在HumanEval和GSM8K基准上使用LLaDA-8B-Instruct模型的评估结果显示，在不同温度设置下显著提高了多样性和Pass@$k$性能，为当前和未来的扩散语言模型提供了一个简单且成本效益高的解决方案。

RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations

Authors: I-Hsiang Chen, Yu-Wei Liu, Tse-Yu Wu, Yu-Chien Chiang, Jen-Chien Yang, Wei-Ting Chen

First: 2026-02-25T15:27:57+00:00 · Latest: 2026-03-05T07:12:37+00:00

Comments: Accepted by CVPR2026; Project Page: https://robustvisrag.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-based Retrieval-Augmented Generation (VisRAG) leverages vision-language models (VLMs) to jointly retrieve relevant visual documents and generate grounded answers based on multimodal evidence. However, existing VisRAG models degrade in performance when visual inputs suffer from distortions such as blur, noise, low light, or shadow, where semantic and degradation factors become entangled within pretrained visual encoders, leading to errors in both retrieval and generation stages. To address this limitation, we introduce RobustVisRAG, a causality-guided dual-path framework that improves VisRAG robustness while preserving efficiency and zero-shot generalization. RobustVisRAG uses a non-causal path to capture degradation signals through unidirectional attention and a causal path to learn purified semantics guided by these signals. Together with the proposed Non-Causal Distortion Modeling and Causal Semantic Alignment objectives, the framework enforces a clear separation between semantics and degradations, enabling stable retrieval and generation under challenging visual conditions. To evaluate robustness under realistic conditions, we introduce the Distortion-VisRAG dataset, a large-scale benchmark containing both synthetic and real-world degraded documents across seven domains, with 12 synthetic and 5 real distortion types that comprehensively reflect practical visual degradations. Experimental results show that RobustVisRAG improves retrieval, generation, and end-to-end performance by 7.35%, 6.35%, and 12.40%, respectively, on real-world degradations, while maintaining comparable accuracy on clean inputs.

中文标题/摘要

标题：RobustVisRAG：视觉降级条件下的因果关系意识视觉检索增强生成

基于视觉的检索增强生成（VisRAG）利用视觉语言模型（VLMs）联合检索相关视觉文档，并基于多模态证据生成基于地面的答案。然而，现有的VisRAG模型在视觉输入遭受模糊、噪声、低光照或阴影等失真时性能会下降，因为语义和失真因素在预训练视觉编码器中交织在一起，导致检索和生成阶段出现错误。为了解决这一局限性，我们提出了RobustVisRAG，这是一种因果关系指导下的双路径框架，可以提高VisRAG的鲁棒性，同时保持效率和零样本泛化能力。RobustVisRAG使用非因果路径通过单向注意力捕捉失真信号，并使用因果路径通过这些信号学习净化的语义。结合提出的非因果失真建模和因果语义对齐目标，该框架确保语义和失真之间的清晰分离，使在具有挑战性的视觉条件下检索和生成变得稳定。为了在现实条件下评估鲁棒性，我们引入了Distortion-VisRAG数据集，这是一个包含合成和真实世界降级文档的大规模基准，覆盖七个领域，包含12种合成和5种真实失真类型，全面反映了实际视觉降级。实验结果表明，RobustVisRAG在真实世界降级条件下分别提高了检索、生成和端到端性能7.35%、6.35%和12.40%，同时在干净输入上保持了相当的准确性。

Summary / 总结

RobustVisRAG is a causality-guided dual-path framework that enhances the robustness of Vision-based Retrieval-Augmented Generation (VisRAG) models under visual degradations. It uses a non-causal path to capture degradation signals and a causal path to learn purified semantics, which are aligned through specific objectives. This approach improves retrieval, generation, and end-to-end performance by 7.35%, 6.35%, and 12.40% respectively on real-world degradations, while maintaining comparable accuracy on clean inputs. The framework is evaluated using the Distortion-VisRAG dataset, which includes both synthetic and real-world degraded documents across seven domains.

RobustVisRAG 是一种因果引导的双路径框架，旨在增强视觉检索增强生成（VisRAG）模型在视觉退化条件下的鲁棒性。该框架通过非因果路径捕捉退化信号，并通过因果路径学习净化的语义，这些语义通过特定目标进行对齐。这种方法在真实世界退化条件下分别提高了检索、生成和端到端性能 7.35%、6.35% 和 12.40%，同时在干净输入上保持了相当的准确性。该框架使用 Distortion-VisRAG 数据集进行评估，该数据集包含七个领域中的合成和真实世界退化文档，涵盖了实际视觉退化情况。

GhostEI-Bench: Do Mobile Agents Resilience to Environmental Injection in Dynamic On-Device Environments?

Authors: Chiyu Chen, Xinhao Song, Yunkai Chai, Yang Yao, Haodong Zhao, Lijun Li, Jie Li, Yan Teng, Gongshen Liu, Yingchun Wang

First: 2025-10-23T08:33:24+00:00 · Latest: 2026-03-05T06:51:24+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) are increasingly deployed as autonomous agents to navigate mobile graphical user interfaces (GUIs). Operating in dynamic on-device ecosystems, which include notifications, pop-ups, and inter-app interactions, exposes them to a unique and underexplored threat vector: environmental injection. Unlike prompt-based attacks that manipulate textual instructions, environmental injection corrupts an agent's visual perception by inserting adversarial UI elements (for example, deceptive overlays or spoofed notifications) directly into the GUI. This bypasses textual safeguards and can derail execution, causing privacy leakage, financial loss, or irreversible device compromise. To systematically evaluate this threat, we introduce GhostEI-Bench, the first benchmark for assessing mobile agents under environmental injection attacks within dynamic, executable environments. Moving beyond static image-based assessments, GhostEI-Bench injects adversarial events into realistic application workflows inside fully operational Android emulators and evaluates performance across critical risk scenarios. We further propose a judge-LLM protocol that conducts fine-grained failure analysis by reviewing the agent's action trajectory alongside the corresponding screenshot sequence, pinpointing failure in perception, recognition, or reasoning. Comprehensive experiments on state-of-the-art agents reveal pronounced vulnerability to deceptive environmental cues: current models systematically fail to perceive and reason about manipulated UIs. GhostEI-Bench provides a framework for quantifying and mitigating this emerging threat, paving the way toward more robust and secure embodied agents.

中文标题/摘要

标题：GhostEI-Bench：移动代理在动态设备环境中对环境注入的韧性如何？

视觉-语言模型（VLMs）越来越多地被部署为自主代理以导航移动图形用户界面（GUIs）。在包括通知、弹出窗口和跨应用交互的动态设备生态系统中运行，使它们面临一种独特的、尚未充分探索的威胁向量：环境注入。与基于提示的攻击不同，后者操纵文本指令，环境注入通过直接插入敌对的UI元素（例如，欺骗性覆盖或伪造的通知）来篡改代理的视觉感知，从而绕过文本保护措施并可能导致执行中断、隐私泄露、财务损失或设备不可逆的破坏。为了系统地评估这一威胁，我们引入了GhostEI-Bench，这是首个评估移动代理在动态可执行环境中遭受环境注入攻击的基准。超越基于静态图像的评估，GhostEI-Bench在完全运行的Android模拟器中注入敌对事件到现实的应用工作流程中，并在关键风险场景中评估性能。我们进一步提出了一种裁判LLM协议，通过审查代理的动作轨迹与相应的屏幕截图序列来执行精细的失败分析，以确定感知、识别或推理中的失败。全面的实验表明，最先进的代理模型对欺骗性环境线索表现出明显的脆弱性：当前模型系统地无法感知和推理关于被操纵的UIs。GhostEI-Bench提供了一种量化和缓解这一新兴威胁的框架，为更稳健和安全的实体代理铺平了道路。

Summary / 总结

The research aims to evaluate the resilience of mobile agents to environmental injection attacks in dynamic on-device environments, which are common in vision-language models navigating graphical user interfaces. The study introduces GhostEI-Bench, a benchmark that injects adversarial events into realistic application workflows on Android emulators to assess performance under such attacks. Key findings show that state-of-the-art agents are highly vulnerable to deceptive UI manipulations, failing to perceive and reason about manipulated interfaces. This work provides a framework for quantifying and mitigating these threats, enhancing the security of embodied agents.

研究旨在评估移动代理在动态设备环境中的抗环境注入攻击能力。研究引入了GhostEI-Bench基准，该基准将恶意事件注入到现实的应用工作流中，以评估其在这些攻击下的表现。关键发现表明，最先进的代理对欺骗性环境提示高度脆弱，无法感知和推理关于被操纵的UI。这项工作提供了一个量化和缓解这种威胁的框架，增强了实体代理的安全性。

On Multi-Step Theorem Prediction via Non-Parametric Structural Priors

Authors: Junbo Zhao, Ting Zhang, Can Li, Wei He, Jingdong Wang, Hua Huang

First: 2026-03-05T06:08:50+00:00 · Latest: 2026-03-05T06:08:50+00:00

Abs · PDF · Code1 · Code2

Abstract

Multi-step theorem prediction is a central challenge in automated reasoning. Existing neural-symbolic approaches rely heavily on supervised parametric models, which exhibit limited generalization to evolving theorem libraries. In this work, we explore training-free theorem prediction through the lens of in-context learning (ICL). We identify a critical scalability bottleneck, termed Structural Drift: as reasoning depth increases, the performance of vanilla ICL degrades sharply, often collapsing to near zero. We attribute this failure to the LLM's inability to recover latent topological dependencies, leading to unstructured exploration. To address this issue, we propose Theorem Precedence Graphs, which encode temporal dependencies from historical solution traces as directed graphs, and impose explicit topological constraints that effectively prune the search space during inference. Coupled with retrieval-augmented graph construction and a stepwise symbolic executor, our approach enables LLMs to act as structured planners without any gradient-based optimization. Experiments on the FormalGeo7k benchmark show that our method achieves 89.29% accuracy, substantially outperforming ICL baselines and matching state-of-the-art supervised models. These results indicate that explicit structural priors offer a promising direction for scaling LLM-based symbolic reasoning.

中文标题/摘要

标题：基于非参数结构先验的多步定理预测

多步定理预测是自动推理中的一个核心挑战。现有的神经符号方法主要依赖于监督参数模型，这些模型在处理不断演化的定理库时表现出有限的泛化能力。在本文中，我们通过上下文学习（ICL）的视角探索无训练的定理预测。我们识别出一个关键的可扩展性瓶颈，称为结构漂移：随着推理深度的增加，vanilla ICL的性能急剧下降，通常会崩溃到接近零。我们将这一失败归因于LLM无法恢复潜在的拓扑依赖性，导致无序探索。为了解决这个问题，我们提出了定理优先图，它将历史解题轨迹中的时间依赖性编码为有向图，并施加显式的拓扑约束，有效地在推理期间剪枝搜索空间。结合检索增强的图构建和逐步符号执行，我们的方法使LLM能够作为结构化规划者而无需任何基于梯度的优化。在FormalGeo7k基准测试上的实验表明，我们的方法达到了89.29%的准确率，显著优于ICL基线，并且与最先进的监督模型相当。这些结果表明，显式的结构先验为扩展基于LLM的符号推理提供了一个有希望的方向。

Summary / 总结

This work addresses the challenge of multi-step theorem prediction in automated reasoning by exploring training-free methods through in-context learning (ICL). The authors identify a scalability issue called Structural Drift, where vanilla ICL performance drops significantly with increased reasoning depth. To overcome this, they propose Theorem Precedence Graphs, which encode temporal dependencies and impose topological constraints, enabling LLMs to act as structured planners. Experiments on FormalGeo7k show their method achieves 89.29% accuracy, outperforming ICL baselines and matching state-of-the-art supervised models.

该研究通过提出编码历史解题轨迹中时间依赖性的定理优先图，解决了自动推理中的多步定理预测挑战。该方法通过显式的拓扑约束改进了语言模型的表现，避免了上下文学习的可扩展性瓶颈。实验结果显示，所提出的方法达到了89.29%的准确率，显著优于现有上下文学习基线，并与最先进的监督模型相当。

AutoV: Loss-Oriented Ranking for Visual Prompt Retrieval in LVLMs

Authors: Yuan Zhang, Chun-Kai Fan, Sicheng Yu, Junwen Pan, Tao Huang, Ming Lu, Kuan Cheng, Qi She, Shanghang Zhang

First: 2025-06-19T08:02:53+00:00 · Latest: 2026-03-05T05:25:24+00:00

Abs · PDF · Code1 · Code2

Abstract

Inspired by text prompts in large language models, visual prompts have been explored to enhance the perceptual capabilities of large vision-language models (LVLMs). However, performance tends to saturate under single visual prompt designs, making further prompt engineering increasingly ineffective. To address this limitation, we shift from prompt engineering to prompt retrieval and propose AutoV, a lightweight framework for instance-adaptive visual prompt identification. Given an input image and a textual query, AutoV automatically locates the most suitable visual prompt from a diverse candidate pool. Training such a retrieval framework requires prompt-level supervision, yet prompt quality is inherently ambiguous and difficult to assess reliably, even for humans. To enable automatic supervision, we evaluate visual prompts using a pre-trained LVLM and label them according to their prediction losses. Using the loss-oriented ranking as a robust training signal, AutoV learns to retrieve the query-aware optimal prompt for each instance without manual annotation. Experiments indicate that AutoV enhances the performance of various LVLMs on image understanding, captioning, grounding, and classification tasks. For example, AutoV improves LLaVA-OV by $\textbf{10.2}\%$ on VizWiz and boosts Qwen2.5-VL by $\textbf{3.8}\%$ on MMMU, respectively.

中文标题/摘要

标题：AutoV：面向视觉提示检索的损失导向排名在LVLM中的应用

受大型语言模型中文本提示的启发，视觉提示已被探索以增强大型视觉-语言模型（LVLM）的感知能力。然而，在单一视觉提示设计下，性能往往会饱和，使得进一步的提示工程变得越来越无效。为了解决这一局限性，我们从提示工程转向提示检索，并提出AutoV，这是一种轻量级框架，用于实例自适应视觉提示识别。给定输入图像和文本查询，AutoV 自动从多样化的候选池中定位最合适的视觉提示。训练这种检索框架需要提示级别的监督，但提示质量本质上是模糊的，即使对人类来说也难以可靠地评估。为了实现自动监督，我们使用预训练的LVLM评估视觉提示，并根据其预测损失对其进行标记。利用基于损失的排名作为稳健的训练信号，AutoV 学习在每个实例中检索与查询相关的最佳提示，而无需手动注释。实验表明，AutoV 在图像理解、描述、定位和分类任务中提高了各种LVLM的表现。例如，AutoV 在VizWiz上将LLaVA-OV 的性能提高了10.2%，在MMMU上将Qwen2.5-VL 的性能提高了3.8%。

Summary / 总结

The paper introduces AutoV, a framework for automatically retrieving visual prompts from a diverse pool to enhance the performance of large vision-language models (LVLMs). It uses a loss-oriented ranking method to train the framework without manual annotation, improving tasks like image understanding, captioning, and classification. For instance, AutoV boosts LLaVA-OV by 10.2% on VizWiz and Qwen2.5-VL by 3.8% on MMMU.

论文提出了AutoV框架，用于自动检索视觉提示以增强大型视觉语言模型的性能。该框架使用基于损失的排名方法进行训练，无需人工标注，提升了一系列任务，如图像理解、描述、定位和分类。例如，AutoV在VizWiz上将LLaVA-OV的性能提高了10.2%，在MMMU上将Qwen2.5-VL的性能提高了3.8%。

History

20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553