Accelerating Text-to-Video Generation with Calibrated Sparse Attention
Authors: Shai Yehezkel, Shahar Yadin, Noam Elata, Yaron Ostrovsky-Berman, Bahjat Kawar
First: 2026-03-05T18:59:32+00:00 · Latest: 2026-03-05T18:59:32+00:00
Abstract
Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the attention computation in these cases can be skipped with little to no effect on the result. This observation continues to hold for connections among local token blocks. Motivated by this, we introduce CalibAtt, a training-free method that accelerates video generation via calibrated sparse attention. CalibAtt performs an offline calibration pass that identifies block-level sparsity and repetition patterns that are stable across inputs, and compiles these patterns into optimized attention operations for each layer, head, and diffusion timestep. At inference time, we compute the selected input-dependent connections densely, and skip the unselected ones in a hardware-efficient manner. Extensive experiments on Wan 2.1 14B, Mochi 1, and few-step distilled models at various resolutions show that CalibAtt achieves up to 1.58x end-to-end speedup, outperforming existing training-free methods while maintaining video generation quality and text-video alignment.
中文标题/摘要
标题:校准稀疏注意加速文本到视频生成
最近的扩散模型能够生成高质量的视频,但运行速度较慢。这些模型中使用的大型基于变压器的骨干网络由于时空注意机制而成为瓶颈。在本文中,我们发现许多词元到词元的连接在各种输入中持续产生微不足道的分数,并且其模式在查询之间经常重复。因此,在这些情况下可以跳过注意计算,对结果影响甚微。这一观察结果也适用于局部词元块之间的连接。受此启发,我们引入了CalibAtt,这是一种无需训练的方法,通过校准稀疏注意来加速视频生成。CalibAtt 进行了一次离线校准过程,以识别在输入之间稳定的块级稀疏性和重复模式,并将这些模式编译为每层、每个头和每个扩散时间步的优化注意操作。在推理时,我们密集地计算选定的输入相关连接,并以硬件高效的方式跳过未选中的连接。在Wan 2.1 14B、Mochi 1和不同分辨率下的少量步骤蒸馏模型上进行的广泛实验表明,CalibAtt 可以实现高达1.58倍的端到端加速,同时优于现有无需训练的方法,保持视频生成质量和文本-视频对齐。
Summary / 总结
This paper addresses the slow runtime of diffusion models used for text-to-video generation by proposing CalibAtt, a training-free method that accelerates video generation through calibrated sparse attention. By identifying and skipping negligible token-to-token connections, CalibAtt significantly reduces computation without degrading video quality. Experiments show up to 1.58x speedup on various models and resolutions, outperforming existing methods.
论文针对用于高质量视频生成的扩散模型运行缓慢的问题,提出了一种名为CalibAtt的无训练方法,该方法通过离线校准识别并跳过无用的token-to-token连接,从而在保持视频质量和文本-视频对齐的情况下,实现高达1.58倍的端到端加速。
HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token
Authors: Sai Akhil Kogilathota, Sripadha Vallabha E G, Luzhe Sun, Jiawei Zhou
Venue: The 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026)
First: 2026-03-05T18:36:31+00:00 · Latest: 2026-03-05T18:36:31+00:00
Abstract
Hallucinations remain a persistent challenge for vision-language models (VLMs), which often describe nonexistent objects or fabricate facts. Existing detection methods typically operate after text generation, making intervention both costly and untimely. We investigate whether hallucination risk can instead be predicted before any token is generated by probing a model's internal representations in a single forward pass. Across a diverse set of vision-language tasks and eight modern VLMs, including Llama-3.2-Vision, Gemma-3, Phi-4-VL, and Qwen2.5-VL, we examine three families of internal representations: (i) visual-only features without multimodal fusion, (ii) vision-token representations within the text decoder, and (iii) query-token representations that integrate visual and textual information before generation. Probes trained on these representations achieve strong hallucination-detection performance without decoding, reaching up to 0.93 AUROC on Gemma-3-12B, Phi-4-VL 5.6B, and Molmo 7B. Late query-token states are the most predictive for most models, while visual or mid-layer features dominate in a few architectures (e.g., ~0.79 AUROC for Qwen2.5-VL-7B using visual-only features). These results demonstrate that (1) hallucination risk is detectable pre-generation, (2) the most informative layer and modality vary across architectures, and (3) lightweight probes have the potential to enable early abstention, selective routing, and adaptive decoding to improve both safety and efficiency.
中文标题/摘要
标题:HALP:无需生成单个词元即可检测视觉语言模型中的幻觉
幻觉仍然是视觉语言模型(VLMs)的一个持续性挑战,它们经常描述不存在的对象或编造事实。现有的检测方法通常在文本生成之后进行操作,使得干预既昂贵又不及时。我们研究了是否可以在生成任何词元之前通过探测模型的内部表示来预测幻觉风险。在一系列视觉语言任务和八种现代VLMs(包括Llama-3.2-Vision、Gemma-3、Phi-4-VL和Qwen2.5-VL)中,我们检查了三种内部表示家族:(i)仅视觉特征而不进行多模态融合,(ii)文本解码器中的视觉词元表示,以及(iii)在生成之前整合视觉和文本信息的查询词元表示。基于这些表示训练的探测器在无需解码的情况下实现了强大的幻觉检测性能,达到Gemma-3-12B、Phi-4-VL 5.6B和Molmo 7B上的0.93 AUROC。大多数模型中,后期查询词元状态最具预测性,而视觉或中间层特征在少数架构中占主导地位(例如,Qwen2.5-VL-7B使用仅视觉特征的AUROC约为0.79)。这些结果表明:(1)幻觉风险可以在生成之前检测到;(2)最具信息量的层和模态在不同架构中有所不同;(3)轻量级探测器有可能实现早期避免、选择性路由和自适应解码,以提高安全性和效率。
Summary / 总结
The paper introduces HALP, a method for detecting hallucinations in vision-language models before any token is generated. By probing internal representations in a single forward pass, the method achieves strong performance, with AUROCs up to 0.93 on various models. The study finds that late query-token states are most predictive for most models, while visual or mid-layer features are more informative for some architectures.
研究通过提出一种在文本生成之前预测幻觉风险的方法,来应对视觉语言模型中的幻觉问题。它利用内部模型表示的探针来检测幻觉,实现了高达0.93 AUROC的强性能。结果表明,不同的层和模态对不同的架构来说是最具预测性的,并且轻量级的探针能够实现早期干预,以提高安全性和效率。
Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes
Authors: Pengxiang Li, Joey Tsai, Hongwei Xue, Kunyu Shi, Shilin Yan
Venue: ICLR 2026
First: 2026-03-05T18:25:26+00:00 · Latest: 2026-03-05T18:25:26+00:00
Comments: Accepted at ICLR 2026
Abstract
Diffusion Language Models (DLMs) promise highly parallel text generation, yet their practical inference speed is often bottlenecked by suboptimal decoding schedulers. Standard approaches rely on 'scattered acceptance'-committing high confidence tokens at disjoint positions throughout the sequence. This approach inadvertently fractures the Key-Value (KV) cache, destroys memory locality, and forces the model into costly, repeated repairs across unstable token boundaries. To resolve this, we present the Longest Stable Prefix (LSP) scheduler, a training-free and model-agnostic inference paradigm based on monolithic prefix absorption. In each denoising step, LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions, and snaps its boundary to natural linguistic or structural delimiters before an atomic commitment. This prefix-first topology yields dual benefits: systemically, it converts fragmented KV cache updates into efficient, contiguous appends; algorithmically, it preserves bidirectional lookahead over a geometrically shrinking active suffix, drastically reducing token flip rates and denoiser calls. Extensive evaluations on LLaDA-8B and Dream-7B demonstrate that LSP accelerates inference by up to 3.4x across rigorous benchmarks including mathematical reasoning, code generation, multilingual (CJK) tasks, and creative writing while matching or slightly improving output quality. By fundamentally restructuring the commitment topology, LSP bridges the gap between the theoretical parallelism of DLMs and practical hardware efficiency.
中文标题/摘要
标题:超越零散接受:通过最长稳定前缀实现DLMs的快速和连贯推理
扩散语言模型(DLMs)承诺实现高度并行的文本生成,但其实用推理速度往往受限于次优解码调度器。标准方法依赖于“零散接受”——在序列中不连续位置上提交高置信度的标记。这种方法无意中破坏了键值(KV)缓存,破坏了内存局部性,并迫使模型在不稳定的标记边界上进行昂贵的重复修复。为了解决这个问题,我们提出了最长稳定前缀(LSP)调度器,这是一种基于单一前缀吸收的无训练和模型无关的推理范式。在每个去噪步骤中,LSP 通过单向传递评估标记的稳定性,动态识别一个连续的左对齐的稳定预测块,并在原子提交前将其边界对齐到自然语言或结构分隔符。这种前缀优先的拓扑结构带来了双重好处:系统上,它将碎片化的KV缓存更新转换为高效的连续追加;算法上,它保留了对几何缩小的活动后缀的双向前瞻,大幅减少了标记翻转率和去噪器调用次数。在LLaDA-8B和Dream-7B上的广泛评估表明,LSP 在包括数学推理、代码生成、多语言(CJK)任务和创造性写作在内的严格基准测试中将推理加速了高达3.4倍,同时保持或略微提高了输出质量。通过从根本上重新结构化提交拓扑,LSP 桥接了DLMs的理论并行性和实际硬件效率之间的差距。
Summary / 总结
The paper addresses the slow inference speed of Diffusion Language Models (DLMs) due to suboptimal decoding schedulers that commit high confidence tokens at disjoint positions, fracturing the KV cache and increasing computational costs. It introduces the Longest Stable Prefix (LSP) scheduler, which evaluates token stability, identifies contiguous blocks of stable predictions, and commits them atomically. This method accelerates inference by up to 3.4x across various benchmarks while maintaining or slightly improving output quality. The LSP scheduler achieves this by preserving bidirectional lookahead and reducing token flip rates and denoiser calls.
论文解决了由于解码调度器不理想而导致的扩散语言模型(DLMs)推理速度慢的问题。它提出了最长稳定前缀(LSP)调度器,通过单次前向传播评估token稳定性,并提交一个连续的稳定预测块,从而改善了内存局部性并减少了昂贵的修复需求。在LLaDA-8B和Dream-7B上的实验表明,LSP可以将推理加速多达3.4倍,同时保持或略微提高输出质量。
RelaxFlow: Text-Driven Amodal 3D Generation
Authors: Jiayin Zhu, Guoji Fu, Xiaolu Liu, Qiyuan He, Yicong Li, Angela Yao
First: 2026-03-05T17:45:47+00:00 · Latest: 2026-03-05T17:45:47+00:00
Comments: Code: https://github.com/viridityzhu/RelaxFlow
Abstract
Image-to-3D generation faces inherent semantic ambiguity under occlusion, where partial observation alone is often insufficient to determine object category. In this work, we formalize text-driven amodal 3D generation, where text prompts steer the completion of unseen regions while strictly preserving input observation. Crucially, we identify that these objectives demand distinct control granularities: rigid control for the observation versus relaxed structural control for the prompt. To this end, we propose RelaxFlow, a training-free dual-branch framework that decouples control granularity via a Multi-Prior Consensus Module and a Relaxation Mechanism. Theoretically, we prove that our relaxation is equivalent to applying a low-pass filter on the generative vector field, which suppresses high-frequency instance details to isolate geometric structure that accommodates the observation. To facilitate evaluation, we introduce two diagnostic benchmarks, ExtremeOcc-3D and AmbiSem-3D. Extensive experiments demonstrate that RelaxFlow successfully steers the generation of unseen regions to match the prompt intent without compromising visual fidelity.
中文标题/摘要
标题:RelaxFlow:文本驱动的无遮挡3D生成
在遮挡下,从图像到3D的生成面临着固有的语义模糊性,仅凭部分观察往往不足以确定物体类别。在本文中,我们形式化了文本驱动的无遮挡3D生成,其中文本提示引导未见区域的完成,同时严格保留输入观察。关键的是,我们发现这些目标需要不同的控制粒度:对观察进行刚性控制,而对提示进行放松的结构控制。为此,我们提出了RelaxFlow,这是一种无需训练的双分支框架,通过多先验一致性模块和放松机制解耦控制粒度。理论上,我们证明我们的放松等同于在生成向量场中应用低通滤波器,这抑制了高频实例细节,以隔离几何结构,使其适应观察。为了便于评估,我们引入了两个诊断基准,ExtremeOcc-3D和AmbiSem-3D。广泛的实验表明,RelaxFlow成功地引导了未见区域的生成,使其与提示意图一致,而不牺牲视觉保真度。
Summary / 总结
The research addresses the challenge of generating complete 3D models from partial observations, using text prompts to guide the unseen parts while preserving the observed regions. It introduces RelaxFlow, a dual-branch framework that uses a Multi-Prior Consensus Module and a Relaxation Mechanism to control the generation process at different granularities. Experiments show that RelaxFlow effectively matches the prompt intent for unseen regions without sacrificing visual quality.
该研究旨在通过文本提示从部分观察中生成完整的3D模型。提出的RelaxFlow框架采用双分支方法,通过多先验一致性模块和放松机制来解耦观察区域和文本驱动区域的控制粒度。实验表明,RelaxFlow能够有效生成与文本提示一致的未观察区域,同时保持视觉保真度。
ORMOT: A Dataset and Framework for Omnidirectional Referring Multi-Object Tracking
Authors: Sijia Chen, Zihan Zhou, Yanqiu Yu, En Yu, Wenbing Tao
First: 2026-03-05T17:15:01+00:00 · Latest: 2026-03-05T17:15:01+00:00
Comments: https://github.com/chen-si-jia/ORMOT
Abstract
Multi-Object Tracking (MOT) is a fundamental task in computer vision, aiming to track targets across video frames. Existing MOT methods perform well in general visual scenes, but face significant challenges and limitations when extended to visual-language settings. To bridge this gap, the task of Referring Multi-Object Tracking (RMOT) has recently been proposed, which aims to track objects that correspond to language descriptions. However, current RMOT methods are primarily developed on datasets captured by conventional cameras, which suffer from limited field of view. This constraint often causes targets to move out of the frame, leading to fragmented tracking and loss of contextual information. In this work, we propose a novel task, called Omnidirectional Referring Multi-Object Tracking (ORMOT), which extends RMOT to omnidirectional imagery, aiming to overcome the field-of-view (FoV) limitation of conventional datasets and improve the model's ability to understand long-horizon language descriptions. To advance the ORMOT task, we construct ORSet, an Omnidirectional Referring Multi-Object Tracking dataset, which contains 27 diverse omnidirectional scenes, 848 language descriptions, and 3,401 annotated objects, providing rich visual, temporal, and language information. Furthermore, we propose ORTrack, a Large Vision-Language Model (LVLM)-driven framework tailored for Omnidirectional Referring Multi-Object Tracking. Extensive experiments on the ORSet dataset demonstrate the effectiveness of our ORTrack framework. The dataset and code will be open-sourced at https://github.com/chen-si-jia/ORMOT.
中文标题/摘要
标题:ORMOT: omnidirectional referring multi-object tracking的数据集和框架
多目标跟踪(MOT)是计算机视觉中的一个基本任务,旨在跨视频帧跟踪目标。现有的MOT方法在通用视觉场景中表现良好,但在扩展到视觉语言设置时面临重大挑战和限制。为了解决这一差距,最近提出了引用多目标跟踪(RMOT)任务,旨在跟踪与语言描述对应的物体。然而,当前的RMOT方法主要是在由传统相机捕获的数据集上开发的,这些数据集的视野有限。这种限制通常会导致目标移出画面,从而导致跟踪片段化并丢失上下文信息。在本文中,我们提出了一项新的任务,称为全方位引用多目标跟踪(ORMOT),该任务将RMOT扩展到全方位图像,旨在克服传统数据集的视野限制,并提高模型理解长时语言描述的能力。为了推进ORMOT任务,我们构建了ORSet,一个全方位引用多目标跟踪数据集,包含27个多样化的全方位场景、848个语言描述和3,401个标注物体,提供了丰富的视觉、时间和语言信息。此外,我们提出了ORTrack,一种针对全方位引用多目标跟踪的大型视觉-语言模型驱动框架。在ORSet数据集上的广泛实验表明,我们的ORTrack框架是有效的。数据集和代码将在https://github.com/chen-si-jia/ORMOT开放。
Summary / 总结
The research aims to address the limitations of existing Multi-Object Tracking (MOT) methods in visual-language settings by proposing Omnidirectional Referring Multi-Object Tracking (ORMOT). The authors construct ORSet, a dataset for ORMOT with 27 omnidirectional scenes, 848 language descriptions, and 3,401 annotated objects. They also introduce ORTrack, a framework based on Large Vision-Language Models, to tackle ORMOT. Experiments on ORSet show the effectiveness of ORTrack in handling long-horizon language descriptions and overcoming the field-of-view limitations of conventional datasets.
研究旨在通过提出新的Omnidirectional Referring Multi-Object Tracking (ORMOT)任务来解决现有MOT方法在视觉-语言设置中的局限性。作者开发了ORSet数据集,包含27个全景场景、848个语言描述和3,401个标注对象,并引入了ORTrack,一种基于大型视觉-语言模型的框架。实验表明ORTrack在ORMOT任务中的有效性。
OpenFrontier: General Navigation with Visual-Language Grounded Frontiers
Authors: Esteban Padilla, Boyang Sun, Marc Pollefeys, Hermann Blum
First: 2026-03-05T17:02:22+00:00 · Latest: 2026-03-05T17:02:22+00:00
Abstract
Open-world navigation requires robots to make decisions in complex everyday environments while adapting to flexible task requirements. Conventional navigation approaches often rely on dense 3D reconstruction and hand-crafted goal metrics, which limits their generalization across tasks and environments. Recent advances in vision--language navigation (VLN) and vision--language--action (VLA) models enable end-to-end policies conditioned on natural language, but typically require interactive training, large-scale data collection, or task-specific fine-tuning with a mobile agent. We formulate navigation as a sparse subgoal identification and reaching problem and observe that providing visual anchoring targets for high-level semantic priors enables highly efficient goal-conditioned navigation. Based on this insight, we select navigation frontiers as semantic anchors and propose OpenFrontier, a training-free navigation framework that seamlessly integrates diverse vision--language prior models. OpenFrontier enables efficient navigation with a lightweight system design, without dense 3D mapping, policy training, or model fine-tuning. We evaluate OpenFrontier across multiple navigation benchmarks and demonstrate strong zero-shot performance, as well as effective real-world deployment on a mobile robot.
中文标题/摘要
标题:OpenFrontier:基于视觉-语言引导边界的通用导航
开放世界的导航要求机器人在复杂的日常环境中做出决策并适应灵活的任务需求。传统的导航方法通常依赖于密集的3D重建和手工制作的目标度量标准,这限制了它们在不同任务和环境中的泛化能力。视觉-语言导航(VLN)和视觉-语言-动作(VLA)模型的最新进展使基于自然语言的端到端策略成为可能,但通常需要交互式训练、大规模数据收集或针对移动代理的任务特定微调。我们将导航问题表述为稀疏子目标识别和到达问题,并观察到提供视觉锚定目标以支持高层语义先验可以实现高效的基于目标的导航。基于这一洞察,我们选择导航边界作为语义锚点,并提出OpenFrontier,这是一种无需训练的导航框架,能够无缝集成多种视觉-语言先验模型。OpenFrontier 通过轻量级系统设计实现了高效的导航,无需密集的3D映射、策略训练或模型微调。我们在多个导航基准上评估了OpenFrontier,并展示了其强大的零样本性能,以及在移动机器人上的有效实际部署。
Summary / 总结
The paper addresses the challenge of open-world navigation by proposing OpenFrontier, a training-free framework that uses visual and language priors to identify and reach sparse subgoals. This approach enables efficient navigation without the need for dense 3D mapping, policy training, or model fine-tuning, and demonstrates strong zero-shot performance across multiple benchmarks and real-world deployment on a mobile robot.
论文提出了一种名为OpenFrontier的训练-free框架,利用视觉和语言先验来识别和到达稀疏子目标,从而实现高效的导航,无需密集的3D建图、策略训练或模型微调,并在多个基准测试和移动机器人的真实世界部署中展示了强大的零样本性能。
Revolutionizing Mixed Precision Quantization: Towards Training-free Automatic Proxy Discovery via Large Language Models
Authors: Haidong Kang, Jun Du, Lihong Lin
First: 2025-12-08T10:52:55+00:00 · Latest: 2026-03-05T16:57:43+00:00
Abstract
Mixed-Precision Quantization (MPQ) liberates Deep Neural Networks (DNNs) from the Out-Of-Memory (OOM) bottleneck and has garnered increasing research attention. However, conventional methods either rely on costly differentiable optimization search, which is neither efficient nor flexible, or learn a quantized DNN from a proxy (e.g., HAWQ) manually designed by human experts, which is labor-intensive and requires extensive expert knowledge. Can we design a proxy without involving any human experts or training? In this paper, we provide an affirmative answer by proposing a novel Large Language Model (LLM)-driven Training-free Automatic Proxy (dubbed TAP) discovery framework. It reforms the design paradigm of MPQ by utilizing LLMs and evolutionary search strategies to automatically find superior TAP tailored for MPQ. In addition, to bridge the gap between black-box LLMs and the challenging MPQ task, we introduce a lightweight Direct Preference Optimization (DPO)-based strategy controller that dynamically reweights the selection probabilities of the three prompt templates for evolutionary search strategies according to fitness signals, without fine-tuning the LLM. This forms a task-aware feedback loop that improves proxy generation across evolutions. Extensive experiments on mainstream benchmarks demonstrate that TAP achieves state-of-the-art performance. Finally, we believe that our TAP will significantly contribute to the MPQ community by providing a new perspective on LLM-driven design algorithms.
中文标题/摘要
标题:革新混合精度量化:通过大型语言模型实现无需训练的自动代理发现
混合精度量化(MPQ)使深度神经网络(DNNs)摆脱了内存不足(OOM)的瓶颈,并引起了越来越多的研究关注。然而,传统方法要么依赖于昂贵的可微优化搜索,这既不高效也不灵活,要么从人类专家手动设计的代理(例如HAWQ)中学习量化DNN,这既耗时又需要大量专家知识。我们能否设计一个无需任何人类专家或训练的代理?在本文中,我们通过提出一种新颖的大型语言模型(LLM)驱动的无需训练的自动代理(简称TAP)发现框架,给出了肯定的答案。该框架通过利用LLM和进化搜索策略,自动发现适用于MPQ的优质TAP,改革了MPQ的设计范式。此外,为了弥合黑盒LLM与挑战性的MPQ任务之间的差距,我们引入了一种轻量级的直接偏好优化(DPO)为基础的策略控制器,根据适应度信号动态调整进化搜索策略中三种提示模板的选择概率,无需对LLM进行微调。这形成了一种任务感知的反馈循环,提高了代理生成的性能。在主流基准上的广泛实验表明,TAP达到了最先进的性能。最后,我们认为,我们的TAP将通过提供一种LLM驱动设计算法的新视角,对MPQ社区产生重大贡献。
Summary / 总结
This paper addresses the challenge of designing Mixed Precision Quantization (MPQ) proxies without human intervention or training. It introduces a TAP framework that uses Large Language Models (LLMs) and evolutionary search strategies to automatically discover optimal proxies. The TAP framework includes a lightweight Direct Preference Optimization (DPO) strategy controller that dynamically adjusts the selection probabilities of prompt templates based on fitness signals, enhancing proxy generation. Experiments show that TAP outperforms existing methods on mainstream benchmarks, offering a new approach to MPQ design.
本文解决了无需人工干预或训练即可设计混合精度量化(MPQ)代理的问题。它提出了一种TAP框架,利用大型语言模型(LLMs)和进化搜索策略自动发现最优代理。TAP框架包含一个轻量级的直接偏好优化(DPO)策略控制器,根据适应度信号动态调整提示模板的选择概率,从而提高代理生成效果。实验表明,TAP在主流基准上优于现有方法,为MPQ设计提供了新的视角。
Frequency-Aware Error-Bounded Caching for Accelerating Diffusion Transformers
Authors: Guandong Li
First: 2026-03-05T15:58:06+00:00 · Latest: 2026-03-05T15:58:06+00:00
Abstract
Diffusion Transformers (DiTs) have emerged as the dominant architecture for high-quality image and video generation, yet their iterative denoising process incurs substantial computational cost during inference. Existing caching methods accelerate DiTs by reusing intermediate computations across timesteps, but they share a common limitation: treating the denoising process as uniform across time,depth, and feature dimensions. In this work, we identify three orthogonal axes of non-uniformity in DiT denoising: (1) temporal -- sensitivity to caching errors varies dramatically across the denoising trajectory; (2) depth -- consecutive caching decisions lead to cascading approximation errors; and (3) feature -- different components of the hidden state exhibit heterogeneous temporal dynamics. Based on these observations, we propose SpectralCache, a unified caching framework comprising Timestep-Aware Dynamic Scheduling (TADS), Cumulative Error Budgets (CEB), and Frequency-Decomposed Caching (FDC). On FLUX.1-schnell at 512x512 resolution, SpectralCache achieves 2.46x speedup with LPIPS 0.217 and SSIM 0.727, outperforming TeaCache (2.12x, LPIPS 0.215, SSIM 0.734) by 16% in speed while maintaining comparable quality (LPIPS difference < 1%). Our approach is training-free, plug-and-play, and compatible with existing DiT architectures.
中文标题/摘要
标题:基于频率感知的误差有界缓存加速扩散变换器
扩散变换器(DiTs)已成为高质量图像和视频生成的主要架构,但其迭代去噪过程在推理时会带来巨大的计算成本。现有的缓存方法通过在时间步之间重用中间计算来加速DiTs,但它们共同的局限性在于将去噪过程视为在时间、深度和特征维度上均匀的。在这项工作中,我们识别了DiT去噪中的三个非均匀轴:(1)时间——缓存误差对去噪轨迹的敏感性在不同阶段差异巨大;(2)深度——连续的缓存决策会导致级联的近似误差;(3)特征——隐藏状态的不同组成部分表现出异质的时间动态。基于这些观察,我们提出了SpectralCache,这是一种统一的缓存框架,包括时间感知动态调度(TADS)、累积误差预算(CEB)和频率分解缓存(FDC)。在FLUX.1-schnell,512x512分辨率下,SpectralCache实现了2.46倍的加速,LPIPS为0.217,SSIM为0.727,比TeaCache(2.12倍,LPIPS为0.215,SSIM为0.734)快16%,同时保持了相当的质量(LPIPS差异<1%)。我们的方法是无需训练的、即插即用的,并且与现有的DiT架构兼容。
Summary / 总结
This paper addresses the high computational cost of inference in Diffusion Transformers (DiTs) by proposing SpectralCache, a frequency-aware caching framework. It introduces Timestep-Aware Dynamic Scheduling (TADS), Cumulative Error Budgets (CEB), and Frequency-Decomposed Caching (FDC) to account for non-uniformity in the denoising process. On FLUX.1-schnell at 512x512 resolution, SpectralCache achieves a 2.46x speedup with LPIPS 0.217 and SSIM 0.727, outperforming TeaCache by 16% in speed while maintaining similar quality metrics.
该研究通过提出SpectralCache统一缓存框架来解决Diffusion Transformers (DiTs)在推理过程中的高计算成本问题。SpectralCache结合了Timestep-Aware Dynamic Scheduling、Cumulative Error Budgets和Frequency-Decomposed Caching,以应对去噪过程中的非均匀性。在FLUX.1-schnell的512x512分辨率下,SpectralCache实现了2.46倍的加速,LPIPS为0.217,SSIM为0.727,比TeaCache在速度上提高了16%,同时保持了相近的质量。
FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding
Authors: Janghoon Cho, Jungsoo Lee, Munawar Hayat, Kyuwoong Hwang, Fatih Porikli, Sungha Choi
Venue: ICLR 2026
First: 2025-10-31T17:29:39+00:00 · Latest: 2026-03-05T15:43:07+00:00
Comments: Accepted to ICLR 2026
Abstract
Recent studies in long video understanding have harnessed the advanced visual-language reasoning capabilities of Large Multimodal Models (LMMs), driving the evolution of video-LMMs specialized for processing extended video sequences. However, the scalability of these models is severely limited by the overwhelming volume of visual tokens generated from extended video sequences. To address this challenge, we propose FLoC, an efficient visual token compression framework based on the facility location function, a principled approach that swiftly selects a compact yet highly representative and diverse subset of visual tokens within a predefined budget on the number of visual tokens. By integrating the lazy greedy algorithm, our method achieves remarkable efficiency gains by swiftly selecting a compact subset of tokens, drastically reducing the number of visual tokens while guaranteeing near-optimal performance. Notably, our approach is training-free, model-agnostic, and query-agnostic, providing a versatile solution that seamlessly integrates with diverse video-LLMs and existing workflows. Extensive evaluations on large-scale benchmarks, such as Video-MME, MLVU, LongVideoBench, and EgoSchema, show that our framework consistently surpasses recent compression techniques, highlighting its effectiveness and robustness in addressing the challenges of long video understanding as well as its processing efficiency.
中文标题/摘要
标题:FLoC:基于设施位置的高效视觉标记压缩框架以实现长视频理解
近期关于长视频理解的研究利用了大型多模态模型(LMMs)的先进视觉-语言推理能力,推动了专门用于处理扩展视频序列的视频-LMMs的发展。然而,这些模型的可扩展性受到从扩展视频序列生成的大量视觉标记的限制。为了解决这一挑战,我们提出了FLoC,一种基于设施位置函数的高效视觉标记压缩框架,这是一种原理性的方法,能够迅速选择在预定义的视觉标记数量预算内具有高度代表性且多样化的紧凑子集。通过集成懒惰贪婪算法,我们的方法通过迅速选择紧凑的标记子集实现了显著的效率提升,大幅减少了视觉标记的数量,同时保证了接近最优的性能。值得注意的是,我们的方法是无需训练的、模型无关的、查询无关的,提供了一种灵活的解决方案,能够无缝集成到各种视频-LLMs和现有工作流中。在Video-MME、MLVU、LongVideoBench和EgoSchema等大规模基准上的广泛评估表明,我们的框架在压缩技术方面始终优于近期的技术,突显了其在解决长视频理解挑战方面的有效性、鲁棒性以及处理效率。
Summary / 总结
The paper addresses the scalability issue of Large Multimodal Models (LMMs) in long video understanding by proposing FLoC, a facility location-based visual token compression framework. FLoC uses a lazy greedy algorithm to efficiently select a compact subset of visual tokens, reducing the computational burden while maintaining performance. Experimental results on large-scale benchmarks demonstrate that FLoC outperforms recent compression techniques, showing its effectiveness and efficiency in long video understanding tasks.
论文通过提出基于设施位置函数的视觉标记压缩框架FLoC,解决了大型多模态模型(LMMs)在长视频理解中的可扩展性问题。FLoC 使用设施位置函数和懒惰贪婪算法高效地选择一个紧凑的标记子集,减少标记数量同时保持接近最优的性能。该方法是训练无关、模型无关和查询无关的,且在大规模基准测试上优于最近的压缩技术,展示了其在长视频理解中的有效性和鲁棒性。
Pursuing Minimal Sufficiency in Spatial Reasoning
Authors: Yejie Guo, Yunzhong Hou, Wufei Ma, Meng Tang, Ming-Hsuan Yang
First: 2025-10-19T02:29:09+00:00 · Latest: 2026-03-05T14:41:14+00:00
Abstract
Spatial reasoning, the ability to ground language in 3D understanding, remains a persistent challenge for Vision-Language Models (VLMs). We identify two fundamental bottlenecks: inadequate 3D understanding capabilities stemming from 2D-centric pre-training, and reasoning failures induced by redundant 3D information. To address these, we first construct a Minimal Sufficient Set (MSS) of information before answering a given question: a compact selection of 3D perception results from \textit{expert models}. We introduce MSSR (Minimal Sufficient Spatial Reasoner), a dual-agent framework that implements this principle. A Perception Agent programmatically queries 3D scenes using a versatile perception toolbox to extract sufficient information, including a novel SOG (Situated Orientation Grounding) module that robustly extracts language-grounded directions. A Reasoning Agent then iteratively refines this information to pursue minimality, pruning redundant details and requesting missing ones in a closed loop until the MSS is curated. Extensive experiments demonstrate that our method, by explicitly pursuing both sufficiency and minimality, significantly improves accuracy and achieves state-of-the-art performance across two challenging benchmarks. Furthermore, our framework produces interpretable reasoning paths, offering a promising source of high-quality training data for future models. Source code is available at https://github.com/gyj155/mssr.
中文标题/摘要
标题:追求空间推理的最小充分性
空间推理,即在三维理解基础上将语言接地的能力,仍然是视觉-语言模型(VLMs)的一个持续性挑战。我们识别出两个根本瓶颈:源于二维中心预训练的不充分三维理解能力,以及由冗余三维信息引起的推理失败。为解决这些问题,我们首先在回答给定问题之前构建一个最小充分集(MSS)的信息:从专家模型中提取的紧凑三维感知结果的选择。我们引入了MSSR(最小充分空间推理器),这是一种双智能体框架,实现了这一原则。感知智能体使用多功能感知工具箱程序化地查询三维场景,提取足够的信息,包括一个新颖的SOG(情境定向接地)模块,该模块能够稳健地提取语言导向的方向。推理智能体随后迭代地精炼这些信息,追求最小性,通过闭环修剪冗余细节并请求缺失信息,直到MSS被精心挑选出来。大量实验表明,通过明确追求充分性和最小性,我们的方法显著提高了准确性,并在两个具有挑战性的基准上达到了最先进的性能。此外,我们的框架生成可解释的推理路径,为未来模型提供了一种高质量的训练数据来源。源代码可在https://github.com/gyj155/mssr/获得。
Summary / 总结
This paper addresses the challenge of spatial reasoning in Vision-Language Models by identifying two key issues: inadequate 3D understanding from 2D-centric pre-training and reasoning failures due to redundant 3D information. To tackle these, the authors propose MSSR (Minimal Sufficient Spatial Reasoner), a dual-agent framework that constructs a Minimal Sufficient Set (MSS) of necessary 3D information. The Perception Agent queries 3D scenes using a versatile perception toolbox, while the Reasoning Agent iteratively refines this information to ensure both sufficiency and minimality. Experimental results show that MSSR significantly improves accuracy and achieves state-of-the-art performance on two challenging benchmarks, while also producing interpretable reasoning paths.
论文通过识别两个关键问题——源自2D中心预训练的不足3D理解以及由于冗余3D信息导致的推理失败,来解决视觉-语言模型中的空间推理挑战。为此,作者提出了MSSR(最小充分空间推理器)框架,该框架构建了一个最小充分集(MSS)的3D感知结果。该框架使用感知代理查询3D场景并提取必要信息,以及使用推理代理逐步精炼这些信息以减少冗余。实验表明,MSSR在两个基准测试中显著提高了准确性,并达到了最先进的性能,同时提供了可解释的推理路径,为未来模型训练提供了高质量的数据来源。
SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus
Authors: Ming Zhao, Wenhui Dong, Yang Zhang, Xiang Zheng, Zhonghao Zhang, Zian Zhou, Yunzhi Guan, Liukun Xu, Wei Peng, Zhaoyang Gong, Zhicheng Zhang, Dachuan Li, Xiaosheng Ma, Yuli Ma, Jianing Ni, Changjiang Jiang, Lixia Tian, Qixin Chen, Kaishun Xia, Pingping Liu, Tongshun Zhang, Zhiqiang Liu, Zhongyan Bi, Chenyang Si, Tiansheng Sun, Caifeng Shan
First: 2025-10-03T16:32:02+00:00 · Latest: 2026-03-05T14:25:09+00:00
Abstract
Spine disorders affect 619 million people globally and are a leading cause of disability, yet AI-assisted diagnosis remains limited by the lack of level-aware, multimodal datasets. Clinical decision-making for spine disorders requires sophisticated reasoning across X-ray, CT, and MRI at specific vertebral levels. However, progress has been constrained by the absence of traceable, clinically-grounded instruction data and standardized, spine-specific benchmarks. To address this, we introduce SpineMed, an ecosystem co-designed with practicing spine surgeons. It features SpineMed-450k, the first large-scale dataset explicitly designed for vertebral-level reasoning across imaging modalities with over 450,000 instruction instances, and SpineBench, a clinically-grounded evaluation framework. SpineMed-450k is curated from diverse sources, including textbooks, guidelines, open datasets, and ~1,000 de-identified hospital cases, using a clinician-in-the-loop pipeline with a two-stage LLM generation method (draft and revision) to ensure high-quality, traceable data for question-answering, multi-turn consultations, and report generation. SpineBench evaluates models on clinically salient axes, including level identification, pathology assessment, and surgical planning. Our comprehensive evaluation of several recently advanced large vision-language models (LVLMs) on SpineBench reveals systematic weaknesses in fine-grained, level-specific reasoning. In contrast, our model fine-tuned on SpineMed-450k demonstrates consistent and significant improvements across all tasks. Clinician assessments confirm the diagnostic clarity and practical utility of our model's outputs.
中文标题/摘要
标题:SpineBench:由SpineMed-450k语料库驱动的临床相关、分级感知基准
脊椎疾病影响全球6.19亿人,是导致残疾的主要原因之一,但AI辅助诊断仍受限于缺乏分级感知的多模态数据集。脊椎疾病的临床决策需要在特定椎体水平上对X光、CT和MRI进行复杂的推理。然而,由于缺乏可追溯的、临床依据的数据和标准化的脊椎特定基准,进展受到限制。为了解决这一问题,我们引入了SpineMed,一个与执业脊椎外科医生共同设计的生态系统。它包括SpineMed-450k,这是第一个专门设计用于跨成像模态的椎体级推理的大规模数据集,包含超过45万个指令实例,以及SpineBench,一个临床依据的评估框架。SpineMed-450k从多种来源收集,包括教科书、指南、开放数据集和约1000个匿名医院病例,使用临床医生在环的管道和两阶段LLM生成方法(草案和修订)来确保高质量、可追溯的数据,用于问题回答、多轮咨询和报告生成。SpineBench在临床相关轴上评估模型,包括椎体识别、病理评估和手术规划。我们对SpineBench上几种最近先进的大型视觉-语言模型的全面评估揭示了其在细粒度、椎体特定推理方面的系统性弱点。相比之下,我们基于SpineMed-450k微调的模型在所有任务上都表现出一致且显著的改进。临床医生评估证实了我们模型输出的诊断清晰度和实用价值。
Summary / 总结
The paper introduces SpineMed, a new ecosystem for spine disorders, featuring SpineMed-450k, a large-scale dataset for vertebral-level reasoning across imaging modalities, and SpineBench, an evaluation framework. The dataset is curated from various sources and ensures high-quality, traceable data. SpineBench evaluates models on clinically relevant tasks such as level identification, pathology assessment, and surgical planning. The evaluation shows that models fine-tuned on SpineMed-450k outperform recent large vision-language models in level-specific reasoning.
论文介绍了SpineMed生态系统,包括SpineMed-450k,这是一个用于脊椎水平跨影像模态推理的大规模数据集,以及SpineBench,一个临床相关的评估框架。该数据集从多种来源中整理而来,确保了高质量和可追溯的数据。SpineBench评估模型在脊椎水平识别、病理评估和手术规划等临床相关任务上的表现。评估结果显示,基于SpineMed-450k微调的模型在细粒度、水平特定的推理方面表现优于先进的大型视觉-语言模型。
RadarVLM: A Vision-Language Model Approach for Radar Scene Understanding
Authors: Pushkal Mishra, Kshitiz Bansal, Dinesh Bharadia
First: 2025-11-26T06:41:00+00:00 · Latest: 2026-03-05T14:00:17+00:00
Abstract
Radar sensors provide reliable perception across adverse weather, lighting, and long-range conditions, yet existing machine learning approaches remain fragmented and task-specific, with each downstream task employing distinct architectures and training objectives. We present RadarVLM, a vision-language framework that learns unified scene-level representations through structured spatial language supervision. Leveraging the CARLA simulator with a realistic radar model, we collect over 800k radar-caption pairs across 110+ hours of simulated driving in diverse scenarios. We make two key contributions: (1) a structured caption framework encoding vehicle distributions in the radar's native coordinate system, and (2) Spatially-Grounded CLIP (SG-CLIP) objective that replaces binary matching with continuous scene similarity, enabling fine-grained spatial reasoning. We further propose localization-aware evaluation metrics that directly assess spatial accuracy beyond traditional linguistic similarity measures. Validated on generative captioning and vehicle segmentation, SG-CLIP achieves up to 50\% relative F1-score improvement over vanilla CLIP and a 21\% AP gain on segmentation, demonstrating that language grounding produces spatially structured representations.
中文标题/摘要
标题:RadarVLM:雷达场景理解的视觉语言模型方法
雷达传感器在恶劣天气、光照和远距离条件下提供可靠的感知,但现有的机器学习方法仍然支离破碎且任务特定,每个下游任务都采用不同的架构和训练目标。我们提出了RadarVLM,这是一种视觉语言框架,通过结构化的空间语言监督学习统一的场景级表示。利用CARLA模拟器和现实的雷达模型,我们收集了超过80万对雷达-描述符,涵盖了110多个小时的模拟驾驶,场景多样。我们做出了两项关键贡献:(1) 结构化的描述符框架,编码车辆在雷达原坐标系中的分布,以及(2) 基于空间的CLIP (SG-CLIP) 目标,用连续的场景相似度替代二元匹配,使细粒度的空间推理成为可能。我们还提出了定位感知的评估指标,直接评估空间准确性,超越传统的语言相似度度量。在生成描述符和车辆分割上,SG-CLIP相比vanilla CLIP的相对F1分数提高了50%,分割的AP提高了21%,表明语言定位产生了空间结构化的表示。
Summary / 总结
RadarVLM is a vision-language model that uses structured spatial language supervision to learn unified scene-level representations for radar scene understanding. It leverages the CARLA simulator to collect 800k radar-caption pairs and introduces a structured caption framework and Spatially-Grounded CLIP (SG-CLIP) objective. SG-CLIP improves generative captioning F1-score by up to 50% and vehicle segmentation AP by 21% compared to vanilla CLIP, showing that language grounding enhances spatial reasoning.
RadarVLM 是一种视觉-语言模型,通过结构化的空间语言监督从雷达数据中学习统一的场景级表示。它利用 CARLA 模拟器收集了超过 80 万对雷达-描述,并引入了结构化描述框架和空间定位 CLIP 目标,这在生成描述和车辆分割任务上分别实现了高达 50% 的相对 F1 分数提升和 21% 的 AP 增益。
Logi-PAR: Logic-Infused Patient Activity Recognition via Differentiable Rule
Authors: Muhammad Zarar, MingZheng Zhang, Xiaowang Zhang, Zhiyong Feng, Sofonias Yitagesu, Kawsar Farooq
First: 2026-03-05T13:52:50+00:00 · Latest: 2026-03-05T13:52:50+00:00
Abstract
Patient Activity Recognition (PAR) in clinical settings uses activity data to improve safety and quality of care. Although significant progress has been made, current models mainly identify which activity is occurring. They often spatially compose sub-sparse visual cues using global and local attention mechanisms, yet only learn logically implicit patterns due to their neural-pipeline. Advancing clinical safety requires methods that can infer why a set of visual cues implies a risk, and how these can be compositionally reasoned through explicit logic beyond mere classification. To address this, we proposed Logi-PAR, the first Logic-Infused Patient Activity Recognition Framework that integrates contextual fact fusion as a multi-view primitive extractor and injects neural-guided differentiable rules. Our method automatically learns rules from visual cues, optimizing them end-to-end while enabling the implicit emergence patterns to be explicitly labelled during training. To the best of our knowledge, Logi-PAR is the first framework to recognize patient activity by applying learnable logic rules to symbolic mappings. It produces auditable why explanations as rule traces and supports counterfactual interventions (e.g., risk would decrease by 65% if assistance were present). Extensive evaluation on clinical benchmarks (VAST and OmniFall) demonstrates state-of-the-art performance, significantly outperforming Vision-Language Models and transformer baselines. The code is available via: https://github.com/zararkhan985/Logi-PAR.git}
中文标题/摘要
标题:Logi-PAR:通过可微规则融合上下文事实的患者活动识别
在临床环境中,患者活动识别(PAR)利用活动数据以提高安全性和护理质量。尽管取得了显著进展,当前模型主要识别正在进行的活动。它们通常使用全局和局部注意力机制组合稀疏的视觉线索,但由于其神经管道,只能学习逻辑隐含模式。为了提高临床安全性,需要能够推断出一组视觉线索为何表示风险的方法,并通过明确的逻辑进行组合推理,而不仅仅是分类。为此,我们提出了Logi-PAR,这是第一个融合上下文事实的逻辑注入患者活动识别框架,将其作为多视图原始提取器,并注入神经引导的可微规则。我们的方法自动从视觉线索中学习规则,在端到端优化的同时,使隐含模式在训练期间明确地被标记。据我们所知,Logi-PAR 是第一个通过应用可学习逻辑规则到符号映射来识别患者活动的框架。它产生可审计的“为什么”解释作为规则跟踪,并支持反事实干预(例如,如果提供帮助,风险将降低65%)。在临床基准测试(VAST和OmniFall)上的广泛评估表明,其性能达到最先进的水平,显著优于视觉-语言模型和变压器基线。代码可通过:https://github.com/zararkhan985/Logi-PAR.git 获取
Summary / 总结
Logi-PAR is a novel framework for Patient Activity Recognition (PAR) that integrates logical rules into the recognition process. It uses contextual fact fusion as a multi-view primitive extractor and injects neural-guided differentiable rules to learn and optimize logical patterns. Logi-PAR demonstrates superior performance on clinical benchmarks, providing auditable explanations and supporting counterfactual interventions, thus significantly outperforming existing vision-language models and transformer baselines.
Logi-PAR 是一种新颖的患者活动识别框架,将逻辑融入识别过程。它使用多视图原始提取器融合上下文事实,并注入神经引导的可微规则来从视觉线索中学习明确的逻辑。这种方法不仅提高了活动识别的准确性,还提供了可审计的解释并支持反事实干预。Logi-PAR 在临床基准测试中表现出色,超越了现有模型,并为带有可学习逻辑规则的活动识别设定了新标准。
Mario: Multimodal Graph Reasoning with Large Language Models
Authors: Yuanfu Sun, Kang Li, Pengkang Guo, Jiajin Liu, Qiaoyu Tan
Venue: CVPR 2026
First: 2026-03-05T13:49:41+00:00 · Latest: 2026-03-05T13:49:41+00:00
Comments: CVPR 2026
Abstract
Recent advances in large language models (LLMs) have opened new avenues for multimodal reasoning. Yet, most existing methods still rely on pretrained vision-language models (VLMs) to encode image-text pairs in isolation, ignoring the relational structure that real-world multimodal data naturally form. This motivates reasoning on multimodal graphs (MMGs), where each node has textual and visual attributes and edges provide structural cues. Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-modal consistency and handling heterogeneous modality preference. To address this, we propose Mario, a unified framework that simultaneously resolves the two above challenges and enables effective LLM-based reasoning over MMGs. Mario consists of two innovative stages. Firstly, a graph-conditioned VLM design that jointly refines textual and visual features through fine-grained cross-modal contrastive learning guided by graph topology. Secondly, a modality-adaptive graph instruction tuning mechanism that organizes aligned multimodal features into graph-aware instruction views and employs a learnable router to surface, for each node and its neighborhood, the most informative modality configuration to the LLM. Extensive experiments across diverse MMG benchmarks demonstrate that Mario consistently outperforms state-of-the-art graph models in both supervised and zero-shot scenarios for node classification and link prediction. The code will be made available at https://github.com/sunyuanfu/Mario.
中文标题/摘要
标题:马里奥:大规模语言模型的多模态图推理
大规模语言模型(LLMs)的最新进展为多模态推理开辟了新的途径。然而,大多数现有方法仍然依赖预训练的视觉-语言模型(VLMs)来孤立地编码图像-文本对,忽略了真实世界多模态数据自然形成的关联结构。这促使我们在多模态图(MMGs)上进行推理,其中每个节点具有文本和视觉属性,边提供结构线索。在保持图拓扑的同时,利用LLM进行这样的异构多模态信号推理引入了两个关键挑战:解决弱跨模态一致性并处理异构模态偏好。为了解决这个问题,我们提出了马里奥,这是一种统一框架,同时解决了上述两个挑战,并使LLM能够在MMGs上进行有效的推理。马里奥由两个创新阶段组成。首先,一种图条件下的VLM设计,通过由图拓扑引导的细粒度跨模态对比学习联合精炼文本和视觉特征。其次,一种模态自适应图指令调优机制,将对齐的多模态特征组织成图感知指令视图,并使用可学习的路由器为每个节点及其邻域呈现最相关信息模态配置给LLM。在各种多模态图基准上的广泛实验表明,马里奥在节点分类和链接预测的监督和零样本场景中均能一致地优于最先进的图模型。代码将在https://github.com/sunyuanfu/Mario上公开。
Summary / 总结
The research is motivated by the need to leverage large language models (LLMs) for multimodal reasoning, addressing the limitations of existing methods that rely on pretrained vision-language models (VLMs) in encoding image-text pairs. Mario, a unified framework, is proposed to resolve weak cross-modal consistency and handle heterogeneous modality preference by using a graph-conditioned VLM design and a modality-adaptive graph instruction tuning mechanism. Experiments show that Mario outperforms state-of-the-art graph models in node classification and link prediction tasks across various multimodal graph benchmarks.
Mario 是一个使用大型语言模型进行多模态图推理的统一框架。通过设计图条件下的视觉-语言模型和模态自适应图指令调优机制来解决跨模态一致性弱和模态偏好异质性的挑战。Mario 在各种多模态图基准上的节点分类和链接预测任务中均优于现有最佳图模型。
Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models
Authors: Shule Lu, Yujing Wang, Hainan Zhang, Xiaoshan Yang, Hongwei Zheng, Yongxin Tong, Changsheng Xu, Zhiming Zheng
First: 2026-01-31T03:11:51+00:00 · Latest: 2026-03-05T12:23:38+00:00
Comments: Due to the need for substantial revisions, the authors believe that the paper should be retracted first.A revised version may be resubmitted
Abstract
VLMs have broad potential in privacy-sensitive domains such as healthcare and finance, yet strict data-sharing constraints render centralized training infeasible. FL mitigates this issue by enabling decentralized training, but practical deployments face challenges due to client heterogeneity in computational resources, application requirements, and model architectures. We argue that while replacing data with model parameters characterizes the present of FL, replacing parameters with preferences represents a more scalable and privacy-preserving future. Motivated by this perspective, we propose MoR, a federated alignment framework based on GRPO with Mixture-of-Rewards for heterogeneous VLMs. MoR initializes a visual foundation model as a KL-regularized reference, while each client locally trains a reward model from local preference annotations, capturing specific evaluation signals without exposing raw data. To reconcile heterogeneous rewards, we introduce a routing-based fusion mechanism that adaptively aggregates client reward signals. Finally, the server performs GRPO with this mixed reward to optimize the base VLM. Experiments on three public VQA benchmarks demonstrate that MoR consistently outperforms federated alignment baselines in generalization, robustness, and cross-client adaptability. Our approach provides a scalable solution for privacy-preserving alignment of heterogeneous VLMs under federated settings.
中文标题/摘要
标题:用偏好替换参数:异构视觉-语言模型的联邦对齐
视觉语言模型(VLMs)在医疗保健和金融等隐私敏感领域具有广泛的应用潜力,但由于严格的数据共享限制,集中式训练变得不可行。联邦学习(FL)通过使训练去中心化来缓解这一问题,但实际部署面临挑战,因为客户端在计算资源、应用需求和模型架构方面存在异质性。我们认为,虽然用模型参数替换数据是当前FL的特点,但用偏好替换参数代表了更具有扩展性和隐私保护的未来。基于这一视角,我们提出了MoR,一种基于GRPO的混合奖励的异构VLM联邦对齐框架。MoR以KL正则化的视觉基础模型作为参考,每个客户端从本地偏好注释中局部训练奖励模型,捕捉特定的评估信号而不暴露原始数据。为了协调异质奖励,我们引入了一种基于路由的融合机制,以自适应地聚合客户端的奖励信号。最后,服务器使用这种混合奖励进行GRPO优化基础VLM。在三个公开的VQA基准测试上进行的实验表明,MoR在泛化能力、鲁棒性和跨客户端适应性方面始终优于联邦对齐基线。我们的方法为联邦设置下异构VLM的隐私保护对齐提供了可扩展的解决方案。
Summary / 总结
The paper addresses the challenge of training vision-language models (VLMs) in privacy-sensitive domains where centralized training is impractical due to data-sharing constraints. It proposes MoR, a federated learning framework that replaces model parameters with preferences to enhance scalability and privacy. MoR initializes a reference model and allows clients to train reward models based on local preference annotations. The framework uses a routing-based fusion mechanism to aggregate client reward signals and optimizes the base model via gradient-based policy optimization. Experiments show that MoR outperforms existing federated alignment methods in terms of generalization, robustness, and cross-client adaptability.
论文针对在隐私敏感领域中由于数据共享限制而无法进行集中训练的问题,提出了一种名为MoR的联邦学习框架,该框架通过将模型参数替换为偏好来增强可扩展性和隐私保护。MoR初始化一个参考模型,并允许客户端基于本地偏好注释训练奖励模型。该框架使用基于路由的融合机制来聚合客户端的奖励信号,并通过基于梯度的策略优化来优化基础模型。实验结果显示,MoR在泛化能力、鲁棒性和跨客户端适应性方面优于现有联邦对齐方法。
GEM-TFL: Bridging Weak and Full Supervision for Forgery Localization through EM-Guided Decomposition and Temporal Refinement
Authors: Xiaodong Zhu, Yuanming Zheng, Suting Wang, Junqi Yang, Yuhong Yang, Weiping Tu, Zhongyuan Wang
Venue: CVPR 2026
First: 2026-03-05T12:07:26+00:00 · Latest: 2026-03-05T12:07:26+00:00
Comments: 10 pages, 4 figures, accepted by CVPR 2026
Abstract
Temporal Forgery Localization (TFL) aims to precisely identify manipulated segments within videos or audio streams, providing interpretable evidence for multimedia forensics and security. While most existing TFL methods rely on dense frame-level labels in a fully supervised manner, Weakly Supervised TFL (WS-TFL) reduces labeling cost by learning only from binary video-level labels. However, current WS-TFL approaches suffer from mismatched training and inference objectives, limited supervision from binary labels, gradient blockage caused by non-differentiable top-k aggregation, and the absence of explicit modeling of inter-proposal relationships. To address these issues, we propose GEM-TFL (Graph-based EM-powered Temporal Forgery Localization), a two-phase classification-regression framework that effectively bridges the supervision gap between training and inference. Built upon this foundation, (1) we enhance weak supervision by reformulating binary labels into multi-dimensional latent attributes through an EM-based optimization process; (2) we introduce a training-free temporal consistency refinement that realigns frame-level predictions for smoother temporal dynamics; and (3) we design a graph-based proposal refinement module that models temporal-semantic relationships among proposals for globally consistent confidence estimation. Extensive experiments on benchmark datasets demonstrate that GEM-TFL achieves more accurate and robust temporal forgery localization, substantially narrowing the gap with fully supervised methods.
中文标题/摘要
标题:GEM-TFL:通过EM引导分解和时间精炼,实现弱监督与全监督之间的伪造定位桥梁
时间伪造定位(TFL)旨在精确识别视频或音频流中的篡改段落,为多媒体取证和安全提供可解释的证据。虽然大多数现有的TFL方法依赖于密集的帧级标签进行全监督学习,但弱监督TFL(WS-TFL)通过仅从二元视频级标签中学习来降低标注成本。然而,当前的WS-TFL方法存在训练和推理目标不匹配、二元标签监督有限、由于非可微的top-k聚合导致梯度阻塞以及缺乏对提案间关系的显式建模等问题。为了解决这些问题,我们提出了GEM-TFL(基于图的EM增强时间伪造定位),这是一种两阶段分类-回归框架,有效地弥合了训练和推理之间的监督差距。在此基础上,(1)我们通过基于EM的优化过程将二元标签重新表述为多维潜在属性,增强弱监督;(2)我们引入了一种无需训练的时间一致性精炼方法,重新对齐帧级预测以实现更平滑的时间动态;(3)我们设计了一种基于图的提案精炼模块,建模提案之间的时空语义关系,以实现全局一致的置信度估计。在基准数据集上的广泛实验表明,GEM-TFL实现了更准确和稳健的时间伪造定位,显著缩小了与全监督方法的差距。
Summary / 总结
GEM-TFL addresses the challenges in weakly supervised temporal forgery localization by proposing a two-phase framework that enhances weak supervision and introduces temporal consistency refinement and graph-based proposal refinement. The method reformulates binary labels into multi-dimensional latent attributes and realigns frame-level predictions for smoother temporal dynamics, achieving more accurate and robust forgery localization compared to fully supervised methods.
GEM-TFL通过提出一个两阶段框架,利用EM优化增强弱监督,并引入时间一致性精炼和基于图的提案精炼,解决弱监督时间伪造定位的挑战。该方法有效地弥合了训练和推理之间的差距,从而在伪造定位的准确性和鲁棒性方面优于全监督方法。
CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection
Authors: Zhaonian Kuang, Rui Ding, Haotian Wang, Xinhu Zheng, Meng Yang, Gang Hua
Venue: CVPR 2026
First: 2026-03-05T10:49:46+00:00 · Latest: 2026-03-05T10:49:46+00:00
Comments: Accepted to CVPR 2026 main track
Abstract
Multi-camera 3D object detection (MC3D) has attracted increasing attention with the growing deployment of multi-sensor physical agents, such as robots and autonomous vehicles. However, MC3D models still struggle to generalize to unseen platforms with new multi-camera configurations. Current solutions simply employ a meta-camera for unified representation but lack comprehensive consideration. In this paper, we revisit this issue and identify that the devil lies in spatial prior discrepancies across source and target configurations, including different intrinsics, extrinsics, and array layouts. To address this, we propose CoIn3D, a generalizable MC3D framework that enables strong transferability from source configurations to unseen target ones. CoIn3D explicitly incorporates all identified spatial priors into both feature embedding and image observation through spatial-aware feature modulation (SFM) and camera-aware data augmentation (CDA), respectively. SFM enriches feature space by integrating four spatial representations, such as focal length, ground depth, ground gradient, and Plücker coordinate. CDA improves observation diversity under various configurations via a training-free dynamic novel-view image synthesis scheme. Extensive experiments demonstrate that CoIn3D achieves strong cross-configuration performance on landmark datasets such as NuScenes, Waymo, and Lyft, under three dominant MC3D paradigms represented by BEVDepth, BEVFormer, and PETR.
中文标题/摘要
标题:CoIn3D:重新审视配置不变的多相机3D物体检测
多相机3D物体检测(MC3D)随着多传感器物理代理,如机器人和自动驾驶车辆的部署越来越多地受到关注。然而,MC3D模型仍然难以在具有新多相机配置的未见过的平台上泛化。当前的解决方案仅使用一个元相机进行统一表示,但缺乏全面考虑。在本文中,我们重新审视了这一问题,并发现问题在于源配置和目标配置之间的空间先验差异,包括不同的内参、外参和阵列布局。为了解决这一问题,我们提出了CoIn3D,这是一种通用的MC3D框架,能够从源配置高效地转移到未见过的目标配置。CoIn3D通过空间感知特征调制(SFM)和相机感知数据增强(CDA)将所有识别的空间先验显式地整合到特征嵌入和图像观察中。SFM通过整合焦距、地面深度、地面梯度和Plücker坐标等四种空间表示丰富了特征空间。CDA通过无训练动态新颖视角图像合成方案在各种配置下提高观察多样性。广泛的实验表明,CoIn3D在NuScenes、Waymo和Lyft等地标数据集上,在BEVDepth、BEVFormer和PETR等三种主导的MC3D范式下实现了强大的跨配置性能。
Summary / 总结
CoIn3D revisits the challenge of multi-camera 3D object detection (MC3D) and proposes a framework that addresses the issue of transferring models across different camera configurations. It introduces spatial-aware feature modulation (SFM) and camera-aware data augmentation (CDA) to handle discrepancies in intrinsics, extrinsics, and array layouts. Experiments show that CoIn3D outperforms existing methods on landmark datasets like NuScenes, Waymo, and Lyft under various MC3D paradigms.
CoIn3D重新审视了多相机3D物体检测(MC3D)中的挑战,并提出了一种框架来解决向未见相机配置泛化的难题。通过整合诸如内参、外参和阵列布局等空间先验,CoIn3D使用空间感知特征调制和相机感知数据增强来增强特征嵌入和图像观察。实验表明,CoIn3D在不同数据集和MC3D范式下表现出色,展示了强大的跨配置性能。
Flatness Guided Test-Time Adaptation for Vision-Language Models
Authors: Aodi Li, Liansheng Zhuang, Xiao Long, Houqiang Li, Shafei Wang
First: 2025-01-31T03:10:48+00:00 · Latest: 2026-03-05T10:05:46+00:00
Abstract
Test-time adaptation (TTA) of Vision-Language Models (VLMs) has emerged as a technique for tackling distribution shifts during the test time. Recent research indicates that the test-time adaptation is intrinsically linked to the model's training history. However, existing TTA methods, such as Test-time Prompt Tuning, often design adaptation strategies in isolation from the models' training characteristics, which degrade their performance. This paper argues that the flatness acquired via sharpness-aware training is an efficient clue for the test-time adaptation of VLMs. Built on this insight, this paper proposes a novel Flatness-Guided Adaptation framework (FGA) for VLMs to cohesively unify training and test-time procedures. Its core idea is to leverage the alignment between the training minimum and test loss flat regions to guide the adaptation process. Specifically, our FGA consists of a prompt-tuning stage and a test-time adaptation stage. In the tuning stage, a Sharpness-Aware Prompt Tuning method is utilized to identify the training flat minimum, offering a geometric clue of flatness for subsequent adaptation. In the test stage, a Sharpness-based Test Sample Selection approach is proposed to ensure the alignment of flat minima between the training and each augmented test sample's loss landscape. In comparison to existing TTA methods, our FGA avoids the expensive prompt parameter updates during test time, and substantially reduces the computation overhead. Extensive experiments on both domain generalization and cross-dataset benchmarks demonstrate that our FGA achieves superior performance over prevalent TTA methods. Notably, when employing a ViT-B/16 image encoder, FGA even outperforms TPT+CoOp by an average of 4.88% across all four ImageNet out-of-domain variants.
中文标题/摘要
标题:基于平坦度引导的视觉-语言模型测试时适应
视觉-语言模型(VLMs)的测试时适应(TTA)已成为解决测试时分布偏移的技术。现有研究表明,测试时适应与模型的训练历史密切相关。然而,现有的TTA方法,如测试时提示调优,往往孤立地设计适应策略,这会降低其性能。本文认为,通过尖锐性感知训练获得的平坦度是视觉-语言模型测试时适应的有效线索。基于这一见解,本文提出了一种新颖的基于平坦度引导的适应框架(FGA),以统一训练和测试过程。其核心思想是利用训练最小值和平坦损失区域之间的对齐来引导适应过程。具体而言,我们的FGA包括一个提示调优阶段和一个测试时适应阶段。在调优阶段,使用尖锐性感知提示调优方法来识别训练平坦最小值,为后续适应提供平坦度的几何线索。在测试阶段,提出了一种基于尖锐性的测试样本选择方法,以确保训练最小值和平滑每个增强测试样本损失景观之间的对齐。与现有的TTA方法相比,我们的FGA避免了测试时昂贵的提示参数更新,并显著减少了计算开销。在领域泛化和跨数据集基准测试上的广泛实验表明,我们的FGA在所有四个ImageNet离域变体中均优于流行的TTA方法。值得注意的是,当使用ViT-B/16图像编码器时,FGA在所有四个ImageNet离域变体中平均优于TPT+CoOp 4.88%。
Summary / 总结
This paper addresses the challenge of test-time adaptation (TTA) for Vision-Language Models (VLMs) by proposing a Flatness-Guided Adaptation (FGA) framework. The motivation is to improve TTA performance by leveraging the flatness acquired during training. The method involves a two-stage process: first, a Sharpness-Aware Prompt Tuning identifies the training flat minimum, providing a geometric clue for adaptation. Second, a Sharpness-based Test Sample Selection ensures alignment between the training and test loss flat regions. Experiments show that FGA outperforms existing TTA methods, achieving superior performance and reducing computational overhead.
本文提出了一种Flatness-Guided Adaptation (FGA)框架,以解决Vision-Language Models (VLMs)在测试时的分布变化问题。动机是通过利用训练过程中获得的平坦性来改进现有的测试时适应方法。FGA框架包括一个调优阶段和一个测试时适应阶段,调优阶段使用Sharpness-Aware Prompt Tuning方法识别训练平坦最小值,测试阶段通过Sharpness-based Test Sample Selection方法确保训练和每个增强测试样本损失景观之间的平坦最小值对齐。实验表明,FGA在性能上优于现有方法,并且减少了计算开销。
Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning
Authors: Yiwei Chen, Yuguang Yao, Yihua Zhang, Bingquan Shen, Gaowen Liu, Sijia Liu
First: 2025-03-14T19:52:08+00:00 · Latest: 2026-03-05T09:05:50+00:00
Abstract
Recent vision language models (VLMs) have made remarkable strides in generative modeling with multimodal inputs, particularly text and images. However, their susceptibility to generating harmful content when exposed to unsafe queries raises critical safety concerns. While current alignment strategies primarily rely on supervised safety fine-tuning with curated datasets, we identify a fundamental limitation we call the ''safety mirage'', where supervised fine-tuning inadvertently reinforces spurious correlations between superficial textual patterns and safety responses, rather than fostering deep, intrinsic mitigation of harm. We show that these spurious correlations leave fine-tuned VLMs vulnerable even to a simple one-word modification-based attack, where substituting a single word in text queries with a spurious correlation-inducing alternative can effectively bypass safeguards. Additionally, these correlations contribute to the over-prudence, causing fine-tuned VLMs to refuse benign queries unnecessarily. To address these issues, we show machine unlearning (MU) as a powerful alternative to supervised safety fine-tuning, as it avoids biased feature-label mappings and directly removes harmful knowledge from VLMs while preserving their general capabilities. Extensive evaluations across safety benchmarks show that under MU-based alignment reduces the attack success rate by up to 60.27% and cuts unnecessary rejections by over 84.20%. WARNING: There exist AI generations that may be offensive in nature.
中文标题/摘要
标题:安全幻象:虚假相关性如何削弱VLM安全微调并可通过机器遗忘加以缓解
近期的视觉语言模型(VLMs)在多模态输入(尤其是文本和图像)的生成建模方面取得了显著进展。然而,当暴露于不安全查询时,它们生成有害内容的脆弱性引发了重要的安全问题。尽管当前的对齐策略主要依赖于监督安全微调和精心策划的数据集,但我们发现了一个根本性的局限性,我们称之为“安全幻象”,即监督微调无意中强化了表面文本模式与安全响应之间的虚假相关性,而不是培养深层次、内在的有害行为缓解。我们展示了这些虚假相关性使微调后的VLMs即使面对简单的基于单词替换的攻击也依然脆弱,其中用一个诱导虚假相关性的替代词替换文本查询中的单个词可以有效绕过防护措施。此外,这些相关性导致过度谨慎,使微调后的VLMs无故拒绝良性查询。为解决这些问题,我们展示了机器遗忘(MU)作为监督安全微调的强大替代方案,因为它避免了有偏的特征-标签映射,并直接从VLMs中移除有害知识,同时保留其一般能力。广泛的跨安全基准评估表明,基于MU的对齐将攻击成功率降低高达60.27%,并减少了超过84.20%的无谓拒绝。注意:存在可能具有冒犯性的AI生成内容。
Summary / 总结
The research addresses the safety concerns of vision language models (VLMs) when exposed to unsafe queries, identifying a 'safety mirage' where supervised fine-tuning can inadvertently reinforce spurious correlations, making VLMs vulnerable to simple attacks and overly cautious. The study demonstrates that machine unlearning (MU) can mitigate these issues by directly removing harmful knowledge without bias, achieving up to a 60.27% reduction in attack success rates and an 84.20% decrease in unnecessary rejections.
论文探讨了视觉语言模型(VLMs)中虚假相关性的问题,这些虚假相关性会削弱模型的安全性,导致误报和漏报。研究引入了‘安全幻象’的概念,指出监督微调可能会无意中强化这些虚假相关性。研究显示,这些虚假相关性使VLMs对简单的单词修改攻击变得脆弱,并导致对良性查询的不必要的拒绝。为解决这些问题,论文提出机器遗忘(MU)作为监督微调的替代方案,表明MU可以在安全基准测试中将攻击成功率降低高达60.27%,并将不必要的拒绝率降低超过84.20%。
Retrieval-Augmented Generation with Covariate Time Series
Authors: Kenny Ye Liang, Zhongyi Pei, Huan Zhang, Yuhui Liu, Shaoxu Song, Jianmin Wang
First: 2026-03-05T08:45:24+00:00 · Latest: 2026-03-05T08:45:24+00:00
Comments: 12 pages. Preprint
Abstract
While RAG has greatly enhanced LLMs, extending this paradigm to Time-Series Foundation Models (TSFMs) remains a challenge. This is exemplified in the Predictive Maintenance of the Pressure Regulating and Shut-Off Valve (PRSOV), a high-stakes industrial scenario characterized by (1) data scarcity, (2) short transient sequences, and (3) covariate coupled dynamics. Unfortunately, existing time-series RAG approaches predominantly rely on generated static vector embeddings and learnable context augmenters, which may fail to distinguish similar regimes in such scarce, transient, and covariate coupled scenarios. To address these limitations, we propose RAG4CTS, a regime-aware, training-free RAG framework for Covariate Time-Series. Specifically, we construct a hierarchal time-series native knowledge base to enable lossless storage and physics-informed retrieval of raw historical regimes. We design a two-stage bi-weighted retrieval mechanism that aligns historical trends through point-wise and multivariate similarities. For context augmentation, we introduce an agent-driven strategy to dynamically optimize context in a self-supervised manner. Extensive experiments on PRSOV demonstrate that our framework significantly outperforms state-of-the-art baselines in prediction accuracy. The proposed system is deployed in Apache IoTDB within China Southern Airlines. Since deployment, our method has successfully identified one PRSOV fault in two months with zero false alarm.
中文标题/摘要
标题:基于协变量时间序列的检索增强生成
尽管检索增强生成(RAG)极大地提升了语言模型(LLMs),将其扩展到时间序列基础模型(TSFMs)仍面临挑战。这在压力调节和关断阀(PRSOV)的预测维护中尤为明显,这是一个高风险的工业场景,具有(1)数据稀缺性,(2)短暂的瞬态序列,以及(3)协变量耦合的动力学。不幸的是,现有的时间序列RAG方法主要依赖于生成的静态向量嵌入和可学习的上下文增强器,这在稀缺、瞬态和协变量耦合的场景中可能无法区分相似的运行状态。为了解决这些局限性,我们提出了RAG4CTS,这是一种针对协变量时间序列的训练无监督的检索增强生成框架。具体而言,我们构建了一个层次化的时间序列本体知识库,以实现无损存储和基于物理的检索历史运行状态。我们设计了一种两阶段的双加权检索机制,通过点对点和多变量相似性对历史趋势进行对齐。对于上下文增强,我们引入了一种基于代理的策略,以自监督的方式动态优化上下文。在PRSOV上的广泛实验表明,我们的框架在预测准确性上显著优于最先进的基线。所提出系统已部署在中国南方航空公司的Apache IoTDB中。自部署以来,我们的方法在两个月内成功检测到一个PRSOV故障,且无误报。
Summary / 总结
The research addresses the challenge of applying Retrieval-Augmented Generation (RAG) to Time-Series Foundation Models (TSFMs) in high-stakes industrial scenarios like Predictive Maintenance of PRSOV, which are characterized by data scarcity, short transient sequences, and covariate coupled dynamics. The proposed RAG4CTS framework introduces a hierarchal time-series native knowledge base and a two-stage bi-weighted retrieval mechanism to align historical trends. It also employs an agent-driven strategy for context augmentation. Experiments show that RAG4CTS significantly improves prediction accuracy compared to existing methods, and it has been successfully deployed in China Southern Airlines, identifying a fault without false alarms.
研究旨在解决在高风险工业场景如PRSOV预测维护中应用时间序列增强生成(RAG)技术的挑战,这些场景具有数据稀缺、短暂序列和协变量耦合动态的特点。提出的RAG4CTS框架引入了层次时间序列本地知识库和两阶段双加权检索机制来对齐历史趋势,并采用代理驱动策略进行上下文增强。实验表明,RAG4CTS在预测准确性上显著优于现有方法,并在中国南方航空公司的部署中成功识别了一个故障,没有误报。
Collaborative Learning of Local 3D Occupancy Prediction and Versatile Global Occupancy Mapping
Authors: Shanshuai Yuan, Julong Wei, Muer Tie, Xiangyun Ren, Zhongxue Gan, Wenchao Ding
Venue: ICRA 2026
First: 2025-04-18T09:58:48+00:00 · Latest: 2026-03-05T07:52:27+00:00
Comments: Accepted by ICRA 2026
Abstract
Vision-based 3D semantic occupancy prediction is vital for autonomous driving, enabling unified modeling of static infrastructure and dynamic agents. Global occupancy maps serve as long-term memory priors, providing valuable historical context that enhances local perception. This is particularly important in challenging scenarios such as occlusion or poor illumination, where current and nearby observations may be unreliable or incomplete. Priors aggregated from previous traversals under better conditions help fill gaps and enhance the robustness of local 3D occupancy prediction. In this paper, we propose Long-term Memory Prior Occupancy (LMPOcc), a plug-and-play framework that incorporates global occupancy priors to boost local prediction and simultaneously updates global maps with new observations. To realize the information gain from global priors, we design an efficient and lightweight Current-Prior Fusion module that adaptively integrates prior and current features. Meanwhile, we introduce a model-agnostic prior format to enable continual updating of global occupancy and ensure compatibility across diverse prediction baselines. LMPOcc achieves state-of-the-art local occupancy prediction performance validated on the Occ3D-nuScenes benchmark, especially on static semantic categories. Furthermore, we verify LMPOcc's capability to build large-scale global occupancy maps through multi-vehicle crowdsourcing, and utilize occupancy-derived dense depth to support the construction of 3D open-vocabulary maps. Our method opens up a new paradigm for continuous global information updating and storage, paving the way towards more comprehensive and scalable scene understanding in large outdoor environments.
中文标题/摘要
标题:协作学习局部3D占用预测和多功能全局占用映射
基于视觉的3D语义占用预测对于自动驾驶至关重要,能够统一建模静态基础设施和动态代理。全局占用地图作为长期记忆先验,提供有价值的历史上下文,增强局部感知。特别是在遮挡或光照不良等具有挑战性的场景中,当前和附近的观测可能不可靠或不完整。来自更好条件下的先前遍历的先验有助于填补空白并增强局部3D占用预测的鲁棒性。在本文中,我们提出了一种名为长时记忆先验占用(LMPOcc)的即插即用框架,该框架结合全局占用先验以增强局部预测,并同时用新观测更新全局地图。为了实现全局先验的信息增益,我们设计了一种高效且轻量级的当前-先验融合模块,以自适应地整合先验和当前特征。同时,我们引入了一种模型无关的先验格式,以实现全局占用的持续更新并确保与各种预测基线的兼容性。LMPOcc在Occ3D-nuScenes基准上实现了最先进的局部占用预测性能,特别是在静态语义类别方面。此外,我们通过多车辆众包验证了LMPOcc构建大规模全局占用地图的能力,并利用占用衍生的密集深度支持3D开放词汇地图的构建。我们的方法为持续的全局信息更新和存储开辟了新的范式,为大型户外环境中的更全面和可扩展的场景理解铺平了道路。
Summary / 总结
This paper addresses the challenge of 3D semantic occupancy prediction in autonomous driving by proposing LMPOcc, a framework that integrates global occupancy priors to enhance local prediction and simultaneously updates global maps. The method includes an efficient Current-Prior Fusion module for adaptive feature integration and a model-agnostic prior format for continuous updating. LMPOcc achieves superior local occupancy prediction performance on the Occ3D-nuScenes benchmark, particularly for static categories, and demonstrates capability in building large-scale global occupancy maps through multi-vehicle crowdsourcing.
本文提出LMPOcc框架,通过整合全局占用先验信息来增强局部预测,并同时更新全局地图。该方法包含高效的当前-先验融合模块和模型无关的先验格式,实现了在Occ3D-nuScenes基准上的最佳局部占用预测性能,特别是在静态类别方面。此外,LMPOcc展示了通过多车辆众包构建大规模全局占用地图的能力,并支持3D开放词汇地图的构建。
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
Authors: Haodong Li, Shaoteng Liu, Zhe Lin, Manmohan Chandraker
First: 2026-02-08T02:16:02+00:00 · Latest: 2026-03-05T07:52:20+00:00
Comments: Figure PDFs were compressed to 150 dpi to comply with arXiv's submission size limit. Project page: https://rolling-sink.github.io/
Abstract
Recently, autoregressive (AR) video diffusion models has achieved remarkable performance. However, due to their limited training durations, a train-test gap emerges when testing at longer horizons, leading to rapid visual degradations. Following Self Forcing, which studies the train-test gap within the training duration, this work studies the train-test gap beyond the training duration, i.e., the gap between the limited horizons during training and open-ended horizons during testing. Since open-ended testing can extend beyond any finite training window, and long-video training is computationally expensive, we pursue a training-free solution to bridge this gap. To explore a training-free solution, we conduct a systematic analysis of AR cache maintenance. These insights lead to Rolling Sink. Built on Self Forcing (trained on only 5s clips), Rolling Sink effectively scales the AR video synthesis to ultra-long durations (e.g., 5-30 minutes at 16 FPS) at test time, with consistent subjects, stable colors, coherent structures, and smooth motions. As demonstrated by extensive experiments, Rolling Sink achieves superior long-horizon visual fidelity and temporal consistency compared to SOTA baselines. Project page: https://rolling-sink.github.io/
中文标题/摘要
标题:滚动水槽:在自回归视频扩散模型中弥合有限训练期与开放测试期之间的差距
最近,自回归(AR)视频扩散模型取得了显著的性能。然而,由于其有限的训练时长,当在更长的时间范围内进行测试时,会出现训练-测试差距,导致视觉质量迅速退化。在研究了训练时长内的训练-测试差距之后,这项工作研究了训练时长之外的训练-测试差距,即训练时有限的时间范围与测试时开放的时间范围之间的差距。由于开放的测试可以超出任何有限的训练窗口,且长视频训练计算成本高昂,我们寻求一种无需训练的解决方案来弥合这一差距。为了探索无需训练的解决方案,我们系统地分析了AR缓存维护。这些见解导致了滚动水槽(Rolling Sink)的提出。基于仅使用5秒片段训练的Self Forcing,滚动水槽在测试时能够将AR视频合成扩展到超长时长(例如,16 FPS下的5-30分钟),且保持一致的主题、稳定的颜色、连贯的结构和流畅的运动。通过广泛的实验表明,滚动水槽在长时域视觉保真度和时间一致性方面优于当前最佳基线。项目页面:https://rolling-sink.github.io/
Summary / 总结
This work addresses the train-test gap in autoregressive video diffusion models when testing at longer horizons. It builds on Self Forcing, which focuses on the train-test gap within the training duration, and introduces Rolling Sink to bridge the gap between limited training horizons and open-ended testing. Rolling Sink, trained on 5-second clips, successfully extends video synthesis to ultra-long durations (5-30 minutes) with consistent subjects, stable colors, coherent structures, and smooth motions, outperforming state-of-the-art baselines in long-horizon visual fidelity and temporal consistency.
该研究通过引入无训练解决方案Rolling Sink来解决自回归视频扩散模型中的训练-测试差距问题。受需要弥合有限训练窗口和开放测试窗口之间差距的驱动,Rolling Sink基于Self Forcing,将AR视频合成扩展到超长持续时间,同时保持一致的主题、稳定的颜色、连贯的结构和流畅的运动。实验表明,Rolling Sink在长持续时间视觉保真度和时间一致性方面优于当前最佳基线。
AdaIAT: Adaptively Increasing Attention to Generated Text to Alleviate Hallucinations in LVLM
Authors: Li'an Zhong, Ziqiang He, Jibin Zheng, Jin Li, Z. Jane Wang, Xiangui Kang
First: 2026-03-05T07:52:11+00:00 · Latest: 2026-03-05T07:52:11+00:00
Abstract
Hallucination has been a significant impediment to the development and application of current Large Vision-Language Models (LVLMs). To mitigate hallucinations, one intuitive and effective way is to directly increase attention weights to image tokens during inference. Although this effectively reduces the hallucination rate, it often induces repetitive descriptions. To address this, we first conduct an analysis of attention patterns and reveal that real object tokens tend to assign higher attention to the generated text than hallucinated ones. This inspires us to leverage the generated text, which contains instruction-related visual information and contextual knowledge, to alleviate hallucinations while maintaining linguistic coherence. We therefore propose Attention to Generated Text (IAT) and demonstrate that it significantly reduces the hallucination rate while avoiding repetitive descriptions. To prevent naive amplification from impairing the inherent prediction capabilities of LVLMs, we further explore Adaptive IAT (AdaIAT) that employs a layer-wise threshold to control intervention time and fine-grained amplification magnitude tailored to the characteristics of each attention head. Both analysis and experiments demonstrate the effectiveness of AdaIAT. Results of several LVLMs show that AdaIAT effectively alleviates hallucination (reducing hallucination rates $C_S$ and $C_I$ on LLaVA-1.5 by 35.8% and 37.1%, respectively) while preserving linguistic performance and prediction capability, achieving an attractive trade-off.
中文标题/摘要
标题:AdaIAT:通过增加生成文本的关注度来适应性地减轻LVLM中的幻觉
幻觉已成为当前大型视觉-语言模型(LVLM)发展和应用中的重大障碍。为了减轻幻觉,一种直观且有效的方法是在推理过程中直接增加对图像标记的关注权重。尽管这种方法有效地降低了幻觉率,但往往会引起重复描述。为了解决这一问题,我们首先分析了注意力模式,并发现真实对象标记倾向于比幻觉标记更关注生成的文本。这启发我们利用包含指令相关视觉信息和上下文知识的生成文本来减轻幻觉,同时保持语言连贯性。因此,我们提出了生成文本注意力(IAT),并证明它显著降低了幻觉率,同时避免了重复描述。为了防止简单的放大损害LVLM的固有预测能力,我们进一步探索了分层阈值的自适应IAT(AdaIAT),以控制干预时间和针对每个注意力头特性的精细放大程度。分析和实验都证明了AdaIAT的有效性。多个LVLM的结果表明,AdaIAT有效地减轻了幻觉(分别在LLaVA-1.5上将幻觉率$C_S$和$C_I$降低了35.8%和37.1%),同时保持了语言性能和预测能力,实现了令人满意的权衡。
Summary / 总结
The paper addresses the issue of hallucination in Large Vision-Language Models (LVLMs) by proposing AdaIAT, which adaptively increases attention to generated text to reduce hallucinations without causing repetitive descriptions. The method leverages the generated text to maintain linguistic coherence and introduces Adaptive IAT to control the intervention time and magnitude. Experiments show that AdaIAT reduces hallucination rates by 35.8% and 37.1% on LLaVA-1.5 while preserving linguistic performance and prediction capability.
论文通过提出AdaIAT方法,该方法通过增加对生成文本的关注来减少大型视觉-语言模型(LVLM)中的幻觉现象,同时避免重复描述。该方法包含逐层阈值以控制干预,有效将LLaVA-1.5中的幻觉率分别降低35.8%和37.1%,同时保持语言一致性和预测能力。
Authorize-on-Demand: Dynamic Authorization with Legality-Aware Intellectual Property Protection for VLMs
Authors: Lianyu Wang, Meng Wang, Huazhu Fu, Daoqiang Zhang
First: 2026-03-05T07:36:07+00:00 · Latest: 2026-03-05T07:36:07+00:00
Abstract
The rapid adoption of vision-language models (VLMs) has heightened the demand for robust intellectual property (IP) protection of these high-value pretrained models. Effective IP protection should proactively confine model deployment within authorized domains and prevent unauthorized transfers. However, existing methods rely on static training-time definitions, limiting flexibility in dynamic environments and often producing opaque responses to unauthorized inputs. To address these limitations, we propose a novel dynamic authorization with legality-aware intellectual property protection (AoD-IP) for VLMs, a framework that supports authorize-on-demand and legality-aware assessment. AoD-IP introduces a lightweight dynamic authorization module that enables flexible, user-controlled authorization, allowing users to actively specify or switch authorized domains on demand at deployment time. This enables the model to adapt seamlessly as application scenarios evolve and provides substantially greater extensibility than existing static-domain approaches. In addition, AoD-IP incorporates a dual-path inference mechanism that jointly predicts input legality-aware and task-specific outputs. Comprehensive experimental results on multiple cross-domain benchmarks demonstrate that AoD-IP maintains strong authorized-domain performance and reliable unauthorized detection, while supporting user-controlled authorization for adaptive deployment in dynamic environments.
中文标题/摘要
标题:按需授权:具有法律意识的知识产权保护以实现VLM的动态授权
视觉-语言模型(VLMs)的快速采用加剧了对这些高价值预训练模型的知识产权(IP)保护需求。有效的IP保护应主动限制模型部署在授权领域内,并防止未经授权的转移。然而,现有方法依赖于静态训练时定义,限制了在动态环境中的灵活性,并经常对未经授权的输入产生不透明的响应。为了解决这些限制,我们提出了一种新颖的具有法律意识的知识产权保护(AoD-IP)框架,用于VLMs,该框架支持按需授权和法律意识评估。AoD-IP引入了一个轻量级的动态授权模块,使授权更加灵活和用户可控,允许用户在部署时主动指定或切换授权领域。这使模型能够无缝适应应用场景的变化,并提供了比现有静态领域方法更大的可扩展性。此外,AoD-IP结合了一种双路径推理机制,同时预测输入的法律意识和任务特定输出。在多个跨域基准上的全面实验结果表明,AoD-IP在授权领域内保持了强大的性能,并且在未经授权检测方面具有可靠性,同时支持用户控制的授权以适应动态环境中的部署。
Summary / 总结
The paper proposes AoD-IP, a dynamic authorization framework for vision-language models (VLMs) that supports authorize-on-demand and legality-aware assessment. It introduces a lightweight dynamic authorization module allowing users to specify authorized domains at deployment time, enhancing flexibility and extensibility. Experimental results show that AoD-IP maintains strong performance in authorized domains and reliable unauthorized detection, supporting adaptive deployment in dynamic environments.
论文提出了AoD-IP,一种支持按需授权和合法性评估的VLM动态授权框架。它引入了一个轻量级的动态授权模块,允许用户在部署时指定授权域,增强了灵活性和扩展性。实验结果表明,AoD-IP在授权域中保持了强大的性能,并且对未授权检测可靠,支持在动态环境中进行自适应部署。
Differentially Private Multimodal In-Context Learning
Authors: Ivoline C. Ngong, Zarreen Reza, Joseph P. Near
First: 2026-03-05T07:36:02+00:00 · Latest: 2026-03-05T07:36:02+00:00
Abstract
Vision-language models are increasingly applied to sensitive domains such as medical imaging and personal photographs, yet existing differentially private methods for in-context learning are limited to few-shot, text-only settings because privacy cost scales with the number of tokens processed. We present Differentially Private Multimodal Task Vectors (DP-MTV), the first framework enabling many-shot multimodal in-context learning with formal $(\varepsilon, δ)$-differential privacy by aggregating hundreds of demonstrations into compact task vectors in activation space. DP-MTV partitions private data into disjoint chunks, applies per-layer clipping to bound sensitivity, and adds calibrated noise to the aggregate, requiring only a single noise addition that enables unlimited inference queries. We evaluate on eight benchmarks across three VLM architectures, supporting deployment with or without auxiliary data. At $\varepsilon=1.0$, DP-MTV achieves 50% on VizWiz compared to 55% non-private and 35% zero-shot, preserving most of the gain from in-context learning under meaningful privacy constraints.
Summary / 总结
The research aims to enable privacy-preserving multimodal in-context learning for sensitive applications like medical imaging. The method, DP-MTV, partitions private data, applies per-layer clipping, and adds calibrated noise to enable many-shot learning with formal differential privacy. At ε=1.0, DP-MTV achieves 50% accuracy on VizWiz, comparable to 55% non-private and 35% zero-shot models, while preserving the benefits of in-context learning under privacy constraints.
研究旨在为敏感应用如医疗影像提供隐私保护的多模态上下文学习。方法DP-MTV将私有数据分区,应用逐层剪裁,并添加校准噪声,以实现多轮次学习并满足形式上的差分隐私。在ε=1.0时,DP-MTV在VizWiz上的准确率为50%,接近55%的非隐私模型和35%的零-shot模型,同时保留了上下文学习的优势。
Free Lunch for Pass@$k$? Low Cost Diverse Sampling for Diffusion Language Models
Authors: Sean Lamont, Christian Walder, Paul Montague, Amir Dezfouli, Michael Norrish
First: 2026-03-05T07:35:07+00:00 · Latest: 2026-03-05T07:35:07+00:00
Abstract
Diverse outputs in text generation are necessary for effective exploration in complex reasoning tasks, such as code generation and mathematical problem solving. Such Pass@$k$ problems benefit from distinct candidates covering the solution space. However, traditional sampling approaches often waste computational resources on repetitive failure modes. While Diffusion Language Models have emerged as a competitive alternative to the prevailing Autoregressive paradigm, they remain susceptible to this redundancy, with independent samples frequently collapsing into similar modes. To address this, we propose a training free, low cost intervention to enhance generative diversity in Diffusion Language Models. Our approach modifies intermediate samples in a batch sequentially, where each sample is repelled from the feature space of previous samples, actively penalising redundancy. Unlike prior methods that require retraining or beam search, our strategy incurs negligible computational overhead, while ensuring that each sample contributes a unique perspective to the batch. We evaluate our method on the HumanEval and GSM8K benchmarks using the LLaDA-8B-Instruct model. Our results demonstrate significantly improved diversity and Pass@$k$ performance across various temperature settings. As a simple modification to the sampling process, our method offers an immediate, low-cost improvement for current and future Diffusion Language Models in tasks that benefit from diverse solution search. We make our code available at https://github.com/sean-lamont/odd.
中文标题/摘要
标题:免费午餐?低成本多样化采样以提升扩散语言模型
文本生成中的多样化输出对于复杂推理任务(如代码生成和数学问题解决)的有效探索是必要的。此类Pass@$k$问题可以从不同的候选方案中受益,这些方案覆盖了解空间。然而,传统的采样方法往往在重复的失败模式上浪费计算资源。尽管扩散语言模型已经成为了与自回归范式竞争的有力替代方案,但它们仍然容易受到这种冗余的影响,独立样本经常陷入相似的模式。为了解决这一问题,我们提出了一种无需训练、低成本的干预措施,以增强扩散语言模型的生成多样性。我们的方法在批次中的中间样本按顺序进行修改,每个样本都远离前一个样本的特征空间,积极惩罚冗余。与需要重新训练或使用束搜索的先前方法不同,我们的策略几乎不增加计算开销,同时确保每个样本都为批次贡献了独特的视角。我们在HumanEval和GSM8K基准上使用LLaDA-8B-Instruct模型评估了我们的方法。结果显示,在各种温度设置下,我们的方法显著提高了多样性和Pass@$k$性能。作为一种简单的采样过程修改,我们的方法为当前和未来的扩散语言模型提供了即时、低成本的改进,特别是在需要多样化解决方案搜索的任务中。我们将在https://github.com/sean-lamont/odd/提供我们的代码。
Summary / 总结
This paper addresses the need for diverse outputs in text generation for complex reasoning tasks, such as code generation and mathematical problem solving. It proposes a low-cost intervention to enhance generative diversity in Diffusion Language Models by sequentially modifying intermediate samples to repel them from the feature space of previous samples. The method does not require retraining or beam search, thus incurring minimal computational overhead. Experimental results on HumanEval and GSM8K benchmarks show significantly improved diversity and Pass@$k$ performance across various temperature settings, demonstrating the method's effectiveness and low cost. The code is available at https://github.com/sean-lamont/odd.
本文旨在解决复杂推理任务(如代码生成和数学问题解决)中需要多样输出的问题。它提出了一种低成本的方法来增强扩散语言模型的生成多样性,通过顺序修改中间样本,使其远离之前样本的特征空间。该方法无需重新训练或使用束搜索,并在HumanEval和GSM8K基准上使用LLaDA-8B-Instruct模型展示了在各种温度设置下显著提高的多样性和Pass@$k$性能。
RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations
Authors: I-Hsiang Chen, Yu-Wei Liu, Tse-Yu Wu, Yu-Chien Chiang, Jen-Chien Yang, Wei-Ting Chen
First: 2026-02-25T15:27:57+00:00 · Latest: 2026-03-05T07:12:37+00:00
Comments: Accepted by CVPR2026; Project Page: https://robustvisrag.github.io
Abstract
Vision-based Retrieval-Augmented Generation (VisRAG) leverages vision-language models (VLMs) to jointly retrieve relevant visual documents and generate grounded answers based on multimodal evidence. However, existing VisRAG models degrade in performance when visual inputs suffer from distortions such as blur, noise, low light, or shadow, where semantic and degradation factors become entangled within pretrained visual encoders, leading to errors in both retrieval and generation stages. To address this limitation, we introduce RobustVisRAG, a causality-guided dual-path framework that improves VisRAG robustness while preserving efficiency and zero-shot generalization. RobustVisRAG uses a non-causal path to capture degradation signals through unidirectional attention and a causal path to learn purified semantics guided by these signals. Together with the proposed Non-Causal Distortion Modeling and Causal Semantic Alignment objectives, the framework enforces a clear separation between semantics and degradations, enabling stable retrieval and generation under challenging visual conditions. To evaluate robustness under realistic conditions, we introduce the Distortion-VisRAG dataset, a large-scale benchmark containing both synthetic and real-world degraded documents across seven domains, with 12 synthetic and 5 real distortion types that comprehensively reflect practical visual degradations. Experimental results show that RobustVisRAG improves retrieval, generation, and end-to-end performance by 7.35%, 6.35%, and 12.40%, respectively, on real-world degradations, while maintaining comparable accuracy on clean inputs.
中文标题/摘要
标题:RobustVisRAG:视觉退化条件下的因果关系感知视觉检索增强生成
基于视觉的检索增强生成(VisRAG)利用视觉语言模型(VLMs)联合检索相关视觉文档,并基于多模态证据生成基于事实的答案。然而,现有的VisRAG模型在视觉输入遭受模糊、噪声、低光照或阴影等退化时性能会下降,因为语义和退化因素在预训练的视觉编码器中交织在一起,导致检索和生成阶段出现错误。为了解决这一局限性,我们提出了RobustVisRAG,这是一种因果关系引导的双路径框架,该框架在保持效率和零样本泛化能力的同时提高了VisRAG的鲁棒性。RobustVisRAG使用非因果路径通过单向注意力捕捉退化信号,并使用因果路径通过这些信号学习净化的语义。通过提出的非因果退化建模和因果语义对齐目标,该框架确保语义和退化之间的清晰分离,从而在具有挑战性的视觉条件下实现稳定的检索和生成。为了在现实条件下评估鲁棒性,我们引入了Distortion-VisRAG数据集,这是一个包含合成和真实世界退化文档的大规模基准,涵盖了七个领域,包括12种合成和5种真实退化类型,这些类型全面反映了实际的视觉退化。实验结果表明,RobustVisRAG在真实世界退化条件下分别提高了检索、生成和端到端性能7.35%、6.35%和12.40%,同时在干净输入上保持了相当的准确性。
Summary / 总结
RobustVisRAG is a causality-guided dual-path framework that enhances the robustness of Vision-based Retrieval-Augmented Generation (VisRAG) models under visual degradations. It uses a non-causal path to capture degradation signals and a causal path to learn purified semantics, improving retrieval, generation, and end-to-end performance by 7.35%, 6.35%, and 12.40% respectively on real-world degradations. The framework is evaluated on the Distortion-VisRAG dataset, which includes both synthetic and real-world degraded documents across seven domains.
RobustVisRAG 是一个因果引导的双路径框架,旨在增强视觉检索增强生成(VisRAG)模型在视觉退化条件下的鲁棒性。该框架通过非因果路径捕捉退化信号,通过因果路径学习净化的语义,分别在真实世界退化条件下提高了检索、生成和端到端性能7.35%、6.35%和12.40%。该框架在包含七个领域中合成和真实世界退化文档的 Distortion-VisRAG 数据集上进行了评估。
GhostEI-Bench: Do Mobile Agents Resilience to Environmental Injection in Dynamic On-Device Environments?
Authors: Chiyu Chen, Xinhao Song, Yunkai Chai, Yang Yao, Haodong Zhao, Lijun Li, Jie Li, Yan Teng, Gongshen Liu, Yingchun Wang
First: 2025-10-23T08:33:24+00:00 · Latest: 2026-03-05T06:51:24+00:00
Abstract
Vision-Language Models (VLMs) are increasingly deployed as autonomous agents to navigate mobile graphical user interfaces (GUIs). Operating in dynamic on-device ecosystems, which include notifications, pop-ups, and inter-app interactions, exposes them to a unique and underexplored threat vector: environmental injection. Unlike prompt-based attacks that manipulate textual instructions, environmental injection corrupts an agent's visual perception by inserting adversarial UI elements (for example, deceptive overlays or spoofed notifications) directly into the GUI. This bypasses textual safeguards and can derail execution, causing privacy leakage, financial loss, or irreversible device compromise. To systematically evaluate this threat, we introduce GhostEI-Bench, the first benchmark for assessing mobile agents under environmental injection attacks within dynamic, executable environments. Moving beyond static image-based assessments, GhostEI-Bench injects adversarial events into realistic application workflows inside fully operational Android emulators and evaluates performance across critical risk scenarios. We further propose a judge-LLM protocol that conducts fine-grained failure analysis by reviewing the agent's action trajectory alongside the corresponding screenshot sequence, pinpointing failure in perception, recognition, or reasoning. Comprehensive experiments on state-of-the-art agents reveal pronounced vulnerability to deceptive environmental cues: current models systematically fail to perceive and reason about manipulated UIs. GhostEI-Bench provides a framework for quantifying and mitigating this emerging threat, paving the way toward more robust and secure embodied agents.
中文标题/摘要
标题:GhostEI-Bench:移动代理在动态设备环境中对环境注入的韧性如何?
视觉-语言模型(VLMs)正越来越多地被部署为自主代理,以导航移动图形用户界面(GUIs)。在包括通知、弹出窗口和跨应用交互的动态设备生态系统中运行,使它们面临一种独特的、尚未充分探索的威胁向量:环境注入。与基于提示的攻击不同,后者操纵文本指令,环境注入通过直接向GUI插入对抗性UI元素(例如,欺骗性覆盖或伪造的通知)来篡改代理的视觉感知。这绕过了文本保护措施,可能导致执行中断,造成隐私泄露、经济损失或设备不可逆的破坏。为了系统地评估这一威胁,我们引入了GhostEI-Bench,这是首个评估移动代理在动态可执行环境中遭受环境注入攻击的基准。超越基于静态图像的评估,GhostEI-Bench在完全运行的Android模拟器中注入对抗性事件到现实的应用工作流程中,并在关键风险场景中评估性能。我们进一步提出了一种裁判LLM协议,通过审查代理的动作轨迹与相应的屏幕截图序列来开展精细的失败分析,定位感知、识别或推理中的失败。全面的实验表明,最先进的代理模型对欺骗性环境线索表现出明显的脆弱性:当前模型系统地无法感知和推理关于被操纵的UI。GhostEI-Bench提供了一种量化和缓解这一新兴威胁的框架,为更稳健和安全的实体代理铺平了道路。
Summary / 总结
The paper introduces GhostEI-Bench, a benchmark for evaluating mobile agents' resilience to environmental injection attacks in dynamic on-device environments. It addresses the threat of adversarial UI elements that bypass textual safeguards and can cause privacy or financial issues. The benchmark injects adversarial events into realistic workflows and evaluates performance in critical scenarios. Key findings show that state-of-the-art agents are highly vulnerable to deceptive environmental cues, failing to perceive and reason about manipulated UIs. This work provides a framework for quantifying and mitigating this emerging threat.
论文介绍了GhostEI-Bench,这是一个用于评估移动代理在动态设备环境中对环境注入攻击的抗性基准。它解决了通过文本保护措施绕过的对抗性UI元素的威胁,可能导致隐私或财务问题。方法是将对抗性事件注入到Android模拟器中的现实应用工作流中,并在关键场景中评估性能。主要发现表明,最先进的代理对欺骗性环境线索高度脆弱,无法感知和推理关于被操纵的UI,突显了需要采取有效的缓解策略。
On Multi-Step Theorem Prediction via Non-Parametric Structural Priors
Authors: Junbo Zhao, Ting Zhang, Can Li, Wei He, Jingdong Wang, Hua Huang
First: 2026-03-05T06:08:50+00:00 · Latest: 2026-03-05T06:08:50+00:00
Abstract
Multi-step theorem prediction is a central challenge in automated reasoning. Existing neural-symbolic approaches rely heavily on supervised parametric models, which exhibit limited generalization to evolving theorem libraries. In this work, we explore training-free theorem prediction through the lens of in-context learning (ICL). We identify a critical scalability bottleneck, termed Structural Drift: as reasoning depth increases, the performance of vanilla ICL degrades sharply, often collapsing to near zero. We attribute this failure to the LLM's inability to recover latent topological dependencies, leading to unstructured exploration. To address this issue, we propose Theorem Precedence Graphs, which encode temporal dependencies from historical solution traces as directed graphs, and impose explicit topological constraints that effectively prune the search space during inference. Coupled with retrieval-augmented graph construction and a stepwise symbolic executor, our approach enables LLMs to act as structured planners without any gradient-based optimization. Experiments on the FormalGeo7k benchmark show that our method achieves 89.29% accuracy, substantially outperforming ICL baselines and matching state-of-the-art supervised models. These results indicate that explicit structural priors offer a promising direction for scaling LLM-based symbolic reasoning.
中文标题/摘要
标题:基于非参数结构先验的多步定理预测
多步定理预测是自动推理中的一个核心挑战。现有的神经符号方法主要依赖于监督参数模型,这些模型在处理不断演化的定理库时表现出有限的泛化能力。在本文中,我们通过上下文学习(ICL)的视角探索无训练的定理预测。我们识别出一个关键的可扩展性瓶颈,称为结构漂移:随着推理深度的增加,vanilla ICL的性能急剧下降,通常会崩溃到接近零。我们将这种失败归因于LLM无法恢复潜在的拓扑依赖性,导致无序探索。为了解决这个问题,我们提出了定理优先图,它将历史解题轨迹中的时间依赖性编码为有向图,并施加显式的拓扑约束,有效地在推理期间剪枝搜索空间。结合检索增强的图构建和逐步符号执行,我们的方法使LLM能够作为结构化规划者而无需任何基于梯度的优化。在FormalGeo7k基准测试上的实验表明,我们的方法达到了89.29%的准确率,显著优于ICL基线,并且与最先进的监督模型相当。这些结果表明,显式的结构先验为扩展基于LLM的符号推理提供了一个有希望的方向。
Summary / 总结
This paper addresses the challenge of multi-step theorem prediction in automated reasoning by proposing a training-free approach using Theorem Precedence Graphs. The method leverages in-context learning and explicit topological constraints to overcome the scalability bottleneck of unstructured exploration. Experiments on the FormalGeo7k benchmark demonstrate that this approach achieves 89.29% accuracy, significantly outperforming in-context learning baselines and matching state-of-the-art supervised models.
该论文通过提出使用定理 precedence 图的方法,解决自动推理中的多步定理预测挑战,该方法利用上下文学习和显式的拓扑约束来克服无结构探索的可扩展性瓶颈。实验表明,该方法在 FormalGeo7k 基准上的准确率达到 89.29%,显著优于上下文学习基线,并与最先进的监督模型相当。
AutoV: Loss-Oriented Ranking for Visual Prompt Retrieval in LVLMs
Authors: Yuan Zhang, Chun-Kai Fan, Sicheng Yu, Junwen Pan, Tao Huang, Ming Lu, Kuan Cheng, Qi She, Shanghang Zhang
First: 2025-06-19T08:02:53+00:00 · Latest: 2026-03-05T05:25:24+00:00
Abstract
Inspired by text prompts in large language models, visual prompts have been explored to enhance the perceptual capabilities of large vision-language models (LVLMs). However, performance tends to saturate under single visual prompt designs, making further prompt engineering increasingly ineffective. To address this limitation, we shift from prompt engineering to prompt retrieval and propose AutoV, a lightweight framework for instance-adaptive visual prompt identification. Given an input image and a textual query, AutoV automatically locates the most suitable visual prompt from a diverse candidate pool. Training such a retrieval framework requires prompt-level supervision, yet prompt quality is inherently ambiguous and difficult to assess reliably, even for humans. To enable automatic supervision, we evaluate visual prompts using a pre-trained LVLM and label them according to their prediction losses. Using the loss-oriented ranking as a robust training signal, AutoV learns to retrieve the query-aware optimal prompt for each instance without manual annotation. Experiments indicate that AutoV enhances the performance of various LVLMs on image understanding, captioning, grounding, and classification tasks. For example, AutoV improves LLaVA-OV by $\textbf{10.2}\%$ on VizWiz and boosts Qwen2.5-VL by $\textbf{3.8}\%$ on MMMU, respectively.
中文标题/摘要
标题:AutoV:面向视觉提示检索的损失导向排名
受大型语言模型中文本提示的启发,视觉提示已被探索以增强大型视觉-语言模型(LVLM)的感知能力。然而,在单一视觉提示设计下,性能往往会饱和,使得进一步的提示工程变得越来越无效。为了解决这一局限性,我们从提示工程转向提示检索,并提出AutoV,这是一种轻量级框架,用于实例自适应视觉提示识别。给定输入图像和文本查询,AutoV 自动从多样化的候选池中定位最合适的视觉提示。训练这种检索框架需要提示级别的监督,但提示质量本质上是模糊的,即使对人类来说也难以可靠地评估。为了实现自动监督,我们使用预训练的LVLM评估视觉提示,并根据其预测损失对其进行标记。利用损失导向的排名作为稳健的训练信号,AutoV 学习在每个实例中检索与查询相关的最佳提示,而无需手动注释。实验表明,AutoV 在图像理解、描述、定位和分类任务中提高了各种LVLM的表现。例如,AutoV 在VizWiz上将LLaVA-OV 的性能提高了10.2%,在MMMU上将Qwen2.5-VL 的性能提高了3.8%。
Summary / 总结
The paper introduces AutoV, a framework for automatically retrieving visual prompts to improve the performance of large vision-language models. It uses loss-oriented ranking to train the framework without manual annotation, and demonstrates improvements in various tasks such as image understanding, captioning, grounding, and classification. For instance, AutoV enhances LLaVA-OV by 10.2% on VizWiz and Qwen2.5-VL by 3.8% on MMMU.
AutoV 是一个轻量级框架,旨在从多样化的候选池中自动检索最适合的视觉提示,以增强大型视觉-语言模型的感知能力。它使用基于损失的排名进行训练,无需人工标注,从而在图像理解、描述、定位和分类等多种任务上提高了性能。例如,AutoV 在 VizWiz 上将 LLaVA-OV 的性能提高了 10.2%,在 MMMU 上将 Qwen2.5-VL 的性能提高了 3.8%。