Accelerating Text-to-Video Generation with Calibrated Sparse Attention
Authors: Shai Yehezkel, Shahar Yadin, Noam Elata, Yaron Ostrovsky-Berman, Bahjat Kawar
First: 2026-03-05T18:59:32+00:00 · Latest: 2026-03-05T18:59:32+00:00
Abstract
Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the attention computation in these cases can be skipped with little to no effect on the result. This observation continues to hold for connections among local token blocks. Motivated by this, we introduce CalibAtt, a training-free method that accelerates video generation via calibrated sparse attention. CalibAtt performs an offline calibration pass that identifies block-level sparsity and repetition patterns that are stable across inputs, and compiles these patterns into optimized attention operations for each layer, head, and diffusion timestep. At inference time, we compute the selected input-dependent connections densely, and skip the unselected ones in a hardware-efficient manner. Extensive experiments on Wan 2.1 14B, Mochi 1, and few-step distilled models at various resolutions show that CalibAtt achieves up to 1.58x end-to-end speedup, outperforming existing training-free methods while maintaining video generation quality and text-video alignment.
中文标题/摘要
标题:校准稀疏注意加速文本生成视频
最近的扩散模型能够生成高质量的视频,但运行速度较慢。这些模型中使用的大型基于变压器的骨干网络由于时空注意机制而成为瓶颈。在本文中,我们发现大量词到词的连接在各种输入中持续产生微不足道的分数,并且它们的模式在查询之间经常重复。因此,在这些情况下可以跳过注意计算,对结果影响甚微。这一观察结果同样适用于局部词块之间的连接。受此启发,我们引入了CalibAtt,这是一种无需训练的方法,通过校准稀疏注意来加速视频生成。CalibAtt 进行了一次离线校准过程,以识别在各种输入中稳定的块级稀疏性和重复模式,并将这些模式编译为每层、每个头和每个扩散时间步的优化注意操作。在推理时,我们对选定的输入相关连接进行密集计算,并以硬件高效的方式跳过未选中的连接。在Wan 2.1 14B、Mochi 1和不同分辨率下的少量步骤蒸馏模型上进行的广泛实验表明,CalibAtt 可以实现高达1.58倍的端到端加速,同时优于现有无需训练的方法,保持视频生成质量和文本-视频对齐。
Summary / 总结
This paper addresses the slow runtime of diffusion models used for high-quality text-to-video generation. It introduces CalibAtt, a training-free method that accelerates video generation by skipping unnecessary attention computations based on identified sparsity and repetition patterns. Experiments show that CalibAtt achieves up to 1.58x speedup while maintaining video quality and text-video alignment.
本文通过引入CalibAtt方法,即一种无需训练的加速方法,利用校准的稀疏注意力来解决用于高质量文本到视频生成的扩散模型的运行缓慢问题。CalibAtt在离线校准过程中识别并跳过不重要的token-to-token连接,为每一层、每一个头和每一个扩散时间步优化注意力操作。实验结果显示,CalibAtt可以实现最高1.58倍的端到端加速,同时不牺牲视频质量和文本-视频对齐。
HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token
Authors: Sai Akhil Kogilathota, Sripadha Vallabha E G, Luzhe Sun, Jiawei Zhou
Venue: The 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026)
First: 2026-03-05T18:36:31+00:00 · Latest: 2026-03-05T18:36:31+00:00
Abstract
Hallucinations remain a persistent challenge for vision-language models (VLMs), which often describe nonexistent objects or fabricate facts. Existing detection methods typically operate after text generation, making intervention both costly and untimely. We investigate whether hallucination risk can instead be predicted before any token is generated by probing a model's internal representations in a single forward pass. Across a diverse set of vision-language tasks and eight modern VLMs, including Llama-3.2-Vision, Gemma-3, Phi-4-VL, and Qwen2.5-VL, we examine three families of internal representations: (i) visual-only features without multimodal fusion, (ii) vision-token representations within the text decoder, and (iii) query-token representations that integrate visual and textual information before generation. Probes trained on these representations achieve strong hallucination-detection performance without decoding, reaching up to 0.93 AUROC on Gemma-3-12B, Phi-4-VL 5.6B, and Molmo 7B. Late query-token states are the most predictive for most models, while visual or mid-layer features dominate in a few architectures (e.g., ~0.79 AUROC for Qwen2.5-VL-7B using visual-only features). These results demonstrate that (1) hallucination risk is detectable pre-generation, (2) the most informative layer and modality vary across architectures, and (3) lightweight probes have the potential to enable early abstention, selective routing, and adaptive decoding to improve both safety and efficiency.
中文标题/摘要
标题:HALP:无需生成单个词元即可检测视觉语言模型中的幻觉
幻觉仍然是视觉语言模型(VLMs)的一个持续性挑战,它们经常描述不存在的对象或编造事实。现有的检测方法通常在文本生成之后进行操作,这使得干预既昂贵又不及时。我们研究了是否可以在生成任何词元之前通过探测模型的内部表示来预测幻觉风险,而只需一次前向传递。在一系列视觉语言任务和八个现代VLMs(包括Llama-3.2-Vision、Gemma-3、Phi-4-VL和Qwen2.5-VL)上,我们检查了三种内部表示家族:(i)仅视觉特征而不进行多模态融合,(ii)文本解码器中的视觉词元表示,以及(iii)在生成之前整合视觉和文本信息的查询词元表示。基于这些表示训练的探测器在无需解码的情况下实现了强大的幻觉检测性能,达到Gemma-3-12B、Phi-4-VL 5.6B和Molmo 7B上高达0.93的AUROC。大多数模型中,后期查询词元状态最具预测性,而视觉或中间层特征在少数架构中占主导地位(例如,Qwen2.5-VL-7B使用仅视觉特征的AUROC约为0.79)。这些结果表明:(1)幻觉风险可以在生成之前检测到,(2)最具信息量的层和模态在不同架构中有所不同,(3)轻量级探测器有可能实现早期避免、选择性路由和自适应解码,以提高安全性和效率。
Summary / 总结
The paper introduces HALP, a method for detecting hallucinations in vision-language models before any text generation occurs. By probing internal representations during a single forward pass, the method achieves strong performance, with AUROCs up to 0.93 on several models. The study finds that late query-token states are most predictive for most models, while visual or mid-layer features are more informative for some architectures. This suggests that hallucination risk can be detected early, and that lightweight probes could enable more efficient and safe model operation.
该研究提出了HALP方法,无需生成任何标记即可检测视觉语言模型中的幻觉。通过在单次前向传播过程中探测内部表示,该方法在多个模型上取得了良好的性能,AUROC得分最高可达0.93。研究发现,大多数模型中晚期查询标记状态最具预测性,而某些架构中视觉或中间层特征更为重要。这项工作表明,幻觉风险可以在生成之前被检测到,并且轻量级探测器可以实现早期干预,以提高安全性和效率。
Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes
Authors: Pengxiang Li, Joey Tsai, Hongwei Xue, Kunyu Shi, Shilin Yan
Venue: ICLR 2026
First: 2026-03-05T18:25:26+00:00 · Latest: 2026-03-05T18:25:26+00:00
Comments: Accepted at ICLR 2026
Abstract
Diffusion Language Models (DLMs) promise highly parallel text generation, yet their practical inference speed is often bottlenecked by suboptimal decoding schedulers. Standard approaches rely on 'scattered acceptance'-committing high confidence tokens at disjoint positions throughout the sequence. This approach inadvertently fractures the Key-Value (KV) cache, destroys memory locality, and forces the model into costly, repeated repairs across unstable token boundaries. To resolve this, we present the Longest Stable Prefix (LSP) scheduler, a training-free and model-agnostic inference paradigm based on monolithic prefix absorption. In each denoising step, LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions, and snaps its boundary to natural linguistic or structural delimiters before an atomic commitment. This prefix-first topology yields dual benefits: systemically, it converts fragmented KV cache updates into efficient, contiguous appends; algorithmically, it preserves bidirectional lookahead over a geometrically shrinking active suffix, drastically reducing token flip rates and denoiser calls. Extensive evaluations on LLaDA-8B and Dream-7B demonstrate that LSP accelerates inference by up to 3.4x across rigorous benchmarks including mathematical reasoning, code generation, multilingual (CJK) tasks, and creative writing while matching or slightly improving output quality. By fundamentally restructuring the commitment topology, LSP bridges the gap between the theoretical parallelism of DLMs and practical hardware efficiency.
中文标题/摘要
标题:超越零散接受:通过最长稳定前缀实现DLMs的快速和连贯推理
扩散语言模型(DLMs)承诺实现高度并行的文本生成,但其实用推理速度往往受限于次优解码调度器。标准方法依赖于“零散接受”——在序列中不连续的位置提交高置信度的标记。这种方法无意中破坏了键值(KV)缓存,破坏了内存局部性,并迫使模型在不稳定的标记边界上进行昂贵的重复修复。为了解决这个问题,我们提出了最长稳定前缀(LSP)调度器,这是一种基于单一前缀吸收的无训练和模型无关的推理范式。在每次去噪步骤中,LSP 通过单次前向传播评估标记的稳定性,动态识别一个连续的左对齐的稳定预测块,并在原子提交前将其边界对齐到自然语言或结构分隔符。这种前缀优先的拓扑结构带来了双重好处:系统上,它将碎片化的KV缓存更新转换为高效的连续追加;算法上,它保留了对几何缩小的活动后缀的双向前瞻,大幅减少了标记翻转率和去噪器调用次数。在LLaDA-8B和Dream-7B上的广泛评估表明,LSP 在包括数学推理、代码生成、多语言(CJK)任务和创造性写作在内的严格基准测试中将推理加速了高达3.4倍,同时保持或略微提高了输出质量。通过从根本上重新结构化提交拓扑,LSP 桥接了DLMs的理论并行性和实际硬件效率之间的差距。
Summary / 总结
The paper addresses the issue of slow inference in Diffusion Language Models (DLMs) due to suboptimal decoding schedulers. It introduces the Longest Stable Prefix (LSP) scheduler, which evaluates token stability and commits to a contiguous block of stable predictions, thereby preserving the KV cache and reducing token flip rates. Experiments on LLaDA-8B and Dream-7B show that LSP can accelerate inference by up to 3.4x while maintaining or slightly improving output quality across various tasks.
论文解决了由于解码调度器将高置信度的标记分散在不连续位置而导致的扩散语言模型(DLMs)的推理速度慢的问题,这会破坏键值缓存并增加计算成本。它提出了最长稳定前缀(LSP)调度器,该调度器评估标记的稳定性并一次性提交一个连续的稳定预测块,从而保持内存局部性并减少标记翻转率。在LLaDA-8B和Dream-7B上的实验表明,LSP可以加速推理最多3.4倍,同时保持或略微提高输出质量。
RelaxFlow: Text-Driven Amodal 3D Generation
Authors: Jiayin Zhu, Guoji Fu, Xiaolu Liu, Qiyuan He, Yicong Li, Angela Yao
First: 2026-03-05T17:45:47+00:00 · Latest: 2026-03-05T17:45:47+00:00
Comments: Code: https://github.com/viridityzhu/RelaxFlow
Abstract
Image-to-3D generation faces inherent semantic ambiguity under occlusion, where partial observation alone is often insufficient to determine object category. In this work, we formalize text-driven amodal 3D generation, where text prompts steer the completion of unseen regions while strictly preserving input observation. Crucially, we identify that these objectives demand distinct control granularities: rigid control for the observation versus relaxed structural control for the prompt. To this end, we propose RelaxFlow, a training-free dual-branch framework that decouples control granularity via a Multi-Prior Consensus Module and a Relaxation Mechanism. Theoretically, we prove that our relaxation is equivalent to applying a low-pass filter on the generative vector field, which suppresses high-frequency instance details to isolate geometric structure that accommodates the observation. To facilitate evaluation, we introduce two diagnostic benchmarks, ExtremeOcc-3D and AmbiSem-3D. Extensive experiments demonstrate that RelaxFlow successfully steers the generation of unseen regions to match the prompt intent without compromising visual fidelity.
中文标题/摘要
标题:RelaxFlow:文本驱动的无遮挡3D生成
从图像到3D的生成面临着在遮挡下固有的语义模糊性,其中仅部分观察往往不足以确定物体类别。在本文中,我们形式化了文本驱动的无遮挡3D生成,其中文本提示引导未见区域的完成,同时严格保留输入观察。关键的是,我们发现这些目标需要不同的控制粒度:对观察进行刚性控制,而对提示进行放松的结构控制。为此,我们提出了一种无需训练的双分支框架RelaxFlow,通过多先验一致性模块和放松机制解耦控制粒度。理论上,我们证明了我们的放松等同于在生成向量场中应用低通滤波器,这抑制了高频实例细节以隔离几何结构,以适应观察。为了便于评估,我们引入了两个诊断基准,ExtremeOcc-3D和AmbiSem-3D。广泛的实验表明,RelaxFlow成功地引导未见区域的生成以匹配提示意图,而不牺牲视觉保真度。
Summary / 总结
The research aims to address the challenge of generating complete 3D models from partial observations using text prompts. The proposed RelaxFlow framework uses a dual-branch approach with a Multi-Prior Consensus Module and a Relaxation Mechanism to decouple control granularity between the observed regions and the text-driven regions. Experiments show that RelaxFlow can effectively generate unseen regions that align with the text prompt while maintaining visual fidelity.
研究旨在通过文本提示从部分观察中生成完整的3D模型,同时保持已观察部分的完整性。提出了一种无需训练的框架RelaxFlow,通过多先验一致性模块和放松机制来解耦控制粒度。理论分析表明,这种放松技术抑制了高频细节,专注于与输入观察相一致的几何结构。实验表明,RelaxFlow能够有效匹配未观察部分的提示意图,同时保持视觉保真度。
ORMOT: A Dataset and Framework for Omnidirectional Referring Multi-Object Tracking
Authors: Sijia Chen, Zihan Zhou, Yanqiu Yu, En Yu, Wenbing Tao
First: 2026-03-05T17:15:01+00:00 · Latest: 2026-03-05T17:15:01+00:00
Comments: https://github.com/chen-si-jia/ORMOT
Abstract
Multi-Object Tracking (MOT) is a fundamental task in computer vision, aiming to track targets across video frames. Existing MOT methods perform well in general visual scenes, but face significant challenges and limitations when extended to visual-language settings. To bridge this gap, the task of Referring Multi-Object Tracking (RMOT) has recently been proposed, which aims to track objects that correspond to language descriptions. However, current RMOT methods are primarily developed on datasets captured by conventional cameras, which suffer from limited field of view. This constraint often causes targets to move out of the frame, leading to fragmented tracking and loss of contextual information. In this work, we propose a novel task, called Omnidirectional Referring Multi-Object Tracking (ORMOT), which extends RMOT to omnidirectional imagery, aiming to overcome the field-of-view (FoV) limitation of conventional datasets and improve the model's ability to understand long-horizon language descriptions. To advance the ORMOT task, we construct ORSet, an Omnidirectional Referring Multi-Object Tracking dataset, which contains 27 diverse omnidirectional scenes, 848 language descriptions, and 3,401 annotated objects, providing rich visual, temporal, and language information. Furthermore, we propose ORTrack, a Large Vision-Language Model (LVLM)-driven framework tailored for Omnidirectional Referring Multi-Object Tracking. Extensive experiments on the ORSet dataset demonstrate the effectiveness of our ORTrack framework. The dataset and code will be open-sourced at https://github.com/chen-si-jia/ORMOT.
中文标题/摘要
标题:ORMOT:全景描述多目标跟踪的数据集和框架
多目标跟踪(MOT)是计算机视觉中的一个基本任务,旨在跨视频帧跟踪目标。现有的MOT方法在一般视觉场景中表现良好,但在扩展到视觉语言设置时面临重大挑战和限制。为了解决这一差距,最近提出了描述多目标跟踪(RMOT)任务,旨在跟踪与语言描述对应的物体。然而,当前的RMOT方法主要是在由传统相机拍摄的数据集上开发的,这些数据集存在有限的视野。这种限制往往导致目标移出画面,从而导致跟踪片段化并丢失上下文信息。在本文中,我们提出了一项新的任务,称为全景描述多目标跟踪(ORMOT),该任务将RMOT扩展到全景图像,旨在克服传统数据集的视野(FoV)限制,并提高模型理解长时语言描述的能力。为了推进ORMOT任务,我们构建了ORSet,一个全景描述多目标跟踪数据集,包含27个多样化的全景场景、848个语言描述和3,401个标注物体,提供了丰富的视觉、时间和语言信息。此外,我们提出了ORTrack,一种针对全景描述多目标跟踪的大型视觉-语言模型(LVLM)驱动框架。在ORSet数据集上的广泛实验表明,我们的ORTrack框架是有效的。数据集和代码将在https://github.com/chen-si-jia/ORMOT开源。
Summary / 总结
The research aims to address the limitations of existing Multi-Object Tracking (MOT) methods in visual-language settings by proposing Omnidirectional Referring Multi-Object Tracking (ORMOT). The authors construct ORSet, a dataset with 27 omnidirectional scenes, 848 language descriptions, and 3,401 annotated objects, and develop ORTrack, a Large Vision-Language Model-driven framework. Experiments show the effectiveness of ORTrack in ORMOT tasks.
研究旨在通过提出一种新的任务——全景引用多目标跟踪(ORMOT),解决现有目标跟踪(MOT)方法在视觉-语言场景中的局限性。方法包括构建一个包含丰富视觉、时间和语言信息的ORSet数据集,并基于大型视觉-语言模型(LVLM)开发ORTrack框架。实验表明,ORTrack框架在处理传统数据集的视野限制以及提高长时语言描述理解方面具有有效性。数据集和代码已开源在https://github.com/chen-si-jia/ORMOT。
OpenFrontier: General Navigation with Visual-Language Grounded Frontiers
Authors: Esteban Padilla, Boyang Sun, Marc Pollefeys, Hermann Blum
First: 2026-03-05T17:02:22+00:00 · Latest: 2026-03-05T17:02:22+00:00
Abstract
Open-world navigation requires robots to make decisions in complex everyday environments while adapting to flexible task requirements. Conventional navigation approaches often rely on dense 3D reconstruction and hand-crafted goal metrics, which limits their generalization across tasks and environments. Recent advances in vision--language navigation (VLN) and vision--language--action (VLA) models enable end-to-end policies conditioned on natural language, but typically require interactive training, large-scale data collection, or task-specific fine-tuning with a mobile agent. We formulate navigation as a sparse subgoal identification and reaching problem and observe that providing visual anchoring targets for high-level semantic priors enables highly efficient goal-conditioned navigation. Based on this insight, we select navigation frontiers as semantic anchors and propose OpenFrontier, a training-free navigation framework that seamlessly integrates diverse vision--language prior models. OpenFrontier enables efficient navigation with a lightweight system design, without dense 3D mapping, policy training, or model fine-tuning. We evaluate OpenFrontier across multiple navigation benchmarks and demonstrate strong zero-shot performance, as well as effective real-world deployment on a mobile robot.
中文标题/摘要
标题:OpenFrontier:基于视觉-语言引导边界的通用导航
开放世界导航要求机器人在复杂日常环境中做出决策并适应灵活的任务需求。传统导航方法通常依赖密集的3D重建和手工制作的目标度量标准,这限制了它们在不同任务和环境中的泛化能力。视觉-语言导航(VLN)和视觉-语言-动作(VLA)模型的最新进展使基于自然语言的端到端策略成为可能,但通常需要交互式训练、大规模数据收集或针对移动代理的任务特定微调。我们将导航问题形式化为稀疏子目标识别和到达问题,并观察到提供视觉锚定目标以支持高层语义先验可以实现高效的基于目标的导航。基于这一洞察,我们选择导航边界作为语义锚点,并提出OpenFrontier,这是一种无需训练的导航框架,能够无缝集成多种视觉-语言先验模型。OpenFrontier 通过轻量级系统设计实现了高效的导航,无需密集的3D映射、策略训练或模型微调。我们在多个导航基准上评估了OpenFrontier,并展示了其强大的零样本性能,以及在移动机器人上的有效实际部署。
Summary / 总结
The research aims to develop a general navigation system for robots in complex environments with flexible task requirements. The method involves formulating navigation as a sparse subgoal identification problem and using visual frontiers as semantic anchors to enable efficient goal-conditioned navigation without dense 3D mapping or policy training. Key findings show strong zero-shot performance across multiple benchmarks and successful real-world deployment on a mobile robot.
研究旨在开发一种适用于复杂环境和灵活任务要求的通用导航系统。方法是将导航问题表述为稀疏子目标识别问题,并使用视觉前沿作为语义锚点,以实现无需密集3D建模或策略训练的高效目标导向导航。关键发现表明,在多个基准测试中表现出色,并成功应用于移动机器人的真实世界部署。
Revolutionizing Mixed Precision Quantization: Towards Training-free Automatic Proxy Discovery via Large Language Models
Authors: Haidong Kang, Jun Du, Lihong Lin
First: 2025-12-08T10:52:55+00:00 · Latest: 2026-03-05T16:57:43+00:00
Abstract
Mixed-Precision Quantization (MPQ) liberates Deep Neural Networks (DNNs) from the Out-Of-Memory (OOM) bottleneck and has garnered increasing research attention. However, conventional methods either rely on costly differentiable optimization search, which is neither efficient nor flexible, or learn a quantized DNN from a proxy (e.g., HAWQ) manually designed by human experts, which is labor-intensive and requires extensive expert knowledge. Can we design a proxy without involving any human experts or training? In this paper, we provide an affirmative answer by proposing a novel Large Language Model (LLM)-driven Training-free Automatic Proxy (dubbed TAP) discovery framework. It reforms the design paradigm of MPQ by utilizing LLMs and evolutionary search strategies to automatically find superior TAP tailored for MPQ. In addition, to bridge the gap between black-box LLMs and the challenging MPQ task, we introduce a lightweight Direct Preference Optimization (DPO)-based strategy controller that dynamically reweights the selection probabilities of the three prompt templates for evolutionary search strategies according to fitness signals, without fine-tuning the LLM. This forms a task-aware feedback loop that improves proxy generation across evolutions. Extensive experiments on mainstream benchmarks demonstrate that TAP achieves state-of-the-art performance. Finally, we believe that our TAP will significantly contribute to the MPQ community by providing a new perspective on LLM-driven design algorithms.
中文标题/摘要
标题:革新混合精度量化:通过大型语言模型实现无需训练的自动代理发现
混合精度量化(MPQ)使深度神经网络(DNNs)摆脱了内存不足(OOM)的瓶颈,并引起了越来越多的研究关注。然而,传统方法要么依赖于昂贵的可微优化搜索,这既不高效也不灵活,要么从人类专家手动设计的代理(例如HAWQ)中学习量化DNN,这既耗时又需要大量专家知识。我们能否设计一个无需任何人类专家或训练的代理?在本文中,我们通过提出一种新颖的大型语言模型(LLM)驱动的无需训练的自动代理(简称TAP)发现框架,给出了肯定的回答。该框架通过利用LLM和进化搜索策略,自动发现适用于MPQ的优质TAP,改革了MPQ的设计范式。此外,为了弥合黑盒LLM与挑战性的MPQ任务之间的差距,我们引入了一种轻量级的直接偏好优化(DPO)为基础的策略控制器,根据适应度信号动态调整进化搜索策略中三种提示模板的选择概率,无需微调LLM。这形成了一种任务感知的反馈循环,提高了代理生成的性能。在主流基准上的广泛实验表明,TAP达到了最先进的性能。最后,我们认为,我们的TAP将通过提供一种LLM驱动设计算法的新视角,对MPQ社区产生重大贡献。
Summary / 总结
The paper addresses the challenge of designing a proxy for Mixed-Precision Quantization (MPQ) without human intervention or training. It introduces a TAP framework that uses Large Language Models (LLMs) and evolutionary search strategies to automatically discover a superior proxy. The TAP framework includes a Direct Preference Optimization (DPO)-based strategy controller that dynamically adjusts the selection probabilities of prompt templates based on fitness signals, enhancing the proxy generation process. Experiments show that TAP outperforms existing methods on mainstream benchmarks.
论文解决了在无需人工干预或训练的情况下为混合精度量化(MPQ)设计代理的问题。它提出了一种TAP框架,利用大型语言模型(LLMs)和进化搜索策略自动发现更优的代理。TAP框架包含一个基于直接偏好优化(DPO)的策略控制器,根据适应度信号动态调整提示模板的选择概率,从而提高代理生成过程。实验表明,TAP在主流基准上优于现有方法。
Frequency-Aware Error-Bounded Caching for Accelerating Diffusion Transformers
Authors: Guandong Li
First: 2026-03-05T15:58:06+00:00 · Latest: 2026-03-05T15:58:06+00:00
Abstract
Diffusion Transformers (DiTs) have emerged as the dominant architecture for high-quality image and video generation, yet their iterative denoising process incurs substantial computational cost during inference. Existing caching methods accelerate DiTs by reusing intermediate computations across timesteps, but they share a common limitation: treating the denoising process as uniform across time,depth, and feature dimensions. In this work, we identify three orthogonal axes of non-uniformity in DiT denoising: (1) temporal -- sensitivity to caching errors varies dramatically across the denoising trajectory; (2) depth -- consecutive caching decisions lead to cascading approximation errors; and (3) feature -- different components of the hidden state exhibit heterogeneous temporal dynamics. Based on these observations, we propose SpectralCache, a unified caching framework comprising Timestep-Aware Dynamic Scheduling (TADS), Cumulative Error Budgets (CEB), and Frequency-Decomposed Caching (FDC). On FLUX.1-schnell at 512x512 resolution, SpectralCache achieves 2.46x speedup with LPIPS 0.217 and SSIM 0.727, outperforming TeaCache (2.12x, LPIPS 0.215, SSIM 0.734) by 16% in speed while maintaining comparable quality (LPIPS difference < 1%). Our approach is training-free, plug-and-play, and compatible with existing DiT architectures.
中文标题/摘要
标题:基于频率感知的误差受限缓存加速扩散变换器
扩散变换器(DiTs)已成为高质量图像和视频生成的主要架构,但其迭代去噪过程在推理时会带来巨大的计算成本。现有的缓存方法通过在时间步之间重用中间计算来加速DiTs,但它们都存在一个共同的局限性:将去噪过程视为在时间、深度和特征维度上均匀的。在这项工作中,我们识别了DiT去噪中的三个非均匀轴:(1)时间轴——缓存误差对去噪轨迹的敏感性差异巨大;(2)深度轴——连续的缓存决策会导致级联的近似误差;(3)特征轴——隐藏状态的不同组成部分表现出异质的时间动态。基于这些观察,我们提出了SpectralCache,这是一种统一的缓存框架,包括时间步感知动态调度(TADS)、累积误差预算(CEB)和频率分解缓存(FDC)。在FLUX.1-schnell,512x512分辨率下,SpectralCache实现了2.46倍的加速,LPIPS为0.217,SSIM为0.727,比TeaCache(2.12倍,LPIPS为0.215,SSIM为0.734)快16%,同时保持了相当的质量(LPIPS差异<1%)。我们的方法是无需训练的、即插即用的,并且与现有的DiT架构兼容。
Summary / 总结
This work addresses the high computational cost of Diffusion Transformers (DiTs) during inference by proposing SpectralCache, a frequency-aware caching framework. SpectralCache includes Timestep-Aware Dynamic Scheduling, Cumulative Error Budgets, and Frequency-Decomposed Caching to handle the non-uniformity in DiT denoising. On FLUX.1-schnell at 512x512 resolution, SpectralCache achieves a 2.46x speedup with comparable image quality (LPIPS 0.217, SSIM 0.727) to TeaCache, which has a 2.12x speedup (LPIPS 0.215, SSIM 0.734).
本文提出了一种基于频率的缓存框架SpectralCache,以解决Diffusion Transformers (DiTs)在推理过程中的高计算成本问题。SpectralCache包括时间步长感知动态调度、累积误差预算和频率分解缓存,以解决DiT去噪过程在时间、深度和特征维度上的非均匀性。在FLUX.1-schnell 512x512分辨率下,SpectralCache实现了2.46倍的加速,LPIPS为0.217,SSIM为0.727,比TeaCache在速度上提高了16%,同时保持了相近的质量。
FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding
Authors: Janghoon Cho, Jungsoo Lee, Munawar Hayat, Kyuwoong Hwang, Fatih Porikli, Sungha Choi
Venue: ICLR 2026
First: 2025-10-31T17:29:39+00:00 · Latest: 2026-03-05T15:43:07+00:00
Comments: Accepted to ICLR 2026
Abstract
Recent studies in long video understanding have harnessed the advanced visual-language reasoning capabilities of Large Multimodal Models (LMMs), driving the evolution of video-LMMs specialized for processing extended video sequences. However, the scalability of these models is severely limited by the overwhelming volume of visual tokens generated from extended video sequences. To address this challenge, we propose FLoC, an efficient visual token compression framework based on the facility location function, a principled approach that swiftly selects a compact yet highly representative and diverse subset of visual tokens within a predefined budget on the number of visual tokens. By integrating the lazy greedy algorithm, our method achieves remarkable efficiency gains by swiftly selecting a compact subset of tokens, drastically reducing the number of visual tokens while guaranteeing near-optimal performance. Notably, our approach is training-free, model-agnostic, and query-agnostic, providing a versatile solution that seamlessly integrates with diverse video-LLMs and existing workflows. Extensive evaluations on large-scale benchmarks, such as Video-MME, MLVU, LongVideoBench, and EgoSchema, show that our framework consistently surpasses recent compression techniques, highlighting its effectiveness and robustness in addressing the challenges of long video understanding as well as its processing efficiency.
中文标题/摘要
标题:FLoC:基于设施位置的高效视觉标记压缩框架以实现长视频理解
近期关于长视频理解的研究利用了大型多模态模型(LMMs)的先进视觉-语言推理能力,推动了专门用于处理扩展视频序列的视频-LMMs的发展。然而,这些模型的可扩展性受到从扩展视频序列生成的大量视觉标记的限制。为了解决这一挑战,我们提出了一种基于设施位置函数的高效视觉标记压缩框架FLoC,这是一种原理性的方法,能够迅速选择在预定义的视觉标记数量预算内具有高度代表性且多样化的紧凑子集。通过集成懒惰贪婪算法,我们的方法通过迅速选择紧凑的标记子集实现了显著的效率提升,大幅减少了视觉标记的数量,同时保证了接近最优的性能。值得注意的是,我们的方法是无需训练的、模型无关的、查询无关的,提供了一种灵活的解决方案,能够无缝集成到各种视频-LLMs和现有工作流程中。在Video-MME、MLVU、LongVideoBench和EgoSchema等大规模基准上的广泛评估表明,我们的框架在压缩技术方面始终优于近期的技术,突显了其在解决长视频理解挑战方面的有效性、稳健性以及处理效率。
Summary / 总结
The paper addresses the scalability issue of Large Multimodal Models (LMMs) in long video understanding by proposing FLoC, a facility location-based visual token compression framework. FLoC uses a lazy greedy algorithm to efficiently select a compact subset of visual tokens, reducing the number of tokens while maintaining near-optimal performance. The method is training-free, model-agnostic, and query-agnostic, making it versatile and easy to integrate with various video-LLMs. Experimental results on large-scale benchmarks demonstrate that FLoC outperforms recent compression techniques in terms of effectiveness and efficiency.
论文提出了一种高效的视觉令牌压缩框架FLoC,以解决大型多模态模型(LMMs)在长视频理解中的可扩展性问题。FLoC 使用设施位置函数和懒惰贪婪算法来选择一个紧凑的视觉令牌子集,同时减少令牌数量并保持接近最优的性能。该方法无需训练、模型无关且查询无关,使其能够无缝集成到各种视频-LLMs 中。大规模基准测试表明,FLoC 在有效性和处理效率方面优于最近的压缩技术。
Pursuing Minimal Sufficiency in Spatial Reasoning
Authors: Yejie Guo, Yunzhong Hou, Wufei Ma, Meng Tang, Ming-Hsuan Yang
First: 2025-10-19T02:29:09+00:00 · Latest: 2026-03-05T14:41:14+00:00
Abstract
Spatial reasoning, the ability to ground language in 3D understanding, remains a persistent challenge for Vision-Language Models (VLMs). We identify two fundamental bottlenecks: inadequate 3D understanding capabilities stemming from 2D-centric pre-training, and reasoning failures induced by redundant 3D information. To address these, we first construct a Minimal Sufficient Set (MSS) of information before answering a given question: a compact selection of 3D perception results from \textit{expert models}. We introduce MSSR (Minimal Sufficient Spatial Reasoner), a dual-agent framework that implements this principle. A Perception Agent programmatically queries 3D scenes using a versatile perception toolbox to extract sufficient information, including a novel SOG (Situated Orientation Grounding) module that robustly extracts language-grounded directions. A Reasoning Agent then iteratively refines this information to pursue minimality, pruning redundant details and requesting missing ones in a closed loop until the MSS is curated. Extensive experiments demonstrate that our method, by explicitly pursuing both sufficiency and minimality, significantly improves accuracy and achieves state-of-the-art performance across two challenging benchmarks. Furthermore, our framework produces interpretable reasoning paths, offering a promising source of high-quality training data for future models. Source code is available at https://github.com/gyj155/mssr.
中文标题/摘要
标题:在空间推理中追求最小充分性
空间推理,即在三维理解基础上将语言接地的能力,仍然是视觉-语言模型(VLMs)的一个持续性挑战。我们识别出两个根本瓶颈:源于二维中心预训练的不充分三维理解能力,以及由冗余三维信息引起的推理失败。为解决这些问题,我们首先在回答给定问题之前构建一个最小充分集(MSS)的信息:从专家模型中提取的紧凑三维感知结果的选择。我们引入了MSSR(最小充分空间推理器),这是一种双代理框架,实现了这一原则。感知代理使用多功能感知工具箱程序化地查询三维场景,提取足够的信息,包括一个新颖的SOG(情境定向接地)模块,该模块能够稳健地提取语言接地的方向。然后,推理代理迭代地精炼这些信息以追求最小性,通过闭环剪枝冗余细节并请求缺失信息,直到MSS被精心挑选出来。大量实验表明,通过明确追求充分性和最小性,我们的方法显著提高了准确性,并在两个具有挑战性的基准测试中达到了最先进的性能。此外,我们的框架生成可解释的推理路径,为未来模型提供了一个高质量的训练数据来源。源代码可在https://github.com/gyj155/mssr/获取。
Summary / 总结
This paper addresses the challenge of spatial reasoning for Vision-Language Models by identifying two key issues: inadequate 3D understanding from 2D-centric pre-training and reasoning failures due to redundant 3D information. To tackle these, the authors propose MSSR (Minimal Sufficient Spatial Reasoner), a dual-agent framework that constructs a Minimal Sufficient Set (MSS) of 3D information. The Perception Agent queries 3D scenes using a versatile toolbox, while the Reasoning Agent iteratively refines this information to ensure both sufficiency and minimality. Experiments show that this approach significantly improves accuracy and achieves state-of-the-art performance on two benchmarks, while also providing interpretable reasoning paths for future model training.
论文通过识别两个关键问题——2D-centric预训练导致的3D理解不足以及冗余3D信息引起的推理失败,来解决视觉-语言模型的空间推理挑战。为此,作者提出了MSSR(Minimal Sufficient Spatial Reasoner)框架,该框架构建了一个3D信息的Minimal Sufficient Set (MSS)。该框架使用感知代理查询3D场景并提取必要的信息,然后使用推理代理迭代精炼这些信息以确保充分性和最小性。实验表明,MSSR在两个基准测试中显著提高了准确性并达到了最先进的性能,同时提供了可解释的推理路径,为未来模型训练提供了高质量的数据来源。
SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus
Authors: Ming Zhao, Wenhui Dong, Yang Zhang, Xiang Zheng, Zhonghao Zhang, Zian Zhou, Yunzhi Guan, Liukun Xu, Wei Peng, Zhaoyang Gong, Zhicheng Zhang, Dachuan Li, Xiaosheng Ma, Yuli Ma, Jianing Ni, Changjiang Jiang, Lixia Tian, Qixin Chen, Kaishun Xia, Pingping Liu, Tongshun Zhang, Zhiqiang Liu, Zhongyan Bi, Chenyang Si, Tiansheng Sun, Caifeng Shan
First: 2025-10-03T16:32:02+00:00 · Latest: 2026-03-05T14:25:09+00:00
Abstract
Spine disorders affect 619 million people globally and are a leading cause of disability, yet AI-assisted diagnosis remains limited by the lack of level-aware, multimodal datasets. Clinical decision-making for spine disorders requires sophisticated reasoning across X-ray, CT, and MRI at specific vertebral levels. However, progress has been constrained by the absence of traceable, clinically-grounded instruction data and standardized, spine-specific benchmarks. To address this, we introduce SpineMed, an ecosystem co-designed with practicing spine surgeons. It features SpineMed-450k, the first large-scale dataset explicitly designed for vertebral-level reasoning across imaging modalities with over 450,000 instruction instances, and SpineBench, a clinically-grounded evaluation framework. SpineMed-450k is curated from diverse sources, including textbooks, guidelines, open datasets, and ~1,000 de-identified hospital cases, using a clinician-in-the-loop pipeline with a two-stage LLM generation method (draft and revision) to ensure high-quality, traceable data for question-answering, multi-turn consultations, and report generation. SpineBench evaluates models on clinically salient axes, including level identification, pathology assessment, and surgical planning. Our comprehensive evaluation of several recently advanced large vision-language models (LVLMs) on SpineBench reveals systematic weaknesses in fine-grained, level-specific reasoning. In contrast, our model fine-tuned on SpineMed-450k demonstrates consistent and significant improvements across all tasks. Clinician assessments confirm the diagnostic clarity and practical utility of our model's outputs.
中文标题/摘要
标题:SpineBench:基于SpineMed-450k语料库的临床相关、分级驱动的基准
脊椎疾病影响全球6.19亿人,是导致残疾的主要原因之一,但AI辅助诊断仍受限于缺乏分级意识的多模态数据集。脊椎疾病的临床决策需要在特定椎体水平上对X光、CT和MRI进行复杂的推理。然而,由于缺乏可追溯的、临床依据的数据和标准化的脊椎特定基准,进展受限。为解决这一问题,我们引入了SpineMed,一个与实践中的脊椎外科医生共同设计的生态系统。它包括SpineMed-450k,这是首个专门设计用于跨影像模态的椎体级推理的大规模数据集,包含超过45万个指令实例,以及SpineBench,一个基于临床的评估框架。SpineMed-450k从多种来源收集,包括教科书、指南、开放数据集和约1000个匿名医院病例,通过临床医生在环的管道和两阶段LLM生成方法(草案和修订)进行编目,以确保高质量、可追溯的数据用于问答、多轮咨询和报告生成。SpineBench在临床相关轴上评估模型,包括椎体识别、病理评估和手术规划。我们对SpineBench上几种最近先进的大型视觉-语言模型的全面评估揭示了其在细粒度、椎体特定推理方面的系统性弱点。相比之下,我们基于SpineMed-450k微调的模型在所有任务上都表现出一致且显著的改进。临床医生评估证实了我们模型输出的诊断清晰度和实用价值。
Summary / 总结
The paper introduces SpineMed, a new ecosystem for spine disorders, featuring SpineMed-450k, a large-scale dataset for vertebral-level reasoning across imaging modalities, and SpineBench, an evaluation framework. SpineMed-450k includes over 450,000 instruction instances curated from various sources and processed through a clinician-in-the-loop pipeline. SpineBench evaluates models on clinically relevant tasks such as level identification, pathology assessment, and surgical planning. The evaluation shows that models fine-tuned on SpineMed-450k outperform recent large vision-language models in level-specific reasoning.
该论文介绍了SpineMed生态系统,包括用于脊椎水平跨影像模态推理的大型数据集SpineMed-450k和评估框架SpineBench。该数据集来自多种来源,确保了高质量和可追溯的数据。SpineBench在临床相关任务上评估模型,评估结果显示,基于SpineMed-450k微调的模型在细粒度、水平特定推理方面表现优于先进的大型视觉语言模型。
RadarVLM: A Vision-Language Model Approach for Radar Scene Understanding
Authors: Pushkal Mishra, Kshitiz Bansal, Dinesh Bharadia
First: 2025-11-26T06:41:00+00:00 · Latest: 2026-03-05T14:00:17+00:00
Abstract
Radar sensors provide reliable perception across adverse weather, lighting, and long-range conditions, yet existing machine learning approaches remain fragmented and task-specific, with each downstream task employing distinct architectures and training objectives. We present RadarVLM, a vision-language framework that learns unified scene-level representations through structured spatial language supervision. Leveraging the CARLA simulator with a realistic radar model, we collect over 800k radar-caption pairs across 110+ hours of simulated driving in diverse scenarios. We make two key contributions: (1) a structured caption framework encoding vehicle distributions in the radar's native coordinate system, and (2) Spatially-Grounded CLIP (SG-CLIP) objective that replaces binary matching with continuous scene similarity, enabling fine-grained spatial reasoning. We further propose localization-aware evaluation metrics that directly assess spatial accuracy beyond traditional linguistic similarity measures. Validated on generative captioning and vehicle segmentation, SG-CLIP achieves up to 50\% relative F1-score improvement over vanilla CLIP and a 21\% AP gain on segmentation, demonstrating that language grounding produces spatially structured representations.
中文标题/摘要
标题:RadarVLM:雷达场景理解的视觉-语言模型方法
雷达传感器在恶劣天气、光照和远距离条件下提供可靠的感知,但现有的机器学习方法仍然支离破碎且任务特定,每个下游任务都采用不同的架构和训练目标。我们提出了RadarVLM,这是一种视觉-语言框架,通过结构化的空间语言监督学习统一的场景级表示。利用CARLA模拟器和现实的雷达模型,我们收集了超过80万对雷达-描述符,覆盖了110多个小时的模拟驾驶,涉及多种场景。我们做出了两项关键贡献:(1) 结构化的描述符框架,编码车辆在雷达原坐标系中的分布,(2) 空间定位CLIP (SG-CLIP) 目标,用连续的场景相似度替换二元匹配,使细粒度的空间推理成为可能。我们进一步提出了定位感知的评估指标,直接评估空间准确性,超越传统的语言相似度度量。在生成描述符和车辆分割上,SG-CLIP相比vanilla CLIP实现了高达50%的相对F1分数提升,分割的AP值提高了21%,证明了语言定位产生了空间结构化的表示。
Summary / 总结
RadarVLM is a vision-language model that addresses the fragmented and task-specific nature of existing machine learning approaches for radar scene understanding. It uses a structured spatial language supervision to learn unified scene-level representations. The model leverages the CARLA simulator to collect 800k radar-caption pairs and introduces a Spatially-Grounded CLIP (SG-CLIP) objective that improves fine-grained spatial reasoning. On generative captioning and vehicle segmentation tasks, SG-CLIP shows up to 50% relative F1-score improvement and a 21% AP gain, indicating that language grounding enhances spatially structured representations.
RadarVLM 是一种使用结构化空间语言监督来学习统一场景级表示的视觉-语言模型。它利用 CARLA 模拟器收集了超过 80 万对雷达-描述,并引入了结构化描述框架和空间定位 CLIP (SG-CLIP) 目标,该目标在生成描述和车辆分割任务中表现出色。SG-CLIP 目标在生成描述中实现了高达 50% 的相对 F1 分数改进,并在分割任务中获得了 21% 的 AP 增益,表明语言定位产生了空间结构化的表示。
Logi-PAR: Logic-Infused Patient Activity Recognition via Differentiable Rule
Authors: Muhammad Zarar, MingZheng Zhang, Xiaowang Zhang, Zhiyong Feng, Sofonias Yitagesu, Kawsar Farooq
First: 2026-03-05T13:52:50+00:00 · Latest: 2026-03-05T13:52:50+00:00
Abstract
Patient Activity Recognition (PAR) in clinical settings uses activity data to improve safety and quality of care. Although significant progress has been made, current models mainly identify which activity is occurring. They often spatially compose sub-sparse visual cues using global and local attention mechanisms, yet only learn logically implicit patterns due to their neural-pipeline. Advancing clinical safety requires methods that can infer why a set of visual cues implies a risk, and how these can be compositionally reasoned through explicit logic beyond mere classification. To address this, we proposed Logi-PAR, the first Logic-Infused Patient Activity Recognition Framework that integrates contextual fact fusion as a multi-view primitive extractor and injects neural-guided differentiable rules. Our method automatically learns rules from visual cues, optimizing them end-to-end while enabling the implicit emergence patterns to be explicitly labelled during training. To the best of our knowledge, Logi-PAR is the first framework to recognize patient activity by applying learnable logic rules to symbolic mappings. It produces auditable why explanations as rule traces and supports counterfactual interventions (e.g., risk would decrease by 65% if assistance were present). Extensive evaluation on clinical benchmarks (VAST and OmniFall) demonstrates state-of-the-art performance, significantly outperforming Vision-Language Models and transformer baselines. The code is available via: https://github.com/zararkhan985/Logi-PAR.git}
中文标题/摘要
标题:Logi-PAR:通过可微规则融合上下文事实的患者活动识别
在临床环境中,患者活动识别(PAR)利用活动数据以提高安全性和护理质量。尽管取得了显著进展,当前模型主要识别正在进行的活动。它们通常使用全局和局部注意力机制组合稀疏的视觉线索,但由于其神经管道,只能学习逻辑隐含模式。为了提高临床安全性,需要能够推断出一组视觉线索为何表示风险的方法,并通过明确的逻辑进行组合推理,而不仅仅是分类。为此,我们提出了Logi-PAR,这是第一个融合上下文事实的逻辑注入患者活动识别框架,将其作为多视图原始提取器,并注入神经引导的可微规则。我们的方法自动从视觉线索中学习规则,在端到端优化的同时,使隐含模式在训练期间明确地被标记。据我们所知,Logi-PAR 是第一个通过应用可学习逻辑规则到符号映射来识别患者活动的框架。它产生可审计的“为什么”解释作为规则跟踪,并支持反事实干预(例如,如果提供帮助,风险将降低65%)。在临床基准测试(VAST和OmniFall)上的广泛评估表明,其性能达到最先进的水平,显著优于视觉-语言模型和变压器基线。代码可通过:https://github.com/zararkhan985/Logi-PAR.git 获取
Summary / 总结
Logi-PAR is a novel framework for Patient Activity Recognition (PAR) that integrates logic into the recognition process. It uses contextual fact fusion and neural-guided differentiable rules to learn explicit logic patterns from visual cues, providing auditable explanations and supporting counterfactual interventions. Logi-PAR outperforms existing models on clinical benchmarks and demonstrates state-of-the-art performance.
Logi-PAR 是一种新颖的患者活动识别框架,将逻辑融入识别过程。它利用上下文事实融合和神经引导的可微规则来从视觉线索中学习明确的逻辑模式,从而提供更可解释和实用的洞察。Logi-PAR 在临床基准测试中表现出色,提供了可审计的解释和反事实干预,证明了其优越的性能和实际应用价值。
Mario: Multimodal Graph Reasoning with Large Language Models
Authors: Yuanfu Sun, Kang Li, Pengkang Guo, Jiajin Liu, Qiaoyu Tan
Venue: CVPR 2026
First: 2026-03-05T13:49:41+00:00 · Latest: 2026-03-05T13:49:41+00:00
Comments: CVPR 2026
Abstract
Recent advances in large language models (LLMs) have opened new avenues for multimodal reasoning. Yet, most existing methods still rely on pretrained vision-language models (VLMs) to encode image-text pairs in isolation, ignoring the relational structure that real-world multimodal data naturally form. This motivates reasoning on multimodal graphs (MMGs), where each node has textual and visual attributes and edges provide structural cues. Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-modal consistency and handling heterogeneous modality preference. To address this, we propose Mario, a unified framework that simultaneously resolves the two above challenges and enables effective LLM-based reasoning over MMGs. Mario consists of two innovative stages. Firstly, a graph-conditioned VLM design that jointly refines textual and visual features through fine-grained cross-modal contrastive learning guided by graph topology. Secondly, a modality-adaptive graph instruction tuning mechanism that organizes aligned multimodal features into graph-aware instruction views and employs a learnable router to surface, for each node and its neighborhood, the most informative modality configuration to the LLM. Extensive experiments across diverse MMG benchmarks demonstrate that Mario consistently outperforms state-of-the-art graph models in both supervised and zero-shot scenarios for node classification and link prediction. The code will be made available at https://github.com/sunyuanfu/Mario.
中文标题/摘要
标题:马里奥:大规模语言模型的多模态图推理
大规模语言模型(LLMs)的最新进展为多模态推理开辟了新的途径。然而,大多数现有方法仍然依赖预训练的视觉-语言模型(VLMs)来孤立地编码图像-文本对,忽略了真实世界多模态数据自然形成的关联结构。这促使我们在多模态图(MMGs)上进行推理,其中每个节点具有文本和视觉属性,边提供结构线索。在保持图拓扑的同时,使基于LLM的多模态异构信号推理引入了两个关键挑战:解决弱跨模态一致性并处理异构模态偏好。为了解决这些问题,我们提出了一种名为马里奥的统一框架,该框架同时解决了上述两个挑战,并在MMGs上实现了有效的基于LLM的推理。马里奥由两个创新阶段组成。首先,一种基于图条件的VLM设计,通过细粒度的跨模态对比学习来联合细化文本和视觉特征,该学习由图拓扑引导。其次,一种模态自适应图指令调优机制,将对齐的多模态特征组织成图感知指令视图,并使用可学习的路由器为每个节点及其邻域呈现最相关信息模态配置给LLM。在多种多模态图基准上的广泛实验表明,马里奥在节点分类和链接预测的监督和零样本场景中均优于最先进的图模型。代码将在https://github.com/sunyuanfu/Mario上公开。
Summary / 总结
The paper proposes Mario, a unified framework for multimodal graph reasoning using large language models. It addresses the challenges of weak cross-modal consistency and heterogeneous modality preference by designing a graph-conditioned vision-language model and a modality-adaptive graph instruction tuning mechanism. Mario outperforms state-of-the-art graph models in node classification and link prediction tasks across various multimodal graph benchmarks.
Mario 是一个统一框架,通过解决弱跨模一致性问题和处理异构模态偏好来实现多模态图推理。它包括一个图条件下的视觉-语言模型进行细粒度的跨模态对比学习,以及一个模态自适应图指令调优机制。实验表明,Mario 在各种多模态图基准上的节点分类和链接预测任务中均优于现有最佳图模型。
Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models
Authors: Shule Lu, Yujing Wang, Hainan Zhang, Xiaoshan Yang, Hongwei Zheng, Yongxin Tong, Changsheng Xu, Zhiming Zheng
First: 2026-01-31T03:11:51+00:00 · Latest: 2026-03-05T12:23:38+00:00
Comments: Due to the need for substantial revisions, the authors believe that the paper should be retracted first.A revised version may be resubmitted
Abstract
VLMs have broad potential in privacy-sensitive domains such as healthcare and finance, yet strict data-sharing constraints render centralized training infeasible. FL mitigates this issue by enabling decentralized training, but practical deployments face challenges due to client heterogeneity in computational resources, application requirements, and model architectures. We argue that while replacing data with model parameters characterizes the present of FL, replacing parameters with preferences represents a more scalable and privacy-preserving future. Motivated by this perspective, we propose MoR, a federated alignment framework based on GRPO with Mixture-of-Rewards for heterogeneous VLMs. MoR initializes a visual foundation model as a KL-regularized reference, while each client locally trains a reward model from local preference annotations, capturing specific evaluation signals without exposing raw data. To reconcile heterogeneous rewards, we introduce a routing-based fusion mechanism that adaptively aggregates client reward signals. Finally, the server performs GRPO with this mixed reward to optimize the base VLM. Experiments on three public VQA benchmarks demonstrate that MoR consistently outperforms federated alignment baselines in generalization, robustness, and cross-client adaptability. Our approach provides a scalable solution for privacy-preserving alignment of heterogeneous VLMs under federated settings.
中文标题/摘要
标题:用偏好替换参数:异构视觉-语言模型的联邦对齐
视觉-语言模型(VLMs)在医疗保健和金融等隐私敏感领域具有广泛潜力,但由于严格的数据共享限制,集中式训练不可行。联邦学习(FL)通过使训练去中心化来缓解这一问题,但实际部署面临挑战,因为客户端在计算资源、应用需求和模型架构方面存在异质性。我们认为,虽然用模型参数替换数据是当前FL的特点,但用偏好替换参数代表了更可扩展和隐私保护的未来。基于这一视角,我们提出了MoR,一种基于GRPO的混合奖励混合的异构VLM联邦对齐框架。MoR以KL正则化的视觉基础模型作为参考,每个客户端从本地偏好注释中局部训练奖励模型,捕捉特定的评估信号而不暴露原始数据。为了协调异质奖励,我们引入了一种基于路由的融合机制,以自适应地聚合客户端的奖励信号。最后,服务器使用这种混合奖励进行GRPO优化基础VLM。在三个公开的VQA基准测试上进行的实验表明,MoR在泛化能力、鲁棒性和跨客户端适应性方面始终优于联邦对齐基线。我们的方法为在联邦设置下对齐异构VLM提供了可扩展的解决方案。
Summary / 总结
This paper addresses the challenge of training vision-language models (VLMs) in privacy-sensitive domains where centralized training is impractical due to data-sharing constraints. The authors propose MoR, a federated learning framework that replaces model parameters with preferences to enhance scalability and privacy. MoR initializes a reference model and allows clients to train local reward models based on preference annotations, which are then fused to optimize the base model. Experiments show that MoR outperforms existing federated alignment methods in terms of generalization, robustness, and cross-client adaptability.
论文针对在隐私敏感领域中由于严格的数据共享限制而无法进行集中训练的问题,提出了一种名为MoR的联邦学习框架,该框架通过将模型参数替换为偏好来增强可扩展性和隐私保护。MoR初始化一个参考模型,并允许客户端根据偏好注释训练局部奖励模型,然后将这些奖励信号融合以优化基础模型。实验结果显示,MoR在泛化能力、鲁棒性和跨客户端适应性方面优于现有联邦对齐方法。
GEM-TFL: Bridging Weak and Full Supervision for Forgery Localization through EM-Guided Decomposition and Temporal Refinement
Authors: Xiaodong Zhu, Yuanming Zheng, Suting Wang, Junqi Yang, Yuhong Yang, Weiping Tu, Zhongyuan Wang
Venue: CVPR 2026
First: 2026-03-05T12:07:26+00:00 · Latest: 2026-03-05T12:07:26+00:00
Comments: 10 pages, 4 figures, accepted by CVPR 2026
Abstract
Temporal Forgery Localization (TFL) aims to precisely identify manipulated segments within videos or audio streams, providing interpretable evidence for multimedia forensics and security. While most existing TFL methods rely on dense frame-level labels in a fully supervised manner, Weakly Supervised TFL (WS-TFL) reduces labeling cost by learning only from binary video-level labels. However, current WS-TFL approaches suffer from mismatched training and inference objectives, limited supervision from binary labels, gradient blockage caused by non-differentiable top-k aggregation, and the absence of explicit modeling of inter-proposal relationships. To address these issues, we propose GEM-TFL (Graph-based EM-powered Temporal Forgery Localization), a two-phase classification-regression framework that effectively bridges the supervision gap between training and inference. Built upon this foundation, (1) we enhance weak supervision by reformulating binary labels into multi-dimensional latent attributes through an EM-based optimization process; (2) we introduce a training-free temporal consistency refinement that realigns frame-level predictions for smoother temporal dynamics; and (3) we design a graph-based proposal refinement module that models temporal-semantic relationships among proposals for globally consistent confidence estimation. Extensive experiments on benchmark datasets demonstrate that GEM-TFL achieves more accurate and robust temporal forgery localization, substantially narrowing the gap with fully supervised methods.
中文标题/摘要
标题:GEM-TFL:通过EM引导分解和时间精炼,实现弱监督与全监督之间的伪造定位桥梁
时间伪造定位(TFL)旨在精确识别视频或音频流中的篡改段落,为多媒体取证和安全提供可解释的证据。虽然大多数现有的TFL方法依赖于密集的帧级标签进行全监督学习,但弱监督TFL(WS-TFL)通过仅从二元视频级标签中学习来降低标注成本。然而,当前的WS-TFL方法存在训练和推理目标不匹配、二元标签监督有限、由于非可微的top-k聚合导致梯度阻塞以及缺乏对提案间关系的显式建模等问题。为了解决这些问题,我们提出了GEM-TFL(基于图的EM增强时间伪造定位),这是一种两阶段分类-回归框架,有效地弥合了训练和推理之间的监督差距。在此基础上,(1)我们通过基于EM的优化过程将二元标签重新表述为多维潜在属性,增强弱监督;(2)我们引入了一种无需训练的时间一致性精炼方法,重新对齐帧级预测以实现更平滑的时间动态;(3)我们设计了一种基于图的提案精炼模块,建模提案之间的时空语义关系,以实现全局一致的置信度估计。在基准数据集上的广泛实验表明,GEM-TFL实现了更准确和稳健的时间伪造定位,显著缩小了与全监督方法的差距。
Summary / 总结
GEM-TFL is a two-phase classification-regression framework designed to improve weakly supervised temporal forgery localization (WS-TFL) by addressing issues such as mismatched training and inference objectives and limited supervision. It reformulates binary video-level labels into multi-dimensional latent attributes using an EM-based optimization process and introduces a training-free temporal consistency refinement to align frame-level predictions. Additionally, it models temporal-semantic relationships among proposals through a graph-based module for globally consistent confidence estimation. Experiments show that GEM-TFL achieves more accurate and robust temporal forgery localization compared to fully supervised methods.
GEM-TFL通过提出两阶段框架来增强弱监督并引入时间一致性精炼和基于图的提案精炼,解决了弱监督时间伪造定位的局限性。该方法将二元标签重新表述为多维潜在属性,并重新对齐帧级预测以改善时间动态。实验结果表明,GEM-TFL实现了更准确和稳健的伪造定位,缩小了与全监督方法的差距。
CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection
Authors: Zhaonian Kuang, Rui Ding, Haotian Wang, Xinhu Zheng, Meng Yang, Gang Hua
Venue: CVPR 2026
First: 2026-03-05T10:49:46+00:00 · Latest: 2026-03-05T10:49:46+00:00
Comments: Accepted to CVPR 2026 main track
Abstract
Multi-camera 3D object detection (MC3D) has attracted increasing attention with the growing deployment of multi-sensor physical agents, such as robots and autonomous vehicles. However, MC3D models still struggle to generalize to unseen platforms with new multi-camera configurations. Current solutions simply employ a meta-camera for unified representation but lack comprehensive consideration. In this paper, we revisit this issue and identify that the devil lies in spatial prior discrepancies across source and target configurations, including different intrinsics, extrinsics, and array layouts. To address this, we propose CoIn3D, a generalizable MC3D framework that enables strong transferability from source configurations to unseen target ones. CoIn3D explicitly incorporates all identified spatial priors into both feature embedding and image observation through spatial-aware feature modulation (SFM) and camera-aware data augmentation (CDA), respectively. SFM enriches feature space by integrating four spatial representations, such as focal length, ground depth, ground gradient, and Plücker coordinate. CDA improves observation diversity under various configurations via a training-free dynamic novel-view image synthesis scheme. Extensive experiments demonstrate that CoIn3D achieves strong cross-configuration performance on landmark datasets such as NuScenes, Waymo, and Lyft, under three dominant MC3D paradigms represented by BEVDepth, BEVFormer, and PETR.
中文标题/摘要
标题:CoIn3D: 重新审视配置不变的多相机3D物体检测
多相机3D物体检测(MC3D)随着多传感器物理代理(如机器人和自动驾驶车辆)的部署越来越多而受到越来越多的关注。然而,MC3D模型仍然难以在具有新多相机配置的未见过的平台上泛化。当前的解决方案只是使用一个元相机进行统一表示,但缺乏全面的考虑。在本文中,我们重新审视了这一问题,并发现问题在于源配置和目标配置之间的空间先验差异,包括不同的内参、外参和阵列布局。为了解决这个问题,我们提出了CoIn3D,这是一种通用的MC3D框架,能够从源配置高效地转移到未见过的目标配置。CoIn3D通过空间感知特征调制(SFM)和相机感知数据增强(CDA)将所有识别的空间先验显式地整合到特征嵌入和图像观察中。SFM通过整合焦距、地面深度、地面梯度和Plücker坐标等四种空间表示来丰富特征空间。CDA通过一种无需训练的动态新颖视角图像合成方案来在各种配置下提高观察多样性。广泛的实验表明,CoIn3D在NuScenes、Waymo和Lyft等地标数据集上,在BEVDepth、BEVFormer和PETR等三种主导的MC3D范式下,实现了强大的跨配置性能。
Summary / 总结
CoIn3D revisits the challenge of multi-camera 3D object detection across different configurations and proposes a framework that addresses spatial prior discrepancies through spatial-aware feature modulation and camera-aware data augmentation. Experiments show that CoIn3D outperforms existing methods on landmark datasets like NuScenes, Waymo, and Lyft under various paradigms.
CoIn3D重新审视了多相机3D目标检测(MC3D)的问题,并提出了一种框架来解决不同相机配置之间的模型迁移问题。该框架通过空间感知特征调制和相机感知数据增强来融入空间先验,从而提高模型的迁移性。实验表明,CoIn3D在NuScenes、Waymo和Lyft等数据集上,以及不同的MC3D范式下,优于现有方法。
Flatness Guided Test-Time Adaptation for Vision-Language Models
Authors: Aodi Li, Liansheng Zhuang, Xiao Long, Houqiang Li, Shafei Wang
First: 2025-01-31T03:10:48+00:00 · Latest: 2026-03-05T10:05:46+00:00
Abstract
Test-time adaptation (TTA) of Vision-Language Models (VLMs) has emerged as a technique for tackling distribution shifts during the test time. Recent research indicates that the test-time adaptation is intrinsically linked to the model's training history. However, existing TTA methods, such as Test-time Prompt Tuning, often design adaptation strategies in isolation from the models' training characteristics, which degrade their performance. This paper argues that the flatness acquired via sharpness-aware training is an efficient clue for the test-time adaptation of VLMs. Built on this insight, this paper proposes a novel Flatness-Guided Adaptation framework (FGA) for VLMs to cohesively unify training and test-time procedures. Its core idea is to leverage the alignment between the training minimum and test loss flat regions to guide the adaptation process. Specifically, our FGA consists of a prompt-tuning stage and a test-time adaptation stage. In the tuning stage, a Sharpness-Aware Prompt Tuning method is utilized to identify the training flat minimum, offering a geometric clue of flatness for subsequent adaptation. In the test stage, a Sharpness-based Test Sample Selection approach is proposed to ensure the alignment of flat minima between the training and each augmented test sample's loss landscape. In comparison to existing TTA methods, our FGA avoids the expensive prompt parameter updates during test time, and substantially reduces the computation overhead. Extensive experiments on both domain generalization and cross-dataset benchmarks demonstrate that our FGA achieves superior performance over prevalent TTA methods. Notably, when employing a ViT-B/16 image encoder, FGA even outperforms TPT+CoOp by an average of 4.88% across all four ImageNet out-of-domain variants.
中文标题/摘要
标题:基于平坦度引导的视觉-语言模型测试时适应
视觉-语言模型(VLMs)的测试时适应(TTA)已成为解决测试时分布偏移的技术。现有研究表明,测试时适应与模型的训练历史密切相关。然而,现有的TTA方法,如测试时提示调优,往往孤立于模型的训练特性之外,导致性能下降。本文认为,通过尖锐感知训练获得的平坦度是视觉-语言模型测试时适应的有效线索。基于此见解,本文提出了一种新颖的基于平坦度引导的适应框架(FGA),以统一训练和测试过程。其核心思想是利用训练最小值和平坦损失区域之间的对齐来引导适应过程。具体而言,我们的FGA包括一个提示调优阶段和一个测试时适应阶段。在调优阶段,使用尖锐感知提示调优方法来识别训练平坦最小值,为后续适应提供平坦度的几何线索。在测试阶段,提出了一种基于尖锐性的测试样本选择方法,以确保训练最小值和平滑每个增强测试样本损失景观之间的对齐。与现有TTA方法相比,我们的FGA避免了测试时昂贵的提示参数更新,并显著减少了计算开销。在领域泛化和跨数据集基准上的广泛实验表明,我们的FGA在主流TTA方法中表现出更优的性能。值得注意的是,当使用ViT-B/16图像编码器时,FGA在所有四个ImageNet离域变体上平均优于TPT+CoOp 4.88%。
Summary / 总结
This paper addresses the challenge of test-time adaptation (TTA) for Vision-Language Models (VLMs) by proposing a Flatness-Guided Adaptation (FGA) framework. The motivation is to improve TTA performance by leveraging the model's training characteristics. The FGA framework consists of a prompt-tuning stage and a test-time adaptation stage. In the tuning stage, a Sharpness-Aware Prompt Tuning method identifies the training flat minimum, providing a geometric clue for subsequent adaptation. In the test stage, a Sharpness-based Test Sample Selection approach ensures alignment between the training and test loss flat regions. Experiments show that FGA outperforms existing TTA methods, achieving superior performance and reducing computation overhead.
本文提出了一个用于视觉-语言模型(VLMs)的平滑性引导适应(FGA)框架,以提高测试时的适应性。该方法利用训练过程中获得的平滑性来引导适应过程,通过尖锐感知的提示调优阶段来识别训练平滑最小值,并通过基于尖锐性的测试样本选择方法来确保训练和每个增强测试样本损失景观之间的平滑最小值对齐。实验表明,FGA在域泛化和跨数据集基准测试中优于现有方法,平均改进幅度为4.88%,超过TPT+CoOp在四个ImageNet离域变体中的表现。
Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning
Authors: Yiwei Chen, Yuguang Yao, Yihua Zhang, Bingquan Shen, Gaowen Liu, Sijia Liu
First: 2025-03-14T19:52:08+00:00 · Latest: 2026-03-05T09:05:50+00:00
Abstract
Recent vision language models (VLMs) have made remarkable strides in generative modeling with multimodal inputs, particularly text and images. However, their susceptibility to generating harmful content when exposed to unsafe queries raises critical safety concerns. While current alignment strategies primarily rely on supervised safety fine-tuning with curated datasets, we identify a fundamental limitation we call the ''safety mirage'', where supervised fine-tuning inadvertently reinforces spurious correlations between superficial textual patterns and safety responses, rather than fostering deep, intrinsic mitigation of harm. We show that these spurious correlations leave fine-tuned VLMs vulnerable even to a simple one-word modification-based attack, where substituting a single word in text queries with a spurious correlation-inducing alternative can effectively bypass safeguards. Additionally, these correlations contribute to the over-prudence, causing fine-tuned VLMs to refuse benign queries unnecessarily. To address these issues, we show machine unlearning (MU) as a powerful alternative to supervised safety fine-tuning, as it avoids biased feature-label mappings and directly removes harmful knowledge from VLMs while preserving their general capabilities. Extensive evaluations across safety benchmarks show that under MU-based alignment reduces the attack success rate by up to 60.27% and cuts unnecessary rejections by over 84.20%. WARNING: There exist AI generations that may be offensive in nature.
中文标题/摘要
标题:安全幻象:虚假相关性如何削弱VLM安全微调并可通过机器遗忘加以缓解
近期的视觉语言模型(VLMs)在多模态输入的生成建模方面取得了显著进展,特别是文本和图像。然而,当暴露于不安全查询时,它们生成有害内容的倾向引发了重要的安全问题。尽管当前的对齐策略主要依赖于监督安全微调和精心策划的数据集,但我们发现了一个根本性的局限性,我们称之为“安全幻象”,即监督微调无意中强化了表面文本模式与安全响应之间的虚假相关性,而不是培养深层次、内在的有害行为缓解。我们展示了这些虚假相关性使微调后的VLMs即使在简单的基于单词替换的攻击中也容易受到攻击,其中用一个诱导虚假相关性的替代词替换文本查询中的单个词可以有效地绕过防护措施。此外,这些相关性导致过度谨慎,使微调后的VLMs无故拒绝良性查询。为了解决这些问题,我们展示了机器遗忘(MU)作为监督安全微调的强大替代方案,因为它避免了有偏的特征-标签映射,并直接从VLMs中移除有害知识,同时保留其一般能力。广泛的评估表明,基于MU的对齐将攻击成功率降低高达60.27%,并减少了超过84.20%的无谓拒绝。注意:存在可能具有冒犯性的AI生成内容。
Summary / 总结
This paper addresses the issue of spurious correlations in vision language models (VLMs) that can undermine their safety. The study identifies a 'safety mirage' where supervised fine-tuning inadvertently reinforces superficial textual patterns associated with safety responses, rather than addressing the root causes of harm. The research demonstrates that these spurious correlations make VLMs vulnerable to simple one-word modification attacks and lead to unnecessary rejections of benign queries. To mitigate these issues, the paper proposes machine unlearning (MU) as an alternative to supervised fine-tuning, showing that it can reduce attack success rates by up to 60.27% and decrease unnecessary rejections by over 84.20%.
研究关注视觉语言模型(VLMs)的安全问题,识别出一种称为‘安全幻象’的问题,即监督微调可能会无意中强化虚假关联,使VLMs对简单攻击和过度谨慎变得脆弱。研究提出机器遗忘(MU)作为替代方法来缓解这些问题,并通过在安全基准上的广泛评估,展示了显著降低攻击成功率和不必要的拒绝率。
Retrieval-Augmented Generation with Covariate Time Series
Authors: Kenny Ye Liang, Zhongyi Pei, Huan Zhang, Yuhui Liu, Shaoxu Song, Jianmin Wang
First: 2026-03-05T08:45:24+00:00 · Latest: 2026-03-05T08:45:24+00:00
Comments: 12 pages. Preprint
Abstract
While RAG has greatly enhanced LLMs, extending this paradigm to Time-Series Foundation Models (TSFMs) remains a challenge. This is exemplified in the Predictive Maintenance of the Pressure Regulating and Shut-Off Valve (PRSOV), a high-stakes industrial scenario characterized by (1) data scarcity, (2) short transient sequences, and (3) covariate coupled dynamics. Unfortunately, existing time-series RAG approaches predominantly rely on generated static vector embeddings and learnable context augmenters, which may fail to distinguish similar regimes in such scarce, transient, and covariate coupled scenarios. To address these limitations, we propose RAG4CTS, a regime-aware, training-free RAG framework for Covariate Time-Series. Specifically, we construct a hierarchal time-series native knowledge base to enable lossless storage and physics-informed retrieval of raw historical regimes. We design a two-stage bi-weighted retrieval mechanism that aligns historical trends through point-wise and multivariate similarities. For context augmentation, we introduce an agent-driven strategy to dynamically optimize context in a self-supervised manner. Extensive experiments on PRSOV demonstrate that our framework significantly outperforms state-of-the-art baselines in prediction accuracy. The proposed system is deployed in Apache IoTDB within China Southern Airlines. Since deployment, our method has successfully identified one PRSOV fault in two months with zero false alarm.
中文标题/摘要
标题:基于协变量时间序列的检索增强生成
尽管检索增强生成(RAG)极大地提升了语言模型(LLMs),将其扩展到时间序列基础模型(TSFMs)仍面临挑战。这在压力调节和关断阀(PRSOV)的预测维护中尤为明显,这是一个高风险的工业场景,具有(1)数据稀缺性,(2)短暂的瞬态序列,以及(3)协变量耦合的动力学特征。不幸的是,现有的时间序列RAG方法主要依赖于生成的静态向量嵌入和可学习的上下文增强器,这在稀缺、短暂且协变量耦合的场景中可能无法区分相似的运行状态。为了解决这些局限性,我们提出了RAG4CTS,这是一种针对协变量时间序列的训练无监督的检索增强生成框架。具体而言,我们构建了一个层次化的时间序列本体知识库,以实现无损存储和基于物理的检索历史运行状态。我们设计了一种两阶段的双加权检索机制,通过点对点和多变量相似性对历史趋势进行对齐。对于上下文增强,我们引入了一种基于代理的策略,以自监督的方式动态优化上下文。在PRSOV上的广泛实验表明,我们的框架在预测准确性上显著优于最先进的基线。所提出的系统已部署在中国南方航空公司的Apache IoTDB中。自部署以来,我们的方法在两个月内成功识别了一个PRSOV故障,且无误报。
Summary / 总结
This paper addresses the challenge of applying Retrieval-Augmented Generation (RAG) to Time-Series Foundation Models (TSFMs) in high-stakes industrial scenarios like Predictive Maintenance of the PRSOV valve. It proposes RAG4CTS, a regime-aware RAG framework that uses a hierarchical time-series knowledge base for lossless storage and physics-informed retrieval of historical regimes. The system employs a two-stage bi-weighted retrieval mechanism and an agent-driven context augmentation strategy. Experiments show that RAG4CTS significantly improves prediction accuracy compared to existing methods, and it has been successfully deployed in China Southern Airlines to identify faults without false alarms.
本文针对在高风险工业场景如PRSOV阀门的预测维护中应用时间序列检索增强生成(RAG4CTS)框架的挑战。该框架通过层次化知识库实现历史阶段的无损存储和基于物理的检索,并采用两阶段的双加权检索机制进行上下文对齐。实验表明,RAG4CTS在预测准确性上显著优于现有方法,并在中国南方航空公司的部署中成功检测到一个故障,且未产生误报。
Collaborative Learning of Local 3D Occupancy Prediction and Versatile Global Occupancy Mapping
Authors: Shanshuai Yuan, Julong Wei, Muer Tie, Xiangyun Ren, Zhongxue Gan, Wenchao Ding
Venue: ICRA 2026
First: 2025-04-18T09:58:48+00:00 · Latest: 2026-03-05T07:52:27+00:00
Comments: Accepted by ICRA 2026
Abstract
Vision-based 3D semantic occupancy prediction is vital for autonomous driving, enabling unified modeling of static infrastructure and dynamic agents. Global occupancy maps serve as long-term memory priors, providing valuable historical context that enhances local perception. This is particularly important in challenging scenarios such as occlusion or poor illumination, where current and nearby observations may be unreliable or incomplete. Priors aggregated from previous traversals under better conditions help fill gaps and enhance the robustness of local 3D occupancy prediction. In this paper, we propose Long-term Memory Prior Occupancy (LMPOcc), a plug-and-play framework that incorporates global occupancy priors to boost local prediction and simultaneously updates global maps with new observations. To realize the information gain from global priors, we design an efficient and lightweight Current-Prior Fusion module that adaptively integrates prior and current features. Meanwhile, we introduce a model-agnostic prior format to enable continual updating of global occupancy and ensure compatibility across diverse prediction baselines. LMPOcc achieves state-of-the-art local occupancy prediction performance validated on the Occ3D-nuScenes benchmark, especially on static semantic categories. Furthermore, we verify LMPOcc's capability to build large-scale global occupancy maps through multi-vehicle crowdsourcing, and utilize occupancy-derived dense depth to support the construction of 3D open-vocabulary maps. Our method opens up a new paradigm for continuous global information updating and storage, paving the way towards more comprehensive and scalable scene understanding in large outdoor environments.
中文标题/摘要
标题:协作学习局部3D占用预测和多功能全局占用映射
基于视觉的3D语义占用预测对于自动驾驶至关重要,能够统一建模静态基础设施和动态代理。全局占用地图作为长期记忆先验,提供有价值的历史上下文,增强局部感知。特别是在遮挡或光照不良等具有挑战性的场景中,当前和附近的观测可能不可靠或不完整。来自更好条件下的先前遍历的先验有助于填补空白并增强局部3D占用预测的鲁棒性。在本文中,我们提出了一种名为长时记忆先验占用(LMPOcc)的即插即用框架,该框架结合全局占用先验以增强局部预测,并同时使用新观测更新全局地图。为了实现全局先验的信息增益,我们设计了一种高效且轻量级的当前-先验融合模块,以自适应地整合先验和当前特征。同时,我们引入了一种模型无关的先验格式,以实现全局占用的持续更新并确保与各种预测基线的兼容性。LMPOcc在Occ3D-nuScenes基准上实现了最先进的局部占用预测性能,特别是在静态语义类别方面。此外,我们通过多车辆众包验证了LMPOcc构建大规模全局占用地图的能力,并利用占用衍生的密集深度支持3D开放词汇地图的构建。我们的方法为持续的全局信息更新和存储开辟了新的范式,为大型户外环境中的更全面和可扩展的场景理解铺平了道路。
Summary / 总结
This paper addresses the challenge of 3D semantic occupancy prediction in autonomous driving by proposing Long-term Memory Prior Occupancy (LMPOcc), which integrates global occupancy priors to enhance local prediction and simultaneously updates global maps. The framework includes an efficient Current-Prior Fusion module and a model-agnostic prior format. LMPOcc demonstrates superior local occupancy prediction performance, especially for static categories, and shows capability in building large-scale global occupancy maps through multi-vehicle crowdsourcing, supporting 3D open-vocabulary map construction.
研究旨在通过利用全局占用先验来提升自主驾驶中的局部3D占用预测。提出的Long-term Memory Prior Occupancy (LMPOcc)框架将全局占用地图作为先验,增强局部预测并同时用新观测更新这些地图。Current-Prior Fusion模块适配性地结合了先验和当前特征,而模型无关的先验格式确保了不同预测模型之间的兼容性。实验表明,LMPOcc在Occ3D-nuScenes基准上达到了最先进的性能,特别是在静态语义类别方面,并展示了通过多车辆众包构建大规模全局占用地图的能力。
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
Authors: Haodong Li, Shaoteng Liu, Zhe Lin, Manmohan Chandraker
First: 2026-02-08T02:16:02+00:00 · Latest: 2026-03-05T07:52:20+00:00
Comments: Figure PDFs were compressed to 150 dpi to comply with arXiv's submission size limit. Project page: https://rolling-sink.github.io/
Abstract
Recently, autoregressive (AR) video diffusion models has achieved remarkable performance. However, due to their limited training durations, a train-test gap emerges when testing at longer horizons, leading to rapid visual degradations. Following Self Forcing, which studies the train-test gap within the training duration, this work studies the train-test gap beyond the training duration, i.e., the gap between the limited horizons during training and open-ended horizons during testing. Since open-ended testing can extend beyond any finite training window, and long-video training is computationally expensive, we pursue a training-free solution to bridge this gap. To explore a training-free solution, we conduct a systematic analysis of AR cache maintenance. These insights lead to Rolling Sink. Built on Self Forcing (trained on only 5s clips), Rolling Sink effectively scales the AR video synthesis to ultra-long durations (e.g., 5-30 minutes at 16 FPS) at test time, with consistent subjects, stable colors, coherent structures, and smooth motions. As demonstrated by extensive experiments, Rolling Sink achieves superior long-horizon visual fidelity and temporal consistency compared to SOTA baselines. Project page: https://rolling-sink.github.io/
中文标题/摘要
标题:滚动水槽:在自回归视频扩散模型中弥合有限训练期与开放测试期之间的差距
最近,自回归(AR)视频扩散模型取得了显著的性能。然而,由于其有限的训练时长,当在更长的时间范围内进行测试时,会出现训练-测试差距,导致视觉质量迅速退化。在研究了训练时长内的训练-测试差距之后,这项工作研究了训练时长之外的训练-测试差距,即训练时有限时间范围与测试时开放时间范围之间的差距。由于开放测试可以超出任何有限的训练窗口,且长视频训练计算成本高昂,我们寻求一种无需训练的解决方案来弥合这一差距。为了探索无需训练的解决方案,我们系统地分析了AR缓存维护。这些见解导致了滚动水槽(Rolling Sink)的提出。基于仅使用5秒片段训练的Self Forcing,滚动水槽在测试时能够将AR视频合成扩展到超长时长(例如,16 FPS下5-30分钟),保持一致的主题、稳定的颜色、连贯的结构和流畅的运动。通过广泛的实验表明,滚动水槽在长时域视觉保真度和时间一致性方面优于当前最佳基线。项目页面:https://rolling-sink.github.io/
Summary / 总结
This work addresses the train-test gap in autoregressive video diffusion models by exploring a training-free solution, Rolling Sink, which builds on insights from Self Forcing. Rolling Sink effectively extends the synthesis of AR videos to ultra-long durations (5-30 minutes) with consistent subjects, stable colors, coherent structures, and smooth motions, outperforming state-of-the-art baselines in long-horizon visual fidelity and temporal consistency.
该研究关注自回归视频扩散模型在有限训练窗口与开放测试窗口之间的差距,提出了一种无需训练的解决方案Rolling Sink,该方案基于AR缓存维护的见解,能够在测试时将视频合成持续时间扩展到30分钟(每秒16帧),同时保持一致的主题、稳定的颜色、连贯的结构和流畅的运动。实验表明,Rolling Sink在长时段视觉保真度和时间一致性方面优于当前最先进的基线方法。
AdaIAT: Adaptively Increasing Attention to Generated Text to Alleviate Hallucinations in LVLM
Authors: Li'an Zhong, Ziqiang He, Jibin Zheng, Jin Li, Z. Jane Wang, Xiangui Kang
First: 2026-03-05T07:52:11+00:00 · Latest: 2026-03-05T07:52:11+00:00
Abstract
Hallucination has been a significant impediment to the development and application of current Large Vision-Language Models (LVLMs). To mitigate hallucinations, one intuitive and effective way is to directly increase attention weights to image tokens during inference. Although this effectively reduces the hallucination rate, it often induces repetitive descriptions. To address this, we first conduct an analysis of attention patterns and reveal that real object tokens tend to assign higher attention to the generated text than hallucinated ones. This inspires us to leverage the generated text, which contains instruction-related visual information and contextual knowledge, to alleviate hallucinations while maintaining linguistic coherence. We therefore propose Attention to Generated Text (IAT) and demonstrate that it significantly reduces the hallucination rate while avoiding repetitive descriptions. To prevent naive amplification from impairing the inherent prediction capabilities of LVLMs, we further explore Adaptive IAT (AdaIAT) that employs a layer-wise threshold to control intervention time and fine-grained amplification magnitude tailored to the characteristics of each attention head. Both analysis and experiments demonstrate the effectiveness of AdaIAT. Results of several LVLMs show that AdaIAT effectively alleviates hallucination (reducing hallucination rates $C_S$ and $C_I$ on LLaVA-1.5 by 35.8% and 37.1%, respectively) while preserving linguistic performance and prediction capability, achieving an attractive trade-off.
中文标题/摘要
标题:AdaIAT:通过增加生成文本的关注度来缓解LVLM中的幻觉
幻觉已成为当前大型视觉-语言模型(LVLM)发展和应用中的重大障碍。为了减轻幻觉,一种直观且有效的方法是在推理过程中直接增加对图像标记的关注权重。尽管这种方法有效地降低了幻觉率,但往往会引起重复描述。为了解决这一问题,我们首先分析了注意力模式,并发现真实对象标记倾向于比幻觉标记更关注生成的文本。这启发我们利用包含指令相关视觉信息和上下文知识的生成文本来缓解幻觉,同时保持语言连贯性。因此,我们提出了生成文本注意力(IAT),并证明它显著降低了幻觉率,同时避免了重复描述。为了防止简单的放大损害LVLM的固有预测能力,我们进一步探索了分层阈值控制干预时间和针对每个注意力头特征进行精细放大调整的自适应IAT(AdaIAT)。分析和实验都证明了AdaIAT的有效性。多个LVLM的结果表明,AdaIAT有效地缓解了幻觉(分别在LLaVA-1.5上将幻觉率$C_S$和$C_I$降低了35.8%和37.1%),同时保持了语言性能和预测能力,实现了令人满意的权衡。
Summary / 总结
The paper addresses the issue of hallucination in Large Vision-Language Models (LVLMs) by proposing AdaIAT, an adaptive method that increases attention to generated text to reduce hallucinations without causing repetitive descriptions. The method, inspired by the observation that real object tokens assign higher attention to generated text, is further refined into AdaIAT, which uses a layer-wise threshold to control the intervention time and fine-grained amplification. Experiments show that AdaIAT reduces hallucination rates by 35.8% and 37.1% on LLaVA-1.5 while maintaining linguistic performance and prediction capability.
论文通过提出AdaIAT方法,增加对生成文本的关注以减少LVLM中的幻觉现象,同时避免重复描述。分析表明,真实物体令牌倾向于对生成文本赋予更高的注意力,而不是幻觉。实验结果显示,AdaIAT在LLaVA-1.5上将幻觉率分别降低了35.8%和37.1%,同时保持了语言一致性和预测能力,实现了良好的权衡。
Authorize-on-Demand: Dynamic Authorization with Legality-Aware Intellectual Property Protection for VLMs
Authors: Lianyu Wang, Meng Wang, Huazhu Fu, Daoqiang Zhang
First: 2026-03-05T07:36:07+00:00 · Latest: 2026-03-05T07:36:07+00:00
Abstract
The rapid adoption of vision-language models (VLMs) has heightened the demand for robust intellectual property (IP) protection of these high-value pretrained models. Effective IP protection should proactively confine model deployment within authorized domains and prevent unauthorized transfers. However, existing methods rely on static training-time definitions, limiting flexibility in dynamic environments and often producing opaque responses to unauthorized inputs. To address these limitations, we propose a novel dynamic authorization with legality-aware intellectual property protection (AoD-IP) for VLMs, a framework that supports authorize-on-demand and legality-aware assessment. AoD-IP introduces a lightweight dynamic authorization module that enables flexible, user-controlled authorization, allowing users to actively specify or switch authorized domains on demand at deployment time. This enables the model to adapt seamlessly as application scenarios evolve and provides substantially greater extensibility than existing static-domain approaches. In addition, AoD-IP incorporates a dual-path inference mechanism that jointly predicts input legality-aware and task-specific outputs. Comprehensive experimental results on multiple cross-domain benchmarks demonstrate that AoD-IP maintains strong authorized-domain performance and reliable unauthorized detection, while supporting user-controlled authorization for adaptive deployment in dynamic environments.
中文标题/摘要
标题:按需授权:具有法律意识的知识产权保护以实现VLM的动态授权
视觉-语言模型(VLMs)的快速采用加剧了对这些高价值预训练模型的知识产权(IP)保护需求。有效的IP保护应主动限制模型部署在授权领域内,并防止未经授权的转移。然而,现有方法依赖于静态训练时定义,限制了在动态环境中的灵活性,并经常对未经授权的输入产生不透明的响应。为解决这些限制,我们提出了一种新颖的具有法律意识的知识产权保护(AoD-IP)框架,用于VLMs,该框架支持按需授权和法律意识评估。AoD-IP引入了一个轻量级的动态授权模块,使授权更加灵活和用户可控,允许用户在部署时主动指定或切换授权领域。这使模型能够无缝适应应用场景的变化,并提供了比现有静态领域方法更大的可扩展性。此外,AoD-IP结合了一种双路径推理机制,同时预测输入的法律意识和任务特定输出。在多个跨域基准上的全面实验结果表明,AoD-IP在授权领域内保持了强大的性能,并可靠地检测未经授权的输入,同时支持用户控制的授权以适应动态环境中的部署。
Summary / 总结
The paper proposes AoD-IP, a dynamic authorization framework for VLMs that supports on-demand authorization and legality-aware assessment. It introduces a lightweight dynamic authorization module allowing users to specify authorized domains at deployment time, enhancing flexibility and extensibility compared to static-domain approaches. Experimental results show that AoD-IP maintains strong performance in authorized domains and reliable unauthorized detection, supporting adaptive deployment in dynamic environments.
论文提出AoD-IP框架,用于VLM中的动态授权和法律意识知识产权保护,通过在部署时提供灵活的用户控制授权,解决了静态方法的局限性。实验结果表明,AoD-IP在授权域中保持了强大的性能,并且对未授权检测具有可靠性,支持动态环境中的自适应部署。
Differentially Private Multimodal In-Context Learning
Authors: Ivoline C. Ngong, Zarreen Reza, Joseph P. Near
First: 2026-03-05T07:36:02+00:00 · Latest: 2026-03-05T07:36:02+00:00
Abstract
Vision-language models are increasingly applied to sensitive domains such as medical imaging and personal photographs, yet existing differentially private methods for in-context learning are limited to few-shot, text-only settings because privacy cost scales with the number of tokens processed. We present Differentially Private Multimodal Task Vectors (DP-MTV), the first framework enabling many-shot multimodal in-context learning with formal $(\varepsilon, δ)$-differential privacy by aggregating hundreds of demonstrations into compact task vectors in activation space. DP-MTV partitions private data into disjoint chunks, applies per-layer clipping to bound sensitivity, and adds calibrated noise to the aggregate, requiring only a single noise addition that enables unlimited inference queries. We evaluate on eight benchmarks across three VLM architectures, supporting deployment with or without auxiliary data. At $\varepsilon=1.0$, DP-MTV achieves 50% on VizWiz compared to 55% non-private and 35% zero-shot, preserving most of the gain from in-context learning under meaningful privacy constraints.
Free Lunch for Pass@$k$? Low Cost Diverse Sampling for Diffusion Language Models
Authors: Sean Lamont, Christian Walder, Paul Montague, Amir Dezfouli, Michael Norrish
First: 2026-03-05T07:35:07+00:00 · Latest: 2026-03-05T07:35:07+00:00
Abstract
Diverse outputs in text generation are necessary for effective exploration in complex reasoning tasks, such as code generation and mathematical problem solving. Such Pass@$k$ problems benefit from distinct candidates covering the solution space. However, traditional sampling approaches often waste computational resources on repetitive failure modes. While Diffusion Language Models have emerged as a competitive alternative to the prevailing Autoregressive paradigm, they remain susceptible to this redundancy, with independent samples frequently collapsing into similar modes. To address this, we propose a training free, low cost intervention to enhance generative diversity in Diffusion Language Models. Our approach modifies intermediate samples in a batch sequentially, where each sample is repelled from the feature space of previous samples, actively penalising redundancy. Unlike prior methods that require retraining or beam search, our strategy incurs negligible computational overhead, while ensuring that each sample contributes a unique perspective to the batch. We evaluate our method on the HumanEval and GSM8K benchmarks using the LLaDA-8B-Instruct model. Our results demonstrate significantly improved diversity and Pass@$k$ performance across various temperature settings. As a simple modification to the sampling process, our method offers an immediate, low-cost improvement for current and future Diffusion Language Models in tasks that benefit from diverse solution search. We make our code available at https://github.com/sean-lamont/odd.
中文标题/摘要
标题:免费午餐?低成本多样化采样以提高扩散语言模型性能
文本生成中的多样化输出对于复杂推理任务(如代码生成和数学问题解决)的有效探索是必要的。此类Pass@$k$问题可以从不同的候选方案中受益,这些方案覆盖了解空间。然而,传统的采样方法往往在重复的失败模式上浪费计算资源。尽管扩散语言模型已经成为了与自回归范式竞争的有力替代方案,但它们仍然容易受到这种冗余的影响,独立样本经常陷入相似的模式。为了解决这一问题,我们提出了一种无需训练、低成本的干预措施,以增强扩散语言模型的生成多样性。我们的方法按顺序修改批次中的中间样本,其中每个样本远离先前样本的特征空间,积极惩罚冗余。与需要重新训练或使用束搜索的先前方法不同,我们的策略几乎不增加计算开销,同时确保每个样本都为批次贡献了独特的视角。我们在HumanEval和GSM8K基准上使用LLaDA-8B-Instruct模型评估了我们的方法。结果显示,在各种温度设置下,我们的方法显著提高了多样性和Pass@$k$性能。作为一种简单的采样过程修改,我们的方法为当前和未来的扩散语言模型在需要多样化解决方案搜索的任务中提供了即时、低成本的改进。我们已在https://github.com/sean-lamont/odd/开源了我们的代码。
Summary / 总结
The paper addresses the need for diverse outputs in text generation for complex reasoning tasks, such as code generation and mathematical problem solving. It proposes a low-cost method to enhance generative diversity in Diffusion Language Models by sequentially modifying intermediate samples to repel them from the feature space of previous samples. The method does not require retraining and shows significant improvement in diversity and Pass@$k$ performance across various temperature settings on HumanEval and GSM8K benchmarks using the LLaDA-8B-Instruct model.
论文旨在解决复杂推理任务(如代码生成和数学问题解决)中需要多样输出的问题。它提出了一种低成本干预方法,通过顺序修改中间样本,使其远离先前样本的特征空间,以增强扩散语言模型的生成多样性。该方法无需重新训练或使用束搜索,并在HumanEval和GSM8K基准上使用LLaDA-8B-Instruct模型展示了在各种温度设置下显著提高的多样性和Pass@$k$性能。
RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations
Authors: I-Hsiang Chen, Yu-Wei Liu, Tse-Yu Wu, Yu-Chien Chiang, Jen-Chien Yang, Wei-Ting Chen
First: 2026-02-25T15:27:57+00:00 · Latest: 2026-03-05T07:12:37+00:00
Comments: Accepted by CVPR2026; Project Page: https://robustvisrag.github.io
Abstract
Vision-based Retrieval-Augmented Generation (VisRAG) leverages vision-language models (VLMs) to jointly retrieve relevant visual documents and generate grounded answers based on multimodal evidence. However, existing VisRAG models degrade in performance when visual inputs suffer from distortions such as blur, noise, low light, or shadow, where semantic and degradation factors become entangled within pretrained visual encoders, leading to errors in both retrieval and generation stages. To address this limitation, we introduce RobustVisRAG, a causality-guided dual-path framework that improves VisRAG robustness while preserving efficiency and zero-shot generalization. RobustVisRAG uses a non-causal path to capture degradation signals through unidirectional attention and a causal path to learn purified semantics guided by these signals. Together with the proposed Non-Causal Distortion Modeling and Causal Semantic Alignment objectives, the framework enforces a clear separation between semantics and degradations, enabling stable retrieval and generation under challenging visual conditions. To evaluate robustness under realistic conditions, we introduce the Distortion-VisRAG dataset, a large-scale benchmark containing both synthetic and real-world degraded documents across seven domains, with 12 synthetic and 5 real distortion types that comprehensively reflect practical visual degradations. Experimental results show that RobustVisRAG improves retrieval, generation, and end-to-end performance by 7.35%, 6.35%, and 12.40%, respectively, on real-world degradations, while maintaining comparable accuracy on clean inputs.
中文标题/摘要
标题:RobustVisRAG:在视觉退化条件下具有因果关系意识的基于视觉检索增强生成
基于视觉的检索增强生成(VisRAG)利用视觉语言模型(VLMs)联合检索相关视觉文档,并基于多模态证据生成基于事实的答案。然而,现有的VisRAG模型在视觉输入遭受模糊、噪声、低光照或阴影等退化时性能会下降,因为语义和退化因素在预训练视觉编码器中交织在一起,导致检索和生成阶段出现错误。为了解决这一限制,我们提出了RobustVisRAG,这是一种因果关系引导的双路径框架,可以提高VisRAG的鲁棒性,同时保持效率和零样本泛化能力。RobustVisRAG使用非因果路径通过单向注意力捕捉退化信号,并使用因果路径通过这些信号学习净化的语义。结合提出的非因果退化建模和因果语义对齐目标,该框架确保语义和退化之间的清晰分离,从而在具有挑战性的视觉条件下实现稳定的检索和生成。为了在现实条件下评估鲁棒性,我们引入了Distortion-VisRAG数据集,这是一个大规模基准,包含七个领域中的合成和真实世界退化文档,具有12种合成和5种真实退化类型,全面反映了实际视觉退化。实验结果表明,RobustVisRAG在真实世界退化条件下分别提高了检索、生成和端到端性能7.35%、6.35%和12.40%,同时在干净输入上保持了相当的准确性。
Summary / 总结
RobustVisRAG is a causality-guided dual-path framework designed to enhance the robustness of Vision-based Retrieval-Augmented Generation (VisRAG) models under visual degradations. It uses a non-causal path to capture degradation signals and a causal path to learn purified semantics, which are aligned through specific objectives. This approach improves retrieval, generation, and end-to-end performance by 7.35%, 6.35%, and 12.40%, respectively, on real-world degradations, while maintaining accuracy on clean inputs. The framework is evaluated using the Distortion-VisRAG dataset, which includes both synthetic and real-world degraded documents across seven domains.
RobustVisRAG 是一种因果引导的双路径框架,旨在增强视觉检索增强生成(VisRAG)模型在视觉退化条件下的鲁棒性。该框架使用非因果路径捕捉退化信号,使用因果路径学习净化的语义,分别在真实世界退化条件下提高检索、生成和端到端性能 7.35%、6.35% 和 12.40%。框架还包括一个包含 12 种合成和 5 种真实退化类型的 Distortion-VisRAG 数据集,用于在现实条件下评估鲁棒性。
GhostEI-Bench: Do Mobile Agents Resilience to Environmental Injection in Dynamic On-Device Environments?
Authors: Chiyu Chen, Xinhao Song, Yunkai Chai, Yang Yao, Haodong Zhao, Lijun Li, Jie Li, Yan Teng, Gongshen Liu, Yingchun Wang
First: 2025-10-23T08:33:24+00:00 · Latest: 2026-03-05T06:51:24+00:00
Abstract
Vision-Language Models (VLMs) are increasingly deployed as autonomous agents to navigate mobile graphical user interfaces (GUIs). Operating in dynamic on-device ecosystems, which include notifications, pop-ups, and inter-app interactions, exposes them to a unique and underexplored threat vector: environmental injection. Unlike prompt-based attacks that manipulate textual instructions, environmental injection corrupts an agent's visual perception by inserting adversarial UI elements (for example, deceptive overlays or spoofed notifications) directly into the GUI. This bypasses textual safeguards and can derail execution, causing privacy leakage, financial loss, or irreversible device compromise. To systematically evaluate this threat, we introduce GhostEI-Bench, the first benchmark for assessing mobile agents under environmental injection attacks within dynamic, executable environments. Moving beyond static image-based assessments, GhostEI-Bench injects adversarial events into realistic application workflows inside fully operational Android emulators and evaluates performance across critical risk scenarios. We further propose a judge-LLM protocol that conducts fine-grained failure analysis by reviewing the agent's action trajectory alongside the corresponding screenshot sequence, pinpointing failure in perception, recognition, or reasoning. Comprehensive experiments on state-of-the-art agents reveal pronounced vulnerability to deceptive environmental cues: current models systematically fail to perceive and reason about manipulated UIs. GhostEI-Bench provides a framework for quantifying and mitigating this emerging threat, paving the way toward more robust and secure embodied agents.
中文标题/摘要
标题:GhostEI-Bench:移动代理在动态设备环境中对环境注入的韧性如何?
视觉-语言模型(VLMs)正越来越多地作为自主代理部署,以导航移动图形用户界面(GUIs)。在包括通知、弹出窗口和跨应用交互的动态设备生态系统中运行,使它们面临一种独特的、尚未充分探索的威胁向量:环境注入。与基于提示的攻击不同,后者操纵文本指令,环境注入通过直接向GUI插入对抗性UI元素(例如,欺骗性覆盖或伪造的通知)来篡改代理的视觉感知,从而绕过了文本保护措施,可能导致执行中断、隐私泄露、经济损失或设备不可逆的破坏。为了系统地评估这一威胁,我们引入了GhostEI-Bench,这是首个评估移动代理在动态可执行环境中遭受环境注入攻击的基准。超越基于静态图像的评估,GhostEI-Bench在完全运行的Android模拟器中注入对抗性事件到现实的应用工作流中,并在关键风险场景中评估性能。我们进一步提出了一种裁判LLM协议,通过审查代理的动作轨迹与相应的屏幕截图序列来开展精细的失败分析,以确定感知、识别或推理中的失败。全面的实验表明,最先进的代理模型对欺骗性环境线索表现出明显的脆弱性:当前模型系统地无法感知和推理关于被操纵的UIs。GhostEI-Bench提供了一种量化和缓解这一新兴威胁的框架,为更稳健和安全的实体代理铺平了道路。
Summary / 总结
The research aims to evaluate the resilience of mobile agents to environmental injection attacks in dynamic on-device environments, which can corrupt their visual perception through adversarial UI elements. The study introduces GhostEI-Bench, a benchmark that injects adversarial events into realistic application workflows on fully operational Android emulators, assessing performance in critical risk scenarios. Key findings show that state-of-the-art agents are highly vulnerable to deceptive environmental cues, failing to perceive and reason about manipulated UIs. This work provides a framework for quantifying and mitigating this emerging threat, enhancing the robustness and security of embodied agents.
论文引入了GhostEI-Bench,这是一个用于评估移动代理在动态设备环境中的环境注入攻击下的鲁棒性的基准。它解决了通过篡改视觉感知来绕过文本保护的对抗UI元素的威胁。方法是将对抗事件注入到完全运行的Android模拟器中的现实应用工作流中,并在关键风险场景中评估性能。关键发现表明,最先进的代理对欺骗性的环境线索高度脆弱,无法感知和推理关于篡改的UI。这项工作提供了一个量化和缓解这种新兴威胁的框架。
On Multi-Step Theorem Prediction via Non-Parametric Structural Priors
Authors: Junbo Zhao, Ting Zhang, Can Li, Wei He, Jingdong Wang, Hua Huang
First: 2026-03-05T06:08:50+00:00 · Latest: 2026-03-05T06:08:50+00:00
Abstract
Multi-step theorem prediction is a central challenge in automated reasoning. Existing neural-symbolic approaches rely heavily on supervised parametric models, which exhibit limited generalization to evolving theorem libraries. In this work, we explore training-free theorem prediction through the lens of in-context learning (ICL). We identify a critical scalability bottleneck, termed Structural Drift: as reasoning depth increases, the performance of vanilla ICL degrades sharply, often collapsing to near zero. We attribute this failure to the LLM's inability to recover latent topological dependencies, leading to unstructured exploration. To address this issue, we propose Theorem Precedence Graphs, which encode temporal dependencies from historical solution traces as directed graphs, and impose explicit topological constraints that effectively prune the search space during inference. Coupled with retrieval-augmented graph construction and a stepwise symbolic executor, our approach enables LLMs to act as structured planners without any gradient-based optimization. Experiments on the FormalGeo7k benchmark show that our method achieves 89.29% accuracy, substantially outperforming ICL baselines and matching state-of-the-art supervised models. These results indicate that explicit structural priors offer a promising direction for scaling LLM-based symbolic reasoning.
中文标题/摘要
标题:基于非参数结构先验的多步定理预测
多步定理预测是自动推理中的一个核心挑战。现有的神经符号方法主要依赖于监督参数模型,这些模型在处理不断演化的定理库时表现出有限的泛化能力。在本文中,我们通过上下文学习(ICL)的视角探索无训练的定理预测。我们识别出一个关键的可扩展性瓶颈,称为结构漂移:随着推理深度的增加,vanilla ICL的性能急剧下降,通常会崩溃到接近零。我们认为这种失败是由于LLM无法恢复潜在的拓扑依赖性,导致无结构的探索。为了解决这个问题,我们提出了定理优先图,它将历史解题轨迹中的时间依赖性编码为有向图,并施加显式的拓扑约束,有效地在推理期间剪枝搜索空间。结合检索增强的图构建和逐步符号执行,我们的方法使LLM能够作为结构化规划者而无需任何基于梯度的优化。在FormalGeo7k基准测试上的实验表明,我们的方法达到了89.29%的准确率,显著优于ICL基线,并且与最先进的监督模型相当。这些结果表明,显式的结构先验为扩展基于LLM的符号推理提供了一个有希望的方向。
Summary / 总结
This work addresses the challenge of multi-step theorem prediction in automated reasoning by leveraging non-parametric structural priors. It identifies a scalability issue known as Structural Drift, where vanilla in-context learning degrades with increased reasoning depth. To overcome this, the authors propose Theorem Precedence Graphs, which encode temporal dependencies and impose topological constraints to prune the search space. Experiments on the FormalGeo7k benchmark demonstrate that this approach achieves 89.29% accuracy, significantly outperforming ICL baselines and matching state-of-the-art supervised models.
该研究通过提出编码历史解题轨迹中时间依赖性的定理优先图,解决了自动推理中的多步定理预测挑战。该方法在推理过程中有效剪枝搜索空间,并在FormalGeo7k基准测试中达到89.29%的准确率,优于上下文学习基线并匹配最先进的监督模型。
AutoV: Loss-Oriented Ranking for Visual Prompt Retrieval in LVLMs
Authors: Yuan Zhang, Chun-Kai Fan, Sicheng Yu, Junwen Pan, Tao Huang, Ming Lu, Kuan Cheng, Qi She, Shanghang Zhang
First: 2025-06-19T08:02:53+00:00 · Latest: 2026-03-05T05:25:24+00:00
Abstract
Inspired by text prompts in large language models, visual prompts have been explored to enhance the perceptual capabilities of large vision-language models (LVLMs). However, performance tends to saturate under single visual prompt designs, making further prompt engineering increasingly ineffective. To address this limitation, we shift from prompt engineering to prompt retrieval and propose AutoV, a lightweight framework for instance-adaptive visual prompt identification. Given an input image and a textual query, AutoV automatically locates the most suitable visual prompt from a diverse candidate pool. Training such a retrieval framework requires prompt-level supervision, yet prompt quality is inherently ambiguous and difficult to assess reliably, even for humans. To enable automatic supervision, we evaluate visual prompts using a pre-trained LVLM and label them according to their prediction losses. Using the loss-oriented ranking as a robust training signal, AutoV learns to retrieve the query-aware optimal prompt for each instance without manual annotation. Experiments indicate that AutoV enhances the performance of various LVLMs on image understanding, captioning, grounding, and classification tasks. For example, AutoV improves LLaVA-OV by $\textbf{10.2}\%$ on VizWiz and boosts Qwen2.5-VL by $\textbf{3.8}\%$ on MMMU, respectively.
中文标题/摘要
标题:AutoV:面向视觉提示检索的损失导向排名
受大型语言模型中文本提示的启发,视觉提示已被探索以增强大型视觉-语言模型(LVLM)的感知能力。然而,在单一视觉提示设计下,性能往往会饱和,使得进一步的提示工程变得越来越无效。为了解决这一局限性,我们从提示工程转向提示检索,并提出AutoV,这是一种轻量级框架,用于实例自适应视觉提示识别。给定输入图像和文本查询,AutoV 自动从多样化的候选池中定位最合适的视觉提示。训练这种检索框架需要提示级别的监督,但提示质量本质上是模糊的,即使对人类来说也难以可靠地评估。为了实现自动监督,我们使用预训练的LVLM评估视觉提示,并根据其预测损失对其进行标记。利用损失导向的排名作为稳健的训练信号,AutoV 学习在每个实例中检索与查询相关的最佳提示,而无需手动注释。实验表明,AutoV 在图像理解、描述、定位和分类任务中提高了各种LVLM的表现。例如,AutoV 在VizWiz上将LLaVA-OV 的性能提高了10.2%,在MMMU上将Qwen2.5-VL 的性能提高了3.8%。
Summary / 总结
The paper introduces AutoV, a framework for automatically retrieving visual prompts to enhance large vision-language models (LVLMs). It uses a loss-oriented ranking method to label visual prompts based on their prediction losses, enabling the training of a retrieval system without manual annotation. Experiments show that AutoV improves performance on various tasks, such as increasing LLaVA-OV's accuracy by 10.2% on VizWiz and Qwen2.5-VL's performance by 3.8% on MMMU.
论文提出了一种名为AutoV的框架,用于自动检索视觉提示以增强大型视觉语言模型的性能。该框架通过提出基于损失的排名方法来评估和选择最适合每个输入图像和查询的视觉提示,解决了单一视觉提示设计的局限性。实验表明,AutoV在图像理解、描述等任务上提高了各种LVLM的表现,例如在VizWiz和MMMU上的改进分别为10.2%和3.8%。