arXiv 论文速递

2026-04-04 03:50
Snapshot: 20260404_0350
Steerable Visual Representations
Authors: Jona Ruthardt, Manu Gaur, Deva Ramanan, Makarand Tapaswi, Yuki M. Asano
First: 2026-04-02T17:59:49+00:00 · Latest: 2026-04-02T17:59:49+00:00
Comments: preprint
Abstract
Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion) via lightweight cross-attention. We introduce benchmarks for measuring representational steerability, and demonstrate that our steerable visual features can focus on any desired objects in an image while preserving the underlying representation quality. Our method also matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibiting zero-shot generalization to out-of-distribution tasks.
中文标题/摘要
标题:可引导的视觉表示
预训练的视觉变换器(ViTs)如DINOv2和MAE提供了通用的图像特征,可用于检索、分类和分割等多种下游任务。然而,这些表示往往集中在图像中最显眼的视觉线索上,没有方法可以引导它们关注不那么突出的概念。相比之下,多模态LLMs可以通过文本提示进行引导,但生成的表示往往是语言中心的,对于通用的视觉任务效果不佳。为了解决这个问题,我们引入了可引导的视觉表示,这是一种新的视觉表示类别,其全局和局部特征可以通过自然语言进行引导。大多数视觉-语言模型(例如CLIP)在编码后将文本与视觉特征融合(晚期融合),而我们则通过轻量级的交叉注意力直接将文本注入视觉编码器的层中(早期融合)。我们引入了衡量表示可引导性的基准,并证明我们的可引导视觉特征可以在图像中聚焦于任何所需的对象,同时保持底层表示的质量。我们的方法在异常检测和个性化对象区分方面也与专门的方法相当或更优,展示了对未见过任务的零样本泛化。
Summary / 总结
The paper introduces Steerable Visual Representations, which can be directed by natural language to focus on less prominent visual concepts while maintaining generic image feature quality. Unlike Multimodal LLMs, which become language-centric, these representations are integrated early in the visual encoder through lightweight cross-attention. Experiments show that the steerable features can target any desired objects in images and perform competitively or better than specialized methods on anomaly detection and personalized object discrimination tasks, demonstrating zero-shot generalization to out-of-distribution scenarios.
该论文提出了可引导的视觉表示方法,可以通过自然语言引导关注图像中的特定对象,同时保持整体表示质量。不同于现有方法在后期融合文本和视觉特征,该方法在视觉编码器的早期直接注入文本。该方法在异常检测和个人化对象区分方面表现出优越性能,展示了对未见过任务的零样本泛化能力。
Stop Wandering: Efficient Vision-Language Navigation via Metacognitive Reasoning
Authors: Xueying Li, Feng Lyu, Hao Wu, Mingliu Liu, Jia-Nan Liu, Guozi Liu
First: 2026-04-02T17:58:08+00:00 · Latest: 2026-04-02T17:58:08+00:00
Comments: 10 pages, 6 figures
Abstract
Training-free Vision-Language Navigation (VLN) agents powered by foundation models can follow instructions and explore 3D environments. However, existing approaches rely on greedy frontier selection and passive spatial memory, leading to inefficient behaviors such as local oscillation and redundant revisiting. We argue that this stems from a lack of metacognitive capabilities: the agent cannot monitor its exploration progress, diagnose strategy failures, or adapt accordingly. To address this, we propose MetaNav, a metacognitive navigation agent integrating spatial memory, history-aware planning, and reflective correction. Spatial memory builds a persistent 3D semantic map. History-aware planning penalizes revisiting to improve efficiency. Reflective correction detects stagnation and uses an LLM to generate corrective rules that guide future frontier selection. Experiments on GOAT-Bench, HM3D-OVON, and A-EQA show that MetaNav achieves state-of-the-art performance while reducing VLM queries by 20.7%, demonstrating that metacognitive reasoning significantly improves robustness and efficiency.
中文标题/摘要
标题:停止漫游:通过元认知推理实现高效的视觉-语言导航
基于基础模型的无需训练的视觉-语言导航(VLN)代理可以遵循指令并探索3D环境。然而,现有方法依赖于贪婪的前沿选择和被动的空间记忆,导致诸如局部振荡和重复访问等低效行为。我们认为这源于缺乏元认知能力:代理无法监控其探索进度,诊断策略失败,或相应地进行调整。为了解决这个问题,我们提出了MetaNav,这是一种结合空间记忆、历史感知规划和反思性纠正的元认知导航代理。空间记忆构建持久的3D语义地图。历史感知规划通过惩罚重复访问来提高效率。反思性纠正检测停滞并使用LLM生成纠正规则,以指导未来的前沿选择。在GOAT-Bench、HM3D-OVON和A-EQA上的实验表明,MetaNav在保持最佳性能的同时减少了VLM查询20.7%,证明了元认知推理显著提高了鲁棒性和效率。
Summary / 总结
The research aims to improve the efficiency of Vision-Language Navigation (VLN) agents by addressing their tendency to exhibit inefficient behaviors such as local oscillation and redundant revisiting. The proposed MetaNav integrates spatial memory, history-aware planning, and reflective correction to enhance the agent's metacognitive capabilities. Experimental results on GOAT-Bench, HM3D-OVON, and A-EQA show that MetaNav outperforms existing methods while reducing VLM queries by 20.7%, indicating significant improvements in robustness and efficiency.
研究旨在通过解决视觉-语言导航(VLN)代理的局部振荡和重复访问等问题,提高其效率。MetaNav通过集成空间记忆、历史感知规划和反思性纠正来增强代理的元认知能力。实验结果表明,MetaNav在GOAT-Bench、HM3D-OVON和A-EQA上的表现优于现有方法,同时减少了20.7%的VLM查询,显示出显著的鲁棒性和效率提升。
VOID: Video Object and Interaction Deletion
Authors: Saman Motamed, William Harvey, Benjamin Klein, Luc Van Gool, Zhuoning Yuan, Ta-Ying Cheng
First: 2026-04-02T17:36:53+00:00 · Latest: 2026-04-02T17:36:53+00:00
Abstract
Existing video object removal methods excel at inpainting content "behind" the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.
中文标题/摘要
标题:VOID:视频对象和交互删除
现有的视频对象移除方法在修复内容“背后”的内容和纠正阴影、反射等外观级伪影方面表现出色。然而,当移除的对象具有更显著的交互,如与其他对象的碰撞时,当前的模型无法纠正这些交互,从而产生不合理的结果。我们提出了VOID,一种旨在在这些复杂场景中执行物理上合理的修复的视频对象移除框架。为了训练模型,我们使用Kubric和HUMOTO生成了一个新的配对数据集,其中移除对象需要改变下游的物理交互。在推理过程中,一个视觉语言模型识别场景中受移除对象影响的区域。然后使用这些区域来引导一个视频扩散模型生成物理上一致的反事实结果。在合成和真实数据上的实验表明,与之前的视频对象移除方法相比,我们的方法在对象移除后更好地保持了场景动力学的一致性。我们希望这个框架能够揭示如何通过高层次的因果推理使视频编辑模型更好地模拟世界。
Summary / 总结
The research aims to address the limitations of existing video object removal methods, which struggle with scenarios involving significant physical interactions. The proposed VOID framework uses a vision-language model to identify affected regions and a video diffusion model to generate physically consistent outcomes. Experiments show that VOID better preserves scene dynamics after object removal compared to previous methods.
研究旨在解决现有视频对象移除方法在处理复杂交互(如碰撞)时的不足,当前模型无法正确修正这些问题。VOID框架使用Kubric和HUMOTO生成的配对数据集来训练一个能够进行物理上合理修补的模型。在推理过程中,视觉语言模型识别受影响的区域,这些区域指导视频扩散模型生成一致的反事实结果。实验表明,VOID在对象移除后更好地保持了场景的动力学,优于之前的视频对象移除方法。
Modular Energy Steering for Safe Text-to-Image Generation with Foundation Models
Authors: Yaoteng Tan, Zikui Cai, M. Salman Asif
First: 2026-04-02T16:59:28+00:00 · Latest: 2026-04-02T16:59:28+00:00
Abstract
Controlling the behavior of text-to-image generative models is critical for safe and practical deployment. Existing safety approaches typically rely on model fine-tuning or curated datasets, which can degrade generation quality or limit scalability. We propose an inference-time steering framework that leverages gradient feedback from frozen pretrained foundation models to guide the generation process without modifying the underlying generator. Our key observation is that vision-language foundation models encode rich semantic representations that can be repurposed as off-the-shelf supervisory signals during generation. By injecting such feedback through clean latent estimates at each sampling step, our method formulates safety steering as an energy-based sampling problem. This design enables modular, training-free safety control that is compatible with both diffusion and flow-matching models and can generalize across diverse visual concepts. Experiments demonstrate state-of-the-art robustness against NSFW red-teaming benchmarks and effective multi-target steering, while preserving high generation quality on benign non-targeted prompts. Our framework provides a principled approach for utilizing foundation models as semantic energy estimators, enabling reliable and scalable safety control for text-to-image generation.
中文标题/摘要
标题:模块化能源导向以实现安全的文本到图像生成与基础模型
控制文本到图像生成模型的行为对于安全和实际部署至关重要。现有安全方法通常依赖于模型微调或精心策划的数据集,这可能会降低生成质量或限制可扩展性。我们提出了一种推理时的导向框架,该框架利用冻结的预训练基础模型的梯度反馈来引导生成过程,而不修改底层生成器。我们的主要观察是,视觉语言基础模型编码了丰富的语义表示,可以在生成过程中作为现成的监督信号重新利用。通过在每次采样步骤中注入这种反馈,我们的方法将安全导向建模为一种基于能量的采样问题。此设计使安全控制模块化,无需训练即可与扩散和流匹配模型兼容,并且可以跨多种视觉概念泛化。实验表明,我们的方法在NSFW红队测试基准上具有最先进的鲁棒性,并且能够有效进行多目标导向,同时在良性非目标提示上保持高质量的生成。我们的框架提供了一种原理性的方法,用于利用基础模型作为语义能量估计器,从而实现文本到图像生成的可靠和可扩展的安全控制。
Summary / 总结
The research aims to enhance the safety of text-to-image generation by proposing a modular energy steering framework that uses gradient feedback from frozen pretrained models to guide the generation process without altering the underlying generator. This method leverages the rich semantic representations of vision-language foundation models to provide off-the-shelf supervisory signals during generation. Experiments show that the approach achieves state-of-the-art robustness against NSFW benchmarks and effective multi-target steering while maintaining high generation quality on benign prompts.
研究旨在通过提出一种模块化能量引导框架,利用冻结预训练模型的梯度反馈来引导生成过程,而不修改底层生成器。该方法利用视觉语言基础模型丰富的语义表示,在生成过程中提供现成的监督信号。实验表明,该方法在对抗NSFW基准测试中表现出色,并且能够有效进行多目标引导,同时保持对良性非目标提示的高质量生成。
Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI
Authors: Aravind Cheruvu, Shravya Kanchi, Sifat Muhammad Abdullah, Nicholas Kong, Daphne Yao, Murtuza Jadliwala, Bimal Viswanath
First: 2025-07-08T04:40:09+00:00 · Latest: 2026-04-02T16:59:25+00:00
Comments: Accepted at ACM CODASPY 2026
Abstract
Customizing Large Language Models (LLMs) on untrusted datasets poses severe risks of injecting toxic behaviors. In this work, we introduce Optimus, a novel defense framework designed to mitigate fine-tuning harms while preserving conversational utility. Unlike existing defenses that rely heavily on precise toxicity detection or restrictive filtering, Optimus addresses the critical challenge of ensuring robust mitigation even when toxicity classifiers are imperfect or biased. Optimus integrates a training-free toxicity classification scheme that repurposes the safety alignment of commodity LLMs, and employs a dual-strategy alignment process combining synthetic "healing data" with Direct Preference Optimization (DPO) to efficiently steer models toward safety. Extensive evaluations demonstrate that Optimus mitigates toxicity even when relying on extremely biased classifiers (with up to 85% degradation in Recall). Optimus outperforms the state-of-the-art defense StarDSS and exhibits strong resilience against adaptive adversarial and jailbreak attacks. Our source code and datasets are available at https://github.com/secml-lab-vt/Optimus
中文标题/摘要
标题:Optimus:一种稳健的防御框架,用于在微调对话AI时减轻毒性
在不可信数据集上定制大型语言模型(LLMs)会严重增加注入毒性行为的风险。在本研究中,我们提出了Optimus,一种新颖的防御框架,旨在减轻微调危害同时保留对话实用性。与依赖精确毒性检测或严格过滤的现有防御不同,Optimus 通过确保即使在毒性分类器不准确或有偏见时也能实现稳健的缓解来解决关键挑战。Optimus 结合了一种无需训练的毒性分类方案,该方案重新利用了商品级LLM的安全对齐,并采用结合合成“治愈数据”与直接偏好优化(DPO)的双重策略对齐过程,以高效地引导模型向安全方向发展。广泛的评估表明,即使依赖于高度有偏见的分类器(召回率降低高达85%),Optimus 也能减轻毒性。Optimus 在对抗适应性对抗和越狱攻击方面表现出色,优于最先进的防御StarDSS。我们的源代码和数据集可在https://github.com/secml-lab-vt/Optimus 获取
Summary / 总结
Optimus is a defense framework designed to mitigate toxic behaviors in fine-tuned conversational AI models while preserving conversational utility. Unlike previous methods that rely on precise toxicity detection or restrictive filtering, Optimus uses a training-free toxicity classification scheme and a dual-strategy alignment process to steer models towards safety. Evaluations show that Optimus can effectively mitigate toxicity even with highly biased classifiers and demonstrates strong resilience against various attacks.
Optimus 是一个防御框架,旨在在保持对话功能的同时减轻对话 AI 模型在微调过程中产生的有毒行为。与依赖精确的毒性检测或严格过滤的方法不同,Optimus 使用无训练的毒性分类方案和双重策略对齐过程来引导模型向安全方向发展。评估结果显示,Optimus 即使在使用高度偏差的分类器时也能有效减轻毒性,并且对各种攻击具有很强的抗性。
Scaling Video Pretraining for Surgical Foundation Models
Authors: Sicheng Lu, Zikai Xiao, Jianhui Wei, Danyu Sun, Qi Lu, Keli Hu, Yang Feng, Jian Wu, Zongxin Yang, Zuozhu Liu
First: 2026-03-31T16:31:25+00:00 · Latest: 2026-04-02T16:46:06+00:00
Abstract
Surgical video understanding is essential for computer-assisted interventions, yet existing surgical foundation models remain constrained by limited data scale, procedural diversity, and inconsistent evaluation, often lacking a reproducible training pipeline. We propose SurgRec, a scalable and reproducible pretraining recipe for surgical video understanding, instantiated with two variants: SurgRec-MAE and SurgRec-JEPA. We curate a large multi-source corpus of 10,535 videos and 214.5M frames spanning endoscopy, laparoscopy, cataract, and robotic surgery. Building on this corpus, we develop a unified pretraining pipeline with balanced sampling and standardize a reproducible benchmark across 16 downstream datasets and four clinical domains with consistent data splits. Across extensive comparisons against SSL baselines and vision-language models, SurgRec consistently achieves superior performance across downstream datasets. In contrast, VLMs prove unreliable for fine-grained temporal recognition, exhibiting both performance gaps and sensitivity to prompt phrasing. Our work provides a reproducible, scalable foundation for the community to build more general surgical video models. All code, models, and data will be publicly released.
中文标题/摘要
标题:手术视频预训练的扩展
手术视频理解对于计算机辅助干预至关重要,但现有的手术基础模型仍然受到数据规模有限、程序多样性不足以及评估不一致的限制,往往缺乏可重复的训练管道。我们提出了SurgRec,这是一种可扩展且可重复的手术视频理解预训练方案,包括两种变体:SurgRec-MAE和SurgRec-JEPA。我们整理了一个包含10,535个视频和2.145亿帧的大型多源数据集,涵盖了内窥镜、腹腔镜、白内障和机器人手术。基于此数据集,我们开发了一个统一的预训练管道,采用平衡采样,并在16个下游数据集和四个临床领域中标准化了一个可重复的基准,数据集具有统一的数据分割。在与SSL基线和视觉-语言模型的广泛比较中,SurgRec在所有下游数据集上都表现出更优的性能。相比之下,视觉-语言模型在细粒度的时间识别上表现不稳定,表现出性能差距和对提示措辞的敏感性。我们的工作为社区提供了一个可重复和可扩展的基础,以构建更通用的手术视频模型。所有代码、模型和数据将公开发布。
Summary / 总结
The paper addresses the limitations of existing surgical foundation models in terms of data scale and procedural diversity. It introduces SurgRec, a scalable pretraining method for surgical video understanding, which includes SurgRec-MAE and SurgRec-JEPA. The authors curate a large dataset of 10,535 videos and 214.5M frames from various surgical procedures. They develop a unified pretraining pipeline and benchmark across 16 downstream datasets, showing that SurgRec outperforms SSL baselines and vision-language models in surgical video understanding tasks.
研究旨在解决现有手术基础模型在数据规模和程序多样性方面的限制。提出了一种可扩展的手术视频理解预训练方法SurgRec,包括两种变体:SurgRec-MAE和SurgRec-JEPA。研究收集了来自不同手术程序的10,535个视频和214.5M帧。开发的预训练管道在16个下游数据集上表现优于SSL基线和视觉语言模型,特别是在细粒度的时间识别任务上表现出色。
SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation
Authors: Naomi Kombol, Ivan Martinović, Siniša Šegvić, Giorgos Tolias
Venue: CVPR 2026
First: 2026-04-02T16:45:34+00:00 · Latest: 2026-04-02T16:45:34+00:00
Comments: Accepted to CVPR 2026
Abstract
Foundational Vision Transformers (ViTs) have limited effectiveness in tasks requiring fine-grained spatial understanding, due to their fixed pre-training resolution and inherently coarse patch-level representations. These challenges are especially pronounced in dense prediction scenarios, such as open-vocabulary segmentation with ViT-based vision-language models, where high-resolution inputs are essential for accurate pixel-level reasoning. Existing approaches typically process large-resolution images using a sliding-window strategy at the pre-training resolution. While this improves accuracy through finer strides, it comes at a significant computational cost. We introduce SPAR: Single-Pass Any-Resolution ViT, a resolution-agnostic dense feature extractor designed for efficient high-resolution inference. We distill the spatial reasoning capabilities of a finely-strided, sliding-window teacher into a single-pass student using a feature regression loss, without requiring architectural changes or pixel-level supervision. Applied to open-vocabulary segmentation, SPAR improves single-pass baselines by up to 10.5 mIoU and even surpasses the teacher, demonstrating effectiveness in efficient, high-resolution reasoning. Code: https://github.com/naomikombol/SPAR
中文标题/摘要
标题:SPAR:单次通过任意分辨率ViT的开放词汇分割
基础视觉变换器(ViTs)在需要精细空间理解的任务中效果有限,因为它们具有固定的预训练分辨率和固有的粗粒度的块级表示。这些挑战在密集预测场景中尤为明显,例如基于ViT的视觉-语言模型的开放词汇分割,其中高分辨率输入对于准确的像素级推理至关重要。现有方法通常使用滑动窗口策略在预训练分辨率下处理大分辨率图像。虽然这通过更精细的步幅提高了准确性,但会带来显著的计算成本。我们引入了SPAR:单次通过任意分辨率ViT,这是一种分辨率无关的密集特征提取器,旨在进行高效的高分辨率推理。我们通过特征回归损失将精细步幅的滑动窗口教师的空间推理能力提炼到单次通过的学生中,而无需进行架构更改或像素级监督。应用于开放词汇分割,SPAR将单次通过基线提高了最多10.5 mIoU,并且甚至超过了教师,证明了其在高效高分辨率推理中的有效性。代码:https://github.com/naomikombol/SPAR
Summary / 总结
SPAR is a resolution-agnostic ViT designed for efficient high-resolution inference in open-vocabulary segmentation tasks. It distills the spatial reasoning capabilities of a finely-strided, sliding-window teacher into a single-pass student using a feature regression loss. SPAR improves single-pass baselines by up to 10.5 mIoU and even surpasses the teacher, showing effectiveness in efficient, high-resolution reasoning.
SPAR 是一种针对开放词汇分割任务的分辨率无关 ViT,旨在高效进行高分辨率推理。它通过特征回归损失将精细步长教师模型的空间推理能力提炼到单次通过的学生模型中。SPAR 将单次通过基线模型的 mIoU 提高了最多 10.5%,甚至超越了教师模型,展示了高效高分辨率推理的有效性。
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Authors: Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, Ranjay Krishna
First: 2026-01-15T17:27:44+00:00 · Latest: 2026-04-02T16:01:02+00:00
Comments: Updated first authors
Abstract
Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).
中文标题/摘要
标题:Molmo2:开放权重和数据的视觉-语言模型,具备视频理解与定位能力
当前最强的视频-语言模型(VLMs)仍为私有。最强的开放权重模型要么依赖于私有VLMs的合成数据,要么不披露其训练数据或方法。因此,开源社区缺乏改进当前最先进的视频(和图像)语言模型的基础。更重要的是,许多下游应用不仅需要高层次的视频理解,还需要定位——无论是通过指针还是像素跟踪。即使私有模型也缺乏这种能力。我们提出了Molmo2,这是一种新的VLM家族,是开源模型中的最新技术,并展示了在单图像、多图像和视频任务中出色的基于指针的定位能力。我们的主要贡献是一系列7个新的视频数据集和2个多图像数据集,包括用于预训练的详细视频字幕数据集、用于微调的自由形式视频问答数据集、新的具有复杂查询的对象跟踪数据集以及创新的视频指针数据集,所有这些数据集均未使用封闭的VLMs收集。我们还提供了一种利用高效打包和消息树编码方案的数据训练食谱,并展示了在视觉标记上进行双向注意以及一种新的标记权重策略可以提高性能。我们最好的8B模型在短视频、计数和字幕方面优于其他开放权重和数据模型,并在长视频方面具有竞争力。在视频定位方面,Molmo2显著优于现有开放权重模型如Qwen3-VL(视频计数准确率为35.5% vs 29.6%),并在某些任务上超越了私有模型如Gemini 3 Pro(视频指针F1得分为38.4% vs 20.0%,视频跟踪J&F得分为56.2% vs 41.1%)。
Summary / 总结
Molmo2 is a new family of open-source vision-language models that excel in point-driven grounding tasks, surpassing both open-weight and proprietary models. The research addresses the lack of open-source foundations for improving video-language models and introduces 9 new datasets for pre-training and fine-tuning. Key improvements include a novel token-weight strategy and bi-directional attention, leading to superior performance on short videos, counting, and captioning tasks. Molmo2 also significantly outperforms existing open-weight models and proprietary models on video-grounding tasks such as video counting, pointing, and tracking.
Molmo2 是一种新的开源视觉-语言模型,擅长视频理解和定位,在各种任务上超越了现有的开源模型和专有模型。研究通过提供 7 个新视频数据集和 2 个多图像数据集来解决缺乏开源基础的问题,包括一个视频定位数据集。该模型使用高效的训练配方和新颖的 token 权重策略及双向注意力,实现了在短视频、计数和字幕任务上的卓越表现。Molmo2 在视频定位任务如视频计数和视频跟踪上显著优于现有开源模型和专有模型。
UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving
Authors: Yongkang Li, Lijun Zhou, Sixu Yan, Bencheng Liao, Tianyi Yan, Kaixin Xiong, Long Chen, Hongwei Xie, Bing Wang, Guang Chen, Hangjun Ye, Wenyu Liu, Haiyang Sun, Xinggang Wang
First: 2026-04-02T15:48:45+00:00 · Latest: 2026-04-02T15:48:45+00:00
Comments: code has been released at https://github.com/xiaomi-research/unidrivevla
Abstract
Vision-Language-Action (VLA) models have recently emerged in autonomous driving, with the promise of leveraging rich world knowledge to improve the cognitive capabilities of driving systems. However, adapting such models for driving tasks currently faces a critical dilemma between spatial perception and semantic reasoning. Consequently, existing VLA systems are forced into suboptimal compromises: directly adopting 2D Vision-Language Models yields limited spatial perception, whereas enhancing them with 3D spatial representations often impairs the native reasoning capacity of VLMs. We argue that this dilemma largely stems from the coupled optimization of spatial perception and semantic reasoning within shared model parameters. To overcome this, we propose UniDriveVLA, a Unified Driving Vision-Language-Action model based on Mixture-of-Transformers that addresses the perception-reasoning conflict via expert decoupling. Specifically, it comprises three experts for driving understanding, scene perception, and action planning, which are coordinated through masked joint attention. In addition, we combine a sparse perception paradigm with a three-stage progressive training strategy to improve spatial perception while maintaining semantic reasoning capability. Extensive experiments show that UniDriveVLA achieves state-of-the-art performance in open-loop evaluation on nuScenes and closed-loop evaluation on Bench2Drive. Moreover, it demonstrates strong performance across a broad range of perception, prediction, and understanding tasks, including 3D detection, online mapping, motion forecasting, and driving-oriented VQA, highlighting its broad applicability as a unified model for autonomous driving. Code and model have been released at https://github.com/xiaomi-research/unidrivevla
中文标题/摘要
标题:UniDriveVLA:统一理解、感知与行动规划的自动驾驶
视觉-语言-行动(VLA)模型最近在自动驾驶中崭露头角,有望利用丰富的世界知识提升驾驶系统的认知能力。然而,将这些模型适应驾驶任务目前面临一个关键困境:空间感知与语义推理之间的权衡。因此,现有的VLA系统被迫做出次优妥协:直接采用2D视觉-语言模型导致空间感知有限,而增强它们的空间表示则往往损害了VLM的原生推理能力。我们认为,这一困境主要源于共享模型参数中空间感知与语义推理的耦合优化。为克服这一问题,我们提出了基于混合变换器的UniDriveVLA统一驾驶视觉-语言-行动模型,通过专家解耦解决感知-推理冲突。具体而言,它包括三个专家,分别负责驾驶理解、场景感知和行动规划,通过掩蔽联合注意力协调。此外,我们结合稀疏感知范式和三阶段渐进式训练策略,以提高空间感知能力同时保持语义推理能力。广泛实验表明,UniDriveVLA在nuScenes的开环评估和Bench2Drive的闭环评估中均达到最先进的性能。此外,它在包括3D检测、在线建图、运动预测和驾驶导向的VQA等一系列感知、预测和理解任务中表现出色,突显了其作为统一模型在自动驾驶领域的广泛应用潜力。代码和模型已发布于https://github.com/xiaomi-research/unidrivevla
Summary / 总结
UniDriveVLA is a unified model that addresses the conflict between spatial perception and semantic reasoning in autonomous driving by decoupling these tasks through expert decoupling. It consists of three experts for understanding, perception, and action planning, coordinated by masked joint attention. Additionally, it uses a sparse perception paradigm and a three-stage training strategy to enhance spatial perception while preserving semantic reasoning. Experimental results show that UniDriveVLA outperforms existing models in both open-loop and closed-loop evaluations on nuScenes and Bench2Drive, and it excels in various tasks such as 3D detection, online mapping, and driving-oriented VQA.
UniDriveVLA 是一种通过专家解耦解决空间感知和语义推理之间冲突的统一驾驶视觉-语言-行动模型。它包含三个专家分别负责理解、感知和行动规划,并通过掩蔽联合注意力进行协调。此外,它采用稀疏感知范式和三阶段训练策略来增强空间感知能力同时保持语义推理能力。实验结果表明,UniDriveVLA 在 nuScenes 和 Bench2Drive 的开环和闭环评估中均表现出色,并且在 3D 检测、在线建图和驾驶导向的 VQA 等多种任务中表现出色。
CoRegOVCD: Consistency-Regularized Open-Vocabulary Change Detection
Authors: Weidong Tang, Hanbin Sun, Zihan Li, Yikai Wang, Feifan Zhang
First: 2026-04-02T15:28:29+00:00 · Latest: 2026-04-02T15:28:29+00:00
Abstract
Remote sensing change detection (CD) aims to identify where land-cover semantics change across time, but most existing methods still assume a fixed label space and therefore cannot answer arbitrary user-defined queries. Open-vocabulary change detection (OVCD) instead asks for the change mask of a queried concept. In the fully training-free setting, however, dense concept responses are difficult to compare directly across dates: appearance variation, weak cross-concept competition, and the spatial continuity of many land-cover categories often produce noisy, fragmented, and semantically unreliable change evidence. We propose Consistency-Regularized Open-Vocabulary Change Detection (CoRegOVCD), a training-free dense inference framework that reformulates concept-specific change as calibrated posterior discrepancy. Competitive Posterior Calibration (CPC) and the Semantic Posterior Delta (SPD) convert raw concept responses into competition-aware queried-concept posteriors and quantify their cross-temporal discrepancy, making semantic change evidence more comparable without explicit instance matching. Geometry-Token Consistency Gate (GeoGate) and Regional Consensus Discrepancy (RCD) further suppress unsupported responses and improve spatial coherence through geometry-aware structural verification and regional consensus. Across four benchmarks spanning building-oriented and multi-class settings, CoRegOVCD consistently improves over the strongest previous training-free baseline by 2.24 to 4.98 F1$_C$ points and reaches a six-class average of 47.50% F1$_C$ on SECOND.
中文标题/摘要
标题:CoRegOVCD: 一致性正则化开放词汇变化检测
遥感变化检测(CD)旨在识别不同时期土地覆盖语义的变化,但大多数现有方法仍然假设固定标签空间,因此无法回答任意用户定义的查询。开放词汇变化检测(OVCD)则要求提供查询概念的变化掩码。然而,在完全无训练设置中,密集的概念响应难以直接在不同日期之间进行比较:外观变化、弱跨概念竞争以及许多土地覆盖类别的空间连续性经常产生嘈杂、碎片化且语义不可靠的变化证据。我们提出了Consistency-Regularized Open-Vocabulary Change Detection(CoRegOVCD),这是一种完全无训练的密集推理框架,将概念特定的变化重新表述为校准后的后验差异。Competitive Posterior Calibration(CPC)和Semantic Posterior Delta(SPD)将原始概念响应转换为竞争意识的查询概念后验,并量化它们的跨时间差异,从而在无需显式实例匹配的情况下使语义变化证据更具可比性。Geometry-Token Consistency Gate(GeoGate)和Regional Consensus Discrepancy(RCD)进一步抑制不支持的响应,并通过几何感知结构验证和区域共识提高空间一致性。在四个涵盖建筑导向和多类别的基准测试中,CoRegOVCD在最强的先前完全无训练基线基础上,F1$_C$得分提高了2.24到4.98个百分点,并在SECOND数据集上达到了六类平均47.50%的F1$_C$得分。
Summary / 总结
CoRegOVCD is a training-free dense inference framework for open-vocabulary change detection that reformulates concept-specific change as calibrated posterior discrepancy. It uses Competitive Posterior Calibration and Semantic Posterior Delta to convert raw concept responses into competition-aware queried-concept posteriors and quantify their cross-temporal discrepancy. Geometry-Token Consistency Gate and Regional Consensus Discrepancy further refine the responses to suppress unsupported ones and improve spatial coherence. CoRegOVCD outperforms the strongest previous training-free baseline by 2.24 to 4.98 F1$_C$ points across four benchmarks and achieves an average F1$_C$ of 47.50% on the six-class setting of SECOND.
CoRegOVCD 是一个无训练的密集推理框架,用于开放词汇变化检测,解决不同日期间密集概念响应难以直接比较的问题。它使用竞争后验校准和语义后验差异将原始概念响应转换为竞争感知的查询概念后验,并量化其跨时间的差异。几何标记一致性门控和区域共识差异进一步细化结果。CoRegOVCD 在四个基准测试中优于最强的无训练基线,F1$_C$ 分数提高了 2.24 到 4.98 点,并在 SECOND 基准测试中达到 47.50% 的平均 F1$_C$。
Be Tangential to Manifold: Discovering Riemannian Metric for Diffusion Models
Authors: Shinnosuke Saito, Takashi Matsubara
First: 2025-10-07T01:54:47+00:00 · Latest: 2026-04-02T14:50:54+00:00
Abstract
Diffusion models are powerful deep generative models, but unlike classical models, they lack an explicit low-dimensional latent space that parameterizes the data manifold. This absence makes it difficult to perform manifold-aware operations, such as geometrically faithful interpolation or conditional guidance that respects the learned manifold. We propose a training-free Riemannian metric on the noise space, derived from the Jacobian of the score function. The key insight is that the spectral structure of this Jacobian separates tangent and normal directions of the data manifold; our metric leverages this separation to encourage paths to stay tangential to the manifold rather than drift toward high-density regions. To validate that our metric faithfully captures the manifold geometry, we examine it from two complementary angles. First, geodesics under our metric yield perceptually more natural interpolations than existing methods on synthetic, image, and video frame datasets. Second, the tangent-normal decomposition induced by our metric prevents classifier-free guidance from deviating off the manifold, improving generation quality while preserving text-image alignment.
中文标题/摘要
标题:与流形共轭:发现用于扩散模型的黎曼度量
扩散模型是强大的深度生成模型,但与经典模型不同,它们缺乏一个显式的低维潜在空间来参数化数据流形。这种缺失使得难以执行流形感知操作,如几何保真插值或尊重学习到的流形的条件引导。我们提出了一种在噪声空间上的无训练黎曼度量,该度量源自分数函数的雅可比矩阵。关键洞察是,该雅可比矩阵的谱结构将数据流形的切向和法向方向区分开来;我们的度量利用这种分离来鼓励路径保持在流形上而不是向高密度区域漂移。为了验证我们的度量是否准确捕捉了流形几何,我们从两个互补的角度进行了验证。首先,在我们的度量下,测地线在合成、图像和视频帧数据集上提供了感知上更自然的插值。其次,由我们的度量引起的切向-法向分解防止了无分类器引导偏离流形,从而提高了生成质量并保持了文本-图像对齐。
Summary / 总结
This paper addresses the challenge of performing manifold-aware operations in diffusion models by proposing a training-free Riemannian metric derived from the Jacobian of the score function. The key finding is that this metric encourages paths to stay tangential to the data manifold, leading to more natural interpolations and improved generation quality. Geodesics under this metric yield perceptually more natural interpolations and prevent classifier-free guidance from deviating off the manifold, thus enhancing generation quality while maintaining text-image alignment.
本文提出了一种基于分数函数雅可比的无训练Riemannian度量,以解决在扩散模型中执行流形感知操作的挑战。该度量鼓励路径保持在数据流形上而不是移向高密度区域。实验表明,在该度量下的测地线提供了更自然的插值,并且由该度量诱导的切空间-法空间分解防止了无分类器引导偏离流形,从而提高了生成质量并保持了文本-图像对齐。
FlowSlider: Training-Free Continuous Image Editing via Fidelity-Steering Decomposition
Authors: Taichi Endo, Guoqing Hao, Kazuhiko Sumi
First: 2026-04-02T14:16:06+00:00 · Latest: 2026-04-02T14:16:06+00:00
Comments: HuggingFace Space: https://huggingface.co/spaces/dominoer/FlowSlider
Abstract
Continuous image editing aims to provide slider-style control of edit strength while preserving source-image fidelity and maintaining a consistent edit direction. Existing learning-based slider methods typically rely on auxiliary modules trained with synthetic or proxy supervision. This introduces additional training overhead and couples slider behavior to the training distribution, which can reduce reliability under distribution shifts in edits or domains. We propose \textit{FlowSlider}, a training-free method for continuous editing in Rectified Flow that requires no post-training. \textit{FlowSlider} decomposes FlowEdit's update into (i) a fidelity term, which acts as a source-conditioned stabilizer that preserves identity and structure, and (ii) a steering term that drives semantic transition toward the target edit. Geometric analysis and empirical measurements show that these terms are approximately orthogonal, enabling stable strength control by scaling only the steering term while keeping the fidelity term unchanged. As a result, \textit{FlowSlider} provides smooth and reliable control without post-training, improving continuous editing quality across diverse tasks.
中文标题/摘要
标题:FlowSlider:无需训练的连续图像编辑方法
连续图像编辑旨在提供滑块式控制编辑强度的同时,保持源图像保真度并保持一致的编辑方向。现有的基于学习的滑块方法通常依赖于使用合成或代理监督训练的辅助模块。这引入了额外的训练开销,并将滑块行为与训练分布耦合,这在编辑或领域分布变化时可能会降低可靠性。我们提出了一种无需训练的连续编辑方法FlowSlider,该方法在修正流中不需要任何后训练。FlowSlider 将 FlowEdit 的更新分解为 (i) 保真度项,作为基于源条件的稳定器,保持身份和结构;(ii) 导航项,驱动语义过渡以达到目标编辑。几何分析和实证测量表明,这些项几乎正交,使得通过仅缩放导航项而不改变保真度项即可实现稳定的强度控制。因此,FlowSlider 在无需后训练的情况下提供了平滑且可靠的控制,从而提高了各种任务的连续编辑质量。
Summary / 总结
FlowSlider is a training-free method for continuous image editing that decomposes the editing process into a fidelity term and a steering term. The fidelity term stabilizes the source image, while the steering term drives the semantic transition. This orthogonal decomposition allows for smooth and reliable strength control without additional training, improving the quality of continuous editing across various tasks.
FlowSlider 是一种无需训练的方法,用于连续图像编辑,将编辑过程分解为保真度项和导向项。保真度项稳定源图像,而导向项驱动语义过渡。这种正交分解允许在无需后训练的情况下平滑可靠地控制编辑强度,从而提高各种任务中连续编辑的质量。
Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection
Authors: Soo Won Seo, KyungChae Lee, Hyungchan Cho, Taein Son, Nam Ik Cho, Jun Won Choi
Venue: CVPR 2026
First: 2026-04-02T14:01:58+00:00 · Latest: 2026-04-02T14:01:58+00:00
Comments: Accepted to CVPR 2026. Code: https://github.com/nowuss/InCoM-Net
Abstract
Human-Object Interaction (HOI) detection aims to localize human-object pairs and classify their interactions from a single image, a task that demands strong visual understanding and nuanced contextual reasoning. Recent approaches have leveraged Vision-Language Models (VLMs) to introduce semantic priors, significantly improving HOI detection performance. However, existing methods often fail to fully capitalize on the diverse contextual cues distributed across the entire scene. To overcome these limitations, we propose the Instance-centric Context Mining Network (InCoM-Net)-a novel framework that effectively integrates rich semantic knowledge extracted from VLMs with instance-specific features produced by an object detector. This design enables deeper interaction reasoning by modeling relationships not only within each detected instance but also across instances and their surrounding scene context. InCoM-Net comprises two core components: Instancecentric Context Refinement (ICR), which separately extracts intra-instance, inter-instance, and global contextual cues from VLM-derived features, and Progressive Context Aggregation (ProCA), which iteratively fuses these multicontext features with instance-level detector features to support high-level HOI reasoning. Extensive experiments on the HICO-DET and V-COCO benchmarks show that InCoM-Net achieves state-of-the-art performance, surpassing previous HOI detection methods. Code is available at https://github.com/nowuss/InCoM-Net.
中文标题/摘要
标题:基于实例中心视觉-语言语境的人机物交互检测
人机物交互(HOI)检测旨在从单张图像中定位人与物体的配对并分类它们的交互,这需要强大的视觉理解能力和细腻的语境推理。最近的方法利用视觉-语言模型(VLMs)引入语义先验,显著提高了HOI检测性能。然而,现有方法往往未能充分利用场景中分散的多样化语境线索。为克服这些限制,我们提出了一种实例中心的语境挖掘网络(InCoM-Net)——一种新颖的框架,该框架有效地将从VLM提取的丰富语义知识与对象检测器生成的实例特定特征相结合。此设计通过建模不仅在每个检测实例内部的关系,还在实例之间及其周围场景语境中的关系,来实现更深入的交互推理。InCoM-Net 包含两个核心组件:实例中心的语境精炼(ICR),它分别从VLM特征中提取实例内、实例间和全局语境线索,以及渐进式语境聚合(ProCA),它迭代地将这些多语境特征与实例级检测器特征融合,以支持高级HOI推理。在HICO-DET和V-COCO基准上的广泛实验表明,InCoM-Net 达到了最先进的性能,超越了之前的HOI检测方法。代码可在 https://github.com/nowuss/InCoM-Net 获取。
Summary / 总结
The research aims to enhance Human-Object Interaction (HOI) detection by integrating rich semantic knowledge from Vision-Language Models (VLMs) with instance-specific features. The proposed Instance-centric Context Mining Network (InCoM-Net) extracts and refines intra-instance, inter-instance, and global contextual cues, then progressively aggregates them to support high-level HOI reasoning. Experiments on HICO-DET and V-COCO benchmarks demonstrate that InCoM-Net outperforms existing HOI detection methods, achieving state-of-the-art performance.
研究旨在通过将视觉语言模型(VLM)提取的丰富语义知识与实例特定特征相结合,来提升人类物体交互(HOI)检测。提出的实例中心上下文挖掘网络(InCoM-Net)提取并精炼了内部实例、跨实例和全局上下文线索,然后逐步融合这些多上下文特征以支持高级HOI推理。在HICO-DET和V-COCO基准上的实验表明,InCoM-Net在HOI检测方面超越了现有方法,达到了最先进的性能。
Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models
Authors: Issa Sugiura, Keito Sasagawa, Keisuke Nakao, Koki Maeda, Ziqi Yin, Zhishen Yang, Shuhei Kurita, Yusuke Oda, Ryoko Tokuhisa, Daisuke Kawahara, Naoaki Okazaki
First: 2026-04-02T13:48:43+00:00 · Latest: 2026-04-02T13:48:43+00:00
Comments: 18 pages, 7 figures
Abstract
Developing vision-language models (VLMs) that generalize across diverse tasks requires large-scale training datasets with diverse content. In English, such datasets are typically constructed by aggregating and curating numerous existing visual question answering (VQA) resources. However, this strategy does not readily extend to other languages, where VQA datasets remain limited in both scale and domain coverage, posing a major obstacle to building high-quality multilingual and non-English VLMs. In this work, we introduce Jagle, the largest Japanese multimodal post-training dataset to date, comprising approximately 9.2 million instances across diverse tasks. Rather than relying on existing VQA datasets, we collect heterogeneous source data, including images, image-text pairs, and PDF documents, and generate VQA pairs through multiple strategies such as VLM-based QA generation, translation, and text rendering. Experiments demonstrate that a 2.2B model trained with Jagle achieves strong performance on Japanese tasks, surpassing InternVL3.5-2B in average score across ten Japanese evaluation tasks and approaching within five points of Qwen3-VL-2B-Instruct. Furthermore, combining Jagle with FineVision does not degrade English performance; instead, it improves English performance compared to training with FineVision alone. To facilitate reproducibility and future research, we release the dataset, trained models, and code.
中文标题/摘要
标题:Jagle:构建大规模日语多模态后训练数据集以支持视觉-语言模型
开发能够在多种任务中泛化的视觉-语言模型(VLMs)需要大规模的训练数据集,这些数据集包含多样化的内容。通常,这样的数据集通过聚合和整理大量的现有视觉问答(VQA)资源来构建。然而,这种方法并不容易扩展到其他语言,因为其他语言的VQA数据集在规模和领域覆盖方面都有限,这成为构建高质量的多语言和非英语VLMs的主要障碍。在本文中,我们介绍了迄今为止最大的日语多模态后训练数据集Jagle,包含约920万实例,涵盖了多种任务。我们没有依赖现有的VQA数据集,而是收集了异构源数据,包括图像、图像-文本对和PDF文档,并通过多种策略生成VQA对,如基于VLM的问答生成、翻译和文本渲染。实验表明,使用Jagle训练的22亿参数模型在日语任务上表现出色,平均得分超过InternVL3.5-2B,在十个日语评估任务上的平均得分高出五分,接近Qwen3-VL-2B-Instruct。此外,将Jagle与FineVision结合使用不会降低英语性能,反而在单独使用FineVision训练时提高了英语性能。为了促进可重复性和未来研究,我们发布了数据集、训练模型和代码。
Summary / 总结
Jagle is a large-scale Japanese multimodal post-training dataset comprising about 9.2 million instances across various tasks. Unlike existing VQA datasets, Jagle collects diverse sources such as images, image-text pairs, and PDF documents, and generates VQA pairs using multiple strategies. A 2.2B model trained with Jagle outperforms InternVL3.5-2B on ten Japanese evaluation tasks and nearly matches Qwen3-VL-2B-Instruct. Additionally, combining Jagle with FineVision improves English performance compared to FineVision alone.
Jagle 是一个包含约 920 万实例的大型日语多模态后训练数据集,涵盖了各种任务。不同于现有的 VQA 数据集,Jagle 收集了多种来源的数据,如图像、图像-文本对和 PDF 文档,并通过多种策略生成 VQA 对。使用 Jagle 训练的 2.2B 模型在十个日语评估任务中表现优于 InternVL3.5-2B,并接近 Qwen3-VL-2B-Instruct。此外,将 Jagle 与 FineVision 结合使用会提高英语性能,优于仅使用 FineVision 的情况。
Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding
Authors: Tao Jin, Phuong Minh Nguyen, Naoya Inoue
First: 2026-04-02T13:48:42+00:00 · Latest: 2026-04-02T13:48:42+00:00
Abstract
Speculative decoding accelerates large language model inference by drafting multiple candidate tokens and verifying them in a single forward pass. Candidates are organized as a tree: deeper trees accept more tokens per step, but adding depth requires sacrificing breadth (fallback options) under a fixed verification budget. Existing training-free methods draft from a single token source and shape their trees without distinguishing candidate quality across origins. We observe that two common training-free token sources - n-gram matches copied from the input context, and statistical predictions from prior forward passes - differ dramatically in acceptance rate (~6x median gap, range 2-18x across five models and five benchmarks). We prove that when such a quality gap exists, the optimal tree is anisotropic (asymmetric): reliable tokens should form a deep chain while unreliable tokens spread as wide branches, breaking through the depth limit of balanced trees. We realize this structure in GOOSE, a training-free framework that builds an adaptive spine tree - a deep chain of high-acceptance context-matched tokens with wide branches of low-acceptance alternatives at each node. We prove that the number of tokens accepted per step is at least as large as that of either source used alone. On five LLMs (7B-33B) and five benchmarks, GOOSE achieves 1.9-4.3x lossless speedup, outperforming balanced-tree baselines by 12-33% under the same budget.
中文标题/摘要
标题:鹅:异质推测树用于无训练推测解码
推测解码通过在单次前向传递中起草多个候选令牌并验证它们来加速大型语言模型的推理。候选令牌组织成一棵树:更深的树每步接受更多令牌,但在固定验证预算下增加深度需要牺牲宽度(备用选项)。现有无训练方法从单一令牌源起草并塑造其树,而不区分候选令牌的质量来源。我们观察到,两种常见的无训练令牌源——从输入上下文中复制的n-克隆匹配和来自先前前向传递的统计预测——在接受率上存在巨大差异(中位数差距约为6倍,范围从2到18倍,跨越五个模型和五个基准)。我们证明,当存在这种质量差距时,最优树是异质的(不对称):可靠的令牌应形成一条深链,而不可靠的令牌则扩展为宽分支,突破平衡树的深度限制。我们通过GOOSE实现这一结构,这是一种无训练框架,构建自适应脊柱树——一条由高接受率上下文匹配令牌组成的深链,每个节点都有宽分支的低接受率替代选项。我们证明,每步接受的令牌数量至少与单独使用任一来源一样多。在五个LLM(7B-33B)和五个基准上,GOOSE实现了1.9-4.3倍无损加速,即使在相同的预算下,也比平衡树基线高出12-33%。
Summary / 总结
Goose is a training-free speculative decoding framework that addresses the quality gap between two common token sources by constructing anisotropic speculation trees. These trees consist of a deep chain of high-acceptance context-matched tokens with wide branches of low-acceptance alternatives, allowing for more efficient inference. On five large language models ranging from 7B to 33B parameters, Goose achieves a 1.9-4.3x lossless speedup compared to balanced-tree baselines, outperforming them by 12-33% under the same verification budget.
Goose 是一种无训练的推测性解码框架,使用各向异性推测树来提高推理速度。它利用两种具有不同接受率的令牌源,并构建一个深链的高接受率上下文匹配令牌,以及每个节点上宽分支的低接受率替代选项。这种方法每步至少能接受与单一最佳来源一样多的令牌。在五个大型语言模型上的实验显示,与平衡树基线相比,Goose 实现了1.9-4.3倍的无损加速,且在相同预算下性能提高了12-33%。
Are VLMs Lost Between Sky and Space? LinkS$^2$Bench for UAV-Satellite Dynamic Cross-View Spatial Intelligence
Authors: Dian Liu, Jie Feng, Di Li, Yuhui Zheng, Guanbin Li, Weisheng Dong, Guangming Shi
First: 2026-04-02T13:22:57+00:00 · Latest: 2026-04-02T13:22:57+00:00
Abstract
Synergistic spatial intelligence between UAVs and satellites is indispensable for emergency response and security operations, as it uniquely integrates macro-scale global coverage with dynamic, real-time local perception. However, the capacity of Vision-Language Models (VLMs) to master this complex interplay remains largely unexplored. This gap persists primarily because existing benchmarks are confined to isolated Unmanned Aerial Vehicle (UAV) videos or static satellite imagery, failing to evaluate the dynamic local-to-global spatial mapping essential for comprehensive cross-view reasoning. To bridge this gap, we introduce LinkS$^2$Bench, the first comprehensive benchmark designed to evaluate VLMs' wide-area, dynamic cross-view spatial intelligence. LinkS$^2$Bench links 1,022 minutes of dynamic UAV footage with high-resolution satellite imagery covering over 200 km$^2$. Through an LMM-assisted pipeline and rigorous human annotation, we constructed 17.9k high-quality question-answer pairs comprising 12 fine-grained tasks across four dimensions: perception, localization, relation, and reasoning. Evaluations of 18 representative VLMs reveal a substantial gap compared to human baselines, identifying accurate cross-view dynamic alignment as the critical bottleneck. To alleviate this, we design a Cross-View Alignment Adapter, demonstrating that explicit alignment significantly improves model performance. Furthermore, fine-tuning experiments underscore the potential of LinkS$^2$Bench in advancing VLM adaptation for complex spatial reasoning.
中文标题/摘要
标题:VLMs在天地之间迷失了吗?LinkS$^2$Bench无人机-卫星动态跨视角空间智能评估
无人机与卫星之间的协同空间智能对于应急响应和安全操作至关重要,因为它能够独特地结合宏观规模的全球覆盖与动态的实时局部感知。然而,视觉-语言模型(VLMs)掌握这种复杂互动的能力仍然鲜有探索。这一差距主要因为现有基准仅限于孤立的无人机视频或静态卫星图像,未能评估全面跨视角推理所必需的动态局部到全局的空间映射。为弥补这一差距,我们引入了LinkS$^2$Bench,这是首个旨在评估VLMs广泛区域、动态跨视角空间智能的综合基准。LinkS$^2$Bench将1,022分钟的动态无人机视频与覆盖超过200平方公里的高分辨率卫星图像相连。通过LMM辅助管道和严格的真人注释,我们构建了17,900个高质量的问题-答案对,涵盖四个维度的12个细粒度任务:感知、定位、关系和推理。对18个代表性VLMs的评估显示,与人类基准相比存在显著差距,准确的跨视角动态对齐是关键瓶颈。为缓解这一问题,我们设计了跨视角对齐适配器,表明显式对齐显著提高了模型性能。此外,微调实验强调了LinkS$^2$Bench在推进VLM适应复杂空间推理方面的潜力。
Summary / 总结
The paper addresses the gap in evaluating Vision-Language Models (VLMs) for the dynamic cross-view spatial intelligence between UAVs and satellites, which is crucial for emergency response and security operations. To bridge this gap, the authors introduce LinkS$^2$Bench, a comprehensive benchmark that links 1,022 minutes of dynamic UAV footage with high-resolution satellite imagery. The benchmark includes 17.9k high-quality question-answer pairs and reveals a significant gap between VLMs and human performance, with accurate cross-view dynamic alignment identified as the critical bottleneck. The authors also propose a Cross-View Alignment Adapter to improve model performance in this domain.
研究旨在评估视觉语言模型(VLMs)在应急响应和安全操作中掌握无人机和卫星之间复杂互动的能力。为解决现有基准的不足,作者引入了LinkS$^2$Bench,该基准将动态无人机视频与高分辨率卫星图像链接起来。对18个VLMs的评估显示,与人类表现相比存在显著差距,特别是在准确的跨视图动态对齐方面。研究还提出了一种跨视图对齐适配器,以提高模型在这一领域的性能。
Decouple and Rectify: Semantics-Preserving Structural Enhancement for Open-Vocabulary Remote Sensing Segmentation
Authors: Jie Feng, Fengze Li, Junpeng Zhang, Siyu Chen, Yuping Liang, Junying Chen, Ronghua Shang
First: 2026-04-02T13:15:05+00:00 · Latest: 2026-04-02T13:15:05+00:00
Abstract
Open-vocabulary semantic segmentation in the remote sensing (RS) field requires both language-aligned recognition and fine-grained spatial delineation. Although CLIP offers robust semantic generalization, its global-aligned visual representations inherently struggle to capture structural details. Recent methods attempt to compensate for this by introducing RS-pretrained DINO features. However, these methods treat CLIP representations as a monolithic semantic space and cannot localize where structural enhancement is required, failing to effectively delineate boundaries while risking the disruption of CLIP's semantic integrity. To address this limitation, we propose DR-Seg, a novel decouple-and-rectify framework in this paper. Our method is motivated by the key observation that CLIP feature channels exhibit distinct functional heterogeneity rather than forming a uniform semantic space. Building on this insight, DR-Seg decouples CLIP features into semantics-dominated and structure-dominated subspaces, enabling targeted structural enhancement by DINO without distorting language-aligned semantics. Subsequently, a prior-driven graph rectification module injects high-fidelity structural priors under DINO guidance to form a refined branch, while an uncertainty-guided adaptive fusion module dynamically integrates this refined branch with the original CLIP branch for final prediction. Comprehensive experiments across eight benchmarks demonstrate that DR-Seg establishes a new state-of-the-art.
中文标题/摘要
标题:解耦与校正:面向开放词汇遥感分割的语义保留结构增强
遥感(RS)领域的开放词汇语义分割需要语言对齐的识别和精细的空间界定。尽管CLIP提供了强大的语义泛化能力,但其全局对齐的视觉表示在捕捉结构细节方面存在固有困难。最近的方法通过引入RS预训练的DINO特征来弥补这一不足。然而,这些方法将CLIP表示视为一个统一的语义空间,无法定位需要结构增强的地方,无法有效界定边界,同时可能破坏CLIP的语义完整性。为解决这一局限,本文提出了一种新颖的解耦与校正框架DR-Seg。我们的方法受到一个关键观察的启发,即CLIP特征通道表现出功能异质性,而不是形成一个统一的语义空间。基于这一洞察,DR-Seg将CLIP特征分解为以语义为主导和以结构为主导的子空间,通过DINO实现有针对性的结构增强,而不破坏语言对齐的语义。随后,一个先验驱动的图校正模块在DINO的引导下注入高保真的结构先验,形成一个精炼分支,而一个基于不确定性自适应融合模块动态将该精炼分支与原始CLIP分支融合,以进行最终预测。在八个基准上的全面实验表明,DR-Seg建立了新的性能最佳水平。
Summary / 总结
The research aims to improve open-vocabulary semantic segmentation in remote sensing by addressing the limitations of CLIP's global-aligned visual representations in capturing structural details. DR-Seg proposes a decouple-and-rectify framework that separates CLIP features into semantics-dominated and structure-dominated subspaces, allowing for targeted structural enhancement by DINO without disrupting semantic integrity. The method includes a graph rectification module to inject structural priors and an adaptive fusion module to integrate the refined branch with the original CLIP branch. Experiments show that DR-Seg outperforms existing methods on eight benchmarks.
研究提出了一种去耦合和校正框架DR-Seg,以解决遥感领域的开放词汇语义分割问题。DR-Seg将CLIP特征分解为语义主导和结构主导子空间,通过DINO进行针对性的结构增强,同时保持语义一致性。此外,该方法还包含一个图校正模块和自适应融合模块,以细化和结合增强和原始特征。在八个基准测试中的实验表明,DR-Seg在性能上超过了现有方法,达到了新的最佳水平。
Test-Time Adaptation for Height Completion via Self-Supervised ViT Features and Monocular Foundation Models
Authors: Osher Rafaeli, Tal Svoray, Ariel Nahlieli
First: 2026-04-02T13:13:17+00:00 · Latest: 2026-04-02T13:13:17+00:00
Abstract
Accurate digital surface models (DSMs) are essential for many geospatial applications, including urban monitoring, environmental analyses, infrastructure management, and change detection. However, large-scale DSMs frequently contain incomplete or outdated regions due to acquisition limitations, reconstruction artifacts, or changes in the built environment. Traditional height completion approaches primarily rely on spatial interpolation or which assume spatial continuity and therefore fail when objects are missing. Recent learning-based approaches improve reconstruction quality but typically require supervised training on sensor-specific datasets, limiting their generalization across domains and sensing conditions. We propose Prior2DSM, a training-free framework for metric DSM completion that operates entirely at test time by leveraging foundation models. Unlike previous height completion approaches that require task-specific training, the proposed method combines self-supervised Vision Transformer (ViT) features from DINOv3 with monocular depth foundation models to propagate metric information from incomplete height priors through semantic feature-space correspondence. Test-time adaptation (TTA) is performed using parameter-efficient low-rank adaptation (LoRA) together with a lightweight multilayer perceptron (MLP), which predicts spatially varying scale and shift parameters to convert relative depth estimates into metric heights. Experiments demonstrate consistent improvements over interpolation based methods, prior-based rescaling height approaches, and state-of-the-art monocular depth estimation models. Prior2DSM reduces reconstruction error while preserving structural fidelity, achieving up to a 46% reduction in RMSE compared to linear fitting of MDE, and further enables DSM updating and coupled RGB-DSM generation.
中文标题/摘要
标题:测试时自适应的高程完成方法通过自我监督的ViT特征和单目基础模型
准确的数字表面模型(DSMs)对于许多地理空间应用至关重要,包括城市监测、环境分析、基础设施管理和变化检测。然而,由于获取限制、重建伪影或建成环境的变化,大规模DSMs经常包含不完整或过时的区域。传统的高程完成方法主要依赖于空间插值或假设空间连续性,因此在物体缺失时会失效。最近的基于学习的方法可以提高重建质量,但通常需要在特定传感器数据集上进行监督训练,限制了它们在不同领域和传感条件下的泛化能力。我们提出了一种名为Prior2DSM的无需训练框架,该框架完全在测试时运行,通过利用基础模型来完成米制DSM。与之前需要特定任务训练的高程完成方法不同,所提出的方法结合了来自DINOv3的自我监督Vision Transformer(ViT)特征和单目深度基础模型,通过语义特征空间对应传播米制信息。测试时自适应(TTA)使用参数高效的低秩适应(LoRA)与轻量级多层感知机(MLP)一起进行,预测空间变化的尺度和偏移参数,将相对深度估计转换为米制高度。实验表明,Prior2DSM在减少重建误差的同时保持结构保真度,与线性拟合MDE相比,RMSE最多可减少46%,并进一步实现了DSM更新和耦合RGB-DSM生成。
Summary / 总结
The paper proposes Prior2DSM, a training-free framework for metric DSM completion that leverages self-supervised ViT features and monocular depth foundation models at test time. It uses parameter-efficient low-rank adaptation (LoRA) and a lightweight MLP to adapt relative depth estimates into metric heights, improving reconstruction accuracy and structural fidelity. Experiments show consistent improvements over interpolation-based methods and state-of-the-art monocular depth estimation models, with up to a 46% reduction in RMSE compared to linear fitting of MDE.
研究旨在通过提出Prior2DSM,一种无需训练的框架来解决DSM中不完整或过时区域的问题。该方法利用自监督的Vision Transformer (ViT) 特征从DINOv3和单目深度基础模型来传播度量信息并通过语义特征空间对应。该方法使用参数高效的低秩适应(LoRA)和轻量级多层感知机(MLP)进行测试时适应,将相对深度估计转换为度量高度。实验表明,Prior2DSM在减少重建误差方面优于基于插值的方法和基于先验的重新缩放高度方法,与单目深度估计模型的线性拟合相比,误差降低高达46%。
Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation
Authors: Boyang Gong, Yu Zheng, Fanye Kong, Jie Zhou, Jiwen Lu
First: 2026-04-02T12:51:07+00:00 · Latest: 2026-04-02T12:51:07+00:00
Abstract
Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require inter-object relational deduction. Through token-wise attention analysis, we identify this visual inertia as a key factor: attention to semantically critical regions remains persistently focused and fails to dynamically support relational inference. We thereby propose a training-free Inertia-aware Visual Excitation (IVE) method that breaks this inertial pattern by modeling cognitive inference as the dynamic responsiveness of visual attention. Specifically, IVE selects visual tokens that are dynamically emerging relative to historical attention trends while distinguishing tokens exhibiting inertial behavior. To further facilitate compositional inference, IVE introduces an inertia-aware penalty that discourages over-concentration and limits the persistence of attention within localized regions. Extensive experiments show that IVE is effective across various base MLLMs and multiple hallucination benchmarks, particularly for cognitive hallucinations.
中文标题/摘要
标题:注意力静止则保持静止:打破视觉惯性以减轻认知幻觉
如同静止的物体保持静止,我们发现多模态大型语言模型(MLLMs)中的视觉注意力表现出明显的惯性,在早期解码步骤中一旦稳定下来就保持相对静止,无法支持认知推理所需的组合理解。现有的幻觉缓解方法主要针对与物体存在或属性相关的感知幻觉,但对于需要物体间关系推理的认知幻觉却无能为力。通过词元级别的注意力分析,我们发现这种视觉惯性是关键因素:对语义关键区域的注意力保持持续聚焦,无法动态支持关系推理。因此,我们提出了一种无需训练的感知意识视觉激发(IVE)方法,通过将认知推理建模为视觉注意力的动态响应来打破这种惯性模式。具体而言,IVE 选择相对于历史注意力趋势动态出现的视觉词元,同时区分表现出惯性行为的词元。为了进一步促进组合推理,IVE 引入了一种感知意识惩罚,以防止过度集中并限制注意力在局部区域的持久性。广泛的实验表明,IVE 在各种基础 MLLMs 和多个幻觉基准测试中都有效,特别是在处理认知幻觉方面。
Summary / 总结
The study addresses the issue of visual inertia in multimodal large language models (MLLMs), where attention remains static and fails to support compositional understanding needed for cognitive inference. It introduces Inertia-aware Visual Excitation (IVE), which models cognitive inference as dynamic responsiveness of visual attention, selecting tokens that are dynamically emerging and discouraging over-concentration. Experiments demonstrate IVE's effectiveness in mitigating cognitive hallucinations across different MLLMs and benchmarks.
研究旨在通过缓解视觉惯性来解决多模态大型语言模型(MLLMs)中的认知幻觉问题,视觉惯性导致注意力保持静态,无法支持关系推理。方法Inertia-aware Visual Excitation (IVE) 将认知推理建模为视觉注意力的动态响应,选择动态出现的令牌并避免过度集中。实验表明IVE在不同MLLM和幻觉基准测试中有效,特别是在处理认知幻觉方面。
Curia-2: Scaling Self-Supervised Learning for Radiology Foundation Models
Authors: Antoine Saporta, Baptiste Callard, Corentin Dancette, Julien Khlaut, Charles Corbière, Leo Butsanets, Amaury Prat, Pierre Manceron
First: 2026-04-02T12:49:38+00:00 · Latest: 2026-04-02T12:49:38+00:00
Abstract
The rapid growth of medical imaging has fueled the development of Foundation Models (FMs) to reduce the growing, unsustainable workload on radiologists. While recent FMs have shown the power of large-scale pre-training to CT and MRI analysis, there remains significant room to optimize how these models learn from complex radiological volumes. Building upon the Curia framework, this work introduces Curia-2, which significantly improves the original pre-training strategy and representation quality to better capture the specificities of radiological data. The proposed methodology enables scaling the architecture up to billion-parameter Vision Transformers, marking a first for multi-modal CT and MRI FMs. Furthermore, we formalize the evaluation of these models by extending and restructuring CuriaBench into two distinct tracks: a 2D track tailored for slice-based vision models and a 3D track for volumetric benchmarking. Our results demonstrate that Curia-2 outperforms all FMs on vision-focused tasks and fairs competitively to vision-language models on clinically complex tasks such as finding detection. Weights will be made publicly available to foster further research.
Summary / 总结
This study aims to enhance the efficiency and accuracy of radiology Foundation Models (FMs) by improving pre-training strategies and representation quality. Curia-2, an advanced version of the original Curia framework, scales up to billion-parameter Vision Transformers, enabling better handling of complex radiological data. The research introduces two new tracks in CuriaBench for evaluating these models: a 2D track for slice-based models and a 3D track for volumetric benchmarking. Experimental results show that Curia-2 outperforms existing FMs on vision-focused tasks and performs competitively on clinically complex tasks such as detection. The weights of the models will be made publicly available to promote further research.
该研究旨在通过改进预训练策略和表示质量来提高放射学基础模型(FMs)的效率和准确性。Curia-2 是 Curia 框架的改进版本,能够将架构扩展到十亿参数的 Vision Transformers,更好地处理复杂的放射学数据。研究引入了 CuriaBench 的两个新赛道:一个针对切片模型的 2D 轨道和一个针对体素基准测试的 3D 轨道。实验结果表明,Curia-2 在视觉任务中优于现有 FMs,并在检测等临床复杂任务中表现出色。模型的权重将公开发布,以促进进一步研究。
SPAGS: Sparse-View Articulated Object Reconstruction from Single State via Planar Gaussian Splatting
Authors: Di Wu, Liu Liu, Xueyu Yuan, Wenxiao Chen, Lijun Yue, Liuzhu Chen, Yiming Tang, Meng Wang
First: 2025-11-21T09:49:53+00:00 · Latest: 2026-04-02T12:37:26+00:00
Comments: 10 pages, 7 figures
Abstract
Articulated objects are ubiquitous in daily environments, and their 3D reconstruction holds great significance across various fields. However, existing articulated object reconstruction methods typically require costly inputs such as multi-stage and multi-view observations. To address the limitations, we propose a category-agnostic articulated object reconstruction framework via planar Gaussian Splatting, which only uses sparse-view RGB images from a single state. Specifically, we first introduce a Gaussian information field to perceive the optimal sparse viewpoints from candidate camera poses. To ensure precise geometric fidelity, we constrain traditional 3D Gaussians into planar primitives, facilitating accurate normal and depth estimation. The planar Gaussians are then optimized in a coarse-to-fine manner, regularized by depth smoothness and few-shot diffusion priors. Furthermore, we leverage a Vision-Language Model (VLM) via visual prompting to achieve open-vocabulary part segmentation and joint parameter estimation. Extensive experiments on both synthetic and real-world datasets demonstrate that our approach significantly outperforms existing baselines, achieving superior part-level surface reconstruction fidelity. Code and data are provided in the supplementary material.
中文标题/摘要
标题:SPAGS: 基于平面高斯点云的单状态稀疏视图 articulated对象重建
articulated对象在日常环境中无处不在,它们的3D重建在多个领域具有重要意义。然而,现有的articulated对象重建方法通常需要多阶段和多视角的观测,成本较高。为了解决这一限制,我们提出了一种基于平面高斯点云的articulated对象重建框架,仅使用单状态下的稀疏视RGB图像。具体来说,我们首先引入高斯信息场来感知候选相机姿态中的最优稀疏视角。为了确保精确的几何保真度,我们将传统的3D高斯约束为平面原语,便于准确的法线和深度估计。然后,平面高斯在粗到细的方式下进行优化,通过深度平滑和少量样本扩散先验进行正则化。此外,我们利用视觉提示的视觉语言模型(VLM)实现开放词汇的部件分割和关节参数估计。在合成和真实世界数据集上的广泛实验表明,我们的方法显著优于现有基线,实现了更优的部件级表面重建保真度。代码和数据在附录中提供。
Summary / 总结
The research aims to address the high cost and complexity of existing articulated object reconstruction methods that require multi-stage and multi-view observations. The proposed SPAGS framework uses a single sparse-view RGB image to reconstruct articulated objects with high precision. It employs a Gaussian information field to select optimal viewpoints and planar Gaussian splatting to estimate normals and depths accurately. The planar Gaussians are optimized in a coarse-to-fine manner, and a Vision-Language Model is used for part segmentation and joint parameter estimation. Experiments show that SPAGS outperforms existing methods in part-level surface reconstruction fidelity on both synthetic and real-world datasets.
研究旨在解决现有需要多视角观察的 articulated 物体重建方法的局限性。提出的 SPAGS 框架利用单张稀疏视角 RGB 图像以高几何精度重建 articulated 物体。该方法通过高斯信息场选择最优视角,并使用平面高斯散点图估计法线和深度。该方法以粗到细的方式进行优化,并通过深度平滑和少量样本扩散先验进行正则化。此外,还利用视觉提示的视觉语言模型进行部分分割和关节参数估计。实验表明,SPAGS 在合成和真实世界数据集上的部分级表面重建精度优于现有方法。
Enhancing Medical Visual Grounding via Knowledge-guided Spatial Prompts
Authors: Yifan Gao, Tao Zhou, Yi Zhou, Ke Zou, Yizhe Zhang, Huazhu Fu
First: 2026-04-02T11:31:30+00:00 · Latest: 2026-04-02T11:31:30+00:00
Comments: 10 pages, 6 figures
Abstract
Medical Visual Grounding (MVG) aims to identify diagnostically relevant phrases from free-text radiology reports and localize their corresponding regions in medical images, providing interpretable visual evidence to support clinical decision-making. Although recent Vision-Language Models (VLMs) exhibit promising multimodal reasoning ability, their grounding remains insufficient spatial precision, largely due to a lack of explicit localization priors when relying solely on latent embeddings. In this work, we analyze this limitation from an attention perspective and propose KnowMVG, a Knowledge-prior and global-local attention enhancement framework for MVG in VLMs that explicitly strengthens spatial awareness during decoding. Specifically, we present a knowledge-enhanced prompting strategy that encodes phrase related medical knowledge into compact embeddings, together with a global-local attention that jointly leverages coarse global information and refined local cues to guide precise region localization. localization. This design bridges high-level semantic understanding and fine-grained visual perception without introducing extra textual reasoning overhead. Extensive experiments on four MVG benchmarks demonstrate that our KnowMVG consistently outperforms existing approaches, achieving gains of 3.0% in AP50 and 2.6% in mIoU over prior state-of-the-art methods. Qualitative and ablation studies further validate the effectiveness of each component.
中文标题/摘要
标题:通过知识引导的空间提示增强医学视觉定位
医学视觉定位(MVG)旨在从自由文本放射学报告中识别出诊断相关的短语,并在医学图像中定位其对应的区域,提供可解释的视觉证据以支持临床决策。尽管最近的视觉-语言模型(VLMs)展示了有希望的多模态推理能力,但它们的空间定位精度仍然不足,主要是因为在仅依赖潜在嵌入时缺乏明确的定位先验。在本文中,我们从注意力机制的角度分析了这一局限性,并提出了一种名为KnowMVG的知识先验和全局-局部注意力增强框架,以在VLMs中增强MVG的空间意识。具体而言,我们提出了一种知识增强的提示策略,将与短语相关的医学知识编码为紧凑的嵌入,同时结合全局-局部注意力机制,共同利用粗略的全局信息和精细的局部线索来引导精确的区域定位。此设计在不引入额外文本推理开销的情况下,将高层次的语义理解和精细的视觉感知相结合。在四个MVG基准上的广泛实验表明,我们的KnowMVG在AP50和mIoU方面均优于现有方法,分别提高了3.0%和2.6%。进一步的定性和消融研究还验证了每个组件的有效性。
Summary / 总结
This work addresses the limitation of insufficient spatial precision in Medical Visual Grounding (MVG) by proposing KnowMVG, a framework that enhances spatial awareness in Vision-Language Models (VLMs). KnowMVG incorporates a knowledge-enhanced prompting strategy and global-local attention to guide precise region localization. Experiments on four MVG benchmarks show that KnowMVG outperforms existing methods, achieving 3.0% gain in AP50 and 2.6% in mIoU.
该研究通过提出KnowMVG框架来解决医学视觉定位(MVG)中空间精度不足的问题,该框架增强了视觉语言模型(VLMs)的空间意识。KnowMVG使用知识增强的提示策略将医学知识编码到嵌入中,并使用全局-局部注意力机制来引导精确的区域定位。实验表明,KnowMVG在四个MVG基准上的表现优于现有方法,分别在AP50和mIoU上提高了3.0%和2.6%。
Robot Collapse: Supply Chain Backdoor Attacks Against VLM-based Robotic Manipulation
Authors: Xianlong Wang, Hewen Pan, Hangtao Zhang, Minghui Li, Shengshan Hu, Ziqi Zhou, Lulu Xue, Peijin Guo, Aishan Liu, Leo Yu Zhang, Xiaohua Jia
First: 2024-11-18T16:09:26+00:00 · Latest: 2026-04-02T10:50:53+00:00
Abstract
Robotic manipulation policies are increasingly empowered by \textit{large language models} (LLMs) and \textit{vision-language models} (VLMs), leveraging their understanding and perception capabilities. Recently, inference-time attacks against robotic manipulation have been extensively studied, yet backdoor attacks targeting model supply chain security in robotic policies remain largely unexplored. To fill this gap, we propose \texttt{TrojanRobot}, a backdoor injection framework for model supply chain attack scenarios, which embeds a malicious module into modular robotic policies via backdoor relationships to manipulate the LLM-to-VLM pathway and compromise the system. Our vanilla design instantiates this module as a backdoor-finetuned VLM. To further enhance attack performance, we propose a prime scheme by introducing the concept of \textit{LVLM-as-a-backdoor}, which leverages \textit{in-context instruction learning} (ICIL) to steer \textit{large vision-language model} (LVLM) behavior through backdoored system prompts. Moreover, we develop three types of prime attacks, \textit{permutation}, \textit{stagnation}, and \textit{intentional}, achieving flexible backdoor attack effects. Extensive physical-world and simulator experiments on 18 real-world manipulation tasks and 4 VLMs verify the superiority of proposed \texttt{TrojanRobot}
中文标题/摘要
标题:机器人崩溃:针对基于VLM的机器人操作的供应链后门攻击
机器人的操作策略越来越多地借助于\textit{大型语言模型}(LLMs)和\textit{视觉语言模型}(VLMs),利用它们的理解和感知能力。最近,针对机器人操作的推理时攻击得到了广泛研究,但针对机器人策略模型供应链安全的后门攻击却鲜有探索。为填补这一空白,我们提出了\texttt{TrojanRobot},一种针对模型供应链攻击场景的后门注入框架,通过后门关系将恶意模块嵌入模块化机器人策略中,操控LLM到VLM的路径并破坏系统。我们的基础设计将此模块实例化为后门微调的VLM。为进一步增强攻击性能,我们提出了一个质数方案,引入了\textit{在上下文指令学习}(ICIL)的概念,通过后门系统提示引导\textit{大型视觉语言模型}(LVLM)的行为。此外,我们开发了三种类型的质数攻击,\textit{排列}、\textit{停滞}和\textit{故意},实现了灵活的后门攻击效果。在18个真实世界的操作任务和4个VLM上的物理世界和模拟器实验验证了所提\texttt{TrojanRobot}的优越性
Summary / 总结
This paper addresses the security vulnerability in robotic manipulation policies that rely on large language models (LLMs) and vision-language models (VLMs). It introduces TrojanRobot, a framework for backdoor attacks on the model supply chain, embedding a malicious module into robotic policies to manipulate the LLM-to-VLM pathway. The paper proposes a prime scheme using in-context instruction learning (ICIL) to steer LVLM behavior through backdoored system prompts, and develops three types of prime attacks. Experimental results on 18 real-world manipulation tasks and 4 VLMs demonstrate the effectiveness of the proposed method.
本文提出了一种名为TrojanRobot的后门注入框架,以解决机器人操作策略的安全漏洞问题。该框架通过后门关系将恶意模块嵌入到机器人策略中,影响LLM到VLM的路径。框架使用了后门微调的VLM,并引入了一种名为LVLM-as-a-backdoor的初级方案,利用上下文指令学习来引导LVLM的行为。开发了三种类型的初级攻击,并在18个真实世界的操作任务和4个VLM上的实验验证了该方法的有效性。
One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image
Authors: Ezzeldin Shereen, Dan Ristea, Shae McFadden, Burak Hasircioglu, Vasilios Mavroudis, Chris Hicks
Venue: Transactions on Machine Learning Research (TMLR), 2026
First: 2025-04-02T21:08:33+00:00 · Latest: 2026-04-02T10:17:08+00:00
Comments: Published in Transactions on Machine Learning Research (03/2026)
Abstract
Retrieval-augmented generation (RAG) is instrumental for inhibiting hallucinations in large language models (LLMs) through the use of a factual knowledge base (KB). Although PDF documents are prominent sources of knowledge, text-based RAG pipelines are ineffective at capturing their rich multi-modal information. In contrast, visual document RAG (VD-RAG) uses screenshots of document pages as the KB, which has been shown to achieve state-of-the-art results. However, by introducing the image modality, VD-RAG introduces new attack vectors for adversaries to disrupt the system by injecting malicious documents into the KB. In this paper, we demonstrate the vulnerability of VD-RAG to poisoning attacks targeting both retrieval and generation. We define two attack objectives and demonstrate that both can be realized by injecting only a single adversarial image into the KB. Firstly, we introduce a targeted attack against one or a group of queries with the goal of spreading targeted disinformation. Secondly, we present a universal attack that, for any potential user query, influences the response to cause a denial-of-service in the VD-RAG system. We investigate the two attack objectives under both white-box and black-box assumptions, employing a multi-objective gradient-based optimization approach as well as prompting state-of-the-art generative models. Using two visual document datasets, a diverse set of state-of-the-art retrievers (embedding models) and generators (vision language models), we show VD-RAG is vulnerable to poisoning attacks in both the targeted and universal settings, yet demonstrating robustness to black-box attacks in the universal setting.
中文标题/摘要
标题:仅需一张图片:通过单张图像对视觉文档增强生成进行投毒攻击
检索增强生成(RAG)通过使用事实知识库(KB)来抑制大型语言模型(LLMs)中的幻觉。尽管PDF文档是知识的重要来源,但基于文本的RAG管道无法有效捕捉其丰富的多模态信息。相比之下,视觉文档RAG(VD-RAG)使用文档页面的截图作为KB,已被证明可达到最先进的效果。然而,通过引入图像模态,VD-RAG为对手提供了新的攻击向量,通过向KB注入恶意文档来破坏系统。在本文中,我们展示了VD-RAG在检索和生成方面对投毒攻击的脆弱性。我们定义了两种攻击目标,并证明只需向KB注入一张对抗性图像即可实现这两种目标。首先,我们介绍了一种针对一个或一组查询的定向攻击,其目标是传播有针对性的虚假信息。其次,我们提出了一种通用攻击,对于任何潜在的用户查询,都会影响响应以导致VD-RAG系统的拒绝服务。我们在白盒和黑盒假设下研究了这两种攻击目标,采用多目标梯度优化方法以及提示最先进的生成模型。使用两个视觉文档数据集、一组多样化的最先进的检索器(嵌入模型)和生成器(视觉语言模型),我们展示了VD-RAG在定向和通用设置下都容易受到投毒攻击的影响,但在通用设置下对黑盒攻击具有鲁棒性。
Summary / 总结
This paper investigates the vulnerability of visual document retrieval-augmented generation (VD-RAG) systems to poisoning attacks. The study demonstrates that a single adversarial image can be used to either spread targeted disinformation or cause a denial-of-service for any query. The research employs a multi-objective gradient-based optimization approach and state-of-the-art generative models to show that VD-RAG is susceptible to both targeted and universal poisoning attacks, though it remains robust to black-box attacks in the universal setting.
该研究探讨了视觉文档检索增强生成(VD-RAG)系统对投毒攻击的脆弱性。研究证明,只需一张恶意图像,就可以传播针对性的虚假信息或导致系统拒绝服务。研究采用多目标梯度优化方法,表明VD-RAG在针对性和通用攻击中都容易受到攻击,但在通用攻击中对黑盒攻击具有更强的鲁棒性。
Semantic Richness or Geometric Reasoning? The Fragility of VLM's Visual Invariance
Authors: Jason Qiu, Zachary Meurer, Xavier Thomas, Deepti Ghadiyaram
First: 2026-04-02T10:02:49+00:00 · Latest: 2026-04-02T10:02:49+00:00
Abstract
This work investigates the fundamental fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations. While modern VLMs excel at semantic tasks such as recognizing objects in canonical orientations and describing complex scenes, they exhibit systematic failures at a more fundamental level: lack of robust spatial invariance and equivariance required to reliably determine object identity under simple rotations, scaling, and identity transformations. We demonstrate this limitation through a systematic evaluation across diverse visual domains, including symbolic sketches, natural photographs, and abstract art. Performance drops sharply as semantic content becomes sparse, and this behavior is observed across architectures, model capacities, and prompting strategies. Overall, our results reveal a systematic gap between semantic understanding and spatial reasoning in current VLMs, highlighting the need for stronger geometric grounding in future multimodal systems.
中文标题/摘要
标题:语义丰富性或几何推理?VLM视觉不变性的脆弱性
本研究探讨了最先进的视觉-语言模型(VLMs)在基本几何变换下的根本脆弱性。尽管现代VLMs在识别处于标准方向的对象和描述复杂场景等语义任务上表现出色,但在更基本的层面上,它们表现出系统性的失败:缺乏可靠的确定物体身份所需的稳健的空间不变性和协变性。我们通过在包括符号草图、自然照片和抽象艺术在内的多种视觉领域进行系统评估,展示了这一局限性。随着语义内容的稀疏,性能急剧下降,这种行为在不同架构、模型容量和提示策略中均被观察到。总体而言,我们的结果揭示了当前VLMs在语义理解和空间推理之间的系统性差距,突显了未来多模态系统中需要更强的几何基础的重要性。
Summary / 总结
This work examines the fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations, showing that while VLMs perform well on semantic tasks, they struggle with fundamental spatial invariance and equivariance, particularly under simple rotations and scaling. The study evaluates VLMs across various visual domains and finds that performance decreases significantly when semantic content is sparse, indicating a gap between semantic understanding and spatial reasoning in current VLMs.
这项研究考察了最先进的视觉-语言模型(VLMs)在基本几何变换下的脆弱性,发现虽然VLMs在语义任务上表现良好,但在基本的空间不变性和协变性方面却表现出困难,尤其是在简单的旋转、缩放和身份变换下。研究在多种视觉领域评估了VLMs,并发现当语义内容稀少时,性能显著下降。这表明当前VLMs在语义理解和空间推理之间存在差距,强调了未来多模态系统中需要更强的几何基础的重要性。
Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models
Authors: Zekai Ye, Qiming Li, Xiaocheng Feng, Ruihan Chen, Ziming Li, Haoyu Ren, Kun Chen, Dandan Tu, Bing Qin
First: 2026-04-02T09:53:20+00:00 · Latest: 2026-04-02T09:53:20+00:00
Abstract
While Reinforcement Learning from Verifiable Rewards (RLVR) has advanced reasoning in Large Vision-Language Models (LVLMs), prevailing frameworks suffer from a foundational methodological flaw: by distributing identical advantages across all generated tokens, these methods inherently dilute the learning signals essential for optimizing the critical, visually-grounded steps of multimodal reasoning. To bridge this gap, we formulate \textit{Token Visual Dependency}, quantifying the causal information gain of visual inputs via the Kullback-Leibler (KL) divergence between visual-conditioned and text-only predictive distributions. Revealing that this dependency is highly sparse and semantically pivotal, we introduce Perception-Grounded Policy Optimization (PGPO), which is a novel fine-grained credit assignment framework that dynamically reshapes advantages at the token level. Through a threshold-gated, mass-conserving mechanism, PGPO actively amplifies learning signals for visually-dependent tokens while suppressing gradient noise from linguistic priors. Extensive experiments based on the Qwen2.5-VL series across seven challenging multimodal reasoning benchmarks demonstrate that PGPO boosts models by 18.7% on average. Both theoretical and empirical analyses confirm that PGPO effectively reduces gradient variance, prevents training collapse, and acts as a potent regularizer for robust, perception-grounded multimodal reasoning. Code will be published on https://github.com/Yzk1114/PGPO.
中文标题/摘要
标题:并非所有标记物都平等:基于感知的大型视觉-语言模型策略优化
虽然可验证奖励强化学习(RLVR)在大型视觉-语言模型(LVLMs)中推进了推理能力,但现有的框架存在根本性的方法论缺陷:通过向所有生成的标记物分配相同的优势,这些方法会稀释对于优化关键的视觉导向多模态推理步骤至关重要的学习信号。为弥补这一差距,我们提出了标记物视觉依赖性,通过计算视觉条件下的预测分布与仅基于文本的预测分布之间的Kullback-Leibler(KL)散度来量化视觉输入的因果信息增益。揭示出这种依赖性高度稀疏且语义关键,我们引入了基于感知的策略优化(PGPO),这是一种新颖的细粒度的信用分配框架,能够动态地在标记物级别重塑优势。通过一个阈值门控、质量守恒的机制,PGPO积极放大了依赖视觉的标记物的学习信号,同时抑制了语言先验的梯度噪声。基于Qwen2.5-VL系列在七个具有挑战性的多模态推理基准上的广泛实验表明,PGPO平均提升了模型18.7%。理论和实证分析均证实,PGPO有效减少了梯度方差,防止了训练崩溃,并作为强大的正则化器促进了稳健的、基于感知的多模态推理。代码将在https://github.com/Yzk1114/PGPO上发布。
Summary / 总结
This paper addresses the issue of diluting learning signals in Large Vision-Language Models (LVLMs) by proposing Token Visual Dependency and Perception-Grounded Policy Optimization (PGPO). PGPO quantifies the causal information gain of visual inputs and dynamically reshapes advantages at the token level, amplifying learning signals for visually-dependent tokens. Experiments show that PGPO improves performance by 18.7% on average across seven multimodal reasoning benchmarks, confirming its effectiveness in reducing gradient variance and preventing training collapse.
本文针对大型视觉-语言模型(LVLM)中学习信号被稀释的问题,提出了Token Visual Dependency和感知导向的策略优化(PGPO)。PGPO量化了视觉输入的因果信息增益,并在token级别动态重塑优势,放大了视觉依赖token的学习信号。实验表明,PGPO在七个跨模态推理基准上的平均性能提高了18.7%,证实了其在减少梯度方差和防止训练崩溃方面的有效性。
Efficient Reasoning with Balanced Thinking
Authors: Yulin Li, Tengyao Tu, Li Ding, Junjie Wang, Huiling Zhen, Yixin Chen, Yong Li, Zhuotao Tian
Venue: ICLR 2026
First: 2026-03-12T18:48:07+00:00 · Latest: 2026-04-02T09:30:13+00:00
Comments: Accepted by ICLR 2026
Abstract
Large Reasoning Models (LRMs) have shown remarkable reasoning capabilities, yet they often suffer from overthinking, expending redundant computational steps on simple problems, or underthinking, failing to explore sufficient reasoning paths despite inherent capabilities. These issues lead to inefficiencies and potential inaccuracies, limiting practical deployment in resource-constrained settings. Existing methods to mitigate overthinking, such as suppressing reflective keywords or adjusting reasoning length, may inadvertently induce underthinking, compromising accuracy. Therefore, we propose ReBalance, a training-free framework that achieves efficient reasoning with balanced thinking. ReBalance leverages confidence as a continuous indicator of reasoning dynamics, identifying overthinking through high confidence variance and underthinking via consistent overconfidence. By aggregating hidden states from a small-scale dataset into reasoning mode prototypes, we compute a steering vector to guide LRMs' reasoning trajectories. A dynamic control function modulates this vector's strength and direction based on real-time confidence, pruning redundancy during overthinking, and promoting exploration during underthinking. Extensive experiments conducted on four models ranging from 0.5B to 32B, and across nine benchmarks in math reasoning, general question answering, and coding tasks demonstrate that ReBalance effectively reduces output redundancy while improving accuracy, offering a general, training-free, and plug-and-play strategy for efficient and robust LRM deployment. Project page and code are available at https://rebalance-ai.github.io .
中文标题/摘要
标题:平衡思考实现高效推理
大型推理模型(LRMs)展示了出色的推理能力,但往往存在过度推理的问题,即在简单问题上浪费冗余计算步骤,或者存在欠推理的问题,即尽管具有内在能力,但在探索足够的推理路径方面却不够充分。这些问题导致了效率低下和潜在的不准确性,限制了其在资源受限环境中的实际部署。现有减少过度推理的方法,如抑制反思关键词或调整推理长度,可能会无意中导致欠推理,从而损害准确性。因此,我们提出了ReBalance,这是一种无需训练的框架,实现了平衡思考下的高效推理。ReBalance 利用置信度作为推理动态的连续指标,通过高置信度波动识别过度推理,通过一致的过度自信识别欠推理。通过将小型数据集中的隐藏状态聚合为推理模式原型,我们计算出一个引导向量来引导LRMs的推理轨迹。动态控制函数根据实时置信度调整该向量的强度和方向,在过度推理时修剪冗余,在欠推理时促进探索。在四个从0.5B到32B的模型以及九个涉及数学推理、通用问答和编程任务的基准测试中进行的广泛实验表明,ReBalance 有效减少了输出冗余,提高了准确性,提供了一种通用、无需训练且即插即用的策略,用于高效和稳健的LRM部署。项目页面和代码可在https://rebalance-ai.github.io 获取。
Summary / 总结
The paper addresses the inefficiencies of Large Reasoning Models (LRMs) by proposing ReBalance, a training-free framework that balances overthinking and underthinking. ReBalance uses confidence as an indicator to steer LRMs, reducing redundancy and improving accuracy. Experiments on various models and benchmarks show that ReBalance effectively enhances the efficiency and robustness of LRMs without requiring additional training.
论文针对大型推理模型(LRMs)因过度推理或不足推理而导致的效率问题,提出了ReBalance框架以实现平衡思考。ReBalance利用信心作为指标来识别过度推理和不足推理,并通过计算基于实时信心动态控制的方向引导矢量来引导LRMs的推理轨迹。实验表明,ReBalance可以减少输出冗余并提高准确性,适用于各种模型和基准测试,提供了一种通用且即插即用的策略,用于高效和稳健的LRM部署。
Hidden Meanings in Plain Sight: RebusBench for Evaluating Cognitive Visual Reasoning
Authors: Seyed Amir Kasaei, Arash Marioriyad, Mahbod Khaleti, MohammadAmin Fazli, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban
Venue: ICLR 2026
First: 2026-04-02T08:33:13+00:00 · Latest: 2026-04-02T08:33:13+00:00
Comments: Accepted at ICLR 2026 Workshop: From Human Cognition to AI Reasoning (HCAIR)
Abstract
Large Vision-Language Models (LVLMs) have achieved remarkable proficiency in explicit visual recognition, effectively describing what is directly visible in an image. However, a critical cognitive gap emerges when the visual input serves only as a clue rather than the answer. We identify that current models struggle with the complex, multi-step reasoning required to solve problems where information is not explicitly depicted. Successfully solving a rebus puzzle requires a distinct cognitive workflow: the model must extract visual and textual attributes, retrieve linguistic prior knowledge (such as idioms), and perform abstract mapping to synthesize these elements into a meaning that exists outside the pixel space. To evaluate this neurosymbolic capability, we introduce RebusBench, a benchmark of 1,164 puzzles designed to test this specific integration of perception and knowledge. Our evaluation of state-of-the-art models (including Qwen, InternVL, and LLaVA) shows a severe deficiency: performance saturates below 10% Exact Match and 20% semantic accuracy, with no significant improvement observed from model scaling or In-Context Learning (ICL). These findings suggest that while models possess the necessary visual and linguistic components, they lack the cognitive reasoning glue to connect them. Project page available at https://amirkasaei.com/rebusbench/.
中文标题/摘要
标题:显而易见中的隐藏含义:RebusBench 用于评估认知视觉推理能力
大型视觉-语言模型(LVLMs)在显性的视觉识别方面取得了显著的成就,能够有效地描述图像中直接可见的内容。然而,当视觉输入仅作为线索而非答案时,一个关键的认知差距出现了。我们发现,当前的模型在解决需要复杂多步推理的问题时存在困难,这些问题中的信息并未明确呈现。成功解决谜语谜题需要一种独特的认知工作流程:模型必须提取视觉和文本属性,检索语言先验知识(如成语),并进行抽象映射,将这些元素综合成一种存在于像素空间之外的意义。为了评估这种神经符号能力,我们引入了RebusBench,这是一个包含1,164个谜题的基准测试,旨在测试这种感知与知识的特定整合。我们对最先进的模型(包括Qwen、InternVL和LLaVA)的评估显示,性能在10%的精确匹配和20%的语义准确性以下饱和,没有观察到模型规模或上下文学习(ICL)的显著改进。这些发现表明,虽然模型具备必要的视觉和语言组件,但缺乏将它们连接起来的认知推理机制。项目页面可在https://amirkasaei.com/rebusbench/获取。
Summary / 总结
The research aims to evaluate the cognitive visual reasoning capabilities of large vision-language models by introducing RebusBench, a benchmark of 1,164 rebus puzzles. The method involves testing models like Qwen, InternVL, and LLaVA on these puzzles, which require extracting visual and textual information, applying linguistic knowledge, and synthesizing this into a meaningful interpretation. Key findings show that these models perform poorly, with exact match rates below 10% and semantic accuracy at 20%, indicating a lack of cognitive reasoning ability to integrate visual and linguistic data effectively.
研究旨在通过引入包含1,164个谜题的RebusBench基准来评估大型视觉-语言模型的认知视觉推理能力。方法是测试Qwen、InternVL和LLaVA等最先进的模型,这些谜题需要提取视觉和文本信息、运用语言知识并将其综合成有意义的解释。关键发现表明,这些模型表现不佳,精确匹配率低于10%,语义准确率为20%,表明它们缺乏将视觉和语言信息整合起来的认知推理能力。
SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer
Authors: Tongcheng Fang, Hanling Zhang, Ruiqi Xie, Zhuo Han, Xin Tao, Tianchen Zhao, Pengfei Wan, Wenbo Ding, Wanli Ouyang, Xuefei Ning, Yu Wang
First: 2026-01-23T07:28:53+00:00 · Latest: 2026-04-02T08:13:10+00:00
Abstract
Diffusion Transformers have demonstrated remarkable performance in video generation. However, their long input sequences incur substantial latency due to the quadratic complexity of full attention. Various sparse attention mechanisms have been proposed. Training-free approaches are limited to moderate sparsity and thus yield only modest acceleration, whereas training-based methods can reach much higher sparsity but demand substantial data and computation. In this work, we propose SALAD, introducing a lightweight linear attention branch in parallel with the sparse attention. Leveraging a Multi-level Static-Dynamic Scaling Strategy to balance the two branches, our method attains up to 90% sparsity and 1.52-2.03x inference speedup across different models and sequence lengths, while maintaining generation quality comparable to the full attention baseline. Moreover, our finetuning process is highly efficient, requiring only 2,000 video samples, fewer than 1,600 training steps, and no more than 30 GPU hours with a batch size of 8.
中文标题/摘要
标题:SALAD:通过高效线性注意力调优实现高稀疏度注意力以提高视频扩散变换器性能
扩散变换器在视频生成方面表现出色。然而,由于全注意力的二次复杂性,其长输入序列导致了显著的延迟。已经提出了各种稀疏注意力机制。无需训练的方法仅能达到中等稀疏度,因此只能提供适度的加速,而基于训练的方法可以达到更高的稀疏度,但需要大量的数据和计算。在本工作中,我们提出了SALAD,引入了一个轻量级的线性注意力分支与稀疏注意力并行。通过多级静态-动态缩放策略平衡两个分支,我们的方法在不同模型和序列长度上实现了高达90%的稀疏度和1.52-2.03倍的推理加速,同时保持与全注意力基线相当的生成质量。此外,我们的微调过程非常高效,只需要2,000个视频样本,少于1,600个训练步骤,且不超过30个GPU小时,批量大小为8。
Summary / 总结
SALAD is designed to enhance the efficiency of video generation by introducing a lightweight linear attention branch alongside sparse attention in diffusion transformers. It uses a Multi-level Static-Dynamic Scaling Strategy to balance the two branches, achieving up to 90% sparsity and 1.52-2.03x speedup while maintaining comparable generation quality to full attention. The finetuning process is efficient, requiring only 2,000 video samples and 30 GPU hours.
研究旨在通过提出SALAD,即引入轻量级线性注意力分支与稀疏注意力并行,解决视频生成中扩散变换器的延迟问题。通过多级静态-动态缩放策略平衡两个分支,该方法可实现高达90%的稀疏度和1.52-2.03倍的推理加速,同时保持与全注意力基线相当的生成质量。此外,微调过程高效,仅需2,000个视频样本和30个GPU小时。
GridVAD: Open-Set Video Anomaly Detection via Spatial Reasoning over Stratified Frame Grids
Authors: Mohamed Eltahir, Ahmed O. Ibrahim, Obada Siralkhatim, Tabarak Abdallah, Sondos Mohamed
First: 2026-03-26T14:08:41+00:00 · Latest: 2026-04-02T07:53:39+00:00
Abstract
Vision-Language Models (VLMs) are powerful open-set reasoners, yet their direct use as anomaly detectors in video surveillance is fragile: without calibrated anomaly priors, they alternate between missed detections and hallucinated false alarms. We argue the problem is not the VLM itself but how it is used. VLMs should function as anomaly proposers, generating open-set candidate descriptions that are then grounded and tracked by purpose-built spatial and temporal modules. We instantiate this propose-ground-propagate principle in GridVAD, a training-free pipeline that produces pixel-level anomaly masks without any domain-specific training. A VLM reasons over stratified grid representations of video clips to generate natural-language anomaly proposals. Self-Consistency Consolidation (SCC) filters hallucinations by retaining only proposals that recur across multiple independent samplings. Grounding DINO anchors each surviving proposal to a bounding box, and SAM2 propagates it as a dense mask through the anomaly interval. The per-clip VLM budget is fixed at M+1 calls regardless of video length, where M can be set according to the proposals needed. On UCSD Ped2, GridVAD achieves the highest Pixel-AUROC (77.59) among all compared methods, surpassing even the partially fine-tuned TAO (75.11) and outperforms other zero-shot approaches on object-level RBDC by over 5x. Ablations reveal that SCC provides a controllable precision-recall tradeoff: filtering improves all pixel level metrics at a modest cost in object-level recall. Efficiency experiments show GridVAD is 2.7x more call-efficient than uniform per-frame VLM querying while additionally producing dense segmentation masks.Code and qualitative video results are available at https://gridvad.github.io.
中文标题/摘要
标题:GridVAD:通过分层帧网格的空间推理实现开放集视频异常检测
视觉-语言模型(VLMs)是强大的开放集推理器,但在视频监控中直接用作异常检测器却很脆弱:没有校准的异常先验,它们会在漏检和虚假警报之间摇摆。我们认为问题不在于VLM本身,而在于其使用方式。VLMs应作为异常提议者,生成开放集候选描述,然后由专门构建的空间和时间模块进行定位和跟踪。我们在此原则中实例化了GridVAD,这是一种无需训练的流水线,能够在没有任何领域特定训练的情况下生成像素级异常掩码。VLM对视频片段的分层网格表示进行推理,生成自然语言异常提议。自我一致性聚合(SCC)通过仅保留跨多次独立采样中重复出现的提议来过滤虚假警报。DINO锚定每个幸存的提议到一个边界框,SAM2将其作为密集掩码在异常区间内传播。每段视频的VLM预算固定为M+1次调用,无论视频长度如何,M可以根据所需的提议进行设置。在UCSD Ped2上,GridVAD在所有比较方法中实现了最高的像素-AUROC(77.59),甚至超过了部分微调的TAO(75.11),在对象级RBDC上也优于其他零样本方法超过5倍。消融实验表明,SCC提供了可控制的精确度-召回率权衡:过滤可以提高所有像素级别指标,同时在对象级别召回率上付出适度的代价。效率实验表明,GridVAD比均匀的每帧VLM查询效率高2.7倍,同时还能生成密集分割掩码。代码和定性视频结果可在https://gridvad.github.io/获取。
Summary / 总结
GridVAD proposes a method to enhance the robustness of Vision-Language Models (VLMs) in open-set video anomaly detection. It uses a stratified grid representation and self-consistency consolidation to filter out hallucinations, grounding with DINO and SAM2 to produce dense anomaly masks. On UCSD Ped2, GridVAD achieves the highest Pixel-AUROC (77.59) and outperforms other zero-shot approaches by over 5x on object-level RBDC. Ablations show SCC improves precision-recall tradeoff, and efficiency experiments demonstrate GridVAD is 2.7x more call-efficient than uniform per-frame querying while providing dense segmentation masks.
GridVAD 提出了一种使用分层帧网格进行空间推理的视频异常检测方法,利用 Vision-Language 模型生成自然语言的异常提案,然后通过空间和时间模块进行定位和跟踪。在 UCSD Ped2 上,GridVAD 达到了最高的像素 AUROC(77.59),并在对象级 RBDC 上比其他零样本方法高出 5 倍以上。消融实验表明,Self-Consistency Consolidation (SCC) 改善了精确度-召回率的权衡,而效率实验则展示了 GridVAD 的调用效率以及密集分割掩码的生成。
History
20260403_0400 20260401_0408 20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553