arXiv 论文速递

2026-04-06 03:47
Snapshot: 20260406_0347
Steerable Visual Representations
Authors: Jona Ruthardt, Manu Gaur, Deva Ramanan, Makarand Tapaswi, Yuki M. Asano
First: 2026-04-02T17:59:49+00:00 · Latest: 2026-04-02T17:59:49+00:00
Comments: preprint
Abstract
Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion) via lightweight cross-attention. We introduce benchmarks for measuring representational steerability, and demonstrate that our steerable visual features can focus on any desired objects in an image while preserving the underlying representation quality. Our method also matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibiting zero-shot generalization to out-of-distribution tasks.
中文标题/摘要
标题:可引导的视觉表示
预训练的视觉变换器(ViTs)如DINOv2和MAE提供了通用的图像特征,可用于检索、分类和分割等多种下游任务。然而,这些表示往往集中在图像中最显著的视觉线索上,没有方法可以引导它们关注不那么突出的概念。相比之下,多模态LLMs可以通过文本提示进行引导,但生成的表示往往是语言中心的,对于通用的视觉任务效果不佳。为了解决这个问题,我们引入了可引导的视觉表示,这是一种新的视觉表示类别,其全局和局部特征可以通过自然语言进行引导。大多数视觉-语言模型(例如CLIP)在编码后将文本与视觉特征融合(晚期融合),而我们则通过轻量级的交叉注意力直接将文本注入视觉编码器的层中(早期融合)。我们引入了衡量表示可引导性的基准,并证明我们的可引导视觉特征可以在图像中聚焦于任何所需的对象,同时保持底层表示的质量。我们的方法在异常检测和个性化对象区分方面也与专门的方法相当或更优,展示了对未见过任务的零样本泛化。
Summary / 总结
The research introduces Steerable Visual Representations, which can be directed by natural language to focus on less prominent visual concepts while maintaining generic image features. Unlike Multimodal LLMs, which are language-centric, these representations are integrated early in the visual encoder using lightweight cross-attention. Experiments show that the steerable features can target specific objects in images and outperform dedicated approaches in anomaly detection and personalized object discrimination, demonstrating zero-shot generalization to out-of-distribution tasks.
该研究引入了可引导视觉表示,可以通过自然语言将通用图像特征导向特定的概念。不同于侧重于图像中显著视觉线索的预训练视觉变换器,或变得以语言为中心的多模态大模型,该方法在视觉编码器中直接注入文本进行早期融合。作者证明了他们的可引导视觉特征可以聚焦于图像中的任何所需对象,同时保持质量,并且该方法在异常检测和个人化对象区分任务上与专门方法相当或更优,展示了对未见过任务的零样本泛化能力。
Stop Wandering: Efficient Vision-Language Navigation via Metacognitive Reasoning
Authors: Xueying Li, Feng Lyu, Hao Wu, Mingliu Liu, Jia-Nan Liu, Guozi Liu
First: 2026-04-02T17:58:08+00:00 · Latest: 2026-04-02T17:58:08+00:00
Comments: 10 pages, 6 figures
Abstract
Training-free Vision-Language Navigation (VLN) agents powered by foundation models can follow instructions and explore 3D environments. However, existing approaches rely on greedy frontier selection and passive spatial memory, leading to inefficient behaviors such as local oscillation and redundant revisiting. We argue that this stems from a lack of metacognitive capabilities: the agent cannot monitor its exploration progress, diagnose strategy failures, or adapt accordingly. To address this, we propose MetaNav, a metacognitive navigation agent integrating spatial memory, history-aware planning, and reflective correction. Spatial memory builds a persistent 3D semantic map. History-aware planning penalizes revisiting to improve efficiency. Reflective correction detects stagnation and uses an LLM to generate corrective rules that guide future frontier selection. Experiments on GOAT-Bench, HM3D-OVON, and A-EQA show that MetaNav achieves state-of-the-art performance while reducing VLM queries by 20.7%, demonstrating that metacognitive reasoning significantly improves robustness and efficiency.
Summary / 总结
The research aims to improve the efficiency of Vision-Language Navigation (VLN) agents by addressing their tendency to exhibit inefficient behaviors like local oscillation and redundant revisiting. MetaNav, a metacognitive navigation agent, is proposed, incorporating spatial memory, history-aware planning, and reflective correction. The agent builds a persistent 3D semantic map, penalizes revisiting to enhance efficiency, and uses an LLM to generate corrective rules for better frontier selection. Experimental results on GOAT-Bench, HM3D-OVON, and A-EQA show that MetaNav outperforms existing methods with a 20.7% reduction in VLM queries, indicating significant improvements in robustness and efficiency.
研究旨在通过解决视觉-语言导航(VLN)代理的局部振荡和重复访问等低效行为,提高其效率。方法是开发了MetaNav,该方法结合了空间记忆、历史感知规划和反思性纠正。MetaNav构建了一个持久的3D语义地图,通过惩罚重复访问来提高效率,并使用LLM生成纠正规则以指导未来的前沿选择。实验表明,MetaNav在性能上超过了现有方法,并将VLM查询减少了20.7%,显示出显著的鲁棒性和效率改进。
VOID: Video Object and Interaction Deletion
Authors: Saman Motamed, William Harvey, Benjamin Klein, Luc Van Gool, Zhuoning Yuan, Ta-Ying Cheng
First: 2026-04-02T17:36:53+00:00 · Latest: 2026-04-02T17:36:53+00:00
Abstract
Existing video object removal methods excel at inpainting content "behind" the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.
中文标题/摘要
标题:VOID:视频对象和交互删除
现有的视频对象移除方法在修复内容“背后”的内容和纠正阴影、反射等外观级伪影方面表现出色。然而,当移除的对象具有更显著的交互,如与其他对象的碰撞时,当前的模型无法纠正这些交互,从而产生不合理的结果。我们提出了VOID,一种旨在在这些复杂场景中执行物理上合理的修复的视频对象移除框架。为了训练模型,我们使用Kubric和HUMOTO生成了一个新的配对数据集,其中移除对象需要改变下游的物理交互。在推理过程中,一个视觉语言模型识别场景中受移除对象影响的区域。这些区域随后用于引导一个视频扩散模型,生成物理上一致的反事实结果。在合成和真实数据上的实验表明,与之前的视频对象移除方法相比,我们的方法在对象移除后更好地保持了场景动力学的一致性。我们希望这个框架能够揭示如何通过高层次的因果推理使视频编辑模型更好地模拟世界。
Summary / 总结
The research aims to address the limitations of existing video object removal methods, which struggle with scenarios involving significant physical interactions. The proposed VOID framework uses a paired dataset generated by Kubric and HUMOTO to train a model capable of physically plausible inpainting. During inference, a vision-language model identifies affected regions, which guide a video diffusion model to generate consistent outcomes. Experiments show that VOID better preserves scene dynamics after object removal compared to previous methods.
研究旨在解决现有视频对象移除方法在处理碰撞等复杂交互时的局限性。作者提出了VOID框架,该框架使用视觉语言模型识别受影响区域,并使用视频扩散模型生成物理上一致的结果。实验表明,VOID在对象移除后更好地保持了场景的动力学,优于之前的视频对象移除方法。
Modular Energy Steering for Safe Text-to-Image Generation with Foundation Models
Authors: Yaoteng Tan, Zikui Cai, M. Salman Asif
First: 2026-04-02T16:59:28+00:00 · Latest: 2026-04-02T16:59:28+00:00
Abstract
Controlling the behavior of text-to-image generative models is critical for safe and practical deployment. Existing safety approaches typically rely on model fine-tuning or curated datasets, which can degrade generation quality or limit scalability. We propose an inference-time steering framework that leverages gradient feedback from frozen pretrained foundation models to guide the generation process without modifying the underlying generator. Our key observation is that vision-language foundation models encode rich semantic representations that can be repurposed as off-the-shelf supervisory signals during generation. By injecting such feedback through clean latent estimates at each sampling step, our method formulates safety steering as an energy-based sampling problem. This design enables modular, training-free safety control that is compatible with both diffusion and flow-matching models and can generalize across diverse visual concepts. Experiments demonstrate state-of-the-art robustness against NSFW red-teaming benchmarks and effective multi-target steering, while preserving high generation quality on benign non-targeted prompts. Our framework provides a principled approach for utilizing foundation models as semantic energy estimators, enabling reliable and scalable safety control for text-to-image generation.
中文标题/摘要
标题:模块化能量导向以确保基础模型驱动的文本到图像生成的安全性
控制文本到图像生成模型的行为对于安全和实际部署至关重要。现有安全方法通常依赖于模型微调或精心策划的数据集,这可能会降低生成质量或限制可扩展性。我们提出了一种推理时的导向框架,该框架利用冻结的预训练基础模型的梯度反馈来引导生成过程,而不修改底层生成器。我们的主要观察是,视觉-语言基础模型编码了丰富的语义表示,可以在生成过程中作为现成的监督信号重新利用。通过在每次采样步骤中注入这种反馈,我们的方法将安全性导向问题表述为能量导向的采样问题。这种设计使得安全性控制模块化、无需训练,并且兼容扩散和流匹配模型,可以跨多种视觉概念泛化。实验表明,我们的方法在NSFW红队测试基准上具有最先进的鲁棒性,并且能够有效进行多目标导向,同时保持对良性非目标提示的高质量生成。我们的框架提供了一种原理性的方法,用于利用基础模型作为语义能量估计器,从而实现文本到图像生成的可靠和可扩展的安全控制。
Summary / 总结
The research aims to enhance the safety of text-to-image generation by proposing a modular energy steering framework that uses gradient feedback from pretrained models to guide the generation process without altering the underlying model. This method leverages the rich semantic representations of vision-language foundation models to provide off-the-shelf supervisory signals during generation. Experimental results show that the approach achieves state-of-the-art robustness against safety benchmarks and effective multi-target steering while maintaining high generation quality for non-targeted prompts.
研究旨在通过提出一种模块化能量引导框架,利用冻结预训练模型在推理时的梯度反馈来提高文本到图像生成的安全性。该方法避免了模型微调或依赖于定制数据集,从而保持了生成质量和可扩展性。关键发现表明,该方法在NSFW基准测试中达到了最先进的鲁棒性,并且能够有效进行多目标引导,同时保持对良性非目标提示的高质量图像生成。
Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI
Authors: Aravind Cheruvu, Shravya Kanchi, Sifat Muhammad Abdullah, Nicholas Kong, Daphne Yao, Murtuza Jadliwala, Bimal Viswanath
First: 2025-07-08T04:40:09+00:00 · Latest: 2026-04-02T16:59:25+00:00
Comments: Accepted at ACM CODASPY 2026
Abstract
Customizing Large Language Models (LLMs) on untrusted datasets poses severe risks of injecting toxic behaviors. In this work, we introduce Optimus, a novel defense framework designed to mitigate fine-tuning harms while preserving conversational utility. Unlike existing defenses that rely heavily on precise toxicity detection or restrictive filtering, Optimus addresses the critical challenge of ensuring robust mitigation even when toxicity classifiers are imperfect or biased. Optimus integrates a training-free toxicity classification scheme that repurposes the safety alignment of commodity LLMs, and employs a dual-strategy alignment process combining synthetic "healing data" with Direct Preference Optimization (DPO) to efficiently steer models toward safety. Extensive evaluations demonstrate that Optimus mitigates toxicity even when relying on extremely biased classifiers (with up to 85% degradation in Recall). Optimus outperforms the state-of-the-art defense StarDSS and exhibits strong resilience against adaptive adversarial and jailbreak attacks. Our source code and datasets are available at https://github.com/secml-lab-vt/Optimus
中文标题/摘要
标题:Optimus:一种稳健的防御框架,用于在微调对话AI时减轻毒性
在不可信数据集上定制大型语言模型(LLMs)会严重增加注入毒性行为的风险。在本研究中,我们提出了Optimus,一种新颖的防御框架,旨在减轻微调危害同时保留对话实用性。与依赖精确毒性检测或严格过滤的现有防御不同,Optimus 通过确保即使在毒性分类器不准确或有偏见时也能实现稳健的缓解来解决关键挑战。Optimus 结合了一种无需训练的毒性分类方案,重新利用了商品级LLMs的安全对齐,并采用结合合成“治愈数据”与直接偏好优化(DPO)的双重策略对齐过程,高效地引导模型向安全方向发展。广泛的评估表明,即使依赖于高度有偏见的分类器(召回率降低高达85%),Optimus 也能减轻毒性。Optimus 在对抗适应性对手和突破攻击方面表现出色,优于最先进的防御StarDSS。我们的源代码和数据集可在https://github.com/secml-lab-vt/Optimus 获取
Summary / 总结
Optimus is a defense framework designed to mitigate toxic behaviors in fine-tuned conversational AI models while preserving conversational utility. Unlike other defenses that rely on precise toxicity detection or restrictive filtering, Optimus uses a training-free toxicity classification scheme and a dual-strategy alignment process to steer models towards safety. Evaluations show that Optimus can effectively mitigate toxicity even with highly biased classifiers and performs better than the state-of-the-art defense StarDSS, demonstrating strong resilience against adversarial attacks.
Optimus 是一个防御框架,旨在在保持对话功能的同时减轻对话 AI 模型细调过程中的有毒行为。不同于依赖精确的毒性检测或严格过滤的方法,Optimus 使用一种无需训练的毒性分类方案和双重策略对齐过程来引导模型向安全方向发展。评估结果显示,Optimus 即使在使用高度偏颇的分类器时也能有效减轻毒性,并且在各种攻击下表现出强大的抗攻击能力,优于现有防御方法如 StarDSS。
Scaling Video Pretraining for Surgical Foundation Models
Authors: Sicheng Lu, Zikai Xiao, Jianhui Wei, Danyu Sun, Qi Lu, Keli Hu, Yang Feng, Jian Wu, Zongxin Yang, Zuozhu Liu
First: 2026-03-31T16:31:25+00:00 · Latest: 2026-04-02T16:46:06+00:00
Abstract
Surgical video understanding is essential for computer-assisted interventions, yet existing surgical foundation models remain constrained by limited data scale, procedural diversity, and inconsistent evaluation, often lacking a reproducible training pipeline. We propose SurgRec, a scalable and reproducible pretraining recipe for surgical video understanding, instantiated with two variants: SurgRec-MAE and SurgRec-JEPA. We curate a large multi-source corpus of 10,535 videos and 214.5M frames spanning endoscopy, laparoscopy, cataract, and robotic surgery. Building on this corpus, we develop a unified pretraining pipeline with balanced sampling and standardize a reproducible benchmark across 16 downstream datasets and four clinical domains with consistent data splits. Across extensive comparisons against SSL baselines and vision-language models, SurgRec consistently achieves superior performance across downstream datasets. In contrast, VLMs prove unreliable for fine-grained temporal recognition, exhibiting both performance gaps and sensitivity to prompt phrasing. Our work provides a reproducible, scalable foundation for the community to build more general surgical video models. All code, models, and data will be publicly released.
中文标题/摘要
标题:手术视频预训练的扩展
手术视频理解对于计算机辅助干预至关重要,但现有的手术基础模型仍然受到数据规模有限、程序多样性不足以及评估不一致的限制,往往缺乏可重复的训练管道。我们提出了一种名为SurgRec的可扩展且可重复的手术视频理解预训练方案,包括两种变体:SurgRec-MAE和SurgRec-JEPA。我们整理了一个包含10,535个视频和2.145亿帧的大型多源数据集,涵盖内窥镜、腹腔镜、白内障和机器人手术。基于此数据集,我们开发了一个统一的预训练管道,采用平衡采样,并在16个下游数据集和四个临床领域中标准化了一个可重复的基准,数据分割一致。在与SSL基线和视觉-语言模型的广泛比较中,SurgRec在所有下游数据集上均表现出更优的性能。相比之下,视觉-语言模型在细粒度的时间识别上表现不稳定,表现出性能差距和对提示措辞的敏感性。我们的工作为社区提供了一个可重复且可扩展的基础,以构建更通用的手术视频模型。所有代码、模型和数据将公开发布。
Summary / 总结
The research aims to enhance surgical video understanding for computer-assisted interventions by addressing data limitations and evaluation inconsistencies. The authors propose SurgRec, a scalable pretraining method for surgical videos, using a large dataset of 10,535 videos and developing a unified pretraining pipeline. Experiments show that SurgRec outperforms SSL baselines and vision-language models across 16 downstream datasets, providing a reproducible benchmark for future research in surgical video models.
研究旨在通过解决数据限制和评估不一致性,提升手术视频的理解能力,以支持计算机辅助干预。作者提出了SurgRec,一种针对手术视频的可扩展预训练方法,使用了包含10,535个视频的大规模数据集,并开发了一个统一的预训练流水线。实验表明,SurgRec在16个下游数据集上优于SSL基线和视觉语言模型,为未来手术视频模型的研究提供了可重复的基准。
SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation
Authors: Naomi Kombol, Ivan Martinović, Siniša Šegvić, Giorgos Tolias
Venue: CVPR 2026
First: 2026-04-02T16:45:34+00:00 · Latest: 2026-04-02T16:45:34+00:00
Comments: Accepted to CVPR 2026
Abstract
Foundational Vision Transformers (ViTs) have limited effectiveness in tasks requiring fine-grained spatial understanding, due to their fixed pre-training resolution and inherently coarse patch-level representations. These challenges are especially pronounced in dense prediction scenarios, such as open-vocabulary segmentation with ViT-based vision-language models, where high-resolution inputs are essential for accurate pixel-level reasoning. Existing approaches typically process large-resolution images using a sliding-window strategy at the pre-training resolution. While this improves accuracy through finer strides, it comes at a significant computational cost. We introduce SPAR: Single-Pass Any-Resolution ViT, a resolution-agnostic dense feature extractor designed for efficient high-resolution inference. We distill the spatial reasoning capabilities of a finely-strided, sliding-window teacher into a single-pass student using a feature regression loss, without requiring architectural changes or pixel-level supervision. Applied to open-vocabulary segmentation, SPAR improves single-pass baselines by up to 10.5 mIoU and even surpasses the teacher, demonstrating effectiveness in efficient, high-resolution reasoning. Code: https://github.com/naomikombol/SPAR
中文标题/摘要
标题:SPAR:单次通过任意分辨率ViT的开放词汇分割
基础视觉变换器(ViTs)在需要精细空间理解的任务中效果有限,因为它们具有固定的预训练分辨率和固有的粗粒度的块级表示。这些挑战在密集预测场景中尤为明显,例如基于ViT的视觉-语言模型的开放词汇分割,其中高分辨率输入对于准确的像素级推理至关重要。现有方法通常使用滑动窗口策略在预训练分辨率下处理大分辨率图像。虽然这通过更细的步长提高了准确性,但带来了显著的计算成本。我们引入了SPAR:单次通过任意分辨率ViT,这是一种为高效高分辨率推理设计的分辨率无关的密集特征提取器。我们通过特征回归损失将精细步长的滑动窗口教师的空间推理能力提炼到单次通过的学生中,无需进行架构更改或像素级监督。应用于开放词汇分割,SPAR将单次通过基线提高了最多10.5 mIoU,并且甚至超过了教师,证明了其在高效高分辨率推理中的有效性。代码:https://github.com/naomikombol/SPAR
Summary / 总结
SPAR is a resolution-agnostic ViT designed for efficient high-resolution inference in open-vocabulary segmentation tasks. It distills the spatial reasoning capabilities of a finely-strided teacher into a single-pass student using a feature regression loss. SPAR improves single-pass baselines by up to 10.5 mIoU and even surpasses the teacher, showing effectiveness in efficient, high-resolution reasoning.
SPAR 是一种针对开放词汇分割任务的分辨率无关 ViT,旨在高效进行高分辨率推理。它通过特征回归损失将精细步长教师的空间推理能力提炼到单次通过的学生模型中。SPAR 的单次通过基线提高了最多 10.5 mIoU,并且甚至超越了教师模型,展示了其在高效高分辨率推理中的有效性。
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Authors: Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, Ranjay Krishna
First: 2026-01-15T17:27:44+00:00 · Latest: 2026-04-02T16:01:02+00:00
Comments: Updated first authors
Abstract
Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).
中文标题/摘要
标题:Molmo2:开放权重和数据的视觉-语言模型,具备视频理解与定位能力
当前最强的视频-语言模型(VLMs)仍为私有。最强的开放权重模型要么依赖于私有VLMs的合成数据,要么不披露其训练数据或方法。因此,开源社区缺乏改进当前最先进的视频(和图像)语言模型的基础。至关重要的是,许多下游应用不仅需要高层次的视频理解,还需要定位——无论是通过指针还是通过像素跟踪。即使是私有模型也缺乏这种能力。我们提出了Molmo2,这是一种新的VLM家族,是开源模型中的最先进的,并展示了在单图像、多图像和视频任务中出色的基于指针的定位新能力。我们的主要贡献是一系列7个新的视频数据集和2个多图像数据集,包括用于预训练的详细视频字幕数据集、自由形式的视频问答数据集、新的具有复杂查询的对象跟踪数据集以及创新的视频指针数据集,所有这些数据集均未使用封闭的VLMs收集。我们还提供了一种利用高效打包和消息树编码方案的训练方法,并展示了在视觉标记上进行双向注意以及一种新的标记权重策略可以提高性能。我们最好的8B模型在短视频、计数和字幕方面优于其他开放权重和数据模型,并在长视频方面具有竞争力。在视频定位方面,Molmo2显著优于现有开放权重模型如Qwen3-VL(视频计数准确率为35.5 vs 29.6)并在某些任务上超越了私有模型如Gemini 3 Pro(视频指针F1得分为38.4 vs 20.0,视频跟踪J&F得分为56.2 vs 41.1)。
Summary / 总结
Molmo2 is a new family of open-source vision-language models that excel in video understanding and grounding. It includes 7 new video datasets and 2 multi-image datasets, and uses an efficient training recipe with bi-directional attention and a novel token-weight strategy. Molmo2 outperforms other open-source models on short videos, counting, and captioning, and surpasses proprietary models on video-grounding tasks such as video pointing and tracking.
研究旨在解决缺乏具有强大定位能力的开源视频语言模型(VLMs)的问题。作者介绍了Molmo2,这是一种新的VLM家族,其在点驱动的定位任务中优于现有开源模型。关键贡献包括9个新数据集和一个通过高效打包和消息树编码、双向注意力以及新颖的标记权重策略来提升模型性能的训练方法。Molmo2在视频计数、字幕生成和视频定位等任务中显著提高了准确性,超越了开源和专有模型。
UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving
Authors: Yongkang Li, Lijun Zhou, Sixu Yan, Bencheng Liao, Tianyi Yan, Kaixin Xiong, Long Chen, Hongwei Xie, Bing Wang, Guang Chen, Hangjun Ye, Wenyu Liu, Haiyang Sun, Xinggang Wang
First: 2026-04-02T15:48:45+00:00 · Latest: 2026-04-02T15:48:45+00:00
Comments: code has been released at https://github.com/xiaomi-research/unidrivevla
Abstract
Vision-Language-Action (VLA) models have recently emerged in autonomous driving, with the promise of leveraging rich world knowledge to improve the cognitive capabilities of driving systems. However, adapting such models for driving tasks currently faces a critical dilemma between spatial perception and semantic reasoning. Consequently, existing VLA systems are forced into suboptimal compromises: directly adopting 2D Vision-Language Models yields limited spatial perception, whereas enhancing them with 3D spatial representations often impairs the native reasoning capacity of VLMs. We argue that this dilemma largely stems from the coupled optimization of spatial perception and semantic reasoning within shared model parameters. To overcome this, we propose UniDriveVLA, a Unified Driving Vision-Language-Action model based on Mixture-of-Transformers that addresses the perception-reasoning conflict via expert decoupling. Specifically, it comprises three experts for driving understanding, scene perception, and action planning, which are coordinated through masked joint attention. In addition, we combine a sparse perception paradigm with a three-stage progressive training strategy to improve spatial perception while maintaining semantic reasoning capability. Extensive experiments show that UniDriveVLA achieves state-of-the-art performance in open-loop evaluation on nuScenes and closed-loop evaluation on Bench2Drive. Moreover, it demonstrates strong performance across a broad range of perception, prediction, and understanding tasks, including 3D detection, online mapping, motion forecasting, and driving-oriented VQA, highlighting its broad applicability as a unified model for autonomous driving. Code and model have been released at https://github.com/xiaomi-research/unidrivevla
中文标题/摘要
标题:UniDriveVLA:统一理解、感知与行动规划的自动驾驶
视觉-语言-行动(VLA)模型最近在自动驾驶中崭露头角,有望利用丰富的世界知识提升驾驶系统的认知能力。然而,将这些模型适应驾驶任务目前面临一个关键困境:空间感知与语义推理之间的权衡。因此,现有的VLA系统被迫做出次优妥协:直接采用2D视觉-语言模型导致空间感知有限,而增强它们的3D空间表示往往损害了VLM的原生推理能力。我们认为,这一困境主要源于共享模型参数中空间感知与语义推理的耦合优化。为克服这一问题,我们提出了基于混合变换器的UniDriveVLA统一驾驶视觉-语言-行动模型,通过专家解耦解决感知-推理冲突。具体而言,它包括三个专家,分别负责驾驶理解、场景感知和行动规划,通过掩蔽联合注意力协调。此外,我们结合稀疏感知范式和三阶段渐进式训练策略,以提高空间感知能力同时保持语义推理能力。大量实验表明,UniDriveVLA在nuScenes的开环评估和Bench2Drive的闭环评估中均达到最先进的性能。此外,它在包括3D检测、在线制图、运动预测和驾驶导向的VQA等一系列感知、预测和理解任务中表现出色,突显了其作为统一模型在自动驾驶领域的广泛应用潜力。代码和模型已发布于https://github.com/xiaomi-research/unidrivevla
Summary / 总结
The paper addresses the challenge of integrating spatial perception and semantic reasoning in Vision-Language-Action models for autonomous driving. It proposes UniDriveVLA, a Unified Driving Vision-Language-Action model based on Mixture-of-Transformers, which decouples perception and reasoning through expert decoupling. The model includes three experts for understanding, perception, and action planning, and uses a sparse perception paradigm and a three-stage training strategy to enhance spatial perception while preserving semantic reasoning. Experimental results show that UniDriveVLA outperforms existing models in both open-loop and closed-loop evaluations on nuScenes and Bench2Drive, and demonstrates strong performance in various tasks related to autonomous driving.
论文旨在解决在自动驾驶中将空间感知和语义推理集成的挑战。提出了基于Mixture-of-Transformers的UniDriveVLA统一驾驶视觉-语言-行动模型,通过专家解耦来分离感知和推理。该模型包含三个专家,分别负责理解、感知和行动规划,并使用稀疏感知范式和三阶段训练策略来增强空间感知能力同时保持语义推理能力。实验结果显示,UniDriveVLA在nuScenes和Bench2Drive的开环和闭环评估中均优于现有模型,并在各种与自动驾驶相关的任务中表现出色。
CoRegOVCD: Consistency-Regularized Open-Vocabulary Change Detection
Authors: Weidong Tang, Hanbin Sun, Zihan Li, Yikai Wang, Feifan Zhang
First: 2026-04-02T15:28:29+00:00 · Latest: 2026-04-02T15:28:29+00:00
Abstract
Remote sensing change detection (CD) aims to identify where land-cover semantics change across time, but most existing methods still assume a fixed label space and therefore cannot answer arbitrary user-defined queries. Open-vocabulary change detection (OVCD) instead asks for the change mask of a queried concept. In the fully training-free setting, however, dense concept responses are difficult to compare directly across dates: appearance variation, weak cross-concept competition, and the spatial continuity of many land-cover categories often produce noisy, fragmented, and semantically unreliable change evidence. We propose Consistency-Regularized Open-Vocabulary Change Detection (CoRegOVCD), a training-free dense inference framework that reformulates concept-specific change as calibrated posterior discrepancy. Competitive Posterior Calibration (CPC) and the Semantic Posterior Delta (SPD) convert raw concept responses into competition-aware queried-concept posteriors and quantify their cross-temporal discrepancy, making semantic change evidence more comparable without explicit instance matching. Geometry-Token Consistency Gate (GeoGate) and Regional Consensus Discrepancy (RCD) further suppress unsupported responses and improve spatial coherence through geometry-aware structural verification and regional consensus. Across four benchmarks spanning building-oriented and multi-class settings, CoRegOVCD consistently improves over the strongest previous training-free baseline by 2.24 to 4.98 F1$_C$ points and reaches a six-class average of 47.50% F1$_C$ on SECOND.
中文标题/摘要
标题:CoRegOVCD: 一致性正则化开放词汇变化检测
遥感变化检测(CD)旨在识别不同时期土地覆盖语义的变化,但大多数现有方法仍然假设固定标签空间,因此无法回答任意用户定义的查询。开放词汇变化检测(OVCD)则要求提供查询概念的变化掩码。然而,在完全无需训练的情况下,密集的概念响应难以直接在不同日期之间进行比较:外观变化、弱跨概念竞争以及许多土地覆盖类别的空间连续性经常产生嘈杂、碎片化且语义不可靠的变化证据。我们提出了Consistency-Regularized Open-Vocabulary Change Detection(CoRegOVCD),这是一种无需训练的密集推理框架,将概念特定的变化重新表述为校准后的后验差异。Competitive Posterior Calibration(CPC)和Semantic Posterior Delta(SPD)将原始概念响应转换为竞争意识的查询概念后验,并量化它们的跨时间差异,从而在无需显式实例匹配的情况下使语义变化证据更具可比性。Geometry-Token Consistency Gate(GeoGate)和Regional Consensus Discrepancy(RCD)进一步抑制不支持的响应,并通过几何感知结构验证和区域共识提高空间一致性。在四个涵盖建筑导向和多类别的基准测试中,CoRegOVCD在最强的先前无需训练基准上持续提高了2.24到4.98个F1$_C$点,并在SECOND上达到了六类平均47.50%的F1$_C$。
Summary / 总结
CoRegOVCD is a training-free framework for open-vocabulary change detection that addresses the challenge of comparing dense concept responses across different dates. It uses Competitive Posterior Calibration and Semantic Posterior Delta to convert raw concept responses into competition-aware queried-concept posteriors and quantify their cross-temporal discrepancy. Geometry-Token Consistency Gate and Regional Consensus Discrepancy further refine the results. CoRegOVCD outperforms previous methods by 2.24 to 4.98 F1$_C$ points on four benchmarks and achieves an average F1$_C$ of 47.50% on the six-class setting of SECOND.
CoRegOVCD 是一种无需训练的密集推理框架,用于开放词汇变化检测,解决跨时间比较概念响应的挑战。它使用 Competitive Posterior Calibration 和 Semantic Posterior Delta 将原始响应转换为竞争意识的后验,并量化其跨时间的差异。Geometry-Token Consistency Gate 和 Regional Consensus Discrepancy 进一步细化结果。CoRegOVCD 在四个基准测试中优于最强的无训练基准,F1$_C$ 分数提高了 2.24 到 4.98 点,并在SECOND基准测试的六类设置中达到了 47.50% 的 F1$_C$。
Be Tangential to Manifold: Discovering Riemannian Metric for Diffusion Models
Authors: Shinnosuke Saito, Takashi Matsubara
First: 2025-10-07T01:54:47+00:00 · Latest: 2026-04-02T14:50:54+00:00
Abstract
Diffusion models are powerful deep generative models, but unlike classical models, they lack an explicit low-dimensional latent space that parameterizes the data manifold. This absence makes it difficult to perform manifold-aware operations, such as geometrically faithful interpolation or conditional guidance that respects the learned manifold. We propose a training-free Riemannian metric on the noise space, derived from the Jacobian of the score function. The key insight is that the spectral structure of this Jacobian separates tangent and normal directions of the data manifold; our metric leverages this separation to encourage paths to stay tangential to the manifold rather than drift toward high-density regions. To validate that our metric faithfully captures the manifold geometry, we examine it from two complementary angles. First, geodesics under our metric yield perceptually more natural interpolations than existing methods on synthetic, image, and video frame datasets. Second, the tangent-normal decomposition induced by our metric prevents classifier-free guidance from deviating off the manifold, improving generation quality while preserving text-image alignment.
中文标题/摘要
标题:与流形共轭:为扩散模型发现黎曼度量
扩散模型是强大的深度生成模型,但与经典模型不同,它们缺乏一个显式的低维潜在空间来参数化数据流形。这种缺失使得执行流形感知操作变得困难,例如几何上忠实的插值或尊重学习到的流形的条件引导。我们提出了一种无需训练的噪声空间上的黎曼度量,该度量源自分数函数的雅可比矩阵。关键洞察是,该雅可比矩阵的谱结构将数据流形的切向和法向方向区分开来;我们的度量利用这种区分来鼓励路径保持在流形上而不是向高密度区域漂移。为了验证我们的度量是否忠实捕捉了流形几何,我们从两个互补的角度进行了验证。首先,在我们的度量下,测地线在合成、图像和视频帧数据集上提供了感知上更自然的插值。其次,由我们的度量引起的切向-法向分解防止了无分类器引导偏离流形,从而提高了生成质量并保持了文本-图像对齐。
Summary / 总结
This paper addresses the challenge of performing manifold-aware operations in diffusion models by proposing a training-free Riemannian metric derived from the Jacobian of the score function. The key finding is that this metric encourages paths to stay tangential to the data manifold, leading to more natural interpolations and improved generation quality while preserving text-image alignment. Geodesics under this metric yield perceptually more natural interpolations compared to existing methods on various datasets, and the tangent-normal decomposition prevents classifier-free guidance from deviating off the manifold, enhancing generation quality.
论文通过提出一个基于得分函数雅可比的无训练Riemannian度量来解决在扩散模型中执行流形感知操作的挑战。该度量鼓励路径保持在流形上,而不是向高密度区域漂移,从而改善几何保真插值和无分类器引导。实验表明,在该度量下的测地线提供了更自然的插值,并且由该度量诱导的切空间-法空间分解防止了偏离流形,从而提高了生成质量和文本-图像对齐。
FlowSlider: Training-Free Continuous Image Editing via Fidelity-Steering Decomposition
Authors: Taichi Endo, Guoqing Hao, Kazuhiko Sumi
First: 2026-04-02T14:16:06+00:00 · Latest: 2026-04-02T14:16:06+00:00
Comments: HuggingFace Space: https://huggingface.co/spaces/dominoer/FlowSlider
Abstract
Continuous image editing aims to provide slider-style control of edit strength while preserving source-image fidelity and maintaining a consistent edit direction. Existing learning-based slider methods typically rely on auxiliary modules trained with synthetic or proxy supervision. This introduces additional training overhead and couples slider behavior to the training distribution, which can reduce reliability under distribution shifts in edits or domains. We propose \textit{FlowSlider}, a training-free method for continuous editing in Rectified Flow that requires no post-training. \textit{FlowSlider} decomposes FlowEdit's update into (i) a fidelity term, which acts as a source-conditioned stabilizer that preserves identity and structure, and (ii) a steering term that drives semantic transition toward the target edit. Geometric analysis and empirical measurements show that these terms are approximately orthogonal, enabling stable strength control by scaling only the steering term while keeping the fidelity term unchanged. As a result, \textit{FlowSlider} provides smooth and reliable control without post-training, improving continuous editing quality across diverse tasks.
中文标题/摘要
标题:FlowSlider:无需训练的连续图像编辑方法及其保真度导向分解
连续图像编辑旨在提供滑块式控制编辑强度的同时,保持源图像保真度并维持一致的编辑方向。现有的基于学习的滑块方法通常依赖于使用合成或代理监督训练的辅助模块。这引入了额外的训练开销,并将滑块行为与训练分布耦合,这在编辑或领域分布变化时可能会降低可靠性。我们提出了一种名为\textit{FlowSlider}的无需训练的连续编辑方法,该方法在修正流中不需要后训练。\textit{FlowSlider}将FlowEdit的更新分解为(i)保真度项,该项作为基于源条件的稳定器,保持身份和结构;(ii)导向项,驱动语义过渡以接近目标编辑。几何分析和实证测量表明,这些项几乎正交,使得通过仅缩放导向项而保持保真度项不变即可实现稳定的强度控制。因此,\textit{FlowSlider}在无需后训练的情况下提供了平滑且可靠的控制,从而提高了各种任务中的连续编辑质量。
Summary / 总结
FlowSlider is a training-free method for continuous image editing that decomposes the editing process into a fidelity term and a steering term. The fidelity term stabilizes the source image, while the steering term drives the semantic transition. This orthogonal decomposition allows for smooth and reliable strength control without post-training, improving the quality of continuous editing across various tasks.
FlowSlider 是一种无需训练的方法,用于连续图像编辑,它将编辑过程分解为保真度项和导向项。保真度项稳定源图像,而导向项驱动语义过渡。这种正交分解允许在无需额外训练的情况下平滑且可靠地控制编辑强度,从而提高各种任务中连续编辑的质量。
Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection
Authors: Soo Won Seo, KyungChae Lee, Hyungchan Cho, Taein Son, Nam Ik Cho, Jun Won Choi
Venue: CVPR 2026
First: 2026-04-02T14:01:58+00:00 · Latest: 2026-04-02T14:01:58+00:00
Comments: Accepted to CVPR 2026. Code: https://github.com/nowuss/InCoM-Net
Abstract
Human-Object Interaction (HOI) detection aims to localize human-object pairs and classify their interactions from a single image, a task that demands strong visual understanding and nuanced contextual reasoning. Recent approaches have leveraged Vision-Language Models (VLMs) to introduce semantic priors, significantly improving HOI detection performance. However, existing methods often fail to fully capitalize on the diverse contextual cues distributed across the entire scene. To overcome these limitations, we propose the Instance-centric Context Mining Network (InCoM-Net)-a novel framework that effectively integrates rich semantic knowledge extracted from VLMs with instance-specific features produced by an object detector. This design enables deeper interaction reasoning by modeling relationships not only within each detected instance but also across instances and their surrounding scene context. InCoM-Net comprises two core components: Instancecentric Context Refinement (ICR), which separately extracts intra-instance, inter-instance, and global contextual cues from VLM-derived features, and Progressive Context Aggregation (ProCA), which iteratively fuses these multicontext features with instance-level detector features to support high-level HOI reasoning. Extensive experiments on the HICO-DET and V-COCO benchmarks show that InCoM-Net achieves state-of-the-art performance, surpassing previous HOI detection methods. Code is available at https://github.com/nowuss/InCoM-Net.
中文标题/摘要
标题:基于实例中心视觉-语言语境的人-物交互检测挖掘
人-物交互(HOI)检测旨在从单张图像中定位人-物对并分类其交互,这需要强大的视觉理解能力和细腻的语境推理。最近的方法利用视觉-语言模型(VLMs)引入语义先验,显著提高了HOI检测性能。然而,现有方法往往未能充分利用场景中分散的多样化语境线索。为克服这些限制,我们提出了一种实例中心语境挖掘网络(InCoM-Net)——一种新颖的框架,该框架有效整合了从VLM提取的丰富语义知识与对象检测器生成的实例特定特征。此设计通过建模不仅在每个检测实例内部的关系,还在实例之间及其周围场景语境中的关系,以实现更深入的交互推理。InCoM-Net 包含两个核心组件:实例中心语境精炼(ICR),该组件分别从VLM特征中提取实例内、实例间和全局语境线索,以及渐进语境聚合(ProCA),该组件迭代融合这些多语境特征与实例级检测器特征,以支持高级HOI推理。在HICO-DET和V-COCO基准上的广泛实验表明,InCoM-Net 达到了最先进的性能,超越了之前的HOI检测方法。代码可在 https://github.com/nowuss/InCoM-Net 获取。
Summary / 总结
The research aims to enhance Human-Object Interaction (HOI) detection by integrating rich semantic knowledge from Vision-Language Models (VLMs) with instance-specific features. The proposed Instance-centric Context Mining Network (InCoM-Net) refines intra-instance, inter-instance, and global contextual cues, and progressively aggregates them to support high-level HOI reasoning. Experiments on HICO-DET and V-COCO benchmarks demonstrate that InCoM-Net outperforms existing methods, achieving state-of-the-art performance in HOI detection. Code is available at https://github.com/nowuss/InCoM-Net.
研究旨在通过结合Vision-Language模型(VLM)的丰富语义知识和实例特定特征来提升人类物体交互(HOI)检测。提出的Instance-centric Context Mining Network(InCoM-Net)分别提炼实例内、实例间和全局上下文线索,并逐步融合这些多上下文特征以支持高级HOI推理。在HICO-DET和V-COCO基准上的实验表明,InCoM-Net在HOI检测中超越了现有方法,达到最先进的性能。代码可在https://github.com/nowuss/InCoM-Net获取。
Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models
Authors: Issa Sugiura, Keito Sasagawa, Keisuke Nakao, Koki Maeda, Ziqi Yin, Zhishen Yang, Shuhei Kurita, Yusuke Oda, Ryoko Tokuhisa, Daisuke Kawahara, Naoaki Okazaki
First: 2026-04-02T13:48:43+00:00 · Latest: 2026-04-02T13:48:43+00:00
Comments: 18 pages, 7 figures
Abstract
Developing vision-language models (VLMs) that generalize across diverse tasks requires large-scale training datasets with diverse content. In English, such datasets are typically constructed by aggregating and curating numerous existing visual question answering (VQA) resources. However, this strategy does not readily extend to other languages, where VQA datasets remain limited in both scale and domain coverage, posing a major obstacle to building high-quality multilingual and non-English VLMs. In this work, we introduce Jagle, the largest Japanese multimodal post-training dataset to date, comprising approximately 9.2 million instances across diverse tasks. Rather than relying on existing VQA datasets, we collect heterogeneous source data, including images, image-text pairs, and PDF documents, and generate VQA pairs through multiple strategies such as VLM-based QA generation, translation, and text rendering. Experiments demonstrate that a 2.2B model trained with Jagle achieves strong performance on Japanese tasks, surpassing InternVL3.5-2B in average score across ten Japanese evaluation tasks and approaching within five points of Qwen3-VL-2B-Instruct. Furthermore, combining Jagle with FineVision does not degrade English performance; instead, it improves English performance compared to training with FineVision alone. To facilitate reproducibility and future research, we release the dataset, trained models, and code.
中文标题/摘要
标题:Jagle:构建大规模日语多模态后训练数据集以支持视觉-语言模型
开发能够跨多种任务泛化的视觉-语言模型(VLMs)需要大规模的训练数据集,这些数据集包含多样化的内容。在英语中,这样的数据集通常通过聚合和整理大量的现有视觉问答(VQA)资源来构建。然而,这种方法并不容易扩展到其他语言,在这些语言中,VQA数据集在规模和领域覆盖方面都受到限制,这构成了构建高质量的多语言和非英语VLMs的主要障碍。在本文中,我们介绍了迄今为止最大的日语多模态后训练数据集Jagle,包含约920万实例,涵盖了多种任务。我们没有依赖现有的VQA数据集,而是收集了异质源数据,包括图像、图像-文本对和PDF文档,并通过多种策略生成VQA对,如基于VLM的问答生成、翻译和文本渲染。实验表明,使用Jagle训练的220亿参数模型在日语任务上表现出色,平均得分超过InternVL3.5-2B,在十个日语评估任务上的平均得分高出五分,接近Qwen3-VL-2B-Instruct。此外,将Jagle与FineVision结合使用不会降低英语性能,反而在单独使用FineVision训练时提高了英语性能。为了促进可重复性和未来研究,我们发布了数据集、训练模型和代码。
Summary / 总结
Jagle is a large-scale Japanese multimodal post-training dataset comprising about 9.2 million instances, created by collecting diverse sources like images, image-text pairs, and PDF documents, and generating VQA pairs through various strategies. The 2.2B model trained with Jagle outperformed InternVL3.5-2B on ten Japanese evaluation tasks and nearly matched Qwen3-VL-2B-Instruct. Additionally, combining Jagle with FineVision improved English performance compared to using FineVision alone.
研究旨在开发大规模的日语多模态数据集,以增强视觉语言模型(VLMs)在多样任务上的泛化能力。作者介绍了Jagle数据集,包含约920万实例,来源于图像、图像-文本对和PDF文档等多种来源,并通过多种策略生成VQA对。实验表明,使用Jagle训练的2.2B模型在十个日语评估任务上优于InternVL3.5-2B,并接近Qwen3-VL-2B-Instruct的性能。此外,将Jagle与FineVision结合使用可以提高英语性能,优于单独使用FineVision的训练效果。
Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding
Authors: Tao Jin, Phuong Minh Nguyen, Naoya Inoue
First: 2026-04-02T13:48:42+00:00 · Latest: 2026-04-02T13:48:42+00:00
Abstract
Speculative decoding accelerates large language model inference by drafting multiple candidate tokens and verifying them in a single forward pass. Candidates are organized as a tree: deeper trees accept more tokens per step, but adding depth requires sacrificing breadth (fallback options) under a fixed verification budget. Existing training-free methods draft from a single token source and shape their trees without distinguishing candidate quality across origins. We observe that two common training-free token sources - n-gram matches copied from the input context, and statistical predictions from prior forward passes - differ dramatically in acceptance rate (~6x median gap, range 2-18x across five models and five benchmarks). We prove that when such a quality gap exists, the optimal tree is anisotropic (asymmetric): reliable tokens should form a deep chain while unreliable tokens spread as wide branches, breaking through the depth limit of balanced trees. We realize this structure in GOOSE, a training-free framework that builds an adaptive spine tree - a deep chain of high-acceptance context-matched tokens with wide branches of low-acceptance alternatives at each node. We prove that the number of tokens accepted per step is at least as large as that of either source used alone. On five LLMs (7B-33B) and five benchmarks, GOOSE achieves 1.9-4.3x lossless speedup, outperforming balanced-tree baselines by 12-33% under the same budget.
中文标题/摘要
标题:鹅:各向异性推测树用于训练-free推测解码
推测解码通过在单次前向传递中起草多个候选令牌并验证它们来加速大型语言模型的推理。候选令牌组织成一棵树:更深的树每步接受更多令牌,但在固定验证预算下增加深度需要牺牲宽度(备用选项)。现有训练-free方法从单一令牌源起草,并且在不区分候选质量来源的情况下塑造其树。我们观察到,两种常见的训练-free令牌源——从输入上下文中复制的n-gram匹配和来自先前前向传递的统计预测——在接受率上存在巨大差异(中位数差距约为6倍,范围在五种模型和五种基准测试之间为2-18倍)。我们证明,当存在这种质量差距时,最优树是各向异性的(不对称):可靠的令牌应形成一条深链,而不可靠的令牌则扩展为宽分支,突破平衡树的深度限制。我们通过GOOSE实现这一结构,这是一种训练-free框架,构建自适应脊柱树——一条由高接受率上下文匹配令牌组成的深链,以及每个节点处宽分支的低接受率替代选项。我们证明,每步接受的令牌数量至少与单独使用任一来源一样多。在五种LLM(7B-33B)和五种基准测试上,GOOSE实现了1.9-4.3倍无损加速,即使在相同的预算下,也比平衡树基线高出12-33%。
Summary / 总结
Goose is a training-free speculative decoding framework that addresses the limitations of existing methods by organizing candidate tokens into an anisotropic tree structure. This structure allows for a deep chain of high-acceptance tokens with wide branches of low-acceptance alternatives, optimizing the use of a fixed verification budget. On five large language models ranging from 7B to 33B parameters, Goose achieves a 1.9-4.3x lossless speedup compared to balanced-tree baselines, outperforming them by 12-33% under the same budget.
Goose 是一种无需训练的推测性解码框架,通过使用非对称推测树来最大化每步接受的令牌数量,这些树由高接受率的上下文匹配令牌组成的深链和每个节点上低接受率的替代分支组成。在从7B到33B参数的五个大型语言模型和五个基准测试上,Goose 达到了1.9-4.3倍的无损加速,相比平衡树基线,在相同验证预算下性能高出12-33%。
Are VLMs Lost Between Sky and Space? LinkS$^2$Bench for UAV-Satellite Dynamic Cross-View Spatial Intelligence
Authors: Dian Liu, Jie Feng, Di Li, Yuhui Zheng, Guanbin Li, Weisheng Dong, Guangming Shi
First: 2026-04-02T13:22:57+00:00 · Latest: 2026-04-02T13:22:57+00:00
Abstract
Synergistic spatial intelligence between UAVs and satellites is indispensable for emergency response and security operations, as it uniquely integrates macro-scale global coverage with dynamic, real-time local perception. However, the capacity of Vision-Language Models (VLMs) to master this complex interplay remains largely unexplored. This gap persists primarily because existing benchmarks are confined to isolated Unmanned Aerial Vehicle (UAV) videos or static satellite imagery, failing to evaluate the dynamic local-to-global spatial mapping essential for comprehensive cross-view reasoning. To bridge this gap, we introduce LinkS$^2$Bench, the first comprehensive benchmark designed to evaluate VLMs' wide-area, dynamic cross-view spatial intelligence. LinkS$^2$Bench links 1,022 minutes of dynamic UAV footage with high-resolution satellite imagery covering over 200 km$^2$. Through an LMM-assisted pipeline and rigorous human annotation, we constructed 17.9k high-quality question-answer pairs comprising 12 fine-grained tasks across four dimensions: perception, localization, relation, and reasoning. Evaluations of 18 representative VLMs reveal a substantial gap compared to human baselines, identifying accurate cross-view dynamic alignment as the critical bottleneck. To alleviate this, we design a Cross-View Alignment Adapter, demonstrating that explicit alignment significantly improves model performance. Furthermore, fine-tuning experiments underscore the potential of LinkS$^2$Bench in advancing VLM adaptation for complex spatial reasoning.
中文标题/摘要
标题:VLMs在天地之间迷失了吗?LinkS$^2$Bench用于无人机-卫星动态跨视角空间智能
无人机与卫星之间的协同空间智能对于应急响应和安全操作至关重要,因为它能够独特地结合宏观全球覆盖与动态实时的局部感知。然而,视觉-语言模型(VLMs)掌握这种复杂互动的能力仍然鲜有探索。这一差距主要因为现有基准局限于孤立的无人机视频或静态卫星图像,未能评估全面跨视角推理所需的动态局部到全局的空间映射。为弥补这一差距,我们引入了LinkS$^2$Bench,这是首个用于评估VLMs广泛区域动态跨视角空间智能的综合基准。LinkS$^2$Bench将1,022分钟的动态无人机视频与覆盖超过200平方公里的高分辨率卫星图像相连。通过LMM辅助管道和严格的真人注释,我们构建了17,900个高质量的问题-答案对,涵盖四个维度的12个细粒度任务:感知、定位、关系和推理。18个代表性VLMs的评估显示与人类基准相比存在显著差距,准确的跨视角动态对齐是关键瓶颈。为缓解这一问题,我们设计了跨视角对齐适配器,表明显式对齐显著提高了模型性能。此外,微调实验强调了LinkS$^2$Bench在推进VLM适应复杂空间推理方面的潜力。
Summary / 总结
This paper addresses the gap in evaluating Vision-Language Models (VLMs) for dynamic cross-view spatial intelligence between UAVs and satellites. It introduces LinkS$^2$Bench, a comprehensive benchmark that links 1,022 minutes of dynamic UAV footage with high-resolution satellite imagery. The benchmark includes 17,900 high-quality question-answer pairs covering 12 tasks across four dimensions. Evaluations of 18 VLMs show a significant gap compared to human performance, highlighting the challenge of accurate cross-view dynamic alignment. The authors propose a Cross-View Alignment Adapter to improve model performance and demonstrate its effectiveness through fine-tuning experiments.
论文提出了LinkS$^2$Bench,这是一个用于评估Vision-Language模型(VLMs)在无人机和卫星之间动态跨视图空间智能的新基准。它通过链接1,022分钟的无人机视频和高分辨率卫星图像,覆盖超过200平方公里的区域,填补了现有基准的空白。对18个VLMs的评估显示,与人类基线相比,模型在跨视图动态对齐方面存在显著差距。作者提出了一种跨视图对齐适配器来提高模型性能,并强调LinkS$^2$Bench在复杂空间推理任务中推进VLMs的潜力。
Decouple and Rectify: Semantics-Preserving Structural Enhancement for Open-Vocabulary Remote Sensing Segmentation
Authors: Jie Feng, Fengze Li, Junpeng Zhang, Siyu Chen, Yuping Liang, Junying Chen, Ronghua Shang
First: 2026-04-02T13:15:05+00:00 · Latest: 2026-04-02T13:15:05+00:00
Abstract
Open-vocabulary semantic segmentation in the remote sensing (RS) field requires both language-aligned recognition and fine-grained spatial delineation. Although CLIP offers robust semantic generalization, its global-aligned visual representations inherently struggle to capture structural details. Recent methods attempt to compensate for this by introducing RS-pretrained DINO features. However, these methods treat CLIP representations as a monolithic semantic space and cannot localize where structural enhancement is required, failing to effectively delineate boundaries while risking the disruption of CLIP's semantic integrity. To address this limitation, we propose DR-Seg, a novel decouple-and-rectify framework in this paper. Our method is motivated by the key observation that CLIP feature channels exhibit distinct functional heterogeneity rather than forming a uniform semantic space. Building on this insight, DR-Seg decouples CLIP features into semantics-dominated and structure-dominated subspaces, enabling targeted structural enhancement by DINO without distorting language-aligned semantics. Subsequently, a prior-driven graph rectification module injects high-fidelity structural priors under DINO guidance to form a refined branch, while an uncertainty-guided adaptive fusion module dynamically integrates this refined branch with the original CLIP branch for final prediction. Comprehensive experiments across eight benchmarks demonstrate that DR-Seg establishes a new state-of-the-art.
中文标题/摘要
标题:解耦与校正:开放词汇语义保留结构增强方法在遥感分割中的应用
遥感(RS)领域的开放词汇语义分割需要语言对齐的识别和精细的空间界定。尽管CLIP提供了强大的语义泛化能力,但其全局对齐的视觉表示在捕捉结构细节方面存在固有困难。最近的方法通过引入RS预训练的DINO特征来弥补这一不足。然而,这些方法将CLIP表示视为一个统一的语义空间,无法定位需要结构增强的地方,无法有效界定边界,同时可能破坏CLIP的语义完整性。为解决这一局限,本文提出了一种新颖的解耦与校正框架DR-Seg。我们的方法基于一个关键观察:CLIP特征通道表现出功能异质性,而不是形成一个统一的语义空间。基于这一洞察,DR-Seg将CLIP特征分解为语义主导子空间和结构主导子空间,通过DINO实现有针对性的结构增强,而不破坏语言对齐的语义。随后,一个先验驱动的图校正模块在DINO的引导下注入高保真结构先验,形成一个精炼分支,而一个基于不确定性自适应融合模块动态将该精炼分支与原始CLIP分支融合,以进行最终预测。在八个基准上的全面实验表明,DR-Seg建立了新的性能最佳水平。
Summary / 总结
The paper proposes DR-Seg, a decouple-and-rectify framework for open-vocabulary remote sensing segmentation. Motivated by the observation that CLIP feature channels have distinct functional heterogeneity, DR-Seg separates CLIP features into semantic and structural subspaces, allowing targeted structural enhancement by DINO while preserving semantic integrity. The method includes a graph rectification module that injects structural priors and an adaptive fusion module that integrates the refined branch with the original CLIP branch. Experiments show DR-Seg outperforms existing methods across eight benchmarks.
研究旨在通过解决CLIP在捕捉结构细节方面的局限性,改进遥感领域的开放词汇语义分割。DR-Seg 提出了一种解耦和校正框架,将CLIP特征分离为语义主导和结构主导子空间,允许在不破坏语义一致性的情况下,通过DINO特征进行有针对性的结构增强。实验表明,DR-Seg 在八个基准测试中超越了现有方法,达到了新的最佳水平。
Test-Time Adaptation for Height Completion via Self-Supervised ViT Features and Monocular Foundation Models
Authors: Osher Rafaeli, Tal Svoray, Ariel Nahlieli
First: 2026-04-02T13:13:17+00:00 · Latest: 2026-04-02T13:13:17+00:00
Abstract
Accurate digital surface models (DSMs) are essential for many geospatial applications, including urban monitoring, environmental analyses, infrastructure management, and change detection. However, large-scale DSMs frequently contain incomplete or outdated regions due to acquisition limitations, reconstruction artifacts, or changes in the built environment. Traditional height completion approaches primarily rely on spatial interpolation or which assume spatial continuity and therefore fail when objects are missing. Recent learning-based approaches improve reconstruction quality but typically require supervised training on sensor-specific datasets, limiting their generalization across domains and sensing conditions. We propose Prior2DSM, a training-free framework for metric DSM completion that operates entirely at test time by leveraging foundation models. Unlike previous height completion approaches that require task-specific training, the proposed method combines self-supervised Vision Transformer (ViT) features from DINOv3 with monocular depth foundation models to propagate metric information from incomplete height priors through semantic feature-space correspondence. Test-time adaptation (TTA) is performed using parameter-efficient low-rank adaptation (LoRA) together with a lightweight multilayer perceptron (MLP), which predicts spatially varying scale and shift parameters to convert relative depth estimates into metric heights. Experiments demonstrate consistent improvements over interpolation based methods, prior-based rescaling height approaches, and state-of-the-art monocular depth estimation models. Prior2DSM reduces reconstruction error while preserving structural fidelity, achieving up to a 46% reduction in RMSE compared to linear fitting of MDE, and further enables DSM updating and coupled RGB-DSM generation.
中文标题/摘要
标题:测试时自适应的高程完成方法:基于自我监督的ViT特征和单目基础模型
准确的数字表面模型(DSMs)对于许多地理空间应用至关重要,包括城市监测、环境分析、基础设施管理和变化检测。然而,大规模的DSMs经常包含不完整或过时的区域,这可能是由于获取限制、重建伪影或建成环境的变化。传统的高程完成方法主要依赖于空间插值或假设空间连续性,因此在物体缺失时会失效。最近的基于学习的方法可以提高重建质量,但通常需要在特定传感器数据集上进行监督训练,这限制了它们在不同领域和传感条件下的泛化能力。我们提出了一种名为Prior2DSM的无需训练的框架,该框架完全在测试时运行,通过利用基础模型来完成米级DSM。与之前需要特定任务训练的高程完成方法不同,所提出的方法结合了来自DINOv3的自我监督的Vision Transformer(ViT)特征和单目深度基础模型,通过语义特征空间对应关系传播度量信息。测试时自适应(TTA)使用参数高效的低秩适应(LoRA)与轻量级多层感知机(MLP)一起进行,该MLP预测空间变化的尺度和偏移参数,将相对深度估计转换为度量高度。实验结果表明,该方法在插值方法、基于先验的重新缩放高度方法以及最先进的单目深度估计模型上均表现出一致的改进。Prior2DSM减少了重建误差,同时保持了结构保真度,与MDE的线性拟合相比,RMSE最多可减少46%,进一步实现了DSM更新和耦合RGB-DSM生成。
Summary / 总结
The research aims to address the issue of incomplete or outdated regions in large-scale digital surface models (DSMs) by proposing Prior2DSM, a training-free framework that leverages self-supervised ViT features and monocular depth foundation models for metric DSM completion at test time. The method uses parameter-efficient low-rank adaptation (LoRA) and a lightweight MLP to adapt relative depth estimates into metric heights, showing consistent improvements over interpolation and prior-based methods, with up to a 46% reduction in RMSE compared to linear fitting of monocular depth estimation models.
研究旨在通过提出Prior2DSM框架解决DSM中不完整或过时区域的问题,该框架利用自监督的Vision Transformer特征和单目深度基础模型在测试时进行度量DSM完成。该方法使用参数高效的低秩适应和轻量级MLP将相对深度估计转换为度量高度,实验结果表明,该方法在重建误差和结构保真度方面优于现有方法,与单目深度估计模型的线性拟合相比,RMSE最多可减少46%。
Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation
Authors: Boyang Gong, Yu Zheng, Fanye Kong, Jie Zhou, Jiwen Lu
First: 2026-04-02T12:51:07+00:00 · Latest: 2026-04-02T12:51:07+00:00
Abstract
Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require inter-object relational deduction. Through token-wise attention analysis, we identify this visual inertia as a key factor: attention to semantically critical regions remains persistently focused and fails to dynamically support relational inference. We thereby propose a training-free Inertia-aware Visual Excitation (IVE) method that breaks this inertial pattern by modeling cognitive inference as the dynamic responsiveness of visual attention. Specifically, IVE selects visual tokens that are dynamically emerging relative to historical attention trends while distinguishing tokens exhibiting inertial behavior. To further facilitate compositional inference, IVE introduces an inertia-aware penalty that discourages over-concentration and limits the persistence of attention within localized regions. Extensive experiments show that IVE is effective across various base MLLMs and multiple hallucination benchmarks, particularly for cognitive hallucinations.
中文标题/摘要
标题:注意力静止则保持静止:打破视觉惯性以减轻认知幻觉
如同静止的物体保持静止,我们发现多模态大型语言模型(MLLMs)中的视觉注意力表现出明显的惯性,在早期解码步骤中一旦稳定下来就保持相对静止,无法支持认知推理所需的组合理解。现有的幻觉缓解方法主要针对与物体存在或属性相关的感知幻觉,但对于需要物体间关系推理的认知幻觉却无能为力。通过词元级别的注意力分析,我们发现这种视觉惯性是关键因素:对语义关键区域的注意力保持持续聚焦,无法动态支持关系推理。因此,我们提出了一种无需训练的感知意识视觉激发(IVE)方法,通过将认知推理建模为视觉注意力的动态响应来打破这种惯性模式。具体而言,IVE 选择相对于历史注意力趋势动态出现的视觉词元,同时区分表现出惯性行为的词元。为了进一步促进组合推理,IVE 引入了一种感知意识惩罚,以防止过度集中并限制注意力在局部区域的持久性。广泛的实验表明,IVE 在各种基础 MLLMs 和多个幻觉基准测试中都有效,特别是在认知幻觉方面。
Summary / 总结
The study addresses the issue of visual inertia in multimodal large language models (MLLMs), where attention remains static and fails to support compositional understanding needed for cognitive inference. By analyzing token-wise attention, the research identifies this inertia as a key factor in cognitive hallucinations. To mitigate this, the study proposes an Inertia-aware Visual Excitation (IVE) method that models cognitive inference as dynamic responsiveness of visual attention, selecting tokens that are dynamically emerging and discouraging over-concentration. Experiments show IVE is effective in various MLLMs and hallucination benchmarks, especially for cognitive hallucinations.
研究针对多模态大型语言模型(MLLMs)中视觉惯性问题,即注意力保持静态,无法支持所需的组成性推理。作者提出了一种惯性感知视觉激发(IVE)方法,通过动态建模视觉注意力来打破这种惯性。IVE 选择动态出现的令牌,并引入惯性感知惩罚以防止过度集中,从而增强组成性推理。实验表明,IVE 在不同 MLLMs 和基准测试中有效缓解了认知幻觉问题。
Curia-2: Scaling Self-Supervised Learning for Radiology Foundation Models
Authors: Antoine Saporta, Baptiste Callard, Corentin Dancette, Julien Khlaut, Charles Corbière, Leo Butsanets, Amaury Prat, Pierre Manceron
First: 2026-04-02T12:49:38+00:00 · Latest: 2026-04-02T12:49:38+00:00
Abstract
The rapid growth of medical imaging has fueled the development of Foundation Models (FMs) to reduce the growing, unsustainable workload on radiologists. While recent FMs have shown the power of large-scale pre-training to CT and MRI analysis, there remains significant room to optimize how these models learn from complex radiological volumes. Building upon the Curia framework, this work introduces Curia-2, which significantly improves the original pre-training strategy and representation quality to better capture the specificities of radiological data. The proposed methodology enables scaling the architecture up to billion-parameter Vision Transformers, marking a first for multi-modal CT and MRI FMs. Furthermore, we formalize the evaluation of these models by extending and restructuring CuriaBench into two distinct tracks: a 2D track tailored for slice-based vision models and a 3D track for volumetric benchmarking. Our results demonstrate that Curia-2 outperforms all FMs on vision-focused tasks and fairs competitively to vision-language models on clinically complex tasks such as finding detection. Weights will be made publicly available to foster further research.
中文标题/摘要
标题:Curia-2:扩展自监督学习以优化放射学基础模型
医学影像的迅速增长推动了基础模型(FMs)的发展,以减轻放射学家日益增长且不可持续的工作负担。尽管最近的FMs展示了大规模预训练在CT和MRI分析中的强大能力,但这些模型从复杂放射学数据中学习的方式仍有很大的优化空间。基于Curia框架,这项工作引入了Curia-2,显著改进了原始的预训练策略和表示质量,更好地捕捉了放射学数据的特性。提出的方案使架构能够扩展到具有数十亿参数的视觉变换器,这是多模态CT和MRI FMs的首次。此外,我们通过扩展和重构CuriaBench,将其分为两个不同的赛道:一个针对切片视觉模型的2D赛道和一个用于体素基准测试的3D赛道。我们的结果显示,Curia-2在视觉任务上优于所有FMs,并在复杂的临床任务如检测方面与视觉语言模型竞争。权重将公开发布,以促进进一步的研究。
Summary / 总结
The paper introduces Curia-2, an enhanced version of the Curia framework for Foundation Models in radiology, which improves pre-training strategies and representation quality. This allows scaling to billion-parameter Vision Transformers for CT and MRI analysis. The authors also formalized model evaluation with CuriaBench, which includes two tracks for 2D and 3D tasks. Curia-2 outperforms other FMs on vision-focused tasks and performs competitively on clinically complex tasks like detection. The weights are publicly available for further research.
该论文介绍了Curia-2,这是Curia框架的增强版本,用于放射学中的基础模型,改进了预训练策略和表示质量。这使得可以将架构扩展到十亿参数的Vision Transformers,用于CT和MRI分析。作者还通过CuriaBench对模型进行了标准化评估,包括针对切片和体积任务的两个轨道。Curia-2在视觉任务中表现优于其他基础模型,并在检测等临床复杂任务中表现与视觉语言模型相当。权重将公开以促进进一步研究。
SPAGS: Sparse-View Articulated Object Reconstruction from Single State via Planar Gaussian Splatting
Authors: Di Wu, Liu Liu, Xueyu Yuan, Wenxiao Chen, Lijun Yue, Liuzhu Chen, Yiming Tang, Meng Wang
First: 2025-11-21T09:49:53+00:00 · Latest: 2026-04-02T12:37:26+00:00
Comments: 10 pages, 7 figures
Abstract
Articulated objects are ubiquitous in daily environments, and their 3D reconstruction holds great significance across various fields. However, existing articulated object reconstruction methods typically require costly inputs such as multi-stage and multi-view observations. To address the limitations, we propose a category-agnostic articulated object reconstruction framework via planar Gaussian Splatting, which only uses sparse-view RGB images from a single state. Specifically, we first introduce a Gaussian information field to perceive the optimal sparse viewpoints from candidate camera poses. To ensure precise geometric fidelity, we constrain traditional 3D Gaussians into planar primitives, facilitating accurate normal and depth estimation. The planar Gaussians are then optimized in a coarse-to-fine manner, regularized by depth smoothness and few-shot diffusion priors. Furthermore, we leverage a Vision-Language Model (VLM) via visual prompting to achieve open-vocabulary part segmentation and joint parameter estimation. Extensive experiments on both synthetic and real-world datasets demonstrate that our approach significantly outperforms existing baselines, achieving superior part-level surface reconstruction fidelity. Code and data are provided in the supplementary material.
中文标题/摘要
标题:SPAGS: 单状态稀疏视图 articulated对象重建方法通过平面高斯点绘制
articulated对象在日常环境中无处不在,它们的3D重建在多个领域具有重要意义。然而,现有的articulated对象重建方法通常需要多阶段和多视图观察等昂贵的输入。为了解决这些限制,我们提出了一种通过平面高斯点绘制的category-无关articulated对象重建框架,仅使用单状态下的稀疏视RGB图像。具体而言,我们首先引入了一个高斯信息场来感知候选相机姿态中的最优稀疏视图。为了确保精确的几何保真度,我们将传统的3D高斯约束为平面原语,便于准确的法线和深度估计。然后,平面高斯在粗到细的方式下进行优化,通过深度平滑和少量样本扩散先验进行正则化。此外,我们利用视觉提示的Vision-Language模型(VLM)实现开放词汇部分分割和关节参数估计。在合成和真实世界数据集上的广泛实验表明,我们的方法显著优于现有基线,实现了更优的部分级表面重建保真度。附录中提供了代码和数据。
Summary / 总结
The paper proposes SPAGS, a method for reconstructing articulated objects from a single state using sparse-view RGB images. It introduces a Gaussian information field to select optimal viewpoints and uses planar Gaussian splatting to estimate normals and depths. The approach is optimized in a coarse-to-fine manner and leverages a Vision-Language Model for part segmentation and joint parameter estimation. Experiments show that SPAGS outperforms existing methods in part-level surface reconstruction fidelity on both synthetic and real-world datasets.
论文旨在解决现有需要多阶段和多视角观察的 articulated 物体重建方法的局限性。提出了一种使用单状态稀疏视图 RGB 图像和平面高斯散点图的 SPAGS 框架。方法引入了高斯信息场来选择最优视点,并将 3D 高斯约束为平面原语以实现准确的法线和深度估计。实验表明,SPAGS 在合成和真实世界数据集上的部分级表面重建精度优于现有方法。
Enhancing Medical Visual Grounding via Knowledge-guided Spatial Prompts
Authors: Yifan Gao, Tao Zhou, Yi Zhou, Ke Zou, Yizhe Zhang, Huazhu Fu
First: 2026-04-02T11:31:30+00:00 · Latest: 2026-04-02T11:31:30+00:00
Comments: 10 pages, 6 figures
Abstract
Medical Visual Grounding (MVG) aims to identify diagnostically relevant phrases from free-text radiology reports and localize their corresponding regions in medical images, providing interpretable visual evidence to support clinical decision-making. Although recent Vision-Language Models (VLMs) exhibit promising multimodal reasoning ability, their grounding remains insufficient spatial precision, largely due to a lack of explicit localization priors when relying solely on latent embeddings. In this work, we analyze this limitation from an attention perspective and propose KnowMVG, a Knowledge-prior and global-local attention enhancement framework for MVG in VLMs that explicitly strengthens spatial awareness during decoding. Specifically, we present a knowledge-enhanced prompting strategy that encodes phrase related medical knowledge into compact embeddings, together with a global-local attention that jointly leverages coarse global information and refined local cues to guide precise region localization. localization. This design bridges high-level semantic understanding and fine-grained visual perception without introducing extra textual reasoning overhead. Extensive experiments on four MVG benchmarks demonstrate that our KnowMVG consistently outperforms existing approaches, achieving gains of 3.0% in AP50 and 2.6% in mIoU over prior state-of-the-art methods. Qualitative and ablation studies further validate the effectiveness of each component.
中文标题/摘要
标题:通过知识引导的空间提示增强医学视觉定位
医学视觉定位(MVG)旨在从自由文本放射学报告中识别出诊断相关的短语,并定位其在医学图像中的对应区域,为临床决策提供可解释的视觉证据。尽管最近的视觉-语言模型(VLMs)展示了有希望的多模态推理能力,但在依赖潜在嵌入时缺乏明确的定位先验,导致其定位的空间精度仍然不足。在本文中,我们从注意力机制的角度分析了这一局限性,并提出了一种名为KnowMVG的知识先验和全局-局部注意力增强框架,以在VLMs中增强MVG的空间意识。具体而言,我们提出了一种知识增强的提示策略,将与短语相关的医学知识编码为紧凑的嵌入,结合全局-局部注意力机制,共同利用粗略的全局信息和精细的局部线索来引导精确的区域定位。此设计在不引入额外文本推理开销的情况下,将高层次的语义理解与精细的视觉感知相结合。在四个MVG基准上的广泛实验表明,我们的KnowMVG在AP50和mIoU上分别比先前的最先进方法提高了3.0%和2.6%。进一步的定性和消融研究也验证了每个组件的有效性。
Summary / 总结
This paper addresses the limitation of spatial precision in Medical Visual Grounding (MVG) by proposing KnowMVG, a framework that enhances spatial awareness in Vision-Language Models (VLMs) through knowledge-guided spatial prompts and global-local attention. KnowMVG improves the localization of diagnostically relevant phrases in medical images, leading to better clinical decision support. Experiments on four MVG benchmarks show that KnowMVG outperforms existing methods by 3.0% in AP50 and 2.6% in mIoU.
该研究通过提出KnowMVG框架来增强视觉语言模型(VLMs)在医学视觉定位(MVG)中的空间精度。KnowMVG使用知识增强的提示策略将医学知识编码为嵌入,并使用全局-局部注意力机制引导精确的区域定位。实验表明,KnowMVG在四个MVG基准测试中分别在AP50和mIoU上比现有方法高出3.0%和2.6%。
Robot Collapse: Supply Chain Backdoor Attacks Against VLM-based Robotic Manipulation
Authors: Xianlong Wang, Hewen Pan, Hangtao Zhang, Minghui Li, Shengshan Hu, Ziqi Zhou, Lulu Xue, Peijin Guo, Aishan Liu, Leo Yu Zhang, Xiaohua Jia
First: 2024-11-18T16:09:26+00:00 · Latest: 2026-04-02T10:50:53+00:00
Abstract
Robotic manipulation policies are increasingly empowered by \textit{large language models} (LLMs) and \textit{vision-language models} (VLMs), leveraging their understanding and perception capabilities. Recently, inference-time attacks against robotic manipulation have been extensively studied, yet backdoor attacks targeting model supply chain security in robotic policies remain largely unexplored. To fill this gap, we propose \texttt{TrojanRobot}, a backdoor injection framework for model supply chain attack scenarios, which embeds a malicious module into modular robotic policies via backdoor relationships to manipulate the LLM-to-VLM pathway and compromise the system. Our vanilla design instantiates this module as a backdoor-finetuned VLM. To further enhance attack performance, we propose a prime scheme by introducing the concept of \textit{LVLM-as-a-backdoor}, which leverages \textit{in-context instruction learning} (ICIL) to steer \textit{large vision-language model} (LVLM) behavior through backdoored system prompts. Moreover, we develop three types of prime attacks, \textit{permutation}, \textit{stagnation}, and \textit{intentional}, achieving flexible backdoor attack effects. Extensive physical-world and simulator experiments on 18 real-world manipulation tasks and 4 VLMs verify the superiority of proposed \texttt{TrojanRobot}
中文标题/摘要
标题:机器人坍塌:针对基于VLM的机器人操作的供应链后门攻击
机器人的操作策略越来越多地受到\textit{大型语言模型}(LLMs)和\textit{视觉语言模型}(VLMs)的赋能,利用它们的理解和感知能力。最近,针对机器人操作的推理时攻击得到了广泛研究,但针对模型供应链安全的后门攻击在机器人策略中仍然鲜有探索。为填补这一空白,我们提出了\texttt{TrojanRobot},一种针对模型供应链攻击场景的后门注入框架,通过后门关系将恶意模块嵌入模块化机器人策略中,操控LLM到VLM路径并破坏系统。我们的基础设计将此模块实例化为后门微调的VLM。为进一步增强攻击性能,我们提出了一个质数方案,通过引入\textit{上下文内指令学习}(ICIL)的概念,利用\textit{大型视觉语言模型}(LVLM)的后门系统提示引导其行为。此外,我们开发了三种类型的质数攻击,\textit{排列}、\textit{停滞}和\textit{故意},实现了灵活的后门攻击效果。在18个真实世界的操作任务和4个VLM上的物理世界和模拟器实验验证了所提出的\texttt{TrojanRobot}的优越性
Summary / 总结
This paper addresses the security vulnerability in robotic manipulation policies that rely on large language models (LLMs) and vision-language models (VLMs). It introduces TrojanRobot, a backdoor injection framework that embeds a malicious module into robotic policies to manipulate the LLM-to-VLM pathway. The study proposes a prime scheme using in-context instruction learning (ICIL) to steer LVLM behavior through backdoored system prompts, and demonstrates three types of prime attacks: permutation, stagnation, and intentional. Experiments on 18 real-world manipulation tasks and 4 VLMs show the effectiveness of the proposed framework in compromising robotic systems.
论文关注依赖大型语言模型(LLM)和视觉语言模型(VLM)的机器人操作策略的安全漏洞。它引入了TrojanRobot,这是一种后门注入框架,将恶意模块嵌入到机器人策略中,以操纵LLM到VLM的路径。该框架使用后门微调的VLM,并引入了通过后门系统提示引导LVLM行为的LVLM-as-a-backdoor的初级方案。开发了三种类型的初级攻击——置换、停滞和故意,以实现灵活的后门效果。在18个真实世界的操作任务和4个VLM上的实验验证了所提出的TrojanRobot框架的有效性。
One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image
Authors: Ezzeldin Shereen, Dan Ristea, Shae McFadden, Burak Hasircioglu, Vasilios Mavroudis, Chris Hicks
Venue: Transactions on Machine Learning Research (TMLR), 2026
First: 2025-04-02T21:08:33+00:00 · Latest: 2026-04-02T10:17:08+00:00
Comments: Published in Transactions on Machine Learning Research (03/2026)
Abstract
Retrieval-augmented generation (RAG) is instrumental for inhibiting hallucinations in large language models (LLMs) through the use of a factual knowledge base (KB). Although PDF documents are prominent sources of knowledge, text-based RAG pipelines are ineffective at capturing their rich multi-modal information. In contrast, visual document RAG (VD-RAG) uses screenshots of document pages as the KB, which has been shown to achieve state-of-the-art results. However, by introducing the image modality, VD-RAG introduces new attack vectors for adversaries to disrupt the system by injecting malicious documents into the KB. In this paper, we demonstrate the vulnerability of VD-RAG to poisoning attacks targeting both retrieval and generation. We define two attack objectives and demonstrate that both can be realized by injecting only a single adversarial image into the KB. Firstly, we introduce a targeted attack against one or a group of queries with the goal of spreading targeted disinformation. Secondly, we present a universal attack that, for any potential user query, influences the response to cause a denial-of-service in the VD-RAG system. We investigate the two attack objectives under both white-box and black-box assumptions, employing a multi-objective gradient-based optimization approach as well as prompting state-of-the-art generative models. Using two visual document datasets, a diverse set of state-of-the-art retrievers (embedding models) and generators (vision language models), we show VD-RAG is vulnerable to poisoning attacks in both the targeted and universal settings, yet demonstrating robustness to black-box attacks in the universal setting.
中文标题/摘要
标题:仅需一张图片:通过单张图像对视觉文档增强生成进行投毒攻击
检索增强生成(RAG)通过使用事实知识库(KB)来抑制大型语言模型(LLMs)中的幻觉,起到了关键作用。尽管PDF文档是知识的重要来源,但基于文本的RAG管道无法有效捕捉其丰富的多模态信息。相比之下,视觉文档RAG(VD-RAG)使用文档页面的截图作为KB,已被证明能够达到最先进的效果。然而,通过引入图像模态,VD-RAG为对手提供了新的攻击途径,通过向KB注入恶意文档来破坏系统。在本文中,我们展示了VD-RAG在检索和生成方面都容易受到投毒攻击。我们定义了两种攻击目标,并证明只需向KB注入一张对抗性图像即可实现这两种目标。首先,我们介绍了一种针对一个或一组查询的定向攻击,其目标是传播有针对性的虚假信息。其次,我们提出了一种通用攻击,对于任何潜在的用户查询,都会影响响应,导致VD-RAG系统的服务中断。我们在白盒和黑盒假设下研究了这两种攻击目标,采用多目标梯度优化方法以及提示最先进的生成模型。使用两个视觉文档数据集、一组多样化的最先进的检索器(嵌入模型)和生成器(视觉语言模型),我们展示了VD-RAG在定向和通用设置下都容易受到投毒攻击,但在通用设置下对黑盒攻击具有鲁棒性。
Summary / 总结
This paper investigates the vulnerability of visual document retrieval-augmented generation (VD-RAG) systems to poisoning attacks. It demonstrates that a single adversarial image can be used to either spread targeted disinformation or cause a denial-of-service for any potential query. The study employs a multi-objective gradient-based optimization approach and shows that VD-RAG is susceptible to both targeted and universal attacks, though it remains robust to black-box attacks in the universal setting.
本文探讨了视觉文档检索增强生成(VD-RAG)系统对投毒攻击的脆弱性。研究表明,通过向知识库注入单一的恶意图像,攻击者可以传播针对性的虚假信息或导致服务中断。该研究采用多目标梯度优化方法,表明VD-RAG在目标攻击和通用攻击场景下都存在漏洞,但在通用攻击场景下对黑盒攻击具有鲁棒性。
Semantic Richness or Geometric Reasoning? The Fragility of VLM's Visual Invariance
Authors: Jason Qiu, Zachary Meurer, Xavier Thomas, Deepti Ghadiyaram
First: 2026-04-02T10:02:49+00:00 · Latest: 2026-04-02T10:02:49+00:00
Abstract
This work investigates the fundamental fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations. While modern VLMs excel at semantic tasks such as recognizing objects in canonical orientations and describing complex scenes, they exhibit systematic failures at a more fundamental level: lack of robust spatial invariance and equivariance required to reliably determine object identity under simple rotations, scaling, and identity transformations. We demonstrate this limitation through a systematic evaluation across diverse visual domains, including symbolic sketches, natural photographs, and abstract art. Performance drops sharply as semantic content becomes sparse, and this behavior is observed across architectures, model capacities, and prompting strategies. Overall, our results reveal a systematic gap between semantic understanding and spatial reasoning in current VLMs, highlighting the need for stronger geometric grounding in future multimodal systems.
中文标题/摘要
标题:语义丰富性还是几何推理?VLM视觉不变性的脆弱性
这项工作研究了最先进的视觉-语言模型(VLMs)在基本几何变换下的根本脆弱性。尽管现代VLMs在识别处于标准方向的对象和描述复杂场景等语义任务上表现出色,但在更基本的层面上,它们表现出系统性的失败:缺乏在简单旋转、缩放和恒等变换下可靠确定物体身份所需的稳健的空间不变性和协变性。我们通过在包括符号草图、自然照片和抽象艺术在内的多种视觉领域进行系统评估,展示了这一局限性。随着语义内容的稀疏,性能急剧下降,这种行为在不同架构、模型容量和提示策略中均被观察到。总体而言,我们的结果揭示了当前VLMs在语义理解和空间推理之间的系统性差距,强调了未来多模态系统中更强的几何基础的必要性。
Summary / 总结
This work examines the fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations, showing that while VLMs perform well on semantic tasks, they struggle with fundamental spatial invariance and equivariance required for object identity determination under simple rotations, scaling, and identity transformations. Performance drops significantly as semantic content decreases, observed across various visual domains and model architectures. This highlights the need for improved geometric reasoning in VLMs.
这项研究探讨了最先进的视觉-语言模型(VLMs)在基本几何变换下的脆弱性。尽管VLMs在语义任务上表现出色,但在基本的空间不变性和协变性方面,它们难以确定物体在简单变换下的身份,导致系统性失败。研究在多种视觉领域评估了VLMs,并发现当语义内容稀少时,性能显著下降,表明当前VLMs在语义理解和空间推理之间存在差距,未来需要更强的空间几何基础。
Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models
Authors: Zekai Ye, Qiming Li, Xiaocheng Feng, Ruihan Chen, Ziming Li, Haoyu Ren, Kun Chen, Dandan Tu, Bing Qin
First: 2026-04-02T09:53:20+00:00 · Latest: 2026-04-02T09:53:20+00:00
Abstract
While Reinforcement Learning from Verifiable Rewards (RLVR) has advanced reasoning in Large Vision-Language Models (LVLMs), prevailing frameworks suffer from a foundational methodological flaw: by distributing identical advantages across all generated tokens, these methods inherently dilute the learning signals essential for optimizing the critical, visually-grounded steps of multimodal reasoning. To bridge this gap, we formulate \textit{Token Visual Dependency}, quantifying the causal information gain of visual inputs via the Kullback-Leibler (KL) divergence between visual-conditioned and text-only predictive distributions. Revealing that this dependency is highly sparse and semantically pivotal, we introduce Perception-Grounded Policy Optimization (PGPO), which is a novel fine-grained credit assignment framework that dynamically reshapes advantages at the token level. Through a threshold-gated, mass-conserving mechanism, PGPO actively amplifies learning signals for visually-dependent tokens while suppressing gradient noise from linguistic priors. Extensive experiments based on the Qwen2.5-VL series across seven challenging multimodal reasoning benchmarks demonstrate that PGPO boosts models by 18.7% on average. Both theoretical and empirical analyses confirm that PGPO effectively reduces gradient variance, prevents training collapse, and acts as a potent regularizer for robust, perception-grounded multimodal reasoning. Code will be published on https://github.com/Yzk1114/PGPO.
中文标题/摘要
标题:并非所有标记物都平等:基于感知的政策优化方法
虽然可验证奖励强化学习(RLVR)在大型视觉-语言模型(LVLMs)中提升了推理能力,但现有框架存在根本性的方法论缺陷:通过向所有生成的标记物分配相同的优势,这些方法会稀释对于优化关键的视觉导向推理步骤至关重要的学习信号。为解决这一问题,我们提出了标记物视觉依赖性,通过计算视觉条件下的预测分布与仅基于文本的预测分布之间的Kullback-Leibler(KL)散度来量化因果信息增益。揭示出这种依赖性高度稀疏且在语义上至关重要,我们引入了基于感知的政策优化(PGPO),这是一种新颖的细粒度信用分配框架,能够动态地在标记物级别重塑优势。通过一个阈值门控、质量守恒的机制,PGPO能够积极放大依赖视觉的标记物的学习信号,同时抑制语言先验带来的梯度噪声。基于Qwen2.5-VL系列在七个具有挑战性的多模态推理基准上的广泛实验表明,PGPO平均提升了模型18.7%。理论和实证分析均证实,PGPO有效减少了梯度方差,防止了训练崩溃,并作为稳健的、基于感知的多模态推理的有效正则化器。代码将在https://github.com/Yzk1114/PGPO上发布。
Efficient Reasoning with Balanced Thinking
Authors: Yulin Li, Tengyao Tu, Li Ding, Junjie Wang, Huiling Zhen, Yixin Chen, Yong Li, Zhuotao Tian
Venue: ICLR 2026
First: 2026-03-12T18:48:07+00:00 · Latest: 2026-04-02T09:30:13+00:00
Comments: Accepted by ICLR 2026
Abstract
Large Reasoning Models (LRMs) have shown remarkable reasoning capabilities, yet they often suffer from overthinking, expending redundant computational steps on simple problems, or underthinking, failing to explore sufficient reasoning paths despite inherent capabilities. These issues lead to inefficiencies and potential inaccuracies, limiting practical deployment in resource-constrained settings. Existing methods to mitigate overthinking, such as suppressing reflective keywords or adjusting reasoning length, may inadvertently induce underthinking, compromising accuracy. Therefore, we propose ReBalance, a training-free framework that achieves efficient reasoning with balanced thinking. ReBalance leverages confidence as a continuous indicator of reasoning dynamics, identifying overthinking through high confidence variance and underthinking via consistent overconfidence. By aggregating hidden states from a small-scale dataset into reasoning mode prototypes, we compute a steering vector to guide LRMs' reasoning trajectories. A dynamic control function modulates this vector's strength and direction based on real-time confidence, pruning redundancy during overthinking, and promoting exploration during underthinking. Extensive experiments conducted on four models ranging from 0.5B to 32B, and across nine benchmarks in math reasoning, general question answering, and coding tasks demonstrate that ReBalance effectively reduces output redundancy while improving accuracy, offering a general, training-free, and plug-and-play strategy for efficient and robust LRM deployment. Project page and code are available at https://rebalance-ai.github.io .
中文标题/摘要
标题:平衡思考实现高效推理
大型推理模型(LRMs)展示了出色的推理能力,但往往存在过度推理的问题,即在简单问题上浪费冗余计算步骤,或者存在欠推理的问题,即尽管具备内在能力,未能充分探索推理路径。这些问题导致了效率低下和潜在的不准确性,限制了其在资源受限环境中的实际部署。现有减少过度推理的方法,如抑制反思关键词或调整推理长度,可能会无意中导致欠推理,损害准确性。因此,我们提出了ReBalance,一种无需训练的框架,实现平衡思考下的高效推理。ReBalance 利用置信度作为推理动态的连续指标,通过高置信度波动识别过度推理,通过一致的过度自信识别欠推理。通过将小型数据集中的隐藏状态聚合为推理模式原型,我们计算一个引导向量来引导LRMs的推理轨迹。动态控制函数根据实时置信度调整该向量的强度和方向,在过度推理时修剪冗余,在欠推理时促进探索。在四个从0.5B到32B的模型以及九个涵盖数学推理、通用问答和编程任务的基准测试中进行的广泛实验表明,ReBalance 有效减少了输出冗余,提高了准确性,提供了一种通用、无需训练且即插即用的策略,用于高效和稳健的LRM部署。项目页面和代码可在 https://rebalance-ai.github.io 获取。
Summary / 总结
The paper addresses the inefficiencies of Large Reasoning Models (LRMs) due to overthinking or underthinking, proposing ReBalance, a training-free framework that uses confidence to balance reasoning dynamics. ReBalance identifies overthinking through high confidence variance and underthinking via consistent overconfidence, guiding LRMs to reduce redundancy and promote exploration. Experiments show that ReBalance improves accuracy while reducing output redundancy, making LRMs more efficient and robust for practical deployment.
论文提出了一种名为ReBalance的无训练框架,旨在平衡LRMs的过度思考和不足思考。ReBalance利用信心来引导LRMs,避免冗余步骤并促进在必要时进行探索。实验表明,ReBalance能够减少输出冗余并提高准确性,使LRMs在各种模型和任务中更高效和稳健,适用于实际部署。
Hidden Meanings in Plain Sight: RebusBench for Evaluating Cognitive Visual Reasoning
Authors: Seyed Amir Kasaei, Arash Marioriyad, Mahbod Khaleti, MohammadAmin Fazli, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban
Venue: ICLR 2026
First: 2026-04-02T08:33:13+00:00 · Latest: 2026-04-02T08:33:13+00:00
Comments: Accepted at ICLR 2026 Workshop: From Human Cognition to AI Reasoning (HCAIR)
Abstract
Large Vision-Language Models (LVLMs) have achieved remarkable proficiency in explicit visual recognition, effectively describing what is directly visible in an image. However, a critical cognitive gap emerges when the visual input serves only as a clue rather than the answer. We identify that current models struggle with the complex, multi-step reasoning required to solve problems where information is not explicitly depicted. Successfully solving a rebus puzzle requires a distinct cognitive workflow: the model must extract visual and textual attributes, retrieve linguistic prior knowledge (such as idioms), and perform abstract mapping to synthesize these elements into a meaning that exists outside the pixel space. To evaluate this neurosymbolic capability, we introduce RebusBench, a benchmark of 1,164 puzzles designed to test this specific integration of perception and knowledge. Our evaluation of state-of-the-art models (including Qwen, InternVL, and LLaVA) shows a severe deficiency: performance saturates below 10% Exact Match and 20% semantic accuracy, with no significant improvement observed from model scaling or In-Context Learning (ICL). These findings suggest that while models possess the necessary visual and linguistic components, they lack the cognitive reasoning glue to connect them. Project page available at https://amirkasaei.com/rebusbench/.
中文标题/摘要
标题:隐含意义显而易见:RebusBench 用于评估认知视觉推理能力
大型视觉-语言模型(LVLMs)在显式的视觉识别方面取得了显著的成就,能够有效地描述图像中直接可见的内容。然而,当视觉输入仅作为线索而非答案时,一个关键的认知差距出现了。我们发现,当前的模型在解决需要复杂多步推理的问题时存在困难,这些问题中的信息并未明确呈现。成功解决谜语谜题需要一种独特的认知工作流程:模型必须提取视觉和文本属性,检索语言先验知识(如成语),并进行抽象映射,将这些元素综合成一种存在于像素空间之外的意义。为了评估这种神经符号能力,我们引入了RebusBench,这是一个包含1,164个谜题的基准测试,旨在测试这种感知与知识的特定整合。我们对最先进的模型(包括Qwen、InternVL和LLaVA)的评估显示,性能在精确匹配率低于10%和语义准确率低于20%时饱和,模型规模或上下文学习(ICL)均未观察到显著改进。这些发现表明,虽然模型具备必要的视觉和语言组件,但缺乏将它们连接起来的认知推理能力。项目页面可在https://amirkasaei.com/rebusbench/获取。
Summary / 总结
The research aims to evaluate the cognitive visual reasoning capabilities of large vision-language models (LVLMs) by introducing RebusBench, a benchmark of 1,164 rebus puzzles. The method involves testing models like Qwen, InternVL, and LLaVA on their ability to extract visual and textual attributes, retrieve linguistic knowledge, and synthesize these elements to solve problems not explicitly depicted. Key findings show that these models perform poorly, with exact match scores below 10% and semantic accuracy around 20%, indicating a lack of cognitive reasoning to connect visual and linguistic components effectively.
研究旨在通过引入包含1,164个谜题的RebusBench基准来评估大型视觉-语言模型的认知视觉推理能力。方法是测试如Qwen、InternVL和LLaVA等最先进的模型在这些谜题上的表现,这些谜题需要结合视觉和文本信息进行复杂的多步推理。主要发现表明,这些模型表现不佳,精确匹配率低于10%,语义准确率低于20%,表明它们缺乏将视觉和语言信息整合起来的认知推理能力。
SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer
Authors: Tongcheng Fang, Hanling Zhang, Ruiqi Xie, Zhuo Han, Xin Tao, Tianchen Zhao, Pengfei Wan, Wenbo Ding, Wanli Ouyang, Xuefei Ning, Yu Wang
First: 2026-01-23T07:28:53+00:00 · Latest: 2026-04-02T08:13:10+00:00
Abstract
Diffusion Transformers have demonstrated remarkable performance in video generation. However, their long input sequences incur substantial latency due to the quadratic complexity of full attention. Various sparse attention mechanisms have been proposed. Training-free approaches are limited to moderate sparsity and thus yield only modest acceleration, whereas training-based methods can reach much higher sparsity but demand substantial data and computation. In this work, we propose SALAD, introducing a lightweight linear attention branch in parallel with the sparse attention. Leveraging a Multi-level Static-Dynamic Scaling Strategy to balance the two branches, our method attains up to 90% sparsity and 1.52-2.03x inference speedup across different models and sequence lengths, while maintaining generation quality comparable to the full attention baseline. Moreover, our finetuning process is highly efficient, requiring only 2,000 video samples, fewer than 1,600 training steps, and no more than 30 GPU hours with a batch size of 8.
中文标题/摘要
标题:SALAD:通过高效线性注意力调优实现高稀疏度注意力以提高视频扩散变换器性能
扩散变换器在视频生成方面表现出色。然而,由于全注意力的二次复杂性,其长输入序列导致了显著的延迟。已经提出了各种稀疏注意力机制。无训练方法仅限于中等稀疏度,因此只能实现适度加速,而基于训练的方法可以达到更高的稀疏度,但需要大量的数据和计算。在本工作中,我们提出了SALAD,引入了一个轻量级的线性注意力分支与稀疏注意力并行。利用多级静态-动态缩放策略平衡两个分支,我们的方法在不同模型和序列长度上实现了高达90%的稀疏度和1.52-2.03倍的推理加速,同时保持与全注意力基线相当的生成质量。此外,我们的微调过程非常高效,只需要2,000个视频样本,少于1,600个训练步骤,且不超过30个GPU小时,批量大小为8。
Summary / 总结
The research aims to address the latency issue in diffusion transformers for video generation by proposing SALAD, which introduces a lightweight linear attention branch alongside sparse attention. By balancing the two branches with a Multi-level Static-Dynamic Scaling Strategy, the method achieves up to 90% sparsity and a 1.52-2.03x speedup in inference, while maintaining comparable generation quality to full attention. The finetuning process is efficient, requiring only 2,000 video samples and 30 GPU hours with a batch size of 8.
研究旨在通过提出SALAD方法解决视频生成中扩散变换器的延迟问题,该方法引入了一个轻量级的线性注意力分支与稀疏注意力并行。这种方法使用多级静态-动态缩放策略来平衡两个分支,实现高达90%的稀疏度和1.52-2.03倍的推理加速,同时保持与全注意力基线相当的生成质量。微调过程非常高效,只需要2,000个视频样本和30个GPU小时的训练时间。
GridVAD: Open-Set Video Anomaly Detection via Spatial Reasoning over Stratified Frame Grids
Authors: Mohamed Eltahir, Ahmed O. Ibrahim, Obada Siralkhatim, Tabarak Abdallah, Sondos Mohamed
First: 2026-03-26T14:08:41+00:00 · Latest: 2026-04-02T07:53:39+00:00
Abstract
Vision-Language Models (VLMs) are powerful open-set reasoners, yet their direct use as anomaly detectors in video surveillance is fragile: without calibrated anomaly priors, they alternate between missed detections and hallucinated false alarms. We argue the problem is not the VLM itself but how it is used. VLMs should function as anomaly proposers, generating open-set candidate descriptions that are then grounded and tracked by purpose-built spatial and temporal modules. We instantiate this propose-ground-propagate principle in GridVAD, a training-free pipeline that produces pixel-level anomaly masks without any domain-specific training. A VLM reasons over stratified grid representations of video clips to generate natural-language anomaly proposals. Self-Consistency Consolidation (SCC) filters hallucinations by retaining only proposals that recur across multiple independent samplings. Grounding DINO anchors each surviving proposal to a bounding box, and SAM2 propagates it as a dense mask through the anomaly interval. The per-clip VLM budget is fixed at M+1 calls regardless of video length, where M can be set according to the proposals needed. On UCSD Ped2, GridVAD achieves the highest Pixel-AUROC (77.59) among all compared methods, surpassing even the partially fine-tuned TAO (75.11) and outperforms other zero-shot approaches on object-level RBDC by over 5x. Ablations reveal that SCC provides a controllable precision-recall tradeoff: filtering improves all pixel level metrics at a modest cost in object-level recall. Efficiency experiments show GridVAD is 2.7x more call-efficient than uniform per-frame VLM querying while additionally producing dense segmentation masks.Code and qualitative video results are available at https://gridvad.github.io.
中文标题/摘要
标题:GridVAD:通过分层帧网格的空间推理实现开放集视频异常检测
视觉-语言模型(VLMs)是强大的开放集推理器,但在视频监控中直接用作异常检测器却很脆弱:没有校准的异常先验,它们会在漏检和虚假警报之间交替。我们认为问题不在于VLM本身,而在于其使用方式。VLM应该作为异常提议者,生成开放集候选描述,然后由专门构建的空间和时间模块进行定位和跟踪。我们在GridVAD中实现了这一提议-定位-传播原则,这是一种无需训练的管道,能够在没有任何领域特定训练的情况下生成像素级异常掩码。VLM对视频片段的分层网格表示进行推理,生成自然语言异常提议。自我一致性聚合(SCC)通过仅保留多次独立采样中反复出现的提议来过滤虚假警报。DINO锚定每个幸存的提议到一个边界框,SAM2将其作为密集掩码在异常区间内传播。每个视频片段的VLM预算固定为M+1次调用,无论视频长度如何,M可以根据需要进行设置。在UCSD Ped2上,GridVAD在所有比较方法中实现了最高的像素-AUROC(77.59),甚至超过了部分微调的TAO(75.11),在对象级RBDC上也比其他零样本方法高出5倍以上。消融实验表明,SCC提供了可控制的精确度-召回率权衡:过滤可以提高所有像素级别指标,同时在对象级别召回率上付出较小的代价。效率实验表明,GridVAD比均匀的每帧VLM查询更高效2.7倍,同时还能生成密集分割掩码。代码和定性视频结果可在https://gridvad.github.io/获取。
Summary / 总结
GridVAD proposes a method to enhance the use of Vision-Language Models (VLMs) for open-set video anomaly detection. It leverages a stratified grid representation of video clips to generate natural-language anomaly proposals, filters these proposals using Self-Consistency Consolidation (SCC), and then propagates the surviving proposals as dense masks through the anomaly interval. On the UCSD Ped2 dataset, GridVAD achieves the highest Pixel-AUROC (77.59) and outperforms other zero-shot approaches by over 5x. Ablation studies show that SCC improves precision-recall tradeoffs, and efficiency experiments demonstrate that GridVAD is more call-efficient than uniform per-frame VLM querying while providing dense segmentation masks. The per-clip VLM budget is fixed at M+1 calls, where M is adjustable based on the number of proposals needed.
GridVAD 提出了一种方法,利用 Vision-Language 模型(VLM)对视频中的开放集异常进行检测。它通过分层网格表示视频片段来生成自然语言的异常提案,使用自一致性汇聚(SCC)过滤这些提案,并将幸存的提案作为密集掩码传播到异常区间。在 UCSD Ped2 数据集上,GridVAD 达到了最高的像素 AUROC(77.59),并且比其他零样本方法高出超过 5 倍。消融实验表明,SCC 改善了精确率-召回率的权衡,而效率实验显示,GridVAD 比均匀的每帧 VLM 查询更高效,同时还能生成密集分割掩码。每段视频的 VLM 预算固定为 M+1 次调用,M 的值可以根据需要的提案数量进行调整。
History
20260405_0344 20260404_0350 20260403_0400 20260401_0408 20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553