arXiv 论文速递

Snapshot: 20260405_0344

Steerable Visual Representations

Authors: Jona Ruthardt, Manu Gaur, Deva Ramanan, Makarand Tapaswi, Yuki M. Asano

First: 2026-04-02T17:59:49+00:00 · Latest: 2026-04-02T17:59:49+00:00

Comments: preprint

Abstract

Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion) via lightweight cross-attention. We introduce benchmarks for measuring representational steerability, and demonstrate that our steerable visual features can focus on any desired objects in an image while preserving the underlying representation quality. Our method also matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibiting zero-shot generalization to out-of-distribution tasks.

中文标题/摘要

标题：可引导的视觉表示

预训练的视觉变换器（ViTs）如DINOv2和MAE提供了通用的图像特征，可以应用于检索、分类和分割等多种下游任务。然而，这些表示往往集中在图像中最显眼的视觉线索上，没有方法可以将它们引导到感兴趣的不太显眼的概念上。相比之下，多模态LLMs可以通过文本提示进行引导，但生成的表示往往是语言中心的，对于通用的视觉任务效果不佳。为了解决这个问题，我们引入了可引导的视觉表示，这是一种新的视觉表示类别，其全局和局部特征可以通过自然语言进行引导。大多数视觉-语言模型（例如CLIP）在编码后将文本与视觉特征融合（晚期融合），而我们则通过轻量级的交叉注意力直接将文本注入视觉编码器的层中（早期融合）。我们引入了衡量表示可引导性的基准，并证明我们的可引导视觉特征可以在图像中聚焦于任何所需的对象，同时保持底层表示的质量。我们的方法在异常检测和个性化对象区分上也与专门的方法相当或更优，展示了对未见过任务的零样本泛化。

Summary / 总结

The paper introduces Steerable Visual Representations, which allow for the guidance of visual features with natural language, addressing the limitations of existing Vision Transformers that focus on salient visual cues and Multimodal LLMs that become language-centric. The method injects text directly into the visual encoder layers (early fusion) using lightweight cross-attention. Experimental results show that these steerable visual features can focus on any desired objects in an image while maintaining quality, and they match or outperform dedicated approaches in anomaly detection and personalized object discrimination, demonstrating zero-shot generalization to out-of-distribution tasks.

研究引入了可引导视觉表示，可以通过自然语言引导关注图像中的特定对象，同时保持整体表示质量。不同于在编码后融合文本和视觉特征的方法，该方法通过轻量级交叉注意力直接将文本注入视觉编码器层（早期融合）。实验表明，这些可引导的特征在异常检测和个人对象区分上优于或匹配专门方法，展示了对未见过的任务的零样本泛化能力。

Stop Wandering: Efficient Vision-Language Navigation via Metacognitive Reasoning

Authors: Xueying Li, Feng Lyu, Hao Wu, Mingliu Liu, Jia-Nan Liu, Guozi Liu

First: 2026-04-02T17:58:08+00:00 · Latest: 2026-04-02T17:58:08+00:00

Comments: 10 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Training-free Vision-Language Navigation (VLN) agents powered by foundation models can follow instructions and explore 3D environments. However, existing approaches rely on greedy frontier selection and passive spatial memory, leading to inefficient behaviors such as local oscillation and redundant revisiting. We argue that this stems from a lack of metacognitive capabilities: the agent cannot monitor its exploration progress, diagnose strategy failures, or adapt accordingly. To address this, we propose MetaNav, a metacognitive navigation agent integrating spatial memory, history-aware planning, and reflective correction. Spatial memory builds a persistent 3D semantic map. History-aware planning penalizes revisiting to improve efficiency. Reflective correction detects stagnation and uses an LLM to generate corrective rules that guide future frontier selection. Experiments on GOAT-Bench, HM3D-OVON, and A-EQA show that MetaNav achieves state-of-the-art performance while reducing VLM queries by 20.7%, demonstrating that metacognitive reasoning significantly improves robustness and efficiency.

中文标题/摘要

标题：停止徘徊：通过元认知推理实现高效的视觉-语言导航

基于基础模型的无需训练的视觉-语言导航（VLN）代理可以遵循指令并探索3D环境。然而，现有方法依赖于贪婪的前沿选择和被动的空间记忆，导致诸如局部振荡和重复访问等低效行为。我们认为这源于缺乏元认知能力：代理无法监控其探索进度，诊断策略失败，或相应地进行调整。为了解决这个问题，我们提出了MetaNav，这是一种结合了空间记忆、历史感知规划和反思性纠正的元认知导航代理。空间记忆构建了一个持久的3D语义地图。历史感知规划通过惩罚重复访问来提高效率。反思性纠正检测停滞，并使用LLM生成纠正规则以指导未来的前沿选择。在GOAT-Bench、HM3D-OVON和A-EQA上的实验表明，MetaNav在保持最佳性能的同时减少了20.7%的VLM查询，证明了元认知推理显著提高了稳健性和效率。

Summary / 总结

The research aims to enhance the efficiency of Vision-Language Navigation (VLN) agents by addressing their tendency to exhibit inefficient behaviors like local oscillation and redundant revisiting. MetaNav, a metacognitive navigation agent, integrates spatial memory, history-aware planning, and reflective correction to monitor exploration progress and adapt strategies. Experiments show that MetaNav outperforms existing methods while reducing VLM queries by 20.7%, indicating significant improvements in robustness and efficiency.

研究旨在通过解决视觉-语言导航（VLN）代理的局部振荡和冗余回访等低效行为，提高其效率。提出了一个元认知导航代理MetaNav，集成了空间记忆、历史感知规划和反思性纠正。该代理构建了一个持久的3D语义地图，通过惩罚回访来提高效率，并使用语言模型生成纠正规则以更好地选择前沿区域。在GOAT-Bench、HM3D-OVON和A-EQA上的实验结果显示，MetaNav在减少VLM查询20.7%的同时，实现了最先进的性能，表明元认知推理显著提高了VLN任务的鲁棒性和效率。

VOID: Video Object and Interaction Deletion

Authors: Saman Motamed, William Harvey, Benjamin Klein, Luc Van Gool, Zhuoning Yuan, Ta-Ying Cheng

First: 2026-04-02T17:36:53+00:00 · Latest: 2026-04-02T17:36:53+00:00

Abs · PDF · Code1 · Code2

Abstract

Existing video object removal methods excel at inpainting content "behind" the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.

中文标题/摘要

标题：VOID：视频对象和交互删除

现有的视频对象移除方法在修复内容“背后”的内容和纠正外观级别的伪影（如阴影和反射）方面表现出色。然而，当移除的对象具有更显著的交互，例如与其他对象的碰撞时，当前的模型无法纠正这些交互，从而产生不合理的结果。我们提出了VOID，一种视频对象移除框架，旨在在这些复杂场景中执行物理上合理的修复。为了训练模型，我们使用Kubric和HUMOTO生成了一个新的配对数据集，其中移除对象需要改变下游的物理交互。在推理过程中，一个视觉语言模型识别场景中受移除对象影响的区域。这些区域随后用于指导一个视频扩散模型，生成物理上一致的反事实结果。在合成和真实数据上的实验表明，与之前的视频对象移除方法相比，我们的方法在对象移除后更好地保持了场景动力学的一致性。我们希望这个框架能够揭示如何通过高层次的因果推理使视频编辑模型更好地模拟世界。

Summary / 总结

The research aims to address the limitations of existing video object removal methods, which struggle with complex interactions like collisions. VOID, a new framework, is introduced to handle these scenarios by ensuring physically plausible inpainting. It uses a vision-language model to identify affected regions and a video diffusion model to generate consistent outcomes. Experiments show that VOID better preserves scene dynamics after object removal compared to previous methods.

研究旨在解决现有视频对象移除方法在处理碰撞等复杂交互时的局限性。提出了VOID框架，通过确保物理上合理的修复来应对这些场景。该框架使用视觉-语言模型识别受影响区域，并使用视频扩散模型生成一致的结果。实验表明，VOID在对象移除后更好地保持了场景动力学，优于先前的方法。

Modular Energy Steering for Safe Text-to-Image Generation with Foundation Models

Authors: Yaoteng Tan, Zikui Cai, M. Salman Asif

First: 2026-04-02T16:59:28+00:00 · Latest: 2026-04-02T16:59:28+00:00

Abs · PDF · Code1 · Code2

Abstract

Controlling the behavior of text-to-image generative models is critical for safe and practical deployment. Existing safety approaches typically rely on model fine-tuning or curated datasets, which can degrade generation quality or limit scalability. We propose an inference-time steering framework that leverages gradient feedback from frozen pretrained foundation models to guide the generation process without modifying the underlying generator. Our key observation is that vision-language foundation models encode rich semantic representations that can be repurposed as off-the-shelf supervisory signals during generation. By injecting such feedback through clean latent estimates at each sampling step, our method formulates safety steering as an energy-based sampling problem. This design enables modular, training-free safety control that is compatible with both diffusion and flow-matching models and can generalize across diverse visual concepts. Experiments demonstrate state-of-the-art robustness against NSFW red-teaming benchmarks and effective multi-target steering, while preserving high generation quality on benign non-targeted prompts. Our framework provides a principled approach for utilizing foundation models as semantic energy estimators, enabling reliable and scalable safety control for text-to-image generation.

中文标题/摘要

标题：模块化能源转向以实现安全的文本到图像生成与基础模型

控制文本到图像生成模型的行为对于安全和实际部署至关重要。现有安全方法通常依赖于模型微调或精心策划的数据集，这可能会降低生成质量或限制可扩展性。我们提出了一种推理时转向框架，该框架利用冻结的基础模型的梯度反馈来引导生成过程，而不修改底层生成器。我们的主要观察是，视觉-语言基础模型编码了丰富的语义表示，可以在生成过程中作为现成的监督信号重新利用。通过在每次采样步骤中注入这种反馈，我们的方法将安全性转向建模为一种基于能量的采样问题。这种设计使安全性控制模块化、无需训练，并且与扩散和流匹配模型兼容，可以跨多种视觉概念泛化。实验表明，我们的方法在NSFW红队基准测试中具有最先进的鲁棒性，并且能够有效进行多目标转向，同时在良性非目标提示上保持高质量的生成。我们的框架提供了一种原理性的方法，利用基础模型作为语义能量估计器，使文本到图像生成的安全控制可靠且可扩展。

Summary / 总结

The research aims to enhance the safety of text-to-image generation by proposing a modular energy steering framework that uses gradient feedback from frozen pretrained models to guide the generation process without altering the underlying generator. This method leverages the rich semantic representations of vision-language foundation models to provide off-the-shelf supervisory signals during generation. Experiments show that the framework achieves state-of-the-art robustness against NSFW benchmarks and effective multi-target steering while maintaining high generation quality on benign prompts. This approach enables reliable and scalable safety control for text-to-image generation.

研究旨在通过提出一种模块化能量引导框架，利用预训练模型的梯度反馈来引导生成过程，而不修改基础模型。该方法利用视觉语言基础模型丰富的语义表示，在生成过程中提供现成的监督信号。实验表明，该框架在安全基准测试中实现了最先进的鲁棒性，并且能够有效进行多目标引导，同时保持对良性非目标提示的高质量生成。

Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI

Authors: Aravind Cheruvu, Shravya Kanchi, Sifat Muhammad Abdullah, Nicholas Kong, Daphne Yao, Murtuza Jadliwala, Bimal Viswanath

First: 2025-07-08T04:40:09+00:00 · Latest: 2026-04-02T16:59:25+00:00

Comments: Accepted at ACM CODASPY 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Customizing Large Language Models (LLMs) on untrusted datasets poses severe risks of injecting toxic behaviors. In this work, we introduce Optimus, a novel defense framework designed to mitigate fine-tuning harms while preserving conversational utility. Unlike existing defenses that rely heavily on precise toxicity detection or restrictive filtering, Optimus addresses the critical challenge of ensuring robust mitigation even when toxicity classifiers are imperfect or biased. Optimus integrates a training-free toxicity classification scheme that repurposes the safety alignment of commodity LLMs, and employs a dual-strategy alignment process combining synthetic "healing data" with Direct Preference Optimization (DPO) to efficiently steer models toward safety. Extensive evaluations demonstrate that Optimus mitigates toxicity even when relying on extremely biased classifiers (with up to 85% degradation in Recall). Optimus outperforms the state-of-the-art defense StarDSS and exhibits strong resilience against adaptive adversarial and jailbreak attacks. Our source code and datasets are available at https://github.com/secml-lab-vt/Optimus

中文标题/摘要

标题：Optimus：一种稳健的防御框架，用于在微调对话AI时减轻毒性

在不可信数据集上定制大型语言模型（LLMs）会严重增加注入毒性行为的风险。在本研究中，我们提出了Optimus，一种新颖的防御框架，旨在减轻微调危害同时保留对话实用性。与依赖精确毒性检测或严格过滤的现有防御不同，Optimus 通过确保即使在毒性分类器不准确或有偏见时也能实现稳健的缓解来解决关键挑战。Optimus 结合了一种无需训练的毒性分类方案，该方案重新利用了商品级LLM的安全对齐，并采用结合合成“治愈数据”与直接偏好优化（DPO）的双重策略对齐过程，以高效地引导模型向安全方向发展。广泛的评估表明，即使依赖于高度有偏见的分类器（召回率降低高达85%），Optimus 也能减轻毒性。Optimus 在对抗适应性对抗和越狱攻击方面表现出强大的鲁棒性。我们的源代码和数据集可在 https://github.com/secml-lab-vt/Optimus 获取

Scaling Video Pretraining for Surgical Foundation Models

Authors: Sicheng Lu, Zikai Xiao, Jianhui Wei, Danyu Sun, Qi Lu, Keli Hu, Yang Feng, Jian Wu, Zongxin Yang, Zuozhu Liu

First: 2026-03-31T16:31:25+00:00 · Latest: 2026-04-02T16:46:06+00:00

Abs · PDF · Code1 · Code2

Abstract

Surgical video understanding is essential for computer-assisted interventions, yet existing surgical foundation models remain constrained by limited data scale, procedural diversity, and inconsistent evaluation, often lacking a reproducible training pipeline. We propose SurgRec, a scalable and reproducible pretraining recipe for surgical video understanding, instantiated with two variants: SurgRec-MAE and SurgRec-JEPA. We curate a large multi-source corpus of 10,535 videos and 214.5M frames spanning endoscopy, laparoscopy, cataract, and robotic surgery. Building on this corpus, we develop a unified pretraining pipeline with balanced sampling and standardize a reproducible benchmark across 16 downstream datasets and four clinical domains with consistent data splits. Across extensive comparisons against SSL baselines and vision-language models, SurgRec consistently achieves superior performance across downstream datasets. In contrast, VLMs prove unreliable for fine-grained temporal recognition, exhibiting both performance gaps and sensitivity to prompt phrasing. Our work provides a reproducible, scalable foundation for the community to build more general surgical video models. All code, models, and data will be publicly released.

中文标题/摘要

标题：手术视频预训练的扩展

手术视频理解对于计算机辅助干预至关重要，但现有的手术基础模型仍然受到数据规模有限、程序多样性不足以及评估不一致的限制，往往缺乏可重复的训练管道。我们提出了一种名为SurgRec的可扩展且可重复的手术视频理解预训练方案，包括两种变体：SurgRec-MAE和SurgRec-JEPA。我们整理了一个包含10,535个视频和2.145亿帧的大型多源数据集，涵盖内窥镜、腹腔镜、白内障和机器人手术。基于此数据集，我们开发了一个统一的预训练管道，采用平衡采样，并在16个下游数据集和四个临床领域中标准化了一个可重复的基准，数据集具有统一的数据分割。在与SSL基线和视觉-语言模型的广泛比较中，SurgRec在所有下游数据集上均表现出更优的性能。相比之下，视觉-语言模型在细粒度的时间识别上表现不稳定，存在性能差距和对提示措辞的敏感性。我们的工作为社区提供了一个可重复且可扩展的基础，以构建更通用的手术视频模型。所有代码、模型和数据将公开发布。

SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation

Authors: Naomi Kombol, Ivan Martinović, Siniša Šegvić, Giorgos Tolias

Venue: CVPR 2026

First: 2026-04-02T16:45:34+00:00 · Latest: 2026-04-02T16:45:34+00:00

Comments: Accepted to CVPR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Foundational Vision Transformers (ViTs) have limited effectiveness in tasks requiring fine-grained spatial understanding, due to their fixed pre-training resolution and inherently coarse patch-level representations. These challenges are especially pronounced in dense prediction scenarios, such as open-vocabulary segmentation with ViT-based vision-language models, where high-resolution inputs are essential for accurate pixel-level reasoning. Existing approaches typically process large-resolution images using a sliding-window strategy at the pre-training resolution. While this improves accuracy through finer strides, it comes at a significant computational cost. We introduce SPAR: Single-Pass Any-Resolution ViT, a resolution-agnostic dense feature extractor designed for efficient high-resolution inference. We distill the spatial reasoning capabilities of a finely-strided, sliding-window teacher into a single-pass student using a feature regression loss, without requiring architectural changes or pixel-level supervision. Applied to open-vocabulary segmentation, SPAR improves single-pass baselines by up to 10.5 mIoU and even surpasses the teacher, demonstrating effectiveness in efficient, high-resolution reasoning. Code: https://github.com/naomikombol/SPAR

中文标题/摘要

标题：SPAR：单次通过任意分辨率ViT的开放词汇分割

基础视觉变换器（ViTs）在需要精细空间理解的任务中效果有限，因为它们具有固定的预训练分辨率和固有的粗粒度的块级表示。这些挑战在密集预测场景中尤为明显，例如基于ViT的视觉-语言模型的开放词汇分割，其中高分辨率输入对于准确的像素级推理至关重要。现有方法通常使用滑动窗口策略在预训练分辨率下处理大分辨率图像。虽然这通过更精细的步幅提高了准确性，但会带来显著的计算成本。我们引入了SPAR：单次通过任意分辨率ViT，这是一种分辨率无关的密集特征提取器，旨在进行高效的高分辨率推理。我们通过特征回归损失将精细步幅的滑动窗口教师的空间推理能力提炼到单次通过的学生中，而无需进行架构更改或像素级监督。应用于开放词汇分割，SPAR将单次通过基线提高了最多10.5 mIoU，并且甚至超过了教师，证明了其在高效高分辨率推理中的有效性。代码：https://github.com/naomikombol/SPAR

Summary / 总结

SPAR is a resolution-agnostic ViT designed for efficient high-resolution inference in open-vocabulary segmentation tasks. It distills the spatial reasoning capabilities of a finely-strided teacher into a single-pass student using a feature regression loss. SPAR improves single-pass baselines by up to 10.5 mIoU and even surpasses the teacher, showing effectiveness in efficient, high-resolution reasoning.

研究旨在解决固定分辨率预训练在Vision Transformers (ViTs)中对需要精细空间理解的任务（如开放词汇分割）的局限性。提出了SPAR，这是一种分辨率无关的密集特征提取器，以实现高效的高分辨率推理。通过将精细步幅教师的时空推理能力提炼到单次通过的学生模型中，SPAR在开放词汇分割任务中将单次通过基线提高了最多10.5 mIoU，并超越了教师模型。

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Authors: Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, Ranjay Krishna

First: 2026-01-15T17:27:44+00:00 · Latest: 2026-04-02T16:01:02+00:00

Comments: Updated first authors

Abs · PDF · Code1 · Code2

Abstract

Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).

中文标题/摘要

标题：Molmo2：开放权重和数据的视觉-语言模型，具备视频理解与定位能力

当前最强的视频-语言模型（VLMs）仍为私有。最强的开放权重模型要么依赖于私有VLMs的合成数据，要么不披露其训练数据或方法。因此，开源社区缺乏改进当前最先进的视频（和图像）语言模型的基础。至关重要的是，许多下游应用不仅需要高层次的视频理解，还需要定位——无论是通过指针还是像素跟踪。即使私有模型也缺乏这种能力。我们提出了Molmo2，这是一种新的VLM家族，是开源模型中的最新技术，并展示了在单图像、多图像和视频任务中出色的基于指针的定位能力。我们的主要贡献是一系列7个新的视频数据集和2个多图像数据集，包括用于预训练的详细视频字幕数据集、自由形式的视频问答数据集、新的具有复杂查询的对象跟踪数据集以及创新的视频指针数据集，所有这些数据集均未使用封闭的VLMs收集。我们还提供了一种利用高效打包和消息树编码方案的训练方法，并展示了在视觉标记上使用双向注意和一种新颖的标记权重策略可以提高性能。我们最好的8B模型在短视频、计数和字幕方面优于其他开放权重和数据模型，并在长视频方面具有竞争力。在视频定位方面，Molmo2显著优于现有开放权重模型如Qwen3-VL（视频计数准确率为35.5 vs 29.6）并在某些任务上超越了私有模型如Gemini 3 Pro（视频指针F1得分为38.4 vs 20.0，视频跟踪J&F得分为56.2 vs 41.1）。

Summary / 总结

The paper introduces Molmo2, a new family of open-source vision-language models that excel in point-driven grounding tasks. The models are trained on a collection of 9 new datasets, including video captions, Q&A, object tracking, and pointing datasets, all created without relying on proprietary models. The training method includes an efficient packing and message-tree encoding scheme, and the models show superior performance in tasks such as video counting, captioning, and video-grounding, outperforming both open-source and proprietary models in several benchmarks.

该论文介绍了Molmo2，这是一种开源的视觉-语言模型，擅长视频理解和定位任务。它利用了一个新的9个数据集集合，包括视频字幕、问答、物体跟踪和指针数据集，所有这些数据集都是在不依赖于专有模型的情况下创建的。该模型在视频计数、描述和视频定位等任务上表现出色，优于开源和专有模型的多个基准测试。

UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving

Authors: Yongkang Li, Lijun Zhou, Sixu Yan, Bencheng Liao, Tianyi Yan, Kaixin Xiong, Long Chen, Hongwei Xie, Bing Wang, Guang Chen, Hangjun Ye, Wenyu Liu, Haiyang Sun, Xinggang Wang

First: 2026-04-02T15:48:45+00:00 · Latest: 2026-04-02T15:48:45+00:00

Comments: code has been released at https://github.com/xiaomi-research/unidrivevla

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-Language-Action (VLA) models have recently emerged in autonomous driving, with the promise of leveraging rich world knowledge to improve the cognitive capabilities of driving systems. However, adapting such models for driving tasks currently faces a critical dilemma between spatial perception and semantic reasoning. Consequently, existing VLA systems are forced into suboptimal compromises: directly adopting 2D Vision-Language Models yields limited spatial perception, whereas enhancing them with 3D spatial representations often impairs the native reasoning capacity of VLMs. We argue that this dilemma largely stems from the coupled optimization of spatial perception and semantic reasoning within shared model parameters. To overcome this, we propose UniDriveVLA, a Unified Driving Vision-Language-Action model based on Mixture-of-Transformers that addresses the perception-reasoning conflict via expert decoupling. Specifically, it comprises three experts for driving understanding, scene perception, and action planning, which are coordinated through masked joint attention. In addition, we combine a sparse perception paradigm with a three-stage progressive training strategy to improve spatial perception while maintaining semantic reasoning capability. Extensive experiments show that UniDriveVLA achieves state-of-the-art performance in open-loop evaluation on nuScenes and closed-loop evaluation on Bench2Drive. Moreover, it demonstrates strong performance across a broad range of perception, prediction, and understanding tasks, including 3D detection, online mapping, motion forecasting, and driving-oriented VQA, highlighting its broad applicability as a unified model for autonomous driving. Code and model have been released at https://github.com/xiaomi-research/unidrivevla

中文标题/摘要

标题：UniDriveVLA：统一理解、感知与行动规划的自动驾驶

视觉-语言-行动（VLA）模型最近在自动驾驶中崭露头角，有望利用丰富的世界知识提升驾驶系统的认知能力。然而，将这些模型适应驾驶任务目前面临一个关键困境：空间感知与语义推理之间的权衡。因此，现有的VLA系统被迫做出次优妥协：直接采用2D视觉-语言模型导致空间感知有限，而增强它们的3D空间表示往往损害了VLM的原生推理能力。我们认为，这一困境主要源于在共享模型参数内空间感知与语义推理的耦合优化。为克服这一问题，我们提出了基于混合变换器的UniDriveVLA统一驾驶视觉-语言-行动模型，通过专家解耦解决感知-推理冲突。具体而言，它包含三个专家，分别负责驾驶理解、场景感知和行动规划，通过掩蔽联合注意力协调。此外，我们结合稀疏感知范式和三阶段渐进式训练策略，以提高空间感知能力同时保持语义推理能力。大量实验表明，UniDriveVLA在nuScenes的开环评估和Bench2Drive的闭环评估中均达到最先进的性能。此外，它在包括3D检测、在线建图、运动预测和驾驶导向的VQA等一系列感知、预测和理解任务中表现出色，突显了其作为统一模型在自动驾驶领域的广泛应用潜力。代码和模型已发布于https://github.com/xiaomi-research/unidrivevla

Summary / 总结

The paper addresses the challenge of integrating spatial perception and semantic reasoning in Vision-Language-Action models for autonomous driving. It introduces UniDriveVLA, a unified model based on Mixture-of-Transformers, which decouples perception and reasoning through expert modules. UniDriveVLA shows superior performance in both open-loop and closed-loop evaluations on nuScenes and Bench2Drive, respectively, and excels in various tasks such as 3D detection, online mapping, and driving-oriented VQA. The model's design allows it to maintain strong performance across different aspects of autonomous driving tasks.

论文解决了在自动驾驶中将空间感知和语义推理集成的挑战。提出了基于Mixture-of-Transformers的UniDriveVLA模型，通过专家模块解耦感知和推理。UniDriveVLA在nuScenes和Bench2Drive上的开环和闭环评估中表现出色，并在3D检测、在线建图和驾驶导向的VQA等多种任务中表现出色。该模型的设计使其在不同自动驾驶任务方面保持了强大的性能。

CoRegOVCD: Consistency-Regularized Open-Vocabulary Change Detection

Authors: Weidong Tang, Hanbin Sun, Zihan Li, Yikai Wang, Feifan Zhang

First: 2026-04-02T15:28:29+00:00 · Latest: 2026-04-02T15:28:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Remote sensing change detection (CD) aims to identify where land-cover semantics change across time, but most existing methods still assume a fixed label space and therefore cannot answer arbitrary user-defined queries. Open-vocabulary change detection (OVCD) instead asks for the change mask of a queried concept. In the fully training-free setting, however, dense concept responses are difficult to compare directly across dates: appearance variation, weak cross-concept competition, and the spatial continuity of many land-cover categories often produce noisy, fragmented, and semantically unreliable change evidence. We propose Consistency-Regularized Open-Vocabulary Change Detection (CoRegOVCD), a training-free dense inference framework that reformulates concept-specific change as calibrated posterior discrepancy. Competitive Posterior Calibration (CPC) and the Semantic Posterior Delta (SPD) convert raw concept responses into competition-aware queried-concept posteriors and quantify their cross-temporal discrepancy, making semantic change evidence more comparable without explicit instance matching. Geometry-Token Consistency Gate (GeoGate) and Regional Consensus Discrepancy (RCD) further suppress unsupported responses and improve spatial coherence through geometry-aware structural verification and regional consensus. Across four benchmarks spanning building-oriented and multi-class settings, CoRegOVCD consistently improves over the strongest previous training-free baseline by 2.24 to 4.98 F1$_C$ points and reaches a six-class average of 47.50% F1$_C$ on SECOND.

中文标题/摘要

标题：CoRegOVCD: 一致性正则化开放词汇变化检测

遥感变化检测（CD）旨在识别不同时期土地覆盖语义的变化，但大多数现有方法仍然假设固定标签空间，因此无法回答任意用户定义的查询。开放词汇变化检测（OVCD）则要求提供查询概念的变化掩码。然而，在完全无训练设置中，密集的概念响应难以直接在不同日期之间进行比较：外观变化、弱跨概念竞争以及许多土地覆盖类别的空间连续性经常产生嘈杂、碎片化且语义不可靠的变化证据。我们提出了Consistency-Regularized Open-Vocabulary Change Detection（CoRegOVCD），这是一种完全无训练的密集推理框架，将概念特定的变化重新表述为校准后的后验差异。Competitive Posterior Calibration（CPC）和Semantic Posterior Delta（SPD）将原始概念响应转换为竞争意识的查询概念后验，并量化它们的跨时间差异，从而在无需显式实例匹配的情况下使语义变化证据更具可比性。Geometry-Token Consistency Gate（GeoGate）和Regional Consensus Discrepancy（RCD）进一步抑制不支持的响应并通过几何感知结构验证和区域共识提高空间一致性。在四个涵盖建筑导向和多类别的基准测试中，CoRegOVCD在最强的先前完全无训练基线基础上分别提高了2.24到4.98个F1_C点，并在SECOND上达到六类平均47.50%的F1_C。

Summary / 总结

CoRegOVCD is a training-free dense inference framework for open-vocabulary change detection that addresses the challenges of direct comparison of concept responses across dates. It uses Competitive Posterior Calibration and Semantic Posterior Delta to convert raw concept responses into competition-aware queried-concept posteriors and quantify their cross-temporal discrepancy. Geometry-Token Consistency Gate and Regional Consensus Discrepancy further refine the results. CoRegOVCD outperforms the strongest previous training-free baseline by 2.24 to 4.98 F1$_C$ points across four benchmarks and achieves an average F1$_C$ of 47.50% on the six-class setting of SECOND.

CoRegOVCD 是一个无需训练的密集推理框架，用于开放词汇量变化检测，旨在解决跨时间比较概念响应的挑战。它使用 Competitive Posterior Calibration 和 Semantic Posterior Delta 将原始概念响应转换为竞争意识的查询概念后验，并量化它们的跨时间差异。Geometry-Token Consistency Gate 和 Regional Consensus Discrepancy 进一步通过几何感知结构验证和区域共识来抑制不支持的响应并提高空间一致性。CoRegOVCD 在四个基准测试中优于最强的无训练基准，F1$_C$ 分数提高了 2.24 到 4.98 点，平均达到 47.50% 的 F1$_C$ 在第六个基准测试上。

Be Tangential to Manifold: Discovering Riemannian Metric for Diffusion Models

Authors: Shinnosuke Saito, Takashi Matsubara

First: 2025-10-07T01:54:47+00:00 · Latest: 2026-04-02T14:50:54+00:00

Abs · PDF · Code1 · Code2

Abstract

Diffusion models are powerful deep generative models, but unlike classical models, they lack an explicit low-dimensional latent space that parameterizes the data manifold. This absence makes it difficult to perform manifold-aware operations, such as geometrically faithful interpolation or conditional guidance that respects the learned manifold. We propose a training-free Riemannian metric on the noise space, derived from the Jacobian of the score function. The key insight is that the spectral structure of this Jacobian separates tangent and normal directions of the data manifold; our metric leverages this separation to encourage paths to stay tangential to the manifold rather than drift toward high-density regions. To validate that our metric faithfully captures the manifold geometry, we examine it from two complementary angles. First, geodesics under our metric yield perceptually more natural interpolations than existing methods on synthetic, image, and video frame datasets. Second, the tangent-normal decomposition induced by our metric prevents classifier-free guidance from deviating off the manifold, improving generation quality while preserving text-image alignment.

中文标题/摘要

标题：与流形共轭：发现用于扩散模型的黎曼度量

扩散模型是强大的深度生成模型，但与经典模型不同，它们缺乏一个显式的低维潜在空间来参数化数据流形。这种缺失使得执行流形感知操作变得困难，例如几何上忠实的插值或尊重学习到的流形的条件引导。我们提出了一种无需训练的噪声空间上的黎曼度量，该度量源自分数函数的雅可比矩阵。关键洞察是，该雅可比矩阵的谱结构将数据流形的切向和法向方向区分开来；我们的度量利用这种分离来鼓励路径保持在流形上，而不是向高密度区域漂移。为了验证我们的度量是否忠实捕捉了流形几何，我们从两个互补的角度进行了验证。首先，在合成、图像和视频帧数据集上，我们的度量下的测地线提供了感知上更自然的插值。其次，由我们的度量引起的切向-法向分解防止了无分类器引导偏离流形，从而提高了生成质量并保持了文本-图像对齐。

Summary / 总结

The paper addresses the challenge of performing manifold-aware operations in diffusion models by proposing a training-free Riemannian metric derived from the Jacobian of the score function. This metric encourages paths to stay tangential to the data manifold rather than drifting towards high-density regions. Experiments show that geodesics under this metric provide more natural interpolations and that the tangent-normal decomposition prevents classifier-free guidance from deviating off the manifold, thereby improving generation quality and preserving text-image alignment.

论文提出了一种基于得分函数雅可比的无训练Riemannian度量，以解决在扩散模型中执行流形感知操作的挑战。该度量鼓励路径保持在流形上，从而改善几何保真插值和无分类器引导。实验表明，在该度量下的测地线提供了更自然的插值，并且由该度量诱导的切空间-法空间分解防止了偏离流形，从而提高了生成质量并保持了文本-图像对齐。

FlowSlider: Training-Free Continuous Image Editing via Fidelity-Steering Decomposition

Authors: Taichi Endo, Guoqing Hao, Kazuhiko Sumi

First: 2026-04-02T14:16:06+00:00 · Latest: 2026-04-02T14:16:06+00:00

Comments: HuggingFace Space: https://huggingface.co/spaces/dominoer/FlowSlider

Abs · PDF · Code1 · Code2 · Code3

Abstract

Continuous image editing aims to provide slider-style control of edit strength while preserving source-image fidelity and maintaining a consistent edit direction. Existing learning-based slider methods typically rely on auxiliary modules trained with synthetic or proxy supervision. This introduces additional training overhead and couples slider behavior to the training distribution, which can reduce reliability under distribution shifts in edits or domains. We propose \textit{FlowSlider}, a training-free method for continuous editing in Rectified Flow that requires no post-training. \textit{FlowSlider} decomposes FlowEdit's update into (i) a fidelity term, which acts as a source-conditioned stabilizer that preserves identity and structure, and (ii) a steering term that drives semantic transition toward the target edit. Geometric analysis and empirical measurements show that these terms are approximately orthogonal, enabling stable strength control by scaling only the steering term while keeping the fidelity term unchanged. As a result, \textit{FlowSlider} provides smooth and reliable control without post-training, improving continuous editing quality across diverse tasks.

中文标题/摘要

标题：FlowSlider：无需训练的连续图像编辑方法及其保真度导向分解

连续图像编辑旨在提供滑块式控制编辑强度的同时，保持源图像保真度并维持一致的编辑方向。现有的基于学习的滑块方法通常依赖于使用合成或代理监督训练的辅助模块。这引入了额外的训练开销，并将滑块行为与训练分布耦合，这在编辑或领域分布变化时可能会降低可靠性。我们提出了一种名为\textit{FlowSlider}的无需训练的连续编辑方法，该方法在修正流中不需要后训练。\textit{FlowSlider}将FlowEdit的更新分解为(i)保真度项，该项作为基于源条件的稳定器，保持身份和结构；(ii)导向项，驱动语义过渡以接近目标编辑。几何分析和实证测量表明，这些项几乎正交，使得通过仅缩放导向项而保持保真度项不变即可实现稳定的强度控制。因此，\textit{FlowSlider}在无需后训练的情况下提供了平滑且可靠的控制，从而提高了各种任务中的连续编辑质量。

Summary / 总结

FlowSlider is a training-free method for continuous image editing that decomposes the editing process into a fidelity term and a steering term. The fidelity term stabilizes the source image, while the steering term drives the semantic transition. This orthogonal decomposition allows for smooth and reliable control of edit strength without additional training, improving the quality of continuous editing across various tasks.

FlowSlider 是一种无需训练的连续图像编辑方法，将编辑过程分解为保真度项和引导项。保真度项稳定源图像，而引导项驱动语义过渡。这种正交分解允许在无需后训练的情况下实现平滑且可靠的强度控制，从而在各种任务中提高连续编辑的质量。

Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection

Authors: Soo Won Seo, KyungChae Lee, Hyungchan Cho, Taein Son, Nam Ik Cho, Jun Won Choi

Venue: CVPR 2026

First: 2026-04-02T14:01:58+00:00 · Latest: 2026-04-02T14:01:58+00:00

Comments: Accepted to CVPR 2026. Code: https://github.com/nowuss/InCoM-Net

Abs · PDF · Code1 · Code2 · Code3

Abstract

Human-Object Interaction (HOI) detection aims to localize human-object pairs and classify their interactions from a single image, a task that demands strong visual understanding and nuanced contextual reasoning. Recent approaches have leveraged Vision-Language Models (VLMs) to introduce semantic priors, significantly improving HOI detection performance. However, existing methods often fail to fully capitalize on the diverse contextual cues distributed across the entire scene. To overcome these limitations, we propose the Instance-centric Context Mining Network (InCoM-Net)-a novel framework that effectively integrates rich semantic knowledge extracted from VLMs with instance-specific features produced by an object detector. This design enables deeper interaction reasoning by modeling relationships not only within each detected instance but also across instances and their surrounding scene context. InCoM-Net comprises two core components: Instancecentric Context Refinement (ICR), which separately extracts intra-instance, inter-instance, and global contextual cues from VLM-derived features, and Progressive Context Aggregation (ProCA), which iteratively fuses these multicontext features with instance-level detector features to support high-level HOI reasoning. Extensive experiments on the HICO-DET and V-COCO benchmarks show that InCoM-Net achieves state-of-the-art performance, surpassing previous HOI detection methods. Code is available at https://github.com/nowuss/InCoM-Net.

中文标题/摘要

标题：基于实例中心视觉-语言语境的人-物交互检测挖掘

人-物交互（HOI）检测旨在从单张图像中定位人-物对并分类其交互，这需要强大的视觉理解能力和细腻的语境推理。最近的方法利用视觉-语言模型（VLM）引入语义先验，显著提高了HOI检测性能。然而，现有方法往往未能充分利用场景中分散的多样化语境线索。为克服这些限制，我们提出了一种实例中心语境挖掘网络（InCoM-Net）——一种新颖的框架，该框架有效结合了从VLM提取的丰富语义知识与对象检测器生成的实例特定特征。此设计通过建模不仅在每个检测实例内部的关系，还在实例之间及其周围场景语境中的关系，以实现更深入的交互推理。InCoM-Net 包含两个核心组件：实例中心语境精炼（ICR），该组件分别从VLM特征中提取实例内、实例间和全局语境线索，以及渐进式语境聚合（ProCA），该组件迭代地将这些多语境特征与实例级检测器特征融合，以支持高级HOI推理。在HICO-DET和V-COCO基准上的广泛实验表明，InCoM-Net 达到了最先进的性能，超越了之前的HOI检测方法。代码可在 https://github.com/nowuss/InCoM-Net 获取。

Summary / 总结

The research aims to enhance Human-Object Interaction (HOI) detection by integrating rich semantic knowledge from Vision-Language Models (VLMs) with instance-specific features. The proposed Instance-centric Context Mining Network (InCoM-Net) refines intra-instance, inter-instance, and global contextual cues, and progressively aggregates these with detector features to improve HOI reasoning. Experiments on HICO-DET and V-COCO benchmarks demonstrate that InCoM-Net outperforms existing methods, achieving state-of-the-art performance in HOI detection.

研究旨在通过利用Vision-Language模型（VLM）捕捉整个场景中的丰富上下文线索，来提升Human-Object Interaction (HOI)检测。提出的Instance-centric Context Mining Network (InCoM-Net)将VLM提取的语义知识与实例特定特征相结合，通过两个核心组件：Instance-centric Context Refinement (ICR)和Progressive Context Aggregation (ProCA)实现更深层次的交互推理。在HICO-DET和V-COCO基准上的实验表明，InCoM-Net在HOI检测方面超越了现有方法，达到了最先进的性能。

Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models

Authors: Issa Sugiura, Keito Sasagawa, Keisuke Nakao, Koki Maeda, Ziqi Yin, Zhishen Yang, Shuhei Kurita, Yusuke Oda, Ryoko Tokuhisa, Daisuke Kawahara, Naoaki Okazaki

First: 2026-04-02T13:48:43+00:00 · Latest: 2026-04-02T13:48:43+00:00

Comments: 18 pages, 7 figures

Abs · PDF · Code1 · Code2

Abstract

Developing vision-language models (VLMs) that generalize across diverse tasks requires large-scale training datasets with diverse content. In English, such datasets are typically constructed by aggregating and curating numerous existing visual question answering (VQA) resources. However, this strategy does not readily extend to other languages, where VQA datasets remain limited in both scale and domain coverage, posing a major obstacle to building high-quality multilingual and non-English VLMs. In this work, we introduce Jagle, the largest Japanese multimodal post-training dataset to date, comprising approximately 9.2 million instances across diverse tasks. Rather than relying on existing VQA datasets, we collect heterogeneous source data, including images, image-text pairs, and PDF documents, and generate VQA pairs through multiple strategies such as VLM-based QA generation, translation, and text rendering. Experiments demonstrate that a 2.2B model trained with Jagle achieves strong performance on Japanese tasks, surpassing InternVL3.5-2B in average score across ten Japanese evaluation tasks and approaching within five points of Qwen3-VL-2B-Instruct. Furthermore, combining Jagle with FineVision does not degrade English performance; instead, it improves English performance compared to training with FineVision alone. To facilitate reproducibility and future research, we release the dataset, trained models, and code.

中文标题/摘要

标题：Jagle：构建大规模日语多模态后训练数据集以支持视觉-语言模型

开发能够跨多种任务泛化的视觉-语言模型（VLMs）需要大规模的训练数据集，这些数据集包含多样化的内容。在英语中，这样的数据集通常通过聚合和整理大量的现有视觉问答（VQA）资源来构建。然而，这种方法并不容易扩展到其他语言，在这些语言中，VQA数据集在规模和领域覆盖方面都受到限制，成为构建高质量的多语言和非英语VLMs的主要障碍。在本文中，我们介绍了迄今为止最大的日语多模态后训练数据集Jagle，包含约920万条跨多种任务的数据实例。我们没有依赖现有的VQA数据集，而是收集了异构的数据源，包括图像、图像-文本对和PDF文档，并通过多种策略生成VQA对，如基于VLM的问答生成、翻译和文本渲染。实验表明，使用Jagle训练的220亿参数模型在日语任务上表现出色，平均得分超过InternVL3.5-2B，在十个日语评估任务上的平均得分高出五分，接近Qwen3-VL-2B-Instruct。此外，将Jagle与FineVision结合使用不会降低英语性能，反而在单独使用FineVision训练时提高了英语性能。为了促进可重复性和未来研究，我们发布了数据集、训练模型和代码。

Summary / 总结

The research aims to develop a large-scale Japanese multimodal dataset, Jagle, to enhance the generalization of vision-language models (VLMs) across diverse tasks. The dataset is created by collecting various sources like images, image-text pairs, and PDF documents, and generating VQA pairs using multiple strategies. Experiments show that a 2.2B model trained with Jagle outperforms InternVL3.5-2B and approaches Qwen3-VL-2B-Instruct on Japanese tasks. Additionally, combining Jagle with FineVision improves English performance compared to FineVision alone.

Jagle 是一个包含约 920 万个实例的大型日语多模态后训练数据集，涵盖了各种任务。不同于现有的 VQA 数据集，Jagle 收集了多种来源的数据，如图像、图像-文本对和 PDF 文档，并通过 VLM 基准问答生成、翻译和文本渲染等方法生成 VQA 对。一个 2.2B 模型使用 Jagle 训练后，在十个日语评估任务上的表现优于 InternVL3.5-2B，并接近 Qwen3-VL-2B-Instruct。此外，将 Jagle 与 FineVision 结合使用会提升英语性能，优于仅使用 FineVision 的情况。

Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding

Authors: Tao Jin, Phuong Minh Nguyen, Naoya Inoue

First: 2026-04-02T13:48:42+00:00 · Latest: 2026-04-02T13:48:42+00:00

Abs · PDF · Code1 · Code2

Abstract

Speculative decoding accelerates large language model inference by drafting multiple candidate tokens and verifying them in a single forward pass. Candidates are organized as a tree: deeper trees accept more tokens per step, but adding depth requires sacrificing breadth (fallback options) under a fixed verification budget. Existing training-free methods draft from a single token source and shape their trees without distinguishing candidate quality across origins. We observe that two common training-free token sources - n-gram matches copied from the input context, and statistical predictions from prior forward passes - differ dramatically in acceptance rate (~6x median gap, range 2-18x across five models and five benchmarks). We prove that when such a quality gap exists, the optimal tree is anisotropic (asymmetric): reliable tokens should form a deep chain while unreliable tokens spread as wide branches, breaking through the depth limit of balanced trees. We realize this structure in GOOSE, a training-free framework that builds an adaptive spine tree - a deep chain of high-acceptance context-matched tokens with wide branches of low-acceptance alternatives at each node. We prove that the number of tokens accepted per step is at least as large as that of either source used alone. On five LLMs (7B-33B) and five benchmarks, GOOSE achieves 1.9-4.3x lossless speedup, outperforming balanced-tree baselines by 12-33% under the same budget.

中文标题/摘要

标题：鹅：异质推测树用于训练免费推测解码

推测解码通过在单次前向传递中起草多个候选令牌并验证它们来加速大型语言模型的推理。候选令牌组织成一棵树：更深的树每步接受更多令牌，但在固定验证预算下增加深度需要牺牲宽度（备用选项）。现有训练免费方法从单一令牌源起草，并且在不区分候选质量来源的情况下塑造其树。我们观察到两种常见的训练免费令牌源——从输入上下文中复制的n-gram匹配和来自先前前向传递的统计预测——在接受率上存在巨大差异（中位数差距约6倍，范围从2到18倍，跨越五个模型和五个基准）。我们证明，当存在这种质量差距时，最优树是异质的（不对称）：可靠的令牌应形成一条深链，而不可靠的令牌则扩展为宽分支，突破平衡树的深度限制。我们通过GOOSE实现这一结构，这是一种训练免费框架，构建自适应脊柱树——一条高接受率上下文匹配令牌的深链，以及每个节点处宽分支的低接受率替代选项。我们证明，每步接受的令牌数量至少与单独使用任一来源一样多。在五个LLM（7B-33B）和五个基准上，GOOSE实现了1.9-4.3倍无损加速，即使在相同预算下，也比平衡树基线高出12-33%。

Summary / 总结

Goose is a training-free speculative decoding framework that addresses the limitations of existing methods by utilizing an anisotropic speculation tree. This tree structure allows for a deep chain of high-acceptance tokens while spreading unreliable tokens as wide branches, maximizing the number of accepted tokens per step. On five large language models ranging from 7B to 33B parameters, Goose achieves a 1.9-4.3x lossless speedup compared to balanced-tree baselines, outperforming them by 12-33% under the same verification budget.

研究旨在通过解决两种常见令牌来源之间的质量差距，改进大型语言模型中的推测性解码。方法是创建异质推测树，其中可靠的令牌形成一个深链，而不可靠的令牌则作为每个节点的宽分支展开。实验结果显示，GOOSE 在五个语言模型上实现了 1.9-4.3 倍的无损加速，与平衡树基线相比，在相同验证预算下性能提高了 12-33%。

Are VLMs Lost Between Sky and Space? LinkS$^2$Bench for UAV-Satellite Dynamic Cross-View Spatial Intelligence

Authors: Dian Liu, Jie Feng, Di Li, Yuhui Zheng, Guanbin Li, Weisheng Dong, Guangming Shi

First: 2026-04-02T13:22:57+00:00 · Latest: 2026-04-02T13:22:57+00:00

Abs · PDF · Code1 · Code2

Abstract

Synergistic spatial intelligence between UAVs and satellites is indispensable for emergency response and security operations, as it uniquely integrates macro-scale global coverage with dynamic, real-time local perception. However, the capacity of Vision-Language Models (VLMs) to master this complex interplay remains largely unexplored. This gap persists primarily because existing benchmarks are confined to isolated Unmanned Aerial Vehicle (UAV) videos or static satellite imagery, failing to evaluate the dynamic local-to-global spatial mapping essential for comprehensive cross-view reasoning. To bridge this gap, we introduce LinkS$^2$Bench, the first comprehensive benchmark designed to evaluate VLMs' wide-area, dynamic cross-view spatial intelligence. LinkS$^2$Bench links 1,022 minutes of dynamic UAV footage with high-resolution satellite imagery covering over 200 km$^2$. Through an LMM-assisted pipeline and rigorous human annotation, we constructed 17.9k high-quality question-answer pairs comprising 12 fine-grained tasks across four dimensions: perception, localization, relation, and reasoning. Evaluations of 18 representative VLMs reveal a substantial gap compared to human baselines, identifying accurate cross-view dynamic alignment as the critical bottleneck. To alleviate this, we design a Cross-View Alignment Adapter, demonstrating that explicit alignment significantly improves model performance. Furthermore, fine-tuning experiments underscore the potential of LinkS$^2$Bench in advancing VLM adaptation for complex spatial reasoning.

中文标题/摘要

标题：VLMs在天地之间迷失了吗？UAV-卫星动态跨视角空间智能LinkS$^2$Bench评测

无人机与卫星之间的协同空间智能对于应急响应和安全操作至关重要，因为它能够结合宏观全球覆盖与动态实时的局部感知。然而，视觉-语言模型（VLMs）掌握这种复杂交互的能力尚未得到充分探索。这一差距主要因为现有基准仅限于孤立的无人机视频或静态卫星图像，未能评估全面跨视角推理所需的动态局部到全局的空间映射。为弥补这一差距，我们引入了LinkS$^2$Bench，这是首个用于评估VLMs广泛区域动态跨视角空间智能的综合基准。LinkS$^2$Bench将1022分钟的动态无人机视频与覆盖超过200平方公里的高分辨率卫星图像相连。通过LMM辅助管道和严格的真人注释，我们构建了17900个高质量的问题-答案对，涵盖四个维度的12个细粒度任务：感知、定位、关系和推理。对18个代表性VLMs的评估显示，与人类基准相比存在显著差距，准确的跨视角动态对齐是关键瓶颈。为缓解这一问题，我们设计了跨视角对齐适配器，表明显式对齐显著提高了模型性能。此外，微调实验强调了LinkS$^2$Bench在推进VLM适应复杂空间推理方面的潜力。

Summary / 总结

The paper addresses the gap in evaluating Vision-Language Models (VLMs) for the dynamic cross-view spatial intelligence between UAVs and satellites, which is crucial for emergency response and security operations. To fill this gap, the authors introduce LinkS$^2$Bench, a comprehensive benchmark that links 1,022 minutes of dynamic UAV footage with high-resolution satellite imagery. The benchmark includes 17,900 high-quality question-answer pairs covering 12 tasks across four dimensions. Evaluations show a significant performance gap between VLMs and human baselines, highlighting the need for better cross-view dynamic alignment. The authors propose a Cross-View Alignment Adapter to improve model performance and demonstrate its effectiveness through fine-tuning experiments.

论文旨在评估Vision-Language模型（VLMs）在无人机和卫星之间动态跨视图空间智能方面的表现，这对于应急响应和安全操作至关重要。为了解决这一问题，作者引入了LinkS$^2$Bench，这是一个综合基准，将1,022分钟的动态无人机视频与高分辨率卫星图像链接起来。基准包括17,900个高质量的问题-答案对，涵盖四个维度下的12个任务。评估结果显示，VLMs的表现与人类基线之间存在显著差距，强调了需要更好的跨视图动态对齐。作者提出了一种跨视图对齐适配器来提高模型性能，并通过微调实验展示了其有效性。

Decouple and Rectify: Semantics-Preserving Structural Enhancement for Open-Vocabulary Remote Sensing Segmentation

Authors: Jie Feng, Fengze Li, Junpeng Zhang, Siyu Chen, Yuping Liang, Junying Chen, Ronghua Shang

First: 2026-04-02T13:15:05+00:00 · Latest: 2026-04-02T13:15:05+00:00

Abs · PDF · Code1 · Code2

Abstract

Open-vocabulary semantic segmentation in the remote sensing (RS) field requires both language-aligned recognition and fine-grained spatial delineation. Although CLIP offers robust semantic generalization, its global-aligned visual representations inherently struggle to capture structural details. Recent methods attempt to compensate for this by introducing RS-pretrained DINO features. However, these methods treat CLIP representations as a monolithic semantic space and cannot localize where structural enhancement is required, failing to effectively delineate boundaries while risking the disruption of CLIP's semantic integrity. To address this limitation, we propose DR-Seg, a novel decouple-and-rectify framework in this paper. Our method is motivated by the key observation that CLIP feature channels exhibit distinct functional heterogeneity rather than forming a uniform semantic space. Building on this insight, DR-Seg decouples CLIP features into semantics-dominated and structure-dominated subspaces, enabling targeted structural enhancement by DINO without distorting language-aligned semantics. Subsequently, a prior-driven graph rectification module injects high-fidelity structural priors under DINO guidance to form a refined branch, while an uncertainty-guided adaptive fusion module dynamically integrates this refined branch with the original CLIP branch for final prediction. Comprehensive experiments across eight benchmarks demonstrate that DR-Seg establishes a new state-of-the-art.

中文标题/摘要

标题：解耦与校正：面向开放词汇遥感分割的语义保留结构增强

遥感(RS)领域的开放词汇语义分割需要语言对齐的识别和精细的空间界定。尽管CLIP提供了强大的语义泛化能力，但其全局对齐的视觉表示在捕捉结构细节方面存在固有困难。最近的方法通过引入RS预训练的DINO特征来弥补这一不足。然而，这些方法将CLIP表示视为一个统一的语义空间，无法定位结构增强所需的位置，从而无法有效界定边界，同时又可能破坏CLIP的语义完整性。为解决这一局限，本文提出了一种新颖的解耦与校正框架DR-Seg。我们的方法基于一个关键观察：CLIP特征通道表现出功能异质性，而不是形成一个统一的语义空间。基于这一洞察，DR-Seg将CLIP特征解耦为以语义为主导和以结构为主导的子空间，通过DINO实现有针对性的结构增强，而不破坏语言对齐的语义。随后，一个先验驱动的图校正模块在DINO的引导下注入高保真的结构先验，形成一个精炼分支，而一个基于不确定性自适应融合模块动态将该精炼分支与原始CLIP分支融合，以进行最终预测。在八个基准上的全面实验表明，DR-Seg建立了新的性能最佳水平。

Summary / 总结

The research aims to improve open-vocabulary semantic segmentation in remote sensing by addressing the limitations of CLIP's global-aligned visual representations in capturing structural details. The proposed DR-Seg framework decouples CLIP features into semantic and structure-dominated subspaces, allowing targeted structural enhancement by DINO while preserving semantic integrity. Experimental results across eight benchmarks show that DR-Seg outperforms existing methods, establishing a new state-of-the-art.

研究旨在通过解决CLIP在远程 sensing 中捕捉结构细节方面的局限性，提高开放词汇语义分割的性能。DR-Seg 提出了一种解耦和校正框架，将 CLIP 特征分离为语义主导和结构主导子空间，允许 DINO 在不破坏语义一致性的情况下进行有针对性的结构增强。实验结果表明，DR-Seg 在八个基准测试中超越了现有方法，并建立了新的性能基准。

Test-Time Adaptation for Height Completion via Self-Supervised ViT Features and Monocular Foundation Models

Authors: Osher Rafaeli, Tal Svoray, Ariel Nahlieli

First: 2026-04-02T13:13:17+00:00 · Latest: 2026-04-02T13:13:17+00:00

Abs · PDF · Code1 · Code2

Abstract

Accurate digital surface models (DSMs) are essential for many geospatial applications, including urban monitoring, environmental analyses, infrastructure management, and change detection. However, large-scale DSMs frequently contain incomplete or outdated regions due to acquisition limitations, reconstruction artifacts, or changes in the built environment. Traditional height completion approaches primarily rely on spatial interpolation or which assume spatial continuity and therefore fail when objects are missing. Recent learning-based approaches improve reconstruction quality but typically require supervised training on sensor-specific datasets, limiting their generalization across domains and sensing conditions. We propose Prior2DSM, a training-free framework for metric DSM completion that operates entirely at test time by leveraging foundation models. Unlike previous height completion approaches that require task-specific training, the proposed method combines self-supervised Vision Transformer (ViT) features from DINOv3 with monocular depth foundation models to propagate metric information from incomplete height priors through semantic feature-space correspondence. Test-time adaptation (TTA) is performed using parameter-efficient low-rank adaptation (LoRA) together with a lightweight multilayer perceptron (MLP), which predicts spatially varying scale and shift parameters to convert relative depth estimates into metric heights. Experiments demonstrate consistent improvements over interpolation based methods, prior-based rescaling height approaches, and state-of-the-art monocular depth estimation models. Prior2DSM reduces reconstruction error while preserving structural fidelity, achieving up to a 46% reduction in RMSE compared to linear fitting of MDE, and further enables DSM updating and coupled RGB-DSM generation.

中文标题/摘要

标题：测试时自适应的高程完成方法：基于自我监督的ViT特征和单目基础模型

准确的数字表面模型（DSMs）对于许多地理空间应用至关重要，包括城市监测、环境分析、基础设施管理和变化检测。然而，大规模的DSMs经常包含不完整或过时的区域，这可能是由于获取限制、重建伪影或建成环境的变化。传统的高程完成方法主要依赖于空间插值或假设空间连续性，因此在物体缺失时会失效。最近的基于学习的方法可以提高重建质量，但通常需要在特定传感器数据集上进行监督训练，这限制了它们在不同领域和传感条件下的泛化能力。我们提出了一种名为Prior2DSM的无需训练框架，该框架完全在测试时运行，通过利用基础模型来完成米级DSM。与之前需要特定任务训练的高程完成方法不同，所提出的方法结合了来自DINOv3的自我监督的Vision Transformer（ViT）特征和单目深度基础模型，通过语义特征空间对应关系传播度量信息。测试时自适应（TTA）使用参数高效的低秩适应（LoRA）与轻量级多层感知机（MLP）一起进行，预测空间变化的尺度和偏移参数，将相对深度估计转换为度量高程。实验结果表明，Prior2DSM在插值方法、基于先验的重新缩放高度方法以及最先进的单目深度估计模型上都表现出一致的改进。Prior2DSM减少了重建误差，同时保持了结构保真度，与MDE的线性拟合相比，RMSE降低了高达46%，并且进一步实现了DSM更新和耦合RGB-DSM生成。

Summary / 总结

The research aims to address the issue of incomplete or outdated regions in large-scale digital surface models (DSMs) by proposing Prior2DSM, a training-free framework that leverages self-supervised ViT features and monocular depth foundation models for test-time adaptation. The method uses parameter-efficient low-rank adaptation (LoRA) and a lightweight MLP to predict spatially varying scale and shift parameters, converting relative depth estimates into metric heights. Experiments show that Prior2DSM outperforms interpolation-based methods and prior-based rescaling approaches, reducing reconstruction error by up to 46% compared to linear fitting of monocular depth estimation models.

论文旨在通过提出Prior2DSM，一种无需训练的DSM（数字表面模型）完成框架，解决大规模DSM中存在的不完整或过时区域问题。该方法利用DINOv3的自监督Vision Transformer (ViT)特征和单目深度基础模型传播不完整高度先验的度量信息。该方法使用参数高效的低秩适应(LoRA)和轻量级多层感知机(MLP)进行测试时适应，将相对深度估计转换为度量高度。实验表明，Prior2DSM在减少重建误差方面优于基于插值的方法和最先进的单目深度估计模型，与单目深度估计(MDE)的线性拟合相比，误差降低高达46%。

Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation

Authors: Boyang Gong, Yu Zheng, Fanye Kong, Jie Zhou, Jiwen Lu

First: 2026-04-02T12:51:07+00:00 · Latest: 2026-04-02T12:51:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require inter-object relational deduction. Through token-wise attention analysis, we identify this visual inertia as a key factor: attention to semantically critical regions remains persistently focused and fails to dynamically support relational inference. We thereby propose a training-free Inertia-aware Visual Excitation (IVE) method that breaks this inertial pattern by modeling cognitive inference as the dynamic responsiveness of visual attention. Specifically, IVE selects visual tokens that are dynamically emerging relative to historical attention trends while distinguishing tokens exhibiting inertial behavior. To further facilitate compositional inference, IVE introduces an inertia-aware penalty that discourages over-concentration and limits the persistence of attention within localized regions. Extensive experiments show that IVE is effective across various base MLLMs and multiple hallucination benchmarks, particularly for cognitive hallucinations.

中文标题/摘要

标题：注意力静止则保持静止：打破视觉惯性以减轻认知幻觉

如同静止的物体保持静止，我们发现多模态大型语言模型（MLLMs）中的视觉注意力表现出明显的惯性，在早期解码步骤中一旦稳定下来就保持相对静止，无法支持认知推理所需的组合理解。现有的幻觉缓解方法主要针对与物体存在或属性相关的感知幻觉，但对于需要物体间关系推理的认知幻觉则显得力不从心。通过词元级别的注意力分析，我们发现这种视觉惯性是关键因素：对语义关键区域的注意力保持持续聚焦，无法动态支持关系推理。因此，我们提出了一种无需训练的感知意识视觉激发（IVE）方法，通过将认知推理建模为视觉注意力的动态响应来打破这种惯性模式。具体而言，IVE 选择相对于历史注意力趋势动态出现的视觉词元，同时区分表现出惯性行为的词元。为了进一步促进组合推理，IVE 引入了一种感知意识惩罚，以防止过度集中并限制注意力在局部区域内的持久性。广泛的实验表明，IVE 在各种基础 MLLMs 和多个幻觉基准测试中都表现出有效性，特别是在认知幻觉方面。

Summary / 总结

The paper addresses the issue of visual inertia in multimodal large language models (MLLMs), where attention remains static and fails to support compositional understanding needed for cognitive inference. It proposes an Inertia-aware Visual Excitation (IVE) method to mitigate cognitive hallucinations by dynamically adjusting visual attention. Experiments demonstrate that IVE effectively reduces cognitive hallucinations across different MLLMs and benchmarks.

论文针对多模态大型语言模型（MLLMs）中视觉惯性问题，即注意力保持静态无法支持所需的组成性推理。提出了一种惯性感知视觉激发（IVE）方法，通过将注意力建模为对历史趋势的动态响应来打破惯性模式。实验表明IVE在各种MLLMs和幻觉基准测试中有效，特别是在认知幻觉方面。

Curia-2: Scaling Self-Supervised Learning for Radiology Foundation Models

Authors: Antoine Saporta, Baptiste Callard, Corentin Dancette, Julien Khlaut, Charles Corbière, Leo Butsanets, Amaury Prat, Pierre Manceron

First: 2026-04-02T12:49:38+00:00 · Latest: 2026-04-02T12:49:38+00:00

Abs · PDF · Code1 · Code2

Abstract

The rapid growth of medical imaging has fueled the development of Foundation Models (FMs) to reduce the growing, unsustainable workload on radiologists. While recent FMs have shown the power of large-scale pre-training to CT and MRI analysis, there remains significant room to optimize how these models learn from complex radiological volumes. Building upon the Curia framework, this work introduces Curia-2, which significantly improves the original pre-training strategy and representation quality to better capture the specificities of radiological data. The proposed methodology enables scaling the architecture up to billion-parameter Vision Transformers, marking a first for multi-modal CT and MRI FMs. Furthermore, we formalize the evaluation of these models by extending and restructuring CuriaBench into two distinct tracks: a 2D track tailored for slice-based vision models and a 3D track for volumetric benchmarking. Our results demonstrate that Curia-2 outperforms all FMs on vision-focused tasks and fairs competitively to vision-language models on clinically complex tasks such as finding detection. Weights will be made publicly available to foster further research.

中文标题/摘要

标题：Curia-2：扩展医学影像自监督学习的基础模型

医学影像的迅速增长推动了基础模型（FMs）的发展，以减轻放射科医生日益增长且不可持续的工作负担。尽管最近的FMs展示了大规模预训练在CT和MRI分析中的强大能力，但这些模型从复杂放射学数据中学习的方式仍有很大的优化空间。基于Curia框架，这项工作引入了Curia-2，显著改进了原始的预训练策略和表示质量，更好地捕捉了放射学数据的特异性。提出的方案使架构能够扩展到具有十亿参数的视觉变换器，这是多模态CT和MRI FMs的首次。此外，我们通过扩展和重构CuriaBench，将其分为两个不同的赛道：一个针对切片视觉模型的2D赛道和一个用于体素基准测试的3D赛道。我们的结果显示，Curia-2在视觉任务上优于所有FMs，并在复杂的临床任务如检测方面与视觉语言模型竞争。权重将公开发布以促进进一步研究。

Summary / 总结

This work aims to enhance the performance of Foundation Models (FMs) for radiology by improving the pre-training strategy and representation quality. Curia-2, an advancement of the Curia framework, scales up to billion-parameter Vision Transformers for multi-modal CT and MRI analysis. The authors introduce two tracks in CuriaBench for evaluating these models: a 2D track for slice-based vision models and a 3D track for volumetric benchmarking. Experimental results show that Curia-2 outperforms other FMs on vision-focused tasks and performs competitively with vision-language models on complex clinical tasks such as detection. The weights are publicly available for further research.

该论文介绍了Curia-2，它增强了放射学领域基础模型的预训练策略和表示质量，使能够使用十亿参数的Vision Transformers进行CT和MRI分析。作者正式化了评估过程，并展示了Curia-2在视觉任务上优于其他基础模型，在临床复杂任务上也表现得相当不错。模型权重将公开以促进进一步研究。

SPAGS: Sparse-View Articulated Object Reconstruction from Single State via Planar Gaussian Splatting

Authors: Di Wu, Liu Liu, Xueyu Yuan, Wenxiao Chen, Lijun Yue, Liuzhu Chen, Yiming Tang, Meng Wang

First: 2025-11-21T09:49:53+00:00 · Latest: 2026-04-02T12:37:26+00:00

Comments: 10 pages, 7 figures

Abs · PDF · Code1 · Code2

Abstract

Articulated objects are ubiquitous in daily environments, and their 3D reconstruction holds great significance across various fields. However, existing articulated object reconstruction methods typically require costly inputs such as multi-stage and multi-view observations. To address the limitations, we propose a category-agnostic articulated object reconstruction framework via planar Gaussian Splatting, which only uses sparse-view RGB images from a single state. Specifically, we first introduce a Gaussian information field to perceive the optimal sparse viewpoints from candidate camera poses. To ensure precise geometric fidelity, we constrain traditional 3D Gaussians into planar primitives, facilitating accurate normal and depth estimation. The planar Gaussians are then optimized in a coarse-to-fine manner, regularized by depth smoothness and few-shot diffusion priors. Furthermore, we leverage a Vision-Language Model (VLM) via visual prompting to achieve open-vocabulary part segmentation and joint parameter estimation. Extensive experiments on both synthetic and real-world datasets demonstrate that our approach significantly outperforms existing baselines, achieving superior part-level surface reconstruction fidelity. Code and data are provided in the supplementary material.

中文标题/摘要

标题：SPAGS: 从单态通过平面高斯点阵重建稀疏视图 articulated 对象

articulated 对象在日常环境中无处不在，它们的3D重建在多个领域具有重要意义。然而，现有的articulated 对象重建方法通常需要多阶段和多视图观察等昂贵的输入。为了解决这些限制，我们提出了一种通过平面高斯点阵的通用articulated 对象重建框架，仅使用单态的稀疏视图RGB图像。具体来说，我们首先引入高斯信息场来感知候选相机姿态中的最优稀疏视点。为了确保精确的几何保真度，我们将传统的3D高斯约束为平面原语，便于准确的法线和深度估计。然后，平面高斯在粗到细的方式下进行优化，通过深度平滑和少量样本扩散先验进行正则化。此外，我们利用视觉提示的视觉语言模型（VLM）实现开放词汇部分分割和关节参数估计。在合成和真实世界数据集上的广泛实验表明，我们的方法显著优于现有基线，实现了更优的部分级表面重建保真度。代码和数据在附录中提供。

Summary / 总结

The paper proposes SPAGS, a framework for reconstructing articulated objects from a single state using sparse-view RGB images. It uses planar Gaussian splatting to perceive optimal viewpoints and optimize planar Gaussians in a coarse-to-fine manner, constrained by depth smoothness and diffusion priors. The approach leverages a Vision-Language Model for part segmentation and joint parameter estimation. Experiments show that SPAGS outperforms existing methods in part-level surface reconstruction fidelity on both synthetic and real-world datasets.

研究旨在解决现有 articulated 对象重建方法需要多视角观察的高成本问题。提出的 SPAGS 框架利用稀疏视图 RGB 图像从单个状态重建 articulated 对象。该方法通过高斯信息场选择最优视点，并将 3D 高斯约束为平面原语以实现准确的法线和深度估计。通过在合成和真实世界数据集上的广泛实验，该方法在部分级表面重建精度方面显著优于现有方法。

Enhancing Medical Visual Grounding via Knowledge-guided Spatial Prompts

Authors: Yifan Gao, Tao Zhou, Yi Zhou, Ke Zou, Yizhe Zhang, Huazhu Fu

First: 2026-04-02T11:31:30+00:00 · Latest: 2026-04-02T11:31:30+00:00

Comments: 10 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Medical Visual Grounding (MVG) aims to identify diagnostically relevant phrases from free-text radiology reports and localize their corresponding regions in medical images, providing interpretable visual evidence to support clinical decision-making. Although recent Vision-Language Models (VLMs) exhibit promising multimodal reasoning ability, their grounding remains insufficient spatial precision, largely due to a lack of explicit localization priors when relying solely on latent embeddings. In this work, we analyze this limitation from an attention perspective and propose KnowMVG, a Knowledge-prior and global-local attention enhancement framework for MVG in VLMs that explicitly strengthens spatial awareness during decoding. Specifically, we present a knowledge-enhanced prompting strategy that encodes phrase related medical knowledge into compact embeddings, together with a global-local attention that jointly leverages coarse global information and refined local cues to guide precise region localization. localization. This design bridges high-level semantic understanding and fine-grained visual perception without introducing extra textual reasoning overhead. Extensive experiments on four MVG benchmarks demonstrate that our KnowMVG consistently outperforms existing approaches, achieving gains of 3.0% in AP50 and 2.6% in mIoU over prior state-of-the-art methods. Qualitative and ablation studies further validate the effectiveness of each component.

中文标题/摘要

标题：通过知识引导的空间提示增强医学视觉定位

医学视觉定位（MVG）旨在从自由文本放射学报告中识别出诊断相关的短语，并定位其在医学图像中的对应区域，提供可解释的视觉证据以支持临床决策。尽管最近的视觉-语言模型（VLMs）展示了有希望的多模态推理能力，但它们的空间定位精度仍然不足，主要是由于在仅依赖潜在嵌入时缺乏明确的定位先验。在本文中，我们从注意力机制的角度分析了这一局限性，并提出了一种名为KnowMVG的知识先验和全局-局部注意力增强框架，以在VLMs中增强MVG的空间意识。具体而言，我们提出了一种知识增强的提示策略，将与短语相关的医学知识编码为紧凑的嵌入，同时结合全局-局部注意力机制，共同利用粗略的全局信息和精细的局部线索来引导精确的区域定位。此设计在不引入额外的文本推理开销的情况下，将高层次的语义理解和精细的视觉感知相结合。在四个MVG基准上的广泛实验表明，我们的KnowMVG在AP50和mIoU方面均优于现有方法，分别提高了3.0%和2.6%。进一步的定性和消融研究还验证了每个组件的有效性。

Summary / 总结

This paper addresses the challenge of Medical Visual Grounding (MVG) by proposing KnowMVG, a framework that enhances spatial precision in VLMs through knowledge-guided spatial prompts. KnowMVG incorporates medical knowledge into compact embeddings and uses global-local attention to guide precise region localization, improving both semantic understanding and visual perception. Experiments on four MVG benchmarks show that KnowMVG outperforms existing methods, achieving gains of 3.0% in AP50 and 2.6% in mIoU.

该研究通过提出KnowMVG框架来解决医学视觉定位（MVG）中空间精度不足的问题，该框架增强了视觉语言模型（VLMs）的空间意识。KnowMVG使用知识增强的提示策略将医学知识编码到嵌入中，并使用全局-局部注意力机制来引导精确的区域定位。实验表明，KnowMVG在四个MVG基准测试中分别在AP50和mIoU上优于现有方法3.0%和2.6%。

Robot Collapse: Supply Chain Backdoor Attacks Against VLM-based Robotic Manipulation

Authors: Xianlong Wang, Hewen Pan, Hangtao Zhang, Minghui Li, Shengshan Hu, Ziqi Zhou, Lulu Xue, Peijin Guo, Aishan Liu, Leo Yu Zhang, Xiaohua Jia

First: 2024-11-18T16:09:26+00:00 · Latest: 2026-04-02T10:50:53+00:00

Abs · PDF · Code1 · Code2

Abstract

Robotic manipulation policies are increasingly empowered by \textit{large language models} (LLMs) and \textit{vision-language models} (VLMs), leveraging their understanding and perception capabilities. Recently, inference-time attacks against robotic manipulation have been extensively studied, yet backdoor attacks targeting model supply chain security in robotic policies remain largely unexplored. To fill this gap, we propose \texttt{TrojanRobot}, a backdoor injection framework for model supply chain attack scenarios, which embeds a malicious module into modular robotic policies via backdoor relationships to manipulate the LLM-to-VLM pathway and compromise the system. Our vanilla design instantiates this module as a backdoor-finetuned VLM. To further enhance attack performance, we propose a prime scheme by introducing the concept of \textit{LVLM-as-a-backdoor}, which leverages \textit{in-context instruction learning} (ICIL) to steer \textit{large vision-language model} (LVLM) behavior through backdoored system prompts. Moreover, we develop three types of prime attacks, \textit{permutation}, \textit{stagnation}, and \textit{intentional}, achieving flexible backdoor attack effects. Extensive physical-world and simulator experiments on 18 real-world manipulation tasks and 4 VLMs verify the superiority of proposed \texttt{TrojanRobot}

中文标题/摘要

标题：机器人坍塌：针对基于VLM的机器人操作的供应链后门攻击

机器人的操作策略越来越多地借助于大型语言模型（LLMs）和视觉语言模型（VLMs），利用它们的理解和感知能力。最近，针对机器人操作的推理时攻击得到了广泛研究，但针对模型供应链安全的后门攻击在机器人策略中仍鲜有探索。为填补这一空白，我们提出了一个名为\texttt{TrojanRobot}的后门注入框架，该框架通过后门关系将恶意模块嵌入模块化机器人策略中，以操控LLM到VLM的路径并破坏系统。我们的基础设计将此模块实例化为后门微调的VLM。为进一步增强攻击性能，我们提出了一个质数方案，通过引入\textit{在上下文指令学习}（ICIL）的概念，利用后门系统提示引导\textit{大型视觉语言模型}（LVLM）的行为。此外，我们开发了三种类型的质数攻击，即\textit{排列}、\textit{停滞}和\textit{故意}，实现了灵活的后门攻击效果。在18个真实世界的操作任务和4个VLM上的物理世界和模拟器实验验证了所提出的\texttt{TrojanRobot}的优越性

Summary / 总结

This paper addresses the security vulnerability in robotic manipulation policies that rely on large language models (LLMs) and vision-language models (VLMs). It introduces TrojanRobot, a backdoor injection framework that embeds a malicious module into robotic policies to manipulate the LLM-to-VLM pathway. The framework uses a backdoor-finetuned VLM and an in-context instruction learning (ICIL) scheme to steer LVLM behavior. Three types of prime attacks are proposed, achieving flexible backdoor effects. Experiments on 18 real-world manipulation tasks and 4 VLMs demonstrate the effectiveness of the proposed method.

论文关注依赖大型语言模型（LLM）和视觉语言模型（VLM）的机器人操作策略的安全性。提出了TrojanRobot，这是一种后门注入框架，将恶意模块嵌入到机器人策略中，以操控LLM到VLM的路径。作者提出了一种基于LVLM-as-a-backdoor的prime方案，并开发了三种类型的prime攻击以增强攻击性能。在18个真实世界的操作任务和4个VLM上的实验结果表明了所提方法的有效性。

One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image

Authors: Ezzeldin Shereen, Dan Ristea, Shae McFadden, Burak Hasircioglu, Vasilios Mavroudis, Chris Hicks

Venue: Transactions on Machine Learning Research (TMLR), 2026

First: 2025-04-02T21:08:33+00:00 · Latest: 2026-04-02T10:17:08+00:00

Comments: Published in Transactions on Machine Learning Research (03/2026)

Abs · PDF · Code1 · Code2

Abstract

Retrieval-augmented generation (RAG) is instrumental for inhibiting hallucinations in large language models (LLMs) through the use of a factual knowledge base (KB). Although PDF documents are prominent sources of knowledge, text-based RAG pipelines are ineffective at capturing their rich multi-modal information. In contrast, visual document RAG (VD-RAG) uses screenshots of document pages as the KB, which has been shown to achieve state-of-the-art results. However, by introducing the image modality, VD-RAG introduces new attack vectors for adversaries to disrupt the system by injecting malicious documents into the KB. In this paper, we demonstrate the vulnerability of VD-RAG to poisoning attacks targeting both retrieval and generation. We define two attack objectives and demonstrate that both can be realized by injecting only a single adversarial image into the KB. Firstly, we introduce a targeted attack against one or a group of queries with the goal of spreading targeted disinformation. Secondly, we present a universal attack that, for any potential user query, influences the response to cause a denial-of-service in the VD-RAG system. We investigate the two attack objectives under both white-box and black-box assumptions, employing a multi-objective gradient-based optimization approach as well as prompting state-of-the-art generative models. Using two visual document datasets, a diverse set of state-of-the-art retrievers (embedding models) and generators (vision language models), we show VD-RAG is vulnerable to poisoning attacks in both the targeted and universal settings, yet demonstrating robustness to black-box attacks in the universal setting.

中文标题/摘要

标题：仅需一张图片：通过单张图像对视觉文档增强生成进行投毒攻击

检索增强生成（RAG）通过使用事实知识库（KB）来抑制大型语言模型（LLMs）中的幻觉，起到了关键作用。尽管PDF文档是知识的重要来源，但基于文本的RAG管道无法有效捕捉其丰富的多模态信息。相比之下，视觉文档RAG（VD-RAG）使用文档页面的截图作为KB，已被证明可达到最先进的效果。然而，通过引入图像模态，VD-RAG为对手提供了新的攻击途径，通过向KB注入恶意文档来破坏系统。在本文中，我们展示了VD-RAG在检索和生成方面都容易受到投毒攻击的脆弱性。我们定义了两种攻击目标，并证明只需向KB注入一张对抗性图像即可实现这两种目标。首先，我们介绍了一种针对一个或一组查询的定向攻击，其目标是传播有针对性的虚假信息。其次，我们提出了一种通用攻击，对于任何潜在的用户查询，都会影响响应以导致VD-RAG系统的拒绝服务。我们在白盒和黑盒假设下研究了这两种攻击目标，采用多目标梯度优化方法以及提示最先进的生成模型。使用两个视觉文档数据集、一组多样化的最先进的检索器（嵌入模型）和生成器（视觉语言模型），我们展示了VD-RAG在定向和通用设置下都容易受到投毒攻击，但在通用设置下对黑盒攻击具有鲁棒性。

Summary / 总结

This paper investigates the vulnerability of visual document retrieval-augmented generation (VD-RAG) systems to poisoning attacks. The study demonstrates that a single adversarial image can be used to either spread targeted disinformation or cause a denial-of-service for any potential query. The research employs a multi-objective gradient-based optimization approach and shows that VD-RAG is susceptible to both targeted and universal attacks, but remains robust against black-box attacks in the universal setting.

本文研究了视觉文档检索增强生成（VD-RAG）系统对投毒攻击的脆弱性。研究证明，只需一张恶意图像即可破坏检索和生成过程。定义了两种攻击目标：针对特定查询的定向攻击以传播虚假信息，以及通用攻击以导致系统拒绝服务。研究采用多目标梯度优化方法，并表明VD-RAG在定向攻击中容易受到攻击，但在通用攻击中对黑盒攻击具有鲁棒性。实验使用了多种数据集和最先进的检索与生成模型。

Semantic Richness or Geometric Reasoning? The Fragility of VLM's Visual Invariance

Authors: Jason Qiu, Zachary Meurer, Xavier Thomas, Deepti Ghadiyaram

First: 2026-04-02T10:02:49+00:00 · Latest: 2026-04-02T10:02:49+00:00

Abs · PDF · Code1 · Code2

Abstract

This work investigates the fundamental fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations. While modern VLMs excel at semantic tasks such as recognizing objects in canonical orientations and describing complex scenes, they exhibit systematic failures at a more fundamental level: lack of robust spatial invariance and equivariance required to reliably determine object identity under simple rotations, scaling, and identity transformations. We demonstrate this limitation through a systematic evaluation across diverse visual domains, including symbolic sketches, natural photographs, and abstract art. Performance drops sharply as semantic content becomes sparse, and this behavior is observed across architectures, model capacities, and prompting strategies. Overall, our results reveal a systematic gap between semantic understanding and spatial reasoning in current VLMs, highlighting the need for stronger geometric grounding in future multimodal systems.

中文标题/摘要

标题：语义丰富性还是几何推理？VLM视觉不变性的脆弱性

这项工作探讨了最先进的视觉-语言模型（VLMs）在基本几何变换下的根本脆弱性。尽管现代VLMs在识别以标准姿态出现的对象和描述复杂场景等语义任务上表现出色，但在更基本的层面上，它们表现出系统性的失败：缺乏在简单旋转、缩放和恒等变换下可靠确定物体身份所需的稳健的空间不变性和协变性。我们通过在包括符号草图、自然照片和抽象艺术在内的多种视觉领域进行系统评估，展示了这一局限性。随着语义内容的稀疏，性能急剧下降，这种行为在不同架构、模型容量和提示策略中均被观察到。总体而言，我们的结果揭示了当前VLMs在语义理解和空间推理之间的系统性差距，突显了未来多模态系统中需要更强的几何基础的重要性。

Summary / 总结

This work examines the fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations, showing that while VLMs perform well on semantic tasks, they struggle with fundamental spatial invariance and equivariance required for object recognition under simple transformations. Performance decreases significantly when semantic content is sparse, indicating a gap between semantic understanding and spatial reasoning in current VLMs, necessitating stronger geometric grounding in future models.

这项工作探讨了最先进的视觉-语言模型（VLMs）在基本几何变换下的根本局限性。尽管它们在语义任务上表现出色，但在处理稀疏语义内容时，VLMs的性能显著下降，表明它们在语义理解和空间推理能力之间存在差距。研究在各种视觉领域评估了VLMs，并发现其性能一致性下降，强调了未来多模态系统中需要增强的空间几何基础。

Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models

Authors: Zekai Ye, Qiming Li, Xiaocheng Feng, Ruihan Chen, Ziming Li, Haoyu Ren, Kun Chen, Dandan Tu, Bing Qin

First: 2026-04-02T09:53:20+00:00 · Latest: 2026-04-02T09:53:20+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

While Reinforcement Learning from Verifiable Rewards (RLVR) has advanced reasoning in Large Vision-Language Models (LVLMs), prevailing frameworks suffer from a foundational methodological flaw: by distributing identical advantages across all generated tokens, these methods inherently dilute the learning signals essential for optimizing the critical, visually-grounded steps of multimodal reasoning. To bridge this gap, we formulate \textit{Token Visual Dependency}, quantifying the causal information gain of visual inputs via the Kullback-Leibler (KL) divergence between visual-conditioned and text-only predictive distributions. Revealing that this dependency is highly sparse and semantically pivotal, we introduce Perception-Grounded Policy Optimization (PGPO), which is a novel fine-grained credit assignment framework that dynamically reshapes advantages at the token level. Through a threshold-gated, mass-conserving mechanism, PGPO actively amplifies learning signals for visually-dependent tokens while suppressing gradient noise from linguistic priors. Extensive experiments based on the Qwen2.5-VL series across seven challenging multimodal reasoning benchmarks demonstrate that PGPO boosts models by 18.7% on average. Both theoretical and empirical analyses confirm that PGPO effectively reduces gradient variance, prevents training collapse, and acts as a potent regularizer for robust, perception-grounded multimodal reasoning. Code will be published on https://github.com/Yzk1114/PGPO.

中文标题/摘要

标题：并非所有标记物都平等：基于感知的政策优化方法

虽然可验证奖励强化学习（RLVR）在大型视觉-语言模型（LVLMs）中提升了推理能力，但现有框架存在根本性的方法论缺陷：通过向所有生成的标记物分配相同的优势，这些方法会稀释对于优化关键的视觉导向推理步骤至关重要的学习信号。为弥补这一差距，我们提出了标记物视觉依赖性（Token Visual Dependency）的概念，通过计算视觉条件下的预测分布与仅基于文本的预测分布之间的Kullback-Leibler（KL）散度来量化因果信息增益。揭示出这种依赖性高度稀疏且在语义上至关重要，我们引入了基于感知的政策优化（PGPO），这是一种新颖的细粒度信用分配框架，能够动态地在标记物级别重新塑造优势。通过一个阈值门控、质量守恒的机制，PGPO能够积极放大依赖视觉的标记物的学习信号，同时抑制语言先验带来的梯度噪声。基于Qwen2.5-VL系列在七个具有挑战性的跨模态推理基准上的广泛实验表明，PGPO平均提升了模型18.7%。理论和实证分析均证实，PGPO有效降低了梯度方差，防止了训练崩溃，并作为强大的正则化器促进了稳健的、基于感知的跨模态推理。代码将在https://github.com/Yzk1114/PGPO上发布。

Efficient Reasoning with Balanced Thinking

Authors: Yulin Li, Tengyao Tu, Li Ding, Junjie Wang, Huiling Zhen, Yixin Chen, Yong Li, Zhuotao Tian

Venue: ICLR 2026

First: 2026-03-12T18:48:07+00:00 · Latest: 2026-04-02T09:30:13+00:00

Comments: Accepted by ICLR 2026

Abs · PDF · Code1 · Code2 · Project1

Abstract

Large Reasoning Models (LRMs) have shown remarkable reasoning capabilities, yet they often suffer from overthinking, expending redundant computational steps on simple problems, or underthinking, failing to explore sufficient reasoning paths despite inherent capabilities. These issues lead to inefficiencies and potential inaccuracies, limiting practical deployment in resource-constrained settings. Existing methods to mitigate overthinking, such as suppressing reflective keywords or adjusting reasoning length, may inadvertently induce underthinking, compromising accuracy. Therefore, we propose ReBalance, a training-free framework that achieves efficient reasoning with balanced thinking. ReBalance leverages confidence as a continuous indicator of reasoning dynamics, identifying overthinking through high confidence variance and underthinking via consistent overconfidence. By aggregating hidden states from a small-scale dataset into reasoning mode prototypes, we compute a steering vector to guide LRMs' reasoning trajectories. A dynamic control function modulates this vector's strength and direction based on real-time confidence, pruning redundancy during overthinking, and promoting exploration during underthinking. Extensive experiments conducted on four models ranging from 0.5B to 32B, and across nine benchmarks in math reasoning, general question answering, and coding tasks demonstrate that ReBalance effectively reduces output redundancy while improving accuracy, offering a general, training-free, and plug-and-play strategy for efficient and robust LRM deployment. Project page and code are available at https://rebalance-ai.github.io .

中文标题/摘要

标题：平衡思考实现高效推理

大型推理模型（LRMs）展示了出色的推理能力，但往往存在过度推理的问题，即在简单问题上浪费冗余计算步骤，或者存在欠推理的问题，即在具备推理能力的情况下未能充分探索推理路径。这些问题导致了效率低下和潜在的不准确性，限制了其在资源受限环境中的实际部署。现有减少过度推理的方法，如抑制反思关键词或调整推理长度，可能会无意中导致欠推理，从而影响准确性。因此，我们提出了ReBalance，这是一种无需训练的框架，实现了平衡思考下的高效推理。ReBalance 利用置信度作为推理动态的连续指标，通过高置信度波动识别过度推理，通过一致的高置信度识别欠推理。通过将小型数据集中的隐藏状态聚合为推理模式原型，我们计算出一个引导向量来引导LRMs的推理轨迹。动态控制函数根据实时置信度调整该向量的强度和方向，在过度推理时修剪冗余，在欠推理时促进探索。在四个从0.5B到32B的模型以及九个涉及数学推理、通用问答和编程任务的基准测试中进行的广泛实验表明，ReBalance 有效减少了输出冗余并提高了准确性，提供了一种通用、无需训练且即插即用的策略，用于高效和稳健的LRM部署。项目页面和代码可在https://rebalance-ai.github.io 获取。

Summary / 总结

The paper addresses the inefficiencies of Large Reasoning Models (LRMs) due to overthinking or underthinking, proposing ReBalance, a training-free framework that uses confidence to balance reasoning dynamics. ReBalance identifies overthinking through high confidence variance and underthinking via consistent overconfidence, guiding LRMs to prune redundancy and promote exploration. Experiments show that ReBalance reduces output redundancy and improves accuracy across various models and benchmarks, offering a general and plug-and-play strategy for efficient LRM deployment.

论文针对大型推理模型（LRMs）因过度推理或不足推理导致的效率问题，提出了一个无需训练的框架ReBalance。ReBalance利用信心来识别过度推理和不足推理，并通过计算引导LRMs推理轨迹的引导向量，该向量根据实时信心动态控制，以消除冗余并促进探索。实验表明，ReBalance在各种模型和基准测试中减少了输出冗余并提高了准确性，提供了一种通用且即插即用的策略，用于高效和稳健的LRM部署。

Hidden Meanings in Plain Sight: RebusBench for Evaluating Cognitive Visual Reasoning

Authors: Seyed Amir Kasaei, Arash Marioriyad, Mahbod Khaleti, MohammadAmin Fazli, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban

Venue: ICLR 2026

First: 2026-04-02T08:33:13+00:00 · Latest: 2026-04-02T08:33:13+00:00

Comments: Accepted at ICLR 2026 Workshop: From Human Cognition to AI Reasoning (HCAIR)

Abs · PDF · Code1 · Code2

Abstract

Large Vision-Language Models (LVLMs) have achieved remarkable proficiency in explicit visual recognition, effectively describing what is directly visible in an image. However, a critical cognitive gap emerges when the visual input serves only as a clue rather than the answer. We identify that current models struggle with the complex, multi-step reasoning required to solve problems where information is not explicitly depicted. Successfully solving a rebus puzzle requires a distinct cognitive workflow: the model must extract visual and textual attributes, retrieve linguistic prior knowledge (such as idioms), and perform abstract mapping to synthesize these elements into a meaning that exists outside the pixel space. To evaluate this neurosymbolic capability, we introduce RebusBench, a benchmark of 1,164 puzzles designed to test this specific integration of perception and knowledge. Our evaluation of state-of-the-art models (including Qwen, InternVL, and LLaVA) shows a severe deficiency: performance saturates below 10% Exact Match and 20% semantic accuracy, with no significant improvement observed from model scaling or In-Context Learning (ICL). These findings suggest that while models possess the necessary visual and linguistic components, they lack the cognitive reasoning glue to connect them. Project page available at https://amirkasaei.com/rebusbench/.

中文标题/摘要

标题：显而易见中的隐藏含义：RebusBench 用于评估认知视觉推理能力

大型视觉-语言模型（LVLMs）在显性的视觉识别方面取得了显著的成就，能够有效地描述图像中直接可见的内容。然而，当视觉输入仅作为线索而非答案时，一个关键的认知差距出现了。我们发现，当前的模型在解决需要复杂多步推理的问题时存在困难，这些问题中的信息并未明确呈现。成功解决谜语谜题需要一种独特的认知工作流程：模型必须提取视觉和文本属性，检索语言先验知识（如成语），并进行抽象映射，将这些元素综合成一种存在于像素空间之外的意义。为了评估这种神经符号能力，我们引入了RebusBench，这是一个包含1,164个谜题的基准测试，旨在测试这种感知与知识的特定整合。我们对最先进的模型（包括Qwen、InternVL和LLaVA）的评估显示，性能在10%的精确匹配和20%的语义准确性以下饱和，模型规模或上下文学习（ICL）均未观察到显著改进。这些发现表明，虽然模型具备必要的视觉和语言组件，但缺乏将它们连接起来的认知推理机制。项目页面可在https://amirkasaei.com/rebusbench/访问。

Summary / 总结

The research aims to evaluate the cognitive visual reasoning abilities of large vision-language models by introducing RebusBench, a benchmark of 1,164 rebus puzzles. The method involves testing models like Qwen, InternVL, and LLaVA on their ability to extract visual and textual attributes, retrieve linguistic knowledge, and synthesize this information. Key findings show that these models perform poorly, with exact match rates below 10% and semantic accuracy at 20%, indicating a lack of cognitive reasoning capabilities to integrate visual and linguistic information effectively.

研究旨在通过引入包含1,164个谜题的RebusBench基准来评估大型视觉-语言模型的认知视觉推理能力。方法是测试Qwen、InternVL和LLaVA等模型解决这些谜题，这些谜题需要提取视觉和文本信息，运用语言知识，并将其综合成有意义的理解。关键发现表明，这些模型表现不佳，精确匹配率低于10%，语义准确率为20%，表明它们缺乏将视觉和语言信息有效整合的认知推理能力。

SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer

Authors: Tongcheng Fang, Hanling Zhang, Ruiqi Xie, Zhuo Han, Xin Tao, Tianchen Zhao, Pengfei Wan, Wenbo Ding, Wanli Ouyang, Xuefei Ning, Yu Wang

First: 2026-01-23T07:28:53+00:00 · Latest: 2026-04-02T08:13:10+00:00

Abs · PDF · Code1 · Code2

Abstract

Diffusion Transformers have demonstrated remarkable performance in video generation. However, their long input sequences incur substantial latency due to the quadratic complexity of full attention. Various sparse attention mechanisms have been proposed. Training-free approaches are limited to moderate sparsity and thus yield only modest acceleration, whereas training-based methods can reach much higher sparsity but demand substantial data and computation. In this work, we propose SALAD, introducing a lightweight linear attention branch in parallel with the sparse attention. Leveraging a Multi-level Static-Dynamic Scaling Strategy to balance the two branches, our method attains up to 90% sparsity and 1.52-2.03x inference speedup across different models and sequence lengths, while maintaining generation quality comparable to the full attention baseline. Moreover, our finetuning process is highly efficient, requiring only 2,000 video samples, fewer than 1,600 training steps, and no more than 30 GPU hours with a batch size of 8.

中文标题/摘要

标题：SALAD：通过高效线性注意力调优实现高稀疏度注意力以提高视频扩散变换器性能

扩散变换器在视频生成方面表现出色。然而，由于全注意力的二次复杂性，其长输入序列导致了显著的延迟。已经提出了各种稀疏注意力机制。无需训练的方法仅能达到中等稀疏度，因此只能提供适度的加速，而基于训练的方法可以达到更高的稀疏度，但需要大量的数据和计算。在本工作中，我们提出了SALAD，引入了一个轻量级的线性注意力分支与稀疏注意力并行。通过多级静态-动态缩放策略平衡两个分支，我们的方法在不同模型和序列长度上实现了高达90%的稀疏度和1.52-2.03倍的推理加速，同时保持与全注意力基线相当的生成质量。此外，我们的微调过程非常高效，只需要2,000个视频样本，少于1,600个训练步骤，且不超过30个GPU小时，批量大小为8。

Summary / 总结

The research aims to address the latency issue in diffusion transformers for video generation by proposing SALAD, which combines a lightweight linear attention branch with sparse attention. The method uses a Multi-level Static-Dynamic Scaling Strategy to balance the two branches, achieving up to 90% sparsity and 1.52-2.03x inference speedup while maintaining comparable generation quality to full attention. The finetuning process is efficient, requiring only 2,000 video samples and 30 GPU hours.

研究旨在通过提出SALAD，即在Diffusion Transformers中引入轻量级线性注意力分支与稀疏注意力并行，解决视频生成中的延迟问题。该方法使用多级静态-动态缩放策略平衡两个分支，实现高达90%的稀疏度和1.52-2.03倍的推理加速，同时保持与全注意力基线相当的生成质量。微调过程高效，仅需2,000个视频样本、30个GPU小时和批量大小为8的1,600次训练步骤。

GridVAD: Open-Set Video Anomaly Detection via Spatial Reasoning over Stratified Frame Grids

Authors: Mohamed Eltahir, Ahmed O. Ibrahim, Obada Siralkhatim, Tabarak Abdallah, Sondos Mohamed

First: 2026-03-26T14:08:41+00:00 · Latest: 2026-04-02T07:53:39+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-Language Models (VLMs) are powerful open-set reasoners, yet their direct use as anomaly detectors in video surveillance is fragile: without calibrated anomaly priors, they alternate between missed detections and hallucinated false alarms. We argue the problem is not the VLM itself but how it is used. VLMs should function as anomaly proposers, generating open-set candidate descriptions that are then grounded and tracked by purpose-built spatial and temporal modules. We instantiate this propose-ground-propagate principle in GridVAD, a training-free pipeline that produces pixel-level anomaly masks without any domain-specific training. A VLM reasons over stratified grid representations of video clips to generate natural-language anomaly proposals. Self-Consistency Consolidation (SCC) filters hallucinations by retaining only proposals that recur across multiple independent samplings. Grounding DINO anchors each surviving proposal to a bounding box, and SAM2 propagates it as a dense mask through the anomaly interval. The per-clip VLM budget is fixed at M+1 calls regardless of video length, where M can be set according to the proposals needed. On UCSD Ped2, GridVAD achieves the highest Pixel-AUROC (77.59) among all compared methods, surpassing even the partially fine-tuned TAO (75.11) and outperforms other zero-shot approaches on object-level RBDC by over 5x. Ablations reveal that SCC provides a controllable precision-recall tradeoff: filtering improves all pixel level metrics at a modest cost in object-level recall. Efficiency experiments show GridVAD is 2.7x more call-efficient than uniform per-frame VLM querying while additionally producing dense segmentation masks.Code and qualitative video results are available at https://gridvad.github.io.

中文标题/摘要

标题：GridVAD：通过分层帧网格的空间推理实现开放集视频异常检测

视觉-语言模型（VLMs）是强大的开放集推理器，但在视频监控中直接用作异常检测器却很脆弱：没有校准的异常先验，它们会在漏检和虚假警报之间交替。我们认为问题不在于VLM本身，而在于其使用方式。VLMs应作为异常提案者，生成开放集候选描述，然后由专门构建的空间和时间模块进行定位和跟踪。我们通过GridVAD这一无需训练的管道实例化了这一提案-定位-传播原则，该管道在没有任何领域特定训练的情况下生成像素级异常掩码。VLM对视频片段的分层网格表示进行推理，生成自然语言异常提案。自我一致性聚合（SCC）通过仅保留跨多次独立采样中重复出现的提案来过滤虚假警报。DINO锚定每个幸存提案到一个边界框，SAM2将其作为密集掩码在异常区间内传播。每段视频的VLM预算固定为M+1次调用，无论视频长度如何，M可以根据需要进行设置。在UCSD Ped2上，GridVAD在所有比较方法中实现了最高的像素-AUROC（77.59），甚至超过了部分微调的TAO（75.11），在对象级RBDC上也比其他零样本方法高出5倍以上。消融实验表明，SCC提供了可控制的精确度-召回率权衡：过滤可以提高所有像素级别指标，同时在对象级别召回率上付出适度的代价。效率实验表明，GridVAD比均匀的每帧VLM查询更高效2.7倍，同时还能生成密集分割掩码。代码和定性视频结果可在https://gridvad.github.io/获取。

Summary / 总结

GridVAD proposes a method to enhance the use of Vision-Language Models (VLMs) for open-set video anomaly detection. It leverages stratified grid representations and self-consistency consolidation to generate and filter anomaly proposals, which are then grounded and propagated using spatial and temporal modules. On the UCSD Ped2 dataset, GridVAD achieves the highest Pixel-AUROC score (77.59) and outperforms other zero-shot approaches by over 5x. Ablation studies show that self-consistency consolidation improves precision-recall tradeoffs while maintaining object-level recall. Efficiency experiments demonstrate that GridVAD is more call-efficient than uniform per-frame querying while providing dense segmentation masks.

GridVAD 提出了一种方法，通过利用分层网格表示和自我一致性聚合来增强 Vision-Language 模型（VLM）在视频异常检测中的应用。该方法生成并过滤异常提案，然后使用空间和时间模块进行定位和传播。在 UCSD Ped2 数据集上，GridVAD 达到了最高的像素 AUROC 分数（77.59），并且在对象级别 RBDC 上比其他零样本方法高出 5 倍以上。消融研究显示，自我一致性聚合可以改善精确召回权衡，同时保持对象级别的召回率。效率实验表明，GridVAD 比均匀的每帧查询更高效，同时还能提供密集的分割掩码。

History

20260404_0350 20260403_0400 20260401_0408 20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553