arXiv 论文速递

Snapshot: 20260317_0401

Visual-ERM: Reward Modeling for Visual Equivalence

Authors: Ziyu Liu, Shengyuan Ding, Xinyu Fang, Xuanlang Dai, Penghui Yang, Jianze Liang, Jiaqi Wang, Kai Chen, Dahua Lin, Yuhang Zang

First: 2026-03-13T17:58:14+00:00 · Latest: 2026-03-13T17:58:14+00:00

Comments: Project: https://github.com/InternLM/Visual-ERM

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.

中文标题/摘要

标题：Visual-ERM：视觉等价性奖励建模

视觉到代码任务要求模型将结构化的视觉输入，如图表、表格和SVG，重构为具有高视觉保真的可执行或结构化表示。虽然最近的大规模视觉语言模型（LVLM）通过监督微调取得了出色的结果，但强化学习仍然具有挑战性，因为奖励信号存在对齐问题。现有的奖励要么依赖于文本规则，要么依赖于粗略的视觉嵌入相似性，两者都无法捕捉细微的视觉差异，并且容易受到奖励作弊的影响。我们提出了视觉等价性奖励模型（Visual-ERM），这是一种多模态生成奖励模型，能够直接在渲染的视觉空间中提供细微、可解释且任务无关的反馈，以评估视觉到代码的质量。将Visual-ERM集成到强化学习中，可以提高Qwen3-VL-8B-Instruct的性能，在图表到代码任务上提高了8.4%，在表格和SVG解析上分别提高了2.7%和4.1%，并通过反射和修订进一步增强了测试时的扩展性。我们还引入了VisualCritic-RewardBench（VC-RewardBench），这是一个用于判断结构化视觉数据上细微图像到图像差异的基准，其中Visual-ERM在8B规模下显著优于Qwen3-VL-235B-Instruct，并接近领先的企业级模型。我们的结果表明，无论任务具体性如何，细微的视觉奖励监督都是必要且足够的。

Summary / 总结

Visual-ERM is a multimodal generative reward model designed to improve the quality of vision-to-code tasks by providing fine-grained, interpretable, and task-agnostic feedback. It integrates into reinforcement learning to enhance the performance of Qwen3-VL-8B-Instruct, improving chart-to-code quality by 8.4 and yielding consistent gains on table and SVG parsing. Visual-ERM also outperforms larger models and approaches leading closed-source models on the VisualCritic-RewardBench benchmark, indicating the necessity and sufficiency of fine-grained visual reward supervision for vision-to-code reinforcement learning.

Visual-ERM 是一个多模态生成奖励模型，旨在通过提供精细、可解释且任务无关的反馈来改进视觉到代码任务的强化学习。它使 Qwen3-VL-8B-Instruct 在图表到代码任务上提高了 8.4，且在表格和 SVG 解析上也取得了稳定的改进，平均改进幅度分别为 +2.7 和 +4.1。Visual-ERM 在 VisualCritic-RewardBench 基准测试中也超越了更大规模的模型，并接近领先的企业级模型，表明精细的视觉奖励监督对于视觉到代码的强化学习既是必要的也是充分的。

Automatic In-Domain Exemplar Construction and LLM-Based Refinement of Multi-LLM Expansions for Query Expansion

Authors: Minghan Li, Ercong Nie, Siqi Zhao, Tongna Chen, Huiping Huang, Guodong Zhou

First: 2026-02-09T17:16:39+00:00 · Latest: 2026-03-13T17:55:59+00:00

Comments: Preprint. This paper is under consideration at Pattern Recognition Letters

Abs · PDF · Code1 · Code2

Abstract

Query expansion with large language models is promising but often relies on hand-crafted prompts, manually chosen exemplars, or a single LLM, making it non-scalable and sensitive to domain shift. We present an automated, domain-adaptive QE framework that builds in-domain exemplar pools by harvesting pseudo-relevant passages using a BM25-MonoT5 pipeline. A training-free cluster-based strategy selects diverse demonstrations, yielding strong and stable in-context QE without supervision. To further exploit model complementarity, we introduce a two-LLM ensemble in which two heterogeneous LLMs independently generate expansions and a refinement LLM consolidates them into one coherent expansion. Across TREC DL20, DBPedia, and SciFact, the refined ensemble delivers consistent and statistically significant gains over BM25, Rocchio, zero-shot, and fixed few-shot baselines. The framework offers a reproducible testbed for exemplar selection and multi-LLM generation, and a practical, label-free solution for real-world QE.

中文标题/摘要

标题：自动领域内示例构建及基于LLM的多LLM扩展精炼用于查询扩展

使用大型语言模型进行查询扩展前景广阔，但通常依赖于手工制作的提示、手动选择的示例或单一的LLM，这使其难以扩展且对领域转移敏感。我们提出了一种自动的领域自适应查询扩展框架，通过使用BM25-MonoT5流水线收集伪相关段落来构建领域内示例池。一种无需训练的基于聚类的策略选择多样化的示例，从而在无监督的情况下获得强大且稳定的上下文查询扩展。为了进一步利用模型互补性，我们引入了一种两LLM集成方法，在该方法中，两个异构的LLM独立生成扩展，而精炼LLM将它们整合成一个连贯的扩展。在TREC DL20、DBPedia和SciFact上，精炼的集成体在BM25、Rocchio、零样本和固定少量样本基线之上提供了持续且统计上显著的改进。该框架提供了一个可重复的示例选择和多LLM生成测试平台，并为实际应用中的查询扩展提供了一种实用的、无需标注的解决方案。

Summary / 总结

The paper addresses the limitations of query expansion methods that rely on hand-crafted prompts and single LLMs, which are non-scalable and sensitive to domain shift. It proposes an automated framework that constructs in-domain exemplar pools using a BM25-MonoT5 pipeline and selects diverse demonstrations through a cluster-based strategy. The framework then uses a two-LLM ensemble to generate and refine query expansions, showing consistent and significant improvements over existing methods across TREC DL20, DBPedia, and SciFact datasets.

论文针对依赖手工构造提示和单一LLM的方法存在的非扩展性和领域漂移敏感性问题，提出了一种自动化框架，通过BM25-MonoT5管道构建领域内示例池，并通过聚类策略选择多样化的示例。该框架然后使用两LLM组合来生成和精炼查询扩展，展示了在TREC DL20、DBPedia和SciFact数据集上相对于现有方法的一致和显著改进。

Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos

Authors: Rohith Peddi, Saurabh, Shravan Shanmugam, Likhitha Pallapothula, Yu Xiang, Parag Singla, Vibhav Gogate

First: 2026-03-13T17:18:03+00:00 · Latest: 2026-03-13T17:18:03+00:00

Comments: https://github.com/rohithpeddi/WorldSGG

Abs · PDF · Code1 · Code2 · Code3

Abstract

Spatio-temporal scene graphs provide a principled representation for modeling evolving object interactions, yet existing methods remain fundamentally frame-centric: they reason only about currently visible objects, discard entities upon occlusion, and operate in 2D. To address this, we first introduce ActionGenome4D, a dataset that upgrades Action Genome videos into 4D scenes via feed-forward 3D reconstruction, world-frame oriented bounding boxes for every object involved in actions, and dense relationship annotations including for objects that are temporarily unobserved due to occlusion or camera motion. Building on this data, we formalize World Scene Graph Generation (WSGG), the task of constructing a world scene graph at each timestamp that encompasses all interacting objects in the scene, both observed and unobserved. We then propose three complementary methods, each exploring a different inductive bias for reasoning about unobserved objects: PWG (Persistent World Graph), which implements object permanence via a zero-order feature buffer; MWAE (Masked World Auto-Encoder), which reframes unobserved-object reasoning as masked completion with cross-view associative retrieval; and 4DST (4D Scene Transformer), which replaces the static buffer with differentiable per-object temporal attention enriched by 3D motion and camera-pose features. We further design and evaluate the performance of strong open-source Vision-Language Models on the WSGG task via a suite of Graph RAG-based approaches, establishing baselines for unlocalized relationship prediction. WSGG thus advances video scene understanding toward world-centric, temporally persistent, and interpretable scene reasoning.

中文标题/摘要

标题：从单目视频生成时空世界场景图的方法

时空场景图提供了一种原理性的表示方法，用于建模不断变化的对象交互，但现有方法仍然主要基于帧：它们仅考虑当前可见的对象，遮挡时丢弃实体，并在二维空间中操作。为了解决这个问题，我们首先引入了ActionGenome4D数据集，该数据集通过前馈3D重建、面向世界框架的对象边界框以及密集的关系注释（包括由于遮挡或摄像机运动而暂时未观察到的对象）将Action Genome视频升级为4D场景。基于此数据，我们定义了世界场景图生成（WSGG）任务，即在每个时间戳构建一个包含场景中所有交互对象（包括已观察和未观察的对象）的世界场景图。然后，我们提出了三种互补的方法，每种方法探索不同的归纳偏置来处理未观察到的对象：PWG（持久世界图），通过零阶特征缓冲区实现对象持久性；MWAE（掩码世界自编码器），将未观察到的对象推理重新定义为掩码完成与跨视图关联检索；以及4DST（4D场景变换器），用具有3D运动和摄像机姿态特征的可微分的逐对象时空注意力替换静态缓冲区。我们进一步设计并评估了强大的开源视觉-语言模型在WSGG任务上的性能，通过一系列基于Graph RAG的方法建立了未定位关系预测的基线。因此，WSGG推动了视频场景理解向以世界为中心、时间持久和可解释的场景推理方向发展。

Summary / 总结

The paper addresses the limitation of existing spatio-temporal scene graph generation methods that are frame-centric and do not handle occlusions or operate in 3D. It introduces ActionGenome4D, a 4D dataset, and formalizes the World Scene Graph Generation (WSGG) task. Three methods—PWG, MWAE, and 4DST—are proposed to handle unobserved objects, and strong open-source Vision-Language Models are evaluated on the WSGG task, establishing baselines for unlocalized relationship prediction.

论文通过引入包含3D重建和密集关系注释的ActionGenome4D数据集，解决了现有时空场景图生成方法的局限性。作者定义了世界场景图生成（WSGG）任务，并提出了三种方法：PWG、MWAE和4DST，每种方法对未观察到的对象有不同的推理方式。这些方法使用视觉-语言模型进行评估，并建立了未定位关系预测的基线。该工作旨在将视频场景理解推向以世界为中心、时间持久且可解释的场景推理。

SpaceControl: Introducing Test-Time Spatial Control to 3D Generative Modeling

Authors: Elisabetta Fedele, Francis Engelmann, Ian Huang, Or Litany, Marc Pollefeys, Leonidas Guibas

First: 2025-12-05T00:54:48+00:00 · Latest: 2026-03-13T17:13:29+00:00

Comments: Project page: https://spacecontrol3d.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Generative methods for 3D assets have recently achieved remarkable progress, yet providing intuitive and precise control over the object geometry remains a key challenge. Existing approaches predominantly rely on text or image prompts, which often fall short in geometric specificity: language can be ambiguous, and images are difficult to manipulate. In this work, we introduce SpaceControl, a training-free test-time method for explicit spatial control of 3D asset generation. Our approach accepts a wide range of geometric inputs, from coarse primitives to detailed meshes, and integrates seamlessly with modern generative models without requiring any additional training. A control parameter lets users trade off between geometric fidelity and output realism. Extensive quantitative evaluation and user studies demonstrate that SpaceControl outperforms both training-based and optimization-based baselines in geometric faithfulness while preserving high visual quality. Finally, we present an interactive interface for real-time superquadric editing and direct 3D asset generation, enabling seamless use in creative workflows. Project page: https://spacecontrol3d.github.io/.

中文标题/摘要

标题：SpaceControl：在3D生成建模中引入测试时空间控制

近年来，用于3D资产的生成方法取得了显著进展，但在提供对对象几何形状直观和精确控制方面仍面临关键挑战。现有方法主要依赖于文本或图像提示，这在几何精确性方面往往不够：语言可能含糊不清，而图像难以操作。在本工作中，我们引入了SpaceControl，这是一种无需训练的测试时方法，用于明确控制3D资产生成的空间。我们的方法接受从粗略的原始形状到详细的网格的各种几何输入，并能够无缝集成到现代生成模型中，无需额外训练。一个控制参数让用户可以在几何保真度和输出现实性之间进行权衡。广泛的定量评估和用户研究证明，SpaceControl在几何保真度方面优于基于训练和基于优化的基线，同时保持高质量的视觉效果。最后，我们提供了一个交互式界面，用于实时超二次元编辑和直接3D资产生成，使其能够无缝地用于创意工作流程中。项目页面：https://spacecontrol3d.github.io/

Summary / 总结

SpaceControl introduces a training-free method for explicit spatial control in 3D generative modeling, allowing users to input geometric inputs ranging from primitives to meshes. The approach integrates with modern generative models and includes a control parameter to balance geometric fidelity and output realism. Experimental results show that SpaceControl outperforms both training-based and optimization-based methods in geometric faithfulness while maintaining high visual quality. An interactive interface for real-time superquadric editing and 3D asset generation is also provided, enhancing creative workflows.

SpaceControl 提出了一种无需训练的方法，用于 3D 生成建模中的显式空间控制，允许用户输入从基本形状到详细网格的各种几何输入。该方法可以与现代生成模型无缝集成，并包含一个控制参数来平衡几何保真度和输出的现实感。实验结果表明，SpaceControl 在几何保真度方面优于基于训练和优化的方法，同时保持高质量的视觉效果。还提供了一个实时超二次编辑和 3D 资产生成的交互界面，以增强创意工作流程。

RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation

Authors: Yash Jangir, Yidi Zhang, Pang-Chi Lo, Kashu Yamazaki, Chenyu Zhang, Kuan-Hsun Tu, Tsung-Wei Ke, Lei Ke, Yonatan Bisk, Katerina Fragkiadaki

First: 2025-10-27T17:41:38+00:00 · Latest: 2026-03-13T16:29:34+00:00

Comments: Website: https://robotarenainf.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

The pursuit of robot generalists, agents capable of performing diverse tasks across diverse environments, demands rigorous and scalable evaluation. Yet real-world testing of robot policies remains fundamentally constrained: it is labor-intensive, slow, unsafe at scale, and difficult to reproduce. As policies expand in scope and complexity, these barriers only intensify, since defining "success" in robotics often hinges on nuanced human judgments of execution quality. We introduce RobotArena Infinity, a new benchmarking framework that overcomes these challenges by shifting vision-language-action (VLA) evaluation into large-scale simulated environments augmented with online human feedback. Leveraging advances in vision-language models, 2D-to-3D generative modeling, and differentiable rendering, our approach automatically converts video demonstrations from widely used robot datasets into simulated counterparts. Within these digital twins, we assess VLA policies using both automated vision-language-model-guided scoring and scalable human preference judgments collected from crowdworkers, transforming human involvement from tedious scene setup, resetting, and safety supervision into lightweight preference comparisons. To measure robustness, we systematically perturb simulated environments along multiple axes, including textures and object placements, stress-testing policy generalization under controlled variation. The result is a continuously evolving, reproducible, and scalable benchmark for real-world-trained robot manipulation policies, addressing a critical missing capability in today's robotics landscape.

中文标题/摘要

标题：RobotArena $\infty$：通过实景到模拟转换实现可扩展的机器人基准测试

机器人通才，即能够在多种环境中执行多种任务的代理，需要严格的可扩展评估。然而，机器人策略的实际测试仍然受到根本限制：它劳动密集型、速度慢、大规模不安全且难以重现。随着策略的范围和复杂性扩大，这些障碍只会加剧，因为机器人成功往往依赖于执行质量的微妙人类判断。我们引入了RobotArena Infinity，这是一种新的基准测试框架，通过将视觉-语言-动作（VLA）评估转移到增强有人类在线反馈的大规模模拟环境中来克服这些挑战。利用视觉-语言模型、2D到3D生成建模和可微渲染的最新进展，我们的方法自动将广泛使用的机器人数据集中的视频演示转换为模拟对应物。在这些数字双胞胎中，我们使用自动化的视觉-语言模型指导评分和从众包工人收集的可扩展的人类偏好判断来评估VLA策略，将人类参与从繁琐的场景设置、重置和安全监督转变为轻量级的偏好比较。为了衡量鲁棒性，我们系统地沿多个轴线扰动模拟环境，包括纹理和物体放置，对控制变化下的策略泛化进行压力测试。结果是一个不断演进、可重现且可扩展的基准测试，用于实际训练的机器人操作策略，解决了当今机器人领域的一个关键缺失能力。

Summary / 总结

RobotArena Infinity is designed to evaluate robot policies that can perform various tasks in diverse environments by leveraging large-scale simulated environments and online human feedback. It converts real-world video demonstrations into simulated counterparts using advances in vision-language models and differentiable rendering. The framework assesses vision-language-action policies through automated scoring and scalable human preference judgments, and systematically perturbs simulated environments to test policy robustness. Key findings include a continuously evolving, reproducible, and scalable benchmark for real-world-trained robot manipulation policies.

RobotArena Infinity 通过将真实世界的视频演示转换为具有在线人类反馈的模拟环境来评估机器人通用性。它利用视觉语言模型和可微渲染来自动化此过程，从而实现可扩展和稳健的策略评估。关键发现包括系统地扰动模拟环境以测试策略泛化能力，并创建一个持续演进、可重复且可扩展的机器人操作策略基准。

Geometry-Guided Camera Motion Understanding in VideoLLMs

Authors: Haoan Feng, Sri Harsha Musunuri, Guan-Ming Su

First: 2026-03-13T16:13:09+00:00 · Latest: 2026-03-13T16:13:09+00:00

Comments: 10 pages, 7 figures, supplementary included

Abs · PDF · Code1 · Code2

Abstract

Camera motion is a fundamental geometric signal that shapes visual perception and cinematic style, yet current video-capable vision-language models (VideoLLMs) rarely represent it explicitly and often fail on fine-grained motion primitives. We address this gap with a framework of $\textbf{benchmarking}$, $\textbf{diagnosis}$, and $\textbf{injection}$. We curate $\textbf{CameraMotionDataset}$, a large-scale synthetic dataset with explicit camera control, formulate camera motion as constraint-aware multi-label recognition, and construct a VQA benchmark--$\textbf{CameraMotionVQA}$. Across diverse off-the-shelf VideoLLMs, we observe substantial errors in recognizing camera motion primitives. Probing experiments on a Qwen2.5-VL vision encoder suggest that camera motion cues are weakly represented, especially in deeper ViT blocks, helping explain the observed failure modes. To bridge this gap without costly training or fine-tuning, we propose a lightweight, model-agnostic pipeline that extracts geometric camera cues from 3D foundation models (3DFMs), predicts constrained motion primitives with a temporal classifier, and injects them into downstream VideoLLM inference via structured prompting. Experiments demonstrate improved motion recognition and more camera-aware model responses, highlighting geometry-driven cue extraction and structured prompting as practical steps toward a camera-aware VideoLLM and VLA system. The dataset and benchmark is publicly available at https://hf.co/datasets/fengyee/camera-motion-dataset-and-benchmark.

中文标题/摘要

标题：视频LLMs中的几何引导摄像机运动理解

摄像机运动是塑造视觉感知和电影风格的基本几何信号，但当前的视频能力视觉语言模型（VideoLLMs）很少明确表示它，并且经常在精细的运动基元上出错。我们通过一个框架来解决这一差距，该框架包括基准测试、诊断和注入。我们编排了一个名为$\textbf{CameraMotionDataset}$的大规模合成数据集，其中包含明确的摄像机控制，并将摄像机运动形式化为约束感知的多标签识别，构建了一个VQA基准——$\textbf{CameraMotionVQA}$。在多种现成的VideoLLMs中，我们观察到在识别摄像机运动基元方面存在大量错误。对Qwen2.5-VL视觉编码器的探针实验表明，摄像机运动提示在视觉编码器中表示较弱，尤其是在更深层次的ViT块中，这有助于解释观察到的失败模式。为了在不进行昂贵的训练或微调的情况下弥合这一差距，我们提出了一种轻量级、模型无关的管道，从3D基础模型（3DFMs）中提取几何摄像机提示，使用时间分类器预测受限的运动基元，并通过结构化提示将它们注入下游VideoLLM推理中。实验表明，运动识别得到了改善，模型的响应也更加关注摄像机，突出了几何驱动的提示提取和结构化提示作为朝着具有摄像机意识的VideoLLM和VLA系统实践步骤的重要性。数据集和基准可以在https://hf.co/datasets/fengyee/camera-motion-dataset-and-benchmark/上公开获取。

Summary / 总结

This paper addresses the lack of explicit camera motion representation in current VideoLLMs by introducing a framework of benchmarking, diagnosis, and injection. It curates a large-scale synthetic dataset, CameraMotionDataset, and constructs a VQA benchmark, CameraMotionVQA, to evaluate camera motion recognition. The authors observe significant errors in recognizing camera motion primitives across various VideoLLMs and propose a lightweight, model-agnostic pipeline to inject geometric camera cues into VideoLLMs via structured prompting, improving motion recognition and model responses. The dataset and benchmark are publicly available.

论文通过引入基准测试、诊断和注入的框架，解决了当前VideoLLMs中缺乏对相机运动的显式表示的问题。它构建了一个大规模合成数据集CameraMotionDataset，并将相机运动形式化为约束感知的多标签识别，从而创建了CameraMotionVQA。实验发现，各种VideoLLMs在识别相机运动基本元素方面存在显著错误。作者提出了一种轻量级、模型无关的管道，从3D基础模型中提取几何相机线索，预测受限运动基本元素，并通过结构化提示注入到VideoLLM推理中，从而提高了运动识别和模型响应。数据集和基准测试已公开。

Evaluating VLMs' Spatial Reasoning Over Robot Motion: A Step Towards Robot Planning with Motion Preferences

Authors: Wenxi Wu, Jingjing Zhang, Martim Brandão

Venue: ICLR 2026

First: 2026-03-13T15:53:42+00:00 · Latest: 2026-03-13T15:53:42+00:00

Comments: Accepted to the First Workshop on Efficient Spatial Reasoning at ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Understanding user instructions and object spatial relations in surrounding environments is crucial for intelligent robot systems to assist humans in various tasks. The natural language and spatial reasoning capabilities of Vision-Language Models (VLMs) have the potential to enhance the generalization of robot planners on new tasks, objects, and motion specifications. While foundation models have been applied to task planning, it is still unclear the degree to which they have the capability of spatial reasoning required to enforce user preferences or constraints on motion, such as desired distances from objects, topological properties, or motion style preferences. In this paper, we evaluate the capability of four state-of-the-art VLMs at spatial reasoning over robot motion, using four different querying methods. Our results show that, with the highest-performing querying method, Qwen2.5-VL achieves 71.4% accuracy zero-shot and 75% on a smaller model after fine-tuning, and GPT-4o leads to lower performance. We evaluate two types of motion preferences (object-proximity and path-style), and we also analyze the trade-off between accuracy and computation cost in number of tokens. This work shows some promise in the potential of VLM integration with robot motion planning pipelines.

中文标题/摘要

标题：评估VLMs在机器人运动空间推理中的能力：迈向具有运动偏好的机器人规划

理解用户的指令和周围环境中的物体空间关系对于智能机器人系统在各种任务中协助人类至关重要。视觉-语言模型（VLMs）的自然语言和空间推理能力有可能增强机器人规划者在新任务、新物体和运动规范上的泛化能力。虽然基础模型已被应用于任务规划，但尚不清楚它们在执行用户对运动的偏好或约束（如与物体的距离、拓扑属性或运动风格偏好）所需的空间推理能力方面的能力如何。在本文中，我们使用四种不同的查询方法评估了四种最先进的VLMs在机器人运动空间推理方面的能力。我们的结果显示，使用性能最高的查询方法，Qwen2.5-VL在零样本情况下达到71.4%的准确率，在较小的模型上微调后达到75%，而GPT-4o的性能较低。我们评估了两种类型的运动偏好（物体接近性和路径风格），并分析了准确性和计算成本（以令牌数量衡量）之间的权衡。这项工作展示了VLM与机器人运动规划管道集成的潜力。

Summary / 总结

This paper evaluates the spatial reasoning capabilities of Vision-Language Models (VLMs) in robot motion planning, focusing on their ability to understand user preferences and constraints. Four state-of-the-art VLMs were tested using four querying methods, with Qwen2.5-VL achieving 71.4% zero-shot accuracy and 75% after fine-tuning. The study also examines two types of motion preferences and the trade-off between accuracy and computational cost, indicating potential for VLM integration in robot motion planning.

研究评估了视觉语言模型（VLMs）在机器人运动规划中的空间推理能力，重点在于它们理解用户偏好和约束的能力。使用四种查询方法测试了四种最先进的VLMs，Qwen2.5-VL在零样本情况下达到71.4%的准确率，在微调后达到75%。研究还考察了两种类型的运动偏好以及准确性和计算成本之间的权衡，表明VLMs在机器人运动规划管道中的潜在应用前景。

DriveMind: A Dual Visual Language Model-based Reinforcement Learning Framework for Autonomous Driving

Authors: Dawood Wasif, Terrence J. Moore, Chandan K. Reddy, Frederica Free-Nelson, Seunghyun Yoon, Hyuk Lim, Dan Dongseong Kim, Jin-Hee Cho

First: 2025-06-01T03:51:09+00:00 · Latest: 2026-03-13T15:16:53+00:00

Comments: Submitted to IEEE Transactions on Intelligent Vehicles (T-IV)

Abs · PDF · Code1 · Code2

Abstract

End-to-end autonomous driving systems map sensor data directly to control commands, but remain opaque, lack interpretability, and offer no formal safety guarantees. While recent vision-language-guided reinforcement learning (RL) methods introduce semantic feedback, they often rely on static prompts and fixed objectives, limiting adaptability to dynamic driving scenes. We present DriveMind, a unified semantic reward framework that integrates: (i) a contrastive Vision-Language Model (VLM) encoder for stepwise semantic anchoring; (ii) a novelty-triggered VLM encoder-decoder, fine-tuned via chain-of-thought (CoT) distillation, for dynamic prompt generation upon semantic drift; (iii) a hierarchical safety module enforcing kinematic constraints (e.g., speed, lane centering, stability); and (iv) a compact predictive world model to reward alignment with anticipated ideal states. DriveMind achieves 19.4 +/- 2.3 km/h average speed, 0.98 +/- 0.03 route completion, and near-zero collisions in CARLA Town 2, outperforming baselines by over 4% in success rate. Its semantic reward generalizes zero-shot to real dash-cam data with minimal distributional shift, demonstrating robust cross-domain alignment and potential for real-world deployment.

中文标题/摘要

标题：DriveMind：基于双视觉语言模型的自主驾驶强化学习框架

端到端的自主驾驶系统将传感器数据直接映射到控制命令，但仍然不透明，缺乏可解释性，并且没有正式的安全保证。虽然最近的视觉-语言指导的强化学习（RL）方法引入了语义反馈，但它们通常依赖于静态提示和固定目标，限制了对动态驾驶场景的适应性。我们提出了DriveMind，这是一种统一的语义奖励框架，整合了：(i) 对比视觉-语言模型（VLM）编码器，用于逐步语义锚定；(ii) 一种新颖触发的VLM编码器-解码器，通过链式思考（CoT）蒸馏进行微调，用于在语义漂移时动态生成提示；(iii) 一种分层安全模块，强制执行运动约束（例如，速度、车道居中、稳定性）；以及(iv) 一种紧凑的预测世界模型，用于奖励与预期理想状态的对齐。DriveMind在CARLA Town 2中实现了19.4 +/- 2.3 km/h的平均速度，0.98 +/- 0.03的路线完成率，并且几乎零碰撞，成功率比基线高出超过4%。其语义奖励能够零样本泛化到真实仪表板摄像头数据，几乎没有分布偏移，展示了跨域对齐的鲁棒性和在实际部署中的潜力。

Summary / 总结

DriveMind is a reinforcement learning framework for autonomous driving that integrates a contrastive Vision-Language Model for semantic anchoring, a dynamic prompt generation mechanism, a hierarchical safety module, and a predictive world model. It achieves high average speed, route completion, and collision avoidance in CARLA Town 2, outperforming baselines by over 4% in success rate. It also demonstrates robustness in real-world scenarios with minimal distributional shift.

DriveMind 是一个结合对比视觉语言模型进行语义锚定、动态提示生成系统、层次安全模块和预测世界模型的自主驾驶强化学习框架。它在 CARLA 城镇 2 中实现了较高的平均速度、路线完成率和几乎零碰撞，并在成功率方面优于基线，展示了跨域的鲁棒性。

CORE: Context-Robust Remasking for Diffusion Language Models

Authors: Kevin Zhai, Sabbir Mollah, Zhenyi Wang, Mubarak Shah

First: 2026-02-04T00:12:30+00:00 · Latest: 2026-03-13T15:05:41+00:00

Comments: Project Page: https://ucf-crcv.github.io/core/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Standard decoding in Masked Diffusion Models (MDMs) is hindered by context rigidity: tokens are retained based on transient high confidence, often ignoring that early predictions lack full context. This creates cascade effects where initial inconsistencies misguide the remaining generation. Existing revision strategies attempt to mitigate this by relying on static confidence scores, but these signals are inherently myopic; inconsistent tokens can appear confident to the model itself. We propose Context-Robust Remasking (CORE), a training-free framework for inference-time revision. Rather than trusting static token probabilities, CORE identifies context-brittle tokens by probing their sensitivity to targeted masked-context perturbations. We formalize revision as a robust optimization objective over context shifts and efficiently approximate this objective to prioritize unstable tokens for revision. On LLaDA-8B-Base, CORE delivers consistent improvements across reasoning and code benchmarks, outperforming compute-matched baselines and improving MBPP by up to 9.2 percentage points.

中文标题/摘要

标题：CORE：基于上下文的稳健重遮盖以提高扩散语言模型

在掩码扩散模型（MDMs）中，标准解码受限于上下文刚性：基于短暂的高置信度保留令牌，往往忽略了早期预测缺乏完整上下文的情况。这会导致初始不一致性误导后续生成。现有的修订策略试图通过依赖静态置信度分数来缓解这一问题，但这些信号本质上是短视的；不一致的令牌对模型本身来说可能显得很有信心。我们提出了基于上下文的稳健重遮盖（CORE），这是一种无需训练的推理时修订框架。CORE 不依赖静态令牌概率，而是通过探测其对目标遮盖上下文扰动的敏感性来识别上下文脆弱的令牌。我们将修订形式化为上下文转换下的鲁棒优化目标，并高效地近似此目标以优先修订不稳定的令牌。在LLaDA-8B-Base上，CORE 在推理和代码基准测试中提供了持续改进，超越了计算匹配的基线，并将MBPP提高了高达9.2个百分点。

Summary / 总结

The paper addresses the issue of context rigidity in Masked Diffusion Models (MDMs) where early predictions can misguide the entire generation process. It introduces Context-Robust Remasking (CORE), a training-free framework that revises predictions by probing the sensitivity of tokens to masked-context perturbations. CORE improves consistency in reasoning and code generation tasks, achieving up to 9.2 percentage point improvement on MBPP compared to compute-matched baselines.

论文解决了Masked Diffusion Models (MDMs)中上下文刚性的问题，早期预测可能会误导整个生成过程。提出了Context-Robust Remasking (CORE)框架，通过探查对目标遮蔽上下文扰动的敏感性来识别上下文脆弱的令牌，并优先对这些令牌进行修订。CORE在推理和代码基准测试中表现出一致的改进，相较于计算量匹配的基线，在MBPP上提高了最多9.2个百分点。

Topo-R1: Detecting Topological Anomalies via Vision-Language Models

Authors: Meilong Xu, Qingqiao Hu, Xiaoling Hu, Shahira Abousamra, Xin Yu, Weimin Lyu, Kehan Qi, Dimitris Samaras, Chao Chen

First: 2026-03-13T15:05:04+00:00 · Latest: 2026-03-13T15:05:04+00:00

Comments: 28 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Topological correctness is crucial for tubular structures such as blood vessels, nerve fibers, and road networks. Existing topology-preserving methods rely on domain-specific ground truth, which is costly and rarely transfers across domains. When deployed to a new domain without annotations, a key question arises: how can we detect topological anomalies without ground-truth supervision? We reframe this as topological anomaly detection, a structured visual reasoning task requiring a model to locate and classify topological errors in predicted segmentation masks. Vision-Language Models (VLMs) are natural candidates; however, we find that state-of-the-art VLMs perform nearly at random, lacking the fine-grained, topology-aware perception needed to identify sparse connectivity errors in dense structures. To bridge this gap, we develop an automated data-curation pipeline that synthesizes diverse topological anomalies with verifiable annotations across progressively difficult levels, thereby constructing the first large-scale, multi-domain benchmark for this task. We then introduce Topo-R1, a framework that endows VLMs with topology-aware perception via two-stage training: supervised fine-tuning followed by reinforcement learning with Group Relative Policy Optimization (GRPO). Central to our approach is a topology-aware composite reward that integrates type-aware Hungarian matching for structured error classification, spatial localization scoring, and a centerline Dice (clDice) reward that directly penalizes connectivity disruptions, thereby jointly incentivizing semantic precision and structural fidelity. Extensive experiments demonstrate that Topo-R1 establishes a new paradigm for annotation-free topological quality assessment, consistently outperforming general-purpose VLMs and supervised baselines across all evaluation protocols.

中文标题/摘要

标题：Topo-R1：通过视觉语言模型检测拓扑异常

拓扑正确性对于血管、神经纤维和道路网络等管状结构至关重要。现有的拓扑保持方法依赖于特定领域的地面真值，这既昂贵又难以跨域转移。当部署到没有注释的新领域时，一个关键问题是：在没有地面真值监督的情况下，我们如何检测拓扑异常？我们将此问题重新定义为拓扑异常检测，这是一个结构化的视觉推理任务，要求模型在预测分割掩码中定位和分类拓扑错误。视觉语言模型（VLMs）是自然的候选者；然而，我们发现最先进的VLMs几乎随机表现，缺乏识别密集结构中稀疏连接错误所需的细粒度、拓扑感知能力。为了弥合这一差距，我们开发了一个自动数据整理管道，该管道综合了不同拓扑异常的多样化样本，并在逐渐困难的级别上提供了可验证的注释，从而构建了该任务的第一个大规模、多领域基准。然后，我们引入了Topo-R1框架，通过两阶段训练赋予VLMs拓扑感知能力：监督微调后，通过组相对策略优化（GRPO）进行强化学习。我们方法的核心是一个拓扑感知的复合奖励，它结合了类型感知的匈牙利匹配、空间定位评分和中心线Dice（clDice）奖励，直接惩罚连接中断，从而同时激励语义精确性和结构保真度。广泛的实验表明，Topo-R1建立了无注释拓扑质量评估的新范式，在所有评估协议中均优于通用VLMs和监督基线。

Summary / 总结

The paper addresses the challenge of detecting topological anomalies in tubular structures without ground-truth supervision, a critical issue for domains like blood vessels and road networks. It introduces Topo-R1, a framework that uses Vision-Language Models (VLMs) with a two-stage training process, including supervised fine-tuning and reinforcement learning with Group Relative Policy Optimization (GRPO). The method leverages a topology-aware composite reward to improve the model's ability to identify sparse connectivity errors. Experiments show that Topo-R1 outperforms general-purpose VLMs and supervised baselines across various evaluation protocols.

研究旨在通过视觉-语言模型（VLMs）检测管状结构中的拓扑异常，而不依赖于地面真实监督。为了解决当前VLMs的局限性，作者开发了一个自动数据整理管道，并引入了Topo-R1框架，该框架采用两阶段训练：监督微调后跟随基于组相对策略优化（GRPO）的强化学习，并使用一种拓扑感知的复合奖励。实验结果表明，Topo-R1在各种评估协议中均优于通用的VLMs和监督基线，在拓扑质量评估方面表现出色。

ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models

Authors: Yanpeng Zhao, Wentao Ding, Hongtao Li, Baoxiong Jia, Zilong Zheng

First: 2026-03-13T14:43:00+00:00 · Latest: 2026-03-13T14:43:00+00:00

Abs · PDF · Code1 · Code2

Abstract

A recent trend in vision-language models (VLMs) has been to enhance their spatial cognition for embodied domains. Despite progress, existing evaluations have been limited both in paradigm and in coverage, hindering rapid, iterative model development. To address these limitations, we propose ESPIRE, a diagnostic benchmark for embodied spatial reasoning. ESPIRE offers a simulated world that physically grounds VLMs and evaluates them on spatial-reasoning-centric robotic tasks, thus narrowing the gap between evaluation and real-world deployment. To adapt VLMs to robotic tasks, we decompose each task into localization and execution, and frame both as generative problems, in stark contrast to predominant discriminative evaluations (e.g., via visual-question answering) that rely on distractors and discard execution. This decomposition further enables a fine-grained analysis beyond passive spatial reasoning toward reasoning to act. We systematically design ESPIRE both at the instruction level and at the environment level, ensuring broad coverage of spatial reasoning scenarios. We use ESPIRE to diagnose a range of frontier VLMs and provide in-depth analysis of their spatial reasoning behaviors.

中文标题/摘要

标题：ESPIRE：视觉语言模型在体感空间推理方面的诊断基准

视觉语言模型（VLMs）最近的趋势是增强其空间认知能力以适应体感领域。尽管取得了进展，但现有的评估在范式和覆盖面方面都有限，阻碍了模型的快速迭代开发。为了解决这些限制，我们提出了ESPIRE，一个体感空间推理的诊断基准。ESPIRE提供了一个物理化的模拟世界，用于评估VLMs的空间推理能力，从而缩小了评估与实际部署之间的差距。为了使VLMs适应机器人任务，我们将每个任务分解为定位和执行，并将两者都作为生成问题来处理，这与主要依赖于干扰项的判别性评估（例如，通过视觉问答）形成了鲜明对比，后者会丢弃执行部分。这种分解还使我们能够从被动的空间推理进一步分析到推理以行动。我们系统地在指令和环境层面设计了ESPIRE，确保了空间推理场景的广泛覆盖。我们使用ESPIRE诊断了一系列前沿的VLMs，并对其空间推理行为进行了深入分析。

Summary / 总结

ESPIRE is a diagnostic benchmark designed to evaluate the embodied spatial reasoning capabilities of vision-language models (VLMs). It introduces a simulated world to physically ground VLMs and assess them on spatial-reasoning-centric robotic tasks, bridging the gap between evaluation and real-world deployment. By decomposing tasks into localization and execution, ESPIRE frames these as generative problems, offering a fine-grained analysis beyond passive spatial reasoning. The benchmark systematically covers various spatial reasoning scenarios and diagnoses a range of VLMs, providing insights into their spatial reasoning behaviors.

ESPIRE 是一个用于视觉-语言模型在体感空间推理的诊断基准，通过提供一个物理上将 VLMs 地化的模拟世界并评估它们在空间推理为中心的机器人任务上的表现，解决了现有评估的局限性。该基准将任务分解为定位和执行，并将两者都作为生成问题来处理，同时系统地设计基准以确保覆盖广泛的空间推理场景。研究诊断了一系列前沿的 VLMs，并对其空间推理行为进行了深入分析。

A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks

Authors: Tangzheng Lian, Guanyu Hu, Yijing Ren, Dimitrios Kollias, Oya Celiktutan

First: 2026-03-13T13:55:34+00:00 · Latest: 2026-03-13T13:55:34+00:00

Abs · PDF · Code1 · Code2

Abstract

While Vision-Language Models (VLMs) have achieved remarkable performance across diverse downstream tasks, recent studies have shown that they can inherit social biases from the training data and further propagate them into downstream applications. To address this issue, various debiasing approaches have been proposed, yet most of them aim to improve fairness without having a theoretical guarantee that the utility of the model is preserved. In this paper, we introduce a debiasing method that yields a \textbf{closed-form} solution in the cross-modal space, achieving Pareto-optimal fairness with \textbf{bounded utility losses}. Our method is \textbf{training-free}, requires \textbf{no annotated data}, and can jointly debias both visual and textual modalities across downstream tasks. Extensive experiments show that our method outperforms existing methods in debiasing VLMs across diverse fairness metrics and datasets for both group and \textbf{intersectional} fairness in downstream tasks such as zero-shot image classification, text-to-image retrieval, and text-to-image generation while preserving task performance.

Dependency-Aware Parallel Decoding via Attention for Diffusion LLMs

Authors: Bumjun Kim, Dongjae Jeon, Moongyu Jeon, Albert No

First: 2026-03-13T13:52:02+00:00 · Latest: 2026-03-13T13:52:02+00:00

Abs · PDF · Code1 · Code2

Abstract

Parallel decoding for diffusion LLMs (dLLMs) is difficult because each denoising step provides only token-wise marginal distributions, while unmasking multiple tokens simultaneously requires accounting for inter-token dependencies. We propose Dependency-Aware Parallel Decoding (DAPD), a simple, training-free decoding method that uses self-attention to induce a conditional dependency graph over masked tokens. At each iteration, edges in this graph capture strong token interactions, while non-edges indicate weak dependence. Parallel decoding is then reduced to selecting an independent set on the graph and unmasking the selected tokens in parallel. This avoids co-updating strongly coupled tokens without auxiliary models or retraining. Experiments on LLaDA and Dream show that DAPD improves the accuracy-steps trade-off over existing methods and enables more globally distributed parallel updates that better exploit the any-order generation capability of dLLMs.

中文标题/摘要

标题：依赖感知并行解码以注意机制增强扩散大语言模型

对于扩散大语言模型（dLLMs），并行解码具有挑战性，因为每个去噪步骤仅提供词元级边缘分布，而同时揭露多个词元需要考虑词元间的依赖关系。我们提出了一种名为依赖感知并行解码（DAPD）的简单、无需训练的解码方法，该方法使用自注意力机制诱导一个覆盖遮蔽词元的条件依赖图。在每次迭代中，该图中的边捕捉强词元交互，而非边表示弱依赖。并行解码则被简化为在图上选择独立集并在并行中揭露选定的词元。这种方法避免了强耦合词元的共更新，无需辅助模型或重新训练。实验表明，DAPD在LLaDA和Dream上的准确度-步数权衡优于现有方法，并能够实现更广泛的并行更新，更好地利用dLLMs的任意顺序生成能力。

Summary / 总结

The paper addresses the challenge of parallel decoding in diffusion language models (dLLMs) by proposing Dependency-Aware Parallel Decoding (DAPD), which uses self-attention to induce a conditional dependency graph over masked tokens. At each iteration, DAPD selects an independent set on this graph to unmask tokens in parallel, avoiding co-updating strongly coupled tokens. Experiments show that DAPD improves the accuracy-steps trade-off compared to existing methods and enables more globally distributed parallel updates, leveraging the any-order generation capability of dLLMs.

论文提出了一种依赖感知并行解码方法（DAPD），通过自注意力机制诱导一个条件依赖图来处理掩码标记。在每次迭代中，DAPD 选择一组独立的标记并行解码，避免了强耦合标记的同时更新。实验表明，DAPD 在准确性和步骤之间提供了更好的权衡，并且能够更有效地利用 dLLMs 的任意顺序生成能力。

Training-free Uncertainty Guidance for Complex Visual Tasks with MLLMs

Authors: Sanghwan Kim, Rui Xiao, Stephan Alaniz, Yongqin Xian, Zeynep Akata

First: 2025-10-01T09:20:51+00:00 · Latest: 2026-03-13T13:45:50+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal Large Language Models (MLLMs) often struggle with fine-grained perception, such as identifying small objects in high-resolution images or detecting key moments in long videos. Existing methods typically rely on complex, task-specific fine-tuning, which reduces generalizability and increases system complexity. In this work, we propose an effective, training-free framework that uses an MLLM's intrinsic uncertainty as proactive guidance. Our core insight is that a model's uncertainty decreases when provided with relevant visual information. We introduce a unified mechanism that scores candidate visual inputs by response uncertainty, enabling the model to autonomously focus on the most informative data. We apply this simple principle to three challenging visual tasks: Visual Search, Long Video Understanding, and Temporal Grounding, allowing off-the-shelf MLLMs to achieve performance competitive with specialized, fine-tuned systems. Our results demonstrate that leveraging intrinsic uncertainty is a powerful strategy for improving fine-grained multimodal performance.

中文标题/摘要

标题：无需训练的不确定性指导：MLLM在复杂视觉任务中的应用

多模态大型语言模型（MLLMs）在精细感知方面经常遇到困难，例如识别高分辨率图像中的小物体或检测长视频中的关键时刻。现有方法通常依赖于复杂的、针对特定任务的微调，这降低了模型的泛化能力和增加了系统复杂性。在本文中，我们提出了一种有效的、无需训练的框架，利用MLLM固有的不确定性作为主动指导。我们的核心见解是，当模型获得相关视觉信息时，其不确定性会降低。我们引入了一种统一机制，通过响应不确定性对候选视觉输入进行评分，使模型能够自主关注最具信息量的数据。我们将这一简单原则应用于三个具有挑战性的视觉任务：视觉搜索、长视频理解以及时间定位，使即用型MLLMs能够达到与专门微调系统相当的性能。我们的结果表明，利用固有的不确定性是提高多模态精细性能的强大策略。

Summary / 总结

The paper addresses the challenge of fine-grained perception in multimodal tasks using Multimodal Large Language Models (MLLMs). It proposes a training-free framework that utilizes the intrinsic uncertainty of MLLMs as guidance. By scoring visual inputs based on response uncertainty, the model can autonomously focus on the most informative data. The framework was applied to three tasks: Visual Search, Long Video Understanding, and Temporal Grounding, and achieved performance comparable to specialized, fine-tuned systems.

论文针对使用多模态大型语言模型（MLLMs）进行细粒度感知的挑战。提出了一种无需训练的框架，利用MLLMs的内在不确定性作为指导。通过基于响应不确定性对视觉输入进行评分，模型可以自主关注最相关信息。该框架应用于视觉搜索、长视频理解和时间定位三个任务，并达到了与专门微调系统相当的性能。

Test-Time Attention Purification for Backdoored Large Vision Language Models

Authors: Zhifang Zhang, Bojun Yang, Shuo He, Weitong Chen, Wei Emma Zhang, Olaf Maennel, Lei Feng, Miao Xu

First: 2026-03-13T13:45:06+00:00 · Latest: 2026-03-13T13:45:06+00:00

Abs · PDF · Code1 · Code2

Abstract

Despite the strong multimodal performance, large vision-language models (LVLMs) are vulnerable during fine-tuning to backdoor attacks, where adversaries insert trigger-embedded samples into the training data to implant behaviors that can be maliciously activated at test time. Existing defenses typically rely on retraining backdoored parameters (e.g., adapters or LoRA modules) with clean data, which is computationally expensive and often degrades model performance. In this work, we provide a new mechanistic understanding of backdoor behaviors in LVLMs: the trigger does not influence prediction through low-level visual patterns, but through abnormal cross-modal attention redistribution, where trigger-bearing visual tokens steal attention away from the textual context - a phenomenon we term attention stealing. Motivated by this, we propose CleanSight, a training-free, plug-and-play defense that operates purely at test time. CleanSight (i) detects poisoned inputs based on the relative visual-text attention ratio in selected cross-modal fusion layers, and (ii) purifies the input by selectively pruning the suspicious high-attention visual tokens to neutralize the backdoor activation. Extensive experiments show that CleanSight significantly outperforms existing pixel-based purification defenses across diverse datasets and backdoor attack types, while preserving the model's utility on both clean and poisoned samples.

中文标题/摘要

标题：测试时注意力净化以抵御中毒大型视觉语言模型

尽管大型视觉-语言模型（LVLMs）在多模态方面表现出色，但在微调过程中仍易受后门攻击的影响，攻击者会在训练数据中插入嵌入触发器的样本，以植入可在测试时恶意激活的行为。现有防御通常依赖于使用干净数据重新训练受后门影响的参数（例如适配器或LoRA模块），这在计算上非常昂贵，且往往会导致模型性能下降。在本文中，我们提供了LVLMs中后门行为的新机制理解：触发器不是通过低级视觉模式影响预测，而是通过异常的跨模态注意力再分配，其中携带触发器的视觉标记会从文本上下文中窃取注意力——我们称其为注意力窃取现象。受此启发，我们提出了CleanSight，这是一种无需训练、即插即用的防御方法，仅在测试时运行。CleanSight (i) 基于选定的跨模态融合层中的相对视觉-文本注意力比例检测中毒输入，(ii) 通过选择性修剪可疑的高注意力视觉标记来净化输入，以消除后门激活。广泛实验表明，CleanSight 在多种数据集和后门攻击类型上显著优于现有的基于像素的净化防御，同时在干净和中毒样本上均保持了模型的实用性。

Summary / 总结

This work addresses the vulnerability of large vision-language models (LVLMs) to backdoor attacks by proposing CleanSight, a training-free defense mechanism. CleanSight detects poisoned inputs based on the attention ratio between visual and textual information and purifies the input by pruning suspicious visual tokens. Experiments demonstrate that CleanSight outperforms existing pixel-based defenses across various datasets and attack types while maintaining model performance on clean and poisoned samples.

该研究提出了一种无需训练的防御机制CleanSight，以应对大型视觉-语言模型（LVLM）的后门攻击。CleanSight 通过异常的跨模态注意力重新分配来识别中毒输入，并通过修剪可疑的视觉令牌来净化输入。实验表明，CleanSight 在各种数据集和攻击类型上优于现有基于像素的防御机制，同时保持模型在干净和中毒样本上的性能。

Train Short, Inference Long: Training-free Horizon Extension for Autoregressive Video Generation

Authors: Jia Li, Xiaomeng Fu, Xurui Peng, Weifeng Chen, Youwei Zheng, Tianyu Zhao, Jiexi Wang, Fangmin Chen, Xing Wang, Hayden Kwok-Hay So

First: 2026-02-15T07:14:47+00:00 · Latest: 2026-03-13T12:59:04+00:00

Comments: 19 pages, 15 figures

Abs · PDF · Code1 · Code2 · Project1

Abstract

Autoregressive video diffusion models have emerged as a scalable paradigm for long video generation. However, they often suffer from severe extrapolation failure, where rapid error accumulation leads to significant temporal degradation when extending beyond training horizons. We identify that this failure primarily stems from the spectral bias of 3D positional embeddings and the lack of dynamic priors in noise sampling. To address these issues, we propose FLEX (Frequency-aware Length EXtension), a training-free inference-time framework that bridges the gap between short-term training and long-term inference. FLEX introduces Frequency-aware RoPE Modulation to adaptively interpolate under-trained low-frequency components while extrapolating high-frequency ones to preserve multi-scale temporal discriminability. This is integrated with Antiphase Noise Sampling (ANS) to inject high-frequency dynamic priors and Inference-only Attention Sink to anchor global structure. Extensive evaluations on VBench demonstrate that FLEX significantly outperforms state-of-the-art models at 6x extrapolation (30s duration) and matches the performance of long-video fine-tuned baselines at 12x scale (60s duration). As a plug-and-play augmentation, FLEX seamlessly integrates into existing inference pipelines for horizon extension. It effectively pushes the generation limits of models such as LongLive, supporting consistent and dynamic video synthesis at a 4-minute scale. Project page is available at https://ga-lee.github.io/FLEX_demo.

中文标题/摘要

标题：短训练，长推理：无需训练的超长时间自回归视频生成

自回归视频扩散模型已成为长视频生成的可扩展范式。然而，它们通常会遭受严重的外推失败，即在超出训练范围时，快速的误差累积会导致显著的时间降解。我们发现，这种失败主要源于3D位置嵌入的频谱偏差以及噪声采样中缺乏动态先验。为了解决这些问题，我们提出了FLEX（频率感知长度扩展），这是一种无需训练的推理时框架，能够弥合短期训练与长期推理之间的差距。FLEX引入了频率感知RoPE调制，以适应性地插补未充分训练的低频成分，同时外推高频成分，以保持多尺度时间可区分性。这与反相噪声采样（ANS）结合使用，以注入高频动态先验，并与推理专用注意力汇合，以锚定全局结构。在VBench上的广泛评估表明，FLEX在6倍外推（30秒时长）时显著优于最先进的模型，并在12倍尺度（60秒时长）时与长视频微调基线相当。作为即插即用的增强方法，FLEX无缝集成到现有的推理管道中，有效推动了如LongLive等模型的生成极限，支持4分钟规模的一致和动态视频合成。项目页面可在https://ga-lee.github.io/FLEX_demo/获取。

Summary / 总结

The paper addresses the issue of extrapolation failure in autoregressive video diffusion models by proposing FLEX, a training-free framework for horizon extension. FLEX uses Frequency-aware RoPE Modulation and Antiphase Noise Sampling to adaptively interpolate and inject high-frequency dynamic priors, respectively, while also anchoring global structure through Inference-only Attention Sink. Experimental results on VBench show that FLEX significantly outperforms state-of-the-art models at 6x extrapolation and matches long-video fine-tuned baselines at 12x scale.

论文针对自回归视频扩散模型在超出训练范围时出现的时间降级问题，提出了一种名为FLEX的训练免费框架，通过引入频率感知RoPE调制、反相噪声采样和推理仅注意力下陷来增强长期推理能力。VBench上的实验结果表明，FLEX在6倍扩展时显著提高了性能，并在12倍尺度上与长视频微调基线相当，展示了其在支持4分钟规模的一致和动态视频合成方面的有效性。

SODA: Sensitivity-Oriented Dynamic Acceleration for Diffusion Transformer

Authors: Tong Shao, Yusen Fu, Guoying Sun, Jingde Kong, Zhuotao Tian, Jingyong Su

Venue: CVPR 2026

First: 2026-03-07T06:33:07+00:00 · Latest: 2026-03-13T12:32:01+00:00

Comments: 23 pages, CVPR 2026 accepted

Abs · PDF · Code1 · Code2 · Code3

Abstract

Diffusion Transformers have become a dominant paradigm in visual generation, yet their low inference efficiency remains a key bottleneck hindering further advancement. Among common training-free techniques, caching offers high acceleration efficiency but often compromises fidelity, whereas pruning shows the opposite trade-off. Integrating caching with pruning achieves a balance between acceleration and generation quality. However, existing methods typically employ fixed and heuristic schemes to configure caching and pruning strategies. While they roughly follow the overall sensitivity trend of generation models to acceleration, they fail to capture fine-grained and complex variations, inevitably skipping highly sensitive computations and leading to quality degradation. Furthermore, such manually designed strategies exhibit poor generalization. To address these issues, we propose SODA, a Sensitivity-Oriented Dynamic Acceleration method that adaptively performs caching and pruning based on fine-grained sensitivity. SODA builds an offline sensitivity error modeling framework across timesteps, layers, and modules to capture the sensitivity to different acceleration operations. The cache intervals are optimized via dynamic programming with sensitivity error as the cost function, minimizing the impact of caching on model sensitivity. During pruning and cache reuse, SODA adaptively determines the pruning timing and rate to preserve computations of highly sensitive tokens, significantly enhancing generation fidelity. Extensive experiments on DiT-XL/2, PixArt-$α$, and OpenSora demonstrate that SODA achieves state-of-the-art generation fidelity under controllable acceleration ratios. Our code is released publicly at: https://github.com/leaves162/SODA.

中文标题/摘要

标题：SODA：面向灵敏度的动态加速方法用于扩散变换器

扩散变换器已成为视觉生成的主要范式，但其低推理效率仍然是进一步发展的关键瓶颈。在常见的无训练技术中，缓存提供了高加速效率，但往往牺牲了保真度，而剪枝则相反。将缓存与剪枝结合可以平衡加速和生成质量。然而，现有方法通常采用固定和启发式的方案来配置缓存和剪枝策略。虽然它们大致遵循生成模型对加速的整体灵敏度趋势，但无法捕捉到细微和复杂的差异，不可避免地跳过了高度灵敏的计算，导致质量下降。此外，这些手动设计的策略表现出较差的泛化能力。为了解决这些问题，我们提出了一种面向灵敏度的动态加速方法SODA，该方法基于细粒度的灵敏度自适应地执行缓存和剪枝。SODA构建了一个跨时间步、层和模块的离线灵敏度误差建模框架，以捕捉不同加速操作的灵敏度。通过使用灵敏度误差作为成本函数的动态规划优化缓存间隔，最小化缓存对模型灵敏度的影响。在剪枝和缓存重用过程中，SODA自适应地确定剪枝时机和速率，以保留高度灵敏的令牌的计算，显著提高生成保真度。在DiT-XL/2、PixArt-$α$和OpenSora上的广泛实验表明，SODA在可控加速比下实现了最先进的生成保真度。我们的代码已公开发布在：https://github.com/leaves162/SODA。

Summary / 总结

SODA is a Sensitivity-Oriented Dynamic Acceleration method for improving the inference efficiency of Diffusion Transformers while maintaining generation quality. It uses an offline sensitivity error modeling framework to optimize cache intervals and pruning timing based on fine-grained sensitivity, thereby reducing the impact on model sensitivity. Experiments show that SODA achieves high generation fidelity even with controlled acceleration ratios compared to existing methods.

SODA 是一种针对扩散变换器的灵敏度导向动态加速方法，旨在提高推理效率同时保持生成质量。它通过基于细粒度灵敏度的离线灵敏度误差建模框架来优化缓存间隔和剪枝时机，从而减少对模型灵敏度的影响。实验表明，SODA 在不同加速比下实现了比现有方法更高的生成保真度。

MotionAnymesh: Physics-Grounded Articulation for Simulation-Ready Digital Twins

Authors: WenBo Xu, Liu Liu, Li Zhang, Dan Guo, RuoNan Liu

First: 2026-03-13T12:30:42+00:00 · Latest: 2026-03-13T12:30:42+00:00

Comments: 5 figures

Abs · PDF · Code1 · Code2

Abstract

Converting static 3D meshes into interactable articulated assets is crucial for embodied AI and robotic simulation. However, existing zero-shot pipelines struggle with complex assets due to a critical lack of physical grounding. Specifically, ungrounded Vision-Language Models (VLMs) frequently suffer from kinematic hallucinations, while unconstrained joint estimation inevitably leads to catastrophic mesh inter-penetration during physical simulation. To bridge this gap, we propose MotionAnymesh, an automated zero-shot framework that seamlessly transforms unstructured static meshes into simulation-ready digital twins. Our method features a kinematic-aware part segmentation module that grounds VLM reasoning with explicit SP4D physical priors, effectively eradicating kinematic hallucinations. Furthermore, we introduce a geometry-physics joint estimation pipeline that combines robust type-aware initialization with physics-constrained trajectory optimization to rigorously guarantee collision-free articulation. Extensive experiments demonstrate that MotionAnymesh significantly outperforms state-of-the-art baselines in both geometric precision and dynamic physical executability, providing highly reliable assets for downstream applications.

中文标题/摘要

标题：MotionAnymesh：基于物理的关节化模拟就绪数字孪生

将静态3D网格转换为可交互的关节化资产对于具身AI和机器人模拟至关重要。然而，现有的零样本管道由于缺乏物理基础，在处理复杂资产时存在困难。具体来说，未接地的视觉-语言模型（VLM）经常出现运动学幻觉，而未约束的关节估计最终会导致物理模拟过程中网格间的灾难性穿透。为了解决这一问题，我们提出了一种名为MotionAnymesh的自动化零样本框架，该框架能够无缝地将无结构的静态网格转换为模拟就绪的数字孪生。我们的方法包含一个运动学感知的部分分割模块，该模块通过明确的SP4D物理先验来接地VLM推理，有效消除了运动学幻觉。此外，我们还引入了一种几何-物理联合估计管道，该管道结合了鲁棒的类型感知初始化与物理约束的轨迹优化，以严格保证无碰撞的关节化。大量实验表明，MotionAnymesh在几何精度和动态物理执行性方面显著优于最先进的基线方法，为下游应用提供了高度可靠的资产。

Summary / 总结

The research aims to convert static 3D meshes into interactive articulated assets for embodied AI and robotic simulation, addressing the limitations of existing zero-shot pipelines that lack physical grounding. MotionAnymesh, the proposed method, includes a kinematic-aware part segmentation module and a geometry-physics joint estimation pipeline, which together eliminate kinematic hallucinations and ensure collision-free articulation. Experiments show that MotionAnymesh outperforms existing methods in geometric precision and dynamic physical executability, offering reliable assets for downstream applications.

研究旨在将静态3D网格转换为可用于体态AI和机器人模拟的可交互关节化资产。为解决现有零样本管道的局限性，如运动幻觉和网格穿透问题，作者提出了MotionAnymesh框架。该框架包括一个运动感知部件分割模块和一个几何-物理联合估计管道，共同确保精确且无碰撞的关节化。实验表明，MotionAnymesh在几何精度和动态物理执行性方面均优于现有方法，为下游应用提供可靠的资产。

Rethinking VLMs for Image Forgery Detection and Localization

Authors: Shaofeng Guo, Jiequan Cui, Richang Hong

First: 2026-03-13T12:21:31+00:00 · Latest: 2026-03-13T12:21:31+00:00

Comments: 8pages

Abs · PDF · Code1 · Code2 · Code3

Abstract

With the rapid rise of Artificial Intelligence Generated Content (AIGC), image manipulation has become increasingly accessible, posing significant challenges for image forgery detection and localization (IFDL). In this paper, we study how to fully leverage vision-language models (VLMs) to assist the IFDL task. In particular, we observe that priors from VLMs hardly benefit the detection and localization performance and even have negative effects due to their inherent biases toward semantic plausibility rather than authenticity. Additionally, the location masks explicitly encode the forgery concepts, which can serve as extra priors for VLMs to ease their training optimization, thus enhancing the interpretability of detection and localization results. Building on these findings, we propose a new IFDL pipeline named IFDL-VLM. To demonstrate the effectiveness of our method, we conduct experiments on 9 popular benchmarks and assess the model performance under both in-domain and cross-dataset generalization settings. The experimental results show that we consistently achieve new state-of-the-art performance in detection, localization, and interpretability.Code is available at: https://github.com/sha0fengGuo/IFDL-VLM.

中文标题/摘要

标题：重新思考VLMs在图像伪造检测与定位中的应用

随着人工智能生成内容（AIGC）的迅速发展，图像篡改变得越来越容易，这给图像伪造检测与定位（IFDL）带来了重大挑战。本文研究了如何充分利用视觉语言模型（VLMs）来辅助IFDL任务。特别地，我们观察到，VLMs中的先验知识几乎不能提高检测和定位性能，甚至由于其对语义合理性而非真实性偏向的固有偏见，反而产生了负面影响。此外，位置掩码明确编码了伪造概念，可以作为额外的先验知识，帮助VLMs的训练优化，从而增强检测和定位结果的可解释性。基于这些发现，我们提出了一种新的IFDL管道，称为IFDL-VLM。为了证明我们方法的有效性，我们在9个流行的基准上进行了实验，并在同域和跨数据集泛化设置下评估了模型性能。实验结果表明，我们在检测、定位和可解释性方面始终取得了新的最佳性能。代码可在：https://github.com/sha0fengGuo/IFDL-VLM获取。

Summary / 总结

This paper addresses the challenges of image forgery detection and localization (IFDL) in the era of Artificial Intelligence Generated Content (AIGC). It finds that vision-language models (VLMs) do not effectively assist in IFDL due to their bias towards semantic plausibility. The authors propose IFDL-VLM, which uses location masks as additional priors to improve VLM training and enhance interpretability. Experiments on nine benchmarks show that IFDL-VLM achieves state-of-the-art performance in detection, localization, and interpretability.

本文探讨了在AIGC时代图像伪造检测与定位（IFDL）的挑战，提出了一种新的IFDL-VLM管道，通过使用位置掩码作为额外先验来利用视觉语言模型（VLMs），以提高可解释性和性能。在九个基准上的实验表明，IFDL-VLM在检测、定位和可解释性方面均实现了最先进的结果，无论是针对领域内还是跨数据集的情况。

GraphPilot: Grounded Scene Graph Conditioning for Language-Based Autonomous Driving

Authors: Fabian Schmidt, Markus Enzweiler, Abhinav Valada

First: 2025-11-14T12:57:39+00:00 · Latest: 2026-03-13T11:16:18+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-language models have recently emerged as promising planners for autonomous driving, where success hinges on topology-aware reasoning over spatial structure and dynamic interactions from multimodal input. However, existing models are typically trained without supervision that explicitly encodes these relational dependencies, limiting their ability to infer how agents and other traffic entities influence one another from raw sensor data. In this work, we bridge this gap with a novel model-agnostic method that conditions language-based driving models on structured relational context in the form of traffic scene graphs. We serialize scene graphs at various abstraction levels and formats, and incorporate them into models via structured prompt templates, enabling systematic analysis of when and how relational supervision is most beneficial and computationally efficient. Extensive evaluations on the LangAuto and Bench2Drive benchmarks show that scene graph conditioning yields large and persistent improvements. We observe a substantial performance increase in the Driving Score of our proposed approach versus competitive LMDrive, BEVDriver, and SimLingo baselines. These results indicate that diverse architectures can effectively internalize and ground relational priors through scene graph-conditioned training, even without requiring scene graph input at test-time. Code, fine-tuned models, and our scene graph dataset are publicly available at https://github.com/iis-esslingen/GraphPilot.

中文标题/摘要

标题：GraphPilot：基于场景图的语义自主驾驶条件化

视觉-语言模型最近在自主驾驶领域展现出潜力，其成功依赖于对多模态输入中的空间结构和动态交互的拓扑意识推理。然而，现有模型通常未在明确编码这些关系依赖性的监督下进行训练，限制了它们从原始传感器数据中推断出代理和其他交通实体如何相互影响的能力。在本文中，我们通过一种新颖的模型无关方法弥合了这一差距，该方法将基于语言的驾驶模型条件化在交通场景图的结构化关系上下文中。我们以不同抽象级别和格式序列化场景图，并通过结构化提示模板将它们整合到模型中，从而系统地分析关系监督在何时以及如何最为有益和计算效率高。在LangAuto和Bench2Drive基准上的广泛评估表明，场景图条件化带来了显著且持久的改进。我们观察到，与竞争性的LMDrive、BEVDriver和SimLingo基线相比，我们提出的方法在驾驶得分方面有显著的性能提升。这些结果表明，即使在测试时不使用场景图输入，不同的架构也能有效地通过场景图条件化训练内化和接地关系先验。代码、微调模型和我们的场景图数据集可在https://github.com/iis-esslingen/GraphPilot上公开获取。

Summary / 总结

This work addresses the challenge of using vision-language models for autonomous driving by introducing a method to condition language-based driving models on structured relational context in the form of traffic scene graphs. The method serializes scene graphs at various levels and incorporates them into models via structured prompt templates, enhancing the models' ability to reason about spatial structures and dynamic interactions. Extensive evaluations on benchmark datasets show significant improvements in the Driving Score compared to existing baselines like LMDrive, BEVDriver, and SimLingo, indicating that scene graph-conditioned training can effectively ground relational priors in diverse architectures without requiring scene graph input at test-time.

研究旨在通过引入交通场景图中的结构化关系上下文来提升基于语言的自动驾驶模型。方法包括在不同层次上序列化场景图，并通过结构化提示模板将其集成到模型中。在LangAuto和Bench2Drive基准上的实验结果表明，场景图条件化显著提高了驾驶性能，与LMDrive、BEVDriver和SimLingo等现有基线相比，提出的模型在驾驶得分上有明显提升。这表明，即使在测试时不使用场景图输入，场景图条件化训练也能有效增强关系推理能力。

Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models

Authors: David McAllister, Miika Aittala, Tero Karras, Janne Hellsten, Angjoo Kanazawa, Timo Aila, Samuli Laine

First: 2026-03-13T10:54:09+00:00 · Latest: 2026-03-13T10:54:09+00:00

Comments: Code available at https://github.com/NVlabs/finite-difference-flow-optimization

Abs · PDF · Code1 · Code2 · Code3

Abstract

Reinforcement learning (RL) has become a standard technique for post-training diffusion-based image synthesis models, as it enables learning from reward signals to explicitly improve desirable aspects such as image quality and prompt alignment. In this paper, we propose an online RL variant that reduces the variance in the model updates by sampling paired trajectories and pulling the flow velocity in the direction of the more favorable image. Unlike existing methods that treat each sampling step as a separate policy action, we consider the entire sampling process as a single action. We experiment with both high-quality vision language models and off-the-shelf quality metrics for rewards, and evaluate the outputs using a broad set of metrics. Our method converges faster and yields higher output quality and prompt alignment than previous approaches.

中文标题/摘要

标题：有限差分流优化在文本到图像模型后训练中的RL方法

强化学习（RL）已成为后训练基于扩散的图像合成模型的标准技术，因为它能够通过学习奖励信号来明确提高诸如图像质量和提示对齐等期望方面。在本文中，我们提出了一种在线RL变体，通过采样配对轨迹并使流速度朝向更可取的图像方向来减少模型更新的方差。与现有方法将每次采样步骤视为单独的策略动作不同，我们将整个采样过程视为一个动作。我们使用高质量的视觉语言模型和现成的质量度量作为奖励，并使用一系列度量标准评估输出。我们的方法比以前的方法收敛更快，输出质量和提示对齐更高。

Summary / 总结

This paper proposes an online RL variant for post-training diffusion-based text-to-image models, which reduces variance in model updates by sampling paired trajectories and pulling the flow velocity towards the more favorable image. Unlike previous methods, it treats the entire sampling process as a single action. The method converges faster and produces higher quality and better prompt alignment outputs compared to existing approaches, using both high-quality vision language models and off-the-shelf quality metrics for rewards and evaluating outputs with a broad set of metrics.

本文提出了一种在线RL方法，通过优化采样过程来提升文本到图像生成模型。该方法通过比较配对轨迹并调整流速度朝向更优图像来减少模型更新的方差。该方法在收敛速度和输出质量方面优于现有方法，实现了更好的提示对齐，并使用高质量的视觉语言模型和现成的质量度量作为奖励。

HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios

Authors: Jiayue Pu, Zhongxiang Sun, Zilu Zhang, Xiao Zhang, Jun Xu

First: 2026-03-12T14:25:44+00:00 · Latest: 2026-03-13T10:53:52+00:00

Abs · PDF · Code1 · Code2

Abstract

The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured industrial settings, household spaces introduce unpredictable safety risks, where system limitations such as perception latency and lack of common sense knowledge can lead to dangerous errors. Current safety evaluations, often restricted to static images, text, or general hazards, fail to adequately benchmark dynamic unsafe action detection in these specific contexts. To bridge this gap, we introduce HomeSafe-Bench, a challenging benchmark designed to evaluate Vision-Language Models (VLMs) on unsafe action detection in household scenarios. HomeSafe-Bench is contrusted via a hybrid pipeline combining physical simulation with advanced video generation and features 438 diverse cases across six functional areas with fine-grained multidimensional annotations. Beyond benchmarking, we propose Hierarchical Dual-Brain Guard for Household Safety (HD-Guard), a hierarchical streaming architecture for real-time safety monitoring. HD-Guard coordinates a lightweight FastBrain for continuous high-frequency screening with an asynchronous large-scale SlowBrain for deep multimodal reasoning, effectively balancing inference efficiency with detection accuracy. Evaluations demonstrate that HD-Guard achieves a superior trade-off between latency and performance, while our analysis identifies critical bottlenecks in current VLM-based safety detection.

中文标题/摘要

标题：HomeSafe-Bench：评估视觉-语言模型在家庭场景中对危险动作检测的安全性

随着实体代理的迅速发展，家用机器人在现实环境中的部署速度加快。然而，与结构化的工业环境不同，家庭空间引入了不可预测的安全风险，系统限制如感知延迟和常识知识的缺乏可能导致危险错误。当前的安全评估通常局限于静态图像、文本或一般性危害，未能充分评估这些特定情境下的动态危险动作检测。为弥补这一差距，我们引入了HomeSafe-Bench，这是一个具有挑战性的基准，旨在评估视觉-语言模型（VLMs）在家庭场景中的危险动作检测能力。HomeSafe-Bench通过结合物理模拟和高级视频生成构建，包含六个功能区域的438个多样化案例，并具有精细的多维度注释。除了基准测试，我们还提出了家庭安全的分层双脑守护（HD-Guard），这是一种分层流式架构，用于实时安全监控。HD-Guard协调一个轻量级的FastBrain进行连续的高频筛查，并通过异步的大规模SlowBrain进行深度多模态推理，有效地平衡了推理效率与检测准确性。评估表明，HD-Guard在延迟和性能之间实现了更优的权衡，而我们的分析指出了当前基于VLM的安全检测中的关键瓶颈。

Summary / 总结

The paper introduces HomeSafe-Bench, a benchmark for evaluating Vision-Language Models on detecting unsafe actions in household scenarios, addressing the limitations of current safety evaluations. The benchmark uses a hybrid pipeline combining physical simulation and video generation, featuring 438 diverse cases with detailed annotations. Additionally, the study proposes HD-Guard, a hierarchical streaming architecture that balances inference efficiency and detection accuracy, demonstrating superior performance in real-time safety monitoring.

论文介绍了HomeSafe-Bench，这是一个用于评估视觉-语言模型在家庭场景中对危险动作检测的基准，解决了现有安全评估的局限性。该基准使用物理模拟和视频生成的混合管道，包含438个具有精细注释的多样化案例。此外，论文还提出了HD-Guard，这是一种用于实时安全监控的分层流式架构，通过使用轻量级的FastBrain和异步的SlowBrain进行深度多模态推理，平衡了延迟和性能。评估表明，HD-Guard在延迟和性能之间的权衡上优于现有方法。

NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

Authors: Zhuchenyang Liu, Yao Zhang, Yu Xiao

First: 2026-03-13T09:24:23+00:00 · Latest: 2026-03-13T09:24:23+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Model (VLM) based retrievers have advanced visual document retrieval (VDR) to impressive quality. They require the same multi-billion parameter encoder for both document indexing and query encoding, incurring high latency and GPU dependence even for plain-text queries. We observe that this design is unnecessarily symmetric: documents are visually complex and demand strong visual understanding, whereas queries are just short text strings. NanoVDR exploits this query--document asymmetry by decoupling the two encoding paths: a frozen 2B VLM teacher indexes documents offline, while a distilled text-only student as small as 69M parameters encodes queries at inference. The key design choice is the distillation objective. Through systematic comparison of six objectives across three backbones and 22 ViDoRe benchmark datasets, we find that pointwise cosine alignment on query text consistently outperforms ranking-based and contrastive alternatives, while requiring only pre-cached teacher query embeddings and no document processing during training. Furthermore, we identify cross-lingual transfer as the primary performance bottleneck, and resolve it cheaply by augmenting training data with machine-translated queries. The resulting NanoVDR-S-Multi (DistilBERT, 69M) retains 95.1\% of teacher quality and outperforms DSE-Qwen2 (2B) on v2 and v3 with 32$\times$ fewer parameters and 50$\times$ lower CPU query latency, at a total training cost under 13 GPU-hours.

中文标题/摘要

标题：NanoVDR：将20亿参数的视觉语言检索器精简为7000万参数的纯文本编码器用于视觉文档检索

基于视觉语言模型（VLM）的检索器已将视觉文档检索（VDR）提升到了令人印象深刻的水平。它们需要相同的多十亿参数编码器来同时进行文档索引和查询编码，即使对于纯文本查询也会导致高延迟和对GPU的依赖。我们观察到这种设计是不必要的对称：文档视觉复杂且需要强大的视觉理解，而查询只是简短的文本字符串。NanoVDR 利用查询与文档之间的不对称性通过解耦两种编码路径：一个冻结的20亿参数VLM教师离线索引文档，而一个仅6900万参数的精简文本学生在推理时编码查询。关键的设计选择是蒸馏目标。通过在三个骨干网络和22个ViDoRe基准数据集上系统地比较六种目标，我们发现，针对查询文本的点对齐余弦相似度始终优于基于排名和对比的替代方案，同时只需要预缓存的教师查询嵌入，而在训练过程中无需处理文档。此外，我们发现跨语言迁移是主要的性能瓶颈，通过用机器翻译的查询扩充训练数据，可以廉价地解决这一问题。最终，NanoVDR-S-Multi（DistilBERT，6900万参数）保留了95.1%的教师质量，并在v2和v3上以32倍更少的参数和50倍更低的CPU查询延迟超过了DSE-Qwen2（20亿参数），总训练成本不到13个GPU小时。

Summary / 总结

NanoVDR aims to reduce the computational overhead of vision-language model-based visual document retrieval (VDR) by decoupling the document indexing and query encoding processes. It uses a large 2 billion parameter VLM to index documents offline and a small 69 million parameter text-only encoder to encode queries at inference. The key method is the pointwise cosine alignment distillation objective, which consistently outperforms ranking-based and contrastive alternatives. The model achieves 95.1% of the teacher model's quality while requiring 32 times fewer parameters and 50 times lower CPU query latency, with a total training cost under 13 GPU-hours.

NanoVDR旨在通过将文档索引和查询编码过程分离来减少基于视觉语言模型的视觉文档检索（VDR）的计算开销。它使用一个20亿参数的VLM来离线索引文档，并使用一个6900万参数的纯文本编码器在推理时编码查询。关键方法是点对点余弦对齐的蒸馏目标，这在排名基于和对比性替代方法中表现更优。该模型在参数量上仅需教师模型的32分之一，在CPU查询延迟上降低50倍，总训练成本不到13个GPU小时，同时保持了95.1%的教师模型质量。

Adaptive Vision-Language Model Routing for Computer Use Agents

Authors: Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

First: 2026-03-13T09:21:25+00:00 · Latest: 2026-03-13T09:21:25+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Computer Use Agents (CUAs) translate natural-language instructions into Graphical User Interface (GUI) actions such as clicks, keystrokes, and scrolls by relying on a Vision-Language Model (VLM) to interpret screenshots and predict grounded tool calls. However, grounding accuracy varies dramatically across VLMs, while current CUA systems typically route every action to a single fixed model regardless of difficulty. We propose \textbf{Adaptive VLM Routing} (AVR), a framework that inserts a lightweight semantic routing layer between the CUA orchestrator and a pool of VLMs. For each tool call, AVR estimates action difficulty from multimodal embeddings, probes a small VLM to measure confidence, and routes the action to the cheapest model whose predicted accuracy satisfies a target reliability threshold. For \textit{warm} agents with memory of prior UI interactions, retrieved context further narrows the capability gap between small and large models, allowing many actions to be handled without escalation. We formalize routing as a cost--accuracy trade-off, derive a threshold-based policy for model selection, and evaluate AVR using ScreenSpot-Pro grounding data together with the OpenClaw agent routing benchmark. Across these settings, AVR projects inference cost reductions of up to 78\% while staying within 2 percentage points of an all-large-model baseline. When combined with the Visual Confused Deputy guardrail, AVR also escalates high-risk actions directly to the strongest available model, unifying efficiency and safety within a single routing framework. Materials are also provided Model, benchmark, and code: https://github.com/vllm-project/semantic-router.

中文标题/摘要

标题：自适应视觉-语言模型路由用于计算机使用代理

计算机使用代理（CUAs）通过视觉-语言模型（VLM）将自然语言指令转换为图形用户界面（GUI）操作，如点击、按键和滚动。然而，不同VLM的接地准确性差异巨大，而当前的CUA系统通常会将所有操作路由到单一固定模型，而不考虑难度。我们提出了自适应VLM路由（AVR），这是一种在CUA协调器和VLM池之间插入轻量级语义路由层的框架。对于每个工具调用，AVR从多模态嵌入中估计操作难度，探测小型VLM以测量置信度，并将操作路由到满足目标可靠性阈值的最便宜模型。对于具有先前UI交互记忆的“温暖”代理，检索到的上下文进一步缩小了小型和大型模型的能力差距，使许多操作无需升级即可处理。我们将路由形式化为成本-准确性的权衡，推导出基于阈值的模型选择策略，并使用ScreenSpot-Pro接地数据和OpenClaw代理路由基准评估AVR。在这些设置中，AVR在保持与所有大型模型基线相差不超过2个百分点的情况下，将推理成本降低高达78%。当与视觉困惑副警护栏结合使用时，AVR还可以直接将高风险操作升级到可用的最强模型，统一了效率和安全性。还提供了模型、基准和代码：https://github.com/vllm-project/semantic-router。

Summary / 总结

The paper proposes Adaptive VLM Routing (AVR), a framework that improves the efficiency of Computer Use Agents (CUAs) by dynamically routing actions to the most suitable Vision-Language Model (VLM) based on action difficulty and reliability. AVR estimates action difficulty from multimodal embeddings, probes a small VLM for confidence, and selects the cheapest model meeting a target reliability threshold. Evaluations show AVR reduces inference costs by up to 78% while maintaining performance close to an all-large-model baseline. Combined with the Visual Confused Deputy guardrail, AVR ensures safety and efficiency in handling high-risk actions.

论文提出了自适应VLM路由（AVR）框架，该框架根据动作难度和可靠性阈值动态地将动作路由到视觉语言模型（VLMs）。这种方法通过将推理成本最多降低78%，同时保持性能接近仅使用大型模型的基准，提高了计算机使用代理（CUAs）的效率。它还集成了视觉困惑副手护栏，以将高风险动作直接路由到最强可用模型，从而在单一路由框架中同时提高效率和安全性。评估使用了ScreenSpot-Pro接地数据和OpenClaw代理路由基准。

What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models

Authors: Sen Nie, Jie Zhang, Zhongqi Wang, Zhaoyang Wei, Shiguang Shan, Xilin Chen

First: 2026-03-13T09:02:11+00:00 · Latest: 2026-03-13T09:02:11+00:00

Comments: 28 pages

Abs · PDF · Code1 · Code2 · Project1

Abstract

Achieving adversarial robustness in Vision-Language Models (VLMs) inevitably compromises accuracy on clean data, presenting a long-standing and challenging trade-off. In this work, we revisit this trade-off by investigating a fundamental question: What makes VLMs robust? Through a detailed analysis of adversarially fine-tuned models, we examine how robustness mechanisms function internally and how they interact with clean accuracy. Our analysis reveals that adversarial robustness is not uniformly distributed across network depth. Instead, unexpectedly, it is primarily localized within the shallow layers, driven by a low-frequency spectral bias and input-insensitive attention patterns. Meanwhile, updates to the deep layers tend to undermine both clean accuracy and robust generalization. Motivated by these insights, we propose Adversarial Robustness Adaptation (R-Adapt), a simple yet effective framework that freezes all pre-trained weights and introduces minimal, insight-driven adaptations only in the initial layers. This design achieves an exceptional balance between adversarial robustness and clean accuracy. R-Adapt further supports training-free, model-guided, and data-driven paradigms, offering flexible pathways to seamlessly equip standard models with robustness. Extensive evaluations on 18 datasets and diverse tasks demonstrate our state-of-the-art performance under various attacks. Notably, R-Adapt generalizes efficiently to large vision-language models (e.g., LLaVA and Qwen-VL) to enhance their robustness. Our project page is available at https://summu77.github.io/R-Adapt.

中文标题/摘要

标题：什么是使VLMs稳健的因素？关于视觉-语言模型稳健性和准确性之间矛盾的统一

在视觉-语言模型（VLMs）中实现对抗性稳健性不可避免地会牺牲干净数据上的准确性，这一直是一个长期存在的挑战。在本文中，我们通过重新审视这一权衡，探讨了一个基本问题：是什么使VLMs稳健？通过对对抗性微调模型的详细分析，我们研究了稳健性机制的内部运作方式及其与干净准确性之间的相互作用。我们的分析揭示了稳健性在网络深度上的分布并不均匀。相反，出乎意料的是，它主要集中在浅层，由低频频谱偏差和输入无关的注意力模式驱动。同时，深层层的更新往往会削弱干净准确性和稳健泛化。基于这些见解，我们提出了对抗性稳健性适应（R-Adapt）框架，该框架冻结所有预训练权重，并在初始层中引入少量、基于洞察的适应。该设计在对抗性稳健性和干净准确性之间实现了卓越的平衡。R-Adapt 进一步支持无训练、模型引导和数据驱动的范式，为标准模型提供灵活的途径以无缝增强其稳健性。在18个数据集和多种任务上的广泛评估表明，在各种攻击下我们的性能处于最新水平。值得注意的是，R-Adapt 能够高效地推广到大型视觉-语言模型（例如，LLaVA和Qwen-VL）以增强其稳健性。我们的项目页面可在 https://summu77.github.io/R-Adapt/ 查看。

Summary / 总结

This work addresses the trade-off between adversarial robustness and clean accuracy in Vision-Language Models (VLMs) by investigating what makes VLMs robust. Through detailed analysis, it reveals that robustness is primarily localized in shallow layers, driven by low-frequency spectral bias and input-insensitive attention patterns. Motivated by these findings, the authors propose Adversarial Robustness Adaptation (R-Adapt), which freezes pre-trained weights and makes minimal adaptations in initial layers, achieving a balance between robustness and accuracy. Extensive evaluations show that R-Adapt performs well across various attacks and large models like LLaVA and Qwen-VL.

该研究探讨了视觉-语言模型（VLMs）在对抗鲁棒性和干净准确性的权衡问题。通过分析对抗性微调模型，研究发现鲁棒性主要集中在浅层，由低频谱偏置和输入无关的注意力模式驱动。基于这些发现，作者提出了对抗鲁棒性适应（R-Adapt）框架，该框架冻结预训练权重并在初始层进行最小化调整，从而在鲁棒性和准确性之间取得平衡。广泛评估表明，R-Adapt在各种数据集和大型模型（如LLaVA和Qwen-VL）上表现出色，增强了其鲁棒性而不显著牺牲干净准确性。

IROSA: Interactive Robot Skill Adaptation using Natural Language

Authors: Markus Knauer, Samuel Bustamante, Thomas Eiband, Alin Albu-Schäffer, Freek Stulp, João Silvério

Venue: IEEE Robotics and Automation Letters (RA-L), 2026

First: 2026-03-04T09:54:09+00:00 · Latest: 2026-03-13T09:00:04+00:00

Comments: Accepted IEEE Robotics and Automation Letters (RA-L) journal, 8 pages, 5 figures, 3 tables, 1 listing

Abs · PDF · Code1 · Code2

Abstract

Foundation models have demonstrated impressive capabilities across diverse domains, while imitation learning provides principled methods for robot skill adaptation from limited data. Combining these approaches holds significant promise for direct application to robotics, yet this combination has received limited attention, particularly for industrial deployment. We present a novel framework that enables open-vocabulary skill adaptation through a tool-based architecture, maintaining a protective abstraction layer between the language model and robot hardware. Our approach leverages pre-trained LLMs to select and parameterize specific tools for adapting robot skills without requiring fine-tuning or direct model-to-robot interaction. We demonstrate the framework on a 7-DoF torque-controlled robot performing an industrial bearing ring insertion task, showing successful skill adaptation through natural language commands for speed adjustment, trajectory correction, and obstacle avoidance while maintaining safety, transparency, and interpretability.

中文标题/摘要

标题：IROSA：使用自然语言的交互式机器人技能适应

基础模型在多个领域展示了令人印象深刻的性能，而模仿学习为从有限数据中进行机器人技能适应提供了原理性的方法。将这两种方法结合起来在直接应用于机器人技术方面具有巨大的潜力，但这种结合在工业部署方面受到的关注有限。我们提出了一种新的框架，通过基于工具的架构实现开放词汇量的技能适应，保持语言模型与机器人硬件之间的保护性抽象层。我们的方法利用预训练的大规模语言模型来选择和参数化特定工具，以适应机器人技能，而无需进行微调或直接模型到机器人的交互。我们在一个7自由度扭矩控制机器人上演示了该框架，该机器人执行工业轴承环插入任务，通过自然语言命令成功实现了技能适应，同时保持了速度调整、轨迹校正和障碍物避免的安全性、透明性和可解释性。

Summary / 总结

The research aims to combine the capabilities of foundation models and imitation learning to enable robots to adapt skills using natural language commands. The method involves using a pre-trained language model to select and parameterize specific tools for skill adaptation without requiring fine-tuning or direct interaction with the robot. Key experimental findings show successful skill adaptation for a 7-DoF torque-controlled robot in an industrial bearing ring insertion task, with adjustments made through natural language for speed, trajectory, and obstacle avoidance, while ensuring safety and interpretability.

研究旨在结合基础模型和模仿学习的能力，使机器人能够通过自然语言命令来适应技能。方法是使用预训练的语言模型来选择和参数化特定工具进行技能适应，无需对模型进行微调或直接与机器人交互。实验结果表明，对于一个7自由度的扭矩控制机器人，在工业轴承环插入任务中，可以通过自然语言调整速度、轨迹和避障，同时确保安全性和可解释性。

Generalized Recognition of Basic Surgical Actions Enables Skill Assessment and Vision-Language-Model-based Surgical Planning

Authors: Mengya Xu, Daiyun Shen, Jie Zhang, Hon Chi Yip, Yujia Gao, Cheng Chen, Dillan Imans, Yonghao Long, Yiru Ye, Yixiao Liu, Rongyun Mai, Kai Chen, Hongliang Ren, Yutong Ban, Guangsuo Wang, Francis Wong, Chi-Fai Ng, Kee Yuan Ngiam, Russell H. Taylor, Daguang Xu, Yueming Jin, Qi Dou

First: 2026-03-13T08:46:25+00:00 · Latest: 2026-03-13T08:46:25+00:00

Comments: 34 pages, 8 figures

Abs · PDF · Code1 · Code2

Abstract

Artificial intelligence, imaging, and large language models have the potential to transform surgical practice, training, and automation. Understanding and modeling of basic surgical actions (BSA), the fundamental unit of operation in any surgery, is important to drive the evolution of this field. In this paper, we present a BSA dataset comprising 10 basic actions across 6 surgical specialties with over 11,000 video clips, which is the largest to date. Based on the BSA dataset, we developed a new foundation model that conducts general-purpose recognition of basic actions. Our approach demonstrates robust cross-specialist performance in experiments validated on datasets from different procedural types and various body parts. Furthermore, we demonstrate downstream applications enabled by the BAS foundation model through surgical skill assessment in prostatectomy using domain-specific knowledge, and action planning in cholecystectomy and nephrectomy using large vision-language models. Multinational surgeons' evaluation of the language model's output of the action planning explainable texts demonstrated clinical relevance. These findings indicate that basic surgical actions can be robustly recognized across scenarios, and an accurate BSA understanding model can essentially facilitate complex applications and speed up the realization of surgical superintelligence.

中文标题/摘要

标题：基本手术动作的通用识别使技能评估和基于视觉-语言模型的手术规划成为可能

人工智能、成像技术和大型语言模型有可能彻底改变外科实践、培训和自动化。理解并建模基本手术动作（BSA），即任何手术中的基本操作单位，对于推动该领域的发展至关重要。在本文中，我们介绍了一个包含10种基本动作、覆盖6个外科专科、超过11,000个视频片段的BSA数据集，这是迄今为止最大的数据集。基于BSA数据集，我们开发了一种新的基础模型，用于基本动作的通用识别。我们的方法在跨专科实验中表现出色，这些实验在不同手术类型和不同身体部位的数据集上进行了验证。此外，我们通过使用专科知识进行前列腺切除术的技能评估以及使用大型视觉-语言模型进行胆囊切除术和肾切除术的动作规划，展示了BSA基础模型的下游应用。来自多个国家的外科医生对语言模型生成的动作规划解释性文本的评估表明其临床相关性。这些发现表明，基本手术动作可以在各种场景中稳健识别，并且准确的BSA理解模型可以促进复杂应用并加速外科超级智能的实现。

Summary / 总结

This paper aims to enhance surgical practice through the recognition and modeling of basic surgical actions (BSA). The authors developed a large BSA dataset and a foundation model for recognizing BSA across different surgical specialties. The model showed robust performance in cross-specialist scenarios and enabled downstream applications such as surgical skill assessment and action planning using large vision-language models. Surgeons evaluated the explainable texts generated by the model, indicating clinical relevance.

本文旨在通过识别和建模基本手术动作（BSA）来提升手术实践。作者开发了一个大型BSA数据集和一个跨不同外科专科识别BSA的基础模型。该模型在跨专科场景中表现出色，并且能够实现下游应用，如前列腺切除术中的手术技能评估和胆囊切除术及肾切除术中的动作规划，使用大型视觉-语言模型。外科医生评估了模型生成的解释性文本，表明其临床相关性。

Empowering Semantic-Sensitive Underwater Image Enhancement with VLM

Authors: Guodong Fan, Shengning Zhou, Genji Yuan, Huiyu Li, Jingchun Zhou, Jinjiang Li

Venue: AAAI 2026 Oral presentation

First: 2026-03-13T08:17:06+00:00 · Latest: 2026-03-13T08:17:06+00:00

Comments: Accepted as an Oral presentation at AAAI 2026

Abs · PDF · Code1 · Code2

Abstract

In recent years, learning-based underwater image enhancement (UIE) techniques have rapidly evolved. However, distribution shifts between high-quality enhanced outputs and natural images can hinder semantic cue extraction for downstream vision tasks, thereby limiting the adaptability of existing enhancement models. To address this challenge, this work proposes a new learning mechanism that leverages Vision-Language Models (VLMs) to empower UIE models with semantic-sensitive capabilities. To be concrete, our strategy first generates textual descriptions of key objects from a degraded image via VLMs. Subsequently, a text-image alignment model remaps these relevant descriptions back onto the image to produce a spatial semantic guidance map. This map then steers the UIE network through a dual-guidance mechanism, which combines cross-attention and an explicit alignment loss. This forces the network to focus its restorative power on semantic-sensitive regions during image reconstruction, rather than pursuing a globally uniform improvement, thereby ensuring the faithful restoration of key object features. Experiments confirm that when our strategy is applied to different UIE baselines, significantly boosts their performance on perceptual quality metrics as well as enhances their performance on detection and segmentation tasks, validating its effectiveness and adaptability.

中文标题/摘要

标题：利用VLM实现语义敏感的水下图像增强

近年来，基于学习的水下图像增强（UIE）技术迅速发展。然而，高质量增强输出与自然图像之间的分布变化会妨碍下游视觉任务中的语义线索提取，从而限制现有增强模型的适应性。为解决这一挑战，本工作提出了一种新的学习机制，利用视觉语言模型（VLMs）赋予UIE模型语义敏感能力。具体而言，我们的策略首先通过VLMs从退化图像中生成关键对象的文本描述。随后，一个文本-图像对齐模型将这些相关描述重新映射回图像，生成一个空间语义指导图。该图通过双重指导机制引导UIE网络，结合交叉注意力和显式对齐损失。这迫使网络在图像重建过程中将恢复能力集中在语义敏感区域，而不是追求全局一致的改进，从而确保关键对象特征的忠实恢复。实验表明，当我们的策略应用于不同的UIE基线时，显著提升了其在感知质量指标上的性能，并增强了其在检测和分割任务上的表现，验证了其有效性和适应性。

Summary / 总结

This work addresses the challenge of semantic cue extraction in underwater image enhancement by proposing a new learning mechanism that integrates Vision-Language Models (VLMs). The method generates textual descriptions of key objects from degraded images and uses a text-image alignment model to produce a spatial semantic guidance map. This map guides the enhancement network to focus on semantic-sensitive regions, improving perceptual quality and downstream tasks like detection and segmentation. Experiments show significant performance boosts across different UIE baselines.

该研究通过引入Vision-Language模型（VLM）来增强水下图像增强的语义敏感性，解决分布偏移的问题。方法是从退化图像生成关键对象的文本描述，并使用文本-图像对齐模型创建空间语义引导图。该图通过双重引导机制指导图像增强网络，专注于恢复关键对象特征而非全局提升图像质量。实验表明，当应用于不同的增强基线时，这种方法显著提高了感知质量并增强了检测和分割任务的表现。

PatchCue: Enhancing Vision-Language Model Reasoning with Patch-Based Visual Cues

Authors: Yukun Qi, Pei Fu, Hang Li, Yuhan Liu, Chao Jiang, Bin Qin, Zhenbo Luo, Jian Luan

First: 2026-03-06T03:44:27+00:00 · Latest: 2026-03-13T08:10:46+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) have achieved remarkable progress on a wide range of challenging multimodal understanding and reasoning tasks. However, existing reasoning paradigms, such as the classical Chain-of-Thought (CoT), rely solely on textual information and often underutilize important visual cues. While prior work has incorporated pixel-level visual cues, these representations require precise spatial localization, introducing additional learning complexity. To address this, we propose PatchCue, a novel patch-based visual cue paradigm designed to significantly enhance the visual reasoning capabilities of VLMs. By partitioning images into patches and representing cues at the patch level, PatchCue aligns better with human perceptual habits and leverages the patch-tokenized input of modern VLMs. We train VLMs using a two-stage approach: cold-start supervised fine-tuning to output patch-level cues, followed by reinforcement learning with a process-supervised cue reward that guides intermediate visual reasoning steps. Extensive experiments on multiple VLMs and diverse benchmarks, including general visual question answering, complex reasoning, and document understanding, demonstrate that PatchCue consistently improves overall model performance. Our results show that patch-level cues outperform both pixel-level bounding boxes and point-based cues, providing a more effective and cognitively aligned visual reasoning paradigm.

中文标题/摘要

标题：PatchCue：通过基于块的视觉提示增强视觉语言模型的推理能力

视觉语言模型（VLMs）在多种具有挑战性的跨模态理解和推理任务中取得了显著进展。然而，现有的推理范式，如经典的思维链（CoT），仅依赖于文本信息，往往未能充分利用重要的视觉提示。尽管先前的工作已经整合了像素级的视觉提示，但这些表示需要精确的空间定位，增加了额外的学习复杂性。为了解决这个问题，我们提出了一种新的基于块的视觉提示范式PatchCue，旨在显著增强VLMs的视觉推理能力。通过将图像划分为块并在块级别表示提示，PatchCue更好地符合人类的知觉习惯，并利用现代VLMs的块标记输入。我们采用两阶段的方法进行训练：冷启动监督微调以输出块级提示，然后使用过程监督提示奖励进行强化学习，以引导中间的视觉推理步骤。在多个VLMs和多种基准测试上的广泛实验，包括通用视觉问答、复杂推理和文档理解，表明PatchCue能够一致地提高整体模型性能。我们的结果表明，块级提示优于像素级边界框和基于点的提示，提供了一种更有效且认知上更一致的视觉推理范式。

Summary / 总结

PatchCue is a novel patch-based visual cue paradigm that enhances the visual reasoning capabilities of Vision-Language Models (VLMs) by partitioning images into patches and representing cues at the patch level. It uses a two-stage training approach, first fine-tuning the models to output patch-level cues and then using reinforcement learning to guide intermediate visual reasoning steps. Experiments on various VLMs and benchmarks show that PatchCue improves overall model performance and outperforms pixel-level and point-based cues.

PatchCue 是一种新颖的基于补丁的视觉线索 paradigm，旨在增强 Vision-Language 模型（VLM）的视觉推理能力。它将图像划分为补丁，并在补丁级别表示线索，更好地与人类的感知习惯对齐，并利用现代 VLM 的补丁分词输入。通过两阶段训练过程，VLM 被微调以输出补丁级线索，并由过程监督的线索奖励引导。实验表明，PatchCue 在各种基准测试中一致提高了模型性能，优于像素级边界框和基于点的线索。

DeCode: Decoupling Content and Delivery for Medical QA

Authors: Po-Jen Ko, Chen-Han Tsai, Yu-Shao Peng

First: 2026-01-05T13:54:38+00:00 · Latest: 2026-03-13T08:08:44+00:00

Comments: Preprint

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) exhibit strong medical knowledge and can generate factually accurate responses. However, existing models often fail to account for individual patient contexts, producing answers that are clinically correct yet poorly aligned with patients' needs. In this work, we introduce DeCode (Decoupling Content and Delivery), a training-free, model-agnostic framework that adapts existing LLMs to produce contextualized answers in clinical settings. We evaluate DeCode on OpenAI HealthBench, a comprehensive and challenging benchmark designed to assess clinical relevance and validity of LLM responses. DeCode boosts zero-shot performance from 28.4% to 49.8% and achieves new state-of-the-art compared to existing methods. Experimental results suggest the effectiveness of DeCode in improving clinical question answering of LLMs.

中文标题/摘要

标题：DeCode：解耦内容与交付以实现医疗QA

大型语言模型（LLMs）表现出强大的医学知识，并能生成事实准确的回答。然而，现有模型往往未能考虑个体患者的背景，导致答案在临床上正确但与患者需求严重脱节。在本研究中，我们引入了DeCode（解耦内容与交付），这是一种无需训练、模型通用的框架，能够将现有的LLMs适应到临床环境中生成上下文化的回答。我们使用OpenAI HealthBench（一个全面且具有挑战性的基准，旨在评估LLM回答的临床相关性和有效性）对DeCode进行了评估。DeCode将零样本性能从28.4%提升到49.8%，并实现了与现有方法相比的新最佳性能。实验结果表明，DeCode在提高LLMs的临床问题回答方面具有有效性。

Summary / 总结

DeCode is a training-free, model-agnostic framework that decouples content and delivery to improve the clinical relevance of large language models (LLMs). Evaluated on OpenAI HealthBench, DeCode significantly boosts zero-shot performance from 28.4% to 49.8%, setting a new state-of-the-art in clinical question answering for LLMs.

DeCode 是一个无需训练、适用于各种模型的框架，旨在通过内容和交付的解耦使现有大型语言模型适应临床环境。它在 OpenAI HealthBench 上进行评估，并将零样本性能从 28.4% 提高到 49.8%，在 LLM 的临床问题回答方面达到了新的最佳水平。

History

20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553