arXiv 论文速递

Snapshot: 20260305_0348

Utonia: Toward One Encoder for All Point Clouds

Authors: Yujia Zhang, Xiaoyang Wu, Yunhan Yang, Xianzhe Fan, Han Li, Yuechen Zhang, Zehao Huang, Naiyan Wang, Hengshuang Zhao

First: 2026-03-03T18:59:58+00:00 · Latest: 2026-03-03T18:59:58+00:00

Comments: produced by Pointcept, project page: https://pointcept.github.io/Utonia

Abs · PDF · Code1 · Code2 · Project1

Abstract

We dream of a future where point clouds from all domains can come together to shape a single model that benefits them all. Toward this goal, we present Utonia, a first step toward training a single self-supervised point transformer encoder across diverse domains, spanning remote sensing, outdoor LiDAR, indoor RGB-D sequences, object-centric CAD models, and point clouds lifted from RGB-only videos. Despite their distinct sensing geometries, densities, and priors, Utonia learns a consistent representation space that transfers across domains. This unification improves perception capability while revealing intriguing emergent behaviors that arise only when domains are trained jointly. Beyond perception, we observe that Utonia representations can also benefit embodied and multimodal reasoning: conditioning vision-language-action policies on Utonia features improves robotic manipulation, and integrating them into vision-language models yields gains on spatial reasoning. We hope Utonia can serve as a step toward foundation models for sparse 3D data, and support downstream applications in AR/VR, robotics, and autonomous driving.

中文标题/摘要

标题：Utonia：朝向通用点云编码器

我们梦想着一个未来，来自各个领域的点云能够汇聚在一起，共同塑造一个能够惠及所有领域的单一模型。为此，我们提出了Utonia，这是朝着在多样化的领域中训练单一的自监督点变换编码器迈出的第一步，这些领域包括遥感、户外LiDAR、室内RGB-D序列、对象中心的CAD模型以及从仅RGB视频中提取的点云。尽管它们具有不同的传感几何结构、密度和先验知识，Utonia仍然能够学习一个一致的表示空间，该空间可以在不同领域之间进行迁移。这种统一提高了感知能力，同时揭示了只有在联合训练领域时才会出现的有趣涌现行为。超越感知，我们观察到Utonia表示还可以为具身和多模态推理提供帮助：基于Utonia特征对视觉-语言-动作策略进行条件化可以提高机器人的操作能力，将它们整合到视觉-语言模型中可以提高空间推理能力。我们希望Utonia能够作为稀疏3D数据基础模型的一步，支持AR/VR、机器人技术和自动驾驶等下游应用。

Summary / 总结

Utonia aims to develop a single self-supervised point transformer encoder that can be trained across various domains including remote sensing, outdoor LiDAR, indoor RGB-D sequences, object-centric CAD models, and point clouds from RGB-only videos. Despite the diverse characteristics of these domains, Utonia learns a consistent representation space that enhances perception and reveals interesting behaviors when domains are trained jointly. The model also benefits embodied and multimodal reasoning, improving robotic manipulation and vision-language models for spatial reasoning. Utonia is seen as a step toward foundation models for sparse 3D data applications in AR/VR, robotics, and autonomous driving.

研究旨在开发一种适用于不同领域点云的统一模型。Utonia作为一种自监督点变换编码器，被训练在包括遥感、户外LiDAR、室内RGB-D序列、物体中心CAD模型以及RGB-only视频提取的点云等多种领域。该模型学习了一致的表示空间，提升了感知能力，并揭示了联合训练领域时出现的新兴行为。此外，Utonia还增强了基于视觉-语言-动作策略的机器人操作和视觉-语言模型的空间推理能力。研究认为Utonia可能成为稀疏3D数据的基础模型，支持AR/VR、机器人和自动驾驶等下游应用。

Tether: Autonomous Functional Play with Correspondence-Driven Trajectory Warping

Authors: William Liang, Sam Wang, Hung-Ju Wang, Osbert Bastani, Yecheng Jason Ma, Dinesh Jayaraman

Venue: ICLR

First: 2026-03-03T18:59:07+00:00 · Latest: 2026-03-03T18:59:07+00:00

Comments: International Conference on Learning Representations (ICLR), 2026. Project website and code: https://tether-research.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

The ability to conduct and learn from interaction and experience is a central challenge in robotics, offering a scalable alternative to labor-intensive human demonstrations. However, realizing such "play" requires (1) a policy robust to diverse, potentially out-of-distribution environment states, and (2) a procedure that continuously produces useful robot experience. To address these challenges, we introduce Tether, a method for autonomous functional play involving structured, task-directed interactions. First, we design a novel open-loop policy that warps actions from a small set of source demonstrations (<=10) by anchoring them to semantic keypoint correspondences in the target scene. We show that this design is extremely data-efficient and robust even under significant spatial and semantic variations. Second, we deploy this policy for autonomous functional play in the real world via a continuous cycle of task selection, execution, evaluation, and improvement, guided by the visual understanding capabilities of vision-language models. This procedure generates diverse, high-quality datasets with minimal human intervention. In a household-like multi-object setup, our method is the first to perform many hours of autonomous multi-task play in the real world starting from only a handful of demonstrations. This produces a stream of data that consistently improves the performance of closed-loop imitation policies over time, ultimately yielding over 1000 expert-level trajectories and training policies competitive with those learned from human-collected demonstrations.

中文标题/摘要

标题：Tether：基于对应驱动轨迹扭曲的自主功能性玩耍

能够进行交互和从经验中学习的能力是机器人技术中的一个核心挑战，提供了一种劳动密集型的人类示范的可扩展替代方案。然而，实现这种“玩耍”需要（1）一种对各种潜在分布外环境状态具有鲁棒性的策略，以及（2）一种能够持续生成有用机器人经验的程序。为了解决这些挑战，我们引入了Tether，一种涉及结构化、任务导向交互的自主功能性玩耍方法。首先，我们设计了一种新颖的开环策略，通过将动作锚定到目标场景中的语义关键点对应关系，对来自少量源示范（≤10个）的动作进行扭曲。我们展示了这种设计在数据效率和鲁棒性方面具有极大的优势，即使在显著的空间和语义变化下也是如此。其次，我们通过视觉理解能力引导的连续循环任务选择、执行、评估和改进，将此策略部署到现实世界中的自主功能性玩耍中。这种方法生成了大量高质量的数据集，同时减少了人类干预。在一个类似家庭的多对象设置中，我们的方法是第一个仅从少量示范开始，在现实世界中进行多小时的自主多任务玩耍的方法。这产生了一条持续改进闭环模仿策略性能的数据流，最终产生了超过1000条专家级轨迹，并训练出与人类收集示范学习的策略相当的策略。

Summary / 总结

Tether is a method for autonomous functional play in robotics, addressing the challenges of robust policy design and continuous experience generation. It uses a novel open-loop policy that warps actions from a few source demonstrations by anchoring them to semantic keypoint correspondences in the target scene, showing high data efficiency and robustness. Tether continuously selects tasks, executes them, evaluates the outcomes, and improves the policy, generating diverse and high-quality datasets with minimal human intervention. In a household-like setup, Tether performs many hours of autonomous multi-task play, producing over 1000 expert-level trajectories and training policies competitive with those from human demonstrations.

Tether 是一种用于自主功能玩耍的机器人方法，解决策略设计的鲁棒性和持续经验生成的挑战。它使用一种新颖的开环策略，通过在目标场景中对齐语义关键点来扭曲来自少量源演示的动作，显示出高度的数据效率和鲁棒性。Tether 通过视觉理解能力从视觉语言模型中不断选择任务、执行任务、评估结果并改进策略，生成高质量的多样化数据集，几乎无需人工干预。这种方法使机器人能够在类似家庭的多对象设置中自主进行多任务玩耍数小时，生成超过1000个专家级轨迹，并训练出与人类收集的演示学习的策略相竞争的策略。

UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving

Authors: Zhexiao Xiong, Xin Ye, Burhan Yaman, Sheng Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, Liu Ren

First: 2026-01-07T23:49:52+00:00 · Latest: 2026-03-03T18:40:54+00:00

Comments: Project Page: https://unidrive-wm.github.io/UniDrive-WM

Abs · PDF · Code1 · Code2 · Project1

Abstract

World models have become central to autonomous driving, where accurate scene understanding and future prediction are crucial for safe control. Recent work has explored using vision-language models (VLMs) for planning, yet existing approaches typically treat perception, prediction, and planning as separate modules. We propose UniDrive-WM, a unified VLM-based world model that jointly performs driving-scene understanding, trajectory planning, and trajectory-conditioned future image generation within a single architecture. UniDrive-WM's trajectory planner predicts a future trajectory, which conditions a VLM-based image generator to produce plausible future frames. These predictions provide additional supervisory signals that enhance scene understanding and iteratively refine trajectory generation. We further compare discrete and continuous output representations for future image prediction, analyzing their influence on downstream driving performance. Experiments on the challenging Bench2Drive benchmark show that UniDrive-WM produces high-fidelity future images and improves planning performance by 5.9% in L2 trajectory error and 9.2% in collision rate over the previous best method. These results demonstrate the advantages of tightly integrating VLM-driven reasoning, planning, and generative world modeling for autonomous driving. The project page is available at https://unidrive-wm.github.io/UniDrive-WM .

中文标题/摘要

标题：UniDrive-WM：统一理解、规划和生成世界模型在自动驾驶中的应用

世界模型已成为自动驾驶的核心，准确的场景理解和未来预测对于安全控制至关重要。近期研究探索了使用视觉-语言模型（VLMs）进行规划，但现有方法通常将感知、预测和规划视为独立模块。我们提出了一种名为UniDrive-WM的统一VLM基世界模型，该模型在单一架构中联合执行驾驶场景理解、轨迹规划和基于轨迹的未来图像生成。UniDrive-WM的轨迹规划器预测未来轨迹，条件化VLM基图像生成器以生成合理的未来帧。这些预测提供了额外的监督信号，增强场景理解并逐步细化轨迹生成。我们进一步比较了离散和连续输出表示对未来图像预测的影响，分析其对下游驾驶性能的影响。在具有挑战性的Bench2Drive基准测试中，UniDrive-WM生成了高保真度的未来图像，并在L2轨迹误差和碰撞率方面分别提高了5.9%和9.2%，超过了之前的最佳方法。这些结果表明，将VLM驱动的推理、规划和生成世界建模紧密集成对于自动驾驶的优势。项目页面可在https://unidrive-wm.github.io/UniDrive-WM 查看。

Summary / 总结

UniDrive-WM is a unified VLM-based world model that integrates driving-scene understanding, trajectory planning, and future image generation. It uses a trajectory planner to predict future paths, which conditions a VLM to generate plausible future frames, enhancing scene understanding and trajectory generation. Experiments show that UniDrive-WM improves planning performance by 5.9% in L2 trajectory error and 9.2% in collision rate compared to the previous best method on the Bench2Drive benchmark. This demonstrates the benefits of tightly integrating VLM-driven reasoning, planning, and generative world modeling for autonomous driving.

UniDrive-WM 是一个统一的世界模型，使用视觉语言模型整合场景理解、轨迹规划和未来图像生成。通过预测未来轨迹并条件化图像生成，它将 L2 轨迹误差降低了 5.9%，碰撞率降低了 9.2%，优于之前的最佳方法。该模型的紧密集成增强了场景理解并逐步细化轨迹预测。 Bench2Drive 基准测试结果表明其有效性。

UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

Authors: Zimo Wen, Boxiu Li, Wanbo Zhang, Junxiang Lei, Xiaoyu Chen, Yijia Fan, Qi Zhang, Yujiang Wang, Lili Qiu, Bo Li, Ziwei Liu, Caihua Shan, Yifan Yang, Yifei Shen

First: 2026-03-03T18:36:16+00:00 · Latest: 2026-03-03T18:36:16+00:00

Abs · PDF · Code1 · Code2

Abstract

Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where generation facilitates understanding. To this end, we introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks, requiring varying degrees of implicit or explicit visual transformations. Extensive evaluation of over 30 models reveals three core findings: 1) Unified models generally underperform their base Vision-Language Models (VLMs), and Generate-then-Answer (GtA) inference typically degrades performance relative to direct inference. 2) Consistent enhancements emerge in spatial intelligence, visual illusions, or multi-round reasoning subtasks, where enhanced spatial and shape perception, as well as multi-step intermediate image states, prove beneficial. 3) Tasks with similar reasoning structures and models sharing architectures exhibit correlated behaviors, suggesting that generation-understanding coupling induces class-consistent inductive biases over tasks, pretraining data, and model architectures. These findings highlight the necessity for more diverse training data and novel paradigms to fully unlock the potential of unified multimodal modeling.

中文标题/摘要

标题：UniG2U-Bench：统一模型是否推进了多模态理解？

统一多模态模型最近展示了强大的生成能力，但生成是否以及何时提升理解仍不清楚。现有基准缺乏对生成促进理解的具体任务的系统探索。为此，我们引入了UniG2U-Bench，这是一个全面的基准，将生成到理解（G2U）评估分为7个阶段和30个子任务，需要不同程度的隐式或显式视觉转换。对超过30个模型的广泛评估揭示了三个核心发现：1）统一模型通常不如其基础视觉语言模型（VLM），生成后推理（GtA）通常会降低性能相对于直接推理。2）在空间智能、视觉错觉或多轮推理子任务中出现一致的增强，其中增强的空间和形状感知以及多步中间图像状态是有益的。3）具有相似推理结构的任务和共享架构的模型表现出相关行为，表明生成-理解耦合在任务、预训练数据和模型架构上诱导了类一致的归纳偏差。这些发现强调了需要更多样化的训练数据和新颖的范式来充分释放统一多模态建模的潜力。

Summary / 总结

The study introduces UniG2U-Bench, a comprehensive benchmark to evaluate the understanding capabilities of unified multimodal models through generation-to-understanding tasks. It evaluates over 30 models across 7 regimes and 30 subtasks, revealing that unified models generally underperform their base VLMs and that GtA inference degrades performance. The study finds consistent enhancements in spatial intelligence, visual illusions, and multi-round reasoning subtasks, and suggests that generation-understanding coupling induces class-consistent inductive biases, highlighting the need for diverse training data and new paradigms.

研究引入了UniG2U-Bench基准，评估统一多模态模型在7个范式和30个子任务中的生成到理解（G2U）能力。关键发现包括统一模型通常不如其基础视觉-语言模型，生成后推理通常会降低性能。增强的空间和形状感知在需要多步推理的子任务中受益，而具有相似推理结构的任务表现出相关行为，这表明生成-理解耦合在任务、预训练数据和模型架构上产生了类一致的归纳偏差。这强调了需要更多样化的训练数据和新的范式来提高统一多模态模型。

MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

Authors: Sixun Dong, Juhua Hu, Mian Zhang, Ming Yin, Yanjie Fu, Qi Qian

Venue: ICLR 2026

First: 2025-08-25T17:57:49+00:00 · Latest: 2026-03-03T17:59:41+00:00

Comments: Accepted by ICLR 2026. Code: https://github.com/Ironieser/mmtok , Project Homepage: https://project.ironieser.cc/mmtok

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-Language Models (VLMs) demonstrate impressive performance in understanding visual content with language instruction by converting visual inputs to vision tokens. However, redundancy in vision tokens results in the degraded inference efficiency of VLMs. While many algorithms have been proposed to reduce the number of vision tokens, most of them apply only unimodal information (i.e., vision/text) for pruning and ignore the inherent multimodal property of vision-language tasks. Moreover, it lacks a generic criterion that can be applied to different modalities. To mitigate this limitation, in this work, we propose to leverage both vision and text tokens to select informative vision tokens by the coverage criterion. We first formulate the subset selection problem as a maximum coverage problem. Afterwards, a subset of vision tokens is optimized to cover the text tokens and the original set of vision tokens, simultaneously. The proposed method MMTok is extensively evaluated on benchmark datasets with different VLMs. The comparison illustrates that vision and text information are complementary, and combining multimodal information can surpass the unimodal baseline with a clear margin. Moreover, under the maximum coverage criterion on the POPE dataset, our method achieves a 1.87x speedup while maintaining 98.7% of the original performance on LLaVA-NeXT-13B. Finally, with only four vision tokens, 87.7% of the original performance is still preserved on LLaVA-1.5-7B. These results highlight the effectiveness of coverage in token selection. The code is available at https://github.com/Ironieser/mmtok

中文标题/摘要

标题：MMTok：多模态覆盖率最大化以提高VLMs高效推理

视觉-语言模型（VLMs）通过将视觉输入转换为视觉标记来理解带有语言指令的视觉内容，表现出令人印象深刻的性能。然而，视觉标记中的冗余性导致了VLMs推理效率的下降。虽然已经提出了许多算法来减少视觉标记的数量，但大多数算法仅使用单模态信息（即视觉/文本）进行剪枝，忽略了视觉-语言任务的固有多模态特性。此外，缺乏一个适用于不同模态的通用标准。为了解决这一局限性，本文提出利用视觉和文本标记来通过覆盖率标准选择信息性的视觉标记。我们首先将子集选择问题形式化为最大覆盖问题。之后，优化一个视觉标记子集以同时覆盖文本标记和原始的视觉标记集。所提出的方法MMTok在不同的基准数据集和VLMs上进行了广泛评估。比较结果表明，视觉和文本信息是互补的，结合多模态信息可以明显超越单模态基线。此外，在POPE数据集上的最大覆盖标准下，我们的方法在LLaVA-NeXT-13B上实现了1.87倍的速度提升，同时保持了98.7%的原始性能。最后，仅使用四个视觉标记，LLaVA-1.5-7B的原始性能仍保留了87.7%。这些结果突显了覆盖率在标记选择中的有效性。代码可在https://github.com/Ironieser/mmtok 获取。

Summary / 总结

The research aims to improve the inference efficiency of Vision-Language Models (VLMs) by reducing redundant vision tokens while preserving performance. The method, MMTok, leverages both vision and text tokens to select informative vision tokens based on a coverage criterion. Experiments on benchmark datasets show that combining multimodal information outperforms unimodal baselines, achieving a 1.87x speedup on LLaVA-NeXT-13B while maintaining 98.7% of the original performance. With only four vision tokens, 87.7% of the original performance is preserved on LLaVA-1.5-7B.

研究旨在通过减少冗余的视觉标记来提高视觉语言模型（VLMs）的推理效率。MMTok 利用视觉和文本标记根据覆盖准则选择信息性的视觉标记，将其形式化为最大覆盖问题。实验表明，结合多模态信息优于单模态基线，实现了1.87倍的速度提升，同时保持了LLaVA-NeXT-13B的98.7%性能，并且仅使用四个视觉标记仍能保持LLaVA-1.5-7B的87.7%性能。

MLV-Edit: Towards Consistent and Highly Efficient Editing for Minute-Level Videos

Authors: Yangyi Cao, Yuanhang Li, Lan Chen, Qi Mao

First: 2026-02-02T14:07:00+00:00 · Latest: 2026-03-03T16:46:06+00:00

Abs · PDF · Code1 · Code2

Abstract

We propose MLV-Edit, a training-free, flow-based framework that address the unique challenges of minute-level video editing. While existing techniques excel in short-form video manipulation, scaling them to long-duration videos remains challenging due to prohibitive computational overhead and the difficulty of maintaining global temporal consistency across thousands of frames. To address this, MLV-Edit employs a divide-and-conquer strategy for segment-wise editing, facilitated by two core modules: Velocity Blend rectifies motion inconsistencies at segment boundaries by aligning the flow fields of adjacent chunks, eliminating flickering and boundary artifacts commonly observed in fragmented video processing; and Attention Sink anchors local segment features to global reference frames, effectively suppressing cumulative structural drift. Extensive quantitative and qualitative experiments demonstrate that MLV-Edit consistently outperforms state-of-the-art methods in terms of temporal stability and semantic fidelity.

中文标题/摘要

标题：MLV-Edit：针对分钟级视频编辑的一致且高效编辑方法

我们提出了一种无需训练、基于流的框架MLV-Edit，以应对分钟级视频编辑的独特挑战。尽管现有技术在短格式视频操作方面表现出色，但将它们扩展到长时间视频仍然具有挑战性，因为计算开销巨大且难以在数千帧中保持全局时间一致性。为了解决这个问题，MLV-Edit 采用了一种分而治之的策略进行段落级编辑，通过两个核心模块实现：Velocity Blend 通过对齐相邻块的流场来纠正段落边界处的运动不一致性，消除片段视频处理中常见的闪烁和边界伪影；Attention Sink 将局部段落特征锚定到全局参考帧，有效抑制累积结构漂移。大量定量和定性实验表明，MLV-Edit 在时间稳定性和语义保真度方面始终优于现有最先进的方法。

Summary / 总结

MLV-Edit is a training-free, flow-based framework designed for editing minute-level videos, addressing the challenges of maintaining global temporal consistency and reducing computational overhead. It uses a divide-and-conquer strategy with two core modules: Velocity Blend aligns flow fields to eliminate flickering and artifacts, and Attention Sink anchors local features to global frames to suppress structural drift. Experiments show MLV-Edit outperforms existing methods in terms of temporal stability and semantic fidelity.

MLV-Edit 是一个无需训练、基于流的框架，旨在解决分钟级视频编辑的挑战。它采用分而治之的策略，并包含两个核心模块：Velocity Blend 用于纠正段落边界处的运动不一致性，而 Attention Sink 则将局部特征锚定到全局参考帧以防止结构漂移。实验表明，MLV-Edit 在时间稳定性和语义保真度方面优于现有方法。

Know When to Abstain: Optimal Selective Classification with Likelihood Ratios

Authors: Alvin Heng, Harold Soh

First: 2025-05-21T01:26:21+00:00 · Latest: 2026-03-03T13:50:16+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Selective classification enhances the reliability of predictive models by allowing them to abstain from making uncertain predictions. In this work, we revisit the design of optimal selection functions through the lens of the Neyman--Pearson lemma, a classical result in statistics that characterizes the optimal rejection rule as a likelihood ratio test. We show that this perspective not only unifies the behavior of several post-hoc selection baselines, but also motivates new approaches to selective classification which we propose here. A central focus of our work is the setting of covariate shift, where the input distribution at test time differs from that at training. This realistic and challenging scenario remains relatively underexplored in the context of selective classification. We evaluate our proposed methods across a range of vision and language tasks, including both supervised learning and vision-language models. Our experiments demonstrate that our Neyman--Pearson-informed methods consistently outperform existing baselines, indicating that likelihood ratio-based selection offers a robust mechanism for improving selective classification under covariate shifts. Our code is publicly available at https://github.com/clear-nus/sc-likelihood-ratios.

中文标题/摘要

标题：何时避免：基于似然比的最优选择性分类

选择性分类通过允许模型在不确定时避免做出预测，从而提高预测模型的可靠性。在本文中，我们通过 Neyman--Pearson 引理的视角重新审视了最优选择函数的设计，该引理是统计学中的一个经典结果，描述了最优拒绝规则为似然比检验。我们表明，这种视角不仅统一了多种后验选择基线的行为，还激发了新的选择性分类方法，我们在此提出这些方法。我们工作的核心关注点是协变量偏移的设置，在此设置下，测试时的输入分布与训练时不同。这是一个现实且具有挑战性的场景，在选择性分类的背景下尚未得到充分探索。我们在包括监督学习和视觉语言模型在内的多种视觉和语言任务中评估了我们提出的方法。我们的实验表明，我们的 Neyman--Pearson 指导方法在协变量偏移下始终优于现有基线，表明基于似然比的选择提供了改进选择性分类的稳健机制。我们的代码可在 https://github.com/clear-nus/sc-likelihood-ratios 公开获取。

Summary / 总结

This paper revisits the design of optimal selection functions for selective classification using the Neyman--Pearson lemma, which characterizes the optimal rejection rule as a likelihood ratio test. The authors propose new approaches to selective classification, particularly in the challenging scenario of covariate shift, where the test-time input distribution differs from the training distribution. Experiments across various vision and language tasks show that their likelihood ratio-based methods outperform existing baselines, indicating robust performance under covariate shifts.

本文旨在通过允许模型在不确定时放弃预测来提高预测模型的可靠性。作者重新审视了使用 Neyman--Pearson 引理设计最优选择函数的方法，该引理将最优拒绝规则描述为似然比检验。他们展示了这种方法如何统一了几种后处理选择基线的行为，并提出了新的选择分类方法。实验表明，他们的 Neyman--Pearson 方法在各种视觉和语言任务中优于现有基线，特别是在协变量偏移的情况下，表明基于似然比的选择机制在提高选择分类的鲁棒性方面是有效的。

TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation

Authors: Jiaxing Liu, Zexi Zhang, Xiaoyan Li, Boyue Wang, Yongli Hu, Baocai Yin

First: 2026-03-03T13:28:07+00:00 · Latest: 2026-03-03T13:28:07+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch: VLMs are primarily pretrained on static, disembodied vision-language tasks, which fundamentally clash with the dynamic, embodied, and spatially-structured nature of navigation. Existing large-model-based methods often resort to converting rich visual and spatial information into text, forcing models to implicitly infer complex visual-topological relationships or limiting their global action capabilities. To bridge this gap, we propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone. To introduce topological edge information, Spatial Topology Aware Residual Attention (STAR-Att) directly integrates it into the VLM's self-attention mechanism, enabling intrinsic spatial reasoning while preserving pretrained knowledge. To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment. Finally, with the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction. On the R2R benchmark, TagaVLM achieves state-of-the-art performance among large-model-based methods, with a Success Rate (SR) of 51.09% and SPL of 47.18 in unseen environments, outperforming prior work by 3.39% in SR and 9.08 in SPL. This demonstrates that, for embodied spatial reasoning, targeted enhancements on smaller open-source VLMs can be more effective than brute-force model scaling. The code will be released upon publication.Project page: https://apex-bjut.github.io/Taga-VLM

中文标题/摘要

标题：TagaVLM：拓扑感知全局动作推理在视觉语言导航中的应用

视觉语言导航（VLN）对大型视觉语言模型（VLMs）构成了独特的挑战，因为它们的架构不匹配：VLMs主要在静态、无体感的视觉语言任务上进行预训练，这与导航的动态、体感和空间结构化本质相冲突。现有的基于大型模型的方法通常将丰富的视觉和空间信息转换为文本，迫使模型隐式推断复杂的视觉拓扑关系或限制其全局动作能力。为了解决这一差距，我们提出了TagaVLM（拓扑感知全局动作推理），这是一种端到端框架，明确地将拓扑结构注入到VLM主干中。为了引入拓扑边信息，Spatial Topology Aware Residual Attention (STAR-Att) 直接将其整合到VLM的自注意力机制中，从而实现内在的空间推理，同时保留预训练知识。为了增强拓扑节点信息，Interleaved Navigation Prompt 加强了节点级的视觉-文本对齐。最后，通过嵌入的拓扑图，模型能够进行全局动作推理，从而实现稳健的路径校正。在R2R基准测试中，TagaVLM在未见过的环境中实现了最先进的性能，成功率为51.09%，SPL为47.18%，分别比先前工作提高了3.39%和9.08%。这表明，对于体感空间推理，对较小的开源VLM进行有针对性的增强可能比简单的模型扩展更有效。代码将在发表后发布。项目页面：https://apex-bjut.github.io/Taga-VLM

Summary / 总结

TagaVLM is an end-to-end framework that integrates topological structures into the VLM backbone to address the mismatch between VLMs and the dynamic, spatially-structured nature of navigation. It uses STAR-Att to integrate topological edge information into the self-attention mechanism and an Interleaved Navigation Prompt to enhance node-level visual-text alignment. On the R2R benchmark, TagaVLM achieves a Success Rate of 51.09% and SPL of 47.18, outperforming previous methods by 3.39% in SR and 9.08 in SPL in unseen environments.

TagaVLM 是一个端到端框架，将拓扑结构嵌入到 VLM 主干中，以解决 VLMs 和导航中动态的空间结构化性质之间的不匹配问题。它使用 STAR-Att 将拓扑边信息直接集成到自注意力机制中，并使用交错导航提示增强节点级的视觉-文本对齐。在 R2R 基准测试中，TagaVLM 在未见过的环境中实现了 51.09% 的成功率和 47.18% 的 SPL，优于之前的方法 3.39% 的 SR 和 9.08% 的 SPL。

Semi-Supervised Few-Shot Adaptation of Vision-Language Models

Authors: Julio Silva-Rodríguez, Ender Konukoglu

First: 2026-03-03T13:11:47+00:00 · Latest: 2026-03-03T13:11:47+00:00

Comments: Code: https://github.com/jusiro/SS-Text-U

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-language models (VLMs) pre-trained on large, heterogeneous data sources are becoming increasingly popular, providing rich multi-modal embeddings that enable efficient transfer to new tasks. A particularly relevant application is few-shot adaptation, where only a handful of annotated examples are available to adapt the model through multi-modal linear probes. In medical imaging, specialized VLMs have shown promising performance in zero- and few-shot image classification, which is valuable for mitigating the high cost of expert annotations. However, challenges remain in extremely low-shot regimes: the inherent class imbalances in medical tasks often lead to underrepresented categories, penalizing overall model performance. To address this limitation, we propose leveraging unlabeled data by introducing an efficient semi-supervised solver that propagates text-informed pseudo-labels during few-shot adaptation. The proposed method enables lower-budget annotation pipelines for adapting VLMs, reducing labeling effort by >50% in low-shot regimes.

中文标题/摘要

标题：半监督少量样本适应的视觉-语言模型

在大规模异构数据源上预训练的视觉-语言模型（VLMs）变得越来越流行，提供了丰富的多模态嵌入，使得模型能够高效地转移到新任务。一个特别相关的应用是少量样本适应，在这种情况下，只有少量标注的示例可用于通过多模态线性探针适应模型。在医学成像中，专门的VLMs在零样本和少量样本图像分类中表现出有希望的性能，这对于减轻专家注释的高成本具有重要意义。然而，在极少量样本的情况下仍然存在挑战：医学任务中的固有类别不平衡往往导致代表性不足的类别，从而惩罚整体模型性能。为了解决这一限制，我们提出了一种利用未标注数据的方法，通过在少量样本适应过程中引入高效的半监督求解器传播带有文本信息的伪标签。所提出的方法使适应VLMs的注释管道更加经济，低样本情况下可减少超过50%的标注努力。

Summary / 总结

This paper addresses the challenge of few-shot adaptation in medical imaging using vision-language models (VLMs). It proposes a semi-supervised solver that uses unlabeled data to propagate text-informed pseudo-labels, enabling efficient adaptation with fewer annotated examples. The method reduces labeling effort by over 50% in low-shot regimes, making it suitable for scenarios where expert annotations are costly.

研究旨在提高视觉-语言模型(VLMs)在少量样本适应场景中的性能，特别是在医学成像领域，专家标注成本高昂。方法引入了一种半监督求解器，利用未标注数据传播文本指导的伪标签，增强模型适应性。关键发现表明，这种方法在少量样本条件下将标注工作量减少了超过50%，使VLMs在医学应用中更具实用性。

Training-Free Multi-Concept Image Editing

Authors: Niki Foteinopoulou, Ignas Budvytis, Stephan Liwicki

First: 2026-02-24T12:27:51+00:00 · Latest: 2026-03-03T13:06:40+00:00

Comments: 17 pages, 13 figures

Abs · PDF · Code1 · Code2

Abstract

Editing images with diffusion models under strict training-free constraints remains a significant challenge. While recent optimisation-based methods achieve strong zero-shot edits from text, they struggle to preserve identity and capture intricate details, such as facial structure, material texture, or object-specific geometry, that exist below the level of linguistic abstraction. To address this fundamental gap, we propose Concept Distillation Sampling (CDS). To the best of our knowledge, we are the first to introduce a unified, training-free framework for target-less, multi-concept image editing. CDS overcomes the linguistic bottleneck of previous methods by integrating a highly stable distillation backbone (featuring ordered timesteps, regularisation, and negative-prompt guidance), with a dynamic weighting mechanism. This approach enables the seamless composition and control of multiple visual concepts directly within the diffusion process, utilising spatially-aware priors from pretrained LoRA adapters without spatial interference. Our method preserves instance fidelity without requiring reference samples of the desired edit. Extensive quantitative and qualitative evaluations demonstrate consistent state-of-the-art performance over existing training-free editing and multi-LoRA composition methods on the InstructPix2Pix and ComposLoRA benchmarks. Code will be made publicly available.

中文标题/摘要

标题：无训练多概念图像编辑

在严格无训练约束下使用扩散模型编辑图像仍然是一个重大挑战。尽管最近基于优化的方法能够实现强大的零样本编辑，但它们在保留身份和捕捉细微特征（如面部结构、材料纹理或特定对象几何形状）方面存在困难，这些特征存在于语言抽象之下。为解决这一根本性差距，我们提出了概念蒸馏采样（CDS）。据我们所知，我们是第一个提出一种统一的、无训练的框架来进行无目标的多概念图像编辑。CDS通过结合一个高度稳定的蒸馏骨干（包括有序的时间步、正则化和负提示引导），以及动态加权机制，克服了先前方法的语言瓶颈。这种方法能够在扩散过程中无缝组合和控制多个视觉概念，利用预训练LoRA适配器的空间感知先验，而不会受到空间干扰。我们的方法在不需要参考样本的情况下保持实例保真度。广泛的定量和定性评估表明，我们的方法在InstructPix2Pix和ComposLoRA基准上的现有无训练编辑和多LoRA组合方法中表现出一致的最优性能。代码将公开发布。

Summary / 总结

The paper addresses the challenge of training-free multi-concept image editing using diffusion models. It introduces Concept Distillation Sampling (CDS), a unified framework that integrates a stable distillation backbone with dynamic weighting to enable the seamless composition of multiple visual concepts directly within the diffusion process. The method preserves instance fidelity without needing reference samples and outperforms existing training-free editing and multi-LoRA composition methods on InstructPix2Pix and ComposLoRA benchmarks.

论文解决了使用扩散模型进行无训练多概念图像编辑的挑战。它提出了概念蒸馏采样（CDS），这是一种统一框架，将稳定的蒸馏骨干与动态加权相结合，使多个视觉概念可以在扩散过程中无缝组合和控制。该方法无需参考样本即可保持实例保真度，并在InstructPix2Pix和ComposLoRA基准上优于现有的无训练编辑和多LoRA组合方法。

TTT3R: 3D Reconstruction as Test-Time Training

Authors: Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, Anpei Chen

Venue: ICLR

First: 2025-09-30T17:59:51+00:00 · Latest: 2026-03-03T12:58:09+00:00

Comments: Page: https://rover-xingyu.github.io/TTT3R/ Code: https://github.com/Inception3D/TTT3R

Abs · PDF · Code1 · Code2 · Code3 · Project1 · Project2

Abstract

Modern Recurrent Neural Networks have become a competitive architecture for 3D reconstruction due to their linear-time complexity. However, their performance degrades significantly when applied beyond the training context length, revealing limited length generalization. In this work, we revisit the 3D reconstruction foundation models from a Test-Time Training perspective, framing their designs as an online learning problem. Building on this perspective, we leverage the alignment confidence between the memory state and incoming observations to derive a closed-form learning rate for memory updates, to balance between retaining historical information and adapting to new observations. This training-free intervention, termed TTT3R, substantially improves length generalization, achieving a $2\times$ improvement in global pose estimation over baselines, while operating at 20 FPS with just 6 GB of GPU memory to process thousands of images. Code is available in https://rover-xingyu.github.io/TTT3R

中文标题/摘要

标题：TTT3R：测试时训练的3D重建

现代循环神经网络因其线性时间复杂性已成为3D重建的竞争性架构。然而，当应用于超出训练上下文长度的场景时，其性能显著下降，显示出有限的长度泛化能力。在本文中，我们从测试时训练的角度重新审视3D重建的基础模型，将其设计框架化为在线学习问题。基于这一视角，我们利用记忆状态与新观测之间的对齐置信度来推导出记忆更新的闭式学习率，以平衡保留历史信息和适应新观测之间的关系。这种无需训练的干预措施，称为TTT3R，显著提高了长度泛化能力，在全局姿态估计方面比基线提高了2倍，同时以每秒20帧的速度运行，仅使用6 GB的GPU内存处理数千张图像。代码可在https://rover-xingyu.github.io/TTT3R/ 获取

Are VLMs Ready for Lane Topology Awareness in Autonomous Driving?

Authors: Xin Chen, Jia He, Maozheng Li, Dongliang Xu, Tianyu Wang, Yixiao Chen, Zhixin Lin, Yue Yao

First: 2025-09-20T12:02:39+00:00 · Latest: 2026-03-03T12:52:59+00:00

Comments: 5 pages, 5 figures

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) have recently shown remarkable progress in multimodal reasoning, yet their applications in autonomous driving remain limited. In particular, the ability to understand road topology, a key requirement for safe navigation, has received relatively little attention. While some recent works have begun to explore VLMs in driving contexts, their performance on topology reasoning is far from satisfactory. In this work, we systematically evaluate VLMs' capabilities in road topology understanding. Specifically, multi-view images are projected into unified ground-plane coordinate system and fused into bird's-eye-view (BEV) lanes. Based on these BEV lanes, we formulate four topology-related diagnostic VQA tasks, which together capture essential components of spatial topology reasoning. Through extensive evaluation, we find that while frontier closed-source models (e.g., GPT-4o) achieve relatively high accuracy in some tasks, they still fail in some spatial questions that humans can answer (e.g., GPT-4o achieve only 67.8% in vector, a two-class classification problem). Furthermore, we find open-source VLMs, even at 30B scale, struggle significantly. These results indicate that spatial reasoning remains a fundamental bottleneck for current VLMs. We also find that the model's capability is positively correlated with model size, length of reasoning tokens and shots provided as examples, showing direction for future research.

中文标题/摘要

标题：大型语言模型在自动驾驶车道拓扑意识方面的准备情况？

视觉-语言模型（VLMs）在多模态推理方面取得了显著进展，但在自动驾驶领域的应用仍然有限。特别是理解道路拓扑的能力，这是安全导航的关键要求，却受到了相对较少的关注。尽管一些最近的研究开始探索VLMs在驾驶环境中的应用，但它们在拓扑推理上的表现远未令人满意。在本研究中，我们系统地评估了VLMs在道路拓扑理解方面的能力。具体来说，多视角图像被投影到统一的地平面坐标系中并融合成鸟瞰图（BEV）车道。基于这些BEV车道，我们制定了四个与拓扑相关的诊断VQA任务，这些任务共同捕捉了空间拓扑推理的关键组成部分。通过广泛的评估，我们发现，尽管前沿的闭源模型（如GPT-4o）在某些任务中实现了较高的准确性，但在一些空间问题上仍然无法回答（例如，GPT-4o在向量问题上的准确率为67.8%，这是一个二分类问题）。此外，我们发现即使是30B规模的开源VLMs也面临显著的挑战。这些结果表明，空间推理仍然是当前VLMs的基本瓶颈。我们还发现，模型的能力与其规模、推理令牌的长度以及提供的示例数量呈正相关，这为未来的研究指明了方向。

Summary / 总结

This study evaluates the capability of Vision-Language Models (VLMs) in understanding road topology, a critical aspect of autonomous driving. By projecting multi-view images into a unified coordinate system and formulating four topology-related diagnostic VQA tasks, the research reveals that while some closed-source models perform reasonably well, they still struggle with spatial reasoning tasks that humans can easily solve. Open-source VLMs, even at 30B scale, show significant limitations. The study indicates that spatial reasoning is a fundamental bottleneck for current VLMs and suggests that model size, reasoning token length, and example shots positively correlate with performance.

该研究评估了视觉-语言模型（VLMs）在理解道路拓扑方面的能力，这是自动驾驶的关键方面。通过将多视角图像投影到统一的坐标系统中，并制定四个拓扑相关的诊断VQA任务，研究发现即使是大型闭源模型如GPT-4o在空间推理方面也表现不佳，仅在向量分类任务中达到67.8%的准确率。开源VLMs，即使达到30B规模，也表现不佳。结果表明，空间推理是当前VLMs的一个重要瓶颈，模型性能与规模、推理令牌长度呈正相关。

HDINO: A Concise and Efficient Open-Vocabulary Detector

Authors: Hao Zhang, Yiqun Wang, Qinran Lin, Runze Fan, Yong Li

First: 2026-03-03T12:29:19+00:00 · Latest: 2026-03-03T12:29:19+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Despite the growing interest in open-vocabulary object detection in recent years, most existing methods rely heavily on manually curated fine-grained training datasets as well as resource-intensive layer-wise cross-modal feature extraction. In this paper, we propose HDINO, a concise yet efficient open-vocabulary object detector that eliminates the dependence on these components. Specifically, we propose a two-stage training strategy built upon the transformer-based DINO model. In the first stage, noisy samples are treated as additional positive object instances to construct a One-to-Many Semantic Alignment Mechanism(O2M) between the visual and textual modalities, thereby facilitating semantic alignment. A Difficulty Weighted Classification Loss (DWCL) is also designed based on initial detection difficulty to mine hard examples and further improve model performance. In the second stage, a lightweight feature fusion module is applied to the aligned representations to enhance sensitivity to linguistic semantics. Under the Swin Transformer-T setting, HDINO-T achieves \textbf{49.2} mAP on COCO using 2.2M training images from two publicly available detection datasets, without any manual data curation and the use of grounding data, surpassing Grounding DINO-T and T-Rex2 by \textbf{0.8} mAP and \textbf{2.8} mAP, respectively, which are trained on 5.4M and 6.5M images. After fine-tuning on COCO, HDINO-T and HDINO-L further achieve \textbf{56.4} mAP and \textbf{59.2} mAP, highlighting the effectiveness and scalability of our approach. Code and models are available at https://github.com/HaoZ416/HDINO.

中文标题/摘要

标题：HDINO：简洁高效的开放词汇检测器

尽管近年来对开放词汇目标检测的兴趣日益增加，但大多数现有方法仍然依赖于手动编写的细粒度训练数据集以及资源密集型逐层跨模态特征提取。在本文中，我们提出了一种简洁高效的开放词汇目标检测器HDINO，该检测器消除了对这些组件的依赖。具体而言，我们基于基于Transformer的DINO模型提出了一种两阶段训练策略。在第一阶段，将嘈杂样本视为额外的正对象实例，以构建视觉和文本模态之间的一对多语义对齐机制(O2M)，从而促进语义对齐。我们还基于初始检测难度设计了一种加权分类损失(DWCL)，以挖掘困难样本并进一步提高模型性能。在第二阶段，应用一个轻量级特征融合模块来增强对语言语义的敏感性。在Swin Transformer-T设置下，HDINO-T在使用来自两个公开可用检测数据集的2.2M训练图像的COCO上达到了49.2 mAP，无需任何手动数据编目和使用接地数据，超越了Grounding DINO-T和T-Rex2，分别提高了0.8 mAP和2.8 mAP，后者分别在5.4M和6.5M图像上进行了训练。经过COCO微调后，HDINO-T和HDINO-L分别达到了56.4 mAP和59.2 mAP，突显了我们方法的有效性和可扩展性。代码和模型可在https://github.com/HaoZ416/HDINO获取。

Summary / 总结

HDINO is a concise and efficient open-vocabulary object detector that avoids the need for manually curated datasets and resource-intensive feature extraction. It employs a two-stage training strategy with a One-to-Many Semantic Alignment Mechanism and a Difficulty Weighted Classification Loss to improve model performance. HDINO-T achieves 49.2 mAP on COCO using only 2.2M training images, surpassing Grounding DINO-T and T-Rex2 by 0.8 mAP and 2.8 mAP, respectively. Fine-tuning on COCO further improves its performance to 56.4 mAP and 59.2 mAP for HDINO-T and HDINO-L, respectively.

HDINO 是一种开放词汇的目标检测器，通过消除对手动标注数据集和资源密集型特征提取的依赖来提高效率。它采用基于 DINO 模型的两阶段训练策略。第一阶段通过对齐视觉和文本模态来增强语义对齐，并应用难度加权分类损失以提高模型性能。第二阶段使用轻量级特征融合模块来增强对语言语义的敏感性。HDINO-T 在使用 2.2M 训练图像的 COCO 上达到 49.2 mAP，分别超越 Grounding DINO-T 和 T-Rex2 0.8 mAP 和 2.8 mAP。经过微调后，HDINO-T 和 HDINO-L 达到 56.4 mAP 和 59.2 mAP，展示了该方法的有效性和可扩展性。

Spilled Energy in Large Language Models

Authors: Adrian Robert Minut, Hazem Dewidar, Iacopo Masi

First: 2026-02-21T00:38:47+00:00 · Latest: 2026-03-03T12:23:38+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

We reinterpret the final Large Language Model (LLM) softmax classifier as an Energy-Based Model (EBM), decomposing the sequence-to-sequence probability chain into multiple interacting EBMs at inference. This principled approach allows us to track "energy spills" during decoding, which we empirically show correlate with factual errors, biases, and failures. Similar to Orgad et al. (2025), our method localizes the exact answer token and subsequently tests for hallucinations. Crucially, however, we achieve this without requiring trained probe classifiers or activation ablations. Instead, we introduce two completely training-free metrics derived directly from output logits: spilled energy, which captures the discrepancy between energy values across consecutive generation steps that should theoretically match, and marginalized energy, which is measurable at a single step. Evaluated on nine benchmarks across state-of-the-art LLMs (including LLaMA, Mistral, and Gemma) and on synthetic algebraic operations (Qwen3), our approach demonstrates robust, competitive hallucination detection and cross-task generalization. Notably, these results hold for both pretrained and instruction-tuned variants without introducing any training overhead. Code available at: github.com/OmnAI-Lab/spilled-energy

中文标题/摘要

标题：大型语言模型中的溢出能量

我们将大型语言模型（LLM）的最终softmax分类器重新解释为能量基于模型（EBM），在推理过程中将序列到序列的概率链分解为多个相互作用的EBM。这种方法使我们能够追踪解码过程中的“能量溢出”，我们实验证明这些能量溢出与事实错误、偏见和失败相关。类似于Orgad等人（2025），我们的方法定位了确切的答案标记，然后测试幻觉。然而，我们通过这种方法并不需要训练探针分类器或激活消融。相反，我们引入了两个完全无需训练的度量标准，直接从输出logits中得出：溢出能量，它捕捉了理论上应匹配的能量值在连续生成步骤之间的差异；以及边缘化能量，它可以在单个步骤中进行测量。在九个基准测试上评估了最先进的LLM（包括LLaMA、Mistral和Gemma），以及合成代数操作（Qwen3），我们的方法展示了稳健且具有竞争力的幻觉检测和跨任务泛化能力。值得注意的是，这些结果对于预训练和指令微调变体都适用，且不引入任何训练开销。代码可在github.com/OmnAI-Lab/spilled-energy获取

Summary / 总结

The research aims to identify and quantify energy discrepancies in the decoding process of Large Language Models (LLMs) by reinterpreting the softmax classifier as an Energy-Based Model (EBM). The method decomposes the sequence-to-sequence probability into multiple interacting EBMs, enabling the tracking of 'energy spills' that correlate with factual errors and biases. The study introduces two training-free metrics, 'spilled energy' and 'marginalized energy,' which effectively detect hallucinations across various benchmarks and synthetic tasks without additional training overhead. This approach demonstrates robust performance and generalization capabilities across different LLMs and tasks.

研究将大型语言模型（LLM）的softmax分类器重新解释为能量基于模型（EBM），以追踪解码过程中的‘能量溢出’，这些溢出与事实错误、偏见和失败相关。该方法引入了两个无需训练的度量标准，即溢出能量和边缘化能量，用于检测幻觉，无需使用探针分类器或激活层消融。在九个基准测试和合成代数运算上的评估表明，该方法在预训练和指令调优的LLM中表现出稳健的幻觉检测和跨任务泛化能力，且无需额外的训练开销。

Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models

Authors: Jialiang Zhang, Junlong Tong, Junyan Lin, Hao Wu, Yirong Sun, Yunpu Ma, Xiaoyu Shen

First: 2026-03-03T11:24:55+00:00 · Latest: 2026-03-03T11:24:55+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large Vision Language Models (LVLMs) exhibit strong Chain-of-Thought (CoT) capabilities, yet most existing paradigms assume full-video availability before inference, a batch-style process misaligned with real-world video streams where information arrives sequentially. Motivated by the streaming nature of video data, we investigate two streaming reasoning paradigms for LVLMs. The first, an interleaved paradigm, alternates between receiving frames and producing partial reasoning but remains constrained by strictly ordered cache updates. To better match streaming inputs, we propose \textbf{Think-as-You-See (TaYS)}, a unified framework enabling true concurrent reasoning. TaYS integrates parallelized CoT generation, stream-constrained training, and stream-parallel inference. It further employs temporally aligned reasoning units, streaming attention masks and positional encodings, and a dual KV-cache that decouples visual encoding from textual reasoning. We evaluate all paradigms on the Qwen2.5-VL family across representative video CoT tasks, including event dynamics analysis, causal reasoning, and thematic understanding. Experiments show that TaYS consistently outperforms both batch and interleaved baselines, improving reasoning performance while substantially reducing time-to-first-token (TTFT) and overall reasoning delay. These results demonstrate the effectiveness of data-aligned streaming reasoning in enabling efficient and responsive video understanding for LVLMs. We release our code at \href{https://github.com/EIT-NLP/StreamingLLM/tree/main/TaYS}{this repository.}

中文标题/摘要

标题：Think-as-你-看：大型视觉语言模型的流式链式推理

大型视觉语言模型（LVLMs）表现出强大的链式推理（CoT）能力，但大多数现有范式假设在推理前视频信息全部可用，这是一种批次处理过程，与实际视频流中信息按序到达的情况不匹配。受视频数据流式特性的启发，我们研究了两种LVLM的流式推理范式。第一种是交错范式，交替接收帧和生成部分推理，但仍然受限于严格有序的缓存更新。为了更好地匹配流式输入，我们提出了**Think-as-你-看（TaYS）**，这是一种统一框架，允许真正的并发推理。TaYS 结合了并行化的 CoT 生成、流式约束训练和流式并行推理。它还采用了时间对齐的推理单元、流式注意力掩码和位置编码，以及一个双 KV 缓存，将视觉编码与文本推理解耦。我们在 Qwen2.5-VL 家族上对所有范式进行了评估，包括事件动态分析、因果推理和主题理解等代表性视频 CoT 任务。实验表明，TaYS 在所有基准上都表现出色，不仅提高了推理性能，还显著减少了第一个标记的生成时间和整体推理延迟。这些结果证明了数据对齐的流式推理在使 LVLM 能够高效和响应地理解视频方面的有效性。我们将在**这个仓库**发布我们的代码：https://github.com/EIT-NLP/StreamingLLM/tree/main/TaYS

Summary / 总结

This paper addresses the limitations of existing batch-style inference for large vision-language models (LVLMs) by proposing a streaming reasoning paradigm called Think-as-You-See (TaYS). Motivated by the sequential nature of video data, TaYS integrates parallelized chain-of-thought generation, stream-constrained training, and stream-parallel inference, and employs temporally aligned reasoning units and dual KV-caches. Experiments show that TaYS outperforms both batch and interleaved baselines, improving reasoning performance and reducing time-to-first-token and overall reasoning delay. This demonstrates the effectiveness of data-aligned streaming reasoning for efficient video understanding in LVLMs.

研究旨在解决视频数据的流式处理问题，而不是通常用于大型视觉语言模型（LVLMs）的批量处理方式。研究引入了Think-as-You-See（TaYS）统一框架，通过集成并行化的链式思考生成、流式约束训练和流式并行推理，实现真正的并发推理。TaYS在事件动态分析、因果推理和主题理解等视频链式思考任务中显著优于批量和交错基线，提高了推理性能并减少了首个词出现时间和整体推理延迟。

Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language Models

Authors: Haruto Yoshida, Keito Kudo, Yoichi Aoki, Ryota Tanaka, Itsumi Saito, Keisuke Sakaguchi, Kentaro Inui

First: 2026-03-03T11:17:31+00:00 · Latest: 2026-03-03T11:17:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Large vision-language models (LVLMs) demonstrate strong performance on diagram understanding benchmarks, yet they still struggle with understanding relationships between elements, particularly those represented by nodes and directed edges (e.g., arrows and lines). To investigate the underlying causes of this limitation, we probe the internal representation of LVLMs using a carefully constructed synthetic diagram dataset based on directed graphs. Our probing experiments reveal that edge information is not linearly separable in the vision encoder and becomes linearly encoded only in the text tokens in the language model. In contrast, node information and global structural features are already linearly encoded in individual hidden states of the vision encoder. These findings suggest that the stage at which linearly separable representations are formed varies depending on the type of visual information. In particular, the delayed emergence of edge representations may help explain why LVLMs struggle with relational understanding, such as interpreting edge directions, which require more abstract, compositionally integrated processes.

中文标题/摘要

标题：节点先行，边线滞后：探究大型视觉-语言模型中的图表表示

大型视觉-语言模型（LVLMs）在图表理解基准测试中表现出色，但仍难以理解元素之间的关系，尤其是那些由节点和有向边（例如箭头和线条）表示的关系。为了探究这一局限性的原因，我们使用基于有向图构建的精心构造的合成图表数据集来探测LVLMs的内部表示。我们的探测实验表明，在视觉编码器中，边信息不是线性可分的，仅在语言模型中的文本标记中才线性编码。相比之下，节点信息和全局结构特征已经在视觉编码器的单个隐藏状态中线性编码。这些发现表明，形成线性可分表示的阶段取决于视觉信息的类型。特别是边表示的延迟出现可能有助于解释为什么LVLMs在关系理解方面存在困难，例如解释边的方向，这需要更抽象的、组合性的过程。

Summary / 总结

This study investigates why large vision-language models (LVLMs) struggle with understanding relationships between elements in diagrams, particularly nodes and directed edges. By probing LVLMs with a synthetic diagram dataset, the research finds that edge information is not linearly separable in the vision encoder but becomes separable in text tokens. In contrast, node information and global structural features are already linearly encoded in the vision encoder. This suggests that the formation of linearly separable representations varies depending on the type of visual information, and the delayed emergence of edge representations may explain LVLMs' difficulty in relational understanding.

研究探讨了大型视觉-语言模型（LVLMs）在理解图中元素之间的关系时，尤其是节点和有向边时，为何存在困难。通过使用合成的图数据集对LVLMs进行探针实验，研究发现边信息在视觉编码器中不是线性可分的，但在文本标记中才变得线性可分。相比之下，节点信息和全局结构特征已经在视觉编码器的隐藏状态中线性可分。这表明线性可分表示的形成取决于视觉信息的类型，而边表示的延迟出现可能解释了LVLMs在关系理解上的困难。

CoFL: Continuous Flow Fields for Language-Conditioned Navigation

Authors: Haokun Liu, Zhaoqi Ma, Yicheng Chen, Masaki Kitagawa, Wentao Zhang, Jinjie Li, Moju Zhao

First: 2026-03-03T11:02:55+00:00 · Latest: 2026-03-03T11:02:55+00:00

Comments: 20 pages, 11 figures

Abs · PDF · Code1 · Code2

Abstract

Language-conditioned navigation pipelines often rely on brittle modular components or costly action-sequence generation. To address these limitations, we present CoFL, an end-to-end policy that directly maps a bird's-eye view (BEV) observation and a language instruction to a continuous flow field for navigation. Instead of predicting discrete action tokens or sampling action chunks via iterative denoising, CoFL outputs instantaneous velocities that can be queried at arbitrary 2D projected locations. Trajectories are obtained by numerical integration of the predicted field, producing smooth motion that remains reactive under closed-loop execution. To enable large-scale training, we build a dataset of over 500k BEV image-instruction pairs, each procedurally annotated with a flow field and a trajectory derived from BEV semantic maps built on Matterport3D and ScanNet. By training on a mixed distribution, CoFL significantly outperforms modular Vision-Language Model (VLM)-based planners and generative policy baselines on strictly unseen scenes. Finally, we deploy CoFL zero-shot in real-world experiments with overhead BEV observations across multiple layouts, maintaining reliable closed-loop control and a high success rate.

中文标题/摘要

标题：CoFL：语言条件导航的连续流场

语言条件导航管道通常依赖于脆弱的模块化组件或昂贵的动作序列生成。为了解决这些限制，我们提出了CoFL，这是一种端到端的策略，可以直接将鸟瞰图（BEV）观察和语言指令映射到一个连续的流场以进行导航。CoFL 不是预测离散的动作令牌或通过迭代去噪采样动作片段，而是输出可以在任意2D投影位置查询的瞬时速度。轨迹通过预测场的数值积分获得，产生平滑的运动，并在闭环执行下保持反应性。为了实现大规模训练，我们构建了一个包含超过50万张BEV图像-指令对的数据集，每个数据对都通过Matterport3D和ScanNet的BEV语义图程序化注释了流场和轨迹。通过在混合分布上训练，CoFL 在严格未见过的场景上显著优于基于模块化视觉-语言模型（VLM）的规划者和生成性策略基线。最后，我们在多个布局的现实世界实验中零样本部署了CoFL，保持了可靠的闭环控制和高成功率。

Summary / 总结

CoFL is an end-to-end policy that maps bird's-eye view observations and language instructions to a continuous flow field for navigation, addressing limitations of modular components and action-sequence generation. It outputs instantaneous velocities for numerical integration, producing smooth and reactive trajectories. CoFL was trained on a large dataset of 500k BEV image-instruction pairs and outperformed modular VLM-based planners and generative policy baselines on unseen scenes, demonstrating reliable closed-loop control in real-world experiments.

CoFL 是一个端到端的策略，将鸟瞰图观察和语言指令映射到连续的流场以实现导航，解决了模块化组件和动作序列生成的局限性。它输出瞬时速度进行数值积分，生成平滑且反应式的轨迹。CoFL 在包含 50 万 BEV 图像-指令对的大数据集上进行训练，并在未见过的场景中显著优于基于视觉-语言模型的模块化规划器和生成性策略基线。在现实世界的实验中，CoFL 维持了可靠的闭环控制和高成功率。

Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models

Authors: Sangwon Jang, Taekyung Ki, Jaehyeong Jo, Jaehong Yoon, Soo Ye Kim, Zhe Lin, Sung Ju Hwang

Venue: ICLR 2026

First: 2025-06-08T14:54:41+00:00 · Latest: 2026-03-03T10:51:04+00:00

Comments: ICLR 2026. Project page: https://frame-guidance-video.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Advancements in diffusion models have significantly improved video quality, directing attention to fine-grained controllability. However, many existing methods depend on fine-tuning large-scale video models for specific tasks, which becomes increasingly impractical as model sizes continue to grow. In this work, we present Frame Guidance, a training-free guidance for controllable video generation based on frame-level signals, such as keyframes, style reference images, sketches, or depth maps. For practical training-free guidance, we propose a simple latent processing method that dramatically reduces memory usage, and apply a novel latent optimization strategy designed for globally coherent video generation. Frame Guidance enables effective control across diverse tasks, including keyframe guidance, stylization, and looping, without any training, compatible with any video models. Experimental results show that Frame Guidance can produce high-quality controlled videos for a wide range of tasks and input signals.

中文标题/摘要

标题：帧指导：基于帧级信号的无训练控制指导在视频扩散模型中的应用

扩散模型的进步显著提高了视频质量，引起了对精细粒度可控性的关注。然而，许多现有方法依赖于对大规模视频模型进行特定任务的微调，随着模型规模的不断增大，这种方法变得越来越不实际。在本文中，我们提出了帧指导，这是一种基于帧级信号（如关键帧、风格参考图像、素描或深度图）的无训练控制生成方法。为了实现实用的无训练指导，我们提出了一种简单的潜在处理方法，大幅减少了内存使用，并应用了一种专为全局一致视频生成设计的潜在优化策略。帧指导能够在不进行任何训练的情况下，有效控制各种任务，包括关键帧指导、风格化和循环，兼容任何视频模型。实验结果表明，帧指导能够生成广泛任务和输入信号的高质量控制视频。

Summary / 总结

The research aims to enhance fine-grained control in video generation using diffusion models without requiring extensive fine-tuning. Frame Guidance, a training-free method, utilizes frame-level signals like keyframes or sketches to guide video generation. It employs a simple latent processing method and a novel latent optimization strategy to achieve globally coherent videos. The method successfully controls various tasks such as keyframe guidance, stylization, and looping, producing high-quality results across different input signals.

Frame Guidance 是一种无需训练的方法，通过关键帧或草图等帧级信号来控制视频生成。它使用一种简单的潜在处理方法来减少内存使用，并采用一种新的潜在优化策略以实现视频的全局一致性。实验结果表明，Frame Guidance 可以有效地控制各种任务，如关键帧指导、风格化和循环生成，无需训练即可生成高质量的视频。

Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs

Authors: Minji Kim, Taekyung Kim, Bohyung Han

Venue: ICLR 2026

First: 2025-10-15T07:59:06+00:00 · Latest: 2026-03-03T09:13:20+00:00

Comments: ICLR 2026, 32 pages, 39 figures, 8 tables

Abs · PDF · Code1 · Code2 · Project1

Abstract

Video Large Language Models (VideoLLMs) extend the capabilities of vision-language models to spatiotemporal inputs, enabling tasks such as video question answering (VideoQA). Despite recent advances in VideoLLMs, their internal mechanisms on where and how they extract and propagate video and textual information remain less explored. In this study, we investigate the internal information flow of VideoLLMs using mechanistic interpretability techniques. Our analysis reveals consistent patterns across diverse VideoQA tasks: (1) temporal reasoning in VideoLLMs initiates with active cross-frame interactions in early-to-middle layers, (2) followed by progressive video-language integration in middle layers. This is facilitated by alignment between video representations and linguistic embeddings containing temporal concepts. (3) Upon completion of this integration, the model is ready to generate correct answers in middle-to-late layers. (4) Based on our analysis, we show that VideoLLMs retain their VideoQA performance by selecting these effective information pathways while suppressing a substantial amount of attention edges, e.g., 58% in LLaVA-NeXT-7B-Video-FT. These findings provide a blueprint for how VideoLLMs perform temporal reasoning and offer practical insights for improving model interpretability and downstream generalization. Our project page with the source code is available at https://map-the-flow.github.io

中文标题/摘要

标题：映射流程：揭示视频LLMs中隐藏的信息路径

视频大型语言模型（VideoLLMs）扩展了视觉语言模型的能力，使其能够处理时空输入，支持视频问答（VideoQA）等任务。尽管最近在VideoLLMs方面取得了进展，但它们在何处以及如何提取和传播视频和文本信息的内部机制仍较少被探索。在本研究中，我们使用机制可解释性技术研究了VideoLLMs的内部信息流。我们的分析揭示了跨多种VideoQA任务的一致模式：(1) 视频LLMs中的时间推理始于早期到中期层的跨帧交互，(2) 接着是中间层逐步的视频-语言整合。这得益于视频表示与包含时间概念的语言嵌入之间的对齐。 (3) 完成这种整合后，模型在中期到晚期层准备好生成正确答案。 (4) 根据我们的分析，我们展示了VideoLLMs通过选择这些有效信息路径并抑制大量注意力边（例如，在LLaVA-NeXT-7B-Video-FT中为58%）来保持其VideoQA性能。这些发现为理解VideoLLMs的时间推理提供了蓝图，并为提高模型可解释性和下游泛化提供了实用见解。我们的项目页面和源代码可在https://map-the-flow.github.io 获取

Summary / 总结

This study investigates the internal information flow of VideoLLMs using mechanistic interpretability techniques. It reveals that temporal reasoning starts with active cross-frame interactions in early-to-middle layers, followed by progressive video-language integration in middle layers, which is facilitated by alignment between video representations and linguistic embeddings. The model generates correct answers in middle-to-late layers after this integration. VideoLLMs retain their VideoQA performance by selecting effective information pathways while suppressing a substantial amount of attention edges, such as 58% in LLaVA-NeXT-7B-Video-FT.

本研究使用机制可解释性技术探讨了VideoLLMs的内部信息流。研究发现，时间推理始于早期到中期层的跨帧交互，随后在中期层进行逐步的视频-语言整合，这得益于视频表示与语言嵌入之间的对齐。模型在中期到晚期层生成正确答案后完成这一整合。VideoLLMs通过选择有效信息路径并抑制大量注意力边，如在LLaVA-NeXT-7B-Video-FT中抑制了58%的边，保持了其性能。这些发现提供了VideoLLMs如何进行时间推理的蓝图，并为提高模型可解释性和下游泛化提供了实用见解。

Seeing Clearly without Training: Mitigating Hallucinations in Multimodal LLMs for Remote Sensing

Authors: Yi Liu, Jing Zhang, Di Wang, Xiaoyu Tian, Haonan Guo, Bo Du

First: 2026-03-03T08:53:20+00:00 · Latest: 2026-03-03T08:53:20+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Multimodal large language models (MLLMs) suffer from pronounced hallucinations in remote sensing visual question-answering (RS-VQA), primarily caused by visual grounding failures in large-scale scenes or misinterpretation of fine-grained small targets. To systematically analyze these issues, we introduce RSHBench, a protocol-based benchmark for fine-grained diagnosis of factual and logical hallucinations. To mitigate grounding-induced factual hallucinations, we further propose Relative Attention-Driven Actively Reasoning (RADAR), a training-free inference method that leverages intrinsic attention in MLLMs to guide progressive localization and fine-grained local reasoning at test time. Extensive experiments across diverse MLLMs demonstrate that RADAR consistently improves RS-VQA performance and reduces both factual and logical hallucinations. Code and data will be publicly available at: https://github.com/MiliLab/RADAR

中文标题/摘要

标题：无需训练即可清晰看见：缓解多模态大语言模型在遥感视觉问答中的幻觉

多模态大语言模型（MLLMs）在遥感视觉问答（RS-VQA）中遭受显著的幻觉问题，主要由大规模场景中的视觉定位失败或对细粒度小目标的误解释引起。为了系统地分析这些问题，我们引入了RSHBench，这是一种基于协议的基准测试，用于细粒度诊断事实和逻辑幻觉。为了缓解由定位引起的事实幻觉，我们进一步提出了相对注意驱动的主动推理（RADAR），这是一种无需训练的推理方法，利用MLLMs中的内在注意力在测试时引导渐进定位和细粒度局部推理。广泛的实验表明，RADAR在多种MLLMs上一致地提高了RS-VQA性能并减少了事实和逻辑幻觉。代码和数据将在以下地址公开：https://github.com/MiliLab/RADAR

Summary / 总结

This study addresses the issue of hallucinations in multimodal large language models (MLLMs) for remote sensing visual question-answering (RS-VQA), focusing on visual grounding failures and misinterpretation of small targets. To tackle these problems, the authors introduce RSHBench, a benchmark for diagnosing factual and logical hallucinations. They also propose RADAR, a training-free inference method that uses intrinsic attention in MLLMs to guide progressive localization and fine-grained local reasoning, which improves RS-VQA performance and reduces hallucinations across various MLLMs.

论文针对多模态大型语言模型（MLLMs）在遥感视觉问答（RS-VQA）中的幻觉问题，重点关注视觉定位失败和小目标的误解释。引入了RSHBench，用于诊断事实和逻辑幻觉的基准，并提出了RADAR，这是一种无需训练的推理方法，利用内在注意力来引导定位和推理。实验表明，RADAR在不同MLLMs上提高了RS-VQA性能并减少了幻觉。

iGVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding

Authors: HanZpeng Liu, Yaqian Li, Zidan Wang, Shuoxi Zhang, Zihao Bo, Rinyoichi Takezoe, Kaiwen Long, Kun He

First: 2026-03-03T08:49:41+00:00 · Latest: 2026-03-03T08:49:41+00:00

Abs · PDF · Code1 · Code2

Abstract

Despite the success of Large Vision--Language Models (LVLMs), most existing architectures suffer from a representation bottleneck: they rely on static, instruction-agnostic vision encoders whose visual representations are utilized in an invariant manner across different textual tasks. This rigidity hinders fine-grained reasoning where task-specific visual cues are critical. To address this issue, we propose iGVLM, a general framework for instruction-guided visual modulation. iGVLM introduces a decoupled dual-branch architecture: a frozen representation branch that preserves task-agnostic visual representations learned during pre-training, and a dynamic conditioning branch that performs affine feature modulation via Adaptive Layer Normalization (AdaLN). This design enables a smooth transition from general-purpose perception to instruction-aware reasoning while maintaining the structural integrity and stability of pre-trained visual priors. Beyond standard benchmarks, we introduce MM4, a controlled diagnostic probe for quantifying logical consistency under multi-query, multi-instruction settings. Extensive results show that iGVLM consistently enhances instruction sensitivity across diverse language backbones, offering a plug-and-play paradigm for bridging passive perception and active reasoning.

中文标题/摘要

标题：iGVLM：动态指令引导视觉编码以实现问题感知的多模态理解

尽管大型视觉-语言模型（LVLMs）取得了成功，但大多数现有架构仍存在表示瓶颈：它们依赖于静态、无指令的视觉编码器，这些编码器在不同文本任务中的视觉表示是不变的。这种僵化阻碍了细粒度推理，其中特定任务的视觉线索至关重要。为了解决这一问题，我们提出了一种指令引导视觉调制的一般框架iGVLM。iGVLM引入了一个解耦的双分支架构：一个冻结表示分支，保留预训练期间学习到的任务无关视觉表示，以及一个动态条件分支，通过自适应层归一化（AdaLN）执行仿射特征调制。这种设计使从通用感知到指令感知推理的平滑过渡成为可能，同时保持预训练视觉先验的结构完整性和稳定性。除了标准基准之外，我们还引入了MM4，这是一种受控诊断探针，用于在多查询、多指令设置下量化逻辑一致性。广泛的结果表明，iGVLM在各种语言后端中一致地增强了指令敏感性，提供了一种即插即用的范式，用于连接被动感知和主动推理。

Summary / 总结

The research aims to improve the flexibility of visual representations in multimodal models by addressing the limitations of static vision encoders. The proposed iGVLM framework introduces a decoupled dual-branch architecture, enabling fine-grained reasoning through dynamic instruction-guided visual modulation. Key experimental results demonstrate that iGVLM enhances instruction sensitivity across various language models, outperforming existing methods on standard benchmarks and controlled diagnostic probes like MM4.

iGVLM 提出了一种解决现有大型视觉-语言模型 (LVLM) 表示瓶颈的方法，通过引入解耦的双分支架构实现动态指令引导的视觉调制。该模型使用冻结表示分支来保持任务无关的视觉表示，并使用动态调节分支通过自适应层归一化（AdaLN）进行仿射特征调制。实验结果表明，iGVLM 在各种语言骨干网络上增强了指令敏感性，并提供了一种插拔式解决方案以提高多模态理解能力。

InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation

Authors: Shuai Yang, Hao Li, Bin Wang, Yilun Chen, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, Jiangmiao Pang

First: 2025-07-23T13:57:06+00:00 · Latest: 2026-03-03T08:20:15+00:00

Comments: 48 pages

Abs · PDF · Code1 · Code2

Abstract

To operate effectively in the real world, robots should integrate multimodal reasoning with precise action generation. However, existing vision-language-action (VLA) models often sacrifice one for the other, narrow their abilities to task-specific manipulation data, and suffer catastrophic forgetting of pre-trained vision-language capabilities. To bridge this gap, we introduce InstructVLA, an end-to-end VLA model that preserves the flexible reasoning of large vision-language models (VLMs) while delivering leading manipulation performance with the help of embodied reasoning. InstructVLA introduces a novel training paradigm, Vision-Language-Action Instruction Tuning (VLA-IT), which employs multimodal training with mixture-of-experts adaptation to jointly optimize embodied reasoning and action generation on both standard VLM corpora and a curated 650K-sample VLA-IT dataset. On in-domain SimplerEnv tasks, InstructVLA achieves 33% improvement over SpatialVLA. To evaluate generalization, we introduce SimplerEnv-Instruct, an 80-task benchmark requiring closed-loop control and high-level instruction understanding, where it outperforms a fine-tuned OpenVLA by 96% and an action expert aided by GPT-4o by 29%. Additionally, InstructVLA surpasses baseline VLMs on multimodal tasks and exhibits inference-time scaling by leveraging textual reasoning to boost manipulation performance in both simulated and real-world settings. These results demonstrate InstructVLA's potential for bridging intuitive and steerable human-robot interaction with efficient policy learning.

中文标题/摘要

标题：InstructVLA：从理解到操作的视觉-语言-行动指令调优

为了在现实世界中有效操作，机器人应该整合多模态推理与精确的动作生成。然而，现有的视觉-语言-行动（VLA）模型往往在两者之间做出牺牲，将能力局限于特定的任务操作数据，并且会遗忘预训练的视觉-语言能力。为了解决这一问题，我们引入了InstructVLA，这是一种端到端的VLA模型，它保留了大型视觉-语言模型（VLM）的灵活推理能力，同时通过嵌入式推理和动作生成的联合优化，实现了领先的操作性能。InstructVLA引入了一种新的训练范式，即视觉-语言-行动指令调优（VLA-IT），该范式采用多模态训练和混合专家适应，共同优化标准VLM语料库和一个精心策划的65万样本VLA-IT数据集上的嵌入式推理和动作生成。在同域的SimplerEnv任务中，InstructVLA比SpatialVLA提高了33%。为了评估泛化能力，我们引入了SimplerEnv-Instruct，这是一个包含80个任务的基准测试，要求闭环控制和高层次指令理解，其中它比微调的OpenVLA高出96%，比GPT-4o辅助的动作专家高出29%。此外，InstructVLA在多模态任务中超过了基线VLM，并通过利用文本推理在模拟和现实世界环境中提高操作性能，展示了推理时间的扩展性。这些结果表明InstructVLA在直观和可操控的人机交互与高效策略学习之间的潜力。

Summary / 总结

InstructVLA is an end-to-end VLA model that combines the flexibility of large vision-language models with strong manipulation capabilities. It uses a novel training paradigm, VLA-IT, which optimizes embodied reasoning and action generation on both standard VLM corpora and a curated dataset. InstructVLA shows significant improvements over existing models on in-domain tasks and outperforms other models on a new benchmark, demonstrating its potential for efficient policy learning in human-robot interaction.

InstructVLA 是一种结合了大型视觉-语言模型灵活性和强大操作能力的端到端 VLA 模型。它使用了一种新的训练范式 VLA-IT，同时优化了体态推理和动作生成。InstructVLA 在领域内任务和新基准 SimplerEnv-Instruct 上表现出色，能够处理复杂的指令和闭环控制。此外，它在多模态任务上也表现出色，并且在推理时具有良好的扩展性。

OmniFashion: Towards Generalist Fashion Intelligence via Multi-Task Vision-Language Learning

Authors: Zhengwei Yang, Andi Long, Hao Li, Zechao Hu, Kui Jiang, Zheng Wang

First: 2026-03-03T06:48:27+00:00 · Latest: 2026-03-03T06:48:27+00:00

Comments: 12 pages, 8 figures

Abs · PDF · Code1 · Code2

Abstract

Fashion intelligence spans multiple tasks, i.e., retrieval, recommendation, recognition, and dialogue, yet remains hindered by fragmented supervision and incomplete fashion annotations. These limitations jointly restrict the formation of consistent visual-semantic structures, preventing recent vision-language models (VLMs) from serving as a generalist fashion brain that unifies understanding and reasoning across tasks. Therefore, we construct FashionX, a million-scale dataset that exhaustively annotates visible fashion items within an outfit and organizes attributes from global to part-level. Built upon this foundation, we propose OmniFashion, a unified vision-language framework that bridges diverse fashion tasks under a unified fashion dialogue paradigm, enabling both multi-task reasoning and interactive dialogue. Experiments on multi-subtasks and retrieval benchmarks show that OmniFashion achieves strong task-level accuracy and cross-task generalization, highlighting its offering of a scalable path toward universal, dialogue-oriented fashion intelligence.

中文标题/摘要

标题：OmniFashion：通过多任务视觉语言学习迈向通用时尚智能

时尚智能涵盖了多个任务，例如检索、推荐、识别和对话，但仍然受到碎片化监督和不完整时尚注解的限制。这些限制共同限制了视觉-语义结构的一致性形成，阻止了最近的视觉语言模型（VLMs）成为能够统一跨任务理解与推理的通用时尚大脑。因此，我们构建了FashionX，这是一个包含一百万规模的数据集，全面注释了服装中的可见时尚物品，并从全局到部分组织属性。在此基础上，我们提出了OmniFashion，这是一种统一的视觉语言框架，能够在统一的时尚对话范式下连接各种时尚任务，实现多任务推理和交互对话。在多子任务和检索基准上的实验表明，OmniFashion在任务级准确性和跨任务泛化方面表现出色，突显了其向通用、对话导向的时尚智能提供可扩展路径的能力。

Summary / 总结

The research aims to address the fragmented supervision and incomplete fashion annotations that hinder the development of comprehensive fashion intelligence. To overcome these challenges, the authors created FashionX, a large-scale dataset with detailed annotations of fashion items in outfits. They then developed OmniFashion, a unified vision-language framework that integrates various fashion tasks into a dialogue paradigm, enhancing multi-task reasoning and cross-task generalization. The experiments demonstrate that OmniFashion performs well on multiple subtasks and retrieval benchmarks, showcasing its potential for scalable, dialogue-oriented fashion intelligence.

研究旨在解决碎片化的监督和不完整的时尚标注限制了全面时尚智能的发展。为此，作者创建了FashionX，一个包含详细可见时尚物品及其属性标注的大规模数据集。然后，他们提出了OmniFashion，一个统一的视觉语言框架，将各种时尚任务整合到对话范式中，实现多任务推理和互动对话。实验表明，OmniFashion在单个任务和跨任务上都表现出色，表明其具有可扩展的、对话导向的时尚智能潜力。

Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects

Authors: Jiawei Wang, Dingyou Wang, Jiaming Hu, Qixuan Zhang, Jingyi Yu, Lan Xu

First: 2025-11-03T07:21:42+00:00 · Latest: 2026-03-03T06:27:33+00:00

Comments: Project Page: https://sites.google.com/deemos.com/kinematify

Abs · PDF · Code1 · Code2 · Project1

Abstract

A deep understanding of kinematic structures and movable components is essential for enabling robots to manipulate objects and model their own articulated forms. Such understanding is captured through articulated objects, which are essential for tasks such as physical simulation, motion planning, and policy learning. However, creating these models, particularly for objects with high degrees of freedom (DoF), remains a significant challenge. Existing methods typically rely on motion sequences or strong assumptions from hand-curated datasets, which hinders scalability. In this paper, we introduce Kinematify, an automated framework that synthesizes articulated objects directly from arbitrary RGB images or textual descriptions. Our method addresses two core challenges: (i) inferring kinematic topologies for high-DoF objects and (ii) estimating joint parameters from static geometry. To achieve this, we combine MCTS search for structural inference with geometry-driven optimization for joint reasoning, producing physically consistent and functionally valid descriptions. We evaluate Kinematify on diverse inputs from both synthetic and real-world environments, demonstrating improvements in registration and kinematic topology accuracy over prior work.

中文标题/摘要

标题：Kinematify：高自由度可动物体的开放词汇合成

对运动结构和可动部件的深刻理解对于使机器人能够操作物体并建模其自身的可动形态至关重要。这种理解通过可动物体来捕捉，这些物体对于物理模拟、运动规划和策略学习等任务至关重要。然而，特别是对于具有高自由度（DoF）的物体，创建这些模型仍然是一个重大挑战。现有方法通常依赖于运动序列或来自手工制作数据集的强假设，这阻碍了可扩展性。在本文中，我们介绍了Kinematify，这是一种自动框架，可以直接从任意RGB图像或文本描述中合成可动物体。我们的方法解决了两个核心挑战：（i）推断高自由度物体的运动结构拓扑，（ii）从静态几何中估计关节参数。为此，我们结合了基于MCTS搜索的结构推理和基于几何的优化来推断关节参数，从而产生物理上一致且功能上有效的描述。我们在来自合成和真实环境的多种输入上评估了Kinematify，展示了与先前工作相比在配准和运动结构准确性方面的改进。

Summary / 总结

Kinematify is an automated framework that synthesizes articulated objects from RGB images or textual descriptions, addressing the challenge of creating models for high degrees of freedom objects. It uses MCTS search for structural inference and geometry-driven optimization for joint parameter estimation, producing physically consistent and functionally valid descriptions. Experiments show improvements in registration and kinematic topology accuracy compared to previous methods.

Kinematify 是一个自动化框架，可以从 RGB 图像或文本描述中合成 articulated 对象，解决高自由度对象建模的挑战。它使用 MCTS 搜索进行结构推理，并使用几何驱动优化进行关节参数估计，生成物理上一致且功能有效的描述。实验结果显示在注册和运动学拓扑准确性方面优于先前的方法。

Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery

Authors: Meng Xin, Sweta Priyadarshi, Jingyu Xin, Bilal Kartal, Aditya Vavre, Asma Kuriparambil Thekkumpate, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Ido Shahaf, Akhiad Bercovich, Kinjal Patel, Suguna Varshini Velury, Chenjie Luo, Zhiyu Cheng, Jenny Chen, Chen-Han Yu, Wei Ping, Oleg Rybakov, Nima Tajbakhsh, Oluwatobi Olabiyi, Dusan Stosic, Di Wu, Song Han, Eric Chung, Sharath Turuvekere Sreenivas, Bryan Catanzaro, Yoshi Suhara, Tijmen Blankevoort, Huizi Mao

First: 2026-01-27T22:14:47+00:00 · Latest: 2026-03-03T06:03:52+00:00

Abs · PDF · Code1 · Code2

Abstract

This technical report presents quantization-aware distillation (QAD) and our best practices for recovering accuracy of NVFP4-quantized large language models (LLMs) and vision-language models (VLMs). QAD distills a full-precision teacher model into a quantized student model using a KL divergence loss. While applying distillation to quantized models is not a new idea, we observe key advantages of QAD for today's LLMs: 1. It shows remarkable effectiveness and stability for models trained through multi-stage post-training pipelines, including supervised fine-tuning (SFT), reinforcement learning (RL), and model merging, where traditional quantization-aware training (QAT) suffers from engineering complexity and training instability; 2. It is robust to data quality and coverage, enabling accuracy recovery without full training data. We evaluate QAD across multiple post-trained models including AceReason Nemotron, Nemotron 3 Nano, Nemotron Nano V2, Nemotron Nano V2 VL (VLM), and Llama Nemotron Super v1, showing consistent recovery to near-BF16 accuracy.

中文标题/摘要

标题：面向NVFP4推理准确性的量化感知蒸馏

本技术报告介绍了量化感知蒸馏（QAD）及其在恢复NVFP4量化大型语言模型（LLMs）和视觉-语言模型（VLMs）准确性的最佳实践。QAD使用KL散度损失将全精度教师模型蒸馏到量化学生模型中。虽然将蒸馏应用于量化模型不是新想法，但我们观察到QAD对当今的LLMs具有关键优势：1. 对于通过多阶段后训练管道训练的模型，包括监督微调（SFT）、强化学习（RL）和模型合并，它显示出显著的有效性和稳定性，而传统的量化感知训练（QAT）则因工程复杂性和训练不稳定性而受到影响；2. 它对数据质量和覆盖面具有鲁棒性，能够在无需完整训练数据的情况下实现准确性的恢复。我们在包括AceReason Nemotron、Nemotron 3 Nano、Nemotron Nano V2、Nemotron Nano V2 VL（VLM）和Llama Nemotron Super v1的多个后训练模型上评估了QAD，展示了其恢复到接近BF16准确性的一致性。

Summary / 总结

This technical report introduces quantization-aware distillation (QAD) for recovering the inference accuracy of NVFP4-quantized large language models and vision-language models. QAD distills a full-precision teacher model into a quantized student model using KL divergence loss. The method is particularly effective for models trained through multi-stage post-training pipelines, such as supervised fine-tuning, reinforcement learning, and model merging, where traditional quantization-aware training faces challenges. QAD also shows robustness to data quality and coverage, enabling accuracy recovery without full training data. Experiments across various models demonstrate consistent recovery to near-BF16 accuracy.

研究旨在通过量化感知蒸馏（QAD）提高NVFP4量化的大语言模型和视觉语言模型的推理准确性。QAD使用KL散度损失将全精度教师模型蒸馏到量化学生模型中。该方法在通过多阶段后训练管道训练的模型中表现出显著的有效性和稳定性，并且对数据质量和覆盖率具有鲁棒性，能够在没有完整训练数据的情况下实现准确性恢复。对多种后训练模型的评估显示，其能够一致地恢复到接近BF16的准确性。

SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety

Authors: Zixuan Xu, Tiancheng He, Huahui Yi, Kun Wang, Xi Chen, Gongli Xi, Qiankun Li, Kang Li, Yang Liu, Zhigang Zeng

First: 2026-03-03T06:03:38+00:00 · Latest: 2026-03-03T06:03:38+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-language models remain susceptible to multimodal jailbreaks and over-refusal because safety hinges on both visual evidence and user intent, while many alignment pipelines supervise only the final response. To address this, we present SaFeR-ToolKit, which formalizes safety decision-making as a checkable protocol. Concretely, a planner specifies a persona, a Perception $\to$ Reasoning $\to$ Decision tool set, and a constrained transition graph, while a responder outputs a typed key-value tool trace before the final answer. To make the protocol reliably followed in practice, we train a single policy with a three-stage curriculum (SFT $\to$ DPO $\to$ GRPO), where GRPO directly supervises tool usage beyond answer-level feedback. Our contributions are two-fold: I. Dataset. The first tool-based safety reasoning dataset, comprising 31,654 examples (SFT 6k, DPO 18.6k, GRPO 6k) plus 1k held-out evaluation. II. Experiments. On Qwen2.5-VL, SaFeR-ToolKit significantly improves Safety/Helpfulness/Reasoning Rigor on 3B (29.39/45.04/4.98 $\to$ 84.40/71.13/78.87) and 7B (53.21/52.92/19.26 $\to$ 86.34/80.79/85.34), while preserving general capabilities (3B: 58.67 $\to$ 59.21; 7B: 66.39 $\to$ 66.81). Codes are available at https://github.com/Duebassx/SaFeR_ToolKit.

中文标题/摘要

标题：SaFeR-ToolKit: 结构化推理通过虚拟工具调用实现多模态安全

视觉-语言模型仍然容易受到多模态脱缰和过度拒绝的影响，因为安全性取决于视觉证据和用户意图，而许多对齐管道仅监督最终的回答。为了解决这个问题，我们提出了SaFeR-ToolKit，将安全性决策形式化为可验证的协议。具体来说，规划者指定一个角色、一个感知→推理→决策工具集以及一个受限的转换图，而响应者在最终答案之前输出一个类型化的键值工具跟踪。为了使该协议在实践中可靠地遵循，我们使用三阶段课程（SFT→DPO→GRPO）训练了一个单一策略，其中GRPO直接监督工具使用，而不仅仅是答案级别的反馈。我们的贡献有两个方面：I. 数据集。第一个基于工具的安全推理数据集，包含31,654个示例（SFT 6k，DPO 18.6k，GRPO 6k）以及1k保留的评估。II. 实验。在Qwen2.5-VL上，SaFeR-ToolKit显著提高了安全性/帮助性/推理严谨性（3B：29.39/45.04/4.98→84.40/71.13/78.87；7B：53.21/52.92/19.26→86.34/80.79/85.34），同时保留了通用能力（3B：58.67→59.21；7B：66.39→66.81）。代码可在https://github.com/Duebassx/SaFeR_ToolKit/ 获取。

Summary / 总结

SaFeR-ToolKit addresses the susceptibility of vision-language models to multimodal jailbreaks by formalizing safety decision-making as a checkable protocol. It involves a planner specifying a persona, a Perception $\to$ Reasoning $\to$ Decision tool set, and a constrained transition graph, while a responder outputs a typed key-value tool trace before the final answer. The protocol is trained with a three-stage curriculum (SFT $\to$ DPO $\to$ GRPO) to ensure reliable tool usage. Experiments on Qwen2.5-VL show significant improvements in Safety/Helpfulness/Reasoning Rigor for 3B and 7B models, while general capabilities are preserved.

研究旨在解决视觉语言模型对多模态脱缰和过度拒绝的脆弱性，通过关注视觉证据和用户意图。SaFeR-ToolKit将安全决策过程形式化为可验证的协议，规划者指定人物、感知$\to$推理$\to$决策工具集以及受限转换图，响应者在最终答案前输出类型化的工具轨迹。通过三阶段训练课程（SFT $\to$ DPO $\to$ GRPO）确保工具使用的可靠性。实验结果显示，在Qwen2.5-VL上，安全、有用性和推理严谨性显著提高，同时保留了通用能力。

3D Modality-Aware Pre-training for Vision-Language Model in MRI Multi-organ Abnormality Detection

Authors: Haowen Zhu, Ning Yin, Xiaogen Zhou

First: 2026-02-27T03:37:55+00:00 · Latest: 2026-03-03T05:58:36+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-language models (VLMs) show strong potential for complex diagnostic tasks in medical imaging. However, applying VLMs to multi-organ medical imaging introduces two principal challenges: (1) modality-specific vision-language alignment and (2) cross-modal feature fusion. In this work, we propose MedMAP, a Medical Modality-Aware Pretraining framework that enhances vision-language representation learning in 3D MRI. MedMAP comprises a modality-aware vision-language alignment stage and a fine-tuning stage for multi-organ abnormality detection. During the pre-training stage, the modality-aware encoders implicitly capture the joint modality distribution and improve alignment between visual and textual representations. We then fine-tune the pre-trained vision encoders (while keeping the text encoder frozen) for downstream tasks. To this end, we curated MedMoM-MRI3D, comprising 7,392 3D MRI volume-report pairs spanning twelve MRI modalities and nine abnormalities tailored for various 3D medical analysis tasks. Extensive experiments on MedMoM-MRI3D demonstrate that MedMAP significantly outperforms existing VLMs in 3D MRI-based multi-organ abnormality detection. Our code is available at https://github.com/RomantiDr/MedMAP.

中文标题/摘要

标题：3D 模态感知预训练在MRI多器官异常检测中的视觉-语言模型

视觉-语言模型（VLMs）在医学影像复杂诊断任务中显示出强大的潜力。然而，将VLMs应用于多器官医学影像引入了两个主要挑战：（1）模态特定的视觉-语言对齐和（2）跨模态特征融合。在本文中，我们提出了一种名为MedMAP的医学模态感知预训练框架，以增强3D MRI中的视觉-语言表示学习。MedMAP包括一个模态感知的视觉-语言对齐阶段和一个多器官异常检测的微调阶段。在预训练阶段，模态感知编码器隐式捕获联合模态分布并改善视觉和文本表示之间的对齐。然后，我们微调预训练的视觉编码器（同时冻结文本编码器）以执行下游任务。为此，我们构建了MedMoM-MRI3D，包含7,392个3D MRI体积-报告对，涵盖十二种MRI模态和九种异常，适用于各种3D医学分析任务。在MedMoM-MRI3D上的广泛实验表明，MedMAP在基于3D MRI的多器官异常检测中显著优于现有VLMs。我们的代码可在https://github.com/RomantiDr/MedMAP获取。

Summary / 总结

The research aims to enhance vision-language models for medical imaging tasks, particularly in 3D MRI multi-organ abnormality detection. MedMAP, a Medical Modality-Aware Pretraining framework, addresses modality-specific alignment and cross-modal feature fusion. It uses modality-aware encoders to improve visual and textual representation alignment during pre-training and fine-tunes the vision encoders for specific tasks. Experiments show that MedMAP outperforms existing vision-language models in 3D MRI-based multi-organ abnormality detection.

研究旨在提升视觉语言模型在医学成像任务中的表现，特别是在基于3D MRI的多器官异常检测。MedMAP是一种医疗模态感知预训练框架，解决了模态特定的对齐和跨模态特征融合问题。该框架使用模态感知编码器在预训练阶段改善视觉和文本表示的对齐，并在下游任务中微调视觉编码器。实验表明，MedMAP在基于3D MRI的多器官异常检测中优于现有视觉语言模型。

Automated Data Enrichment using Confidence-Aware Fine-Grained Debate among Open-Source LLMs for Mental Health and Online Safety

Authors: Junyu Mao, Anthony Hills, Talia Tseriotou, Maria Liakata, Aya Shamir, Dan Sayda, Dana Atzil-Slonim, Natalie Djohari, Arpan Mandal, Silke Roth, Pamela Ugwudike, Mahesan Niranjan, Stuart E. Middleton

First: 2025-12-06T00:21:29+00:00 · Latest: 2026-03-03T05:45:32+00:00

Abs · PDF · Code1 · Code2

Abstract

Real-world indicators play an important role in many natural language processing (NLP) applications, such as life-event for mental health analysis and risky behaviour for online safety, yet labelling such information in training datasets is often costly and/or difficult due to their dynamic nature. Large language models (LLMs) show promising potential for automated annotation, yet multi-label prediction remains challenging. In this work, we propose a Confidence-Aware Fine-Grained Debate (CFD) framework that simulates collaborative annotation using fine-grained information to better support automated multi-label enrichment. We introduce two new expert-annotated resources: A mental health Reddit well-being dataset and an online safety Facebook sharenting risk dataset. Experiments show that CFD achieves the most robust enrichment performance compared to a range of baseline approaches. We further evaluate various training-free enrichment incorporation strategies and demonstrate that LLM-enriched indicators consistently improves our downstream tasks. Enriched features incorporated via debate transcripts yield the largest gains, outperforming the non-enriched baseline by 9.9\% on the online safety task.

中文标题/摘要

标题：使用开源LLM的信心感知细粒度辩论实现自动化数据增强：心理健康和在线安全

现实世界的指标在许多自然语言处理（NLP）应用中扮演着重要角色，例如心理健康分析中的生活事件和在线安全中的危险行为，但由于这些信息的动态性质，对其进行标注往往成本高且/或困难。大型语言模型（LLMs）在自动化标注方面显示出有希望的潜力，但多标签预测仍然具有挑战性。在本文中，我们提出了一种信心感知细粒度辩论（CFD）框架，通过模拟协作标注来更好地支持自动化多标签增强。我们引入了两个新的专家标注资源：心理健康Reddit幸福感数据集和在线安全Facebook分享风险数据集。实验表明，CFD在各种基线方法中实现了最稳健的增强性能。我们进一步评估了各种无需训练的增强整合策略，并证明了LLM增强的指标始终提高了我们的下游任务。通过辩论记录整合增强特征带来了最大的收益，在在线安全任务上比非增强基线高出9.9%。

Summary / 总结

This work addresses the challenge of automatically enriching datasets with real-world indicators for mental health and online safety by proposing a Confidence-Aware Fine-Grained Debate (CFD) framework. The method involves simulating collaborative annotation among open-source LLMs to handle multi-label prediction. Key experimental findings show that CFD outperforms baseline approaches in robust enrichment performance and that incorporating enriched features via debate transcripts improves downstream tasks, particularly for online safety, with a 9.9% gain over non-enriched baselines.

该研究提出了一种名为Confidence-Aware Fine-Grained Debate (CFD) 的框架，以自动丰富与心理健康和在线安全相关的数据集中的现实指标。方法通过模拟开源大语言模型之间的协作注释来处理多标签预测问题。实验结果表明，CFD 在稳健的丰富性能方面优于基线方法，并且通过辩论记录集成丰富特征可以显著提高下游任务的表现，特别是在在线安全任务上，相比未丰富基线提高了9.9%。

Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs

Authors: Zhikang Xu, Qianqian Xu, Zitai Wang, Cong Hua, Sicong Li, Zhiyong Yang, Qingming Huang

Venue: CVPR 2026

First: 2026-03-03T05:44:47+00:00 · Latest: 2026-03-03T05:44:47+00:00

Comments: Accepted by the main track of CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Out-of-distribution (OOD) detection seeks to identify samples from unknown classes, a critical capability for deploying machine learning models in open-world scenarios. Recent research has demonstrated that Vision-Language Models (VLMs) can effectively leverage their multi-modal representations for OOD detection. However, current methods often incorporate intra-modal distance during OOD detection, such as comparing negative texts with ID labels or comparing test images with image proxies. This design paradigm creates an inherent inconsistency against the inter-modal distance that CLIP-like VLMs are optimized for, potentially leading to suboptimal performance. To address this limitation, we propose InterNeg, a simple yet effective framework that systematically utilizes consistent inter-modal distance enhancement from textual and visual perspectives. From the textual perspective, we devise an inter-modal criterion for selecting negative texts. From the visual perspective, we dynamically identify high-confidence OOD images and invert them into the textual space, generating extra negative text embeddings guided by inter-modal distance. Extensive experiments across multiple benchmarks demonstrate the superiority of our approach. Notably, our InterNeg achieves state-of-the-art performance compared to existing works, with a 3.47\% reduction in FPR95 on the large-scale ImageNet benchmark and a 5.50\% improvement in AUROC on the challenging Near-OOD benchmark.

中文标题/摘要

标题：注意选择负样本的方式：使用VLMs进行OOD检测时追求跨模态距离一致性

分布外（OOD）检测旨在识别来自未知类别的样本，这是在开放世界场景中部署机器学习模型的关键能力。最近的研究表明，视觉-语言模型（VLMs）能够有效利用其多模态表示进行OOD检测。然而，当前的方法通常在OOD检测中引入了同模态距离，例如将负文本与ID标签进行比较，或将测试图像与图像代理进行比较。这种设计范式在CLIP等VLMs优化的跨模态距离方面存在固有的不一致性，可能导致性能不佳。为了解决这一局限性，我们提出了一种简单而有效的框架InterNeg，系统地从文本和视觉两个视角利用一致的跨模态距离增强。从文本视角出发，我们设计了一种跨模态标准来选择负样本。从视觉视角出发，我们动态识别高置信度的OOD图像，并将其反转到文本空间，生成由跨模态距离引导的额外负文本嵌入。在多个基准上的广泛实验表明，我们的方法具有优越性。值得注意的是，我们的InterNeg在大规模ImageNet基准上实现了最先进的性能，FPR95降低了3.47%，在具有挑战性的Near-OOD基准上AUROC提高了5.50%。

Summary / 总结

The paper addresses the limitation of current out-of-distribution (OOD) detection methods that use intra-modal distance, which can lead to suboptimal performance due to inconsistency with the inter-modal distance optimized by VLMs. It introduces InterNeg, a framework that enhances inter-modal distance from both textual and visual perspectives. InterNeg selects negative texts based on an inter-modal criterion and generates extra negative text embeddings by inverting high-confidence OOD images. Experiments show that InterNeg outperforms existing methods, achieving a 3.47% reduction in FPR95 on ImageNet and a 5.50% improvement in AUROC on the Near-OOD benchmark.

论文针对当前使用单模态距离进行OOD检测的方法可能与CLIP等VLMs优化的跨模态距离不一致的问题，提出了InterNeg框架，通过基于跨模态标准选择负样本，并动态生成额外的负文本嵌入来增强跨模态距离的一致性。实验结果显示，InterNeg在ImageNet上的FPR95降低了3.47%，在挑战性的Near-OOD基准上的AUROC提高了5.50%。

VLMFusionOcc3D: VLM Assisted Multi-Modal 3D Semantic Occupancy Prediction

Authors: A. Enes Doruk, Hasan F. Ates

First: 2026-03-03T05:22:28+00:00 · Latest: 2026-03-03T05:22:28+00:00

Abs · PDF · Code1 · Code2

Abstract

This paper introduces VLMFusionOcc3D, a robust multimodal framework for dense 3D semantic occupancy prediction in autonomous driving. Current voxel-based occupancy models often struggle with semantic ambiguity in sparse geometric grids and performance degradation under adverse weather conditions. To address these challenges, we leverage the rich linguistic priors of Vision-Language Models (VLMs) to anchor ambiguous voxel features to stable semantic concepts. Our framework initiates with a dual-branch feature extraction pipeline that projects multi-view images and LiDAR point clouds into a unified voxel space. We propose Instance-driven VLM Attention (InstVLM), which utilizes gated cross-attention and LoRA-adapted CLIP embeddings to inject high-level semantic and geographic priors directly into the 3D voxels. Furthermore, we introduce Weather-Aware Adaptive Fusion (WeathFusion), a dynamic gating mechanism that utilizes vehicle metadata and weather-conditioned prompts to re-weight sensor contributions based on real-time environmental reliability. To ensure structural consistency, a Depth-Aware Geometric Alignment (DAGA) loss is employed to align dense camera-derived geometry with sparse, spatially accurate LiDAR returns. Extensive experiments on the nuScenes and SemanticKITTI datasets demonstrate that our plug-and-play modules consistently enhance the performance of state-of-the-art voxel-based baselines. Notably, our approach achieves significant improvements in challenging weather scenarios, offering a scalable and robust solution for complex urban navigation.

中文标题/摘要

标题：VLMFusionOcc3D：基于VLM的多模态3D语义占用预测

本文介绍了VLMFusionOcc3D，这是一种用于自主驾驶中密集3D语义占用预测的鲁棒多模态框架。当前基于体素的占用模型在稀疏几何网格中处理语义模糊性方面常常存在困难，并且在恶劣天气条件下性能下降。为了解决这些挑战，我们利用视觉语言模型（VLM）丰富的语言先验知识，将模糊的体素特征锚定到稳定的语义概念上。我们的框架采用了一种双分支特征提取管道，将多视角图像和LiDAR点云投影到统一的体素空间中。我们提出了实例驱动的VLM注意力（InstVLM），利用门控交叉注意力和LoRA调整后的CLIP嵌入直接将高层语义和地理先验注入3D体素中。此外，我们引入了基于天气的自适应融合（WeathFusion），这是一种动态门控机制，利用车辆元数据和天气条件提示，根据实时环境可靠性重新加权传感器贡献。为了确保结构一致性，我们采用了深度感知几何对齐（DAGA）损失，将密集的相机衍生几何与稀疏的、空间上准确的LiDAR返回对齐。在nuScenes和SemanticKITTI数据集上的广泛实验表明，我们的即插即用模块可以一致地增强最先进的基于体素基线的性能。值得注意的是，我们的方法在恶劣天气场景中实现了显著的性能提升，为复杂城市导航提供了一种可扩展且鲁棒的解决方案。

Summary / 总结

VLMFusionOcc3D is a multimodal framework for 3D semantic occupancy prediction in autonomous driving, addressing the limitations of voxel-based models in handling semantic ambiguity and performance degradation under adverse weather conditions. It uses Vision-Language Models to inject semantic and geographic priors into 3D voxels and introduces a dynamic gating mechanism for sensor fusion based on real-time environmental conditions. Experiments show consistent performance improvements over state-of-the-art voxel-based models, particularly in challenging weather scenarios.

VLMFusionOcc3D 是一种多模态框架，用于自动驾驶中的密集 3D 语义占用预测。该框架利用视觉语言模型（VLMs）解决语义模糊和恶劣天气下的性能问题。框架包括双分支特征提取管道、用于语义注入的实例驱动 VLM 注意力（InstVLM）、基于实时环境可靠性的动态传感器加权机制（Weather-Aware Adaptive Fusion, WeathFusion）以及用于结构一致性的深度感知几何对齐（Depth-Aware Geometric Alignment, DAGA）损失。实验表明，该方法在恶劣天气条件下能够一致地提升最先进的体素基模型的性能。

History

20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553