BEACON: Language-Conditioned Navigation Affordance Prediction under Occlusion
Authors: Xinyu Gao, Gang Chen, Javier Alonso-Mora
First: 2026-03-10T17:56:16+00:00 · Latest: 2026-03-10T17:56:16+00:00
Comments: 8 pages. Project page: https://xin-yu-gao.github.io/beacon
Abstract
Language-conditioned local navigation requires a robot to infer a nearby traversable target location from its current observation and an open-vocabulary, relational instruction. Existing vision-language spatial grounding methods usually rely on vision-language models (VLMs) to reason in image space, producing 2D predictions tied to visible pixels. As a result, they struggle to infer target locations in occluded regions, typically caused by furniture or moving humans. To address this issue, we propose BEACON, which predicts an ego-centric Bird's-Eye View (BEV) affordance heatmap over a bounded local region including occluded areas. Given an instruction and surround-view RGB-D observations from four directions around the robot, BEACON predicts the BEV heatmap by injecting spatial cues into a VLM and fusing the VLM's output with depth-derived BEV features. Using an occlusion-aware dataset built in the Habitat simulator, we conduct detailed experimental analysis to validate both our BEV space formulation and the design choices of each module. Our method improves the accuracy averaged across geodesic thresholds by 22.74 percentage points over the state-of-the-art image-space baseline on the validation subset with occluded target locations. Our project page is: https://xin-yu-gao.github.io/beacon.
中文标题/摘要
标题:BEACON:基于语言的遮挡下局部导航可用性预测
基于语言的局部导航要求机器人从其当前观察和开放词汇关系指令中推断出附近的可通行目标位置。现有的视觉-语言空间对齐方法通常依赖视觉-语言模型(VLM)在图像空间中进行推理,产生与可见像素相关的二维预测。因此,它们在遮挡区域(通常由家具或移动的人类引起)推断目标位置时遇到困难。为了解决这个问题,我们提出了BEACON,它预测了一个以自我为中心的鸟瞰图(BEV)可用性热力图,覆盖了一个包括遮挡区域的局部区域。给定一个指令和机器人周围四个方向的环绕视图RGB-D观察结果,BEACON通过将空间线索注入VLM并将VLM的输出与深度衍生的BEV特征融合来预测BEV热力图。使用在Habitat模拟器中构建的具有遮挡感知的数据集,我们进行了详细的实验分析,以验证我们的BEV空间表示和每个模块的设计选择。我们的方法在验证子集上遮挡目标位置的平均测地距离阈值精度上比最先进的图像空间基线提高了22.74个百分点。我们的项目页面是:https://xin-yu-gao.github.io/beacon.
Summary / 总结
BEACON predicts a 3D affordance heatmap in a bird's-eye view (BEV) to address the challenge of inferring target locations in occluded regions. It uses a vision-language model and depth-derived BEV features to generate a heatmap from multi-directional RGB-D observations. Experiments show a 22.74 percentage point improvement in accuracy over existing methods on occluded target locations in the validation subset.
BEACON旨在解决语言条件导航中在遮挡区域预测目标位置的挑战。它通过向视觉语言模型注入空间线索并将其输出与深度衍生的BEV特征融合,预测一个局部区域的鸟瞰图(BEV)可操作性热图。实验结果显示,BEACON在包含遮挡目标的验证子集上的准确率比现有方法提高了22.74个百分点。
MedMASLab: A Unified Orchestration Framework for Benchmarking Multimodal Medical Multi-Agent Systems
Authors: Yunhang Qian, Xiaobin Hu, Jiaquan Yu, Siyang Xin, Xiaokun Chen, Jiangning Zhang, Peng-Tao Jiang, Jiawei Liu, Hongwei Bran Li
First: 2026-03-10T17:03:11+00:00 · Latest: 2026-03-10T17:03:11+00:00
Abstract
While Multi-Agent Systems (MAS) show potential for complex clinical decision support, the field remains hindered by architectural fragmentation and the lack of standardized multimodal integration. Current medical MAS research suffers from non-uniform data ingestion pipelines, inconsistent visual-reasoning evaluation, and a lack of cross-specialty benchmarking. To address these challenges, we present MedMASLab, a unified framework and benchmarking platform for multimodal medical multi-agent systems. MedMASLab introduces: (1) A standardized multimodal agent communication protocol that enables seamless integration of 11 heterogeneous MAS architectures across 24 medical modalities. (2) An automated clinical reasoning evaluator, a zero-shot semantic evaluation paradigm that overcomes the limitations of lexical string-matching by leveraging large vision-language models to verify diagnostic logic and visual grounding. (3) The most extensive benchmark to date, spanning 11 organ systems and 473 diseases, standardizing data from 11 clinical benchmarks. Our systematic evaluation reveals a critical domain-specific performance gap: while MAS improves reasoning depth, current architectures exhibit significant fragility when transitioning between specialized medical sub-domains. We provide a rigorous ablation of interaction mechanisms and cost-performance trade-offs, establishing a new technical baseline for future autonomous clinical systems. The source code and data is publicly available at: https://github.com/NUS-Project/MedMASLab/
中文标题/摘要
标题:MedMASLab: 多模态医疗多智能体系统统一编排框架
尽管多智能体系统(MAS)在复杂临床决策支持方面显示出潜力,但该领域仍受到架构碎片化和标准化多模态集成缺乏的阻碍。当前的医疗MAS研究遭受非统一数据摄入管道、不一致的视觉推理评估以及跨专科基准测试不足的困扰。为解决这些挑战,我们提出了MedMASLab,一个统一的多模态医疗多智能体系统框架和基准平台。MedMASLab 引入了:(1)一个标准化的多模态智能体通信协议,使11种异构MAS架构在24种医疗模态下无缝集成。(2)一个自动临床推理评估器,这是一种零样本语义评估范式,通过利用大型视觉语言模型来验证诊断逻辑和视觉定位,克服了基于词串匹配的局限性。(3)迄今为止最广泛的基准测试,覆盖11个器官系统和473种疾病,标准化了11个临床基准的数据。我们的系统评估揭示了一个关键的专业领域性能差距:尽管MAS提高了推理深度,但当前架构在从专门的医学亚领域过渡时表现出显著的脆弱性。我们对交互机制进行了严格的消融分析,并建立了未来自主临床系统的新的技术基线。源代码和数据可在:https://github.com/NUS-Project/MedMASLab/ 公开获取。
Summary / 总结
MedMASLab is a unified framework and benchmarking platform for multimodal medical multi-agent systems, addressing the challenges of architectural fragmentation and inconsistent evaluation. It introduces a standardized multimodal agent communication protocol, an automated clinical reasoning evaluator, and the most extensive benchmark to date. The evaluation highlights a significant domain-specific performance gap and provides insights into interaction mechanisms and cost-performance trade-offs, setting a new technical baseline for future clinical systems.
MedMASLab 是一个统一框架,旨在评估多模态医疗多智能体系统,解决架构碎片化和缺乏标准化的问题。它引入了一个标准化的多模态智能体通信协议、一个自动临床推理评估器以及迄今为止最广泛的基准测试,涵盖了11个器官系统和473种疾病。评估结果显示,在从通用到专业医疗子领域的过渡中存在显著的性能差距,表明需要更 robust 的架构。该框架为未来的自主临床系统提供了技术基准。
Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports
Authors: Yuchen Yang, Yuqing Shao, Duxiu Huang, Linfeng Dong, Yifei Liu, Suixin Tang, Xiang Zhou, Yuanyuan Gao, Wei Wang, Yue Zhou, Xue Yang, Yanfeng Wang, Xiao Sun, Zhihang Zhong
First: 2026-03-10T16:50:32+00:00 · Latest: 2026-03-10T16:50:32+00:00
Abstract
Sports have long attracted broad attention as they push the limits of human physical and cognitive capabilities. Amid growing interest in spatial intelligence for vision-language models (VLMs), sports provide a natural testbed for understanding high-intensity human motion and dynamic object interactions. To this end, we present CourtSI, the first large-scale spatial intelligence dataset tailored to sports scenarios. CourtSI contains over 1M QA pairs, organized under a holistic taxonomy that systematically covers spatial counting, distance measurement, localization, and relational reasoning, across representative net sports including badminton, tennis, and table tennis. Leveraging well-defined court geometry as metric anchors, we develop a semi-automatic data engine to reconstruct sports scenes, enabling scalable curation of CourtSI. In addition, we introduce CourtSI-Bench, a high-quality evaluation benchmark comprising 3,686 QA pairs with rigorous human verification. We evaluate 25 proprietary and open-source VLMs on CourtSI-Bench, revealing a remaining human-AI performance gap and limited generalization from existing spatial intelligence benchmarks. These findings indicate that sports scenarios expose limitations in spatial intelligence capabilities captured by existing benchmarks. Further, fine-tuning Qwen3-VL-8B on CourtSI improves accuracy on CourtSI-Bench by 23.5 percentage points. The adapted model also generalizes effectively to CourtSI-Ext, an evaluation set built on a similar but unseen sport, and demonstrates enhanced spatial-aware commentary generation. Together, these findings demonstrate that CourtSI provides a scalable pathway toward advancing spatial intelligence of VLMs in sports.
中文标题/摘要
标题:将VLMs带入法庭:体育空间智能基准测试
体育运动长期以来一直吸引着广泛的关注,因为它们推动了人类身体和认知能力的极限。随着对视觉语言模型(VLMs)的空间智能兴趣日益增长,体育运动为理解高强度的人体运动和动态物体交互提供了一个自然的测试平台。为此,我们提出了CourtSI,这是第一个针对体育场景的空间智能大规模数据集。CourtSI包含超过100万对问答,按照全面的分类系统系统地涵盖了空间计数、距离测量、定位和关系推理,覆盖了代表性网球场运动,包括羽毛球、网球和乒乓球。利用明确的场地几何作为度量锚点,我们开发了一种半自动数据引擎来重建体育场景,从而实现CourtSI的可扩展编目。此外,我们引入了CourtSI-Bench,这是一个高质量的评估基准,包含3,686对经过严格人工验证的问答对。我们在CourtSI-Bench上评估了25个专有和开源的VLMs,揭示了人类与AI之间的性能差距,并且现有空间智能基准的泛化能力有限。这些发现表明,体育场景揭示了现有基准所捕捉的空间智能能力的局限性。进一步地,对Qwen3-VL-8B进行微调后,其在CourtSI-Bench上的准确性提高了23.5个百分点。调整后的模型还能够有效泛化到基于类似但未见过的运动构建的CourtSI-Ext评估集,并展示了增强的空间感知评论生成能力。这些发现共同表明,CourtSI为推动VLMs在体育中的空间智能提供了可扩展的途径。
Summary / 总结
The paper introduces CourtSI, a large-scale dataset for spatial intelligence in sports, containing over 1 million QA pairs covering spatial counting, distance measurement, localization, and relational reasoning across badminton, tennis, and table tennis. It evaluates 25 VLMs on CourtSI-Bench and finds a human-AI performance gap, with fine-tuning Qwen3-VL-8B on CourtSI improving accuracy by 23.5 percentage points and showing effective generalization to unseen sports scenarios and enhanced spatial-aware commentary generation.
论文介绍了CourtSI,这是一个用于体育领域空间智能的大规模数据集,包含超过100万的问答对,覆盖羽毛球、网球和乒乓球等网球场上运动。利用明确的场地几何结构,开发了一种半自动数据引擎来重建体育场景,并构建了CourtSI-Bench这一严格的评估基准,用于25个VLMs的评估。评估结果显示了人类和AI之间的性能差距以及现有基准的有限泛化能力。通过在CourtSI上微调Qwen3-VL-8B,准确率提高了23.5个百分点,并且能够很好地泛化到未见过的运动场景,突显了体育场景在提升VLMs空间智能能力方面的重要性。
Fairness-Aware Fine-Tuning of Vision-Language Models for Medical Glaucoma Diagnosis
Authors: Zijian Gu, Yuxi Liu, Zhenhao Zhang, Song Wang
First: 2025-12-03T06:09:14+00:00 · Latest: 2026-03-10T15:59:28+00:00
Comments: AMIA 2026 Amplify Informatics Conference (Poster), Denver, CO, May 18-21, 2026. 10 pages, 3 tables
Abstract
Vision-language models achieve expert-level performance on medical imaging tasks but exhibit significant diagnostic accuracy disparities across demographic groups. We introduce fairness-aware Low-Rank Adaptation for medical VLMs, combining parameter efficiency with explicit fairness optimization. Our key algorithmic contribution is a differentiable MaxAccGap loss that enables end-to-end optimization of accuracy parity across demographic groups. We propose three methods: FR-LoRA integrates MaxAccGap regularization into the training objective, GR-LoRA applies inverse frequency weighting to balance gradient contributions, and Hybrid-LoRA combines both mechanisms. Evaluated on 10,000 glaucoma fundus images, GR-LoRA reduces diagnostic accuracy disparities by 69% while maintaining 53.15% overall accuracy. Ablation studies reveal that strong regularization strength achieves optimal fairness with minimal accuracy trade-off, and race-specific optimization yields 60% disparity reduction. Our approach requires only 0.24% trainable parameters, enabling practical deployment of fair medical AI in resource-constrained healthcare settings.
中文标题/摘要
标题:公平导向的视觉-语言模型医疗青光眼诊断微调
视觉-语言模型在医疗影像任务上达到专家级表现,但在不同人口群体中表现出显著的诊断准确率差异。我们引入了公平导向的低秩适应方法,结合参数效率与显式的公平优化。我们的主要算法贡献是一种可微分的MaxAccGap损失,能够实现端到端地优化不同人口群体之间的准确率一致性。我们提出了三种方法:FR-LoRA将MaxAccGap正则化整合到训练目标中,GR-LoRA应用逆频率加权以平衡梯度贡献,Hybrid-LoRA则结合了这两种机制。在10,000张青光眼底片图像上评估,GR-LoRA将诊断准确率差异降低了69%,同时保持53.15%的整体准确率。消融研究显示,较强的正则化强度可以实现最佳公平性,同时最小化准确率损失,而种族特定优化可实现60%的差异减少。我们的方法只需要0.24%的可训练参数,使得在资源受限的医疗保健环境中实现公平的医疗AI具有可行性。
Summary / 总结
This paper addresses the issue of diagnostic accuracy disparities in medical glaucoma diagnosis by vision-language models across demographic groups. It introduces a fairness-aware Low-Rank Adaptation method, specifically GR-LoRA, which uses inverse frequency weighting to reduce accuracy disparities by 69% while maintaining 53.15% overall accuracy. The approach requires minimal trainable parameters, making it suitable for resource-constrained healthcare settings.
该研究针对医学青光眼诊断中视觉语言模型在不同人口群体间诊断准确性差异的问题,引入了一种公平性意识下的低秩适应方法,包括FR-LoRA、GR-LoRA和Hybrid-LoRA三种方法。GR-LoRA通过应用逆频率加权,显著减少了69%的诊断准确性差异,同时保持了53.15%的整体准确性。该方法所需的可训练参数极少,适用于资源受限的医疗保健环境。
VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models
Authors: Shuhao Kang, Youqi Liao, Peijie Wang, Wenlong Liao, Qilin Zhang, Benjamin Busam, Xieyuanli Chen, Yun Liu
Venue: CVPR 2026
First: 2026-03-10T15:48:25+00:00 · Latest: 2026-03-10T15:48:25+00:00
Comments: CVPR 2026
Abstract
Text-to-point-cloud (T2P) localization aims to infer precise spatial positions within 3D point cloud maps from natural language descriptions, reflecting how humans perceive and communicate spatial layouts through language. However, existing methods largely rely on shallow text-point cloud correspondence without effective spatial reasoning, limiting their accuracy in complex environments. To address this limitation, we propose VLM-Loc, a framework that leverages the spatial reasoning capability of large vision-language models (VLMs) for T2P localization. Specifically, we transform point clouds into bird's-eye-view (BEV) images and scene graphs that jointly encode geometric and semantic context, providing structured inputs for the VLM to learn cross-modal representations bridging linguistic and spatial semantics. On top of these representations, we introduce a partial node assignment mechanism that explicitly associates textual cues with scene graph nodes, enabling interpretable spatial reasoning for accurate localization. To facilitate systematic evaluation across diverse scenes, we present CityLoc, a benchmark built from multi-source point clouds for fine-grained T2P localization. Experiments on CityLoc demonstrate VLM-Loc achieves superior accuracy and robustness compared to state-of-the-art methods. Our code, model, and dataset are available at \href{https://github.com/MCG-NKU/nku-3d-vision}{repository}.
中文标题/摘要
标题:VLM-Loc:通过视觉语言模型在点云地图中的定位
文本到点云(T2P)定位旨在从自然语言描述中推断出3D点云地图中的精确空间位置,反映了人类通过语言感知和传达空间布局的方式。然而,现有方法大多依赖于浅层的文本-点云对应关系,缺乏有效的空间推理,限制了其在复杂环境中的准确性。为解决这一局限,我们提出了一种VLM-Loc框架,利用大型视觉语言模型(VLMs)的空间推理能力进行T2P定位。具体而言,我们将点云转换为鸟瞰图(BEV)图像和场景图,联合编码几何和语义上下文,为VLM提供结构化的输入,学习语言和空间语义之间的跨模态表示。在此基础上,我们引入了一种部分节点分配机制,明确将文本提示与场景图节点关联起来,实现可解释的空间推理以实现准确的定位。为了系统地评估不同场景下的细粒度T2P定位,我们提出了CityLoc基准,该基准基于多源点云构建。CityLoc上的实验表明,VLM-Loc在准确性和鲁棒性方面优于现有最先进的方法。我们的代码、模型和数据集可在\href{https://github.com/MCG-NKU/nku-3d-vision}{仓库}获取。
Summary / 总结
VLM-Loc is a framework that uses vision-language models to perform text-to-point-cloud localization by transforming point clouds into bird's-eye-view images and scene graphs, which are then used to learn cross-modal representations. This approach introduces a partial node assignment mechanism to explicitly link textual cues with scene graph nodes, enhancing spatial reasoning. Experiments on the CityLoc benchmark show that VLM-Loc outperforms existing methods in terms of accuracy and robustness in complex environments.
VLM-Loc 通过利用大型视觉语言模型的空间推理能力来解决 T2P 定位中浅层文本-点云对应关系的局限性。它将点云转换为鸟瞰图图像和场景图,并引入部分节点分配机制,将文本线索与场景图节点关联起来。CityLoc 上的实验表明,VLM-Loc 在准确性和鲁棒性方面优于现有方法。
World2Mind: Cognition Toolkit for Allocentric Spatial Reasoning in Foundation Models
Authors: Shouwei Ruan, Bin Wang, Zhenyu Wu, Qihui Zhu, Yuxiang Zhang, Hang Su, Yubin Wang
First: 2026-03-10T15:12:14+00:00 · Latest: 2026-03-10T15:12:14+00:00
Abstract
Achieving robust spatial reasoning remains a fundamental challenge for current Multimodal Foundation Models (MFMs). Existing methods either overfit statistical shortcuts via 3D grounding data or remain confined to 2D visual perception, limiting both spatial reasoning accuracy and generalization in unseen scenarios. Inspired by the spatial cognitive mapping mechanisms of biological intelligence, we propose World2Mind, a training-free spatial intelligence toolkit. At its core, World2Mind leverages 3D reconstruction and instance segmentation models to construct structured spatial cognitive maps, empowering MFMs to proactively acquire targeted spatial knowledge regarding interested landmarks and routes of interest. To provide robust geometric-topological priors, World2Mind synthesizes an Allocentric-Spatial Tree (AST) that uses elliptical parameters to model the top-down layout of landmarks accurately. To mitigate the inherent inaccuracies of 3D reconstruction, we introduce a three-stage reasoning chain comprising tool invocation assessment, modality-decoupled cue collection, and geometry-semantics interwoven reasoning. Extensive experiments demonstrate that World2Mind boosts the performance of frontier models, such as GPT-5.2, by 5%~18%. Astonishingly, relying solely on the AST-structured text, purely text-only foundation models can perform complex 3D spatial reasoning, achieving performance approaching that of advanced multimodal models.
中文标题/摘要
标题:World2Mind:用于基础模型远地空间推理的认知工具包
实现稳健的空间推理仍然是当前多模态基础模型(MFMs)的基本挑战。现有方法要么通过3D定位数据过度拟合统计捷径,要么仍然局限于2D视觉感知,限制了空间推理的准确性和在未见场景中的泛化能力。受生物智能空间认知映射机制的启发,我们提出了World2Mind,一种无需训练的空间智能工具包。其核心在于,World2Mind 利用3D重建和实例分割模型构建结构化空间认知地图,使MFMs能够主动获取关于感兴趣地标和路线的目标空间知识。为了提供稳健的几何-拓扑先验,World2Mind 合成了一种使用椭圆参数建模地标自上而下布局的远地空间树(AST)。为缓解3D重建固有的不准确性,我们引入了一个三阶段推理链,包括工具调用评估、模态解耦提示收集和几何-语义交织推理。大量实验表明,World2Mind 可以将前沿模型,如GPT-5.2,的性能提升5%~18%。令人惊讶的是,仅依赖于AST结构化的文本,纯文本基础模型就能进行复杂的3D空间推理,其性能接近高级多模态模型。
Summary / 总结
World2Mind is a training-free toolkit designed to enhance the allocentric spatial reasoning capabilities of Multimodal Foundation Models (MFMs). It constructs structured spatial cognitive maps using 3D reconstruction and instance segmentation, and introduces an Allocentric-Spatial Tree (AST) to provide geometric-topological priors. Experiments show that World2Mind improves the performance of models like GPT-5.2 by 5% to 18%, and even purely text-based models can perform complex 3D spatial reasoning using the AST-structured text, approaching the performance of advanced multimodal models.
World2Mind 是一个无需训练的工具包,旨在增强多模态基础模型(MFMs)的 allocentric 空间推理能力。它使用 3D 重建和实例分割构建结构化的空间认知地图,并引入了分配中心空间树(AST)来提供稳健的几何-拓扑先验。实验表明,World2Mind 可以将模型如 GPT-5.2 的性能提升 5% 至 18%,并且仅依赖 AST 结构化文本的纯文本基础模型可以执行复杂的 3D 空间推理,接近高级多模态模型的性能。
Ego: Embedding-Guided Personalization of Vision-Language Models
Authors: Soroush Seifi, Simon Gardier, Vaggelis Dorovatas, Daniel Olmeda Reino, Rahaf Aljundi
First: 2026-03-10T15:10:41+00:00 · Latest: 2026-03-10T15:10:41+00:00
Abstract
AI assistants that support humans in daily life are becoming increasingly feasible, driven by the rapid advancements in multimodal language models. A key challenge lies in overcoming the generic nature of these models to deliver personalized experiences. Existing approaches to personalizing large vision language models often rely on additional training stages, which limit generality and scalability, or on engineered pipelines with external pre-trained modules, which hinder deployment efficiency. In this work, we propose an efficient personalization method that leverages the model's inherent ability to capture personalized concepts. Specifically, we extract visual tokens that predominantly represent the target concept by utilizing the model's internal attention mechanisms. These tokens serve as a memory of that specific concept, enabling the model to recall and describe it when it appears in test images. We conduct a comprehensive and unified evaluation of our approach and SOTA methods across various personalization settings including single-concept, multi-concept, and video personalization, demonstrating strong performance gains with minimal personalization overhead.
中文标题/摘要
标题:自我:嵌入引导的视觉语言模型个性化
支持人类日常生活的AI助手正变得越来越可行,这得益于多模态语言模型的迅速发展。一个关键挑战在于克服这些模型的通用性,以提供个性化的体验。现有方法在个性化大型视觉语言模型时往往依赖额外的训练阶段,这限制了通用性和可扩展性,或者依赖于具有外部预训练模块的工程化管道,这阻碍了部署效率。在本工作中,我们提出了一种高效的个性化方法,利用模型内在捕捉个性化概念的能力。具体来说,我们通过利用模型内部的注意力机制提取主要代表目标概念的视觉标记。这些标记作为该特定概念的记忆,使模型能够在测试图像中出现时回忆和描述它。我们对我们的方法和当前最佳方法进行了全面和统一的评估,涵盖了单概念、多概念和个人化视频等各种个性化设置,展示了在最小个性化开销下显著的性能提升。
Summary / 总结
This paper addresses the challenge of personalizing large vision-language models to provide more tailored experiences. It proposes a method that leverages the model's internal attention mechanisms to extract visual tokens representing specific concepts, which are then used to personalize the model. The method is evaluated across different personalization settings and shows strong performance gains with minimal overhead compared to state-of-the-art approaches.
本文旨在解决大型视觉-语言模型个性化的问题,以提供更个性化的体验。提出了一种方法,利用模型内部的注意力机制提取特定概念的视觉令牌,然后用于个性化模型。该方法在不同的个性化设置下进行了评估,并显示出与最新方法相比具有较强的性能提升和较小的个性化开销。
LogoDiffuser: Training-Free Multilingual Logo Generation and Stylization via Letter-Aware Attention Control
Authors: Mingyu Kang, Hyein Seo, Yuna Jeong, Junhyeong Park, Yong Suk Choi
First: 2026-03-10T14:57:46+00:00 · Latest: 2026-03-10T14:57:46+00:00
Abstract
Recent advances in text-to-image generation have been remarkable, but generating multilingual design logos that harmoniously integrate visual and textual elements remains a challenging task. Existing methods often distort character geometry when applying creative styles and struggle to support multilingual text generation without additional training. To address these challenges, we propose LogoDiffuser, a training-free method that synthesizes multilingual logo designs using the multimodal diffusion transformer. Instead of using textual prompts, we input the target characters as images, enabling robust character structure control regardless of language. We first analyze the joint attention mechanism to identify core tokens, which are tokens that strongly respond to textual structures. With this observation, our method integrates character structure and visual design by injecting the most informative attention maps. Furthermore, we perform layer-wise aggregation of attention maps to mitigate attention shifts across layers and obtain consistent core tokens. Extensive experiments and user studies demonstrate that our method achieves state-of-the-art performance in multilingual logo generation.
中文标题/摘要
标题:LogoDiffuser:无需训练的多语言Logo生成与风格化通过字母感知注意力控制
近年来,文本到图像生成取得了显著进展,但生成能够和谐整合视觉和文本元素的多语言设计Logo仍然是一个具有挑战性的任务。现有方法在应用创意风格时往往会扭曲字符几何形状,并且难以在无需额外训练的情况下支持多语言文本生成。为了解决这些挑战,我们提出了一种无需训练的方法LogoDiffuser,该方法使用多模态扩散变换器合成功能多语言Logo设计。我们不使用文本提示,而是将目标字符作为图像输入,从而无论语言如何都能实现稳健的字符结构控制。我们首先分析联合注意力机制以识别核心令牌,这些令牌强烈响应文本结构。基于这一观察,我们的方法通过注入最具信息量的注意力图来整合字符结构和视觉设计。此外,我们逐层聚合注意力图以减轻层间注意力偏移并获得一致的核心令牌。广泛的实验和用户研究证明,我们的方法在多语言Logo生成方面达到了最先进的性能。
Summary / 总结
LogoDiffuser is a training-free method for generating multilingual logo designs that harmoniously integrate visual and textual elements. It uses a multimodal diffusion transformer and inputs target characters as images to control character structure robustly across languages. By analyzing joint attention mechanisms and integrating informative attention maps, LogoDiffuser mitigates attention shifts and achieves consistent core tokens, leading to superior performance in multilingual logo generation compared to existing methods.
LogoDiffuser 是一种无需训练的方法,用于生成能够和谐结合视觉和文本元素的多语言logo设计。它使用多模态扩散变换器,并以图像形式输入目标字符,以在不同语言中稳健地控制字符结构。通过分析联合注意力机制并整合信息丰富的注意力图,LogoDiffuser 减轻了注意力转移并获得了一致的核心令牌,从而在多语言logo生成方面优于现有方法。
LAP: A Language-Aware Planning Model For Procedure Planning In Instructional Videos
Authors: Lei Shi, Victor Aregbede, Andreas Persson, Martin Längkvist, Amy Loutfi, Stephanie Lowry
First: 2026-03-10T14:48:24+00:00 · Latest: 2026-03-10T14:48:24+00:00
Abstract
Procedure planning requires a model to predict a sequence of actions that transform a start visual observation into a goal in instructional videos. While most existing methods rely primarily on visual observations as input, they often struggle with the inherent ambiguity where different actions can appear visually similar. In this work, we argue that language descriptions offer a more distinctive representation in the latent space for procedure planning. We introduce Language-Aware Planning (LAP), a novel method that leverages the expressiveness of language to bridge visual observation and planning. LAP uses a finetuned Vision Language Model (VLM) to translate visual observations into text descriptions and to predict actions and extract text embeddings. These text embeddings are more distinctive than visual embeddings and are used in a diffusion model for planning action sequences. We evaluate LAP on three procedure planning benchmarks: CrossTask, Coin, and NIV. LAP achieves new state-of-the-art performance across multiple metrics and time horizons by large margin, demonstrating the significant advantage of language-aware planning.
中文标题/摘要
标题:LAP:一种语言感知规划模型,用于教学视频中的程序规划
程序规划需要一种模型来预测将起始视觉观察转换为目标的一系列动作。虽然大多数现有方法主要依赖视觉观察作为输入,但它们往往难以处理不同动作在视觉上相似的固有歧义性。在本工作中,我们主张语言描述在潜在空间中提供了更具有区别的表示形式,有利于程序规划。我们引入了语言感知规划(LAP),这是一种新颖的方法,利用语言的表达能力连接视觉观察和规划。LAP 使用微调的视觉语言模型(VLM)将视觉观察翻译成文本描述,并预测动作和提取文本嵌入。这些文本嵌入比视觉嵌入更具区分性,并在规划动作序列的扩散模型中使用。我们在三个程序规划基准测试:CrossTask、Coin 和 NIV 上评估了 LAP。LAP 在多个指标和时间范围上取得了显著的新最佳性能,证明了语言感知规划的显著优势。
Summary / 总结
The research aims to improve procedure planning in instructional videos by leveraging language descriptions to address the ambiguity in visual observations. The Language-Aware Planning (LAP) model uses a fine-tuned Vision Language Model to convert visual observations into text descriptions and predict actions. LAP outperforms existing methods on three benchmarks, achieving state-of-the-art performance and demonstrating the benefits of language-aware planning in procedure planning.
研究旨在通过利用语言描述来解决视觉观察中的歧义问题,从而改进教学视频中的程序规划。Language-Aware Planning (LAP) 方法使用微调的视觉语言模型将视觉观察转换为文本描述,然后预测动作并提取具有区别的文本嵌入。LAP 在三个基准测试上表现出色,多项指标和时间范围内的性能均达到最新最佳水平,突显了语言感知规划的优势。
Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT
Authors: Peng Sun, Huawen Shen, Yi Ban, Tianfan Fu, Yanbo Wang, Yuqiang Li
First: 2026-03-10T14:23:38+00:00 · Latest: 2026-03-10T14:23:38+00:00
Abstract
Visual instruction tuning is crucial for improving vision-language large models (VLLMs). However, many samples can be solved via linguistic patterns or common-sense shortcuts, without genuine cross-modal reasoning, limiting the effectiveness of multimodal learning. Prior data selection methods often rely on costly proxy model training and focus on difficulty or diversity, failing to capture a sample's true contribution to vision-language joint reasoning. In this paper, we propose CVS, a training-free data selection method based on the insight that, for high-quality multimodal samples, introducing the question should substantially alter the model's assessment of answer validity given an image. CVS leverages a frozen VLLM as an evaluator and measures the discrepancy in answer validity with and without conditioning on the question, enabling the identification of samples that require vision-language joint reasoning while filtering semantic-conflict noise. Experiments on Vision-Flan and The Cauldron show that CVS achieves solid performance across datasets. On Vision-Flan, CVS outperforms full-data training by 3.5% and 4.8% using only 10% and 15% of the data, respectively, and remains robust on the highly heterogeneous Cauldron dataset. Moreover, CVS reduces computational cost by 17.3% and 44.4% compared to COINCIDE and XMAS.
中文标题/摘要
标题:问题真的重要吗?无需训练的数据选择方法用于视觉-语言微调
视觉指令调优对于提高视觉-语言大型模型(VLLMs)至关重要。然而,许多样本可以通过语言模式或常识捷径解决,而无需真正的跨模态推理,限制了多模态学习的有效性。先前的数据选择方法通常依赖于昂贵的代理模型训练,并专注于难度或多样性,未能捕捉样本对视觉-语言联合推理的真实贡献。在本文中,我们提出了一种无需训练的数据选择方法CVS,基于这样的洞察:对于高质量的多模态样本,引入问题应该显著改变模型在给定图像情况下对答案有效性的评估。CVS 利用一个冻结的VLLM作为评估器,并测量在有和无问题条件下的答案有效性差异,从而识别需要视觉-语言联合推理的样本,同时过滤掉语义冲突噪声。在Vision-Flan和The Cauldron上的实验表明,CVS 在各个数据集上都取得了良好的性能。在Vision-Flan上,CVS 分别使用10%和15%的数据比全数据训练高出3.5%和4.8%,并且在高度异质的Cauldron数据集上仍然保持稳健。此外,与COINCIDE和XMAS相比,CVS 将计算成本分别降低了17.3%和44.4%。
Summary / 总结
The paper addresses the issue of training-free data selection for visual instruction tuning in vision-language large models. It proposes CVS, which evaluates the impact of questions on model assessments to identify samples requiring genuine cross-modal reasoning. Experiments show CVS outperforms full-data training by 3.5% and 4.8% using 10% and 15% of the data, respectively, and reduces computational cost compared to other methods on diverse datasets.
本文提出了一种名为CVS的训练-free数据选择方法,通过评估问题对模型答案有效性评估的影响,识别需要视觉-语言联合推理的样本,解决了视觉指令调优中不必要的样本问题。实验表明,CVS在使用Vision-Flan数据集的10%和15%时分别比全数据训练高出3.5%和4.8%,并且在高度异质的Cauldron数据集上保持稳健,同时与COINCIDE和XMAS相比,计算成本分别降低了17.3%和44.4%。
MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models
Authors: Chih-Kai Yang, Yun-Shao Tsai, Yu-Kai Guo, Ping-Le Tsai, Yen-Ting Piao, Hung-Wei Chen, Ting-Lin Hsiao, Yun-Man Hsu, Ke-Han Lu, Hung-yi Lee
First: 2026-03-10T14:22:22+00:00 · Latest: 2026-03-10T14:22:22+00:00
Comments: 6 pages, 3 figures, 3 tables. Dataset: https://huggingface.co/Multi-Audio-Grounding
Abstract
While multi-audio understanding is critical for large audio-language models (LALMs), it remains underexplored. We introduce MUGEN, a comprehensive benchmark evaluating this capability across speech, general audio, and music. Our experiments reveal consistent weaknesses in multi-audio settings, and performance degrades sharply as the number of concurrent audio inputs increases, identifying input scaling as a fundamental bottleneck. We further investigate training-free strategies and observe that Audio-Permutational Self-Consistency, which diversifies the order of audio candidates, helps models form more robust aggregated predictions, yielding up to 6.28% accuracy gains. Combining this permutation strategy with Chain-of-Thought further improves performance to 6.74%. These results expose blind spots in current LALMs and provide a foundation for evaluating complex auditory comprehension.
中文标题/摘要
标题:MUGEN:评估和改进大型音频语言模型的多音频理解
尽管多音频理解对于大型音频语言模型(LALMs)至关重要,但这一领域仍处于探索阶段。我们引入了MUGEN,这是一个全面的基准测试,评估了其在语音、通用音频和音乐方面的多音频理解能力。我们的实验揭示了多音频设置中的一致性弱点,并发现随着同时输入音频数量的增加,性能急剧下降,表明输入缩放是一个基本瓶颈。我们进一步研究了无需训练的策略,并观察到音频排列自一致性,这种策略通过多样化音频候选的顺序,有助于模型形成更稳健的综合预测,可获得高达6.28%的准确率提升。将此排列策略与思维链结合使用,进一步提高了性能至6.74%。这些结果揭示了当前LALMs中的盲点,并为评估复杂的听觉理解提供了基础。
Summary / 总结
The research aims to evaluate and enhance the multi-audio understanding of large audio-language models (LALMs) by introducing MUGEN, a comprehensive benchmark. Experiments show that performance significantly drops as the number of concurrent audio inputs increases, indicating a fundamental bottleneck. The study finds that Audio-Permutational Self-Consistency, which diversifies the order of audio candidates, can improve accuracy by up to 6.28%, and combining this with Chain-of-Thought further enhances performance. These findings highlight the limitations of current LALMs and provide insights for future development.
研究旨在通过引入MUGEN基准来评估和提升大型音频语言模型(LALMs)的多音频理解能力。实验表明,随着同时输入音频数量的增加,性能显著下降,表明存在一个基本瓶颈。研究发现,通过多样化音频候选的顺序来实现的Audio-Permutational Self-Consistency可以将准确率提高多达6.28%,而将其与Chain-of-Thought结合使用则进一步提升了性能。这些发现揭示了当前LALMs的局限性,并为未来的发展提供了参考。
Mitigating Long-Tail Bias in HOI Detection via Adaptive Diversity Cache
Authors: Yuqiu Jiang, Xiaozhen Qiao, Yifan Chen, Ye Zheng, Zhe Sun, Xuelong Li
First: 2025-11-24T06:30:08+00:00 · Latest: 2026-03-10T13:53:03+00:00
Abstract
Human-Object Interaction (HOI) detection is a fundamental task in computer vision, empowering machines to comprehend human-object relationships in diverse real-world scenarios. Recent advances in VLMs have significantly improved HOI detection by leveraging rich cross-modal representations. However, most existing VLM-based approaches rely heavily on additional training or prompt tuning, resulting in substantial computational overhead and limited scalability, particularly in long-tailed scenarios where rare interactions are severely underrepresented. In this paper, we propose the Adaptive Diversity Cache (ADC) module, a novel training-free and plug-and-play mechanism designed to mitigate long-tail bias in HOI detection. ADC constructs class-specific caches that accumulate high-confidence and diverse feature representations during inference. The method incorporates adaptive capacity allocation favoring rare categories and dynamic feature augmentation to enable robust prediction calibration without requiring additional training or fine-tuning. Extensive experiments on HICO-DET and V-COCO datasets show that ADC consistently improves existing HOI detectors, particularly enhancing rare category detection while preserving overall performance. These findings confirm the effectiveness of ADC as a training-free, plug-and-play solution for long-tail bias mitigation.
中文标题/摘要
标题:通过自适应多样性缓存缓解HOI检测中的长尾偏差
人类-物体交互(HOI)检测是计算机视觉中的一个基本任务,使机器能够理解各种现实场景中的人物-物体关系。基于VLM的最新进展通过丰富的跨模态表示显著提高了HOI检测的性能。然而,大多数现有的VLM基方法严重依赖额外的训练或提示调优,导致大量的计算开销和有限的可扩展性,特别是在长尾场景中,稀有交互严重不足。在本文中,我们提出了一种自适应多样性缓存(ADC)模块,这是一种无需训练且即插即用的机制,旨在缓解HOI检测中的长尾偏差。ADC在推理过程中构建特定类别的缓存,累积高置信度和多样性的特征表示。该方法通过自适应容量分配优先考虑稀有类别,并动态特征增强,以实现稳健的预测校准,而无需额外的训练或微调。在HICO-DET和V-COCO数据集上的广泛实验表明,ADC可以一致地提高现有的HOI检测器,特别是在增强稀有类别检测方面,同时保持整体性能。这些发现证实了ADC作为无需训练、即插即用的解决方案,对于缓解长尾偏差的有效性。
Summary / 总结
The paper addresses the challenge of long-tail bias in HOI detection by proposing the Adaptive Diversity Cache (ADC) module. ADC constructs class-specific caches to accumulate high-confidence and diverse feature representations during inference, and incorporates adaptive capacity allocation and dynamic feature augmentation to improve rare category detection. Experiments on HICO-DET and V-COCO datasets demonstrate that ADC enhances rare category performance while maintaining overall detection accuracy.
论文提出了Adaptive Diversity Cache (ADC) 模块来解决HOI检测中的长尾偏见问题。ADC在推理过程中构建类别特定的缓存,积累高置信度和多样性的特征表示,并通过自适应容量分配和动态特征增强来提升罕见类别检测。实验结果表明,ADC在HICO-DET和V-COCO数据集上能够改进现有HOI检测器,特别是在罕见类别检测方面表现出色,且无需额外的训练或微调。
When to Lock Attention: Training-Free KV Control in Video Diffusion
Authors: Tianyi Zeng, Jincheng Gao, Tianyi Wang, Zijie Meng, Miao Zhang, Jun Yin, Haoyuan Sun, Junfeng Jiao, Christian Claudel, Junbo Tan, Xueqian Wang
First: 2026-03-10T13:31:38+00:00 · Latest: 2026-03-10T13:31:38+00:00
Comments: 18 pages, 9 figures, 3 tables
Abstract
Maintaining background consistency while enhancing foreground quality remains a core challenge in video editing. Injecting full-image information often leads to background artifacts, whereas rigid background locking severely constrains the model's capacity for foreground generation. To address this issue, we propose KV-Lock, a training-free framework tailored for DiT-based video diffusion models. Our core insight is that the hallucination metric (variance of denoising prediction) directly quantifies generation diversity, which is inherently linked to the classifier-free guidance (CFG) scale. Building upon this, KV-Lock leverages diffusion hallucination detection to dynamically schedule two key components: the fusion ratio between cached background key-values (KVs) and newly generated KVs, and the CFG scale. When hallucination risk is detected, KV-Lock strengthens background KV locking and simultaneously amplifies conditional guidance for foreground generation, thereby mitigating artifacts and improving generation fidelity. As a training-free, plug-and-play module, KV-Lock can be easily integrated into any pre-trained DiT-based models. Extensive experiments validate that our method outperforms existing approaches in improved foreground quality with high background fidelity across various video editing tasks.
中文标题/摘要
标题:何时锁定注意力:基于DiT的视频扩散模型的无训练KV控制
在保持背景一致性的同时提升前景质量仍然是视频编辑中的核心挑战。注入全图信息往往会引入背景伪影,而刚性背景锁定则严重限制了模型在前景生成方面的容量。为了解决这一问题,我们提出了一种名为KV-Lock的无训练框架,专门针对基于DiT的视频扩散模型。我们的核心见解是,幻觉度量(去噪预测的方差)直接量化了生成多样性,这与无分类引导(CFG)尺度内在相关。基于此,KV-Lock利用扩散幻觉检测动态调度两个关键组件:缓存背景键值(KVs)与新生成KVs的融合比例,以及CFG尺度。当检测到幻觉风险时,KV-Lock会加强背景KV锁定,并同时增强对前景生成的条件引导,从而减轻伪影并提高生成保真度。作为一种无训练、即插即用模块,KV-Lock可以轻松集成到任何预训练的DiT模型中。大量实验验证了我们的方法在各种视频编辑任务中,在保持高背景保真度的同时提升了前景质量。
Summary / 总结
The paper addresses the challenge of maintaining background consistency while enhancing foreground quality in video editing. It introduces KV-Lock, a training-free framework for DiT-based video diffusion models. KV-Lock dynamically adjusts the fusion ratio between cached background key-values and newly generated key-values, and the classifier-free guidance scale based on hallucination detection. This approach effectively reduces background artifacts and improves foreground generation fidelity, outperforming existing methods in various video editing tasks.
论文旨在解决在视频编辑中保持背景一致性的同时提升前景质量的问题。提出了一种名为KV-Lock的训练-free框架,适用于DiT基于的视频扩散模型。KV-Lock根据扩散幻觉检测动态调整缓存背景关键值和新生成关键值的融合比例,以及分类器自由引导比例。这种方法可以减少伪影并提高生成保真度,在各种视频编辑任务中优于现有方法。
OTPL-VIO: Robust Visual-Inertial Odometry with Optimal Transport Line Association and Adaptive Uncertainty
Authors: Zikun Chen, Wentao Zhao, Yihe Niu, Tianchen Deng, Jingchuan Wang
First: 2026-03-10T13:30:05+00:00 · Latest: 2026-03-10T13:30:05+00:00
Abstract
Robust stereo visual-inertial odometry (VIO) remains challenging in low-texture scenes and under abrupt illumination changes, where point features become sparse and unstable, leading to ambiguous association and under-constrained estimation. Line structures offer complementary geometric cues, yet many efficient point-line systems still rely on point-guided line association, which can break down when point support is weak and may lead to biased constraints. We present a stereo point-line VIO system in which line segments are equipped with dedicated deep descriptors and matched using an entropy-regularized optimal transport formulation, enabling globally consistent correspondences under ambiguity, outliers, and partial observations. The proposed descriptor is training-free and is computed by sampling and pooling network feature maps. To improve estimation stability, we analyze the impact of line measurement noise and introduce reliability-adaptive weighting to regulate the influence of line constraints during optimization. Experiments on EuRoC and UMA-VI, together with real-world deployments in low-texture and illumination-challenging environments, demonstrate improved accuracy and robustness over representative baselines while maintaining real-time performance.
中文标题/摘要
标题:OTPL-VIO:具有最优传输线关联和自适应不确定性稳健的视觉惯性里程计
在低纹理场景和突然光照变化下,稳健的立体视觉惯性里程计(VIO)仍然具有挑战性,此时点特征变得稀疏且不稳定,导致关联模糊和欠约束估计。线结构提供了补充的几何线索,但许多高效的点线系统仍然依赖于点导向的线关联,当点支持较弱时,这可能会失效并导致有偏的约束。我们提出了一种立体点线VIO系统,在该系统中,线段配备了专用的深度描述符,并使用熵正则化的最优传输公式进行匹配,从而在存在歧义、离群值和部分观测的情况下实现全局一致的对应关系。所提出的描述符是训练免费的,并通过采样和聚合网络特征图进行计算。为了提高估计稳定性,我们分析了线测量噪声的影响,并引入可靠性自适应加权来调节优化过程中线约束的影响。在EuRoC和UMA-VI上的实验,以及在低纹理和光照挑战性环境中的实际部署,表明与代表性基线相比,该方法在保持实时性能的同时具有更高的准确性和鲁棒性。
Summary / 总结
The research addresses the challenge of robust stereo visual-inertial odometry in low-texture and illumination-challenging environments. It proposes a system that uses line segments with dedicated deep descriptors and matches them using an entropy-regularized optimal transport formulation, which provides globally consistent correspondences. The system also includes reliability-adaptive weighting to enhance estimation stability. Experiments show improved accuracy and robustness compared to existing methods while maintaining real-time performance.
研究针对低纹理和光照变化环境下的鲁棒立体视觉惯性里程计问题。提出了一种系统,使用带有专用深度描述符的线段,并采用熵正则化的最优传输公式进行匹配,从而提供全局一致的对应关系。该系统还包含可靠性自适应加权,以增强估计稳定性。实验表明,与现有方法相比,该系统在保持实时性能的同时具有更高的准确性和鲁棒性。
X-GS: An Extensible Open Framework Unifying 3DGS Architectures with Downstream Multimodal Models
Authors: Yueen Ma, Irwin King
First: 2026-03-10T13:10:18+00:00 · Latest: 2026-03-10T13:10:18+00:00
Abstract
3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis, subsequently extending into numerous spatial AI applications. However, most existing 3DGS methods are isolated, focusing on specific domains such as online SLAM, semantic enrichment, or 3DGS for unposed images. In this paper, we introduce X-GS, an extensible open framework that unifies a broad range of techniques to enable real-time 3DGS-based online SLAM enriched with semantics, bridging the gap to downstream multimodal models. At the core of X-GS is a highly efficient pipeline called X-GS-Perceiver, capable of taking unposed RGB (or optionally RGB-D) video streams as input to co-optimize geometry and poses, and distill high-dimensional semantic features from vision foundation models into the 3D Gaussians. We achieve real-time performance through a novel online Vector Quantization (VQ) module, a GPU-accelerated grid-sampling scheme, and a highly parallelized pipeline design. The semantic 3D Gaussians can then be utilized by vision-language models within the X-GS-Thinker component, enabling downstream tasks such as object detection, zero-shot caption generation, and potentially embodied tasks. Experimental results on real-world datasets showcase the efficacy, efficiency, and newly unlocked multimodal capabilities of the X-GS framework.
中文标题/摘要
标题:X-GS:一种统一3DGS架构与下游多模态模型的可扩展开源框架
3D高斯斑点化(3DGS)已成为新型视图合成的强大技术,并进一步扩展到众多空间AI应用中。然而,大多数现有的3DGS方法都是孤立的,专注于特定领域,如在线SLAM、语义增强或未摆拍图像的3DGS。本文介绍了一种可扩展的开源框架X-GS,该框架统一了多种技术,以实现基于3DGS的实时在线SLAM,并结合语义信息,填补了与下游多模态模型之间的差距。X-GS的核心是一个高效的名为X-GS-Perceiver的流水线,能够接受未摆拍的RGB(或可选的RGB-D)视频流作为输入,同时优化几何形状和姿态,并将来自视觉基础模型的高维语义特征提炼到3D高斯分布中。我们通过一种新颖的在线向量量化(VQ)模块、GPU加速的网格采样方案以及高度并行化的流水线设计实现了实时性能。语义3D高斯分布可以由X-GS-Thinker组件中的视觉语言模型利用,从而实现下游任务,如物体检测、零样本描述生成,甚至可能实现具身任务。在真实数据集上的实验结果展示了X-GS框架的有效性、效率以及新解锁的多模态能力。
Summary / 总结
X-GS is an extensible open framework that unifies various 3D Gaussian Splatting (3DGS) techniques for real-time online SLAM with semantic enrichment, bridging to downstream multimodal models. It uses an efficient X-GS-Perceiver pipeline to co-optimize geometry and poses from unposed RGB (or RGB-D) video streams, and a novel online Vector Quantization module to distill high-dimensional semantic features into 3D Gaussians for real-time performance. The framework demonstrates efficacy and new multimodal capabilities on real-world datasets.
X-GS 是一个可扩展的开放框架,统一了多种 3D 贝塞尔体 (3DGS) 技术,实现实时在线 SLAM 并带有语义增强,连接到下游多模态模型。它包含一个高效的 X-GS-Perceiver 管道,处理未摆拍的 RGB(或 RGB-D)视频流以协同优化几何和姿态,并提取高维语义特征。通过新型在线向量量化模块、GPU 加速的网格采样和并行化管道设计实现实时性能。该框架在真实世界数据集上的实验结果展示了其有效性和新的多模态能力。
Making Training-Free Diffusion Segmentors Scale with the Generative Power
Authors: Benyuan Meng, Qianqian Xu, Zitai Wang, Xiaochun Cao, Longtao Huang, Qingming Huang
Venue: CVPR 2026
First: 2026-03-06T11:35:37+00:00 · Latest: 2026-03-10T12:51:00+00:00
Comments: Accepted to CVPR 2026
Abstract
As powerful generative models, text-to-image diffusion models have recently been explored for discriminative tasks. A line of research focuses on adapting a pre-trained diffusion model to semantic segmentation without any further training, leading to what training-free diffusion segmentors. These methods typically rely on cross-attention maps from the model's attention layers, which are assumed to capture semantic relationships between image pixels and text tokens. Ideally, such approaches should benefit from more powerful diffusion models, i.e., stronger generative capability should lead to better segmentation. However, we observe that existing methods often fail to scale accordingly. To understand this issue, we identify two underlying gaps: (i) cross-attention is computed across multiple heads and layers, but there exists a discrepancy between these individual attention maps and a unified global representation. (ii) Even when a global map is available, it does not directly translate to accurate semantic correlation for segmentation, due to score imbalances among different text tokens. To bridge these gaps, we propose two techniques: auto aggregation and per-pixel rescaling, which together enable training-free segmentation to better leverage generative capability. We evaluate our approach on standard semantic segmentation benchmarks and further integrate it into a generative technique, demonstrating both improved performance broad applicability. Codes are at https://github.com/Darkbblue/goca.
中文标题/摘要
标题:利用生成能力使无训练分割器扩展
作为强大的生成模型,文本到图像的扩散模型最近被探索用于判别任务。一系列研究致力于在无需进一步训练的情况下,将预训练的扩散模型适应于语义分割,从而产生了无训练的扩散分割器。这些方法通常依赖于模型注意力层的交叉注意力图,这些图被认为捕捉了图像像素和文本标记之间的语义关系。理想情况下,此类方法应受益于更强大的扩散模型,即更强的生成能力应导致更好的分割。然而,我们观察到现有方法往往无法相应地扩展。为了理解这一问题,我们识别了两个潜在的差距:(i) 交叉注意力是在多个头和层之间计算的,但这些单独的注意力图与统一的全局表示之间存在差异。(ii) 即使有全局图,它也不直接转化为准确的语义相关性,因为不同文本标记之间的评分不平衡。为了弥合这些差距,我们提出了两种技术:自动聚合和逐像素重新缩放,这两者共同使无训练分割能够更好地利用生成能力。我们在标准语义分割基准上评估了我们的方法,并进一步将其集成到生成技术中,展示了更好的性能和更广泛的适用性。代码在 https://github.com/Darkbblue/goca.
Summary / 总结
This paper addresses the challenge of scaling training-free diffusion segmentors with the generative power of diffusion models. The authors identify two key issues: discrepancies between individual attention maps and a unified global representation, and score imbalances among text tokens. To address these, they propose auto aggregation and per-pixel rescaling techniques. Evaluations on standard benchmarks show improved performance, and the method is also integrated into a generative technique, enhancing its broad applicability.
本文解决了训练-free 扩散分割器随扩散模型生成能力增强而难以扩展的问题。研究识别了两个关键问题:个体注意力图与统一全局表示之间的差异,以及文本标记之间的评分不平衡。为此,作者提出了自动聚合和逐像素重新缩放两种技术。在标准分割基准上的评估显示了性能的提升,并将该方法集成到生成技术中展示了广泛的应用性。
PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters
Authors: Yinghong Yu, Guangyuan Li, Jiancheng Yang
First: 2026-03-04T15:23:30+00:00 · Latest: 2026-03-10T12:36:39+00:00
Abstract
Large-scale 2D foundation models exhibit strong transferable representations, yet extending them to 3D volumetric data typically requires retraining, adapters, or architectural redesign. We introduce PlaneCycle, a training-free, adapter-free operator for architecture-agnostic 2D-to-3D lifting of foundation models. PlaneCycle reuses the original pretrained 2D backbone by cyclically distributing spatial aggregation across orthogonal HW, DW, and DH planes throughout network depth, enabling progressive 3D fusion while preserving pretrained inductive biases. The method introduces no additional parameters and is applicable to arbitrary 2D networks. Using pretrained DINOv3 models, we evaluate PlaneCycle on six 3D classification and three 3D segmentation benchmarks. Without any training, the lifted models exhibit intrinsic 3D fusion capability and, under linear probing, outperform slice-wise 2D baselines and strong 3D counterparts, approaching the performance of fully trained models. With full fine-tuning, PlaneCycle matches standard 3D architectures, highlighting its potential as a seamless and practical 2D-to-3D lifting operator. These results demonstrate that 3D capability can be unlocked from pretrained 2D foundation models without structural modification or retraining. Code is available at https://github.com/HINTLab/PlaneCycle.
中文标题/摘要
标题:PlaneCycle:无需训练的2D到3D基础模型提升操作
大规模的2D基础模型表现出强大的可迁移表示,但将其扩展到3D体数据通常需要重新训练、适配器或架构重设计。我们引入了PlaneCycle,这是一种无需训练、无需适配器的操作符,用于基础模型的架构无关的2D到3D提升。PlaneCycle 通过在网络深度中周期性地在正交的HW、DW和DH平面间分配空间聚合,重用了原始预训练的2D主干,从而实现渐进的3D融合并保留预训练的归纳偏差。该方法不引入额外参数,并适用于任意2D网络。使用预训练的DINOv3模型,我们在六个3D分类和三个3D分割基准上评估了PlaneCycle。在无需训练的情况下,提升后的模型展示了内在的3D融合能力,并在线性探测中优于切片式的2D基线和强大的3D对应物,接近完全训练模型的性能。在完全微调后,PlaneCycle 达到了标准3D架构的性能,突显了其作为无缝且实用的2D到3D提升操作符的潜力。这些结果表明,3D能力可以从预训练的2D基础模型中解锁,无需结构修改或重新训练。代码可在 https://github.com/HINTLab/PlaneCycle 获取。
Summary / 总结
PlaneCycle is a training-free method for lifting 2D foundation models to 3D volumetric data without adapters or architectural redesign. It cyclically distributes spatial aggregation across orthogonal planes, enabling progressive 3D fusion while preserving pretrained inductive biases. Evaluations on six 3D classification and three 3D segmentation benchmarks show that the lifted models outperform slice-wise 2D baselines and strong 3D counterparts under linear probing, and match fully trained models with full fine-tuning, demonstrating the potential of 3D capability from pretrained 2D models.
PlaneCycle 是一种无需训练且无需使用适配器或重新设计架构的方法,用于将 2D 基础模型转换为 3D 模型。它通过在正交平面间周期性地分布空间聚合,实现 3D 融合并保留预训练的归纳偏差。在六个 3D 分类和三个 3D 分割基准上的评估表明,PlaneCycle 在线性探针下优于切片式 2D 基线,并接近完全训练的 3D 模型的性能,即使进行全微调也能证明其在利用预训练 2D 模型进行 3D 任务方面的有效性。
More than the Sum: Panorama-Language Models for Adverse Omni-Scenes
Authors: Weijia Fan, Ruiping Liu, Jiale Wei, Yufan Chen, Junwei Zheng, Zichao Zeng, Jiaming Zhang, Qiufu Li, Linlin Shen, Rainer Stiefelhagen
Venue: CVPR 2026
First: 2026-03-10T12:19:50+00:00 · Latest: 2026-03-10T12:19:50+00:00
Comments: Accepted by CVPR 2026. Project page: https://github.com/InSAI-Lab/PanoVQA
Abstract
Existing vision-language models (VLMs) are tailored for pinhole imagery, stitching multiple narrow field-of-view inputs to piece together a complete omni-scene understanding. Yet, such multi-view perception overlooks the holistic spatial and contextual relationships that a single panorama inherently preserves. In this work, we introduce the Panorama-Language Modeling (PLM)paradigm, a unified $360^\circ$ vision-language reasoning that is more than the sum of its pinhole counterparts. Besides, we present PanoVQA, a large-scale panoramic VQA dataset that involves adverse omni-scenes, enabling comprehensive reasoning under object occlusions and driving accidents. To establish a foundation for PLM, we develop a plug-and-play panoramic sparse attention module that allows existing pinhole-based VLMs to process equirectangular panoramas without retraining. Extensive experiments demonstrate that our PLM achieves superior robustness and holistic reasoning under challenging omni-scenes, yielding understanding greater than the sum of its narrow parts. Project page: https://github.com/InSAI-Lab/PanoVQA.
中文标题/摘要
标题:超越总和:全景语言模型在全视角场景中的应用
现有的视觉-语言模型(VLMs)针对的是针孔成像,通过拼接多个窄视角输入来构建完整的全视角理解。然而,这种多视角感知忽略了全景图本身固有的整体空间和上下文关系。在此项工作中,我们提出了全景语言模型(PLM)范式,这是一种统一的360°视觉-语言推理,超越了其针孔成像的总和。此外,我们还介绍了PanoVQA,这是一个包含不良全视角场景的大规模全景问答数据集,能够支持在物体遮挡和交通事故下的全面推理。为了建立PLM的基础,我们开发了一个即插即用的全景稀疏注意力模块,使现有的针孔基视觉-语言模型能够处理球面全景图而无需重新训练。广泛的实验表明,我们的PLM在具有挑战性的全视角场景中实现了更优的鲁棒性和整体推理,从而获得了超越其狭窄部分总和的理解。项目页面:https://github.com/InSAI-Lab/PanoVQA.
Summary / 总结
This work introduces the Panorama-Language Modeling (PLM) paradigm, which enhances vision-language models (VLMs) for understanding 360-degree omni-scenes, overcoming the limitations of pinhole imagery. The authors present PanoVQA, a large-scale dataset for panoramic VQA involving adverse omni-scenes. They develop a plug-and-play panoramic sparse attention module to enable existing VLMs to process equirectangular panoramas. Experiments show that PLM outperforms traditional VLMs in robustness and holistic reasoning under challenging omni-scenes, demonstrating superior performance.
该研究引入了全景语言建模(PLM)范式,旨在增强视觉语言模型(VLMs)以理解360度全景场景,克服了针孔成像的局限性。作者提出了PanoVQA,一个涉及不良全景场景的大规模全景问答数据集。他们开发了一个插件式全景稀疏注意力模块,使现有的VLMs能够处理球形全景图而无需重新训练。实验表明,PLM在具有挑战性的全景场景中表现出更强的鲁棒性和整体推理能力,性能更优。
GeoSolver: Scaling Test-Time Reasoning in Remote Sensing with Fine-Grained Process Supervision
Authors: Lang Sun, Ronghao Fu, Zhuoran Duan, Haoran Liu, Xueyan Liu, Bo Yang
First: 2026-03-10T11:59:05+00:00 · Latest: 2026-03-10T11:59:05+00:00
Abstract
While Vision-Language Models (VLMs) have significantly advanced remote sensing interpretation, enabling them to perform complex, step-by-step reasoning remains highly challenging. Recent efforts to introduce Chain-of-Thought (CoT) reasoning to this domain have shown promise, yet ensuring the visual faithfulness of these intermediate steps remains a critical bottleneck. To address this, we introduce GeoSolver, a novel framework that transitions remote sensing reasoning toward verifiable, process-supervised reinforcement learning. We first construct Geo-PRM-2M, a large-scale, token-level process supervision dataset synthesized via entropy-guided Monte Carlo Tree Search (MCTS) and targeted visual hallucination injection. Building upon this dataset, we train GeoPRM, a token-level process reward model (PRM) that provides granular faithfulness feedback. To effectively leverage these verification signals, we propose Process-Aware Tree-GRPO, a reinforcement learning algorithm that integrates tree-structured exploration with a faithfulness-weighted reward mechanism to precisely assign credit to intermediate steps. Extensive experiments demonstrate that our resulting model, GeoSolver-9B, achieves state-of-the-art performance across diverse remote sensing benchmarks. Crucially, GeoPRM unlocks robust Test-Time Scaling (TTS). Serving as a universal geospatial verifier, it seamlessly scales the performance of GeoSolver-9B and directly enhances general-purpose VLMs, highlighting its remarkable cross-model generalization.
中文标题/摘要
标题:GeoSolver:通过精细粒度的过程监督在遥感中扩展测试时推理
尽管视觉-语言模型(VLMs)在遥感解释方面取得了显著进展,但使它们能够执行复杂的、逐步的推理仍然极具挑战性。最近将链式思考(CoT)推理引入该领域的努力显示出前景,但确保这些中间步骤的视觉真实性仍然是一个关键瓶颈。为了解决这个问题,我们引入了GeoSolver,这是一种新颖的框架,将遥感推理转向可验证的过程监督强化学习。我们首先构建了Geo-PRM-2M,这是一个通过熵引导的蒙特卡洛树搜索(MCTS)和目标视觉幻觉注入合成的大规模、标记级过程监督数据集。在此数据集的基础上,我们训练了GeoPRM,这是一种标记级过程奖励模型(PRM),提供详细的忠实度反馈。为了有效利用这些验证信号,我们提出了过程感知树-GRPO,这是一种结合树结构探索和忠实度加权奖励机制的强化学习算法,以精确分配中间步骤的信用。广泛的实验表明,我们的模型GeoSolver-9B在各种遥感基准测试中达到了最先进的性能。关键的是,GeoPRM解锁了稳健的测试时扩展(TTS)。作为通用地理空间验证器,它无缝地扩展了GeoSolver-9B的性能,并直接增强了通用VLMs,突显了其出色的跨模型泛化能力。
Summary / 总结
GeoSolver is a novel framework designed to enhance the test-time reasoning in remote sensing by introducing fine-grained process supervision. It constructs a large-scale dataset, Geo-PRM-2M, and trains a token-level process reward model (GeoPRM) to provide detailed feedback on visual faithfulness. GeoSolver then uses Process-Aware Tree-GRPO, a reinforcement learning algorithm, to integrate tree-structured exploration with a faithfulness-weighted reward mechanism. The resulting GeoSolver-9B model achieves state-of-the-art performance across various remote sensing benchmarks and demonstrates robust Test-Time Scaling (TTS) through GeoPRM, which enhances the performance of general-purpose VLMs.
GeoSolver 是一个通过引入细粒度过程监督来增强遥感解释的框架。它构建了一个大规模数据集 Geo-PRM-2M,并训练了一个标记级过程奖励模型(GeoPRM)以提供详细的反馈。GeoSolver 然后使用 Process-Aware Tree-GRPO,这是一种强化学习算法,通过树结构探索和信仰度加权奖励机制来精确分配中间步骤的信用。最终,GeoSolver-9B 模型实现了最先进的性能和稳健的测试时扩展,有效增强了通用视觉语言模型。
Memory-Guided View Refinement for Dynamic Human-in-the-loop EQA
Authors: Xin Lu, Rui Li, Xun Huang, Weixin Li, Chuanqing Zhuang, Jiayuan Li, Zhengda Lu, Jun Xiao, Yunhong Wang
First: 2026-03-10T11:51:54+00:00 · Latest: 2026-03-10T11:51:54+00:00
Abstract
Embodied Question Answering (EQA) has traditionally been evaluated in temporally stable environments where visual evidence can be accumulated reliably. However, in dynamic, human-populated scenes, human activities and occlusions introduce significant perceptual non-stationarity: task-relevant cues are transient and view-dependent, while a store-then-retrieve strategy over-accumulates redundant evidence and increases inference cost. This setting exposes two practical challenges for EQA agents: resolving ambiguity caused by viewpoint-dependent occlusions, and maintaining compact yet up-to-date evidence for efficient inference. To enable systematic study of this setting, we introduce DynHiL-EQA, a human-in-the-loop EQA dataset with two subsets: a Dynamic subset featuring human activities and temporal changes, and a Static subset with temporally stable observations. To address the above challenges, we present DIVRR (Dynamic-Informed View Refinement and Relevance-guided Adaptive Memory Selection), a training-free framework that couples relevance-guided view refinement with selective memory admission. By verifying ambiguous observations before committing them and retaining only informative evidence, DIVRR improves robustness under occlusions while preserving fast inference with compact memory. Extensive experiments on DynHiL-EQA and the established HM-EQA dataset demonstrate that DIVRR consistently improves over existing baselines in both dynamic and static settings while maintaining high inference efficiency.
中文标题/摘要
标题:基于记忆的视角精炼以实现动态人类在环EQA
传统的EQA(具身问答)评估通常在视觉证据可以可靠积累的时序稳定环境中进行。然而,在动态的人类聚集场景中,人类活动和遮挡引入了显著的感知非稳态性:任务相关线索是瞬时且视角依赖的,而存储然后检索的策略会导致冗余证据的过度积累并增加推理成本。这种设置暴露了EQA代理的两个实际挑战:解决由视角依赖遮挡引起的歧义,并保持紧凑且最新的证据以实现高效的推理。为了系统地研究这一设置,我们引入了DynHiL-EQA数据集,其中包括两个子集:动态子集展示了人类活动和时间变化,静态子集则包含时序稳定的观察。为了解决上述挑战,我们提出了DIVRR(动态导向的视角精炼和相关性导向的适应性记忆选择)框架,该框架结合了相关性导向的视角精炼和选择性记忆准入。通过在提交之前验证模糊的观察并仅保留有信息性的证据,DIVRR在遮挡下提高了鲁棒性,同时保持了快速推理和紧凑的记忆。在DynHiL-EQA和现有的HM-EQA数据集上的广泛实验表明,DIVRR在动态和静态设置中都优于现有基线,同时保持了高推理效率。
Summary / 总结
The research addresses the challenges of embodied question answering (EQA) in dynamic, human-populated scenes, where visual evidence is transient and view-dependent. The method, DIVRR, couples relevance-guided view refinement with selective memory admission to resolve occlusion ambiguity and maintain compact, up-to-date evidence. Experiments show that DIVRR improves robustness under occlusions and maintains high inference efficiency in both dynamic and static settings compared to existing baselines.
论文探讨了在动态、有人类活动的场景中进行体态问答(EQA)的挑战,视觉证据是瞬时且视角依赖的。它引入了DIVRR框架,结合了相关性导向的视角修正和选择性记忆准入,以应对遮挡并保持高效的推理。实验结果表明,DIVRR在动态和静态场景中都优于现有基线,同时保持了高推理效率。
Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization
Authors: Ming Nie, Chunwei Wang, Jianhua Han, Hang Xu, Li Zhang
First: 2026-03-10T11:49:20+00:00 · Latest: 2026-03-10T11:49:20+00:00
Abstract
Unified vision-language models have made significant progress in multimodal understanding and generation, yet they largely fall short in producing multimodal interleaved outputs, which is a crucial capability for tasks like visual storytelling and step-by-step visual reasoning. In this work, we propose a reinforcement learning-based post-training strategy to unlock this capability in existing unified models, without relying on large-scale multimodal interleaved datasets. We begin with a warm-up stage using a hybrid dataset comprising curated interleaved sequences and limited data for multimodal understanding and text-to-image generation, which exposes the model to interleaved generation patterns while preserving its pretrained capabilities. To further refine interleaved generation, we propose a unified policy optimization framework that extends Group Relative Policy Optimization (GRPO) to the multimodal setting. Our approach jointly models text and image generation within a single decoding trajectory and optimizes it with our novel hybrid rewards covering textual relevance, visual-text alignment, and structural fidelity. Additionally, we incorporate process-level rewards to provide step-wise guidance, enhancing training efficiency in complex multimodal tasks. Experiments on MMIE and InterleavedBench demonstrate that our approach significantly enhances the quality and coherence of multimodal interleaved generation.
中文标题/摘要
标题:通过组相对策略优化实现统一多模态交织生成
统一的视觉语言模型在多模态理解和生成方面取得了显著进展,但在生成多模态交织输出方面仍存在不足,这对于视觉叙事和逐步视觉推理等任务至关重要。在本文中,我们提出了一种基于强化学习的后训练策略,以在现有统一模型中解锁这种能力,而无需依赖大规模的多模态交织数据集。我们从一个混合数据集开始,该数据集包含精心策划的交织序列和少量用于多模态理解和图文生成的数据,使模型接触到交织生成模式,同时保留其预训练能力。为了进一步细化交织生成,我们提出了一种统一的策略优化框架,将组相对策略优化(GRPO)扩展到多模态设置。我们的方法在单一解码轨迹中联合建模文本和图像生成,并使用我们新颖的混合奖励进行优化,这些奖励涵盖了文本相关性、视觉-文本对齐和结构保真度。此外,我们还引入了过程级奖励,以提供逐步指导,提高复杂多模态任务的训练效率。在MMIE和InterleavedBench上的实验表明,我们的方法显著提高了多模态交织生成的质量和连贯性。
Summary / 总结
This work addresses the limitation of unified vision-language models in generating multimodal interleaved outputs, crucial for tasks like visual storytelling. It proposes a reinforcement learning-based post-training strategy using a hybrid dataset and a unified policy optimization framework, extending Group Relative Policy Optimization (GRPO) to multimodal settings. The approach optimizes both text and image generation within a single trajectory, using hybrid rewards for textual relevance, visual-text alignment, and structural fidelity, and incorporating process-level rewards for step-wise guidance. Experiments show significant improvements in the quality and coherence of multimodal interleaved generation.
本文针对统一的视觉-语言模型在生成多模态交织输出方面的局限性,提出了一个基于强化学习的后训练策略,使用混合数据集和统一的策略优化框架,将组相对策略优化(GRPO)扩展到多模态设置。该方法将文本和图像生成一起优化,并使用混合奖励关注文本相关性、视觉-文本对齐和结构保真度,从而提高了多模态交织生成的质量和连贯性。
AVGGT: Rethinking Global Attention for Accelerating VGGT
Authors: Xianbing Sun, Zhikai Zhu, Zhengyu Lou, Bo Yang, Jinyang Tang, Liqing Zhang, He Wang, Jianfu Zhang
First: 2025-12-02T09:08:18+00:00 · Latest: 2026-03-10T11:49:12+00:00
Abstract
Models such as VGGT and $π^3$ have shown strong multi-view 3D performance, but their heavy reliance on global self-attention results in high computational cost. Existing sparse-attention variants offer partial speedups, yet lack a systematic analysis of how global attention contributes to multi-view reasoning. In this paper, we first conduct an in-depth investigation of the global attention modules in VGGT and $π^3$ to better understand their roles. Our analysis reveals a clear division of roles in the alternating global-frame architecture: early global layers do not form meaningful correspondences, middle layers perform cross-view alignment, and last layers provide only minor refinements. Guided by these findings, we propose a training-free two-step acceleration scheme: (1) converting early global layers into frame attention, and (2) subsampling global attention by subsampling K/V over patch tokens with diagonal preservation and a mean-fill component. We instantiate this strategy on VGGT and $π^3$ and evaluate across standard pose and point-map benchmarks. Our method achieves substantial inference acceleration across different context lengths, yielding about $2\times$ speedup at 100 frames, $4$--$5\times$ at 300 frames, and $8$--$10\times$ at 800 frames, while matching or slightly improving the accuracy of the original models and remaining robust in extremely dense multi-view settings where prior sparse-attention baselines fail.
中文标题/摘要
标题:AVGGT:重新思考VGGT中的全局注意力加速
模型如VGGT和$π^3$在多视图3D性能方面表现出色,但它们对全局自注意力的重度依赖导致了高计算成本。现有的稀疏注意力变体提供了一定的加速,但缺乏对全局注意力如何促进多视图推理的系统分析。在本文中,我们首先对VGGT和$π^3$中的全局注意力模块进行了深入研究,以更好地理解它们的作用。我们的分析揭示了交替全局帧架构中角色的明确分工:早期的全局层不形成有意义的对应关系,中间层执行跨视图对齐,而最后的层仅提供轻微的改进。受这些发现的启发,我们提出了一种无需训练的两步加速方案:(1) 将早期的全局层转换为帧注意力,(2) 通过在补丁标记上进行K/V下采样并保留对角线和均值填充组件来减少全局注意力。我们在VGGT和$π^3$上实例化了这一策略,并在标准姿态和点图基准上进行了评估。我们的方法在不同上下文长度下实现了显著的推理加速,100帧时约$2\times$加速,300帧时$4$--$5\times$加速,800帧时$8$--$10\times$加速,同时保持或略微提高了原始模型的准确性,并在极密集的多视图设置中保持稳健,而在此之前,基于稀疏注意力的基线模型会失效。
Summary / 总结
This paper addresses the high computational cost of global self-attention in VGGT and $π^3$ models, which are known for their strong multi-view 3D performance. By analyzing the roles of global attention layers, the authors propose an acceleration scheme that converts early global layers to frame attention and subsamples global attention through diagonal preservation and mean-filling. This method achieves significant speedups of up to 10 times at 800 frames while maintaining or slightly improving accuracy and robustness in dense multi-view settings.
本文旨在通过提出两步加速方案来解决VGGT和$π^3$模型中全局自注意力的高计算成本问题。作者首先分析了这些模型中全局注意力层的作用,并发现早期层不形成有意义的对应关系,中间层执行跨视图对齐,最后层仅提供轻微的细化。基于这一分析,他们将早期的全局层转换为帧注意力,并通过在补丁标记上采样关键/值标记并保留对角线和均值填充组件来子采样全局注意力。该方法实现了显著的推理加速,100帧时约2倍加速,300帧时4-5倍加速,800帧时8-10倍加速,同时保持或略微提高原始模型的准确性和在密集多视图设置中的鲁棒性。
VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness
Authors: Zihao Zheng, Zhihao Mao, Xingyue Zhou, Jiayu Chen, Maoliang Li, Xinhao Sun, Hailong Zou, Zhaobo Zhang, Xuanzhe Liu, Donggang Cao, Hong Mei, Xiang Chen
First: 2026-03-07T07:30:35+00:00 · Latest: 2026-03-10T11:17:09+00:00
Abstract
Vision-and-Language Navigation (VLN) increasingly relies on large vision-language models, but their inference cost conflicts with real-time deployment. Token caching is a promising training-free strategy that avoids redundant computation by reusing stable visual tokens across frames. However, existing methods assume a static camera and fixed semantic focus, assumptions that VLN fundamentally violates. We identify two failure modes: (1) visual dynamics, where viewpoint shift displaces token positions across frames, causing position-wise matching to pair misaligned content; (2) semantic dynamics, where token relevance shifts across task stages as navigation progresses, making cached states stale. We propose VLN-Cache, a visual-dynamic-aware and semantic-dynamic-aware caching framework that introduces view-aligned remapping to recover geometric correspondences and a task-relevance saliency filter to veto reuse at semantic transitions. A layer-adaptive entropy policy further balances the per-layer reuse budget. Experiments on the R2R-CE simulation benchmark show up to 1.52x speedup while maintaining competitive navigation success rates.
中文标题/摘要
标题:VLN-Cache:具有视觉/语义动态意识的标记缓存启用技术
视觉-语言导航(VLN)越来越多地依赖于大型视觉-语言模型,但其推理成本与实时部署相冲突。标记缓存是一种有前景的无需训练的策略,通过在帧间重用稳定的视觉标记来避免冗余计算。然而,现有方法假设静态相机和固定语义焦点,这是VLN根本违反的假设。我们识别了两种失败模式:(1)视觉动态,其中视角变化导致标记位置在帧间错位,使位置匹配对齐错配内容;(2)语义动态,其中随着导航进程任务阶段的变化,标记的相关性发生变化,使缓存状态过时。我们提出了VLN-Cache,一种具有视觉动态意识和语义动态意识的缓存框架,引入视图对齐重映射以恢复几何对应关系,并引入任务相关性显著性滤波器以在语义过渡时阻止重用。逐层自适应熵策略进一步平衡每层的重用预算。在R2R-CE模拟基准测试上进行的实验显示,在保持竞争力的导航成功率的同时,可实现高达1.52倍的加速。
Summary / 总结
The research aims to address the high inference cost of large vision-language models in Vision-and-Language Navigation (VLN) by proposing VLN-Cache, a caching framework that accounts for visual and semantic dynamics. The method introduces view-aligned remapping to handle viewpoint shifts and a task-relevance saliency filter to manage semantic transitions, along with a layer-adaptive entropy policy to balance reuse. Experiments demonstrate up to 1.52x speedup with competitive navigation success rates on the R2R-CE benchmark.
论文旨在降低VLN模型的推理成本,以实现实时部署。提出了一种考虑视觉和语义动态的缓存框架VLN-Cache。VLN-Cache通过视图对齐重新映射来处理视角变化,并通过任务相关性显著性滤波器来管理语义过渡,实现了最多1.52倍的加速,同时保持了竞争力的导航成功率。
Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning
Authors: Chun-Peng Chang, Chen-Yu Wang, Holger Caesar, Alain Pagani
First: 2026-03-10T11:12:28+00:00 · Latest: 2026-03-10T11:12:28+00:00
Abstract
A reliable driving assistant should provide consistent responses based on temporally grounded reasoning derived from observed information. In this work, we investigate whether Vision-Language Models (VLMs), when applied as driving assistants, can response consistantly and understand how present observations shape future outcomes, or whether their outputs merely reflect patterns memorized during training without temporally grounded reasoning. While recent efforts have integrated VLMs into autonomous driving, prior studies typically emphasize scene understanding and instruction generation, implicitly assuming that strong visual interpretation naturally enables consistant future reasoning and thus ensures reliable decision-making, a claim we critically examine. We focus on two major challenges limiting VLM reliability in this setting: response inconsistency, where minor input perturbations yield different answers or, in some cases, responses degenerate toward near-random guessing, and limited temporal reasoning, in which models fail to reason and align sequential events from current observations, often resulting in incorrect or even contradictory responses. Moreover, we find that models with strong visual understanding do not necessarily perform best on tasks requiring temporal reasoning, indicating a tendency to over-rely on pretrained patterns rather than modeling temporal dynamics. To address these issues, we adopt existing evaluation methods and introduce FutureVQA, a human-annotated benchmark dataset specifically designed to assess future scene reasoning. In addition, we propose a simple yet effective self-supervised tuning approach with chain-of-thought reasoning that improves both consistency and temporal reasoning without requiring temporal labels.
中文标题/摘要
标题:探究驾驶VLMs的可靠性:从不一致的响应到基于时间的推理
可靠的驾驶助手应基于观察到的信息进行时间上合理的推理并提供一致的响应。在本文中,我们探讨了当将视觉语言模型(VLMs)应用于驾驶助手时,它们是否能够提供一致的响应,并理解当前观察如何影响未来结果,还是其输出仅反映训练期间记忆的模式而缺乏时间上的推理。尽管最近的努力将VLMs集成到自动驾驶中,但先前的研究通常强调场景理解和指令生成,隐含地假设强大的视觉解释自然能够实现一致的未来推理,从而确保可靠的决策,这是一个我们批判性地质疑的主张。我们重点关注限制VLM在这种情境下可靠性的两个主要挑战:响应不一致,其中微小的输入扰动导致不同的答案,甚至在某些情况下,响应退化为近乎随机猜测,以及有限的时间推理,其中模型无法从当前观察中推理和对齐序列事件,经常导致错误甚至矛盾的响应。此外,我们发现具有强大视觉理解的模型在需要时间推理的任务上并不一定表现最佳,表明它们倾向于过度依赖预训练模式而不是建模时间动态。为了解决这些问题,我们采用了现有的评估方法,并引入了FutureVQA,这是一个由人类注释的基准数据集,专门用于评估未来场景推理。此外,我们提出了一种简单而有效的自监督调优方法,结合了链式推理,该方法在不需要时间标签的情况下提高了响应一致性和时间推理能力。
Summary / 总结
This study investigates the reliability of Vision-Language Models (VLMs) as driving assistants, focusing on response consistency and temporal reasoning. The research identifies two major challenges: response inconsistency and limited temporal reasoning. Key findings show that even models with strong visual understanding struggle with temporal reasoning tasks, suggesting a reliance on memorized patterns rather than temporal dynamics. To address these issues, the study introduces FutureVQA, a benchmark dataset, and a self-supervised tuning approach with chain-of-thought reasoning to enhance consistency and temporal reasoning without needing temporal labels.
研究探讨了视觉语言模型(VLMs)作为驾驶助手的可靠性,通过考察其响应的一致性和时间推理能力。研究发现,VLMs 对输入微小变化的响应不一致,并且在时间推理方面存在困难,导致输出错误或矛盾。为解决这些问题,作者引入了FutureVQA基准数据集和一种无需时间标签的简单有效自监督调优方法,以提高一致性和时间推理能力。
SlowBA: An efficiency backdoor attack towards VLM-based GUI agents
Authors: Junxian Li, Tu Lan, Haozhen Tan, Yan Meng, Haojin Zhu
First: 2026-03-09T12:38:28+00:00 · Latest: 2026-03-10T11:10:35+00:00
Comments: 25 pages
Abstract
Modern vision-language-model (VLM) based graphical user interface (GUI) agents are expected not only to execute actions accurately but also to respond to user instructions with low latency. While existing research on GUI-agent security mainly focuses on manipulating action correctness, the security risks related to response efficiency remain largely unexplored. In this paper, we introduce SlowBA, a novel backdoor attack that targets the responsiveness of VLM-based GUI agents. The key idea is to manipulate response latency by inducing excessively long reasoning chains under specific trigger patterns. To achieve this, we propose a two-stage reward-level backdoor injection (RBI) strategy that first aligns the long-response format and then learns trigger-aware activation through reinforcement learning. In addition, we design realistic pop-up windows as triggers that naturally appear in GUI environments, improving the stealthiness of the attack. Extensive experiments across multiple datasets and baselines demonstrate that SlowBA can significantly increase response length and latency while largely preserving task accuracy. The attack remains effective even with a small poisoning ratio and under several defense settings. These findings reveal a previously overlooked security vulnerability in GUI agents and highlight the need for defenses that consider both action correctness and response efficiency. Code can be found in https://github.com/tu-tuing/SlowBA.
中文标题/摘要
标题:SlowBA:一种针对基于VLM的GUI代理的效率后门攻击
现代基于视觉语言模型(VLM)的图形用户界面(GUI)代理不仅期望能够准确执行操作,还期望能够以低延迟响应用户指令。虽然现有关于GUI代理安全性的研究主要集中在操控操作的正确性上,但与响应效率相关的安全风险却很少被探索。在本文中,我们介绍了SlowBA,这是一种针对基于VLM的GUI代理响应性的新型后门攻击。关键思想是通过在特定触发模式下诱导过长的推理链来操纵响应延迟。为了实现这一点,我们提出了一种两阶段奖励级后门注入(RBI)策略,首先对齐长响应格式,然后通过强化学习学习触发感知激活。此外,我们设计了现实的弹出窗口作为触发器,这些触发器自然出现在GUI环境中,提高了攻击的隐蔽性。在多个数据集和基线上的广泛实验表明,SlowBA可以显著增加响应长度和延迟,同时在很大程度上保持任务准确性。即使在小污染比例和多种防御设置下,攻击仍然有效。这些发现揭示了GUI代理中一个之前未被注意到的安全漏洞,并强调了需要同时考虑操作正确性和响应效率的防御措施。代码可以在https://github.com/tu-tuing/SlowBA/找到。
Summary / 总结
The research motivation is to address the security risks related to response efficiency in VLM-based GUI agents, which have been largely unexplored. The main method involves a two-stage reward-level backdoor injection strategy to manipulate response latency by inducing long reasoning chains under specific trigger patterns. Key experimental findings show that SlowBA can significantly increase response length and latency while maintaining task accuracy, even with a small poisoning ratio and under various defense settings.
研究动机是解决VLM基于的GUI代理响应效率的安全风险,这些风险此前尚未受到广泛关注。主要方法是采用两阶段奖励级后门注入策略,通过在特定触发模式下诱导长推理链来操纵响应延迟。关键实验发现表明,SlowBA可以显著增加响应时间和延迟,同时保持任务准确性,即使在小污染比例和多种防御设置下仍然有效。
Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition
Authors: Jiahang Cao, Yize Huang, Hanzhong Guo, Rui Zhang, Mu Nan, Weijian Mai, Jiaxu Wang, Hao Cheng, Jingkai Sun, Gang Han, Wen Zhao, Qiang Zhang, Yijie Guo, Qihao Zheng, Chunfeng Song, Xiao Li, Ping Luo, Andrew F. Luo
Venue: ICLR 2026
First: 2025-10-01T16:05:53+00:00 · Latest: 2026-03-10T10:57:47+00:00
Comments: Accepted to ICLR 2026. Project Page: https://sagecao1125.github.io/GPC-Site/
Abstract
Diffusion-based models for robotic control, including vision-language-action (VLA) and vision-action (VA) policies, have demonstrated significant capabilities. Yet their advancement is constrained by the high cost of acquiring large-scale interaction datasets. This work introduces an alternative paradigm for enhancing policy performance without additional model training. Perhaps surprisingly, we demonstrate that the composed policies can exceed the performance of either parent policy. Our contribution is threefold. First, we establish a theoretical foundation showing that the convex composition of distributional scores from multiple diffusion models can yield a superior one-step functional objective compared to any individual score. A Grönwall-type bound is then used to show that this single-step improvement propagates through entire generation trajectories, leading to systemic performance gains. Second, motivated by these results, we propose General Policy Composition (GPC), a training-free method that enhances performance by combining the distributional scores of multiple pre-trained policies via a convex combination and test-time search. GPC is versatile, allowing for the plug-and-play composition of heterogeneous policies, including VA and VLA models, as well as those based on diffusion or flow-matching, irrespective of their input visual modalities. Third, we provide extensive empirical validation. Experiments on Robomimic, PushT, and RoboTwin benchmarks, alongside real-world robotic evaluations, confirm that GPC consistently improves performance and adaptability across a diverse set of tasks. Further analysis of alternative composition operators and weighting strategies offers insights into the mechanisms underlying the success of GPC. These results establish GPC as a simple yet effective method for improving control performance by leveraging existing policies.
中文标题/摘要
标题:编写您的策略!通过测试时分布级组合提高基于扩散或流的机器人策略性能
基于扩散的机器人控制模型,包括视觉-语言-动作(VLA)和视觉-动作(VA)策略,已经展示了显著的能力。然而,它们的进步受到大规模交互数据集获取成本高的限制。本研究提出了一种无需额外模型训练的增强策略性能的新范式。令人惊讶的是,我们证明组合策略的性能可以超过任何一个父策略。我们的贡献包括三个方面。首先,我们建立了理论基础,证明了来自多个扩散模型的分布得分的凸组合可以产生优于任何单一得分的一步功能目标。然后使用Grönwall型界来证明这种一步改进会贯穿整个生成轨迹,从而带来系统性的性能提升。其次,受这些结果的启发,我们提出了通用策略组合(GPC),这是一种无需训练的方法,通过凸组合和测试时搜索结合多个预训练策略的分布得分来增强性能。GPC具有灵活性,允许异构策略的即插即用组合,包括VA和VLA模型,以及基于扩散或流匹配的策略,无论其输入视觉模态如何。第三,我们提供了广泛的实证验证。在Robomimic、PushT和RoboTwin基准测试以及实际机器人评估中,GPC在一系列任务中一致提高了性能和适应性。对替代组合操作符和加权策略的进一步分析提供了GPC成功机制的见解。这些结果确立了GPC作为一种简单而有效的方法,通过利用现有策略来提高控制性能。
Summary / 总结
This work addresses the challenge of improving diffusion-based or flow-based robot policies without additional training data. It introduces a method called General Policy Composition (GPC) that combines the distributional scores of multiple pre-trained policies through a convex combination and test-time search. Theoretical analysis shows that this approach can lead to superior performance compared to individual policies. Empirical results across various benchmarks and real-world robotic evaluations demonstrate that GPC consistently enhances performance and adaptability in diverse tasks.
该研究旨在提高基于扩散或流的机器人控制策略性能,而无需额外训练。提出了一种名为General Policy Composition (GPC)的方法,通过凸组合和测试时搜索将多个预训练策略的分布得分进行组合。理论分析表明,这种方法可以通过在整个生成轨迹中传播单步改进来提升性能。在多种基准测试和实际机器人任务上的实验证明,GPC能够一致地提高性能和适应性,适用于不同任务。
Evolving Prompt Adaptation for Vision-Language Models
Authors: Enming Zhang, Jiayang Li, Yanru Wu, Zhenyu Liu, Yang Li
First: 2026-03-10T10:53:01+00:00 · Latest: 2026-03-10T10:53:01+00:00
Abstract
The adaptation of large-scale vision-language models (VLMs) to downstream tasks with limited labeled data remains a significant challenge. While parameter-efficient prompt learning methods offer a promising path, they often suffer from catastrophic forgetting of pre-trained knowledge. Toward addressing this limitation, our work is grounded in the insight that governing the evolutionary path of prompts is essential for forgetting-free adaptation. To this end, we propose EvoPrompt, a novel framework designed to explicitly steer the prompt trajectory for stable, knowledge-preserving fine-tuning. Specifically, our approach employs a Modality-Shared Prompt Projector (MPP) to generate hierarchical prompts from a unified embedding space. Critically, an evolutionary training strategy decouples low-rank updates into directional and magnitude components, preserving early-learned semantic directions while only adapting their magnitude, thus enabling prompts to evolve without discarding foundational knowledge. This process is further stabilized by Feature Geometric Regularization (FGR), which enforces feature decorrelation to prevent representation collapse. Extensive experiments demonstrate that EvoPrompt achieves state-of-the-art performance in few-shot learning while robustly preserving the original zero-shot capabilities of pre-trained VLMs.
中文标题/摘要
标题:视觉语言模型的进化提示适应
大规模视觉语言模型(VLMs)在下游任务中的适应性改进,尤其是在有限标注数据的情况下,仍然是一个重大挑战。尽管参数高效提示学习方法提供了有希望的途径,但它们往往会导致预训练知识的灾难性遗忘。为了解决这一限制,我们的工作基于这样一个见解:控制提示的进化路径对于无遗忘适应至关重要。为此,我们提出了一种名为EvoPrompt的新框架,旨在明确引导提示轨迹,实现稳定且保留知识的微调。具体而言,我们的方法使用模态共享提示投影器(MPP)从统一嵌入空间生成层次提示。关键的是,进化训练策略将低秩更新分解为方向和幅度两个部分,保留早期学习的语义方向,仅调整其幅度,从而允许提示在不丢弃基础知识的情况下进化。此外,特征几何正则化(FGR)通过强制特征去相关来防止表示坍塌,进一步稳定了这一过程。广泛的实验表明,EvoPrompt在少样本学习中达到了最先进的性能,同时稳健地保留了预训练VLMs的零样本能力。
Summary / 总结
The paper addresses the challenge of adapting large vision-language models to downstream tasks with limited labeled data, focusing on the issue of catastrophic forgetting. It introduces EvoPrompt, a framework that uses a Modality-Shared Prompt Projector to generate hierarchical prompts and an evolutionary training strategy to update prompts directionally while preserving their magnitude. This approach, combined with Feature Geometric Regularization, ensures stable and knowledge-preserving fine-tuning. Experiments show that EvoPrompt outperforms existing methods in few-shot learning while maintaining zero-shot capabilities.
论文针对大规模视觉-语言模型在有限标注数据下游任务中的适应性挑战,特别是灾难性遗忘的问题。提出了EvoPrompt框架,利用模态共享提示投影器生成层次提示,并采用进化训练策略解耦低秩更新,保留早期学习的语义方向。通过特征几何正则化防止表示坍塌。实验表明,EvoPrompt在少样本学习中表现出色,同时保持了预训练模型的零样本能力。
Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity
Authors: Zhengyao Fang, Pengyuan Lyu, Chengquan Zhang, Guangming Lu, Jun Yu, Wenjie Pei
First: 2026-03-10T10:31:58+00:00 · Latest: 2026-03-10T10:31:58+00:00
Comments: accepted by ICLR2026
Abstract
Vision-language models (VLMs) face significant computational inefficiencies caused by excessive generation of visual tokens. While prior work shows that a large fraction of visual tokens are redundant, existing compression methods struggle to balance importance preservation and information diversity. To address this, we propose PruneSID, a training-free Synergistic Importance-Diversity approach featuring a two-stage pipeline: (1) Principal Semantic Components Analysis (PSCA) for clustering tokens into semantically coherent groups, ensuring comprehensive concept coverage, and (2) Intra-group Non-Maximum Suppression (NMS) for pruning redundant tokens while preserving key representative tokens within each group. Additionally, PruneSID incorporates an information-aware dynamic compression ratio mechanism that optimizes token compression rates based on image complexity, enabling more effective average information preservation across diverse scenes. Extensive experiments demonstrate state-of-the-art performance, achieving 96.3% accuracy on LLaVA-1.5 with only 11.1% token retention, and 92.8% accuracy at extreme compression rates (5.6%) on LLaVA-NeXT, outperforming prior methods by 2.5% with 7.8 $\times$ faster prefilling speed compared to the original model. Our framework generalizes across diverse VLMs and both image and video modalities, showcasing strong cross-modal versatility. Code is available at https://github.com/ZhengyaoFang/PruneSID}{https://github.com/ZhengyaoFang/PruneSID.
中文标题/摘要
标题:去冗存精,协同重要性与多样性:VLMs中的视觉标记压缩
视觉语言模型(VLMs)因视觉标记过度生成面临显著的计算效率问题。尽管先前工作表明大量视觉标记是冗余的,但现有压缩方法难以在重要性保存和信息多样性之间取得平衡。为解决这一问题,我们提出了一种名为PruneSID的无训练方法,该方法采用协同重要性与多样性的两阶段管道:(1)主语义成分分析(PSCA)用于将标记聚类成语义一致的组,确保概念覆盖的全面性;(2)组内非最大抑制(NMS)用于去除冗余标记同时保留每个组内的关键代表性标记。此外,PruneSID还引入了一种基于图像复杂性的信息感知动态压缩比机制,根据图像复杂性优化标记压缩率,从而在不同场景中实现更有效的平均信息保存。大量实验表明,PruneSID在LLaVA-1.5上达到96.3%的准确率,仅保留11.1%的标记;在LLaVA-NeXT上以极端压缩率(5.6%)达到92.8%的准确率,相比先前方法性能提升2.5%,且预填充速度比原模型快7.8倍。我们的框架适用于多种VLMs和图像、视频模态,展示了强大的跨模态通用性。代码可在https://github.com/ZhengyaoFang/PruneSID获取。
Summary / 总结
The paper addresses the computational inefficiencies in vision-language models (VLMs) due to redundant visual tokens. It introduces PruneSID, a training-free method that combines Principal Semantic Components Analysis (PSCA) and Intra-group Non-Maximum Suppression (NMS) to compress tokens while preserving important information and diversity. Experiments show that PruneSID achieves high accuracy even at extreme compression rates, outperforming previous methods with faster prefilling speed and broad applicability across different VLMs and modalities.
论文针对视觉语言模型(VLMs)因冗余视觉标记导致的计算效率低下问题,提出了一种名为PruneSID的无训练方法,结合了主语义成分分析(PSCA)进行标记聚类和组内非最大抑制(NMS)进行冗余标记的修剪。该方法确保了概念的全面覆盖,同时保留了关键的代表性标记。实验结果显示,PruneSID在LLaVA-1.5上实现了96.3%的准确率,仅保留11.1%的标记,在LLaVA-NeXT上在极端压缩率(5.6%)下实现了92.8%的准确率,优于先前的方法,并且预填充速度更快。
OmniEarth: A Benchmark for Evaluating Vision-Language Models in Geospatial Tasks
Authors: Ronghao Fu, Haoran Liu, Weijie Zhang, Zhiwen Lin, Xiao Yang, Peng Zhang, Bo Yang
First: 2026-03-10T10:22:01+00:00 · Latest: 2026-03-10T10:22:01+00:00
Abstract
Vision-Language Models (VLMs) have demonstrated effective perception and reasoning capabilities on general-domain tasks, leading to growing interest in their application to Earth observation. However, a systematic benchmark for comprehensively evaluating remote sensing vision-language models (RSVLMs) remains lacking. To address this gap, we introduce OmniEarth, a benchmark for evaluating RSVLMs under realistic Earth observation scenarios. OmniEarth organizes tasks along three capability dimensions: perception, reasoning, and robustness. It defines 28 fine-grained tasks covering multi-source sensing data and diverse geospatial contexts. The benchmark supports two task formulations: multiple-choice VQA and open-ended VQA. The latter includes pure text outputs for captioning tasks, bounding box outputs for visual grounding tasks, and mask outputs for segmentation tasks. To reduce linguistic bias and examine whether model predictions rely on visual evidence, OmniEarth adopts a blind test protocol and a quintuple semantic consistency requirement. OmniEarth includes 9,275 carefully quality-controlled images, including proprietary satellite imagery from Jilin-1 (JL-1), along with 44,210 manually verified instructions. We conduct a systematic evaluation of contrastive learning-based models, general closed-source and open-source VLMs, as well as RSVLMs. Results show that existing VLMs still struggle with geospatially complex tasks, revealing clear gaps that need to be addressed for remote sensing applications. OmniEarth is publicly available at https://huggingface.co/datasets/sjeeudd/OmniEarth.
中文标题/摘要
标题:OmniEarth:评估地理空间任务中视觉-语言模型基准
视觉-语言模型(VLMs)在通用领域任务中展示了有效的感知和推理能力,引起了将其应用于地球观测的兴趣。然而,缺乏系统性的基准来全面评估遥感视觉-语言模型(RSVLMs)。为解决这一问题,我们引入了OmniEarth,一个在现实地球观测场景中评估RSVLMs的基准。OmniEarth 按照感知、推理和鲁棒性三个能力维度组织任务。它定义了28个细粒度任务,涵盖多源传感数据和多种地理空间环境。基准支持两种任务形式:多项选择VQA和开放式VQA。后者包括纯文本输出的描述任务、边界框输出的视觉定位任务和掩码输出的分割任务。为了减少语言偏见并检查模型预测是否依赖于视觉证据,OmniEarth 采用盲测协议和五重语义一致性要求。OmniEarth 包含9,275张精心质量控制的图像,包括来自吉林一号(JL-1)的专有卫星图像,以及44,210条手动验证的指令。我们系统地评估了对比学习模型、通用闭源和开源VLMs以及RSVLMs。结果表明,现有VLMs在地理空间复杂任务中仍然存在困难,揭示了需要解决的明显差距,以适应遥感应用。OmniEarth 公开可用于 https://huggingface.co/datasets/sjeeudd/OmniEarth。
Summary / 总结
OmniEarth is a benchmark designed to evaluate remote sensing vision-language models (RSVLMs) in realistic Earth observation scenarios. It includes 28 tasks covering perception, reasoning, and robustness, and supports multiple-choice and open-ended question formats. The benchmark features 9,275 quality-controlled images and 44,210 instructions, and evaluates existing models, showing gaps in handling geospatially complex tasks. OmniEarth is publicly available for research use.
OmniEarth 是一个基准,用于评估远程 sensing 视觉-语言模型 (RSVLM) 在现实地球观测场景中的表现。它包含28个任务,涵盖感知、推理和鲁棒性,并支持多项选择和开放式问答格式。该基准包括9,275张高质量控制的图像和44,210条指令,并评估了各种 VLM,结果显示现有模型在处理地理空间复杂任务方面存在显著差距,强调了需要改进模型以适应遥感应用的必要性。
A Guideline-Aware AI Agent for Zero-Shot Target Volume Auto-Delineation
Authors: Yoon Jo Kim, Wonyoung Cho, Jongmin Lee, Han Joo Chae, Hyunki Park, Sang Hoon Seo, Noh Jae Myung, Kyungmi Yang, Dongryul Oh, Jin Sung Kim
Venue: MICCAI 2026
First: 2026-03-10T10:00:01+00:00 · Latest: 2026-03-10T10:00:01+00:00
Comments: Submitted to MICCAI 2026
Abstract
Delineating the clinical target volume (CTV) in radiotherapy involves complex margins constrained by tumor location and anatomical barriers. While deep learning models automate this process, their rigid reliance on expert-annotated data requires costly retraining whenever clinical guidelines update. To overcome this limitation, we introduce OncoAgent, a novel guideline-aware AI agent framework that seamlessly converts textual clinical guidelines into three-dimensional target contours in a training-free manner. Evaluated on esophageal cancer cases, the agent achieves a zero-shot Dice similarity coefficient of 0.842 for the CTV and 0.880 for the planning target volume, demonstrating performance highly comparable to a fully supervised nnU-Net baseline. Notably, in a blinded clinical evaluation, physicians strongly preferred OncoAgent over the supervised baseline, rating it higher in guideline compliance, modification effort, and clinical acceptability. Furthermore, the framework generalizes zero-shot to alternative esophageal guidelines and other anatomical sites (e.g., prostate) without any retraining. Beyond mere volumetric overlap, our agent-based paradigm offers near-instantaneous adaptability to alternative guidelines, providing a scalable and transparent pathway toward interpretability in radiotherapy treatment planning.
中文标题/摘要
标题:一种基于指南的AI代理用于零样本目标体积自动勾画
在放射治疗中勾画临床靶体积(CTV)涉及由肿瘤位置和解剖屏障制约的复杂边界。虽然深度学习模型可以自动化这一过程,但它们对专家标注数据的严格依赖要求在临床指南更新时进行昂贵的重新训练。为克服这一限制,我们引入了OncoAgent,这是一种新颖的基于指南的AI代理框架,能够无缝地将文本临床指南转换为三维目标轮廓,无需训练。在食管癌病例上评估,该代理实现了CTV的零样本Dice相似系数为0.842,规划靶体积为0.880,性能与完全监督的nnU-Net基线相当。值得注意的是,在盲法临床评估中,医生更偏好OncoAgent,认为其在指南遵守性、修改努力和临床可接受性方面优于监督基线。此外,该框架在无需重新训练的情况下,零样本推广到其他食管指南和其他解剖部位(如前列腺)。超越单纯的体积重叠,我们的基于代理的范式提供了近乎即时的对不同指南的适应性,为放射治疗计划的可解释性提供了一种可扩展和透明的途径。
Summary / 总结
The research aims to address the challenge of updating deep learning models for radiotherapy target volume delineation when clinical guidelines change. OncoAgent, a guideline-aware AI agent, converts textual guidelines into 3D target contours without retraining. It achieves a Dice similarity coefficient of 0.842 for CTV and 0.880 for the planning target volume, comparable to a fully supervised nnU-Net baseline. Physicians preferred OncoAgent for its guideline compliance and clinical acceptability, and the framework generalizes to other anatomical sites without retraining, offering near-instantaneous adaptability to new guidelines.
研究旨在解决在临床指南更新时,如何更新用于放疗靶区勾画的深度学习模型的问题。研究引入了OncoAgent,这是一种能够将文本形式的临床指南直接转换为三维目标轮廓的AI代理框架,无需重新训练。OncoAgent在CTV和规划靶区上的Dice相似性系数分别达到0.842和0.880,与完全监督的nnU-Net基线相当。在临床评估中,医生更偏好OncoAgent,认为其在指南遵循性和临床接受性方面表现更佳。此外,该框架还能在不同解剖部位上实现零样本泛化,无需重新训练,提供了一种快速适应新指南的可扩展和透明路径。