arXiv 论文速递

2025-12-22 03:28
Snapshot: 20251222_0328
MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning
Authors: Yuanchen Ju, Yongyuan Liang, Yen-Jen Wang, Nandiraju Gireesh, Yuanliang Ju, Seungjae Lee, Qiao Gu, Elvis Hsieh, Furong Huang, Koushil Sreenath
First: 2025-12-18T18:59:03+00:00 · Latest: 2025-12-18T18:59:03+00:00
Comments: 25 pages, 10 figures. Project page:https://hybridrobotics.github.io/MomaGraph/
Abstract
Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To address these limitations, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. We thus contribute MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, along with MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments demonstrate that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.
中文标题/摘要
标题:MomaGraph:基于视觉语言模型的统一场景图及其在体感任务规划中的状态感知
家庭中的移动机械臂必须同时导航和操作。这需要一种紧凑且语义丰富的场景表示,能够捕捉物体的位置、功能以及哪些部分可以操作。场景图是一个自然的选择,但先前的工作往往将空间关系和功能关系分开处理,将场景视为静态快照,不包含物体状态或时间更新,也忽略了与当前任务相关的最重要信息。为了解决这些限制,我们引入了MomaGraph,这是一种将空间功能关系和部分级交互元素整合在一起的统一场景表示。然而,要推进这种表示需要合适的数据和严格的评估,这些方面目前仍然缺乏。因此,我们贡献了MomaGraph-Scenes,这是第一个包含丰富注释、任务驱动的场景图的大规模数据集,以及MomaGraph-Bench,这是一个涵盖从高层规划到细粒度场景理解的六个推理能力的系统评估套件。在此基础上,我们进一步开发了MomaGraph-R1,这是一种7B参数的视觉语言模型,通过强化学习在MomaGraph-Scenes上进行训练。MomaGraph-R1预测任务导向的场景图,并在Graph-then-Plan框架下作为零样本任务规划器。广泛的实验表明,我们的模型在开源模型中达到了最先进的结果,准确率达到71.6%(比最佳基线高11.4%),并且在公共基准测试中具有泛化能力,并且能够有效地转移到真实机器人实验。
Summary / 总结
MomaGraph addresses the limitations of previous scene graph representations by integrating spatial-functional relationships and part-level interactive elements. It introduces MomaGraph-Scenes, a large-scale dataset of richly annotated, task-driven scene graphs in household environments, and MomaGraph-Bench, an evaluation suite for embodied agents. MomaGraph-R1, a 7B vision-language model, predicts task-oriented scene graphs and serves as a zero-shot task planner. Experiments show that MomaGraph-R1 achieves 71.6% accuracy on the benchmark, outperforming previous models by 11.4%.
MomaGraph通过整合空间-功能关系和部件级交互元素解决了先前场景图表示的局限性,并引入了MomaGraph-Scenes,这是一个包含丰富注释、任务驱动的场景图的大规模数据集,适用于家庭环境。MomaGraph-R1是一个7B的视觉-语言模型,能够预测任务导向的场景图,并作为零样本任务规划器使用,其在基准测试中的准确率为71.6%,比最佳基线高出11.4%。该模型在公共基准测试中表现出良好的泛化能力,并且能够有效地转移到真实机器人实验中。
SceneDiff: A Benchmark and Method for Multiview Object Change Detection
Authors: Yuqun Wu, Chih-hao Lin, Henry Che, Aditi Tiwari, Chuhang Zou, Shenlong Wang, Derek Hoiem
First: 2025-12-18T18:59:02+00:00 · Latest: 2025-12-18T18:59:02+00:00
Abstract
We investigate the problem of identifying objects that have been added, removed, or moved between a pair of captures (images or videos) of the same scene at different times. Detecting such changes is important for many applications, such as robotic tidying or construction progress and safety monitoring. A major challenge is that varying viewpoints can cause objects to falsely appear changed. We introduce SceneDiff Benchmark, the first multiview change detection benchmark with object instance annotations, comprising 350 diverse video pairs with thousands of changed objects. We also introduce the SceneDiff method, a new training-free approach for multiview object change detection that leverages pretrained 3D, segmentation, and image encoding models to robustly predict across multiple benchmarks. Our method aligns the captures in 3D, extracts object regions, and compares spatial and semantic region features to detect changes. Experiments on multi-view and two-view benchmarks demonstrate that our method outperforms existing approaches by large margins (94% and 37.4% relative AP improvements). The benchmark and code will be publicly released.
中文标题/摘要
标题:SceneDiff:多视角物体变化检测的基准与方法
我们研究了在不同时间同一场景的两组捕获(图像或视频)之间识别已添加、移除或移动的物体的问题。检测此类变化对于许多应用非常重要,例如机器人整理或建筑进度和安全监控。主要挑战在于不同视角的变化可能导致物体错误地被检测为变化。我们引入了SceneDiff基准,这是第一个包含物体实例注释的多视角变化检测基准,包含350个多样化的视频对,数千个变化的物体。我们还引入了SceneDiff方法,这是一种新的无需训练的多视角物体变化检测方法,利用预训练的3D、分割和图像编码模型来稳健地预测多个基准。该方法在3D中对齐捕获,提取物体区域,并比较空间和语义区域特征以检测变化。在多视角和两视角基准上的实验表明,我们的方法在现有方法的基础上取得了显著的性能提升(相对AP改进94%和37.4%)。基准和代码将公开发布。
Summary / 总结
The research aims to detect changes in objects between two captures of the same scene taken at different times, which is crucial for applications like robotic tidying and construction monitoring. The authors introduce the SceneDiff Benchmark, the first multiview change detection benchmark with object instance annotations, and the SceneDiff method, a training-free approach that uses pretrained 3D, segmentation, and image encoding models to detect changes. Experiments show that the SceneDiff method outperforms existing approaches by significant margins on both multi-view and two-view benchmarks.
论文解决了同一场景在不同时间拍摄的两幅图像之间检测物体变化的问题,这对于机器人整理和建筑监控等应用至关重要。为了解决视角变化导致的误检测问题,作者引入了SceneDiff基准,这是一个包含物体实例注释的多视角变化检测基准,并提出了SceneDiff方法,这是一种无需训练的检测方法,利用预训练的3D、分割和图像编码模型来稳健地预测变化。该方法在多视角和两视角基准上的表现分别优于现有方法94%和37.4%的相对AP提升。
Memory-Enhanced SAM3 for Occlusion-Robust Surgical Instrument Segmentation
Authors: Valay Bundele, Mehran Hosseinzadeh, Hendrik P. A. Lensch
First: 2025-12-18T18:49:33+00:00 · Latest: 2025-12-18T18:49:33+00:00
Comments: Under Review
Abstract
Accurate surgical instrument segmentation in endoscopic videos is crucial for computer-assisted interventions, yet remains challenging due to frequent occlusions, rapid motion, specular artefacts, and long-term instrument re-entry. While SAM3 provides a powerful spatio-temporal framework for video object segmentation, its performance in surgical scenes is limited by indiscriminate memory updates, fixed memory capacity, and weak identity recovery after occlusions. We propose ReMeDI-SAM3, a training-free memory-enhanced extension of SAM3, that addresses these limitations through three components: (i) relevance-aware memory filtering with a dedicated occlusion-aware memory for storing pre-occlusion frames, (ii) a piecewise interpolation scheme that expands the effective memory capacity, and (iii) a feature-based re-identification module with temporal voting for reliable post-occlusion identity disambiguation. Together, these components mitigate error accumulation and enable reliable recovery after occlusions. Evaluations on EndoVis17 and EndoVis18 under a zero-shot setting show absolute mcIoU improvements of around 7% and 16%, respectively, over vanilla SAM3, outperforming even prior training-based approaches. Project page: https://valaybundele.github.io/remedi-sam3/.
中文标题/摘要
标题:增强记忆的SAM3在遮挡鲁棒的手术器械分割
内窥镜视频中手术器械的准确分割对于计算机辅助干预至关重要,但由于频繁的遮挡、快速运动、镜面伪影以及长期器械再进入,这一任务仍然具有挑战性。尽管SAM3提供了一种强大的时空框架用于视频对象分割,但在手术场景中的性能受限于非区分性的记忆更新、固定的内存容量以及遮挡后的弱身份恢复。我们提出了一种无需训练的记忆增强扩展ReMeDI-SAM3,通过三个组件解决了这些限制:(i) 基于相关性的记忆过滤,配备专门的遮挡感知记忆用于存储遮挡前的帧,(ii) 一段式插值方案,扩展了有效内存容量,(iii) 基于特征的重新识别模块,结合时间投票,用于可靠的遮挡后身份消歧。这些组件共同减轻了错误累积,并在遮挡后实现了可靠的恢复。在零样本设置下,基于EndoVis17和EndoVis18的数据集的评估显示,绝对mcIoU改进分别约为7%和16%,优于原始的SAM3,甚至优于先前的基于训练的方法。项目页面:https://valaybundele.github.io/remedi-sam3/
Summary / 总结
The research addresses the challenge of accurate surgical instrument segmentation in endoscopic videos, which is crucial for computer-assisted interventions but hindered by frequent occlusions and rapid motion. The proposed ReMeDI-SAM3 enhances SAM3 by introducing a relevance-aware memory filtering system, a piecewise interpolation scheme, and a feature-based re-identification module. These components improve the system's ability to handle occlusions and maintain identity recovery, leading to significant improvements in mean class IoU (mcIoU) of around 7% and 16% on EndoVis17 and EndoVis18 datasets, respectively, outperforming previous methods.
研究旨在提高内窥镜视频中手术器械分割的准确性,这对于计算机辅助干预至关重要。方法ReMeDI-SAM3通过引入相关性感知的记忆过滤器、分段插值方案和基于特征的重新识别模块来增强SAM3。这些组件解决了诸如非区分性记忆更新和遮挡后弱身份恢复等问题。实验结果表明,ReMeDI-SAM3优于vanilla SAM3和先前的基于训练的方法,分别在EndoVis17和EndoVis18上实现了约7%和16%的绝对mcIoU改进。
RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing
Authors: Tianyuan Qu, Lei Ke, Xiaohang Zhan, Longxiang Tang, Yuqi Liu, Bohao Peng, Bei Yu, Dong Yu, Jiaya Jia
First: 2025-12-18T18:34:23+00:00 · Latest: 2025-12-18T18:34:23+00:00
Comments: Precise region control and planning for instruction-based image editing. Our project page: https://replan-iv-edit.github.io
Abstract
Instruction-based image editing enables natural-language control over visual modifications, yet existing models falter under Instruction-Visual Complexity (IV-Complexity), where intricate instructions meet cluttered or ambiguous scenes. We introduce RePlan (Region-aligned Planning), a plan-then-execute framework that couples a vision-language planner with a diffusion editor. The planner decomposes instructions via step-by-step reasoning and explicitly grounds them to target regions; the editor then applies changes using a training-free attention-region injection mechanism, enabling precise, parallel multi-region edits without iterative inpainting. To strengthen planning, we apply GRPO-based reinforcement learning using 1K instruction-only examples, yielding substantial gains in reasoning fidelity and format reliability. We further present IV-Edit, a benchmark focused on fine-grained grounding and knowledge-intensive edits. Across IV-Complex settings, RePlan consistently outperforms strong baselines trained on far larger datasets, improving regional precision and overall fidelity. Our project page: https://replan-iv-edit.github.io
中文标题/摘要
标题:RePlan:基于推理的区域规划方法用于复杂指令驱动的图像编辑
指令驱动的图像编辑允许通过自然语言控制视觉修改,但现有模型在指令视觉复杂性(IV-复杂性)场景下表现不佳,即复杂的指令与杂乱或模糊的场景相遇时。我们提出了RePlan(区域对齐规划),这是一种计划-执行框架,结合了视觉语言规划器和扩散编辑器。规划器通过逐步推理将指令分解,并明确地将它们与目标区域关联;编辑器然后使用无训练注意力区域注入机制应用更改,从而实现精确、并行的多区域编辑,无需迭代修复。为了增强规划,我们使用基于GRPO的强化学习应用1000个仅指令示例,显著提高了推理准确性和格式可靠性。我们还提出了IV-Edit基准,专注于精细的区域定位和知识密集型编辑。在IV-复杂设置中,RePlan始终优于大型数据集训练的强大基线,提高了区域精度和整体保真度。我们的项目页面:https://replan-iv-edit.github.io
Summary / 总结
RePlan is a plan-then-execute framework for instruction-based image editing that addresses the challenge of Instruction-Visual Complexity. It uses a vision-language planner to decompose instructions and ground them to target regions, followed by a diffusion editor that applies changes without iterative inpainting. The planner is enhanced with GRPO-based reinforcement learning, improving reasoning fidelity. RePlan outperforms strong baselines in regional precision and overall image fidelity across complex settings, making it suitable for fine-grained and knowledge-intensive edits.
RePlan 是一种用于基于指令的图像编辑的计划-执行框架,旨在解决指令-视觉复杂性的问题。它使用视觉语言规划器分解指令并将其明确地与目标区域关联,随后使用无需迭代修复的注意力区域注入机制应用更改。规划器通过基于 GRPO 的强化学习得到增强,提高了推理的准确性。RePlan 在复杂场景中表现出色,优于强大的基线模型,在区域精度和整体图像保真度方面均有所提升,适用于精细和知识密集型编辑。
CitySeeker: How Do VLMS Explore Embodied Urban Navigation With Implicit Human Needs?
Authors: Siqi Wang, Chao Liang, Yunfan Gao, Erxin Yu, Sen Li, Yushi Li, Jing Li, Haofen Wang
First: 2025-12-18T16:53:12+00:00 · Latest: 2025-12-18T16:53:12+00:00
Abstract
Vision-Language Models (VLMs) have made significant progress in explicit instruction-based navigation; however, their ability to interpret implicit human needs (e.g., "I am thirsty") in dynamic urban environments remains underexplored. This paper introduces CitySeeker, a novel benchmark designed to assess VLMs' spatial reasoning and decision-making capabilities for exploring embodied urban navigation to address implicit needs. CitySeeker includes 6,440 trajectories across 8 cities, capturing diverse visual characteristics and implicit needs in 7 goal-driven scenarios. Extensive experiments reveal that even top-performing models (e.g., Qwen2.5-VL-32B-Instruct) achieve only 21.1% task completion. We find key bottlenecks in error accumulation in long-horizon reasoning, inadequate spatial cognition, and deficient experiential recall. To further analyze them, we investigate a series of exploratory strategies-Backtracking Mechanisms, Enriching Spatial Cognition, and Memory-Based Retrieval (BCR), inspired by human cognitive mapping's emphasis on iterative observation-reasoning cycles and adaptive path optimization. Our analysis provides actionable insights for developing VLMs with robust spatial intelligence required for tackling "last-mile" navigation challenges.
中文标题/摘要
标题:CitySeeker:VLMS 如何探索具有隐含人类需求的实体城市导航?
视觉-语言模型(VLMs)在基于明确指令的导航方面取得了显著进展;然而,它们在动态城市环境中解释隐含的人类需求(例如,“我渴了”)的能力仍然未被充分探索。本文介绍了CitySeeker,这是一种新型基准,旨在评估VLMs的空间推理和决策能力,以探索具有隐含需求的实体城市导航。CitySeeker 包含了8个城市中的6,440条轨迹,涵盖了7个目标驱动场景中的多样视觉特征和隐含需求。广泛的实验表明,即使是表现最好的模型(例如,Qwen2.5-VL-32B-Instruct)也只能完成任务的21.1%。我们发现长时推理中的错误累积、空间认知不足和经验回忆不足是关键瓶颈。为了进一步分析这些问题,我们研究了一系列探索性策略——回溯机制、增强空间认知和基于记忆的检索(BCR),这些策略受到人类认知地图强调的迭代观察-推理循环和适应性路径优化的启发。我们的分析为开发具有应对“最后一公里”导航挑战所需的空间智能的VLMs提供了可操作的见解。
Summary / 总结
CitySeeker evaluates VLMs' ability to navigate urban environments based on implicit human needs, introducing a benchmark with 6,440 trajectories across 8 cities. Experiments show that even top models achieve only 21.1% task completion, highlighting issues in long-term reasoning, spatial cognition, and experiential recall. The study proposes strategies like Backtracking Mechanisms, Enriching Spatial Cognition, and Memory-Based Retrieval to improve VLMs for urban navigation tasks.
CitySeeker 评估 VLMs 在基于隐含人类需求的城市导航中的能力,引入了一个包含 6,440 条轨迹的基准,覆盖 8 个城市。实验显示,即使顶级模型也只能完成 21.1% 的任务,突显了长期推理、空间认知和经验回忆方面的问题。研究提出了回溯机制、增强空间认知和基于记忆的检索等策略,以提高 VLMs 在城市导航任务中的表现。
Hierarchical Schedule Optimization for Fast and Robust Diffusion Model Sampling
Authors: Aihua Zhu, Rui Su, Qinglin Zhao, Li Feng, Meng Shen, Shibo He
Venue: AAAI 2026
First: 2025-11-12T08:57:46+00:00 · Latest: 2025-12-18T14:23:34+00:00
Comments: Preprint, accepted to AAAI 2026
Abstract
Diffusion probabilistic models have set a new standard for generative fidelity but are hindered by a slow iterative sampling process. A powerful training-free strategy to accelerate this process is Schedule Optimization, which aims to find an optimal distribution of timesteps for a fixed and small Number of Function Evaluations (NFE) to maximize sample quality. To this end, a successful schedule optimization method must adhere to four core principles: effectiveness, adaptivity, practical robustness, and computational efficiency. However, existing paradigms struggle to satisfy these principles simultaneously, motivating the need for a more advanced solution. To overcome these limitations, we propose the Hierarchical-Schedule-Optimizer (HSO), a novel and efficient bi-level optimization framework. HSO reframes the search for a globally optimal schedule into a more tractable problem by iteratively alternating between two synergistic levels: an upper-level global search for an optimal initialization strategy and a lower-level local optimization for schedule refinement. This process is guided by two key innovations: the Midpoint Error Proxy (MEP), a solver-agnostic and numerically stable objective for effective local optimization, and the Spacing-Penalized Fitness (SPF) function, which ensures practical robustness by penalizing pathologically close timesteps. Extensive experiments show that HSO sets a new state-of-the-art for training-free sampling in the extremely low-NFE regime. For instance, with an NFE of just 5, HSO achieves a remarkable FID of 11.94 on LAION-Aesthetics with Stable Diffusion v2.1. Crucially, this level of performance is attained not through costly retraining, but with a one-time optimization cost of less than 8 seconds, presenting a highly practical and efficient paradigm for diffusion model acceleration.
中文标题/摘要
标题:层次化时间表优化以实现快速稳健的扩散模型采样
扩散概率模型在生成保真度方面树立了新标准,但受限于缓慢的迭代采样过程。一种强大的无训练策略是时间表优化,其目标是在固定且较小的函数评估次数(NFE)下,找到最优的时间步分布,以最大化样本质量。为此,一种成功的时间表优化方法必须遵循四个核心原则:有效性、适应性、实用鲁棒性和计算效率。然而,现有的范式难以同时满足这些原则,因此需要更先进的解决方案。为克服这些限制,我们提出了层次化时间表优化器(HSO),这是一种新颖且高效的双层优化框架。HSO通过交替进行两个协同工作的层次来重新定义全局最优时间表的搜索,即上层进行全局搜索以找到最优初始化策略,下层进行时间表细化的局部优化。这一过程由两个关键创新引导:中间点误差代理(MEP),一种与求解器无关且数值稳定的局部优化目标,以及间距惩罚适应度(SPF)函数,该函数通过惩罚时间步之间的病态接近来确保实用鲁棒性。大量实验表明,HSO在极低NFE区间内无训练采样的新标准。例如,使用NFE仅为5时,HSO在Stable Diffusion v2.1上的LAION-Aesthetics数据集上实现了令人瞩目的FID值11.94。至关重要的是,这种性能水平不是通过昂贵的重新训练获得的,而是一次优化成本不到8秒,这为扩散模型加速提供了一种高度实用和高效的范式。
Summary / 总结
The research aims to address the slow sampling process of diffusion probabilistic models by proposing the Hierarchical-Schedule-Optimizer (HSO), which optimizes the distribution of timesteps for a fixed number of function evaluations (NFE) to improve sample quality. HSO uses a bi-level optimization framework with two levels: an upper-level global search and a lower-level local optimization. Key innovations include the Midpoint Error Proxy (MEP) for effective local optimization and the Spacing-Penalized Fitness (SPF) function to ensure practical robustness. Experiments show that HSO achieves state-of-the-art results, with an NFE of 5 yielding an FID of 11.94 on LAION-Aesthetics with Stable Diffusion v2.1, and the optimization cost is less than 8 seconds, making it highly practical and efficient.
研究旨在通过提出一种新的层级优化框架Hierarchical-Schedule-Optimizer (HSO)来解决扩散概率模型的缓慢采样问题。HSO通过交替进行全局搜索初始化和局部优化来迭代寻找最优时间表,使用Midpoint Error Proxy (MEP)和Spacing-Penalized Fitness (SPF)确保有效的和稳健的优化。实验表明,HSO在仅5次函数评估的情况下,实现了LAION-Aesthetics上Stable Diffusion v2.1的FID为11.94的最优性能,同时保持了低于8秒的一次性优化成本。
N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models
Authors: Yuxin Wang, Lei Ke, Boqiang Zhang, Tianyuan Qu, Hanxun Yu, Zhenpeng Huang, Meng Yu, Dan Xu, Dong Yu
First: 2025-12-18T14:03:44+00:00 · Latest: 2025-12-18T14:03:44+00:00
Comments: Project Page: https://n3d-vlm.github.io
Abstract
While current multimodal models can answer questions based on 2D images, they lack intrinsic 3D object perception, limiting their ability to comprehend spatial relationships and depth cues in 3D scenes. In this work, we propose N3D-VLM, a novel unified framework that seamlessly integrates native 3D object perception with 3D-aware visual reasoning, enabling both precise 3D grounding and interpretable spatial understanding. Unlike conventional end-to-end models that directly predict answers from RGB/RGB-D inputs, our approach equips the model with native 3D object perception capabilities, enabling it to directly localize objects in 3D space based on textual descriptions. Building upon accurate 3D object localization, the model further performs explicit reasoning in 3D, achieving more interpretable and structured spatial understanding. To support robust training for these capabilities, we develop a scalable data construction pipeline that leverages depth estimation to lift large-scale 2D annotations into 3D space, significantly increasing the diversity and coverage for 3D object grounding data, yielding over six times larger than the largest existing single-image 3D detection dataset. Moreover, the pipeline generates spatial question-answering datasets that target chain-of-thought (CoT) reasoning in 3D, facilitating joint training for both 3D object localization and 3D spatial reasoning. Experimental results demonstrate that our unified framework not only achieves state-of-the-art performance on 3D grounding tasks, but also consistently surpasses existing methods in 3D spatial reasoning in vision-language model.
中文标题/摘要
标题:N3D-VLM:原生3D定位使视觉语言模型在空间推理中获得准确的空间理解
当前的多模态模型虽然可以根据2D图像回答问题,但缺乏内在的3D物体感知能力,限制了它们对3D场景中的空间关系和深度线索的理解能力。在本文中,我们提出了一种名为N3D-VLM的新型统一框架,该框架无缝地将原生3D物体感知与3D感知视觉推理相结合,从而实现精确的3D定位和可解释的空间理解。与传统的端到端模型直接从RGB/RGB-D输入预测答案不同,我们的方法赋予模型原生的3D物体感知能力,使其能够根据文本描述直接在3D空间中定位物体。基于准确的3D物体定位,模型进一步在3D中进行显式的推理,从而实现更可解释和结构化的空间理解。为了支持这些能力的稳健训练,我们开发了一种可扩展的数据构建管道,该管道利用深度估计将大规模的2D注释提升到3D空间,显著增加了3D物体定位数据的多样性和覆盖范围,比现有最大的单张图像3D检测数据集大六倍以上。此外,该管道生成了空间问答数据集,旨在针对3D中的链式推理(CoT)进行训练,从而促进3D物体定位和3D空间推理的联合训练。实验结果表明,我们的统一框架不仅在3D定位任务上达到了最先进的性能,还在视觉语言模型中的3D空间推理方面也始终优于现有方法。
Summary / 总结
The research aims to enhance vision-language models by integrating native 3D object perception and 3D-aware visual reasoning, enabling better spatial understanding. The proposed N3D-VLM framework uses a scalable data construction pipeline to generate 3D object grounding data and spatial question-answering datasets, which significantly improves 3D grounding and spatial reasoning performance. Experiments show that N3D-VLM outperforms existing methods in both 3D grounding and 3D spatial reasoning tasks.
研究旨在通过整合原生的3D物体感知能力,增强视觉语言模型对3D空间关系的理解。N3D-VLM是一个新颖的统一框架,它结合了3D感知和视觉推理,使模型能够基于文本描述在3D空间中精确定位物体,并进行明确的3D推理。该框架使用一个可扩展的数据构建管道生成大规模的3D物体定位数据和空间问答数据集,从而在3D定位和空间推理任务上优于现有方法,取得了最先进的性能。
Scaling Laws for Energy Efficiency of Local LLMs
Authors: Ander Alvarez, Alessandro Genuardi, Nilotpal Sinha, Antonio Tiene, Samuel Mugel, Román Orús
First: 2025-12-18T13:40:33+00:00 · Latest: 2025-12-18T13:40:33+00:00
Abstract
Deploying local large language models and vision-language models on edge devices requires balancing accuracy with constrained computational and energy budgets. Although graphics processors dominate modern artificial-intelligence deployment, most consumer hardware--including laptops, desktops, industrial controllers, and embedded systems--relies on central processing units. Despite this, the computational laws governing central-processing-unit-only inference for local language and vision-language workloads remain largely unexplored. We systematically benchmark large language and vision-language models on two representative central-processing-unit tiers widely used for local inference: a MacBook Pro M2, reflecting mainstream laptop-class deployment, and a Raspberry Pi 5, representing constrained, low-power embedded settings. Using a unified methodology based on continuous sampling of processor and memory usage together with area-under-curve integration, we characterize how computational load scales with input text length for language models and with image resolution for vision-language models. We uncover two empirical scaling laws: (1) computational cost for language-model inference scales approximately linearly with token length; and (2) vision-language models exhibit a preprocessing-driven "resolution knee", where compute remains constant above an internal resolution clamp and decreases sharply below it. Beyond these laws, we show that quantum-inspired compression reduces processor and memory usage by up to 71.9% and energy consumption by up to 62%, while preserving or improving semantic accuracy. These results provide a systematic quantification of multimodal central-processing-unit-only scaling for local language and vision-language workloads, and they identify model compression and input-resolution preprocessing as effective, low-cost levers for sustainable edge inference.
中文标题/摘要
标题:局部LLM能效的标度律
在边缘设备上部署局部大型语言模型和视觉-语言模型需要在准确性与受限的计算和能源预算之间进行权衡。尽管图形处理器主导了现代人工智能部署,但大多数消费级硬件(包括笔记本电脑、台式机、工业控制器和嵌入式系统)仍依赖于中央处理器。尽管如此,仅中央处理器的推理计算法则对局部语言和视觉-语言工作负载的研究仍相对较少。我们系统地在两个广泛用于局部推理的中央处理器层级上对大型语言和视觉-语言模型进行了基准测试:一台搭载M2芯片的MacBook Pro,代表主流笔记本电脑部署,以及一个Raspberry Pi 5,代表受限的低功耗嵌入式设置。通过基于连续采样处理器和内存使用情况并结合面积-曲线积分的统一方法,我们表征了计算负载随输入文本长度和图像分辨率的变化情况。我们发现了两条经验标度律:(1)语言模型推理的计算成本大约与标记长度成线性关系;(2)视觉-语言模型表现出一种预处理驱动的“分辨率拐点”,其中计算在内部分辨率限制以上保持恒定,在以下则急剧下降。除了这些法则之外,我们还表明,基于量子启发的压缩可以将处理器和内存使用量最多减少71.9%,能源消耗最多减少62%,同时保持或提高语义准确性。这些结果为局部语言和视觉-语言工作负载的多模态中央处理器仅计算法则提供了系统量化,并指出了模型压缩和输入分辨率预处理作为可持续边缘推理的有效、低成本杠杆。
Summary / 总结
The research aims to explore the energy efficiency of deploying large language models and vision-language models on edge devices, focusing on balancing accuracy with computational and energy constraints. The study benchmarks these models on two representative central-processing-unit tiers: a MacBook Pro M2 and a Raspberry Pi 5. Key findings include linear scaling of computational cost with input text length for language models and a preprocessing-driven 'resolution knee' for vision-language models, where compute remains constant above a certain resolution and decreases below it. Additionally, quantum-inspired compression reduces processor and memory usage by up to 71.9% and energy consumption by up to 62%, while maintaining or improving semantic accuracy.
该研究探讨了在边缘设备上部署大型语言模型和视觉-语言模型的能效问题,重点关注中央处理单元。通过在MacBook Pro M2和Raspberry Pi 5上进行基准测试,研究人员发现两个缩放定律:语言模型的计算成本与词元长度成线性关系,视觉-语言模型在某个分辨率以上保持计算量恒定,在以下则急剧下降。此外,量子启发式压缩技术可将处理器和内存使用量最多减少71.9%,能耗最多减少62%,同时保持或提高语义准确性。
TTP: Test-Time Padding for Adversarial Detection and Robust Adaptation on Vision-Language Models
Authors: Zhiwei Li, Yitian Pang, Weining Wang, Zhenan Sun, Qi Li
First: 2025-12-18T13:34:14+00:00 · Latest: 2025-12-18T13:34:14+00:00
Abstract
Vision-Language Models (VLMs), such as CLIP, have achieved impressive zero-shot recognition performance but remain highly susceptible to adversarial perturbations, posing significant risks in safety-critical scenarios. Previous training-time defenses rely on adversarial fine-tuning, which requires labeled data and costly retraining, while existing test-time strategies fail to reliably distinguish between clean and adversarial inputs, thereby preventing both adversarial robustness and clean accuracy from reaching their optimum. To address these limitations, we propose Test-Time Padding (TTP), a lightweight defense framework that performs adversarial detection followed by targeted adaptation at inference. TTP identifies adversarial inputs via the cosine similarity shift between CLIP feature embeddings computed before and after spatial padding, yielding a universal threshold for reliable detection across architectures and datasets. For detected adversarial cases, TTP employs trainable padding to restore disrupted attention patterns, coupled with a similarity-aware ensemble strategy for a more robust final prediction. For clean inputs, TTP leaves them unchanged by default or optionally integrates existing test-time adaptation techniques for further accuracy gains. Comprehensive experiments on diverse CLIP backbones and fine-grained benchmarks show that TTP consistently surpasses state-of-the-art test-time defenses, delivering substantial improvements in adversarial robustness without compromising clean accuracy. The code for this paper will be released soon.
中文标题/摘要
标题:TTP:测试时填充以对抗检测和视觉-语言模型的鲁棒适应
视觉-语言模型(VLMs),如CLIP,已实现令人印象深刻的零样本识别性能,但仍然高度易受对抗性扰动的影响,在安全关键场景中存在重大风险。以往的训练时防御依赖于对抗性微调,这需要标记数据和昂贵的重新训练,而现有的测试时策略无法可靠地区分干净和对抗性输入,从而无法同时达到对抗鲁棒性和干净准确性的最佳效果。为了解决这些限制,我们提出了测试时填充(TTP),这是一种轻量级的防御框架,在推理时执行对抗检测并随后进行目标化适应。TTP通过CLIP特征嵌入在空间填充前后计算的余弦相似度偏移来识别对抗性输入,从而获得适用于不同架构和数据集的可靠检测的通用阈值。对于检测到的对抗性情况,TTP使用可训练的填充来恢复被破坏的注意力模式,并结合相似性感知的集成策略以获得更稳健的最终预测。对于干净输入,TTP默认不进行更改,或可选地结合现有的测试时适应技术以进一步提高准确性。在各种CLIP后端和细粒度基准上的全面实验表明,TTP始终超越最先进的测试时防御,能够在不牺牲干净准确性的情况下显著提高对抗鲁棒性。本文的代码将很快发布。
Summary / 总结
The paper addresses the vulnerability of Vision-Language Models (VLMs) like CLIP to adversarial attacks, which can pose risks in safety-critical scenarios. It introduces Test-Time Padding (TTP), a lightweight framework that detects adversarial inputs and adapts them to restore robustness. TTP uses cosine similarity shifts to identify adversarial cases and applies trainable padding to correct disrupted attention patterns, enhancing robustness without sacrificing clean accuracy. Experiments show TTP outperforms existing methods across various VLMs and datasets.
论文提出了Test-Time Padding (TTP)框架,用于提升如CLIP等视觉-语言模型的对抗检测和鲁棒适应能力。TTP通过余弦相似度变化来识别对抗样本,并应用可训练的填充来恢复被破坏的注意力模式,同时对干净样本不做改变。实验表明,TTP在对抗鲁棒性方面优于现有测试时防御方法,且不会牺牲干净准确率。
SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning
Authors: Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax
First: 2025-12-18T12:27:06+00:00 · Latest: 2025-12-18T12:27:06+00:00
Abstract
Autonomous robotic systems require spatio-temporal understanding of dynamic environments to ensure reliable navigation and interaction. While Vision-Language Models (VLMs) provide open-world semantic priors, they lack grounding in 3D geometry and temporal dynamics. Conversely, geometric perception captures structure and motion but remains semantically sparse. We propose SNOW (Scene Understanding with Open-World Knowledge), a training-free and backbone-agnostic framework for unified 4D scene understanding that integrates VLM-derived semantics with point cloud geometry and temporal consistency. SNOW processes synchronized RGB images and 3D point clouds, using HDBSCAN clustering to generate object-level proposals that guide SAM2-based segmentation. Each segmented region is encoded through our proposed Spatio-Temporal Tokenized Patch Encoding (STEP), producing multimodal tokens that capture localized semantic, geometric, and temporal attributes. These tokens are incrementally integrated into a 4D Scene Graph (4DSG), which serves as 4D prior for downstream reasoning. A lightweight SLAM backend anchors all STEP tokens spatially in the environment, providing the global reference alignment, and ensuring unambiguous spatial grounding across time. The resulting 4DSG forms a queryable, unified world model through which VLMs can directly interpret spatial scene structure and temporal dynamics. Experiments on a diverse set of benchmarks demonstrate that SNOW enables precise 4D scene understanding and spatially grounded inference, thereby setting new state-of-the-art performance in several settings, highlighting the importance of structured 4D priors for embodied reasoning and autonomous robotics.
中文标题/摘要
标题:SNOW:基于世界知识的时空场景理解
自主机器人系统需要对动态环境进行时空理解,以确保可靠的导航和交互。视觉-语言模型(VLMs)提供了开放世界的语义先验,但缺乏三维几何和时间动态的定位。相反,几何感知捕捉结构和运动,但语义稀疏。我们提出了SNOW(基于开放世界知识的场景理解),这是一种无需训练且不依赖于骨干网络的框架,用于统一的4D场景理解,将VLM提取的语义与点云几何和时间一致性相结合。SNOW处理同步的RGB图像和3D点云,使用HDBSCAN聚类生成对象级提案,指导SAM2基的分割。每个分割区域通过我们提出的时空分块编码(STEP)进行编码,生成多模态令牌,捕捉局部语义、几何和时间属性。这些令牌逐步整合到4D场景图(4DSG)中,作为下游推理的4D先验。轻量级的SLAM后端在环境中将所有STEP令牌空间定位,提供全局参考对齐,并确保时间上的空间定位无歧义。生成的4DSG形成一个可查询的统一世界模型,通过该模型VLM可以直接解释空间场景结构和时间动态。在一系列基准测试上的实验表明,SNOW能够实现精确的4D场景理解和空间定位推理,从而在多个设置中达到新的最佳性能,突显了结构化4D先验对于体态推理和自主机器人的重要性。
Summary / 总结
SNOW is a framework for 4D scene understanding that integrates VLM-derived semantics with 3D geometry and temporal consistency. It processes synchronized RGB images and 3D point clouds, using HDBSCAN clustering and SAM2-based segmentation to generate object-level proposals. These proposals are encoded through STEP, producing multimodal tokens that capture semantic, geometric, and temporal attributes, which are integrated into a 4D Scene Graph. Experiments show that SNOW enables precise 4D scene understanding and spatially grounded inference, setting new state-of-the-art performance in several settings.
SNOW 是一个无需训练且与骨干网络无关的框架,将 VLM 提取的语义与 3D 点云几何和时间一致性结合起来,实现统一的 4D 场景理解。它处理同步的 RGB 图像和 3D 点云,使用 HDBSCAN 聚类和 SAM2 基础的分割生成对象级提案。这些提案通过时空分块编码(STEP)来捕捉局部语义、几何和时间属性,然后整合到 4D 场景图(4DSG)中。实验表明,SNOW 实现了精确的 4D 场景理解和空间定位推理,并在多个设置中达到了新的最佳性能。
E-SDS: Environment-aware See it, Do it, Sorted - Automated Environment-Aware Reinforcement Learning for Humanoid Locomotion
Authors: Enis Yalcin, Joshua O'Hara, Maria Stamatopoulou, Chengxu Zhou, Dimitrios Kanoulas
Venue: RiTA 2025 (Springer LNNS)
First: 2025-12-18T12:08:24+00:00 · Latest: 2025-12-18T12:08:24+00:00
Comments: 12 pages, 3 figures, 4 tables. Accepted at RiTA 2025 (Springer LNNS)
Abstract
Vision-language models (VLMs) show promise in automating reward design in humanoid locomotion, which could eliminate the need for tedious manual engineering. However, current VLM-based methods are essentially "blind", as they lack the environmental perception required to navigate complex terrain. We present E-SDS (Environment-aware See it, Do it, Sorted), a framework that closes this perception gap. E-SDS integrates VLMs with real-time terrain sensor analysis to automatically generate reward functions that facilitate training of robust perceptive locomotion policies, grounded by example videos. Evaluated on a Unitree G1 humanoid across four distinct terrains (simple, gaps, obstacles, stairs), E-SDS uniquely enabled successful stair descent, while policies trained with manually-designed rewards or a non-perceptive automated baseline were unable to complete the task. In all terrains, E-SDS also reduced velocity tracking error by 51.9-82.6%. Our framework reduces the human effort of reward design from days to less than two hours while simultaneously producing more robust and capable locomotion policies.
中文标题/摘要
标题:E-SDS:环境感知的看见它、做到它、整理好——面向类人行走的环境感知强化学习自动化
视觉语言模型(VLMs)在自动化类人行走中的奖励设计方面显示出潜力,这可能消除繁琐的手动工程需求。然而,当前基于VLM的方法本质上是“盲目的”,因为它们缺乏导航复杂地形所需的环境感知能力。我们提出了E-SDS(环境感知的看见它、做到它、整理好)框架,以弥补这一感知缺口。E-SDS将VLM与实时地形传感器分析集成,以自动生成促进稳健感知行走策略训练的奖励函数,这些策略以示例视频为基础。在对Unitree G1类人机器人在四种不同地形(简单地形、缺口、障碍物、楼梯)上进行评估时,E-SDS唯一实现了成功的楼梯下降,而使用手动设计的奖励或非感知自动化基线训练的策略无法完成任务。在所有地形中,E-SDS还将速度跟踪误差降低了51.9%-82.6%。我们的框架将奖励设计的人力投入从几天减少到不到两个小时,同时生成了更稳健和能力更强的行走策略。
Summary / 总结
E-SDS is a framework that integrates vision-language models with real-time terrain sensor analysis to automatically generate reward functions for humanoid locomotion, addressing the lack of environmental perception in current methods. Evaluated on four terrains, E-SDS enabled successful stair descent and reduced velocity tracking error by 51.9-82.6% compared to manually-designed rewards or a non-perceptive automated baseline.
E-SDS 是一个框架,将视觉语言模型与实时地形传感器分析相结合,生成用于人形机器人行走的奖励函数,解决了当前方法中缺乏环境感知的问题。在四种地形上的测试表明,E-SDS 使楼梯下降成为可能,并将速度跟踪误差降低了51.9-82.6%,优于手动设计的奖励或非感知的基线方法。
Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt
Authors: Shangxun Li, Youngjung Uh
First: 2025-12-18T11:55:06+00:00 · Latest: 2025-12-18T11:55:06+00:00
Abstract
Text-to-image diffusion models excel at generating high-quality images from natural language descriptions but often fail to preserve subject consistency across multiple outputs, limiting their use in visual storytelling. Existing approaches rely on model fine-tuning or image conditioning, which are computationally expensive and require per-subject optimization. 1Prompt1Story, a training-free approach, concatenates all scene descriptions into a single prompt and rescales token embeddings, but it suffers from semantic leakage, where embeddings across frames become entangled, causing text misalignment. In this paper, we propose a simple yet effective training-free approach that addresses semantic entanglement from a geometric perspective by refining text embeddings to suppress unwanted semantics. Extensive experiments prove that our approach significantly improves both subject consistency and text alignment over existing baselines.
中文标题/摘要
标题:文本嵌入的空间解缠绕以实现主题一致的文本到图像生成
文本到图像的扩散模型在从自然语言描述生成高质量图像方面表现出色,但在多个输出中保持主题一致性方面常常失败,限制了其在视觉叙事中的应用。现有方法依赖于模型微调或图像条件化,这在计算上非常昂贵,并且需要针对每个主题进行优化。1Prompt1Story 是一种无需训练的方法,将所有场景描述连接成一个提示并重新缩放标记嵌入,但它遭受语义泄露的问题,即帧间嵌入变得纠缠,导致文本对齐错误。在本文中,我们提出了一种简单而有效的无需训练的方法,从几何学的角度出发,通过细化文本嵌入来抑制不需要的语义,从而解决语义纠缠问题。广泛的实验表明,我们的方法在主题一致性和文本对齐方面显著优于现有基线。
Summary / 总结
The research aims to improve subject consistency in text-to-image generation by addressing semantic entanglement in text embeddings. The method refines text embeddings to suppress unwanted semantics from a geometric perspective, without requiring model fine-tuning or per-subject optimization. Experimental results show that this approach significantly enhances subject consistency and text alignment compared to existing methods.
研究旨在通过解决文本嵌入中的语义纠缠问题,提高文本到图像生成中的主题一致性。方法从几何角度精炼文本嵌入以抑制不必要的语义,无需进行模型微调或针对每个主题进行优化。实验表明,该方法在主题一致性和文本对齐方面优于现有方法。
CountZES: Counting via Zero-Shot Exemplar Selection
Authors: Muhammad Ibraheem Siddiqui, Muhammad Haris Khan
First: 2025-12-18T11:12:50+00:00 · Latest: 2025-12-18T11:12:50+00:00
Abstract
Object counting in complex scenes remains challenging, particularly in the zero-shot setting, where the goal is to count instances of unseen categories specified only by a class name. Existing zero-shot object counting (ZOC) methods that infer exemplars from text either rely on open-vocabulary detectors, which often yield multi-instance candidates, or on random patch sampling, which fails to accurately delineate object instances. To address this, we propose CountZES, a training-free framework for object counting via zero-shot exemplar selection. CountZES progressively discovers diverse exemplars through three synergistic stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines open-vocabulary detections to isolate precise single-instance exemplars. DGE introduces a density-driven, self-supervised paradigm to identify statistically consistent and semantically compact exemplars, while FCE reinforces visual coherence through feature-space clustering. Together, these stages yield a diverse, complementary exemplar set that balances textual grounding, count consistency, and feature representativeness. Experiments on diverse datasets demonstrate CountZES superior performance among ZOC methods while generalizing effectively across natural, aerial and medical domains.
中文标题/摘要
标题:CountZES:通过零样本示例选择进行计数
在复杂场景中的物体计数仍然具有挑战性,特别是在零样本设置中,目标是计数仅通过类别名称指定的未见类别的实例。现有的零样本物体计数(ZOC)方法通过文本推断示例,要么依赖于开放词汇检测器,这通常会产生多实例候选,要么依赖于随机补丁采样,这无法准确划分物体实例。为了解决这个问题,我们提出了一种无需训练的CountZES框架,用于通过零样本示例选择进行物体计数。CountZES通过三个协同阶段逐步发现多样化的示例:检测锚定示例(DAE)、密度引导示例(DGE)和特征共识示例(FCE)。DAE细化开放词汇检测以隔离精确的单实例示例。DGE引入了一种基于密度的自我监督范式,以识别统计上一致且语义紧凑的示例,而FCE通过特征空间聚类增强视觉一致性。这些阶段共同产生一个多样且互补的示例集,平衡了文本基础、计数一致性以及特征表示性。在多种数据集上的实验表明,CountZES在ZOC方法中表现出优越的性能,并且在自然、航空和医疗领域中具有良好的泛化能力。
Summary / 总结
CountZES is a training-free framework for zero-shot object counting in complex scenes. It progressively discovers diverse exemplars through three stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines open-vocabulary detections to isolate precise single-instance exemplars, DGE identifies statistically consistent and semantically compact exemplars, and FCE reinforces visual coherence through feature-space clustering. Experiments show CountZES outperforms other zero-shot object counting methods across various domains, including natural, aerial, and medical scenes.
CountZES 是一种无需训练的框架,用于复杂场景下的零样本物体计数。它通过三个阶段逐步发现多样化的示例:检测锚定示例(DAE)、密度引导示例(DGE)和特征共识示例(FCE)。该方法细化开放词汇检测,识别统计上一致的示例,并增强视觉一致性,从而在各种数据集和领域中表现出色,优于其他零样本物体计数方法。
Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference
Authors: Dhruv Deshmukh, Saurabh Goyal, Nipun Kwatra, Ramachandran Ramjee
First: 2025-12-18T10:37:14+00:00 · Latest: 2025-12-18T10:37:14+00:00
Comments: 11 pages, 8 figures, 3 tables and 1 algorithm
Abstract
Attention is the dominant source of latency during long-context LLM inference, an increasingly popular workload with reasoning models and RAG. We propose Kascade, a training-free sparse attention method that leverages known observations such as 1) post-softmax attention is intrinsically sparse, and 2) the identity of high-weight keys is stable across nearby layers. Kascade computes exact Top-k indices in a small set of anchor layers, then reuses those indices in intermediate reuse layers. The anchor layers are selected algorithmically, via a dynamic-programming objective that maximizes cross-layer similarity over a development set, allowing easy deployment across models. The method incorporates efficient implementation constraints (e.g. tile-level operations), across both prefill and decode attention. The Top-k selection and reuse in Kascade is head-aware and we show in our experiments that this is critical for high accuracy. Kascade achieves up to 4.1x speedup in decode attention and 2.2x speedup in prefill attention over FlashAttention-3 baseline on H100 GPUs while closely matching dense attention accuracy on long-context benchmarks such as LongBench and AIME-24.
中文标题/摘要
标题:Kascade:一种实用的稀疏注意力方法用于长上下文LLM推理
注意力是长上下文LLM推理中延迟的主要来源,这是随着推理模型和RAG越来越受欢迎的工作负载。我们提出了Kascade,一种无需训练的稀疏注意力方法,利用已知观察,例如1)后softmax注意力本质上是稀疏的,2)高权重键的身份在相邻层中是稳定的。Kascade在一组锚定层中精确计算Top-k索引,然后在中间重用层中重用这些索引。锚定层是通过动态规划目标算法选择的,该目标最大化开发集上的跨层相似性,从而实现模型之间的轻松部署。该方法考虑了高效的实现约束(例如,块级操作),适用于预填充和解码注意力。Kascade的Top-k选择和重用是头感知的,我们在实验中展示了这一点对于高准确率至关重要。Kascade在H100 GPU上相对于FlashAttention-3基线在解码注意力中实现了高达4.1倍的加速,在预填充注意力中实现了2.2倍的加速,同时在长上下文基准测试(如LongBench和AIME-24)上接近密集注意力的准确率。
Summary / 总结
Kascade is a training-free sparse attention method that enhances the efficiency of long-context LLM inference by leveraging the intrinsic sparsity of post-softmax attention and the stability of high-weight keys across layers. It computes exact Top-k indices in anchor layers and reuses them in intermediate layers, achieving up to 4.1x speedup in decode attention and 2.2x speedup in prefill attention on H100 GPUs while maintaining high accuracy on long-context benchmarks.
Kascade 是一种无需训练的稀疏注意力方法,通过利用后softmax注意力的固有稀疏性和高权重键在相邻层中的稳定性来提高长上下文 LLM 推断的效率。它在锚层中计算精确的 Top-k 索引并在中间层中重用这些索引,从而在 H100 GPU 上分别实现高达 4.1 倍的解码注意力加速和 2.2 倍的预填充注意力加速,同时在长上下文基准测试中保持高准确性。
Unified Semantic Transformer for 3D Scene Understanding
Authors: Sebastian Koch, Johanna Wald, Hidenobu Matsuki, Pedro Hermosilla, Timo Ropinski, Federico Tombari
First: 2025-12-16T12:49:35+00:00 · Latest: 2025-12-18T10:28:42+00:00
Comments: Project page: https://unite-page.github.io/
Abstract
Holistic 3D scene understanding involves capturing and parsing unstructured 3D environments. Due to the inherent complexity of the real world, existing models have predominantly been developed and limited to be task-specific. We introduce UNITE, a Unified Semantic Transformer for 3D scene understanding, a novel feed-forward neural network that unifies a diverse set of 3D semantic tasks within a single model. Our model operates on unseen scenes in a fully end-to-end manner and only takes a few seconds to infer the full 3D semantic geometry. Our approach is capable of directly predicting multiple semantic attributes, including 3D scene segmentation, instance embeddings, open-vocabulary features, as well as affordance and articulations, solely from RGB images. The method is trained using a combination of 2D distillation, heavily relying on self-supervision and leverages novel multi-view losses designed to ensure 3D view consistency. We demonstrate that UNITE achieves state-of-the-art performance on several different semantic tasks and even outperforms task-specific models, in many cases, surpassing methods that operate on ground truth 3D geometry. See the project website at unite-page.github.io
中文标题/摘要
标题:统一语义变换器用于3D场景理解
整体3D场景理解涉及捕捉和解析未结构化的3D环境。由于现实世界的固有复杂性,现有的模型大多被开发出来并局限于特定任务。我们引入了UNITE,一种用于3D场景理解的统一语义变换器,这是一种新颖的前馈神经网络,能够在一个模型中统一多种3D语义任务。我们的模型以端到端的方式处理未见过的场景,并且只需几秒钟即可推断出完整的3D语义几何结构。我们的方法能够直接预测多个语义属性,包括3D场景分割、实例嵌入、开放词汇特征,以及可操作性和关节,仅从RGB图像中。该方法通过结合2D蒸馏训练,高度依赖于自我监督,并利用了设计用于确保3D视图一致性的新型多视图损失。我们证明,UNITE在多个不同的语义任务上达到了最先进的性能,并且在许多情况下甚至超过了特定任务的模型,甚至在某些情况下超越了在真实3D几何上操作的方法。请参见项目网站:unite-page.github.io
Summary / 总结
The research aims to develop a unified model for understanding 3D scenes, addressing the limitations of task-specific models. UNITE, a Unified Semantic Transformer, is introduced as a novel feed-forward neural network that can predict multiple semantic attributes from RGB images, including 3D scene segmentation and instance embeddings. The model is trained using 2D distillation and self-supervision, achieving state-of-the-art performance on various semantic tasks and outperforming task-specific models in many cases.
研究旨在开发一个统一模型来处理3D场景理解的各种任务,如3D场景分割和实例嵌入。引入了UNITE,这是一种新颖的前馈神经网络,能够从RGB图像中以端到端的方式预测多种语义属性。该模型在不同语义任务上表现出色,并在许多情况下超越了特定任务的模型,即使是在处理未见过的场景时也是如此。它通过2D蒸馏和自我监督以及新颖的多视图损失进行训练,以确保3D视图一致性。
Collaborative Edge-to-Server Inference for Vision-Language Models
Authors: Soochang Song, Yongjune Kim
First: 2025-12-18T09:38:18+00:00 · Latest: 2025-12-18T09:38:18+00:00
Comments: 13 pages, 12 figures
Abstract
We propose a collaborative edge-to-server inference framework for vision-language models (VLMs) that reduces the communication cost while maintaining inference accuracy. In typical deployments, visual data captured at edge devices (clients) is transmitted to the server for VLM inference. However, resizing the original image (global image) to match the vision encoder's input resolution often discards fine-grained details, leading to accuracy degradation. To overcome this limitation, we design a two-stage framework. In the first stage, the server performs inference on the global image and identifies a region of interest (RoI) using the VLM's internal attention. The min-entropy of the output tokens is then computed as a confidence measure to determine whether retransmission is required. If the min-entropy exceeds a predefined threshold, the server requests the edge device to send a detail-preserved local image of the RoI. The server then refines its inference by jointly leveraging the global and local images. This selective retransmission strategy ensures that only essential visual content is transmitted. Experiments across multiple VLM architectures show that the proposed framework significantly reduces communication cost while maintaining inference accuracy.
中文标题/摘要
标题:边缘到服务器协作推理在视觉语言模型中的应用
我们提出了一种视觉语言模型(VLM)的协作边缘到服务器推理框架,该框架在保持推理准确性的前提下减少了通信成本。在典型的部署中,边缘设备(客户端)捕获的视觉数据被传输到服务器进行VLM推理。然而,将原始图像(全局图像)调整到视觉编码器的输入分辨率往往会丢弃细粒度的细节,导致准确度下降。为克服这一限制,我们设计了一个两阶段框架。在第一阶段,服务器对全局图像进行推理,并使用VLM的内部注意力识别感兴趣区域(RoI)。然后计算输出标记的最小熵作为置信度度量,以确定是否需要重新传输。如果最小熵超过预定义的阈值,服务器将请求边缘设备发送RoI的细节保留局部图像。服务器然后通过联合利用全局和局部图像来细化其推理。这种选择性重新传输策略确保仅传输必要的视觉内容。在多个VLM架构上的实验表明,所提出的框架在保持推理准确性的前提下显著减少了通信成本。
Summary / 总结
The paper proposes a collaborative edge-to-server inference framework for vision-language models to reduce communication costs while preserving inference accuracy. It addresses the issue of accuracy degradation caused by resizing the original image to match the vision encoder's input resolution. The framework uses a two-stage process: the server first performs inference on the global image and identifies a region of interest, then requests the edge device to send a detail-preserved local image of the RoI if the min-entropy of the output tokens exceeds a threshold. This selective retransmission ensures only essential visual content is transmitted, leading to significant communication cost reduction without compromising accuracy.
研究提出了一种协作边缘到服务器的视觉语言模型推理框架,以减少通信成本同时保持推理准确性。该框架通过设计两阶段过程解决了由于图像缩放导致的准确性下降问题。第一阶段,服务器对全图进行推理并使用模型的注意力机制识别感兴趣区域。如果置信度低,服务器会请求边缘设备发送该区域的详细局部图像。第二阶段,服务器利用全局和局部图像进行推理的细化。实验结果表明,该方法在不同视觉语言模型架构下显著减少了通信成本并保持了准确性。
MoAPT: Mixture of Adversarial Prompt Tuning for Vision-Language Models
Authors: Shiji Zhao, Qihui Zhu, Shukun Xiong, Shouwei Ruan, Maoxun Yuan, Jialing Tao, Jiexi Liu, Ranjie Duan, Jie Zhang, Jie Zhang, Xingxing Wei
First: 2025-05-23T06:04:15+00:00 · Latest: 2025-12-18T09:15:05+00:00
Abstract
Large pre-trained Vision Language Models (VLMs) demonstrate excellent generalization capabilities but remain highly susceptible to adversarial examples, posing potential security risks. To improve the robustness of VLMs against adversarial examples, adversarial prompt tuning methods are proposed to align the text feature with the adversarial image feature without changing model parameters. However, when facing various adversarial attacks, a single learnable text prompt has insufficient generalization to align well with all adversarial image features, which ultimately results in overfitting. To address the above challenge, in this paper, we empirically find that increasing the number of learned prompts yields greater robustness improvements than simply extending the length of a single prompt. Building on this observation, we propose an adversarial tuning method named \textbf{Mixture of Adversarial Prompt Tuning (MoAPT)} to enhance the generalization against various adversarial attacks for VLMs. MoAPT aims to learn mixture text prompts to obtain more robust text features. To further enhance the adaptability, we propose a conditional weight router based on the adversarial images to predict the mixture weights of multiple learned prompts, which helps obtain sample-specific mixture text features aligning with different adversarial image features. Extensive experiments across 11 datasets under different settings show that our method can achieve better adversarial robustness than state-of-the-art approaches.
中文标题/摘要
标题:MoAPT:视觉语言模型的混合对抗提示调优
大型预训练视觉语言模型(VLMs)表现出色的泛化能力,但仍然高度易受对抗样本的影响,存在潜在的安全风险。为了提高VLMs对抗对抗样本的鲁棒性,提出了对抗提示调优方法,以在不改变模型参数的情况下使文本特征与对抗图像特征对齐。然而,当面对各种对抗攻击时,单一可学习的文本提示在泛化以与所有对抗图像特征对齐方面不足,最终导致过拟合。为了解决上述挑战,本文通过实验证明增加学习提示的数量比简单地延长单一提示的长度能获得更大的鲁棒性改进。基于这一观察,我们提出了一种名为\textbf{混合对抗提示调优(MoAPT)}的对抗调优方法,以增强VLMs对各种对抗攻击的泛化能力。MoAPT旨在学习混合文本提示以获得更鲁棒的文本特征。为了进一步增强适应性,我们基于对抗图像提出了一种条件权重路由器,以预测多个学习提示的混合权重,这有助于获得样本特定的混合文本特征,使其与不同的对抗图像特征对齐。在不同设置下的11个数据集上进行的广泛实验表明,我们的方法可以比最先进的方法实现更好的对抗鲁棒性。
Summary / 总结
The research aims to enhance the robustness of Vision Language Models (VLMs) against adversarial examples by proposing a method called Mixture of Adversarial Prompt Tuning (MoAPT). MoAPT introduces multiple learned text prompts to improve generalization and adaptability against various adversarial attacks. Experimental results across 11 datasets demonstrate that MoAPT outperforms existing approaches in achieving better adversarial robustness.
本文通过提出混合对抗提示调优(MoAPT)方法来解决视觉语言模型(VLMs)对对抗样本的脆弱性问题。MoAPT 引入多个学习文本提示以提高模型对各种对抗攻击的鲁棒性,优于单一提示方法。该方法还包括一个基于对抗图像的条件权重路由器,以适应性地结合这些提示,增强模型与不同对抗特征的对齐。实验结果显示,MoAPT 在 11 个数据集上的鲁棒性优于现有方法。
In-Context Probing for Membership Inference in Fine-Tuned Language Models
Authors: Zhexi Lu, Hongliang Chi, Nathalie Baracaldo, Swanand Ravindra Kadhe, Yuseok Jeon, Lei Yu
First: 2025-12-18T08:26:26+00:00 · Latest: 2025-12-18T08:26:26+00:00
Abstract
Membership inference attacks (MIAs) pose a critical privacy threat to fine-tuned large language models (LLMs), especially when models are adapted to domain-specific tasks using sensitive data. While prior black-box MIA techniques rely on confidence scores or token likelihoods, these signals are often entangled with a sample's intrinsic properties - such as content difficulty or rarity - leading to poor generalization and low signal-to-noise ratios. In this paper, we propose ICP-MIA, a novel MIA framework grounded in the theory of training dynamics, particularly the phenomenon of diminishing returns during optimization. We introduce the Optimization Gap as a fundamental signal of membership: at convergence, member samples exhibit minimal remaining loss-reduction potential, while non-members retain significant potential for further optimization. To estimate this gap in a black-box setting, we propose In-Context Probing (ICP), a training-free method that simulates fine-tuning-like behavior via strategically constructed input contexts. We propose two probing strategies: reference-data-based (using semantically similar public samples) and self-perturbation (via masking or generation). Experiments on three tasks and multiple LLMs show that ICP-MIA significantly outperforms prior black-box MIAs, particularly at low false positive rates. We further analyze how reference data alignment, model type, PEFT configurations, and training schedules affect attack effectiveness. Our findings establish ICP-MIA as a practical and theoretically grounded framework for auditing privacy risks in deployed LLMs.
中文标题/摘要
标题:上下文探查在微调语言模型成员推断中的应用
成员推断攻击(MIAs)对微调大型语言模型(LLMs)构成了严重的隐私威胁,尤其是在使用敏感数据将模型适应特定领域任务时。尽管先前的黑盒MIAs技术依赖于置信分数或标记可能性,但这些信号往往与样本的固有属性(如内容难度或稀有性)交织在一起,导致泛化能力差和信噪比低。在本文中,我们提出了一种新的基于训练动力学理论的MIAs框架ICP-MIA,特别关注优化过程中的递减回报现象。我们引入了优化差距作为基本的成员信号:在收敛时,成员样本表现出最小的剩余损失减少潜力,而非成员样本则保留了进一步优化的显著潜力。为了在黑盒设置中估计这一差距,我们提出了一种无需训练的上下文探查(ICP)方法,通过战略性构建输入上下文来模拟微调行为。我们提出了两种探查策略:参考数据基于(使用语义相似的公共样本)和自我扰动(通过掩码或生成)。在三个任务和多个LLM上的实验表明,ICP-MIA在低误报率下显著优于先前的黑盒MIAs。我们进一步分析了参考数据对齐、模型类型、PEFT配置和训练计划如何影响攻击效果。我们的研究结果确立了ICP-MIA作为一种实用且理论基础的框架,用于审计部署中LLMs的隐私风险。
Summary / 总结
This paper addresses the privacy threat of membership inference attacks on fine-tuned language models. It introduces ICP-MIA, a novel framework that leverages the optimization gap to distinguish member samples from non-members. The method uses In-Context Probing to estimate this gap without accessing model training data, demonstrating superior performance compared to previous techniques, especially at low false positive rates.
本文针对细调语言模型中的成员推理攻击(MIAs)的隐私威胁,提出了一种基于优化差距的新框架ICP-MIA。它引入了In-Context Probing(ICP)来在不进行训练的情况下估计这一差距,使用参考数据或自我扰动。实验表明ICP-MIA在低误报率下显著优于之前的黑盒MIAs,并研究了影响攻击效果的因素。
Scaling Spatial Reasoning in MLLMs through Programmatic Data Synthesis
Authors: Zhi Helu, Huang Jingjing, Xu Wang, Xu Yangbin, Zhang Wanyue, Jiang Baoyang, Deng Shirui, Zhu Liang, Li Fangfang, Zhao Tiejun, Lin Yankai, Yao Yuan
First: 2025-12-18T06:30:08+00:00 · Latest: 2025-12-18T06:30:08+00:00
Abstract
Embodied intelligence, a grand challenge in artificial intelligence, is fundamentally constrained by the limited spatial understanding and reasoning capabilities of current models. Prevailing efforts to address this through enhancing Vision-Language Models (VLMs) are trapped in a dilemma: template-based datasets are scalable but structurally rigid, while manual annotation is linguistically diverse but unscalable and, critically, computationally imprecise. We introduce SPRITE, a novel framework that overcomes this dilemma by leveraging simulators and large models to programmatically synthesize scalable, diverse, and high-quality spatial reasoning data. The core innovation of SPRITE is to reframe ground-truth generation as a code-generation task. We utilize LLMs to compile complex spatial questions into executable programs, which are then verified against high-precision scene meta-information extracted from simulators. This ensures our ground truth is both computationally precise and verifiable, while the generative power of LLMs provides vast linguistic diversity. Leveraging this pipeline, we have curated a dataset encompassing 3 simulators, 11k+ scenes, and 300k+ image/video instruction-tuning pairs. We demonstrate that a VLM trained on our data achieves significant performance gains on multiple spatial benchmarks and outperforms other open-source datasets of equivalent size. Furthermore, a scalability analysis confirms our hypothesis that overcoming the low-diversity nature of traditional template methods is essential for building robust, generalizable spatial intelligence. We will make the SPRITE framework code and the full 300k+ dataset publicly available to facilitate future research in spatial intelligence.
中文标题/摘要
标题:通过程序化数据合成在MLLM中扩展空间推理能力
具身智能,人工智能领域的重大挑战,从根本上受限于当前模型的空间理解和推理能力有限。通过增强视觉-语言模型(VLMs)来解决这一问题的努力陷入了困境:基于模板的数据集虽然可扩展但结构僵化,而人工标注虽然语言多样但不可扩展且计算上不精确。我们提出了SPRITE,一种新颖的框架,通过利用模拟器和大型模型程序化合成可扩展、多样且高质量的空间推理数据来克服这一困境。SPRITE的核心创新是将地面真值生成重新定义为代码生成任务。我们利用LLMs将复杂的空间问题编译成可执行程序,然后通过模拟器提取的高精度场景元信息验证这些程序。这确保了我们的地面真值既计算上精确又可验证,而LLMs的生成能力提供了广泛的语言多样性。利用这一管道,我们构建了一个包含3个模拟器、11000多个场景和300000多张/视频指令调优对的数据集。我们证明,基于我们数据训练的VLM在多个空间基准测试中取得了显著的性能提升,并优于其他等量规模的开源数据集。此外,可扩展性分析证实了我们的假设,即克服传统模板方法的低多样性对于构建稳健、泛化的空间智能至关重要。我们将使SPRITE框架代码和完整的300000+数据集公开,以促进未来在空间智能方面的研究。
Summary / 总结
The research aims to enhance the spatial reasoning capabilities of Vision-Language Models (VLMs) by addressing the limitations of existing datasets. SPRITE, a novel framework, synthesizes scalable, diverse, and high-quality spatial reasoning data using simulators and large language models. The method reframes ground-truth generation as a code-generation task, ensuring computational precision and linguistic diversity. Experimental results show that a VLM trained on this dataset outperforms other open-source datasets on multiple spatial benchmarks, validating the importance of overcoming low-diversity in traditional template methods for building robust spatial intelligence.
论文通过引入SPRITE框架,利用程序化生成方法合成大规模、多样性和高质量的空间推理数据,以解决当前视觉-语言模型在空间理解和推理方面的局限性。SPRITE将地面真值生成重新定义为代码生成任务,使用大语言模型将复杂的空间问题编译成可执行程序,并通过模拟器中的高精度场景信息进行验证。该数据集包含3个模拟器、11,000多个场景和300,000多张/视频指令调优对。基于此数据集训练的视觉-语言模型在空间基准测试中表现出显著的性能提升,并优于其他等量级的开源数据集,验证了克服传统模板方法的低多样性对于构建稳健的空间智能的重要性。
From Frames to Clips: Training-free Adaptive Key Clip Selection for Long-Form Video Understanding
Authors: Guangyu Sun, Archit Singhal, Burak Uzkent, Mubarak Shah, Chen Chen, Garin Kessler
First: 2025-10-02T17:43:01+00:00 · Latest: 2025-12-18T06:01:41+00:00
Abstract
Video Large Language Models (VLMs) have achieved strong performance on various vision-language tasks, yet their practical use is limited by the massive number of visual tokens produced from raw video frames, which quickly exhausts the model's context window. Existing solutions mitigate this issue by selecting a sparse set of frames, but such frame-wise selection discards essential temporal dynamics in long-form videos, leading to suboptimal reasoning about motion and event continuity. In this work, we systematically examine the role of temporal information and show that extending selection from isolated key frames to temporally coherent key clips improves video understanding. To maintain a fixed computational budget while accommodating the larger token footprint of clips, we introduce frame resolution as a controllable factor in frame selection, enabling a trade-off between spatial resolution and clip length. Building on this idea, we propose an adaptive clip length module that dynamically balances these factors to ensure a constant token count per video. Experiments on three long-form video benchmarks demonstrate that our training-free approach, F2C, outperforms uniform sampling by up to 8.1%, 5.6%, and 10.3% on Video-MME, LongVideoBench, and MLVU, respectively. These results highlight the importance of preserving temporal coherence in frame selection and provide a practical pathway for scaling VLMs to real-world video understanding applications. Project webpage is available at https://guangyusun.com/f2c .
中文标题/摘要
标题:从帧到片段:无需训练的自适应关键片段选择以理解长视频
视频大型语言模型(VLMs)在各种视觉语言任务上取得了强大的性能,但它们的实际应用受到从原始视频帧中生成的大量视觉标记的限制,这迅速耗尽了模型的上下文窗口。现有解决方案通过选择稀疏帧集来缓解这一问题,但这种基于帧的选择会丢弃长视频中的重要时间动态,导致对运动和事件连续性的推理效果不佳。在本文中,我们系统地探讨了时间信息的作用,并表明将选择从孤立的关键帧扩展到时间上连贯的关键片段可以提高视频理解。为了在保持固定计算预算的同时适应片段更大的标记占用空间,我们引入了帧分辨率作为帧选择的可控因素,从而在空间分辨率和片段长度之间实现权衡。在此基础上,我们提出了一种自适应片段长度模块,动态平衡这些因素以确保每个视频的标记计数恒定。在三个长视频基准上的实验表明,我们的无需训练的方法F2C在Video-MME、LongVideoBench和MLVU上的表现分别优于均匀采样8.1%、5.6%和10.3%。这些结果突显了在帧选择中保持时间连贯性的重要性,并为将VLMs扩展到实际视频理解应用提供了实用途径。项目网页可访问 https://guangyusun.com/f2c 。
Summary / 总结
This work addresses the challenge of using Video Large Language Models (VLMs) for long-form video understanding by proposing a training-free adaptive key clip selection method. It extends the selection from isolated key frames to temporally coherent key clips, balancing spatial resolution and clip length to maintain a fixed computational budget. Experiments show that the proposed approach, F2C, outperforms uniform sampling on three long-form video benchmarks by up to 10.3%. This highlights the importance of preserving temporal coherence in frame selection for better video understanding.
该研究提出了一种无需训练的自适应关键片段选择方法,以解决使用视频大型语言模型(VLMs)进行长视频理解的挑战。该方法将选择从孤立的关键帧扩展到具有时间连贯性的关键片段,同时平衡空间分辨率和片段长度,以保持固定的计算预算。实验表明,所提出的F2C方法在三个长视频基准上比均匀采样分别提高了最多10.3%,这突显了在帧选择中保留时间连贯性的重要性,以实现更好的视频理解。
Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation
Authors: Sarosij Bose, Ravi K. Rajendran, Biplob Debnath, Konstantinos Karydis, Amit K. Roy-Chowdhury, Srimat Chakradhar
First: 2025-12-18T05:48:21+00:00 · Latest: 2025-12-18T05:48:21+00:00
Abstract
Radiology Report Generation (RRG) is a critical step toward automating healthcare workflows, facilitating accurate patient assessments, and reducing the workload of medical professionals. Despite recent progress in Large Medical Vision-Language Models (Med-VLMs), generating radiology reports that are both visually grounded and clinically accurate remains a significant challenge. Existing approaches often rely on large labeled corpora for pre-training, costly task-specific preference data, or retrieval-based methods. However, these strategies do not adequately mitigate hallucinations arising from poor cross-modal alignment between visual and linguistic representations. To address these limitations, we propose VALOR:Visual Alignment of Medical Vision-Language Models for GrOunded Radiology Report Generation. Our method introduces a reinforcement learning-based post-alignment framework utilizing Group-Relative Proximal Optimization (GRPO). The training proceeds in two stages: (1) improving the Med-VLM with textual rewards to encourage clinically precise terminology, and (2) aligning the vision projection module of the textually grounded model with disease findings, thereby guiding attention toward image re gions most relevant to the diagnostic task. Extensive experiments on multiple benchmarks demonstrate that VALOR substantially improves factual accuracy and visual grounding, achieving significant performance gains over state-of-the-art report generation methods.
中文标题/摘要
标题:医学视觉-语言模型的视觉对齐以实现基于图像的放射学报告生成
放射学报告生成(RRG)是实现自动化医疗工作流程、促进准确的患者评估和减轻医疗专业人员工作负担的关键步骤。尽管在大型医学视觉-语言模型(Med-VLMs)方面取得了进展,但生成既视觉接地又临床准确的放射学报告仍然是一个重大挑战。现有方法通常依赖于大型标注语料库进行预训练、昂贵的任务特定偏好数据或基于检索的方法。然而,这些策略未能充分缓解由于视觉和语言表示之间跨模态对齐不良而产生的幻觉。为了解决这些限制,我们提出了VALOR:医学视觉-语言模型的视觉对齐以实现基于图像的放射学报告生成。该方法引入了一种基于强化学习的后对齐框架,利用组相对近邻优化(GRPO)。训练分为两个阶段:(1)通过文本奖励改进Med-VLM,以鼓励临床精确术语;(2)将文本接地模型的视觉投影模块与疾病发现对齐,从而引导注意力集中在与诊断任务最相关的图像区域。在多个基准上的广泛实验表明,VALOR在事实准确性和视觉对齐方面显著提高,实现了对最先进的报告生成方法的重大性能提升。
Summary / 总结
The research aims to improve the accuracy and visual grounding of radiology reports generated by large medical vision-language models. The proposed method, VALOR, uses a reinforcement learning-based post-alignment framework with Group-Relative Proximal Optimization (GRPO) to enhance the model's clinical precision and visual alignment. Experiments show that VALOR significantly improves factual accuracy and visual grounding compared to existing state-of-the-art methods.
研究旨在提高大型医疗视觉语言模型生成的放射学报告的准确性和视觉定位。方法VALOR使用基于强化学习的后对齐框架和组相对近端优化(GRPO)来增强模型的临床精确度并使其视觉和语言表示相匹配。实验表明,与现有最先进的方法相比,VALOR在事实准确性和视觉定位方面取得了显著改进。
UniVCD: A New Method for Unsupervised Change Detection in the Open-Vocabulary Era
Authors: Ziqiang Zhu, Bowei Yang
First: 2025-12-15T08:42:23+00:00 · Latest: 2025-12-18T05:14:28+00:00
Comments: 10 pages, 6 figures
Abstract
Change detection (CD) identifies scene changes from multi-temporal observations and is widely used in urban development and environmental monitoring. Most existing CD methods rely on supervised learning, making performance strongly dataset-dependent and incurring high annotation costs; they typically focus on a few predefined categories and generalize poorly to diverse scenes. With the rise of vision foundation models such as SAM2 and CLIP, new opportunities have emerged to relax these constraints. We propose Unified Open-Vocabulary Change Detection (UniVCD), an unsupervised, open-vocabulary change detection method built on frozen SAM2 and CLIP. UniVCD detects category-agnostic changes across diverse scenes and imaging geometries without any labeled data or paired change images. A lightweight feature alignment module is introduced to bridge the spatially detailed representations from SAM2 and the semantic priors from CLIP, enabling high-resolution, semantically aware change estimation while keeping the number of trainable parameters small. On top of this, a streamlined post-processing pipeline is further introduced to suppress noise and pseudo-changes, improving the detection accuracy for objects with well-defined boundaries. Experiments on several public BCD (Binary Change Detection) and SCD (Semantic Change Detection) benchmarks show that UniVCD achieves consistently strong performance and matches or surpasses existing open-vocabulary CD methods in key metrics such as F1 and IoU. The results demonstrate that unsupervised change detection with frozen vision foundation models and lightweight multi-modal alignment is a practical and effective paradigm for open-vocabulary CD. Code and pretrained models will be released at https://github.com/Die-Xie/UniVCD.
中文标题/摘要
标题:UniVCD:开放词汇时代的无监督变化检测新方法
变化检测(CD)通过多时相观测识别场景变化,在城市开发和环境监测中广泛应用。现有大多数CD方法依赖于监督学习,性能高度依赖于数据集,导致标注成本高昂;它们通常专注于少数预定义类别,难以泛化到多样化的场景。随着SAM2和CLIP等视觉基础模型的兴起,出现了放松这些限制的新机会。我们提出了统一开放词汇变化检测(UniVCD),这是一种基于冻结的SAM2和CLIP构建的无监督、开放词汇变化检测方法。UniVCD能够在没有任何标注数据或配对变化图像的情况下,检测跨多样场景和成像几何的变化。引入了一个轻量级特征对齐模块,将SAM2的空间详细表示与CLIP的语义先验相结合,实现高分辨率、语义感知的变化估计,同时保持可训练参数数量较少。在此基础上,引入了一条简化的后处理流水线,以抑制噪声和伪变化,提高具有明确边界对象的检测准确性。在几个公开的二值变化检测(BCD)和语义变化检测(SCD)基准测试上进行的实验表明,UniVCD在F1和IoU等关键指标上表现出一致的强性能,并且在某些情况下超越了现有的开放词汇变化检测方法。结果表明,使用冻结的视觉基础模型和轻量级多模态对齐的无监督变化检测是一种实用且有效的开放词汇变化检测范式。代码和预训练模型将在https://github.com/Die-Xie/UniVCD上发布。
Summary / 总结
UniVCD is an unsupervised change detection method that leverages frozen SAM2 and CLIP to detect category-agnostic changes across various scenes without labeled data. It introduces a lightweight feature alignment module to combine spatially detailed representations from SAM2 with semantic priors from CLIP, enabling high-resolution, semantically aware change estimation. Experiments on multiple benchmarks show that UniVCD outperforms existing methods in key metrics such as F1 and IoU, demonstrating the practicality and effectiveness of using frozen vision foundation models for open-vocabulary change detection.
UniVCD 是一种无需标注数据的无监督变化检测方法,利用冻结的 SAM2 和 CLIP 来检测各种场景中的类别无关变化。它引入了一个轻量级的特征对齐模块,将 SAM2 的空间详细表示和 CLIP 的语义先验结合起来,实现高分辨率、语义感知的变化估计。实验结果显示,UniVCD 在多个公开的二值变化检测和语义变化检测基准上表现出色,F1 和 IoU 等关键指标上超越了现有方法。该方法展示了使用冻结的视觉基础模型进行无监督变化检测的有效性。
C-DGPA: Class-Centric Dual-Alignment Generative Prompt Adaptation
Authors: Chao Li, Dasha Hu, Chengyang Li, Yuming Jiang, Yuncheng Shen
First: 2025-12-18T04:30:53+00:00 · Latest: 2025-12-18T04:30:53+00:00
Abstract
Unsupervised Domain Adaptation transfers knowledge from a labeled source domain to an unlabeled target domain. Directly deploying Vision-Language Models (VLMs) with prompt tuning in downstream UDA tasks faces the signifi cant challenge of mitigating domain discrepancies. Existing prompt-tuning strategies primarily align marginal distribu tion, but neglect conditional distribution discrepancies, lead ing to critical issues such as class prototype misalignment and degraded semantic discriminability. To address these lim itations, the work proposes C-DGPA: Class-Centric Dual Alignment Generative Prompt Adaptation. C-DGPA syner gistically optimizes marginal distribution alignment and con ditional distribution alignment through a novel dual-branch architecture. The marginal distribution alignment branch em ploys a dynamic adversarial training framework to bridge marginal distribution discrepancies. Simultaneously, the con ditional distribution alignment branch introduces a Class Mapping Mechanism (CMM) to align conditional distribu tion discrepancies by standardizing semantic prompt under standing and preventing source domain over-reliance. This dual alignment strategy effectively integrates domain knowl edge into prompt learning via synergistic optimization, ensur ing domain-invariant and semantically discriminative repre sentations. Extensive experiments on OfficeHome, Office31, and VisDA-2017 validate the superiority of C-DGPA. It achieves new state-of-the-art results on all benchmarks.
中文标题/摘要
标题:C-DGPA: 以类为中心的双重对齐生成提示适应
无监督领域适应将已标记的源领域知识转移到未标记的目标领域。直接在下游UDA任务中部署带有提示调优的视觉-语言模型(VLMs)面临显著挑战,即缓解领域差异。现有提示调优策略主要对齐边缘分布,但忽视了条件分布差异,导致诸如类原型对齐错误和语义可区分性下降等关键问题。为解决这些局限性,该工作提出了C-DGPA:以类为中心的双重对齐生成提示适应。C-DGPA通过一种新颖的双重分支架构协同优化边缘分布对齐和条件分布对齐。边缘分布对齐分支采用动态对抗训练框架来弥合边缘分布差异。同时,条件分布对齐分支引入类映射机制(CMM)通过标准化语义提示理解来对齐条件分布差异,防止对源领域过度依赖。这种双重对齐策略通过协同优化有效地将领域知识整合到提示学习中,确保领域不变和语义可区分的表示。在OfficeHome、Office31和VisDA-2017上的广泛实验验证了C-DGPA的优越性。它在所有基准上均实现了新的最佳结果。
Summary / 总结
C-DGPA addresses the challenge of domain discrepancies in unsupervised domain adaptation by proposing a class-centric dual-alignment generative prompt adaptation method. It optimizes both marginal and conditional distribution alignments through a dual-branch architecture, using a dynamic adversarial training framework and a Class Mapping Mechanism to prevent over-reliance on the source domain. Experiments on OfficeHome, Office31, and VisDA-2017 demonstrate that C-DGPA outperforms existing methods, achieving new state-of-the-art results.
C-DGPA 通过提出一种基于类别的双对齐生成提示适应方法,旨在解决无监督领域适应中的领域差异问题。它使用双分支架构同时对齐边缘分布和条件分布,采用动态对抗训练框架和类映射机制(CMM)来缓解类原型错位和语义可区分性问题。在 OfficeHome、Office31 和 VisDA-2017 上的实验表明,C-DGPA 的性能优于现有方法,取得了新的最佳结果。
MRG-R1: Reinforcement Learning for Clinically Aligned Medical Report Generation
Authors: Pengyu Wang, Shuchang Ye, Usman Naseem, Jinman Kim
First: 2025-12-18T03:57:55+00:00 · Latest: 2025-12-18T03:57:55+00:00
Comments: 12 pages
Abstract
Medical report generation (MRG) aims to automatically derive radiology-style reports from medical images to aid in clinical decision-making. However, existing methods often generate text that mimics the linguistic style of radiologists but fails to guarantee clinical correctness, because they are trained on token-level objectives which focus on word-choice and sentence structure rather than actual medical accuracy. We propose a semantic-driven reinforcement learning (SRL) method for medical report generation, adopted on a large vision-language model (LVLM). SRL adopts Group Relative Policy Optimization (GRPO) to encourage clinical-correctness-guided learning beyond imitation of language style. Specifically, we optimise a report-level reward: a margin-based cosine similarity (MCCS) computed between key radiological findings extracted from generated and reference reports, thereby directly aligning clinical-label agreement and improving semantic correctness. A lightweight reasoning format constraint further guides the model to generate structured "thinking report" outputs. We evaluate Medical Report Generation with Sematic-driven Reinforment Learning (MRG-R1), on two datasets: IU X-Ray and MIMIC-CXR using clinical efficacy (CE) metrics. MRG-R1 achieves state-of-the-art performance with CE-F1 51.88 on IU X-Ray and 40.39 on MIMIC-CXR. We found that the label-semantic reinforcement is better than conventional token-level supervision. These results indicate that optimizing a clinically grounded, report-level reward rather than token overlap,meaningfully improves clinical correctness. This work is a prior to explore semantic-reinforcement in supervising medical correctness in medical Large vision-language model(Med-LVLM) training.
中文标题/摘要
标题:MRG-R1:临床对齐的医学报告生成强化学习
医学报告生成(MRG)旨在从医学图像中自动提取放射学风格的报告,以辅助临床决策。然而,现有方法生成的文本虽然模仿了放射科医生的语言风格,但无法保证临床正确性,因为它们是基于词元级目标进行训练的,这些目标侧重于词汇选择和句子结构,而不是实际的医学准确性。我们提出了一种基于语义的强化学习(SRL)方法用于医学报告生成,采用了一个大型视觉语言模型(LVLM)。SRL采用组相对策略优化(GRPO)来鼓励临床正确性引导的学习,而不仅仅是语言风格的模仿。具体来说,我们优化了一个报告级奖励:生成报告和参考报告中提取的关键放射学发现之间的余弦相似度的边际计算(MCCS),从而直接对齐临床标签一致性和提高语义正确性。一种轻量级的推理格式约束进一步引导模型生成结构化的“思考报告”输出。我们使用临床效用(CE)指标在两个数据集:IU X-Ray和MIMIC-CXR上评估了基于语义驱动的强化学习的医学报告生成(MRG-R1)。MRG-R1在IU X-Ray上实现了最先进的性能,CE-F1为51.88,在MIMIC-CXR上为40.39。我们发现标签语义强化比传统的词元级监督效果更好。这些结果表明,优化一个基于临床的报告级奖励而不是词元重叠,显著提高了临床正确性。这项工作是探索在训练医学大型视觉语言模型(Med-LVLM)时监督医学正确性的语义强化的一个先驱。
Summary / 总结
The paper proposes MRG-R1, a semantic-driven reinforcement learning method for medical report generation, which aims to improve clinical correctness by focusing on report-level rewards rather than token-level objectives. The method uses a large vision-language model and optimizes a margin-based cosine similarity between generated and reference reports to enhance semantic correctness. MRG-R1 achieves state-of-the-art performance with CE-F1 scores of 51.88 on IU X-Ray and 40.39 on MIMIC-CXR, indicating better clinical correctness compared to conventional token-level supervision.
研究旨在提高机器学习模型生成的医疗报告的临床正确性。提出了一种基于语义的强化学习方法MRG-R1,使用大型视觉语言模型和基于余弦相似度的报告级奖励来增强临床准确性。该方法在两个数据集IU X-Ray和MIMIC-CXR上分别取得了CE-F1分数51.88和40.39的优异成绩,表明与临床标签有更好的对齐和语义正确性提升。
Exploiting Domain Properties in Language-Driven Domain Generalization for Semantic Segmentation
Authors: Seogkyu Jeon, Kibeom Hong, Hyeran Byun
Venue: ICCV 2025 poster
First: 2025-12-03T06:58:38+00:00 · Latest: 2025-12-18T03:34:53+00:00
Comments: ICCV 2025 (poster)
Abstract
Recent domain generalized semantic segmentation (DGSS) studies have achieved notable improvements by distilling semantic knowledge from Vision-Language Models (VLMs). However, they overlook the semantic misalignment between visual and textual contexts, which arises due to the rigidity of a fixed context prompt learned on a single source domain. To this end, we present a novel domain generalization framework for semantic segmentation, namely Domain-aware Prompt-driven Masked Transformer (DPMFormer). Firstly, we introduce domain-aware prompt learning to facilitate semantic alignment between visual and textual cues. To capture various domain-specific properties with a single source dataset, we propose domain-aware contrastive learning along with the texture perturbation that diversifies the observable domains. Lastly, to establish a framework resilient against diverse environmental changes, we have proposed the domain-robust consistency learning which guides the model to minimize discrepancies of prediction from original and the augmented images. Through experiments and analyses, we demonstrate the superiority of the proposed framework, which establishes a new state-of-the-art on various DGSS benchmarks. The code is available at https://github.com/jone1222/DPMFormer.
中文标题/摘要
标题:利用领域属性的语言驱动领域泛化在语义分割中的应用
近期的领域泛化语义分割(DGSS)研究通过从视觉语言模型(VLMs)中提炼语义知识取得了显著进步。然而,它们忽视了由于固定上下文提示在单一源领域学习而导致的视觉和文本上下文之间的语义不一致。为此,我们提出了一种新的语义分割领域泛化框架,即领域感知提示驱动的掩码变换器(DPMFormer)。首先,我们引入了领域感知提示学习,以促进视觉和文本线索之间的语义对齐。为了用单一源数据集捕捉各种领域特定属性,我们提出了领域感知对比学习以及纹理扰动,以多样化可观察的领域。最后,为了建立一个对多种环境变化具有鲁棒性的框架,我们提出了领域鲁棒一致性学习,以引导模型最小化原始图像和增强图像预测之间的差异。通过实验和分析,我们展示了所提出框架的优越性,该框架在各种DGSS基准上建立了新的最先进水平。代码可在https://github.com/jone1222/DPMFormer/ 获取。
Summary / 总结
This paper addresses the issue of semantic misalignment in domain generalized semantic segmentation by proposing DPMFormer, which includes domain-aware prompt learning, domain-aware contrastive learning with texture perturbation, and domain-robust consistency learning. Experiments show that DPMFormer outperforms existing methods on various DGSS benchmarks, setting a new state-of-the-art. The code is available on GitHub.
研究旨在通过解决视觉和文本上下文之间的语义不匹配来提高领域泛化的语义分割。方法DPMFormer引入了领域感知提示学习、带有纹理扰动的领域感知对比学习以及领域鲁棒一致性学习。实验表明,DPMFormer在各种DGSS基准上优于现有方法,达到了新的最佳水平。代码可在GitHub上获得。
Auto-Vocabulary 3D Object Detection
Authors: Haomeng Zhang, Kuan-Chuan Peng, Suhas Lohit, Raymond A. Yeh
First: 2025-12-18T01:53:40+00:00 · Latest: 2025-12-18T01:53:40+00:00
Comments: technical report
Abstract
Open-vocabulary 3D object detection methods are able to localize 3D boxes of classes unseen during training. Despite the name, existing methods rely on user-specified classes both at training and inference. We propose to study Auto-Vocabulary 3D Object Detection (AV3DOD), where the classes are automatically generated for the detected objects without any user input. To this end, we introduce Semantic Score (SS) to evaluate the quality of the generated class names. We then develop a novel framework, AV3DOD, which leverages 2D vision-language models (VLMs) to generate rich semantic candidates through image captioning, pseudo 3D box generation, and feature-space semantics expansion. AV3DOD achieves the state-of-the-art (SOTA) performance on both localization (mAP) and semantic quality (SS) on the ScanNetV2 and SUNRGB-D datasets. Notably, it surpasses the SOTA, CoDA, by 3.48 overall mAP and attains a 24.5% relative improvement in SS on ScanNetV2.
中文标题/摘要
标题:Auto-Vocabulary 3D物体检测
开放词汇3D物体检测方法能够在训练期间未见过的类别中定位3D框。尽管名称如此,现有方法在训练和推理期间都依赖于用户指定的类别。我们提出研究Auto-Vocabulary 3D物体检测(AV3DOD),其中检测到的物体的类别是自动生成的,无需任何用户输入。为此,我们引入语义得分(SS)来评估生成的类别名称的质量。然后,我们开发了一个新的框架AV3DOD,该框架利用2D视觉-语言模型(VLMs)通过图像描述、伪3D框生成和特征空间语义扩展来生成丰富的语义候选。AV3DOD在ScanNetV2和SUNRGB-D数据集上的定位(mAP)和语义质量(SS)上均达到最新技术水平。值得注意的是,它在整体mAP上超越了最新技术水平CoDA 3.48,并在ScanNetV2上的SS上实现了24.5%的相对改进。
Summary / 总结
The research aims to develop an open-vocabulary 3D object detection method that automatically generates classes for detected objects without user input. The proposed AV3DOD framework uses 2D vision-language models to generate semantic candidates and expands feature-space semantics. It achieves state-of-the-art performance in both localization (mAP) and semantic quality (SS) on ScanNetV2 and SUNRGB-D datasets, surpassing the previous best method CoDA by 3.48 overall mAP and 24.5% relative improvement in SS on ScanNetV2.
研究旨在开发一种无需用户输入即可自动为检测到的对象生成类别的开放词汇3D目标检测方法。提出的AV3DOD框架使用2D视觉语言模型生成语义候选,并扩展特征空间语义。该方法在ScanNetV2和SUNRGB-D数据集上的定位(mAP)和语义质量(SS)上达到了最先进的性能,超越了之前的最佳方法CoDA,整体mAP提高了3.48,SS相对提高了24.5%。
Are vision-language models ready to zero-shot replace supervised classification models in agriculture?
Authors: Earl Ranario, Mason J. Earles
First: 2025-12-17T21:22:44+00:00 · Latest: 2025-12-17T21:22:44+00:00
Comments: Draft version
Abstract
Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural classification datasets from the AgML collection, spanning 162 classes across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%. Applying LLM-based semantic judging increases open-ended accuracy (for example, from 21% to 30% for top models) and alters model rankings, demonstrating that evaluation methodology meaningfully affects reported conclusions. Among open-source models, Qwen-VL-72B performs best, approaching closed-source performance under constrained prompting but still trailing top proprietary systems. Task-level analysis shows that plant and weed species classification is consistently easier than pest and damage identification, which remains the most challenging category across models. Overall, these results indicate that current off-the-shelf VLMs are not yet suitable as standalone agricultural diagnostic systems, but can function as assistive components when paired with constrained interfaces, explicit label ontologies, and domain-aware evaluation strategies.
中文标题/摘要
标题:视觉语言模型在农业中是否准备好零样本替代监督分类模型?
视觉语言模型(VLMs)越来越多地被提议作为视觉识别任务的一般解决方案,但它们在农业决策支持中的可靠性仍然知之甚少。我们对来自AgML集合的27个农业分类数据集进行了基准测试,这些数据集涵盖了植物病害、害虫和损伤、以及植物和杂草物种识别的162个类别。在所有任务中,零样本VLMs的表现显著低于监督任务特定基线(YOLO11),后者始终比任何基础模型获得更高的准确率。在多项选择提示下,表现最佳的VLM(Gemini-3 Pro)的平均准确率为约62%,而开放式提示则导致性能大幅下降,通常准确率低于25%。基于LLM的语义评估提高了开放式提示的准确率(例如,顶级模型从21%提高到30%),并改变了模型排名,表明评估方法对报告结论有实质性影响。在开源模型中,Qwen-VL-72B表现最佳,在受限提示下接近闭源性能,但仍落后于顶级专有系统。任务级分析表明,植物和杂草物种分类始终比害虫和损伤识别更容易,后者是所有模型中最具有挑战性的类别。总体而言,这些结果表明,当前的即用型VLMs尚不适合作为独立的农业诊断系统,但在与受限界面、明确标签本体和领域意识评估策略配对时,可以作为辅助组件发挥作用。
Summary / 总结
The study evaluates the performance of vision-language models (VLMs) on 27 agricultural classification tasks, finding that zero-shot VLMs underperform a supervised task-specific baseline. The best-performing VLM under multiple-choice prompting achieves around 62% accuracy, while open-ended prompting yields lower results. Applying LLM-based semantic judging improves open-ended accuracy and alters model rankings. The research suggests that current VLMs are not yet suitable as standalone diagnostic systems but can assist with constrained interfaces and domain-aware evaluation strategies.
研究评估了视觉语言模型(VLMs)在27个农业分类数据集上的表现,发现零样本VLMs的表现低于监督任务特定基线。在多项选择提示下,表现最好的VLM(Gemini-3 Pro)达到约62%的准确率,而开放提示下的表现较低。应用基于LLM的语义判断可以提高开放提示的准确率并改变模型排名。在开源模型中,Qwen-VL-72B表现最佳,但仍落后于顶级专有系统。研究建议当前的VLMs尚不适合作为独立诊断工具,但在适当约束和评估方法下可以辅助农业决策。
From Words to Wavelengths: VLMs for Few-Shot Multispectral Object Detection
Authors: Manuel Nkegoum, Minh-Tan Pham, Élisa Fromont, Bruno Avignon, Sébastien Lefèvre
First: 2025-12-17T21:06:36+00:00 · Latest: 2025-12-17T21:06:36+00:00
Abstract
Multispectral object detection is critical for safety-sensitive applications such as autonomous driving and surveillance, where robust perception under diverse illumination conditions is essential. However, the limited availability of annotated multispectral data severely restricts the training of deep detectors. In such data-scarce scenarios, textual class information can serve as a valuable source of semantic supervision. Motivated by the recent success of Vision-Language Models (VLMs) in computer vision, we explore their potential for few-shot multispectral object detection. Specifically, we adapt two representative VLM-based detectors, Grounding DINO and YOLO-World, to handle multispectral inputs and propose an effective mechanism to integrate text, visual and thermal modalities. Through extensive experiments on two popular multispectral image benchmarks, FLIR and M3FD, we demonstrate that VLM-based detectors not only excel in few-shot regimes, significantly outperforming specialized multispectral models trained with comparable data, but also achieve competitive or superior results under fully supervised settings. Our findings reveal that the semantic priors learned by large-scale VLMs effectively transfer to unseen spectral modalities, ofFering a powerful pathway toward data-efficient multispectral perception.
中文标题/摘要
标题:从文字到波长:基于VLM的少量样本多光谱目标检测
多光谱目标检测对于自动驾驶和监控等安全敏感应用至关重要,其中在不同光照条件下进行稳健感知是必不可少的。然而,标注的多光谱数据的有限可用性严重限制了深度检测器的训练。在这种数据稀缺的情景下,文本类信息可以作为有价值的语义监督来源。受近期视觉语言模型(VLMs)在计算机视觉领域取得成功的影响,我们探索了它们在少量样本多光谱目标检测中的潜力。具体而言,我们对两种代表性的VLM基检测器Grounding DINO和YOLO-World进行了适应,使其能够处理多光谱输入,并提出了一种有效机制来整合文本、视觉和热成像模态。通过在两个流行的多光谱图像基准FLIR和M3FD上进行广泛的实验,我们证明基于VLM的检测器不仅在少量样本场景中表现出色,显著优于使用相似数据训练的专业多光谱模型,而且在完全监督设置下也能取得具有竞争力或更优的结果。我们的研究结果表明,大规模VLM学习到的语义先验能够有效转移到未见过的光谱模态中,为数据高效多光谱感知提供了一条强大的途径。
Summary / 总结
The paper explores the use of Vision-Language Models (VLMs) for few-shot multispectral object detection, motivated by the scarcity of annotated multispectral data. By adapting VLM-based detectors like Grounding DINO and YOLO-World to handle multispectral inputs and integrating text, visual, and thermal modalities, the authors demonstrate that these models outperform specialized multispectral models in few-shot settings and achieve competitive results in fully supervised settings. The findings suggest that semantic priors learned by VLMs can effectively transfer to unseen spectral modalities, offering a data-efficient approach to multispectral perception.
该论文探索了使用Vision-Language模型(VLMs)进行少量样本多光谱目标检测的方法,动机在于在安全敏感应用中需要在多种照明条件下实现稳健的感知。作者将两种基于VLM的检测器Grounding DINO和YOLO-World适应处理多光谱输入,并整合了文本、视觉和热成像模态。在FLIR和M3FD基准数据集上的实验表明,基于VLM的检测器在少量样本设置中表现出色,超越了专门针对多光谱数据训练的模型,并且在全监督设置中也取得了竞争力或更优的结果,表明VLM学到的语义先验可以有效转移到未见过的光谱模态中。
Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models
Authors: Utsav Panchal, Yuchen Liu, Luigi Palmieri, Ilche Georgievski, Marco Aiello
Venue: WACV
First: 2025-12-17T20:44:32+00:00 · Latest: 2025-12-17T20:44:32+00:00
Comments: Accepted at IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026
Abstract
Accurately predicting human behaviors is crucial for mobile robots operating in human-populated environments. While prior research primarily focuses on predicting actions in single-human scenarios from an egocentric view, several robotic applications require understanding multiple human behaviors from a third-person perspective. To this end, we present CAMP-VLM (Context-Aware Multi-human behavior Prediction): a Vision Language Model (VLM)-based framework that incorporates contextual features from visual input and spatial awareness from scene graphs to enhance prediction of humans-scene interactions. Due to the lack of suitable datasets for multi-human behavior prediction from an observer view, we perform fine-tuning of CAMP-VLM with synthetic human behavior data generated by a photorealistic simulator, and evaluate the resulting models on both synthetic and real-world sequences to assess their generalization capabilities. Leveraging Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), CAMP-VLM outperforms the best-performing baseline by up to 66.9% in prediction accuracy.
中文标题/摘要
标题:有图有真相(和预测):基于视觉语言模型的上下文感知多人体行为预测
准确预测人体行为对于在人群环境中操作移动机器人至关重要。尽管先前的研究主要集中在从第一人称视角预测单人体的行为,但许多机器人应用需要从第三人称视角理解多人体的行为。为此,我们提出了CAMP-VLM(上下文感知多人体行为预测):一种基于视觉语言模型(VLM)的框架,该框架结合了视觉输入中的上下文特征和场景图中的空间意识,以增强对人类-场景交互的预测。由于缺乏适用于第三人称视角多人体行为预测的合适数据集,我们使用逼真的模拟器生成的人体行为数据对CAMP-VLM进行微调,并在合成和真实世界序列上评估模型,以评估其泛化能力。利用监督微调(SFT)和直接偏好优化(DPO),CAMP-VLM在预测准确性上比表现最佳的基线高出66.9%。
Summary / 总结
The research aims to improve the prediction of human behaviors in environments with multiple people, moving beyond single-person scenarios. It introduces CAMP-VLM, a Vision Language Model framework that integrates visual context and spatial awareness from scene graphs to predict human interactions with the environment. The model is fine-tuned using synthetic data and evaluated on both synthetic and real-world sequences, showing up to a 66.9% improvement in prediction accuracy over existing methods.
研究旨在提高在多人环境中预测人类行为的能力,超越单一人的场景。引入了CAMP-VLM框架,该框架结合了视觉上下文和场景图中的空间意识来预测人类与环境的互动。该模型使用合成数据进行微调,并在合成和真实世界序列上进行评估,显示出比现有方法高达66.9%的预测准确性提升。
R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space
Authors: Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax
First: 2025-12-17T20:08:32+00:00 · Latest: 2025-12-17T20:08:32+00:00
Abstract
Humans perceive and reason about their surroundings in four dimensions by building persistent, structured internal representations that encode semantic meaning, spatial layout, and temporal dynamics. These multimodal memories enable them to recall past events, infer unobserved states, and integrate new information into context-dependent reasoning. Inspired by this capability, we introduce R4, a training-free framework for retrieval-augmented reasoning in 4D spatio-temporal space that equips vision-language models (VLMs) with structured, lifelong memory. R4 continuously constructs a 4D knowledge database by anchoring object-level semantic descriptions in metric space and time, yielding a persistent world model that can be shared across agents. At inference, natural language queries are decomposed into semantic, spatial, and temporal keys to retrieve relevant observations, which are integrated into the VLM's reasoning. Unlike classical retrieval-augmented generation methods, retrieval in R4 operates directly in 4D space, enabling episodic and collaborative reasoning without training. Experiments on embodied question answering and navigation benchmarks demonstrate that R4 substantially improves retrieval and reasoning over spatio-temporal information compared to baselines, advancing a new paradigm for embodied 4D reasoning in dynamic environments.
中文标题/摘要
标题:R4:四维时空中的检索增强推理方法在视觉语言模型中的应用
人类通过构建持久的、结构化的内部表示来感知和推理其周围的四维环境,这些表示编码了语义意义、空间布局和时间动态。这些多模态记忆使他们能够回忆过去的事件、推断未观察到的状态,并将新信息整合到上下文相关的推理中。受此能力的启发,我们提出了R4,一种无需训练的四维时空中的检索增强推理框架,为视觉语言模型(VLMs)提供了结构化的、终身记忆。R4通过在度量空间和时间中锚定对象级语义描述,不断构建一个四维知识数据库,生成一个持久的世界模型,可以在不同代理之间共享。在推理时,自然语言查询被分解为语义、空间和时间键以检索相关观察,这些观察被整合到VLM的推理中。与传统的检索增强生成方法不同,R4中的检索直接在四维空间中进行,这使得它能够进行事件性和协作性推理而无需训练。在基于体感问答和导航基准的实验中,R4在时空信息检索和推理方面显著优于基线方法,推动了动态环境中四维推理的新范式。
Summary / 总结
R4 is a training-free framework for vision-language models that enhances their reasoning capabilities in a 4D spatio-temporal space by building a structured, lifelong memory. It constructs a 4D knowledge database by anchoring object-level semantic descriptions in space and time, allowing agents to retrieve and integrate relevant observations for context-dependent reasoning. Experiments show that R4 significantly improves retrieval and reasoning over spatio-temporal information compared to baseline methods, advancing embodied 4D reasoning in dynamic environments.
研究旨在通过使视觉语言模型能够在4D时空空间中进行推理,类似于人类的感知。方法是引入一种名为R4的检索增强推理框架,该框架通过在空间和时间中锚定对象级描述来构建持久的4D知识数据库。关键发现表明,R4在体感问答和导航基准测试中显著提高了对时空信息的检索和推理能力,推动了动态环境中的4D推理新范式的进步。
History
20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553