arXiv 论文速递

2025-12-20 03:30
Snapshot: 20251220_0330
MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning
Authors: Yuanchen Ju, Yongyuan Liang, Yen-Jen Wang, Nandiraju Gireesh, Yuanliang Ju, Seungjae Lee, Qiao Gu, Elvis Hsieh, Furong Huang, Koushil Sreenath
First: 2025-12-18T18:59:03+00:00 · Latest: 2025-12-18T18:59:03+00:00
Comments: 25 pages, 10 figures. Project page:https://hybridrobotics.github.io/MomaGraph/
Abstract
Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To address these limitations, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. We thus contribute MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, along with MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments demonstrate that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.
中文标题/摘要
标题:MomaGraph:基于视觉语言模型的统一场景图及其在体感任务规划中的状态感知
家庭中的移动机械臂必须同时导航和操作。这需要一种紧凑且语义丰富的场景表示,能够捕捉物体的位置、功能以及哪些部分可以操作。场景图是一个自然的选择,但先前的工作往往将空间关系和功能关系分开处理,将场景视为静态快照,不包含物体状态或时间更新,并且忽略了与当前任务相关的最重要信息。为了解决这些限制,我们引入了MomaGraph,这是一种将空间功能关系和部分级交互元素整合在一起的统一场景表示,适用于体感代理。然而,推进这种表示需要合适的数据和严格的评估,这些方面目前仍然不足。因此,我们贡献了MomaGraph-Scenes,这是第一个包含丰富注释、任务驱动的场景图的大规模数据集,以及MomaGraph-Bench,一个涵盖从高层规划到细粒度场景理解六个推理能力的系统评估套件。在此基础上,我们进一步开发了MomaGraph-R1,这是一种7B参数的视觉语言模型,通过强化学习在MomaGraph-Scenes上进行训练。MomaGraph-R1预测任务导向的场景图,并在Graph-then-Plan框架下作为零样本任务规划器。广泛的实验表明,我们的模型在开源模型中达到了最先进的结果,准确率达到71.6%(比最佳基线高11.4%),并且在公共基准测试中具有良好的泛化能力,并且能够有效地转移到真实机器人实验。
Summary / 总结
MomaGraph addresses the limitations of previous scene graph representations by integrating spatial-functional relationships and part-level interactive elements. It introduces MomaGraph-Scenes, a large-scale dataset of richly annotated, task-driven scene graphs, and MomaGraph-Bench, an evaluation suite for embodied agents. MomaGraph-R1, a 7B vision-language model, predicts task-oriented scene graphs and serves as a zero-shot task planner, achieving 71.6% accuracy on the benchmark, surpassing previous models by 11.4%.
MomaGraph通过整合空间-功能关系和部分级交互元素解决了先前场景图表示的局限性,并引入了MomaGraph-Scenes,这是一个包含丰富注释和任务驱动的场景图的大规模数据集,适用于家庭环境。MomaGraph-R1是一个7B的视觉-语言模型,通过强化学习训练,可以预测任务导向的场景图,并作为零样本任务规划器使用,其在基准测试中的准确率为71.6%,并且在公共基准测试和真实机器人实验中表现出良好的泛化能力。
SceneDiff: A Benchmark and Method for Multiview Object Change Detection
Authors: Yuqun Wu, Chih-hao Lin, Henry Che, Aditi Tiwari, Chuhang Zou, Shenlong Wang, Derek Hoiem
First: 2025-12-18T18:59:02+00:00 · Latest: 2025-12-18T18:59:02+00:00
Abstract
We investigate the problem of identifying objects that have been added, removed, or moved between a pair of captures (images or videos) of the same scene at different times. Detecting such changes is important for many applications, such as robotic tidying or construction progress and safety monitoring. A major challenge is that varying viewpoints can cause objects to falsely appear changed. We introduce SceneDiff Benchmark, the first multiview change detection benchmark with object instance annotations, comprising 350 diverse video pairs with thousands of changed objects. We also introduce the SceneDiff method, a new training-free approach for multiview object change detection that leverages pretrained 3D, segmentation, and image encoding models to robustly predict across multiple benchmarks. Our method aligns the captures in 3D, extracts object regions, and compares spatial and semantic region features to detect changes. Experiments on multi-view and two-view benchmarks demonstrate that our method outperforms existing approaches by large margins (94% and 37.4% relative AP improvements). The benchmark and code will be publicly released.
中文标题/摘要
标题:SceneDiff:多视角物体变化检测的基准与方法
我们研究了在不同时间同一场景的两组捕获(图像或视频)之间识别已添加、移除或移动的物体的问题。检测此类变化对于许多应用非常重要,例如机器人整理或建筑进度和安全监控。主要挑战在于不同视角的变化可能导致物体错误地被检测为变化。我们引入了SceneDiff基准,这是第一个包含物体实例注释的多视角变化检测基准,包含350个多样化的视频对,数千个变化的物体。我们还引入了SceneDiff方法,这是一种新的无需训练的多视角物体变化检测方法,利用预训练的3D、分割和图像编码模型来稳健地预测多个基准。该方法在3D中对齐捕获,提取物体区域,并比较空间和语义区域特征以检测变化。在多视角和两视角基准上的实验表明,我们的方法在现有方法的基础上取得了显著的性能提升(相对AP改进94%和37.4%)。基准和代码将公开发布。
Summary / 总结
The research aims to detect changes in objects between two captures of the same scene taken at different times, which is crucial for applications like robotic tidying and construction monitoring. The SceneDiff method uses pretrained 3D, segmentation, and image encoding models to align captures in 3D, extract object regions, and compare spatial and semantic features to detect changes. The method outperforms existing approaches by large margins on both multi-view and two-view benchmarks, with relative AP improvements of 94% and 37.4%, respectively. The SceneDiff Benchmark, which includes 350 diverse video pairs with thousands of changed objects, is also introduced as the first multiview change detection benchmark with object instance annotations.
该研究解决了同一场景在不同时间点的两组捕获中物体变化的检测问题,这对于机器人整理和建筑监控等应用至关重要。为了应对视角变化导致的误检测问题,作者引入了SceneDiff基准,这是一个包含物体实例注释的多视角变化检测基准,并提出了SceneDiff方法,这是一种无需训练的检测方法,利用预训练的3D、分割和图像编码模型对捕获进行对齐,提取物体区域,并比较空间和语义特征。该方法在多视角和两视角基准上的表现显著优于现有方法,相对AP改进分别为94%和37.4%。
Memory-Enhanced SAM3 for Occlusion-Robust Surgical Instrument Segmentation
Authors: Valay Bundele, Mehran Hosseinzadeh, Hendrik P. A. Lensch
First: 2025-12-18T18:49:33+00:00 · Latest: 2025-12-18T18:49:33+00:00
Comments: Under Review
Abstract
Accurate surgical instrument segmentation in endoscopic videos is crucial for computer-assisted interventions, yet remains challenging due to frequent occlusions, rapid motion, specular artefacts, and long-term instrument re-entry. While SAM3 provides a powerful spatio-temporal framework for video object segmentation, its performance in surgical scenes is limited by indiscriminate memory updates, fixed memory capacity, and weak identity recovery after occlusions. We propose ReMeDI-SAM3, a training-free memory-enhanced extension of SAM3, that addresses these limitations through three components: (i) relevance-aware memory filtering with a dedicated occlusion-aware memory for storing pre-occlusion frames, (ii) a piecewise interpolation scheme that expands the effective memory capacity, and (iii) a feature-based re-identification module with temporal voting for reliable post-occlusion identity disambiguation. Together, these components mitigate error accumulation and enable reliable recovery after occlusions. Evaluations on EndoVis17 and EndoVis18 under a zero-shot setting show absolute mcIoU improvements of around 7% and 16%, respectively, over vanilla SAM3, outperforming even prior training-based approaches. Project page: https://valaybundele.github.io/remedi-sam3/.
中文标题/摘要
标题:增强记忆的SAM3用于遮挡鲁棒的手术器械分割
内窥镜视频中手术器械的准确分割对于计算机辅助干预至关重要,但由于频繁的遮挡、快速运动、镜面伪影以及长期器械再进入,这一任务仍然具有挑战性。尽管SAM3提供了一种强大的时空框架用于视频对象分割,但在手术场景中的性能受限于不加区别的记忆更新、固定的记忆容量以及遮挡后的弱身份恢复。我们提出了一种无需训练的记忆增强扩展ReMeDI-SAM3,通过三个组件解决了这些限制:(i) 与专用的遮挡感知记忆相结合的关联性记忆过滤,用于存储遮挡前的帧;(ii) 一段式插值方案,扩展有效记忆容量;(iii) 基于特征的重新识别模块,结合时间投票,实现可靠的遮挡后身份消歧。这些组件共同减轻了错误累积,并在遮挡后实现了可靠的恢复。在零样本设置下,EndoVis17和EndoVis18上的绝对mcIoU改进分别约为7%和16%,超过了原始的SAM3,甚至优于先前的训练基线方法。项目页面:https://valaybundele.github.io/remedi-sam3/
Summary / 总结
The research aims to improve surgical instrument segmentation in endoscopic videos by addressing challenges such as occlusions and rapid motion. The method, ReMeDI-SAM3, enhances SAM3 with a relevance-aware memory filter, a piecewise interpolation scheme, and a feature-based re-identification module. This leads to significant improvements in mean class IoU, achieving around 7% and 16% absolute increases on EndoVis17 and EndoVis18, respectively, outperforming previous approaches.
研究旨在提高内窥镜视频中的手术器械分割,这对于计算机辅助干预至关重要,但因遮挡和快速运动而具有挑战性。方法ReMeDI-SAM3通过引入相关性感知的记忆过滤器、分段插值方案以及基于特征的重新识别模块来增强SAM3。这种方法显著提高了遮挡鲁棒性,在EndoVis17和EndoVis18上分别实现了约7%和16%的绝对mcIoU改进,超过了先前的方法。
RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing
Authors: Tianyuan Qu, Lei Ke, Xiaohang Zhan, Longxiang Tang, Yuqi Liu, Bohao Peng, Bei Yu, Dong Yu, Jiaya Jia
First: 2025-12-18T18:34:23+00:00 · Latest: 2025-12-18T18:34:23+00:00
Comments: Precise region control and planning for instruction-based image editing. Our project page: https://replan-iv-edit.github.io
Abstract
Instruction-based image editing enables natural-language control over visual modifications, yet existing models falter under Instruction-Visual Complexity (IV-Complexity), where intricate instructions meet cluttered or ambiguous scenes. We introduce RePlan (Region-aligned Planning), a plan-then-execute framework that couples a vision-language planner with a diffusion editor. The planner decomposes instructions via step-by-step reasoning and explicitly grounds them to target regions; the editor then applies changes using a training-free attention-region injection mechanism, enabling precise, parallel multi-region edits without iterative inpainting. To strengthen planning, we apply GRPO-based reinforcement learning using 1K instruction-only examples, yielding substantial gains in reasoning fidelity and format reliability. We further present IV-Edit, a benchmark focused on fine-grained grounding and knowledge-intensive edits. Across IV-Complex settings, RePlan consistently outperforms strong baselines trained on far larger datasets, improving regional precision and overall fidelity. Our project page: https://replan-iv-edit.github.io
中文标题/摘要
标题:RePlan:基于推理的区域规划方法用于复杂指令驱动的图像编辑
指令驱动的图像编辑允许通过自然语言控制视觉修改,但现有模型在指令视觉复杂性(IV-复杂性)场景下表现不佳,即复杂的指令与杂乱或模糊的场景相遇时。我们提出了RePlan(区域对齐规划),这是一种计划-执行框架,结合了视觉语言规划器和扩散编辑器。规划器通过逐步推理将指令分解并明确地与目标区域关联;编辑器随后使用无训练注意力区域注入机制应用更改,从而实现精确的、并行的多区域编辑,无需迭代修复。为了增强规划,我们使用基于GRPO的强化学习应用1000个仅指令示例,显著提高了推理准确性和格式可靠性。我们还提出了IV-Edit基准,专注于精细的区域定位和知识密集型编辑。在IV-复杂设置中,RePlan始终优于大型数据集训练的强大基线,提高了区域精度和整体保真度。我们的项目页面:https://replan-iv-edit.github.io
Summary / 总结
RePlan is a plan-then-execute framework for instruction-based image editing that addresses the challenge of Instruction-Visual Complexity. It uses a vision-language planner to decompose instructions and ground them to target regions, followed by a diffusion editor that applies changes without iterative inpainting. RePlan leverages GRPO-based reinforcement learning to enhance planning and outperforms strong baselines in regional precision and overall fidelity across complex settings.
RePlan 是一种用于指令驱动图像编辑的计划-执行框架,通过视觉-语言规划器分解指令并将其明确地与目标区域关联,随后由扩散编辑器应用更改而无需迭代修复。规划器使用基于GRPO的强化学习来提高推理准确性和格式可靠性。RePlan 在复杂场景下表现出色,优于强大的基线模型,在区域精度和整体保真度方面表现出色,适用于精细和知识密集型编辑。
CitySeeker: How Do VLMS Explore Embodied Urban Navigation With Implicit Human Needs?
Authors: Siqi Wang, Chao Liang, Yunfan Gao, Erxin Yu, Sen Li, Yushi Li, Jing Li, Haofen Wang
First: 2025-12-18T16:53:12+00:00 · Latest: 2025-12-18T16:53:12+00:00
Abstract
Vision-Language Models (VLMs) have made significant progress in explicit instruction-based navigation; however, their ability to interpret implicit human needs (e.g., "I am thirsty") in dynamic urban environments remains underexplored. This paper introduces CitySeeker, a novel benchmark designed to assess VLMs' spatial reasoning and decision-making capabilities for exploring embodied urban navigation to address implicit needs. CitySeeker includes 6,440 trajectories across 8 cities, capturing diverse visual characteristics and implicit needs in 7 goal-driven scenarios. Extensive experiments reveal that even top-performing models (e.g., Qwen2.5-VL-32B-Instruct) achieve only 21.1% task completion. We find key bottlenecks in error accumulation in long-horizon reasoning, inadequate spatial cognition, and deficient experiential recall. To further analyze them, we investigate a series of exploratory strategies-Backtracking Mechanisms, Enriching Spatial Cognition, and Memory-Based Retrieval (BCR), inspired by human cognitive mapping's emphasis on iterative observation-reasoning cycles and adaptive path optimization. Our analysis provides actionable insights for developing VLMs with robust spatial intelligence required for tackling "last-mile" navigation challenges.
中文标题/摘要
标题:CitySeeker:VLMS 如何探索具有隐含人类需求的实体城市导航?
视觉-语言模型(VLMs)在基于明确指令的导航方面取得了显著进展;然而,它们在动态城市环境中解释隐含的人类需求(例如,“我渴了”)的能力仍然未被充分探索。本文介绍了CitySeeker,这是一种新型基准,旨在评估VLMs的空间推理和决策能力,以探索具有隐含需求的实体城市导航。CitySeeker 包含了8个城市中的6,440条轨迹,涵盖了7个目标驱动场景中的多样视觉特征和隐含需求。广泛的实验表明,即使是表现最好的模型(例如,Qwen2.5-VL-32B-Instruct)也只能完成21.1%的任务。我们发现长期推理中的错误累积、空间认知不足和经验回忆不足是关键瓶颈。为了进一步分析这些问题,我们研究了一系列探索性策略——回溯机制、增强空间认知和基于记忆的检索(BCR),这些策略受到人类认知地图强调的迭代观察-推理循环和适应性路径优化的启发。我们的分析为开发能够应对“最后一公里”导航挑战所需的稳健空间智能的VLMs提供了可操作的见解。
Summary / 总结
CitySeeker evaluates VLMs' ability to navigate urban environments based on implicit human needs, introducing a benchmark with 6,440 trajectories across 8 cities. Experiments show top models achieve only 21.1% task completion, highlighting issues in long-term reasoning, spatial cognition, and experiential recall. The study proposes strategies like backtracking, enriching spatial cognition, and memory-based retrieval to improve VLMs for urban navigation challenges.
CitySeeker 是一个基准,用于评估 VLMs 在基于隐含人类需求的城市导航中的能力。它包含 6,440 条轨迹,覆盖 8 个城市,捕捉了多样化的视觉特征和隐含需求。实验显示,即使顶级模型也只能完成 21.1% 的任务,突显了长时推理、空间认知和经验回忆方面的问题。研究提出了回溯机制、增强空间认知和基于记忆的检索等策略来改进这些能力。
Hierarchical Schedule Optimization for Fast and Robust Diffusion Model Sampling
Authors: Aihua Zhu, Rui Su, Qinglin Zhao, Li Feng, Meng Shen, Shibo He
Venue: AAAI 2026
First: 2025-11-12T08:57:46+00:00 · Latest: 2025-12-18T14:23:34+00:00
Comments: Preprint, accepted to AAAI 2026
Abstract
Diffusion probabilistic models have set a new standard for generative fidelity but are hindered by a slow iterative sampling process. A powerful training-free strategy to accelerate this process is Schedule Optimization, which aims to find an optimal distribution of timesteps for a fixed and small Number of Function Evaluations (NFE) to maximize sample quality. To this end, a successful schedule optimization method must adhere to four core principles: effectiveness, adaptivity, practical robustness, and computational efficiency. However, existing paradigms struggle to satisfy these principles simultaneously, motivating the need for a more advanced solution. To overcome these limitations, we propose the Hierarchical-Schedule-Optimizer (HSO), a novel and efficient bi-level optimization framework. HSO reframes the search for a globally optimal schedule into a more tractable problem by iteratively alternating between two synergistic levels: an upper-level global search for an optimal initialization strategy and a lower-level local optimization for schedule refinement. This process is guided by two key innovations: the Midpoint Error Proxy (MEP), a solver-agnostic and numerically stable objective for effective local optimization, and the Spacing-Penalized Fitness (SPF) function, which ensures practical robustness by penalizing pathologically close timesteps. Extensive experiments show that HSO sets a new state-of-the-art for training-free sampling in the extremely low-NFE regime. For instance, with an NFE of just 5, HSO achieves a remarkable FID of 11.94 on LAION-Aesthetics with Stable Diffusion v2.1. Crucially, this level of performance is attained not through costly retraining, but with a one-time optimization cost of less than 8 seconds, presenting a highly practical and efficient paradigm for diffusion model acceleration.
中文标题/摘要
标题:层次化时间表优化以实现快速稳健的扩散模型采样
扩散概率模型在生成保真度方面树立了新标准,但受限于缓慢的迭代采样过程。一种强大的无训练策略是时间表优化,其目标是在固定且较小的函数评估次数(NFE)下找到最优的时间步分布,以最大化样本质量。为此,成功的时间表优化方法必须遵循四个核心原则:有效性、适应性、实用鲁棒性和计算效率。然而,现有的范式难以同时满足这些原则,因此需要更先进的解决方案。为克服这些限制,我们提出了层次化时间表优化器(HSO),这是一种新颖且高效的双层优化框架。HSO通过交替进行两个协同工作的层次来重新定义全局最优时间表的搜索,即上层进行全局搜索以找到最优初始化策略,下层进行时间表细化的局部优化。这一过程由两个关键创新引导:中间点误差代理(MEP),一种与求解器无关且数值稳定的局部优化目标,以及间距惩罚适应度(SPF)函数,该函数通过惩罚病态接近的时间步来确保实用鲁棒性。大量实验表明,HSO在极低NFE区间内无训练采样的新标准。例如,使用NFE仅为5时,HSO在Stable Diffusion v2.1上的LAION-Aesthetics数据集上实现了令人瞩目的FID值11.94。至关重要的是,这种性能水平并非通过昂贵的重新训练获得,而是一次优化成本不到8秒,这为扩散模型加速提供了一种高度实用和高效的范式。
Summary / 总结
The paper addresses the challenge of slow sampling in diffusion probabilistic models by proposing the Hierarchical-Schedule-Optimizer (HSO), a bi-level optimization framework. HSO aims to find an optimal distribution of timesteps for a fixed number of function evaluations (NFE) to maximize sample quality. Key innovations include the Midpoint Error Proxy (MEP) for effective local optimization and the Spacing-Penalized Fitness (SPF) function to ensure practical robustness. Experiments demonstrate that HSO achieves state-of-the-art performance, with an NFE of 5 yielding an FID of 11.94 on LAION-Aesthetics with Stable Diffusion v2.1, and a one-time optimization cost of less than 8 seconds.
论文通过提出层次化时间表优化器(HSO)来解决扩散概率模型的缓慢采样问题,旨在为固定的功能评估次数(NFE)找到最优的时间步分布以最大化样本质量。HSO 使用两层优化框架:上层进行全局最优初始化策略搜索,下层进行时间表细化优化。关键创新包括用于有效局部优化的中间点误差代理(MEP)和用于确保实用鲁棒性的间距惩罚适应度(SPF)函数。实验表明,HSO 达到了最先进的结果,使用 NFE 为 5 时在 LAION-Aesthetics 上的 FID 为 11.94,且优化成本不到 8 秒。
N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models
Authors: Yuxin Wang, Lei Ke, Boqiang Zhang, Tianyuan Qu, Hanxun Yu, Zhenpeng Huang, Meng Yu, Dan Xu, Dong Yu
First: 2025-12-18T14:03:44+00:00 · Latest: 2025-12-18T14:03:44+00:00
Comments: Project Page: https://n3d-vlm.github.io
Abstract
While current multimodal models can answer questions based on 2D images, they lack intrinsic 3D object perception, limiting their ability to comprehend spatial relationships and depth cues in 3D scenes. In this work, we propose N3D-VLM, a novel unified framework that seamlessly integrates native 3D object perception with 3D-aware visual reasoning, enabling both precise 3D grounding and interpretable spatial understanding. Unlike conventional end-to-end models that directly predict answers from RGB/RGB-D inputs, our approach equips the model with native 3D object perception capabilities, enabling it to directly localize objects in 3D space based on textual descriptions. Building upon accurate 3D object localization, the model further performs explicit reasoning in 3D, achieving more interpretable and structured spatial understanding. To support robust training for these capabilities, we develop a scalable data construction pipeline that leverages depth estimation to lift large-scale 2D annotations into 3D space, significantly increasing the diversity and coverage for 3D object grounding data, yielding over six times larger than the largest existing single-image 3D detection dataset. Moreover, the pipeline generates spatial question-answering datasets that target chain-of-thought (CoT) reasoning in 3D, facilitating joint training for both 3D object localization and 3D spatial reasoning. Experimental results demonstrate that our unified framework not only achieves state-of-the-art performance on 3D grounding tasks, but also consistently surpasses existing methods in 3D spatial reasoning in vision-language model.
中文标题/摘要
标题:N3D-VLM:原生3D定位使视觉语言模型在空间推理中获得准确的空间理解
当前的多模态模型虽然可以根据2D图像回答问题,但缺乏内在的3D物体感知能力,限制了它们对3D场景中的空间关系和深度线索的理解能力。在本文中,我们提出了一种名为N3D-VLM的新型统一框架,该框架无缝地将原生3D物体感知与3D感知视觉推理相结合,从而实现精确的3D定位和可解释的空间理解。与传统的端到端模型直接从RGB/RGB-D输入预测答案不同,我们的方法赋予模型原生的3D物体感知能力,使其能够根据文本描述直接在3D空间中定位物体。基于准确的3D物体定位,模型进一步在3D中进行显式的推理,从而实现更可解释和结构化的空间理解。为了支持这些能力的稳健训练,我们开发了一种可扩展的数据构建管道,该管道利用深度估计将大规模的2D注释提升到3D空间,显著增加了3D物体定位数据的多样性和覆盖范围,比现有最大的单张图像3D检测数据集大六倍以上。此外,该管道生成了空间问答数据集,旨在针对3D中的链式推理(CoT)进行训练,从而促进3D物体定位和3D空间推理的联合训练。实验结果表明,我们的统一框架不仅在3D定位任务上达到了最先进的性能,还在视觉语言模型中的3D空间推理方面也始终优于现有方法。
Summary / 总结
This work addresses the limitation of current multimodal models in understanding 3D spatial relationships by proposing N3D-VLM, a unified framework that integrates native 3D object perception with 3D-aware visual reasoning. The model can localize objects in 3D space based on textual descriptions and perform explicit 3D reasoning, leading to more interpretable and structured spatial understanding. The authors developed a scalable data construction pipeline that leverages depth estimation to create large-scale 3D object grounding data, significantly enhancing the model's performance. Experimental results show that N3D-VLM outperforms existing methods in both 3D grounding and spatial reasoning tasks.
研究旨在通过整合原生的3D物体感知和3D感知视觉推理来增强视觉语言模型,解决基于2D模型在理解空间关系方面的局限性。提出的N3D-VLM框架实现了精确的3D定位和可解释的空间理解。使用一个可扩展的数据构建管道生成大规模的3D物体定位数据和空间问答数据集,显著提高了模型在3D定位和空间推理任务上的性能,超越了现有方法。
Scaling Laws for Energy Efficiency of Local LLMs
Authors: Ander Alvarez, Alessandro Genuardi, Nilotpal Sinha, Antonio Tiene, Samuel Mugel, Román Orús
First: 2025-12-18T13:40:33+00:00 · Latest: 2025-12-18T13:40:33+00:00
Abstract
Deploying local large language models and vision-language models on edge devices requires balancing accuracy with constrained computational and energy budgets. Although graphics processors dominate modern artificial-intelligence deployment, most consumer hardware--including laptops, desktops, industrial controllers, and embedded systems--relies on central processing units. Despite this, the computational laws governing central-processing-unit-only inference for local language and vision-language workloads remain largely unexplored. We systematically benchmark large language and vision-language models on two representative central-processing-unit tiers widely used for local inference: a MacBook Pro M2, reflecting mainstream laptop-class deployment, and a Raspberry Pi 5, representing constrained, low-power embedded settings. Using a unified methodology based on continuous sampling of processor and memory usage together with area-under-curve integration, we characterize how computational load scales with input text length for language models and with image resolution for vision-language models. We uncover two empirical scaling laws: (1) computational cost for language-model inference scales approximately linearly with token length; and (2) vision-language models exhibit a preprocessing-driven "resolution knee", where compute remains constant above an internal resolution clamp and decreases sharply below it. Beyond these laws, we show that quantum-inspired compression reduces processor and memory usage by up to 71.9% and energy consumption by up to 62%, while preserving or improving semantic accuracy. These results provide a systematic quantification of multimodal central-processing-unit-only scaling for local language and vision-language workloads, and they identify model compression and input-resolution preprocessing as effective, low-cost levers for sustainable edge inference.
中文标题/摘要
标题:局部LLM能效的标度律
在边缘设备上部署局部大型语言模型和视觉-语言模型需要在准确性与受限的计算和能源预算之间进行权衡。尽管图形处理器主导了现代人工智能部署,但大多数消费级硬件(包括笔记本电脑、台式机、工业控制器和嵌入式系统)仍依赖于中央处理器。尽管如此,仅中央处理器的推理计算法则对局部语言和视觉-语言工作负载的研究仍相对较少。我们系统地在两个广泛用于局部推理的中央处理器级别上对大型语言和视觉-语言模型进行了基准测试:一台搭载M2芯片的MacBook Pro,代表主流笔记本电脑级别的部署;以及一个Raspberry Pi 5,代表受限的、低功耗嵌入式设置。我们采用基于连续采样处理器和内存使用情况并结合面积-曲线积分的统一方法,表征了计算负载随输入文本长度的变化规律(对于语言模型)和图像分辨率的变化规律(对于视觉-语言模型)。我们发现了两条经验标度律:(1)语言模型推理的计算成本大约与标记长度成线性关系;(2)视觉-语言模型表现出预处理驱动的“分辨率拐点”,其中计算在内部分辨率限制以上保持恒定,在以下则急剧下降。除了这些规律,我们还表明,基于量子启发的压缩可将处理器和内存使用量最多减少71.9%,能耗最多减少62%,同时保持或提高语义准确性。这些结果提供了对局部语言和视觉-语言工作负载的多模态中央处理器仅计算法则的系统量化,并指出了模型压缩和输入分辨率预处理作为有效、低成本的杠杆,以实现可持续的边缘推理。
Summary / 总结
This study explores the energy efficiency of deploying large language models and vision-language models on edge devices, focusing on central processing units. By benchmarking these models on a MacBook Pro M2 and a Raspberry Pi 5, the researchers discovered two scaling laws: the computational cost for language models scales linearly with token length, and vision-language models show a preprocessing-driven 'resolution knee' where compute remains constant above a certain resolution and decreases below it. Additionally, they found that quantum-inspired compression can reduce processor and memory usage by up to 71.9% and energy consumption by up to 62%, while maintaining or improving semantic accuracy.
研究探讨了在边缘设备上部署大型语言模型和视觉-语言模型时中央处理器的能量效率。通过在MacBook Pro M2和Raspberry Pi 5上进行基准测试,研究揭示了两个缩放定律:语言模型的计算成本随词元长度线性增加,而视觉-语言模型则表现出一个预处理驱动的“分辨率拐点”,即在某一分辨率以上,计算量保持不变,在此之下则急剧下降。此外,研究还表明,基于量子的压缩技术可以将处理器和内存使用量最多减少71.9%,能量消耗最多减少62%,同时保持或提高语义准确性。
TTP: Test-Time Padding for Adversarial Detection and Robust Adaptation on Vision-Language Models
Authors: Zhiwei Li, Yitian Pang, Weining Wang, Zhenan Sun, Qi Li
First: 2025-12-18T13:34:14+00:00 · Latest: 2025-12-18T13:34:14+00:00
Abstract
Vision-Language Models (VLMs), such as CLIP, have achieved impressive zero-shot recognition performance but remain highly susceptible to adversarial perturbations, posing significant risks in safety-critical scenarios. Previous training-time defenses rely on adversarial fine-tuning, which requires labeled data and costly retraining, while existing test-time strategies fail to reliably distinguish between clean and adversarial inputs, thereby preventing both adversarial robustness and clean accuracy from reaching their optimum. To address these limitations, we propose Test-Time Padding (TTP), a lightweight defense framework that performs adversarial detection followed by targeted adaptation at inference. TTP identifies adversarial inputs via the cosine similarity shift between CLIP feature embeddings computed before and after spatial padding, yielding a universal threshold for reliable detection across architectures and datasets. For detected adversarial cases, TTP employs trainable padding to restore disrupted attention patterns, coupled with a similarity-aware ensemble strategy for a more robust final prediction. For clean inputs, TTP leaves them unchanged by default or optionally integrates existing test-time adaptation techniques for further accuracy gains. Comprehensive experiments on diverse CLIP backbones and fine-grained benchmarks show that TTP consistently surpasses state-of-the-art test-time defenses, delivering substantial improvements in adversarial robustness without compromising clean accuracy. The code for this paper will be released soon.
中文标题/摘要
标题:TTP:测试时填充用于视觉-语言模型的对抗检测和鲁棒适应
视觉-语言模型(VLMs),如CLIP,已实现令人印象深刻的零样本识别性能,但仍然高度易受对抗性扰动的影响,在安全关键场景中存在重大风险。以往的训练时防御依赖于对抗性微调,这需要标记数据和昂贵的重新训练,而现有的测试时策略无法可靠地区分干净和对抗性输入,从而无法同时达到对抗鲁棒性和干净准确性的最佳效果。为了解决这些限制,我们提出了测试时填充(TTP),这是一种轻量级的防御框架,在推理时执行对抗检测并随后进行目标化适应。TTP 通过计算CLIP特征嵌入在空间填充前后余弦相似度的变化来识别对抗性输入,从而获得适用于不同架构和数据集的通用阈值以实现可靠的检测。对于检测到的对抗性情况,TTP 使用可训练的填充来恢复被破坏的注意力模式,并结合相似性感知的集成策略以实现更鲁棒的最终预测。对于干净输入,TTP 默认不进行更改,或可选地结合现有的测试时适应技术以进一步提高准确性。在多种CLIP后端和细粒度基准上的全面实验表明,TTP 一致地超越了最先进的测试时防御,能够在不牺牲干净准确性的情况下显著提高对抗鲁棒性。该论文的代码将很快发布。
Summary / 总结
The paper introduces Test-Time Padding (TTP), a lightweight defense framework for Vision-Language Models (VLMs) like CLIP, which addresses the vulnerability to adversarial perturbations. TTP detects adversarial inputs using a cosine similarity shift and applies targeted padding to restore disrupted attention patterns, while leaving clean inputs unchanged. Experiments show that TTP outperforms existing test-time defenses in terms of adversarial robustness without sacrificing clean accuracy.
论文提出了一种轻量级防御框架Test-Time Padding (TTP),用于Vision-Language Models(VLMs)如CLIP,以应对对抗性扰动的敏感性问题。TTP通过余弦相似度偏移检测对抗性输入,并应用目标填充以恢复被破坏的注意力模式,从而增强鲁棒性而不牺牲干净准确率。实验表明,TTP在各种CLIP骨干网络和细粒度基准上优于现有测试时防御方法。
SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning
Authors: Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax
First: 2025-12-18T12:27:06+00:00 · Latest: 2025-12-18T12:27:06+00:00
Abstract
Autonomous robotic systems require spatio-temporal understanding of dynamic environments to ensure reliable navigation and interaction. While Vision-Language Models (VLMs) provide open-world semantic priors, they lack grounding in 3D geometry and temporal dynamics. Conversely, geometric perception captures structure and motion but remains semantically sparse. We propose SNOW (Scene Understanding with Open-World Knowledge), a training-free and backbone-agnostic framework for unified 4D scene understanding that integrates VLM-derived semantics with point cloud geometry and temporal consistency. SNOW processes synchronized RGB images and 3D point clouds, using HDBSCAN clustering to generate object-level proposals that guide SAM2-based segmentation. Each segmented region is encoded through our proposed Spatio-Temporal Tokenized Patch Encoding (STEP), producing multimodal tokens that capture localized semantic, geometric, and temporal attributes. These tokens are incrementally integrated into a 4D Scene Graph (4DSG), which serves as 4D prior for downstream reasoning. A lightweight SLAM backend anchors all STEP tokens spatially in the environment, providing the global reference alignment, and ensuring unambiguous spatial grounding across time. The resulting 4DSG forms a queryable, unified world model through which VLMs can directly interpret spatial scene structure and temporal dynamics. Experiments on a diverse set of benchmarks demonstrate that SNOW enables precise 4D scene understanding and spatially grounded inference, thereby setting new state-of-the-art performance in several settings, highlighting the importance of structured 4D priors for embodied reasoning and autonomous robotics.
中文标题/摘要
标题:SNOW:基于世界知识的时空场景理解
自主机器人系统需要对动态环境进行时空理解,以确保可靠的导航和交互。视觉-语言模型(VLMs)提供了开放世界的语义先验,但缺乏3D几何和时间动态的定位。相反,几何感知捕捉结构和运动,但语义稀疏。我们提出了SNOW(基于开放世界知识的场景理解),这是一种无需训练且不依赖于骨干网络的框架,用于统一的4D场景理解,将VLM提取的语义与点云几何和时间一致性相结合。SNOW处理同步的RGB图像和3D点云,使用HDBSCAN聚类生成对象级提案,指导SAM2基的分割。每个分割区域通过我们提出的时空分块编码(STEP)进行编码,生成多模态令牌,捕捉局部语义、几何和时间属性。这些令牌逐步整合到4D场景图(4DSG)中,作为下游推理的4D先验。轻量级的SLAM后端在环境中将所有STEP令牌空间定位,提供全局参考对齐,并确保时间上的空间定位无歧义。生成的4DSG形成一个可查询的统一世界模型,通过该模型VLM可以直接解释空间场景结构和时间动态。在一系列基准测试上的实验表明,SNOW能够实现精确的4D场景理解和空间定位推理,从而在多个场景中达到新的最佳性能,突显了结构化4D先验对于体态推理和自主机器人的重要性。
Summary / 总结
SNOW is a training-free and backbone-agnostic framework for 4D scene understanding that integrates VLM-derived semantics with 3D geometry and temporal consistency. It processes synchronized RGB images and 3D point clouds, using HDBSCAN clustering and SAM2-based segmentation to generate object-level proposals. These proposals are encoded through Spatio-Temporal Tokenized Patch Encoding (STEP) to produce multimodal tokens that capture semantic, geometric, and temporal attributes, which are integrated into a 4D Scene Graph (4DSG). Experiments show that SNOW enables precise 4D scene understanding and spatially grounded inference, setting new state-of-the-art performance in several settings.
SNOW 是一个将 VLM 提取的语义信息与 3D 几何和时间一致性相结合的 4D 场景理解框架。它处理同步的 RGB 图像和 3D 点云,使用 HDBSCAN 聚类和 SAM2 基础的分割生成对象级提案。这些提案通过时空分块编码(STEP)来捕捉局部语义、几何和时间属性,并集成到 4D 场景图(4DSG)中。实验表明,SNOW 实现了精确的 4D 场景理解和空间定位推理,并在多个设置中达到了新的最先进的性能。
E-SDS: Environment-aware See it, Do it, Sorted - Automated Environment-Aware Reinforcement Learning for Humanoid Locomotion
Authors: Enis Yalcin, Joshua O'Hara, Maria Stamatopoulou, Chengxu Zhou, Dimitrios Kanoulas
Venue: RiTA 2025 (Springer LNNS)
First: 2025-12-18T12:08:24+00:00 · Latest: 2025-12-18T12:08:24+00:00
Comments: 12 pages, 3 figures, 4 tables. Accepted at RiTA 2025 (Springer LNNS)
Abstract
Vision-language models (VLMs) show promise in automating reward design in humanoid locomotion, which could eliminate the need for tedious manual engineering. However, current VLM-based methods are essentially "blind", as they lack the environmental perception required to navigate complex terrain. We present E-SDS (Environment-aware See it, Do it, Sorted), a framework that closes this perception gap. E-SDS integrates VLMs with real-time terrain sensor analysis to automatically generate reward functions that facilitate training of robust perceptive locomotion policies, grounded by example videos. Evaluated on a Unitree G1 humanoid across four distinct terrains (simple, gaps, obstacles, stairs), E-SDS uniquely enabled successful stair descent, while policies trained with manually-designed rewards or a non-perceptive automated baseline were unable to complete the task. In all terrains, E-SDS also reduced velocity tracking error by 51.9-82.6%. Our framework reduces the human effort of reward design from days to less than two hours while simultaneously producing more robust and capable locomotion policies.
中文标题/摘要
标题:E-SDS:环境感知的看见它、做到它、整理好——面向类人行走的环境感知强化学习自动化
视觉语言模型(VLMs)在自动化类人行走的奖励设计方面显示出潜力,这可能消除繁琐的手动工程需求。然而,当前基于VLM的方法本质上是“盲目的”,因为它们缺乏导航复杂地形所需的环境感知能力。我们提出了E-SDS(环境感知的看见它、做到它、整理好),一种填补这一感知缺口的框架。E-SDS将VLM与实时地形传感器分析集成,以自动生成促进稳健感知行走策略训练的奖励函数,这些策略由示例视频支持。在对Unitree G1类人机器人在四种不同地形(简单地形、缺口、障碍物、楼梯)上进行评估时,E-SDS唯一实现了成功的楼梯下降,而使用手动设计的奖励或非感知自动化基线训练的策略无法完成任务。在所有地形中,E-SDS还将速度跟踪误差降低了51.9%-82.6%。我们的框架将奖励设计的人力投入从几天减少到不到两小时,同时生成了更稳健和更强大的行走策略。
Summary / 总结
E-SDS is a framework that integrates vision-language models with real-time terrain sensor analysis to automate reward design for humanoid locomotion, addressing the lack of environmental perception in current methods. It enables successful stair descent and reduces velocity tracking error by 51.9-82.6% across various terrains, compared to manually-designed rewards or a non-perceptive automated baseline.
E-SDS 是一个框架,将视觉语言模型与实时地形传感器分析结合,为类人机器人行走生成奖励函数,解决了当前方法中缺乏环境感知的问题。在四种地形上的评估表明,E-SDS 使楼梯下降成为可能,并将速度跟踪误差降低了 51.9-82.6%,优于手动设计的奖励或非感知的基线。
Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt
Authors: Shangxun Li, Youngjung Uh
First: 2025-12-18T11:55:06+00:00 · Latest: 2025-12-18T11:55:06+00:00
Abstract
Text-to-image diffusion models excel at generating high-quality images from natural language descriptions but often fail to preserve subject consistency across multiple outputs, limiting their use in visual storytelling. Existing approaches rely on model fine-tuning or image conditioning, which are computationally expensive and require per-subject optimization. 1Prompt1Story, a training-free approach, concatenates all scene descriptions into a single prompt and rescales token embeddings, but it suffers from semantic leakage, where embeddings across frames become entangled, causing text misalignment. In this paper, we propose a simple yet effective training-free approach that addresses semantic entanglement from a geometric perspective by refining text embeddings to suppress unwanted semantics. Extensive experiments prove that our approach significantly improves both subject consistency and text alignment over existing baselines.
中文标题/摘要
标题:文本嵌入的空间解缠用于单个提示生成主题一致的文本到图像
文本到图像的扩散模型在从自然语言描述生成高质量图像方面表现出色,但在多个输出中保持主题一致性方面经常失败,限制了其在视觉叙事中的应用。现有方法依赖于模型微调或图像条件化,这在计算上昂贵且需要针对每个主题进行优化。1Prompt1Story 是一种无需训练的方法,将所有场景描述连接成一个提示并重新缩放标记嵌入,但它遭受语义泄露的问题,即帧间嵌入变得纠缠,导致文本对齐不良。在本文中,我们提出了一种简单而有效的无需训练的方法,从几何学角度解决语义纠缠问题,通过细化文本嵌入来抑制不需要的语义。大量实验表明,我们的方法在主题一致性和文本对齐方面显著优于现有基线。
Summary / 总结
This paper addresses the issue of subject inconsistency in text-to-image generation by proposing a training-free method that refines text embeddings to suppress unwanted semantics. The method aims to improve subject consistency and text alignment. Experiments show that the proposed approach outperforms existing methods in both subject consistency and text alignment.
论文提出了一种无需训练的方法,通过精炼文本嵌入来抑制不必要的语义,从几何角度解决文本到图像生成中的主题不一致性问题。实验表明,该方法在多个输出中显著提高了主题一致性和文本描述与生成图像的对齐程度,优于现有基线方法。
CountZES: Counting via Zero-Shot Exemplar Selection
Authors: Muhammad Ibraheem Siddiqui, Muhammad Haris Khan
First: 2025-12-18T11:12:50+00:00 · Latest: 2025-12-18T11:12:50+00:00
Abstract
Object counting in complex scenes remains challenging, particularly in the zero-shot setting, where the goal is to count instances of unseen categories specified only by a class name. Existing zero-shot object counting (ZOC) methods that infer exemplars from text either rely on open-vocabulary detectors, which often yield multi-instance candidates, or on random patch sampling, which fails to accurately delineate object instances. To address this, we propose CountZES, a training-free framework for object counting via zero-shot exemplar selection. CountZES progressively discovers diverse exemplars through three synergistic stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines open-vocabulary detections to isolate precise single-instance exemplars. DGE introduces a density-driven, self-supervised paradigm to identify statistically consistent and semantically compact exemplars, while FCE reinforces visual coherence through feature-space clustering. Together, these stages yield a diverse, complementary exemplar set that balances textual grounding, count consistency, and feature representativeness. Experiments on diverse datasets demonstrate CountZES superior performance among ZOC methods while generalizing effectively across natural, aerial and medical domains.
中文标题/摘要
标题:CountZES:通过零样本示例选择进行计数
在复杂场景中的物体计数仍然具有挑战性,特别是在零样本设置中,目标是计数仅通过类别名称指定的未见类别的实例。现有的零样本物体计数(ZOC)方法通过文本推断示例,要么依赖于开放词汇检测器,这通常会产生多实例候选,要么依赖于随机补丁采样,这无法准确划分物体实例。为了解决这个问题,我们提出了一种无需训练的CountZES框架,用于通过零样本示例选择进行物体计数。CountZES通过三个协同阶段逐步发现多样化的示例:检测锚定示例(DAE)、密度引导示例(DGE)和特征共识示例(FCE)。DAE细化开放词汇检测以隔离精确的单实例示例。DGE引入了一种基于密度的自我监督范式,以识别统计上一致且语义紧凑的示例,而FCE通过特征空间聚类增强视觉一致性。这些阶段共同产生一个多样化、互补的示例集,平衡了文本基础、计数一致性和特征代表性。在多种数据集上的实验表明,CountZES在ZOC方法中表现出优越的性能,并且在自然、航空和医疗领域中具有良好的泛化能力。
Summary / 总结
CountZES is a training-free framework for zero-shot object counting in complex scenes. It addresses the challenge of counting unseen categories by progressively discovering diverse exemplars through three stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines open-vocabulary detections, DGE identifies statistically consistent exemplars, and FCE reinforces visual coherence. Experiments show CountZES outperforms other zero-shot object counting methods across various domains.
CountZES 是一个无需训练的框架,用于在复杂场景中对未见过的类别进行零样本计数。它通过三个阶段—检测锚定的示例 (DAE)、密度引导的示例 (DGE) 和特征共识的示例 (FCE)—逐步发现多样且准确的示例。该方法在多种数据集上优于现有零样本计数方法,并且在自然、航空和医疗图像等不同领域中表现出良好的泛化能力。
Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference
Authors: Dhruv Deshmukh, Saurabh Goyal, Nipun Kwatra, Ramachandran Ramjee
First: 2025-12-18T10:37:14+00:00 · Latest: 2025-12-18T10:37:14+00:00
Comments: 11 pages, 8 figures, 3 tables and 1 algorithm
Abstract
Attention is the dominant source of latency during long-context LLM inference, an increasingly popular workload with reasoning models and RAG. We propose Kascade, a training-free sparse attention method that leverages known observations such as 1) post-softmax attention is intrinsically sparse, and 2) the identity of high-weight keys is stable across nearby layers. Kascade computes exact Top-k indices in a small set of anchor layers, then reuses those indices in intermediate reuse layers. The anchor layers are selected algorithmically, via a dynamic-programming objective that maximizes cross-layer similarity over a development set, allowing easy deployment across models. The method incorporates efficient implementation constraints (e.g. tile-level operations), across both prefill and decode attention. The Top-k selection and reuse in Kascade is head-aware and we show in our experiments that this is critical for high accuracy. Kascade achieves up to 4.1x speedup in decode attention and 2.2x speedup in prefill attention over FlashAttention-3 baseline on H100 GPUs while closely matching dense attention accuracy on long-context benchmarks such as LongBench and AIME-24.
中文标题/摘要
标题:Kascade:一种实用的稀疏注意方法用于长上下文LLM推理
注意力是长上下文LLM推理中延迟的主要来源,随着推理模型和RAG的流行,这是一个越来越受欢迎的工作负载。我们提出了Kascade,一种无需训练的稀疏注意方法,利用已知观察,例如1)后softmax注意力本质上是稀疏的,2)高权重键的身份在相邻层中是稳定的。Kascade在一组锚定层中精确计算Top-k索引,然后在中间重用层中重用这些索引。锚定层是通过动态规划目标算法选择的,该目标最大化开发集上的跨层相似性,从而实现模型之间的轻松部署。该方法考虑了高效的实现约束(例如,tile级操作),适用于预填充和解码注意力。Kascade的Top-k选择和重用是头感知的,我们在实验中展示了这一点对于高准确率至关重要。Kascade在H100 GPU上将解码注意力的加速比提高到4.1倍,预填充注意力的加速比提高到2.2倍,同时在长上下文基准测试(如LongBench和AIME-24)上接近密集注意力的准确性。
Summary / 总结
Kascade is a training-free sparse attention method that improves the efficiency of long-context LLM inference by leveraging the intrinsic sparsity of post-softmax attention and the stability of high-weight keys across layers. It computes exact Top-k indices in anchor layers and reuses them in intermediate layers, achieving up to 4.1x speedup in decode attention and 2.2x speedup in prefill attention on H100 GPUs while maintaining high accuracy on long-context benchmarks.
Kascade 是一种无需训练的稀疏注意力方法,通过利用后softmax注意力的固有稀疏性和高权重键在相邻层中的稳定性来提高长上下文 LLM 推断的效率。它在锚层中计算精确的 Top-k 索引并在中间层中重用这些索引,从而在 H100 GPU 上实现高达 4.1 倍的解码注意力加速和 2.2 倍的预填充注意力加速,同时在长上下文基准测试中保持高准确性。
Unified Semantic Transformer for 3D Scene Understanding
Authors: Sebastian Koch, Johanna Wald, Hidenobu Matsuki, Pedro Hermosilla, Timo Ropinski, Federico Tombari
First: 2025-12-16T12:49:35+00:00 · Latest: 2025-12-18T10:28:42+00:00
Comments: Project page: https://unite-page.github.io/
Abstract
Holistic 3D scene understanding involves capturing and parsing unstructured 3D environments. Due to the inherent complexity of the real world, existing models have predominantly been developed and limited to be task-specific. We introduce UNITE, a Unified Semantic Transformer for 3D scene understanding, a novel feed-forward neural network that unifies a diverse set of 3D semantic tasks within a single model. Our model operates on unseen scenes in a fully end-to-end manner and only takes a few seconds to infer the full 3D semantic geometry. Our approach is capable of directly predicting multiple semantic attributes, including 3D scene segmentation, instance embeddings, open-vocabulary features, as well as affordance and articulations, solely from RGB images. The method is trained using a combination of 2D distillation, heavily relying on self-supervision and leverages novel multi-view losses designed to ensure 3D view consistency. We demonstrate that UNITE achieves state-of-the-art performance on several different semantic tasks and even outperforms task-specific models, in many cases, surpassing methods that operate on ground truth 3D geometry. See the project website at unite-page.github.io
中文标题/摘要
标题:统一语义变换器用于3D场景理解
整体3D场景理解涉及捕捉和解析未结构化的3D环境。由于现实世界的固有复杂性,现有模型主要被开发并局限于特定任务。我们引入了UNITE,一种用于3D场景理解的统一语义变换器,这是一种新颖的前馈神经网络,能够在一个模型中统一多种3D语义任务。我们的模型以端到端的方式处理未见过的场景,并且只需几秒钟即可推断出完整的3D语义几何结构。我们的方法能够直接预测多个语义属性,包括3D场景分割、实例嵌入、开放词汇特征,以及用途和关节,仅从RGB图像中。该方法通过结合2D蒸馏训练,高度依赖于自我监督,并利用了设计用于确保3D视图一致性的新型多视图损失。我们证明,UNITE在多个不同的语义任务上达到了最先进的性能,并且在许多情况下甚至超过了特定任务的模型,甚至在某些情况下超越了在真实3D几何上操作的方法。请参见项目网站:unite-page.github.io
Summary / 总结
UNITE is a Unified Semantic Transformer for 3D scene understanding, designed to handle various 3D semantic tasks in a single model. It processes unseen scenes end-to-end and predicts multiple semantic attributes from RGB images, including 3D scene segmentation and instance embeddings. UNITE uses 2D distillation and self-supervision, along with novel multi-view losses, to ensure 3D view consistency and achieve state-of-the-art performance across different tasks, often outperforming task-specific models even when using ground truth 3D geometry.
UNITE 是一个统一的语义变换器,用于3D场景理解,能够在单一模型中处理多种3D语义任务。它从RGB图像直接预测多个语义属性,包括3D场景分割和实例嵌入。UNITE 使用2D蒸馏和自我监督,结合新颖的多视图损失,确保3D视图一致性,并在不同任务上达到最先进的性能,有时甚至在使用真实3D几何时超越专门任务模型。
Collaborative Edge-to-Server Inference for Vision-Language Models
Authors: Soochang Song, Yongjune Kim
First: 2025-12-18T09:38:18+00:00 · Latest: 2025-12-18T09:38:18+00:00
Comments: 13 pages, 12 figures
Abstract
We propose a collaborative edge-to-server inference framework for vision-language models (VLMs) that reduces the communication cost while maintaining inference accuracy. In typical deployments, visual data captured at edge devices (clients) is transmitted to the server for VLM inference. However, resizing the original image (global image) to match the vision encoder's input resolution often discards fine-grained details, leading to accuracy degradation. To overcome this limitation, we design a two-stage framework. In the first stage, the server performs inference on the global image and identifies a region of interest (RoI) using the VLM's internal attention. The min-entropy of the output tokens is then computed as a confidence measure to determine whether retransmission is required. If the min-entropy exceeds a predefined threshold, the server requests the edge device to send a detail-preserved local image of the RoI. The server then refines its inference by jointly leveraging the global and local images. This selective retransmission strategy ensures that only essential visual content is transmitted. Experiments across multiple VLM architectures show that the proposed framework significantly reduces communication cost while maintaining inference accuracy.
中文标题/摘要
标题:边缘到服务器协作推理在视觉语言模型中的应用
我们提出了一种视觉语言模型(VLM)的协作边缘到服务器推理框架,该框架在保持推理准确性的前提下减少了通信成本。在典型部署中,边缘设备(客户端)捕获的视觉数据被传输到服务器进行VLM推理。然而,将原始图像(全局图像)调整到视觉编码器的输入分辨率往往会丢弃细粒度的细节,导致准确度下降。为克服这一限制,我们设计了一个两阶段框架。在第一阶段,服务器对全局图像进行推理,并使用VLM的内部注意力识别感兴趣区域(RoI)。然后计算输出标记的最小熵作为置信度度量,以确定是否需要重新传输。如果最小熵超过预定义的阈值,服务器将请求边缘设备发送RoI的细节保留局部图像。服务器然后通过联合利用全局和局部图像来细化其推理。这种选择性重新传输策略确保仅传输必要的视觉内容。在多个VLM架构上的实验表明,所提出的框架在保持推理准确性的前提下显著减少了通信成本。
Summary / 总结
The paper proposes a collaborative edge-to-server inference framework for vision-language models to reduce communication costs while preserving inference accuracy. It addresses the issue of accuracy degradation caused by resizing the original image to the vision encoder's input resolution. The framework uses a two-stage process: the server first performs inference on the full image and identifies a region of interest, then requests the edge device to send a detail-preserved local image of this region if necessary. Experiments show that this approach significantly reduces communication cost without compromising inference accuracy across various VLM architectures.
论文提出了一种协作边缘到服务器的视觉-语言模型推理框架,以减少通信成本并保持推理准确性。该框架解决了将原始图像缩小以匹配视觉编码器输入分辨率而导致准确度下降的问题。框架分为两个阶段:服务器首先在全局图像上进行初始推理并识别感兴趣区域,然后如果必要,请求边缘设备发送该区域的保细节局部图像。实验表明,这种方法可以显著减少通信成本而不影响推理准确性。
MoAPT: Mixture of Adversarial Prompt Tuning for Vision-Language Models
Authors: Shiji Zhao, Qihui Zhu, Shukun Xiong, Shouwei Ruan, Maoxun Yuan, Jialing Tao, Jiexi Liu, Ranjie Duan, Jie Zhang, Jie Zhang, Xingxing Wei
First: 2025-05-23T06:04:15+00:00 · Latest: 2025-12-18T09:15:05+00:00
Abstract
Large pre-trained Vision Language Models (VLMs) demonstrate excellent generalization capabilities but remain highly susceptible to adversarial examples, posing potential security risks. To improve the robustness of VLMs against adversarial examples, adversarial prompt tuning methods are proposed to align the text feature with the adversarial image feature without changing model parameters. However, when facing various adversarial attacks, a single learnable text prompt has insufficient generalization to align well with all adversarial image features, which ultimately results in overfitting. To address the above challenge, in this paper, we empirically find that increasing the number of learned prompts yields greater robustness improvements than simply extending the length of a single prompt. Building on this observation, we propose an adversarial tuning method named \textbf{Mixture of Adversarial Prompt Tuning (MoAPT)} to enhance the generalization against various adversarial attacks for VLMs. MoAPT aims to learn mixture text prompts to obtain more robust text features. To further enhance the adaptability, we propose a conditional weight router based on the adversarial images to predict the mixture weights of multiple learned prompts, which helps obtain sample-specific mixture text features aligning with different adversarial image features. Extensive experiments across 11 datasets under different settings show that our method can achieve better adversarial robustness than state-of-the-art approaches.
中文标题/摘要
标题:MoAPT:视觉语言模型的混合对抗提示调优
大型预训练视觉语言模型(VLMs)表现出色的泛化能力,但仍然高度易受对抗样本的影响,存在潜在的安全风险。为了提高VLMs对抗对抗样本的鲁棒性,提出了对抗提示调优方法,以调整文本特征与对抗图像特征对齐,而不改变模型参数。然而,当面对各种对抗攻击时,单一可学习的文本提示在泛化以与所有对抗图像特征对齐方面不足,最终导致过拟合。为了解决上述挑战,本文通过实验证明增加学习提示的数量比简单地延长单一提示的长度能获得更大的鲁棒性改进。基于这一观察,我们提出了一种名为**混合对抗提示调优(MoAPT)**的对抗调优方法,以增强VLMs对各种对抗攻击的泛化能力。MoAPT旨在学习混合文本提示以获得更鲁棒的文本特征。为了进一步增强适应性,我们提出了一种基于对抗图像的条件权重路由器,以预测多个学习提示的混合权重,这有助于获得样本特定的混合文本特征,与不同的对抗图像特征对齐。在不同设置下的11个数据集上进行的广泛实验表明,我们的方法可以比现有最佳方法实现更好的对抗鲁棒性。
Summary / 总结
The research aims to enhance the robustness of Vision Language Models (VLMs) against adversarial examples by proposing a new method called Mixture of Adversarial Prompt Tuning (MoAPT). MoAPT introduces multiple learned text prompts to improve generalization and adaptability against various adversarial attacks. Experimental results across 11 datasets demonstrate that MoAPT outperforms existing methods in achieving better adversarial robustness.
本文提出了一种称为Mixture of Adversarial Prompt Tuning (MoAPT)的方法,通过学习多个文本提示来提高大型预训练视觉语言模型(VLMs)对各种对抗攻击的鲁棒性。实验结果表明,MoAPT在11个数据集上的对抗鲁棒性优于现有方法。
In-Context Probing for Membership Inference in Fine-Tuned Language Models
Authors: Zhexi Lu, Hongliang Chi, Nathalie Baracaldo, Swanand Ravindra Kadhe, Yuseok Jeon, Lei Yu
First: 2025-12-18T08:26:26+00:00 · Latest: 2025-12-18T08:26:26+00:00
Abstract
Membership inference attacks (MIAs) pose a critical privacy threat to fine-tuned large language models (LLMs), especially when models are adapted to domain-specific tasks using sensitive data. While prior black-box MIA techniques rely on confidence scores or token likelihoods, these signals are often entangled with a sample's intrinsic properties - such as content difficulty or rarity - leading to poor generalization and low signal-to-noise ratios. In this paper, we propose ICP-MIA, a novel MIA framework grounded in the theory of training dynamics, particularly the phenomenon of diminishing returns during optimization. We introduce the Optimization Gap as a fundamental signal of membership: at convergence, member samples exhibit minimal remaining loss-reduction potential, while non-members retain significant potential for further optimization. To estimate this gap in a black-box setting, we propose In-Context Probing (ICP), a training-free method that simulates fine-tuning-like behavior via strategically constructed input contexts. We propose two probing strategies: reference-data-based (using semantically similar public samples) and self-perturbation (via masking or generation). Experiments on three tasks and multiple LLMs show that ICP-MIA significantly outperforms prior black-box MIAs, particularly at low false positive rates. We further analyze how reference data alignment, model type, PEFT configurations, and training schedules affect attack effectiveness. Our findings establish ICP-MIA as a practical and theoretically grounded framework for auditing privacy risks in deployed LLMs.
中文标题/摘要
标题:上下文探查在微调语言模型成员推断中的应用
成员推断攻击(MIAs)对微调大型语言模型(LLMs)构成了严重的隐私威胁,尤其是在使用敏感数据将模型适应特定领域任务时。尽管先前的黑盒MIA技术依赖于置信分数或标记概率,但这些信号往往与样本的固有属性(如内容难度或稀有性)交织在一起,导致泛化能力差和信噪比低。在本文中,我们提出了一种新的MIA框架ICP-MIA,该框架基于训练动力学理论,特别是优化过程中收益递减的现象。我们引入了优化差距作为基本的成员信号:在收敛时,成员样本表现出最小的剩余损失减少潜力,而非成员样本则保留了进一步优化的显著潜力。为了在黑盒设置中估计这一差距,我们提出了一种无需训练的上下文探查(ICP)方法,通过战略性构建输入上下文来模拟微调行为。我们提出了两种探查策略:参考数据基于(使用语义相似的公共样本)和自我扰动(通过掩码或生成)。在三个任务和多个LLM上的实验表明,ICP-MIA在低误报率下显著优于先前的黑盒MIAs。我们进一步分析了参考数据对齐、模型类型、PEFT配置和训练计划如何影响攻击效果。我们的研究结果确立了ICP-MIA作为一种实用且理论基础的框架,用于审计部署中LLM的隐私风险。
Summary / 总结
This paper addresses the privacy threat of membership inference attacks on fine-tuned language models, proposing ICP-MIA, a novel framework based on the optimization gap. The method uses In-Context Probing to estimate this gap without training, by simulating fine-tuning-like behavior through strategic input contexts. Experiments show that ICP-MIA outperforms previous black-box MIAs, especially at low false positive rates, and the study explores factors affecting attack effectiveness.
本文针对细调的语言模型面临的成员推理攻击隐私威胁,提出了一种基于优化差距理论的新框架ICP-MIA。它引入了In-Context Probing (ICP)来无需训练即可估计这一差距,使用参考数据或自我扰动。实验表明ICP-MIA在低误报率下显著优于之前的黑盒方法,并提供了影响攻击效果的因素分析。
Scaling Spatial Reasoning in MLLMs through Programmatic Data Synthesis
Authors: Zhi Helu, Huang Jingjing, Xu Wang, Xu Yangbin, Zhang Wanyue, Jiang Baoyang, Deng Shirui, Zhu Liang, Li Fangfang, Zhao Tiejun, Lin Yankai, Yao Yuan
First: 2025-12-18T06:30:08+00:00 · Latest: 2025-12-18T06:30:08+00:00
Abstract
Embodied intelligence, a grand challenge in artificial intelligence, is fundamentally constrained by the limited spatial understanding and reasoning capabilities of current models. Prevailing efforts to address this through enhancing Vision-Language Models (VLMs) are trapped in a dilemma: template-based datasets are scalable but structurally rigid, while manual annotation is linguistically diverse but unscalable and, critically, computationally imprecise. We introduce SPRITE, a novel framework that overcomes this dilemma by leveraging simulators and large models to programmatically synthesize scalable, diverse, and high-quality spatial reasoning data. The core innovation of SPRITE is to reframe ground-truth generation as a code-generation task. We utilize LLMs to compile complex spatial questions into executable programs, which are then verified against high-precision scene meta-information extracted from simulators. This ensures our ground truth is both computationally precise and verifiable, while the generative power of LLMs provides vast linguistic diversity. Leveraging this pipeline, we have curated a dataset encompassing 3 simulators, 11k+ scenes, and 300k+ image/video instruction-tuning pairs. We demonstrate that a VLM trained on our data achieves significant performance gains on multiple spatial benchmarks and outperforms other open-source datasets of equivalent size. Furthermore, a scalability analysis confirms our hypothesis that overcoming the low-diversity nature of traditional template methods is essential for building robust, generalizable spatial intelligence. We will make the SPRITE framework code and the full 300k+ dataset publicly available to facilitate future research in spatial intelligence.
中文标题/摘要
标题:通过程序化数据合成在MLLM中扩展空间推理能力
具身智能,人工智能领域的重大挑战,从根本上受限于当前模型的空间理解和推理能力有限。通过增强视觉-语言模型(VLMs)来解决这一问题的努力陷入了困境:基于模板的数据集虽然可扩展但结构僵化,而人工标注虽然语言多样但不可扩展且计算上不精确。我们提出了SPRITE,一种新颖的框架,通过利用模拟器和大型模型程序化合成可扩展、多样且高质量的空间推理数据来克服这一困境。SPRITE的核心创新在于将地面真值生成重新构想为代码生成任务。我们利用LLMs将复杂的空间问题编译成可执行程序,然后验证这些程序与模拟器中提取的高精度场景元信息的一致性。这确保了我们的地面真值既计算上精确又可验证,而LLMs的生成能力提供了广泛的语言多样性。利用这一管道,我们构建了一个包含3个模拟器、11000多个场景和300000多张/视频指令调优对的数据集。我们证明,基于我们数据训练的VLM在多个空间基准测试中取得了显著的性能提升,并优于其他等量规模的开源数据集。此外,可扩展性分析证实了我们的假设,即克服传统模板方法的低多样性对于构建稳健、泛化的空间智能至关重要。我们将使SPRITE框架代码和完整的300000+数据集公开,以促进未来在空间智能方面的研究。
Summary / 总结
The paper addresses the limitation of current models in spatial understanding and reasoning by introducing SPRITE, a framework that synthesizes scalable, diverse, and high-quality spatial reasoning data using simulators and large language models. SPRITE reframes ground-truth generation as a code-generation task, enabling the creation of complex spatial questions and their verification against high-precision scene information. The dataset generated through this method includes 3 simulators, 11,000+ scenes, and 300,000+ image/video instruction-tuning pairs. A Vision-Language Model trained on this data shows significant performance improvements on spatial benchmarks compared to models trained on other open-source datasets of similar size, validating the importance of high diversity in building robust spatial intelligence.
论文通过引入SPRITE框架,利用模拟器和大语言模型生成大规模、多样性和高质量的空间推理数据,来解决当前模型在空间理解和推理方面的局限性。SPRITE将地面真值生成重新定义为代码生成任务,能够创建复杂的空间问题并验证其与高精度场景信息的一致性。生成的数据集包括3个模拟器、11,000多个场景和300,000多对图像/视频指令调优对。通过这种方法训练的视觉-语言模型在空间基准测试上的表现显著优于其他等量级的开源数据集,验证了高多样性对于构建稳健的空间智能的重要性。
From Frames to Clips: Training-free Adaptive Key Clip Selection for Long-Form Video Understanding
Authors: Guangyu Sun, Archit Singhal, Burak Uzkent, Mubarak Shah, Chen Chen, Garin Kessler
First: 2025-10-02T17:43:01+00:00 · Latest: 2025-12-18T06:01:41+00:00
Abstract
Video Large Language Models (VLMs) have achieved strong performance on various vision-language tasks, yet their practical use is limited by the massive number of visual tokens produced from raw video frames, which quickly exhausts the model's context window. Existing solutions mitigate this issue by selecting a sparse set of frames, but such frame-wise selection discards essential temporal dynamics in long-form videos, leading to suboptimal reasoning about motion and event continuity. In this work, we systematically examine the role of temporal information and show that extending selection from isolated key frames to temporally coherent key clips improves video understanding. To maintain a fixed computational budget while accommodating the larger token footprint of clips, we introduce frame resolution as a controllable factor in frame selection, enabling a trade-off between spatial resolution and clip length. Building on this idea, we propose an adaptive clip length module that dynamically balances these factors to ensure a constant token count per video. Experiments on three long-form video benchmarks demonstrate that our training-free approach, F2C, outperforms uniform sampling by up to 8.1%, 5.6%, and 10.3% on Video-MME, LongVideoBench, and MLVU, respectively. These results highlight the importance of preserving temporal coherence in frame selection and provide a practical pathway for scaling VLMs to real-world video understanding applications. Project webpage is available at https://guangyusun.com/f2c .
中文标题/摘要
标题:从帧到片段:无需训练的自适应关键片段选择以适应长视频理解
视频大型语言模型(VLMs)在各种视觉语言任务上取得了强大的性能,但由于从原始视频帧中生成的大量视觉标记迅速耗尽了模型的上下文窗口,其实际应用受到限制。现有解决方案通过选择稀疏帧集来缓解这一问题,但这种帧级选择会丢弃长视频中的重要时间动态,导致对运动和事件连续性的推理效果不佳。在本文中,我们系统地探讨了时间信息的作用,并表明将选择从孤立的关键帧扩展到时间上连贯的关键片段可以提高视频理解。为了在保持固定计算预算的同时适应片段更大的标记占用空间,我们引入了帧分辨率作为帧选择的可控因素,从而在空间分辨率和片段长度之间实现权衡。在此基础上,我们提出了一种自适应片段长度模块,动态平衡这些因素以确保每个视频的标记计数恒定。在三个长视频基准上的实验表明,我们的无需训练方法F2C在Video-MME、LongVideoBench和MLVU上的表现分别优于均匀采样8.1%、5.6%和10.3%。这些结果突显了在帧选择中保持时间连贯性的重要性,并为将VLMs扩展到实际视频理解应用提供了实用途径。项目网页可在https://guangyusun.com/f2c 查看。
Summary / 总结
This work addresses the challenge of processing long-form videos by proposing a training-free adaptive key clip selection method, F2C, which improves video understanding by selecting temporally coherent key clips instead of isolated frames. This approach maintains a fixed computational budget by adjusting frame resolution and clip length, leading to better performance on motion and event continuity. Experiments show F2C outperforms uniform sampling by up to 8.1%, 5.6%, and 10.3% on Video-MME, LongVideoBench, and MLVU benchmarks, respectively.
该研究提出了一种无需训练的自适应关键片段选择方法,以解决使用视频大型语言模型(VLMs)进行长视频理解的问题。该方法将选择从孤立的关键帧扩展到具有时间连贯性的关键片段,平衡空间分辨率和片段长度,以保持固定的计算预算。实验结果显示,提出的F2C方法在三个长视频基准上的表现优于均匀采样,最高改善幅度达到10.3%(MLVU),强调了在帧选择中保持时间连贯性的重要性。
Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation
Authors: Sarosij Bose, Ravi K. Rajendran, Biplob Debnath, Konstantinos Karydis, Amit K. Roy-Chowdhury, Srimat Chakradhar
First: 2025-12-18T05:48:21+00:00 · Latest: 2025-12-18T05:48:21+00:00
Abstract
Radiology Report Generation (RRG) is a critical step toward automating healthcare workflows, facilitating accurate patient assessments, and reducing the workload of medical professionals. Despite recent progress in Large Medical Vision-Language Models (Med-VLMs), generating radiology reports that are both visually grounded and clinically accurate remains a significant challenge. Existing approaches often rely on large labeled corpora for pre-training, costly task-specific preference data, or retrieval-based methods. However, these strategies do not adequately mitigate hallucinations arising from poor cross-modal alignment between visual and linguistic representations. To address these limitations, we propose VALOR:Visual Alignment of Medical Vision-Language Models for GrOunded Radiology Report Generation. Our method introduces a reinforcement learning-based post-alignment framework utilizing Group-Relative Proximal Optimization (GRPO). The training proceeds in two stages: (1) improving the Med-VLM with textual rewards to encourage clinically precise terminology, and (2) aligning the vision projection module of the textually grounded model with disease findings, thereby guiding attention toward image re gions most relevant to the diagnostic task. Extensive experiments on multiple benchmarks demonstrate that VALOR substantially improves factual accuracy and visual grounding, achieving significant performance gains over state-of-the-art report generation methods.
中文标题/摘要
标题:医学视觉语言模型的视觉对齐以实现基于图像的放射学报告生成
放射学报告生成(RRG)是实现自动化医疗工作流程、促进准确的患者评估并减轻医疗专业人员工作负担的关键步骤。尽管在大型医学视觉语言模型(Med-VLM)方面取得了进展,但生成既视觉接地又临床准确的放射学报告仍然是一个重大挑战。现有方法通常依赖于大型标注语料库进行预训练、昂贵的任务特定偏好数据或基于检索的方法。然而,这些策略未能充分缓解由于视觉和语言表示之间跨模态对齐不良而产生的幻觉。为了解决这些限制,我们提出了一种名为VALOR的方法:医学视觉语言模型的视觉对齐以实现基于图像的放射学报告生成。该方法引入了一种基于强化学习的后对齐框架,利用组相对近邻优化(GRPO)。训练分为两个阶段:(1)通过文本奖励改进Med-VLM,以鼓励使用临床精确的术语;(2)将文本接地模型的视觉投影模块与疾病发现对齐,从而引导注意力集中在与诊断任务最相关的图像区域。在多个基准上的广泛实验表明,VALOR在事实准确性和视觉对齐方面显著提高,实现了对最先进的报告生成方法的重大性能提升。
Summary / 总结
The research aims to improve the accuracy and visual grounding of radiology reports generated by large medical vision-language models. The method, VALOR, uses a reinforcement learning-based post-alignment framework with Group-Relative Proximal Optimization (GRPO) to enhance the model's clinical precision and align the vision projection module with disease findings. Experiments show that VALOR significantly improves factual accuracy and visual grounding compared to existing methods.
研究旨在提高大型医疗视觉语言模型生成的放射学报告的准确性和视觉定位。提出的VALOR方法使用基于强化学习的后对齐框架和组相对近端优化(GRPO)来增强模型的临床精确度,并使其视觉投影模块与疾病发现对齐。实验表明,VALOR在事实准确性和视觉定位方面显著优于现有最先进的方法。
UniVCD: A New Method for Unsupervised Change Detection in the Open-Vocabulary Era
Authors: Ziqiang Zhu, Bowei Yang
First: 2025-12-15T08:42:23+00:00 · Latest: 2025-12-18T05:14:28+00:00
Comments: 10 pages, 6 figures
Abstract
Change detection (CD) identifies scene changes from multi-temporal observations and is widely used in urban development and environmental monitoring. Most existing CD methods rely on supervised learning, making performance strongly dataset-dependent and incurring high annotation costs; they typically focus on a few predefined categories and generalize poorly to diverse scenes. With the rise of vision foundation models such as SAM2 and CLIP, new opportunities have emerged to relax these constraints. We propose Unified Open-Vocabulary Change Detection (UniVCD), an unsupervised, open-vocabulary change detection method built on frozen SAM2 and CLIP. UniVCD detects category-agnostic changes across diverse scenes and imaging geometries without any labeled data or paired change images. A lightweight feature alignment module is introduced to bridge the spatially detailed representations from SAM2 and the semantic priors from CLIP, enabling high-resolution, semantically aware change estimation while keeping the number of trainable parameters small. On top of this, a streamlined post-processing pipeline is further introduced to suppress noise and pseudo-changes, improving the detection accuracy for objects with well-defined boundaries. Experiments on several public BCD (Binary Change Detection) and SCD (Semantic Change Detection) benchmarks show that UniVCD achieves consistently strong performance and matches or surpasses existing open-vocabulary CD methods in key metrics such as F1 and IoU. The results demonstrate that unsupervised change detection with frozen vision foundation models and lightweight multi-modal alignment is a practical and effective paradigm for open-vocabulary CD. Code and pretrained models will be released at https://github.com/Die-Xie/UniVCD.
中文标题/摘要
标题:UniVCD:开放词汇时代的无监督变化检测新方法
变化检测(CD)通过多时相观测识别场景变化,在城市开发和环境监测中广泛应用。现有大多数CD方法依赖于监督学习,导致性能高度依赖于数据集且注释成本高昂;它们通常专注于少数预定义类别,难以泛化到多样化的场景。随着SAM2和CLIP等视觉基础模型的兴起,出现了放松这些限制的新机会。我们提出了统一开放词汇变化检测(UniVCD),这是一种基于冻结的SAM2和CLIP构建的无监督、开放词汇变化检测方法。UniVCD在没有任何标注数据或配对变化图像的情况下,能够检测跨多种场景和成像几何的变化。引入了一个轻量级特征对齐模块,将SAM2的空间详细表示与CLIP的语义先验相结合,实现高分辨率、语义感知的变化估计,同时保持可训练参数数量较少。在此基础上,进一步引入了一条简化的后处理流水线,以抑制噪声和伪变化,提高具有明确边界对象的检测准确性。在几个公开的二元变化检测(BCD)和语义变化检测(SCD)基准测试上进行的实验表明,UniVCD在关键指标如F1和IoU上表现出一致的强性能,并且在某些方面超越了现有的开放词汇变化检测方法。结果表明,使用冻结的视觉基础模型和轻量级多模态对齐的无监督变化检测是一种实用且有效的开放词汇变化检测范式。代码和预训练模型将在https://github.com/Die-Xie/UniVCD上发布。
Summary / 总结
UniVCD is an unsupervised change detection method that leverages frozen SAM2 and CLIP to detect category-agnostic changes across various scenes without labeled data. It introduces a lightweight feature alignment module to combine spatially detailed representations from SAM2 and semantic priors from CLIP, enabling high-resolution, semantically aware change estimation. Experiments on multiple benchmarks show that UniVCD outperforms existing methods in key metrics such as F1 and IoU, demonstrating the effectiveness of using frozen vision foundation models for open-vocabulary change detection.
UniVCD 是一种无需标注数据的无监督变化检测方法,利用冻结的 SAM2 和 CLIP 来检测跨多种场景的无类别变化。它引入了一个轻量级的特征对齐模块,将 SAM2 的空间详细表示与 CLIP 的语义先验相结合,实现高分辨率、语义感知的变化估计。在各种基准测试上的实验表明,UniVCD 在 F1 和 IoU 等关键指标上表现出强劲性能,超越或匹配现有开放词汇变化检测方法,证明了冻结视觉基础模型和轻量级多模态对齐在开放词汇变化检测中的实用性和有效性。
C-DGPA: Class-Centric Dual-Alignment Generative Prompt Adaptation
Authors: Chao Li, Dasha Hu, Chengyang Li, Yuming Jiang, Yuncheng Shen
First: 2025-12-18T04:30:53+00:00 · Latest: 2025-12-18T04:30:53+00:00
Abstract
Unsupervised Domain Adaptation transfers knowledge from a labeled source domain to an unlabeled target domain. Directly deploying Vision-Language Models (VLMs) with prompt tuning in downstream UDA tasks faces the signifi cant challenge of mitigating domain discrepancies. Existing prompt-tuning strategies primarily align marginal distribu tion, but neglect conditional distribution discrepancies, lead ing to critical issues such as class prototype misalignment and degraded semantic discriminability. To address these lim itations, the work proposes C-DGPA: Class-Centric Dual Alignment Generative Prompt Adaptation. C-DGPA syner gistically optimizes marginal distribution alignment and con ditional distribution alignment through a novel dual-branch architecture. The marginal distribution alignment branch em ploys a dynamic adversarial training framework to bridge marginal distribution discrepancies. Simultaneously, the con ditional distribution alignment branch introduces a Class Mapping Mechanism (CMM) to align conditional distribu tion discrepancies by standardizing semantic prompt under standing and preventing source domain over-reliance. This dual alignment strategy effectively integrates domain knowl edge into prompt learning via synergistic optimization, ensur ing domain-invariant and semantically discriminative repre sentations. Extensive experiments on OfficeHome, Office31, and VisDA-2017 validate the superiority of C-DGPA. It achieves new state-of-the-art results on all benchmarks.
中文标题/摘要
标题:C-DGPA:以班级为中心的双重对齐生成提示适应
无监督领域适应将已标记的源领域知识转移到未标记的目标领域。直接在下游UDA任务中部署带有提示调优的视觉-语言模型面临显著挑战,即缓解领域差异。现有提示调优策略主要对齐边缘分布,但忽视条件分布差异,导致诸如类别原型对齐错误和语义可区分性下降等关键问题。为解决这些局限性,该工作提出了C-DGPA:以班级为中心的双重对齐生成提示适应。C-DGPA通过一种新颖的双重分支架构协同优化边缘分布对齐和条件分布对齐。边缘分布对齐分支采用动态对抗训练框架来弥合边缘分布差异。同时,条件分布对齐分支引入类别映射机制(CMM)通过标准化语义提示理解来对齐条件分布差异,防止对源领域过度依赖。这种双重对齐策略通过协同优化有效地将领域知识整合到提示学习中,确保领域不变和语义可区分的表示。在OfficeHome、Office31和VisDA-2017上的广泛实验验证了C-DGPA的优越性。它在所有基准上都取得了新的最佳结果。
Summary / 总结
C-DGPA addresses the challenge of domain discrepancies in unsupervised domain adaptation by proposing a class-centric dual-alignment generative prompt adaptation method. It synergistically optimizes marginal and conditional distribution alignments through a dual-branch architecture, using a dynamic adversarial training framework and a Class Mapping Mechanism. Experiments on OfficeHome, Office31, and VisDA-2017 demonstrate that C-DGPA outperforms existing methods, achieving new state-of-the-art results.
C-DGPA 通过同时优化边缘分布和条件分布对齐来解决现有提示调优策略在无监督领域适应中的局限性。它使用双分支架构,包含动态对抗训练框架进行边缘分布对齐和类映射机制进行条件分布对齐。在 OfficeHome、Office31 和 VisDA-2017 上的实验表明,C-DGPA 超过了现有方法并取得了新的最佳结果。
MRG-R1: Reinforcement Learning for Clinically Aligned Medical Report Generation
Authors: Pengyu Wang, Shuchang Ye, Usman Naseem, Jinman Kim
First: 2025-12-18T03:57:55+00:00 · Latest: 2025-12-18T03:57:55+00:00
Comments: 12 pages
Abstract
Medical report generation (MRG) aims to automatically derive radiology-style reports from medical images to aid in clinical decision-making. However, existing methods often generate text that mimics the linguistic style of radiologists but fails to guarantee clinical correctness, because they are trained on token-level objectives which focus on word-choice and sentence structure rather than actual medical accuracy. We propose a semantic-driven reinforcement learning (SRL) method for medical report generation, adopted on a large vision-language model (LVLM). SRL adopts Group Relative Policy Optimization (GRPO) to encourage clinical-correctness-guided learning beyond imitation of language style. Specifically, we optimise a report-level reward: a margin-based cosine similarity (MCCS) computed between key radiological findings extracted from generated and reference reports, thereby directly aligning clinical-label agreement and improving semantic correctness. A lightweight reasoning format constraint further guides the model to generate structured "thinking report" outputs. We evaluate Medical Report Generation with Sematic-driven Reinforment Learning (MRG-R1), on two datasets: IU X-Ray and MIMIC-CXR using clinical efficacy (CE) metrics. MRG-R1 achieves state-of-the-art performance with CE-F1 51.88 on IU X-Ray and 40.39 on MIMIC-CXR. We found that the label-semantic reinforcement is better than conventional token-level supervision. These results indicate that optimizing a clinically grounded, report-level reward rather than token overlap,meaningfully improves clinical correctness. This work is a prior to explore semantic-reinforcement in supervising medical correctness in medical Large vision-language model(Med-LVLM) training.
中文标题/摘要
标题:MRG-R1:临床对齐的医学报告生成强化学习
医学报告生成(MRG)旨在从医学图像中自动提取放射学风格的报告,以辅助临床决策。然而,现有方法生成的文本虽然模仿了放射科医生的语言风格,但无法保证临床正确性,因为它们是基于词元级目标进行训练的,这些目标侧重于词汇选择和句子结构,而不是实际的医学准确性。我们提出了一种基于语义的强化学习(SRL)方法用于医学报告生成,采用了一个大型视觉语言模型(LVLM)。SRL采用组相对策略优化(GRPO)来鼓励临床正确性引导的学习,而不仅仅是语言风格的模仿。具体来说,我们优化了一个报告级奖励:生成报告和参考报告中提取的关键放射学发现之间的余弦相似度的边际计算(MCCS),从而直接对齐临床标签一致性和提高语义正确性。一种轻量级的推理格式约束进一步引导模型生成结构化的“思考报告”输出。我们使用临床效用(CE)指标在两个数据集:IU X-Ray和MIMIC-CXR上评估了基于语义驱动的强化学习的医学报告生成(MRG-R1)。MRG-R1在IU X-Ray上实现了最先进的性能,CE-F1为51.88,在MIMIC-CXR上为40.39。我们发现标签语义强化比传统的词元级监督效果更好。这些结果表明,优化一个基于临床的报告级奖励而不是词元重叠,显著提高了临床正确性。这项工作是探索在医学大型视觉语言模型(Med-LVLM)训练中监督医学正确性的语义强化的一个先驱。
Summary / 总结
The research aims to improve the clinical correctness of medical reports generated by machine learning models. It proposes a semantic-driven reinforcement learning method, MRG-R1, which uses a large vision-language model and optimizes a report-level reward based on clinical label agreement. This method outperforms existing token-level supervised approaches, achieving state-of-the-art clinical efficacy metrics on IU X-Ray and MIMIC-CXR datasets.
研究旨在提高机器学习模型生成的医疗报告的临床正确性,这些模型通常关注语言风格而非医学准确性。方法采用基于语义的强化学习,具体使用Group Relative Policy Optimization (GRPO) 来优化基于临床标签一致性的报告级奖励。模型MRG-R1在IU X-Ray和MIMIC-CXR数据集上的CE-F1分数分别为51.88和40.39,表明其临床正确性优于传统的基于令牌级别的监督。
Exploiting Domain Properties in Language-Driven Domain Generalization for Semantic Segmentation
Authors: Seogkyu Jeon, Kibeom Hong, Hyeran Byun
Venue: ICCV 2025 poster
First: 2025-12-03T06:58:38+00:00 · Latest: 2025-12-18T03:34:53+00:00
Comments: ICCV 2025 (poster)
Abstract
Recent domain generalized semantic segmentation (DGSS) studies have achieved notable improvements by distilling semantic knowledge from Vision-Language Models (VLMs). However, they overlook the semantic misalignment between visual and textual contexts, which arises due to the rigidity of a fixed context prompt learned on a single source domain. To this end, we present a novel domain generalization framework for semantic segmentation, namely Domain-aware Prompt-driven Masked Transformer (DPMFormer). Firstly, we introduce domain-aware prompt learning to facilitate semantic alignment between visual and textual cues. To capture various domain-specific properties with a single source dataset, we propose domain-aware contrastive learning along with the texture perturbation that diversifies the observable domains. Lastly, to establish a framework resilient against diverse environmental changes, we have proposed the domain-robust consistency learning which guides the model to minimize discrepancies of prediction from original and the augmented images. Through experiments and analyses, we demonstrate the superiority of the proposed framework, which establishes a new state-of-the-art on various DGSS benchmarks. The code is available at https://github.com/jone1222/DPMFormer.
中文标题/摘要
标题:利用领域属性的语言驱动领域泛化在语义分割中的应用
近期的领域泛化语义分割(DGSS)研究通过从视觉语言模型(VLMs)中提炼语义知识取得了显著进步。然而,它们忽视了由于固定上下文提示在单一源领域学习而导致的视觉和文本上下文之间的语义不一致。为了解决这一问题,我们提出了一种新的语义分割领域泛化框架,即领域感知提示驱动的掩码变换器(DPMFormer)。首先,我们引入了领域感知提示学习,以促进视觉和文本线索之间的语义对齐。为了用单一源数据集捕捉各种领域特定属性,我们提出了领域感知对比学习以及纹理扰动,以多样化可观察的领域。最后,为了建立一个对多种环境变化具有鲁棒性的框架,我们提出了领域鲁棒一致性学习,以引导模型最小化原始图像和增强图像预测之间的差异。通过实验和分析,我们展示了所提出框架的优越性,该框架在多种DGSS基准上建立了新的最先进水平。代码可在https://github.com/jone1222/DPMFormer/获取。
Summary / 总结
This paper addresses the issue of semantic misalignment in domain generalized semantic segmentation by proposing DPMFormer, which includes domain-aware prompt learning, domain-aware contrastive learning, and domain-robust consistency learning. The framework demonstrates superior performance, setting a new state-of-the-art on various DGSS benchmarks.
本文提出DPMFormer框架来解决领域泛化语义分割中的语义对齐问题,该框架包含领域感知提示学习、领域感知对比学习和领域鲁棒一致性学习。实验表明该框架性能优越,达到了各种DGSS基准的新最先进水平。
Auto-Vocabulary 3D Object Detection
Authors: Haomeng Zhang, Kuan-Chuan Peng, Suhas Lohit, Raymond A. Yeh
First: 2025-12-18T01:53:40+00:00 · Latest: 2025-12-18T01:53:40+00:00
Comments: technical report
Abstract
Open-vocabulary 3D object detection methods are able to localize 3D boxes of classes unseen during training. Despite the name, existing methods rely on user-specified classes both at training and inference. We propose to study Auto-Vocabulary 3D Object Detection (AV3DOD), where the classes are automatically generated for the detected objects without any user input. To this end, we introduce Semantic Score (SS) to evaluate the quality of the generated class names. We then develop a novel framework, AV3DOD, which leverages 2D vision-language models (VLMs) to generate rich semantic candidates through image captioning, pseudo 3D box generation, and feature-space semantics expansion. AV3DOD achieves the state-of-the-art (SOTA) performance on both localization (mAP) and semantic quality (SS) on the ScanNetV2 and SUNRGB-D datasets. Notably, it surpasses the SOTA, CoDA, by 3.48 overall mAP and attains a 24.5% relative improvement in SS on ScanNetV2.
中文标题/摘要
标题:自动词汇3D物体检测
开放词汇的3D物体检测方法能够在未见过的类别上进行3D框的定位。尽管名称如此,现有方法在训练和推理时都依赖于用户指定的类别。我们提出研究自动词汇3D物体检测(AV3DOD),其中检测到的物体类别是自动生成的,无需任何用户输入。为此,我们引入语义得分(SS)来评估生成的类别名称的质量。然后,我们开发了一个新的框架AV3DOD,该框架利用2D视觉-语言模型(VLMs)通过图像描述、伪3D框生成和特征空间语义扩展来生成丰富的语义候选。AV3DOD在ScanNetV2和SUNRGB-D数据集上的定位(mAP)和语义质量(SS)上均达到了最先进的性能。值得注意的是,它在整体mAP上超过了最先进的CoDA 3.48,并在ScanNetV2上的SS上实现了24.5%的相对改进。
Summary / 总结
The research aims to develop an open-vocabulary 3D object detection method that can automatically generate classes for detected objects without user input. The proposed AV3DOD framework uses 2D vision-language models to generate semantic candidates through image captioning, pseudo 3D box generation, and feature-space semantics expansion. The method achieves state-of-the-art performance in both localization (mAP) and semantic quality (SS) on ScanNetV2 and SUNRGB-D datasets, surpassing the previous best method, CoDA, by 3.48 overall mAP and 24.5% relative improvement in SS on ScanNetV2.
研究旨在开发一种无需用户输入即可自动为检测到的对象生成类别的开放词汇3D目标检测方法。提出的Auto-Vocabulary 3D Object Detection (AV3DOD)框架使用2D视觉语言模型生成语义候选并扩展特征空间语义。AV3DOD在ScanNetV2和SUNRGB-D数据集上的整体mAP比最先进的CoDA方法高出3.48,并且在语义质量(SS)上实现了24.5%的相对改进。
Are vision-language models ready to zero-shot replace supervised classification models in agriculture?
Authors: Earl Ranario, Mason J. Earles
First: 2025-12-17T21:22:44+00:00 · Latest: 2025-12-17T21:22:44+00:00
Comments: Draft version
Abstract
Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural classification datasets from the AgML collection, spanning 162 classes across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%. Applying LLM-based semantic judging increases open-ended accuracy (for example, from 21% to 30% for top models) and alters model rankings, demonstrating that evaluation methodology meaningfully affects reported conclusions. Among open-source models, Qwen-VL-72B performs best, approaching closed-source performance under constrained prompting but still trailing top proprietary systems. Task-level analysis shows that plant and weed species classification is consistently easier than pest and damage identification, which remains the most challenging category across models. Overall, these results indicate that current off-the-shelf VLMs are not yet suitable as standalone agricultural diagnostic systems, but can function as assistive components when paired with constrained interfaces, explicit label ontologies, and domain-aware evaluation strategies.
中文标题/摘要
标题:视觉语言模型在农业中零样本替代监督分类模型是否准备就绪?
视觉语言模型(VLMs)越来越多地被提议作为视觉识别任务的一般解决方案,但它们在农业决策支持中的可靠性仍不清楚。我们对来自AgML集合的27个农业分类数据集中的多种开源和闭源VLM进行了基准测试,这些数据集涵盖了162个类别,包括植物病害、害虫和损伤以及植物和杂草物种识别。在所有任务中,零样本VLMs的表现显著低于监督任务特定基线(YOLO11),后者始终比任何基础模型获得更高的准确率。在多项选择提示下,表现最佳的VLM(Gemini-3 Pro)的平均准确率为约62%,而开放式提示则导致性能大幅下降,通常准确率低于25%。基于LLM的语义评估提高了开放式提示的准确率(例如,顶级模型从21%提高到30%),并改变了模型排名,表明评估方法对报告结论有实质性影响。在开源模型中,Qwen-VL-72B表现最佳,在受限提示下接近闭源性能,但仍落后于顶级专有系统。任务级分析表明,植物和杂草物种分类始终比害虫和损伤识别更容易,后者是所有模型中最具有挑战性的类别。总体而言,这些结果表明,当前的即用型VLM尚不适合作为独立的农业诊断系统,但在与受限界面、明确标签本体和领域意识评估策略配对时,可以作为辅助组件发挥作用。
Summary / 总结
The study benchmarks vision-language models (VLMs) on 27 agricultural classification datasets, finding that zero-shot VLMs underperform a supervised task-specific baseline. The best-performing VLM, Gemini-3 Pro, achieves around 62% accuracy under multiple-choice prompting, while open-ended prompting yields much lower performance. The research highlights that current VLMs are not yet suitable as standalone diagnostic systems but can assist when paired with specific interfaces and evaluation strategies.
研究对27个农业分类数据集进行了视觉语言模型(VLMs)的基准测试,发现零样本VLMs的表现低于监督任务特定基线。在多项选择提示下,最佳VLM(Gemini-3 Pro)的准确率为约62%,而开放式提示下的表现较低。应用基于LLM的语义判断可以提高开放式提示的准确性并改变模型排名。在开源模型中,Qwen-VL-72B表现最佳,但仍落后于顶级专有系统。研究结果表明,当前的VLMs尚不适合作为独立诊断工具,但在与受限界面和领域导向评估策略结合使用时可以发挥作用。
From Words to Wavelengths: VLMs for Few-Shot Multispectral Object Detection
Authors: Manuel Nkegoum, Minh-Tan Pham, Élisa Fromont, Bruno Avignon, Sébastien Lefèvre
First: 2025-12-17T21:06:36+00:00 · Latest: 2025-12-17T21:06:36+00:00
Abstract
Multispectral object detection is critical for safety-sensitive applications such as autonomous driving and surveillance, where robust perception under diverse illumination conditions is essential. However, the limited availability of annotated multispectral data severely restricts the training of deep detectors. In such data-scarce scenarios, textual class information can serve as a valuable source of semantic supervision. Motivated by the recent success of Vision-Language Models (VLMs) in computer vision, we explore their potential for few-shot multispectral object detection. Specifically, we adapt two representative VLM-based detectors, Grounding DINO and YOLO-World, to handle multispectral inputs and propose an effective mechanism to integrate text, visual and thermal modalities. Through extensive experiments on two popular multispectral image benchmarks, FLIR and M3FD, we demonstrate that VLM-based detectors not only excel in few-shot regimes, significantly outperforming specialized multispectral models trained with comparable data, but also achieve competitive or superior results under fully supervised settings. Our findings reveal that the semantic priors learned by large-scale VLMs effectively transfer to unseen spectral modalities, ofFering a powerful pathway toward data-efficient multispectral perception.
中文标题/摘要
标题:从文字到波长:基于VLM的少样本多光谱目标检测
多光谱目标检测对于自动驾驶和监控等安全敏感应用至关重要,其中在不同光照条件下进行稳健感知是必不可少的。然而,标注的多光谱数据的有限可用性严重限制了深度检测器的训练。在这种数据稀缺的情况下,文本类信息可以作为有价值的语义监督来源。受最近视觉-语言模型(VLMs)在计算机视觉中取得成功的影响,我们探索了它们在少样本多光谱目标检测中的潜力。具体而言,我们调整了两个代表性的VLM基检测器,Grounding DINO和YOLO-World,以处理多光谱输入,并提出了一种有效机制来整合文本、视觉和热成像模态。通过在两个流行的多光谱图像基准FLIR和M3FD上进行广泛的实验,我们证明基于VLM的检测器不仅在少样本场景中表现出色,显著优于使用相似数据训练的专业多光谱模型,而且在完全监督设置下也能取得具有竞争力或更优的结果。我们的研究结果表明,大规模VLM学习到的语义先验能够有效转移到未见过的光谱模态中,为数据高效多光谱感知提供了强大的途径。
Summary / 总结
The paper explores the use of Vision-Language Models (VLMs) for few-shot multispectral object detection, motivated by the need for robust perception under diverse lighting conditions in safety-sensitive applications. By adapting VLM-based detectors like Grounding DINO and YOLO-World to handle multispectral inputs and integrating text, visual, and thermal modalities, the study demonstrates that these models outperform specialized multispectral models in few-shot settings and achieve competitive results in fully supervised settings. The experiments on FLIR and M3FD benchmarks show that VLMs can effectively transfer semantic priors to unseen spectral modalities, offering a data-efficient approach to multispectral perception.
论文旨在解决在自主驾驶和监控等安全敏感应用中,由于缺乏标注的多光谱数据,导致的少量样本多光谱目标检测难题。通过利用视觉语言模型(VLMs)整合文本类别信息,对Grounding DINO和YOLO-World检测器进行多光谱输入的适应。在FLIR和M3FD基准数据集上的实验表明,基于VLMs的检测器在少量样本场景中表现出色,不仅超越了专门训练的多光谱模型,还在完全监督条件下取得了竞争力或更优的结果,表明了大规模VLMs学到的语义先验能够有效转移到未见过的光谱模态中。
Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models
Authors: Utsav Panchal, Yuchen Liu, Luigi Palmieri, Ilche Georgievski, Marco Aiello
Venue: WACV
First: 2025-12-17T20:44:32+00:00 · Latest: 2025-12-17T20:44:32+00:00
Comments: Accepted at IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026
Abstract
Accurately predicting human behaviors is crucial for mobile robots operating in human-populated environments. While prior research primarily focuses on predicting actions in single-human scenarios from an egocentric view, several robotic applications require understanding multiple human behaviors from a third-person perspective. To this end, we present CAMP-VLM (Context-Aware Multi-human behavior Prediction): a Vision Language Model (VLM)-based framework that incorporates contextual features from visual input and spatial awareness from scene graphs to enhance prediction of humans-scene interactions. Due to the lack of suitable datasets for multi-human behavior prediction from an observer view, we perform fine-tuning of CAMP-VLM with synthetic human behavior data generated by a photorealistic simulator, and evaluate the resulting models on both synthetic and real-world sequences to assess their generalization capabilities. Leveraging Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), CAMP-VLM outperforms the best-performing baseline by up to 66.9% in prediction accuracy.
中文标题/摘要
标题:有图有真相(和预测):基于视觉语言模型的上下文感知多人体行为预测
准确预测人体行为对于在人群环境中操作移动机器人至关重要。尽管先前的研究主要集中在从第一人称视角预测单人体的行为,但许多机器人应用需要从第三人称视角理解多人体的行为。为此,我们提出了CAMP-VLM(上下文感知多人体行为预测):一种基于视觉语言模型(VLM)的框架,该框架结合了视觉输入中的上下文特征和场景图中的空间意识,以增强对人类-场景交互的预测。由于缺乏适用于第三人称视角多人体行为预测的合适数据集,我们使用逼真的模拟器生成的人体行为数据对CAMP-VLM进行了微调,并在合成和真实序列上评估了模型的泛化能力。利用监督微调(SFT)和直接偏好优化(DPO),CAMP-VLM在预测准确性上比最佳基线高出66.9%。
Summary / 总结
The research aims to improve the prediction of human behaviors in environments with multiple people, which is essential for mobile robots. It introduces CAMP-VLM, a framework that uses Vision Language Models to incorporate contextual visual information and spatial awareness from scene graphs. Despite the absence of suitable datasets, CAMP-VLM was fine-tuned with synthetic data and showed up to 66.9% better prediction accuracy than the best baseline when evaluated on both synthetic and real-world sequences.
研究旨在利用结合视觉上下文特征和场景图中空间意识的视觉语言模型(VLM),提高多人环境中的行为预测能力。CAMP-VLM 框架通过从拟真模拟器生成的合成数据进行微调,并在合成和真实序列上进行评估,结果显示其预测准确性比现有方法高出高达66.9%。
R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space
Authors: Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax
First: 2025-12-17T20:08:32+00:00 · Latest: 2025-12-17T20:08:32+00:00
Abstract
Humans perceive and reason about their surroundings in four dimensions by building persistent, structured internal representations that encode semantic meaning, spatial layout, and temporal dynamics. These multimodal memories enable them to recall past events, infer unobserved states, and integrate new information into context-dependent reasoning. Inspired by this capability, we introduce R4, a training-free framework for retrieval-augmented reasoning in 4D spatio-temporal space that equips vision-language models (VLMs) with structured, lifelong memory. R4 continuously constructs a 4D knowledge database by anchoring object-level semantic descriptions in metric space and time, yielding a persistent world model that can be shared across agents. At inference, natural language queries are decomposed into semantic, spatial, and temporal keys to retrieve relevant observations, which are integrated into the VLM's reasoning. Unlike classical retrieval-augmented generation methods, retrieval in R4 operates directly in 4D space, enabling episodic and collaborative reasoning without training. Experiments on embodied question answering and navigation benchmarks demonstrate that R4 substantially improves retrieval and reasoning over spatio-temporal information compared to baselines, advancing a new paradigm for embodied 4D reasoning in dynamic environments.
中文标题/摘要
标题:R4:四维时空中的检索增强推理方法
人类通过构建持久的、结构化的内部表示来感知和推理其周围的环境,这些表示编码了语义意义、空间布局和时间动态。这些多模态记忆使他们能够回忆过去的事件、推断未观察到的状态,并将新信息整合到上下文相关的推理中。受此能力的启发,我们提出了R4,这是一种无需训练的检索增强推理框架,为视觉语言模型(VLMs)提供了结构化的终身记忆。R4通过在度量空间和时间中锚定对象级语义描述,不断构建一个四维知识数据库,从而生成一个持久的世界模型,该模型可以在不同代理之间共享。在推理时,自然语言查询被分解为语义、空间和时间键,以检索相关观察结果,这些观察结果被整合到VLM的推理中。与传统的检索增强生成方法不同,R4中的检索直接在四维空间中进行,这使得它能够进行事件性和协作性推理而无需训练。在基于体感问答和导航基准上的实验表明,与基线相比,R4在时空信息检索和推理方面有了显著改进,推动了动态环境中的四维体感推理的新范式。
Summary / 总结
R4 is a training-free framework that enhances vision-language models with a structured, lifelong memory in a 4D spatio-temporal space. By anchoring object-level semantic descriptions in metric space and time, R4 constructs a persistent world model that can be shared among agents. At inference, natural language queries are decomposed into semantic, spatial, and temporal keys to retrieve relevant observations, which are integrated into the VLM's reasoning. Experiments show that R4 significantly improves retrieval and reasoning over spatio-temporal information compared to baselines, advancing a new paradigm for embodied 4D reasoning in dynamic environments.
研究旨在通过使视觉语言模型能够在4D时空空间中进行推理,类似于人类的感知。方法是引入一种名为R4的检索增强推理框架,该框架通过在空间和时间中锚定对象级别的描述来构建持久的4D知识数据库。关键实验发现表明,R4在体感问答和导航基准测试中显著提高了时空信息的检索和推理能力,相比基线方法,推进了4D推理的新范式在动态环境中的应用。
History
20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553