BachVid: Training-Free Video Generation with Consistent Background and Character
Authors: Han Yan, Xibin Song, Yifu Wang, Hongdong Li, Pan Ji, Chao Ma
First: 2025-10-24T17:56:37+00:00 · Latest: 2025-10-24T17:56:37+00:00
Comments: Project page: https://wolfball.github.io/bachvid
Abstract
Diffusion Transformers (DiTs) have recently driven significant progress in
text-to-video (T2V) generation. However, generating multiple videos with
consistent characters and backgrounds remains a significant challenge. Existing
methods typically rely on reference images or extensive training, and often
only address character consistency, leaving background consistency to
image-to-video models. We introduce BachVid, the first training-free method
that achieves consistent video generation without needing any reference images.
Our approach is based on a systematic analysis of DiT's attention mechanism and
intermediate features, revealing its ability to extract foreground masks and
identify matching points during the denoising process. Our method leverages
this finding by first generating an identity video and caching the intermediate
variables, and then inject these cached variables into corresponding positions
in newly generated videos, ensuring both foreground and background consistency
across multiple videos. Experimental results demonstrate that BachVid achieves
robust consistency in generated videos without requiring additional training,
offering a novel and efficient solution for consistent video generation without
relying on reference images or additional training.
中文标题/摘要
标题:BachVid:无需训练的具有一致背景和角色的视频生成
扩散变换器(DiTs)最近在文本到视频(T2V)生成方面取得了显著进展。然而,生成具有一致角色和背景的多个视频仍然是一个重大挑战。现有方法通常依赖参考图像或大量训练,且往往仅解决角色一致性问题,而将背景一致性留给图像到视频模型处理。我们提出了BachVid,这是第一个无需训练的方法,能够在无需任何参考图像的情况下实现一致的视频生成。我们的方法基于对DiT注意力机制和中间特征的系统分析,揭示了其在去噪过程中提取前景掩模和识别匹配点的能力。我们的方法利用这一发现,首先生成一个身份视频并缓存中间变量,然后将这些缓存的变量注入到新生成视频的相应位置,确保多个视频中前景和背景的一致性。实验结果表明,BachVid能够在无需额外训练的情况下生成具有稳健一致性的视频,提供了一种新颖且高效的解决方案,无需依赖参考图像或额外训练。
Summary / 总结
BachVid is a training-free method for generating consistent videos with both character and background consistency. It leverages the attention mechanism and intermediate features of Diffusion Transformers to extract foreground masks and identify matching points during the denoising process. By generating an identity video and caching the intermediate variables, BachVid injects these variables into newly generated videos, ensuring consistency across multiple videos. The method demonstrates robust consistency in generated videos without requiring additional training or reference images.
BachVid 是一种无需训练的方法,用于生成具有角色和背景一致性的视频。它利用扩散变换器中的注意力机制和中间特征来提取前景掩码并在去噪过程中识别匹配点。通过生成身份视频并缓存中间变量,BachVid 将这些变量注入到新生成的视频中,确保多视频的一致性。该方法在无需额外训练或参考图像的情况下展示了生成视频的一致性。
A Multimodal Benchmark for Framing of Oil & Gas Advertising and Potential Greenwashing Detection
Authors: Gaku Morio, Harri Rowlands, Dominik Stammbach, Christopher D. Manning, Peter Henderson
Venue: NeurIPS 2025
First: 2025-10-24T17:34:28+00:00 · Latest: 2025-10-24T17:34:28+00:00
Comments: Forthcoming in NeurIPS 2025 Datasets and Benchmarks Track
Abstract
Companies spend large amounts of money on public relations campaigns to
project a positive brand image. However, sometimes there is a mismatch between
what they say and what they do. Oil & gas companies, for example, are accused
of "greenwashing" with imagery of climate-friendly initiatives. Understanding
the framing, and changes in framing, at scale can help better understand the
goals and nature of public relations campaigns. To address this, we introduce a
benchmark dataset of expert-annotated video ads obtained from Facebook and
YouTube. The dataset provides annotations for 13 framing types for more than 50
companies or advocacy groups across 20 countries. Our dataset is especially
designed for the evaluation of vision-language models (VLMs), distinguishing it
from past text-only framing datasets. Baseline experiments show some promising
results, while leaving room for improvement for future work: GPT-4.1 can detect
environmental messages with 79% F1 score, while our best model only achieves
46% F1 score on identifying framing around green innovation. We also identify
challenges that VLMs must address, such as implicit framing, handling videos of
various lengths, or implicit cultural backgrounds. Our dataset contributes to
research in multimodal analysis of strategic communication in the energy
sector.
中文标题/摘要
标题:一种多模态基准,用于石油与天然气广告框架和潜在绿色漂洗检测
公司花费大量资金进行公共关系活动以塑造积极的品牌形象。然而,有时他们的言行并不一致。例如,石油和天然气公司因使用气候友好型项目的图像而被指责进行“绿色漂洗”。理解大规模的框架及其变化有助于更好地了解公共关系活动的目标和性质。为了解决这一问题,我们引入了一个由专家注释的视频广告基准数据集,这些数据来自Facebook和YouTube。该数据集为来自20个国家的50多家公司或倡导组织提供了超过13种框架类型的注释。我们的数据集特别设计用于评估视觉-语言模型(VLMs),使其不同于以往仅基于文本的框架数据集。基线实验显示了一些有希望的结果,但也为未来的工作留下了改进的空间:GPT-4.1可以以79%的F1分数检测环境信息,而我们最好的模型在识别围绕绿色创新的框架方面只能达到46%的F1分数。我们还指出了视觉-语言模型必须解决的挑战,如隐含的框架、处理不同长度的视频或隐含的文化背景。我们的数据集为能源领域战略沟通的多模态分析研究做出了贡献。
Summary / 总结
The research aims to understand the framing and potential greenwashing in oil and gas advertising through a multimodal benchmark dataset. The dataset includes expert-annotated video ads from Facebook and YouTube, with annotations for 13 framing types across 50 companies or advocacy groups in 20 countries. Baseline experiments show that while GPT-4.1 can detect environmental messages with a 79% F1 score, identifying framing around green innovation remains challenging with only a 46% F1 score. The study highlights the need for VLMs to address implicit framing and handle videos of varying lengths and cultural backgrounds.
研究旨在通过分析视频广告来理解石油和天然气广告中的框架和潜在的绿色漂洗问题。研究引入了一个包含13种框架类型的专家标注基准数据集,覆盖了来自20个国家的50多家公司或倡导组织。基线实验显示,虽然GPT-4.1可以以79%的F1分数检测环境信息,但在识别绿色创新框架方面仅达到46%的F1分数。数据集强调了VLMs需要解决隐含框架问题,并处理不同长度的视频和文化背景差异的挑战。
MobileRL: Online Agentic Reinforcement Learning for Mobile GUI Agents
Authors: Yifan Xu, Xiao Liu, Xinghan Liu, Jiaqi Fu, Hanchen Zhang, Bohao Jing, Shudan Zhang, Yuting Wang, Wenyi Zhao, Yuxiao Dong
First: 2025-09-10T13:09:27+00:00 · Latest: 2025-10-24T17:13:05+00:00
Abstract
Building general-purpose graphical user interface (GUI) agents has become
increasingly promising with the progress in vision language models. However,
developing effective mobile GUI agents with reinforcement learning (RL) remains
challenging due to the heavy-tailed distribution of task difficulty and the
inefficiency of large-scale environment sampling. We present an online agentic
reinforcement learning framework MobileRL to enhance GUI agents in mobile
environments. Its core component is the Difficulty-ADAptive GRPO (ADAGRPO)
algorithm. In ADAGRPO, we design difficulty-adaptive positive replay and
failure curriculum filtering to adapt the model to different task difficulties.
We introduce the shortest-path reward adjustment strategy to reshape rewards
concerning the task length in multi-turn agentic tasks. Those strategies
jointly stabilize RL training, improve sample efficiency, and generate strong
performance across diverse mobile apps and tasks. We apply MOBILERL to two open
models (Qwen2.5-VL-7B-Instruct and GLM-4.1V-9B-Base). The resultant MOBILERL-9B
model achieves state-of-the-art results in terms of success rates on both
AndroidWorld (80.2%) and AndroidLab (53.6%). The MOBILERL framework is
open-sourced at: https://github.com/THUDM/MobileRL.
中文标题/摘要
标题:MobileRL:移动GUI代理的在线代理强化学习
随着视觉语言模型的进步,构建通用图形用户界面(GUI)代理变得越来越有前景。然而,由于任务难度的重尾分布和大规模环境采样的低效性,使用强化学习(RL)开发有效的移动GUI代理仍然具有挑战性。我们提出了一种在线代理强化学习框架MobileRL,以增强移动环境中的GUI代理。其核心组件是难度自适应GRPO(ADAGRPO)算法。在ADAGRPO中,我们设计了难度自适应正重播和失败课程筛选,以使模型适应不同的任务难度。我们引入了最短路径奖励调整策略,以在多轮代理任务中重新塑造与任务长度相关的奖励。这些策略共同稳定了RL训练,提高了样本效率,并在各种移动应用和任务中生成了强大的性能。我们将MOBILERL应用于两个开源模型(Qwen2.5-VL-7B-Instruct和GLM-4.1V-9B-Base)。MOBILERL-9B模型在AndroidWorld(80.2%)和AndroidLab(53.6%)的成功率方面均达到了最先进的结果。MOBILERL框架已开源:https://github.com/THUDM/MobileRL。
The Narrow Gate: Localized Image-Text Communication in Native Multimodal Models
Authors: Alessandro Serra, Francesco Ortu, Emanuele Panizon, Lucrezia Valeriani, Lorenzo Basile, Alessio Ansuini, Diego Doimo, Alberto Cazzaniga
First: 2024-12-09T16:39:40+00:00 · Latest: 2025-10-24T16:24:17+00:00
Abstract
Recent advances in multimodal training have significantly improved the
integration of image understanding and generation within a unified model. This
study investigates how vision-language models (VLMs) handle image-understanding
tasks, focusing on how visual information is processed and transferred to the
textual domain. We compare native multimodal VLMs, models trained from scratch
on multimodal data to generate both text and images, and non-native multimodal
VLMs, models adapted from pre-trained large language models or capable of
generating only text, highlighting key differences in information flow. We find
that in native multimodal VLMs, image and text embeddings are more separated
within the residual stream. Moreover, VLMs differ in how visual information
reaches text: non-native multimodal VLMs exhibit a distributed communication
pattern, where information is exchanged through multiple image tokens, whereas
models trained natively for joint image and text generation tend to rely on a
single post-image token that acts as a narrow gate for visual information. We
show that ablating this single token significantly deteriorates
image-understanding performance, whereas targeted, token-level interventions
reliably steer image semantics and downstream text with fine-grained control.
中文标题/摘要
标题:窄门之路:本地化图像-文本通信在原生多模态模型中的应用
近期在多模态训练方面的进展显著提高了图像理解和生成在统一模型中的整合。本研究探讨了视觉语言模型(VLMs)如何处理图像理解任务,重点关注视觉信息如何被处理并转移到文本域。我们比较了原生多模态VLMs,从头开始在多模态数据上训练以生成文本和图像的模型,以及非原生多模态VLMs,这些模型是从预训练的大语言模型改编而来或仅能生成文本,突出了信息流中的关键差异。我们发现,在原生多模态VLMs中,图像和文本嵌入在残差流中更为分离。此外,VLMs在视觉信息如何到达文本方面存在差异:非原生多模态VLMs表现出分布式通信模式,信息通过多个图像令牌交换,而专门为联合图像和文本生成训练的模型则倾向于依赖一个后图像令牌作为视觉信息的窄门。我们展示了删除这个单一令牌会显著降低图像理解性能,而针对的、令牌级别的干预可以可靠地以精细控制的方式引导图像语义和下游文本。
Summary / 总结
This study examines how vision-language models (VLMs) process and transfer visual information to the textual domain, focusing on the differences between native and non-native multimodal VLMs. The research finds that native VLMs have more separated image and text embeddings within the residual stream, while non-native VLMs use a distributed communication pattern. Ablating the single post-image token in native VLMs significantly reduces image-understanding performance, indicating its crucial role in information flow.
研究探讨了视觉语言模型(VLMs)如何处理和传输视觉信息到文本域,重点关注原生和非原生多模态VLMs之间的差异。研究发现,原生VLMs的图像和文本嵌入在残差流中更为分离,而非原生VLMs则采用分布式通信模式。删除原生VLMs中的单个后图像标记会显著降低图像理解性能,表明其在信息流中的关键作用。
Modest-Align: Data-Efficient Alignment for Vision-Language Models
Authors: Jiaxiang Liu, Yuan Wang, Jiawei Du, Joey Tianyi Zhou, Mingkun Xu, Zuozhu Liu
First: 2025-10-24T16:11:10+00:00 · Latest: 2025-10-24T16:11:10+00:00
Abstract
Cross-modal alignment aims to map heterogeneous modalities into a shared
latent space, as exemplified by models like CLIP, which benefit from
large-scale image-text pretraining for strong recognition capabilities.
However, when operating in resource-constrained settings with limited or
low-quality data, these models often suffer from overconfidence and degraded
performance due to the prevalence of ambiguous or weakly correlated image-text
pairs. Current contrastive learning approaches, which rely on single positive
pairs, further exacerbate this issue by reinforcing overconfidence on uncertain
samples. To address these challenges, we propose Modest-Align, a lightweight
alignment framework designed for robustness and efficiency. Our approach
leverages two complementary strategies -- Random Perturbation, which introduces
controlled noise to simulate uncertainty, and Embedding Smoothing, which
calibrates similarity distributions in the embedding space. These mechanisms
collectively reduce overconfidence and improve performance on noisy or weakly
aligned samples. Extensive experiments across multiple benchmark datasets
demonstrate that Modest-Align outperforms state-of-the-art methods in retrieval
tasks, achieving competitive results with over 100x less training data and 600x
less GPU time than CLIP. Our method offers a practical and scalable solution
for cross-modal alignment in real-world, low-resource scenarios.
中文标题/摘要
标题:谦逊对齐:数据高效视觉语言模型对齐
跨模态对齐旨在将异构模态映射到共享的潜在空间,如CLIP等模型通过大规模图像文本预训练获得强大的识别能力。然而,在资源受限的环境中,由于存在模糊或弱相关图像文本对,这些模型往往因过度自信而导致性能下降。当前依赖单一正样本对的对比学习方法进一步加剧了这一问题,强化了对不确定样本的过度自信。为解决这些问题,我们提出谦逊对齐,这是一种轻量级的对齐框架,旨在提高鲁棒性和效率。我们的方法利用了两种互补策略——随机扰动,通过引入受控噪声模拟不确定性;嵌入平滑,校准嵌入空间中的相似性分布。这些机制共同减少了过度自信,提高了对噪声或弱对齐样本的性能。在多个基准数据集上的广泛实验表明,谦逊对齐在检索任务中优于最先进的方法,使用超过100倍少的训练数据和600倍少的GPU时间即可达到与CLIP相当的结果。我们的方法为真实世界低资源环境下的跨模态对齐提供了一种实用且可扩展的解决方案。
Summary / 总结
Modest-Align is a lightweight alignment framework designed to improve cross-modal alignment in resource-constrained settings. It uses Random Perturbation and Embedding Smoothing to reduce overconfidence and enhance performance on noisy data. Experiments show that Modest-Align outperforms state-of-the-art methods with significantly less training data and GPU time.
Modest-Align 是一种轻量级的对齐框架,旨在改进资源受限环境下的跨模态对齐。它使用随机扰动和嵌入平滑来减少过度自信并提高对模糊数据的性能。实验表明,Modest-Align 在显著减少训练数据和 GPU 时间的情况下,优于最先进的方法。
Head Pursuit: Probing Attention Specialization in Multimodal Transformers
Authors: Lorenzo Basile, Valentino Maiorca, Diego Doimo, Francesco Locatello, Alberto Cazzaniga
Venue: NeurIPS 2025 spotlight
First: 2025-10-24T14:41:47+00:00 · Latest: 2025-10-24T14:41:47+00:00
Comments: Accepted at NeurIPS 2025 (spotlight)
Abstract
Language and vision-language models have shown impressive performance across
a wide range of tasks, but their internal mechanisms remain only partly
understood. In this work, we study how individual attention heads in
text-generative models specialize in specific semantic or visual attributes.
Building on an established interpretability method, we reinterpret the practice
of probing intermediate activations with the final decoding layer through the
lens of signal processing. This lets us analyze multiple samples in a
principled way and rank attention heads based on their relevance to target
concepts. Our results show consistent patterns of specialization at the head
level across both unimodal and multimodal transformers. Remarkably, we find
that editing as few as 1% of the heads, selected using our method, can reliably
suppress or enhance targeted concepts in the model output. We validate our
approach on language tasks such as question answering and toxicity mitigation,
as well as vision-language tasks including image classification and captioning.
Our findings highlight an interpretable and controllable structure within
attention layers, offering simple tools for understanding and editing
large-scale generative models.
中文标题/摘要
标题:头部追踪:探究多模态变换器中的注意力专业化
语言模型和视觉-语言模型在广泛的任务中表现出色,但其内部机制仍部分未被理解。在本研究中,我们研究了文本生成模型中的个体注意力头如何专门化于特定的语义或视觉属性。基于已建立的可解释性方法,我们将通过最终解码层探测中间激活的过程重新解释为信号处理的视角。这使我们能够系统地分析多个样本,并根据其与目标概念的相关性对注意力头进行排名。我们的结果显示,在单模态和多模态变换器中,头部级别的专业化模式是一致的。令人惊讶的是,我们发现使用我们的方法编辑少至1%的头部,可以可靠地抑制或增强模型输出中的目标概念。我们在诸如问答和毒性缓解的语言任务,以及包括图像分类和字幕生成的视觉-语言任务上验证了我们的方法。我们的发现突显了注意力层中的可解释和可控结构,提供了理解并编辑大规模生成模型的简单工具。
MoniTor: Exploiting Large Language Models with Instruction for Online Video Anomaly Detection
Authors: Shengtian Yang, Yue Feng, Yingshi Liu, Jingrou Zhang, Jie Qin
Venue: NeurIPS 2025
First: 2025-10-24T13:28:29+00:00 · Latest: 2025-10-24T13:28:29+00:00
Comments: Accepted to NeurIPS 2025. The first two authors hold equal
contributions
Abstract
Video Anomaly Detection (VAD) aims to locate unusual activities or behaviors
within videos. Recently, offline VAD has garnered substantial research
attention, which has been invigorated by the progress in large language models
(LLMs) and vision-language models (VLMs), offering the potential for a more
nuanced understanding of anomalies. However, online VAD has seldom received
attention due to real-time constraints and computational intensity. In this
paper, we introduce a novel Memory-based online scoring queue scheme for
Training-free VAD (MoniTor), to address the inherent complexities in online
VAD. Specifically, MoniTor applies a streaming input to VLMs, leveraging the
capabilities of pre-trained large-scale models. To capture temporal
dependencies more effectively, we incorporate a novel prediction mechanism
inspired by Long Short-Term Memory (LSTM) networks. This ensures the model can
effectively model past states and leverage previous predictions to identify
anomalous behaviors. Thereby, it better understands the current frame.
Moreover, we design a scoring queue and an anomaly prior to dynamically store
recent scores and cover all anomalies in the monitoring scenario, providing
guidance for LLMs to distinguish between normal and abnormal behaviors over
time. We evaluate MoniTor on two large datasets (i.e., UCF-Crime and
XD-Violence) containing various surveillance and real-world scenarios. The
results demonstrate that MoniTor outperforms state-of-the-art methods and is
competitive with weakly supervised methods without training. Code is available
at https://github.com/YsTvT/MoniTor.
中文标题/摘要
标题:MoniTor:利用指令驱动大型语言模型进行在线视频异常检测
视频异常检测(VAD)旨在定位视频中的异常活动或行为。近年来,离线VAD获得了大量研究关注,这得益于大型语言模型(LLMs)和视觉语言模型(VLMs)的进步,提供了更深入理解异常的可能性。然而,由于实时约束和计算强度,在线VAD很少受到关注。本文介绍了一种新的基于记忆的在线评分队列方案,用于无需训练的VAD(MoniTor),以解决在线VAD固有的复杂性。具体而言,MoniTor将流式输入应用于VLMs,利用预训练大规模模型的能力。为了更有效地捕捉时间依赖性,我们引入了一种受长短期记忆(LSTM)网络启发的新型预测机制。这确保了模型能够有效建模过去状态,并利用先前的预测来识别异常行为,从而更好地理解当前帧。此外,我们设计了一个评分队列和一个异常先验,以动态存储最近的评分并覆盖监控场景中的所有异常,为LLMs提供指导,以区分正常和异常行为。我们在两个大型数据集(即UCF-Crime和XD-Violence)上评估了MoniTor,这些数据集包含各种监控和现实场景。结果表明,MoniTor优于最先进的方法,并且在无需训练的情况下与弱监督方法竞争。代码可在https://github.com/YsTvT/MoniTor/获取。
Summary / 总结
MoniTor is a novel approach for online video anomaly detection that leverages pre-trained large language models (LLMs) and incorporates a streaming input and a prediction mechanism inspired by LSTM networks to address real-time constraints. It uses a scoring queue to dynamically store recent scores and an anomaly prior to distinguish between normal and abnormal behaviors. Experimental results on UCF-Crime and XD-Violence datasets show that MoniTor outperforms state-of-the-art methods and is competitive with weakly supervised methods without training.
MoniTor通过引入基于记忆的评分队列和受LSTM网络启发的预测机制,解决在线视频异常检测(VAD)的挑战,采用无训练方法。该方法利用预训练的大规模语言模型捕捉时间依赖性,并动态存储最近的评分以指导模型区分正常和异常行为。实验结果表明,MoniTor在UCF-Crime和XD-Violence数据集上的表现优于现有方法,并且在无需训练的情况下与弱监督方法竞争。
OpenHype: Hyperbolic Embeddings for Hierarchical Open-Vocabulary Radiance Fields
Authors: Lisa Weijler, Sebastian Koch, Fabio Poiesi, Timo Ropinski, Pedro Hermosilla
Venue: NeurIPS 2025
First: 2025-10-24T13:17:56+00:00 · Latest: 2025-10-24T13:17:56+00:00
Abstract
Modeling the inherent hierarchical structure of 3D objects and 3D scenes is
highly desirable, as it enables a more holistic understanding of environments
for autonomous agents. Accomplishing this with implicit representations, such
as Neural Radiance Fields, remains an unexplored challenge. Existing methods
that explicitly model hierarchical structures often face significant
limitations: they either require multiple rendering passes to capture
embeddings at different levels of granularity, significantly increasing
inference time, or rely on predefined, closed-set discrete hierarchies that
generalize poorly to the diverse and nuanced structures encountered by agents
in the real world. To address these challenges, we propose OpenHype, a novel
approach that represents scene hierarchies using a continuous hyperbolic latent
space. By leveraging the properties of hyperbolic geometry, OpenHype naturally
encodes multi-scale relationships and enables smooth traversal of hierarchies
through geodesic paths in latent space. Our method outperforms state-of-the-art
approaches on standard benchmarks, demonstrating superior efficiency and
adaptability in 3D scene understanding.
中文标题/摘要
标题:OpenHype:超曲面嵌入在开放词汇辐射场中的层次结构表示
对3D对象和3D场景固有的层次结构进行建模是极其必要的,因为它能够为自主代理提供更全面的环境理解。使用隐式表示,如神经辐射场来实现这一点仍然是一个未探索的挑战。现有的显式建模层次结构的方法往往面临重大限制:它们要么需要多次渲染以捕捉不同粒度级别的嵌入,显著增加推理时间,要么依赖于预定义的封闭集离散层次结构,这些层次结构在代理在现实世界中遇到的多样和复杂的结构上泛化能力较差。为了解决这些挑战,我们提出了一种名为OpenHype的新方法,该方法使用连续的超曲面潜在空间来表示场景层次结构。通过利用超曲面几何的性质,OpenHype自然地编码多尺度关系,并通过潜在空间中的测地线路径平滑地遍历层次结构。我们的方法在标准基准测试中优于现有方法,展示了在3D场景理解方面的优越效率和适应性。
Summary / 总结
OpenHype is a novel approach that uses a continuous hyperbolic latent space to represent the hierarchical structure of 3D scenes, addressing the limitations of existing methods that require multiple rendering passes or rely on predefined hierarchies. OpenHype enables efficient and smooth traversal of hierarchies through geodesic paths in latent space, and it outperforms state-of-the-art methods on standard benchmarks, showing superior efficiency and adaptability in 3D scene understanding.
OpenHype 是一种新颖的方法,使用连续的双曲隐空间来表示 3D 场景的层次结构,解决了现有方法需要多次渲染或依赖预定义层次结构的局限性。OpenHype 通过隐空间中的测地线路径实现高效且平滑的层次结构遍历,并在标准基准测试中优于最先进的方法,展示了其在 3D 场景理解方面的高效性和适应性。
Vision Language Models for Dynamic Human Activity Recognition in Healthcare Settings
Authors: Abderrazek Abid, Thanh-Cong Ho, Fakhri Karray
First: 2025-10-24T13:04:13+00:00 · Latest: 2025-10-24T13:04:13+00:00
Abstract
As generative AI continues to evolve, Vision Language Models (VLMs) have
emerged as promising tools in various healthcare applications. One area that
remains relatively underexplored is their use in human activity recognition
(HAR) for remote health monitoring. VLMs offer notable strengths, including
greater flexibility and the ability to overcome some of the constraints of
traditional deep learning models. However, a key challenge in applying VLMs to
HAR lies in the difficulty of evaluating their dynamic and often
non-deterministic outputs. To address this gap, we introduce a descriptive
caption data set and propose comprehensive evaluation methods to evaluate VLMs
in HAR. Through comparative experiments with state-of-the-art deep learning
models, our findings demonstrate that VLMs achieve comparable performance and,
in some cases, even surpass conventional approaches in terms of accuracy. This
work contributes a strong benchmark and opens new possibilities for the
integration of VLMs into intelligent healthcare systems.
中文标题/摘要
标题:视觉语言模型在医疗环境中动态人类活动识别中的应用
随着生成式AI的不断演进,视觉语言模型(VLMs)在各种医疗应用中展现出巨大的潜力。一个相对未被充分探索的领域是它们在远程健康监测中的人类活动识别(HAR)应用。VLMs的优势包括更大的灵活性以及能够克服传统深度学习模型的一些限制。然而,将VLMs应用于HAR的一个关键挑战在于评估其动态且往往非确定性的输出难度较大。为解决这一问题,我们引入了一组描述性标题数据集,并提出了全面的评估方法来评估VLMs在HAR中的表现。通过与最先进的深度学习模型进行对比实验,我们的研究结果表明,VLMs在准确率方面达到了可比的水平,在某些情况下甚至超过了传统方法。这项工作提供了一个强有力的基准,并为VLMs在智能医疗系统中的集成开辟了新的可能性。
Summary / 总结
The research aims to explore the use of Vision Language Models (VLMs) for human activity recognition (HAR) in healthcare settings, addressing the challenge of evaluating their dynamic and non-deterministic outputs. The study introduces a descriptive caption dataset and proposes evaluation methods, showing that VLMs achieve comparable performance and sometimes outperform traditional deep learning models in terms of accuracy.
研究旨在探索在医疗保健环境中使用Vision Language Models (VLMs)进行动态人体活动识别(HAR),解决评估VLMs非确定性输出的难题。研究引入了描述性标注数据集,并提出了评估方法,结果显示VLMs在准确率方面与传统深度学习模型相当,有时甚至更优。
Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models
Authors: Ling Li, Yao Zhou, Yuxuan Liang, Fugee Tsung, Jiaheng Wei
Venue: NeurIPS 2025
First: 2025-06-17T16:07:58+00:00 · Latest: 2025-10-24T13:02:36+00:00
Comments: NeurIPS 2025
Abstract
Previous methods for image geo-localization have typically treated the task
as either classification or retrieval, often relying on black-box decisions
that lack interpretability. The rise of large vision-language models (LVLMs)
has enabled a rethinking of geo-localization as a reasoning-driven task
grounded in visual cues. However, two major challenges persist. On the data
side, existing reasoning-focused datasets are primarily based on street-view
imagery, offering limited scene diversity and constrained viewpoints. On the
modeling side, current approaches predominantly rely on supervised fine-tuning,
which yields only marginal improvements in reasoning capabilities. To address
these challenges, we propose a novel pipeline that constructs a
reasoning-oriented geo-localization dataset, MP16-Reason, using diverse social
media images. We introduce GLOBE, Group-relative policy optimization for
Localizability assessment and Optimized visual-cue reasoning, yielding
Bi-objective geo-Enhancement for the VLM in recognition and reasoning. GLOBE
incorporates task-specific rewards that jointly enhance localizability
assessment, visual-cue reasoning, and geolocation accuracy. Both qualitative
and quantitative results demonstrate that GLOBE outperforms state-of-the-art
open-source LVLMs on geo-localization tasks, particularly in diverse visual
scenes, while also generating more insightful and interpretable reasoning
trajectories. The data and code are available at
https://github.com/lingli1996/GLOBE.
中文标题/摘要
标题:通过推理实现识别:使用大型视觉-语言模型强化图像地理定位
以往的图像地理定位方法通常将任务视为分类或检索,往往依赖于缺乏可解释性的黑盒决策。大型视觉-语言模型(LVLM)的兴起使地理定位重新思考为基于视觉线索的推理驱动任务。然而,两个主要挑战仍然存在。在数据方面,现有的推理导向数据集主要基于街景图像,提供的场景多样性有限且视角受限。在建模方面,当前方法主要依赖于监督微调,这仅在推理能力上带来微小的改进。为了解决这些挑战,我们提出了一种新的管道,使用多样化的社交媒体图像构建了一个推理导向的地理定位数据集MP16-Reason。我们引入了GLOBE,一种基于局部化评估和优化视觉线索推理的组相对策略优化方法,产生VLM在识别和推理中的双目标地理增强。GLOBE结合了任务特定的奖励,共同增强了局部化评估、视觉线索推理和地理定位准确性。定性和定量结果表明,GLOBE在地理定位任务中优于最先进的开源LVLM,特别是在多样化的视觉场景中,同时生成了更具洞察力和可解释性的推理轨迹。数据和代码可在https://github.com/lingli1996/GLOBE/获取。
Summary / 总结
This paper addresses the limitations of previous image geo-localization methods by proposing a new pipeline that leverages diverse social media images to construct a reasoning-oriented dataset, MP16-Reason. The method, GLOBE, uses group-relative policy optimization to enhance localizability assessment, visual-cue reasoning, and geolocation accuracy. Experiments show that GLOBE outperforms state-of-the-art open-source large vision-language models in diverse visual scenes and generates more interpretable reasoning trajectories.
该论文通过提出一种新的推理导向的管道来解决之前图像地理定位方法的局限性。它构建了一个多样化的数据集MP16-Reason,并引入了GLOBE方法,该方法增强了定位评估、视觉线索推理和地理定位准确性。结果显示,GLOBE在多样化的视觉场景中优于最先进的开源视觉-语言模型,并生成了更具解释性的推理轨迹。
Bridging the gap to real-world language-grounded visual concept learning
Authors: Whie Jung, Semin Kim, Junee Kim, Seunghoon Hong
First: 2025-10-24T12:54:13+00:00 · Latest: 2025-10-24T12:54:13+00:00
Abstract
Human intelligence effortlessly interprets visual scenes along a rich
spectrum of semantic dimensions. However, existing approaches to
language-grounded visual concept learning are limited to a few predefined
primitive axes, such as color and shape, and are typically explored in
synthetic datasets. In this work, we propose a scalable framework that
adaptively identifies image-related concept axes and grounds visual concepts
along these axes in real-world scenes. Leveraging a pretrained vision-language
model and our universal prompting strategy, our framework identifies a diverse
image-related axes without any prior knowledge. Our universal concept encoder
adaptively binds visual features to the discovered axes without introducing
additional model parameters for each concept. To ground visual concepts along
the discovered axes, we optimize a compositional anchoring objective, which
ensures that each axis can be independently manipulated without affecting
others. We demonstrate the effectiveness of our framework on subsets of
ImageNet, CelebA-HQ, and AFHQ, showcasing superior editing capabilities across
diverse real-world concepts that are too varied to be manually predefined. Our
method also exhibits strong compositional generalization, outperforming
existing visual concept learning and text-based editing methods. The code is
available at https://github.com/whieya/Language-grounded-VCL.
中文标题/摘要
标题:跨越到现实世界语言导向视觉概念学习的鸿沟
人类智能能够轻松地在丰富的语义维度上解释视觉场景。然而,现有的语言导向视觉概念学习方法仅限于少数预定义的基本轴,如颜色和形状,并且通常在合成数据集中进行探索。在本工作中,我们提出了一种可扩展的框架,该框架能够自适应地识别与图像相关的概念轴,并在现实世界的场景中将视觉概念沿这些轴进行定位。利用预训练的跨模态模型和我们的通用提示策略,我们的框架能够在没有任何先验知识的情况下识别出多样化的与图像相关的轴。我们的通用概念编码器能够自适应地将视觉特征绑定到发现的轴上,而无需为每个概念引入额外的模型参数。为了沿发现的轴定位视觉概念,我们优化了一个组合锚定目标,该目标确保每个轴可以独立操作而不影响其他轴。我们在ImageNet、CelebA-HQ和AFHQ的子集上展示了我们框架的有效性,展示了在多样化的现实世界概念上具有优越的编辑能力,这些概念过于多样化而无法手动预定义。我们的方法还表现出强大的组合泛化能力,优于现有的视觉概念学习和基于文本的编辑方法。代码可在https://github.com/whieya/Language-grounded-VCL/ 获取。
Summary / 总结
This work addresses the gap in language-grounded visual concept learning by proposing a scalable framework that identifies diverse image-related concept axes in real-world scenes. Utilizing a pretrained vision-language model and a universal prompting strategy, the framework adapts to discover these axes without prior knowledge. The method optimizes a compositional anchoring objective to ground visual concepts along these axes, demonstrating superior editing capabilities and strong compositional generalization across various real-world concepts compared to existing methods.
本文提出了一种可扩展的框架,该框架能够在现实世界的场景中识别出多种与图像相关的概念轴线,解决了现有方法在语言指导视觉概念学习方面的局限性。该框架利用预训练的视觉-语言模型和通用提示策略,在无需先验知识的情况下发现这些轴线,并采用自适应的概念编码器将视觉特征绑定到发现的轴线上。同时,通过优化组成锚定目标确保每个轴线可以独立操作而不影响其他轴线。该方法在多种现实世界数据集上展示了优越的编辑能力和强大的组成泛化能力,优于现有方法。
BEAST: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning
Authors: Hongyi Zhou, Weiran Liao, Xi Huang, Yucheng Tang, Fabian Otto, Xiaogang Jia, Xinkai Jiang, Simon Hilber, Ge Li, Qian Wang, Ömer Erdinç Yağmurlu, Nils Blank, Moritz Reuss, Rudolf Lioutikov
Venue: NeurIPS 2025
First: 2025-06-06T13:26:16+00:00 · Latest: 2025-10-24T12:20:29+00:00
Comments: Accepted by NeurIPS 2025
Abstract
We present the B-spline Encoded Action Sequence Tokenizer (BEAST), a novel
action tokenizer that encodes action sequences into compact discrete or
continuous tokens using B-splines. In contrast to existing action tokenizers
based on vector quantization or byte pair encoding, BEAST requires no separate
tokenizer training and consistently produces tokens of uniform length, enabling
fast action sequence generation via parallel decoding. Leveraging our B-spline
formulation, BEAST inherently ensures generating smooth trajectories without
discontinuities between adjacent segments. We extensively evaluate BEAST by
integrating it with three distinct model architectures: a Variational
Autoencoder (VAE) with continuous tokens, a decoder-only Transformer with
discrete tokens, and Florence-2, a pretrained Vision-Language Model with an
encoder-decoder architecture, demonstrating BEAST's compatibility and
scalability with large pretrained models. We evaluate BEAST across three
established benchmarks consisting of 166 simulated tasks and on three distinct
robot settings with a total of 8 real-world tasks. Experimental results
demonstrate that BEAST (i) significantly reduces both training and inference
computational costs, and (ii) consistently generates smooth, high-frequency
control signals suitable for continuous control tasks while (iii) reliably
achieves competitive task success rates compared to state-of-the-art methods.
中文标题/摘要
标题:BEAST:使用B样条编码动作序列的高效分词器
我们提出了B样条编码动作序列分词器(BEAST),这是一种新颖的动作分词器,使用B样条将动作序列编码为紧凑的离散或连续分词。与基于向量量化或字节对编码的动作分词器不同,BEAST 不需要单独的分词器训练,并且始终生成长度一致的分词,从而通过并行解码快速生成动作序列。利用我们的B样条公式,BEAST 本身确保生成平滑的轨迹,相邻段之间没有不连续性。我们通过将BEAST与三种不同的模型架构集成来广泛评估BEAST:连续分词的变分自编码器(VAE),仅解码器变换器和具有编码器-解码器架构的预训练视觉-语言模型Florence-2,展示了BEAST与大型预训练模型的兼容性和可扩展性。我们在三个标准基准中评估了BEAST,包括166个模拟任务,并在三个不同的机器人设置中评估了三个真实任务,总共8个。实验结果表明,BEAST (i) 显著降低了训练和推理计算成本,(ii) 一致生成适合连续控制任务的平滑、高频控制信号,(iii) 可靠地实现了与最先进的方法相当的任务成功率。
Summary / 总结
BEAST is a novel action tokenizer that encodes action sequences into B-spline tokens, enabling uniform-length tokens for fast parallel decoding and smooth trajectory generation. It is evaluated with three model architectures and across multiple benchmarks, showing significant reduction in computational costs and competitive task success rates for continuous control tasks.
BEAST 是一种新型的动作分词器,将动作序列编码为 B-样条,实现高效且平滑的动作序列生成。它在三种模型架构中进行了评估,并在多种基准测试中表现出色,降低了计算成本并生成了适用于连续控制任务的高频控制信号,成功率与现有最佳方法相当。
Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling
Authors: Bryan Wong, Jong Woo Kim, Huazhu Fu, Mun Yong Yi
Venue: NeurIPS 2025
First: 2025-05-23T14:48:32+00:00 · Latest: 2025-10-24T11:54:03+00:00
Comments: Accepted at NeurIPS 2025
Abstract
Vision-language models (VLMs) have recently been integrated into multiple
instance learning (MIL) frameworks to address the challenge of few-shot, weakly
supervised classification of whole slide images (WSIs). A key trend involves
leveraging multi-scale information to better represent hierarchical tissue
structures. However, existing methods often face two key limitations: (1)
insufficient modeling of interactions within the same modalities across scales
(e.g., 5x and 20x) and (2) inadequate alignment between visual and textual
modalities on the same scale. To address these gaps, we propose HiVE-MIL, a
hierarchical vision-language framework that constructs a unified graph
consisting of (1) parent-child links between coarse (5x) and fine (20x)
visual/textual nodes to capture hierarchical relationships, and (2)
heterogeneous intra-scale edges linking visual and textual nodes on the same
scale. To further enhance semantic consistency, HiVE-MIL incorporates a
two-stage, text-guided dynamic filtering mechanism that removes weakly
correlated patch-text pairs, and introduces a hierarchical contrastive loss to
align textual semantics across scales. Extensive experiments on TCGA breast,
lung, and kidney cancer datasets demonstrate that HiVE-MIL consistently
outperforms both traditional MIL and recent VLM-based MIL approaches, achieving
gains of up to 4.1% in macro F1 under 16-shot settings. Our results demonstrate
the value of jointly modeling hierarchical structure and multimodal alignment
for efficient and scalable learning from limited pathology data. The code is
available at https://github.com/bryanwong17/HiVE-MIL.
中文标题/摘要
标题:通过层次视觉-语言对齐与建模从吉格像素图像进行少量样本学习
视觉-语言模型(VLMs)已集成到多个实例学习(MIL)框架中,以解决全切片图像(WSIs)少量样本、弱监督分类的挑战。一个关键趋势是利用多尺度信息更好地表示层次组织结构。然而,现有方法通常面临两个关键限制:(1)在同一模态(如5x和20x)跨尺度的交互建模不足,(2)同一尺度上视觉和文本模态之间的对齐不足。为了解决这些差距,我们提出了HiVE-MIL,这是一种层次视觉-语言框架,构建了一个统一图,包括(1)粗(5x)和细(20x)视觉/文本节点之间的父节点-子节点链接,以捕捉层次关系,以及(2)同一尺度上视觉和文本节点之间的异质内尺度边。为了进一步增强语义一致性,HiVE-MIL引入了两阶段、文本引导的动态过滤机制,去除弱相关的小块-文本对,并引入了层次对比损失以在不同尺度上对齐文本语义。在TCGA乳腺、肺和肾癌数据集上的广泛实验表明,HiVE-MIL在16样本设置下的一致性宏F1得分上优于传统MIL和最近的基于VLM的MIL方法,提高了高达4.1%。我们的结果表明,联合建模层次结构和多模态对齐对于从有限的病理数据中高效、可扩展地学习具有价值。代码可在https://github.com/bryanwong17/HiVE-MIL/ 获取。
Summary / 总结
The research aims to improve few-shot classification of whole slide images (WSIs) using vision-language models (VLMs) integrated into multiple instance learning (MIL) frameworks. HiVE-MIL, a hierarchical vision-language framework, addresses limitations in modeling interactions across scales and aligning visual and textual modalities. It constructs a unified graph with parent-child links and heterogeneous intra-scale edges, and incorporates a two-stage, text-guided dynamic filtering mechanism and hierarchical contrastive loss. Experiments on TCGA datasets show that HiVE-MIL outperforms traditional MIL and recent VLM-based MIL approaches, achieving up to 4.1% gains in macro F1 under 16-shot settings.
研究旨在通过层级视觉-语言对齐提高全切片图像(WSIs)的少样本学习。HiVE-MIL 构建了一个统一图,包含父节点与子节点链接和同尺度内节点链接,以建模层级和多模态关系。实验表明,HiVE-MIL 在 16 射样本下比传统方法和最近的基于 VLM 的方法表现更好,宏 F1 得分提高了最多 4.1%。
Surfer 2: The Next Generation of Cross-Platform Computer Use Agents
Authors: Mathieu Andreux, Märt Bakler, Yanael Barbier, Hamza Benchekroun, Emilien Biré, Antoine Bonnet, Riaz Bordie, Nathan Bout, Matthias Brunel, Aleix Cambray, Pierre-Louis Cedoz, Antoine Chassang, Gautier Cloix, Ethan Connelly, Alexandra Constantinou, Ramzi De Coster, Hubert de la Jonquiere, Aurélien Delfosse, Maxime Delpit, Alexis Deprez, Augustin Derupti, Mathieu Diaz, Shannon D'Souza, Julie Dujardin, Abai Edmund, Michael Eickenberg, Armand Fatalot, Wissem Felissi, Isaac Herring, Xavier Koegler, Erwan Le Jumeau de Kergaradec, Aurélien Lac, Maxime Langevin, Corentin Lauverjat, Antonio Loison, Avshalom Manevich, Axel Moyal, Axel Nguyen Kerbel, Marinela Parovic, Julien Revelle, Guillaume Richard, Mats Richter, Ronan Riochet, María Santos, Romain Savidan, Laurent Sifre, Maxime Theillard, Marc Thibault, Ivan Valentini, Tony Wu, Laura Yie, Kai Yuan, Jevgenij Zubovskij
First: 2025-10-22T18:21:52+00:00 · Latest: 2025-10-24T11:52:29+00:00
Comments: 21 pages, 9 figures, 2 tables
Abstract
Building agents that generalize across web, desktop, and mobile environments
remains an open challenge, as prior systems rely on environment-specific
interfaces that limit cross-platform deployment. We introduce Surfer 2, a
unified architecture operating purely from visual observations that achieves
state-of-the-art performance across all three environments. Surfer 2 integrates
hierarchical context management, decoupled planning and execution, and
self-verification with adaptive recovery, enabling reliable operation over long
task horizons. Our system achieves 97.1% accuracy on WebVoyager, 69.6% on
WebArena, 60.1% on OSWorld, and 87.1% on AndroidWorld, outperforming all prior
systems without task-specific fine-tuning. With multiple attempts, Surfer 2
exceeds human performance on all benchmarks. These results demonstrate that
systematic orchestration amplifies foundation model capabilities and enables
general-purpose computer control through visual interaction alone, while
calling for a next-generation vision language model to achieve Pareto-optimal
cost-efficiency.
中文标题/摘要
标题:冲浪者2:跨平台计算机使用代理的新一代
构建能够在网络、桌面和移动环境中泛化的代理仍然是一个开放的挑战,因为先前的系统依赖于环境特定的接口,限制了跨平台部署。我们引入了冲浪者2,这是一种仅基于视觉观察的统一架构,实现了所有三个环境中的最佳性能。冲浪者2集成了分层上下文管理、解耦计划和执行以及自我验证与自适应恢复,使其能够在长时间任务中可靠运行。我们的系统在WebVoyager上达到了97.1%的准确性,在WebArena上达到了69.6%的准确性,在OSWorld上达到了60.1%的准确性,在AndroidWorld上达到了87.1%的准确性,超过了所有先前的系统,而无需针对特定任务进行微调。通过多次尝试,冲浪者2在所有基准测试中都超过了人类的表现。这些结果表明,系统化的协调可以放大基础模型的能力,并通过仅通过视觉交互实现通用的计算机控制,同时呼吁下一代视觉语言模型以实现帕累托最优的成本效率。
VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set
Authors: Shufan Shen, Junshu Sun, Qingming Huang, Shuhui Wang
Venue: NeurIPS 2025
First: 2025-10-24T10:29:31+00:00 · Latest: 2025-10-24T10:29:31+00:00
Comments: Accepted by NeurIPS 2025
Abstract
The alignment of vision-language representations endows current
Vision-Language Models (VLMs) with strong multi-modal reasoning capabilities.
However, the interpretability of the alignment component remains uninvestigated
due to the difficulty in mapping the semantics of multi-modal representations
into a unified concept set. To address this problem, we propose VL-SAE, a
sparse autoencoder that encodes vision-language representations into its hidden
activations. Each neuron in its hidden layer correlates to a concept
represented by semantically similar images and texts, thereby interpreting
these representations with a unified concept set. To establish the
neuron-concept correlation, we encourage semantically similar representations
to exhibit consistent neuron activations during self-supervised training.
First, to measure the semantic similarity of multi-modal representations, we
perform their alignment in an explicit form based on cosine similarity. Second,
we construct the VL-SAE with a distance-based encoder and two modality-specific
decoders to ensure the activation consistency of semantically similar
representations. Experiments across multiple VLMs (e.g., CLIP, LLaVA)
demonstrate the superior capability of VL-SAE in interpreting and enhancing the
vision-language alignment. For interpretation, the alignment between vision and
language representations can be understood by comparing their semantics with
concepts. For enhancement, the alignment can be strengthened by aligning
vision-language representations at the concept level, contributing to
performance improvements in downstream tasks, including zero-shot image
classification and hallucination elimination. Codes are available at
https://github.com/ssfgunner/VL-SAE.
中文标题/摘要
标题:VL-SAE:使用统一概念集解释和增强视觉-语言对齐
视觉-语言表示的对齐赋予了当前的视觉-语言模型强大的多模态推理能力。然而,由于难以将多模态表示的语义映射到统一的概念集中,对齐组件的可解释性尚未得到研究。为了解决这一问题,我们提出了一种稀疏自编码器VL-SAE,它将视觉-语言表示编码为其隐藏激活。其隐藏层中的每个神经元与一组语义相似的图像和文本表示的概念相关联,从而使用统一的概念集解释这些表示。为了建立神经元-概念相关性,我们在自监督训练中鼓励语义相似的表示表现出一致的神经元激活。首先,为了衡量多模态表示的语义相似性,我们基于余弦相似性显式地对它们进行对齐。其次,我们使用基于距离的编码器和两个模态特定解码器构建VL-SAE,以确保语义相似表示的激活一致性。跨多个视觉-语言模型(例如,CLIP,LLaVA)的实验表明,VL-SAE在解释和增强视觉-语言对齐方面具有优越的能力。对于解释,可以通过将视觉和语言表示的语义与概念进行比较来理解它们之间的对齐。对于增强,可以通过在概念层面对齐视觉-语言表示来加强对齐,从而在下游任务(包括零样本图像分类和幻觉消除)中提高性能。代码可在https://github.com/ssfgunner/VL-SAE/ 获取。
Summary / 总结
VL-SAE is a sparse autoencoder that interprets and enhances vision-language alignment by encoding representations into hidden activations correlated with unified concepts. It uses self-supervised training to ensure consistent activations for semantically similar representations. Experiments on VLMs like CLIP and LLaVA show that VL-SAE improves interpretability and performance in tasks such as zero-shot image classification and hallucination elimination.
VL-SAE 是一种稀疏自编码器,通过将表示编码为与统一概念相关的隐藏激活来解释和增强视觉-语言对齐。它通过自监督训练确保具有语义相似性的表示具有一致的神经元激活,这些激活通过余弦相似性进行测量。实验表明,VL-SAE 提高了视觉-语言模型的可解释性和性能,特别是在零样本图像分类和幻觉消除方面。
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Authors: Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, Wenhu Chen
First: 2025-05-21T19:35:08+00:00 · Latest: 2025-10-24T09:35:22+00:00
Comments: Project Page: https://tiger-ai-lab.github.io/Pixel-Reasoner/,
Hands-on Demo: https://huggingface.co/spaces/TIGER-Lab/Pixel-Reasoner
Abstract
Chain-of-thought reasoning has significantly improved the performance of
Large Language Models (LLMs) across various domains. However, this reasoning
process has been confined exclusively to textual space, limiting its
effectiveness in visually intensive tasks. To address this limitation, we
introduce the concept of reasoning in the pixel-space. Within this novel
framework, Vision-Language Models (VLMs) are equipped with a suite of visual
reasoning operations, such as zoom-in and select-frame. These operations enable
VLMs to directly inspect, interrogate, and infer from visual evidences, thereby
enhancing reasoning fidelity for visual tasks. Cultivating such pixel-space
reasoning capabilities in VLMs presents notable challenges, including the
model's initially imbalanced competence and its reluctance to adopt the newly
introduced pixel-space operations. We address these challenges through a
two-phase training approach. The first phase employs instruction tuning on
synthesized reasoning traces to familiarize the model with the novel visual
operations. Following this, a reinforcement learning (RL) phase leverages a
curiosity-driven reward scheme to balance exploration between pixel-space
reasoning and textual reasoning. With these visual operations, VLMs can
interact with complex visual inputs, such as information-rich images or videos
to proactively gather necessary information. We demonstrate that this approach
significantly improves VLM performance across diverse visual reasoning
benchmarks. Our 7B model, \model, achieves 84\% on V* bench, 74\% on
TallyQA-Complex, and 84\% on InfographicsVQA, marking the highest accuracy
achieved by any open-source model to date. These results highlight the
importance of pixel-space reasoning and the effectiveness of our framework.
Summary / 总结
The research aims to enhance the reasoning capabilities of Vision-Language Models (VLMs) by incorporating pixel-space reasoning, which has been previously limited to textual reasoning. The method involves a two-phase training approach: first, instruction tuning on synthesized reasoning traces to familiarize the model with visual operations, and second, a reinforcement learning phase using a curiosity-driven reward scheme to balance exploration between pixel-space and textual reasoning. The key experimental findings show significant improvements in VLM performance across various visual reasoning benchmarks, with the 7B model achieving the highest accuracy on V*, TallyQA-Complex, and InfographicsVQA compared to existing models.
研究旨在通过引入像素空间推理来增强视觉语言模型(VLMs)的视觉推理能力,此前这种推理主要局限于文本空间。方法采用两阶段训练:第一阶段通过合成推理轨迹进行指令调优,使模型熟悉视觉操作;第二阶段使用好奇心驱动的强化学习奖励方案来平衡像素空间和文本空间的推理。模型\model在视觉推理基准测试中表现出显著提升,分别在V*、TallyQA-Complex和InfographicsVQA上达到84%、74%和84%,超过了之前的开源模型。
CLASP: Adaptive Spectral Clustering for Unsupervised Per-Image Segmentation
Authors: Max Curie, Paulo da Costa
First: 2025-09-29T16:41:30+00:00 · Latest: 2025-10-24T09:11:35+00:00
Abstract
We introduce CLASP (Clustering via Adaptive Spectral Processing), a
lightweight framework for unsupervised image segmentation that operates without
any labeled data or finetuning. CLASP first extracts per patch features using a
self supervised ViT encoder (DINO); then, it builds an affinity matrix and
applies spectral clustering. To avoid manual tuning, we select the segment
count automatically with a eigengap silhouette search, and we sharpen the
boundaries with a fully connected DenseCRF. Despite its simplicity and training
free nature, CLASP attains competitive mIoU and pixel accuracy on COCO Stuff
and ADE20K, matching recent unsupervised baselines. The zero training design
makes CLASP a strong, easily reproducible baseline for large unannotated
corpora especially common in digital advertising and marketing workflows such
as brand safety screening, creative asset curation, and social media content
moderation
中文标题/摘要
标题:CLASP:自适应光谱聚类的无监督单图像分割
我们引入了CLASP(自适应光谱处理聚类),这是一种无需任何标注数据或微调的轻量级无监督图像分割框架。CLASP 首先使用自监督 ViT 编码器(DINO)提取每个补丁的特征;然后,它构建一个亲和矩阵并应用谱聚类。为了避免手动调参,我们使用特征间隙轮廓搜索自动选择分割数量,并使用全连接密集 CRF 锐化边界。尽管其简单且无需训练,CLASP 在 COCO Stuff 和 ADE20K 上仍能达到具有竞争力的 mIoU 和像素精度,与最近的无监督基线相当。零训练设计使 CLASP 成为大型未标注数据集的强大且易于复现的基线,特别是在数字广告和营销工作流程中常见的品牌安全筛查、创意资产筛选和社交媒体内容审核等场景中
Summary / 总结
CLASP (Clustering via Adaptive Spectral Processing) is a lightweight unsupervised image segmentation framework that uses a self-supervised ViT encoder to extract patch features and applies spectral clustering. It automatically selects the number of segments using eigengap silhouette search and sharpens boundaries with a DenseCRF. Despite its simplicity and lack of training, CLASP achieves competitive mIoU and pixel accuracy on COCO Stuff and ADE20K, matching recent unsupervised baselines.
CLASP(Clustering via Adaptive Spectral Processing)是一种轻量级的无监督图像分割框架,使用自监督的ViT编码器提取局部特征,并应用谱聚类。它使用eigengap轮廓搜索自动选择分割数量,并使用DenseCRF细化边界。尽管其简单且无需训练,CLASP在COCO Stuff和ADE20K上的mIoU和像素精度与最近的无监督基准相当。
InfiniDreamer: Arbitrarily Long Human Motion Generation via Segment Score Distillation
Authors: Wenjie Zhuo, Fan Ma, Hehe Fan
First: 2024-11-27T12:51:02+00:00 · Latest: 2025-10-24T08:06:50+00:00
Abstract
We present InfiniDreamer, a novel framework for arbitrarily long human motion
generation. InfiniDreamer addresses the limitations of current motion
generation methods, which are typically restricted to short sequences due to
the lack of long motion training data. To achieve this, we first generate
sub-motions corresponding to each textual description and then assemble them
into a coarse, extended sequence using randomly initialized transition
segments. We then introduce an optimization-based method called Segment Score
Distillation (SSD) to refine the entire long motion sequence. SSD is designed
to utilize an existing motion prior, which is trained only on short clips, in a
training-free manner. Specifically, SSD iteratively refines overlapping short
segments sampled from the coarsely extended long motion sequence, progressively
aligning them with the pre-trained motion diffusion prior. This process ensures
local coherence within each segment, while the refined transitions between
segments maintain global consistency across the entire sequence. Extensive
qualitative and quantitative experiments validate the superiority of our
framework, showcasing its ability to generate coherent, contextually aware
motion sequences of arbitrary length.
中文标题/摘要
标题:InfiniDreamer:通过段得分精炼生成任意长度的人体动作
我们提出了InfiniDreamer,一种新颖的任意长度人体动作生成框架。InfiniDreamer解决了当前动作生成方法的限制,这些方法通常由于缺乏长动作训练数据而只能生成短序列。为此,我们首先根据每个文本描述生成子动作,然后使用随机初始化的过渡段将它们组装成一个粗略的扩展序列。我们随后引入了一种基于优化的方法,称为段得分精炼(SSD),以细化整个长动作序列。SSD旨在以无需训练的方式利用仅在短片段上训练的运动先验。具体而言,SSD迭代地细化从粗略扩展的长动作序列中采样的重叠短片段,逐步将它们与预训练的运动扩散先验对齐。这一过程确保了每个片段内的局部一致性,而细化后的过渡则在整个序列中保持全局一致性。广泛的定性和定量实验验证了我们框架的优越性,展示了其生成任意长度的连贯且上下文相关的动作序列的能力。
Summary / 总结
InfiniDreamer is a novel framework for generating arbitrarily long human motion sequences by first creating sub-motions and then assembling them into a coarse sequence using randomly initialized transition segments. It employs Segment Score Distillation (SSD) to refine the entire sequence, ensuring local coherence within segments and global consistency across the sequence. Experiments show that InfiniDreamer can generate coherent and contextually aware motion sequences of arbitrary length, overcoming the limitations of existing methods that are typically restricted to short sequences due to the lack of long motion training data.
InfiniDreamer 是一种通过首先创建子动作并使用随机初始化的过渡段将其组装成粗略序列来生成任意长度的人类运动序列的新框架。它使用 Segment Score Distillation (SSD) 对整个序列进行细化,确保每个段内的局部一致性以及整个序列之间的全局一致性。实验表明,InfiniDreamer 能够生成连贯且上下文相关的运动序列,克服了现有方法通常受限于短序列的限制,因为缺乏长运动训练数据。
Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs
Authors: Insu Lee, Wooje Park, Jaeyun Jang, Minyoung Noh, Kyuhong Shim, Byonghyo Shim
Venue: NeurIPS 2025 Spotlight
First: 2025-05-28T04:09:42+00:00 · Latest: 2025-10-24T07:43:13+00:00
Comments: Accepted to NeurIPS 2025 (Spotlight)
Abstract
Large vision-language models (LVLMs) are increasingly deployed in interactive
applications such as virtual and augmented reality, where a first-person
(egocentric) view captured by head-mounted cameras serves as key input. While
this view offers fine-grained cues about user attention and hand-object
interactions, its narrow field of view and lack of global context often lead to
failures on spatially or contextually demanding queries. To address this, we
introduce a framework that augments egocentric inputs with third-person
(exocentric) views, providing complementary information such as global scene
layout and object visibility to LVLMs. We present E3VQA, the first benchmark
for multi-view question answering with 4K high-quality question-answer pairs
grounded in synchronized ego-exo image pairs. Additionally, we propose M3CoT, a
training-free prompting technique that constructs a unified scene
representation by integrating scene graphs from three complementary
perspectives. M3CoT enables LVLMs to reason more effectively across views,
yielding consistent performance gains (4.84% for GPT-4o and 5.94% for Gemini
2.0 Flash) over a recent CoT baseline. Our extensive evaluation reveals key
strengths and limitations of LVLMs in multi-view reasoning and highlights the
value of leveraging both egocentric and exocentric inputs. The dataset and
source code are available at
https://github.com/Leeinsu1/Towards-Comprehensive-Scene-Understanding.
中文标题/摘要
标题:全面理解场景:结合第一人称和第三人称视角的LVLM
大型视觉语言模型(LVLMs)在虚拟和增强现实等交互式应用中越来越广泛部署,其中由头戴式相机捕捉的第一人称(自视点)视角是关键输入。虽然这种视角提供了关于用户注意力和手物交互的精细线索,但由于其狭窄的视野和缺乏全局上下文,常常在空间或上下文要求较高的查询中失败。为了解决这一问题,我们提出了一种框架,将自视点输入与第三人称(他视点)视角相结合,为LVLMs提供补充信息,如全局场景布局和物体可见性。我们提出了E3VQA,这是第一个多视角问答基准,包含4K高质量的问题-答案对,基于同步的自视-他视图像对。此外,我们提出了M3CoT,这是一种无需训练的提示技术,通过整合三个互补视角的场景图构建统一的场景表示。M3CoT使LVLMs能够在不同视角之间更有效地推理,相对于最近的CoT基线,GPT-4o和Gemini 2.0 Flash分别获得了4.84%和5.94%的一致性能提升。我们的广泛评估揭示了LVLMs在多视角推理中的关键优势和局限性,并强调了利用自视点和他视点输入的价值。数据集和源代码可在https://github.com/Leeinsu1/Towards-Comprehensive-Scene-Understanding/ 获取。
Summary / 总结
This paper addresses the limitations of large vision-language models (LVLMs) in interactive applications by integrating first-person (egocentric) and third-person (exocentric) views. The authors introduce E3VQA, a benchmark for multi-view question answering, and propose M3CoT, a training-free prompting technique that combines scene graphs from three perspectives to enhance LVLMs' reasoning across views. The results show consistent performance gains for GPT-4o and Gemini 2.0 Flash over a recent CoT baseline, demonstrating the benefits of using both egocentric and exocentric inputs for LVLMs in multi-view reasoning.
本文旨在通过结合第一人称(自视点)和第三人称(他视点)视角来提高大型视觉语言模型(LVLMs)在交互式应用中的性能。作者引入了E3VQA,一个用于多视角问答的基准,并提出了M3CoT,一种无需训练的提示技术,通过从三个视角结合场景图来增强LVLMs在跨视角推理中的能力。结果表明,M3CoT在GPT-4o和Gemini 2.0 Flash上分别比最近的CoT基线提高了4.84%和5.94%的性能。
SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning
Authors: Yang Liu, Ming Ma, Xiaomin Yu, Pengxiang Ding, Han Zhao, Mingyang Sun, Siteng Huang, Donglin Wang
Venue: NeurIPS 2025
First: 2025-05-18T14:40:16+00:00 · Latest: 2025-10-24T07:26:20+00:00
Comments: Accepted by NeurIPS 2025
Abstract
Despite impressive advancements in Visual-Language Models (VLMs) for
multi-modal tasks, their reliance on RGB inputs limits precise spatial
understanding. Existing methods for integrating spatial cues, such as point
clouds or depth, either require specialized sensors or fail to effectively
exploit depth information for higher-order reasoning. To this end, we propose a
novel Spatial Sense and Reasoning method, dubbed SSR, a novel framework that
transforms raw depth data into structured, interpretable textual rationales.
These textual rationales serve as meaningful intermediate representations to
significantly enhance spatial reasoning capabilities. Additionally, we leverage
knowledge distillation to compress the generated rationales into compact latent
embeddings, which facilitate resource-efficient and plug-and-play integration
into existing VLMs without retraining. To enable comprehensive evaluation, we
introduce a new dataset named SSR-CoT, a million-scale visual-language
reasoning dataset enriched with intermediate spatial reasoning annotations, and
present SSRBench, a comprehensive multi-task benchmark. Extensive experiments
on multiple benchmarks demonstrate SSR substantially improves depth utilization
and enhances spatial reasoning, thereby advancing VLMs toward more human-like
multi-modal understanding. Project page: https://yliu-cs.github.io/SSR.
中文标题/摘要
标题:SSR:通过推理引导的空间推理增强视觉语言模型的深度感知
尽管视觉语言模型(VLMs)在多模态任务方面取得了令人印象深刻的进展,但它们依赖于RGB输入,限制了精确的空间理解。现有方法将空间线索(如点云或深度)集成进来,要么需要专门的传感器,要么无法有效利用深度信息进行高层次推理。为此,我们提出了一种新颖的空间感知与推理方法,称为SSR,这是一种新颖的框架,将原始深度数据转换为结构化、可解释的文本推理。这些文本推理作为有意义的中间表示,显著增强了空间推理能力。此外,我们利用知识蒸馏将生成的推理压缩为紧凑的潜在嵌入,这有助于资源高效且即插即用地集成到现有的VLMs中,无需重新训练。为了进行全面评估,我们引入了一个新的数据集SSR-CoT,这是一个包含大量中间空间推理注释的百万级视觉语言推理数据集,并提出了SSRBench,一个全面的多任务基准。在多个基准上的广泛实验表明,SSR 显著提高了深度利用并增强了空间推理,从而推动了VLMs向更接近人类的多模态理解发展。项目页面:https://yliu-cs.github.io/SSR。
Summary / 总结
The research aims to enhance depth perception in Vision-Language Models (VLMs) by integrating spatial reasoning. SSR, a novel framework, transforms raw depth data into textual rationales, which serve as structured intermediate representations. Experiments show that SSR significantly improves depth utilization and spatial reasoning, advancing VLMs towards more human-like multi-modal understanding.
研究旨在通过整合空间线索来增强视觉语言模型(VLMs)的深度感知。SSR是一种新颖的框架,将原始深度数据转换为可解释的文字推理,作为中间表示以提高空间推理能力。实验表明,SSR显著提升了深度利用和空间推理,使VLMs向更接近人类的多模态理解迈进。
SharpZO: Hybrid Sharpness-Aware Vision Language Model Prompt Tuning via Forward-Only Passes
Authors: Yifan Yang, Zhen Zhang, Rupak Vignesh Swaminathan, Jing Liu, Nathan Susanj, Zheng Zhang
First: 2025-06-26T04:07:14+00:00 · Latest: 2025-10-24T06:22:59+00:00
Abstract
Fine-tuning vision language models (VLMs) has achieved remarkable performance
across various downstream tasks; yet, it requires access to model gradients
through backpropagation (BP), making them unsuitable for memory-constrained,
inference-only edge devices. To address this limitation, previous work has
explored various BP-free fine-tuning methods. However, these approaches often
rely on high-variance evolutionary strategies (ES) or zeroth-order (ZO)
optimization, and often fail to achieve satisfactory performance. In this
paper, we propose a hybrid Sharpness-aware Zeroth-order optimization (SharpZO)
approach, specifically designed to enhance the performance of ZO VLM
fine-tuning via a sharpness-aware warm-up training. SharpZO features a
two-stage optimization process: a sharpness-aware ES stage that globally
explores and smooths the loss landscape to construct a strong initialization,
followed by a fine-grained local search via sparse ZO optimization. The entire
optimization relies solely on forward passes. Detailed theoretical analysis and
extensive experiments on CLIP models demonstrate that SharpZO significantly
improves accuracy and convergence speed, achieving up to 7% average gain over
state-of-the-art forward-only methods.
中文标题/摘要
标题:SharpZO:通过前向传递的混合锐化感知视觉语言模型提示调优
视觉语言模型(VLMs)的微调已在各种下游任务中取得了显著的性能;然而,这需要通过反向传播(BP)访问模型梯度,使其不适合内存受限的、仅用于推理的边缘设备。为了解决这一限制,先前的工作探索了各种无BP微调方法。然而,这些方法通常依赖于高方差的进化策略(ES)或零阶(ZO)优化,往往无法达到令人满意的效果。在本文中,我们提出了一种混合锐化感知零阶优化(SharpZO)方法,专门设计用于通过锐化感知预热训练增强ZO VLM微调的性能。SharpZO 特征为两阶段优化过程:一个锐化感知的ES阶段,全局探索和平滑损失景观以构建强大的初始化,随后通过稀疏ZO优化进行精细的局部搜索。整个优化仅依赖于前向传递。详细的理论分析和在CLIP模型上的广泛实验表明,SharpZO 显著提高了准确性和收敛速度,相对于最先进的前向传递方法,平均提高了7%的性能。
Summary / 总结
The research aims to improve fine-tuning of vision language models (VLMs) for edge devices with limited memory by proposing a hybrid Sharpness-aware Zeroth-order optimization (SharpZO) method. This method combines a sharpness-aware evolutionary strategy for global exploration with sparse zeroth-order optimization for local refinement, both using only forward passes. Experiments on CLIP models show that SharpZO enhances accuracy and convergence speed, achieving up to a 7% average improvement over existing forward-only methods.
研究旨在通过提出混合Sharpness-aware Zeroth-order优化(SharpZO)方法,改善视觉语言模型(VLMs)在内存受限的边缘设备上的微调。该方法结合了全局探索的尖锐度感知进化策略和局部优化的稀疏零阶优化,两者均仅使用前向传递。实验表明,SharpZO提高了准确性和收敛速度,相对于现有前向方法,平均提高了7%的性能。
Memory-Free Continual Learning with Null Space Adaptation for Zero-Shot Vision-Language Models
Authors: Yujin Jo, Taesup Kim
First: 2025-10-24T05:53:32+00:00 · Latest: 2025-10-24T05:53:32+00:00
Abstract
Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated
remarkable zero-shot generalization, enabling deployment in a wide range of
real-world tasks without additional task-specific training. However, in real
deployment scenarios with evolving environments or emerging classes, these
models inevitably face distributional shifts and novel tasks. In such contexts,
static zero-shot capabilities are insufficient, and there is a growing need for
continual learning methods that allow models to adapt over time while avoiding
catastrophic forgetting. We introduce NuSA-CL (Null Space Adaptation for
Continual Learning), a lightweight memory-free continual learning framework
designed to address this challenge. NuSA-CL employs low-rank adaptation and
constrains task-specific weight updates to lie within an approximate null space
of the model's current parameters. This strategy minimizes interference with
previously acquired knowledge, effectively preserving the zero-shot
capabilities of the original model. Unlike methods relying on replay buffers or
costly distillation, NuSA-CL imposes minimal computational and memory overhead,
making it practical for deployment in resource-constrained, real-world
continual learning environments. Experiments show that our framework not only
effectively preserves zero-shot transfer capabilities but also achieves highly
competitive performance on continual learning benchmarks. These results
position NuSA-CL as a practical and scalable solution for continually evolving
zero-shot VLMs in real-world applications.
中文标题/摘要
标题:无记忆连续学习:基于空问适应的零样本视觉-语言模型
预训练的视觉-语言模型(VLMs),如CLIP,展示了出色的零样本泛化能力,能够在广泛的实际任务中部署而无需额外的任务特定训练。然而,在实际部署场景中,随着环境的演变或新类别的出现,这些模型不可避免地会面临分布偏移和新任务。在这种情况下,静态的零样本能力是不够的,需要能够随着时间适应模型的方法,同时避免灾难性遗忘。我们提出了NuSA-CL(空问适应的连续学习),这是一种轻量级的无记忆连续学习框架,旨在解决这一挑战。NuSA-CL 使用低秩适应,并将任务特定的权重更新限制在模型当前参数的近似空空间内。这种策略最小化了对先前获得知识的干扰,有效地保留了原始模型的零样本能力。与依赖重播缓冲区或昂贵的蒸馏的方法不同,NuSA-CL 对计算和内存开销的影响最小,使其在资源受限的实际连续学习环境中具有可行性。实验表明,我们的框架不仅有效地保留了零样本迁移能力,还在连续学习基准测试中取得了非常有竞争力的性能。这些结果将NuSA-CL 定位为在实际应用中连续演化的零样本VLMs 的实用且可扩展的解决方案。
Summary / 总结
The paper introduces NuSA-CL (Null Space Adaptation for Continual Learning), a lightweight memory-free framework for continual learning in vision-language models. It employs low-rank adaptation and constrains task-specific weight updates to an approximate null space, minimizing interference with previously acquired knowledge. Experiments show that NuSA-CL effectively preserves zero-shot transfer capabilities and achieves competitive performance on continual learning benchmarks, making it a practical solution for real-world applications.
论文提出了NuSA-CL,这是一种无内存的持续学习框架,通过低秩适应并约束任务特定权重更新到模型当前参数的近似零空间,使预训练的视觉-语言模型能够在不忘记之前学习的知识的情况下适应新任务。实验表明,NuSA-CL不仅有效地保持了零样本迁移能力,还在持续学习基准上取得了竞争力的表现,使其成为适用于实际应用的实用解决方案。
InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding
Authors: Minsoo Kim, Kyuhong Shim, Jungwook Choi, Simyung Chang
Venue: NeurIPS 2025
First: 2025-06-18T02:22:14+00:00 · Latest: 2025-10-24T05:39:03+00:00
Comments: NeurIPS 2025
Abstract
Modern multimodal large language models (MLLMs) can reason over hour-long
video, yet their key-value (KV) cache grows linearly with time-quickly
exceeding the fixed memory of phones, AR glasses, and edge robots. Prior
compression schemes either assume the whole video and user query are available
offline or must first build the full cache, so memory still scales with stream
length. InfiniPot-V is the first training-free, query-agnostic framework that
enforces a hard, length-independent memory cap for streaming video
understanding. During video encoding it monitors the cache and, once a user-set
threshold is reached, runs a lightweight compression pass that (i) removes
temporally redundant tokens via Temporal-axis Redundancy (TaR) metric and (ii)
keeps semantically significant tokens via Value-Norm (VaN) ranking. Across four
open-source MLLMs and four long-video and streaming-video benchmarks,
InfiniPot-V cuts peak GPU memory by up to 94%, sustains real-time generation,
and matches or surpasses full-cache accuracy-even in multi-turn dialogues. By
dissolving the KV cache bottleneck without retraining or query knowledge,
InfiniPot-V closes the gap for on-device streaming video assistants.
中文标题/摘要
标题:InfiniPot-V:流式视频理解中的内存受限KV缓存压缩
现代多模态大型语言模型(MLLMs)可以推理长达一小时的视频,但其键值(KV)缓存会线性增长,很快超过手机、AR眼镜和边缘机器人的固定内存。之前的压缩方案要么假设整个视频和用户查询可以在离线状态下获得,要么必须先构建完整的缓存,因此内存仍然与流媒体长度成比例。InfiniPot-V 是第一个无需训练、不依赖查询的框架,它为流式视频理解强制执行一个固定长度的内存上限。在视频编码过程中,它监控缓存,一旦达到用户设置的阈值,就会运行一个轻量级的压缩过程,(i)通过时间轴冗余(TaR)指标移除时间上冗余的标记,(ii)通过值范数(VaN)排名保留语义上重要的标记。在四个开源 MLLMs 和四个长视频及流式视频基准测试中,InfiniPot-V 将峰值 GPU 内存减少了高达 94%,保持实时生成,并且在多轮对话中甚至超过了完整缓存的准确性。通过消除 KV 缓存瓶颈而不需重新训练或查询知识,InfiniPot-V 为设备上的流式视频助手填补了空白。
Summary / 总结
InfiniPot-V is a training-free framework that compresses the key-value cache during video encoding to enforce a fixed memory cap, reducing peak GPU memory usage by up to 94% while maintaining real-time generation and accuracy in multi-turn dialogues across various video benchmarks. This method addresses the memory scalability issue for streaming video understanding on devices with limited memory.
InfiniPot-V 是一个无需训练的框架,在视频编码过程中压缩 key-value 缓存以强制执行固定的内存上限,将 GPU 内存峰值使用量最多减少 94%,同时保持实时生成和多轮对话中的准确性,适用于多种视频基准测试。该方法解决了在内存有限的设备上进行流式视频理解时的内存可扩展性问题。
AgentSense: LLMs Empower Generalizable and Explainable Web-Based Participatory Urban Sensing
Authors: Xusen Guo, Mingxing Peng, Xixuan Hao, Xingchen Zou, Qiongyan Wang, Sijie Ruan, Yuxuan Liang
First: 2025-10-22T15:06:26+00:00 · Latest: 2025-10-24T05:16:51+00:00
Comments: 13 pages, 10 pages
Abstract
Web-based participatory urban sensing has emerged as a vital approach for
modern urban management by leveraging mobile individuals as distributed
sensors. However, existing urban sensing systems struggle with limited
generalization across diverse urban scenarios and poor interpretability in
decision-making. In this work, we introduce AgentSense, a hybrid, training-free
framework that integrates large language models (LLMs) into participatory urban
sensing through a multi-agent evolution system. AgentSense initially employs
classical planner to generate baseline solutions and then iteratively refines
them to adapt sensing task assignments to dynamic urban conditions and
heterogeneous worker preferences, while producing natural language explanations
that enhance transparency and trust. Extensive experiments across two
large-scale mobility datasets and seven types of dynamic disturbances
demonstrate that AgentSense offers distinct advantages in adaptivity and
explainability over traditional methods. Furthermore, compared to single-agent
LLM baselines, our approach outperforms in both performance and robustness,
while delivering more reasonable and transparent explanations. These results
position AgentSense as a significant advancement towards deploying adaptive and
explainable urban sensing systems on the web.
中文标题/摘要
标题:AgentSense:大规模语言模型赋能的通用化和可解释的基于网络的参与式城市感知
基于网络的参与式城市感知已成为现代城市管理的重要方法,通过利用移动个体作为分布式传感器。然而,现有的城市感知系统在不同城市场景下的泛化能力有限,并且在决策中的可解释性较差。在本工作中,我们引入了AgentSense,这是一种无需训练的混合框架,通过多智能体演化系统将大规模语言模型(LLMs)集成到参与式城市感知中。AgentSense 初始使用经典规划器生成基线解决方案,然后迭代优化它们以适应动态城市条件和异质工作者偏好,并生成自然语言解释以增强透明度和信任。在两个大规模移动数据集和七种动态干扰类型上的广泛实验表明,AgentSense 在适应性和可解释性方面优于传统方法。此外,与单智能体LLM基线相比,我们的方法在性能和鲁棒性方面表现更优,同时提供更合理和透明的解释。这些结果使AgentSense 成为部署适应性和可解释的城市感知系统的重大进展。
Summary / 总结
AgentSense is a hybrid framework that integrates large language models into participatory urban sensing through a multi-agent evolution system. It uses classical planners to generate initial solutions and iteratively refines them to adapt to dynamic urban conditions and worker preferences, providing natural language explanations for increased transparency and trust. Experiments show that AgentSense outperforms traditional methods in adaptivity and explainability, and it is more robust and transparent than single-agent LLM baselines.
AgentSense 是一个将大型语言模型(LLMs)集成到参与式城市传感中的混合框架,旨在增强适应性和可解释性。它使用多智能体进化系统根据动态城市条件和工人偏好迭代优化传感任务分配,并提供自然语言解释。实验表明,AgentSense 在适应性和鲁棒性方面优于传统方法,并且与单智能体 LLM 基线相比,提供了更透明和合理的解释。
Generalizable Hierarchical Skill Learning via Object-Centric Representation
Authors: Haibo Zhao, Yu Qi, Boce Hu, Yizhe Zhu, Ziyan Chen, Heng Tian, Xupeng Zhu, Owen Howell, Haojie Huang, Robin Walters, Dian Wang, Robert Platt
First: 2025-10-24T03:21:42+00:00 · Latest: 2025-10-24T03:21:42+00:00
Abstract
We present Generalizable Hierarchical Skill Learning (GSL), a novel framework
for hierarchical policy learning that significantly improves policy
generalization and sample efficiency in robot manipulation. One core idea of
GSL is to use object-centric skills as an interface that bridges the high-level
vision-language model and the low-level visual-motor policy. Specifically, GSL
decomposes demonstrations into transferable and object-canonicalized skill
primitives using foundation models, ensuring efficient low-level skill learning
in the object frame. At test time, the skill-object pairs predicted by the
high-level agent are fed to the low-level module, where the inferred canonical
actions are mapped back to the world frame for execution. This structured yet
flexible design leads to substantial improvements in sample efficiency and
generalization of our method across unseen spatial arrangements, object
appearances, and task compositions. In simulation, GSL trained with only 3
demonstrations per task outperforms baselines trained with 30 times more data
by 15.5 percent on unseen tasks. In real-world experiments, GSL also surpasses
the baseline trained with 10 times more data.
中文标题/摘要
标题:基于对象中心表示的可泛化分层技能学习
我们提出了可泛化分层技能学习(GSL),这是一种新颖的分层策略学习框架,显著提高了机器人操作中的策略泛化能力和样本效率。GSL 的一个核心思想是使用对象中心的技能作为连接高层视觉-语言模型和低层视觉-运动策略的接口。具体来说,GSL 使用基础模型将演示分解为可转移和对象标准化的技能基元,确保在对象框架中高效学习低层技能。在测试时,高层代理预测的技能-对象对被输入到低层模块中,其中推断出的标准化动作被映射回世界框架以执行。这种结构化但灵活的设计在空间布局、对象外观和任务组合方面显著提高了我们方法的样本效率和泛化能力。在仿真中,GSL 使用每个任务仅 3 个演示训练,其在未见过的任务上的表现优于使用 30 倍数据训练的基线,高出 15.5 个百分点。在真实世界实验中,GSL 也超过了使用 10 倍数据训练的基线。
Summary / 总结
The paper introduces Generalizable Hierarchical Skill Learning (GSL), a framework for improving policy generalization and sample efficiency in robot manipulation. GSL uses object-centric skills as an interface between high-level vision-language models and low-level visual-motor policies, decomposing demonstrations into transferable skill primitives. Experiments show that GSL, trained with only 3 demonstrations per task, outperforms baselines trained with 30 times more data by 15.5 percent on unseen tasks in simulation and surpasses the baseline trained with 10 times more data in real-world experiments.
论文提出了通用层次技能学习(GSL)框架,该框架在机器人操作中的层次策略学习中提高了策略的泛化能力和样本效率。GSL 使用对象中心的技能作为高层视觉语言模型和低层视觉运动策略之间的接口,将演示分解为可转移和对象标准化的技能基元。实验结果显示,GSL 使用每任务仅 3 个演示训练,其性能比使用 30 倍更多数据训练的基线高出 15.5 个百分点,在未见过的任务上表现更好;在真实世界实验中,GSL 也超过了使用 10 倍更多数据训练的基线。
SafetyPairs: Isolating Safety Critical Image Features with Counterfactual Image Generation
Authors: Alec Helbling, Shruti Palaskar, Kundan Krishna, Polo Chau, Leon Gatys, Joseph Yitan Cheng
First: 2025-10-24T03:19:48+00:00 · Latest: 2025-10-24T03:19:48+00:00
Abstract
What exactly makes a particular image unsafe? Systematically differentiating
between benign and problematic images is a challenging problem, as subtle
changes to an image, such as an insulting gesture or symbol, can drastically
alter its safety implications. However, existing image safety datasets are
coarse and ambiguous, offering only broad safety labels without isolating the
specific features that drive these differences. We introduce SafetyPairs, a
scalable framework for generating counterfactual pairs of images, that differ
only in the features relevant to the given safety policy, thus flipping their
safety label. By leveraging image editing models, we make targeted changes to
images that alter their safety labels while leaving safety-irrelevant details
unchanged. Using SafetyPairs, we construct a new safety benchmark, which serves
as a powerful source of evaluation data that highlights weaknesses in
vision-language models' abilities to distinguish between subtly different
images. Beyond evaluation, we find our pipeline serves as an effective data
augmentation strategy that improves the sample efficiency of training
lightweight guard models. We release a benchmark containing over 3,020
SafetyPair images spanning a diverse taxonomy of 9 safety categories, providing
the first systematic resource for studying fine-grained image safety
distinctions.
中文标题/摘要
标题:SafetyPairs:通过反事实图像生成隔离安全关键图像特征
什么样的具体图像被认为是不安全的?系统地区分良性图像和问题图像是一个具有挑战性的问题,因为图像中的细微变化,如侮辱性手势或符号,可能会极大地改变其安全含义。然而,现有的图像安全性数据集是粗糙且模糊的,仅提供宽泛的安全标签,而没有隔离驱动这些差异的具体特征。我们引入了SafetyPairs,这是一种可扩展的框架,用于生成仅在与给定安全政策相关的特征上不同的图像反事实对,从而翻转它们的安全标签。通过利用图像编辑模型,我们对图像进行有针对性的修改,改变其安全标签,同时保留与安全性无关的细节不变。使用SafetyPairs,我们构建了一个新的安全性基准,作为评估数据的强大来源,突显了视觉语言模型在区分细微不同图像方面的能力缺陷。除了评估之外,我们发现我们的管道作为有效的数据增强策略,提高了训练轻量级防护模型的样本效率。我们发布了包含超过3,020个SafetyPair图像的基准,涵盖了9个安全类别的多样化分类,提供了研究细粒度图像安全性差异的第一个系统资源。
Summary / 总结
The research aims to identify the specific features that make an image unsafe by generating counterfactual pairs that differ only in safety-critical features. The method involves using image editing models to create pairs of images that flip their safety labels while keeping other details unchanged. Key findings include the creation of a new safety benchmark with over 3,020 SafetyPair images, which highlights the limitations of vision-language models in distinguishing subtle differences and serves as an effective data augmentation strategy for training guard models.
SafetyPairs 是一个框架,通过生成反事实图像对来隔离安全关键特征,以实现对图像安全性的详细分析。通过使用图像编辑模型,它创建了仅在与安全政策相关的特征上有所不同的图像对,从而改变它们的安全标签。这种方法有助于构建一个新的安全基准,揭示视觉-语言模型在区分细微差异方面的局限性。此外,SafetyPairs 还作为有效的数据增强策略,提高了轻量级防护模型的训练效率。基准数据集包含超过 3,020 个 SafetyPair 图像,覆盖 9 个安全类别。
MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning
Authors: Siyong Chen, Jinbo Wen, Jiawen Kang, Tenghui Huang, Xumin Huang, Yuanjia Su, Hudan Pan, Zishao Zhong, Dusit Niyato, Shengli Xie, Dong In Kim
First: 2025-10-24T02:11:05+00:00 · Latest: 2025-10-24T02:11:05+00:00
Abstract
Recently, large models have shown significant potential for smart healthcare.
However, the deployment of Large Vision-Language Models (LVLMs) for clinical
services is currently hindered by three critical challenges: a tendency to
hallucinate answers not grounded in visual evidence, the inefficiency of
fixed-depth reasoning, and the difficulty of multi-institutional collaboration.
To address these challenges, in this paper, we develop MedAlign, a novel
framework to ensure visually accurate LVLM responses for Medical Visual
Question Answering (Med-VQA). Specifically, we first propose a multimodal
Direct Preference Optimization (mDPO) objective to explicitly align preference
learning with visual context. We then design a Retrieval-Aware
Mixture-of-Experts (RA-MoE) architecture that utilizes image and text
similarity to route queries to a specialized and context-augmented LVLM (i.e.,
an expert), thereby mitigating hallucinations in LVLMs. To achieve adaptive
reasoning and facilitate multi-institutional collaboration, we propose a
federated governance mechanism, where the selected expert, fine-tuned on
clinical datasets based on mDPO, locally performs iterative Chain-of-Thought
(CoT) reasoning via the local meta-cognitive uncertainty estimator. Extensive
experiments on three representative Med-VQA datasets demonstrate that MedAlign
achieves state-of-the-art performance, outperforming strong retrieval-augmented
baselines by up to $11.85\%$ in F1-score, and simultaneously reducing the
average reasoning length by $51.60\%$ compared with fixed-depth CoT approaches.
中文标题/摘要
标题:MedAlign:多模态偏好优化和联邦元认知推理的协同框架
近年来,大型模型在智能医疗方面展现了显著潜力。
然而,将大型视觉-语言模型(LVLM)部署到临床服务中目前受到三个关键挑战的阻碍:倾向于生成缺乏视觉证据支持的答案,固定深度推理的低效性,以及多机构协作的难度。
为应对这些挑战,本文开发了MedAlign,一种确保视觉准确的LVLM响应的新框架,用于医学视觉问答(Med-VQA)。具体而言,我们首先提出了一种多模态直接偏好优化(mDPO)目标,以明确将偏好学习与视觉上下文对齐。然后,我们设计了一种检索感知混合专家(RA-MoE)架构,利用图像和文本相似性将查询路由到一个专门且上下文增强的LVLM(即,专家),从而减轻LVLM中的幻觉现象。为了实现适应性推理并促进多机构协作,我们提出了一种联邦治理机制,其中所选专家基于mDPO在临床数据集上进行微调,并通过本地元认知不确定性估计器进行迭代的链式思考(CoT)推理。在三个代表性Med-VQA数据集上的广泛实验表明,MedAlign实现了最先进的性能,与强大的检索增强基线相比,在F1分数上提高了最多11.85%,同时将平均推理长度减少了51.60%,与固定深度CoT方法相比。
Summary / 总结
MedAlign is a framework designed to improve the accuracy and efficiency of Large Vision-Language Models (LVLMs) in medical applications. It addresses challenges such as hallucination, fixed-depth reasoning, and multi-institutional collaboration by introducing a multimodal Direct Preference Optimization (mDPO) and a Retrieval-Aware Mixture-of-Experts (RA-MoE) architecture. MedAlign also employs a federated governance mechanism for adaptive reasoning and multi-institutional collaboration. Experimental results show that MedAlign outperforms existing retrieval-augmented baselines by up to 11.85% in F1-score and reduces the average reasoning length by 51.60% compared to fixed-depth Chain-of-Thought approaches.
MedAlign 是一个旨在提高大型视觉语言模型(LVLM)在医疗应用中准确性和效率的框架。它通过提出多模态直接偏好优化(mDPO)和检索感知混合专家(RA-MoE)架构来解决幻觉、固定深度推理和多机构协作等挑战。MedAlign 还引入了联邦治理机制以实现适应性推理和多机构协作。实验表明,MedAlign 在 F1 分数上比强基线高出最多 11.85%,并且与固定深度推理方法相比,推理长度减少了 51.60%。
Knowledge-Driven Vision-Language Model for Plexus Detection in Hirschsprung's Disease
Authors: Youssef Megahed, Atallah Madi, Dina El Demellawy, Adrian D. C. Chan
First: 2025-10-24T01:42:57+00:00 · Latest: 2025-10-24T01:42:57+00:00
Comments: Accepted into the ICAAI 2025 - The 9th International Conference on
Advances in Artificial Intelligence
Abstract
Hirschsprung's disease is defined as the congenital absence of ganglion cells
in some segment(s) of the colon. The muscle cannot make coordinated movements
to propel stool in that section, most commonly leading to obstruction. The
diagnosis and treatment for this disease require a clear identification of
different region(s) of the myenteric plexus, where ganglion cells should be
present, on the microscopic view of the tissue slide. While deep learning
approaches, such as Convolutional Neural Networks, have performed very well in
this task, they are often treated as black boxes, with minimal understanding
gained from them, and may not conform to how a physician makes decisions. In
this study, we propose a novel framework that integrates expert-derived textual
concepts into a Contrastive Language-Image Pre-training-based vision-language
model to guide plexus classification. Using prompts derived from expert sources
(e.g., medical textbooks and papers) generated by large language models and
reviewed by our team before being encoded with QuiltNet, our approach aligns
clinically relevant semantic cues with visual features. Experimental results
show that the proposed model demonstrated superior discriminative capability
across different classification metrics as it outperformed CNN-based models,
including VGG-19, ResNet-18, and ResNet-50; achieving an accuracy of 83.9%, a
precision of 86.6%, and a specificity of 87.6%. These findings highlight the
potential of multi-modal learning in histopathology and underscore the value of
incorporating expert knowledge for more clinically relevant model outputs.
中文标题/摘要
标题:知识驱动的视觉-语言模型在希恩氏病肠神经丛检测中的应用
希恩氏病是指结肠某段或某些段落中先天缺乏神经节细胞。肌肉无法协调运动以推动粪便,通常导致阻塞。该病的诊断和治疗需要在组织切片的显微镜下清晰识别不同区域的肠神经丛,神经节细胞应在这些区域存在。虽然深度学习方法,如卷积神经网络,在此任务中表现非常出色,但它们通常被视为黑盒模型,从中学到的知识有限,可能不符合医生的决策过程。在本研究中,我们提出了一种新的框架,将专家提取的文本概念整合到对比语言-图像预训练的视觉-语言模型中,以指导神经丛分类。通过大型语言模型生成并由我们团队审阅后编码的专家来源提示(例如医学教科书和论文),我们的方法将临床相关的语义线索与视觉特征对齐。实验结果表明,所提出的模型在不同分类指标上的区分能力优于基于卷积神经网络的模型,包括VGG-19、ResNet-18和ResNet-50;准确率为83.9%,精确率为86.6%,特异性为87.6%。这些发现突显了多模态学习在病理学中的潜力,并强调了结合专家知识以获得更临床相关模型输出的价值。
Summary / 总结
This study addresses the challenge of detecting the myenteric plexus in Hirschsprung's disease by proposing a knowledge-driven vision-language model. The model integrates expert-derived textual concepts into a Contrastive Language-Image Pre-training framework to guide plexus classification. Experimental results show that the proposed model outperformed CNN-based models, achieving an accuracy of 83.9%, precision of 86.6%, and specificity of 87.6%. This demonstrates the potential of multi-modal learning in histopathology and the importance of incorporating expert knowledge for more clinically relevant outputs.
该研究提出了一种知识驱动的视觉-语言模型,以解决Hirschsprung病中肠肌间神经丛检测的挑战。该模型将专家提取的文本概念整合到对比语言-图像预训练框架中,以指导神经丛分类。实验结果显示,所提出的模型在准确率83.9%、精确率86.6%和特异性87.6%方面优于基于CNN的模型。这表明多模态学习在组织病理学中的潜力,并强调了整合专家知识对于更临床相关输出的重要性。
Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning
Authors: Pengxiang Li, Zhi Gao, Bofei Zhang, Yapeng Mi, Xiaojian Ma, Chenrui Shi, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li
First: 2025-04-30T12:01:27+00:00 · Latest: 2025-10-24T01:09:11+00:00
Comments: 24 pages
Abstract
Multimodal agents, which integrate a controller e.g., a vision language
model) with external tools, have demonstrated remarkable capabilities in
tackling complex multimodal tasks. Existing approaches for training these
agents, both supervised fine-tuning and reinforcement learning, depend on
extensive human-annotated task-answer pairs and tool trajectories. However, for
complex multimodal tasks, such annotations are prohibitively expensive or
impractical to obtain. In this paper, we propose an iterative tool usage
exploration method for multimodal agents without any pre-collected data, namely
SPORT, via step-wise preference optimization to refine the trajectories of tool
usage. Our method enables multimodal agents to autonomously discover effective
tool usage strategies through self-exploration and optimization, eliminating
the bottleneck of human annotation. SPORT has four iterative components: task
synthesis, step sampling, step verification, and preference tuning. We first
synthesize multimodal tasks using language models. Then, we introduce a novel
trajectory exploration scheme, where step sampling and step verification are
executed alternately to solve synthesized tasks. In step sampling, the agent
tries different tools and obtains corresponding results. In step verification,
we employ a verifier to provide AI feedback to construct step-wise preference
data. The data is subsequently used to update the controller for tool usage
through preference tuning, producing a SPORT agent. By interacting with real
environments, the SPORT agent gradually evolves into a more refined and capable
system. Evaluation in the GTA and GAIA benchmarks shows that the SPORT agent
achieves 6.41% and 3.64% improvements, underscoring the generalization and
effectiveness introduced by our method. The project page is
https://SPORT-Agents.github.io.
中文标题/摘要
标题:通过逐步偏好调整探索多模态代理的工具使用
多模态代理将控制器(例如,视觉语言模型)与外部工具结合,展示了在处理复杂多模态任务方面的出色能力。现有训练这些代理的方法,包括监督微调和强化学习,依赖于大量的人标注的任务-答案对和工具轨迹。然而,对于复杂的多模态任务,获取这些标注是极其昂贵或不切实际的。本文提出了一种无需预先收集数据的多模态代理逐步工具使用探索方法,即SPORT,通过逐步偏好优化来细化工具使用的轨迹。该方法使多模态代理能够通过自我探索和优化自主发现有效的工具使用策略,消除了人工标注的瓶颈。SPORT 包含四个迭代组件:任务合成、步骤采样、步骤验证和偏好调整。我们首先使用语言模型合成多模态任务。然后,我们引入了一种新的轨迹探索方案,其中步骤采样和步骤验证交替执行以解决合成的任务。在步骤采样中,代理尝试不同的工具并获得相应的结果。在步骤验证中,我们使用验证器提供AI反馈以构建逐步偏好数据。数据随后用于通过偏好调整更新控制器以优化工具使用,生成一个SPORT代理。通过与真实环境交互,SPORT代理逐渐进化成一个更精细和强大的系统。在GTA和GAIA基准测试中的评估表明,SPORT代理分别实现了6.41%和3.64%的改进,突显了我们方法引入的泛化能力和有效性。项目页面为https://SPORT-Agents.github.io。
Summary / 总结
This paper introduces SPORT, an iterative method for multimodal agents to explore tool usage without pre-collected data. SPORT uses step-wise preference tuning to refine tool usage trajectories through task synthesis, step sampling, step verification, and preference tuning. Evaluation on GTA and GAIA benchmarks shows a 6.41% and 3.64% improvement respectively, demonstrating the method's effectiveness in generalization and performance enhancement.
本文提出了一种名为SPORT的方法,用于无需预先收集数据的多模态代理工具使用探索。SPORT通过步骤偏好调优逐步优化工具使用轨迹。该方法包括任务合成、步骤采样、步骤验证和偏好调优四个步骤。在GTA和GAIA基准上的评估表明,分别取得了6.41%和3.64%的改进,证明了其有效性和泛化能力。
ZING-3D: Zero-shot Incremental 3D Scene Graphs via Vision-Language Models
Authors: Pranav Saxena, Jimmy Chiun
First: 2025-10-24T00:52:33+00:00 · Latest: 2025-10-24T00:52:33+00:00
Abstract
Understanding and reasoning about complex 3D environments requires structured
scene representations that capture not only objects but also their semantic and
spatial relationships. While recent works on 3D scene graph generation have
leveraged pretrained VLMs without task-specific fine-tuning, they are largely
confined to single-view settings, fail to support incremental updates as new
observations arrive and lack explicit geometric grounding in 3D space, all of
which are essential for embodied scenarios. In this paper, we propose, ZING-3D,
a framework that leverages the vast knowledge of pretrained foundation models
to enable open-vocabulary recognition and generate a rich semantic
representation of the scene in a zero-shot manner while also enabling
incremental updates and geometric grounding in 3D space, making it suitable for
downstream robotics applications. Our approach leverages VLM reasoning to
generate a rich 2D scene graph, which is grounded in 3D using depth
information. Nodes represent open-vocabulary objects with features, 3D
locations, and semantic context, while edges capture spatial and semantic
relations with inter-object distances. Our experiments on scenes from the
Replica and HM3D dataset show that ZING-3D is effective at capturing spatial
and relational knowledge without the need of task-specific training.
中文标题/摘要
标题:ZING-3D:通过视觉语言模型实现零样本增量三维场景图
理解和推理复杂的三维环境需要结构化的场景表示,不仅捕捉物体,还要捕捉它们的语义和空间关系。虽然最近关于三维场景图生成的工作利用了预训练的VLMs,但没有进行特定任务的微调,它们主要局限于单视角设置,无法在新观察到达时支持增量更新,缺乏在三维空间中的几何定位,所有这些对于具身场景都是必不可少的。在本文中,我们提出了一种名为ZING-3D的框架,该框架利用预训练基础模型的大量知识,以零样本方式实现开放词汇识别并生成丰富的语义场景表示,同时支持增量更新和三维空间中的几何定位,使其适用于下游机器人应用。我们的方法利用VLM推理生成丰富的二维场景图,通过深度信息在三维中进行定位。节点表示具有特征、三维位置和语义上下文的开放词汇对象,边捕捉空间和语义关系,包括物体间的距离。我们在Replica和HM3D数据集上的实验表明,ZING-3D在不需要特定任务训练的情况下能够有效地捕捉空间和关系知识。
Summary / 总结
ZING-3D is designed to address the limitations of existing 3D scene graph generation methods by enabling zero-shot incremental updates and geometric grounding in 3D space using pretrained vision-language models. It generates a rich semantic representation of the scene through VLM reasoning and depth information, capturing spatial and relational knowledge without task-specific training. Experiments on the Replica and HM3D datasets demonstrate its effectiveness in representing complex 3D environments.
ZING-3D 是一个框架,利用预训练的视觉-语言模型生成零样本增量的 3D 场景图,实现开放词汇的识别和几何定位。它在无需特定任务微调的情况下捕捉 3D 环境中的语义和空间关系。在 Replica 和 HM3D 数据集上的实验表明,ZING-3D 能够有效地捕捉 3D 场景中的空间和关系知识。