arXiv 论文速递

Open-ended Hierarchical Streaming Video Understanding with Vision Language Models

Authors: Hyolim Kang, Yunsu Park, Youngbeom Yoo, Yeeun Choi, Seon Joo Kim

First: 2025-09-15T17:11:06+00:00 · Latest: 2025-09-15T17:11:06+00:00

Comments: 17 pages

Abstract

We introduce Hierarchical Streaming Video Understanding, a task that combines online temporal action localization with free-form description generation. Given the scarcity of datasets with hierarchical and fine-grained temporal annotations, we demonstrate that LLMs can effectively group atomic actions into higher-level events, enriching existing datasets. We then propose OpenHOUSE (Open-ended Hierarchical Online Understanding System for Events), which extends streaming action perception beyond action classification. OpenHOUSE features a specialized streaming module that accurately detects boundaries between closely adjacent actions, nearly doubling the performance of direct extensions of existing methods. We envision the future of streaming action perception in the integration of powerful generative models, with OpenHOUSE representing a key step in that direction.

中文标题/摘要

标题：使用视觉语言模型的开放性分层流式视频理解

我们介绍了分层流式视频理解这一任务，它结合了在线时间动作定位与自由形式描述生成。鉴于缺乏具有分层和细粒度时间注释的数据集，我们展示了LLMs能够有效将原子动作分组为更高层次的事件，从而丰富现有数据集。随后，我们提出了OpenHOUSE（开放性分层在线事件理解系统），它将流式动作感知扩展到动作分类之外。OpenHOUSE 特设了一个专门的流式模块，能够准确检测紧密相邻动作之间的边界，几乎将现有方法直接扩展的性能翻倍。我们设想流式动作感知的未来在于与强大生成模型的整合，而OpenHOUSE正是这一方向的关键一步。

Summary / 总结

The paper introduces Hierarchical Streaming Video Understanding, which combines online temporal action localization with free-form description generation. To address the lack of datasets with hierarchical and fine-grained temporal annotations, the authors demonstrate that large language models can effectively group atomic actions into higher-level events. They propose OpenHOUSE, which enhances streaming action perception beyond simple classification by accurately detecting boundaries between closely adjacent actions, achieving nearly double the performance of existing methods. The study paves the way for integrating powerful generative models into streaming action perception systems.

论文介绍了结合在线时空动作定位与自由形式描述生成的层级流式视频理解任务。为了解决缺乏具有层次和细粒度时间注释的数据集的问题，作者展示了大型语言模型可以有效地将原子动作分组为更高层次的事件。他们提出了OpenHOUSE，该系统超越了简单的分类，通过准确检测紧密相邻动作之间的边界，实现了现有方法近两倍的性能提升。该研究为将强大的生成模型集成到流式动作感知系统中铺平了道路。

Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models

Authors: Pu Jian, Junhong Wu, Wei Sun, Chen Wang, Shuo Ren, Jiajun Zhang

First: 2025-09-15T16:57:25+00:00 · Latest: 2025-09-15T16:57:25+00:00

Comments: EMNLP2025 Main

Abs · PDF · Code1 · Code2

Abstract

Recent advances in text-only "slow-thinking" reasoning have prompted efforts to transfer this capability to vision-language models (VLMs), for training visual reasoning models (\textbf{VRMs}). owever, such transfer faces critical challenges: Effective "slow thinking" in VRMs requires \textbf{visual reflection}, the ability to check the reasoning process based on visual information. Through quantitative analysis, we observe that current VRMs exhibit limited visual reflection, as their attention to visual information diminishes rapidly with longer generated responses. To address this challenge, we propose a new VRM \textbf{Reflection-V}, which enhances visual reflection based on reasoning data construction for cold-start and reward design for reinforcement learning (RL). Firstly, we construct vision-centered reasoning data by leveraging an agent that interacts between VLMs and reasoning LLMs, enabling cold-start learning of visual reflection patterns. Secondly, a visual attention based reward model is employed during RL to encourage reasoning based on visual information. Therefore, \textbf{Reflection-V} demonstrates significant improvements across multiple visual reasoning benchmarks. Furthermore, \textbf{Reflection-V} maintains a stronger and more consistent reliance on visual information during visual reasoning, indicating effective enhancement in visual reflection capabilities.

中文标题/摘要

标题：再审视，慢思考：增强视觉语言模型的视觉反思

近期文本仅有的“慢思考”推理进展促使人们努力将这种能力转移到视觉语言模型（VLMs）中，以训练视觉推理模型（\textbf{VRMs}）。然而，这种转移面临关键挑战：有效的“慢思考”在VRMs中需要\textbf{视觉反思}，即基于视觉信息检查推理过程的能力。通过定量分析，我们发现当前VRMs在视觉反思方面表现出有限的能力，因为它们对视觉信息的关注随生成响应的延长而迅速减弱。为应对这一挑战，我们提出了一种新的VRM \textbf{Reflection-V}，它基于推理数据构建增强视觉反思，并在强化学习（RL）中通过奖励设计促进基于视觉信息的推理。首先，我们通过利用一个在VLMs和推理LLMs之间交互的代理来构建以视觉为中心的推理数据，从而实现视觉反思模式的冷启动学习。其次，在RL过程中使用基于视觉注意力的奖励模型来鼓励基于视觉信息的推理。因此，\textbf{Reflection-V}在多个视觉推理基准测试中表现出显著改进。此外，\textbf{Reflection-V}在视觉推理过程中对视觉信息的依赖更强且更一致，表明视觉反思能力得到了有效增强。

Summary / 总结

The paper aims to enhance visual reflection in vision-language models (VRMs) by addressing the challenge of limited visual reflection in current VRMs. It proposes a new VRM called Reflection-V, which improves visual reflection through reasoning data construction and a visual attention-based reward model during reinforcement learning. Reflection-V shows significant improvements in multiple visual reasoning benchmarks and maintains a stronger reliance on visual information during reasoning, indicating effective enhancement in visual reflection capabilities.

论文旨在通过解决当前VRM中视觉反思能力有限的问题，增强视觉语言模型（VRM）的视觉反思能力。提出了一种名为Reflection-V的新VRM，通过推理数据构建和基于视觉注意力的奖励模型来改进视觉反思。Reflection-V在多个视觉推理基准测试中表现出显著改进，并且在视觉推理过程中对视觉信息的依赖更强且更一致，表明视觉反思能力得到了有效增强。

Social Perception of Faces in a Vision-Language Model

Authors: Carina I. Hausladen, Manuel Knott, Colin F. Camerer, Pietro Perona

Venue: Published in the Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT 2025)

First: 2024-08-26T17:21:54+00:00 · Latest: 2025-09-15T16:27:04+00:00

Abs · PDF · Code1 · Code2

Abstract

We explore social perception of human faces in CLIP, a widely used open-source vision-language model. To this end, we compare the similarity in CLIP embeddings between different textual prompts and a set of face images. Our textual prompts are constructed from well-validated social psychology terms denoting social perception. The face images are synthetic and are systematically and independently varied along six dimensions: the legally protected attributes of age, gender, and race, as well as facial expression, lighting, and pose. Independently and systematically manipulating face attributes allows us to study the effect of each on social perception and avoids confounds that can occur in wild-collected data due to uncontrolled systematic correlations between attributes. Thus, our findings are experimental rather than observational. Our main findings are three. First, while CLIP is trained on the widest variety of images and texts, it is able to make fine-grained human-like social judgments on face images. Second, age, gender, and race do systematically impact CLIP's social perception of faces, suggesting an undesirable bias in CLIP vis-a-vis legally protected attributes. Most strikingly, we find a strong pattern of bias concerning the faces of Black women, where CLIP produces extreme values of social perception across different ages and facial expressions. Third, facial expression impacts social perception more than age and lighting as much as age. The last finding predicts that studies that do not control for unprotected visual attributes may reach the wrong conclusions on bias. Our novel method of investigation, which is founded on the social psychology literature and on the experiments involving the manipulation of individual attributes, yields sharper and more reliable observations than previous observational methods and may be applied to study biases in any vision-language model.

中文标题/摘要

标题：视觉语言模型中面部的社会感知

我们探讨了CLIP（一个广泛使用的开源视觉语言模型）中人类面部的社会感知。为此，我们比较了CLIP嵌入中不同文本提示与一组面部图像之间的相似性。我们的文本提示是从验证良好的社会心理学术语构建的，这些术语表示社会感知。面部图像为合成图像，并沿六个维度系统独立变化：年龄、性别和种族等受法律保护的属性，以及面部表情、照明和姿态。独立系统地操纵面部属性使我们能够研究每个属性对社会感知的影响，并避免野生收集数据中由于未控制的系统相关性而产生的混淆。因此，我们的发现是实验性的而不是观察性的。我们的主要发现有三点。首先，尽管CLIP被训练在最广泛的各种图像和文本上，但它仍然能够对面部图像做出细致的人类似的社会判断。其次，年龄、性别和种族系统地影响CLIP对面部的社会感知，表明CLIP在面对受法律保护的属性时存在不理想的偏见。最引人注目的是，我们发现关于黑人女性面部的偏见模式非常强烈，CLIP在不同年龄和面部表情下对社会感知产生极端值。第三，面部表情比年龄和照明对社会感知的影响更大，甚至比年龄的影响更大。这一发现预测，不控制未受保护的视觉属性的研究可能会得出错误的结论。我们基于社会心理学文献和个体属性操纵实验的新研究方法，提供了比以往观察方法更精确和可靠的结果，可以应用于研究任何视觉语言模型中的偏见。

Summary / 总结

This study investigates how a vision-language model, CLIP, perceives social aspects of human faces. By systematically varying face attributes such as age, gender, race, facial expression, lighting, and pose, the researchers found that CLIP can make fine-grained social judgments similar to humans. However, the model exhibits biases, particularly against Black women, with strong patterns of bias observed across different ages and facial expressions. The study's method, based on social psychology, provides more reliable insights into model biases than previous observational methods and suggests that studies not controlling for facial expression may misinterpret bias.

该研究探讨了视觉-语言模型CLIP对人类面部社会方面的感知。通过系统地改变面部属性如年龄、性别、种族、面部表情、照明和姿势，研究者发现CLIP能够做出细粒度的人类相似的社会判断。然而，模型对黑人女性表现出明显的偏见，这种偏见在不同年龄和面部表情下尤为明显。基于社会心理学的方法提供了比以往观察方法更可靠的观点，表明未控制面部表情的研究可能会误解偏见。

RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning

Authors: Suhang Hu, Wei Hu, Yuhang Su, Fan Zhang

First: 2025-08-17T17:24:35+00:00 · Latest: 2025-09-15T16:19:34+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-Language Models (VLMs) struggle with complex image annotation tasks, such as emotion classification and context-driven object detection, which demand sophisticated reasoning. Standard Supervised Fine-Tuning (SFT) focuses solely on annotation outcomes, ignoring underlying rationales, while Visual Reinforcement Fine-Tuning (Visual-RFT) produces inconsistent Chains of Thought (CoTs) due to the absence of high-quality, verified CoTs during pre-training. We introduce RISE (Reason-Inspire-Strengthen-Expertise), a two-stage framework to overcome these limitations. In the Reason stage (RISE-CoT), a reinforcement learning-driven "annotation-reasoning-annotation" closed-loop generates visually grounded, logically consistent CoTs by verifying their ability to reconstruct original annotations without direct leakage. The Inspire and Strengthen stage (RISE-R1) leverages a high-quality CoT subset, filtered by RISE-CoT rewards, for supervised fine-tuning, followed by reinforcement fine-tuning to produce interpretable reasoning and accurate annotations, achieving Expertise in complex visual tasks. Evaluated on complex and simple image annotation tasks, RISE-trained Qwen2-VL-2B outperforms SFT and Visual-RFT, achieving robust performance and enhanced explainability. RISE offers a self-supervised solution for advancing VLM reasoning without requiring manually annotated CoTs.Code and resources are available at: https://github.com/HSH55/RISE.

中文标题/摘要

标题：RISE：通过自我监督推理增强VLM图像标注

视觉-语言模型（VLMs）在处理复杂图像标注任务，如情绪分类和基于上下文的对象检测时遇到困难，这些任务需要复杂的推理。标准监督微调（SFT）仅关注标注结果，忽略了背后的推理过程，而视觉强化微调（Visual-RFT）由于缺乏高质量的验证推理链（CoTs）在预训练阶段，会产生不一致的推理链。我们提出了RISE（Reason-Inspire-Strengthen-Expertise）框架，以克服这些限制。在Reason阶段（RISE-CoT），通过强化学习驱动的“标注-推理-标注”闭环生成视觉接地、逻辑一致的推理链，并通过验证其重建原始标注的能力来确保推理链的正确性，避免直接泄露。在Inspire和Strengthen阶段（RISE-R1），利用RISE-CoT奖励筛选出的高质量推理链子集进行监督微调，然后进行强化微调以生成可解释的推理和准确的标注，从而在复杂视觉任务中达到专家水平。RISE在复杂和简单图像标注任务上的评估表明，RISE训练的Qwen2-VL-2B优于SFT和Visual-RFT，实现了稳健的性能和增强的可解释性。RISE提供了一种无需手动标注推理链的自我监督解决方案，以促进VLM推理能力的提升。代码和资源可在以下链接获取：https://github.com/HSH55/RISE。

Summary / 总结

RISE is a two-stage framework designed to enhance VLM image annotation by addressing the limitations of standard supervised fine-tuning and visual reinforcement fine-tuning. The Reason stage generates logically consistent Chains of Thought (CoTs) through a reinforcement learning-driven closed-loop process, while the Inspire and Strengthen stage uses these CoTs for supervised and reinforcement fine-tuning to produce accurate and interpretable annotations. RISE outperforms standard supervised fine-tuning and visual reinforcement fine-tuning on both complex and simple image annotation tasks, demonstrating robust performance and enhanced explainability.

RISE 是一个两阶段框架，旨在通过解决标准监督微调和视觉强化微调的局限性来提升 VLM 的图像注释能力。在 Reason 阶段，通过强化学习驱动的闭环生成逻辑上一致的推理链（CoTs），能够重建原始注释。Inspire 和 Strengthen 阶段使用 Reason 阶段筛选出的高质量 CoT 子集进行监督和强化微调，从而提高推理和注释的质量。RISE 训练的 Qwen2-VL-2B 在复杂和简单的图像注释任务上均优于 SFT 和 Visual-RFT，展示了稳健的性能和增强的可解释性。

Exploring Efficient Open-Vocabulary Segmentation in the Remote Sensing

Authors: Bingyu Li, Haocheng Dong, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li

First: 2025-09-15T15:24:49+00:00 · Latest: 2025-09-15T15:24:49+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Open-Vocabulary Remote Sensing Image Segmentation (OVRSIS), an emerging task that adapts Open-Vocabulary Segmentation (OVS) to the remote sensing (RS) domain, remains underexplored due to the absence of a unified evaluation benchmark and the domain gap between natural and RS images. To bridge these gaps, we first establish a standardized OVRSIS benchmark (\textbf{OVRSISBench}) based on widely-used RS segmentation datasets, enabling consistent evaluation across methods. Using this benchmark, we comprehensively evaluate several representative OVS/OVRSIS models and reveal their limitations when directly applied to remote sensing scenarios. Building on these insights, we propose \textbf{RSKT-Seg}, a novel open-vocabulary segmentation framework tailored for remote sensing. RSKT-Seg integrates three key components: (1) a Multi-Directional Cost Map Aggregation (RS-CMA) module that captures rotation-invariant visual cues by computing vision-language cosine similarities across multiple directions; (2) an Efficient Cost Map Fusion (RS-Fusion) transformer, which jointly models spatial and semantic dependencies with a lightweight dimensionality reduction strategy; and (3) a Remote Sensing Knowledge Transfer (RS-Transfer) module that injects pre-trained knowledge and facilitates domain adaptation via enhanced upsampling. Extensive experiments on the benchmark show that RSKT-Seg consistently outperforms strong OVS baselines by +3.8 mIoU and +5.9 mACC, while achieving 2x faster inference through efficient aggregation. Our code is \href{https://github.com/LiBingyu01/RSKT-Seg}{\textcolor{blue}{here}}.

Summary / 总结

This research aims to address the underexplored area of Open-Vocabulary Remote Sensing Image Segmentation (OVRSIS) by establishing a standardized benchmark (OVRSISBench) and evaluating existing models. The study proposes RSKT-Seg, a novel framework that includes a Multi-Directional Cost Map Aggregation module, an Efficient Cost Map Fusion transformer, and a Remote Sensing Knowledge Transfer module. Experiments show that RSKT-Seg outperforms strong baselines by 3.8 mIoU and 5.9 mACC, while achieving faster inference times.

研究旨在通过建立标准化基准（OVRSISBench）和评估现有模型来解决开放词汇量遥感图像分割（OVRSIS）的不足。研究提出了RSKT-Seg框架，包括多方向成本图聚合模块、高效成本图融合变换器和遥感知识转移模块。实验表明，RSKT-Seg在基准上优于强基线，分别在mIoU和mACC上提高了3.8和5.9，同时实现了更快的推理速度。

Lost in Embeddings: Information Loss in Vision-Language Models

Authors: Wenyan Li, Raphael Tang, Chengzu Li, Caiqi Zhang, Ivan Vulić, Anders Søgaard

First: 2025-09-15T14:38:06+00:00 · Latest: 2025-09-15T14:38:06+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision--language models (VLMs) often process visual inputs through a pretrained vision encoder, followed by a projection into the language model's embedding space via a connector component. While crucial for modality fusion, the potential information loss induced by this projection step and its direct impact on model capabilities remain understudied. We introduce two complementary approaches to examine and quantify this loss by analyzing the latent representation space. First, we evaluate semantic information preservation by analyzing changes in k-nearest neighbor relationships between image representations, before and after projection. Second, we directly measure information loss by reconstructing visual embeddings from the projected representation, localizing loss at an image patch level. Experiments reveal that connectors substantially distort the local geometry of visual representations, with k-nearest neighbors diverging by 40--60\% post-projection, correlating with degradation in retrieval performance. The patch-level embedding reconstruction provides interpretable insights for model behavior on visually grounded question-answering tasks, finding that areas of high information loss reliably predict instances where models struggle.

中文标题/摘要

标题：迷失在嵌入中：视觉-语言模型中的信息损失

视觉-语言模型（VLMs）通常通过预训练的视觉编码器处理视觉输入，然后通过连接组件将这些输入投影到语言模型的嵌入空间中。虽然这对于模态融合至关重要，但这一投影步骤可能引起的信息损失及其对模型能力的直接影响仍研究不足。我们引入了两种互补的方法来研究和量化这种损失，通过分析潜在表示空间。首先，我们通过分析图像表示在投影前后k近邻关系的变化来评估语义信息的保留情况。其次，我们直接通过从投影表示中重建视觉嵌入来测量信息损失，并在图像块级别定位损失。实验表明，连接器显著地扭曲了视觉表示的局部几何结构，投影后k近邻关系的差异高达40-60%，与检索性能的下降相关。图像块级别的嵌入重建为模型在视觉接地问答任务中的行为提供了可解释的见解，发现信息损失高的区域可靠地预测了模型遇到困难的实例。

Summary / 总结

This study investigates the information loss in vision-language models (VLMs) during the projection of visual inputs into the language model's embedding space. Two methods were employed: analyzing changes in k-nearest neighbor relationships and reconstructing visual embeddings at the patch level. The results show that connectors significantly distort the local geometry of visual representations, with a 40-60% divergence in k-nearest neighbors post-projection, which correlates with a decline in retrieval performance. Patch-level embedding reconstruction also reveals that areas with high information loss are associated with model difficulties in visually grounded question-answering tasks.

研究探讨了视觉-语言模型（VLMs）在将视觉输入投影到语言模型的嵌入空间时的信息损失。采用了两种方法：分析k-最近邻关系的变化和在像素级别重建视觉嵌入。结果显示，连接器显著扭曲了视觉表示的局部几何结构，导致k-最近邻关系在投影后偏离40-60%，并导致检索性能下降。像素级别的嵌入重建还揭示了模型在视觉接地问答任务中表现不佳的区域，这些区域与高信息损失相关。

SAM-TTT: Segment Anything Model via Reverse Parameter Configuration and Test-Time Training for Camouflaged Object Detection

Authors: Zhenni Yu, Li Zhao, Guobao Xiao, Xiaoqin Zhang

Venue: ACM MM

First: 2025-09-15T13:02:27+00:00 · Latest: 2025-09-15T13:02:27+00:00

Comments: accepted by ACM MM 25

Abs · PDF · Code1 · Code2 · Code3

Abstract

This paper introduces a new Segment Anything Model (SAM) that leverages reverse parameter configuration and test-time training to enhance its performance on Camouflaged Object Detection (COD), named SAM-TTT. While most existing SAM-based COD models primarily focus on enhancing SAM by extracting favorable features and amplifying its advantageous parameters, a crucial gap is identified: insufficient attention to adverse parameters that impair SAM's semantic understanding in downstream tasks. To tackle this issue, the Reverse SAM Parameter Configuration Module is proposed to effectively mitigate the influence of adverse parameters in a train-free manner by configuring SAM's parameters. Building on this foundation, the T-Visioner Module is unveiled to strengthen advantageous parameters by integrating Test-Time Training layers, originally developed for language tasks, into vision tasks. Test-Time Training layers represent a new class of sequence modeling layers characterized by linear complexity and an expressive hidden state. By integrating two modules, SAM-TTT simultaneously suppresses adverse parameters while reinforcing advantageous ones, significantly improving SAM's semantic understanding in COD task. Our experimental results on various COD benchmarks demonstrate that the proposed approach achieves state-of-the-art performance, setting a new benchmark in the field. The code will be available at https://github.com/guobaoxiao/SAM-TTT.

中文标题/摘要

标题：SAM-TTT：通过反向参数配置和测试时训练增强伪装目标检测的分割一切模型

本文介绍了一种新的分割一切模型（SAM），该模型利用反向参数配置和测试时训练来增强其在伪装目标检测（COD）任务中的性能，命名为SAM-TTT。虽然现有的大多数基于SAM的COD模型主要集中在通过提取有利特征和放大其优势参数来增强SAM，但本文识别了一个关键差距：对影响SAM在下游任务中语义理解的不良参数关注不足。为了解决这一问题，提出了反向SAM参数配置模块，以在无需训练的情况下有效减轻不良参数的影响。在此基础上，揭示了T-Visioner模块，通过将原本为语言任务开发的测试时训练层整合到视觉任务中，加强有利参数。通过整合两个模块，SAM-TTT同时抑制不良参数并强化有利参数，显著提高了SAM在COD任务中的语义理解能力。我们在各种COD基准上的实验结果表明，所提出的方法达到了最先进的性能，为该领域设定了新的基准。代码将在https://github.com/guobaoxiao/SAM-TTT上提供。

Summary / 总结

The paper introduces SAM-TTT, which enhances Segment Anything Model (SAM) for Camouflaged Object Detection (COD) by addressing the issue of adverse parameters that impair SAM's semantic understanding. It proposes a Reverse SAM Parameter Configuration Module to mitigate the influence of these parameters and a T-Visioner Module that integrates Test-Time Training layers to reinforce advantageous parameters. Experimental results show that SAM-TTT outperforms existing methods on various COD benchmarks, setting a new benchmark in the field.

该论文提出了SAM-TTT，通过解决影响SAM语义理解的不良参数问题，增强其在伪装目标检测（COD）中的性能。它提出了一种反向SAM参数配置模块来减轻这些参数的影响，并引入了一种T-Visioner模块，通过将测试时训练层集成到视觉任务中来强化有利参数。实验结果表明，SAM-TTT在各种COD基准测试中优于现有方法，为该领域设定了新的基准。

Bridging Vision Language Models and Symbolic Grounding for Video Question Answering

Authors: Haodi Ma, Vyom Pathak, Daisy Zhe Wang

First: 2025-09-15T12:35:56+00:00 · Latest: 2025-09-15T12:35:56+00:00

Abs · PDF · Code1 · Code2

Abstract

Video Question Answering (VQA) requires models to reason over spatial, temporal, and causal cues in videos. Recent vision language models (VLMs) achieve strong results but often rely on shallow correlations, leading to weak temporal grounding and limited interpretability. We study symbolic scene graphs (SGs) as intermediate grounding signals for VQA. SGs provide structured object-relation representations that complement VLMs holistic reasoning. We introduce SG-VLM, a modular framework that integrates frozen VLMs with scene graph grounding via prompting and visual localization. Across three benchmarks (NExT-QA, iVQA, ActivityNet-QA) and multiple VLMs (QwenVL, InternVL), SG-VLM improves causal and temporal reasoning and outperforms prior baselines, though gains over strong VLMs are limited. These findings highlight both the promise and current limitations of symbolic grounding, and offer guidance for future hybrid VLM-symbolic approaches in video understanding.

Synthetic Captions for Open-Vocabulary Zero-Shot Segmentation

Authors: Tim Lebailly, Vijay Veerabadran, Satwik Kottur, Karl Ridgeway, Michael Louis Iuzzolino

Venue: ICCV 2025

First: 2025-09-15T12:26:47+00:00 · Latest: 2025-09-15T12:26:47+00:00

Comments: ICCV 2025 CDEL Workshop

Abs · PDF · Code1 · Code2

Abstract

Generative vision-language models (VLMs) exhibit strong high-level image understanding but lack spatially dense alignment between vision and language modalities, as our findings indicate. Orthogonal to advancements in generative VLMs, another line of research has focused on representation learning for vision-language alignment, targeting zero-shot inference for dense tasks like segmentation. In this work, we bridge these two directions by densely aligning images with synthetic descriptions generated by VLMs. Synthetic captions are inexpensive, scalable, and easy to generate, making them an excellent source of high-level semantic understanding for dense alignment methods. Empirically, our approach outperforms prior work on standard zero-shot open-vocabulary segmentation benchmarks/datasets, while also being more data-efficient.

中文标题/摘要

标题：开放词汇零样本分割的合成描述

生成式视觉-语言模型（VLMs）在高层次图像理解方面表现出色，但在视觉和语言模态之间的空间密集对齐方面存在不足，正如我们的研究结果所示。与生成式VLMs的发展方向不同，另一条研究路线专注于视觉-语言对齐的表示学习，旨在实现密集任务如分割的零样本推理。在本研究中，我们通过将图像与由VLMs生成的合成描述进行密集对齐，将这两条路线结合起来。合成描述成本低廉、易于扩展和生成，是密集对齐方法中高水平语义理解的优秀来源。实验证明，我们的方法在标准的零样本开放词汇分割基准/数据集上优于先前的工作，同时更具数据效率。

Summary / 总结

This work addresses the gap in spatial alignment between vision and language by generating synthetic captions using generative vision-language models (VLMs). The approach improves zero-shot segmentation performance on standard benchmarks, demonstrating higher efficiency and better results compared to previous methods.

该研究旨在通过生成视觉语言模型（VLM）生成的合成描述来改善视觉和语言模态之间的空间对齐，以提高零样本分割的效果。作者使用VLM生成合成描述以密集地对齐图像，然后利用这些描述进行密集对齐方法。该方法在标准的零样本开放词汇分割基准测试中表现优于先前的方法，并且更具数据效率。

Transformer-Based Multimodal Knowledge Graph Completion with Link-Aware Contexts

Authors: Haodi Ma, Dzmitry Kasinets, Daisy Zhe Wang

First: 2025-01-26T22:23:14+00:00 · Latest: 2025-09-15T11:55:10+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal knowledge graph completion (MMKGC) aims to predict missing links in multimodal knowledge graphs (MMKGs) by leveraging information from various modalities alongside structural data. Existing MMKGC approaches primarily extend traditional knowledge graph embedding (KGE) models, which often require creating an embedding for every entity. This results in large model sizes and inefficiencies in integrating multimodal information, particularly for real-world graphs. Meanwhile, Transformer-based models have demonstrated competitive performance in knowledge graph completion (KGC). However, their focus on single-modal knowledge limits their capacity to utilize cross-modal information. Recently, Large vision-language models (VLMs) have shown potential in cross-modal tasks but are constrained by the high cost of training. In this work, we propose a novel approach that integrates Transformer-based KGE models with cross-modal context generated by pre-trained VLMs, thereby extending their applicability to MMKGC. Specifically, we employ a pre-trained VLM to transform relevant visual information from entities and their neighbors into textual sequences. We then frame KGC as a sequence-to-sequence task, fine-tuning the model with the generated cross-modal context. This simple yet effective method significantly reduces model size compared to traditional KGE approaches while achieving competitive performance across multiple large-scale datasets with minimal hyperparameter tuning.

Summary / 总结

The research aims to improve multimodal knowledge graph completion by integrating Transformer-based knowledge graph embedding models with cross-modal contexts generated by pre-trained vision-language models. The method uses a pre-trained VLM to convert visual information into textual sequences, treating KGC as a sequence-to-sequence task. This approach significantly reduces model size and achieves competitive performance across multiple datasets with minimal tuning.

该研究通过将基于Transformer的知识图嵌入模型与预训练的视觉-语言模型生成的跨模态上下文相结合，解决了多模态知识图完成的挑战。方法将视觉信息转换为文本序列，并将任务建模为序列到序列问题，从而在多个数据集上实现了较小的模型规模和竞争力的表现，且需要最少的超参数调整。

SpecVLM: Fast Speculative Decoding in Vision-Language Models

Authors: Haiduo Huang, Fuwei Yang, Zhenhua Liu, Xuanwu Yin, Dong Li, Pengju Ren, Emad Barsoum

First: 2025-09-15T11:53:56+00:00 · Latest: 2025-09-15T11:53:56+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Speculative decoding is a powerful way to accelerate autoregressive large language models (LLMs), but directly porting it to vision-language models (VLMs) faces unique systems constraints: the prefill stage is dominated by visual tokens whose count scales with image resolution and video length, inflating both compute and memory, especially the key-value (KV) cache. We study speculative decoding for VLMs and introduce SpecVLM, a practical system that (1) establishes a strong EAGLE-2-style baseline, EagleVLM, delivering 1.5--2.3x end-to-end speedups over full autoregressive inference, and (2) further accelerates VLM inference with an elastic visual compressor that adaptively selects among pruning, pooling, convolution, and resampler primitives to balance FLOPs/parameters and accuracy per input. To avoid costly offline distillation corpora, we propose an online-logit distillation protocol that trains the draft model with on-the-fly teacher logits and penultimate features using a combined cross-entropy and Smooth L1 objective, eliminating storage and preprocessing while remaining compute-efficient. This protocol reveals a training-time scaling effect: longer online training monotonically increases the draft model's average accepted length, improving speculative efficiency. Empirically, SpecVLM achieves additional acceleration, culminating in 2.5--2.9x end-to-end speedups within 5 epochs across LLaVA and MMMU, consistently over resolutions and task difficulties, while preserving the target model's output distribution (lossless decoding). Our code is available at https://github.com/haiduo/SpecVLM.

中文标题/摘要

标题：SpecVLM：视觉语言模型中的快速推测性解码

推测性解码是加速自回归大型语言模型（LLMs）的强大方法，但直接将其移植到视觉语言模型（VLMs）中面临独特的系统约束：预填充阶段主要由视觉标记占据，其数量随图像分辨率和视频长度增加，导致计算和内存成本上升，尤其是关键值（KV）缓存。我们研究了VLMs中的推测性解码，并引入了SpecVLM，这是一种实用系统，（1）建立了强大的EAGLE-2风格基线EagleVLM，相比完整的自回归推理，端到端加速1.5-2.3倍，（2）进一步通过弹性视觉压缩器加速VLM推理，该压缩器根据输入适应性选择剪枝、池化、卷积和重采样原语，平衡FLOPs/参数和准确性。为了避免昂贵的离线蒸馏数据集，我们提出了一种在线-logit蒸馏协议，使用结合交叉熵和Smooth L1目标，在线训练草稿模型时使用即时教师logits和倒数第二层特征，消除存储和预处理，同时保持计算效率。该协议揭示了训练时间的缩放效应：更长的在线训练单调地增加了草稿模型的平均接受长度，提高推测效率。实验中，SpecVLM实现了额外的加速，最终在5个周期内，LLaVA和MMMU的端到端加速2.5-2.9倍，一致地跨越分辨率和任务难度，同时保持目标模型的输出分布（无损解码）。我们的代码可在https://github.com/haiduo/SpecVLM获取。

Summary / 总结

Speculative decoding is explored for vision-language models (VLMs) to accelerate inference, addressing the challenge of visual token counts scaling with image resolution. SpecVLM introduces an elastic visual compressor and an online-logit distillation protocol, achieving 1.5--2.3x speedups over full autoregressive inference and up to 2.9x speedups within 5 epochs across different models and resolutions, with lossless decoding.

Speculative decoding被探索用于视觉语言模型（VLMs）以加速推理，解决了由于视觉标记导致的高计算和内存成本问题。SpecVLM引入了弹性视觉压缩器和在线-logit蒸馏协议，以平衡效率和准确性。SpecVLM实现了1.5-2.3倍的加速比全自回归推理，并在5个epoch内实现了额外的2.5-2.9倍加速，同时保持目标模型的输出分布。

FineQuest: Adaptive Knowledge-Assisted Sports Video Understanding via Agent-of-Thoughts Reasoning

Authors: Haodong Chen, Haojian Huang, XinXiang Yin, Dian Shao

Venue: ACM MM 2025

First: 2025-09-15T11:27:23+00:00 · Latest: 2025-09-15T11:27:23+00:00

Comments: ACM MM 2025

Abs · PDF · Code1 · Code2

Abstract

Video Question Answering (VideoQA) based on Large Language Models (LLMs) has shown potential in general video understanding but faces significant challenges when applied to the inherently complex domain of sports videos. In this work, we propose FineQuest, the first training-free framework that leverages dual-mode reasoning inspired by cognitive science: i) Reactive Reasoning for straightforward sports queries and ii) Deliberative Reasoning for more complex ones. To bridge the knowledge gap between general-purpose models and domain-specific sports understanding, FineQuest incorporates SSGraph, a multimodal sports knowledge scene graph spanning nine sports, which encodes both visual instances and domain-specific terminology to enhance reasoning accuracy. Furthermore, we introduce two new sports VideoQA benchmarks, Gym-QA and Diving-QA, derived from the FineGym and FineDiving datasets, enabling diverse and comprehensive evaluation. FineQuest achieves state-of-the-art performance on these benchmarks as well as the existing SPORTU dataset, while maintains strong general VideoQA capabilities.

中文标题/摘要

标题：FineQuest：基于Agent-of-Thoughts推理的自适应知识辅助体育视频理解

基于大型语言模型（LLMs）的视频问答（VideoQA）在通用视频理解方面显示出潜力，但在应用于体育视频这一固有复杂领域时面临重大挑战。本文提出FineQuest，这是一个无需训练的框架，利用受认知科学启发的双模式推理：i) 反应性推理处理直接的体育查询；ii) 认真推理处理更复杂的查询。为弥合通用模型与特定领域体育理解之间的知识差距，FineQuest 结合了SSGraph，这是一个涵盖九种体育项目的多模态体育知识场景图，编码了视觉实例和领域特定术语，以提高推理准确性。此外，我们引入了两个新的体育视频问答基准，Gym-QA 和 Diving-QA，分别源自FineGym 和 FineDiving 数据集，使评估更加多样和全面。FineQuest 在这些基准以及现有的SPORTU数据集上均实现了最先进的性能，同时保持了强大的通用视频问答能力。

Summary / 总结

FineQuest is a training-free framework that uses dual-mode reasoning to address the challenges of understanding sports videos with large language models. It combines reactive reasoning for simple queries and deliberative reasoning for complex ones, leveraging SSGraph, a multimodal sports knowledge scene graph, to enhance reasoning accuracy. FineQuest demonstrates state-of-the-art performance on new benchmarks and existing datasets while maintaining general VideoQA capabilities.

FineQuest 是一个无需训练的框架，采用双模式推理来解决体育视频中视频问答的挑战。它结合了用于简单查询的反应推理和用于复杂查询的深思推理，并利用了包含多种体育知识的多模态场景图 SSGraph。该框架在新的体育视频问答基准和现有数据集上达到了最先进的性能，同时保持了一般视频问答的能力。

EMeRALDS: Electronic Medical Record Driven Automated Lung Nodule Detection and Classification in Thoracic CT Images

Authors: Hafza Eman, Furqan Shaukat, Muhammad Hamza Zafar, Syed Muhammad Anwar

First: 2025-09-15T09:11:17+00:00 · Latest: 2025-09-15T09:11:17+00:00

Abs · PDF · Code1 · Code2

Abstract

Objective: Lung cancer is a leading cause of cancer-related mortality worldwide, primarily due to delayed diagnosis and poor early detection. This study aims to develop a computer-aided diagnosis (CAD) system that leverages large vision-language models (VLMs) for the accurate detection and classification of pulmonary nodules in computed tomography (CT) scans. Methods: We propose an end-to-end CAD pipeline consisting of two modules: (i) a detection module (CADe) based on the Segment Anything Model 2 (SAM2), in which the standard visual prompt is replaced with a text prompt encoded by CLIP (Contrastive Language-Image Pretraining), and (ii) a diagnosis module (CADx) that calculates similarity scores between segmented nodules and radiomic features. To add clinical context, synthetic electronic medical records (EMRs) were generated using radiomic assessments by expert radiologists and combined with similarity scores for final classification. The method was tested on the publicly available LIDC-IDRI dataset (1,018 CT scans). Results: The proposed approach demonstrated strong performance in zero-shot lung nodule analysis. The CADe module achieved a Dice score of 0.92 and an IoU of 0.85 for nodule segmentation. The CADx module attained a specificity of 0.97 for malignancy classification, surpassing existing fully supervised methods. Conclusions: The integration of VLMs with radiomics and synthetic EMRs allows for accurate and clinically relevant CAD of pulmonary nodules in CT scans. The proposed system shows strong potential to enhance early lung cancer detection, increase diagnostic confidence, and improve patient management in routine clinical workflows.

中文标题/摘要

标题：EMeRALDS: 电子病历驱动的胸部CT图像肺癌结节自动检测与分类

目的：肺癌是全球癌症相关死亡的主要原因，主要是由于诊断延迟和早期检测不足。本研究旨在开发一种基于大型视觉语言模型（VLMs）的计算机辅助诊断（CAD）系统，用于准确检测和分类计算机断层扫描（CT）扫描中的肺结节。方法：我们提出了一种端到端的CAD流水线，包括两个模块：（i）基于Segment Anything Model 2（SAM2）的检测模块（CADe），其中标准视觉提示被CLIP（对比语言-图像预训练）编码的文本提示所取代；（ii）诊断模块（CADx），该模块计算分割结节与放射组学特征之间的相似性分数。为了增加临床背景，使用专家放射科医生的放射组学评估生成合成电子病历（EMRs），并与相似性分数结合用于最终分类。该方法在公开可用的LIDC-IDRI数据集（1,018张CT扫描）上进行了测试。结果：所提出的方法在零样本肺癌结节分析中表现出色。CADe模块的结节分割Dice得分为0.92，IoU为0.85。CADx模块的恶性分类特异性为0.97，超过了现有的完全监督方法。结论：将VLMs与放射组学和合成电子病历结合使用，可以实现CT扫描中肺结节的准确和临床相关CAD。所提出的系统显示出增强早期肺癌检测、提高诊断信心和改善常规临床工作流程中患者管理的强大潜力。

Summary / 总结

The study aims to develop a CAD system for lung cancer early detection using large vision-language models. It proposes an end-to-end pipeline with a detection module (CADe) based on SAM2 and a diagnosis module (CADx) that uses radiomic features. The system achieved a Dice score of 0.92 and an IoU of 0.85 for nodule segmentation and a specificity of 0.97 for malignancy classification, outperforming existing methods.

该研究旨在利用大型视觉语言模型开发一种肺癌检测和分类的计算机辅助诊断系统。该系统包括基于SAM2的检测模块（CADe），使用CLIP生成的文本提示，以及一个诊断模块（CADx），该模块计算结节与影像组学特征之间的相似性得分。该方法在结节分割上的Dice得分为0.92，IoU为0.85，并且在恶性分类上的特异性为0.97，超过了现有方法。

Towards Understanding Visual Grounding in Visual Language Models

Authors: Georgios Pantazopoulos, Eda B. Özyiğit

First: 2025-09-12T15:33:49+00:00 · Latest: 2025-09-15T08:46:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Visual grounding refers to the ability of a model to identify a region within some visual input that matches a textual description. Consequently, a model equipped with visual grounding capabilities can target a wide range of applications in various domains, including referring expression comprehension, answering questions pertinent to fine-grained details in images or videos, caption visual context by explicitly referring to entities, as well as low and high-level control in simulated and real environments. In this survey paper, we review representative works across the key areas of research on modern general-purpose vision language models (VLMs). We first outline the importance of grounding in VLMs, then delineate the core components of the contemporary paradigm for developing grounded models, and examine their practical applications, including benchmarks and evaluation metrics for grounded multimodal generation. We also discuss the multifaceted interrelations among visual grounding, multimodal chain-of-thought, and reasoning in VLMs. Finally, we analyse the challenges inherent to visual grounding and suggest promising directions for future research.

中文标题/摘要

标题：理解视觉语言模型中的视觉定位

视觉定位是指模型识别视觉输入中与文本描述匹配的区域的能力。因此，具备视觉定位能力的模型可以应用于各种领域的广泛应用，包括指示表达理解、回答与图像或视频中细粒度细节相关的问题、通过明确指代实体来描述视觉上下文，以及在模拟和真实环境中进行低级和高级控制。在本文综述中，我们回顾了现代通用视觉语言模型（VLMs）研究领域的代表性工作。我们首先概述了视觉定位在VLMs中的重要性，然后阐述了当前开发定位模型的核心组件，并探讨了它们的实际应用，包括定位多模态生成的基准和评估指标。我们还讨论了视觉定位、多模态推理链和VLMs推理之间的多方面关系。最后，我们分析了视觉定位固有的挑战，并提出了未来研究的有希望的方向。

Summary / 总结

The paper explores the concept of visual grounding in visual language models, which involves identifying a region in a visual input that matches a textual description. This capability enables models to apply to various applications such as referring expression comprehension and captioning. The study reviews key areas of research, outlines the importance of grounding, and examines practical applications and evaluation metrics. Challenges in visual grounding are also discussed, with suggestions for future research directions.

论文探讨了视觉语言模型中的视觉定位概念，即识别视觉输入中与文本描述匹配的区域。这一能力使模型能够应用于各种应用，如指代表达理解和生成描述。研究回顾了关键研究领域，概述了视觉定位的重要性，并探讨了实际应用和评估指标。还讨论了视觉定位中的挑战，并提出了未来研究方向的建议。

Remote Sensing SpatioTemporal Vision-Language Models: A Comprehensive Survey

Authors: Chenyang Liu, Jiafan Zhang, Keyan Chen, Man Wang, Zhengxia Zou, Zhenwei Shi

Venue: IEEE Geoscience and Remote Sensing Magazine, 2025

First: 2024-12-03T16:56:10+00:00 · Latest: 2025-09-15T07:36:59+00:00

Comments: Published in IEEE Geoscience and Remote Sensing Magazine

Abs · PDF · Code1 · Code2 · Code3

Abstract

The interpretation of multi-temporal remote sensing imagery is critical for monitoring Earth's dynamic processes-yet previous change detection methods, which produce binary or semantic masks, fall short of providing human-readable insights into changes. Recent advances in Vision-Language Models (VLMs) have opened a new frontier by fusing visual and linguistic modalities, enabling spatio-temporal vision-language understanding: models that not only capture spatial and temporal dependencies to recognize changes but also provide a richer interactive semantic analysis of temporal images (e.g., generate descriptive captions and answer natural-language queries). In this survey, we present the first comprehensive review of RS-STVLMs. The survey covers the evolution of models from early task-specific models to recent general foundation models that leverage powerful large language models. We discuss progress in representative tasks, such as change captioning, change question answering, and change grounding. Moreover, we systematically dissect the fundamental components and key technologies underlying these models, and review the datasets and evaluation metrics that have driven the field. By synthesizing task-level insights with a deep dive into shared architectural patterns, we aim to illuminate current achievements and chart promising directions for future research in spatio-temporal vision-language understanding for remote sensing. We will keep tracing related works at https://github.com/Chen-Yang-Liu/Awesome-RS-SpatioTemporal-VLMs

中文标题/摘要

标题：遥感时空视觉语言模型：全面综述

多时相遥感影像的解释对于监测地球动态过程至关重要，但以往的变化检测方法生成的二元或语义掩码无法提供易于理解的变化洞察。近期视觉语言模型（VLMs）的进步通过融合视觉和语言模态，开辟了新的前沿，使时空视觉语言理解成为可能：这些模型不仅捕捉空间和时间依赖性以识别变化，还能提供丰富的交互式语义分析（例如生成描述性标题和回答自然语言查询）。在本文综述中，我们首次全面回顾了RS-STVLMs。综述涵盖了从早期特定任务模型到最近利用强大大型语言模型的通用基础模型的发展。我们讨论了代表性任务的进步，如变化标题生成、变化问答和变化定位。此外，我们系统地剖析了这些模型的基本组件和关键技术，并回顾了推动该领域的数据集和评估指标。通过综合任务级见解并深入探讨共享的架构模式，我们旨在阐明当前的成就，并为遥感时空视觉语言理解的未来研究指明有希望的方向。我们将在https://github.com/Chen-Yang-Liu/Awesome-RS-SpatioTemporal-VLMs持续追踪相关工作

IS-Diff: Improving Diffusion-Based Inpainting with Better Initial Seed

Authors: Yongzhe Lyu, Yu Wu, Yutian Lin, Bo Du

First: 2025-09-15T07:16:03+00:00 · Latest: 2025-09-15T07:16:03+00:00

Abs · PDF · Code1 · Code2

Abstract

Diffusion models have shown promising results in free-form inpainting. Recent studies based on refined diffusion samplers or novel architectural designs led to realistic results and high data consistency. However, random initialization seed (noise) adopted in vanilla diffusion process may introduce mismatched semantic information in masked regions, leading to biased inpainting results, e.g., low consistency and low coherence with the other unmasked area. To address this issue, we propose the Initial Seed refined Diffusion Model (IS-Diff), a completely training-free approach incorporating distributional harmonious seeds to produce harmonious results. Specifically, IS-Diff employs initial seeds sampled from unmasked areas to imitate the masked data distribution, thereby setting a promising direction for the diffusion procedure. Moreover, a dynamic selective refinement mechanism is proposed to detect severe unharmonious inpaintings in intermediate latent and adjust the strength of our initialization prior dynamically. We validate our method on both standard and large-mask inpainting tasks using the CelebA-HQ, ImageNet, and Places2 datasets, demonstrating its effectiveness across all metrics compared to state-of-the-art inpainting methods.

中文标题/摘要

标题：IS-Diff：通过更好的初始种子提高基于扩散的修复效果

扩散模型在自由形式的修复任务中展现了令人鼓舞的结果。基于改进的扩散采样器或新颖的架构设计的研究取得了真实且数据一致性高的结果。然而，传统的扩散过程中采用的随机初始化种子（噪声）可能会在遮罩区域引入不匹配的语义信息，导致偏颇的修复结果，例如不一致性和与其他未遮罩区域缺乏连贯性。为了解决这一问题，我们提出了初始种子改进扩散模型（IS-Diff），这是一种完全无需训练的方法，结合了分布和谐的种子以产生和谐的结果。具体而言，IS-Diff 使用来自未遮罩区域的初始种子来模仿遮罩数据的分布，从而为扩散过程设定一个有希望的方向。此外，我们提出了一种动态选择性精炼机制，用于检测中间潜在空间中的严重不和谐修复，并动态调整我们的初始化先验的强度。我们在CelebA-HQ、ImageNet和Places2数据集上的标准修复和大遮罩修复任务中验证了该方法，证明了其在所有指标上都优于最先进的修复方法。

Summary / 总结

The paper addresses the issue of biased inpainting results due to random initialization seeds in diffusion models. It proposes IS-Diff, which uses initial seeds sampled from unmasked areas to better match the masked data distribution. The method also includes a dynamic selective refinement mechanism to adjust the initialization strength. Experiments on CelebA-HQ, ImageNet, and Places2 datasets show that IS-Diff outperforms state-of-the-art methods across various metrics for both standard and large-mask inpainting tasks.

论文针对随机初始化种子导致的扩散模型 inpainting 结果偏差问题，提出了 IS-Diff 方法，该方法使用来自未遮罩区域的初始种子来更好地匹配遮罩数据分布。此外，IS-Diff 还包含一个动态选择性精炼机制，以动态调整初始化先验的强度。实验结果表明，IS-Diff 在 CelebA-HQ、ImageNet 和 Places2 数据集上的各种指标中均优于现有最先进的 inpainting 方法，适用于标准和大遮罩 inpainting 任务。

LATTE: Learning to Think with Vision Specialists

Authors: Zixian Ma, Jianguo Zhang, Zhiwei Liu, Jieyu Zhang, Juntao Tan, Manli Shu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Caiming Xiong, Ranjay Krishna, Silvio Savarese

Venue: EMNLP 2025

First: 2024-12-07T00:42:04+00:00 · Latest: 2025-09-15T07:14:08+00:00

Abs · PDF · Code1 · Code2

Abstract

While open-source vision-language models perform well on simple question-answering, they still struggle with complex questions that require both perceptual and reasoning capabilities. We propose LATTE, a family of vision-language models that have LeArned to Think wiTh vision spEcialists. By offloading perception to state-of-the-art vision models, our approach enables vision-language models to focus solely on reasoning over high-quality perceptual information. To train LATTE, we synthesize and filter a large dataset of 293K multi-modal reasoning traces over perceptual outputs of vision specialists. LATTE trained on this data achieves significant 4-5% gains over baselines across 6 benchmarks covering both perception and reasoning abilities. Ablation studies reveal that the effectiveness of multi-modal reasoning traces depends on the data sources, formats, and quality of thoughts.

中文标题/摘要

标题：LATTE：学习与视觉专家一起思考

尽管开源的视觉-语言模型在简单的问答任务上表现良好，但在处理需要感知和推理能力的复杂问题时仍然存在困难。我们提出了LATTE，这是一种视觉-语言模型家族，它们已经学会了与视觉专家一起思考。通过将感知任务卸载到最先进的视觉模型上，我们的方法使视觉-语言模型能够专注于高质量感知信息上的推理。为了训练LATTE，我们合成并过滤了一个包含293K个多模态推理痕迹的数据集，这些痕迹覆盖了视觉专家的感知输出。使用此数据集训练的LATTE在6个涵盖感知和推理能力的基准测试中实现了显著的4-5%的性能提升。消融研究显示，多模态推理痕迹的有效性取决于数据来源、格式和思维的质量。

Summary / 总结

The research aims to enhance vision-language models' ability to handle complex questions by integrating perceptual and reasoning capabilities. LATTE, a family of vision-language models, offloads perception to advanced vision models, allowing the models to focus on reasoning. The model was trained using a large dataset of 293K multi-modal reasoning traces, which resulted in significant 4-5% improvements over baselines across six benchmarks. Ablation studies indicate that the effectiveness of these reasoning traces is influenced by the data sources, formats, and quality of thoughts.

研究旨在通过整合感知和推理能力来提升视觉-语言模型处理复杂问题的能力。LATTE 是一系列视觉-语言模型，将感知任务卸载给先进的视觉模型，使其专注于推理。该模型通过包含293K多模态推理痕迹的大规模数据集进行训练，结果在六个基准测试中实现了4-5%的显著改进。消融研究显示，这些推理痕迹的有效性受到数据来源、格式和思想质量的影响。

Scalp Diagnostic System With Label-Free Segmentation and Training-Free Image Translation

Authors: Youngmin Kim, Saejin Kim, Hoyeon Moon, Youngjae Yu, Junhyug Noh

Venue: MICCAI 2025

First: 2024-06-25T03:42:29+00:00 · Latest: 2025-09-15T07:11:27+00:00

Comments: Accepted to MICCAI 2025(https://papers.miccai.org/miccai-2025/0806-Paper5080.html), Project page: https://0110tpwls.github.io/scalpvision25/

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Scalp disorders are highly prevalent worldwide, yet remain underdiagnosed due to limited access to expert evaluation and the high cost of annotation. Although AI-based approaches hold great promise, their practical deployment is hindered by challenges such as severe data imbalance and the absence of pixel-level segmentation labels. To address these issues, we propose ScalpVision, an AI-driven system for the holistic diagnosis of scalp diseases. In ScalpVision, effective hair segmentation is achieved using pseudo image-label pairs and an innovative prompting method in the absence of traditional hair masking labels. Additionally, ScalpVision introduces DiffuseIT-M, a generative model adopted for dataset augmentation while maintaining hair information, facilitating improved predictions of scalp disease severity. Our experimental results affirm ScalpVision's efficiency in diagnosing a variety of scalp conditions, showcasing its potential as a valuable tool in dermatological care. Our code is available at https://github.com/winston1214/ScalpVision.

中文标题/摘要

标题：头皮诊断系统：无标签分割与无需训练的图像翻译

头皮疾病在全球范围内非常普遍，但由于专家评估的有限访问和注释的高成本，这些疾病常常被误诊。尽管基于AI的方法前景广阔，但其实际部署受到数据不平衡严重和缺乏像素级分割标签的挑战。为了解决这些问题，我们提出了ScalpVision，这是一种用于头皮疾病全面诊断的AI驱动系统。在ScalpVision中，通过伪图像标签对和一种创新的提示方法实现了有效的头发分割，而无需传统的头发遮罩标签。此外，ScalpVision引入了DiffuseIT-M，这是一种用于数据集增强的生成模型，同时保留头发信息，有助于提高头皮疾病严重程度的预测。我们的实验结果证实了ScalpVision在诊断各种头皮状况方面的效率，展示了其在皮肤科护理中的潜在价值。我们的代码可在https://github.com/winston1214/ScalpVision 获取。

Summary / 总结

ScalpVision is an AI-driven system designed to diagnose scalp diseases, addressing the challenges of limited expert evaluation and high annotation costs. It uses pseudo image-label pairs and an innovative prompting method for effective hair segmentation without traditional hair masking labels. Additionally, ScalpVision employs DiffuseIT-M, a generative model for dataset augmentation, which helps in predicting scalp disease severity more accurately. The system demonstrates high efficiency in diagnosing various scalp conditions, making it a promising tool in dermatological care.

ScalpVision 是一个基于 AI 的系统，旨在通过解决数据不平衡和缺乏像素级分割标签的问题来诊断头皮疾病。它使用伪图像标签对和一种创新的提示方法进行头发分割，并引入了 DiffuseIT-M 用于数据集增强，同时保留头发信息。实验结果表明，它在诊断各种头皮状况方面非常有效，是一个有价值的皮肤科护理工具。

Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation

Authors: Zhenglin Hua, Jinghan He, Zijun Yao, Tianxu Han, Haiyun Guo, Yuheng Jia, Junfeng Fang

Venue: EMNLP 2025

First: 2025-05-22T02:45:45+00:00 · Latest: 2025-09-15T07:02:17+00:00

Comments: Accepted to Findings of EMNLP 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large vision-language models (LVLMs) have achieved remarkable performance on multimodal tasks. However, they still suffer from hallucinations, generating text inconsistent with visual input, posing significant risks in real-world applications. Existing approaches to address this issue focus on incorporating external knowledge bases, alignment training, or decoding strategies, all of which require substantial computational cost and time. Recent works try to explore more efficient alternatives by adjusting LVLMs' internal representations. Although promising, these methods may cause hallucinations to be insufficiently suppressed or lead to excessive interventions that negatively affect normal semantics. In this work, we leverage sparse autoencoders (SAEs) to identify semantic directions closely associated with faithfulness or hallucination, extracting more precise and disentangled hallucination-related representations. Our analysis demonstrates that interventions along the identified faithful direction can mitigate hallucinations, while those along the hallucinatory direction can exacerbate them. Building on these insights, we propose Steering LVLMs via SAE Latent Directions (SSL), a plug-and-play method based on SAE-derived latent directions to mitigate hallucinations in LVLMs. Extensive experiments demonstrate that SSL significantly outperforms existing decoding approaches in mitigating hallucinations, while maintaining transferability across different model architectures with negligible additional time overhead. The code is available at https://github.com/huazhenglin2003/SSL.

中文标题/摘要

标题：通过稀疏自编码器引导LVLMs以减轻幻觉

大型视觉-语言模型（LVLMs）在多模态任务上取得了显著的性能。然而，它们仍然遭受幻觉问题，生成与视觉输入不一致的文本，这在实际应用中带来了重大风险。现有方法通过引入外部知识库、对齐训练或解码策略来解决这一问题，但这些方法都需要大量的计算成本和时间。最近的研究尝试通过调整LVLMs的内部表示来探索更有效的替代方案。尽管这些方法很有前景，但它们可能会导致幻觉无法充分抑制，或者导致过度干预，从而负面影响正常的语义。在本工作中，我们利用稀疏自编码器（SAEs）来识别与忠实性或幻觉紧密相关的语义方向，提取更精确和分离的幻觉相关表示。我们的分析表明，沿着识别出的忠实方向进行干预可以减轻幻觉，而沿着幻觉方向进行干预则会加剧幻觉。基于这些见解，我们提出了基于SAE潜在方向的LVLMs引导方法（SSL），这是一种插件式方法，用于减轻LVLMs中的幻觉，同时在不同模型架构之间保持可移植性，且几乎不增加额外的时间开销。代码可在https://github.com/huazhenglin2003/SSL获取。

Summary / 总结

This work addresses the issue of hallucinations in large vision-language models (LVLMs) by leveraging sparse autoencoders (SAEs) to identify and mitigate hallucination-related representations. The method, named Steering LVLMs via SAE Latent Directions (SSL), involves interventions along faithful and hallucinatory directions to suppress hallucinations. Experiments show that SSL effectively mitigates hallucinations with minimal computational overhead and maintains transferability across different model architectures.

本文通过使用稀疏自编码器（SAE）来识别与忠实性或幻觉相关的语义方向，提出了一种方法以减轻大型视觉-语言模型（LVLM）中的幻觉问题。实验表明，提出的Steering LVLMs via SAE Latent Directions（SSL）方法能够有效减少幻觉，同时不会显著增加计算成本或负面影响正常语义。该方法适用于不同模型架构，并且可以作为插件解决方案实现。

Multi-View Slot Attention Using Paraphrased Texts for Face Anti-Spoofing

Authors: Jeongmin Yu, Susang Kim, Kisu Lee, Taekyoung Kwon, Won-Yong Shin, Ha Young Kim

Venue: ICCV 2025

First: 2025-09-08T04:53:46+00:00 · Latest: 2025-09-15T05:55:24+00:00

Comments: Accepted to ICCV 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent face anti-spoofing (FAS) methods have shown remarkable cross-domain performance by employing vision-language models like CLIP. However, existing CLIP-based FAS models do not fully exploit CLIP's patch embedding tokens, failing to detect critical spoofing clues. Moreover, these models rely on a single text prompt per class (e.g., 'live' or 'fake'), which limits generalization. To address these issues, we propose MVP-FAS, a novel framework incorporating two key modules: Multi-View Slot attention (MVS) and Multi-Text Patch Alignment (MTPA). Both modules utilize multiple paraphrased texts to generate generalized features and reduce dependence on domain-specific text. MVS extracts local detailed spatial features and global context from patch embeddings by leveraging diverse texts with multiple perspectives. MTPA aligns patches with multiple text representations to improve semantic robustness. Extensive experiments demonstrate that MVP-FAS achieves superior generalization performance, outperforming previous state-of-the-art methods on cross-domain datasets. Code: https://github.com/Elune001/MVP-FAS.

中文标题/摘要

标题：使用 paraphrased 文本的多视图槽注意机制用于面部防伪

近期的面部防伪（FAS）方法通过使用像 CLIP 这样的视觉-语言模型展示了跨域的出色性能。然而，现有的基于 CLIP 的 FAS 模型未能充分利用 CLIP 的补丁嵌入标记，未能检测到关键的防伪线索。此外，这些模型依赖于每个类别单一的文本提示（例如 'live' 或 'fake'），这限制了泛化能力。为了解决这些问题，我们提出了一种名为 MVP-FAS 的新型框架，该框架结合了两个关键模块：多视图槽注意（MVS）和多文本补丁对齐（MTPA）。这两个模块利用多种 paraphrased 文本生成通用特征，减少对特定领域文本的依赖。MVS 通过利用多种视角的多样化文本从补丁嵌入中提取局部详细的空间特征和全局上下文。MTPA 对齐多个文本表示以提高语义鲁棒性。广泛的实验表明，MVP-FAS 达到了优越的泛化性能，在跨域数据集上超越了先前的最先进方法。代码：https://github.com/Elune001/MVP-FAS。

Summary / 总结

The research aims to improve the cross-domain performance of face anti-spoofing (FAS) by utilizing vision-language models like CLIP. The proposed MVP-FAS framework introduces Multi-View Slot attention (MVS) and Multi-Text Patch Alignment (MTPA) modules, which use multiple paraphrased texts to generate generalized features and reduce domain-specific text reliance. Experimental results show that MVP-FAS outperforms previous state-of-the-art methods in cross-domain datasets, demonstrating superior generalization performance.

研究旨在通过更有效地利用CLIP等视觉语言模型来提高跨域人脸识别防伪（FAS）的性能。提出的MVP-FAS框架引入了多视角槽注意力（MVS）和多文本块对齐（MTPA），以增强特征提取和语义鲁棒性。实验表明，MVP-FAS在不同域的数据集上表现出优于现有最佳方法的泛化性能。

First RAG, Second SEG: A Training-Free Paradigm for Camouflaged Object Detection

Authors: Wutao Liu, YiDan Wang, Pan Gao

First: 2025-08-21T07:14:18+00:00 · Latest: 2025-09-15T05:21:07+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Camouflaged object detection (COD) poses a significant challenge in computer vision due to the high similarity between objects and their backgrounds. Existing approaches often rely on heavy training and large computational resources. While foundation models such as the Segment Anything Model (SAM) offer strong generalization, they still struggle to handle COD tasks without fine-tuning and require high-quality prompts to yield good performance. However, generating such prompts manually is costly and inefficient. To address these challenges, we propose \textbf{First RAG, Second SEG (RAG-SEG)}, a training-free paradigm that decouples COD into two stages: Retrieval-Augmented Generation (RAG) for generating coarse masks as prompts, followed by SAM-based segmentation (SEG) for refinement. RAG-SEG constructs a compact retrieval database via unsupervised clustering, enabling fast and effective feature retrieval. During inference, the retrieved features produce pseudo-labels that guide precise mask generation using SAM2. Our method eliminates the need for conventional training while maintaining competitive performance. Extensive experiments on benchmark COD datasets demonstrate that RAG-SEG performs on par with or surpasses state-of-the-art methods. Notably, all experiments are conducted on a \textbf{personal laptop}, highlighting the computational efficiency and practicality of our approach. We present further analysis in the Appendix, covering limitations, salient object detection extension, and possible improvements. \textcolor{blue} {Code: https://github.com/Lwt-diamond/RAG-SEG.}

中文标题/摘要

标题：First RAG, Second SEG：一种无需训练的伪装目标检测范式

伪装目标检测（COD）在计算机视觉中面临重大挑战，因为目标与背景的高度相似性。现有方法通常依赖于大量的训练和计算资源。虽然基础模型如Segment Anything Model (SAM) 提供了强大的泛化能力，但在处理COD任务时仍需微调，并且需要高质量的提示以获得良好的性能。然而，手动生成这些提示既耗时又低效。为了解决这些问题，我们提出了一种无需训练的范式——First RAG, Second SEG (RAG-SEG)，将COD分为两个阶段：检索增强生成（RAG）用于生成粗略的掩码作为提示，随后使用SAM进行细化。RAG-SEG通过无监督聚类构建紧凑的检索数据库，实现快速有效的特征检索。在推理过程中，检索到的特征生成伪标签，指导使用SAM2生成精确的掩码。我们的方法消除了传统训练的需要，同时保持了竞争力。基准COD数据集上的大量实验表明，RAG-SEG与或超越了最先进的方法。值得注意的是，所有实验均在一台个人笔记本电脑上进行，突显了我们方法的计算效率和实用性。我们在附录中进一步分析了限制、显著目标检测扩展和可能的改进。

Summary / 总结

The paper addresses the challenge of camouflaged object detection (COD) by proposing a training-free paradigm called First RAG, Second SEG (RAG-SEG). This method decouples COD into two stages: Retrieval-Augmented Generation (RAG) for generating coarse masks, and SAM-based Segmentation (SEG) for refinement. RAG-SEG constructs a compact retrieval database through unsupervised clustering, enabling efficient feature retrieval. Experiments show that RAG-SEG performs competitively with state-of-the-art methods and can be run on a personal laptop, highlighting its computational efficiency and practicality.

论文提出了一种无训练框架First RAG, Second SEG (RAG-SEG)，通过将目标检测任务分为两个阶段：Retrieval-Augmented Generation (RAG) 生成粗略掩码，以及SAM为基础的分割(SEG)进行细化。RAG-SEG通过无监督聚类构建紧凑的检索数据库，实现快速有效的特征检索。实验表明，RAG-SEG在基准COD数据集上的性能与最先进的方法相当，并且可以在个人笔记本电脑上运行，突显了其计算效率和实用性。

How Auxiliary Reasoning Unleashes GUI Grounding in VLMs

Authors: Weiming Li, Yan Shao, Jing Yang, Yujing Lu, Ling Zhong, Yuhan Wang, Manni Duan

First: 2025-09-15T03:28:29+00:00 · Latest: 2025-09-15T03:28:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Graphical user interface (GUI) grounding is a fundamental task for building GUI agents. However, general vision-language models (VLMs) struggle with this task due to a lack of specific optimization. We identify a key gap in this paper: while VLMs exhibit significant latent grounding potential, as demonstrated by their performance measured by Pointing Game, they underperform when tasked with outputting explicit coordinates. To address this discrepancy, and bypass the high data and annotation costs of current fine-tuning approaches, we propose three zero-shot auxiliary reasoning methods. By providing explicit spatial cues such as axes, grids and labeled intersections as part of the input image, these methods enable VLMs to articulate their implicit spatial understanding capabilities. We evaluate these methods on four GUI grounding benchmarks across seven open-source and proprietary VLMs. The evaluation results demonstrate that the proposed methods substantially improve the performance of GUI grounding.

中文标题/摘要

标题：辅助推理如何释放VLMs中的GUI接地

图形用户界面（GUI）接地是构建GUI代理的基本任务。然而，由于缺乏特定优化，通用的视觉-语言模型（VLMs）在这一任务上表现不佳。本文指出了一个关键缺口：尽管VLMs在点游戏等指标上展示了显著的潜在接地能力，但在输出明确坐标时表现不佳。为解决这一差异，并绕过当前微调方法的高数据和注释成本，我们提出了三种零样本辅助推理方法。通过在输入图像中提供明确的空间线索，如轴、网格和标记的交点，这些方法使VLMs能够表达其隐含的空间理解能力。我们在四个GUI接地基准上对这三种方法进行了评估，涉及七个开源和专有的VLMs。评估结果表明，所提出的方法显著提高了GUI接地的性能。

Summary / 总结

This paper addresses the challenge of graphical user interface (GUI) grounding in vision-language models (VLMs) by identifying a gap in their ability to output explicit coordinates despite strong latent grounding potential. To overcome this, the authors propose three zero-shot auxiliary reasoning methods that provide spatial cues, enabling VLMs to better articulate their spatial understanding. Evaluations on four GUI grounding benchmarks across seven VLMs show significant improvements in performance.

本文通过识别视觉语言模型（VLMs）在图形用户界面（GUI）接地方面的能力缺口，即尽管在隐式接地任务中表现良好，但在输出明确坐标方面表现不佳。为此，作者提出了三种零样本辅助推理方法，这些方法为VLMs提供明确的空间线索。在四个GUI接地基准上的评估显示，这些方法在七个VLMs上显著提高了性能。

CEMTM: Contextual Embedding-based Multimodal Topic Modeling

Authors: Amirhossein Abaskohi, Raymond Li, Chuyuan Li, Shafiq Joty, Giuseppe Carenini

Venue: EMNLP 2025

First: 2025-09-14T23:07:46+00:00 · Latest: 2025-09-14T23:07:46+00:00

Comments: EMNLP 2025

Abs · PDF · Code1 · Code2

Abstract

We introduce CEMTM, a context-enhanced multimodal topic model designed to infer coherent and interpretable topic structures from both short and long documents containing text and images. CEMTM builds on fine-tuned large vision language models (LVLMs) to obtain contextualized embeddings, and employs a distributional attention mechanism to weight token-level contributions to topic inference. A reconstruction objective aligns topic-based representations with the document embedding, encouraging semantic consistency across modalities. Unlike existing approaches, CEMTM can process multiple images per document without repeated encoding and maintains interpretability through explicit word-topic and document-topic distributions. Extensive experiments on six multimodal benchmarks show that CEMTM consistently outperforms unimodal and multimodal baselines, achieving a remarkable average LLM score of 2.61. Further analysis shows its effectiveness in downstream few-shot retrieval and its ability to capture visually grounded semantics in complex domains such as scientific articles.

中文标题/摘要

标题：CEMTM：基于上下文的多模态主题建模

我们介绍了CEMTM，一种增强上下文的多模态主题模型，旨在从包含文本和图像的短文档和长文档中推断出连贯且可解释的主题结构。CEMTM 基于微调的大规模视觉语言模型（LVLMs）获得上下文化嵌入，并采用分布注意力机制来加权词级对主题推断的贡献。重建目标使基于主题的表示与文档嵌入对齐，鼓励跨模态的一致性。与现有方法不同，CEMTM 可以处理每个文档中的多个图像而无需重复编码，并通过显式的词-主题和文档-主题分布保持可解释性。在六个多模态基准上的广泛实验表明，CEMTM 一致地优于单模态和多模态基线，平均LLM得分为2.61。进一步的分析显示了其在下游少样本检索中的有效性，并且能够捕捉复杂领域（如科学文章）中的视觉基础语义。

Summary / 总结

CEMTM is a context-enhanced multimodal topic model that uses fine-tuned large vision-language models to generate contextualized embeddings and employs a distributional attention mechanism to infer topics from both text and images. It outperforms existing unimodal and multimodal baselines on six benchmarks, achieving an average LLM score of 2.61 and demonstrating effectiveness in few-shot retrieval and capturing visually grounded semantics in complex domains like scientific articles.

CEMTM 是一种增强上下文的多模态主题模型，利用微调的大规模视觉-语言模型生成上下文嵌入，并采用分布注意力机制进行主题推断。它能够高效处理每份文档中的多张图片，并通过显式的词-主题和文档-主题分布保持可解释性。在六个多模态基准上的实验表明，CEMTM 在性能上优于单模态和多模态基线，平均 LLM 得分为 2.61，并且在下游检索和复杂领域（如科学文章）中捕捉到视觉基础的语义方面表现出有效性。

Enhancing Generalization in Vision-Language-Action Models by Preserving Pretrained Representations

Authors: Shresth Grover, Akshay Gopalkrishnan, Bo Ai, Henrik I. Christensen, Hao Su, Xuanlin Li

First: 2025-09-14T20:08:56+00:00 · Latest: 2025-09-14T20:08:56+00:00

Comments: Project Page: https://gen-vla.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-language-action (VLA) models finetuned from vision-language models (VLMs) hold the promise of leveraging rich pretrained representations to build generalist robots across diverse tasks and environments. However, direct fine-tuning on robot data often disrupts these representations and limits generalization. We present a framework that better preserves pretrained features while adapting them for robot manipulation. Our approach introduces three components: (i) a dual-encoder design with one frozen vision encoder to retain pretrained features and another trainable for task adaptation, (ii) a string-based action tokenizer that casts continuous actions into character sequences aligned with the model's pretraining domain, and (iii) a co-training strategy that combines robot demonstrations with vision-language datasets emphasizing spatial reasoning and affordances. Evaluations in simulation and on real robots show that our method improves robustness to visual perturbations, generalization to novel instructions and environments, and overall task success compared to baselines.

中文标题/摘要

标题：通过保留预训练表示来增强视觉-语言-动作模型的一般化能力

从视觉-语言模型（VLMs）微调而来的视觉-语言-动作（VLA）模型有望利用丰富的预训练表示来构建跨多种任务和环境的通用机器人。然而，直接在机器人数据上进行微调往往会破坏这些表示并限制一般化。我们提出了一种框架，以更好地保留预训练特征并适应机器人操作。我们的方法引入了三个组件：（i）一种双编码器设计，其中一个冻结的视觉编码器保留预训练特征，另一个可训练的编码器用于任务适应；（ii）一种基于字符串的动作分词器，将连续动作转换为与模型预训练领域对齐的字符序列；（iii）一种联合训练策略，结合了机器人演示和强调空间推理和功能的视觉-语言数据集。在模拟和真实机器人上的评估表明，与基线方法相比，我们的方法提高了对视觉扰动的鲁棒性、对新指令和环境的一般化以及整体任务成功率。

Summary / 总结

The research aims to enhance the generalization capabilities of vision-language-action (VLA) models by preserving their pretrained representations while adapting to robotic tasks. The method includes a dual-encoder design with a frozen vision encoder to retain pretrained features and a trainable encoder for task adaptation, a string-based action tokenizer to align continuous actions with the model's pretraining domain, and a co-training strategy combining robot demonstrations with vision-language datasets. Experimental results demonstrate improved robustness to visual perturbations, better generalization to novel instructions and environments, and higher task success compared to baseline methods.

研究旨在通过保留预训练表示来增强视觉-语言-动作（VLA）模型的泛化能力，同时适应机器人任务。方法包括一个双编码器设计，其中冻结的视觉编码器保留预训练特征，可训练的编码器用于任务适应；一个基于字符串的动作分词器，将连续动作与模型的预训练领域对齐；以及一种结合机器人演示和视觉-语言数据集的协同训练策略，强调空间推理和可用性。实验结果表明，该方法在视觉扰动下的鲁棒性更强，对新指令和环境的泛化能力更好，任务成功率也更高，优于基线方法。

Offline RLAIF: Piloting VLM Feedback for RL via SFO

Authors: Jacob Beck

Venue: Published at The RLC 2025 Workshop on Reinforcement Learning Beyond Rewards: Ingredients for Developing Generalist Agents

First: 2025-03-02T23:52:46+00:00 · Latest: 2025-09-14T19:13:37+00:00

Comments: Code is provided at https://github.com/jacooba/OfflineRLAIF

Abs · PDF · Code1 · Code2 · Code3

Abstract

While internet-scale image and textual data have enabled strong generalization in Vision-Language Models (VLMs), the absence of internet-scale control data has impeded the development of similar generalization in standard reinforcement learning (RL) agents. Although VLMs are fundamentally limited in their ability to solve control tasks due to their lack of action-conditioned training data, their capacity for image understanding allows them to provide valuable feedback in RL tasks by recognizing successful outcomes. A key challenge in Reinforcement Learning from AI Feedback (RLAIF) is determining how best to integrate VLM-derived signals into the learning process. We explore this question in the context of offline RL and introduce a class of methods called Sub-Trajectory Filtered Optimization (SFO). We identify three key insights. First, trajectory length plays a crucial role in offline RL, as full-trajectory preference learning exacerbates the stitching problem, necessitating the use of sub-trajectories. Second, even in Markovian environments, a non-Markovian reward signal from a sequence of images is required to assess trajectory improvement, as VLMs do not interpret control actions and must rely on visual cues over time. Third, a simple yet effective approach--filtered and weighted behavior cloning--consistently outperforms more complex RLHF-based methods. We propose Sub-Trajectory Filtered Behavior Cloning (SFBC), a method that leverages VLM feedback on sub-trajectories while incorporating a retrospective filtering mechanism that removes sub-trajectories preceding failures to improve robustness and prevent turbulence. Please enjoy our airport puns.

中文标题/摘要

标题：离线RLAIF：通过SFO引导VLM反馈

尽管互联网规模的图像和文本数据使视觉语言模型（VLMs）在泛化方面表现出色，但由于缺乏互联网规模的控制数据，标准强化学习（RL）代理的泛化发展受到了阻碍。尽管VLMs由于缺乏基于动作的训练数据而在解决控制任务方面受到根本限制，但它们对图像的理解能力使它们能够在RL任务中通过识别成功结果提供有价值的反馈。在从AI反馈进行强化学习（RLAIF）中，一个关键挑战是如何最好地将VLM衍生的信号整合到学习过程中。我们在此背景下探讨了这一问题，并引入了一类称为子轨迹过滤优化（SFO）的方法。我们发现了三个关键见解。首先，轨迹长度在离线RL中起着关键作用，因为完整的轨迹偏好学习加剧了拼接问题，需要使用子轨迹。其次，即使在马尔可夫环境中，从一系列图像中获得的非马尔可夫奖励信号也是评估轨迹改进所必需的，因为VLMs无法解释控制动作，必须依赖时间上的视觉线索。第三，一种简单而有效的方法——过滤和加权行为克隆——在性能上始终优于更复杂的RLHF方法。我们提出了子轨迹过滤行为克隆（SFBC）方法，该方法利用VLM在子轨迹上的反馈，并结合了一种回顾性过滤机制，以去除先前失败的子轨迹，提高鲁棒性并防止湍流。

Summary / 总结

The research aims to enhance reinforcement learning (RL) agents by integrating Vision-Language Models (VLMs) for feedback, addressing the lack of control data. The study introduces Sub-Trajectory Filtered Optimization (SFO) methods, focusing on the importance of sub-trajectories in offline RL. Key findings include the necessity of using sub-trajectories to mitigate the stitching problem, the requirement of a non-Markovian reward signal from sequential images, and the effectiveness of a simple filtered and weighted behavior cloning approach over more complex RLHF methods.

研究旨在通过整合视觉语言模型（VLMs）提供反馈来提升强化学习（RL）代理，解决控制数据不足的问题。研究引入了子轨迹优化（SFO）方法，强调在离线RL中使用子轨迹的重要性。关键发现包括使用子轨迹以缓解拼接问题的必要性，需要从连续图像序列中获得非马尔可夫奖励信号，以及简单过滤和加权行为克隆方法的有效性，优于更复杂的RLHF方法。

Leveraging Geometric Priors for Unaligned Scene Change Detection

Authors: Ziling Liu, Ziwei Chen, Mingqi Gao, Jinyu Yang, Feng Zheng

First: 2025-09-14T14:31:08+00:00 · Latest: 2025-09-14T14:31:08+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Unaligned Scene Change Detection aims to detect scene changes between image pairs captured at different times without assuming viewpoint alignment. To handle viewpoint variations, current methods rely solely on 2D visual cues to establish cross-image correspondence to assist change detection. However, large viewpoint changes can alter visual observations, causing appearance-based matching to drift or fail. Additionally, supervision limited to 2D change masks from small-scale SCD datasets restricts the learning of generalizable multi-view knowledge, making it difficult to reliably identify visual overlaps and handle occlusions. This lack of explicit geometric reasoning represents a critical yet overlooked limitation. In this work, we are the first to leverage geometric priors from a Geometric Foundation Model to address the core challenges of unaligned SCD, including reliable identification of visual overlaps, robust correspondence establishment, and explicit occlusion detection. Building on these priors, we propose a training-free framework that integrates them with the powerful representations of a visual foundation model to enable reliable change detection under viewpoint misalignment. Through extensive evaluation on the PSCD, ChangeSim, and PASLCD datasets, we demonstrate that our approach achieves superior and robust performance. Our code will be released at https://github.com/ZilingLiu/GeoSCD.

Summary / 总结

The research aims to detect scene changes between images captured at different times without assuming viewpoint alignment, addressing the limitations of current methods that rely solely on 2D visual cues. The proposed method leverages geometric priors from a Geometric Foundation Model to establish robust correspondences and handle occlusions, leading to superior and robust performance on the PSCD, ChangeSim, and PASLCD datasets. The framework integrates these geometric priors with a visual foundation model for reliable change detection under viewpoint misalignment.

研究旨在解决在不同时间拍摄的图像之间检测场景变化的问题，而不假设视点对齐。为克服仅依赖2D视觉线索的局限性，作者提出利用几何基础模型中的几何先验。这种方法能够可靠地识别视觉重叠、建立稳健的对应关系，并明确检测遮挡。实验结果表明，该方法在PSCD、ChangeSim和PASLCD数据集上的表现优于现有技术，能够稳健地处理视点偏差并进行变化检测。

Mitigating Hallucinations in Large Vision-Language Models by Self-Injecting Hallucinations

Authors: Yifan Lu, Ziqi Zhang, Chunfeng Yuan, Jun Gao, Congxuan Zhang, Xiaojuan Qi, Bing Li, Weiming Hu

Venue: emnlp 2025

First: 2025-09-14T14:26:53+00:00 · Latest: 2025-09-14T14:26:53+00:00

Comments: emnlp 2025 accepted

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large Vision-Language Models (LVLMs) suffer from serious hallucination problems, where the model-generated responses are inconsistent with the visual inputs. Existing hallucination mitigation methods are mainly based on preference alignment and require external human annotations or auxiliary models for preference data collection, which increase costs and limit sustainable improvement. To tackle these challenges, we propose Autonomous Preference Alignment via Self-Injection (APASI), a novel and generalizable method that mitigates hallucinations without external dependencies. APASI leverages the target LVLM to self-inject hallucinations into a generated response, creating a pair of responses with varying preference levels. During the self-injection process, the dis-preferred response is generated based on three key observations of hallucinations, ensuring it simulates real hallucination patterns. This fidelity offers an accurate learning signal for hallucination mitigation. Moreover, APASI incorporates an iterative alignment training strategy combined with curriculum learning to periodically update the preference data with increasing challenge, enabling stable and continuous enhancement of the LVLM. Extensive experiments across six benchmarks show that APASI not only effectively mitigates hallucinations for three baseline models but also achieves comparable or even superior performance to alignment-based methods with external dependency, thereby demonstrating its effectiveness and generalization capability. The code is available at https://github.com/davidluciolu/APASI.

中文标题/摘要

标题：通过自我注入幻觉缓解大型视觉-语言模型中的幻觉问题

大型视觉-语言模型（LVLMs）遭受严重的幻觉问题，模型生成的响应与视觉输入不一致。现有的幻觉缓解方法主要基于偏好对齐，需要外部人工注释或辅助模型来收集偏好数据，这增加了成本并限制了可持续改进。为了解决这些挑战，我们提出了自主偏好对齐通过自我注入（APASI）方法，这是一种新颖且可泛化的技术，无需外部依赖即可缓解幻觉。APASI 利用目标 LVLM 自我注入幻觉到生成的响应中，创建具有不同偏好水平的响应对。在自我注入过程中，根据幻觉的三个关键观察生成不偏好响应，确保其模拟真实的幻觉模式。这种保真度为幻觉缓解提供了准确的学习信号。此外，APASI 结合迭代对齐训练策略和课程学习，定期更新具有增加挑战的偏好数据，使 LVLM 能够稳定且持续地增强。在六个基准测试中的广泛实验表明，APASI 不仅有效地缓解了三种基线模型的幻觉，而且在具有外部依赖的对齐方法中实现了可比或更优的性能，从而证明了其有效性和泛化能力。代码可在 https://github.com/davidluciolu/APASI 获取。

Summary / 总结

The paper addresses the issue of hallucinations in Large Vision-Language Models (LVLMs) by proposing APASI, a method that uses the LVLM itself to inject hallucinations into generated responses. This approach avoids the need for external human annotations or auxiliary models, making it more cost-effective and sustainable. APASI incorporates an iterative alignment training strategy with curriculum learning to continuously improve the model's performance. Experiments across six benchmarks demonstrate that APASI effectively mitigates hallucinations and outperforms alignment-based methods with external dependencies, showcasing its effectiveness and generalization capability.

论文针对大型视觉-语言模型中存在的幻觉问题，即生成的响应与视觉输入不一致。提出了一种名为APASI的方法，利用模型自身向生成的响应中注入幻觉，创建具有不同偏好水平的响应对。该方法结合迭代对齐训练和课程学习，有效地解决了幻觉问题，无需依赖外部的人工标注或辅助模型。六项基准测试结果表明，APASI不仅有效缓解了幻觉问题，而且在某些情况下甚至超过了依赖外部数据的方法，展示了其有效性和泛化能力。

Fighting Fire with Fire (F3): A Training-free and Efficient Visual Adversarial Example Purification Method in LVLMs

Authors: Yudong Zhang, Ruobing Xie, Yiqing Huang, Jiansheng Chen, Xingwu Sun, Zhanhui Kang, Di Wang, Yu Wang

First: 2025-06-01T16:07:30+00:00 · Latest: 2025-09-14T09:51:48+00:00

Comments: Accepted by ACM Multimedia 2025 BNI track (Oral)

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances in large vision-language models (LVLMs) have showcased their remarkable capabilities across a wide range of multimodal vision-language tasks. However, these models remain vulnerable to visual adversarial attacks, which can substantially compromise their performance. In this paper, we introduce F3, a novel adversarial purification framework that employs a counterintuitive ``fighting fire with fire'' strategy: intentionally introducing simple perturbations to adversarial examples to mitigate their harmful effects. Specifically, F3 leverages cross-modal attentions derived from randomly perturbed adversary examples as reference targets. By injecting noise into these adversarial examples, F3 effectively refines their attention, resulting in cleaner and more reliable model outputs. Remarkably, this seemingly paradoxical approach of employing noise to counteract adversarial attacks yields impressive purification results. Furthermore, F3 offers several distinct advantages: it is training-free and straightforward to implement, and exhibits significant computational efficiency improvements compared to existing purification methods. These attributes render F3 particularly suitable for large-scale industrial applications where both robust performance and operational efficiency are critical priorities. The code is available at https://github.com/btzyd/F3.

中文标题/摘要

标题：以火灭火（F3）：一种在大视觉语言模型中的无训练高效视觉对抗样本净化方法

近年来，大视觉语言模型（LVLMs）在多种跨模态视觉语言任务中展现了其卓越的能力。然而，这些模型仍然容易受到视觉对抗攻击的影响，这会显著削弱其性能。本文介绍了一种名为F3的新型对抗净化框架，该框架采用了一种反直觉的“以火灭火”策略：故意向对抗样本中引入简单的扰动以减轻其有害影响。具体而言，F3利用从随机扰动的对手样本中提取的跨模态注意力作为参考目标。通过向这些对抗样本中注入噪声，F3有效地改进了它们的注意力，从而产生更清洁、更可靠的模型输出。令人惊讶的是，这种看似矛盾的方法——利用噪声来对抗对抗攻击——取得了令人印象深刻的净化效果。此外，F3还具有几个显著的优势：它是无训练的，易于实现，并且在计算效率方面比现有净化方法有显著改进。这些特性使F3特别适合大规模工业应用，其中稳健的性能和操作效率是关键优先事项。代码可在https://github.com/btzyd/F3获取。

Seeing the Undefined: Chain-of-Action for Generative Semantic Labels

Authors: Meng Wei, Zhongnian Li, Peng Ying, Xinzheng Xu

First: 2024-11-26T13:09:14+00:00 · Latest: 2025-09-14T08:49:44+00:00

Comments: 15 pages, 8 figures

Abs · PDF · Code1 · Code2

Abstract

Recent advances in vision-language models (VLMs) have demonstrated remarkable capabilities in image classification by leveraging predefined sets of labels to construct text prompts for zero-shot reasoning. However, these approaches face significant limitations in undefined domains, where the label space is vocabulary-unknown and composite. We thus introduce Generative Semantic Labels (GSLs), a novel task that aims to predict a comprehensive set of semantic labels for an image without being constrained by a predefined labels set. Unlike traditional zero-shot classification, GSLs generates multiple semantic-level labels, encompassing objects, scenes, attributes, and relationships, thereby providing a richer and more accurate representation of image content. In this paper, we propose Chain-of-Action (CoA), an innovative method designed to tackle the GSLs task. CoA is motivated by the observation that enriched contextual information significantly improves generative performance during inference. Specifically, CoA decomposes the GSLs task into a sequence of detailed actions. Each action extracts and merges key information from the previous step, passing enriched context to the next, ultimately guiding the VLM to generate comprehensive and accurate semantic labels. We evaluate the effectiveness of CoA through extensive experiments on widely-used benchmark datasets. The results demonstrate significant improvements across key performance metrics, validating the capability of CoA to generate accurate and contextually rich semantic labels. Our work not only advances the state-of-the-art in generative semantic labels but also opens new avenues for applying VLMs in open-ended and dynamic real-world scenarios.

中文标题/摘要

标题：洞察未定义：生成语义标签的链式操作

近期视觉-语言模型（VLMs）在利用预定义标签集构建文本提示进行零样本推理方面展示了令人瞩目的图像分类能力。然而，这些方法在未定义领域面临重大限制，这些领域的标签空间是词汇未知且复合的。因此，我们提出了生成语义标签（GSLs），这是一种旨在预测图像全面语义标签的新任务，而不受预定义标签集的约束。与传统的零样本分类不同，GSLs 生成多个语义层次的标签，涵盖对象、场景、属性和关系，从而提供更丰富和准确的图像内容表示。在本文中，我们提出了链式操作（CoA），这是一种设计用于解决GSLs任务的创新方法。CoA 的动机是观察到丰富的上下文信息在推理过程中显著提高了生成性能。具体来说，CoA 将GSLs任务分解为一系列详细的步骤。每个步骤提取并合并前一步的关键信息，将丰富的上下文传递给下一步，最终引导VLM生成全面和准确的语义标签。我们通过在广泛使用的基准数据集上进行大量实验评估了CoA的有效性。结果表明，在关键性能指标上取得了显著改进，验证了CoA生成准确且上下文丰富的语义标签的能力。我们的工作不仅推进了生成语义标签的前沿技术，还为在开放性和动态现实场景中应用VLMs开辟了新的途径。

Summary / 总结

This paper addresses the limitations of vision-language models (VLMs) in undefined domains by introducing Generative Semantic Labels (GSLs), which predict a comprehensive set of semantic labels for images without predefined labels. To tackle this task, the authors propose Chain-of-Action (CoA), a method that decomposes the task into a sequence of actions, each enriching the context passed to the next step. Experiments on benchmark datasets show that CoA significantly improves the accuracy and context richness of generated semantic labels.

本文通过引入生成语义标签（GSLs），解决视觉语言模型（VLMs）在未定义领域中的局限性，GSLs任务旨在在没有预定义标签的情况下为图像预测全面的语义标签。为应对GSLs任务，作者提出了链式操作（CoA），将任务分解为一系列操作，每个操作都会在上一步的基础上丰富上下文信息，引导VLM生成准确且丰富的语义标签。在基准数据集上的实验显示，在关键性能指标上取得了显著改进，验证了CoA的有效性。

The System Description of CPS Team for Track on Driving with Language of CVPR 2024 Autonomous Grand Challenge

Authors: Jinghan Peng, Jingwen Wang, Xing Yu, Dehui Du

First: 2025-09-14T03:37:17+00:00 · Latest: 2025-09-14T03:37:17+00:00

Abs · PDF · Code1 · Code2

Abstract

This report outlines our approach using vision language model systems for the Driving with Language track of the CVPR 2024 Autonomous Grand Challenge. We have exclusively utilized the DriveLM-nuScenes dataset for training our models. Our systems are built on the LLaVA models, which we enhanced through fine-tuning with the LoRA and DoRA methods. Additionally, we have integrated depth information from open-source depth estimation models to enrich the training and inference processes. For inference, particularly with multiple-choice and yes/no questions, we adopted a Chain-of-Thought reasoning approach to improve the accuracy of the results. This comprehensive methodology enabled us to achieve a top score of 0.7799 on the validation set leaderboard, ranking 1st on the leaderboard.

中文标题/摘要

标题：CVPR 2024自主挑战赛Driving with Language赛道的CPS团队系统描述

本报告概述了我们使用视觉语言模型系统参加CVPR 2024自主挑战赛Driving with Language赛道的方法。我们仅使用DriveLM-nuScenes数据集进行模型训练。我们的系统基于LLaVA模型，并通过LoRA和DoRA方法进行了微调。此外，我们还整合了开源深度估计模型的深度信息，以丰富训练和推理过程。在推理过程中，特别是对于多项选择和是/否问题，我们采用了链式思考推理方法以提高结果的准确性。这种全面的方法使我们在验证集排行榜上获得了0.7799的最高分，排名第一。

Summary / 总结

The research aimed to develop a system for the Driving with Language track of the CVPR 2024 Autonomous Grand Challenge using vision-language models. They utilized the DriveLM-nuScenes dataset for training and enhanced LLaVA models with LoRA and DoRA fine-tuning methods. Additionally, they integrated depth information from open-source models. For inference, they used Chain-of-Thought reasoning, achieving a top score of 0.7799 on the validation set leaderboard, ranking first overall.

研究旨在为CVPR 2024自主挑战赛中的Driving with Language赛道开发一个系统，使用了vision-language模型。他们使用DriveLM-nuScenes数据集进行训练，并通过LoRA和DoRA微调方法增强了LLaVA模型。此外，他们还从开源模型中整合了深度信息。在推理时，他们采用了Chain-of-Thought推理方法，最终在验证集排行榜上获得了0.7799的最高分，排名第一。