arXiv 论文速递

Cache-of-Thought: Master-Apprentice Framework for Cost-Effective Vision Language Model Reasoning

Authors: Mingyuan Wu, Jize Jiang, Haozhen Zheng, Meitang Li, Zhaoheng Li, Beitong Tian, Bo Chen, Yongjoo Park, Minjia Zhang, Chengxiang Zhai, Klara Nahrstedt

Venue: EMNLP 2025

First: 2025-02-27T23:09:20+00:00 · Latest: 2025-09-19T17:18:26+00:00

Comments: EMNLP 2025 Main Conference. Mingyuan, Jize, and Haozhen contributed equally, while Minjia, Chengxiang, and Klara advised equally

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision Language Models (VLMs) have achieved remarkable success in a wide range of vision applications of increasing complexity and scales, yet choosing the right VLM model size involves a trade-off between response quality and cost. While smaller VLMs are cheaper to run, they typically produce responses only marginally better than random guessing on benchmarks such as MMMU. In this paper, we propose Cache of Thought (CoT), a master apprentice framework for collaborative inference between large and small VLMs. CoT manages high quality query results from large VLMs (master) in a cache, which are then selected via a novel multi modal retrieval and in-context learning to aid the performance of small VLMs (apprentice). We extensively evaluate CoT on various widely recognized and challenging general reasoning benchmarks, and show that CoT increases overall reasoning performance by up to 7.7% under the same budget, and specifically boosts the performance of apprentice VLMs by up to 36.6%. Our code is available at https://github.com/UIUC-MONET/Cache-of-Thoughts

中文标题/摘要

标题：Cache-of-Thought: 大小模型协作框架以实现成本效益的视觉语言模型推理

视觉语言模型（VLMs）在复杂性和规模不断增加的视觉应用中取得了显著的成功，但在选择合适的VLM模型大小时，响应质量和成本之间存在权衡。虽然较小的VLMs运行成本较低，但在MMMU等基准测试中，它们的响应质量通常仅略优于随机猜测。在本文中，我们提出了一种名为Cache of Thought (CoT)的框架，该框架通过大型和小型VLMs之间的协作推理来管理高质量查询结果。CoT通过多模态检索和上下文学习选择这些缓存中的高质量查询结果，以辅助小型VLMs（学徒）的性能。我们在各种广泛认可和具有挑战性的通用推理基准上对CoT进行了广泛评估，并展示了在相同预算下，CoT的整体推理性能提高了最多7.7%，并且特别提高了学徒VLMs的性能，最多提高了36.6%。我们的代码可在https://github.com/UIUC-MONET/Cache-of-Thoughts获取

Summary / 总结

论文提出了Cache-of-Thought (CoT)框架，通过利用大型视觉语言模型（VLMs）的结果来增强较小的VLMs的性能。CoT将大型VLMs（导师）的高质量查询结果缓存起来，并通过多模态检索和上下文学习来帮助较小的VLMs（学徒）的表现。该框架在各种基准测试上显著提高了学徒VLMs的性能，最多可达36.6%，以及整体推理性能，最多可达7.7%，同时保持相同的预算。

Robust Vision-Language Models via Tensor Decomposition: A Defense Against Adversarial Attacks

Authors: Het Patel, Muzammil Allie, Qian Zhang, Jia Chen, Evangelos E. Papalexakis

First: 2025-09-19T17:16:32+00:00 · Latest: 2025-09-19T17:16:32+00:00

Comments: To be presented as a poster at the Workshop on Safe and Trustworthy Multimodal AI Systems (SafeMM-AI), 2025

Abs · PDF · Code1 · Code2

Abstract

Vision language models (VLMs) excel in multimodal understanding but are prone to adversarial attacks. Existing defenses often demand costly retraining or significant architecture changes. We introduce a lightweight defense using tensor decomposition suitable for any pre-trained VLM, requiring no retraining. By decomposing and reconstructing vision encoder representations, it filters adversarial noise while preserving meaning. Experiments with CLIP on COCO and Flickr30K show improved robustness. On Flickr30K, it restores 12.3\% performance lost to attacks, raising Recall@1 accuracy from 7.5\% to 19.8\%. On COCO, it recovers 8.1\% performance, improving accuracy from 3.8\% to 11.9\%. Analysis shows Tensor Train decomposition with low rank (8-32) and low residual strength ($\alpha=0.1-0.2$) is optimal. This method is a practical, plug-and-play solution with minimal overhead for existing VLMs.

中文标题/摘要

标题：通过张量分解实现稳健的跨模态模型：对抗攻击的防御

视觉语言模型（VLMs）在多模态理解方面表现出色，但容易受到对抗攻击的影响。现有防御措施往往需要昂贵的重新训练或显著的架构更改。我们提出了一种轻量级的防御方法，使用张量分解适用于任何预训练的VLM，无需重新训练。通过分解和重构视觉编码器表示，它能够过滤掉对抗噪声，同时保留语义。在CLIP上对COCO和Flickr30K进行的实验显示了增强的鲁棒性。在Flickr30K上，它恢复了12.3%因攻击而损失的性能，将Recall@1准确率从7.5%提高到19.8%。在COCO上，它恢复了8.1%的性能，将准确率从3.8%提高到11.9%。分析表明，最优的张量火车分解具有低秩（8-32）和低残差强度（α=0.1-0.2）。该方法是一种实用的即插即用解决方案，对现有VLMs的开销最小。

Summary / 总结

Vision language models (VLMs) excel in multimodal understanding but are prone to adversarial attacks.

本文通过引入基于张量分解的轻量级防御方法，解决了视觉-语言模型（VLMs）对对抗攻击的脆弱性问题。该方法通过分解和重构视觉编码器表示来过滤掉对抗噪声，同时保持原始意义。实验结果表明，该方法在Flickr30K和COCO数据集上显著提高了鲁棒性，分别恢复了12.3%和8.1%的性能损失，最优的张量分解参数为张量火车（Tensor Train），具有较低的秩和残差强度。

AcT2I: Evaluating and Improving Action Depiction in Text-to-Image Models

Authors: Vatsal Malaviya, Agneet Chatterjee, Maitreya Patel, Yezhou Yang, Chitta Baral

First: 2025-09-19T16:41:39+00:00 · Latest: 2025-09-19T16:41:39+00:00

Comments: Project Page : https://vatsal-malaviya.github.io/AcT2I/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Text-to-Image (T2I) models have recently achieved remarkable success in generating images from textual descriptions. However, challenges still persist in accurately rendering complex scenes where actions and interactions form the primary semantic focus. Our key observation in this work is that T2I models frequently struggle to capture nuanced and often implicit attributes inherent in action depiction, leading to generating images that lack key contextual details. To enable systematic evaluation, we introduce AcT2I, a benchmark designed to evaluate the performance of T2I models in generating images from action-centric prompts. We experimentally validate that leading T2I models do not fare well on AcT2I. We further hypothesize that this shortcoming arises from the incomplete representation of the inherent attributes and contextual dependencies in the training corpora of existing T2I models. We build upon this by developing a training-free, knowledge distillation technique utilizing Large Language Models to address this limitation. Specifically, we enhance prompts by incorporating dense information across three dimensions, observing that injecting prompts with temporal details significantly improves image generation accuracy, with our best model achieving an increase of 72%. Our findings highlight the limitations of current T2I methods in generating images that require complex reasoning and demonstrate that integrating linguistic knowledge in a systematic way can notably advance the generation of nuanced and contextually accurate images.

中文标题/摘要

标题：AcT2I：评估和改进文本到图像模型中的动作描绘

文本到图像（T2I）模型在从文本描述生成图像方面取得了显著成功。然而，在准确渲染以动作和互动为主要语义焦点的复杂场景方面仍然存在挑战。我们在这项工作中观察到的关键问题是，T2I模型经常难以捕捉动作描绘中固有的细微和常常是隐含的属性，导致生成的图像缺乏关键的上下文细节。为了进行系统性评估，我们引入了AcT2I基准，旨在评估T2I模型在从以动作为中心的提示生成图像方面的性能。实验证明，领先的T2I模型在AcT2I上表现不佳。我们进一步假设这种不足源于现有T2I模型训练语料库中固有属性和上下文依赖性的不完整表示。我们在此基础上开发了一种无需训练的知识蒸馏技术，利用大型语言模型来解决这一局限。具体来说，我们通过在三个维度上引入密集信息来增强提示，观察到注入时间细节显著提高了图像生成的准确性，我们的最佳模型实现了72%的提升。我们的研究结果突显了当前T2I方法在生成需要复杂推理的图像方面的局限性，并表明系统地整合语言知识可以显著提高生成细腻且上下文准确的图像的能力。

Summary / 总结

The research aims to address the challenge of accurately rendering complex scenes with actions and interactions in Text-to-Image models. The study introduces AcT2I, a benchmark for evaluating T2I models on action-centric prompts, and finds that leading models perform poorly. The authors propose a knowledge distillation technique using Large Language Models to enhance prompts, leading to a 72% improvement in image generation accuracy for complex actions.

研究旨在解决Text-to-Image (T2I)模型在渲染以动作和互动为重点的复杂场景时的局限性。研究引入了AcT2I基准，用于评估T2I模型在动作导向提示上的表现，并发现领先T2I模型表现不佳。作者提出了一种无需训练的方法，利用大型语言模型增强提示，通过注入时间细节显著提高了图像生成准确性，其最佳模型提高了72%。这突显了当前T2I方法在生成需要复杂推理的图像方面的局限性，表明系统地整合语言知识可以显著提高生成细腻且上下文准确的图像的能力。

LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation

Authors: Yang Zhou, Shiyu Zhao, Yuxiao Chen, Zhenting Wang, Can Jin, Dimitris N. Metaxas

First: 2025-03-18T00:50:40+00:00 · Latest: 2025-09-19T15:35:46+00:00

Abs · PDF · Code1 · Code2

Abstract

Large foundation models trained on large-scale vision-language data can boost Open-Vocabulary Object Detection (OVD) via synthetic training data, yet the hand-crafted pipelines often introduce bias and overfit to specific prompts. We sidestep this issue by directly fusing hidden states from Large Language Models (LLMs) into detectors-an avenue surprisingly under-explored. This paper presents a systematic method to enhance visual grounding by utilizing decoder layers of the LLM of an MLLM. We introduce a zero-initialized cross-attention adapter to enable efficient knowledge fusion from LLMs to object detectors, a new approach called LED (LLM Enhanced Open-Vocabulary Object Detection). We find that intermediate LLM layers already encode rich spatial semantics; adapting only the early layers yields most of the gain. With Swin-T as the vision encoder, Qwen2-0.5B + LED lifts GroundingDINO by 3.82 % on OmniLabel at just 8.7 % extra GFLOPs, and a larger vision backbone pushes the improvement to 6.22 %. Extensive ablations on adapter variants, LLM scales and fusion depths further corroborate our design.

中文标题/摘要

标题：LED：大语言模型增强的开放词汇对象检测

大规模视觉-语言数据训练的基础模型可以通过合成训练数据提升开放词汇对象检测（OVD），但手工设计的流水线往往会引入偏差并过度拟合特定提示。我们通过直接将大语言模型（LLM）的隐藏状态融合到检测器中绕过了这个问题——这是一个令人惊讶地未被充分探索的途径。本文提出了一种系统方法，通过利用多LLM的LLM解码器层来增强视觉定位。我们引入了一个零初始化的交叉注意力适配器，以实现从LLM到对象检测器的有效知识融合，这是一种新的方法，称为LED（大语言模型增强的开放词汇对象检测）。我们发现中间的LLM层已经编码了丰富的空间语义；仅调整早期层就能获得大部分收益。使用Swin-T作为视觉编码器，Qwen2-0.5B + LED在OmniLabel上将GroundingDINO的性能提升了3.82%，仅增加了8.7%的额外GFLOPs，而更大的视觉骨干则将改进提升至6.22%。广泛的适配器变体、LLM规模和融合深度的消融实验进一步证实了我们的设计。

Summary / 总结

This paper addresses the challenge of Open-Vocabulary Object Detection (OVD) by leveraging Large Language Models (LLMs) to enhance visual grounding without the need for human-curated synthetic data. The method, called LED, directly fuses hidden states from LLMs into detectors using a zero-initialized cross-attention adapter. The study finds that adapting only early layers of the LLM yields significant performance gains. Using Swin-T as the vision encoder, Qwen2-0.5B + LED improves GroundingDINO by 3.82% on OmniLabel with minimal computational overhead, and further improvements are observed with larger vision backbones.

该论文通过利用大型语言模型（LLM）来增强视觉定位，解决了开放词汇对象检测（OVD）的挑战，而无需人工标注合成数据。方法称为LED，通过零初始化的交叉注意力适配器直接将LLM的隐藏状态融合到检测器中。研究发现，仅调整LLM的早期层即可获得显著的性能提升。使用Swin-T作为视觉编码器，Qwen2-0.5B + LED在OmniLabel上将GroundingDINO的性能提高了3.82%，且计算开销较小，进一步使用更大的视觉骨干网络可获得更大的改进。

LLMs Can Compensate for Deficiencies in Visual Representations

Authors: Sho Takishita, Jay Gala, Abdelrahman Mohamed, Kentaro Inui, Yova Kementchedjhieva

Venue: EMNLP 2025

First: 2025-06-05T12:04:59+00:00 · Latest: 2025-09-19T15:33:50+00:00

Comments: EMNLP 2025 Findings

Abs · PDF · Code1 · Code2

Abstract

Many vision-language models (VLMs) that prove very effective at a range of multimodal task, build on CLIP-based vision encoders, which are known to have various limitations. We investigate the hypothesis that the strong language backbone in VLMs compensates for possibly weak visual features by contextualizing or enriching them. Using three CLIP-based VLMs, we perform controlled self-attention ablations on a carefully designed probing task. Our findings show that despite known limitations, CLIP visual representations offer ready-to-read semantic information to the language decoder. However, in scenarios of reduced contextualization in the visual representations, the language decoder can largely compensate for the deficiency and recover performance. This suggests a dynamic division of labor in VLMs and motivates future architectures that offload more visual processing to the language decoder.

中文标题/摘要

标题：LLMs可以弥补视觉表示的不足

许多在多种多模态任务中表现出色的视觉-语言模型（VLMs）基于CLIP的视觉编码器，而这些编码器已知存在各种局限性。我们研究了VLMs中的强大语言骨干是否通过上下文化或丰富视觉特征来补偿可能较弱的视觉特征。使用三个基于CLIP的VLMs，我们在精心设计的探针任务上进行了控制的自注意力消融实验。我们的发现表明，尽管存在已知局限性，但CLIP的视觉表示仍能为语言解码器提供易于阅读的语义信息。然而，在视觉表示减少上下文化的情况下，语言解码器可以大量补偿这一缺陷并恢复性能。这表明VLMs中存在动态分工，并激励未来将更多视觉处理卸载到语言解码器的架构。

Summary / 总结

The study investigates how language models in vision-language models (VLMs) compensate for weak visual features in CLIP-based encoders. By performing controlled self-attention ablations on three CLIP-based VLMs, the researchers found that the language backbone can contextualize or enrich visual representations, allowing the models to maintain performance despite known limitations in visual features. This suggests a dynamic division of labor between visual and language processing in VLMs, which could inform future model designs that rely more on language for visual tasks.

研究探讨了语言模型在视觉语言模型（VLMs）中如何补偿CLIP基视觉编码器中的弱视觉特征。研究人员使用了三个CLIP基VLMs进行了控制的自注意力消融实验，并发现语言骨干可以对视觉特征进行上下文化或丰富，即使这些特征本身较弱。研究结果表明，语言解码器可以很大程度上补偿视觉表示的缺陷，这表明VLMs中存在动态分工，并建议未来架构将更多的视觉处理任务卸载到语言解码器上。

Randomized Smoothing Meets Vision-Language Models

Authors: Emmanouil Seferis, Changshun Wu, Stefanos Kollias, Saddek Bensalem, Chih-Hong Cheng

Venue: EMNLP

First: 2025-09-19T15:33:22+00:00 · Latest: 2025-09-19T15:33:22+00:00

Comments: EMNLP'25 full version, including appendix (proofs, additional experiments)

Abs · PDF · Code1 · Code2

Abstract

Randomized smoothing (RS) is one of the prominent techniques to ensure the correctness of machine learning models, where point-wise robustness certificates can be derived analytically. While RS is well understood for classification, its application to generative models is unclear, since their outputs are sequences rather than labels. We resolve this by connecting generative outputs to an oracle classification task and showing that RS can still be enabled: the final response can be classified as a discrete action (e.g., service-robot commands in VLAs), as harmful vs. harmless (content moderation or toxicity detection in VLMs), or even applying oracles to cluster answers into semantically equivalent ones. Provided that the error rate for the oracle classifier comparison is bounded, we develop the theory that associates the number of samples with the corresponding robustness radius. We further derive improved scaling laws analytically relating the certified radius and accuracy to the number of samples, showing that the earlier result of 2 to 3 orders of magnitude fewer samples sufficing with minimal loss remains valid even under weaker assumptions. Together, these advances make robustness certification both well-defined and computationally feasible for state-of-the-art VLMs, as validated against recent jailbreak-style adversarial attacks.

中文标题/摘要

标题：随机平滑技术与视觉语言模型

随机平滑（RS）是确保机器学习模型正确性的主要技术之一，其中可以分析地推导出点稳健性证书。虽然RS在分类任务中已经得到了很好的理解，但将其应用于生成模型却不清楚，因为生成模型的输出是序列而不是标签。我们通过将生成输出连接到一个或有分类任务来解决这一问题，并展示了RS仍然可以启用：最终响应可以被分类为离散动作（例如，VLAs中的服务机器人命令）、有害 vs. 无害（VLM中的内容审核或毒性检测），甚至可以应用或有分类器将答案聚类为语义等价的类别。只要或有分类器比较的错误率是可控制的，我们发展了理论，将样本数量与相应的稳健性半径关联起来。进一步地，我们通过分析推导出改进的缩放定律，将认证半径和准确性与样本数量的关系联系起来，表明即使在较弱的假设下，先前结果所需的样本数量减少2到3个数量级且损失最小仍然有效。这些进展使得对于最先进的VLMs来说，稳健性认证既明确又具有计算可行性，并通过最近的监牢风格的对抗攻击得到了验证。

Summary / 总结

The paper explores the application of randomized smoothing (RS) to generative models in vision-language models (VLMs), addressing the challenge of outputs being sequences rather than labels. By connecting generative outputs to an oracle classification task, RS is enabled, allowing for robustness certification. The authors develop theoretical foundations linking sample size to robustness radius and derive improved scaling laws, showing that RS can achieve robustness with minimal loss even under weaker assumptions. This makes robustness certification computationally feasible for VLMs against adversarial attacks.

论文探讨了将随机化平滑（RS）应用于视觉语言模型（VLMs）中的生成模型。通过将生成输出连接到一个或acles分类任务，RS可以用于命令分类、有害内容检测或聚类语义等价答案等任务。作者发展了将样本数量与鲁棒性半径关联的理论，并推导出改进的缩放定律，表明即使在较弱的假设下，RS也能在最小损失的情况下实现鲁棒性认证。这使得最先进的VLMs的鲁棒性认证在计算上是可行的，对抗攻击有效。

See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model

Authors: Pengteng Li, Pinhao Song, Wuyang Li, Weiyu Guo, Huizai Yao, Yijie Xu, Dugang Liu, Hui Xiong

Venue: NeurIPS 2025

First: 2025-09-19T15:30:26+00:00 · Latest: 2025-09-19T15:30:26+00:00

Comments: Accepted by NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

We introduce SEE&TREK, the first training-free prompting framework tailored to enhance the spatial understanding of Multimodal Large Language Models (MLLMS) under vision-only constraints. While prior efforts have incorporated modalities like depth or point clouds to improve spatial reasoning, purely visualspatial understanding remains underexplored. SEE&TREK addresses this gap by focusing on two core principles: increasing visual diversity and motion reconstruction. For visual diversity, we conduct Maximum Semantic Richness Sampling, which employs an off-the-shell perception model to extract semantically rich keyframes that capture scene structure. For motion reconstruction, we simulate visual trajectories and encode relative spatial positions into keyframes to preserve both spatial relations and temporal coherence. Our method is training&GPU-free, requiring only a single forward pass, and can be seamlessly integrated into existing MLLM'S. Extensive experiments on the VSI-B ENCH and STI-B ENCH show that S EE &T REK consistently boosts various MLLM S performance across diverse spatial reasoning tasks with the most +3.5% improvement, offering a promising path toward stronger spatial intelligence.

中文标题/摘要

标题：See&Trek: 无需训练的空间提示框架以增强多模态大型语言模型的空间理解

我们介绍了SEE&TREK，这是首个无需训练的空间提示框架，旨在在仅视图约束下增强多模态大型语言模型（MLLMS）的空间理解能力。尽管先前的努力已经将深度或点云等模态纳入以提高空间推理能力，但纯粹的视觉空间理解仍然未被充分探索。SEE&TREK通过关注两个核心原则来填补这一空白：增加视觉多样性以及运动重建。对于视觉多样性，我们进行了最大语义丰富度采样，利用现成的感知模型提取语义丰富的关键帧以捕捉场景结构。对于运动重建，我们模拟了视觉轨迹并将相对空间位置编码到关键帧中，以保持空间关系和时间连贯性。我们的方法无需训练和GPU，只需一次前向传播，即可无缝集成到现有的MLLM中。在VSI-B ENCH和STI-B ENCH上的广泛实验表明，SEE&TREK在各种空间推理任务中持续提升了MLLMs的表现，最高提升达+3.5%，为更强的空间智能提供了有希望的途径。

Summary / 总结

SEE&TREK is a training-free prompting framework designed to enhance the spatial understanding of Multimodal Large Language Models (MLLMS) using only visual data. It increases visual diversity through Maximum Semantic Richness Sampling and reconstructs motion to preserve spatial and temporal coherence. Experiments show a consistent improvement in MLLM performance across various spatial reasoning tasks, with the highest gain of +3.5%.

SEE&TREK 是一个无需训练的提示框架，旨在通过纯视觉输入增强多模态大型语言模型（MLLMS）的空间理解能力。它通过最大语义丰富性采样增加视觉多样性，并通过模拟视觉轨迹进行运动重建。实验表明，SEE&TREK 可以提高 MLLM 在各种空间推理任务中的表现，最多提升 3.5%，为增强 MLLM 的空间智能提供了有希望的方法。

Compose by Focus: Scene Graph-based Atomic Skills

Authors: Han Qi, Changhe Chen, Heng Yang

First: 2025-09-19T15:03:18+00:00 · Latest: 2025-09-19T15:03:18+00:00

Abs · PDF · Code1 · Code2

Abstract

A key requirement for generalist robots is compositional generalization - the ability to combine atomic skills to solve complex, long-horizon tasks. While prior work has primarily focused on synthesizing a planner that sequences pre-learned skills, robust execution of the individual skills themselves remains challenging, as visuomotor policies often fail under distribution shifts induced by scene composition. To address this, we introduce a scene graph-based representation that focuses on task-relevant objects and relations, thereby mitigating sensitivity to irrelevant variation. Building on this idea, we develop a scene-graph skill learning framework that integrates graph neural networks with diffusion-based imitation learning, and further combine "focused" scene-graph skills with a vision-language model (VLM) based task planner. Experiments in both simulation and real-world manipulation tasks demonstrate substantially higher success rates than state-of-the-art baselines, highlighting improved robustness and compositional generalization in long-horizon tasks.

中文标题/摘要

标题：聚焦创作：基于场景图的原子技能

通用机器人的一项关键要求是组合通用化能力——将原子技能组合起来解决复杂的、长期的任务。尽管先前的工作主要集中在合成一个规划器来顺序使用预学习的技能，但个体技能的稳健执行仍然具有挑战性，因为场景组合引起的分布变化往往会导致视觉运动策略失效。为了解决这个问题，我们引入了一种基于场景图的表示方法，该方法专注于与任务相关的对象和关系，从而减轻了对无关变化的敏感性。在此基础上，我们开发了一种场景图技能学习框架，该框架结合了图神经网络和基于扩散的模仿学习，并进一步将“聚焦”的场景图技能与基于视觉语言模型的任务规划器结合在一起。在模拟和真实世界操作任务中的实验表明，与最先进的基线相比，成功率显著提高，突显了在长期任务中改进的鲁棒性和组合通用化能力。

Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study

Authors: DongGeon Lee, Joonwon Jang, Jihae Jeong, Hwanjo Yu

Venue: EMNLP 2025

First: 2025-05-21T11:26:40+00:00 · Latest: 2025-09-19T13:54:09+00:00

Comments: Accepted to EMNLP 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Rapid deployment of vision-language models (VLMs) magnifies safety risks, yet most evaluations rely on artificial images. This study asks: How safe are current VLMs when confronted with meme images that ordinary users share? To investigate this question, we introduce MemeSafetyBench, a 50,430-instance benchmark pairing real meme images with both harmful and benign instructions. Using a comprehensive safety taxonomy and LLM-based instruction generation, we assess multiple VLMs across single and multi-turn interactions. We investigate how real-world memes influence harmful outputs, the mitigating effects of conversational context, and the relationship between model scale and safety metrics. Our findings demonstrate that VLMs are more vulnerable to meme-based harmful prompts than to synthetic or typographic images. Memes significantly increase harmful responses and decrease refusals compared to text-only inputs. Though multi-turn interactions provide partial mitigation, elevated vulnerability persists. These results highlight the need for ecologically valid evaluations and stronger safety mechanisms. MemeSafetyBench is publicly available at https://github.com/oneonlee/Meme-Safety-Bench.

中文标题/摘要

标题：视觉-语言模型在野外安全吗？基于梗的基准研究

视觉-语言模型（VLMs）的快速部署放大了安全风险，但大多数评估依赖于人工图像。本研究提出的问题是：当面对普通用户分享的梗图像时，当前的VLMs有多安全？为了探讨这一问题，我们引入了MemeSafetyBench基准，包含50,430个实例，将真实的梗图像与有害和无害的指令配对。利用全面的安全分类和基于LLM的指令生成，我们评估了多个VLMs在单轮和多轮交互中的表现。我们研究了真实世界的梗如何影响有害输出，对话背景的缓解效果，以及模型规模与安全指标之间的关系。研究结果表明，VLMs对基于梗的有害提示比对合成或文本图像更脆弱。梗显著增加了有害响应并减少了拒绝率。尽管多轮交互提供了一定的缓解，但脆弱性仍然存在。这些结果强调了生态有效评估和更强的安全机制的必要性。MemeSafetyBench可在https://github.com/oneonlee/Meme-Safety-Bench获取。

Summary / 总结

This study evaluates the safety of vision-language models (VLMs) using a new benchmark called MemeSafetyBench, which pairs real meme images with both harmful and benign instructions. The research finds that VLMs are more vulnerable to meme-based harmful prompts than to synthetic or typographic images, with memes significantly increasing harmful responses and decreasing refusals. Multi-turn interactions provide some mitigation but do not fully resolve the elevated vulnerability. The study emphasizes the need for ecologically valid evaluations and stronger safety mechanisms.

该研究使用包含50,430个真实表情包图像的新基准MemeSafetyBench，这些图像与有害和良性指令配对，评估视觉-语言模型（VLMs）的安全性。研究发现，VLMs对基于表情包的有害提示比对合成或文本图像更脆弱，表情包会增加有害响应并减少拒绝。虽然多轮交互提供了一定的缓解，但VLMs对基于表情包的输入的高脆弱性并未完全解决。

CLIPTTA: Robust Contrastive Vision-Language Test-Time Adaptation

Authors: Marc Lafon, Gustavo Adolfo Vargas Hakim, Clément Rambour, Christian Desrosier, Nicolas Thome

Venue: NeurIPS 2025

First: 2025-07-18T18:32:17+00:00 · Latest: 2025-09-19T12:52:46+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) like CLIP exhibit strong zero-shot capabilities but often fail to generalize under distribution shifts. Test-time adaptation (TTA) allows models to update at inference time without labeled data, typically via entropy minimization. However, this objective is fundamentally misaligned with the contrastive image-text training of VLMs, limiting adaptation performance and introducing failure modes such as pseudo-label drift and class collapse. We propose CLIPTTA, a new gradient-based TTA method for vision-language models that leverages a soft contrastive loss aligned with CLIP's pre-training objective. We provide a theoretical analysis of CLIPTTA's gradients, showing how its batch-aware design mitigates the risk of collapse. We further extend CLIPTTA to the open-set setting, where both in-distribution (ID) and out-of-distribution (OOD) samples are encountered, using an Outlier Contrastive Exposure (OCE) loss to improve OOD detection. Evaluated on 75 datasets spanning diverse distribution shifts, CLIPTTA consistently outperforms entropy-based objectives and is highly competitive with state-of-the-art TTA methods, outperforming them on a large number of datasets and exhibiting more stable performance across diverse shifts.

中文标题/摘要

标题：CLIPTTA：稳健的对比视觉-语言测试时适应

视觉-语言模型（VLMs）如CLIP表现出强大的零样本能力，但在分布偏移下往往无法泛化。测试时适应（TTA）允许模型在推理时更新，通常通过熵最小化实现。然而，这一目标与VLMs的对比图像-文本预训练目标从根本上不一致，限制了适应性能并引入了伪标签漂移和类别坍缩等失败模式。我们提出CLIPTTA，这是一种新的基于梯度的TTA方法，利用与CLIP预训练目标对齐的软对比损失。我们对CLIPTTA的梯度进行了理论分析，展示了其批处理感知设计如何减轻坍缩的风险。我们进一步将CLIPTTA扩展到开放集设置，其中同时遇到分布内（ID）和分布外（OOD）样本，使用异常对比暴露（OCE）损失以提高OOD检测。在涵盖多种分布偏移的75个数据集上评估，CLIPTTA在所有数据集上均优于基于熵的目标，并且在多种偏移中表现出更稳定的性能，与最先进的TTA方法竞争激烈，优于它们的大量数据集。

Summary / 总结

The research aims to improve the generalization of vision-language models like CLIP under distribution shifts by proposing CLIPTTA, a new gradient-based test-time adaptation method. CLIPTTA uses a soft contrastive loss aligned with CLIP's pre-training objective, which mitigates the risk of collapse and pseudo-label drift. It also introduces an Outlier Contrastive Exposure (OCE) loss for open-set settings, enhancing OOD detection. Experiments on 75 diverse datasets show that CLIPTTA outperforms entropy-based methods and is highly competitive with state-of-the-art TTA methods, demonstrating more stable performance across various distribution shifts.

研究旨在提高如CLIP这类视觉-语言模型在分布变化下的泛化能力。提出了一种新的基于梯度的测试时自适应方法CLIPTTA，该方法使用与CLIP预训练目标对齐的软对比损失。CLIPTTA在各种分布变化下优于基于熵的目标，并与最先进的测试时自适应方法竞争，特别是在许多数据集上表现出色，并且在不同变化中表现出更稳定的性能。

Sparse Multiview Open-Vocabulary 3D Detection

Authors: Olivier Moliner, Viktor Larsson, Kalle Åström

Venue: ICCV 2025

First: 2025-09-19T12:22:24+00:00 · Latest: 2025-09-19T12:22:24+00:00

Comments: ICCV 2025; OpenSUN3D Workshop; Camera ready version

Abs · PDF · Code1 · Code2

Abstract

The ability to interpret and comprehend a 3D scene is essential for many vision and robotics systems. In numerous applications, this involves 3D object detection, i.e.~identifying the location and dimensions of objects belonging to a specific category, typically represented as bounding boxes. This has traditionally been solved by training to detect a fixed set of categories, which limits its use. In this work, we investigate open-vocabulary 3D object detection in the challenging yet practical sparse-view setting, where only a limited number of posed RGB images are available as input. Our approach is training-free, relying on pre-trained, off-the-shelf 2D foundation models instead of employing computationally expensive 3D feature fusion or requiring 3D-specific learning. By lifting 2D detections and directly optimizing 3D proposals for featuremetric consistency across views, we fully leverage the extensive training data available in 2D compared to 3D. Through standard benchmarks, we demonstrate that this simple pipeline establishes a powerful baseline, performing competitively with state-of-the-art techniques in densely sampled scenarios while significantly outperforming them in the sparse-view setting.

中文标题/摘要

标题：稀疏多视图开放词汇3D检测

理解和解释3D场景的能力对于许多视觉和机器人系统至关重要。在许多应用中，这涉及3D物体检测，即识别特定类别中物体的位置和尺寸，通常用边界框表示。这传统上通过训练来检测固定类别集来解决，这限制了其应用范围。在本文中，我们研究了在只有有限数量的摆姿RGB图像可用的挑战性但实用的稀疏视图设置下的开放词汇3D物体检测。我们的方法是无需训练的，依赖于预训练的现成2D基础模型，而不是使用计算昂贵的3D特征融合或需要3D特定学习。通过提升2D检测结果并直接优化视图间特征度量一致性来优化3D提案，我们充分利用了2D中可用的大量训练数据，而3D中则较少。通过标准基准测试，我们证明了这个简单的流水线建立了强大的基线，在密集采样场景中与最先进的技术竞争，而在稀疏视图设置中显著优于它们。

Summary / 总结

This work addresses the challenge of 3D object detection in a sparse-view setting with limited posed RGB images. The approach leverages pre-trained 2D models to detect objects and optimize 3D proposals for feature consistency across views, avoiding expensive 3D feature fusion. Experiments show that the method performs competitively in dense scenarios and outperforms existing techniques in sparse-view settings.

该研究针对稀疏视角下仅有限数量的摆姿RGB图像进行3D物体检测的挑战，关注开放词汇检测而不进行训练。方法利用预训练的2D模型提升2D检测结果并优化3D提案以实现特征度量一致性。实验结果显示，在密集场景中表现与最先进的技术相当，在稀疏视角设置中则显著优于它们。

ViLU: Learning Vision-Language Uncertainties for Failure Prediction

Authors: Marc Lafon, Yannis Karmim, Julio Silva-Rodríguez, Paul Couairon, Clément Rambour, Raphaël Fournier-Sniehotta, Ismail Ben Ayed, Jose Dolz, Nicolas Thome

Venue: ICCV 2025

First: 2025-07-10T10:41:13+00:00 · Latest: 2025-09-19T12:00:01+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Reliable Uncertainty Quantification (UQ) and failure prediction remain open challenges for Vision-Language Models (VLMs). We introduce ViLU, a new Vision-Language Uncertainty quantification framework that contextualizes uncertainty estimates by leveraging all task-relevant textual representations. ViLU constructs an uncertainty-aware multi-modal representation by integrating the visual embedding, the predicted textual embedding, and an image-conditioned textual representation via cross-attention. Unlike traditional UQ methods based on loss prediction, ViLU trains an uncertainty predictor as a binary classifier to distinguish correct from incorrect predictions using a weighted binary cross-entropy loss, making it loss-agnostic. In particular, our proposed approach is well-suited for post-hoc settings, where only vision and text embeddings are available without direct access to the model itself. Extensive experiments on diverse datasets show the significant gains of our method compared to state-of-the-art failure prediction methods. We apply our method to standard classification datasets, such as ImageNet-1k, as well as large-scale image-caption datasets like CC12M and LAION-400M. Ablation studies highlight the critical role of our architecture and training in achieving effective uncertainty quantification. Our code is publicly available and can be found here: https://github.com/ykrmm/ViLU.

中文标题/摘要

标题：ViLU：学习视觉-语言不确定性以进行故障预测

可靠的不确定性量化（UQ）和故障预测仍然是视觉-语言模型（VLMs）的开放挑战。我们引入了ViLU，这是一种新的视觉-语言不确定性量化框架，通过利用所有与任务相关的文本表示来上下文化不确定性估计。ViLU 通过交叉注意力将视觉嵌入、预测的文本嵌入和图像条件下的文本表示整合，构建一个具有不确定性的多模态表示。与基于损失预测的传统UQ方法不同，ViLU 通过加权二元交叉熵损失训练一个不确定性预测器，以二元分类器的形式区分正确和错误的预测，使其与损失无关。特别是，我们提出的方法适用于后验场景，在这种场景中，仅可用视觉和文本嵌入，而无法直接访问模型本身。在多种数据集上的广泛实验表明，与最先进的故障预测方法相比，我们的方法具有显著优势。我们将该方法应用于标准分类数据集，如ImageNet-1k，以及大规模图像-描述数据集，如CC12M和LAION-400M。消融研究强调了我们架构和训练在实现有效不确定性量化中的关键作用。我们的代码已公开，并可在此找到：https://github.com/ykrmm/ViLU。

Summary / 总结

The paper introduces ViLU, a new framework for quantifying uncertainties in Vision-Language Models (VLMs) by leveraging all task-relevant textual representations. ViLU constructs an uncertainty-aware multi-modal representation using cross-attention and trains an uncertainty predictor as a binary classifier. Experiments on various datasets demonstrate that ViLU outperforms existing methods in failure prediction, making it particularly useful in post-hoc settings where only vision and text embeddings are available. Ablation studies confirm the effectiveness of the proposed architecture and training method.

论文提出了ViLU框架，通过利用所有任务相关的文本表示来量化Vision-Language模型中的不确定性。ViLU使用交叉注意力构建了一个不确定性感知的多模态表示，并训练一个二元分类器来预测不确定性。在多种数据集上的实验表明，ViLU在失败预测方面优于现有方法，特别是在只有视觉和文本嵌入可用的后验场景中特别有用。消融研究证实了所提出架构和训练方法的有效性。

cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning

Authors: Maksim Kolodiazhnyi, Denis Tarasov, Dmitrii Zhemchuzhnikov, Alexander Nikulin, Ilya Zisman, Anna Vorontsova, Anton Konushin, Vladislav Kurenkov, Danila Rukhovich

First: 2025-05-28T22:32:31+00:00 · Latest: 2025-09-19T09:57:39+00:00

Abs · PDF · Code1 · Code2

Abstract

Computer-Aided Design (CAD) plays a central role in engineering and manufacturing, making it possible to create precise and editable 3D models. Using a variety of sensor or user-provided data as inputs for CAD reconstruction can democratize access to design applications. However, existing methods typically focus on a single input modality, such as point clouds, images, or text, which limits their generalizability and robustness. Leveraging recent advances in vision-language models (VLM), we propose a multi-modal CAD reconstruction model that simultaneously processes all three input modalities. Inspired by large language model (LLM) training paradigms, we adopt a two-stage pipeline: supervised fine-tuning (SFT) on large-scale procedurally generated data, followed by reinforcement learning (RL) fine-tuning using online feedback, obtained programatically. Furthermore, we are the first to explore RL fine-tuning of LLMs for CAD tasks demonstrating that online RL algorithms such as Group Relative Preference Optimization (GRPO) outperform offline alternatives. In the DeepCAD benchmark, our SFT model outperforms existing single-modal approaches in all three input modalities simultaneously. More importantly, after RL fine-tuning, cadrille sets new state-of-the-art on three challenging datasets, including a real-world one.

中文标题/摘要

标题：cadrille：多模态CAD重建与在线强化学习

计算机辅助设计（CAD）在工程和制造中发挥着核心作用，使其能够创建精确可编辑的3D模型。使用各种传感器或用户提供的数据作为CAD重建的输入可以促进设计应用程序的普及。然而，现有方法通常仅专注于单一输入模态，如点云、图像或文本，这限制了它们的通用性和鲁棒性。利用视觉语言模型（VLM）的最新进展，我们提出了一种多模态CAD重建模型，同时处理所有三种输入模态。受大型语言模型（LLM）训练范式的启发，我们采用两阶段管道：在大规模程序生成数据上进行监督微调（SFT），然后使用在线反馈进行强化学习（RL）微调，该反馈是程序获取的。此外，我们首次探索了使用在线RL算法（如组相对偏好优化GRPO）对LLMs进行CAD任务的RL微调，表明在线RL算法优于离线替代方案。在DeepCAD基准测试中，我们的SFT模型在所有三种输入模态上均优于现有单模态方法。更重要的是，经过RL微调后，cadrille在三个具有挑战性的数据集中均达到了新的最佳性能，包括一个真实世界的数据集。

Summary / 总结

The research aims to improve CAD reconstruction by utilizing multi-modal inputs and online reinforcement learning. The method involves a two-stage pipeline: supervised fine-tuning on large-scale procedurally generated data, followed by reinforcement learning fine-tuning using online feedback. Key experimental findings show that the proposed model outperforms existing single-modal approaches in the DeepCAD benchmark and sets new state-of-the-art on three challenging datasets after RL fine-tuning, including a real-world dataset.

研究旨在通过利用多模态输入和在线强化学习来改进CAD重建。方法包括两阶段管道：在大规模程序生成数据上进行监督微调，然后使用在线反馈进行强化学习微调。实验结果表明，提出的模型在DeepCAD基准测试中优于现有单模态方法，并在三个具有挑战性的数据集中（包括一个真实世界的数据集）达到了新的最佳性能。

RegionMed-CLIP: A Region-Aware Multimodal Contrastive Learning Pre-trained Model for Medical Image Understanding

Authors: Tianchen Fang, Guiru Liu

First: 2025-08-07T10:32:03+00:00 · Latest: 2025-09-19T09:46:41+00:00

Comments: Upon further review, we identified that our dataset requires optimization to ensure research reliability and accuracy. Additionally, considering the target journal's latest submission policies, we believe comprehensive manuscript revisions are necessary

Abs · PDF · Code1 · Code2

Abstract

Medical image understanding plays a crucial role in enabling automated diagnosis and data-driven clinical decision support. However, its progress is impeded by two primary challenges: the limited availability of high-quality annotated medical data and an overreliance on global image features, which often miss subtle but clinically significant pathological regions. To address these issues, we introduce RegionMed-CLIP, a region-aware multimodal contrastive learning framework that explicitly incorporates localized pathological signals along with holistic semantic representations. The core of our method is an innovative region-of-interest (ROI) processor that adaptively integrates fine-grained regional features with the global context, supported by a progressive training strategy that enhances hierarchical multimodal alignment. To enable large-scale region-level representation learning, we construct MedRegion-500k, a comprehensive medical image-text corpus that features extensive regional annotations and multilevel clinical descriptions. Extensive experiments on image-text retrieval, zero-shot classification, and visual question answering tasks demonstrate that RegionMed-CLIP consistently exceeds state-of-the-art vision language models by a wide margin. Our results highlight the critical importance of region-aware contrastive pre-training and position RegionMed-CLIP as a robust foundation for advancing multimodal medical image understanding.

中文标题/摘要

标题：RegionMed-CLIP：一种区域感知多模态对比学习预训练模型，用于医学图像理解

医学图像理解在实现自动化诊断和数据驱动的临床决策支持中起着关键作用。然而，其进展受到两个主要挑战的阻碍：高质量标注医学数据的有限可用性和对全局图像特征的过度依赖，这往往忽略了细微但具有临床意义的病理区域。为了解决这些问题，我们引入了RegionMed-CLIP，这是一种区域感知的多模态对比学习框架，明确地结合了局部病理信号与整体语义表示。我们方法的核心是一种创新的感兴趣区域（ROI）处理器，它可以自适应地将细粒度的区域特征与全局上下文结合起来，并通过逐步训练策略增强层次多模态对齐。为了实现大规模区域级表示学习，我们构建了MedRegion-500k，这是一个包含广泛区域注释和多层次临床描述的医学图像-文本语料库。在图像-文本检索、零样本分类和视觉问答任务上的广泛实验表明，RegionMed-CLIP在所有方面都远远超过了最先进的视觉语言模型。我们的结果强调了区域感知对比预训练的重要性，并将RegionMed-CLIP定位为多模态医学图像理解进步的坚实基础。

CIDER: A Causal Cure for Brand-Obsessed Text-to-Image Models

Authors: Fangjian Shen, Zifeng Liang, Chao Wang, Wushao Wen

First: 2025-09-19T09:30:37+00:00 · Latest: 2025-09-19T09:30:37+00:00

Comments: 5 pages, 7 figures, submitted to ICASSP2026

Abs · PDF · Code1 · Code2

Abstract

Text-to-image (T2I) models exhibit a significant yet under-explored "brand bias", a tendency to generate contents featuring dominant commercial brands from generic prompts, posing ethical and legal risks. We propose CIDER, a novel, model-agnostic framework to mitigate bias at inference-time through prompt refinement to avoid costly retraining. CIDER uses a lightweight detector to identify branded content and a Vision-Language Model (VLM) to generate stylistically divergent alternatives. We introduce the Brand Neutrality Score (BNS) to quantify this issue and perform extensive experiments on leading T2I models. Results show CIDER significantly reduces both explicit and implicit biases while maintaining image quality and aesthetic appeal. Our work offers a practical solution for more original and equitable content, contributing to the development of trustworthy generative AI.

中文标题/摘要

标题：CIDER：因果纠偏，应对品牌痴迷的文本到图像模型

文本到图像（T2I）模型表现出一种显著但尚未充分探索的“品牌偏见”，即从通用提示生成内容时倾向于包含主导的商业品牌，这带来了伦理和法律风险。我们提出了一种名为CIDER的新型、模型无关框架，在推理时通过提示优化来减轻偏见，避免重新训练的高昂成本。CIDER使用一个轻量级检测器识别品牌内容，并使用视觉-语言模型（VLM）生成风格迥异的替代方案。我们引入了品牌中立度评分（BNS）来量化这一问题，并在领先T2I模型上进行了大量实验。结果表明，CIDER显著减少了显性和隐性的偏见，同时保持了图像质量和审美吸引力。我们的工作提供了一种实用的解决方案，以生成更原创和公平的内容，促进可信生成AI的发展。

Summary / 总结

The paper addresses the issue of brand bias in text-to-image models, which tend to generate images featuring dominant commercial brands from generic prompts. To mitigate this, the authors propose CIDER, a model-agnostic framework that uses a lightweight detector to identify branded content and a Vision-Language Model to generate alternative images. Experiments show that CIDER effectively reduces both explicit and implicit biases while maintaining image quality and aesthetic appeal. This work provides a practical solution for generating more original and equitable content, contributing to the development of trustworthy generative AI.

论文针对文本到图像模型中存在的品牌偏见问题，这类模型倾向于从通用提示中生成包含主导商业品牌的图像。为解决这一问题，作者提出了一种名为CIDER的模型无关框架，该框架利用轻量级检测器识别品牌内容，并使用视觉语言模型生成风格不同的替代图像。实验结果显示，CIDER能够有效减少显性和隐性的偏见，同时保持图像质量和视觉吸引力。这项工作提供了一种生成更具原创性和公平性的内容的实用解决方案，有助于推动可信生成AI的发展。

Vision-Language Models as Differentiable Semantic and Spatial Rewards for Text-to-3D Generation

Authors: Weimin Bai, Yubo Li, Weijian Luo, Wenzheng Chen, He Sun

First: 2025-09-19T08:54:52+00:00 · Latest: 2025-09-19T08:54:52+00:00

Abs · PDF · Code1 · Code2

Abstract

Score Distillation Sampling (SDS) enables high-quality text-to-3D generation by supervising 3D models through the denoising of multi-view 2D renderings, using a pretrained text-to-image diffusion model to align with the input prompt and ensure 3D consistency. However, existing SDS-based methods face two fundamental limitations: (1) their reliance on CLIP-style text encoders leads to coarse semantic alignment and struggles with fine-grained prompts; and (2) 2D diffusion priors lack explicit 3D spatial constraints, resulting in geometric inconsistencies and inaccurate object relationships in multi-object scenes. To address these challenges, we propose VLM3D, a novel text-to-3D generation framework that integrates large vision-language models (VLMs) into the SDS pipeline as differentiable semantic and spatial priors. Unlike standard text-to-image diffusion priors, VLMs leverage rich language-grounded supervision that enables fine-grained prompt alignment. Moreover, their inherent vision language modeling provides strong spatial understanding, which significantly enhances 3D consistency for single-object generation and improves relational reasoning in multi-object scenes. We instantiate VLM3D based on the open-source Qwen2.5-VL model and evaluate it on the GPTeval3D benchmark. Experiments across diverse objects and complex scenes show that VLM3D significantly outperforms prior SDS-based methods in semantic fidelity, geometric coherence, and spatial correctness.

中文标题/摘要

标题：视觉-语言模型作为可微语义和空间奖励用于文本到3D生成

Score Distillation Sampling (SDS) 通过监督3D模型对多视角2D渲染的去噪，使文本到3D生成达到高质量。使用预训练的文本到图像扩散模型与输入提示对齐，确保3D一致性。然而，现有的SDS方法面临两个根本性限制：（1）依赖CLIP风格的文本编码器导致粗略的语义对齐，并且难以处理精细的提示；（2）2D扩散先验缺乏明确的空间约束，导致几何不一致和多对象场景中物体关系的不准确。为了解决这些挑战，我们提出VLM3D，这是一种新颖的文本到3D生成框架，将大型视觉-语言模型（VLMs）整合到SDS管道中，作为可微语义和空间先验。与标准的文本到图像扩散先验不同，VLMs利用丰富的语言导向监督，实现精细的提示对齐。此外，其固有的视觉语言建模提供了强大的空间理解，显著提高了单个对象生成的3D一致性，并改善了多对象场景中的关系推理。我们基于开源的Qwen2.5-VL模型实例化VLM3D，并在GPTeval3D基准上进行评估。实验表明，VLM3D在语义保真度、几何连贯性和空间正确性方面显著优于先前的SDS方法。

Summary / 总结

The paper introduces VLM3D, a text-to-3D generation framework that integrates large vision-language models (VLMs) into the Score Distillation Sampling (SDS) pipeline to enhance semantic and spatial alignment. VLM3D addresses the limitations of existing SDS methods by providing fine-grained prompt alignment and strong spatial understanding, leading to improved semantic fidelity, geometric coherence, and spatial correctness in 3D generation tasks.

论文针对现有Score Distillation Sampling (SDS)方法在文本到3D生成中的局限性，如粗略的语义对齐和几何不一致问题，提出了VLM3D框架，将大型视觉语言模型作为可微分的语义和空间先验，增强细粒度提示对齐和3D一致性。实验表明，VLM3D在语义保真度、几何连贯性和空间正确性方面优于先前的方法。

GUI-ReWalk: Massive Data Generation for GUI Agent via Stochastic Exploration and Intent-Aware Reasoning

Authors: Musen Lin, Minghao Liu, Taoran Lu, Lichen Yuan, Yiwei Liu, Haonan Xu, Yu Miao, Yuhao Chao, Zhaojian Li

First: 2025-09-19T08:09:18+00:00 · Latest: 2025-09-19T08:09:18+00:00

Abs · PDF · Code1 · Code2

Abstract

Graphical User Interface (GUI) Agents, powered by large language and vision-language models, hold promise for enabling end-to-end automation in digital environments. However, their progress is fundamentally constrained by the scarcity of scalable, high-quality trajectory data. Existing data collection strategies either rely on costly and inconsistent manual annotations or on synthetic generation methods that trade off between diversity and meaningful task coverage. To bridge this gap, we present GUI-ReWalk: a reasoning-enhanced, multi-stage framework for synthesizing realistic and diverse GUI trajectories. GUI-ReWalk begins with a stochastic exploration phase that emulates human trial-and-error behaviors, and progressively transitions into a reasoning-guided phase where inferred goals drive coherent and purposeful interactions. Moreover, it supports multi-stride task generation, enabling the construction of long-horizon workflows across multiple applications. By combining randomness for diversity with goal-aware reasoning for structure, GUI-ReWalk produces data that better reflects the intent-aware, adaptive nature of human-computer interaction. We further train Qwen2.5-VL-7B on the GUI-ReWalk dataset and evaluate it across multiple benchmarks, including Screenspot-Pro, OSWorld-G, UI-Vision, AndroidControl, and GUI-Odyssey. Results demonstrate that GUI-ReWalk enables superior coverage of diverse interaction flows, higher trajectory entropy, and more realistic user intent. These findings establish GUI-ReWalk as a scalable and data-efficient framework for advancing GUI agent research and enabling robust real-world automation.

中文标题/摘要

标题：GUI-ReWalk：通过随机探索和意图驱动推理生成GUI代理的大量数据

图形用户界面（GUI）代理，由大规模语言和视觉语言模型驱动，有望在数字环境中实现端到端的自动化。然而，它们的进步从根本上受到可扩展的高质量轨迹数据稀缺性的限制。现有的数据收集策略要么依赖于昂贵且不一致的手动注释，要么依赖于在多样性和有意义的任务覆盖之间权衡的合成生成方法。为了解决这一差距，我们提出了GUI-ReWalk：一种增强推理的多阶段框架，用于合成现实且多样的GUI轨迹。GUI-ReWalk从一个随机探索阶段开始，模拟人类的试错行为，并逐步过渡到一个由推断出的目标引导的阶段，其中推断出的目标驱动连贯且有目的的交互。此外，它支持多步任务生成，使跨多个应用程序构建长时序工作流成为可能。通过结合随机性以实现多样性，并结合目标驱动的推理以实现结构，GUI-ReWalk生成的数据更好地反映了人类计算机交互的意图驱动和适应性。我们进一步在GUI-ReWalk数据集上训练Qwen2.5-VL-7B，并在Screenspot-Pro、OSWorld-G、UI-Vision、AndroidControl和GUI-Odyssey等多个基准上对其进行评估。结果表明，GUI-ReWalk能够实现更广泛的交互流程覆盖、更高的轨迹熵和更真实的用户意图。这些发现确立了GUI-ReWalk作为可扩展和数据高效框架，用于推进GUI代理研究和实现稳健的现实世界自动化。

Summary / 总结

The research aims to address the scarcity of high-quality trajectory data for GUI agents by proposing GUI-ReWalk, a multi-stage framework that combines stochastic exploration and intent-aware reasoning. The method involves an initial exploration phase followed by a reasoning-guided phase, enabling the generation of diverse and coherent GUI trajectories. Experimental results show that GUI-ReWalk improves coverage of interaction flows, trajectory entropy, and user intent realism, making it a scalable and data-efficient framework for advancing GUI agent research.

研究旨在解决GUI代理缺乏高质量轨迹数据的问题，这些代理由大型语言和视觉语言模型驱动。方法是采用名为GUI-ReWalk的多阶段框架，结合随机探索和意图导向推理来合成多样且逼真的GUI轨迹。关键实验发现表明，GUI-ReWalk在覆盖多样交互流程、增加轨迹熵以及生成更真实的用户意图方面优于现有方法。

Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models

Authors: Dilxat Muhtar, Enzhuo Zhang, Zhenshi Li, Feng Gu, Yanglangxing He, Pengfeng Xiao, Xueliang Zhang

First: 2025-03-02T05:44:56+00:00 · Latest: 2025-09-19T07:38:49+00:00

Comments: 39 pages, 13 figures. Accept for NeruIPS2025

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) have demonstrated great potential in interpreting remote sensing (RS) images through language-guided semantic. However, the effectiveness of these VLMs critically depends on high-quality image-text training data that captures rich semantic relationships between visual content and language descriptions. Unlike natural images, RS lacks large-scale interleaved image-text pairs from web data, making data collection challenging. While current approaches rely primarily on rule-based methods or flagship VLMs for data synthesis, a systematic framework for automated quality assessment of such synthetically generated RS vision-language data is notably absent. To fill this gap, we propose a novel score model trained on large-scale RS vision-language preference data for automated quality assessment. Our empirical results demonstrate that fine-tuning CLIP or advanced VLMs (e.g., Qwen2-VL) with the top 30% of data ranked by our score model achieves superior accuracy compared to both full-data fine-tuning and CLIP-score-based ranking approaches. Furthermore, we demonstrate applications of our scoring model for reinforcement learning (RL) training and best-of-N (BoN) test-time scaling, enabling significant improvements in VLM performance for RS tasks. Our code, model, and dataset are publicly available

中文标题/摘要

标题：基于质量驱动的遥感视觉-语言数据精选通过学习评分模型

视觉-语言模型（VLMs）在通过语言引导的语义解释遥感（RS）图像方面展现了巨大的潜力。然而，这些VLMs的有效性高度依赖于能够捕捉视觉内容和语言描述之间丰富语义关系的高质量图像-文本训练数据。与自然图像不同，RS缺乏大规模的图像-文本配对数据，使得数据收集变得困难。尽管当前的方法主要依赖于基于规则的方法或旗舰VLMs进行数据合成，但系统性的框架来自动评估这种合成的RS视觉-语言数据的质量却明显缺失。为了填补这一空白，我们提出了一种新型评分模型，该模型基于大规模的RS视觉-语言偏好数据进行训练，用于自动质量评估。我们的实验证明，使用我们评分模型排名前30%的数据微调CLIP或先进的VLMs（例如Qwen2-VL）相比全数据微调和基于CLIP评分的排名方法，能够实现更高的准确性。此外，我们展示了我们评分模型在强化学习（RL）训练和最佳N（BoN）测试时缩放中的应用，这显著提高了VLM在RS任务中的性能。我们的代码、模型和数据集已公开

Summary / 总结

This paper addresses the challenge of collecting high-quality image-text pairs for vision-language models (VLMs) in remote sensing (RS) images. It proposes a learned scoring model trained on large-scale RS vision-language preference data to automate the quality assessment of synthetically generated data. The study shows that fine-tuning CLIP or advanced VLMs with the top 30% of data ranked by the score model outperforms full-data fine-tuning and CLIP-score-based ranking approaches. The model also improves VLM performance in RL training and BoN test-time scaling for RS tasks.

本文针对在遥感（RS）图像中收集高质量图像-文本对的挑战，提出了一种新型评分模型，该模型基于大规模RS视觉-语言偏好数据进行自动质量评估。研究表明，使用此评分模型排名的前30%的数据对CLIP或高级VLM进行微调，其性能优于全数据微调和基于CLIP评分的排名方法。此外，该评分模型还应用于强化学习和最佳的N次测试时缩放，提升了VLM在RS任务中的性能。

Training-Free Pyramid Token Pruning for Efficient Large Vision-Language Models via Region, Token, and Instruction-Guided Importance

Authors: Yuxuan Liang, Xu Li, Xiaolei Chen, Yi Zheng, Haotian Chen, Bin Li, Xiangyang Xue

First: 2025-09-19T07:28:17+00:00 · Latest: 2025-09-19T07:28:17+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Vision-Language Models (LVLMs) have significantly advanced multimodal understanding but still struggle with efficiently processing high-resolution images. Recent approaches partition high-resolution images into multiple sub-images, dramatically increasing the number of visual tokens and causing exponential computational overhead during inference. To address these limitations, we propose a training-free token pruning strategy, Pyramid Token Pruning (PTP), that integrates bottom-up visual saliency at both region and token levels with top-down instruction-guided importance. Inspired by human visual attention mechanisms, PTP selectively retains more tokens from visually salient regions and further leverages textual instructions to pinpoint tokens most relevant to specific multimodal tasks. Extensive experiments across 13 diverse benchmarks demonstrate that our method substantially reduces computational overhead and inference latency with minimal performance loss.

中文标题/摘要

标题：无需训练的金字塔令牌剪枝：通过区域、令牌和指令引导的重要性进行高效大型视觉-语言模型剪枝

大型视觉-语言模型（LVLMs）在多模态理解方面取得了显著进展，但仍难以高效处理高分辨率图像。最近的方法将高分辨率图像划分为多个子图像，大幅增加了视觉令牌的数量，并在推理过程中导致了指数级的计算开销。为解决这些限制，我们提出了一种无需训练的令牌剪枝策略——金字塔令牌剪枝（PTP），该策略结合了自下而上的视觉显著性（在区域和令牌级别）和自上而下的指令引导的重要性。受人类视觉注意机制的启发，PTP 选择性地保留来自视觉显著区域的更多令牌，并进一步利用文本指令来确定与特定多模态任务最相关的令牌。在13个不同基准上的广泛实验表明，我们的方法在几乎不损失性能的情况下显著减少了计算开销和推理延迟。

ORIC: Benchmarking Object Recognition in Incongruous Context for Large Vision-Language Models

Authors: Zhaoyang Li, Zhan Ling, Yuchen Zhou, Hao Su

First: 2025-09-19T07:14:29+00:00 · Latest: 2025-09-19T07:14:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Vision-Language Models (LVLMs) have made significant strides in image caption, visual question answering, and robotics by integrating visual and textual information. However, they remain prone to errors in incongruous contexts, where objects appear unexpectedly or are absent when contextually expected. This leads to two key recognition failures: object misidentification and hallucination. To systematically examine this issue, we introduce the Object Recognition in Incongruous Context Benchmark (ORIC), a novel benchmark that evaluates LVLMs in scenarios where object-context relationships deviate from expectations. ORIC employs two key strategies: (1) LLM-guided sampling, which identifies objects that are present but contextually incongruous, and (2) CLIP-guided sampling, which detects plausible yet nonexistent objects that are likely to be hallucinated, thereby creating an incongruous context. Evaluating 18 LVLMs and two open-vocabulary detection models, our results reveal significant recognition gaps, underscoring the challenges posed by contextual incongruity. This work provides critical insights into LVLMs' limitations and encourages further research on context-aware object recognition.

中文标题/摘要

标题：ORIC：大型视觉语言模型在不协调上下文中的物体识别基准测试

大型视觉语言模型（LVLMs）在图像字幕、视觉问答和机器人技术中通过整合视觉和文本信息取得了显著进展。然而，它们在不协调的上下文中仍然容易出错，即当物体在上下文中预期出现时却意外地出现或缺失。这导致了两种关键的识别失败：物体误识别和幻觉。为了系统地研究这一问题，我们引入了不协调上下文中物体识别基准测试（ORIC），这是一个新颖的基准测试，用于评估LVLMs在物体-上下文关系偏离预期的场景中的表现。ORIC采用了两种关键策略：（1）LLM引导采样，识别出虽然存在但上下文不协调的物体；（2）CLIP引导采样，检测出可能存在的但实际上是幻觉的合理物体，从而创建不协调的上下文。评估了18个LVLMs和两个开放式词汇检测模型，我们的结果揭示了显著的识别差距，突显了上下文不协调带来的挑战。这项工作为LVLMs的局限性提供了关键见解，并鼓励进一步研究上下文感知的物体识别。

Summary / 总结

The paper introduces ORIC, a benchmark for evaluating large vision-language models (LVLMs) in incongruous contexts, where objects appear unexpectedly or are absent. It uses LLM-guided and CLIP-guided sampling to identify incongruous and hallucinated objects. Evaluations of 18 LVLMs and two open-vocabulary detection models highlight significant recognition gaps, indicating the need for better context-aware object recognition.

研究引入了ORIC基准，用于评估大型视觉-语言模型（LVLMs）在不一致的上下文中对物体的识别能力，即物体与预期的上下文不匹配的情况。该基准使用LLM引导和CLIP引导的采样方法来识别不一致的和虚假的物体。评估20个模型后，研究揭示了显著的识别差距，表明需要更好地实现上下文感知的物体识别。

SAR-TEXT: A Large-Scale SAR Image-Text Dataset Built with SAR-Narrator and Progressive Transfer Learning

Authors: Yiguo He, Xinjun Cheng, Junjie Zhu, Chunping Qiu, Jun Wang, Xichuan Zhang, Qiangjuan Huang, Ke Yang

First: 2025-07-24T18:45:30+00:00 · Latest: 2025-09-19T07:05:22+00:00

Comments: IEEE Submission

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision Language Models (VLMs) have achieved remarkable breakthroughs in the field of remote sensing in recent years. Synthetic Aperture Radar (SAR) imagery, with its all-weather capability, is essential in remote sensing, yet the lack of large-scale, high-quality SAR image-text datasets hinders its semantic understanding. In this paper, we construct SAR-TEXT, a large-scale and high-quality dataset consisting of over 130,000 SAR image-text pairs. To construct the SAR-TEXT dataset, we design the SAR-Narrator framework, which generates textual descriptions for SAR images through a multi-stage strategy. To verify the effectiveness of the SAR-TEXT dataset, we conduct experiments on three typical vision-language tasks: image-text retrieval, image captioning, and visual question answering (VQA). Specifically, we construct three representative models on SAR-TEXT: SAR-RS-CLIP, SAR-RS-CoCa, and SAR-GPT. SAR-RS-CLIP achieves notable improvements in retrieval performance, boosting average recall by 12.97% and 10.0% on the OSdataset_512 and HRSID test sets, respectively. In the captioning task, SAR-RS-CoCa achieves significant improvements over the original CoCa models in terms of BLEU-4, SPICE, and CIDEr scores. In the VQA task, SAR-GPT outperforms baseline and single-stage models on multiple SAR-VQA datasets, demonstrating stronger semantic understanding and reasoning ability, as further confirmed by qualitative results. It is worth noting that, as a flexible captioning tool, SAR-Narrator can be readily adopted by the community to construct larger-scale SAR image-text datasets. All code, pretrained models, and the SAR-Text dataset are publicly available at: https://github.com/YiguoHe/SAR-TEXT.

中文标题/摘要

标题：SAR-TEXT：使用SAR-Narrator和渐进式迁移学习构建的大规模SAR图像-文本数据集

视觉语言模型（VLMs）近年来在遥感领域取得了显著突破。合成孔径雷达（SAR）图像因其全天候能力在遥感中至关重要，但由于缺乏大规模、高质量的SAR图像-文本数据集，其语义理解受到了限制。本文构建了SAR-TEXT数据集，包含超过130,000个SAR图像-文本对。为了构建SAR-TEXT数据集，我们设计了SAR-Narrator框架，通过多阶段策略生成SAR图像的文本描述。为了验证SAR-TEXT数据集的有效性，我们在三个典型的视觉-语言任务上进行了实验：图像-文本检索、图像字幕生成和视觉问答（VQA）。具体来说，我们在SAR-TEXT上构建了三个代表性模型：SAR-RS-CLIP、SAR-RS-CoCa和SAR-GPT。SAR-RS-CLIP在检索性能上取得了显著提升，在OSdataset_512和HRSID测试集上分别将平均召回率提高了12.97%和10.0%。在字幕生成任务中，SAR-RS-CoCa在BLEU-4、SPICE和CIDEr分数上显著优于原始CoCa模型。在VQA任务中，SAR-GPT在多个SAR-VQA数据集上优于基线和单阶段模型，展示了更强的语义理解和推理能力，进一步的定性结果也证实了这一点。值得注意的是，作为灵活的字幕工具，SAR-Narrator可以方便地被社区采用以构建更大规模的SAR图像-文本数据集。所有代码、预训练模型和SAR-Text数据集均可在以下链接获取：https://github.com/YiguoHe/SAR-TEXT。

Summary / 总结

The research aims to enhance the semantic understanding of Synthetic Aperture Radar (SAR) imagery through the construction of a large-scale SAR image-text dataset, SAR-TEXT. The dataset is created using the SAR-Narrator framework, which generates textual descriptions for SAR images. Experiments on image-text retrieval, captioning, and VQA tasks show that models trained on SAR-TEXT, such as SAR-RS-CLIP, SAR-RS-CoCa, and SAR-GPT, achieve significant improvements in performance metrics like recall, BLEU-4, SPICE, CIDEr, and VQA accuracy. The SAR-Narrator framework also enables the creation of larger-scale SAR image-text datasets. All resources are publicly available.

研究旨在通过使用SAR-Narrator框架构建包含超过130,000个SAR图像-文本对的大规模SAR图像-文本数据集SAR-TEXT，以增强SAR图像的语义理解。该数据集在图像-文本检索、图像描述和视觉问答三个视觉-语言任务上进行了评估。SAR-TEXT数据集在OSdataset_512和HRSID测试集上的检索性能分别提高了12.97%和10.0%。SAR-RS-CoCa在描述任务中获得了更好的BLEU-4、SPICE和CIDEr分数，而SAR-GPT在问答任务中优于基线模型，展示了更强的语义理解和推理能力。

VLA-Mark: A cross modal watermark for large vision-language alignment model

Authors: Shuliang Liu, Qi Zheng, Jesse Jiaxi Xu, Yibo Yan, Junyan Zhang, He Geng, Aiwei Liu, Peijie Jiang, Jia Liu, Yik-Cheung Tam, Xuming Hu

Venue: EMNLP 2025

First: 2025-07-18T16:44:41+00:00 · Latest: 2025-09-19T06:54:08+00:00

Comments: Accepted by the main conference, EMNLP 2025

Abs · PDF · Code1 · Code2

Abstract

Vision-language models demand watermarking solutions that protect intellectual property without compromising multimodal coherence. Existing text watermarking methods disrupt visual-textual alignment through biased token selection and static strategies, leaving semantic-critical concepts vulnerable. We propose VLA-Mark, a vision-aligned framework that embeds detectable watermarks while preserving semantic fidelity through cross-modal coordination. Our approach integrates multiscale visual-textual alignment metrics, combining localized patch affinity, global semantic coherence, and contextual attention patterns, to guide watermark injection without model retraining. An entropy-sensitive mechanism dynamically balances watermark strength and semantic preservation, prioritizing visual grounding during low-uncertainty generation phases. Experiments show 7.4% lower PPL and 26.6% higher BLEU than conventional methods, with near-perfect detection (98.8% AUC). The framework demonstrates 96.1\% attack resilience against attacks such as paraphrasing and synonym substitution, while maintaining text-visual consistency, establishing new standards for quality-preserving multimodal watermarking

中文标题/摘要

标题：VLA-Mark：一种用于大型视觉语言对齐模型的跨模态水印

视觉语言模型需要能够在保护知识产权的同时不损害多模态一致性性的水印解决方案。现有的文本水印方法通过有偏的词元选择和静态策略破坏了视觉-文本对齐，使语义关键概念处于风险之中。我们提出了VLA-Mark，这是一种视觉对齐框架，能够在保持语义保真度的同时通过跨模态协调嵌入可检测的水印。我们的方法结合了多尺度视觉-文本对齐度量，包括局部补丁亲和性、全局语义一致性以及上下文注意力模式，以指导水印注入而无需重新训练模型。一种敏感于熵的机制动态平衡水印强度和语义保真度，在低不确定性生成阶段优先考虑视觉接地。实验结果显示，与传统方法相比，PPL低7.4%，BLEU高26.6%，检测接近完美（AUC 98.8%）。该框架对诸如改写和同义词替换等攻击具有96.1%的攻击抵抗力，同时保持了文本-视觉一致性，建立了高质量保留的多模态水印的新标准

Summary / 总结

VLA-Mark is a vision-aligned watermarking framework designed to protect intellectual property in vision-language models without disrupting multimodal coherence. It uses cross-modal coordination to embed detectable watermarks while preserving semantic fidelity. The method integrates multiscale alignment metrics and an entropy-sensitive mechanism to balance watermark strength and semantic preservation. Experiments show that VLA-Mark achieves 7.4% lower perplexity and 26.6% higher BLEU scores compared to conventional methods, with near-perfect detection accuracy and high attack resilience against paraphrasing and synonym substitution attacks.

VLA-Mark 是一种针对视觉-语言模型的对齐水印框架，旨在保护知识产权同时不破坏多模态一致性。该方法通过跨模态协调嵌入可检测的水印，并保持语义保真度。该方法结合了多尺度对齐指标和熵敏感机制来引导水印注入。实验结果显示，VLA-Mark 较传统方法实现了 7.4% 的更低困惑度和 26.6% 的更高 BLEU，具有近乎完美的检测准确性和对改写和同义替换等攻击的强大抵抗力。

MMAPG: A Training-Free Framework for Multimodal Multi-hop Question Answering via Adaptive Planning Graphs

Authors: Yiheng Hu, Xiaoyang Wang, Qing Liu, Xiwei Xu, Qian Fu, Wenjie Zhang, Liming Zhu

First: 2025-08-22T02:57:52+00:00 · Latest: 2025-09-19T06:41:50+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal Multi-hop question answering requires integrating information from diverse sources, such as images and texts, to derive answers. Existing methods typically rely on sequential retrieval and reasoning, where each step builds on the previous output. However, this single-path paradigm makes them vulnerable to errors due to misleading intermediate steps. Moreover, developing multimodal models can be computationally expensive, often requiring extensive training. To address these limitations, we propose a training-free framework guided by an Adaptive Planning Graph, which consists of planning, retrieval and reasoning modules. The planning module analyzes the current state of the Adaptive Planning Graph, determines the next action and where to expand the graph, which enables dynamic and flexible exploration of reasoning paths. To handle retrieval of text to unspecified target modalities, we devise modality-specific strategies that dynamically adapt to distinct data types. Our approach preserves the characteristics of multimodal information without costly task-specific training, enabling seamless integration with up-to-date models. Finally, the experiments on MultimodalQA and WebQA show that our approach matches or outperforms existing models that rely on training.

中文标题/摘要

标题：MMAPG：基于自适应规划图的无需训练框架用于多模态多跳问答

多模态多跳问答需要从图像和文本等多种来源整合信息以得出答案。现有方法通常依赖于顺序检索和推理，每一步都基于上一步的输出。然而，这种单路径范式使它们容易受到误导性中间步骤的错误影响。此外，开发多模态模型可能计算成本高昂，通常需要大量训练。为了解决这些限制，我们提出了一种由自适应规划图引导的无需训练框架，该框架包括规划、检索和推理模块。规划模块分析自适应规划图的当前状态，确定下一步行动和扩展图的位置，从而实现动态和灵活的推理路径探索。为了处理文本到未指定目标模态的检索，我们设计了特定模态策略，能够动态适应不同数据类型。我们的方法保留了多模态信息的特性，无需昂贵的任务特定训练，能够无缝集成到最新的模型中。最后，对MultimodalQA和WebQA的实验表明，我们的方法与依赖训练的现有模型相当或更优。

Summary / 总结

The paper proposes MMAPG, a training-free framework for multimodal multi-hop question answering using Adaptive Planning Graphs. It addresses the limitations of single-path retrieval and reasoning by enabling dynamic exploration of reasoning paths. The framework includes modality-specific strategies for handling text retrieval to unspecified target modalities. Experiments on MultimodalQA and WebQA demonstrate that MMAPG matches or outperforms existing training-dependent models.

论文针对现有顺序检索和推理方法在多模态多跳问答中的局限性，如易出错和需要大量训练。提出了一种无需训练的框架MMAPG，通过适应性规划图指导，包含规划、检索和推理模块。该框架能够动态探索推理路径并适应不同数据类型，在MultimodalQA和WebQA基准测试中表现出色。

SightSound-R1: Cross-Modal Reasoning Distillation from Vision to Audio Language Models

Authors: Qiaolin Wang, Xilin Jiang, Linyang He, Junkai Wu, Nima Mesgarani

First: 2025-09-19T06:39:39+00:00 · Latest: 2025-09-19T06:39:39+00:00

Abs · PDF · Code1 · Code2

Abstract

While large audio-language models (LALMs) have demonstrated state-of-the-art audio understanding, their reasoning capability in complex soundscapes still falls behind large vision-language models (LVLMs). Compared to the visual domain, one bottleneck is the lack of large-scale chain-of-thought audio data to teach LALM stepwise reasoning. To circumvent this data and modality gap, we present SightSound-R1, a cross-modal distillation framework that transfers advanced reasoning from a stronger LVLM teacher to a weaker LALM student on the same audio-visual question answering (AVQA) dataset. SightSound-R1 consists of three core steps: (i) test-time scaling to generate audio-focused chains of thought (CoT) from an LVLM teacher, (ii) audio-grounded validation to filter hallucinations, and (iii) a distillation pipeline with supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO) for the LALM student. Results show that SightSound-R1 improves LALM reasoning performance both in the in-domain AVQA test set as well as in unseen auditory scenes and questions, outperforming both pretrained and label-only distilled baselines. Thus, we conclude that vision reasoning can be effectively transferred to audio models and scaled with abundant audio-visual data.

中文标题/摘要

标题：SightSound-R1: 视觉到音频语言模型的跨模态推理精炼

虽然大型音频-语言模型（LALMs）在音频理解方面表现出最先进的水平，但在复杂声景中的推理能力仍落后于大型视觉-语言模型（LVLMs）。与视觉领域相比，一个瓶颈是缺乏大规模的链式思考音频数据来教LALM逐步推理。为克服这一数据和模态差距，我们提出了SightSound-R1，这是一种跨模态精炼框架，该框架将更强的LVLM教师的高级推理转移到同一音频-视觉问答（AVQA）数据集上的较弱LALM学生上。SightSound-R1 包含三个核心步骤：(i) 测试时缩放以从LVLM教师生成音频聚焦的链式思考（CoT），(ii) 音频定位验证以过滤幻觉，(iii) 一个精炼管道，包括监督微调（SFT）后跟随组相对策略优化（GRPO）以对LALM学生进行优化。结果显示，SightSound-R1 在领域内和未见过的听觉场景及问题上均提高了LALM的推理性能，优于预训练和仅标签精炼的基线。因此，我们得出结论，视觉推理可以有效地转移到音频模型，并通过丰富的音频-视觉数据进行扩展。

Summary / 总结

SightSound-R1 is a cross-modal distillation framework that transfers reasoning from a strong vision-language model (LVLM) to a weaker audio-language model (LALM) for audio-visual question answering. It involves test-time scaling to generate audio-focused chains of thought, audio-grounded validation to filter hallucinations, and a distillation pipeline with supervised fine-tuning and Group Relative Policy Optimization. The framework improves LALM reasoning performance in both in-domain and unseen auditory scenarios, outperforming pretrained and label-only distilled baselines.

研究旨在通过利用视觉语言模型（LVLM）的高级推理能力来增强大型音频语言模型（LALM）的推理能力。方法包括一个跨模态蒸馏框架SightSound-R1，该框架包含测试时缩放以生成音频焦点的推理链、基于音频的验证以过滤幻觉，以及一个包含监督微调和组相对策略优化的蒸馏管道。关键发现表明，SightSound-R1在领域内AVQA测试集以及未见过的听觉场景和问题中均提高了LALM的推理性能，优于预训练和仅标签蒸馏的基线模型。

GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains

Authors: Chun Wang, Xiaoran Pan, Zihao Pan, Haofan Wang, Yiren Song

First: 2025-05-24T13:48:57+00:00 · Latest: 2025-09-19T06:24:54+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances in Visual Language Models (VLMs) have demonstrated exceptional performance in visual reasoning tasks. However, geo-localization presents unique challenges, requiring the extraction of multigranular visual cues from images and their integration with external world knowledge for systematic reasoning. Current approaches to geo-localization tasks often lack robust reasoning mechanisms and explainability, limiting their effectiveness. To address these limitations, we propose the Geo Reason Enhancement (GRE) Suite, a novel framework that augments VLMs with structured reasoning chains for accurate and interpretable location inference. The GRE Suite is systematically developed across three key dimensions: dataset, model, and benchmark. First, we introduce GRE30K, a high-quality geo-localization reasoning dataset designed to facilitate fine-grained visual and contextual analysis. Next, we present the GRE model, which employs a multi-stage reasoning strategy to progressively infer scene attributes, local details, and semantic features, thereby narrowing down potential geographic regions with enhanced precision. Finally, we construct the Geo Reason Evaluation Benchmark (GREval-Bench), a comprehensive evaluation framework that assesses VLMs across diverse urban, natural, and landmark scenes to measure both coarse-grained (e.g., country, continent) and fine-grained (e.g., city, street) localization performance. Experimental results demonstrate that GRE significantly outperforms existing methods across all granularities of geo-localization tasks, underscoring the efficacy of reasoning-augmented VLMs in complex geographic inference. Code and data will be released at https://github.com/Thorin215/GRE.

Summary / 总结

The GRE Suite is a novel framework that enhances Visual Language Models (VLMs) with structured reasoning chains for accurate geo-localization. It includes GRE30K, a high-quality dataset for fine-grained visual and contextual analysis, the GRE model, which uses a multi-stage reasoning strategy, and the GREval-Bench, a comprehensive evaluation framework. Experiments show that GRE outperforms existing methods in all granularities of geo-localization tasks, highlighting the effectiveness of reasoning-augmented VLMs.

GRE Suite 是一种增强视觉语言模型 (VLM) 的新型框架，通过结构化的推理链实现精确的地理定位。它包括 GRE30K，一个高质量的数据集，用于精细的视觉和上下文分析，GRE 模型，采用多阶段推理策略，以及 GREval-Bench，一个全面的评估框架。实验结果表明，GRE 在所有地理定位任务的粒度上都优于现有方法。

Spatial Understanding from Videos: Structured Prompts Meet Simulation Data

Authors: Haoyu Zhang, Meng Liu, Zaijing Li, Haokun Wen, Weili Guan, Yaowei Wang, Liqiang Nie

Venue: NeurIPS 2025 Spotlight

First: 2025-06-04T07:36:33+00:00 · Latest: 2025-09-19T05:48:14+00:00

Comments: Accepted by NeurIPS 2025 as a Spotlight

Abs · PDF · Code1 · Code2

Abstract

Visual-spatial understanding, the ability to infer object relationships and layouts from visual input, is fundamental to downstream tasks such as robotic navigation and embodied interaction. However, existing methods face spatial uncertainty and data scarcity, limiting the 3D spatial reasoning capability of pre-trained vision-language models (VLMs). To address these challenges, we present a unified framework for enhancing 3D spatial reasoning in pre-trained VLMs without modifying their architecture. This framework combines SpatialMind, a structured prompting strategy that decomposes complex scenes and questions into interpretable reasoning steps, with ScanForgeQA, a scalable question-answering dataset built from diverse 3D simulation scenes through an automated construction process designed for fine-tuning. Extensive experiments across multiple benchmarks demonstrate the individual and combined effectiveness of our prompting and fine-tuning strategies, and yield insights that may inspire future research on visual-spatial understanding.

中文标题/摘要

标题：从视频中理解空间：结构化提示与模拟数据

视觉-空间理解，即从视觉输入中推断物体关系和布局的能力，是下游任务如机器人导航和具身交互的基础。然而，现有方法面临空间不确定性与数据稀缺性，限制了预训练视觉-语言模型（VLM）的三维空间推理能力。为应对这些挑战，我们提出了一种无需修改架构即可增强预训练VLM三维空间推理能力的统一框架。该框架结合了SpatialMind，一种结构化提示策略，将复杂场景和问题分解为可解释的推理步骤，以及ScanForgeQA，一种通过自动化构建过程从多样化的3D模拟场景中构建的可扩展问答数据集，旨在用于微调。在多个基准测试中的广泛实验表明，我们的提示和微调策略的单独和联合有效性，并提供了可能启发未来视觉-空间理解研究的见解。

Summary / 总结

The research aims to enhance 3D spatial reasoning in pre-trained vision-language models by addressing spatial uncertainty and data scarcity. It introduces a unified framework combining SpatialMind, a structured prompting strategy that breaks down complex scenes into interpretable reasoning steps, and ScanForgeQA, a scalable dataset built from diverse 3D simulation scenes. Experiments across multiple benchmarks show the effectiveness of these strategies in improving visual-spatial understanding.

研究旨在通过解决空间不确定性与数据稀缺性问题，增强预训练视觉-语言模型的3D空间推理能力。该研究提出了一种统一框架，结合了SpatialMind，这是一种将复杂场景分解为可解释推理步骤的结构化提示策略，以及ScanForgeQA，这是一个从多样化的3D模拟场景中构建的可扩展问答数据集。跨多个基准的实验表明，这些策略在提高视觉-空间理解方面是有效的。

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

Authors: Zhaoyang Liu, Jingjing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, Shenglong Ye, Qingyun Li, Xuan Dong, Yue Yu, Chenyu Lu, YunXiang Mo, Yao Yan, Zeyue Tian, Xiao Zhang, Yuan Huang, Yiqian Liu, Weijie Su, Gen Luo, Xiangyu Yue, Biqing Qi, Kai Chen, Bowen Zhou, Yu Qiao, Qifeng Chen, Wenhai Wang

First: 2025-09-18T17:59:22+00:00 · Latest: 2025-09-19T05:29:03+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research: https://github.com/OpenGVLab/ScaleCUA.

中文标题/摘要

标题：ScaleCUA：跨平台数据扩展开源计算机使用代理

视觉-语言模型（VLMs）使计算机使用代理（CUAs）能够自主操作GUI，展现出巨大的潜力，但进展受限于缺乏大规模、开源的计算机使用数据和基础模型。在本项工作中，我们介绍了ScaleCUA，这是迈向扩展开源CUA的一个步骤。它提供了一个跨越6个操作系统和3个任务领域的大型数据集，通过将自动化代理与人类专家结合的闭环管道构建而成。在这些扩展的数据上训练后，ScaleCUA可以在不同平台之间无缝操作。具体而言，它在WebArena-Lite-v2上比基线提高了26.6%，在ScreenSpot-Pro上提高了10.7%，并在MMBench-GUI L1-Hard、OSWorld-G和WebArena-Lite-v2上分别达到了94.4%、60.6%和47.4%的新最佳结果。这些发现强调了数据驱动扩展对通用计算机使用代理的强大作用。我们将发布数据、模型和代码以促进未来研究：https://github.com/OpenGVLab/ScaleCUA。

Summary / 总结

ScaleCUA aims to scale open-source computer use agents by providing a large-scale dataset across multiple operating systems and task domains. It uses a closed-loop pipeline combining automated agents and human experts. ScaleCUA outperforms baselines and sets new state-of-the-art results on MMBench-GUI L1-Hard, OSWorld-G, and WebArena-Lite-v2, demonstrating the effectiveness of data-driven scaling for general-purpose computer use agents. The dataset, models, and code are available for future research.

ScaleCUA旨在通过提供跨多个操作系统和任务领域的大型数据集来扩展开源计算机使用代理。它使用结合了自动化代理和人类专家的闭环管道。ScaleCUA在MM Bench-GUI L1-Hard、OSWorld-G和WebArena-Lite-v2上超越了基线，并设定了新的最先进结果，证明了数据驱动扩展对通用计算机使用代理的有效性。数据、模型和代码已发布以促进未来研究。

Enhancing Sa2VA for Referent Video Object Segmentation: 2nd Solution for 7th LSVOS RVOS Track

Authors: Ran Hong, Feng Lu, Leilei Cao, An Yan, Youhai Jiang, Fengjie Zhu

First: 2025-09-19T03:01:27+00:00 · Latest: 2025-09-19T03:01:27+00:00

Comments: 6 pages, 2 figures

Abs · PDF · Code1 · Code2

Abstract

Referential Video Object Segmentation (RVOS) aims to segment all objects in a video that match a given natural language description, bridging the gap between vision and language understanding. Recent work, such as Sa2VA, combines Large Language Models (LLMs) with SAM~2, leveraging the strong video reasoning capability of LLMs to guide video segmentation. In this work, we present a training-free framework that substantially improves Sa2VA's performance on the RVOS task. Our method introduces two key components: (1) a Video-Language Checker that explicitly verifies whether the subject and action described in the query actually appear in the video, thereby reducing false positives; and (2) a Key-Frame Sampler that adaptively selects informative frames to better capture both early object appearances and long-range temporal context. Without any additional training, our approach achieves a J&F score of 64.14% on the MeViS test set, ranking 2nd place in the RVOS track of the 7th LSVOS Challenge at ICCV 2025.

中文标题/摘要

标题：增强Sa2VA以提高引用视频对象分割性能：7th LSVOS RVOS赛道的第2个解决方案

引用视频对象分割（RVOS）旨在将视频中的所有与给定自然语言描述匹配的对象进行分割，从而弥合视觉理解和语言理解之间的差距。最近的工作，如Sa2VA，结合了大型语言模型（LLMs）与SAM~2，利用LLMs强大的视频推理能力来指导视频分割。在本文中，我们提出了一种无需训练的框架，显著提高了Sa2VA在RVOS任务上的性能。我们的方法引入了两个关键组件：（1）视频-语言检查器，明确验证查询中描述的主题和动作是否出现在视频中，从而减少误报；（2）关键帧采样器，自适应地选择信息性更强的帧，以更好地捕捉早期对象出现和长时间的时空上下文。在没有任何额外训练的情况下，我们的方法在MeViS测试集上达到了64.14%的J&F分数，在2025年ICCV第7届LSVOS挑战赛的RVOS赛道中排名第二。

Summary / 总结

This work aims to enhance Sa2VA for RVOS by introducing a training-free framework that includes a Video-Language Checker and a Key-Frame Sampler. The Video-Language Checker verifies the query's subject and action in the video to reduce false positives, while the Key-Frame Sampler selects informative frames for better temporal context. Without additional training, the approach achieves a J&F score of 64.14% on the MeViS test set, ranking 2nd place in the RVOS track of the 7th LSVOS Challenge at ICCV 2025.

该研究旨在通过引入训练-free框架，结合Video-Language Checker和Key-Frame Sampler来增强Sa2VA的RVOS性能。Video-Language Checker验证查询中的主体和动作，减少误报，而Key-Frame Sampler选择关键帧以捕捉早期对象出现和长时间的时空上下文。该方法在MeViS测试集上达到64.14%的J&F分数，在ICCV 2025第7届LSVOS挑战赛的RVOS赛道上排名第二。

TGPO: Tree-Guided Preference Optimization for Robust Web Agent Reinforcement Learning

Authors: Ziyuan Chen, Zhenghui Zhao, Zhangye Han, Miancan Liu, Xianhang Ye, Yiqing Li, Hongbo Min, Jinkui Ren, Xiantao Zhang, Guitao Cao

First: 2025-09-17T16:58:44+00:00 · Latest: 2025-09-19T02:13:09+00:00

Abs · PDF · Code1 · Code2

Abstract

With the rapid advancement of large language models and vision-language models, employing large models as Web Agents has become essential for automated web interaction. However, training Web Agents with reinforcement learning faces critical challenges including credit assignment misallocation, prohibitively high annotation costs, and reward sparsity. To address these issues, we propose Tree-Guided Preference Optimization (TGPO), an offline reinforcement learning framework that proposes a tree-structured trajectory representation merging semantically identical states across trajectories to eliminate label conflicts. Our framework incorporates a Process Reward Model that automatically generates fine-grained rewards through subgoal progress, redundancy detection, and action verification. Additionally, a dynamic weighting mechanism prioritizes high-impact decision points during training. Experiments on Online-Mind2Web and our self-constructed C-WebShop datasets demonstrate that TGPO significantly outperforms existing methods, achieving higher success rates with fewer redundant steps.

中文标题/摘要

标题：TGPO：基于树引导的偏好优化以实现鲁棒的网络代理强化学习

随着大型语言模型和视觉-语言模型的迅速发展，使用大型模型作为网络代理已成为自动网络交互的必要手段。然而，使用强化学习训练网络代理面临着关键挑战，包括奖励分配不当、标注成本高昂以及奖励稀疏性。为了解决这些问题，我们提出了基于树引导的偏好优化（TGPO），这是一种离线强化学习框架，通过树结构轨迹表示将轨迹中语义相同的状态合并，消除标签冲突。该框架结合了过程奖励模型，该模型能够通过子目标进展、冗余检测和动作验证自动生成细粒度奖励。此外，动态加权机制在训练过程中优先处理高影响决策点。在Online-Mind2Web和我们自构建的C-WebShop数据集上的实验表明，TGPO显著优于现有方法，能够在较少冗余步骤的情况下实现更高的成功率。

Summary / 总结

The paper proposes Tree-Guided Preference Optimization (TGPO), an offline reinforcement learning framework designed to address challenges in training Web Agents, such as credit assignment misallocation and reward sparsity. TGPO uses a tree-structured trajectory representation to merge semantically identical states and a Process Reward Model that generates rewards based on subgoal progress, redundancy detection, and action verification. The framework also includes a dynamic weighting mechanism that prioritizes high-impact decision points. Experiments show that TGPO outperforms existing methods by achieving higher success rates with fewer redundant steps on Online-Mind2Web and C-WebShop datasets.

论文针对强化学习训练Web代理时面临的信用分配错误、高标注成本和稀疏奖励等问题，提出了Tree-Guided Preference Optimization (TGPO)框架。该框架采用树结构轨迹表示法，合并语义相同的轨迹状态以消除标签冲突。TGPO还包括一个过程奖励模型，该模型基于子目标进度、冗余检测和动作验证自动生成细粒度奖励，并采用动态加权机制优先处理高影响决策点。实验结果表明，TGPO在Online-Mind2Web和C-WebShop数据集上优于现有方法，实现了更高的成功率和更少的冗余步骤。

SmolRGPT: Efficient Spatial Reasoning for Warehouse Environments with 600M Parameters

Authors: Abdarahmane Traore, Éric Hervet, Andy Couturier

First: 2025-09-18T23:55:51+00:00 · Latest: 2025-09-18T23:55:51+00:00

Comments: 9 pages, 3 figures, IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances in vision-language models (VLMs) have enabled powerful multimodal reasoning, but state-of-the-art approaches typically rely on extremely large models with prohibitive computational and memory requirements. This makes their deployment challenging in resource-constrained environments such as warehouses, robotics, and industrial applications, where both efficiency and robust spatial understanding are critical. In this work, we present SmolRGPT, a compact vision-language architecture that explicitly incorporates region-level spatial reasoning by integrating both RGB and depth cues. SmolRGPT employs a three-stage curriculum that progressively align visual and language features, enables spatial relationship understanding, and adapts to task-specific datasets. We demonstrate that with only 600M parameters, SmolRGPT achieves competitive results on challenging warehouse spatial reasoning benchmarks, matching or exceeding the performance of much larger alternatives. These findings highlight the potential for efficient, deployable multimodal intelligence in real-world settings without sacrificing core spatial reasoning capabilities. The code of the experimentation will be available at: https://github.com/abtraore/SmolRGPT

Summary / 总结

SmolRGPT is a compact vision-language model designed for efficient spatial reasoning in resource-constrained environments like warehouses. It uses a three-stage curriculum to integrate RGB and depth cues, enabling robust spatial understanding. With only 600M parameters, SmolRGPT matches or exceeds the performance of larger models on warehouse spatial reasoning benchmarks, demonstrating the potential for efficient multimodal intelligence without sacrificing spatial reasoning capabilities.

SmolRGPT旨在解决大型视觉-语言模型在资源受限环境如仓库中的计算和内存挑战。它采用三阶段课程来整合RGB和深度线索进行高效的空间推理，仅需600M参数。SmolRGPT在仓库空间推理基准测试中表现出色，与更大模型相比，保持了效率和鲁棒性。