arXiv 论文速递

GC-VLN: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation

Authors: Hang Yin, Haoyu Wei, Xiuwei Xu, Wenxuan Guo, Jie Zhou, Jiwen Lu

Venue: CoRL 2025

First: 2025-09-12T17:59:58+00:00 · Latest: 2025-09-12T17:59:58+00:00

Comments: Accepted to CoRL 2025. Project page: [this https URL](https://bagh2178.github.io/GC-VLN/)

Abstract

In this paper, we propose a training-free framework for vision-and-language navigation (VLN). Existing zero-shot VLN methods are mainly designed for discrete environments or involve unsupervised training in continuous simulator environments, which makes it challenging to generalize and deploy them in real-world scenarios. To achieve a training-free framework in continuous environments, our framework formulates navigation guidance as graph constraint optimization by decomposing instructions into explicit spatial constraints. The constraint-driven paradigm decodes spatial semantics through constraint solving, enabling zero-shot adaptation to unseen environments. Specifically, we construct a spatial constraint library covering all types of spatial relationship mentioned in VLN instructions. The human instruction is decomposed into a directed acyclic graph, with waypoint nodes, object nodes and edges, which are used as queries to retrieve the library to build the graph constraints. The graph constraint optimization is solved by the constraint solver to determine the positions of waypoints, obtaining the robot's navigation path and final goal. To handle cases of no solution or multiple solutions, we construct a navigation tree and the backtracking mechanism. Extensive experiments on standard benchmarks demonstrate significant improvements in success rate and navigation efficiency compared to state-of-the-art zero-shot VLN methods. We further conduct real-world experiments to show that our framework can effectively generalize to new environments and instruction sets, paving the way for a more robust and autonomous navigation framework.

中文标题/摘要

标题：GC-VLN：基于图约束的无训练视觉-语言导航

在本文中，我们提出了一种无训练框架，用于视觉-语言导航（VLN）。现有的零样本VLN方法主要针对离散环境设计，或者在连续模拟环境中涉及无监督训练，这使得它们在现实世界场景中的泛化和部署变得具有挑战性。为了在连续环境中实现无训练框架，我们的框架通过将指令分解为显式的空间约束，将导航指导形式化为图约束优化。约束驱动的范式通过约束求解来解码空间语义，从而实现对未见过环境的零样本适应。具体而言，我们构建了一个涵盖VLN指令中提到的所有类型空间关系的空间约束库。人类指令被分解为有向无环图，包含航点节点、对象节点和边，这些节点和边用作查询以检索库来构建图约束。图约束优化通过约束求解器求解，以确定航点的位置，从而获得机器人的导航路径和最终目标。为了处理无解或多个解的情况，我们构建了一个导航树和回溯机制。在标准基准上的广泛实验表明，与最先进的零样本VLN方法相比，我们的方法在成功率和导航效率方面取得了显著提高。我们进一步进行了现实世界实验，展示了我们的框架可以有效泛化到新环境和指令集，为更稳健和自主的导航框架铺平了道路。

Summary / 总结

The paper proposes a training-free framework for vision-and-language navigation (VLN) that formulates navigation guidance as graph constraint optimization. By decomposing instructions into spatial constraints and using a spatial constraint library, the framework enables zero-shot adaptation to unseen environments. Experiments show significant improvements in success rate and navigation efficiency compared to existing zero-shot VLN methods, and real-world experiments demonstrate its effectiveness in new environments.

论文提出了一种无需训练的视觉-语言导航（VLN）框架，将导航指导形式化为图约束优化。它将指令分解为空间约束，并使用空间约束库构建图约束，然后通过约束求解器确定机器人的导航路径。实验结果显示，与现有的零样本VLN方法相比，该框架在成功率和导航效率方面有显著提高，并且在现实世界实验中能够有效泛化到新环境和指令集。

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Authors: Muyao Li, Zihao Wang, Kaichen He, Xiaojian Ma, Yitao Liang

Venue: ACL 2025

First: 2025-03-20T17:21:58+00:00 · Latest: 2025-09-12T17:56:19+00:00

Comments: Accepted by ACL 2025

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recently, action-based decision-making in open-world environments has gained significant attention. Visual Language Action (VLA) models, pretrained on large-scale web datasets, have shown promise in decision-making tasks. However, previous work has primarily focused on action post-training, often neglecting enhancements to the foundational model itself. In response, we introduce a novel approach, Act from Visual Language Post-Training, which refines Visual Language Models (VLMs) through visual and linguistic guidance in a self-supervised manner. This enhancement improves the models' capabilities in world knowledge, visual recognition, and spatial grounding in open-world environments. Following the above post-training paradigms, we obtain the first VLA models in Minecraft that can follow human instructions on over 1k different atomic tasks, including crafting, smelting, cooking, mining, and killing. Our experiments demonstrate that post-training on non-trajectory tasks leads to a significant 40% improvement over the best agent baseline on a diverse set of atomic tasks. Furthermore, we demonstrate that our approach surpasses traditional imitation learning-based policies in Minecraft, achieving state-of-the-art performance. We have open-sourced the code, models, and datasets to foster further research. The project page can be found in https://craftjarvis.github.io/JarvisVLA.

中文标题/摘要

标题：JARVIS-VLA：训练后大规模视觉语言模型通过键盘和鼠标玩视觉游戏

最近，开放世界环境中的基于动作的决策制定引起了广泛关注。视觉语言行动（VLA）模型，预训练于大规模网络数据集上，在决策任务中显示出潜力。然而，先前的工作主要集中在动作的训练后阶段，往往忽视了对基础模型本身的改进。为此，我们提出了一种新颖的方法，即视觉语言训练后行动，通过视觉和语言指导以自监督方式改进视觉语言模型（VLMs）。这种方法提高了模型在开放世界环境中的世界知识、视觉识别和空间定位能力。遵循上述训练后范式，我们在Minecraft中获得了第一个可以遵循人类指令完成超过1000个不同原子任务的VLA模型，包括制作、冶炼、烹饪、采矿和杀敌。我们的实验表明，训练后在非轨迹任务上的改进在多种原子任务上比最佳代理基线提高了40%。此外，我们展示了我们的方法在Minecraft中超越了传统的模仿学习策略，达到了最先进的性能。我们已开源了代码、模型和数据集，以促进进一步的研究。项目页面可以在https://craftjarvis.github.io/JarvisVLA/找到。

Ordinality of Visible-Thermal Image Intensities for Intrinsic Image Decomposition

Authors: Zeqing Leo Yuan, Mani Ramanagopal, Aswin C. Sankaranarayanan, Srinivasa G. Narasimhan

First: 2025-09-12T16:29:02+00:00 · Latest: 2025-09-12T16:29:02+00:00

Abs · PDF · Code1 · Code2

Abstract

Decomposing an image into its intrinsic photometric factors--shading and reflectance--is a long-standing challenge due to the lack of extensive ground-truth data for real-world scenes. Recent methods rely on synthetic data or sparse annotations for limited indoor and even fewer outdoor scenes. We introduce a novel training-free approach for intrinsic image decomposition using only a pair of visible and thermal images. We leverage the principle that light not reflected from an opaque surface is absorbed and detected as heat by a thermal camera. This allows us to relate the ordinalities between visible and thermal image intensities to the ordinalities of shading and reflectance, which can densely self-supervise an optimizing neural network to recover shading and reflectance. We perform quantitative evaluations with known reflectance and shading under natural and artificial lighting, and qualitative experiments across diverse outdoor scenes. The results demonstrate superior performance over recent learning-based models and point toward a scalable path to curating real-world ordinal supervision, previously infeasible via manual labeling.

中文标题/摘要

标题：可见-热图像强度的序数性在固有图像分解中的应用

将图像分解为其固有光度学因素——阴影和反射率——一直是一个长期的挑战，由于缺乏现实场景的大量真实数据。最近的方法依赖于合成数据或有限的室内稀疏注释，甚至更少的室外场景注释。我们提出了一种无需训练的新方法，仅使用可见光和热图像对进行固有图像分解。我们利用光被不透明表面吸收并由热像仪检测为热量的原理。这使我们能够将可见光和热图像强度的序数性与阴影和反射率的序数性联系起来，从而密集地自我监督优化神经网络以恢复阴影和反射率。我们使用已知反射率和阴影在自然光和人工光下进行定量评估，并在多种室外场景中进行定性实验。结果表明，该方法在性能上优于最近的基于学习的模型，并指出了通过手动标注难以实现的现实场景序数监督的可扩展路径。

Summary / 总结

The paper addresses the challenge of intrinsic image decomposition by proposing a training-free method using visible and thermal images. It leverages the principle that light not reflected from an opaque surface is absorbed and detected as heat, allowing the ordinalities between visible and thermal image intensities to be related to the ordinalities of shading and reflectance. This self-supervision enables a neural network to recover shading and reflectance. Experiments show that this approach outperforms recent learning-based models and suggests a scalable method for obtaining real-world ordinal supervision without manual labeling.

论文提出了一种无需训练的方法，利用可见光和热成像来分解图像的固有光度因素。该方法基于光未被不透明表面反射会被吸收并以热的形式被热成像仪检测到的原理，使得可见光和热成像强度的顺序关系可以与阴影和反射率的顺序关系联系起来，从而自监督优化神经网络以恢复阴影和反射率。实验结果表明，该方法优于现有的基于学习的模型，并提出了一种无需人工标注即可获得真实世界顺序监督的可扩展方法。

Towards Understanding Visual Grounding in Visual Language Models

Authors: Georgios Pantazopoulos, Eda B. Özyiğit

First: 2025-09-12T15:33:49+00:00 · Latest: 2025-09-12T15:33:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Visual grounding refers to the ability of a model to identify a region within some visual input that matches a textual description. Consequently, a model equipped with visual grounding capabilities can target a wide range of applications in various domains, including referring expression comprehension, answering questions pertinent to fine-grained details in images or videos, caption visual context by explicitly referring to entities, as well as low and high-level control in simulated and real environments. In this survey paper, we review representative works across the key areas of research on modern general-purpose vision language models (VLMs). We first outline the importance of grounding in VLMs, then delineate the core components of the contemporary paradigm for developing grounded models, and examine their practical applications, including benchmarks and evaluation metrics for grounded multimodal generation. We also discuss the multifaceted interrelations among visual grounding, multimodal chain-of-thought, and reasoning in VLMs. Finally, we analyse the challenges inherent to visual grounding and suggest promising directions for future research.

中文标题/摘要

标题：理解视觉语言模型中的视觉定位

视觉定位是指模型识别视觉输入中与文本描述匹配的区域的能力。因此，具备视觉定位能力的模型可以应用于各种领域的广泛应用，包括指示表达理解、回答与图像或视频中的细粒度细节相关的问题、通过明确指代实体描述视觉上下文，以及在模拟和真实环境中进行低级和高级控制。在本文综述中，我们回顾了现代通用视觉语言模型（VLMs）研究领域的代表性工作。我们首先概述了视觉定位在VLMs中的重要性，然后阐述了当前开发定位模型的核心组件，并探讨了它们的实际应用，包括定位多模态生成的基准和评估指标。我们还讨论了视觉定位、多模态推理链和VLMs推理之间的多方面关系。最后，我们分析了视觉定位固有的挑战，并提出了未来研究的有希望的方向。

Summary / 总结

This paper explores the concept of visual grounding in visual language models, which involves identifying regions in visual inputs that match textual descriptions. The research highlights the importance of visual grounding for various applications such as referring expression comprehension and multimodal generation. Key findings include the review of core components and practical applications of grounded models, as well as the challenges and future research directions in this field.

本文探讨了视觉语言模型中的视觉定位概念，即识别视觉输入中与文本描述匹配的区域。研究强调了视觉定位对于各种应用的重要性，如指示表达理解和多模态生成。关键发现包括对定位模型核心组件和实际应用的回顾，以及该领域面临的挑战和未来研究方向。

Talk2PC: Enhancing 3D Visual Grounding through LiDAR and Radar Point Clouds Fusion for Autonomous Driving

Authors: Runwei Guan, Jianan Liu, Ningwei Ouyang, Shaofeng Liang, Daizong Liu, Xiaolou Sun, Lianqing Zheng, Ming Xu, Yutao Yue, Guoqiang Mao, Hui Xiong

First: 2025-03-11T11:48:27+00:00 · Latest: 2025-09-12T15:05:09+00:00

Comments: 13 pages, 12 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Embodied outdoor scene understanding forms the foundation for autonomous agents to perceive, analyze, and react to dynamic driving environments. However, existing 3D understanding is predominantly based on 2D Vision-Language Models (VLMs), which collect and process limited scene-aware contexts. In contrast, compared to the 2D planar visual information, point cloud sensors such as LiDAR provide rich depth and fine-grained 3D representations of objects. Even better the emerging 4D millimeter-wave radar detects the motion trend, velocity, and reflection intensity of each object. The integration of these two modalities provides more flexible querying conditions for natural language, thereby supporting more accurate 3D visual grounding. To this end, we propose a novel method called TPCNet, the first outdoor 3D visual grounding model upon the paradigm of prompt-guided point cloud sensor combination, including both LiDAR and radar sensors. To optimally combine the features of these two sensors required by the prompt, we design a multi-fusion paradigm called Two-Stage Heterogeneous Modal Adaptive Fusion. Specifically, this paradigm initially employs Bidirectional Agent Cross-Attention (BACA), which feeds both-sensor features, characterized by global receptive fields, to the text features for querying. Moreover, we design a Dynamic Gated Graph Fusion (DGGF) module to locate the regions of interest identified by the queries. To further enhance accuracy, we devise an C3D-RECHead, based on the nearest object edge to the ego-vehicle. Experimental results demonstrate that our TPCNet, along with its individual modules, achieves the state-of-the-art performance on both the Talk2Radar and Talk2Car datasets. We release the code at https://github.com/GuanRunwei/TPCNet.

中文标题/摘要

标题：Talk2PC：通过LiDAR和雷达点云融合增强自主驾驶的3D视觉定位

具身户外场景理解是自主代理感知、分析和应对动态驾驶环境的基础。然而，现有的3D理解主要基于2D视觉语言模型（VLMs），收集和处理的场景感知上下文有限。相比之下，与2D平面视觉信息相比，点云传感器如LiDAR提供了丰富的深度和精细的3D对象表示。更进一步，新兴的4D毫米波雷达检测每个对象的运动趋势、速度和反射强度。将这两种模态结合起来，为自然语言提供了更灵活的查询条件，从而支持更准确的3D视觉定位。为此，我们提出了一种名为TPCNet的新方法，这是第一个基于提示引导点云传感器组合的户外3D视觉定位模型，包括LiDAR和雷达传感器。为了优化结合这两种传感器所需的特征，我们设计了一种多模态自适应融合的双阶段异构模态融合范式。具体而言，该范式最初使用双向代理交叉注意力（BACA），将两种传感器特征，由全局感受野表征，输入到文本特征中进行查询。此外，我们设计了一个动态门控图融合（DGGF）模块，以定位由查询识别的感兴趣区域。为了进一步提高准确性，我们基于与自主车辆最近的物体边缘设计了一个C3D-RECHead。实验结果表明，我们的TPCNet及其各个模块在Talk2Radar和Talk2Car数据集上均达到了最先进的性能。我们已在https://github.com/GuanRunwei/TPCNet/发布了代码。

Summary / 总结

The research aims to enhance 3D visual grounding for autonomous driving by integrating LiDAR and radar point clouds. The proposed TPCNet model uses a Two-Stage Heterogeneous Modal Adaptive Fusion approach, which includes Bidirectional Agent Cross-Attention and Dynamic Gated Graph Fusion, to combine LiDAR and radar features. The model achieves state-of-the-art performance on the Talk2Radar and Talk2Car datasets, demonstrating improved accuracy in 3D visual grounding.

研究旨在通过融合LiDAR和雷达点云来提升自主驾驶中的3D视觉定位。提出的TPCNet模型采用两阶段异模自适应融合方法，包括双向代理交叉注意力和动态门控图融合，以结合LiDAR和雷达特征。该模型在Talk2Radar和Talk2Car数据集上达到了最先进的性能，展示了3D视觉定位的改进准确性。

Detecting Text Manipulation in Images using Vision Language Models

Authors: Vidit Vidit, Pavel Korshunov, Amir Mohammadi, Christophe Ecabert, Ketan Kotwal, Sébastien Marcel

Venue: www

First: 2025-09-12T14:20:29+00:00 · Latest: 2025-09-12T14:20:29+00:00

Comments: Accepted in Synthetic Realities and Biometric Security Workshop BMVC-2025. For paper page see https://www.idiap.ch/paper/textvlmdet/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent works have shown the effectiveness of Large Vision Language Models (VLMs or LVLMs) in image manipulation detection. However, text manipulation detection is largely missing in these studies. We bridge this knowledge gap by analyzing closed- and open-source VLMs on different text manipulation datasets. Our results suggest that open-source models are getting closer, but still behind closed-source ones like GPT- 4o. Additionally, we benchmark image manipulation detection-specific VLMs for text manipulation detection and show that they suffer from the generalization problem. We benchmark VLMs for manipulations done on in-the-wild scene texts and on fantasy ID cards, where the latter mimic a challenging real-world misuse.

中文标题/摘要

标题：使用视觉语言模型检测图像中的文本操纵

近期研究表明，大型视觉语言模型（VLMs或LVLMs）在图像操纵检测方面非常有效。然而，这些研究中几乎没有涉及文本操纵检测。我们通过分析不同文本操纵数据集上的闭源和开源VLMs来填补这一知识空白。我们的结果显示，开源模型正在接近，但仍落后于闭源模型如GPT-4o。此外，我们还对专门用于文本操纵检测的图像操纵检测VLMs进行了基准测试，并表明它们存在泛化问题。我们对在自然场景文本和幻想身份卡上进行的操纵进行了基准测试，后者模仿了现实世界中的复杂误用。

MagicMirror: A Large-Scale Dataset and Benchmark for Fine-Grained Artifacts Assessment in Text-to-Image Generation

Authors: Jia Wang, Jie Hu, Xiaoqi Ma, Hanghang Ma, Yanbing Zeng, Xiaoming Wei

First: 2025-09-12T14:03:00+00:00 · Latest: 2025-09-12T14:03:00+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Text-to-image (T2I) generation has achieved remarkable progress in instruction following and aesthetics. However, a persistent challenge is the prevalence of physical artifacts, such as anatomical and structural flaws, which severely degrade perceptual quality and limit application. Given the diversity and complexity of these artifacts, a systematic and fine-grained evaluation framework is required, which is lacking in current benchmarks. To fill this gap, we introduce MagicMirror, a comprehensive framework for artifacts assessment. We first establish a detailed taxonomy of generated image artifacts. Guided by this taxonomy, we manually annotate MagicData340K, the first human-annotated large-scale dataset of 340K generated images with fine-grained artifact labels. Building on this dataset, we train MagicAssessor, a Vision-Language Model (VLM) that provides detailed assessments and corresponding labels. To overcome challenges like class imbalance and reward hacking, we design a novel data sampling strategy and a multi-level reward system for Group Relative Policy Optimization (GRPO). Finally, we leverage MagicAssessor to construct MagicBench, an automated benchmark for evaluating the image artifacts of current T2I models. Our evaluation with MagicBench reveals that despite their widespread adoption, even top-tier models like GPT-image-1 are consistently plagued by significant artifacts, highlighting artifact reduction as a critical frontier for future T2I development. Project page: https://wj-inf.github.io/MagicMirror-page/.

中文标题/摘要

标题：MagicMirror：大规模数据集和基准评估文本到图像生成中的细粒度缺陷

文本到图像（T2I）生成在指令跟随和美学方面取得了显著进展。然而，持续存在的挑战是物理缺陷的普遍存在，如解剖和结构缺陷，这些缺陷严重降低了感知质量并限制了应用。鉴于这些缺陷的多样性和复杂性，需要一个系统和细粒度的评估框架，而当前基准中缺乏这种框架。为填补这一空白，我们引入了MagicMirror，一个全面的缺陷评估框架。我们首先建立了一个详细的生成图像缺陷分类体系。受此分类体系的指导，我们手动标注了MagicData340K，这是第一个包含340K生成图像和细粒度缺陷标签的人工标注大规模数据集。基于此数据集，我们训练了MagicAssessor，这是一个视觉-语言模型（VLM），提供详细的评估和相应的标签。为克服类不平衡和奖励作弊等挑战，我们设计了一种新的数据采样策略和多级奖励系统，用于组相对策略优化（GRPO）。最后，我们利用MagicAssessor构建了MagicBench，这是一个自动基准，用于评估当前T2I模型的图像缺陷。我们的MagicBench评估显示，尽管这些模型被广泛采用，即使是顶级模型如GPT-image-1也持续受到显著缺陷的困扰，突显了减少缺陷是未来T2I开发的关键前沿领域。项目页面：https://wj-inf.github.io/MagicMirror-page/

Summary / 总结

MagicMirror is a large-scale dataset and benchmark for fine-grained assessment of artifacts in text-to-image generation. It introduces a detailed taxonomy of image artifacts and manually annotates 340K generated images with fine-grained labels. Using this dataset, a Vision-Language Model (VLM) named MagicAssessor is trained to provide detailed assessments. The evaluation with MagicBench shows that even top-tier models like GPT-image-1 still suffer from significant artifacts, emphasizing the need for artifact reduction in T2I models.

MagicMirror 是一个用于评估文本到图像生成中细粒度缺陷的数据集和基准。它引入了一个详细的图像缺陷分类体系，并手动标注了包含340K张图像的MagicData340K大型数据集。使用该数据集训练了一个视觉-语言模型MagicAssessor，以提供详细的评估。研究使用了一种新颖的数据采样策略和多层次奖励系统来解决类别不平衡和奖励作弊的问题。MagicBench 基准测试了当前的 T2I 模型，并揭示即使是顶级模型也存在显著的缺陷，表明未来 T2I 发展中需要减少缺陷。

VARCO-VISION-2.0 Technical Report

Authors: Young-rok Cha, Jeongho Ju, SunYoung Park, Jong-Hyeon Lee, Younghyun Yu, Youngjune Kim

First: 2025-09-12T09:55:56+00:00 · Latest: 2025-09-12T09:55:56+00:00

Comments: 19 pages, 1 figure, 14 tables. Technical report for VARCO-VISION-2.0, a Korean-English bilingual VLM in 14B and 1.7B variants. Key features: multi-image understanding, OCR with text localization, improved Korean capabilities

Abs · PDF · Code1 · Code2

Abstract

We introduce VARCO-VISION-2.0, an open-weight bilingual vision-language model (VLM) for Korean and English with improved capabilities compared to the previous model VARCO-VISION-14B. The model supports multi-image understanding for complex inputs such as documents, charts, and tables, and delivers layoutaware OCR by predicting both textual content and its spatial location. Trained with a four-stage curriculum with memory-efficient techniques, the model achieves enhanced multimodal alignment, while preserving core language abilities and improving safety via preference optimization. Extensive benchmark evaluations demonstrate strong spatial grounding and competitive results for both languages, with the 14B model achieving 8th place on the OpenCompass VLM leaderboard among models of comparable scale. Alongside the 14B-scale model, we release a 1.7B version optimized for on-device deployment. We believe these models advance the development of bilingual VLMs and their practical applications. Two variants of VARCO-VISION-2.0 are available at Hugging Face: a full-scale 14B model and a lightweight 1.7B model.

中文标题/摘要

标题：VARCO-VISION-2.0 技术报告

我们介绍了VARCO-VISION-2.0，这是一种改进了功能的开放重量双语视觉语言模型（VLM），支持韩语和英语。该模型能够理解复杂输入如文档、图表和表格，并通过预测文本内容及其空间位置提供布局感知的OCR。通过使用四阶段课程训练和高效内存技术，该模型实现了增强的多模态对齐，同时保留了核心语言能力并提高了安全性。广泛的基准测试表明，该模型在空间定位方面表现出色，并且在两种语言上都取得了竞争力的结果，14B模型在OpenCompass VLM排行榜上排名第8。除了14B规模的模型，我们还发布了1.7B规模的版本，优化了设备端部署。我们相信这些模型推动了双语VLM的发展及其实际应用。VARCO-VISION-2.0在Hugging Face上提供了两种变体：14B全规模模型和1.7B轻量级模型。

Summary / 总结

VARCO-VISION-2.0 is an open-weight bilingual vision-language model for Korean and English, enhancing previous capabilities with improved multimodal alignment and layout-aware OCR. Trained using a four-stage curriculum and memory-efficient techniques, it supports multi-image understanding and achieves strong spatial grounding, ranking 8th on the OpenCompass VLM leaderboard. Two variants are available: a full-scale 14B model and a lightweight 1.7B model optimized for on-device deployment.

VARCO-VISION-2.0 是一个双语视觉语言模型，支持韩语和英语，通过四阶段课程和高效训练技术增强了前一代的能力。它支持多图像理解及带有文本定位的OCR，基准测试结果显示其具有强大的空间定位能力和竞争力。14B模型在OpenCompass VLM排行榜上排名第8，而1.7B轻量级变体则优化了设备端部署。这些模型促进了双语VLM的发展及其实际应用。

GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill

Authors: Jieming Cui, Tengyu Liu, Ziyu Meng, Jiale Yu, Ran Song, Wei Zhang, Yixin Zhu, Siyuan Huang

First: 2025-04-05T14:44:47+00:00 · Latest: 2025-09-12T09:55:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Learning open-vocabulary physical skills for simulated agents presents a significant challenge in artificial intelligence. Current reinforcement learning approaches face critical limitations: manually designed rewards lack scalability across diverse tasks, while demonstration-based methods struggle to generalize beyond their training distribution. We introduce GROVE, a generalized reward framework that enables open-vocabulary physical skill learning without manual engineering or task-specific demonstrations. Our key insight is that Large Language Models(LLMs) and Vision Language Models(VLMs) provide complementary guidance -- LLMs generate precise physical constraints capturing task requirements, while VLMs evaluate motion semantics and naturalness. Through an iterative design process, VLM-based feedback continuously refines LLM-generated constraints, creating a self-improving reward system. To bridge the domain gap between simulation and natural images, we develop Pose2CLIP, a lightweight mapper that efficiently projects agent poses directly into semantic feature space without computationally expensive rendering. Extensive experiments across diverse embodiments and learning paradigms demonstrate GROVE's effectiveness, achieving 22.2% higher motion naturalness and 25.7% better task completion scores while training 8.4x faster than previous methods. These results establish a new foundation for scalable physical skill acquisition in simulated environments.

中文标题/摘要

标题：GROVE：一种通用奖励，用于学习开放词汇物理技能

为模拟代理学习开放词汇物理技能在人工智能中提出了重大挑战。当前的强化学习方法面临关键限制：手动设计的奖励缺乏在多种任务中的可扩展性，而基于演示的方法则难以泛化到训练分布之外。我们引入了GROVE，一种通用奖励框架，使开放词汇物理技能学习无需手动工程或特定任务的演示。我们的核心见解是，大型语言模型（LLMs）和视觉语言模型（VLMs）提供了互补的指导——LLMs生成精确的物理约束，捕捉任务要求，而VLMs评估运动语义和自然性。通过迭代设计过程，基于VLM的反馈不断细化LLM生成的约束，形成一个自我改进的奖励系统。为了弥合模拟与自然图像之间的领域差距，我们开发了Pose2CLIP，这是一种轻量级映射器，可以高效地将代理姿态直接投影到语义特征空间，而无需昂贵的渲染。在多种体态和学习范式的广泛实验中，GROVE的有效性得到了验证，实现了22.2%更高的运动自然性和25.7%更好的任务完成分数，同时训练速度比以前的方法快8.4倍。这些结果为模拟环境中可扩展的物理技能获取奠定了新的基础。

Summary / 总结

GROVE is a generalized reward framework for learning open-vocabulary physical skills in simulated agents. It leverages Large Language Models (LLMs) to generate precise physical constraints and Vision Language Models (VLMs) to evaluate motion semantics and naturalness, creating a self-improving reward system. Experiments show GROVE achieves higher motion naturalness and better task completion scores, training 8.4 times faster than previous methods.

GROVE 是一种用于模拟代理学习开放词汇物理技能的通用奖励框架，解决了手动设计奖励和基于演示的方法的局限性。它利用大型语言模型（LLMs）生成精确的物理约束，并利用视觉语言模型（VLMs）评估运动语义和自然性，创建了一个自我改进的奖励系统。实验表明，GROVE 在运动自然性和任务完成度方面表现更好，训练速度比之前的方法快 8.4 倍。

MedM-VL: What Makes a Good Medical LVLM?

Authors: Yiming Shi, Shaoshuai Yang, Xun Zhu, Haoyu Wang, Xiangling Fu, Miao Li, Ji Wu

First: 2025-04-06T01:44:46+00:00 · Latest: 2025-09-12T09:13:18+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Medical image analysis is essential in modern healthcare. Deep learning has redirected research focus toward complex medical multimodal tasks, including report generation and visual question answering. Traditional task-specific models often fall short in handling these challenges. Large vision-language models (LVLMs) offer new solutions for solving such tasks. In this study, we build on the popular LLaVA framework to systematically explore model architectures and training strategies for both 2D and 3D medical LVLMs. We present extensive empirical findings and practical guidance. To support reproducibility and future research, we release a modular codebase, MedM-VL, and two pre-trained models: MedM-VL-2D for 2D medical image analysis and MedM-VL-CT-Chest for 3D CT-based applications. The code is available at: https://github.com/MSIIP/MedM-VL

中文标题/摘要

标题：MedM-VL：什么是好的医学LVLM？

医学图像分析是现代医疗保健中的重要组成部分。深度学习已将研究重点转向复杂的医学多模态任务，包括报告生成和视觉问答。传统的任务特定模型在处理这些挑战时往往不够理想。大型视觉语言模型（LVLM）为解决此类任务提供了新的解决方案。在本研究中，我们基于流行的LLaVA框架，系统地探索了适用于2D和3D医学LVLM的模型架构和训练策略。我们提供了详尽的经验研究结果和实用指导。为了支持可重复性和未来研究，我们发布了模块化代码库MedM-VL，并提供了两个预训练模型：MedM-VL-2D用于2D医学图像分析，MedM-VL-CT-Chest用于基于3D CT的应用。代码可在以下链接获取：https://github.com/MSIIP/MedM-VL

When and How Does CLIP Enable Domain and Compositional Generalization?

Authors: Elias Kempf, Simon Schrodi, Max Argus, Thomas Brox

Venue: ICML 2025 Spotlight

First: 2025-02-13T17:21:37+00:00 · Latest: 2025-09-12T08:50:44+00:00

Comments: ICML 2025 (Spotlight)

Abs · PDF · Code1 · Code2

Abstract

The remarkable generalization performance of contrastive vision-language models like CLIP is often attributed to the diversity of their training distributions. However, key questions remain unanswered: Can CLIP generalize to an entirely unseen domain when trained on a diverse mixture of domains (domain generalization)? Can it generalize to unseen classes within partially seen domains (compositional generalization)? What factors affect such generalization? To answer these questions, we trained CLIP models on systematically constructed training distributions with controlled domain diversity and object class exposure. Our experiments show that domain diversity is essential for both domain and compositional generalization, yet compositional generalization can be surprisingly weaker than domain generalization when the training distribution contains a suboptimal subset of the test domain. Through data-centric and mechanistic analyses, we find that successful generalization requires the learning of sufficiently shared representations in intermediate layers and circuits.

中文标题/摘要

标题：CLIP在何时和如何实现领域和组分泛化？

对比视觉-语言模型如CLIP的出色泛化性能通常归因于其训练分布的多样性。然而，关键问题仍未解答：当CLIP在多样化的领域混合中进行训练时（领域泛化），它能否在完全未见过的领域中泛化？在部分已见过的领域中，它能否在未见过的类别上泛化（组分泛化）？哪些因素会影响这种泛化？为了回答这些问题，我们在系统构建的具有控制领域多样性和对象类别暴露的训练分布上训练了CLIP模型。我们的实验表明，领域多样性对于领域泛化和组分泛化都是必不可少的，但在训练分布包含测试领域中次优子集时，组分泛化可能会出人意料地弱于领域泛化。通过数据为中心和机制分析，我们发现成功的泛化需要在中间层和电路中学习足够共享的表示。

Summary / 总结

This study investigates the conditions under which CLIP models can generalize to unseen domains and classes. By controlling the diversity of training data, the researchers found that domain diversity is crucial for both domain and compositional generalization, but compositional generalization can be weaker if the training data lacks certain elements of the test domain. Successful generalization depends on learning shared representations in intermediate layers and circuits.

研究探讨了CLIP在何种情况下能够泛化到未见过的领域和类别。通过控制训练数据的多样性，研究人员发现领域多样性对于领域泛化和组合理念泛化都至关重要，但若训练数据缺乏测试领域的一部分，则组合理念泛化可能会较弱。成功的泛化依赖于在中间层学习到足够共享的表示。

Multimodal Mathematical Reasoning Embedded in Aerial Vehicle Imagery: Benchmarking, Analysis, and Exploration

Authors: Yue Zhou, Litong Feng, Mengcheng Lan, Xue Yang, Qingyun Li, Yiping Ke, Xue Jiang, Wayne Zhang

First: 2025-09-12T08:46:49+00:00 · Latest: 2025-09-12T08:46:49+00:00

Comments: 17 pages, 16 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Mathematical reasoning is critical for tasks such as precise distance and area computations, trajectory estimations, and spatial analysis in unmanned aerial vehicle (UAV) based remote sensing, yet current vision-language models (VLMs) have not been adequately tested in this domain. To address this gap, we introduce AVI-Math, the first benchmark to rigorously evaluate multimodal mathematical reasoning in aerial vehicle imagery, moving beyond simple counting tasks to include domain-specific knowledge in areas such as geometry, logic, and algebra. The dataset comprises 3,773 high-quality vehicle-related questions captured from UAV views, covering 6 mathematical subjects and 20 topics. The data, collected at varying altitudes and from multiple UAV angles, reflects real-world UAV scenarios, ensuring the diversity and complexity of the constructed mathematical problems. In this paper, we benchmark 14 prominent VLMs through a comprehensive evaluation and demonstrate that, despite their success on previous multimodal benchmarks, these models struggle with the reasoning tasks in AVI-Math. Our detailed analysis highlights significant limitations in the mathematical reasoning capabilities of current VLMs and suggests avenues for future research. Furthermore, we explore the use of Chain-of-Thought prompting and fine-tuning techniques, which show promise in addressing the reasoning challenges in AVI-Math. Our findings not only expose the limitations of VLMs in mathematical reasoning but also offer valuable insights for advancing UAV-based trustworthy VLMs in real-world applications. The code, and datasets will be released at https://github.com/VisionXLab/avi-math

中文标题/摘要

标题：嵌入航空器图像中的多模态数学推理：基准测试、分析与探索

数学推理对于无人机（UAV）基于遥感的任务至关重要，如精确的距离和面积计算、轨迹估计和空间分析，但当前的视觉-语言模型（VLMs）尚未在这一领域得到充分测试。为解决这一问题，我们引入了AVI-Math，这是首个严格评估航空器图像中多模态数学推理的基准，超越了简单的计数任务，包括几何学、逻辑学和代数等领域的专业知识。数据集包含3,773个高质量的与车辆相关的问答题，覆盖6个数学学科和20个主题。数据在不同高度和多个无人机视角下收集，反映了真实的无人机场景，确保了构建的数学问题的多样性和复杂性。在本文中，我们通过全面评估对14个主要的VLMs进行了基准测试，并展示了尽管这些模型在之前的多模态基准测试中取得了成功，但在AVI-Math中的推理任务上却表现不佳。我们的详细分析突显了当前VLMs在数学推理能力方面的显著局限性，并提出了未来研究的方向。此外，我们还探讨了使用链式思考提示和微调技术，这些技术在解决AVI-Math中的推理挑战方面显示出潜力。我们的研究不仅揭示了VLMs在数学推理方面的局限性，还为在实际应用中推进基于无人机的可信VLMs提供了宝贵的见解。代码和数据集将在https://github.com/VisionXLab/avi-math发布。

Summary / 总结

This paper introduces AVI-Math, a benchmark for evaluating multimodal mathematical reasoning in aerial vehicle imagery, which includes geometry, logic, and algebra tasks. The dataset consists of 3,773 high-quality questions covering six mathematical subjects and 20 topics. The benchmark tests 14 prominent vision-language models, revealing their limitations in mathematical reasoning. The study also explores Chain-of-Thought prompting and fine-tuning techniques to improve reasoning capabilities. The findings highlight the need for better mathematical reasoning in VLMs for UAV-based applications.

本文介绍了AVI-Math基准，用于评估航空器图像中的多模态数学推理能力，包括几何、逻辑和代数任务。数据集包含3,773个高质量问题，覆盖六个数学主题和20个主题。该基准测试了14个主流的视觉语言模型，揭示了它们在数学推理方面的局限性。研究还探索了使用链式思考提示和微调技术来提高推理能力。研究结果强调了在航空器应用中需要改进视觉语言模型的数学推理能力。

Color Me Correctly: Bridging Perceptual Color Spaces and Text Embeddings for Improved Diffusion Generation

Authors: Sung-Lin Tsai, Bo-Lun Huang, Yu Ting Shen, Cheng Yu Yeo, Chiang Tseng, Bo-Kai Ruan, Wen-Sheng Lien, Hong-Han Shuai

Venue: MM

First: 2025-09-12T08:44:22+00:00 · Latest: 2025-09-12T08:44:22+00:00

Comments: Accepted to ACM Multimedia 2025 (MM '25)

Abs · PDF · Code1 · Code2

Abstract

Accurate color alignment in text-to-image (T2I) generation is critical for applications such as fashion, product visualization, and interior design, yet current diffusion models struggle with nuanced and compound color terms (e.g., Tiffany blue, lime green, hot pink), often producing images that are misaligned with human intent. Existing approaches rely on cross-attention manipulation, reference images, or fine-tuning but fail to systematically resolve ambiguous color descriptions. To precisely render colors under prompt ambiguity, we propose a training-free framework that enhances color fidelity by leveraging a large language model (LLM) to disambiguate color-related prompts and guiding color blending operations directly in the text embedding space. Our method first employs a large language model (LLM) to resolve ambiguous color terms in the text prompt, and then refines the text embeddings based on the spatial relationships of the resulting color terms in the CIELAB color space. Unlike prior methods, our approach improves color accuracy without requiring additional training or external reference images. Experimental results demonstrate that our framework improves color alignment without compromising image quality, bridging the gap between text semantics and visual generation.

中文标题/摘要

标题：正确着色：通过知觉色彩空间和文本嵌入连接以改进扩散生成

在文本到图像(T2I)生成中准确的颜色对齐对于时尚、产品可视化和室内设计等应用至关重要，但当前的扩散模型在处理复杂的色彩描述（如蒂芙尼蓝、青柠绿、热粉红）时常常产生与人类意图不符的图像。现有方法依赖于交叉注意力操作、参考图像或微调，但无法系统地解决模糊的色彩描述。为了在提示模糊的情况下精确渲染颜色，我们提出了一种无需训练的框架，通过大型语言模型（LLM）来澄清与色彩相关的提示，并直接在文本嵌入空间中引导色彩混合操作。该方法首先使用大型语言模型（LLM）解决文本提示中的模糊色彩术语，然后根据CIELAB色彩空间中结果色彩术语的空间关系细化文本嵌入。与先前的方法不同，我们的方法在无需额外训练或外部参考图像的情况下提高了色彩准确性。实验结果表明，我们的框架在不牺牲图像质量的情况下提高了颜色对齐，填补了文本语义与视觉生成之间的差距。

Summary / 总结

This paper addresses the challenge of accurate color alignment in text-to-image generation, particularly for nuanced and compound color terms. It proposes a training-free framework that uses a large language model to disambiguate color-related prompts and guides color blending operations in the text embedding space, based on the CIELAB color space. Experiments show that this method enhances color fidelity without degrading image quality, effectively bridging the gap between text semantics and visual generation.

研究旨在提高文本生成图像中的颜色准确性，特别是对于复杂的颜色术语。提出了一种无需额外训练的框架，利用大型语言模型来解析颜色相关的提示，并在CIELAB颜色空间中基于颜色术语的空间关系指导颜色混合操作。实验表明，该方法能够提高颜色准确性而不损害图像质量，有效地弥合了文本语义与视觉生成之间的差距。

LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA

Authors: Jing Huang, Zhiya Tan, Shutao Gong, Fanwei Zeng, Jianshu Li

First: 2025-09-12T07:45:44+00:00 · Latest: 2025-09-12T07:45:44+00:00

Comments: 12 Pages, 12 Figures, 2 Tables

Abs · PDF · Code1 · Code2 · Code3

Abstract

As large vision language models (VLMs) advance, their capabilities in multilingual visual question answering (mVQA) have significantly improved. Chain-of-thought (CoT) reasoning has been proven to enhance interpretability and complex reasoning. However, most existing approaches rely primarily on textual CoT and provide limited support for multilingual multimodal reasoning, constraining their deployment in real-world applications. To address this gap, we introduce \textbf{LaV-CoT}, the first Language-aware Visual CoT framework with Multi-Aspect Reward Optimization. LaV-CoT incorporates an interpretable multi-stage reasoning pipeline consisting of Text Summary with Bounding Box (BBox), Language Identification, Spatial Object-level Captioning, and Step-by-step Logical Reasoning. Following this reasoning pipeline, we design an automated data curation method that generates multilingual CoT annotations through iterative generation, correction, and refinement, enabling scalable and high-quality training data. To improve reasoning and generalization, LaV-CoT adopts a two-stage training paradigm combining Supervised Fine-Tuning (SFT) with Language-aware Group Relative Policy Optimization (GRPO), guided by verifiable multi-aspect rewards including language consistency, structural accuracy, and semantic alignment. Extensive evaluations on public datasets including MMMB, Multilingual MMBench, and MTVQA show that LaV-CoT achieves up to $\sim$9.5\% accuracy improvements over open-source baselines of similar size and even surpasses models with 2$\times$ larger scales by $\sim$2.6\%. Moreover, LaV-CoT outperforms advanced proprietary models such as GPT-4o-0513 and Gemini-2.5-flash. We further conducted an online A/B test to validate our method on real-world data, highlighting its effectiveness for industrial deployment. Our code is available at this link: \href{https://github.com/HJNVR/LaV-CoT}

中文标题/摘要

标题：LaV-CoT：语言感知视觉CoT与多方面奖励优化相结合的实时多语言VQA

随着大型视觉语言模型（VLMs）的发展，它们在多语言视觉问答（mVQA）方面的能力显著提高。链式思考（CoT）推理已被证明可以增强可解释性和复杂推理。然而，大多数现有方法主要依赖于文本CoT，对多语言多模态推理的支持有限，限制了它们在实际应用中的部署。为了解决这一差距，我们提出了**LaV-CoT**，这是第一个具有多方面奖励优化的语言感知视觉CoT框架。LaV-CoT结合了一个可解释的多阶段推理管道，包括带有边界框的文本摘要、语言识别、空间对象级描述和逐步逻辑推理。遵循这一推理管道，我们设计了一种自动数据整理方法，通过迭代生成、修正和精炼生成多语言CoT注释，从而实现可扩展和高质量的训练数据。为了提高推理能力和泛化能力，LaV-CoT采用了一种结合监督微调（SFT）和语言感知组相对策略优化（GRPO）的两阶段训练范式，由可验证的多方面奖励包括语言一致性、结构准确性和语义对齐引导。在包括MMMB、多语言MMBench和MTVQA的公共数据集上的广泛评估表明，LaV-CoT在开放源代码基线模型上实现了高达约9.5%的准确率改进，并且甚至超过了规模大两倍的模型约2.6%。此外，LaV-CoT在GPT-4o-0513和Gemini-2.5-flash等先进专有模型中表现出色。我们还进行了在线A/B测试，验证了该方法在实际数据中的有效性，突显了其在工业部署中的效果。我们的代码可在以下链接获取：https://github.com/HJNVR/LaV-CoT

MoPD: Mixture-of-Prompts Distillation for Vision-Language Models

Authors: Yang Chen, Shuai Fu, Yu Zhang

First: 2024-12-26T06:57:04+00:00 · Latest: 2025-09-12T05:49:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Soft prompt learning methods are effective for adapting vision-language models (VLMs) to downstream tasks. Nevertheless, empirical evidence reveals a tendency of existing methods that they overfit seen classes and exhibit degraded performance on unseen classes. This limitation is due to the inherent bias in the training data towards the seen classes. To address this issue, we propose a novel soft prompt learning method, named Mixture-of-Prompts Distillation (MoPD), which can effectively transfer useful knowledge from hard prompts manually hand-crafted (a.k.a. teacher prompts) to the learnable soft prompt (a.k.a. student prompt), thereby enhancing the generalization ability of soft prompts on unseen classes. Moreover, the proposed MoPD method utilizes a gating network that learns to select hard prompts used for prompt distillation. Extensive experiments demonstrate that the proposed MoPD method outperforms state-of-the-art baselines especially on on unseen classes.

中文标题/摘要

标题：MoPD：混合提示蒸馏用于视觉-语言模型

软提示学习方法对于将视觉-语言模型（VLMs）适应下游任务是有效的。然而，实证证据表明，现有方法倾向于过拟合已见过的类别，并在未见过的类别上表现出较差的性能。这一限制是由于训练数据对已见过的类别的固有偏差。为了解决这一问题，我们提出了一种新的软提示学习方法，称为混合提示蒸馏（MoPD），它可以有效地将硬提示（即教师提示）中手工构建的有用知识转移到可学习的软提示（即学生提示）中，从而增强软提示在未见过的类别的泛化能力。此外，所提出的MoPD方法利用了一个门控网络，该网络学习选择用于提示蒸馏的硬提示。广泛的实验表明，所提出的MoPD方法在未见过的类别上优于最先进的基线方法。

Summary / 总结

The research aims to improve the generalization ability of soft prompts in vision-language models by addressing the overfitting issue to seen classes. The proposed MoPD method uses a gating network to select hard prompts for distilling knowledge to soft prompts, enhancing performance on unseen classes. Experiments show that MoPD outperforms existing methods, particularly on unseen classes.

研究旨在通过解决对已见类别的过度拟合问题，提高软提示在视觉-语言模型中的泛化能力。提出的MoPD方法使用门控网络选择硬提示进行知识蒸馏到软提示，从而在未见类别上提升性能。实验表明，MoPD在未见类别上优于现有方法。

Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation

Authors: Hongji Yang, Yucheng Zhou, Wencheng Han, Jianbing Shen

First: 2025-05-22T15:05:07+00:00 · Latest: 2025-09-12T05:22:32+00:00

Comments: Accepted by ACL2025 Findings

Abs · PDF · Code1 · Code2

Abstract

Text-to-image models are powerful for producing high-quality images based on given text prompts, but crafting these prompts often requires specialized vocabulary. To address this, existing methods train rewriting models with supervision from large amounts of manually annotated data and trained aesthetic assessment models. To alleviate the dependence on data scale for model training and the biases introduced by trained models, we propose a novel prompt optimization framework, designed to rephrase a simple user prompt into a sophisticated prompt to a text-to-image model. Specifically, we employ the large vision language models (LVLMs) as the solver to rewrite the user prompt, and concurrently, employ LVLMs as a reward model to score the aesthetics and alignment of the images generated by the optimized prompt. Instead of laborious human feedback, we exploit the prior knowledge of the LVLM to provide rewards, i.e., AI feedback. Simultaneously, the solver and the reward model are unified into one model and iterated in reinforcement learning to achieve self-improvement by giving a solution and judging itself. Results on two popular datasets demonstrate that our method outperforms other strong competitors.

中文标题/摘要

标题：自我奖励大型跨模态模型用于优化文本到图像生成中的提示

文本到图像模型可以根据给定的文本提示生成高质量的图像，但这些提示的创作往往需要专门的词汇。为了解决这个问题，现有方法通过大量手动标注数据和训练美学评估模型的监督来训练重写模型。为了减轻对数据规模的依赖以及训练模型引入的偏见，我们提出了一种新颖的提示优化框架，旨在将简单的用户提示重新表述为复杂的提示以供文本到图像模型使用。具体而言，我们使用大型视觉语言模型（LVLMs）作为解码器来重写用户提示，并同时使用LVLMs作为奖励模型来评估优化提示生成的图像的美学和对齐程度。我们利用LVLM的先验知识提供奖励，即AI反馈，而不是繁琐的人工反馈。同时，解码器和奖励模型被统一为一个模型，并通过强化学习迭代以实现自我改进。在两个流行数据集上的结果表明，我们的方法优于其他强竞争对手。

Summary / 总结

This paper addresses the challenge of crafting effective text prompts for text-to-image generation by proposing a novel framework that uses large vision-language models (LVLMs) for both prompt optimization and aesthetic scoring. The method iteratively refines user prompts and evaluates the generated images using AI feedback, eliminating the need for manual annotation. Experiments on two datasets show that this approach outperforms other methods.

论文提出了一种新的文本到图像生成的提示优化框架，使用大型视觉语言模型（LVLMs）既作为解题器重新表述用户提示，也作为奖励模型评估生成的图像。这种方法避免了大量人工反馈的需求，并减少了模型偏见。在两个流行数据集上的实验表明，该方法优于其他强竞争对手。

Adaptive Token Merging for Efficient Transformer Semantic Communication at the Edge

Authors: Omar Erak, Omar Alhussein, Hatem Abou-Zeid, Mehdi Bennis, Sami Muhaidat

First: 2025-09-12T04:11:59+00:00 · Latest: 2025-09-12T04:11:59+00:00

Comments: Submitted to IEEE Journals

Abs · PDF · Code1 · Code2

Abstract

Large-scale transformers are central to modern semantic communication, yet their high computational and communication costs hinder deployment on resource-constrained edge devices. This paper introduces a training-free framework for adaptive token merging, a novel mechanism that compresses transformer representations at runtime by selectively merging semantically redundant tokens under per-layer similarity thresholds. Unlike prior fixed-ratio reduction, our approach couples merging directly to input redundancy, enabling data-dependent adaptation that balances efficiency and task relevance without retraining. We cast the discovery of merging strategies as a multi-objective optimization problem and leverage Bayesian optimization to obtain Pareto-optimal trade-offs between accuracy, inference cost, and communication cost. On ImageNet classification, we match the accuracy of the unmodified transformer with 30\% fewer floating-point operations per second and under 20\% of the original communication cost, while for visual question answering our method achieves performance competitive with the full LLaVA model at less than one-third of the compute and one-tenth of the bandwidth. Finally, we show that our adaptive merging is robust across varying channel conditions and provides inherent privacy benefits, substantially degrading the efficacy of model inversion attacks. Our framework provides a practical and versatile solution for deploying powerful transformer models in resource-limited edge intelligence scenarios.

中文标题/摘要

标题：边缘设备上高效变压器语义通信的自适应令牌合并

大规模变压器是现代语义通信的核心，但其高计算和通信成本阻碍了在资源受限的边缘设备上的部署。本文介绍了一种无需训练的自适应令牌合并框架，这是一种新颖的机制，在每层相似度阈值下通过选择性地合并语义冗余令牌来实时压缩变压器表示。与之前的固定比例减少不同，我们的方法将合并直接与输入冗余相关联，使数据依赖的适应能够在不重新训练的情况下平衡效率和任务相关性。我们将合并策略的发现视为一个多目标优化问题，并利用贝叶斯优化来获得准确度、推理成本和通信成本之间的帕累托最优权衡。在ImageNet分类上，我们以每秒30%更少的浮点运算匹配未修改的变压器的准确度，并且通信成本不到原始成本的20%。对于视觉问答，我们的方法在不到全LLaVA模型三分之一的计算量和十分之一的带宽下实现了可竞争的性能。最后，我们展示了我们的自适应合并具有跨不同信道条件的鲁棒性，并提供了固有的隐私保护，显著降低了模型反转攻击的效果。我们的框架为在资源受限的边缘智能场景中部署强大的变压器模型提供了实用且灵活的解决方案。

Summary / 总结

This paper addresses the challenge of deploying large-scale transformers on resource-constrained edge devices by introducing a training-free adaptive token merging framework. This method compresses transformer representations at runtime by merging semantically redundant tokens based on per-layer similarity thresholds, without retraining. The approach balances efficiency and task relevance, achieving comparable accuracy to the unmodified transformer with significantly reduced computational and communication costs. For visual question answering, the method matches the performance of the full LLaVA model with much lower compute and bandwidth requirements.

本文提出了一种无需训练的自适应令牌合并框架，旨在解决在资源受限的边缘设备上部署大规模变压器的挑战。该框架通过基于每层相似度阈值合并语义冗余令牌来实时压缩变压器表示，而不需重新训练。该方法在保持效率和任务相关性的同时，实现了与未修改的变压器相当的准确度，并且显著减少了计算和通信成本。对于视觉问答任务，该方法在使用更少的计算资源和带宽的情况下，与完整的LLaVA模型性能相当。此外，自适应合并方法在不同信道条件下表现出鲁棒性，并提供了一定的隐私保护，显著降低了模型反转攻击的效果。

Data Matters Most: Auditing Social Bias in Contrastive Vision Language Models

Authors: Zahraa Al Sahili, Ioannis Patras, Matthew Purver

First: 2025-01-22T21:08:30+00:00 · Latest: 2025-09-11T20:42:53+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) deliver strong zero-shot recognition but frequently inherit social biases from their training data. We systematically disentangle three design factors -- model size, training-data scale, and training-data source -- by comparing CLIP and OpenCLIP, two models that share an identical contrastive objective yet differ in encoder width and in the image-text corpora on which they are pre-trained (400M proprietary pairs vs. 400M/2B LAION). Across balanced face-analysis benchmarks, enlarging the encoder reduces gender skew in CLIP but amplifies both gender and racial skew in OpenCLIP; increasing the LAION corpus from 400M to 2B further increases OpenCLIP bias. At matched model and data budgets, substituting proprietary data with LAION improves gender fairness while increasing racial skew, underscoring data source as the primary driver of bias patterns. We also evaluate three post-hoc, test-time debiasing strategies -- Bias Prompts, Prompt Array, and SANER. Debiasing reduces but does not eliminate harm, and its effectiveness is source- and size-dependent: Bias Prompts most effectively reduce gender skew in CLIP at smaller model sizes, whereas Prompt Array and SANER more reliably reduce racial skew in OpenCLIP; scaling LAION reconfigures which method is most fair. Taken together, these findings challenge the assumption that bigger models or datasets are automatically fairer and foreground training data source as the key determinant of both bias and mitigation efficacy. We release code and evaluation scripts to enable transparent, reproducible auditing of future VLMs.

中文标题/摘要

标题：数据最重要：审计对比视觉语言模型中的社会偏见

视觉语言模型（VLMs）在零样本识别方面表现出色，但经常从训练数据中继承社会偏见。我们系统地拆分了三个设计因素——模型大小、训练数据规模和训练数据来源，通过比较CLIP和OpenCLIP两种模型，这两种模型具有相同的对比目标，但在编码器宽度和预训练图像-文本语料库方面有所不同（4亿私有配对 vs. 4亿/20亿LAION）。在平衡的人脸分析基准测试中，增大编码器减少了CLIP中的性别偏差，但增加了OpenCLIP中的性别和种族偏差；将LAION语料库从4亿增加到20亿进一步增加了OpenCLIP的偏见。在匹配的模型和数据预算下，用LAION替换私有数据提高了性别公平性，但增加了种族偏差，突显了数据来源是偏见模式的主要驱动因素。我们还评估了三种事后测试时去偏策略——偏见提示、提示阵列和SANER。去偏减少了但并未消除伤害，其有效性取决于数据来源和模型规模：偏见提示在较小的模型规模下最有效地减少了CLIP中的性别偏差，而提示阵列和SANER更可靠地减少了OpenCLIP中的种族偏差；扩大LAION重新配置了哪种方法最公平。这些发现共同挑战了更大的模型或数据集自动更公平的假设，并将训练数据来源置于偏见和缓解效果的关键决定因素的前沿。我们发布了代码和评估脚本，以实现未来VLMs的透明、可重复审计。

Summary / 总结

The study aims to understand the impact of model size, training data scale, and data source on social bias in vision-language models. By comparing CLIP and OpenCLIP, which share the same contrastive objective but differ in encoder width and training data, the research finds that enlarging the encoder reduces gender skew in CLIP but amplifies biases in OpenCLIP. Increasing the LAION corpus from 400M to 2B further increases OpenCLIP bias. Substituting proprietary data with LAION improves gender fairness but increases racial skew, highlighting the importance of data source. Post-hoc debiasing strategies show varying effectiveness depending on the model and data source, challenging the notion that larger models or datasets are inherently fairer. The study underscores the critical role of training data in determining bias patterns and mitigation efficacy.

研究旨在理解模型大小、训练数据规模和数据来源对视觉语言模型中社会偏见的影响。通过比较CLIP和OpenCLIP，这两种模型具有相同的对比目标但编码器宽度和训练数据不同，研究发现增大编码器可以减少CLIP中的性别偏见，但在OpenCLIP中却加剧了性别和种族偏见。将LAION数据集从400M增加到2B进一步增加了OpenCLIP的偏见。用LAION数据替换专有数据可以提高性别公平性但增加种族偏见，突显了数据来源的重要性。后处理去偏策略的效果取决于模型和数据来源，挑战了更大模型或数据集自动更公平的假设。研究强调了训练数据在确定偏见模式和缓解效果中的关键作用。

Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models

Authors: Zahraa Al Sahili, Ioannis Patras, Matthew Purver

First: 2025-05-20T10:14:00+00:00 · Latest: 2025-09-11T20:26:08+00:00

Abs · PDF · Code1 · Code2

Abstract

Multilingual vision-language models (VLMs) promise universal image-text retrieval, yet their social biases remain underexplored. We perform the first systematic audit of four public multilingual CLIP variants: M-CLIP, NLLB-CLIP, CAPIVARA-CLIP, and the debiased SigLIP-2, covering ten languages that differ in resource availability and morphological gender marking. Using balanced subsets of FairFace and the PATA stereotype suite in a zero-shot setting, we quantify race and gender bias and measure stereotype amplification. Contrary to the intuition that multilinguality mitigates bias, every model exhibits stronger gender skew than its English-only baseline. CAPIVARA-CLIP shows its largest biases precisely in the low-resource languages it targets, while the shared encoder of NLLB-CLIP and SigLIP-2 transfers English gender stereotypes into gender-neutral languages; loosely coupled encoders largely avoid this leakage. Although SigLIP-2 reduces agency and communion skews, it inherits -- and in caption-sparse contexts (e.g., Xhosa) amplifies -- the English anchor's crime associations. Highly gendered languages consistently magnify all bias types, yet gender-neutral languages remain vulnerable whenever cross-lingual weight sharing imports foreign stereotypes. Aggregated metrics thus mask language-specific hot spots, underscoring the need for fine-grained, language-aware bias evaluation in future multilingual VLM research.

中文标题/摘要

标题：打破语言障碍还是强化偏见？多语言对比视觉语言模型中的性别和种族差异研究

多语言视觉-语言模型（VLMs）承诺实现通用的图像-文本检索，但其社会偏见仍被忽视。我们首次系统地审计了四种公开的多语言CLIP变体：M-CLIP、NLLB-CLIP、CAPIVARA-CLIP和去偏见的SigLIP-2，涵盖了十种在资源可用性和形态性别标记方面不同的语言。使用平衡的FairFace子集和PATA刻板印象套件，在零样本设置下，我们量化了种族和性别偏见，并测量了刻板印象的放大。与多语言性会减轻偏见的直觉相反，每种模型都比其仅英语基线表现出更强的性别偏差。CAPIVARA-CLIP在其目标的低资源语言中显示出最大的偏差，而NLLB-CLIP和SigLIP-2的共享编码器将英语性别刻板印象转移到了性别中立的语言中；松散耦合的编码器则避免了这种泄漏。尽管SigLIP-2减少了行动性和共情性的偏差，但在标题稀疏的上下文中（例如，Xhosa），它放大了英语锚点的犯罪关联。高度性别化的语言始终放大了所有类型的偏见，而性别中立的语言在跨语言权重共享引入外来刻板印象时仍然脆弱。因此，汇总的指标掩盖了语言特定的热点，强调了未来多语言VLM研究中需要细致的语言意识偏见评估的必要性。

Summary / 总结

This study examines the social biases in multilingual contrastive vision-language models (VLMs) by auditing four public CLIP variants across ten languages. Using balanced datasets and a zero-shot setting, the research quantifies race and gender bias and measures stereotype amplification. Contrary to expectations, every model shows stronger gender bias than its English-only counterpart, with CAPIVARA-CLIP exhibiting the largest biases in low-resource languages and NLLB-CLIP and SigLIP-2 transferring English stereotypes into gender-neutral languages. While SigLIP-2 reduces some biases, it amplifies others, especially in caption-sparse contexts. The study highlights the need for fine-grained, language-aware bias evaluation in future multilingual VLM research.

这项研究通过审计四个公共CLIP变体在十种语言上的表现，检查了多语言对比视觉-语言模型（VLMs）中的社会偏见。使用平衡的数据集和零样本设置，研究量化了种族和性别偏见，并测量了刻板印象的放大。与预期相反，每种模型都比其英语单一版本表现出更强的性别偏见，其中CAPIVARA-CLIP在低资源语言中表现出最大的偏见，而NLLB-CLIP和SigLIP-2将英语刻板印象转移到性别中立的语言中。虽然SigLIP-2减少了某些偏见，但在标题稀疏的上下文中（如Xhosa），它会放大英语锚点的犯罪关联。该研究强调了未来多语言VLM研究中需要进行细粒度的语言意识偏见评估的必要性。

ANTS: Shaping the Adaptive Negative Textual Space by MLLM for OOD Detection

Authors: Wenjie Zhu, Yabin Zhang, Xin Jin, Wenjun Zeng, Lei Zhang

First: 2025-09-04T07:26:20+00:00 · Latest: 2025-09-11T19:44:24+00:00

Abs · PDF · Code1 · Code2

Abstract

The introduction of negative labels (NLs) has proven effective in enhancing Out-of-Distribution (OOD) detection. However, existing methods often lack an understanding of OOD images, making it difficult to construct an accurate negative space. In addition, the presence of false negative labels significantly degrades their near-OOD performance. To address these issues, we propose shaping an Adaptive Negative Textual Space (ANTS) by leveraging the understanding and reasoning capabilities of multimodal large language models (MLLMs). Specifically, we identify images likely to be OOD samples as negative images and prompt the MLLM to describe these images, generating expressive negative sentences that precisely characterize the OOD distribution and enhance far-OOD detection. For the near-OOD setting, where OOD samples resemble the in-distribution (ID) subset, we first identify the subset of ID classes that are visually similar to negative images and then leverage the reasoning capability of MLLMs to generate visually similar negative labels tailored to this subset, effectively reducing false negatives and improving near-OOD detection. To balance these two types of negative textual spaces, we design an adaptive weighted score that enables the method to handle different OOD task settings (near-OOD and far-OOD) without relying on task-specific prior knowledge, making it highly adaptable in open environments. On the ImageNet benchmark, our ANTS significantly reduces the FPR95 by 4.2\%, establishing a new state-of-the-art. Furthermore, our method is training-free and zero-shot, enabling high scalability.

中文标题/摘要

标题：ANTS：通过MLLM塑造自适应负文本空间以进行OOD检测

引入负标签（NLs）已被证明能有效提升Out-of-Distribution (OOD)检测。然而，现有方法往往缺乏对OOD图像的理解，难以构建准确的负空间。此外，假负标签的存在显著降低了其近OOD性能。为解决这些问题，我们提出利用多模态大语言模型（MLLM）的理解和推理能力，塑造自适应负文本空间（ANTS）。具体而言，我们识别出可能为OOD样本的图像作为负图像，并提示MLLM描述这些图像，生成能够精确刻画OOD分布的表达性负句子，从而增强远OOD检测。对于近OOD设置，其中OOD样本与分布内（ID）子集相似，我们首先识别出与负图像视觉相似的ID类子集，然后利用MLLM的推理能力生成针对该子集的视觉相似负标签，有效减少假负标签并提高近OOD检测。为了平衡这两种类型的负文本空间，我们设计了一种自适应加权评分，使方法能够在无需依赖特定任务先验知识的情况下处理不同的OOD任务设置（近OOD和远OOD），使其在开放环境中具有高度适应性。在ImageNet基准测试中，我们的ANTS显著降低了FPR95，建立了新的最佳水平。此外，我们的方法无需训练且为零样本，具有高可扩展性。

How well can LLMs provide planning feedback in grounded environments?

Authors: Yuxuan Li, Victor Zhong

First: 2025-09-11T18:51:26+00:00 · Latest: 2025-09-11T18:51:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Learning to plan in grounded environments typically requires carefully designed reward functions or high-quality annotated demonstrations. Recent works show that pretrained foundation models, such as large language models (LLMs) and vision language models (VLMs), capture background knowledge helpful for planning, which reduces the amount of reward design and demonstrations needed for policy learning. We evaluate how well LLMs and VLMs provide feedback across symbolic, language, and continuous control environments. We consider prominent types of feedback for planning including binary feedback, preference feedback, action advising, goal advising, and delta action feedback. We also consider inference methods that impact feedback performance, including in-context learning, chain-of-thought, and access to environment dynamics. We find that foundation models can provide diverse high-quality feedback across domains. Moreover, larger and reasoning models consistently provide more accurate feedback, exhibit less bias, and benefit more from enhanced inference methods. Finally, feedback quality degrades for environments with complex dynamics or continuous state spaces and action spaces.

中文标题/摘要

标题：大语言模型在基于地面环境中的规划反馈能力如何？

在基于地面的环境中学习规划通常需要精心设计的奖励函数或高质量的标注示范。近期研究表明，预训练的基础模型，如大型语言模型（LLMs）和视觉语言模型（VLMs），能够捕捉到有助于规划的背景知识，从而减少所需的设计奖励和示范的数量。我们评估了LLMs和VLMs在符号、语言和连续控制环境中的反馈能力。我们考虑了包括二元反馈、偏好反馈、动作建议、目标建议和动作增量反馈在内的主要规划反馈类型。我们还考虑了影响反馈性能的推理方法，包括上下文学习、逐步推理和环境动力学访问。我们发现基础模型能够在不同领域提供多样且高质量的反馈。此外，更大的和具有推理能力的模型通常能提供更准确的反馈，表现出更少的偏见，并且更受益于增强的推理方法。最后，对于具有复杂动力学或连续状态空间和动作空间的环境，反馈质量会下降。

Summary / 总结

This study evaluates the effectiveness of large language models (LLMs) and vision language models (VLMs) in providing planning feedback across various environments, including symbolic, language, and continuous control settings. The research finds that foundation models can offer high-quality feedback across different domains, with larger and reasoning models providing more accurate and less biased feedback. Enhanced inference methods also improve feedback performance. However, feedback quality decreases in environments with complex dynamics or continuous state and action spaces.

研究评估了大型语言模型（LLMs）和视觉语言模型（VLMs）在符号、语言和连续控制等多种环境中的规划反馈效果。研究发现，基础模型可以在不同领域提供高质量的反馈，较大的和具有推理能力的模型提供更准确且更少偏见的反馈。增强的推理方法也能改善反馈性能。然而，在具有复杂动力学或连续状态和动作空间的环境中，反馈质量会下降。

FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark

Authors: Rongyao Fang, Aldrich Yu, Chengqi Duan, Linjiang Huang, Shuai Bai, Yuxuan Cai, Kun Wang, Si Liu, Xihui Liu, Hongsheng Li

First: 2025-09-11T17:59:59+00:00 · Latest: 2025-09-11T17:59:59+00:00

Comments: Project page: https://flux-reason-6m.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

The advancement of open-source text-to-image (T2I) models has been hindered by the absence of large-scale, reasoning-focused datasets and comprehensive evaluation benchmarks, resulting in a performance gap compared to leading closed-source systems. To address this challenge, We introduce FLUX-Reason-6M and PRISM-Bench (Precise and Robust Image Synthesis Measurement Benchmark). FLUX-Reason-6M is a massive dataset consisting of 6 million high-quality FLUX-generated images and 20 million bilingual (English and Chinese) descriptions specifically designed to teach complex reasoning. The image are organized according to six key characteristics: Imagination, Entity, Text rendering, Style, Affection, and Composition, and design explicit Generation Chain-of-Thought (GCoT) to provide detailed breakdowns of image generation steps. The whole data curation takes 15,000 A100 GPU days, providing the community with a resource previously unattainable outside of large industrial labs. PRISM-Bench offers a novel evaluation standard with seven distinct tracks, including a formidable Long Text challenge using GCoT. Through carefully designed prompts, it utilizes advanced vision-language models for nuanced human-aligned assessment of prompt-image alignment and image aesthetics. Our extensive evaluation of 19 leading models on PRISM-Bench reveals critical performance gaps and highlights specific areas requiring improvement. Our dataset, benchmark, and evaluation code are released to catalyze the next wave of reasoning-oriented T2I generation. Project page: https://flux-reason-6m.github.io/ .

中文标题/摘要

标题：FLUX-Reason-6M & PRISM-Bench：百万规模的图文推理数据集及全面基准测试

开源图文生成（T2I）模型的发展受限于缺乏大规模、注重推理的数据集和全面的评估基准，导致其性能与领先封闭源系统存在差距。为解决这一挑战，我们引入了FLUX-Reason-6M和PRISM-Bench（精确且稳健的图像合成测量基准）。FLUX-Reason-6M是一个包含600万高质量FLUX生成图像和2000万双语（英语和中文）描述的庞大数据集，专门用于教授复杂推理。图像根据六个关键特征组织：想象力、实体、文本呈现、风格、情感和构图，并设计明确的生成链式思维（GCoT）以提供详细的图像生成步骤分解。整个数据整理耗时15000个A100 GPU天，为社区提供了以往只有大型工业实验室才能获得的资源。PRISM-Bench提供了一种新颖的评估标准，包括七个不同的赛道，其中包括使用GCoT的艰巨长文本挑战。通过精心设计的提示，它利用先进的视觉-语言模型进行细腻的人类对齐评估和图像美学评估。我们在PRISM-Bench上对19个领先模型进行了全面评估，揭示了关键性能差距并指出了需要改进的具体领域。我们的数据集、基准测试和评估代码已发布，以推动下一代注重推理的T2I生成。项目页面：https://flux-reason-6m.github.io/

Summary / 总结

The paper introduces FLUX-Reason-6M, a large-scale dataset with 6 million images and 20 million bilingual descriptions, designed to enhance reasoning in text-to-image models. It also presents PRISM-Bench, a comprehensive benchmark with seven tracks, including a Long Text challenge, to evaluate these models. The evaluation of 19 leading models on PRISM-Bench highlights significant performance gaps and areas for improvement in reasoning capabilities. The dataset and benchmark are publicly available to advance the field of reasoning-oriented text-to-image generation.

论文介绍了FLUX-Reason-6M，这是一个包含600万张图像和2000万条双语描述的大规模数据集，旨在提升文本到图像模型的推理能力。同时，还提出了PRISM-Bench，这是一个包含七个赛道的综合基准，包括长文本挑战，用于评估这些模型。对19个领先模型在PRISM-Bench上的评估揭示了显著的性能差距，并指出了需要改进的具体领域。该数据集和基准已公开发布，以促进推理导向的文本到图像生成的发展。

Locality in Image Diffusion Models Emerges from Data Statistics

Authors: Artem Lukoianov, Chenyang Yuan, Justin Solomon, Vincent Sitzmann

First: 2025-09-11T17:59:08+00:00 · Latest: 2025-09-11T17:59:08+00:00

Comments: 30 pages, 18 figures, 6 tables

Abs · PDF · Code1 · Code2

Abstract

Among generative models, diffusion models are uniquely intriguing due to the existence of a closed-form optimal minimizer of their training objective, often referred to as the optimal denoiser. However, diffusion using this optimal denoiser merely reproduces images in the training set and hence fails to capture the behavior of deep diffusion models. Recent work has attempted to characterize this gap between the optimal denoiser and deep diffusion models, proposing analytical, training-free models that can generate images that resemble those generated by a trained UNet. The best-performing method hypothesizes that shift equivariance and locality inductive biases of convolutional neural networks are the cause of the performance gap, hence incorporating these assumptions into its analytical model. In this work, we present evidence that the locality in deep diffusion models emerges as a statistical property of the image dataset, not due to the inductive bias of convolutional neural networks. Specifically, we demonstrate that an optimal parametric linear denoiser exhibits similar locality properties to the deep neural denoisers. We further show, both theoretically and experimentally, that this locality arises directly from the pixel correlations present in natural image datasets. Finally, we use these insights to craft an analytical denoiser that better matches scores predicted by a deep diffusion model than the prior expert-crafted alternative.

中文标题/摘要

标题：图像扩散模型中的局部性源自数据统计

在生成模型中，扩散模型因其训练目标的闭式最优解而独具魅力，通常被称为最优去噪器。然而，使用该最优去噪器的扩散仅能复现训练集中的图像，无法捕捉深层扩散模型的行为。近期工作试图描述这种最优去噪器与深层扩散模型之间的差距，提出了无需训练的分析模型，能够生成类似于训练UNet生成的图像。表现最佳的方法假设卷积神经网络的平移等变性和局部性先验是性能差距的原因，因此将其纳入分析模型中。本文中，我们提供了证据表明，深层扩散模型中的局部性是一种统计性质，而非卷积神经网络的归纳偏置所致。具体而言，我们证明了最优参数线性去噪器表现出与深层神经去噪器相似的局部性特征。我们还通过理论和实验表明，这种局部性直接来源于自然图像数据集中像素间的相关性。最后，我们利用这些见解构建了一个分析去噪器，其预测得分比之前的专家构建的替代方案更接近深层扩散模型的预测。

Summary / 总结

This paper investigates why deep diffusion models exhibit locality, a property not present in the optimal denoiser. The authors find that the locality in deep diffusion models arises from the statistical properties of the image dataset rather than the convolutional neural network's inductive biases. They demonstrate that an optimal parametric linear denoiser also exhibits similar locality properties and that this locality is due to pixel correlations in natural images. The study leads to an analytical denoiser that better matches the scores predicted by deep diffusion models.

研究探讨了为什么深度扩散模型具有局部性，而最优去噪器不具备这一特性。研究显示，深度扩散模型中的局部性来源于图像数据集的统计特性，而非卷积神经网络的归纳偏置。关键发现包括最优参数线性去噪器和深度神经去噪器在局部性方面的相似性，以及局部性是由于自然图像中的像素相关性。研究还提出了一种分析性去噪器，其预测结果更接近深度扩散模型的得分。

Improved GUI Grounding via Iterative Narrowing

Authors: Anthony Nguyen

First: 2024-11-18T05:47:12+00:00 · Latest: 2025-09-11T16:37:00+00:00

Comments: Code available at https://github.com/ant-8/GUI-Grounding-via-Iterative-Narrowing

Abs · PDF · Code1 · Code2 · Code3

Abstract

Graphical User Interface (GUI) grounding plays a crucial role in enhancing the capabilities of Vision-Language Model (VLM) agents. While general VLMs, such as GPT-4V, demonstrate strong performance across various tasks, their proficiency in GUI grounding remains suboptimal. Recent studies have focused on fine-tuning these models specifically for zero-shot GUI grounding, yielding significant improvements over baseline performance. We introduce a visual prompting framework that employs an iterative narrowing mechanism to further improve the performance of both general and fine-tuned models in GUI grounding. For evaluation, we tested our method on a comprehensive benchmark comprising various UI platforms and provided the code to reproduce our results.

中文标题/摘要

标题：通过迭代细化改进的GUI接地

图形用户界面（GUI）接地在增强视觉语言模型（VLM）代理的能力方面起着关键作用。虽然通用的VLM，如GPT-4V，在各种任务中表现出色，但在GUI接地方面的熟练程度仍然不足。最近的研究集中在对这些模型进行微调，以实现零样本GUI接地，从而在基线性能上取得了显著改进。我们提出了一种视觉提示框架，采用迭代细化机制，进一步提高通用模型和微调模型在GUI接地中的性能。为了评估，我们在包含各种UI平台的综合基准上测试了我们的方法，并提供了可重现我们结果的代码。

Summary / 总结

The research aims to enhance the GUI grounding capabilities of Vision-Language Models (VLMs) by introducing an iterative narrowing mechanism. This method improves the performance of both general VLMs and fine-tuned models in GUI grounding tasks. Key experimental findings show significant improvements over baseline performance on a comprehensive benchmark of various UI platforms.

研究旨在通过引入迭代缩小机制来提升视觉语言模型（VLM）在GUI定位方面的能力。该方法提高了通用VLM和微调模型在GUI定位任务中的性能。实验结果表明，在多种UI平台的综合基准测试中，与基线性能相比有显著提升。

Compositional Concept Generalization with Variational Quantum Circuits

Authors: Hala Hawashin, Mina Abbaszadeh, Nicholas Joseph, Beth Pearson, Martha Lewis, Mehrnoosh sadrzadeh

First: 2025-09-11T15:34:33+00:00 · Latest: 2025-09-11T15:34:33+00:00

Comments: Accepted to: 2025 IEEE International Conference on Quantum Artificial Intelligence (QAI), Naples, Italy, Nov 2-5, 2025. This is the authors' accepted manuscript (AAM). An IEEE copyright notice appears on page 1. The final published version will appear in IEEE Xplore; DOI to be added when available

Abs · PDF · Code1 · Code2

Abstract

Compositional generalization is a key facet of human cognition, but lacking in current AI tools such as vision-language models. Previous work examined whether a compositional tensor-based sentence semantics can overcome the challenge, but led to negative results. We conjecture that the increased training efficiency of quantum models will improve performance in these tasks. We interpret the representations of compositional tensor-based models in Hilbert spaces and train Variational Quantum Circuits to learn these representations on an image captioning task requiring compositional generalization. We used two image encoding techniques: a multi-hot encoding (MHE) on binary image vectors and an angle/amplitude encoding on image vectors taken from the vision-language model CLIP. We achieve good proof-of-concept results using noisy MHE encodings. Performance on CLIP image vectors was more mixed, but still outperformed classical compositional models.

中文标题/摘要

标题：使用变分量子电路的组合理构概念泛化

组合理构泛化是人类认知的关键方面，但在当前的AI工具如视觉-语言模型中缺失。先前的工作研究了组合张量基句法语义是否能克服这一挑战，但结果为负。我们推测量子模型的训练效率提升将改善这些任务的表现。我们解释了组合张量基模型在希尔伯特空间中的表示，并训练变分量子电路在需要组合理构泛化的图像字幕任务中学习这些表示。我们使用了两种图像编码技术：二值图像向量上的多热编码（MHE）和从视觉-语言模型CLIP获取的图像向量上的角度/振幅编码。我们使用嘈杂的MHE编码取得了良好的概念验证结果。CLIP图像向量的表现则更为混合，但仍优于经典组合模型。

Summary / 总结

The paper aims to improve compositional generalization, a key aspect of human cognition, which is lacking in current AI tools. It explores the use of Variational Quantum Circuits to learn compositional tensor-based sentence semantics, using two image encoding techniques: multi-hot encoding and angle/amplitude encoding from the vision-language model CLIP. The study shows good proof-of-concept results with noisy multi-hot encodings, and mixed but still superior performance compared to classical models on CLIP image vectors.

论文旨在提高组成性泛化，这是人类认知的关键方面，当前的AI工具中缺乏。研究探索了使用变量子电路来学习基于组成张量的句子语义，并使用两种图像编码技术：多热编码和来自视觉语言模型CLIP的角度/幅度编码。研究结果显示，使用噪声多热编码具有良好的概念验证结果，而基于CLIP图像向量的表现则更为混合，但仍然优于经典模型。

Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning

Authors: Yuyao Ge, Shenghua Liu, Yiwei Wang, Lingrui Mei, Baolong Bi, Xuanshan Zhou, Jiayu Yao, Jiafeng Guo, Xueqi Cheng

First: 2025-09-08T09:20:04+00:00 · Latest: 2025-09-11T15:24:22+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments. While existing enhancement approaches require additional training, rely on external segmentation tools, or operate at coarse-grained levels, they overlook the innate ability within VLMs. To bridge this gap, we investigate VLMs' attention patterns and discover that: (1) visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance; (2) attention progressively refines from global scanning in shallow layers to focused convergence in deeper layers, with convergence degree determined by visual complexity. (3) Theoretically, we prove that the contrast of attention maps between general queries and task-specific queries enables the decomposition of visual signal into semantic signals and visual noise components. Building on these insights, we propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level. Extensive experiments demonstrate that CARVE consistently enhances performance, achieving up to 75% improvement on open-source models. Our work provides critical insights into the interplay between visual complexity and attention mechanisms, offering an efficient pathway for improving visual reasoning with contrasting attention.

中文标题/摘要

标题：基于对比注意力聚焦：增强VLMs的视觉推理

视觉-语言模型（VLMs）在多种视觉任务中表现出色，但在复杂视觉环境中性能下降。现有增强方法需要额外训练、依赖外部分割工具或在粗粒度级别操作，忽视了VLMs内部的能力。为解决这一问题，我们研究了VLMs的注意力模式，发现：（1）视觉复杂性与注意力熵呈强相关性，负面影响推理性能；（2）注意力从浅层的全局扫描逐渐聚焦到深层的集中收敛，收敛程度由视觉复杂性决定；（3）理论上，我们证明了通用查询与任务特定查询之间的注意力图对比能够将视觉信号分解为语义信号和视觉噪声成分。基于这些见解，我们提出了基于像素级注意力对比的视觉增强对比注意力精炼（CARVE）方法，这是一种无需训练的方法，通过注意力对比提取任务相关的视觉信号。大量实验表明，CARVE能够一致地提升性能，开源模型上可实现高达75%的提升。我们的工作为理解视觉复杂性和注意力机制之间的相互作用提供了关键见解，为通过对比注意力改进视觉推理提供了高效途径。

Summary / 总结

This paper addresses the performance degradation of Vision-Language Models (VLMs) in complex visual environments. By analyzing VLMs' attention patterns, the authors find that visual complexity negatively impacts reasoning performance and that attention progressively refines from global scanning to focused convergence. They propose CARVE, a training-free method that enhances VLMs through pixel-level attention contrasting, which decomposes visual signals into semantic and noise components. Experiments show that CARVE significantly improves performance, achieving up to 75% enhancement on open-source models.

本文针对视觉语言模型（VLMs）在复杂视觉环境中的性能下降问题，通过分析注意力模式发现，视觉复杂性会负面影响推理性能，并且注意力会从全局扫描逐渐精炼到聚焦收敛。作者提出了一种无需训练的方法——像素级对比注意力精炼以视觉增强（CARVE），通过对比注意力图来增强VLMs的视觉推理能力。实验表明，CARVE可以将开源模型的性能提高多达75%。

Adapting Vision-Language Models for Neutrino Event Classification in High-Energy Physics

Authors: Dikshant Sagar, Kaiwen Yu, Alejandro Yankelevich, Jianming Bian, Pierre Baldi

First: 2025-09-10T10:07:27+00:00 · Latest: 2025-09-11T13:03:04+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advances in Large Language Models (LLMs) have demonstrated their remarkable capacity to process and reason over structured and unstructured data modalities beyond natural language. In this work, we explore the applications of Vision Language Models (VLMs), specifically a fine-tuned variant of LLaMa 3.2, to the task of identifying neutrino interactions in pixelated detector data from high-energy physics (HEP) experiments. We benchmark this model against a state-of-the-art convolutional neural network (CNN) architecture, similar to those used in the NOvA and DUNE experiments, which have achieved high efficiency and purity in classifying electron and muon neutrino events. Our evaluation considers both the classification performance and interpretability of the model predictions. We find that VLMs can outperform CNNs, while also providing greater flexibility in integrating auxiliary textual or semantic information and offering more interpretable, reasoning-based predictions. This work highlights the potential of VLMs as a general-purpose backbone for physics event classification, due to their high performance, interpretability, and generalizability, which opens new avenues for integrating multimodal reasoning in experimental neutrino physics.

中文标题/摘要

标题：将视觉语言模型适应于高能物理中的中微子事件分类

近年来，大型语言模型（LLMs）在处理和推理结构化和非结构化数据方面的能力已经得到了显著的展示，这些数据远超自然语言。在本文中，我们探讨了视觉语言模型（VLMs），特别是LLaMa 3.2的微调变体，应用于识别高能物理（HEP）实验中像素化检测器数据中的中微子相互作用的任务。我们将该模型与NOvA和DUNE实验中使用的类似卷积神经网络（CNN）架构进行了基准测试，这些架构在分类电子和Muon中微子事件方面已经实现了高效率和纯度。我们的评估考虑了模型分类性能和预测的可解释性。我们发现VLMs可以超越CNNs，同时还能提供更大的灵活性以整合辅助文本或语义信息，并提供更可解释、基于推理的预测。本文强调了VLMs作为物理事件分类的一般用途基础架构的潜力，由于它们的高性能、可解释性和泛化能力，这为在实验中微子物理中整合多模态推理打开了新的途径。

Summary / 总结

This study investigates the use of Vision Language Models (VLMs) for classifying neutrino interactions in high-energy physics experiments, comparing them to state-of-the-art convolutional neural networks (CNNs). The VLMs, fine-tuned from LLaMa 3.2, outperformed CNNs in classification tasks while offering greater flexibility and interpretability by integrating textual or semantic information. The results suggest that VLMs could serve as a versatile backbone for physics event classification, enhancing multimodal reasoning capabilities.

本研究探讨了使用视觉语言模型（VLMs）来识别高能物理实验中的中微子相互作用，将其与最先进的卷积神经网络（CNNs）进行了比较。经过LLaMa 3.2微调的VLMs在分类任务中表现优于CNNs，同时通过整合文本或语义信息提供了更大的灵活性和可解释性。研究结果表明，VLMs可以作为物理事件分类的通用基础模型，增强多模态推理能力。

Shaken, Not Stirred: A Novel Dataset for Visual Understanding of Glasses in Human-Robot Bartending Tasks

Authors: Lukáš Gajdošech, Hassan Ali, Jan-Gerrit Habekost, Martin Madaras, Matthias Kerzel, Stefan Wermter

Venue: IROS

First: 2025-03-06T10:51:04+00:00 · Latest: 2025-09-11T12:49:34+00:00

Comments: Submitted and Accepted for Presentation at the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025

Abs · PDF · Code1 · Code2

Abstract

Datasets for object detection often do not account for enough variety of glasses, due to their transparent and reflective properties. Specifically, open-vocabulary object detectors, widely used in embodied robotic agents, fail to distinguish subclasses of glasses. This scientific gap poses an issue for robotic applications that suffer from accumulating errors between detection, planning, and action execution. This paper introduces a novel method for acquiring real-world data from RGB-D sensors that minimizes human effort. We propose an auto-labeling pipeline that generates labels for all the acquired frames based on the depth measurements. We provide a novel real-world glass object dataset GlassNICOLDataset that was collected on the Neuro-Inspired COLlaborator (NICOL), a humanoid robot platform. The dataset consists of 7850 images recorded from five different cameras. We show that our trained baseline model outperforms state-of-the-art open-vocabulary approaches. In addition, we deploy our baseline model in an embodied agent approach to the NICOL platform, on which it achieves a success rate of 81% in a human-robot bartending scenario.

中文标题/摘要

标题：摇而不搅：一种用于人类-机器人调酒任务中玻璃视觉理解的新数据集

物体检测数据集往往未能涵盖足够多样的玻璃，由于玻璃的透明和反射特性。具体来说，广泛应用于具身机器人代理的开放词汇物体检测器无法区分不同类别的玻璃。这一科学空白对因检测、规划和动作执行之间的累积错误而受到影响的机器人应用构成了问题。本文介绍了一种新的方法，用于从RGB-D传感器获取真实世界数据，以最小化人工努力。我们提出了一种自动标注流水线，根据深度测量生成所有获取帧的标签。我们提供了一个新的真实世界玻璃对象数据集GlassNICOLDataset，该数据集是在神经启发式协作者（NICOL）人形机器人平台上收集的。该数据集包含从五个不同摄像头记录的7850张图像。我们展示了我们训练的基本模型优于最先进的开放词汇方法。此外，我们在NICOL平台上部署了我们的基本模型，该模型在人类-机器人调酒场景中达到了81%的成功率。

Summary / 总结

This paper addresses the lack of variety in datasets for object detection concerning glasses, which are often transparent and reflective. It introduces a novel method for collecting real-world data using RGB-D sensors and an auto-labeling pipeline based on depth measurements. The resulting GlassNICOLDataset includes 7850 images from five cameras and outperforms state-of-the-art open-vocabulary approaches. The model deployed on the NICOL humanoid robot achieved an 81% success rate in a human-robot bartending scenario.

本文解决了物体检测数据集中玻璃种类缺乏多样性的问题，这些数据集由于玻璃的透明和反射特性，往往无法区分不同类型的玻璃。作者引入了一个名为GlassNICOLDataset的新数据集，该数据集使用人形机器人平台NICOL上的RGB-D传感器收集。他们提出了一种自动标注流水线以减少人工标注的努力。该数据集包含来自五个摄像头的7850张图像，并展示了他们的训练模型在现有开放词汇方法中的优越性，在人类-机器人调酒场景中达到了81%的成功率。

Decoupling Clinical and Class-Agnostic Features for Reliable Few-Shot Adaptation under Shift

Authors: Umaima Rahman, Raza Imam, Mohammad Yaqub, Dwarikanath Mahapatra

First: 2025-09-11T12:26:57+00:00 · Latest: 2025-09-11T12:26:57+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Medical vision-language models (VLMs) offer promise for clinical decision support, yet their reliability under distribution shifts remains a major concern for safe deployment. These models often learn task-agnostic correlations due to variability in imaging protocols and free-text reports, limiting their generalizability and increasing the risk of failure in real-world settings. We propose DRiFt, a structured feature decoupling framework that explicitly separates clinically relevant signals from task-agnostic noise using parameter-efficient tuning (LoRA) and learnable prompt tokens. To enhance cross-modal alignment and reduce uncertainty, we curate high-quality, clinically grounded image-text pairs by generating captions for a diverse medical dataset. Our approach improves in-distribution performance by +11.4% Top-1 accuracy and +3.3% Macro-F1 over prior prompt-based methods, while maintaining strong robustness across unseen datasets. Ablation studies reveal that disentangling task-relevant features and careful alignment significantly enhance model generalization and reduce unpredictable behavior under domain shift. These insights contribute toward building safer, more trustworthy VLMs for clinical use. The code is available at https://github.com/rumaima/DRiFt.

中文标题/摘要

标题：解耦临床和类别无关特征以实现可靠的少量样本适应性调整

医学视觉-语言模型（VLMs）为临床决策支持提供了希望，但在分布变化下的可靠性仍然是安全部署的主要关切。这些模型由于成像协议和自由文本报告的差异性，往往会学习到任务无关的相关性，这限制了它们的泛化能力，并增加了在实际场景中失败的风险。我们提出了DRiFt，这是一种结构化的特征解耦框架，通过参数高效调优（LoRA）和可学习的提示标记，明确地将临床相关信号与任务无关的噪声分离。为了增强跨模态对齐并减少不确定性，我们通过为多样化的医学数据集生成描述来精心策划高质量的临床相关图像-文本对。我们的方法在分布内性能上比之前的基于提示的方法提高了11.4%的Top-1准确率和3.3%的宏F1分数，同时在未见数据集上保持了强大的鲁棒性。消融研究显示，分离任务相关特征和精细对齐显著增强了模型的泛化能力和减少了领域变化下的不可预测行为。这些见解有助于构建更安全、更值得信赖的VLMs用于临床应用。代码可在https://github.com/rumaima/DRiFt获取。

Summary / 总结

The research aims to improve the reliability of medical vision-language models (VLMs) under distribution shifts for clinical decision support. DRiFt, a structured feature decoupling framework, separates clinically relevant signals from task-agnostic noise using parameter-efficient tuning and learnable prompt tokens. This approach enhances in-distribution performance by 11.4% in Top-1 accuracy and 3.3% in Macro-F1 over prior prompt-based methods, while maintaining robustness across unseen datasets. Ablation studies show that disentangling task-relevant features and careful alignment significantly enhance model generalization and reduce unpredictable behavior under domain shift.

论文提出了一种特征解耦框架DRiFt，以确保医疗视觉-语言模型（VLMs）在分布变化下的可靠性。DRiFt 使用参数高效调优（LoRA）和可学习提示标记来分离临床相关信号和任务无关噪声，使在分布内的性能提高了11.4%的Top-1准确率和3.3%的宏F1分数，同时在未见过的数据集上保持了鲁棒性。

Curriculum-Based Multi-Tier Semantic Exploration via Deep Reinforcement Learning

Authors: Abdel Hakim Drid, Vincenzo Suriani, Daniele Nardi, Abderrezzak Debilou

First: 2025-09-11T11:10:08+00:00 · Latest: 2025-09-11T11:10:08+00:00

Comments: The 19th International Conference on Intelligent Autonomous Systems (IAS 19), 2025, Genoa

Abs · PDF · Code1 · Code2

Abstract

Navigating and understanding complex and unknown environments autonomously demands more than just basic perception and movement from embodied agents. Truly effective exploration requires agents to possess higher-level cognitive abilities, the ability to reason about their surroundings, and make more informed decisions regarding exploration strategies. However, traditional RL approaches struggle to balance efficient exploration and semantic understanding due to limited cognitive capabilities embedded in the small policies for the agents, leading often to human drivers when dealing with semantic exploration. In this paper, we address this challenge by presenting a novel Deep Reinforcement Learning (DRL) architecture that is specifically designed for resource efficient semantic exploration. A key methodological contribution is the integration of a Vision-Language Model (VLM) common-sense through a layered reward function. The VLM query is modeled as a dedicated action, allowing the agent to strategically query the VLM only when deemed necessary for gaining external guidance, thereby conserving resources. This mechanism is combined with a curriculum learning strategy designed to guide learning at different levels of complexity to ensure robust and stable learning. Our experimental evaluation results convincingly demonstrate that our agent achieves significantly enhanced object discovery rates and develops a learned capability to effectively navigate towards semantically rich regions. Furthermore, it also shows a strategic mastery of when to prompt for external environmental information. By demonstrating a practical and scalable method for embedding common-sense semantic reasoning with autonomous agents, this research provides a novel approach to pursuing a fully intelligent and self-guided exploration in robotics.

中文标题/摘要

标题：基于 Curriculum 的多级语义探索深度强化学习

自主导航和理解复杂未知环境不仅需要基本的感知和移动，还需要具备高级认知能力，能够推理周围环境并做出更明智的探索策略选择。然而，传统强化学习方法由于代理嵌入的认知能力有限，难以在高效探索和语义理解之间取得平衡，导致在处理语义探索时需要人工干预。本文提出了一种新的深度强化学习（DRL）架构，专门设计用于资源高效语义探索。一个关键的方法贡献是通过分层奖励函数集成视觉语言模型（VLM）常识。VLM 查询被建模为专用动作，使代理仅在必要时战略性地查询 VLM 以获取外部指导，从而节省资源。该机制结合了一种课程学习策略，以指导不同复杂度水平的学习，确保稳健和稳定的训练。实验评估结果表明，我们的代理在物体发现率方面显著提高，并发展了有效导航至语义丰富区域的能力。此外，还展示了何时请求外部环境信息的战略掌握。通过展示一种实用且可扩展的方法，将常识语义推理嵌入自主代理，这项研究为追求完全智能和自我引导的机器人探索提供了一种新方法。

Summary / 总结

This paper addresses the challenge of autonomous exploration in complex environments by proposing a DRL architecture that integrates a Vision-Language Model (VLM) through a layered reward function and curriculum learning. The method allows the agent to query the VLM only when necessary, conserving resources while enhancing semantic understanding. Experimental results show significant improvements in object discovery rates and the agent's ability to navigate towards semantically rich regions, demonstrating strategic use of external information.

本文提出了一种结合Vision-Language模型（VLM）并通过分层奖励函数和课程学习的DRL架构，以解决复杂环境中的自主探索挑战。该方法允许代理仅在必要时查询VLM，从而节省资源并增强语义理解。实验结果表明，该代理在物体发现率和导航至语义丰富区域方面取得了显著改进，并展示了对外部环境信息的策略性使用。