arXiv 论文速递

OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning

Authors: Yuan Gong, Xionghui Wang, Jie Wu, Shiyin Wang, Yitong Wang, Xinglong Wu

First: 2025-08-28T17:59:46+00:00 · Latest: 2025-08-28T17:59:46+00:00

Comments: project url: https://one-reward.github.io

Abstract

In this paper, we introduce OneReward, a unified reinforcement learning framework that enhances the model's generative capabilities across multiple tasks under different evaluation criteria using only \textit{One Reward} model. By employing a single vision-language model (VLM) as the generative reward model, which can distinguish the winner and loser for a given task and a given evaluation criterion, it can be effectively applied to multi-task generation models, particularly in contexts with varied data and diverse task objectives. We utilize OneReward for mask-guided image generation, which can be further divided into several sub-tasks such as image fill, image extend, object removal, and text rendering, involving a binary mask as the edit area. Although these domain-specific tasks share same conditioning paradigm, they differ significantly in underlying data distributions and evaluation metrics. Existing methods often rely on task-specific supervised fine-tuning (SFT), which limits generalization and training efficiency. Building on OneReward, we develop Seedream 3.0 Fill, a mask-guided generation model trained via multi-task reinforcement learning directly on a pre-trained base model, eliminating the need for task-specific SFT. Experimental results demonstrate that our unified edit model consistently outperforms both commercial and open-source competitors, such as Ideogram, Adobe Photoshop, and FLUX Fill [Pro], across multiple evaluation dimensions. Code and model are available at: https://one-reward.github.io

中文标题/摘要

标题：OneReward：统一的多任务人类偏好学习引导图像生成

在本文中，我们介绍了OneReward，这是一种统一的强化学习框架，通过单一的奖励模型在多种任务和不同评估标准下增强模型的生成能力。通过使用单一的视觉语言模型（VLM）作为生成奖励模型，该模型可以区分给定任务和评估标准下的胜者和败者，从而可以有效地应用于多任务生成模型，特别是在数据和任务目标多样化的背景下。我们使用OneReward进行掩码引导的图像生成，可以进一步细分为图像填充、图像扩展、对象移除和文本渲染等子任务，涉及一个二元掩码作为编辑区域。尽管这些特定领域的任务共享相同的条件范式，但它们在底层数据分布和评估指标上存在显著差异。现有方法通常依赖于特定任务的监督微调（SFT），这限制了泛化能力和训练效率。基于OneReward，我们开发了Seedream 3.0 Fill，这是一种通过多任务强化学习直接在预训练基模型上训练的掩码引导生成模型，消除了对特定任务SFT的需要。实验结果表明，我们的统一编辑模型在多个评估维度上均优于商业和开源竞争对手，如Ideogram、Adobe Photoshop和FLUX Fill [Pro]。代码和模型可在：https://one-reward.github.io 获取。

Summary / 总结

OneReward is a unified reinforcement learning framework that uses a single reward model to enhance generative capabilities across multiple tasks. It employs a vision-language model to distinguish winners and losers for different tasks and criteria, enabling multi-task generation without task-specific supervised fine-tuning. Experiments show that Seedream 3.0 Fill, trained using OneReward, outperforms commercial and open-source competitors in various evaluation metrics for mask-guided image generation tasks such as image fill, extend, object removal, and text rendering.

OneReward 是一个统一的强化学习框架，通过单一奖励模型提升多任务生成能力。它使用视觉语言模型来评估任务和标准，适用于多样化的数据和目标。对于蒙版引导的图像生成，OneReward 开发了 Seedream 3.0 Fill，该模型通过预训练模型上的多任务强化学习训练，避免了特定任务的监督微调。实验结果显示，OneReward 在多个评估维度上优于商业和开源竞争对手。

CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

Authors: Wei Li, Renshan Zhang, Rui Shao, Jie He, Liqiang Nie

First: 2025-08-28T17:50:58+00:00 · Latest: 2025-08-28T17:50:58+00:00

Comments: 23 pages, 8 figures, Project Page: https://jiutian-vl.github.io/CogVLA-page

Abs · PDF · Code1 · Project1

Abstract

Recent Vision-Language-Action (VLA) models built on pre-trained Vision-Language Models (VLMs) require extensive post-training, resulting in high computational overhead that limits scalability and deployment.We propose CogVLA, a Cognition-Aligned Vision-Language-Action framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance. CogVLA draws inspiration from human multimodal coordination and introduces a 3-stage progressive architecture. 1) Encoder-FiLM based Aggregation Routing (EFA-Routing) injects instruction information into the vision encoder to selectively aggregate and compress dual-stream visual tokens, forming a instruction-aware latent representation. 2) Building upon this compact visual encoding, LLM-FiLM based Pruning Routing (LFP-Routing) introduces action intent into the language model by pruning instruction-irrelevant visually grounded tokens, thereby achieving token-level sparsity. 3) To ensure that compressed perception inputs can still support accurate and coherent action generation, we introduce V-L-A Coupled Attention (CAtten), which combines causal vision-language attention with bidirectional action parallel decoding. Extensive experiments on the LIBERO benchmark and real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5-fold and decreasing inference latency by 2.8-fold compared to OpenVLA. CogVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/CogVLA.

中文标题/摘要

标题：CogVLA：基于指令驱动路由与稀疏化的认知对齐视觉-语言-行动模型

基于预训练视觉-语言模型（VLM）构建的视觉-语言-行动（VLA）模型需要大量的后训练，导致高计算开销，限制了其可扩展性和部署。我们提出CogVLA，一种认知对齐的视觉-语言-行动框架，通过指令驱动的路由和稀疏化来提高效率和性能。CogVLA 受人类多模态协调的启发，引入了三阶段渐进式架构。1) 编码器-FiLM 基础聚合路由（EFA-路由）将指令信息注入视觉编码器，选择性地聚合和压缩双流视觉标记，形成指令感知的潜在表示。2) 在此基础上，通过LLM-FiLM 基础剪枝路由（LFP-路由）将行动意图引入语言模型，通过剪枝与指令无关的视觉接地标记，实现标记级稀疏化。3) 为了确保压缩的感知输入仍能支持准确和连贯的行动生成，我们引入了V-L-A 耦合注意力（CAtten），结合因果视觉-语言注意力与双向行动并行解码。在LIBERO基准和真实世界机器人任务上的广泛实验表明，CogVLA 在成功率分别为97.4%和70.0%的情况下，将训练成本降低了2.5倍，推理延迟减少了2.8倍，优于OpenVLA。CogVLA 开源并可在https://github.com/JiuTian-VL/CogVLA/ 获取。

Summary / 总结

CogVLA is a cognition-aligned VLA framework that uses instruction-driven routing and sparsification to enhance efficiency and performance. It consists of three stages: EFA-Routing for instruction-aware visual token aggregation, LFP-Routing for token-level sparsity through pruning, and CAtten for ensuring accurate action generation. Experiments show CogVLA outperforms OpenVLA with higher success rates and reduced training and inference costs on LIBERO and real-world tasks.

CogVLA 是一种认知对齐的 VLA 框架，通过指令驱动的路由和稀疏化来提升效率和性能。它包括三个阶段：EFA-Routing 进行指令感知的视觉令牌聚合，LFP-Routing 通过剪枝实现令牌级稀疏化，CAtten 确保压缩的感知输入仍能支持准确和连贯的动作生成。实验表明，CogVLA 在 LIBERO 和真实世界任务上的成功率更高，并且相比 OpenVLA 减少了训练和推理成本。

MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs

Authors: Junpeng Ma, Qizhe Zhang, Ming Lu, Zhibin Wang, Qiang Zhou, Jun Song, Shanghang Zhang

First: 2025-08-28T17:50:03+00:00 · Latest: 2025-08-28T17:50:03+00:00

Comments: 10 pages, 3 figures

Abs · PDF

Abstract

Video Large Language Models (VLLMs) excel in video understanding, but their excessive visual tokens pose a significant computational challenge for real-world applications. Current methods aim to enhance inference efficiency by visual token pruning. However, they do not consider the dynamic characteristics and temporal dependencies of video frames, as they perceive video understanding as a multi-frame task. To address these challenges, we propose MMG-Vid, a novel training-free visual token pruning framework that removes redundancy by Maximizing Marginal Gains at both segment-level and token-level. Specifically, we first divide the video into segments based on frame similarity, and then dynamically allocate the token budget for each segment to maximize the marginal gain of each segment. Subsequently, we propose a temporal-guided DPC algorithm that jointly models inter-frame uniqueness and intra-frame diversity, thereby maximizing the marginal gain of each token. By combining both stages, MMG-Vid can maximize the utilization of the limited token budget, significantly improving efficiency while maintaining strong performance. Extensive experiments demonstrate that MMG-Vid can maintain over 99.5% of the original performance, while effectively reducing 75% visual tokens and accelerating the prefilling stage by 3.9x on LLaVA-OneVision-7B. Code will be released soon.

中文标题/摘要

标题：MMG-Vid：在段级和令牌级最大化边际收益以提高高效视频LLMs

视频大型语言模型（VLLMs）在视频理解方面表现出色，但其过多的视觉令牌对实际应用构成了显著的计算挑战。当前的方法通过视觉令牌剪枝来提高推理效率，但它们没有考虑到视频帧的动态特性和时间依赖性，因为它们将视频理解视为多帧任务。为了解决这些挑战，我们提出了一种名为MMG-Vid的新型无训练视觉令牌剪枝框架，通过在段级和令牌级最大化边际收益来去除冗余。具体而言，我们首先根据帧相似性将视频划分为段，然后为每个段动态分配令牌预算，以最大化每个段的边际收益。随后，我们提出了一种基于时间引导的DPC算法，该算法同时建模了帧间唯一性和帧内多样性，从而最大化每个令牌的边际收益。通过结合这两个阶段，MMG-Vid可以最大化有限的令牌预算的利用，显著提高效率同时保持强大的性能。广泛的实验表明，MMG-Vid可以保持超过99.5%的原始性能，同时有效减少75%的视觉令牌，并在LLaVA-OneVision-7B的预填充阶段加速3.9倍。代码将很快发布。

Summary / 总结

MMG-Vid is a training-free visual token pruning framework that enhances the efficiency of Video Large Language Models (VLLMs) by maximizing marginal gains at both segment-level and token-level. It divides videos into segments based on frame similarity and dynamically allocates token budgets to maximize the marginal gain of each segment. Additionally, it uses a temporal-guided DPC algorithm to model inter-frame uniqueness and intra-frame diversity, further optimizing token usage. Experiments show that MMG-Vid maintains over 99.5% of the original performance while reducing visual tokens by 75% and accelerating the prefilling stage by 3.9x on LLaVA-OneVision-7B.

MMG-Vid 是一种无需训练的视觉标记剪枝框架，通过在段级和标记级最大化边际收益来提升视频大型语言模型（VLLMs）的效率。它根据帧相似性将视频划分为段，并动态分配标记预算以最大化每个段的边际收益。此外，它使用基于时间的DPC算法来建模帧间独特性和帧内多样性，进一步优化标记使用。实验表明，MMG-Vid 在保持超过 99.5% 的原始性能的同时，减少了 75% 的视觉标记，并将预填充阶段加速了 3.9 倍，适用于 LLaVA-OneVision-7B。

Reusing Computation in Text-to-Image Diffusion for Efficient Generation of Image Sets

Authors: Dale Decatur, Thibault Groueix, Wang Yifan, Rana Hanocka, Vladimir Kim, Matheus Gadelha

Venue: ICCV 2025

First: 2025-08-28T17:35:03+00:00 · Latest: 2025-08-28T17:35:03+00:00

Comments: ICCV 2025. Project page: https://ddecatur.github.io/hierarchical-diffusion/

Abs · PDF · Project1

Abstract

Text-to-image diffusion models enable high-quality image generation but are computationally expensive. While prior work optimizes per-inference efficiency, we explore an orthogonal approach: reducing redundancy across correlated prompts. Our method leverages the coarse-to-fine nature of diffusion models, where early denoising steps capture shared structures among similar prompts. We propose a training-free approach that clusters prompts based on semantic similarity and shares computation in early diffusion steps. Experiments show that for models trained conditioned on image embeddings, our approach significantly reduces compute cost while improving image quality. By leveraging UnClip's text-to-image prior, we enhance diffusion step allocation for greater efficiency. Our method seamlessly integrates with existing pipelines, scales with prompt sets, and reduces the environmental and financial burden of large-scale text-to-image generation. Project page: https://ddecatur.github.io/hierarchical-diffusion/

中文标题/摘要

标题：在文本到图像扩散中重用计算以高效生成图像集

文本到图像的扩散模型能够生成高质量的图像，但计算成本高昂。虽然先前的工作优化了每次推理的效率，我们探索了一种不同的方法：减少相关提示之间的冗余。我们的方法利用了扩散模型从粗到细的特性，在早期去噪步骤中捕获相似提示之间的共享结构。我们提出了一种无需训练的方法，根据语义相似性对提示进行聚类，并在早期扩散步骤中共享计算。实验表明，对于基于图像嵌入训练的模型，我们的方法显著降低了计算成本并提高了图像质量。通过利用UnClip的文本到图像先验，我们增强了扩散步骤的分配以提高效率。我们的方法可以无缝集成到现有的管道中，适用于不同的提示集，并减少了大规模文本到图像生成的环境和财务负担。项目页面：https://ddecatur.github.io/hierarchical-diffusion/

Summary / 总结

The research aims to improve the efficiency of text-to-image generation by reducing computational redundancy across similar prompts. The method clusters prompts based on semantic similarity and shares computation in early diffusion steps, leading to significant reductions in compute cost while maintaining image quality. This approach integrates with existing pipelines and scales well with prompt sets, reducing both environmental and financial burdens of large-scale text-to-image generation.

研究旨在减少文本生成图像的计算成本同时保持图像质量。方法基于语义相似性聚类相似的提示，并在早期扩散步骤中共享计算，特别适用于基于图像嵌入训练的模型。实验表明，这种方法显著降低了计算成本并提高了图像质量，使得大规模文本生成图像更加高效和环保。

DrivingGaussian++: Towards Realistic Reconstruction and Editable Simulation for Surrounding Dynamic Driving Scenes

Authors: Yajiao Xiong, Xiaoyu Zhou, Yongtao Wan, Deqing Sun, Ming-Hsuan Yang

First: 2025-08-28T16:22:54+00:00 · Latest: 2025-08-28T16:22:54+00:00

Abs · PDF · Project1

Abstract

We present DrivingGaussian++, an efficient and effective framework for realistic reconstructing and controllable editing of surrounding dynamic autonomous driving scenes. DrivingGaussian++ models the static background using incremental 3D Gaussians and reconstructs moving objects with a composite dynamic Gaussian graph, ensuring accurate positions and occlusions. By integrating a LiDAR prior, it achieves detailed and consistent scene reconstruction, outperforming existing methods in dynamic scene reconstruction and photorealistic surround-view synthesis. DrivingGaussian++ supports training-free controllable editing for dynamic driving scenes, including texture modification, weather simulation, and object manipulation, leveraging multi-view images and depth priors. By integrating large language models (LLMs) and controllable editing, our method can automatically generate dynamic object motion trajectories and enhance their realism during the optimization process. DrivingGaussian++ demonstrates consistent and realistic editing results and generates dynamic multi-view driving scenarios, while significantly enhancing scene diversity. More results and code can be found at the project site: https://xiong-creator.github.io/DrivingGaussian_plus.github.io

中文标题/摘要

标题：DrivingGaussian++：朝向现实重建和可编辑模拟周围动态驾驶场景

我们提出了DrivingGaussian++，一种高效且有效的框架，用于现实重建和可控编辑周围动态自主驾驶场景。DrivingGaussian++使用增量3D高斯模型静态背景，并用复合动态高斯图重建移动对象，确保准确的位置和遮挡。通过整合LiDAR先验，它实现了详细的且一致的场景重建，优于现有方法在动态场景重建和照片写实的全景视图合成方面的表现。DrivingGaussian++支持无需训练的可控编辑动态驾驶场景，包括纹理修改、天气模拟和对象操作，利用多视图图像和深度先验。通过整合大型语言模型（LLMs）和可控编辑，我们的方法可以在优化过程中自动生成动态对象运动轨迹并增强其现实感。DrivingGaussian++展示了持续且现实的编辑结果，并生成动态多视图驾驶场景，显著增强了场景多样性。更多结果和代码可在项目网站上找到：https://xiong-creator.github.io/DrivingGaussian_plus.github.io

Summary / 总结

DrivingGaussian++ is a framework designed for realistic reconstruction and controllable editing of dynamic driving scenes. It models static backgrounds using incremental 3D Gaussians and reconstructs moving objects with a composite dynamic Gaussian graph, ensuring accurate positions and occlusions. By integrating a LiDAR prior, it achieves detailed and consistent scene reconstruction, outperforming existing methods. The framework supports training-free controllable editing, including texture modification, weather simulation, and object manipulation, leveraging multi-view images and depth priors. It can automatically generate dynamic object motion trajectories and enhance their realism, demonstrating consistent and realistic editing results and generating dynamic multi-view driving scenarios.

DrivingGaussian++ 是一个用于动态驾驶场景的现实重建和可控编辑框架。它使用增量 3D 高斯模型静态背景，并使用复合动态高斯图重建移动物体，确保准确的位置和遮挡。通过集成 LiDAR 先验，它实现了详细且一致的场景重建，优于现有方法。该框架支持无需训练的可控编辑，包括纹理修改、天气模拟和物体操作，利用多视图图像和深度先验。它可以自动生成动态物体运动轨迹并增强其现实感，展示出一致且现实的编辑结果，并生成动态多视图驾驶场景。

Understanding and evaluating computer vision models through the lens of counterfactuals

Authors: Pushkar Shukla

First: 2025-08-28T15:11:49+00:00 · Latest: 2025-08-28T15:11:49+00:00

Abs · PDF

Abstract

Counterfactual reasoning -- the practice of asking ``what if'' by varying inputs and observing changes in model behavior -- has become central to interpretable and fair AI. This thesis develops frameworks that use counterfactuals to explain, audit, and mitigate bias in vision classifiers and generative models. By systematically altering semantically meaningful attributes while holding others fixed, these methods uncover spurious correlations, probe causal dependencies, and help build more robust systems. The first part addresses vision classifiers. CAVLI integrates attribution (LIME) with concept-level analysis (TCAV) to quantify how strongly decisions rely on human-interpretable concepts. With localized heatmaps and a Concept Dependency Score, CAVLI shows when models depend on irrelevant cues like backgrounds. Extending this, ASAC introduces adversarial counterfactuals that perturb protected attributes while preserving semantics. Through curriculum learning, ASAC fine-tunes biased models for improved fairness and accuracy while avoiding stereotype-laden artifacts. The second part targets generative Text-to-Image (TTI) models. TIBET provides a scalable pipeline for evaluating prompt-sensitive biases by varying identity-related terms, enabling causal auditing of how race, gender, and age affect image generation. To capture interactions, BiasConnect builds causal graphs diagnosing intersectional biases. Finally, InterMit offers a modular, training-free algorithm that mitigates intersectional bias via causal sensitivity scores and user-defined fairness goals. Together, these contributions show counterfactuals as a unifying lens for interpretability, fairness, and causality in both discriminative and generative models, establishing principled, scalable methods for socially responsible bias evaluation and mitigation.

中文标题/摘要

标题：通过反事实视角理解与评估计算机视觉模型

反事实推理——通过改变输入并观察模型行为的变化来提出“如果”的实践——已成为可解释和公平AI的核心。本论文开发了使用反事实来解释、审核和缓解视觉分类器和生成模型中的偏见的框架。通过系统地改变语义上有意义的属性，同时固定其他属性，这些方法揭示了虚假的相关性，探究了因果依赖性，并有助于构建更稳健的系统。第一部分针对视觉分类器。CAVLI将归因（LIME）与概念级分析（TCAV）结合，量化决策依赖于可解释概念的程度。通过局部热图和概念依赖性得分，CAVLI展示了模型依赖于无关线索（如背景）的情况。在此基础上，ASAC引入了对抗反事实，通过扰动保护属性同时保持语义，通过课程学习微调有偏模型，以提高公平性和准确性，同时避免刻板印象负载的特征。第二部分针对生成文本到图像（TTI）模型。TIBET提供了一种可扩展的流水线，通过改变身份相关术语来评估提示敏感性偏见，使因果审计能够探究种族、性别和年龄如何影响图像生成。为了捕捉交互作用，BiasConnect构建了因果图来诊断交叉偏见。最后，InterMit提供了一种模块化、无需训练的算法，通过因果敏感性得分和用户定义的公平目标来缓解交叉偏见。这些贡献共同展示了反事实作为解释性、公平性和因果性在判别性和生成性模型中的统一视角，建立了有原则、可扩展的方法，用于社会负责的偏见评估和缓解。

Summary / 总结

Counterfactual reasoning -- the practice of asking ``what if'' by varying inputs and observing changes in model behavior -- has become central to interpretable and fair AI.

Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models

Authors: Hao Cheng, Erjia Xiao, Yichi Wang, Lingfeng Zhang, Qiang Zhang, Jiahang Cao, Kaidi Xu, Mengshu Sun, Xiaoshuai Hao, Jindong Gu, Renjing Xu

First: 2025-03-14T15:42:42+00:00 · Latest: 2025-08-28T14:55:38+00:00

Comments: This paper is accepted by IJCAI2025 Workshop on Deepfake Detection, Localization, and Interpretability

Abs · PDF

Abstract

Current Cross-Modality Generation Models (GMs) demonstrate remarkable capabilities in various generative tasks. Given the ubiquity and information richness of vision modality inputs in real-world scenarios, Cross-Vision tasks, encompassing Vision-Language Perception (VLP) and Image-to-Image (I2I), have attracted significant attention. Large Vision Language Models (LVLMs) and I2I Generation Models (GMs) are employed to handle VLP and I2I tasks, respectively. Previous research indicates that printing typographic words into input images significantly induces LVLMs and I2I GMs to produce disruptive outputs that are semantically aligned with those words. Additionally, visual prompts, as a more sophisticated form of typography, are also revealed to pose security risks to various applications of cross-vision tasks. However, the specific characteristics of the threats posed by visual prompts remain underexplored. In this paper, to comprehensively investigate the performance impact induced by Typographic Visual Prompt Injection (TVPI) in various LVLMs and I2I GMs, we propose the Typographic Visual Prompts Injection Dataset and thoroughly evaluate the TVPI security risks on various open-source and closed-source LVLMs and I2I GMs under visual prompts with different target semantics, deepening the understanding of TVPI threats.

Summary / 总结

This paper investigates the security threats posed by typographic visual prompts in Cross-Modality Generation Models. It introduces a dataset to evaluate the impact of typographic visual prompt injection (TVPI) on various large vision language models (LVLMs) and image-to-image (I2I) generation models. The study reveals that visual prompts can significantly influence model outputs, leading to semantically aligned but disruptive results, highlighting the need for better security measures in cross-vision tasks.

本文研究了图文提示注入（TVPI）在跨模态生成模型中的安全威胁。提出了一个数据集来评估TVPI对各种大型视觉语言模型（LVLM）和图像到图像（I2I）生成模型的影响。研究发现，图文提示可以显著影响模型输出，导致语义对齐但具有破坏性的结果，强调了在跨视觉任务中需要更好的安全措施。

Learning Primitive Embodied World Models: Towards Scalable Robotic Learning

Authors: Qiao Sun, Liujia Yang, Wei Tang, Wei Huang, Kaixin Xu, Yongchao Chen, Mingyu Liu, Jiange Yang, Haoyi Zhu, Yating Wang, Tong He, Yilun Chen, Xili Dai, Nanyang Ye, Qinying Gu

First: 2025-08-28T14:31:48+00:00 · Latest: 2025-08-28T14:31:48+00:00

Abs · PDF

Abstract

While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation--hindering generative models from achieving a "GPT moment" in the embodied domain. There is a naive observation: the diversity of embodied data far exceeds the relatively small space of possible primitive motions. Based on this insight, we propose a novel paradigm for world modeling--Primitive Embodied World Models (PEWM). By restricting video generation to fixed short horizons, our approach 1) enables fine-grained alignment between linguistic concepts and visual representations of robotic actions, 2) reduces learning complexity, 3) improves data efficiency in embodied data collection, and 4) decreases inference latency. By equipping with a modular Vision-Language Model (VLM) planner and a Start-Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed-loop control and supports compositional generalization of primitive-level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.

Summary / 总结

This paper addresses the challenge of generating embodied world models through video generation, emphasizing the limitations posed by the scarcity and high dimensionality of embodied interaction data. To overcome these issues, the authors propose Primitive Embodied World Models (PEWM), which focus on short horizons to achieve fine-grained alignment between language and actions, reduce learning complexity, and improve data efficiency. PEWM incorporates a modular Vision-Language Model planner and a Start-Goal heatmap Guidance mechanism, enabling flexible closed-loop control and compositional generalization of primitive-level policies for complex tasks.

本文解决了通过视频生成构建体态世界模型面临的挑战，需要大量的交互数据。它提出了体态世界模型（PEWM），专注于短时间范围，以更好地实现语言与动作之间的对齐，减少学习复杂性，提高数据效率，并降低推理延迟。PEWM 使用模块化视觉语言模型规划器和起始-目标热图引导机制，支持对复杂任务中低级策略的灵活闭环控制和组合泛化。

Estimating 2D Keypoints of Surgical Tools Using Vision-Language Models with Low-Rank Adaptation

Authors: Krit Duangprom, Tryphon Lambrou, Binod Bhattarai

Venue: MICCAI 2025

First: 2025-08-28T14:25:32+00:00 · Latest: 2025-08-28T14:25:32+00:00

Comments: Accepted to MICCAI 2025

Abs · PDF

Abstract

This paper presents a novel pipeline for 2D keypoint estima- tion of surgical tools by leveraging Vision Language Models (VLMs) fine- tuned using a low rank adjusting (LoRA) technique. Unlike traditional Convolutional Neural Network (CNN) or Transformer-based approaches, which often suffer from overfitting in small-scale medical datasets, our method harnesses the generalization capabilities of pre-trained VLMs. We carefully design prompts to create an instruction-tuning dataset and use them to align visual features with semantic keypoint descriptions. Experimental results show that with only two epochs of fine tuning, the adapted VLM outperforms the baseline models, demonstrating the ef- fectiveness of LoRA in low-resource scenarios. This approach not only improves keypoint detection performance, but also paves the way for future work in 3D surgical hands and tools pose estimation.

中文标题/摘要

标题：使用低秩适应的视觉语言模型估计手术工具的2D关键点

本文提出了一种利用低秩调整（LoRA）技术微调视觉语言模型（VLMs）的新管道，用于估计手术工具的2D关键点。与通常在小型医学数据集中过拟合的传统卷积神经网络（CNN）或基于变换器的方法不同，我们的方法利用了预训练VLMs的泛化能力。我们精心设计提示以创建指令微调数据集，并使用它们将视觉特征与语义关键点描述对齐。实验结果表明，仅经过两轮微调，适应后的VLM就优于基线模型，证明了LoRA在资源有限场景中的有效性。该方法不仅提高了关键点检测性能，还为未来3D手术手和工具姿态估计的研究铺平了道路。

Summary / 总结

This paper presents a novel pipeline for 2D keypoint estima- tion of surgical tools by leveraging Vision Language Models (VLMs) fine- tuned using a low rank adjusting (LoRA) technique.

Improving Fine-Grained Control via Aggregation of Multiple Diffusion Models

Authors: Conghan Yue, Zhengwei Peng, Shiyan Du, Zhi Ji, Chuangjian Cai, Le Wan, Dongyu Zhang

First: 2024-10-02T06:16:06+00:00 · Latest: 2025-08-28T14:03:26+00:00

Abs · PDF · Code1

Abstract

While many diffusion models perform well when controlling particular aspects such as style, character, and interaction, they struggle with fine-grained control due to dataset limitations and intricate model architecture design. This paper introduces a novel training-free algorithm, independent of denoising network architectures, for fine-grained generation, called Aggregation of Multiple Diffusion Models (AMDM). The algorithm integrates features from multiple diffusion models into a specified model to activate particular features and enable fine-grained control. Experimental results demonstrate that AMDM significantly improves fine-grained control without training, validating its effectiveness. Additionally, it reveals that diffusion models initially focus on features such as position, attributes, and style, with later stages improving generation quality and consistency. AMDM offers a new perspective for tackling the challenges of fine-grained conditional generation in diffusion models. Specifically, it allows us to fully utilize existing or develop new conditional diffusion models that control specific aspects, and then aggregate them using the AMDM algorithm. This eliminates the need for constructing complex datasets, designing intricate model architectures, and incurring high training costs. Code is available at: https://github.com/Hammour-steak/AMDM.

中文标题/摘要

标题：通过多扩散模型聚合提高细粒度控制

尽管许多扩散模型在控制特定方面如风格、角色和交互时表现良好，但在细粒度控制方面由于数据集限制和复杂的模型架构设计，它们表现不佳。本文介绍了一种无需训练的新型算法，该算法独立于去噪网络架构，称为多扩散模型聚合（AMDM），用于细粒度生成。该算法将多个扩散模型的特征整合到指定模型中，以激活特定特征并实现细粒度控制。实验结果表明，AMDM在无需训练的情况下显著提高了细粒度控制能力，验证了其有效性。此外，它揭示了扩散模型最初关注位置、属性和风格等特征，后期阶段则提高生成质量和一致性。AMDM为解决扩散模型中的细粒度条件生成挑战提供了新的视角。具体而言，它允许我们充分利用现有或开发新的控制特定方面的条件扩散模型，并通过AMDM算法将它们聚合起来。这消除了构建复杂数据集、设计复杂模型架构和高训练成本的需要。代码可在：https://github.com/Hammour-steak/AMDM 获取。

Evaluating Compositional Generalisation in VLMs and Diffusion Models

Authors: Beth Pearson, Bilal Boulbarss, Michael Wray, Martha Lewis

First: 2025-08-28T13:45:04+00:00 · Latest: 2025-08-28T13:45:04+00:00

Comments: 11 pages including references, 6 figures. Accepted at IWCS 2025

Abs · PDF · Code1

Abstract

A fundamental aspect of the semantics of natural language is that novel meanings can be formed from the composition of previously known parts. Vision-language models (VLMs) have made significant progress in recent years, however, there is evidence that they are unable to perform this kind of composition. For example, given an image of a red cube and a blue cylinder, a VLM such as CLIP is likely to incorrectly label the image as a red cylinder or a blue cube, indicating it represents the image as a `bag-of-words' and fails to capture compositional semantics. Diffusion models have recently gained significant attention for their impressive generative abilities, and zero-shot classifiers based on diffusion models have been shown to perform competitively with CLIP in certain compositional tasks. In this work we explore whether the generative Diffusion Classifier has improved compositional generalisation abilities compared to discriminative models. We assess three models -- Diffusion Classifier, CLIP, and ViLT -- on their ability to bind objects with attributes and relations in both zero-shot learning (ZSL) and generalised zero-shot learning (GZSL) settings. Our results show that the Diffusion Classifier and ViLT perform well at concept binding tasks, but that all models struggle significantly with the relational GZSL task, underscoring the broader challenges VLMs face with relational reasoning. Analysis of CLIP embeddings suggests that the difficulty may stem from overly similar representations of relational concepts such as left and right. Code and dataset are available at: https://github.com/otmive/diffusion_classifier_clip

Summary / 总结

This work evaluates the compositional generalization capabilities of Vision-language models (VLMs) and diffusion models by assessing their ability to bind objects with attributes and relations. The study compares the Diffusion Classifier, CLIP, and ViLT in zero-shot learning and generalized zero-shot learning settings. Results indicate that while the Diffusion Classifier and ViLT perform well in concept binding tasks, all models struggle with relational reasoning, highlighting the challenges VLMs face in relational tasks. Analysis of CLIP embeddings suggests that the difficulty might arise from similar representations of relational concepts such as left and right.

这项研究评估了视觉语言模型（VLMs）和扩散模型在组成性泛化能力上的表现，通过评估它们将对象与其属性和关系绑定的能力。研究比较了扩散分类器、CLIP和ViLT在零样本学习和泛化零样本学习设置中的表现。结果显示，尽管扩散分类器和ViLT在概念绑定任务中表现良好，但所有模型在关系推理任务中都面临重大挑战。CLIP嵌入分析表明，困难可能源于关系概念如左右等的表示过于相似。

Occlusion Robustness of CLIP for Military Vehicle Classification

Authors: Jan Erik van Woerden, Gertjan Burghouts, Lotte Nijskens, Alma M. Liezenga, Sabina van Rooij, Frank Ruis, Hugo J. Kuijf

First: 2025-08-28T13:16:55+00:00 · Latest: 2025-08-28T13:16:55+00:00

Comments: To be presented at SPIE: Sensors + Imaging, Artificial Intelligence for Security and Defence Applications II

Abs · PDF

Abstract

Vision-language models (VLMs) like CLIP enable zero-shot classification by aligning images and text in a shared embedding space, offering advantages for defense applications with scarce labeled data. However, CLIP's robustness in challenging military environments, with partial occlusion and degraded signal-to-noise ratio (SNR), remains underexplored. We investigate CLIP variants' robustness to occlusion using a custom dataset of 18 military vehicle classes and evaluate using Normalized Area Under the Curve (NAUC) across occlusion percentages. Four key insights emerge: (1) Transformer-based CLIP models consistently outperform CNNs, (2) fine-grained, dispersed occlusions degrade performance more than larger contiguous occlusions, (3) despite improved accuracy, performance of linear-probed models sharply drops at around 35% occlusion, (4) by finetuning the model's backbone, this performance drop occurs at more than 60% occlusion. These results underscore the importance of occlusion-specific augmentations during training and the need for further exploration into patch-level sensitivity and architectural resilience for real-world deployment of CLIP.

中文标题/摘要

标题：CLIP在军事车辆分类中的遮挡鲁棒性

视觉-语言模型（VLMs）如CLIP通过在共享嵌入空间中对齐图像和文本实现零样本分类，为缺乏标注数据的防御应用提供了优势。然而，CLIP在具有部分遮挡和信噪比（SNR）降级的挑战性军事环境中的鲁棒性尚未得到充分探索。我们使用包含18类军事车辆的自定义数据集研究了CLIP变体对遮挡的鲁棒性，并使用归一化曲线下面积（NAUC）在不同遮挡百分比下进行评估。研究结果得出四个关键见解：（1）基于Transformer的CLIP模型始终优于CNN，（2）细粒度、分散的遮挡比大面积连续遮挡对性能影响更大，（3）尽管准确率有所提高，但在约35%遮挡时，线性探查模型的性能急剧下降，（4）通过微调模型的骨干网络，性能下降发生在超过60%遮挡时。这些结果强调了在训练过程中使用遮挡特定增强的重要性，并指出了需要进一步探索像素级敏感性和架构鲁棒性以实现CLIP在实际部署中的应用。

Summary / 总结

The study investigates the robustness of CLIP models to occlusions in military vehicle classification, using a custom dataset of 18 classes. Key findings include the superior performance of transformer-based CLIP models over CNNs, the greater impact of fine-grained occlusions, and the significant drop in performance of linear-probed models at around 35% occlusion, which improves with backbone fine-tuning beyond 60% occlusion. These results highlight the need for occlusion-specific training and further research into architectural resilience for real-world applications.

研究使用自定义数据集评估了CLIP变体在军事车辆分类中的遮挡鲁棒性。关键发现包括基于变压器的模型在性能上优于CNN，细粒度的遮挡导致更大的性能下降，在35%遮挡时线性探针模型性能急剧下降，以及通过微调模型骨干以维持超过60%遮挡时的性能的必要性。这些结果强调了在实际应用中需要遮挡特定的训练和进一步研究架构的鲁棒性。

NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks

Authors: Aritra Dutta, Swapnanil Mukherjee, Deepanway Ghosal, Somak Aditya

First: 2025-08-27T09:34:28+00:00 · Latest: 2025-08-28T12:05:33+00:00

Abs · PDF

Abstract

Commonsense visual-question answering often hinges on knowledge that is missing from the image or the question. Small vision-language models (sVLMs) such as ViLT, VisualBERT and FLAVA therefore lag behind their larger generative counterparts. To study the effect of careful commonsense knowledge integration on sVLMs, we present an end-to-end framework (NLKI) that (i) retrieves natural language facts, (ii) prompts an LLM to craft natural language explanations, and (iii) feeds both signals to sVLMs respectively across two commonsense VQA datasets (CRIC, AOKVQA) and a visual-entailment dataset (e-SNLI-VE). Facts retrieved using a fine-tuned ColBERTv2 and an object information-enriched prompt yield explanations that largely cut down hallucinations, while lifting the end-to-end answer accuracy by up to 7% (across 3 datasets), making FLAVA and other models in NLKI match or exceed medium-sized VLMs such as Qwen-2 VL-2B and SmolVLM-2.5B. As these benchmarks contain 10-25% label noise, additional finetuning using noise-robust losses (such as symmetric cross entropy and generalised cross entropy) adds another 2.5% in CRIC, and 5.5% in AOKVQA. Our findings expose when LLM-based commonsense knowledge beats retrieval from commonsense knowledge bases, how noise-aware training stabilises small models in the context of external knowledge augmentation, and why parameter-efficient commonsense reasoning is now within reach for 250M models.

Summary / 总结

The research aims to enhance small vision-language models (sVLMs) in commonsense visual-question answering tasks by integrating natural language knowledge. The NLKI framework retrieves natural language facts and prompts an LLM to generate explanations, which are then fed into sVLMs. This approach improves end-to-end answer accuracy by up to 7% across three datasets, making models like FLAVA match or surpass medium-sized VLMs. Additional fine-tuning with noise-robust losses further enhances performance, especially in datasets with label noise.

研究旨在通过整合自然语言知识来提升小型视觉语言模型（sVLMs）在常识视觉问答任务中的表现。NLKI框架检索自然语言事实并促使LLM生成解释，然后将这些信息输入sVLMs。这种方法在三个数据集上将端到端的答案准确性提高了最多7%，使FLAVA等模型能够匹配或超越中型视觉语言模型。通过使用抗噪声损失（如对称交叉熵和广义交叉熵）进行进一步微调，特别是在包含标签噪声的数据集上，性能得到了进一步提升。

"Humor, Art, or Misinformation?": A Multimodal Dataset for Intent-Aware Synthetic Image Detection

Authors: Anastasios Skoularikis, Stefanos-Iordanis Papadopoulos, Symeon Papadopoulos, Panagiotis C. Petrantonakis

First: 2025-08-28T11:22:15+00:00 · Latest: 2025-08-28T11:22:15+00:00

Abs · PDF

Abstract

Recent advances in multimodal AI have enabled progress in detecting synthetic and out-of-context content. However, existing efforts largely overlook the intent behind AI-generated images. To fill this gap, we introduce S-HArM, a multimodal dataset for intent-aware classification, comprising 9,576 "in the wild" image-text pairs from Twitter/X and Reddit, labeled as Humor/Satire, Art, or Misinformation. Additionally, we explore three prompting strategies (image-guided, description-guided, and multimodally-guided) to construct a large-scale synthetic training dataset with Stable Diffusion. We conduct an extensive comparative study including modality fusion, contrastive learning, reconstruction networks, attention mechanisms, and large vision-language models. Our results show that models trained on image- and multimodally-guided data generalize better to "in the wild" content, due to preserved visual context. However, overall performance remains limited, highlighting the complexity of inferring intent and the need for specialized architectures.

中文标题/摘要

标题："幽默、艺术还是误导信息？": 一种面向意图的合成图像检测多模态数据集

近年来，多模态AI的进步推动了对合成和脱离上下文内容检测的进展。然而，现有努力大多忽略了AI生成图像背后的意图。为填补这一空白，我们引入了S-HArM，这是一个面向意图分类的多模态数据集，包含来自Twitter/X和Reddit的9,576个“野生”图像-文本对，并标记为幽默/讽刺、艺术或误导信息。此外，我们探索了三种提示策略（图像导向、描述导向和多模态导向）来构建大规模合成训练数据集，使用Stable Diffusion。我们进行了广泛的比较研究，包括模态融合、对比学习、重建网络、注意力机制和大型视觉-语言模型。我们的结果显示，基于图像和多模态导向数据训练的模型在“野生”内容上的泛化能力更强，因为保留了视觉上下文。然而，总体性能仍然有限，突显了推断意图的复杂性以及需要专门架构的需求。

Amadeus: Autoregressive Model with Bidirectional Attribute Modelling for Symbolic Music

Authors: Hongju Su, Ke Li, Lan Yang, Honggang Zhang, Yi-Zhe Song

First: 2025-08-28T11:15:44+00:00 · Latest: 2025-08-28T11:15:44+00:00

Comments: Under review

Abs · PDF

Abstract

Existing state-of-the-art symbolic music generation models predominantly adopt autoregressive or hierarchical autoregressive architectures, modelling symbolic music as a sequence of attribute tokens with unidirectional temporal dependencies, under the assumption of a fixed, strict dependency structure among these attributes. However, we observe that using different attributes as the initial token in these models leads to comparable performance. This suggests that the attributes of a musical note are, in essence, a concurrent and unordered set, rather than a temporally dependent sequence. Based on this insight, we introduce Amadeus, a novel symbolic music generation framework. Amadeus adopts a two-level architecture: an autoregressive model for note sequences and a bidirectional discrete diffusion model for attributes. To enhance performance, we propose Music Latent Space Discriminability Enhancement Strategy(MLSDES), incorporating contrastive learning constraints that amplify discriminability of intermediate music representations. The Conditional Information Enhancement Module (CIEM) simultaneously strengthens note latent vector representation via attention mechanisms, enabling more precise note decoding. We conduct extensive experiments on unconditional and text-conditioned generation tasks. Amadeus significantly outperforms SOTA models across multiple metrics while achieving at least 4$\times$ speed-up. Furthermore, we demonstrate training-free, fine-grained note attribute control feasibility using our model. To explore the upper performance bound of the Amadeus architecture, we compile the largest open-source symbolic music dataset to date, AMD (Amadeus MIDI Dataset), supporting both pre-training and fine-tuning.

中文标题/摘要

标题：阿玛迪乌斯：双向属性建模的自回归音乐生成模型

现有的最先进的符号音乐生成模型主要采用自回归或分层自回归架构，将符号音乐建模为具有单向时间依赖性的属性令牌序列，假设这些属性之间存在固定且严格的依赖结构。然而，我们观察到，在这些模型中使用不同的属性作为初始令牌会导致相当的性能。这表明，音乐音符的属性本质上是一个并发且无序的集合，而不是一个时间依赖的序列。基于这一洞察，我们引入了阿玛迪乌斯，一种新颖的符号音乐生成框架。阿玛迪乌斯采用两层架构：用于音符序列的自回归模型和用于属性的双向离散扩散模型。为了提高性能，我们提出了音乐潜在空间判别性增强策略(MLSDES)，结合对比学习约束以增强中间音乐表示的判别性。条件信息增强模块(CIEM)同时通过注意力机制增强音符潜在向量表示，使音符解码更加精确。我们在无条件和文本条件生成任务上进行了广泛的实验。阿玛迪乌斯在多个指标上显著优于当前最佳模型，同时实现至少4倍的速度提升。此外，我们展示了使用我们的模型实现无训练的细粒度音符属性控制的可行性。为了探索阿玛迪乌斯架构的性能上限，我们编译了迄今为止最大的开源符号音乐数据集AMD（阿玛迪乌斯MIDI数据集），支持预训练和微调。

Summary / 总结

Amadeus is a novel symbolic music generation framework that addresses the limitations of existing autoregressive models by introducing a bidirectional discrete diffusion model for attributes. It outperforms state-of-the-art models across multiple metrics and achieves at least 4 times speed-up. The framework includes MLSDES and CIEM to enhance discriminability and note representation, respectively. Amadeus also enables fine-grained control over note attributes without training.

Amadeus 是一种新颖的符号音乐生成框架，通过引入属性的双向离散扩散模型来解决现有自回归模型的限制。它在多个指标上显著优于现有最佳模型，并且至少实现了 4 倍的速度提升。该框架包括 MLSDES 和 CIEM，分别增强中间音乐表示的可区分性和音符的表示。Amadeus 还能够在不进行训练的情况下实现对音符属性的细粒度控制。

Enhancing Document VQA Models via Retrieval-Augmented Generation

Authors: Eric López, Artemis Llabrés, Ernest Valveny

First: 2025-08-26T12:32:55+00:00 · Latest: 2025-08-28T10:31:44+00:00

Comments: Accepted at Workshop on Machine Learning in Document Analysis and Recognition (ICDAR WML 2025), Wuhan, China

Abs · PDF

Abstract

Document Visual Question Answering (Document VQA) must cope with documents that span dozens of pages, yet leading systems still concatenate every page or rely on very large vision-language models, both of which are memory-hungry. Retrieval-Augmented Generation (RAG) offers an attractive alternative, first retrieving a concise set of relevant segments before generating answers from this selected evidence. In this paper, we systematically evaluate the impact of incorporating RAG into Document VQA through different retrieval variants - text-based retrieval using OCR tokens and purely visual retrieval without OCR - across multiple models and benchmarks. Evaluated on the multi-page datasets MP-DocVQA, DUDE, and InfographicVQA, the text-centric variant improves the "concatenate-all-pages" baseline by up to +22.5 ANLS, while the visual variant achieves +5.0 ANLS improvement without requiring any text extraction. An ablation confirms that retrieval and reranking components drive most of the gain, whereas the layout-guided chunking strategy - proposed in several recent works to leverage page structure - fails to help on these datasets. Our experiments demonstrate that careful evidence selection consistently boosts accuracy across multiple model sizes and multi-page benchmarks, underscoring its practical value for real-world Document VQA.

中文标题/摘要

标题：通过检索增强生成提升文档VQA模型

文档视觉问答（Document VQA）必须应对跨数十页的文档，但领先系统仍会将每页连接起来或依赖非常大的视觉语言模型，这两种方法都消耗大量内存。检索增强生成（RAG）提供了一种有吸引力的替代方案，首先检索一组相关的简要段落，然后从这些选定的证据中生成答案。在本文中，我们系统地评估了将RAG整合到文档VQA中的影响，通过不同的检索变体——使用OCR标记的文本检索和完全基于视觉的检索——在多个模型和基准上进行了评估。在多页数据集MP-DocVQA、DUDE和InfographicVQA上进行评估，以文本为中心的变体将“连接所有页面”的基线提高了最多+22.5 ANLS，而视觉变体在无需任何文本提取的情况下实现了+5.0 ANLS的改进。消融实验表明，检索和重新排序组件是主要的改进来源，而布局引导的分块策略——在几篇最近的工作中提出，旨在利用页面结构——在这些数据集上未能提供帮助。我们的实验表明，仔细选择证据在多个模型大小和多页基准上始终提高了准确性，突显了其实用价值。

Summary / 总结

This paper explores the use of Retrieval-Augmented Generation (RAG) in enhancing Document Visual Question Answering (Document VQA) systems. It evaluates different retrieval methods, including text-based and purely visual, across various models and benchmarks. The text-centric RAG variant improves the baseline by up to 22.5 ANLS, while the visual variant achieves a 5.0 ANLS improvement without OCR. The experiments show that careful evidence selection consistently boosts accuracy, highlighting the practical value of RAG for Document VQA.

本文探讨了将检索增强生成（RAG）集成到文档视觉问答（Document VQA）模型中，以解决处理多页文档的内存挑战。通过在MP-DocVQA、DUDE和InfographicVQA上的系统评估，文本中心的RAG变体将基线提高了最多22.5 ANLS，而视觉RAG变体在无需OCR的情况下实现了5.0 ANLS的改进。研究确认检索和重排序对于性能提升至关重要，而基于布局的分块策略在这些数据集上并未显著受益。实验结果强调了RAG在多种模型大小和多页基准上的准确性的提升，突显了其实用价值。

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

Authors: Haodong Duan, Xinyu Fang, Junming Yang, Xiangyu Zhao, Yuxuan Qiao, Mo Li, Amit Agarwal, Zhe Chen, Lin Chen, Yuan Liu, Yubo Ma, Hailong Sun, Yifan Zhang, Shiyin Lu, Tack Hwa Wong, Weiyun Wang, Peiheng Zhou, Xiaozhe Li, Chaoyou Fu, Junbo Cui, Jixuan Chen, Enxin Song, Song Mao, Shengyuan Ding, Tianhao Liang, Zicheng Zhang, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, Dahua Lin, Kai Chen

First: 2024-07-16T13:06:15+00:00 · Latest: 2025-08-28T09:40:49+00:00

Comments: Updated on 2025.08.28, data cut down to 2025.06.30

Abs · PDF · Code1

Abstract

We present VLMEvalKit: an open-source toolkit for evaluating large multi-modality models based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework for researchers and developers to evaluate existing multi-modality models and publish reproducible evaluation results. In VLMEvalKit, we implement over 200+ different large multi-modality models, including both proprietary APIs and open-source models, as well as more than 80 different multi-modal benchmarks. By implementing a single interface, new models can be easily added to the toolkit, while the toolkit automatically handles the remaining workloads, including data preparation, distributed inference, prediction post-processing, and metric calculation. Although the toolkit is currently mainly used for evaluating large vision-language models, its design is compatible with future updates that incorporate additional modalities, such as audio and video. Based on the evaluation results obtained with the toolkit, we host OpenVLM Leaderboard, a comprehensive leaderboard to track the progress of multi-modality learning research. The toolkit is released on https://github.com/open-compass/VLMEvalKit and is actively maintained.

中文标题/摘要

标题：VLMEvalKit：一个基于PyTorch的开源多模态模型评估工具包

我们介绍了VLMEvalKit：一个基于PyTorch的开源多模态模型评估工具包。该工具包旨在为研究人员和开发人员提供一个用户友好且全面的框架，用于评估现有的多模态模型并发布可重复的评估结果。在VLMEvalKit中，我们实现了超过200种不同的大型多模态模型，包括专有API和开源模型，以及超过80种不同的多模态基准。通过实现单一接口，新模型可以轻松添加到工具包中，而工具包会自动处理其余的工作负载，包括数据准备、分布式推理、预测后处理和指标计算。尽管该工具包目前主要用于评估大型视觉-语言模型，但其设计兼容未来可能增加其他模态（如音频和视频）的更新。基于使用该工具包获得的评估结果，我们托管了OpenVLM排行榜，这是一个全面的排行榜，用于跟踪多模态学习研究的进展。该工具包发布在https://github.com/open-compass/VLMEvalKit，并且正在积极维护。

Summary / 总结

VLMEvalKit is an open-source toolkit for evaluating large multi-modality models using PyTorch. It provides a user-friendly framework for researchers to evaluate models and publish reproducible results. The toolkit includes over 200 multi-modality models and more than 80 benchmarks, automatically handling data preparation, distributed inference, and metric calculation. It is designed to be compatible with future updates to include additional modalities like audio and video, and hosts the OpenVLM Leaderboard to track research progress. The toolkit is available on GitHub and actively maintained.

VLMEvalKit 是一个基于 PyTorch 的开源工具包，用于评估大型多模态模型。它为研究人员提供了一个用户友好的框架，用于评估模型并发布可重复的结果。该工具包包括超过 200 个的多模态模型和超过 80 个基准，自动处理数据准备、分布式推理和指标计算。它设计为未来可以扩展以包含其他模态，如音频和视频，并托管 OpenVLM 领先榜以跟踪研究进展。该工具包可在 GitHub 上获取并积极维护。

Towards Mechanistic Defenses Against Typographic Attacks in CLIP

Authors: Lorenz Hufe, Constantin Venhoff, Maximilian Dreyer, Sebastian Lapuschkin, Wojciech Samek

First: 2025-08-28T09:08:30+00:00 · Latest: 2025-08-28T09:08:30+00:00

Abs · PDF

Abstract

Typographic attacks exploit multi-modal systems by injecting text into images, leading to targeted misclassifications, malicious content generation and even Vision-Language Model jailbreaks. In this work, we analyze how CLIP vision encoders behave under typographic attacks, locating specialized attention heads in the latter half of the model's layers that causally extract and transmit typographic information to the cls token. Building on these insights, we introduce a method to defend CLIP models against typographic attacks by selectively ablating a typographic circuit, consisting of attention heads. Without requiring finetuning, our method improves performance by up to 19.6% on a typographic variant of ImageNet-100, while reducing standard ImageNet-100 accuracy by less than 1%. Notably, our training-free approach remains competitive with current state-of-the-art typographic defenses that rely on finetuning. To this end, we release a family of dyslexic CLIP models which are significantly more robust against typographic attacks. These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.

中文标题/摘要

标题：面向CLIP的机械防御机制对抗 typographic 攻击

typographic 攻击通过向图像中注入文本来利用多模态系统，导致目标误分类、恶意内容生成，甚至视觉语言模型的越狱。在本研究中，我们分析了CLIP视觉编码器在 typographic 攻击下的行为，发现模型后半部分层中的特定注意力头因果性地提取并传递 typographic 信息至 cls 令牌。基于这些见解，我们提出了一种通过选择性地消除 typographic 电路（由注意力头组成）来防御 CLIP 模型的 typographic 攻击的方法。无需微调，我们的方法在 typographic 变体的 ImageNet-100 上性能提升高达 19.6%，同时 ImageNet-100 准确率下降不到 1%。值得注意的是，我们的无需训练的方法在与依赖微调的当前最先进的 typographic 防御方法的竞争中保持竞争力。为此，我们发布了具有显著更强 typographic 攻击鲁棒性的 dyslexic CLIP 模型，这些模型适合作为广泛的安全关键应用的即插即用替代品，其中文本操纵的风险超过了文本识别的实用性。

Summary / 总结

Typographic attacks exploit multi-modal systems by injecting text into images, leading to targeted misclassifications, malicious content generation and even Vision-Language Model jailbreaks.

该研究针对CLIP模型的字形攻击，通过分析CLIP视觉编码器如何处理字形信息来应对这类攻击。作者识别出特定的注意力头将字形数据传输到cls标记，并提出通过选择性地去除这些头来防御此类攻击的方法。该方法在字形变体的ImageNet-100上提高了高达19.6%的性能，同时对标准的ImageNet-100准确性的影响很小，且优于基于微调的防御方法。研究还引入了抗字形攻击更强的字形CLIP模型，这些模型可以作为安全关键应用中的即插即用替代品。

MedGR$^2$: Breaking the Data Barrier for Medical Reasoning via Generative Reward Learning

Authors: Weihai Zhi, Jiayan Guo, Shangyang Li

First: 2025-08-28T08:41:32+00:00 · Latest: 2025-08-28T08:41:32+00:00

Comments: 8 pages, 5 figures

Abs · PDF

Abstract

The application of Vision-Language Models (VLMs) in medicine is critically hampered by the scarcity of high-quality, expert-annotated data. Supervised Fine-Tuning (SFT) on existing datasets often leads to poor generalization on unseen modalities and tasks, while Reinforcement Learning (RL), a promising alternative, is stymied by the lack of reliable reward signals in this data-scarce domain. To break this impasse, we introduce Generative Reward Learning for Medical Reasoning (MedGR$^2$), a novel framework that creates a self-improving virtuous cycle. MedGR$^2$ co-develops a data generator and a reward model, enabling the automated, continuous creation of high-quality, multi-modal medical data that serves as both a superior training source for SFT and RL. Our experiments demonstrate that SFT with MedGR$^2$-produced data already surpasses baselines trained on large-scale, human-curated datasets. Crucially, when leveraging this data for RL via Group Relative Policy Optimization (GRPO), our model achieves state-of-the-art cross-modality and cross-task generalization, significantly outperforming specialized RL-based methods. Furthermore, our compact model, empowered by MedGR$^2$, achieves performance competitive with foundation models possessing over 10 times more parameters. MedGR$^2$ presents a new paradigm for data-efficient learning in high-stakes domains, transforming the problem from data scarcity to data generation and unlocking the full potential of RL for building truly generalizable medical AI.

Summary / 总结

The scarcity of high-quality medical data hampers the application of Vision-Language Models in medicine. To address this, MedGR$^2$ is introduced, a framework that co-develops a data generator and a reward model to create high-quality, multi-modal medical data. This data is used for Supervised Fine-Tuning and Reinforcement Learning, leading to superior performance compared to existing methods. Specifically, MedGR$^2$-generated data improves SFT and, when used with RL, achieves state-of-the-art generalization across modalities and tasks, outperforming specialized RL methods.

MedGR$^2$通过引入生成奖励学习框架来解决医学数据稀缺问题。该框架同时开发数据生成器和奖励模型，能够生成高质量的多模态医学数据，用于监督微调和强化学习。实验表明，MedGR$^2$生成的数据在监督微调中优于大规模的人工标注数据集，并在强化学习中实现了最先进的泛化性能，超越了专门的强化学习方法。此外，使用MedGR$^2$的紧凑模型与更大规模的模型相比表现相当。

Language-to-Space Programming for Training-Free 3D Visual Grounding

Authors: Boyu Mi, Hanqing Wang, Tai Wang, Yilun Chen, Jiangmiao Pang

First: 2025-02-03T14:32:36+00:00 · Latest: 2025-08-28T07:57:55+00:00

Abs · PDF

Abstract

3D visual grounding (3DVG) is challenging due to the need to understand 3D spatial relations. While supervised approaches have achieved superior performance, they are constrained by the scarcity and high annotation costs of 3D vision-language datasets. Training-free approaches based on LLMs/VLMs eliminate the need for large-scale training data, but they either incur prohibitive grounding time and token costs or have unsatisfactory accuracy. To address the challenges, we introduce a novel method for training-free 3D visual grounding, namely Language-to-Space Programming (LaSP). LaSP introduces LLM-generated codes to analyze 3D spatial relations among objects, along with a pipeline that evaluates and optimizes the codes automatically. Experimental results demonstrate that LaSP achieves 52.9% accuracy on the Nr3D benchmark, ranking among the best training-free methods. Moreover, it substantially reduces the grounding time and token costs, offering a balanced trade-off between performance and efficiency.

中文标题/摘要

标题：语言到空间编程用于无训练3D视觉定位

3D视觉定位（3DVG）由于需要理解3D空间关系而具有挑战性。虽然监督方法取得了优异的性能，但它们受限于3D视觉语言数据集的稀缺性和高注释成本。基于LLM/VLM的无训练方法消除了大规模训练数据的需求，但它们要么导致高昂的定位时间和标记成本，要么准确率不令人满意。为了解决这些挑战，我们提出了一种新的无训练3D视觉定位方法，即语言到空间编程（LaSP）。LaSP引入了LLM生成的代码来分析对象之间的3D空间关系，并且包含一个自动评估和优化代码的流水线。实验结果表明，LaSP在Nr3D基准测试中达到了52.9%的准确率，排名在最好的无训练方法之中。此外，它显著减少了定位时间和标记成本，提供了性能和效率之间的平衡折衷。

Summary / 总结

The paper introduces Language-to-Space Programming (LaSP) for training-free 3D visual grounding, addressing the challenges of high annotation costs and the need for large-scale training data. LaSP uses LLM-generated codes to analyze 3D spatial relations and automatically evaluates and optimizes these codes. The method achieves 52.9% accuracy on the Nr3D benchmark and significantly reduces grounding time and token costs, offering a balanced trade-off between performance and efficiency.

论文提出了用于训练-free 3D视觉定位的Language-to-Space Programming (LaSP)方法，解决了监督学习和训练-free方法的局限性。LaSP使用LLM生成的代码来分析3D空间关系，并自动评估和优化这些代码。该方法在Nr3D基准测试中达到了52.9%的准确率，显著减少了定位时间和令牌成本，优于之前的训练-free方法。

SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning

Authors: Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, Huan Li

Venue: EMNLP 2025

First: 2025-08-22T08:23:09+00:00 · Latest: 2025-08-28T06:44:28+00:00

Comments: Accepted at EMNLP 2025 Main

Abs · PDF · Code1

Abstract

Video large language models (Vid-LLMs) have shown strong capabilities in understanding video content. However, their reliance on dense video token representations introduces substantial memory and computational overhead in both prefilling and decoding. To mitigate the information loss of recent video token reduction methods and accelerate the decoding stage of Vid-LLMs losslessly, we introduce SpecVLM, a training-free speculative decoding (SD) framework tailored for Vid-LLMs that incorporates staged video token pruning. Building on our novel finding that the draft model's speculation exhibits low sensitivity to video token pruning, SpecVLM prunes up to 90% of video tokens to enable efficient speculation without sacrificing accuracy. To achieve this, we performs a two-stage pruning process: Stage I selects highly informative tokens guided by attention signals from the verifier (target model), while Stage II prunes remaining redundant ones in a spatially uniform manner. Extensive experiments on four video understanding benchmarks demonstrate the effectiveness and robustness of SpecVLM, which achieves up to 2.68$\times$ decoding speedup for LLaVA-OneVision-72B and 2.11$\times$ speedup for Qwen2.5-VL-32B. Code is available at https://github.com/zju-jiyicheng/SpecVLM.

中文标题/摘要

标题：SpecVLM：通过验证器引导的令牌剪枝增强视频LLM的推测性解码

视频大型语言模型（Vid-LLMs）在理解视频内容方面表现出强大的能力。然而，它们对密集视频令牌表示的依赖性在预填充和解码阶段引入了巨大的内存和计算开销。为了减轻最近视频令牌减少方法的信息损失并以无损方式加速Vid-LLMs的解码阶段，我们提出了SpecVLM，这是一种针对Vid-LLMs的无需训练的推测性解码（SD）框架，结合了分阶段的视频令牌剪枝。基于我们的一项新发现，草稿模型的推测对视频令牌剪枝的敏感性较低，SpecVLM 剪枝高达90%的视频令牌，以实现高效的推测而不牺牲准确性。为此，我们进行了两阶段的剪枝过程：第一阶段根据验证器（目标模型）的注意力信号选择高度信息性的令牌，第二阶段以空间均匀的方式剪枝剩余的冗余令牌。在四个视频理解基准上的广泛实验表明，SpecVLM 的有效性和鲁棒性，它分别实现了LLaVA-OneVision-72B和Qwen2.5-VL-32B高达2.68倍和2.11倍的解码加速。代码可在https://github.com/zju-jiyicheng/SpecVLM 获取。

Summary / 总结

SpecVLM is a speculative decoding framework for video large language models (Vid-LLMs) that prunes up to 90% of video tokens to reduce memory and computational overhead while maintaining accuracy. It uses a verifier-guided two-stage pruning process to select and remove redundant tokens, achieving up to 2.68x and 2.11x decoding speedup for LLaVA-OneVision-72B and Qwen2.5-VL-32B respectively on four video understanding benchmarks.

SpecVLM 是一种针对视频大型语言模型（Vid-LLMs）的推测性解码框架，通过剪枝最多90%的视频令牌来减少内存和计算开销，同时保持准确性。它使用两阶段剪枝过程，由验证器的注意力信号引导来选择和剪枝令牌。实验表明，SpecVLM 分别为 LLaVA-OneVision-72B 和 Qwen2.5-VL-32B 实现了高达2.68倍和2.11倍的解码加速。

Relative Drawing Identification Complexity is Invariant to Modality in Vision-Language Models

Authors: Diogo Freitas, Brigt Håvardstun, Cèsar Ferri, Darío Garigliotti, Jan Arne Telle, José Hernández-Orallo

First: 2025-05-14T09:41:38+00:00 · Latest: 2025-08-28T06:16:17+00:00

Comments: 54 pages (42 pages of appendix). Accepted for publication at the ECAI 2025 conference

Abs · PDF

Abstract

Large language models have become multimodal, and many of them are said to integrate their modalities using common representations. If this were true, a drawing of a car as an image, for instance, should map to a similar area in the latent space as a textual description of the strokes that form the drawing. To explore this in a black-box access regime to these models, we propose the use of machine teaching, a theory that studies the minimal set of examples a teacher needs to choose so that the learner captures the concept. In this paper, we evaluate the complexity of teaching vision-language models a subset of objects in the Quick, Draw! dataset using two presentations: raw images as bitmaps and trace coordinates in TikZ format. The results indicate that image-based representations generally require fewer segments and achieve higher accuracy than coordinate-based representations. But, surprisingly, the teaching size usually ranks concepts similarly across both modalities, even when controlling for (a human proxy of) concept priors, suggesting that the simplicity of concepts may be an inherent property that transcends modality representations.

Summary / 总结

Large language models have become multimodal, and many of them are said to integrate their modalities using common representations.

研究探讨了使用原始图像和轨迹坐标两种方式教授视觉-语言模型识别来自Quick, Draw!数据集的物体子集的复杂性。研究发现，基于图像的表示更有效，需要更少的片段并实现更高的准确性，而基于坐标的表示则不然。然而，令人惊讶的是，两种表示方式的教学规模对概念的排序相似，表明概念的简单性可能是独立于表示方式的固有属性。

Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models

Authors: Xin Huang, Ruibin Li, Tong Jia, Wei Zheng, Ya Wang

Venue: IJCAI 2025

First: 2025-05-21T14:28:43+00:00 · Latest: 2025-08-28T04:15:35+00:00

Comments: Accepted at the International Joint Conference on Artificial Intelligence (IJCAI 2025)

Abs · PDF · Code1

Abstract

Vision-Language Models (VLMs) are essential for multimodal tasks, especially compositional reasoning (CR) tasks, which require distinguishing fine-grained semantic differences between visual and textual embeddings. However, existing methods primarily fine-tune the model by generating text-based hard negative samples, neglecting the importance of image-based negative samples, which results in insufficient training of the visual encoder and ultimately impacts the overall performance of the model. Moreover, negative samples are typically treated uniformly, without considering their difficulty levels, and the alignment of positive samples is insufficient, which leads to challenges in aligning difficult sample pairs. To address these issues, we propose Adaptive Hard Negative Perturbation Learning (AHNPL). AHNPL translates text-based hard negatives into the visual domain to generate semantically disturbed image-based negatives for training the model, thereby enhancing its overall performance. AHNPL also introduces a contrastive learning approach using a multimodal hard negative loss to improve the model's discrimination of hard negatives within each modality and a dynamic margin loss that adjusts the contrastive margin according to sample difficulty to enhance the distinction of challenging sample pairs. Experiments on three public datasets demonstrate that our method effectively boosts VLMs' performance on complex CR tasks. The source code is available at https://github.com/nynu-BDAI/AHNPL.

中文标题/摘要

标题：视觉扰动与自适应难负样本对比学习在视觉语言模型中成分推理的应用

视觉语言模型（VLMs）对于多模态任务至关重要，尤其是成分推理（CR）任务，这些任务需要区分视觉和文本嵌入之间的细微语义差异。然而，现有方法主要通过生成基于文本的难负样本对模型进行微调，忽视了基于图像的负样本的重要性，导致视觉编码器训练不足，最终影响模型的整体性能。此外，负样本通常被均匀处理，没有考虑其难度级别，正样本对的对齐也不充分，这导致了难以对齐的样本对的对齐挑战。为了解决这些问题，我们提出了自适应难负样本扰动学习（AHNPL）。AHNPL将基于文本的难负样本转换到视觉域，生成语义上被干扰的图像负样本进行模型训练，从而提高其整体性能。AHNPL还引入了一种使用多模态难负样本损失的对比学习方法，以提高模型在每个模态内区分难负样本的能力，并引入了一种动态边际损失，根据样本难度调整对比边际，以增强困难样本对的区分能力。在三个公开数据集上的实验表明，我们的方法有效地提升了VLMs在复杂CR任务上的性能。源代码可在https://github.com/nynu-BDAI/AHNPL获取。

Summary / 总结

The paper addresses the limitations of existing methods in Vision-Language Models (VLMs) for compositional reasoning tasks by proposing Adaptive Hard Negative Perturbation Learning (AHNPL). AHNPL generates image-based negative samples to improve the visual encoder's training and introduces a contrastive learning approach with a multimodal hard negative loss and a dynamic margin loss to enhance the model's discrimination. Experiments show that AHNPL significantly improves VLMs' performance on complex compositional reasoning tasks.

研究旨在通过解决现有方法主要依赖文本硬负样本而忽视图像硬负样本的问题，提升视觉语言模型（VLM）的组合理解能力。提出的自适应硬负样本扰动学习（AHNPL）方法将文本硬负样本转换为视觉域中的语义干扰图像负样本，并引入多模态硬负样本损失和动态边际损失，以增强模型的区分能力和样本对的区分。实验表明，AHNPL 显著提升了 VLM 在复杂组合理解任务上的性能。

MedFoundationHub: A Lightweight and Secure Toolkit for Deploying Medical Vision Language Foundation Models

Authors: Xiao Li, Yanfan Zhu, Ruining Deng, Wei-Qi Wei, Yu Wang, Shilin Zhao, Yaohong Wang, Haichun Yang, Yuankai Huo

First: 2025-08-28T01:39:16+00:00 · Latest: 2025-08-28T01:39:16+00:00

Abs · PDF

Abstract

Recent advances in medical vision-language models (VLMs) open up remarkable opportunities for clinical applications such as automated report generation, copilots for physicians, and uncertainty quantification. However, despite their promise, medical VLMs introduce serious security concerns, most notably risks of Protected Health Information (PHI) exposure, data leakage, and vulnerability to cyberthreats - which are especially critical in hospital environments. Even when adopted for research or non-clinical purposes, healthcare organizations must exercise caution and implement safeguards. To address these challenges, we present MedFoundationHub, a graphical user interface (GUI) toolkit that: (1) enables physicians to manually select and use different models without programming expertise, (2) supports engineers in efficiently deploying medical VLMs in a plug-and-play fashion, with seamless integration of Hugging Face open-source models, and (3) ensures privacy-preserving inference through Docker-orchestrated, operating system agnostic deployment. MedFoundationHub requires only an offline local workstation equipped with a single NVIDIA A6000 GPU, making it both secure and accessible within the typical resources of academic research labs. To evaluate current capabilities, we engaged board-certified pathologists to deploy and assess five state-of-the-art VLMs (Google-MedGemma3-4B, Qwen2-VL-7B-Instruct, Qwen2.5-VL-7B-Instruct, and LLaVA-1.5-7B/13B). Expert evaluation covered colon cases and renal cases, yielding 1015 clinician-model scoring events. These assessments revealed recurring limitations, including off-target answers, vague reasoning, and inconsistent pathology terminology.

中文标题/摘要

标题：MedFoundationHub：一种轻量级安全的医疗视觉语言基础模型部署工具包

近期医疗视觉语言模型（VLMs）的进展为临床应用如自动报告生成、医生副驾和不确定性量化带来了巨大机会。然而，尽管这些模型具有潜力，但它们也带来了严重的安全问题，尤其是保护健康信息（PHI）暴露、数据泄露和网络威胁风险，特别是在医院环境中尤为重要。即使用于研究或非临床目的，医疗保健组织也必须谨慎并采取保护措施。为应对这些挑战，我们提出了MedFoundationHub，这是一种图形用户界面（GUI）工具包，旨在：（1）使医生能够无需编程知识即可手动选择和使用不同的模型，（2）支持工程师以插拔式方式高效部署医疗VLMs，无缝集成Hugging Face开源模型，（3）通过Docker编排实现操作系统无关的部署，确保隐私保护推理。MedFoundationHub仅需配备单个NVIDIA A6000 GPU的离线本地工作站，使其在学术研究实验室的典型资源范围内既安全又易于访问。为了评估当前能力，我们邀请了认证病理学家部署并评估了五种最先进的VLMs（Google-MedGemma3-4B、Qwen2-VL-7B-Instruct、Qwen2.5-VL-7B-Instruct和LLaVA-1.5-7B/13B）。专家评估涵盖了结肠病例和肾病例，产生了1015个临床模型评分事件。这些评估揭示了反复出现的局限性，包括偏离目标的答案、模糊的推理和不一致的病理术语。

Summary / 总结

MedFoundationHub is a GUI toolkit designed to facilitate the deployment of medical vision-language models (VLMs) for clinical applications, addressing security concerns such as PHI exposure and data leakage. It enables non-programming users to select and use models, supports efficient deployment, and ensures privacy through Docker-orchestrated, OS-agnostic deployment. Evaluations by pathologists on five state-of-the-art VLMs showed recurring issues like off-target answers and inconsistent terminology, highlighting limitations in model accuracy and reliability.

MedFoundationHub 是一个图形用户界面工具包，旨在促进医疗视觉语言模型（VLMs）在临床应用中的部署，解决诸如 PHI 暴露和数据泄漏等安全问题。它使非编程用户能够选择和使用模型，支持高效的部署，并通过 Docker-orchestrated、操作系统无关的部署确保隐私。病理学家对五个最先进的 VLMs 的评估显示，存在诸如偏离目标答案和术语不一致等反复出现的问题，突显了模型准确性和可靠性的局限性。

GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

Authors: Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, Yang Zhang, Haohan Wang

First: 2025-08-28T00:07:10+00:00 · Latest: 2025-08-28T00:07:10+00:00

Comments: 54 pages

Abs · PDF

Abstract

As Large Language Models become increasingly integral to various domains, their potential to generate harmful responses has prompted significant societal and regulatory concerns. In response, governments have issued ethics guidelines to promote the development of trustworthy AI. However, these guidelines are typically high-level demands for developers and testers, leaving a gap in translating them into actionable testing questions to verify LLM compliance. To address this challenge, we introduce GUARD (\textbf{G}uideline \textbf{U}pholding Test through \textbf{A}daptive \textbf{R}ole-play and Jailbreak \textbf{D}iagnostics), a testing method designed to operationalize guidelines into specific guideline-violating questions that assess LLM adherence. To implement this, GUARD uses automated generation of guideline-violating questions based on government-issued guidelines, thereby testing whether responses comply with these guidelines. When responses directly violate guidelines, GUARD reports inconsistencies. Furthermore, for responses that do not directly violate guidelines, GUARD integrates the concept of ``jailbreaks'' to diagnostics, named GUARD-JD, which creates scenarios that provoke unethical or guideline-violating responses, effectively identifying potential scenarios that could bypass built-in safety mechanisms. Our method finally culminates in a compliance report, delineating the extent of adherence and highlighting any violations. We have empirically validated the effectiveness of GUARD on seven LLMs, including Vicuna-13B, LongChat-7B, Llama2-7B, Llama-3-8B, GPT-3.5, GPT-4, GPT-4o, and Claude-3.7, by testing compliance under three government-issued guidelines and conducting jailbreak diagnostics. Additionally, GUARD-JD can transfer jailbreak diagnostics to vision-language models, demonstrating its usage in promoting reliable LLM-based applications.

Summary / 总结

GUARD is a testing method designed to operationalize ethics guidelines into specific guideline-violating questions for Large Language Models (LLMs). It uses automated generation of such questions based on government-issued guidelines to test LLM compliance and reports inconsistencies. GUARD-JD, a jailbreak diagnostic component, creates scenarios to provoke unethical responses, identifying potential bypasses of safety mechanisms. Empirical validation on seven LLMs under three government guidelines and jailbreak diagnostics shows GUARD's effectiveness in promoting trustworthy AI.

GUARD 是一种测试方法，旨在将伦理准则具体化为特定问题以验证大型语言模型（LLMs）的合规性。它使用自动化生成违背准则的问题，并结合监狱突破诊断来识别可能绕过安全机制的潜在场景。GUARD 在七种 LLMs 上进行了实证验证，包括 Vicuna-13B、LongChat-7B 和 Llama2-7B，并在三种政府发布的准则下进行了测试，展示了其在促进可靠 LLM 基础应用中的使用。

Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation

Authors: Zhixiang Chi, Yanan Wu, Li Gu, Huan Liu, Ziqiang Wang, Yang Zhang, Yang Wang, Konstantinos N. Plataniotis

Venue: ICCV 2025

First: 2025-08-27T20:47:03+00:00 · Latest: 2025-08-27T20:47:03+00:00

Comments: ICCV 2025, code:https://github.com/chi-chi-zx/FSA

Abs · PDF · Code1

Abstract

CLIP exhibits strong visual-textual alignment but struggle with open-vocabulary segmentation due to poor localization. Prior methods enhance spatial coherence by modifying intermediate attention. But, this coherence isn't consistently propagated to the final output due to subsequent operations such as projections. Additionally, intermediate attention lacks direct interaction with text representations, such semantic discrepancy limits the full potential of CLIP. In this work, we propose a training-free, feedback-driven self-adaptive framework that adapts output-based patch-level correspondences back to the intermediate attention. The output predictions, being the culmination of the model's processing, encapsulate the most comprehensive visual and textual semantics about each patch. Our approach enhances semantic consistency between internal representations and final predictions by leveraging the model's outputs as a stronger spatial coherence prior. We design key modules, including attention isolation, confidence-based pruning for sparse adaptation, and adaptation ensemble, to effectively feedback the output coherence cues. Our method functions as a plug-in module, seamlessly integrating into four state-of-the-art approaches with three backbones (ViT-B, ViT-L, ViT-H). We further validate our framework across multiple attention types (Q-K, self-self, and Proxy augmented with MAE, SAM, and DINO). Our approach consistently improves their performance across eight benchmarks.

中文标题/摘要

标题：插件反馈自适应注意力在CLIP中的应用以实现无需训练的开放词汇分割

CLIP 在视觉-文本对齐方面表现出色，但在开放词汇分割方面由于定位能力差而遇到困难。先前的方法通过修改中间注意力来增强空间一致性，但由于后续操作如投影，这种一致性不能稳定地传递到最终输出。此外，中间注意力与文本表示缺乏直接交互，这种语义差异限制了CLIP的全部潜力。在本文中，我们提出了一种无需训练、反馈驱动的自适应框架，将基于输出的块级对应关系反向调整到中间注意力。输出预测是模型处理的综合结果，包含了每个块最全面的视觉和文本语义。我们的方法通过利用模型输出作为更强的空间一致性先验，增强内部表示和最终预测之间的语义一致性。我们设计了关键模块，包括注意力隔离、基于置信度的稀疏适应修剪和适应集成，以有效反馈输出的一致性线索。我们的方法作为插件模块，无缝集成到四种最先进的方法中，使用三种骨干网络（ViT-B、ViT-L、ViT-H）。我们进一步在多种注意力类型（Q-K、自我-自我以及代理增强的MAE、SAM和DINO）上验证了我们的框架。我们的方法在八个基准上的一致性提高了这些方法的性能。

Summary / 总结

This work addresses the challenge of open-vocabulary segmentation by proposing a training-free framework that enhances spatial coherence in CLIP models. The method adapts output-based patch-level correspondences back to intermediate attention, leveraging the model's final predictions to improve semantic consistency. Experiments show consistent performance improvements across various attention types and benchmarks when integrated into four state-of-the-art approaches with three backbones (ViT-B, ViT-L, ViT-H).

该研究通过提出一种无需训练的框架，增强CLIP在开放词汇分割中的空间一致性，该框架通过反馈驱动的自适应注意力机制，将输出级的补丁级对应关系反馈到中间注意力中，从而提高语义一致性。实验结果显示，该方法在四种最先进的方法和三种骨干模型上，多种注意力类型和多个基准测试中均表现出一致的性能提升。

A Novel Framework for Automated Explain Vision Model Using Vision-Language Models

Authors: Phu-Vinh Nguyen, Tan-Hanh Pham, Chris Ngo, Truong Son Hy

First: 2025-08-27T19:16:40+00:00 · Latest: 2025-08-27T19:16:40+00:00

Abs · PDF

Abstract

The development of many vision models mainly focuses on improving their performance using metrics such as accuracy, IoU, and mAP, with less attention to explainability due to the complexity of applying xAI methods to provide a meaningful explanation of trained models. Although many existing xAI methods aim to explain vision models sample-by-sample, methods explaining the general behavior of vision models, which can only be captured after running on a large dataset, are still underexplored. Furthermore, understanding the behavior of vision models on general images can be very important to prevent biased judgments and help identify the model's trends and patterns. With the application of Vision-Language Models, this paper proposes a pipeline to explain vision models at both the sample and dataset levels. The proposed pipeline can be used to discover failure cases and gain insights into vision models with minimal effort, thereby integrating vision model development with xAI analysis to advance image analysis.

中文标题/摘要

标题：一种基于视觉语言模型的自动解释视图模型的新框架

许多视觉模型的发展主要集中在使用准确率、IoU和mAP等指标提高性能上，而较少关注可解释性，因为将xAI方法应用于提供有意义的解释非常复杂。尽管许多现有的xAI方法旨在逐样本解释视觉模型，但只能在运行大量数据集后捕捉到的视觉模型的总体行为解释方法仍然未被充分探索。此外，理解视觉模型在一般图像上的行为对于防止偏见判断和帮助识别模型的趋势和模式非常重要。借助视觉语言模型的应用，本文提出了一种管道，可以在样本和数据集两个级别上解释视觉模型。所提出的管道可以用于发现失败案例并以最小的努力获得对视觉模型的见解，从而将视觉模型开发与xAI分析结合起来，推动图像分析的发展。

Summary / 总结

This paper addresses the lack of explainability in vision models by proposing a novel framework that uses Vision-Language Models to explain both individual samples and the general behavior of vision models. The method involves developing a pipeline that can identify failure cases and provide insights into the models' trends and patterns with minimal effort. Key findings show that this approach effectively integrates xAI analysis with vision model development, enhancing image analysis capabilities.

本文提出了一种新的框架，利用视觉语言模型来解释视觉模型的单个样本和整体行为，以解决视觉模型缺乏解释性的问题。该方法通过开发一个能够识别失败案例并提供模型趋势和模式见解的管道，以最小的努力实现这一目标。关键发现表明，这种方法有效地将xAI分析与视觉模型开发相结合，提升了图像分析能力。

OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations

Authors: Peng-Hao Hsu, Ke Zhang, Fu-En Wang, Tao Tu, Ming-Feng Li, Yu-Lun Liu, Albert Y. C. Chen, Min Sun, Cheng-Hao Kuo

First: 2025-08-27T17:17:00+00:00 · Latest: 2025-08-27T17:17:00+00:00

Comments: ICCV2025

Abs · PDF

Abstract

Open-vocabulary (OV) 3D object detection is an emerging field, yet its exploration through image-based methods remains limited compared to 3D point cloud-based methods. We introduce OpenM3D, a novel open-vocabulary multi-view indoor 3D object detector trained without human annotations. In particular, OpenM3D is a single-stage detector adapting the 2D-induced voxel features from the ImGeoNet model. To support OV, it is jointly trained with a class-agnostic 3D localization loss requiring high-quality 3D pseudo boxes and a voxel-semantic alignment loss requiring diverse pre-trained CLIP features. We follow the training setting of OV-3DET where posed RGB-D images are given but no human annotations of 3D boxes or classes are available. We propose a 3D Pseudo Box Generation method using a graph embedding technique that combines 2D segments into coherent 3D structures. Our pseudo-boxes achieve higher precision and recall than other methods, including the method proposed in OV-3DET. We further sample diverse CLIP features from 2D segments associated with each coherent 3D structure to align with the corresponding voxel feature. The key to training a highly accurate single-stage detector requires both losses to be learned toward high-quality targets. At inference, OpenM3D, a highly efficient detector, requires only multi-view images for input and demonstrates superior accuracy and speed (0.3 sec. per scene) on ScanNet200 and ARKitScenes indoor benchmarks compared to existing methods. We outperform a strong two-stage method that leverages our class-agnostic detector with a ViT CLIP-based OV classifier and a baseline incorporating multi-view depth estimator on both accuracy and speed.

Summary / 总结

Open-vocabulary (OV) 3D object detection is an emerging field, yet its exploration through image-based methods remains limited compared to 3D point cloud-based methods.

OpenM3D 是一种无需人工注释的开放词汇多视图室内 3D 对象检测器，采用单阶段方法并使用 2D 引导的体素特征。该方法通过联合训练具有类无感知的 3D 定位损失和体素语义对齐损失来生成高质量的 3D 假设框，并与预训练的 CLIP 特征对齐。在 ScanNet200 和 ARKitScenes 基准测试中，OpenM3D 达到了比现有方法更高的准确性和速度，在推理时仅需输入多视图图像，并且每场景处理时间为 0.3 秒，优于一种强大的两阶段方法在准确性和速度上的表现。

Segmentation Assisted Incremental Test Time Adaptation in an Open World

Authors: Manogna Sreenivas, Soma Biswas

First: 2025-08-27T16:33:32+00:00 · Latest: 2025-08-27T16:33:32+00:00

Comments: Accepted at BMVC 2025

Abs · PDF · Project1

Abstract

In dynamic environments, unfamiliar objects and distribution shifts are often encountered, which challenge the generalization abilities of the deployed trained models. This work addresses Incremental Test Time Adaptation of Vision Language Models, tackling scenarios where unseen classes and unseen domains continuously appear during testing. Unlike traditional Test Time Adaptation approaches, where the test stream comes only from a predefined set of classes, our framework allows models to adapt simultaneously to both covariate and label shifts, actively incorporating new classes as they emerge. Towards this goal, we establish a new benchmark for ITTA, integrating single image TTA methods for VLMs with active labeling techniques that query an oracle for samples potentially representing unseen classes during test time. We propose a segmentation assisted active labeling module, termed SegAssist, which is training free and repurposes the segmentation capabilities of VLMs to refine active sample selection, prioritizing samples likely to belong to unseen classes. Extensive experiments on several benchmark datasets demonstrate the potential of SegAssist to enhance the performance of VLMs in real world scenarios, where continuous adaptation to emerging data is essential. Project-page:https://manogna-s.github.io/segassist/

中文标题/摘要

标题：开放世界中的分割辅助增量测试时适应

在动态环境中，经常遇到不熟悉的对象和分布变化，这挑战了已部署训练模型的泛化能力。本研究针对视觉语言模型的增量测试时适应问题，处理测试过程中不断出现的未见类别和未见领域的情况。与传统的测试时适应方法不同，后者仅从预定义的类别集中获取测试流，我们的框架允许模型同时适应协变量和标签的变化，积极地将新类别纳入其中。为此，我们为增量测试时适应建立了新的基准，将单张图像的测试时适应方法与主动标注技术结合，测试时查询或acles以获取可能代表未见类别的样本。我们提出了一种分割辅助主动标注模块，称为SegAssist，该模块无需训练，并利用视觉语言模型的分割能力来精炼主动样本选择，优先选择可能属于未见类别的样本。在多个基准数据集上的广泛实验表明，SegAssist能够增强视觉语言模型在现实世界场景中的性能，其中持续适应新兴数据至关重要。项目页面：https://manogna-s.github.io/segassist/

Summary / 总结

This work addresses Incremental Test Time Adaptation (TTA) for Vision Language Models (VLMs) in dynamic environments where unseen classes and domains continuously appear. It introduces SegAssist, a segmentation-assisted active labeling module that helps VLMs adapt to new classes without retraining, by querying an oracle for potentially unseen samples. Experiments show that SegAssist enhances VLMs' performance in real-world scenarios requiring continuous adaptation to emerging data.

该研究针对动态环境中持续出现的未见类别和领域，解决Vision Language Models的增量测试时适应问题。引入了SegAssist模块，该模块利用分割能力帮助模型适应新类别而无需重新训练。实验表明，SegAssist在需要持续适应新数据的现实场景中提高了VLM的性能。

SWIRL: A Staged Workflow for Interleaved Reinforcement Learning in Mobile GUI Control

Authors: Quanfeng Lu, Zhantao Ma, Shuai Zhong, Jin Wang, Dahai Yu, Michael K. Ng, Ping Luo

First: 2025-08-27T16:27:19+00:00 · Latest: 2025-08-27T16:27:19+00:00

Comments: 28 pages, 12 figures

Abs · PDF

Abstract

The rapid advancement of large vision language models (LVLMs) and agent systems has heightened interest in mobile GUI agents that can reliably translate natural language into interface operations. Existing single-agent approaches, however, remain limited by structural constraints. Although multi-agent systems naturally decouple different competencies, recent progress in multi-agent reinforcement learning (MARL) has often been hindered by inefficiency and remains incompatible with current LVLM architectures. To address these challenges, we introduce SWIRL, a staged workflow for interleaved reinforcement learning designed for multi-agent systems. SWIRL reformulates MARL into a sequence of single-agent reinforcement learning tasks, updating one agent at a time while keeping the others fixed. This formulation enables stable training and promotes efficient coordination across agents. Theoretically, we provide a stepwise safety bound, a cross-round monotonic improvement theorem, and convergence guarantees on return, ensuring robust and principled optimization. In application to mobile GUI control, SWIRL instantiates a Navigator that converts language and screen context into structured plans, and an Interactor that grounds these plans into executable atomic actions. Extensive experiments demonstrate superior performance on both high-level and low-level GUI benchmarks. Beyond GUI tasks, SWIRL also demonstrates strong capability in multi-agent mathematical reasoning, underscoring its potential as a general framework for developing efficient and robust multi-agent systems.

中文标题/摘要

标题：SWIRL：多智能体系统中交错强化学习的分阶段工作流

大型视觉语言模型（LVLM）和代理系统的迅速发展引发了对能够可靠地将自然语言转换为界面操作的移动GUI代理的兴趣。然而，现有的单智能体方法仍然受到结构限制的限制。尽管多智能体系统自然地解耦了不同的能力，但最近多智能体强化学习（MARL）的进步往往受到效率低下的阻碍，并且与当前的LVLM架构不兼容。为了解决这些挑战，我们提出了SWIRL，这是一种为多智能体系统设计的交错强化学习的分阶段工作流。SWIRL将MARL重新表述为一系列单智能体强化学习任务，每次更新一个智能体，同时保持其他智能体不变。这种表述形式使训练更加稳定，并促进了智能体之间的高效协调。理论上，我们提供了逐步安全界、跨轮次单调改进定理以及回报收敛保证，确保了稳健和原则性的优化。在移动GUI控制的应用中，SWIRL实现了一个导航器，将语言和屏幕上下文转换为结构化计划，以及一个执行器，将这些计划转化为可执行的原子动作。广泛的实验表明，SWIRL在高阶和低阶GUI基准测试中均表现出色。超越GUI任务，SWIRL还在多智能体数学推理方面展示了强大的能力，突显了其作为开发高效和稳健的多智能体系统的一般框架的潜力。

Summary / 总结

The rapid advancement of large vision language models (LVLMs) and agent systems has heightened interest in mobile GUI agents that can reliably translate natural language into interface operations.

SWIRL 是一个多代理系统中交错强化学习的阶段化工作流，旨在解决单代理方法的局限性和多代理强化学习（MARL）的低效问题。通过将 MARL 重新表述为一系列单代理任务，SWIRL 实现了稳定的训练和高效的代理间协调。理论保证包括逐步安全性边界、跨轮次单调改进和回报收敛。实验表明，SWIRL 在高阶和低阶 GUI 基准测试中均表现出色，并在多代理数学推理方面表现出强大能力，表明其作为多代理系统通用框架的潜力。