arXiv 论文速递

OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning

Authors: Yuan Gong, Xionghui Wang, Jie Wu, Shiyin Wang, Yitong Wang, Xinglong Wu

First: 2025-08-28T17:59:46+00:00 · Latest: 2025-08-28T17:59:46+00:00

Comments: project url: https://one-reward.github.io

Abstract

In this paper, we introduce OneReward, a unified reinforcement learning framework that enhances the model's generative capabilities across multiple tasks under different evaluation criteria using only \textit{One Reward} model. By employing a single vision-language model (VLM) as the generative reward model, which can distinguish the winner and loser for a given task and a given evaluation criterion, it can be effectively applied to multi-task generation models, particularly in contexts with varied data and diverse task objectives. We utilize OneReward for mask-guided image generation, which can be further divided into several sub-tasks such as image fill, image extend, object removal, and text rendering, involving a binary mask as the edit area. Although these domain-specific tasks share same conditioning paradigm, they differ significantly in underlying data distributions and evaluation metrics. Existing methods often rely on task-specific supervised fine-tuning (SFT), which limits generalization and training efficiency. Building on OneReward, we develop Seedream 3.0 Fill, a mask-guided generation model trained via multi-task reinforcement learning directly on a pre-trained base model, eliminating the need for task-specific SFT. Experimental results demonstrate that our unified edit model consistently outperforms both commercial and open-source competitors, such as Ideogram, Adobe Photoshop, and FLUX Fill [Pro], across multiple evaluation dimensions. Code and model are available at: https://one-reward.github.io

中文标题/摘要

标题：OneReward：统一的多任务人类偏好学习引导图像生成

在本文中，我们介绍了OneReward，这是一种统一的强化学习框架，仅使用一个‘单一奖励’模型即可在多种任务和不同评估标准下增强模型的生成能力。通过使用单一的视觉语言模型（VLM）作为生成奖励模型，该模型能够区分给定任务和评估标准下的胜者和败者，因此可以有效地应用于多任务生成模型，特别是在数据多样和任务目标多样的情况下。我们使用OneReward进行掩码引导的图像生成，这可以进一步细分为图像填充、图像扩展、对象移除和文本渲染等子任务，涉及一个二元掩码作为编辑区域。尽管这些特定领域的任务共享相同的条件范式，但它们在底层数据分布和评估指标上存在显著差异。现有方法通常依赖于特定任务的监督微调（SFT），这限制了泛化能力和训练效率。基于OneReward，我们开发了Seedream 3.0 Fill，这是一种通过直接在预训练基模型上进行多任务强化学习训练的掩码引导生成模型，消除了对特定任务SFT的需求。实验结果表明，我们的统一编辑模型在多个评估维度上均优于商业和开源竞争对手，如Ideogram、Adobe Photoshop和FLUX Fill [Pro]。代码和模型可在：https://one-reward.github.io 获取。

Summary / 总结

OneReward is a unified reinforcement learning framework that improves generative capabilities across multiple tasks using a single reward model. It employs a vision-language model to distinguish task winners and losers, enabling multi-task generation in varied contexts. The model, Seedream 3.0 Fill, trained via multi-task reinforcement learning, outperforms commercial and open-source competitors in mask-guided image generation tasks such as image fill, object removal, and text rendering, across multiple evaluation dimensions.

OneReward 是一个统一的强化学习框架，使用单一奖励模型提升多任务生成能力。它通过视觉语言模型区分任务优胜者和失败者，适用于多种上下文的多任务生成。通过多任务强化学习训练的模型 Seedream 3.0 Fill 在图像填充、对象移除和文本渲染等任务中表现出色，优于商业和开源竞争对手，涵盖了多个评估维度。

CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

Authors: Wei Li, Renshan Zhang, Rui Shao, Jie He, Liqiang Nie

First: 2025-08-28T17:50:58+00:00 · Latest: 2025-08-28T17:50:58+00:00

Comments: 23 pages, 8 figures, Project Page: https://jiutian-vl.github.io/CogVLA-page

Abs · PDF · Code1 · Project1

Abstract

Recent Vision-Language-Action (VLA) models built on pre-trained Vision-Language Models (VLMs) require extensive post-training, resulting in high computational overhead that limits scalability and deployment.We propose CogVLA, a Cognition-Aligned Vision-Language-Action framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance. CogVLA draws inspiration from human multimodal coordination and introduces a 3-stage progressive architecture. 1) Encoder-FiLM based Aggregation Routing (EFA-Routing) injects instruction information into the vision encoder to selectively aggregate and compress dual-stream visual tokens, forming a instruction-aware latent representation. 2) Building upon this compact visual encoding, LLM-FiLM based Pruning Routing (LFP-Routing) introduces action intent into the language model by pruning instruction-irrelevant visually grounded tokens, thereby achieving token-level sparsity. 3) To ensure that compressed perception inputs can still support accurate and coherent action generation, we introduce V-L-A Coupled Attention (CAtten), which combines causal vision-language attention with bidirectional action parallel decoding. Extensive experiments on the LIBERO benchmark and real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5-fold and decreasing inference latency by 2.8-fold compared to OpenVLA. CogVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/CogVLA.

Summary / 总结

MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs

Authors: Junpeng Ma, Qizhe Zhang, Ming Lu, Zhibin Wang, Qiang Zhou, Jun Song, Shanghang Zhang

First: 2025-08-28T17:50:03+00:00 · Latest: 2025-08-28T17:50:03+00:00

Comments: 10 pages, 3 figures

Abs · PDF

Abstract

Video Large Language Models (VLLMs) excel in video understanding, but their excessive visual tokens pose a significant computational challenge for real-world applications. Current methods aim to enhance inference efficiency by visual token pruning. However, they do not consider the dynamic characteristics and temporal dependencies of video frames, as they perceive video understanding as a multi-frame task. To address these challenges, we propose MMG-Vid, a novel training-free visual token pruning framework that removes redundancy by Maximizing Marginal Gains at both segment-level and token-level. Specifically, we first divide the video into segments based on frame similarity, and then dynamically allocate the token budget for each segment to maximize the marginal gain of each segment. Subsequently, we propose a temporal-guided DPC algorithm that jointly models inter-frame uniqueness and intra-frame diversity, thereby maximizing the marginal gain of each token. By combining both stages, MMG-Vid can maximize the utilization of the limited token budget, significantly improving efficiency while maintaining strong performance. Extensive experiments demonstrate that MMG-Vid can maintain over 99.5% of the original performance, while effectively reducing 75% visual tokens and accelerating the prefilling stage by 3.9x on LLaVA-OneVision-7B. Code will be released soon.

Summary / 总结

Video Large Language Models (VLLMs) excel in video understanding, but their excessive visual tokens pose a significant computational challenge for real-world applications.

MMG-Vid 是一种无需训练的视觉标记剪枝框架，通过在段级和标记级最大化边际收益来提高视频大型语言模型（VLLMs）的效率。它根据帧相似性将视频划分为段，并动态分配标记预算以最大化每个段的边际收益。此外，它使用时间引导的DPC算法来建模帧间唯一性和帧内多样性，进一步最大化每个标记的边际收益。实验表明，MMG-Vid 可以保持超过 99.5% 的原始性能，同时减少 75% 的视觉标记，并将预填充阶段加速 3.9 倍，适用于 LLaVA-OneVision-7B。

Reusing Computation in Text-to-Image Diffusion for Efficient Generation of Image Sets

Authors: Dale Decatur, Thibault Groueix, Wang Yifan, Rana Hanocka, Vladimir Kim, Matheus Gadelha

Venue: ICCV 2025

First: 2025-08-28T17:35:03+00:00 · Latest: 2025-08-28T17:35:03+00:00

Comments: ICCV 2025. Project page: https://ddecatur.github.io/hierarchical-diffusion/

Abs · PDF · Project1

Abstract

Text-to-image diffusion models enable high-quality image generation but are computationally expensive. While prior work optimizes per-inference efficiency, we explore an orthogonal approach: reducing redundancy across correlated prompts. Our method leverages the coarse-to-fine nature of diffusion models, where early denoising steps capture shared structures among similar prompts. We propose a training-free approach that clusters prompts based on semantic similarity and shares computation in early diffusion steps. Experiments show that for models trained conditioned on image embeddings, our approach significantly reduces compute cost while improving image quality. By leveraging UnClip's text-to-image prior, we enhance diffusion step allocation for greater efficiency. Our method seamlessly integrates with existing pipelines, scales with prompt sets, and reduces the environmental and financial burden of large-scale text-to-image generation. Project page: https://ddecatur.github.io/hierarchical-diffusion/

中文标题/摘要

标题：文本到图像扩散中的计算复用以高效生成图像集

文本到图像的扩散模型能够生成高质量的图像，但计算成本高昂。虽然先前的工作优化了每次推理的效率，我们探索了一种不同的方法：减少相关提示之间的冗余。我们的方法利用了扩散模型从粗到细的特性，在早期去噪步骤中捕获相似提示之间的共享结构。我们提出了一种无需训练的方法，根据语义相似性对提示进行聚类，并在早期扩散步骤中共享计算。实验表明，对于基于图像嵌入训练的模型，我们的方法显著降低了计算成本并提高了图像质量。通过利用UnClip的文本到图像先验，我们增强了扩散步骤的分配，以提高效率。我们的方法可以无缝集成到现有的管道中，适用于不同的提示集，并减少了大规模文本到图像生成的环境和财务负担。项目页面：https://ddecatur.github.io/hierarchical-diffusion/

Summary / 总结

This paper addresses the computational inefficiency of text-to-image diffusion models by proposing a method that reduces redundancy across similar prompts. The approach leverages the coarse-to-fine nature of diffusion models to share computation in early steps. Experiments demonstrate that this method significantly reduces computational cost while improving image quality, especially when using UnClip's text-to-image prior. This method integrates seamlessly with existing pipelines and scales well with prompt sets, reducing both environmental and financial burdens of large-scale text-to-image generation.

本文通过减少相似提示之间的冗余来解决文本到图像扩散模型的计算效率问题，提出了一种方法。该方法利用扩散模型的粗到细特性，在早期步骤中共享计算。实验表明，这种方法在使用UnClip的文本到图像先验时，能够显著降低计算成本并提高图像质量。该方法能够无缝集成到现有管道中，随着提示集的增加而扩展，从而减少大规模文本到图像生成的环境和财务负担。

DrivingGaussian++: Towards Realistic Reconstruction and Editable Simulation for Surrounding Dynamic Driving Scenes

Authors: Yajiao Xiong, Xiaoyu Zhou, Yongtao Wan, Deqing Sun, Ming-Hsuan Yang

First: 2025-08-28T16:22:54+00:00 · Latest: 2025-08-28T16:22:54+00:00

Abs · PDF · Project1

Abstract

We present DrivingGaussian++, an efficient and effective framework for realistic reconstructing and controllable editing of surrounding dynamic autonomous driving scenes. DrivingGaussian++ models the static background using incremental 3D Gaussians and reconstructs moving objects with a composite dynamic Gaussian graph, ensuring accurate positions and occlusions. By integrating a LiDAR prior, it achieves detailed and consistent scene reconstruction, outperforming existing methods in dynamic scene reconstruction and photorealistic surround-view synthesis. DrivingGaussian++ supports training-free controllable editing for dynamic driving scenes, including texture modification, weather simulation, and object manipulation, leveraging multi-view images and depth priors. By integrating large language models (LLMs) and controllable editing, our method can automatically generate dynamic object motion trajectories and enhance their realism during the optimization process. DrivingGaussian++ demonstrates consistent and realistic editing results and generates dynamic multi-view driving scenarios, while significantly enhancing scene diversity. More results and code can be found at the project site: https://xiong-creator.github.io/DrivingGaussian_plus.github.io

中文标题/摘要

标题：DrivingGaussian++：面向现实重建和可编辑模拟的周围动态驾驶场景

我们提出DrivingGaussian++，一种高效且有效的框架，用于现实重建和可控编辑周围动态自主驾驶场景。DrivingGaussian++使用增量3D高斯模型静态背景，并用复合动态高斯图重建移动对象，确保准确的位置和遮挡。通过整合LiDAR先验，它实现了详细的且一致的场景重建，优于现有方法在动态场景重建和照片写实的全景视图合成方面的表现。DrivingGaussian++支持无需训练的动态驾驶场景可控编辑，包括纹理修改、天气模拟和对象操作，利用多视图图像和深度先验。通过整合大型语言模型（LLMs）和可控编辑，我们的方法可以在优化过程中自动生成动态对象运动轨迹并增强其现实感。DrivingGaussian++展示了持续且现实的编辑结果，并生成动态多视图驾驶场景，显著增强场景多样性。更多结果和代码可在项目站点找到：https://xiong-creator.github.io/DrivingGaussian_plus.github.io

Summary / 总结

DrivingGaussian++ is a framework designed for realistic reconstruction and controllable editing of dynamic driving scenes. It models static backgrounds using incremental 3D Gaussians and reconstructs moving objects with a composite dynamic Gaussian graph, ensuring accurate positions and occlusions. The method integrates a LiDAR prior to achieve detailed and consistent scene reconstruction, outperforming existing methods in dynamic scene reconstruction and photorealistic surround-view synthesis. It supports training-free controllable editing, including texture modification, weather simulation, and object manipulation, by leveraging multi-view images and depth priors. The framework can automatically generate dynamic object motion trajectories and enhance their realism during optimization, demonstrating consistent and realistic editing results and generating dynamic multi-view driving scenarios with enhanced scene diversity.

DrivingGaussian++ 是一个用于动态自主驾驶场景的现实重建和可控编辑框架。它使用增量 3D 高斯模型来表示静态背景，并使用复合动态高斯图来重建移动对象，确保准确的位置和遮挡。通过集成 LiDAR 先验，它实现了详细且一致的场景重建，优于现有方法在动态场景重建和高保真全景合成方面的表现。该方法支持无需训练的可控编辑，包括纹理修改、天气模拟和对象操作，通过利用多视图图像和深度先验。它可以在优化过程中自动生成动态对象运动轨迹并增强其现实性，展示出一致且现实的编辑结果，并生成动态多视图驾驶场景，显著增强了场景多样性。

Understanding and evaluating computer vision models through the lens of counterfactuals

Authors: Pushkar Shukla

First: 2025-08-28T15:11:49+00:00 · Latest: 2025-08-28T15:11:49+00:00

Abs · PDF

Abstract

Counterfactual reasoning -- the practice of asking ``what if'' by varying inputs and observing changes in model behavior -- has become central to interpretable and fair AI. This thesis develops frameworks that use counterfactuals to explain, audit, and mitigate bias in vision classifiers and generative models. By systematically altering semantically meaningful attributes while holding others fixed, these methods uncover spurious correlations, probe causal dependencies, and help build more robust systems. The first part addresses vision classifiers. CAVLI integrates attribution (LIME) with concept-level analysis (TCAV) to quantify how strongly decisions rely on human-interpretable concepts. With localized heatmaps and a Concept Dependency Score, CAVLI shows when models depend on irrelevant cues like backgrounds. Extending this, ASAC introduces adversarial counterfactuals that perturb protected attributes while preserving semantics. Through curriculum learning, ASAC fine-tunes biased models for improved fairness and accuracy while avoiding stereotype-laden artifacts. The second part targets generative Text-to-Image (TTI) models. TIBET provides a scalable pipeline for evaluating prompt-sensitive biases by varying identity-related terms, enabling causal auditing of how race, gender, and age affect image generation. To capture interactions, BiasConnect builds causal graphs diagnosing intersectional biases. Finally, InterMit offers a modular, training-free algorithm that mitigates intersectional bias via causal sensitivity scores and user-defined fairness goals. Together, these contributions show counterfactuals as a unifying lens for interpretability, fairness, and causality in both discriminative and generative models, establishing principled, scalable methods for socially responsible bias evaluation and mitigation.

Summary / 总结

This thesis develops frameworks using counterfactual reasoning to explain, audit, and mitigate bias in vision classifiers and generative models. CAVLI integrates attribution and concept-level analysis to quantify how decisions rely on human-interpretable concepts, revealing irrelevant cues. ASAC introduces adversarial counterfactuals to fine-tune biased models for improved fairness and accuracy. TIBET and BiasConnect provide scalable methods for evaluating and diagnosing prompt-sensitive biases in generative Text-to-Image models, while InterMit mitigates intersectional bias via causal sensitivity scores. These methods collectively demonstrate counterfactuals as a unifying lens for interpretability, fairness, and causality in both discriminative and generative models.

该论文开发了使用反事实推理来解释、审计和缓解视觉分类器和生成模型中的偏见的框架。CAVLI 结合了归因和概念级分析，量化决策依赖于可解释概念的程度，揭示无关线索。ASAC 引入了对抗反事实，通过课程学习微调有偏模型，提高公平性和准确性。TIBET 和 BiasConnect 提供了评估和诊断生成文本到图像模型中提示敏感偏见的可扩展方法，而 InterMit 通过因果敏感分数和用户定义的公平目标来缓解交集偏见。这些方法共同展示了反事实作为解释性、公平性和因果关系统一视角在判别性和生成性模型中的应用。

Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models

Authors: Hao Cheng, Erjia Xiao, Yichi Wang, Lingfeng Zhang, Qiang Zhang, Jiahang Cao, Kaidi Xu, Mengshu Sun, Xiaoshuai Hao, Jindong Gu, Renjing Xu

First: 2025-03-14T15:42:42+00:00 · Latest: 2025-08-28T14:55:38+00:00

Comments: This paper is accepted by IJCAI2025 Workshop on Deepfake Detection, Localization, and Interpretability

Abs · PDF

Abstract

Current Cross-Modality Generation Models (GMs) demonstrate remarkable capabilities in various generative tasks. Given the ubiquity and information richness of vision modality inputs in real-world scenarios, Cross-Vision tasks, encompassing Vision-Language Perception (VLP) and Image-to-Image (I2I), have attracted significant attention. Large Vision Language Models (LVLMs) and I2I Generation Models (GMs) are employed to handle VLP and I2I tasks, respectively. Previous research indicates that printing typographic words into input images significantly induces LVLMs and I2I GMs to produce disruptive outputs that are semantically aligned with those words. Additionally, visual prompts, as a more sophisticated form of typography, are also revealed to pose security risks to various applications of cross-vision tasks. However, the specific characteristics of the threats posed by visual prompts remain underexplored. In this paper, to comprehensively investigate the performance impact induced by Typographic Visual Prompt Injection (TVPI) in various LVLMs and I2I GMs, we propose the Typographic Visual Prompts Injection Dataset and thoroughly evaluate the TVPI security risks on various open-source and closed-source LVLMs and I2I GMs under visual prompts with different target semantics, deepening the understanding of TVPI threats.

中文标题/摘要

标题：探索跨模态生成模型中的字体视觉提示注入威胁

当前的跨模态生成模型（GMs）在各种生成任务中表现出色。鉴于现实世界场景中视觉模态输入的普遍性和信息丰富性，包括视觉语言感知（VLP）和图像到图像（I2I）在内的跨视觉任务引起了广泛关注。大型视觉语言模型（LVLMs）和I2I生成模型（GMs）分别用于处理VLP和I2I任务。先前的研究表明，在输入图像中印刷字体文字会显著诱导LVLMs和I2I GMs生成与这些文字语义一致的破坏性输出。此外，作为更复杂的字体形式，视觉提示也被发现对跨视觉任务的各种应用构成了安全风险。然而，视觉提示所造成的威胁的具体特征仍待进一步探索。在本文中，为了全面调查字体视觉提示注入（TVPI）对各种LVLMs和I2I GMs性能影响，我们提出了字体视觉提示注入数据集，并在具有不同目标语义的视觉提示下对各种开源和闭源LVLMs和I2I GMs进行了彻底的安全风险评估，加深了对TVPI威胁的理解。

Summary / 总结

This paper investigates the security risks posed by typographic visual prompts in Cross-Modality Generation Models. It proposes a dataset to evaluate the impact of these prompts on various models, revealing that they can induce disruptive outputs aligned with the prompts. The study deepens understanding of the specific threats posed by visual prompts in cross-vision tasks.

本文研究了图文提示在跨模态生成模型中的安全威胁。提出了一个数据集来评估这些提示对各种模型的影响，发现它们会导致与提示内容一致的破坏性输出。研究加深了对这些威胁在开源和闭源模型中具体特征的理解。

Learning Primitive Embodied World Models: Towards Scalable Robotic Learning

Authors: Qiao Sun, Liujia Yang, Wei Tang, Wei Huang, Kaixin Xu, Yongchao Chen, Mingyu Liu, Jiange Yang, Haoyi Zhu, Yating Wang, Tong He, Yilun Chen, Xili Dai, Nanyang Ye, Qinying Gu

First: 2025-08-28T14:31:48+00:00 · Latest: 2025-08-28T14:31:48+00:00

Abs · PDF

Abstract

While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation--hindering generative models from achieving a "GPT moment" in the embodied domain. There is a naive observation: the diversity of embodied data far exceeds the relatively small space of possible primitive motions. Based on this insight, we propose a novel paradigm for world modeling--Primitive Embodied World Models (PEWM). By restricting video generation to fixed short horizons, our approach 1) enables fine-grained alignment between linguistic concepts and visual representations of robotic actions, 2) reduces learning complexity, 3) improves data efficiency in embodied data collection, and 4) decreases inference latency. By equipping with a modular Vision-Language Model (VLM) planner and a Start-Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed-loop control and supports compositional generalization of primitive-level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.

中文标题/摘要

标题：学习原始具身世界模型：迈向可扩展的机器人学习

尽管基于视频生成的具身世界模型逐渐受到关注，但它们对大规模具身交互数据的依赖仍然是一个关键瓶颈。具身数据的稀缺性、收集难度和高维度性从根本上限制了语言与动作之间的对齐精度，并加剧了长时序视频生成的挑战——阻碍生成模型在具身领域实现“GPT时刻”。一个简单的观察是：具身数据的多样性远超过可能的原始动作的小空间。基于这一洞察，我们提出了一种新的世界建模范式——原始具身世界模型（PEWM）。通过限制视频生成到固定短时序，我们的方法1) 使语言概念与机器人动作的视觉表示之间的对齐更加精细，2) 减少学习复杂性，3) 改进具身数据收集的数据效率，4) 减少推理延迟。通过配备模块化视觉-语言模型（VLM）规划器和起始-目标热图引导机制（SGG），PEWM 进一步实现了灵活的闭环控制，并支持在扩展和复杂任务中对原始级策略的组合泛化。我们的框架利用视频模型中的时空视觉先验和 VLM 的语义意识，弥合了精细物理交互与高层推理之间的差距，为可扩展、可解释和通用的具身智能铺平了道路。

Summary / 总结

This paper addresses the challenge of generating embodied world models through video generation, which requires large-scale interaction data. It introduces Primitive Embodied World Models (PEWM) to focus on short horizons, enabling finer alignment between language and actions, reducing learning complexity, improving data efficiency, and decreasing inference latency. PEWM uses a modular Vision-Language Model planner and Start-Goal heatmap Guidance to support flexible closed-loop control and compositional generalization of primitive-level policies for complex tasks.

本文解决了通过视频生成生成体态世界模型的挑战，需要大量的交互数据。它提出了Primitive Embodied World Models (PEWM)，专注于短时间范围，以实现语言和动作之间的更精细对齐，减少学习复杂性，提高数据效率并降低推理延迟。PEWM 使用模块化的视觉-语言模型规划器和起始-目标热图引导，支持对复杂任务中原始级别策略的灵活闭环控制和组合泛化。

Estimating 2D Keypoints of Surgical Tools Using Vision-Language Models with Low-Rank Adaptation

Authors: Krit Duangprom, Tryphon Lambrou, Binod Bhattarai

Venue: MICCAI 2025

First: 2025-08-28T14:25:32+00:00 · Latest: 2025-08-28T14:25:32+00:00

Comments: Accepted to MICCAI 2025

Abs · PDF

Abstract

This paper presents a novel pipeline for 2D keypoint estima- tion of surgical tools by leveraging Vision Language Models (VLMs) fine- tuned using a low rank adjusting (LoRA) technique. Unlike traditional Convolutional Neural Network (CNN) or Transformer-based approaches, which often suffer from overfitting in small-scale medical datasets, our method harnesses the generalization capabilities of pre-trained VLMs. We carefully design prompts to create an instruction-tuning dataset and use them to align visual features with semantic keypoint descriptions. Experimental results show that with only two epochs of fine tuning, the adapted VLM outperforms the baseline models, demonstrating the ef- fectiveness of LoRA in low-resource scenarios. This approach not only improves keypoint detection performance, but also paves the way for future work in 3D surgical hands and tools pose estimation.

中文标题/摘要

标题：使用低秩适应的视觉语言模型估计手术工具的2D关键点

本文提出了一种利用视觉语言模型（VLMs）结合低秩调整（LoRA）技术进行2D关键点估计的新管道。与传统卷积神经网络（CNN）或基于变换器的方法相比，后者在小型医学数据集中经常出现过拟合问题，我们的方法利用了预训练VLMs的泛化能力。我们精心设计了提示以创建指令调优数据集，并使用它们将视觉特征与语义关键点描述对齐。实验结果表明，仅经过两轮微调，适应后的VLM就优于基线模型，证明了LoRA在资源有限场景中的有效性。该方法不仅提高了关键点检测性能，还为未来3D手术手和工具姿态估计的研究铺平了道路。

Improving Fine-Grained Control via Aggregation of Multiple Diffusion Models

Authors: Conghan Yue, Zhengwei Peng, Shiyan Du, Zhi Ji, Chuangjian Cai, Le Wan, Dongyu Zhang

First: 2024-10-02T06:16:06+00:00 · Latest: 2025-08-28T14:03:26+00:00

Abs · PDF · Code1

Abstract

While many diffusion models perform well when controlling particular aspects such as style, character, and interaction, they struggle with fine-grained control due to dataset limitations and intricate model architecture design. This paper introduces a novel training-free algorithm, independent of denoising network architectures, for fine-grained generation, called Aggregation of Multiple Diffusion Models (AMDM). The algorithm integrates features from multiple diffusion models into a specified model to activate particular features and enable fine-grained control. Experimental results demonstrate that AMDM significantly improves fine-grained control without training, validating its effectiveness. Additionally, it reveals that diffusion models initially focus on features such as position, attributes, and style, with later stages improving generation quality and consistency. AMDM offers a new perspective for tackling the challenges of fine-grained conditional generation in diffusion models. Specifically, it allows us to fully utilize existing or develop new conditional diffusion models that control specific aspects, and then aggregate them using the AMDM algorithm. This eliminates the need for constructing complex datasets, designing intricate model architectures, and incurring high training costs. Code is available at: https://github.com/Hammour-steak/AMDM.

中文标题/摘要

标题：通过多扩散模型聚合提高细粒度控制

尽管许多扩散模型在控制特定方面如风格、角色和交互时表现良好，但在细粒度控制方面由于数据集限制和复杂的模型架构设计，它们表现不佳。本文介绍了一种无需训练的新型算法，该算法独立于去噪网络架构，称为多扩散模型聚合（AMDM），用于细粒度生成。该算法将多个扩散模型的特征整合到指定模型中，以激活特定特征并实现细粒度控制。实验结果表明，AMDM在无需训练的情况下显著提高了细粒度控制能力，验证了其有效性。此外，它揭示了扩散模型最初关注位置、属性和风格等特征，后期阶段则提高生成质量和一致性。AMDM为解决扩散模型中的细粒度条件生成挑战提供了新的视角。具体而言，它允许我们充分利用现有或开发新的控制特定方面的条件扩散模型，并通过AMDM算法将它们聚合起来。这消除了构建复杂数据集、设计复杂模型架构和高训练成本的需要。代码可在：https://github.com/Hammour-steak/AMDM 获取。

Evaluating Compositional Generalisation in VLMs and Diffusion Models

Authors: Beth Pearson, Bilal Boulbarss, Michael Wray, Martha Lewis

First: 2025-08-28T13:45:04+00:00 · Latest: 2025-08-28T13:45:04+00:00

Comments: 11 pages including references, 6 figures. Accepted at IWCS 2025

Abs · PDF · Code1

Abstract

A fundamental aspect of the semantics of natural language is that novel meanings can be formed from the composition of previously known parts. Vision-language models (VLMs) have made significant progress in recent years, however, there is evidence that they are unable to perform this kind of composition. For example, given an image of a red cube and a blue cylinder, a VLM such as CLIP is likely to incorrectly label the image as a red cylinder or a blue cube, indicating it represents the image as a `bag-of-words' and fails to capture compositional semantics. Diffusion models have recently gained significant attention for their impressive generative abilities, and zero-shot classifiers based on diffusion models have been shown to perform competitively with CLIP in certain compositional tasks. In this work we explore whether the generative Diffusion Classifier has improved compositional generalisation abilities compared to discriminative models. We assess three models -- Diffusion Classifier, CLIP, and ViLT -- on their ability to bind objects with attributes and relations in both zero-shot learning (ZSL) and generalised zero-shot learning (GZSL) settings. Our results show that the Diffusion Classifier and ViLT perform well at concept binding tasks, but that all models struggle significantly with the relational GZSL task, underscoring the broader challenges VLMs face with relational reasoning. Analysis of CLIP embeddings suggests that the difficulty may stem from overly similar representations of relational concepts such as left and right. Code and dataset are available at: https://github.com/otmive/diffusion_classifier_clip

Summary / 总结

A fundamental aspect of the semantics of natural language is that novel meanings can be formed from the composition of previously known parts.

Occlusion Robustness of CLIP for Military Vehicle Classification

Authors: Jan Erik van Woerden, Gertjan Burghouts, Lotte Nijskens, Alma M. Liezenga, Sabina van Rooij, Frank Ruis, Hugo J. Kuijf

First: 2025-08-28T13:16:55+00:00 · Latest: 2025-08-28T13:16:55+00:00

Comments: To be presented at SPIE: Sensors + Imaging, Artificial Intelligence for Security and Defence Applications II

Abs · PDF

Abstract

Vision-language models (VLMs) like CLIP enable zero-shot classification by aligning images and text in a shared embedding space, offering advantages for defense applications with scarce labeled data. However, CLIP's robustness in challenging military environments, with partial occlusion and degraded signal-to-noise ratio (SNR), remains underexplored. We investigate CLIP variants' robustness to occlusion using a custom dataset of 18 military vehicle classes and evaluate using Normalized Area Under the Curve (NAUC) across occlusion percentages. Four key insights emerge: (1) Transformer-based CLIP models consistently outperform CNNs, (2) fine-grained, dispersed occlusions degrade performance more than larger contiguous occlusions, (3) despite improved accuracy, performance of linear-probed models sharply drops at around 35% occlusion, (4) by finetuning the model's backbone, this performance drop occurs at more than 60% occlusion. These results underscore the importance of occlusion-specific augmentations during training and the need for further exploration into patch-level sensitivity and architectural resilience for real-world deployment of CLIP.

中文标题/摘要

标题：CLIP在军事车辆分类中的遮挡鲁棒性

视觉-语言模型（VLMs）如CLIP通过在共享嵌入空间中对齐图像和文本实现零样本分类，为缺乏标注数据的防御应用提供了优势。然而，CLIP在具有部分遮挡和信噪比（SNR）降级的挑战性军事环境中的鲁棒性尚未得到充分探索。我们使用包含18类军事车辆的自定义数据集研究了CLIP变体对遮挡的鲁棒性，并使用归一化曲线下面积（NAUC）在不同遮挡百分比下进行评估。研究结果得出四个关键见解：（1）基于Transformer的CLIP模型始终优于CNN，（2）细粒度、分散的遮挡比大面积连续遮挡对性能影响更大，（3）尽管准确率有所提高，但在约35%遮挡时，线性探查模型的性能急剧下降，（4）通过微调模型的骨干网络，性能下降发生在超过60%遮挡时。这些结果强调了在训练过程中使用遮挡特定增强的重要性，并指出了需要进一步探索像素级敏感性和架构鲁棒性以实现CLIP在实际部署中的应用。

Summary / 总结

The study investigates the occlusion robustness of CLIP models for military vehicle classification, using a custom dataset of 18 classes. Key findings include the superior performance of transformer-based CLIP models over CNNs, the greater impact of fine-grained occlusions, and the significant drop in performance of linear-probed models at around 35% occlusion, which improves with backbone fine-tuning beyond 60% occlusion. These results highlight the need for occlusion-specific training and further research into architectural resilience.

研究考察了CLIP变体在军事车辆分类中的遮挡鲁棒性，使用了包含18个类别的自定义数据集。主要发现包括基于Transformer的CLIP模型优于CNN模型，细粒度的遮挡影响更大，线性探针模型在约35%遮挡时性能急剧下降，通过进一步微调模型骨干可以在超过60%遮挡时改善性能。

NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks

Authors: Aritra Dutta, Swapnanil Mukherjee, Deepanway Ghosal, Somak Aditya

First: 2025-08-27T09:34:28+00:00 · Latest: 2025-08-28T12:05:33+00:00

Abs · PDF

Abstract

Commonsense visual-question answering often hinges on knowledge that is missing from the image or the question. Small vision-language models (sVLMs) such as ViLT, VisualBERT and FLAVA therefore lag behind their larger generative counterparts. To study the effect of careful commonsense knowledge integration on sVLMs, we present an end-to-end framework (NLKI) that (i) retrieves natural language facts, (ii) prompts an LLM to craft natural language explanations, and (iii) feeds both signals to sVLMs respectively across two commonsense VQA datasets (CRIC, AOKVQA) and a visual-entailment dataset (e-SNLI-VE). Facts retrieved using a fine-tuned ColBERTv2 and an object information-enriched prompt yield explanations that largely cut down hallucinations, while lifting the end-to-end answer accuracy by up to 7% (across 3 datasets), making FLAVA and other models in NLKI match or exceed medium-sized VLMs such as Qwen-2 VL-2B and SmolVLM-2.5B. As these benchmarks contain 10-25% label noise, additional finetuning using noise-robust losses (such as symmetric cross entropy and generalised cross entropy) adds another 2.5% in CRIC, and 5.5% in AOKVQA. Our findings expose when LLM-based commonsense knowledge beats retrieval from commonsense knowledge bases, how noise-aware training stabilises small models in the context of external knowledge augmentation, and why parameter-efficient commonsense reasoning is now within reach for 250M models.

"Humor, Art, or Misinformation?": A Multimodal Dataset for Intent-Aware Synthetic Image Detection

Authors: Anastasios Skoularikis, Stefanos-Iordanis Papadopoulos, Symeon Papadopoulos, Panagiotis C. Petrantonakis

First: 2025-08-28T11:22:15+00:00 · Latest: 2025-08-28T11:22:15+00:00

Abs · PDF

Abstract

Recent advances in multimodal AI have enabled progress in detecting synthetic and out-of-context content. However, existing efforts largely overlook the intent behind AI-generated images. To fill this gap, we introduce S-HArM, a multimodal dataset for intent-aware classification, comprising 9,576 "in the wild" image-text pairs from Twitter/X and Reddit, labeled as Humor/Satire, Art, or Misinformation. Additionally, we explore three prompting strategies (image-guided, description-guided, and multimodally-guided) to construct a large-scale synthetic training dataset with Stable Diffusion. We conduct an extensive comparative study including modality fusion, contrastive learning, reconstruction networks, attention mechanisms, and large vision-language models. Our results show that models trained on image- and multimodally-guided data generalize better to "in the wild" content, due to preserved visual context. However, overall performance remains limited, highlighting the complexity of inferring intent and the need for specialized architectures.

中文标题/摘要

标题："幽默、艺术还是误导信息？": 一种面向意图的合成图像检测多模态数据集

近期多模态AI技术的进步促进了对合成和脱离上下文内容检测的进展。然而，现有努力大多忽略了AI生成图像背后的意图。为填补这一空白，我们引入了S-HArM，这是一个面向意图分类的多模态数据集，包含来自Twitter/X和Reddit的9,576个“野生”图像-文本对，标记为幽默/讽刺、艺术或误导信息。此外，我们探索了三种提示策略（图像导向、描述导向和多模态导向）来构建大规模合成训练数据集，使用Stable Diffusion。我们进行了广泛的比较研究，包括模态融合、对比学习、重建网络、注意力机制和大型视觉-语言模型。我们的结果显示，基于图像和多模态导向数据训练的模型在“野生”内容上的泛化能力更强，由于保留了视觉上下文。然而，总体性能仍然有限，突显了推断意图的复杂性以及需要专门架构的需求。

Amadeus: Autoregressive Model with Bidirectional Attribute Modelling for Symbolic Music

Authors: Hongju Su, Ke Li, Lan Yang, Honggang Zhang, Yi-Zhe Song

First: 2025-08-28T11:15:44+00:00 · Latest: 2025-08-28T11:15:44+00:00

Comments: Under review

Abs · PDF

Abstract

Existing state-of-the-art symbolic music generation models predominantly adopt autoregressive or hierarchical autoregressive architectures, modelling symbolic music as a sequence of attribute tokens with unidirectional temporal dependencies, under the assumption of a fixed, strict dependency structure among these attributes. However, we observe that using different attributes as the initial token in these models leads to comparable performance. This suggests that the attributes of a musical note are, in essence, a concurrent and unordered set, rather than a temporally dependent sequence. Based on this insight, we introduce Amadeus, a novel symbolic music generation framework. Amadeus adopts a two-level architecture: an autoregressive model for note sequences and a bidirectional discrete diffusion model for attributes. To enhance performance, we propose Music Latent Space Discriminability Enhancement Strategy(MLSDES), incorporating contrastive learning constraints that amplify discriminability of intermediate music representations. The Conditional Information Enhancement Module (CIEM) simultaneously strengthens note latent vector representation via attention mechanisms, enabling more precise note decoding. We conduct extensive experiments on unconditional and text-conditioned generation tasks. Amadeus significantly outperforms SOTA models across multiple metrics while achieving at least 4$\times$ speed-up. Furthermore, we demonstrate training-free, fine-grained note attribute control feasibility using our model. To explore the upper performance bound of the Amadeus architecture, we compile the largest open-source symbolic music dataset to date, AMD (Amadeus MIDI Dataset), supporting both pre-training and fine-tuning.

中文标题/摘要

标题：阿玛迪乌斯：双向属性建模的自回归音乐生成模型

现有的最先进的符号音乐生成模型主要采用自回归或分层自回归架构，将符号音乐建模为具有单向时间依赖性的属性标记序列，假设这些属性之间存在固定的严格依赖结构。然而，我们观察到，在这些模型中使用不同的属性作为初始标记会导致相当的性能。这表明，音乐音符的属性本质上是一个并发且无序的集合，而不是一个时间依赖序列。基于这一洞察，我们引入了阿玛迪乌斯，一种新颖的符号音乐生成框架。阿玛迪乌斯采用两层架构：自回归模型用于音符序列，双向离散扩散模型用于属性。为了提高性能，我们提出了音乐潜在空间判别性增强策略(MLSDES)，结合对比学习约束以增强中间音乐表示的判别性。条件信息增强模块(CIEM)通过注意力机制同时增强音符潜在向量表示，使音符解码更加精确。我们在无条件和文本条件生成任务上进行了广泛的实验。阿玛迪乌斯在多个指标上显著优于现有最佳模型，同时实现至少4倍的速度提升。此外，我们展示了使用我们的模型实现无训练的细粒度音符属性控制的可行性。为了探索阿玛迪乌斯架构的性能上限，我们编译了迄今为止最大的开源符号音乐数据集AMD（阿玛迪乌斯MIDI数据集），支持预训练和微调。

Summary / 总结

Amadeus 是一种新颖的符号音乐生成框架，通过引入属性的双向离散扩散模型来解决现有自回归模型的局限性。它在多个指标上显著优于最先进的模型，并且至少实现了 4 倍的速度提升。关键组件包括 MLSDES 以增强可区分性，以及 CIEM 以实现更精确的音符解码。

Enhancing Document VQA Models via Retrieval-Augmented Generation

Authors: Eric López, Artemis Llabrés, Ernest Valveny

First: 2025-08-26T12:32:55+00:00 · Latest: 2025-08-28T10:31:44+00:00

Comments: Accepted at Workshop on Machine Learning in Document Analysis and Recognition (ICDAR WML 2025), Wuhan, China

Abs · PDF

Abstract

Document Visual Question Answering (Document VQA) must cope with documents that span dozens of pages, yet leading systems still concatenate every page or rely on very large vision-language models, both of which are memory-hungry. Retrieval-Augmented Generation (RAG) offers an attractive alternative, first retrieving a concise set of relevant segments before generating answers from this selected evidence. In this paper, we systematically evaluate the impact of incorporating RAG into Document VQA through different retrieval variants - text-based retrieval using OCR tokens and purely visual retrieval without OCR - across multiple models and benchmarks. Evaluated on the multi-page datasets MP-DocVQA, DUDE, and InfographicVQA, the text-centric variant improves the "concatenate-all-pages" baseline by up to +22.5 ANLS, while the visual variant achieves +5.0 ANLS improvement without requiring any text extraction. An ablation confirms that retrieval and reranking components drive most of the gain, whereas the layout-guided chunking strategy - proposed in several recent works to leverage page structure - fails to help on these datasets. Our experiments demonstrate that careful evidence selection consistently boosts accuracy across multiple model sizes and multi-page benchmarks, underscoring its practical value for real-world Document VQA.

Summary / 总结

This paper explores enhancing Document VQA models using Retrieval-Augmented Generation (RAG) to address the memory challenges of processing multi-page documents. It evaluates text-based and purely visual retrieval methods across various models and benchmarks, showing that the text-centric variant improves the baseline by up to 22.5 ANLS, while the visual variant achieves a 5.0 ANLS improvement without OCR. The study confirms that retrieval and reranking are crucial for gains, while layout-guided chunking does not significantly help. These findings highlight the practical benefits of RAG for Document VQA.

本文探讨了使用检索增强生成（RAG）来增强文档VQA模型，以解决处理多页文档的内存挑战。它在多种模型和基准上评估了基于文本和纯视觉的检索方法，结果显示基于文本的变体将基线提高了最多22.5 ANLS，而纯视觉变体在无需OCR的情况下实现了5.0 ANLS的改进。研究证实，检索和重排序对于提高性能至关重要，而基于布局的分块策略在这些数据集上并未显著帮助。这些发现强调了RAG在实际文档VQA中的实用价值。

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

Authors: Haodong Duan, Xinyu Fang, Junming Yang, Xiangyu Zhao, Yuxuan Qiao, Mo Li, Amit Agarwal, Zhe Chen, Lin Chen, Yuan Liu, Yubo Ma, Hailong Sun, Yifan Zhang, Shiyin Lu, Tack Hwa Wong, Weiyun Wang, Peiheng Zhou, Xiaozhe Li, Chaoyou Fu, Junbo Cui, Jixuan Chen, Enxin Song, Song Mao, Shengyuan Ding, Tianhao Liang, Zicheng Zhang, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, Dahua Lin, Kai Chen

First: 2024-07-16T13:06:15+00:00 · Latest: 2025-08-28T09:40:49+00:00

Comments: Updated on 2025.08.28, data cut down to 2025.06.30

Abs · PDF · Code1

Abstract

We present VLMEvalKit: an open-source toolkit for evaluating large multi-modality models based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework for researchers and developers to evaluate existing multi-modality models and publish reproducible evaluation results. In VLMEvalKit, we implement over 200+ different large multi-modality models, including both proprietary APIs and open-source models, as well as more than 80 different multi-modal benchmarks. By implementing a single interface, new models can be easily added to the toolkit, while the toolkit automatically handles the remaining workloads, including data preparation, distributed inference, prediction post-processing, and metric calculation. Although the toolkit is currently mainly used for evaluating large vision-language models, its design is compatible with future updates that incorporate additional modalities, such as audio and video. Based on the evaluation results obtained with the toolkit, we host OpenVLM Leaderboard, a comprehensive leaderboard to track the progress of multi-modality learning research. The toolkit is released on https://github.com/open-compass/VLMEvalKit and is actively maintained.

中文标题/摘要

标题：VLMEvalKit：一个基于PyTorch的开源多模态模型评估工具包

我们介绍了VLMEvalKit：一个基于PyTorch的开源多模态模型评估工具包。该工具包旨在为研究人员和开发人员提供一个用户友好且全面的框架，用于评估现有的多模态模型并发布可重复的评估结果。在VLMEvalKit中，我们实现了超过200种不同的大型多模态模型，包括专有API和开源模型，以及超过80种不同的多模态基准。通过实现单一接口，新模型可以轻松添加到工具包中，而工具包会自动处理其余的工作负载，包括数据准备、分布式推理、预测后处理和指标计算。尽管该工具包目前主要用于评估大型视觉-语言模型，但其设计与未来可能包含其他模态（如音频和视频）的更新兼容。基于使用该工具包获得的评估结果，我们托管了OpenVLM排行榜，这是一个全面的排行榜，用于跟踪多模态学习研究的进展。该工具包发布在https://github.com/open-compass/VLMEvalKit，并且正在积极维护。

Towards Mechanistic Defenses Against Typographic Attacks in CLIP

Authors: Lorenz Hufe, Constantin Venhoff, Maximilian Dreyer, Sebastian Lapuschkin, Wojciech Samek

First: 2025-08-28T09:08:30+00:00 · Latest: 2025-08-28T09:08:30+00:00

Abs · PDF

Abstract

Typographic attacks exploit multi-modal systems by injecting text into images, leading to targeted misclassifications, malicious content generation and even Vision-Language Model jailbreaks. In this work, we analyze how CLIP vision encoders behave under typographic attacks, locating specialized attention heads in the latter half of the model's layers that causally extract and transmit typographic information to the cls token. Building on these insights, we introduce a method to defend CLIP models against typographic attacks by selectively ablating a typographic circuit, consisting of attention heads. Without requiring finetuning, our method improves performance by up to 19.6% on a typographic variant of ImageNet-100, while reducing standard ImageNet-100 accuracy by less than 1%. Notably, our training-free approach remains competitive with current state-of-the-art typographic defenses that rely on finetuning. To this end, we release a family of dyslexic CLIP models which are significantly more robust against typographic attacks. These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.

MedGR$^2$: Breaking the Data Barrier for Medical Reasoning via Generative Reward Learning

Authors: Weihai Zhi, Jiayan Guo, Shangyang Li

First: 2025-08-28T08:41:32+00:00 · Latest: 2025-08-28T08:41:32+00:00

Comments: 8 pages, 5 figures

Abs · PDF

Abstract

The application of Vision-Language Models (VLMs) in medicine is critically hampered by the scarcity of high-quality, expert-annotated data. Supervised Fine-Tuning (SFT) on existing datasets often leads to poor generalization on unseen modalities and tasks, while Reinforcement Learning (RL), a promising alternative, is stymied by the lack of reliable reward signals in this data-scarce domain. To break this impasse, we introduce Generative Reward Learning for Medical Reasoning (MedGR$^2$), a novel framework that creates a self-improving virtuous cycle. MedGR$^2$ co-develops a data generator and a reward model, enabling the automated, continuous creation of high-quality, multi-modal medical data that serves as both a superior training source for SFT and RL. Our experiments demonstrate that SFT with MedGR$^2$-produced data already surpasses baselines trained on large-scale, human-curated datasets. Crucially, when leveraging this data for RL via Group Relative Policy Optimization (GRPO), our model achieves state-of-the-art cross-modality and cross-task generalization, significantly outperforming specialized RL-based methods. Furthermore, our compact model, empowered by MedGR$^2$, achieves performance competitive with foundation models possessing over 10 times more parameters. MedGR$^2$ presents a new paradigm for data-efficient learning in high-stakes domains, transforming the problem from data scarcity to data generation and unlocking the full potential of RL for building truly generalizable medical AI.

Summary / 总结

The paper addresses the challenge of limited high-quality medical data for training Vision-Language Models (VLMs) in medicine. It introduces MedGR$^2$, a framework that uses Generative Reward Learning to create high-quality, multi-modal medical data. This data is used for both Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), leading to better performance than existing methods. Specifically, MedGR$^2$-generated data improves SFT results and, when used with RL, achieves state-of-the-art cross-modality and cross-task generalization, outperforming specialized RL methods. Additionally, the compact model using MedGR$^2$ performs competitively with much larger models.

论文解决了医学领域中高质量数据稀缺的问题，提出了MedGR$^2$框架，利用生成奖励学习来生成高质量的多模态医学数据。这些数据用于监督微调（SFT）和强化学习（RL），取得了优于现有方法的效果。具体来说，MedGR$^2$生成的数据提高了SFT的效果，而在使用RL时，实现了跨模态和跨任务的最先进的泛化能力，超越了专门的RL方法。此外，使用MedGR$^2$的紧凑模型与更大规模的模型相比表现相当。

Language-to-Space Programming for Training-Free 3D Visual Grounding

Authors: Boyu Mi, Hanqing Wang, Tai Wang, Yilun Chen, Jiangmiao Pang

First: 2025-02-03T14:32:36+00:00 · Latest: 2025-08-28T07:57:55+00:00

Abs · PDF

Abstract

3D visual grounding (3DVG) is challenging due to the need to understand 3D spatial relations. While supervised approaches have achieved superior performance, they are constrained by the scarcity and high annotation costs of 3D vision-language datasets. Training-free approaches based on LLMs/VLMs eliminate the need for large-scale training data, but they either incur prohibitive grounding time and token costs or have unsatisfactory accuracy. To address the challenges, we introduce a novel method for training-free 3D visual grounding, namely Language-to-Space Programming (LaSP). LaSP introduces LLM-generated codes to analyze 3D spatial relations among objects, along with a pipeline that evaluates and optimizes the codes automatically. Experimental results demonstrate that LaSP achieves 52.9% accuracy on the Nr3D benchmark, ranking among the best training-free methods. Moreover, it substantially reduces the grounding time and token costs, offering a balanced trade-off between performance and efficiency.

中文标题/摘要

标题：语言到空间编程用于无训练3D视觉定位

3D视觉定位（3DVG）由于需要理解3D空间关系而具有挑战性。虽然监督方法取得了优异的性能，但它们受限于3D视觉语言数据集的稀缺性和高标注成本。基于LLM/VLM的无训练方法消除了大规模训练数据的需求，但它们要么导致高昂的定位时间和标记成本，要么准确率不令人满意。为应对这些挑战，我们提出了一种新的无训练3D视觉定位方法，即语言到空间编程（LaSP）。LaSP引入了LLM生成的代码来分析物体之间的3D空间关系，并且包含一个自动评估和优化代码的流水线。实验结果表明，LaSP在Nr3D基准测试中达到了52.9%的准确率，排名在最好的无训练方法之中。此外，它显著减少了定位时间和标记成本，提供了性能和效率之间的平衡折衷。

Summary / 总结

The paper addresses the challenge of 3D visual grounding by introducing Language-to-Space Programming (LaSP), which uses LLM-generated codes to analyze 3D spatial relations among objects. The method evaluates and optimizes these codes automatically, leading to a 52.9% accuracy on the Nr3D benchmark and significantly reducing grounding time and token costs compared to other training-free approaches.

该论文通过引入Language-to-Space Programming (LaSP) 方法解决了3D视觉定位的挑战，该方法使用LLM生成的代码来分析3D空间关系，并自动评估和优化这些代码。LaSP在Nr3D基准测试中实现了52.9%的准确率，同时减少了定位时间和标记成本，相比其他无训练方法具有更好的性能和效率平衡。

SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning

Authors: Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, Huan Li

Venue: EMNLP 2025

First: 2025-08-22T08:23:09+00:00 · Latest: 2025-08-28T06:44:28+00:00

Comments: Accepted at EMNLP 2025 Main

Abs · PDF · Code1

Abstract

Video large language models (Vid-LLMs) have shown strong capabilities in understanding video content. However, their reliance on dense video token representations introduces substantial memory and computational overhead in both prefilling and decoding. To mitigate the information loss of recent video token reduction methods and accelerate the decoding stage of Vid-LLMs losslessly, we introduce SpecVLM, a training-free speculative decoding (SD) framework tailored for Vid-LLMs that incorporates staged video token pruning. Building on our novel finding that the draft model's speculation exhibits low sensitivity to video token pruning, SpecVLM prunes up to 90% of video tokens to enable efficient speculation without sacrificing accuracy. To achieve this, we performs a two-stage pruning process: Stage I selects highly informative tokens guided by attention signals from the verifier (target model), while Stage II prunes remaining redundant ones in a spatially uniform manner. Extensive experiments on four video understanding benchmarks demonstrate the effectiveness and robustness of SpecVLM, which achieves up to 2.68$\times$ decoding speedup for LLaVA-OneVision-72B and 2.11$\times$ speedup for Qwen2.5-VL-32B. Code is available at https://github.com/zju-jiyicheng/SpecVLM.

Summary / 总结

SpecVLM is a speculative decoding framework for video large language models (Vid-LLMs) that reduces video token representations by up to 90% through a two-stage pruning process, enabling efficient speculation without accuracy loss. This method leverages the low sensitivity of draft models to token pruning and achieves up to 2.68x and 2.11x decoding speedup for LLaVA-OneVision-72B and Qwen2.5-VL-32B, respectively, on four video understanding benchmarks.

SpecVLM 是一种针对视频大型语言模型 (Vid-LLMs) 的推测性解码框架，通过两阶段剪枝过程将视频令牌表示减少至 90%，实现高效解码而不牺牲准确性。第一阶段根据验证器的注意力信号选择信息丰富的令牌，第二阶段均匀剪枝剩余冗余令牌。SpecVLM 在四个视频理解基准测试上分别实现了 LLaVA-OneVision-72B 和 Qwen2.5-VL-32B 最高达 2.68 倍和 2.11 倍的解码加速。

Relative Drawing Identification Complexity is Invariant to Modality in Vision-Language Models

Authors: Diogo Freitas, Brigt Håvardstun, Cèsar Ferri, Darío Garigliotti, Jan Arne Telle, José Hernández-Orallo

First: 2025-05-14T09:41:38+00:00 · Latest: 2025-08-28T06:16:17+00:00

Comments: 54 pages (42 pages of appendix). Accepted for publication at the ECAI 2025 conference

Abs · PDF

Abstract

Large language models have become multimodal, and many of them are said to integrate their modalities using common representations. If this were true, a drawing of a car as an image, for instance, should map to a similar area in the latent space as a textual description of the strokes that form the drawing. To explore this in a black-box access regime to these models, we propose the use of machine teaching, a theory that studies the minimal set of examples a teacher needs to choose so that the learner captures the concept. In this paper, we evaluate the complexity of teaching vision-language models a subset of objects in the Quick, Draw! dataset using two presentations: raw images as bitmaps and trace coordinates in TikZ format. The results indicate that image-based representations generally require fewer segments and achieve higher accuracy than coordinate-based representations. But, surprisingly, the teaching size usually ranks concepts similarly across both modalities, even when controlling for (a human proxy of) concept priors, suggesting that the simplicity of concepts may be an inherent property that transcends modality representations.

中文标题/摘要

标题：相对绘图识别复杂性在视觉-语言模型中的不变性与模态无关

大型语言模型已经变得多模态，并且其中许多模型据说使用共同表示来整合其模态。如果这是真的，例如，一幅作为图像的汽车的草图应该在潜在空间中映射到与其构成该草图的笔画的文本描述相似的区域。为了在这种黑盒访问模式下探索这一点，我们提出使用机器教学，这是一种研究教师需要选择的最小示例集以使学习者掌握概念的理论。在本文中，我们使用两种呈现方式评估视觉-语言模型对Quick, Draw!数据集中一部分对象的教学复杂性：原始图像作为位图和TikZ格式的轨迹坐标。结果表明，基于图像的表示通常需要更少的片段并实现更高的准确性，但令人惊讶的是，教学规模通常在两种模态中对概念的排名相似，即使控制了（人类概念先验的代理），这表明概念的简单性可能是超越模态表示的固有属性。

Summary / 总结

This study investigates the relative drawing identification complexity in vision-language models across different modalities. Using machine teaching, the researchers evaluated the complexity of teaching models to recognize objects from the Quick, Draw! dataset in two forms: raw images and trace coordinates. The findings show that image-based representations are more efficient, requiring fewer segments and achieving higher accuracy. However, surprisingly, the teaching size for concepts was similar across both modalities, indicating that the simplicity of concepts might be an inherent property that is not heavily modality-dependent.

研究通过使用机器教学方法，评估了向视觉-语言模型教授Quick, Draw!数据集中部分对象识别任务时，使用原始图像或轨迹坐标所需的概念数量和准确性。结果显示，基于图像的表示需要更少的片段并实现更高的准确性，但令人惊讶的是，两种表示方式的教学规模对概念的排序相似，表明概念的简单性是独立于表示方式的固有属性。

Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models

Authors: Xin Huang, Ruibin Li, Tong Jia, Wei Zheng, Ya Wang

Venue: IJCAI 2025

First: 2025-05-21T14:28:43+00:00 · Latest: 2025-08-28T04:15:35+00:00

Comments: Accepted at the International Joint Conference on Artificial Intelligence (IJCAI 2025)

Abs · PDF · Code1

Abstract

Vision-Language Models (VLMs) are essential for multimodal tasks, especially compositional reasoning (CR) tasks, which require distinguishing fine-grained semantic differences between visual and textual embeddings. However, existing methods primarily fine-tune the model by generating text-based hard negative samples, neglecting the importance of image-based negative samples, which results in insufficient training of the visual encoder and ultimately impacts the overall performance of the model. Moreover, negative samples are typically treated uniformly, without considering their difficulty levels, and the alignment of positive samples is insufficient, which leads to challenges in aligning difficult sample pairs. To address these issues, we propose Adaptive Hard Negative Perturbation Learning (AHNPL). AHNPL translates text-based hard negatives into the visual domain to generate semantically disturbed image-based negatives for training the model, thereby enhancing its overall performance. AHNPL also introduces a contrastive learning approach using a multimodal hard negative loss to improve the model's discrimination of hard negatives within each modality and a dynamic margin loss that adjusts the contrastive margin according to sample difficulty to enhance the distinction of challenging sample pairs. Experiments on three public datasets demonstrate that our method effectively boosts VLMs' performance on complex CR tasks. The source code is available at https://github.com/nynu-BDAI/AHNPL.

中文标题/摘要

标题：视觉扰动与自适应难负样本对比学习在视觉语言模型组成推理中的应用

视觉语言模型（VLMs）对于多模态任务至关重要，尤其是组成推理（CR）任务，这类任务需要区分视觉和文本嵌入之间的细微语义差异。然而，现有方法主要通过生成基于文本的难负样本对模型进行微调，忽视了基于图像的负样本的重要性，导致视觉编码器训练不足，最终影响模型的整体性能。此外，负样本通常被均匀处理，没有考虑其难度级别，正样本对的对齐也不充分，这导致了难以对齐的样本对的挑战。为了解决这些问题，我们提出了自适应难负样本扰动学习（AHNPL）。AHNPL将基于文本的难负样本转换到视觉域，生成语义上被扰乱的图像负样本，从而提高模型的整体性能。AHNPL还引入了一种使用多模态难负样本损失的对比学习方法，以提高模型在每个模态内区分难负样本的能力，并引入了一种动态边际损失，根据样本难度调整对比边际，以增强困难样本对的区分能力。在三个公开数据集上的实验表明，我们的方法有效地提升了VLMs在复杂CR任务上的性能。源代码可在https://github.com/nynu-BDAI/AHNPL获取。

Summary / 总结

The paper addresses the limitations of existing Vision-Language Models (VLMs) in compositional reasoning tasks by proposing Adaptive Hard Negative Perturbation Learning (AHNPL). AHNPL generates image-based negative samples from text-based hard negatives, improving the visual encoder's training. It also uses a multimodal hard negative loss and a dynamic margin loss to enhance the model's discrimination and challenge sample pair distinction. Experiments on three public datasets show significant performance improvements in complex compositional reasoning tasks.

研究旨在通过解决现有方法主要依赖文本硬负样本而忽视图像负样本的问题，提高视觉语言模型（VLMs）的组合推理能力。提出的自适应硬负样本扰动学习（AHNPL）生成语义上被扰乱的图像负样本，并使用多模态硬负样本损失和动态边际损失来增强模型的区分能力和对样本对的对齐。在三个公开数据集上的实验表明，AHNPL 有效提升了 VLMs 在复杂组合推理任务中的性能。

MedFoundationHub: A Lightweight and Secure Toolkit for Deploying Medical Vision Language Foundation Models

Authors: Xiao Li, Yanfan Zhu, Ruining Deng, Wei-Qi Wei, Yu Wang, Shilin Zhao, Yaohong Wang, Haichun Yang, Yuankai Huo

First: 2025-08-28T01:39:16+00:00 · Latest: 2025-08-28T01:39:16+00:00

Abs · PDF

Abstract

Recent advances in medical vision-language models (VLMs) open up remarkable opportunities for clinical applications such as automated report generation, copilots for physicians, and uncertainty quantification. However, despite their promise, medical VLMs introduce serious security concerns, most notably risks of Protected Health Information (PHI) exposure, data leakage, and vulnerability to cyberthreats - which are especially critical in hospital environments. Even when adopted for research or non-clinical purposes, healthcare organizations must exercise caution and implement safeguards. To address these challenges, we present MedFoundationHub, a graphical user interface (GUI) toolkit that: (1) enables physicians to manually select and use different models without programming expertise, (2) supports engineers in efficiently deploying medical VLMs in a plug-and-play fashion, with seamless integration of Hugging Face open-source models, and (3) ensures privacy-preserving inference through Docker-orchestrated, operating system agnostic deployment. MedFoundationHub requires only an offline local workstation equipped with a single NVIDIA A6000 GPU, making it both secure and accessible within the typical resources of academic research labs. To evaluate current capabilities, we engaged board-certified pathologists to deploy and assess five state-of-the-art VLMs (Google-MedGemma3-4B, Qwen2-VL-7B-Instruct, Qwen2.5-VL-7B-Instruct, and LLaVA-1.5-7B/13B). Expert evaluation covered colon cases and renal cases, yielding 1015 clinician-model scoring events. These assessments revealed recurring limitations, including off-target answers, vague reasoning, and inconsistent pathology terminology.

Summary / 总结

MedFoundationHub is a toolkit designed to address the security and usability challenges of deploying medical vision-language models (VLMs) in clinical settings. It provides a graphical user interface for physicians to use VLMs without programming knowledge and supports efficient deployment of models through Docker, ensuring privacy-preserving inference. Evaluation by pathologists on five state-of-the-art VLMs showed recurring limitations such as off-target answers and inconsistent terminology, indicating areas for improvement in model accuracy and reliability.

MedFoundationHub 是一个工具包，旨在解决医疗视觉语言模型（VLMs）部署中的安全性和易用性挑战。它提供了一个图形用户界面，供医生在无需编程知识的情况下使用模型，并支持工程师以插拔式方式高效部署这些模型。专家评估五种最先进的 VLMs 的结果显示，存在诸如偏离目标答案和病理术语不一致等反复出现的问题。

GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

Authors: Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, Yang Zhang, Haohan Wang

First: 2025-08-28T00:07:10+00:00 · Latest: 2025-08-28T00:07:10+00:00

Comments: 54 pages

Abs · PDF

Abstract

As Large Language Models become increasingly integral to various domains, their potential to generate harmful responses has prompted significant societal and regulatory concerns. In response, governments have issued ethics guidelines to promote the development of trustworthy AI. However, these guidelines are typically high-level demands for developers and testers, leaving a gap in translating them into actionable testing questions to verify LLM compliance. To address this challenge, we introduce GUARD (\textbf{G}uideline \textbf{U}pholding Test through \textbf{A}daptive \textbf{R}ole-play and Jailbreak \textbf{D}iagnostics), a testing method designed to operationalize guidelines into specific guideline-violating questions that assess LLM adherence. To implement this, GUARD uses automated generation of guideline-violating questions based on government-issued guidelines, thereby testing whether responses comply with these guidelines. When responses directly violate guidelines, GUARD reports inconsistencies. Furthermore, for responses that do not directly violate guidelines, GUARD integrates the concept of ``jailbreaks'' to diagnostics, named GUARD-JD, which creates scenarios that provoke unethical or guideline-violating responses, effectively identifying potential scenarios that could bypass built-in safety mechanisms. Our method finally culminates in a compliance report, delineating the extent of adherence and highlighting any violations. We have empirically validated the effectiveness of GUARD on seven LLMs, including Vicuna-13B, LongChat-7B, Llama2-7B, Llama-3-8B, GPT-3.5, GPT-4, GPT-4o, and Claude-3.7, by testing compliance under three government-issued guidelines and conducting jailbreak diagnostics. Additionally, GUARD-JD can transfer jailbreak diagnostics to vision-language models, demonstrating its usage in promoting reliable LLM-based applications.

中文标题/摘要

标题：GUARD：通过适应性角色扮演和监狱逃脱诊断实现准则维护测试

随着大型语言模型在各个领域中的作用越来越重要，它们生成有害响应的潜在风险引起了社会和监管机构的广泛关注。为此，政府发布了伦理准则以促进可信赖的人工智能的发展。然而，这些准则通常是对开发者和测试者的高层次要求，缺乏将这些要求转化为可操作的测试问题以验证LLM合规性的具体方法。为应对这一挑战，我们提出了GUARD（通过适应性角色扮演和监狱逃脱诊断实现准则维护测试）这一测试方法，旨在将准则具体化为特定的准则违反问题，以评估LLM的合规性。GUARD通过基于政府发布的准则自动生成准则违反问题来实施这一方法，从而测试响应是否符合这些准则。当响应直接违反准则时，GUARD会报告不一致之处。此外，对于不直接违反准则的响应，GUARD结合“监狱逃脱”概念进行诊断，命名为GUARD-JD，通过创建可能引发不道德或准则违反响应的情景，有效识别可能绕过内置安全机制的潜在场景。最后，我们的方法总结为一份合规报告，详细说明了合规程度并指出了任何违反情况。我们通过在七个LLM（包括Vicuna-13B、LongChat-7B、Llama2-7B、Llama-3-8B、GPT-3.5、GPT-4、GPT-4o和Claude-3.7）下测试合规性以及进行监狱逃脱诊断，实证验证了GUARD的有效性。此外，GUARD-JD还可以将监狱逃脱诊断应用于视觉语言模型，展示了其在促进可靠LLM应用方面的使用。

Summary / 总结

GUARD is a testing method designed to operationalize ethics guidelines into specific questions to verify Large Language Models (LLMs) compliance. It uses automated generation of guideline-violating questions and integrates the concept of 'jailbreaks' to diagnose potential ethical violations. GUARD was empirically validated on seven LLMs, including Vicuna-13B and GPT-4, under three government-issued guidelines, and demonstrated its effectiveness in transferring jailbreak diagnostics to vision-language models.

GUARD 是一种测试方法，旨在将伦理准则具体化为特定问题以验证大型语言模型（LLMs）的合规性。它使用了生成违背准则问题的自动化方法，并结合了‘越狱’的概念来识别可能绕过安全机制的场景。GUARD 在包括 Vicuna-13B、LongChat-7B 和 Llama2-7B 在内的七种 LLM 上进行了实证验证，并在三种政府发布的准则下进行了测试，展示了其在促进可靠 LLM 应用程序中的使用情况。

Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation

Authors: Zhixiang Chi, Yanan Wu, Li Gu, Huan Liu, Ziqiang Wang, Yang Zhang, Yang Wang, Konstantinos N. Plataniotis

Venue: ICCV 2025

First: 2025-08-27T20:47:03+00:00 · Latest: 2025-08-27T20:47:03+00:00

Comments: ICCV 2025, code:https://github.com/chi-chi-zx/FSA

Abs · PDF · Code1

Abstract

CLIP exhibits strong visual-textual alignment but struggle with open-vocabulary segmentation due to poor localization. Prior methods enhance spatial coherence by modifying intermediate attention. But, this coherence isn't consistently propagated to the final output due to subsequent operations such as projections. Additionally, intermediate attention lacks direct interaction with text representations, such semantic discrepancy limits the full potential of CLIP. In this work, we propose a training-free, feedback-driven self-adaptive framework that adapts output-based patch-level correspondences back to the intermediate attention. The output predictions, being the culmination of the model's processing, encapsulate the most comprehensive visual and textual semantics about each patch. Our approach enhances semantic consistency between internal representations and final predictions by leveraging the model's outputs as a stronger spatial coherence prior. We design key modules, including attention isolation, confidence-based pruning for sparse adaptation, and adaptation ensemble, to effectively feedback the output coherence cues. Our method functions as a plug-in module, seamlessly integrating into four state-of-the-art approaches with three backbones (ViT-B, ViT-L, ViT-H). We further validate our framework across multiple attention types (Q-K, self-self, and Proxy augmented with MAE, SAM, and DINO). Our approach consistently improves their performance across eight benchmarks.

Summary / 总结

This work addresses the challenge of open-vocabulary segmentation using CLIP by proposing a training-free framework that enhances spatial coherence in intermediate attention through feedback from final output predictions. The method includes modules for attention isolation, confidence-based pruning, and adaptation ensemble to improve semantic consistency. Experiments show consistent performance improvements across various attention types and benchmarks when integrated into four state-of-the-art approaches with three backbones (ViT-B, ViT-L, ViT-H).

该研究提出了一种无需训练的框架，旨在增强CLIP模型在开放词汇分割中的空间一致性。方法通过将输出级别的patch对应关系反向反馈到中间注意力，利用模型的最终预测来提升语义一致性。实验结果显示，该方法在多种注意力类型和基准测试中表现一致提升，并能无缝集成到多个最先进的方法中，使用不同的骨干网络。

A Novel Framework for Automated Explain Vision Model Using Vision-Language Models

Authors: Phu-Vinh Nguyen, Tan-Hanh Pham, Chris Ngo, Truong Son Hy

First: 2025-08-27T19:16:40+00:00 · Latest: 2025-08-27T19:16:40+00:00

Abs · PDF

Abstract

The development of many vision models mainly focuses on improving their performance using metrics such as accuracy, IoU, and mAP, with less attention to explainability due to the complexity of applying xAI methods to provide a meaningful explanation of trained models. Although many existing xAI methods aim to explain vision models sample-by-sample, methods explaining the general behavior of vision models, which can only be captured after running on a large dataset, are still underexplored. Furthermore, understanding the behavior of vision models on general images can be very important to prevent biased judgments and help identify the model's trends and patterns. With the application of Vision-Language Models, this paper proposes a pipeline to explain vision models at both the sample and dataset levels. The proposed pipeline can be used to discover failure cases and gain insights into vision models with minimal effort, thereby integrating vision model development with xAI analysis to advance image analysis.

中文标题/摘要

标题：一种基于视觉语言模型的自动解释视图模型新框架

许多视觉模型的发展主要集中在使用准确率、IoU和mAP等指标提高性能上，而较少关注可解释性，因为将xAI方法应用于提供有意义的解释较为复杂。尽管许多现有的xAI方法旨在逐样本解释视觉模型，但只能在运行大量数据集后捕捉到的视觉模型的总体行为解释方法仍被忽视。此外，理解视觉模型在一般图像上的行为对于防止偏见判断和帮助识别模型的趋势和模式非常重要。借助视觉语言模型的应用，本文提出了一种管道，用于在样本和数据集级别解释视觉模型。所提出的管道可以用于发现失败案例并以最小努力获得视觉模型的见解，从而将视觉模型开发与xAI分析相结合，推动图像分析的发展。

OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations

Authors: Peng-Hao Hsu, Ke Zhang, Fu-En Wang, Tao Tu, Ming-Feng Li, Yu-Lun Liu, Albert Y. C. Chen, Min Sun, Cheng-Hao Kuo

First: 2025-08-27T17:17:00+00:00 · Latest: 2025-08-27T17:17:00+00:00

Comments: ICCV2025

Abs · PDF

Abstract

Open-vocabulary (OV) 3D object detection is an emerging field, yet its exploration through image-based methods remains limited compared to 3D point cloud-based methods. We introduce OpenM3D, a novel open-vocabulary multi-view indoor 3D object detector trained without human annotations. In particular, OpenM3D is a single-stage detector adapting the 2D-induced voxel features from the ImGeoNet model. To support OV, it is jointly trained with a class-agnostic 3D localization loss requiring high-quality 3D pseudo boxes and a voxel-semantic alignment loss requiring diverse pre-trained CLIP features. We follow the training setting of OV-3DET where posed RGB-D images are given but no human annotations of 3D boxes or classes are available. We propose a 3D Pseudo Box Generation method using a graph embedding technique that combines 2D segments into coherent 3D structures. Our pseudo-boxes achieve higher precision and recall than other methods, including the method proposed in OV-3DET. We further sample diverse CLIP features from 2D segments associated with each coherent 3D structure to align with the corresponding voxel feature. The key to training a highly accurate single-stage detector requires both losses to be learned toward high-quality targets. At inference, OpenM3D, a highly efficient detector, requires only multi-view images for input and demonstrates superior accuracy and speed (0.3 sec. per scene) on ScanNet200 and ARKitScenes indoor benchmarks compared to existing methods. We outperform a strong two-stage method that leverages our class-agnostic detector with a ViT CLIP-based OV classifier and a baseline incorporating multi-view depth estimator on both accuracy and speed.

Summary / 总结

OpenM3D is an open-vocabulary multi-view indoor 3D object detector trained without human annotations. It uses a single-stage approach with 2D-induced voxel features and is jointly trained with a class-agnostic 3D localization loss and a voxel-semantic alignment loss. The method generates high-quality 3D pseudo boxes and aligns them with diverse CLIP features, achieving superior accuracy and speed compared to existing methods. On ScanNet200 and ARKitScenes benchmarks, OpenM3D demonstrates higher precision and recall, and is more efficient with a processing time of 0.3 seconds per scene.

OpenM3D 是一种无需人工标注的多视图室内 3D 对象检测器，采用单阶段方法并使用 2D 引导的体素特征。该方法通过联合训练具有类无感知 3D 定位损失和体素语义对齐损失来生成高质量的 3D 假设框，并与多样化的 CLIP 特征对齐。OpenM3D 在 ScanNet200 和 ARKitScenes 基准测试中表现出更高的精度和召回率，并且处理时间仅为每场景 0.3 秒，优于现有方法。

Segmentation Assisted Incremental Test Time Adaptation in an Open World

Authors: Manogna Sreenivas, Soma Biswas

First: 2025-08-27T16:33:32+00:00 · Latest: 2025-08-27T16:33:32+00:00

Comments: Accepted at BMVC 2025

Abs · PDF · Project1

Abstract

In dynamic environments, unfamiliar objects and distribution shifts are often encountered, which challenge the generalization abilities of the deployed trained models. This work addresses Incremental Test Time Adaptation of Vision Language Models, tackling scenarios where unseen classes and unseen domains continuously appear during testing. Unlike traditional Test Time Adaptation approaches, where the test stream comes only from a predefined set of classes, our framework allows models to adapt simultaneously to both covariate and label shifts, actively incorporating new classes as they emerge. Towards this goal, we establish a new benchmark for ITTA, integrating single image TTA methods for VLMs with active labeling techniques that query an oracle for samples potentially representing unseen classes during test time. We propose a segmentation assisted active labeling module, termed SegAssist, which is training free and repurposes the segmentation capabilities of VLMs to refine active sample selection, prioritizing samples likely to belong to unseen classes. Extensive experiments on several benchmark datasets demonstrate the potential of SegAssist to enhance the performance of VLMs in real world scenarios, where continuous adaptation to emerging data is essential. Project-page:https://manogna-s.github.io/segassist/

中文标题/摘要

标题：开放世界中的分割辅助增量测试时适应

在动态环境中，经常遇到不熟悉的对象和分布变化，这挑战了已部署训练模型的泛化能力。本研究解决了视觉语言模型的增量测试时适应问题，应对测试过程中不断出现的未见类别和未见领域的情况。与传统的测试时适应方法不同，后者仅从预定义的类别集中的流中进行测试，我们的框架允许模型同时适应协变量和标签的变化，积极地将新类别纳入其中。为此，我们为ITTA建立了一个新的基准，将单张图像的测试时适应方法与主动标注技术结合，测试时查询一个或acles以获取可能代表未见类别的样本。我们提出了一种分割辅助主动标注模块，称为SegAssist，该模块无需训练，并利用视觉语言模型的分割能力来精炼主动样本选择，优先选择可能属于未见类别的样本。在多个基准数据集上的广泛实验表明，SegAssist能够增强视觉语言模型在现实世界场景中的性能，其中连续适应新兴数据至关重要。项目页面：https://manogna-s.github.io/segassist/

Summary / 总结

In dynamic environments, unfamiliar objects and distribution shifts are often encountered, which challenge the generalization abilities of the deployed trained models.

该研究针对动态环境中Vision Language Models (VLMs)的增量测试时适应（ITTA）问题，其中不断出现新的类别和领域。提出了一种基于分割的主动标注模块SegAssist，通过查询潜在的新类别样本来帮助模型适应未见过的类别和标签变化。实验结果表明，SegAssist能够提升VLM在需要持续适应新数据的现实场景中的性能。

SWIRL: A Staged Workflow for Interleaved Reinforcement Learning in Mobile GUI Control

Authors: Quanfeng Lu, Zhantao Ma, Shuai Zhong, Jin Wang, Dahai Yu, Michael K. Ng, Ping Luo

First: 2025-08-27T16:27:19+00:00 · Latest: 2025-08-27T16:27:19+00:00

Comments: 28 pages, 12 figures

Abs · PDF

Abstract

The rapid advancement of large vision language models (LVLMs) and agent systems has heightened interest in mobile GUI agents that can reliably translate natural language into interface operations. Existing single-agent approaches, however, remain limited by structural constraints. Although multi-agent systems naturally decouple different competencies, recent progress in multi-agent reinforcement learning (MARL) has often been hindered by inefficiency and remains incompatible with current LVLM architectures. To address these challenges, we introduce SWIRL, a staged workflow for interleaved reinforcement learning designed for multi-agent systems. SWIRL reformulates MARL into a sequence of single-agent reinforcement learning tasks, updating one agent at a time while keeping the others fixed. This formulation enables stable training and promotes efficient coordination across agents. Theoretically, we provide a stepwise safety bound, a cross-round monotonic improvement theorem, and convergence guarantees on return, ensuring robust and principled optimization. In application to mobile GUI control, SWIRL instantiates a Navigator that converts language and screen context into structured plans, and an Interactor that grounds these plans into executable atomic actions. Extensive experiments demonstrate superior performance on both high-level and low-level GUI benchmarks. Beyond GUI tasks, SWIRL also demonstrates strong capability in multi-agent mathematical reasoning, underscoring its potential as a general framework for developing efficient and robust multi-agent systems.

Summary / 总结

SWIRL is a staged workflow for interleaved reinforcement learning in multi-agent systems, designed to address the limitations of single-agent approaches and the inefficiencies in multi-agent reinforcement learning. By reformulating MARL into a sequence of single-agent tasks, SWIRL enables stable training and efficient coordination. Theoretical guarantees include stepwise safety bounds, cross-round monotonic improvement, and convergence on return. Experimental results show superior performance in both high-level and low-level mobile GUI control benchmarks, as well as strong capabilities in multi-agent mathematical reasoning, indicating its potential as a general framework for multi-agent systems.

SWIRL 是一个多代理系统的交错强化学习的阶段化工作流，旨在解决单代理方法的局限性和多代理强化学习的低效率问题。通过将多代理强化学习重新表述为一系列单代理任务，SWIRL 实现了稳定的训练和高效的代理间协调。理论保证包括逐步安全边界、跨轮次单调改进定理和回报收敛性。实验结果表明，SWIRL 在高阶和低阶移动 GUI 控制基准测试中表现出色，并且在多代理数学推理方面表现出强大的能力，这表明它作为多代理系统开发的通用框架的潜力。