arXiv 论文速递

2025-08-30 03:56
Snapshot: 20250830_0356
OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning
Authors: Yuan Gong, Xionghui Wang, Jie Wu, Shiyin Wang, Yitong Wang, Xinglong Wu
First: 2025-08-28T17:59:46+00:00 · Latest: 2025-08-28T17:59:46+00:00
Comments: project url: https://one-reward.github.io
Abstract
In this paper, we introduce OneReward, a unified reinforcement learning framework that enhances the model's generative capabilities across multiple tasks under different evaluation criteria using only \textit{One Reward} model. By employing a single vision-language model (VLM) as the generative reward model, which can distinguish the winner and loser for a given task and a given evaluation criterion, it can be effectively applied to multi-task generation models, particularly in contexts with varied data and diverse task objectives. We utilize OneReward for mask-guided image generation, which can be further divided into several sub-tasks such as image fill, image extend, object removal, and text rendering, involving a binary mask as the edit area. Although these domain-specific tasks share same conditioning paradigm, they differ significantly in underlying data distributions and evaluation metrics. Existing methods often rely on task-specific supervised fine-tuning (SFT), which limits generalization and training efficiency. Building on OneReward, we develop Seedream 3.0 Fill, a mask-guided generation model trained via multi-task reinforcement learning directly on a pre-trained base model, eliminating the need for task-specific SFT. Experimental results demonstrate that our unified edit model consistently outperforms both commercial and open-source competitors, such as Ideogram, Adobe Photoshop, and FLUX Fill [Pro], across multiple evaluation dimensions. Code and model are available at: https://one-reward.github.io
中文标题/摘要
标题:OneReward:统一的多任务人类偏好学习引导图像生成
在本文中,我们介绍了OneReward,这是一种统一的强化学习框架,通过单一的奖励模型在多种任务和不同评估标准下增强模型的生成能力。通过使用单一的视觉语言模型(VLM)作为生成奖励模型,该模型可以区分给定任务和评估标准下的胜者和败者,从而可以有效地应用于多任务生成模型,特别是在数据和任务目标多样化的背景下。我们使用OneReward进行掩码引导的图像生成,这可以进一步细分为图像填充、图像扩展、对象删除和文本渲染等子任务,涉及一个二元掩码作为编辑区域。尽管这些特定领域的任务共享相同的条件范式,但它们在底层数据分布和评估指标上存在显著差异。现有方法通常依赖于特定任务的监督微调(SFT),这限制了泛化能力和训练效率。基于OneReward,我们开发了Seedream 3.0 Fill,这是一种通过多任务强化学习直接在预训练基模型上训练的掩码引导生成模型,消除了对特定任务SFT的需求。实验结果表明,我们的统一编辑模型在多个评估维度上均优于商业和开源竞争对手,如Ideogram、Adobe Photoshop和FLUX Fill [Pro]。代码和模型可在:https://one-reward.github.io 获取。
Summary / 总结
OneReward is a unified reinforcement learning framework that improves generative capabilities across multiple tasks using a single reward model. It employs a vision-language model to distinguish task winners and losers, enabling multi-task generation without task-specific supervised fine-tuning. Experiments show that OneReward's unified edit model outperforms commercial and open-source competitors in various evaluation metrics for mask-guided image generation tasks such as image fill, object removal, and text rendering. Code and model are available at https://one-reward.github.io.
OneReward 是一个统一的强化学习框架,使用单一奖励模型在多个任务中提升生成能力。它利用视觉-语言模型来区分任务的优胜者和失败者,从而在无需特定任务监督微调的情况下实现多任务生成。实验结果表明,OneReward 的统一编辑模型在图像填充、对象移除和文本渲染等任务的多个评估维度上均优于商业和开源竞争对手。代码和模型可在 https://one-reward.github.io 获取。
CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification
Authors: Wei Li, Renshan Zhang, Rui Shao, Jie He, Liqiang Nie
First: 2025-08-28T17:50:58+00:00 · Latest: 2025-08-28T17:50:58+00:00
Comments: 23 pages, 8 figures, Project Page: https://jiutian-vl.github.io/CogVLA-page
Abstract
Recent Vision-Language-Action (VLA) models built on pre-trained Vision-Language Models (VLMs) require extensive post-training, resulting in high computational overhead that limits scalability and deployment.We propose CogVLA, a Cognition-Aligned Vision-Language-Action framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance. CogVLA draws inspiration from human multimodal coordination and introduces a 3-stage progressive architecture. 1) Encoder-FiLM based Aggregation Routing (EFA-Routing) injects instruction information into the vision encoder to selectively aggregate and compress dual-stream visual tokens, forming a instruction-aware latent representation. 2) Building upon this compact visual encoding, LLM-FiLM based Pruning Routing (LFP-Routing) introduces action intent into the language model by pruning instruction-irrelevant visually grounded tokens, thereby achieving token-level sparsity. 3) To ensure that compressed perception inputs can still support accurate and coherent action generation, we introduce V-L-A Coupled Attention (CAtten), which combines causal vision-language attention with bidirectional action parallel decoding. Extensive experiments on the LIBERO benchmark and real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5-fold and decreasing inference latency by 2.8-fold compared to OpenVLA. CogVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/CogVLA.
Summary / 总结
Recent Vision-Language-Action (VLA) models built on pre-trained Vision-Language Models (VLMs) require extensive post-training, resulting in high computational overhead that limits scalability and deployment.We propose CogVLA, a Cognition-Aligned Vision-Language-Action framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance.
MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs
Authors: Junpeng Ma, Qizhe Zhang, Ming Lu, Zhibin Wang, Qiang Zhou, Jun Song, Shanghang Zhang
First: 2025-08-28T17:50:03+00:00 · Latest: 2025-08-28T17:50:03+00:00
Comments: 10 pages, 3 figures
Abstract
Video Large Language Models (VLLMs) excel in video understanding, but their excessive visual tokens pose a significant computational challenge for real-world applications. Current methods aim to enhance inference efficiency by visual token pruning. However, they do not consider the dynamic characteristics and temporal dependencies of video frames, as they perceive video understanding as a multi-frame task. To address these challenges, we propose MMG-Vid, a novel training-free visual token pruning framework that removes redundancy by Maximizing Marginal Gains at both segment-level and token-level. Specifically, we first divide the video into segments based on frame similarity, and then dynamically allocate the token budget for each segment to maximize the marginal gain of each segment. Subsequently, we propose a temporal-guided DPC algorithm that jointly models inter-frame uniqueness and intra-frame diversity, thereby maximizing the marginal gain of each token. By combining both stages, MMG-Vid can maximize the utilization of the limited token budget, significantly improving efficiency while maintaining strong performance. Extensive experiments demonstrate that MMG-Vid can maintain over 99.5% of the original performance, while effectively reducing 75% visual tokens and accelerating the prefilling stage by 3.9x on LLaVA-OneVision-7B. Code will be released soon.
中文标题/摘要
标题:MMG-Vid:在段级和令牌级最大化边际收益以提高高效视频LLMs
视频大型语言模型(VLLMs)在视频理解方面表现出色,但其过多的视觉令牌对实际应用构成了显著的计算挑战。当前的方法通过视觉令牌剪枝来提高推理效率,但它们没有考虑到视频帧的动态特性和时间依赖性,因为它们将视频理解视为多帧任务。为了解决这些挑战,我们提出了一种名为MMG-Vid的新型无训练视觉令牌剪枝框架,通过在段级和令牌级最大化边际收益来去除冗余。具体而言,我们首先根据帧相似性将视频划分为段,然后为每个段动态分配令牌预算,以最大化每个段的边际收益。随后,我们提出了一种基于时间引导的DPC算法,该算法同时建模了帧间唯一性和帧内多样性,从而最大化每个令牌的边际收益。通过结合这两个阶段,MMG-Vid可以最大化有限的令牌预算的利用,显著提高效率同时保持强大的性能。广泛的实验表明,MMG-Vid可以保持超过99.5%的原始性能,同时有效减少75%的视觉令牌,并在LLaVA-OneVision-7B的预填充阶段加速3.9倍。代码将很快发布。
Summary / 总结
MMG-Vid is a training-free visual token pruning framework that enhances the efficiency of Video Large Language Models (VLLMs) by maximizing marginal gains at both segment-level and token-level. It divides videos into segments based on frame similarity and dynamically allocates token budgets to maximize the marginal gain of each segment. Additionally, it uses a temporal-guided DPC algorithm to model inter-frame uniqueness and intra-frame diversity, further optimizing token usage. Experiments show that MMG-Vid maintains over 99.5% of the original performance while reducing visual tokens by 75% and accelerating the prefilling stage by 3.9x on LLaVA-OneVision-7B.
MMG-Vid 是一种无需训练的视觉标记剪枝框架,通过在段级和标记级最大化边际收益来提升视频大型语言模型(VLLMs)的效率。它根据帧相似性将视频划分为段,并动态分配标记预算以最大化每个段的边际收益。此外,它使用时间引导的DPC算法来建模帧间独特性和帧内多样性,进一步优化标记使用。实验表明,MMG-Vid 在保持超过 99.5% 的原始性能的同时,减少了 75% 的视觉标记,并将预填充阶段加速了 3.9 倍,适用于 LLaVA-OneVision-7B。
Reusing Computation in Text-to-Image Diffusion for Efficient Generation of Image Sets
Authors: Dale Decatur, Thibault Groueix, Wang Yifan, Rana Hanocka, Vladimir Kim, Matheus Gadelha
Venue: ICCV 2025
First: 2025-08-28T17:35:03+00:00 · Latest: 2025-08-28T17:35:03+00:00
Comments: ICCV 2025. Project page: https://ddecatur.github.io/hierarchical-diffusion/
Abstract
Text-to-image diffusion models enable high-quality image generation but are computationally expensive. While prior work optimizes per-inference efficiency, we explore an orthogonal approach: reducing redundancy across correlated prompts. Our method leverages the coarse-to-fine nature of diffusion models, where early denoising steps capture shared structures among similar prompts. We propose a training-free approach that clusters prompts based on semantic similarity and shares computation in early diffusion steps. Experiments show that for models trained conditioned on image embeddings, our approach significantly reduces compute cost while improving image quality. By leveraging UnClip's text-to-image prior, we enhance diffusion step allocation for greater efficiency. Our method seamlessly integrates with existing pipelines, scales with prompt sets, and reduces the environmental and financial burden of large-scale text-to-image generation. Project page: https://ddecatur.github.io/hierarchical-diffusion/
中文标题/摘要
标题:文本到图像扩散中的计算重用以高效生成图像集
文本到图像扩散模型能够生成高质量的图像,但计算成本高昂。尽管先前的工作优化了每轮推理的效率,我们探索了一种不同的方法:减少相关提示之间的冗余。我们的方法利用了扩散模型从粗到细的特性,在早期去噪步骤中捕获相似提示之间的共享结构。我们提出了一种无需训练的方法,根据语义相似性对提示进行聚类,并在早期扩散步骤中共享计算。实验表明,对于基于图像嵌入训练的模型,我们的方法显著降低了计算成本并提高了图像质量。通过利用UnClip的文本到图像先验,我们增强了扩散步骤的分配以提高效率。我们的方法可以无缝集成到现有的管道中,适用于不同的提示集,并减少了大规模文本到图像生成的环境和财务负担。项目页面:https://ddecatur.github.io/hierarchical-diffusion/
Summary / 总结
This paper addresses the computational inefficiency of text-to-image diffusion models by proposing a method that reduces redundancy across similar prompts. The approach leverages the coarse-to-fine nature of diffusion models to share computation in early steps. Experiments demonstrate that this method significantly reduces computational cost while improving image quality, especially when using UnClip's text-to-image prior. The method integrates seamlessly with existing pipelines and scales well with prompt sets, reducing both environmental and financial burdens of large-scale text-to-image generation.
研究旨在通过在相似提示之间重用计算来降低文本到图像生成的计算成本。方法基于语义相似性聚类提示,并在早期扩散步骤中共享计算,特别适用于基于图像嵌入训练的模型。这种方法显著降低了计算成本并提高了图像质量。该方法与现有管道无缝集成,并且随着提示集的扩展而扩展,从而减少了大规模文本到图像生成的环境和财务负担。
DrivingGaussian++: Towards Realistic Reconstruction and Editable Simulation for Surrounding Dynamic Driving Scenes
Authors: Yajiao Xiong, Xiaoyu Zhou, Yongtao Wan, Deqing Sun, Ming-Hsuan Yang
First: 2025-08-28T16:22:54+00:00 · Latest: 2025-08-28T16:22:54+00:00
Abstract
We present DrivingGaussian++, an efficient and effective framework for realistic reconstructing and controllable editing of surrounding dynamic autonomous driving scenes. DrivingGaussian++ models the static background using incremental 3D Gaussians and reconstructs moving objects with a composite dynamic Gaussian graph, ensuring accurate positions and occlusions. By integrating a LiDAR prior, it achieves detailed and consistent scene reconstruction, outperforming existing methods in dynamic scene reconstruction and photorealistic surround-view synthesis. DrivingGaussian++ supports training-free controllable editing for dynamic driving scenes, including texture modification, weather simulation, and object manipulation, leveraging multi-view images and depth priors. By integrating large language models (LLMs) and controllable editing, our method can automatically generate dynamic object motion trajectories and enhance their realism during the optimization process. DrivingGaussian++ demonstrates consistent and realistic editing results and generates dynamic multi-view driving scenarios, while significantly enhancing scene diversity. More results and code can be found at the project site: https://xiong-creator.github.io/DrivingGaussian_plus.github.io
中文标题/摘要
标题:DrivingGaussian++:面向现实重建与可编辑模拟的周围动态驾驶场景
我们提出了DrivingGaussian++,一种高效且有效的框架,用于现实重建和可控编辑周围动态自主驾驶场景。DrivingGaussian++使用增量3D高斯模型静态背景,并用复合动态高斯图重建移动对象,确保准确的位置和遮挡。通过整合LiDAR先验,它实现了详细的且一致的场景重建,优于现有方法在动态场景重建和照片写实的全景视图合成方面的表现。DrivingGaussian++支持无需训练的动态驾驶场景可控编辑,包括纹理修改、天气模拟和对象操作,利用多视图图像和深度先验。通过整合大型语言模型(LLMs)和可控编辑,我们的方法可以在优化过程中自动生成动态对象运动轨迹并增强其现实感。DrivingGaussian++展示了一致且现实的编辑结果,并生成动态多视图驾驶场景,显著增强了场景多样性。更多结果和代码可在项目网站上找到:https://xiong-creator.github.io/DrivingGaussian_plus.github.io
Understanding and evaluating computer vision models through the lens of counterfactuals
Authors: Pushkar Shukla
First: 2025-08-28T15:11:49+00:00 · Latest: 2025-08-28T15:11:49+00:00
Abstract
Counterfactual reasoning -- the practice of asking ``what if'' by varying inputs and observing changes in model behavior -- has become central to interpretable and fair AI. This thesis develops frameworks that use counterfactuals to explain, audit, and mitigate bias in vision classifiers and generative models. By systematically altering semantically meaningful attributes while holding others fixed, these methods uncover spurious correlations, probe causal dependencies, and help build more robust systems. The first part addresses vision classifiers. CAVLI integrates attribution (LIME) with concept-level analysis (TCAV) to quantify how strongly decisions rely on human-interpretable concepts. With localized heatmaps and a Concept Dependency Score, CAVLI shows when models depend on irrelevant cues like backgrounds. Extending this, ASAC introduces adversarial counterfactuals that perturb protected attributes while preserving semantics. Through curriculum learning, ASAC fine-tunes biased models for improved fairness and accuracy while avoiding stereotype-laden artifacts. The second part targets generative Text-to-Image (TTI) models. TIBET provides a scalable pipeline for evaluating prompt-sensitive biases by varying identity-related terms, enabling causal auditing of how race, gender, and age affect image generation. To capture interactions, BiasConnect builds causal graphs diagnosing intersectional biases. Finally, InterMit offers a modular, training-free algorithm that mitigates intersectional bias via causal sensitivity scores and user-defined fairness goals. Together, these contributions show counterfactuals as a unifying lens for interpretability, fairness, and causality in both discriminative and generative models, establishing principled, scalable methods for socially responsible bias evaluation and mitigation.
Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models
Authors: Hao Cheng, Erjia Xiao, Yichi Wang, Lingfeng Zhang, Qiang Zhang, Jiahang Cao, Kaidi Xu, Mengshu Sun, Xiaoshuai Hao, Jindong Gu, Renjing Xu
First: 2025-03-14T15:42:42+00:00 · Latest: 2025-08-28T14:55:38+00:00
Comments: This paper is accepted by IJCAI2025 Workshop on Deepfake Detection, Localization, and Interpretability
Abstract
Current Cross-Modality Generation Models (GMs) demonstrate remarkable capabilities in various generative tasks. Given the ubiquity and information richness of vision modality inputs in real-world scenarios, Cross-Vision tasks, encompassing Vision-Language Perception (VLP) and Image-to-Image (I2I), have attracted significant attention. Large Vision Language Models (LVLMs) and I2I Generation Models (GMs) are employed to handle VLP and I2I tasks, respectively. Previous research indicates that printing typographic words into input images significantly induces LVLMs and I2I GMs to produce disruptive outputs that are semantically aligned with those words. Additionally, visual prompts, as a more sophisticated form of typography, are also revealed to pose security risks to various applications of cross-vision tasks. However, the specific characteristics of the threats posed by visual prompts remain underexplored. In this paper, to comprehensively investigate the performance impact induced by Typographic Visual Prompt Injection (TVPI) in various LVLMs and I2I GMs, we propose the Typographic Visual Prompts Injection Dataset and thoroughly evaluate the TVPI security risks on various open-source and closed-source LVLMs and I2I GMs under visual prompts with different target semantics, deepening the understanding of TVPI threats.
中文标题/摘要
标题:探索跨模态生成模型中的图形排版视觉提示注入威胁
当前的跨模态生成模型(GMs)在各种生成任务中表现出显著的能力。鉴于现实世界场景中视觉模态输入的普遍性和信息丰富性,包括视觉语言感知(VLP)和图像到图像(I2I)在内的跨视觉任务引起了广泛关注。大型视觉语言模型(LVLMs)和I2I生成模型(GMs)分别用于处理VLP和I2I任务。先前的研究表明,在输入图像中印刷图形排版文字会显著诱导LVLMs和I2I GMs生成与这些文字语义一致的破坏性输出。此外,作为更复杂的图形排版形式,视觉提示也被发现对跨视觉任务的各种应用构成了安全风险。然而,视觉提示所造成的威胁的具体特征仍待进一步探索。在本文中,为了全面调查图形排版视觉提示注入(TVPI)对各种LVLMs和I2I GMs性能影响,我们提出了图形排版视觉提示注入数据集,并在具有不同目标语义的视觉提示下对各种开源和闭源LVLMs和I2I GMs进行了彻底的安全风险评估,加深了对TVPI威胁的理解。
Summary / 总结
This paper investigates the security threats posed by typographic visual prompts in cross-modality generation models. It proposes a dataset to evaluate the impact of typographic visual prompt injection (TVPI) on various large vision language models (LVLMs) and image-to-image (I2I) generation models. The study reveals that visual prompts can induce these models to produce semantically aligned but disruptive outputs, highlighting the need for better security measures in cross-vision tasks.
本文研究了图文提示注入(TVPI)在跨模态生成模型中的安全威胁。提出了一组数据集来评估图文提示注入对各种大型视觉语言模型(LVLM)和图像到图像(I2I)生成模型的影响。研究发现,图文提示可以促使这些模型生成与提示语义一致但具有破坏性的输出,强调了跨视觉任务中需要更好的安全措施。
Learning Primitive Embodied World Models: Towards Scalable Robotic Learning
Authors: Qiao Sun, Liujia Yang, Wei Tang, Wei Huang, Kaixin Xu, Yongchao Chen, Mingyu Liu, Jiange Yang, Haoyi Zhu, Yating Wang, Tong He, Yilun Chen, Xili Dai, Nanyang Ye, Qinying Gu
First: 2025-08-28T14:31:48+00:00 · Latest: 2025-08-28T14:31:48+00:00
Abstract
While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation--hindering generative models from achieving a "GPT moment" in the embodied domain. There is a naive observation: the diversity of embodied data far exceeds the relatively small space of possible primitive motions. Based on this insight, we propose a novel paradigm for world modeling--Primitive Embodied World Models (PEWM). By restricting video generation to fixed short horizons, our approach 1) enables fine-grained alignment between linguistic concepts and visual representations of robotic actions, 2) reduces learning complexity, 3) improves data efficiency in embodied data collection, and 4) decreases inference latency. By equipping with a modular Vision-Language Model (VLM) planner and a Start-Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed-loop control and supports compositional generalization of primitive-level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.
中文标题/摘要
标题:学习原始具身世界模型:迈向可扩展的机器人学习
尽管基于视频生成的具身世界模型逐渐受到关注,但它们对大规模具身交互数据的依赖仍然是一个关键瓶颈。具身数据的稀缺性、收集难度和高维度性从根本上限制了语言与动作之间的对齐精度,并加剧了长时序视频生成的挑战——阻碍生成模型在具身领域实现“GPT时刻”。一个简单的观察是:具身数据的多样性远超过可能的原始动作的小空间。基于这一洞察,我们提出了一种新的世界建模范式——原始具身世界模型(PEWM)。通过限制视频生成到固定短时序,我们的方法1) 使语言概念与机器人动作的视觉表示之间的对齐更加精细,2) 减少学习复杂性,3) 改进具身数据收集的数据效率,4) 减少推理延迟。通过配备模块化视觉-语言模型(VLM)规划器和起始-目标热图引导机制(SGG),PEWM 进一步实现了灵活的闭环控制,并支持在扩展和复杂任务中对原始级策略的组合泛化。我们的框架利用视频模型中的时空视觉先验和 VLM 的语义意识,弥合了精细物理交互与高层推理之间的差距,为可扩展、可解释和通用的具身智能铺平了道路。
Summary / 总结
This paper addresses the challenge of aligning language and actions in embodied learning by proposing Primitive Embodied World Models (PEWM). PEWM focuses on short horizons to achieve fine-grained alignment, reduce learning complexity, and improve data efficiency. The approach uses a modular Vision-Language Model planner and Start-Goal heatmap Guidance to support flexible closed-loop control and compositional generalization of primitive-level policies for complex tasks, thereby enhancing interpretability and scalability in embodied intelligence.
本文提出了Primitive Embodied World Models (PEWM) 来解决语言与动作在体态学习中的对齐问题。PEWM 专注于短时间范围内的生成,以实现细粒度对齐、降低学习复杂度和提高数据效率。该方法利用模块化的视觉-语言模型规划器和起始-目标热图引导机制来支持对复杂任务中原始级别策略的灵活闭环控制和组合泛化,从而增强体态智能的可解释性和可扩展性。
Estimating 2D Keypoints of Surgical Tools Using Vision-Language Models with Low-Rank Adaptation
Authors: Krit Duangprom, Tryphon Lambrou, Binod Bhattarai
Venue: MICCAI 2025
First: 2025-08-28T14:25:32+00:00 · Latest: 2025-08-28T14:25:32+00:00
Comments: Accepted to MICCAI 2025
Abstract
This paper presents a novel pipeline for 2D keypoint estima- tion of surgical tools by leveraging Vision Language Models (VLMs) fine- tuned using a low rank adjusting (LoRA) technique. Unlike traditional Convolutional Neural Network (CNN) or Transformer-based approaches, which often suffer from overfitting in small-scale medical datasets, our method harnesses the generalization capabilities of pre-trained VLMs. We carefully design prompts to create an instruction-tuning dataset and use them to align visual features with semantic keypoint descriptions. Experimental results show that with only two epochs of fine tuning, the adapted VLM outperforms the baseline models, demonstrating the ef- fectiveness of LoRA in low-resource scenarios. This approach not only improves keypoint detection performance, but also paves the way for future work in 3D surgical hands and tools pose estimation.
Improving Fine-Grained Control via Aggregation of Multiple Diffusion Models
Authors: Conghan Yue, Zhengwei Peng, Shiyan Du, Zhi Ji, Chuangjian Cai, Le Wan, Dongyu Zhang
First: 2024-10-02T06:16:06+00:00 · Latest: 2025-08-28T14:03:26+00:00
Abstract
While many diffusion models perform well when controlling particular aspects such as style, character, and interaction, they struggle with fine-grained control due to dataset limitations and intricate model architecture design. This paper introduces a novel training-free algorithm, independent of denoising network architectures, for fine-grained generation, called Aggregation of Multiple Diffusion Models (AMDM). The algorithm integrates features from multiple diffusion models into a specified model to activate particular features and enable fine-grained control. Experimental results demonstrate that AMDM significantly improves fine-grained control without training, validating its effectiveness. Additionally, it reveals that diffusion models initially focus on features such as position, attributes, and style, with later stages improving generation quality and consistency. AMDM offers a new perspective for tackling the challenges of fine-grained conditional generation in diffusion models. Specifically, it allows us to fully utilize existing or develop new conditional diffusion models that control specific aspects, and then aggregate them using the AMDM algorithm. This eliminates the need for constructing complex datasets, designing intricate model architectures, and incurring high training costs. Code is available at: https://github.com/Hammour-steak/AMDM.
中文标题/摘要
标题:通过多扩散模型聚合提高细粒度控制
尽管许多扩散模型在控制特定方面如风格、角色和交互时表现良好,但在细粒度控制方面由于数据集限制和复杂的模型架构设计,它们面临挑战。本文介绍了一种无需训练的新型算法,该算法独立于去噪网络架构,称为多扩散模型聚合(AMDM)。该算法将多个扩散模型的特征整合到指定模型中,以激活特定特征并实现细粒度控制。实验结果表明,AMDM在无需训练的情况下显著提高了细粒度控制能力,验证了其有效性。此外,它揭示了扩散模型最初关注位置、属性和风格等特征,后期阶段则提高生成质量和一致性。AMDM为解决扩散模型中的细粒度条件生成挑战提供了新视角。具体而言,它允许我们充分利用现有或开发新的控制特定方面的条件扩散模型,并通过AMDM算法将它们聚合起来。这消除了构建复杂数据集、设计复杂模型架构和高训练成本的需要。代码可在:https://github.com/Hammour-steak/AMDM 获取。
Evaluating Compositional Generalisation in VLMs and Diffusion Models
Authors: Beth Pearson, Bilal Boulbarss, Michael Wray, Martha Lewis
First: 2025-08-28T13:45:04+00:00 · Latest: 2025-08-28T13:45:04+00:00
Comments: 11 pages including references, 6 figures. Accepted at IWCS 2025
Abstract
A fundamental aspect of the semantics of natural language is that novel meanings can be formed from the composition of previously known parts. Vision-language models (VLMs) have made significant progress in recent years, however, there is evidence that they are unable to perform this kind of composition. For example, given an image of a red cube and a blue cylinder, a VLM such as CLIP is likely to incorrectly label the image as a red cylinder or a blue cube, indicating it represents the image as a `bag-of-words' and fails to capture compositional semantics. Diffusion models have recently gained significant attention for their impressive generative abilities, and zero-shot classifiers based on diffusion models have been shown to perform competitively with CLIP in certain compositional tasks. In this work we explore whether the generative Diffusion Classifier has improved compositional generalisation abilities compared to discriminative models. We assess three models -- Diffusion Classifier, CLIP, and ViLT -- on their ability to bind objects with attributes and relations in both zero-shot learning (ZSL) and generalised zero-shot learning (GZSL) settings. Our results show that the Diffusion Classifier and ViLT perform well at concept binding tasks, but that all models struggle significantly with the relational GZSL task, underscoring the broader challenges VLMs face with relational reasoning. Analysis of CLIP embeddings suggests that the difficulty may stem from overly similar representations of relational concepts such as left and right. Code and dataset are available at: https://github.com/otmive/diffusion_classifier_clip
Summary / 总结
This study evaluates the compositional generalization capabilities of Vision-Language Models (VLMs) and Diffusion Models. It compares a Diffusion Classifier, CLIP, and ViLT on their ability to bind objects with attributes and relations in zero-shot and generalized zero-shot learning settings. The results indicate that while the Diffusion Classifier and ViLT perform well in concept binding tasks, all models struggle significantly with relational reasoning, highlighting the challenges VLMs face in relational understanding. Analysis of CLIP embeddings suggests that the difficulty might arise from similar representations of relational concepts like left and right.
该研究评估了视觉语言模型(VLMs)和扩散模型在组成性泛化能力上的表现。它比较了扩散分类器、CLIP 和 ViLT 在零样本学习和泛化零样本学习设置中将对象与其属性和关系绑定的能力。结果表明,尽管扩散分类器和 ViLT 在概念绑定任务中表现良好,但所有模型在关系推理任务中都遇到了显著困难,突显了 VLMs 在关系理解方面面临的挑战。CLIP嵌入的分析表明,这种困难可能源于对左右等关系概念的表示过于相似。
Occlusion Robustness of CLIP for Military Vehicle Classification
Authors: Jan Erik van Woerden, Gertjan Burghouts, Lotte Nijskens, Alma M. Liezenga, Sabina van Rooij, Frank Ruis, Hugo J. Kuijf
First: 2025-08-28T13:16:55+00:00 · Latest: 2025-08-28T13:16:55+00:00
Comments: To be presented at SPIE: Sensors + Imaging, Artificial Intelligence for Security and Defence Applications II
Abstract
Vision-language models (VLMs) like CLIP enable zero-shot classification by aligning images and text in a shared embedding space, offering advantages for defense applications with scarce labeled data. However, CLIP's robustness in challenging military environments, with partial occlusion and degraded signal-to-noise ratio (SNR), remains underexplored. We investigate CLIP variants' robustness to occlusion using a custom dataset of 18 military vehicle classes and evaluate using Normalized Area Under the Curve (NAUC) across occlusion percentages. Four key insights emerge: (1) Transformer-based CLIP models consistently outperform CNNs, (2) fine-grained, dispersed occlusions degrade performance more than larger contiguous occlusions, (3) despite improved accuracy, performance of linear-probed models sharply drops at around 35% occlusion, (4) by finetuning the model's backbone, this performance drop occurs at more than 60% occlusion. These results underscore the importance of occlusion-specific augmentations during training and the need for further exploration into patch-level sensitivity and architectural resilience for real-world deployment of CLIP.
中文标题/摘要
标题:CLIP在军事车辆分类中的遮挡鲁棒性
视觉-语言模型(VLMs)如CLIP通过在共享嵌入空间中对齐图像和文本实现零样本分类,为缺乏标注数据的防御应用提供了优势。然而,CLIP在具有部分遮挡和信噪比(SNR)降级的挑战性军事环境中的鲁棒性尚未得到充分探索。我们使用包含18类军事车辆的自定义数据集研究了CLIP变体对遮挡的鲁棒性,并使用归一化曲线下面积(NAUC)在不同遮挡百分比下进行评估。研究结果得出四个关键见解:(1)基于Transformer的CLIP模型始终优于CNN,(2)细粒度、分散的遮挡比大面积连续遮挡对性能影响更大,(3)尽管准确率有所提高,但在约35%遮挡时,线性探查模型的性能急剧下降,(4)通过微调模型的骨干网络,性能下降发生在超过60%遮挡时。这些结果强调了在训练过程中使用遮挡特定增强的重要性,并指出了需要进一步探索像素级敏感性和架构鲁棒性以实现CLIP在实际部署中的应用。
Summary / 总结
This study evaluates the occlusion robustness of CLIP models for military vehicle classification using a custom dataset. The research finds that transformer-based CLIP models outperform CNNs, with fine-grained and dispersed occlusions causing more performance degradation than larger contiguous occlusions. The performance of linear-probed models drops sharply at around 35% occlusion, while finetuning the model's backbone reduces this threshold to over 60% occlusion. These findings highlight the need for occlusion-specific augmentations and further exploration into patch-level sensitivity and architectural resilience for real-world deployment.
研究通过使用包含18类军事车辆的自定义数据集,评估了CLIP模型在遮挡情况下的鲁棒性。主要发现包括基于Transformer的CLIP模型优于CNN模型,细粒度的遮挡对性能影响更大,线性探针模型在约35%遮挡时性能急剧下降,通过微调模型的骨干网络,这一性能下降可以在超过60%遮挡时得到缓解。
NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks
Authors: Aritra Dutta, Swapnanil Mukherjee, Deepanway Ghosal, Somak Aditya
First: 2025-08-27T09:34:28+00:00 · Latest: 2025-08-28T12:05:33+00:00
Abstract
Commonsense visual-question answering often hinges on knowledge that is missing from the image or the question. Small vision-language models (sVLMs) such as ViLT, VisualBERT and FLAVA therefore lag behind their larger generative counterparts. To study the effect of careful commonsense knowledge integration on sVLMs, we present an end-to-end framework (NLKI) that (i) retrieves natural language facts, (ii) prompts an LLM to craft natural language explanations, and (iii) feeds both signals to sVLMs respectively across two commonsense VQA datasets (CRIC, AOKVQA) and a visual-entailment dataset (e-SNLI-VE). Facts retrieved using a fine-tuned ColBERTv2 and an object information-enriched prompt yield explanations that largely cut down hallucinations, while lifting the end-to-end answer accuracy by up to 7% (across 3 datasets), making FLAVA and other models in NLKI match or exceed medium-sized VLMs such as Qwen-2 VL-2B and SmolVLM-2.5B. As these benchmarks contain 10-25% label noise, additional finetuning using noise-robust losses (such as symmetric cross entropy and generalised cross entropy) adds another 2.5% in CRIC, and 5.5% in AOKVQA. Our findings expose when LLM-based commonsense knowledge beats retrieval from commonsense knowledge bases, how noise-aware training stabilises small models in the context of external knowledge augmentation, and why parameter-efficient commonsense reasoning is now within reach for 250M models.
"Humor, Art, or Misinformation?": A Multimodal Dataset for Intent-Aware Synthetic Image Detection
Authors: Anastasios Skoularikis, Stefanos-Iordanis Papadopoulos, Symeon Papadopoulos, Panagiotis C. Petrantonakis
First: 2025-08-28T11:22:15+00:00 · Latest: 2025-08-28T11:22:15+00:00
Abstract
Recent advances in multimodal AI have enabled progress in detecting synthetic and out-of-context content. However, existing efforts largely overlook the intent behind AI-generated images. To fill this gap, we introduce S-HArM, a multimodal dataset for intent-aware classification, comprising 9,576 "in the wild" image-text pairs from Twitter/X and Reddit, labeled as Humor/Satire, Art, or Misinformation. Additionally, we explore three prompting strategies (image-guided, description-guided, and multimodally-guided) to construct a large-scale synthetic training dataset with Stable Diffusion. We conduct an extensive comparative study including modality fusion, contrastive learning, reconstruction networks, attention mechanisms, and large vision-language models. Our results show that models trained on image- and multimodally-guided data generalize better to "in the wild" content, due to preserved visual context. However, overall performance remains limited, highlighting the complexity of inferring intent and the need for specialized architectures.
中文标题/摘要
标题:"幽默、艺术还是误导信息?": 一种面向意图的合成图像检测多模态数据集
近年来,多模态AI的进步推动了对合成和脱离上下文内容检测的进展。然而,现有努力大多忽略了AI生成图像背后的意图。为填补这一空白,我们引入了S-HArM,这是一个面向意图的多模态数据集,包含来自Twitter/X和Reddit的9,576个“野生”图像-文本对,并标记为幽默/讽刺、艺术或误导信息。此外,我们探索了三种提示策略(图像导向、描述导向和多模态导向)来构建大规模合成训练数据集,使用Stable Diffusion。我们进行了广泛的比较研究,包括模态融合、对比学习、重建网络、注意力机制和大型视觉-语言模型。我们的结果显示,基于图像和多模态导向数据训练的模型在“野生”内容上的泛化能力更强,因为保留了视觉上下文。然而,总体性能仍然有限,突显了推断意图的复杂性以及需要专门架构的需求。
Amadeus: Autoregressive Model with Bidirectional Attribute Modelling for Symbolic Music
Authors: Hongju Su, Ke Li, Lan Yang, Honggang Zhang, Yi-Zhe Song
First: 2025-08-28T11:15:44+00:00 · Latest: 2025-08-28T11:15:44+00:00
Comments: Under review
Abstract
Existing state-of-the-art symbolic music generation models predominantly adopt autoregressive or hierarchical autoregressive architectures, modelling symbolic music as a sequence of attribute tokens with unidirectional temporal dependencies, under the assumption of a fixed, strict dependency structure among these attributes. However, we observe that using different attributes as the initial token in these models leads to comparable performance. This suggests that the attributes of a musical note are, in essence, a concurrent and unordered set, rather than a temporally dependent sequence. Based on this insight, we introduce Amadeus, a novel symbolic music generation framework. Amadeus adopts a two-level architecture: an autoregressive model for note sequences and a bidirectional discrete diffusion model for attributes. To enhance performance, we propose Music Latent Space Discriminability Enhancement Strategy(MLSDES), incorporating contrastive learning constraints that amplify discriminability of intermediate music representations. The Conditional Information Enhancement Module (CIEM) simultaneously strengthens note latent vector representation via attention mechanisms, enabling more precise note decoding. We conduct extensive experiments on unconditional and text-conditioned generation tasks. Amadeus significantly outperforms SOTA models across multiple metrics while achieving at least 4$\times$ speed-up. Furthermore, we demonstrate training-free, fine-grained note attribute control feasibility using our model. To explore the upper performance bound of the Amadeus architecture, we compile the largest open-source symbolic music dataset to date, AMD (Amadeus MIDI Dataset), supporting both pre-training and fine-tuning.
中文标题/摘要
标题:阿玛迪乌斯:双向属性建模的自回归音乐生成模型
现有的最先进的符号音乐生成模型主要采用自回归或分层自回归架构,将符号音乐建模为具有单向时间依赖性的属性令牌序列,假设这些属性之间存在固定且严格的依赖结构。然而,我们观察到,在这些模型中使用不同的属性作为初始令牌会导致相当的性能。这表明,音乐音符的属性本质上是一个并发且无序的集合,而不是一个时间依赖的序列。基于这一洞察,我们引入了阿玛迪乌斯,一种新颖的符号音乐生成框架。阿玛迪乌斯采用两层架构:用于音符序列的自回归模型和用于属性的双向离散扩散模型。为了提高性能,我们提出了音乐潜在空间判别性增强策略(MLSDES),结合对比学习约束以增强中间音乐表示的判别性。条件信息增强模块(CIEM)同时通过注意力机制增强音符潜在向量表示,使音符解码更加精确。我们在无条件和文本条件生成任务上进行了广泛的实验。阿玛迪乌斯在多个指标上显著优于当前最佳模型,同时实现至少4倍的速度提升。此外,我们展示了使用我们的模型实现无训练的细粒度音符属性控制的可行性。为了探索阿玛迪乌斯架构的性能上限,我们编译了迄今为止最大的开源符号音乐数据集AMD(阿玛迪乌斯MIDI数据集),支持预训练和微调。
Summary / 总结
Amadeus is a novel symbolic music generation framework that addresses the limitations of existing autoregressive models by introducing a two-level architecture with an autoregressive model for note sequences and a bidirectional discrete diffusion model for attributes. It also includes MLSDES and CIEM to enhance performance. Amadeus outperforms state-of-the-art models across multiple metrics and achieves at least 4 times speed-up. Additionally, it demonstrates the feasibility of fine-grained note attribute control without training.
Amadeus 是一种新颖的符号音乐生成框架,通过引入属性的双向离散扩散模型来解决现有自回归模型的局限性。该模型采用两层架构,并包含如 MLSDES 和 CIEM 等策略以提升性能。实验表明,Amadeus 在多个指标上显著优于最先进的模型,并且至少快 4 倍。此外,它还能够在无需额外训练的情况下实现对音符属性的精细控制。
Enhancing Document VQA Models via Retrieval-Augmented Generation
Authors: Eric López, Artemis Llabrés, Ernest Valveny
First: 2025-08-26T12:32:55+00:00 · Latest: 2025-08-28T10:31:44+00:00
Comments: Accepted at Workshop on Machine Learning in Document Analysis and Recognition (ICDAR WML 2025), Wuhan, China
Abstract
Document Visual Question Answering (Document VQA) must cope with documents that span dozens of pages, yet leading systems still concatenate every page or rely on very large vision-language models, both of which are memory-hungry. Retrieval-Augmented Generation (RAG) offers an attractive alternative, first retrieving a concise set of relevant segments before generating answers from this selected evidence. In this paper, we systematically evaluate the impact of incorporating RAG into Document VQA through different retrieval variants - text-based retrieval using OCR tokens and purely visual retrieval without OCR - across multiple models and benchmarks. Evaluated on the multi-page datasets MP-DocVQA, DUDE, and InfographicVQA, the text-centric variant improves the "concatenate-all-pages" baseline by up to +22.5 ANLS, while the visual variant achieves +5.0 ANLS improvement without requiring any text extraction. An ablation confirms that retrieval and reranking components drive most of the gain, whereas the layout-guided chunking strategy - proposed in several recent works to leverage page structure - fails to help on these datasets. Our experiments demonstrate that careful evidence selection consistently boosts accuracy across multiple model sizes and multi-page benchmarks, underscoring its practical value for real-world Document VQA.
Summary / 总结
Document Visual Question Answering (Document VQA) must cope with documents that span dozens of pages, yet leading systems still concatenate every page or rely on very large vision-language models, both of which are memory-hungry.
本文探讨了将检索增强生成(RAG)集成到文档视觉问答(Document VQA)模型中,以解决处理多页文档的内存挑战。通过在不同检索方法和多个基准上的系统评估,文本中心的RAG变体将基线提高了最多22.5 ANLS,而纯视觉变体在无需OCR的情况下实现了5.0 ANLS的改进。研究证实,检索和重排序对于性能提升至关重要,而布局导向的分块策略在这类数据集上并未显著受益。实验结果强调了在多页基准上的文档视觉问答系统中仔细选择证据的重要性。
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models
Authors: Haodong Duan, Xinyu Fang, Junming Yang, Xiangyu Zhao, Yuxuan Qiao, Mo Li, Amit Agarwal, Zhe Chen, Lin Chen, Yuan Liu, Yubo Ma, Hailong Sun, Yifan Zhang, Shiyin Lu, Tack Hwa Wong, Weiyun Wang, Peiheng Zhou, Xiaozhe Li, Chaoyou Fu, Junbo Cui, Jixuan Chen, Enxin Song, Song Mao, Shengyuan Ding, Tianhao Liang, Zicheng Zhang, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, Dahua Lin, Kai Chen
First: 2024-07-16T13:06:15+00:00 · Latest: 2025-08-28T09:40:49+00:00
Comments: Updated on 2025.08.28, data cut down to 2025.06.30
Abstract
We present VLMEvalKit: an open-source toolkit for evaluating large multi-modality models based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework for researchers and developers to evaluate existing multi-modality models and publish reproducible evaluation results. In VLMEvalKit, we implement over 200+ different large multi-modality models, including both proprietary APIs and open-source models, as well as more than 80 different multi-modal benchmarks. By implementing a single interface, new models can be easily added to the toolkit, while the toolkit automatically handles the remaining workloads, including data preparation, distributed inference, prediction post-processing, and metric calculation. Although the toolkit is currently mainly used for evaluating large vision-language models, its design is compatible with future updates that incorporate additional modalities, such as audio and video. Based on the evaluation results obtained with the toolkit, we host OpenVLM Leaderboard, a comprehensive leaderboard to track the progress of multi-modality learning research. The toolkit is released on https://github.com/open-compass/VLMEvalKit and is actively maintained.
中文标题/摘要
标题:VLMEvalKit:一个基于PyTorch的开源多模态模型评估工具包
我们介绍了VLMEvalKit:一个基于PyTorch的开源多模态模型评估工具包。该工具包旨在为研究人员和开发人员提供一个用户友好且全面的框架,用于评估现有的多模态模型并发布可重复的评估结果。在VLMEvalKit中,我们实现了超过200种不同的大型多模态模型,包括专有API和开源模型,以及超过80种不同的多模态基准。通过实现单一接口,新模型可以轻松添加到工具包中,而工具包会自动处理其余的工作负载,包括数据准备、分布式推理、预测后处理和指标计算。尽管该工具包目前主要用于评估大型视觉-语言模型,但其设计兼容未来可能增加其他模态(如音频和视频)的更新。基于使用该工具包获得的评估结果,我们托管了OpenVLM排行榜,这是一个全面的排行榜,用于跟踪多模态学习研究的进展。该工具包发布在https://github.com/open-compass/VLMEvalKit,并且正在积极维护。
Towards Mechanistic Defenses Against Typographic Attacks in CLIP
Authors: Lorenz Hufe, Constantin Venhoff, Maximilian Dreyer, Sebastian Lapuschkin, Wojciech Samek
First: 2025-08-28T09:08:30+00:00 · Latest: 2025-08-28T09:08:30+00:00
Abstract
Typographic attacks exploit multi-modal systems by injecting text into images, leading to targeted misclassifications, malicious content generation and even Vision-Language Model jailbreaks. In this work, we analyze how CLIP vision encoders behave under typographic attacks, locating specialized attention heads in the latter half of the model's layers that causally extract and transmit typographic information to the cls token. Building on these insights, we introduce a method to defend CLIP models against typographic attacks by selectively ablating a typographic circuit, consisting of attention heads. Without requiring finetuning, our method improves performance by up to 19.6% on a typographic variant of ImageNet-100, while reducing standard ImageNet-100 accuracy by less than 1%. Notably, our training-free approach remains competitive with current state-of-the-art typographic defenses that rely on finetuning. To this end, we release a family of dyslexic CLIP models which are significantly more robust against typographic attacks. These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.
中文标题/摘要
标题:面向CLIP的机械防御机制对抗 typographic 攻击
typographic 攻击通过向图像中注入文本来利用多模态系统,导致目标误分类、恶意内容生成,甚至视觉语言模型的越狱。在本研究中,我们分析了CLIP视觉编码器在 typographic 攻击下的行为,发现模型后半部分层中的特定注意力头因果性地提取并传递 typographic 信息至 cls 令牌。基于这些见解,我们提出了一种通过选择性地消除 typographic 电路(由注意力头组成)来防御 CLIP 模型的 typographic 攻击的方法。无需微调,我们的方法在 typographic 变体的 ImageNet-100 上性能提升高达 19.6%,同时 ImageNet-100 准确率下降不到 1%。值得注意的是,我们的无需训练的方法在与依赖微调的当前最先进的 typographic 防御方法的竞争中保持竞争力。为此,我们发布了具有显著更强 typographic 攻击鲁棒性的 dyslexic CLIP 模型,这些模型适合作为广泛的安全关键应用的即插即用替代品,其中文本操纵的风险超过了文本识别的实用性。
Summary / 总结
This work investigates typographic attacks on CLIP models and identifies specific attention heads that transmit typographic information. The authors propose a method to defend against these attacks by selectively ablating these attention heads, improving typographic attack resistance by up to 19.6% on ImageNet-100 without affecting standard accuracy. The approach is training-free and outperforms finetuning-based defenses. Dyslexic CLIP models are also released, which are more robust against typographic attacks and suitable for safety-critical applications.
该研究通过分析CLIP视觉编码器如何处理 typographic 信息并识别负责传输这些信息的具体注意力头,来应对CLIP模型的 typographic 攻击。作者提出了一种通过选择性消除这些注意力头的方法来防御 typographic 攻击,这在 typographic 变体的 ImageNet-100 上可提高性能高达 19.6%,且对标准 ImageNet-100 准确性的影响很小。该方法无需训练且优于当前需要 finetuning 的最先进的防御方法。此外,该研究还引入了更抗 typographic 攻击的 dyslexic CLIP 模型,适用于存在文本操纵风险的安全关键应用。
MedGR$^2$: Breaking the Data Barrier for Medical Reasoning via Generative Reward Learning
Authors: Weihai Zhi, Jiayan Guo, Shangyang Li
First: 2025-08-28T08:41:32+00:00 · Latest: 2025-08-28T08:41:32+00:00
Comments: 8 pages, 5 figures
Abstract
The application of Vision-Language Models (VLMs) in medicine is critically hampered by the scarcity of high-quality, expert-annotated data. Supervised Fine-Tuning (SFT) on existing datasets often leads to poor generalization on unseen modalities and tasks, while Reinforcement Learning (RL), a promising alternative, is stymied by the lack of reliable reward signals in this data-scarce domain. To break this impasse, we introduce Generative Reward Learning for Medical Reasoning (MedGR$^2$), a novel framework that creates a self-improving virtuous cycle. MedGR$^2$ co-develops a data generator and a reward model, enabling the automated, continuous creation of high-quality, multi-modal medical data that serves as both a superior training source for SFT and RL. Our experiments demonstrate that SFT with MedGR$^2$-produced data already surpasses baselines trained on large-scale, human-curated datasets. Crucially, when leveraging this data for RL via Group Relative Policy Optimization (GRPO), our model achieves state-of-the-art cross-modality and cross-task generalization, significantly outperforming specialized RL-based methods. Furthermore, our compact model, empowered by MedGR$^2$, achieves performance competitive with foundation models possessing over 10 times more parameters. MedGR$^2$ presents a new paradigm for data-efficient learning in high-stakes domains, transforming the problem from data scarcity to data generation and unlocking the full potential of RL for building truly generalizable medical AI.
中文标题/摘要
标题:MedGR$^2$:通过生成奖励学习打破医学推理的数据障碍
医学中视觉-语言模型(VLMs)的应用受到高质量、专家标注数据稀缺的严重限制。现有数据集上的监督微调(SFT)往往在未见过的模态和任务上表现不佳,而强化学习(RL)作为一种有前景的替代方案,由于数据稀缺领域缺乏可靠的奖励信号而受阻。为了解决这一困境,我们提出了医学推理的生成奖励学习(MedGR$^2$),这是一种新颖的框架,能够创建一个自我改进的良性循环。MedGR$^2$ 同时开发了一个数据生成器和一个奖励模型,使自动、持续地生成高质量的多模态医学数据成为可能,这些数据既可作为SFT和RL的优质训练源。我们的实验表明,使用MedGR$^2$生成的数据进行SFT已经超越了基于大规模、人工标注数据集的基线。关键的是,通过组相对策略优化(GRPO)利用这些数据进行RL时,我们的模型在跨模态和跨任务泛化方面达到了最先进的水平,显著优于专门的基于RL的方法。此外,我们的紧凑型模型,得益于MedGR$^2$,在性能上与参数量超过其10倍的预训练模型相当。MedGR$^2$ 为高风险领域中的数据高效学习提供了一个新范式,将问题从数据稀缺转变为数据生成,并解锁了RL在构建真正泛化医学AI方面的全部潜力。
Summary / 总结
MedGR$^2$ addresses the scarcity of high-quality medical data by introducing a Generative Reward Learning framework. This framework co-develops a data generator and a reward model, creating high-quality, multi-modal medical data for both supervised fine-tuning and reinforcement learning. Experiments show that models trained with MedGR$^2$-generated data outperform baselines trained on large-scale, human-curated datasets. Notably, when used for reinforcement learning, MedGR$^2$ achieves state-of-the-art generalization across modalities and tasks, outperforming specialized RL methods. Additionally, the compact model using MedGR$^2$ matches the performance of much larger foundation models.
MedGR$^2$通过引入生成奖励学习框架解决了高质量医学数据稀缺的问题。该框架同时开发数据生成器和奖励模型,能够生成高质量的多模态医学数据,用于监督微调和强化学习。实验表明,MedGR$^2$生成的数据优于基于大规模、人工标注数据的基线模型,且在强化学习中实现了跨模态和跨任务的最优泛化能力,超越了专门的RL方法。
Language-to-Space Programming for Training-Free 3D Visual Grounding
Authors: Boyu Mi, Hanqing Wang, Tai Wang, Yilun Chen, Jiangmiao Pang
First: 2025-02-03T14:32:36+00:00 · Latest: 2025-08-28T07:57:55+00:00
Abstract
3D visual grounding (3DVG) is challenging due to the need to understand 3D spatial relations. While supervised approaches have achieved superior performance, they are constrained by the scarcity and high annotation costs of 3D vision-language datasets. Training-free approaches based on LLMs/VLMs eliminate the need for large-scale training data, but they either incur prohibitive grounding time and token costs or have unsatisfactory accuracy. To address the challenges, we introduce a novel method for training-free 3D visual grounding, namely Language-to-Space Programming (LaSP). LaSP introduces LLM-generated codes to analyze 3D spatial relations among objects, along with a pipeline that evaluates and optimizes the codes automatically. Experimental results demonstrate that LaSP achieves 52.9% accuracy on the Nr3D benchmark, ranking among the best training-free methods. Moreover, it substantially reduces the grounding time and token costs, offering a balanced trade-off between performance and efficiency.
中文标题/摘要
标题:语言到空间编程用于无训练3D视觉定位
3D视觉定位(3DVG)由于需要理解3D空间关系而具有挑战性。虽然监督方法取得了优异的性能,但它们受限于3D视觉语言数据集的稀缺性和高注释成本。基于LLM/VLM的无训练方法消除了大规模训练数据的需求,但它们要么导致高昂的定位时间和标记成本,要么准确率不令人满意。为了解决这些挑战,我们提出了一种新的无训练3D视觉定位方法,即语言到空间编程(LaSP)。LaSP引入了LLM生成的代码来分析对象之间的3D空间关系,并且包含一个自动评估和优化代码的流水线。实验结果表明,LaSP在Nr3D基准测试中达到了52.9%的准确率,排名在最好的无训练方法之中。此外,它显著减少了定位时间和标记成本,提供了性能和效率之间的平衡折衷。
Summary / 总结
The paper addresses the challenge of 3D visual grounding (3DVG) by introducing Language-to-Space Programming (LaSP), a training-free method that leverages LLM-generated codes to analyze 3D spatial relations. LaSP evaluates and optimizes these codes automatically, achieving 52.9% accuracy on the Nr3D benchmark and significantly reducing grounding time and token costs.
论文通过引入基于LLM生成代码分析3D空间关系的Language-to-Space Programming (LaSP)方法来解决3D视觉定位(3DVG)的挑战。LaSP自动评估和优化这些代码,使其在Nr3D基准测试中达到52.9%的准确率,并显著减少了定位时间和令牌成本。
SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning
Authors: Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, Huan Li
Venue: EMNLP 2025
First: 2025-08-22T08:23:09+00:00 · Latest: 2025-08-28T06:44:28+00:00
Comments: Accepted at EMNLP 2025 Main
Abstract
Video large language models (Vid-LLMs) have shown strong capabilities in understanding video content. However, their reliance on dense video token representations introduces substantial memory and computational overhead in both prefilling and decoding. To mitigate the information loss of recent video token reduction methods and accelerate the decoding stage of Vid-LLMs losslessly, we introduce SpecVLM, a training-free speculative decoding (SD) framework tailored for Vid-LLMs that incorporates staged video token pruning. Building on our novel finding that the draft model's speculation exhibits low sensitivity to video token pruning, SpecVLM prunes up to 90% of video tokens to enable efficient speculation without sacrificing accuracy. To achieve this, we performs a two-stage pruning process: Stage I selects highly informative tokens guided by attention signals from the verifier (target model), while Stage II prunes remaining redundant ones in a spatially uniform manner. Extensive experiments on four video understanding benchmarks demonstrate the effectiveness and robustness of SpecVLM, which achieves up to 2.68$\times$ decoding speedup for LLaVA-OneVision-72B and 2.11$\times$ speedup for Qwen2.5-VL-32B. Code is available at https://github.com/zju-jiyicheng/SpecVLM.
Summary / 总结
SpecVLM is a speculative decoding framework for video large language models (Vid-LLMs) that prunes up to 90% of video tokens without losing accuracy. It uses a two-stage process: the first stage selects informative tokens guided by the verifier’s attention signals, and the second stage prunes redundant tokens uniformly. This approach achieves up to 2.68x decoding speedup for LLaVA-OneVision-72B and 2.11x for Qwen2.5-VL-32B on four video understanding benchmarks.
SpecVLM 是一种针对视频大型语言模型(Vid-LLMs)的推测性解码框架,通过减少视频令牌表示来实现高效的解码,同时不牺牲准确性。它使用两阶段的剪枝过程,由验证器的注意力信号引导,可以剪枝多达90%的视频令牌,分别在LLaVA-OneVision-72B和Qwen2.5-VL-32B上实现高达2.68倍和2.11倍的解码加速,这在四个视频理解基准上得到了验证。
Relative Drawing Identification Complexity is Invariant to Modality in Vision-Language Models
Authors: Diogo Freitas, Brigt Håvardstun, Cèsar Ferri, Darío Garigliotti, Jan Arne Telle, José Hernández-Orallo
First: 2025-05-14T09:41:38+00:00 · Latest: 2025-08-28T06:16:17+00:00
Comments: 54 pages (42 pages of appendix). Accepted for publication at the ECAI 2025 conference
Abstract
Large language models have become multimodal, and many of them are said to integrate their modalities using common representations. If this were true, a drawing of a car as an image, for instance, should map to a similar area in the latent space as a textual description of the strokes that form the drawing. To explore this in a black-box access regime to these models, we propose the use of machine teaching, a theory that studies the minimal set of examples a teacher needs to choose so that the learner captures the concept. In this paper, we evaluate the complexity of teaching vision-language models a subset of objects in the Quick, Draw! dataset using two presentations: raw images as bitmaps and trace coordinates in TikZ format. The results indicate that image-based representations generally require fewer segments and achieve higher accuracy than coordinate-based representations. But, surprisingly, the teaching size usually ranks concepts similarly across both modalities, even when controlling for (a human proxy of) concept priors, suggesting that the simplicity of concepts may be an inherent property that transcends modality representations.
Summary / 总结
This study investigates the complexity of teaching vision-language models to recognize a subset of objects from the Quick, Draw! dataset using either raw images or trace coordinates. By employing machine teaching, the researchers found that image-based representations generally require fewer segments and achieve higher accuracy than coordinate-based representations. However, surprisingly, the teaching size for concepts was similar across both modalities, indicating that the simplicity of concepts might be an inherent property that is not heavily modality-dependent.
该研究探讨了视觉-语言模型在不同模态下识别绘画的相对复杂性。通过使用机器教学,研究人员评估了使用原始图像和轨迹坐标两种方式教授模型识别来自Quick, Draw!数据集中的对象的复杂性。研究发现,基于图像的表示更有效,需要更少的片段并实现更高的准确性。然而,令人惊讶的是,两种模态下的教学规模相似,表明概念的简单性可能是独立于模态表示的固有属性。
Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models
Authors: Xin Huang, Ruibin Li, Tong Jia, Wei Zheng, Ya Wang
Venue: IJCAI 2025
First: 2025-05-21T14:28:43+00:00 · Latest: 2025-08-28T04:15:35+00:00
Comments: Accepted at the International Joint Conference on Artificial Intelligence (IJCAI 2025)
Abstract
Vision-Language Models (VLMs) are essential for multimodal tasks, especially compositional reasoning (CR) tasks, which require distinguishing fine-grained semantic differences between visual and textual embeddings. However, existing methods primarily fine-tune the model by generating text-based hard negative samples, neglecting the importance of image-based negative samples, which results in insufficient training of the visual encoder and ultimately impacts the overall performance of the model. Moreover, negative samples are typically treated uniformly, without considering their difficulty levels, and the alignment of positive samples is insufficient, which leads to challenges in aligning difficult sample pairs. To address these issues, we propose Adaptive Hard Negative Perturbation Learning (AHNPL). AHNPL translates text-based hard negatives into the visual domain to generate semantically disturbed image-based negatives for training the model, thereby enhancing its overall performance. AHNPL also introduces a contrastive learning approach using a multimodal hard negative loss to improve the model's discrimination of hard negatives within each modality and a dynamic margin loss that adjusts the contrastive margin according to sample difficulty to enhance the distinction of challenging sample pairs. Experiments on three public datasets demonstrate that our method effectively boosts VLMs' performance on complex CR tasks. The source code is available at https://github.com/nynu-BDAI/AHNPL.
中文标题/摘要
标题:视觉扰动与自适应难负样本对比学习在视觉语言模型中成分推理的应用
视觉语言模型(VLMs)对于多模态任务至关重要,尤其是成分推理(CR)任务,这些任务需要区分视觉和文本嵌入之间的细微语义差异。然而,现有方法主要通过生成基于文本的难负样本对模型进行微调,忽视了基于图像的负样本的重要性,导致视觉编码器训练不足,最终影响模型的整体性能。此外,负样本通常被均匀处理,没有考虑其难度级别,正样本对的对齐也不充分,这导致了难以对齐的样本对的对齐挑战。为了解决这些问题,我们提出了自适应难负样本扰动学习(AHNPL)。AHNPL将基于文本的难负样本转换到视觉域,生成语义上被干扰的图像负样本进行模型训练,从而提高其整体性能。AHNPL还引入了一种使用多模态难负样本损失的对比学习方法,以提高模型在每个模态内区分难负样本的能力,并引入了一种动态边际损失,根据样本难度调整对比边际,以增强困难样本对的区分能力。在三个公开数据集上的实验表明,我们的方法有效地提升了VLMs在复杂CR任务上的性能。源代码可在https://github.com/nynu-BDAI/AHNPL获取。
Summary / 总结
The research addresses the limitations of existing Vision-Language Models (VLMs) in compositional reasoning tasks, particularly the insufficient training of the visual encoder due to the lack of image-based negative samples and the uniform treatment of negative samples. The proposed Adaptive Hard Negative Perturbation Learning (AHNPL) translates text-based hard negatives into visual domain to generate semantically disturbed image-based negatives, and introduces a multimodal hard negative loss and dynamic margin loss to improve model performance. Experiments on three public datasets show that AHNPL effectively enhances VLMs' performance on complex compositional reasoning tasks.
研究旨在通过解决现有方法主要依赖文本硬负样本的局限性,提高视觉语言模型(VLM)的组合推理能力。提出的自适应硬负样本扰动学习(AHNPL)方法将文本硬负样本转换为视觉域,生成语义上被干扰的图像负样本,并引入多模态硬负样本损失和动态边际损失,以增强模型的区分能力和样本对的区分。在三个公开数据集上的实验表明,AHNPL有效提升了VLM在复杂组合推理任务中的性能。
MedFoundationHub: A Lightweight and Secure Toolkit for Deploying Medical Vision Language Foundation Models
Authors: Xiao Li, Yanfan Zhu, Ruining Deng, Wei-Qi Wei, Yu Wang, Shilin Zhao, Yaohong Wang, Haichun Yang, Yuankai Huo
First: 2025-08-28T01:39:16+00:00 · Latest: 2025-08-28T01:39:16+00:00
Abstract
Recent advances in medical vision-language models (VLMs) open up remarkable opportunities for clinical applications such as automated report generation, copilots for physicians, and uncertainty quantification. However, despite their promise, medical VLMs introduce serious security concerns, most notably risks of Protected Health Information (PHI) exposure, data leakage, and vulnerability to cyberthreats - which are especially critical in hospital environments. Even when adopted for research or non-clinical purposes, healthcare organizations must exercise caution and implement safeguards. To address these challenges, we present MedFoundationHub, a graphical user interface (GUI) toolkit that: (1) enables physicians to manually select and use different models without programming expertise, (2) supports engineers in efficiently deploying medical VLMs in a plug-and-play fashion, with seamless integration of Hugging Face open-source models, and (3) ensures privacy-preserving inference through Docker-orchestrated, operating system agnostic deployment. MedFoundationHub requires only an offline local workstation equipped with a single NVIDIA A6000 GPU, making it both secure and accessible within the typical resources of academic research labs. To evaluate current capabilities, we engaged board-certified pathologists to deploy and assess five state-of-the-art VLMs (Google-MedGemma3-4B, Qwen2-VL-7B-Instruct, Qwen2.5-VL-7B-Instruct, and LLaVA-1.5-7B/13B). Expert evaluation covered colon cases and renal cases, yielding 1015 clinician-model scoring events. These assessments revealed recurring limitations, including off-target answers, vague reasoning, and inconsistent pathology terminology.
Summary / 总结
MedFoundationHub is a toolkit designed to address the security and usability challenges of deploying medical vision-language models (VLMs). It provides a graphical user interface for physicians to use VLMs without programming knowledge and supports efficient deployment of models through Docker, ensuring privacy-preserving inference. The evaluation involved board-certified pathologists deploying and assessing five state-of-the-art VLMs, revealing recurring limitations such as off-target answers and inconsistent terminology in clinical assessments.
MedFoundationHub 是一个工具包,旨在解决与医疗视觉语言模型(VLMs)相关的安全性和易用性挑战。它提供了一个图形用户界面,使医生无需编程知识即可使用 VLMs,并支持工程师以插件方式高效部署这些模型。该工具包通过 Docker 统一部署来确保隐私保护推理,仅需少量硬件资源。病理学家对五个最先进的 VLMs 的专家评估揭示了诸如偏离目标答案和术语不一致等反复出现的问题,突显了模型准确性和可靠性的改进空间。
GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs
Authors: Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, Yang Zhang, Haohan Wang
First: 2025-08-28T00:07:10+00:00 · Latest: 2025-08-28T00:07:10+00:00
Comments: 54 pages
Abstract
As Large Language Models become increasingly integral to various domains, their potential to generate harmful responses has prompted significant societal and regulatory concerns. In response, governments have issued ethics guidelines to promote the development of trustworthy AI. However, these guidelines are typically high-level demands for developers and testers, leaving a gap in translating them into actionable testing questions to verify LLM compliance. To address this challenge, we introduce GUARD (\textbf{G}uideline \textbf{U}pholding Test through \textbf{A}daptive \textbf{R}ole-play and Jailbreak \textbf{D}iagnostics), a testing method designed to operationalize guidelines into specific guideline-violating questions that assess LLM adherence. To implement this, GUARD uses automated generation of guideline-violating questions based on government-issued guidelines, thereby testing whether responses comply with these guidelines. When responses directly violate guidelines, GUARD reports inconsistencies. Furthermore, for responses that do not directly violate guidelines, GUARD integrates the concept of ``jailbreaks'' to diagnostics, named GUARD-JD, which creates scenarios that provoke unethical or guideline-violating responses, effectively identifying potential scenarios that could bypass built-in safety mechanisms. Our method finally culminates in a compliance report, delineating the extent of adherence and highlighting any violations. We have empirically validated the effectiveness of GUARD on seven LLMs, including Vicuna-13B, LongChat-7B, Llama2-7B, Llama-3-8B, GPT-3.5, GPT-4, GPT-4o, and Claude-3.7, by testing compliance under three government-issued guidelines and conducting jailbreak diagnostics. Additionally, GUARD-JD can transfer jailbreak diagnostics to vision-language models, demonstrating its usage in promoting reliable LLM-based applications.
Summary / 总结
As Large Language Models become increasingly integral to various domains, their potential to generate harmful responses has prompted significant societal and regulatory concerns.
GUARD 是一种测试方法,旨在将伦理准则转化为具体问题以验证大型语言模型(LLM)的合规性。它使用自动化生成违背准则的问题,并结合监狱逃脱诊断来识别可能绕过安全机制的场景。GUARD 在七个 LLM 上进行了实证验证,包括 Vicuna-13B、LongChat-7B 和 Llama2-7B,并展示了其在促进可靠 LLM 应用程序中的使用。
Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation
Authors: Zhixiang Chi, Yanan Wu, Li Gu, Huan Liu, Ziqiang Wang, Yang Zhang, Yang Wang, Konstantinos N. Plataniotis
Venue: ICCV 2025
First: 2025-08-27T20:47:03+00:00 · Latest: 2025-08-27T20:47:03+00:00
Comments: ICCV 2025, code:https://github.com/chi-chi-zx/FSA
Abstract
CLIP exhibits strong visual-textual alignment but struggle with open-vocabulary segmentation due to poor localization. Prior methods enhance spatial coherence by modifying intermediate attention. But, this coherence isn't consistently propagated to the final output due to subsequent operations such as projections. Additionally, intermediate attention lacks direct interaction with text representations, such semantic discrepancy limits the full potential of CLIP. In this work, we propose a training-free, feedback-driven self-adaptive framework that adapts output-based patch-level correspondences back to the intermediate attention. The output predictions, being the culmination of the model's processing, encapsulate the most comprehensive visual and textual semantics about each patch. Our approach enhances semantic consistency between internal representations and final predictions by leveraging the model's outputs as a stronger spatial coherence prior. We design key modules, including attention isolation, confidence-based pruning for sparse adaptation, and adaptation ensemble, to effectively feedback the output coherence cues. Our method functions as a plug-in module, seamlessly integrating into four state-of-the-art approaches with three backbones (ViT-B, ViT-L, ViT-H). We further validate our framework across multiple attention types (Q-K, self-self, and Proxy augmented with MAE, SAM, and DINO). Our approach consistently improves their performance across eight benchmarks.
Summary / 总结
This work addresses the challenge of open-vocabulary segmentation by proposing a training-free framework that enhances spatial coherence in CLIP models. The method adapts output-based patch-level correspondences to intermediate attention, leveraging the model's final predictions to improve semantic consistency. Experimental results show consistent performance improvements across various state-of-the-art approaches and attention types on multiple benchmarks.
该研究提出了一种无需训练的框架,以增强CLIP模型在开放词汇分割任务中的空间一致性。方法通过将输出级别的补丁对应关系反向反馈到中间注意力,利用模型的最终预测来提升内部表示与最终预测之间的语义一致性。实验结果显示,该方法在四种最先进的方法(ViT-B、ViT-L、ViT-H)和多种注意力类型(Q-K、自对齐、以及MAE、SAM、DINO增强)上的一系列基准测试中均能实现性能提升。
A Novel Framework for Automated Explain Vision Model Using Vision-Language Models
Authors: Phu-Vinh Nguyen, Tan-Hanh Pham, Chris Ngo, Truong Son Hy
First: 2025-08-27T19:16:40+00:00 · Latest: 2025-08-27T19:16:40+00:00
Abstract
The development of many vision models mainly focuses on improving their performance using metrics such as accuracy, IoU, and mAP, with less attention to explainability due to the complexity of applying xAI methods to provide a meaningful explanation of trained models. Although many existing xAI methods aim to explain vision models sample-by-sample, methods explaining the general behavior of vision models, which can only be captured after running on a large dataset, are still underexplored. Furthermore, understanding the behavior of vision models on general images can be very important to prevent biased judgments and help identify the model's trends and patterns. With the application of Vision-Language Models, this paper proposes a pipeline to explain vision models at both the sample and dataset levels. The proposed pipeline can be used to discover failure cases and gain insights into vision models with minimal effort, thereby integrating vision model development with xAI analysis to advance image analysis.
中文标题/摘要
标题:一种基于视觉语言模型的自动化解释视图模型的新框架
许多视觉模型的发展主要集中在使用准确率、IoU和mAP等指标提高性能上,而较少关注可解释性,因为将xAI方法应用于提供有意义的解释非常复杂。尽管许多现有的xAI方法旨在逐样本解释视觉模型,但只能在运行大量数据集后捕捉到的视觉模型的总体行为解释方法仍然未被充分探索。此外,理解视觉模型在一般图像上的行为对于防止偏见判断和帮助识别模型的趋势和模式非常重要。借助视觉语言模型的应用,本文提出了一种管道,可以在样本和数据集两个级别上解释视觉模型。所提出的管道可以用于发现失败案例并以最小的努力获得视觉模型的见解,从而将视觉模型开发与xAI分析结合起来,推动图像分析的发展。
Summary / 总结
The development of many vision models mainly focuses on improving their performance using metrics such as accuracy, IoU, and mAP, with less attention to explainability due to the complexity of applying xAI methods to provide a meaningful explanation of trained models.
OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations
Authors: Peng-Hao Hsu, Ke Zhang, Fu-En Wang, Tao Tu, Ming-Feng Li, Yu-Lun Liu, Albert Y. C. Chen, Min Sun, Cheng-Hao Kuo
First: 2025-08-27T17:17:00+00:00 · Latest: 2025-08-27T17:17:00+00:00
Comments: ICCV2025
Abstract
Open-vocabulary (OV) 3D object detection is an emerging field, yet its exploration through image-based methods remains limited compared to 3D point cloud-based methods. We introduce OpenM3D, a novel open-vocabulary multi-view indoor 3D object detector trained without human annotations. In particular, OpenM3D is a single-stage detector adapting the 2D-induced voxel features from the ImGeoNet model. To support OV, it is jointly trained with a class-agnostic 3D localization loss requiring high-quality 3D pseudo boxes and a voxel-semantic alignment loss requiring diverse pre-trained CLIP features. We follow the training setting of OV-3DET where posed RGB-D images are given but no human annotations of 3D boxes or classes are available. We propose a 3D Pseudo Box Generation method using a graph embedding technique that combines 2D segments into coherent 3D structures. Our pseudo-boxes achieve higher precision and recall than other methods, including the method proposed in OV-3DET. We further sample diverse CLIP features from 2D segments associated with each coherent 3D structure to align with the corresponding voxel feature. The key to training a highly accurate single-stage detector requires both losses to be learned toward high-quality targets. At inference, OpenM3D, a highly efficient detector, requires only multi-view images for input and demonstrates superior accuracy and speed (0.3 sec. per scene) on ScanNet200 and ARKitScenes indoor benchmarks compared to existing methods. We outperform a strong two-stage method that leverages our class-agnostic detector with a ViT CLIP-based OV classifier and a baseline incorporating multi-view depth estimator on both accuracy and speed.
Segmentation Assisted Incremental Test Time Adaptation in an Open World
Authors: Manogna Sreenivas, Soma Biswas
First: 2025-08-27T16:33:32+00:00 · Latest: 2025-08-27T16:33:32+00:00
Comments: Accepted at BMVC 2025
Abstract
In dynamic environments, unfamiliar objects and distribution shifts are often encountered, which challenge the generalization abilities of the deployed trained models. This work addresses Incremental Test Time Adaptation of Vision Language Models, tackling scenarios where unseen classes and unseen domains continuously appear during testing. Unlike traditional Test Time Adaptation approaches, where the test stream comes only from a predefined set of classes, our framework allows models to adapt simultaneously to both covariate and label shifts, actively incorporating new classes as they emerge. Towards this goal, we establish a new benchmark for ITTA, integrating single image TTA methods for VLMs with active labeling techniques that query an oracle for samples potentially representing unseen classes during test time. We propose a segmentation assisted active labeling module, termed SegAssist, which is training free and repurposes the segmentation capabilities of VLMs to refine active sample selection, prioritizing samples likely to belong to unseen classes. Extensive experiments on several benchmark datasets demonstrate the potential of SegAssist to enhance the performance of VLMs in real world scenarios, where continuous adaptation to emerging data is essential. Project-page:https://manogna-s.github.io/segassist/
中文标题/摘要
标题:开放世界中的分割辅助增量测试时适应
在动态环境中,经常遇到不熟悉的对象和分布变化,这挑战了部署模型的泛化能力。本研究针对视觉语言模型的增量测试时适应问题,处理测试过程中不断出现的未见类别和未见领域的情况。与传统的测试时适应方法不同,后者仅从预定义的类别集合中获取测试流,我们的框架允许模型同时适应协变量和标签的变化,积极地将新类别纳入其中。为此,我们为增量测试时适应建立了新的基准,将单张图像的测试时适应方法与主动标注技术结合,测试时查询或acles以获取可能代表未见类别的样本。我们提出了一种分割辅助主动标注模块,称为SegAssist,该模块无需训练,并利用视觉语言模型的分割能力来精炼主动样本选择,优先选择可能属于未见类别的样本。在多个基准数据集上的广泛实验表明,SegAssist能够增强视觉语言模型在现实世界场景中的性能,其中持续适应新兴数据至关重要。项目页面:https://manogna-s.github.io/segassist/
Summary / 总结
This work addresses Incremental Test Time Adaptation (TTA) for Vision Language Models (VLMs) in dynamic environments where unseen classes and domains continuously appear. The proposed SegAssist framework allows models to adapt to both covariate and label shifts by incorporating new classes as they emerge. It uses a segmentation-assisted active labeling module to prioritize samples likely to belong to unseen classes, enhancing VLM performance in real-world scenarios requiring continuous adaptation. Experiments on benchmark datasets show the effectiveness of SegAssist in improving VLMs' generalization abilities.
该研究针对动态环境中出现的新类和新领域,解决Vision Language Models (VLMs)的增量测试时适应问题。引入了基于分割的主动标注模块SegAssist,通过查询潜在的新类样本来帮助VLMs适应未见过的类和标签变化。实验表明,SegAssist能够提升VLMs在需要持续适应新数据的现实场景中的性能。
SWIRL: A Staged Workflow for Interleaved Reinforcement Learning in Mobile GUI Control
Authors: Quanfeng Lu, Zhantao Ma, Shuai Zhong, Jin Wang, Dahai Yu, Michael K. Ng, Ping Luo
First: 2025-08-27T16:27:19+00:00 · Latest: 2025-08-27T16:27:19+00:00
Comments: 28 pages, 12 figures
Abstract
The rapid advancement of large vision language models (LVLMs) and agent systems has heightened interest in mobile GUI agents that can reliably translate natural language into interface operations. Existing single-agent approaches, however, remain limited by structural constraints. Although multi-agent systems naturally decouple different competencies, recent progress in multi-agent reinforcement learning (MARL) has often been hindered by inefficiency and remains incompatible with current LVLM architectures. To address these challenges, we introduce SWIRL, a staged workflow for interleaved reinforcement learning designed for multi-agent systems. SWIRL reformulates MARL into a sequence of single-agent reinforcement learning tasks, updating one agent at a time while keeping the others fixed. This formulation enables stable training and promotes efficient coordination across agents. Theoretically, we provide a stepwise safety bound, a cross-round monotonic improvement theorem, and convergence guarantees on return, ensuring robust and principled optimization. In application to mobile GUI control, SWIRL instantiates a Navigator that converts language and screen context into structured plans, and an Interactor that grounds these plans into executable atomic actions. Extensive experiments demonstrate superior performance on both high-level and low-level GUI benchmarks. Beyond GUI tasks, SWIRL also demonstrates strong capability in multi-agent mathematical reasoning, underscoring its potential as a general framework for developing efficient and robust multi-agent systems.
中文标题/摘要
标题:SWIRL:多智能体系统中交错强化学习的分阶段工作流
大型视觉语言模型(LVLM)和代理系统的迅速发展引发了对能够可靠地将自然语言转换为界面操作的移动GUI代理的高度兴趣。然而,现有的单智能体方法仍然受到结构限制的限制。尽管多智能体系统自然地解耦不同的能力,但最近多智能体强化学习(MARL)的进步往往受到效率低下的阻碍,并且与当前的LVLM架构不兼容。为了解决这些挑战,我们提出了SWIRL,这是一种为多智能体系统设计的交错强化学习分阶段工作流。SWIRL将MARL重新表述为一系列单智能体强化学习任务,每次更新一个智能体,同时保持其他智能体不变。这种表述形式使训练更加稳定,并促进了智能体之间的高效协调。理论上,我们提供了逐步安全性边界、跨轮次单调改进定理以及回报收敛保证,确保了稳健和原则性的优化。在移动GUI控制的应用中,SWIRL 实现了一个导航器,将语言和屏幕上下文转换为结构化计划,以及一个执行器,将这些计划转化为可执行的原子动作。广泛的实验表明,SWIRL 在高阶和低阶GUI基准测试中均表现出优越的性能。除了GUI任务,SWIRL 还展示了强大的多智能体数学推理能力,突显了其作为开发高效和稳健的多智能体系统的一般框架的潜力。
Summary / 总结
SWIRL is a staged workflow for interleaved reinforcement learning in multi-agent systems, addressing the limitations of single-agent approaches and the inefficiencies in multi-agent reinforcement learning. By reformulating MARL into a sequence of single-agent tasks, SWIRL enables stable training and efficient coordination. Experiments show superior performance on GUI control tasks and strong capabilities in multi-agent mathematical reasoning, indicating its potential as a general framework for multi-agent systems.
SWIRL 是一个多代理系统的交错强化学习的阶段化工作流,旨在解决单代理方法的局限性和多代理强化学习的低效问题。通过将 MARL 转换为单代理任务序列,SWIRL 实现了稳定训练和高效协调。理论保证包括逐步安全性边界、跨轮次单调改进和回报收敛。实验结果表明,SWIRL 在高阶和低阶 GUI 基准测试中表现出色,并且在多代理数学推理方面表现出强大的能力,表明其作为多代理系统通用框架的潜力。
History