arXiv 论文速递

VoCap: Video Object Captioning and Segmentation from Any Prompt

Authors: Jasper Uijlings, Xingyi Zhou, Xiuye Gu, Arsha Nagrani, Anurag Arnab, Alireza Fathi, David Ross, Cordelia Schmid

First: 2025-08-29T17:43:58+00:00 · Latest: 2025-08-29T17:43:58+00:00

Abs · PDF · Code1

Abstract

Understanding objects in videos in terms of fine-grained localization masks and detailed semantic properties is a fundamental task in video understanding. In this paper, we propose VoCap, a flexible video model that consumes a video and a prompt of various modalities (text, box or mask), and produces a spatio-temporal masklet with a corresponding object-centric caption. As such our model addresses simultaneously the tasks of promptable video object segmentation, referring expression segmentation, and object captioning. Since obtaining data for this task is tedious and expensive, we propose to annotate an existing large-scale segmentation dataset (SAV) with pseudo object captions. We do so by preprocessing videos with their ground-truth masks to highlight the object of interest and feed this to a large Vision Language Model (VLM). For an unbiased evaluation, we collect manual annotations on the validation set. We call the resulting dataset SAV-Caption. We train our VoCap model at scale on a SAV-Caption together with a mix of other image and video datasets. Our model yields state-of-the-art results on referring expression video object segmentation, is competitive on semi-supervised video object segmentation, and establishes a benchmark for video object captioning. Our dataset will be made available at https://github.com/google-deepmind/vocap.

中文标题/摘要

标题：VoCap：从任意提示进行视频对象描述和分割

理解视频中的对象，以精细粒度的定位掩码和详细的语义属性为基准，是视频理解中的基本任务。在本文中，我们提出了VoCap，这是一种灵活的视频模型，它接受一段视频和不同模态（文本、框或掩码）的提示，并生成相应的时空掩码片段和对象为中心的描述。因此，我们的模型同时解决了可提示视频对象分割、指示表达分割和对象描述的任务。由于获取此类任务的数据既繁琐又昂贵，我们建议对现有的大规模分割数据集（SAV）进行伪对象描述的标注。我们通过预处理带有真实掩码的视频以突出显示目标对象，并将其输入大型视觉语言模型（VLM）。为了进行公平的评估，我们在验证集上收集了人工注释。我们称该数据集为SAV-描述。我们在SAV-描述以及多种图像和视频数据集的混合数据集上大规模训练我们的VoCap模型。我们的模型在指示表达视频对象分割上达到了最先进的结果，在半监督视频对象分割上具有竞争力，并建立了视频对象描述的基准。我们的数据集将在https://github.com/google-deepmind/vocap/公开。

Tree-Guided Diffusion Planner

Authors: Hyeonseong Jeon, Cheolhong Min, Jaesik Park

First: 2025-08-29T17:27:44+00:00 · Latest: 2025-08-29T17:27:44+00:00

Comments: 20 pages, 11 figures, 14 tables (main paper + appendix) / under review / project page will be available after the paper becomes public in arxiv

Abs · PDF

Abstract

Planning with pretrained diffusion models has emerged as a promising approach for solving test-time guided control problems. However, standard gradient guidance typically performs optimally under convex and differentiable reward landscapes, showing substantially reduced effectiveness in real-world scenarios involving non-convex objectives, non-differentiable constraints, and multi-reward structures. Furthermore, recent supervised planning approaches require task-specific training or value estimators, which limits test-time flexibility and zero-shot generalization. We propose a Tree-guided Diffusion Planner (TDP), a zero-shot test-time planning framework that balances exploration and exploitation through structured trajectory generation. We frame test-time planning as a tree search problem using a bi-level sampling process: (1) diverse parent trajectories are produced via training-free particle guidance to encourage broad exploration, and (2) sub-trajectories are refined through fast conditional denoising guided by task objectives. TDP addresses the limitations of gradient guidance by exploring diverse trajectory regions and harnessing gradient information across this expanded solution space using only pretrained models and test-time reward signals. We evaluate TDP on three diverse tasks: maze gold-picking, robot arm block manipulation, and AntMaze multi-goal exploration. TDP consistently outperforms state-of-the-art approaches on all tasks. The project page can be found at: tree-diffusion-planner.github.io.

Summary / 总结

The Tree-Guided Diffusion Planner (TDP) addresses the limitations of gradient guidance in solving complex, non-convex problems by using a bi-level sampling process. It generates diverse parent trajectories for exploration and refines them through fast conditional denoising. TDP outperforms state-of-the-art methods across maze gold-picking, robot arm block manipulation, and AntMaze multi-goal exploration tasks.

Tree-Guided Diffusion Planner (TDP) 通过将测试时规划问题建模为树搜索问题，解决了梯度指导在处理非凸目标和非可微约束时的局限性。TDP 使用两层采样过程生成多样化的父轨迹以促进广泛探索，并通过快速条件去噪细化子轨迹。TDP 在迷宫黄金拾取、机器人手臂块操作和AntMaze多目标探索等任务上均优于现有最佳方法。

PiCSAR: Probabilistic Confidence Selection And Ranking

Authors: Joshua Ong Jun Leang, Zheng Zhao, Aryo Pradipta Gema, Sohee Yang, Wai-Chung Kwan, Xuanli He, Wenda Li, Pasquale Minervini, Eleonora Giunchiglia, Shay B. Cohen

First: 2025-08-29T17:03:47+00:00 · Latest: 2025-08-29T17:03:47+00:00

Abs · PDF

Abstract

Best-of-n sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions and selecting the one with the highest reward. The key challenge for reasoning tasks is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers. We propose Probabilistic Confidence Selection And Ranking (PiCSAR): a simple, training-free method that scores each candidate generation using the joint log-likelihood of the reasoning and final answer. The joint log-likelihood of the reasoning and final answer naturally decomposes into reasoning confidence and answer confidence. PiCSAR achieves substantial gains across diverse benchmarks (+10.18 on MATH500, +9.81 on AIME2025), outperforming baselines with at least 2x fewer samples in 16 out of 20 comparisons. Our analysis reveals that correct reasoning chains exhibit significantly higher reasoning and answer confidence, justifying the effectiveness of PiCSAR.

中文标题/摘要

标题：PiCSAR: 概率置信度选择和排序

最佳的n次采样通过生成多个候选解决方案并选择具有最高奖励的解决方案，提高了大型语言模型（LLMs）和大型推理模型（LRMs）的准确性。对于推理任务的关键挑战是设计一个评分函数，能够在不访问正确答案的情况下识别正确的推理链。我们提出了概率置信度选择和排序（PiCSAR）：一种无需训练的简单方法，使用推理和最终答案的联合对数似然性对每个候选生成进行评分。推理和最终答案的联合对数似然性自然分解为推理置信度和答案置信度。PiCSAR 在多个基准测试中取得了显著的改进（在MATH500上+10.18，在AIME2025上+9.81），在16次比较中有20次优于基线，使用至少少2倍的样本。我们的分析表明，正确的推理链在推理和答案置信度方面表现出显著的差异，这证明了PiCSAR的有效性。

Summary / 总结

PiCSAR is a training-free method that improves the accuracy of large language models and reasoning models by scoring candidate solutions using the joint log-likelihood of reasoning and final answers. It outperforms baselines with fewer samples across various benchmarks, achieving significant gains such as +10.18 on MATH500 and +9.81 on AIME2025.

研究旨在通过选择最佳候选解决方案来提高大型语言模型和推理模型的准确性，使用推理和最终答案的联合对数似然性对候选方案进行评分。PiCSAR 是一种无需训练的方法，在多个基准测试中取得了显著的改进，使用更少的样本就能达到更高的准确性，优于基线方法。

CAD2DMD-SET: Synthetic Generation Tool of Digital Measurement Device CAD Model Datasets for fine-tuning Large Vision-Language Models

Authors: João Valente, Atabak Dehban, Rodrigo Ventura

First: 2025-08-29T15:57:43+00:00 · Latest: 2025-08-29T15:57:43+00:00

Abs · PDF

Abstract

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities across various multimodal tasks. They continue, however, to struggle with trivial scenarios such as reading values from Digital Measurement Devices (DMDs), particularly in real-world conditions involving clutter, occlusions, extreme viewpoints, and motion blur; common in head-mounted cameras and Augmented Reality (AR) applications. Motivated by these limitations, this work introduces CAD2DMD-SET, a synthetic data generation tool designed to support visual question answering (VQA) tasks involving DMDs. By leveraging 3D CAD models, advanced rendering, and high-fidelity image composition, our tool produces diverse, VQA-labelled synthetic DMD datasets suitable for fine-tuning LVLMs. Additionally, we present DMDBench, a curated validation set of 1,000 annotated real-world images designed to evaluate model performance under practical constraints. Benchmarking three state-of-the-art LVLMs using Average Normalised Levenshtein Similarity (ANLS) and further fine-tuning LoRA's of these models with CAD2DMD-SET's generated dataset yielded substantial improvements, with InternVL showcasing a score increase of 200% without degrading on other tasks. This demonstrates that the CAD2DMD-SET training dataset substantially improves the robustness and performance of LVLMs when operating under the previously stated challenging conditions. The CAD2DMD-SET tool is expected to be released as open-source once the final version of this manuscript is prepared, allowing the community to add different measurement devices and generate their own datasets.

中文标题/摘要

标题：CAD2DMD-SET：数字测量设备CAD模型数据集的合成生成工具，用于微调大型视觉-语言模型

近年来，大型视觉-语言模型（LVLMs）在各种多模态任务中展现了令人印象深刻的性能。然而，它们仍然在诸如读取数字测量设备（DMDs）值等简单场景中遇到困难，尤其是在涉及杂乱、遮挡、极端视角和运动模糊等现实条件下的头戴式相机和增强现实（AR）应用中。鉴于这些局限性，本文介绍了CAD2DMD-SET，这是一种合成数据生成工具，旨在支持涉及DMDs的视觉问答（VQA）任务。通过利用3D CAD模型、高级渲染和高保真图像合成，我们的工具生成了多样化的、带有VQA标签的合成DMD数据集，适用于微调LVLMs。此外，我们还介绍了DMDBench，这是一个包含1,000张注释的现实世界图像的精选验证集，用于在实际约束条件下评估模型性能。使用平均归一化莱文斯坦相似度（ANLS）对三个最先进的LVLMs进行基准测试，并进一步使用CAD2DMD-SET生成的数据集微调这些模型的LoRA，取得了显著的改进，其中InternVL的得分提高了200%，而其他任务的性能没有下降。这表明CAD2DMD-SET训练数据集在上述具有挑战性的条件下显著提高了LVLMs的鲁棒性和性能。CAD2DMD-SET工具预计在最终版本的论文准备完成后将作为开源发布，允许社区添加不同的测量设备并生成自己的数据集。

PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation

Authors: Jiho Choi, Seojeong Park, Seongjong Song, Hyunjung Shim

First: 2025-08-29T15:36:06+00:00 · Latest: 2025-08-29T15:36:06+00:00

Abs · PDF

Abstract

We present a novel training-free framework, \textit{PosterForest}, for automated scientific poster generation. Unlike prior approaches, which largely neglect the hierarchical structure of scientific documents and the semantic integration of textual and visual elements, our method addresses both challenges directly. We introduce the \textit{Poster Tree}, a hierarchical intermediate representation that jointly encodes document structure and visual-textual relationships at multiple levels. Our framework employs a multi-agent collaboration strategy, where agents specializing in content summarization and layout planning iteratively coordinate and provide mutual feedback. This approach enables the joint optimization of logical consistency, content fidelity, and visual coherence. Extensive experiments on multiple academic domains show that our method outperforms existing baselines in both qualitative and quantitative evaluations. The resulting posters achieve quality closest to expert-designed ground truth and deliver superior information preservation, structural clarity, and user preference.

中文标题/摘要

标题：PosterForest：科学海报生成的分层多智能体协作框架

我们提出了一种无需训练的框架\textit{PosterForest}，用于自动化科学海报生成。与以往方法主要忽视科学文档的分层结构和文本与视觉元素的语义整合不同，我们的方法直接解决了这两个挑战。我们引入了\textit{Poster Tree}，这是一种分层的中间表示，可以联合编码文档结构和多级的视觉-文本关系。我们的框架采用多智能体协作策略，其中专门负责内容总结和布局规划的智能体迭代协调并相互提供反馈。这种方法使得逻辑一致性、内容保真度和视觉一致性能够联合优化。在多个学术领域的广泛实验表明，我们的方法在定性和定量评估中均优于现有基线。生成的海报质量接近专家设计的真实标准，并提供了更好的信息保留、结构清晰度和用户偏好。

Summary / 总结

The research motivation is to develop a training-free framework for automated scientific poster generation that addresses the hierarchical structure and semantic integration of scientific documents. The method introduces a hierarchical intermediate representation called the Poster Tree and employs a multi-agent collaboration strategy for content summarization and layout planning. The key experimental findings show that this approach outperforms existing baselines in both qualitative and quantitative evaluations, achieving quality close to expert-designed posters and superior information preservation and user preference.

研究动机是开发一个无需训练的框架，用于自动化生成科学海报，以解决科学文档的层次结构和语义集成问题。方法引入了一种层次中间表示，称为Poster Tree，并采用多智能体协作策略进行内容总结和布局规划。关键实验结果表明，该方法在定性和定量评估中均优于现有基线，生成的海报质量接近专家设计的样本，并且具有更好的信息保留、结构清晰度和用户偏好。

Guiding a diffusion model using sliding windows

Authors: Nikolas Adaloglou, Tim Kaiser, Damir Iagudin, Markus Kollmann

First: 2024-11-15T15:04:04+00:00 · Latest: 2025-08-29T13:10:29+00:00

Comments: Accepted at BMVC 2025. 30 pages, 16 figures in total, including appendix

Abs · PDF · Code1

Abstract

Guidance is a widely used technique for diffusion models to enhance sample quality. Technically, guidance is realised by using an auxiliary model that generalises more broadly than the primary model. Using a 2D toy example, we first show that it is highly beneficial when the auxiliary model exhibits similar but stronger generalisation errors than the primary model. Based on this insight, we introduce \emph{masked sliding window guidance (M-SWG)}, a novel, training-free method. M-SWG upweights long-range spatial dependencies by guiding the primary model with itself by selectively restricting its receptive field. M-SWG requires neither access to model weights from previous iterations, additional training, nor class conditioning. M-SWG achieves a superior Inception score (IS) compared to previous state-of-the-art training-free approaches, without introducing sample oversaturation. In conjunction with existing guidance methods, M-SWG reaches state-of-the-art Frechet DINOv2 distance on ImageNet using EDM2-XXL and DiT-XL. The code is available at https://github.com/HHU-MMBS/swg_bmvc2025_official.

中文标题/摘要

标题：使用滑动窗口引导扩散模型

引导是一种广泛使用的技术，用于增强扩散模型的样本质量。技术上，引导通过使用一个比主模型更广泛泛化的辅助模型来实现。通过一个2D玩具示例，我们首先展示了当辅助模型表现出类似但更强的泛化误差时，其效果非常有益。基于这一见解，我们引入了*掩码滑动窗口引导（M-SWG）*，这是一种全新的、无需训练的方法。M-SWG 通过选择性地限制其感受野来引导主模型，从而增强长距离空间依赖性。M-SWG 不需要访问先前迭代的模型权重、额外的训练或类别条件。M-SWG 在 inception 分数 (IS) 上优于之前的最先进的无需训练的方法，且不引入样本过饱和。结合现有的引导方法，M-SWG 使用 EDM2-XXL 和 DiT-XL 达到了 ImageNet 上的最新 frechet DINOv2 距离。代码可在 https://github.com/HHU-MMBS/swg_bmvc2025_official 获取。

Summary / 总结

This paper introduces Masked Sliding Window Guidance (M-SWG), a training-free method for enhancing the quality of samples generated by diffusion models. By selectively restricting the primary model's receptive field, M-SWG improves long-range spatial dependencies and achieves a superior Inception score compared to previous approaches. Combined with existing guidance methods, M-SWG reaches state-of-the-art performance on ImageNet using EDM2-XXL and DiT-XL models without introducing sample oversaturation.

本文介绍了训练-free 方法 masked sliding window guidance (M-SWG)，旨在提升扩散模型的效果。受辅助模型具有相似但更强泛化误差的益处启发，M-SWG 通过选择性地限制主模型的感受野来增强长距离空间依赖性。该方法在 Inception 分数上优于之前的最先进的方法，并且与现有的指导方法结合使用时，使用 EDM2-XXL 和 DiT-XL 达到了 ImageNet 上的最先进的性能。

How Well Do Vision--Language Models Understand Cities? A Comparative Study on Spatial Reasoning from Street-View Images

Authors: Juneyoung Ro, Namwoo Kim, Yoonjin Yoon

Venue: ICCV

First: 2025-08-29T12:21:57+00:00 · Latest: 2025-08-29T12:21:57+00:00

Comments: Accepted to ICCV Workshop 2025

Abs · PDF

Abstract

Effectively understanding urban scenes requires fine-grained spatial reasoning about objects, layouts, and depth cues. However, how well current vision-language models (VLMs), pretrained on general scenes, transfer these abilities to urban domain remains underexplored. To address this gap, we conduct a comparative study of three off-the-shelf VLMs-BLIP-2, InstructBLIP, and LLaVA-1.5-evaluating both zero-shot performance and the effects of fine-tuning with a synthetic VQA dataset specific to urban scenes. We construct such dataset from segmentation, depth, and object detection predictions of street-view images, pairing each question with LLM-generated Chain-of-Thought (CoT) answers for step-by-step reasoning supervision. Results show that while VLMs perform reasonably well in zero-shot settings, fine-tuning with our synthetic CoT-supervised dataset substantially boosts performance, especially for challenging question types such as negation and counterfactuals. This study introduces urban spatial reasoning as a new challenge for VLMs and demonstrates synthetic dataset construction as a practical path for adapting general-purpose models to specialized domains.

中文标题/摘要

标题：视觉-语言模型理解城市的能力如何？基于街道视图图像的空间推理比较研究

有效理解城市场景需要对物体、布局和深度线索进行精细的空间推理。然而，当前基于通用场景预训练的视觉-语言模型（VLMs）在城市领域的这些能力如何转移仍鲜有探索。为解决这一问题，我们对三个现成的VLMs-BLIP-2、InstructBLIP和LLaVA-1.5进行了比较研究，评估了它们的零样本性能以及使用特定于城市场景的合成VQA数据集进行微调的效果。我们从街道视图图像的分割、深度和物体检测预测中构建了这样一个数据集，将每个问题与LLM生成的逐步推理答案配对，作为监督。结果显示，虽然VLMs在零样本设置中表现合理，但使用我们的合成逐步推理监督数据集进行微调显著提升了性能，尤其是在否定和反事实等具有挑战性的问题类型上。本研究将城市空间推理引入了VLMs的新挑战，并展示了合成数据集构建作为一种实用路径，使通用模型适应专门领域。

Summary / 总结

This study evaluates the ability of vision-language models (VLMs) to understand urban scenes through a comparative analysis of BLIP-2, InstructBLIP, and LLaVA-1.5, both in zero-shot settings and after fine-tuning with a synthetic VQA dataset. The dataset, constructed from street-view images, includes questions with LLM-generated Chain-of-Thought answers to guide reasoning. Results indicate that while VLMs perform reasonably in zero-shot settings, fine-tuning significantly improves their performance, particularly for complex question types like negation and counterfactuals.

该研究通过对比分析BLIP-2、InstructBLIP和LLaVA-1.5等视觉-语言模型在零样本设置和使用特定于城市场景的合成VQA数据集微调后的表现，评估其理解城市场景的能力。该数据集基于街景图像的分割、深度和物体检测预测构建，并配以LLM生成的推理答案，显著提升了模型性能，特别是在否定和反事实等具有挑战性的问题类型上。这项工作强调了在城市环境中增强视觉-语言模型的空间推理能力需要进行微调。

HCCM: Hierarchical Cross-Granularity Contrastive and Matching Learning for Natural Language-Guided Drones

Authors: Hao Ruan, Jinliang Lin, Yingxin Lai, Zhiming Luo, Shaozi Li

Venue: ACM MM

First: 2025-08-29T11:50:24+00:00 · Latest: 2025-08-29T11:50:24+00:00

Comments: Accepted by ACM MM'25

Abs · PDF

Abstract

Natural Language-Guided Drones (NLGD) provide a novel paradigm for tasks such as target matching and navigation. However, the wide field of view and complex compositional semantics in drone scenarios pose challenges for vision-language understanding. Mainstream Vision-Language Models (VLMs) emphasize global alignment while lacking fine-grained semantics, and existing hierarchical methods depend on precise entity partitioning and strict containment, limiting effectiveness in dynamic environments. To address this, we propose the Hierarchical Cross-Granularity Contrastive and Matching learning (HCCM) framework with two components: (1) Region-Global Image-Text Contrastive Learning (RG-ITC), which avoids precise scene partitioning and captures hierarchical local-to-global semantics by contrasting local visual regions with global text and vice versa; (2) Region-Global Image-Text Matching (RG-ITM), which dispenses with rigid constraints and instead evaluates local semantic consistency within global cross-modal representations, enhancing compositional reasoning. Moreover, drone text descriptions are often incomplete or ambiguous, destabilizing alignment. HCCM introduces a Momentum Contrast and Distillation (MCD) mechanism to improve robustness. Experiments on GeoText-1652 show HCCM achieves state-of-the-art Recall@1 of 28.8% (image retrieval) and 14.7% (text retrieval). On the unseen ERA dataset, HCCM demonstrates strong zero-shot generalization with 39.93% mean recall (mR), outperforming fine-tuned baselines.

Summary / 总结

Natural Language-Guided Drones (NLGD) provide a novel paradigm for tasks such as target matching and navigation.

PlantVillageVQA: A Visual Question Answering Dataset for Benchmarking Vision-Language Models in Plant Science

Authors: Syed Nazmus Sakib, Nafiul Haque, Mohammad Zabed Hossain, Shifat E. Arman

First: 2025-08-23T19:04:57+00:00 · Latest: 2025-08-28T21:35:42+00:00

Comments: 17 pages, 15 figures and Submittd to Nature Scientific Data

Abs · PDF · Code1

Abstract

PlantVillageVQA is a large-scale visual question answering (VQA) dataset derived from the widely used PlantVillage image corpus. It was designed to advance the development and evaluation of vision-language models for agricultural decision-making and analysis. The PlantVillageVQA dataset comprises 193,609 high-quality question-answer (QA) pairs grounded over 55,448 images spanning 14 crop species and 38 disease conditions. Questions are organised into 3 levels of cognitive complexity and 9 distinct categories. Each question category was phrased manually following expert guidance and generated via an automated two-stage pipeline: (1) template-based QA synthesis from image metadata and (2) multi-stage linguistic re-engineering. The dataset was iteratively reviewed by domain experts for scientific accuracy and relevancy. The final dataset was evaluated using three state-of-the-art models for quality assessment. Our objective remains to provide a publicly available, standardised and expert-verified database to enhance diagnostic accuracy for plant disease identifications and advance scientific research in the agricultural domain. Our dataset will be open-sourced at https://huggingface.co/datasets/SyedNazmusSakib/PlantVillageVQA.

中文标题/摘要

标题：PlantVillageVQA：植物科学领域用于视觉语言模型基准测试的视觉问答数据集

PlantVillageVQA 是一个源自广泛使用的 PlantVillage 图像库的大规模视觉问答 (VQA) 数据集，旨在促进农业决策和分析中视觉语言模型的发展和评估。PlantVillageVQA 数据集包含 193,609 个高质量的问题-答案 (QA) 对，覆盖 55,448 张图像，涉及 14 种作物和 38 种疾病状况。问题按 3 个认知复杂度级别和 9 个不同的类别组织。每个问题类别都是根据专家指导手动编写的，并通过自动两阶段管道生成：(1) 基于图像元数据的模板式 QA 合成；(2) 多阶段语言重构。该数据集经过领域专家多次审查，以确保科学准确性和相关性。最终数据集使用三个最先进的模型进行了质量评估。我们的目标是提供一个公开可用、标准化和专家验证的数据库，以提高植物疾病诊断的准确性并推动农业领域的科学研究。我们的数据集将在 https://huggingface.co/datasets/SyedNazmusSakib/PlantVillageVQA 开源。

Summary / 总结

PlantVillageVQA is a large-scale VQA dataset derived from PlantVillage images to advance vision-language models in plant science. It includes 193,609 QA pairs over 55,448 images from 14 crop species and 38 disease conditions, with questions organized into cognitive complexity levels and categories. The dataset was reviewed by experts and evaluated with state-of-the-art models, aiming to improve plant disease diagnostics and agricultural research. The dataset will be open-sourced at https://huggingface.co/datasets/SyedNazmusSakib/PlantVillageVQA.

PlantVillageVQA是一个从PlantVillage图像库创建的大规模视觉问答数据集，旨在推进植物科学中的视觉-语言模型。它包含193,609个高质量的问题-答案对，覆盖14种作物和38种病害条件的55,448张图像。问题按照认知复杂度分为三个级别，并分为九个类别，通过自动化管道生成并由领域专家审查。该数据集使用最先进的模型进行了评估，并旨在提高植物病害诊断和农业研究。数据集将开源发布在https://huggingface.co/datasets/SyedNazmusSakib/PlantVillageVQA。

OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning

Authors: Yuan Gong, Xionghui Wang, Jie Wu, Shiyin Wang, Yitong Wang, Xinglong Wu

First: 2025-08-28T17:59:46+00:00 · Latest: 2025-08-28T17:59:46+00:00

Comments: project url: https://one-reward.github.io

Abs · PDF · Project1

Abstract

In this paper, we introduce OneReward, a unified reinforcement learning framework that enhances the model's generative capabilities across multiple tasks under different evaluation criteria using only \textit{One Reward} model. By employing a single vision-language model (VLM) as the generative reward model, which can distinguish the winner and loser for a given task and a given evaluation criterion, it can be effectively applied to multi-task generation models, particularly in contexts with varied data and diverse task objectives. We utilize OneReward for mask-guided image generation, which can be further divided into several sub-tasks such as image fill, image extend, object removal, and text rendering, involving a binary mask as the edit area. Although these domain-specific tasks share same conditioning paradigm, they differ significantly in underlying data distributions and evaluation metrics. Existing methods often rely on task-specific supervised fine-tuning (SFT), which limits generalization and training efficiency. Building on OneReward, we develop Seedream 3.0 Fill, a mask-guided generation model trained via multi-task reinforcement learning directly on a pre-trained base model, eliminating the need for task-specific SFT. Experimental results demonstrate that our unified edit model consistently outperforms both commercial and open-source competitors, such as Ideogram, Adobe Photoshop, and FLUX Fill [Pro], across multiple evaluation dimensions. Code and model are available at: https://one-reward.github.io

中文标题/摘要

标题：OneReward：统一的多任务人类偏好学习引导图像生成

在本文中，我们介绍了OneReward，这是一种统一的强化学习框架，仅使用一个‘One Reward’模型即可在多种任务和不同评估标准下增强模型的生成能力。通过使用单一的视觉语言模型（VLM）作为生成奖励模型，该模型能够区分给定任务和评估标准下的胜者和败者，从而可以有效地应用于多任务生成模型，特别是在数据多样和任务目标多样的情况下。我们使用OneReward进行掩码引导的图像生成，可以进一步细分为图像填充、图像扩展、对象移除和文本渲染等子任务，涉及一个二元掩码作为编辑区域。尽管这些特定领域的任务共享相同的条件范式，但它们在底层数据分布和评估指标上存在显著差异。现有方法通常依赖于特定任务的监督微调（SFT），这限制了泛化能力和训练效率。基于OneReward，我们开发了Seedream 3.0 Fill，这是一种通过多任务强化学习直接在预训练基模型上训练的掩码引导生成模型，消除了对特定任务SFT的需求。实验结果表明，我们的统一编辑模型在多个评估维度上均优于商业和开源竞争对手，如Ideogram、Adobe Photoshop和FLUX Fill [Pro]。代码和模型可在：https://one-reward.github.io 获取。

Summary / 总结

OneReward 是一个统一的强化学习框架，通过单一奖励模型提升在多个任务上的生成能力。它利用视觉语言模型区分不同任务和评估标准下的优劣，适用于多样化的数据和目标。实验结果显示，通过多任务强化学习训练的 Seedream 3.0 Fill 在图像填充、对象移除和文本渲染等任务上，优于商业和开源竞争对手，涵盖了多种评估维度。代码和模型可在 https://one-reward.github.io 获取。

CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

Authors: Wei Li, Renshan Zhang, Rui Shao, Jie He, Liqiang Nie

First: 2025-08-28T17:50:58+00:00 · Latest: 2025-08-28T17:50:58+00:00

Comments: 23 pages, 8 figures, Project Page: https://jiutian-vl.github.io/CogVLA-page

Abs · PDF · Code1 · Project1

Abstract

Recent Vision-Language-Action (VLA) models built on pre-trained Vision-Language Models (VLMs) require extensive post-training, resulting in high computational overhead that limits scalability and deployment.We propose CogVLA, a Cognition-Aligned Vision-Language-Action framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance. CogVLA draws inspiration from human multimodal coordination and introduces a 3-stage progressive architecture. 1) Encoder-FiLM based Aggregation Routing (EFA-Routing) injects instruction information into the vision encoder to selectively aggregate and compress dual-stream visual tokens, forming a instruction-aware latent representation. 2) Building upon this compact visual encoding, LLM-FiLM based Pruning Routing (LFP-Routing) introduces action intent into the language model by pruning instruction-irrelevant visually grounded tokens, thereby achieving token-level sparsity. 3) To ensure that compressed perception inputs can still support accurate and coherent action generation, we introduce V-L-A Coupled Attention (CAtten), which combines causal vision-language attention with bidirectional action parallel decoding. Extensive experiments on the LIBERO benchmark and real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5-fold and decreasing inference latency by 2.8-fold compared to OpenVLA. CogVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/CogVLA.

Summary / 总结

MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs

Authors: Junpeng Ma, Qizhe Zhang, Ming Lu, Zhibin Wang, Qiang Zhou, Jun Song, Shanghang Zhang

First: 2025-08-28T17:50:03+00:00 · Latest: 2025-08-28T17:50:03+00:00

Comments: 10 pages, 3 figures

Abs · PDF

Abstract

Video Large Language Models (VLLMs) excel in video understanding, but their excessive visual tokens pose a significant computational challenge for real-world applications. Current methods aim to enhance inference efficiency by visual token pruning. However, they do not consider the dynamic characteristics and temporal dependencies of video frames, as they perceive video understanding as a multi-frame task. To address these challenges, we propose MMG-Vid, a novel training-free visual token pruning framework that removes redundancy by Maximizing Marginal Gains at both segment-level and token-level. Specifically, we first divide the video into segments based on frame similarity, and then dynamically allocate the token budget for each segment to maximize the marginal gain of each segment. Subsequently, we propose a temporal-guided DPC algorithm that jointly models inter-frame uniqueness and intra-frame diversity, thereby maximizing the marginal gain of each token. By combining both stages, MMG-Vid can maximize the utilization of the limited token budget, significantly improving efficiency while maintaining strong performance. Extensive experiments demonstrate that MMG-Vid can maintain over 99.5% of the original performance, while effectively reducing 75% visual tokens and accelerating the prefilling stage by 3.9x on LLaVA-OneVision-7B. Code will be released soon.

Summary / 总结

MMG-Vid is a training-free visual token pruning framework that maximizes marginal gains at both segment-level and token-level to enhance the efficiency of video LLMs. It divides videos into segments based on frame similarity and dynamically allocates token budgets to maximize the marginal gain of each segment. Additionally, it uses a temporal-guided DPC algorithm to model inter-frame uniqueness and intra-frame diversity, further optimizing token usage. Experiments show that MMG-Vid can maintain over 99.5% of the original performance while reducing visual tokens by 75% and accelerating the prefilling stage by 3.9x on LLaVA-OneVision-7B.

MMG-Vid 是一种无需训练的视觉 token 剪枝框架，通过在段级和 token 级别去除冗余来提高视频大型语言模型（VLLMs）的效率。它根据帧相似性将视频划分为段，并动态分配 token 预算以最大化每个段和 token 的边际收益。实验表明，MMG-Vid 可以保持超过 99.5% 的原始性能，同时减少 75% 的视觉 token，并将预填充阶段加速 3.9 倍，适用于 LLaVA-OneVision-7B。

Reusing Computation in Text-to-Image Diffusion for Efficient Generation of Image Sets

Authors: Dale Decatur, Thibault Groueix, Wang Yifan, Rana Hanocka, Vladimir Kim, Matheus Gadelha

Venue: ICCV 2025

First: 2025-08-28T17:35:03+00:00 · Latest: 2025-08-28T17:35:03+00:00

Comments: ICCV 2025. Project page: https://ddecatur.github.io/hierarchical-diffusion/

Abs · PDF · Project1

Abstract

Text-to-image diffusion models enable high-quality image generation but are computationally expensive. While prior work optimizes per-inference efficiency, we explore an orthogonal approach: reducing redundancy across correlated prompts. Our method leverages the coarse-to-fine nature of diffusion models, where early denoising steps capture shared structures among similar prompts. We propose a training-free approach that clusters prompts based on semantic similarity and shares computation in early diffusion steps. Experiments show that for models trained conditioned on image embeddings, our approach significantly reduces compute cost while improving image quality. By leveraging UnClip's text-to-image prior, we enhance diffusion step allocation for greater efficiency. Our method seamlessly integrates with existing pipelines, scales with prompt sets, and reduces the environmental and financial burden of large-scale text-to-image generation. Project page: https://ddecatur.github.io/hierarchical-diffusion/

中文标题/摘要

标题：文本到图像扩散中的计算复用以高效生成图像集

文本到图像的扩散模型能够生成高质量的图像，但计算成本高昂。虽然先前的工作优化了每次推理的效率，我们探索了一种不同的方法：减少相关提示之间的冗余。我们的方法利用了扩散模型从粗到细的特性，在早期去噪步骤中捕获相似提示之间的共享结构。我们提出了一种无需训练的方法，根据语义相似性对提示进行聚类，并在早期扩散步骤中共享计算。实验表明，对于基于图像嵌入训练的模型，我们的方法显著降低了计算成本并提高了图像质量。通过利用UnClip的文本到图像先验，我们增强了扩散步骤的分配，以提高效率。我们的方法可以无缝集成到现有的管道中，适用于不同的提示集，并减少了大规模文本到图像生成的环境和财务负担。项目页面：https://ddecatur.github.io/hierarchical-diffusion/

Summary / 总结

This paper addresses the computational inefficiency of text-to-image diffusion models by proposing a method that reduces redundancy across similar prompts. The approach leverages the coarse-to-fine nature of diffusion models to share computation in early steps. Experiments demonstrate that this method significantly reduces computational cost while improving image quality, especially when using UnClip's text-to-image prior. This method integrates seamlessly with existing pipelines and scales well with prompt sets, reducing both environmental and financial burdens of large-scale text-to-image generation.

该论文通过提出一种减少相似提示之间冗余的方法来解决文本到图像扩散模型的计算效率问题。该方法利用扩散模型的粗到细特性，在早期步骤中共享计算。实验表明，这种方法在使用UnClip的文本到图像先验时，显著降低了计算成本并提高了图像质量。该方法与现有管道无缝集成，随着提示集的增加而扩展，减少了大规模文本到图像生成的环境和财务负担。

DrivingGaussian++: Towards Realistic Reconstruction and Editable Simulation for Surrounding Dynamic Driving Scenes

Authors: Yajiao Xiong, Xiaoyu Zhou, Yongtao Wan, Deqing Sun, Ming-Hsuan Yang

First: 2025-08-28T16:22:54+00:00 · Latest: 2025-08-28T16:22:54+00:00

Abs · PDF · Project1

Abstract

We present DrivingGaussian++, an efficient and effective framework for realistic reconstructing and controllable editing of surrounding dynamic autonomous driving scenes. DrivingGaussian++ models the static background using incremental 3D Gaussians and reconstructs moving objects with a composite dynamic Gaussian graph, ensuring accurate positions and occlusions. By integrating a LiDAR prior, it achieves detailed and consistent scene reconstruction, outperforming existing methods in dynamic scene reconstruction and photorealistic surround-view synthesis. DrivingGaussian++ supports training-free controllable editing for dynamic driving scenes, including texture modification, weather simulation, and object manipulation, leveraging multi-view images and depth priors. By integrating large language models (LLMs) and controllable editing, our method can automatically generate dynamic object motion trajectories and enhance their realism during the optimization process. DrivingGaussian++ demonstrates consistent and realistic editing results and generates dynamic multi-view driving scenarios, while significantly enhancing scene diversity. More results and code can be found at the project site: https://xiong-creator.github.io/DrivingGaussian_plus.github.io

中文标题/摘要

标题：DrivingGaussian++：朝向现实的重建和可编辑的周围动态驾驶场景模拟

我们提出了DrivingGaussian++，一种高效且有效的框架，用于现实的周围动态自主驾驶场景重建和可控编辑。DrivingGaussian++使用增量3D高斯模型静态背景，并用复合动态高斯图重建移动对象，确保准确的位置和遮挡。通过整合LiDAR先验，它实现了详细的且一致的场景重建，在动态场景重建和照片现实的全景视图合成方面优于现有方法。DrivingGaussian++支持无需训练的动态驾驶场景可控编辑，包括纹理修改、天气模拟和对象操作，利用多视角图像和深度先验。通过整合大型语言模型（LLMs）和可控编辑，我们的方法可以在优化过程中自动生成动态对象运动轨迹并增强其现实性。DrivingGaussian++展示了持续且现实的编辑结果，并生成动态多视角驾驶场景，显著增强了场景多样性。更多结果和代码可在项目网站上找到：https://xiong-creator.github.io/DrivingGaussian_plus.github.io

Summary / 总结

DrivingGaussian++ is a framework for realistic reconstruction and controllable editing of dynamic driving scenes. It uses incremental 3D Gaussians for static background modeling and a composite dynamic Gaussian graph for moving objects, ensuring accurate positions and occlusions. By integrating a LiDAR prior, it achieves detailed and consistent scene reconstruction, outperforming existing methods. The framework supports training-free controllable editing, including texture modification, weather simulation, and object manipulation, leveraging multi-view images and depth priors. It can automatically generate dynamic object motion trajectories and enhance their realism, demonstrating consistent and realistic editing results and generating dynamic multi-view driving scenarios with enhanced scene diversity.

DrivingGaussian++ 是一个用于动态驾驶场景的现实重建和可控编辑框架。它使用增量 3D 高斯模型静态背景，并使用复合动态高斯图来表示移动对象，确保准确的位置和遮挡。通过集成 LiDAR 先验，它实现了详细且一致的场景重建，优于现有方法。该框架支持无需训练的可控编辑，包括纹理修改、天气模拟和对象操作，利用多视图图像和深度先验。它可以自动生成动态对象运动轨迹并增强其现实性，展示了一致且现实的编辑结果，并生成动态多视图驾驶场景，显著增强了场景多样性。

Understanding and evaluating computer vision models through the lens of counterfactuals

Authors: Pushkar Shukla

First: 2025-08-28T15:11:49+00:00 · Latest: 2025-08-28T15:11:49+00:00

Abs · PDF

Abstract

Counterfactual reasoning -- the practice of asking ``what if'' by varying inputs and observing changes in model behavior -- has become central to interpretable and fair AI. This thesis develops frameworks that use counterfactuals to explain, audit, and mitigate bias in vision classifiers and generative models. By systematically altering semantically meaningful attributes while holding others fixed, these methods uncover spurious correlations, probe causal dependencies, and help build more robust systems. The first part addresses vision classifiers. CAVLI integrates attribution (LIME) with concept-level analysis (TCAV) to quantify how strongly decisions rely on human-interpretable concepts. With localized heatmaps and a Concept Dependency Score, CAVLI shows when models depend on irrelevant cues like backgrounds. Extending this, ASAC introduces adversarial counterfactuals that perturb protected attributes while preserving semantics. Through curriculum learning, ASAC fine-tunes biased models for improved fairness and accuracy while avoiding stereotype-laden artifacts. The second part targets generative Text-to-Image (TTI) models. TIBET provides a scalable pipeline for evaluating prompt-sensitive biases by varying identity-related terms, enabling causal auditing of how race, gender, and age affect image generation. To capture interactions, BiasConnect builds causal graphs diagnosing intersectional biases. Finally, InterMit offers a modular, training-free algorithm that mitigates intersectional bias via causal sensitivity scores and user-defined fairness goals. Together, these contributions show counterfactuals as a unifying lens for interpretability, fairness, and causality in both discriminative and generative models, establishing principled, scalable methods for socially responsible bias evaluation and mitigation.

Summary / 总结

Counterfactual reasoning -- the practice of asking ``what if'' by varying inputs and observing changes in model behavior -- has become central to interpretable and fair AI.

Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models

Authors: Hao Cheng, Erjia Xiao, Yichi Wang, Lingfeng Zhang, Qiang Zhang, Jiahang Cao, Kaidi Xu, Mengshu Sun, Xiaoshuai Hao, Jindong Gu, Renjing Xu

First: 2025-03-14T15:42:42+00:00 · Latest: 2025-08-28T14:55:38+00:00

Comments: This paper is accepted by IJCAI2025 Workshop on Deepfake Detection, Localization, and Interpretability

Abs · PDF

Abstract

Current Cross-Modality Generation Models (GMs) demonstrate remarkable capabilities in various generative tasks. Given the ubiquity and information richness of vision modality inputs in real-world scenarios, Cross-Vision tasks, encompassing Vision-Language Perception (VLP) and Image-to-Image (I2I), have attracted significant attention. Large Vision Language Models (LVLMs) and I2I Generation Models (GMs) are employed to handle VLP and I2I tasks, respectively. Previous research indicates that printing typographic words into input images significantly induces LVLMs and I2I GMs to produce disruptive outputs that are semantically aligned with those words. Additionally, visual prompts, as a more sophisticated form of typography, are also revealed to pose security risks to various applications of cross-vision tasks. However, the specific characteristics of the threats posed by visual prompts remain underexplored. In this paper, to comprehensively investigate the performance impact induced by Typographic Visual Prompt Injection (TVPI) in various LVLMs and I2I GMs, we propose the Typographic Visual Prompts Injection Dataset and thoroughly evaluate the TVPI security risks on various open-source and closed-source LVLMs and I2I GMs under visual prompts with different target semantics, deepening the understanding of TVPI threats.

Summary / 总结

This paper explores the security risks posed by typographic visual prompts in Cross-Modality Generation Models. It introduces a dataset to evaluate the impact of Typographic Visual Prompt Injection (TVPI) on various models, revealing that visual prompts can induce disruptive outputs aligned with the prompts. The study deepens the understanding of TVPI threats in both Large Vision Language Models and Image-to-Image Generation Models.

本文探讨了图文提示在跨模态生成模型中的安全威胁。作者提出了一组数据集来评估这些提示对各种模型的影响，发现它们可以诱导与提示内容一致的破坏性输出。该研究加深了对LVLMs和I2I GMs中图文提示威胁的理解。

Learning Primitive Embodied World Models: Towards Scalable Robotic Learning

Authors: Qiao Sun, Liujia Yang, Wei Tang, Wei Huang, Kaixin Xu, Yongchao Chen, Mingyu Liu, Jiange Yang, Haoyi Zhu, Yating Wang, Tong He, Yilun Chen, Xili Dai, Nanyang Ye, Qinying Gu

First: 2025-08-28T14:31:48+00:00 · Latest: 2025-08-28T14:31:48+00:00

Abs · PDF

Abstract

While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation--hindering generative models from achieving a "GPT moment" in the embodied domain. There is a naive observation: the diversity of embodied data far exceeds the relatively small space of possible primitive motions. Based on this insight, we propose a novel paradigm for world modeling--Primitive Embodied World Models (PEWM). By restricting video generation to fixed short horizons, our approach 1) enables fine-grained alignment between linguistic concepts and visual representations of robotic actions, 2) reduces learning complexity, 3) improves data efficiency in embodied data collection, and 4) decreases inference latency. By equipping with a modular Vision-Language Model (VLM) planner and a Start-Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed-loop control and supports compositional generalization of primitive-level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.

中文标题/摘要

标题：学习原始具身世界模型：迈向具身学习的可扩展性

尽管基于视频生成的具身世界模型受到了越来越多的关注，但它们对大规模具身交互数据的依赖仍然是一个关键瓶颈。具身数据的稀缺性、收集难度和高维度性从根本上限制了语言与动作之间的对齐精度，并加剧了长时序视频生成的挑战——阻碍生成模型在具身领域实现“GPT时刻”。一个简单的观察是：具身数据的多样性远远超过了可能的基本运动的小空间。基于这一洞察，我们提出了一种新的世界建模范式——原始具身世界模型（PEWM）。通过将视频生成限制在固定的小时序内，我们的方法1) 使语言概念与机器人动作的视觉表示之间的对齐更加精细，2) 减少学习复杂性，3) 提高了具身数据收集的数据效率，4) 减少了推理延迟。通过配备模块化视觉语言模型（VLM）规划器和起始-目标热图引导机制（SGG），PEWM 进一步实现了灵活的闭环控制，并支持在扩展和复杂任务中对基本级策略的组合泛化。我们的框架利用视频模型中的时空视觉先验和 VLM 的语义意识，弥合了精细物理交互与高层次推理之间的差距，为可扩展、可解释和通用的具身智能铺平了道路。

Summary / 总结

This paper addresses the challenge of generating embodied world models through video generation, which is constrained by the need for large-scale interaction data. It introduces Primitive Embodied World Models (PEWM) to focus on short horizons, enabling finer alignment between language and actions, reducing learning complexity, improving data efficiency, and decreasing inference latency. PEWM uses a modular Vision-Language Model planner and Start-Goal heatmap Guidance to support flexible closed-loop control and compositional generalization of primitive-level policies for complex tasks.

本文提出了一种称为Primitive Embodied World Models (PEWM)的方法，通过限制视频生成的短时间范围，实现语言概念与机器人动作的精细对齐，降低学习复杂度并提高数据效率。该方法使用模块化的视觉-语言模型规划器和起始-目标热图引导机制，支持对复杂任务中原始级策略的灵活闭环控制和组合泛化。

Estimating 2D Keypoints of Surgical Tools Using Vision-Language Models with Low-Rank Adaptation

Authors: Krit Duangprom, Tryphon Lambrou, Binod Bhattarai

Venue: MICCAI 2025

First: 2025-08-28T14:25:32+00:00 · Latest: 2025-08-28T14:25:32+00:00

Comments: Accepted to MICCAI 2025

Abs · PDF

Abstract

This paper presents a novel pipeline for 2D keypoint estima- tion of surgical tools by leveraging Vision Language Models (VLMs) fine- tuned using a low rank adjusting (LoRA) technique. Unlike traditional Convolutional Neural Network (CNN) or Transformer-based approaches, which often suffer from overfitting in small-scale medical datasets, our method harnesses the generalization capabilities of pre-trained VLMs. We carefully design prompts to create an instruction-tuning dataset and use them to align visual features with semantic keypoint descriptions. Experimental results show that with only two epochs of fine tuning, the adapted VLM outperforms the baseline models, demonstrating the ef- fectiveness of LoRA in low-resource scenarios. This approach not only improves keypoint detection performance, but also paves the way for future work in 3D surgical hands and tools pose estimation.

中文标题/摘要

标题：使用低秩适应的视觉语言模型估计手术工具的2D关键点

本文提出了一种利用视觉语言模型（VLMs）结合低秩调整（LoRA）技术进行2D关键点估计的新管道。与传统卷积神经网络（CNN）或基于变换器的方法相比，后者在小型医学数据集上常常容易过拟合，我们的方法利用了预训练VLMs的泛化能力。我们精心设计了提示以创建指令调优数据集，并使用它们将视觉特征与语义关键点描述对齐。实验结果表明，仅经过两轮微调，适应后的VLM就优于基线模型，证明了LoRA在资源有限场景中的有效性。该方法不仅提高了关键点检测性能，还为未来3D手术手和工具姿态估计的研究铺平了道路。

Improving Fine-Grained Control via Aggregation of Multiple Diffusion Models

Authors: Conghan Yue, Zhengwei Peng, Shiyan Du, Zhi Ji, Chuangjian Cai, Le Wan, Dongyu Zhang

First: 2024-10-02T06:16:06+00:00 · Latest: 2025-08-28T14:03:26+00:00

Abs · PDF · Code1

Abstract

While many diffusion models perform well when controlling particular aspects such as style, character, and interaction, they struggle with fine-grained control due to dataset limitations and intricate model architecture design. This paper introduces a novel training-free algorithm, independent of denoising network architectures, for fine-grained generation, called Aggregation of Multiple Diffusion Models (AMDM). The algorithm integrates features from multiple diffusion models into a specified model to activate particular features and enable fine-grained control. Experimental results demonstrate that AMDM significantly improves fine-grained control without training, validating its effectiveness. Additionally, it reveals that diffusion models initially focus on features such as position, attributes, and style, with later stages improving generation quality and consistency. AMDM offers a new perspective for tackling the challenges of fine-grained conditional generation in diffusion models. Specifically, it allows us to fully utilize existing or develop new conditional diffusion models that control specific aspects, and then aggregate them using the AMDM algorithm. This eliminates the need for constructing complex datasets, designing intricate model architectures, and incurring high training costs. Code is available at: https://github.com/Hammour-steak/AMDM.

中文标题/摘要

标题：通过多扩散模型聚合提高细粒度控制

虽然许多扩散模型在控制特定方面如风格、角色和交互时表现良好，但在细粒度控制方面由于数据集限制和复杂的模型架构设计，它们面临挑战。本文介绍了一种无需训练的新型算法，该算法独立于去噪网络架构，称为多扩散模型聚合（AMDM）。该算法将多个扩散模型的特征整合到指定模型中，以激活特定特征并实现细粒度控制。实验结果表明，AMDM在无需训练的情况下显著提高了细粒度控制能力，验证了其有效性。此外，它揭示了扩散模型最初关注位置、属性和风格等特征，后期阶段则提高生成质量和一致性。AMDM为解决扩散模型中的细粒度条件生成挑战提供了新视角。具体而言，它允许我们充分利用现有或开发新的控制特定方面的条件扩散模型，并使用AMDM算法进行聚合。这消除了构建复杂数据集、设计复杂模型架构和高训练成本的需要。代码可在：https://github.com/Hammour-steak/AMDM 获取。

Summary / 总结

This paper addresses the challenge of fine-grained control in diffusion models by introducing AMDM, an algorithm that aggregates features from multiple diffusion models to enhance fine-grained generation without requiring training. Experimental results show that AMDM significantly improves fine-grained control, offering a new approach to tackle the limitations of existing models in controlling specific aspects like position, attributes, and style. This method reduces the need for complex datasets and intricate model architectures, thus lowering training costs.

本文通过引入AMDM算法，该算法将多个扩散模型的特征进行聚合，以增强细粒度生成，无需训练即可实现细粒度控制。实验结果表明，AMDM显著提高了细粒度控制能力，提供了一种解决现有模型在控制位置、属性和风格等特定方面限制的新方法。这种方法减少了复杂数据集和复杂模型架构的需求，从而降低了训练成本。

Evaluating Compositional Generalisation in VLMs and Diffusion Models

Authors: Beth Pearson, Bilal Boulbarss, Michael Wray, Martha Lewis

First: 2025-08-28T13:45:04+00:00 · Latest: 2025-08-28T13:45:04+00:00

Comments: 11 pages including references, 6 figures. Accepted at IWCS 2025

Abs · PDF · Code1

Abstract

A fundamental aspect of the semantics of natural language is that novel meanings can be formed from the composition of previously known parts. Vision-language models (VLMs) have made significant progress in recent years, however, there is evidence that they are unable to perform this kind of composition. For example, given an image of a red cube and a blue cylinder, a VLM such as CLIP is likely to incorrectly label the image as a red cylinder or a blue cube, indicating it represents the image as a `bag-of-words' and fails to capture compositional semantics. Diffusion models have recently gained significant attention for their impressive generative abilities, and zero-shot classifiers based on diffusion models have been shown to perform competitively with CLIP in certain compositional tasks. In this work we explore whether the generative Diffusion Classifier has improved compositional generalisation abilities compared to discriminative models. We assess three models -- Diffusion Classifier, CLIP, and ViLT -- on their ability to bind objects with attributes and relations in both zero-shot learning (ZSL) and generalised zero-shot learning (GZSL) settings. Our results show that the Diffusion Classifier and ViLT perform well at concept binding tasks, but that all models struggle significantly with the relational GZSL task, underscoring the broader challenges VLMs face with relational reasoning. Analysis of CLIP embeddings suggests that the difficulty may stem from overly similar representations of relational concepts such as left and right. Code and dataset are available at: https://github.com/otmive/diffusion_classifier_clip

Summary / 总结

This study evaluates the ability of Vision-Language Models (VLMs) and diffusion models to perform compositional generalization. The research compares the Diffusion Classifier, CLIP, and ViLT on tasks involving object binding with attributes and relations in zero-shot and generalized zero-shot settings. The results indicate that while the Diffusion Classifier and ViLT perform well in concept binding, all models struggle with relational tasks, highlighting the challenges VLMs face in relational reasoning.

该研究评估了视觉-语言模型（VLMs）和扩散模型在进行组合泛化方面的能力。研究比较了扩散分类器、CLIP 和 ViLT 在零样本和泛化零样本设置下的任务，涉及对象与属性和关系的绑定。结果表明，虽然扩散分类器和 ViLT 在概念绑定方面表现良好，但所有模型在关系任务上都面临重大挑战，突显了 VLMs 在关系推理方面的困难。

Occlusion Robustness of CLIP for Military Vehicle Classification

Authors: Jan Erik van Woerden, Gertjan Burghouts, Lotte Nijskens, Alma M. Liezenga, Sabina van Rooij, Frank Ruis, Hugo J. Kuijf

First: 2025-08-28T13:16:55+00:00 · Latest: 2025-08-28T13:16:55+00:00

Comments: To be presented at SPIE: Sensors + Imaging, Artificial Intelligence for Security and Defence Applications II

Abs · PDF

Abstract

Vision-language models (VLMs) like CLIP enable zero-shot classification by aligning images and text in a shared embedding space, offering advantages for defense applications with scarce labeled data. However, CLIP's robustness in challenging military environments, with partial occlusion and degraded signal-to-noise ratio (SNR), remains underexplored. We investigate CLIP variants' robustness to occlusion using a custom dataset of 18 military vehicle classes and evaluate using Normalized Area Under the Curve (NAUC) across occlusion percentages. Four key insights emerge: (1) Transformer-based CLIP models consistently outperform CNNs, (2) fine-grained, dispersed occlusions degrade performance more than larger contiguous occlusions, (3) despite improved accuracy, performance of linear-probed models sharply drops at around 35% occlusion, (4) by finetuning the model's backbone, this performance drop occurs at more than 60% occlusion. These results underscore the importance of occlusion-specific augmentations during training and the need for further exploration into patch-level sensitivity and architectural resilience for real-world deployment of CLIP.

中文标题/摘要

标题：CLIP在军事车辆分类中的遮挡鲁棒性

视觉-语言模型（VLMs）如CLIP通过在共享嵌入空间中对齐图像和文本实现零样本分类，为缺乏标注数据的防御应用提供了优势。然而，CLIP在具有部分遮挡和降级信噪比（SNR）的挑战性军事环境中的鲁棒性尚未得到充分探索。我们使用包含18类军事车辆的自定义数据集研究了CLIP变体在遮挡下的鲁棒性，并使用归一化曲线下面积（NAUC）在不同遮挡百分比下进行评估。研究结果得出四个关键见解：（1）基于Transformer的CLIP模型始终优于CNN，（2）细粒度、分散的遮挡比大面积连续遮挡对性能影响更大，（3）尽管准确率有所提高，但在约35%遮挡时，线性探查模型的性能急剧下降，（4）通过微调模型的骨干网络，性能下降发生在超过60%遮挡时。这些结果强调了在训练过程中使用遮挡特定增强的重要性，并指出了需要进一步探索像素级敏感性和架构鲁棒性以实现CLIP在实际部署中的应用。

Summary / 总结

Vision-language models (VLMs) like CLIP enable zero-shot classification by aligning images and text in a shared embedding space, offering advantages for defense applications with scarce labeled data.

NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks

Authors: Aritra Dutta, Swapnanil Mukherjee, Deepanway Ghosal, Somak Aditya

First: 2025-08-27T09:34:28+00:00 · Latest: 2025-08-28T12:05:33+00:00

Abs · PDF

Abstract

Commonsense visual-question answering often hinges on knowledge that is missing from the image or the question. Small vision-language models (sVLMs) such as ViLT, VisualBERT and FLAVA therefore lag behind their larger generative counterparts. To study the effect of careful commonsense knowledge integration on sVLMs, we present an end-to-end framework (NLKI) that (i) retrieves natural language facts, (ii) prompts an LLM to craft natural language explanations, and (iii) feeds both signals to sVLMs respectively across two commonsense VQA datasets (CRIC, AOKVQA) and a visual-entailment dataset (e-SNLI-VE). Facts retrieved using a fine-tuned ColBERTv2 and an object information-enriched prompt yield explanations that largely cut down hallucinations, while lifting the end-to-end answer accuracy by up to 7% (across 3 datasets), making FLAVA and other models in NLKI match or exceed medium-sized VLMs such as Qwen-2 VL-2B and SmolVLM-2.5B. As these benchmarks contain 10-25% label noise, additional finetuning using noise-robust losses (such as symmetric cross entropy and generalised cross entropy) adds another 2.5% in CRIC, and 5.5% in AOKVQA. Our findings expose when LLM-based commonsense knowledge beats retrieval from commonsense knowledge bases, how noise-aware training stabilises small models in the context of external knowledge augmentation, and why parameter-efficient commonsense reasoning is now within reach for 250M models.

中文标题/摘要

标题：NLKI：一种轻量级自然语言知识整合框架，用于提高小型视觉语言模型在常识视觉问答任务中的表现

常识视觉问答往往依赖于图像或问题中缺失的知识。小型视觉语言模型（sVLMs）如ViLT、VisualBERT和FLAVA因此落后于其较大的生成型对应模型。为了研究仔细整合常识知识对sVLMs的影响，我们提出了一种端到端框架（NLKI），该框架（i）检索自然语言事实，（ii）提示LLM生成自然语言解释，（iii）将两种信号分别输入两个常识视觉问答数据集（CRIC、AOKVQA）和一个视觉蕴含数据集（e-SNLI-VE）。使用微调后的ColBERTv2和对象信息增强的提示检索的事实生成的解释大大减少了幻觉，同时使端到端答案准确性提高多达7%（在3个数据集中），使FLAVA和其他模型在NLKI中达到或超过中型视觉语言模型Qwen-2 VL-2B和SmolVLM-2.5B。由于这些基准数据集包含10-25%的标签噪声，使用抗噪声损失（如对称交叉熵和广义交叉熵）的额外微调在CRIC中增加了2.5%，在AOKVQA中增加了5.5%。我们的研究结果揭示了基于LLM的常识知识何时能胜过从常识知识库检索，如何在外部知识增强的背景下噪声感知训练稳定小型模型，以及为什么参数高效的常识推理现在对于2.5亿参数模型来说是可行的。

Summary / 总结

The research aims to enhance the performance of small vision-language models (sVLMs) in commonsense visual-question answering tasks by integrating natural language knowledge. The proposed NLKI framework retrieves natural language facts, prompts an LLM to generate explanations, and feeds these to sVLMs. This approach improves end-to-end answer accuracy by up to 7% across three datasets, making models like FLAVA match or exceed larger models such as Qwen-2 VL-2B and SmolVLM-2.5B. Additional fine-tuning with noise-robust losses further boosts accuracy by 2.5% to 5.5% in different datasets.

研究旨在通过整合自然语言知识来提升小视语言模型（sVLMs）在常识视觉问答任务中的表现。提出的NLKI框架检索自然语言事实，促使LLM生成解释，并将这些信息输入sVLMs。这种方法在三个数据集上将端到端的答案准确性提高了7%以上，使FLAVA等模型达到或超过了Qwen-2 VL-2B和SmolVLM-2.5B等大型模型的水平。使用抗噪声损失进一步提升准确度，分别在CRIC和AOKVQA数据集上提高了2.5%和5.5%。

"Humor, Art, or Misinformation?": A Multimodal Dataset for Intent-Aware Synthetic Image Detection

Authors: Anastasios Skoularikis, Stefanos-Iordanis Papadopoulos, Symeon Papadopoulos, Panagiotis C. Petrantonakis

First: 2025-08-28T11:22:15+00:00 · Latest: 2025-08-28T11:22:15+00:00

Abs · PDF

Abstract

Recent advances in multimodal AI have enabled progress in detecting synthetic and out-of-context content. However, existing efforts largely overlook the intent behind AI-generated images. To fill this gap, we introduce S-HArM, a multimodal dataset for intent-aware classification, comprising 9,576 "in the wild" image-text pairs from Twitter/X and Reddit, labeled as Humor/Satire, Art, or Misinformation. Additionally, we explore three prompting strategies (image-guided, description-guided, and multimodally-guided) to construct a large-scale synthetic training dataset with Stable Diffusion. We conduct an extensive comparative study including modality fusion, contrastive learning, reconstruction networks, attention mechanisms, and large vision-language models. Our results show that models trained on image- and multimodally-guided data generalize better to "in the wild" content, due to preserved visual context. However, overall performance remains limited, highlighting the complexity of inferring intent and the need for specialized architectures.

中文标题/摘要

标题："幽默、艺术还是误导信息？": 一种面向意图的合成图像检测多模态数据集

近年来，多模态AI的进步推动了对合成和脱离上下文内容检测的进展。然而，现有努力大多忽略了AI生成图像背后的意图。为填补这一空白，我们引入了S-HArM，这是一个面向意图的多模态数据集，包含来自Twitter/X和Reddit的9,576个“野生”图像-文本对，并标记为幽默/讽刺、艺术或误导信息。此外，我们探索了三种提示策略（图像导向、描述导向和多模态导向）来构建大规模合成训练数据集，使用Stable Diffusion。我们进行了广泛的比较研究，包括模态融合、对比学习、重建网络、注意力机制和大型视觉-语言模型。我们的结果显示，基于图像和多模态导向数据训练的模型在“野生”内容上的泛化能力更强，因为保留了视觉上下文。然而，总体性能仍然有限，突显了推断意图的复杂性以及需要专门架构的需求。

Amadeus: Autoregressive Model with Bidirectional Attribute Modelling for Symbolic Music

Authors: Hongju Su, Ke Li, Lan Yang, Honggang Zhang, Yi-Zhe Song

First: 2025-08-28T11:15:44+00:00 · Latest: 2025-08-28T11:15:44+00:00

Comments: Under review

Abs · PDF

Abstract

Existing state-of-the-art symbolic music generation models predominantly adopt autoregressive or hierarchical autoregressive architectures, modelling symbolic music as a sequence of attribute tokens with unidirectional temporal dependencies, under the assumption of a fixed, strict dependency structure among these attributes. However, we observe that using different attributes as the initial token in these models leads to comparable performance. This suggests that the attributes of a musical note are, in essence, a concurrent and unordered set, rather than a temporally dependent sequence. Based on this insight, we introduce Amadeus, a novel symbolic music generation framework. Amadeus adopts a two-level architecture: an autoregressive model for note sequences and a bidirectional discrete diffusion model for attributes. To enhance performance, we propose Music Latent Space Discriminability Enhancement Strategy(MLSDES), incorporating contrastive learning constraints that amplify discriminability of intermediate music representations. The Conditional Information Enhancement Module (CIEM) simultaneously strengthens note latent vector representation via attention mechanisms, enabling more precise note decoding. We conduct extensive experiments on unconditional and text-conditioned generation tasks. Amadeus significantly outperforms SOTA models across multiple metrics while achieving at least 4$\times$ speed-up. Furthermore, we demonstrate training-free, fine-grained note attribute control feasibility using our model. To explore the upper performance bound of the Amadeus architecture, we compile the largest open-source symbolic music dataset to date, AMD (Amadeus MIDI Dataset), supporting both pre-training and fine-tuning.

中文标题/摘要

标题：阿玛迪乌斯：双向属性建模的自回归音乐生成模型

现有的最先进的符号音乐生成模型主要采用自回归或分层自回归架构，将符号音乐建模为具有单向时间依赖性的属性令牌序列，假设这些属性之间存在固定且严格的依赖结构。然而，我们观察到，在这些模型中使用不同的属性作为初始令牌会导致相当的性能。这表明，音乐音符的属性本质上是一个并发且无序的集合，而不是一个时间依赖序列。基于这一洞察，我们引入了阿玛迪乌斯，一种新颖的符号音乐生成框架。阿玛迪乌斯采用两层架构：用于音符序列的自回归模型和用于属性的双向离散扩散模型。为了提高性能，我们提出了音乐潜在空间判别性增强策略(MLSDES)，结合对比学习约束以增强中间音乐表示的判别性。条件信息增强模块(CIEM)同时通过注意力机制加强音符潜在向量表示，使音符解码更加精确。我们在无条件和文本条件生成任务上进行了广泛的实验。阿玛迪乌斯在多个指标上显著优于当前最佳模型，同时实现至少4倍的速度提升。此外，我们展示了使用我们的模型实现无训练的细粒度音符属性控制的可行性。为了探索阿玛迪乌斯架构的性能上限，我们编译了迄今为止最大的开源符号音乐数据集AMD（阿玛迪乌斯MIDI数据集），支持预训练和微调。

Summary / 总结

Amadeus 是一种新颖的符号音乐生成框架，通过引入两层架构——自回归模型用于音符序列和双向离散扩散模型用于属性，解决了现有自回归模型的局限性。它还包含 MLSDES 和 CIEM 来提升性能。实验表明，Amadeus 在多个指标上显著优于现有最佳模型，并且至少快 4 倍。此外，它还能够在无需训练的情况下实现对音符属性的精细控制。

Enhancing Document VQA Models via Retrieval-Augmented Generation

Authors: Eric López, Artemis Llabrés, Ernest Valveny

First: 2025-08-26T12:32:55+00:00 · Latest: 2025-08-28T10:31:44+00:00

Comments: Accepted at Workshop on Machine Learning in Document Analysis and Recognition (ICDAR WML 2025), Wuhan, China

Abs · PDF

Abstract

Document Visual Question Answering (Document VQA) must cope with documents that span dozens of pages, yet leading systems still concatenate every page or rely on very large vision-language models, both of which are memory-hungry. Retrieval-Augmented Generation (RAG) offers an attractive alternative, first retrieving a concise set of relevant segments before generating answers from this selected evidence. In this paper, we systematically evaluate the impact of incorporating RAG into Document VQA through different retrieval variants - text-based retrieval using OCR tokens and purely visual retrieval without OCR - across multiple models and benchmarks. Evaluated on the multi-page datasets MP-DocVQA, DUDE, and InfographicVQA, the text-centric variant improves the "concatenate-all-pages" baseline by up to +22.5 ANLS, while the visual variant achieves +5.0 ANLS improvement without requiring any text extraction. An ablation confirms that retrieval and reranking components drive most of the gain, whereas the layout-guided chunking strategy - proposed in several recent works to leverage page structure - fails to help on these datasets. Our experiments demonstrate that careful evidence selection consistently boosts accuracy across multiple model sizes and multi-page benchmarks, underscoring its practical value for real-world Document VQA.

Summary / 总结

This paper explores the integration of Retrieval-Augmented Generation (RAG) into Document VQA models to address the memory challenges of processing multi-page documents. It evaluates text-based and purely visual retrieval methods across various models and benchmarks, showing that the text-centric variant improves the baseline by up to 22.5 ANLS, while the visual variant achieves a 5.0 ANLS improvement without text extraction. The study confirms that retrieval and reranking are key contributors to the gains, and layout-guided chunking does not significantly help on these datasets.

本文探讨了使用检索增强生成（RAG）方法在文档视觉问答（Document VQA）中的应用，以解决将所有页面串联或使用大型视觉语言模型的内存限制问题。研究在多个模型和基准上评估了基于文本和纯视觉的检索方法，显示了文本中心检索最多可提高22.5 ANLS，而纯视觉检索可提高5.0 ANLS。研究确认检索和重排序对于性能提升至关重要，而布局引导的分块策略在这些数据集上并未显著帮助。

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

Authors: Haodong Duan, Xinyu Fang, Junming Yang, Xiangyu Zhao, Yuxuan Qiao, Mo Li, Amit Agarwal, Zhe Chen, Lin Chen, Yuan Liu, Yubo Ma, Hailong Sun, Yifan Zhang, Shiyin Lu, Tack Hwa Wong, Weiyun Wang, Peiheng Zhou, Xiaozhe Li, Chaoyou Fu, Junbo Cui, Jixuan Chen, Enxin Song, Song Mao, Shengyuan Ding, Tianhao Liang, Zicheng Zhang, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, Dahua Lin, Kai Chen

First: 2024-07-16T13:06:15+00:00 · Latest: 2025-08-28T09:40:49+00:00

Comments: Updated on 2025.08.28, data cut down to 2025.06.30

Abs · PDF · Code1

Abstract

We present VLMEvalKit: an open-source toolkit for evaluating large multi-modality models based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework for researchers and developers to evaluate existing multi-modality models and publish reproducible evaluation results. In VLMEvalKit, we implement over 200+ different large multi-modality models, including both proprietary APIs and open-source models, as well as more than 80 different multi-modal benchmarks. By implementing a single interface, new models can be easily added to the toolkit, while the toolkit automatically handles the remaining workloads, including data preparation, distributed inference, prediction post-processing, and metric calculation. Although the toolkit is currently mainly used for evaluating large vision-language models, its design is compatible with future updates that incorporate additional modalities, such as audio and video. Based on the evaluation results obtained with the toolkit, we host OpenVLM Leaderboard, a comprehensive leaderboard to track the progress of multi-modality learning research. The toolkit is released on https://github.com/open-compass/VLMEvalKit and is actively maintained.

Towards Mechanistic Defenses Against Typographic Attacks in CLIP

Authors: Lorenz Hufe, Constantin Venhoff, Maximilian Dreyer, Sebastian Lapuschkin, Wojciech Samek

First: 2025-08-28T09:08:30+00:00 · Latest: 2025-08-28T09:08:30+00:00

Abs · PDF

Abstract

Typographic attacks exploit multi-modal systems by injecting text into images, leading to targeted misclassifications, malicious content generation and even Vision-Language Model jailbreaks. In this work, we analyze how CLIP vision encoders behave under typographic attacks, locating specialized attention heads in the latter half of the model's layers that causally extract and transmit typographic information to the cls token. Building on these insights, we introduce a method to defend CLIP models against typographic attacks by selectively ablating a typographic circuit, consisting of attention heads. Without requiring finetuning, our method improves performance by up to 19.6% on a typographic variant of ImageNet-100, while reducing standard ImageNet-100 accuracy by less than 1%. Notably, our training-free approach remains competitive with current state-of-the-art typographic defenses that rely on finetuning. To this end, we release a family of dyslexic CLIP models which are significantly more robust against typographic attacks. These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.

中文标题/摘要

标题：面向CLIP的机械防御机制对抗 typographic 攻击

typographic 攻击通过向图像中注入文本来利用多模态系统，导致目标错误分类、恶意内容生成，甚至视觉语言模型的越狱。在本研究中，我们分析了CLIP视觉编码器在 typographic 攻击下的行为，发现模型后半部分层中的特定注意力头因果性地提取并传递 typographic 信息至 cls 标记。基于这些见解，我们提出了一种通过选择性地消除 typographic 电路（由注意力头组成）来防御 CLIP 模型的 typographic 攻击的方法。无需微调，我们的方法在 typographic 变体的 ImageNet-100 上性能提升高达 19.6%，同时 ImageNet-100 准确率下降不到 1%。值得注意的是，我们的无需训练的方法在依赖微调的当前最先进的 typographic 防御方法中仍具有竞争力。为此，我们发布了具有显著更强 typographic 攻击防御能力的 dyslexic CLIP 模型系列。这些模型适合作为广泛的安全关键应用的即插即用替代品，在基于文本的操纵风险超过文本识别的实用性时尤为适用。

MedGR$^2$: Breaking the Data Barrier for Medical Reasoning via Generative Reward Learning

Authors: Weihai Zhi, Jiayan Guo, Shangyang Li

First: 2025-08-28T08:41:32+00:00 · Latest: 2025-08-28T08:41:32+00:00

Comments: 8 pages, 5 figures

Abs · PDF

Abstract

The application of Vision-Language Models (VLMs) in medicine is critically hampered by the scarcity of high-quality, expert-annotated data. Supervised Fine-Tuning (SFT) on existing datasets often leads to poor generalization on unseen modalities and tasks, while Reinforcement Learning (RL), a promising alternative, is stymied by the lack of reliable reward signals in this data-scarce domain. To break this impasse, we introduce Generative Reward Learning for Medical Reasoning (MedGR$^2$), a novel framework that creates a self-improving virtuous cycle. MedGR$^2$ co-develops a data generator and a reward model, enabling the automated, continuous creation of high-quality, multi-modal medical data that serves as both a superior training source for SFT and RL. Our experiments demonstrate that SFT with MedGR$^2$-produced data already surpasses baselines trained on large-scale, human-curated datasets. Crucially, when leveraging this data for RL via Group Relative Policy Optimization (GRPO), our model achieves state-of-the-art cross-modality and cross-task generalization, significantly outperforming specialized RL-based methods. Furthermore, our compact model, empowered by MedGR$^2$, achieves performance competitive with foundation models possessing over 10 times more parameters. MedGR$^2$ presents a new paradigm for data-efficient learning in high-stakes domains, transforming the problem from data scarcity to data generation and unlocking the full potential of RL for building truly generalizable medical AI.

Summary / 总结

The application of Vision-Language Models (VLMs) in medicine is critically hampered by the scarcity of high-quality, expert-annotated data.

MedGR$^2$通过引入生成奖励学习框架解决了医学数据稀缺问题。该框架同时开发数据生成器和奖励模型，能够生成高质量的多模态医学数据，用于监督微调和强化学习。实验表明，使用MedGR$^2$生成的数据训练的模型在跨模态和跨任务泛化方面优于基于大规模人工标注数据训练的基线模型，并且在跨模态和跨任务泛化方面超越了专门的RL方法。此外，使用MedGR$^2$的紧凑型模型在性能上与参数量大得多的基础模型相当。

Language-to-Space Programming for Training-Free 3D Visual Grounding

Authors: Boyu Mi, Hanqing Wang, Tai Wang, Yilun Chen, Jiangmiao Pang

First: 2025-02-03T14:32:36+00:00 · Latest: 2025-08-28T07:57:55+00:00

Abs · PDF

Abstract

3D visual grounding (3DVG) is challenging due to the need to understand 3D spatial relations. While supervised approaches have achieved superior performance, they are constrained by the scarcity and high annotation costs of 3D vision-language datasets. Training-free approaches based on LLMs/VLMs eliminate the need for large-scale training data, but they either incur prohibitive grounding time and token costs or have unsatisfactory accuracy. To address the challenges, we introduce a novel method for training-free 3D visual grounding, namely Language-to-Space Programming (LaSP). LaSP introduces LLM-generated codes to analyze 3D spatial relations among objects, along with a pipeline that evaluates and optimizes the codes automatically. Experimental results demonstrate that LaSP achieves 52.9% accuracy on the Nr3D benchmark, ranking among the best training-free methods. Moreover, it substantially reduces the grounding time and token costs, offering a balanced trade-off between performance and efficiency.

中文标题/摘要

标题：语言到空间编程用于无训练3D视觉定位

3D视觉定位（3DVG）由于需要理解3D空间关系而具有挑战性。虽然监督方法取得了优异的性能，但它们受限于3D视觉语言数据集的稀缺性和高注释成本。基于LLM/VLM的无训练方法消除了大规模训练数据的需求，但它们要么导致高昂的定位时间和标记成本，要么准确率不令人满意。为了解决这些挑战，我们提出了一种新的无训练3D视觉定位方法，即语言到空间编程（LaSP）。LaSP引入了LLM生成的代码来分析对象之间的3D空间关系，并且包含一个自动评估和优化代码的流水线。实验结果表明，LaSP在Nr3D基准测试中达到了52.9%的准确率，排名在最好的无训练方法之中。此外，它显著减少了定位时间和标记成本，提供了性能和效率之间的平衡折衷。

Summary / 总结

The paper addresses the challenge of 3D visual grounding by introducing Language-to-Space Programming (LaSP), which uses LLM-generated codes to analyze 3D spatial relations among objects. The method evaluates and optimizes these codes automatically, leading to a 52.9% accuracy on the Nr3D benchmark, and significantly reducing grounding time and token costs compared to training-free approaches based on LLMs/VLMs.

论文提出了用于训练-free 3D视觉定位的方法Language-to-Space Programming (LaSP)，解决了监督学习和基于LLM/VLM的方法的限制。LaSP 使用LLM生成的代码来分析3D空间关系，并自动评估和优化这些代码。该方法在Nr3D基准上达到了52.9%的准确率，同时减少了定位时间和标记成本，保持了性能。

SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning

Authors: Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, Huan Li

Venue: EMNLP 2025

First: 2025-08-22T08:23:09+00:00 · Latest: 2025-08-28T06:44:28+00:00

Comments: Accepted at EMNLP 2025 Main

Abs · PDF · Code1

Abstract

Video large language models (Vid-LLMs) have shown strong capabilities in understanding video content. However, their reliance on dense video token representations introduces substantial memory and computational overhead in both prefilling and decoding. To mitigate the information loss of recent video token reduction methods and accelerate the decoding stage of Vid-LLMs losslessly, we introduce SpecVLM, a training-free speculative decoding (SD) framework tailored for Vid-LLMs that incorporates staged video token pruning. Building on our novel finding that the draft model's speculation exhibits low sensitivity to video token pruning, SpecVLM prunes up to 90% of video tokens to enable efficient speculation without sacrificing accuracy. To achieve this, we performs a two-stage pruning process: Stage I selects highly informative tokens guided by attention signals from the verifier (target model), while Stage II prunes remaining redundant ones in a spatially uniform manner. Extensive experiments on four video understanding benchmarks demonstrate the effectiveness and robustness of SpecVLM, which achieves up to 2.68$\times$ decoding speedup for LLaVA-OneVision-72B and 2.11$\times$ speedup for Qwen2.5-VL-32B. Code is available at https://github.com/zju-jiyicheng/SpecVLM.

中文标题/摘要

标题：SpecVLM：通过验证器引导的令牌剪枝增强视频LLM的推测性解码

视频大型语言模型（Vid-LLMs）在理解视频内容方面表现出强大的能力。然而，它们对密集视频令牌表示的依赖性在预填充和解码阶段引入了巨大的内存和计算开销。为了减轻最近视频令牌减少方法的信息损失并以无损方式加速Vid-LLMs的解码阶段，我们提出了SpecVLM，这是一种针对Vid-LLMs的无需训练的推测性解码（SD）框架，结合了分阶段的视频令牌剪枝。基于我们的一项新发现，草稿模型的推测对视频令牌剪枝的敏感性较低，SpecVLM 剪枝高达90%的视频令牌，以实现高效的推测而不牺牲准确性。为此，我们进行了两阶段的剪枝过程：第一阶段根据验证器（目标模型）的注意力信号选择高度信息性的令牌，而第二阶段以空间均匀的方式剪枝剩余的冗余令牌。在四个视频理解基准上的广泛实验表明，SpecVLM 的有效性和鲁棒性，它分别实现了LLaVA-OneVision-72B和Qwen2.5-VL-32B高达2.68倍和2.11倍的解码加速。代码可在https://github.com/zju-jiyicheng/SpecVLM 获取。

Summary / 总结

SpecVLM is a speculative decoding framework for video large language models (Vid-LLMs) that reduces video token representations by up to 90% through a two-stage pruning process, enabling efficient decoding without accuracy loss. The framework prunes tokens based on attention signals from the verifier and uniform spatial pruning, achieving up to 2.68x and 2.11x decoding speedup for LLaVA-OneVision-72B and Qwen2.5-VL-32B, respectively.

SpecVLM 是一种针对视频大型语言模型（Vid-LLMs）的推测性解码框架，通过引导验证器的注意力信号进行两阶段剪枝，最多可剪枝 90% 的视频令牌，同时保持准确性。该方法实现了对 LLaVA-OneVision-72B 和 Qwen2.5-VL-32B 分别高达 2.68 倍和 2.11 倍的解码加速，而无需牺牲准确性。该方法无训练需求，并在四个视频理解基准测试中展示了其有效性和鲁棒性。