arXiv 论文速递

FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark

Authors: Rongyao Fang, Aldrich Yu, Chengqi Duan, Linjiang Huang, Shuai Bai, Yuxuan Cai, Kun Wang, Si Liu, Xihui Liu, Hongsheng Li

First: 2025-09-11T17:59:59+00:00 · Latest: 2025-09-11T17:59:59+00:00

Comments: Project page: https://flux-reason-6m.github.io/

Abs · PDF · Project1

Abstract

The advancement of open-source text-to-image (T2I) models has been hindered by the absence of large-scale, reasoning-focused datasets and comprehensive evaluation benchmarks, resulting in a performance gap compared to leading closed-source systems. To address this challenge, We introduce FLUX-Reason-6M and PRISM-Bench (Precise and Robust Image Synthesis Measurement Benchmark). FLUX-Reason-6M is a massive dataset consisting of 6 million high-quality FLUX-generated images and 20 million bilingual (English and Chinese) descriptions specifically designed to teach complex reasoning. The image are organized according to six key characteristics: Imagination, Entity, Text rendering, Style, Affection, and Composition, and design explicit Generation Chain-of-Thought (GCoT) to provide detailed breakdowns of image generation steps. The whole data curation takes 15,000 A100 GPU days, providing the community with a resource previously unattainable outside of large industrial labs. PRISM-Bench offers a novel evaluation standard with seven distinct tracks, including a formidable Long Text challenge using GCoT. Through carefully designed prompts, it utilizes advanced vision-language models for nuanced human-aligned assessment of prompt-image alignment and image aesthetics. Our extensive evaluation of 19 leading models on PRISM-Bench reveals critical performance gaps and highlights specific areas requiring improvement. Our dataset, benchmark, and evaluation code are released to catalyze the next wave of reasoning-oriented T2I generation. Project page: https://flux-reason-6m.github.io/ .

中文标题/摘要

标题：FLUX-Reason-6M & PRISM-Bench：百万规模的图文推理数据集及全面基准测试

开源图文生成（T2I）模型的发展受限于缺乏大规模、注重推理的数据集和全面的评估基准，导致其性能与领先封闭源系统存在差距。为解决这一挑战，我们引入了FLUX-Reason-6M和PRISM-Bench（精确且稳健的图像合成测量基准）。FLUX-Reason-6M是一个包含600万高质量FLUX生成图像和2000万双语（英语和中文）描述的庞大数据集，专门设计用于教授复杂推理。图像根据六个关键特征组织：想象力、实体、文本呈现、风格、情感和构图，并设计明确的生成链式思维（GCoT）以提供详细的图像生成步骤分解。整个数据整理耗时15000个A100 GPU天，为社区提供了以往只有大型工业实验室才能获得的资源。PRISM-Bench提供了一种新颖的评估标准，包括七个不同的赛道，其中包括使用GCoT的严峻长文本挑战。通过精心设计的提示，它利用先进的视觉-语言模型进行细致的人类对齐评估和图像美学评估。我们在PRISM-Bench上对19个领先模型的广泛评估揭示了关键性能差距，并指出了需要改进的具体领域。我们的数据集、基准测试和评估代码已发布，以推动下一代注重推理的T2I生成。项目页面：https://flux-reason-6m.github.io/

Summary / 总结

The research addresses the lack of large-scale reasoning-focused datasets and comprehensive benchmarks for text-to-image models, introducing FLUX-Reason-6M, a dataset of 6 million images and 20 million bilingual descriptions, and PRISM-Bench, a benchmark with seven tracks, including a Long Text challenge. The evaluation of 19 leading models on PRISM-Bench highlights significant performance gaps and areas for improvement in reasoning-oriented T2I generation. The dataset and benchmark are publicly released to advance the field.

研究针对文本到图像模型缺乏大规模的推理数据集和全面的评估基准的问题，引入了包含600万张图像和2000万对双语描述的FLUX-Reason-6M数据集，以及包含七个赛道的PRISM-Bench基准，其中包括长文本挑战。对19个领先模型在PRISM-Bench上的评估揭示了显著的性能差距，并指出了需要改进的具体领域。数据集和基准已公开发布，以推动该领域的发展。

Locality in Image Diffusion Models Emerges from Data Statistics

Authors: Artem Lukoianov, Chenyang Yuan, Justin Solomon, Vincent Sitzmann

First: 2025-09-11T17:59:08+00:00 · Latest: 2025-09-11T17:59:08+00:00

Comments: 30 pages, 18 figures, 6 tables

Abs · PDF

Abstract

Among generative models, diffusion models are uniquely intriguing due to the existence of a closed-form optimal minimizer of their training objective, often referred to as the optimal denoiser. However, diffusion using this optimal denoiser merely reproduces images in the training set and hence fails to capture the behavior of deep diffusion models. Recent work has attempted to characterize this gap between the optimal denoiser and deep diffusion models, proposing analytical, training-free models that can generate images that resemble those generated by a trained UNet. The best-performing method hypothesizes that shift equivariance and locality inductive biases of convolutional neural networks are the cause of the performance gap, hence incorporating these assumptions into its analytical model. In this work, we present evidence that the locality in deep diffusion models emerges as a statistical property of the image dataset, not due to the inductive bias of convolutional neural networks. Specifically, we demonstrate that an optimal parametric linear denoiser exhibits similar locality properties to the deep neural denoisers. We further show, both theoretically and experimentally, that this locality arises directly from the pixel correlations present in natural image datasets. Finally, we use these insights to craft an analytical denoiser that better matches scores predicted by a deep diffusion model than the prior expert-crafted alternative.

中文标题/摘要

标题：图像扩散模型中的局部性源自数据统计

在生成模型中，扩散模型因其训练目标的闭形式最优解而独具魅力，通常称为最优去噪器。然而，使用这种最优去噪器的扩散仅能复现训练集中的图像，因而无法捕捉深度扩散模型的行为。近期工作试图描述这种最优去噪器与深度扩散模型之间的差距，提出了无需训练的分析模型，能够生成类似于训练好的UNet生成的图像。表现最佳的方法假设卷积神经网络的平移等变性和局部性归纳偏差是性能差距的原因，因此将其假设纳入其分析模型中。在本文中，我们提供了证据表明，深度扩散模型中的局部性是图像数据集的统计属性，而非卷积神经网络的归纳偏差所致。具体而言，我们证明了最优参数线性去噪器表现出与深度神经去噪器相似的局部性特征。我们还通过理论和实验表明，这种局部性直接来源于自然图像数据集中像素间的相关性。最后，我们利用这些见解构建了一个分析去噪器，其预测分数与深度扩散模型更匹配，优于之前的专家设计的替代方案。

Summary / 总结

This study investigates why deep diffusion models exhibit locality, which is not present in the optimal denoiser. The research demonstrates that the locality in deep diffusion models is a statistical property of the image dataset rather than an inductive bias of convolutional neural networks. Key findings include that an optimal parametric linear denoiser also exhibits similar locality properties, and that this locality arises from pixel correlations in natural images. The study proposes a new analytical denoiser that better matches scores predicted by deep diffusion models.

本文研究了为什么深度扩散模型具有局部性这一特性，而这种局部性并非源自卷积神经网络的归纳偏置，而是数据集的统计特性。研究显示，最优参数线性去噪器也表现出类似的局部性，并且这种局部性源于自然图像中的像素相关性。此外，研究还开发了一种分析性去噪器，其预测结果更接近于深度扩散模型的预测分数。

Improved GUI Grounding via Iterative Narrowing

Authors: Anthony Nguyen

First: 2024-11-18T05:47:12+00:00 · Latest: 2025-09-11T16:37:00+00:00

Comments: Code available at https://github.com/ant-8/GUI-Grounding-via-Iterative-Narrowing

Abs · PDF · Code1

Abstract

Graphical User Interface (GUI) grounding plays a crucial role in enhancing the capabilities of Vision-Language Model (VLM) agents. While general VLMs, such as GPT-4V, demonstrate strong performance across various tasks, their proficiency in GUI grounding remains suboptimal. Recent studies have focused on fine-tuning these models specifically for zero-shot GUI grounding, yielding significant improvements over baseline performance. We introduce a visual prompting framework that employs an iterative narrowing mechanism to further improve the performance of both general and fine-tuned models in GUI grounding. For evaluation, we tested our method on a comprehensive benchmark comprising various UI platforms and provided the code to reproduce our results.

中文标题/摘要

标题：通过迭代细化改进的GUI接地

图形用户界面（GUI）接地在增强视觉语言模型（VLM）代理的能力方面起着关键作用。虽然通用的VLM，如GPT-4V，在各种任务中表现出色，但在GUI接地方面的熟练程度仍然不足。最近的研究集中在针对零样本GUI接地微调这些模型，显著提高了基线性能。我们介绍了一种视觉提示框架，该框架采用迭代细化机制，进一步提高了通用和微调模型在GUI接地中的性能。为了评估，我们在一个包含各种UI平台的综合基准上测试了我们的方法，并提供了可重现我们结果的代码。

Summary / 总结

The research aims to enhance the GUI grounding capabilities of Vision-Language Models (VLMs) by addressing their suboptimal performance in this area. The method introduces a visual prompting framework with an iterative narrowing mechanism to improve both general and fine-tuned VLMs. Key findings show significant performance improvements over baseline models across various UI platforms.

研究旨在通过改进视觉提示框架和迭代缩小机制来提升Vision-Language模型在GUI接地方面的性能，解决其在这一领域的不足。关键发现表明，该方法在各种UI平台上的性能显著优于基线模型。

Compositional Concept Generalization with Variational Quantum Circuits

Authors: Hala Hawashin, Mina Abbaszadeh, Nicholas Joseph, Beth Pearson, Martha Lewis, Mehrnoosh sadrzadeh

First: 2025-09-11T15:34:33+00:00 · Latest: 2025-09-11T15:34:33+00:00

Comments: Accepted to: 2025 IEEE International Conference on Quantum Artificial Intelligence (QAI), Naples, Italy, Nov 2-5, 2025. This is the authors' accepted manuscript (AAM). An IEEE copyright notice appears on page 1. The final published version will appear in IEEE Xplore; DOI to be added when available

Abs · PDF

Abstract

Compositional generalization is a key facet of human cognition, but lacking in current AI tools such as vision-language models. Previous work examined whether a compositional tensor-based sentence semantics can overcome the challenge, but led to negative results. We conjecture that the increased training efficiency of quantum models will improve performance in these tasks. We interpret the representations of compositional tensor-based models in Hilbert spaces and train Variational Quantum Circuits to learn these representations on an image captioning task requiring compositional generalization. We used two image encoding techniques: a multi-hot encoding (MHE) on binary image vectors and an angle/amplitude encoding on image vectors taken from the vision-language model CLIP. We achieve good proof-of-concept results using noisy MHE encodings. Performance on CLIP image vectors was more mixed, but still outperformed classical compositional models.

中文标题/摘要

标题：使用变分量子电路的组合理构概念泛化

组合理构泛化是人类认知的关键方面，但在当前的AI工具如视觉-语言模型中缺失。先前的工作研究了是否可以通过组合理论张量句法克服这一挑战，但结果为负。我们推测，量子模型的训练效率提高将改善这些任务的性能。我们解释了组合理论张量模型在希尔伯特空间中的表示，并训练变分量子电路在需要组合理构泛化的图像字幕任务中学习这些表示。我们使用了两种图像编码技术：二值图像向量的多热编码（MHE）和从视觉-语言模型CLIP获取的图像向量的角度/振幅编码。我们使用嘈杂的MHE编码取得了良好的概念验证结果。在CLIP图像向量上的表现则更为混合，但仍优于经典组合理构模型。

Summary / 总结

This study aims to enhance compositional generalization, a critical aspect of human cognition, by leveraging Variational Quantum Circuits. The researchers interpret compositional tensor-based models in Hilbert spaces and train VQCs on an image captioning task. Using noisy multi-hot encoding on binary image vectors, they achieve promising results. Performance with CLIP image vectors was more variable but still outperformed classical compositional models.

该研究旨在通过利用变量子电路来增强组成性泛化，这是人类认知的一个关键方面。研究人员将组成性张量模型解释在希尔伯特空间中，并在图像描述任务上训练VQC。使用二值图像向量的噪声多热编码，他们取得了令人鼓舞的结果。使用CLIP图像向量时，性能更加波动，但仍优于经典的组成性模型。

Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning

Authors: Yuyao Ge, Shenghua Liu, Yiwei Wang, Lingrui Mei, Baolong Bi, Xuanshan Zhou, Jiayu Yao, Jiafeng Guo, Xueqi Cheng

First: 2025-09-08T09:20:04+00:00 · Latest: 2025-09-11T15:24:22+00:00

Abs · PDF

Abstract

Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments. While existing enhancement approaches require additional training, rely on external segmentation tools, or operate at coarse-grained levels, they overlook the innate ability within VLMs. To bridge this gap, we investigate VLMs' attention patterns and discover that: (1) visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance; (2) attention progressively refines from global scanning in shallow layers to focused convergence in deeper layers, with convergence degree determined by visual complexity. (3) Theoretically, we prove that the contrast of attention maps between general queries and task-specific queries enables the decomposition of visual signal into semantic signals and visual noise components. Building on these insights, we propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level. Extensive experiments demonstrate that CARVE consistently enhances performance, achieving up to 75% improvement on open-source models. Our work provides critical insights into the interplay between visual complexity and attention mechanisms, offering an efficient pathway for improving visual reasoning with contrasting attention.

中文标题/摘要

标题：基于对比注意力聚焦：增强VLMs的视觉推理

视觉-语言模型（VLMs）在多种视觉任务中取得了显著的成功，但在复杂视觉环境中其性能会下降。尽管现有的增强方法需要额外的训练、依赖外部分割工具或在粗粒度级别上操作，但它们忽视了VLMs内部固有的能力。为了弥合这一差距，我们研究了VLMs的注意力模式，并发现：（1）视觉复杂性与注意力熵呈强相关性，负面影响了推理性能；（2）注意力从浅层的全局扫描逐渐精炼到深层的聚焦收敛，收敛程度由视觉复杂性决定；（3）理论上，我们证明了通用查询与任务特定查询之间的注意力图对比能够将视觉信号分解为语义信号和视觉噪声成分。基于这些见解，我们提出了对比注意力精炼以增强视觉（CARVE），这是一种无需训练的方法，通过像素级的注意力对比提取任务相关的视觉信号。广泛的实验表明，CARVE能够一致地提升性能，开源模型的性能提升高达75%。我们的工作为理解视觉复杂性和注意力机制之间的相互作用提供了关键见解，为通过对比注意力改进视觉推理提供了高效途径。

Summary / 总结

This study addresses the performance degradation of Vision-Language Models (VLMs) in complex visual environments. By analyzing VLMs' attention patterns, the authors found that visual complexity negatively impacts reasoning performance and that attention progressively refines from global scanning to focused convergence. They propose CARVE, a training-free method that enhances VLMs by contrasting attention maps at the pixel level, leading to up to 75% performance improvement on open-source models.

该论文针对视觉语言模型（VLMs）在复杂视觉环境中的性能下降问题，通过分析VLMs的注意力模式，发现视觉复杂性会负面影响推理性能，并且注意力会从全局扫描逐渐精炼为聚焦收敛。他们提出了一种名为CARVE的无需训练的方法，通过像素级别的注意力对比提取任务相关的视觉信号，实验结果显示该方法可以显著提升性能，最高可达75%的改进。

Adapting Vision-Language Models for Neutrino Event Classification in High-Energy Physics

Authors: Dikshant Sagar, Kaiwen Yu, Alejandro Yankelevich, Jianming Bian, Pierre Baldi

First: 2025-09-10T10:07:27+00:00 · Latest: 2025-09-11T13:03:04+00:00

Abs · PDF

Abstract

Recent advances in Large Language Models (LLMs) have demonstrated their remarkable capacity to process and reason over structured and unstructured data modalities beyond natural language. In this work, we explore the applications of Vision Language Models (VLMs), specifically a fine-tuned variant of LLaMa 3.2, to the task of identifying neutrino interactions in pixelated detector data from high-energy physics (HEP) experiments. We benchmark this model against a state-of-the-art convolutional neural network (CNN) architecture, similar to those used in the NOvA and DUNE experiments, which have achieved high efficiency and purity in classifying electron and muon neutrino events. Our evaluation considers both the classification performance and interpretability of the model predictions. We find that VLMs can outperform CNNs, while also providing greater flexibility in integrating auxiliary textual or semantic information and offering more interpretable, reasoning-based predictions. This work highlights the potential of VLMs as a general-purpose backbone for physics event classification, due to their high performance, interpretability, and generalizability, which opens new avenues for integrating multimodal reasoning in experimental neutrino physics.

中文标题/摘要

标题：将视觉语言模型适应于高能物理中的中微子事件分类

近年来，大型语言模型（LLMs）在处理和推理结构化和非结构化数据方面的能力已经显示出其显著优势，远超自然语言之外的数据模态。在本文中，我们探讨了视觉语言模型（VLMs），特别是LLaMa 3.2的微调变体，应用于高能物理（HEP）实验中像素化探测器数据中的中微子相互作用识别任务。我们将该模型与NOvA和DUNE实验中使用的类似卷积神经网络（CNN）架构进行基准测试，这些架构在分类电子和Muon中微子事件方面已达到高效率和纯度。我们的评估考虑了模型分类性能和预测的可解释性。我们发现VLMs可以超越CNNs，同时提供更大的灵活性以整合辅助文本或语义信息，并提供更可解释、基于推理的预测。本文强调了VLMs作为物理事件分类的一般用途骨干的潜力，由于其高性能、可解释性和泛化能力，这为在实验中微子物理中整合多模态推理开辟了新的途径。

Summary / 总结

This study explores the application of Vision Language Models (VLMs) in classifying neutrino interactions in high-energy physics experiments, comparing them to state-of-the-art convolutional neural networks (CNNs). The VLMs, fine-tuned from LLaMa 3.2, outperform CNNs in classification performance while offering greater interpretability and flexibility in integrating textual information. Key findings include improved efficiency and purity in event classification, highlighting VLMs' potential as a versatile backbone for multimodal reasoning in experimental neutrino physics.

本研究探索了使用视觉语言模型（VLMs）来分类高能物理实验中的中微子相互作用，将其与最先进的卷积神经网络（CNNs）进行了比较。经过LLaMa 3.2微调的VLMs在分类性能上优于CNNs，并且在整合文本信息方面更具灵活性和可解释性。研究结果表明，VLMs可能因其高性能和可解释性而在实验中微子物理中作为多模态推理的有价值工具。

Shaken, Not Stirred: A Novel Dataset for Visual Understanding of Glasses in Human-Robot Bartending Tasks

Authors: Lukáš Gajdošech, Hassan Ali, Jan-Gerrit Habekost, Martin Madaras, Matthias Kerzel, Stefan Wermter

Venue: IROS

First: 2025-03-06T10:51:04+00:00 · Latest: 2025-09-11T12:49:34+00:00

Comments: Submitted and Accepted for Presentation at the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025

Abs · PDF

Abstract

Datasets for object detection often do not account for enough variety of glasses, due to their transparent and reflective properties. Specifically, open-vocabulary object detectors, widely used in embodied robotic agents, fail to distinguish subclasses of glasses. This scientific gap poses an issue for robotic applications that suffer from accumulating errors between detection, planning, and action execution. This paper introduces a novel method for acquiring real-world data from RGB-D sensors that minimizes human effort. We propose an auto-labeling pipeline that generates labels for all the acquired frames based on the depth measurements. We provide a novel real-world glass object dataset GlassNICOLDataset that was collected on the Neuro-Inspired COLlaborator (NICOL), a humanoid robot platform. The dataset consists of 7850 images recorded from five different cameras. We show that our trained baseline model outperforms state-of-the-art open-vocabulary approaches. In addition, we deploy our baseline model in an embodied agent approach to the NICOL platform, on which it achieves a success rate of 81% in a human-robot bartending scenario.

中文标题/摘要

标题：摇而不搅：一种用于人类-机器人调酒任务中玻璃视觉理解的新数据集

用于物体检测的数据集往往没有考虑到足够多的玻璃种类，由于玻璃的透明和反射特性。特别是，广泛应用于具身机器人代理的开放词汇物体检测器无法区分不同类别的玻璃。这一科学空白对因检测、规划和动作执行之间的累积错误而受到影响的机器人应用构成了问题。本文介绍了一种新的方法，用于从RGB-D传感器获取真实世界数据，从而最小化人力投入。我们提出了一种自动标注流水线，根据深度测量生成所有获取帧的标签。我们提供了一个新的真实世界玻璃对象数据集GlassNICOLDataset，该数据集是在神经启发式协作者（NICOL）人形机器人平台上收集的。数据集包含从五个不同摄像头记录的7850张图像。我们展示了我们训练的基本模型优于最先进的开放词汇方法。此外，我们在NICOL平台上部署了我们的基本模型，该模型在人类-机器人调酒场景中达到了81%的成功率。

Summary / 总结

This paper addresses the lack of variety in existing datasets for object detection, particularly for glasses, which are transparent and reflective. It introduces a novel dataset, GlassNICOLDataset, collected using RGB-D sensors on a humanoid robot platform. The dataset includes 7850 images from five cameras and an auto-labeling pipeline for generating labels. The authors' baseline model outperforms state-of-the-art open-vocabulary approaches and achieves an 81% success rate in a human-robot bartending scenario on the NICOL platform.

本文解决了现有物体检测数据集中对眼镜这类透明且反射性强的物体缺乏多样性的问题。该文提出了一种新型数据集GlassNICOLDataset，通过在人形机器人平台上使用RGB-D传感器收集了7850张来自五个摄像头的图像，并提出了一种基于深度测量的自动标注流水线。作者的基线模型在开放词汇表方法中表现出色，并在NICOL平台上的人机调酒场景中实现了81%的成功率。

Decoupling Clinical and Class-Agnostic Features for Reliable Few-Shot Adaptation under Shift

Authors: Umaima Rahman, Raza Imam, Mohammad Yaqub, Dwarikanath Mahapatra

First: 2025-09-11T12:26:57+00:00 · Latest: 2025-09-11T12:26:57+00:00

Abs · PDF · Code1

Abstract

Medical vision-language models (VLMs) offer promise for clinical decision support, yet their reliability under distribution shifts remains a major concern for safe deployment. These models often learn task-agnostic correlations due to variability in imaging protocols and free-text reports, limiting their generalizability and increasing the risk of failure in real-world settings. We propose DRiFt, a structured feature decoupling framework that explicitly separates clinically relevant signals from task-agnostic noise using parameter-efficient tuning (LoRA) and learnable prompt tokens. To enhance cross-modal alignment and reduce uncertainty, we curate high-quality, clinically grounded image-text pairs by generating captions for a diverse medical dataset. Our approach improves in-distribution performance by +11.4% Top-1 accuracy and +3.3% Macro-F1 over prior prompt-based methods, while maintaining strong robustness across unseen datasets. Ablation studies reveal that disentangling task-relevant features and careful alignment significantly enhance model generalization and reduce unpredictable behavior under domain shift. These insights contribute toward building safer, more trustworthy VLMs for clinical use. The code is available at https://github.com/rumaima/DRiFt.

中文标题/摘要

标题：解耦临床和类别无关特征以实现可靠的少量样本适应性调整

医学视觉-语言模型（VLMs）为临床决策支持提供了希望，但在分布变化下的可靠性仍然是安全部署的主要关切。这些模型由于成像协议和自由文本报告的差异性，往往学习到任务无关的关联性，限制了其泛化能力并增加了在实际场景中失败的风险。我们提出了一种名为DRiFt的结构化特征解耦框架，该框架通过参数高效调优（LoRA）和可学习的提示令牌显式地将临床相关信号与任务无关的噪声分离。为了增强跨模态对齐并减少不确定性，我们通过为多样化的医学数据集生成描述来精心策划高质量的临床相关图像-文本对。我们的方法在分布内性能上比之前的基于提示的方法提高了11.4%的Top-1准确率和3.3%的宏F1分数，同时在未见数据集上保持了强大的鲁棒性。消融研究显示，分离任务相关特征和精细对齐显著增强了模型的泛化能力和减少了领域变化下的不可预测行为。这些见解有助于构建更安全、更值得信赖的VLMs用于临床应用。代码可在https://github.com/rumaima/DRiFt获取。

Summary / 总结

The research aims to improve the reliability of medical vision-language models (VLMs) under distribution shifts by decoupling clinical and task-agnostic features. DRiFt uses parameter-efficient tuning (LoRA) and learnable prompt tokens to separate clinically relevant signals from noise. This approach enhances in-distribution performance by 11.4% in Top-1 accuracy and 3.3% in Macro-F1, while maintaining robustness across unseen datasets. Ablation studies show that disentangling task-relevant features and careful alignment significantly enhance model generalization and reduce unpredictable behavior under domain shift.

本文提出了一种结构化的特征解耦框架DRiFt，以解决医疗视觉-语言模型（VLMs）在分布变化下的可靠性问题。DRiFt 使用参数高效的调优（LoRA）和可学习的提示标记来分离临床相关的信号和任务无关的噪声。该方法在 Top-1 准确率上提高了 11.4%，在宏 F1 上提高了 3.3%，超过了之前的基于提示的方法，同时在未见过的数据集上保持了鲁棒性。消融研究表明，分离任务相关特征和仔细对齐可以增强模型的泛化能力并减少域变化下的不可预测行为。

Curriculum-Based Multi-Tier Semantic Exploration via Deep Reinforcement Learning

Authors: Abdel Hakim Drid, Vincenzo Suriani, Daniele Nardi, Abderrezzak Debilou

First: 2025-09-11T11:10:08+00:00 · Latest: 2025-09-11T11:10:08+00:00

Comments: The 19th International Conference on Intelligent Autonomous Systems (IAS 19), 2025, Genoa

Abs · PDF

Abstract

Navigating and understanding complex and unknown environments autonomously demands more than just basic perception and movement from embodied agents. Truly effective exploration requires agents to possess higher-level cognitive abilities, the ability to reason about their surroundings, and make more informed decisions regarding exploration strategies. However, traditional RL approaches struggle to balance efficient exploration and semantic understanding due to limited cognitive capabilities embedded in the small policies for the agents, leading often to human drivers when dealing with semantic exploration. In this paper, we address this challenge by presenting a novel Deep Reinforcement Learning (DRL) architecture that is specifically designed for resource efficient semantic exploration. A key methodological contribution is the integration of a Vision-Language Model (VLM) common-sense through a layered reward function. The VLM query is modeled as a dedicated action, allowing the agent to strategically query the VLM only when deemed necessary for gaining external guidance, thereby conserving resources. This mechanism is combined with a curriculum learning strategy designed to guide learning at different levels of complexity to ensure robust and stable learning. Our experimental evaluation results convincingly demonstrate that our agent achieves significantly enhanced object discovery rates and develops a learned capability to effectively navigate towards semantically rich regions. Furthermore, it also shows a strategic mastery of when to prompt for external environmental information. By demonstrating a practical and scalable method for embedding common-sense semantic reasoning with autonomous agents, this research provides a novel approach to pursuing a fully intelligent and self-guided exploration in robotics.

中文标题/摘要

标题：基于 Curriculum 的多级语义探索深度强化学习

自主导航和理解复杂未知环境不仅需要基本的感知和移动，还需要具备高级认知能力，能够推理周围环境并做出更明智的探索策略选择。然而，传统强化学习方法由于代理嵌入的认知能力有限，难以在高效探索和语义理解之间取得平衡，导致在处理语义探索时需要人类干预。本文提出了一种新的深度强化学习（DRL）架构，专门设计用于资源高效语义探索。一个关键的方法贡献是通过分层奖励函数整合了视觉语言模型（VLM）的常识。VLM 查询被建模为专用动作，使代理仅在必要时战略性地查询 VLM 以获取外部指导，从而节省资源。该机制结合了一种课程学习策略，以引导不同复杂度水平的学习，确保稳健和稳定的训练。实验评估结果表明，我们的代理在物体发现率方面显著提高，并发展了有效导航至语义丰富区域的能力。此外，还展示了何时请求外部环境信息的战略掌握。通过展示一种实用且可扩展的方法，将常识语义推理嵌入自主代理，这项研究为追求完全智能和自我引导的机器人探索提供了一种新的方法。

Summary / 总结

This paper addresses the challenge of autonomous embodied agents in exploring complex and unknown environments by proposing a novel Deep Reinforcement Learning (DRL) architecture that integrates a Vision-Language Model (VLM) through a layered reward function. The method combines curriculum learning to guide learning at different levels of complexity and strategic querying of the VLM only when necessary. Experimental results show that the agent achieves significantly enhanced object discovery rates and effectively navigates towards semantically rich regions, demonstrating strategic mastery of when to seek external information.

本文通过提出一种新颖的深度强化学习（DRL）架构，将视觉-语言模型（VLM）通过分层奖励函数集成进来，解决自主语义探索的挑战。该代理仅在必要时战略性地查询VLM，从而节省资源并提高物体发现率和向语义丰富区域导航的能力。课程学习策略确保了学习的稳健性和稳定性，实验结果表明在探索效率和外部环境信息提示策略方面取得了显著改进。

S$^2$-Guidance: Stochastic Self Guidance for Training-Free Enhancement of Diffusion Models

Authors: Chubin Chen, Jiashu Zhu, Xiaokun Feng, Nisha Huang, Meiqi Wu, Fangyuan Mao, Jiahong Wu, Xiangxiang Chu, Xiu Li

First: 2025-08-18T12:31:20+00:00 · Latest: 2025-09-11T10:04:07+00:00

Abs · PDF

Abstract

Classifier-free Guidance (CFG) is a widely used technique in modern diffusion models for enhancing sample quality and prompt adherence. However, through an empirical analysis on Gaussian mixture modeling with a closed-form solution, we observe a discrepancy between the suboptimal results produced by CFG and the ground truth. The model's excessive reliance on these suboptimal predictions often leads to semantic incoherence and low-quality outputs. To address this issue, we first empirically demonstrate that the model's suboptimal predictions can be effectively refined using sub-networks of the model itself. Building on this insight, we propose S^2-Guidance, a novel method that leverages stochastic block-dropping during the forward process to construct stochastic sub-networks, effectively guiding the model away from potential low-quality predictions and toward high-quality outputs. Extensive qualitative and quantitative experiments on text-to-image and text-to-video generation tasks demonstrate that S^2-Guidance delivers superior performance, consistently surpassing CFG and other advanced guidance strategies. Our code will be released.

中文标题/摘要

标题：S$^2$-指导：基于随机自我引导的训练无监督增强扩散模型

无分类器指导（CFG）是现代扩散模型中广泛使用的一种技术，用于提高样本质量和指令一致性。然而，通过对具有解析解的高斯混合模型进行经验分析，我们观察到CFG产生的次优结果与真实值之间存在差异。模型对这些次优预测的过度依赖往往导致语义不一致和低质量输出。为了解决这一问题，我们首先通过经验表明，可以使用模型自身的子网络有效精炼模型的次优预测。在此基础上，我们提出了一种新颖的方法S$^2$-指导，该方法利用前向过程中随机块丢弃来构建随机子网络，有效地引导模型远离潜在的低质量预测，趋向高质量输出。在文本到图像和文本到视频生成任务上的广泛定性和定量实验表明，S$^2$-指导提供了优越的性能，始终优于CFG和其他先进的指导策略。我们的代码将被发布。

Summary / 总结

The paper addresses the issue of suboptimal results produced by Classifier-free Guidance (CFG) in diffusion models, which often lead to semantic incoherence and low-quality outputs. To improve this, the authors propose S$^2$-Guidance, a method that uses stochastic block-dropping to construct sub-networks, guiding the model towards high-quality predictions. Experiments show that S$^2$-Guidance outperforms CFG and other advanced guidance strategies in text-to-image and text-to-video generation tasks.

论文针对分类器自由引导(CFG)在扩散模型中产生的次优结果导致语义不一致和低质量输出的问题，提出了一种名为S$^2$-引导的方法，该方法通过在前向过程中使用随机块丢弃来构建子网络，引导模型产生高质量的预测。实验表明，S$^2$-引导在文本到图像和文本到视频生成任务中优于CFG和其他先进引导策略。

Image Recognition with Vision and Language Embeddings of VLMs

Authors: Illia Volkov, Nikita Kisel, Klara Janouskova, Jiri Matas

First: 2025-09-11T09:54:25+00:00 · Latest: 2025-09-11T09:54:25+00:00

Abs · PDF · Code1

Abstract

Vision-language models (VLMs) have enabled strong zero-shot classification through image-text alignment. Yet, their purely visual inference capabilities remain under-explored. In this work, we conduct a comprehensive evaluation of both language-guided and vision-only image classification with a diverse set of dual-encoder VLMs, including both well-established and recent models such as SigLIP 2 and RADIOv2.5. The performance is compared in a standard setup on the ImageNet-1k validation set and its label-corrected variant. The key factors affecting accuracy are analysed, including prompt design, class diversity, the number of neighbours in k-NN, and reference set size. We show that language and vision offer complementary strengths, with some classes favouring textual prompts and others better handled by visual similarity. To exploit this complementarity, we introduce a simple, learning-free fusion method based on per-class precision that improves classification performance. The code is available at: https://github.com/gonikisgo/bmvc2025-vlm-image-recognition.

中文标题/摘要

标题：使用VLMs的视觉和语言嵌入的图像识别

视觉语言模型（VLMs）通过图像文本对齐实现了强大的零样本分类。然而，它们的纯视觉推理能力尚未得到充分探索。在本文中，我们使用一组多样化的双编码器VLMs，包括一些成熟的模型和最近的模型（如SigLIP 2和RADIOv2.5），对语言引导和纯视觉图像分类进行了全面评估。性能在标准设置下使用ImageNet-1k验证集及其标签修正变体进行比较。分析了影响准确性的关键因素，包括提示设计、类别多样性、k-NN中的邻居数量以及参考集大小。我们展示了语言和视觉提供了互补的优势，某些类别更倾向于文本提示，而另一些类别则更适合通过视觉相似性处理。为了利用这种互补性，我们提出了一种基于类别的精确度的简单无学习融合方法，该方法提高了分类性能。代码可在：https://github.com/gonikisgo/bmvc2025-vlm-image-recognition 获取。

Summary / 总结

This study evaluates the performance of vision-language models (VLMs) in image classification, both with and without language guidance, using a variety of dual-encoder VLMs like SigLIP 2 and RADIOv2.5. The evaluation is conducted on the ImageNet-1k validation set and its corrected variant. Key factors such as prompt design, class diversity, and reference set size are analyzed to understand their impact on accuracy. The research demonstrates that language and vision provide complementary strengths, and a simple fusion method based on per-class precision improves classification performance.

该研究评估了视觉-语言模型（VLMs）在有和无语言指导下的图像分类性能，使用了多种双编码器模型。研究考察了提示设计、类别多样性以及最近邻的数量等因素，显示语言和视觉提供了互补的优势。引入了一种基于类别精度的简单融合方法，以提高分类性能，展示了其相对于基线方法的改进效果。

Visual Programmability: A Guide for Code-as-Thought in Chart Understanding

Authors: Bohao Tang, Yan Ma, Fei Zhang, Jiadi Su, Ethan Chern, Zhulin Hu, Zhixin Wang, Pengfei Liu, Ya Zhang

First: 2025-09-11T09:22:16+00:00 · Latest: 2025-09-11T09:22:16+00:00

Abs · PDF

Abstract

Chart understanding presents a critical test to the reasoning capabilities of Vision-Language Models (VLMs). Prior approaches face critical limitations: some rely on external tools, making them brittle and constrained by a predefined toolkit, while others fine-tune specialist models that often adopt a single reasoning strategy, such as text-based chain-of-thought (CoT). The intermediate steps of text-based reasoning are difficult to verify, which complicates the use of reinforcement-learning signals that reward factual accuracy. To address this, we propose a Code-as-Thought (CaT) approach to represent the visual information of a chart in a verifiable, symbolic format. Our key insight is that this strategy must be adaptive: a fixed, code-only implementation consistently fails on complex charts where symbolic representation is unsuitable. This finding leads us to introduce Visual Programmability: a learnable property that determines if a chart-question pair is better solved with code or direct visual analysis. We implement this concept in an adaptive framework where a VLM learns to choose between the CaT pathway and a direct visual reasoning pathway. The selection policy of the model is trained with reinforcement learning using a novel dual-reward system. This system combines a data-accuracy reward to ground the model in facts and prevent numerical hallucination, with a decision reward that teaches the model when to use each strategy, preventing it from defaulting to a single reasoning mode. Experiments demonstrate strong and robust performance across diverse chart-understanding benchmarks. Our work shows that VLMs can be taught not only to reason but also how to reason, dynamically selecting the optimal reasoning pathway for each task.

中文标题/摘要

标题：视觉编程性：图表理解中的代码即思维指南

图表理解是对视觉语言模型（VLMs）推理能力的一个关键考验。先前的方法面临重大限制：一些方法依赖外部工具，使其脆弱且受限于预定义的工具集，而另一些则微调专门模型，通常采用单一的推理策略，如基于文本的思维链（CoT）。基于文本的推理中间步骤难以验证，这使得使用奖励事实准确性的强化学习信号变得复杂。为解决这一问题，我们提出了一种代码即思维（CaT）方法，以可验证的符号格式表示图表的视觉信息。我们的关键见解是，这种策略必须是适应性的：固定且仅包含代码的实现方式在复杂图表上始终失败，因为符号表示在此类图表上不合适。这一发现促使我们引入视觉编程性：一种可学习的属性，决定图表-问题对更适合用代码还是直接视觉分析来解决。我们在一个适应性框架中实现这一概念，其中VLM学习在CaT路径和直接视觉推理路径之间进行选择。模型的选择策略通过一个新颖的双重奖励系统进行强化学习训练。该系统结合了数据准确性奖励，以使模型扎根于事实，防止数值幻觉，以及一个决策奖励，教导模型何时使用每种策略，防止其默认使用单一推理模式。实验表明，该方法在多种图表理解基准测试中表现出强大且稳健的性能。我们的工作表明，VLMs不仅可以被教导进行推理，还可以被教导如何推理，动态选择每个任务的最佳推理路径。

Summary / 总结

This paper addresses the limitations of existing approaches in chart understanding by proposing a Code-as-Thought (CaT) approach and introducing Visual Programmability. The CaT approach represents visual information in a verifiable, symbolic format, while Visual Programmability determines whether a chart-question pair should be solved with code or direct visual analysis. The model learns to choose between these pathways using a reinforcement learning system with dual rewards for data accuracy and decision-making. Experiments show strong and robust performance across various chart-understanding benchmarks.

论文通过提出Code-as-Thought (CaT) 方法来表示视觉信息，以解决现有方法在图表理解中的局限性。作者引入了视觉编程性，这是一种可学习的特性，允许视觉语言模型动态选择基于代码的推理或直接视觉分析。实验表明，该方法在各种图表理解基准测试中表现出强大的鲁棒性，表明视觉语言模型可以被教导如何根据不同的任务选择最优的推理路径。

Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis

Authors: Jing Hao, Yuxuan Fan, Yanpeng Sun, Kaixin Guo, Lizhuo Lin, Jinrong Yang, Qi Yong H. Ai, Lun M. Wong, Hao Tang, Kuo Feng Hung

First: 2025-09-11T08:39:08+00:00 · Latest: 2025-09-11T08:39:08+00:00

Comments: 40 pages, 26 figures, 9 tables

Abs · PDF · Code1

Abstract

Recent advances in large vision-language models (LVLMs) have demonstrated strong performance on general-purpose medical tasks. However, their effectiveness in specialized domains such as dentistry remains underexplored. In particular, panoramic X-rays, a widely used imaging modality in oral radiology, pose interpretative challenges due to dense anatomical structures and subtle pathological cues, which are not captured by existing medical benchmarks or instruction datasets. To this end, we introduce MMOral, the first large-scale multimodal instruction dataset and benchmark tailored for panoramic X-ray interpretation. MMOral consists of 20,563 annotated images paired with 1.3 million instruction-following instances across diverse task types, including attribute extraction, report generation, visual question answering, and image-grounded dialogue. In addition, we present MMOral-Bench, a comprehensive evaluation suite covering five key diagnostic dimensions in dentistry. We evaluate 64 LVLMs on MMOral-Bench and find that even the best-performing model, i.e., GPT-4o, only achieves 41.45% accuracy, revealing significant limitations of current models in this domain. To promote the progress of this specific domain, we also propose OralGPT, which conducts supervised fine-tuning (SFT) upon Qwen2.5-VL-7B with our meticulously curated MMOral instruction dataset. Remarkably, a single epoch of SFT yields substantial performance enhancements for LVLMs, e.g., OralGPT demonstrates a 24.73% improvement. Both MMOral and OralGPT hold significant potential as a critical foundation for intelligent dentistry and enable more clinically impactful multimodal AI systems in the dental field. The dataset, model, benchmark, and evaluation suite are available at https://github.com/isbrycee/OralGPT.

中文标题/摘要

标题：迈向更好的牙科AI：针对全景X光分析的多模态基准和指令数据集

近年来，大型视觉-语言模型（LVLM）在通用医疗任务上表现出色。然而，它们在牙科等专门领域的有效性尚未得到充分探索。特别是，全景X光片是口腔放射学中广泛使用的成像技术，由于其密集的解剖结构和细微的病理线索，现有的医疗基准或指令数据集无法捕捉到这些特征，从而带来解读上的挑战。为此，我们引入了MMOral，这是首个针对全景X光片解读的大型多模态指令数据集和基准。MMOral包含20,563张标注图像，配以130万条指令遵循实例，涵盖了包括属性提取、报告生成、视觉问答和图像基础对话在内的多种任务类型。此外，我们还提出了MMOral-Bench，这是一个全面的评估套件，涵盖了牙科中的五个关键诊断维度。我们在MMOral-Bench上评估了64个LVLM，发现即使表现最好的模型GPT-4o也只能达到41.45%的准确率，揭示了当前模型在该领域的显著局限性。为了促进该特定领域的进展，我们还提出了OralGPT，它基于我们精心整理的MMOral指令数据集对Qwen2.5-VL-7B进行了监督微调（SFT）。令人惊讶的是，单个SFT周期就为LVLM带来了显著的性能提升，例如OralGPT展示了24.73%的改进。MMOral和OralGPT在智能牙科领域具有重要潜力，能够促进牙科领域的更多临床影响的多模态AI系统的发展。数据集、模型、基准和评估套件可在https://github.com/isbrycee/OralGPT获取。

Summary / 总结

This study aims to enhance the performance of AI in dental diagnostics by introducing MMOral, a multimodal instruction dataset and benchmark for panoramic X-ray analysis. The dataset includes 20,563 annotated images and 1.3 million instruction-following instances. Evaluating 64 large vision-language models on MMOral-Bench, the best model achieved only 41.45% accuracy, highlighting the limitations of current models. The study proposes OralGPT, which uses supervised fine-tuning on Qwen2.5-VL-7B with the MMOral dataset, showing a 24.73% improvement after a single epoch. This work significantly advances the field of intelligent dentistry.

该论文介绍了MMOral，这是一个面向牙科全景X光片解释的大规模多模态指令数据集和基准。它包含20,563张标注图像，配对了130万条指令遵循实例，涵盖多种任务。作者在MMOral-Bench上评估了64个大型视觉语言模型，并发现即使最好的模型GPT-4o也只能达到41.45%的准确率。他们还提出了OralGPT，在MMOral数据集上进行一次监督微调后，显示出24.73%的性能提升。

IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves

Authors: Ruofan Wang, Juncheng Li, Yixu Wang, Bo Wang, Xiaosen Wang, Yan Teng, Yingchun Wang, Xingjun Ma, Yu-Gang Jiang

First: 2024-10-29T07:15:56+00:00 · Latest: 2025-09-11T06:44:05+00:00

Abs · PDF · Project1

Abstract

As large Vision-Language Models (VLMs) gain prominence, ensuring their safe deployment has become critical. Recent studies have explored VLM robustness against jailbreak attacks-techniques that exploit model vulnerabilities to elicit harmful outputs. However, the limited availability of diverse multimodal data has constrained current approaches to rely heavily on adversarial or manually crafted images derived from harmful text datasets, which often lack effectiveness and diversity across different contexts. In this paper, we propose IDEATOR, a novel jailbreak method that autonomously generates malicious image-text pairs for black-box jailbreak attacks. IDEATOR is grounded in the insight that VLMs themselves could serve as powerful red team models for generating multimodal jailbreak prompts. Specifically, IDEATOR leverages a VLM to create targeted jailbreak texts and pairs them with jailbreak images generated by a state-of-the-art diffusion model. Extensive experiments demonstrate IDEATOR's high effectiveness and transferability, achieving a 94% attack success rate (ASR) in jailbreaking MiniGPT-4 with an average of only 5.34 queries, and high ASRs of 82%, 88%, and 75% when transferred to LLaVA, InstructBLIP, and Chameleon, respectively. Building on IDEATOR's strong transferability and automated process, we introduce the VLJailbreakBench, a safety benchmark comprising 3,654 multimodal jailbreak samples. Our benchmark results on 11 recently released VLMs reveal significant gaps in safety alignment. For instance, our challenge set achieves ASRs of 46.31% on GPT-4o and 19.65% on Claude-3.5-Sonnet, underscoring the urgent need for stronger defenses.VLJailbreakBench is publicly available at https://roywang021.github.io/VLJailbreakBench.

中文标题/摘要

标题：IDEATOR：利用自身进行视觉-语言模型的越狱和基准测试

随着大型视觉-语言模型（VLMs）的重要性日益凸显，确保其安全部署变得至关重要。近期研究探索了VLM对越狱攻击的鲁棒性——利用模型漏洞产生有害输出的技术。然而，多样化的多模态数据的有限可用性限制了当前方法主要依赖于来自有害文本数据集的对抗性或手工制作的图像，这些图像往往在不同上下文中的有效性和多样性不足。在本文中，我们提出了一种名为IDEATOR的新颖越狱方法，该方法能够自主生成用于黑盒越狱攻击的恶意图像-文本对。IDEATOR基于这样一个洞察：VLMs本身可以作为强大的红队模型，用于生成多模态越狱提示。具体而言，IDEATOR利用VLM生成针对性的越狱文本，并与最先进的扩散模型生成的越狱图像配对。广泛的实验表明，IDEATOR具有高效率和可移植性，在对MiniGPT-4进行越狱攻击时，成功率为94%，平均只需5.34次查询，分别在LLaVA、InstructBLIP和Chameleon上的成功率分别为82%、88%和75%。基于IDEATOR强大的可移植性和自动化过程，我们引入了VLJailbreakBench，这是一个包含3,654个多模态越狱样本的安全基准。我们的基准测试结果表明，11个最近发布的VLMs在安全性对齐方面存在显著差距。例如，我们的挑战集在GPT-4o上的成功率为46.31%，在Claude-3.5-Sonnet上的成功率为19.65%，突显了加强防御的迫切需求。VLJailbreakBench已公开发布于https://roywang021.github.io/VLJailbreakBench。

Summary / 总结

The paper introduces IDEATOR, a novel method for generating malicious image-text pairs to perform black-box jailbreak attacks on Vision-Language Models (VLMs). By leveraging VLMs themselves to create targeted jailbreak texts and images, IDEATOR achieves high effectiveness and transferability, with a 94% attack success rate on MiniGPT-4 and significant ASRs on other models. The authors also present VLJailbreakBench, a safety benchmark for VLMs, highlighting significant safety gaps among various models.

本文提出了IDEATOR，一种用于对视觉-语言模型（VLMs）进行黑盒脱狱攻击的新方法。通过利用VLMs自身生成针对性的脱狱文本和图像，IDEATOR实现了高效率和可移植性，在MiniGPT-4上的攻击成功率高达94%，并在其他模型上也取得了显著的成功率。作者还提出了VLJailbreakBench，一个安全基准，揭示了VLM在安全性对齐方面的显著差距。

Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios

Authors: Chunxiao Li, Xiaoxiao Wang, Meiling Li, Boming Miao, Peng Sun, Yunjian Zhang, Xiangyang Ji, Yao Zhu

First: 2025-09-11T06:15:52+00:00 · Latest: 2025-09-11T06:15:52+00:00

Comments: ICCV2025

Abs · PDF

Abstract

With the rapid advancement of generative models, highly realistic image synthesis has posed new challenges to digital security and media credibility. Although AI-generated image detection methods have partially addressed these concerns, a substantial research gap remains in evaluating their performance under complex real-world conditions. This paper introduces the Real-World Robustness Dataset (RRDataset) for comprehensive evaluation of detection models across three dimensions: 1) Scenario Generalization: RRDataset encompasses high-quality images from seven major scenarios (War and Conflict, Disasters and Accidents, Political and Social Events, Medical and Public Health, Culture and Religion, Labor and Production, and everyday life), addressing existing dataset gaps from a content perspective. 2) Internet Transmission Robustness: examining detector performance on images that have undergone multiple rounds of sharing across various social media platforms. 3) Re-digitization Robustness: assessing model effectiveness on images altered through four distinct re-digitization methods. We benchmarked 17 detectors and 10 vision-language models (VLMs) on RRDataset and conducted a large-scale human study involving 192 participants to investigate human few-shot learning capabilities in detecting AI-generated images. The benchmarking results reveal the limitations of current AI detection methods under real-world conditions and underscore the importance of drawing on human adaptability to develop more robust detection algorithms.

中文标题/摘要

标题：在理想与现实之间架起桥梁：在挑战性场景中评估AI生成图像检测基准

随着生成模型的迅速发展，高度逼真的图像合成给数字安全和媒体可信度带来了新的挑战。尽管已经部分解决了AI生成图像检测方法的问题，但在复杂现实世界条件下的评估研究仍存在较大差距。本文介绍了Real-World Robustness Dataset (RRDataset)，用于从三个维度全面评估检测模型：1) 场景泛化：RRDataset 包含来自七个主要场景的高质量图像（战争与冲突、灾难与事故、政治与社会事件、医疗与公共卫生、文化与宗教、劳动与生产、日常生活），从内容角度填补了现有数据集的空白。2) 互联网传输鲁棒性：考察检测器在经过多个社交媒体平台多次分享后的图像上的性能。3) 再数字化鲁棒性：评估模型在经过四种不同再数字化方法修改后的图像上的效果。我们在RRDataset 上对17种检测器和10种视觉-语言模型（VLMs）进行了基准测试，并进行了涉及192名参与者的大型人类研究，以调查人类在检测AI生成图像方面的少量样本学习能力。基准测试结果揭示了当前AI检测方法在现实世界条件下的局限性，并强调了借鉴人类适应性以开发更鲁棒检测算法的重要性。

Summary / 总结

This paper addresses the challenge of evaluating AI-generated image detection methods under complex real-world conditions. It introduces the Real-World Robustness Dataset (RRDataset) that covers seven major scenarios, examines detector performance after internet transmission, and assesses re-digitization robustness. The study benchmarks 17 detectors and 10 vision-language models on RRDataset and includes a human study involving 192 participants, revealing the limitations of current AI detection methods and emphasizing the need for human adaptability in developing more robust algorithms.

本文旨在解决在复杂现实条件下评估AI生成图像检测方法的挑战。引入了涵盖七大场景的Real-World Robustness Dataset (RRDataset)，考察了图像在社交媒体上传播后的检测性能，并评估了图像经过不同重数字化处理后的模型效果。研究在RRDataset上对17种检测器和10种视觉-语言模型进行了基准测试，并通过192名参与者的实验研究了人类在少量样本下的学习能力，揭示了当前AI检测方法的局限性，并强调了借鉴人类适应性的重要性，以开发更 robust 的检测算法。

Adaptive Pareto-Optimal Token Merging for Edge Transformer Models in Semantic Communication

Authors: Omar Erak, Omar Alhussein, Hatem Abou-Zeid, Mehdi Bennis

First: 2025-09-11T06:05:35+00:00 · Latest: 2025-09-11T06:05:35+00:00

Comments: To appear in IEEE Globecom 2025

Abs · PDF

Abstract

Large-scale transformer models have emerged as a powerful tool for semantic communication systems, enabling edge devices to extract rich representations for robust inference across noisy wireless channels. However, their substantial computational demands remain a major barrier to practical deployment in resource-constrained 6G networks. In this paper, we present a training-free framework for adaptive token merging in pretrained vision transformers to jointly reduce inference time and transmission resource usage. We formulate the selection of per-layer merging proportions as a multi-objective optimization problem to balance accuracy and computational cost. We employ Gaussian process-based Bayesian optimization to construct a Pareto frontier of optimal configurations, enabling flexible runtime adaptation to dynamic application requirements and channel conditions. Extensive experiments demonstrate that our method consistently outperforms other baselines and achieves significant reductions in floating-point operations while maintaining competitive accuracy across a wide range of signal-to-noise ratio (SNR) conditions. Additional results highlight the effectiveness of adaptive policies that adjust merging aggressiveness in response to channel quality, providing a practical mechanism to trade off latency and semantic fidelity on demand. These findings establish a scalable and efficient approach for deploying transformer-based semantic communication in future edge intelligence systems.

中文标题/摘要

标题：边缘变换器模型在语义通信中的自适应帕累托最优标记合并

大规模变换器模型已成为语义通信系统中的强大工具，使边缘设备能够在噪声无线信道中提取丰富的表示以进行稳健的推理。然而，它们巨大的计算需求仍然是在资源受限的6G网络中实际部署的主要障碍。本文提出了一种无需训练的框架，用于在预训练的视觉变换器中自适应标记合并，以同时减少推理时间和传输资源使用。我们将每层合并比例的选择形式化为一个多目标优化问题，以平衡准确性和计算成本。我们采用基于高斯过程的贝叶斯优化来构建最优配置的帕累托前沿，从而在动态应用需求和信道条件下实现灵活的运行时自适应。广泛实验表明，我们的方法在各种信噪比（SNR）条件下始终优于其他基线，并在保持竞争力的同时显著减少了浮点运算。附加结果强调了自适应策略的有效性，这些策略根据信道质量调整合并的激进程度，提供了一种按需权衡延迟和语义保真的实用机制。这些发现确立了一种可扩展且高效的基于变换器的语义通信部署方法，适用于未来的边缘智能系统。

Summary / 总结

This paper addresses the computational demands of large-scale transformer models in semantic communication systems by proposing an adaptive token merging framework. The method formulates token merging as a multi-objective optimization problem and uses Gaussian process-based Bayesian optimization to find the Pareto frontier of optimal configurations. Experiments show that the proposed method reduces floating-point operations while maintaining competitive accuracy across various SNR conditions, and adaptive policies further improve performance based on channel quality.

本文提出了一种自适应token合并框架，以解决大规模变压器模型在语义通信系统中的计算需求问题。该方法将合并问题形式化为一个多目标优化问题，并使用高斯过程基于的贝叶斯优化来找到最优配置的帕累托前沿。实验表明，该方法在各种信噪比条件下减少了浮点运算，同时保持了竞争力的准确性，并且基于信道质量的自适应策略提高了性能。

Early Exit and Multi Stage Knowledge Distillation in VLMs for Video Summarization

Authors: Anas Anwarul Haq Khan, Utkarsh Verma, Ganesh Ramakrishnan

First: 2025-04-30T17:37:55+00:00 · Latest: 2025-09-11T05:48:04+00:00

Abs · PDF

Abstract

We introduce DEEVISum (Distilled Early Exit Vision language model for Summarization), a lightweight, efficient, and scalable vision language model designed for segment wise video summarization. Leveraging multi modal prompts that combine textual and audio derived signals, DEEVISum incorporates Multi Stage Knowledge Distillation (MSKD) and Early Exit (EE) to strike a balance between performance and efficiency. MSKD offers a 1.33% absolute F1 improvement over baseline distillation (0.5%), while EE reduces inference time by approximately 21% with a 1.3 point drop in F1. Evaluated on the TVSum dataset, our best model PaLI Gemma2 3B + MSKD achieves an F1 score of 61.1, competing the performance of significantly larger models, all while maintaining a lower computational footprint. We publicly release our code and processed dataset to support further research.

中文标题/摘要

标题：VLMs中用于视频摘要的早期退出和多阶段知识蒸馏

我们介绍了DEEVISum（Distilled Early Exit Vision语言模型 for 摘要），这是一种轻量级、高效且可扩展的视觉语言模型，专为段落级视频摘要设计。通过结合文本和音频提取的多模态提示，DEEVISum结合了多阶段知识蒸馏（MSKD）和早期退出（EE），在性能和效率之间取得平衡。MSKD相比基线蒸馏提供了1.33%的绝对F1改进（0.5%），而EE将推理时间减少了约21%，F1下降1.3点。在TVSum数据集上评估，我们最好的模型PaLI Gemma2 3B + MSKD的F1得分为61.1，与显著更大的模型竞争性能，同时保持较低的计算开销。我们公开发布了我们的代码和处理过的数据集，以支持进一步的研究。

A Knowledge Noise Mitigation Framework for Knowledge-based Visual Question Answering

Authors: Zhiyue Liu, Sihang Liu, Jinyuan Liu, Xinru Zhang

Venue: ICME 2025 oral presentation

First: 2025-09-11T05:40:26+00:00 · Latest: 2025-09-11T05:40:26+00:00

Comments: Accepted by the IEEE International Conference on Multimedia and Expo (ICME 2025) for oral presentation. \copyright\ 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses

Abs · PDF

Abstract

Knowledge-based visual question answering (KB-VQA) requires a model to understand images and utilize external knowledge to provide accurate answers. Existing approaches often directly augment models with retrieved information from knowledge sources while ignoring substantial knowledge redundancy, which introduces noise into the answering process. To address this, we propose a training-free framework with knowledge focusing for KB-VQA, that mitigates the impact of noise by enhancing knowledge relevance and reducing redundancy. First, for knowledge retrieval, our framework concludes essential parts from the image-question pairs, creating low-noise queries that enhance the retrieval of highly relevant knowledge. Considering that redundancy still persists in the retrieved knowledge, we then prompt large models to identify and extract answer-beneficial segments from knowledge. In addition, we introduce a selective knowledge integration strategy, allowing the model to incorporate knowledge only when it lacks confidence in answering the question, thereby mitigating the influence of redundant information. Our framework enables the acquisition of accurate and critical knowledge, and extensive experiments demonstrate that it outperforms state-of-the-art methods.

中文标题/摘要

标题：一种知识噪声缓解框架：基于知识的视觉问答

基于知识的视觉问答（KB-VQA）要求模型理解图像并利用外部知识提供准确的答案。现有方法通常直接将检索到的信息增强到模型中，而忽略知识冗余，这引入了噪声。为了解决这一问题，我们提出了一种无需训练的基于知识聚焦的KB-VQA框架，通过增强知识相关性和减少冗余来缓解噪声的影响。首先，对于知识检索，我们的框架从图像-问题对中得出关键部分，创建低噪声查询以增强相关知识的检索。考虑到检索到的知识中仍然存在冗余，我们随后提示大型模型识别和提取有益于回答的知识片段。此外，我们引入了一种选择性知识集成策略，允许模型仅在对回答问题缺乏信心时才整合知识，从而减轻冗余信息的影响。我们的框架能够获取准确且关键的知识，广泛实验表明其优于现有方法。

Summary / 总结

The paper proposes a training-free framework for mitigating knowledge noise in knowledge-based visual question answering (KB-VQA). It focuses on enhancing knowledge relevance and reducing redundancy by creating low-noise queries and prompting large models to extract answer-beneficial segments. Experiments show that this framework outperforms existing state-of-the-art methods in KB-VQA tasks.

论文提出了一种无需训练的框架，旨在解决知识导向的视觉问答（KB-VQA）中的噪声问题，通过增强检索知识的相关性和减少冗余来提高答案的准确性。实验表明，该框架在KB-VQA任务中优于现有方法。

Imagine, Verify, Execute: Memory-guided Agentic Exploration with Vision-Language Models

Authors: Seungjae Lee, Daniel Ekpo, Haowen Liu, Furong Huang, Abhinav Shrivastava, Jia-Bin Huang

First: 2025-05-12T17:59:11+00:00 · Latest: 2025-09-11T03:49:17+00:00

Comments: Project webpage: https://ive-robot.github.io/

Abs · PDF · Project1

Abstract

Exploration is essential for general-purpose robotic learning, especially in open-ended environments where dense rewards, explicit goals, or task-specific supervision are scarce. Vision-language models (VLMs), with their semantic reasoning over objects, spatial relations, and potential outcomes, present a compelling foundation for generating high-level exploratory behaviors. However, their outputs are often ungrounded, making it difficult to determine whether imagined transitions are physically feasible or informative. To bridge the gap between imagination and execution, we present IVE (Imagine, Verify, Execute), an agentic exploration framework inspired by human curiosity. Human exploration is often driven by the desire to discover novel scene configurations and to deepen understanding of the environment. Similarly, IVE leverages VLMs to abstract RGB-D observations into semantic scene graphs, imagine novel scenes, predict their physical plausibility, and generate executable skill sequences through action tools. We evaluate IVE in both simulated and real-world tabletop environments. The results show that IVE enables more diverse and meaningful exploration than RL baselines, as evidenced by a 4.1 to 7.8x increase in the entropy of visited states. Moreover, the collected experience supports downstream learning, producing policies that closely match or exceed the performance of those trained on human-collected demonstrations.

中文标题/摘要

标题：想象、验证、执行：基于视觉语言模型的记忆引导代理探索

探索对于通用机器人学习至关重要，尤其是在稀疏密集奖励、明确目标或特定任务监督较少的开放环境中。视觉语言模型（VLMs）凭借其对物体、空间关系和潜在结果的语义推理，为生成高层次探索行为提供了有力的基础。然而，它们的输出往往是不具象的，难以判断想象中的转换是否物理可行或具有信息价值。为了弥合想象与执行之间的差距，我们提出了IVE（想象、验证、执行）框架，该框架受到人类好奇心的启发。人类探索往往由发现新颖场景配置和加深对环境理解的欲望驱动。同样，IVE 利用 VLMs 将 RGB-D 观察抽象为语义场景图，想象新的场景，预测其物理可行性，并通过动作工具生成可执行技能序列。我们在模拟和真实世界的桌面环境中评估了IVE。结果显示，与基于强化学习的基线相比，IVE 能够实现更多样化和有意义的探索，熵的增加幅度为 4.1 至 7.8 倍。此外，收集的经验支持下游学习，生成的策略与或超过基于人类收集的演示训练的性能。

Summary / 总结

The paper introduces IVE (Imagine, Verify, Execute), a framework for agentic exploration in robotics, which leverages vision-language models to generate and verify high-level exploratory behaviors. IVE converts RGB-D observations into semantic scene graphs, imagines novel scenes, verifies their physical plausibility, and generates executable skill sequences. Experimental results demonstrate that IVE outperforms reinforcement learning baselines, increasing the entropy of visited states by 4.1 to 7.8 times and supporting downstream learning to match or exceed human-collected demonstrations.

论文提出了IVE（Imagine, Verify, Execute）框架，利用视觉语言模型生成和验证高层次的探索行为。IVE 将 RGB-D 观察转换为语义场景图，想象新的场景，验证其物理可行性，并生成可执行的技能序列。实验结果表明，IVE 在探索多样性方面优于强化学习基线，将访问状态的熵提高 4.1 到 7.8 倍，并支持下游学习以匹配或超越人类收集的演示效果。

Zero-shot Hierarchical Plant Segmentation via Foundation Segmentation Models and Text-to-image Attention

Authors: Junhao Xing, Ryohei Miyakawa, Yang Yang, Xinpeng Liu, Risa Shinoda, Hiroaki Santo, Yosuke Toda, Fumio Okura

Venue: WACV 2026

First: 2025-09-11T02:53:58+00:00 · Latest: 2025-09-11T02:53:58+00:00

Comments: WACV 2026 accepted

Abs · PDF · Code1

Abstract

Foundation segmentation models achieve reasonable leaf instance extraction from top-view crop images without training (i.e., zero-shot). However, segmenting entire plant individuals with each consisting of multiple overlapping leaves remains challenging. This problem is referred to as a hierarchical segmentation task, typically requiring annotated training datasets, which are often species-specific and require notable human labor. To address this, we introduce ZeroPlantSeg, a zero-shot segmentation for rosette-shaped plant individuals from top-view images. We integrate a foundation segmentation model, extracting leaf instances, and a vision-language model, reasoning about plants' structures to extract plant individuals without additional training. Evaluations on datasets with multiple plant species, growth stages, and shooting environments demonstrate that our method surpasses existing zero-shot methods and achieves better cross-domain performance than supervised methods. Implementations are available at https://github.com/JunhaoXing/ZeroPlantSeg.

中文标题/摘要

标题：基于基础分割模型和文本到图像注意力的零样本分层植物分割

基础分割模型在无需训练的情况下（即零样本）可以从顶部视角的作物图像中合理地提取叶实例。然而，分割由多个重叠叶子组成的整个植物个体仍然具有挑战性。这个问题被称为分层分割任务，通常需要带有注释的训练数据集，这些数据集往往是特定于物种的，并且需要大量的人工劳动。为了解决这个问题，我们引入了ZeroPlantSeg，这是一种从顶部视角图像中对呈放射状的植物个体进行零样本分割的方法。我们结合了基础分割模型，提取叶实例，以及一个视觉语言模型，通过推理植物结构来提取植物个体，无需额外训练。在包含多种植物物种、生长阶段和拍摄环境的数据集上的评估表明，我们的方法超越了现有的零样本方法，并且在跨域性能上优于监督方法。实现代码可在https://github.com/JunhaoXing/ZeroPlantSeg获取。

Summary / 总结

This research aims to address the challenge of hierarchical plant segmentation, particularly for rosette-shaped plants, using zero-shot methods. The method combines a foundation segmentation model for leaf instance extraction and a vision-language model for plant structure reasoning. Experiments on diverse datasets show that the proposed ZeroPlantSeg method outperforms existing zero-shot approaches and achieves better cross-domain performance than supervised methods.

研究旨在解决从包含多片重叠叶子的顶部视角图像中提取整个植物个体的层次分割难题。为了实现零样本性能，研究将基础分割模型用于叶子实例提取，并结合视觉语言模型进行植物结构推理。实验结果表明，提出的ZeroPlantSeg方法在多种植物种类、生长阶段和拍摄环境的多样数据集上，优于现有零样本方法，并且在跨域性能上优于监督方法。

SQAP-VLA: A Synergistic Quantization-Aware Pruning Framework for High-Performance Vision-Language-Action Models

Authors: Hengyu Fang, Yijiang Liu, Yuan Du, Li Du, Huanrui Yang

First: 2025-09-11T01:52:25+00:00 · Latest: 2025-09-11T01:52:25+00:00

Comments: 12 pages, 9 figures

Abs · PDF

Abstract

Vision-Language-Action (VLA) models exhibit unprecedented capabilities for embodied intelligence. However, their extensive computational and memory costs hinder their practical deployment. Existing VLA compression and acceleration approaches conduct quantization or token pruning in an ad-hoc manner but fail to enable both for a holistic efficiency improvement due to an observed incompatibility. This work introduces SQAP-VLA, the first structured, training-free VLA inference acceleration framework that simultaneously enables state-of-the-art quantization and token pruning. We overcome the incompatibility by co-designing the quantization and token pruning pipeline, where we propose new quantization-aware token pruning criteria that work on an aggressively quantized model while improving the quantizer design to enhance pruning effectiveness. When applied to standard VLA models, SQAP-VLA yields significant gains in computational efficiency and inference speed while successfully preserving core model performance, achieving a $\times$1.93 speedup and up to a 4.5\% average success rate enhancement compared to the original model.

中文标题/摘要

标题：SQAP-VLA：一种高效率视觉-语言-行动模型的协同量化感知剪枝框架

视觉-语言-行动（VLA）模型展示了前所未有的体现智能能力。然而，它们广泛的计算和内存成本阻碍了它们的实际部署。现有的VLA压缩和加速方法以非系统的方式进行量化或标记剪枝，但未能同时进行这两种操作以实现整体效率的提升，因为观察到了不兼容性。这项工作引入了SQAP-VLA，这是第一个无需训练的VLA推理加速框架，能够同时实现最先进的量化和标记剪枝。我们通过协同设计量化和标记剪枝管道来克服不兼容性，其中我们提出了新的量化感知标记剪枝标准，可以在高度量化模型上工作并改进量化器设计以增强剪枝效果。当应用于标准VLA模型时，SQAP-VLA在计算效率和推理速度上取得了显著提升，同时成功保留了核心模型性能，实现了1.93倍的速度提升和最高4.5%的平均成功率提升，与原始模型相比。

Summary / 总结

SQAP-VLA is a framework designed to enhance the efficiency of Vision-Language-Action models by simultaneously enabling quantization and token pruning. It addresses the incompatibility between these two techniques by co-designing the quantization and token pruning pipeline, resulting in a significant computational efficiency improvement and faster inference speed. Compared to the original model, SQAP-VLA achieves a 1.93 times speedup and up to a 4.5% average success rate enhancement.

SQAP-VLA 是一种框架，旨在通过同时实现量化和 token 剪枝来提升 Vision-Language-Action 模型的效率。通过共同设计量化和 token 剪枝管道，它克服了两者之间的不兼容性，从而实现了显著的计算效率和速度提升，同时保持了模型性能。与原始模型相比，SQAP-VLA 实现了 1.93 倍的加速，并且平均成功率提高了最多 4.5%。

COCO-Urdu: A Large-Scale Urdu Image-Caption Dataset with Multimodal Quality Estimation

Authors: Umair Hassan

First: 2025-09-10T21:17:32+00:00 · Latest: 2025-09-10T21:17:32+00:00

Comments: 17 pages, 3 figures, 3 tables. Dataset available at https://huggingface.co/datasets/umairhassan02/urdu-translated-coco-captions-subset. Scripts and notebooks to reproduce results available at https://github.com/umair-hassan2/COCO-Urdu

Abs · PDF · Code1 · Code2

Abstract

Urdu, spoken by over 250 million people, remains critically under-served in multimodal and vision-language research. The absence of large-scale, high-quality datasets has limited the development of Urdu-capable systems and reinforced biases in multilingual vision-language models trained primarily on high-resource languages. To address this gap, we present COCO-Urdu, a large-scale image-caption dataset derived from MS COCO, containing 59,000 images and 319,000 Urdu captions selected through stratified sampling to preserve the original distribution. Captions were translated using SeamlessM4T v2 and validated with a hybrid multimodal quality estimation framework that integrates COMET-Kiwi for translation quality, CLIP-based similarity for visual grounding, and BERTScore with back-translation for semantic consistency; low-scoring captions were iteratively refined using open-source large language models. We further benchmark COCO-Urdu on BLEU, SacreBLEU, and chrF, reporting consistently strong results. To the best of our knowledge, COCO-Urdu is the largest publicly available Urdu captioning dataset. By releasing both the dataset and the quality estimation pipeline, we aim to reduce language bias in multimodal research and establish a foundation for inclusive vision-language systems.

中文标题/摘要

标题：COCO-Urdu：大规模乌尔都语图像-描述数据集及多模态质量评估

乌尔都语被超过2.5亿人使用，但在多模态和视觉语言研究中仍严重不足。缺乏大规模、高质量的数据集限制了乌尔都语系统的开发，并强化了主要基于高资源语言训练的多语言视觉语言模型中的偏见。为解决这一问题，我们提出了COCO-Urdu，一个源自MS COCO的大规模图像-描述数据集，包含59,000张图像和319,000个乌尔都语描述，通过分层抽样保留了原始分布。描述使用SeamlessM4T v2进行翻译，并通过结合COMET-Kiwi进行翻译质量评估、CLIP进行视觉定位以及BERTScore和回译进行语义一致性评估的混合多模态质量评估框架进行验证；得分低的描述通过开源大型语言模型迭代优化。我们还在BLEU、SacreBLEU和chrF上对COCO-Urdu进行了基准测试，报告了一致的强结果。据我们所知，COCO-Urdu是最大的公开可用的乌尔都语描述数据集。通过发布数据集和质量评估管道，我们旨在减少多模态研究中的语言偏见，并建立包容性视觉语言系统的基础。

Can Vision-Language Models Solve Visual Math Equations?

Authors: Monjoy Narayan Choudhury, Junling Wang, Yifan Hou, Mrinmaya Sachan

Venue: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

First: 2025-09-10T21:16:11+00:00 · Latest: 2025-09-10T21:16:11+00:00

Comments: Monjoy Narayan Choudhury and Junling Wang contributed equally to this work. Accepted at EMNLP2025 main. Code and datasets are open-sourced with links in the paper

Abs · PDF

Abstract

Despite strong performance in visual understanding and language-based reasoning, Vision-Language Models (VLMs) struggle with tasks requiring integrated perception and symbolic computation. We study this limitation through visual equation solving, where mathematical equations are embedded in images, variables are represented by object icons, and coefficients must be inferred by counting. While VLMs perform well on textual equations, they fail on visually grounded counterparts. To understand this gap, we decompose the task into coefficient counting and variable recognition, and find that counting is the primary bottleneck, even when recognition is accurate. We also observe that composing recognition and reasoning introduces additional errors, highlighting challenges in multi-step visual reasoning. Finally, as equation complexity increases, symbolic reasoning itself becomes a limiting factor. These findings reveal key weaknesses in current VLMs and point toward future improvements in visually grounded mathematical reasoning.

中文标题/摘要

标题：视觉语言模型能否解决视觉数学方程？

尽管在视觉理解和语言推理方面表现出色，视觉语言模型（VLMs）在需要整合感知和符号计算的任务中表现不佳。我们通过视觉方程求解任务研究了这一局限性，其中数学方程嵌入在图像中，变量由对象图标表示，系数必须通过计数来推断。虽然VLMs在文本方程上表现良好，但在视觉接地的对应任务上却失败了。为了理解这一差距，我们将任务分解为系数计数和变量识别，并发现即使识别准确，计数仍然是主要瓶颈。我们还观察到，组合识别和推理会引入额外的错误，突显了多步视觉推理的挑战。最后，随着方程复杂性的增加，符号推理本身也成为限制因素。这些发现揭示了当前VLMs的关键弱点，并指出了未来改进视觉接地数学推理的方向。

Summary / 总结

The study investigates the limitations of Vision-Language Models (VLMs) in solving visual math equations, where equations are embedded in images and variables are represented by object icons. VLMs perform poorly on these tasks compared to textual equations. The research decomposes the task into coefficient counting and variable recognition, identifying counting as the main bottleneck. It also highlights that combining recognition and reasoning introduces additional errors, and that as equation complexity increases, symbolic reasoning becomes a limiting factor. These findings suggest key weaknesses in current VLMs and point to future improvements in visually grounded mathematical reasoning.

研究探讨了视觉语言模型（VLMs）在解决嵌入图像中的数学方程问题时的局限性，其中方程中的变量由物体图标表示。尽管VLMs在视觉理解和语言推理方面表现出色，但在需要综合感知和符号计算的任务上却表现不佳。研究将任务分解为系数计数和变量识别，发现计数是主要瓶颈。同时，研究还指出了多步视觉推理的挑战以及随着方程复杂性的增加，符号推理本身的局限性，这为未来VLMs的发展指明了方向。

Recurrence Meets Transformers for Universal Multimodal Retrieval

Authors: Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

First: 2025-09-10T18:00:29+00:00 · Latest: 2025-09-10T18:00:29+00:00

Abs · PDF · Code1

Abstract

With the rapid advancement of multimodal retrieval and its application in LLMs and multimodal LLMs, increasingly complex retrieval tasks have emerged. Existing methods predominantly rely on task-specific fine-tuning of vision-language models and are limited to single-modality queries or documents. In this paper, we propose ReT-2, a unified retrieval model that supports multimodal queries, composed of both images and text, and searches across multimodal document collections where text and images coexist. ReT-2 leverages multi-layer representations and a recurrent Transformer architecture with LSTM-inspired gating mechanisms to dynamically integrate information across layers and modalities, capturing fine-grained visual and textual details. We evaluate ReT-2 on the challenging M2KR and M-BEIR benchmarks across different retrieval configurations. Results demonstrate that ReT-2 consistently achieves state-of-the-art performance across diverse settings, while offering faster inference and reduced memory usage compared to prior approaches. When integrated into retrieval-augmented generation pipelines, ReT-2 also improves downstream performance on Encyclopedic-VQA and InfoSeek datasets. Our source code and trained models are publicly available at: https://github.com/aimagelab/ReT-2

中文标题/摘要

标题：循环神经网络与变换器结合用于通用多模态检索

随着多模态检索及其在LLMs和多模态LLMs中的应用快速发展，越来越多复杂的检索任务涌现出来。现有方法主要依赖于视觉语言模型的特定任务微调，并且局限于单模态查询或文档。在本文中，我们提出了一种统一的检索模型ReT-2，支持包含图像和文本的多模态查询，并在包含文本和图像的多模态文档集合中进行跨模态搜索。ReT-2 利用多层表示和具有LSTM启发式门机制的递归变换器架构，动态地在层间和模态间整合信息，捕捉细微的视觉和文本细节。我们在具有不同检索配置的M2KR和M-BEIR基准上评估了ReT-2。结果表明，ReT-2 在各种场景中都实现了最先进的性能，同时相比先前方法具有更快的推理速度和更低的内存使用。当集成到检索增强生成管道中时，ReT-2 还在Encyclopedic-VQA和InfoSeek数据集上提高了下游性能。我们的源代码和训练模型已公开发布于：https://github.com/aimagelab/ReT-2

Summary / 总结

The paper addresses the need for more versatile multimodal retrieval systems capable of handling complex tasks and multimodal queries. It introduces ReT-2, a unified model using a recurrent Transformer architecture with LSTM-like gating mechanisms to integrate visual and textual information dynamically. Experiments on M2KR and M-BEIR benchmarks show that ReT-2 outperforms existing methods in various settings, with faster inference and lower memory usage. Additionally, integrating ReT-2 into generation pipelines enhances performance on Encyclopedic-VQA and InfoSeek datasets.

论文旨在解决更灵活的多模态检索方法的需求，能够处理复杂的任务和多模态查询。ReT-2 使用递归 Transformer 架构，有效地整合视觉和文本信息。在 M2KR 和 M-BEIR 上的实验表明，ReT-2 在各种设置中优于现有方法，具有更快的推理速度和更低的内存使用量。此外，它还增强了检索增强生成任务的下游性能。

RewardDance: Reward Scaling in Visual Generation

Authors: Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, Yan Zeng, Weilin Huang

First: 2025-09-10T17:59:31+00:00 · Latest: 2025-09-10T17:59:31+00:00

Comments: Bytedance Seed Technical Report

Abs · PDF

Abstract

Reward Models (RMs) are critical for improving generation models via Reinforcement Learning (RL), yet the RM scaling paradigm in visual generation remains largely unexplored. It primarily due to fundamental limitations in existing approaches: CLIP-based RMs suffer from architectural and input modality constraints, while prevalent Bradley-Terry losses are fundamentally misaligned with the next-token prediction mechanism of Vision-Language Models (VLMs), hindering effective scaling. More critically, the RLHF optimization process is plagued by Reward Hacking issue, where models exploit flaws in the reward signal without improving true quality. To address these challenges, we introduce RewardDance, a scalable reward modeling framework that overcomes these barriers through a novel generative reward paradigm. By reformulating the reward score as the model's probability of predicting a "yes" token, indicating that the generated image outperforms a reference image according to specific criteria, RewardDance intrinsically aligns reward objectives with VLM architectures. This alignment unlocks scaling across two dimensions: (1) Model Scaling: Systematic scaling of RMs up to 26 billion parameters; (2) Context Scaling: Integration of task-specific instructions, reference examples, and chain-of-thought (CoT) reasoning. Extensive experiments demonstrate that RewardDance significantly surpasses state-of-the-art methods in text-to-image, text-to-video, and image-to-video generation. Crucially, we resolve the persistent challenge of "reward hacking": Our large-scale RMs exhibit and maintain high reward variance during RL fine-tuning, proving their resistance to hacking and ability to produce diverse, high-quality outputs. It greatly relieves the mode collapse problem that plagues smaller models.

中文标题/摘要

标题：RewardDance：视觉生成中的奖励缩放

奖励模型（RMs）对于通过强化学习（RL）改进生成模型至关重要，但在视觉生成中的RM缩放范式尚未得到充分探索。这主要是由于现有方法的基本限制：基于CLIP的RMs受到架构和输入模态的限制，而常见的Bradley-Terry损失与视觉语言模型（VLM）的下一个词预测机制根本不对齐，阻碍了有效的缩放。更关键的是，RLHF优化过程受到奖励作弊问题的困扰，模型利用奖励信号中的缺陷而不提高真实质量。为了解决这些挑战，我们提出了RewardDance，这是一种通过新颖的生成奖励范式克服这些障碍的可扩展奖励建模框架。通过将奖励分数重新表述为模型预测“是”标记的概率，表明生成的图像根据特定标准优于参考图像，RewardDance内在地将奖励目标与VLM架构对齐。这种对齐在两个维度上解锁了缩放：（1）模型缩放：系统地将RMs扩展至260亿参数；（2）上下文缩放：整合任务特定指令、参考示例和链式推理（CoT）。大量实验表明，RewardDance在文本到图像、文本到视频和图像到视频生成方面显著超越了现有最先进的方法。最关键的是，我们解决了持续存在的“奖励作弊”问题：我们的大规模RMs在RL微调过程中表现出并维持了高奖励方差，证明了它们的抗作弊能力和产生多样、高质量输出的能力。这极大地缓解了困扰较小模型的模式崩溃问题。

Summary / 总结

RewardDance is a scalable reward modeling framework designed to improve visual generation via Reinforcement Learning. It addresses the limitations of existing approaches by reformulating reward scores and aligning them with Vision-Language Model architectures. Key findings include systematic scaling of reward models up to 26 billion parameters and effective integration of task-specific instructions and reference examples, leading to superior performance in text-to-image, text-to-video, and image-to-video generation. Additionally, RewardDance mitigates the reward hacking issue, ensuring diverse and high-quality outputs without mode collapse.

RewardDance 是一种可扩展的奖励建模框架，旨在通过强化学习改进视觉生成模型。它通过重新定义奖励分数并与其视觉语言模型架构对齐来解决现有方法的局限性。关键发现包括将奖励模型系统地扩展至260亿参数，并整合了任务特定指令和参考示例。大量实验表明，RewardDance 在文本到图像、文本到视频和图像到视频生成方面均优于现有最佳方法，并有效解决了奖励作弊问题，生成多样且高质量的输出，避免了小模型中的模式崩溃。

SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation

Authors: Michael J. Munje, Chen Tang, Shuijing Liu, Zichao Hu, Yifeng Zhu, Jiaxun Cui, Garrett Warnell, Joydeep Biswas, Peter Stone

Venue: CoRL

First: 2025-09-10T16:47:00+00:00 · Latest: 2025-09-10T16:47:00+00:00

Comments: Conference on Robot Learning (CoRL) 2025 Project site: https://larg.github.io/socialnav-sub

Abs · PDF · Project1

Abstract

Robot navigation in dynamic, human-centered environments requires socially-compliant decisions grounded in robust scene understanding. Recent Vision-Language Models (VLMs) exhibit promising capabilities such as object recognition, common-sense reasoning, and contextual understanding-capabilities that align with the nuanced requirements of social robot navigation. However, it remains unclear whether VLMs can accurately understand complex social navigation scenes (e.g., inferring the spatial-temporal relations among agents and human intentions), which is essential for safe and socially compliant robot navigation. While some recent works have explored the use of VLMs in social robot navigation, no existing work systematically evaluates their ability to meet these necessary conditions. In this paper, we introduce the Social Navigation Scene Understanding Benchmark (SocialNav-SUB), a Visual Question Answering (VQA) dataset and benchmark designed to evaluate VLMs for scene understanding in real-world social robot navigation scenarios. SocialNav-SUB provides a unified framework for evaluating VLMs against human and rule-based baselines across VQA tasks requiring spatial, spatiotemporal, and social reasoning in social robot navigation. Through experiments with state-of-the-art VLMs, we find that while the best-performing VLM achieves an encouraging probability of agreeing with human answers, it still underperforms simpler rule-based approach and human consensus baselines, indicating critical gaps in social scene understanding of current VLMs. Our benchmark sets the stage for further research on foundation models for social robot navigation, offering a framework to explore how VLMs can be tailored to meet real-world social robot navigation needs. An overview of this paper along with the code and data can be found at https://larg.github.io/socialnav-sub .

中文标题/摘要

标题：SocialNav-SUB：评估社会机器人导航场景理解的VLM基准

在动态、以人类为中心的环境中，机器人的导航需要基于稳健场景理解的合乎社会规范的决策。近期的视觉-语言模型（VLMs）展示了诸如物体识别、常识推理和上下文理解等有前景的能力，这些能力与社会机器人导航的复杂需求相契合。然而，尚不清楚VLMs是否能够准确理解复杂的社交导航场景（例如，推断代理和人类意图的空间-时间关系），这对于安全和合乎社会规范的机器人导航至关重要。尽管一些近期的研究探索了在社会机器人导航中使用VLMs，但目前尚无工作系统地评估它们是否能够满足这些必要条件。在本文中，我们介绍了社会导航场景理解基准（SocialNav-SUB），这是一个视觉问答（VQA）数据集和基准，旨在评估VLMs在真实世界社会机器人导航场景中的场景理解能力。SocialNav-SUB提供了一个统一框架，用于评估VLMs在涉及社会机器人导航的空间、空间-时间和社会推理的VQA任务中与基于视觉问答的人类和基于规则的基线的对比。通过使用最先进的VLMs进行实验，我们发现尽管表现最佳的VLM在与人类答案一致的概率上取得了令人鼓舞的结果，但它仍然不如简单的基于规则的方法和人类共识基线表现良好，表明当前VLMs在社会场景理解方面存在关键差距。我们的基准为社会机器人导航的基础模型研究奠定了基础，提供了一个框架来探索如何将VLMs定制以满足现实世界的社会机器人导航需求。有关本文的概述以及代码和数据可以在https://larg.github.io/socialnav-sub 查看。

Summary / 总结

This paper introduces SocialNav-SUB, a benchmark for evaluating Vision-Language Models (VLMs) in understanding complex social navigation scenes. The study aims to assess whether VLMs can accurately infer spatial-temporal relations and human intentions, essential for safe and socially compliant robot navigation. Experiments with state-of-the-art VLMs show that while the best-performing model agrees with human answers, it still underperforms compared to simpler rule-based approaches and human consensus baselines, highlighting the need for improved social scene understanding in VLMs.

本文介绍了SocialNav-SUB，这是一个用于评估视觉语言模型（VLMs）在理解复杂社交导航场景能力的基准。研究旨在评估VLMs是否能够准确推断空间-时间关系和人类意图，这对于安全和社交合规的机器人导航至关重要。实验表明，尽管表现最佳的模型与人类答案一致，但其性能仍不及简单的基于规则的方法和人类共识基准，这表明当前VLMs在社交场景理解方面存在关键差距。

LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation

Authors: Yang Zhou, Shiyu Zhao, Yuxiao Chen, Zhenting Wang, Can Jin, Dimitris N. Metaxas

First: 2025-03-18T00:50:40+00:00 · Latest: 2025-09-10T15:29:43+00:00

Abs · PDF

Abstract

Large foundation models trained on large-scale vision-language data can boost Open-Vocabulary Object Detection (OVD) via synthetic training data, yet the hand-crafted pipelines often introduce bias and overfit to specific prompts. We sidestep this issue by directly fusing hidden states from Large Language Models (LLMs) into detectors-an avenue surprisingly under-explored. This paper presents a systematic method to enhance visual grounding by utilizing decoder layers of the LLM of an MLLM. We introduce a zero-initialized cross-attention adapter to enable efficient knowledge fusion from LLMs to object detectors, a new approach called LED (LLM Enhanced Open-Vocabulary Object Detection). We find that intermediate LLM layers already encode rich spatial semantics; adapting only the early layers yields most of the gain. With Swin-T as the vision encoder, Qwen2-0.5B + LED lifts GroundingDINO by 3.82 % on OmniLabel at just 8.7 % extra GFLOPs, and a larger vision backbone pushes the improvement to 6.22 %. Extensive ablations on adapter variants, LLM scales and fusion depths further corroborate our design.

中文标题/摘要

标题：LED：大语言模型增强的开放词汇对象检测

大规模的视觉-语言基础模型可以通过合成训练数据提升开放词汇对象检测（OVD），但手工设计的流水线往往会引入偏差并过度拟合特定提示。我们通过直接将大语言模型（LLM）的隐藏状态融合到检测器中绕过了这个问题——这是一个令人惊讶地未被充分探索的途径。本文提出了一种系统的方法，通过利用多语言模型（MLLM）的LLM解码器层来增强视觉定位。我们引入了一个零初始化的交叉注意力适配器，以实现从LLM到对象检测器的有效知识融合，这是一种新的方法，称为LED（大语言模型增强的开放词汇对象检测）。我们发现中间的LLM层已经编码了丰富的空间语义；仅调整早期层就能获得大部分收益。使用Swin-T作为视觉编码器，Qwen2-0.5B + LED在OmniLabel上将GroundingDINO的性能提升了3.82%，仅增加了8.7%的额外GFLOPs，而更大的视觉骨干则将改进幅度推高至6.22%。广泛的适配器变体、LLM规模和融合深度的消融实验进一步证实了我们的设计。

Summary / 总结

This paper addresses the challenge of Open-Vocabulary Object Detection (OVD) by leveraging Large Language Models (LLMs) to enhance visual grounding without the need for human-curated synthetic data. The authors propose a method called LED, which fuses hidden states from LLMs into detectors using a zero-initialized cross-attention adapter. Experiments show that adapting only early layers of the LLM yields significant improvements. Using Swin-T as the vision encoder, Qwen2-0.5B + LED enhances GroundingDINO by 3.82% on OmniLabel with minimal computational overhead, and a larger vision backbone further improves performance.

该论文通过利用大型语言模型（LLM）来增强视觉定位，解决开放词汇对象检测（OVD）的挑战，而不依赖于人工标注数据。方法称为LED，通过零初始化的交叉注意力适配器直接将LLM的隐藏状态融合到检测器中。实验表明，仅调整LLM的早期层即可获得显著改进。使用Swin-T作为视觉编码器，Qwen2-0.5B + LED在OmniLabel上将GroundingDINO的性能提升了3.82%，且计算开销较小，更大的视觉骨干进一步提升了性能。

LLaDA-VLA: Vision Language Diffusion Action Models

Authors: Yuqing Wen, Hebei Li, Kefan Gu, Yucheng Zhao, Tiancai Wang, Xiaoyan Sun

First: 2025-09-08T17:45:40+00:00 · Latest: 2025-09-10T14:34:25+00:00

Abs · PDF

Abstract

The rapid progress of auto-regressive vision-language models (VLMs) has inspired growing interest in vision-language-action models (VLA) for robotic manipulation. Recently, masked diffusion models, a paradigm distinct from autoregressive models, have begun to demonstrate competitive performance in text generation and multimodal applications, leading to the development of a series of diffusion-based VLMs (d-VLMs). However, leveraging such models for robot policy learning remains largely unexplored. In this work, we present LLaDA-VLA, the first Vision-Language-Diffusion-Action model built upon pretrained d-VLMs for robotic manipulation. To effectively adapt d-VLMs to robotic domain, we introduce two key designs: (1) a localized special-token classification strategy that replaces full-vocabulary classification with special action token classification, reducing adaptation difficulty; (2) a hierarchical action-structured decoding strategy that decodes action sequences hierarchically considering the dependencies within and across actions. Extensive experiments demonstrate that LLaDA-VLA significantly outperforms state-of-the-art VLAs on both simulation and real-world robots.

中文标题/摘要

标题：LLaDA-VLA：视觉语言扩散动作模型

自回归视觉语言模型（VLMs）的快速发展激发了对视觉语言动作模型（VLA）在机器人操作方面的研究兴趣。最近，掩码扩散模型，这一与自回归模型不同的范式，已经开始在文本生成和多模态应用中展示出竞争力，推动了一系列基于扩散的VLMs（d-VLMs）的发展。然而，利用这些模型进行机器人策略学习仍然鲜有探索。本文介绍了LLaDA-VLA，这是首个基于预训练d-VLMs的视觉语言扩散动作模型，用于机器人操作。为了有效适应机器人领域，我们提出了两个关键设计：（1）局部特殊标记分类策略，用特殊动作标记分类替代全词汇分类，降低适应难度；（2）分层动作结构解码策略，考虑动作内部和跨动作的依赖关系进行分层解码。大量实验表明，LLaDA-VLA在模拟和真实机器人上均显著优于现有最先进的VLA。

Summary / 总结

This paper introduces LLaDA-VLA, a novel Vision-Language-Diffusion-Action model built on pretrained diffusion-based vision-language models for robotic manipulation. It addresses the challenge of adapting these models to robotic tasks by introducing a localized special-token classification strategy and a hierarchical action-structured decoding strategy. Experimental results show that LLaDA-VLA outperforms existing vision-language-action models in both simulation and real-world robotic manipulation tasks.

该研究提出了LLaDA-VLA，这是一种基于预训练的扩散型视觉语言模型（d-VLMs）构建的视觉语言动作模型，用于机器人操作。该模型包含两个关键设计：局部特殊标记分类策略和分层动作结构解码策略。实验表明，LLaDA-VLA 在模拟和真实世界机器人任务中均优于现有视觉语言动作模型。

Have Large Vision-Language Models Mastered Art History?

Authors: Ombretta Strafforello, Derya Soydaner, Michiel Willems, Anne-Sofie Maerten, Stefanie De Winter

First: 2024-09-05T13:33:57+00:00 · Latest: 2025-09-10T14:31:31+00:00

Abs · PDF

Abstract

The emergence of large Vision-Language Models (VLMs) has established new baselines in image classification across multiple domains. We examine whether their multimodal reasoning can also address a challenge mastered by human experts. Specifically, we test whether VLMs can classify the style, author and creation date of paintings, a domain traditionally mastered by art historians. Artworks pose a unique challenge compared to natural images due to their inherently complex and diverse structures, characterized by variable compositions and styles. This requires a contextual and stylistic interpretation rather than straightforward object recognition. Art historians have long studied the unique aspects of artworks, with style prediction being a crucial component of their discipline. This paper investigates whether large VLMs, which integrate visual and textual data, can effectively reason about the historical and stylistic attributes of paintings. We present the first study of its kind, conducting an in-depth analysis of three VLMs, namely CLIP, LLaVA, and GPT-4o, evaluating their zero-shot classification of art style, author and time period. Using two image benchmarks of artworks, we assess the models' ability to interpret style, evaluate their sensitivity to prompts, and examine failure cases. Additionally, we focus on how these models compare to human art historical expertise by analyzing misclassifications, providing insights into their reasoning and classification patterns.

中文标题/摘要

标题：大型视觉-语言模型是否掌握了艺术史？

大型视觉-语言模型（VLMs）在多个领域图像分类中建立了新的基准。我们探讨它们的多模态推理是否也能解决由人类专家掌握的挑战。具体来说，我们测试VLMs是否能够对绘画的风格、作者和创作年代进行分类，这是一个传统上由艺术史学家掌握的领域。与自然图像相比，艺术品因其复杂多样的结构而构成独特的挑战，这些结构具有可变的组成和风格。这需要一种上下文和风格的解释，而不仅仅是简单的物体识别。艺术史学家长期研究艺术品的独特方面，风格预测是他们学科的关键组成部分。本文探讨了大型VLMs，这些模型整合了视觉和文本数据，是否能够有效地对绘画的历史和风格属性进行推理。我们进行了此类研究的第一个案例，对CLIP、LLaVA和GPT-4o三种VLMs进行了深入分析，评估它们在零样本分类中的艺术风格、作者和时期的能力。使用两个艺术品图像基准，我们评估了模型对风格的解释能力、对提示的敏感性以及失败案例。此外，我们通过分析错误分类，关注这些模型与人类艺术史专家的比较，提供有关它们推理和分类模式的见解。

Summary / 总结

This study investigates whether large Vision-Language Models (VLMs) can classify the style, author, and creation date of paintings, a task traditionally mastered by art historians. The research examines CLIP, LLaVA, and GPT-4o, evaluating their zero-shot classification abilities using two image benchmarks. Key findings show that while VLMs can interpret styles and provide some context, they struggle with accurate author and date classifications, highlighting the need for improved contextual and stylistic reasoning capabilities.

研究探讨了大型视觉-语言模型（VLMs）是否能够分类绘画的风格、作者和创作年代，这是传统上由艺术史学家掌握的领域。研究评估了CLIP、LLaVA和GPT-4o在使用两个艺术基准进行零样本分类任务中的表现，展示了它们在风格解释方面的能力以及对提示的敏感性。研究发现，尽管VLMs表现出一定的潜力，但在艺术解释和分类方面仍存在挑战，突显了艺术史领域与自然图像分类之间的独特差异。

To See a World in a Spark of Neuron: Disentangling Multi-task Interference for Training-free Model Merging

Authors: Zitao Fang, Guodong DU, Shuyang Yu, Yifei Guo, Yiwei Zhang, Yiyao Cao, Jing Li, Ho-Kin Tang, Sim Kuan Goh

Venue: EMNLP 2025

First: 2025-03-07T11:00:24+00:00 · Latest: 2025-09-10T13:56:44+00:00

Comments: Accepted to EMNLP 2025 Main Conference. This is the camera-ready version. Code: https://ZzzitaoFang.github.io/projects/NeuroMerging/

Abs · PDF · Project1

Abstract

Fine-tuning pre-trained models on targeted datasets enhances task-specific performance but often comes at the expense of generalization. Model merging techniques, which integrate multiple fine-tuned models into a single multi-task model through task arithmetic, offer a promising solution. However, task interference remains a fundamental challenge, leading to performance degradation and suboptimal merged models. Existing approaches largely overlooked the fundamental roles of neurons, their connectivity, and activation, resulting in a merging process and a merged model that does not consider how neurons relay and process information. In this work, we present the first study that relies on neuronal mechanisms for model merging. Specifically, we decomposed task-specific representations into two complementary neuronal subspaces that regulate input sensitivity and task adaptability. Leveraging this decomposition, we introduced NeuroMerging, a novel merging framework developed to mitigate task interference within neuronal subspaces, enabling training-free model fusion across diverse tasks. Through extensive experiments, we demonstrated that NeuroMerging achieved superior performance compared to existing methods on multi-task benchmarks across both natural language and vision domains. Our findings highlighted the importance of aligning neuronal mechanisms in model merging, offering new insights into mitigating task interference and improving knowledge fusion. Our project is available at https://ZzzitaoFang.github.io/projects/NeuroMerging/.

中文标题/摘要

标题：在神经元火花中见世界：无监督模型融合中的多任务干扰拆解

在目标数据集上微调预训练模型可以提升特定任务的性能，但往往以牺牲泛化能力为代价。通过任务算术将多个微调模型整合成一个多功能模型的模型融合技术提供了一种有前景的解决方案。然而，任务干扰仍然是一个基本挑战，导致性能下降和次优融合模型。现有方法大多忽视了神经元及其连接性和激活的基本作用，导致融合过程和融合模型未能考虑神经元如何传递和处理信息。在本文中，我们首次依赖神经元机制进行模型融合。具体而言，我们将任务特定表示分解为两个互补的神经子空间，分别调节输入敏感性和任务适应性。利用这种分解，我们引入了NeuroMerging，这是一种新型融合框架，旨在减轻神经子空间内的任务干扰，实现跨多种任务的无监督模型融合。通过广泛的实验，我们证明了NeuroMerging在自然语言和视觉领域的多功能基准测试中优于现有方法，实现了更优的性能。我们的研究结果强调了在模型融合中对齐神经元机制的重要性，为减轻任务干扰和提高知识融合提供了新的见解。我们的项目可在https://ZzzitaoFang.github.io/projects/NeuroMerging/找到。

Summary / 总结

This work addresses the challenge of task interference in model merging by proposing NeuroMerging, a framework that decomposes task-specific representations into two complementary neuronal subspaces. The study demonstrates that NeuroMerging outperforms existing methods on multi-task benchmarks in both natural language and vision domains, highlighting the importance of aligning neuronal mechanisms in model merging to mitigate task interference and improve knowledge fusion.

该研究提出了一种名为NeuroMerging的新框架，通过将任务特定表示分解为两个互补的神经子空间来解决模型合并中的任务干扰问题。研究结果表明，NeuroMerging在自然语言和视觉领域的多任务基准测试中优于现有方法，强调了在模型合并中对齐神经机制以减轻任务干扰和提高知识融合的重要性。