arXiv 论文速递

FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark

Authors: Rongyao Fang, Aldrich Yu, Chengqi Duan, Linjiang Huang, Shuai Bai, Yuxuan Cai, Kun Wang, Si Liu, Xihui Liu, Hongsheng Li

First: 2025-09-11T17:59:59+00:00 · Latest: 2025-09-11T17:59:59+00:00

Comments: Project page: https://flux-reason-6m.github.io/

Abs · PDF · Project1

Abstract

The advancement of open-source text-to-image (T2I) models has been hindered by the absence of large-scale, reasoning-focused datasets and comprehensive evaluation benchmarks, resulting in a performance gap compared to leading closed-source systems. To address this challenge, We introduce FLUX-Reason-6M and PRISM-Bench (Precise and Robust Image Synthesis Measurement Benchmark). FLUX-Reason-6M is a massive dataset consisting of 6 million high-quality FLUX-generated images and 20 million bilingual (English and Chinese) descriptions specifically designed to teach complex reasoning. The image are organized according to six key characteristics: Imagination, Entity, Text rendering, Style, Affection, and Composition, and design explicit Generation Chain-of-Thought (GCoT) to provide detailed breakdowns of image generation steps. The whole data curation takes 15,000 A100 GPU days, providing the community with a resource previously unattainable outside of large industrial labs. PRISM-Bench offers a novel evaluation standard with seven distinct tracks, including a formidable Long Text challenge using GCoT. Through carefully designed prompts, it utilizes advanced vision-language models for nuanced human-aligned assessment of prompt-image alignment and image aesthetics. Our extensive evaluation of 19 leading models on PRISM-Bench reveals critical performance gaps and highlights specific areas requiring improvement. Our dataset, benchmark, and evaluation code are released to catalyze the next wave of reasoning-oriented T2I generation. Project page: https://flux-reason-6m.github.io/ .

中文标题/摘要

标题：FLUX-Reason-6M & PRISM-Bench：百万规模的图文推理数据集及全面基准测试

开源文本到图像（T2I）模型的发展受限于缺乏大规模、注重推理的数据集和全面的评估基准，导致其性能与领先的封闭源系统存在差距。为解决这一挑战，我们引入了FLUX-Reason-6M和PRISM-Bench（精确且稳健的图像合成测量基准）。FLUX-Reason-6M是一个包含600万高质量FLUX生成图像和2000万双语（英语和中文）描述的庞大数据集，专门设计用于教授复杂推理。图像按照六个关键特征组织：想象力、实体、文本呈现、风格、情感和构图，并设计明确的生成链式思维（GCoT）以提供详细的图像生成步骤分解。整个数据整理耗时15000个A100 GPU天，为社区提供了此前仅在大型工业实验室才能获得的资源。PRISM-Bench提供了一种新颖的评估标准，包括七个不同的赛道，其中包括使用GCoT的艰巨长文本挑战。通过精心设计的提示，它利用先进的视觉语言模型进行细致的人类对齐评估和图像美学评估。我们在PRISM-Bench上对19个领先模型进行了广泛评估，揭示了关键性能差距并指出了需要改进的具体领域。我们的数据集、基准测试和评估代码已发布，以推动下一代注重推理的T2I生成。项目页面：https://flux-reason-6m.github.io/

Locality in Image Diffusion Models Emerges from Data Statistics

Authors: Artem Lukoianov, Chenyang Yuan, Justin Solomon, Vincent Sitzmann

First: 2025-09-11T17:59:08+00:00 · Latest: 2025-09-11T17:59:08+00:00

Comments: 30 pages, 18 figures, 6 tables

Abs · PDF

Abstract

Among generative models, diffusion models are uniquely intriguing due to the existence of a closed-form optimal minimizer of their training objective, often referred to as the optimal denoiser. However, diffusion using this optimal denoiser merely reproduces images in the training set and hence fails to capture the behavior of deep diffusion models. Recent work has attempted to characterize this gap between the optimal denoiser and deep diffusion models, proposing analytical, training-free models that can generate images that resemble those generated by a trained UNet. The best-performing method hypothesizes that shift equivariance and locality inductive biases of convolutional neural networks are the cause of the performance gap, hence incorporating these assumptions into its analytical model. In this work, we present evidence that the locality in deep diffusion models emerges as a statistical property of the image dataset, not due to the inductive bias of convolutional neural networks. Specifically, we demonstrate that an optimal parametric linear denoiser exhibits similar locality properties to the deep neural denoisers. We further show, both theoretically and experimentally, that this locality arises directly from the pixel correlations present in natural image datasets. Finally, we use these insights to craft an analytical denoiser that better matches scores predicted by a deep diffusion model than the prior expert-crafted alternative.

中文标题/摘要

标题：图像扩散模型中的局部性源自数据统计

在生成模型中，扩散模型因其训练目标的闭式最优解而独具魅力，通常被称为最优去噪器。然而，使用这种最优去噪器的扩散仅能复现训练集中的图像，因而无法捕捉深层扩散模型的行为。近期工作试图描述这种最优去噪器与深层扩散模型之间的差距，提出了无需训练的分析模型，这些模型可以生成类似于训练好的UNet生成的图像。表现最佳的方法假设卷积神经网络的平移等变性和局部性归纳偏置是性能差距的原因，因此将其假设纳入其分析模型中。在本文中，我们提供了证据表明，深层扩散模型中的局部性是图像数据集的统计属性，而非卷积神经网络的归纳偏置所致。具体而言，我们证明了最优参数线性去噪器表现出与深层神经去噪器相似的局部性特征。我们还通过理论和实验表明，这种局部性直接来源于自然图像数据集中像素间的相关性。最后，我们利用这些见解设计了一种分析去噪器，其预测分数与深层扩散模型更为匹配，优于先前的专家设计的替代方案。

Summary / 总结

This paper investigates why deep diffusion models generate images differently from the optimal denoiser, which merely reproduces training images. The authors propose that the locality in deep diffusion models is a statistical property of the image dataset rather than an inductive bias of convolutional neural networks. They show that an optimal parametric linear denoiser also exhibits similar locality properties and that this locality arises from pixel correlations in natural images. The study concludes by developing a new analytical denoiser that more closely matches deep diffusion model predictions.

本文研究了为什么深度扩散模型生成的图像与最优去噪器生成的图像不同，后者只能重现训练图像。作者提出，深度扩散模型中的局部性是图像数据集的统计特性，而不是卷积神经网络的归纳偏置。研究表明，最优参数线性去噪器也表现出类似的局部性，并且这种局部性源自自然图像数据中的像素相关性。研究结果导致了一个更符合深度扩散模型预测的分析性去噪器。

Improved GUI Grounding via Iterative Narrowing

Authors: Anthony Nguyen

First: 2024-11-18T05:47:12+00:00 · Latest: 2025-09-11T16:37:00+00:00

Comments: Code available at https://github.com/ant-8/GUI-Grounding-via-Iterative-Narrowing

Abs · PDF · Code1

Abstract

Graphical User Interface (GUI) grounding plays a crucial role in enhancing the capabilities of Vision-Language Model (VLM) agents. While general VLMs, such as GPT-4V, demonstrate strong performance across various tasks, their proficiency in GUI grounding remains suboptimal. Recent studies have focused on fine-tuning these models specifically for zero-shot GUI grounding, yielding significant improvements over baseline performance. We introduce a visual prompting framework that employs an iterative narrowing mechanism to further improve the performance of both general and fine-tuned models in GUI grounding. For evaluation, we tested our method on a comprehensive benchmark comprising various UI platforms and provided the code to reproduce our results.

中文标题/摘要

标题：通过迭代细化改进的GUI接地

图形用户界面（GUI）接地在增强视觉语言模型（VLM）代理的能力方面起着关键作用。虽然通用的VLM，如GPT-4V，在各种任务中表现出色，但在GUI接地方面的专业能力仍然不足。最近的研究集中在对这些模型进行微调，以实现零样本GUI接地，从而在基线性能上取得了显著改进。我们介绍了一种视觉提示框架，该框架采用迭代细化机制，进一步提高了通用模型和微调模型在GUI接地中的性能。为了评估，我们在包含各种UI平台的综合基准上测试了我们的方法，并提供了可重现我们结果的代码。

Summary / 总结

The paper addresses the limitation of Vision-Language Models (VLMs) in GUI grounding, which is crucial for enhancing the capabilities of VLM agents. It introduces a visual prompting framework with an iterative narrowing mechanism to improve both general and fine-tuned VLMs. The method was evaluated on a comprehensive benchmark, showing significant performance improvements over baseline models.

论文针对视觉语言模型（VLMs）在GUI定位方面的局限性，提出了一个包含迭代缩小机制的视觉提示框架，以提高通用和微调后的VLMs的性能。该方法在综合基准上进行了评估，显示出比基线模型显著的性能提升。

Compositional Concept Generalization with Variational Quantum Circuits

Authors: Hala Hawashin, Mina Abbaszadeh, Nicholas Joseph, Beth Pearson, Martha Lewis, Mehrnoosh sadrzadeh

First: 2025-09-11T15:34:33+00:00 · Latest: 2025-09-11T15:34:33+00:00

Comments: Accepted to: 2025 IEEE International Conference on Quantum Artificial Intelligence (QAI), Naples, Italy, Nov 2-5, 2025. This is the authors' accepted manuscript (AAM). An IEEE copyright notice appears on page 1. The final published version will appear in IEEE Xplore; DOI to be added when available

Abs · PDF

Abstract

Compositional generalization is a key facet of human cognition, but lacking in current AI tools such as vision-language models. Previous work examined whether a compositional tensor-based sentence semantics can overcome the challenge, but led to negative results. We conjecture that the increased training efficiency of quantum models will improve performance in these tasks. We interpret the representations of compositional tensor-based models in Hilbert spaces and train Variational Quantum Circuits to learn these representations on an image captioning task requiring compositional generalization. We used two image encoding techniques: a multi-hot encoding (MHE) on binary image vectors and an angle/amplitude encoding on image vectors taken from the vision-language model CLIP. We achieve good proof-of-concept results using noisy MHE encodings. Performance on CLIP image vectors was more mixed, but still outperformed classical compositional models.

中文标题/摘要

标题：使用变分量子电路的组合理构概念泛化

组合理构泛化是人类认知的关键方面，但在当前的AI工具如视觉-语言模型中缺失。先前的工作研究了是否可以通过组合张量基句法语义克服这一挑战，但结果为负。我们推测，量子模型的训练效率提高将改善这些任务的表现。我们解释了组合张量基模型在希尔伯特空间中的表示，并训练变分量子电路在需要组合理构泛化的图像字幕任务中学习这些表示。我们使用了两种图像编码技术：二值图像向量上的多热编码（MHE）和从视觉-语言模型CLIP获取的图像向量上的角度/振幅编码。我们使用嘈杂的MHE编码取得了良好的概念验证结果。在CLIP图像向量上的表现则更为混合，但仍优于经典组合模型。

Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning

Authors: Yuyao Ge, Shenghua Liu, Yiwei Wang, Lingrui Mei, Baolong Bi, Xuanshan Zhou, Jiayu Yao, Jiafeng Guo, Xueqi Cheng

First: 2025-09-08T09:20:04+00:00 · Latest: 2025-09-11T15:24:22+00:00

Abs · PDF

Abstract

Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments. While existing enhancement approaches require additional training, rely on external segmentation tools, or operate at coarse-grained levels, they overlook the innate ability within VLMs. To bridge this gap, we investigate VLMs' attention patterns and discover that: (1) visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance; (2) attention progressively refines from global scanning in shallow layers to focused convergence in deeper layers, with convergence degree determined by visual complexity. (3) Theoretically, we prove that the contrast of attention maps between general queries and task-specific queries enables the decomposition of visual signal into semantic signals and visual noise components. Building on these insights, we propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level. Extensive experiments demonstrate that CARVE consistently enhances performance, achieving up to 75% improvement on open-source models. Our work provides critical insights into the interplay between visual complexity and attention mechanisms, offering an efficient pathway for improving visual reasoning with contrasting attention.

中文标题/摘要

标题：基于对比注意力聚焦：增强VLMs的视觉推理能力

视觉-语言模型（VLMs）在多种视觉任务中表现出显著的成功，但在复杂视觉环境中其性能会下降。尽管现有的增强方法需要额外的训练、依赖外部分割工具或在粗粒度级别上操作，但它们忽视了VLMs内部固有的能力。为了解决这一问题，我们研究了VLMs的注意力模式，并发现：（1）视觉复杂性与注意力熵呈强烈正相关，负面影响了推理性能；（2）注意力从浅层的全局扫描逐渐细化到深层的集中收敛，收敛程度由视觉复杂性决定；（3）理论上，我们证明了通用查询与任务特定查询之间的注意力图对比能够将视觉信号分解为语义信号和视觉噪声成分。基于这些见解，我们提出了对比注意力细化以增强视觉效果（CARVE）的方法，这是一种无需训练的方法，通过像素级的注意力对比提取任务相关的视觉信号。大量实验表明，CARVE能够一致地提升性能，开源模型的性能提升高达75%。我们的工作为理解视觉复杂性和注意力机制之间的相互作用提供了关键见解，为通过对比注意力提高视觉推理能力提供了高效途径。

Summary / 总结

The research aims to enhance VLMs' performance in complex visual environments by leveraging their inherent attention mechanisms. The method involves analyzing VLMs' attention patterns and proposing Contrastive Attention Refinement for Visual Enhancement (CARVE), which extracts task-relevant visual signals through pixel-level attention contrasting. Experimental results show that CARVE significantly improves performance, achieving up to 75% enhancement on open-source models.

研究旨在通过分析注意力模式来提升Vision-Language模型（VLMs）在复杂环境中的视觉推理能力。研究发现，视觉复杂性会负面影响推理性能，并且注意力会从全局扫描逐渐细化到聚焦收敛。提出的Contrastive Attention Refinement for Visual Enhancement（CARVE）方法在无需额外训练的情况下提升了VLMs的表现，使其性能提高了高达75%。

Adapting Vision-Language Models for Neutrino Event Classification in High-Energy Physics

Authors: Dikshant Sagar, Kaiwen Yu, Alejandro Yankelevich, Jianming Bian, Pierre Baldi

First: 2025-09-10T10:07:27+00:00 · Latest: 2025-09-11T13:03:04+00:00

Abs · PDF

Abstract

Recent advances in Large Language Models (LLMs) have demonstrated their remarkable capacity to process and reason over structured and unstructured data modalities beyond natural language. In this work, we explore the applications of Vision Language Models (VLMs), specifically a fine-tuned variant of LLaMa 3.2, to the task of identifying neutrino interactions in pixelated detector data from high-energy physics (HEP) experiments. We benchmark this model against a state-of-the-art convolutional neural network (CNN) architecture, similar to those used in the NOvA and DUNE experiments, which have achieved high efficiency and purity in classifying electron and muon neutrino events. Our evaluation considers both the classification performance and interpretability of the model predictions. We find that VLMs can outperform CNNs, while also providing greater flexibility in integrating auxiliary textual or semantic information and offering more interpretable, reasoning-based predictions. This work highlights the potential of VLMs as a general-purpose backbone for physics event classification, due to their high performance, interpretability, and generalizability, which opens new avenues for integrating multimodal reasoning in experimental neutrino physics.

中文标题/摘要

标题：将视觉语言模型适应于高能物理中的中微子事件分类

近年来，大型语言模型（LLMs）在处理和推理结构化和非结构化数据方面的能力已经显示出其显著优势，远超自然语言之外。在本项工作中，我们探讨了视觉语言模型（VLMs），特别是LLaMa 3.2的微调版本，应用于识别高能物理（HEP）实验中像素化探测器数据中的中微子相互作用的任务。我们将该模型与NOvA和DUNE实验中使用的类似卷积神经网络（CNN）架构进行基准测试，这些架构在分类电子和Muon中微子事件方面已达到高效率和纯度。我们的评估考虑了模型分类性能和预测的可解释性。我们发现VLMs可以超越CNNs，同时提供更大的灵活性以整合辅助的文本或语义信息，并提供更可解释、基于推理的预测。本项工作突显了VLMs作为物理事件分类的一般性基础架构的潜力，由于其高性能、可解释性和泛化能力，这为在实验中微子物理中整合多模态推理开辟了新的途径。

Summary / 总结

This study investigates the use of Vision Language Models (VLMs) for classifying neutrino interactions in high-energy physics experiments, comparing them to state-of-the-art convolutional neural networks (CNNs). The VLMs, fine-tuned from LLaMa 3.2, outperform CNNs in classification performance and offer greater interpretability by integrating textual or semantic information. The research suggests VLMs as a versatile backbone for physics event classification due to their high performance, interpretability, and generalizability.

本研究探讨了使用视觉语言模型（VLMs）识别高能物理实验中的中微子相互作用，将其与最先进的卷积神经网络（CNNs）进行比较。VLMs基于LLaMa 3.2进行微调，表现出比CNNs更好的分类性能，并通过整合文本或语义信息提供了更大的灵活性和可解释性。研究展示了VLMs作为物理事件分类的通用基础架构的潜力，增强了实验中多模态推理的能力。

Shaken, Not Stirred: A Novel Dataset for Visual Understanding of Glasses in Human-Robot Bartending Tasks

Authors: Lukáš Gajdošech, Hassan Ali, Jan-Gerrit Habekost, Martin Madaras, Matthias Kerzel, Stefan Wermter

Venue: IROS

First: 2025-03-06T10:51:04+00:00 · Latest: 2025-09-11T12:49:34+00:00

Comments: Submitted and Accepted for Presentation at the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025

Abs · PDF

Abstract

Datasets for object detection often do not account for enough variety of glasses, due to their transparent and reflective properties. Specifically, open-vocabulary object detectors, widely used in embodied robotic agents, fail to distinguish subclasses of glasses. This scientific gap poses an issue for robotic applications that suffer from accumulating errors between detection, planning, and action execution. This paper introduces a novel method for acquiring real-world data from RGB-D sensors that minimizes human effort. We propose an auto-labeling pipeline that generates labels for all the acquired frames based on the depth measurements. We provide a novel real-world glass object dataset GlassNICOLDataset that was collected on the Neuro-Inspired COLlaborator (NICOL), a humanoid robot platform. The dataset consists of 7850 images recorded from five different cameras. We show that our trained baseline model outperforms state-of-the-art open-vocabulary approaches. In addition, we deploy our baseline model in an embodied agent approach to the NICOL platform, on which it achieves a success rate of 81% in a human-robot bartending scenario.

中文标题/摘要

标题：摇而不搅：一种用于人类-机器人调酒任务中玻璃视觉理解的新数据集

物体检测数据集往往未能涵盖足够多样的玻璃，由于玻璃的透明和反射特性。特别是，广泛应用于具身机器人代理的开放词汇物体检测器无法区分不同类别的玻璃。这一科学缺口对依赖于检测、规划和动作执行之间累积误差的机器人应用造成了问题。本文介绍了一种新的方法，用于从RGB-D传感器中获取真实世界数据，以最小化人工努力。我们提出了一种自动标注流水线，根据深度测量生成所有获取帧的标签。我们提供了一个新的真实世界玻璃对象数据集GlassNICOLDataset，该数据集是在神经启发式协作者（NICOL）人形机器人平台上收集的。数据集包含从五个不同摄像头记录的7850张图像。我们展示了我们训练的基本模型优于最先进的开放词汇方法。此外，我们在NICOL平台上部署了我们的基本模型，该模型在人类-机器人调酒场景中达到了81%的成功率。

Summary / 总结

This paper addresses the lack of variety in glass datasets for object detection, which often fail to distinguish subclasses of glasses due to their transparent and reflective properties. The authors introduce a novel dataset, GlassNICOLDataset, collected using RGB-D sensors on a humanoid robot platform. They propose an auto-labeling pipeline to minimize human effort and demonstrate that their trained baseline model outperforms state-of-the-art approaches. The model achieves an 81% success rate in a human-robot bartending scenario on the NICOL platform.

本文解决了物体检测数据集中缺乏对透明和反射性玻璃的多样性表示的问题。它介绍了一种使用RGB-D传感器收集真实世界数据的新方法，并基于深度测量提出了一个自动标注流水线。GlassNICOLDataset在人形机器人平台上收集了来自五个摄像头的7850张图像。训练的基本模型优于最先进的开放词汇方法，并在NICOL平台上的人机调酒场景中实现了81%的成功率。

Decoupling Clinical and Class-Agnostic Features for Reliable Few-Shot Adaptation under Shift

Authors: Umaima Rahman, Raza Imam, Mohammad Yaqub, Dwarikanath Mahapatra

First: 2025-09-11T12:26:57+00:00 · Latest: 2025-09-11T12:26:57+00:00

Abs · PDF · Code1

Abstract

Medical vision-language models (VLMs) offer promise for clinical decision support, yet their reliability under distribution shifts remains a major concern for safe deployment. These models often learn task-agnostic correlations due to variability in imaging protocols and free-text reports, limiting their generalizability and increasing the risk of failure in real-world settings. We propose DRiFt, a structured feature decoupling framework that explicitly separates clinically relevant signals from task-agnostic noise using parameter-efficient tuning (LoRA) and learnable prompt tokens. To enhance cross-modal alignment and reduce uncertainty, we curate high-quality, clinically grounded image-text pairs by generating captions for a diverse medical dataset. Our approach improves in-distribution performance by +11.4% Top-1 accuracy and +3.3% Macro-F1 over prior prompt-based methods, while maintaining strong robustness across unseen datasets. Ablation studies reveal that disentangling task-relevant features and careful alignment significantly enhance model generalization and reduce unpredictable behavior under domain shift. These insights contribute toward building safer, more trustworthy VLMs for clinical use. The code is available at https://github.com/rumaima/DRiFt.

中文标题/摘要

标题：解耦临床和类别无关特征以实现可靠的少量样本适应性调整

医学视觉-语言模型（VLMs）为临床决策支持提供了希望，但在分布变化下的可靠性仍然是安全部署的主要关切。这些模型由于成像协议和自由文本报告的变异性，往往学习到任务无关的关联，限制了其泛化能力，并增加了在实际场景中失败的风险。我们提出了DRiFt，一种结构化的特征解耦框架，通过参数高效调优（LoRA）和可学习的提示标记，明确地将临床相关信号与任务无关的噪声分离。为了增强跨模态对齐并减少不确定性，我们通过为多样化的医学数据集生成描述来精心策划高质量的临床相关图像-文本对。我们的方法在分布内性能上比之前的基于提示的方法提高了11.4%的Top-1准确率和3.3%的宏F1分数，同时在未见数据集上保持了强大的鲁棒性。消融研究显示，分离任务相关特征和精细对齐显著增强了模型的泛化能力和减少了领域变化下的不可预测行为。这些见解有助于构建更安全、更值得信赖的VLMs用于临床应用。代码可在https://github.com/rumaima/DRiFt获取。

Summary / 总结

The research aims to improve the reliability of medical vision-language models (VLMs) under distribution shifts by decoupling clinically relevant signals from task-agnostic noise. DRiFt, a structured feature decoupling framework, uses parameter-efficient tuning (LoRA) and learnable prompt tokens to separate clinically relevant signals from task-agnostic noise. The approach enhances in-distribution performance by 11.4% in Top-1 accuracy and 3.3% in Macro-F1 over prior prompt-based methods, while maintaining robustness across unseen datasets.

研究旨在通过分离临床和任务无关特征来提高医疗视觉-语言模型（VLMs）在分布变化下的可靠性。DRiFt框架通过参数高效调优（LoRA）和可学习提示标记来分离临床相关信号和任务无关噪声。该方法在分布内性能上提高了11.4%的Top-1准确率和3.3%的宏F1分数，同时在未见过的数据集上保持了鲁棒性。消融研究表明，分离任务相关特征和仔细对齐显著提高了模型的泛化能力和减少了领域变化下的不可预测行为。

Curriculum-Based Multi-Tier Semantic Exploration via Deep Reinforcement Learning

Authors: Abdel Hakim Drid, Vincenzo Suriani, Daniele Nardi, Abderrezzak Debilou

First: 2025-09-11T11:10:08+00:00 · Latest: 2025-09-11T11:10:08+00:00

Comments: The 19th International Conference on Intelligent Autonomous Systems (IAS 19), 2025, Genoa

Abs · PDF

Abstract

Navigating and understanding complex and unknown environments autonomously demands more than just basic perception and movement from embodied agents. Truly effective exploration requires agents to possess higher-level cognitive abilities, the ability to reason about their surroundings, and make more informed decisions regarding exploration strategies. However, traditional RL approaches struggle to balance efficient exploration and semantic understanding due to limited cognitive capabilities embedded in the small policies for the agents, leading often to human drivers when dealing with semantic exploration. In this paper, we address this challenge by presenting a novel Deep Reinforcement Learning (DRL) architecture that is specifically designed for resource efficient semantic exploration. A key methodological contribution is the integration of a Vision-Language Model (VLM) common-sense through a layered reward function. The VLM query is modeled as a dedicated action, allowing the agent to strategically query the VLM only when deemed necessary for gaining external guidance, thereby conserving resources. This mechanism is combined with a curriculum learning strategy designed to guide learning at different levels of complexity to ensure robust and stable learning. Our experimental evaluation results convincingly demonstrate that our agent achieves significantly enhanced object discovery rates and develops a learned capability to effectively navigate towards semantically rich regions. Furthermore, it also shows a strategic mastery of when to prompt for external environmental information. By demonstrating a practical and scalable method for embedding common-sense semantic reasoning with autonomous agents, this research provides a novel approach to pursuing a fully intelligent and self-guided exploration in robotics.

中文标题/摘要

标题：基于 Curriculum 的多级语义探索深度强化学习

自主导航和理解复杂未知环境不仅需要基本的感知和移动，还需要具备高级认知能力，能够推理周围环境并做出更明智的探索策略选择。然而，传统强化学习方法由于代理嵌入的认知能力有限，难以平衡高效的探索和语义理解，导致在处理语义探索时常常需要人工干预。本文通过提出一种专为资源高效语义探索设计的新型深度强化学习（DRL）架构，解决了这一挑战。一个关键的方法论贡献是通过分层奖励函数整合了视觉语言模型（VLM）的常识。VLM 查询被建模为专用动作，使代理仅在必要时战略性地查询 VLM 以获取外部指导，从而节省资源。该机制与一种课程学习策略相结合，旨在引导不同复杂度水平的学习，以确保稳健和稳定的训练。我们的实验结果表明，我们的代理在物体发现率方面显著提高，并发展了有效导航至语义丰富区域的能力。此外，还展示了何时请求外部环境信息的战略掌握。通过展示一种实用且可扩展的方法，将常识语义推理嵌入自主代理，这项研究为追求机器人中完全智能和自我引导的探索提供了一种新方法。

Summary / 总结

This paper addresses the challenge of autonomous semantic exploration by proposing a DRL architecture that integrates a Vision-Language Model (VLM) through a layered reward function. The method uses a curriculum learning strategy to guide learning at different levels of complexity, conserving resources by strategically querying the VLM only when necessary. Experimental results show significant improvements in object discovery rates and the agent's ability to navigate towards semantically rich regions, along with strategic use of external information. This research provides a practical and scalable method for embedding common-sense semantic reasoning in autonomous agents for intelligent exploration.

本文提出了一种新颖的DRL架构，通过层次奖励函数和课程学习整合Vision-Language Model (VLM)，使代理能够战略性地查询VLM，从而节省资源并实现稳健学习。实验结果表明，该代理在物体发现率和向语义丰富区域导航方面取得了显著改进，并展示了对外部环境信息的策略性使用。这项研究提供了一种将常识性语义推理嵌入自主代理中的实用方法，以实现智能探索。

S$^2$-Guidance: Stochastic Self Guidance for Training-Free Enhancement of Diffusion Models

Authors: Chubin Chen, Jiashu Zhu, Xiaokun Feng, Nisha Huang, Meiqi Wu, Fangyuan Mao, Jiahong Wu, Xiangxiang Chu, Xiu Li

First: 2025-08-18T12:31:20+00:00 · Latest: 2025-09-11T10:04:07+00:00

Abs · PDF

Abstract

Classifier-free Guidance (CFG) is a widely used technique in modern diffusion models for enhancing sample quality and prompt adherence. However, through an empirical analysis on Gaussian mixture modeling with a closed-form solution, we observe a discrepancy between the suboptimal results produced by CFG and the ground truth. The model's excessive reliance on these suboptimal predictions often leads to semantic incoherence and low-quality outputs. To address this issue, we first empirically demonstrate that the model's suboptimal predictions can be effectively refined using sub-networks of the model itself. Building on this insight, we propose S^2-Guidance, a novel method that leverages stochastic block-dropping during the forward process to construct stochastic sub-networks, effectively guiding the model away from potential low-quality predictions and toward high-quality outputs. Extensive qualitative and quantitative experiments on text-to-image and text-to-video generation tasks demonstrate that S^2-Guidance delivers superior performance, consistently surpassing CFG and other advanced guidance strategies. Our code will be released.

中文标题/摘要

标题：S$^2$-指导：基于随机自我引导的无训练增强扩散模型

分类器无指导（CFG）是现代扩散模型中广泛使用的一种技术，用于提高样本质量和指令一致性。然而，通过对具有解析解的高斯混合模型进行经验分析，我们观察到CFG产生的次优结果与真实值之间存在差异。模型对这些次优预测的过度依赖往往导致语义不一致和低质量的输出。为了解决这一问题，我们首先通过经验表明，可以使用模型自身的子网络有效精炼模型的次优预测。在此基础上，我们提出了一种新颖的方法S$^2$-指导，该方法利用前向过程中随机块丢弃来构建随机子网络，有效地引导模型远离潜在的低质量预测，趋向高质量输出。在文本到图像和文本到视频生成任务上的广泛定性和定量实验表明，S$^2$-指导提供了优越的性能，始终优于CFG和其他先进的指导策略。我们的代码将被发布。

Summary / 总结

The paper addresses the issue of suboptimal results produced by Classifier-free Guidance (CFG) in diffusion models, which often lead to semantic incoherence and low-quality outputs. To improve this, the authors propose S$^2$-Guidance, a method that uses stochastic block-dropping to construct sub-networks, guiding the model towards high-quality predictions. Experiments show that S$^2$-Guidance outperforms CFG and other advanced guidance strategies in both qualitative and quantitative evaluations on text-to-image and text-to-video generation tasks.

论文针对分类器自由引导(CFG)在扩散模型中产生的次优结果导致语义不一致和低质量输出的问题，提出了一种名为S$^2$-引导的方法，该方法通过在前向过程中使用随机块丢弃来构建子网络，引导模型产生高质量的预测。实验表明，S$^2$-引导在文本到图像和文本到视频生成任务中的定性和定量评估中均优于CFG和其他高级引导策略。