arXiv 论文速递

Optimal Control Meets Flow Matching: A Principled Route to Multi-Subject Fidelity

Authors: Eric Tillmann Bill, Enis Simsar, Thomas Hofmann

First: 2025-10-02T17:59:58+00:00 · Latest: 2025-10-02T17:59:58+00:00

Comments: Code: https://github.com/ericbill21/FOCUS/

Abstract

Text-to-image (T2I) models excel on single-entity prompts but struggle with multi-subject descriptions, often showing attribute leakage, identity entanglement, and subject omissions. We introduce the first theoretical framework with a principled, optimizable objective for steering sampling dynamics toward multi-subject fidelity. Viewing flow matching (FM) through stochastic optimal control (SOC), we formulate subject disentanglement as control over a trained FM sampler. This yields two architecture-agnostic algorithms: (i) a training-free test-time controller that perturbs the base velocity with a single-pass update, and (ii) Adjoint Matching, a lightweight fine-tuning rule that regresses a control network to a backward adjoint signal while preserving base-model capabilities. The same formulation unifies prior attention heuristics, extends to diffusion models via a flow-diffusion correspondence, and provides the first fine-tuning route explicitly designed for multi-subject fidelity. Empirically, on Stable Diffusion 3.5, FLUX, and Stable Diffusion XL, both algorithms consistently improve multi-subject alignment while maintaining base-model style. Test-time control runs efficiently on commodity GPUs, and fine-tuned controllers trained on limited prompts generalize to unseen ones. We further highlight FOCUS (Flow Optimal Control for Unentangled Subjects), which achieves state-of-the-art multi-subject fidelity across models.

中文标题/摘要

标题：最优控制与流匹配结合：通往多主体保真度的原理性途径

文本到图像（T2I）模型在单一实体提示上表现出色，但在处理多主体描述时遇到困难，经常出现属性泄漏、身份纠缠和主体遗漏。我们提出了第一个理论框架，提供了一个可优化的目标，以引导采样动力学向多主体保真度方向发展。通过将流匹配（FM）视为随机最优控制（SOC），我们将主体解纠缠视为对训练好的FM采样器的控制。这产生了两种架构无关的算法：（i）一个无需训练的测试时控制器，通过单次更新扰动基础速度，以及（ii）轻量级微调规则Adjoint Matching，该规则通过回归控制网络到反向伴随信号来实现，同时保留基础模型的能力。相同的公式统一了先前的注意力启发式方法，通过流扩散对应关系扩展到扩散模型，并提供了第一个明确为多主体保真度设计的微调途径。实验上，在Stable Diffusion 3.5、FLUX和Stable Diffusion XL上，两种算法在保持基础模型风格的同时，一致地提高了多主体对齐度。测试时的控制在普通GPU上高效运行，微调控制器在有限提示上训练后，可以泛化到未见过的提示。我们进一步强调FOCUS（Flow Optimal Control for Unentangled Subjects），它在多个模型上实现了最先进的多主体保真度。

Summary / 总结

The paper addresses the challenge of generating images from multi-subject descriptions by introducing a theoretical framework based on stochastic optimal control and flow matching. Two algorithms are proposed: a test-time controller and Adjoint Matching, which fine-tune a control network without retraining the base model. These methods improve multi-subject alignment and maintain the base model's style, showing consistent performance across different T2I models like Stable Diffusion 3.5, FLUX, and Stable Diffusion XL. Test-time control is efficient and fine-tuned controllers generalize well to unseen prompts.

论文针对文本到图像模型在生成多个主体时遇到的属性泄漏、身份纠缠和主体遗漏等问题，引入了使用随机最优控制来引导采样动力学以实现多主体保真度的理论框架。提出了两种算法：一种是在测试时控制基速度的控制器，另一种是名为Adjoint Matching的微调规则。这两种方法都能提高多主体对齐效果，同时保留基模型的风格，并且微调控制器在有限的提示下也能很好地泛化到未见过的提示。FOCUS，一种特定实现，实现了在不同模型中多主体保真度的最新成果。

NoiseShift: Resolution-Aware Noise Recalibration for Better Low-Resolution Image Generation

Authors: Ruozhen He, Moayed Haji-Ali, Ziyan Yang, Vicente Ordonez

First: 2025-10-02T17:59:43+00:00 · Latest: 2025-10-02T17:59:43+00:00

Abs · PDF · Code1 · Code2

Abstract

Text-to-image diffusion models trained on a fixed set of resolutions often fail to generalize, even when asked to generate images at lower resolutions than those seen during training. High-resolution text-to-image generators are currently unable to easily offer an out-of-the-box budget-efficient alternative to their users who might not need high-resolution images. We identify a key technical insight in diffusion models that when addressed can help tackle this limitation: Noise schedulers have unequal perceptual effects across resolutions. The same level of noise removes disproportionately more signal from lower-resolution images than from high-resolution images, leading to a train-test mismatch. We propose NoiseShift, a training-free method that recalibrates the noise level of the denoiser conditioned on resolution size. NoiseShift requires no changes to model architecture or sampling schedule and is compatible with existing models. When applied to Stable Diffusion 3, Stable Diffusion 3.5, and Flux-Dev, quality at low resolutions is significantly improved. On LAION-COCO, NoiseShift improves SD3.5 by 15.89%, SD3 by 8.56%, and Flux-Dev by 2.44% in FID on average. On CelebA, NoiseShift improves SD3.5 by 10.36%, SD3 by 5.19%, and Flux-Dev by 3.02% in FID on average. These results demonstrate the effectiveness of NoiseShift in mitigating resolution-dependent artifacts and enhancing the quality of low-resolution image generation.

中文标题/摘要

标题：NoiseShift：基于分辨率的噪声重新校准以提高低分辨率图像生成质量

训练于固定分辨率集上的文本到图像扩散模型在生成低于训练分辨率的图像时往往无法很好地泛化。当前的高分辨率文本到图像生成器无法为不需要高分辨率图像的用户提供一个开箱即用且成本效益高的替代方案。我们发现扩散模型中的一个关键技术洞察：噪声调度器在不同分辨率下的感知效果不等。相同水平的噪声从低分辨率图像中移除的信号比从高分辨率图像中移除的更多，导致训练和测试之间的不匹配。我们提出了一种无需训练的方法NoiseShift，该方法根据分辨率大小重新校准去噪器的噪声水平。NoiseShift 不需要对模型架构或采样计划进行任何更改，并且与现有模型兼容。当应用于Stable Diffusion 3、Stable Diffusion 3.5和Flux-Dev时，低分辨率下的质量显著提高。在LAION-COCO上，NoiseShift分别将SD3.5、SD3和Flux-Dev的FID提高了15.89%、8.56%和2.44%。在CelebA上，NoiseShift分别将SD3.5、SD3和Flux-Dev的FID提高了10.36%、5.19%和3.02%。这些结果表明NoiseShift在减轻分辨率依赖性伪影和提高低分辨率图像生成质量方面的有效性。

Summary / 总结

NoiseShift is a training-free method that recalibrates the noise level of the denoiser based on resolution size to improve low-resolution image generation. It addresses the train-test mismatch caused by the unequal perceptual effects of noise across resolutions. When applied to Stable Diffusion 3, Stable Diffusion 3.5, and Flux-Dev, NoiseShift significantly improves image quality at low resolutions, with average improvements of 15.89%, 8.56%, and 2.44% in FID on LAION-COCO, and 10.36%, 5.19%, and 3.02% in FID on CelebA, respectively.

NoiseShift 是一种无需训练的方法，根据分辨率大小重新校准去噪器中的噪声水平，以提高低分辨率图像生成的质量。它解决了由于噪声在不同分辨率下的感知效果不等导致的训练-测试不匹配问题。当应用于 Stable Diffusion 3、Stable Diffusion 3.5 和 Flux-Dev 时，NoiseShift 显著提高了低分辨率图像的质量，在 LAION-COCO 上的平均 FID 改进了 15.89%、8.56% 和 2.44%，在 CelebA 上的平均 FID 改进了 10.36%、5.19% 和 3.02%。

VideoNSA: Native Sparse Attention Scales Video Understanding

Authors: Enxin Song, Wenhao Chai, Shusheng Yang, Ethan Armand, Xiaojun Shan, Haiyang Xu, Jianwen Xie, Zhuowen Tu

First: 2025-10-02T17:58:54+00:00 · Latest: 2025-10-02T17:58:54+00:00

Comments: Project Page: https://enxinsong.com/VideoNSA-web/, Code: https://github.com/Espere-1119-Song/VideoNSA

Abs · PDF · Code1 · Code2 · Code3

Abstract

Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.

中文标题/摘要

标题：VideoNSA：原生稀疏注意机制扩展视频理解

多模态语言模型中的视频理解受限于上下文长度：模型经常错过关键过渡帧，并且难以在长时间尺度上保持连贯性。为了解决这个问题，我们对原生稀疏注意（NSA）进行了调整，以适应视频-语言模型。我们的方法VideoNSA通过端到端训练Qwen2.5-VL，在216K视频指令数据集上进行。我们采用一种硬件感知的混合注意方法，为文本保留密集注意，而使用NSA处理视频。与基于token压缩和无训练稀疏基线相比，VideoNSA在长视频理解、时间推理和空间基准测试中取得了更好的性能。进一步的消融分析揭示了四个关键发现：(1) 可靠地扩展到128K token；(2) 固定预算下的全局-局部注意分配；(3) 任务依赖的分支使用模式；(4) 可学习的组合稀疏注意有助于诱导动态注意焦点。

Summary / 总结

VideoNSA adapts Native Sparse Attention (NSA) to video-language models, enhancing long-video understanding and temporal reasoning by employing a hardware-aware hybrid approach. It achieves better performance compared to token-compression and training-free sparse baselines, with key findings including reliable scaling to 128K tokens, optimal global-local attention allocation, task-dependent branch usage patterns, and the benefit of learnable combined sparse attention for dynamic attention sinks.

VideoNSA通过将Native Sparse Attention应用到视频语言模型中，提升了长视频的理解和时间推理能力。通过使用密集注意机制处理文本，而使用NSA处理视频，VideoNSA在token压缩和无训练稀疏基线中表现出更优性能。关键发现包括可靠地扩展到128K tokens，全局-局部注意机制的最佳分配，任务依赖的分支使用模式，以及可学习的组合稀疏注意机制带来的动态注意焦点。

From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens

Authors: Hala Sheta, Eric Huang, Shuyu Wu, Ilia Alenabi, Jiajun Hong, Ryker Lin, Ruoxi Ning, Daniel Wei, Jialin Yang, Jiawei Zhou, Ziqiao Ma, Freda Shi

Venue: EMNLP 2025

First: 2025-10-02T17:58:41+00:00 · Latest: 2025-10-02T17:58:41+00:00

Comments: EMNLP 2025 System Demonstration | Code: https://github.com/compling-wat/vlm-lens

Abs · PDF · Code1 · Code2 · Code3

Abstract

We introduce VLM-Lens, a toolkit designed to enable systematic benchmarking, analysis, and interpretation of vision-language models (VLMs) by supporting the extraction of intermediate outputs from any layer during the forward pass of open-source VLMs. VLM-Lens provides a unified, YAML-configurable interface that abstracts away model-specific complexities and supports user-friendly operation across diverse VLMs. It currently supports 16 state-of-the-art base VLMs and their over 30 variants, and is extensible to accommodate new models without changing the core logic. The toolkit integrates easily with various interpretability and analysis methods. We demonstrate its usage with two simple analytical experiments, revealing systematic differences in the hidden representations of VLMs across layers and target concepts. VLM-Lens is released as an open-sourced project to accelerate community efforts in understanding and improving VLMs.

中文标题/摘要

标题：从行为表现到内在能力：使用VLM-Lens解析视觉语言模型

我们介绍了VLM-Lens，一个旨在通过支持从开源视觉语言模型（VLMs）前向传递过程中任何层提取中间输出来实现系统基准测试、分析和解释的工具包。VLM-Lens提供了一个统一的、通过YAML配置的接口，抽象掉了模型特定的复杂性，并支持用户友好的操作，适用于各种不同的VLMs。它目前支持16个最先进的基础VLM及其超过30个变体，并且可以扩展以容纳新模型而不改变核心逻辑。该工具包易于与各种可解释性和分析方法集成。我们通过两个简单的分析实验展示了其用法，揭示了VLMs在不同层和目标概念上的隐藏表示的系统性差异。VLM-Lens作为一个开源项目发布，以加速社区在理解并改进VLMs方面的努力。

Summary / 总结

The research aims to provide a systematic way to benchmark, analyze, and interpret vision-language models (VLMs) by introducing VLM-Lens, a unified toolkit. VLM-Lens supports the extraction of intermediate outputs from any layer of open-source VLMs, enabling users to easily integrate various interpretability methods. Key findings include systematic differences in hidden representations across layers and target concepts, highlighting the need for further understanding and improvement of VLMs.

研究旨在通过引入VLM-Lens统一工具包，提供一种系统的方法来评估、分析和解释视觉-语言模型（VLMs）。VLM-Lens支持从开源VLM的任何层提取中间输出，使用户能够轻松集成各种可解释性方法。关键发现包括不同层和目标概念之间隐藏表示的系统性差异，强调了进一步理解和改进VLMs的必要性。

Test-Time Anchoring for Discrete Diffusion Posterior Sampling

Authors: Litu Rout, Andreas Lugmayr, Yasamin Jafarian, Srivatsan Varadharajan, Constantine Caramanis, Sanjay Shakkottai, Ira Kemelmacher-Shlizerman

First: 2025-10-02T17:58:37+00:00 · Latest: 2025-10-02T17:58:37+00:00

Comments: Preprint

Abs · PDF · Code1 · Code2

Abstract

We study the problem of posterior sampling using pretrained discrete diffusion foundation models, aiming to recover images from noisy measurements without retraining task-specific models. While diffusion models have achieved remarkable success in generative modeling, most advances rely on continuous Gaussian diffusion. In contrast, discrete diffusion offers a unified framework for jointly modeling categorical data such as text and images. Beyond unification, discrete diffusion provides faster inference, finer control, and principled training-free Bayesian inference, making it particularly well-suited for posterior sampling. However, existing approaches to discrete diffusion posterior sampling face severe challenges: derivative-free guidance yields sparse signals, continuous relaxations limit applicability, and split Gibbs samplers suffer from the curse of dimensionality. To overcome these limitations, we introduce Anchored Posterior Sampling (APS) for masked diffusion foundation models, built on two key innovations -- quantized expectation for gradient-like guidance in discrete embedding space, and anchored remasking for adaptive decoding. Our approach achieves state-of-the-art performance among discrete diffusion samplers across linear and nonlinear inverse problems on the standard benchmarks. We further demonstrate the benefits of our approach in training-free stylization and text-guided editing.

中文标题/摘要

标题：测试时锚定用于离散扩散后验采样

我们研究了使用预训练的离散扩散基础模型进行后验采样的问题，旨在从噪声测量中恢复图像而不重新训练任务特定模型。虽然扩散模型在生成建模方面取得了显著成功，但大多数进展依赖于连续的高斯扩散。相比之下，离散扩散为联合建模诸如文本和图像等分类数据提供了一个统一框架。除了统一建模之外，离散扩散还提供了更快的推理、更精细的控制和无需训练的贝叶斯推理，使其特别适合后验采样。然而，现有的离散扩散后验采样方法面临严重挑战：无导数引导产生稀疏信号，连续松弛限制了适用性，而分裂吉布斯采样器遭受了维数灾难。为了克服这些限制，我们引入了锚定后验采样（APS）方法，基于两个关键创新——量化期望用于离散嵌入空间中的梯度式引导，以及锚定重新遮盖用于自适应解码。我们的方法在标准基准上的线性和非线性逆问题中，离散扩散采样器中达到了最先进的性能。我们进一步展示了我们方法在无需训练的风格化和文本引导编辑中的优势。

Summary / 总结

The research aims to improve posterior sampling using pretrained discrete diffusion models to recover images from noisy measurements without retraining task-specific models. The method introduces Anchored Posterior Sampling (APS) with quantized expectation for gradient-like guidance and anchored remasking for adaptive decoding. The approach outperforms existing samplers on linear and nonlinear inverse problems and shows benefits in training-free stylization and text-guided editing.

本文研究了使用预训练的离散扩散模型从噪声测量中恢复图像的问题，而不重新训练特定任务的模型。作者提出了锚定后验采样（APS），该方法使用量化期望进行梯度似然引导，并使用锚定重新遮罩进行自适应解码。APS在各种逆问题上优于现有方法，并展示了在训练免费的风格化和文本引导编辑中的优势。

microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification

Authors: Sathira Silva, Eman Ali, Chetan Arora, Muhammad Haris Khan

First: 2025-10-02T17:47:39+00:00 · Latest: 2025-10-02T17:47:39+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Unsupervised adaptation of CLIP-based vision-language models (VLMs) for fine-grained image classification requires sensitivity to microscopic local cues. While CLIP exhibits strong zero-shot transfer, its reliance on coarse global features restricts its performance on fine-grained classification tasks. Prior efforts inject fine-grained knowledge by aligning large language model (LLM) descriptions with the CLIP $\texttt{[CLS]}$ token; however, this approach overlooks spatial precision. We propose $\textbf{microCLIP}$, a self-training framework that jointly refines CLIP's visual and textual representations using fine-grained cues. At its core is Saliency-Oriented Attention Pooling (SOAP) within a lightweight TokenFusion module, which builds a saliency-guided $\texttt{[FG]}$ token from patch embeddings and fuses it with the global $\texttt{[CLS]}$ token for coarse-fine alignment. To stabilize adaptation, we introduce a two-headed LLM-derived classifier: a frozen classifier that, via multi-view alignment, provides a stable text-based prior for pseudo-labeling, and a learnable classifier initialized from LLM descriptions and fine-tuned with TokenFusion. We further develop Dynamic Knowledge Aggregation, which convexly combines fixed LLM/CLIP priors with TokenFusion's evolving logits to iteratively refine pseudo-labels. Together, these components uncover latent fine-grained signals in CLIP, yielding a consistent $2.90\%$ average accuracy gain across 13 fine-grained benchmarks while requiring only light adaptation. Our code is available at https://github.com/sathiiii/microCLIP.

Summary / 总结

microCLIP is an unsupervised adaptation framework for CLIP-based vision-language models to improve fine-grained image classification. It introduces Saliency-Oriented Attention Pooling and a TokenFusion module to refine visual and textual representations using fine-grained cues. The method also includes a two-headed LLM-derived classifier and Dynamic Knowledge Aggregation to stabilize adaptation and refine pseudo-labels. This approach achieves an average accuracy gain of 2.90% across 13 fine-grained benchmarks with minimal adaptation effort.

microCLIP 是一种无监督的适应框架，用于改进基于 CLIP 的视觉-语言模型在细粒度图像分类中的表现。该框架引入了一个带有注意力引导池化（SOAP）的 TokenFusion 模块，利用细粒度线索来细化 CLIP 的视觉和文本表示。框架包括一个双头的 LLM 提取分类器和动态知识聚合，以稳定和迭代细化伪标签。实验结果显示，在 13 个细粒度基准测试中，平均准确率提高了 2.90%，且适应过程较为轻量。

From Frames to Clips: Efficient Key Clip Selection for Long-Form Video Understanding

Authors: Guangyu Sun, Archit Singhal, Burak Uzkent, Mubarak Shah, Chen Chen, Garin Kessler

First: 2025-10-02T17:43:01+00:00 · Latest: 2025-10-02T17:43:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Video Large Language Models (VLMs) have achieved remarkable results on a variety of vision language tasks, yet their practical use is limited by the "needle in a haystack" problem: the massive number of visual tokens produced from raw video frames exhausts the model's context window. Existing solutions alleviate this issue by selecting a sparse set of frames, thereby reducing token count, but such frame-wise selection discards essential temporal dynamics, leading to suboptimal reasoning about motion and event continuity. In this work we systematically explore the impact of temporal information and demonstrate that extending selection from isolated key frames to key clips, which are short, temporally coherent segments, improves video understanding. To maintain a fixed computational budget while accommodating the larger token footprint of clips, we propose an adaptive resolution strategy that dynamically balances spatial resolution and clip length, ensuring a constant token count per video. Experiments on three long-form video benchmarks demonstrate that our training-free approach, F2C, outperforms uniform sampling up to 8.1%, 5.6%, and 10.3% on Video-MME, LongVideoBench and MLVU benchmarks, respectively. These results highlight the importance of preserving temporal coherence in frame selection and provide a practical pathway for scaling Video LLMs to real world video understanding applications. Project webpage is available at https://guangyusun.com/f2c .

中文标题/摘要

标题：从帧到片段：长视频理解中的高效关键片段选择

视频大型语言模型（VLMs）在各种视觉语言任务中取得了显著成果，但其实际应用受到“大海捞针”问题的限制：从原始视频帧中生成的大量视觉标记耗尽了模型的上下文窗口。现有解决方案通过选择稀疏帧集来缓解这一问题，从而减少标记数量，但这种帧级选择会丢弃重要的时间动态性，导致对运动和事件连续性的推理效果不佳。在本文中，我们系统地探讨了时间信息的影响，并证明将选择从孤立的关键帧扩展到关键片段（即短且时间上连贯的片段）可以提高视频理解。为了在保持固定计算预算的同时适应片段更大的标记占用量，我们提出了一种自适应分辨率策略，该策略动态平衡空间分辨率和片段长度，确保每个视频的标记数量恒定。在三个长视频基准上的实验表明，我们的无需训练的方法F2C在Video-MME、LongVideoBench和MLVU基准上分别比均匀采样高出8.1%、5.6%和10.3%。这些结果突显了在帧选择中保持时间连贯性的重要性，并为将视频LLMs扩展到实际视频理解应用提供了实用途径。项目网页可在https://guangyusun.com/f2c 查看。

Summary / 总结

This work addresses the challenge of using Video Large Language Models (VLMs) for long-form video understanding by proposing a method to select key clips instead of individual frames. The approach maintains a fixed computational budget by adaptively balancing spatial resolution and clip length, ensuring a constant token count per video. Experiments show that the proposed method, F2C, outperforms uniform sampling on three benchmarks by up to 10.3%, demonstrating the importance of preserving temporal coherence in frame selection for better video understanding.

该研究旨在通过选择关键片段而非孤立的关键帧来解决使用视频大型语言模型（VLMs）进行长视频理解的挑战。方法动态平衡空间分辨率和片段长度，以维持固定的计算预算同时保留时间连贯性。实验表明，提出的F2C方法在三个基准测试中比均匀采样分别提高了最多10.3%，证明了时间连贯性在视频理解中的重要性。

GeoPurify: A Data-Efficient Geometric Distillation Framework for Open-Vocabulary 3D Segmentation

Authors: Weijia Dou, Xu Zhang, Yi Bin, Jian Liu, Bo Peng, Guoqing Wang, Yang Yang, Heng Tao Shen

First: 2025-10-02T16:37:56+00:00 · Latest: 2025-10-02T16:37:56+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent attempts to transfer features from 2D Vision-Language Models (VLMs) to 3D semantic segmentation expose a persistent trade-off. Directly projecting 2D features into 3D yields noisy and fragmented predictions, whereas enforcing geometric coherence necessitates costly training pipelines and large-scale annotated 3D data. We argue that this limitation stems from the dominant segmentation-and-matching paradigm, which fails to reconcile 2D semantics with 3D geometric structure. The geometric cues are not eliminated during the 2D-to-3D transfer but remain latent within the noisy and view-aggregated features. To exploit this property, we propose GeoPurify that applies a small Student Affinity Network to purify 2D VLM-generated 3D point features using geometric priors distilled from a 3D self-supervised teacher model. During inference, we devise a Geometry-Guided Pooling module to further denoise the point cloud and ensure the semantic and structural consistency. Benefiting from latent geometric information and the learned affinity network, GeoPurify effectively mitigates the trade-off and achieves superior data efficiency. Extensive experiments on major 3D benchmarks demonstrate that GeoPurify achieves or surpasses state-of-the-art performance while utilizing only about 1.5% of the training data. Our codes and checkpoints are available at [https://github.com/tj12323/GeoPurify](https://github.com/tj12323/GeoPurify).

Summary / 总结

GeoPurify is a data-efficient framework that addresses the trade-off in transferring 2D Vision-Language Model features to 3D semantic segmentation by purifying 3D point features using geometric priors. It employs a small Student Affinity Network and a Geometry-Guided Pooling module to enhance semantic and structural consistency. Experiments show that GeoPurify outperforms or matches state-of-the-art methods using only 1.5% of the training data.

GeoPurify 是一个数据高效的框架，通过使用来自 3D 自监督教师模型的几何先验来净化 3D 点特征，解决了将 2D 视觉语言模型特征转移到 3D 语义分割中的权衡问题。它包含一个几何引导聚合模块，用于进一步去噪并确保语义和结构一致性。实验表明，GeoPurify 在使用仅 1.5% 的训练数据的情况下，能够超越或匹配最先进的方法。

Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting

Authors: Shu Zou, Xinyu Tian, Lukas Wesemann, Fabian Waschkowski, Zhaoyuan Yang, Jing Zhang

First: 2025-10-02T16:06:31+00:00 · Latest: 2025-10-02T16:06:31+00:00

Comments: 14 pages, video anomaly detection

Abs · PDF · Code1 · Code2

Abstract

Prompting has emerged as a practical way to adapt frozen vision-language models (VLMs) for video anomaly detection (VAD). Yet, existing prompts are often overly abstract, overlooking the fine-grained human-object interactions or action semantics that define complex anomalies in surveillance videos. We propose ASK-Hint, a structured prompting framework that leverages action-centric knowledge to elicit more accurate and interpretable reasoning from frozen VLMs. Our approach organizes prompts into semantically coherent groups (e.g. violence, property crimes, public safety) and formulates fine-grained guiding questions that align model predictions with discriminative visual cues. Extensive experiments on UCF-Crime and XD-Violence show that ASK-Hint consistently improves AUC over prior baselines, achieving state-of-the-art performance compared to both fine-tuned and training-free methods. Beyond accuracy, our framework provides interpretable reasoning traces towards anomaly and demonstrates strong generalization across datasets and VLM backbones. These results highlight the critical role of prompt granularity and establish ASK-Hint as a new training-free and generalizable solution for explainable video anomaly detection.

中文标题/摘要

标题：通过精细粒度提示解锁视觉-语言模型在视频异常检测中的应用

提示已成为一种实用的方法，用于适应冻结的视觉-语言模型（VLMs）以进行视频异常检测（VAD）。然而，现有的提示往往过于抽象，忽视了定义监视视频中复杂异常的精细的人-物交互或动作语义。我们提出了一种名为ASK-Hint的结构化提示框架，该框架利用以动作为中心的知识来激发更准确和可解释的推理。我们的方法将提示组织成语义上一致的组（例如，暴力、财产犯罪、公共安全），并制定细粒度的引导问题，使模型预测与区分性视觉线索保持一致。在UCF-Crime和XD-Violence上的广泛实验表明，ASK-Hint在AUC上始终优于先前的基线，与微调和无训练方法相比，实现了最先进的性能。除了准确性之外，我们的框架提供了可解释的推理轨迹，以指向异常，并展示了在不同数据集和VLM主干上的强大泛化能力。这些结果突显了提示粒度的关键作用，并将ASK-Hint确立为新的无训练和可泛化的可解释视频异常检测解决方案。

Summary / 总结

The research aims to enhance the performance of frozen vision-language models in video anomaly detection by introducing ASK-Hint, a structured prompting framework. This method organizes prompts into semantically coherent groups and formulates fine-grained guiding questions to align model predictions with discriminative visual cues. Experiments on UCF-Crime and XD-Violence show that ASK-Hint improves AUC over prior baselines and achieves state-of-the-art performance, providing interpretable reasoning traces and strong generalization across datasets and VLM backbones.

研究旨在通过引入ASK-Hint结构化提示框架，提升冻结的视觉-语言模型在视频异常检测中的性能。该方法将提示组织成语义上一致的组，并制定细粒度的引导问题，使模型预测与区分性视觉线索对齐。在UCF-Crime和XD-Violence上的实验表明，ASK-Hint在AUC上优于先前基线，并达到最先进的性能，提供可解释的推理轨迹，并在数据集和VLM基础模型上具有强大的泛化能力。

Post-hoc Probabilistic Vision-Language Models

Authors: Anton Baumann, Rui Li, Marcus Klasson, Santeri Mentu, Shyamgopal Karthik, Zeynep Akata, Arno Solin, Martin Trapp

First: 2024-12-08T18:16:13+00:00 · Latest: 2025-10-02T15:13:15+00:00

Comments: Project page: https://aaltoml.github.io/BayesVLM/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-language models (VLMs), such as CLIP and SigLIP, have found remarkable success in classification, retrieval, and generative tasks. For this, VLMs deterministically map images and text descriptions to a joint latent space in which their similarity is assessed using the cosine similarity. However, a deterministic mapping of inputs fails to capture uncertainties over concepts arising from domain shifts when used in downstream tasks. In this work, we propose post-hoc uncertainty estimation in VLMs that does not require additional training. Our method leverages a Bayesian posterior approximation over the last layers in VLMs and analytically quantifies uncertainties over cosine similarities. We demonstrate its effectiveness for uncertainty quantification and support set selection in active learning. Compared to baselines, we obtain improved and well-calibrated predictive uncertainties, interpretable uncertainty estimates, and sample-efficient active learning. Our results show promise for safety-critical applications of large-scale models.

中文标题/摘要

标题：事后概率视觉-语言模型

视觉-语言模型（VLMs），如CLIP和SigLIP，在分类、检索和生成任务中取得了显著的成功。为此，VLMs将图像和文本描述确定性地映射到一个联合隐空间，在该空间中使用余弦相似度评估它们的相似性。然而，在下游任务中使用确定性映射输入时，无法捕捉到由于领域转移而产生的概念不确定性。在本文中，我们提出了一种不需要额外训练的VLMs事后不确定性估计方法。我们的方法利用了VLMs最后一层的贝叶斯后验近似，并通过分析量化余弦相似度的不确定性。我们展示了其在不确定性量化和主动学习支持集选择中的有效性。与基线相比，我们获得了改进且校准良好的预测不确定性、可解释的不确定性估计以及样本高效的主动学习。我们的结果表明，对于大规模模型的安全关键应用具有前景。

Summary / 总结

This work addresses the limitations of deterministic mappings in vision-language models (VLMs) by proposing a post-hoc method for uncertainty estimation. The method uses a Bayesian posterior approximation to quantify uncertainties over cosine similarities in VLMs. The study demonstrates that this approach improves predictive uncertainties, provides interpretable uncertainty estimates, and enhances sample-efficient active learning compared to baselines. This is particularly beneficial for safety-critical applications of large-scale models.

本文提出了一种后处理方法来估计视觉-语言模型（VLMs）中的不确定性，以克服其确定性映射的局限性。该方法通过贝叶斯后验近似来量化VLMs中余弦相似性的不确定性。研究表明，这种方法可以提高预测不确定性，提供可解释的不确定性估计，并增强样本高效的主动学习，相较于基线方法表现更优，特别适用于大规模模型的安全关键应用。

DiCache: Let Diffusion Model Determine Its Own Cache

Authors: Jiazi Bu, Pengyang Ling, Yujie Zhou, Yibin Wang, Yuhang Zang, Dahua Lin, Jiaqi Wang

First: 2025-08-24T13:30:00+00:00 · Latest: 2025-10-02T14:42:41+00:00

Comments: Project Page: https://bujiazi.github.io/dicache.github.io/ Code: https://github.com/Bujiazi/DiCache

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Recent years have witnessed the rapid development of acceleration techniques for diffusion models, especially caching-based acceleration methods. These studies seek to answer two fundamental questions: "When to cache" and "How to use cache", typically relying on predefined empirical laws or dataset-level priors to determine caching timings and adopting handcrafted rules for multi-step cache utilization. However, given the highly dynamic nature of the diffusion process, they often exhibit limited generalizability and fail to cope with diverse samples. In this paper, a strong sample-specific correlation is revealed between the variation patterns of the shallow-layer feature differences in the diffusion model and those of deep-layer features. Moreover, we have observed that the features from different model layers form similar trajectories. Based on these observations, we present DiCache, a novel training-free adaptive caching strategy for accelerating diffusion models at runtime, answering both when and how to cache within a unified framework. Specifically, DiCache is composed of two principal components: (1) Online Probe Profiling Scheme leverages a shallow-layer online probe to obtain an on-the-fly indicator for the caching error in real time, enabling the model to dynamically customize the caching schedule for each sample. (2) Dynamic Cache Trajectory Alignment adaptively approximates the deep-layer feature output from multi-step historical caches based on the shallow-layer feature trajectory, facilitating higher visual quality. Extensive experiments validate DiCache's capability in achieving higher efficiency and improved fidelity over state-of-the-art approaches on various leading diffusion models including WAN 2.1, HunyuanVideo and Flux.

中文标题/摘要

标题：DiCache：让扩散模型自行决定其缓存

近年来，扩散模型的加速技术得到了迅速发展，尤其是基于缓存的加速方法。这些研究试图回答两个基本问题：“何时缓存”和“如何使用缓存”，通常依赖于预定义的经验法则或数据集级别的先验知识来确定缓存时间，并采用手工编写的规则进行多步缓存利用。然而，鉴于扩散过程的高度动态性，它们往往表现出有限的通用性，并且难以应对多样化的样本。在本文中，我们揭示了扩散模型浅层特征差异的变化模式与深层特征变化模式之间存在强烈的样本特异性相关性。此外，我们观察到不同模型层的特征形成了相似的轨迹。基于这些观察，我们提出了DiCache，这是一种新型的无需训练的自适应缓存策略，可以在运行时加速扩散模型，统一回答何时和如何缓存。具体而言，DiCache 由两个主要组成部分组成：(1) 在线探针分析方案利用浅层在线探针实时获取缓存误差的即时指示，使模型能够动态地为每个样本自定义缓存计划。(2) 动态缓存轨迹对齐根据浅层特征轨迹自适应地近似多步历史缓存的深层特征输出，促进更高的视觉质量。广泛的实验验证了DiCache在各种领先扩散模型（包括WAN 2.1、HunyuanVideo和Flux）上实现更高效率和改进保真的能力。

Summary / 总结

This paper addresses the limitations of existing caching methods for diffusion models by introducing DiCache, a training-free adaptive caching strategy. It leverages the correlation between shallow and deep layer feature variations to dynamically determine when and how to cache, improving both efficiency and visual quality. Experiments show that DiCache outperforms state-of-the-art approaches on various diffusion models such as WAN 2.1, HunyuanVideo, and Flux.

研究旨在通过提出DiCache，一种无需训练的自适应缓存策略，解决现有扩散模型缓存方法的局限性。DiCache 使用在线探针动态确定何时缓存以及如何利用缓存，基于不同模型层的特征模式。该方法在WAN 2.1、HunyuanVideo和Flux等领先模型上显著提高了效率和视觉质量。

Robust Prompt Tuning for Vision-Language Models with Mild Semantic Noise

Authors: Yansheng Gao, Yufei Zheng, Shengsheng Wang

First: 2025-08-06T17:42:30+00:00 · Latest: 2025-10-02T14:03:47+00:00

Abs · PDF · Code1 · Code2

Abstract

Prompt tuning has shown promising results, but its robustness and generalization to unseen categories remain limited. Through our experiments, we demonstrate that the complete removal of semantic noise is a key factor restricting robustness. Existing methods typically suppress or filter out semantic noise in the prompt space, inadvertently hindering the model's robustness and its ability to generalize to unseen categories. To address this, we propose ANPrompt, a robust prompt tuning framework that actively incorporates weak semantic noise. By clustering weakly perturbed features into noise prompts and integrating them with learnable tokens in both the text and vision encoders, ANPrompt ensures controlled exposure to semantic variations. To enhance the visual pathway, we introduce the Noise-Resistant Visual Prompt Prototype (NRVPP), which stabilizes visual semantics under weak perturbations. Additionally, we propose a Weak Alignment Loss (WALoss) at the logits level to enforce consistency between clean and perturbed predictions, providing stable supervision. By combining weak semantic noise exposure with logits-based consistency, ANPrompt prevents overfitting to specific phrasings while preserving semantic integrity. Extensive experiments across 11 benchmarks, including base-to-new splits, show that ANPrompt consistently outperforms existing prompt tuning methods, offering superior robustness to semantic noise and improved generalization across tasks.

中文标题/摘要

标题：具有轻微语义噪声的视觉-语言模型鲁棒提示调优

提示调优已经显示出有希望的结果，但其鲁棒性和对未见过类别的泛化能力仍然有限。通过我们的实验，我们证明了完全消除语义噪声是限制鲁棒性的关键因素。现有方法通常在提示空间中抑制或过滤语义噪声，无意中阻碍了模型的鲁棒性和其对未见过类别的泛化能力。为了解决这个问题，我们提出了ANPrompt，这是一种鲁棒的提示调优框架，主动整合弱语义噪声。通过将弱扰动特征聚类成噪声提示，并在文本和视觉编码器中与可学习的标记集成，ANPrompt确保了对语义变化的可控暴露。为了增强视觉路径，我们引入了抗噪声视觉提示原型（NRVPP），它在弱扰动下稳定视觉语义。此外，我们提出了在logits层的弱对齐损失（WALoss），以在干净和扰动预测之间强制一致性，提供稳定的监督。通过结合弱语义噪声暴露和logits一致性，ANPrompt防止了对特定短语的过拟合，同时保持了语义完整性。在包括基础到新类别的11个基准测试中，ANPrompt始终优于现有提示调优方法，提供了对语义噪声的更强鲁棒性和跨任务的更好泛化能力。

Summary / 总结

The research aims to enhance the robustness and generalization of vision-language models by addressing the limitations of prompt tuning methods, which often suppress semantic noise, thereby hindering model robustness. ANPrompt, a novel robust prompt tuning framework, is proposed to incorporate weak semantic noise. It achieves this by clustering weakly perturbed features into noise prompts and integrating them with learnable tokens in both text and vision encoders. Experimental results across 11 benchmarks demonstrate that ANPrompt outperforms existing methods, showing superior robustness to semantic noise and better generalization capabilities.

研究旨在通过解决提示调优方法的局限性，提高视觉-语言模型的鲁棒性和泛化能力。提出了一种新颖的鲁棒提示调优框架ANPrompt，该框架通过引入弱语义噪声来增强模型的鲁棒性和泛化能力。在11个基准测试中的实验表明，ANPrompt在鲁棒性对语义噪声和任务泛化能力方面优于现有方法。

PlaceFM: A Training-free Geospatial Foundation Model of Places using Large-Scale Point of Interest Data

Authors: Mohammad Hashemi, Hossein Amiri, Andreas Zufle

First: 2025-06-25T15:10:31+00:00 · Latest: 2025-10-02T13:01:06+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

With the rapid growth and continual updates of geospatial data from diverse sources, geospatial foundation model pre-training for urban representation learning has emerged as a key research direction for advancing data-driven urban planning. Spatial structure is fundamental to effective geospatial intelligence systems; however, existing foundation models often lack the flexibility to reason about places, context-rich regions spanning multiple spatial granularities that may consist of many spatially and semantically related points of interest. To address this gap, we propose PlaceFM, a geospatial foundation model that captures place representations through a training-free, clustering-based approach. PlaceFM summarizes the entire point of interest graph constructed from U.S. Foursquare data, producing general-purpose region embeddings while automatically identifying places of interest. These embeddings can be directly integrated into geolocation data pipelines to support a variety of urban downstream tasks. Without the need for costly pre-training, PlaceFM provides a scalable and efficient solution for multi-granular geospatial analysis. Extensive experiments on two real-world prediction tasks, ZIP code-level population density and housing prices, demonstrate that PlaceFM not only outperforms most state-of-the-art graph-based geospatial foundation models but also achieves up to a 100x speedup in generating region-level representations on large-scale POI graphs. The implementation is available at https://github.com/mohammadhashemii/PlaceFM.

中文标题/摘要

标题：PlaceFM：基于大规模兴趣点数据的无训练地理空间基础模型

随着地理空间数据从多种来源的快速增长和持续更新，基于城市表示学习的地理空间基础模型预训练已成为推动数据驱动城市规划的关键研究方向。空间结构是有效地理空间智能系统的基础；然而，现有的基础模型往往缺乏灵活地推理地方的能力，即跨越多个空间粒度的丰富上下文区域，这些区域可能包含许多空间上和语义上相关的兴趣点。为了解决这一差距，我们提出了一种无训练的地理空间基础模型PlaceFM，通过基于聚类的方法捕捉地方表示。PlaceFM 通过美国Foursquare数据构建的兴趣点图总结，生成通用区域嵌入，同时自动识别兴趣点。这些嵌入可以直接集成到地理定位数据管道中，以支持各种城市下游任务。无需昂贵的预训练，PlaceFM 提供了一种可扩展且高效的多粒度地理空间分析解决方案。在两个实际预测任务（邮政编码级别的人口密度和住房价格）上的广泛实验表明，PlaceFM 不仅优于大多数最先进的基于图的地理空间基础模型，还在大规模POI图上生成区域级表示时实现了高达100倍的速度提升。实现代码可在https://github.com/mohammadhashemii/PlaceFM 获取。

Summary / 总结

PlaceFM is a geospatial foundation model that captures place representations using a training-free, clustering-based approach on large-scale point of interest data. It automatically identifies places of interest and generates general-purpose region embeddings, which can be integrated into geolocation data pipelines. Experiments show that PlaceFM outperforms most state-of-the-art graph-based geospatial foundation models and achieves up to a 100x speedup in generating region-level representations on large-scale POI graphs.

PlaceFM 是一种无需训练的地理空间基础模型，通过聚类方法在大规模点兴趣数据上捕捉地方表示。它能够自动识别地方兴趣点，并生成通用的区域嵌入，这些嵌入可以直接集成到地理定位数据管道中以支持各种城市任务。实验表明，PlaceFM 在生成大规模 POI 图上的区域级表示方面比最先进的图基地理空间基础模型性能更优，且速度提升高达 100 倍。

More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

Authors: Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Fabian Waschkowski, Lukas Wesemann, Peter Tu, Jing Zhang

First: 2025-09-30T06:37:47+00:00 · Latest: 2025-10-02T12:24:56+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Reasoning has emerged as a pivotal capability in Large Language Models (LLMs). Through Reinforcement Learning (RL), typically Group Relative Policy Optimization (GRPO), these models are able to solve complex tasks such as mathematics and code generation. Building on these advances, recent research has sought to extend reasoning to Vision-Language Models (VLMs), yielding promising results across diverse visual tasks. Despite this progress, our study uncovers the dual nature of multimodal reasoning: while it substantially enhances logical inference and facilitates performance on challenging problems, it may gradually impair perceptual grounding, leading to recognition failures on otherwise basic visual questions. Through further analysis, we attribute this phenomenon to visual forgetting, wherein prolonged reasoning causes the model to increasingly disregard visual input. To address this, we propose Vision-Anchored Policy Optimization (VAPO), a simple yet effective method that explicitly steers the reasoning process toward visually grounded trajectories. Our result model, VAPO-Thinker-7B, significantly strengthens the model's reliance on visual information and achieves new state-of-the-art results on a wide range of established benchmarks. Project page: https://xytian1008.github.io/VAPO/

中文标题/摘要

标题：更多思考，更少准确度？关于视觉-语言模型中推理的双重性质

推理已成为大型语言模型（LLMs）的关键能力。通过强化学习（RL），通常使用组相对策略优化（GRPO），这些模型能够解决复杂的任务，如数学和代码生成。在此基础上，最近的研究试图将推理扩展到视觉-语言模型（VLMs），并在多种视觉任务中取得了令人鼓舞的结果。尽管取得了这些进展，我们的研究揭示了多模态推理的双重性质：虽然它显著增强了逻辑推理并促进了对复杂问题的解决，但它可能会逐渐损害知觉定位，导致在原本基本的视觉问题上出现识别失败。通过进一步分析，我们将其归因于视觉遗忘，即长时间推理导致模型越来越多地忽视视觉输入。为了解决这一问题，我们提出了视觉锚定策略优化（VAPO），这是一种简单而有效的方法，明确引导推理过程向视觉定位轨迹发展。我们的结果模型VAPO-Thinker-7B显著增强了模型对视觉信息的依赖，并在一系列标准基准上取得了新的最佳结果。项目页面：https://xytian1008.github.io/VAPO/

Summary / 总结

This study explores the dual nature of reasoning in Vision-Language Models (VLMs), showing that while reasoning enhances logical inference and performance on complex tasks, it can also impair perceptual grounding, leading to recognition failures. The authors propose Vision-Anchored Policy Optimization (VAPO) to address this issue, resulting in a model, VAPO-Thinker-7B, that significantly improves reliance on visual information and achieves new state-of-the-art results on various benchmarks.

研究探讨了视觉语言模型（VLMs）中推理的双重性质，发现虽然推理可以增强逻辑推理和复杂任务的表现，但也可能损害感知定位，导致基本视觉问题的识别失败。作者提出了一种名为视觉锚定策略优化（VAPO）的方法来解决这一问题，最终的模型VAPO-Thinker-7B显著增强了对视觉信息的依赖，并在多个基准测试中取得了新的最佳结果。

SoftCFG: Uncertainty-guided Stable Guidance for Visual Autoregressive Model

Authors: Dongli Xu, Aleksei Tiulpin, Matthew B. Blaschko

First: 2025-10-01T15:04:00+00:00 · Latest: 2025-10-02T09:32:57+00:00

Comments: preprint

Abs · PDF · Code1 · Code2

Abstract

Autoregressive (AR) models have emerged as powerful tools for image generation by modeling images as sequences of discrete tokens. While Classifier-Free Guidance (CFG) has been adopted to improve conditional generation, its application in AR models faces two key issues: guidance diminishing, where the conditional-unconditional gap quickly vanishes as decoding progresses, and over-guidance, where strong conditions distort visual coherence. To address these challenges, we propose SoftCFG, an uncertainty-guided inference method that distributes adaptive perturbations across all tokens in the sequence. The key idea behind SoftCFG is to let each generated token contribute certainty-weighted guidance, ensuring that the signal persists across steps while resolving conflicts between text guidance and visual context. To further stabilize long-sequence generation, we introduce Step Normalization, which bounds cumulative perturbations of SoftCFG. Our method is training-free, model-agnostic, and seamlessly integrates with existing AR pipelines. Experiments show that SoftCFG significantly improves image quality over standard CFG and achieves state-of-the-art FID on ImageNet 256*256 among autoregressive models.

中文标题/摘要

标题：SoftCFG：基于不确定性指导的视觉自回归模型稳定引导

自回归（AR）模型已成为通过将图像建模为离散标记序列来生成图像的强大工具。尽管无分类器自由引导（CFG）已被采用以提高条件生成，但在AR模型中的应用面临两个关键问题：引导减弱，其中条件与无条件之间的差距随着解码进程迅速消失，以及过度引导，其中强烈的条件会扭曲视觉连贯性。为了解决这些挑战，我们提出了SoftCFG，这是一种基于不确定性指导的推理方法，它在整个序列中分配自适应扰动。SoftCFG的核心思想是让生成的每个标记贡献加权引导，确保信号在步骤之间持续存在，同时解决文本指导与视觉上下文之间的冲突。为了进一步稳定长序列生成，我们引入了步长归一化，以限制SoftCFG的累积扰动。我们的方法是无需训练的、模型无关的，并且可以无缝集成到现有的AR管道中。实验表明，SoftCFG在图像质量上显著优于标准CFG，并且在ImageNet 256*256的自回归模型中达到了最先进的FID。

Summary / 总结

SoftCFG is an uncertainty-guided inference method designed to enhance the stability of autoregressive models in image generation. It addresses the issues of guidance diminishing and over-guidance by distributing adaptive perturbations across all tokens in the sequence, ensuring that the guidance signal persists while resolving conflicts between text and visual context. Step Normalization is introduced to stabilize long-sequence generation by bounding cumulative perturbations. Experiments demonstrate that SoftCFG improves image quality and achieves state-of-the-art FID scores on ImageNet 256*256 among autoregressive models.

SoftCFG 是一种不确定性引导的推理方法，旨在提高自回归模型中图像生成的稳定性和质量。它通过在整个序列中分配自适应扰动来解决指导信号减弱和过度指导的问题，确保指导信号在各步骤中持续存在并解决文本和视觉上下文之间的冲突。引入了步骤归一化以进一步稳定长序列生成。实验表明，SoftCFG 在 ImageNet 256*256 上的 FID 分数优于标准的 Classifier-Free Guidance (CFG)，并达到了自回归模型中的最先进水平。

VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention

Authors: Mingzhe Zheng, Yongqi Xu, Haojian Huang, Xuran Ma, Yexin Liu, Wenjie Shu, Yatian Pang, Feilong Tang, Qifeng Chen, Harry Yang, Ser-Nam Lim

First: 2024-12-03T08:33:50+00:00 · Latest: 2025-10-02T08:38:03+00:00

Comments: Code: https://github.com/DuNGEOnmassster/VideoGen-of-Thought.git; Webpage: https://cheliosoops.github.io/VGoT/

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines. Existing solutions either rely on extensive manual scripting/editing or prioritize single-shot fidelity over cross-scene continuity, limiting their practicality for movie-like content. We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence by systematically addressing three core challenges: (1) Narrative fragmentation: Existing methods lack structured storytelling. We propose dynamic storyline modeling, which turns the user prompt into concise shot drafts and then expands them into detailed specifications across five domains (character dynamics, background continuity, relationship evolution, camera movements, and HDR lighting) with self-validation to ensure logical progress. (2) Visual inconsistency: previous approaches struggle to maintain consistent appearance across shots. Our identity-aware cross-shot propagation builds identity-preserving portrait (IPP) tokens that keep character identity while allowing controlled trait changes (expressions, aging) required by the story. (3) Transition artifacts: Abrupt shot changes disrupt immersion. Our adjacent latent transition mechanisms implement boundary-aware reset strategies that process adjacent shots' features at transition points, enabling seamless visual flow while preserving narrative continuity. Combined in a training-free pipeline, VGoT surpasses strong baselines by 20.4\% in within-shot face consistency and 17.4\% in style consistency, while requiring 10x fewer manual adjustments. VGoT bridges the gap between raw visual synthesis and director-level storytelling for automated multi-shot video generation.

中文标题/摘要

标题：VideoGen-of-Thought：逐步生成多镜头视频，最少手动干预

当前的视频生成模型在短片段方面表现出色，但在生成连贯的多镜头叙事方面却因视觉动态不连贯和故事情节断裂而失败。现有解决方案要么依赖于大量的手动脚本编写/编辑，要么优先考虑单镜头的保真度而忽视跨场景的连续性，这限制了它们在电影内容方面的实用性。我们引入了VideoGen-of-Thought (VGoT)，这是一种逐步框架，可以从一句话自动合成多镜头视频，通过系统地解决三个核心挑战来实现：(1) 故事片段化：现有方法缺乏结构化的叙事。我们提出了动态故事情节建模，将用户提示转化为简洁的镜头草稿，然后扩展到五个领域（角色动态、背景连续性、关系演变、摄像机运动和高动态范围照明）的详细规范，并通过自我验证确保逻辑进展。(2) 视觉不一致：先前的方法难以在镜头之间保持一致的外观。我们的身份感知跨镜头传播构建保持身份的肖像（IPP）令牌，同时允许故事所需的可控特征变化（表情、老化）。(3) 过渡伪影：镜头突变会破坏沉浸感。我们的相邻潜在过渡机制实施边界感知重置策略，在过渡点处理相邻镜头的特征，从而实现无缝的视觉流动，同时保持叙事连续性。在无需训练的管道中，VGoT 在镜头内面部一致性方面超过强大基线 20.4%，在风格一致性方面超过 17.4%，同时需要 10 倍少的手动调整。VGoT 在原始视觉合成与导演级叙事之间架起桥梁，用于自动化多镜头视频生成。

Summary / 总结

VideoGen-of-Thought (VGoT) addresses the limitations of current video generation models by introducing a step-by-step framework to synthesize multi-shot videos from a single sentence. It tackles narrative fragmentation, visual inconsistency, and transition artifacts through dynamic storyline modeling, identity-aware cross-shot propagation, and adjacent latent transition mechanisms. VGoT outperforms strong baselines by 20.4% in within-shot face consistency and 17.4% in style consistency, requiring 10x fewer manual adjustments. This framework bridges the gap between visual synthesis and director-level storytelling for automated multi-shot video generation.

VideoGen-of-Thought (VGoT) 提出了一种逐步框架，从一句话合成多镜头视频，解决当前视频生成模型在连贯叙事、视觉一致性以及转场问题上的不足。通过动态故事情节建模、身份感知跨镜头传播和相邻潜在转场机制，VGoT 在单帧内面部一致性上超越强基线 20.4%，在风格一致性上超越 17.4%，且需要 10 倍少的手动调整。该框架填补了从纯视觉合成到导演级叙事的自动化多镜头视频生成的空白。

Nav-EE: Navigation-Guided Early Exiting for Efficient Vision-Language Models in Autonomous Driving

Authors: Haibo Hu, Lianming Huang, Xinyu Wang, Yufei Cui, Nan Guan, Chun Jason Xue

First: 2025-10-02T08:37:58+00:00 · Latest: 2025-10-02T08:37:58+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) are increasingly applied in autonomous driving for unified perception and reasoning, but high inference latency hinders real-time deployment. Early-exit reduces latency by terminating inference at intermediate layers, yet its task-dependent nature limits generalization across diverse scenarios. We observe that this limitation aligns with autonomous driving: navigation systems can anticipate upcoming contexts (e.g., intersections, traffic lights), indicating which tasks will be required. We propose Nav-EE, a navigation-guided early-exit framework that precomputes task-specific exit layers offline and dynamically applies them online based on navigation priors. Experiments on CODA, Waymo, and BOSCH show that Nav-EE achieves accuracy comparable to full inference while reducing latency by up to 63.9%. Real-vehicle integration with Autoware Universe further demonstrates reduced inference latency (600ms to 300ms), supporting faster decision-making in complex scenarios. These results suggest that coupling navigation foresight with early-exit offers a viable path toward efficient deployment of large models in autonomous systems. Code and data are available at our anonymous repository: https://anonymous.4open.science/r/Nav-EE-BBC4

中文标题/摘要

标题：Nav-EE：自主驾驶中高效视觉-语言模型的导航引导早期退出

视觉-语言模型（VLMs）在自主驾驶中被广泛应用，用于统一感知和推理，但高推理延迟阻碍了实时部署。早期退出通过在中间层终止推理来减少延迟，但其任务依赖性限制了在不同场景中的泛化能力。我们观察到这一限制与自主驾驶一致：导航系统可以预测即将出现的上下文（如交叉口、交通灯），从而指示哪些任务将被需要。我们提出了一种导航引导的早期退出框架Nav-EE，该框架离线预计算了任务特定的退出层，并基于导航先验在线动态应用。在CODA、Waymo和BOSCH上的实验表明，Nav-EE在减少延迟最多63.9%的同时，实现了与完整推理相当的准确性。与Autoware Universe的实车集成进一步证明了推理延迟的减少（从600ms降至300ms），支持在复杂场景中更快的决策。这些结果表明，将导航先见与早期退出相结合，为在自主系统中高效部署大型模型提供了一条可行路径。代码和数据可在我们的匿名存储库中获得：https://anonymous.4open.science/r/Nav-EE-BBC4

Accelerating Attention with Basis Decomposition

Authors: Jialin Zhao

First: 2025-10-02T06:58:10+00:00 · Latest: 2025-10-02T06:58:10+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Attention is a core operation in large language models (LLMs) and vision-language models (VLMs). We present BD Attention (BDA), the first lossless algorithmic reformulation of attention. BDA is enabled by a simple matrix identity from Basis Decomposition (BD), which restructures multi-head projections into a compact form while preserving exact outputs. Unlike I/O-aware system optimizations such as FlashAttention, BDA provides a mathematically guaranteed acceleration that is architecture-agnostic. On DeepSeek-V2-Lite (16B, FP16), BDA requires only 4s of offline preparation with no retraining required and, on modern GPUs, achieves 32% faster key/value projections and 25% smaller weights, while increasing end-to-end perplexity (PPL) by just 0.02% (FP16) or 0.0004% (FP32), a negligible effect on model performance. These results position BDA as the first theoretically exact method for lossless attention acceleration that is complementary to existing engineering-level optimizations. Our code is available at https://github.com/abcbdf/basis-decomposition-official.

中文标题/摘要

标题：基于分解加速注意力

注意力是大型语言模型（LLMs）和视觉-语言模型（VLMs）中的核心操作。我们提出了BD注意力（BDA），这是第一个无损的算法重构注意力的方法。BDA得益于基分解（BD）中的一个简单矩阵恒等式，将多头投影重新结构化为紧凑形式，同时保持精确输出。与FlashAttention等输入输出感知系统优化不同，BDA提供了一种数学上保证的加速，且不受架构限制。在DeepSeek-V2-Lite（16B，FP16）上，BDA只需要4s的离线准备时间，无需重新训练，并在现代GPU上实现了32%更快的关键/值投影和25%更小的权重，同时增加端到端困惑度（PPL）仅0.02%（FP16）或0.0004%（FP32），对模型性能的影响可以忽略不计。这些结果将BDA定位为第一个理论上精确的无损注意力加速方法，与现有的工程级优化相辅相成。我们的代码可在https://github.com/abcbdf/basis-decomposition-official 获取。

Summary / 总结

The paper presents BD Attention (BDA), a lossless algorithmic reformulation of attention that uses Basis Decomposition (BD) to restructure multi-head projections into a compact form while preserving exact outputs. On DeepSeek-V2-Lite (16B, FP16), BDA requires 4s of offline preparation, achieves 32% faster key/value projections and 25% smaller weights, with only a negligible increase in end-to-end perplexity (0.02% in FP16 or 0.0004% in FP32).

BD Attention (BDA) 是一种使用基分解重新构建立多头投影的无损算法重构方法，可以在保持精确输出的同时将多头投影重构为紧凑形式。它只需要 4 秒的离线准备时间且无需重新训练，即可在 DeepSeek-V2-Lite (16B, FP16) 上实现 32% 更快的键/值投影和 25% 更小的权重，同时对模型性能的影响微乎其微，表现为端到端困惑度（PPL）的微小增加（FP16 中为 0.02%，FP32 中为 0.0004%）。

Contrastive Representation Regularization for Vision-Language-Action Models

Authors: Taeyoung Kim, Jimin Lee, Myungkyu Koo, Dongyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, Jinwoo Shin

First: 2025-10-02T06:41:22+00:00 · Latest: 2025-10-02T06:41:22+00:00

Comments: 20 pages, 12 figures

Abs · PDF · Code1 · Code2

Abstract

Vision-Language-Action (VLA) models have shown its capabilities in robot manipulation by leveraging rich representations from pre-trained Vision-Language Models (VLMs). However, their representations arguably remain suboptimal, lacking sensitivity to robotic signals such as control actions and proprioceptive states. To address the issue, we introduce Robot State-aware Contrastive Loss (RS-CL), a simple and effective representation regularization for VLA models, designed to bridge the gap between VLM representations and robotic signals. In particular, RS-CL aligns the representations more closely with the robot's proprioceptive states, by using relative distances between the states as soft supervision. Complementing the original action prediction objective, RS-CL effectively enhances control-relevant representation learning, while being lightweight and fully compatible with standard VLA training pipeline. Our empirical results demonstrate that RS-CL substantially improves the manipulation performance of state-of-the-art VLA models; it pushes the prior art from 30.8% to 41.5% on pick-and-place tasks in RoboCasa-Kitchen, through more accurate positioning during grasping and placing, and boosts success rates from 45.0% to 58.3% on challenging real-robot manipulation tasks.

中文标题/摘要

标题：视觉-语言-动作模型的对比表示正则化

视觉-语言-动作（VLA）模型通过利用预训练视觉-语言模型（VLM）的丰富表示，在机器人操作方面展示了其能力。然而，它们的表示可能仍然不够优化，缺乏对控制动作和本体感受状态等机器人信号的敏感性。为了解决这一问题，我们引入了机器人状态感知对比损失（RS-CL），这是一种简单而有效的VLA模型表示正则化方法，旨在弥合VLM表示与机器人信号之间的差距。特别是，RS-CL通过使用状态之间的相对距离作为软监督，使表示更紧密地与机器人的本体感受状态对齐。RS-CL补充了原始的动作预测目标，有效地增强了控制相关的表示学习，同时保持轻量级且完全兼容标准的VLA训练流程。我们的实验证明，RS-CL显著提高了最先进的VLA模型的操纵性能；在RoboCasa-Kitchen的拾取和放置任务中，它将先前的最佳性能从30.8%提升到41.5%，通过更准确的抓取和放置定位，以及在具有挑战性的真实机器人操纵任务中将成功率从45.0%提升到58.3%。

Summary / 总结

The paper introduces Robot State-aware Contrastive Loss (RS-CL) to enhance the representation learning of Vision-Language-Action (VLA) models by aligning them with robotic signals such as proprioceptive states. This method improves the models' sensitivity to control actions and proprioceptive states, complementing the original action prediction objective. Empirical results show that RS-CL significantly boosts manipulation performance, achieving a success rate of 58.3% on real-robot tasks compared to 45.0% without RS-CL, and improving the pick-and-place task performance from 30.8% to 41.5% in RoboCasa-Kitchen.

论文提出了机器人状态感知对比损失（RS-CL），通过将视觉-语言-动作（VLA）模型的表示与机器人信号如本体感受状态对齐来改进其表示学习。该方法在不改变标准VLA训练流程的情况下增强了控制相关的表示学习。实验证明，RS-CL显著提高了操作性能，实机器人任务的成功率从45.0%提升到58.3%，在RoboCasa-Kitchen的拾取和放置任务中，成功率从30.8%提升到41.5%。

Audio-Enhanced Vision-Language Modeling with Latent Space Broadening for High Quality Data Expansion

Authors: Yu Sun, Yin Li, Ruixiao Sun, Chunhui Liu, Fangming Zhou, Ze Jin, Linjie Wang, Xiang Shen, Zhuolin Hao, Hongyu Xiong

First: 2025-03-21T21:55:05+00:00 · Latest: 2025-10-02T06:26:27+00:00

Abs · PDF · Code1 · Code2

Abstract

Transformer-based multimodal models are widely used in industrial-scale recommendation, search, and advertising systems for content understanding and relevance ranking. Enhancing labeled training data quality and cross-modal fusion significantly improves model performance, influencing key metrics such as quality view rates and ad revenue. High-quality annotations are crucial for advancing content modeling, yet traditional statistical-based active learning (AL) methods face limitations: they struggle to detect overconfident misclassifications and are less effective in distinguishing semantically similar items in deep neural networks. Additionally, audio information plays an increasing role, especially in short-video platforms, yet most pre-trained multimodal architectures primarily focus on text and images. While training from scratch across all three modalities is possible, it sacrifices the benefits of leveraging existing pre-trained visual-language (VL) and audio models. To address these challenges, we propose kNN-based Latent Space Broadening (LSB) to enhance AL efficiency and Vision-Language Modeling with Audio Enhancement (VLMAE), a mid-fusion approach integrating audio into VL models. This system deployed in production systems, leading to significant business gains.

中文标题/摘要

标题：增强音频的视觉语言建模与潜在空间扩展以实现高质量数据扩展

基于变换器的多模态模型在工业规模的内容理解和相关性排名中广泛应用于推荐、搜索和广告系统。提高标记训练数据质量和跨模态融合显著提升了模型性能，影响了诸如高质量观看率和广告收入等关键指标。高质量的注解对于推进内容建模至关重要，但传统的基于统计的主动学习（AL）方法存在局限性：它们难以检测出过度自信的误分类，并且在区分深度神经网络中的语义相似项方面效果较差。此外，音频信息在短视频平台中发挥着越来越重要的作用，但大多数预训练的多模态架构主要集中在文本和图像上。虽然可以从所有三种模态重新训练是可能的，但这牺牲了利用现有预训练视觉语言（VL）和音频模型的好处。为了解决这些挑战，我们提出了基于kNN的潜在空间扩展（LSB）以提高AL效率，并提出了结合音频的视觉语言建模（VLMAE），这是一种中间融合方法，将音频整合到VL模型中。该系统部署在生产系统中，带来了显著的商业收益。

Summary / 总结

The research aims to improve the quality of labeled training data and cross-modal fusion in transformer-based multimodal models, which are widely used in industrial applications. The method involves using kNN-based Latent Space Broadening (LSB) to enhance active learning efficiency and integrating audio into Vision-Language Modeling with Audio Enhancement (VLMAE). The key experimental findings show that this approach significantly improves model performance, leading to better quality view rates and ad revenue in production systems.

本文旨在解决增强标注训练数据质量及提高多模态模型跨模态融合的问题，提出了基于kNN的Latent Space Broadening (LSB)方法以提高主动学习效率，并提出Vision-Language Modeling with Audio Enhancement (VLMAE)，将音频信息整合到现有的视觉语言模型中。该系统已在生产环境中部署，取得了显著的商业收益。

VaPR -- Vision-language Preference alignment for Reasoning

Authors: Rohan Wadhawan, Fabrice Y Harel-Canada, Zi-Yi Dou, Suhaila Shakiah, Robinson Piramuthu, Nanyun Peng

Venue: COLM 2025

First: 2025-10-02T06:10:43+00:00 · Latest: 2025-10-02T06:10:43+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Preference finetuning methods like Direct Preference Optimization (DPO) with AI-generated feedback have shown promise in aligning Large Vision-Language Models (LVLMs) with human preferences. However, existing techniques overlook the prevalence of noise in synthetic preference annotations in the form of stylistic and length biases. To this end, we introduce a hard-negative response generation framework based on LLM-guided response editing, that produces rejected responses with targeted errors, maintaining stylistic and length similarity to the accepted ones. Using this framework, we develop the VaPR dataset, comprising 30K high-quality samples, to finetune three LVLM families: LLaVA-V1.5, Qwen2VL & Qwen2.5VL (2B-13B sizes). Our VaPR models deliver significant performance improvements across ten benchmarks, achieving average gains of 6.5% (LLaVA), 4.0% (Qwen2VL), and 1.5% (Qwen2.5VL), with notable improvements on reasoning tasks. A scaling analysis shows that performance consistently improves with data size, with LLaVA models benefiting even at smaller scales. Moreover, VaPR reduces the tendency to answer "Yes" in binary questions - addressing a common failure mode in LVLMs like LLaVA. Lastly, we show that the framework generalizes to open-source LLMs as editors, with models trained on VaPR-OS achieving ~99% of the performance of models trained on \name, which is synthesized using GPT-4o. Our data, models, and code can be found on the project page https://vap-r.github.io

中文标题/摘要

标题：VaPR -- 视觉-语言偏好对齐以推理

像直接偏好优化（DPO）这样的偏好微调方法，通过AI生成的反馈，已经在对齐大型视觉-语言模型（LVLMs）与人类偏好方面显示出潜力。然而，现有技术忽视了合成偏好注解中噪声的普遍存在，这些噪声以风格和长度偏差的形式出现。为此，我们引入了一种基于LLM引导的响应编辑的硬负响应生成框架，该框架生成具有目标错误的被拒绝响应，同时保持与接受响应的风格和长度相似性。利用这一框架，我们开发了包含30000个高质量样本的VaPR数据集，用于微调三个LVLM家族：LLaVA-V1.5、Qwen2VL & Qwen2.5VL（2B-13B规模）。我们的VaPR模型在十个基准测试中实现了显著的性能提升，平均增益分别为6.5%（LLaVA）、4.0%（Qwen2VL）和1.5%（Qwen2.5VL），特别是在推理任务上取得了显著改进。性能分析显示，随着数据量的增加，性能持续提升，LLaVA模型即使在较小规模下也能受益。此外，VaPR减少了在二元问题中回答“是”的倾向，解决了LVLMs如LLaVA的常见失败模式。最后，我们展示了该框架可以应用于开源LLM作为编辑器，使用VaPR-OS训练的模型在性能上达到了使用GPT-4o合成的\name模型的约99%。我们的数据、模型和代码可以在项目页面https://vap-r.github.io找到

Summary / 总结

Preference finetuning methods like Direct Preference Optimization (DPO) with AI-generated feedback have shown promise in aligning Large Vision-Language Models (LVLMs) with human preferences.

Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks

Authors: Shijie Lian, Changti Wu, Laurence Tianruo Yang, Hang Yuan, Bin Yu, Lei Zhang, Kai Chen

First: 2025-09-29T08:49:21+00:00 · Latest: 2025-10-02T06:05:35+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Spatial intelligence spans a rich suite of abilities, including visualising and transforming shapes, mentally rotating objects, judging relational positions and containment, and estimating numerosity. However, it still remains a critical unresolved challenge for Multimodal Large Language Models (MLLMs).To fill this gap, we propose to treat Euclidean geometry problem-solving as a surrogate task. Specifically, we meticulously constructed a curated multimodal dataset, called Euclid30K, comprising approximately 30K plane and solid geometry problems. To enable the model to acquire and apply Euclidean principles from these geometry problems, we employed Group Relative Policy Optimization (GRPO) to finetune the Qwen2.5VL family and RoboBrain2.0 family, inspiring the models to identify shapes, count, and relate entities, and perform multi-step deductive reasoning using Euclidean principles. Our experiments demonstrate that the resulting models achieve substantial zero-shot gains across four spatial reasoning benchmarks (Super-CLEVR, Omni3DBench, VSI-Bench, and MindCube) without any task-specific adaptations. Notably, after training on the Euclid30K, the mean VSI-Bench accuracy of all evaluated models rose from 34.5% to 40.5%, improving by 5.5 percentage points. Among them, RoboBrain2.0-Euclid-7B achieves 49.6\% accuracy, surpassing the previous state-of-the-art model, Spatial-MLLM.To our knowledge, this is the first systematic study showing that geometry-centric fine-tuning can confer vision-language models with broadly transferable spatial skills. Code and Euclid30K dataset can be found in https://zgca-ai4edu.github.io/Euclids_Gift.

中文标题/摘要

标题：欧几里得的馈赠：通过几何代理任务增强视觉-语言模型的空间感知与推理能力

空间智能涵盖了丰富的能力，包括可视化和变换形状、心理旋转物体、判断相对位置和包含关系，以及估算数量。然而，这仍然是多模态大型语言模型（MLLMs）的一个关键未解决的挑战。为了填补这一空白，我们建议将欧几里得几何问题解决作为代理任务。具体来说，我们精心构建了一个多模态数据集，称为Euclid30K，包含约30000个平面几何和立体几何问题。为了使模型能够从这些几何问题中学习和应用欧几里得原理，我们使用了组相对策略优化（GRPO）对Qwen2.5VL家族和RoboBrain2.0家族进行微调，激励模型识别形状、计数和关联实体，并使用欧几里得原理进行多步演绎推理。我们的实验表明，经过微调的模型在四个空间推理基准测试（Super-CLEVR、Omni3DBench、VSI-Bench和MindCube）上实现了显著的零样本增益，无需任何特定任务的适应。值得注意的是，经过Euclid30K训练后，所有评估模型的平均VSI-Bench准确率从34.5%提高到40.5%，提高了5.5个百分点。其中，RoboBrain2.0-Euclid-7B的准确率达到49.6%，超越了之前的最佳模型Spatial-MLLM。据我们所知，这是首次系统研究表明，以几何为中心的微调可以赋予视觉-语言模型广泛转移的空间技能。代码和Euclid30K数据集可在https://zgca-ai4edu.github.io/Euclids_Gift/找到。

Summary / 总结

The research aims to enhance the spatial reasoning capabilities of Vision-Language Models (VLMs) by treating Euclidean geometry problem-solving as a surrogate task. The authors created a dataset, Euclid30K, containing 30,000 geometry problems and used Group Relative Policy Optimization (GRPO) to fine-tune models like Qwen2.5VL and RoboBrain2.0. The experiments show that these models significantly improved their performance on four spatial reasoning benchmarks, with the mean VSI-Bench accuracy increasing from 34.5% to 40.5%, and RoboBrain2.0-Euclid-7B achieving 49.6%, surpassing the previous state-of-the-art model, Spatial-MLLM.

研究旨在通过将欧几里得几何问题解决作为代理任务来提升视觉语言模型（VLMs）的空间推理能力。作者创建了一个包含30,000个几何问题的数据集Euclid30K，并使用组相对策略优化（GRPO）对Qwen2.5VL和RoboBrain2.0等模型进行了微调。实验表明，这些模型在四个空间推理基准测试上的表现显著提升，VSI-Bench的平均准确率从34.5%提高到40.5%，其中RoboBrain2.0-Euclid-7B达到了49.6%，超过了之前的最佳模型Spatial-MLLM。

FreeViS: Training-free Video Stylization with Inconsistent References

Authors: Jiacong Xu, Yiqun Mei, Ke Zhang, Vishal M. Patel

First: 2025-10-02T05:27:06+00:00 · Latest: 2025-10-02T05:27:06+00:00

Comments: Project Page: \url{https://xujiacong.github.io/FreeViS/}

Abs · PDF · Code1 · Code2 · Project1

Abstract

Video stylization plays a key role in content creation, but it remains a challenging problem. Na\"ively applying image stylization frame-by-frame hurts temporal consistency and reduces style richness. Alternatively, training a dedicated video stylization model typically requires paired video data and is computationally expensive. In this paper, we propose FreeViS, a training-free video stylization framework that generates stylized videos with rich style details and strong temporal coherence. Our method integrates multiple stylized references to a pretrained image-to-video (I2V) model, effectively mitigating the propagation errors observed in prior works, without introducing flickers and stutters. In addition, it leverages high-frequency compensation to constrain the content layout and motion, together with flow-based motion cues to preserve style textures in low-saliency regions. Through extensive evaluations, FreeViS delivers higher stylization fidelity and superior temporal consistency, outperforming recent baselines and achieving strong human preference. Our training-free pipeline offers a practical and economic solution for high-quality, temporally coherent video stylization. The code and videos can be accessed via https://xujiacong.github.io/FreeViS/

Summary / 总结

FreeViS is a training-free video stylization framework that uses multiple inconsistent references to generate videos with rich style details and strong temporal coherence. It integrates these references into a pretrained image-to-video model to mitigate propagation errors and avoid flickers. FreeViS outperforms recent baselines in terms of stylization fidelity and temporal consistency, and it is preferred by humans. The method offers a practical and cost-effective solution for high-quality, temporally coherent video stylization.

FreeViS 是一个无需训练的视频风格化框架，通过使用多个不一致的参考来生成具有丰富风格细节和强时间连贯性的视频。它利用高频补偿和基于流的运动线索来保留风格纹理并约束内容布局和运动，其在风格化保真度和时间连贯性方面优于最近的基线。这种方法提供了一种实用且经济的解决方案，用于生成高质量、时间连贯的视频风格化，无需配对的视频数据或大量训练。

Look Less, Reason More: Rollout-Guided Adaptive Pixel-Space Reasoning

Authors: Xuchen Li, Xuzhao Li, Jiahui Gao, Renjie Pi, Shiyu Hu, Wentao Zhang

First: 2025-10-02T05:14:52+00:00 · Latest: 2025-10-02T05:14:52+00:00

Comments: Preprint, Under review

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) excel at many multimodal tasks, yet they frequently struggle with tasks requiring precise understanding and handling of fine-grained visual elements. This is mainly due to information loss during image encoding or insufficient attention to critical regions. Recent work has shown promise by incorporating pixel-level visual information into the reasoning process, enabling VLMs to access high-resolution visual details during their thought process. However, this pixel-level information is often overused, leading to inefficiency and distraction from irrelevant visual details. To address these challenges, we propose the first framework for adaptive pixel reasoning that dynamically determines necessary pixel-level operations based on the input query. Specifically, we first apply operation-aware supervised fine-tuning to establish baseline competence in textual reasoning and visual operations, then design a novel rollout-guided reinforcement learning framework relying on feedback of the model's own responses, which enables the VLM to determine when pixel operations should be invoked based on query difficulty. Experiments on extensive multimodal reasoning benchmarks show that our model achieves superior performance while significantly reducing unnecessary visual operations. Impressively, our model achieves 73.4\% accuracy on HR-Bench 4K while maintaining a tool usage ratio of only 20.1\%, improving accuracy and simultaneously reducing tool usage by 66.5\% compared to the previous methods.

中文标题/摘要

标题：少看多思：基于展开引导的自适应像素空间推理

视觉-语言模型（VLMs）在许多跨模态任务中表现出色，但在需要精确理解和处理细粒度视觉元素的任务中经常遇到困难。这主要是由于图像编码过程中的信息丢失或对关键区域的关注不足。最近的工作通过将像素级视觉信息纳入推理过程，使VLMs能够在思考过程中访问高分辨率的视觉细节，显示出前景。然而，这种像素级信息的过度使用导致了效率低下和对无关视觉细节的干扰。为了解决这些挑战，我们提出了第一个自适应像素推理框架，该框架根据输入查询动态确定必要的像素级操作。具体来说，我们首先应用操作感知的监督微调来建立文本推理和视觉操作的基础能力，然后设计一个基于模型自身响应反馈的新型展开引导强化学习框架，使VLM能够根据查询难度决定何时调用像素操作。在广泛的跨模态推理基准测试中，我们的模型在显著减少不必要的视觉操作的同时实现了优越的性能。令人印象深刻的是，我们的模型在HR-Bench 4K上的准确率为73.4%，工具使用比率为20.1%，与先前方法相比，准确率提高了，同时工具使用率降低了66.5%。

Summary / 总结

This paper addresses the challenge of fine-grained visual understanding in Vision-Language Models (VLMs) by proposing a framework for adaptive pixel reasoning. The method involves operation-aware fine-tuning and a rollout-guided reinforcement learning framework to dynamically decide when to use pixel-level information. Experiments show that the model achieves 73.4% accuracy on HR-Bench 4K with only 20.1% tool usage, significantly improving accuracy and reducing unnecessary visual operations compared to previous methods.

该论文通过提出一种自适应像素推理框架，解决了视觉语言模型（VLMs）在精细视觉理解方面的挑战。方法包括操作感知的微调和基于模型自身响应的卷出导向强化学习框架，以动态决定何时使用像素级信息。实验表明，该模型在HR-Bench 4K上达到了73.4%的准确率，同时工具使用率为20.1%，显著提高了准确率并减少了不必要的视觉操作，优于先前的方法。

Does Bigger Mean Better? Comparitive Analysis of CNNs and Biomedical Vision Language Modles in Medical Diagnosis

Authors: Ran Tong, Jiaqi Liu, Su Liu, Jiexi Xu, Lanruo Wang, Tong Wang

First: 2025-10-01T01:46:09+00:00 · Latest: 2025-10-02T04:22:36+00:00

Comments: 6pages,3 figures.Uunder review of International Conference on Artificial Intelligence, Computer, Data Sciences and Applications

Abs · PDF · Code1 · Code2

Abstract

The accurate interpretation of chest radiographs using automated methods is a critical task in medical imaging. This paper presents a comparative analysis between a supervised lightweight Convolutional Neural Network (CNN) and a state-of-the-art, zero-shot medical Vision-Language Model (VLM), BiomedCLIP, across two distinct diagnostic tasks: pneumonia detection on the PneumoniaMNIST benchmark and tuberculosis detection on the Shenzhen TB dataset. Our experiments show that supervised CNNs serve as highly competitive baselines in both cases. While the default zero-shot performance of the VLM is lower, we demonstrate that its potential can be unlocked via a simple yet crucial remedy: decision threshold calibration. By optimizing the classification threshold on a validation set, the performance of BiomedCLIP is significantly boosted across both datasets. For pneumonia detection, calibration enables the zero-shot VLM to achieve a superior F1-score of 0.8841, surpassing the supervised CNN's 0.8803. For tuberculosis detection, calibration dramatically improves the F1-score from 0.4812 to 0.7684, bringing it close to the supervised baseline's 0.7834. This work highlights a key insight: proper calibration is essential for leveraging the full diagnostic power of zero-shot VLMs, enabling them to match or even outperform efficient, task-specific supervised models.

中文标题/摘要

标题：更大意味着更好吗？医学诊断中CNN与生物医学视觉语言模型的比较分析

使用自动化方法准确解释胸部X光片是医学影像中的关键任务。本文在两个不同的诊断任务上对监督的轻量级卷积神经网络（CNN）和最先进的零样本医学视觉-语言模型（VLM）BiomedCLIP进行了比较分析：在PneumoniaMNIST基准上进行肺炎检测，在Shenzhen TB数据集上进行结核病检测。我们的实验表明，在这两种情况下，监督的CNN都是极具竞争力的基线模型。虽然VLM的默认零样本性能较低，但我们证明了一个简单的关键补救措施——决策阈值校准，可以显著提升其性能。通过在验证集上优化分类阈值，BiomedCLIP在两个数据集上的性能都有显著提升。对于肺炎检测，校准使零样本VLM的F1分数达到0.8841，超过了监督CNN的0.8803。对于结核病检测，校准将F1分数从0.4812提高到0.7684，使其接近监督基线的0.7834。这项工作强调了一个关键见解：适当的校准对于充分利用零样本VLM的全部诊断能力至关重要，使其能够匹配甚至超越高效的、针对特定任务的监督模型。

Summary / 总结

This paper compares a supervised lightweight Convolutional Neural Network (CNN) with a zero-shot medical Vision-Language Model (VLM), BiomedCLIP, for pneumonia and tuberculosis detection. Experiments show that while BiomedCLIP initially performs lower, optimizing its decision threshold significantly improves its performance. Calibration enables BiomedCLIP to achieve an F1-score of 0.8841 for pneumonia detection and 0.7684 for tuberculosis detection, surpassing or closely matching the supervised CNN's performance.

该研究比较了监督轻量级卷积神经网络（CNN）和零样本医学视觉语言模型（VLM）BiomedCLIP在肺炎和肺结核检测中的表现。实验表明，虽然BiomedCLIP初始性能较低，但通过优化其决策阈值，其性能显著提升。对于肺炎检测，BiomedCLIP在校准后达到了0.8841的F1分数，超过了CNN的0.8803。对于肺结核检测，校准后的BiomedCLIP从0.4812提高到0.7684，接近CNN的0.7834。该研究强调了适当校准对于VLMs的重要性，使其能够匹配甚至超越特定任务的监督模型。

Source-Free Cross-Domain Continual Learning

Authors: Muhammad Tanzil Furqon, Mahardhika Pratama, Igor Škrjanc, Lin Liu, Habibullah Habibullah, Kutluyil Dogancay

First: 2025-10-02T04:09:25+00:00 · Latest: 2025-10-02T04:09:25+00:00

Abs · PDF · Code1 · Code2

Abstract

Although existing cross-domain continual learning approaches successfully address many streaming tasks having domain shifts, they call for a fully labeled source domain hindering their feasibility in the privacy constrained environments. This paper goes one step ahead with the problem of source-free cross-domain continual learning where the use of source-domain samples are completely prohibited. We propose the idea of rehearsal-free frequency-aware dynamic prompt collaborations (REFEREE) to cope with the absence of labeled source-domain samples in realm of cross-domain continual learning. REFEREE is built upon a synergy between a source-pre-trained model and a large-scale vision-language model, thus overcoming the problem of sub-optimal generalizations when relying only on a source pre-trained model. The domain shift problem between the source domain and the target domain is handled by a frequency-aware prompting technique encouraging low-frequency components while suppressing high-frequency components. This strategy generates frequency-aware augmented samples, robust against noisy pseudo labels. The noisy pseudo-label problem is further addressed with the uncertainty-aware weighting strategy where the mean and covariance matrix are weighted by prediction uncertainties, thus mitigating the adverse effects of the noisy pseudo label. Besides, the issue of catastrophic forgetting (CF) is overcome by kernel linear discriminant analysis (KLDA) where the backbone network is frozen while the classification is performed using the linear discriminant analysis approach guided by the random kernel method. Our rigorous numerical studies confirm the advantage of our approach where it beats prior arts having access to source domain samples with significant margins.

中文标题/摘要

标题：源代码免费跨域连续学习

尽管现有的跨域连续学习方法成功地解决了许多具有领域转移的流式任务，但它们需要完全标记的源域，这阻碍了它们在隐私受限环境中的可行性。本文进一步探讨了源代码免费的跨域连续学习问题，其中完全禁止使用源域样本。我们提出了无回放频率感知动态提示协作（REFEREE）的想法，以应对跨域连续学习领域中未标记源域样本的缺失问题。REFEREE基于源预训练模型和大规模视觉语言模型之间的协同作用，从而克服了仅依赖源预训练模型时的次优泛化问题。通过频率感知提示技术处理源域和目标域之间的领域转移问题，该技术鼓励低频成分并抑制高频成分，从而生成频率感知增强样本，这些样本对嘈杂的伪标签具有鲁棒性。进一步通过不确定性感知加权策略解决了伪标签问题，其中均值和协方差矩阵根据预测不确定性加权，从而减轻了伪标签的不良影响。此外，通过核线性判别分析（KLDA）克服了灾难性遗忘问题，其中主干网络冻结，分类使用由随机核方法引导的线性判别分析方法。我们的严格数值研究证实了我们方法的优势，它在具有显著优势的情况下击败了可以访问源域样本的先前方法。

Summary / 总结

The paper addresses the challenge of source-free cross-domain continual learning, where labeled source-domain samples are not available. It proposes REFEREE, which combines a source-pretrained model with a large-scale vision-language model to handle domain shifts. Key findings include improved generalization, robustness against noisy pseudo labels, and mitigation of catastrophic forgetting through a novel weighting strategy and kernel linear discriminant analysis.

该论文解决了源域样本不可用的跨域连续学习问题。提出了一种名为REFEREE的方法，结合了预训练模型和大规模视觉-语言模型来处理域偏移。该方法使用频率感知提示生成鲁棒的增强样本，并采用不确定性感知加权策略来缓解伪标签噪声的影响。此外，通过核线性判别分析来防止灾难性遗忘。实验结果表明，REFEREE在有源域样本的情况下显著优于现有方法。

Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization

Authors: Tao Zhang, Cheng Da, Kun Ding, Huan Yang, Kun Jin, Yan Li, Tingting Gao, Di Zhang, Shiming Xiang, Chunhong Pan

Venue: NeurIPS 2025

First: 2025-02-03T04:51:28+00:00 · Latest: 2025-10-02T04:07:53+00:00

Comments: NeurIPS 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Preference optimization for diffusion models aims to align them with human preferences for images. Previous methods typically use Vision-Language Models (VLMs) as pixel-level reward models to approximate human preferences. However, when used for step-level preference optimization, these models face challenges in handling noisy images of different timesteps and require complex transformations into pixel space. In this work, we show that pre-trained diffusion models are naturally suited for step-level reward modeling in the noisy latent space, as they are explicitly designed to process latent images at various noise levels. Accordingly, we propose the Latent Reward Model (LRM), which repurposes components of the diffusion model to predict preferences of latent images at arbitrary timesteps. Building on LRM, we introduce Latent Preference Optimization (LPO), a step-level preference optimization method conducted directly in the noisy latent space. Experimental results indicate that LPO significantly improves the model's alignment with general, aesthetic, and text-image alignment preferences, while achieving a 2.5-28x training speedup over existing preference optimization methods. Our code and models are available at https://github.com/Kwai-Kolors/LPO.

中文标题/摘要

标题：作为噪声感知潜在奖励模型的扩散模型在步骤级偏好优化中的应用

扩散模型的偏好优化旨在使它们与人类对图像的偏好相一致。先前的方法通常使用视觉语言模型（VLM）作为像素级奖励模型来近似人类偏好。然而，当用于步骤级偏好优化时，这些模型在处理不同时间步的噪声图像时面临挑战，并需要复杂的像素空间转换。在本文中，我们展示了预训练的扩散模型自然适合在噪声潜在空间中进行步骤级奖励建模，因为它们明确设计用于处理各种噪声水平的潜在图像。因此，我们提出了潜在奖励模型（LRM），该模型重新利用扩散模型的组件来预测任意时间步的潜在图像偏好。基于LRM，我们引入了潜在偏好优化（LPO），这是一种直接在噪声潜在空间中进行步骤级偏好优化的方法。实验结果表明，LPO在与一般、美学和图文对齐偏好方面显著提高了模型的对齐程度，同时比现有偏好优化方法实现了2.5至28倍的训练加速。我们的代码和模型可在https://github.com/Kwai-Kolors/LPO/获得。

Summary / 总结

This research aims to improve the alignment of diffusion models with human preferences for images by optimizing preferences at the step level. The authors propose the Latent Reward Model (LRM) that leverages pre-trained diffusion models to predict preferences of latent images at various noise levels. This approach, called Latent Preference Optimization (LPO), directly optimizes preferences in the noisy latent space, leading to a significant improvement in model alignment with general, aesthetic, and text-image alignment preferences, with a 2.5-28x training speedup compared to existing methods.

本文提出了一种名为Latent Preference Optimization (LPO)的步级偏好优化方法，通过利用预训练的扩散模型在噪声潜空间中预测偏好，避免了复杂的像素级变换。实验结果表明，LPO在一般、美学和图文匹配偏好方面提高了模型的对齐程度，并比现有方法快2.5到28倍的训练速度。

ImageNet-Think-250K: A Large-Scale Synthetic Dataset for Multimodal Reasoning for Vision Language Models

Authors: Krishna Teja Chitty-Venkata, Murali Emani

First: 2025-10-02T02:02:45+00:00 · Latest: 2025-10-02T02:02:45+00:00

Comments: Preprint

Abs · PDF · Code1 · Code2

Abstract

We develop ImageNet-Think, a multimodal reasoning dataset designed to aid the development of Vision Language Models (VLMs) with explicit reasoning capabilities. Our dataset is built on 250,000 images from ImageNet21k dataset, providing structured thinking tokens and corresponding answers. Our synthetic dataset is generated by two state-of-the-art VLMs: GLM-4.1V-9B-Thinking and Kimi-VL-A3B-Thinking-2506. Each image is accompanied by two pairs of thinking-answer sequences, creating a resource for training and evaluating multimodal reasoning models. We capture the step-by-step reasoning process of VLMs and the final descriptive answers. Our goal with this dataset is to enable the development of more robust VLMs while contributing to the broader understanding of multimodal reasoning mechanisms. The dataset and evaluation benchmarks will be publicly available to aid research in reasoning/thinking multimodal VLMs.

中文标题/摘要

标题：ImageNet-Think-250K：一种用于视觉语言模型多模态推理的大规模合成数据集

我们开发了ImageNet-Think，这是一个多模态推理数据集，旨在帮助开发具有明确推理能力的视觉语言模型（VLMs）。我们的数据集基于ImageNet21k数据集中的250,000张图像，提供了结构化的思考标记和相应的答案。我们的合成数据集由两个最先进的VLMs生成：GLM-4.1V-9B-Thinking和Kimi-VL-A3B-Thinking-2506。每张图像都配有两对思考-答案序列，为训练和评估多模态推理模型提供了资源。我们捕捉了VLMs的逐步推理过程和最终描述性答案。我们希望通过这个数据集促进更稳健的VLMs的发展，同时为更广泛的多模态推理机制的理解做出贡献。该数据集和评估基准将公开发布，以帮助研究多模态推理/思考的VLMs。

Summary / 总结

The research aims to develop ImageNet-Think, a dataset for training Vision Language Models (VLMs) with explicit reasoning capabilities, using 250,000 images from ImageNet21k. Two state-of-the-art VLMs generate structured thinking tokens and corresponding answers for each image, providing a resource for evaluating multimodal reasoning models. Key findings include capturing the reasoning process and descriptive answers of VLMs, which will contribute to the development of more robust VLMs and understanding of multimodal reasoning mechanisms.

研究旨在开发ImageNet-Think数据集，用于训练具有推理能力的视觉语言模型（VLMs）。该数据集使用来自ImageNet21k的250,000张图像，配以由两个最先进的VLMs生成的结构化思考令牌和答案。数据集捕捉了推理过程和最终答案，有助于评估VLMs的推理能力，并促进对多模态推理机制的理解。

Poutine: Vision-Language-Trajectory Pre-Training and Reinforcement Learning Post-Training Enable Robust End-to-End Autonomous Driving

Authors: Luke Rowe, Rodrigue de Schaetzen, Roger Girgis, Christopher Pal, Liam Paull

First: 2025-06-12T19:14:00+00:00 · Latest: 2025-10-02T01:37:54+00:00

Abs · PDF · Code1 · Code2

Abstract

Maintaining good driving behavior in out-of-distribution scenarios remains a critical challenge in autonomous driving. A promising direction is to leverage the generalist knowledge and reasoning capabilities of large-language models by treating unusual driving scenarios as a logical reasoning task. In this work, we present Poutine, a method that uses an off-the-shelf 3B-parameter vision-language model (VLM) - without any additional components - to achieve robust end-to-end autonomous driving via a simple and scalable training recipe. To learn strong base driving capabilities, we first train Poutine-Base using self-supervised next-token prediction over vision, language, and trajectory (VLT) tokens, leveraging both nominal and long-tail driving data. In the second stage, we fine-tune Poutine-Base using Group Relative Policy Optimization (GRPO) with a small set of human preference-labeled examples. We evaluated our approach on the Waymo end-to-end driving benchmark curated for long-tail scenarios. The final Poutine model achieves an RFS of 7.99 on the test set, placing 1st in the 2025 Waymo Vision-Based End-to-End Driving Challenge by a significant margin. Our results suggest that handcrafted tokenizers or custom architectural components added to base VLMs in prior work are not necessary to achieve strong driving performance. Instead, this work highlights the potential of scalable VLT pretraining combined with lightweight RL fine-tuning to enable robust and generalizable autonomous driving.

中文标题/摘要

标题：普顿：视觉-语言-轨迹预训练和强化学习后训练实现稳健的端到端自动驾驶

在分布外场景中保持良好的驾驶行为仍然是自动驾驶领域的关键挑战。一种有前景的方向是通过将大型语言模型的一般知识和推理能力应用于将不寻常的驾驶场景视为逻辑推理任务。在本文中，我们提出了普顿方法，该方法仅使用一个现成的30亿参数视觉-语言模型（VLM）——无需任何额外组件——通过简单的可扩展训练食谱实现稳健的端到端自动驾驶。为了学习强大的基础驾驶能力，我们首先使用自我监督的下一个标记预测对视觉、语言和轨迹（VLT）标记进行训练，利用标准和长尾驾驶数据。在第二阶段，我们使用组相对策略优化（GRPO）对Poutine-Base进行微调，使用少量的人类偏好标记示例。我们在Waymo端到端驾驶基准测试上评估了我们的方法，该基准测试专门用于长尾场景。最终的Poutine模型在测试集上的RFS为7.99，在2025年Waymo基于视觉的端到端驾驶挑战赛中以显著优势获得第一名。我们的结果表明，在先前的工作中，将自定义标记器或架构组件添加到基础VLM中并非实现强大驾驶性能所必需。相反，本文强调了可扩展的VLT预训练与轻量级RL微调相结合的潜力，以实现稳健和泛化的自动驾驶。

MIRA: Towards Mitigating Reward Hacking in Inference-Time Alignment of T2I Diffusion Models

Authors: Kevin Zhai, Utsav Singh, Anirudh Thatipelli, Souradip Chakraborty, Anit Kumar Sahu, Furong Huang, Amrit Singh Bedi, Mubarak Shah

First: 2025-10-02T00:47:36+00:00 · Latest: 2025-10-02T00:47:36+00:00

Abs · PDF · Code1 · Code2

Abstract

Diffusion models excel at generating images conditioned on text prompts, but the resulting images often do not satisfy user-specific criteria measured by scalar rewards such as Aesthetic Scores. This alignment typically requires fine-tuning, which is computationally demanding. Recently, inference-time alignment via noise optimization has emerged as an efficient alternative, modifying initial input noise to steer the diffusion denoising process towards generating high-reward images. However, this approach suffers from reward hacking, where the model produces images that score highly, yet deviate significantly from the original prompt. We show that noise-space regularization is insufficient and that preventing reward hacking requires an explicit image-space constraint. To this end, we propose MIRA (MItigating Reward hAcking), a training-free, inference-time alignment method. MIRA introduces an image-space, score-based KL surrogate that regularizes the sampling trajectory with a frozen backbone, constraining the output distribution so reward can increase without off-distribution drift (reward hacking). We derive a tractable approximation to KL using diffusion scores. Across SDv1.5 and SDXL, multiple rewards (Aesthetic, HPSv2, PickScore), and public datasets (e.g., Animal-Animal, HPDv2), MIRA achieves >60\% win rate vs. strong baselines while preserving prompt adherence; mechanism plots show reward gains with near-zero drift, whereas DNO drifts as compute increases. We further introduce MIRA-DPO, mapping preference optimization to inference time with a frozen backbone, extending MIRA to non-differentiable rewards without fine-tuning.

中文标题/摘要

标题：MIRA：减轻生成文本到图像扩散模型推理时对齐中的奖励作弊

扩散模型在根据文本提示生成图像方面表现出色，但生成的图像往往不满足用户特定的标准，如美学评分。这种对齐通常需要微调，这在计算上是耗时的。最近，通过噪声优化的推理时对齐作为一种高效的替代方案出现了，通过修改初始输入噪声来引导去噪过程，以生成高奖励的图像。然而，这种方法存在奖励作弊的问题，模型生成的图像虽然得分高，但与原始提示相差甚远。我们表明，噪声空间正则化是不够的，防止奖励作弊需要显式的图像空间约束。为此，我们提出了MIRA（Mitigating Reward Hacking），一种无需训练的推理时对齐方法。MIRA引入了一个基于分数的KL近似作为图像空间的正则化因子，使用冻结的主干来约束输出分布，使奖励可以增加而不发生离分布漂移（奖励作弊）。我们使用了可微分的KL近似方法。在SDv1.5和SDXL、多种奖励（美学、HPSv2、PickScore）以及公共数据集（如Animal-Animal、HPDv2）上，MIRA在与强大基线相比时，胜率超过60%，同时保持了对原始提示的遵守；机制图显示奖励增加几乎无漂移，而DNO随着计算量增加则出现漂移。我们还引入了MIRA-DPO，通过冻结主干将偏好优化映射到推理时间，将MIRA扩展到非可微奖励而无需微调。

Summary / 总结

The paper addresses the issue of reward hacking in inference-time alignment of text-to-image diffusion models, where the model generates images that score highly but do not adhere to the original prompt. To mitigate this, the authors propose MIRA, an inference-time method that introduces an image-space, score-based KL surrogate to regularize the sampling trajectory, preventing off-distribution drift. Experiments show that MIRA outperforms strong baselines with a win rate of over 60% across different models, rewards, and datasets, while maintaining prompt adherence and minimal drift. Additionally, MIRA-DPO extends MIRA to non-differentiable rewards without requiring fine-tuning.

论文针对文本到图像扩散模型在推理时对齐过程中出现的奖励作弊问题，即生成的图像虽然得分高但偏离原始提示。提出了一种名为MIRA的训练免费方法，通过图像空间的基于分数的KL近似来正则化采样轨迹，防止奖励作弊。实验表明，MIRA在多个奖励和数据集上优于强基线，胜率超过60%，同时保持对提示的遵从性，并且在计算量增加时，奖励提升几乎无漂移，而噪声优化方法则会出现漂移现象。