arXiv 论文速递

Optimal Control Meets Flow Matching: A Principled Route to Multi-Subject Fidelity

Authors: Eric Tillmann Bill, Enis Simsar, Thomas Hofmann

First: 2025-10-02T17:59:58+00:00 · Latest: 2025-10-02T17:59:58+00:00

Comments: Code: https://github.com/ericbill21/FOCUS/

Abstract

Text-to-image (T2I) models excel on single-entity prompts but struggle with multi-subject descriptions, often showing attribute leakage, identity entanglement, and subject omissions. We introduce the first theoretical framework with a principled, optimizable objective for steering sampling dynamics toward multi-subject fidelity. Viewing flow matching (FM) through stochastic optimal control (SOC), we formulate subject disentanglement as control over a trained FM sampler. This yields two architecture-agnostic algorithms: (i) a training-free test-time controller that perturbs the base velocity with a single-pass update, and (ii) Adjoint Matching, a lightweight fine-tuning rule that regresses a control network to a backward adjoint signal while preserving base-model capabilities. The same formulation unifies prior attention heuristics, extends to diffusion models via a flow-diffusion correspondence, and provides the first fine-tuning route explicitly designed for multi-subject fidelity. Empirically, on Stable Diffusion 3.5, FLUX, and Stable Diffusion XL, both algorithms consistently improve multi-subject alignment while maintaining base-model style. Test-time control runs efficiently on commodity GPUs, and fine-tuned controllers trained on limited prompts generalize to unseen ones. We further highlight FOCUS (Flow Optimal Control for Unentangled Subjects), which achieves state-of-the-art multi-subject fidelity across models.

中文标题/摘要

标题：最优控制与流匹配结合：通往多主体保真度的原理性途径

文本到图像（T2I）模型在单一实体提示上表现出色，但在处理多主体描述时遇到困难，经常出现属性泄漏、身份纠缠和主体遗漏。我们提出了第一个理论框架，提供了一个可优化的目标，以引导采样动力学向多主体保真度方向发展。通过将流匹配（FM）视为随机最优控制（SOC），我们将主体解纠缠视为对训练好的FM采样器的控制。这产生了两个架构无关的算法：（i）一个无需训练的测试时控制器，通过单次更新扰动基础速度，以及（ii）轻量级微调规则Adjoint Matching，该规则通过回归控制网络到反向伴随信号来微调，同时保留基础模型的能力。相同的公式统一了先前的注意力启发式方法，通过流扩散对应关系扩展到扩散模型，并提供了第一个明确为多主体保真度设计的微调途径。实验上，在Stable Diffusion 3.5、FLUX和Stable Diffusion XL上，两个算法在保持基础模型风格的同时，一致地提高了多主体对齐度。测试时控制在普通GPU上高效运行，微调控制器在有限提示下训练后可以泛化到未见过的提示。我们进一步强调FOCUS（流最优控制以解纠缠主体），它在多个模型上实现了最先进的多主体保真度。

Summary / 总结

The paper introduces a theoretical framework for improving multi-subject fidelity in text-to-image generation by viewing flow matching through stochastic optimal control. Two algorithms are proposed: a test-time controller that perturbs the base velocity and a fine-tuning rule called Adjoint Matching. Both methods consistently enhance multi-subject alignment while preserving the base model's style. FOCUS, a specific implementation, achieves state-of-the-art results across different models.

论文针对文本到图像模型在生成多个主体图像时遇到的属性泄漏、身份纠缠和主体遗漏问题，提出了一种使用随机最优控制的理论框架，以引导采样动力学向多主体保真度方向发展。引入了两种算法：测试时控制器和伴随匹配，前者在不改变基模型风格的情况下，通过单次更新扰动基础速度，后者则通过回归控制网络到后向伴随信号进行轻量级微调。实验结果显示，这两种算法在Stable Diffusion 3.5、FLUX和Stable Diffusion XL上均能一致地提高多主体对齐效果，并且测试时控制高效且微调控制器能够泛化到未见过的提示。

NoiseShift: Resolution-Aware Noise Recalibration for Better Low-Resolution Image Generation

Authors: Ruozhen He, Moayed Haji-Ali, Ziyan Yang, Vicente Ordonez

First: 2025-10-02T17:59:43+00:00 · Latest: 2025-10-02T17:59:43+00:00

Abs · PDF · Code1 · Code2

Abstract

Text-to-image diffusion models trained on a fixed set of resolutions often fail to generalize, even when asked to generate images at lower resolutions than those seen during training. High-resolution text-to-image generators are currently unable to easily offer an out-of-the-box budget-efficient alternative to their users who might not need high-resolution images. We identify a key technical insight in diffusion models that when addressed can help tackle this limitation: Noise schedulers have unequal perceptual effects across resolutions. The same level of noise removes disproportionately more signal from lower-resolution images than from high-resolution images, leading to a train-test mismatch. We propose NoiseShift, a training-free method that recalibrates the noise level of the denoiser conditioned on resolution size. NoiseShift requires no changes to model architecture or sampling schedule and is compatible with existing models. When applied to Stable Diffusion 3, Stable Diffusion 3.5, and Flux-Dev, quality at low resolutions is significantly improved. On LAION-COCO, NoiseShift improves SD3.5 by 15.89%, SD3 by 8.56%, and Flux-Dev by 2.44% in FID on average. On CelebA, NoiseShift improves SD3.5 by 10.36%, SD3 by 5.19%, and Flux-Dev by 3.02% in FID on average. These results demonstrate the effectiveness of NoiseShift in mitigating resolution-dependent artifacts and enhancing the quality of low-resolution image generation.

中文标题/摘要

标题：NoiseShift：分辨率感知噪声再校准以提高低分辨率图像生成质量

训练于固定分辨率集上的文本到图像扩散模型在生成低于训练分辨率的图像时往往无法很好地泛化。当前的高分辨率文本到图像生成器无法为不需要高分辨率图像的用户提供一个开箱即用且成本效益高的替代方案。我们发现扩散模型中的一个关键技术洞察：噪声调度器在不同分辨率下的感知效果不等。相同水平的噪声从低分辨率图像中移除的信号比从高分辨率图像中移除的更多，导致训练和测试之间的不匹配。我们提出NoiseShift，一种无需训练的方法，根据分辨率大小重新校准去噪器的噪声水平。NoiseShift 不需要修改模型架构或采样计划，并且与现有模型兼容。当应用于Stable Diffusion 3、Stable Diffusion 3.5和Flux-Dev时，低分辨率的质量显著提高。在LAION-COCO上，NoiseShift分别将SD3.5、SD3和Flux-Dev的FID提高了15.89%、8.56%和2.44%。在CelebA上，NoiseShift分别将SD3.5、SD3和Flux-Dev的FID提高了10.36%、5.19%和3.02%。这些结果表明NoiseShift在减轻分辨率依赖性伪影和提高低分辨率图像生成质量方面的有效性。

Summary / 总结

Text-to-image diffusion models trained on a fixed set of resolutions often fail to generalize, even when asked to generate images at lower resolutions than those seen during training.

VideoNSA: Native Sparse Attention Scales Video Understanding

Authors: Enxin Song, Wenhao Chai, Shusheng Yang, Ethan Armand, Xiaojun Shan, Haiyang Xu, Jianwen Xie, Zhuowen Tu

First: 2025-10-02T17:58:54+00:00 · Latest: 2025-10-02T17:58:54+00:00

Comments: Project Page: https://enxinsong.com/VideoNSA-web/, Code: https://github.com/Espere-1119-Song/VideoNSA

Abs · PDF · Code1 · Code2 · Code3

Abstract

Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.

中文标题/摘要

标题：VideoNSA: 原生稀疏注意机制扩展视频理解

多模态语言模型中的视频理解受限于上下文长度：模型经常错过关键过渡帧，并且难以在长时间尺度上保持连贯性。为了解决这一问题，我们将原生稀疏注意（NSA）适应到视频-语言模型中。我们的方法VideoNSA通过端到端训练Qwen2.5-VL，在216K视频指令数据集上进行。我们采用一种硬件感知的混合注意方法，为文本保留密集注意，而为视频使用NSA。与基于token压缩和无训练稀疏基线相比，VideoNSA在长视频理解、时间推理和空间基准测试中表现出更好的性能。进一步的消融分析揭示了四个关键发现：(1) 可靠地扩展到128K token；(2) 固定预算下的全局-局部注意分配；(3) 任务依赖的分支使用模式；(4) 可学习的组合稀疏注意有助于诱导动态注意焦点。

Summary / 总结

VideoNSA adapts Native Sparse Attention to video-language models, enhancing long-video understanding and temporal reasoning. By employing a hybrid approach that uses dense attention for text and NSA for video, VideoNSA outperforms token-compression and training-free sparse baselines. Key findings include reliable scaling to 128K tokens, optimal global-local attention allocation, task-dependent branch usage, and the benefit of learnable combined sparse attention.

VideoNSA通过将Native Sparse Attention (NSA)应用到视频-语言模型中，提升了长视频的理解和时间推理能力。它采用混合方法，对文本使用密集注意，对视频使用NSA，并通过大规模视频指令数据集进行端到端训练。主要发现包括可靠地扩展到128K令牌，全局-局部注意分配的最优分配，任务相关的分支使用模式以及可学习的组合稀疏注意有助于动态注意力吸收。

From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens

Authors: Hala Sheta, Eric Huang, Shuyu Wu, Ilia Alenabi, Jiajun Hong, Ryker Lin, Ruoxi Ning, Daniel Wei, Jialin Yang, Jiawei Zhou, Ziqiao Ma, Freda Shi

Venue: EMNLP 2025

First: 2025-10-02T17:58:41+00:00 · Latest: 2025-10-02T17:58:41+00:00

Comments: EMNLP 2025 System Demonstration | Code: https://github.com/compling-wat/vlm-lens

Abs · PDF · Code1 · Code2 · Code3

Abstract

We introduce VLM-Lens, a toolkit designed to enable systematic benchmarking, analysis, and interpretation of vision-language models (VLMs) by supporting the extraction of intermediate outputs from any layer during the forward pass of open-source VLMs. VLM-Lens provides a unified, YAML-configurable interface that abstracts away model-specific complexities and supports user-friendly operation across diverse VLMs. It currently supports 16 state-of-the-art base VLMs and their over 30 variants, and is extensible to accommodate new models without changing the core logic. The toolkit integrates easily with various interpretability and analysis methods. We demonstrate its usage with two simple analytical experiments, revealing systematic differences in the hidden representations of VLMs across layers and target concepts. VLM-Lens is released as an open-sourced project to accelerate community efforts in understanding and improving VLMs.

中文标题/摘要

标题：从行为性能到内在能力：使用VLM-Lens解析视觉语言模型

我们介绍了VLM-Lens，一个旨在通过支持从开源视觉语言模型（VLMs）前向传递过程中任何层提取中间输出来实现系统基准测试、分析和解释的工具包。VLM-Lens提供了一个统一的、基于YAML配置的接口，抽象掉了模型特定的复杂性，并支持用户友好的操作，适用于各种不同的VLMs。它目前支持16个最先进的基础VLM及其超过30个变体，并且可以扩展以容纳新模型而不改变核心逻辑。该工具包易于与各种可解释性和分析方法集成。我们通过两个简单的分析实验展示了其用法，揭示了VLMs在不同层和目标概念上的隐藏表示的系统性差异。VLM-Lens作为一个开源项目发布，旨在加速社区对理解和改进VLMs的努力。

Summary / 总结

VLM-Lens is a toolkit that facilitates the systematic benchmarking, analysis, and interpretation of vision-language models (VLMs) by providing a unified interface for extracting intermediate outputs from any layer during the forward pass of open-source VLMs. It supports 16 state-of-the-art base VLMs and their variants, and is extensible to new models. VLM-Lens demonstrates its utility through two analytical experiments, revealing systematic differences in hidden representations across layers and target concepts, which can help in understanding and improving VLMs.

VLM-Lens 是一个工具包，通过提供统一的接口来提取开放源代码 VLMs 在前向传递过程中任何层的中间输出，来促进视觉-语言模型（VLMs）的系统基准测试、分析和解释。它支持 16 个最先进的基础 VLMs 及其变体，并可扩展以适应新模型。VLM-Lens 通过两个分析实验展示了其用途，揭示了不同层和目标概念下隐藏表示的系统性差异，这有助于理解并改进 VLMs。

Test-Time Anchoring for Discrete Diffusion Posterior Sampling

Authors: Litu Rout, Andreas Lugmayr, Yasamin Jafarian, Srivatsan Varadharajan, Constantine Caramanis, Sanjay Shakkottai, Ira Kemelmacher-Shlizerman

First: 2025-10-02T17:58:37+00:00 · Latest: 2025-10-02T17:58:37+00:00

Comments: Preprint

Abs · PDF · Code1 · Code2

Abstract

We study the problem of posterior sampling using pretrained discrete diffusion foundation models, aiming to recover images from noisy measurements without retraining task-specific models. While diffusion models have achieved remarkable success in generative modeling, most advances rely on continuous Gaussian diffusion. In contrast, discrete diffusion offers a unified framework for jointly modeling categorical data such as text and images. Beyond unification, discrete diffusion provides faster inference, finer control, and principled training-free Bayesian inference, making it particularly well-suited for posterior sampling. However, existing approaches to discrete diffusion posterior sampling face severe challenges: derivative-free guidance yields sparse signals, continuous relaxations limit applicability, and split Gibbs samplers suffer from the curse of dimensionality. To overcome these limitations, we introduce Anchored Posterior Sampling (APS) for masked diffusion foundation models, built on two key innovations -- quantized expectation for gradient-like guidance in discrete embedding space, and anchored remasking for adaptive decoding. Our approach achieves state-of-the-art performance among discrete diffusion samplers across linear and nonlinear inverse problems on the standard benchmarks. We further demonstrate the benefits of our approach in training-free stylization and text-guided editing.

中文标题/摘要

标题：测试时锚定用于离散扩散后验采样

我们研究了使用预训练的离散扩散基础模型进行后验采样的问题，旨在从噪声测量中恢复图像而无需重新训练特定任务的模型。虽然扩散模型在生成建模方面取得了显著成功，但大多数进展依赖于连续的高斯扩散。相比之下，离散扩散为联合建模诸如文本和图像的分类数据提供了一个统一框架。除了统一之外，离散扩散提供了更快的推理、更精细的控制和无需训练的贝叶斯推理，使其特别适合后验采样。然而，现有的离散扩散后验采样方法面临严重挑战：无导数引导产生稀疏信号，连续松弛限制了适用性，而分裂吉布斯采样器遭受维度灾难。为克服这些限制，我们为掩码扩散基础模型引入了锚定后验采样（APS），基于两个关键创新——量化期望用于离散嵌入空间中的梯度式引导，以及锚定重新掩码用于自适应解码。我们的方法在标准基准上的线性和非线性逆问题中，离散扩散采样器中达到了最先进的性能。我们进一步展示了我们方法在无需训练的风格化和文本引导编辑中的优势。

Summary / 总结

This paper addresses the challenge of posterior sampling using pretrained discrete diffusion models to recover images from noisy measurements without retraining task-specific models. The authors introduce Anchored Posterior Sampling (APS), which combines quantized expectation for gradient-like guidance and anchored remasking for adaptive decoding. APS outperforms existing methods on various inverse problems and demonstrates benefits in training-free stylization and text-guided editing.

论文旨在使用预训练的离散扩散模型从噪声测量中恢复图像，而不重新训练特定任务的模型。它提出了锚定后验采样（APS），该方法使用量化期望进行梯度似然引导，并使用锚定重新遮罩进行自适应解码。该方法在标准基准上的线性和非线性逆问题中优于现有离散扩散采样器。此外，它还展示了在无训练样式化和文本引导编辑中的优势。