arXiv 论文速递

Coupled Diffusion Sampling for Training-Free Multi-View Image Editing

Authors: Hadi Alzayer, Yunzhi Zhang, Chen Geng, Jia-Bin Huang, Jiajun Wu

First: 2025-10-16T17:59:59+00:00 · Latest: 2025-10-16T17:59:59+00:00

Comments: Project page: https://coupled-diffusion.github.io

Abstract

We present an inference-time diffusion sampling method to perform multi-view consistent image editing using pre-trained 2D image editing models. These models can independently produce high-quality edits for each image in a set of multi-view images of a 3D scene or object, but they do not maintain consistency across views. Existing approaches typically address this by optimizing over explicit 3D representations, but they suffer from a lengthy optimization process and instability under sparse view settings. We propose an implicit 3D regularization approach by constraining the generated 2D image sequences to adhere to a pre-trained multi-view image distribution. This is achieved through coupled diffusion sampling, a simple diffusion sampling technique that concurrently samples two trajectories from both a multi-view image distribution and a 2D edited image distribution, using a coupling term to enforce the multi-view consistency among the generated images. We validate the effectiveness and generality of this framework on three distinct multi-view image editing tasks, demonstrating its applicability across various model architectures and highlighting its potential as a general solution for multi-view consistent editing.

中文标题/摘要

标题：耦合扩散采样用于无训练多视图图像编辑

我们提出了一种推理时的扩散采样方法，使用预训练的2D图像编辑模型在多视图图像集中执行多视图一致的图像编辑。这些模型可以独立地为多视图场景或对象的一组图像生成高质量的编辑，但它们无法在不同视图之间保持一致性。现有方法通常通过优化显式的3D表示来解决这个问题，但它们在稀疏视图设置下会遭受优化过程漫长且不稳定的问题。我们提出了一种隐式的3D正则化方法，通过约束生成的2D图像序列遵循预训练的多视图图像分布来实现。这通过耦合扩散采样实现，这是一种简单的扩散采样技术，同时从多视图图像分布和2D编辑图像分布中采样两条轨迹，并使用耦合项来强制生成图像之间的多视图一致性。我们在三个不同的多视图图像编辑任务上验证了该框架的有效性和通用性，展示了其在各种模型架构中的适用性，并强调了其作为多视图一致编辑的通用解决方案的潜力。

Summary / 总结

The paper introduces a method for multi-view consistent image editing using pre-trained 2D image editing models. It proposes coupled diffusion sampling to enforce consistency across multiple views by sampling from both a multi-view image distribution and a 2D edited image distribution, ensuring that the generated images are consistent. The method is validated on three tasks and shown to be effective and applicable across different model architectures.

研究旨在通过提出一种基于耦合扩散采样的训练-free 方法来解决多视图图像编辑中的一致性问题。该方法能够在一组多视图图像上独立地生成高质量的编辑效果，并保持视图间的一致性。实验结果表明，该方法能够有效强制执行多视图一致性，无需进行显式的3D优化，使其高效且适用于各种2D图像编辑模型。

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

Authors: Haiwen Diao, Mingxuan Li, Silei Wu, Linjun Dai, Xiaohua Wang, Hanming Deng, Lewei Lu, Dahua Lin, Ziwei Liu

First: 2025-10-16T17:59:58+00:00 · Latest: 2025-10-16T17:59:58+00:00

Comments: 21 pages, 7 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

The edifice of native Vision-Language Models (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? (-) How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. In this paper, we clarify these challenges and outline guiding principles for constructing native VLMs. Specifically, one native VLM primitive should: (i) effectively align pixel and word representations within a shared semantic space; (ii) seamlessly integrate the strengths of formerly separate vision and language modules; (iii) inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of native VLMs built from first principles, capable of rivaling top-tier modular counterparts across diverse real-world scenarios. With only 390M image-text examples, NEO efficiently develops visual perception from scratch while mitigating vision-language conflicts inside a dense and monolithic model crafted from our elaborate primitives. We position NEO as a cornerstone for scalable and powerful native VLMs, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.

中文标题/摘要

标题：从像素到文字——迈向大规模原生视觉-语言基础

原生视觉-语言模型（VLMs）的建筑已经成为了典型的模块化VLMs的有力竞争者，这得益于不断演进的模型架构和训练范式。然而，两个悬而未决的问题仍然阻碍了其广泛探索和推广：（-）原生VLMs与模块化VLMs之间有哪些基本约束，这些障碍可以克服到什么程度？（-）如何使原生VLMs的研究更加普及和民主化，从而加速该领域的进展。在本文中，我们澄清了这些挑战，并概述了构建原生VLMs的指导原则。具体而言，一个原生VLM的基本单元应该：（i）在共享语义空间内有效对齐像素和词的表示；（ii）无缝整合以前分离的视觉和语言模块的优势；（iii）内在地体现各种跨模态特性，以支持统一的视觉-语言编码、对齐和推理。因此，我们推出了NEO，这是一种从第一原理构建的新一代原生VLMs，能够在多种现实场景中与顶级模块化对手竞争。仅使用3.9亿张图像-文本示例，NEO能够从头开始高效地发展视觉感知，同时在密集且单一的模型中缓解视觉-语言冲突，该模型由我们精心设计的基本单元构建而成。我们将NEO定位为大规模且强大的原生VLMs的基础，并配有一套丰富的可重用组件，以促进经济高效且可扩展的生态系统。我们的代码和模型已公开发布在：https://github.com/EvolvingLMMs-Lab/NEO。

Summary / 总结

This paper addresses the challenges in developing native Vision-Language Models (VLMs) by defining key primitives that align pixel and word representations and integrate vision and language modules. The authors introduce NEO, a novel family of native VLMs, which efficiently develops visual perception from scratch with only 390M image-text examples, outperforming modular counterparts in various real-world scenarios while mitigating vision-language conflicts within a dense model. The NEO framework is designed to be scalable and cost-effective, with reusable components available publicly.

本文通过定义关键的视觉-语言模型（VLM）原语，解决其挑战，这些原语能够对齐像素和词的表示，整合视觉和语言模块，并支持统一的跨模态编码。作者引入了NEO，这是一种新型的VLM家族，仅使用390M图像-文本示例，能够与顶级模块化模型竞争，并在密集的单一模型中缓解视觉-语言冲突。NEO设计为可扩展且经济高效，并提供了可重用的组件供公众使用。

Learning an Image Editing Model without Image Editing Pairs

Authors: Nupur Kumari, Sheng-Yu Wang, Nanxuan Zhao, Yotam Nitzan, Yuheng Li, Krishna Kumar Singh, Richard Zhang, Eli Shechtman, Jun-Yan Zhu, Xun Huang

First: 2025-10-16T17:59:57+00:00 · Latest: 2025-10-16T17:59:57+00:00

Comments: project page: https://nupurkmr9.github.io/npedit/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent image editing models have achieved impressive results while following natural language editing instructions, but they rely on supervised fine-tuning with large datasets of input-target pairs. This is a critical bottleneck, as such naturally occurring pairs are hard to curate at scale. Current workarounds use synthetic training pairs that leverage the zero-shot capabilities of existing models. However, this can propagate and magnify the artifacts of the pretrained model into the final trained model. In this work, we present a new training paradigm that eliminates the need for paired data entirely. Our approach directly optimizes a few-step diffusion model by unrolling it during training and leveraging feedback from vision-language models (VLMs). For each input and editing instruction, the VLM evaluates if an edit follows the instruction and preserves unchanged content, providing direct gradients for end-to-end optimization. To ensure visual fidelity, we incorporate distribution matching loss (DMD), which constrains generated images to remain within the image manifold learned by pretrained models. We evaluate our method on standard benchmarks and include an extensive ablation study. Without any paired data, our method performs on par with various image editing diffusion models trained on extensive supervised paired data, under the few-step setting. Given the same VLM as the reward model, we also outperform RL-based techniques like Flow-GRPO.

中文标题/摘要

标题：无需图像编辑配对学习图像编辑模型

最近的图像编辑模型在遵循自然语言编辑指令方面取得了令人印象深刻的成果，但它们依赖于带有大量输入-目标配对数据集的监督微调。这是一个关键瓶颈，因为这种自然出现的配对数据难以大规模整理。当前的变通方法使用合成训练配对，利用现有模型的零样本能力。然而，这可能会传播并放大预训练模型的缺陷到最终训练模型中。在本工作中，我们提出了一种新的训练范式，完全消除了对配对数据的需求。我们的方法直接优化了一个多步扩散模型，在训练过程中展开它，并利用视觉语言模型（VLM）的反馈。对于每个输入和编辑指令，VLM 评估编辑是否遵循指令并保留未更改的内容，提供端到端优化的直接梯度。为了确保视觉保真度，我们引入了分布匹配损失（DMD），该损失限制生成的图像保持在预训练模型学习到的图像流形内。我们在标准基准上评估了我们的方法，并包括了详尽的消融研究。在没有任何配对数据的情况下，我们的方法在多步设置下与各种在大量监督配对数据上训练的图像编辑扩散模型表现相当。使用相同的 VLM 作为奖励模型时，我们还优于基于 RL 的技术如 Flow-GRPO。

Summary / 总结

This work addresses the challenge of training image editing models without relying on paired input-target data, which is difficult to curate at scale. The authors propose a new training paradigm that directly optimizes a diffusion model using feedback from vision-language models. This approach, which includes a distribution matching loss, achieves performance comparable to models trained on extensive paired data, even without any paired data during training. The method outperforms RL-based techniques like Flow-GRPO when using the same vision-language model as the reward model.

该研究解决了无需依赖大规模配对输入-目标数据训练图像编辑模型的挑战。作者提出了一种新的训练范式，通过视觉语言模型的反馈直接优化扩散模型。该方法包括分布匹配损失，即使在没有配对数据的情况下也能达到与大量配对数据训练的模型相当的性能。当使用相同的视觉语言模型作为奖励模型时，该方法还优于基于RL的技术如Flow-GRPO。

Attention Is All You Need for KV Cache in Diffusion LLMs

Authors: Quan Nguyen-Tri, Mukul Ranjan, Zhiqiang Shen

First: 2025-10-16T17:59:48+00:00 · Latest: 2025-10-16T17:59:48+00:00

Comments: https://vila-lab.github.io/elastic-cache-webpage/

Abs · PDF · Code1 · Code2 · Project1

Abstract

This work studies how to adaptively recompute key-value (KV) caches for diffusion large language models (DLMs) to maximize prediction accuracy while minimizing decoding latency. Prior methods' decoders recompute QKV for all tokens at every denoising step and layer, despite KV states changing little across most steps, especially in shallow layers, leading to substantial redundancy. We make three observations: (1) distant ${\bf MASK}$ tokens primarily act as a length-bias and can be cached block-wise beyond the active prediction window; (2) KV dynamics increase with depth, suggesting that selective refresh starting from deeper layers is sufficient; and (3) the most-attended token exhibits the smallest KV drift, providing a conservative lower bound on cache change for other tokens. Building on these, we propose ${\bf Elastic-Cache}$, a training-free, architecture-agnostic strategy that jointly decides ${when}$ to refresh (via an attention-aware drift test on the most-attended token) and ${where}$ to refresh (via a depth-aware schedule that recomputes from a chosen layer onward while reusing shallow-layer caches and off-window MASK caches). Unlike fixed-period schemes, Elastic-Cache performs adaptive, layer-aware cache updates for diffusion LLMs, reducing redundant computation and accelerating decoding with negligible loss in generation quality. Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across mathematical reasoning and code generation tasks demonstrate consistent speedups: $8.7\times$ on GSM8K (256 tokens), $45.1\times$ on longer sequences, and $4.8\times$ on HumanEval, while consistently maintaining higher accuracy than the baseline. Our method achieves significantly higher throughput ($6.8\times$ on GSM8K) than existing confidence-based approaches while preserving generation quality, enabling practical deployment of diffusion LLMs.

中文标题/摘要

标题：注意力即是你在扩散大语言模型中所需的一切：针对KV缓存的自适应重计算

本研究探讨了如何为扩散大语言模型（DLMs）自适应地重新计算键值（KV）缓存，以最大化预测准确性并最小化解码延迟。先前方法的解码器在每个去噪步骤和每一层中都重新计算QKV，尽管大多数步骤中KV状态变化不大，尤其是在浅层，导致大量冗余。我们做出了三个观察：（1）距离较远的${f MASK}$标记主要作为长度偏差，并且可以在活动预测窗口之外块状缓存；（2）KV动态随深度增加，表明从较深层开始的选择性刷新是足够的；（3）最关注的标记表现出最小的KV漂移，为其他标记的缓存变化提供了保守的下限。基于这些观察，我们提出了${f Elastic-Cache}$，这是一种无需训练、架构无关的策略，联合决定何时（通过最关注标记的注意力感知漂移测试）和何处（通过深度感知调度，从选定层开始重新计算，同时重用浅层缓存和窗口外的${f MASK}$缓存）刷新缓存。与固定周期方案不同，Elastic-Cache为扩散大语言模型执行适应性、分层感知的缓存更新，减少冗余计算并加速解码，同时几乎不损失生成质量。在LLaDA-Instruct、LLaDA-1.5和LLaDA-V上的数学推理和代码生成任务实验中，Elastic-Cache在GSM8K（256个标记）上实现了$8.7 imes$的加速，在较长序列上实现了$45.1 imes$的加速，在HumanEval上实现了$4.8 imes$的加速，同时保持了比基线更高的准确性。我们的方法在GSM8K上实现了显著更高的吞吐量（$6.8 imes$），同时保持了生成质量，使扩散大语言模型的实际部署成为可能。

Summary / 总结

This work addresses the challenge of efficiently managing key-value (KV) caches in diffusion large language models (DLMs) to reduce decoding latency while maintaining prediction accuracy. The authors observe that distant ${\bf MASK}$ tokens can be cached block-wise, KV dynamics increase with depth, and the most-attended token has minimal drift. They propose ${\bf Elastic-Cache}$, a strategy that decides when and where to refresh KV caches based on these observations, leading to significant speedups (up to $45.1\times$ on longer sequences) without compromising generation quality. Experiments show consistent improvements in throughput and accuracy over baseline methods.

该研究旨在提高扩散大语言模型（DLMs）中关键值（KV）缓存的效率，以提升预测准确性和减少解码延迟。提出了一种无需训练的策略Elastic-Cache，根据最关注的令牌的注意力感知漂移和深度感知调度来决定何时和何地刷新缓存。实验表明，Elastic-Cache 可以实现显著的加速（最高45.1倍），同时保持生成质量，并且在吞吐量方面优于现有的基于置信度的方法。

RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks

Authors: Mingxuan Yan, Yuping Wang, Zechun Liu, Jiachen Li

Venue: NeurIPS 2025

First: 2025-10-16T17:59:37+00:00 · Latest: 2025-10-16T17:59:37+00:00

Comments: 39th Conference on Neural Information Processing Systems (NeurIPS 2025); Project Website: rdd-neurips.github.io

Abs · PDF · Code1 · Code2

Abstract

To tackle long-horizon tasks, recent hierarchical vision-language-action (VLAs) frameworks employ vision-language model (VLM)-based planners to decompose complex manipulation tasks into simpler sub-tasks that low-level visuomotor policies can easily handle. Typically, the VLM planner is finetuned to learn to decompose a target task. This finetuning requires target task demonstrations segmented into sub-tasks by either human annotation or heuristic rules. However, the heuristic subtasks can deviate significantly from the training data of the visuomotor policy, which degrades task performance. To address these issues, we propose a Retrieval-based Demonstration Decomposer (RDD) that automatically decomposes demonstrations into sub-tasks by aligning the visual features of the decomposed sub-task intervals with those from the training data of the low-level visuomotor policies. Our method outperforms the state-of-the-art sub-task decomposer on both simulation and real-world tasks, demonstrating robustness across diverse settings. Code and more results are available at rdd-neurips.github.io.

中文标题/摘要

标题：RDD：基于检索的演示分解器用于规划者对齐在长时序任务中的计划

为解决长时序任务，最近的分层视觉-语言-动作（VLAs）框架采用基于视觉-语言模型（VLM）的规划者将复杂的操作任务分解为低级视觉-运动策略可以轻松处理的简单子任务。通常，VLM规划者会微调以学习分解目标任务。这种微调需要将目标任务的演示分解成子任务，由人类注释或启发式规则完成。然而，启发式的子任务可能与低级视觉-运动策略的训练数据相差甚远，这会降低任务性能。为了解决这些问题，我们提出了一种基于检索的演示分解器（RDD），该分解器通过将分解的子任务间隔的视觉特征与低级视觉-运动策略的训练数据对齐来自动分解演示。我们的方法在模拟和真实世界任务中均优于最先进的子任务分解器，展示了在各种环境中的鲁棒性。代码和更多结果可在rdd-neurips.github.io获取。

Summary / 总结

The paper introduces RDD, a method for automatically decomposing demonstrations into sub-tasks by aligning visual features with the training data of low-level visuomotor policies. This approach addresses the issue of heuristic sub-tasks deviating from the training data, thereby improving task performance. RDD outperforms existing methods on both simulation and real-world tasks, showing robustness across different settings.

该论文提出了RDD方法，通过将分解后的子任务的视觉特征与低级视知觉运动策略的训练数据对齐，自动分解演示。这解决了启发式子任务与训练数据偏差导致任务性能下降的问题。RDD在仿真和真实世界任务中均优于现有方法，展示了在不同环境中的鲁棒性。