Coupled Diffusion Sampling for Training-Free Multi-View Image Editing
Authors: Hadi Alzayer, Yunzhi Zhang, Chen Geng, Jia-Bin Huang, Jiajun Wu
First: 2025-10-16T17:59:59+00:00 · Latest: 2025-10-16T17:59:59+00:00
Comments: Project page: https://coupled-diffusion.github.io
Abstract
We present an inference-time diffusion sampling method to perform multi-view
consistent image editing using pre-trained 2D image editing models. These
models can independently produce high-quality edits for each image in a set of
multi-view images of a 3D scene or object, but they do not maintain consistency
across views. Existing approaches typically address this by optimizing over
explicit 3D representations, but they suffer from a lengthy optimization
process and instability under sparse view settings. We propose an implicit 3D
regularization approach by constraining the generated 2D image sequences to
adhere to a pre-trained multi-view image distribution. This is achieved through
coupled diffusion sampling, a simple diffusion sampling technique that
concurrently samples two trajectories from both a multi-view image distribution
and a 2D edited image distribution, using a coupling term to enforce the
multi-view consistency among the generated images. We validate the
effectiveness and generality of this framework on three distinct multi-view
image editing tasks, demonstrating its applicability across various model
architectures and highlighting its potential as a general solution for
multi-view consistent editing.
中文标题/摘要
标题:耦合扩散采样用于无训练多视图图像编辑
我们提出了一种推理时的扩散采样方法,使用预训练的2D图像编辑模型在多视图图像集中执行多视图一致的图像编辑。这些模型可以独立地为多视图场景或对象的一组图像生成高质量的编辑,但它们无法在不同视图之间保持一致性。现有方法通常通过优化显式的3D表示来解决这个问题,但它们在稀疏视图设置下会遭受优化过程漫长且不稳定的问题。我们提出了一种隐式的3D正则化方法,通过约束生成的2D图像序列遵循预训练的多视图图像分布来实现。这通过耦合扩散采样实现,这是一种简单的扩散采样技术,同时从多视图图像分布和2D编辑图像分布中采样两条轨迹,并使用耦合项来强制生成图像之间的多视图一致性。我们在三个不同的多视图图像编辑任务上验证了该框架的有效性和通用性,展示了其在各种模型架构中的适用性,并强调了其作为多视图一致编辑的通用解决方案的潜力。
Summary / 总结
The paper introduces a method for multi-view consistent image editing using pre-trained 2D image editing models. It proposes coupled diffusion sampling to enforce consistency across multiple views by sampling from both a multi-view image distribution and a 2D edited image distribution, ensuring that the generated images are consistent. The method is validated on three tasks and shown to be effective and applicable across different model architectures.
研究旨在通过提出一种基于耦合扩散采样的训练-free 方法来解决多视图图像编辑中的一致性问题。该方法能够在一组多视图图像上独立地生成高质量的编辑效果,并保持视图间的一致性。实验结果表明,该方法能够有效强制执行多视图一致性,无需进行显式的3D优化,使其高效且适用于各种2D图像编辑模型。
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale
Authors: Haiwen Diao, Mingxuan Li, Silei Wu, Linjun Dai, Xiaohua Wang, Hanming Deng, Lewei Lu, Dahua Lin, Ziwei Liu
First: 2025-10-16T17:59:58+00:00 · Latest: 2025-10-16T17:59:58+00:00
Comments: 21 pages, 7 figures
Abstract
The edifice of native Vision-Language Models (VLMs) has emerged as a rising
contender to typical modular VLMs, shaped by evolving model architectures and
training paradigms. Yet, two lingering clouds cast shadows over its widespread
exploration and promotion: (-) What fundamental constraints set native VLMs
apart from modular ones, and to what extent can these barriers be overcome? (-)
How to make research in native VLMs more accessible and democratized, thereby
accelerating progress in the field. In this paper, we clarify these challenges
and outline guiding principles for constructing native VLMs. Specifically, one
native VLM primitive should: (i) effectively align pixel and word
representations within a shared semantic space; (ii) seamlessly integrate the
strengths of formerly separate vision and language modules; (iii) inherently
embody various cross-modal properties that support unified vision-language
encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of
native VLMs built from first principles, capable of rivaling top-tier modular
counterparts across diverse real-world scenarios. With only 390M image-text
examples, NEO efficiently develops visual perception from scratch while
mitigating vision-language conflicts inside a dense and monolithic model
crafted from our elaborate primitives. We position NEO as a cornerstone for
scalable and powerful native VLMs, paired with a rich set of reusable
components that foster a cost-effective and extensible ecosystem. Our code and
models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.
中文标题/摘要
标题:从像素到文字——迈向大规模原生视觉-语言基础
原生视觉-语言模型(VLMs)的建筑已经成为了典型的模块化VLMs的有力竞争者,这得益于不断演进的模型架构和训练范式。然而,两个悬而未决的问题仍然阻碍了其广泛探索和推广:(-)原生VLMs与模块化VLMs之间有哪些基本约束,这些障碍可以克服到什么程度?(-)如何使原生VLMs的研究更加普及和民主化,从而加速该领域的进展。在本文中,我们澄清了这些挑战,并概述了构建原生VLMs的指导原则。具体而言,一个原生VLM的基本单元应该:(i)在共享语义空间内有效对齐像素和词的表示;(ii)无缝整合以前分离的视觉和语言模块的优势;(iii)内在地体现各种跨模态特性,以支持统一的视觉-语言编码、对齐和推理。因此,我们推出了NEO,这是一种从第一原理构建的新一代原生VLMs,能够在多种现实场景中与顶级模块化对手竞争。仅使用3.9亿张图像-文本示例,NEO能够从头开始高效地发展视觉感知,同时在密集且单一的模型中缓解视觉-语言冲突,该模型由我们精心设计的基本单元构建而成。我们将NEO定位为大规模且强大的原生VLMs的基础,并配有一套丰富的可重用组件,以促进经济高效且可扩展的生态系统。我们的代码和模型已公开发布在:https://github.com/EvolvingLMMs-Lab/NEO。
Summary / 总结
This paper addresses the challenges in developing native Vision-Language Models (VLMs) by defining key primitives that align pixel and word representations and integrate vision and language modules. The authors introduce NEO, a novel family of native VLMs, which efficiently develops visual perception from scratch with only 390M image-text examples, outperforming modular counterparts in various real-world scenarios while mitigating vision-language conflicts within a dense model. The NEO framework is designed to be scalable and cost-effective, with reusable components available publicly.
本文通过定义关键的视觉-语言模型(VLM)原语,解决其挑战,这些原语能够对齐像素和词的表示,整合视觉和语言模块,并支持统一的跨模态编码。作者引入了NEO,这是一种新型的VLM家族,仅使用390M图像-文本示例,能够与顶级模块化模型竞争,并在密集的单一模型中缓解视觉-语言冲突。NEO设计为可扩展且经济高效,并提供了可重用的组件供公众使用。
Learning an Image Editing Model without Image Editing Pairs
Authors: Nupur Kumari, Sheng-Yu Wang, Nanxuan Zhao, Yotam Nitzan, Yuheng Li, Krishna Kumar Singh, Richard Zhang, Eli Shechtman, Jun-Yan Zhu, Xun Huang
First: 2025-10-16T17:59:57+00:00 · Latest: 2025-10-16T17:59:57+00:00
Comments: project page: https://nupurkmr9.github.io/npedit/
Abstract
Recent image editing models have achieved impressive results while following
natural language editing instructions, but they rely on supervised fine-tuning
with large datasets of input-target pairs. This is a critical bottleneck, as
such naturally occurring pairs are hard to curate at scale. Current workarounds
use synthetic training pairs that leverage the zero-shot capabilities of
existing models. However, this can propagate and magnify the artifacts of the
pretrained model into the final trained model. In this work, we present a new
training paradigm that eliminates the need for paired data entirely. Our
approach directly optimizes a few-step diffusion model by unrolling it during
training and leveraging feedback from vision-language models (VLMs). For each
input and editing instruction, the VLM evaluates if an edit follows the
instruction and preserves unchanged content, providing direct gradients for
end-to-end optimization. To ensure visual fidelity, we incorporate distribution
matching loss (DMD), which constrains generated images to remain within the
image manifold learned by pretrained models. We evaluate our method on standard
benchmarks and include an extensive ablation study. Without any paired data,
our method performs on par with various image editing diffusion models trained
on extensive supervised paired data, under the few-step setting. Given the same
VLM as the reward model, we also outperform RL-based techniques like Flow-GRPO.
中文标题/摘要
标题:无需图像编辑配对学习图像编辑模型
最近的图像编辑模型在遵循自然语言编辑指令方面取得了令人印象深刻的成果,但它们依赖于带有大量输入-目标配对数据集的监督微调。这是一个关键瓶颈,因为这种自然出现的配对数据难以大规模整理。当前的变通方法使用合成训练配对,利用现有模型的零样本能力。然而,这可能会传播并放大预训练模型的缺陷到最终训练模型中。在本工作中,我们提出了一种新的训练范式,完全消除了对配对数据的需求。我们的方法直接优化了一个多步扩散模型,在训练过程中展开它,并利用视觉语言模型(VLM)的反馈。对于每个输入和编辑指令,VLM 评估编辑是否遵循指令并保留未更改的内容,提供端到端优化的直接梯度。为了确保视觉保真度,我们引入了分布匹配损失(DMD),该损失限制生成的图像保持在预训练模型学习到的图像流形内。我们在标准基准上评估了我们的方法,并包括了详尽的消融研究。在没有任何配对数据的情况下,我们的方法在多步设置下与各种在大量监督配对数据上训练的图像编辑扩散模型表现相当。使用相同的 VLM 作为奖励模型时,我们还优于基于 RL 的技术如 Flow-GRPO。
Summary / 总结
This work addresses the challenge of training image editing models without relying on paired input-target data, which is difficult to curate at scale. The authors propose a new training paradigm that directly optimizes a diffusion model using feedback from vision-language models. This approach, which includes a distribution matching loss, achieves performance comparable to models trained on extensive paired data, even without any paired data during training. The method outperforms RL-based techniques like Flow-GRPO when using the same vision-language model as the reward model.
该研究解决了无需依赖大规模配对输入-目标数据训练图像编辑模型的挑战。作者提出了一种新的训练范式,通过视觉语言模型的反馈直接优化扩散模型。该方法包括分布匹配损失,即使在没有配对数据的情况下也能达到与大量配对数据训练的模型相当的性能。当使用相同的视觉语言模型作为奖励模型时,该方法还优于基于RL的技术如Flow-GRPO。
Attention Is All You Need for KV Cache in Diffusion LLMs
Authors: Quan Nguyen-Tri, Mukul Ranjan, Zhiqiang Shen
First: 2025-10-16T17:59:48+00:00 · Latest: 2025-10-16T17:59:48+00:00
Comments: https://vila-lab.github.io/elastic-cache-webpage/
Abstract
This work studies how to adaptively recompute key-value (KV) caches for
diffusion large language models (DLMs) to maximize prediction accuracy while
minimizing decoding latency. Prior methods' decoders recompute QKV for all
tokens at every denoising step and layer, despite KV states changing little
across most steps, especially in shallow layers, leading to substantial
redundancy. We make three observations: (1) distant ${\bf MASK}$ tokens
primarily act as a length-bias and can be cached block-wise beyond the active
prediction window; (2) KV dynamics increase with depth, suggesting that
selective refresh starting from deeper layers is sufficient; and (3) the
most-attended token exhibits the smallest KV drift, providing a conservative
lower bound on cache change for other tokens. Building on these, we propose
${\bf Elastic-Cache}$, a training-free, architecture-agnostic strategy that
jointly decides ${when}$ to refresh (via an attention-aware drift test on the
most-attended token) and ${where}$ to refresh (via a depth-aware schedule that
recomputes from a chosen layer onward while reusing shallow-layer caches and
off-window MASK caches). Unlike fixed-period schemes, Elastic-Cache performs
adaptive, layer-aware cache updates for diffusion LLMs, reducing redundant
computation and accelerating decoding with negligible loss in generation
quality. Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across
mathematical reasoning and code generation tasks demonstrate consistent
speedups: $8.7\times$ on GSM8K (256 tokens), $45.1\times$ on longer sequences,
and $4.8\times$ on HumanEval, while consistently maintaining higher accuracy
than the baseline. Our method achieves significantly higher throughput
($6.8\times$ on GSM8K) than existing confidence-based approaches while
preserving generation quality, enabling practical deployment of diffusion LLMs.
中文标题/摘要
标题:注意力即是你在扩散大语言模型中所需的一切:针对KV缓存的自适应重计算
本研究探讨了如何为扩散大语言模型(DLMs)自适应地重新计算键值(KV)缓存,以最大化预测准确性并最小化解码延迟。先前方法的解码器在每个去噪步骤和每一层中都重新计算QKV,尽管大多数步骤中KV状态变化不大,尤其是在浅层,导致大量冗余。我们做出了三个观察:(1)距离较远的${f MASK}$标记主要作为长度偏差,并且可以在活动预测窗口之外块状缓存;(2)KV动态随深度增加,表明从较深层开始的选择性刷新是足够的;(3)最关注的标记表现出最小的KV漂移,为其他标记的缓存变化提供了保守的下限。基于这些观察,我们提出了${f Elastic-Cache}$,这是一种无需训练、架构无关的策略,联合决定何时(通过最关注标记的注意力感知漂移测试)和何处(通过深度感知调度,从选定层开始重新计算,同时重用浅层缓存和窗口外的${f MASK}$缓存)刷新缓存。与固定周期方案不同,Elastic-Cache为扩散大语言模型执行适应性、分层感知的缓存更新,减少冗余计算并加速解码,同时几乎不损失生成质量。在LLaDA-Instruct、LLaDA-1.5和LLaDA-V上的数学推理和代码生成任务实验中,Elastic-Cache在GSM8K(256个标记)上实现了$8.7 imes$的加速,在较长序列上实现了$45.1 imes$的加速,在HumanEval上实现了$4.8 imes$的加速,同时保持了比基线更高的准确性。我们的方法在GSM8K上实现了显著更高的吞吐量($6.8 imes$),同时保持了生成质量,使扩散大语言模型的实际部署成为可能。
Summary / 总结
This work addresses the challenge of efficiently managing key-value (KV) caches in diffusion large language models (DLMs) to reduce decoding latency while maintaining prediction accuracy. The authors observe that distant ${\bf MASK}$ tokens can be cached block-wise, KV dynamics increase with depth, and the most-attended token has minimal drift. They propose ${\bf Elastic-Cache}$, a strategy that decides when and where to refresh KV caches based on these observations, leading to significant speedups (up to $45.1\times$ on longer sequences) without compromising generation quality. Experiments show consistent improvements in throughput and accuracy over baseline methods.
该研究旨在提高扩散大语言模型(DLMs)中关键值(KV)缓存的效率,以提升预测准确性和减少解码延迟。提出了一种无需训练的策略Elastic-Cache,根据最关注的令牌的注意力感知漂移和深度感知调度来决定何时和何地刷新缓存。实验表明,Elastic-Cache 可以实现显著的加速(最高45.1倍),同时保持生成质量,并且在吞吐量方面优于现有的基于置信度的方法。
RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks
Authors: Mingxuan Yan, Yuping Wang, Zechun Liu, Jiachen Li
Venue: NeurIPS
2025
First: 2025-10-16T17:59:37+00:00 · Latest: 2025-10-16T17:59:37+00:00
Comments: 39th Conference on Neural Information Processing Systems (NeurIPS
2025); Project Website: rdd-neurips.github.io
Abstract
To tackle long-horizon tasks, recent hierarchical vision-language-action
(VLAs) frameworks employ vision-language model (VLM)-based planners to
decompose complex manipulation tasks into simpler sub-tasks that low-level
visuomotor policies can easily handle. Typically, the VLM planner is finetuned
to learn to decompose a target task. This finetuning requires target task
demonstrations segmented into sub-tasks by either human annotation or heuristic
rules. However, the heuristic subtasks can deviate significantly from the
training data of the visuomotor policy, which degrades task performance. To
address these issues, we propose a Retrieval-based Demonstration Decomposer
(RDD) that automatically decomposes demonstrations into sub-tasks by aligning
the visual features of the decomposed sub-task intervals with those from the
training data of the low-level visuomotor policies. Our method outperforms the
state-of-the-art sub-task decomposer on both simulation and real-world tasks,
demonstrating robustness across diverse settings. Code and more results are
available at rdd-neurips.github.io.
中文标题/摘要
标题:RDD:基于检索的演示分解器用于规划者对齐在长时序任务中的计划
为解决长时序任务,最近的分层视觉-语言-动作(VLAs)框架采用基于视觉-语言模型(VLM)的规划者将复杂的操作任务分解为低级视觉-运动策略可以轻松处理的简单子任务。通常,VLM规划者会微调以学习分解目标任务。这种微调需要将目标任务的演示分解成子任务,由人类注释或启发式规则完成。然而,启发式的子任务可能与低级视觉-运动策略的训练数据相差甚远,这会降低任务性能。为了解决这些问题,我们提出了一种基于检索的演示分解器(RDD),该分解器通过将分解的子任务间隔的视觉特征与低级视觉-运动策略的训练数据对齐来自动分解演示。我们的方法在模拟和真实世界任务中均优于最先进的子任务分解器,展示了在各种环境中的鲁棒性。代码和更多结果可在rdd-neurips.github.io获取。
Summary / 总结
The paper introduces RDD, a method for automatically decomposing demonstrations into sub-tasks by aligning visual features with the training data of low-level visuomotor policies. This approach addresses the issue of heuristic sub-tasks deviating from the training data, thereby improving task performance. RDD outperforms existing methods on both simulation and real-world tasks, showing robustness across different settings.
该论文提出了RDD方法,通过将分解后的子任务的视觉特征与低级视知觉运动策略的训练数据对齐,自动分解演示。这解决了启发式子任务与训练数据偏差导致任务性能下降的问题。RDD在仿真和真实世界任务中均优于现有方法,展示了在不同环境中的鲁棒性。