arXiv 论文速递

MV-RAG: Retrieval Augmented Multiview Diffusion

Authors: Yosef Dayani, Omer Benishu, Sagie Benaim

First: 2025-08-22T17:59:40+00:00 · Latest: 2025-08-22T17:59:40+00:00

Comments: Project page: https://yosefdayani.github.io/MV-RAG

Abstract

Text-to-3D generation approaches have advanced significantly by leveraging pretrained 2D diffusion priors, producing high-quality and 3D-consistent outputs. However, they often fail to produce out-of-domain (OOD) or rare concepts, yielding inconsistent or inaccurate results. To this end, we propose MV-RAG, a novel text-to-3D pipeline that first retrieves relevant 2D images from a large in-the-wild 2D database and then conditions a multiview diffusion model on these images to synthesize consistent and accurate multiview outputs. Training such a retrieval-conditioned model is achieved via a novel hybrid strategy bridging structured multiview data and diverse 2D image collections. This involves training on multiview data using augmented conditioning views that simulate retrieval variance for view-specific reconstruction, alongside training on sets of retrieved real-world 2D images using a distinctive held-out view prediction objective: the model predicts the held-out view from the other views to infer 3D consistency from 2D data. To facilitate a rigorous OOD evaluation, we introduce a new collection of challenging OOD prompts. Experiments against state-of-the-art text-to-3D, image-to-3D, and personalization baselines show that our approach significantly improves 3D consistency, photorealism, and text adherence for OOD/rare concepts, while maintaining competitive performance on standard benchmarks.

中文标题/摘要

标题：MV-RAG：检索增强的多视角扩散模型

文本到3D生成方法通过利用预训练的2D扩散先验取得了显著进展，能产生高质量且3D一致的结果。然而，这些方法在处理域外（OOD）或罕见概念时往往表现不佳，导致生成结果不一致或不准确。为此，我们提出MV-RAG——一种新颖的文本到3D流程：首先从大规模真实世界2D数据库中检索相关图像，随后以这些图像为条件驱动多视角扩散模型，合成具有一致性和准确性的多视角输出。通过创新性混合策略，该模型实现了结构化多视角数据与多样化2D图像集的协同训练：一方面使用模拟检索差异的增强条件视图进行多视角数据训练以实现视角特异性重建，另一方面通过独特留出视角预测目标对检索到的真实2D图像集进行训练——模型根据其他视角预测留出视角，从而从2D数据推断3D一致性。为促进严格OOD评估，我们构建了具有挑战性的OOD提示词集合。与最先进的文本到3D、图像到3D及个性化基线方法的对比实验表明，我们的方法在保持标准基准竞争力的同时，显著提升了OOD/罕见概念的3D一致性、照片真实感和文本遵循度。

Summary / 总结

To address the limitations of text-to-3D generation methods in handling out-of-domain or rare concepts, which often result in inconsistent or inaccurate outputs, this work introduces MV-RAG, a retrieval-augmented pipeline. The method first retrieves relevant 2D images from a large database and conditions a multiview diffusion model on these images, using a hybrid training strategy that combines structured multiview data with diverse 2D collections via view-specific reconstruction and held-out view prediction. Experimental results demonstrate that MV-RAG significantly enhances 3D consistency, photorealism, and text adherence for challenging OOD prompts, outperforming state-of-the-art text-to-3D, image-to-3D, and personalization baselines while maintaining competitive performance on standard benchmarks.

针对文本到3D生成方法在处理域外或罕见概念时经常产生不一致或不准确结果的问题，本文提出了MV-RAG，一种检索增强的多视角扩散流程。该方法首先从大规模二维图像数据库中检索相关图像，并以此作为条件训练多视角扩散模型，采用混合策略结合结构化多视角数据和多样化二维图像集，通过视角特定重建和保留视角预测来推断三维一致性。实验结果表明，MV-RAG在处理挑战性域外概念时显著提升了三维一致性、照片真实感和文本遵循性，同时在标准基准上保持了竞争力。

Benchmarking Training Paradigms, Dataset Composition, and Model Scaling for Child ASR in ESPnet

Authors: Anyu Ying, Natarajan Balaji Shankar, Chyi-Jiunn Lin, Mohan Shi, Pu Wang, Hye-jin Shim, Siddhant Arora, Hugo Van hamme, Abeer Alwan, Shinji Watanabe

First: 2025-08-22T17:59:35+00:00 · Latest: 2025-08-22T17:59:35+00:00

Comments: 5 pages, 3 figures, presented at WOCCI 2025 (Workshop on Child Computer Interaction), satellite workshop of Interspeech 2025