arXiv 论文速递

MV-RAG: Retrieval Augmented Multiview Diffusion

Authors: Yosef Dayani, Omer Benishu, Sagie Benaim

First: 2025-08-22T17:59:40+00:00 · Latest: 2025-08-22T17:59:40+00:00

Comments: Project page: https://yosefdayani.github.io/MV-RAG

Abstract

Text-to-3D generation approaches have advanced significantly by leveraging pretrained 2D diffusion priors, producing high-quality and 3D-consistent outputs. However, they often fail to produce out-of-domain (OOD) or rare concepts, yielding inconsistent or inaccurate results. To this end, we propose MV-RAG, a novel text-to-3D pipeline that first retrieves relevant 2D images from a large in-the-wild 2D database and then conditions a multiview diffusion model on these images to synthesize consistent and accurate multiview outputs. Training such a retrieval-conditioned model is achieved via a novel hybrid strategy bridging structured multiview data and diverse 2D image collections. This involves training on multiview data using augmented conditioning views that simulate retrieval variance for view-specific reconstruction, alongside training on sets of retrieved real-world 2D images using a distinctive held-out view prediction objective: the model predicts the held-out view from the other views to infer 3D consistency from 2D data. To facilitate a rigorous OOD evaluation, we introduce a new collection of challenging OOD prompts. Experiments against state-of-the-art text-to-3D, image-to-3D, and personalization baselines show that our approach significantly improves 3D consistency, photorealism, and text adherence for OOD/rare concepts, while maintaining competitive performance on standard benchmarks.

中文标题/摘要

标题：MV-RAG：检索增强的多视角扩散模型

文本到3D生成方法通过利用预训练的2D扩散先验取得了显著进展，能生成高质量且3D一致的结果。然而，这些方法在处理域外（OOD）或罕见概念时往往失效，导致生成结果不一致或不准确。为此，我们提出MV-RAG——一种新颖的文本到3D流程：首先从大规模真实世界2D数据库中检索相关图像，随后以这些图像为条件驱动多视角扩散模型，合成具有一致性和准确性的多视角输出。通过创新性混合策略训练该检索条件模型，桥接结构化多视角数据与多样化2D图像集合：一方面使用模拟检索差异的增强条件视图进行多视角数据训练以实现视角特异性重建，另一方面通过独特保留视角预测目标训练检索到的真实2D图像集——模型根据其他视角预测保留视角，从而从2D数据推断3D一致性。为促进严格OOD评估，我们构建了具有挑战性的OOD提示词集合。与最先进的文本到3D、图像到3D及个性化基线对比表明，我们的方法显著提升了OOD/罕见概念的3D一致性、照片真实感和文本遵循度，同时在标准基准上保持竞争力。

Summary / 总结

Text-to-3D generation approaches have advanced significantly by leveraging pretrained 2D diffusion priors, producing high-quality and 3D-consistent outputs.

Benchmarking Training Paradigms, Dataset Composition, and Model Scaling for Child ASR in ESPnet

Authors: Anyu Ying, Natarajan Balaji Shankar, Chyi-Jiunn Lin, Mohan Shi, Pu Wang, Hye-jin Shim, Siddhant Arora, Hugo Van hamme, Abeer Alwan, Shinji Watanabe

First: 2025-08-22T17:59:35+00:00 · Latest: 2025-08-22T17:59:35+00:00

Comments: 5 pages, 3 figures, presented at WOCCI 2025 (Workshop on Child Computer Interaction), satellite workshop of Interspeech 2025

Abs · PDF

Abstract

Despite advancements in ASR, child speech recognition remains challenging due to acoustic variability and limited annotated data. While fine-tuning adult ASR models on child speech is common, comparisons with flat-start training remain underexplored. We compare flat-start training across multiple datasets, SSL representations (WavLM, XEUS), and decoder architectures. Our results show that SSL representations are biased toward adult speech, with flat-start training on child speech mitigating these biases. We also analyze model scaling, finding consistent improvements up to 1B parameters, beyond which performance plateaus. Additionally, age-related ASR and speaker verification analysis highlights the limitations of proprietary models like Whisper, emphasizing the need for open-data models for reliable child speech research. All investigations are conducted using ESPnet, and our publicly available benchmark provides insights into training strategies for robust child speech processing.

中文标题/摘要

标题：ESPnet中儿童语音识别的训练范式、数据集构成与模型规模扩展基准研究

尽管自动语音识别（ASR）技术有所进步，但由于声学变异性和标注数据有限，儿童语音识别仍具挑战性。虽然通常采用成人ASR模型对儿童语音进行微调，但与从头开始训练的对比研究仍不足。我们比较了跨多个数据集的从头训练、自监督学习表示（WavLM, XEUS）及解码器架构。结果显示自监督学习表示存在成人语音偏向，而基于儿童语音的从头训练可缓解这种偏差。模型规模分析表明参数增至10亿时性能持续提升，之后趋于平稳。年龄相关的ASR和说话人验证分析揭示了Whisper等专有模型的局限性，强调需要开放数据模型以支持可靠的儿童语音研究。所有研究均基于ESPnet框架，公开的基准测试为鲁棒的儿童语音处理提供了训练策略参考。

Summary / 总结

Despite advancements in ASR, child speech recognition remains challenging due to acoustic variability and limited annotated data.

A Disease-Centric Vision-Language Foundation Model for Precision Oncology in Kidney Cancer

Authors: Yuhui Tao, Zhongwei Zhao, Zilong Wang, Xufang Luo, Feng Chen, Kang Wang, Chuanfu Wu, Xue Zhang, Shaoting Zhang, Jiaxi Yao, Xingwei Jin, Xinyang Jiang, Yifan Yang, Dongsheng Li, Lili Qiu, Zhiqiang Shao, Jianming Guo, Nengwang Yu, Shuo Wang, Ying Xiong

First: 2025-08-22T17:48:19+00:00 · Latest: 2025-08-22T17:48:19+00:00

Abs · PDF

Abstract

The non-invasive assessment of increasingly incidentally discovered renal masses is a critical challenge in urologic oncology, where diagnostic uncertainty frequently leads to the overtreatment of benign or indolent tumors. In this study, we developed and validated RenalCLIP using a dataset of 27,866 CT scans from 8,809 patients across nine Chinese medical centers and the public TCIA cohort, a visual-language foundation model for characterization, diagnosis and prognosis of renal mass. The model was developed via a two-stage pre-training strategy that first enhances the image and text encoders with domain-specific knowledge before aligning them through a contrastive learning objective, to create robust representations for superior generalization and diagnostic precision. RenalCLIP achieved better performance and superior generalizability across 10 core tasks spanning the full clinical workflow of kidney cancer, including anatomical assessment, diagnostic classification, and survival prediction, compared with other state-of-the-art general-purpose CT foundation models. Especially, for complicated task like recurrence-free survival prediction in the TCIA cohort, RenalCLIP achieved a C-index of 0.726, representing a substantial improvement of approximately 20% over the leading baselines. Furthermore, RenalCLIP's pre-training imparted remarkable data efficiency; in the diagnostic classification task, it only needs 20% training data to achieve the peak performance of all baseline models even after they were fully fine-tuned on 100% of the data. Additionally, it achieved superior performance in report generation, image-text retrieval and zero-shot diagnosis tasks. Our findings establish that RenalCLIP provides a robust tool with the potential to enhance diagnostic accuracy, refine prognostic stratification, and personalize the management of patients with kidney cancer.

中文标题/摘要

标题：面向肾癌精准肿瘤学的疾病中心化视觉-语言基础模型

对日益多发的偶发性肾占位进行无创评估是泌尿系肿瘤学的关键挑战，诊断不确定性常导致良性或惰性肿瘤的过度治疗。本研究利用来自中国九家医疗中心和公共TCIA队列的8,809名患者的27,866次CT扫描数据集，开发并验证了视觉-语言基础模型RenalCLIP，用于肾占位的表征、诊断和预后预测。该模型通过两阶段预训练策略开发：首先用领域特定知识增强图像和文本编码器，再通过对比学习目标进行对齐，以创建具有卓越泛化能力和诊断精度的鲁棒表征。与其它最先进的通用CT基础模型相比，RenalCLIP在涵盖肾癌全临床工作流的10项核心任务（包括解剖评估、诊断分类和生存预测）中表现出更优的性能和泛化能力。尤其在TCIA队列中无复发生存预测这类复杂任务上，RenalCLIP取得了0.726的C指数，较领先基线提升约20%。此外，RenalCLIP的预训练赋予其显著的数据效率——在诊断分类任务中，仅需20%训练数据即可达到所有基线模型使用100%数据充分微调后的峰值性能。该模型在报告生成、图文检索和零样本诊断任务中也实现了卓越性能。我们的研究证实RenalCLIP为提升诊断准确性、优化预后分层和实现肾癌患者个体化管理提供了强有力的工具。

Summary / 总结

Closer to Reality: Practical Semi-Supervised Federated Learning for Foundation Model Adaptation

Authors: Guangyu Sun, Jingtao Li, Weiming Zhuang, Chen Chen, Chen Chen, Lingjuan Lyu

First: 2025-08-22T17:47:02+00:00 · Latest: 2025-08-22T17:47:02+00:00

Abs · PDF

Abstract

Foundation models (FMs) exhibit remarkable generalization but require adaptation to downstream tasks, particularly in privacy-sensitive applications. Due to data privacy regulations, cloud-based FMs cannot directly access private edge data, limiting their adaptation. Federated learning (FL) provides a privacy-aware alternative, but existing FL approaches overlook the constraints imposed by edge devices -- namely, limited computational resources and the scarcity of labeled data. To address these challenges, we introduce Practical Semi-Supervised Federated Learning (PSSFL), where edge devices hold only unlabeled, low-resolution data, while the server has limited labeled, high-resolution data. In this setting, we propose the Federated Mixture of Experts (FedMox), a novel framework that enhances FM adaptation in FL. FedMox tackles computational and resolution mismatch challenges via a sparse Mixture-of-Experts architecture, employing a spatial router to align features across resolutions and a Soft-Mixture strategy to stabilize semi-supervised learning. We take object detection as a case study, and experiments on real-world autonomous driving datasets demonstrate that FedMox effectively adapts FMs under PSSFL, significantly improving performance with constrained memory costs on edge devices. Our work paves the way for scalable and privacy-preserving FM adaptation in federated scenarios.

中文标题/摘要

标题：更贴近现实：面向基础模型适配的实用半监督联邦学习

基础模型（FMs）展现出卓越的泛化能力，但需针对下游任务进行适配，尤其在隐私敏感应用中。由于数据隐私法规，云端基础模型无法直接访问私有边缘数据，限制了其适配能力。联邦学习（FL）提供了隐私保护的替代方案，但现有FL方法忽视了边缘设备带来的约束——即有限的计算资源和标注数据稀缺。为应对这些挑战，我们提出实用半监督联邦学习（PSSFL），其中边缘设备仅持有未标注的低分辨率数据，而服务器拥有有限标注的高分辨率数据。在此设定下，我们提出联邦专家混合模型（FedMox），这一新颖框架增强FL中的FM适配。FedMox通过稀疏专家混合架构应对计算与分辨率失配挑战，采用空间路由器对齐跨分辨率特征，并通过软混合策略稳定半监督学习。我们以目标检测为案例研究，在真实自动驾驶数据集上的实验表明，FedMox在PSSFL下有效适配FMs，在边缘设备有限内存成本下显著提升性能。我们的工作为联邦场景中可扩展且隐私保护的FM适配铺平道路。

Summary / 总结

Foundation models (FMs) exhibit remarkable generalization but require adaptation to downstream tasks, particularly in privacy-sensitive applications.