arXiv 论文速递

CitySeg: A 3D Open Vocabulary Semantic Segmentation Foundation Model in City-scale Scenarios

Authors: Jialei Xu, Zizhuang Wei, Weikang You, Linyun Li, Weijian Sun

First: 2025-08-13T03:55:56+00:00 · Latest: 2025-08-13T03:55:56+00:00

Abstract

Semantic segmentation of city-scale point clouds is a critical technology for Unmanned Aerial Vehicle (UAV) perception systems, enabling the classification of 3D points without relying on any visual information to achieve comprehensive 3D understanding. However, existing models are frequently constrained by the limited scale of 3D data and the domain gap between datasets, which lead to reduced generalization capability. To address these challenges, we propose CitySeg, a foundation model for city-scale point cloud semantic segmentation that incorporates text modality to achieve open vocabulary segmentation and zero-shot inference. Specifically, in order to mitigate the issue of non-uniform data distribution across multiple domains, we customize the data preprocessing rules, and propose a local-global cross-attention network to enhance the perception capabilities of point networks in UAV scenarios. To resolve semantic label discrepancies across datasets, we introduce a hierarchical classification strategy. A hierarchical graph established according to the data annotation rules consolidates the data labels, and the graph encoder is used to model the hierarchical relationships between categories. In addition, we propose a two-stage training strategy and employ hinge loss to increase the feature separability of subcategories. Experimental results demonstrate that the proposed CitySeg achieves state-of-the-art (SOTA) performance on nine closed-set benchmarks, significantly outperforming existing approaches. Moreover, for the first time, CitySeg enables zero-shot generalization in city-scale point cloud scenarios without relying on visual information.

中文标题/摘要

标题：CitySeg：城市场景下的三维开放词汇语义分割基础模型

城市级点云语义分割是无人机感知系统的关键技术，通过不依赖视觉信息实现对三维点的分类，达成全面三维理解。然而现有模型常受限于三维数据规模有限及数据集间的领域差异，导致泛化能力下降。为此，我们提出CitySeg——一个融合文本模态实现开放词汇分割与零样本推理的城市级点云语义分割基础模型。具体通过定制数据预处理规则解决多领域数据分布不均问题，并提出局部-全局交叉注意力网络增强无人机场景下的点网络感知能力。针对跨数据集语义标签差异，采用分层分类策略：依据数据标注规则建立分层图整合标签，并通过图编码器建模类别间层次关系。此外提出两阶段训练策略并采用铰链损失增强子类特征可分性。实验表明CitySeg在九个封闭集基准测试中达到最先进性能，显著优于现有方法，并首次实现不依赖视觉信息的城市级点云零样本泛化。

TL;DR (中文)

Semantic segmentation of city-scale point clouds is a critical technology for Unmanned Aerial Vehicle (UAV) perception systems, enabling the classification of 3D points without relying on any visual information to achieve comprehensive 3D understanding.

TL;DR (English)

Semantic segmentation of city-scale point clouds is a critical technology for Unmanned Aerial Vehicle (UAV) perception systems, enabling the classification of 3D points without relying on any visual information to achieve comprehensive 3D understanding.

方法卡 + 讨论（中文）

Method Card (方法卡) - Task / Problem: Open-Vocabulary, Segmentation, 3D Vision - Core Idea: Semantic segmentation of city-scale point clouds is a critical technology for Unmanned Aerial Vehicle (UAV) perception systems, enabling the classification of 3D points without relying on any visual information to achieve comprehensive 3D understanding.

Discussion (讨论问题) 1. 相比强基线，优势是否稳定显著？ 2. 代价/延迟与内存开销如何，复现细节是否充分？ 3. 失败模式与局限？可能改进方向？ 4. 数据与指标是否充分支撑结论，是否存在偏置/重叠？ 5. 是否可迁移到真实应用或边缘设备？

Method Card + Discussion (EN)

Method Card (方法卡) - Task / Problem: Open-Vocabulary, Segmentation, 3D Vision - Core Idea: Semantic segmentation of city-scale point clouds is a critical technology for Unmanned Aerial Vehicle (UAV) perception systems, enabling the classification of 3D points without relying on any visual information to achieve comprehensive 3D understanding.

Discussion (讨论问题) 1. 相比强基线，优势是否稳定显著？ 2. 代价/延迟与内存开销如何，复现细节是否充分？ 3. 失败模式与局限？可能改进方向？ 4. 数据与指标是否充分支撑结论，是否存在偏置/重叠？ 5. 是否可迁移到真实应用或边缘设备？

ReferSplat: Referring Segmentation in 3D Gaussian Splatting

Authors: Shuting He, Guangquan Jie, Changshuo Wang, Yun Zhou, Shuming Hu, Guanbin Li, Henghui Ding

Venue: ICML 2025 Oral

First: 2025-08-11T17:59:30+00:00 · Latest: 2025-08-11T17:59:30+00:00

Comments: ICML 2025 Oral, Code: https://github.com/heshuting555/ReferSplat

Abs · PDF · Code1

Abstract

We introduce Referring 3D Gaussian Splatting Segmentation (R3DGS), a new task that aims to segment target objects in a 3D Gaussian scene based on natural language descriptions, which often contain spatial relationships or object attributes. This task requires the model to identify newly described objects that may be occluded or not directly visible in a novel view, posing a significant challenge for 3D multi-modal understanding. Developing this capability is crucial for advancing embodied AI. To support research in this area, we construct the first R3DGS dataset, Ref-LERF. Our analysis reveals that 3D multi-modal understanding and spatial relationship modeling are key challenges for R3DGS. To address these challenges, we propose ReferSplat, a framework that explicitly models 3D Gaussian points with natural language expressions in a spatially aware paradigm. ReferSplat achieves state-of-the-art performance on both the newly proposed R3DGS task and 3D open-vocabulary segmentation benchmarks. Dataset and code are available at https://github.com/heshuting555/ReferSplat.

中文标题/摘要

标题：ReferSplat：基于3D高斯泼溅的指代分割

本文提出指代式3D高斯泼溅分割（R3DGS）新任务，旨在通过自然语言描述（常包含空间关系或物体属性）对3D高斯场景中的目标物体进行分割。该任务要求模型识别新视角下可能被遮挡或不可见的新描述对象，对3D多模态理解提出重大挑战。发展此能力对推进具身人工智能至关重要。为支持该领域研究，我们构建了首个R3DGS数据集Ref-LERF。分析表明，3D多模态理解与空间关系建模是R3DGS的核心挑战。为此我们提出ReferSplat框架，在空间感知范式下显式建模自然语言表达与3D高斯点的关联。ReferSplat在新建的R3DGS任务和3D开放词汇分割基准上均实现最先进性能。数据集与代码详见https://github.com/heshuting555/ReferSplat。

TL;DR (中文)

We introduce Referring 3D Gaussian Splatting Segmentation (R3DGS), a new task that aims to segment target objects in a 3D Gaussian scene based on natural language descriptions, which often contain spatial relationships or object attributes.

TL;DR (English)

We introduce Referring 3D Gaussian Splatting Segmentation (R3DGS), a new task that aims to segment target objects in a 3D Gaussian scene based on natural language descriptions, which often contain spatial relationships or object attributes.

方法卡 + 讨论（中文）

Method Card (方法卡) - Task / Problem: Open-Vocabulary, Segmentation, Referring / Grounding, 3D Vision - Core Idea: We introduce Referring 3D Gaussian Splatting Segmentation (R3DGS), a new task that aims to segment target objects in a 3D Gaussian scene based on natural language descriptions, which often contain spatial relationships or object attributes. - Venue: ICML 2025 Oral

Discussion (讨论问题) 1. 相比强基线，优势是否稳定显著？ 2. 代价/延迟与内存开销如何，复现细节是否充分？ 3. 失败模式与局限？可能改进方向？ 4. 数据与指标是否充分支撑结论，是否存在偏置/重叠？ 5. 是否可迁移到真实应用或边缘设备？

Method Card + Discussion (EN)

Method Card (方法卡) - Task / Problem: Open-Vocabulary, Segmentation, Referring / Grounding, 3D Vision - Core Idea: We introduce Referring 3D Gaussian Splatting Segmentation (R3DGS), a new task that aims to segment target objects in a 3D Gaussian scene based on natural language descriptions, which often contain spatial relationships or object attributes. - Venue: ICML 2025 Oral

Discussion (讨论问题) 1. 相比强基线，优势是否稳定显著？ 2. 代价/延迟与内存开销如何，复现细节是否充分？ 3. 失败模式与局限？可能改进方向？ 4. 数据与指标是否充分支撑结论，是否存在偏置/重叠？ 5. 是否可迁移到真实应用或边缘设备？

Correspondence as Video: Test-Time Adaption on SAM2 for Reference Segmentation in the Wild

Authors: Haoran Wang, Zekun Li, Jian Zhang, Lei Qi, Yinghuan Shi

First: 2025-08-11T08:42:49+00:00 · Latest: 2025-08-11T08:42:49+00:00

Abs · PDF

Abstract

Large vision models like the Segment Anything Model (SAM) exhibit significant limitations when applied to downstream tasks in the wild. Consequently, reference segmentation, which leverages reference images and their corresponding masks to impart novel knowledge to the model, emerges as a promising new direction for adapting vision models. However, existing reference segmentation approaches predominantly rely on meta-learning, which still necessitates an extensive meta-training process and brings massive data and computational cost. In this study, we propose a novel approach by representing the inherent correspondence between reference-target image pairs as a pseudo video. This perspective allows the latest version of SAM, known as SAM2, which is equipped with interactive video object segmentation (iVOS) capabilities, to be adapted to downstream tasks in a lightweight manner. We term this approach Correspondence As Video for SAM (CAV-SAM). CAV-SAM comprises two key modules: the Diffusion-Based Semantic Transition (DBST) module employs a diffusion model to construct a semantic transformation sequence, while the Test-Time Geometric Alignment (TTGA) module aligns the geometric changes within this sequence through test-time fine-tuning. We evaluated CAVSAM on widely-used datasets, achieving segmentation performance improvements exceeding 5% over SOTA methods. Implementation is provided in the supplementary materials.

中文标题/摘要

标题：作为视频的对应性：基于SAM2的测试时自适应在野外参考分割中的应用

像Segment Anything Model (SAM)这样的大型视觉模型在应用于野外下游任务时表现出显著局限性。因此，参考分割——利用参考图像及其对应掩码向模型传授新知识——成为适应视觉模型的一个有前景的新方向。然而，现有参考分割方法主要依赖元学习，仍需大量元训练过程并带来巨大的数据和计算成本。本研究提出一种创新方法，将参考-目标图像对之间的内在对应性表示为伪视频。这一视角使得具备交互式视频对象分割(iVOS)能力的最新版SAM（即SAM2）能够以轻量级方式适应下游任务。我们将该方法称为CAV-SAM（对应性作为视频的SAM）。CAV-SAM包含两个核心模块：基于扩散的语义转换(DBST)模块采用扩散模型构建语义转换序列，而测试时几何对齐(TTGA)模块通过测试时微调对齐该序列中的几何变化。我们在广泛使用的数据集上评估CAV-SAM，其分割性能较SOTA方法提升超过5%。具体实现详见补充材料。

TL;DR (中文)

Large vision models like the Segment Anything Model (SAM) exhibit significant limitations when applied to downstream tasks in the wild.

TL;DR (English)

Large vision models like the Segment Anything Model (SAM) exhibit significant limitations when applied to downstream tasks in the wild.

方法卡 + 讨论（中文）

Method Card (方法卡) - Task / Problem: Segmentation - Core Idea: Large vision models like the Segment Anything Model (SAM) exhibit significant limitations when applied to downstream tasks in the wild.

Discussion (讨论问题) 1. 相比强基线，优势是否稳定显著？ 2. 代价/延迟与内存开销如何，复现细节是否充分？ 3. 失败模式与局限？可能改进方向？ 4. 数据与指标是否充分支撑结论，是否存在偏置/重叠？ 5. 是否可迁移到真实应用或边缘设备？

Method Card + Discussion (EN)

Method Card (方法卡) - Task / Problem: Segmentation - Core Idea: Large vision models like the Segment Anything Model (SAM) exhibit significant limitations when applied to downstream tasks in the wild.

Discussion (讨论问题) 1. 相比强基线，优势是否稳定显著？ 2. 代价/延迟与内存开销如何，复现细节是否充分？ 3. 失败模式与局限？可能改进方向？ 4. 数据与指标是否充分支撑结论，是否存在偏置/重叠？ 5. 是否可迁移到真实应用或边缘设备？

Learning 3D Texture-Aware Representations for Parsing Diverse Human Clothing and Body Parts

Authors: Kiran Chhatre, Christopher Peters, Srikrishna Karanam

First: 2025-08-08T05:36:20+00:00 · Latest: 2025-08-08T05:36:20+00:00

Comments: 16 pages, 11 figures

Abs · PDF

Abstract

Existing methods for human parsing into body parts and clothing often use fixed mask categories with broad labels that obscure fine-grained clothing types. Recent open-vocabulary segmentation approaches leverage pretrained text-to-image (T2I) diffusion model features for strong zero-shot transfer, but typically group entire humans into a single person category, failing to distinguish diverse clothing or detailed body parts. To address this, we propose Spectrum, a unified network for part-level pixel parsing (body parts and clothing) and instance-level grouping. While diffusion-based open-vocabulary models generalize well across tasks, their internal representations are not specialized for detailed human parsing. We observe that, unlike diffusion models with broad representations, image-driven 3D texture generators maintain faithful correspondence to input images, enabling stronger representations for parsing diverse clothing and body parts. Spectrum introduces a novel repurposing of an Image-to-Texture (I2Tx) diffusion model -- obtained by fine-tuning a T2I model on 3D human texture maps -- for improved alignment with body parts and clothing. From an input image, we extract human-part internal features via the I2Tx diffusion model and generate semantically valid masks aligned to diverse clothing categories through prompt-guided grounding. Once trained, Spectrum produces semantic segmentation maps for every visible body part and clothing category, ignoring standalone garments or irrelevant objects, for any number of humans in the scene. We conduct extensive cross-dataset experiments -- separately assessing body parts, clothing parts, unseen clothing categories, and full-body masks -- and demonstrate that Spectrum consistently outperforms baseline methods in prompt-based segmentation.

中文标题/摘要

标题：学习用于解析多样化人体服装与身体部位的3D纹理感知表示

现有人体解析方法常采用固定掩码类别和宽泛标签，难以区分细粒度服装类型。近期开放词汇分割方法利用预训练文生图扩散模型特征实现强零样本迁移，但通常将整个人体归为单一类别，无法区分多样化服装或细节身体部位。为此，我们提出Spectrum——一个统一网络，实现部件级像素解析（身体部位与服装）和实例级分组。虽然基于扩散的开放词汇模型跨任务泛化能力强，但其内部表示未针对细节人体解析专门优化。我们发现，与具有宽泛表示的扩散模型不同，图像驱动的3D纹理生成器能保持与输入图像的忠实对应，从而为解析多样化服装和身体部位提供更强表示。Spectrum创新性地重构了图像到纹理扩散模型（通过对文生图模型进行3D人体纹理图微调获得），以提升与身体部位和服装的对齐能力。从输入图像中，我们通过该扩散模型提取人体部件内部特征，并通过提示引导 grounding 生成符合语义且对齐多样化服装类别的掩码。训练完成后，Spectrum可为场景中任意数量人体生成每个可见身体部位和服装类别的语义分割图，忽略独立衣物或无关物体。我们进行了广泛跨数据集实验——分别评估身体部位、服装部件、未见服装类别和全身掩码——结果表明Spectrum在基于提示的分割中持续优于基线方法。

TL;DR (中文)

Existing methods for human parsing into body parts and clothing often use fixed mask categories with broad labels that obscure fine-grained clothing types.

TL;DR (English)

Existing methods for human parsing into body parts and clothing often use fixed mask categories with broad labels that obscure fine-grained clothing types.

方法卡 + 讨论（中文）

Method Card (方法卡) - Task / Problem: Open-Vocabulary, Segmentation, Grounding, 3D Vision - Core Idea: Existing methods for human parsing into body parts and clothing often use fixed mask categories with broad labels that obscure fine-grained clothing types.

Discussion (讨论问题) 1. 相比强基线，优势是否稳定显著？ 2. 代价/延迟与内存开销如何，复现细节是否充分？ 3. 失败模式与局限？可能改进方向？ 4. 数据与指标是否充分支撑结论，是否存在偏置/重叠？ 5. 是否可迁移到真实应用或边缘设备？

Method Card + Discussion (EN)

Method Card (方法卡) - Task / Problem: Open-Vocabulary, Segmentation, Grounding, 3D Vision - Core Idea: Existing methods for human parsing into body parts and clothing often use fixed mask categories with broad labels that obscure fine-grained clothing types.

Discussion (讨论问题) 1. 相比强基线，优势是否稳定显著？ 2. 代价/延迟与内存开销如何，复现细节是否充分？ 3. 失败模式与局限？可能改进方向？ 4. 数据与指标是否充分支撑结论，是否存在偏置/重叠？ 5. 是否可迁移到真实应用或边缘设备？

What Holds Back Open-Vocabulary Segmentation?

Authors: Josip Šarić, Ivan Martinović, Matej Kristan, Siniša Šegvić

Venue: ICCV

First: 2025-08-06T08:46:47+00:00 · Latest: 2025-08-06T08:46:47+00:00

Comments: Accepted for publication at ICCV 25 Workshop: What is Next in Multimodal Foundation Models?

Abs · PDF

Abstract

Standard segmentation setups are unable to deliver models that can recognize concepts outside the training taxonomy. Open-vocabulary approaches promise to close this gap through language-image pretraining on billions of image-caption pairs. Unfortunately, we observe that the promise is not delivered due to several bottlenecks that have caused the performance to plateau for almost two years. This paper proposes novel oracle components that identify and decouple these bottlenecks by taking advantage of the groundtruth information. The presented validation experiments deliver important empirical findings that provide a deeper insight into the failures of open-vocabulary models and suggest prominent approaches to unlock the future research.

中文标题/摘要

标题：开放词汇分割面临哪些瓶颈？

标准分割设置无法产生能识别训练分类体系外概念的模型。开放词汇方法承诺通过对数十亿图像-标题对进行语言-图像预训练来弥合这一差距。然而，我们发现由于存在多个导致性能近两年停滞不前的瓶颈，这一承诺未能实现。本文提出新颖的预言组件，通过利用真实标注信息来识别并解耦这些瓶颈。验证实验提供了重要的实证发现，深入揭示了开放词汇模型的失败原因，并为未来研究指明了突破方向。

TL;DR (中文)

Standard segmentation setups are unable to deliver models that can recognize concepts outside the training taxonomy.

TL;DR (English)

Standard segmentation setups are unable to deliver models that can recognize concepts outside the training taxonomy.

方法卡 + 讨论（中文）

Method Card (方法卡) - Task / Problem: Open-Vocabulary, Segmentation, Vision-Language - Core Idea: Standard segmentation setups are unable to deliver models that can recognize concepts outside the training taxonomy. - Venue: ICCV

Discussion (讨论问题) 1. 相比强基线，优势是否稳定显著？ 2. 代价/延迟与内存开销如何，复现细节是否充分？ 3. 失败模式与局限？可能改进方向？ 4. 数据与指标是否充分支撑结论，是否存在偏置/重叠？ 5. 是否可迁移到真实应用或边缘设备？

Method Card + Discussion (EN)

Method Card (方法卡) - Task / Problem: Open-Vocabulary, Segmentation, Vision-Language - Core Idea: Standard segmentation setups are unable to deliver models that can recognize concepts outside the training taxonomy. - Venue: ICCV

Discussion (讨论问题) 1. 相比强基线，优势是否稳定显著？ 2. 代价/延迟与内存开销如何，复现细节是否充分？ 3. 失败模式与局限？可能改进方向？ 4. 数据与指标是否充分支撑结论，是否存在偏置/重叠？ 5. 是否可迁移到真实应用或边缘设备？

Multimodal Referring Segmentation: A Survey

Authors: Henghui Ding, Song Tang, Shuting He, Chang Liu, Zuxuan Wu, Yu-Gang Jiang

First: 2025-08-01T02:14:00+00:00 · Latest: 2025-08-05T11:42:44+00:00

Comments: Project Page: https://github.com/henghuiding/Awesome-Multimodal-Referring-Segmentation

Abs · PDF · Code1

Abstract

Multimodal referring segmentation aims to segment target objects in visual scenes, such as images, videos, and 3D scenes, based on referring expressions in text or audio format. This task plays a crucial role in practical applications requiring accurate object perception based on user instructions. Over the past decade, it has gained significant attention in the multimodal community, driven by advances in convolutional neural networks, transformers, and large language models, all of which have substantially improved multimodal perception capabilities. This paper provides a comprehensive survey of multimodal referring segmentation. We begin by introducing this field's background, including problem definitions and commonly used datasets. Next, we summarize a unified meta architecture for referring segmentation and review representative methods across three primary visual scenes, including images, videos, and 3D scenes. We further discuss Generalized Referring Expression (GREx) methods to address the challenges of real-world complexity, along with related tasks and practical applications. Extensive performance comparisons on standard benchmarks are also provided. We continually track related works at https://github.com/henghuiding/Awesome-Multimodal-Referring-Segmentation.

中文标题/摘要

标题：多模态指代分割技术综述

多模态指代分割旨在基于文本或音频形式的指代表达，对图像、视频和3D场景等视觉场景中的目标对象进行分割。该任务在需要根据用户指令实现精确对象感知的实际应用中具有关键作用。过去十年间，在卷积神经网络、Transformer架构及大语言模型发展的推动下，该领域在多模态社区获得广泛关注，显著提升了多模态感知能力。本文系统综述了多模态指代分割技术：首先介绍领域背景，包括问题定义与常用数据集；继而总结指代分割的统一元架构，并分别回顾图像、视频和3D场景三大视觉场景中的代表性方法；进一步探讨应对现实复杂性的广义指代表达（GREx）方法及相关任务与实际应用；同时提供标准基准测试的全面性能对比。相关研究持续追踪于：https://github.com/henghuiding/Awesome-Multimodal-Referring-Segmentation

TL;DR (中文)

Multimodal referring segmentation aims to segment target objects in visual scenes, such as images, videos, and 3D scenes, based on referring expressions in text or audio format.

TL;DR (English)

Multimodal referring segmentation aims to segment target objects in visual scenes, such as images, videos, and 3D scenes, based on referring expressions in text or audio format.

方法卡 + 讨论（中文）

Method Card (方法卡) - Task / Problem: Segmentation, Referring / Grounding, 3D Vision, Vision-Language - Core Idea: Multimodal referring segmentation aims to segment target objects in visual scenes, such as images, videos, and 3D scenes, based on referring expressions in text or audio format.

Discussion (讨论问题) 1. 相比强基线，优势是否稳定显著？ 2. 代价/延迟与内存开销如何，复现细节是否充分？ 3. 失败模式与局限？可能改进方向？ 4. 数据与指标是否充分支撑结论，是否存在偏置/重叠？ 5. 是否可迁移到真实应用或边缘设备？

Method Card + Discussion (EN)

Method Card (方法卡) - Task / Problem: Segmentation, Referring / Grounding, 3D Vision, Vision-Language - Core Idea: Multimodal referring segmentation aims to segment target objects in visual scenes, such as images, videos, and 3D scenes, based on referring expressions in text or audio format.

Discussion (讨论问题) 1. 相比强基线，优势是否稳定显著？ 2. 代价/延迟与内存开销如何，复现细节是否充分？ 3. 失败模式与局限？可能改进方向？ 4. 数据与指标是否充分支撑结论，是否存在偏置/重叠？ 5. 是否可迁移到真实应用或边缘设备？

Enhancing and Accelerating Brain MRI through Deep Learning Reconstruction Using Prior Subject-Specific Imaging

Authors: Amirmohammad Shamaei, Alexander Stebner, Salome, Bosshart, Johanna Ospel, Gouri Ginde, Mariana Bento, Roberto Souza

First: 2025-07-28T21:39:36+00:00 · Latest: 2025-07-28T21:39:36+00:00

Abs · PDF · Code1

Abstract

Magnetic resonance imaging (MRI) is a crucial medical imaging modality. However, long acquisition times remain a significant challenge, leading to increased costs, and reduced patient comfort. Recent studies have shown the potential of using deep learning models that incorporate information from prior subject-specific MRI scans to improve reconstruction quality of present scans. Integrating this prior information requires registration of the previous scan to the current image reconstruction, which can be time-consuming. We propose a novel deep-learning-based MRI reconstruction framework which consists of an initial reconstruction network, a deep registration model, and a transformer-based enhancement network. We validated our method on a longitudinal dataset of T1-weighted MRI scans with 2,808 images from 18 subjects at four acceleration factors (R5, R10, R15, R20). Quantitative metrics confirmed our approach's superiority over existing methods (p < 0.05, Wilcoxon signed-rank test). Furthermore, we analyzed the impact of our MRI reconstruction method on the downstream task of brain segmentation and observed improved accuracy and volumetric agreement with reference segmentations. Our approach also achieved a substantial reduction in total reconstruction time compared to methods that use traditional registration algorithms, making it more suitable for real-time clinical applications. The code associated with this work is publicly available at https://github.com/amirshamaei/longitudinal-mri-deep-recon.

中文标题/摘要

标题：利用先验特定对象成像通过深度学习增强和加速脑部MRI重建

磁共振成像（MRI）是一种关键的医学成像技术，但长采集时间仍是重大挑战，导致成本增加和患者舒适度降低。近期研究表明，采用包含先验特定对象MRI扫描信息的深度学习模型可提升当前扫描的重建质量。整合此类先验信息需将既往扫描与当前图像重建进行配准，该过程耗时较长。我们提出了一种新型深度学习MRI重建框架，包含初始重建网络、深度配准模型和基于Transformer的增强网络。我们在包含18名受试者2,808幅T1加权MRI图像的纵向数据集上，以四种加速因子（R5、R10、R15、R20）验证了该方法。定量指标证实本方法优于现有方法（p < 0.05，Wilcoxon符号秩检验）。此外，我们分析了该MRI重建方法对脑部分割下游任务的影响，发现其提升了准确性并与参考分割达成更优的体积一致性。相较于传统配准算法，本方法还显著缩短了总重建时间，更适用于实时临床应用。相关代码已公开于：https://github.com/amirshamaei/longitudinal-mri-deep-recon。

TL;DR (中文)

Magnetic resonance imaging (MRI) is a crucial medical imaging modality.

TL;DR (English)

Magnetic resonance imaging (MRI) is a crucial medical imaging modality.

方法卡 + 讨论（中文）

Method Card (方法卡) - Task / Problem: Segmentation - Core Idea: Magnetic resonance imaging (MRI) is a crucial medical imaging modality.

Discussion (讨论问题) 1. 相比强基线，优势是否稳定显著？ 2. 代价/延迟与内存开销如何，复现细节是否充分？ 3. 失败模式与局限？可能改进方向？ 4. 数据与指标是否充分支撑结论，是否存在偏置/重叠？ 5. 是否可迁移到真实应用或边缘设备？

Method Card + Discussion (EN)

Method Card (方法卡) - Task / Problem: Segmentation - Core Idea: Magnetic resonance imaging (MRI) is a crucial medical imaging modality.

Discussion (讨论问题) 1. 相比强基线，优势是否稳定显著？ 2. 代价/延迟与内存开销如何，复现细节是否充分？ 3. 失败模式与局限？可能改进方向？ 4. 数据与指标是否充分支撑结论，是否存在偏置/重叠？ 5. 是否可迁移到真实应用或边缘设备？

Taking Language Embedded 3D Gaussian Splatting into the Wild

Authors: Yuze Wang, Yue Qi

First: 2025-07-26T07:00:32+00:00 · Latest: 2025-08-05T01:40:57+00:00

Comments: Visit our project page at https://yuzewang1998.github.io/takinglangsplatw/

Abs · PDF · Project1

Abstract

Recent advances in leveraging large-scale Internet photo collections for 3D reconstruction have enabled immersive virtual exploration of landmarks and historic sites worldwide. However, little attention has been given to the immersive understanding of architectural styles and structural knowledge, which remains largely confined to browsing static text-image pairs. Therefore, can we draw inspiration from 3D in-the-wild reconstruction techniques and use unconstrained photo collections to create an immersive approach for understanding the 3D structure of architectural components? To this end, we extend language embedded 3D Gaussian splatting (3DGS) and propose a novel framework for open-vocabulary scene understanding from unconstrained photo collections. Specifically, we first render multiple appearance images from the same viewpoint as the unconstrained image with the reconstructed radiance field, then extract multi-appearance CLIP features and two types of language feature uncertainty maps-transient and appearance uncertainty-derived from the multi-appearance features to guide the subsequent optimization process. Next, we propose a transient uncertainty-aware autoencoder, a multi-appearance language field 3DGS representation, and a post-ensemble strategy to effectively compress, learn, and fuse language features from multiple appearances. Finally, to quantitatively evaluate our method, we introduce PT-OVS, a new benchmark dataset for assessing open-vocabulary segmentation performance on unconstrained photo collections. Experimental results show that our method outperforms existing methods, delivering accurate open-vocabulary segmentation and enabling applications such as interactive roaming with open-vocabulary queries, architectural style pattern recognition, and 3D scene editing.

中文标题/摘要

标题：将语言嵌入的3D高斯溅射技术引入野外场景

近年来利用大规模互联网照片集进行三维重建的进展，实现了对全球地标和历史遗址的沉浸式虚拟探索。然而，对于建筑风格与结构知识的沉浸式理解却鲜有关注，目前仍主要局限于浏览静态图文对。为此，我们能否从野外三维重建技术中汲取灵感，利用无约束照片集创建理解建筑构件三维结构的沉浸式方法？本文扩展了语言嵌入的3D高斯溅射技术（3DGS），提出了一种基于无约束照片集的开放词汇场景理解新框架。具体而言，我们首先通过重建辐射场从与无约束图像相同视角渲染多外观图像，继而提取多外观CLIP特征及两种语言特征不确定性图谱——瞬态不确定性和外观不确定性（源自多外观特征）以指导后续优化过程。接着提出瞬态不确定性感知自编码器、多外观语言场3DGS表示及后集成策略，有效压缩、学习并融合多外观语言特征。最后，为量化评估方法，我们引入PT-OVS基准数据集，用于评估无约束照片集上的开放词汇分割性能。实验结果表明，本方法优于现有技术，可实现精确的开放词汇分割，并支持开放词汇查询的交互式漫游、建筑风格模式识别及三维场景编辑等应用。

TL;DR (中文)

Recent advances in leveraging large-scale Internet photo collections for 3D reconstruction have enabled immersive virtual exploration of landmarks and historic sites worldwide.

TL;DR (English)

Recent advances in leveraging large-scale Internet photo collections for 3D reconstruction have enabled immersive virtual exploration of landmarks and historic sites worldwide.

方法卡 + 讨论（中文）

Method Card (方法卡) - Task / Problem: Open-Vocabulary, Segmentation, 3D Vision - Core Idea: Recent advances in leveraging large-scale Internet photo collections for 3D reconstruction have enabled immersive virtual exploration of landmarks and historic sites worldwide.

Discussion (讨论问题) 1. 相比强基线，优势是否稳定显著？ 2. 代价/延迟与内存开销如何，复现细节是否充分？ 3. 失败模式与局限？可能改进方向？ 4. 数据与指标是否充分支撑结论，是否存在偏置/重叠？ 5. 是否可迁移到真实应用或边缘设备？

Method Card + Discussion (EN)

Method Card (方法卡) - Task / Problem: Open-Vocabulary, Segmentation, 3D Vision - Core Idea: Recent advances in leveraging large-scale Internet photo collections for 3D reconstruction have enabled immersive virtual exploration of landmarks and historic sites worldwide.

Discussion (讨论问题) 1. 相比强基线，优势是否稳定显著？ 2. 代价/延迟与内存开销如何，复现细节是否充分？ 3. 失败模式与局限？可能改进方向？ 4. 数据与指标是否充分支撑结论，是否存在偏置/重叠？ 5. 是否可迁移到真实应用或边缘设备？

DiSCO-3D : Discovering and segmenting Sub-Concepts from Open-vocabulary queries in NeRF

Authors: Doriand Petit, Steve Bourgeois, Vincent Gay-Bellile, Florian Chabot, Loïc Barthe

Venue: ICCV

First: 2025-07-19T12:46:20+00:00 · Latest: 2025-07-19T12:46:20+00:00

Comments: Published at ICCV'25

Abs · PDF

Abstract

3D semantic segmentation provides high-level scene understanding for applications in robotics, autonomous systems, \textit{etc}. Traditional methods adapt exclusively to either task-specific goals (open-vocabulary segmentation) or scene content (unsupervised semantic segmentation). We propose DiSCO-3D, the first method addressing the broader problem of 3D Open-Vocabulary Sub-concepts Discovery, which aims to provide a 3D semantic segmentation that adapts to both the scene and user queries. We build DiSCO-3D on Neural Fields representations, combining unsupervised segmentation with weak open-vocabulary guidance. Our evaluations demonstrate that DiSCO-3D achieves effective performance in Open-Vocabulary Sub-concepts Discovery and exhibits state-of-the-art results in the edge cases of both open-vocabulary and unsupervised segmentation.

中文标题/摘要

标题：DiSCO-3D：基于神经辐射场的开放词汇查询中子概念的发现与分割

三维语义分割为机器人技术和自主系统等应用提供高层级场景理解。传统方法仅适配特定任务目标（开放词汇分割）或场景内容（无监督语义分割）。我们提出DiSCO-3D，首个解决三维开放词汇子概念发现这一更广泛问题的方法，旨在提供同时适配场景和用户查询的三维语义分割。基于神经场表示，我们将无监督分割与弱开放词汇指导相结合。评估表明，DiSCO-3D在开放词汇子概念发现中实现有效性能，并在开放词汇与无监督分割的边缘案例中展现最先进成果。

TL;DR (中文)

3D semantic segmentation provides high-level scene understanding for applications in robotics, autonomous systems, \textit{etc}.

TL;DR (English)

3D semantic segmentation provides high-level scene understanding for applications in robotics, autonomous systems, \textit{etc}.

方法卡 + 讨论（中文）

Method Card (方法卡) - Task / Problem: Open-Vocabulary, Segmentation, 3D Vision - Core Idea: 3D semantic segmentation provides high-level scene understanding for applications in robotics, autonomous systems, \textit{etc}. - Venue: ICCV

Discussion (讨论问题) 1. 相比强基线，优势是否稳定显著？ 2. 代价/延迟与内存开销如何，复现细节是否充分？ 3. 失败模式与局限？可能改进方向？ 4. 数据与指标是否充分支撑结论，是否存在偏置/重叠？ 5. 是否可迁移到真实应用或边缘设备？

Method Card + Discussion (EN)

Method Card (方法卡) - Task / Problem: Open-Vocabulary, Segmentation, 3D Vision - Core Idea: 3D semantic segmentation provides high-level scene understanding for applications in robotics, autonomous systems, \textit{etc}. - Venue: ICCV

Discussion (讨论问题) 1. 相比强基线，优势是否稳定显著？ 2. 代价/延迟与内存开销如何，复现细节是否充分？ 3. 失败模式与局限？可能改进方向？ 4. 数据与指标是否充分支撑结论，是否存在偏置/重叠？ 5. 是否可迁移到真实应用或边缘设备？

LOSC: LiDAR Open-voc Segmentation Consolidator

Authors: Nermin Samet, Gilles Puy, Renaud Marlet

First: 2025-07-10T10:10:13+00:00 · Latest: 2025-07-10T10:10:13+00:00

Abs · PDF

Abstract

We study the use of image-based Vision-Language Models (VLMs) for open-vocabulary segmentation of lidar scans in driving settings. Classically, image semantics can be back-projected onto 3D point clouds. Yet, resulting point labels are noisy and sparse. We consolidate these labels to enforce both spatio-temporal consistency and robustness to image-level augmentations. We then train a 3D network based on these refined labels. This simple method, called LOSC, outperforms the SOTA of zero-shot open-vocabulary semantic and panoptic segmentation on both nuScenes and SemanticKITTI, with significant margins.

中文标题/摘要

标题：LOSC：激光雷达开放词汇分割整合器

本研究探索基于图像的视觉语言模型（VLM）在驾驶场景中实现激光雷达扫描的开放词汇分割。传统方法可将图像语义反投影至3D点云，但产生的点标签存在噪声且稀疏。我们通过整合这些标签，实现时空一致性并对抗图像级增强干扰。基于优化后的标签训练3D网络，该名为LOSC的简易方法在nuScenes和SemanticKITTI数据集上以显著优势超越了零样本开放词汇语义与全景分割的当前最优水平。

TL;DR (中文)

We study the use of image-based Vision-Language Models (VLMs) for open-vocabulary segmentation of lidar scans in driving settings.

TL;DR (English)

We study the use of image-based Vision-Language Models (VLMs) for open-vocabulary segmentation of lidar scans in driving settings.

方法卡 + 讨论（中文）

Method Card (方法卡) - Task / Problem: Open-Vocabulary, Segmentation, 3D Vision, Vision-Language - Core Idea: We study the use of image-based Vision-Language Models (VLMs) for open-vocabulary segmentation of lidar scans in driving settings. - Data / Benchmarks: KITTI, nuScenes

Discussion (讨论问题) 1. 相比强基线，优势是否稳定显著？ 2. 代价/延迟与内存开销如何，复现细节是否充分？ 3. 失败模式与局限？可能改进方向？ 4. 数据与指标是否充分支撑结论，是否存在偏置/重叠？ 5. 是否可迁移到真实应用或边缘设备？

Method Card + Discussion (EN)

Method Card (方法卡) - Task / Problem: Open-Vocabulary, Segmentation, 3D Vision, Vision-Language - Core Idea: We study the use of image-based Vision-Language Models (VLMs) for open-vocabulary segmentation of lidar scans in driving settings. - Data / Benchmarks: KITTI, nuScenes

Discussion (讨论问题) 1. 相比强基线，优势是否稳定显著？ 2. 代价/延迟与内存开销如何，复现细节是否充分？ 3. 失败模式与局限？可能改进方向？ 4. 数据与指标是否充分支撑结论，是否存在偏置/重叠？ 5. 是否可迁移到真实应用或边缘设备？