arXiv 论文速递

SLGaussian: Fast Language Gaussian Splatting in Sparse Views

Authors: Kangjie Chen, BingQuan Dai, Minghan Qin, Dongbin Zhang, Peihao Li, Yingshuang Zou, Haoqian Wang

Venue: ACM MM 2025

First: 2024-12-11T12:18:30+00:00 · Latest: 2025-08-18T08:08:13+00:00

Comments: Accepted by ACM MM 2025. Project page: https://chenkangjie1123.github.io/SLGaussian.github.io/

Abstract

3D semantic field learning is crucial for applications like autonomous navigation, AR/VR, and robotics, where accurate comprehension of 3D scenes from limited viewpoints is essential. Existing methods struggle under sparse view conditions, relying on inefficient per-scene multi-view optimizations, which are impractical for many real-world tasks. To address this, we propose SLGaussian, a feed-forward method for constructing 3D semantic fields from sparse viewpoints, allowing direct inference of 3DGS-based scenes. By ensuring consistent SAM segmentations through video tracking and using low-dimensional indexing for high-dimensional CLIP features, SLGaussian efficiently embeds language information in 3D space, offering a robust solution for accurate 3D scene understanding under sparse view conditions. In experiments on two-view sparse 3D object querying and segmentation in the LERF and 3D-OVS datasets, SLGaussian outperforms existing methods in chosen IoU, Localization Accuracy, and mIoU. Moreover, our model achieves scene inference in under 30 seconds and open-vocabulary querying in just 0.011 seconds per query.

中文标题/摘要

标题：SLGaussian: 快速语言高斯点云在稀疏视图中的3D语义场构建

3D语义场学习对于自主导航、AR/VR和机器人技术等应用至关重要，这些应用需要从有限视角准确理解3D场景。现有方法在稀疏视图条件下表现不佳，依赖于低效的多视图优化，这在许多实际任务中是不切实际的。为了解决这一问题，我们提出了SLGaussian，这是一种用于从稀疏视角构建3D语义场的前馈方法，允许直接推断基于3DGS的场景。通过视频跟踪确保一致的SAM分割，并使用低维索引嵌入高维CLIP特征，SLGaussian高效地在3D空间中嵌入语言信息，为在稀疏视图条件下提供准确的3D场景理解提供了一种稳健的解决方案。在LERF和3D-OVS数据集上的两项稀疏3D对象查询和分割实验中，SLGaussian在选择的IoU、定位准确性和mIoU方面优于现有方法。此外，我们的模型在场景推断中只需不到30秒，在开放词汇查询中每次查询只需0.011秒。

Summary / 总结

SLGaussian is a feed-forward method designed to construct 3D semantic fields from sparse viewpoints, enabling efficient and accurate 3D scene understanding. By leveraging consistent SAM segmentations through video tracking and low-dimensional indexing for high-dimensional CLIP features, SLGaussian embeds language information in 3D space. Experimental results show that SLGaussian outperforms existing methods in chosen IoU, Localization Accuracy, and mIoU, and achieves scene inference in under 30 seconds and open-vocabulary querying in just 0.011 seconds per query.

SLGaussian 是一种前馈方法，用于从稀疏视角构建 3D 语义场，解决了现有方法依赖于低效的多视角优化的问题。通过利用视频跟踪实现一致的分割，并使用低维索引 CLIP 特征，SLGaussian 在 3D 空间中高效地嵌入了语言信息，相比现有方法在选定的 IoU、定位准确性和 mIoU 上表现更优。此外，它能够在不到 30 秒内进行场景推理，并且在每次查询中实现开放词汇查询只需 0.011 秒。

Splat Feature Solver

Authors: Butian Xiong, Rong Liu, Kenneth Xu, Meida Chen, Andrew Feng

First: 2025-08-17T03:13:06+00:00 · Latest: 2025-08-17T03:13:06+00:00

Comments: webpage not that stable

Abs · PDF · Code1

Abstract

Feature lifting has emerged as a crucial component in 3D scene understanding, enabling the attachment of rich image feature descriptors (e.g., DINO, CLIP) onto splat-based 3D representations. The core challenge lies in optimally assigning rich general attributes to 3D primitives while addressing the inconsistency issues from multi-view images. We present a unified, kernel- and feature-agnostic formulation of the feature lifting problem as a sparse linear inverse problem, which can be solved efficiently in closed form. Our approach admits a provable upper bound on the global optimal error under convex losses for delivering high quality lifted features. To address inconsistencies and noise in multi-view observations, we introduce two complementary regularization strategies to stabilize the solution and enhance semantic fidelity. Tikhonov Guidance enforces numerical stability through soft diagonal dominance, while Post-Lifting Aggregation filters noisy inputs via feature clustering. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on open-vocabulary 3D segmentation benchmarks, outperforming training-based, grouping-based, and heuristic-forward baselines while producing the lifted features in minutes. Code is available at \href{https://github.com/saliteta/splat-distiller.git}{\textbf{github}}. We also have a \href{https://splat-distiller.pages.dev/}

Summary / 总结

The research aims to improve 3D scene understanding by optimally assigning rich image feature descriptors to 3D primitives. The method formulates the feature lifting problem as a sparse linear inverse problem, providing a closed-form solution with a provable upper bound on the global optimal error. Key findings show that the approach outperforms existing baselines on open-vocabulary 3D segmentation benchmarks, achieving state-of-the-art performance and producing lifted features in minutes.

研究旨在通过将丰富的图像特征描述符最优地分配给3D原语来提升3D场景理解。方法将特征提升问题表述为稀疏线性逆问题，并引入正则化策略以解决不一致性和噪声问题。实验表明，该方法在3D分割基准测试中优于其他基线，并且能够快速生成提升的特征。

Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception

Authors: Junjie Wang, Keyu Chen, Yulin Li, Bin Chen, Hengshuang Zhao, Xiaojuan Qi, Zhuotao Tian

First: 2025-08-15T06:43:51+00:00 · Latest: 2025-08-15T06:43:51+00:00

Comments: arXiv admin note: text overlap with arXiv:2505.04410

Abs · PDF · Code1

Abstract

Dense visual perception tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have shown promise in open-vocabulary tasks, their direct application to dense perception often leads to suboptimal performance due to limitations in local feature representation. In this work, we present our observation that CLIP's image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self-attention module to obtain ``content'' and ``context'' features respectively. \revise{The context features are enhanced by jointly distilling semantic correlations from Vision Foundation Models (VFMs) and object integrity cues from diffusion models, thereby enhancing spatial consistency. In parallel, the content features are aligned with image crop representations and constrained by region correlations from VFMs to improve local discriminability. Extensive experiments demonstrate that DeCLIP establishes a solid foundation for open-vocabulary dense perception, consistently achieving state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.} Code is available at https://github.com/xiaomoguhz/DeCLIP

中文标题/摘要

标题：通用解耦学习增强开放词汇密集感知

密集视觉感知任务受限于其对预定义类别的依赖，限制了其在视觉概念未定义的现实场景中的应用。尽管像CLIP这样的视觉-语言模型（VLMs）在开放词汇任务中显示出潜力，但它们直接应用于密集感知时，由于局部特征表示的局限性，往往导致性能不佳。在本文中，我们观察到CLIP的图像令牌难以有效地从空间上或语义上相关的区域聚合信息，导致特征缺乏局部可区分性和空间一致性。为了解决这一问题，我们提出了一种名为DeCLIP的新框架，通过解耦自注意力模块来分别获得“内容”和“上下文”特征。上下文特征通过联合从视觉基础模型（VFMs）中蒸馏语义相关性以及从扩散模型中提取对象完整性线索来增强，从而增强空间一致性。同时，内容特征与图像剪辑表示对齐，并受到VFMs中区域相关性的约束，以提高局部可区分性。广泛的实验表明，DeCLIP为开放词汇密集感知奠定了坚实的基础，一致地在包括2D检测和分割、3D实例分割、视频实例分割和6D物体姿态估计在内的广泛任务中实现了最先进的性能。代码可在https://github.com/xiaomoguhz/DeCLIP获取

Summary / 总结

This work addresses the limitations of dense visual perception tasks by proposing DeCLIP, which enhances CLIP through a decoupled self-attention mechanism to improve local discriminability and spatial consistency. Extensive experiments show that DeCLIP outperforms existing methods across various tasks including 2D detection, segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.

该研究通过提出DeCLIP，增强CLIP的局部可区分性和空间一致性，解决密集视觉感知任务的限制。大量实验表明，DeCLIP在2D检测、分割、3D实例分割、视频实例分割和6D物体姿态估计等多种任务上均优于现有方法。