arXiv 论文速递

2025-08-25 16:45
Snapshot: 20250825_1645
SLGaussian: Fast Language Gaussian Splatting in Sparse Views
Authors: Kangjie Chen, BingQuan Dai, Minghan Qin, Dongbin Zhang, Peihao Li, Yingshuang Zou, Haoqian Wang
Venue: ACM MM 2025
First: 2024-12-11T12:18:30+00:00 · Latest: 2025-08-18T08:08:13+00:00
Comments: Accepted by ACM MM 2025. Project page: https://chenkangjie1123.github.io/SLGaussian.github.io/
Abstract
3D semantic field learning is crucial for applications like autonomous navigation, AR/VR, and robotics, where accurate comprehension of 3D scenes from limited viewpoints is essential. Existing methods struggle under sparse view conditions, relying on inefficient per-scene multi-view optimizations, which are impractical for many real-world tasks. To address this, we propose SLGaussian, a feed-forward method for constructing 3D semantic fields from sparse viewpoints, allowing direct inference of 3DGS-based scenes. By ensuring consistent SAM segmentations through video tracking and using low-dimensional indexing for high-dimensional CLIP features, SLGaussian efficiently embeds language information in 3D space, offering a robust solution for accurate 3D scene understanding under sparse view conditions. In experiments on two-view sparse 3D object querying and segmentation in the LERF and 3D-OVS datasets, SLGaussian outperforms existing methods in chosen IoU, Localization Accuracy, and mIoU. Moreover, our model achieves scene inference in under 30 seconds and open-vocabulary querying in just 0.011 seconds per query.
中文标题/摘要
标题:SLGaussian: 快速语言高斯点积在稀疏视图中的3D语义场构建
3D语义场学习对于自主导航、AR/VR和机器人技术等应用至关重要,这些应用需要从有限视角准确理解3D场景。现有方法在稀疏视图条件下表现不佳,依赖于低效的多视角优化,这在许多实际任务中是不切实际的。为了解决这一问题,我们提出了SLGaussian,这是一种用于从稀疏视角构建3D语义场的前馈方法,允许直接推断基于3DGS的场景。通过视频跟踪确保一致的SAM分割,并使用低维索引嵌入高维CLIP特征,SLGaussian高效地在3D空间中嵌入语言信息,为在稀疏视图条件下提供准确的3D场景理解提供了一种稳健的解决方案。在LERF和3D-OVS数据集上的两项稀疏3D对象查询和分割实验中,SLGaussian在选定的IoU、定位准确性和mIoU方面优于现有方法。此外,我们的模型在场景推断中只需不到30秒,在开放词汇查询中每次查询只需0.011秒。
Summary / 总结
SLGaussian is a feed-forward method designed to construct 3D semantic fields from sparse viewpoints, enabling efficient and accurate 3D scene understanding. It uses consistent SAM segmentations through video tracking and low-dimensional indexing for CLIP features to embed language information in 3D space. Experiments show that SLGaussian outperforms existing methods in chosen IoU, Localization Accuracy, and mIoU, and achieves scene inference in under 30 seconds and open-vocabulary querying in just 0.011 seconds per query.
SLGaussian 是一种前馈方法,旨在从稀疏视点构建 3D 语义场,解决现有方法在多视图优化方面的局限性。它通过视频跟踪保持一致的 SAM 分割,并使用低维索引 CLIP 特征来高效地将语言信息嵌入 3D 空间。实验表明,SLGaussian 在选定的 IoU、定位准确性和 mIoU 方面优于现有方法,并且场景推理在不到 30 秒内完成,开放词汇查询每查询只需 0.011 秒。
Splat Feature Solver
Authors: Butian Xiong, Rong Liu, Kenneth Xu, Meida Chen, Andrew Feng
First: 2025-08-17T03:13:06+00:00 · Latest: 2025-08-17T03:13:06+00:00
Comments: webpage not that stable
Abstract
Feature lifting has emerged as a crucial component in 3D scene understanding, enabling the attachment of rich image feature descriptors (e.g., DINO, CLIP) onto splat-based 3D representations. The core challenge lies in optimally assigning rich general attributes to 3D primitives while addressing the inconsistency issues from multi-view images. We present a unified, kernel- and feature-agnostic formulation of the feature lifting problem as a sparse linear inverse problem, which can be solved efficiently in closed form. Our approach admits a provable upper bound on the global optimal error under convex losses for delivering high quality lifted features. To address inconsistencies and noise in multi-view observations, we introduce two complementary regularization strategies to stabilize the solution and enhance semantic fidelity. Tikhonov Guidance enforces numerical stability through soft diagonal dominance, while Post-Lifting Aggregation filters noisy inputs via feature clustering. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on open-vocabulary 3D segmentation benchmarks, outperforming training-based, grouping-based, and heuristic-forward baselines while producing the lifted features in minutes. Code is available at \href{https://github.com/saliteta/splat-distiller.git}{\textbf{github}}. We also have a \href{https://splat-distiller.pages.dev/}
中文标题/摘要
标题:Splat特征求解器
特征提升已成为3D场景理解的关键组成部分,使丰富的图像特征描述符(例如,DINO,CLIP)能够附着到基于splat的3D表示上。核心挑战在于在解决多视图图像不一致性问题的同时,最优地将丰富的一般属性分配给3D基本体。我们提出了一种统一的、内核和特征无关的特征提升问题的稀疏线性逆问题形式,可以高效地以闭式形式求解。我们的方法在凸损失下提供了全局最优误差的可证明上界,以提供高质量的提升特征。为了处理多视图观测中的不一致性和噪声,我们引入了两种互补的正则化策略来稳定解并增强语义保真度。Tikhonov引导通过软对角占优确保数值稳定性,而后提升聚合通过特征聚类过滤噪声输入。广泛的实验表明,我们的方法在开放词汇3D分割基准测试中实现了最先进的性能,优于基于训练、基于分组和启发式前向的基线方法,同时在几分钟内生成提升特征。代码可在<https://github.com/saliteta/splat-distiller.git> 获取。我们还有一个<https://splat-distiller.pages.dev/>。
Summary / 总结
The research aims to improve 3D scene understanding by optimally assigning rich image feature descriptors to 3D primitives. The method formulates the feature lifting problem as a sparse linear inverse problem, providing a closed-form solution with a provable upper bound on the global optimal error. The approach introduces two regularization strategies to address inconsistencies and noise in multi-view observations, achieving state-of-the-art performance on 3D segmentation benchmarks and producing lifted features in minutes. Code is available at github.com/saliteta/splat-distiller.git.
论文解决了将丰富的图像特征描述符最优地分配给基于splat的3D表示中的3D基本体的挑战,提出了一种统一的稀疏线性逆问题形式化方法。引入了两种正则化策略,Tikhonov Guidance和Post-Lifting Aggregation,以处理不一致性和噪声。实验表明,所提出的方法在开放词汇3D分割基准测试中优于现有基线,并能在几分钟内生成提升的特征。
Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception
Authors: Junjie Wang, Keyu Chen, Yulin Li, Bin Chen, Hengshuang Zhao, Xiaojuan Qi, Zhuotao Tian
First: 2025-08-15T06:43:51+00:00 · Latest: 2025-08-15T06:43:51+00:00
Comments: arXiv admin note: text overlap with arXiv:2505.04410
Abstract
Dense visual perception tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have shown promise in open-vocabulary tasks, their direct application to dense perception often leads to suboptimal performance due to limitations in local feature representation. In this work, we present our observation that CLIP's image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self-attention module to obtain ``content'' and ``context'' features respectively. \revise{The context features are enhanced by jointly distilling semantic correlations from Vision Foundation Models (VFMs) and object integrity cues from diffusion models, thereby enhancing spatial consistency. In parallel, the content features are aligned with image crop representations and constrained by region correlations from VFMs to improve local discriminability. Extensive experiments demonstrate that DeCLIP establishes a solid foundation for open-vocabulary dense perception, consistently achieving state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.} Code is available at https://github.com/xiaomoguhz/DeCLIP
中文标题/摘要
标题:通用解耦学习增强开放词汇密集感知
密集视觉感知任务受限于其对预定义类别的依赖,限制了它们在视觉概念无界的实际场景中的应用。尽管像CLIP这样的视觉-语言模型(VLMs)在开放词汇任务中显示出潜力,但它们直接应用于密集感知时,由于局部特征表示的局限性,往往会导致性能不佳。在本文中,我们观察到CLIP的图像标记难以有效地从空间上或语义上相关的区域聚合信息,导致特征缺乏局部可区分性和空间一致性。为了解决这一问题,我们提出了一种名为DeCLIP的新框架,通过解耦自注意力模块来分别获得“内容”和“上下文”特征。上下文特征通过联合从视觉基础模型(VFMs)中蒸馏语义关联和从扩散模型中提取对象完整性线索来增强,从而增强空间一致性。同时,内容特征与图像剪辑表示对齐,并受到VFMs中区域关联的约束,以提高局部可区分性。广泛的实验表明,DeCLIP为开放词汇密集感知奠定了坚实的基础,一致地在包括2D检测和分割、3D实例分割、视频实例分割和6D物体姿态估计等多种任务中实现了最先进的性能。代码可在https://github.com/xiaomoguhz/DeCLIP获取
Summary / 总结
Dense visual perception tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded.
该研究通过提出DeCLIP框架,增强CLIP的性能,该框架通过解耦自注意力机制来获取内容和上下文特征。上下文特征通过结合视觉基础模型的语义关联和扩散模型的对象完整性线索来改进,而内容特征则与图像剪辑表示对齐,并受到视觉基础模型区域关联的约束以提高局部可区分性。广泛的实验表明,DeCLIP在包括2D检测、分割、3D实例分割、视频实例分割和6D物体姿态估计等多种任务中均取得了优于现有方法的性能。
History