arXiv 论文速递

SLGaussian: Fast Language Gaussian Splatting in Sparse Views

Authors: Kangjie Chen, BingQuan Dai, Minghan Qin, Dongbin Zhang, Peihao Li, Yingshuang Zou, Haoqian Wang

Venue: ACM MM 2025

First: 2024-12-11T12:18:30+00:00 · Latest: 2025-08-18T08:08:13+00:00

Comments: Accepted by ACM MM 2025. Project page: https://chenkangjie1123.github.io/SLGaussian.github.io/

Abstract

3D semantic field learning is crucial for applications like autonomous navigation, AR/VR, and robotics, where accurate comprehension of 3D scenes from limited viewpoints is essential. Existing methods struggle under sparse view conditions, relying on inefficient per-scene multi-view optimizations, which are impractical for many real-world tasks. To address this, we propose SLGaussian, a feed-forward method for constructing 3D semantic fields from sparse viewpoints, allowing direct inference of 3DGS-based scenes. By ensuring consistent SAM segmentations through video tracking and using low-dimensional indexing for high-dimensional CLIP features, SLGaussian efficiently embeds language information in 3D space, offering a robust solution for accurate 3D scene understanding under sparse view conditions. In experiments on two-view sparse 3D object querying and segmentation in the LERF and 3D-OVS datasets, SLGaussian outperforms existing methods in chosen IoU, Localization Accuracy, and mIoU. Moreover, our model achieves scene inference in under 30 seconds and open-vocabulary querying in just 0.011 seconds per query.

中文标题/摘要

标题：SLGaussian：稀疏视角下的快速语言高斯泼溅技术

三维语义场学习对于自动驾驶导航、增强现实/虚拟现实（AR/VR）以及机器人技术等应用至关重要，这些应用需要从有限视角准确理解三维场景。现有方法在稀疏视角条件下表现不佳，依赖于低效的逐场景多视角优化，这在许多实际任务中不切实际。为此，我们提出SLGaussian，一种前馈方法，用于从稀疏视角构建三维语义场，支持直接推断基于3DGS的场景。通过视频跟踪确保一致的SAM分割，并利用低维索引处理高维CLIP特征，SLGaussian高效地将语言信息嵌入三维空间，为稀疏视角条件下的精确三维场景理解提供了稳健解决方案。在LERF和3D-OVS数据集上的双视角稀疏三维物体查询与分割实验中，SLGaussian在选定的IoU、定位精度和mIoU指标上均优于现有方法。此外，我们的模型在30秒内完成场景推断，每项开放词汇查询仅需0.011秒。

Summary / 总结

3D semantic field learning is crucial for applications like autonomous navigation, AR/VR, and robotics, where accurate comprehension of 3D scenes from limited viewpoints is essential.

Splat Feature Solver

Authors: Butian Xiong, Rong Liu, Kenneth Xu, Meida Chen, Andrew Feng

First: 2025-08-17T03:13:06+00:00 · Latest: 2025-08-17T03:13:06+00:00

Comments: webpage not that stable

Abs · PDF · Code1

Abstract

Feature lifting has emerged as a crucial component in 3D scene understanding, enabling the attachment of rich image feature descriptors (e.g., DINO, CLIP) onto splat-based 3D representations. The core challenge lies in optimally assigning rich general attributes to 3D primitives while addressing the inconsistency issues from multi-view images. We present a unified, kernel- and feature-agnostic formulation of the feature lifting problem as a sparse linear inverse problem, which can be solved efficiently in closed form. Our approach admits a provable upper bound on the global optimal error under convex losses for delivering high quality lifted features. To address inconsistencies and noise in multi-view observations, we introduce two complementary regularization strategies to stabilize the solution and enhance semantic fidelity. Tikhonov Guidance enforces numerical stability through soft diagonal dominance, while Post-Lifting Aggregation filters noisy inputs via feature clustering. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on open-vocabulary 3D segmentation benchmarks, outperforming training-based, grouping-based, and heuristic-forward baselines while producing the lifted features in minutes. Code is available at \href{https://github.com/saliteta/splat-distiller.git}{\textbf{github}}. We also have a \href{https://splat-distiller.pages.dev/}

中文标题/摘要

标题：Splat特征求解器

特征提升已成为3D场景理解的关键组成部分，能够将丰富的图像特征描述符（如DINO、CLIP）附加到基于splat的3D表示上。核心挑战在于如何最优地将丰富通用属性分配给3D图元，同时解决多视角图像的不一致性问题。我们提出了一个统一、与内核和特征无关的特征提升问题稀疏线性逆问题表述，可通过闭式解高效求解。该方法在凸损失下可证明全局最优误差上界，从而提供高质量提升特征。针对多视角观测中的不一致性和噪声，我们引入两种互补的正则化策略：吉洪诺夫指导通过软对角占优确保数值稳定性，后提升聚合通过特征聚类过滤噪声输入。大量实验表明，我们的方法在开放词汇3D分割基准上达到最先进性能，在几分钟内生成提升特征的同时，优于基于训练、分组和启发式的前沿基线。代码发布于\href{https://github.com/saliteta/splat-distiller.git}{\textbf{github}}，另设\href{https://splat-distiller.pages.dev/}{演示页面}。

Summary / 总结

Feature lifting has emerged as a crucial component in 3D scene understanding, enabling the attachment of rich image feature descriptors (e.g., DINO, CLIP) onto splat-based 3D representations.

Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception

Authors: Junjie Wang, Keyu Chen, Yulin Li, Bin Chen, Hengshuang Zhao, Xiaojuan Qi, Zhuotao Tian

First: 2025-08-15T06:43:51+00:00 · Latest: 2025-08-15T06:43:51+00:00

Comments: arXiv admin note: text overlap with arXiv:2505.04410

Abs · PDF · Code1

Abstract

Dense visual perception tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have shown promise in open-vocabulary tasks, their direct application to dense perception often leads to suboptimal performance due to limitations in local feature representation. In this work, we present our observation that CLIP's image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self-attention module to obtain ``content'' and ``context'' features respectively. \revise{The context features are enhanced by jointly distilling semantic correlations from Vision Foundation Models (VFMs) and object integrity cues from diffusion models, thereby enhancing spatial consistency. In parallel, the content features are aligned with image crop representations and constrained by region correlations from VFMs to improve local discriminability. Extensive experiments demonstrate that DeCLIP establishes a solid foundation for open-vocabulary dense perception, consistently achieving state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.} Code is available at https://github.com/xiaomoguhz/DeCLIP

中文标题/摘要

标题：广义解耦学习增强开放词汇密集感知

密集视觉感知任务长期受限于预定义类别，难以适应现实世界中无边界视觉概念的应用场景。尽管CLIP等视觉语言模型在开放词汇任务中展现出潜力，但其直接应用于密集感知时，由于局部特征表示的限制往往导致性能欠佳。本研究观察到CLIP的图像令牌难以有效聚合空间或语义相关区域的信息，导致特征缺乏局部判别性和空间一致性。为此，我们提出DeCLIP框架，通过解耦自注意力模块分别获取'内容'与'上下文'特征。上下文特征通过联合蒸馏视觉基础模型的语义关联和扩散模型的物体完整性线索得到增强，从而提升空间一致性；同时，内容特征与图像裁剪表示对齐，并受视觉基础模型的区域相关性约束以改善局部判别性。大量实验表明，DeCLIP为开放词汇密集感知奠定了坚实基础，在2D检测与分割、3D实例分割、视频实例分割及6D物体姿态估计等广泛任务中持续取得最先进性能。代码已开源：https://github.com/xiaomoguhz/DeCLIP

Summary / 总结

Dense visual perception tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded.

DualMap: Online Open-Vocabulary Semantic Mapping for Natural Language Navigation in Dynamic Changing Scenes

Authors: Jiajun Jiang, Yiming Zhu, Zirui Wu, Jie Song

First: 2025-06-02T17:59:10+00:00 · Latest: 2025-08-13T07:21:25+00:00

Comments: 14 pages, 14 figures. Code: https://github.com/Eku127/DualMap Project page: https://eku127.github.io/DualMap/

Abs · PDF · Code1 · Project1

Abstract

We introduce DualMap, an online open-vocabulary mapping system that enables robots to understand and navigate dynamically changing environments through natural language queries. Designed for efficient semantic mapping and adaptability to changing environments, DualMap meets the essential requirements for real-world robot navigation applications. Our proposed hybrid segmentation frontend and object-level status check eliminate the costly 3D object merging required by prior methods, enabling efficient online scene mapping. The dual-map representation combines a global abstract map for high-level candidate selection with a local concrete map for precise goal-reaching, effectively managing and updating dynamic changes in the environment. Through extensive experiments in both simulation and real-world scenarios, we demonstrate state-of-the-art performance in 3D open-vocabulary segmentation, efficient scene mapping, and online language-guided navigation.Project page: https://eku127.github.io/DualMap/

中文标题/摘要

标题：DualMap：动态变化场景中自然语言导航的在线开放词汇语义映射

我们推出DualMap，一种在线开放词汇映射系统，使机器人能通过自然语言查询理解并导航动态变化的环境。该系统专为高效语义映射和适应环境变化而设计，满足现实世界机器人导航应用的核心需求。提出的混合分割前端和对象级状态检查消除了先前方法所需的昂贵3D对象合并，实现了高效的在线场景映射。双地图表示结合了用于高层候选选择的全局抽象地图与用于精确抵达目标的局部具体地图，有效管理并更新环境中的动态变化。通过仿真和真实场景的广泛实验，我们在3D开放词汇分割、高效场景映射和在线语言引导导航方面展示了最先进的性能。项目页面：https://eku127.github.io/DualMap/

Summary / 总结

We introduce DualMap, an online open-vocabulary mapping system that enables robots to understand and navigate dynamically changing environments through natural language queries.

CitySeg: A 3D Open Vocabulary Semantic Segmentation Foundation Model in City-scale Scenarios

Authors: Jialei Xu, Zizhuang Wei, Weikang You, Linyun Li, Weijian Sun

First: 2025-08-13T03:55:56+00:00 · Latest: 2025-08-13T03:55:56+00:00

Abs · PDF

Abstract

Semantic segmentation of city-scale point clouds is a critical technology for Unmanned Aerial Vehicle (UAV) perception systems, enabling the classification of 3D points without relying on any visual information to achieve comprehensive 3D understanding. However, existing models are frequently constrained by the limited scale of 3D data and the domain gap between datasets, which lead to reduced generalization capability. To address these challenges, we propose CitySeg, a foundation model for city-scale point cloud semantic segmentation that incorporates text modality to achieve open vocabulary segmentation and zero-shot inference. Specifically, in order to mitigate the issue of non-uniform data distribution across multiple domains, we customize the data preprocessing rules, and propose a local-global cross-attention network to enhance the perception capabilities of point networks in UAV scenarios. To resolve semantic label discrepancies across datasets, we introduce a hierarchical classification strategy. A hierarchical graph established according to the data annotation rules consolidates the data labels, and the graph encoder is used to model the hierarchical relationships between categories. In addition, we propose a two-stage training strategy and employ hinge loss to increase the feature separability of subcategories. Experimental results demonstrate that the proposed CitySeg achieves state-of-the-art (SOTA) performance on nine closed-set benchmarks, significantly outperforming existing approaches. Moreover, for the first time, CitySeg enables zero-shot generalization in city-scale point cloud scenarios without relying on visual information.

中文标题/摘要

标题：CitySeg：城市场景下的三维开放词汇语义分割基础模型

城市级点云语义分割是无人机感知系统的关键技术，通过对三维点进行无需视觉信息的分类实现全面三维理解。然而现有模型常受限于三维数据规模有限及数据集间的领域差异，导致泛化能力下降。为此我们提出CitySeg——融合文本模态实现开放词汇分割与零样本推理的城市级点云语义分割基础模型。针对多领域数据分布不均问题，定制数据预处理规则并提出局部-全局交叉注意力网络以增强点网络在无人机场景中的感知能力；通过建立符合数据标注规则的层次化图谱整合标签，并利用图编码器建模类别层级关系来解决语义标签差异；采用两阶段训练策略和铰链损失提升子类特征可分性。实验表明CitySeg在九个封闭集基准上达到最先进性能，显著优于现有方法，并首次实现不依赖视觉信息的城市级点云零样本泛化。

Summary / 总结

ReferSplat: Referring Segmentation in 3D Gaussian Splatting

Authors: Shuting He, Guangquan Jie, Changshuo Wang, Yun Zhou, Shuming Hu, Guanbin Li, Henghui Ding

Venue: ICML 2025 Oral

First: 2025-08-11T17:59:30+00:00 · Latest: 2025-08-11T17:59:30+00:00

Comments: ICML 2025 Oral, Code: https://github.com/heshuting555/ReferSplat

Abs · PDF · Code1

Abstract

We introduce Referring 3D Gaussian Splatting Segmentation (R3DGS), a new task that aims to segment target objects in a 3D Gaussian scene based on natural language descriptions, which often contain spatial relationships or object attributes. This task requires the model to identify newly described objects that may be occluded or not directly visible in a novel view, posing a significant challenge for 3D multi-modal understanding. Developing this capability is crucial for advancing embodied AI. To support research in this area, we construct the first R3DGS dataset, Ref-LERF. Our analysis reveals that 3D multi-modal understanding and spatial relationship modeling are key challenges for R3DGS. To address these challenges, we propose ReferSplat, a framework that explicitly models 3D Gaussian points with natural language expressions in a spatially aware paradigm. ReferSplat achieves state-of-the-art performance on both the newly proposed R3DGS task and 3D open-vocabulary segmentation benchmarks. Dataset and code are available at https://github.com/heshuting555/ReferSplat.

中文标题/摘要

标题：ReferSplat：基于3D高斯泼溅的指代分割

我们提出了指代式3D高斯泼溅分割（R3DGS）这一新任务，旨在通过自然语言描述（常包含空间关系或物体属性）对3D高斯场景中的目标物体进行分割。该任务要求模型识别在新视角下可能被遮挡或不可见的新描述物体，这对3D多模态理解提出了重大挑战。发展此能力对推进具身人工智能至关重要。为支持该领域研究，我们构建了首个R3DGS数据集Ref-LERF。分析表明，3D多模态理解与空间关系建模是R3DGS的核心挑战。为此，我们提出ReferSplat框架，在空间感知范式下显式建模自然语言表达与3D高斯点的关联。ReferSplat在新建的R3DGS任务和3D开放词汇分割基准上均实现了最先进性能。数据集与代码详见https://github.com/heshuting555/ReferSplat。

Summary / 总结

Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images

Authors: Xiangyu Sun, Haoyi Jiang, Liu Liu, Seungtae Nam, Gyeongjin Kang, Xinjie Wang, Wei Sui, Zhizhong Su, Wenyu Liu, Xinggang Wang, Eunbyung Park

First: 2025-08-05T16:54:55+00:00 · Latest: 2025-08-11T03:47:38+00:00

Comments: The code is available at https://github.com/HorizonRobotics/Uni3R

Abs · PDF · Code1

Abstract

Reconstructing and semantically interpreting 3D scenes from sparse 2D views remains a fundamental challenge in computer vision. Conventional methods often decouple semantic understanding from reconstruction or necessitate costly per-scene optimization, thereby restricting their scalability and generalizability. In this paper, we introduce Uni3R, a novel feed-forward framework that jointly reconstructs a unified 3D scene representation enriched with open-vocabulary semantics, directly from unposed multi-view images. Our approach leverages a Cross-View Transformer to robustly integrate information across arbitrary multi-view inputs, which then regresses a set of 3D Gaussian primitives endowed with semantic feature fields. This unified representation facilitates high-fidelity novel view synthesis, open-vocabulary 3D semantic segmentation, and depth prediction, all within a single, feed-forward pass. Extensive experiments demonstrate that Uni3R establishes a new state-of-the-art across multiple benchmarks, including 25.07 PSNR on RE10K and 55.84 mIoU on ScanNet. Our work signifies a novel paradigm towards generalizable, unified 3D scene reconstruction and understanding. The code is available at https://github.com/HorizonRobotics/Uni3R.

中文标题/摘要

标题：Uni3R：通过未标定多视角图像中可泛化的高斯溅射实现统一的三维重建与语义理解

从稀疏二维视图重建并语义解析三维场景始终是计算机视觉领域的核心挑战。传统方法常将语义理解与重建过程解耦，或需进行昂贵的逐场景优化，从而限制了其可扩展性与泛化能力。本文提出Uni3R——一种新颖的前馈框架，可直接从未标定多视角图像中联合重建具有开放词汇语义的统一三维场景表征。该方法通过跨视角Transformer鲁棒地整合任意多视角输入信息，进而回归出带有语义特征场的三维高斯基元集合。这种统一表征可在单次前馈过程中实现高保真新视角合成、开放词汇三维语义分割及深度预测。大量实验表明，Uni3R在多个基准测试中创下新纪录，包括RE10K数据集上25.07的PSNR值和ScanNet数据集上55.84的mIoU值。本工作标志着向可泛化、统一化的三维场景重建与理解新范式的重大迈进。代码已开源：https://github.com/HorizonRobotics/Uni3R

Summary / 总结

Reconstructing and semantically interpreting 3D scenes from sparse 2D views remains a fundamental challenge in computer vision.

Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction

Authors: Yunheng Li, Yuxuan Li, Quansheng Zeng, Wenhai Wang, Qibin Hou, Ming-Ming Cheng

Venue: ICCV 2025

First: 2024-12-09T06:34:23+00:00 · Latest: 2025-08-10T11:17:34+00:00

Comments: Accepted at ICCV 2025. The code is available at https://github.com/HVision-NKU/DenseVLM

Abs · PDF · Code1

Abstract

Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated impressive zero-shot recognition capability, but still underperform in dense prediction tasks. Self-distillation recently is emerging as a promising approach for fine-tuning VLMs to better adapt to local regions without requiring extensive annotations. However, previous state-of-the-art approaches often suffer from significant `foreground bias', where models tend to wrongly identify background regions as foreground objects. To alleviate this issue, we propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations. To alleviate this issue, we propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations. DenseVLM leverages the pre-trained VLM to retrieve categories for unlabeled regions and then decouples the interference between foreground and background features. We show that DenseVLM can directly replace the original VLM in open-vocabulary object detection and image segmentation methods, leading to notable performance improvements. Furthermore, it exhibits promising zero-shot scalability when training on more extensive and diverse datasets. Our code is available at https://github.com/HVision-NKU/DenseVLM.

中文标题/摘要

标题：开放词汇密集预测的无偏区域-语言对齐

预训练视觉-语言模型（如CLIP）已展现出卓越的零样本识别能力，但在密集预测任务中仍表现不足。自蒸馏技术近期成为无需大量标注即可微调VLM以适应局部区域的有效方法。然而，现有先进方法常存在显著的前景偏差，即模型易将背景区域误判为前景对象。为缓解此问题，我们提出DenseVLM框架，通过预训练VLM表征学习无偏的区域-语言对齐。该框架利用预训练VLM检索未标注区域的类别，并解耦前景与背景特征间的干扰。实验表明，DenseVLM可直接替代开放词汇目标检测和图像分割方法中的原始VLM，实现显著性能提升，且在更广泛多样数据集训练时展现出良好的零样本扩展性。代码已开源：https://github.com/HVision-NKU/DenseVLM

Summary / 总结

Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated impressive zero-shot recognition capability, but still underperform in dense prediction tasks.

Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation

Authors: Jiho Choi, Seonho Lee, Minhyun Lee, Seungho Lee, Hyunjung Shim

Venue: CVPR 2025

First: 2025-01-16T17:40:19+00:00 · Latest: 2025-08-08T08:51:23+00:00

Comments: CVPR 2025

Abs · PDF

Abstract

Open-Vocabulary Part Segmentation (OVPS) is an emerging field for recognizing fine-grained parts in unseen categories. We identify two primary challenges in OVPS: (1) the difficulty in aligning part-level image-text correspondence, and (2) the lack of structural understanding in segmenting object parts. To address these issues, we propose PartCATSeg, a novel framework that integrates object-aware part-level cost aggregation, compositional loss, and structural guidance from DINO. Our approach employs a disentangled cost aggregation strategy that handles object and part-level costs separately, enhancing the precision of part-level segmentation. We also introduce a compositional loss to better capture part-object relationships, compensating for the limited part annotations. Additionally, structural guidance from DINO features improves boundary delineation and inter-part understanding. Extensive experiments on Pascal-Part-116, ADE20K-Part-234, and PartImageNet datasets demonstrate that our method significantly outperforms state-of-the-art approaches, setting a new baseline for robust generalization to unseen part categories.

中文标题/摘要

标题：细粒度图像-文本对应与成本聚合在开放词汇部件分割中的应用

开放词汇部件分割（OVPS）是一个新兴领域，旨在识别未见类别中的细粒度部件。我们识别出OVPS中的两个主要挑战：（1）对齐部件级图像-文本对应的困难，（2）在分割对象部件时缺乏结构理解。为解决这些问题，我们提出了PartCATSeg，这是一个新颖框架，集成了对象感知的部件级成本聚合、组合损失以及来自DINO的结构指导。我们的方法采用解耦成本聚合策略，分别处理对象和部件级成本，从而提升部件级分割的精确度。我们还引入了组合损失以更好地捕捉部件-对象关系，弥补部件标注的不足。此外，DINO特征的结构指导改善了边界划分和部件间理解。在Pascal-Part-116、ADE20K-Part-234和PartImageNet数据集上的大量实验表明，我们的方法显著优于现有最先进方法，为对未见部件类别的鲁棒泛化设立了新基准。

Summary / 总结

Open-Vocabulary Part Segmentation (OVPS) is an emerging field for recognizing fine-grained parts in unseen categories.

SynSeg: Feature Synergy for Multi-Category Contrastive Learning in Open-Vocabulary Semantic Segmentation

Authors: Weichen Zhang, Kebin Liu, Fan Dang, Zhui Zhu, Xikai Sun, Yunhao Liu

First: 2025-08-08T08:26:41+00:00 · Latest: 2025-08-08T08:26:41+00:00

Abs · PDF

Abstract

Semantic segmentation in open-vocabulary scenarios presents significant challenges due to the wide range and granularity of semantic categories. Existing weakly-supervised methods often rely on category-specific supervision and ill-suited feature construction methods for contrastive learning, leading to semantic misalignment and poor performance. In this work, we propose a novel weakly-supervised approach, SynSeg, to address the challenges. SynSeg performs Multi-Category Contrastive Learning (MCCL) as a stronger training signal with a new feature reconstruction framework named Feature Synergy Structure (FSS). Specifically, MCCL strategy robustly combines both intra- and inter-category alignment and separation in order to make the model learn the knowledge of correlations from different categories within the same image. Moreover, FSS reconstructs discriminative features for contrastive learning through prior fusion and semantic-activation-map enhancement, effectively avoiding the foreground bias introduced by the visual encoder. In general, SynSeg effectively improves the abilities in semantic localization and discrimination under weak supervision. Extensive experiments on benchmarks demonstrate that our method outperforms state-of-the-art (SOTA) performance. For instance, SynSeg achieves higher accuracy than SOTA baselines by 4.5\% on VOC, 8.9\% on Context, 2.6\% on Object and 2.0\% on City.

中文标题/摘要

标题：SynSeg：开放词汇语义分割中多类别对比学习的特征协同

开放词汇场景下的语义分割因语义类别范围广泛且粒度精细而面临巨大挑战。现有弱监督方法常依赖特定类别的监督和不适合对比学习的特征构建方法，导致语义错位和性能不佳。本研究提出新型弱监督方法SynSeg，通过多类别对比学习（MCCL）作为更强训练信号，并结合名为特征协同结构（FSS）的新特征重构框架。具体而言，MCCL策略鲁棒地结合了类内与类间的对齐与分离，使模型能够学习同一图像中不同类别间的相关性知识。此外，FSS通过先验融合和语义激活图增强来重构判别性特征以进行对比学习，有效避免了视觉编码器引入的前景偏差。总体而言，SynSeg显著提升了弱监督下的语义定位与判别能力。在基准测试上的大量实验表明，本方法优于现有最先进（SOTA）性能，例如在VOC数据集上准确率比SOTA基线高4.5%，Context高8.9%，Object高2.6%，Cityscapes高2.0%。

Summary / 总结

Semantic segmentation in open-vocabulary scenarios presents significant challenges due to the wide range and granularity of semantic categories.

Learning 3D Texture-Aware Representations for Parsing Diverse Human Clothing and Body Parts

Authors: Kiran Chhatre, Christopher Peters, Srikrishna Karanam

First: 2025-08-08T05:36:20+00:00 · Latest: 2025-08-08T05:36:20+00:00

Comments: 16 pages, 11 figures

Abs · PDF

Abstract

Existing methods for human parsing into body parts and clothing often use fixed mask categories with broad labels that obscure fine-grained clothing types. Recent open-vocabulary segmentation approaches leverage pretrained text-to-image (T2I) diffusion model features for strong zero-shot transfer, but typically group entire humans into a single person category, failing to distinguish diverse clothing or detailed body parts. To address this, we propose Spectrum, a unified network for part-level pixel parsing (body parts and clothing) and instance-level grouping. While diffusion-based open-vocabulary models generalize well across tasks, their internal representations are not specialized for detailed human parsing. We observe that, unlike diffusion models with broad representations, image-driven 3D texture generators maintain faithful correspondence to input images, enabling stronger representations for parsing diverse clothing and body parts. Spectrum introduces a novel repurposing of an Image-to-Texture (I2Tx) diffusion model -- obtained by fine-tuning a T2I model on 3D human texture maps -- for improved alignment with body parts and clothing. From an input image, we extract human-part internal features via the I2Tx diffusion model and generate semantically valid masks aligned to diverse clothing categories through prompt-guided grounding. Once trained, Spectrum produces semantic segmentation maps for every visible body part and clothing category, ignoring standalone garments or irrelevant objects, for any number of humans in the scene. We conduct extensive cross-dataset experiments -- separately assessing body parts, clothing parts, unseen clothing categories, and full-body masks -- and demonstrate that Spectrum consistently outperforms baseline methods in prompt-based segmentation.

中文标题/摘要

标题：学习用于解析多样化人体服装与部位的三维纹理感知表示

现有人体解析方法常采用固定掩码类别和宽泛标签，难以区分细粒度服装类型。尽管基于预训练文本到图像扩散模型的开放词汇分割方法具有强大的零样本迁移能力，但通常将整个人体归为单一类别，无法辨别多样化服装或细节部位。为此，我们提出Spectrum——一个统一网络，可实现部件级像素解析（人体部位与服装）和实例级分组。虽然扩散模型能良好泛化，但其内部表示未针对精细人体解析优化。我们发现，与具有宽泛表示的扩散模型不同，图像驱动的三维纹理生成器能保持与输入图像的高度对应性，从而为解析多样化服装和身体部位提供更强表征。Spectrum创新性地重构了图像到纹理扩散模型（通过对T2I模型在三维人体纹理图上微调获得），以提升与身体部位及服装的对齐能力。通过从输入图像提取人体部件内部特征，并经由提示词引导生成与多样化服装类别对齐的语义有效掩码。训练后的Spectrum可为场景中任意数量人体生成所有可见身体部位和服装类别的语义分割图，忽略独立衣物或无关物体。通过跨数据集实验（分别评估身体部位、服装部件、未见服装类别及全身掩码），我们证明Spectrum在基于提示的分割中持续优于基线方法。

Summary / 总结

Existing methods for human parsing into body parts and clothing often use fixed mask categories with broad labels that obscure fine-grained clothing types.

EarthSynth: Generating Informative Earth Observation with Diffusion Models

Authors: Jiancheng Pan, Shiye Lei, Yuqian Fu, Jiahao Li, Yanxing Liu, Yuze Sun, Xiao He, Long Peng, Xiaomeng Huang, Bo Zhao

First: 2025-05-17T18:27:15+00:00 · Latest: 2025-08-07T10:33:17+00:00

Comments: 25 pages

Abs · PDF

Abstract

Remote sensing image (RSI) interpretation typically faces challenges due to the scarcity of labeled data, which limits the performance of RSI interpretation tasks. To tackle this challenge, we propose EarthSynth, a diffusion-based generative foundation model that enables synthesizing multi-category, cross-satellite labeled Earth observation for downstream RSI interpretation tasks. To the best of our knowledge, EarthSynth is the first to explore multi-task generation for remote sensing, tackling the challenge of limited generalization in task-oriented synthesis for RSI interpretation. EarthSynth, trained on the EarthSynth-180K dataset, employs the Counterfactual Composition training strategy with a three-dimensional batch-sample selection mechanism to improve training data diversity and enhance category control. Furthermore, a rule-based method of R-Filter is proposed to filter more informative synthetic data for downstream tasks. We evaluate our EarthSynth on scene classification, object detection, and semantic segmentation in open-world scenarios. There are significant improvements in open-vocabulary understanding tasks, offering a practical solution for advancing RSI interpretation.

中文标题/摘要

标题：EarthSynth：基于扩散模型生成信息丰富的地球观测数据

遥感影像（RSI）解译常因标注数据稀缺而面临挑战，限制了RSI解译任务的性能。为此，我们提出EarthSynth——基于扩散模型的生成式基础模型，能够为下游RSI解译任务合成多类别、跨卫星标注的地球观测数据。据我们所知，EarthSynth是首个探索遥感多任务生成的方法，解决了RSI解译中面向任务合成泛化能力受限的难题。该模型在EarthSynth-180K数据集上训练，采用反事实组合训练策略与三维批量样本选择机制，提升训练数据多样性并增强类别控制。此外，提出基于规则的R-Filter方法为下游任务筛选信息量更大的合成数据。我们在开放世界场景中对EarthSynth进行场景分类、目标检测和语义分割评估，在开放词汇理解任务中取得显著提升，为推进RSI解译提供了实用解决方案。

Summary / 总结

Remote sensing image (RSI) interpretation typically faces challenges due to the scarcity of labeled data, which limits the performance of RSI interpretation tasks.

What Holds Back Open-Vocabulary Segmentation?

Authors: Josip Šarić, Ivan Martinović, Matej Kristan, Siniša Šegvić

Venue: ICCV

First: 2025-08-06T08:46:47+00:00 · Latest: 2025-08-06T08:46:47+00:00

Comments: Accepted for publication at ICCV 25 Workshop: What is Next in Multimodal Foundation Models?

Abs · PDF

Abstract

Standard segmentation setups are unable to deliver models that can recognize concepts outside the training taxonomy. Open-vocabulary approaches promise to close this gap through language-image pretraining on billions of image-caption pairs. Unfortunately, we observe that the promise is not delivered due to several bottlenecks that have caused the performance to plateau for almost two years. This paper proposes novel oracle components that identify and decouple these bottlenecks by taking advantage of the groundtruth information. The presented validation experiments deliver important empirical findings that provide a deeper insight into the failures of open-vocabulary models and suggest prominent approaches to unlock the future research.

中文标题/摘要

标题：开放词汇分割的瓶颈何在？

标准分割设置无法产生能识别训练分类外概念的模型。开放词汇方法承诺通过数十亿图像-标题对的语言-图像预训练来弥合这一差距。然而，我们发现由于多个瓶颈导致性能近两年停滞不前，这一承诺未能实现。本文提出新颖的预言组件，利用真实标注信息识别并解耦这些瓶颈。验证实验提供了重要的实证发现，深入揭示了开放词汇模型的失败原因，并为未来研究指明了突破方向。

Summary / 总结

Standard segmentation setups are unable to deliver models that can recognize concepts outside the training taxonomy.

Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

Authors: Luca Barsellotti, Lorenzo Bianchi, Nicola Messina, Fabio Carrara, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, Rita Cucchiara

First: 2024-11-28T19:00:03+00:00 · Latest: 2025-08-05T12:26:14+00:00

Abs · PDF · Project1

Abstract

Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse spatial information from Vision Transformers, they face challenges in spatial localization due to their global alignment of image and text features. Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns the textual embeddings of CLIP to the patch-level features of DINOv2 through a learned mapping function without the need to fine-tune the underlying backbones. At training time, we exploit the attention maps of DINOv2 to selectively align local visual patches with textual embeddings. We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process, resulting in more natural and less noisy segmentations, and that our approach can also effectively distinguish foreground objects from the background. Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks. Source code and models are publicly available at: https://lorebianchi98.github.io/Talk2DINO/.

中文标题/摘要

标题：与DINO对话：融合自监督视觉骨干与语言实现开放词汇分割

开放词汇分割（OVS）旨在无需预定义训练类别的情况下，根据自由形式的文本概念对图像进行分割。虽然CLIP等现有多模态模型可通过利用视觉Transformer的粗略空间信息生成分割掩码，但由于其图像与文本特征的全局对齐方式，在空间定位方面面临挑战。相反，DINO等自监督视觉模型擅长细粒度视觉编码，但缺乏与语言的整合。为弥合这一差距，我们提出Talk2DINO——一种将DINOv2的空间精确性与CLIP的语言理解能力相结合的新型混合方法。该方法通过可学习的映射函数将CLIP的文本嵌入与DINOv2的补丁级特征对齐，无需微调底层骨干网络。训练时利用DINOv2的注意力图选择性对齐局部视觉补丁与文本嵌入。实验表明，Talk2DINO强大的语义和定位能力能优化分割过程，产生更自然、噪声更少的分割结果，并能有效区分前景与背景。在多个无监督OVS基准测试中达到最先进性能。源代码与模型已开源：https://lorebianchi98.github.io/Talk2DINO/

Summary / 总结

Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes.

Taking Language Embedded 3D Gaussian Splatting into the Wild

Authors: Yuze Wang, Yue Qi

First: 2025-07-26T07:00:32+00:00 · Latest: 2025-08-05T01:40:57+00:00

Comments: Visit our project page at https://yuzewang1998.github.io/takinglangsplatw/

Abs · PDF · Project1

Abstract

Recent advances in leveraging large-scale Internet photo collections for 3D reconstruction have enabled immersive virtual exploration of landmarks and historic sites worldwide. However, little attention has been given to the immersive understanding of architectural styles and structural knowledge, which remains largely confined to browsing static text-image pairs. Therefore, can we draw inspiration from 3D in-the-wild reconstruction techniques and use unconstrained photo collections to create an immersive approach for understanding the 3D structure of architectural components? To this end, we extend language embedded 3D Gaussian splatting (3DGS) and propose a novel framework for open-vocabulary scene understanding from unconstrained photo collections. Specifically, we first render multiple appearance images from the same viewpoint as the unconstrained image with the reconstructed radiance field, then extract multi-appearance CLIP features and two types of language feature uncertainty maps-transient and appearance uncertainty-derived from the multi-appearance features to guide the subsequent optimization process. Next, we propose a transient uncertainty-aware autoencoder, a multi-appearance language field 3DGS representation, and a post-ensemble strategy to effectively compress, learn, and fuse language features from multiple appearances. Finally, to quantitatively evaluate our method, we introduce PT-OVS, a new benchmark dataset for assessing open-vocabulary segmentation performance on unconstrained photo collections. Experimental results show that our method outperforms existing methods, delivering accurate open-vocabulary segmentation and enabling applications such as interactive roaming with open-vocabulary queries, architectural style pattern recognition, and 3D scene editing.

中文标题/摘要

标题：将语言嵌入的3D高斯溅射技术引入实景应用

近年来利用互联网大规模照片集进行三维重建的进展，实现了对全球地标和历史遗址的沉浸式虚拟探索。然而，对于建筑风格与结构知识的沉浸式理解仍鲜有关注，目前主要局限于浏览静态图文对。为此，我们能否从野外三维重建技术中汲取灵感，利用无约束照片集创建理解建筑构件三维结构的沉浸式方法？本文扩展了语言嵌入的3D高斯溅射技术（3DGS），提出了一种基于无约束照片集的开放词汇场景理解新框架。具体而言，我们首先通过重建辐射场从与无约束图像相同视角渲染多外观图像，继而提取多外观CLIP特征及两种语言特征不确定性图谱——瞬态不确定性和外观不确定性（源自多外观特征）以指导后续优化过程。接着提出瞬态不确定性感知自编码器、多外观语言场3DGS表示及后集成策略，有效压缩、学习并融合多外观语言特征。最后，为量化评估方法，我们引入PT-OVS基准数据集，用于评估无约束照片集上的开放词汇分割性能。实验结果表明，本方法优于现有技术，可实现精确的开放词汇分割，并支持开放词汇查询交互漫游、建筑风格模式识别及三维场景编辑等应用。

Summary / 总结

Recent advances in leveraging large-scale Internet photo collections for 3D reconstruction have enabled immersive virtual exploration of landmarks and historic sites worldwide.

AG$^2$aussian: Anchor-Graph Structured Gaussian Splatting for Instance-Level 3D Scene Understanding and Editing

Authors: Zhaonan Wang, Manyi Li, Changhe Tu

First: 2025-08-03T12:47:30+00:00 · Latest: 2025-08-03T12:47:30+00:00

Abs · PDF

Abstract

3D Gaussian Splatting (3DGS) has witnessed exponential adoption across diverse applications, driving a critical need for semantic-aware 3D Gaussian representations to enable scene understanding and editing tasks. Existing approaches typically attach semantic features to a collection of free Gaussians and distill the features via differentiable rendering, leading to noisy segmentation and a messy selection of Gaussians. In this paper, we introduce AG$^2$aussian, a novel framework that leverages an anchor-graph structure to organize semantic features and regulate Gaussian primitives. Our anchor-graph structure not only promotes compact and instance-aware Gaussian distributions, but also facilitates graph-based propagation, achieving a clean and accurate instance-level Gaussian selection. Extensive validation across four applications, i.e. interactive click-based query, open-vocabulary text-driven query, object removal editing, and physics simulation, demonstrates the advantages of our approach and its benefits to various applications. The experiments and ablation studies further evaluate the effectiveness of the key designs of our approach.

中文标题/摘要

标题：AG$^2$aussian：基于锚图结构的高斯溅射实现实例级三维场景理解与编辑

三维高斯溅射（3DGS）技术已在多个领域得到指数级应用，这推动了对具备语义感知能力的三维高斯表示的迫切需求，以实现场景理解与编辑任务。现有方法通常将语义特征附加到自由高斯集合上，并通过可微分渲染提取特征，导致分割结果存在噪声且高斯选择混乱。本文提出AG$^2$aussian——一种利用锚图结构组织语义特征并规整高斯基元的新框架。我们的锚图结构不仅促进了紧凑且实例感知的高斯分布，还支持基于图的传播机制，实现了清晰准确的实例级高斯选择。通过在交互式点击查询、开放词汇文本驱动查询、物体移除编辑及物理仿真四个应用领域的广泛验证，证明了本方法的优势及其多场景适用性。实验与消融研究进一步评估了关键设计的有效性。

Summary / 总结

OpenGS-Fusion: Open-Vocabulary Dense Mapping with Hybrid 3D Gaussian Splatting for Refined Object-Level Understanding

Authors: Dianyi Yang, Xihan Wang, Yu Gao, Shiyang Liu, Bohan Ren, Yufeng Yue, Yi Yang

First: 2025-08-02T02:22:36+00:00 · Latest: 2025-08-02T02:22:36+00:00

Comments: IROS2025

Abs · PDF · Project1

Abstract

Recent advancements in 3D scene understanding have made significant strides in enabling interaction with scenes using open-vocabulary queries, particularly for VR/AR and robotic applications. Nevertheless, existing methods are hindered by rigid offline pipelines and the inability to provide precise 3D object-level understanding given open-ended queries. In this paper, we present OpenGS-Fusion, an innovative open-vocabulary dense mapping framework that improves semantic modeling and refines object-level understanding. OpenGS-Fusion combines 3D Gaussian representation with a Truncated Signed Distance Field to facilitate lossless fusion of semantic features on-the-fly. Furthermore, we introduce a novel multimodal language-guided approach named MLLM-Assisted Adaptive Thresholding, which refines the segmentation of 3D objects by adaptively adjusting similarity thresholds, achieving an improvement 17\% in 3D mIoU compared to the fixed threshold strategy. Extensive experiments demonstrate that our method outperforms existing methods in 3D object understanding and scene reconstruction quality, as well as showcasing its effectiveness in language-guided scene interaction. The code is available at https://young-bit.github.io/opengs-fusion.github.io/ .

中文标题/摘要

标题：OpenGS-Fusion：基于混合3D高斯泼溅的开放词汇密集建图与精细化对象级理解

近期三维场景理解技术的进步在支持通过开放词汇查询进行场景交互方面取得重大进展，尤其在VR/AR和机器人应用中。然而，现有方法受限于僵化的离线流程及无法针对开放式查询提供精确的三维对象级理解。本文提出OpenGS-Fusion——一种创新的开放词汇密集建图框架，通过融合3D高斯表示与截断符号距离场实现语义特征的无损实时融合，显著提升语义建模能力与对象级理解精度。我们进一步引入名为MLLM辅助自适应阈值分割的多模态语言引导方法，通过动态调整相似度阈值优化三维对象分割，相比固定阈值策略使三维mIoU指标提升17%。大量实验表明，本方法在三维对象理解、场景重建质量及语言引导场景交互方面均优于现有方法。代码发布于https://young-bit.github.io/opengs-fusion.github.io/

Summary / 总结

Recent advancements in 3D scene understanding have made significant strides in enabling interaction with scenes using open-vocabulary queries, particularly for VR/AR and robotic applications.

Training-Free Class Purification for Open-Vocabulary Semantic Segmentation

Authors: Qi Chen, Lingxiao Yang, Yun Chen, Nailong Zhao, Jianhuang Lai, Jie Shao, Xiaohua Xie

Venue: ICCV 2025

First: 2025-08-01T11:55:12+00:00 · Latest: 2025-08-01T11:55:12+00:00

Comments: Accepted to ICCV 2025

Abs · PDF

Abstract

Fine-tuning pre-trained vision-language models has emerged as a powerful approach for enhancing open-vocabulary semantic segmentation (OVSS). However, the substantial computational and resource demands associated with training on large datasets have prompted interest in training-free methods for OVSS. Existing training-free approaches primarily focus on modifying model architectures and generating prototypes to improve segmentation performance. However, they often neglect the challenges posed by class redundancy, where multiple categories are not present in the current test image, and visual-language ambiguity, where semantic similarities among categories create confusion in class activation. These issues can lead to suboptimal class activation maps and affinity-refined activation maps. Motivated by these observations, we propose FreeCP, a novel training-free class purification framework designed to address these challenges. FreeCP focuses on purifying semantic categories and rectifying errors caused by redundancy and ambiguity. The purified class representations are then leveraged to produce final segmentation predictions. We conduct extensive experiments across eight benchmarks to validate FreeCP's effectiveness. Results demonstrate that FreeCP, as a plug-and-play module, significantly boosts segmentation performance when combined with other OVSS methods.

中文标题/摘要

标题：开放词汇语义分割的无训练类别净化方法

微调预训练的视觉-语言模型已成为增强开放词汇语义分割（OVSS）的有效途径。然而，大规模数据集训练所需的高计算和资源成本引发了人们对无训练OVSS方法的兴趣。现有无训练方法主要侧重于修改模型架构和生成原型以提升分割性能，但往往忽略了类别冗余（当前测试图像中不存在的多个类别）和视觉-语言歧义（类别间语义相似性导致类激活混淆）带来的挑战。这些问题可能导致次优的类激活图及亲和力精炼激活图。基于这些观察，我们提出FreeCP——一种创新的无训练类别净化框架，旨在解决这些挑战。FreeCP通过净化语义类别并修正冗余与歧义导致的误差，利用净化后的类别表征生成最终分割预测。我们在八个基准测试上开展广泛实验验证FreeCP的有效性。结果表明，FreeCP作为即插即用模块，与其他OVSS方法结合时能显著提升分割性能。

Summary / 总结

Fine-tuning pre-trained vision-language models has emerged as a powerful approach for enhancing open-vocabulary semantic segmentation (OVSS).

OpenSeg-R: Improving Open-Vocabulary Segmentation via Step-by-Step Visual Reasoning

Authors: Zongyan Han, Jiale Cao, Shuo Chen, Tong Wang, Jorma Laaksonen, Rao Muhammad Anwer

First: 2025-05-22T17:51:48+00:00 · Latest: 2025-08-01T08:53:45+00:00

Abs · PDF · Code1

Abstract

Open-Vocabulary Segmentation (OVS) has drawn increasing attention for its capacity to generalize segmentation beyond predefined categories. However, existing methods typically predict segmentation masks with simple forward inference, lacking explicit reasoning and interpretability. This makes it challenging for OVS model to distinguish similar categories in open-world settings due to the lack of contextual understanding and discriminative visual cues. To address this limitation, we propose a step-by-step visual reasoning framework for open-vocabulary segmentation, named OpenSeg-R. The proposed OpenSeg-R leverages Large Multimodal Models (LMMs) to perform hierarchical visual reasoning before segmentation. Specifically, we generate both generic and image-specific reasoning for each image, forming structured triplets that explain the visual reason for objects in a coarse-to-fine manner. Based on these reasoning steps, we can compose detailed description prompts, and feed them to the segmentor to produce more accurate segmentation masks. To the best of our knowledge, OpenSeg-R is the first framework to introduce explicit step-by-step visual reasoning into OVS. Experimental results demonstrate that OpenSeg-R significantly outperforms state-of-the-art methods on open-vocabulary semantic segmentation across five benchmark datasets. Moreover, it achieves consistent gains across all metrics on open-vocabulary panoptic segmentation. Qualitative results further highlight the effectiveness of our reasoning-guided framework in improving both segmentation precision and interpretability. Our code is publicly available at https://github.com/Hanzy1996/OpenSeg-R.

中文标题/摘要

标题：OpenSeg-R：通过逐步视觉推理改进开放词汇分割

开放词汇分割（OVS）因其能够泛化分割至预定义类别之外而日益受到关注。然而，现有方法通常通过简单的前向推理预测分割掩码，缺乏显式推理和可解释性。这使得OVS模型在开放世界场景中因缺乏上下文理解和判别性视觉线索而难以区分相似类别。为解决这一局限，我们提出了名为OpenSeg-R的逐步视觉推理框架。该框架利用大型多模态模型（LMMs）在分割前执行分层视觉推理，通过生成通用及图像特定的推理内容，构建结构化三元组以粗到细的方式解释物体视觉特征。基于这些推理步骤，可组合详细描述提示并输入分割器以生成更精确的分割掩码。据我们所知，OpenSeg-R是首个将显式逐步推理引入OVS的框架。实验结果表明，在五个基准数据集上，OpenSeg-R在开放词汇语义分割任务中显著优于现有最优方法，并在开放词汇全景分割的所有指标上实现一致提升。定性结果进一步验证了该推理引导框架对提升分割精度与可解释性的有效性。代码已开源：https://github.com/Hanzy1996/OpenSeg-R。

Summary / 总结

Open-Vocabulary Segmentation (OVS) has drawn increasing attention for its capacity to generalize segmentation beyond predefined categories.

CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation

Authors: Dengke Zhang, Fagui Liu, Quan Tang

Venue: ICCV 2025 Oral

First: 2024-11-15T10:14:55+00:00 · Latest: 2025-08-01T08:25:34+00:00

Comments: Accepted to ICCV 2025 Oral

Abs · PDF · Code1

Abstract

Open-vocabulary semantic segmentation aims to assign semantic labels to each pixel without being constrained by a predefined set of categories. While Contrastive Language-Image Pre-training (CLIP) excels in zero-shot classification, it struggles to align image patches with category embeddings because of its incoherent patch correlations. This study reveals that inter-class correlations are the main reason for impairing CLIP's segmentation performance. Accordingly, we propose CorrCLIP, which reconstructs the scope and value of patch correlations. Specifically, CorrCLIP leverages the Segment Anything Model (SAM) to define the scope of patch interactions, reducing inter-class correlations. To mitigate the problem that SAM-generated masks may contain patches belonging to different classes, CorrCLIP incorporates self-supervised models to compute coherent similarity values, suppressing the weight of inter-class correlations. Additionally, we introduce two additional branches to strengthen patch features' spatial details and semantic representation. Finally, we update segmentation maps with SAM-generated masks to improve spatial consistency. Based on the improvement across patch correlations, feature representations, and segmentation maps, CorrCLIP achieves superior performance across eight benchmarks. Codes are available at: https://github.com/zdk258/CorrCLIP.

中文标题/摘要

标题：CorrCLIP：重构CLIP中的图像块相关性以实现开放词汇语义分割

开放词汇语义分割旨在不受预定义类别限制的情况下为每个像素分配语义标签。尽管对比语言-图像预训练模型（CLIP）在零样本分类中表现卓越，但由于其图像块相关性不连贯，难以将图像块与类别嵌入对齐。本研究揭示类间相关性是影响CLIP分割性能的主要原因。据此，我们提出CorrCLIP，通过重构图像块相关性的作用域和数值来解决问题。具体而言，CorrCLIP利用分割任意模型（SAM）界定图像块交互范围以降低类间相关性；针对SAM生成掩码可能包含多类别图像块的问题，引入自监督模型计算一致性相似值以抑制类间相关性权重。此外，我们新增两个分支强化图像块特征的空间细节和语义表征，并采用SAM生成掩码更新分割图以提升空间一致性。通过改进图像块相关性、特征表征和分割图，CorrCLIP在八个基准测试中均取得优异性能。代码详见：https://github.com/zdk258/CorrCLIP。

Summary / 总结

Open-vocabulary semantic segmentation aims to assign semantic labels to each pixel without being constrained by a predefined set of categories.