SLGaussian: Fast Language Gaussian Splatting in Sparse Views
Authors: Kangjie Chen, BingQuan Dai, Minghan Qin, Dongbin Zhang, Peihao Li, Yingshuang Zou, Haoqian Wang
Venue: ACM MM 2025
First: 2024-12-11T12:18:30+00:00 · Latest: 2025-08-18T08:08:13+00:00
Comments: Accepted by ACM MM 2025. Project page:
https://chenkangjie1123.github.io/SLGaussian.github.io/
Abstract
3D semantic field learning is crucial for applications like autonomous
navigation, AR/VR, and robotics, where accurate comprehension of 3D scenes from
limited viewpoints is essential. Existing methods struggle under sparse view
conditions, relying on inefficient per-scene multi-view optimizations, which
are impractical for many real-world tasks. To address this, we propose
SLGaussian, a feed-forward method for constructing 3D semantic fields from
sparse viewpoints, allowing direct inference of 3DGS-based scenes. By ensuring
consistent SAM segmentations through video tracking and using low-dimensional
indexing for high-dimensional CLIP features, SLGaussian efficiently embeds
language information in 3D space, offering a robust solution for accurate 3D
scene understanding under sparse view conditions. In experiments on two-view
sparse 3D object querying and segmentation in the LERF and 3D-OVS datasets,
SLGaussian outperforms existing methods in chosen IoU, Localization Accuracy,
and mIoU. Moreover, our model achieves scene inference in under 30 seconds and
open-vocabulary querying in just 0.011 seconds per query.
中文标题/摘要
标题:SLGaussian:稀疏视角下的快速语言高斯泼溅技术
三维语义场学习对于自动驾驶导航、增强现实/虚拟现实(AR/VR)以及机器人技术等应用至关重要,这些应用需要从有限视角准确理解三维场景。现有方法在稀疏视角条件下表现不佳,依赖于低效的逐场景多视角优化,这在许多实际任务中不切实际。为此,我们提出SLGaussian,一种前馈方法,用于从稀疏视角构建三维语义场,支持直接推断基于3DGS的场景。通过视频跟踪确保一致的SAM分割,并利用低维索引处理高维CLIP特征,SLGaussian高效地将语言信息嵌入三维空间,为稀疏视角条件下的精确三维场景理解提供了稳健解决方案。在LERF和3D-OVS数据集上的双视角稀疏三维物体查询与分割实验中,SLGaussian在选定的IoU、定位精度和mIoU指标上均优于现有方法。此外,我们的模型在30秒内完成场景推断,每项开放词汇查询仅需0.011秒。
Summary / 总结
3D semantic field learning is crucial for applications like autonomous navigation, AR/VR, and robotics, where accurate comprehension of 3D scenes from limited viewpoints is essential.
Splat Feature Solver
Authors: Butian Xiong, Rong Liu, Kenneth Xu, Meida Chen, Andrew Feng
First: 2025-08-17T03:13:06+00:00 · Latest: 2025-08-17T03:13:06+00:00
Comments: webpage not that stable
Abstract
Feature lifting has emerged as a crucial component in 3D scene understanding,
enabling the attachment of rich image feature descriptors (e.g., DINO, CLIP)
onto splat-based 3D representations. The core challenge lies in optimally
assigning rich general attributes to 3D primitives while addressing the
inconsistency issues from multi-view images. We present a unified, kernel- and
feature-agnostic formulation of the feature lifting problem as a sparse linear
inverse problem, which can be solved efficiently in closed form. Our approach
admits a provable upper bound on the global optimal error under convex losses
for delivering high quality lifted features. To address inconsistencies and
noise in multi-view observations, we introduce two complementary regularization
strategies to stabilize the solution and enhance semantic fidelity. Tikhonov
Guidance enforces numerical stability through soft diagonal dominance, while
Post-Lifting Aggregation filters noisy inputs via feature clustering. Extensive
experiments demonstrate that our approach achieves state-of-the-art performance
on open-vocabulary 3D segmentation benchmarks, outperforming training-based,
grouping-based, and heuristic-forward baselines while producing the lifted
features in minutes. Code is available at
\href{https://github.com/saliteta/splat-distiller.git}{\textbf{github}}. We
also have a \href{https://splat-distiller.pages.dev/}
中文标题/摘要
标题:Splat特征求解器
特征提升已成为3D场景理解的关键组成部分,能够将丰富的图像特征描述符(如DINO、CLIP)附加到基于splat的3D表示上。核心挑战在于如何最优地将丰富通用属性分配给3D图元,同时解决多视角图像的不一致性问题。我们提出了一个统一、与内核和特征无关的特征提升问题稀疏线性逆问题表述,可通过闭式解高效求解。该方法在凸损失下可证明全局最优误差上界,从而提供高质量提升特征。针对多视角观测中的不一致性和噪声,我们引入两种互补的正则化策略:吉洪诺夫指导通过软对角占优确保数值稳定性,后提升聚合通过特征聚类过滤噪声输入。大量实验表明,我们的方法在开放词汇3D分割基准上达到最先进性能,在几分钟内生成提升特征的同时,优于基于训练、分组和启发式的前沿基线。代码发布于\href{https://github.com/saliteta/splat-distiller.git}{\textbf{github}},另设\href{https://splat-distiller.pages.dev/}{演示页面}。
Summary / 总结
Feature lifting has emerged as a crucial component in 3D scene understanding, enabling the attachment of rich image feature descriptors (e.g., DINO, CLIP) onto splat-based 3D representations.
Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception
Authors: Junjie Wang, Keyu Chen, Yulin Li, Bin Chen, Hengshuang Zhao, Xiaojuan Qi, Zhuotao Tian
First: 2025-08-15T06:43:51+00:00 · Latest: 2025-08-15T06:43:51+00:00
Comments: arXiv admin note: text overlap with arXiv:2505.04410
Abstract
Dense visual perception tasks have been constrained by their reliance on
predefined categories, limiting their applicability in real-world scenarios
where visual concepts are unbounded. While Vision-Language Models (VLMs) like
CLIP have shown promise in open-vocabulary tasks, their direct application to
dense perception often leads to suboptimal performance due to limitations in
local feature representation. In this work, we present our observation that
CLIP's image tokens struggle to effectively aggregate information from
spatially or semantically related regions, resulting in features that lack
local discriminability and spatial consistency. To address this issue, we
propose DeCLIP, a novel framework that enhances CLIP by decoupling the
self-attention module to obtain ``content'' and ``context'' features
respectively. \revise{The context features are enhanced by jointly distilling
semantic correlations from Vision Foundation Models (VFMs) and object integrity
cues from diffusion models, thereby enhancing spatial consistency. In parallel,
the content features are aligned with image crop representations and
constrained by region correlations from VFMs to improve local discriminability.
Extensive experiments demonstrate that DeCLIP establishes a solid foundation
for open-vocabulary dense perception, consistently achieving state-of-the-art
performance across a broad spectrum of tasks, including 2D detection and
segmentation, 3D instance segmentation, video instance segmentation, and 6D
object pose estimation.} Code is available at
https://github.com/xiaomoguhz/DeCLIP
中文标题/摘要
标题:广义解耦学习增强开放词汇密集感知
密集视觉感知任务长期受限于预定义类别,难以适应现实世界中无边界视觉概念的应用场景。尽管CLIP等视觉语言模型在开放词汇任务中展现出潜力,但其直接应用于密集感知时,由于局部特征表示的限制往往导致性能欠佳。本研究观察到CLIP的图像令牌难以有效聚合空间或语义相关区域的信息,导致特征缺乏局部判别性和空间一致性。为此,我们提出DeCLIP框架,通过解耦自注意力模块分别获取'内容'与'上下文'特征。上下文特征通过联合蒸馏视觉基础模型的语义关联和扩散模型的物体完整性线索得到增强,从而提升空间一致性;同时,内容特征与图像裁剪表示对齐,并受视觉基础模型的区域相关性约束以改善局部判别性。大量实验表明,DeCLIP为开放词汇密集感知奠定了坚实基础,在2D检测与分割、3D实例分割、视频实例分割及6D物体姿态估计等广泛任务中持续取得最先进性能。代码已开源:https://github.com/xiaomoguhz/DeCLIP
Summary / 总结
Dense visual perception tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded.
DualMap: Online Open-Vocabulary Semantic Mapping for Natural Language Navigation in Dynamic Changing Scenes
Authors: Jiajun Jiang, Yiming Zhu, Zirui Wu, Jie Song
First: 2025-06-02T17:59:10+00:00 · Latest: 2025-08-13T07:21:25+00:00
Comments: 14 pages, 14 figures. Code: https://github.com/Eku127/DualMap Project
page: https://eku127.github.io/DualMap/
Abstract
We introduce DualMap, an online open-vocabulary mapping system that enables
robots to understand and navigate dynamically changing environments through
natural language queries. Designed for efficient semantic mapping and
adaptability to changing environments, DualMap meets the essential requirements
for real-world robot navigation applications. Our proposed hybrid segmentation
frontend and object-level status check eliminate the costly 3D object merging
required by prior methods, enabling efficient online scene mapping. The
dual-map representation combines a global abstract map for high-level candidate
selection with a local concrete map for precise goal-reaching, effectively
managing and updating dynamic changes in the environment. Through extensive
experiments in both simulation and real-world scenarios, we demonstrate
state-of-the-art performance in 3D open-vocabulary segmentation, efficient
scene mapping, and online language-guided navigation.Project page:
https://eku127.github.io/DualMap/
中文标题/摘要
标题:DualMap:动态变化场景中自然语言导航的在线开放词汇语义映射
我们推出DualMap,一种在线开放词汇映射系统,使机器人能通过自然语言查询理解并导航动态变化的环境。该系统专为高效语义映射和适应环境变化而设计,满足现实世界机器人导航应用的核心需求。提出的混合分割前端和对象级状态检查消除了先前方法所需的昂贵3D对象合并,实现了高效的在线场景映射。双地图表示结合了用于高层候选选择的全局抽象地图与用于精确抵达目标的局部具体地图,有效管理并更新环境中的动态变化。通过仿真和真实场景的广泛实验,我们在3D开放词汇分割、高效场景映射和在线语言引导导航方面展示了最先进的性能。项目页面:https://eku127.github.io/DualMap/
Summary / 总结
We introduce DualMap, an online open-vocabulary mapping system that enables robots to understand and navigate dynamically changing environments through natural language queries.
CitySeg: A 3D Open Vocabulary Semantic Segmentation Foundation Model in City-scale Scenarios
Authors: Jialei Xu, Zizhuang Wei, Weikang You, Linyun Li, Weijian Sun
First: 2025-08-13T03:55:56+00:00 · Latest: 2025-08-13T03:55:56+00:00
Abstract
Semantic segmentation of city-scale point clouds is a critical technology for
Unmanned Aerial Vehicle (UAV) perception systems, enabling the classification
of 3D points without relying on any visual information to achieve comprehensive
3D understanding. However, existing models are frequently constrained by the
limited scale of 3D data and the domain gap between datasets, which lead to
reduced generalization capability. To address these challenges, we propose
CitySeg, a foundation model for city-scale point cloud semantic segmentation
that incorporates text modality to achieve open vocabulary segmentation and
zero-shot inference. Specifically, in order to mitigate the issue of
non-uniform data distribution across multiple domains, we customize the data
preprocessing rules, and propose a local-global cross-attention network to
enhance the perception capabilities of point networks in UAV scenarios. To
resolve semantic label discrepancies across datasets, we introduce a
hierarchical classification strategy. A hierarchical graph established
according to the data annotation rules consolidates the data labels, and the
graph encoder is used to model the hierarchical relationships between
categories. In addition, we propose a two-stage training strategy and employ
hinge loss to increase the feature separability of subcategories. Experimental
results demonstrate that the proposed CitySeg achieves state-of-the-art (SOTA)
performance on nine closed-set benchmarks, significantly outperforming existing
approaches. Moreover, for the first time, CitySeg enables zero-shot
generalization in city-scale point cloud scenarios without relying on visual
information.
中文标题/摘要
标题:CitySeg:城市场景下的三维开放词汇语义分割基础模型
城市级点云语义分割是无人机感知系统的关键技术,通过对三维点进行无需视觉信息的分类实现全面三维理解。然而现有模型常受限于三维数据规模有限及数据集间的领域差异,导致泛化能力下降。为此我们提出CitySeg——融合文本模态实现开放词汇分割与零样本推理的城市级点云语义分割基础模型。针对多领域数据分布不均问题,定制数据预处理规则并提出局部-全局交叉注意力网络以增强点网络在无人机场景中的感知能力;通过建立符合数据标注规则的层次化图谱整合标签,并利用图编码器建模类别层级关系来解决语义标签差异;采用两阶段训练策略和铰链损失提升子类特征可分性。实验表明CitySeg在九个封闭集基准上达到最先进性能,显著优于现有方法,并首次实现不依赖视觉信息的城市级点云零样本泛化。
Summary / 总结
Semantic segmentation of city-scale point clouds is a critical technology for Unmanned Aerial Vehicle (UAV) perception systems, enabling the classification of 3D points without relying on any visual information to achieve comprehensive 3D understanding.
ReferSplat: Referring Segmentation in 3D Gaussian Splatting
Authors: Shuting He, Guangquan Jie, Changshuo Wang, Yun Zhou, Shuming Hu, Guanbin Li, Henghui Ding
Venue: ICML 2025 Oral
First: 2025-08-11T17:59:30+00:00 · Latest: 2025-08-11T17:59:30+00:00
Comments: ICML 2025 Oral, Code: https://github.com/heshuting555/ReferSplat
Abstract
We introduce Referring 3D Gaussian Splatting Segmentation (R3DGS), a new task
that aims to segment target objects in a 3D Gaussian scene based on natural
language descriptions, which often contain spatial relationships or object
attributes. This task requires the model to identify newly described objects
that may be occluded or not directly visible in a novel view, posing a
significant challenge for 3D multi-modal understanding. Developing this
capability is crucial for advancing embodied AI. To support research in this
area, we construct the first R3DGS dataset, Ref-LERF. Our analysis reveals that
3D multi-modal understanding and spatial relationship modeling are key
challenges for R3DGS. To address these challenges, we propose ReferSplat, a
framework that explicitly models 3D Gaussian points with natural language
expressions in a spatially aware paradigm. ReferSplat achieves state-of-the-art
performance on both the newly proposed R3DGS task and 3D open-vocabulary
segmentation benchmarks. Dataset and code are available at
https://github.com/heshuting555/ReferSplat.
中文标题/摘要
标题:ReferSplat:基于3D高斯泼溅的指代分割
我们提出了指代式3D高斯泼溅分割(R3DGS)这一新任务,旨在通过自然语言描述(常包含空间关系或物体属性)对3D高斯场景中的目标物体进行分割。该任务要求模型识别在新视角下可能被遮挡或不可见的新描述物体,这对3D多模态理解提出了重大挑战。发展此能力对推进具身人工智能至关重要。为支持该领域研究,我们构建了首个R3DGS数据集Ref-LERF。分析表明,3D多模态理解与空间关系建模是R3DGS的核心挑战。为此,我们提出ReferSplat框架,在空间感知范式下显式建模自然语言表达与3D高斯点的关联。ReferSplat在新建的R3DGS任务和3D开放词汇分割基准上均实现了最先进性能。数据集与代码详见https://github.com/heshuting555/ReferSplat。
Summary / 总结
We introduce Referring 3D Gaussian Splatting Segmentation (R3DGS), a new task that aims to segment target objects in a 3D Gaussian scene based on natural language descriptions, which often contain spatial relationships or object attributes.
Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images
Authors: Xiangyu Sun, Haoyi Jiang, Liu Liu, Seungtae Nam, Gyeongjin Kang, Xinjie Wang, Wei Sui, Zhizhong Su, Wenyu Liu, Xinggang Wang, Eunbyung Park
First: 2025-08-05T16:54:55+00:00 · Latest: 2025-08-11T03:47:38+00:00
Comments: The code is available at https://github.com/HorizonRobotics/Uni3R
Abstract
Reconstructing and semantically interpreting 3D scenes from sparse 2D views
remains a fundamental challenge in computer vision. Conventional methods often
decouple semantic understanding from reconstruction or necessitate costly
per-scene optimization, thereby restricting their scalability and
generalizability. In this paper, we introduce Uni3R, a novel feed-forward
framework that jointly reconstructs a unified 3D scene representation enriched
with open-vocabulary semantics, directly from unposed multi-view images. Our
approach leverages a Cross-View Transformer to robustly integrate information
across arbitrary multi-view inputs, which then regresses a set of 3D Gaussian
primitives endowed with semantic feature fields. This unified representation
facilitates high-fidelity novel view synthesis, open-vocabulary 3D semantic
segmentation, and depth prediction, all within a single, feed-forward pass.
Extensive experiments demonstrate that Uni3R establishes a new state-of-the-art
across multiple benchmarks, including 25.07 PSNR on RE10K and 55.84 mIoU on
ScanNet. Our work signifies a novel paradigm towards generalizable, unified 3D
scene reconstruction and understanding. The code is available at
https://github.com/HorizonRobotics/Uni3R.
中文标题/摘要
标题:Uni3R:通过未标定多视角图像中可泛化的高斯溅射实现统一的三维重建与语义理解
从稀疏二维视图重建并语义解析三维场景始终是计算机视觉领域的核心挑战。传统方法常将语义理解与重建过程解耦,或需进行昂贵的逐场景优化,从而限制了其可扩展性与泛化能力。本文提出Uni3R——一种新颖的前馈框架,可直接从未标定多视角图像中联合重建具有开放词汇语义的统一三维场景表征。该方法通过跨视角Transformer鲁棒地整合任意多视角输入信息,进而回归出带有语义特征场的三维高斯基元集合。这种统一表征可在单次前馈过程中实现高保真新视角合成、开放词汇三维语义分割及深度预测。大量实验表明,Uni3R在多个基准测试中创下新纪录,包括RE10K数据集上25.07的PSNR值和ScanNet数据集上55.84的mIoU值。本工作标志着向可泛化、统一化的三维场景重建与理解新范式的重大迈进。代码已开源:https://github.com/HorizonRobotics/Uni3R
Summary / 总结
Reconstructing and semantically interpreting 3D scenes from sparse 2D views remains a fundamental challenge in computer vision.
Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction
Authors: Yunheng Li, Yuxuan Li, Quansheng Zeng, Wenhai Wang, Qibin Hou, Ming-Ming Cheng
Venue: ICCV 2025
First: 2024-12-09T06:34:23+00:00 · Latest: 2025-08-10T11:17:34+00:00
Comments: Accepted at ICCV 2025. The code is available at
https://github.com/HVision-NKU/DenseVLM
Abstract
Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated
impressive zero-shot recognition capability, but still underperform in dense
prediction tasks. Self-distillation recently is emerging as a promising
approach for fine-tuning VLMs to better adapt to local regions without
requiring extensive annotations. However, previous state-of-the-art approaches
often suffer from significant `foreground bias', where models tend to wrongly
identify background regions as foreground objects. To alleviate this issue, we
propose DenseVLM, a framework designed to learn unbiased region-language
alignment from powerful pre-trained VLM representations. To alleviate this
issue, we propose DenseVLM, a framework designed to learn unbiased
region-language alignment from powerful pre-trained VLM representations.
DenseVLM leverages the pre-trained VLM to retrieve categories for unlabeled
regions and then decouples the interference between foreground and background
features. We show that DenseVLM can directly replace the original VLM in
open-vocabulary object detection and image segmentation methods, leading to
notable performance improvements. Furthermore, it exhibits promising zero-shot
scalability when training on more extensive and diverse datasets. Our code is
available at https://github.com/HVision-NKU/DenseVLM.
中文标题/摘要
标题:开放词汇密集预测的无偏区域-语言对齐
预训练视觉-语言模型(如CLIP)已展现出卓越的零样本识别能力,但在密集预测任务中仍表现不足。自蒸馏技术近期成为无需大量标注即可微调VLM以适应局部区域的有效方法。然而,现有先进方法常存在显著的前景偏差,即模型易将背景区域误判为前景对象。为缓解此问题,我们提出DenseVLM框架,通过预训练VLM表征学习无偏的区域-语言对齐。该框架利用预训练VLM检索未标注区域的类别,并解耦前景与背景特征间的干扰。实验表明,DenseVLM可直接替代开放词汇目标检测和图像分割方法中的原始VLM,实现显著性能提升,且在更广泛多样数据集训练时展现出良好的零样本扩展性。代码已开源:https://github.com/HVision-NKU/DenseVLM
Summary / 总结
Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated impressive zero-shot recognition capability, but still underperform in dense prediction tasks.
Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation
Authors: Jiho Choi, Seonho Lee, Minhyun Lee, Seungho Lee, Hyunjung Shim
Venue: CVPR 2025
First: 2025-01-16T17:40:19+00:00 · Latest: 2025-08-08T08:51:23+00:00
Comments: CVPR 2025
Abstract
Open-Vocabulary Part Segmentation (OVPS) is an emerging field for recognizing
fine-grained parts in unseen categories. We identify two primary challenges in
OVPS: (1) the difficulty in aligning part-level image-text correspondence, and
(2) the lack of structural understanding in segmenting object parts. To address
these issues, we propose PartCATSeg, a novel framework that integrates
object-aware part-level cost aggregation, compositional loss, and structural
guidance from DINO. Our approach employs a disentangled cost aggregation
strategy that handles object and part-level costs separately, enhancing the
precision of part-level segmentation. We also introduce a compositional loss to
better capture part-object relationships, compensating for the limited part
annotations. Additionally, structural guidance from DINO features improves
boundary delineation and inter-part understanding. Extensive experiments on
Pascal-Part-116, ADE20K-Part-234, and PartImageNet datasets demonstrate that
our method significantly outperforms state-of-the-art approaches, setting a new
baseline for robust generalization to unseen part categories.
中文标题/摘要
标题:细粒度图像-文本对应与成本聚合在开放词汇部件分割中的应用
开放词汇部件分割(OVPS)是一个新兴领域,旨在识别未见类别中的细粒度部件。我们识别出OVPS中的两个主要挑战:(1)对齐部件级图像-文本对应的困难,(2)在分割对象部件时缺乏结构理解。为解决这些问题,我们提出了PartCATSeg,这是一个新颖框架,集成了对象感知的部件级成本聚合、组合损失以及来自DINO的结构指导。我们的方法采用解耦成本聚合策略,分别处理对象和部件级成本,从而提升部件级分割的精确度。我们还引入了组合损失以更好地捕捉部件-对象关系,弥补部件标注的不足。此外,DINO特征的结构指导改善了边界划分和部件间理解。在Pascal-Part-116、ADE20K-Part-234和PartImageNet数据集上的大量实验表明,我们的方法显著优于现有最先进方法,为对未见部件类别的鲁棒泛化设立了新基准。
Summary / 总结
Open-Vocabulary Part Segmentation (OVPS) is an emerging field for recognizing fine-grained parts in unseen categories.
SynSeg: Feature Synergy for Multi-Category Contrastive Learning in Open-Vocabulary Semantic Segmentation
Authors: Weichen Zhang, Kebin Liu, Fan Dang, Zhui Zhu, Xikai Sun, Yunhao Liu
First: 2025-08-08T08:26:41+00:00 · Latest: 2025-08-08T08:26:41+00:00
Abstract
Semantic segmentation in open-vocabulary scenarios presents significant
challenges due to the wide range and granularity of semantic categories.
Existing weakly-supervised methods often rely on category-specific supervision
and ill-suited feature construction methods for contrastive learning, leading
to semantic misalignment and poor performance. In this work, we propose a novel
weakly-supervised approach, SynSeg, to address the challenges. SynSeg performs
Multi-Category Contrastive Learning (MCCL) as a stronger training signal with a
new feature reconstruction framework named Feature Synergy Structure (FSS).
Specifically, MCCL strategy robustly combines both intra- and inter-category
alignment and separation in order to make the model learn the knowledge of
correlations from different categories within the same image. Moreover, FSS
reconstructs discriminative features for contrastive learning through prior
fusion and semantic-activation-map enhancement, effectively avoiding the
foreground bias introduced by the visual encoder. In general, SynSeg
effectively improves the abilities in semantic localization and discrimination
under weak supervision. Extensive experiments on benchmarks demonstrate that
our method outperforms state-of-the-art (SOTA) performance. For instance,
SynSeg achieves higher accuracy than SOTA baselines by 4.5\% on VOC, 8.9\% on
Context, 2.6\% on Object and 2.0\% on City.
中文标题/摘要
标题:SynSeg:开放词汇语义分割中多类别对比学习的特征协同
开放词汇场景下的语义分割因语义类别范围广泛且粒度精细而面临巨大挑战。现有弱监督方法常依赖特定类别的监督和不适合对比学习的特征构建方法,导致语义错位和性能不佳。本研究提出新型弱监督方法SynSeg,通过多类别对比学习(MCCL)作为更强训练信号,并结合名为特征协同结构(FSS)的新特征重构框架。具体而言,MCCL策略鲁棒地结合了类内与类间的对齐与分离,使模型能够学习同一图像中不同类别间的相关性知识。此外,FSS通过先验融合和语义激活图增强来重构判别性特征以进行对比学习,有效避免了视觉编码器引入的前景偏差。总体而言,SynSeg显著提升了弱监督下的语义定位与判别能力。在基准测试上的大量实验表明,本方法优于现有最先进(SOTA)性能,例如在VOC数据集上准确率比SOTA基线高4.5%,Context高8.9%,Object高2.6%,Cityscapes高2.0%。
Summary / 总结
Semantic segmentation in open-vocabulary scenarios presents significant challenges due to the wide range and granularity of semantic categories.
Learning 3D Texture-Aware Representations for Parsing Diverse Human Clothing and Body Parts
Authors: Kiran Chhatre, Christopher Peters, Srikrishna Karanam
First: 2025-08-08T05:36:20+00:00 · Latest: 2025-08-08T05:36:20+00:00
Comments: 16 pages, 11 figures
Abstract
Existing methods for human parsing into body parts and clothing often use
fixed mask categories with broad labels that obscure fine-grained clothing
types. Recent open-vocabulary segmentation approaches leverage pretrained
text-to-image (T2I) diffusion model features for strong zero-shot transfer, but
typically group entire humans into a single person category, failing to
distinguish diverse clothing or detailed body parts. To address this, we
propose Spectrum, a unified network for part-level pixel parsing (body parts
and clothing) and instance-level grouping. While diffusion-based
open-vocabulary models generalize well across tasks, their internal
representations are not specialized for detailed human parsing. We observe
that, unlike diffusion models with broad representations, image-driven 3D
texture generators maintain faithful correspondence to input images, enabling
stronger representations for parsing diverse clothing and body parts. Spectrum
introduces a novel repurposing of an Image-to-Texture (I2Tx) diffusion model --
obtained by fine-tuning a T2I model on 3D human texture maps -- for improved
alignment with body parts and clothing. From an input image, we extract
human-part internal features via the I2Tx diffusion model and generate
semantically valid masks aligned to diverse clothing categories through
prompt-guided grounding. Once trained, Spectrum produces semantic segmentation
maps for every visible body part and clothing category, ignoring standalone
garments or irrelevant objects, for any number of humans in the scene. We
conduct extensive cross-dataset experiments -- separately assessing body parts,
clothing parts, unseen clothing categories, and full-body masks -- and
demonstrate that Spectrum consistently outperforms baseline methods in
prompt-based segmentation.
中文标题/摘要
标题:学习用于解析多样化人体服装与部位的三维纹理感知表示
现有人体解析方法常采用固定掩码类别和宽泛标签,难以区分细粒度服装类型。尽管基于预训练文本到图像扩散模型的开放词汇分割方法具有强大的零样本迁移能力,但通常将整个人体归为单一类别,无法辨别多样化服装或细节部位。为此,我们提出Spectrum——一个统一网络,可实现部件级像素解析(人体部位与服装)和实例级分组。虽然扩散模型能良好泛化,但其内部表示未针对精细人体解析优化。我们发现,与具有宽泛表示的扩散模型不同,图像驱动的三维纹理生成器能保持与输入图像的高度对应性,从而为解析多样化服装和身体部位提供更强表征。Spectrum创新性地重构了图像到纹理扩散模型(通过对T2I模型在三维人体纹理图上微调获得),以提升与身体部位及服装的对齐能力。通过从输入图像提取人体部件内部特征,并经由提示词引导生成与多样化服装类别对齐的语义有效掩码。训练后的Spectrum可为场景中任意数量人体生成所有可见身体部位和服装类别的语义分割图,忽略独立衣物或无关物体。通过跨数据集实验(分别评估身体部位、服装部件、未见服装类别及全身掩码),我们证明Spectrum在基于提示的分割中持续优于基线方法。
Summary / 总结
Existing methods for human parsing into body parts and clothing often use fixed mask categories with broad labels that obscure fine-grained clothing types.
EarthSynth: Generating Informative Earth Observation with Diffusion Models
Authors: Jiancheng Pan, Shiye Lei, Yuqian Fu, Jiahao Li, Yanxing Liu, Yuze Sun, Xiao He, Long Peng, Xiaomeng Huang, Bo Zhao
First: 2025-05-17T18:27:15+00:00 · Latest: 2025-08-07T10:33:17+00:00
Comments: 25 pages
Abstract
Remote sensing image (RSI) interpretation typically faces challenges due to
the scarcity of labeled data, which limits the performance of RSI
interpretation tasks. To tackle this challenge, we propose EarthSynth, a
diffusion-based generative foundation model that enables synthesizing
multi-category, cross-satellite labeled Earth observation for downstream RSI
interpretation tasks. To the best of our knowledge, EarthSynth is the first to
explore multi-task generation for remote sensing, tackling the challenge of
limited generalization in task-oriented synthesis for RSI interpretation.
EarthSynth, trained on the EarthSynth-180K dataset, employs the Counterfactual
Composition training strategy with a three-dimensional batch-sample selection
mechanism to improve training data diversity and enhance category control.
Furthermore, a rule-based method of R-Filter is proposed to filter more
informative synthetic data for downstream tasks. We evaluate our EarthSynth on
scene classification, object detection, and semantic segmentation in open-world
scenarios. There are significant improvements in open-vocabulary understanding
tasks, offering a practical solution for advancing RSI interpretation.
中文标题/摘要
标题:EarthSynth:基于扩散模型生成信息丰富的地球观测数据
遥感影像(RSI)解译常因标注数据稀缺而面临挑战,限制了RSI解译任务的性能。为此,我们提出EarthSynth——基于扩散模型的生成式基础模型,能够为下游RSI解译任务合成多类别、跨卫星标注的地球观测数据。据我们所知,EarthSynth是首个探索遥感多任务生成的方法,解决了RSI解译中面向任务合成泛化能力受限的难题。该模型在EarthSynth-180K数据集上训练,采用反事实组合训练策略与三维批量样本选择机制,提升训练数据多样性并增强类别控制。此外,提出基于规则的R-Filter方法为下游任务筛选信息量更大的合成数据。我们在开放世界场景中对EarthSynth进行场景分类、目标检测和语义分割评估,在开放词汇理解任务中取得显著提升,为推进RSI解译提供了实用解决方案。
Summary / 总结
Remote sensing image (RSI) interpretation typically faces challenges due to the scarcity of labeled data, which limits the performance of RSI interpretation tasks.
What Holds Back Open-Vocabulary Segmentation?
Authors: Josip Šarić, Ivan Martinović, Matej Kristan, Siniša Šegvić
Venue: ICCV
First: 2025-08-06T08:46:47+00:00 · Latest: 2025-08-06T08:46:47+00:00
Comments: Accepted for publication at ICCV 25 Workshop: What is Next in
Multimodal Foundation Models?
Abstract
Standard segmentation setups are unable to deliver models that can recognize
concepts outside the training taxonomy. Open-vocabulary approaches promise to
close this gap through language-image pretraining on billions of image-caption
pairs. Unfortunately, we observe that the promise is not delivered due to
several bottlenecks that have caused the performance to plateau for almost two
years. This paper proposes novel oracle components that identify and decouple
these bottlenecks by taking advantage of the groundtruth information. The
presented validation experiments deliver important empirical findings that
provide a deeper insight into the failures of open-vocabulary models and
suggest prominent approaches to unlock the future research.
中文标题/摘要
标题:开放词汇分割的瓶颈何在?
标准分割设置无法产生能识别训练分类外概念的模型。开放词汇方法承诺通过数十亿图像-标题对的语言-图像预训练来弥合这一差距。然而,我们发现由于多个瓶颈导致性能近两年停滞不前,这一承诺未能实现。本文提出新颖的预言组件,利用真实标注信息识别并解耦这些瓶颈。验证实验提供了重要的实证发现,深入揭示了开放词汇模型的失败原因,并为未来研究指明了突破方向。
Summary / 总结
Standard segmentation setups are unable to deliver models that can recognize concepts outside the training taxonomy.
Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation
Authors: Luca Barsellotti, Lorenzo Bianchi, Nicola Messina, Fabio Carrara, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, Rita Cucchiara
First: 2024-11-28T19:00:03+00:00 · Latest: 2025-08-05T12:26:14+00:00
Abstract
Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form
textual concepts without predefined training classes. While existing
vision-language models such as CLIP can generate segmentation masks by
leveraging coarse spatial information from Vision Transformers, they face
challenges in spatial localization due to their global alignment of image and
text features. Conversely, self-supervised visual models like DINO excel in
fine-grained visual encoding but lack integration with language. To bridge this
gap, we present Talk2DINO, a novel hybrid approach that combines the spatial
accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns
the textual embeddings of CLIP to the patch-level features of DINOv2 through a
learned mapping function without the need to fine-tune the underlying
backbones. At training time, we exploit the attention maps of DINOv2 to
selectively align local visual patches with textual embeddings. We show that
the powerful semantic and localization abilities of Talk2DINO can enhance the
segmentation process, resulting in more natural and less noisy segmentations,
and that our approach can also effectively distinguish foreground objects from
the background. Experimental results demonstrate that Talk2DINO achieves
state-of-the-art performance across several unsupervised OVS benchmarks. Source
code and models are publicly available at:
https://lorebianchi98.github.io/Talk2DINO/.
中文标题/摘要
标题:与DINO对话:融合自监督视觉骨干与语言实现开放词汇分割
开放词汇分割(OVS)旨在无需预定义训练类别的情况下,根据自由形式的文本概念对图像进行分割。虽然CLIP等现有多模态模型可通过利用视觉Transformer的粗略空间信息生成分割掩码,但由于其图像与文本特征的全局对齐方式,在空间定位方面面临挑战。相反,DINO等自监督视觉模型擅长细粒度视觉编码,但缺乏与语言的整合。为弥合这一差距,我们提出Talk2DINO——一种将DINOv2的空间精确性与CLIP的语言理解能力相结合的新型混合方法。该方法通过可学习的映射函数将CLIP的文本嵌入与DINOv2的补丁级特征对齐,无需微调底层骨干网络。训练时利用DINOv2的注意力图选择性对齐局部视觉补丁与文本嵌入。实验表明,Talk2DINO强大的语义和定位能力能优化分割过程,产生更自然、噪声更少的分割结果,并能有效区分前景与背景。在多个无监督OVS基准测试中达到最先进性能。源代码与模型已开源:https://lorebianchi98.github.io/Talk2DINO/
Summary / 总结
Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes.
Taking Language Embedded 3D Gaussian Splatting into the Wild
Authors: Yuze Wang, Yue Qi
First: 2025-07-26T07:00:32+00:00 · Latest: 2025-08-05T01:40:57+00:00
Comments: Visit our project page at
https://yuzewang1998.github.io/takinglangsplatw/
Abstract
Recent advances in leveraging large-scale Internet photo collections for 3D
reconstruction have enabled immersive virtual exploration of landmarks and
historic sites worldwide. However, little attention has been given to the
immersive understanding of architectural styles and structural knowledge, which
remains largely confined to browsing static text-image pairs. Therefore, can we
draw inspiration from 3D in-the-wild reconstruction techniques and use
unconstrained photo collections to create an immersive approach for
understanding the 3D structure of architectural components? To this end, we
extend language embedded 3D Gaussian splatting (3DGS) and propose a novel
framework for open-vocabulary scene understanding from unconstrained photo
collections. Specifically, we first render multiple appearance images from the
same viewpoint as the unconstrained image with the reconstructed radiance
field, then extract multi-appearance CLIP features and two types of language
feature uncertainty maps-transient and appearance uncertainty-derived from the
multi-appearance features to guide the subsequent optimization process. Next,
we propose a transient uncertainty-aware autoencoder, a multi-appearance
language field 3DGS representation, and a post-ensemble strategy to effectively
compress, learn, and fuse language features from multiple appearances. Finally,
to quantitatively evaluate our method, we introduce PT-OVS, a new benchmark
dataset for assessing open-vocabulary segmentation performance on unconstrained
photo collections. Experimental results show that our method outperforms
existing methods, delivering accurate open-vocabulary segmentation and enabling
applications such as interactive roaming with open-vocabulary queries,
architectural style pattern recognition, and 3D scene editing.
中文标题/摘要
标题:将语言嵌入的3D高斯溅射技术引入实景应用
近年来利用互联网大规模照片集进行三维重建的进展,实现了对全球地标和历史遗址的沉浸式虚拟探索。然而,对于建筑风格与结构知识的沉浸式理解仍鲜有关注,目前主要局限于浏览静态图文对。为此,我们能否从野外三维重建技术中汲取灵感,利用无约束照片集创建理解建筑构件三维结构的沉浸式方法?本文扩展了语言嵌入的3D高斯溅射技术(3DGS),提出了一种基于无约束照片集的开放词汇场景理解新框架。具体而言,我们首先通过重建辐射场从与无约束图像相同视角渲染多外观图像,继而提取多外观CLIP特征及两种语言特征不确定性图谱——瞬态不确定性和外观不确定性(源自多外观特征)以指导后续优化过程。接着提出瞬态不确定性感知自编码器、多外观语言场3DGS表示及后集成策略,有效压缩、学习并融合多外观语言特征。最后,为量化评估方法,我们引入PT-OVS基准数据集,用于评估无约束照片集上的开放词汇分割性能。实验结果表明,本方法优于现有技术,可实现精确的开放词汇分割,并支持开放词汇查询交互漫游、建筑风格模式识别及三维场景编辑等应用。
Summary / 总结
Recent advances in leveraging large-scale Internet photo collections for 3D reconstruction have enabled immersive virtual exploration of landmarks and historic sites worldwide.
AG$^2$aussian: Anchor-Graph Structured Gaussian Splatting for Instance-Level 3D Scene Understanding and Editing
Authors: Zhaonan Wang, Manyi Li, Changhe Tu
First: 2025-08-03T12:47:30+00:00 · Latest: 2025-08-03T12:47:30+00:00
Abstract
3D Gaussian Splatting (3DGS) has witnessed exponential adoption across
diverse applications, driving a critical need for semantic-aware 3D Gaussian
representations to enable scene understanding and editing tasks. Existing
approaches typically attach semantic features to a collection of free Gaussians
and distill the features via differentiable rendering, leading to noisy
segmentation and a messy selection of Gaussians. In this paper, we introduce
AG$^2$aussian, a novel framework that leverages an anchor-graph structure to
organize semantic features and regulate Gaussian primitives. Our anchor-graph
structure not only promotes compact and instance-aware Gaussian distributions,
but also facilitates graph-based propagation, achieving a clean and accurate
instance-level Gaussian selection. Extensive validation across four
applications, i.e. interactive click-based query, open-vocabulary text-driven
query, object removal editing, and physics simulation, demonstrates the
advantages of our approach and its benefits to various applications. The
experiments and ablation studies further evaluate the effectiveness of the key
designs of our approach.
中文标题/摘要
标题:AG$^2$aussian:基于锚图结构的高斯溅射实现实例级三维场景理解与编辑
三维高斯溅射(3DGS)技术已在多个领域得到指数级应用,这推动了对具备语义感知能力的三维高斯表示的迫切需求,以实现场景理解与编辑任务。现有方法通常将语义特征附加到自由高斯集合上,并通过可微分渲染提取特征,导致分割结果存在噪声且高斯选择混乱。本文提出AG$^2$aussian——一种利用锚图结构组织语义特征并规整高斯基元的新框架。我们的锚图结构不仅促进了紧凑且实例感知的高斯分布,还支持基于图的传播机制,实现了清晰准确的实例级高斯选择。通过在交互式点击查询、开放词汇文本驱动查询、物体移除编辑及物理仿真四个应用领域的广泛验证,证明了本方法的优势及其多场景适用性。实验与消融研究进一步评估了关键设计的有效性。
Summary / 总结
3D Gaussian Splatting (3DGS) has witnessed exponential adoption across diverse applications, driving a critical need for semantic-aware 3D Gaussian representations to enable scene understanding and editing tasks.
OpenGS-Fusion: Open-Vocabulary Dense Mapping with Hybrid 3D Gaussian Splatting for Refined Object-Level Understanding
Authors: Dianyi Yang, Xihan Wang, Yu Gao, Shiyang Liu, Bohan Ren, Yufeng Yue, Yi Yang
First: 2025-08-02T02:22:36+00:00 · Latest: 2025-08-02T02:22:36+00:00
Comments: IROS2025
Abstract
Recent advancements in 3D scene understanding have made significant strides
in enabling interaction with scenes using open-vocabulary queries, particularly
for VR/AR and robotic applications. Nevertheless, existing methods are hindered
by rigid offline pipelines and the inability to provide precise 3D object-level
understanding given open-ended queries. In this paper, we present
OpenGS-Fusion, an innovative open-vocabulary dense mapping framework that
improves semantic modeling and refines object-level understanding.
OpenGS-Fusion combines 3D Gaussian representation with a Truncated Signed
Distance Field to facilitate lossless fusion of semantic features on-the-fly.
Furthermore, we introduce a novel multimodal language-guided approach named
MLLM-Assisted Adaptive Thresholding, which refines the segmentation of 3D
objects by adaptively adjusting similarity thresholds, achieving an improvement
17\% in 3D mIoU compared to the fixed threshold strategy. Extensive experiments
demonstrate that our method outperforms existing methods in 3D object
understanding and scene reconstruction quality, as well as showcasing its
effectiveness in language-guided scene interaction. The code is available at
https://young-bit.github.io/opengs-fusion.github.io/ .
中文标题/摘要
标题:OpenGS-Fusion:基于混合3D高斯泼溅的开放词汇密集建图与精细化对象级理解
近期三维场景理解技术的进步在支持通过开放词汇查询进行场景交互方面取得重大进展,尤其在VR/AR和机器人应用中。然而,现有方法受限于僵化的离线流程及无法针对开放式查询提供精确的三维对象级理解。本文提出OpenGS-Fusion——一种创新的开放词汇密集建图框架,通过融合3D高斯表示与截断符号距离场实现语义特征的无损实时融合,显著提升语义建模能力与对象级理解精度。我们进一步引入名为MLLM辅助自适应阈值分割的多模态语言引导方法,通过动态调整相似度阈值优化三维对象分割,相比固定阈值策略使三维mIoU指标提升17%。大量实验表明,本方法在三维对象理解、场景重建质量及语言引导场景交互方面均优于现有方法。代码发布于https://young-bit.github.io/opengs-fusion.github.io/
Summary / 总结
Recent advancements in 3D scene understanding have made significant strides in enabling interaction with scenes using open-vocabulary queries, particularly for VR/AR and robotic applications.
Training-Free Class Purification for Open-Vocabulary Semantic Segmentation
Authors: Qi Chen, Lingxiao Yang, Yun Chen, Nailong Zhao, Jianhuang Lai, Jie Shao, Xiaohua Xie
Venue: ICCV 2025
First: 2025-08-01T11:55:12+00:00 · Latest: 2025-08-01T11:55:12+00:00
Comments: Accepted to ICCV 2025
Abstract
Fine-tuning pre-trained vision-language models has emerged as a powerful
approach for enhancing open-vocabulary semantic segmentation (OVSS). However,
the substantial computational and resource demands associated with training on
large datasets have prompted interest in training-free methods for OVSS.
Existing training-free approaches primarily focus on modifying model
architectures and generating prototypes to improve segmentation performance.
However, they often neglect the challenges posed by class redundancy, where
multiple categories are not present in the current test image, and
visual-language ambiguity, where semantic similarities among categories create
confusion in class activation. These issues can lead to suboptimal class
activation maps and affinity-refined activation maps. Motivated by these
observations, we propose FreeCP, a novel training-free class purification
framework designed to address these challenges. FreeCP focuses on purifying
semantic categories and rectifying errors caused by redundancy and ambiguity.
The purified class representations are then leveraged to produce final
segmentation predictions. We conduct extensive experiments across eight
benchmarks to validate FreeCP's effectiveness. Results demonstrate that FreeCP,
as a plug-and-play module, significantly boosts segmentation performance when
combined with other OVSS methods.
中文标题/摘要
标题:开放词汇语义分割的无训练类别净化方法
微调预训练的视觉-语言模型已成为增强开放词汇语义分割(OVSS)的有效途径。然而,大规模数据集训练所需的高计算和资源成本引发了人们对无训练OVSS方法的兴趣。现有无训练方法主要侧重于修改模型架构和生成原型以提升分割性能,但往往忽略了类别冗余(当前测试图像中不存在的多个类别)和视觉-语言歧义(类别间语义相似性导致类激活混淆)带来的挑战。这些问题可能导致次优的类激活图及亲和力精炼激活图。基于这些观察,我们提出FreeCP——一种创新的无训练类别净化框架,旨在解决这些挑战。FreeCP通过净化语义类别并修正冗余与歧义导致的误差,利用净化后的类别表征生成最终分割预测。我们在八个基准测试上开展广泛实验验证FreeCP的有效性。结果表明,FreeCP作为即插即用模块,与其他OVSS方法结合时能显著提升分割性能。
Summary / 总结
Fine-tuning pre-trained vision-language models has emerged as a powerful approach for enhancing open-vocabulary semantic segmentation (OVSS).
OpenSeg-R: Improving Open-Vocabulary Segmentation via Step-by-Step Visual Reasoning
Authors: Zongyan Han, Jiale Cao, Shuo Chen, Tong Wang, Jorma Laaksonen, Rao Muhammad Anwer
First: 2025-05-22T17:51:48+00:00 · Latest: 2025-08-01T08:53:45+00:00
Abstract
Open-Vocabulary Segmentation (OVS) has drawn increasing attention for its
capacity to generalize segmentation beyond predefined categories. However,
existing methods typically predict segmentation masks with simple forward
inference, lacking explicit reasoning and interpretability. This makes it
challenging for OVS model to distinguish similar categories in open-world
settings due to the lack of contextual understanding and discriminative visual
cues. To address this limitation, we propose a step-by-step visual reasoning
framework for open-vocabulary segmentation, named OpenSeg-R. The proposed
OpenSeg-R leverages Large Multimodal Models (LMMs) to perform hierarchical
visual reasoning before segmentation. Specifically, we generate both generic
and image-specific reasoning for each image, forming structured triplets that
explain the visual reason for objects in a coarse-to-fine manner. Based on
these reasoning steps, we can compose detailed description prompts, and feed
them to the segmentor to produce more accurate segmentation masks. To the best
of our knowledge, OpenSeg-R is the first framework to introduce explicit
step-by-step visual reasoning into OVS. Experimental results demonstrate that
OpenSeg-R significantly outperforms state-of-the-art methods on open-vocabulary
semantic segmentation across five benchmark datasets. Moreover, it achieves
consistent gains across all metrics on open-vocabulary panoptic segmentation.
Qualitative results further highlight the effectiveness of our reasoning-guided
framework in improving both segmentation precision and interpretability. Our
code is publicly available at https://github.com/Hanzy1996/OpenSeg-R.
中文标题/摘要
标题:OpenSeg-R:通过逐步视觉推理改进开放词汇分割
开放词汇分割(OVS)因其能够泛化分割至预定义类别之外而日益受到关注。然而,现有方法通常通过简单的前向推理预测分割掩码,缺乏显式推理和可解释性。这使得OVS模型在开放世界场景中因缺乏上下文理解和判别性视觉线索而难以区分相似类别。为解决这一局限,我们提出了名为OpenSeg-R的逐步视觉推理框架。该框架利用大型多模态模型(LMMs)在分割前执行分层视觉推理,通过生成通用及图像特定的推理内容,构建结构化三元组以粗到细的方式解释物体视觉特征。基于这些推理步骤,可组合详细描述提示并输入分割器以生成更精确的分割掩码。据我们所知,OpenSeg-R是首个将显式逐步推理引入OVS的框架。实验结果表明,在五个基准数据集上,OpenSeg-R在开放词汇语义分割任务中显著优于现有最优方法,并在开放词汇全景分割的所有指标上实现一致提升。定性结果进一步验证了该推理引导框架对提升分割精度与可解释性的有效性。代码已开源:https://github.com/Hanzy1996/OpenSeg-R。
Summary / 总结
Open-Vocabulary Segmentation (OVS) has drawn increasing attention for its capacity to generalize segmentation beyond predefined categories.
CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation
Authors: Dengke Zhang, Fagui Liu, Quan Tang
Venue: ICCV 2025 Oral
First: 2024-11-15T10:14:55+00:00 · Latest: 2025-08-01T08:25:34+00:00
Comments: Accepted to ICCV 2025 Oral
Abstract
Open-vocabulary semantic segmentation aims to assign semantic labels to each
pixel without being constrained by a predefined set of categories. While
Contrastive Language-Image Pre-training (CLIP) excels in zero-shot
classification, it struggles to align image patches with category embeddings
because of its incoherent patch correlations. This study reveals that
inter-class correlations are the main reason for impairing CLIP's segmentation
performance. Accordingly, we propose CorrCLIP, which reconstructs the scope and
value of patch correlations. Specifically, CorrCLIP leverages the Segment
Anything Model (SAM) to define the scope of patch interactions, reducing
inter-class correlations. To mitigate the problem that SAM-generated masks may
contain patches belonging to different classes, CorrCLIP incorporates
self-supervised models to compute coherent similarity values, suppressing the
weight of inter-class correlations. Additionally, we introduce two additional
branches to strengthen patch features' spatial details and semantic
representation. Finally, we update segmentation maps with SAM-generated masks
to improve spatial consistency. Based on the improvement across patch
correlations, feature representations, and segmentation maps, CorrCLIP achieves
superior performance across eight benchmarks. Codes are available at:
https://github.com/zdk258/CorrCLIP.
中文标题/摘要
标题:CorrCLIP:重构CLIP中的图像块相关性以实现开放词汇语义分割
开放词汇语义分割旨在不受预定义类别限制的情况下为每个像素分配语义标签。尽管对比语言-图像预训练模型(CLIP)在零样本分类中表现卓越,但由于其图像块相关性不连贯,难以将图像块与类别嵌入对齐。本研究揭示类间相关性是影响CLIP分割性能的主要原因。据此,我们提出CorrCLIP,通过重构图像块相关性的作用域和数值来解决问题。具体而言,CorrCLIP利用分割任意模型(SAM)界定图像块交互范围以降低类间相关性;针对SAM生成掩码可能包含多类别图像块的问题,引入自监督模型计算一致性相似值以抑制类间相关性权重。此外,我们新增两个分支强化图像块特征的空间细节和语义表征,并采用SAM生成掩码更新分割图以提升空间一致性。通过改进图像块相关性、特征表征和分割图,CorrCLIP在八个基准测试中均取得优异性能。代码详见:https://github.com/zdk258/CorrCLIP。
Summary / 总结
Open-vocabulary semantic segmentation aims to assign semantic labels to each pixel without being constrained by a predefined set of categories.