arXiv 论文速递

2025-11-15 03:27
Snapshot: 20251115_0327
A Simple Framework for Open-Vocabulary Segmentation and Detection
Authors: Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianfeng Gao, Jianwei Yang, Lei Zhang
First: 2023-03-14T17:58:34+00:00 · Latest: 2023-03-20T10:52:40+00:00
Comments: A Simple Framework for Open-Vocabulary Segmentation and Detection
Abstract
We present OpenSeeD, a simple Open-vocabulary Segmentation and Detection framework that jointly learns from different segmentation and detection datasets. To bridge the gap of vocabulary and annotation granularity, we first introduce a pre-trained text encoder to encode all the visual concepts in two tasks and learn a common semantic space for them. This gives us reasonably good results compared with the counterparts trained on segmentation task only. To further reconcile them, we locate two discrepancies: $i$) task discrepancy -- segmentation requires extracting masks for both foreground objects and background stuff, while detection merely cares about the former; $ii$) data discrepancy -- box and mask annotations are with different spatial granularity, and thus not directly interchangeable. To address these issues, we propose a decoupled decoding to reduce the interference between foreground/background and a conditioned mask decoding to assist in generating masks for given boxes. To this end, we develop a simple encoder-decoder model encompassing all three techniques and train it jointly on COCO and Objects365. After pre-training, our model exhibits competitive or stronger zero-shot transferability for both segmentation and detection. Specifically, OpenSeeD beats the state-of-the-art method for open-vocabulary instance and panoptic segmentation across 5 datasets, and outperforms previous work for open-vocabulary detection on LVIS and ODinW under similar settings. When transferred to specific tasks, our model achieves new SoTA for panoptic segmentation on COCO and ADE20K, and instance segmentation on ADE20K and Cityscapes. Finally, we note that OpenSeeD is the first to explore the potential of joint training on segmentation and detection, and hope it can be received as a strong baseline for developing a single model for both tasks in open world.
中文标题/摘要
标题:一种简单的开放词汇分割与检测框架
我们提出了OpenSeeD,这是一种简单的开放词汇分割与检测框架,能够从不同的分割和检测数据集中联合学习。为了解决词汇和注释粒度之间的差距,我们首先引入了一个预训练的文本编码器,将两个任务中的所有视觉概念编码,并学习一个共同的语义空间。这使我们能够获得与仅在分割任务上训练的模型相当甚至更好的结果。为进一步解决这些问题,我们定位了两个差异:$i$) 任务差异——分割需要提取前景对象和背景内容的掩码,而检测仅关心前者;$ii$) 数据差异——框和掩码注释具有不同的空间粒度,因此不能直接互换。为了解决这些问题,我们提出了一种解耦的解码方法来减少前景/背景之间的干扰,并提出了一种条件掩码解码方法来辅助生成给定框的掩码。为此,我们开发了一个包含所有三种技术的简单编码器-解码器模型,并在COCO和Objects365上联合训练。经过预训练后,我们的模型在分割和检测的零样本迁移性能上表现出竞争力或更强的性能。具体来说,OpenSeeD在5个数据集上击败了最先进的开放词汇实例和全景分割方法,并在LVIS和ODinW的相似设置下优于之前的开放词汇检测工作。当转移到特定任务时,我们的模型在COCO和ADE20K上的全景分割以及ADE20K和Cityscapes上的实例分割上达到了新的最佳性能。 最后,我们注意到OpenSeeD是第一个探索分割和检测联合训练潜力的工作,并希望它能作为开发同时处理两个任务的单一模型的强大基线被接受。
Summary / 总结
The paper introduces OpenSeeD, a framework that jointly learns from segmentation and detection datasets to bridge vocabulary and annotation granularity gaps. It uses a pre-trained text encoder to encode visual concepts and proposes a decoupled decoding to address task and data discrepancies. Experiments show that OpenSeeD outperforms state-of-the-art methods in open-vocabulary instance and panoptic segmentation across five datasets, and achieves new state-of-the-art results in panoptic segmentation on COCO and ADE20K, and instance segmentation on ADE20K and Cityscapes when transferred to specific tasks.
该论文提出了OpenSeeD框架,通过联合学习分割和检测数据集来弥合词汇和注释粒度的差距。它使用预训练的文本编码器来编码视觉概念,并提出了一种解耦解码来解决任务和数据差异。实验结果显示,OpenSeeD在五个数据集上的开放词汇实例和全景分割中优于最先进的方法,并在COCO和ADE20K上的全景分割、ADE20K和Cityscapes上的实例分割中实现了新的最佳结果,当转移到特定任务时。
USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation
Authors: Xiaoqi Wang, Wenbin He, Xiwei Xuan, Clint Sebastian, Jorge Piazentin Ono, Xin Li, Sima Behpour, Thang Doan, Liang Gou, Han Wei Shen, Liu Ren
First: 2024-06-07T21:41:18+00:00 · Latest: 2024-06-07T21:41:18+00:00
Abstract
The open-vocabulary image segmentation task involves partitioning images into semantically meaningful segments and classifying them with flexible text-defined categories. The recent vision-based foundation models such as the Segment Anything Model (SAM) have shown superior performance in generating class-agnostic image segments. The main challenge in open-vocabulary image segmentation now lies in accurately classifying these segments into text-defined categories. In this paper, we introduce the Universal Segment Embedding (USE) framework to address this challenge. This framework is comprised of two key components: 1) a data pipeline designed to efficiently curate a large amount of segment-text pairs at various granularities, and 2) a universal segment embedding model that enables precise segment classification into a vast range of text-defined categories. The USE model can not only help open-vocabulary image segmentation but also facilitate other downstream tasks (e.g., querying and ranking). Through comprehensive experimental studies on semantic segmentation and part segmentation benchmarks, we demonstrate that the USE framework outperforms state-of-the-art open-vocabulary segmentation methods.
中文标题/摘要
标题:USE: 开集词汇图像分割的通用分割嵌入
开集词汇图像分割任务涉及将图像划分为语义上有意义的片段,并用灵活的文本定义类别进行分类。最近的基于视觉的基础模型,如Segment Anything Model (SAM),在生成无类别图像片段方面表现出色。当前开集词汇图像分割的主要挑战在于准确地将这些片段分类到文本定义的类别中。在本文中,我们提出了通用分割嵌入(USE)框架来解决这一挑战。该框架由两个关键组件组成:1)一个数据管道,旨在高效地收集各种粒度的片段-文本对,以及2)一个通用分割嵌入模型,能够将片段精确分类到广泛的文本定义类别中。USE模型不仅可以帮助开集词汇图像分割,还可以促进其他下游任务(例如查询和排序)。通过在语义分割和部分分割基准上的全面实验研究,我们证明了USE框架优于最先进的开集词汇分割方法。
Summary / 总结
The paper introduces the Universal Segment Embedding (USE) framework for open-vocabulary image segmentation, addressing the challenge of accurately classifying segments into text-defined categories. The framework consists of a data pipeline for curating segment-text pairs and a universal segment embedding model. Experimental results show that USE outperforms existing methods on semantic and part segmentation benchmarks.
论文针对开放词汇图像分割中准确将语义有意义的图像分割成文本定义类别的挑战,提出了通用分割嵌入(USE)框架,该框架包括用于构建分割-文本对的数据管道和用于精确分类的通用分割嵌入模型。实验结果表明,USE框架在语义分割和部件分割基准测试中优于现有方法。
From Open-Vocabulary to Vocabulary-Free Semantic Segmentation
Authors: Klara Reichard, Giulia Rizzoli, Stefano Gasperini, Lukas Hoyer, Pietro Zanuttigh, Nassir Navab, Federico Tombari
First: 2025-02-17T15:17:08+00:00 · Latest: 2025-02-17T15:17:08+00:00
Comments: Submitted to: Pattern Recognition Letters, Klara Reichard and Giulia Rizzoli equally contributed to this work
Abstract
Open-vocabulary semantic segmentation enables models to identify novel object categories beyond their training data. While this flexibility represents a significant advancement, current approaches still rely on manually specified class names as input, creating an inherent bottleneck in real-world applications. This work proposes a Vocabulary-Free Semantic Segmentation pipeline, eliminating the need for predefined class vocabularies. Specifically, we address the chicken-and-egg problem where users need knowledge of all potential objects within a scene to identify them, yet the purpose of segmentation is often to discover these objects. The proposed approach leverages Vision-Language Models to automatically recognize objects and generate appropriate class names, aiming to solve the challenge of class specification and naming quality. Through extensive experiments on several public datasets, we highlight the crucial role of the text encoder in model performance, particularly when the image text classes are paired with generated descriptions. Despite the challenges introduced by the sensitivity of the segmentation text encoder to false negatives within the class tagging process, which adds complexity to the task, we demonstrate that our fully automated pipeline significantly enhances vocabulary-free segmentation accuracy across diverse real-world scenarios.
中文标题/摘要
标题:从开放词汇到无词汇语义分割
开放词汇语义分割使模型能够识别超出训练数据的新颖对象类别。尽管这种灵活性代表了重要进步,但当前方法仍需手动指定类名作为输入,这在实际应用中形成了瓶颈。本文提出了一种无词汇语义分割管道,消除了预定义词汇表的需要。具体而言,我们解决了这样一个难题:用户需要了解场景中所有潜在对象的知识才能识别它们,而分割的目的往往是发现这些对象。所提出的方法利用视觉-语言模型自动识别对象并生成适当的类名,旨在解决类指定和命名质量的挑战。通过在多个公开数据集上的大量实验,我们强调了文本编码器在模型性能中的关键作用,尤其是在图像文本类别与生成描述配对时。尽管分割文本编码器对类标签过程中假阴性的敏感性引入了任务复杂性,但我们证明了我们完全自动化的管道在多种现实场景中显著提高了无词汇分割的准确性。
Summary / 总结
This work addresses the limitation of open-vocabulary semantic segmentation by proposing a Vocabulary-Free Semantic Segmentation pipeline that eliminates the need for predefined class vocabularies. It uses Vision-Language Models to automatically recognize objects and generate class names, solving the challenge of class specification. Extensive experiments on public datasets show that the text encoder plays a crucial role in model performance, especially when paired with generated descriptions. Despite the complexity introduced by the sensitivity of the segmentation text encoder to false negatives, the fully automated pipeline significantly improves vocabulary-free segmentation accuracy in various real-world scenarios.
该研究旨在通过提出一种无词汇表语义分割管道,消除语义分割中预定义类词汇表的需求。该方法利用视觉-语言模型自动识别物体并生成类名,解决了鸡和蛋问题。大量实验表明,文本编码器在模型性能中起着关键作用,且该管道在多种现实场景中显著提高了无词汇表分割的准确性,尽管存在类标签过程中假阴性带来的挑战。
Weakly Supervised 3D Open-vocabulary Segmentation
Authors: Kunhao Liu, Fangneng Zhan, Jiahui Zhang, Muyu Xu, Yingchen Yu, Abdulmotaleb El Saddik, Christian Theobalt, Eric Xing, Shijian Lu
Venue: NeurIPS 2023
First: 2023-05-23T14:16:49+00:00 · Latest: 2024-01-09T17:09:47+00:00
Comments: Accepted to NeurIPS 2023
Abstract
Open-vocabulary segmentation of 3D scenes is a fundamental function of human perception and thus a crucial objective in computer vision research. However, this task is heavily impeded by the lack of large-scale and diverse 3D open-vocabulary segmentation datasets for training robust and generalizable models. Distilling knowledge from pre-trained 2D open-vocabulary segmentation models helps but it compromises the open-vocabulary feature as the 2D models are mostly finetuned with close-vocabulary datasets. We tackle the challenges in 3D open-vocabulary segmentation by exploiting pre-trained foundation models CLIP and DINO in a weakly supervised manner. Specifically, given only the open-vocabulary text descriptions of the objects in a scene, we distill the open-vocabulary multimodal knowledge and object reasoning capability of CLIP and DINO into a neural radiance field (NeRF), which effectively lifts 2D features into view-consistent 3D segmentation. A notable aspect of our approach is that it does not require any manual segmentation annotations for either the foundation models or the distillation process. Extensive experiments show that our method even outperforms fully supervised models trained with segmentation annotations in certain scenes, suggesting that 3D open-vocabulary segmentation can be effectively learned from 2D images and text-image pairs. Code is available at \url{https://github.com/Kunhao-Liu/3D-OVS}.
中文标题/摘要
标题:弱监督3D开放词汇分割
3D场景的开放词汇分割是人类感知的基本功能,也是计算机视觉研究中的关键目标。然而,由于缺乏大规模和多样化的3D开放词汇分割数据集,训练出鲁棒且泛化的模型变得非常困难。从预训练的2D开放词汇分割模型中提取知识有助于解决这一问题,但会牺牲开放词汇特征,因为这些2D模型大多是在封闭词汇数据集上微调的。我们通过弱监督方式利用预训练的基础模型CLIP和DINO来解决3D开放词汇分割的挑战。具体来说,仅给定场景中对象的开放词汇文本描述,我们将CLIP和DINO的开放词汇多模态知识和对象推理能力提炼到神经辐射场(NeRF)中,从而将2D特征提升为视图一致的3D分割。我们方法的一个显著特点是,它不需要为基础模型或提炼过程提供任何手动分割注释。大量实验表明,即使在某些场景中,我们的方法也优于使用分割注释进行完全监督训练的模型,这表明3D开放词汇分割可以从2D图像和图文对中有效学习。代码可在https://github.com/Kunhao-Liu/3D-OVS获取。
Summary / 总结
The research aims to address the challenge of 3D open-vocabulary segmentation by leveraging pre-trained CLIP and DINO models in a weakly supervised manner. The method distills open-vocabulary knowledge from these models into a neural radiance field (NeRF) to achieve 3D segmentation without requiring any manual annotations. Experiments demonstrate that the proposed approach outperforms fully supervised models in certain scenarios, indicating the potential of learning 3D open-vocabulary segmentation from 2D images and text-image pairs.
研究旨在通过弱监督方式利用预训练的CLIP和DINO模型解决3D开放词汇分割的挑战。方法将这些模型中的开放词汇多模态知识提炼到神经辐射场(NeRF)中,以实现3D分割,无需任何人工标注。实验表明,在某些场景下,该方法的性能优于完全监督模型,表明可以从2D图像和图文对中学习3D开放词汇分割的潜力。
Diffusion Models for Open-Vocabulary Segmentation
Authors: Laurynas Karazija, Iro Laina, Andrea Vedaldi, Christian Rupprecht
Venue: ECCV 2024
First: 2023-06-15T17:51:28+00:00 · Latest: 2024-09-30T03:17:39+00:00
Comments: ECCV 2024
Abstract
Open-vocabulary segmentation is the task of segmenting anything that can be named in an image. Recently, large-scale vision-language modelling has led to significant advances in open-vocabulary segmentation, but at the cost of gargantuan and increasing training and annotation efforts. Hence, we ask if it is possible to use existing foundation models to synthesise on-demand efficient segmentation algorithms for specific class sets, making them applicable in an open-vocabulary setting without the need to collect further data, annotations or perform training. To that end, we present OVDiff, a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation. OVDiff synthesises support image sets for arbitrary textual categories, creating for each a set of prototypes representative of both the category and its surrounding context (background). It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training. Our approach shows strong performance on a range of benchmarks, obtaining a lead of more than 5% over prior work on PASCAL VOC.
中文标题/摘要
标题:开放词汇分割的扩散模型
开放词汇分割是将图像中任何可命名的物体进行分割的任务。最近,大规模的视觉-语言建模在开放词汇分割方面取得了显著进展,但代价是巨大的且不断增加的训练和标注努力。因此,我们询问是否可以利用现有的基础模型来合成针对特定类别集的高效分割算法,使其在开放词汇设置下应用,而无需收集更多数据、标注或进行训练。为此,我们提出了OVDiff,一种利用生成文本到图像扩散模型进行无监督开放词汇分割的新方法。OVDiff 为任意文本类别合成了支持图像集,为每个类别创建了一组代表该类别及其周围上下文(背景)的原型。它仅依赖于预训练组件,并直接输出合成的分割器,无需训练。我们的方法在一系列基准测试中表现出色,PASCAL VOC 上的性能比先前工作高出超过 5%。
OpenMask3D: Open-Vocabulary 3D Instance Segmentation
Authors: Ayça Takmaz, Elisabetta Fedele, Robert W. Sumner, Marc Pollefeys, Federico Tombari, Francis Engelmann
Venue: NeurIPS 2023
First: 2023-06-23T17:36:44+00:00 · Latest: 2023-10-29T14:04:25+00:00
Comments: NeurIPS 2023. Project page: https://openmask3d.github.io/
Abstract
We introduce the task of open-vocabulary 3D instance segmentation. Current approaches for 3D instance segmentation can typically only recognize object categories from a pre-defined closed set of classes that are annotated in the training datasets. This results in important limitations for real-world applications where one might need to perform tasks guided by novel, open-vocabulary queries related to a wide variety of objects. Recently, open-vocabulary 3D scene understanding methods have emerged to address this problem by learning queryable features for each point in the scene. While such a representation can be directly employed to perform semantic segmentation, existing methods cannot separate multiple object instances. In this work, we address this limitation, and propose OpenMask3D, which is a zero-shot approach for open-vocabulary 3D instance segmentation. Guided by predicted class-agnostic 3D instance masks, our model aggregates per-mask features via multi-view fusion of CLIP-based image embeddings. Experiments and ablation studies on ScanNet200 and Replica show that OpenMask3D outperforms other open-vocabulary methods, especially on the long-tail distribution. Qualitative experiments further showcase OpenMask3D's ability to segment object properties based on free-form queries describing geometry, affordances, and materials.
中文标题/摘要
标题:OpenMask3D: 开放词汇3D实例分割
我们介绍了开放词汇3D实例分割的任务。当前的3D实例分割方法通常只能识别训练数据集中预定义的封闭类别集合中的对象类别。这在实际应用中造成了重要限制,因为可能需要执行由新颖的开放词汇查询引导的任务,这些查询与各种对象有关。最近,为了解决这个问题,出现了开放词汇3D场景理解方法,通过学习场景中每个点的可查询特征来解决这一问题。虽然这种表示可以直接用于执行语义分割,但现有方法无法区分多个对象实例。在本文中,我们解决了这一限制,并提出了OpenMask3D,这是一种开放词汇3D实例分割的零样本方法。通过预测的类无差别的3D实例掩码引导,我们的模型通过基于CLIP的图像嵌入的多视图融合来聚合每个掩码特征。在ScanNet200和Replica上的实验和消融研究显示,OpenMask3D在长尾分布上优于其他开放词汇方法。进一步的定性实验展示了OpenMask3D根据自由形式查询描述几何、功能和材料的能力来分割对象属性。
Summary / 总结
The research introduces open-vocabulary 3D instance segmentation to address the limitations of closed-set approaches in recognizing novel object categories. OpenMask3D proposes a zero-shot method that uses class-agnostic 3D instance masks and multi-view fusion of CLIP-based image embeddings to aggregate per-mask features. Experiments on ScanNet200 and Replica demonstrate that OpenMask3D outperforms existing methods, particularly on long-tail distributions, and effectively segments objects based on free-form queries describing geometry, affordances, and materials.
研究引入了开放词汇量的3D实例分割任务,以解决当前3D实例分割方法只能识别预定义对象类别的问题。提出了OpenMask3D作为零样本方法,通过类无差别3D实例掩码和基于CLIP的多视图融合图像嵌入来聚合每个掩码特征。在ScanNet200和Replica上的实验表明,OpenMask3D在长尾分布的对象类别上优于其他方法。
Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models
Authors: Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang, Shalini De Mello
Venue: CVPR 2023 Highlight
First: 2023-03-08T18:58:26+00:00 · Latest: 2023-04-05T17:40:38+00:00
Comments: CVPR 2023 Highlight. Project page and code: https://jerryxu.net/ODISE
Abstract
We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies pre-trained text-image diffusion and discriminative models to perform open-vocabulary panoptic segmentation. Text-to-image diffusion models have the remarkable ability to generate high-quality images with diverse open-vocabulary language descriptions. This demonstrates that their internal representation space is highly correlated with open concepts in the real world. Text-image discriminative models like CLIP, on the other hand, are good at classifying images into open-vocabulary labels. We leverage the frozen internal representations of both these models to perform panoptic segmentation of any category in the wild. Our approach outperforms the previous state of the art by significant margins on both open-vocabulary panoptic and semantic segmentation tasks. In particular, with COCO training only, our method achieves 23.4 PQ and 30.0 mIoU on the ADE20K dataset, with 8.3 PQ and 7.9 mIoU absolute improvement over the previous state of the art. We open-source our code and models at https://github.com/NVlabs/ODISE .
中文标题/摘要
标题:基于文本到图像扩散模型的开放词汇全景分割
我们提出了ODISE:开放词汇基于扩散的全景分割,它将预训练的文本到图像扩散模型和判别模型统一起来,以执行开放词汇全景分割。文本到图像扩散模型具有生成具有多样化开放词汇语言描述的高质量图像的显著能力。这表明它们的内部表示空间与现实世界中的开放概念高度相关。另一方面,像CLIP这样的文本到图像判别模型擅长将图像分类到开放词汇标签中。我们利用这两种模型的冻结内部表示来对任何野外类别进行全景分割。我们的方法在开放词汇全景分割和语义分割任务上均显著优于之前的状态。特别是,仅使用COCO训练,我们的方法在ADE20K数据集上实现了23.4 PQ和30.0 mIoU,与之前的状态相比绝对提高了8.3 PQ和7.9 mIoU。我们在https://github.com/NVlabs/ODISE 开放了我们的代码和模型。
Summary / 总结
The research aims to improve open-vocabulary panoptic segmentation by integrating text-to-image diffusion models and discriminative models like CLIP. The method leverages the internal representations of these models to perform segmentation on any category without requiring category-specific training. The approach significantly outperforms previous methods, achieving 23.4 PQ and 30.0 mIoU on ADE20K with COCO training only, marking an 8.3 PQ and 7.9 mIoU improvement over the state of the art.
ODISE 将预训练的文本-图像扩散模型和 CLIP 等判别模型结合,以执行开放词汇量全景分割。该方法利用扩散模型的高质量图像生成能力和判别模型的分类能力,实现了显著的性能提升。该方法在 ADE20K 上仅使用 COCO 训练数据,达到了 23.4 PQ 和 30.0 mIoU,比之前的最佳结果分别提高了 8.3 PQ 和 7.9 mIoU。
Open-Vocabulary Camouflaged Object Segmentation
Authors: Youwei Pang, Xiaoqi Zhao, Jiaming Zuo, Lihe Zhang, Huchuan Lu
Venue: ECCV 2024
First: 2023-11-19T06:00:39+00:00 · Latest: 2024-07-04T08:56:21+00:00
Comments: Update details. Acceptd by ECCV 2024
Abstract
Recently, the emergence of the large-scale vision-language model (VLM), such as CLIP, has opened the way towards open-world object perception. Many works have explored the utilization of pre-trained VLM for the challenging open-vocabulary dense prediction task that requires perceiving diverse objects with novel classes at inference time. Existing methods construct experiments based on the public datasets of related tasks, which are not tailored for open vocabulary and rarely involve imperceptible objects camouflaged in complex scenes due to data collection bias and annotation costs. To fill in the gaps, we introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS), and construct a large-scale complex scene dataset (\textbf{OVCamo}) containing 11,483 hand-selected images with fine annotations and corresponding object classes. Further, we build a strong single-stage open-vocabulary \underline{c}amouflaged \underline{o}bject \underline{s}egmentation transform\underline{er} baseline \textbf{OVCoser} attached to the parameter-fixed CLIP with iterative semantic guidance and structure enhancement. By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects. Moreover, this effective framework also surpasses previous state-of-the-arts of open-vocabulary semantic image segmentation by a large margin on our OVCamo dataset. With the proposed dataset and baseline, we hope that this new task with more practical value can further expand the research on open-vocabulary dense prediction tasks. Our code and data can be found in the \href{https://github.com/lartpang/OVCamo}{link}.
中文标题/摘要
标题:开放词汇伪装目标分割
近年来,大规模视觉-语言模型(VLM),如CLIP的出现,为开放世界物体感知打开了大门。许多研究探索了利用预训练VLM进行具有新颖类别的多样化物体的挑战性开放词汇密集预测任务。现有方法基于相关任务的公共数据集构建实验,这些数据集未针对开放词汇进行定制,且很少涉及由于数据收集偏差和注释成本而难以察觉的伪装在复杂场景中的物体。为填补这一空白,我们引入了一个新的任务——开放词汇伪装目标分割(OVCOS),并构建了一个包含11,483张手工挑选的图像和精细注释的大规模复杂场景数据集(OVCamo),以及相应的物体类别。此外,我们构建了一个强的单阶段开放词汇伪装目标分割变换器基线OVCoser,该基线附着于参数固定的CLIP,并具有迭代语义指导和结构增强。通过整合类别语义知识的指导和边缘和深度信息提供的视觉结构线索的补充,所提出的方法可以有效地捕捉伪装目标。此外,该有效框架在我们的OVCamo数据集上也大幅超越了开放词汇语义图像分割的先前最佳方法。借助所提出的数据集和基线,我们希望这一具有更高实用价值的新任务能够进一步扩展开放词汇密集预测任务的研究。我们的代码和数据可以在链接中找到。
Summary / 总结
The research aims to address the challenge of open-vocabulary dense prediction, particularly for camouflaged objects in complex scenes. The method introduces a new task, open-vocabulary camouflaged object segmentation (OVCOS), and constructs a large-scale dataset (OVCamo) with 11,483 images. A strong single-stage baseline, OVCoser, is developed using a parameter-fixed CLIP model with iterative semantic guidance and structure enhancement. This approach effectively captures camouflaged objects and outperforms previous state-of-the-arts on the OVCamo dataset by a significant margin.
研究旨在解决开放词汇密集预测的挑战,特别是复杂场景中的伪装物体。方法引入了一个新的任务,即开放词汇伪装物体分割(OVCOS),并构建了一个包含11,483张精细标注图像的大规模数据集OVCamo。提出了一种强单阶段基线OVCoser,通过迭代语义引导和结构增强有效捕捉伪装物体。该方法在OVCamo数据集上显著超越了之前的最佳方法。
Open-vocabulary Panoptic Segmentation with Embedding Modulation
Authors: Xi Chen, Shuang Li, Ser-Nam Lim, Antonio Torralba, Hengshuang Zhao
First: 2023-03-20T17:58:48+00:00 · Latest: 2023-07-15T11:04:26+00:00
Comments: ICCV2023
Abstract
Open-vocabulary image segmentation is attracting increasing attention due to its critical applications in the real world. Traditional closed-vocabulary segmentation methods are not able to characterize novel objects, whereas several recent open-vocabulary attempts obtain unsatisfactory results, i.e., notable performance reduction on the closed vocabulary and massive demand for extra data. To this end, we propose OPSNet, an omnipotent and data-efficient framework for Open-vocabulary Panoptic Segmentation. Specifically, the exquisitely designed Embedding Modulation module, together with several meticulous components, enables adequate embedding enhancement and information exchange between the segmentation model and the visual-linguistic well-aligned CLIP encoder, resulting in superior segmentation performance under both open- and closed-vocabulary settings with much fewer need of additional data. Extensive experimental evaluations are conducted across multiple datasets (e.g., COCO, ADE20K, Cityscapes, and PascalContext) under various circumstances, where the proposed OPSNet achieves state-of-the-art results, which demonstrates the effectiveness and generality of the proposed approach. The code and trained models will be made publicly available.
中文标题/摘要
标题:具有嵌入调制的开放词汇全景分割
开放词汇图像分割由于其实用应用越来越受到关注。传统的封闭词汇分割方法无法描述新对象,而最近的开放词汇尝试则获得不令人满意的结果,即在封闭词汇上的显著性能下降和对额外数据的巨大需求。为此,我们提出OPSNet,这是一种用于开放词汇全景分割的全能且数据高效的框架。具体而言,精心设计的嵌入调制模块,结合几个细致的组件,使得分割模型与视觉语言对齐的CLIP编码器之间能够充分增强嵌入和信息交换,从而在开放词汇和封闭词汇设置下均能获得优越的分割性能,且对额外数据的需求大大减少。在多个数据集(如COCO、ADE20K、Cityscapes和PascalContext)下进行了广泛的实验评估,其中提出的OPSNet达到了最先进的结果,这表明所提出方法的有效性和普适性。代码和训练模型将公开提供。
Summary / 总结
The research aims to address the limitations of traditional closed-vocabulary segmentation methods in handling novel objects and the unsatisfactory results of recent open-vocabulary approaches. OPSNet, a framework for open-vocabulary panoptic segmentation, is proposed, featuring an Embedding Modulation module that enhances embedding and facilitates information exchange between the segmentation model and a CLIP encoder. Experimental results across multiple datasets show that OPSNet outperforms existing methods in both open- and closed-vocabulary settings with less reliance on additional data.
该研究提出了OPSNet框架,通过增强嵌入调制和分割模型与CLIP编码器之间的信息交换,提升开放词汇和封闭词汇场景下的分割性能,且无需大量额外数据。在多个数据集上的广泛实验表明,OPSNet达到了最先进的性能,验证了其有效性和普适性。
Search3D: Hierarchical Open-Vocabulary 3D Segmentation
Authors: Ayca Takmaz, Alexandros Delitzas, Robert W. Sumner, Francis Engelmann, Johanna Wald, Federico Tombari
First: 2024-09-27T03:44:07+00:00 · Latest: 2025-01-22T15:09:00+00:00
Comments: This manuscript is provided as a pre-print, it has been accepted for publication by IEEE RA-L
Abstract
Open-vocabulary 3D segmentation enables exploration of 3D spaces using free-form text descriptions. Existing methods for open-vocabulary 3D instance segmentation primarily focus on identifying object-level instances but struggle with finer-grained scene entities such as object parts, or regions described by generic attributes. In this work, we introduce Search3D, an approach to construct hierarchical open-vocabulary 3D scene representations, enabling 3D search at multiple levels of granularity: fine-grained object parts, entire objects, or regions described by attributes like materials. Unlike prior methods, Search3D shifts towards a more flexible open-vocabulary 3D search paradigm, moving beyond explicit object-centric queries. For systematic evaluation, we further contribute a scene-scale open-vocabulary 3D part segmentation benchmark based on MultiScan, along with a set of open-vocabulary fine-grained part annotations on ScanNet++. Search3D outperforms baselines in scene-scale open-vocabulary 3D part segmentation, while maintaining strong performance in segmenting 3D objects and materials. Our project page is http://search3d-segmentation.github.io.
中文标题/摘要
标题:Search3D:分层开放词汇3D分割
开放词汇3D分割能够使用自由形式的文本描述探索3D空间。现有的开放词汇3D实例分割方法主要集中在识别对象级别的实例,但在处理更细粒度的场景实体,如对象部件或由通用属性描述的区域方面存在困难。在本工作中,我们提出了Search3D,一种构建分层开放词汇3D场景表示的方法,使3D搜索能够在多个粒度级别进行:细粒度的对象部件、整个对象或由材料等属性描述的区域。与先前的方法不同,Search3D转向了更灵活的开放词汇3D搜索范式,超越了明确的对象中心查询。为了系统评估,我们进一步贡献了一个基于MultiScan的场景规模开放词汇3D部件分割基准,以及一个针对ScanNet++的开放词汇细粒度部件注释集。Search3D在场景规模开放词汇3D部件分割中优于基线模型,同时在分割3D对象和材料方面保持了强大的性能。我们的项目页面是http://search3d-segmentation.github.io。
Summary / 总结
Search3D introduces a hierarchical open-vocabulary 3D segmentation approach that enables detailed scene representation at multiple levels of granularity, from object parts to regions described by attributes. Unlike previous methods, it focuses on flexible open-vocabulary 3D search rather than object-centric queries. The approach is evaluated on a new scene-scale benchmark based on MultiScan and ScanNet++, showing superior performance in open-vocabulary 3D part segmentation while maintaining strong object and material segmentation capabilities.
Search3D 提出了一种层次化的开放词汇表 3D 分割方法,能够使用自由文本描述进行详细的 3D 场景探索。与专注于对象级实例的先前方法不同,Search3D 可以识别更细粒度的元素,如对象部分和由属性定义的区域。该方法在场景尺度的开放词汇表 3D 部件分割中优于基线方法,同时在对象和材料分割方面保持良好的性能。提供了一个新的基准和注释用于评估。
Scaling Open-Vocabulary Image Segmentation with Image-Level Labels
Authors: Golnaz Ghiasi, Xiuye Gu, Yin Cui, Tsung-Yi Lin
Venue: ECCV 2022
First: 2021-12-22T18:57:54+00:00 · Latest: 2022-07-20T21:56:52+00:00
Comments: Accepted at ECCV 2022
Abstract
We design an open-vocabulary image segmentation model to organize an image into meaningful regions indicated by arbitrary texts. Recent works (CLIP and ALIGN), despite attaining impressive open-vocabulary classification accuracy with image-level caption labels, are unable to segment visual concepts with pixels. We argue that these models miss an important step of visual grouping, which organizes pixels into groups before learning visual-semantic alignments. We propose OpenSeg to address the above issue while still making use of scalable image-level supervision of captions. First, it learns to propose segmentation masks for possible organizations. Then it learns visual-semantic alignments by aligning each word in a caption to one or a few predicted masks. We find the mask representations are the key to support learning image segmentation from captions, making it possible to scale up the dataset and vocabulary sizes. OpenSeg significantly outperforms the recent open-vocabulary method of LSeg by +19.9 mIoU on PASCAL dataset, thanks to its scalability.
中文标题/摘要
标题:基于图像级标签扩展词汇量图像分割
我们设计了一种开放词汇量图像分割模型,将图像组织成由任意文本指示的意义区域。尽管最近的工作(CLIP和ALIGN)使用图像级描述符标签实现了令人印象深刻的开放词汇量分类准确性,但它们无法用像素分割视觉概念。我们认为这些模型缺少一个重要的视觉分组步骤,在学习视觉语义对齐之前,它会将像素组织成组。我们提出OpenSeg以解决上述问题,同时仍然利用图像级描述符标签的可扩展监督。首先,它学习为可能的组织提出分割掩码。然后,通过将描述符中的每个词与一个或几个预测的掩码对齐,学习视觉语义对齐。我们发现掩码表示是支持从描述符学习图像分割的关键,使其成为扩展数据集和词汇量规模的可能。OpenSeg在PASCAL数据集上显著优于LSeg的开放词汇量方法,+19.9 mIoU,得益于其可扩展性。
Summary / 总结
The research aims to develop an open-vocabulary image segmentation model that can organize images into meaningful regions using arbitrary text labels. The method, OpenSeg, learns to propose segmentation masks and aligns each word in a caption to one or a few predicted masks, leveraging scalable image-level caption labels. Key findings show that OpenSeg outperforms the recent open-vocabulary method LSeg by 19.9 mIoU on the PASCAL dataset, demonstrating its scalability and effectiveness in learning image segmentation from captions.
研究旨在开发一种使用任意文本标签将图像组织成有意义区域的开放词汇量图像分割模型。方法OpenSeg学习提出分割掩码,并将每个单词在标题中与一个或几个预测的掩码对齐,利用可扩展的图像级标题标签。关键发现表明,OpenSeg在PASCAL数据集上的mIoU比最近的开放词汇量方法LSeg高出19.9,展示了其在从标题学习图像分割方面的可扩展性和有效性。
ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation
Authors: Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, Wayne Zhang
Venue: ECCV 2024
First: 2024-08-09T06:17:00+00:00 · Latest: 2024-08-09T06:17:00+00:00
Comments: Accepted to ECCV 2024. Code available at https://github.com/mc-lan/ProxyCLIP
Abstract
Open-vocabulary semantic segmentation requires models to effectively integrate visual representations with open-vocabulary semantic labels. While Contrastive Language-Image Pre-training (CLIP) models shine in recognizing visual concepts from text, they often struggle with segment coherence due to their limited localization ability. In contrast, Vision Foundation Models (VFMs) excel at acquiring spatially consistent local visual representations, yet they fall short in semantic understanding. This paper introduces ProxyCLIP, an innovative framework designed to harmonize the strengths of both CLIP and VFMs, facilitating enhanced open-vocabulary semantic segmentation. ProxyCLIP leverages the spatial feature correspondence from VFMs as a form of proxy attention to augment CLIP, thereby inheriting the VFMs' robust local consistency and maintaining CLIP's exceptional zero-shot transfer capacity. We propose an adaptive normalization and masking strategy to get the proxy attention from VFMs, allowing for adaptation across different VFMs. Remarkably, as a training-free approach, ProxyCLIP significantly improves the average mean Intersection over Union (mIoU) across eight benchmarks from 40.3 to 44.4, showcasing its exceptional efficacy in bridging the gap between spatial precision and semantic richness for the open-vocabulary segmentation task.
中文标题/摘要
标题:ProxyCLIP:代理注意机制提升CLIP在开放词汇分割中的表现
开放词汇语义分割要求模型能够有效地将视觉表示与开放词汇语义标签结合起来。虽然对比语言-图像预训练(CLIP)模型在从文本识别视觉概念方面表现出色,但在分割连贯性方面却因定位能力有限而常常遇到困难。相比之下,视觉基础模型(VFMs)擅长获取空间一致的局部视觉表示,但在语义理解方面却有所欠缺。本文介绍了一种名为ProxyCLIP的创新框架,旨在协调CLIP和VFMs的优势,促进增强的开放词汇语义分割。ProxyCLIP利用来自VFMs的空间特征对应作为代理注意机制来增强CLIP,从而继承VFMs的稳健局部一致性,并保持CLIP的出色零样本迁移能力。我们提出了一种自适应归一化和掩码策略,以从VFMs中获取代理注意机制,允许在不同VFMs之间进行适应。令人惊讶的是,作为一种无需训练的方法,ProxyCLIP在八个基准测试中的平均交并比(mIoU)从40.3显著提高到44.4,展示了其在填补空间精确度和语义丰富度之间差距方面的出色效果。
Summary / 总结
ProxyCLIP is designed to enhance open-vocabulary semantic segmentation by combining the strengths of CLIP and Vision Foundation Models (VFMs). It uses proxy attention from VFMs to improve CLIP's localization ability while maintaining its zero-shot transfer capacity. ProxyCLIP achieves a significant improvement in average mean Intersection over Union (mIoU) from 40.3 to 44.4 across eight benchmarks, demonstrating its effectiveness in balancing spatial precision and semantic richness.
ProxyCLIP旨在通过结合Contrastive Language-Image Pre-training (CLIP)和Vision Foundation Models (VFMs)的优势来提升开放词汇语义分割。它利用VFMs的空间特征对应作为代理注意力来增强CLIP的定位能力,同时保持其零样本迁移能力。ProxyCLIP在八个基准上的平均mean Intersection over Union (mIoU)从40.3提升到44.4,展示了其在平衡空间精度和语义丰富性方面的有效性。
PosSAM: Panoptic Open-vocabulary Segment Anything
Authors: Vibashan VS, Shubhankar Borse, Hyojin Park, Debasmit Das, Vishal Patel, Munawar Hayat, Fatih Porikli
First: 2024-03-14T17:55:03+00:00 · Latest: 2024-03-14T17:55:03+00:00
Abstract
In this paper, we introduce an open-vocabulary panoptic segmentation model that effectively unifies the strengths of the Segment Anything Model (SAM) with the vision-language CLIP model in an end-to-end framework. While SAM excels in generating spatially-aware masks, it's decoder falls short in recognizing object class information and tends to oversegment without additional guidance. Existing approaches address this limitation by using multi-stage techniques and employing separate models to generate class-aware prompts, such as bounding boxes or segmentation masks. Our proposed method, PosSAM is an end-to-end model which leverages SAM's spatially rich features to produce instance-aware masks and harnesses CLIP's semantically discriminative features for effective instance classification. Specifically, we address the limitations of SAM and propose a novel Local Discriminative Pooling (LDP) module leveraging class-agnostic SAM and class-aware CLIP features for unbiased open-vocabulary classification. Furthermore, we introduce a Mask-Aware Selective Ensembling (MASE) algorithm that adaptively enhances the quality of generated masks and boosts the performance of open-vocabulary classification during inference for each image. We conducted extensive experiments to demonstrate our methods strong generalization properties across multiple datasets, achieving state-of-the-art performance with substantial improvements over SOTA open-vocabulary panoptic segmentation methods. In both COCO to ADE20K and ADE20K to COCO settings, PosSAM outperforms the previous state-of-the-art methods by a large margin, 2.4 PQ and 4.6 PQ, respectively. Project Website: https://vibashan.github.io/possam-web/.
中文标题/摘要
标题:PosSAM:全景开放词汇分割一切
在本文中,我们提出了一种开放词汇全景分割模型,该模型有效地将Segment Anything Model (SAM) 的优势与Vision-Language CLIP模型结合在一个端到端框架中。虽然SAM在生成空间感知掩码方面表现出色,但其解码器在识别对象类别信息方面存在不足,且在没有额外指导的情况下容易过度分割。现有方法通过使用多阶段技术并采用单独模型生成类别感知提示(如边界框或分割掩码)来解决这一局限性。我们提出的方法PosSAM是一个端到端模型,它利用SAM的空间丰富特征生成实例感知掩码,并利用CLIP的语义区分特征进行有效的实例分类。具体而言,我们针对SAM的局限性,提出了一种新颖的局部区分池化(LDP)模块,利用类别无关的SAM特征和类别相关的CLIP特征进行无偏的开放词汇分类。此外,我们引入了一种掩码感知选择性融合(MASE)算法,在推断过程中自适应地提高生成掩码的质量,并增强开放词汇分类的性能。我们进行了广泛的实验,展示了我们的方法在多个数据集上的强大泛化能力,实现了相对于SOTA开放词汇全景分割方法的显著性能提升。在COCO到ADE20K和ADE20K到COCO设置中,PosSAM分别以2.4 PQ和4.6 PQ的大幅优势超越了先前的SOTA方法。项目网站:https://vibashan.github.io/possam-web/
Summary / 总结
The research introduces PosSAM, an end-to-end open-vocabulary panoptic segmentation model that combines the strengths of Segment Anything Model (SAM) and CLIP. PosSAM uses SAM's spatially rich features to generate instance-aware masks and CLIP's semantic features for effective instance classification. It includes a Local Discriminative Pooling (LDP) module and a Mask-Aware Selective Ensembling (MASE) algorithm to enhance mask quality and improve open-vocabulary classification. Extensive experiments show PosSAM outperforms previous methods, achieving state-of-the-art performance with 2.4 and 4.6 PQ improvements in COCO to ADE20K and ADE20K to COCO settings, respectively.
该研究提出了PosSAM,一种将Segment Anything Model (SAM) 和CLIP的优势结合在端到端框架中的开放词汇泛化分割模型。PosSAM通过使用局部辨别池化(LDP)模块和掩码感知选择性融合(MASE)算法来生成实例感知的掩码并提高开放词汇分类。大量实验表明,PosSAM在COCO到ADE20K和ADE20K到COCO设置中分别比之前最先进的方法提高了2.4和4.6 PQ。
A Training-Free Framework for Open-Vocabulary Image Segmentation and Recognition with EfficientNet and CLIP
Authors: Ying Dai, Wei Yu Chen
First: 2025-10-22T07:54:18+00:00 · Latest: 2025-10-27T02:16:09+00:00
Abstract
This paper presents a novel training-free framework for open-vocabulary image segmentation and object recognition (OVSR), which leverages EfficientNetB0, a convolutional neural network, for unsupervised segmentation and CLIP, a vision-language model, for open-vocabulary object recognition. The proposed framework adopts a two stage pipeline: unsupervised image segmentation followed by segment-level recognition via vision-language alignment. In the first stage, pixel-wise features extracted from EfficientNetB0 are decomposed using singular value decomposition to obtain latent representations, which are then clustered using hierarchical clustering to segment semantically meaningful regions. The number of clusters is adaptively determined by the distribution of singular values. In the second stage, the segmented regions are localized and encoded into image embeddings using the Vision Transformer backbone of CLIP. Text embeddings are precomputed using CLIP's text encoder from category-specific prompts, including a generic something else prompt to support open set recognition. The image and text embeddings are concatenated and projected into a shared latent feature space via SVD to enhance cross-modal alignment. Recognition is performed by computing the softmax over the similarities between the projected image and text embeddings. The proposed method is evaluated on standard benchmarks, including COCO, ADE20K, and PASCAL VOC, achieving state-of-the-art performance in terms of Hungarian mIoU, precision, recall, and F1-score. These results demonstrate the effectiveness, flexibility, and generalizability of the proposed framework.
中文标题/摘要
标题:一种基于EfficientNet和CLIP的无训练框架用于开放词汇图像分割和识别
本文提出了一种新颖的无训练框架,用于开放词汇图像分割和对象识别(OVSR),该框架利用EfficientNetB0,一种卷积神经网络,进行无监督分割,并利用CLIP,一种视觉-语言模型,进行开放词汇对象识别。所提出的框架采用两阶段管道:无监督图像分割,随后是通过视觉-语言对齐进行的分割级别识别。在第一阶段,从EfficientNetB0提取的像素级特征通过奇异值分解进行分解,以获得潜在表示,然后使用层次聚类进行聚类,以分割语义上相关的区域。聚类的数量通过奇异值的分布自适应确定。在第二阶段,分割区域通过CLIP的视觉变换器主干进行定位和编码为图像嵌入。使用CLIP的文本编码器从类别特定的提示中预计算文本嵌入,包括一个通用的其他提示以支持开放集识别。图像和文本嵌入通过奇异值分解连接并投影到共享的潜在特征空间中,以增强跨模态对齐。通过计算投影图像和文本嵌入之间的相似性的softmax来进行识别。所提出的方法在标准基准上进行了评估,包括COCO、ADE20K和PASCAL VOC,以匈牙利mIoU、精确度、召回率和F1分数衡量,取得了最先进的性能。这些结果表明所提出框架的有效性、灵活性和泛化能力。
Summary / 总结
This paper introduces a training-free framework for open-vocabulary image segmentation and recognition (OVSR) using EfficientNetB0 for unsupervised segmentation and CLIP for open-vocabulary object recognition. The method employs a two-stage pipeline: first, unsupervised image segmentation is achieved by decomposing pixel-wise features from EfficientNetB0 and clustering latent representations, and second, segment-level recognition is performed by localizing and encoding regions using CLIP's Vision Transformer backbone. Text embeddings are precomputed from category-specific prompts and concatenated with image embeddings to enhance cross-modal alignment. The framework achieves state-of-the-art performance on COCO, ADE20K, and PASCAL VOC benchmarks, indicating its effectiveness and generalizability.
本文提出了一种无需训练的开放词汇图像分割和识别(OVSR)框架,使用EfficientNetB0进行无监督分割,CLIP进行开放词汇对象识别。该框架包括两个阶段:无监督图像分割和通过视觉-语言对齐的区域级识别。第一阶段中,EfficientNetB0提取像素级特征并进行分解和聚类,以分割出语义上有意义的区域。第二阶段中,CLIP的Vision Transformer编码分割出的区域,使用CLIP的文本编码器预计算文本嵌入,并将这些嵌入投影到共享的潜在空间中以增强跨模态对齐,通过计算投影后的图像和文本嵌入之间的相似性softmax值来进行识别。该方法在COCO、ADE20K和PASCAL VOC基准测试上取得了最先进的性能,展示了其有效性和通用性。
OpenDAS: Open-Vocabulary Domain Adaptation for 2D and 3D Segmentation
Authors: Gonca Yilmaz, Songyou Peng, Marc Pollefeys, Francis Engelmann, Hermann Blum
First: 2024-05-30T15:16:06+00:00 · Latest: 2024-10-29T23:03:34+00:00
Abstract
Recently, Vision-Language Models (VLMs) have advanced segmentation techniques by shifting from the traditional segmentation of a closed-set of predefined object classes to open-vocabulary segmentation (OVS), allowing users to segment novel classes and concepts unseen during training of the segmentation model. However, this flexibility comes with a trade-off: fully-supervised closed-set methods still outperform OVS methods on base classes, that is on classes on which they have been explicitly trained. This is due to the lack of pixel-aligned training masks for VLMs (which are trained on image-caption pairs), and the absence of domain-specific knowledge, such as autonomous driving. Therefore, we propose the task of open-vocabulary domain adaptation to infuse domain-specific knowledge into VLMs while preserving their open-vocabulary nature. By doing so, we achieve improved performance in base and novel classes. Existing VLM adaptation methods improve performance on base (training) queries, but fail to fully preserve the open-set capabilities of VLMs on novel queries. To address this shortcoming, we combine parameter-efficient prompt tuning with a triplet-loss-based training strategy that uses auxiliary negative queries. Notably, our approach is the only parameter-efficient method that consistently surpasses the original VLM on novel classes. Our adapted VLMs can seamlessly be integrated into existing OVS pipelines, e.g., improving OVSeg by +6.0% mIoU on ADE20K for open-vocabulary 2D segmentation, and OpenMask3D by +4.1% AP on ScanNet++ Offices for open-vocabulary 3D instance segmentation without other changes. The project page is available at https://open-das.github.io/.
中文标题/摘要
标题:OpenDAS:开放词汇领域适应在2D和3D分割中的应用
近年来,视觉-语言模型(VLMs)通过从传统的封闭词汇集预定义对象类别分割转向开放词汇分割(OVS),推动了分割技术的发展,允许用户分割训练过程中未见过的新类别和概念。然而,这种灵活性伴随着一个权衡:完全监督的封闭词汇集方法在基础类别上仍然优于OVS方法,即在它们明确训练的类别上。这是由于VLMs(在图像-描述对上训练)缺乏像素对齐的训练掩码,以及缺乏特定领域的知识,如自动驾驶。因此,我们提出了开放词汇领域适应任务,以将特定领域的知识注入VLMs,同时保持其开放词汇的性质。通过这种方式,我们在基础类别和新类别上都实现了更好的性能。现有的VLM适应方法在基础(训练)查询上提高了性能,但在新查询上未能完全保留VLMs的开放集能力。为了解决这一不足,我们结合了参数高效提示调优与基于三元损失的训练策略,使用辅助负查询。值得注意的是,我们的方法是唯一一个参数高效的、在新类别上始终超越原始VLM的方法。我们的适应VLMs可以无缝集成到现有的OVS流水线中,例如,在ADE20K上将OVSeg的mIoU提高6.0%进行开放词汇2D分割,以及在ScanNet++ Offices上将OpenMask3D的AP提高4.1%进行开放词汇3D实例分割,而无需其他更改。项目页面可在https://open-das.github.io/获取。
Summary / 总结
The research aims to enhance the performance of Vision-Language Models (VLMs) in open-vocabulary segmentation by incorporating domain-specific knowledge while maintaining their ability to segment novel classes. The method involves using parameter-efficient prompt tuning and a triplet-loss-based training strategy with auxiliary negative queries. Key experimental findings show that the proposed approach consistently outperforms the original VLM on novel classes and improves existing open-vocabulary segmentation pipelines, such as OVSeg and OpenMask3D, by 6.0% mIoU and 4.1% AP, respectively, without other changes.
研究旨在通过开放词汇域适应提升视觉-语言模型(VLM)在基类和新类中的性能。方法结合了参数高效提示调优和使用辅助负查询的三重损失训练策略。实验结果表明,适应后的VLM在新类上优于原始VLM,并且在OVSeg和OpenMask3D等现有开放词汇分割流水线中分别提高了6.0%的mIoU和4.1%的AP,无需其他更改。项目页面见https://open-das.github.io/.
Going Denser with Open-Vocabulary Part Segmentation
Authors: Peize Sun, Shoufa Chen, Chenchen Zhu, Fanyi Xiao, Ping Luo, Saining Xie, Zhicheng Yan
First: 2023-05-18T17:59:10+00:00 · Latest: 2023-05-18T17:59:10+00:00
Comments: Code is available at \url{https://github.com/facebookresearch/VLPart}
Abstract
Object detection has been expanded from a limited number of categories to open vocabulary. Moving forward, a complete intelligent vision system requires understanding more fine-grained object descriptions, object parts. In this paper, we propose a detector with the ability to predict both open-vocabulary objects and their part segmentation. This ability comes from two designs. First, we train the detector on the joint of part-level, object-level and image-level data to build the multi-granularity alignment between language and image. Second, we parse the novel object into its parts by its dense semantic correspondence with the base object. These two designs enable the detector to largely benefit from various data sources and foundation models. In open-vocabulary part segmentation experiments, our method outperforms the baseline by 3.3$\sim$7.3 mAP in cross-dataset generalization on PartImageNet, and improves the baseline by 7.3 novel AP$_{50}$ in cross-category generalization on Pascal Part. Finally, we train a detector that generalizes to a wide range of part segmentation datasets while achieving better performance than dataset-specific training.
中文标题/摘要
标题:使用开放词汇集部分分割增强密度
对象检测已从有限的类别扩展到开放词汇集。未来,完整的智能视觉系统需要理解更精细的对象描述和对象部分。本文提出了一种能够预测开放词汇集对象及其部分分割的检测器。这种能力来自两个设计。首先,我们在部分级、对象级和图像级数据的联合上训练检测器,以建立语言和图像之间的多粒度对齐。其次,我们通过其与基础对象的密集语义对应将新对象解析为其部分。这两个设计使检测器能够从各种数据源和基础模型中受益。在开放词汇集部分分割实验中,我们的方法在PartImageNet的跨数据集泛化上比基线高出3.3至7.3 mAP,在Pascal Part的跨类别泛化上提高了7.3个新型AP$_{50}$。最后,我们训练了一个能够泛化到广泛部分分割数据集并优于数据集特定训练的检测器。
Summary / 总结
The paper addresses the challenge of open-vocabulary part segmentation in object detection, aiming to improve the understanding of fine-grained object descriptions. The method combines part-level, object-level, and image-level data for training, and parses novel objects into parts based on dense semantic correspondence with base objects. Experiments show that the proposed method outperforms baselines by 3.3 to 7.3 mAP in cross-dataset generalization and by 7.3 novel AP$_{50}$ in cross-category generalization. The detector generalizes well across various part segmentation datasets and achieves better performance than dataset-specific training.
该论文旨在解决开放词汇量部件分割在目标检测中的挑战,以提高对细粒度对象描述的理解。方法结合了部件级、对象级和图像级数据进行训练,并基于基对象的密集语义对应将新对象解析为部件。实验表明,所提出的方法在跨数据集泛化中比基线高出3.3到7.3 mAP,在跨类别泛化中比基线高出7.3 novel AP$_{50}$。该检测器在各种部件分割数据集上泛化良好,并且性能优于针对特定数据集的训练。
Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation
Authors: Junha Lee, Chunghyun Park, Jaesung Choe, Yu-Chiang Frank Wang, Jan Kautz, Minsu Cho, Chris Choy
First: 2025-02-04T18:18:50+00:00 · Latest: 2025-04-14T18:27:02+00:00
Comments: project page: https://nvlabs.github.io/Mosaic3D/
Abstract
We tackle open-vocabulary 3D scene understanding by introducing a novel data generation pipeline and training framework. Our method addresses three critical requirements for effective training: precise 3D region segmentation, comprehensive textual descriptions, and sufficient dataset scale. By leveraging state-of-the-art open-vocabulary image segmentation models and region-aware Vision-Language Models, we develop an automatic pipeline that generates high-quality 3D mask-text pairs. Applying this pipeline to multiple 3D scene datasets, we create Mosaic3D-5.6M, a dataset of over 30K annotated scenes with 5.6M mask-text pairs, significantly larger than existing datasets. Building upon this data, we propose Mosaic3D, a foundation model combining a 3D encoder trained with contrastive learning and a lightweight mask decoder for open-vocabulary 3D semantic and instance segmentation. Our approach achieves state-of-the-art results on open-vocabulary 3D semantic and instance segmentation tasks including ScanNet200, Matterport3D, and ScanNet++, with ablation studies validating the effectiveness of our large-scale training data.
中文标题/摘要
标题:Mosaic3D:开放词汇3D分割的基础数据集和模型
我们通过引入一种新颖的数据生成管道和训练框架,解决开放词汇3D场景理解问题。我们的方法满足了有效训练的三个关键要求:精确的3D区域分割、全面的文本描述和足够的数据集规模。通过利用最先进的开放词汇图像分割模型和区域感知的视觉-语言模型,我们开发了一个自动管道,生成高质量的3D掩码-文本对。将此管道应用于多个3D场景数据集,我们创建了Mosaic3D-5.6M,一个包含超过30K标注场景和5.6M掩码-文本对的数据集,规模远大于现有数据集。基于此数据,我们提出了Mosaic3D,一个结合了通过对比学习训练的3D编码器和轻量级掩码解码器的基础模型,用于开放词汇3D语义和实例分割。我们的方法在包括ScanNet200、Matterport3D和ScanNet++的开放词汇3D语义和实例分割任务中达到了最先进的结果,消融研究验证了我们大规模训练数据的有效性。
Summary / 总结
The research introduces Mosaic3D, a novel dataset and model for open-vocabulary 3D segmentation. It addresses key requirements through a pipeline that generates precise 3D region segmentation, comprehensive textual descriptions, and a large-scale dataset. The method uses state-of-the-art models to create Mosaic3D-5.6M with over 5.6 million mask-text pairs, significantly larger than existing datasets. The proposed Mosaic3D model, combining a 3D encoder and a lightweight mask decoder, achieves state-of-the-art results on open-vocabulary 3D semantic and instance segmentation tasks, supported by ablation studies on large-scale training data.
研究引入了Mosaic3D,一种用于开放词汇3D分割的新数据集和模型。通过开发一个自动生成高质量3D掩码-文本对的管道,解决了精确3D区域分割、全面的文本描述和大数据集的需求。Mosaic3D-5.6M数据集包含超过30K标注场景和5.6M掩码-文本对,远大于现有数据集。基于此,Mosaic3D模型结合了通过对比学习训练的3D编码器和轻量级掩码解码器,实现了开放词汇3D语义和实例分割任务的最新成果。
Renovating Names in Open-Vocabulary Segmentation Benchmarks
Authors: Haiwen Huang, Songyou Peng, Dan Zhang, Andreas Geiger
First: 2024-03-14T17:35:32+00:00 · Latest: 2024-05-24T07:57:33+00:00
Abstract
Names are essential to both human cognition and vision-language models. Open-vocabulary models utilize class names as text prompts to generalize to categories unseen during training. However, the precision of these names is often overlooked in existing datasets. In this paper, we address this underexplored problem by presenting a framework for "renovating" names in open-vocabulary segmentation benchmarks (RENOVATE). Our framework features a renaming model that enhances the quality of names for each visual segment. Through experiments, we demonstrate that our renovated names help train stronger open-vocabulary models with up to 15% relative improvement and significantly enhance training efficiency with improved data quality. We also show that our renovated names improve evaluation by better measuring misclassification and enabling fine-grained model analysis. We will provide our code and relabelings for several popular segmentation datasets (MS COCO, ADE20K, Cityscapes) to the research community.
中文标题/摘要
标题:开放词汇分割基准中的名称改造
名称对于人类认知和视觉语言模型都至关重要。开放词汇模型利用类别名称作为文本提示以泛化到训练期间未见过的类别。然而,现有数据集中这些名称的精确性往往被忽视。在本文中,我们通过提出一种“改造”开放词汇分割基准中名称的框架(RENOVATE)来解决这一未充分探索的问题。我们的框架包含一个重命名模型,用于提升每个视觉片段名称的质量。通过实验,我们证明我们的改造名称有助于训练更强的开放词汇模型,相对改进高达15%,并显著提高训练效率,同时提高数据质量。我们还展示了我们的改造名称如何通过更好地衡量误分类和实现细粒度模型分析来改进评估。我们将为MS COCO、ADE20K、Cityscapes等几个流行的分割数据集提供我们的代码和重新标注,供研究界使用。
Summary / 总结
This paper addresses the underexplored issue of name precision in open-vocabulary segmentation benchmarks. It introduces a framework called RENOVATE that enhances the quality of class names for each visual segment. Experiments show that using renovated names improves the performance of open-vocabulary models by up to 15% and enhances training efficiency and evaluation accuracy. The framework includes a renaming model that refines class names, leading to better model training and analysis.
本文解决了开放词汇量分割基准中名称精度不足的问题,提出了一个名为RENOVATE的框架,以提高每个视觉片段的类别名称质量。实验表明,使用改进后的名称可以将开放词汇量模型的性能提高多达15%的相对改进,并提高训练效率。该框架还通过更好地衡量误分类和实现详细的模型分析来改进评估。
OpenSD: Unified Open-Vocabulary Segmentation and Detection
Authors: Shuai Li, Minghan Li, Pengfei Wang, Lei Zhang
First: 2023-12-10T08:51:34+00:00 · Latest: 2023-12-10T08:51:34+00:00
Abstract
Recently, a few open-vocabulary methods have been proposed by employing a unified architecture to tackle generic segmentation and detection tasks. However, their performance still lags behind the task-specific models due to the conflict between different tasks, and their open-vocabulary capability is limited due to the inadequate use of CLIP. To address these challenges, we present a universal transformer-based framework, abbreviated as OpenSD, which utilizes the same architecture and network parameters to handle open-vocabulary segmentation and detection tasks. First, we introduce a decoder decoupled learning strategy to alleviate the semantic conflict between thing and staff categories so that each individual task can be learned more effectively under the same framework. Second, to better leverage CLIP for end-to-end segmentation and detection, we propose dual classifiers to handle the in-vocabulary domain and out-of-vocabulary domain, respectively. The text encoder is further trained to be region-aware for both thing and stuff categories through decoupled prompt learning, enabling them to filter out duplicated and low-quality predictions, which is important to end-to-end segmentation and detection. Extensive experiments are conducted on multiple datasets under various circumstances. The results demonstrate that OpenSD outperforms state-of-the-art open-vocabulary segmentation and detection methods in both closed- and open-vocabulary settings. Code is available at https://github.com/strongwolf/OpenSD
中文标题/摘要
标题:OpenSD:统一开放词汇分割与检测
最近,提出了一些通过统一架构解决通用分割和检测任务的开放词汇方法。然而,由于不同任务之间的冲突,它们的性能仍然落后于任务特定模型,而且由于CLIP的不足使用,它们的开放词汇能力也受到限制。为了解决这些挑战,我们提出了一种通用的基于转换器的框架,简称为OpenSD,该框架使用相同的架构和网络参数来处理开放词汇分割和检测任务。首先,我们引入了一种解耦学习策略,以缓解事物和背景类别之间的语义冲突,从而使每个单独的任务在同一个框架下能够更有效地学习。其次,为了更好地利用CLIP进行端到端分割和检测,我们提出了双分类器分别处理词汇内领域和词汇外领域。通过解耦提示学习,进一步训练文本编码器以对事物和背景类别都具有区域意识,使它们能够过滤掉重复和低质量的预测,这对于端到端分割和检测非常重要。在多种数据集的各种情况下进行了广泛的实验。结果表明,OpenSD在封闭词汇和开放词汇设置中均优于最先进的开放词汇分割和检测方法。代码可在https://github.com/strongwolf/OpenSD 获取
Summary / 总结
The research aims to improve open-vocabulary segmentation and detection by addressing the conflicts between tasks and enhancing the use of CLIP. OpenSD, a unified transformer-based framework, employs a decoder decoupled learning strategy and dual classifiers to handle thing and stuff categories effectively. The framework also uses decoupled prompt learning to train the text encoder to be region-aware, improving prediction quality. Experiments show that OpenSD outperforms existing methods in both closed- and open-vocabulary settings.
研究旨在通过解决语义冲突和CLIP使用不足的问题来提升开放词汇的分割和检测。OpenSD是一个统一的变压器框架,采用解码器解耦学习策略和双分类器来增强特定任务的学习并有效利用CLIP。实验结果显示,OpenSD在闭合词汇和开放词汇设置中均优于现有方法。
Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation
Authors: Yongkang Li, Tianheng Cheng, Bin Feng, Wenyu Liu, Xinggang Wang
Venue: CVPR 2025
First: 2024-12-05T17:42:37+00:00 · Latest: 2025-03-10T12:14:22+00:00
Comments: Accepted by CVPR 2025; Code & models: https://github.com/hustvl/MaskAdapter
Abstract
Recent open-vocabulary segmentation methods adopt mask generators to predict segmentation masks and leverage pre-trained vision-language models, e.g., CLIP, to classify these masks via mask pooling. Although these approaches show promising results, it is counterintuitive that accurate masks often fail to yield accurate classification results through pooling CLIP image embeddings within the mask regions. In this paper, we reveal the performance limitations of mask pooling and introduce Mask-Adapter, a simple yet effective method to address these challenges in open-vocabulary segmentation. Compared to directly using proposal masks, our proposed Mask-Adapter extracts semantic activation maps from proposal masks, providing richer contextual information and ensuring alignment between masks and CLIP. Additionally, we propose a mask consistency loss that encourages proposal masks with similar IoUs to obtain similar CLIP embeddings to enhance models' robustness to varying predicted masks. Mask-Adapter integrates seamlessly into open-vocabulary segmentation methods based on mask pooling in a plug-and-play manner, delivering more accurate classification results. Extensive experiments across several zero-shot benchmarks demonstrate significant performance gains for the proposed Mask-Adapter on several well-established methods. Notably, Mask-Adapter also extends effectively to SAM and achieves impressive results on several open-vocabulary segmentation datasets. Code and models are available at https://github.com/hustvl/MaskAdapter.
中文标题/摘要
标题:Mask-Adapter:开放词汇分割中的魔鬼在于掩码
最近的开放词汇分割方法采用掩码生成器来预测分割掩码,并利用预训练的视觉-语言模型,例如CLIP,通过掩码聚类来分类这些掩码。尽管这些方法显示出有希望的结果,但通过在掩码区域内部聚类CLIP图像嵌入来分类准确的掩码往往未能产生准确的分类结果,这一点是反直觉的。在本文中,我们揭示了掩码聚类的性能限制,并引入了Mask-Adapter,这是一种简单而有效的方法,用于解决开放词汇分割中的这些挑战。与直接使用提案掩码相比,我们提出的Mask-Adapter从提案掩码中提取语义激活图,提供更丰富的上下文信息,并确保掩码与CLIP之间的对齐。此外,我们还提出了一种掩码一致性损失,鼓励具有相似IoU的提案掩码获得相似的CLIP嵌入,以增强模型对预测掩码变化的鲁棒性。Mask-Adapter可以无缝集成到基于掩码聚类的开放词汇分割方法中,以实现更准确的分类结果。在多个零样本基准上的广泛实验表明,所提出的Mask-Adapter在多个已建立的方法上取得了显著的性能提升。值得注意的是,Mask-Adapter还有效地扩展到了SAM,并在多个开放词汇分割数据集上取得了令人印象深刻的结果。代码和模型可在https://github.com/hustvl/MaskAdapter获取。
Summary / 总结
This paper addresses the limitations of open-vocabulary segmentation methods that rely on mask pooling with pre-trained vision-language models like CLIP. It introduces Mask-Adapter, which extracts semantic activation maps from proposal masks to provide richer contextual information and ensures alignment with CLIP. The method also includes a mask consistency loss to enhance robustness. Experiments show significant performance gains across various benchmarks and datasets, including SAM, demonstrating the effectiveness of Mask-Adapter in improving classification accuracy.
本文针对使用预训练视觉-语言模型如CLIP进行掩码聚类的开放词汇分割方法中存在的问题,提出了一种Mask-Adapter方法,通过从提案掩码中提取语义激活图来改善上下文信息和与CLIP的对齐。该方法还包括一个掩码一致性损失,以增强鲁棒性。实验结果显示,Mask-Adapter在各种基准和数据集上取得了显著的性能提升,包括SAM,证明了其在提高分类准确性方面的有效性。
MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Image Segmentation
Authors: Yuanbing Zhu, Bingke Zhu, Yingying Chen, Yunfang Niu, Ming Tang, Jinqiao Wang
First: 2024-08-27T04:45:53+00:00 · Latest: 2024-11-27T15:26:41+00:00
Comments: Technical report
Abstract
Pretrained vision-language models (VLMs), \eg CLIP, are increasingly used to bridge the gap between open- and close-vocabulary recognition in open-vocabulary image segmentation. As VLMs are generally pretrained with low-resolution images (e.g. $224\times224$), most previous methods operate only on downscaled images. We question this design as low resolution features often fail to preserve fine details. A typical solution is to employ additional image backbones for high-resolution inputs, but it also introduce significant computation overhead. Therefore, we propose MROVSeg, a multi-resolution training framework for open-vocabulary image segmentation with a single pretrained CLIP backbone, that uses sliding windows to slice the high-resolution input into uniform patches, each matching the input size of the well-trained image encoder. Its key components include a Multi-Res Adapter, which restores the spatial geometry and grasps local-global correspondences across patches by interacting with multi-resolution features. To achieve accurate segmentation, we introduce Multi-grained Masked Attention scheme to aggregate multi-grained semantics from multi-resolution CLIP features to object queries. Through comprehensive experiments, we demonstrate the superiority of MROVSeg on well-established open-vocabulary image segmentation benchmarks, establishing new standards for open-vocabulary image segmentation.
中文标题/摘要
标题:MROVSeg:打破开放词汇图像分割中视觉-语言模型的分辨率诅咒
预训练的视觉-语言模型(VLMs),例如CLIP,越来越多地用于在开放词汇图像分割中弥合开放词汇和封闭词汇识别之间的差距。由于VLMs通常是在低分辨率图像(例如$224 imes 224$)上进行预训练的,大多数先前的方法仅在下采样图像上操作。我们质疑这种设计,因为低分辨率特征往往无法保留细节点。一个典型的解决方案是使用额外的图像骨干网络处理高分辨率输入,但这也会引入显著的计算开销。因此,我们提出了MROVSeg,这是一种使用单个预训练CLIP骨干网络的多分辨率训练框架,通过滑动窗口将高分辨率输入分割成均匀的块,每个块的大小与训练良好的图像编码器的输入大小匹配。其关键组件包括一个多分辨率适配器,该适配器通过与多分辨率特征交互恢复空间几何结构并跨块捕捉局部-全局对应关系。为了实现准确的分割,我们引入了多粒度掩码注意力方案,从多分辨率CLIP特征中聚合多粒度语义到对象查询。通过全面的实验,我们在成熟的开放词汇图像分割基准上展示了MROVSeg的优越性,建立了开放词汇图像分割的新标准。
Summary / 总结
MROVSeg addresses the resolution limitation of vision-language models in open-vocabulary image segmentation by proposing a multi-resolution training framework that uses a single pretrained CLIP backbone. It employs sliding windows to process high-resolution inputs and introduces a Multi-Res Adapter to restore spatial geometry and local-global correspondences. Additionally, a Multi-grained Masked Attention scheme is used to aggregate multi-grained semantics from multi-resolution CLIP features to improve segmentation accuracy. Experiments show that MROVSeg outperforms previous methods on open-vocabulary image segmentation benchmarks.
MROVSeg 是一种多分辨率训练框架,使用单个预训练的 CLIP 后端,在高分辨率输入中切片并使用 Multi-Res Adapter 恢复空间几何结构和局部-全局对应关系。该方法引入了多粒度掩码注意力方案,从多分辨率 CLIP 特征中聚合多粒度语义到对象查询。实验表明,MROVSeg 在开放词汇图像分割基准测试中优于先前的方法,解决了 VLM 的分辨率限制问题。
A Survey on Training-free Open-Vocabulary Semantic Segmentation
Authors: Naomi Kombol, Ivan Martinović, Siniša Šegvić
First: 2025-05-28T10:37:52+00:00 · Latest: 2025-05-28T10:37:52+00:00
Abstract
Semantic segmentation is one of the most fundamental tasks in image understanding with a long history of research, and subsequently a myriad of different approaches. Traditional methods strive to train models up from scratch, requiring vast amounts of computational resources and training data. In the advent of moving to open-vocabulary semantic segmentation, which asks models to classify beyond learned categories, large quantities of finely annotated data would be prohibitively expensive. Researchers have instead turned to training-free methods where they leverage existing models made for tasks where data is more easily acquired. Specifically, this survey will cover the history, nuance, idea development and the state-of-the-art in training-free open-vocabulary semantic segmentation that leverages existing multi-modal classification models. We will first give a preliminary on the task definition followed by an overview of popular model archetypes and then spotlight over 30 approaches split into broader research branches: purely CLIP-based, those leveraging auxiliary visual foundation models and ones relying on generative methods. Subsequently, we will discuss the limitations and potential problems of current research, as well as provide some underexplored ideas for future study. We believe this survey will serve as a good onboarding read to new researchers and spark increased interest in the area.
中文标题/摘要
标题:无训练开放词汇语义分割综述
语义分割是图像理解中最基本的任务之一,具有悠久的研究历史,并且随之产生了众多不同的方法。传统方法试图从头训练模型,需要大量的计算资源和训练数据。随着向开放词汇语义分割的过渡,要求模型超越已学习的类别进行分类,大量精细标注的数据将变得极其昂贵。研究人员转而采用无训练方法,利用为更容易获取数据的任务训练的现有模型。具体而言,本文综述了无训练开放词汇语义分割的历史、细微差别、思想发展和最新进展,这些方法利用现有的多模态分类模型。我们将首先介绍任务定义,然后概述流行的模型架构,并重点介绍超过30种方法,分为更广泛的科研分支:基于CLIP的方法、利用辅助视觉基础模型的方法以及依赖生成方法的方法。随后,我们将讨论当前研究的局限性和潜在问题,并提供一些未充分探索的未来研究想法。我们认为,本文综述将为新研究人员提供良好的入门读物,并激发对该领域的更大兴趣。
Towards Open-Vocabulary Video Instance Segmentation
Authors: Haochen Wang, Cilin Yan, Shuai Wang, Xiaolong Jiang, XU Tang, Yao Hu, Weidi Xie, Efstratios Gavves
First: 2023-04-04T11:25:23+00:00 · Latest: 2023-08-06T20:08:58+00:00
Abstract
Video Instance Segmentation (VIS) aims at segmenting and categorizing objects in videos from a closed set of training categories, lacking the generalization ability to handle novel categories in real-world videos. To address this limitation, we make the following three contributions. First, we introduce the novel task of Open-Vocabulary Video Instance Segmentation, which aims to simultaneously segment, track, and classify objects in videos from open-set categories, including novel categories unseen during training. Second, to benchmark Open-Vocabulary VIS, we collect a Large-Vocabulary Video Instance Segmentation dataset (LV-VIS), that contains well-annotated objects from 1,196 diverse categories, significantly surpassing the category size of existing datasets by more than one order of magnitude. Third, we propose an efficient Memory-Induced Transformer architecture, OV2Seg, to first achieve Open-Vocabulary VIS in an end-to-end manner with near real-time inference speed. Extensive experiments on LV-VIS and four existing VIS datasets demonstrate the strong zero-shot generalization ability of OV2Seg on novel categories. The dataset and code are released here https://github.com/haochenheheda/LVVIS.
中文标题/摘要
标题:迈向开放词汇视频实例分割
视频实例分割(VIS)旨在从封闭训练类别的集合中对视频中的对象进行分割和分类,缺乏处理真实世界视频中新型类别的泛化能力。为解决这一局限,我们做出了以下三项贡献。首先,我们引入了开放词汇视频实例分割这一新任务,旨在同时对开放集类别,包括训练期间未见过的新类别中的视频中的对象进行分割、跟踪和分类。其次,为了评估开放词汇VIS,我们收集了一个大型词汇视频实例分割数据集(LV-VIS),该数据集包含来自1,196个多样类别的良好标注对象,显著超越现有数据集的类别数量一个数量级。第三,我们提出了一种高效的基于记忆的变换器架构OV2Seg,以端到端的方式实现开放词汇VIS,并具有接近实时的推理速度。在LV-VIS和四个现有VIS数据集上的广泛实验表明,OV2Seg在新型类别上的零样本泛化能力很强。数据集和代码在此处发布:https://github.com/haochenheheda/LVVIS。
Summary / 总结
The paper addresses the limitation of Video Instance Segmentation (VIS) in handling novel categories not seen during training. It introduces Open-Vocabulary VIS, which aims to segment, track, and classify objects from both closed and open-set categories. To benchmark this task, the authors created LV-VIS, a large dataset with 1,196 diverse categories. They also proposed OV2Seg, an efficient Memory-Induced Transformer architecture, which achieves near real-time inference for Open-Vocabulary VIS and demonstrates strong zero-shot generalization on novel categories. The dataset and code are publicly available.
论文旨在解决视频实例分割(VIS)在处理未见过的新类别时的局限性。它引入了开放词汇量VIS任务,旨在从已见和未见的类别中分割、跟踪和分类物体。为此,作者创建了一个包含1,196个不同类别的新数据集LV-VIS。作者提出了OV2Seg,一种高效的记忆诱导Transformer架构,能够以接近实时的速度实现开放词汇量VIS,并在新类别上展示了强大的零样本泛化能力。数据集和代码已公开发布。
Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter
Authors: Jinglong Wang, Xiawei Li, Jing Zhang, Qingyuan Xu, Qin Zhou, Qian Yu, Lu Sheng, Dong Xu
First: 2023-09-06T06:31:08+00:00 · Latest: 2024-01-22T07:18:55+00:00
Abstract
The pre-trained text-image discriminative models, such as CLIP, has been explored for open-vocabulary semantic segmentation with unsatisfactory results due to the loss of crucial localization information and awareness of object shapes. Recently, there has been a growing interest in expanding the application of generative models from generation tasks to semantic segmentation. These approaches utilize generative models either for generating annotated data or extracting features to facilitate semantic segmentation. This typically involves generating a considerable amount of synthetic data or requiring additional mask annotations. To this end, we uncover the potential of generative text-to-image diffusion models (e.g., Stable Diffusion) as highly efficient open-vocabulary semantic segmenters, and introduce a novel training-free approach named DiffSegmenter. The insight is that to generate realistic objects that are semantically faithful to the input text, both the complete object shapes and the corresponding semantics are implicitly learned by diffusion models. We discover that the object shapes are characterized by the self-attention maps while the semantics are indicated through the cross-attention maps produced by the denoising U-Net, forming the basis of our segmentation results.Additionally, we carefully design effective textual prompts and a category filtering mechanism to further enhance the segmentation results. Extensive experiments on three benchmark datasets show that the proposed DiffSegmenter achieves impressive results for open-vocabulary semantic segmentation.
中文标题/摘要
标题:扩散模型实际上是无需训练的开放词汇语义分割器
预训练的文本-图像判别模型(如CLIP)在开放词汇语义分割任务中由于丢失了关键的定位信息和物体形状意识,导致效果不佳。最近,人们越来越关注将生成模型的应用从生成任务扩展到语义分割。这些方法利用生成模型要么生成标注数据,要么提取特征以促进语义分割。这通常涉及生成大量合成数据或需要额外的掩码注释。为此,我们发现了生成文本到图像扩散模型(如Stable Diffusion)作为高效开放词汇语义分割器的潜力,并引入了一种全新的无需训练的方法,名为DiffSegmenter。我们的见解是,为了生成与输入文本语义一致的逼真物体,扩散模型会隐式地学习物体的完整形状和相应的语义。我们发现物体形状由自注意力图表征,而语义则通过去噪U-Net生成的交叉注意力图指示,构成了我们的分割结果的基础。此外,我们精心设计了有效的文本提示和类别筛选机制,以进一步提高分割结果。在三个基准数据集上的大量实验表明,提出的DiffSegmenter在开放词汇语义分割任务中取得了令人印象深刻的结果。
Summary / 总结
This paper explores the use of diffusion models for open-vocabulary semantic segmentation, addressing the limitations of previous text-image discriminative models like CLIP. The authors propose a training-free approach called DiffSegmenter, leveraging the implicit learning of object shapes and semantics through self-attention and cross-attention maps. Experiments on three benchmark datasets demonstrate that DiffSegmenter outperforms existing methods in open-vocabulary semantic segmentation without the need for additional data generation or annotations.
研究旨在利用扩散模型进行开放词汇语义分割,克服了如CLIP等文本-图像判别模型的局限性。方法是利用扩散模型生成与输入文本语义一致的现实物体,隐式学习物体形状和语义。在三个基准数据集上的实验表明,提出的DiffSegmenter在开放词汇语义分割方面取得了优于现有方法的结果,无需额外训练或掩码标注。
TAG: Guidance-free Open-Vocabulary Semantic Segmentation
Authors: Yasufumi Kawano, Yoshimitsu Aoki
First: 2024-03-17T12:49:02+00:00 · Latest: 2024-03-17T12:49:02+00:00
Comments: 18 pages
Abstract
Semantic segmentation is a crucial task in computer vision, where each pixel in an image is classified into a category. However, traditional methods face significant challenges, including the need for pixel-level annotations and extensive training. Furthermore, because supervised learning uses a limited set of predefined categories, models typically struggle with rare classes and cannot recognize new ones. Unsupervised and open-vocabulary segmentation, proposed to tackle these issues, faces challenges, including the inability to assign specific class labels to clusters and the necessity of user-provided text queries for guidance. In this context, we propose a novel approach, TAG which achieves Training, Annotation, and Guidance-free open-vocabulary semantic segmentation. TAG utilizes pre-trained models such as CLIP and DINO to segment images into meaningful categories without additional training or dense annotations. It retrieves class labels from an external database, providing flexibility to adapt to new scenarios. Our TAG achieves state-of-the-art results on PascalVOC, PascalContext and ADE20K for open-vocabulary segmentation without given class names, i.e. improvement of +15.3 mIoU on PascalVOC. All code and data will be released at https://github.com/Valkyrja3607/TAG.
中文标题/摘要
标题:TAG: 无需指导的开放词汇语义分割
语义分割是计算机视觉中的关键任务,其中图像中的每个像素都被分类到一个类别中。然而,传统方法面临显著挑战,包括需要像素级注释和大量训练。此外,由于监督学习使用有限的预定义类别集,模型通常难以处理稀有类别并无法识别新的类别。为了解决这些问题,提出了无监督和开放词汇分割,但这种方法面临挑战,包括无法为聚类分配特定类别标签,以及需要用户提供的文本查询作为指导。在此背景下,我们提出了一种新颖的方法,TAG,实现了训练、注释和指导的开放词汇语义分割。TAG 利用预训练模型如 CLIP 和 DINO 对图像进行有意义类别的分割,无需额外训练或密集注释。它从外部数据库检索类别标签,提供适应新场景的灵活性。我们的 TAG 在 PascalVOC、PascalContext 和 ADE20K 上实现了开放词汇分割的最新成果,无需给定类别名称,PascalVOC 上的改进为 +15.3 mIoU。所有代码和数据将在 https://github.com/Valkyrja3607/TAG 上发布。
Summary / 总结
The paper addresses the limitations of traditional semantic segmentation methods, such as the need for pixel-level annotations and reliance on predefined categories. It introduces TAG, a novel approach that achieves guidance-free, open-vocabulary semantic segmentation by leveraging pre-trained models like CLIP and DINO. TAG retrieves class labels from an external database, enabling it to handle new scenarios without additional training or dense annotations. The method significantly improves performance on PascalVOC, PascalContext, and ADE20K, achieving a +15.3 mIoU on PascalVOC for open-vocabulary segmentation without given class names.
该论文针对传统语义分割方法的局限性,如需要像素级标注和依赖预定义类别。它提出了TAG,一种新型方法,通过利用预训练模型如CLIP和DINO实现无指导的开放词汇语义分割。TAG从外部数据库检索类别标签,使其能够在无需额外训练或密集标注的情况下处理新场景。该方法显著提高了开放词汇语义分割的结果,在PascalVOC上相比之前的方法实现了+15.3 mIoU的提升,且无需给定类别名称。
Beyond-Labels: Advancing Open-Vocabulary Segmentation With Vision-Language Models
Authors: Muhammad Atta ur Rahman, Dooseop Choi, Seung-Ik Lee, KyoungWook Min
First: 2025-01-28T07:49:52+00:00 · Latest: 2025-07-02T01:46:17+00:00
Comments: Accepted at the 17th IEEE International Conference on Advanced Computational Intelligence (ICACI 2025)
Abstract
Open-vocabulary semantic segmentation attempts to classify and outline objects in an image using arbitrary text labels, including those unseen during training. Self-supervised learning resolves numerous visual and linguistic processing problems when effectively trained. This study investigates simple yet efficient methods for adapting previously learned foundation models for open-vocabulary semantic segmentation tasks. Our research proposes "Beyond-Labels", a lightweight transformer-based fusion module that uses a small amount of image segmentation data to fuse frozen visual representations with language concepts. This strategy allows the model to leverage the extensive knowledge of pre-trained models without requiring significant retraining, making the approach data-efficient and scalable. Furthermore, we capture positional information in images using Fourier embeddings, improving generalization and enabling smooth and consistent spatial encoding. We perform thorough ablation studies to examine the main components of our proposed method. On the standard benchmark PASCAL-5i, the method performs better despite being trained on frozen vision and language representations. Index Terms: Beyond-Labels, open-vocabulary semantic segmentation, Fourier embeddings, PASCAL-5i
中文标题/摘要
标题:超越标签:利用视觉语言模型推进开放词汇分割
开放词汇语义分割试图使用任意文本标签对图像中的对象进行分类和轮廓化,包括训练期间未见过的标签。自我监督学习在有效训练时可以解决许多视觉和语言处理问题。本研究探讨了简单而高效的方法,以适应先前学习的基础模型进行开放词汇语义分割任务。“Beyond-Labels”是一种轻量级的基于变压器的融合模块,使用少量图像分割数据将冻结的视觉表示与语言概念融合。该策略允许模型利用预训练模型的广泛知识,而无需进行大量重新训练,从而使方法具有数据效率和可扩展性。此外,我们使用傅里叶嵌入捕获图像中的位置信息,提高泛化能力并实现平滑一致的空间编码。我们进行了详尽的消融研究以检查我们提出方法的主要组成部分。尽管仅在冻结的视觉和语言表示上进行训练,该方法在标准基准PASCAL-5i上表现更好。
Summary / 总结
The study aims to enhance open-vocabulary semantic segmentation by using vision-language models. It introduces 'Beyond-Labels', a lightweight transformer-based fusion module that integrates frozen visual representations with language concepts using a small amount of image segmentation data. This approach improves the model's performance on the PASCAL-5i benchmark without significant retraining, demonstrating data efficiency and scalability. Fourier embeddings are used to capture positional information, enhancing generalization and spatial encoding consistency.
研究旨在通过视觉语言模型提升开放词汇语义分割。提出了一种名为'Beyond-Labels'的轻量级变压器融合模块,该模块使用少量图像分割数据将冻结的视觉表示与语言概念融合。这种方法提高了泛化能力和空间编码,尽管进行了少量重新训练,但在PASCAL-5i基准测试中表现更好。消融研究证实了所提方法的有效性。
Open-Vocabulary Audio-Visual Semantic Segmentation
Authors: Ruohao Guo, Liao Qu, Dantong Niu, Yanyu Qi, Wenzhen Yue, Ji Shi, Bowei Xing, Xianghua Ying
Venue: ACM MM 2024 Oral
First: 2024-07-31T16:14:09+00:00 · Latest: 2024-07-31T16:14:09+00:00
Comments: Accepted by ACM MM 2024 (Oral)
Abstract
Audio-visual semantic segmentation (AVSS) aims to segment and classify sounding objects in videos with acoustic cues. However, most approaches operate on the close-set assumption and only identify pre-defined categories from training data, lacking the generalization ability to detect novel categories in practical applications. In this paper, we introduce a new task: open-vocabulary audio-visual semantic segmentation, extending AVSS task to open-world scenarios beyond the annotated label space. This is a more challenging task that requires recognizing all categories, even those that have never been seen nor heard during training. Moreover, we propose the first open-vocabulary AVSS framework, OV-AVSS, which mainly consists of two parts: 1) a universal sound source localization module to perform audio-visual fusion and locate all potential sounding objects and 2) an open-vocabulary classification module to predict categories with the help of the prior knowledge from large-scale pre-trained vision-language models. To properly evaluate the open-vocabulary AVSS, we split zero-shot training and testing subsets based on the AVSBench-semantic benchmark, namely AVSBench-OV. Extensive experiments demonstrate the strong segmentation and zero-shot generalization ability of our model on all categories. On the AVSBench-OV dataset, OV-AVSS achieves 55.43% mIoU on base categories and 29.14% mIoU on novel categories, exceeding the state-of-the-art zero-shot method by 41.88%/20.61% and open-vocabulary method by 10.2%/11.6%. The code is available at https://github.com/ruohaoguo/ovavss.
中文标题/摘要
标题:开放词汇音频-视觉语义分割
音频-视觉语义分割(AVSS)旨在利用声学线索对视频中的发声对象进行分割和分类。然而,大多数方法基于封闭集假设,只能识别训练数据中预定义的类别,缺乏在实际应用中检测未见过的新类别的泛化能力。本文引入了一个新的任务:开放词汇音频-视觉语义分割,将AVSS任务扩展到标注标签空间之外的开放世界场景。这是一个更具挑战性的任务,需要识别所有类别,即使这些类别在训练过程中从未见过也从未听过。此外,我们提出了第一个开放词汇AVSS框架OV-AVSS,主要由两部分组成:1)一个通用声源定位模块,用于执行音频-视觉融合并定位所有潜在的发声对象;2)一个开放词汇分类模块,在大规模预训练视觉-语言模型先验知识的帮助下预测类别。为了适当评估开放词汇AVSS,我们基于AVSBench-semantic基准划分了零样本训练和测试子集,即AVSBench-OV。大量实验表明,我们的模型在所有类别上的分割和零样本泛化能力都很强。在AVSBench-OV数据集上,OV-AVSS在基础类别上的mIoU为55.43%,在新类别上的mIoU为29.14%,分别超过了最先进的零样本方法41.88%/20.61%,开放词汇方法10.2%/11.6%。代码可在https://github.com/ruohaoguo/ovavss/获取。
OV-PARTS: Towards Open-Vocabulary Part Segmentation
Authors: Meng Wei, Xiaoyu Yue, Wenwei Zhang, Shu Kong, Xihui Liu, Jiangmiao Pang
Venue: NeurIPS
First: 2023-10-08T10:28:42+00:00 · Latest: 2023-10-08T10:28:42+00:00
Comments: Accepted by NeurIPS Dataset and Benchmark Track 2023
Abstract
Segmenting and recognizing diverse object parts is a crucial ability in applications spanning various computer vision and robotic tasks. While significant progress has been made in object-level Open-Vocabulary Semantic Segmentation (OVSS), i.e., segmenting objects with arbitrary text, the corresponding part-level research poses additional challenges. Firstly, part segmentation inherently involves intricate boundaries, while limited annotated data compounds the challenge. Secondly, part segmentation introduces an open granularity challenge due to the diverse and often ambiguous definitions of parts in the open world. Furthermore, the large-scale vision and language models, which play a key role in the open vocabulary setting, struggle to recognize parts as effectively as objects. To comprehensively investigate and tackle these challenges, we propose an Open-Vocabulary Part Segmentation (OV-PARTS) benchmark. OV-PARTS includes refined versions of two publicly available datasets: Pascal-Part-116 and ADE20K-Part-234. And it covers three specific tasks: Generalized Zero-Shot Part Segmentation, Cross-Dataset Part Segmentation, and Few-Shot Part Segmentation, providing insights into analogical reasoning, open granularity and few-shot adapting abilities of models. Moreover, we analyze and adapt two prevailing paradigms of existing object-level OVSS methods for OV-PARTS. Extensive experimental analysis is conducted to inspire future research in leveraging foundational models for OV-PARTS. The code and dataset are available at https://github.com/OpenRobotLab/OV_PARTS.
中文标题/摘要
标题:OV-PARTS:迈向开放词汇部件分割
在各种计算机视觉和机器人任务中,分割和识别多样化的物体部件是一项关键能力。尽管在物体级开放词汇语义分割(OVSS)方面取得了显著进展,即分割任意文本描述的物体,但相应的部件级研究带来了额外的挑战。首先,部件分割本身涉及复杂的边界,而有限的标注数据进一步增加了挑战。其次,部件分割引入了开放粒度挑战,因为开放世界中部件的定义多样且往往模糊不清。此外,大型的视觉和语言模型,在开放词汇设置中扮演关键角色,难以像识别物体那样有效地识别部件。为了全面研究和解决这些挑战,我们提出了一个开放词汇部件分割(OV-PARTS)基准。OV-PARTS 包括两个公开可用数据集的精炼版本:Pascal-Part-116 和 ADE20K-Part-234,并涵盖了三个特定任务:泛化零样本部件分割、跨数据集部件分割和少量样本部件分割,提供了模型类比推理、开放粒度和少量样本适应能力的见解。此外,我们分析并调整了现有物体级 OVSS 方法的两种主要范式以适应 OV-PARTS。进行了广泛的实验分析,以启发未来利用基础模型进行 OV-PARTS 的研究。代码和数据集可在 https://github.com/OpenRobotLab/OV_PARTS/ 获取。
Summary / 总结
The research aims to address the challenges of part-level segmentation in open-vocabulary settings, particularly the intricate boundaries and open granularity issues. The authors propose OV-PARTS, a benchmark that includes refined versions of Pascal-Part-116 and ADE20K-Part-234 datasets, and covers three tasks: Generalized Zero-Shot Part Segmentation, Cross-Dataset Part Segmentation, and Few-Shot Part Segmentation. Key findings include insights into the analogical reasoning, open granularity, and few-shot adapting abilities of models, and the adaptation of existing object-level open-vocabulary semantic segmentation methods for part-level tasks.
论文针对开放词汇量部件分割这一挑战,提出了OV-PARTS基准,该基准包含两个现有数据集的改进版本,并涵盖了三个特定任务:通用零样本部件分割、跨数据集部件分割和少量样本部件分割。这些任务旨在评估模型的类比推理、开放粒度和少量样本适应能力。研究还对现有对象级开放词汇量语义分割方法进行了适应,并探讨了大规模视觉和语言模型在识别部件方面的局限性。进行了广泛的实验分析,以指导未来的研究。
Open-Vocabulary Segmentation with Semantic-Assisted Calibration
Authors: Yong Liu, Sule Bai, Guanbin Li, Yitong Wang, Yansong Tang
First: 2023-12-07T07:00:09+00:00 · Latest: 2024-11-26T13:45:09+00:00
Comments: Accepted by CVPR2024
Abstract
This paper studies open-vocabulary segmentation (OVS) through calibrating in-vocabulary and domain-biased embedding space with generalized contextual prior of CLIP. As the core of open-vocabulary understanding, alignment of visual content with the semantics of unbounded text has become the bottleneck of this field. To address this challenge, recent works propose to utilize CLIP as an additional classifier and aggregate model predictions with CLIP classification results. Despite their remarkable progress, performance of OVS methods in relevant scenarios is still unsatisfactory compared with supervised counterparts. We attribute this to the in-vocabulary embedding and domain-biased CLIP prediction. To this end, we present a Semantic-assisted CAlibration Network (SCAN). In SCAN, we incorporate generalized semantic prior of CLIP into proposal embedding to avoid collapsing on known categories. Besides, a contextual shift strategy is applied to mitigate the lack of global context and unnatural background noise. With above designs, SCAN achieves state-of-the-art performance on all popular open-vocabulary segmentation benchmarks. Furthermore, we also focus on the problem of existing evaluation system that ignores semantic duplication across categories, and propose a new metric called Semantic-Guided IoU (SG-IoU).
中文标题/摘要
标题:基于语义辅助校准的开放词汇分割
本文通过使用CLIP的广义上下文先验来校准词汇内和领域偏差嵌入空间,研究了开放词汇分割(OVS)。开放词汇理解的核心是视觉内容与未定义文本语义的对齐,这是该领域的瓶颈。为了解决这一挑战,最近的研究提出利用CLIP作为附加分类器,并将CLIP分类结果与模型预测结果进行聚合。尽管取得了显著进展,但OVS方法在相关场景中的性能仍然不如监督方法。我们将其归因于词汇内嵌入和领域偏差的CLIP预测。为此,我们提出了一种语义辅助校准网络(SCAN)。在SCAN中,我们将CLIP的广义语义先验融入提案嵌入中,以避免在已知类别上崩溃。此外,我们还应用了上下文偏移策略,以缓解缺乏全局上下文和不自然背景噪声的问题。通过上述设计,SCAN在所有流行的开放词汇分割基准上达到了最先进的性能。此外,我们还关注现有评估系统忽略类别间语义重复的问题,并提出了一种新的度量标准,称为语义导向交并比(SG-IoU)。
Summary / 总结
This paper addresses the challenge of open-vocabulary segmentation by proposing a Semantic-assisted CAlibration Network (SCAN) that incorporates generalized semantic prior from CLIP to improve alignment between visual content and unbounded text. SCAN uses a contextual shift strategy to handle global context and background noise issues, achieving state-of-the-art performance on popular benchmarks. Additionally, the paper introduces a new metric called Semantic-Guided IoU (SG-IoU) to evaluate semantic duplication across categories in open-vocabulary segmentation systems.
本文通过提出一种结合CLIP通用语义先验的Semantic-assisted CAlibration Network (SCAN) 来解决开放词汇分割中的挑战,以提高视觉内容与未定义文本之间的对齐。SCAN还包含一种上下文偏移策略来处理全局上下文和背景噪声问题。实验结果表明,SCAN在各种开放词汇分割基准上优于现有方法。此外,本文还提出了一种新的评估指标,称为语义导向的IoU (SG-IoU),以更准确地评估类别间的语义重复。
Effective SAM Combination for Open-Vocabulary Semantic Segmentation
Authors: Minhyeok Lee, Suhwan Cho, Jungho Lee, Sunghun Yang, Heeseung Choi, Ig-Jae Kim, Sangyoun Lee
Venue: CVPR 2025
First: 2024-11-22T04:36:12+00:00 · Latest: 2025-03-30T10:33:55+00:00
Comments: Accepted to CVPR 2025
Abstract
Open-vocabulary semantic segmentation aims to assign pixel-level labels to images across an unlimited range of classes. Traditional methods address this by sequentially connecting a powerful mask proposal generator, such as the Segment Anything Model (SAM), with a pre-trained vision-language model like CLIP. But these two-stage approaches often suffer from high computational costs, memory inefficiencies. In this paper, we propose ESC-Net, a novel one-stage open-vocabulary segmentation model that leverages the SAM decoder blocks for class-agnostic segmentation within an efficient inference framework. By embedding pseudo prompts generated from image-text correlations into SAM's promptable segmentation framework, ESC-Net achieves refined spatial aggregation for accurate mask predictions. ESC-Net achieves superior performance on standard benchmarks, including ADE20K, PASCAL-VOC, and PASCAL-Context, outperforming prior methods in both efficiency and accuracy. Comprehensive ablation studies further demonstrate its robustness across challenging conditions.
中文标题/摘要
标题:有效结合的SAM模型在开放词汇语义分割中的应用
开放词汇语义分割旨在为无限数量类别的图像分配像素级标签。传统方法通过将强大的掩码生成器,如Segment Anything Model (SAM),与预训练的视觉-语言模型,如CLIP,进行顺序连接来解决这一问题。但这些两阶段方法往往面临高计算成本和内存效率低的问题。在本文中,我们提出了一种新颖的一阶段开放词汇分割模型ESC-Net,该模型利用SAM解码块在高效推理框架中实现类无差别分割。通过将来自图像-文本相关性的伪提示嵌入SAM的可提示分割框架中,ESC-Net实现了精细的空间聚合,以获得准确的掩码预测。ESC-Net在标准基准测试中表现出色,包括ADE20K、PASCAL-VOC和PASCAL-Context,其在效率和准确性方面均优于先前的方法。全面的消融研究进一步证明了其在具有挑战性条件下的鲁棒性。
Summary / 总结
The paper addresses the challenge of open-vocabulary semantic segmentation by proposing ESC-Net, a one-stage model that integrates the SAM decoder blocks for efficient class-agnostic segmentation. By incorporating pseudo prompts from image-text correlations, the model achieves accurate mask predictions. ESC-Net outperforms previous methods on standard benchmarks like ADE20K, PASCAL-VOC, and PASCAL-Context, demonstrating superior performance in both efficiency and accuracy.
论文提出了一种名为ESC-Net的一阶段模型,通过整合SAM解码块实现高效的无类别分割。通过结合来自图像-文本关联的伪提示,ESC-Net增强了空间聚合以获得精确的掩码预测。该模型在标准基准上表现出色,优于先前的方法,在效率和准确性方面均表现出色,并得到了稳健的消融研究支持。
History
20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553