arXiv 论文速递

2025-09-04 03:23
Snapshot: 20250904_0323
NOCTIS: Novel Object Cyclic Threshold based Instance Segmentation
Authors: Max Gandyra, Alessandro Santonicola, Michael Beetz
Venue: ICLR 2026
First: 2025-07-02T08:23:14+00:00 · Latest: 2025-09-02T11:45:02+00:00
Comments: 10 pages, 3 figures, 5 tables, ICLR 2026 preprint
Abstract
Instance segmentation of novel objects instances in RGB images, given some example images for each object, is a well known problem in computer vision. Designing a model general enough to be employed for all kinds of novel objects without (re-) training has proven to be a difficult task. To handle this, we present a new training-free framework, called: Novel Object Cyclic Threshold based Instance Segmentation (NOCTIS). NOCTIS integrates two pre-trained models: Grounded-SAM 2 for object proposals with precise bounding boxes and corresponding segmentation masks; and DINOv2 for robust class and patch embeddings, due to its zero-shot capabilities. Internally, the proposal-object matching is realized by determining an object matching score based on the similarity of the class embeddings and the average maximum similarity of the patch embeddings with a new cyclic thresholding (CT) mechanism that mitigates unstable matches caused by repetitive textures or visually similar patterns. Beyond CT, NOCTIS introduces: (i) an appearance score that is unaffected by object selection bias; (ii) the usage of the average confidence of the proposals bounding box and mask as a scoring component; and (iii) an RGB-only pipeline that performs even better than RGB-D ones. We empirically show that NOCTIS, without further training/fine tuning, attains state-of-the-art results regarding the mean AP score, w.r.t. the best RGB and RGB-D methods on the seven core datasets of the BOP 2023 challenge for the "Model-based 2D segmentation of unseen objects" task.
中文标题/摘要
标题:NOCTIS:基于新颖对象周期阈值实例分割
给定每种对象的一些示例图像,在RGB图像中进行新颖对象实例分割是一个在计算机视觉中广为人知的问题。设计一个适用于所有类型新颖对象的通用模型,而无需重新训练,证明是一个困难的任务。为此,我们提出了一种新的无需训练框架,称为:基于新颖对象周期阈值实例分割(NOCTIS)。NOCTIS 结合了两个预训练模型:Grounded-SAM 2 用于对象提议,具有精确的边界框和相应的分割掩码;以及 DINOv2 用于鲁棒的类别和补丁嵌入,由于其零样本能力。内部,通过确定基于类别嵌入相似性和补丁嵌入平均最大相似性的对象匹配得分,结合新的周期阈值(CT)机制来实现提议对象匹配,该机制可以缓解由重复纹理或视觉相似模式引起的不稳定匹配。除了CT,NOCTIS 还引入了:(i)不受对象选择偏差影响的外观得分;(ii)使用提议边界框和掩码的平均置信度作为评分组件;(iii)仅使用RGB的管道,其性能甚至优于RGB-D管道。我们实验证明,NOCTIS 在BOP 2023 挑战赛七个核心数据集的“基于模型的未见对象2D分割”任务中,无需进一步训练/微调,即可达到最先进的平均AP得分。
Summary / 总结
NOCTIS is a training-free framework for instance segmentation of novel objects in RGB images. It combines Grounded-SAM 2 for precise object proposals and DINOv2 for robust class and patch embeddings. NOCTIS introduces a cyclic thresholding mechanism to improve matching stability and includes an appearance score and confidence scoring component. Empirically, NOCTIS achieves state-of-the-art mean AP scores on seven core BOP 2023 datasets without further training.
NOCTIS 是一个无需训练的框架,用于 RGB 图像中新型物体的实例分割。它结合了 Grounded-SAM 进行对象提案和 DINOv2 进行类别和补丁嵌入。NOCTIS 使用新颖的循环阈值机制来匹配提案与对象,并引入了外观得分和基于提案边界框和掩码的置信度得分。实验表明,NOCTIS 在七个 BOP 2023 数据集上实现了最先进的平均 AP 分数,无需进一步训练。
SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning
Authors: Huanxuan Liao, Yixing Xu, Shizhu He, Guanchen Li, Xuanwu Yin, Dong Li, Emad Barsoum, Jun Zhao, Kang Liu
First: 2025-08-21T03:48:28+00:00 · Latest: 2025-09-02T11:29:34+00:00
Abstract
Long-context inference in large language models (LLMs) is increasingly constrained by the KV cache bottleneck: memory usage grows linearly with sequence length, while attention computation scales quadratically. Existing approaches address this issue by compressing the KV cache along the temporal axis through strategies such as token eviction or merging to reduce memory and computational overhead. However, these methods often neglect fine-grained importance variations across feature dimensions (i.e., the channel axis), thereby limiting their ability to effectively balance efficiency and model accuracy. In reality, we observe that channel saliency varies dramatically across both queries and positions: certain feature channels carry near-zero information for a given query, while others spike in relevance. To address this oversight, we propose SPARK, a training-free plug-and-play method that applies unstructured sparsity by pruning KV at the channel level, while dynamically restoring the pruned entries during attention score computation. Notably, our approach is orthogonal to existing KV compression and quantization techniques, making it compatible for integration with them to achieve further acceleration. By reducing channel-level redundancy, SPARK enables processing of longer sequences within the same memory budget. For sequences of equal length, SPARK not only preserves or improves model accuracy but also reduces KV cache storage by over 30% compared to eviction-based methods. Furthermore, even with an aggressive pruning ratio of 80%, SPARK maintains performance with less degradation than 5% compared to the baseline eviction method, demonstrating its robustness and effectiveness. Our code will be available at https://github.com/Xnhyacinth/SparK.
中文标题/摘要
标题:SparK:查询感知的无结构稀疏性与可恢复的KV缓存通道剪枝
大型语言模型(LLMs)中的长上下文推理越来越受到KV缓存瓶颈的限制:内存使用量随着序列长度线性增长,而注意力计算则呈二次增长。现有方法通过沿时间轴压缩KV缓存(例如通过令牌移除或合并)来解决这一问题,以减少内存和计算开销。然而,这些方法往往忽略了特征维度(即通道轴)上的细粒度重要性变化,从而限制了它们在效率和模型准确性之间取得有效平衡的能力。实际上,我们观察到通道显著性随查询和位置的变化而变化:某些特征通道在给定查询中几乎不携带任何信息,而其他通道则在相关性上激增。为了解决这一疏忽,我们提出了SPARK,这是一种无需训练的即插即用方法,通过在通道级别剪枝KV来应用无结构稀疏性,并在注意力分数计算期间动态恢复剪枝的条目。值得注意的是,我们的方法与现有的KV压缩和量化技术是正交的,使其可以与它们集成以实现进一步加速。通过减少通道级别的冗余,SPARK能够在相同的内存预算下处理更长的序列。对于等长的序列,SPARK不仅保持或提高了模型准确性,而且与基于移除的方法相比,KV缓存存储减少了超过30%。此外,即使采用80%的激进剪枝比例,SPARK的性能下降也比基线移除方法少5%,这证明了其稳健性和有效性。我们的代码将在https://github.com/Xnhyacinth/SparK/提供。
Summary / 总结
SPARK is a query-aware unstructured sparsity method that prunes the KV cache at the channel level while dynamically restoring pruned entries during attention score computation. This approach addresses the KV cache bottleneck in long-context inference of large language models by reducing channel-level redundancy without affecting model accuracy. SPARK reduces KV cache storage by over 30% compared to token eviction methods and maintains performance even with an 80% pruning ratio, showing its robustness and effectiveness.
SPARK 是一种查询感知的无结构稀疏性方法,它在通道级别剪枝 KV 缓存,并在注意力分数计算过程中动态恢复被剪枝的条目。该方法通过减少通道级别的冗余来解决大型语言模型长上下文推理中的 KV 缓存瓶颈,而不影响模型准确性。与基于令牌移除的方法相比,SPARK 可将 KV 缓存存储减少超过 30%,即使在 80% 的剪枝比率下也能保持性能,显示出其稳健性和有效性。
VIKSER: Visual Knowledge-Driven Self-Reinforcing Reasoning Framework
Authors: Chao Wang, Chunbai Zhang, Yongxiao Tian, Yang Zhou, Yan Peng
First: 2025-02-02T07:54:55+00:00 · Latest: 2025-09-02T05:28:29+00:00
Comments: 14 pages,17 figures
Abstract
Visual reasoning refers to the task of solving questions about visual information. Current visual reasoning methods typically employ pre-trained vision-language model (VLM) strategies or deep neural network approaches. However, existing efforts are constrained by limited reasoning interpretability, while hindering by the phenomenon of underspecification in the question text. Additionally, the absence of fine-grained visual knowledge limits the precise understanding of subject behavior in visual reasoning tasks. To address these issues, we propose VIKSER (Visual Knowledge-Driven Self-Reinforcing Reasoning Framework). Specifically, VIKSER, trained using knowledge distilled from large language models, extracts fine-grained visual knowledge with the assistance of visual relationship detection techniques. Subsequently, VIKSER utilizes fine-grained visual knowledge to paraphrase the question with underspecification. Additionally, we design a novel prompting method called Chain-of-Evidence (CoE), which leverages the power of "evidence for reasoning" to endow VIKSER with interpretable reasoning capabilities. Meanwhile, the integration of self-reflection technology empowers VIKSER with the ability to learn and improve from its mistakes. Experiments conducted on widely used datasets demonstrate that VIKSER achieves new state-of-the-art (SOTA) results in relevant tasks. Moreover, VIKSER achieves performance on par with leading proprietary models, such as the latest ChatGPT-5.
中文标题/摘要
标题:VIKSER:视觉知识驱动的自我强化推理框架
视觉推理是指解决关于视觉信息的问题的任务。当前的视觉推理方法通常采用预训练的视觉-语言模型(VLM)策略或深度神经网络方法。然而,现有的努力受到有限的推理可解释性的限制,同时受到问题文本中欠定现象的阻碍。此外,缺乏细粒度的视觉知识限制了在视觉推理任务中对主题行为的精确理解。为了解决这些问题,我们提出了VIKSER(视觉知识驱动的自我强化推理框架)。具体而言,VIKSER通过从大型语言模型中提取的知识进行训练,并借助视觉关系检测技术提取细粒度的视觉知识。随后,VIKSER利用细粒度的视觉知识对欠定的问题进行改写。同时,我们设计了一种名为证据链(CoE)的新型提示方法,利用“推理证据”的力量赋予VIKSER可解释的推理能力。此外,自我反思技术的集成使VIKSER能够从错误中学习和改进。在广泛使用的数据集上进行的实验表明,VIKSER在相关任务中取得了新的最佳结果。此外,VIKSER在性能上与领先的专有模型(如最新的ChatGPT-5)相当。
History