arXiv 论文速递

2026-03-02 03:36
Snapshot: 20260302_0336
Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning
Authors: Amita Kamath, Jack Hessel, Khyathi Chandu, Jena D. Hwang, Kai-Wei Chang, Ranjay Krishna
First: 2026-02-26T18:54:06+00:00 · Latest: 2026-02-26T18:54:06+00:00
Comments: TACL 2026
Abstract
The lack of reasoning capabilities in Vision-Language Models (VLMs) has remained at the forefront of research discourse. We posit that this behavior stems from a reporting bias in their training data. That is, how people communicate about visual content by default omits tacit information needed to supervise some types of reasoning; e.g., "at the game today!" is a more likely caption than "a photo of 37 people standing behind a field". We investigate the data underlying the popular VLMs OpenCLIP, LLaVA-1.5 and Molmo through the lens of theories from pragmatics, and find that reporting bias results in insufficient representation of four reasoning skills (spatial, temporal, negation, and counting), despite the corpora being of web-scale, and/or synthetically generated. With a set of curated benchmarks, we demonstrate that: (i) VLMs perform poorly on the aforementioned types of reasoning suppressed in the training data by reporting bias; (ii) contrary to popular belief, scaling data size, model size, and to multiple languages does not result in emergence of these skills by default; but, promisingly, (iii) incorporating annotations specifically collected to obtain tacit information is effective. Our findings highlight the need for more intentional training data curation methods, rather than counting on scale for emergence of reasoning capabilities.
中文标题/摘要
标题:规模无法克服语用学:报告偏差对视觉语言推理的影响
视觉语言模型(VLMs)缺乏推理能力的问题一直是研究讨论的焦点。我们认为这种行为源于其训练数据中的报告偏差。也就是说,人们默认描述视觉内容时会省略一些监督某些类型推理所需的隐含信息;例如,“今天在比赛!”比“一张37个人站在田野后面的图片”更可能作为描述。我们通过语用学理论的视角,研究了流行的VLMs OpenCLIP、LLaVA-1.5和Molmo的数据基础,发现报告偏差导致在空间、时间、否定和计数这四种推理技能上缺乏足够的表示,尽管这些语料库是大规模的,或者合成生成的。通过一组精心策划的基准测试,我们证明:(i) VLMs在由报告偏差抑制的上述类型推理上表现不佳;(ii) 与普遍认为的相反,增加数据量、模型规模和多语言训练并不会默认产生这些技能;但令人鼓舞的是,(iii) 特别收集用于获取隐含信息的注解是有效的。我们的研究结果强调了需要更故意的数据集策划方法,而不是依赖规模来产生推理能力。
Summary / 总结
The research investigates the limitations of Vision-Language Models (VLMs) in reasoning capabilities, attributing this to reporting bias in their training data. By examining OpenCLIP, LLaVA-1.5, and Molmo, the study finds that these models lack sufficient representation of spatial, temporal, negation, and counting reasoning skills due to the omission of tacit information in captions. Despite large-scale and synthetic data, scaling does not inherently improve these skills. However, incorporating specific annotations enhances these capabilities, underscoring the need for more deliberate data curation methods.
研究探讨了报告偏见对视觉语言模型(VLMs)如OpenCLIP、LLaVA-1.5和Molmo推理能力的影响。通过使用语用学理论分析训练数据,研究发现报告偏见导致空间、时间、否定和计数推理技能的不足表示,尽管数据集规模庞大。研究显示,VLMs在这些被报告偏见抑制的推理类型上表现不佳,单纯增加数据或模型规模并不能改善这些技能。然而,通过特定注解来捕捉隐含信息可以提高这些推理能力,强调了更故意的数据整理方法的必要性。
Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?
Authors: Tilemachos Aravanis, Vladan Stojnić, Bill Psomas, Nikos Komodakis, Giorgos Tolias
First: 2026-02-26T18:45:33+00:00 · Latest: 2026-02-26T18:45:33+00:00
Abstract
Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.
中文标题/摘要
标题:检索与分割:少量示例足以弥合开放词汇分割中的监督缺口吗?
开放词汇分割(OVS)将视觉语言模型(VLMs)的零样本识别能力扩展到像素级预测,使模型能够根据文本提示分割任意类别。尽管取得了进展,但由于训练VLMs所使用的粗略图像级监督和自然语言的语义模糊性,OVS仍落后于完全监督的方法。我们通过引入一种少量样本设置,将文本提示与像素标注图像的支持集相结合,来解决这些限制。在此基础上,我们提出了一种检索增强的测试时适配器,通过融合文本和视觉支持特征学习一种轻量级的、针对每张图像的分类器。与依赖于后期手工融合的先前方法不同,我们的方法进行学习的、针对每个查询的融合,实现了模态之间的更强协同作用。该方法支持不断扩展的支持集,并适用于细粒度任务,如个性化分割。实验表明,我们显著缩小了零样本和监督分割之间的差距,同时保留了开放词汇的能力。
Summary / 总结
This paper addresses the limitations of open-vocabulary segmentation (OVS) by proposing a few-shot setting that combines textual prompts with pixel-annotated images. The method introduces a retrieval-augmented test-time adapter to learn a lightweight classifier that fuses textual and visual support features, achieving better synergy between modalities than previous methods. Experiments demonstrate that this approach significantly reduces the gap between zero-shot and supervised segmentation while maintaining open-vocabulary capabilities.
该研究通过将文本提示与像素标注图像结合的少量样本设置来解决开放词汇分割(OVS)的局限性。方法引入了一种检索增强的测试时适配器,通过融合文本和视觉支持特征来学习轻量级分类器,实现比先前方法更好的模态协同效应。实验表明,该方法显著缩小了零样本和监督分割之间的差距,同时保持了开放词汇的能力。
ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding
Authors: Yiran Guan, Sifan Tu, Dingkang Liang, Linghao Zhu, Jianzhong Ju, Zhenbo Luo, Jian Luan, Yuliang Liu, Xiang Bai
Venue: ICLR 2026
First: 2026-02-26T18:10:41+00:00 · Latest: 2026-02-26T18:10:41+00:00
Comments: Accept by ICLR 2026
Abstract
Omni-modal reasoning is essential for intelligent systems to understand and draw inferences from diverse data sources. While existing omni-modal large language models (OLLM) excel at perceiving diverse modalities, they lack the complex reasoning abilities of recent large reasoning models (LRM). However, enhancing the reasoning ability of OLLMs through additional training presents significant challenges, including the need for high-quality data, task-specific adaptation, and substantial computational costs. To address these limitations, we propose ThinkOmni, a training-free and data-free framework that lifts textual reasoning to omni-modal scenarios. ThinkOmni introduces two key components: 1) LRM-as-a-Guide, which leverages off-the-shelf LRMs to guide the OLLM decoding process; 2) Stepwise Contrastive Scaling, which adaptively balances perception and reasoning signals without manual hyperparameter tuning. Experiments on six multi-modal reasoning benchmarks demonstrate that ThinkOmni consistently delivers performance improvements, with main results achieving 70.2 on MathVista and 75.5 on MMAU. Overall, ThinkOmni offers a flexible and generalizable solution for omni-modal reasoning and provides new insights into the generalization and application of reasoning capabilities.
中文标题/摘要
标题:ThinkOmni:通过指导解码提升到全模态场景的文本推理
全模态推理对于智能系统理解并从多种数据源中推断信息至关重要。虽然现有的全模态大型语言模型(OLLM)在感知多种模态方面表现出色,但它们缺乏近期大型推理模型(LRM)的复杂推理能力。然而,通过额外训练来增强OLLM的推理能力面临着重大挑战,包括高质量数据的需求、任务特定的适应以及巨大的计算成本。为了解决这些限制,我们提出ThinkOmni,这是一种无需训练和数据的框架,将文本推理提升到全模态场景。ThinkOmni引入了两个关键组件:1)LRM-as-a-Guide,利用现成的LRM来指导OLLM的解码过程;2)逐步对比缩放,无需手动超参数调整即可适应性平衡感知和推理信号。在六个跨模态推理基准上的实验表明,ThinkOmni始终能够提供性能改进,主要结果在MathVista上达到70.2,在MMAU上达到75.5。总体而言,ThinkOmni提供了一种灵活且通用的全模态推理解决方案,并为推理能力的泛化和应用提供了新的见解。
Summary / 总结
The research aims to enhance the reasoning abilities of omni-modal large language models (OLLMs) without additional training or data, by leveraging large reasoning models (LRMs) and a stepwise contrastive scaling method. The proposed ThinkOmni framework improves performance on six multi-modal reasoning benchmarks, achieving 70.2 on MathVista and 75.5 on MMAU.
ThinkOmni 是一个无需训练和数据的框架,通过利用现成的大型推理模型(LRMs)进行引导解码,并通过逐步对比缩放来适应性地平衡感知和推理信号,从而增强全模态大型语言模型(OLLMs)的推理能力。在六个跨模态推理基准上的实验显示了一致的性能提升,特别是在 MathVista 达到 70.2 和 MMAU 达到 75.5 的主要结果上。
ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation
Authors: Ayush Roy, Wei-Yang Alex Lee, Rudrasis Chakraborty, Vishnu Suresh Lokhande
First: 2026-02-26T18:07:10+00:00 · Latest: 2026-02-26T18:07:10+00:00
Comments: CVPE 2026
Abstract
In recent times, large datasets hinder efficient model training while also containing redundant concepts. Dataset distillation aims to synthesize compact datasets that preserve the knowledge of large-scale training sets while drastically reducing storage and computation. Recent advances in diffusion models have enabled training-free distillation by leveraging pre-trained generative priors; however, existing guidance strategies remain limited. Current score-based methods either perform unguided denoising or rely on simple mode-based guidance toward instance prototype centroids (IPC centroids), which often are rudimentary and suboptimal. We propose Manifold-Guided Distillation (ManifoldGD), a training-free diffusion-based framework that integrates manifold consistent guidance at every denoising timestep. Our method employs IPCs computed via a hierarchical, divisive clustering of VAE latent features, yielding a multi-scale coreset of IPCs that captures both coarse semantic modes and fine intra-class variability. Using a local neighborhood of the extracted IPC centroids, we create the latent manifold for each diffusion denoising timestep. At each denoising step, we project the mode-alignment vector onto the local tangent space of the estimated latent manifold, thus constraining the generation trajectory to remain manifold-faithful while preserving semantic consistency. This formulation improves representativeness, diversity, and image fidelity without requiring any model retraining. Empirical results demonstrate consistent gains over existing training-free and training-based baselines in terms of FID, l2 distance among real and synthetic dataset embeddings, and classification accuracy, establishing ManifoldGD as the first geometry-aware training-free data distillation framework.
中文标题/摘要
标题:ManifoldGD:无需训练的分层流形引导扩散基础数据集蒸馏
近年来,大规模数据集妨碍了高效的模型训练,同时也包含冗余的概念。数据集蒸馏旨在合成紧凑的数据集,同时保留大规模训练集的知识,大幅减少存储和计算需求。扩散模型的最新进展使无需训练的蒸馏成为可能,通过利用预训练生成先验;然而,现有的引导策略仍然有限。当前基于分数的方法要么进行无引导的去噪,要么依赖于简单的基于实例原型中心(IPC中心)的模式引导,这些中心往往是原始且次优的。我们提出了一种无需训练的基于扩散的框架——流形引导蒸馏(ManifoldGD),该框架在每次去噪时间步中整合流形一致的引导。我们的方法通过VAE潜在特征的分层、分裂聚类计算IPC,生成多尺度的核心集,捕捉粗粒度语义模式和细粒度类内变异性。通过提取的IPC中心的局部邻域,我们为每次扩散去噪时间步创建潜在流形。在每次去噪步骤中,我们将模式对齐向量投影到估计的潜在流形的局部切空间,从而约束生成轨迹保持流形忠实性,同时保持语义一致性。这种表述在无需任何模型重新训练的情况下提高了表示性、多样性和图像保真度。实验证明,ManifoldGD在FID、真实和合成数据集嵌入的l2距离以及分类准确性方面优于现有的无需训练和基于训练的基线,确立了ManifoldGD作为首个几何感知的无需训练数据蒸馏框架的地位。
Summary / 总结
ManifoldGD is a training-free diffusion-based framework that enhances dataset distillation by integrating manifold consistent guidance at each denoising step. It uses hierarchical clustering of VAE latent features to compute IPC centroids, creating a multi-scale coreset that captures both coarse semantic modes and fine intra-class variability. This method projects the mode-alignment vector onto the local tangent space of the estimated latent manifold, ensuring manifold-faithful generation while preserving semantic consistency. Empirical results show consistent improvements over existing training-free and training-based baselines in terms of FID, l2 distance, and classification accuracy.
ManifoldGD 是一种无需训练的扩散基础框架,通过在每个去噪步骤中整合流形一致的指导来增强数据集蒸馏。它使用 VAE 潜在特征的分层聚类来计算实例原型中心(IPCs),创建一个多尺度核心集,同时捕捉粗粒度语义模式和细粒度类内变异性。在每个去噪步骤中,该方法将模式对齐向量投影到估计的潜在流形的局部切空间上,确保生成轨迹保持流形一致性并保留语义一致性。实验结果显示,在 FID、l2 距离和分类准确性方面,ManifoldGD 优于现有的无需训练和基于训练的基线方法。
PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
Authors: Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis, Keith Krut, Robert Stein, Rina Elster Pantalony, Mohit Bansal, Kathleen McKeown
Venue: ICLR 2026
First: 2025-10-21T20:30:20+00:00 · Latest: 2026-02-26T18:05:42+00:00
Comments: Accepted at ICLR 2026. 26 pages, 9 figures. Metric/benchmark available at https://github.com/amith-ananthram/posh
Abstract
While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge. Standard metrics (e.g. CIDEr, SPICE) were designed for short texts and tuned to recognize errors that are now uncommon, such as object misidentification. In contrast, long texts require sensitivity to attribute and relation attachments and scores that localize errors to particular text spans. In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g. mistakes in compositional understanding). PoSh is replicable, interpretable and a better proxy for human raters than existing metrics (including GPT4o-as-a-Judge). To validate PoSh, we introduce a challenging new dataset, DOCENT. This novel benchmark contains artwork, paired with expert-written references, and model-generated descriptions, augmented with granular and coarse judgments of their quality from art history students. Thus, DOCENT enables evaluating both detailed image description metrics and detailed image description itself in a challenging new domain. We show that PoSh achieves stronger correlations (+0.05 Spearman $ρ$) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable reward function, outperforming standard supervised fine-tuning. Then, using PoSh, we characterize the performance of open and closed models in describing the paintings, sketches and statues in DOCENT and find that foundation models struggle to achieve full, error-free coverage of images with rich scene dynamics, establishing a demanding new task to gauge VLM progress. Through both PoSh and DOCENT, we hope to enable advances in important areas such as assistive text generation.
中文标题/摘要
标题:PoSh:使用场景图引导LLM作为裁判进行详细图像描述
尽管视觉-语言模型(VLMs)在详细图像描述方面取得了进展,但评估仍是一个挑战。标准指标(如CIDEr、SPICE)是为短文本设计的,并且调整为识别现在已不常见的错误,例如物体识别错误。相比之下,长文本需要对属性和关系的敏感度以及能够定位特定文本片段错误的评分。在本文中,我们引入了PoSh,这是一种用于详细图像描述的指标,它使用场景图作为结构化的评分标准来引导LLM作为裁判,产生基于细粒度错误(如组合理解错误)的综合评分。PoSh是可复制的、可解释的,并且比现有指标(包括GPT4o作为裁判)更接近人类评分者。为了验证PoSh,我们引入了一个新的具有挑战性的数据集DOCENT。这个新的基准数据集包含艺术品,并配以专家撰写的参考文本和模型生成的描述,还增加了艺术史学生对它们质量的精细和粗略判断。因此,DOCENT使我们能够在一个新的具有挑战性的领域中评估详细图像描述指标和详细图像描述本身。我们展示了PoSh与DOCENT中的人类判断相比,具有更强的相关性(Spearman ρ +0.05),并且对图像类型具有鲁棒性(使用CapArena,一个现有的网络图像数据集),并且是一个有效的奖励函数,优于标准的监督微调。然后,使用PoSh,我们表征了开放和封闭模型在描述DOCENT中的绘画、素描和雕像的表现,并发现基础模型难以实现对具有丰富场景动态的图像的全面、无误的覆盖,从而确立了一个新的具有挑战性的任务来衡量VLM的进步。通过PoSh和DOCENT,我们希望促进在诸如辅助文本生成等重要领域的发展。
Summary / 总结
PoSh is a new metric for evaluating detailed image descriptions that uses scene graphs to guide LLMs as judges, focusing on fine-grained errors. It was validated on a new dataset, DOCENT, which includes artwork with expert references and quality judgments from art history students. PoSh shows stronger correlations with human judgments than existing metrics and outperforms standard supervised fine-tuning, highlighting the challenges in describing complex scenes accurately.
PoSh 是一个使用场景图来指导LLM作为裁判的新评价指标,专注于细粒度错误。它在包含艺术作品及其专家参考和艺术史学生质量判断的新数据集 DOCENT 上进行了验证。PoSh 与人类判断的相关性更强,优于现有指标,并且优于标准的监督微调,突出了准确描述复杂场景的挑战。
CXReasonAgent: Evidence-Grounded Diagnostic Reasoning Agent for Chest X-rays
Authors: Hyungyung Lee, Hangyul Yoon, Edward Choi
First: 2026-02-26T17:51:21+00:00 · Latest: 2026-02-26T17:51:21+00:00
Abstract
Chest X-ray plays a central role in thoracic diagnosis, and its interpretation inherently requires multi-step, evidence-grounded reasoning. However, large vision-language models (LVLMs) often generate plausible responses that are not faithfully grounded in diagnostic evidence and provide limited visual evidence for verification, while also requiring costly retraining to support new diagnostic tasks, limiting their reliability and adaptability in clinical settings. To address these limitations, we present CXReasonAgent, a diagnostic agent that integrates a large language model (LLM) with clinically grounded diagnostic tools to perform evidence-grounded diagnostic reasoning using image-derived diagnostic and visual evidence. To evaluate these capabilities, we introduce CXReasonDial, a multi-turn dialogue benchmark with 1,946 dialogues across 12 diagnostic tasks, and show that CXReasonAgent produces faithfully grounded responses, enabling more reliable and verifiable diagnostic reasoning than LVLMs. These findings highlight the importance of integrating clinically grounded diagnostic tools, particularly in safety-critical clinical settings.
中文标题/摘要
标题:CXReasonAgent:基于证据的胸部X光诊断推理代理
胸部X光在胸部诊断中起着核心作用,其解释本质上需要多步、基于证据的推理。然而,大型视觉-语言模型(LVLM)通常生成的响应虽然看似合理,但并不忠实于诊断证据,提供的视觉证据有限,难以验证,同时还需要昂贵的重新训练以支持新的诊断任务,这限制了它们在临床环境中的可靠性和适应性。为了解决这些限制,我们提出了CXReasonAgent,这是一种将大型语言模型(LLM)与临床基础的诊断工具结合的诊断代理,用于使用图像衍生的诊断和视觉证据进行基于证据的诊断推理。为了评估这些能力,我们引入了包含1,946轮对话的多轮对话基准CXReasonDial,涉及12项诊断任务,并展示了CXReasonAgent生成忠实于证据的响应,使其在临床环境中比LVLMs提供更可靠和可验证的诊断推理。这些发现强调了在安全关键的临床环境中整合基于临床证据的诊断工具的重要性。
Summary / 总结
CXReasonAgent is designed to perform evidence-grounded diagnostic reasoning for chest X-rays by integrating a large language model with clinically grounded diagnostic tools. It addresses the limitations of large vision-language models by generating responses that are more faithfully grounded in diagnostic evidence and providing visual evidence for verification. CXReasonAgent outperforms large vision-language models in producing reliable and verifiable diagnostic reasoning, as demonstrated through the CXReasonDial benchmark with 1,946 dialogues across 12 diagnostic tasks.
该论文介绍了CXReasonAgent,这是一种结合了大型语言模型和临床基础诊断工具的诊断代理,用于进行基于证据的胸部X光诊断推理。它通过生成更忠实于诊断证据的响应并提供可验证的视觉证据来解决大型视觉-语言模型的局限性。CXReasonAgent使用CXReasonDial多轮对话基准进行了评估,并展示了与大型视觉-语言模型相比,能够产生更可靠和可验证的诊断推理的能力。
Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP
Authors: Lorenz Hufe, Constantin Venhoff, Erblina Purelku, Maximilian Dreyer, Sebastian Lapuschkin, Wojciech Samek
First: 2025-08-28T09:08:30+00:00 · Latest: 2026-02-26T17:33:06+00:00
Abstract
Typographic attacks exploit multi-modal systems by injecting text into images, leading to targeted misclassifications, malicious content generation and even Vision-Language Model jailbreaks. In this work, we analyze how CLIP vision encoders behave under typographic attacks, locating specialized attention heads in the latter half of the model's layers that causally extract and transmit typographic information to the cls token. Building on these insights, we introduce Dyslexify - a method to defend CLIP models against typographic attacks by selectively ablating a typographic circuit, consisting of attention heads. Without requiring finetuning, dyslexify improves performance by up to 22.06% on a typographic variant of ImageNet-100, while reducing standard ImageNet-100 accuracy by less than 1%, and demonstrate its utility in a medical foundation model for skin lesion diagnosis. Notably, our training-free approach remains competitive with current state-of-the-art typographic defenses that rely on finetuning. To this end, we release a family of dyslexic CLIP models which are significantly more robust against typographic attacks. These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.
中文标题/摘要
标题:Dyslexify:CLIP对抗 typographic 攻击的机制性防御
typographic 攻击通过在图像中注入文本来利用多模态系统,导致目标错误分类、恶意内容生成,甚至视觉语言模型的逃逸。在本研究中,我们分析了CLIP视觉编码器在typographic 攻击下的行为,发现模型后半部分的注意力头专门提取并传递typographic 信息至cls标记。基于这些见解,我们提出了Dyslexify——一种通过选择性消除typographic 电路(由注意力头组成)来防御CLIP模型的对抗方法。无需微调,Dyslexify在typographic变体的ImageNet-100上性能提升高达22.06%,同时将标准ImageNet-100的准确性降低不到1%,并在皮肤病变诊断的医学基础模型中展示了其效用。值得注意的是,我们的无训练方法在依赖微调的当前最先进的typographic防御方法中仍具有竞争力。为此,我们发布了对抗typographic 攻击具有显著更强鲁棒性的Dyslexic CLIP模型系列,这些模型适用于广泛的安全关键应用,其中基于文本的操纵风险超过了文本识别的实用性。
Summary / 总结
This work addresses typographic attacks on multi-modal systems, particularly on CLIP models, by analyzing how CLIP vision encoders process typographic information. The authors identify specialized attention heads that transmit typographic data to the cls token and introduce Dyslexify, a method that selectively ablates these heads to defend against attacks. Dyslexify improves performance by up to 22.06% on a typographic variant of ImageNet-100 while maintaining standard accuracy and showing utility in medical applications. The approach is training-free and competitive with state-of-the-art defenses that require finetuning.
该研究针对CLIP模型等多模态系统中的字型攻击问题,通过分析CLIP视觉编码器如何处理字型信息来应对这些攻击。作者识别出专门将字型数据传输到cls标记的注意力头,并引入了Dyslexify方法,通过选择性地消除这些头来防御攻击。Dyslexify在字型变体的ImageNet-100数据集上可提高高达22.06%的性能,同时保持标准准确性,并在医疗应用中显示出实用性。该方法无需训练,与需要微调的当前最佳防御方法具有竞争力。
Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents
Authors: Zhou Xu, Bowen Zhou, Qi Wang, Shuwen Feng, Jingyu Xiao
First: 2026-02-26T17:12:40+00:00 · Latest: 2026-02-26T17:12:40+00:00
Abstract
Pure-vision GUI agents provide universal interaction capabilities but suffer from severe efficiency bottlenecks due to the massive spatiotemporal redundancy inherent in high-resolution screenshots and historical trajectories. We identify two critical misalignments in existing compression paradigms: the temporal mismatch, where uniform history encoding diverges from the agent's "fading memory" attention pattern, and the spatial topology conflict, where unstructured pruning compromises the grid integrity required for precise coordinate grounding, inducing spatial hallucinations. To address these challenges, we introduce GUIPruner, a training-free framework tailored for high-resolution GUI navigation. It synergizes Temporal-Adaptive Resolution (TAR), which eliminates historical redundancy via decay-based resizing, and Stratified Structure-aware Pruning (SSP), which prioritizes interactive foregrounds and semantic anchors while safeguarding global layout. Extensive evaluations across diverse benchmarks demonstrate that GUIPruner consistently achieves state-of-the-art performance, effectively preventing the collapse observed in large-scale models under high compression. Notably, on Qwen2-VL-2B, our method delivers a 3.4x reduction in FLOPs and a 3.3x speedup in vision encoding latency while retaining over 94% of the original performance, enabling real-time, high-precision navigation with minimal resource consumption.
中文标题/摘要
标题:时空令牌剪枝以实现高效的高分辨率GUI代理
纯视觉GUI代理提供了通用的交互能力,但由于高分辨率屏幕截图和历史轨迹中固有的大量时空冗余,它们遭受了严重的效率瓶颈。我们发现现有压缩范式中的两个关键不匹配:时间不匹配,其中均匀的历史编码与代理的“衰减记忆”注意力模式相背离,以及空间拓扑冲突,其中无结构的剪枝破坏了用于精确坐标定位所需的网格完整性,导致空间幻觉。为了解决这些挑战,我们引入了GUIPruner,这是一种针对高分辨率GUI导航的无需训练框架。它结合了基于衰减的重缩放以消除历史冗余的时空自适应分辨率(TAR),以及优先处理交互前景和语义锚点并保护全局布局的分层结构感知剪枝(SSP)。在多种基准上的广泛评估表明,GUIPruner始终能够实现最先进的性能,有效防止在高压缩下大型模型的性能崩溃。值得注意的是,在Qwen2-VL-2B上,我们的方法在FLOPs上减少了3.4倍,在视觉编码延迟上加快了3.3倍,同时保留了超过94%的原始性能,使实时、高精度导航在极低资源消耗下成为可能。
Summary / 总结
The research aims to improve the efficiency of high-resolution GUI agents by addressing temporal and spatial redundancy issues. It introduces GUIPruner, a training-free framework combining Temporal-Adaptive Resolution and Stratified Structure-aware Pruning to reduce historical redundancy and preserve grid integrity. Experimental results show that GUIPruner achieves state-of-the-art performance with a 3.4x reduction in FLOPs and a 3.3x speedup in vision encoding latency while maintaining over 94% of the original performance.
本文提出了一种名为GUIPruner的无训练框架,结合了Temporal-Adaptive Resolution (TAR) 和 Stratified Structure-aware Pruning (SSP)。TAR通过衰减基线调整历史冗余,而SSP优先处理交互元素和语义锚点以保持布局完整性。该方法在多种基准测试中显著提升了性能,实现了3.4倍的FLOPs减少和3.3倍的视觉编码延迟加速,同时保留了大部分原始性能,从而实现实时、高精度的导航。
Large Multimodal Models as General In-Context Classifiers
Authors: Marco Garosi, Matteo Farina, Alessandro Conti, Massimiliano Mancini, Elisa Ricci
Venue: CVPR
First: 2026-02-26T17:08:18+00:00 · Latest: 2026-02-26T17:08:18+00:00
Comments: CVPR Findings 2026. Project website at https://circle-lmm.github.io/
Abstract
Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMM) are more suitable for complex tasks. In this work, we argue that this answer overlooks an important capability of LMMs: in-context learning. We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP's, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters, their "in-context" equivalent. We extend this analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task. In this challenging scenario, LMMs struggle whenever provided with imperfect context information. To address this issue, we propose CIRCLE, a simple training-free method that assigns pseudo-labels to in-context examples, iteratively refining them with the available context itself. Through extensive experiments, we show that CIRCLE establishes a robust baseline for open-world classification, surpassing VLM counterparts and highlighting the potential of LMMs to serve as unified classifiers, and a flexible alternative to specialized models.
中文标题/摘要
标题:大型多模态模型作为通用上下文分类器
我们应该使用哪种多模态模型进行分类?以往的研究表明,答案在于CLIP类对比视觉-语言模型(VLMs),因为它们在零样本分类中的表现非常出色。相比之下,大型多模态模型(LMM)更适合复杂任务。在本文中,我们提出,这种答案忽视了LMM的一个重要能力:上下文学习。我们在多种数据集上对最先进的LMM进行基准测试,发现尽管它们的零样本性能低于CLIP,但在提供少量上下文示例的情况下,LMM可以匹配甚至超越基于缓存适配器的对比VLM,其“上下文”等价物。我们将这种分析扩展到开放世界设置,在这种设置中,LMM的生成性质使它们更适合该任务。在这种具有挑战性的场景中,LMM在提供不完美上下文信息时会遇到困难。为了解决这一问题,我们提出了一种简单的无训练方法CIRCLE,该方法为上下文示例分配伪标签,并通过可用的上下文本身逐步优化它们。通过广泛的实验,我们表明CIRCLE为开放世界分类建立了稳健的基础,超越了VLM的对应物,并突显了LMM作为统一分类器和服务于专门模型的灵活替代方案的潜力。
Summary / 总结
This work explores the use of Large Multimodal Models (LMMs) for classification tasks, arguing that their in-context learning capability makes them competitive with Contrastive Vision-Language Models (VLMs) in both closed-world and open-world settings. The study finds that LMMs, when provided with a few in-context examples, can match or even outperform VLMs, and introduces CIRCLE, a method that improves LMM performance in open-world scenarios by iteratively refining pseudo-labels with context information.
研究探讨了大型多模态模型(LMMs)在分类任务中的应用,指出其在上下文学习方面的潜力被忽视了。作者在多种数据集上将LMMs与对比视觉-语言模型(VLMs)进行了对比,并发现LMMs在提供少量上下文示例后,可以匹配甚至超越VLMs。研究还扩展到了开放世界设置,LMMs在不完美的上下文信息下表现不佳。为解决这一问题,作者提出了CIRCLE,一种无需训练的方法,通过迭代使用可用的上下文信息来细化伪标签,展示了LMMs作为统一分类器的潜力,并作为专门模型的灵活替代方案。
MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction
Authors: Yizhi Li, Xiaohan Chen, Miao Jiang, Wentao Tang, Gaoang Wang
First: 2026-02-26T17:08:08+00:00 · Latest: 2026-02-26T17:08:08+00:00
Comments: 6 pages, CSCWD 2026
Abstract
With the explosive growth of digital entertainment, automated video summarization has become indispensable for applications such as content indexing, personalized recommendation, and efficient media archiving. Automatic synopsis generation for long-form videos, such as movies and TV series, presents a significant challenge for existing Vision-Language Models (VLMs). While proficient at single-image captioning, these general-purpose models often exhibit critical failures in long-duration contexts, primarily a lack of ID-consistent character identification and a fractured narrative coherence. To overcome these limitations, we propose MovieTeller, a novel framework for generating movie synopses via tool-augmented progressive abstraction. Our core contribution is a training-free, tool-augmented, fact-grounded generation process. Instead of requiring costly model fine-tuning, our framework directly leverages off-the-shelf models in a plug-and-play manner. We first invoke a specialized face recognition model as an external "tool" to establish Factual Groundings--precise character identities and their corresponding bounding boxes. These groundings are then injected into the prompt to steer the VLM's reasoning, ensuring the generated scene descriptions are anchored to verifiable facts. Furthermore, our progressive abstraction pipeline decomposes the summarization of a full-length movie into a multi-stage process, effectively mitigating the context length limitations of current VLMs. Experiments demonstrate that our approach yields significant improvements in factual accuracy, character consistency, and overall narrative coherence compared to end-to-end baselines.
中文标题/摘要
标题:MovieTeller:工具增强的电影摘要工具,具有ID一致渐进抽象
随着数字娱乐的爆炸性增长,自动视频摘要已成为内容索引、个性化推荐和高效媒体归档等应用不可或缺的技术。对于长格式视频,如电影和电视剧的自动摘要生成,现有视觉-语言模型(VLMs)面临重大挑战。尽管在单张图像描述方面表现出色,但这些通用模型在长时间段上下文中往往表现出关键性失败,主要是缺乏ID一致的人物识别和叙述连贯性断裂。为克服这些限制,我们提出了一种名为MovieTeller的新框架,用于通过工具增强的渐进抽象生成电影摘要。我们的核心贡献是一种无需训练、工具增强、基于事实的生成过程。我们不需进行昂贵的模型微调,而是直接以即插即用的方式利用现成模型。我们首先调用一个专门的面部识别模型作为外部“工具”,建立事实基础——精确的人物身份及其对应的边界框。这些基础随后被注入提示中,引导VLM的推理,确保生成的场景描述基于可验证的事实。此外,我们的渐进抽象流水线将整部电影的总结分解为多阶段过程,有效缓解了当前VLMs的上下文长度限制。实验表明,与端到端基线相比,我们的方法在事实准确性、人物一致性以及整体叙述连贯性方面取得了显著改进。
Summary / 总结
The research aims to address the challenges of automatic synopsis generation for long-form videos by proposing MovieTeller, a tool-augmented framework that uses a specialized face recognition model to establish factual groundings and a progressive abstraction pipeline to decompose the summarization process. The key experimental findings show that MovieTeller improves factual accuracy, character consistency, and narrative coherence compared to end-to-end baselines.
研究旨在解决使用Vision-Language模型(VLM)为长格式视频生成准确且连贯的电影概要时遇到的挑战。提出的MovieTeller框架采用工具增强的渐进抽象过程来提升事实准确性和叙事连贯性。该框架利用专门的面部识别模型来确定精确的人物身份,并将这些信息注入到VLM提示中,确保生成的概要基于可验证的事实。实验结果表明,MovieTeller在事实准确性、人物一致性以及叙事连贯性方面优于端到端基线方法。
Object-Centric Representation Learning for Enhanced 3D Semantic Scene Graph Prediction
Authors: KunHo Heo, GiHyun Kim, SuYeon Kim, MyeongAh Cho
Venue: NeurIPS 2025
First: 2025-10-06T11:33:09+00:00 · Latest: 2026-02-26T16:03:04+00:00
Comments: Accepted by NeurIPS 2025. Code: https://github.com/VisualScienceLab-KHU/OCRL-3DSSG-Codes
Abstract
3D Semantic Scene Graph Prediction aims to detect objects and their semantic relationships in 3D scenes, and has emerged as a crucial technology for robotics and AR/VR applications. While previous research has addressed dataset limitations and explored various approaches including Open-Vocabulary settings, they frequently fail to optimize the representational capacity of object and relationship features, showing excessive reliance on Graph Neural Networks despite insufficient discriminative capability. In this work, we demonstrate through extensive analysis that the quality of object features plays a critical role in determining overall scene graph accuracy. To address this challenge, we design a highly discriminative object feature encoder and employ a contrastive pretraining strategy that decouples object representation learning from the scene graph prediction. This design not only enhances object classification accuracy but also yields direct improvements in relationship prediction. Notably, when plugging in our pretrained encoder into existing frameworks, we observe substantial performance improvements across all evaluation metrics. Additionally, whereas existing approaches have not fully exploited the integration of relationship information, we effectively combine both geometric and semantic features to achieve superior relationship prediction. Comprehensive experiments on the 3DSSG dataset demonstrate that our approach significantly outperforms previous state-of-the-art methods. Our code is publicly available at https://github.com/VisualScienceLab-KHU/OCRL-3DSSG-Codes.
中文标题/摘要
标题:基于对象中心表示学习的增强3D语义场景图预测
3D语义场景图预测旨在检测3D场景中的对象及其语义关系,并已成为机器人技术和AR/VR应用中的关键技术。尽管先前的研究解决了数据集限制并探索了各种方法,包括开放式词汇设置,但它们经常未能优化对象和关系特征的表示能力,过度依赖图神经网络,尽管其区分能力不足。在本工作中,我们通过广泛分析表明,对象特征的质量对整体场景图准确性起着关键作用。为了解决这一挑战,我们设计了一种高度区分的对象特征编码器,并采用对比预训练策略,将对象表示学习与场景图预测分离。这一设计不仅提高了对象分类准确性,还直接提高了关系预测。值得注意的是,当将我们的预训练编码器插入现有框架时,我们观察到所有评估指标上都取得了显著性能提升。此外,尽管现有方法尚未充分利用关系信息的整合,我们有效结合了几何和语义特征,实现了更优的关系预测。在3DSSG数据集上的全面实验表明,我们的方法显著优于先前的最先进方法。我们的代码可在https://github.com/VisualScienceLab-KHU/OCRL-3DSSG-Codes公开获取。
Summary / 总结
This research aims to improve 3D semantic scene graph prediction by focusing on the quality of object features. The authors design a discriminative object feature encoder and use a contrastive pretraining strategy to enhance object and relationship prediction. Experiments show that their approach significantly outperforms previous methods across all evaluation metrics on the 3DSSG dataset.
该研究旨在通过提高对象特征的质量来改进3D语义场景图预测。作者引入了一种区分性对象特征编码器和对比预训练策略,该策略将对象表示学习与场景图预测分离。这种方法不仅增强了对象分类和关系预测的准确性,还在集成到现有框架时实现了所有评估指标的显著性能提升。全面的实验表明,他们的方法在3DSSG数据集上优于之前的最先进的方法。
Cytoarchitecture in Words: Weakly Supervised Vision-Language Modeling for Human Brain Microscopy
Authors: Matthew Sutton, Katrin Amunts, Timo Dickscheid, Christian Schiffer
First: 2026-02-26T15:10:39+00:00 · Latest: 2026-02-26T15:10:39+00:00
Comments: 8 pages, 3 figures, submitted for inclusion at a conference
Abstract
Foundation models increasingly offer potential to support interactive, agentic workflows that assist researchers during analysis and interpretation of image data. Such workflows often require coupling vision to language to provide a natural-language interface. However, paired image-text data needed to learn this coupling are scarce and difficult to obtain in many research and clinical settings. One such setting is microscopic analysis of cell-body-stained histological human brain sections, which enables the study of cytoarchitecture: cell density and morphology and their laminar and areal organization. Here, we propose a label-mediated method that generates meaningful captions from images by linking images and text only through a label, without requiring curated paired image-text data. Given the label, we automatically mine area descriptions from related literature and use them as synthetic captions reflecting canonical cytoarchitectonic attributes. An existing cytoarchitectonic vision foundation model (CytoNet) is then coupled to a large language model via an image-to-text training objective, enabling microscopy regions to be described in natural language. Across 57 brain areas, the resulting method produces plausible area-level descriptions and supports open-set use through explicit rejection of unseen areas. It matches the cytoarchitectonic reference label for in-scope patches with 90.6% accuracy and, with the area label masked, its descriptions remain discriminative enough to recover the area in an 8-way test with 68.6% accuracy. These results suggest that weak, label-mediated pairing can suffice to connect existing biomedical vision foundation models to language, providing a practical recipe for integrating natural-language in domains where fine-grained paired annotations are scarce.
中文标题/摘要
标题:细胞架构的语言表达:弱监督视觉-语言模型在人类大脑显微镜分析中的应用
基础模型越来越多地提供支持研究人员在图像数据分析和解释过程中进行互动、自主工作流程的潜力。此类工作流程通常需要将视觉与语言结合以提供自然语言界面。然而,在许多研究和临床环境中,用于学习这种结合的成对图像-文本数据稀缺且难以获取。其中一个环境是细胞体染色的人类大脑组织切片的显微镜分析,这使我们能够研究细胞架构:细胞密度和形态及其层状和区域组织。在此,我们提出了一种标签介导的方法,通过仅通过标签将图像和文本链接起来生成有意义的描述,而无需使用经过精心策划的成对图像-文本数据。给定标签,我们自动从相关文献中挖掘区域描述,并使用它们作为反映经典细胞架构属性的合成描述。然后,通过图像到文本的训练目标将现有的细胞架构视觉基础模型(CytoNet)与大型语言模型耦合,使显微镜区域能够用自然语言描述。在57个脑区中,该方法生成了合理的区域级描述,并通过明确拒绝未见过的区域支持开放集使用。在掩码区域标签的情况下,其描述具有足够的区分性,能够在8类测试中以68.6%的准确率恢复区域。这些结果表明,弱的、标签介导的配对足以将现有的生物医学视觉基础模型与语言连接起来,为在细粒度成对注释稀缺的领域中集成自然语言提供了一种实用的配方。
Summary / 总结
This study proposes a label-mediated method for generating meaningful captions from microscopic images of human brain sections without requiring paired image-text data. By using area descriptions from literature, the method couples an existing cytoarchitectonic vision foundation model (CytoNet) with a large language model. The results show that the method produces plausible descriptions with 90.6% accuracy in matching cytoarchitectonic reference labels and can recover areas in an 8-way test with 68.6% accuracy when the area label is masked.
研究旨在开发一种弱监督的视觉-语言模型,以描述显微镜下的脑图像,解决配对图像-文本数据稀缺的问题。方法使用标签将图像和文本链接起来,从文献中自动提取描述并用语言模型训练现有的细胞建筑学视觉基础模型(CytoNet)。在57个脑区中,该模型生成的描述在针对范围内的斑块上准确率为90.6%,而在掩码区域标签的8分类测试中准确率为68.6%。
Inducing Dyslexia in Vision Language Models
Authors: Melika Honarmand, Ayati Sharma, Badr AlKhamissi, Johannes Mehrer, Martin Schrimpf
First: 2025-09-29T11:03:16+00:00 · Latest: 2026-02-26T15:04:01+00:00
Abstract
Dyslexia, a neurodevelopmental disorder characterized by persistent reading difficulties, is often linked to reduced activity of the visual word form area (VWFA) in the ventral occipito-temporal cortex. Traditional approaches to studying dyslexia, such as behavioral and neuroimaging methods, have provided valuable insights but remain limited in their ability to test causal hypotheses about the underlying mechanisms of reading impairments. In this study, we use large-scale vision-language models (VLMs) to simulate dyslexia by functionally identifying and perturbing artificial analogues of word processing. Using stimuli from cognitive neuroscience, we identify visual-word-form-selective units within VLMs and demonstrate that they predict human VWFA neural responses. Ablating model VWF units leads to selective impairments in reading tasks while general visual and language comprehension abilities remain intact. In particular, the resulting model matches dyslexic humans' phonological deficits without a significant change in orthographic processing, and mirrors dyslexic behavior in font sensitivity. Taken together, our modeling results replicate key characteristics of dyslexia and establish a computational framework for investigating brain disorders.
中文标题/摘要
标题:在视觉语言模型中诱发阅读障碍
阅读障碍是一种神经发育障碍,表现为持续的阅读困难,通常与背外侧枕颞叶皮层中的视觉单词形式区(VWFA)活动减少有关。传统上,通过行为和神经影像学方法研究阅读障碍虽然提供了宝贵见解,但在测试阅读障碍潜在机制的因果假设方面仍有限制。本研究使用大规模视觉-语言模型(VLMs)通过功能上识别和扰动单词处理的人工模拟来模拟阅读障碍。使用认知神经科学的刺激,我们识别出VLMs中的视觉单词形式选择性单元,并证明它们可以预测人类VWFA神经反应。删除模型中的VWF单元会导致阅读任务中的选择性障碍,而一般视觉和语言理解能力保持不变。特别是,该模型表现出与阅读障碍患者相似的音韵缺陷,而书写形式处理没有显著变化,并且在字体敏感性方面反映了阅读障碍的行为特征。综上所述,我们的建模结果复制了阅读障碍的关键特征,并建立了一个研究大脑疾病的计算框架。
Summary / 总结
This study investigates dyslexia using large-scale vision-language models by identifying and perturbing units that simulate word processing. The research demonstrates that ablation of these units leads to selective reading impairments without affecting general visual and language comprehension. The model replicates key characteristics of dyslexia, such as phonological deficits and font sensitivity, and establishes a computational framework for studying brain disorders.
本研究旨在通过识别并扰动视觉-语言模型中的视觉单词形式选择性单元来模拟阅读障碍,这些单元类似于大脑中的VWFA。研究显示,移除这些单元会导致特定的阅读障碍,而不影响一般的视觉和语言理解能力。该模型重现了阅读障碍的关键特征,如音位缺陷和字体敏感性,为研究大脑疾病提供了计算框架。
WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval
Authors: Tianyue Wang, Leigang Qu, Tianyu Yang, Xiangzhao Hao, Yifan Xu, Haiyun Guo, Jinqiao Wang
First: 2026-02-26T14:11:10+00:00 · Latest: 2026-02-26T14:11:10+00:00
Abstract
Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a multimodal query (comprising a reference image and a modification text), without training on annotated triplets. Existing methods typically convert the multimodal query into a single modality-either as an edited caption for Text-to-Image retrieval (T2I) or as an edited image for Image-to-Image retrieval (I2I). However, each paradigm has inherent limitations: T2I often loses fine-grained visual details, while I2I struggles with complex semantic modifications. To effectively leverage their complementary strengths under diverse query intents, we propose WISER, a training-free framework that unifies T2I and I2I via a "retrieve-verify-refine" pipeline, explicitly modeling intent awareness and uncertainty awareness. Specifically, WISER first performs Wider Search by generating both edited captions and images for parallel retrieval to broaden the candidate pool. Then, it conducts Adaptive Fusion with a verifier to assess retrieval confidence, triggering refinement for uncertain retrievals, and dynamically fusing the dual-path for reliable ones. For uncertain retrievals, WISER generates refinement suggestions through structured self-reflection to guide the next retrieval round toward Deeper Thinking. Extensive experiments demonstrate that WISER significantly outperforms previous methods across multiple benchmarks, achieving relative improvements of 45% on CIRCO (mAP@5) and 57% on CIRR (Recall@1) over existing training-free methods. Notably, it even surpasses many training-dependent methods, highlighting its superiority and generalization under diverse scenarios. Code will be released at https://github.com/Physicsmile/WISER.
中文标题/摘要
标题:WISER:更广泛的搜索、更深的思考和自适应融合的无训练零样本组合图像检索
零样本组合图像检索(ZS-CIR)旨在根据包含参考图像和修改文本的多模态查询检索目标图像,无需使用标注三元组进行训练。现有方法通常将多模态查询转换为单一模态——要么作为文本到图像检索(T2I)中的编辑标题,要么作为图像到图像检索(I2I)中的编辑图像。然而,每种范式都有其固有的局限性:T2I往往丢失了细粒度的视觉细节,而I2I则难以处理复杂的语义修改。为了在各种查询意图下有效利用它们的互补优势,我们提出了一种无训练框架WISER,通过“检索-验证-精炼”管道统一T2I和I2I,明确建模意图意识和不确定性意识。具体而言,WISER首先通过生成编辑后的标题和图像进行并行检索,以扩大候选池,进行更广泛的搜索。然后,它通过验证器进行自适应融合,评估检索置信度,对不确定的检索结果触发精炼,并动态融合双路径以获得可靠的检索结果。对于不确定的检索结果,WISER通过结构化的自我反思生成精炼建议,以指导下一轮检索朝着更深的思考进行。广泛的实验表明,WISER在多个基准测试中显著优于先前的方法,在CIRCO(mAP@5)上相对提高了45%,在CIRR(Recall@1)上相对提高了57%。值得注意的是,它甚至超越了许多依赖训练的方法,突显了其在各种场景下的优越性和泛化能力。代码将在https://github.com/Physicsmile/WISER上发布。
Summary / 总结
WISER is a training-free framework for Zero-Shot Composed Image Retrieval that unifies Text-to-Image and Image-to-Image retrieval methods through a 'retrieve-verify-refine' pipeline. It generates both edited captions and images for parallel retrieval, assesses retrieval confidence, and refines uncertain results. Experiments show that WISER outperforms previous methods, achieving significant improvements on CIRCO and CIRR benchmarks.
WISER 是一个无需训练的零样本组合图像检索框架,结合了文本到图像(T2I)和图像到图像(I2I)检索方法。它同时生成编辑后的文本和图像进行并行检索,以扩大候选池,通过自适应融合评估检索置信度,并对不确定的检索结果进行细化。实验表明,WISER 在多个基准测试中优于先前的方法,分别在 CIRCO(mAP@5)和 CIRR(Recall@1)上实现了 45% 和 57% 的相对改进。
SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling
Authors: Camile Lendering, Erkut Akdag, Egor Bondarev
Venue: CVPR 2026
First: 2026-02-26T13:52:57+00:00 · Latest: 2026-02-26T13:52:57+00:00
Comments: Accepted to CVPR 2026
Abstract
Detecting visual anomalies in industrial inspection often requires training with only a few normal images per category. Recent few-shot methods achieve strong results employing foundation-model features, but typically rely on memory banks, auxiliary datasets, or multi-modal tuning of vision-language models. We therefore question whether such complexity is necessary given the feature representations of vision foundation models. To answer this question, we introduce SubspaceAD, a training-free method, that operates in two simple stages. First, patch-level features are extracted from a small set of normal images by a frozen DINOv2 backbone. Second, a Principal Component Analysis (PCA) model is fit to these features to estimate the low-dimensional subspace of normal variations. At inference, anomalies are detected via the reconstruction residual with respect to this subspace, producing interpretable and statistically grounded anomaly scores. Despite its simplicity, SubspaceAD achieves state-of-the-art performance across one-shot and few-shot settings without training, prompt tuning, or memory banks. In the one-shot anomaly detection setting, SubspaceAD achieves image-level and pixel-level AUROC of 98.0% and 97.6% on the MVTec-AD dataset, and 93.3% and 98.3% on the VisA dataset, respectively, surpassing prior state-of-the-art results. Code and demo are available at https://github.com/CLendering/SubspaceAD.
中文标题/摘要
标题:SubspaceAD:无需训练的少量样本异常检测方法通过子空间建模
在工业检测中检测视觉异常通常需要仅用每类别少量的正常图像进行训练。最近的少量样本方法通过基础模型特征取得了很好的结果,但通常依赖于记忆库、辅助数据集或视觉语言模型的多模态调优。因此,我们质疑在视觉基础模型特征表示下是否有必要如此复杂。为回答这个问题,我们引入了SubspaceAD,一种无需训练的方法,分为两个简单的阶段。首先,通过冻结的DINOv2主干从少量正常图像中提取补丁级别的特征。其次,使用主成分分析(PCA)模型拟合这些特征以估计正常变化的低维子空间。在推理时,通过相对于该子空间的重构残差检测异常,生成可解释且统计上可靠的异常评分。尽管简单,SubspaceAD在无需训练、提示调优或记忆库的情况下,在单次样本和少量样本设置中均取得了最先进的性能。在单次样本异常检测设置中,SubspaceAD在MVTec-AD数据集上实现了图像级和像素级的AUROC分别为98.0%和97.6%,在VisA数据集上分别为93.3%和98.3%,超越了先前的最先进的结果。代码和演示可在https://github.com/CLendering/SubspaceAD获取。
Summary / 总结
SubspaceAD is a training-free few-shot anomaly detection method that uses a simple two-stage process. First, it extracts patch-level features from a small set of normal images using a frozen DINOv2 backbone. Second, it fits a Principal Component Analysis (PCA) model to these features to estimate the normal variations' low-dimensional subspace. During inference, anomalies are detected by measuring the reconstruction residual with respect to this subspace, resulting in interpretable and statistically grounded anomaly scores. SubspaceAD achieves state-of-the-art performance across one-shot and few-shot settings without requiring training, prompt tuning, or memory banks, surpassing prior results on the MVTec-AD and VisA datasets.
SubspaceAD 是一种无需训练的少样本异常检测方法,采用两阶段简单过程。首先,使用冻结的 DINOv2 主干从少量正常图像中提取 patch 级别特征。其次,使用这些特征拟合主成分分析(PCA)模型来估计正常变化的低维子空间。在推理阶段,通过与该子空间的重构残差检测异常,生成可解释且统计上可靠的异常评分。SubspaceAD 在少样本设置中达到了最先进的性能,无需训练、提示调优或记忆库,超越了 MVTec-AD 和 VisA 数据集上的先前最佳结果。
VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm
Authors: Zhenkai Wu, Xiaowen Ma, Zhenliang Ni, Dengming Zhang, Han Shu, Xin Jiang, Xinghao Chen
First: 2025-12-02T12:30:05+00:00 · Latest: 2026-02-26T13:16:26+00:00
Comments: Accepted by CVPR2026
Abstract
Vision-language models (VLMs) excel at image understanding tasks, but the large number of visual tokens imposes significant computational costs, hindering deployment on mobile devices. Many pruning methods rely solely on token importance and thus overlook inter-token redundancy, retaining numerous duplicated tokens and wasting capacity. Although some redundancy-aware approaches have been proposed, they often ignore the spatial relationships among visual tokens. This can lead to overly sparse selections of retained tokens that fail to adequately cover the regions of target objects. To address these limitations, we propose VLM-Pruner, a training-free token pruning algorithm that explicitly balances redundancy and spatial sparsity. We introduce a centrifugal token pruning paradigm that enables near-to-far selection while prioritizing the preservation of fine-grained object details. Moreover, we design a Buffering for Spatial Sparsity (BSS) criterion that defers the selection of spatially distant tokens. We further adopt a parallel greedy strategy to conduct token selection efficiently. To mitigate information loss from pruning, we selectively fuse salient information from the discarded tokens into the retained ones. Comprehensive comparisons demonstrate that VLM-Pruner consistently outperforms strong baselines across five VLMs with an 88.9\% pruning rate, while delivering an end-to-end inference speedup. The code is available at https://github.com/Casey-bit/VLMPruner.
中文标题/摘要
标题:VLM-Pruner:高效VLM离心式令牌剪枝范式中的空间稀疏性缓冲
视觉语言模型(VLMs)在图像理解任务中表现出色,但大量的视觉令牌导致了显著的计算成本,阻碍了其在移动设备上的部署。许多剪枝方法仅依赖于令牌的重要性,从而忽略了令牌间的冗余性,保留了大量重复的令牌,浪费了容量。尽管提出了一些具有冗余意识的方法,但它们往往忽略了视觉令牌之间的空间关系。这可能导致保留的令牌过于稀疏,无法充分覆盖目标对象的区域。为了解决这些局限性,我们提出了一种无需训练的VLM-Pruner令牌剪枝算法,明确平衡冗余性和空间稀疏性。我们引入了一种离心式令牌剪枝范式,能够在优先保留细粒度对象细节的同时,实现从近到远的选择。此外,我们设计了一种空间稀疏性缓冲(BSS)准则,推迟选择空间上距离较远的令牌。我们还采用了一种并行贪婪策略,以高效地进行令牌选择。为了减轻剪枝带来的信息损失,我们有选择地将被丢弃的令牌中的重要信息融合到保留的令牌中。全面的比较表明,VLM-Pruner在五个VLM中以88.9%的剪枝率持续优于强大的基线模型,同时实现了端到端的推理加速。代码可在https://github.com/Casey-bit/VLMPruner获取。
Summary / 总结
VLM-Pruner is a training-free token pruning algorithm designed to address the computational challenges of vision-language models (VLMs) by balancing redundancy and spatial sparsity. It introduces a centrifugal token pruning paradigm and a Buffering for Spatial Sparsity (BSS) criterion to efficiently select tokens while preserving fine-grained object details. Experimental results show that VLM-Pruner outperforms strong baselines with an 88.9% pruning rate and provides an end-to-end inference speedup.
VLM-Pruner 是一种无需训练的 token 剪枝算法,旨在高效降低视觉语言模型的计算成本同时保留空间细节。它引入了离心 token 剪枝范式和空间稀疏性缓冲准则来平衡冗余和空间稀疏性,并采用并行贪婪策略进行高效的 token 选择。实验结果表明,VLM-Pruner 在 88.9% 的剪枝率下优于强基线,并实现了端到端的推理加速。
Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
Authors: Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng, Ismail Elezi
First: 2026-02-26T11:08:39+00:00 · Latest: 2026-02-26T11:08:39+00:00
Abstract
Reasoning with large language models often benefits from generating multiple chains-of-thought, but existing aggregation strategies are typically trajectory-level (e.g., selecting the best trace or voting on the final answer), discarding useful intermediate work from partial or "nearly correct" attempts. We propose Stitching Noisy Diffusion Thoughts, a self-consistency framework that turns cheap diffusion-sampled reasoning into a reusable pool of step-level candidates. Given a problem, we (i) sample many diverse, low-cost reasoning trajectories using a masked diffusion language model, (ii) score every intermediate step with an off-the-shelf process reward model (PRM), and (iii) stitch these highest-quality steps across trajectories into a composite rationale. This rationale then conditions an autoregressive (AR) model (solver) to recompute only the final answer. This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search. Across math reasoning benchmarks, we find that step-level recombination is most beneficial on harder problems, and ablations highlight the importance of the final AR solver in converting stitched but imperfect rationales into accurate answers. Using low-confidence diffusion sampling with parallel, independent rollouts, our training-free framework improves average accuracy by up to 23.8% across six math and coding tasks. At the same time, it achieves up to a 1.8x latency reduction relative to both traditional diffusion models (e.g., Dream, LLaDA) and unified architectures (e.g., TiDAR). Code is available at https://github.com/roymiles/diffusion-stitching.
中文标题/摘要
标题:通过奖励引导拼接实现扩散语言模型的测试时缩放
使用大型语言模型进行推理通常可以从生成多个链式思考中受益,但现有的聚合策略通常是轨迹级别的(例如,选择最佳轨迹或对最终答案进行投票),会丢弃来自部分或“几乎正确”尝试的有用中间工作。我们提出了一种名为Noisy Diffusion Thoughts拼接的自一致性框架,将廉价采样的推理转换为可重复使用的步骤级候选池。给定一个问题,我们(i)使用掩码扩散语言模型采样许多多样且低成本的推理轨迹,(ii)使用现成的过程奖励模型(PRM)评分每个中间步骤,(iii)将这些最高质量的步骤跨轨迹拼接成一个综合的推理。然后,这种综合的推理条件一个自回归(AR)模型(求解器)仅重新计算最终答案。这种模块化管道将探索(扩散)与评估和解决方案合成分离,避免了单一统一的混合体,同时保留了广泛的搜索。在数学推理基准测试中,我们发现步骤级重组在更难的问题上最有益,消融实验强调了最终AR求解器在将拼接但不完美的推理转化为准确答案中的重要性。使用低置信度扩散采样和并行独立的展开,我们的无需训练框架在六个数学和编程任务中将平均准确性提高了最多23.8%。同时,它相对于传统扩散模型(例如Dream,LLaDA)和统一架构(例如TiDAR)实现了最多1.8倍的延迟减少。代码可在https://github.com/roymiles/diffusion-stitching/ 获取。
Summary / 总结
The paper introduces a method called Stitching Noisy Diffusion Thoughts to improve the reasoning capabilities of large language models. It involves sampling multiple diverse reasoning trajectories using a masked diffusion language model, scoring each step with a process reward model, and then stitching the highest-quality steps into a composite rationale. This rationale is used to condition an autoregressive model to compute the final answer. The approach shows significant improvements in accuracy, up to 23.8%, across various math and coding tasks, while also reducing latency by up to 1.8x compared to traditional and unified models.
该论文提出了一种名为Stitching Noisy Diffusion Thoughts的方法,以增强大型语言模型的推理能力。该方法包括使用掩码扩散语言模型生成多个低成本的推理轨迹,使用过程奖励模型对每个中间步骤进行评分,然后将最高质量的步骤缝合到一个综合的推理中。该综合推理用于条件化自回归模型以计算最终答案。该方法在数学和编码任务上的准确率最多可提高23.8%,同时与传统的扩散模型和统一架构相比,将延迟最多减少1.8倍。
TrajTok: Learning Trajectory Tokens enables better Video Understanding
Authors: Chenhao Zheng, Jieyu Zhang, Jianing Zhang, Weikai Huang, Ashutosh Kumar, Quan Kong, Oncel Tuzel, Chun-Liang Li, Ranjay Krishna
Venue: CVPR 2026
First: 2026-02-26T09:15:34+00:00 · Latest: 2026-02-26T09:15:34+00:00
Comments: CVPR 2026
Abstract
Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens. This severely limits video efficiency and scalability. While recent trajectory-based tokenizers offer a promising solution by decoupling video duration from token count, they rely on complex external segmentation and tracking pipelines that are slow and task-agnostic. We propose TrajTok, an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective, dynamically adapting its token granularity to semantic complexity, independent of video duration. TrajTok contains a unified segmenter that performs implicit clustering over pixels in both space and time to directly produce object trajectories in a single forward pass. By prioritizing downstream adaptability over pixel-perfect segmentation fidelity, TrajTok is lightweight and efficient, yet empirically improves video understanding performance. With TrajTok, we implement a video CLIP model trained from scratch (TrajViT2). It achieves the best accuracy at scale across both classification and retrieval benchmarks, while maintaining efficiency comparable to the best token-merging methods. TrajTok also proves to be a versatile component beyond its role as a tokenizer. We show that it can be seamlessly integrated as either a probing head for pretrained visual features (TrajAdapter) or an alignment connector in vision-language models (TrajVLM) with especially strong performance in long-video reasoning.
中文标题/摘要
标题:TrajTok:学习轨迹标记使视频理解更优秀
视频模型中的标记化通常通过分块生成大量的标记,这导致了冗余和效率低下。这严重限制了视频的效率和可扩展性。虽然基于轨迹的标记器通过解耦视频时长和标记数量提供了有希望的解决方案,但它们依赖于复杂的外部分割和跟踪管道,这些管道既慢又不针对特定任务。我们提出了一种端到端的视频标记模块TrajTok,该模块完全集成并与视频模型联合训练,以适应下游目标,动态调整标记粒度以适应语义复杂性,而不依赖于视频时长。TrajTok包含一个统一的分割器,该分割器在空间和时间上对像素进行隐式聚类,直接在单次前向传播中生成对象轨迹。通过优先考虑下游适应性而非像素级分割精度,TrajTok轻量且高效,但实验证明其能提高视频理解性能。借助TrajTok,我们实现了一个从零开始训练的视频CLIP模型(TrajViT2),它在分类和检索基准测试中都达到了最佳的准确性,同时保持了与最佳标记合并方法相当的效率。TrajTok还证明了其作为标记器之外的多功能性。我们展示了它可以无缝集成为预训练视觉特征的探针头(TrajAdapter)或视觉-语言模型中的对齐连接器(TrajVLM),特别是在长视频推理方面表现出色。
Summary / 总结
TrajTok is an end-to-end video tokenizer that dynamically adapts its token granularity based on semantic complexity, improving video understanding efficiency and performance. It integrates a unified segmenter for implicit clustering of pixels in space and time, producing object trajectories in one pass. TrajTok enhances a video CLIP model (TrajViT2) to achieve top accuracy in classification and retrieval benchmarks while maintaining efficiency. Beyond tokenization, TrajTok can be used as a probing head or alignment connector in vision-language models, showing strong performance in long-video reasoning.
TrajTok 是一种端到端的视频分词器,能够根据语义复杂性动态调整分词粒度,提高视频理解的效率和性能。与之前的轨迹分词器不同,TrajTok 集成了一个统一的分割器,能够在单次前向传播中生成物体轨迹,使其轻量且高效。TrajTok 使基于视频的 CLIP 模型(TrajViT2)能够超越现有方法,在分类和检索基准测试中表现出色,同时保持与最佳分词合并方法相当的效率。TrajTok 还展示了其在增强预训练视觉特征和视觉语言模型方面的多功能性。
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Authors: Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, Ranjay Krishna
First: 2026-01-15T17:27:44+00:00 · Latest: 2026-02-26T08:46:23+00:00
Comments: Fixed results in Table 7
Abstract
Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).
中文标题/摘要
标题:Molmo2:开放权重和数据的视觉-语言模型,具备视频理解与定位能力
当前最强的视频-语言模型(VLMs)仍为私有。最强的开放权重模型要么依赖于私有VLMs的合成数据,要么不披露其训练数据或方法。因此,开源社区缺乏改进当前最先进的视频(和图像)语言模型的基础。至关重要的是,许多下游应用不仅需要高层次的视频理解,还需要定位——无论是通过指针还是像素跟踪。即使是私有模型也缺乏这种能力。我们提出了Molmo2,这是一种新的VLMs家族,是开源模型中的最先进的,并展示了在单图像、多图像和视频任务中出色的基于指针的定位能力。我们的主要贡献是一系列7个新的视频数据集和2个多图像数据集,包括用于预训练的详细视频字幕数据集、自由形式的视频问答数据集、新的具有复杂查询的对象跟踪数据集以及创新的视频指针数据集,所有这些数据集均未使用封闭的VLMs收集。我们还提供了一种利用高效打包和消息树编码方案的数据训练配方,并展示了在视觉标记上进行双向注意以及一种新的标记权重策略可以提高性能。我们的最佳8B模型在短视频、计数和字幕方面优于其他开放权重和数据模型,并在长视频方面具有竞争力。在视频定位方面,Molmo2显著优于现有开放权重模型如Qwen3-VL(视频计数准确率为35.5 vs 29.6)并在某些任务上超越了私有模型如Gemini 3 Pro(视频指针F1得分为38.4 vs 20.0,视频跟踪J&F得分为56.2 vs 41.1)。
Summary / 总结
The paper introduces Molmo2, a new family of open-source vision-language models that are state-of-the-art in open-source models and demonstrate exceptional grounding capabilities. The authors provide 9 new datasets, including video captions, Q&A, object tracking, and pointing, all collected without using proprietary models. The models use an efficient training recipe with a novel token-weight strategy and bi-directional attention, achieving superior performance on various tasks, especially in video-grounding, where Molmo2 outperforms existing open-source models and even some proprietary models.
论文介绍了Molmo2,这是一种新的开源视觉-语言模型,是开源模型中的佼佼者,并展示了出色的定位能力。作者提供了9个新数据集,包括视频字幕、问答、物体跟踪和指针等,所有数据集均未使用封闭模型收集。模型使用高效的训练配方,带有新颖的token权重策略和双向注意力,实现了在各种任务上的优越性能,特别是在视频定位方面,Molmo2超越了现有的开源模型,甚至一些封闭模型。
ProjFlow: Projection Sampling with Flow Matching for Zero-Shot Exact Spatial Motion Control
Authors: Akihisa Watanabe, Qing Yu, Edgar Simo-Serra, Kent Fujiwara
First: 2026-02-26T08:29:25+00:00 · Latest: 2026-02-26T08:29:25+00:00
Abstract
Generating human motion with precise spatial control is a challenging problem. Existing approaches often require task-specific training or slow optimization, and enforcing hard constraints frequently disrupts motion naturalness. Building on the observation that many animation tasks can be formulated as a linear inverse problem, we introduce ProjFlow, a training-free sampler that achieves zero-shot, exact satisfaction of linear spatial constraints while preserving motion realism. Our key advance is a novel kinematics-aware metric that encodes skeletal topology. This metric allows the sampler to enforce hard constraints by distributing corrections coherently across the entire skeleton, avoiding the unnatural artifacts of naive projection. Furthermore, for sparse inputs, such as filling in long gaps between a few keyframes, we introduce a time-varying formulation using pseudo-observations that fade during sampling. Extensive experiments on representative applications, motion inpainting, and 2D-to-3D lifting, demonstrate that ProjFlow achieves exact constraint satisfaction and matches or improves realism over zero-shot baselines, while remaining competitive with training-based controllers.
中文标题/摘要
标题:ProjFlow: 基于流匹配的投影采样方法实现零样本精确空间运动控制
精确的空间控制生成人类运动是一个具有挑战性的问题。现有方法通常需要特定任务的训练或缓慢的优化,并且施加硬约束经常破坏运动的自然性。基于许多动画任务可以表述为线性逆问题的观察,我们引入了ProjFlow,这是一种无需训练的采样器,能够在不破坏运动真实性的前提下实现零样本、精确满足线性空间约束。我们的主要进展是一种新颖的动力学感知度量,它编码了骨骼拓扑结构。这种度量允许采样器通过在整棵骨骼上一致地分配修正来施加硬约束,从而避免了简单投影的不自然伪影。此外,对于稀疏输入,例如填补几帧之间较长的空白,我们引入了一种时间变化的公式,使用在采样过程中逐渐淡出的伪观测值。在代表性的应用、运动填补和2D到3D提升的广泛实验中,证明了ProjFlow实现了精确的约束满足,并且在零样本基线的基础上匹配或提高了真实感,同时保持了与基于训练的控制器的竞争力。
Summary / 总结
The research aims to generate human motion with precise spatial control without the need for task-specific training or slow optimization. ProjFlow, a training-free sampler, is introduced to achieve zero-shot, exact satisfaction of linear spatial constraints while maintaining motion realism. Key to this is a kinematics-aware metric that ensures coherent corrections across the entire skeleton, avoiding unnatural artifacts. For sparse inputs, ProjFlow uses a time-varying formulation with pseudo-observations to fill gaps between keyframes. Experiments show that ProjFlow excels in exact constraint satisfaction and realism, outperforming zero-shot baselines and competing with training-based controllers.
研究旨在通过精确的空间控制生成人类动作,无需特定任务的训练或缓慢的优化。引入了ProjFlow,这是一种无需训练的采样器,能够在不损害动作真实性的前提下实现零样本、精确的空间约束满足。关键在于一种基于运动学的度量,确保在整个骨骼上一致地应用修正,避免不自然的伪影。对于稀疏输入,如填补关键帧之间的长空白,ProjFlow使用时间变化的公式和伪观测值进行采样。实验表明,ProjFlow在精确约束满足方面表现出色,并且在保持或提高真实感方面优于零样本基线,同时与基于训练的控制器具有竞争力。
HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models
Authors: Yangguang Lin, Quan Fang, Yufei Li, Jiachen Sun, Junyu Gao, Jitao Sang
Venue: CVPR 2026
First: 2026-02-26T08:08:25+00:00 · Latest: 2026-02-26T08:08:25+00:00
Comments: accepted at CVPR 2026
Abstract
Object hallucination in Large Vision-Language Models (LVLMs) significantly hinders their reliable deployment. Existing methods struggle to balance efficiency and accuracy: they often require expensive reference models and multiple forward passes, or apply static edits that risk suppressing genuine visual evidence. To address this, we introduce HulluEdit, a single-pass, reference-free intervention framework. Our core innovation is orthogonal subspace editing: we decompose the hidden states of the model into orthogonal subspaces - visual evidence, conflicting priors, and residual uncertainty - enabling selective suppression of hallucinatory patterns without interfering with visual grounding. This approach mathematically guarantees that edits applied to the prior subspace leave the visual component entirely unaffected. Extensive experiments show that HulluEdit achieves state-of-the-art hallucination reduction on benchmarks including POPE and CHAIR across diverse architectures, while preserving general capabilities on MME and maintaining efficient inference. Our method consistently outperforms contrastive decoding and static subspace editing baselines, offering a new pathway toward more trustworthy LVLMs.
中文标题/摘要
标题:HulluEdit:单次通过的一致性子空间编辑以减轻大型视觉-语言模型中的幻觉
大型视觉-语言模型(LVLMs)中的对象幻觉严重阻碍了其可靠部署。现有方法难以在效率和准确性之间取得平衡:它们通常需要昂贵的参考模型和多次前向传递,或者应用静态编辑,这可能会抑制真实的视觉证据。为了解决这个问题,我们引入了HulluEdit,这是一种单次通过、无需参考模型的干预框架。我们的核心创新是正交子空间编辑:我们将模型的隐藏状态分解为正交子空间——视觉证据、冲突的先验和残余不确定性,从而能够选择性地抑制幻觉模式而不干扰视觉定位。这种方法从数学上保证了对先验子空间的编辑不会影响视觉部分。大量实验表明,HulluEdit在POPE和CHAIR等基准测试中实现了最先进的幻觉减少效果,同时在MME上保持了通用能力,并且保持了高效的推理。我们的方法在对比解码和静态子空间编辑基线中表现更优,为更可信的LVLMs开辟了一条新途径。
Summary / 总结
HulluEdit is a single-pass, reference-free framework for mitigating hallucinations in large vision-language models. It decomposes the model's hidden states into orthogonal subspaces to selectively suppress hallucinatory patterns without affecting visual grounding. HulluEdit outperforms existing methods on hallucination reduction benchmarks while maintaining model accuracy and efficiency.
HulluEdit 是一种单次通过、无需参考模型的框架,旨在通过将隐藏状态分解为正交子空间来减轻大型视觉-语言模型中的幻觉。这种方法选择性地抑制幻觉模式而不影响视觉定位,确保对先验子空间的编辑不会干扰视觉部分。HulluEdit 在幻觉减少基准测试中表现出色,同时保持一般能力和高效推理。
SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs
Authors: Guanting Ye, Qiyan Zhao, Wenhao Yu, Liangyu Yuan, Mingkai Li, Xiaofeng Zhang, Jianmin Ji, Yanyong Zhang, Qing Jiang, Ka-Veng Yuen
Venue: CVPR 2026
First: 2026-02-26T07:42:15+00:00 · Latest: 2026-02-26T07:42:15+00:00
Comments: CVPR 2026
Abstract
3D Large Vision-Language Models (3D LVLMs) built upon Large Language Models (LLMs) have achieved remarkable progress across various multimodal tasks. However, their inherited position-dependent modeling mechanism, Rotary Position Embedding (RoPE), remains suboptimal for 3D multimodal understanding. The vanilla RoPE formulation fails to preserve essential three-dimensional spatial structures when encoding 3D tokens, and its relative distance computation overlooks angular dependencies, hindering the model's ability to capture directional variations in visual representations. To overcome these limitations, we introduce Spherical Coordinate-based Positional Embedding (SoPE). Our method maps point-cloud token indices into a 3D spherical coordinate space, enabling unified modeling of spatial locations and directional angles. This formulation preserves the inherent geometric structure of point-cloud data, enhances spatial awareness, and yields more consistent and expressive geometric representations for multimodal learning. In addition, we introduce a multi-scale frequency mixing strategy to fuse feature information across different frequency domains. Experimental results on multiple 3D scene benchmarks validate the effectiveness of our approach, while real-world deployment experiments further demonstrate its strong generalization capability.
中文标题/摘要
标题:SoPE:基于球坐标的位置嵌入以增强3D LVLM的空间感知
基于大型语言模型(LLMs)构建的3D大型视觉-语言模型(3D LVLMs)在各种多模态任务中取得了显著进展。然而,它们继承的位置依赖性建模机制,旋转位置嵌入(RoPE),对于3D多模态理解仍然不够优化。传统的RoPE公式在编码3D标记时无法保留关键的三维空间结构,并且其相对距离计算忽略了角度依赖性,阻碍了模型捕捉视觉表示中的方向变化。为克服这些限制,我们引入了基于球坐标的位置嵌入(SoPE)。我们的方法将点云标记索引映射到3D球坐标空间,从而统一建模空间位置和方向角度。这种表示形式保留了点云数据的固有几何结构,增强了空间意识,并为多模态学习提供了更一致和表达性的几何表示。此外,我们引入了一种多尺度频率混合策略,以在不同频率域中融合特征信息。在多个3D场景基准上的实验结果验证了我们方法的有效性,而实际部署实验进一步证明了其强大的泛化能力。
Summary / 总结
The research aims to improve the spatial perception of 3D Large Vision-Language Models (3D LVLMs) by addressing the limitations of Rotary Position Embedding (RoPE). The proposed Spherical Coordinate-based Positional Embedding (SoPE) maps 3D token indices into a 3D spherical coordinate space, preserving spatial structures and directional angles. This method enhances spatial awareness and provides more consistent geometric representations. Additionally, a multi-scale frequency mixing strategy is introduced to fuse feature information across different frequency domains. Experiments on multiple 3D scene benchmarks show the effectiveness of SoPE, and real-world deployment experiments demonstrate its strong generalization capability.
研究旨在通过改进3D大型视觉语言模型(3D LVLM)的空间感知能力,解决旋转位置嵌入(RoPE)的局限性。提出的球坐标位置嵌入(SoPE)方法将3D令牌索引映射到3D球坐标空间,保留了空间结构和方向角度。这种方法增强了模型对3D场景的理解能力,并提供了更一致和表达力更强的几何表示。多组3D场景基准实验验证了SoPE的有效性,而实际部署实验进一步展示了其强大的泛化能力。
Visual Instruction Pretraining for Domain-Specific Foundation Models
Authors: Yuxuan Li, Yicheng Zhang, Wenhao Tang, Yimian Dai, Ming-Ming Cheng, Xiang Li, Jian Yang
First: 2025-09-22T10:57:42+00:00 · Latest: 2026-02-26T07:40:53+00:00
Abstract
Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, this loop remains incomplete: the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is not yet underexplored. This paper addresses this gap by proposing a new paradigm for pretraining foundation models in downstream domains. We introduce Visual insTruction Pretraining (ViTP), a novel approach that directly leverages reasoning to enhance perception. ViTP embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by our proposed Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens. Extensive experiments on 16 challenging remote sensing and medical imaging benchmarks demonstrate that ViTP establishes new state-of-the-art performance across a diverse range of downstream tasks. The code is available at https://github.com/zcablii/ViTP.
中文标题/摘要
标题:领域特定基础模型的视觉指令预训练
现代计算机视觉正在形成一个闭环,在这个闭环中,感知、推理和生成相互增强。然而,这个闭环仍然不完整:高层推理对低层感知特征基础学习的自上而下的影响尚未得到充分探索。本文通过提出一种新的预训练范式来解决这一缺口,以在下游领域预训练基础模型。我们引入了视觉指令预训练(ViTP),这是一种新颖的方法,可以直接利用推理来增强感知。ViTP 将视觉变换器(ViT)主干嵌入到视觉语言模型中,并使用从目标下游领域收集的丰富视觉指令数据集对其进行端到端预训练。ViTP 由我们提出的视觉鲁棒性学习(VRL)驱动,促使 ViT 从稀疏的视觉标记中学习稳健且领域相关的特征。在 16 个具有挑战性的遥感和医学成像基准测试上的广泛实验表明,ViTP 在多种下游任务中建立了新的最佳性能。代码可在 https://github.com/zcablii/ViTP 获取。
Summary / 总结
This paper aims to enhance the foundational learning of low-level perceptual features by incorporating high-level reasoning through a new pretraining paradigm called Visual insTruction Pretraining (ViTP). ViTP uses a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end with visual instruction data from target domains. The key finding is that ViTP achieves new state-of-the-art performance on 16 challenging benchmarks in remote sensing and medical imaging tasks, demonstrating its effectiveness in learning robust and domain-relevant features. The code is available at https://github.com/zcablii/ViTP.
本文旨在通过引入新的预训练范式Visual insTruction Pretraining (ViTP),将高层推理融入低级感知特征的基础学习中。ViTP 使用嵌入在 Vision-Language 模型中的 Vision Transformer (ViT),并使用目标域的视觉指令数据进行端到端预训练。实验结果显示,ViTP 在 16 个不同的遥感和医学成像基准测试中表现出色,超越了现有方法,并在多种下游任务中建立了新的最佳性能。
Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning
Authors: Hao Yu, Shuning Jia, Guanghao Li, Wenhao Jiang, Chun Yuan
First: 2026-02-26T07:28:04+00:00 · Latest: 2026-02-26T07:28:04+00:00
Abstract
Vision-language models (VLMs) often struggle with geometric reasoning due to their limited perception of fundamental diagram elements. To tackle this challenge, we introduce GeoPerceive, a benchmark comprising diagram instances paired with domain-specific language (DSL) representations, along with an efficient automatic data generation pipeline. This design enables the isolated evaluation of geometric perception independently from reasoning. To exploit the data provided by GeoPerceive for enhancing the geometric perception capabilities of VLMs, we propose GeoDPO, a translator-guided reinforcement learning (RL) framework. GeoDPO employs an NL-to-DSL translator, which is trained on synthetic pairs generated by the data engine of GeoPerceive, to bridge natural language and DSL. This translator facilitates the computation of fine-grained, DSL-level scores, which serve as reward signals in reinforcement learning. We assess GeoDPO on both in-domain and out-of-domain datasets, spanning tasks in geometric perception as well as downstream reasoning. Experimental results demonstrate that, while supervised fine-tuning (SFT) offers only marginal improvements and may even impair performance in out-of-domain scenarios, GeoDPO achieves substantial gains: $+26.5\%$ on in-domain data, $+8.0\%$ on out-of-domain data, and $+39.0\%$ on downstream reasoning tasks. These findings underscore the superior performance and generalization ability of GeoDPO over SFT. All codes are released at https://github.com/Longin-Yu/GeoPerceive to ensure reproducibility.
Summary / 总结
The paper addresses the challenge of geometric reasoning in VLMs by introducing GeoPerceive, a benchmark with DSL representations, and proposing GeoDPO, a translator-guided RL framework. GeoDPO uses an NL-to-DSL translator trained on synthetic data to enhance geometric perception, achieving significant improvements over supervised fine-tuning, with gains of 26.5% in-domain, 8.0% out-of-domain, and 39.0% on downstream tasks.
论文通过引入包含DSL表示的GeoPerceive基准和提出基于翻译者的RL框架GeoDPO来解决VLMs的几何推理问题。GeoDPO使用一个在合成数据上训练的NL-to-DSL翻译器来增强几何感知,相比监督微调,取得了显著的改进,在领域内数据上提高了26.5%,在领域外数据上提高了8.0%,在下游任务上提高了39.0%。
Detecting Misbehaviors of Large Vision-Language Models by Evidential Uncertainty Quantification
Authors: Tao Huang, Rui Wang, Xiaofei Liu, Yi Qin, Li Duan, Liping Jing
Venue: ICLR 2026
First: 2026-02-05T10:51:39+00:00 · Latest: 2026-02-26T07:10:35+00:00
Comments: Accepted to ICLR 2026. Code is available at https://github.com/HT86159/EUQ
Abstract
%Large vision-language models (LVLMs) have shown substantial advances in multimodal understanding and generation. However, when presented with incompetent or adversarial inputs, they frequently produce unreliable or even harmful content, such as fact hallucinations or dangerous instructions. This misalignment with human expectations, referred to as \emph{misbehaviors} of LVLMs, raises serious concerns for deployment in critical applications. These misbehaviors are found to stem from epistemic uncertainty, specifically either conflicting internal knowledge or the absence of supporting information. However, existing uncertainty quantification methods, which typically capture only overall epistemic uncertainty, have shown limited effectiveness in identifying such issues. To address this gap, we propose Evidential Uncertainty Quantification (EUQ), a fine-grained method that captures both information conflict and ignorance for effective detection of LVLM misbehaviors. In particular, we interpret features from the model output head as either supporting (positive) or opposing (negative) evidence. Leveraging Evidence Theory, we model and aggregate this evidence to quantify internal conflict and knowledge gaps within a single forward pass. %We extensively evaluate our method across four categories of misbehavior, including hallucinations, jailbreaks, adversarial vulnerabilities, and out-of-distribution (OOD) failures, using state-of-the-art LVLMs, and find that EUQ consistently outperforms strong baselines, showing that hallucinations correspond to high internal conflict and OOD failures to high ignorance. Furthermore, layer-wise evidential uncertainty dynamics analysis helps interpret the evolution of internal representations from a new perspective. The source code is available at https://github.com/HT86159/EUQ.
中文标题/摘要
标题:通过证据不确定性量化检测大型视觉-语言模型的不当行为
大型视觉-语言模型(LVLMs)在多模态理解和生成方面取得了显著进展。然而,当面对无能或对抗性输入时,它们经常生成不可靠甚至有害的内容,如事实幻觉或危险指令。这种与人类期望的不一致,被称为LVLMs的不当行为,对于在关键应用中的部署提出了严重关切。这些不当行为被发现源自于认识不确定性,具体来说是内部知识冲突或缺乏支持信息。然而,现有的不确定性量化方法通常只能捕捉整体认识不确定性,显示出有限的效果来识别这些问题。为了解决这一差距,我们提出了证据不确定性量化(EUQ),这是一种细粒度的方法,能够同时捕捉信息冲突和无知,从而有效检测LVLM的不当行为。特别是,我们将模型输出头的特征解释为支持(正面)或反对(负面)证据。利用证据理论,我们建模并聚合这些证据,在单次前向传播中量化内部冲突和知识空白。我们使用最先进的LVLMs在四个类别(幻觉、脱逃、对抗性漏洞和分布外失败)的不当行为上广泛评估了我们的方法,并发现EUQ始终优于强基线,表明幻觉对应于高内部冲突,而分布外失败对应于高无知。此外,逐层证据不确定性动态分析有助于从新视角解释内部表示的演变。源代码可在https://github.com/HT86159/EUQ获取。
Summary / 总结
The paper addresses the issue of misbehaviors in large vision-language models (LVLMs) by proposing Evidential Uncertainty Quantification (EUQ), which captures both information conflict and ignorance to detect misbehaviors such as hallucinations and out-of-distribution failures. EUQ models and aggregates evidence from the model output head using Evidence Theory, achieving consistent performance improvements over strong baselines across various misbehavior categories.
论文通过提出证据不确定性量化(EUQ)方法来解决大型视觉-语言模型(LVLM)的不良行为问题,该方法捕捉信息冲突和无知以检测幻觉和分布外失败等不良行为。EUQ 使用证据理论从模型输出头中建模和聚合证据,显示出比现有方法更好的性能。该方法在四个类别不良行为的评估中表现出色,并且始终优于强基线,提供了对LVLM内部表示演变的新见解。
No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings
Authors: Joonsung Jeon, Woo Jae Kim, Suhyeon Ha, Sooel Son, Sung-Eui Yoon
Venue: ICLR 2026
First: 2026-02-26T07:07:11+00:00 · Latest: 2026-02-26T07:07:11+00:00
Comments: Accepted to ICLR 2026
Abstract
Latent diffusion models have achieved remarkable success in high-fidelity text-to-image generation, but their tendency to memorize training data raises critical privacy and intellectual property concerns. Membership inference attacks (MIAs) provide a principled way to audit such memorization by determining whether a given sample was included in training. However, existing approaches assume access to ground-truth captions. This assumption fails in realistic scenarios where only images are available and their textual annotations remain undisclosed, rendering prior methods ineffective when substituted with vision-language model (VLM) captions. In this work, we propose MoFit, a caption-free MIA framework that constructs synthetic conditioning inputs that are explicitly overfitted to the target model's generative manifold. Given a query image, MoFit proceeds in two stages: (i) model-fitted surrogate optimization, where a perturbation applied to the image is optimized to construct a surrogate in regions of the model's unconditional prior learned from member samples, and (ii) surrogate-driven embedding extraction, where a model-fitted embedding is derived from the surrogate and then used as a mismatched condition for the query image. This embedding amplifies conditional loss responses for member samples while leaving hold-outs relatively less affected, thereby enhancing separability in the absence of ground-truth captions. Our comprehensive experiments across multiple datasets and diffusion models demonstrate that MoFit consistently outperforms prior VLM-conditioned baselines and achieves performance competitive with caption-dependent methods.
中文标题/摘要
标题:无标题,无问题:基于模型拟合嵌入的无标题成员推理
潜在扩散模型在高保真文本到图像生成方面取得了显著成功,但它们倾向于记忆训练数据,这引发了重要的隐私和知识产权问题。成员推理攻击(MIAs)提供了一种原则性的方法来审计这种记忆现象,通过确定给定样本是否包含在训练中。然而,现有方法假设可以访问真实标题。这一假设在只有图像可用且其文本注释未披露的现实场景中失败,使得先前的方法在用视觉语言模型(VLM)标题替代时无效。在本文中,我们提出了一种名为MoFit的无标题MIA框架,该框架构建了显式过度拟合目标模型生成流形的合成条件输入。对于查询图像,MoFit分为两个阶段:(i) 基于模型拟合的替代优化,其中对图像应用的扰动被优化以构建在模型无条件先验中从成员样本学习的区域中的替代,(ii) 替代驱动的嵌入提取,其中从替代中提取基于模型拟合的嵌入,然后将其用作查询图像的不匹配条件。该嵌入增强了成员样本的条件损失响应,同时相对较少影响保留样本,从而在没有真实标题的情况下增强可分性。我们在多个数据集和扩散模型上的全面实验表明,MoFit始终优于先前的VLM条件基线,并且性能与依赖标题的方法相当。
Summary / 总结
The research addresses the privacy concerns of latent diffusion models by developing MoFit, a caption-free membership inference framework. It constructs synthetic conditioning inputs overfitted to the model's generative manifold to enhance separability without ground-truth captions. Experiments show that MoFit outperforms previous vision-language model-conditioned methods and achieves performance comparable to caption-dependent approaches across various datasets and diffusion models.
研究通过开发MoFit框架解决了潜在扩散模型的隐私问题,该框架是一种无图注的成员身份推理攻击方法。MoFit通过构造与模型生成流形过拟合的合成条件输入来推断查询图像是否属于训练数据。该方法包含两个阶段:模型拟合的代理优化和代理驱动的嵌入提取。实验结果表明,MoFit在多个数据集和扩散模型上的表现优于之前的基于视觉语言模型的方法,并且其性能与使用真实图注的方法相当。
SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses
Authors: Zhuohang Jiang, Xu Yuan, Haohao Qu, Shanru Lin, Kanglong Liu, Wenqi Fan, Qing Li
First: 2026-02-26T06:55:48+00:00 · Latest: 2026-02-26T06:55:48+00:00
Abstract
The rapid advancement of AI-powered smart glasses, one of the hottest wearable devices, has unlocked new frontiers for multimodal interaction, with Visual Question Answering (VQA) over external knowledge sources emerging as a core application. Existing Vision Language Models (VLMs) adapted to smart glasses are typically trained and evaluated on traditional multimodal datasets; however, these datasets lack the variety and realism needed to reflect smart glasses usage scenarios and diverge from their specific challenges, where accurately identifying the object of interest must precede any external knowledge retrieval. To bridge this gap, we introduce SUPERGLASSES, the first comprehensive VQA benchmark built on real-world data entirely collected by smart glasses devices. SUPERGLASSES comprises 2,422 egocentric image-question pairs spanning 14 image domains and 8 query categories, enriched with full search trajectories and reasoning annotations. We evaluate 26 representative VLMs on this benchmark, revealing significant performance gaps. To address the limitations of existing models, we further propose SUPERLENS, a multimodal smart glasses agent that enables retrieval-augmented answer generation by integrating automatic object detection, query decoupling, and multimodal web search. Our agent achieves state-of-the-art performance, surpassing GPT-4o by 2.19 percent, and highlights the need for task-specific solutions in smart glasses VQA scenarios.
中文标题/摘要
标题:SUPERGLASSES:将视觉语言模型作为智能眼镜的智能代理进行基准测试
随着AI驱动的智能眼镜这一热门可穿戴设备的迅速发展,多模态交互的新领域被解锁,其中外部知识源上的视觉问答(VQA)成为核心应用。现有的适应智能眼镜的视觉语言模型通常在传统的多模态数据集上进行训练和评估;然而,这些数据集缺乏反映智能眼镜使用场景的多样性和现实性,无法体现其特定挑战,其中准确识别目标对象必须先于任何外部知识检索。为弥合这一差距,我们引入了SUPERGLASSES,这是首个基于智能眼镜设备完全收集的真实数据构建的全面VQA基准。SUPERGLASSES包含2,422个第一人称视角图像-问题对,覆盖14个图像领域和8个查询类别,并附带完整的搜索轨迹和推理注释。我们在该基准上评估了26个代表性视觉语言模型,揭示了显著的性能差距。为解决现有模型的局限性,我们进一步提出了SUPERLENS,这是一种多模态智能眼镜代理,通过结合自动目标检测、查询解耦和多模态网络搜索,实现检索增强的答案生成。我们的代理达到了最先进的性能,超越了GPT-4o 2.19个百分点,并突显了智能眼镜VQA场景中需要任务特定解决方案的需求。
Summary / 总结
The paper introduces SUPERGLASSES, a new VQA benchmark for smart glasses, addressing the limitations of traditional multimodal datasets. It includes 2,422 egocentric image-question pairs and evaluates 26 VLMs, revealing significant performance gaps. The authors propose SUPERLENS, a multimodal smart glasses agent that integrates object detection and web search, achieving state-of-the-art performance and emphasizing the need for task-specific solutions in smart glasses VQA scenarios.
论文介绍了SUPERGLASSES,这是一个新的用于智能眼镜的VQA基准,旨在解决现有跨模态数据集的局限性。它包含2,422个真实世界的第一人称图像-问题对,并评估了26种VLM,揭示了显著的性能差距。作者提出了SUPERLENS,这是一种多模态智能眼镜代理,结合了物体检测、查询解耦和网络搜索,实现了最先进的性能,并强调了在智能眼镜VQA场景中需要特定任务的解决方案。
ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport
Authors: Quoc-Khang Tran, Minh-Thien Nguyen, Nguyen-Khang Pham
First: 2026-02-26T06:51:25+00:00 · Latest: 2026-02-26T06:51:25+00:00
Comments: Preprint submitted to Expert Systems with Applications
Abstract
Image-text retrieval has become a fundamental component in intelligent multimedia systems; however, most existing vision-language models are optimized for highresource languages and remain suboptimal for low-resource settings such as Vietnamese. This work introduces ViCLIP-OT, a foundation vision-language model specifically designed for Vietnamese image-text retrieval. The proposed framework integrates CLIP-style contrastive learning with a Similarity-Graph Regularized Optimal Transport (SIGROT) loss to enhance global cross-modal consistency and mitigate modality gap issues. Extensive experiments on three Vietnamese benchmarks (UITOpenViIC, KTVIC, and Crossmodal-3600) demonstrate that ViCLIP-OT consistently outperforms CLIP and SigLIP baselines in both in-domain and zero-shot settings. On UIT-OpenViIC, the model achieves an average Recall@K of 67.34%, improving upon CLIP by 5.75 percentage points. In zero-shot evaluation on Crossmodal-3600, ViCLIPOT surpasses CLIP by 11.72 percentage points. Embedding-space analysis further confirms improved alignment and reduced modality gap. The results indicate that integrating SIGROT provides an effective and scalable strategy for cross-modal retrieval in low-resource languages, offering practical implications for intelligent multimedia retrieval systems in Vietnamese and other underrepresented linguistic contexts.
中文标题/摘要
标题:ViCLIP-OT:首个针对越南语图像-文本检索的基座视觉-语言模型,采用最优传输
图像-文本检索已成为智能多媒体系统中的基本组成部分;然而,大多数现有的视觉-语言模型都是针对高资源语言优化的,在越南语等低资源环境中表现不佳。本文介绍了ViCLIP-OT,这是一种专门针对越南语图像-文本检索的基座视觉-语言模型。所提出的方法将CLIP风格的对比学习与相似性图正则化最优传输(SIGROT)损失相结合,以增强全局跨模态一致性并缓解模态差距问题。在三个越南语基准数据集(UITOpenViIC、KTVIC和Crossmodal-3600)上的广泛实验表明,ViCLIP-OT在领域内和零样本设置中均优于CLIP和SigLIP基线。在UIT-OpenViIC上,该模型的平均Recall@K为67.34%,比CLIP提高了5.75个百分点。在Crossmodal-3600上的零样本评估中,ViCLIP-OT比CLIP提高了11.72个百分点。嵌入空间分析进一步证实了更好的对齐和减少的模态差距。结果表明,将SIGROT集成起来为低资源语言的跨模态检索提供了一种有效且可扩展的策略,为越南语和其他代表性不足的语言环境中的智能多媒体检索系统提供了实际意义。
Summary / 总结
The research aims to address the limitations of existing vision-language models for low-resource languages like Vietnamese. ViCLIP-OT, a foundation model, integrates CLIP-style contrastive learning with SIGROT loss to enhance cross-modal consistency and reduce modality gaps. Experiments on three Vietnamese benchmarks show that ViCLIP-OT outperforms CLIP and SigLIP in both in-domain and zero-shot settings, with significant improvements in Recall@K and alignment of embedding spaces.
研究旨在解决低资源语言如越南语的视觉-语言模型的局限性。ViCLIP-OT 基础模型结合了 CLIP 风格的对比学习和 SIGROT 损失,以增强跨模态一致性并减少模态差距。在三个越南语基准测试上的实验表明,ViCLIP-OT 在 UIT-OpenViIC 上的平均 Recall@K 达到 67.34%,并在 Crossmodal-3600 的零样本评估中超越 CLIP 11.72 个百分点。
Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes
Authors: Changqing Zhou, Yueru Luo, Han Zhang, Zeyu Jiang, Changhao Chen
First: 2026-02-26T06:37:43+00:00 · Latest: 2026-02-26T06:37:43+00:00
Comments: Accepted by CVPR2026
Abstract
Open-vocabulary 3D occupancy is vital for embodied agents, which need to understand complex indoor environments where semantic categories are abundant and evolve beyond fixed taxonomies. While recent work has explored open-vocabulary occupancy in outdoor driving scenarios, such methods transfer poorly indoors, where geometry is denser, layouts are more intricate, and semantics are far more fine-grained. To address these challenges, we adopt a geometry-only supervision paradigm that uses only binary occupancy labels (occupied vs free). Our framework builds upon 3D Language-Embedded Gaussians, which serve as a unified intermediate representation coupling fine-grained 3D geometry with a language-aligned semantic embedding. On the geometry side, we find that existing Gaussian-to-Occupancy operators fail to converge under such weak supervision, and we introduce an opacity-aware, Poisson-based approach that stabilizes volumetric aggregation. On the semantic side, direct alignment between rendered features and open-vocabulary segmentation features suffers from feature mixing; we therefore propose a Progressive Temperature Decay schedule that gradually sharpens opacities during splatting, strengthening Gaussian-language alignment. On Occ-ScanNet, our framework achieves 59.50 IoU and 21.05 mIoU in the open-vocabulary setting, surpassing all existing occupancy methods in IoU and outperforming prior open-vocabulary approaches by a large margin in mIoU. Code will be released at https://github.com/JuIvyy/LegoOcc.
中文标题/摘要
标题:单目开放词汇占用预测用于室内场景
开放词汇3D占用对于具身智能体至关重要,它们需要理解复杂且语义类别丰富的室内环境,这些语义类别超越了固定分类体系。虽然最近的工作已经探索了开放词汇占用在户外驾驶场景中的应用,但这些方法在室内环境中表现不佳,因为室内几何结构更密集,布局更复杂,语义也更为精细。为了解决这些挑战,我们采用了一种仅使用二元占用标签(占用 vs 空闲)的几何监督范式。我们的框架基于3D语言嵌入高斯分布,这是一种统一的中间表示,将精细的3D几何结构与语言对齐的语义嵌入联系起来。在几何方面,我们发现现有的高斯到占用的操作符在如此弱的监督下无法收敛,因此我们引入了一种基于透明度的Poisson方法,以稳定体素聚合。在语义方面,直接对齐渲染特征和开放词汇分割特征会受到特征混杂的影响;因此,我们提出了一种渐进温度衰减计划,该计划在点积过程中逐渐增强高斯-语言对齐。在Occ-ScanNet上,我们的框架在开放词汇设置中实现了59.50 IoU和21.05 mIoU,超越了所有现有的占用方法,在IoU上领先,并在mIoU上大幅超越了先前的开放词汇方法。代码将在https://github.com/JuIvyy/LegoOcc上发布。
Summary / 总结
This paper addresses the challenge of predicting open-vocabulary 3D occupancy for indoor scenes, where existing methods for outdoor scenarios fail due to denser geometry and more intricate layouts. The authors propose a geometry-only supervision approach using binary occupancy labels and build upon 3D Language-Embedded Gaussians to couple fine-grained 3D geometry with semantic embeddings. They introduce an opacity-aware Poisson-based approach for volumetric aggregation and a Progressive Temperature Decay schedule for semantic alignment. On Occ-ScanNet, their method achieves 59.50 IoU and 21.05 mIoU, surpassing all existing methods in IoU and outperforming previous approaches in mIoU.
该论文旨在解决在室内场景中预测开放词汇3D占用率的挑战,其中几何结构密集且语义非常精细。作者提出了一种仅基于几何的监督方法,使用二元占用标签,并基于3D语言嵌入高斯模型来结合精细的3D几何结构和语义嵌入。他们引入了一种基于透明度的Poisson方法进行体素聚合,并提出了一种渐进温度衰减调度来逐步增强高斯-语言对齐。实验结果表明,该方法在Occ-ScanNet上实现了59.50 IoU和21.05 mIoU,超越了之前的多种方法。
Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache
Authors: Bowen Cui, Yuanbin Wang, Huajiang Xu, Biaolong Chen, Aixi Zhang, Hao Jiang, Zhengzheng Jin, Xu Liu, Pipei Huang
Venue: CVPR 2026
First: 2026-02-26T06:13:33+00:00 · Latest: 2026-02-26T06:13:33+00:00
Comments: Accepted by CVPR 2026
Abstract
Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead of multi-step iterative sampling. Among acceleration strategies, caching-based methods offer a training-free and effective solution by reusing or predicting features across timesteps. However, existing approaches rely on fixed or locally adaptive schedules without considering the global structure of the denoising trajectory, often leading to error accumulation and visual artifacts. To overcome this limitation, we propose DPCache, a novel training-free acceleration framework that formulates diffusion sampling acceleration as a global path planning problem. DPCache constructs a Path-Aware Cost Tensor from a small calibration set to quantify the path-dependent error of skipping timesteps conditioned on the preceding key timestep. Leveraging this tensor, DPCache employs dynamic programming to select an optimal sequence of key timesteps that minimizes the total path cost while preserving trajectory fidelity. During inference, the model performs full computations only at these key timesteps, while intermediate outputs are efficiently predicted using cached features. Extensive experiments on DiT, FLUX, and HunyuanVideo demonstrate that DPCache achieves strong acceleration with minimal quality loss, outperforming prior acceleration methods by $+$0.031 ImageReward at 4.87$\times$ speedup and even surpassing the full-step baseline by $+$0.028 ImageReward at 3.54$\times$ speedup on FLUX, validating the effectiveness of our path-aware global scheduling framework. Code will be released at https://github.com/argsss/DPCache.
中文标题/摘要
标题:降噪作为路径规划:基于DPCache的无训练加速扩散模型
扩散模型在图像和视频生成方面取得了显著的成功,但其实际部署仍受到多步迭代采样带来的巨大计算开销的阻碍。在加速策略中,基于缓存的方法提供了一种无训练且有效的解决方案,通过在时间步之间重用或预测特征来实现加速。然而,现有方法依赖于固定或局部自适应的时间表,而不考虑去噪轨迹的全局结构,这通常会导致误差累积和视觉伪影。为克服这一限制,我们提出了一种名为DPCache的新型无训练加速框架,将扩散采样的加速问题表述为全局路径规划问题。DPCache从少量校准集中构建路径感知代价张量,以量化在给定前一关键时间步的情况下跳过时间步的路径依赖误差。利用该张量,DPCache采用动态规划选择一个最优的关键时间步序列,以最小化总路径成本并保持轨迹保真度。在推理过程中,模型仅在这些关键时间步进行完整计算,而中间输出则通过缓存特征高效预测。在DiT、FLUX和HunyuanVideo上的广泛实验表明,DPCache在保持最小质量损失的情况下实现了显著加速,与先前的加速方法相比,在4.87倍加速下提高了0.031 ImageReward,在FLUX上3.54倍加速下提高了0.028 ImageReward,甚至超过了全步基线,验证了我们路径感知全局调度框架的有效性。代码将在https://github.com/argsss/DPCache上发布。
Summary / 总结
The paper proposes DPCache, a training-free acceleration framework for diffusion models that formulates the acceleration problem as a path planning task. By constructing a Path-Aware Cost Tensor from a calibration set, DPCache selects an optimal sequence of key timesteps to minimize path cost while preserving trajectory fidelity. Experiments show DPCache achieves strong acceleration with minimal quality loss, outperforming previous methods on DiT, FLUX, and HunyuanVideo.
论文提出了一种名为DPCache的训练-free 加速框架,将加速问题建模为路径规划任务。通过从校准集构建路径感知代价张量,DPCache 选择最优的关键时间步序列以最小化路径代价同时保持轨迹保真度。实验结果显示,DPCache 在 DiT、FLUX 和 HunyuanVideo 上实现了显著加速且质量损失较小,优于先前的方法。
History
20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553