Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning
Authors: Amita Kamath, Jack Hessel, Khyathi Chandu, Jena D. Hwang, Kai-Wei Chang, Ranjay Krishna
First: 2026-02-26T18:54:06+00:00 · Latest: 2026-02-26T18:54:06+00:00
Comments: TACL 2026
Abstract
The lack of reasoning capabilities in Vision-Language Models (VLMs) has remained at the forefront of research discourse. We posit that this behavior stems from a reporting bias in their training data. That is, how people communicate about visual content by default omits tacit information needed to supervise some types of reasoning; e.g., "at the game today!" is a more likely caption than "a photo of 37 people standing behind a field". We investigate the data underlying the popular VLMs OpenCLIP, LLaVA-1.5 and Molmo through the lens of theories from pragmatics, and find that reporting bias results in insufficient representation of four reasoning skills (spatial, temporal, negation, and counting), despite the corpora being of web-scale, and/or synthetically generated. With a set of curated benchmarks, we demonstrate that: (i) VLMs perform poorly on the aforementioned types of reasoning suppressed in the training data by reporting bias; (ii) contrary to popular belief, scaling data size, model size, and to multiple languages does not result in emergence of these skills by default; but, promisingly, (iii) incorporating annotations specifically collected to obtain tacit information is effective. Our findings highlight the need for more intentional training data curation methods, rather than counting on scale for emergence of reasoning capabilities.
中文标题/摘要
标题:规模无法克服语用学:报告偏差对视觉语言推理的影响
视觉语言模型(VLMs)缺乏推理能力的问题一直是研究讨论的焦点。我们认为这种行为源于其训练数据中的报告偏差。也就是说,人们默认在描述视觉内容时会省略一些必要的隐含信息,以监督某些类型的推理;例如,“今天在比赛!”比“一张37个人站在田野后面的图片”更可能作为描述。我们通过语用学理论的视角,研究了流行的VLMs OpenCLIP、LLaVA-1.5和Molmo的数据,发现报告偏差导致在空间、时间、否定和计数这四种推理技能上缺乏足够的表示,尽管这些语料库是大规模的,或者合成生成的。通过一组精心策划的基准测试,我们证明:(i) VLMs在由报告偏差抑制的上述类型推理上表现不佳;(ii) 与普遍认为的相反,增加数据量、模型规模和多语言训练并不会默认产生这些技能;但令人欣慰的是,(iii) 特别收集的用于获取隐含信息的注解是有效的。我们的研究结果强调了需要更故意的数据策划方法,而不是依赖规模来产生推理能力。
Summary / 总结
The study investigates the impact of reporting bias on the reasoning capabilities of Vision-Language Models (VLMs) like OpenCLIP, LLaVA-1.5, and Molmo. By analyzing the training data through pragmatics theories, the research finds that reporting bias leads to insufficient representation of spatial, temporal, negation, and counting reasoning skills, despite the large scale of the corpora. The experiments show that scaling the data or model size does not inherently improve these skills, but specifically collecting annotations to capture tacit information does enhance them. This suggests that intentional data curation is crucial for developing reasoning capabilities in VLMs.
研究探讨了报告偏见对视觉语言模型(VLMs)如OpenCLIP、LLaVA-1.5和Molmo推理能力的影响。通过使用语用学理论分析训练数据,研究发现报告偏见导致空间、时间、否定和计数推理技能的不足表示,尽管数据集规模庞大。实验表明,增加数据或模型规模并不能自然提升这些技能,但专门收集用于捕捉隐含信息的注解则有效提升了它们。这表明,故意的数据整理方法对于开发VLMs的推理能力至关重要。
Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?
Authors: Tilemachos Aravanis, Vladan Stojnić, Bill Psomas, Nikos Komodakis, Giorgos Tolias
First: 2026-02-26T18:45:33+00:00 · Latest: 2026-02-26T18:45:33+00:00
Abstract
Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.
中文标题/摘要
标题:检索与分割:少量示例足以弥合开放词汇分割中的监督缺口吗?
开放词汇分割(OVS)将视觉语言模型(VLM)的零样本识别能力扩展到像素级预测,使模型能够根据文本提示分割任意类别。尽管取得了进展,但由于训练VLMs所使用的粗略图像级监督和自然语言的语义模糊性,OVS仍落后于完全监督的方法。我们通过引入一种少量示例设置,将文本提示与像素标注图像的支持集相结合,来解决这些限制。在此基础上,我们提出了一种检索增强的测试时适配器,通过融合文本和视觉支持特征学习一种轻量级的、针对每张图像的分类器。与依赖于后期手工融合的先前方法不同,我们的方法进行学习的、针对每个查询的融合,实现了模态之间的更强协同作用。该方法支持不断扩展的支持集,并适用于细粒度任务,如个性化分割。实验表明,我们显著缩小了零样本和监督分割之间的差距,同时保留了开放词汇的能力。
Summary / 总结
Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts.
论文通过提出一种使用像素标注图像支持集来增强文本提示的少量样本设置,解决了开放词汇分割(OVS)的局限性。它引入了一种检索增强的测试时适配器,通过融合文本和视觉支持特征来学习轻量级的、针对每个图像的分类器。实验表明,这种方法显著缩小了零样本和监督分割之间的差距,同时保持了开放词汇的能力。与依赖于后期手工融合的方法不同,该方法执行学习的、针对每个查询的融合,实现了更强的模态协同,并支持对细粒度任务如个性化分割的持续扩展支持集。
ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding
Authors: Yiran Guan, Sifan Tu, Dingkang Liang, Linghao Zhu, Jianzhong Ju, Zhenbo Luo, Jian Luan, Yuliang Liu, Xiang Bai
Venue: ICLR 2026
First: 2026-02-26T18:10:41+00:00 · Latest: 2026-02-26T18:10:41+00:00
Comments: Accept by ICLR 2026
Abstract
Omni-modal reasoning is essential for intelligent systems to understand and draw inferences from diverse data sources. While existing omni-modal large language models (OLLM) excel at perceiving diverse modalities, they lack the complex reasoning abilities of recent large reasoning models (LRM). However, enhancing the reasoning ability of OLLMs through additional training presents significant challenges, including the need for high-quality data, task-specific adaptation, and substantial computational costs. To address these limitations, we propose ThinkOmni, a training-free and data-free framework that lifts textual reasoning to omni-modal scenarios. ThinkOmni introduces two key components: 1) LRM-as-a-Guide, which leverages off-the-shelf LRMs to guide the OLLM decoding process; 2) Stepwise Contrastive Scaling, which adaptively balances perception and reasoning signals without manual hyperparameter tuning. Experiments on six multi-modal reasoning benchmarks demonstrate that ThinkOmni consistently delivers performance improvements, with main results achieving 70.2 on MathVista and 75.5 on MMAU. Overall, ThinkOmni offers a flexible and generalizable solution for omni-modal reasoning and provides new insights into the generalization and application of reasoning capabilities.
中文标题/摘要
标题:ThinkOmni:通过指导解码提升文本推理至全模态场景
全模态推理对于智能系统理解并从多种数据源中推断信息至关重要。虽然现有的全模态大型语言模型(OLLM)在感知多种模态方面表现出色,但它们缺乏近期大型推理模型(LRM)的复杂推理能力。然而,通过额外训练来增强OLLM的推理能力面临着重大挑战,包括高质量数据的需求、任务特定的适应以及巨大的计算成本。为了解决这些限制,我们提出了ThinkOmni,这是一种无需训练和数据的框架,将文本推理提升至全模态场景。ThinkOmni引入了两个关键组件:1)LRM-as-a-Guide,利用现成的LRM来指导OLLM的解码过程;2)逐步对比缩放,无需手动超参数调整即可适应性平衡感知和推理信号。在六个跨模态推理基准上的实验表明,ThinkOmni始终能够提供性能改进,主要结果在MathVista上达到70.2,在MMAU上达到75.5。总体而言,ThinkOmni提供了一种灵活且通用的全模态推理解决方案,并为推理能力的泛化和应用提供了新的见解。
Summary / 总结
Omni-modal reasoning is essential for intelligent systems to understand and draw inferences from diverse data sources.
ThinkOmni 是一个无需训练和数据的框架,通过利用现成的大型推理模型(LRMs)和逐步对比缩放机制来增强全模态大型语言模型(OLLMs)的推理能力。该方法在六个多模态推理基准测试中表现出色,分别在 MathVista 和 MMAU 上取得了 70.2 和 75.5 的成绩。
ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation
Authors: Ayush Roy, Wei-Yang Alex Lee, Rudrasis Chakraborty, Vishnu Suresh Lokhande
First: 2026-02-26T18:07:10+00:00 · Latest: 2026-02-26T18:07:10+00:00
Comments: CVPE 2026
Abstract
In recent times, large datasets hinder efficient model training while also containing redundant concepts. Dataset distillation aims to synthesize compact datasets that preserve the knowledge of large-scale training sets while drastically reducing storage and computation. Recent advances in diffusion models have enabled training-free distillation by leveraging pre-trained generative priors; however, existing guidance strategies remain limited. Current score-based methods either perform unguided denoising or rely on simple mode-based guidance toward instance prototype centroids (IPC centroids), which often are rudimentary and suboptimal. We propose Manifold-Guided Distillation (ManifoldGD), a training-free diffusion-based framework that integrates manifold consistent guidance at every denoising timestep. Our method employs IPCs computed via a hierarchical, divisive clustering of VAE latent features, yielding a multi-scale coreset of IPCs that captures both coarse semantic modes and fine intra-class variability. Using a local neighborhood of the extracted IPC centroids, we create the latent manifold for each diffusion denoising timestep. At each denoising step, we project the mode-alignment vector onto the local tangent space of the estimated latent manifold, thus constraining the generation trajectory to remain manifold-faithful while preserving semantic consistency. This formulation improves representativeness, diversity, and image fidelity without requiring any model retraining. Empirical results demonstrate consistent gains over existing training-free and training-based baselines in terms of FID, l2 distance among real and synthetic dataset embeddings, and classification accuracy, establishing ManifoldGD as the first geometry-aware training-free data distillation framework.
中文标题/摘要
标题:ManifoldGD:无需训练的分层流形引导扩散基础数据集蒸馏
近年来,大规模数据集妨碍了高效的模型训练,同时也包含冗余的概念。数据集蒸馏旨在合成紧凑的数据集,同时保留大规模训练集的知识,大幅减少存储和计算需求。扩散模型的最新进展使无需训练的蒸馏成为可能,通过利用预训练生成先验;然而,现有的引导策略仍然有限。当前基于分数的方法要么进行无引导的去噪,要么依赖于简单的基于实例原型中心(IPC中心)的模式引导,这些中心往往过于简单且不理想。我们提出了一种无需训练的基于扩散的框架——流形引导蒸馏(ManifoldGD),该框架在每次去噪时间步中整合流形一致的引导。我们的方法通过VAE潜在特征的分层、分裂聚类计算IPC,生成多尺度的核心集,捕捉粗粒度语义模式和细粒度类内变异性。通过提取的IPC中心的局部邻域,我们为每次扩散去噪时间步创建潜在流形。在每次去噪步骤中,我们将模式对齐向量投影到估计的潜在流形的局部切空间上,从而约束生成轨迹保持流形忠实性,同时保持语义一致性。这种表述在无需任何模型重训练的情况下提高了表示性、多样性和图像保真度。实验证明,ManifoldGD在FID、真实和合成数据集嵌入的l2距离以及分类准确性方面优于现有的无需训练和基于训练的基线,确立了ManifoldGD作为首个几何感知的无需训练数据蒸馏框架的地位。
Summary / 总结
ManifoldGD is a training-free diffusion-based framework that enhances dataset distillation by integrating manifold consistent guidance at each denoising step. It uses hierarchical clustering of VAE latent features to generate a multi-scale coreset of instance prototype centroids (IPCs), which are then used to create a latent manifold for each denoising step. This approach improves representativeness, diversity, and image fidelity without retraining. Experiments show consistent improvements over existing training-free and training-based methods in terms of FID, l2 distance, and classification accuracy.
ManifoldGD 是一种训练-free 的扩散模型框架,通过在每个去噪步骤中集成流形一致的指导来增强数据集蒸馏。它使用分层分裂聚类从VAE隐特征中计算实例原型中心(IPCs),创建一个多尺度核心集,同时捕捉粗粒度语义模式和细粒度类内变异性。在每个去噪步骤中,该方法将模式对齐向量投影到估计的隐流形的局部切空间上,确保生成轨迹保持流形一致性并保留语义一致性。实验表明,ManifoldGD 在 FID、真实和合成数据集嵌入的 l2 距离以及分类准确性方面的一致改进,使其成为第一个几何感知的训练-free 数据蒸馏框架。
PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
Authors: Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis, Keith Krut, Robert Stein, Rina Elster Pantalony, Mohit Bansal, Kathleen McKeown
Venue: ICLR 2026
First: 2025-10-21T20:30:20+00:00 · Latest: 2026-02-26T18:05:42+00:00
Comments: Accepted at ICLR 2026. 26 pages, 9 figures. Metric/benchmark available at https://github.com/amith-ananthram/posh
Abstract
While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge. Standard metrics (e.g. CIDEr, SPICE) were designed for short texts and tuned to recognize errors that are now uncommon, such as object misidentification. In contrast, long texts require sensitivity to attribute and relation attachments and scores that localize errors to particular text spans. In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g. mistakes in compositional understanding). PoSh is replicable, interpretable and a better proxy for human raters than existing metrics (including GPT4o-as-a-Judge). To validate PoSh, we introduce a challenging new dataset, DOCENT. This novel benchmark contains artwork, paired with expert-written references, and model-generated descriptions, augmented with granular and coarse judgments of their quality from art history students. Thus, DOCENT enables evaluating both detailed image description metrics and detailed image description itself in a challenging new domain. We show that PoSh achieves stronger correlations (+0.05 Spearman $ρ$) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable reward function, outperforming standard supervised fine-tuning. Then, using PoSh, we characterize the performance of open and closed models in describing the paintings, sketches and statues in DOCENT and find that foundation models struggle to achieve full, error-free coverage of images with rich scene dynamics, establishing a demanding new task to gauge VLM progress. Through both PoSh and DOCENT, we hope to enable advances in important areas such as assistive text generation.
中文标题/摘要
标题:PoSh:使用场景图引导LLM作为裁判进行详细图像描述
尽管视觉-语言模型(VLMs)在详细图像描述方面取得了进展,但评估仍是一个挑战。标准指标(如CIDEr、SPICE)是为短文本设计的,并且调整为识别现在已不常见的错误,例如物体识别错误。相比之下,长文本需要对属性和关系的敏感度以及能够定位特定文本段落错误的评分。在本研究中,我们引入了PoSh,这是一种用于详细图像描述的指标,使用场景图作为结构化的评分标准来引导LLM作为裁判,产生基于细粒度错误(如组合理解错误)的综合评分。PoSh是可复制的、可解释的,并且比现有指标(包括GPT4o作为裁判)更接近人类评分者。为了验证PoSh,我们引入了一个新的具有挑战性的数据集DOCENT。这个新的基准数据集包含艺术品,并配以专家撰写的参考文本和模型生成的描述,还增加了艺术史学生对它们质量的精细和粗略判断。因此,DOCENT不仅能够评估详细图像描述指标,还能够在一个新的具有挑战性的领域中评估详细图像描述本身。我们展示了PoSh与DOCENT中的人类判断具有更强的相关性(Spearman ρ +0.05),并且对图像类型具有鲁棒性(使用CapArena,一个现有的网络图像数据集),并且是一个有效的奖励函数,优于标准的监督微调。然后,使用PoSh,我们表征了开放和封闭模型在描述DOCENT中的绘画、素描和雕像的表现,并发现基础模型难以实现对具有丰富场景动态的图像的全面、无误的描述,从而确立了一个新的具有挑战性的任务来衡量VLM的进步。通过PoSh和DOCENT,我们希望促进在诸如辅助文本生成等重要领域的发展。
Summary / 总结
PoSh is a new metric for evaluating detailed image descriptions by using scene graphs to guide LLMs as judges. It provides aggregate scores based on fine-grained errors and correlates better with human judgments than existing metrics. PoSh was validated on a new dataset, DOCENT, which includes artwork and expert-written references, and showed strong performance in evaluating both detailed image description metrics and models' performance. The study also found that foundation models struggle with rich scene dynamics, setting a new benchmark for VLM progress.
PoSh 是一个使用场景图来指导LLM作为评判者评估详细图像描述的新指标。它基于细粒度错误提供综合评分,并且与现有指标相比,与人类判断的相关性更强。PoSh 在一个新数据集 DOCENT 上得到了验证,该数据集包含艺术品和专家撰写的参考文本,展示了评估详细图像描述指标和模型性能的强大能力。研究还发现,基础模型在处理丰富场景动态时存在困难,为 VLM 进展设定了新的基准。
CXReasonAgent: Evidence-Grounded Diagnostic Reasoning Agent for Chest X-rays
Authors: Hyungyung Lee, Hangyul Yoon, Edward Choi
First: 2026-02-26T17:51:21+00:00 · Latest: 2026-02-26T17:51:21+00:00
Abstract
Chest X-ray plays a central role in thoracic diagnosis, and its interpretation inherently requires multi-step, evidence-grounded reasoning. However, large vision-language models (LVLMs) often generate plausible responses that are not faithfully grounded in diagnostic evidence and provide limited visual evidence for verification, while also requiring costly retraining to support new diagnostic tasks, limiting their reliability and adaptability in clinical settings. To address these limitations, we present CXReasonAgent, a diagnostic agent that integrates a large language model (LLM) with clinically grounded diagnostic tools to perform evidence-grounded diagnostic reasoning using image-derived diagnostic and visual evidence. To evaluate these capabilities, we introduce CXReasonDial, a multi-turn dialogue benchmark with 1,946 dialogues across 12 diagnostic tasks, and show that CXReasonAgent produces faithfully grounded responses, enabling more reliable and verifiable diagnostic reasoning than LVLMs. These findings highlight the importance of integrating clinically grounded diagnostic tools, particularly in safety-critical clinical settings.
中文标题/摘要
标题:CXReasonAgent:基于证据的胸部X光诊断推理代理
胸部X光在胸部诊断中起着核心作用,其解释本质上需要多步、基于证据的推理。然而,大型视觉-语言模型(LVLM)通常生成的响应虽然看似合理,但并不忠实于诊断证据,提供的视觉证据有限,难以验证,同时还需要昂贵的重新训练以支持新的诊断任务,这限制了它们在临床环境中的可靠性和适应性。为了解决这些限制,我们提出了CXReasonAgent,这是一种将大型语言模型(LLM)与临床导向的诊断工具结合的诊断代理,用于使用图像衍生的诊断和视觉证据进行基于证据的诊断推理。为了评估这些能力,我们引入了包含1,946轮对话的多轮对话基准CXReasonDial,涵盖12项诊断任务,并展示了CXReasonAgent生成忠实于证据的响应,使其在临床环境中比LVLMs提供更可靠和可验证的诊断推理。这些发现强调了在安全关键的临床环境中整合基于临床的诊断工具的重要性。
Summary / 总结
The research aims to improve the reliability and adaptability of diagnostic reasoning for chest X-rays by addressing the limitations of large vision-language models. CXReasonAgent, a diagnostic agent, integrates a large language model with clinically grounded diagnostic tools to perform evidence-grounded reasoning. The evaluation on CXReasonDial, a multi-turn dialogue benchmark, shows that CXReasonAgent produces more faithfully grounded responses, enhancing the reliability and verifiability of diagnostic reasoning compared to LVLMs.
CXReasonAgent 通过将大型语言模型与临床相关的诊断工具结合,用于胸片的证据导向诊断推理。它解决了大型视觉语言模型生成的响应与诊断证据不一致的问题,并提供了可验证的视觉证据。CXReasonAgent 在 CXReasonDial 基准测试中的 1,946 轮对话(涵盖 12 个诊断任务)中表现出更可靠和可验证的诊断推理能力,优于大型视觉语言模型。
Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP
Authors: Lorenz Hufe, Constantin Venhoff, Erblina Purelku, Maximilian Dreyer, Sebastian Lapuschkin, Wojciech Samek
First: 2025-08-28T09:08:30+00:00 · Latest: 2026-02-26T17:33:06+00:00
Abstract
Typographic attacks exploit multi-modal systems by injecting text into images, leading to targeted misclassifications, malicious content generation and even Vision-Language Model jailbreaks. In this work, we analyze how CLIP vision encoders behave under typographic attacks, locating specialized attention heads in the latter half of the model's layers that causally extract and transmit typographic information to the cls token. Building on these insights, we introduce Dyslexify - a method to defend CLIP models against typographic attacks by selectively ablating a typographic circuit, consisting of attention heads. Without requiring finetuning, dyslexify improves performance by up to 22.06% on a typographic variant of ImageNet-100, while reducing standard ImageNet-100 accuracy by less than 1%, and demonstrate its utility in a medical foundation model for skin lesion diagnosis. Notably, our training-free approach remains competitive with current state-of-the-art typographic defenses that rely on finetuning. To this end, we release a family of dyslexic CLIP models which are significantly more robust against typographic attacks. These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.
中文标题/摘要
标题:Dyslexify:CLIP对抗 typographic 攻击的机制性防御
typographic 攻击通过在图像中注入文本来利用多模态系统,导致目标错误分类、恶意内容生成,甚至视觉语言模型的逃逸。在本研究中,我们分析了CLIP视觉编码器在typographic 攻击下的行为,发现模型后半部分层中的特定注意力头因果性地提取并传递typographic 信息至cls标记。基于这些见解,我们引入了Dyslexify——一种通过选择性地消除typographic 电路(由注意力头组成)来防御CLIP模型对抗typographic 攻击的方法。无需微调,Dyslexify在typographic 变体的ImageNet-100上性能提升高达22.06%,同时将标准ImageNet-100的准确性降低不到1%,并在皮肤病变诊断的医学基础模型中展示了其实用性。值得注意的是,我们的无需训练的方法在当前依赖微调的typographic 防御中仍具有竞争力。为此,我们发布了对抗typographic 攻击具有显著更强鲁棒性的Dyslexic CLIP模型系列,这些模型适合作为广泛的安全关键应用的即插即用替代品,其中基于文本的操纵风险超过了文本识别的实用性。
Summary / 总结
This work addresses typographic attacks on CLIP models by analyzing how CLIP vision encoders process typographic information. The authors identify specific attention heads that transmit typographic data to the cls token and introduce Dyslexify, a method that selectively ablates these heads to defend against attacks. Dyslexify improves performance on a typographic variant of ImageNet-100 by up to 22.06% without requiring finetuning, and it also shows utility in a medical foundation model for skin lesion diagnosis. The approach remains competitive with state-of-the-art defenses that rely on finetuning and is released as a family of dyslexic CLIP models for broader safety-critical applications.
该研究通过分析CLIP视觉编码器如何处理字体信息,来应对字体攻击问题。作者识别出特定的注意力头将字体数据传输到cls标记,并引入了Dyslexify方法,通过选择性地消除这些头来防御攻击。Dyslexify在字体变体的ImageNet-100上提高了高达22.06%的性能,且无需微调,同时在皮肤病变诊断的医学基础模型中也显示出实用性。该方法在不依赖微调的情况下与当前最先进的防御方法竞争,并作为更广泛的安全关键应用的合适替代品发布了一组Dyslexic CLIP模型。
Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents
Authors: Zhou Xu, Bowen Zhou, Qi Wang, Shuwen Feng, Jingyu Xiao
First: 2026-02-26T17:12:40+00:00 · Latest: 2026-02-26T17:12:40+00:00
Abstract
Pure-vision GUI agents provide universal interaction capabilities but suffer from severe efficiency bottlenecks due to the massive spatiotemporal redundancy inherent in high-resolution screenshots and historical trajectories. We identify two critical misalignments in existing compression paradigms: the temporal mismatch, where uniform history encoding diverges from the agent's "fading memory" attention pattern, and the spatial topology conflict, where unstructured pruning compromises the grid integrity required for precise coordinate grounding, inducing spatial hallucinations. To address these challenges, we introduce GUIPruner, a training-free framework tailored for high-resolution GUI navigation. It synergizes Temporal-Adaptive Resolution (TAR), which eliminates historical redundancy via decay-based resizing, and Stratified Structure-aware Pruning (SSP), which prioritizes interactive foregrounds and semantic anchors while safeguarding global layout. Extensive evaluations across diverse benchmarks demonstrate that GUIPruner consistently achieves state-of-the-art performance, effectively preventing the collapse observed in large-scale models under high compression. Notably, on Qwen2-VL-2B, our method delivers a 3.4x reduction in FLOPs and a 3.3x speedup in vision encoding latency while retaining over 94% of the original performance, enabling real-time, high-precision navigation with minimal resource consumption.
中文标题/摘要
标题:空间-时间令牌剪枝以实现高效的高分辨率GUI代理
纯视觉GUI代理提供了通用的交互能力,但由于高分辨率屏幕截图和历史轨迹中固有的大量空间-时间冗余,它们遭受了严重的效率瓶颈。我们识别出现有压缩范式中的两个关键不匹配:时间不匹配,其中均匀的历史编码与代理的“衰减记忆”注意力模式相偏离,以及空间拓扑冲突,其中无结构的剪枝破坏了用于精确坐标定位所需的网格完整性,导致空间幻觉。为了解决这些挑战,我们引入了GUIPruner,这是一种针对高分辨率GUI导航的无需训练框架。它结合了基于衰减的重缩放来消除历史冗余的时空自适应分辨率(TAR),以及优先考虑交互前景和语义锚点的同时保护全局布局的分层结构感知剪枝(SSP)。在多种基准上的广泛评估表明,GUIPruner始终能够实现最先进的性能,有效防止在大规模模型下高压缩导致的性能崩溃。值得注意的是,在Qwen2-VL-2B上,我们的方法在FLOPs上减少了3.4倍,在视觉编码延迟上加快了3.3倍,同时保留了超过94%的原始性能,使实时、高精度导航在极低资源消耗下成为可能。
Summary / 总结
The research aims to improve the efficiency of high-resolution GUI agents by addressing temporal and spatial redundancy. The method involves GUIPruner, which combines Temporal-Adaptive Resolution (TAR) and Stratified Structure-aware Pruning (SSP) to reduce historical redundancy and preserve grid integrity. The study shows that GUIPruner achieves state-of-the-art performance with a 3.4x reduction in FLOPs and a 3.3x speedup in vision encoding latency while maintaining over 94% of the original performance, enabling real-time, high-precision navigation.
论文通过引入GUIPruner框架,结合Temporal-Adaptive Resolution (TAR)和Stratified Structure-aware Pruning (SSP),解决了纯视觉GUI代理的效率问题。TAR通过衰减基线缩放减少历史冗余,而SSP优先处理交互元素和语义锚点以保持网格完整性。实验结果表明,GUIPruner实现了最先进的性能,FLOPs减少了3.4倍,视觉编码延迟加速了3.3倍,同时保留了超过94%的原始性能,从而实现实时、高精度的导航。
Large Multimodal Models as General In-Context Classifiers
Authors: Marco Garosi, Matteo Farina, Alessandro Conti, Massimiliano Mancini, Elisa Ricci
Venue: CVPR
First: 2026-02-26T17:08:18+00:00 · Latest: 2026-02-26T17:08:18+00:00
Comments: CVPR Findings 2026. Project website at https://circle-lmm.github.io/
Abstract
Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMM) are more suitable for complex tasks. In this work, we argue that this answer overlooks an important capability of LMMs: in-context learning. We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP's, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters, their "in-context" equivalent. We extend this analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task. In this challenging scenario, LMMs struggle whenever provided with imperfect context information. To address this issue, we propose CIRCLE, a simple training-free method that assigns pseudo-labels to in-context examples, iteratively refining them with the available context itself. Through extensive experiments, we show that CIRCLE establishes a robust baseline for open-world classification, surpassing VLM counterparts and highlighting the potential of LMMs to serve as unified classifiers, and a flexible alternative to specialized models.
中文标题/摘要
标题:大型多模态模型作为通用上下文分类器
在分类任务中我们应该使用哪种多模态模型?先前的研究表明,答案在于CLIP类对比视觉-语言模型(VLMs),因为它们在零样本分类中的表现非常出色。相比之下,大型多模态模型(LMM)更适合复杂任务。在本文中,我们提出,这种答案忽视了LMM的一个重要能力:上下文学习。我们在多种数据集上对最先进的LMM进行基准测试,发现尽管它们的零样本性能低于CLIP,但在提供少量上下文示例的情况下,LMM可以与基于缓存的适配器(其“上下文”等价物)的对比VLM匹敌甚至超越。我们将这种分析扩展到开放世界设置,在这种设置中,LMM的生成性质使它们更适合该任务。在这种具有挑战性的场景中,LMM在提供不完美的上下文信息时会遇到困难。为了解决这个问题,我们提出了一种简单的无训练方法CIRCLE,该方法为上下文示例分配伪标签,并通过可用的上下文本身逐步优化它们。通过广泛的实验,我们表明CIRCLE为开放世界分类建立了稳健的基础,超越了VLM的对应物,并突显了LMM作为统一分类器和服务于特定模型的灵活替代方案的潜力。
Summary / 总结
This study explores the use of Large Multimodal Models (LMMs) for classification tasks, arguing that their capability in in-context learning is often overlooked. By benchmarking state-of-the-art LMMs on various datasets, the researchers found that LMMs, when provided with a few in-context examples, can match or even outperform CLIP-like contrastive Vision-Language Models (VLMs) with cache-based adapters. The study also introduces CIRCLE, a method that enhances LMMs in open-world settings, demonstrating that LMMs can serve as unified classifiers and flexible alternatives to specialized models.
该研究探讨了大型多模态模型(LMMs)在分类任务中的应用,指出其在上下文学习方面的能力往往被忽视。研究在多种数据集上将LMMs与CLIP-like模型进行了对比,并发现LMMs在提供少量上下文示例的情况下,可以匹配甚至超越对比性VLMs的表现。研究还扩展了这一分析到开放世界场景,在这种更具挑战性的环境中,LMMs由于上下文信息不准确而表现不佳,并提出了一种名为CIRCLE的方法,通过迭代地用可用的上下文信息来细化伪标签,从而改善其表现。
MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction
Authors: Yizhi Li, Xiaohan Chen, Miao Jiang, Wentao Tang, Gaoang Wang
First: 2026-02-26T17:08:08+00:00 · Latest: 2026-02-26T17:08:08+00:00
Comments: 6 pages, CSCWD 2026
Abstract
With the explosive growth of digital entertainment, automated video summarization has become indispensable for applications such as content indexing, personalized recommendation, and efficient media archiving. Automatic synopsis generation for long-form videos, such as movies and TV series, presents a significant challenge for existing Vision-Language Models (VLMs). While proficient at single-image captioning, these general-purpose models often exhibit critical failures in long-duration contexts, primarily a lack of ID-consistent character identification and a fractured narrative coherence. To overcome these limitations, we propose MovieTeller, a novel framework for generating movie synopses via tool-augmented progressive abstraction. Our core contribution is a training-free, tool-augmented, fact-grounded generation process. Instead of requiring costly model fine-tuning, our framework directly leverages off-the-shelf models in a plug-and-play manner. We first invoke a specialized face recognition model as an external "tool" to establish Factual Groundings--precise character identities and their corresponding bounding boxes. These groundings are then injected into the prompt to steer the VLM's reasoning, ensuring the generated scene descriptions are anchored to verifiable facts. Furthermore, our progressive abstraction pipeline decomposes the summarization of a full-length movie into a multi-stage process, effectively mitigating the context length limitations of current VLMs. Experiments demonstrate that our approach yields significant improvements in factual accuracy, character consistency, and overall narrative coherence compared to end-to-end baselines.
中文标题/摘要
标题:MovieTeller:工具增强的电影概要生成与ID一致渐进抽象
随着数字娱乐的爆炸性增长,自动视频摘要已成为内容索引、个性化推荐和高效媒体归档等应用不可或缺的一部分。对于长格式视频,如电影和电视剧的自动概要生成,现有的视觉-语言模型(VLMs)面临重大挑战。尽管在单张图像描述方面表现出色,但这些通用模型在长时间段上下文中往往表现出关键性失败,主要是缺乏ID一致的人物识别和叙述连贯性断裂。为克服这些限制,我们提出了一种新的框架——MovieTeller,用于通过工具增强的渐进抽象生成电影概要。我们的核心贡献是一种无需训练、工具增强、基于事实的生成过程。我们不需进行昂贵的模型微调,而是直接以插件方式利用现成模型。我们首先调用一个专门的面部识别模型作为外部“工具”,建立事实基础——精确的人物身份及其对应的边界框。这些基础随后被注入提示中,引导VLM的推理,确保生成的场景描述基于可验证的事实。此外,我们的渐进抽象流水线将整部电影的总结分解为多阶段过程,有效缓解了当前VLMs的上下文长度限制。实验表明,与端到端基线相比,我们的方法在事实准确性、人物一致性以及整体叙述连贯性方面取得了显著改进。
Summary / 总结
With the explosive growth of digital entertainment, automated video summarization has become indispensable for applications such as content indexing, personalized recommendation, and efficient media archiving.
研究旨在解决使用Vision-Language模型(VLMs)生成长格式视频(如电影和电视剧)准确且连贯的概要时遇到的挑战。提出的MovieTeller框架采用工具增强的事实基础生成过程,克服了人物识别和叙事连贯性方面的限制。通过利用专门的人脸识别模型来建立精确的人物身份,并将这些基础注入到VLM提示中,该方法在事实准确性、人物一致性以及整体叙事连贯性方面显著优于端到端基线。
Object-Centric Representation Learning for Enhanced 3D Semantic Scene Graph Prediction
Authors: KunHo Heo, GiHyun Kim, SuYeon Kim, MyeongAh Cho
Venue: NeurIPS 2025
First: 2025-10-06T11:33:09+00:00 · Latest: 2026-02-26T16:03:04+00:00
Comments: Accepted by NeurIPS 2025. Code: https://github.com/VisualScienceLab-KHU/OCRL-3DSSG-Codes
Abstract
3D Semantic Scene Graph Prediction aims to detect objects and their semantic relationships in 3D scenes, and has emerged as a crucial technology for robotics and AR/VR applications. While previous research has addressed dataset limitations and explored various approaches including Open-Vocabulary settings, they frequently fail to optimize the representational capacity of object and relationship features, showing excessive reliance on Graph Neural Networks despite insufficient discriminative capability. In this work, we demonstrate through extensive analysis that the quality of object features plays a critical role in determining overall scene graph accuracy. To address this challenge, we design a highly discriminative object feature encoder and employ a contrastive pretraining strategy that decouples object representation learning from the scene graph prediction. This design not only enhances object classification accuracy but also yields direct improvements in relationship prediction. Notably, when plugging in our pretrained encoder into existing frameworks, we observe substantial performance improvements across all evaluation metrics. Additionally, whereas existing approaches have not fully exploited the integration of relationship information, we effectively combine both geometric and semantic features to achieve superior relationship prediction. Comprehensive experiments on the 3DSSG dataset demonstrate that our approach significantly outperforms previous state-of-the-art methods. Our code is publicly available at https://github.com/VisualScienceLab-KHU/OCRL-3DSSG-Codes.
中文标题/摘要
标题:面向对象的表示学习以增强3D语义场景图预测
3D语义场景图预测旨在检测3D场景中的对象及其语义关系,并已成为机器人技术和AR/VR应用中的关键技术。尽管先前的研究解决了数据集限制并探索了各种方法,包括开放式词汇设置,但它们经常未能优化对象和关系特征的表示能力,过度依赖图神经网络,尽管其区分能力不足。在本工作中,我们通过广泛的分析表明,对象特征的质量对整体场景图准确性至关重要。为了解决这一挑战,我们设计了一种高度区分性的对象特征编码器,并采用对比预训练策略,将对象表示学习与场景图预测分离。这一设计不仅提高了对象分类准确性,还直接改善了关系预测。值得注意的是,当将我们的预训练编码器插入现有框架时,我们观察到所有评估指标上都取得了显著性能提升。此外,与现有方法未能充分利用关系信息的整合不同,我们有效结合了几何和语义特征,实现了更优的关系预测。在3DSSG数据集上的全面实验表明,我们的方法显著优于先前的最先进方法。我们的代码可在https://github.com/VisualScienceLab-KHU/OCRL-3DSSG-Codes公开获取。
Summary / 总结
This research aims to improve 3D semantic scene graph prediction by focusing on the quality of object features. The authors introduce a discriminative object feature encoder and a contrastive pretraining strategy that decouples object representation learning from scene graph prediction. This approach enhances both object classification and relationship prediction, leading to significant performance improvements across all metrics when integrated into existing frameworks. Comprehensive experiments on the 3DSSG dataset show that their method outperforms previous state-of-the-art methods.
该研究旨在通过提高物体特征的质量来改进3D语义场景图预测。作者提出了一种具有区分性的物体特征编码器和对比预训练策略,以增强物体和关系预测。实验表明,他们的方法在3DSSG数据集上所有指标上显著优于之前的方法。
Cytoarchitecture in Words: Weakly Supervised Vision-Language Modeling for Human Brain Microscopy
Authors: Matthew Sutton, Katrin Amunts, Timo Dickscheid, Christian Schiffer
First: 2026-02-26T15:10:39+00:00 · Latest: 2026-02-26T15:10:39+00:00
Comments: 8 pages, 3 figures, submitted for inclusion at a conference
Abstract
Foundation models increasingly offer potential to support interactive, agentic workflows that assist researchers during analysis and interpretation of image data. Such workflows often require coupling vision to language to provide a natural-language interface. However, paired image-text data needed to learn this coupling are scarce and difficult to obtain in many research and clinical settings. One such setting is microscopic analysis of cell-body-stained histological human brain sections, which enables the study of cytoarchitecture: cell density and morphology and their laminar and areal organization. Here, we propose a label-mediated method that generates meaningful captions from images by linking images and text only through a label, without requiring curated paired image-text data. Given the label, we automatically mine area descriptions from related literature and use them as synthetic captions reflecting canonical cytoarchitectonic attributes. An existing cytoarchitectonic vision foundation model (CytoNet) is then coupled to a large language model via an image-to-text training objective, enabling microscopy regions to be described in natural language. Across 57 brain areas, the resulting method produces plausible area-level descriptions and supports open-set use through explicit rejection of unseen areas. It matches the cytoarchitectonic reference label for in-scope patches with 90.6% accuracy and, with the area label masked, its descriptions remain discriminative enough to recover the area in an 8-way test with 68.6% accuracy. These results suggest that weak, label-mediated pairing can suffice to connect existing biomedical vision foundation models to language, providing a practical recipe for integrating natural-language in domains where fine-grained paired annotations are scarce.
中文标题/摘要
标题:细胞架构的语言表达:弱监督视觉-语言模型在人类大脑显微镜分析中的应用
基础模型越来越多地提供支持研究人员在图像数据分析和解释过程中进行互动、自主工作流程的潜力。此类工作流程通常需要将视觉与语言结合以提供自然语言界面。然而,在许多研究和临床环境中,用于学习这种结合的成对图像-文本数据稀缺且难以获取。其中一个环境是细胞体染色的人类大脑组织切片的显微镜分析,这使我们能够研究细胞架构:细胞密度和形态及其层状和区域组织。在此,我们提出了一种标签介导的方法,通过仅通过标签将图像和文本链接起来生成有意义的描述,而无需使用经过精心策划的成对图像-文本数据。给定标签,我们自动从相关文献中挖掘区域描述,并使用它们作为反映经典细胞架构属性的合成描述。然后,通过图像到文本的训练目标将现有的细胞架构视觉基础模型(CytoNet)与大型语言模型耦合,使显微镜区域能够用自然语言描述。在57个脑区中,该方法生成了合理的区域级描述,并通过明确拒绝未见过的区域支持开放集使用。在掩蔽区域标签的情况下,其描述具有足够的区分性,能够在8分类测试中以68.6%的准确率恢复区域。这些结果表明,弱的、标签介导的配对足以将现有的生物医学视觉基础模型与语言连接起来,为在细粒度成对注释稀缺的领域中集成自然语言提供了一种实用的配方。
Summary / 总结
This study proposes a label-mediated method to generate meaningful captions for human brain microscopic images without requiring paired image-text data. By using area descriptions from literature, the method couples an existing cytoarchitectonic vision foundation model (CytoNet) with a large language model. The results show that the method produces plausible descriptions with 90.6% accuracy in matching cytoarchitectonic reference labels and 68.6% accuracy in recovering areas in an 8-way test with the area label masked.
该研究提出了一种标签中介的方法,用于弱监督的视觉-语言建模,以生成人类大脑显微镜图像的有意义描述。通过标签将图像和文本连接起来,该方法避免了稀缺的研究环境中配对的图像-文本数据的需求。该方法使用文献中的区域描述作为合成描述,并将现有的细胞建筑学视觉基础模型(CytoNet)与大型语言模型耦合。结果表明,该方法生成了合理的区域级描述,并且在8分类测试中可以以68.6%的准确率恢复正确的区域,对于范围内的斑块,其与参考标签匹配的准确率为90.6%。这表明标签中介配对可以有效地将视觉和语言连接起来,在标注数据有限的领域中具有实际应用价值。
Inducing Dyslexia in Vision Language Models
Authors: Melika Honarmand, Ayati Sharma, Badr AlKhamissi, Johannes Mehrer, Martin Schrimpf
First: 2025-09-29T11:03:16+00:00 · Latest: 2026-02-26T15:04:01+00:00
Abstract
Dyslexia, a neurodevelopmental disorder characterized by persistent reading difficulties, is often linked to reduced activity of the visual word form area (VWFA) in the ventral occipito-temporal cortex. Traditional approaches to studying dyslexia, such as behavioral and neuroimaging methods, have provided valuable insights but remain limited in their ability to test causal hypotheses about the underlying mechanisms of reading impairments. In this study, we use large-scale vision-language models (VLMs) to simulate dyslexia by functionally identifying and perturbing artificial analogues of word processing. Using stimuli from cognitive neuroscience, we identify visual-word-form-selective units within VLMs and demonstrate that they predict human VWFA neural responses. Ablating model VWF units leads to selective impairments in reading tasks while general visual and language comprehension abilities remain intact. In particular, the resulting model matches dyslexic humans' phonological deficits without a significant change in orthographic processing, and mirrors dyslexic behavior in font sensitivity. Taken together, our modeling results replicate key characteristics of dyslexia and establish a computational framework for investigating brain disorders.
中文标题/摘要
标题:在视觉语言模型中诱发阅读障碍
阅读障碍是一种神经发育障碍,表现为持续的阅读困难,通常与背外侧枕颞叶皮层中的视觉单词形式区(VWFA)活动减少有关。传统上,通过行为和神经影像学方法研究阅读障碍虽然提供了宝贵见解,但在测试阅读障碍潜在机制的因果假设方面仍有限制。本研究使用大规模视觉-语言模型(VLMs)通过功能上识别和扰动单词处理的人工模拟来模拟阅读障碍。使用认知神经科学的刺激,我们识别出VLMs中的视觉单词形式选择性单元,并证明它们可以预测人类VWFA神经反应。删除模型中的VWF单元会导致阅读任务中的选择性障碍,而一般视觉和语言理解能力保持不变。特别是,该模型表现出与阅读障碍患者相似的音位学缺陷,而书写形式处理没有显著变化,并且在字体敏感性方面反映了阅读障碍的行为特征。综上所述,我们的建模结果复制了阅读障碍的关键特征,并建立了一个研究大脑疾病的计算框架。
Summary / 总结
This study aims to simulate dyslexia in large-scale vision-language models by functionally identifying and perturbing artificial word-processing units. The research demonstrates that these models can predict human VWFA neural responses and that ablating model VWFA units leads to selective reading impairments without affecting general visual and language comprehension. The resulting model exhibits phonological deficits similar to dyslexic humans and shows font sensitivity, replicating key characteristics of dyslexia.
本研究旨在通过功能识别和干扰人工单词处理单元来模拟视觉语言模型中的阅读障碍。研究显示这些模型可以预测人类VWFA神经反应,并且删除模型中的VWFA单元会导致选择性的阅读障碍,而不影响一般的视觉和语言理解能力。该模型表现出与阅读障碍患者相似的音位缺陷,并且对字体的敏感性,复制了阅读障碍的关键特征。
WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval
Authors: Tianyue Wang, Leigang Qu, Tianyu Yang, Xiangzhao Hao, Yifan Xu, Haiyun Guo, Jinqiao Wang
First: 2026-02-26T14:11:10+00:00 · Latest: 2026-02-26T14:11:10+00:00
Abstract
Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a multimodal query (comprising a reference image and a modification text), without training on annotated triplets. Existing methods typically convert the multimodal query into a single modality-either as an edited caption for Text-to-Image retrieval (T2I) or as an edited image for Image-to-Image retrieval (I2I). However, each paradigm has inherent limitations: T2I often loses fine-grained visual details, while I2I struggles with complex semantic modifications. To effectively leverage their complementary strengths under diverse query intents, we propose WISER, a training-free framework that unifies T2I and I2I via a "retrieve-verify-refine" pipeline, explicitly modeling intent awareness and uncertainty awareness. Specifically, WISER first performs Wider Search by generating both edited captions and images for parallel retrieval to broaden the candidate pool. Then, it conducts Adaptive Fusion with a verifier to assess retrieval confidence, triggering refinement for uncertain retrievals, and dynamically fusing the dual-path for reliable ones. For uncertain retrievals, WISER generates refinement suggestions through structured self-reflection to guide the next retrieval round toward Deeper Thinking. Extensive experiments demonstrate that WISER significantly outperforms previous methods across multiple benchmarks, achieving relative improvements of 45% on CIRCO (mAP@5) and 57% on CIRR (Recall@1) over existing training-free methods. Notably, it even surpasses many training-dependent methods, highlighting its superiority and generalization under diverse scenarios. Code will be released at https://github.com/Physicsmile/WISER.
中文标题/摘要
标题:WISER: 更广泛的搜索、更深的思考和自适应融合的无训练零样本组合图像检索
零样本组合图像检索(ZS-CIR)旨在根据包含参考图像和修改文本的多模态查询检索目标图像,无需使用标注三元组进行训练。现有方法通常将多模态查询转换为单一模态——要么作为文本到图像检索(T2I)中的编辑标题,要么作为图像到图像检索(I2I)中的编辑图像。然而,每种范式都有其固有的局限性:T2I往往丢失了细微的视觉细节,而I2I则难以处理复杂的语义修改。为了在各种查询意图下有效利用它们的互补优势,我们提出了一种无训练框架WISER,通过“检索-验证-精炼”管道统一T2I和I2I,明确建模意图意识和不确定性意识。具体而言,WISER首先通过生成编辑后的标题和图像进行并行检索,以扩大候选池,进行更广泛的搜索。然后,它通过验证器进行自适应融合,评估检索置信度,对不确定的检索结果触发精炼,并动态融合双路径以获得可靠的检索结果。对于不确定的检索结果,WISER通过结构化的自我反思生成精炼建议,以指导下一轮检索朝着更深的思考进行。广泛的实验表明,WISER在多个基准测试中显著优于先前的方法,在CIRCO(mAP@5)上相对提高了45%,在CIRR(Recall@1)上相对提高了57%。值得注意的是,它甚至超越了许多依赖训练的方法,突显了其在各种场景下的优越性和泛化能力。代码将在https://github.com/Physicsmile/WISER上发布。
SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling
Authors: Camile Lendering, Erkut Akdag, Egor Bondarev
Venue: CVPR 2026
First: 2026-02-26T13:52:57+00:00 · Latest: 2026-02-26T13:52:57+00:00
Comments: Accepted to CVPR 2026
Abstract
Detecting visual anomalies in industrial inspection often requires training with only a few normal images per category. Recent few-shot methods achieve strong results employing foundation-model features, but typically rely on memory banks, auxiliary datasets, or multi-modal tuning of vision-language models. We therefore question whether such complexity is necessary given the feature representations of vision foundation models. To answer this question, we introduce SubspaceAD, a training-free method, that operates in two simple stages. First, patch-level features are extracted from a small set of normal images by a frozen DINOv2 backbone. Second, a Principal Component Analysis (PCA) model is fit to these features to estimate the low-dimensional subspace of normal variations. At inference, anomalies are detected via the reconstruction residual with respect to this subspace, producing interpretable and statistically grounded anomaly scores. Despite its simplicity, SubspaceAD achieves state-of-the-art performance across one-shot and few-shot settings without training, prompt tuning, or memory banks. In the one-shot anomaly detection setting, SubspaceAD achieves image-level and pixel-level AUROC of 98.0% and 97.6% on the MVTec-AD dataset, and 93.3% and 98.3% on the VisA dataset, respectively, surpassing prior state-of-the-art results. Code and demo are available at https://github.com/CLendering/SubspaceAD.
中文标题/摘要
标题:SubspaceAD:无需训练的少量样本异常检测方法通过子空间建模
在工业检测中检测视觉异常通常需要仅用每类少量的正常图像进行训练。最近的少量样本方法通过基础模型特征取得了很好的结果,但通常依赖于记忆库、辅助数据集或视觉-语言模型的多模态调优。因此,我们质疑在视觉基础模型特征表示下是否有必要如此复杂。为回答这个问题,我们引入了SubspaceAD,一种无需训练的方法,它分为两个简单的阶段。首先,通过冻结的DINOv2主干从少量正常图像中提取补丁级别的特征。其次,使用主成分分析(PCA)模型拟合这些特征以估计正常变化的低维子空间。在推理时,通过相对于该子空间的重构残差检测异常,生成可解释且统计上可靠的异常评分。尽管简单,SubspaceAD在无需训练、提示调优或记忆库的情况下,在单次样本和少量样本设置中均取得了最先进的性能。在单次样本异常检测设置中,SubspaceAD在MVTec-AD数据集上的图像级和像素级AUROC分别为98.0%和97.6%,在VisA数据集上的图像级和像素级AUROC分别为93.3%和98.3%,超过了先前的最先进的结果。代码和演示可在https://github.com/CLendering/SubspaceAD获取。
Summary / 总结
SubspaceAD is a training-free few-shot anomaly detection method that uses a simple two-stage process. First, it extracts patch-level features from a small set of normal images using a frozen DINOv2 backbone. Second, it fits a Principal Component Analysis (PCA) model to these features to estimate the normal variations' low-dimensional subspace. During inference, anomalies are detected by measuring the reconstruction residual with respect to this subspace, yielding interpretable and statistically grounded anomaly scores. SubspaceAD achieves state-of-the-art performance in one-shot and few-shot settings without requiring training, prompt tuning, or memory banks, surpassing prior results on the MVTec-AD and VisA datasets.
SubspaceAD 是一种无需训练的少样本异常检测方法,采用两阶段简单流程。首先,使用冻结的 DINOv2 主干从少量正常图像中提取 patch 级别特征。其次,使用这些特征拟合 PCA 模型以估计正常变化的低维子空间。在推理阶段,通过测量相对于该子空间的重构残差来检测异常,生成可解释且统计上可靠的异常评分。SubspaceAD 在少样本设置中达到了最先进的性能,无需训练、提示调优或记忆库,超越了先前的方法在 MVTec-AD 和 VisA 数据集上的表现。
VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm
Authors: Zhenkai Wu, Xiaowen Ma, Zhenliang Ni, Dengming Zhang, Han Shu, Xin Jiang, Xinghao Chen
First: 2025-12-02T12:30:05+00:00 · Latest: 2026-02-26T13:16:26+00:00
Comments: Accepted by CVPR2026
Abstract
Vision-language models (VLMs) excel at image understanding tasks, but the large number of visual tokens imposes significant computational costs, hindering deployment on mobile devices. Many pruning methods rely solely on token importance and thus overlook inter-token redundancy, retaining numerous duplicated tokens and wasting capacity. Although some redundancy-aware approaches have been proposed, they often ignore the spatial relationships among visual tokens. This can lead to overly sparse selections of retained tokens that fail to adequately cover the regions of target objects. To address these limitations, we propose VLM-Pruner, a training-free token pruning algorithm that explicitly balances redundancy and spatial sparsity. We introduce a centrifugal token pruning paradigm that enables near-to-far selection while prioritizing the preservation of fine-grained object details. Moreover, we design a Buffering for Spatial Sparsity (BSS) criterion that defers the selection of spatially distant tokens. We further adopt a parallel greedy strategy to conduct token selection efficiently. To mitigate information loss from pruning, we selectively fuse salient information from the discarded tokens into the retained ones. Comprehensive comparisons demonstrate that VLM-Pruner consistently outperforms strong baselines across five VLMs with an 88.9\% pruning rate, while delivering an end-to-end inference speedup. The code is available at https://github.com/Casey-bit/VLMPruner.
中文标题/摘要
标题:VLM-Pruner:高效VLM离心式令牌剪枝范式中的空间稀疏性缓冲
视觉语言模型(VLMs)在图像理解任务中表现出色,但大量的视觉令牌导致了显著的计算成本,阻碍了其在移动设备上的部署。许多剪枝方法仅依赖于令牌的重要性,从而忽略了令牌间的冗余性,保留了大量重复的令牌,浪费了容量。尽管提出了一些具有冗余意识的方法,但它们往往忽略了视觉令牌之间的空间关系。这可能导致保留的令牌过于稀疏,无法充分覆盖目标对象的区域。为了解决这些局限性,我们提出了一种无需训练的VLM-Pruner令牌剪枝算法,明确平衡冗余性和空间稀疏性。我们引入了一种离心式令牌剪枝范式,能够在优先保留细粒度对象细节的同时,实现从近到远的选择。此外,我们设计了一种空间稀疏性缓冲(BSS)准则,推迟选择空间上距离较远的令牌。我们还采用了一种并行贪婪策略,以高效地进行令牌选择。为了减轻剪枝带来的信息损失,我们有选择地将被丢弃的令牌中的重要信息融合到保留的令牌中。全面的比较表明,VLM-Pruner在五个VLM中以88.9%的剪枝率持续优于强大的基线模型,同时实现了端到端的推理加速。代码可在https://github.com/Casey-bit/VLMPruner获取。
Summary / 总结
Vision-language models (VLMs) excel at image understanding tasks, but the large number of visual tokens imposes significant computational costs, hindering deployment on mobile devices.
VLM-Pruner 是一种无需训练的 token 剪枝算法,旨在通过平衡冗余和空间稀疏性来解决视觉语言模型(VLMs)的计算挑战。它引入了离心 token 剪枝范式和空间稀疏性缓冲(BSS)准则,以优先保留细粒度的物体细节并高效选择 token。实验结果表明,VLM-Pruner 在五个 VLMs 上以 88.9% 的剪枝率优于强基线,并提供端到端的推理加速。
Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
Authors: Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng, Ismail Elezi
First: 2026-02-26T11:08:39+00:00 · Latest: 2026-02-26T11:08:39+00:00
Abstract
Reasoning with large language models often benefits from generating multiple chains-of-thought, but existing aggregation strategies are typically trajectory-level (e.g., selecting the best trace or voting on the final answer), discarding useful intermediate work from partial or "nearly correct" attempts. We propose Stitching Noisy Diffusion Thoughts, a self-consistency framework that turns cheap diffusion-sampled reasoning into a reusable pool of step-level candidates. Given a problem, we (i) sample many diverse, low-cost reasoning trajectories using a masked diffusion language model, (ii) score every intermediate step with an off-the-shelf process reward model (PRM), and (iii) stitch these highest-quality steps across trajectories into a composite rationale. This rationale then conditions an autoregressive (AR) model (solver) to recompute only the final answer. This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search. Across math reasoning benchmarks, we find that step-level recombination is most beneficial on harder problems, and ablations highlight the importance of the final AR solver in converting stitched but imperfect rationales into accurate answers. Using low-confidence diffusion sampling with parallel, independent rollouts, our training-free framework improves average accuracy by up to 23.8% across six math and coding tasks. At the same time, it achieves up to a 1.8x latency reduction relative to both traditional diffusion models (e.g., Dream, LLaDA) and unified architectures (e.g., TiDAR). Code is available at https://github.com/roymiles/diffusion-stitching.
中文标题/摘要
标题:通过奖励引导拼接实现扩散语言模型的测试时缩放
使用大型语言模型进行推理通常可以从生成多个思维链中受益,但现有的聚合策略通常是轨迹级别的(例如,选择最佳轨迹或对最终答案进行投票),会丢弃来自部分或“几乎正确”尝试的有用中间工作。我们提出了一种名为Stitching Noisy Diffusion Thoughts的自一致性框架,将廉价采样的推理转换为可重复使用的步骤级候选池。给定一个问题,我们(i) 使用掩码扩散语言模型采样许多多样且低成本的推理轨迹,(ii) 使用现成的过程奖励模型(PRM)评分每个中间步骤,(iii) 将这些最高质量的步骤跨轨迹拼接成一个综合的推理。然后,这种综合的推理条件一个自回归(AR)模型(求解器)仅重新计算最终答案。这种模块化管道将探索(扩散)与评估和解决方案合成分离,避免了单一统一的混合体,同时保留了广泛的搜索。在数学推理基准测试中,我们发现步骤级重组在更难的问题上最有益,消融实验强调了最终AR求解器在将拼接但不完美的推理转化为准确答案中的重要性。使用低置信度扩散采样和并行独立的展开,我们的无需训练框架在六个数学和编程任务中平均准确率提高了最多23.8%。同时,它相对于传统扩散模型(如Dream、LLaDA)和统一架构(如TiDAR)实现了最多1.8倍的延迟减少。代码可在https://github.com/roymiles/diffusion-stitching/ 获取。
Summary / 总结
The paper introduces a method called Stitching Noisy Diffusion Thoughts to improve reasoning with large language models. It involves sampling multiple diverse reasoning trajectories using a masked diffusion language model, scoring each step with a process reward model, and then stitching the highest-quality steps into a composite rationale. This rationale is used to condition an autoregressive model to compute the final answer. The method shows up to a 23.8% improvement in accuracy and a 1.8x reduction in latency compared to existing models on math and coding tasks.
该论文提出了一种名为Stitching Noisy Diffusion Thoughts的方法,以提高大型语言模型的推理能力。该方法包括使用掩码扩散语言模型生成多种多样的推理轨迹,使用过程奖励模型评分每个步骤,然后将最高质量的步骤缝合到一个综合的推理中。该综合推理用于条件化自回归模型以计算最终答案。该方法在各种数学和编码任务中显示出显著的准确性提升,最高可达23.8%,同时与传统和统一模型相比,将延迟降低了最多1.8倍。
TrajTok: Learning Trajectory Tokens enables better Video Understanding
Authors: Chenhao Zheng, Jieyu Zhang, Jianing Zhang, Weikai Huang, Ashutosh Kumar, Quan Kong, Oncel Tuzel, Chun-Liang Li, Ranjay Krishna
Venue: CVPR 2026
First: 2026-02-26T09:15:34+00:00 · Latest: 2026-02-26T09:15:34+00:00
Comments: CVPR 2026
Abstract
Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens. This severely limits video efficiency and scalability. While recent trajectory-based tokenizers offer a promising solution by decoupling video duration from token count, they rely on complex external segmentation and tracking pipelines that are slow and task-agnostic. We propose TrajTok, an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective, dynamically adapting its token granularity to semantic complexity, independent of video duration. TrajTok contains a unified segmenter that performs implicit clustering over pixels in both space and time to directly produce object trajectories in a single forward pass. By prioritizing downstream adaptability over pixel-perfect segmentation fidelity, TrajTok is lightweight and efficient, yet empirically improves video understanding performance. With TrajTok, we implement a video CLIP model trained from scratch (TrajViT2). It achieves the best accuracy at scale across both classification and retrieval benchmarks, while maintaining efficiency comparable to the best token-merging methods. TrajTok also proves to be a versatile component beyond its role as a tokenizer. We show that it can be seamlessly integrated as either a probing head for pretrained visual features (TrajAdapter) or an alignment connector in vision-language models (TrajVLM) with especially strong performance in long-video reasoning.
中文标题/摘要
标题:TrajTok:学习轨迹标记使视频理解更优秀
视频模型中的标记化通常通过分块化生成过多且冗余的标记。这严重限制了视频的效率和可扩展性。虽然基于轨迹的标记器通过解耦视频时长和标记数量提供了有希望的解决方案,但它们依赖于复杂的外部分割和跟踪管道,这些管道既慢又任务无关。我们提出了一种端到端的视频标记模块TrajTok,该模块完全集成并与视频模型联合训练,以适应下游目标,动态调整标记粒度以适应语义复杂性,与视频时长无关。TrajTok包含一个统一的分割器,该分割器在空间和时间上对像素进行隐式聚类,直接在单次前向传播中生成对象轨迹。通过优先考虑下游适应性而非像素级分割精度,TrajTok轻量且高效,但实验证明其能提高视频理解性能。借助TrajTok,我们实现了一个从零开始训练的视频CLIP模型(TrajViT2)。它在分类和检索基准测试中均实现了最佳的准确性,同时保持了与最佳标记合并方法相当的效率。TrajTok还证明了其作为标记器之外的多功能组件。我们展示了它可以无缝集成为预训练视觉特征的探针头(TrajAdapter)或视觉-语言模型中的对齐连接器(TrajVLM),特别是在长视频推理方面表现出色。
Summary / 总结
Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens.
TrajTok 是一个端到端的视频分词器,能够根据语义复杂性动态调整分词粒度,无需外部分割和跟踪管道。它提高了视频理解性能,并使视频 CLIP 模型(TrajViT2)能够在分类和检索基准测试中达到最先进的准确性,同时保持与最佳分词合并方法相当的效率。TrajTok 还展示了作为预训练视觉特征的探针头和视觉语言模型中的对齐连接器的多功能性。
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Authors: Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, Ranjay Krishna
First: 2026-01-15T17:27:44+00:00 · Latest: 2026-02-26T08:46:23+00:00
Comments: Fixed results in Table 7
Abstract
Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).
中文标题/摘要
标题:Molmo2:开放权重和数据的视觉-语言模型,具备视频理解与定位能力
当今最强的视频-语言模型(VLMs)仍为私有。最强的开放权重模型要么依赖于私有VLMs的合成数据,有效从中提炼,要么不披露其训练数据或方法。因此,开源社区缺乏改进当前最先进的视频(和图像)语言模型的基础。至关重要的是,许多下游应用不仅需要高层次的视频理解,还需要定位——无论是通过指针还是通过像素跟踪。即使是私有模型也缺乏这种能力。我们提出了Molmo2,这是一种新的VLM家族,是开源模型中的最先进的,并展示了在单图像、多图像和视频任务中出色的基于指针的定位新能力。我们的主要贡献是一系列7个新的视频数据集和2个多图像数据集,包括用于预训练的详细视频字幕数据集、自由形式的视频问答数据集、新的具有复杂查询的对象跟踪数据集以及创新的视频指针数据集,所有这些数据集均未使用封闭的VLMs收集。我们还提供了一种利用高效打包和消息树编码方案的训练方法,并展示了在视觉标记上使用双向注意和一种新颖的标记权重策略可以提高性能。我们的最佳8B模型在短视频、计数和字幕方面优于其他开放权重和数据模型,并在长视频方面具有竞争力。在视频定位方面,Molmo2显著优于现有开放权重模型如Qwen3-VL(视频计数准确率35.5 vs 29.6)并超越了某些任务上的私有模型如Gemini 3 Pro(视频指针F1准确率38.4 vs 20.0,视频跟踪J&F准确率56.2 vs 41.1)。
Summary / 总结
The research aims to address the lack of open-source video-language models (VLMs) with strong grounding capabilities. Molmo2 introduces a new family of VLMs with state-of-the-art performance on open-source models, particularly excelling in point-driven grounding tasks. Key contributions include 9 new datasets and a training recipe that enhances model performance through efficient packing and message-tree encoding, as well as bi-directional attention and a novel token-weight strategy. Molmo2 outperforms other open-source models and even surpasses some proprietary models in tasks like video counting, pointing, and tracking.
研究旨在解决缺乏具备稳健定位能力的开源视频-语言模型(VLMs)的问题。作者引入了Molmo2,这是一种新的VLMs家族,其在点驱动的定位任务中优于现有开源模型。关键贡献包括9个新数据集和一个通过高效打包和消息树编码提升模型性能的训练方法。Molmo2在视频计数、视频描述和视频定位等任务中显著优于开源和专有模型,展示了更强的定位能力。
ProjFlow: Projection Sampling with Flow Matching for Zero-Shot Exact Spatial Motion Control
Authors: Akihisa Watanabe, Qing Yu, Edgar Simo-Serra, Kent Fujiwara
First: 2026-02-26T08:29:25+00:00 · Latest: 2026-02-26T08:29:25+00:00
Abstract
Generating human motion with precise spatial control is a challenging problem. Existing approaches often require task-specific training or slow optimization, and enforcing hard constraints frequently disrupts motion naturalness. Building on the observation that many animation tasks can be formulated as a linear inverse problem, we introduce ProjFlow, a training-free sampler that achieves zero-shot, exact satisfaction of linear spatial constraints while preserving motion realism. Our key advance is a novel kinematics-aware metric that encodes skeletal topology. This metric allows the sampler to enforce hard constraints by distributing corrections coherently across the entire skeleton, avoiding the unnatural artifacts of naive projection. Furthermore, for sparse inputs, such as filling in long gaps between a few keyframes, we introduce a time-varying formulation using pseudo-observations that fade during sampling. Extensive experiments on representative applications, motion inpainting, and 2D-to-3D lifting, demonstrate that ProjFlow achieves exact constraint satisfaction and matches or improves realism over zero-shot baselines, while remaining competitive with training-based controllers.
中文标题/摘要
标题:ProjFlow:基于流匹配的投影采样方法实现零样本精确空间运动控制
精确的空间控制生成人类运动是一个具有挑战性的问题。现有方法通常需要特定任务的训练或缓慢的优化,并且施加硬约束经常破坏运动的自然性。基于许多动画任务可以表述为线性逆问题的观察,我们引入了ProjFlow,这是一种无需训练的采样器,能够在不破坏运动真实性的前提下实现零样本、精确满足线性空间约束。我们的主要进展是一种新颖的动力学感知度量,它编码了骨骼拓扑结构。这种度量允许采样器通过在整棵骨骼上一致地分配修正来施加硬约束,从而避免了简单投影的不自然伪影。此外,对于稀疏输入,例如填补几帧之间较长的空白,我们引入了一种时间变化的公式,使用在采样过程中逐渐淡出的伪观测值。在代表性的应用、运动填补和2D到3D提升的广泛实验中,证明了ProjFlow实现了精确的约束满足,并且在零样本基线的基础上匹配或提高了真实感,同时保持了与基于训练的控制器的竞争力。
Summary / 总结
The research aims to generate human motion with precise spatial control without requiring task-specific training or slow optimization. ProjFlow, a training-free sampler, is introduced to achieve zero-shot, exact satisfaction of linear spatial constraints while maintaining motion realism. Key to this is a kinematics-aware metric that ensures coherent corrections across the entire skeleton, avoiding unnatural artifacts. For sparse inputs, ProjFlow uses a time-varying formulation with pseudo-observations to fill gaps between keyframes. Experiments show that ProjFlow satisfies constraints exactly and maintains or improves realism compared to zero-shot baselines, while being competitive with training-based controllers.
ProjFlow旨在生成具有精确空间控制的人体运动,无需特定任务的训练或缓慢的优化。它使用一种运动学感知的度量来强制执行硬约束,同时保持运动的逼真性。对于稀疏输入,如填补关键帧之间的长空白,ProjFlow引入了一种时间变化的公式,有助于填补空白。实验表明,ProjFlow能够精确满足约束条件,并且在保持或提高逼真度方面优于零样本基线,同时与基于训练的控制器具有竞争力。
HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models
Authors: Yangguang Lin, Quan Fang, Yufei Li, Jiachen Sun, Junyu Gao, Jitao Sang
Venue: CVPR 2026
First: 2026-02-26T08:08:25+00:00 · Latest: 2026-02-26T08:08:25+00:00
Comments: accepted at CVPR 2026
Abstract
Object hallucination in Large Vision-Language Models (LVLMs) significantly hinders their reliable deployment. Existing methods struggle to balance efficiency and accuracy: they often require expensive reference models and multiple forward passes, or apply static edits that risk suppressing genuine visual evidence. To address this, we introduce HulluEdit, a single-pass, reference-free intervention framework. Our core innovation is orthogonal subspace editing: we decompose the hidden states of the model into orthogonal subspaces - visual evidence, conflicting priors, and residual uncertainty - enabling selective suppression of hallucinatory patterns without interfering with visual grounding. This approach mathematically guarantees that edits applied to the prior subspace leave the visual component entirely unaffected. Extensive experiments show that HulluEdit achieves state-of-the-art hallucination reduction on benchmarks including POPE and CHAIR across diverse architectures, while preserving general capabilities on MME and maintaining efficient inference. Our method consistently outperforms contrastive decoding and static subspace editing baselines, offering a new pathway toward more trustworthy LVLMs.
中文标题/摘要
标题:HulluEdit:单次通过的一致性子空间编辑以减轻大型视觉-语言模型中的幻觉
大型视觉-语言模型(LVLMs)中的对象幻觉严重阻碍了其可靠部署。现有方法难以在效率和准确性之间取得平衡:它们通常需要昂贵的参考模型和多次前向传递,或者应用静态编辑,这可能会抑制真实的视觉证据。为了解决这个问题,我们引入了HulluEdit,这是一种单次通过、无需参考的干预框架。我们的核心创新是正交子空间编辑:我们将模型的隐藏状态分解为正交子空间——视觉证据、冲突先验和残余不确定性,从而能够选择性地抑制幻觉模式,而不干扰视觉定位。这种方法从数学上保证了对先验子空间的编辑不会影响视觉部分。大量实验表明,HulluEdit在POPE和CHAIR等基准测试中实现了最先进的幻觉减少效果,同时在MME上保持了通用能力,并且保持了高效的推理。我们的方法在对比解码和静态子空间编辑基线中表现更优,为更可信的LVLMs开辟了一条新途径。
Summary / 总结
HulluEdit is a single-pass, reference-free framework designed to mitigate hallucinations in Large Vision-Language Models (LVLMs) by decomposing the hidden states into orthogonal subspaces. This allows for selective suppression of hallucinatory patterns without affecting visual grounding. HulluEdit outperforms existing methods on hallucination reduction benchmarks while maintaining general capabilities and efficient inference.
HulluEdit 是一种单步无参考框架,旨在通过将隐藏状态分解为正交子空间来减轻大型视觉-语言模型(LVLM)中的幻觉问题。这种方法允许选择性地抑制幻觉模式而不影响视觉定位。HulluEdit 在幻觉减少基准测试中表现出色,同时保持了通用能力和高效的推理。
SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs
Authors: Guanting Ye, Qiyan Zhao, Wenhao Yu, Liangyu Yuan, Mingkai Li, Xiaofeng Zhang, Jianmin Ji, Yanyong Zhang, Qing Jiang, Ka-Veng Yuen
Venue: CVPR 2026
First: 2026-02-26T07:42:15+00:00 · Latest: 2026-02-26T07:42:15+00:00
Comments: CVPR 2026
Abstract
3D Large Vision-Language Models (3D LVLMs) built upon Large Language Models (LLMs) have achieved remarkable progress across various multimodal tasks. However, their inherited position-dependent modeling mechanism, Rotary Position Embedding (RoPE), remains suboptimal for 3D multimodal understanding. The vanilla RoPE formulation fails to preserve essential three-dimensional spatial structures when encoding 3D tokens, and its relative distance computation overlooks angular dependencies, hindering the model's ability to capture directional variations in visual representations. To overcome these limitations, we introduce Spherical Coordinate-based Positional Embedding (SoPE). Our method maps point-cloud token indices into a 3D spherical coordinate space, enabling unified modeling of spatial locations and directional angles. This formulation preserves the inherent geometric structure of point-cloud data, enhances spatial awareness, and yields more consistent and expressive geometric representations for multimodal learning. In addition, we introduce a multi-scale frequency mixing strategy to fuse feature information across different frequency domains. Experimental results on multiple 3D scene benchmarks validate the effectiveness of our approach, while real-world deployment experiments further demonstrate its strong generalization capability.
中文标题/摘要
标题:SoPE:基于球坐标的位置嵌入以增强3D LVLM的空间感知
基于大型语言模型(LLMs)构建的3D大型视觉-语言模型(3D LVLMs)在各种多模态任务中取得了显著进展。然而,它们继承的位置依赖性建模机制,旋转位置嵌入(RoPE),对于3D多模态理解仍然不够优化。传统的RoPE公式在编码3D标记时无法保留关键的三维空间结构,并且其相对距离计算忽略了角度依赖性,阻碍了模型捕捉视觉表示中的方向变化。为克服这些限制,我们引入了基于球坐标的位置嵌入(SoPE)。我们的方法将点云标记索引映射到3D球坐标空间,从而实现空间位置和方向角度的统一建模。这种表示形式保留了点云数据的固有几何结构,增强了空间意识,并为多模态学习提供了更一致和表达力更强的几何表示。此外,我们引入了一种多尺度频率混合策略,以在不同频率域中融合特征信息。在多个3D场景基准上的实验结果验证了我们方法的有效性,而实际部署实验进一步证明了其强大的泛化能力。
Summary / 总结
This paper addresses the limitations of Rotary Position Embedding (RoPE) in 3D Large Vision-Language Models (3D LVLMs) by introducing Spherical Coordinate-based Positional Embedding (SoPE). SoPE maps 3D token indices into a 3D spherical coordinate space, preserving spatial structures and directional angles, which enhances spatial perception. Additionally, a multi-scale frequency mixing strategy is used to fuse feature information across different frequency domains. Experiments on multiple 3D scene benchmarks show the effectiveness of SoPE in improving 3D multimodal understanding, and real-world deployment experiments demonstrate its strong generalization capability.
研究旨在通过改进3D大型视觉语言模型(3D LVLM)的空间感知能力,解决旋转位置嵌入(RoPE)的局限性。提出的球坐标位置嵌入(SoPE)将3D标记索引映射到3D球坐标空间,保留空间结构和方向角度。该方法增强了空间意识,并提供了更一致的几何表示。此外,引入了多尺度频率混合策略,以在不同频率域中融合特征信息。在多个3D场景基准上的实验验证了SoPE的有效性,而实际部署实验进一步展示了其强大的泛化能力。
Visual Instruction Pretraining for Domain-Specific Foundation Models
Authors: Yuxuan Li, Yicheng Zhang, Wenhao Tang, Yimian Dai, Ming-Ming Cheng, Xiang Li, Jian Yang
First: 2025-09-22T10:57:42+00:00 · Latest: 2026-02-26T07:40:53+00:00
Abstract
Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, this loop remains incomplete: the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is not yet underexplored. This paper addresses this gap by proposing a new paradigm for pretraining foundation models in downstream domains. We introduce Visual insTruction Pretraining (ViTP), a novel approach that directly leverages reasoning to enhance perception. ViTP embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by our proposed Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens. Extensive experiments on 16 challenging remote sensing and medical imaging benchmarks demonstrate that ViTP establishes new state-of-the-art performance across a diverse range of downstream tasks. The code is available at https://github.com/zcablii/ViTP.
中文标题/摘要
标题:领域特定基础模型的视觉指令预训练
现代计算机视觉正在形成一个闭环,在这个闭环中,感知、推理和生成相互增强。然而,这个闭环仍然不完整:高层推理对低层感知特征基础学习的自上而下的影响尚未得到充分探索。本文通过提出一种新的预训练范式来解决这一差距,以在下游领域预训练基础模型。我们引入了视觉指令预训练(ViTP),这是一种新颖的方法,可以直接利用推理来增强感知。ViTP 将视觉变换器(ViT)主干嵌入到视觉语言模型中,并使用从目标下游领域收集的丰富视觉指令数据集对其进行端到端预训练。ViTP 由我们提出的视觉鲁棒性学习(VRL)驱动,促使 ViT 从稀疏的视觉标记集中学习稳健且领域相关的特征。在 16 个具有挑战性的遥感和医学成像基准测试上的广泛实验表明,ViTP 在多种下游任务中建立了新的最佳性能。代码可在 https://github.com/zcablii/ViTP 获取。
Summary / 总结
This paper aims to enhance the foundational learning of low-level perceptual features in computer vision by incorporating high-level reasoning through a new paradigm called Visual insTruction Pretraining (ViTP). ViTP uses a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end with visual instruction data from target domains. The results show that ViTP outperforms existing methods on 16 diverse downstream tasks, establishing new state-of-the-art performance. The code is available at https://github.com/zcablii/ViTP.
本文旨在通过引入高阶推理的新范式Visual insTruction Pretraining (ViTP) 来增强计算机视觉中低级感知特征的基础学习。ViTP 使用 Vision Transformer (ViT) 作为骨干,并在目标领域视觉指令数据上进行端到端预训练。实验结果表明,ViTP 在 16 个不同的下游任务上优于现有方法,建立了新的最佳性能。代码可在 https://github.com/zcablii/ViTP 获取。
Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning
Authors: Hao Yu, Shuning Jia, Guanghao Li, Wenhao Jiang, Chun Yuan
First: 2026-02-26T07:28:04+00:00 · Latest: 2026-02-26T07:28:04+00:00
Abstract
Vision-language models (VLMs) often struggle with geometric reasoning due to their limited perception of fundamental diagram elements. To tackle this challenge, we introduce GeoPerceive, a benchmark comprising diagram instances paired with domain-specific language (DSL) representations, along with an efficient automatic data generation pipeline. This design enables the isolated evaluation of geometric perception independently from reasoning. To exploit the data provided by GeoPerceive for enhancing the geometric perception capabilities of VLMs, we propose GeoDPO, a translator-guided reinforcement learning (RL) framework. GeoDPO employs an NL-to-DSL translator, which is trained on synthetic pairs generated by the data engine of GeoPerceive, to bridge natural language and DSL. This translator facilitates the computation of fine-grained, DSL-level scores, which serve as reward signals in reinforcement learning. We assess GeoDPO on both in-domain and out-of-domain datasets, spanning tasks in geometric perception as well as downstream reasoning. Experimental results demonstrate that, while supervised fine-tuning (SFT) offers only marginal improvements and may even impair performance in out-of-domain scenarios, GeoDPO achieves substantial gains: $+26.5\%$ on in-domain data, $+8.0\%$ on out-of-domain data, and $+39.0\%$ on downstream reasoning tasks. These findings underscore the superior performance and generalization ability of GeoDPO over SFT. All codes are released at https://github.com/Longin-Yu/GeoPerceive
to ensure reproducibility.
中文标题/摘要
标题:通过翻译引导强化学习提升VLMs的几何感知能力
视觉-语言模型(VLMs)在几何推理方面常常遇到困难,因为它们对基本图表元素的感知能力有限。为了解决这一挑战,我们引入了GeoPerceive,这是一个包含图表实例及其领域特定语言(DSL)表示的基准测试,以及一个高效的自动数据生成管道。这种设计使得几何感知的评估可以独立于推理进行。为了利用GeoPerceive提供的数据来提升VLMs的几何感知能力,我们提出了GeoDPO,这是一种翻译引导的强化学习(RL)框架。GeoDPO 使用一个从GeoPerceive数据引擎生成的合成对中训练的自然语言到DSL的翻译器,以连接自然语言和DSL。该翻译器使细粒度的DSL级别评分的计算成为可能,这些评分作为强化学习中的奖励信号。我们在领域内和领域外数据集上评估了GeoDPO,涵盖了几何感知任务以及下游推理任务。实验结果表明,虽然监督微调(SFT)仅提供了微小的改进,并且在领域外场景中甚至可能损害性能,但GeoDPO实现了显著的提升:领域内数据上提高了26.5%,领域外数据上提高了8.0%,下游推理任务上提高了39.0%。这些发现突显了GeoDPO相对于SFT的优越性能和泛化能力。所有代码已发布在https://github.com/Longin-Yu/GeoPerceive以确保可再现性。
Summary / 总结
The paper addresses the limitation of VLMs in geometric reasoning by introducing GeoPerceive, a benchmark with DSL representations and a data generation pipeline. GeoDPO, a translator-guided RL framework, is proposed to enhance VLMs' geometric perception. GeoDPO uses an NL-to-DSL translator trained on synthetic data to compute fine-grained scores as reward signals. Experiments show that GeoDPO outperforms supervised fine-tuning, achieving significant improvements on both in-domain and out-of-domain datasets, and downstream reasoning tasks.
论文通过引入包含DSL表示和数据生成管道的GeoPerceive基准,解决了VLMs在几何推理方面的局限性。提出了GeoDPO,一种基于翻译器的强化学习框架,以增强VLMs的几何感知能力。GeoDPO使用一个在合成数据上训练的NL-to-DSL翻译器来计算细粒度的得分作为奖励信号。实验表明,GeoDPO在域内和域外数据集以及下游推理任务上均显著优于监督微调方法。
Detecting Misbehaviors of Large Vision-Language Models by Evidential Uncertainty Quantification
Authors: Tao Huang, Rui Wang, Xiaofei Liu, Yi Qin, Li Duan, Liping Jing
Venue: ICLR 2026
First: 2026-02-05T10:51:39+00:00 · Latest: 2026-02-26T07:10:35+00:00
Comments: Accepted to ICLR 2026. Code is available at https://github.com/HT86159/EUQ
Abstract
%Large vision-language models (LVLMs) have shown substantial advances in multimodal understanding and generation. However, when presented with incompetent or adversarial inputs, they frequently produce unreliable or even harmful content, such as fact hallucinations or dangerous instructions. This misalignment with human expectations, referred to as \emph{misbehaviors} of LVLMs, raises serious concerns for deployment in critical applications. These misbehaviors are found to stem from epistemic uncertainty, specifically either conflicting internal knowledge or the absence of supporting information. However, existing uncertainty quantification methods, which typically capture only overall epistemic uncertainty, have shown limited effectiveness in identifying such issues. To address this gap, we propose Evidential Uncertainty Quantification (EUQ), a fine-grained method that captures both information conflict and ignorance for effective detection of LVLM misbehaviors. In particular, we interpret features from the model output head as either supporting (positive) or opposing (negative) evidence. Leveraging Evidence Theory, we model and aggregate this evidence to quantify internal conflict and knowledge gaps within a single forward pass. %We extensively evaluate our method across four categories of misbehavior, including hallucinations, jailbreaks, adversarial vulnerabilities, and out-of-distribution (OOD) failures, using state-of-the-art LVLMs, and find that EUQ consistently outperforms strong baselines, showing that hallucinations correspond to high internal conflict and OOD failures to high ignorance. Furthermore, layer-wise evidential uncertainty dynamics analysis helps interpret the evolution of internal representations from a new perspective. The source code is available at https://github.com/HT86159/EUQ.
中文标题/摘要
标题:通过证据不确定性量化检测大型视觉-语言模型的不当行为
大型视觉-语言模型(LVLMs)在多模态理解和生成方面取得了显著进展。然而,当面对无能或对抗性输入时,它们经常生成不可靠甚至有害的内容,如事实幻觉或危险指令。这种与人类期望的不一致,被称为LVLMs的不当行为,对关键应用的部署提出了严重关切。这些不当行为源于认识不确定性,具体来说是内部知识冲突或缺乏支持信息。然而,现有的不确定性量化方法通常只能捕捉整体认识不确定性,对于识别此类问题效果有限。为解决这一问题,我们提出了一种细粒度的方法——证据不确定性量化(EUQ),该方法能够同时捕捉信息冲突和无知,从而有效检测LVLM的不当行为。特别是,我们将模型输出头的特征解释为支持(正面)或反对(负面)证据。利用证据理论,我们建模并聚合这些证据,在单次前向传播中量化内部冲突和知识空白。我们使用最先进的LVLMs在四个类别(幻觉、脱逃、对抗性漏洞和分布外失败)的不当行为上广泛评估了该方法,并发现EUQ始终优于强基线,表明幻觉对应于高内部冲突,分布外失败对应于高无知。此外,逐层证据不确定性动态分析有助于从新视角解释内部表示的演变。源代码可在https://github.com/HT86159/EUQ获取。
Summary / 总结
The paper addresses the issue of misbehaviors in large vision-language models (LVLMs) by proposing Evidential Uncertainty Quantification (EUQ), which captures both information conflict and ignorance. EUQ interprets model output features as evidence and uses Evidence Theory to quantify internal conflict and knowledge gaps. The method is evaluated across four categories of misbehavior and shows superior performance compared to existing methods, indicating that hallucinations are associated with high internal conflict and out-of-distribution failures with high ignorance. Layer-wise analysis further helps interpret the evolution of internal representations.
论文通过提出证据不确定性量化(EUQ)方法来解决大型视觉-语言模型(LVLM)的不当行为问题,该方法能够捕捉信息冲突和无知。EUQ 将模型输出特征解释为证据,并利用证据理论量化内部冲突和知识空白。该方法在四个类别的人为行为中进行了评估,并且在与现有方法的比较中表现出更优的性能,表明幻觉与高内部冲突相关,而离分布失败与高无知相关。逐层分析进一步有助于从新视角解释内部表示的演变。
No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings
Authors: Joonsung Jeon, Woo Jae Kim, Suhyeon Ha, Sooel Son, Sung-Eui Yoon
Venue: ICLR 2026
First: 2026-02-26T07:07:11+00:00 · Latest: 2026-02-26T07:07:11+00:00
Comments: Accepted to ICLR 2026
Abstract
Latent diffusion models have achieved remarkable success in high-fidelity text-to-image generation, but their tendency to memorize training data raises critical privacy and intellectual property concerns. Membership inference attacks (MIAs) provide a principled way to audit such memorization by determining whether a given sample was included in training. However, existing approaches assume access to ground-truth captions. This assumption fails in realistic scenarios where only images are available and their textual annotations remain undisclosed, rendering prior methods ineffective when substituted with vision-language model (VLM) captions. In this work, we propose MoFit, a caption-free MIA framework that constructs synthetic conditioning inputs that are explicitly overfitted to the target model's generative manifold. Given a query image, MoFit proceeds in two stages: (i) model-fitted surrogate optimization, where a perturbation applied to the image is optimized to construct a surrogate in regions of the model's unconditional prior learned from member samples, and (ii) surrogate-driven embedding extraction, where a model-fitted embedding is derived from the surrogate and then used as a mismatched condition for the query image. This embedding amplifies conditional loss responses for member samples while leaving hold-outs relatively less affected, thereby enhancing separability in the absence of ground-truth captions. Our comprehensive experiments across multiple datasets and diffusion models demonstrate that MoFit consistently outperforms prior VLM-conditioned baselines and achieves performance competitive with caption-dependent methods.
Summary / 总结
The research addresses privacy concerns in latent diffusion models by proposing MoFit, a caption-free membership inference framework. It constructs synthetic conditioning inputs overfitted to the target model's generative manifold. MoFit optimizes perturbations to images and derives model-fitted embeddings to enhance separability of member samples. Experiments show MoFit outperforms previous VLM-conditioned methods and achieves performance comparable to caption-dependent approaches across various datasets and diffusion models.
研究通过提出MoFit框架解决了潜在扩散模型的隐私问题,这是一种无需图注的成员推理攻击方法。MoFit构建了与模型生成流形过拟合的合成条件输入。该方法优化扰动以在模型的无条件先验中创建一个替代样本,然后提取一个模型拟合嵌入以增强在没有图注的情况下可分性。实验表明,MoFit优于之前的基于VLM的方法,并且与依赖图注的方法具有竞争力。
SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses
Authors: Zhuohang Jiang, Xu Yuan, Haohao Qu, Shanru Lin, Kanglong Liu, Wenqi Fan, Qing Li
First: 2026-02-26T06:55:48+00:00 · Latest: 2026-02-26T06:55:48+00:00
Abstract
The rapid advancement of AI-powered smart glasses, one of the hottest wearable devices, has unlocked new frontiers for multimodal interaction, with Visual Question Answering (VQA) over external knowledge sources emerging as a core application. Existing Vision Language Models (VLMs) adapted to smart glasses are typically trained and evaluated on traditional multimodal datasets; however, these datasets lack the variety and realism needed to reflect smart glasses usage scenarios and diverge from their specific challenges, where accurately identifying the object of interest must precede any external knowledge retrieval. To bridge this gap, we introduce SUPERGLASSES, the first comprehensive VQA benchmark built on real-world data entirely collected by smart glasses devices. SUPERGLASSES comprises 2,422 egocentric image-question pairs spanning 14 image domains and 8 query categories, enriched with full search trajectories and reasoning annotations. We evaluate 26 representative VLMs on this benchmark, revealing significant performance gaps. To address the limitations of existing models, we further propose SUPERLENS, a multimodal smart glasses agent that enables retrieval-augmented answer generation by integrating automatic object detection, query decoupling, and multimodal web search. Our agent achieves state-of-the-art performance, surpassing GPT-4o by 2.19 percent, and highlights the need for task-specific solutions in smart glasses VQA scenarios.
中文标题/摘要
标题:SUPERGLASSES:将视觉语言模型作为智能眼镜的智能代理进行基准测试
随着AI驱动的智能眼镜这一热门可穿戴设备的迅速发展,多模态交互的新领域被解锁,其中外部知识源上的视觉问答(VQA)成为核心应用。现有的适应智能眼镜的视觉语言模型通常在传统的多模态数据集上进行训练和评估;然而,这些数据集缺乏反映智能眼镜使用场景的多样性和现实性,无法体现其特定挑战,其中准确识别目标对象必须先于任何外部知识检索。为弥合这一差距,我们引入了SUPERGLASSES,这是首个基于智能眼镜设备完全收集的真实数据构建的全面VQA基准。SUPERGLASSES包含2,422个第一人称视角图像-问题对,覆盖14个图像领域和8个查询类别,并附带完整的搜索轨迹和推理注释。我们在该基准上评估了26个代表性视觉语言模型,揭示了显著的性能差距。为解决现有模型的局限性,我们进一步提出了SUPERLENS,这是一种多模态智能眼镜代理,通过结合自动目标检测、查询解耦和多模态网络搜索,实现检索增强的答案生成。我们的代理达到了最先进的性能,超越了GPT-4o 2.19个百分点,并突显了智能眼镜VQA场景中需要任务特定解决方案的需求。
Summary / 总结
The paper introduces SUPERGLASSES, a new VQA benchmark for smart glasses, addressing the limitations of existing datasets. It evaluates 26 VLMs and finds significant performance gaps. The authors propose SUPERLENS, a multimodal agent that integrates object detection and web search, achieving state-of-the-art performance in smart glasses VQA scenarios.
论文介绍了SUPERGLASSES,这是一个新的用于智能眼镜的VQA基准,通过使用真实世界数据解决了现有数据集的局限性。它评估了26个VLM,并发现了显著的性能差距。作者提出了SUPERLENS,一个结合了物体检测和网络搜索的多模态代理,其性能超越了GPT-4o 2.19个百分点,强调了在智能眼镜VQA场景中需要专门的任务解决方案。
ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport
Authors: Quoc-Khang Tran, Minh-Thien Nguyen, Nguyen-Khang Pham
First: 2026-02-26T06:51:25+00:00 · Latest: 2026-02-26T06:51:25+00:00
Comments: Preprint submitted to Expert Systems with Applications
Abstract
Image-text retrieval has become a fundamental component in intelligent multimedia systems; however, most existing vision-language models are optimized for highresource languages and remain suboptimal for low-resource settings such as Vietnamese. This work introduces ViCLIP-OT, a foundation vision-language model specifically designed for Vietnamese image-text retrieval. The proposed framework integrates CLIP-style contrastive learning with a Similarity-Graph Regularized Optimal Transport (SIGROT) loss to enhance global cross-modal consistency and mitigate modality gap issues. Extensive experiments on three Vietnamese benchmarks (UITOpenViIC, KTVIC, and Crossmodal-3600) demonstrate that ViCLIP-OT consistently outperforms CLIP and SigLIP baselines in both in-domain and zero-shot settings. On UIT-OpenViIC, the model achieves an average Recall@K of 67.34%, improving upon CLIP by 5.75 percentage points. In zero-shot evaluation on Crossmodal-3600, ViCLIPOT surpasses CLIP by 11.72 percentage points. Embedding-space analysis further confirms improved alignment and reduced modality gap. The results indicate that integrating SIGROT provides an effective and scalable strategy for cross-modal retrieval in low-resource languages, offering practical implications for intelligent multimedia retrieval systems in Vietnamese and other underrepresented linguistic contexts.
中文标题/摘要
标题:ViCLIP-OT:首个针对越南语图像-文本检索的端到端视觉-语言模型,采用最优传输
图像-文本检索已成为智能多媒体系统中的基本组成部分;然而,大多数现有的视觉-语言模型都是针对高资源语言优化的,在越南语等低资源环境中表现不佳。本文介绍了ViCLIP-OT,这是一种专门针对越南语图像-文本检索的端到端视觉-语言模型。提出的框架结合了CLIP风格的对比学习和相似性图正则化最优传输(SIGROT)损失,以增强全局跨模态一致性并缓解模态差距问题。在三个越南语基准数据集(UITOpenViIC、KTVIC和Crossmodal-3600)上的广泛实验表明,ViCLIP-OT在领域内和零样本设置中均优于CLIP和SigLIP基线。在UIT-OpenViIC上,该模型的平均Recall@K为67.34%,比CLIP提高了5.75个百分点。在Crossmodal-3600上的零样本评估中,ViCLIP-OT比CLIP提高了11.72个百分点。嵌入空间分析进一步证实了更好的对齐和减少的模态差距。结果表明,结合SIGROT为低资源语言的跨模态检索提供了一种有效且可扩展的策略,为越南语和其他未充分代表的语言环境中的智能多媒体检索系统提供了实际意义。
Summary / 总结
The research addresses the limitations of existing vision-language models for low-resource languages like Vietnamese. It introduces ViCLIP-OT, which combines CLIP-style contrastive learning with SIGROT loss to improve cross-modal consistency and reduce modality gaps. Experiments on three Vietnamese benchmarks show that ViCLIP-OT outperforms CLIP and SigLIP, achieving higher Recall@K scores and reducing the modality gap in embedding space.
该研究提出了ViCLIP-OT,一种针对越南语图像-文本检索的基线视觉-语言模型。它结合了CLIP风格的对比学习和SIGROT损失,以增强跨模态一致性并减少模态差距。在三个越南语基准上的实验表明,ViCLIP-OT在领域内和零样本设置中均优于CLIP和SigLIP,特别是在Recall@K和嵌入空间对齐方面取得了显著改进。
Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes
Authors: Changqing Zhou, Yueru Luo, Han Zhang, Zeyu Jiang, Changhao Chen
First: 2026-02-26T06:37:43+00:00 · Latest: 2026-02-26T06:37:43+00:00
Comments: Accepted by CVPR2026
Abstract
Open-vocabulary 3D occupancy is vital for embodied agents, which need to understand complex indoor environments where semantic categories are abundant and evolve beyond fixed taxonomies. While recent work has explored open-vocabulary occupancy in outdoor driving scenarios, such methods transfer poorly indoors, where geometry is denser, layouts are more intricate, and semantics are far more fine-grained. To address these challenges, we adopt a geometry-only supervision paradigm that uses only binary occupancy labels (occupied vs free). Our framework builds upon 3D Language-Embedded Gaussians, which serve as a unified intermediate representation coupling fine-grained 3D geometry with a language-aligned semantic embedding. On the geometry side, we find that existing Gaussian-to-Occupancy operators fail to converge under such weak supervision, and we introduce an opacity-aware, Poisson-based approach that stabilizes volumetric aggregation. On the semantic side, direct alignment between rendered features and open-vocabulary segmentation features suffers from feature mixing; we therefore propose a Progressive Temperature Decay schedule that gradually sharpens opacities during splatting, strengthening Gaussian-language alignment. On Occ-ScanNet, our framework achieves 59.50 IoU and 21.05 mIoU in the open-vocabulary setting, surpassing all existing occupancy methods in IoU and outperforming prior open-vocabulary approaches by a large margin in mIoU. Code will be released at https://github.com/JuIvyy/LegoOcc.
中文标题/摘要
标题:单目开放词汇占用预测用于室内场景
开放词汇3D占用对于具身智能体至关重要,它们需要理解复杂且具有丰富语义类别的室内环境,这些语义类别超越了固定分类体系。虽然近期工作已经探索了开放词汇占用在户外驾驶场景中的应用,但这些方法在室内环境中表现不佳,因为室内几何结构更密集,布局更复杂,语义更为精细。为应对这些挑战,我们采用了一种仅使用二元占用标签(占用 vs 空闲)的几何监督范式。我们的框架基于3D语言嵌入高斯分布,作为统一的中间表示,将精细的3D几何结构与语言对齐的语义嵌入耦合起来。在几何方面,我们发现现有的高斯到占用的操作符在如此弱的监督下无法收敛,因此我们引入了一种基于泊松的透明度感知方法,以稳定体素聚合。在语义方面,直接对齐渲染特征和开放词汇分割特征会受到特征混杂的影响;因此,我们提出了一种渐进温度衰减计划,逐步在点绘制期间增强高斯-语言对齐。在Occ-ScanNet上,我们的框架在开放词汇设置中实现了59.50 IoU和21.05 mIoU,超越了所有现有的占用方法,在IoU上领先,并在mIoU上大幅超越了先前的开放词汇方法。代码将在https://github.com/JuIvyy/LegoOcc上发布。
Summary / 总结
This paper addresses the challenge of predicting open-vocabulary 3D occupancy for indoor scenes, where the geometry is dense and the semantics are fine-grained. The authors propose a geometry-only supervision paradigm using binary occupancy labels and build upon 3D Language-Embedded Gaussians to couple fine-grained 3D geometry with semantic embeddings. They introduce an opacity-aware, Poisson-based approach to stabilize volumetric aggregation and a Progressive Temperature Decay schedule to sharpen opacities during splatting. The framework achieves 59.50 IoU and 21.05 mIoU on Occ-ScanNet, outperforming existing methods in both IoU and mIoU. Code is available at https://github.com/JuIvyy/LegoOcc.
该论文旨在解决预测室内场景中开放词汇3D占用率的挑战,传统方法在密集几何和细粒度语义的情况下失效。作者提出了一种仅基于几何的监督方法,使用二元占用标签,并基于3D语言嵌入高斯模型。他们引入了一种基于透明度的Poisson聚合方法和渐进温度衰减调度,以改善语义对齐。其框架在Occ-ScanNet上实现了59.50 IoU和21.05 mIoU,两项指标均优于现有方法。代码可在https://github.com/JuIvyy/LegoOcc获取。
Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache
Authors: Bowen Cui, Yuanbin Wang, Huajiang Xu, Biaolong Chen, Aixi Zhang, Hao Jiang, Zhengzheng Jin, Xu Liu, Pipei Huang
Venue: CVPR 2026
First: 2026-02-26T06:13:33+00:00 · Latest: 2026-02-26T06:13:33+00:00
Comments: Accepted by CVPR 2026
Abstract
Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead of multi-step iterative sampling. Among acceleration strategies, caching-based methods offer a training-free and effective solution by reusing or predicting features across timesteps. However, existing approaches rely on fixed or locally adaptive schedules without considering the global structure of the denoising trajectory, often leading to error accumulation and visual artifacts. To overcome this limitation, we propose DPCache, a novel training-free acceleration framework that formulates diffusion sampling acceleration as a global path planning problem. DPCache constructs a Path-Aware Cost Tensor from a small calibration set to quantify the path-dependent error of skipping timesteps conditioned on the preceding key timestep. Leveraging this tensor, DPCache employs dynamic programming to select an optimal sequence of key timesteps that minimizes the total path cost while preserving trajectory fidelity. During inference, the model performs full computations only at these key timesteps, while intermediate outputs are efficiently predicted using cached features. Extensive experiments on DiT, FLUX, and HunyuanVideo demonstrate that DPCache achieves strong acceleration with minimal quality loss, outperforming prior acceleration methods by $+$0.031 ImageReward at 4.87$\times$ speedup and even surpassing the full-step baseline by $+$0.028 ImageReward at 3.54$\times$ speedup on FLUX, validating the effectiveness of our path-aware global scheduling framework. Code will be released at https://github.com/argsss/DPCache.
中文标题/摘要
标题:降噪作为路径规划:基于DPCache的无训练加速扩散模型
扩散模型在图像和视频生成方面取得了显著的成功,但其实际部署仍受到多步迭代采样带来的大量计算开销的阻碍。在加速策略中,基于缓存的方法提供了一种无训练且有效的解决方案,通过在时间步之间重用或预测特征来实现加速。然而,现有方法依赖于固定或局部自适应的时间表,而不考虑去噪轨迹的全局结构,这通常会导致误差累积和视觉伪影。为克服这一限制,我们提出了一种名为DPCache的新型无训练加速框架,将扩散采样的加速问题表述为全局路径规划问题。DPCache从少量校准集中构建路径感知代价张量,以量化在给定前一关键时间步的情况下跳过时间步的路径依赖误差。利用该张量,DPCache采用动态规划选择一个最优的关键时间步序列,以最小化总路径成本同时保持轨迹保真度。在推理过程中,模型仅在这些关键时间步进行完整计算,而中间输出则通过缓存特征高效预测。在DiT、FLUX和HunyuanVideo上的广泛实验表明,DPCache在保持最小质量损失的情况下实现了显著加速,优于先前的加速方法,特别是在FLUX上实现了4.87倍加速时的ImageReward提升0.031,甚至在3.54倍加速时超过了全步基线的ImageReward提升0.028,验证了我们路径感知全局调度框架的有效性。代码将在https://github.com/argsss/DPCache上发布。
Summary / 总结
The paper addresses the computational challenge of diffusion models in image and video generation by proposing DPCache, a training-free acceleration framework. DPCache formulates the acceleration problem as a path planning task and uses a Path-Aware Cost Tensor to select key timesteps for full computation, while predicting intermediate outputs using cached features. Experiments show that DPCache achieves significant speedups with minimal quality loss, outperforming previous methods on DiT, FLUX, and HunyuanVideo.
论文提出了一种名为DPCache的无训练加速框架,通过将加速问题表述为路径规划任务,并利用路径感知成本张量选择关键时间步进行完整计算,而使用缓存特征预测中间输出。实验表明,DPCache在DiT、FLUX和HunyuanVideo上实现了显著的加速,同时保持了较低的质量损失,优于之前的加速方法。