arXiv 论文速递

2026-03-13 03:52
Snapshot: 20260313_0352
Does AI See like Art Historians? Interpreting How Vision Language Models Recognize Artistic Style
Authors: Marvin Limpijankit, Milad Alshomary, Yassin Oulad Daoud, Amith Ananthram, Tim Trombley, Elias Stengel-Eskin, Mohit Bansal, Noam M. Elcott, Kathleen McKeown
First: 2026-03-11T17:49:45+00:00 · Latest: 2026-03-11T17:49:45+00:00
Comments: 12 pages, 12 figures
Abstract
VLMs have become increasingly proficient at a range of computer vision tasks, such as visual question answering and object detection. This includes increasingly strong capabilities in the domain of art, from analyzing artwork to generation of art. In an interdisciplinary collaboration between computer scientists and art historians, we characterize the mechanisms underlying VLMs' ability to predict artistic style and assess the extent to which they align with the criteria art historians use to reason about artistic style. We employ a latent-space decomposition approach to identify concepts that drive art style prediction and conduct quantitative evaluations, causal analysis and assessment by art historians. Our findings indicate that 73% of the extracted concepts are judged by art historians to exhibit a coherent and semantically meaningful visual feature and 90% of concepts used to predict style of a given artwork were judged relevant. In cases where an irrelevant concept was used to successfully predict style, art historians identified possible reasons for its success; for example, the model might "understand" a concept in more formal terms, such as dark/light contrasts.
中文标题/摘要
标题:AI能否像艺术史家一样看画?视觉语言模型如何识别艺术风格的解读
视觉语言模型(VLMs)在一系列计算机视觉任务中变得越来越熟练,包括视觉问答和物体检测。这包括在艺术领域的强大能力,从分析艺术品到生成艺术。在计算机科学家与艺术史家的跨学科合作中,我们描述了VLMs预测艺术风格的机制,并评估了它们与艺术史家用来推理艺术风格的标准的一致性程度。我们采用潜在空间分解方法来识别驱动艺术风格预测的概念,并进行了定量评估、因果分析和艺术史家的评估。我们的研究发现,73%提取的概念被认为由艺术史家判断具有连贯且语义上有意义的视觉特征,90%用于预测特定艺术品风格的概念被认为相关。在使用不相关概念成功预测风格的情况下,艺术史家指出了可能的原因;例如,模型可能“理解”概念在更形式化的层面,如明暗对比。
Summary / 总结
The study investigates how Vision Language Models (VLMs) recognize artistic style and compares their performance with art historians. By employing a latent-space decomposition approach, the researchers identified key concepts driving art style prediction. The findings show that 73% of these concepts align with art historians' criteria, and 90% of the concepts used to predict style were deemed relevant. In cases where irrelevant concepts were used successfully, art historians provided explanations, such as the model's ability to understand formal aspects like light and dark contrasts.
研究探讨了视觉语言模型(VLMs)如何识别艺术风格,并将其性能与艺术史学家进行了比较。通过使用潜在空间分解方法,研究人员确定了驱动艺术风格预测的关键概念。研究发现,73%的概念符合艺术史学家的标准,而用于预测风格的概念中有90%被认为是相关的。在使用无关概念成功预测风格的情况下,艺术史学家提供了解释,例如模型可能从形式角度理解概念,如明暗对比。
Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity
Authors: Zhengyao Fang, Pengyuan Lyu, Chengquan Zhang, Guangming Lu, Jun Yu, Wenjie Pei
First: 2026-03-10T10:31:58+00:00 · Latest: 2026-03-11T17:27:13+00:00
Comments: accepted by ICLR2026
Abstract
Vision-language models (VLMs) face significant computational inefficiencies caused by excessive generation of visual tokens. While prior work shows that a large fraction of visual tokens are redundant, existing compression methods struggle to balance importance preservation and information diversity. To address this, we propose PruneSID, a training-free Synergistic Importance-Diversity approach featuring a two-stage pipeline: (1) Principal Semantic Components Analysis (PSCA) for clustering tokens into semantically coherent groups, ensuring comprehensive concept coverage, and (2) Intra-group Non-Maximum Suppression (NMS) for pruning redundant tokens while preserving key representative tokens within each group. Additionally, PruneSID incorporates an information-aware dynamic compression ratio mechanism that optimizes token compression rates based on image complexity, enabling more effective average information preservation across diverse scenes. Extensive experiments demonstrate state-of-the-art performance, achieving 96.3% accuracy on LLaVA-1.5 with only 11.1% token retention, and 92.8% accuracy at extreme compression rates (5.6%) on LLaVA-NeXT, outperforming prior methods by 2.5% with 7.8 $\times$ faster prefilling speed compared to the original model. Our framework generalizes across diverse VLMs and both image and video modalities, showcasing strong cross-modal versatility. Code is available at https://github.com/ZhengyaoFang/PruneSID.
中文标题/摘要
标题:去冗存精,协同重要性多样性:VLMs中的视觉标记压缩
视觉语言模型(VLMs)因视觉标记过度生成面临显著的计算效率问题。尽管先前工作表明大量视觉标记是冗余的,但现有压缩方法难以在重要性保存和信息多样性之间取得平衡。为解决这一问题,我们提出了一种名为PruneSID的无训练协同重要性多样性方法,其包含两阶段管道:(1)主语义成分分析(PSCA)用于将标记聚类为语义一致的组,确保全面的概念覆盖;(2)组内非最大抑制(NMS)用于去除冗余标记同时保留每个组内的关键代表性标记。此外,PruneSID还引入了一种基于图像复杂性的信息感知动态压缩比机制,根据图像复杂性优化标记压缩率,从而在多种场景中实现更有效的平均信息保存。大量实验表明,PruneSID在LLaVA-1.5上达到96.3%的准确率,仅保留11.1%的标记,并在LLaVA-NeXT上以5.6%的极端压缩率实现92.8%的准确率,相比先前方法提高了2.5%,且预填充速度比原模型快7.8倍。我们的框架适用于多种VLMs和图像、视频模态,展示了强大的跨模态通用性。代码可在https://github.com/ZhengyaoFang/PruneSID获取。
Summary / 总结
PruneSID is a training-free method for compressing visual tokens in vision-language models (VLMs) by clustering tokens into semantically coherent groups and pruning redundant tokens within each group. It achieves 96.3% accuracy on LLaVA-1.5 with only 11.1% token retention and 92.8% accuracy at 5.6% token retention on LLaVA-NeXT, outperforming previous methods with faster prefilling speed. The method is versatile across different VLMs and modalities.
研究旨在解决由于冗余视觉标记导致的视觉语言模型(VLMs)的计算效率低下问题。PruneSID 是一种无需训练的方法,采用两阶段管道:主语义组件分析(PSCA)进行标记聚类和组内非最大抑制(NMS)进行冗余标记的修剪,同时保留关键标记。它还包含一种基于信息的动态压缩比率机制。实验表明,PruneSID 在低标记保留率下仍能保持高准确性,优于先前的方法,并提供更快的预填充速度。该方法在不同 VLMs 和模态上具有良好的泛化能力。
Too Vivid to Be Real? Benchmarking and Calibrating Generative Color Fidelity
Authors: Zhengyao Fang, Zexi Jia, Yijia Zhong, Pengcheng Luo, Jinchao Zhang, Guangming Lu, Jun Yu, Wenjie Pei
First: 2026-03-11T17:18:12+00:00 · Latest: 2026-03-11T17:18:12+00:00
Comments: accepted by CVPR2026
Abstract
Recent advances in text-to-image (T2I) generation have greatly improved visual quality, yet producing images that appear visually authentic to real-world photography remains challenging. This is partly due to biases in existing evaluation paradigms: human ratings and preference-trained metrics often favor visually vivid images with exaggerated saturation and contrast, which make generations often too vivid to be real even when prompted for realistic-style images. To address this issue, we present Color Fidelity Dataset (CFD) and Color Fidelity Metric (CFM) for objective evaluation of color fidelity in realistic-style generations. CFD contains over 1.3M real and synthetic images with ordered levels of color realism, while CFM employs a multimodal encoder to learn perceptual color fidelity. In addition, we propose a training-free Color Fidelity Refinement (CFR) that adaptively modulates spatial-temporal guidance scale in generation, thereby enhancing color authenticity. Together, CFD supports CFM for assessment, whose learned attention further guides CFR to refine T2I fidelity, forming a progressive framework for assessing and improving color fidelity in realistic-style T2I generation. The dataset and code are available at https://github.com/ZhengyaoFang/CFM.
中文标题/摘要
标题:太生动以至于不真实?生成色彩保真度的基准测试与校准
近年来,文本到图像(T2I)生成技术在视觉质量方面取得了显著进步,但生成出与现实世界摄影看起来真实的图像仍然具有挑战性。这在一定程度上是由于现有评估范式的偏见:人类评分和偏好训练的度量标准往往偏好视觉上生动、饱和度和对比度夸张的图像,即使在要求生成现实风格图像时,生成的图像也往往过于生动而不真实。为了解决这一问题,我们提出了色彩保真度数据集(CFD)和色彩保真度度量(CFM),用于客观评估现实风格生成中的色彩保真度。CFD包含超过130万张真实和合成图像,具有不同程度的色彩现实性,而CFM采用多模态编码器学习感知色彩保真度。此外,我们提出了一种无需训练的色彩保真度精炼(CFR),它能够自适应地调节生成中的空间-时间指导尺度,从而增强色彩的真实性。结合CFD支持CFM进行评估,其学习到的注意力进一步引导CFR精炼T2I保真度,形成一个逐步框架,用于评估和改进现实风格T2I生成中的色彩保真度。数据集和代码可在https://github.com/ZhengyaoFang/CFM/获取。
Summary / 总结
This paper addresses the challenge of generating images that appear visually authentic, despite recent improvements in text-to-image generation. It introduces the Color Fidelity Dataset (CFD) and the Color Fidelity Metric (CFM) to objectively evaluate color fidelity in realistic-style images. CFD consists of over 1.3 million real and synthetic images with varying levels of color realism, while CFM uses a multimodal encoder to learn perceptual color fidelity. Additionally, a training-free Color Fidelity Refinement (CFR) is proposed to enhance color authenticity in generation. The framework combines CFD and CFM to assess and improve color fidelity in realistic-style text-to-image generation, forming a progressive evaluation and refinement process.
本文通过提出一个基准数据集和评估指标来解决生成视觉上真实感图像的挑战,该数据集名为Color Fidelity Dataset (CFD),包含超过130万张真实和合成图像,具有不同程度的颜色真实性。Color Fidelity Metric (CFM) 使用多模态编码器来评估感知颜色保真度。此外,还提出了一种无需训练的Color Fidelity Refinement (CFR) 方法,以增强生成图像的颜色真实性。CFD 和 CFM 的结合形成了一种渐进框架,用于评估和提高现实风格文本到图像生成的颜色保真度。
GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations
Authors: Boyuan Chen, Minghao Shao, Siddharth Garg, Ramesh Karri, Muhammad Shafique
First: 2026-03-11T17:04:30+00:00 · Latest: 2026-03-11T17:04:30+00:00
Abstract
Vision Language Models (VLMs) exhibit persistent hallucinations in counting tasks, with accuracy substantially lower than other visual reasoning tasks (excluding sentiment). This phenomenon persists even in state-of-the-art reasoning-capable VLMs. Conversely, CNN-based object detection models (ODMs) such as YOLO excel at spatial localization and instance counting with minimal computational overhead. We propose GroundCount, a framework that augments VLMs with explicit spatial grounding from ODMs to mitigate counting hallucinations. In the best case, our prompt-based augmentation strategy achieves 81.3% counting accuracy on the best-performing model (Ovis2.5-2B) - a 6.6pp improvement - while reducing inference time by 22% through elimination of hallucination-driven reasoning loops for stronger models. We conduct comprehensive ablation studies demonstrating that positional encoding is a critical component, being beneficial for stronger models but detrimental for weaker ones. Confidence scores, by contrast, introduce noise for most architectures and their removal improves performance in four of five evaluated models. We further evaluate feature-level fusion architectures, finding that explicit symbolic grounding via structured prompts outperforms implicit feature fusion despite sophisticated cross-attention mechanisms. Our approach yields consistent improvements across four of five evaluated VLM architectures (6.2--7.5pp), with one architecture exhibiting degraded performance due to incompatibility between its iterative reflection mechanisms and structured prompts. These results suggest that counting failures stem from fundamental spatial-semantic integration limitations rather than architecture-specific deficiencies, while highlighting the importance of architectural compatibility in augmentation strategies.
中文标题/摘要
标题:GroundCount:通过对象检测实现视觉语言模型的空间定位以减轻计数幻觉
视觉语言模型(VLMs)在计数任务中表现出持续的幻觉现象,准确率远低于其他视觉推理任务(不包括情感分析)。这一现象在最先进的推理能力VLMs中依然存在。相反,基于CNN的对象检测模型(ODMs)如YOLO在空间定位和实例计数方面表现出色,且计算开销较小。我们提出了一种名为GroundCount的框架,通过从ODMs引入显式空间定位来增强VLMs,以减轻计数幻觉。在最佳情况下,基于提示的增强策略在性能最佳的模型(Ovis2.5-2B)上实现了81.3%的计数准确率,比基线提高了6.6个百分点,同时通过消除幻觉驱动的推理循环将推理时间减少了22%。我们进行了全面的消融研究,表明位置编码是关键组件,对强模型有益但对弱模型有害。相比之下,置信度分数对大多数架构引入了噪声,其移除在五个评估模型中有四个模型中提高了性能。我们进一步评估了特征级融合架构,发现通过结构化提示实现的显式符号定位优于隐式特征融合,尽管有复杂的跨注意力机制。我们的方法在四个评估的VLM架构中(6.2-7.5个百分点)提供了持续改进,其中一个架构由于其迭代反射机制与结构化提示不兼容而表现出性能下降。这些结果表明,计数失败的根本原因在于空间语义整合的局限性,而不是特定架构的缺陷,同时强调了增强策略与架构兼容性的重要性。
Summary / 总结
The paper addresses the issue of counting hallucinations in Vision Language Models (VLMs) by proposing GroundCount, a framework that integrates CNN-based object detection models (ODMs) to improve counting accuracy. The method involves prompt-based augmentation to leverage spatial grounding from ODMs, which results in a 6.6 percentage point improvement in counting accuracy for the best-performing model while reducing inference time by 22%. The study also reveals that positional encoding is beneficial for stronger models but detrimental for weaker ones, and that explicit symbolic grounding via structured prompts outperforms implicit feature fusion in most cases.
论文提出GroundCount框架,通过结合对象检测模型(ODMs)来解决视觉语言模型(VLMs)在计数任务中的幻觉问题。该方法通过将VLMs与ODMs的空间定位相结合,使Ovis2.5-2B模型的计数准确率达到81.3%,同时将推理时间减少了22%。研究还发现,位置编码对较强模型至关重要但对较弱模型有害,明确的符号定位通过结构化提示优于隐式特征融合。总体而言,该方法在五个VLM架构中的四个中一致提高了计数准确性。
EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation
Authors: Shih-Yang Liu, Maksim Khadkevich, Nai Chit Fung, Charbel Sakr, Chao-Han Huck Yang, Chien-Yi Wang, Saurav Muralidharan, Hongxu Yin, Kwang-Ting Cheng, Jan Kautz, Yu-Chiang Frank Wang, Pavlo Molchanov, Min-Hung Chen
Venue: ICLR 2026
First: 2024-10-28T17:59:03+00:00 · Latest: 2026-03-11T16:03:41+00:00
Comments: ICLR 2026 workshops. Code: https://github.com/NVlabs/EoRA
Abstract
While post-training compression techniques effectively reduce the memory footprint, latency, and power consumption of Large Language Models (LLMs), they often result in noticeable accuracy degradation and remain limited by hardware and kernel constraints that restrict supported compression formats ultimately reducing flexibility across a wide range of deployment scenarios. In this work, we propose EoRA, a novel fine-tuning-free method that augments compressed LLMs with low-rank matrices, allowing users to rapidly enhance task-specific performance and freely balance the trade-off between accuracy and computational overhead beyond the constraints of compression formats. EoRA consistently outperforms prior training-free low rank methods in recovering the accuracy of compressed LLMs, achieving notable accuracy improvements (e.g., $\mathbf{10.84\%}$ on ARC-Challenge, $\mathbf{6.74\%}$ on MathQA, and $\mathbf{11.45\%}$ on GSM8K) for LLaMA3-8B compressed to 3-bit. We also introduce an optimized CUDA kernel, accelerating inference by up to 1.4x and reducing memory overhead through quantizing EoRA. Overall, EoRA offers a prompt solution for improving the accuracy of compressed models under varying user requirements, enabling more efficient and flexible deployment of LLMs. Code is available at https://github.com/NVlabs/EoRA.
中文标题/摘要
标题:EoRA:基于特征空间低秩逼近的无微调补偿方法
虽然后训练压缩技术有效地减少了大型语言模型(LLM)的内存占用、延迟和功耗,但它们通常会导致明显的准确度下降,并且受限于硬件和内核约束,限制了支持的压缩格式,最终减少了在广泛部署场景中的灵活性。在本文中,我们提出了一种名为EoRA的创新无微调方法,该方法通过低秩矩阵增强压缩的LLM,使用户能够快速提升特定任务的性能,并自由平衡准确度和计算开销之间的权衡,超越压缩格式的限制。EoRA在恢复压缩LLM的准确度方面始终优于先前的无训练低秩方法,在压缩到3比特的LLaMA3-8B上实现了显著的准确度提升(例如,在ARC-Challenge上提高了10.84%,在MathQA上提高了6.74%,在GSM8K上提高了11.45%)。我们还引入了一个优化的CUDA内核,通过量化EoRA加速推理多达1.4倍,并减少内存开销。总体而言,EoRA为满足不同用户需求提高压缩模型的准确度提供了一种简便的解决方案,使LLM的部署更加高效和灵活。代码可在https://github.com/NVlabs/EoRA获取。
Summary / 总结
EoRA is a fine-tuning-free method that enhances the performance of compressed LLMs by adding low-rank matrices, allowing users to adjust the trade-off between accuracy and computational overhead. It consistently outperforms previous training-free low-rank methods, achieving significant accuracy improvements on various benchmarks. EoRA also includes an optimized CUDA kernel that accelerates inference and reduces memory overhead through quantization.
EoRA 是一种无需微调的方法,通过在压缩的 LLM 中添加低秩矩阵来提升其性能,允许用户在准确性和计算开销之间进行权衡。它在多个基准测试中优于之前的无需训练的低秩方法,实现了显著的准确率提升。EoRA 还包含一个优化的 CUDA 内核,可以加速推理并减少内存开销。
Generating a Paracosm for Training-Free Zero-Shot Composed Image Retrieval
Authors: Tong Wang, Yunhan Zhao, Shu Kong
First: 2026-01-31T16:42:55+00:00 · Latest: 2026-03-11T15:40:19+00:00
Abstract
Composed Image Retrieval (CIR) is the task of retrieving a target image from a database using a multimodal query, which consists of a reference image and a modification text. The text specifies how to alter the reference image to form a ''mental image'', based on which CIR should find the target image in the database. The fundamental challenge of CIR is that this ''mental image'' is not physically available and is only implicitly defined by the query. The contemporary literature pursues zero-shot methods and uses a Large Multimodal Model (LMM) to generate a textual description for a given multimodal query, and then employs a Vision-Language Model (VLM) for textual-visual matching to search for the target image. In contrast, we address CIR from first principles by directly generating the ''mental image'' for more accurate matching. Particularly, we prompt an LMM to generate a ''mental image'' for a given multimodal query and propose to use this ''mental image'' to search for the target image. As the ''mental image'' has a synthetic-to-real domain gap with real images, we also generate a synthetic counterpart for each real image in the database to facilitate matching. In this sense, our method uses LMM to construct a ``paracosm'', where it matches the multimodal query and database images. Hence, we call this method Paracosm. Notably, Paracosm is a training-free zero-shot CIR method. It significantly outperforms existing zero-shot methods on challenging benchmarks, achieving state-of-the-art performance for zero-shot CIR.
中文标题/摘要
标题:生成平行宇宙以实现无需训练的零样本组合图像检索
组合图像检索(CIR)是指使用包含参考图像和修改文本的多模态查询从数据库中检索目标图像的任务。文本说明如何修改参考图像以形成“心理图像”,基于此CIR应在数据库中找到目标图像。CIR的基本挑战在于这种“心理图像”是不可物理获取的,仅由查询隐式定义。当代文献追求零样本方法,并使用大型多模态模型(LMM)生成给定多模态查询的文本描述,然后使用视觉语言模型(VLM)进行文本-视觉匹配以搜索目标图像。相反,我们从第一原理出发,直接生成“心理图像”以实现更准确的匹配。特别地,我们提示LMM生成给定多模态查询的“心理图像”,并提议使用此“心理图像”来搜索目标图像。由于“心理图像”与真实图像之间存在合成到现实的领域差距,我们还为数据库中的每个真实图像生成一个合成对应物以促进匹配。因此,我们的方法使用LMM构建一个“平行宇宙”,其中匹配多模态查询和数据库图像。因此,我们称此方法为“平行宇宙”。值得注意的是,平行宇宙是一种无需训练的零样本CIR方法。它在具有挑战性的基准测试中显著优于现有零样本方法,实现了零样本CIR的最新性能。
Summary / 总结
The paper addresses the challenge of Composed Image Retrieval (CIR) by proposing a training-free zero-shot method called Paracosm. It directly generates a 'mental image' using a Large Multimodal Model (LMM) for a given multimodal query and uses this 'mental image' for image retrieval. To bridge the synthetic-to-real domain gap, the method also generates synthetic counterparts for real images in the database. Experimental results show that Paracosm outperforms existing zero-shot methods on challenging benchmarks, achieving state-of-the-art performance for zero-shot CIR.
该论文通过使用大型多模态模型(LMM)直接生成‘心理图像’,然后在其中搜索目标图像,解决了组成图像检索(CIR)的挑战。这种方法名为Paracosm,构建了一个合成的‘平行宇宙’来匹配查询和数据库图像,其在具有挑战性的基准测试中显著优于现有零样本方法,并实现了零样本CIR的最新性能。该方法无需训练,并使用真实图像的合成对应物来促进匹配。
Ego: Embedding-Guided Personalization of Vision-Language Models
Authors: Soroush Seifi, Simon Gardier, Vaggelis Dorovatas, Daniel Olmeda Reino, Rahaf Aljundi
Venue: CVPR
First: 2026-03-10T15:10:41+00:00 · Latest: 2026-03-11T15:26:01+00:00
Comments: Accepted at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026
Abstract
AI assistants that support humans in daily life are becoming increasingly feasible, driven by the rapid advancements in multimodal language models. A key challenge lies in overcoming the generic nature of these models to deliver personalized experiences. Existing approaches to personalizing large vision language models often rely on additional training stages, which limit generality and scalability, or on engineered pipelines with external pre-trained modules, which hinder deployment efficiency. In this work, we propose an efficient personalization method that leverages the model's inherent ability to capture personalized concepts. Specifically, we extract visual tokens that predominantly represent the target concept by utilizing the model's internal attention mechanisms. These tokens serve as a memory of that specific concept, enabling the model to recall and describe it when it appears in test images. We conduct a comprehensive and unified evaluation of our approach and SOTA methods across various personalization settings including single-concept, multi-concept, and video personalization, demonstrating strong performance gains with minimal personalization overhead.
中文标题/摘要
标题:自我:嵌入引导的视觉语言模型个性化
支持人类日常生活的AI助手正变得越来越可行,这得益于多模态语言模型的迅速发展。一个关键挑战在于克服这些模型的通用性,以提供个性化的体验。现有对大型视觉语言模型进行个性化的做法往往依赖于额外的训练阶段,这限制了通用性和可扩展性,或者依赖于具有外部预训练模块的工程化管道,这阻碍了部署效率。在本文中,我们提出了一种高效的个性化方法,利用模型内在捕捉个性化概念的能力。具体来说,我们通过利用模型内部的注意力机制提取主要代表目标概念的视觉标记。这些标记作为该特定概念的记忆,使模型能够在测试图像中出现时回忆和描述它。我们对我们的方法和当前最佳方法进行了全面统一的评估,涵盖了单概念、多概念和个人化视频等各种个性化设置,展示了在最小个性化开销下取得的强大性能提升。
Summary / 总结
This work addresses the challenge of personalizing vision-language models to provide more tailored experiences for users. The method leverages the model's internal attention mechanisms to extract visual tokens that represent specific concepts, which are then used to personalize the model. The approach is evaluated across different personalization settings and shows strong performance gains with minimal overhead compared to state-of-the-art methods.
研究旨在通过增强视觉语言模型的个性化,使AI助手更好地支持日常生活。方法利用模型内部的注意力机制提取特定概念的视觉标记,然后用于个性化模型。实验结果显示,在各种个性化设置中,与最新方法相比,这种方法具有显著的性能提升且开销较小。
UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment
Authors: Yecheng Zhang, Rong Zhao, Zhizhou Sha, Yong Li, Lei Wang, Ce Hou, Wen Ji, Hao Huang, Yunshan Wan, Jian Yu, Junhao Xia, Yuru Zhang, Chunlei Shi
First: 2026-02-23T02:24:55+00:00 · Latest: 2026-03-11T15:04:20+00:00
Comments: 26 pages
Abstract
Vision-language models (VLMs) can describe urban scenes in rich detail, yet consistently fail to produce reliable human preference labels in domain-specific tasks such as safety assessment and aesthetic evaluation. The standard fix, fine-tuning or RLHF, requires large-scale annotations and model retraining. We ask a different question: can a frozen VLM be aligned with human preferences without modifying any weights? Our key insight is that VLMs are strong concept extractors but poor decision calibrators. We propose a three-stage post-hoc pipeline that exploits this asymmetry: (i) interpretable evaluation dimensions are automatically mined from consensus exemplars; (ii) an Observer-Debater-Judge chain extracts robust concept scores from the frozen VLM; and (iii) locally-weighted ridge regression on a hybrid manifold calibrates these scores to human ratings. Applied as UrbanAlign on Place Pulse 2.0, the framework reaches 72.2% accuracy (kappa=0.45) across six perception categories, outperforming all baselines by +11.0 pp and zero-shot VLM by +15.5 pp, with full interpretability and zero weight modification.
中文标题/摘要
标题:UrbanAlign:后验语义校准以实现VLM-人类偏好对齐
视觉-语言模型(VLMs)可以详细描述城市场景,但在诸如安全评估和审美评价等特定领域任务中,却始终无法产生可靠的人类偏好标签。标准的解决方案,即微调或RLHF,需要大规模注释和模型重新训练。我们提出了一个不同的问题:一个冻结的VLM能否在不修改任何权重的情况下与人类偏好对齐?我们的关键洞察是,VLMs是强大的概念提取器,但决策校准能力较弱。我们提出了一种三阶段的后验管道,利用这种不对称性:(i) 自动从共识示例中挖掘可解释的评估维度;(ii) 观察者-辩手-法官链从冻结的VLM中提取稳健的概念得分;(iii) 在混合流形上进行局部加权岭回归校准这些得分以匹配人类评分。将该框架应用于Place Pulse 2.0,框架在六个感知类别中的准确率达到72.2%(κ=0.45),优于所有基线+11.0个百分点,优于零样本VLM+15.5个百分点,具有完全的可解释性和零权重修改。
Summary / 总结
The research aims to align vision-language models (VLMs) with human preferences in domain-specific tasks without fine-tuning or retraining. It proposes a three-stage post-hoc pipeline: mining evaluation dimensions from consensus exemplars, extracting concept scores using an Observer-Debater-Judge chain, and calibrating these scores with locally-weighted ridge regression. On the Place Pulse 2.0 dataset, UrbanAlign achieves 72.2% accuracy across six perception categories, outperforming all baselines by 11.0 percentage points and zero-shot VLM by 15.5 percentage points, while maintaining full interpretability and zero weight modification.
研究旨在通过后处理方法而非重新训练或微调,使视觉语言模型(VLMs)与城市场景评估的人类偏好保持一致。提出了一种三阶段后处理管道:从共识示例中挖掘评估维度,使用观察者-辩论者-法官链提取概念得分,并通过局部加权岭回归校准这些得分。在Place Pulse 2.0上,UrbanAlign在六个感知类别中的准确率达到72.2%,比所有基线高出11.0个百分点,比零样本VLM高出15.5个百分点,同时保持完全可解释性和无权重修改。
HanMoVLM: Large Vision-Language Models for Professional Artistic Painting Evaluation
Authors: Hongji Yang, Yucheng Zhou, Wencheng Han, Songlian Li, Xiaotong Zhao, Jianbing Shen
First: 2026-03-11T14:21:59+00:00 · Latest: 2026-03-11T14:21:59+00:00
Comments: 14 pages
Abstract
While Large Vision-Language Models (VLMs) demonstrate impressive general visual capabilities, they remain artistically blind and unable to offer professional evaluation of artworks within specific artistic domains like human experts. To bridge this gap, we transform VLMs into experts capable of professional-grade painting evaluation in the Chinese Artistic Domain, which is more abstract and demands extensive artistic training for evaluation. We introduce HanMo-Bench, a new dataset that features authentic auction-grade masterpieces and AI-generated works, grounded in real-world market valuations. To realize the rigorous judgment, we propose the HanMoVLM and construct a Chain-of-Thought (CoT) validated by experts. This CoT guides the model to perform expert-level reasoning: from content identification and Region of Interest (RoI) localization to professional evaluation, guided by both theme-specific evaluation and typical three-tier evaluation in Chinese paintings. Furthermore, we design a reward function to refine the reasoning process of the HanMoVLM to improve the accuracy. We demonstrate that HanMoVLM can serve as a critical backbone for Test-time Scaling in image generation. By acting as a high-quality verifier, HanMoVLM enables generative models to select the most artistically superior outputs from multiple candidates. Experimental results and human studies confirm that the proposed HanMoVLM effectively bridges the gap, achieving a high consistency with professional experts and significantly improving the quality of Chinese Painting generation.
中文标题/摘要
标题:HanMoVLM:专业艺术绘画评估的大规模视觉-语言模型
虽然大规模视觉-语言模型(VLMs)展示了令人印象深刻的通用视觉能力,但它们仍然缺乏艺术眼光,无法在特定的艺术领域如人类专家那样对艺术品进行专业评估。为弥合这一差距,我们将VLMs转化为能够在中国艺术领域进行专业级绘画评估的专家,该领域更加抽象,需要广泛的艺术训练才能进行评估。我们引入了HanMo-Bench,这是一个新数据集,包含真实的拍卖级大师作品和AI生成的作品,基于实际市场估值。为了实现严格的判断,我们提出了HanMoVLM,并构建了一个由专家验证的思维链(CoT)。该CoT指导模型进行专家级推理:从内容识别和兴趣区域(RoI)定位到专业评估,由主题特定评估和中国画的典型三级评估指导。此外,我们设计了一个奖励函数来细化HanMoVLM的推理过程,以提高准确性。我们证明HanMoVLM可以作为图像生成中测试时扩展的关键骨干。通过作为高质量验证器,HanMoVLM使生成模型能够从多个候选作品中选择最艺术上优越的输出。实验结果和人类研究证实,提出的HanMoVLM有效地弥合了这一差距,实现了与专业专家的高度一致性,并显著提高了中国画生成的质量。
Summary / 总结
The research aims to enhance Large Vision-Language Models (VLMs) to enable professional evaluation of Chinese artistic paintings. The authors introduce HanMoVLM, which is guided by a Chain-of-Thought (CoT) validated by experts to perform detailed reasoning from content identification to professional evaluation. The model is trained on HanMo-Bench, a dataset featuring authentic and AI-generated artworks. Experimental results show that HanMoVLM can effectively evaluate paintings and improve the quality of Chinese painting generation, achieving high consistency with professional experts.
研究旨在提升大型视觉-语言模型(VLMs)以实现对中国艺术绘画的专业评估。方法包括创建汉莫基准数据集,包含真实和AI生成的艺术品,并开发汉莫VLM,该模型通过专家验证的链式思考(CoT)进行引导,完成内容识别、区域兴趣(RoI)定位和专业评估。研究表明,汉莫VLM能够准确评估绘画作品,并通过作为生成模型的高质量验证器来提高中国绘画生成的质量。
Taking Shortcuts for Categorical VQA Using Super Neurons
Authors: Pierre Musacchio, Jaeyi Jeong, Dahun Kim, Jaesik Park
First: 2026-03-11T13:54:45+00:00 · Latest: 2026-03-11T13:54:45+00:00
Comments: 25 pages, 15 tables, 8 figures
Abstract
Sparse Attention Vectors (SAVs) have emerged as an excellent training-free alternative to supervised finetuning or low-rank adaptation to improve the performance of Vision Language Models (VLMs). At their heart, SAVs select a few accurate attention heads for a task of interest and use them as classifiers, rather than relying on the model's prediction. In a similar spirit, we find that directly probing the raw activations of the VLM, in the form of scalar values, is sufficient to yield accurate classifiers on diverse visually grounded downstream tasks. Shifting focus from attention vectors to scalar activations dramatically increases the search space for accurate parameters, allowing us to find more discriminative neurons immediately from the first generated token. We call such activations Super Neurons (SNs). In this probing setting, we discover that enough SNs appear in the shallower layers of the large language model to allow for extreme early exiting from the first layer of the model at the first generated token. Compared to the original network, SNs robustly improve the classification performance while achieving a speedup of up to 5.10x.
中文标题/摘要
标题:利用超神经元进行分类型VQA的捷径
稀疏注意向量(SAVs)已成为一种优秀的无需训练替代方法,用于提高视觉语言模型(VLMs)的性能,替代监督微调或低秩适应。SAVs的核心在于选择几个针对特定任务的准确注意头,并使用它们作为分类器,而不是依赖模型的预测。类似地,我们发现直接探测VLM的原始激活值,以标量值的形式,足以在多种视觉导向的下游任务中获得准确的分类器。从注意向量转向标量激活显著增加了准确参数的搜索空间,使我们能够从第一个生成的标记开始立即找到更具区分性的神经元。我们称这些激活为超神经元(SNs)。在探测设置中,我们发现足够多的SNs出现在大型语言模型的较浅层中,允许在模型的第一个生成标记时从第一层极端地提前退出。与原始网络相比,SNs稳健地提高了分类性能,同时实现了高达5.10倍的速度提升。
Just-in-Time: Training-Free Spatial Acceleration for Diffusion Transformers
Authors: Wenhao Sun, Ji Li, Zhaoqiang Liu
First: 2026-03-11T13:16:41+00:00 · Latest: 2026-03-11T13:16:41+00:00
Comments: Accepted by CVPR2026
Abstract
Diffusion Transformers have established a new state-of-the-art in image synthesis, but the high computational cost of iterative sampling severely hampers their practical deployment. While existing acceleration methods often focus on the temporal domain, they overlook the substantial spatial redundancy inherent in the generative process, where global structures emerge long before fine-grained details are formed. The uniform computational treatment of all spatial regions represents a critical inefficiency. In this paper, we introduce Just-in-Time (JiT), a novel training-free framework that addresses this challenge by acceleration in the spatial domain. JiT formulates a spatially approximated generative ordinary differential equation (ODE) that drives the full latent state evolution based on computations from a dynamically selected, sparse subset of anchor tokens. To ensure seamless transitions as new tokens are incorporated to expand the dimensions of the latent state, we propose a deterministic micro-flow, a simple and effective finite-time ODE that maintains both structural coherence and statistical correctness. Extensive experiments on the state-of-the-art FLUX.1-dev model demonstrate that JiT achieves up to a 7x speedup with nearly lossless performance, significantly outperforming existing acceleration methods and establishing a new and superior trade-off between inference speed and generation fidelity.
中文标题/摘要
标题:Just-in-Time: 无需训练的空间加速方法用于扩散变换器
扩散变换器已在图像合成领域确立了新的前沿地位,但迭代采样的高计算成本严重阻碍了其实用部署。尽管现有的加速方法通常集中在时间域,但它们忽视了生成过程中固有的大量空间冗余,即全局结构在细粒度细节形成之前就已经出现。对所有空间区域的均匀计算处理是一个关键的低效率。在本文中,我们引入了Just-in-Time (JiT),这是一种新颖的无需训练的框架,通过在空间域加速来解决这一挑战。JiT 形式化了一个基于动态选择的稀疏锚定标记的子集进行计算的空间近似生成常微分方程(ODE),以驱动整个潜在状态的演变。为了确保在新标记被纳入以扩展潜在状态维度时无缝过渡,我们提出了一种确定性微流,这是一种简单且有效的有限时间 ODE,能够保持结构连贯性和统计正确性。在最先进的 FLUX.1-dev 模型上的广泛实验表明,JiT 可以实现高达 7 倍的速度提升,几乎无性能损失,显著优于现有加速方法,并建立了推理速度和生成保真度之间新的和更优的权衡。
Summary / 总结
The paper introduces Just-in-Time (JiT), a training-free framework that accelerates the spatial domain of diffusion transformers for image synthesis. JiT uses a spatially approximated generative ODE based on computations from a dynamically selected sparse subset of anchor tokens, ensuring structural coherence and statistical correctness as the latent state evolves. Experiments show JiT achieves up to a 7x speedup with nearly lossless performance, significantly outperforming existing methods.
本文提出了Just-in-Time (JiT)框架,这是一种无需训练的加速方法,用于解决扩散变换器在图像合成中的高计算成本问题。JiT通过动态选择的稀疏锚点令牌计算生成的近似空间ODE,确保结构连贯性和统计正确性。实验表明,JiT可以实现高达7倍的加速,并且几乎不损失性能,优于现有方法。
MonitorVLM:A Vision Language Framework for Safety Violation Detection in Mining Operations
Authors: Jiang Wu, Sichao Wu, Yinsong Ma, Guangyuan Yu, Haoyuan Xu, Lifang Zheng, Jingliang Duan
First: 2025-10-04T04:46:21+00:00 · Latest: 2026-03-11T12:33:18+00:00
Abstract
Industrial accidents, particularly in high-risk domains such as surface and underground mining, are frequently caused by unsafe worker behaviors. Traditional manual inspection remains labor-intensive, error-prone, and insufficient for large-scale, dynamic environments, highlighting the urgent need for intelligent and automated safety monitoring. In this paper, we present MonitorVLM, a novel vision--language framework designed to detect safety violations directly from surveillance video streams. MonitorVLM introduces three key innovations: (1) a domain-specific violation dataset comprising 9,000 vision--question--answer (VQA) samples across 40 high-frequency mining regulations, enriched with augmentation and auxiliary detection cues; (2) a clause filter (CF) module that dynamically selects the Top-$K$ most relevant clauses, reducing inference latency by 13.56\% while maintaining accuracy; and (3) a behavior magnifier (BM) module that enhances worker regions to improve fine-grained action recognition, yielding additional gains of 3.45% in precision and 8.62% in recall. Experimental results demonstrate that MonitorVLM significantly outperforms baseline vision--language models, achieving improvements of 22.01% in precision, 34.22\% in recall, and 28.37% in F1 score over the 72B unfine-tuned baseline. A lightweight web-based interface further integrates MonitorVLM into practical workflows, enabling automatic violation reporting with video timestamping. This study highlights the potential of multimodal large models to enhance occupational safety monitoring in mining and beyond.
中文标题/摘要
标题:MonitorVLM:一种用于采矿作业安全违规检测的视觉语言框架
工业事故,尤其是在露天和地下采矿等高风险领域,通常由不安全的工人行为引起。传统的手工检查仍然劳动密集、容易出错且不足以应对大规模、动态的环境,突显了智能和自动安全监控的迫切需求。在本文中,我们提出了MonitorVLM,这是一种新型的视觉-语言框架,旨在直接从监控视频流中检测安全违规行为。MonitorVLM 引入了三个关键创新:(1)一个特定领域的违规数据集,包含40项高频采矿规定的9,000个视觉-问题-答案(VQA)样本,这些样本经过增强和辅助检测提示的丰富;(2)一个子句过滤器(CF)模块,动态选择最相关的Top-$K$子句,将推理延迟减少13.56%,同时保持准确性;(3)一个行为放大器(BM)模块,增强工人区域以提高细粒度动作识别,分别在精确度和召回率上额外提高了3.45%和8.62%。实验结果表明,MonitorVLM 显著优于基线视觉-语言模型,在未微调的72B基线下,精确度提高了22.01%,召回率提高了34.22%,F1分数提高了28.37%。一个轻量级的基于Web的界面进一步将MonitorVLM 集成到实际工作流程中,实现带有视频时间戳的自动违规报告。本研究突显了多模态大模型在采矿和其他领域增强职业安全监控的潜力。
WalkGPT: Grounded Vision-Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation
Authors: Rafi Ibn Sultan, Hui Zhu, Xiangyu Zhou, Chengyin Li, Prashant Khanduri, Marco Brocanelli, Dongxiao Zhu
Venue: CVPR
First: 2026-03-11T12:15:40+00:00 · Latest: 2026-03-11T12:15:40+00:00
Comments: Accepted by CVPR-2026
Abstract
Ensuring accessible pedestrian navigation requires reasoning about both semantic and spatial aspects of complex urban scenes, a challenge that existing Large Vision-Language Models (LVLMs) struggle to meet. Although these models can describe visual content, their lack of explicit grounding leads to object hallucinations and unreliable depth reasoning, limiting their usefulness for accessibility guidance. We introduce WalkGPT, a pixel-grounded LVLM for the new task of Grounded Navigation Guide, unifying language reasoning and segmentation within a single architecture for depth-aware accessibility guidance. Given a pedestrian-view image and a navigation query, WalkGPT generates a conversational response with segmentation masks that delineate accessible and harmful features, along with relative depth estimation. The model incorporates a Multi-Scale Query Projector (MSQP) that shapes the final image tokens by aggregating them along text tokens across spatial hierarchies, and a Calibrated Text Projector (CTP), guided by a proposed Region Alignment Loss, that maps language embeddings into segmentation-aware representations. These components enable fine-grained grounding and depth inference without user-provided cues or anchor points, allowing the model to generate complete and realistic navigation guidance. We also introduce PAVE, a large-scale benchmark of 41k pedestrian-view images paired with accessibility-aware questions and depth-grounded answers. Experiments show that WalkGPT achieves strong grounded reasoning and segmentation performance. The source code and dataset are available on the \href{https://sites.google.com/view/walkgpt-26/home}{project website}.
中文标题/摘要
标题:WalkGPT:基于深度感知分割的行人导航视觉-语言对话
确保无障碍行人导航需要对复杂城市场景的语义和空间方面进行推理,而现有的大型视觉-语言模型(LVLM)难以应对这一挑战。尽管这些模型可以描述视觉内容,但它们缺乏明确的定位,导致物体幻觉和不可靠的深度推理,限制了它们在无障碍指导中的实用性。我们提出了WalkGPT,这是一种像素定位的LVLM,用于新的任务——定位导航指导,将语言推理和分割统一在一个架构中,以实现深度感知的无障碍指导。给定一个行人视角的图像和导航查询,WalkGPT生成一个带有分割掩码的对话式响应,这些掩码界定了无障碍和有害特征,并提供相对深度估计。该模型包含一个多尺度查询投影器(MSQP),通过在空间层次上聚合图像标记和文本标记来塑造最终图像标记,以及一个校准文本投影器(CTP),由提出的区域对齐损失引导,将语言嵌入映射到分割感知表示。这些组件使模型能够在无需用户提供的提示或锚点的情况下实现精细的定位和深度推理,从而生成完整且现实的导航指导。我们还引入了PAVE,这是一个包含41000张行人视角图像及其与无障碍意识问题和深度定位答案配对的大规模基准数据集。实验表明,WalkGPT在定位推理和分割性能方面表现出色。源代码和数据集可在项目网站\href{https://sites.google.com/view/walkgpt-26/home}{上获得}。
Summary / 总结
WalkGPT is a pixel-grounded Large Vision-Language Model designed for grounded navigation guidance, addressing the limitations of existing models in handling complex urban scenes. It uses a Multi-Scale Query Projector and a Calibrated Text Projector to generate conversational responses with segmentation masks and depth estimation for accessible and harmful features. Experiments demonstrate strong performance in grounded reasoning and segmentation. The model does not require user-provided cues or anchor points, enabling comprehensive and realistic navigation guidance.
WalkGPT 旨在通过将语言推理和分割集成到单一架构中来解决无障碍行人导航的挑战。它生成包含可访问和有害特征分割掩码的对话响应,以及深度估计,使用多尺度查询投影器和校准文本投影器。实验表明,其在语义推理和分割方面表现出色,超越了现有的大型视觉语言模型,提供了可靠的深度感知导航指导。
Are Video Reasoning Models Ready to Go Outside?
Authors: Yangfan He, Changgyu Boo, Jaehong Yoon
First: 2026-03-11T11:10:52+00:00 · Latest: 2026-03-11T11:10:52+00:00
Comments: Project Page: https://robust-video-reason.github.io/
Abstract
In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially, revealing a gap between clean, controlled (i.e., unperturbed) evaluation settings and real-world robustness. To address this limitation, we propose ROVA, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. Specifically, it continuously re-estimates sample difficulty via self-reflective evaluation, enabling adaptive training with a robustness-aware consistency reward. We also introduce PVRBench, a new benchmark that injects real-world perturbations into embodied video datasets to assess both accuracy and reasoning quality under realistic disturbances. We evaluate ROVA and baselines on PVRBench, UrbanVideo, and VisBench, where open-source and proprietary models suffer up to 35% and 28% drops in accuracy and reasoning under realistic perturbations. ROVA effectively mitigates performance degradation, boosting relative accuracy by at least 24% and reasoning by over 9% compared with baseline models (QWen2.5/3-VL, InternVL2.5, Embodied-R). These gains transfer to clean standard benchmarks, yielding consistent improvements.
中文标题/摘要
标题:视频推理模型准备好走出室内了吗?
在实际部署中,视觉-语言模型常常会遇到诸如天气、遮挡和摄像机运动等干扰,这些条件会导致其理解和推理能力大幅下降,揭示了干净、受控(即未受干扰)评估环境与实际鲁棒性之间的差距。为解决这一局限,我们提出了ROVA,一种通过时空扰动建模鲁棒性感知一致性奖励的新颖训练框架,以提高鲁棒性。ROVA引入了一种基于模型能力演变的难度感知在线训练策略,优先处理信息丰富的样本。具体而言,它通过自我反思评估不断重新估计样本难度,从而实现具有鲁棒性感知一致性奖励的自适应训练。我们还引入了PVRBench,这是一个新的基准,通过向具身视频数据集注入实际干扰来评估在现实干扰下的准确性和推理质量。我们在PVRBench、UrbanVideo和VisBench上评估了ROVA和基线模型,开源和专有模型在现实干扰下的准确性和推理能力分别下降了35%和28%。ROVA有效缓解了性能下降,相对于基线模型提升了至少24%的准确性和超过9%的推理能力。这些增益在干净的标准基准上也有所体现,带来了持续的改进。
Summary / 总结
The research aims to improve the robustness of vision-language models in real-world scenarios by addressing the gap between controlled and real-world evaluation settings. The study proposes ROVA, a training framework that enhances robustness through a difficulty-aware online strategy and a robustness-aware consistency reward. Experiments on PVRBench, UrbanVideo, and VisBench show that ROVA significantly improves accuracy and reasoning quality under realistic perturbations, with relative accuracy gains of at least 24% and reasoning improvements over 9% compared to baseline models.
研究旨在通过解决视觉-语言模型在天气、遮挡和摄像机运动等条件下的退化问题,增强其实用性。研究引入了ROVA,一种在时空扰动下建模鲁棒性一致奖励的训练框架,以及PVRBench,一个用于评估模型在现实干扰下表现的新基准。实验结果显示,ROVA相比基线模型提高了至少24%的相对准确性和超过9%的推理质量,并且在干净的标准基准中也表现出一致的改进。
R4-CGQA: Retrieval-based Vision Language Models for Computer Graphics Image Quality Assessment
Authors: Zhuangzi Li, Jian Jin, Shilv Cai, Weisi Lin
First: 2026-03-11T09:28:49+00:00 · Latest: 2026-03-11T09:28:49+00:00
Abstract
Immersive Computer Graphics (CGs) rendering has become ubiquitous in modern daily life. However, comprehensively evaluating CG quality remains challenging for two reasons: First, existing CG datasets lack systematic descriptions of rendering quality; and second existing CG quality assessment methods cannot provide reasonable text-based explanations. To address these issues, we first identify six key perceptual dimensions of CG quality from the user perspective and construct a dataset of 3500 CG images with corresponding quality descriptions. Each description covers CG style, content, and perceived quality along the selected dimensions. Furthermore, we use a subset of the dataset to build several question-answer benchmarks based on the descriptions in order to evaluate the responses of existing Vision Language Models (VLMs). We find that current VLMs are not sufficiently accurate in judging fine-grained CG quality, but that descriptions of visually similar images can significantly improve a VLM's understanding of a given CG image. Motivated by this observation, we adopt retrieval-augmented generation and propose a two-stream retrieval framework that effectively enhances the CG quality assessment capabilities of VLMs. Experiments on several representative VLMs demonstrate that our method substantially improves their performance on CG quality assessment.
中文标题/摘要
标题:R4-CGQA:基于检索的计算机图形语言模型用于计算机图形图像质量评估
沉浸式计算机图形(CG)渲染已成为现代日常生活中不可或缺的一部分。然而,全面评估CG质量仍然具有挑战性,原因有两个:首先,现有的CG数据集缺乏对渲染质量的系统描述;其次,现有的CG质量评估方法无法提供合理的基于文本的解释。为了解决这些问题,我们首先从用户的角度识别出CG质量的六个关键感知维度,并构建了一个包含3500张CG图像及其相应质量描述的数据集。每个描述涵盖了CG风格、内容以及在选定维度上的感知质量。此外,我们使用数据集的一部分构建了基于描述的多个问答基准,以评估现有视觉语言模型(VLMs)的响应。我们发现,当前的VLMs在判断细粒度CG质量方面不够准确,但描述视觉相似图像的描述可以显著提高VLMs对给定CG图像的理解。受此观察的启发,我们采用检索增强生成,并提出了一种双流检索框架,该框架有效地增强了VLMs的CG质量评估能力。在几个代表性VLMs上的实验表明,我们的方法在CG质量评估方面显著提高了它们的性能。
Summary / 总结
The paper addresses the challenge of comprehensively evaluating computer graphics (CG) quality by identifying six key perceptual dimensions and constructing a dataset with corresponding quality descriptions. It finds that current Vision Language Models (VLMs) struggle with fine-grained CG quality assessment but can be improved by using descriptions of visually similar images. To enhance VLMs, the authors propose a retrieval-augmented generation framework, which significantly improves their performance in CG quality assessment tasks.
论文通过识别六个关键感知维度并构建包含3500张CG图像及其质量描述的数据集来解决CG质量评估的挑战。研究发现,当前的视觉语言模型(VLMs)在判断细粒度CG质量方面不够准确,但相似图像的描述可以提高理解能力。为了增强CG质量评估能力,提出了一个检索增强生成框架,该框架在该任务中显著提高了VLM的性能。
CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents
Authors: Marta Sumyk, Oleksandr Kosovan
First: 2026-03-11T09:28:41+00:00 · Latest: 2026-03-11T09:28:41+00:00
Abstract
Computer-Use Agents (CUAs) are emerging as a new paradigm in human-computer interaction, enabling autonomous execution of tasks in desktop environment by perceiving high-level natural-language instructions. As such agents become increasingly capable and are deployed across diverse desktop environments, evaluating their behavior in a scalable and reliable manner becomes a critical challenge. Existing evaluation pipelines rely on static benchmarks, rule-based success checks, or manual inspection, which are brittle, costly, and poorly aligned with real-world usage. In this work, we study Vision-Language Models (VLMs) as autonomous auditors for assessing CUA task completion directly from observable interactions and conduct a large-scale meta-evaluation of five VLMs that judge task success given a natural-language instruction and the final environment state. Our evaluation spans three widely used CUA benchmarks across macOS, Windows, and Linux environments and analyzes auditor behavior along three complementary dimensions: accuracy, calibration of confidence estimates, and inter-model agreement. We find that while state-of-the-art VLMs achieve strong accuracy and calibration, all auditors exhibit notable performance degradation in more complex or heterogeneous environments, and even high-performing models show significant disagreement in their judgments. These results expose fundamental limitations of current model-based auditing approaches and highlight the need to explicitly account for evaluator reliability, uncertainty, and variance when deploying autonomous CUAs in real-world settings.
中文标题/摘要
标题:CUAAudit:视觉-语言模型作为自主计算机使用代理的评估者元评估
计算机使用代理(CUAs)正在成为人机交互的新范式,能够在桌面环境中通过感知高级自然语言指令自主执行任务。随着这些代理变得越来越强大,并在各种桌面环境中部署,以可扩展和可靠的方式评估其行为成为一个关键挑战。现有的评估管道依赖于静态基准、基于规则的成功检查或人工检查,这些方法是脆弱的、成本高昂的,并且与实际使用情况不匹配。在这项工作中,我们研究视觉-语言模型(VLMs)作为自主评估者,直接从可观察的交互中评估CUA任务完成情况,并对五种VLM进行大规模元评估,这些模型根据自然语言指令和最终环境状态判断任务成功与否。我们的评估跨越了macOS、Windows和Linux环境下的三种广泛使用的CUA基准,并从准确度、置信度估计的校准以及模型间一致性这三个互补维度分析评估者的行为。我们发现,尽管最先进的VLMs在准确度和校准方面表现出色,但所有评估者在更复杂或异构的环境中表现出明显的性能下降,即使是高表现的模型在判断上也显示出显著的分歧。这些结果揭示了当前基于模型的评估方法的基本局限性,并强调了在实际环境中部署自主CUA时需要明确考虑评估者可靠性、不确定性和变异性的必要性。
Summary / 总结
The research aims to evaluate the performance of Vision-Language Models (VLMs) as auditors for Computer-Use Agents (CUAs) by assessing their ability to judge task success based on natural-language instructions and observable interactions. The study evaluates five VLMs across three CUA benchmarks on macOS, Windows, and Linux. Key findings include strong accuracy and calibration of VLMs, but notable performance degradation in complex or heterogeneous environments, and significant disagreement among high-performing models in their judgments.
研究旨在使用Vision-Language模型(VLMs)作为自主审计员来评估计算机使用代理(CUAs)。研究在macOS、Windows和Linux环境中,针对三种CUA基准测试了五个VLMs。主要发现包括VLMs在准确性和校准方面表现出色,但在复杂或异构环境中性能显著下降,即使是高性能模型之间也存在显著分歧。这揭示了当前基于模型的审计方法的基本局限性,并强调在实际部署中需要考虑评估者的可靠性和不确定性。
Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues
Authors: Mohammed Salah, Eman Ouda, Giuseppe Dell'Avvocato, Fabrizio Sarasini, Ester D'Accardi, Jorge Dias, Davor Svetinovic, Stefano Sfarra, Yusra Abdulrahman
First: 2026-03-11T08:58:15+00:00 · Latest: 2026-03-11T08:58:15+00:00
Abstract
Active infrared thermography (AIRT) is currently witnessing a surge of artificial intelligence (AI) methodologies being deployed for automated subsurface defect analysis of high performance carbon fiber-reinforced polymers (CFRP). Deploying AI-based AIRT methodologies for inspecting CFRPs requires the creation of time consuming and expensive datasets of CFRP inspection sequences to train neural networks. To address this challenge, this work introduces a novel language-guided framework for cognitive defect analysis in CFRPs using AIRT and vision-language models (VLMs). Unlike conventional learning-based approaches, the proposed framework does not require developing training datasets for extensive training of defect detectors, instead it relies solely on pretrained multimodal VLM encoders coupled with a lightweight adapter to enable generative zero-shot understanding and localization of subsurface defects. By leveraging pretrained multimodal encoders, the proposed system enables generative zero-shot understanding of thermographic patterns and automatic detection of subsurface defects. Given the domain gap between thermographic data and natural images used to train VLMs, an AIRT-VLM Adapter is proposed to enhance the visibility of defects while aligning the thermographic domain with the learned representations of VLMs. The proposed framework is validated using three representative VLMs; specifically, GroundingDINO, Qwen-VL-Chat, and CogVLM. Validation is performed on 25 CFRP inspection sequences with impacts introduced at different energy levels, reflecting realistic defects encountered in industrial scenarios. Experimental results demonstrate that the AIRT-VLM adapter achieves signal-to-noise ratio (SNR) gains exceeding 10 dB compared with conventional thermographic dimensionality-reduction methods, while enabling zero-shot defect detection with intersection-over-union values reaching 70%.
中文标题/摘要
标题:基于视觉-文本提示的主动红外热成像认知缺陷分析
主动红外热成像(AIRT)目前正见证着人工智能(AI)方法被部署用于高性能碳纤维增强聚合物(CFRP)的自动化次表面缺陷分析。使用基于AI的AIRT方法检查CFRP需要创建耗时且昂贵的CFRP检查序列数据集来训练神经网络。为了解决这一挑战,本研究引入了一种新的语言引导框架,用于使用AIRT和多模态视觉-语言模型(VLMs)在CFRP中进行认知缺陷分析。与传统的基于学习的方法不同,所提出的框架不需要为缺陷检测器进行广泛的训练数据集开发,而是仅依赖预训练的多模态VLM编码器与轻量级适配器相结合,以实现生成式的零样本理解和定位次表面缺陷。通过利用预训练的多模态编码器,所提出系统能够生成式的零样本理解热图像模式并自动检测次表面缺陷。鉴于热图像数据与用于训练VLMs的自然图像之间的领域差距,提出了AIRT-VLM适配器以增强缺陷的可见性并使热图像领域与VLMs学习的表示相齐。所提出的框架使用三种代表性VLMs进行验证;具体而言,GroundingDINO、Qwen-VL-Chat和CogVLM。验证是在25个具有不同能量水平影响的CFRP检查序列上进行的,反映了工业场景中遇到的真实缺陷。实验结果表明,与传统的热图像降维方法相比,AIRT-VLM适配器实现了超过10 dB的信噪比(SNR)增益,同时使零样本缺陷检测的交并比值达到70%。
Summary / 总结
This work introduces a novel language-guided framework for cognitive defect analysis in CFRPs using AIRT and vision-language models (VLMs). Unlike conventional learning-based approaches, the framework relies on pretrained multimodal VLM encoders and a lightweight adapter to enable zero-shot understanding and localization of subsurface defects without extensive training datasets. The AIRT-VLM adapter enhances defect visibility and aligns the thermographic domain with VLMs, achieving SNR gains exceeding 10 dB and zero-shot defect detection with intersection-over-union values reaching 70%. Validation was performed on 25 CFRP inspection sequences with different energy levels of impacts.
该研究提出了一种使用AIRT和视觉语言模型的认知缺陷分析框架,无需大量训练数据。该框架利用预训练的多模态编码器和轻量级适配器实现零样本理解与缺陷定位。AIRT-VLM适配器增强了缺陷的可见性并使热图域与VLMs的学习表示相匹配,实现了超过10 dB的信噪比增益和70%的交并比值,在零样本缺陷检测中表现出色。
OmniVTON++: Training-Free Universal Virtual Try-On with Principal Pose Guidance
Authors: Zhaotong Yang, Yong Du, Shengfeng He, Yuhui Li, Xinzhe Li, Yangyang Xu, Junyu Dong, Jian Yang
First: 2026-02-16T08:27:43+00:00 · Latest: 2026-03-11T08:41:53+00:00
Abstract
Image-based Virtual Try-On (VTON) concerns the synthesis of realistic person imagery through garment re-rendering under human pose and body constraints. In practice, however, existing approaches are typically optimized for specific data conditions, making their deployment reliant on retraining and limiting their generalization as a unified solution. We present OmniVTON++, a training-free VTON framework designed for universal applicability. It addresses the intertwined challenges of garment alignment, human structural coherence, and boundary continuity by coordinating Structured Garment Morphing for correspondence-driven garment adaptation, Principal Pose Guidance for step-wise structural regulation during diffusion sampling, and Continuous Boundary Stitching for boundary-aware refinement, forming a cohesive pipeline without task-specific retraining. Experimental results demonstrate that OmniVTON++ achieves state-of-the-art performance across diverse generalization settings, including cross-dataset and cross-garment-type evaluations, while reliably operating across scenarios and diffusion backbones within a single formulation. In addition to single-garment, single-human cases, the framework supports multi-garment, multi-human, and anime character virtual try-on, expanding the scope of virtual try-on applications. The code is available at https://github.com/Jerome-Young/OmniVTON-PlusPlus.
中文标题/摘要
标题:OmniVTON++: 无需训练的通用虚拟试衣系统及主要姿态指导
基于图像的虚拟试衣(VTON)涉及在人体姿态和身体约束下通过服装重渲染合成逼真的人像。然而,现有方法通常针对特定的数据条件进行优化,使其部署依赖于重新训练,并限制了其作为统一解决方案的泛化能力。我们提出了OmniVTON++,这是一种无需训练的VTON框架,旨在实现通用适用性。它通过协调结构服装变形以实现基于对应关系的服装适应、主要姿态指导以在扩散采样过程中逐步结构调节以及边界感知缝合以实现边界感知细化,来协调服装对齐、人体结构连贯性和边界连续性的复杂挑战,形成一个无需特定任务重新训练的统一管道。实验结果表明,OmniVTON++在多种泛化设置中实现了最先进的性能,包括跨数据集和跨服装类型的评估,同时在单一公式内跨不同场景和扩散基础模型可靠运行。除了单件服装、单个人物的情况外,该框架还支持多件服装、多人物以及动漫角色的虚拟试衣,扩展了虚拟试衣的应用范围。代码可在https://github.com/Jerome-Young/OmniVTON-PlusPlus/ 获取。
Summary / 总结
OmniVTON++ is a training-free VTON framework that addresses the challenges of garment alignment, human structural coherence, and boundary continuity by using Structured Garment Morphing, Principal Pose Guidance, and Continuous Boundary Stitching. It achieves state-of-the-art performance across various scenarios and diffusion backbones, supporting single and multi-garment, single and multi-human, and anime character virtual try-on. The framework reliably operates across diverse generalization settings without the need for retraining. The code is available at https://github.com/Jerome-Young/OmniVTON-PlusPlus.
OmniVTON++ 是一个无需训练的 VTON 框架,旨在解决服装对齐、人体结构一致性和边界连续性的问题。它通过使用结构化服装变形、主要姿态引导和连续边界缝合来形成一个统一的管道,以实现通用的虚拟试穿。实验结果表明,OmniVTON++ 在跨数据集和跨服装类型评估中优于现有方法,并支持包括多件服装和动漫角色在内的多种场景的虚拟试穿。
Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs
Authors: Zhikang Xu, Qianqian Xu, Zitai Wang, Cong Hua, Sicong Li, Zhiyong Yang, Qingming Huang
Venue: CVPR 2026
First: 2026-03-03T05:44:47+00:00 · Latest: 2026-03-11T07:49:07+00:00
Comments: Accepted by the main track of CVPR 2026
Abstract
Out-of-distribution (OOD) detection seeks to identify samples from unknown classes, a critical capability for deploying machine learning models in open-world scenarios. Recent research has demonstrated that Vision-Language Models (VLMs) can effectively leverage their multi-modal representations for OOD detection. However, current methods often incorporate intra-modal distance during OOD detection, such as comparing negative texts with ID labels or comparing test images with image proxies. This design paradigm creates an inherent inconsistency against the inter-modal distance that CLIP-like VLMs are optimized for, potentially leading to suboptimal performance. To address this limitation, we propose InterNeg, a simple yet effective framework that systematically utilizes consistent inter-modal distance enhancement from textual and visual perspectives. From the textual perspective, we devise an inter-modal criterion for selecting negative texts. From the visual perspective, we dynamically identify high-confidence OOD images and invert them into the textual space, generating extra negative text embeddings guided by inter-modal distance. Extensive experiments across multiple benchmarks demonstrate the superiority of our approach. Notably, our InterNeg achieves state-of-the-art performance compared to existing works, with a 3.47% reduction in FPR95 on the large-scale ImageNet benchmark and a 5.50% improvement in AUROC on the challenging Near-OOD benchmark.
中文标题/摘要
标题:注意选择负样本的方式:使用VLMs在OOD检测中追求跨模态距离一致性
离群分布(OOD)检测旨在识别来自未知类别的样本,这是在开放世界场景中部署机器学习模型的关键能力。最近的研究表明,视觉-语言模型(VLMs)能够有效利用其多模态表示进行OOD检测。然而,当前的方法通常在OOD检测中引入了内模态距离,例如将负文本与ID标签进行比较,或将测试图像与图像代理进行比较。这种设计范式在CLIP等VLMs优化的跨模态距离方面存在固有的不一致性,可能导致性能不佳。为了解决这一局限性,我们提出了一种简单而有效的框架InterNeg,该框架系统地从文本和视觉两个视角利用一致的跨模态距离增强。从文本视角出发,我们设计了一种跨模态标准来选择负样本。从视觉视角出发,我们动态识别高置信度的OOD图像,并将其反转到文本空间,生成由跨模态距离引导的额外负文本嵌入。在多个基准上的广泛实验表明,我们的方法具有优越性。值得注意的是,我们的InterNeg在大规模ImageNet基准上实现了最先进的性能,FPR95降低了3.47%,在具有挑战性的Near-OOD基准上AUROC提高了5.50%。
Summary / 总结
This paper addresses the limitation of current out-of-distribution (OOD) detection methods that use intra-modal distances, which can create inconsistency with the inter-modal distance optimized by VLMs. The authors propose InterNeg, a framework that enhances inter-modal distance consistency by selecting negative texts based on an inter-modal criterion and dynamically generating extra negative text embeddings. Experiments show that InterNeg outperforms existing methods, achieving a 3.47% reduction in FPR95 on ImageNet and a 5.50% improvement in AUROC on the Near-OOD benchmark.
论文解决了当前使用单模态距离进行OOD检测的方法与VLMs优化的跨模态距离不一致的问题。提出了InterNeg框架,从文本和视觉两个角度增强跨模态距离。通过基于跨模态标准选择负样本文本,并动态生成额外的负样本文本嵌入,InterNeg提升了OOD检测性能,实现了在ImageNet上的3.47% FPR95降低和在Near-OOD上的5.50% AUROC提升,达到了现有方法的最优水平。
Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression
Authors: Hamidreza Dastmalchi, Aijun An, Ali Cheraghian, Hamed Barzamini
Venue: CVPR 2026
First: 2026-03-11T06:40:50+00:00 · Latest: 2026-03-11T06:40:50+00:00
Comments: CVPR 2026
Abstract
While large vision-language models (LVLMs) achieve strong performance on multimodal tasks, they frequently generate hallucinations -- unfaithful outputs misaligned with the visual input. To address this issue, we introduce CIPHER (Counterfactual Image Perturbations for Hallucination Extraction and Removal), a training-free method that suppresses vision-induced hallucinations via lightweight feature-level correction. Unlike prior training-free approaches that primarily focus on text-induced hallucinations, CIPHER explicitly targets hallucinations arising from the visual modality. CIPHER operates in two phases. In the offline phase, we construct OHC-25K (Object-Hallucinated Counterfactuals, 25,000 samples), a counterfactual dataset consisting of diffusion-edited images that intentionally contradict the original ground-truth captions. We pair these edited images with the unchanged ground-truth captions and process them through an LVLM to extract hallucination-related representations. Contrasting these representations with those from authentic (image, caption) pairs reveals structured, systematic shifts spanning a low-rank subspace characterizing vision-induced hallucination. In the inference phase, CIPHER suppresses hallucinations by projecting intermediate hidden states away from this subspace. Experiments across multiple benchmarks show that CIPHER significantly reduces hallucination rates while preserving task performance, demonstrating the effectiveness of counterfactual visual perturbations for improving LVLM faithfulness. Code and additional materials are available at https://hamidreza-dastmalchi.github.io/cipher-cvpr2026/.
中文标题/摘要
标题:用反事实方法对抗幻觉:基于扩散的特征级纠正以抑制LVLM幻觉
尽管大型视觉-语言模型(LVLMs)在多模态任务上表现出色,但它们经常生成幻觉——与视觉输入不符的不忠实输出。为了解决这一问题,我们引入了CIPHER(反事实图像扰动以提取和移除幻觉),这是一种无需训练的方法,通过轻量级的特征级纠正来抑制由视觉引起的幻觉。与主要针对文本引起的幻觉的先前无需训练的方法不同,CIPHER 明确针对由视觉模态引起的幻觉。CIPHER 分为两个阶段。在离线阶段,我们构建了OHC-25K(对象幻觉反事实,25,000个样本),这是一个包含故意与原始地面真值描述符矛盾的扩散编辑图像的反事实数据集。我们将这些编辑图像与未更改的地面真值描述符配对,并通过LVLM处理以提取与幻觉相关的表示。将这些表示与真实(图像,描述符)配对的表示进行对比,揭示了低秩子空间中的结构化、系统性变化,这些变化表征了由视觉引起的幻觉。在推理阶段,CIPHER 通过将中间隐藏状态投影远离这个子空间来抑制幻觉。在多个基准测试中的实验表明,CIPHER 显著降低了幻觉率,同时保持了任务性能,证明了反事实视觉扰动对提高LVLM忠实度的有效性。代码和额外材料可在https://hamidreza-dastmalchi.github.io/cipher-cvpr2026/ 获取。
Summary / 总结
The paper introduces CIPHER, a training-free method to suppress vision-induced hallucinations in large vision-language models (LVLMs) by using counterfactual image perturbations. CIPHER constructs a dataset of 25,000 counterfactual images that contradict ground-truth captions and uses this to extract hallucination-related representations. During inference, CIPHER projects intermediate hidden states away from the subspace characterized by these representations to suppress hallucinations. Experiments show that CIPHER reduces hallucination rates while maintaining task performance, demonstrating its effectiveness in improving LVLM faithfulness.
论文提出了一种名为CIPHER的方法,通过使用反事实图像扰动来抑制大型视觉-语言模型(LVLM)中的视觉诱发幻觉。CIPHER构建了一个包含25,000个反事实图像的数据集,并提取幻觉相关的表示,然后在推理阶段使用这些表示来识别和纠正幻觉。实验表明,CIPHER能够有效减少幻觉率而不影响任务性能。
SVBench: Evaluation of Video Generation Models on Social Reasoning
Authors: Wenshuo Peng, Gongxuan Wang, Tianmeng Yang, Chuanhao Li, Xiaojie Xu, Hui He, Kaipeng Zhang
First: 2025-12-25T04:44:59+00:00 · Latest: 2026-03-11T06:16:36+00:00
Comments: 10pages
Abstract
Recent text-to-video generation models have made remarkable progress in visual realism, motion fidelity, and text-video alignment, yet they still struggle to produce socially coherent behavior. Unlike humans, who readily infer intentions, beliefs, emotions, and social norms from brief visual cues, current models often generate literal scenes without capturing the underlying causal and psychological dynamics. To systematically assess this limitation, we introduce the first benchmark for social reasoning in video generation. Grounded in developmental and social psychology, the benchmark covers thirty classic social cognition paradigms spanning seven core dimensions: mental-state inference, goal-directed action, joint attention, social coordination, prosocial behavior, social norms, and multi-agent strategy. To operationalize these paradigms, we build a fully training-free agent-based pipeline that distills the reasoning structure of each paradigm, synthesizes diverse video-ready scenarios, enforces conceptual neutrality and difficulty control through cue-based critique, and evaluates generated videos with a high-capacity VLM judge along five interpretable dimensions of social reasoning. Using this framework, we conduct the first large-scale evaluation of seven state-of-the-art video generation systems. Results show a clear gap between surface-level plausibility and deeper social reasoning, suggesting that current models remain limited in their ability to generate socially grounded behavior. https://github.com/Gloria2tt/SVBench-Evaluation
中文标题/摘要
标题:SVBench:视频生成模型在社会推理评估上的应用
近期的文本到视频生成模型在视觉真实感、运动保真度和文本视频对齐方面取得了显著进展,但仍然难以生成社会上连贯的行为。与人类能够从简短的视觉线索中推断意图、信念、情感和社会规范不同,当前的模型往往生成字面场景,而未能捕捉到潜在的因果和心理动态。为了系统地评估这一局限性,我们首次引入了视频生成中的社会推理基准。该基准基于发展和社会心理学,涵盖了三十个经典的社会认知范式,涉及七个核心维度:心理状态推断、目标导向行为、共同注意、社会协调、亲社会行为、社会规范和多智能体策略。为了实现这些范式的操作化,我们构建了一个完全无需训练的基于代理的流水线,提取每个范式的推理结构,合成多种多样的视频场景,通过基于线索的批评实现概念中立性和难度控制,并使用高容量的VLM裁判员从五个可解释的社会推理维度评估生成的视频。使用此框架,我们首次对七种最先进的视频生成系统进行了大规模评估。结果表明,表面合理性与深层次的社会推理之间存在明显差距,表明当前模型在生成社会基础行为方面仍然有限。
Summary / 总结
SVBench is a benchmark for evaluating video generation models on their ability to generate socially coherent behavior. It introduces thirty social cognition paradigms covering seven core dimensions, and uses a training-free agent-based pipeline to synthesize diverse video scenarios and evaluate generated videos with a high-capacity VLM judge. The evaluation of seven state-of-the-art video generation systems reveals a significant gap between surface-level plausibility and deeper social reasoning capabilities, indicating current models' limitations in generating socially grounded behavior.
研究旨在评估文本到视频生成模型在产生社会连贯行为方面的能力,当前模型在这方面往往表现不佳。研究引入了基于社会和发展心理学的SVBench基准,涵盖了三十个经典的社会认知范式。评估使用一个无需训练的基于代理的管道来合成多样化的场景,并使用高容量的VLM评判生成的视频。结果表明,当前模型在更深层次的社会推理能力方面存在显著差距。
Similarity-as-Evidence: Calibrating Overconfident VLMs for Interpretable and Label-Efficient Medical Active Learning
Authors: Zhuofan Xie, Zishan Lin, Jinliang Lin, Jie Qi, Shaohua Hong, Shuo Li
Venue: CVPR 2026
First: 2026-02-21T15:21:54+00:00 · Latest: 2026-03-11T06:14:28+00:00
Comments: Accepted to CVPR 2026 (to appear)
Abstract
Active Learning (AL) reduces annotation costs in medical imaging by selecting only the most informative samples for labeling, but suffers from cold-start when labeled data are scarce. Vision-Language Models (VLMs) address the cold-start problem via zero-shot predictions, yet their temperature-scaled softmax outputs treat text-image similarities as deterministic scores while ignoring inherent uncertainty, leading to overconfidence. This overconfidence misleads sample selection, wasting annotation budgets on uninformative cases. To overcome these limitations, the Similarity-as-Evidence (SaE) framework calibrates text-image similarities by introducing a Similarity Evidence Head (SEH), which reinterprets the similarity vector as evidence and parameterizes a Dirichlet distribution over labels. In contrast to a standard softmax that enforces confident predictions even under weak signals, the Dirichlet formulation explicitly quantifies lack of evidence (vacuity) and conflicting evidence (dissonance), thereby mitigating overconfidence caused by rigid softmax normalization. Building on this, SaE employs a dual-factor acquisition strategy: high-vacuity samples (e.g., rare diseases) are prioritized in early rounds to ensure coverage, while high-dissonance samples (e.g., ambiguous diagnoses) are prioritized later to refine boundaries, providing clinically interpretable selection rationales. Experiments on ten public medical imaging datasets with a 20% label budget show that SaE attains state-of-the-art macro-averaged accuracy of 82.57%. On the representative BTMRI dataset, SaE also achieves superior calibration, with a negative log-likelihood (NLL) of 0.425.
中文标题/摘要
标题:相似性作为证据:校准过度自信的VLMs以实现可解释和标签高效的医学主动学习
主动学习(AL)通过仅选择最具信息量的样本进行标注来减少医学成像中的标注成本,但在标注数据稀缺时会遇到冷启动问题。视觉-语言模型(VLMs)通过零样本预测解决了冷启动问题,但其温度缩放的softmax输出将文本-图像相似性视为确定性得分,忽略了固有的不确定性,导致过度自信。这种过度自信误导了样本选择,浪费了标注预算在不具信息量的案例上。为克服这些限制,相似性作为证据(SaE)框架通过引入相似性证据头(SEH)校准文本-图像相似性,将相似性向量重新解释为证据,并参数化标签上的狄利克雷分布。与标准softmax强制在弱信号下进行自信预测不同,狄利克雷公式明确量化了缺乏证据(真空)和冲突证据(不和谐),从而减轻了由刚性softmax归一化引起的过度自信。在此基础上,SaE采用双重因素获取策略:在早期轮次中优先选择高真空样本(例如罕见疾病)以确保覆盖范围,而在后期轮次中优先选择高不和谐样本(例如模棱两可的诊断)以细化边界,提供临床可解释的选择理由。在20%的标签预算下,对十个公开的医学成像数据集的实验显示,SaE达到了最先进的宏平均准确率82.57%。在代表性的BTMRI数据集上,SaE还实现了更好的校准,负对数似然(NLL)为0.425。
Summary / 总结
The paper addresses the overconfidence issue in Vision-Language Models (VLMs) for medical active learning, proposing the Similarity-as-Evidence (SaE) framework. SaE introduces a Similarity Evidence Head (SEH) to reinterpret text-image similarities as evidence and parameterize a Dirichlet distribution, which explicitly quantifies lack of evidence and conflicting evidence. This mitigates overconfidence and improves sample selection. Experiments show SaE achieves state-of-the-art macro-averaged accuracy of 82.57% and superior calibration on the BTMRI dataset with a negative log-likelihood of 0.425.
论文针对Vision-Language模型(VLM)在医疗影像领域主动学习(AL)中表现出的过度自信问题,提出了Similarity-as-Evidence(SaE)框架,通过引入Similarity Evidence Head(SEH)将文本-图像相似性重新解释为证据,并参数化狄利克雷分布,从而减轻过度自信。实验结果显示,SaE在医疗影像数据集上实现了82.57%的最优宏平均准确率,并且在标注预算为20%的情况下具有更好的校准效果。
GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training
Authors: Tong Wei, Yijun Yang, Changhao Zhang, Junliang Xing, Yuanchun Shi, Zongqing Lu, Deheng Ye
Venue: CVPR 2026
First: 2025-12-15T07:11:56+00:00 · Latest: 2026-03-11T05:58:08+00:00
Comments: Accepted by CVPR 2026
Abstract
Multi-turn reinforcement learning (RL) for multi-modal agents built upon vision-language models (VLMs) is hampered by sparse rewards and long-horizon credit assignment. Recent methods densify the reward by querying a teacher that provides step-level feedback, e.g., Guided Thought Reinforcement (GTR) and On-Policy Distillation, but rely on costly, often privileged models as the teacher, limiting practicality and reproducibility. We introduce GTR-Turbo, a highly efficient upgrade to GTR that matches its performance without training on or querying an expensive teacher model. Specifically, GTR-Turbo merges the weights of checkpoints produced during ongoing RL training and then uses the resulting merged model as a "free" teacher to guide subsequent RL via supervised fine-tuning or soft logit distillation. This design removes dependence on privileged VLMs (e.g., GPT or Gemini), mitigates the "entropy collapse" observed in prior work, and maintains stable training. Across diverse visual agentic tasks, GTR-Turbo improves the accuracy of the baseline model by 10-30% while reducing wall-clock training time by 50% and compute cost by 60% relative to GTR.
中文标题/摘要
标题:GTR-Turbo:合并检查点秘密地成为有能动性的VLM训练的免费教师
基于视觉语言模型(VLMs)构建的多模态代理的多轮强化学习(RL)受到稀疏奖励和长期信用分配的阻碍。最近的方法通过查询提供逐步反馈的教师来增加奖励密度,例如引导思想强化学习(GTR)和在线策略蒸馏,但依赖于昂贵且通常是有特权的模型作为教师,限制了其实用性和可再现性。我们引入了GTR-Turbo,这是一种高度高效的GTR升级版,无需训练或查询昂贵的教师模型即可匹配其性能。具体而言,GTR-Turbo将正在进行的RL训练过程中生成的检查点权重合并,并使用结果合并模型作为“免费”的教师,通过监督微调或软logit蒸馏来指导后续的RL。此设计消除了对特权VLM(例如GPT或Gemini)的依赖,缓解了先前工作中观察到的“熵崩溃”现象,并保持了稳定的训练。在各种视觉有能动性任务中,GTR-Turbo将基线模型的准确性提高了10-30%,同时将墙钟训练时间减少了50%,计算成本降低了60%,相对于GTR而言。
Summary / 总结
GTR-Turbo addresses the challenges of multi-turn reinforcement learning for multi-modal agents by merging checkpoints from ongoing RL training to create a 'free' teacher model. This method eliminates the need for expensive, privileged models and improves the baseline model's accuracy by 10-30% while reducing training time and compute cost by 50% and 60%, respectively, compared to GTR.
GTR-Turbo通过合并正在进行的RL训练产生的检查点来创建一个‘免费’教师模型,该模型通过监督微调或软logit蒸馏来引导后续的RL。这种方法消除了对昂贵的特权模型的依赖,将基线模型的准确性提高了10-30%,将训练时间减少了50%,并将计算成本降低了60%,相比GTR有显著改进。
VIVID-Med: LLM-Supervised Structured Pretraining for Deployable Medical ViTs
Authors: Xiyao Wang, Xiaoyu Tan, Yang Dai, Yuxuan Fu, Shuo Li, Xihe Qiu
First: 2026-03-10T02:42:51+00:00 · Latest: 2026-03-11T04:56:02+00:00
Comments: 10 pages, 4 figures
Abstract
Vision-language pretraining has driven significant progress in medical image analysis. However, current methods typically supervise visual encoders using one-hot labels or free-form text, neither of which effectively captures the complex semantic relationships among clinical findings. In this study, we introduce VIVID-Med, a novel framework that leverages a frozen large language model (LLM) as a structured semantic teacher to pretrain medical vision transformers (ViTs). VIVID-Med translates clinical findings into verifiable JSON field-state pairs via a Unified Medical Schema (UMS), utilizing answerability-aware masking to focus optimization. It then employs Structured Prediction Decomposition (SPD) to partition cross-attention into orthogonality-regularized query groups, extracting complementary visual aspects. Crucially, the LLM is discarded post-training, yielding a lightweight, deployable ViT-only backbone. We evaluated VIVID-Med across multiple settings: on CheXpert linear probing, it achieves a macro-AUC of 0.8588, outperforming BiomedCLIP by +6.65 points while using 500x less data. It also demonstrates robust zero-shot cross-domain transfer to NIH ChestX-ray14 (0.7225 macro-AUC) and strong cross-modality generalization to CT, achieving 0.8413 AUC on LIDC-IDRI lung nodule classification and 0.9969 macro-AUC on OrganAMNIST 11-organ classification. VIVID-Med offers a highly efficient, scalable alternative to deploying resource-heavy vision-language models in clinical settings.
中文标题/摘要
标题:VIVID-Med: LLM监督结构化预训练可部署医疗ViTs
视觉-语言预训练在医疗图像分析中取得了显著进展。然而,当前方法通常使用一热标签或自由文本来监督视觉编码器,这两种方式都无法有效捕捉临床发现之间的复杂语义关系。在本研究中,我们引入了VIVID-Med,这是一种新颖的框架,利用冻结的大语言模型(LLM)作为结构化语义教师来预训练医疗视觉变换器(ViTs)。VIVID-Med通过统一医学模式(UMS)将临床发现翻译为可验证的JSON字段状态对,并利用答案感知掩码来聚焦优化。然后,它使用结构化预测分解(SPD)将跨注意力划分为正交性正则化的查询组,提取互补的视觉方面。关键的是,训练后丢弃LLM,从而获得一个轻量级、可部署的仅ViT主干。我们在多个场景下评估了VIVID-Med:在CheXpert线性探针上,其宏AUC为0.8588,比BiomedCLIP高出6.65个百分点,同时使用的数据量仅为后者的1/500。它还展示了强大的跨域零样本迁移能力,对NIH ChestX-ray14的宏AUC为0.7225,并且在CT上实现了LIDC-IDRI肺结节分类的0.8413 AUC和OrganAMNIST 11器官分类的0.9969宏AUC。VIVID-Med提供了一种在临床环境中部署资源密集型视觉-语言模型的高效、可扩展的替代方案。
Summary / 总结
VIVID-Med is a novel framework that uses a large language model to supervise the pretraining of medical vision transformers, focusing on structured semantic relationships. It achieves a macro-AUC of 0.8588 on CheXpert with 500x less data compared to BiomedCLIP, and shows robust performance in zero-shot cross-domain and cross-modality tasks, outperforming existing methods in terms of efficiency and effectiveness.
VIVID-Med 是一种新颖的框架,通过大型语言模型(LLM)将临床发现翻译成结构化的 JSON 字段状态对来预训练医疗视觉变压器(ViTs)。这种方法提高了复杂语义关系的捕捉,并产生了一个轻量级、可部署的 ViT 仅骨干网络。VIVID-Med 在 CheXpert 上的宏 AUC 达到 0.8588,展示了在 NIH ChestX-ray14 上的鲁棒零样本跨域迁移,并在 CT 和 OrganAMNIST 上展示了强大的跨模态泛化能力。
Boosting Cross-problem Generalization in Diffusion-Based Neural Combinatorial Solver via Inference Time Adaptation
Authors: Haoyu Lei, Kaiwen Zhou, Yinchuan Li, Zhitang Chen, Farzan Farnia
First: 2025-02-15T08:04:00+00:00 · Latest: 2026-03-11T03:51:39+00:00
Abstract
Diffusion-based Neural Combinatorial Optimization (NCO) has demonstrated effectiveness in solving NP-complete (NPC) problems by learning discrete diffusion models for solution generation, eliminating hand-crafted domain knowledge. Despite their success, existing NCO methods face significant challenges in both cross-scale and cross-problem generalization, and high training costs compared to traditional solvers. While recent studies on diffusion models have introduced training-free guidance approaches that leverage pre-defined guidance functions for conditional generation, such methodologies have not been extensively explored in combinatorial optimization. To bridge this gap, we propose a training-free inference time adaptation framework (DIFU-Ada) that enables both the zero-shot cross-problem transfer and cross-scale generalization capabilities of diffusion-based NCO solvers without requiring additional training. We provide theoretical analysis that helps understanding the cross-problem transfer capability. Our experimental results demonstrate that a diffusion solver, trained exclusively on the Traveling Salesman Problem (TSP), can achieve competitive zero-shot transfer performance across different problem scales on TSP variants, such as Prize Collecting TSP (PCTSP) and the Orienteering Problem (OP), through inference time adaptation.
中文标题/摘要
标题:通过推理时自适应提升基于扩散的神经组合求解器跨问题泛化能力
基于扩散的神经组合优化(NCO)通过学习离散扩散模型来生成解决方案,从而解决了NP完全(NPC)问题,消除了手工构建的领域知识。尽管取得了成功,现有的NCO方法在跨尺度和跨问题泛化方面仍面临重大挑战,并且与传统求解器相比,训练成本较高。虽然最近关于扩散模型的研究引入了无需训练的指导方法,利用预定义的指导函数进行条件生成,但这些方法在组合优化中的应用尚未得到广泛探索。为解决这一问题,我们提出了一种无需训练的推理时自适应框架(DIFU-Ada),该框架使基于扩散的NCO求解器能够在不需额外训练的情况下实现零样本跨问题转移和跨尺度泛化能力。我们提供了理论分析,以帮助理解跨问题转移能力。实验结果表明,仅在旅行商问题(TSP)上训练的扩散求解器,可以通过推理时自适应,在TSP变体,如收集奖励TSP(PCTSP)和旅游规划问题(OP)的不同问题尺度上实现具有竞争力的零样本转移性能。
Summary / 总结
The paper addresses the limitations of diffusion-based Neural Combinatorial Optimization (NCO) methods in cross-scale and cross-problem generalization and high training costs. It introduces DIFU-Ada, a training-free inference time adaptation framework that enhances the cross-problem transfer and cross-scale generalization capabilities of diffusion-based NCO solvers. Experiments show that a diffusion solver trained on the Traveling Salesman Problem can achieve competitive zero-shot transfer performance on TSP variants through inference time adaptation.
论文针对基于扩散的神经组合优化(NCO)方法在跨尺度和跨问题泛化以及高训练成本方面的局限性。提出了一个无训练的推理时自适应框架DIFU-Ada,增强了基于扩散的NCO求解器的零样本跨问题转移和跨尺度泛化能力。实验结果表明,仅在旅行商问题(TSP)上训练的扩散求解器可以通过推理时自适应实现对PCTSP和旅行商问题变种OP的竞争力表现。
One Token, Two Fates: A Unified Framework via Vision Token Manipulation Against MLLMs Hallucination
Authors: Zhan Fa, Yue Duan, Jian Zhang, Lei Qi, Yinghuan Shi
First: 2026-03-11T03:19:46+00:00 · Latest: 2026-03-11T03:19:46+00:00
Comments: 10 pages
Abstract
Current training-free methods tackle MLLM hallucination with separate strategies: either enhancing visual signals or suppressing text inertia. However, these separate methods are insufficient due to critical trade-offs: simply enhancing vision often fails against strong language prior, while suppressing language can introduce extra image-irrelevant noise. Moreover, we find their naive combination is also ineffective, necessitating a unified framework. We propose such a framework by focusing on the core asset: the vision token. Our design leverages two key insights: (1) augmented images offer complementary visual semantics, and (2) removing vision tokens (information-gap) isolates hallucination tendencies more precisely than distorting images (modality-gap). Based on these, our framework uses vision tokens in two distinct ways, both operating on latent representations: our Synergistic Visual Calibration (SVC) module incorporates augmented tokens to strengthen visual representations, while our Causal Representation Calibration (CRC) module uses pruned tokens to create latent-space negative samples for correcting internal model biases. By harmonizing these two roles, our framework effectively restores the vision-language balance, significantly reducing object hallucinations, improving POPE accuracy by an average of 2% absolute on LLaVA-1.5 across multiple benchmarks with only a 1.06x inference latency overhead.
中文标题/摘要
标题:一令牌,两命运:通过视觉令牌操控实现对MLLM幻觉的统一框架
当前的无训练方法通过单独的策略来应对MLLM的幻觉:要么增强视觉信号,要么抑制文本惯性。然而,这些单独的方法由于关键权衡不足而不够充分:单纯增强视觉信号往往无法对抗强大的语言先验,而抑制语言则可能引入额外的与图像无关的噪声。此外,我们发现它们的简单组合也是无效的,因此需要一个统一的框架。我们通过关注核心资产——视觉令牌,提出了这样一个框架。我们的设计基于两个关键见解:(1) 增强的图像提供了互补的视觉语义,(2) 删除视觉令牌(信息缺口)比扭曲图像(模态缺口)更精确地隔离了幻觉倾向。基于这些,我们的框架以两种不同的方式使用视觉令牌,都作用于潜在表示:我们的协同视觉校准(SVC)模块结合增强的令牌以增强视觉表示,而我们的因果表示校准(CRC)模块使用精简的令牌来创建潜在空间的负样本以纠正模型内部偏差。通过协调这两种角色,我们的框架有效地恢复了视觉-语言平衡,显著减少了物体幻觉,仅在多个基准测试中将LLaVA-1.5的POPE准确性平均提高了2%,同时仅增加了1.06倍的推理延迟。
Summary / 总结
The paper addresses the limitations of current training-free methods for mitigating MLLM hallucination by proposing a unified framework. It leverages vision tokens to enhance visual signals and suppress text inertia simultaneously. The framework includes Synergistic Visual Calibration (SVC) for incorporating augmented tokens to strengthen visual representations and Causal Representation Calibration (CRC) for using pruned tokens to create latent-space negative samples. This approach reduces object hallucinations and improves POPE accuracy by an average of 2% on LLaVA-1.5 across multiple benchmarks with minimal latency overhead.
本文提出了一种统一框架来解决当前训练-free 方法在缓解 MLLM 幻觉方面的局限性。该框架利用视觉标记同时增强视觉信号和抑制文本惯性。框架包括协同视觉校准(SVC)模块,用于结合增强标记,以及因果表示校准(CRC)模块,用于使用剪枝标记创建负样本以纠正模型偏差。这种方法减少了物体幻觉,提高了 LLaVA-1.5 在多个基准上的 POPE 准确率,仅增加了 1.06 倍的推理延迟。
StyleGallery: Training-free and Semantic-aware Personalized Style Transfer from Arbitrary Image References
Authors: Boyu He, Yunfan Ye, Chang Liu, Weishang Wu, Fang Liu, Zhiping Cai
First: 2026-03-11T03:05:02+00:00 · Latest: 2026-03-11T03:05:02+00:00
Comments: 10 pages, 23 figures, Conference on Computer Vision and Pattern Recognition 2026
Abstract
Despite the advancements in diffusion-based image style transfer, existing methods are commonly limited by 1) semantic gap: the style reference could miss proper content semantics, causing uncontrollable stylization; 2) reliance on extra constraints (e.g., semantic masks) restricting applicability; 3) rigid feature associations lacking adaptive global-local alignment, failing to balance fine-grained stylization and global content preservation. These limitations, particularly the inability to flexibly leverage style inputs, fundamentally restrict style transfer in terms of personalization, accuracy, and adaptability. To address these, we propose StyleGallery, a training-free and semantic-aware framework that supports arbitrary reference images as input and enables effective personalized customization. It comprises three core stages: semantic region segmentation (adaptive clustering on latent diffusion features to divide regions without extra inputs); clustered region matching (block filtering on extracted features for precise alignment); and style transfer optimization (energy function-guided diffusion sampling with regional style loss to optimize stylization). Experiments on our introduced benchmark demonstrate that StyleGallery outperforms state-of-the-art methods in content structure preservation, regional stylization, interpretability, and personalized customization, particularly when leveraging multiple style references.
中文标题/摘要
标题:StyleGallery:无需训练且具有语义意识的个性化风格迁移,从任意图像参考中
尽管在基于扩散的图像风格迁移方面取得了进展,但现有方法通常受限于1)语义差距:风格参考可能缺少适当的内容语义,导致不可控的风格化;2)依赖额外约束(例如,语义掩码)限制了适用性;3)刚性特征关联缺乏适应性的全局-局部对齐,无法平衡精细风格化和全局内容保留。这些限制,尤其是无法灵活利用风格输入,从根本上限制了风格迁移在个性化、准确性和适应性方面的应用。为了解决这些问题,我们提出了StyleGallery,这是一种无需训练且具有语义意识的框架,支持任意参考图像作为输入,并能够实现有效的个性化定制。它包括三个核心阶段:语义区域分割(在潜在扩散特征上进行自适应聚类,无需额外输入以划分区域);聚类区域匹配(在提取特征上进行块过滤,以实现精确对齐);以及风格迁移优化(基于能量函数的扩散采样与区域风格损失引导的优化,以优化风格化)。在我们引入的基准测试上进行的实验表明,StyleGallery在内容结构保留、区域风格化、可解释性和个性化定制方面优于最先进的方法,特别是在利用多个风格参考时。
Summary / 总结
StyleGallery is a training-free and semantic-aware framework for personalized image style transfer that addresses limitations in existing methods such as semantic gaps and rigid feature associations. It uses semantic region segmentation, clustered region matching, and style transfer optimization to achieve effective stylization while preserving content structure. Experiments show that StyleGallery outperforms state-of-the-art methods in content preservation, regional stylization, interpretability, and personalized customization, especially when using multiple style references.
StyleGallery 是一个无需训练且具备语义意识的框架,用于个性化图像风格转移,解决了现有方法中的语义差距和刚性特征关联等问题。它通过语义区域分割、聚类区域匹配和风格转移优化来实现有效的自定义,支持任意参考图像输入。实验表明,StyleGallery 在内容结构保留、区域风格化、可解释性和个性化定制方面优于现有最先进的方法,尤其是在使用多个风格参考时表现更佳。
KVSmooth: Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing
Authors: Siyu Jiang, Feiyang Chen, Xiaojin Zhang, Kun He
Venue: CVPR 2026
First: 2026-02-04T06:59:17+00:00 · Latest: 2026-03-11T02:54:58+00:00
Comments: Accepted by CVPR 2026
Abstract
Despite the significant progress of Multimodal Large Language Models (MLLMs) across diverse tasks, hallucination -- corresponding to the generation of visually inconsistent objects, attributes, or relations -- remains a major obstacle to their reliable deployment. Unlike pure language models, MLLMs must ground their generation process in visual inputs. However, existing models often suffer from semantic drift during decoding, causing outputs to diverge from visual facts as the sequence length increases. To address this issue, we propose KVSmooth, a training-free and plug-and-play method that mitigates hallucination by performing attention-entropy-guided adaptive smoothing on hidden states. Specifically, KVSmooth applies an exponential moving average (EMA) to both keys and values in the KV-Cache, while dynamically quantifying the sink degree of each token through the entropy of its attention distribution to adaptively adjust the smoothing strength. Unlike computationally expensive retraining or contrastive decoding methods, KVSmooth operates efficiently during inference without additional training or model modification. Extensive experiments demonstrate that KVSmooth significantly reduces hallucination ($\mathit{CHAIR}_{S}$ from $41.8 \rightarrow 18.2$) while improving overall performance ($F_1$ score from $77.5 \rightarrow 79.2$), achieving higher precision and recall simultaneously. In contrast, prior methods often improve one at the expense of the other, validating the effectiveness and generality of our approach.
中文标题/摘要
标题:KVSmooth:通过键值平滑减轻多模态大型语言模型中的幻觉
尽管多模态大型语言模型(MLLMs)在多种任务中取得了显著进展,但在视觉输入中生成不一致的对象、属性或关系的幻觉仍然是其可靠部署的主要障碍。与纯粹的语言模型不同,MLLMs 必须在其生成过程中扎根于视觉输入。然而,现有模型在解码过程中经常出现语义漂移,导致输出随着序列长度的增加而偏离视觉事实。 为了解决这一问题,我们提出了一种无需训练且即插即用的方法 KVSmooth,通过注意力-熵引导的自适应平滑来减轻幻觉。具体而言,KVSmooth 对 KV 缓存中的键和值应用指数移动平均(EMA),并通过每个令牌的注意力分布的熵动态量化其汇入度,以自适应调整平滑强度。 与计算成本高昂的重新训练或对比解码方法不同,KVSmooth 在推理过程中高效运行,无需额外的训练或模型修改。广泛的实验表明,KVSmooth 显著减少了幻觉($\mathit{CHAIR}_{S}$ 从 $41.8 ightarrow 18.2$),同时提高了整体性能($F_1$ 分数从 $77.5 ightarrow 79.2$),同时提高了精确度和召回率。相比之下,先前的方法往往在提高一个方面的同时牺牲另一个方面,验证了我们方法的有效性和普适性。
Summary / 总结
KVSmooth is a training-free method that reduces hallucination in MLLMs by applying attention-entropy-guided adaptive smoothing on hidden states. It uses exponential moving averages on keys and values in the KV-Cache and dynamically adjusts smoothing strength based on token attention entropy. Experiments show KVSmooth significantly reduces hallucination while improving overall performance, achieving higher precision and recall simultaneously, outperforming prior methods that often trade off one metric for another.
KVSmooth 是一种无需训练的方法,通过在隐藏状态上进行注意力-熵引导的自适应平滑来减轻多模态大型语言模型中的幻觉。它对 KV-Cache 中的键和值应用指数移动平均,并根据令牌注意力分布的熵动态调整平滑强度。实验表明,KVSmooth 显著减少了幻觉(CHAIR_S 从 41.8 降低到 18.2),同时提高了整体性能(F1 分数从 77.5 提高到 79.2),同时提高了精确度和召回率,优于以往常常在两者之间进行权衡的方法。
Overcoming Visual Clutter in Vision Language Action Models via Concept-Gated Visual Distillation
Authors: Sangmim Song, Sarath Kodagoda, Marc Carmichael, Karthick Thiyagarajan
First: 2026-03-11T02:21:02+00:00 · Latest: 2026-03-11T02:21:02+00:00
Comments: 7 pages, 4 figures, 3 tables
Abstract
Vision-Language-Action (VLA) models demonstrate impressive zero-shot generalization but frequently suffer from a "Precision-Reasoning Gap" in cluttered environments. This failure is driven by background-induced feature dilution, where high-frequency semantic noise corrupts the geometric grounding required for precise manipulation. To bridge this gap, we propose Concept-Gated Visual Distillation (CGVD), a training-free, model-agnostic inference framework that stabilizes VLA policies. CGVD operates by parsing instructions into safe and distractor sets, utilizing a two-layer target refinement process--combining cross-validation and spatial disambiguation--to explicitly penalize false positives and isolate genuine manipulation targets. We then process the scene via Fourier-based inpainting, generating a clean observation that actively suppresses semantic distractors while preserving critical spatial geometry and visual proprioception. Extensive evaluations in highly cluttered manipulation tasks demonstrate that CGVD prevents performance collapse. In environments with dense semantic distractors, our method significantly outperforms state-of-the-art baselines, achieving a 77.5% success rate compared to the baseline's 43.0%. By enforcing strict attribute adherence, CGVD establishes inference-time visual distillation as a critical prerequisite for robust robotic manipulation in the clutter.
中文标题/摘要
标题:通过概念门控视觉蒸馏克服视觉杂乱在视觉语言动作模型中的问题
视觉-语言-动作(VLA)模型展示了令人印象深刻的零样本泛化能力,但在杂乱环境中经常遭受“精确-推理差距”的困扰。这种失败是由背景引起的特征稀释驱动的,其中高频语义噪声会破坏精确操作所需的几何定位。为了弥合这一差距,我们提出了概念门控视觉蒸馏(CGVD),这是一种无需训练、模型无关的推理框架,可以稳定VLA策略。CGVD通过将指令解析为安全集和干扰集,并利用两层目标精炼过程——结合交叉验证和空间去模糊——来明确惩罚假阳性并隔离真正的操作目标。然后,通过基于傅里叶的修补处理场景,生成一个干净的观察结果,该结果主动抑制语义干扰,同时保留关键的空间几何结构和视觉定位。在高度杂乱的操作任务中的广泛评估表明,CGVD可以防止性能崩溃。在具有密集语义干扰的环境中,我们的方法显著优于最先进的基线,成功率达到77.5%,而基线的成功率为43.0%。通过严格属性遵从,CGVD确立了推理时的视觉蒸馏是实现鲁棒机器人操作的关键先决条件。
Summary / 总结
The paper addresses the issue of visual clutter in Vision-Language-Action (VLA) models, which can lead to performance degradation. It introduces Concept-Gated Visual Distillation (CGVD), a model-agnostic inference framework that stabilizes VLA policies by parsing instructions and refining targets. CGVD generates a clean observation through Fourier-based inpainting, which suppresses semantic distractors while preserving critical spatial geometry. Experimental results show that CGVD significantly outperforms state-of-the-art baselines in cluttered manipulation tasks, achieving a 77.5% success rate compared to 43.0% for the baseline.
论文解决了VLA模型在杂乱环境中表现不佳的问题,主要是由于背景噪声导致的特征稀释。文中提出了一种名为Concept-Gated Visual Distillation (CGVD)的模型无关框架,通过解析指令和通过交叉验证和空间消歧来精炼目标。CGVD然后使用基于傅里叶的修复生成一个干净的观察,抑制语义干扰的同时保留空间几何结构。实验表明,CGVD在杂乱的抓取任务中显著提高了性能,成功率达到77.5%,而基线方法仅为43.0%。
SOTA: Self-adaptive Optimal Transport for Zero-Shot Classification with Multiple Foundation Models
Authors: Zhanxuan Hu, Qiyu Xu, Yu Duan, Yonghang Tai, Huafeng Li
First: 2025-06-16T17:27:47+00:00 · Latest: 2026-03-11T02:17:06+00:00
Abstract
Foundation models have attracted widespread attention across domains due to their powerful zero-shot classification capabilities. This work is motivated by two key observations: (1) \textit{Vision-Language Models} (VLMs), such as CLIP, often over-rely on class-level textual priors and struggle to capture fine-grained visual cues, whereas \textit{Vision-only Foundation Models} (VFMs), such as DINO, provide rich and discriminative visual features but lack semantic alignment; (2) the performance of different VLMs varies considerably across datasets owing to differences in pre-training. To address these challenges, we propose \textbf{SOTA} (\textit{Self-adaptive Optimal TrAnsport}), a \textit{training-free} ensemble framework that integrates the outputs of multiple foundation models~(VFMs or VLMs) by learning a self-adaptive transport plan. Notably, \textbf{SOTA} is prior-free and automatically balances model contributions. Extensive experiments across diverse domains, including natural images, medical pathology, and remote sensing, validate the generalizability of \textbf{SOTA}. The results consistently show that it effectively leverages the complementary strengths of different foundation models and achieves substantial improvements over individual models. The implementation code is available at: https://github.com/Afleve/self-adaptive-Optimal-Transport.
中文标题/摘要
标题:SOTA:自适应最优运输在多基础模型零样本分类中的应用
基础模型因其强大的零样本分类能力而在各个领域引起了广泛关注。本文受到两个关键观察的启发:(1)视觉-语言模型(VLMs),如CLIP,往往过度依赖于类别级的文本先验,难以捕捉细微的视觉线索,而视觉基础模型(VFMs),如DINO,则提供了丰富的区分性视觉特征,但缺乏语义对齐;(2)不同VLMs在不同数据集上的性能差异很大,这归因于预训练的不同。为了解决这些挑战,我们提出了SOTA(自适应最优运输),这是一种无需训练的集成框架,通过学习自适应运输计划来整合多个基础模型(VFMs或VLMs)的输出。值得注意的是,SOTA 是无先验的,并且能够自动平衡模型的贡献。在包括自然图像、医学病理和遥感在内的多个领域进行的广泛实验验证了SOTA的普适性。结果一致表明,它有效地利用了不同基础模型的互补优势,并在单个模型上取得了显著的改进。代码可在以下链接获取:https://github.com/Afleve/self-adaptive-Optimal-Transport.
Summary / 总结
This work addresses the limitations of Vision-Language Models (VLMs) and Vision-only Foundation Models (VFMs) in zero-shot classification by proposing SOTA, a training-free ensemble framework. SOTA integrates the outputs of multiple foundation models by learning a self-adaptive transport plan, which automatically balances model contributions without relying on prior knowledge. Experiments across various domains demonstrate that SOTA effectively leverages the complementary strengths of different foundation models and achieves significant performance improvements over individual models.
该研究通过提出SOTA,一种无需训练的集成框架,解决了Vision-Language模型(VLMs)和Vision-only基础模型(VFMs)在零样本分类中的局限性。SOTA通过学习自适应的运输计划来整合多个基础模型的输出,无需依赖先验信息即可自动平衡模型贡献。实验结果表明,SOTA能够有效利用不同基础模型的互补优势,显著优于单一模型的性能。
History
20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553