arXiv 论文速递

Snapshot: 20260409_0411

HaloProbe: Bayesian Detection and Mitigation of Object Hallucinations in Vision-Language Models

Authors: Reihaneh Zohrabi, Hosein Hasani, Akshita Gupta, Mahdieh Soleymani Baghshah, Anna Rohrbach, Marcus Rohrbach

First: 2026-04-07T17:58:04+00:00 · Latest: 2026-04-07T17:58:04+00:00

Abstract

Large vision-language models can produce object hallucinations in image descriptions, highlighting the need for effective detection and mitigation strategies. Prior work commonly relies on the model's attention weights on visual tokens as a detection signal. We reveal that coarse-grained attention-based analysis is unreliable due to hidden confounders, specifically token position and object repetition in a description. This leads to Simpson's paradox: the attention trends reverse or disappear when statistics are aggregated. Based on this observation, we introduce HaloProbe, a Bayesian framework that factorizes external description statistics and internal decoding signals to estimate token-level hallucination probabilities. HaloProbe uses balanced training to isolate internal evidence and combines it with learned prior over external features to recover the true posterior. While intervention-based mitigation methods often degrade utility or fluency by modifying models' internals, we use HaloProbe as an external scoring signal for non-invasive mitigation. Our experiments show that HaloProbe-guided decoding reduces hallucinations more effectively than state-of-the-art intervention-based methods while preserving utility.

中文标题/摘要

标题：HaloProbe：视觉语言模型中物体幻觉的贝叶斯检测与缓解

大型视觉语言模型在图像描述中可能会产生物体幻觉，突显了有效检测和缓解策略的必要性。先前的工作通常依赖于模型对视觉标记的关注权重作为检测信号。我们揭示了粗粒度的关注基分析不可靠，因为存在隐藏的混杂因素，特别是描述中的标记位置和物体重复。这导致辛普森悖论：当统计数据汇总时，关注趋势会逆转或消失。基于这一观察，我们引入了HaloProbe，这是一种贝叶斯框架，将外部描述统计和内部解码信号分解以估计标记级别的幻觉概率。HaloProbe通过平衡训练隔离内部证据，并将其与外部特征的学习先验结合以恢复真正的后验。虽然基于干预的缓解方法通常会通过修改模型的内部结构来降低效用或流畅性，但我们使用HaloProbe作为外部评分信号进行非侵入性缓解。我们的实验表明，HaloProbe引导的解码比最先进的基于干预的方法更有效地减少幻觉，同时保持效用。

Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models

Authors: Md Zarif Hossain, Ahmed Imteaj

First: 2024-07-20T19:53:52+00:00 · Latest: 2026-04-07T17:57:19+00:00

Comments: Accepted at IJCNN 2026

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) rely heavily on pretrained vision encoders to support downstream tasks such as image captioning, visual question answering, and zero-shot classification. Despite their strong performance, these encoders remain highly vulnerable to imperceptible adversarial perturbations, which can severely degrade both robustness and semantic quality in multimodal reasoning. In this work, we introduce Sim-CLIP, an unsupervised adversarial fine-tuning framework that enhances the robustness of the CLIP vision encoder while preserving overall semantic representations. Sim-CLIP adopts a Siamese training architecture with a cosine similarity objective and a symmetric stop-gradient mechanism to enforce semantic alignment between clean and adversarial views. This design avoids large-batch contrastive learning and additional momentum encoders, enabling robust training with low computational overhead. We evaluate Sim-CLIP across multiple Vision-Language Models and tasks under both targeted and untargeted adversarial attacks. Experimental results demonstrate that Sim-CLIP consistently outperforms state-of-the-art robust CLIP variants, achieving stronger adversarial robustness while maintaining or improving semantic fidelity. These findings highlight the limitations of existing adversarial defenses and establish Sim-CLIP as an effective and scalable solution for robust vision-language representation learning.

中文标题/摘要

标题：Sim-CLIP：无监督双胞胎对抗微调以增强鲁棒性和语义丰富性视觉语言模型

视觉语言模型（VLMs）依赖预训练的视觉编码器来支持诸如图像字幕、视觉问答和零样本分类等下游任务。尽管这些编码器表现出色，但它们仍然高度易受不可感知的对抗性扰动的影响，这会严重降低多模态推理中的鲁棒性和语义质量。在本文中，我们提出了Sim-CLIP，这是一种无监督的对抗微调框架，可以增强CLIP视觉编码器的鲁棒性，同时保留整体语义表示。Sim-CLIP采用了一种双胞胎训练架构，使用余弦相似度目标和对称的停止梯度机制，以确保干净视图和对抗视图之间的语义对齐。该设计避免了大规模对比学习和额外的动量编码器，从而实现低计算开销的鲁棒训练。我们在多个视觉语言模型和任务上对Sim-CLIP进行了评估，包括有目标和无目标的对抗攻击。实验结果表明，Sim-CLIP在鲁棒性方面始终优于最先进的鲁棒CLIP变体，同时保持或提高了语义保真度。这些发现突显了现有对抗防御的局限性，并将Sim-CLIP确立为鲁棒视觉语言表示学习的有效且可扩展的解决方案。

Summary / 总结

The research aims to enhance the robustness of pretrained vision encoders used in Vision-Language Models (VLMs) while preserving semantic quality. Sim-CLIP employs an unsupervised adversarial fine-tuning framework with a Siamese architecture and cosine similarity objective, which improves robustness against adversarial attacks. Experiments show that Sim-CLIP outperforms existing robust CLIP variants in adversarial settings while maintaining semantic fidelity across various VLM tasks.

研究旨在通过增强预训练视觉编码器在视觉语言模型（VLM）中的鲁棒性，同时保留语义质量。Sim-CLIP 是一种无监督的对抗微调框架，采用双胞胎架构和余弦相似度目标，通过对称梯度停止机制对齐干净和对抗视图。实验表明，Sim-CLIP 在各种 VLM 任务中的对抗鲁棒性优于现有鲁棒 CLIP 变体，同时保持或提高了语义保真度。

Gym-Anything: Turn any Software into an Agent Environment

Authors: Pranjal Aggarwal, Graham Neubig, Sean Welleck

First: 2026-04-07T17:38:15+00:00 · Latest: 2026-04-07T17:38:15+00:00

Abs · PDF · Code1 · Code2

Abstract

Computer-use agents hold the promise of assisting in a wide range of digital economic activities. However, current research has largely focused on short-horizon tasks over a limited set of software with limited economic value, such as basic e-commerce and OS-configuration tasks. A key reason is that creating environments for complex software requires significant time and human effort, and therefore does not scale. To address this, we introduce Gym-Anything, a framework for converting any software into an interactive computer-use environment. We frame environment creation itself as a multi-agent task: a coding agent writes setup scripts, downloads real-world data, and configures the software, while producing evidence of correct setup. An independent audit agent then verifies evidence for the environment setup against a quality checklist. Using a taxonomy of economically valuable occupations grounded in U.S. GDP data, we apply this pipeline to 200 software applications with broad occupational coverage. The result is CUA-World, a collection of over 10K long-horizon tasks spanning domains from medical science and astronomy to engineering and enterprise systems, each configured with realistic data along with train and test splits. CUA-World also includes CUA-World-Long, a challenging long-horizon benchmark with tasks often requiring over 500 steps, far exceeding existing benchmarks. Distilling successful trajectories from the training split into a 2B vision-language model outperforms models 2$\times$ its size. We also apply the same auditing principle at test time: a separate VLM reviews completed trajectories and provides feedback on what remains, improving Gemini-3-Flash on CUA-World-Long from 11.5% to 14.0%. We release all code, infrastructure, and benchmark data to facilitate future research in realistic computer-use agents.

中文标题/摘要

标题：Gym-Anything: 将任何软件转换为代理环境

计算机使用代理具有在广泛数字经济活动中提供协助的潜力。然而，当前研究主要集中在有限的软件包上，这些软件包具有有限的经济价值，例如基本的电子商务和操作系统配置任务。主要原因在于，为复杂软件创建环境需要大量时间和人力，因此无法扩展。为解决这一问题，我们引入了Gym-Anything框架，用于将任何软件转换为交互式计算机使用环境。我们将环境创建本身作为多代理任务进行框架化：编码代理编写设置脚本，下载真实世界数据，并配置软件，同时产生正确的设置证据。独立的审计代理随后根据质量检查表验证环境设置的证据。基于美国GDP数据中的经济价值职业分类，我们应用此流水线对200个具有广泛职业覆盖的应用软件进行处理。结果是CUA-World，一个包含超过10000个长周期任务的集合，这些任务覆盖从医学科学和天文学到工程和企业系统等多个领域，每个任务都配置了现实数据，并且有训练和测试分割。CUA-World还包括CUA-World-Long，一个具有挑战性的长周期基准，其中的任务通常需要超过500步，远超现有基准。从训练分割中提炼成功的轨迹，用于训练一个20亿参数的视觉语言模型，其性能优于其两倍大小的模型。我们还在测试时应用相同的审计原则：一个独立的视觉语言模型审查已完成的轨迹，并提供反馈，以改进Gemini-3-Flash在CUA-World-Long上的表现，从11.5%提高到14.0%。我们发布了所有代码、基础设施和基准数据，以促进未来在现实计算机使用代理方面的研究。

Summary / 总结

The research aims to expand the scope of computer-use agents from short-horizon tasks to more complex and economically valuable tasks by converting any software into an interactive environment. The method involves a multi-agent framework where a coding agent sets up the software and an audit agent verifies the setup. The result is CUA-World, a collection of over 10,000 long-horizon tasks across various domains, with CUA-World-Long being a challenging benchmark with tasks often requiring over 500 steps. The study also demonstrates that distilling successful trajectories into a vision-language model outperforms larger models and improves performance on the benchmark tasks.

研究旨在通过将任何软件转换为交互式环境，将计算机使用代理的任务范围从短期任务扩展到更复杂的、具有更高经济价值的任务。方法是采用一个多代理框架，其中编码代理设置软件，审计代理验证设置。结果是CUA-World，包含超过10,000个跨多个领域的长期任务，其中CUA-World-Long是一个具有挑战性的基准，许多任务通常需要超过500步。研究还展示了将成功轨迹提炼成视觉语言模型优于更大模型，并在基准任务上提高了性能。

Lightweight Multimodal Adaptation of Vision Language Models for Species Recognition and Habitat Context Interpretation in Drone Thermal Imagery

Authors: Hao Chen, Fang Qiu, Fangchao Dong, Defei Yang, Eve Bohnett, Li An

First: 2026-04-07T17:36:01+00:00 · Latest: 2026-04-07T17:36:01+00:00

Abs · PDF · Code1 · Code2

Abstract

This study proposes a lightweight multimodal adaptation framework to bridge the representation gap between RGB-pretrained VLMs and thermal infrared imagery, and demonstrates its practical utility using a real drone-collected dataset. A thermal dataset was developed from drone-collected imagery and was used to fine-tune VLMs through multimodal projector alignment, enabling the transfer of information from RGB-based visual representations to thermal radiometric inputs. Three representative models, including InternVL3-8B-Instruct, Qwen2.5-VL-7B-Instruct, and Qwen3-VL-8B-Instruct, were benchmarked under both closed-set and open-set prompting conditions for species recognition and instance enumeration. Among the tested models, Qwen3-VL-8B-Instruct with open-set prompting achieved the best overall performance, with F1 scores of 0.935 for deer, 0.915 for rhino, and 0.968 for elephant, and within-1 enumeration accuracies of 0.779, 0.982, and 1.000, respectively. In addition, combining thermal imagery with simultaneously collected RGB imagery enabled the model to generate habitat-context information, including land-cover characteristics, key landscape features, and visible human disturbance. Overall, the findings demonstrate that lightweight projector-based adaptation provides an effective and practical route for transferring RGB-pretrained VLMs to thermal drone imagery, expanding their utility from object-level recognition to habitat-context interpretation in ecological monitoring.

中文标题/摘要

标题：轻量级多模态视觉语言模型在无人机热红外影像物种识别和栖息地环境解释中的适应

本研究提出了一种轻量级多模态适应框架，以弥合基于RGB预训练的VLMs与热红外影像之间的表示差距，并通过实际的无人机采集数据集展示了其实用价值。从无人机采集的影像中开发了一个热红外数据集，并通过多模态投影对齐对VLMs进行了微调，从而将基于RGB的视觉表示信息转移到热红外辐射输入中。三种代表性模型，包括InternVL3-8B-Instruct、Qwen2.5-VL-7B-Instruct和Qwen3-VL-8B-Instruct，分别在封闭集和开放集提示条件下进行了物种识别和实例枚举基准测试。在测试的模型中，使用开放集提示的Qwen3-VL-8B-Instruct在鹿、犀牛和大象上的F1分数分别为0.935、0.915和0.968，1以内枚举准确率分别为0.779、0.982和1.000。此外，结合热红外影像和同时采集的RGB影像使模型能够生成栖息地环境信息，包括土地覆盖特征、关键景观特征和可见的人类干扰。总体而言，研究结果表明，基于轻量级投影的适应为将基于RGB预训练的VLMs转移到无人机热红外影像中提供了有效且实用的途径，从而扩展了它们在生态监测中从对象级识别到栖息地环境解释的应用。

Summary / 总结

This study introduces a lightweight multimodal adaptation framework to bridge the representation gap between RGB-pretrained vision language models and thermal infrared imagery. It fine-tunes VLMs using a thermal dataset from drone-collected imagery and demonstrates the framework's effectiveness through species recognition and habitat context interpretation. The Qwen3-VL-8B-Instruct model with open-set prompting achieved the highest performance, with F1 scores of 0.935 for deer, 0.915 for rhino, and 0.968 for elephant, and within-1 enumeration accuracies of 0.779, 0.982, and 1.000, respectively. Combining thermal and RGB imagery also enabled habitat-context information generation, such as land-cover characteristics and visible human disturbance, enhancing the models' utility in ecological monitoring.

本研究提出了一种轻量级多模态适应框架，以弥合 RGB 预训练视觉语言模型与热红外图像之间的表示差距。该框架使用无人机采集的热图像数据集对 VLM 进行微调，使信息能够从基于 RGB 的视觉表示转移到热辐射输入。在闭集和开集条件下进行基准测试后，Qwen3-VL-8B-Instruct 表现出最佳性能，实现了物种识别的高 F1 分数和在内 1 识别准确性。此外，结合热图像和同时采集的 RGB 图像使模型能够解释栖息地背景信息，包括土地覆盖特征和人类干扰，展示了该框架在生态监测中的实用价值。

CoStream: Codec-Guided Resource-Efficient System for Video Streaming Analytics

Authors: Yulin Zou, Yan Chen, Wenyan Chen, JooYoung Park, Shivaraman Nitin, Luo Tao, Francisco Romero, Dmitrii Ustiugov

First: 2026-04-07T16:31:45+00:00 · Latest: 2026-04-07T16:31:45+00:00

Comments: 18 pages, 34 figures

Abs · PDF · Code1 · Code2

Abstract

Video streaming analytics is a crucial workload for vision-language model serving, but the high cost of multimodal inference limits scalability. Prior systems reduce inference cost by exploiting temporal and spatial redundancy in video streams, but they target either the vision transformer (ViT) or the LLM with a limited view, leaving end-to-end opportunities untapped. Moreover, existing methods incur significant overhead to identify redundancy, either through offline profiling and training or costly online computation, making them ill-suited for dynamic real-time streams. We present CoStream, a codec-guided streaming video analytics system built on a key observation that video codecs already extract the temporal and spatial structure of each stream as a byproduct of compression. CoStream treats this codec metadata as a low-cost runtime signal to unify optimization across video decoding, visual processing, and LLM prefilling, with transmission reduction as an inherent benefit of operating directly on compressed bitstreams. This drives codec-guided patch pruning before ViT encoding and selective key-value cache refresh during LLM prefilling, both of which are fully online and do not require offline training. Experiments show that CoStream achieves up to 3x throughput improvement and up to 87% GPU compute reduction over state-of-the-art baselines, while maintaining competitive accuracy with only 0-8% F1 drop.

中文标题/摘要

标题：CoStream：基于编解码器的资源高效视频流分析系统

视频流分析是视觉语言模型服务中的关键工作负载，但多模态推理的高成本限制了其可扩展性。先前的系统通过利用视频流中的时间和空间冗余来降低推理成本，但它们要么仅针对视觉变换器（ViT），要么仅针对有限视角的LLM，从而错过了端到端的机会。此外，现有方法在识别冗余方面产生了显著的开销，要么通过离线配置和训练，要么通过昂贵的在线计算，这使得它们不适合动态实时流。我们提出了CoStream，这是一种基于编解码器的视频流分析系统，基于一个关键观察，即视频编解码器在压缩过程中已经提取了每个流的时间和空间结构。CoStream 将这种编解码器元数据视为低成本的运行时信号，以统一视频解码、视觉处理和LLM预填充的优化，直接操作压缩位流作为固有的好处。这驱动了在ViT编码前基于编解码器的补丁修剪，并在LLM预填充期间选择性地刷新关键值缓存，两者都是完全在线的，不需要离线训练。实验表明，与最先进的基线相比，CoStream 可以实现高达3倍的吞吐量提升和高达87%的GPU计算量减少，同时保持与仅0-8% F1下降的竞争力。

Summary / 总结

CoStream is a codec-guided system for video streaming analytics that leverages existing video codec metadata to reduce inference costs and improve efficiency. By treating codec metadata as a low-cost runtime signal, CoStream optimizes video decoding, visual processing, and LLM prefilling, leading to up to 3x throughput improvement and 87% GPU compute reduction compared to state-of-the-art methods, with minimal accuracy loss.

CoStream 是一个利用视频编解码器提取的时间和空间结构来减少推理成本的视频流分析系统。通过将编解码器元数据用作运行时信号，CoStream 优化了视频解码、视觉处理和 LLM 填充，实现了最多 3 倍的吞吐量提升和 87% 的 GPU 计算量减少，同时保持了较低的准确率损失。

Online In-Context Distillation for Low-Resource Vision Language Models

Authors: Zhiqi Kang, Rahaf Aljundi, Vaggelis Dorovatas, Karteek Alahari

First: 2025-10-20T21:35:17+00:00 · Latest: 2026-04-07T16:14:59+00:00

Abs · PDF · Code1 · Code2

Abstract

As the field continues its push for ever more resources, this work turns the spotlight on a critical question: how can vision-language models (VLMs) be adapted to thrive in low-resource, budget-constrained settings? While large VLMs offer strong performance, they are impractical to deploy in such settings. Small VLMs, on the other hand, are efficient but typically require costly fine-tuning to close the performance gap with larger models in the deployment domain. Inspired by the in-context learning framework, we propose an online In-Context Distillation (ICD) method, in which a small VLM collaborates with a stronger teacher model at inference time, distilling its knowledge via sparse demonstrations to efficiently bridge the gap between them. Our method is built on an in-depth analysis that identifies the scale and the choice of models for which vision-language ICL is currently feasible, and demonstrates the advantage of ICL over fine-tuning under constrained compute budgets. We enhance our method with a novel cross-modal demonstration selection strategy, teacher test-time scaling to reduce noise, and student uncertainty conditioning to dynamically populate a demonstration pool and minimize teacher queries. Our ICD method significantly boosts the performance of small models (up to 33%) using scarce teacher annotations (as low as 4%), and competes with the teacher's zero-shot performance.

中文标题/摘要

标题：在线上下文蒸馏以适应低资源视觉语言模型

随着领域对资源的追求不断加强，这项工作将焦点转向了一个关键问题：视觉语言模型（VLM）如何在资源有限、预算受限的环境中茁壮成长？虽然大型VLM表现出色，但在这些环境中部署它们是不切实际的。相比之下，小型VLM虽然高效，但在部署领域与大型模型的性能差距通常需要昂贵的微调来弥补。受上下文学习框架的启发，我们提出了一种在线上下文蒸馏（ICD）方法，在此方法中，小型VLM在推理时与更强的教师模型合作，通过稀疏演示来传递知识，从而高效地弥合两者之间的差距。我们的方法基于对当前视觉语言上下文学习（ICL）可行性的深入分析，展示了在受限计算预算下ICL相较于微调的优势。我们通过一种新颖的跨模态演示选择策略、教师测试时缩放以减少噪声、以及学生不确定性条件来动态填充演示池并最小化教师查询，来增强我们的方法。我们的ICD方法使用稀缺的教师注释（低至4%）显著提升了小型模型的性能（最多提升33%），并能与教师的零样本性能相媲美。

Summary / 总结

This work addresses the challenge of deploying vision-language models (VLMs) in low-resource settings by proposing an online In-Context Distillation (ICD) method. The method involves a small VLM learning from a larger teacher model through sparse demonstrations at inference time, thereby bridging the performance gap. The ICD method, enhanced with a cross-modal demonstration selection strategy and teacher test-time scaling, significantly improves the performance of small models by up to 33% with minimal teacher annotations, outperforming fine-tuning under constrained compute budgets.

该研究针对在低资源环境下部署视觉语言模型（VLMs）的挑战，提出了一种在线In-Context Distillation (ICD) 方法。该方法在推理时让一个小模型通过稀疏演示从一个较大的教师模型学习，从而缩小性能差距。主要发现表明，ICD 可以显著提升小模型的性能，最多可提升 33%，仅使用教师模型注解的 4% 就能达到这一效果，并且与教师的零样本性能相当。

Vero: An Open RL Recipe for General Visual Reasoning

Authors: Gabriel Sarch, Linrong Cai, Qunzhong Wang, Haoyang Wu, Danqi Chen, Zhuang Liu

First: 2026-04-06T17:56:25+00:00 · Latest: 2026-04-07T15:20:05+00:00

Comments: Project page: https://vero-reasoning.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) show such broad visual reasoning is within reach, but the recipe behind them remains unclear, locked behind proprietary reinforcement learning (RL) pipelines with non-public data. We introduce Vero, a family of fully open VLMs that matches or exceeds existing open-weight models across diverse visual reasoning tasks. We scale RL data and rewards across six broad task categories, constructing Vero-600K, a 600K-sample dataset from 59 datasets, and designing task-routed rewards that handle heterogeneous answer formats. Vero achieves state-of-the-art performance, improving over four base models by 3.6-5.3 points on average across VeroEval, our suite of 30 challenging benchmarks. Starting from Qwen3-VL-8B-Instruct, Vero outperforms Qwen3-VL-8B-Thinking on 23 of 30 benchmarks without additional proprietary thinking data. When trained from the same base model, Vero-600K exceeds existing RL datasets across task categories. Systematic ablations reveal that different task categories elicit qualitatively distinct reasoning patterns that transfer poorly in isolation, suggesting that broad data coverage is the primary driver of strong RL scaling. All data, code, and models are released.

中文标题/摘要

标题：Vero：通用视觉推理的开放RL食谱

要构建一个能够在图表、科学、空间理解和开放任务中通用的视觉推理器，需要什么？最强的视觉-语言模型（VLMs）表明这种广泛的视觉推理是可行的，但它们背后的配方仍然不清楚，被私有的强化学习（RL）管道和非公开数据所封锁。我们介绍了Vero，一个完全开放的VLM家族，它在多种视觉推理任务中与现有开放权重模型相当或超越。我们跨六个广泛的任务类别扩展了RL数据和奖励，构建了Vero-600K，一个来自59个数据集的600K样本数据集，并设计了任务导向的奖励来处理异构答案格式。Vero在我们的30个具有挑战性的基准测试套件VeroEval中达到了最先进的性能，平均在四个基模型上提高了3.6-5.3分。从Qwen3-VL-8B-Instruct开始，Vero在30个基准测试中的23个上优于Qwen3-VL-8B-Thinking，而无需额外的私有思考数据。当从相同的基模型训练时，Vero-600K在任务类别中超过了现有的RL数据集。系统性的消融实验表明，不同的任务类别引发了不同的推理模式，这些模式在孤立时难以转移，这表明广泛的数据覆盖是强RL扩展的主要驱动力。所有数据、代码和模型均已发布。

Summary / 总结

Vero is an open reinforcement learning (RL) recipe for visual reasoning across various tasks such as charts, science, and spatial understanding. It scales RL data and rewards across six broad task categories, creating a 600K-sample dataset, Vero-600K, and uses task-routed rewards to handle different answer formats. Vero achieves state-of-the-art performance, improving over four base models by an average of 3.6-5.3 points on VeroEval, a suite of 30 challenging benchmarks. The systematic ablations show that broad data coverage is crucial for strong RL scaling across different task categories.

Vero 是一种开放的强化学习 (RL) 方法，构建了一系列视觉语言模型 (VLMs)，能够完成多种视觉推理任务。它在六个广泛的任务类别中扩展了 RL 数据和奖励，创建了一个包含 600K 个样本的数据集 Vero-600K，并使用任务导向的奖励来处理不同的答案格式。Vero 达到了最先进的性能，在 VeroEval 套件中的 30 个具有挑战性的基准测试中，平均提高了 3.6-5.3 个点。从 Qwen3-VL-8B-Instruct 开始，Vero 在 30 个基准测试中的 23 个上超过了 Qwen3-VL-8B-Thinking，而无需额外的专有思考数据，这表明广泛的覆盖数据是强 RL 扩展的主要驱动力。

Is CLIP Cross-Eyed? Revealing and Mitigating Center Bias in the CLIP Family

Authors: Oscar Chew, Hsiao-Ying Huang, Kunal Jain, Tai-I Chen, Khoa D Doan, Kuan-Hao Huang

First: 2026-04-07T15:04:33+00:00 · Latest: 2026-04-07T15:04:33+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent research has shown that contrastive vision-language models such as CLIP often lack fine-grained understanding of visual content. While a growing body of work has sought to address this limitation, we identify a distinct failure mode in the CLIP family, which we term center bias, that persists even in recent model variants. Specifically, CLIP tends to disproportionately focus on the central region of an image, overlooking important objects located near the boundaries. This limitation is fundamental as failure to recognize relevant objects makes it difficult to perform any sophisticated tasks that depend on those objects. To understand the underlying causes of the limitation, we conduct analyses from both representation and attention perspectives. Using interpretability methods, i.e., embedding decomposition and attention map analysis, we find that relevant concepts especially those associated with off-center objects vanish from the model's embedding in the final representation due to information loss during the aggregation of visual embeddings, particularly the reliance on pooling mechanisms. Finally, we show that this bias can be alleviated with training-free strategies such as visual prompting and attention redistribution by redirecting models' attention to off-center regions.

中文标题/摘要

标题：CLIP是斜视的吗？揭示和缓解CLIP家族中的中心偏见

近期研究表明，对比视觉-语言模型如CLIP往往缺乏对视觉内容的精细理解。尽管已有大量研究试图解决这一局限，我们仍发现CLIP家族中存在一种独特的失败模式，我们称之为中心偏见，即使在最近的模型变体中也依然存在。具体来说，CLIP倾向于过度关注图像的中心区域，而忽视了靠近边缘的重要物体。这一局限是根本性的，因为未能识别相关物体使得执行依赖这些物体的任何复杂任务变得困难。为了理解这一局限的根本原因，我们从表示和注意力两个角度进行了分析。通过可解释性方法，即嵌入分解和注意力图分析，我们发现由于视觉嵌入聚合过程中信息丢失，尤其是依赖于池化机制，相关概念尤其是与偏离中心物体相关的概念在最终表示中的模型嵌入中消失。最后，我们展示了可以通过训练无策略如视觉提示和注意力重定向等方法来缓解这种偏见。

"I See What You Did There": Can Large Vision-Language Models Understand Multimodal Puns?

Authors: Naen Xu, Jiayi Sheng, Changjiang Li, Chunyi Zhou, Yuyuan Li, Tianyu Du, Jun Wang, Zhihui Fu, Jinbao Li, Shouling Ji

Venue: ACL 2026

First: 2026-04-07T14:31:32+00:00 · Latest: 2026-04-07T14:31:32+00:00

Comments: ACL 2026 Main

Abs · PDF · Code1 · Code2

Abstract

Puns are a common form of rhetorical wordplay that exploits polysemy and phonetic similarity to create humor. In multimodal puns, visual and textual elements synergize to ground the literal sense and evoke the figurative meaning simultaneously. Although Vision-Language Models (VLMs) are widely used in multimodal understanding and generation, their ability to understand puns has not been systematically studied due to a scarcity of rigorous benchmarks. To address this, we first propose a multimodal pun generation pipeline. We then introduce MultiPun, a dataset comprising diverse types of puns alongside adversarial non-pun distractors. Our evaluation reveals that most models struggle to distinguish genuine puns from these distractors. Moreover, we propose both prompt-level and model-level strategies to enhance pun comprehension, with an average improvement of 16.5% in F1 scores. Our findings provide valuable insights for developing future VLMs that master the subtleties of human-like humor via cross-modal reasoning.

中文标题/摘要

标题："我看到了你的伎俩": 大型视觉-语言模型能否理解多模态双关语？

双关语是一种常见的修辞手法，利用多义性和音似性来制造幽默。在多模态双关语中，视觉和文本元素协同作用，同时接地面上的意义和比喻意义。尽管视觉-语言模型（VLMs）广泛应用于多模态理解和生成，但由于缺乏严格的基准测试，它们理解双关语的能力尚未系统研究。为了解决这一问题，我们首先提出了一种多模态双关语生成管道。然后，我们引入了MultiPun数据集，该数据集包含各种类型的双关语以及对抗性的非双关语干扰项。我们的评估表明，大多数模型难以区分真正的双关语和这些干扰项。此外，我们提出了从提示级和模型级两个层面的策略来增强双关语理解，平均提高了16.5%的F1分数。我们的研究结果为开发未来能够通过跨模态推理掌握人类幽默微妙之处的VLMs提供了宝贵的见解。

Summary / 总结

The paper investigates whether large vision-language models can understand multimodal puns, which exploit both visual and textual elements for humor. It introduces a multimodal pun generation pipeline and the MultiPun dataset, which includes various types of puns and non-pun distractors. The evaluation shows that most models have difficulty distinguishing puns from non-pun distractors. The authors propose strategies to improve pun comprehension, resulting in an average 16.5% increase in F1 scores. This work provides insights for developing VLMs that can better understand human-like humor through cross-modal reasoning.

论文探讨了大型视觉-语言模型是否能够理解结合了视觉和文本元素的双关语。它提出了一个跨模态双关语生成管道，并构建了包含各种类型双关语和非双关语干扰项的MultiPun数据集。评估结果显示，大多数模型难以区分真正的双关语和非双关语干扰项。作者提出了提高双关语理解的策略，平均提高了16.5%的F1分数。这项工作为开发能够通过跨模态推理更好地理解人类幽默的VLMs提供了有价值的见解。

AICA-Bench: Holistically Examining the Capabilities of VLMs in Affective Image Content Analysis

Authors: Dong She, Xianrong Yao, Liqun Chen, Jinghe Yu, Yang Gao, Zhanpeng Jin

Venue: ACL 2026

First: 2026-04-07T14:05:17+00:00 · Latest: 2026-04-07T14:05:17+00:00

Comments: Accepted by Findings of ACL 2026

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) have demonstrated strong capabilities in perception, yet holistic Affective Image Content Analysis (AICA), which integrates perception, reasoning, and generation into a unified framework, remains underexplored. To address this gap, we introduce AICA-Bench, a comprehensive benchmark with three core tasks: Emotion Understanding (EU), Emotion Reasoning (ER), and Emotion-Guided Content Generation (EGCG). We evaluate 23 VLMs and identify two major limitations: weak intensity calibration and shallow open-ended descriptions. To address these issues, we propose Grounded Affective Tree (GAT) Prompting, a training-free framework that combines visual scaffolding with hierarchical reasoning. Experiments show that GAT reduces intensity errors and improves descriptive depth, providing a strong baseline for future research on affective multimodal understanding and generation.

中文标题/摘要

标题：AICA-Bench：全面评估VLMs在情感图像内容分析中的能力

视觉-语言模型（VLMs）在感知方面表现出强大的能力，但在将感知、推理和生成整合到统一框架中的全面情感图像内容分析（AICA）方面仍处于探索阶段。为解决这一问题，我们引入了AICA-Bench，这是一个全面的基准测试，包含三个核心任务：情感理解（EU）、情感推理（ER）和情感引导的内容生成（EGCG）。我们评估了23个VLMs，并发现两个主要限制：强度校准不足和浅显的开放式描述。为解决这些问题，我们提出了基于视觉支撑的层次推理（GAT）提示框架，这是一种无需训练的框架。实验表明，GAT减少了强度误差并提高了描述的深度，为未来的情感多模态理解和生成研究提供了强有力的基线。

Summary / 总结

The research aims to explore the capabilities of Vision-Language Models (VLMs) in Affective Image Content Analysis (AICA), which involves integrating perception, reasoning, and generation. AICA-Bench, a comprehensive benchmark, evaluates 23 VLMs and identifies limitations in intensity calibration and open-ended description. To address these, the study proposes Grounded Affective Tree (GAT) Prompting, which combines visual scaffolding with hierarchical reasoning to improve intensity accuracy and descriptive depth, offering a strong baseline for future research.

研究旨在探索视觉语言模型（VLMs）在情感图像内容分析（AICA）中的能力，AICA涉及感知、推理和生成的整合。AICA-Bench 是一个新的基准，评估了23个VLMs在情感理解、情感推理和情感引导内容生成三个任务上的表现。研究发现VLMs在强度校准和开放性描述质量方面存在不足。为解决这些问题，作者提出了基于情感树（GAT）提示，这是一种无需训练的框架，能够提高强度准确性并增强描述深度，为未来的情感多模态理解和生成研究提供了一个强有力的基线。

EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration

Authors: Runze Li, Yuwen Zhai, Bo Xu, LiWu Xu, Nian Shi, Wei Zhang, Ran Lin, Liang Wang

Venue: CVPR 2026

First: 2025-12-22T13:42:18+00:00 · Latest: 2026-04-07T13:58:20+00:00

Comments: CVPR 2026 Findings

Abs · PDF · Code1 · Code2

Abstract

Contemporary GUI agents, while increasingly capable due to advances in Large Vision-Language Models (VLMs), often operate with a critical limitation: they treat each task in isolation, lacking a mechanism to systematically learn from past successes. This digital ''amnesia'' results in sub-optimal performance, repeated errors, and poor generalization to novel challenges. To bridge this gap, we introduce EchoTrail-GUI, a novel framework designed to mimic human-like experiential learning by equipping agents with a dynamic, accessible memory. Our framework operates in three distinct stages. First, during Experience Exploration, an agent autonomously interacts with GUI environments to build a curated database of successful task trajectories, validated by a reward model. Crucially, the entire knowledge base construction is thus fully automated, requiring no human supervision. Second, in the Memory Injection stage, upon receiving a new task, our system efficiently retrieves the most relevant past trajectories to serve as actionable ''memories''. Finally, during GUI Task Inference, these memories are injected as in-context guidance to inform the agent's reasoning and decision-making process. We demonstrate the efficacy of our approach on benchmarks including Android World and AndroidLab. The results show that EchoTrail-GUI significantly improves the task success rate and operational efficiency of baseline agents, validating the power of structured memory in creating more robust and intelligent GUI automation.

中文标题/摘要

标题：EchoTrail-GUI：通过评论者引导的自我探索构建GUI代理的操作记忆

当代GUI代理由于大型视觉-语言模型（VLMs）的进步而变得越来越强大，但它们通常以一个关键限制为代价：它们将每个任务视为独立的，缺乏系统学习过去成功经验的机制。这种“数字健忘症”导致了次优性能、重复错误和对新挑战的不良泛化。为了弥合这一差距，我们提出了EchoTrail-GUI，这是一种新型框架，旨在通过为代理提供动态且易于访问的记忆来模拟人类经验学习。我们的框架分为三个阶段。首先，在经验探索阶段，代理自主与GUI环境交互，构建由奖励模型验证的成功任务轨迹数据库。重要的是，整个知识库构建过程完全自动化，无需人类监督。其次，在记忆注入阶段，当收到新任务时，我们的系统高效地检索最相关的过去轨迹，作为可操作的“记忆”。最后，在GUI任务推理阶段，这些记忆作为上下文指导注入，以指导代理的推理和决策过程。我们在Android World和AndroidLab等基准测试上展示了我们方法的有效性。结果表明，EchoTrail-GUI 显著提高了基线代理的任务成功率和操作效率，验证了结构化记忆在创建更强大和智能的GUI自动化方面的强大功能。

Summary / 总结

EchoTrail-GUI is a framework that addresses the limitation of GUI agents by providing them with a dynamic memory system. The system autonomously collects successful task trajectories and uses a reward model to validate them. When a new task is given, it retrieves relevant past trajectories to guide the agent's actions. This approach significantly improves task success rates and operational efficiency on benchmarks like Android World and AndroidLab, demonstrating the effectiveness of structured memory in GUI automation.

EchoTrail-GUI 是一个框架，通过为 GUI 代理提供动态记忆系统来解决数字健忘的问题。该框架包括三个阶段：经验探索，代理自主学习过去的经验；记忆注入，为新任务检索相关经验；以及 GUI 任务推理，这些记忆指导代理的行为。该方法在 Android World 和 AndroidLab 等基准测试中显著提高了任务成功率和操作效率。

A State-Update Prompting Strategy for Efficient and Robust Multi-turn Dialogue

Authors: Ziyi Liu

First: 2025-09-22T13:26:24+00:00 · Latest: 2026-04-07T13:32:38+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) struggle with information forgetting and inefficiency in long-horizon, multi-turn dialogues. To address this, we propose a training-free prompt engineering method, the State-Update Multi-turn Dialogue Strategy. It utilizes "State Reconstruction" and "History Remind" mechanisms to effectively manage dialogue history. Our strategy shows strong performance across multiple multi-hop QA datasets. For instance, on the HotpotQA dataset, it improves the core information filtering score by 32.6%, leading to a 14.1% increase in the downstream QA score, while also reducing inference time by 73.1% and token consumption by 59.4%. Ablation studies confirm the pivotal roles of both components. Our work offers an effective solution for optimizing LLMs in long-range interactions, providing new insights for developing more robust Agents.

中文标题/摘要

标题：一种高效的稳健多轮对话状态更新提示策略

大型语言模型（LLMs）在长时域、多轮对话中面临信息遗忘和低效的问题。为解决这一问题，我们提出了一种无需训练的提示工程方法——状态更新多轮对话策略。该方法利用“状态重建”和“历史提醒”机制有效管理对话历史。我们的策略在多个多跳问答数据集上表现出色。例如，在HotpotQA数据集上，它将核心信息过滤得分提高了32.6%，导致下游问答得分提高了14.1%，同时将推理时间减少了73.1%，减少了59.4%的令牌消耗。消融研究证实了两个组件的关键作用。我们的工作为优化LLMs在长距离交互中的表现提供了有效解决方案，为开发更稳健的代理提供了新的见解。

Reading Between the Pixels: An Inscriptive Jailbreak Attack on Text-to-Image Models

Authors: Zonghao Ying, Haowen Dai, Lianyu Hu, Zonglei Jing, Quanchen Zou, Yaodong Yang, Aishan Liu, Xianglong Liu

First: 2026-04-07T13:16:07+00:00 · Latest: 2026-04-07T13:16:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Modern text-to-image (T2I) models can now render legible, paragraph-length text, enabling a fundamentally new class of misuse. We identify and formalize the inscriptive jailbreak, where an adversary coerces a T2I system into generating images containing harmful textual payloads (e.g., fraudulent documents) embedded within visually benign scenes. Unlike traditional depictive jailbreaks that elicit visually objectionable imagery, inscriptive attacks weaponize the text-rendering capability itself. Because existing jailbreak techniques are designed for coarse visual manipulation, they struggle to bypass multi-stage safety filters while maintaining character-level fidelity. To expose this vulnerability, we propose Etch, a black-box attack framework that decomposes the adversarial prompt into three functionally orthogonal layers: semantic camouflage, visual-spatial anchoring, and typographic encoding. This decomposition reduces joint optimization over the full prompt space to tractable sub-problems, which are iteratively refined through a zero-order loop. In this process, a vision-language model critiques each generated image, localizes failures to specific layers, and prescribes targeted revisions. Extensive evaluations across 7 models on the 2 benchmarks demonstrate that Etch achieves an average attack success rate of 65.57% (peaking at 91.00%), significantly outperforming existing baselines. Our results reveal a critical blind spot in current T2I safety alignments and underscore the urgent need for typography-aware defense multimodal mechanisms.

中文标题/摘要

标题：在像素之间阅读：对文本到图像模型的 inscriptional 脱狱攻击

现代文本到图像（T2I）模型现在可以渲染可读的、段落长度的文本，开启了全新的滥用类别。我们识别并形式化了 inscriptional 脱狱攻击，其中对手迫使 T2I 系统生成包含有害文本载荷（例如欺诈性文件）的图像，这些文本载荷嵌入在视觉上无害的场景中。与传统的描绘性脱狱攻击不同，inscriptional 攻击利用了文本渲染能力本身。由于现有的脱狱技术是为粗略的视觉操纵设计的，它们难以在保持字符级保真度的同时绕过多级安全过滤器。为了揭示这一漏洞，我们提出了 Etch，一种黑盒攻击框架，将敌对提示分解为三个功能上正交的层：语义伪装、视觉-空间锚定和排版编码。这种分解将联合优化整个提示空间的问题分解为可处理的子问题，并通过零阶循环迭代细化。在这个过程中，一个视觉-语言模型对生成的每幅图像进行批判，定位特定层的失败，并提出针对性的修订。在 7 模型上的 2 个基准测试中进行的广泛评估表明，Etch 达到了平均攻击成功率 65.57%（峰值为 91.00%），显著优于现有基线。我们的结果揭示了当前 T2I 安全对齐中的一个关键盲点，并强调了迫切需要排版意识的多模态防御机制。

Summary / 总结

The research aims to expose a new type of vulnerability in text-to-image models, known as the inscriptive jailbreak, where harmful text can be embedded in visually benign images. The study proposes Etch, a black-box attack framework that decomposes the adversarial prompt into semantic camouflage, visual-spatial anchoring, and typographic encoding to bypass multi-stage safety filters. Evaluations across seven models on two benchmarks show that Etch achieves an average attack success rate of 65.57%, significantly outperforming existing methods, highlighting the need for typography-aware defense mechanisms.

研究旨在揭示文本到图像模型中的一种新型漏洞——称为书写性越狱，攻击者可以在看似无害的图像中嵌入有害的文本内容。研究提出了一种名为Etch的黑盒攻击框架，将对抗性提示分解为语义伪装、视觉空间锚定和字体编码三个功能独立的层，以绕过多级安全过滤器。Etch的攻击成功率平均为65.57%，显著优于现有方法，突显了在文本到图像模型中需要具备字体意识的防御机制的紧迫性。

Vision-Guided Iterative Refinement for Frontend Code Generation

Authors: Hannah Sansford, Derek H. C. Law, Wei Liu, Abhishek Tripathi, Niresh Agarwal, Gerrit J. J. van den Burg

Venue: ICLR 2026

First: 2026-04-07T13:06:48+00:00 · Latest: 2026-04-07T13:06:48+00:00

Comments: Accepted at ICLR 2026 Workshop on AI with Recursive Self-Improvement

Abs · PDF · Code1 · Code2

Abstract

Code generation with large language models often relies on multi-stage human-in-the-loop refinement, which is effective but very costly - particularly in domains such as frontend web development where the solution quality depends on rendered visual output. We present a fully automated critic-in-the-loop framework in which a vision-language model serves as a visual critic that provides structured feedback on rendered webpages to guide iterative refinement of generated code. Across real-world user requests from the WebDev Arena dataset, this approach yields consistent improvements in solution quality, achieving up to 17.8% increase in performance over three refinement cycles. Next, we investigate parameter-efficient fine-tuning using LoRA to understand whether the improvements provided by the critic can be internalized by the code-generating LLM. Fine-tuning achieves 25% of the gains from the best critic-in-the-loop solution without a significant increase in token counts. Our findings indicate that automated, VLM-based critique of frontend code generation leads to significantly higher quality solutions than can be achieved through a single LLM inference pass, and highlight the importance of iterative refinement for the complex visual outputs associated with web development.

中文标题/摘要

标题：基于视觉引导的迭代优化在前端代码生成中的应用

使用大型语言模型进行代码生成通常依赖多阶段的人工介入修正，这虽然有效但成本极高，特别是在前端网页开发领域，解决方案的质量取决于渲染的视觉输出。我们提出了一种完全自动化的批评者在环框架，其中视觉语言模型作为视觉批评者，提供结构化的反馈以指导生成代码的迭代优化。在WebDev Arena数据集的真实世界用户请求中，这种方法在三个修正周期中持续提高了解决方案质量，性能提高了17.8%。接下来，我们研究了使用LoRA进行参数高效微调，以了解批评者提供的改进是否可以被代码生成的LLM内化。微调实现了最佳批评者在环解决方案25%的收益，而无需显著增加令牌数量。我们的研究结果表明，基于视觉语言模型的自动批评对于前端代码生成的解决方案质量有显著提升，远高于单一LLM推理过程所能达到的水平，并突显了对于与网页开发相关的复杂视觉输出进行迭代优化的重要性。

Summary / 总结

The research aims to improve the quality of frontend code generation by using a vision-language model to provide structured feedback on rendered webpages, guiding iterative refinement. This approach consistently enhances solution quality, achieving up to a 17.8% increase in performance over three cycles. Parameter-efficient fine-tuning using LoRA further internalizes these improvements, achieving 25% of the gains from the best critic-in-the-loop solution without a significant increase in token counts.

该研究针对前端 web 开发中代码生成的人工迭代改进成本高问题，提出了一种基于视觉的迭代改进框架。视觉语言模型提供结构化的反馈，指导生成代码的逐步优化。在实际用户请求中，这种方法在三个迭代周期后将解决方案质量提高了最多 17.8%。使用 LoRA 进行参数高效的微调实现了这些改进的 25% 而无需显著增加 token 数量，表明自动化视觉批评对于复杂视觉输出的代码质量提升具有显著效果。

ReMemNav: A Rethinking and Memory-Augmented Framework for Zero-Shot Object Navigation

Authors: Feng Wu, Wei Zuo, Wenliang Yang, Jun Xiao, Yang Liu, Xinhua Zeng

First: 2026-03-25T09:07:32+00:00 · Latest: 2026-04-07T12:54:16+00:00

Comments: 8 pages, 5 figures

Abs · PDF · Code1 · Code2

Abstract

Zero-shot object navigation requires agents to locate unseen target objects in unfamiliar environments without prior maps or task-specific training which remains a significant challenge. Although recent advancements in vision-language models(VLMs) provide promising commonsense reasoning capabilities for this task, these models still suffer from spatial hallucinations, local exploration deadlocks, and a disconnect between high-level semantic intent and low-level control. In this regard, we propose a novel hierarchical navigation framework named ReMemNav, which seamlessly integrates panoramic semantic priors and episodic memory with VLMs. We introduce the Recognize Anything Model to anchor the spatial reasoning process of the VLM. We also design an adaptive dual-modal rethinking mechanism based on an episodic semantic buffer queue. The proposed mechanism actively verifies target visibility and corrects decisions using historical memory to prevent deadlocks. For low-level action execution, ReMemNav extracts a sequence of feasible actions using depth masks, allowing the VLM to select the optimal action for mapping into actual spatial movement. Extensive evaluations on HM3D and MP3D demonstrate that ReMemNav outperforms existing training-free zero-shot baselines in both success rate and exploration efficiency. Specifically, we achieve significant absolute performance improvements, with SR and SPL increasing by 1.7% and 7.0% on HM3D v0.1, 18.2% and 11.1% on HM3D v0.2, and 8.7% and 7.9% on MP3D.

中文标题/摘要

标题：ReMemNav：一种重新思考和记忆增强的零样本对象导航框架

零样本对象导航要求代理在没有先验地图或任务特定训练的情况下，在不熟悉的环境中定位未见过的目标对象，这仍然是一个重大挑战。尽管最近在视觉-语言模型(VLMs)方面的进展为这项任务提供了有希望的常识推理能力，但这些模型仍然遭受空间幻觉、局部探索死锁以及高层语义意图与低层控制之间的脱节。为此，我们提出了一种新的分层导航框架ReMemNav，该框架无缝地将全景语义先验和情景记忆与VLMs结合在一起。我们引入了“任何事物识别模型”来锚定VLM的空间推理过程。我们还设计了一种基于情景语义缓冲队列的自适应双模态重新思考机制。所提出机制通过使用历史记忆主动验证目标可见性并纠正决策，以防止死锁。在低级动作执行方面，ReMemNav使用深度掩码提取一系列可行动作，使VLM能够选择将映射到实际空间运动的最佳动作。在HM3D和MP3D上的广泛评估表明，ReMemNav在成功率和探索效率方面均优于现有的无训练零样本基线。具体而言，我们在HM3D v0.1中实现了显著的绝对性能改进，SR和SPL分别提高了1.7%和7.0%，在HM3D v0.2中分别提高了18.2%和11.1%，在MP3D中分别提高了8.7%和7.9%。

Summary / 总结

ReMemNav is a hierarchical navigation framework that integrates panoramic semantic priors and episodic memory with vision-language models to address the challenge of zero-shot object navigation. It introduces the Recognize Anything Model to anchor spatial reasoning and an adaptive dual-modal rethinking mechanism to prevent spatial hallucinations and deadlocks. Experimental results show that ReMemNav outperforms existing zero-shot baselines on HM3D and MP3D, with significant improvements in success rate and exploration efficiency.

ReMemNav 是一个将全景语义先验和情景记忆与视觉语言模型相结合的层级导航框架，旨在解决零样本物体导航的挑战。它引入了 Recognize Anything 模型来锚定空间推理，并设计了基于情景语义缓冲队列的自适应双模态重思考机制以防止死锁。实验结果表明，ReMemNav 在 HM3D 和 MP3D 数据集上的成功率和探索效率均优于现有基线，特别是在不同版本的 HM3D 和 MP3D 上，成功率和 SPL 均有显著提升。

WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering

Authors: Yingjian Zhu, Xinming Wang, Kun Ding, Ying Wang, Bin Fan, Shiming Xiang

Venue: ACL 2026

First: 2026-04-07T12:52:38+00:00 · Latest: 2026-04-07T12:52:38+00:00

Comments: Accepted by ACL 2026 Findings

Abs · PDF · Code1 · Code2 · Code3

Abstract

Multi-modal Retrieval-Augmented Generation (RAG) has emerged as a highly effective paradigm for Knowledge-Based Visual Question Answering (KB-VQA). Despite recent advancements, prevailing methods still primarily depend on images as the retrieval key, and often overlook or misplace the role of Vision-Language Models (VLMs), thereby failing to leverage their potential fully. In this paper, we introduce WikiSeeker, a novel multi-modal RAG framework that bridges these gaps by proposing a multi-modal retriever and redefining the role of VLMs. Rather than serving merely as answer generators, we assign VLMs two specialized agents: a Refiner and an Inspector. The Refiner utilizes the capability of VLMs to rewrite the textual query according to the input image, significantly improving the performance of the multimodal retriever. The Inspector facilitates a decoupled generation strategy by selectively routing reliable retrieved context to another LLM for answer generation, while relying on the VLM's internal knowledge when retrieval is unreliable. Extensive experiments on EVQA, InfoSeek, and M2KR demonstrate that WikiSeeker achieves state-of-the-art performance, with substantial improvements in both retrieval accuracy and answer quality. Our code will be released on https://github.com/zhuyjan/WikiSeeker.

中文标题/摘要

标题：WikiSeeker：重新思考知识基础视觉问答中多模态视觉语言模型的作用

多模态检索增强生成（RAG）已成为知识基础视觉问答（KB-VQA）中非常有效的范式。尽管最近取得了进展，但现有方法仍然主要依赖图像作为检索键，并且往往忽视或错误地定位视觉语言模型（VLMs）的作用，从而未能充分利用其潜力。在本文中，我们介绍了WikiSeeker，这是一种新颖的多模态RAG框架，通过提出多模态检索器并重新定义VLMs的作用来弥补这些差距。我们赋予VLMs两个专门的代理：修整器和检查员。修整器利用VLMs根据输入图像重写文本查询的能力，显著提高了多模态检索器的性能。检查员通过选择性地将可靠的检索上下文路由到另一个LLM进行答案生成，促进了一种解耦生成策略，而在检索不可靠时依赖VLMs的内部知识。在EVQA、InfoSeek和M2KR上的广泛实验表明，WikiSeeker达到了最先进的性能，检索准确性和答案质量都有显著提高。我们的代码将在https://github.com/zhuyjan/WikiSeeker上发布。

Summary / 总结

WikiSeeker is a novel multi-modal RAG framework that enhances Knowledge-Based Visual Question Answering by leveraging Vision-Language Models (VLMs) more effectively. It introduces a Refiner and an Inspector to improve retrieval accuracy and answer quality. The Refiner rewrites textual queries based on input images, while the Inspector selectively routes retrieved context to another LLM for answer generation. Experiments on EVQA, InfoSeek, and M2KR show that WikiSeeker outperforms existing methods with significant improvements in both retrieval accuracy and answer quality.

WikiSeeker 是一种新型的多模态 RAG 框架，通过更有效地利用视觉语言模型（VLMs）来提升基于知识的视觉问答能力。它引入了 Refiner 和 Inspector，以改进检索和答案生成。Refiner 根据输入图像重写文本查询，而 Inspector 选择性地将检索到的上下文路由给另一个 LLM 进行答案生成。实验表明，WikiSeeker 在 EVQA、InfoSeek 和 M2KR 数据集上的检索准确性和答案质量都优于现有方法。

Image Diffusion Preview with Consistency Solver

Authors: Fu-Yun Wang, Hao Zhou, Liangzhe Yuan, Sanghyun Woo, Boqing Gong, Bohyung Han, Ming-Hsuan Yang, Han Zhang, Yukun Zhu, Ting Liu, Long Zhao

Venue: CVPR 2026

First: 2025-12-15T17:47:49+00:00 · Latest: 2026-04-07T12:50:18+00:00

Comments: Accepted by CVPR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

The slow inference process of image diffusion models significantly degrades interactive user experiences. To address this, we introduce Diffusion Preview, a novel paradigm employing rapid, low-step sampling to generate preliminary outputs for user evaluation, deferring full-step refinement until the preview is deemed satisfactory. Existing acceleration methods, including training-free solvers and post-training distillation, struggle to deliver high-quality previews or ensure consistency between previews and final outputs. We propose ConsistencySolver derived from general linear multistep methods, a lightweight, trainable high-order solver optimized via Reinforcement Learning, that enhances preview quality and consistency. Experimental results demonstrate that ConsistencySolver significantly improves generation quality and consistency in low-step scenarios, making it ideal for efficient preview-and-refine workflows. Notably, it achieves FID scores on-par with Multistep DPM-Solver using 47% fewer steps, while outperforming distillation baselines. Furthermore, user studies indicate our approach reduces overall user interaction time by nearly 50% while maintaining generation quality. Code is available at https://github.com/G-U-N/consolver.

中文标题/摘要

标题：图像扩散预览与一致性求解器

图像扩散模型的缓慢推理过程显著降低了交互式用户体验。为解决这一问题，我们引入了扩散预览这一新颖范式，通过快速、低步数采样生成初步输出供用户评估，直到预览被判定满意后再进行全步数细化。现有的加速方法，包括无训练求解器和后训练蒸馏，难以提供高质量的预览或确保预览与最终输出之间的一致性。我们提出了一致性求解器ConsistencySolver，这是一种源自通用线性多步法的轻量级、可训练的高阶求解器，通过强化学习优化，能够提升预览质量和一致性。实验结果表明，ConsistencySolver在低步数场景中显著提高了生成质量和一致性，使其成为高效的预览和细化工作流的理想选择。值得注意的是，它在使用47%更少的步骤的情况下，FID得分与Multistep DPM-Solver相当，同时优于蒸馏基线。此外，用户研究显示，我们的方法将总体用户交互时间减少了近50%，同时保持了生成质量。代码可在https://github.com/G-U-N/consolver/获取。

Summary / 总结

The research aims to improve the interactive user experience of image diffusion models by introducing Diffusion Preview, which uses rapid, low-step sampling to generate preliminary outputs for user evaluation. The ConsistencySolver, a lightweight, trainable solver derived from general linear multistep methods and optimized via Reinforcement Learning, is proposed to enhance preview quality and consistency. Experiments show that ConsistencySolver significantly improves generation quality and consistency, achieving FID scores comparable to Multistep DPM-Solver with fewer steps and outperforming distillation baselines. User studies indicate a nearly 50% reduction in user interaction time while maintaining generation quality.

研究旨在通过引入Diffusion Preview，利用快速、低步数采样进行初始输出生成，并在预览满意后再进行全步数细化，以改善图像扩散模型的交互用户体验。研究提出了一种轻量级、可训练的ConsistencySolver，该方法基于通用线性多步法并使用强化学习进行优化，以增强预览质量和一致性。实验结果显示，ConsistencySolver在较少步骤的情况下显著提高了生成质量和一致性，优于现有方法，并且在保持质量的同时将用户交互时间减少了近50%。

Perturb and Recover: Fine-tuning for Effective Backdoor Removal from CLIP

Authors: Naman Deep Singh, Francesco Croce, Matthias Hein

Venue: CVPR 2026

First: 2024-12-01T08:39:12+00:00 · Latest: 2026-04-07T12:10:40+00:00

Comments: CVPR 2026 Findings

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-Language models like CLIP have been shown to be highly effective at linking visual perception and natural language understanding, enabling sophisticated image-text capabilities, including strong retrieval and zero-shot classification performance. Their widespread use, as well as the fact that CLIP models are trained on image-text pairs from the web, make them both a worthwhile and relatively easy target for backdoor attacks. As training foundational models, such as CLIP, from scratch is very expensive, this paper focuses on cleaning potentially poisoned models via fine-tuning. We first show that existing cleaning techniques are not effective against simple structured triggers used in Blended or BadNet backdoor attacks, exposing a critical vulnerability for potential real-world deployment of these models. Then, we introduce PAR, Perturb and Recover, a surprisingly simple yet effective mechanism to remove backdoors from CLIP models. Through extensive experiments across different encoders and types of backdoor attacks, we show that PAR achieves high backdoor removal rate while preserving good standard performance. Finally, we illustrate that our approach is effective even only with synthetic text-image pairs, i.e. without access to real training data. The code and models are available on \href{https://github.com/nmndeep/PerturbAndRecover}{GitHub}.

中文标题/摘要

标题：扰动与恢复：针对CLIP的有效后门移除微调

视觉-语言模型如CLIP已被证明在将视觉感知与自然语言理解联系起来方面非常有效，使图像-文本能力变得复杂，包括强大的检索和零样本分类性能。由于CLIP模型在互联网上的图像-文本对上进行训练，且其广泛应用，使得它们成为后门攻击的一个值得追求且相对容易的目标。由于从头开始训练基础模型如CLIP非常昂贵，本文专注于通过微调来清洁可能被污染的模型。我们首先表明，现有的清洁技术对Blended或BadNet后门攻击中使用的简单结构化触发器无效，暴露了这些模型在实际部署中可能存在的关键漏洞。然后，我们引入了PAR（扰动与恢复），这是一种简单而有效的机制，用于从CLIP模型中移除后门。通过在不同编码器和类型的后门攻击下进行广泛的实验，我们展示了PAR在移除后门的同时保持良好的标准性能。最后，我们说明即使仅使用合成的文本-图像对，我们的方法也有效，即无需访问真实训练数据。代码和模型可在GitHub上获得。

Summary / 总结

This paper addresses the vulnerability of CLIP models to backdoor attacks by introducing a new fine-tuning method called PAR (Perturb and Recover). It demonstrates that existing cleaning techniques are ineffective against simple structured triggers used in backdoor attacks. The PAR method is shown to effectively remove backdoors while maintaining good standard performance across different encoders and types of attacks. Additionally, the approach works even with synthetic data, indicating its robustness without needing real training data.

该论文通过引入新的细调方法PAR（Perturb and Recover）来应对CLIP模型对后门攻击的脆弱性。它表明现有的清洁技术对使用简单结构触发器的后门攻击无效。通过大量实验，PAR被证明能够有效地移除后门并保持良好的标准性能，即使仅使用合成数据而无需访问真实训练数据。

Beyond the Beep: Scalable Collision Anticipation and Real-Time Explainability with BADAS-2.0

Authors: Roni Goldshmidt, Hamish Scott, Lorenzo Niccolini, Hernan Matzner

First: 2026-04-07T12:10:21+00:00 · Latest: 2026-04-07T12:10:21+00:00

Abs · PDF · Code1 · Code2

Abstract

We present BADAS-2.0, the second generation of our collision anticipation system, building on BADAS-1.0 [7], which showed that fine-tuning V-JEPA2 [1] on large-scale ego-centric dashcam data outperforms both academic baselines and production ADAS systems. BADAS-2.0 advances the state of the art along three axes. (i) Long-tail benchmark and accuracy: We introduce a 10-group long-tail benchmark targeting rare and safety-critical scenarios. To construct it, BADAS-1.0 is used as an active oracle to score millions of unlabeled drives and surface high-risk candidates for annotation. Combined with Nexar's Atlas platform [13] for targeted data collection, this expands the dataset from 40k to 178,500 labeled videos (~2M clips), yielding consistent gains across all subgroups, with the largest improvements on the hardest long-tail cases. (ii) Knowledge distillation to edge: Domain-specific self-supervised pre-training on 2.25M unlabeled driving videos enables distillation into compact models, BADAS-2.0-Flash (86M) and BADAS-2.0-Flash-Lite (22M), achieving 7-12x speedup with near-parity accuracy, enabling real-time edge deployment. (iii) Explainability: BADAS-2.0 produces real-time object-centric attention heatmaps that localize the evidence behind predictions. BADAS-Reason [17] extends this with a vision-language model that consumes the last frame and heatmap to generate driver actions and structured textual reasoning. Inference code and evaluation benchmarks are publicly available.

中文标题/摘要

标题：超越蜂鸣声：基于BADAS-2.0的大规模碰撞预见与实时可解释性

我们提出了BADAS-2.0，这是我们的碰撞预见系统的第二代产品，基于BADAS-1.0 [7]，后者表明在大规模第一人称后视镜摄像头数据上微调V-JEPA2 [1]优于学术基准和生产ADAS系统。 BADAS-2.0在三个维度上推进了技术前沿。 (i) 长尾基准和准确性：我们引入了一个10组长尾基准，针对罕见和安全关键场景。通过使用BADAS-1.0作为主动先验来评分数百万未标记的驾驶视频并筛选出高风险候选进行标注。结合Nexar的Atlas平台 [13] 进行有针对性的数据收集，这将数据集从4万扩展到178,500个标注视频（约200万片段），在所有子组中均取得一致的改进，最大的改进出现在最难的长尾案例上。 (ii) 知识蒸馏到边缘：在225万未标记驾驶视频上进行领域特定的自监督预训练，使知识能够被压缩模型BADAS-2.0-Flash（86M）和BADAS-2.0-Flash-Lite（22M）所吸收，实现7-12倍的速度提升，同时保持接近同等的准确性，从而实现实时边缘部署。 (iii) 可解释性：BADAS-2.0生成实时对象中心的注意力热图，定位预测背后的证据。BADAS-Reason [17] 进一步使用一个视觉语言模型，该模型消耗最后一帧和热图来生成驾驶员行为和结构化文本推理。推理代码和评估基准已公开。

Summary / 总结

BADAS-2.0 is an advanced collision anticipation system that builds on BADAS-1.0 by introducing a 10-group long-tail benchmark for rare and critical scenarios, expanding the dataset to 178,500 labeled videos. It also includes compact models BADAS-2.0-Flash and BADAS-2.0-Flash-Lite, which achieve real-time edge deployment with near-parity accuracy and up to 12x speedup. Additionally, BADAS-2.0 provides real-time object-centric attention heatmaps for explainability, with BADAS-Reason generating structured textual reasoning based on these heatmaps and the last frame.

BADAS-2.0 是一种先进的碰撞预测系统，基于 BADAS-1.0 引入了一个针对罕见和关键场景的10组长尾基准，扩展了数据集至178,500个标注视频。它还包括紧凑型模型 BADAS-2.0-Flash 和 BADAS-2.0-Flash-Lite，实现了接近实时的边缘部署，速度提升最高可达12倍。此外，BADAS-2.0 提供了实时的对象中心注意力热图，以实现可解释性，BADAS-Reason 基于这些热图和最后一帧生成结构化的文本推理。

One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control

Authors: Haoxiang Rao, Zhao Wang, Chenyang Si, Yan Lyu, Yuanyi Duan, Fang Zhao, Caifeng Shan

First: 2026-03-18T09:32:41+00:00 · Latest: 2026-04-07T11:27:04+00:00

Comments: Accepted by CVPR2026. Code: https://github.com/echrao/O2MAG

Abs · PDF · Code1 · Code2 · Code3

Abstract

Industrial anomaly detection (AD) is characterized by an abundance of normal images but a scarcity of anomalous ones. Although numerous few-shot anomaly synthesis methods have been proposed to augment anomalous data for downstream AD tasks, most existing approaches require time-consuming training and struggle to learn distributions that are faithful to real anomalies, thereby restricting the efficacy of AD models trained on such data. To address these limitations, we propose a training-free few-shot anomaly generation method, namely O2MAG, which leverages the self-attention in One reference anomalous image to synthesize More realistic anomalies, supporting effective downstream anomaly detection. Specifically, O2MAG manipulates three parallel diffusion processes via self-attention grafting and incorporates the anomaly mask to mitigate foreground-background query confusion, synthesizing text-guided anomalies that closely adhere to real anomalous distributions. To bridge the semantic gap between the encoded anomaly text prompts and the true anomaly semantics, Anomaly-Guided Optimization is further introduced to align the synthesis process with the target anomalous distribution, steering the generation toward realistic and text-consistent anomalies. Moreover, to mitigate faint anomaly synthesis inside anomaly masks, Dual-Attention Enhancement is adopted during generation to reinforce both self- and cross-attention on masked regions. Extensive experiments validate the effectiveness of O2MAG, demonstrating its superior performance over prior state-of-the-art methods on downstream AD tasks.

中文标题/摘要

标题：一到多：基于注意力控制的无训练高保真异常生成

工业异常检测（AD）的特点是正常图像丰富而异常图像稀缺。尽管已经提出了许多少样本异常合成方法来增加异常数据以供下游AD任务使用，但大多数现有方法需要耗时的训练，并且难以学习与真实异常一致的分布，从而限制了基于此类数据训练的AD模型的效果。为了解决这些限制，我们提出了一种无训练的少样本异常生成方法，即O2MAG，该方法利用一个参考异常图像中的自我注意力来合成更多真实的异常，支持有效的下游异常检测。具体而言，O2MAG通过自我注意力嫁接操纵三个并行的扩散过程，并结合异常掩码以减轻前景背景查询混淆，生成与真实异常分布紧密一致的文本引导异常。为了弥合编码异常文本提示与真实异常语义之间的语义差距，引入了异常引导优化以使合成过程与目标异常分布对齐，引导生成真实且文本一致的异常。此外，为了减轻异常掩码内的微弱异常合成，生成过程中采用了双重注意力增强来强化掩码区域的自我和交叉注意力。广泛的实验验证了O2MAG的有效性，证明了其在下游AD任务中优于先前的最先进方法。

Summary / 总结

The paper addresses the challenge of generating realistic anomalies for industrial anomaly detection tasks, where normal data is abundant but anomalous data is scarce. It introduces O2MAG, a training-free method that uses self-attention to synthesize more realistic anomalies from a single reference image. Key findings show that O2MAG outperforms existing methods in generating text-guided anomalies that closely match real anomalous distributions, thereby improving the effectiveness of downstream anomaly detection models.

论文针对工业异常检测任务中正常数据丰富而异常数据稀缺的问题，提出了一种无需训练的O2MAG方法，利用自注意力机制从单个参考异常图像中合成更真实的异常样本。实验结果表明，O2MAG在生成与真实异常分布高度一致的文本引导异常方面优于现有方法，从而提高了下游异常检测模型的效果。

MPM: Mutual Pair Merging for Efficient Vision Transformers

Authors: Simon Ravé, Pejman Rasti, David Rousseau

Venue: CVPR 2026

First: 2026-04-07T11:16:18+00:00 · Latest: 2026-04-07T11:16:18+00:00

Comments: Accepted to CVPR 2026 (Findings)

Abs · PDF · Code1 · Code2

Abstract

Decreasing sequence length is a common way to accelerate transformers, but prior token reduction work often targets classification and reports proxy metrics rather than end-to-end latency. For semantic segmentation, token reduction is further constrained by the need to reconstruct dense, pixel-aligned features, and on modern accelerators the overhead of computing merge maps can erase expected gains. We propose Mutual Pair Merging (MPM), a training-free token aggregation module that forms mutual nearest-neighbor pairs in cosine space, averages each pair, and records a merge map enabling a gather-based reconstruction before the decoder so that existing segmentation heads can be used unchanged. MPM introduces no learned parameters and no continuous compression knob (no keep-rate or threshold). The speed-accuracy trade-off is set by a discrete insertion schedule. We benchmark end-to-end latency on an NVIDIA H100 GPU (with and without FlashAttention-2) and a Raspberry Pi 5 across standard segmentation datasets. On ADE20K, MPM reduces per-image latency by up to 60% for ViT-Tiny on Raspberry Pi 5, and increases throughput by up to 20% on H100 with FlashAttention-2 while keeping the mIoU drop below 3%. These results suggest that simple, reconstruction-aware, training-free token merging can translate into practical wall-clock gains for segmentation when overhead is explicitly accounted for.

中文标题/摘要

标题：MPM：互惠配对合并以提高视觉变换器的效率

减少序列长度是加速变换器的常见方法，但之前的标记减少工作通常针对分类并报告代理指标，而不是端到端延迟。对于语义分割，标记减少进一步受到需要重建密集的、像素对齐的特征的限制，而在现代加速器上，计算合并图的开销可能会抵消预期的收益。我们提出了互惠配对合并（MPM），这是一种无需训练的标记聚合模块，它在余弦空间中形成互近邻配对，对每个配对求平均，并记录一个合并图，以便在解码器之前进行基于收集的重建，从而可以使用现有的分割头不变。MPM 没有引入任何学习参数，也没有连续压缩旋钮（没有保留率或阈值）。速度-准确性的权衡由离散插入计划设置。我们在 NVIDIA H100 GPU（带和不带 FlashAttention-2）以及 Raspberry Pi 5 上对标准分割数据集进行了端到端延迟基准测试。在 ADE20K 上，MPM 将 ViT-Tiny 在 Raspberry Pi 5 上的每张图像延迟最多减少了 60%，同时在带有 FlashAttention-2 的 H100 上，吞吐量最多提高了 20%，且 mIoU 下降不到 3%。这些结果表明，当开销被明确考虑时，简单的、重建意识的、无需训练的标记合并可以转化为实际的墙钟收益。

Summary / 总结

The motivation for MPM is to accelerate vision transformers for semantic segmentation by reducing sequence length without compromising accuracy. MPM forms mutual nearest-neighbor pairs in cosine space, averages them, and uses a gather-based reconstruction before the decoder. This method introduces no learned parameters or continuous compression knobs and sets the speed-accuracy trade-off through a discrete insertion schedule. Experiments show that MPM reduces per-image latency by up to 60% on ViT-Tiny for ADE20K on Raspberry Pi 5 and increases throughput by up to 20% on H100 with FlashAttention-2 while maintaining mIoU within 3%.

论文提出了一种名为Mutual Pair Merging (MPM)的无训练token聚合方法，通过在余弦空间中形成互近邻对并平均它们来减少序列长度。该方法避免了学习参数或连续压缩旋钮的需求，并通过离散插入计划来设置速度-准确度权衡。标准分割数据集上的实验表明，MPM可以在Raspberry Pi 5上将每张图像的延迟最多减少60%，并在带有FlashAttention-2的NVIDIA H100 GPU上最多增加20%的吞吐量，同时保持mIoU低于3%的下降。

TRANSPORTER: Transferring Visual Semantics from VLM Manifolds

Authors: Alexandros Stergiou

Venue: CVPR

First: 2025-11-23T09:12:48+00:00 · Latest: 2026-04-07T10:36:41+00:00

Comments: Accepted at IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026, Project page: https://alexandrosstergiou.github.io/TRANSPORTER

Abs · PDF · Code1 · Code2 · Project1

Abstract

How do video understanding models acquire their answers? Although current Vision Language Models (VLMs) reason over complex scenes with diverse objects, action performances, and scene dynamics, understanding and controlling their internal processes remains an open challenge. Motivated by recent advancements in text-to-video (T2V) generative models, this paper introduces a logits-to-video (L2V) task alongside a model-independent approach, TRANSPORTER, to generate videos that capture the underlying rules behind VLMs' predictions. Given the high-visual-fidelity produced by T2V models, TRANSPORTER learns an optimal transport coupling to VLM's high-semantic embedding spaces. In turn, logit scores define embedding directions for conditional video generation. TRANSPORTER generates videos that reflect caption changes over diverse object attributes, action adverbs, and scene context. Quantitative and qualitative evaluations across VLMs demonstrate that L2V can provide a fidelity-rich, novel direction for model interpretability that has not been previously explored.

中文标题/摘要

标题：TRANSPORTER: 从VLM流形转移视觉语义

视频理解模型是如何获得它们的答案的？尽管当前的视觉语言模型（VLMs）能够处理包含多种物体、动作表演和场景动态的复杂场景，但理解并控制其内部过程仍然是一个开放的挑战。受文本到视频（T2V）生成模型最新进展的启发，本文引入了一个logits到视频（L2V）任务以及一个模型独立的方法TRANSPORTER，以生成能够捕捉VLM预测背后规则的视频。由于T2V模型生成的高视觉保真度，TRANSPORTER学习到VLM的高语义嵌入空间的最佳传输耦合。反过来，logit分数定义了条件视频生成的嵌入方向。TRANSPORTER生成的视频反映了不同物体属性、动作副词和场景上下文的字幕变化。跨VLM的定量和定性评估表明，L2V可以提供一种丰富保真度、新颖的模型可解释性方向，这是之前未被探索过的。

Summary / 总结

This paper aims to understand how video understanding models generate their predictions by introducing a logits-to-video (L2V) task and a model-independent approach called TRANSPORTER. TRANSPORTER learns an optimal transport coupling to VLM's high-semantic embedding spaces, where logit scores define embedding directions for conditional video generation. The results show that TRANSPORTER can generate videos reflecting changes in diverse object attributes, action adverbs, and scene context, providing a new direction for model interpretability that has not been explored before.

该论文通过引入logits-to-video (L2V) 任务和一个模型独立的方法TRANSPORTER，解决了理解视觉语言模型（VLMs）内部过程的挑战。TRANSPORTER 通过学习VLM的高语义嵌入空间中的最优传输耦合，使用logit分数定义条件视频生成的嵌入方向。该方法生成的视频反映了基于不同对象属性、动作副词和场景上下文的标题变化。实验结果表明，L2V 提供了一种具有高视觉保真度和语义准确性的新型模型可解释性方向，此前尚未被探索过。

Less Detail, Better Answers: Degradation-Driven Prompting for VQA

Authors: Haoxuan Han, Weijie Wang, Zeyu Zhang, Yefei He, Bohan Zhuang

First: 2026-04-06T16:41:19+00:00 · Latest: 2026-04-07T10:14:44+00:00

Comments: Accepted to CVPRW 2026. Project page: https://hhx-jpg.github.io/ddp/ , Code: https://github.com/ziplab/DDP

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Recent advancements in Vision-Language Models (VLMs) have significantly pushed the boundaries of Visual Question Answering (VQA).However,high-resolution details can sometimes become noise that leads to hallucinations or reasoning errors. In this paper,we propose Degradation-Driven Prompting (DDP), a novel framework that improves VQA performance by strategically reducing image fidelity to force models to focus on essential structural information. We evaluate DDP across two distinct tasks. Physical attributes targets images prone to human misjudgment, where DDP employs a combination of 80p downsampling, structural visual aids (white background masks and orthometric lines), and In-Context Learning (ICL) to calibrate the model's focus. Perceptual phenomena addresses various machine-susceptible visual anomalies and illusions, including Visual Anomaly (VA), Color (CI), Motion(MI),Gestalt (GI), Geometric (GSI), and Visual Illusions (VI).For this task, DDP integrates a task-classification stage with specialized tools such as blur masks and contrast enhancement alongside downsampling. Our experimental results demonstrate that less is more: by intentionally degrading visual inputs and providing targeted structural prompts, DDP enables VLMs to bypass distracting textures and achieve superior reasoning accuracy on challenging visual benchmarks.

中文标题/摘要

标题：减少细节，获得更好答案：降级驱动的VQA提示

近期视觉-语言模型（VLMs）的进展显著推动了视觉问答（VQA）的边界。然而，高分辨率的细节有时会成为噪声，导致幻觉或推理错误。在本文中，我们提出了一种新颖的降级驱动提示（DDP）框架，通过战略性地降低图像保真度，迫使模型专注于关键结构信息，从而提高VQA性能。我们在这两个不同的任务上评估了DDP。物理属性针对容易引起人类误判的图像，DDP使用80%下采样、结构视觉辅助（白色背景遮罩和正交线）以及上下文学习（ICL）来校准模型的焦点。知觉现象则针对各种机器易受视觉异常和幻觉的影响，包括视觉异常（VA）、颜色（CI）、运动（MI）、整体性（GI）、几何（GSI）和视觉幻觉（VI）。对于此任务，DDP结合了任务分类阶段和模糊遮罩、对比度增强等专业工具，同时进行下采样。我们的实验结果表明，少即是多：通过故意降级视觉输入并提供有针对性的结构提示，DDP使VLMs能够绕过分散注意力的纹理，在具有挑战性的视觉基准上实现更高的推理准确性。

Summary / 总结

This paper proposes Degradation-Driven Prompting (DDP) to enhance VQA performance by reducing image fidelity, focusing on essential structural information. DDP is evaluated on two tasks: Physical attributes and Perceptual phenomena, using techniques like downsampling, structural visual aids, and in-context learning. Results show that DDP improves reasoning accuracy by bypassing distracting textures.

本文提出了降解驱动提示（DDP）方法，通过降低图像保真度来聚焦于关键结构信息从而提升VQA性能。DDP在物理属性和知觉现象两个任务上进行了评估。对于物理属性任务，DDP使用80p降采样、结构视觉辅助和上下文学习。对于知觉现象任务，它结合了任务分类和模糊遮罩、对比度增强以及降采样。实验结果表明，通过有意降解视觉输入并提供有针对性的结构提示，DDP能够绕过干扰纹理并提高推理准确性。

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Authors: Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, Xinchi Chen, Jun Zhao, Xuanjing Huang, Xipeng Qiu

First: 2025-11-06T17:25:23+00:00 · Latest: 2026-04-07T09:55:11+00:00

Comments: 34 pages, 17 figures

Abs · PDF · Code1 · Code2

Abstract

The "Thinking with Text" and "Thinking with Images" paradigms significantly improve the reasoning abilities of large language models (LLMs) and Vision-Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, which hinders unified multimodal understanding and generation. Therefore, we propose "Thinking with Video", a new paradigm that leverages video generation models such as Sora-2 to use video frames as a unified medium for multimodal reasoning. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench), which covers both vision-centric tasks (e.g., Eyeballing Puzzles) and text-centric tasks (e.g., GSM8K and MMMU). Our evaluation on VideoThinkBench establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is comparable to state-of-the-art (SOTA) VLMs, and even surpasses GPT-5 by 10% on eyeballing puzzles. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 69.2% accuracy on MMMU. Furthermore, we systematically analyze the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings show that the video generation model is the potential unified multimodal understanding and generation model, positioning "Thinking with Video" as a potential unified multimodal reasoning paradigm.

中文标题/摘要

标题：视频思考：视频生成作为有前途的多模态推理范式

"文本思考"和"图像思考"范式显著提高了大型语言模型（LLMs）和视觉-语言模型（VLMs）的推理能力。然而，这些范式存在固有的局限性。首先，图像只能捕捉单一时刻，无法表示动态过程或连续变化；其次，将文本和视觉视为不同的模态，阻碍了统一的多模态理解和生成。因此，我们提出了"视频思考"这一新范式，利用如Sora-2等视频生成模型，将视频帧作为统一的多模态推理媒介。为了支持这一探索，我们开发了视频思考基准（VideoThinkBench），涵盖了视觉中心任务（如眼力拼图）和文本中心任务（如GSM8K和MMMU）。在VideoThinkBench上的评估表明Sora-2是一个有效的推理者。在视觉中心任务中，Sora-2与最先进的视觉-语言模型（SOTA）相当，甚至在眼力拼图上比GPT-5高出10%。在文本中心任务中，Sora-2在MATH上达到92%的准确率，在MMMU上达到69.2%的准确率。此外，我们系统地分析了这些能力的来源。我们还发现，自我一致性与上下文学习可以提高Sora-2的性能。总之，我们的研究结果表明，视频生成模型可能是统一的多模态理解和生成模型，将"视频思考"定位为潜在的统一多模态推理范式。

Summary / 总结

This paper introduces the 'Thinking with Video' paradigm, which uses video generation models to enhance multimodal reasoning. It addresses the limitations of 'Thinking with Text' and 'Thinking with Images' by leveraging video frames as a unified medium. The authors developed the Video Thinking Benchmark (VideoThinkBench) to evaluate this paradigm, demonstrating that Sora-2, a video generation model, performs comparably to state-of-the-art Vision-Language Models on vision-centric tasks and surpasses GPT-5 by 10% on eyeballing puzzles. On text-centric tasks, Sora-2 achieves high accuracy on MATH and MMMU. The study also identifies self-consistency and in-context learning as factors that improve Sora-2's performance.

本文提出了‘视频思考’的新范式，利用视频生成模型增强多模态推理能力，解决了‘文本思考’和‘图像思考’的局限性，通过将视频帧作为统一媒介来实现。作者开发了视频思考基准（VideoThinkBench）来评估这一范式，结果显示Sora-2这一视频生成模型在视觉中心任务上与最先进的视觉语言模型表现相当，并在眼球测验中超越GPT-5 10%。在文本中心任务上，Sora-2在MATH和MMMU上的准确率分别达到92%和69.2%。研究还发现自我一致性与上下文学习可以提高Sora-2的表现。

MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization

Authors: Animesh Jain, Alexandros Stergiou

First: 2025-08-11T10:36:58+00:00 · Latest: 2026-04-07T09:29:57+00:00

Comments: Accepted at CVPRw 2026 - How Do Vision Models Work? (HOW) Workshop, Project page: https://anaekin.github.io/MIMIC

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision Language Models (VLMs) encode multimodal inputs over large, complex, and difficult-to-interpret architectures, which limit transparency and trust. We propose a Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) framework that inverts the internal encodings of VLMs. MIMIC uses a joint VLM-based inversion and a feature alignment objective to account for VLM's autoregressive processing. It additionally includes a triplet of regularizers for spatial alignment, natural image smoothness, and semantic realism. We evaluate MIMIC both quantitatively and qualitatively by inverting visual concepts across a range of free-form VLM outputs of varying length. Reported results include both standard visual quality metrics and semantic text-based metrics. To the best of our knowledge, this is the first model inversion approach addressing visual interpretations of VLM concepts.

中文标题/摘要

标题：MIMIC：多模态反演以实现模型解释与概念化

视觉语言模型（VLMs）将多模态输入编码到大型、复杂且难以解释的架构中，这限制了透明度和信任度。我们提出了一种多模态反演以实现模型解释与概念化（MIMIC）框架，该框架反演VLM的内部编码。MIMIC使用基于VLM的联合反演和特征对齐目标来考虑VLM的自回归处理。此外，它还包括用于空间对齐、自然图像平滑性和语义现实性的三个正则化器。我们通过反演不同长度的自由形式VLM输出中的视觉概念，从定量和定性两个方面评估MIMIC。报告的结果包括标准的视觉质量指标和语义文本指标。据我们所知，这是第一个针对VLM概念视觉解释的模型反演方法。

Summary / 总结

The research aims to enhance the transparency and trust in Vision Language Models (VLMs) by developing a framework called MIMIC for multimodal inversion. MIMIC inverts the internal encodings of VLMs using a joint inversion and feature alignment approach, along with regularizers for spatial alignment, natural image smoothness, and semantic realism. The study evaluates MIMIC both quantitatively and qualitatively by inverting visual concepts across various VLM outputs, reporting on visual quality and semantic metrics, and is the first model inversion approach for visual interpretations of VLM concepts.

研究旨在通过开发名为MIMIC的多模态反转框架来增强视觉语言模型（VLMs）的透明度和信任度。MIMIC使用联合反转和特征对齐的方法，并包含用于空间对齐、自然图像平滑性和语义现实性的正则化项。研究通过在各种VLM输出中反转视觉概念，从定量和定性两个方面进行评估，并报告了视觉质量和语义指标，这是首次针对VLM概念的视觉解释的模型反转方法。

From Evidence to Verdict: An Agent-Based Forensic Framework for AI-Generated Image Detection

Authors: Mengfei Liang, Yiting Qu, Yukun Jiang, Michael Backes, Yang Zhang

First: 2025-10-31T18:36:49+00:00 · Latest: 2026-04-07T09:06:09+00:00

Comments: 15 pages, 5 figures

Abs · PDF · Code1 · Code2

Abstract

The rapid evolution of AI-generated images poses growing challenges to information integrity and media authenticity. Existing detection approaches face limitations in robustness, interpretability, and generalization across diverse generative models, particularly when relying on a single source of visual evidence. We introduce AIFo (Agent-based Image Forensics), a training-free framework that formulates AI-generated image detection as a multi-stage forensic analysis process through multi-agent collaboration. The framework integrates a set of forensic tools, including reverse image search, metadata extraction, pre-trained classifiers, and vision-language model analysis, and resolves insufficient or conflicting evidence through a structured multi-agent debate mechanism. An optional memory-augmented module further enables the framework to incorporate information from historical cases. We evaluate AIFo on a benchmark of 6,000 images spanning controlled laboratory settings and challenging real-world scenarios, where it achieves 97.05% accuracy and consistently outperforms traditional classifiers and strong vision-language model baselines. These findings demonstrate the effectiveness of agent-based procedural reasoning for AI-generated image detection.

中文标题/摘要

标题：从证据到判决：基于代理的AI生成图像检测法医框架

AI生成图像的迅速发展对信息完整性和媒体真实性提出了日益增长的挑战。现有的检测方法在鲁棒性、可解释性和跨多种生成模型的一般化方面存在局限性，尤其是在依赖单一视觉证据来源时。我们引入了AIFo（基于代理的图像法医），这是一种无需训练的框架，通过多代理协作将AI生成图像的检测过程转化为多阶段的法医分析过程。该框架整合了一系列法医工具，包括逆向图像搜索、元数据提取、预训练分类器和视觉-语言模型分析，并通过结构化的多代理辩论机制解决证据不足或冲突的问题。可选的记忆增强模块进一步使框架能够结合历史案例的信息。我们在包含6,000张图像的基准测试上评估了AIFo，这些图像覆盖了受控实验室环境和具有挑战性的现实世界场景，其准确率达到97.05%，并始终优于传统的分类器和强大的视觉-语言模型基线。这些发现表明，基于代理的程序性推理对于AI生成图像的检测是有效的。

Summary / 总结

The paper addresses the challenge of detecting AI-generated images by proposing AIFo, an agent-based forensic framework. AIFo uses multi-agent collaboration to integrate various forensic tools and resolve evidence conflicts through structured debates. It achieves 97.05% accuracy on a benchmark of 6,000 images, outperforming traditional classifiers and vision-language models.

论文提出了一种基于代理的法医框架AIFo，以应对检测AI生成图像的挑战。AIFo利用多代理协作整合各种法医工具，并通过结构化的代理辩论机制解决证据冲突。该框架在包含6,000张图像的基准测试中达到了97.05%的准确率，优于传统分类器和视觉语言模型。

ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference

Authors: Zhaohong Huang, Wenjing Liu, Yuxin Zhang, Fei Chao, Rongrong Ji

First: 2026-04-07T08:52:28+00:00 · Latest: 2026-04-07T08:52:28+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advances have explored visual token pruning to accelerate the inference of large vision-language models (LVLMs). However, existing methods often struggle to balance token importance and diversity: importance-based methods tend to retain redundant tokens, whereas diversity-based methods may overlook informative ones. This trade-off becomes especially problematic under high reduction ratios, where preserving only a small subset of visual tokens is critical. To address this issue, we propose ID-Selection, a simple yet effective token selection strategy for efficient LVLM inference. The key idea is to couple importance estimation with diversity-aware iterative selection: each token is first assigned an importance score, after which high-scoring tokens are selected one by one while the scores of similar tokens are progressively suppressed. In this way, ID-Selection preserves informative tokens while reducing redundancy in a unified selection process. Extensive experiments across 5 LVLM backbones and 16 main benchmarks demonstrate that ID-Selection consistently achieves superior performance and efficiency, especially under extreme pruning ratios. For example, on LLaVA-1.5-7B, ID-Selection prunes 97.2% of visual tokens, retaining only 16 tokens, while reducing inference FLOPs by over 97% and preserving 91.8% of the original performance, all without additional training.

中文标题/摘要

标题：ID-选择：基于重要性-多样性的视觉标记选择策略以实现高效的LVLM推理

近期进展探索了视觉标记剪枝以加速大型视觉-语言模型（LVLMs）的推理。然而，现有方法往往难以平衡标记的重要性与多样性：基于重要性的方法倾向于保留冗余标记，而基于多样性的方法可能会忽略有信息性的标记。这种权衡在高剪枝比下尤为突出，因为仅保留一小部分视觉标记至关重要。为了解决这一问题，我们提出了一种简单的有效标记选择策略ID-选择，以实现高效的LVLM推理。核心思想是将重要性估计与多样性意识的迭代选择相结合：首先为每个标记分配一个重要性评分，然后依次选择高评分标记，同时逐步抑制类似标记的评分。这样，ID-选择在统一的选择过程中保留了有信息性的标记，同时减少了冗余。在5种LVLM骨干网络和16个主要基准上的广泛实验表明，ID-选择在所有情况下都实现了优越的性能和效率，尤其是在极端剪枝比下。例如，在LLaVA-1.5-7B上，ID-选择剪枝了97.2%的视觉标记，仅保留了16个标记，推理FLOPs减少了超过97%，同时保留了91.8%的原始性能，且无需额外训练。

Summary / 总结

ID-Selection is a token selection strategy for efficient inference of large vision-language models (LVLMs) that balances token importance and diversity. It assigns importance scores to tokens and iteratively selects high-scoring tokens while suppressing similar tokens, ensuring both informativeness and reduced redundancy. Experiments show that ID-Selection outperforms existing methods, achieving up to 97.2% visual token pruning with minimal performance loss.

ID-Selection 是一种用于高效 LVLM 推断的 token 选择策略，能够平衡 token 的重要性和多样性。它为每个 token 分配重要性分数，并逐步选择高分 token 同时抑制相似的 token，从而保留信息性 token 并减少冗余。实验表明，ID-Selection 在保留 91.8% 的原始性能的同时，仅保留了 16 个 token，并且在 LLaVA-1.5-7B 上去除了 97.2% 的视觉 token，同时将推理 FLOPs 减少了超过 97%。

Benchmarking Vision-Language Models under Contradictory Virtual Content Attacks in Augmented Reality

Authors: Yanming Xiu, Zhengayuan Jiang, Neil Zhenqiang Gong, Maria Gorlatova

Venue: CVPR 2026

First: 2026-04-07T07:04:28+00:00 · Latest: 2026-04-07T07:04:28+00:00

Comments: CVPR 2026 Findings

Abs · PDF · Code1 · Code2

Abstract

Augmented reality (AR) has rapidly expanded over the past decade. As AR becomes increasingly integrated into daily life, its security and reliability emerge as critical challenges. Among various threats, contradictory virtual content attacks, where malicious or inconsistent virtual elements are introduced into the user's view, pose a unique risk by misleading users, creating semantic confusion, or delivering harmful information. In this work, we systematically model such attacks and present ContrAR, a novel benchmark for evaluating the robustness of vision-language models (VLMs) against virtual content manipulation and contradiction in AR. ContrAR contains 312 real-world AR videos validated by 10 human participants. We further benchmark 11 VLMs, including both commercial and open-source models. Experimental results reveal that while current VLMs exhibit reasonable understanding of contradictory virtual content, room still remains for improvement in detecting and reasoning about adversarial content manipulations in AR environments. Moreover, balancing detection accuracy and latency remains challenging.

中文标题/摘要

标题：在增强现实中的矛盾虚拟内容攻击下视觉-语言模型基准测试

增强现实（AR）在过去十年中迅速扩展。随着AR越来越多地融入日常生活，其安全性和可靠性成为关键挑战。各种威胁中，矛盾虚拟内容攻击尤为独特，这种攻击通过引入恶意或不一致的虚拟元素误导用户，造成语义混淆或传递有害信息。在本工作中，我们系统地建模此类攻击，并提出了ContrAR，这是一种新型基准，用于评估视觉-语言模型（VLMs）在AR环境中对抗虚拟内容操纵和矛盾的鲁棒性。ContrAR包含10名人类参与者验证的312个真实世界的AR视频。我们进一步对11个VLM进行了基准测试，包括商业和开源模型。实验结果表明，尽管当前的VLM在理解矛盾虚拟内容方面表现出一定的能力，但在检测和推理AR环境中的对抗内容操纵方面仍有改进空间。此外，检测准确性和延迟之间的平衡仍然具有挑战性。

Summary / 总结

This work addresses the security challenge of contradictory virtual content attacks in augmented reality (AR) by introducing ContrAR, a benchmark for evaluating vision-language models (VLMs). ContrAR includes 312 real-world AR videos validated by humans and benchmarks 11 VLMs. The experiments show that current VLMs can understand contradictory virtual content but need improvement in detecting and reasoning about adversarial manipulations. Balancing detection accuracy and latency is also a challenge.

该研究通过引入ContrAR基准，评估视觉语言模型(VLMs)在增强现实(AR)中对抗矛盾虚拟内容攻击的能力。基准包含312个由人类验证的真实AR视频，并测试了11个VLMs。结果显示，尽管VLMs能够理解矛盾内容，但在检测和推理关于对抗性操纵方面仍需改进，同时在检测准确性和延迟之间取得平衡仍然具有挑战性。

Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models

Authors: Keuntae Kim, Mingyu Kang, Yong Suk Choi

Venue: CVPR 2026

First: 2026-04-07T06:41:05+00:00 · Latest: 2026-04-07T06:41:05+00:00

Comments: CVPR 2026 - main

Abs · PDF · Code1 · Code2

Abstract

Diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive (AR) LLMs. Recently, this paradigm has been extended to multimodal tasks, leading to the development of diffusion multimodal large language models (dMLLMs). These models are expected to retain the reasoning capabilities of LLMs while enabling faster inference through parallel generation. However, when combined with Chain-of-Thought (CoT) reasoning, dMLLMs exhibit two critical issues. First, we observe that dMLLMs often generate the final answer token at a very early timestep. This trend indicates that the model determines the answer before sufficient reasoning, leading to degraded reasoning performance. Second, during the initial timesteps, dMLLMs show minimal dependency on visual prompts, exhibiting a fundamentally different pattern of visual information utilization compared to AR vision-language models. In summary, these findings indicate that dMLLMs tend to generate premature final answers without sufficiently grounding on visual inputs. To address these limitations, we propose Position and Step Penalty (PSP) and Visual Reasoning Guidance (VRG). PSP penalizes tokens in later positions during early timesteps, delaying premature answer generation and encouraging progressive reasoning across timesteps. VRG, inspired by classifier-free guidance, amplifies visual grounding signals to enhance the model's alignment with visual evidence. Extensive experiments across various dMLLMs demonstrate that our method achieves up to 7.5% higher accuracy while delivering more than 3x speedup compared to reasoning with four times more diffusion steps.

中文标题/摘要

标题：思考扩散：在扩散多模态语言模型中惩罚和引导视觉导向推理

扩散大型语言模型（dLLMs）正在成为自回归（AR）LLMs的有前途的替代方案。最近，这一范式已扩展到多模态任务，导致了扩散多模态大型语言模型（dMLLMs）的发展。这些模型有望保留LLMs的推理能力，同时通过并行生成实现更快的推理。然而，当与链式思考（CoT）推理结合时，dMLLMs表现出两个关键问题。首先，我们观察到dMLLMs通常在非常早期的时间步生成最终答案标记。这一趋势表明，模型在充分推理之前就确定了答案，导致推理性能下降。其次，在初始时间步，dMLLMs对视觉提示的依赖性极小，显示出与AR视觉语言模型相比，视觉信息利用的根本不同模式。总之，这些发现表明，dMLLMs倾向于在没有充分基于视觉输入的情况下生成过早的最终答案。为了应对这些限制，我们提出了位置和步骤惩罚（PSP）和视觉推理引导（VRG）。PSP在早期时间步惩罚后期位置的标记，推迟过早的答案生成，并鼓励跨时间步的渐进推理。VRG受到无分类引导的启发，放大视觉接地信号，增强模型与视觉证据的对齐。在各种dMLLMs上的广泛实验表明，我们的方法在准确率上提高了7.5%，同时比使用四倍扩散步骤的推理速度快3倍以上。

Summary / 总结

The paper addresses the issues of premature answer generation and insufficient visual grounding in diffusion multimodal large language models (dMLLMs) when using Chain-of-Thought reasoning. It introduces Position and Step Penalty (PSP) and Visual Reasoning Guidance (VRG) to mitigate these problems. PSP delays premature answers by penalizing later tokens early on, while VRG enhances visual grounding. Experiments show that the proposed method improves accuracy by up to 7.5% and speeds up inference by more than 3x compared to using four times more diffusion steps.

研究针对扩散多模态大型语言模型（dMLLMs）中过早生成答案和视觉接地不足的问题。提出了位置和步骤惩罚（PSP）和视觉推理引导（VRG）来解决这些问题。PSP通过在早期惩罚后期生成的令牌来延迟最终答案的生成，而VRG增强了视觉接地信号。实验表明，所提出的方法可以将准确率提高多达7.5%，并将推理速度加快超过3倍，与使用四倍扩散步骤的推理相比。

Unifying VLM-Guided Flow Matching and Spectral Anomaly Detection for Interpretable Veterinary Diagnosis

Authors: Pu Wang, Zhixuan Mao, Jialu Li, Zhuoran Zheng, Dianjie Lu, Youshan Zhang

First: 2026-04-07T06:22:43+00:00 · Latest: 2026-04-07T06:22:43+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Automatic diagnosis of canine pneumothorax is challenged by data scarcity and the need for trustworthy models. To address this, we first introduce a public, pixel-level annotated dataset to facilitate research. We then propose a novel diagnostic paradigm that reframes the task as a synergistic process of signal localization and spectral detection. For localization, our method employs a Vision-Language Model (VLM) to guide an iterative Flow Matching process, which progressively refines segmentation masks to achieve superior boundary accuracy. For detection, the segmented mask is used to isolate features from the suspected lesion. We then apply Random Matrix Theory (RMT), a departure from traditional classifiers, to analyze these features. This approach models healthy tissue as predictable random noise and identifies pneumothorax by detecting statistically significant outlier eigenvalues that represent a non-random pathological signal. The high-fidelity localization from Flow Matching is crucial for purifying the signal, thus maximizing the sensitivity of our RMT detector. This synergy of generative segmentation and first-principles statistical analysis yields a highly accurate and interpretable diagnostic system (source code is available at: https://github.com/Pu-Wang-alt/Canine-pneumothorax).

中文标题/摘要

标题：统一VLM引导的流匹配和光谱异常检测以实现可解释的兽医诊断

犬气胸的自动诊断受到数据稀缺性和需要可信模型的挑战。为解决这一问题，我们首先引入了一个公开的像素级标注数据集以促进研究。然后，我们提出了一种新的诊断范式，将任务重新定义为信号定位和光谱检测的协同过程。在定位方面，我们的方法使用视觉语言模型（VLM）引导迭代的流匹配过程，逐步细化分割掩码以实现更优的边界精度。在检测方面，分割掩码用于隔离疑似病灶的特征。然后，我们应用随机矩阵理论（RMT），这是一种不同于传统分类器的方法，来分析这些特征。该方法将健康组织建模为可预测的随机噪声，并通过检测统计上显著的异常特征值来识别气胸，这些特征值代表了非随机的病理信号。流匹配的高保真度定位对于净化信号至关重要，从而最大化了我们RMT检测器的灵敏度。这种生成分割与基于原理的统计分析的协同作用产生了高度准确且可解释的诊断系统（源代码可在：https://github.com/Pu-Wang-alt/Canine-pneumothorax 获取）

Summary / 总结

The research aims to address the challenges of diagnosing canine pneumothorax due to data scarcity and the need for trustworthy models. It introduces a new diagnostic paradigm that combines Vision-Language Model-guided Flow Matching for precise localization and Random Matrix Theory for spectral detection. The method uses Flow Matching to refine segmentation masks, which are then used to isolate suspected lesion features for RMT analysis. This approach models healthy tissue as random noise and detects pneumothorax by identifying outlier eigenvalues. The results show high accuracy and interpretability in diagnosing canine pneumothorax.

研究旨在开发一种解释性强的犬肺气肿诊断系统，解决数据稀缺和模型可信度需求的问题。引入了一个公开的数据集和一种结合Vision-Language Model引导的Flow Matching进行信号定位和Random Matrix Theory进行谱检测的新范式。该方法实现了边界精度的显著提升和高灵敏度，展示了该系统在检测肺气肿方面的有效性及解释性结果。

History

20260407_0404 20260406_0347 20260405_0344 20260404_0350 20260403_0400 20260401_0408 20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553