arXiv 论文速递

Snapshot: 20260424_0430

OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

Authors: Qiguang Chen, Chengyu Luan, Jiajun Wu, Qiming Yu, Yi Yang, Yizhuo Li, Jingqi Tong, Xiachong Feng, Libo Qin, Wanxiang Che

Venue: ACL 2026

First: 2026-04-22T17:37:40+00:00 · Latest: 2026-04-22T17:37:40+00:00

Comments: ACL 2026 Camera Ready

Abs · PDF · Code1 · Code2

Abstract

Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying and improving multi-image reasoning in LVLMs.

CLIP-SVD: Efficient and Interpretable Vision-Language Adaptation via Singular Values

Authors: Taha Koleilat, Hassan Rivaz, Yiming Xiao

First: 2025-09-03T22:00:23+00:00 · Latest: 2026-04-22T16:33:26+00:00

Comments: TMLR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-language models (VLMs) like CLIP have shown impressive zero-shot and few-shot learning capabilities across diverse applications. However, adapting these models to new fine-grained domains remains difficult due to reliance on prompt engineering and the high cost of full model fine-tuning. Existing adaptation approaches rely on augmented components, such as prompt tokens and adapter modules, which could limit adaptation quality, destabilize the model, and compromise the rich knowledge learned during pretraining. In this work, we present CLIP-SVD, a multi-modal and parameter-efficient adaptation framework that applies Singular Value Fine-tuning (SVF) to CLIP, leveraging Singular Value Decomposition (SVD) to modify the internal parameter space of CLIP without injecting additional modules. Specifically, we fine-tune only the singular values of the CLIP parameter matrices to rescale the basis vectors for domain adaptation while retaining the pretrained model. This design enables enhanced adaptation performance using only 0.04% of the model's total parameters and better preservation of its generalization ability. CLIP-SVD achieves state-of-the-art classification results on 11 natural and 10 biomedical datasets, outperforming previous methods in both accuracy and generalization under few-shot settings. Additionally, we leverage a natural language-based approach to analyze the effectiveness and dynamics of the CLIP adaptation to allow interpretability of CLIP-SVD. Overall, this work provides the first extensive empirical evaluation of SVD-based finetuning in the vision-language model setting. The code and biomedical corpus are publicly available at https://github.com/HealthX-Lab/CLIP-SVD.

F\textsuperscript{2}LP-AP: Fast \& Flexible Label Propagation with Adaptive Propagation Kernel

Authors: Yutong Shen, Ruizhe Xia, Jingyi Liu, Yinqi Liu

First: 2026-04-22T16:23:17+00:00 · Latest: 2026-04-22T16:23:17+00:00

Comments: 16 pages, 5 figures

Abs · PDF · Code1 · Code2

Abstract

Semi-supervised node classification is a foundational task in graph machine learning, yet state-of-the-art Graph Neural Networks (GNNs) are hindered by significant computational overhead and reliance on strong homophily assumptions. Traditional GNNs require expensive iterative training and multi-layer message passing, while existing training-free methods, such as Label Propagation, lack adaptability to heterophilo\-us graph structures. This paper presents \textbf{F$^2$LP-AP} (Fast and Flexible Label Propagation with Adaptive Propagation Kernel), a training-free, computationally efficient framework that adapts to local graph topology. Our method constructs robust class prototypes via the geometric median and dynamically adjusts propagation parameters based on the Local Clustering Coefficient (LCC), enabling effective modeling of both homophilous and heterophilous graphs without gradient-based training. Extensive experiments across diverse benchmark datasets demonstrate that \textbf{F$^2$LP-AP} achieves competitive or superior accuracy compared to trained GNNs, while significantly outperforming existing baselines in computational efficiency.

SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images

Authors: Kaiyu Li, Shengqi Zhang, Yujie Wang, Yupeng Deng, Zhi Wang, Deyu Meng, Xiangyong Cao

First: 2025-12-09T15:42:28+00:00 · Latest: 2026-04-22T15:53:23+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Most existing methods for training-free open-vocabulary semantic segmentation are based on CLIP. While these approaches have made progress, they often face challenges in precise localization or require complex pipelines to combine separate modules, especially in remote sensing scenarios where numerous dense and small targets are present. Recently, Segment Anything Model 3 (SAM 3) was proposed, unifying segmentation and recognition in a promptable framework. In this paper, we present a comprehensive exploration of applying SAM 3 to the remote sensing open-vocabulary tasks (i.e., 2D semantic segmentation, change detection, and 3D semantic segmentation) without any training. First, we implement a mask fusion strategy that combines the outputs from SAM 3's semantic segmentation head and the Transformer decoder (instance head). This allows us to leverage the strengths of both heads for better land coverage. Second, we utilize the presence score from the presence head to filter out categories that do not exist in the scene, reducing false positives caused by the vast vocabulary sizes and patch-level processing in geospatial scenes. Furthermore, we extend our method to open-vocabulary change detection by a joint instance- and pixel-level verification strategy built directly upon our fused logits. We evaluate our method on extensive remote sensing datasets and tasks, including 20 segmentation datasets, 3 change detection datasets, and a 3D segmentation dataset. Experiments show that our method achieves promising performance, demonstrating the potential of SAM 3 for remote sensing open-vocabulary tasks. Our code is released at https://github.com/earth-insights/SegEarth-OV-3.

中文标题/摘要

标题：SegEarth-OV3：探索SAM 3在遥感图像中的开放词汇语义分割

大多数现有的无训练开放词汇语义分割方法基于CLIP。尽管这些方法取得了进展，但在精确定位方面仍面临挑战，或者需要复杂的管道将单独的模块结合起来，特别是在遥感场景中，存在大量密集且小型的目标。最近，提出了统一分割和识别的提示框架的Segment Anything Model 3 (SAM 3)。在本文中，我们全面探索了在没有任何训练的情况下将SAM 3应用于遥感开放词汇任务（即2D语义分割、变化检测和3D语义分割）。首先，我们实现了一种掩码融合策略，将SAM 3的语义分割头和Transformer解码器（实例头）的输出结合起来。这使我们能够利用两个头的优点，以更好地覆盖土地。其次，我们利用存在分数从存在头中筛选掉场景中不存在的类别，减少由于地理空间场景中的大规模词汇和像素级处理引起的假阳性。此外，我们通过直接基于我们融合的预测值构建的联合实例级和像素级验证策略，将我们的方法扩展到开放词汇变化检测。我们在广泛的遥感数据集和任务上评估了我们的方法，包括20个分割数据集、3个变化检测数据集和一个3D分割数据集。实验表明，我们的方法取得了令人鼓舞的性能，展示了SAM 3在遥感开放词汇任务中的潜力。我们的代码发布在https://github.com/earth-insights/SegEarth-OV-3。

Summary / 总结

This paper explores the application of Segment Anything Model 3 (SAM 3) for open-vocabulary semantic segmentation in remote sensing images. The authors implement a mask fusion strategy combining SAM 3's semantic segmentation head and Transformer decoder to improve land coverage. They also use the presence score from the presence head to filter out non-existent categories, reducing false positives. The method is extended to open-vocabulary change detection using a joint instance- and pixel-level verification strategy. Experiments on various remote sensing datasets show promising results, highlighting SAM 3's potential for these tasks.

本文探讨了使用Segment Anything Model 3 (SAM 3) 对遥感图像中的开放词汇语义分割任务的应用。作者实现了一种掩码融合策略，结合SAM 3的语义分割头和Transformer解码器，以提高土地覆盖。他们还利用存在头的得分来过滤不存在的类别，减少假阳性。该方法通过联合实例级和像素级验证策略扩展到开放词汇变化检测。在各种遥感数据集上的实验显示了令人鼓舞的结果，突显了SAM 3在这些任务中的潜力。

R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs

Authors: Jiahao Xie, Alessio Tonioni, Nathalie Rauschmayr, Federico Tombari, Bernt Schiele

First: 2026-04-22T15:41:33+00:00 · Latest: 2026-04-22T15:41:33+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large vision-language models (LVLMs) have demonstrated impressive performance in various multimodal understanding and reasoning tasks. However, they still struggle with object hallucinations, i.e., the claim of nonexistent objects in the visual input. To address this challenge, we propose Region-aware Chain-of-Verification (R-CoV), a visual chain-of-verification method to alleviate object hallucinations in LVLMs in a post-hoc manner. Motivated by how humans comprehend intricate visual information -- often focusing on specific image regions or details within a given sample -- we elicit such region-level processing from LVLMs themselves and use it as a chaining cue to detect and alleviate their own object hallucinations. Specifically, our R-CoV consists of six steps: initial response generation, entity extraction, coordinate generation, region description, verification execution, and final response generation. As a simple yet effective method, R-CoV can be seamlessly integrated into various LVLMs in a training-free manner and without relying on external detection models. Extensive experiments on several widely used hallucination benchmarks across multiple LVLMs demonstrate that R-CoV can significantly alleviate object hallucinations in LVLMs. Project page: https://github.com/Jiahao000/R-CoV.

Summary / 总结

The paper proposes R-CoV, a region-aware chain-of-verification method to address object hallucinations in large vision-language models (LVLMs). Motivated by human visual comprehension, R-CoV processes LVLMs by focusing on specific image regions and using this information to detect and correct hallucinations. The method consists of six steps: initial response generation, entity extraction, coordinate generation, region description, verification execution, and final response generation. Experiments show that R-CoV can effectively alleviate object hallucinations in various LVLMs without requiring training or external models.

论文提出了一种区域感知的链式验证方法R-CoV，以解决大型视觉-语言模型（LVLM）中的对象幻觉问题。该方法借鉴人类视觉理解的方式，通过关注图像的特定区域并利用这些信息来检测和纠正幻觉。R-CoV 包含六个步骤：初始响应生成、实体提取、坐标生成、区域描述、验证执行和最终响应生成。实验表明，R-CoV 可以有效地缓解各种LVLM中的对象幻觉问题，且无需进行训练或依赖外部模型。

The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

Authors: Karan Goyal, Dikshant Kukreja

First: 2026-04-22T15:15:32+00:00 · Latest: 2026-04-22T15:15:32+00:00

Abs · PDF · Code1 · Code2

Abstract

The rapid proliferation of Vision-Language Models (VLMs) is widely celebrated as the dawn of unified multimodal knowledge discovery but its foundation operates on a dangerous, unquestioned axiom: that current VLMs faithfully synthesise multimodal data. We argue they do not. Instead, a profound crisis of trustworthiness underlies the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore fatally conflates dataset biases with architectural incapacity. We propose a radical, information-theoretic departure: the Modality Translation Protocol, designed to quantifiably unmask the Expense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics -- the Toll (ToS), Curse (CoS), and Fallacy (FoS) of Seeing -- culminating in the Semantic Sufficiency Criterion (SSC). Furthermore, we posit a provocative Divergence Law of Multimodal Scaling, hypothesising that as the underlying language engines scale to unprecedented reasoning capabilities, the mathematical penalty of the visual knowledge bottleneck paradoxically increases. We challenge the KDD community to abandon the illusory pursuit of "multimodal gain". By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide the rigorous, trustworthy foundation required to force the next generation of AI systems to truly see the data, achieving true multimodal reasoning.

中文标题/摘要

标题：观瞻的成本：在单一架构范式内获得可信的多模态推理

视觉-语言模型（VLMs）的迅速普及被广泛视为统一多模态知识发现的黎明，但其基础建立在危险的、未经质疑的假设之上：当前的VLMs能够忠实合成多模态数据。我们认为并非如此。相反，主导的视觉编码器-投影器-大语言模型范式下存在着深刻的可信度危机。最先进的模型并非从视觉输入中提取出具体的知识，而是经常表现出功能盲视，即利用强大的语言先验来绕过严重的视觉表示瓶颈。在本文中，我们挑战了传统的多模态评估方法，该方法依赖于数据消融或新数据集的创建，因此不可避免地将数据集偏差与架构能力不足混为一谈。我们提出了一种激进的信息论方法：模态翻译协议，旨在量化揭示观瞻的成本。通过翻译语义载荷而非消融它们，我们提出了三个新的度量标准——观瞻的代价（ToS）、诅咒（CoS）和谬误（FoS），最终形成了语义充足性准则（SSC）。此外，我们提出了多模态扩展的发散定律，假设随着底层语言引擎扩展到前所未有的推理能力，视觉知识瓶颈的数学代价反而增加。我们挑战KDD社区放弃追求“多模态增益”的幻象。通过将SSC从被动的诊断约束提升为积极的架构蓝图，我们提供了构建下一代AI系统真正观瞻数据所需的严格和可信的基础，实现真正的多模态推理。

Evian: Towards Explainable Visual Instruction-tuning Data Auditing

Authors: Zimu Jia, Mingjie Xu, Andrew Estornell, Jiaheng Wei

Venue: ACL 2026

First: 2026-04-22T13:28:27+00:00 · Latest: 2026-04-22T13:28:27+00:00

Comments: Accepted at ACL 2026

Abs · PDF · Code1 · Code2

Abstract

The efficacy of Large Vision-Language Models (LVLMs) is critically dependent on the quality of their training data, requiring a precise balance between visual fidelity and instruction-following capability. Existing datasets, however, are plagued by inconsistent quality, and current data filtering methods rely on coarse-grained scores that lack the granularity to identify nuanced semantic flaws like logical fallacies or factual errors. This creates a fundamental bottleneck in developing more reliable models. To address this, we make three core contributions. First, we construct a large-scale, 300K-sample benchmark by systematically injecting diverse, subtle defects to provide a challenging testbed for data auditing. Second, we introduce a novel "Decomposition-then-Evaluation" paradigm that breaks model responses into constituent cognitive components: visual description, subjective inference, and factual claim, enabling targeted analysis. Third, we instantiate this paradigm via EVIAN (Explainable Visual Instruction-tuning Data AuditiNg), an automated framework that evaluates these components along the orthogonal axes of Image-Text Consistency, Logical Coherence, and Factual Accuracy. Our empirical findings challenge the prevailing scale-centric paradigm: a model fine-tuned on a compact, high-quality subset curated by EVIAN consistently surpassed models trained on orders-of-magnitude larger datasets. We also reveal that dividing complex auditing into verifiable subtasks enables robust curation, and that Logical Coherence is the most critical factor in data quality evaluation.

中文标题/摘要

标题：伊维恩：迈向可解释的视觉指令调优数据审计

大型视觉-语言模型（LVLMs）的效果与其训练数据的质量密切相关，需要在视觉保真度和指令遵循能力之间保持精确的平衡。然而，现有数据集的质量参差不齐，当前的数据过滤方法依赖于粗粒度的评分，缺乏识别语义细微缺陷（如逻辑谬误或事实错误）的精细度。这在开发更可靠模型时形成了一个根本性的瓶颈。为了解决这一问题，我们做出了三项核心贡献。首先，我们通过系统地注入多样且微妙的缺陷，构建了一个包含30万样本的大规模基准，为数据审计提供了一个具有挑战性的测试平台。其次，我们引入了一种新的“分解-评估”范式，将模型响应分解为认知组件：视觉描述、主观推理和事实断言，以实现有针对性的分析。第三，我们通过伊维恩（Explainable Visual Instruction-tuning Data AuditiNg）这一自动化框架，沿图像-文本一致性、逻辑连贯性和事实准确性这三条轴线评估这些组件。我们的实证研究挑战了现有的规模为中心的范式：伊维恩精心挑选的高质量子集微调的模型始终优于在数量级更大的数据集上训练的模型。我们还发现，将复杂的审计任务分解为可验证的子任务可以实现稳健的编目，并且逻辑连贯性是数据质量评估中最关键的因素。

Summary / 总结

The research aims to improve the quality of training data for Large Vision-Language Models (LVLMs) by addressing the inconsistency in existing datasets and the lack of detailed error identification methods. The study introduces a new benchmark with 300K samples containing subtle defects and a novel 'Decomposition-then-Evaluation' framework called EVIAN, which evaluates models based on Image-Text Consistency, Logical Coherence, and Factual Accuracy. The findings show that a model fine-tuned on a smaller, high-quality dataset curated by EVIAN outperformed models trained on much larger datasets, highlighting the importance of logical coherence in data quality.

研究旨在通过解决现有数据集的一致性问题和缺乏详细的错误识别方法，提高大型视觉-语言模型（LVLM）的训练数据质量。研究引入了一个包含30万个样本的新基准，这些样本中包含细微缺陷，并提出了一种名为EVIAN的新‘分解-评估’框架，该框架根据图像-文本一致性、逻辑连贯性和事实准确性来评估模型。研究发现，使用EVIAN精心挑选的小型高质量数据集微调的模型优于使用大量数据集训练的模型，强调了逻辑连贯性在数据质量评估中的重要性。

Fast-then-Fine: A Two-Stage Framework with Multi-Granular Representation for Cross-Modal Retrieval in Remote Sensing

Authors: Xi Chen, Xu Chen, Xiangyang Jia, Xu Zhang, Shuquan Wei, Wei Wang

First: 2026-04-22T10:50:38+00:00 · Latest: 2026-04-22T10:50:38+00:00

Abs · PDF · Code1 · Code2

Abstract

Remote sensing (RS) image-text retrieval plays a critical role in understanding massive RS imagery. However, the dense multi-object distribution and complex backgrounds in RS imagery make it difficult to simultaneously achieve fine-grained cross-modal alignment and efficient retrieval. Existing methods either rely on complex cross-modal interactions that lead to low retrieval efficiency, or depend on large-scale vision-language model pre-training, which requires massive data and computational resources. To address these issues, we propose a fast-then-fine (FTF) two-stage retrieval framework that decomposes retrieval into a text-agnostic recall stage for efficient candidate selection and a text-guided rerank stage for fine-grained alignment. Specifically, in the recall stage, text-agnostic coarse-grained representations are employed for efficient candidate selection; in the rerank stage, a parameter-free balanced text-guided interaction block enhances fine-grained alignment without introducing additional learnable parameters. Furthermore, an inter- and intra-modal loss is designed to jointly optimize cross-modal alignment across multi-granular representations. Extensive experiments on public benchmarks demonstrate that the FTF achieves competitive retrieval accuracy while significantly improving retrieval efficiency compared with existing methods.

中文标题/摘要

标题：快速-精细：一种用于遥感跨模态检索的两阶段框架及多粒度表示

遥感（RS）图像-文本检索在理解大量RS图像方面起着关键作用。然而，RS图像中密集的多对象分布和复杂的背景使得同时实现精细的跨模态对齐和高效的检索变得困难。现有方法要么依赖于复杂的跨模态交互，导致检索效率低下，要么依赖于大规模的视觉-语言模型预训练，这需要大量的数据和计算资源。为了解决这些问题，我们提出了一种快速-精细（FTF）两阶段检索框架，将检索分解为一个文本无关的召回阶段，用于高效候选选择，以及一个文本引导的重排序阶段，用于精细对齐。具体来说，在召回阶段，使用文本无关的粗粒度表示进行高效的候选选择；在重排序阶段，一个无需参数的平衡文本引导交互块增强了精细对齐，而无需引入额外的可学习参数。此外，设计了跨模态和模内损失，以在多粒度表示中联合优化跨模态对齐。在公共基准上的广泛实验表明，FTF在检索准确性方面具有竞争力，同时显著提高了检索效率，优于现有方法。

Summary / 总结

The paper proposes a fast-then-fine (FTF) two-stage framework for remote sensing image-text retrieval, addressing the challenges of fine-grained cross-modal alignment and retrieval efficiency. The framework consists of a recall stage using text-agnostic coarse-grained representations for efficient candidate selection and a rerank stage employing a parameter-free balanced text-guided interaction block for fine-grained alignment. The FTF framework also includes an inter- and intra-modal loss to optimize cross-modal alignment across multi-granular representations. Experimental results show that the FTF framework achieves competitive retrieval accuracy while improving retrieval efficiency compared to existing methods.

论文提出了一种两阶段的快速-然后-精细（FTF）框架，用于遥感图像-文本检索，解决了细粒度跨模态对齐和检索效率的挑战。该框架包括一个使用文本无关的粗粒度表示进行高效候选选择的召回阶段，以及一个使用无参数平衡文本引导交互块进行细粒度对齐的重排序阶段。FTF框架还包含跨模态和内模态损失，以在多粒度表示中优化跨模态对齐。实验结果表明，FTF框架在保持检索准确性的同时，显著提高了检索效率，优于现有方法。

SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation

Authors: Chris Choy, Junha Lee, Chunghyun Park, Minsu Cho, Jan Kautz

First: 2026-04-22T09:57:57+00:00 · Latest: 2026-04-22T09:57:57+00:00

Comments: Project page: https://nvlabs.github.io/SpaCeFormer/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Open-vocabulary 3D instance segmentation is a core capability for robotics and AR/VR, but prior methods trade one bottleneck for another: multi-stage 2D+3D pipelines aggregate foundation-model outputs at hundreds of seconds per scene, while pseudo-labeled end-to-end approaches rely on fragmented masks and external region proposals. We present SpaCeFormer, a proposal-free space-curve transformer that runs at 0.14 seconds per scene, 2-3 orders of magnitude faster than multi-stage 2D+3D pipelines. We pair it with SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation dataset (3.0M multi-view-consistent captions over 604K instances from 7.4K scenes) built through multi-view mask clustering and multi-view VLM captioning; it reaches 21x higher mask recall than prior single-view pipelines (54.3% vs 2.5% at IoU > 0.5). SpaCeFormer combines spatial window attention with Morton-curve serialization for spatially coherent features, and uses a RoPE-enhanced decoder to predict instance masks directly from learned queries without external proposals. On ScanNet200 we achieve 11.1 zero-shot mAP, a 2.8x improvement over the prior best proposal-free method; on ScanNet++ and Replica, we reach 22.9 and 24.1 mAP, surpassing all prior methods including those using multi-view 2D inputs.

Cross-Modal Taxonomic Generalization in (Vision-) Language Models

Authors: Tianyang Xu, Marcelo Sandoval-Castaneda, Karen Livescu, Greg Shakhnarovich, Kanishka Misra

Venue: ACL 2026

First: 2026-03-08T05:29:28+00:00 · Latest: 2026-04-22T09:05:21+00:00

Comments: ACL 2026 (main conference)

Abs · PDF · Code1 · Code2

Abstract

What is the interplay between semantic representations learned by language models (LM) from surface form alone to those learned from more grounded evidence? We study this question for a scenario where part of the input comes from a different modality -- in our case, in a vision-language model (VLM), where a pretrained LM is aligned with a pretrained image encoder. As a case study, we focus on the task of predicting hypernyms of objects represented in images. We do so in a VLM setup where the image encoder and LM are kept frozen, and only the intermediate mappings are learned. We progressively deprive the VLM of explicit evidence for hypernyms, and test whether knowledge of hypernyms is recoverable from the LM. We find that the LMs we study can recover this knowledge and generalize even in the most extreme version of this experiment (when the model receives no evidence of a hypernym during training). Additional experiments suggest that this cross-modal taxonomic generalization persists under counterfactual image-label mappings only when the counterfactual data have high visual similarity within each category. Taken together, these findings suggest that cross-modal generalization in LMs arises as a result of both coherence in the extralinguistic input and knowledge derived from language cues.

中文标题/摘要

标题：跨模态分类泛化在（视觉-）语言模型中的研究

语言模型（LM）仅从表面形式学习的语义表示与从更具体的证据中学习的表示之间有何互动？我们研究了在部分输入来自不同模态的情况下这一问题——在我们的案例中，对于一个视觉-语言模型（VLM），其中预训练的LM与预训练的图像编码器对齐。作为案例研究，我们专注于预测图像中对象的上位词任务。我们在一个VLM设置中进行研究，其中图像编码器和LM保持冻结，仅学习中间映射。我们逐步剥夺VLM关于上位词的显性证据，并测试LM是否可以从这些证据中恢复上位词的知识。我们发现我们研究的LM可以在实验的最极端版本中恢复这些知识（即，在训练过程中模型没有收到上位词的任何证据）。额外的实验表明，当反事实数据在每个类别内部具有高视觉相似性时，这种跨模态分类泛化在反事实图像-标签映射下仍然存在。综合来看，这些发现表明，语言模型中的跨模态泛化是由于外部输入的一致性和从语言线索中获得的知识共同作用的结果。

Summary / 总结

This study investigates how language models (LMs) trained on surface forms alone can generalize semantic knowledge learned from grounded evidence, specifically in a vision-language model (VLM) setup. The researchers focus on predicting hypernyms of objects in images, keeping the image encoder and LM frozen while learning intermediate mappings. They find that LMs can recover knowledge of hypernyms even when no explicit evidence is provided during training, indicating cross-modal taxonomic generalization. Additional experiments show that this generalization is more likely when counterfactual data have high visual similarity within categories.

研究探讨了语言模型（LMs）如何在仅基于表面形式训练的情况下，从接地证据中学习的语义知识进行跨模态泛化，特别是在视觉语言模型（VLM）设置中。研究人员专注于预测图像中物体的上位词，保持图像编码器和LM不变，学习中间映射。研究发现，LMs即使在训练过程中没有提供明确的上位词证据也能恢复这些知识，表明跨模态分类泛化现象。此外，实验表明，当反事实数据在每个类别内具有高视觉相似性时，这种泛化更可能发生。

Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation

Authors: Xingyu Zhu, Junfeng Fang, Shuo Wang, Beier Zhu, Zhicai Wang, Yonghui Yang, Xiangnan He

Venue: ACL 2026 Oral

First: 2026-04-22T09:02:17+00:00 · Latest: 2026-04-22T09:02:17+00:00

Comments: ACL 2026 (Oral)

Abs · PDF · Code1 · Code2

Abstract

Large Vision-Language Models (LVLMs) exhibit powerful generative capabilities but frequently produce hallucinations that compromise output reliability. Fine-tuning on annotated data devoid of hallucinations offers the most direct solution, while its high computational cost motivates recent representation-based methods, which focus on mitigating hallucinatory components within hidden representations. Though efficient, we empirically observe that these methods degrade general generation capacity due to incomplete extraction of hallucination components and non-selective parameter updates. To address these limitations, we propose MPD, a dual-stage framework for mitigating hallucinations without performance degradation. Specifically, our MPD relies on two essential factors: (1) semantic-aware component disentanglement to extract pure hallucination components, and (2) interpretable parameter updates that selectively modify parameters most relevant to hallucination. Extensive experiments demonstrate that MPD achieves state-of-the-art performance, reducing hallucinations by 23.4\% while maintaining 97.4\% of general generative capability as evaluated on LLaVA-Bench and MME, with no additional computational cost.

Summary / 总结

The paper addresses the issue of hallucinations in large vision-language models (LVLMs) by proposing MPD, a dual-stage framework that mitigates hallucinations without degrading general generation capacity. MPD uses semantic-aware component disentanglement to extract pure hallucination components and interpretable parameter updates to selectively modify parameters relevant to hallucination. Experiments show that MPD reduces hallucinations by 23.4% while maintaining 97.4% of general generative capability, with no additional computational cost.

论文提出了一种名为MPD的双阶段框架，以解决大型视觉-语言模型（LVLM）中的幻觉问题，同时不降低通用生成能力。MPD通过语义感知组件解耦来提取纯幻觉组件，并通过可解释的参数更新来选择性地修改与幻觉相关的参数。实验表明，MPD将幻觉减少了23.4%，同时保持了97.4%的通用生成能力，且无需额外的计算成本。

Object Referring-Guided Scanpath Prediction with Perception-Enhanced Vision-Language Models

Authors: Rong Quan, Yantao Lai, Dong Liang, Jie Qin

First: 2026-04-22T09:00:51+00:00 · Latest: 2026-04-22T09:00:51+00:00

Comments: ICMR 2026

Abs · PDF · Code1 · Code2

Abstract

Object Referring-guided Scanpath Prediction (ORSP) aims to predict the human attention scanpath when they search for a specific target object in a visual scene according to a linguistic description describing the object. Multimodal information fusion is a key point of ORSP. Therefore, we propose a novel model, ScanVLA, to first exploit a Vision-Language Model (VLM) to extract and fuse inherently aligned visual and linguistic feature representations from the input image and referring expression. Next, to enhance the ScanVLA's perception of fine-grained positional information, we not only propose a novel History Enhanced Scanpath Decoder (HESD) that directly takes historical fixations' position information as input to help predict a more reasonable position for the current fixation, but also adopt a frozen Segmentation LoRA as an auxiliary component to help localize the referred object more precisely, which improves the scanpath prediction task without incurring additional large computational and time costs. Extensive experimental results demonstrate that ScanVLA can significantly outperform existing scanpath prediction methods under object referring.

Bimanual Robot Manipulation via Multi-Agent In-Context Learning

Authors: Alessio Palma, Indro Spinelli, Vignesh Prasad, Luca Scofano, Yufeng Jin, Georgia Chalvatzaki, Fabio Galasso

First: 2026-04-22T08:51:07+00:00 · Latest: 2026-04-22T08:51:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Language Models (LLMs) have emerged as powerful reasoning engines for embodied control. In particular, In-Context Learning (ICL) enables off-the-shelf, text-only LLMs to predict robot actions without any task-specific training while preserving their generalization capabilities. Applying ICL to bimanual manipulation remains challenging, as the high-dimensional joint action space and tight inter-arm coordination constraints rapidly overwhelm standard context windows. To address this, we introduce BiCICLe (Bimanual Coordinated In-Context Learning), the first framework that enables standard LLMs to perform few-shot bimanual manipulation without fine-tuning. BiCICLe frames bimanual control as a multi-agent leader-follower problem, decoupling the action space into sequential, conditioned single-arm predictions. This naturally extends to Arms' Debate, an iterative refinement process, and to the introduction of a third LLM-as-Judge to evaluate and select the most plausible coordinated trajectories. Evaluated on 13 tasks from the TWIN benchmark, BiCICLe achieves up to 71.1% average success rate, outperforming the best training-free baseline by 6.7 percentage points and surpassing most supervised methods. We further demonstrate strong few-shot generalization on novel tasks.

Summary / 总结

The research aims to enable standard Language Models (LLMs) to perform bimanual manipulation through In-Context Learning (ICL), addressing the challenge of high-dimensional joint action spaces. BiCICLe (Bimanual Coordinated In-Context Learning) frames bimanual control as a multi-agent leader-follower problem, decoupling the action space and iteratively refining coordinated trajectories. The framework achieves up to 71.1% average success rate across 13 tasks from the TWIN benchmark, outperforming existing training-free methods and most supervised methods.

研究旨在让语言模型在无需特定任务训练的情况下执行双臂操作任务。BiCICLe（双臂协调上下文学习）框架将双臂控制视为多智能体领导者-跟随者问题，使标准语言模型能够通过少样本学习预测动作。该框架在TWIN基准上的13项任务中实现了高达71.1%的成功率，优于其他无训练方法，并超过大多数监督方法。

PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning

Authors: Jinlong Liu, Wanggui He, Peng Zhang, Mushui Liu, Hao Jiang, Pipei Huang

First: 2026-04-14T12:21:15+00:00 · Latest: 2026-04-22T08:05:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Reinforcement learning (RL) can improve the prompt following capability of text-to-image (T2I) models, yet obtaining high-quality reward signals remains challenging: CLIP Score is too coarse-grained, while VLM-based reward models (e.g., RewardDance) require costly human-annotated preference data and additional fine-tuning. We propose PromptEcho, a reward construction method that requires \emph{no} annotation and \emph{no} reward model training. Given a generated image and a guiding query, PromptEcho computes the token-level cross-entropy loss of a frozen VLM with the original prompt as the label, directly extracting the image-text alignment knowledge encoded during VLM pretraining. The reward is deterministic, computationally efficient, and improves automatically as stronger open-source VLMs become available. For evaluation, we develop DenseAlignBench, a benchmark of concept-rich dense captions for rigorously testing prompt following capability. Experimental results on two state-of-the-art T2I models (Z-Image and QwenImage-2512) demonstrate that PromptEcho achieves substantial improvements on DenseAlignBench (+26.8pp / +16.2pp net win rate), along with consistent gains on GenEval, DPG-Bench, and TIIFBench without any task-specific training. Ablation studies confirm that PromptEcho comprehensively outperforms inference-based scoring with the same VLM, and that reward quality scales with VLM size. We will open-source the trained models and the DenseAlignBench.

Summary / 总结

PromptEcho is a reward construction method for text-to-image reinforcement learning that does not require human annotation or additional model training. It computes the token-level cross-entropy loss between a frozen vision-language model and the original prompt, leveraging pretraining knowledge. This method improves the prompt following capability of T2I models, achieving significant gains on DenseAlignBench and other benchmarks without task-specific training. The reward quality improves as more powerful VLMs become available.

PromptEcho 是一种无需人工标注或额外模型训练的奖励构建方法，通过计算冻结的视觉-语言模型与原始提示之间的标记级交叉熵损失，利用预训练知识来提高文本到图像模型的提示遵循能力。该方法在 DenseAlignBench 和其他基准上实现了显著的改进，无需特定任务的训练。随着更强大的 VLM 的可用性，奖励质量也会提高。

X-Cache: Cross-Chunk Block Caching for Few-Step Autoregressive World Models Inference

Authors: Yixiao Zeng, Jianlei Zheng, Chaoda Zheng, Shijia Chen, Mingdian Liu, Tongping Liu, Tengwei Luo, Yu Zhang, Boyang Wang, Linkun Xu, Siyuan Lu, Bo Tian, Xianming Liu

First: 2026-04-22T07:36:59+00:00 · Latest: 2026-04-22T07:36:59+00:00

Comments: Technical Report

Abs · PDF · Code1 · Code2

Abstract

Real-time world simulation is becoming a key infrastructure for scalable evaluation and online reinforcement learning of autonomous driving systems. Recent driving world models built on autoregressive video diffusion achieve high-fidelity, controllable multi-camera generation, but their inference cost remains a bottleneck for interactive deployment. However, existing diffusion caching methods are designed for offline video generation with multiple denoising steps, and do not transfer to this scenario. Few-step distilled models have no inter-step redundancy left for these methods to reuse, and sequence-level parallelization techniques require future conditioning that closed-loop interactive generation does not provide. We present X-Cache, a training-free acceleration method that caches along a different axis: across consecutive generation chunks rather than across denoising steps. X-Cache maintains per-block residual caches that persist across chunks, and applies a dual-metric gating mechanism over a structure- and action-aware block-input fingerprint to independently decide whether each block should recompute or reuse its cached residual. To prevent approximation errors from permanently contaminating the autoregressive KV cache, X-Cache identifies KV update chunks (the forward passes that write clean keys and values into the persistent cache) and unconditionally forces full computation on these chunks, cutting off error propagation. We implement X-Cache on X-world, a production multi-camera action-conditioned driving world model built on multi-block causal DiT with few-step denoising and rolling KV cache. X-Cache achieves 71% block skip rate with 2.6x wall-clock speedup while maintaining minimum degradation.

Summary / 总结

X-Cache is a training-free method that accelerates few-step autoregressive world models by caching across consecutive generation chunks rather than denoising steps. It uses a dual-metric gating mechanism to decide whether to recompute or reuse cached residuals. X-Cache achieves a 71% block skip rate and a 2.6x wall-clock speedup without degrading the model's performance.

X-Cache 是一种无需训练的方法，通过在连续生成块之间而不是去噪步骤之间进行缓存来加速少量步骤的自回归世界模型。它使用双重度量门控机制来决定是重新计算还是重用缓存的残差。X-Cache 实现了 71% 的块跳过率和 2.6 倍的墙钟速度提升，同时保持了模型性能的最小下降。

Rethinking Where to Edit: Task-Aware Localization for Instruction-Based Image Editing

Authors: Jingxuan He, Xiyu Wang, Mengyu Zheng, Xiangyu Zeng, Yunke Wang, Chang Xu

First: 2026-04-22T07:08:01+00:00 · Latest: 2026-04-22T07:08:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Instruction-based image editing (IIE) aims to modify images according to textual instructions while preserving irrelevant content. Despite recent advances in diffusion transformers, existing methods often suffer from over-editing, introducing unintended changes to regions unrelated to the desired edit. We identify that this limitation arises from the lack of an explicit mechanism for edit localization. In particular, different editing operations (e.g., addition, removal and replacement) induce distinct spatial patterns, yet current IIE models typically treat localization in a task-agnostic manner. To address this limitation, we propose a training-free, task-aware edit localization framework that exploits the intrinsic source and target image streams within IIE models. For each image stream, We first obtain attention-based edit cues, and then construct feature centroids based on these attentive cues to partition tokens into edit and non-edit regions. Based on the observation that optimal localization is inherently task-dependent, we further introduce a unified mask construction strategy that selectively leverages source and target image streams for different editing tasks. We provide a systematic analysis for our proposed insights and approaches. Extensive experiments on EdiVal-Bench demonstrate our framework consistently improves non-edit region consistency while maintaining strong instruction-following performance on top of powerful recent image editing backbones, including Step1X-Edit and Qwen-Image-Edit.

Summary / 总结

The paper addresses the issue of over-editing in instruction-based image editing (IIE) by proposing a task-aware edit localization framework. It leverages attention-based edit cues and feature centroids to partition tokens into edit and non-edit regions, and introduces a unified mask construction strategy for different editing tasks. Experiments show that the proposed framework enhances non-edit region consistency while maintaining strong instruction-following performance on powerful recent IIE models like Step1X-Edit and Qwen-Image-Edit.

论文通过提出一种任务感知的编辑定位框架来解决基于指令的图像编辑（IIE）中的过度编辑问题。该框架利用注意力编辑线索和特征质心将标记划分为编辑和非编辑区域，并为不同的编辑任务引入了一种统一的掩码构建策略。实验表明，所提出的框架在保持对强大近期IIE模型如Step1X-Edit和Qwen-Image-Edit的指令遵循性能的同时，增强了非编辑区域的一致性。

Evolvable Embodied Agent for Robotic Manipulation via Long Short-Term Reflection and Optimization

Authors: Jianzong Wang, Botao Zhao, Yayun He, Junqing Peng, Xulong Zhang

First: 2026-04-15T06:29:02+00:00 · Latest: 2026-04-22T06:55:18+00:00

Comments: This work has been accepted for publication in the Proceedings of the 2026 International Joint Conference on Neural Networks (IJCNN 2026)

Abs · PDF · Code1 · Code2

Abstract

Achieving general-purpose robotics requires empowering robots to adapt and evolve based on their environment and feedback. Traditional methods face limitations such as extensive training requirements, difficulties in cross-task generalization, and lack of interpretability. Prompt learning offers new opportunities for self-evolving robots without extensive training, but simply reflecting on past experiences. However, extracting meaningful insights from task successes and failures remains a challenge. To this end, we propose the evolvable embodied agent (EEAgent) framework, which leverages large vision-language models (VLMs) for better environmental interpretation and policy planning. To enhance reflection on past experiences, we propose a long short-term reflective optimization (LSTRO) mechanism that dynamically refines prompts based on both past experiences and newly learned lessons, facilitating continuous self-evolution, thereby enhancing overall task success rates. Evaluations on six VIMA-Bench tasks reveal that our approach sets a new state-of-the-art, notably outperforming baselines in complex scenarios.

Summary / 总结

The research aims to develop a self-evolving robotic agent capable of adapting to various tasks without extensive training. The proposed EEAgent framework uses large vision-language models for better environmental understanding and policy planning. It introduces a long short-term reflective optimization mechanism to dynamically refine prompts based on past experiences and new lessons, leading to improved task success rates. Experiments on six VIMA-Bench tasks demonstrate that the approach outperforms existing methods, especially in complex scenarios.

研究旨在开发一种无需大量训练即可适应各种任务的自进化机器人代理。提出的EEAgent框架利用大型视觉-语言模型进行更好的环境理解和策略规划。它引入了一种长期短期反思优化机制，根据过往经验和新学到的教训动态调整提示，从而提高任务成功率。实验结果显示，该方法在六个VIMA-Bench任务中表现出色，尤其是在复杂场景中超越了现有方法。

From Scene to Object: Text-Guided Dual-Gaze Prediction

Authors: Zehong Ke, Yanbo Jiang, Jinhao Li, Zhiyuan Liu, Yiqian Tu, Qingwen Meng, Heye Huang, Jianqiang Wang

First: 2026-04-22T05:11:59+00:00 · Latest: 2026-04-22T05:11:59+00:00

Abs · PDF · Code1 · Code2

Abstract

Interpretable driver attention prediction is crucial for human-like autonomous driving. However, existing datasets provide only scene-level global gaze rather than fine-grained object-level annotations, inherently failing to support text-grounded cognitive modeling. Consequently, while Vision-Language Models (VLMs) hold great potential for semantic reasoning, this critical data limitations leads to severe text-vision decoupling and visual-bias hallucinations. To break this bottleneck and achieve precise object-level attention prediction, this paper proposes a novel dual-branch gaze prediction framework, establishing a complete paradigm from data construction to model architecture. First, we construct G-W3DA, a object-level driver attention dataset. By integrating a multimodal large language model with the Segment Anything Model 3 (SAM3), we decouple macroscopic heatmaps into object-level masks under rigorous cross-validation, fundamentally eliminating annotation hallucinations. Building upon this high-quality data foundation, we propose the DualGaze-VLM architecture. This architecture extracts the hidden states of semantic queries and dynamically modulates visual features via a Condition-Aware SE-Gate, achieving intent-driven precise spatial anchoring. Extensive experiments on the W3DA benchmark demonstrate that DualGaze-VLM consistently surpasses existing state-of-the-art (SOTA) models in spatial alignment metrics, notably achieving up to a 17.8% improvement in Similarity (SIM) under safety-critical scenarios. Furthermore, a visual Turing test reveals that the attention heatmaps generated by DualGaze-VLM are perceived as authentic by 88.22% of human evaluators, proving its capability to generate rational cognitive priors.

Summary / 总结

This paper addresses the challenge of interpreting driver attention for human-like autonomous driving by proposing a dual-branch gaze prediction framework. It constructs G-W3DA, an object-level driver attention dataset, and introduces the DualGaze-VLM architecture, which uses a Condition-Aware SE-Gate to dynamically modulate visual features based on semantic queries. Experiments show that DualGaze-VLM outperforms existing models in spatial alignment metrics, with up to a 17.8% improvement in Similarity (SIM) under safety-critical scenarios, and is perceived as authentic by 88.22% of human evaluators in a visual Turing test.

本文提出了一种双分支凝视预测框架，以解决人类样式的自主驾驶中驾驶员注意力的解释问题。该研究构建了G-W3DA，一个对象级别的驾驶员注意力数据集，并引入了DualGaze-VLM架构，该架构通过条件感知SE门动态调节基于语义查询的视觉特征。实验表明，DualGaze-VLM在空间对齐指标上优于现有模型，安全关键场景下的相似度（SIM）提高了17.8%，并且在视觉图灵测试中，有88.22%的人类评估者认为其生成的注意力热图是真实的。

FA-Seg: A Fast and Accurate Diffusion-Based Method for Open-Vocabulary Segmentation

Authors: Huy Che, Vinh-Tiep Nguyen

Venue: Neurocomputing 660 (2026) 131844

First: 2025-06-29T16:41:41+00:00 · Latest: 2026-04-22T05:04:11+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Open-vocabulary semantic segmentation (OVSS) aims to segment objects from arbitrary text categories without requiring densely annotated datasets. Although contrastive learning based models enable zero-shot segmentation, they often lose fine spatial precision at pixel level, due to global representation bias. In contrast, diffusion-based models naturally encode fine-grained spatial features via attention mechanisms that capture both global context and local details. However, they often face challenges in balancing the computation costs and the quality of the segmentation mask. In this work, we present FA-Seg, a Fast and Accurate training-free framework for open-vocabulary segmentation based on diffusion models. FA-Seg performs segmentation using only a (1+1)-step from a pretrained diffusion model. Moreover, instead of running multiple times for different classes, FA-Seg performs segmentation for all classes at once. To further enhance the segmentation quality, FA-Seg introduces three key components: (i) a dual-prompt mechanism for discriminative, class-aware attention extraction, (ii) a Hierarchical Attention Refinement Method (HARD) that enhances semantic precision via multi-resolution attention fusion, and (iii) a Test-Time Flipping (TTF) scheme designed to improve spatial consistency. Extensive experiments show that FA-Seg achieves state-of-the-art training-free performance, obtaining 43.8% average mIoU across PASCAL VOC, PASCAL Context, and COCO Object benchmarks while maintaining superior inference efficiency. Our results demonstrate that FA-Seg provides a strong foundation for extendability, bridging the gap between segmentation quality and inference efficiency. The source code is available at https://github.com/chequanghuy/FA-Seg.

Summary / 总结

FA-Seg is a fast and accurate training-free framework for open-vocabulary segmentation using diffusion models. It performs segmentation with a single step from a pretrained model and uses a dual-prompt mechanism, Hierarchical Attention Refinement Method (HARD), and Test-Time Flipping (TTF) to enhance segmentation quality. Experiments show that FA-Seg achieves state-of-the-art performance with 43.8% average mIoU across PASCAL VOC, PASCAL Context, and COCO Object benchmarks while maintaining high inference efficiency.

FA-Seg 是一种基于扩散模型的快速准确无训练框架，用于开放词汇分割。它通过单步从预训练模型进行分割，并使用双提示机制、层次注意力精炼方法（HARD）和测试时翻转（TTF）来提升分割质量。实验表明，FA-Seg 在 PASCAL VOC、PASCAL Context 和 COCO Object 基准上的平均 mIoU 达到 43.8%，同时保持了高推理效率。

BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

Authors: Baoyou Chen, Hanchen Xia, Peng Tu, Haojun Shi, Shan Mu, Weihao Yuan, Siyu Zhu

First: 2026-04-15T09:17:38+00:00 · Latest: 2026-04-22T04:56:34+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Autoregressive vision-language models (VLMs) deliver strong multimodal capability, but their token-by-token decoding imposes a fundamental inference bottleneck. Diffusion VLMs offer a more parallel decoding paradigm, yet directly converting a pretrained autoregressive VLM into a large-block diffusion VLM (dVLM) often leads to substantial quality degradation. In this work, we present BARD, a simple and effective bridging framework that converts a pretrained autoregressive VLM into a same-architecture, decoding-efficient dVLM. Our approach combines progressive supervised block merging, which gradually enlarges the decoding block size, with stage-wise intra-dVLM distillation from a fixed small-block diffusion anchor to recover performance lost at larger blocks. We further incorporate a mixed noise scheduler to improve robustness and token revision during denoising, and memory-friendly training to enable efficient training on long multimodal sequences. A key empirical finding is that direct autoregressive-to-diffusion distillation is poorly aligned and can even hurt performance, whereas distillation within the diffusion regime is consistently effective. Experimental results show that, with $\leq$ 4.4M data, BARD-VL transfers strong multimodal capability from Qwen3-VL to a large-block dVLM. Remarkably, BARD-VL establishes a new SOTA among comparable-scale open dVLMs on our evaluation suite at both 4B and 8B scales. At the same time, BARD-VL achieves up to 3$\times$ decoding throughput speedup compared to the source model. Code is available at: $\href{https://github.com/fudan-generative-vision/Bard-VL}{this~https~URL}$.

Summary / 总结

BARD is a framework that converts a pretrained autoregressive vision-language model into an efficient diffusion model by combining progressive block merging and stage-wise distillation. Key findings include the ineffectiveness of direct autoregressive-to-diffusion distillation and the effectiveness of intra-diffusion distillation. BARD-VL transfers strong multimodal capability with up to 3x decoding speedup compared to the source model, establishing new SOTA performance on evaluation suites at 4B and 8B scales.

BARD 是一种框架，通过结合渐进式块合并和阶段内扩散蒸馏，将预训练的自回归视觉语言模型转换为高效的扩散模型。关键发现包括直接自回归到扩散蒸馏的无效性以及阶段内扩散蒸馏的有效性。BARD-VL 能够转移强大的多模态能力，并且与源模型相比，解码速度提升高达 3 倍，同时在 4B 和 8B 规模的评估套件上建立了新的 SOTA 性能。

Semantic-Fast-SAM: Efficient Semantic Segmenter

Authors: Byunghyun Kim

First: 2026-04-22T04:18:39+00:00 · Latest: 2026-04-22T04:18:39+00:00

Comments: APSIPA ASC 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

We propose Semantic-Fast-SAM (SFS), a semantic segmentation framework that combines the Fast Segment Anything model with a semantic labeling pipeline to achieve real-time performance without sacrificing accuracy. FastSAM is an efficient CNN-based re-implementation of the Segment Anything Model (SAM) that runs much faster than the original transformer-based SAM. Building upon FastSAM's rapid mask generation, we integrate a Semantic-Segment-Anything (SSA) labeling strategy to assign meaningful categories to each mask. The resulting SFS model produces high-quality semantic segmentation maps at a fraction of the computational cost and memory footprint of the original SAM-based approach. Experiments on Cityscapes and ADE20K benchmarks demonstrate that SFS matches the accuracy of prior SAM-based methods (mIoU ~ 70.33 on Cityscapes and 48.01 on ADE20K) while achieving approximately 20x faster inference than SSA in the closed-set setting. We also show that SFS effectively handles open-vocabulary segmentation by leveraging CLIP-based semantic heads, outperforming recent open-vocabulary models on broad class labeling. This work enables practical real-time semantic segmentation with the "segment-anything" capability, broadening the applicability of foundation segmentation models in robotics scenarios. The implementation is available at https://github.com/KBH00/Semantic-Fast-SAM.

Summary / 总结

Semantic-Fast-SAM (SFS) combines FastSAM with a semantic labeling pipeline to achieve real-time semantic segmentation without sacrificing accuracy. SFS generates high-quality segmentation maps at a fraction of the computational cost compared to the original SAM-based approach, matching its mIoU on Cityscapes and ADE20K benchmarks while being approximately 20x faster. SFS also outperforms recent open-vocabulary models in handling broad class labeling scenarios.

Semantic-Fast-SAM (SFS) 结合了 FastSAM 和语义标注管道，实现了实时语义分割而不牺牲准确性。SFS 以比原始 SAM 基于方法低得多的计算成本生成高质量的分割图，同时在 Cityscapes 和 ADE20K 基准上的 mIoU 分数与 SAM 基于方法相当，且速度约快 20 倍。SFS 还在处理广泛类别标注方面优于最近的开放词汇模型。这项工作使实时语义分割在机器人应用中成为可能。

LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models

Authors: Zhiyuan Jiang, Weihao Hong, Xinlei Guan, Tejaswi Dhandu, Miles Q. Li, Meng Xu, Kuan Huang, Umamaheswara Rao Tida, Bingyu Shen, Daehan Kwak, Boyang Li

First: 2026-04-20T20:21:27+00:00 · Latest: 2026-04-22T02:39:44+00:00

Comments: 23 pages, 12 figures

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) are increasingly deployed in settings where reliable visual grounding carries operational consequences, yet their behavior under progressively coercive prompt phrasing remains undercharacterized. Existing hallucination benchmarks predominantly rely on neutral prompts and binary detection, leaving open how both the incidence and the intensity of fabrication respond to graded linguistic pressure across structurally distinct task types. We present Ghost-100, a procedurally constructed benchmark of 800 synthetically generated images spanning eight categories across three task families: text-illegibility, time-reading, and object-absence, each designed under a negative-ground-truth principle that guarantees the queried target is absent, illegible, or indeterminate by construction. Every image is paired with five prompts drawn from a structured 5-Level Prompt Intensity Framework, holding the image and task identity fixed while varying only directive force, so that tone is isolated as the sole independent variable. We adopt a dual-track evaluation protocol: a rule-based H-Rate measuring the proportion of responses in which a model crosses from grounded refusal into unsupported positive commitment, and a GPT-4o-mini-judged H-Score on a 1-5 scale characterizing the confidence and specificity of fabrication once it occurs. We additionally release a three-stage automated validation workflow, which retrospectively confirms 717 of 800 images as strictly compliant. Evaluating nine open-weight VLMs, we find that H-Rate and H-Score dissociate substantially across model families, reading-style and presence-detection subsets respond to prompt pressure in qualitatively different ways, and several models exhibit non-monotonic sensitivity peaking at intermediate tone levels: patterns that aggregate metrics obscure.

Summary / 总结

The research aims to evaluate how Vision-Language Models (VLMs) respond to increasingly coercive prompts, particularly focusing on tone-induced hallucinations. The study introduces Ghost-100, a benchmark of 800 synthetic images across three task families, each with five prompts varying in intensity. The evaluation uses a dual-track protocol: H-Rate to measure the transition from grounded refusal to unsupported positive commitment, and H-Score to assess the confidence and specificity of hallucinations. Key findings show that different model families and subsets exhibit distinct responses to prompt pressure, with some models showing non-monotonic sensitivity at intermediate tone levels.

研究旨在评估视觉语言模型（VLMs）在受到越来越具强制性的提示时的响应，特别是针对语气诱导的幻觉。研究引入了Ghost-100，这是一个包含800张合成图像的基准，覆盖三个任务家族，每个任务家族有五个强度不同的提示。评估使用双轨制：H-Rate衡量从有根据的拒绝到无根据的积极承诺的转变，H-Score评估幻觉的置信度和具体性。关键发现表明，不同模型家族和子集对提示压力的响应存在差异，一些模型在中间语气水平上表现出非单调的敏感性。

Evaluation of Winning Solutions of 2025 Low Power Computer Vision Challenge

Authors: Zihao Ye, Yung-Hsiang Lu, Xiao Hu, Shuai Zhang, Taotao Jing, Xin Li, Zhen Yao, Bo Lang, Zhihao Zheng, Seungmin Oh, Hankyul Kang, Seunghun Kang, Jongbin Ryu, Kexin Chen, Yuan Qi, George K Thiruvathukal, Mooi Choo Chuah

First: 2026-04-21T04:00:55+00:00 · Latest: 2026-04-22T01:26:04+00:00

Comments: 11 pages, 8 figures, 4 tables

Abs · PDF · Code1 · Code2

Abstract

The IEEE Low-Power Computer Vision Challenge (LPCVC) aims to promote the development of efficient vision models for edge devices, balancing accuracy with constraints such as latency, memory capacity, and energy use. The 2025 challenge featured three tracks: (1) Image classification under various lighting conditions and styles, (2) Open-Vocabulary Segmentation with Text Prompt, and (3) Monocular Depth Estimation. This paper presents the design of LPCVC 2025, including its competition structure and evaluation framework, which integrates the Qualcomm AI Hub for consistent and reproducible benchmarking. The paper also introduces the top-performing solutions from each track and outlines key trends and observations. The paper concludes with suggestions for future computer vision competitions.

中文标题/摘要

标题：2025低功耗计算机视觉挑战获胜方案评估

IEEE低功耗计算机视觉挑战（LPCVC）旨在促进边缘设备上高效视觉模型的发展，平衡准确性和延迟、内存容量和能耗等约束。2025年的挑战包括三个赛道：（1）在不同光照条件和风格下的图像分类，（2）带有文本提示的开放词汇分割，以及（3）单目深度估计。本文介绍了LPCVC 2025的设计，包括其竞赛结构和评价框架，该框架整合了Qualcomm AI Hub以实现一致和可重复的基准测试。本文还介绍了每个赛道表现最佳的解决方案，并概述了关键趋势和观察结果。最后，本文提出了对未来计算机视觉竞赛的建议。

Summary / 总结

The IEEE Low-Power Computer Vision Challenge (LPCVC) 2025 aimed to advance efficient vision models for edge devices by balancing accuracy with constraints like latency, memory, and energy. The challenge included three tracks: image classification, open-vocabulary segmentation, and monocular depth estimation. The paper details the competition structure, evaluation framework using Qualcomm AI Hub, and highlights the top solutions from each track, noting key trends. It also provides recommendations for future computer vision competitions.

IEEE低功耗计算机视觉挑战（LPCVC）2025旨在通过平衡准确性和延迟、内存和能耗等约束来推进边缘设备上的高效视觉模型。挑战包括三个赛道：图像分类、开放词汇分割和单目深度估计。论文详细介绍了竞赛结构、使用高通AI Hub进行的一致性和可重复性评估框架，以及每个赛道的顶级解决方案，并指出了关键趋势。还提供了对未来计算机视觉竞赛的建议。

Training-free retrieval-augmented generation with reinforced reasoning for flood damage nowcasting

Authors: Lipai Huang, Kai Yin, Chia-Fu Liu, Ali Mostafavi

First: 2026-02-10T21:31:33+00:00 · Latest: 2026-04-21T21:47:46+00:00

Comments: 18 pages, 3 figures, 8 tables, submitted to CACAIE journal

Abs · PDF · Code1 · Code2

Abstract

We propose R2RAG-Flood, a training-free retrieval-augmented generation framework for flood damage nowcasting with reinforced reasoning. The framework builds a reasoning-centric knowledge base from labeled tabular records, where each sample includes structured predictors, a compact text-mode summary, and a model-generated reasoning trajectory. During inference, the target prompt is augmented with geographically local neighbors and selected free-shots to support case-based reasoning without task-specific fine-tuning. A two-stage procedure first determines damage occurrence and then refines severity within a three-level Property Damage Extent (PDE) classification, followed by a conservative downgrade check for weakly supported over-severe outputs. In a Hurricane Harvey case study in Harris County, Texas, the supervised tabular baseline achieves 0.714 overall accuracy and 0.859 accuracy on the damaged classes (medium and high PDE). Across seven LLM backbones, R2RAG-Flood achieves 0.613--0.668 overall accuracy and 0.757--0.896 accuracy on the damaged classes while providing a structured rationale for each prediction. Under the severity-per-cost metric used in this study, lighter R2RAG-Flood variants are more cost-efficient than the supervised baseline and larger LLM backbones. These results demonstrate the feasibility of a reasoning-centric, training-free pipeline for flood damage nowcasting in a realistic case-study setting.

中文标题/摘要

标题：基于强化推理的无需训练检索增强生成方法用于洪水损害现在预测

我们提出了一种无需训练的检索增强生成框架R2RAG-Flood，用于洪水损害现在预测，并结合了强化推理。该框架从标记的表格记录中构建以推理为中心的知识库，每个样本包括结构化的预测器、紧凑的文本模式摘要以及模型生成的推理轨迹。在推理过程中，目标提示通过地理邻近样本和选定的无监督样本进行增强，以支持案例推理，无需特定任务的微调。该过程分为两阶段：首先确定损害的发生，然后在财产损害程度（PDE）的三级分类中细化严重程度，最后进行保守的降级检查，以检查弱支持的过度严重输出。在德克萨斯州哈里斯县的飓风哈维案例研究中，监督表格基线的整体准确率为0.714，受损类别的准确率为0.859。在七个LLM骨干网络中，R2RAG-Flood的整体准确率为0.613-0.668，受损类别的准确率为0.757-0.896，同时为每个预测提供了结构化的理由。在本研究中使用的按严重程度计成本的度量标准下，较轻的R2RAG-Flood变体比监督基线和较大的LLM骨干网络更具有成本效益。这些结果表明，在现实案例研究环境中，基于推理的无需训练管道在洪水损害现在预测中的可行性。

Summary / 总结

R2RAG-Flood is a training-free framework for flood damage nowcasting using a reasoning-centric knowledge base and reinforced reasoning. It leverages labeled tabular data to build a knowledge base and uses geographically local neighbors and free-shots for inference without task-specific fine-tuning. The framework achieves 0.613--0.668 overall accuracy and 0.757--0.896 accuracy on damaged classes across seven LLM backbones, providing structured rationales for predictions. It is more cost-efficient under the severity-per-cost metric compared to the supervised baseline and larger LLMs.

该论文提出了一种名为R2RAG-Flood的训练-free框架，用于利用推理为中心的知识库和强化推理进行洪水损害现在预测。在推理过程中，框架通过添加本地邻居和自由射击来增强目标提示，以支持案例推理。该框架在不同LLM基座上实现了0.613--0.668的整体准确率和0.757--0.896的受损类准确率，为每个预测提供了结构化的推理。在本研究中使用的严重性-成本度量标准下，更轻量的R2RAG-Flood变体比监督基线和更大规模的LLM更具有成本效益。

EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training

Authors: Yiyang Du, Zhanqiu Guo, Xin Ye, Liu Ren, Chenyan Xiong

First: 2026-04-21T21:40:58+00:00 · Latest: 2026-04-21T21:40:58+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language-Action Models (VLAs) inherit their visual and linguistic capabilities from Vision-Language Models (VLMs), yet most VLAs are built from off-the-shelf VLMs that are not adapted to the embodied domain, limiting their downstream performance. In this work, we propose EmbodiedMidtrain to bridge the gap between VLMs and VLAs. We first characterize the data distribution gap between them, showing that VLA data occupy compact regions that are largely separated from the broader VLM distribution, while the degree of alignment varies substantially both across and within VLM data sources. Then, we build a mid-training data engine that leverages a lightweight learnable proximity estimator to select the most VLA-aligned candidates from a large VLM pool, and mid-trains the VLM on this curated mixture before downstream VLA fine-tuning. Experiments on three robot manipulation benchmarks show that mid-training consistently improves performance across different VLM backbones, achieving results competitive with expert VLAs and off-the-shelf VLMs trained with larger model scale and training budgets. Further analysis reveals that mid-training provides a stronger initialization for VLA fine-tuning, with gains emerging from the earliest steps and widening throughout training. Moreover, the data engine captures both dataset-level and sample-level alignment signals, favoring spatial reasoning over text-centric tasks while preserving the diversity of the VLM data. We will release all code, data and models for future research.

中文标题/摘要

标题：EmbodiedMidtrain：通过中期训练弥合视觉语言模型与视觉语言动作模型之间的差距

视觉语言动作模型（VLAs）继承了视觉语言模型（VLMs）的视觉和语言能力，但大多数VLAs都是基于现成的VLMs构建的，这些VLMs未适应实体领域，限制了其下游性能。在本研究中，我们提出EmbodiedMidtrain以弥合VLMs和VLAs之间的差距。我们首先描述了它们之间的数据分布差距，表明VLAs的数据占据紧凑的区域，这些区域与更广泛的VLM分布有较大分离，而VLM数据源之间的对齐程度差异显著。然后，我们构建了一个中期训练数据引擎，利用轻量级可学习的邻近度估计器从大规模的VLM池中选择最符合VLAs对齐的候选者，并在这些精选的混合数据上对VLM进行中期训练，然后进行下游VLAs微调。在三个机器人操作基准测试上的实验表明，中期训练在不同VLM骨干网络上都提高了性能，达到了与专家VLAs和更大模型规模及训练预算的现成VLMs竞争的结果。进一步的分析表明，中期训练为VLAs微调提供了更强的初始化，从最早几步开始就有所增益，并在整个训练过程中逐渐扩大。此外，数据引擎捕捉到了数据集级和样本级的对齐信号，更倾向于空间推理而非文本中心任务，同时保留了VLM数据的多样性。我们将发布所有代码、数据和模型供未来研究使用。

Summary / 总结

The work study paper paper proposes EmbodiedMidtrain to bridge the gap between Vision-Language Models (VLMs) and Vision-Language-Action models (VLAs), which inherit visual and linguistic capabilities from VLMs but occupy compact regions. The method involves a mid-training engine that uses a lightweight alignment estimator to select on on candidates from a V on V LM pool and mid-trains the VLM on a curated mixture of downstream tasks. on three robot manipulation benchmarks, that show mid-training consistently improves on on V VLM on on bones, achieving competitive performance with expert VLAs s and off off V V LM on on bones trained with large-scale and training budgets. The findings suggest on on the dataset and alignment signals favor favor spatial reasoning on on-centric tasks while preserving the diversity of on on the V on LM on on..

scpFormer: A Foundation Model for Unified Representation and Integration of the Single-Cell Proteomics

Authors: Qifeng Zhou, Lei Yu, Yuzhi Guo, Yuwei Miao, Hehuan Ma, Wenliang Zhong, Lin Xu, Junzhou Huang

First: 2026-04-21T21:27:04+00:00 · Latest: 2026-04-21T21:27:04+00:00

Abs · PDF · Code1 · Code2

Abstract

The integration of single-cell proteomic data is often hindered by the fragmented nature of targeted antibody panels. To address this limitation, we introduce scpFormer, a transformer-based foundation model designed for single-cell proteomics. Pre-trained on over 390 million cells, scpFormer replaces standard index-based tokenization with a continuous, sequence-anchored approach. By combining Evolutionary Scale Modeling (ESM) with value-aware expression embeddings, it dynamically maps variable panels into a shared semantic space without artificial discretization. We demonstrate that scpFormer generates global cell representations that perform competitively in large-scale batch integration and unsupervised clustering. Moreover, its open-vocabulary architecture facilitates in silico panel expansion, assisting in the reconstruction of biological manifolds in sparse clinical datasets. Finally, this learned protein co-expression logic is transferable to bulk-omics tasks, supporting applications like cancer drug response prediction. scpFormer provides a versatile, panel-agnostic framework to facilitate scalable biomarker discovery and precision oncology.

DistortBench: Benchmarking Vision Language Models on Image Distortion Identification

Authors: Divyanshu Goyal, Akhil Eppa, Vanya Bannihatti Kumar

First: 2026-04-21T20:20:59+00:00 · Latest: 2026-04-21T20:20:59+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) are increasingly used in settings where sensitivity to low-level image degradations matters, including content moderation, image restoration, and quality monitoring. Yet their ability to recognize distortion type and severity remains poorly understood. We present DistortBench, a diagnostic benchmark for no-reference distortion perception in VLMs. DistortBench contains 13,500 four-choice questions covering 27 distortion types, six perceptual categories, and five severity levels: 25 distortions inherit KADID-10k calibrations, while two added rotation distortions use monotonic angle-based levels. We evaluate 18 VLMs, including 17 open-weight models from five families and one proprietary model. Despite strong performance on high-level vision-language tasks, the best model reaches only 61.9% accuracy, just below the human majority-vote baseline of 65.7% (average individual: 60.2%), indicating that low-level perceptual understanding remains a major weakness of current VLMs. Our analysis further reveals weak and non-monotonic scaling with model size, performance drops in most base--thinking pairs, and distinct severity-response patterns across model families. We hope DistortBench will serve as a useful benchmark for measuring and improving low-level visual perception in VLMs.

Summary / 总结

DistortBench is a diagnostic benchmark for evaluating vision-language models (VLMs) in recognizing image distortions. It includes 13,500 questions covering 27 distortion types, six perceptual categories, and five severity levels. Evaluating 18 VLMs, the best model achieved only 61.9% accuracy, slightly below the human baseline of 65.7%, highlighting the need for improving low-level perceptual understanding in VLMs. The analysis also showed weak scaling with model size and distinct severity-response patterns across model families.

DistortBench 是一个诊断基准，用于评估视觉语言模型（VLMs）在识别图像失真方面的性能。它包含13,500个问题，涵盖了27种失真类型、六种感知类别和五种严重程度级别。评估18个VLMs后，最佳模型的准确率为61.9%，略低于人类基线的65.7%，这表明需要提高VLMs在低级感知理解方面的性能。分析还显示，模型大小与性能之间存在弱关联，并且不同模型家族在严重程度响应方面表现出不同的模式。

Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning

Authors: Palawat Busaranuvong, Reza Saadati Fard, Emmanuel Agu, Deepak Kumar, Shefalika Gautam, Bengisu Tulu, Diane Strong

First: 2026-04-21T19:28:08+00:00 · Latest: 2026-04-21T19:28:08+00:00

Abs · PDF · Code1 · Code2

Abstract

Assessing chronic wound infection from photographs is challenging because visual appearance varies across wound etiologies, anatomical locations, and imaging conditions. Prior image-based deep learning methods have mainly focused on classification with limited interpretability, despite the need for evidence-grounded explanations to support point-of-care decision making. We present Infection-Reasoner, a compact 4B-parameter reasoning vision-language model for chronic wound infection classification and rationale generation. To address the scarcity of expert-labeled wound images with reasoning annotations, Infection-Reasoner is trained using a two-stage pipeline: (1) reasoning distillation, in which GPT-5.1 generates chain-of-thought rationales for unlabeled wound images to initialize wound-specific reasoning in a smaller student model (Qwen3-VL-4B-Thinking), and (2) reinforcement learning post-training with Group Relative Policy Optimization on a small labeled infection dataset to refine classification reasoning. On a held-out heterogeneous wound dataset, Infection-Reasoner achieved 86.8\% accuracy, 86.4\% sensitivity, and 87.1\% specificity, outperforming several strong baselines, including GPT-5.1. Rationale quality was further evaluated using both multimodal large language model (MLLM) judges and wound expert review. Across four MLLM judges, visual-support agreement scores ranged from 0.722 to 0.903, while expert review rated 61.8\% of rationales as Correct and 32.4\% as Partially Correct.

Summary / 总结

Infection-Reasoner is a compact vision-language model designed for chronic wound infection classification, incorporating evidence-grounded clinical reasoning. It uses a two-stage training process: reasoning distillation to generate initial wound-specific rationales, and reinforcement learning to refine these rationales. On a heterogeneous wound dataset, Infection-Reasoner achieved high accuracy, sensitivity, and specificity, outperforming several strong baselines. Rationale quality was also evaluated positively by both MLLM judges and wound experts.

Infection-Reasoner 是一个紧凑的视觉-语言模型，用于慢性伤口感染分类，并结合了基于证据的临床推理。它采用两阶段训练管道：推理蒸馏生成初始的伤口特定推理，随后通过强化学习进一步优化分类准确性。在外部数据集上的测试结果显示，Infection-Reasoner 达到了较高的准确率、敏感性和特异性，超过了多个强大的基线模型。推理的质量还得到了多模态大型语言模型和伤口专家的积极评价。

MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

Authors: Zijie Li, Yichun Shi, Jingxiang Sun, Ye Wang, Yixuan Huang, Zhiyao Guo, Xiaochen Lian, Peihao Zhu, Yu Tian, Zhonghua Zhai, Peng Wang

First: 2026-04-21T18:25:25+00:00 · Latest: 2026-04-21T18:25:25+00:00

Abs · PDF · Code1 · Code2

Abstract

We present MMCORE, a unified framework designed for multimodal image generation and editing. MMCORE leverages a pre-trained Vision-Language Model (VLM) to predict semantic visual embeddings via learnable query tokens, which subsequently serve as conditioning signals for a diffusion model. This streamlined design effectively transfers the rich understanding and reasoning capabilities of VLMs into the visual generation process. By obviating the need for deep fusion between autoregressive and diffusion models or training from scratch, MMCORE significantly reduces computational overhead while maintaining high-fidelity synthesis. MMCORE seamlessly integrates text-to-image synthesis with interleaved image generation, demonstrating robust multimodal comprehension in complex scenarios such as spatial reasoning and visual grounding. Comprehensive evaluations indicate that MMCORE consistently outperforms state-of-the-art baselines across a broad spectrum of text-to-image and single/multi-image editing benchmarks.

中文标题/摘要

标题：MMCORE：多模态连接与表示对齐的潜在嵌入

我们提出了MMCORE，一个统一框架，用于多模态图像生成和编辑。MMCORE利用预训练的视觉-语言模型（VLM）通过可学习的查询标记预测语义视觉嵌入，这些嵌入随后作为扩散模型的条件信号。这种简洁的设计有效地将VLM丰富的理解和推理能力转移到视觉生成过程中。通过消除自回归模型和扩散模型之间的深度融合或从头开始训练的需求，MMCORE显著减少了计算开销，同时保持了高保真合成。 MMCORE无缝地将文本到图像合成与交错图像生成集成在一起，在复杂的场景如空间推理和视觉定位中展示了强大的多模态理解能力。全面的评估表明，MMCORE在一系列文本到图像和单/多图像编辑基准测试中始终优于最先进的基线。

Locate-Then-Examine: Grounded Region Reasoning Improves Detection of AI-Generated Images

Authors: Yikun Ji, Yan Hong, Bowen Deng, Jun Lan, Huijia Zhu, Weiqiang Wang, Liqing Zhang, Jianfu Zhang

First: 2025-10-05T14:29:01+00:00 · Latest: 2026-04-21T18:16:55+00:00

Comments: 18 pages, 11 figures (including supplementary material)

Abs · PDF · Code1 · Code2

Abstract

The rapid growth of AI-generated imagery has blurred the boundary between real and synthetic content, raising practical concerns for digital integrity. Vision-language models (VLMs) can provide natural language explanations, but standard one-pass classifiers often miss subtle artifacts in high-quality synthetic images and offer limited grounding in the pixels. We propose Locate-Then-Examine (LTE), a two-stage VLM-based forensic framework that first localizes suspicious regions and then re-examines these crops together with the full image to refine the real vs. AI-generated verdict and its explanation. LTE explicitly links each decision to localized visual evidence through region proposals and region-aware reasoning. To support training and evaluation, we introduce TRACE, a dataset of 20,000 real and high-quality synthetic images with region-level annotations and automatically generated forensic explanations, constructed by a VLM-based pipeline with additional consistency checks and quality control. Across TRACE and multiple external benchmarks, LTE achieves competitive accuracy and improved robustness while providing human-understandable, region-grounded explanations suitable for forensic deployment.

History

20260423_0426 20260422_0424 20260421_0418 20260420_0359 20260419_0358 20260418_0415 20260417_0421 20260416_0425 20260415_0426 20260414_0423 20260413_0352 20260412_0347 20260411_0356 20260410_0412 20260409_0411 20260407_0404 20260406_0347 20260405_0344 20260404_0350 20260403_0400 20260401_0408 20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553