ParseBench: A Document Parsing Benchmark for AI Agents
Authors: Boyang Zhang, Sebastián G. Acosta, Preston Carlson, Sacha Bron, Pierre-Loïc Doulcet, Daniel B. Ospina, Simon Suo
First: 2026-04-09T17:59:36+00:00 · Latest: 2026-04-13T17:37:21+00:00
Abstract
AI agents are changing the requirements for document parsing. What matters is semantic correctness: parsed output must preserve the structure and meaning needed for autonomous decisions, including correct table structure, precise chart data, semantically meaningful formatting, and visual grounding. Existing benchmarks do not fully capture this setting for enterprise automation, relying on narrow document distributions and text-similarity metrics that miss agent-critical failures. We introduce ParseBench, a benchmark of ${\sim}2{,}000$ human-verified pages from enterprise documents spanning insurance, finance, and government, organized around five capability dimensions: tables, charts, content faithfulness, semantic formatting, and visual grounding. Across 14 methods spanning vision-language models, specialized document parsers, and LlamaParse, the benchmark reveals a fragmented capability landscape: no method is consistently strong across all five dimensions. LlamaParse Agentic achieves the highest overall score at 84.9%, and the benchmark highlights the remaining capability gaps across current systems. Dataset and evaluation code are available on https://huggingface.co/datasets/llamaindex/ParseBench and https://github.com/run-llama/ParseBench.
中文标题/摘要
标题:ParseBench:AI代理的文档解析基准
AI代理正在改变文档解析的要求。关键在于语义正确性:解析输出必须保留用于自主决策所需的结构和意义,包括正确的表格结构、精确的图表数据、语义相关的格式以及视觉定位。现有基准未能充分捕捉企业自动化中的这一设置,依赖于狭窄的文档分布和文本相似性度量,这些度量忽略了代理关键的失败。我们引入了ParseBench,这是一个包含约2000个企业文档中的人工验证页面的基准,这些文档涵盖了保险、金融和政府领域,围绕五个能力维度组织:表格、图表、内容忠实度、语义格式和视觉定位。在涵盖视觉语言模型、专门的文档解析器和LlamaParse在内的14种方法中,基准揭示了一个碎片化的能力景观:没有一种方法在所有五个维度上都表现出色。LlamaParse Agentic以84.9%的最高总体得分,基准突显了当前系统中的剩余能力差距。数据集和评估代码可在https://huggingface.co/datasets/llamaindex/ParseBench和https://github.com/run-llama/ParseBench获取。
Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games
Authors: Keyang Zhong, Junlin Xie, Hefeng Wu, Haofeng Li, Guanbin Li
Venue: ACL 2026
First: 2026-04-13T17:16:23+00:00 · Latest: 2026-04-13T17:16:23+00:00
Comments: 9 pages, 5 figures, Findings of ACL 2026
Abstract
Vision-language models (VLMs) have shown impressive capabilities in perceptual tasks, yet they degrade in complex multi-hop reasoning under multiplayer game settings with imperfect and deceptive information. In this paper, we study a representative multiplayer task, Murder Mystery Games, which require inferring hidden truths based on partial clues provided by roles with different intentions. To address this challenge, we propose a collaborative multi-agent framework for evaluating and synthesizing high-quality, role-driven multiplayer game scripts, enabling fine-grained interaction patterns tailored to character identities (i.e., murderer vs. innocent). Our system generates rich multimodal contexts, including character backstories, visual and textual clues, and multi-hop reasoning chains, through coordinated agent interactions. We design a two-stage agent-monitored training strategy to enhance the reasoning ability of VLMs: (1) chain-of-thought based fine-tuning on curated and synthetic datasets that model uncertainty and deception; (2) GRPO-based reinforcement learning with agent-monitored reward shaping, encouraging the model to develop character-specific reasoning behaviors and effective multimodal multi-hop inference. Extensive experiments demonstrate that our method significantly boosts the performance of VLMs in narrative reasoning, hidden fact extraction, and deception-resilient understanding. Our contributions offer a scalable solution for training and evaluating VLMs under uncertain, adversarial, and socially complex conditions, laying the groundwork for future benchmarks in multimodal multi-hop reasoning under imperfect information.
中文标题/摘要
标题:增强谋杀谜题游戏中不完全信息推理的协作多智能体脚本生成
视觉语言模型(VLMs)在感知任务中表现出色,但在多人游戏设置中面对不完全和欺骗性信息时,其在复杂多跳推理中的能力会下降。本文研究了一个代表性的多人任务——谋杀谜题游戏,要求根据不同意图角色提供的部分线索推断隐藏的真相。为应对这一挑战,我们提出了一种协作多智能体框架,用于评估和合成高质量、角色驱动的多人游戏脚本,使交互模式能够根据角色身份(即凶手 vs 无辜者)进行精细调整。我们的系统通过协调智能体交互生成丰富的多模态上下文,包括角色背景故事、视觉和文本线索以及多跳推理链。我们设计了两阶段智能体监控训练策略,以增强VLMs的推理能力:(1)基于推理链的微调,使用建模不确定性和欺骗性的定制和合成数据集;(2)基于GRPO的强化学习,通过智能体监控奖励塑造,鼓励模型发展特定于角色的推理行为和有效的多模态多跳推理。大量实验表明,我们的方法显著提升了VLMs在叙事推理、隐藏事实提取和欺骗抵御理解方面的性能。我们的贡献提供了一种在不确定、对抗性和社会复杂条件下训练和评估VLMs的可扩展解决方案,为未来在不完全信息下的多模态多跳推理基准奠定了基础。
Summary / 总结
This paper addresses the challenge of complex reasoning in multiplayer games with imperfect information, focusing on Murder Mystery Games. It proposes a collaborative multi-agent framework that generates high-quality, role-driven game scripts through coordinated interactions. The system uses a two-stage training strategy: chain-of-thought fine-tuning on curated datasets and reinforcement learning with reward shaping. Experiments show significant improvements in narrative reasoning, hidden fact extraction, and deception-resilient understanding for vision-language models.
本文针对具有不完整信息的多人游戏中复杂的推理挑战,集中在谋杀谜题游戏中。它提出了一种协作多智能体框架,通过协调交互生成高质量的角色驱动游戏脚本。该系统使用两阶段训练策略:基于链式思考的微调以及带有奖励塑造的强化学习。实验表明,这种方法显著提高了视觉语言模型在叙事推理、隐藏事实提取和欺骗抵御理解方面的性能。
El Agente Estructural: An Artificially Intelligent Molecular Editor
Authors: Changhyeok Choi, Yunheng Zou, Marcel Müller, Han Hao, Yeonghun Kang, Juan B. Pérez-Sánchez, Ignacio Gustin, Hanyong Xu, Andrew Wang, Mohammad Ghazi Vakili, Chris Crebolder, Alán Aspuru-Guzik, Varinia Bernales
First: 2026-02-04T18:38:48+00:00 · Latest: 2026-04-13T16:51:33+00:00
Abstract
We present El Agente Estructural, a multimodal, natural-language-driven geometry-generation and manipulation agent for autonomous chemistry and molecular modelling. Unlike molecular generation or editing via generative models, Estructural mimics how human experts directly manipulate molecular systems in three dimensions by integrating a comprehensive set of domain-informed tools and vision-language models. This design enables precise control over atomic or functional group replacements, atomic connectivity, and stereochemistry without the need to rebuild extensive core molecular frameworks. Through a series of representative case studies, we demonstrate that Estructural enables chemically meaningful geometry manipulation across a wide range of real-world scenarios. These include site-selective functionalization, ligand binding, ligand exchange, stereochemically controlled structure construction, isomer interconversion, fragment-level structural analysis, image-guided generation of structures from schematic reaction mechanisms, and mechanism-driven geometry generation and modification. These examples illustrate how multimodal reasoning, when combined with specialized geometry-aware tools, supports interactive and context-aware molecular modelling beyond structure generation. Looking forward, the integration of Estructural into El Agente Quntur, an autonomous multi-agent quantum chemistry platform, enhances its capabilities by adding sophisticated tools for the generation and editing of three-dimensional structures.
中文标题/摘要
标题:结构特工:一种人工智能分子编辑器
我们介绍了结构特工,这是一种多模态、自然语言驱动的几何生成和操作代理,用于自主化学和分子建模。与通过生成模型进行的分子生成或编辑不同,结构工模仿了人类专家如何直接在三维空间中操作分子系统的方式,通过整合一系列全面的领域指导工具和视觉语言模型。这种设计使得在无需重建大量核心分子框架的情况下,能够精确控制原子或功能团替换、原子连接性和立体化学。通过一系列代表性案例研究,我们展示了结构工如何在广泛的实际场景中实现有意义的几何操作。这些场景包括选择性功能化、配体结合、配体交换、立体化学控制的结构构建、异构体互变、片段级结构分析、基于示意图反应机制的结构生成以及机制驱动的几何生成和修改。这些示例说明了当结合多模态推理与专门的几何感知工具时,如何支持超越结构生成的交互式和上下文感知的分子建模。展望未来,将结构工整合到量子化学多代理平台El Agente Quntur中,通过添加复杂的三维结构生成和编辑工具,增强了其功能。
BEM: Training-Free Background Embedding Memory for False-Positive Suppression in Real-Time Fixed-Background Camera
Authors: Junwoo Park, Jangho Lee, Sunho Lim
First: 2026-04-13T16:50:05+00:00 · Latest: 2026-04-13T16:50:05+00:00
Comments: Accepted to ICPR 2026
Abstract
Pretrained detectors perform well on benchmarks but often suffer performance degradation in real-world deployments due to distribution gaps between training data and target environments. COCO-like benchmarks emphasize category diversity rather than instance density, causing detectors trained under per-class sparsity to struggle in dense, single- or few-class scenes such as surveillance and traffic monitoring. In fixed-camera environments, the quasi-static background provides a stable, label-free prior that can be exploited at inference to suppress spurious detections. To address the issue, we propose Background Embedding Memory (BEM), a lightweight, training-free, weight-frozen module that can be attached to pretrained detectors during inference. BEM estimates clean background embeddings, maintains a prototype memory, and re-scores detection logits with an inverse-similarity, rank-weighted penalty, effectively reducing false positives while maintaining recall. Empirically, background-frame cosine similarity correlates negatively with object count and positively with Precision-Confidence AUC (P-AUC), motivating its use as a training-free control signal. Across YOLO and RT-DETR families on LLVIP and simulated surveillance streams, BEM consistently reduces false positives while preserving real-time performance. Our code is available at https://github.com/Leo-Park1214/Background-Embedding-Memory.git
中文标题/摘要
标题:BEM:实时固定背景摄像头中虚假正例抑制的无训练背景嵌入记忆
预训练检测器在基准测试中表现良好,但在实际部署中由于训练数据与目标环境之间的分布差距,其性能往往会下降。COCO样式的基准测试强调类别多样性而非实例密度,导致在密集、单一或少数类别的场景(如监控和交通监控)中,按类别稀疏训练的检测器难以应对。在固定摄像头环境中,准静态背景提供了一个稳定的、无标签的先验信息,可以在推理时利用以抑制虚假检测。为了解决这一问题,我们提出了背景嵌入记忆(BEM),这是一个轻量级、无训练、权重冻结的模块,可以在推理时附加到预训练检测器上。BEM 估计干净的背景嵌入,维护原型记忆,并使用逆相似度、排名加权惩罚重新评分检测逻辑,从而有效减少虚假正例同时保持召回率。实验证明,背景帧余弦相似度与物体数量呈负相关,与精确度-置信度AUC(P-AUC)呈正相关,这促使将其用作无训练控制信号。在LLVIP和模拟监控流上,BEM 在YOLO和RT-DETR家族中一致减少了虚假正例,同时保持了实时性能。我们的代码可在 https://github.com/Leo-Park1214/Background-Embedding-Memory.git 获取
Summary / 总结
The paper addresses the performance degradation of pretrained detectors in real-world applications due to distribution gaps. It proposes Background Embedding Memory (BEM), a lightweight, training-free module that estimates clean background embeddings and re-scores detection logits to reduce false positives while maintaining recall. BEM shows consistent performance improvements across YOLO and RT-DETR families on LLVIP and simulated surveillance streams without affecting real-time performance.
论文针对预训练检测器在实际应用中由于分布差距导致的性能下降问题,提出了背景嵌入记忆(BEM)模块,该模块通过估计干净的背景嵌入并重新评分检测逻辑来减少误报,同时保持召回率。BEM在LLVIP和模拟监控流上的YOLO和RT-DETR家族中表现出一致的性能改进,且不影响实时性能。
Training-Free Multi-User Generative Semantic Communications via Null-Space Diffusion Sampling
Authors: Eleonora Grassucci, Jinho Choi, Jihong Park, Riccardo F. Gramaccioni, Giordano Cicchetti, Danilo Comminiello
First: 2024-05-16T07:43:15+00:00 · Latest: 2026-04-13T16:23:32+00:00
Comments: Accepted in IEEE Access
Abstract
In recent years, novel communication strategies have emerged to face the challenges that the increased number of connected devices and the higher quality of transmitted information are posing. Among them, semantic communication obtained promising results especially when combined with state-of-the-art deep generative models, such as large language or diffusion models, able to regenerate content from extremely compressed semantic information. However, most of these approaches focus on single-user scenarios processing the received content at the receiver on top of conventional communication systems. In this paper, we propose to go beyond these methods by developing a novel generative semantic communication framework tailored for multi-user scenarios. This system assigns the channel to users knowing that the lost information can be filled in with a diffusion model at the receivers. Under this innovative perspective, OFDMA systems should not aim to transmit the largest part of information, but solely the bits necessary to the generative model to semantically regenerate the missing ones. The thorough experimental evaluation shows the capabilities of the novel diffusion model and the effectiveness of the proposed framework, leading towards a GenAI-based next generation of communications.
中文标题/摘要
标题:无训练多用户生成语义通信通过空域扩散采样
近年来,为应对连接设备数量增加和传输信息质量提高带来的挑战,新兴的通信策略不断涌现。其中,语义通信在与最先进的深度生成模型(如大型语言模型或扩散模型)结合时取得了显著成果,这些模型能够从极其压缩的语义信息中再生内容。然而,大多数这些方法都集中在单用户场景,处理接收器在传统通信系统基础上接收到的内容。在本文中,我们提出了一种超越这些方法的新颖生成语义通信框架,专门针对多用户场景。该系统在分配信道给用户时,知道丢失的信息可以通过接收端的扩散模型补充。从这一创新视角来看,OFDMA系统不应旨在传输大部分信息,而只需传输生成模型所需的必要位,以语义再生缺失的部分。详尽的实验评估显示了新型扩散模型的能力以及所提框架的有效性,朝着基于GenAI的下一代通信迈进。
AffordSim: A Scalable Data Generator and Benchmark for Affordance-Aware Robotic Manipulation
Authors: Mingyang Li, Haofan Xu, Haowen Sun, Xinzhe Chen, Sihua Ren, Liqi Huang, Xinyang Sui, Chenyang Miao, Qiongjie Cui, Zeyang Liu, Xingyu Chen, Xuguang Lan
First: 2026-04-13T16:21:44+00:00 · Latest: 2026-04-13T16:21:44+00:00
Abstract
Simulation-based data generation has become a dominant paradigm for training robotic manipulation policies, yet existing platforms do not incorporate object affordance information into trajectory generation. As a result, tasks requiring precise interaction with specific functional regions--grasping a mug by its handle, pouring from a cup's rim, or hanging a mug on a hook--cannot be automatically generated with semantically correct trajectories. We introduce AffordSim, the first simulation framework that integrates open-vocabulary 3D affordance prediction into the manipulation data generation pipeline. AffordSim uses our VoxAfford model, an open-vocabulary 3D affordance detector that enhances MLLM output tokens with multi-scale geometric features, to predict affordance maps on object point clouds, guiding grasp pose estimation toward task-relevant functional regions. Built on NVIDIA Isaac Sim with cross-embodiment support (Franka FR3, Panda, UR5e, Kinova), VLM-powered task generation, and novel domain randomization using DA3-based 3D Gaussian reconstruction from real photographs, AffordSim enables automated, scalable generation of affordance-aware manipulation data. We establish a benchmark of 50 tasks across 7 categories (grasping, placing, stacking, pushing/pulling, pouring, mug hanging, long-horizon composite) and evaluate 4 imitation learning baselines (BC, Diffusion Policy, ACT, Pi 0.5). Our results reveal that while grasping is largely solved (53-93% success), affordance-demanding tasks such as pouring into narrow containers (1-43%) and mug hanging (0-47%) remain significantly more challenging for current imitation learning methods, highlighting the need for affordance-aware data generation. Zero-shot sim-to-real experiments on a real Franka FR3 validate the transferability of the generated data.
中文标题/摘要
标题:AffordSim:一种可扩展的数据生成器和基准测试,用于感知功能的机器人操作
基于仿真的数据生成已成为训练机器人操作策略的主要范式,但现有平台未将物体功能信息纳入轨迹生成中。因此,需要精确与特定功能区域交互的任务——如用杯子把手抓取、从杯子边缘倒水或用挂钩挂杯子——无法自动生成语义正确的轨迹。我们引入了AffordSim,这是第一个将开放词汇3D功能预测集成到操作数据生成管道中的仿真框架。AffordSim 使用我们的VoxAfford模型,这是一种开放词汇的3D功能检测器,通过多尺度几何特征增强MLLM输出标记,预测物体点云的功能图,引导抓取姿态估计向任务相关功能区域。基于NVIDIA Isaac Sim,支持跨体态(Franka FR3、Panda、UR5e、Kinova),VLM驱动的任务生成,以及基于DA3的3D高斯重建的新颖领域随机化,AffordSim 使自动化、可扩展的功能感知操作数据生成成为可能。我们建立了涵盖7类(抓取、放置、堆叠、推拉、倒水、挂杯子、长时复合)50个任务的基准,并评估了4种模仿学习基线(BC、扩散策略、ACT、Pi 0.5)。我们的结果表明,虽然抓取问题已基本解决(53-93%成功率),但如向狭窄容器倒水(1-43%)和挂杯子(0-47%)等需要功能感知的任务,对当前模仿学习方法来说仍然更具挑战性,突显了功能感知数据生成的必要性。在真实Franka FR3上的零样本仿真实验验证了生成数据的可迁移性。
Powerful Training-Free Membership Inference Against Autoregressive Language Models
Authors: David Ilić, David Stanojević, Kostadin Cvejoski
First: 2026-01-17T16:59:41+00:00 · Latest: 2026-04-13T15:25:39+00:00
Comments: 9 pages, 2 figures; appendix with additional experiments and derivations
Abstract
Fine-tuned language models pose significant privacy risks, as they may memorize and expose sensitive information from their training data. Membership inference attacks (MIAs) provide a principled framework for auditing these risks, yet existing methods achieve limited detection rates, particularly at the low false-positive thresholds required for practical privacy auditing. We present EZ-MIA, a membership inference attack that exploits a key observation: memorization manifests most strongly at error positions, specifically tokens where the model predicts incorrectly yet still shows elevated probability for training examples. We introduce the Error Zone (EZ) score, which measures the directional imbalance of probability shifts at error positions relative to a pretrained reference model. This principled statistic requires only two forward passes per query and no model training of any kind. On WikiText with GPT-2, EZ-MIA achieves 3.8x higher detection than the previous state-of-the-art under identical conditions (66.3% versus 17.5% true positive rate at 1% false positive rate), with near-perfect discrimination (AUC 0.98). At the stringent 0.1% FPR threshold critical for real-world auditing, we achieve 8x higher detection than prior work (14.0% versus 1.8%), requiring no reference model training. These gains extend to larger architectures: on AG News with Llama-2-7B, we achieve 3x higher detection (46.7% versus 15.8% TPR at 1% FPR). These results establish that privacy risks of fine-tuned language models are substantially greater than previously understood, with implications for both privacy auditing and deployment decisions. Code is available at https://github.com/JetBrains-Research/ez-mia.
Summary / 总结
This paper addresses the privacy risks of fine-tuned language models by presenting EZ-MIA, a training-free membership inference attack. EZ-MIA leverages the observation that memorization is most evident at error positions, where the model incorrectly predicts but still shows elevated probability for training examples. It introduces the Error Zone (EZ) score, which measures the directional imbalance of probability shifts at these error positions. EZ-MIA achieves significantly higher detection rates compared to previous methods, with an AUC of 0.98 and up to 8x higher true positive rates at very low false positive rates, indicating substantial privacy risks in fine-tuned language models. This work has implications for privacy auditing and deployment decisions of these models.
本文提出了一种无训练的成员推理攻击EZ-MIA,以应对细调语言模型的隐私风险。EZ-MIA 利用了一个观察结果:即记忆效应在错误位置最为明显,模型虽然错误预测但对训练样本仍表现出较高的概率。它引入了错误区(EZ)得分,用于衡量这些错误位置上概率变化的方向性不平衡。EZ-MIA 在检测率上显著优于先前的方法,AUC 达到 0.98,且在极低的误报率下最高可提高 8 倍的真阳性率,表明细调语言模型的隐私风险远超预期。这项工作对这些模型的隐私审计和部署决策具有重要意义。
Finetune Like You Pretrain: Boosting Zero-shot Adversarial Robustness in Vision-language Models
Authors: Songlong Xing, Weijie Wang, Zhengyu Zhao, Jindong Gu, Philip Torr, Nicu Sebe
Venue: CVPR
First: 2026-04-13T14:54:25+00:00 · Latest: 2026-04-13T14:54:25+00:00
Comments: Accepted to CVPR Findings Track 2026
Abstract
Despite their impressive zero-shot abilities, vision-language models such as CLIP have been shown to be susceptible to adversarial attacks. To enhance its adversarial robustness, recent studies finetune the pretrained vision encoder of CLIP with adversarial examples on a proxy dataset such as ImageNet by aligning adversarial images with correct class labels. However, these methods overlook the important roles of training data distributions and learning objectives, resulting in reduced zero-shot capabilities and limited transferability of robustness across domains and datasets. In this work, we propose a simple yet effective paradigm AdvFLYP, which follows the training recipe of CLIP's pretraining process when performing adversarial finetuning to the model. Specifically, AdvFLYP finetunes CLIP with adversarial images created based on image-text pairs collected from the web, and match them with their corresponding texts via a contrastive loss. To alleviate distortion of adversarial image embeddings of noisy web images, we further propose to regularise AdvFLYP by penalising deviation of adversarial image features. We show that logit- and feature-level regularisation terms benefit robustness and clean accuracy, respectively. Extensive experiments on 14 downstream datasets spanning various domains show the superiority of our paradigm over mainstream practices. Our code and model weights are released at https://github.com/Sxing2/AdvFLYP.
Summary / 总结
This paper addresses the vulnerability of vision-language models like CLIP to adversarial attacks by proposing a new method called AdvFLYP. Unlike previous approaches that finetune CLIP with adversarial examples on ImageNet, AdvFLYP follows the pretraining process of CLIP, using adversarial images from web-collected image-text pairs and matching them with texts via a contrastive loss. The method also includes regularization to reduce the distortion of adversarial image embeddings. Experiments on 14 datasets demonstrate that AdvFLYP outperforms existing methods in terms of both robustness and clean accuracy.
本文提出了一种名为AdvFLYP的新方法,旨在通过使用来自网络的图像-文本对生成的对抗性样本来增强视觉-语言模型CLIP的对抗鲁棒性,而不是像先前方法那样在ImageNet上使用对抗样本进行微调。AdvFLYP通过对比损失将这些对抗性图像与相应的文本匹配,并引入正则化来减少对抗性图像嵌入的失真。在14个下游数据集上的实验表明,AdvFLYP在鲁棒性和干净准确率方面均优于现有方法。
Training-Free Model Ensemble for Single-Image Super-Resolution via Strong-Branch Compensation
Authors: Gengjia Chang, Xining Ge, Weijun Yuan, Zhan Li, Qiurong Song, Luen Zhu, Shuhong Liu
First: 2026-04-13T14:48:03+00:00 · Latest: 2026-04-13T14:48:03+00:00
Abstract
Single-image super-resolution has progressed from deep convolutional baselines to stronger Transformer and state-space architectures, yet the corresponding performance gains typically come with higher training cost, longer engineering iteration, and heavier deployment burden. In many practical settings, multiple pretrained models with partially complementary behaviors are already available, and the binding constraint is no longer architectural capacity but how effectively their outputs can be combined without additional training. Rather than pursuing further architectural redesign, this paper proposes a training-free output-level ensemble framework. A dual-branch pipeline is constructed in which a Hybrid attention network with TLC inference provides stable main reconstruction, while a MambaIRv2 branch with geometric self-ensemble supplies strong compensation for high-frequency detail recovery. The two branches process the same low-resolution input independently and are fused in the image space via a lightweight weighted combination, without updating any model parameters or introducing an additional trainable module. As our solution to the NTIRE 2026 Image Super-Resolution ($\times 4$) Challenge, the proposed design consistently improves over the base branch and slightly exceeds the pure strong branch in PSNR at the best operating point under a unified DIV2K bicubic $\times 4$ evaluation protocol. Ablation studies confirm that output-level compensation provides a low-overhead and practically accessible upgrade path for existing super-resolution systems.
Summary / 总结
This paper addresses the challenge of combining multiple pretrained models for single-image super-resolution without additional training. It proposes a training-free ensemble framework using a dual-branch pipeline, where a Hybrid attention network with TLC inference handles main reconstruction, and a MambaIRv2 branch with geometric self-ensemble compensates for high-frequency details. The two branches independently process the same low-resolution input and are fused in the image space through a lightweight weighted combination. The proposed method achieves consistent improvements over the base branch and slightly exceeds the performance of the strong branch in PSNR under the NTIRE 2026 Image Super-Resolution challenge.
该论文提出了一种无需额外训练即可结合多个预训练模型进行单图像超分辨率的方法。它使用一个双分支管道,其中Hybrid注意力网络结合TLC推理负责主要重建,而MambaIRv2分支结合几何自增强补偿高频细节。两个分支独立处理相同的低分辨率输入,并通过轻量级加权组合在图像空间中融合。所提出的方法在NTIRE 2026图像超分辨率挑战中实现了对基分支的一致改进,并在PSNR评估下略优于强分支的性能。
CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space
Authors: Sohwi Lim, Lee Hyoseok, Jungjoon Park, Tae-Hyun Oh
Venue: CVPR 2026
First: 2026-04-13T14:33:13+00:00 · Latest: 2026-04-13T14:33:13+00:00
Comments: CVPR 2026, Project page: https://sohwi-lim.github.io/CLAY
Abstract
Human perception of visual similarity is inherently adaptive and subjective, depending on the users' interests and focus. However, most image retrieval systems fail to reflect this flexibility, relying on a fixed, monolithic metric that cannot incorporate multiple conditions simultaneously. To address this, we propose CLAY, an adaptive similarity computation method that reframes the embedding space of pretrained Vision-Language Models (VLMs) as a text-conditional similarity space without additional training. This design separates the textual conditioning process and visual feature extraction, allowing highly efficient and multi-conditioned retrieval with fixed visual embeddings. We also construct a synthetic evaluation dataset CLAY-EVAL, for comprehensive assessment under diverse conditioned retrieval settings. Experiments on standard datasets and our proposed dataset show that CLAY achieves high retrieval accuracy and notable computational efficiency compared to previous works.
中文标题/摘要
标题:CLAY:视觉相似性条件化在视觉-语言嵌入空间中的条件化
人类对视觉相似性的感知是适应性和主观性的,取决于用户的兴趣和关注点。然而,大多数图像检索系统未能反映这种灵活性,依赖于一个固定的、单一的度量标准,无法同时容纳多种条件。为了解决这个问题,我们提出了CLAY,一种适应性相似性计算方法,将预训练的视觉-语言模型(VLMs)的嵌入空间重新构想为一个文本条件化的相似性空间,无需额外训练。该设计将文本条件化过程与视觉特征提取分离,允许使用固定视觉嵌入进行高效且多条件的检索。我们还构建了一个合成评估数据集CLAY-EVAL,以在多种条件化的检索设置下进行全面评估。在标准数据集和我们提出的数据集上的实验表明,CLAY在检索准确性和计算效率方面均优于以往的工作。
Summary / 总结
CLAY is an adaptive similarity computation method that modifies the embedding space of pretrained Vision-Language Models to create a text-conditional similarity space, enabling flexible and efficient multi-conditioned image retrieval without additional training. Experiments demonstrate that CLAY outperforms previous methods in terms of retrieval accuracy and computational efficiency.
研究旨在通过使图像检索系统更具适应性和灵活性来满足用户兴趣。CLAY 是一种方法,它修改预训练的视觉-语言模型的嵌入空间,创建一个基于文本的相似性空间,从而在无需额外训练的情况下实现高效的多条件检索。实验表明,CLAY 在检索准确性和计算效率方面优于先前的方法。
SVD-Prune: Training-Free Token Pruning For Efficient Vision-Language Models
Authors: Yvon Apedo, Martyna Poreba, Michal Szczepanski, Samia Bouchafa
First: 2026-04-13T14:30:13+00:00 · Latest: 2026-04-13T14:30:13+00:00
Abstract
Vision-Language Models (VLM) have revolutionized multimodal learning by jointly processing visual and textual information. Yet, they face significant challenges due to the high computational and memory demands of processing long sequences of vision tokens. Many existing methods rely on local heuristics, such as attention scores or token norms. However, these criteria suffer from positional bias and information dispersion, limiting their ability to preserve essential content at high pruning ratios and leading to performance degradation on visually detailed images. To address these issues, we propose SVD-Prune, a trainingfree, plug-and-play token pruning method based on Singular Value Decomposition. It decomposes the vision token feature matrix and selects the top-K tokens using statistical leverage scores, ensuring only tokens contributing most to the dominant global variance are preserved. Experiments show that SVD-Prune consistently outperforms prior pruning methods under extreme vision token budgets, maintaining strong performance even with 32 and 16 vision tokens.
中文标题/摘要
标题:SVD-Prune:无需训练的视觉语言模型令牌剪枝方法
视觉语言模型(VLM)通过联合处理视觉和文本信息,革新了多模态学习。然而,它们面临着处理长序列视觉令牌时计算和内存需求高的挑战。许多现有方法依赖于局部启发式方法,如注意力分数或令牌范数。然而,这些标准存在位置偏差和信息分散的问题,限制了它们在高剪枝比下保留关键内容的能力,并导致在详细视觉图像上的性能下降。为了解决这些问题,我们提出了一种基于奇异值分解的无需训练、即插即用的令牌剪枝方法——SVD-Prune。该方法分解视觉令牌特征矩阵,并使用统计杠杆得分选择前K个令牌,确保仅保留对主导全局方差贡献最大的令牌。实验表明,SVD-Prune在极端视觉令牌预算下始终优于先前的剪枝方法,即使在仅使用32和16个视觉令牌时也能保持良好的性能。
Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference
Authors: Imanol Miranda, Ander Salaberria, Eneko Agirre, Gorka Azkune
First: 2026-04-13T14:03:18+00:00 · Latest: 2026-04-13T14:03:18+00:00
Abstract
Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks. We argue that this limitation may stem less from deficient representations than from the standard inference protocol based on global cosine similarity. First, through controlled diagnostic experiments, we show that explicitly enforcing fine-grained region-segment alignment at inference dramatically improves compositional performance without updating pretrained encoders. We then introduce a lightweight transformer that learns such alignments directly from frozen patch and token embeddings. Comparing against full fine-tuning and prior end-to-end compositional training methods, we find that although these approaches improve in-domain retrieval, their gains do not consistently transfer under distribution shift. In contrast, learning localized alignment over frozen representations matches full fine-tuning on in-domain retrieval while yielding substantial improvements on controlled out-of-domain compositional benchmarks. These results identify global embedding matching as a key bottleneck in dual-encoder VLMs and highlight the importance of alignment mechanisms for robust compositional generalization.
Summary / 总结
The study revisits the compositionality issue in dual-encoder VLMs like CLIP, suggesting that poor performance on compositional benchmarks is more due to the inference protocol rather than the representations themselves. By enforcing fine-grained region-segment alignment during inference and introducing a lightweight transformer to learn such alignments from frozen embeddings, the research improves compositional performance without updating pretrained encoders. The findings show that while full fine-tuning and end-to-end compositional training methods enhance in-domain retrieval, they do not consistently generalize well under distribution shift. In contrast, learning localized alignment over frozen representations achieves comparable in-domain performance and significant improvements on out-of-domain compositional benchmarks, highlighting the importance of alignment mechanisms for robust compositional generalization.
研究重新审视了像CLIP这样的双编码器VLM在组成性基准上的表现问题,认为表现不佳主要是由于推理协议而非表示本身。通过在推理过程中强制执行细粒度的区域-片段对齐,并引入一个轻量级的变压器直接从冻结的片段和标记嵌入中学习这种对齐,研究提高了组成性表现且无需更新预训练编码器。研究发现,虽然全面微调和端到端组成性训练方法在领域内检索上有所提升,但它们在分布变化下的表现并不一致。相比之下,学习冻结表示上的局部对齐在领域内检索上达到了与全面微调相当的表现,并在受控的领域外组成性基准上取得了显著改进,突显了对齐机制对于稳健组成性泛化的关键作用。
Anthropogenic Regional Adaptation in Multimodal Vision-Language Model
Authors: Samuel Cahyawijaya, Peerat Limkonchotiwat, Tack Hwa Wong, Hitesh Laxmichand Patel, Amit Agarwal, Manuel Antonio Rufino, Carlos Rafael Catalan, Muhammad Reza Qorib, Vicky Feliren, Holy Lovenia, Aye Hninn Khine, Frederikus Hudi, David Anugraha, Alham Fikri Aji, Romrawin Chumpu, Viet-Thanh Pham, Minghan Wang, Mohamed Fazli Imam, Ruochen Zhang, Joseph Marvin Imperial, Do Xuan Long, Musa Izzanardi Wijanarko, Joel Ruben Antony Moniz, Patrick Amadeus Irawan, Hanif Muhammad Zhafran, Isaiah Flores, Ira Salsabila, Jun Kevin, Jostin Jerico Rosal, Patricia Nicole Monderin, Kun Kerdthaisong, Ahmad Mustafid, My Chiffon Nguyen, Natchapon Jongwiriyanurak, Siva Worajitwannakul, Haochen Li, Adrian Xuan Wei Lim, Bin Wang, Muhammad Ravi Shulthan Habibi, Lynnette Hui Xian Ng, Mithil Bangera, Yeshil Bangera, Priyaranjan Pattnayak, Dun Li Chan, Sherissa Caren Djuniwar, Hee Ming Shan
First: 2026-04-13T13:56:00+00:00 · Latest: 2026-04-13T13:56:00+00:00
Abstract
While the field of vision-language (VL) has achieved remarkable success in integrating visual and textual information across multiple languages and domains, there is still no dedicated framework for assessing human-centric alignment in vision-language systems. We offer two contributions to address this gap. First, we introduce Anthropogenic Regional Adaptation: a novel paradigm that aims to optimize model relevance to specific regional contexts while ensuring the retention of global generalization capabilities. Second, we present a simple, but effective adaptation method named Geographical-generalization-made-easy (GG-EZ), which utilizes regional data filtering and model merging. Through comprehensive experiments on 3 VL architectures: large vision-language models, text-to-image diffusion models, and vision-language embedding models, and a case study in Southeast Asia (SEA) regional adaptation, we demonstrate the importance of Anthropogenic Regional Adaptation and the effectiveness of GG-EZ, showing 5-15% gains in cultural relevance metrics across SEA while maintaining over 98% of global performance and even occasionally surpassing it. Our findings establish Anthropogenic Regional Alignment as a foundational paradigm towards applicability of multimodal vision-language models in diverse regions and demonstrate a simple-yet-effective baseline method that optimizes regional value alignment while preserving global generalization.
Summary / 总结
This paper addresses the lack of human-centric alignment in vision-language systems by introducing Anthropogenic Regional Adaptation, a novel paradigm that optimizes model relevance to specific regional contexts while maintaining global generalization. The authors propose Geographical-generalization-made-easy (GG-EZ), a simple adaptation method using regional data filtering and model merging. Experiments on three vision-language architectures and a Southeast Asia case study show a 5-15% gain in cultural relevance metrics with GG-EZ, while maintaining over 98% of global performance and occasionally surpassing it.
本文通过引入Anthropogenic Regional Adaptation这一新颖范式,解决了视觉-语言系统中缺乏人类中心对齐的问题,该范式旨在优化模型对特定区域背景的相关性,同时保持全局泛化能力。作者提出了Geographical-generalization-made-easy (GG-EZ) 简单的适应方法,该方法利用区域数据过滤和模型合并。在三种视觉-语言架构和东南亚案例研究中的实验表明,使用GG-EZ可以获得5-15%的文化相关性指标提升,同时保持超过98%的全局性能,并且有时甚至超过了全局性能。
Arbitration Failure, Not Perceptual Blindness: How Vision-Language Models Resolve Visual-Linguistic Conflicts
Authors: Farhad Nooralahzadeh, Omid Rohanian, Yi Zhang, Jonathan Fürst, Kurt Stockinger
First: 2026-04-10T14:36:07+00:00 · Latest: 2026-04-13T13:46:40+00:00
Abstract
When a Vision-Language Model (VLM) sees a blue banana and answers "yellow", is the problem of perception or arbitration? We explore the question in ten VLMs with various sizes and reveal an Encoding-Grounding Dissociation: models that fail to report what they see (and thus provide a wrong answer) still encode the visual evidence as strongly as models that provide the correct answer. Using Multimodal Arbitration Crossover (MAC) analysis with layer-by-layer Logit Lens probing, we track the competition between visual and prior signals across every layer of each model. We show that visual attributes can be linearly decodable from early layers (AUC > 0.86). The accuracy remains nearly identical for both successful and failed samples. However, the gap in the final-layer logit - not the strength of encoding - better predicts grounding outcomes with a correlation of $ρ=$ 0.847. After having studied when VLMs base their answers on image clues rather than prior knowledge, we want to understand the causal relationships. We establish causality through full-sequence activation patching. The standard last-token interventions in LLM interpretability do not affect VLMs. In contrast, replacing the full token sequence at layers identified by MAC alters 60 to 84% of outputs. Partial-token decomposition shows that image tokens carry almost all of the causal impact, while text tokens have none. Scaling addresses the remaining architectural differences to achieve perfect retention. Moving from diagnosis to intervention, we show that training-free activation steering - both linear and sparse autoencoder-guided - in early layers can improve visual grounding by up to +3.8% with degrading performance in some setups. Overall, these findings lead to a clear conclusion: VLMs already see well, but the challenge is acting on what they see. Targeted interventions can help to bridge this gap.
中文标题/摘要
标题:仲裁失败,而非感知盲点:视觉语言模型如何解决视觉语言冲突
当视觉语言模型(VLM)看到一个蓝色的香蕉并回答‘黄色’时,是感知问题还是仲裁问题?我们在这十个不同规模的VLM中进行了探索,揭示了编码-接地分离现象:那些未能报告所见(从而给出错误答案)的模型,其对视觉证据的编码强度与给出正确答案的模型相当。通过分层逻辑探针的多模态仲裁交叉(MAC)分析,我们追踪了每层模型中视觉信号与先验信号的竞争。我们发现早期层的视觉属性可以线性可解(AUC > 0.86)。准确率在成功和失败样本中几乎保持不变。然而,最终层的逻辑差距——而不是编码强度——更好地预测了接地结果,相关系数为ρ=0.847。在研究VLMs是基于图像线索还是先验知识作答后,我们想理解因果关系。我们通过全序列激活补丁建立了因果关系。LLM的最后标记干预对VLMs没有影响。相反,由MAC识别的层替换整个标记序列会改变60%到84%的输出。部分标记分解显示,图像标记几乎承载了全部因果影响,而文本标记没有影响。通过缩放解决剩余的架构差异,实现了完美的保留。从诊断转向干预,我们展示了在早期层通过无训练激活引导——无论是线性还是稀疏自编码器引导——可以提高视觉接地,最高可提高3.8%,但在某些设置中性能会下降。总体而言,这些发现得出一个明确的结论:VLMs已经看得很好,但挑战在于如何行动。有针对性的干预可以有助于弥合这一差距。
Summary / 总结
The study investigates whether Vision-Language Models (VLMs) fail due to perceptual issues or arbitration problems. By analyzing ten VLMs with different sizes, the researchers found that models that provide incorrect answers still encode visual evidence as strongly as those that give correct answers. Using MAC analysis and Logit Lens probing, they discovered that early layers can linearly decode visual attributes with high accuracy, but the gap in final-layer logits better predicts grounding outcomes. Full-sequence activation patching revealed that image tokens carry the majority of the causal impact, while text tokens have no effect. The findings suggest that VLMs perceive well but struggle to act on visual information, and targeted interventions in early layers can improve visual grounding.
研究探讨了Vision-Language模型(VLM)出错是由于感知问题还是仲裁问题。通过分析十种不同规模的VLM,研究人员发现,提供错误答案的模型在编码视觉证据方面与给出正确答案的模型一样强烈。使用MAC分析和Logit Lens探针,他们发现早期层可以线性解码视觉属性,但最终层logit的差距更能预测接地结果。全序列激活补丁显示,图像标记承载了大部分因果影响,而文本标记没有影响。研究结果表明,VLMs感知良好,但难以采取行动,早期层的针对性干预可以改善视觉接地。
X-SYS: A Reference Architecture for Interactive Explanation Systems
Authors: Tobias Labarta, Nhi Hoang, Maximilian Dreyer, Jim Berend, Oleg Hein, Jackie Ma, Wojciech Samek, Sebastian Lapuschkin
First: 2026-02-13T09:24:03+00:00 · Latest: 2026-04-13T13:42:26+00:00
Comments: 18 pages, 8 figures
Abstract
The explainable AI (XAI) research community has proposed numerous technical methods, yet deploying explainability as systems remains challenging: Interactive explanation systems require both suitable algorithms and system capabilities that maintain explanation usability across repeated queries, evolving models and data, and governance constraints. We argue that operationalizing XAI requires treating explainability as an information systems problem where user interaction demands induce specific system requirements. We introduce X-SYS, a reference architecture for interactive explanation systems, that guides (X)AI researchers, developers and practitioners in connecting interactive explanation user interfaces (XUI) with system capabilities. X-SYS organizes around four quality attributes named STAR (scalability, traceability, responsiveness, and adaptability), and specifies a five-component decomposition (XUI Services, Explanation Services, Model Services, Data Services, Orchestration and Governance). It maps interaction patterns to system capabilities to decouple user interface evolution from backend computation. We implement X-SYS through SemanticLens, a system for semantic search and activation steering in vision-language models. SemanticLens demonstrates how contract-based service boundaries enable independent evolution, offline/online separation ensures responsiveness, and persistent state management supports traceability. Together, this work provides a reusable blueprint and concrete instantiation for interactive explanation systems supporting end-to-end design under operational constraints.
中文标题/摘要
标题:X-SYS:交互解释系统参考架构
可解释人工智能(XAI)研究社区提出了许多技术方法,但将可解释性部署为系统仍然具有挑战性:交互式解释系统需要合适的算法和系统能力,以保持解释的可用性,跨越重复查询、模型和数据的演变以及治理约束。我们认为,实现XAI需要将可解释性视为信息系统问题,其中用户交互需求引发特定系统需求。我们介绍了X-SYS,这是一种交互式解释系统的参考架构,指导(X)AI研究人员、开发人员和从业人员将交互式解释用户界面(XUI)与系统能力连接起来。X-SYS围绕四个质量属性(可扩展性、可追溯性、响应性和适应性)组织,并规定了五个组件分解(XUI服务、解释服务、模型服务、数据服务、编排和治理)。它将交互模式映射到系统能力,以解耦用户界面的演变与后端计算。我们通过SemanticLens系统实现了X-SYS,这是一种用于视觉语言模型中语义搜索和激活引导的系统。SemanticLens展示了基于合同的服务边界如何实现独立演变,离线/在线分离如何确保响应性,持久状态管理如何支持可追溯性。这项工作一起提供了一个可重用的蓝图和一个具体的实例,支持在运营约束下端到端设计交互式解释系统。
Summary / 总结
The paper addresses the challenge of deploying interactive explanation systems in explainable AI (XAI) by proposing X-SYS, a reference architecture. X-SYS aims to connect user interfaces with system capabilities through four quality attributes (scalability, traceability, responsiveness, and adaptability) and a five-component decomposition. Key findings include the demonstration of SemanticLens, which shows how contract-based service boundaries enable independent evolution, offline/online separation ensures responsiveness, and persistent state management supports traceability.
论文提出了X-SYS参考架构来应对交互解释系统的部署挑战。X-SYS通过四个质量属性(可扩展性、可追溯性、响应性和适应性)和五个组件分解来连接用户界面与系统能力。关键发现包括通过合同基服务边界实现独立演化、离线/在线分离确保响应性和持久状态管理支持可追溯性来展示SemanticLens的工作。
TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection
Authors: Lei Jiang, Chunzhao Xie, Tongxuan Liu, Yuting Zeng, jinrong Guo, Yunheng Shen, Weizhe Huang, Jing Li, Xiaohua Xu
First: 2025-04-05T07:57:11+00:00 · Latest: 2026-04-13T13:08:28+00:00
Comments: 8 pages, 9 figures
Abstract
Large Vision-Language Models have demonstrated remarkable capabilities, yet they suffer from hallucinations that limit practical deployment. While various mitigation strategies exist, they often incur high computational overhead or require extensive retraining. In this paper, we address the issue of visual attention decay during generation, a key factor contributing to hallucinations. We propose Temporal Attention Real-time Accumulative Connection (TARAC), a novel training-free framework that dynamically accumulates and re-injects historical attention to sustain visual grounding. Inspired by cognitive reinforcement mechanisms, TARAC operates as a lightweight, plug-and-play module. Extensive experiments across diverse models (e.g., LLaVA, Qwen2-VL) and benchmarks demonstrate that TARAC significantly outperforms state-of-the-art methods. Remarkably, it achieves these gains with negligible inference overhead ($\sim$4\% TPOT increase), compared to the substantial costs of existing training-free baselines. Specifically, TARAC reduces hallucinated sentences by 25.2\% on CHAIR and improves Perception score by +10.65 on MME, validating its effectiveness and efficiency.
中文标题/摘要
标题:TARAC:通过时间注意力实时累积连接减轻LVLM中的幻觉
大型视觉-语言模型展示了显著的能力,但它们受到幻觉的限制,这限制了其实用部署。虽然存在各种缓解策略,但它们往往需要高计算开销或需要大量重新训练。在本文中,我们解决了生成过程中视觉注意力衰减的问题,这是导致幻觉的关键因素。我们提出了时间注意力实时累积连接(TARAC),这是一种新的无需训练的框架,可以动态累积和重新注入历史注意力以维持视觉定位。受认知强化机制的启发,TARAC 作为轻量级的即插即用模块运行。在多种模型(例如,LLaVA,Qwen2-VL)和基准测试上的广泛实验表明,TARAC 显著优于现有方法。值得注意的是,与现有无需训练基线的高昂成本相比,TARAC 实现这些收益时几乎不增加推理开销(约 4% TPOT 增加)。具体而言,TARAC 在 CHAIR 上减少了 25.2% 的幻觉句子,并在 MME 上提高了感知分数 +10.65,验证了其有效性和效率。
Summary / 总结
The paper addresses the issue of hallucinations in Large Vision-Language Models by proposing TARAC, a training-free framework that dynamically accumulates and re-injects historical attention to maintain visual grounding. Experiments show that TARAC significantly outperforms existing methods with minimal inference overhead, reducing hallucinated sentences by 25.2% on CHAIR and improving the Perception score by 10.65 on MME.
论文提出了一种名为TARAC的训练-free框架,通过动态积累和重新注入历史注意力来维持视觉接地,以解决大型视觉-语言模型中的幻觉问题。实验表明,TARAC在最小的推理开销下显著优于现有方法,在CHAIR上减少了25.2%的幻觉句子,并在MME上提高了感知分数10.65。
Optimization-Guided Diffusion for Interactive Scene Generation
Authors: Shihao Li, Naisheng Ye, Tianyu Li, Kashyap Chitta, Tuo An, Peng Su, Boyang Wang, Haiou Liu, Chen Lv, Hongyang Li
First: 2025-12-08T15:56:18+00:00 · Latest: 2026-04-13T13:01:08+00:00
Abstract
Realistic and diverse multi-agent driving scenes are crucial for evaluating autonomous vehicles, but safety-critical events which are essential for this task are rare and underrepresented in driving datasets. Data-driven scene generation offers a low-cost alternative by synthesizing complex traffic behaviors from existing driving logs. However, existing models often lack controllability or yield samples that violate physical or social constraints, limiting their usability. We present OMEGA, an optimization-guided, training-free framework that enforces structural consistency and interaction awareness during diffusion-based sampling from a scene generation model. OMEGA re-anchors each reverse diffusion step via constrained optimization, steering the generation towards physically plausible and behaviorally coherent trajectories. Building on this framework, we formulate ego-attacker interactions as a game-theoretic optimization in the distribution space, approximating Nash equilibria to generate realistic, safety-critical adversarial scenarios. Experiments on nuPlan and Waymo show that OMEGA improves generation realism, consistency, and controllability, increasing the ratio of physically and behaviorally valid scenes from 32.35% to 72.27% for free exploration capabilities, and from 11% to 80% for controllability-focused generation. Our approach can also generate $5\times$ more near-collision frames with a time-to-collision under three seconds while maintaining the overall scene realism.
Scene Change Detection with Vision-Language Representation Learning
Authors: Diwei Sheng, Vijayraj Gohil, Satyam Gaba, Zihan Liu, Giles Hamilton-Fletcher, John-Ross Rizzo, Yongqing Liang, Chen Feng
First: 2026-04-13T12:43:35+00:00 · Latest: 2026-04-13T12:43:35+00:00
Abstract
Scene change detection (SCD) is crucial for urban monitoring and navigation but remains challenging in real-world environments due to lighting variations, seasonal shifts, viewpoint differences, and complex urban layouts. Existing methods rely primarily on low-level visual features, limiting their ability to accurately identify changed objects amid the visual complexity of urban scenes. In this paper, we propose LangSCD, a vision-language framework for scene change detection that overcomes this single-modal limitation by incorporating semantic reasoning through language. Our approach introduces a modular language component that leverages vision-language models (VLMs) to generate textual descriptions of scene changes, which are fused with visual features through a cross-modal feature enhancer. We further introduce a geometric-semantic matching module that refines the predicted masks by enforcing semantic consistency and spatial completeness. Existing real-world scene change detection benchmarks provide only binary change annotations, which are insufficient for downstream applications requiring fine-grained understanding of scene dynamics. To address this limitation, we introduce NYC-CD, a large-scale dataset of 8,122 real-world image pairs collected in New York City with multiclass change annotations generated through a semi-automatic pipeline. Extensive experiments across multiple street-view benchmarks demonstrate that our language and matching modules consistently improve existing change-detection architectures, achieving state-of-the-art performance and highlighting the value of integrating linguistic reasoning with visual representations for robust scene change detection.
中文标题/摘要
标题:基于视觉语言表示学习的场景变化检测
场景变化检测(SCD)对于城市监控和导航至关重要,但在现实环境中由于光照变化、季节转换、视角差异和复杂的城市布局,仍然具有挑战性。现有方法主要依赖低级视觉特征,限制了它们在城市场景的视觉复杂性中准确识别变化对象的能力。本文提出了一种名为LangSCD的视觉语言框架,通过引入语义推理模块,克服了单一模态的限制。我们的方法引入了一个模块化的语言组件,利用视觉语言模型(VLMs)生成场景变化的文本描述,并通过跨模态特征增强器与视觉特征融合。我们还引入了一个几何语义匹配模块,通过强制执行语义一致性和空间完整性来细化预测的掩码。现有的现实世界场景变化检测基准仅提供二元变化注释,不足以满足需要对场景动态进行精细理解的下游应用。为了解决这一限制,我们引入了NYC-CD,这是一个包含8,122对纽约市真实世界图像对的大规模数据集,通过半自动管道生成了多类变化注释。在多个街景基准上的广泛实验表明,我们的语言和匹配模块能够持续改进现有的变化检测架构,实现了最先进的性能,并突显了将语言推理与视觉表示相结合对于稳健场景变化检测的价值。
Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective Merging
Authors: Zihang Fu, Haonan Wang, Jian Kang, Kenji Kawaguchi, Jiaying Wu
First: 2026-04-13T12:41:50+00:00 · Latest: 2026-04-13T12:41:50+00:00
Abstract
Multimodal adaptation equips large language models (LLMs) with perceptual capabilities, but often weakens the reasoning ability inherited from language-only pretraining. This trade-off is especially pronounced in video-language models (VLMs), where visual alignment can impair temporal reasoning (TR) over sequential events. We propose MERIT, a training-free, task-driven model merging framework for restoring TR in VLMs. MERIT searches over layer-wise self-attention merging recipes between a VLM and its paired text-only backbone using an objective that improves TR while penalizing degradation in temporal perception (TP). Across three representative VLMs and multiple challenging video benchmarks, MERIT consistently improves TR, preserves or improves TP, and generalizes beyond the search set to four distinct benchmarks. It also outperforms uniform full-model merging and random layer selection, showing that effective recovery depends on selecting the right layers. Interventional masking and frame-level attribution further show that the selected layers are disproportionately important for reasoning and shift model decisions toward temporally and causally relevant evidence. These results show that targeted, perception-aware model merging can effectively restore TR in VLMs without retraining.
中文标题/摘要
标题:推理存在于层级中:通过层级选择性合并恢复视频语言模型的时间推理能力
多模态适应使大型语言模型(LLMs)获得了感知能力,但往往会削弱从语言仅预训练继承来的推理能力。这种权衡在视频语言模型(VLMs)中尤为明显,其中视觉对齐可能会损害时间推理(TR)对序列事件的推理。我们提出了一种无需训练的任务驱动模型合并框架MERIT,用于恢复VLMs的时间推理能力。MERIT使用一个目标函数在VLM与其配对的文本仅骨干之间搜索层级自注意力合并配方,该目标函数可以提高TR并惩罚时间感知(TP)的退化。在三个代表性VLMs和多个具有挑战性的视频基准测试中,MERIT始终提高了TR,保持或提高了TP,并且在搜索集之外泛化到四个不同的基准测试中。它还优于均匀的全模型合并和随机层级选择,表明有效的恢复取决于选择正确的层级。干预性掩码和帧级归因进一步表明,所选层级对于推理而言尤为重要,并且使模型决策偏向于时间上和因果上相关的证据。这些结果表明,有针对性、感知意识的模型合并可以在不重新训练的情况下有效恢复VLMs的时间推理能力。
Summary / 总结
The paper addresses the trade-off between reasoning and perception in video-language models (VLMs), proposing MERIT, a training-free merging framework that selectively merges layers between a VLM and its text-only backbone to enhance temporal reasoning (TR) while preserving temporal perception (TP). Across various VLMs and benchmarks, MERIT consistently improves TR, maintains or enhances TP, and generalizes well to new benchmarks, demonstrating the importance of selecting the right layers for effective recovery of reasoning abilities.
论文针对视频语言模型(VLMs)中推理和感知之间的权衡问题,提出了一种无需训练的合并框架MERIT,该框架选择性地将VLM与其文本仅有的主干网络之间的层进行合并,以增强时间推理(TR)并保持时间感知(TP)。在多种VLM和基准测试中,MERIT一致地提高了TR,维持或提升了TP,并且能够很好地泛化到新的基准测试中,表明选择合适的层对于有效恢复推理能力的重要性。
SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence
Authors: Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, Weidi Xie
Venue: CVPR 2026 Highlight
First: 2025-05-22T17:59:03+00:00 · Latest: 2026-04-13T12:33:41+00:00
Comments: Accepted by CVPR 2026 (Highlight); Project Page: https://haoningwu3639.github.io/SpatialScore
Abstract
Existing evaluations of multimodal large language models (MLLMs) on spatial intelligence are typically fragmented and limited in scope. In this work, we aim to conduct a holistic assessment of the spatial understanding capabilities of modern MLLMs and propose complementary data-driven and agent-based solutions. Specifically, we make the following contributions: (i) we introduce SpatialScore, to our knowledge, the most comprehensive and diverse benchmark for multimodal spatial intelligence to date. It covers multiple visual data types, input modalities, and question-answering formats, and contains approximately 5K manually verified samples spanning 30 distinct tasks; (ii) using SpatialScore, we extensively evaluate 49 representative MLLMs, revealing persistent challenges and a substantial gap between current models and human-level spatial intelligence; (iii) to advance model capabilities, we construct SpatialCorpus, a large-scale training resource with 331K multimodal QA samples that supports fine-tuning on spatial reasoning tasks and significantly improves the performance of existing models (e.g., Qwen3-VL); (iv) to complement this data-driven route with a training-free paradigm, we develop SpatialAgent, a multi-agent system equipped with 12 specialized spatial perception tools that supports both Plan-Execute and ReAct reasoning, enabling substantial gains in spatial reasoning without additional model training. Extensive experiments and in-depth analyses demonstrate the effectiveness of our benchmark, corpus, and agent framework. We expect these resources to serve as a solid foundation for advancing MLLMs toward human-level spatial intelligence. All data, code, and models will be released to the research community.
中文标题/摘要
标题:SpatialScore:向着全面评估空间智能的方向
现有的多模态大型语言模型(MLLMs)在空间智能方面的评估通常是碎片化的且范围有限。在本工作中,我们旨在对现代MLLM的空间理解能力进行全面评估,并提出数据驱动和基于代理的互补解决方案。具体来说,我们做出了以下贡献:(i) 我们引入了SpatialScore,据我们所知,这是迄今为止最全面和多样的多模态空间智能基准。它涵盖了多种视觉数据类型、输入模态和问答格式,并包含约5000个手动验证的样本,覆盖30个不同的任务;(ii) 使用SpatialScore,我们广泛评估了49个代表性MLLM,揭示了持续存在的挑战以及当前模型与人类水平空间智能之间的巨大差距;(iii) 为了提高模型能力,我们构建了SpatialCorpus,这是一个包含33.1万个问答样本的大规模训练资源,支持空间推理任务的微调,并显著提高了现有模型的性能(例如Qwen3-VL);(iv) 为了补充数据驱动的方法,我们开发了SpatialAgent,这是一个配备有12种专门空间感知工具的多代理系统,支持计划-执行和ReAct推理,能够在无需额外模型训练的情况下显著提高空间推理能力。广泛的实验和深入的分析证明了我们基准、语料库和代理框架的有效性。我们期望这些资源能够为MLLMs向人类水平空间智能的发展奠定坚实的基础。所有数据、代码和模型将向研究界开放。
Summary / 总结
The research aims to comprehensively evaluate the spatial understanding capabilities of modern multimodal large language models (MLLMs) by proposing SpatialScore, a new benchmark that includes diverse visual data types, input modalities, and question-answering formats. The study evaluates 49 MLLMs using this benchmark, revealing significant gaps between current models and human-level spatial intelligence. Additionally, the authors introduce SpatialCorpus, a large-scale training resource, and SpatialAgent, a multi-agent system, to enhance model performance in spatial reasoning tasks without additional training. Extensive experiments show the effectiveness of these resources in advancing MLLMs towards human-level spatial intelligence.
研究旨在通过提出SpatialScore这一新的基准,全面评估现代多模态大型语言模型(MLLMs)的空间理解能力,该基准涵盖了多种视觉数据类型、输入模态和问答格式。研究使用该基准评估了49个MLLMs,揭示了当前模型与人类水平的空间智能之间存在显著差距。此外,作者还引入了SpatialCorpus大规模训练资源和SpatialAgent多代理系统,以增强模型在空间推理任务中的性能,无需额外训练。大量实验表明,这些资源在推动MLLMs向人类水平的空间智能发展方面具有有效性。
Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models
Authors: Jiaying Wu, Fanxiao Li, Zihang Fu, Min-Yen Kan, Bryan Hooi
Venue: ICLR 2026
First: 2025-05-21T13:14:32+00:00 · Latest: 2026-04-13T12:32:46+00:00
Comments: ICLR 2026
Abstract
The impact of multimodal misinformation arises not only from factual inaccuracies but also from the misleading narratives that creators deliberately embed. Interpreting such creator intent is therefore essential for multimodal misinformation detection (MMD) and effective information governance. To this end, we introduce DeceptionDecoded, a large-scale benchmark of 12,000 image-caption pairs grounded in trustworthy reference articles, created using an intent-guided simulation framework that models both the desired influence and the execution plan of news creators. The dataset captures both misleading and non-misleading cases, spanning manipulations across visual and textual modalities, and supports three intent-centric tasks: (1) misleading intent detection, (2) misleading source attribution, and (3) creator desire inference. We evaluate 14 state-of-the-art vision-language models (VLMs) and find that they struggle with intent reasoning, often relying on shallow cues such as surface-level alignment, stylistic polish, or heuristic authenticity signals. To bridge this, our framework systematically synthesizes data that enables models to learn implication-level intent reasoning. Models trained on DeceptionDecoded demonstrate strong transferability to real-world MMD, validating our framework as both a benchmark to diagnose VLM fragility and a data synthesis engine that provides high-quality, intent-focused resources for enhancing robustness in real-world multimodal misinformation governance.
中文标题/摘要
标题:透视欺骗:利用视觉语言模型揭示多模态新闻中的误导性创作者意图
多模态误导信息的影响不仅来自事实错误,还来自创作者故意嵌入的误导性叙述。因此,解读这种创作者意图对于多模态误导信息检测(MMD)和有效的信息治理至关重要。为此,我们引入了DeceptionDecoded,这是一个包含12,000个图像-描述对的大规模基准数据集,这些数据集基于可信的参考文章,使用意图引导的模拟框架来建模新闻创作者的期望影响和执行计划。该数据集捕捉了误导性和非误导性案例,涵盖了视觉和文本模态的操纵,并支持三个以意图为中心的任务:(1)误导意图检测,(2)误导来源归因,(3)创作者愿望推断。我们评估了14个最先进的视觉语言模型(VLMs),发现它们在意图推理方面存在困难,往往依赖于表面级别的对齐、风格上的润色或启发式的可信度信号。为了弥补这一不足,我们的框架系统地合成数据,使模型能够学习到意图层面的推理。在DeceptionDecoded上训练的模型在现实世界的MMD中表现出强大的迁移性,验证了我们的框架既是基准数据集,用于诊断VLM的脆弱性,又是提供高质量、意图导向资源的数据合成引擎,以增强现实世界多模态误导信息治理的鲁棒性。
Summary / 总结
This paper addresses the challenge of detecting misleading narratives in multimodal news by introducing DeceptionDecoded, a large-scale dataset of 12,000 image-caption pairs. The dataset is created using an intent-guided simulation framework to model both the desired influence and execution plan of news creators. The evaluation of 14 state-of-the-art vision-language models shows that they struggle with intent reasoning, often relying on surface-level cues. The authors propose a framework that synthesizes data to enable models to learn implication-level intent reasoning, demonstrating strong transferability to real-world multimodal misinformation detection tasks.
该论文通过引入包含12,000个图像-标题对的DeceptionDecoded数据集,旨在检测多模态新闻中的误导性叙述。该数据集使用意图引导的模拟框架创建,以建模新闻创作者的预期影响和执行计划。对14个最先进的视觉-语言模型的评估显示,它们在意图推理方面存在困难,往往依赖于表面线索。作者提出了一种框架,通过合成数据使模型能够学习意图层面的推理,展示了其在实际多模态虚假信息检测任务中的强大迁移能力。
What Do Vision-Language Models Encode for Personalized Image Aesthetics Assessment?
Authors: Koki Ryu, Hitomi Yanaka
Venue: ACL 2026
First: 2026-04-13T12:15:24+00:00 · Latest: 2026-04-13T12:15:24+00:00
Comments: To appear at ACL 2026 findings
Abstract
Personalized image aesthetics assessment (PIAA) is an important research problem with practical real-world applications. While methods based on vision-language models (VLMs) are promising candidates for PIAA, it remains unclear whether they internally encode rich, multi-level aesthetic attributes required for effective personalization. In this paper, we first analyze the internal representations of VLMs to examine the presence and distribution of such aesthetic attributes, and then leverage them for lightweight, individual-level personalization without model fine-tuning. Our analysis reveals that VLMs encode diverse aesthetic attributes that propagate into the language decoder layers. Building on these representations, we demonstrate that simple linear models can perform PIAA effectively. We further analyze how aesthetic information is transferred across layers in different VLM architectures and across image domains. Our findings provide insights into how VLMs can be utilized for modeling subjective, individual aesthetic preferences. Our code is available at https://github.com/ynklab/vlm-latent-piaa.
中文标题/摘要
标题:视觉-语言模型编码哪些特征用于个性化图像美学评估?
个性化图像美学评估(PIAA)是一个具有实际应用价值的重要研究问题。虽然基于视觉-语言模型(VLMs)的方法是PIAA的有希望的候选者,但尚不清楚它们是否内部编码了有效个性化所需的丰富多层次美学属性。在本文中,我们首先分析VLMs的内部表示,以检查这些美学属性的存在和分布,然后利用它们进行轻量级、个体级别的个性化,无需模型微调。我们的分析表明,VLMs编码了多种多样的美学属性,这些属性传播到语言解码层。基于这些表示,我们证明简单的线性模型可以有效地进行PIAA。我们进一步分析了不同VLM架构和图像领域中美学信息如何在不同层之间传递。我们的发现为如何利用VLMs建模主观的个体美学偏好提供了见解。我们的代码可在https://github.com/ynklab/vlm-latent-piaa获取。
Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility
Authors: Mengxuan Wang, Yuxin Chen, Gang Xu, Tao He, Hongjie Jiang, Ming Li
First: 2026-02-03T11:26:05+00:00 · Latest: 2026-04-13T12:02:34+00:00
Abstract
Vision language models (VLMs) extend the reasoning capabilities of large language models (LLMs) to cross-modal settings, yet remain highly vulnerable to multimodal jailbreak attacks. Existing defenses predominantly rely on safety fine-tuning or aggressive token manipulations, incurring substantial training costs or significantly degrading utility. Recent research shows that LLMs inherently recognize unsafe content in text, and the incorporation of visual inputs in VLMs frequently dilutes risk-related signals. Motivated by this, we propose Risk Awareness Injection (RAI), a lightweight and training-free framework for safety calibration that restores LLM-like risk recognition by amplifying unsafe signals in VLMs. Specifically, RAI constructs an Unsafe Prototype Subspace from language embeddings and performs targeted modulation on selected high-risk visual tokens, explicitly activating safety-critical signals within the cross-modal feature space. This modulation restores the model's LLM-like ability to detect unsafe content from visual inputs, while preserving the semantic integrity of original tokens for cross-modal reasoning. Extensive experiments across multiple jailbreak and utility benchmarks demonstrate that RAI substantially reduces attack success rate without compromising task performance.
Summary / 总结
The paper addresses the vulnerability of vision-language models (VLMs) to multimodal jailbreak attacks by proposing Risk Awareness Injection (RAI), a lightweight framework that enhances the model's ability to recognize unsafe content without significant training costs or utility degradation. RAI amplifies unsafe signals through targeted modulation of high-risk visual tokens, restoring the model's capability to detect unsafe content from visual inputs while maintaining semantic integrity for cross-modal reasoning. Experiments show that RAI effectively reduces attack success rates across various benchmarks without compromising task performance.
论文提出了一种名为Risk Awareness Injection (RAI)的轻量级框架,通过增强模型识别不安全内容的能力来应对视觉语言模型(VLMs)的多模态脱缰攻击,同时避免显著的训练成本或功能退化。RAI通过针对性地调节高风险视觉标记来放大不安全信号,恢复模型从视觉输入中检测不安全内容的能力,同时保持跨模态推理中原始标记的语义完整性。实验表明,RAI在各种基准测试中有效降低了攻击成功率,同时不损害任务性能。
Variational Visual Question Answering for Uncertainty-Aware Selective Prediction
Authors: Tobias Jan Wieczorek, Nathalie Daun, Mohammad Emtiyaz Khan, Marcus Rohrbach
Venue: Transactions on Machine Learning Research (2026)
First: 2025-05-14T17:40:22+00:00 · Latest: 2026-04-13T11:55:58+00:00
Comments: TMLR April 2026 version. 13 pages main paper, 31 pages with appendix. Updated bibliography
Abstract
Despite remarkable progress in recent years, Vision Language Models (VLMs) remain prone to overconfidence and hallucinations on tasks such as Visual Question Answering (VQA) and Visual Reasoning. Bayesian methods can potentially improve reliability by helping models predict selectively, that is, models respond only when they are sufficiently confident. Unfortunately, such approaches can be costly and ineffective for large models, and there exists little evidence to show otherwise for multimodal applications. Here, we show for the first time the effectiveness and competitive edge of variational Bayes for selective prediction in VQA. We build on recent advances in variational methods for deep learning and propose an extension called "Variational VQA". This method improves calibration and yields significant gains for selective prediction on VQA and Visual Reasoning, particularly when the error tolerance is low ($\leq 1\%$). Often, just one posterior sample yields more reliable answers than those given by models trained with AdamW. In addition, we propose a new risk-averse selector that outperforms standard sample averaging by considering the variance of predictions. Overall, we present compelling evidence that variational learning is a viable option to make large VLMs safer and more trustworthy.
Summary / 总结
The research addresses the overconfidence and hallucinations of Vision Language Models (VLMs) in tasks like VQA and visual reasoning. It introduces Variational VQA, which uses variational Bayes for selective prediction, improving calibration and performance, especially with low error tolerance. The method often provides more reliable answers with a single posterior sample compared to models trained with AdamW. Additionally, a risk-averse selector is proposed, which outperforms standard sample averaging by considering prediction variance.
研究针对视觉语言模型(VLMs)在视觉问答(VQA)和视觉推理等任务中表现出的过度自信和幻觉问题。提出了一种基于变分贝叶斯技术的方法——变分VQA,该方法提高了模型的校准和选择性预测能力。该方法在低误差容忍度下显示出显著的改进,通常只需一个后验样本就能提供比AdamW训练的模型更可靠的答案。此外,还提出了一种风险规避选择器,通过考虑预测的方差,其性能优于标准样本平均方法。
Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models
Authors: Kexin Ma, Jing Xiao, Chaofeng Chen, Geyong Min, Guibo Zhu, Jinqiao Wang, Liang Liao
First: 2026-04-13T09:44:52+00:00 · Latest: 2026-04-13T09:44:52+00:00
Abstract
Token pruning has emerged as an effective approach to reduce the substantial computational overhead of Large Vision-Language Models (LVLMs) by discarding less informative visual tokens while preserving performance. However, existing methods typically rely on individual attention sources from different LVLM components, resulting in incomplete and suboptimal pruning decisions due to biased attention distributions. To address this problem, we propose DeSAP, a novel Decoupled Similarity-Aware Pruning method for precise, task-aware token pruning within the visual encoder. Specifically, DeSAP introduces a decoupled similarity to capture fine-grained cross-modal relevance between visual features and text tokens, providing explicit task-related guidance for pruning. By integrating decoupled similarity with visual saliency signals derived from visual attention, DeSAP performs token pruning under the guidance of both task-related and visual cues, enabling robust pruning even under aggressive pruning ratios. Extensive experiments across diverse benchmarks and architectures show that DeSAP consistently outperforms SOTA methods in both accuracy and efficiency. On LLaVA-1.5-7B, DeSAP achieves a 10 times FLOPs reduction and a 2.3 times prefill speedup by retaining only 11.1% of visual tokens, while maintaining 98.1% of the original performance.
中文标题/摘要
标题:大型视觉语言模型任务感知 token 裁剪中的解耦相似性
token 裁剪已成为通过丢弃不那么信息性的视觉 token 同时保持性能来减少大型视觉语言模型(LVLMs)巨大计算开销的有效方法。然而,现有方法通常依赖于来自不同 LVLM 组件的个体注意力来源,导致由于注意力分布偏差而产生不完整和次优的裁剪决策。为了解决这个问题,我们提出了一种新颖的解耦相似性感知裁剪方法 DeSAP,用于视觉编码器内的精确任务感知 token 裁剪。具体而言,DeSAP 引入了解耦相似性来捕捉视觉特征与文本 token 之间的细粒度跨模态相关性,提供明确的任务相关指导以进行裁剪。通过将解耦相似性与来自视觉注意力的视觉显著性信号结合,DeSAP 在任务相关和视觉线索的指导下进行 token 裁剪,即使在激进的裁剪比例下也能实现稳健的裁剪。在多种基准和架构上的广泛实验表明,DeSAP 在准确性和效率上均优于当前最佳方法。在 LLaVA-1.5-7B 上,DeSAP 通过保留仅 11.1% 的视觉 token 实现了 10 倍的 FLOPs 减少和 2.3 倍的预填充加速,同时保持 98.1% 的原始性能。
Seg2Change: Adapting Open-Vocabulary Semantic Segmentation Model for Remote Sensing Change Detection
Authors: You Su, Yonghong Song, Jingqi Chen, Zehan Wen
First: 2026-04-13T09:35:14+00:00 · Latest: 2026-04-13T09:35:14+00:00
Comments: 21 pages, 15 figures
Abstract
Change detection is a fundamental task in remote sensing, aiming to quantify the impacts of human activities and ecological dynamics on land-cover changes. Existing change detection methods are limited to predefined classes in training datasets, which constrains their scalability in real-world scenarios. In recent years, numerous advanced open-vocabulary semantic segmentation models have emerged for remote sensing imagery. However, there is still a lack of an effective framework for directly applying these models to open-vocabulary change detection (OVCD), a novel task that integrates vision and language to detect changes across arbitrary categories. To address these challenges, we first construct a category-agnostic change detection dataset, termed CA-CDD. Further, we design a category-agnostic change head to detect the transitions of arbitrary categories and index them to specific classes. Based on them, we propose Seg2Change, an adapter designed to adapt open-vocabulary semantic segmentation models to change detection task. Without bells and whistles, this simple yet effective framework achieves state-of-the-art OVCD performance (+9.52 IoU on WHU-CD and +5.50 mIoU on SECOND). Our code is released at https://github.com/yogurts-sy/Seg2Change.
中文标题/摘要
标题:Seg2Change: 适应开放词汇语义分割模型的遥感变化检测
变化检测是遥感中的一个基本任务,旨在量化人类活动和生态动态对土地覆盖变化的影响。现有的变化检测方法受限于训练数据集中的预定义类别,这限制了它们在实际场景中的可扩展性。近年来,出现了许多先进的开放词汇语义分割模型用于遥感图像。然而,仍然缺乏一个有效的框架可以直接将这些模型应用于开放词汇变化检测(OVCD),这是一个将视觉和语言结合以检测任意类别变化的新任务。为了解决这些挑战,我们首先构建了一个类别无关的变化检测数据集,称为CA-CDD。进一步地,我们设计了一个类别无关的变化头来检测任意类别的转换,并将它们索引到特定类别。基于此,我们提出了Seg2Change,这是一种设计用于将开放词汇语义分割模型适应变化检测任务的适配器。没有花哨的功能,这个简单而有效的框架在WHU-CD上达到了最先进的OVCD性能(+9.52 IoU)和在SECOND上达到了+5.50 mIoU。我们的代码发布在https://github.com/yogurts-sy/Seg2Change。
Summary / 总结
The research aims to address the limitations of existing change detection methods in handling open-vocabulary scenarios. The authors propose Seg2Change, a framework that adapts open-vocabulary semantic segmentation models for change detection. They construct a category-agnostic change detection dataset and design a category-agnostic change head to detect transitions across arbitrary categories. Experimental results show that Seg2Change outperforms existing methods, achieving state-of-the-art performance on WHU-CD and SECOND datasets with improvements of +9.52 IoU and +5.50 mIoU, respectively.
研究旨在解决现有变化检测方法在处理开放词汇场景时的局限性。作者提出了Seg2Change框架,将开放词汇语义分割模型应用于变化检测任务。他们构建了一个类别无关的变化检测数据集,并设计了一个类别无关的变化头来检测跨任意类别的转换。实验结果显示,Seg2Change在WHU-CD和SECOND数据集上优于现有方法,分别实现了+9.52 IoU和+5.50 mIoU的最佳性能。
Sign Language Recognition in the Age of LLMs
Authors: Vaclav Javorek, Jakub Honzik, Ivan Gruber, Tomas Zelezny, Marek Hruz
Venue: CVPR 2026
First: 2026-04-13T09:26:16+00:00 · Latest: 2026-04-13T09:26:16+00:00
Comments: Accepted at the CVPR 2026 Workshop on Multimodal Sign Language Research (MSLR), 8 pages, 3 figures
Abstract
Recent Vision Language Models (VLMs) have demonstrated strong performance across a wide range of multimodal reasoning tasks. This raises the question of whether such general-purpose models can also address specialized visual recognition problems such as isolated sign language recognition (ISLR) without task-specific training. In this work, we investigate the capability of modern VLMs to perform ISLR in a zero-shot setting. We evaluate several open-source and proprietary VLMs on the WLASL300 benchmark. Our experiments show that, under prompt-only zero-shot inference, current open-source VLMs remain far behind classic supervised ISLR classifiers by a wide margin. However, follow-up experiments reveal that these models capture partial visual-semantic alignment between signs and text descriptions. Larger proprietary models achieve substantially higher accuracy, highlighting the importance of model scale and training data diversity. All our code is publicly available on GitHub.
MedP-CLIP: Medical CLIP with Region-Aware Prompt Integration
Authors: Jiahui Peng, He Yao, Jingwen Li, Yanzhou Su, Sibo Ju, Yujie Lu, Jin Ye, Hongchun Lu, Xue Li, Lincheng Jiang, Min Zhu, Junlong Cheng
First: 2026-04-13T08:53:36+00:00 · Latest: 2026-04-13T08:53:36+00:00
Abstract
Contrastive Language-Image Pre-training (CLIP) has demonstrated outstanding performance in global image understanding and zero-shot transfer through large-scale text-image alignment. However, the core of medical image analysis often lies in the fine-grained understanding of specific anatomical structures or lesion regions. Therefore, precisely comprehending region-of-interest (RoI) information provided by medical professionals or perception models becomes crucial. To address this need, we propose MedP-CLIP, a region-aware medical vision-language model (VLM). MedP-CLIP innovatively integrates medical prior knowledge and designs a feature-level region prompt integration mechanism, enabling it to flexibly respond to various prompt forms (e.g., points, bounding boxes, masks) while maintaining global contextual awareness when focusing on local regions. We pre-train the model on a meticulously constructed large-scale dataset (containing over 6.4 million medical images and 97.3 million region-level annotations), equipping it with cross-disease and cross-modality fine-grained spatial semantic understanding capabilities. Experiments demonstrate that MedP-CLIP significantly outperforms baseline methods in various medical tasks, including zero-shot recognition, interactive segmentation, and empowering multimodal large language models. This model provides a scalable, plug-and-play visual backbone for medical AI, combining holistic image understanding with precise regional analysis.
中文标题/摘要
标题:MedP-CLIP:具有区域感知提示集成的医学CLIP
对比语言-图像预训练(CLIP)在大规模文本-图像对齐中展示了出色的全局图像理解和零样本迁移性能。然而,医学图像分析的核心往往在于对特定解剖结构或病灶区域的精细理解。因此,准确理解医学专业人士或感知模型提供的感兴趣区域(RoI)信息变得至关重要。为解决这一需求,我们提出MedP-CLIP,一种具有区域感知的医学视觉-语言模型(VLM)。MedP-CLIP创新性地整合了医学先验知识,并设计了一种特征级区域提示集成机制,使其能够灵活应对各种提示形式(例如,点、边界框、掩码),同时在关注局部区域时保持全局上下文意识。我们使用一个精心构建的大规模数据集(包含超过640万张医学图像和9730万张区域级注释)对其进行预训练,赋予其跨疾病和跨模态的精细空间语义理解能力。实验表明,MedP-CLIP在各种医学任务中(包括零样本识别、交互式分割和增强多模态大型语言模型)显著优于基线方法。该模型为医学AI提供了一个可扩展、即插即用的视觉骨干,结合了整体图像理解和精确的区域分析。
Summary / 总结
MedP-CLIP integrates medical prior knowledge and a feature-aware mechanism to improve improve region-of-interest (RoI) prompts,, enabling flexible response to RoI annotations while maintaining contextual awareness. Experiments show MedP--CLIP outperforms baselines in medical tasks such as recognition and interactive segmentation, demonstrating its on a region-aware and context-aware model's effectiveness on medical image understanding analysis.
MedP-CLIP 是一种结合了医学先验知识和特征级区域提示集成机制的区域感知医学视觉语言模型,能够在保持全局上下文的同时增强局部区域的理解。它在包含超过 640 万张医学图像和 9730 万张区域级注释的大型数据集上进行了预训练,使其在零样本识别、交互式分割和多模态大型语言模型等医学任务中表现出色。实验表明,MedP-CLIP 在这些任务中的表现优于基线方法。
Do Thought Streams Matter? Evaluating Reasoning in Gemini Vision-Language Models for Video Scene Understanding
Authors: Shivam Sharma, Sankalp Nagaonkar, Ashish Choithani, Ashutosh Trivedi
First: 2026-04-13T08:40:10+00:00 · Latest: 2026-04-13T08:40:10+00:00
Abstract
We benchmark how internal reasoning traces, which we call thought streams, affect video scene understanding in vision-language models. Using four configurations of Google's Gemini 2.5 Flash and Flash Lite across scenes extracted from 100 hours of video, we ask three questions: does more thinking lead to better outputs, where do the gains stop, and what do these models actually think about? We introduce three evaluation metrics. Contentfulness measures how much of the thought stream is useful scene content versus meta-commentary. Thought-Final Coverage measures how faithfully the thought stream translates into the final output. Dominant Entity Analysis identifies which subjects, actions, and settings the model focuses on. GPT-5 serves as an independent judge. We find that quality gains from additional thinking plateau quickly, with most improvement occurring in the first few hundred tokens. Flash Lite offers the best balance between quality and token usage. Tight reasoning budgets cause the model to add content in the final output that it never reasoned about, a form of compression-step hallucination. Despite being different model tiers, Flash and Flash Lite produce similar thought streams, though they differ in style: Flash discusses its reasoning process, while Lite focuses on describing the scene.
中文标题/摘要
标题:思维流重要吗?评估Gemini视觉语言模型在视频场景理解中的推理
我们评估内部推理轨迹,即我们称为思维流的现象,对视觉语言模型在视频场景理解中的影响。使用来自100小时视频提取的四个配置的Google Gemini 2.5 Flash和Flash Lite场景,我们提出了三个问题:更多的思考是否会导致更好的输出,收益在哪里停止,以及这些模型实际上在思考什么?我们引入了三个评估指标。内容丰富度衡量思维流中有用的场景内容与元评论的比例。思维最终覆盖率衡量思维流如何忠实转化为最终输出。主体分析识别模型关注的主题、动作和场景。GPT-5作为独立裁判。我们发现,额外思考的质量收益迅速达到平台期,大部分改进发生在前几百个令牌中。Flash Lite在质量和令牌使用之间提供了最佳平衡。紧张的推理预算导致模型在最终输出中添加从未推理的内容,这是一种压缩步骤的幻觉。尽管是不同的模型层级,Flash和Flash Lite生成相似的思维流,但在风格上有所不同:Flash讨论其推理过程,而Lite专注于描述场景。
Precision Synthesis of Multi-Tracer PET via VLM-Modulated Rectified Flow for Stratifying Mild Cognitive Impairment
Authors: Tuo Liu, Shuijin Lin, Shaozhen Yan, Haifeng Wang, Jie Lu, Jianhua Ma, Chunfeng Lian
First: 2026-04-13T08:37:24+00:00 · Latest: 2026-04-13T08:37:24+00:00
Comments: 15 pages, 5 figures
Abstract
The biological definition of Alzheimer's disease (AD) relies on multi-modal neuroimaging, yet the clinical utility of positron emission tomography (PET) is limited by cost and radiation exposure, hindering early screening at preclinical or prodromal stages. While generative models offer a promising alternative by synthesizing PET from magnetic resonance imaging (MRI), achieving subject-specific precision remains a primary challenge. Here, we introduce DIReCT$++$, a Domain-Informed ReCTified flow model for synthesizing multi-tracer PET from MRI combined with fundamental clinical information. Our approach integrates a 3D rectified flow architecture to capture complex cross-modal and cross-tracer relationships with a domain-adapted vision-language model (BiomedCLIP) that provides text-guided, personalized generation using clinical scores and imaging knowledge. Extensive evaluations on multi-center datasets demonstrate that DIReCT$++$ not only produces synthetic PET images ($^{18}$F-AV-45 and $^{18}$F-FDG) of superior fidelity and generalizability but also accurately recapitulates disease-specific patterns. Crucially, combining these synthesized PET images with MRI enables precise personalized stratification of mild cognitive impairment (MCI), advancing a scalable, data-efficient tool for the early diagnosis and prognostic prediction of AD. The source code will be released on https://github.com/ladderlab-xjtu/DIReCT-PLUS.
Summary / 总结
The study aims to improve the clinical utility of PET imaging for early screening of Alzheimer's disease by synthesizing multi-tracer PET images from MRI and clinical information. DIReCT$++$ uses a 3D rectified flow architecture and a domain-adapted vision-language model to generate high-fidelity and generalizable PET images, which accurately reflect disease-specific patterns. The method enables precise stratification of mild cognitive impairment, advancing early diagnosis and prognostic prediction of AD.
研究旨在通过从MRI和临床信息合成多示踪剂PET图像,提高阿尔茨海默病早期筛查的临床应用。DIReCT$++$使用3D整流流架构和领域适应的视觉语言模型生成高保真度和泛化的PET图像,准确反映疾病特异性模式。该方法能够精确分层轻度认知障碍,推动AD的早期诊断和预后预测。