ArrowGEV: Grounding Events in Video via Learning the Arrow of Time
Authors: Fangxu Yu, Ziyao Lu, Liqiang Niu, Fandong Meng, Jie Zhou
Venue: ACL 2026
First: 2026-01-10T13:05:23+00:00 · Latest: 2026-04-16T17:52:47+00:00
Comments: Accepted to Findings of ACL 2026
Abstract
Grounding events in videos serves as a fundamental capability in video analysis. While Vision Language Models (VLMs) are increasingly employed for this task, existing approaches predominantly train models to associate events with timestamps in the forward video only. This paradigm hinders VLMs from capturing the inherent temporal structure and directionality of events, thereby limiting robustness and generalization. To address this limitation, inspired by the arrow of time in physics, which characterizes the intrinsic directionality of temporal processes, we propose ArrowGEV, a reinforcement learning framework that explicitly models temporal directionality in events to improve both event grounding and temporal directionality understanding in VLMs. Specifically, we categorize events into time-sensitive (e.g., putting down a bag) and time-insensitive (e.g., holding a towel in the left hand). The former denote events whose reversal substantially alters their meaning, while the latter remain semantically unchanged under reversal. For time-sensitive events, ArrowGEV introduces a reward that encourages VLMs to discriminate between forward and backward videos, whereas for time-insensitive events, it enforces consistent grounding across both directions. Extensive experiments demonstrate that ArrowGEV not only improves grounding precision and temporal directionality recognition, but also enhances general video understanding and reasoning ability.
Why Do Vision Language Models Struggle To Recognize Human Emotions?
Authors: Madhav Agarwal, Sotirios A. Tsaftaris, Laura Sevilla-Lara, Steven McDonagh
First: 2026-04-16T17:49:58+00:00 · Latest: 2026-04-16T17:49:58+00:00
Abstract
Understanding emotions is a fundamental ability for intelligent systems to be able to interact with humans. Vision-language models (VLMs) have made tremendous progress in the last few years for many visual tasks, potentially offering a promising solution for understanding emotions. However, it is surprising that even the most sophisticated contemporary VLMs struggle to recognize human emotions or to outperform even specialized vision-only classifiers. In this paper we ask the question "Why do VLMs struggle to recognize human emotions?", and observe that the inherently continuous and dynamic task of facial expression recognition (DFER) exposes two critical VLM vulnerabilities. First, emotion datasets are naturally long-tailed, and the web-scale data used to pre-train VLMs exacerbates this head-class bias, causing them to systematically collapse rare, under-represented emotions into common categories. We propose alternative sampling strategies that prevent favoring common concepts. Second, temporal information is critical for understanding emotions. However, VLMs are unable to represent temporal information over dense frame sequences, as they are limited by context size and the number of tokens that can fit in memory, which poses a clear challenge for emotion recognition. We demonstrate that the sparse temporal sampling strategy used in VLMs is inherently misaligned with the fleeting nature of micro-expressions (0.25-0.5 seconds), which are often the most critical affective signal. As a diagnostic probe, we propose a multi-stage context enrichment strategy that utilizes the information from "in-between" frames by first converting them into natural language summaries. This enriched textual context is provided as input to the VLM alongside sparse keyframes, preventing attentional dilution from excessive visual data while preserving the emotional trajectory.
StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression
Authors: Xuanyi Liu, Deyi Ji, Chunan Yu, Qi Zhu, Xuanfu Li, Jin Ma, Tianrun Chen, Lanyun Zhu
First: 2026-04-16T17:12:10+00:00 · Latest: 2026-04-16T17:12:10+00:00
Abstract
Reconstructing dense 3D geometry from continuous video streams requires stable inference under a constant memory budget. Existing $O(1)$ frameworks primarily rely on a ``pure eviction'' paradigm, which suffers from significant information destruction due to binary token deletion and evaluation noise from localized, single-layer scoring. To address these bottlenecks, we propose StreamCacheVGGT, a training-free framework that reimagines cache management through two synergistic modules: Cross-Layer Consistency-Enhanced Scoring (CLCES) and Hybrid Cache Compression (HCC). CLCES mitigates activation noise by tracking token importance trajectories across the Transformer hierarchy, employing order-statistical analysis to identify sustained geometric salience. Leveraging these robust scores, HCC transcends simple eviction by introducing a three-tier triage strategy that merges moderately important tokens into retained anchors via nearest-neighbor assignment on the key-vector manifold. This approach preserves essential geometric context that would otherwise be lost. Extensive evaluations on five benchmarks (7-Scenes, NRGBD, ETH3D, Bonn, and KITTI) demonstrate that StreamCacheVGGT sets a new state-of-the-art, delivering superior reconstruction accuracy and long-term stability while strictly adhering to constant-cost constraints.
Summary / 总结
StreamCacheVGGT is a training-free framework designed to reconstruct dense 3D geometry from video streams with a constant memory budget. It introduces Cross-Layer Consistency-Enhanced Scoring (CLCES) to mitigate activation noise and Hybrid Cache Compression (HCC) to preserve geometric context. CLCES tracks token importance across the Transformer hierarchy, while HCC uses a three-tier triage strategy to retain important tokens. Experiments on five benchmarks show that StreamCacheVGGT outperforms existing methods in terms of reconstruction accuracy and long-term stability while maintaining constant memory usage.
StreamCacheVGGT 是一个无需训练的框架,旨在以恒定的内存预算从视频流中重建密集的 3D 几何。它引入了跨层一致性增强评分 (CLCES) 来减轻激活噪声,并使用混合缓存压缩 (HCC) 来保留几何上下文。CLCES 跟踪 Transformer 层级中的令牌重要性,而 HCC 使用分级策略将适度重要的令牌合并,确保长期稳定性和卓越的重建准确性。
RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography
Authors: Mélanie Roschewitz, Kenneth Styppa, Yitian Tao, Jiwoong Sohn, Jean-Benoit Delbrouck, Benjamin Gundersen, Nicolas Deperrois, Christian Bluethgen, Julia Vogt, Bjoern Menze, Farhad Nooralahzadeh, Michael Krauthammer, Michael Moor
First: 2026-04-16T17:09:30+00:00 · Latest: 2026-04-16T17:09:30+00:00
Abstract
Vision-language models (VLM) have markedly advanced AI-driven interpretation and reporting of complex medical imaging, such as computed tomography (CT). Yet, existing methods largely relegate clinicians to passive observers of final outputs, offering no interpretable reasoning trace for them to inspect, validate, or refine. To address this, we introduce RadAgent, a tool-using AI agent that generates CT reports through a stepwise and interpretable process. Each resulting report is accompanied by a fully inspectable trace of intermediate decisions and tool interactions, allowing clinicians to examine how the reported findings are derived. In our experiments, we observe that RadAgent improves Chest CT report generation over its 3D VLM counterpart, CT-Chat, across three dimensions. Clinical accuracy improves by 6.0 points (36.4% relative) in macro-F1 and 5.4 points (19.6% relative) in micro-F1. Robustness under adversarial conditions improves by 24.7 points (41.9% relative). Furthermore, RadAgent achieves 37.0% in faithfulness, a new capability entirely absent in its 3D VLM counterpart. By structuring the interpretation of chest CT as an explicit, tool-augmented and iterative reasoning trace, RadAgent brings us closer toward transparent and reliable AI for radiology.
中文标题/摘要
标题:RadAgent:一种用于逐步解释胸部计算机断层扫描的工具使用AI代理
视觉-语言模型(VLM)在复杂医学影像的AI驱动解释和报告方面取得了显著进展,如计算机断层扫描(CT)。然而,现有方法主要使临床医生成为最终输出的被动观察者,不提供可解释的推理痕迹供他们检查、验证或改进。为解决这一问题,我们引入了RadAgent,这是一种工具使用AI代理,通过逐步和可解释的过程生成CT报告。每个生成的报告都附带一个完全可检查的中间决策和工具交互的痕迹,使临床医生能够检查报告发现是如何得出的。在我们的实验中,我们观察到RadAgent在三个维度上优于其3D VLM对手CT-Chat。临床准确性在宏观F1上提高了6.0分(相对提高36.4%),在微观F1上提高了5.4分(相对提高19.6%)。在对抗条件下具有更强的鲁棒性,提高了24.7分(相对提高41.9%)。此外,RadAgent实现了37.0%的忠实度,这是其3D VLM对手完全不具备的新能力。通过将胸部CT的解释结构化为明确的、工具增强的和迭代的推理痕迹,RadAgent使我们更接近透明和可靠的放射学AI。
Summary / 总结
RadAgent is an AI agent that generates stepwise and interpretable CT reports, providing clinicians with a traceable reasoning process. It improves clinical accuracy and robustness compared to CT-Chat, achieving a 6.0-point increase in macro-F1 and a 24.7-point increase in faithfulness under adversarial conditions.
RadAgent 是一个生成逐步和可解释 CT 报告的 AI 代理,为临床医生提供可追溯的推理过程。它在临床准确性和鲁棒性方面优于 CT-Chat,分别在宏观 F1 上提高了 6.0 个点,在对抗条件下的忠实度提高了 24.7 个点。
Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference
Authors: Rei Taniguchi, Yuyang Dong, Makoto Onizuka, Chuan Xiao
Venue: ACL 2026
First: 2026-01-12T15:47:35+00:00 · Latest: 2026-04-16T16:46:40+00:00
Comments: ACL 2026 Findings. Source code available at https://github.com/TANIGUCHIREI/ASL
Abstract
Due to the prevalence of large language models (LLMs), key-value (KV) cache reduction for LLM inference has received remarkable attention. Among numerous works that have been proposed in recent years, layer-wise token pruning approaches, which select a subset of tokens at particular layers to retain in KV cache and prune others, are one of the most popular schemes. They primarily adopt a set of pre-defined layers, at which tokens are selected. Such design is inflexible in the sense that the accuracy significantly varies across tasks and deteriorates in harder tasks such as KV retrieval. In this paper, we propose ASL, a training-free method that adaptively chooses the selection layer for KV cache reduction, exploiting the variance of token ranks ordered by attention score. The proposed method balances the performance across different tasks while meeting the user-specified KV budget requirement. ASL operates during the prefilling stage and can be jointly used with existing KV cache reduction methods such as SnapKV to optimize the decoding stage. By evaluations on the InfiniteBench, RULER, and NIAH benchmarks, we show that ASL, equipped with one-shot token selection, adaptively trades inference speed for accuracy, outperforming state-of-the-art layer-wise token pruning methods in difficult tasks.
中文标题/摘要
标题:适应性层选择在LLM推理中按层裁剪标记
由于大型语言模型(LLMs)的普及,LLM推理中的关键值(KV)缓存减少受到了显著关注。近年来,提出的各种方法中,按层裁剪标记的方法是最受欢迎的方案之一。这些方法主要采用一组预定义的层,在这些层上选择标记并剪裁其他标记。这种设计在灵活性方面存在不足,因为其准确率在不同任务中差异显著,在如KV检索等更难的任务中会下降。在本文中,我们提出了一种无需训练的方法ASL,该方法能够自适应地选择KV缓存减少的层,利用按注意力分数排序的标记排名的方差。该方法在满足用户指定的KV预算要求的同时,平衡了不同任务的性能。ASL在预填充阶段运行,并可以与现有的KV缓存减少方法(如SnapKV)联合使用,以优化解码阶段。通过在InfiniteBench、RULER和NIAH基准上的评估,我们展示了ASL通过一次标记选择,自适应地在推理速度和准确率之间进行权衡,优于最先进的按层裁剪标记方法在困难任务中的表现。
Summary / 总结
This paper addresses the issue of key-value (KV) cache reduction in large language model (LLM) inference, focusing on layer-wise token pruning. The authors propose ASL, a training-free method that adaptively selects the layer for token selection based on the variance of token ranks ordered by attention score. ASL improves performance across different tasks while meeting the user-specified KV budget requirement and can be used with existing KV cache reduction methods. Experimental results on InfiniteBench, RULER, and NIAH benchmarks demonstrate that ASL outperforms state-of-the-art layer-wise token pruning methods in difficult tasks by adaptively trading inference speed for accuracy.
本文提出了一种名为ASL的自适应层选择方法,以解决大型语言模型(LLMs)推理中的关键值(KV)缓存减少问题。与之前使用固定层进行token剪枝的方法不同,ASL根据注意力分数排序的token排名的方差动态选择剪枝层。该方法在不同任务中提高了性能,同时满足用户指定的KV预算要求。在InfiniteBench、RULER和NIAH基准上的实验结果表明,ASL在困难任务中优于现有的最先进的层间token剪枝方法。
VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models
Authors: Huawei Ji, Yuanhao Sun, Yuan Jin, Cheng Deng, Jiaxin Ding, Luoyi Fu, Xinbing Wang
First: 2026-04-16T16:21:05+00:00 · Latest: 2026-04-16T16:21:05+00:00
Abstract
Visual token pruning methods effectively mitigate the quadratic computational growth caused by processing high-resolution images and video frames in vision-language models (VLMs). However, existing approaches rely on predefined pruning configurations without determining whether they achieve computation-performance optimality. In this work, we introduce , a novel framework that formulates visual token pruning as a Pareto configuration optimization problem to automatically identify optimal configurations. Our approach employs continuous relaxation and straight-through estimators to enable gradient-based search, solved via the Augmented Lagrangian method. Extensive experiments across 8 visual benchmarks demonstrate that effectively approximates the empirical Pareto frontier obtained through grid search and generalizes well across various pruning methods and VLM architectures. Furthermore, through learnable kernel functions, we investigate layer-wise pruning patterns and reveal that multi-step progressive pruning captures VLMs' hierarchical compression structure, achieving superior accuracy-efficiency trade-offs compared to single-layer approaches.
中文标题/摘要
标题:VisPCO:视觉标记剪枝配置优化通过预算感知帕累托前沿学习在视觉语言模型中的应用
视觉标记剪枝方法有效地缓解了处理高分辨率图像和视频帧时视觉语言模型(VLMs)中计算量的平方级增长。然而,现有方法依赖于预定义的剪枝配置,而没有确定它们是否实现了计算性能的最优化。在本文中,我们提出了一种新颖的框架,将视觉标记剪枝问题形式化为帕累托配置优化问题,以自动识别最优配置。我们的方法通过连续松弛和直通估计器启用基于梯度的搜索,通过增广拉格朗日方法求解。在8个视觉基准上的广泛实验表明,该方法有效地逼近了通过网格搜索获得的经验帕累托前沿,并且在各种剪枝方法和VLM架构上具有良好的泛化能力。此外,通过可学习的核函数,我们研究了逐层剪枝模式,并揭示了多步渐进剪枝捕获了VLMs的分层压缩结构,实现了与单层方法相比更优的准确率-效率权衡。
Summary / 总结
The research aims to optimize visual token pruning configurations in vision-language models to improve computational efficiency while maintaining performance. The method formulates the problem as a Pareto configuration optimization, using continuous relaxation and gradient-based search to identify optimal configurations. Experiments across eight visual benchmarks show that the approach approximates the Pareto frontier well and generalizes across different pruning methods and model architectures, demonstrating superior accuracy-efficiency trade-offs with multi-step progressive pruning compared to single-layer approaches.
研究旨在优化视觉语言模型中的视觉标记剪枝配置,以平衡计算效率和性能。方法将剪枝问题形式化为帕累托优化问题,并使用连续松弛和梯度搜索来找到最优配置。实验结果显示,该方法有效地逼近了通过网格搜索获得的经验帕累托前沿,并在不同的剪枝方法和模型架构上表现出良好的泛化能力。此外,研究发现多步渐进剪枝比单层剪枝更能捕捉模型的分层压缩结构,从而在准确性和效率之间取得更好的权衡。
Agent-Aided Design for Dynamic CAD Models
Authors: Mitch Adler, Matthew Russo, Michael Cafarella
First: 2026-04-16T16:15:23+00:00 · Latest: 2026-04-16T16:15:23+00:00
Comments: 6 pages, 3 figures, to be published in CAIS'26
Abstract
In the past year, researchers have started to create agentic systems that can design real-world CAD-style objects in a training-free setting, a new variety of system that we call Agent-Aided Design. Generally speaking, these systems place an agent in a feedback loop in which it can write code, compile that code to an assembly of CAD model(s), visualize the model, and then iteratively refine its code based on visual and other feedback. Despite rapid progress, a key problem remains: none of these systems can build complex 3D assemblies with moving parts. For example, no existing system can build a piston, a pendulum, or even a pair of scissors. In order for Agent-Aided Design to make a real impact in industrial manufacturing, we need a system that is capable of generating such 3D assemblies. In this paper we present a prototype of AADvark, an agentic system designed for this task. Unlike previous state-of-the-art systems, AADvark captures the dynamic part interactions with one or more degrees-of-freedom. This design decision allows AADvark to reason directly about assemblies with moving parts and can thereby achieve cross-cutting goals, including but not limited to mechanical movements. Unfortunately, current LLMs are imperfect spatial reasoners, a problem that AADvark addresses by incorporating external constraint solver tools with a specialized visual feedback mechanism. We demonstrate that, by modifying the agent's tools (FreeCAD and the assembly solver), we are able to create a strong verification signal which enables our system to build 3D assemblies with movable parts.
Summary / 总结
The research aims to develop an agentic system capable of designing complex 3D assemblies with moving parts, which is crucial for industrial manufacturing. AADvark, a prototype system, captures dynamic part interactions and uses a specialized visual feedback mechanism along with external constraint solver tools to address the limitations of current language models. The system successfully builds 3D assemblies with movable parts by modifying the agent's tools (FreeCAD and the assembly solver).
本文解决了在Agent-Aided Design系统中创建具有移动部件的复杂3D装配体的挑战。作者引入了AADvark,这是一种能够捕捉动态部件交互的系统,使其能够推理关于具有移动部件的装配体。AADvark通过使用外部约束求解工具和专门的视觉反馈机制来提高空间推理能力,从而能够构建具有可移动部件的3D装配体,如活塞和剪刀。
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
Authors: Zheyu Zhang, Ziqi Pang, Shixing Chen, Xiang Hao, Vimal Bhat, Yu-Xiong Wang
Venue: NeurIPS 2025
First: 2026-04-15T17:59:52+00:00 · Latest: 2026-04-16T15:48:38+00:00
Comments: Appear in the proceedings of NeurIPS 2025
Abstract
Long video understanding is inherently challenging for vision-language models (VLMs) because of the extensive number of frames. With each video frame typically expanding into tens or hundreds of tokens, the limited context length of large language models (LLMs) forces the VLMs to perceive the frames sparsely and lose temporal information. To address this, we explore extreme video token compression towards one token per frame at the final LLM layer. Our key insight is that heuristic-based compression, widely adopted by previous methods, is prone to information loss, and this necessitates supervising LLM layers into learnable and progressive modules for token-level compression (LP-Comp). Such compression enables our VLM to digest 2x-4x more frames with improved performance. To further increase the token efficiency, we investigate frame-level compression, which selects the frames most relevant to the queries via the internal attention scores of the LLM layers, named question-conditioned compression (QC-Comp). As a notable distinction from previous studies, we mitigate the position bias of LLM attention in long contexts, i.e., the over-concentration on the beginning and end of a sequence, by splitting long videos into short segments and employing local attention. Collectively, our combined token-level and frame-level leads to an extreme compression model for long video understanding, named XComp, achieving a significantly larger compression ratio and enabling denser frame sampling. Our XComp is finetuned from VideoChat-Flash with a data-efficient supervised compression tuning stage that only requires 2.5% of the supervised fine-tuning data, yet boosts the accuracy from 42.9% to 46.2% on LVBench and enhances multiple other long video benchmarks.
中文标题/摘要
标题:每个高度选择性帧一个令牌:向长视频理解的极端压缩迈进
长视频理解对视觉-语言模型(VLMs)来说固然是具有挑战性的,因为帧的数量非常庞大。由于每个视频帧通常会扩展成数十或数百个令牌,大型语言模型(LLMs)有限的上下文长度迫使VLMs稀疏地感知帧,从而丢失时间信息。为了解决这个问题,我们探索了在最终LLM层进行极端视频令牌压缩,目标是每个帧一个令牌。我们的关键洞察是,先前方法广泛采用的基于启发式的压缩容易导致信息丢失,因此需要监督LLM层进入可学习和渐进的模块进行令牌级压缩(LP-Comp)。这种压缩使我们的VLM能够消化2-4倍更多的帧,同时提高性能。为了进一步提高令牌效率,我们研究了帧级压缩,通过LLM层的内部注意力分数选择与查询最相关的帧,称为问题条件压缩(QC-Comp)。与先前研究的一个显著区别是,我们通过将长视频分割成短片段并使用局部注意力来缓解LLM注意力在长上下文中的位置偏差,即序列的过度集中在开头和结尾。综合而言,我们的结合了令牌级和帧级压缩的方法为长视频理解提供了一个极端压缩模型,称为XComp,实现了显著更大的压缩比,并允许更密集的帧采样。我们的XComp是从VideoChat-Flash微调而来的,仅需2.5%的监督微调数据,就能在LVBench上将准确率从42.9%提升到46.2%,并增强多个其他长视频基准。
IROSA: Interactive Robot Skill Adaptation using Natural Language
Authors: Markus Knauer, Samuel Bustamante, Thomas Eiband, Alin Albu-Schäffer, Freek Stulp, João Silvério
Venue: IEEE Robotics and Automation Letters (RA-L), 2026
First: 2026-03-04T09:54:09+00:00 · Latest: 2026-04-16T15:37:03+00:00
Comments: Accepted IEEE Robotics and Automation Letters (RA-L) journal, 8 pages, 5 figures, 3 tables, 1 listing. Code available: https://github.com/DLR-RM/IROSA
Abstract
Foundation models have demonstrated impressive capabilities across diverse domains, while imitation learning provides principled methods for robot skill adaptation from limited data. Combining these approaches holds significant promise for direct application to robotics, yet this combination has received limited attention, particularly for industrial deployment. We present a novel framework that enables open-vocabulary skill adaptation through a tool-based architecture, maintaining a protective abstraction layer between the language model and robot hardware. Our approach leverages pre-trained LLMs to select and parameterize specific tools for adapting robot skills without requiring fine-tuning or direct model-to-robot interaction. We demonstrate the framework on a 7-DoF torque-controlled robot performing an industrial bearing ring insertion task, showing successful skill adaptation through natural language commands for speed adjustment, trajectory correction, and obstacle avoidance while maintaining safety, transparency, and interpretability.
OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis
Authors: Kanzhi Cheng, Zehao Li, Zheng Ma, Nuo Chen, Jialin Cao, Qiushi Sun, Zichen Ding, Fangzhi Xu, Hang Yan, Jiajun Chen, Anh Tuan Luu, Jianbing Zhang, Lewei Lu, Dahua Lin
First: 2026-04-16T14:53:08+00:00 · Latest: 2026-04-16T14:53:08+00:00
Comments: Work in progress
Abstract
Mobile agents powered by vision-language models have demonstrated impressive capabilities in automating mobile tasks, with recent leading models achieving a marked performance leap, e.g., nearly 70% success on AndroidWorld. However, these systems keep their training data closed and remain opaque about their task and trajectory synthesis recipes. We present OpenMobile, an open-source framework that synthesizes high-quality task instructions and agent trajectories, with two key components: (1) The first is a scalable task synthesis pipeline that constructs a global environment memory from exploration, then leverages it to generate diverse and grounded instructions. and (2) a policy-switching strategy for trajectory rollout. By alternating between learner and expert models, it captures essential error-recovery data often missing in standard imitation learning. Agents trained on our data achieve competitive results across three dynamic mobile agent benchmarks: notably, our fine-tuned Qwen2.5-VL and Qwen3-VL reach 51.7% and 64.7% on AndroidWorld, far surpassing existing open-data approaches. Furthermore, we conduct transparent analyses on the overlap between our synthetic instructions and benchmark test sets, and verify that performance gains stem from broad functionality coverage rather than benchmark overfitting. We release data and code at https://njucckevin.github.io/openmobile/ to bridge the data gap and facilitate broader mobile agent research.
Beyond Visual Cues: Semantic-Driven Token Filtering and Expert Routing for Anytime Person ReID
Authors: Jiaxuan Li, Xin Wen, Zhihang Li
First: 2026-04-16T14:49:30+00:00 · Latest: 2026-04-16T14:49:30+00:00
Abstract
Any-Time Person Re-identification (AT-ReID) necessitates the robust retrieval of target individuals under arbitrary conditions, encompassing both modality shifts (daytime and nighttime) and extensive clothing-change scenarios, ranging from short-term to long-term intervals. However, existing methods are highly relying on pure visual features, which are prone to change due to environmental and time factors, resulting in significantly performance deterioration under scenarios involving illumination caused modality shifts or cloth-change. In this paper, we propose Semantic-driven Token Filtering and Expert Routing (STFER), a novel framework that leverages the ability of Large Vision-Language Models (LVLMs) to generate identity consistency text, which provides identity-discriminative features that are robust to both clothing variations and cross-modality shifts between RGB and IR. Specifically, we employ instructions to guide the LVLM in generating identity-intrinsic semantic text that captures biometric constants for the semantic model driven. The text token is further used for Semantic-driven Visual Token Filtering (SVTF), which enhances informative visual regions and suppresses redundant background noise. Meanwhile, the text token is also used for Semantic-driven Expert Routing (SER), which integrates the semantic text into expert routing, resulting in more robust multi-scenario gating. Extensive experiments on the Any-Time ReID dataset (AT-USTC) demonstrate that our model achieves state-of-the-art results. Moreover, the model trained on AT-USTC was evaluated across 5 widely-used ReID benchmarks demonstrating superior generalization capabilities with highly competitive results. Our code will be available soon.
中文标题/摘要
标题:超越视觉线索:基于语义的标记过滤和专家路由以实现任意时间行人重识别
任意时间行人重识别(AT-ReID)需要在任意条件下稳健地检索目标个体,包括模态转换(白天和夜晚)和广泛的着装变化场景,从短期到长期不等。然而,现有方法高度依赖纯视觉特征,这些特征容易因环境和时间因素而变化,导致在涉及照明引起的模态转换或着装变化的场景中性能显著下降。在本文中,我们提出了一种新颖的框架——基于语义的标记过滤和专家路由(STFER),该框架利用大型视觉-语言模型(LVLM)生成身份一致性文本的能力,提供对着装变化和RGB与IR之间的跨模态转换具有鲁棒性的身份区分特征。具体而言,我们使用指令引导LVLM生成包含生物特征常数的身份内在语义文本,以驱动语义模型。文本标记进一步用于基于语义的视觉标记过滤(SVTF),以增强信息性视觉区域并抑制冗余背景噪声。同时,文本标记也用于基于语义的专家路由(SER),将语义文本整合到专家路由中,从而实现更鲁棒的多场景门控。在Any-Time ReID数据集(AT-USTC)上的广泛实验表明,我们的模型达到了最先进的结果。此外,该模型在AT-USTC上训练,并在5个广泛使用的行人重识别基准上进行了评估,展示了出色的泛化能力,取得了极具竞争力的结果。我们的代码将很快开源。
Summary / 总结
The research aims to improve the robustness of person re-identification (ReID) under varying conditions, particularly addressing issues with visual features that are sensitive to environmental changes. The proposed Semantic-driven Token Filtering and Expert Routing (STFER) framework uses Large Vision-Language Models to generate identity-consistent text, which enhances visual features and improves performance across different scenarios. Experiments show that STFER outperforms existing methods on the AT-USTC dataset and demonstrates strong generalization across multiple ReID benchmarks.
研究旨在提高在不同条件下的行人重识别(ReID)的鲁棒性,如光照变化和穿着变化。提出的Semantic-driven Token Filtering and Expert Routing (STFER)框架利用大型视觉语言模型生成身份一致的文本,增强视觉特征并改善不同场景下的性能。实验表明,STFER在AT-USTC数据集上优于现有方法,并在多个ReID基准测试中展示了强大的泛化能力。
DocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA
Authors: Ahmad Mohammadshirazi, Pinaki Prasad Guha Neogi, Ser-Nam Lim, Rajiv Ramnath
First: 2025-11-27T15:00:58+00:00 · Latest: 2026-04-16T14:40:49+00:00
Abstract
Document visual question answering requires models not only to answer questions correctly, but also to precisely localize answers within complex document layouts. While large vision-language models (VLMs) achieve strong spatial grounding, their inference cost and latency limit real-world deployment. Compact VLMs are more efficient, but they often suffer substantial localization degradation under standard fine-tuning or distillation. To address this gap, we propose DocVAL, a validated chain-of-thought (CoT) distillation framework that transfers explicit spatial reasoning from large teacher models to compact, deployable student VLMs. DocVAL combines (1) teacher-generated spatial CoT supervision, (2) a rule-based dual-mode validator that filters low-quality training signals and provides fine-grained, pixel-level corrective feedback, and (3) a validation-driven two-stage training procedure with iterative refinement. Text detection is used only as training-time scaffolding for supervision and validation, enabling the final student to operate as a pure VLM without OCR or detection at inference. Across multiple document understanding benchmarks, DocVAL yields consistent improvements of up to 6-7 ANLS points over comparable compact VLMs. We further introduce mean Average Precision (mAP) as a localization metric for document question answering and report strong spatial grounding performance under this new evaluation. We release 95K validator-verified CoT traces and show that high-quality, validated supervision is more effective than scaling unfiltered data, enabling efficient and trustworthy document grounding. Dataset and implementation: https://github.com/ahmad-shirazi/DocVAL
UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards
Authors: Jun Wang, Shuo Tan, Zelong Sun, Tiancheng Gu, Yongle Zhao, Ziyong Feng, Kaicheng Yang, Cewu Lu
First: 2026-04-16T13:03:32+00:00 · Latest: 2026-04-16T13:03:32+00:00
Comments: 17 pages, 11 figures
Abstract
Retrieval-Augmented Generation (RAG) extends Large Vision-Language Models (LVLMs) with external visual knowledge. However, existing visual RAG systems typically rely on generic retrieval signals that overlook the fine-grained visual semantics essential for complex reasoning. To address this limitation, we propose UniDoc-RL, a unified reinforcement learning framework in which an LVLM agent jointly performs retrieval, reranking, active visual perception, and reasoning. UniDoc-RL formulates visual information acquisition as a sequential decision-making problem with a hierarchical action space. Specifically, it progressively refines visual evidence from coarse-grained document retrieval to fine-grained image selection and active region cropping, allowing the model to suppress irrelevant content and attend to information-dense regions. For effective end-to-end training, we introduce a dense multi-reward scheme that provides task-aware supervision for each action. Based on Group Relative Policy Optimization (GRPO), UniDoc-RL aligns agent behavior with multiple objectives without relying on a separate value network. To support this training paradigm, we curate a comprehensive dataset of high-quality reasoning trajectories with fine-grained action annotations. Experiments on three benchmarks demonstrate that UniDoc-RL consistently surpasses state-of-the-art baselines, yielding up to 17.7% gains over prior RL-based methods.
Summary / 总结
UniDoc-RL is a unified reinforcement learning framework that enhances visual Retrieval-Augmented Generation (RAG) by incorporating hierarchical actions and dense rewards. It progressively refines visual evidence from coarse to fine levels, enabling better reasoning. UniDoc-RL uses a dense multi-reward scheme for end-to-end training and aligns agent behavior with multiple objectives. Experiments show that UniDoc-RL outperforms existing methods, achieving up to 17.7% gains over prior RL-based approaches on three benchmarks.
UniDoc-RL 是一种统一的强化学习框架,通过引入层次化动作和密集奖励来增强视觉 RAG。它从粗到细逐步细化视觉证据,以实现更好的推理。UniDoc-RL 使用密集多奖励方案进行端到端训练,并使代理行为与多个目标对齐。实验表明,UniDoc-RL 在三个基准测试中优于现有方法,相对于之前的 RL 基准方法,可获得高达 17.7% 的性能提升。
Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models
Authors: Yifu Chen, Shengpeng Ji, Zhengqing Liu, Qian Chen, Wen Wang, Ziqing Wang, Yangzhuo Li, Tianle Liang, Zhou Zhao
First: 2026-04-16T12:03:50+00:00 · Latest: 2026-04-16T12:03:50+00:00
Abstract
Achieving seamless, human-like interaction remains a key challenge for full-duplex spoken dialogue models (SDMs). Reinforcement learning (RL) has substantially enhanced text- and vision-language models, while well-designed reward signals are crucial for the performance of RL. We consider RL a promising strategy to address the key challenge for SDMs. However, a fundamental barrier persists: prevailing automated metrics for assessing interaction quality rely on superficial proxies, such as behavioral statistics or timing-prediction accuracy, failing to provide reliable reward signals for RL. On the other hand, human evaluations, despite their richness, remain costly, inconsistent, and difficult to scale. We tackle this critical barrier by proposing a Dual-Axis Generative Reward Model, which is trained to understand complex interaction dynamics using a detailed taxonomy and an annotated dataset, produces a single score and, crucially, provides separate evaluations for semantic quality and interaction timing. Such dual outputs furnish precise diagnostic feedback for SDMs and deliver a dependable, instructive reward signal suitable for online reinforcement learning. Our model achieves state-of-the-art performance on interaction-quality assessment across a wide spectrum of datasets, spanning synthetic dialogues and complex real-world interactions.
Summary / 总结
The research aims to improve the human-like interaction of full-duplex spoken dialogue models by addressing the challenge of reliable reward signals for reinforcement learning. The method involves a Dual-Axis Generative Reward Model that evaluates both semantic quality and interaction timing, providing detailed feedback for the models. Key experimental findings show that this model outperforms existing methods in assessing interaction quality across various datasets, including synthetic and real-world dialogues.
研究旨在通过解决可靠奖励信号的问题来提升全双工语音对话模型的人类交互能力。提出的Dual-Axis Generative Reward Model被训练以评估语义质量和交互时机,提供一个综合评分和每个方面的单独反馈。该模型在各种数据集上的交互质量评估中表现出色,包括合成对话和复杂的真实世界交互。
ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints
Authors: Pei-An Chen, Yong-Ching Liang, Jia-Fong Yeh, Hung-Ting Su, Yi-Ting Chen, Min Sun, Winston Hsu
First: 2026-04-16T11:46:30+00:00 · Latest: 2026-04-16T11:46:30+00:00
Abstract
Intelligent embodied agents should not simply follow instructions, as real-world environments often involve unexpected conditions and exceptions. However, existing methods usually focus on directly executing instructions, without considering whether the target objects can actually be manipulated, meaning they fail to assess available affordances. To address this limitation, we introduce DynAfford, a benchmark that evaluates embodied agents in dynamic environments where object affordances may change over time and are not specified in the instruction. DynAfford requires agents to perceive object states, infer implicit preconditions, and adapt their actions accordingly. To enable this capability, we introduce ADAPT, a plug-and-play module that augments existing planners with explicit affordance reasoning. Experiments demonstrate that incorporating ADAPT significantly improves robustness and task success across both seen and unseen environments. We also show that a domain-adapted, LoRA-finetuned vision-language model used as the affordance inference backend outperforms a commercial LLM (GPT-4o), highlighting the importance of task-aligned affordance grounding.
中文标题/摘要
标题:ADAPT:在未指定操作条件下的常识性规划基准测试
智能具身代理不应仅仅遵循指令,因为现实环境往往包含意外情况和例外。然而,现有方法通常专注于直接执行指令,而不考虑目标对象是否可以实际操作,这意味着它们无法评估可用的操作条件。为了解决这一局限性,我们引入了DynAfford,这是一个基准测试,评估具身代理在动态环境中表现,其中对象的操作条件可能会随时间变化且未在指令中指定。DynAfford 要求代理感知对象状态、推断隐含的前提条件,并相应地调整其行为。为了使这一能力成为可能,我们引入了ADAPT,这是一个即插即用模块,可以将显式操作条件推理添加到现有规划器中。实验表明,将ADAPT整合可以显著提高在已见和未见环境中的鲁棒性和任务成功率。我们还展示了,作为操作条件推理后端使用的领域适应、LoRA微调的视觉语言模型优于商业LLM(GPT-4o),突显了任务对齐的操作条件接地的重要性。
Summary / 总结
The research aims to improve the adaptability of intelligent embodied agents in dynamic environments where object affordances are not specified in instructions. The study introduces ADAPT, a module that enables agents to perceive object states, infer implicit preconditions, and adapt their actions accordingly. Experiments show that incorporating ADAPT enhances robustness and task success in both seen and unseen environments. Additionally, a domain-adapted, LoRA-finetuned vision-language model outperforms a commercial LLM in affordance inference, underscoring the importance of task-aligned grounding.
研究旨在提高智能实体代理在指令未指定物体功能的动态环境中的适应性。研究引入了ADAPT模块,使代理能够感知物体状态、推断隐含的前提条件并相应地调整其行为。实验表明,集成ADAPT显著提高了在已见和未见环境中的鲁棒性和任务成功率。此外,一个领域适应的、LoRA微调的视觉语言模型在功能推理方面优于商用LLM,突显了任务对齐的功能接地的重要性。
Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models
Authors: Danae Sánchez Villegas, Samuel Lewis-Lim, Nikolaos Aletras, Desmond Elliott
First: 2026-04-16T11:28:53+00:00 · Latest: 2026-04-16T11:28:53+00:00
Abstract
Recent advances in vision language models (VLMs) offer reasoning capabilities, yet how these unfold and integrate visual and textual information remains unclear. We analyze reasoning dynamics in 18 VLMs covering instruction-tuned and reasoning-trained models from two different model families. We track confidence over Chain-of-Thought (CoT), measure the corrective effect of reasoning, and evaluate the contribution of intermediate reasoning steps. We find that models are prone to answer inertia, in which early commitments to a prediction are reinforced, rather than revised during reasoning steps. While reasoning-trained models show stronger corrective behavior, their gains depend on modality conditions, from text-dominant to vision-only settings. Using controlled interventions with misleading textual cues, we show that models are consistently influenced by these cues even when visual evidence is sufficient, and assess whether this influence is recoverable from CoT. Although this influence can appear in the CoT, its detectability varies across models and depends on what is being monitored. Reasoning-trained models are more likely to explicitly refer to the cues, but their longer and fluent CoTs can still appear visually grounded while actually following textual cues, obscuring modality reliance. In contrast, instruction-tuned models refer to the cues less explicitly, but their shorter traces reveal inconsistencies with the visual input. Taken together, these findings indicate that CoT provides only a partial view of how different modalities drive VLM decisions, with important implications for the transparency and safety of multimodal systems.
Summary / 总结
The study investigates the reasoning dynamics in 18 vision-language models, focusing on how these models integrate visual and textual information. It finds that models often exhibit answer inertia, reinforcing early predictions rather than revising them during reasoning steps. Reasoning-trained models show stronger corrective behavior but their gains vary with modality conditions. Misleading textual cues can influence models even when visual evidence is sufficient, and this influence is not always detectable in the Chain-of-Thought (CoT), especially in reasoning-trained models which may appear visually grounded while following textual cues. Instruction-tuned models, while less likely to explicitly refer to cues, show shorter traces revealing inconsistencies with visual input. These findings suggest that CoT provides a limited view of modality reliance in VLMs, with implications for transparency and safety of multimodal systems.
研究分析了18个视觉语言模型的推理动态,关注它们如何整合视觉和文本信息。研究发现,模型往往表现出答案惯性,倾向于坚持早期预测而不在此后推理步骤中进行修正。推理训练的模型显示出更强的纠正行为,但其效果在不同模态条件下有所不同。通过控制干预发现,模型即使在有足够的视觉证据时也会受到误导性文本提示的影响,并且这种影响可以在链式思考(CoT)中检测到,但不同模型之间的可检测性不同。指令调优的模型虽然不太可能明确引用提示,但在其较短的痕迹中显示出与视觉输入不一致的情况,表明CoT仅提供对不同模态如何驱动VLM决策的片面视角。
RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding
Authors: Zihong Zhang, Zuchao Li, Lefei Zhang, Ping Wang, Hai Zhao
Venue: ACL 2026
First: 2026-04-16T11:23:55+00:00 · Latest: 2026-04-16T11:23:55+00:00
Comments: Accepted to Findings of ACL 2026
Abstract
Autoregressive decoding in Large Language Models (LLMs) generates one token per step, causing high inference latency. Speculative decoding (SD) mitigates this through a guess-and-verify strategy, but existing training-free variants face trade-offs: retrieval-based drafts break when no exact match exists, while logits-based drafts lack structural guidance. We propose $\textbf{RACER}$ ($\textbf{R}$etrieval-$\textbf{A}$ugmented $\textbf{C}$ont$\textbf{e}$xtual $\textbf{R}$apid Speculative Decoding), a lightweight and training-free method that integrates retrieved exact patterns with logit-driven future cues. This unification supplies both reliable anchors and flexible extrapolation, yielding richer speculative drafts. Experiments on Spec-Bench, HumanEval, and MGSM-ZH demonstrate that RACER consistently accelerates inference, achieving more than $2\times$ speedup over autoregressive decoding, and outperforms prior training-free methods, offering a scalable, plug-and-play solution for efficient LLM decoding. Our source code is available at $\href{https://github.com/hkr04/RACER}{https://github.com/hkr04/RACER}$.
中文标题/摘要
标题:RACER:检索增强的上下文快速推测解码
大型语言模型(LLMs)中的自回归解码每次生成一个标记,导致高推理延迟。推测解码(SD)通过猜测和验证策略缓解了这一问题,但现有的无训练版本存在权衡:基于检索的草稿在没有完全匹配时会失效,而基于logits的草稿缺乏结构指导。我们提出了一种轻量级且无训练的方法——RACER(Retrieval-Augmented Contextual Rapid Speculative Decoding),该方法将检索到的精确模式与logits驱动的未来线索结合起来。这种结合提供了可靠的锚点和灵活的外推,生成更丰富的推测草稿。在Spec-Bench、HumanEval和MGSM-ZH上的实验表明,RACER一致地加速了推理,比自回归解码快2倍以上,并优于先前的无训练方法,提供了一种可扩展的、即插即用的高效LLM解码解决方案。我们的源代码可在https://github.com/hkr04/RACER 获取。
Summary / 总结
RACER is a lightweight and training-free method that integrates retrieved exact patterns with logit-driven future cues to improve speculative decoding in Large Language Models (LLMs). It accelerates inference by more than 2 times compared to autoregressive decoding and outperforms prior training-free methods on Spec-Bench, HumanEval, and MGSM-ZH benchmarks.
RACER 是一种轻量级且无需训练的方法,通过结合检索到的精确模式和基于 logits 的未来线索来改进大型语言模型(LLMs)的推测解码。它将推理速度提高了超过 2 倍,并在 Spec-Bench、HumanEval 和 MGSM-ZH 基准测试中优于之前的无需训练的方法。
MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry
Authors: Meng-Xun Li, Wen-Hui Deng, Zhi-Xing Wu, Chun-Xiao Jin, Jia-Min Wu, Yue Han, James Kit Hon Tsoi, Gui-Song Xia, Cui Huang
Venue: Journal of Dental Research, p.00220345261424242 (2026)
First: 2026-04-16T10:56:54+00:00 · Latest: 2026-04-16T10:56:54+00:00
Comments: Project website: https://menxli.github.io/metadent
Abstract
Vision-Language Models (VLMs) have demonstrated significant potential in medical image analysis, yet their application in intraoral photography remains largely underexplored due to the lack of fine-grained, annotated datasets and comprehensive benchmarks. To address this, we present MetaDent, a comprehensive resource that includes (1) a novel and large-scale dentistry image dataset collected from clinical, public, and web sources; (2) a semi-structured annotation framework designed to capture the hierarchical and clinically nuanced nature of dental photography; and (3) comprehensive benchmark suites for evaluating state-of-the-art VLMs on clinical image understanding. Our labeling approach combines a high-level image summary with point-by-point, free-text descriptions of abnormalities. This method enables rich, scalable, and task-agnostic representations. We curated 60,669 dental images from diverse sources and annotated a representative subset of 2,588 images using this meta-labeling scheme. Leveraging Large Language Models (LLMs), we derive standardized benchmarks: approximately 15K Visual Question Answering (VQA) pairs and an 18-class multi-label classification dataset, which we validated with human review and error analysis to justify that the LLM-driven transition reliably preserves fidelity and semantic accuracy. We then evaluate state-of-the-art VLMs across VQA, classification, and image captioning tasks. Quantitative results reveal that even the most advanced models struggle with a fine-grained understanding of intraoral scenes, achieving moderate accuracy and producing inconsistent or incomplete descriptions in image captioning. We publicly release our dataset, annotations, and tools to foster reproducible research and accelerate the development of vision-language systems for dental applications.
Summary / 总结
MetaDent addresses the lack of fine-grained, annotated datasets for vision-language models in dentistry by creating a comprehensive resource including a large-scale image dataset, a semi-structured annotation framework, and benchmark suites. The dataset includes 60,669 dental images with 2,588 annotated using a meta-labeling scheme combining high-level summaries and detailed descriptions. The benchmarks, derived using Large Language Models, consist of 15K Visual Question Answering pairs and an 18-class multi-label classification dataset, validated through human review. Experimental results show that state-of-the-art models achieve moderate accuracy in VQA and classification tasks but struggle with fine-grained understanding in image captioning, producing inconsistent or incomplete descriptions. The dataset and tools are publicly released to promote reproducible research and accelerate dental applications of vision-language systems.
MetaDent通过创建一个包括大规模图像数据集、半结构化注释框架和基准测试套件的综合资源,解决了牙科领域中缺乏细粒度标注数据集的问题。该数据集包含60,669张牙科图像,并使用结合了高一级摘要和详细描述的元标注方案对其中2,588张进行了标注。基准测试包括15K视觉问答对和一个包含18个类别的多标签分类数据集,通过人工审查和错误分析进行了验证。实验结果显示,最先进的模型在视觉问答和分类任务中取得了中等准确度,但在图像描述任务中表现出细粒度理解能力不足,生成了不一致或不完整的描述。数据集和工具已公开发布,以促进可重复研究并加速牙科应用中的视觉语言系统的发展。
Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference
Authors: Imanol Miranda, Ander Salaberria, Eneko Agirre, Gorka Azkune
First: 2026-04-13T14:03:18+00:00 · Latest: 2026-04-16T10:51:43+00:00
Abstract
Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks. We argue that this limitation may stem less from deficient representations than from the standard inference protocol based on global cosine similarity. First, through controlled diagnostic experiments, we show that explicitly enforcing fine-grained region-segment alignment at inference dramatically improves compositional performance without updating pretrained encoders. We then introduce a lightweight transformer that learns such alignments directly from frozen patch and token embeddings. Comparing against full fine-tuning and prior end-to-end compositional training methods, we find that although these approaches improve in-domain retrieval, their gains do not consistently transfer under distribution shift. In contrast, learning localized alignment over frozen representations matches full fine-tuning on in-domain retrieval while yielding substantial improvements on controlled out-of-domain compositional benchmarks. These results identify global embedding matching as a key bottleneck in dual-encoder VLMs and highlight the importance of alignment mechanisms for robust compositional generalization.
Summary / 总结
The study revisits the compositionality issue in dual-encoder VLMs like CLIP, suggesting that poor performance on compositional benchmarks is more due to the inference protocol than the representations themselves. The research shows that enforcing fine-grained region-segment alignment at inference improves compositional performance. A lightweight transformer is introduced to learn these alignments directly from frozen embeddings, outperforming full fine-tuning and other compositional training methods on out-of-domain benchmarks while maintaining in-domain retrieval performance.
研究重新审视了像CLIP这样的双编码器VLM在组成性基准上的表现问题,认为表现不佳主要是由于推理协议而非表示本身。研究显示,在推理时强制执行细粒度的区域-片段对齐可以提高组成性性能。引入了一个轻量级的变压器,直接从冻结的嵌入中学习这些对齐,优于全面微调和其他组成性训练方法在离域基准上的表现,同时保持在域内检索性能。
Zero-Shot Retail Theft Detection via Orchestrated Vision Models: A Model-Agnostic, Cost-Effective Alternative to Trained Single-Model Systems
Authors: Haileab Yagersew
First: 2026-04-16T10:32:20+00:00 · Latest: 2026-04-16T10:32:20+00:00
Comments: 16 pages, 3 figures, Code to be released at https://github.com/xHaileab/Paza-AI
Abstract
Retail theft costs the global economy over \$100 billion annually, yet existing AI-based detection systems require expensive custom model training on proprietary datasets and charge \$200-500/month per store. We present Paza, a zero-shot retail theft detection framework that achieves practical concealment detection without training any model. Our approach orchestrates multiple existing models in a layered pipeline - cheap object detection and pose estimation running continuously, with an expensive vision-language model (VLM) invoked only when behavioral pre-filters trigger. A multi-signal suspicion pre-filter (requiring dwell time plus at least one behavioral signal) reduces VLM invocations by 240x compared to per-frame analysis, bounding calls to <=10/minute and enabling a single GPU to serve 10-20 stores. The architecture is model-agnostic: the VLM component accepts any OpenAI-compatible endpoint, enabling operators to swap between models such as Gemma 4, Qwen3.5-Omni, GPT-4o, or future releases without code changes - ensuring the system improves as the VLM landscape evolves. We evaluate the VLM component on the DCSASS synthesized shoplifting dataset (169 clips, controlled environment), achieving 89.5% precision and 92.8% specificity at 59.3% recall zero-shot - where the recall gap is attributable to sparse frame sampling in offline evaluation rather than VLM reasoning failures, as precision and specificity are the operationally critical metrics determining false alarm rates. We present a detailed cost model showing viability at \$50-100/month per store (3-10x cheaper than commercial alternatives), and introduce a privacy-preserving design that obfuscates faces in the detection pipeline. The source code is available at https://github.com/xHaileab/Paza-AI.
Summary / 总结
The paper presents Paza, a zero-shot retail theft detection framework that uses a layered pipeline of existing models to achieve practical concealment detection without custom model training. The framework reduces the need for expensive VLM invocations by 240x through a multi-signal suspicion pre-filter, enabling a single GPU to serve multiple stores. The VLM component, which accepts any OpenAI-compatible endpoint, achieves 89.5% precision and 92.8% specificity at 59.3% recall, with a cost model showing a viability of $50-100/month per store, significantly cheaper than commercial alternatives.
论文提出了Paza,这是一种零样本零售盗窃检测框架,通过使用现有模型的分层流水线来实现实际的藏匿检测,无需进行定制模型训练。该框架通过多信号疑虑预过滤器将昂贵的VLM调用次数减少240倍,使单个GPU能够为多个商店提供服务。VLM组件接受任何OpenAI兼容的端点,实现了89.5%的精确度和92.8%的特异性,在59.3%的召回率下,成本模型显示每店每月50-100美元,远低于商业替代方案。
POP: Prefill-Only Pruning for Efficient Large Model Inference
Authors: Junhui He, Zhihui Fu, Jun Wang, Qingan Li
First: 2026-02-03T09:22:26+00:00 · Latest: 2026-04-16T10:22:57+00:00
Abstract
Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable capabilities. However, their deployment is hindered by significant computational costs. Existing structured pruning methods, while hardware-efficient, often suffer from significant accuracy degradation. In this paper, we argue that this failure stems from a stage-agnostic pruning approach that overlooks the asymmetric roles between the prefill and decode stages. By introducing a virtual gate mechanism, our importance analysis reveals that deep layers are critical for next-token prediction (decode) but largely redundant for context encoding (prefill). Leveraging this insight, we propose Prefill-Only Pruning (POP), a stage-aware inference strategy that safely omits deep layers during the computationally intensive prefill stage while retaining the full model for the sensitive decode stage. To enable the transition between stages, we introduce independent Key-Value (KV) projections to maintain cache integrity, and a boundary handling strategy to ensure the accuracy of the first generated token. Extensive experiments on Llama-3.1, Qwen3-VL, and Gemma-3 across diverse modalities demonstrate that POP achieves up to 1.37$\times$ speedup in prefill latency with minimal performance loss, effectively overcoming the accuracy-efficiency trade-off limitations of existing structured pruning methods.
Summary / 总结
This paper addresses the computational challenges of deploying Large Language Models (LLMs) and Vision-Language Models (VLMs) by proposing Prefill-Only Pruning (POP), a stage-aware inference strategy. POP leverages the asymmetric roles of the prefill and decode stages to safely omit deep layers during the prefill stage while retaining the full model for the decode stage. The method introduces independent Key-Value projections and a boundary handling strategy to maintain cache integrity and ensure accuracy. Experiments show that POP achieves up to 1.37 times speedup in prefill latency with minimal performance loss.
本文提出了一种阶段感知的推理策略——Prefill-Only Pruning (POP),以解决大规模语言模型(LLMs)和视觉-语言模型(VLMs)的计算挑战。POP 利用预填充和解码阶段的不对称作用,在预填充阶段安全地省略深层层,而在敏感的解码阶段保留完整模型。该方法引入了独立的键-值投影和边界处理策略,以保持缓存的完整性并确保准确性。实验表明,POP 可以实现高达 1.37 倍的预填充延迟加速,同时保持最小的性能损失。
MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models
Authors: Anh Thai, Stefan Stojanov, Zixuan Huang, Bikram Boote, James M. Rehg
First: 2025-05-26T15:23:18+00:00 · Latest: 2026-04-16T10:15:26+00:00
Abstract
This paper introduces MEBench, a novel benchmark for evaluating mutual exclusivity (ME) bias, a cognitive phenomenon observed in children during word learning. Unlike traditional ME tasks, MEBench further incorporates spatial reasoning to create more challenging and realistic evaluation settings. To facilitate controlled experimentation, we also present a flexible and scalable data generation pipeline that supports the construction of diverse annotated scenes. We assess the performance of various vision-language models (VLMs) on this benchmark using novel evaluation metrics that capture key aspects of ME-based reasoning. We find that these VLMs exhibit weak ME bias, while showing some ability to leverage extra spatial context to resolve ambiguity in multiple novel object settings. Project page: http://mebench.github.io/.
中文标题/摘要
标题:MEBench:一种理解视觉语言模型中互斥偏见的新基准
本文介绍了MEBench,这是一种用于评估互斥(ME)偏见的新基准,ME偏见是儿童在词汇学习过程中观察到的一种认知现象。与传统的ME任务不同,MEBench进一步结合了空间推理,以创建更具挑战性和现实性的评估环境。为了便于控制实验,我们还提出了一种灵活且可扩展的数据生成管道,支持构建多样化的标注场景。我们使用新颖的评估指标来评估各种视觉语言模型(VLMs)在该基准上的性能,这些指标捕捉了ME推理的关键方面。我们发现这些VLMs表现出较弱的ME偏见,但在多个新物体设置中能够利用额外的空间上下文来解决歧义。项目页面:http://mebench.github.io/
Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems
Authors: Nishanth Madhusudhan, Vikas Yadav, Alexandre Lacoste
First: 2026-04-16T09:23:22+00:00 · Latest: 2026-04-16T09:23:22+00:00
Comments: 10 pages and 4 figures (excluding appendix)
Abstract
Effective abstention (EA), recognizing evidence insufficiency and refraining from answering, is critical for reliable multimodal systems. Yet existing evaluation paradigms for vision-language models (VLMs) and multi-agent systems (MAS) assume answerability, pushing models to always respond. Abstention has been studied in text-only settings but remains underexplored multimodally; current benchmarks either ignore unanswerability or rely on coarse methods that miss realistic failure modes. We introduce MM-AQA, a benchmark that constructs unanswerable instances from answerable ones via transformations along two axes: visual modality dependency and evidence sufficiency. Evaluating three frontier VLMs spanning closed and open-source models and two MAS architectures across 2079 samples, we find: (1) under standard prompting, VLMs rarely abstain; even simple confidence baselines outperform this setup, (2) MAS improves abstention but introduces an accuracy-abstention trade-off, (3) sequential designs match or exceed iterative variants, suggesting the bottleneck is miscalibration rather than reasoning depth, and (4) models abstain when image or text evidence is absent, but attempt reconciliation with degraded or contradictory evidence. Effective multimodal abstention requires abstention-aware training rather than better prompting or more agents.
中文标题/摘要
标题:何时不答:评估多模态推理系统的回避能力
有效的回避(EA),识别证据不足并避免回答,对于可靠的多模态系统至关重要。然而,现有的视觉-语言模型(VLM)和多智能体系统(MAS)评估范式假设可回答性,促使模型总是回应。回避在纯文本环境中已有研究,但在多模态环境中仍被忽视;当前基准要么忽略不可回答性,要么依赖粗略的方法,无法捕捉到现实中的失败模式。我们引入了MM-AQA基准,通过沿两个轴进行转换从可回答实例构建不可回答实例:视觉模态依赖性和证据充足性。评估三个前沿的VLM,涵盖封闭源和开源模型,以及两种MAS架构的2079个样本,我们发现:(1)在标准提示下,VLMs很少回避;即使简单的置信度基线也优于此设置,(2)MAS提高了回避能力但引入了准确性和回避之间的权衡,(3)序列设计匹配或超过了迭代变体,表明瓶颈在于校准不当而非推理深度,(4)当图像或文本证据缺失时,模型会回避,但在降级或矛盾证据下尝试调和。有效的多模态回避需要回避意识训练,而不仅仅是更好的提示或更多的智能体。
Summary / 总结
The paper introduces MM-AQA, a benchmark for evaluating effective abstention in multimodal systems, addressing the underexplored area of unanswerability in vision-language models and multi-agent systems. The study finds that vision-language models rarely abstain under standard prompting, while multi-agent systems improve abstention but at the cost of accuracy. Sequential designs perform as well as iterative ones, indicating that the issue lies in miscalibration rather than reasoning depth. Models abstain when evidence is absent but attempt to reconcile with degraded or contradictory evidence, suggesting the need for abstention-aware training.
研究关注多模态系统中有效弃答的重要性,即在证据不足时识别并选择不回答以确保系统的可靠性。研究引入了MM-AQA基准,通过改变视觉模态依赖性和证据充足性将可回答实例转化为不可回答实例。对三个VLM和两个MAS架构在2079个样本上的评估显示,标准提示下VLMs很少弃答,即使简单的置信度基线也能超越这一设置,MAS虽然能提高弃答能力但会牺牲准确性。研究还发现,当证据缺失时模型会弃答,但在退化或矛盾证据下会尝试调和,表明弃答意识训练对于有效多模态系统至关重要。
AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning
Authors: Peifeng Zhang, Zice Qiu, Donghua Yu, Shilei Cao, Juepeng Zheng, Yutong Lu, Haohuan Fu
Venue: ACM MM 2026
First: 2026-04-16T08:39:02+00:00 · Latest: 2026-04-16T08:39:02+00:00
Comments: 18 pages, 9 figures. Submitted to ACM MM 2026
Abstract
In continual visual question answering (VQA), existing Continual Learning (CL) methods are mostly built for symmetric, unimodal architectures. However, modern Vision-Language Models (VLMs) violate this assumption, as their trainable components are inherently asymmetric. This structural mismatch renders VLMs highly prone to catastrophic forgetting when learning from continuous data streams. Specifically, the asymmetry causes standard global regularization to favor the massive language decoder during optimization, leaving the smaller but critical visual projection layers highly vulnerable to interference. Consequently, this localized degradation leads to a severe loss of compositional reasoning capabilities. To address this, we propose Asymmetric Information Masking (AIM), which balances stability and plasticity by applying targeted masks based on modality-specific sensitivity. Experiments on VQA v2 and GQA under continual VQA settings show that AIM achieves state-of-the-art performance in both Average Performance (AP) and Average Forgetting (AF), while better preserving generalization to novel skill-concept compositions.
Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach
Authors: Zijian Zhao, Dian Jin, Zijing Zhou
First: 2025-09-26T14:07:29+00:00 · Latest: 2026-04-16T08:15:21+00:00
Abstract
Recently, Image-to-Music (I2M) generation has garnered significant attention, with potential applications in fields such as gaming, advertising, and multi-modal art creation. However, due to the ambiguous and subjective nature of I2M tasks, most end-to-end methods lack interpretability, leaving users puzzled about the generation results. Even methods based on emotion mapping face controversy, as emotion represents only a singular aspect of art. Additionally, most learning-based methods require substantial computational resources and large datasets for training, hindering accessibility for common users. To address these challenges, we propose the first Vision Language Model (VLM)-based I2M framework that offers high interpretability and low computational cost. Specifically, we utilize ABC notation to bridge the text and music modalities, enabling the VLM to generate music using natural language. We then apply multi-modal Retrieval-Augmented Generation (RAG) and self-refinement techniques to allow the VLM to produce high-quality music without external training. Furthermore, we leverage the generated motivations in text and the attention maps from the VLM to provide explanations for the generated results in both text and image modalities. To validate our method, we conduct both human studies and machine evaluations, where our method outperforms others in terms of music quality and music-image consistency, indicating promising results. Our code is available at https://github.com/RS2002/Image2Music .
中文标题/摘要
标题:零努力图像到音乐生成:一种可解释的RAG基视觉语言模型方法
近年来,图像到音乐(I2M)生成引起了广泛关注,其潜在应用领域包括游戏、广告和多模态艺术创作。然而,由于I2M任务的模糊性和主观性,大多数端到端方法缺乏可解释性,使用户对生成结果感到困惑。即使基于情绪映射的方法也存在争议,因为情绪仅代表艺术的一个方面。此外,大多数基于学习的方法需要大量的计算资源和训练数据,这阻碍了普通用户的使用。为了解决这些挑战,我们提出了第一个基于视觉语言模型(VLM)的I2M框架,该框架具有高可解释性和低计算成本。具体而言,我们利用ABC符号来连接文本和音乐模态,使VLM能够使用自然语言生成音乐。然后,我们应用多模态检索增强生成(RAG)和自我精炼技术,使VLM能够在无需外部训练的情况下生成高质量的音乐。此外,我们利用生成的动机和VLM的注意力图来在文本和图像模态中为生成结果提供解释。为了验证我们的方法,我们进行了人类研究和机器评估,结果显示我们的方法在音乐质量和音乐-图像一致性方面优于其他方法,显示出有希望的结果。我们的代码可在https://github.com/RS2002/Image2Music 获取。
Summary / 总结
This paper addresses the challenges of generating music from images by proposing a novel VLM-based approach that offers high interpretability and low computational cost. The method uses ABC notation to bridge text and music modalities, enabling the VLM to generate music through natural language. It employs multi-modal RAG and self-refinement techniques to produce high-quality music without external training. The results show that the proposed method outperforms others in terms of music quality and consistency with images, as validated by both human studies and machine evaluations.
本文提出了一种新型的基于VLM的方法,以解决从图像生成音乐的挑战,该方法具有高可解释性和低计算成本。该方法使用ABC符号将文本和音乐模态连接起来,使VLM能够通过自然语言生成音乐。它采用多模态RAG和自我精炼技术,在无需外部训练的情况下生成高质量的音乐。结果表明,该方法在音乐质量和与图像的一致性方面优于其他方法,经过人类研究和机器评估的验证。
SGA-MCTS: Decoupling Planning from Execution via Training-Free Atomic Experience Retrieval
Authors: Xin Xie, Dongyun Xue, Wuguannan Yao, Mingxiao Feng, Wengang Zhou, Xiang Qi, Houqiang Li, Peng Zhang
First: 2026-04-16T07:22:36+00:00 · Latest: 2026-04-16T07:22:36+00:00
Abstract
LLM-powered systems require complex multi-step decision-making abilities to solve real-world tasks, yet current planning approaches face a trade-off between the high latency of inference-time search and the limited generalization of supervised fine-tuning. To address this limitation, we introduce \textbf{SGA-MCTS}, a framework that casts LLM planning as non-parametric retrieval. Offline, we leverage Monte Carlo Tree Search (MCTS) to explore the solution space and distill high-fidelity trajectories into State-Goal-Action (SGA) atoms. These atoms are de-lexicalized primitives that abstract concrete entities into symbolic slots, preserving reusable causal logic while discarding domain-specific noise. Online, a retrieval-augmented agent employs a hybrid symbolic-semantic mechanism to fetch relevant SGAs and re-ground them into the current context as soft reasoning hints. Empirical results on complex benchmarks demonstrate that this paradigm enables frozen, open-weights models to match the performance of SOTA systems (e.g., GPT-5) without task-specific fine-tuning. By effectively amortizing the heavy computational cost of search, SGA-MCTS achieves System 2 reasoning depth at System 1 inference speeds, rendering autonomous planning both scalable and real-time feasible.
G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval
Authors: Jiyoung Lim, Heejae Yang, Jee-Hyong Lee
Venue: CVPR 2026
First: 2026-04-16T07:21:21+00:00 · Latest: 2026-04-16T07:21:21+00:00
Comments: CVPR 2026 Accepted
Abstract
Composed Image Retrieval (CIR) aims to retrieve target images by integrating a reference image with a corresponding modification text. CIR requires jointly considering the explicit semantics specified in the query and the implicit semantics embedded within its bi-modal composition. Recent training-free Zero-Shot CIR (ZS-CIR) methods leverage Multimodal Large Language Models (MLLMs) to generate detailed target descriptions, converting the implicit information into explicit textual expressions. However, these methods rely heavily on the textual modality and fail to capture the fuzzy retrieval nature that requires considering diverse combinations of candidates. This leads to reduced diversity and accuracy in retrieval results. To address this limitation, we propose a novel training-free method, Geodesic Mixup-based Implicit semantic eXpansion and Explicit semantic Re-ranking for ZS-CIR (G-MIXER). G-MIXER constructs composed query features that reflect the implicit semantics of reference image-text pairs through geodesic mixup over a range of mixup ratios, and builds a diverse candidate set. The generated candidates are then re-ranked using explicit semantics derived from MLLMs, improving both retrieval diversity and accuracy. Our proposed G-MIXER achieves state-of-the-art performance across multiple ZS-CIR benchmarks, effectively handling both implicit and explicit semantics without additional training. Our code will be available at https://github.com/maya0395/gmixer.
中文标题/摘要
标题:G-MIXER:基于测地线Mixup的隐式语义扩展和显式语义重排零样本组合图像检索
组合图像检索(CIR)旨在通过将参考图像与相应的修改文本结合起来检索目标图像。CIR 需要同时考虑查询中明确指定的语义和其双模态组合中嵌入的隐式语义。最近的无训练零样本CIR(ZS-CIR)方法利用多模态大型语言模型(MLLMs)生成详细的目标描述,将隐式信息转换为显式的文本表达。然而,这些方法高度依赖于文本模态,无法捕捉到需要考虑候选者多种组合的模糊检索特性,这导致检索结果的多样性和准确性降低。为了解决这一局限性,我们提出了一种新的无训练方法,基于测地线Mixup的隐式语义扩展和显式语义重排零样本组合图像检索(G-MIXER)。G-MIXER 通过测地线Mixup在一系列混合比例范围内构建反映参考图像-文本对隐式语义的组合查询特征,并构建一个多样化的候选集。生成的候选者随后使用MLLMs提取的显式语义进行重排,从而提高检索的多样性和准确性。我们提出的G-MIXER在多个ZS-CIR基准测试中达到了最先进的性能,有效地处理了隐式和显式语义,无需额外训练。我们的代码将在https://github.com/maya0395/gmixer/上提供。
SAM3-I: Segment Anything with Instructions
Authors: Jingjing Li, Yue Feng, Yuchen Guo, Jincai Huang, Wei Ji, Qi Bi, Yongri Piao, Miao Zhang, Xiaoqi Zhao, Qiang Chen, Shihao Zou, Huchuan Lu, Li Cheng
First: 2025-12-04T09:00:25+00:00 · Latest: 2026-04-16T07:12:40+00:00
Abstract
Segment Anything Model 3 (SAM3) advances open-vocabulary segmentation through promptable concept segmentation, enabling users to segment all instances associated with a given concept using short noun-phrase (NP) prompts. While effective for concept-level grounding, real-world interactions often involve far richer natural-language instructions that combine attributes, relations, actions, states, or implicit reasoning. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and conducts iterative mask filtering, leading to coarse representations and limited instance specificity. In this work, we present SAM3-I, an instruction-following extension of the SAM family that unifies concept-level grounding and instruction-level reasoning within a single segmentation framework. Built upon SAM3, SAM3-I introduces an instruction-aware cascaded adaptation mechanism with dedicated alignment losses that progressively aligns expressive instruction semantics with SAM3's vision-language representations, enabling direct interpretation of natural-language instructions while preserving its strong concept recall ability. To enable instruction-following learning, we introduce HMPL-Instruct, a large-scale instruction-centric dataset that systematically covers hierarchical instruction semantics and diverse target granularities. Experiments demonstrate that SAM3-I achieves appealing performance across referring and reasoning-based segmentation, showing that SAM3 can be effectively extended to follow complex natural-language instructions without sacrificing its original concept-driven strengths. Code and dataset are available at https://github.com/debby-0527/SAM3-I.
Summary / 总结
SAM3 3 advances prompt-based segmentation by enabling users to segment instances with natural-language prompts, complex reasoning into noun-phrases. SAM3 3-I introduces an instruction-aware cascaded adaptation mechanism that aligns natural-language instructions with SAM on, a single framework.. experimental findings show SAM on 3-I demonstrate that it can can instruction-following on on performance is on par par
SAM3-I通过引入指令感知的级联适应机制和对齐损失,使SAM3能够直接解释自然语言指令同时保持强大的概念召回能力。实验表明,SAM3-I在引用和基于推理的分割任务中表现出色,证明了其能够处理复杂的自然语言指令而不牺牲其原始优势。
One RL to See Them All: Visual Triple Unified Reinforcement Learning
Authors: Yan Ma, Linge Du, Xuyang Shen, Shaoxiang Chen, Pengfei Li, Qibing Ren, Lizhuang Ma, Yuchao Dai, Pengfei Liu, Junjie Yan
First: 2025-05-23T17:41:14+00:00 · Latest: 2026-04-16T06:56:57+00:00
Comments: Technical Report
Abstract
Reinforcement learning (RL) is becoming an important direction for post-training vision-language models (VLMs), but public training methodologies for unified multimodal RL remain much less mature, especially for heterogeneous reasoning and perception-heavy tasks. We propose V-Triune, a Visual Triple Unified Reinforcement Learning methodology for unified multimodal RL. It organizes training around three coordinated abstractions: Sample-Level Reward Routing, Verifier-Level Outcome Verification, and Source-Level Diagnostics. Within this methodology, Dynamic IoU provides localization-specific reward shaping that avoids reward ambiguity under loose thresholds and reward sparsity under strict ones. Built on V-Triune, we develop Orsta (7B, 32B), a family of models jointly trained on eight reasoning and perception tasks. Under matched budgets, unified training matches or outperforms specialist mixtures. The final Orsta models improve over their backbones on MEGA-Bench, compare favorably with strong multi-task RL-VLM baselines, and transfer these gains to a broad set of downstream benchmarks. These results show that unified RL can improve both reasoning and perception within a single VLM RL pipeline.The V-Triune system, along with the Orsta models, is publicly available at https://github.com/MiniMax-AI/One-RL-to-See-Them-All.
DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts
Authors: Bo Qian, Dahu Shi, Xing Wei
Venue: ICLR 2026
First: 2026-04-16T06:40:44+00:00 · Latest: 2026-04-16T06:40:44+00:00
Comments: Published as a conference paper at ICLR 2026
Abstract
Visual prompted object detection enables interactive and flexible definition of target categories, thereby facilitating open-vocabulary detection. Since visual prompts are derived directly from image features, they often outperform text prompts in recognizing rare categories. Nevertheless, research on visual prompted detection has been largely overlooked, and it is typically treated as a byproduct of training text prompted detectors, which hinders its development. To fully unlock the potential of visual-prompted detection, we investigate the reasons why its performance is suboptimal and reveal that the underlying issue lies in the absence of global discriminability in visual prompts. Motivated by these observations, we propose DETR-ViP, a robust object detection framework that yields class-distinguishable visual prompts. On top of basic image-text contrastive learning, DETR-ViP incorporates global prompt integration and visual-textual prompt relation distillation to learn more discriminative prompt representations. In addition, DETR-ViP employs a selective fusion strategy that ensures stable and robust detection. Extensive experiments on COCO, LVIS, ODinW, and Roboflow100 demonstrate that DETR-ViP achieves substantially higher performance in visual prompt detection compared to other state-of-the-art counterparts. A series of ablation studies and analyses further validate the effectiveness of the proposed improvements and shed light on the underlying reasons for the enhanced detection capability of visual prompts.
Summary / 总结
DETR-ViP is designed to improve visual prompted object detection by addressing the lack of global discriminability in visual prompts. It incorporates global prompt integration and visual-textual prompt relation distillation to enhance discriminative prompt representations and uses a selective fusion strategy for stable detection. Experiments on COCO, LVIS, ODinW, and Roboflow100 show that DETR-ViP outperforms other state-of-the-art methods in visual prompt detection. Ablation studies confirm the effectiveness of these improvements.
DETR-ViP 通过解决视觉提示中缺乏全局区分性的问题,旨在提升视觉提示对象检测性能。它结合了全局提示整合和视觉-文本提示关系蒸馏,以增强提示表示的区分性,并采用选择性融合策略确保检测的稳定性和鲁棒性。在COCO、LVIS、ODinW和Roboflow100上的实验表明,DETR-ViP 在视觉提示检测方面的性能优于其他最先进的方法。消融研究进一步验证了这些改进的有效性。