arXiv 论文速递

Snapshot: 20260326_0357

MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage

Authors: Ufaq Khan, Umair Nawaz, L D M S S Teja, Numaan Saeed, Muhammad Bilal, Yutong Xie, Mohammad Yaqub, Muhammad Haris Khan

First: 2026-03-24T17:59:54+00:00 · Latest: 2026-03-24T17:59:54+00:00

Comments: 11 Pages

Abs · PDF · Code1 · Code2

Abstract

Vision Language Models (VLMs) are increasingly used for tasks like medical report generation and visual question answering. However, fluent diagnostic text does not guarantee safe visual understanding. In clinical practice, interpretation begins with pre-diagnostic sanity checks: verifying that the input is valid to read (correct modality and anatomy, plausible viewpoint and orientation, and no obvious integrity violations). Existing benchmarks largely assume this step is solved, and therefore miss a critical failure mode: a model can produce plausible narratives even when the input is inconsistent or invalid. We introduce MedObvious, a 1,880-task benchmark that isolates input validation as a set-level consistency capability over small multi-panel image sets: the model must identify whether any panel violates expected coherence. MedObvious spans five progressive tiers, from basic orientation/modality mismatches to clinically motivated anatomy/viewpoint verification and triage-style cues, and includes five evaluation formats to test robustness across interfaces. Evaluating 17 different VLMs, we find that sanity checking remains unreliable: several models hallucinate anomalies on normal (negative-control) inputs, performance degrades when scaling to larger image sets, and measured accuracy varies substantially between multiple-choice and open-ended settings. These results show that pre-diagnostic verification remains unsolved for medical VLMs and should be treated as a distinct, safety-critical capability before deployment.

中文标题/摘要

标题：MedObvious：通过临床分诊揭示VLMs中的医疗莫拉维克悖论

视觉语言模型（VLMs）越来越多地用于医学报告生成和视觉问答等任务。然而，流畅的诊断文本并不保证安全的视觉理解。在临床实践中，解释始于预诊断的合理性检查：验证输入是否有效（正确的模态和解剖结构，合理的视角和方向，以及没有明显的完整性问题）。现有基准大多假设这一步骤已经解决，因此忽略了关键的失败模式：即使输入不一致或无效，模型也能生成合理的叙述。我们引入了MedObvious基准，包含1,880个任务，将输入验证隔离为小多面板图像集的一致性能力：模型必须确定任何面板是否违反了预期的连贯性。MedObvious涵盖了五个渐进的层级，从基本的方向/模态不匹配到基于临床的解剖结构/视角验证和分诊提示，并包括五种评估格式以测试跨界面的鲁棒性。评估17种不同的VLMs，我们发现合理性检查仍然不可靠：多个模型在正常（负控）输入上生成异常，性能在扩展到更大的图像集时下降，测得的准确性在多项选择和开放式设置之间差异显著。这些结果表明，预诊断验证对于医疗VLMs来说仍然是未解决的问题，在部署前应被视为一个独立的安全关键能力。

Summary / 总结

The paper introduces MedObvious, a benchmark of 1,880 tasks designed to test input validation in Vision Language Models (VLMs) for medical applications. It focuses on the critical step of verifying input consistency before diagnosis. The study evaluates 17 VLMs and finds that many models fail to correctly identify invalid inputs, hallucinate anomalies on normal inputs, and show degraded performance with larger image sets. This highlights the need for pre-diagnostic verification as a distinct and essential capability for medical VLMs.

研究旨在通过引入MedObvious基准来揭示Vision Language Models (VLMs)在医疗应用中的局限性，该基准测试输入验证能力。方法是创建一个包含五个渐进级的1,880项任务基准，以评估模型在多面板图像集中的不一致性识别能力。关键发现表明，VLMs经常在正常输入上产生异常，随着图像集规模的扩大，性能会下降，并且在多项选择和开放式设置之间的准确性存在显著差异，这表明预诊断验证仍然是医疗VLMs中的一个关键未解决问题。

VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions

Authors: Adrian Bulat, Alberto Baldrati, Ioannis Maniadis Metaxas, Yassine Ouali, Georgios Tzimiropoulos

Venue: CVPR 2026

First: 2026-03-24T17:58:17+00:00 · Latest: 2026-03-24T17:58:17+00:00

Comments: Accepted at CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Existing approaches for improving the efficiency of Large Vision-Language Models (LVLMs) are largely based on the concept of visual token reduction. This approach, however, creates an information bottleneck that impairs performance, especially on challenging tasks that require fine-grained understanding and reasoning. In this work, we challenge this paradigm by introducing VISion On Request (VISOR), a method that reduces inference cost without discarding visual information. Instead of compressing the image, VISOR improves efficiency by sparsifying the interaction between image and text tokens. Specifically, the language model attends to the full set of high-resolution visual tokens through a small, strategically placed set of attention layers: general visual context is provided by efficient cross-attention between text-image, while a few well-placed and dynamically selected self-attention layers refine the visual representations themselves, enabling complex, high-resolution reasoning when needed. Based on this principle, we first train a single universal network on a range of computational budgets by varying the number of self-attention layers, and then introduce a lightweight policy mechanism that dynamically allocates visual computation based on per-sample complexity. Extensive experiments show that VISOR drastically reduces computational cost while matching or exceeding state-of-the-art results across a diverse suite of benchmarks, and excels in challenging tasks that require detailed visual understanding.

中文标题/摘要

标题：VISion On Request: 提升VLLM效率的稀疏、动态选择视觉-语言交互方法

现有提高大型视觉-语言模型（LVLMs）效率的方法主要基于视觉标记减少的概念。然而，这种方法创建了一个信息瓶颈，影响了性能，尤其是在需要精细理解与推理的任务中。在本文中，我们通过引入VISion On Request（VISOR），一种在不丢弃视觉信息的情况下减少推理成本的方法，挑战了这一范式。VISOR通过稀疏化图像与文本标记之间的交互来提高效率，而不是压缩图像。具体来说，语言模型通过少量战略性放置的注意力层关注全集的高分辨率视觉标记：通过文本-图像之间的高效交叉注意力提供高效的视觉上下文，而少数精心放置并动态选择的自我注意力层则细化视觉表示本身，当需要复杂、高分辨率推理时，能够进行精细的视觉理解。基于这一原则，我们首先通过调整自我注意力层的数量，在不同的计算预算下训练一个通用网络，然后引入一个轻量级的策略机制，根据每个样本的复杂性动态分配视觉计算。广泛的实验表明，VISOR在多种基准测试中大幅减少了计算成本，同时匹配或超越了最先进的结果，并在需要详细视觉理解的挑战性任务中表现出色。

Summary / 总结

The research aims to enhance the efficiency of Large Vision-Language Models (LVLMs) by reducing inference cost without losing visual information. VISion On Request (VISOR) introduces a method that sparsifies the interaction between image and text tokens, using a small set of strategically placed attention layers to provide general visual context and refine visual representations when needed. Experiments show that VISOR significantly reduces computational cost while maintaining or surpassing state-of-the-art performance across various benchmarks, particularly excelling in tasks requiring detailed visual understanding.

该研究针对视觉标记减少在LVLM效率提升中的局限性，提出了VISion On Request (VISOR) 方法，通过稀疏化图像和文本标记之间的交互来提高效率。VISOR 动态选择少量自注意力层来细化视觉表示，当需要复杂推理时启用。实验表明，VISOR 在多种基准测试中减少了计算成本，同时保持或超越了最先进的性能，特别是在需要详细视觉理解的任务中表现出色。

AgentRVOS: Reasoning over Object Tracks for Zero-Shot Referring Video Object Segmentation

Authors: Woojeong Jin, Jaeho Lee, Heeseong Shin, Seungho Jang, Junhwan Heo, Seungryong Kim

First: 2026-03-24T17:55:17+00:00 · Latest: 2026-03-24T17:55:17+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Referring Video Object Segmentation (RVOS) aims to segment a target object throughout a video given a natural language query. Training-free methods for this task follow a common pipeline: a MLLM selects keyframes, grounds the referred object within those frames, and a video segmentation model propagates the results. While intuitive, this design asks the MLLM to make temporal decisions before any object-level evidence is available, limiting both reasoning quality and spatio-temporal coverage. To overcome this, we propose AgentRVOS, a training-free agentic pipeline built on the complementary strengths of SAM3 and a MLLM. Given a concept derived from the query, SAM3 provides reliable perception over the full spatio-temporal extent through generated mask tracks. The MLLM then identifies the target through query-grounded reasoning over this object-level evidence, iteratively pruning guided by SAM3's temporal existence information. Extensive experiments show that AgentRVOS achieves state-of-the-art performance among training-free methods across multiple benchmarks, with consistent results across diverse MLLM backbones. Our project page is available at: https://cvlab-kaist.github.io/AgentRVOS/.

中文标题/摘要

标题：AgentRVOS：基于对象轨迹的零样本视频对象分割推理

参考视频对象分割（RVOS）的目标是在给定自然语言查询的情况下，对视频中的目标对象进行分割。无需训练的方法遵循一个常见的流程：MLLM 选择关键帧，将所指的对象定位在这些帧中，然后视频分割模型传播结果。虽然直观，但这种设计要求 MLLM 在没有任何对象级证据的情况下做出时间决策，从而限制了推理质量和时空覆盖范围。为了解决这个问题，我们提出了基于 SAM3 和 MLLM 相互补充优势的 AgentRVOS，一种无需训练的代理式管道。给定查询中提取的概念，SAM3 通过生成的掩码轨迹在整个时空范围内提供可靠的感知。然后，MLLM 通过查询导向的推理识别目标，SAM3 的时间存在信息指导迭代修剪。广泛的实验表明，AgentRVOS 在多个基准测试中实现了训练无需方法的最新性能，且在多种 MLLM 后端模型上具有一致的结果。我们的项目页面可在：https://cvlab-kaist.github.io/AgentRVOS/。

Summary / 总结

AgentRVOS addresses the limitations of existing training-free methods for Referring Video Object Segmentation (RVOS) by integrating the strengths of SAM3 and a MLLM. It generates reliable object tracks across the entire spatio-temporal extent using SAM3, which the MLLM then uses for query-grounded reasoning, iteratively pruning based on temporal existence information. Experimental results demonstrate that AgentRVOS outperforms other training-free methods on multiple benchmarks, showing consistent performance across different MLLM backbones.

AgentRVOS旨在通过解决现有无监督方法的局限性来提高引用视频对象分割的性能。它结合使用SAM3进行可靠的时空跟踪和MLLM进行查询导向的推理，并根据时空存在信息进行迭代修剪。实验结果表明，AgentRVOS在多个基准测试中优于其他无监督方法，并且在不同MLLM骨干网络下表现出一致的性能。

UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation

Authors: Jiaying Lin, Dan Xu

First: 2026-03-24T17:42:31+00:00 · Latest: 2026-03-24T17:42:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Functionality segmentation in 3D scenes requires an agent to ground implicit natural-language instructions into precise masks of fine-grained interactive elements. Existing methods rely on fragmented pipelines that suffer from visual blindness during initial task parsing. We observe that these methods are limited by single-scale, passive and heuristic frame selection. We present UniFunc3D, a unified and training-free framework that treats the multimodal large language model as an active observer. By consolidating semantic, temporal, and spatial reasoning into a single forward pass, UniFunc3D performs joint reasoning to ground task decomposition in direct visual evidence. Our approach introduces active spatial-temporal grounding with a coarse-to-fine strategy. This allows the model to select correct video frames adaptively and focus on high-detail interactive parts while preserving the global context necessary for disambiguation. On SceneFun3D, UniFunc3D achieves state-of-the-art performance, surpassing both training-free and training-based methods by a large margin with a relative 59.9\% mIoU improvement, without any task-specific training. Code will be released on our project page: https://jiaying.link/unifunc3d.

中文标题/摘要

标题：UniFunc3D: 统一的主动空间-时间定位以实现3D功能分割

在3D场景中进行功能分割需要代理将隐式的自然语言指令精确定位到细粒度交互元素的精确掩码中。现有方法依赖于分段的管道，在初始任务解析过程中存在视觉盲点。我们观察到这些方法受限于单尺度、被动和启发式的帧选择。我们提出了UniFunc3D，这是一种统一且无需训练的框架，将多模态大型语言模型视为主动观察者。通过将语义、时间和空间推理合并到单次前向传递中，UniFunc3D 能够进行联合推理，直接在视觉证据中进行任务分解。我们的方法引入了从粗到细的主动空间-时间定位策略。这使模型能够适应性地选择正确的视频帧，专注于高细节的交互部分，同时保留用于消歧的全局上下文。在SceneFun3D上，UniFunc3D 达到了最先进的性能，与训练免费和基于训练的方法相比，相对mIoU提高了59.9%，而无需任何特定任务的训练。代码将在我们的项目页面上发布：https://jiaying.link/unifunc3d.

Summary / 总结

UniFunc3D addresses the limitations of existing methods in functionality segmentation by introducing a unified and training-free framework. It uses a multimodal large language model as an active observer to perform joint semantic, temporal, and spatial reasoning in a single forward pass. This approach enables active spatial-temporal grounding with a coarse-to-fine strategy, allowing the model to select correct video frames and focus on high-detail interactive parts while preserving global context. On the SceneFun3D dataset, UniFunc3D outperforms both training-free and training-based methods with a significant 59.9% improvement in mIoU without task-specific training.

UniFunc3D 是一个统一且无需训练的框架，通过整合语义、时间和空间推理来增强 3D 场景的功能性分割。它使用多模态大型语言模型作为主动观察者，进行联合推理，允许自适应地选择视频帧并专注于高细节的交互部分。在 SceneFun3D 数据集上，UniFunc3D 在 mIoU 上取得了显著 59.9% 的改进，超过了所有无需训练和基于训练的方法，且无需任何特定任务的训练。

POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency

Authors: Ashim Dahal, Ankit Ghimire, Saydul Akbar Murad, Nick Rahimi

Venue: CVPR

First: 2025-10-01T15:15:36+00:00 · Latest: 2026-03-24T16:54:51+00:00

Comments: Accepted in MAR at CVPR Workshop (Proceedings Track)

Abs · PDF · Code1 · Code2

Abstract

Video Question Answering (VQA) with Large Vision Language Models (LVLMs) has gained significant traction in research ever since the Flamingo was introduced by Deepmind. Recent advancements in large context/long video question answering have allowed VQA tasks to have context window of 1500+ frames. However, this only leads to 50 seconds of video footage without losing any significant information. We introduce POVQA, a data-efficient pipeline that compresses each second of video into a single temporally pooled image (via motion blur and weighted averaging variants) and then align LVLMs with lightweight supervision. Concretely, we build 1 fps input sources using Blend Blur with Last Frame, Weighted Average, Exponential and Ramp pooling and fine-tune QWEN-2.5-VL 7B with supervised two turn target including reasoning and final answer. We apply Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO) on our novel dataset ReasonVQA consisting of 12 movies with 239 human annotated question-answer with reasoning prompts. On our ReasonVQA dataset, this method dramatically improves performance over pooled baselines: F1 score improves from 0.212 to 0.543, BLEU-4 from 0.031 to 0.291, and ROUGE-L from 0.196 to 0.528. Rationale quality also significantly increases. Cross-evaluation of SFT + DPO on various pooling functions show that the gains persist regardless of the pooling scheme used at train or test time, indicating strong robustness on summarization of temporal evidence. Similar observations were made on zero-shot in TVQA.

中文标题/摘要

标题：POVQA：基于偏好优化的视频问答与数据效率推理

自Deepmind推出Flamingo以来，大规模视觉语言模型（LVLM）驱动的视频问答（VQA）在研究中获得了显著的关注。最近在长视频上下文问答方面的进展使得VQA任务能够处理1500多帧的上下文窗口，但这仅相当于50秒的视频内容，而不会丢失重要信息。我们提出了POVQA，这是一种数据高效的管道，将每秒的视频压缩成单个时间池化图像（通过运动模糊和加权平均变体），然后通过轻量级监督与LVLM对齐。具体来说，我们使用Blend Blur with Last Frame、Weighted Average、Exponential和Ramp池化构建1 fps输入源，并使用监督两轮目标（包括推理和最终答案）对QWEN-2.5-VL 7B进行微调。我们在ReasonVQA数据集上应用了监督微调（SFT）和直接偏好优化（DPO），该数据集包含12部电影，有239个人工标注的问题-答案及其推理提示。在ReasonVQA数据集上，该方法显著提高了性能：F1分数从0.212提高到0.543，BLEU-4从0.031提高到0.291，ROUGE-L从0.196提高到0.528。推理质量也显著提高。SFT + DPO在各种池化函数上的跨验证表明，无论是在训练还是测试时使用哪种池化方案，性能提升都保持一致，这表明该方法在时间证据总结方面具有很强的鲁棒性。类似观察结果也出现在TVQA的零样本测试中。

Summary / 总结

POVQA is a data-efficient pipeline that compresses each second of video into a single temporally pooled image and aligns large vision language models with lightweight supervision. It uses methods like Blend Blur with Last Frame, Weighted Average, Exponential, and Ramp pooling to fine-tune QWEN-2.5-VL 7B. On the ReasonVQA dataset, this approach significantly improves performance metrics such as F1 score, BLEU-4, and ROUGE-L, demonstrating strong robustness across different pooling functions.

POVQA 是一种高效的数据方法，通过运动模糊和加权平均将每一秒的视频压缩成一张图像，然后使用轻量级监督微调大型视觉语言模型。在 ReasonVQA 数据集上，该方法显著提高了性能，F1 分数、BLEU-4 和 ROUGE-L 分数分别提高了 1.59、0.96 和 0.334。该方法还提高了推理质量，并且在不同的池化函数下表现出较强的鲁棒性。

ARGENT: Adaptive Hierarchical Image-Text Representations

Authors: Chuong Huynh, Hossein Souri, Abhinav Kumar, Vitali Petsiuk, Deen Dayal Mohan, Suren Kumar

First: 2026-03-24T15:14:12+00:00 · Latest: 2026-03-24T15:14:12+00:00

Abs · PDF · Code1 · Code2

Abstract

Large-scale Vision-Language Models (VLMs) such as CLIP learn powerful semantic representations but operate in Euclidean space, which fails to capture the inherent hierarchical structure of visual and linguistic concepts. Hyperbolic geometry, with its exponential volume growth, offers a principled alternative for embedding such hierarchies with low distortion. However, existing hyperbolic VLMs use entailment losses that are unstable: as parent embeddings contract toward the origin, their entailment cones widen toward a half-space, causing catastrophic cone collapse that destroys the intended hierarchy. Additionally, hierarchical evaluation of these models remains unreliable, being largely retrieval-based and correlation-based metrics and prone to taxonomy dependence and ambiguous negatives. To address these limitations, we propose an adaptive entailment loss paired with a norm regularizer that prevents cone collapse without heuristic aperture clipping. We further introduce an angle-based probabilistic entailment protocol (PEP) for evaluating hierarchical understanding, scored with AUC-ROC and Average Precision. This paper introduces a stronger hyperbolic VLM baseline ARGENT, Adaptive hieRarchical imaGe-tExt represeNTation. ARGENT improves the SOTA hyperbolic VLM by 0.7, 1.1, and 0.8 absolute points on image classification, text-to-image retrieval, and proposed hierarchical metrics, respectively.

中文标题/摘要

标题：ARGENT：自适应分层图像-文本表示

大规模视觉-语言模型（VLMs）如CLIP学习到强大的语义表示，但在欧几里得空间中运行，无法捕捉视觉和语言概念的固有分层结构。双曲几何，由于其指数体积增长，为嵌入此类分层结构提供了低失真原则上的替代方案。然而，现有的双曲VLMs使用不稳定的蕴含损失：随着父级嵌入向原点收缩，其蕴含圆锥向半空间扩展，导致灾难性的圆锥坍塌，破坏了预期的分层结构。此外，这些模型的分层评估仍然不可靠，主要是基于检索和相关性度量，容易受到分类法依赖性和模糊负样本的影响。为解决这些局限性，我们提出了一种自适应蕴含损失配以范数正则化，防止圆锥坍塌，无需启发式孔径裁剪。我们还引入了一种基于角度的概率性蕴含协议（PEP），用于评估分层理解，使用AUC-ROC和平均精度评分。本文引入了更强的双曲VLM基线ARGENT，自适应分层图像-文本表示。ARGENT在图像分类、文本到图像检索以及提出的分层度量上分别提高了0.7、1.1和0.8个绝对点。

Summary / 总结

The research aims to improve the representation of hierarchical structures in Vision-Language Models (VLMs) by using hyperbolic geometry, addressing the instability of existing entailment losses and the unreliability of hierarchical evaluation metrics. The proposed method, ARGENT, introduces an adaptive entailment loss and a norm regularizer to prevent cone collapse, and an angle-based probabilistic entailment protocol for hierarchical evaluation. Experimental results show that ARGENT outperforms the state-of-the-art hyperbolic VLM by 0.7, 1.1, and 0.8 absolute points on image classification, text-to-image retrieval, and hierarchical metrics, respectively.

研究通过使用双曲几何来捕捉图像-文本表示中的层次结构，解决了现有视觉-语言模型（VLM）的局限性，并提出了ARGENT。它引入了一种自适应蕴含损失和范数正则化器来防止锥体坍塌，并提出了一种基于角度的概率性蕴含协议（PEP）进行层次理解评估。ARGENT在图像分类、文本到图像检索以及提出的层次度量上分别比最先进的双曲VLM提高了0.7、1.1和0.8个绝对点。

GeoDiT: A Diffusion-based Vision-Language Model for Geospatial Understanding

Authors: Jiaqi Liu, Ronghao Fu, Haoran Liu, Lang Sun, Bo Yang

First: 2025-12-02T07:59:46+00:00 · Latest: 2026-03-24T14:32:11+00:00

Abs · PDF · Code1 · Code2

Abstract

Autoregressive models are structurally misaligned with the inherently parallel nature of geospatial understanding, forcing a rigid sequential narrative onto scenes and fundamentally hindering the generation of structured and coherent outputs. We challenge this paradigm by reframing geospatial generation as a parallel refinement process, enabling a holistic, coarse-to-fine synthesis that resolves all semantic elements simultaneously. To operationalize this, we introduce GeoDiT, the first diffusion-based vision-language model tailored for the geospatial domain. Extensive experiments demonstrate that GeoDiT establishes a new state-of-the-art on benchmarks requiring structured, object-centric outputs. It achieves significant gains in image captioning, visual grounding, and multi-object detection, precisely the tasks where autoregressive models falter. Our work validates that aligning the generative process with the data's intrinsic structure is key to unlocking superior performance in complex geospatial analysis.

中文标题/摘要

标题：GeoDiT：一种基于扩散的跨模态模型以实现地理空间理解

自回归模型在结构上与地理空间理解的固有并行性质不匹配，迫使一种僵化的顺序叙述方式应用于场景，从根本上阻碍了生成结构化和连贯输出的能力。我们通过将地理空间生成重新构想为一个并行细化过程来挑战这一范式，使整体、从粗到细的综合能够同时解决所有语义元素。为了实现这一点，我们引入了GeoDiT，这是首个针对地理空间领域的基于扩散的跨模态模型。广泛的实验表明，GeoDiT在需要结构化、对象中心化输出的基准测试中建立了新的最先进水平。它在图像描述、视觉定位和多对象检测等任务中取得了显著进步，这些任务正是自回归模型表现不佳的地方。我们的工作验证了使生成过程与数据的内在结构相一致是解锁复杂地理空间分析中优越性能的关键。

Summary / 总结

GeoDiT is a diffusion-based vision-language model designed for geospatial understanding, addressing the limitations of autoregressive models by treating geospatial generation as a parallel refinement process. This approach enables a holistic, coarse-to-fine synthesis, resolving all semantic elements simultaneously. Experiments show that GeoDiT outperforms existing models in tasks such as image captioning, visual grounding, and multi-object detection, demonstrating the importance of aligning the generative process with the data's intrinsic structure for complex geospatial analysis.

GeoDiT 是一种针对地理空间理解的扩散型视觉语言模型，解决了自回归模型在生成结构化和连贯输出方面的局限性。通过将地理空间生成视为一个并行精炼过程，GeoDiT 能够整体合成语义元素。实验表明，GeoDiT 在图像描述、视觉定位和多对象检测等任务中优于现有模型，这些任务正是自回归模型表现不佳的地方。

TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition

Authors: Junyuan Zhang, Bin Wang, Qintong Zhang, Fan Wu, Zichen Wen, Jialin Lu, Junjie Shan, Ziqi Zhao, Shuya Yang, Ziling Wang, Ziyang Miao, Huaping Zhong, Yuhang Zang, Xiaoyi Dong, Ka-Ho Chow, Conghui He

Venue: CVPR 2026

First: 2025-12-01T03:49:00+00:00 · Latest: 2026-03-24T14:30:01+00:00

Comments: Accepted by CVPR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Table recognition (TR) aims to transform table images into semi-structured representations such as HTML or Markdown. As a core component of document parsing, TR has long relied on supervised learning, with recent efforts dominated by fine-tuning vision-language models (VLMs) using labeled data. While VLMs have brought TR to the next level, pushing performance further demands large-scale labeled data that is costly to obtain. Consequently, although proprietary models have continuously pushed the performance boundary, open-source models, often trained with limited resources and, in practice, the only viable option for many due to privacy regulations, still lag far behind. To bridge this gap, we introduce TRivia, a self-supervised fine-tuning method that enables pretrained VLMs to learn TR directly from unlabeled table images in the wild. Built upon Group Relative Policy Optimization, TRivia automatically identifies unlabeled samples that most effectively facilitate learning and eliminates the need for human annotations through a question-answering-based reward mechanism. An attention-guided module generates diverse questions for each table image, and the ability to interpret the recognition results and answer them correctly provides feedback to optimize the TR model. This closed-loop process allows the TR model to autonomously learn to recognize, structure, and reason over tables without labeled data. Leveraging this pipeline, we present TRivia-3B, an open-sourced, compact, and state-of-the-art TR model that surpasses existing systems (e.g., Gemini 2.5 Pro, MinerU2.5) on three popular benchmarks. Model and code are released at: https://github.com/HKU-TASR/TRivia

中文标题/摘要

标题：trivia: 视觉-语言模型的自监督微调以实现表格识别

表格识别（TR）旨在将表格图像转换为半结构化的表示形式，如HTML或Markdown。作为文档解析的核心组件，TR长期以来依赖于监督学习，最近的努力主要集中在使用标记数据微调视觉-语言模型（VLMs）上。尽管VLMs将TR提升到了新的水平，但进一步提高性能需要大量昂贵的标记数据。因此，尽管专有模型不断推动性能边界，开源模型由于资源有限且在实践中由于隐私法规限制，只能由许多人使用，仍然远远落后。为了解决这一差距，我们引入了TRivia，这是一种自监督微调方法，使预训练的VLMs能够直接从野外的未标记表格图像中学习TR。基于Group Relative Policy Optimization，TRivia自动识别最有效地促进学习的未标记样本，并通过基于问答的奖励机制消除对人工注释的需求。注意力引导模块为每张表格图像生成多样化的问答，正确解释识别结果并回答问题的能力为优化TR模型提供了反馈。这一闭环过程使TR模型能够在没有标记数据的情况下自主学习识别、结构化和推理表格。利用此流水线，我们提出了TRivia-3B，这是一个开源、紧凑且最先进的TR模型，在三个流行的基准测试中超越了现有系统（例如Gemini 2.5 Pro、MinerU2.5）。模型和代码发布在：https://github.com/HKU-TASR/TRivia

Summary / 总结

TRivia is a self-supervised fine-tuning method for table recognition (TR) that leverages vision-language models (VLMs) to learn directly from unlabeled table images. It uses Group Relative Policy Optimization to automatically select samples that facilitate learning and an attention-guided module to generate diverse questions for each table image. The model can interpret and answer these questions to optimize its performance. TRivia-3B, the resulting model, outperforms existing systems on three benchmarks without requiring labeled data.

TRivia 是一种自监督微调方法，利用 Group Relative Policy Optimization 从未标记的表格图像中学习。它为每个表格生成多样化的问答，通过正确答案优化模型，无需人工标注。TRivia-3B 模型在三个基准测试中超越了现有系统，无需大量标记数据，使其成为受隐私法规限制的开源模型的可行选择。

Residual Decoding: Mitigating Hallucinations in Large Vision-Language Models via History-Aware Residual Guidance

Authors: Xinrong Chen, Xu Chu, Yingmin Qiu, Hengyuan Zhang, Jing Xiong, Shiyu Tang, Shuai Liu, Shaokang Yang, Cheng Yang, Hayden Kwok-Hay So, Ngai Wong

Venue: CVPR 2026

First: 2026-02-01T06:12:05+00:00 · Latest: 2026-03-24T13:53:53+00:00

Comments: Accepted by CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Large Vision-Language Models (LVLMs) can reason from image-text inputs and perform well in various multimodal tasks. Despite this success, they are affected by language priors and often produce hallucinations. Hallucinations denote generated content that is grammatically and syntactically coherent, yet bears no match or direct relevance to visual input. To address this problem, we propose Residual Decoding (ResDec). It is a novel training-free method that uses historical information to aid decoding. The method relies on the internal implicit reasoning mechanism and token logits evolution mechanism of LVLMs to correct biases. Extensive experiments demonstrate that ResDec effectively suppresses hallucinations induced by language priors, significantly improves visual grounding, and reduces object hallucinations. In addition to mitigating hallucinations, ResDec also performs exceptionally well on comprehensive LVLM benchmarks, highlighting its broad applicability.

中文标题/摘要

标题：残差解码：通过历史感知残差指导减轻大型视觉-语言模型中的幻觉

大型视觉-语言模型（LVLMs）可以从图像-文本输入中进行推理并在各种多模态任务中表现出色。尽管取得了这些成功，但它们受到语言先验的影响，经常产生幻觉。幻觉指的是语法和句法上连贯但与视觉输入无关或无直接关联的生成内容。为了解决这一问题，我们提出了残差解码（ResDec）。这是一种无需训练的新方法，利用历史信息来辅助解码。该方法依赖于LVLMs内部的隐式推理机制和token概率演化机制来纠正偏差。大量实验表明，ResDec有效地抑制了由语言先验引起的幻觉，显著提高了视觉定位，并减少了物体幻觉。除了减轻幻觉外，ResDec在全面的LVLM基准测试中表现也非常出色，突显了其广泛的适用性。

Summary / 总结

The paper addresses the issue of hallucinations in large vision-language models (LVLMs) by proposing Residual Decoding (ResDec), a training-free method that uses historical information to correct biases. Experiments show that ResDec effectively reduces hallucinations, improves visual grounding, and decreases object hallucinations, while also performing well on comprehensive LVLM benchmarks.

论文提出了一种名为残差解码（ResDec）的方法，通过利用历史信息来纠正偏差以解决大型视觉-语言模型（LVLM）中的幻觉问题。实验表明，ResDec能有效减少幻觉、提高视觉定位效果，并降低物体幻觉，同时在综合的LVLM基准测试中表现出色。

Gaze-Regularized VLMs for Ego-Centric Behavior Understanding

Authors: Anupam Pani, Yanchao Yang

First: 2026-03-24T13:37:28+00:00 · Latest: 2026-03-24T13:37:28+00:00

Abs · PDF · Code1 · Code2

Abstract

Eye gaze, encompassing fixations and saccades, provides critical insights into human intentions and future actions. This study introduces a gaze-regularized framework that enhances Vision Language Models (VLMs) for egocentric behavior understanding. Unlike existing methods that rely solely on visual data and overlook gaze information, our approach directly incorporates gaze information into the VLM architecture during training. By generating gaze-based queries, the model dynamically focuses on gaze-highlighted regions, while a gaze-regularization mechanism ensures the alignment of model attention with human attention patterns. To better understand how gaze can be effectively integrated into VLMs, we conducted extensive experiments exploring various strategies for incorporating gaze data. These innovations enable the prediction of future events with detailed action descriptions. Experimental results demonstrate a nearly 13 % improvement in semantic scores compared to baseline models not leveraging gaze data, highlighting the effectiveness of our approach. This work establishes a foundation for leveraging the human gaze in VLMs, significantly boosting their predictive capabilities in applications requiring accurate and robust future event prediction.

中文标题/摘要

标题：注视正则化VLMs用于自中心行为理解

注视，包括注视点和扫视，为人类意图和未来行为提供了关键见解。本研究提出了一种注视正则化框架，以增强视觉语言模型（VLMs）的自中心行为理解能力。与现有方法仅依赖视觉数据而忽略注视信息不同，我们的方法在训练过程中直接将注视信息整合到VLM架构中。通过生成基于注视的查询，模型动态聚焦于注视突出的区域，而注视正则化机制确保模型的注意力与人类注意力模式对齐。为了更好地理解如何有效将注视整合到VLMs中，我们进行了广泛的实验，探索了多种整合注视数据的策略。这些创新使模型能够以详细的动作描述预测未来事件。实验结果表明，与不利用注视数据的基线模型相比，我们的方法在语义得分上提高了近13%，突显了我们方法的有效性。本研究为在VLMs中利用人类注视奠定了基础，显著提升了其在需要准确和稳健的未来事件预测的应用中的预测能力。

Summary / 总结

This study introduces a gaze-regularized framework to enhance Vision Language Models (VLMs) for understanding egocentric behaviors. Unlike previous methods that rely solely on visual data, this approach incorporates gaze information during training, using gaze-based queries to focus on relevant regions and a regularization mechanism to align model attention with human attention. Experiments show a 13% improvement in semantic scores, indicating the effectiveness of integrating gaze data into VLMs for better future event prediction and detailed action descriptions.

该研究提出了一种注视正则化框架，以增强视觉语言模型（VLMs）对自中心行为的理解。通过在训练过程中引入注视信息，模型能够聚焦于注视突出的区域，并与人类的注意力模式对齐。广泛的实验表明，与不利用注视数据的基线模型相比，该方法在语义得分上提高了约13%，证明了将注视数据整合到VLMs中的有效性。

ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting

Authors: Yeonkyung Lee, Dayun Ju, Youngmin Kim, Seil Kang, Seong Jae Hwang

First: 2026-03-24T13:32:52+00:00 · Latest: 2026-03-24T13:32:52+00:00

Comments: accepted to CVPR2026

Abs · PDF · Code1 · Code2

Abstract

Recent advancements in Video Large Language Models (VideoLLMs) have enabled strong performance across diverse multimodal video tasks. To reduce the high computational cost of processing dense video frames, efficiency-oriented methods such as frame selection have been widely adopted. While effective at minimizing redundancy, these methods often cause notable performance drops on tasks requiring temporal reasoning. Unlike humans, who can infer event progression from sparse visual cues, VideoLLMs frequently misinterpret temporal relations when intermediate frames are omitted. To address this limitation, we explore visual prompting (VP) as a lightweight yet effective way to enhance temporal understanding in VideoLLMs. Our analysis reveals that simply annotating each frame with explicit ordinal information helps the model perceive temporal continuity. This visual cue also supports frame-level referencing and mitigates positional ambiguity within a sparsely sampled sequence. Building on these insights, we introduce ViKey, a training-free framework that combines VP with a lightweight Keyword-Frame Mapping (KFM) module. KFM leverages frame indices as dictionary-like keys to link textual cues to the most relevant frames, providing explicit temporal anchors during inference. Despite its simplicity, our approach substantially improves temporal reasoning and, on some datasets, preserves dense-frame baseline performance with as few as 20% of frames.

中文标题/摘要

标题：ViKey：通过视觉提示增强视频中的时间理解

近期视频大型语言模型（VideoLLMs）的发展使得其在多种多模态视频任务中表现出色。为了降低处理密集视频帧的高计算成本，效率导向的方法如帧选择被广泛采用。虽然这些方法在减少冗余方面有效，但它们往往会在需要时间推理的任务中导致显著的性能下降。与人类不同，人类可以从稀疏的视觉线索中推断事件的进展，而VideoLLMs在省略中间帧时经常错误地解释时间关系。为了解决这一局限性，我们探索了视觉提示（VP）作为一种轻量级且有效的方法，以增强VideoLLMs的时间理解能力。我们的分析表明，简单地为每个帧添加显式的序号信息有助于模型感知时间连续性。这种视觉提示还支持帧级引用，并在稀疏采样的序列中消除了位置的模糊性。基于这些见解，我们提出了ViKey，这是一种无需训练的框架，结合了VP和一个轻量级的关键帧映射（KFM）模块。KFM利用帧索引作为字典式的键，将文本提示链接到最相关的帧，提供明确的时间锚点。尽管方法简单，但我们的方法显著提高了时间推理能力，并在某些数据集上，使用不到20%的帧数就能保持与密集帧基线相当的性能。

Summary / 总结

The research aims to enhance temporal understanding in Video Large Language Models (VideoLLMs) by addressing the performance drop caused by frame selection methods. The authors propose ViKey, a training-free framework that uses visual prompting and a lightweight Keyword-Frame Mapping module to provide temporal anchors, improving temporal reasoning and sometimes matching dense-frame baseline performance with only 20% of frames.

研究旨在通过解决帧选择方法导致的性能下降问题，增强Video Large Language Models（VideoLLMs）的时序理解能力。作者提出了一种名为ViKey的训练免费框架，该框架结合了视觉提示和轻量级的Keyword-Frame Mapping (KFM)模块，提供时序锚点，从而提高时序推理能力，并在某些数据集上保持与密集帧基线性能相当的水平，仅使用20%的帧。

Conformal Cross-Modal Active Learning

Authors: Huy Hoang Nguyen, Cédric Jung, Shirin Salehi, Tobias Glück, Anke Schmeink, Andreas Kugi

First: 2026-03-24T12:59:47+00:00 · Latest: 2026-03-24T12:59:47+00:00

Comments: 20 pages, 14 figures

Abs · PDF · Code1 · Code2

Abstract

Foundation models for vision have transformed visual recognition with powerful pretrained representations and strong zero-shot capabilities, yet their potential for data-efficient learning remains largely untapped. Active Learning (AL) aims to minimize annotation costs by strategically selecting the most informative samples for labeling, but existing methods largely overlook the rich multimodal knowledge embedded in modern vision-language models (VLMs). We introduce Conformal Cross-Modal Acquisition (CCMA), a novel AL framework that bridges vision and language modalities through a teacher-student architecture. CCMA employs a pretrained VLM as a teacher to provide semantically grounded uncertainty estimates, conformally calibrated to guide sample selection for a vision-only student model. By integrating multimodal conformal scoring with diversity-aware selection strategies, CCMA achieves superior data efficiency across multiple benchmarks. Our approach consistently outperforms state-of-the-art AL baselines, demonstrating clear advantages over methods relying solely on uncertainty or diversity metrics.

中文标题/摘要

标题：符合性跨模态主动学习

视觉领域的基础模型通过强大的预训练表示和强大的零样本能力已经彻底改变了视觉识别，但它们在数据高效学习方面的潜力尚未充分利用。主动学习（AL）旨在通过战略性地选择最具信息量的样本进行标注来最小化标注成本，但现有方法大多忽视了现代视觉语言模型（VLMs）中嵌入的丰富跨模态知识。我们提出了跨模态获取框架（CCMA），这是一种新颖的AL框架，通过教师-学生架构将视觉和语言模态联系起来。CCMA 使用一个预训练的 VLM 作为教师，提供语义上接地的不确定性估计，并通过一致性校准来指导仅视觉学生模型的样本选择。通过将跨模态一致性评分与多样性感知选择策略相结合，CCMA 在多个基准测试中实现了更高的数据效率。我们的方法在所有基准测试中都优于最先进的AL基线，证明了其在仅依赖不确定性或多样性指标的方法上的优势。

Summary / 总结

The research aims to enhance data efficiency in visual recognition by leveraging multimodal knowledge in vision-language models for active learning. The method, Conformal Cross-Modal Acquisition (CCMA), uses a pretrained vision-language model as a teacher to provide calibrated uncertainty estimates, which guide the selection of samples for a vision-only student model. This approach integrates multimodal conformal scoring with diversity-aware selection strategies, leading to superior performance across multiple benchmarks compared to existing active learning baselines.

研究旨在通过利用视觉语言模型中的多模态知识来提高视觉识别的数据效率。方法Conformal Cross-Modal Acquisition (CCMA) 使用预训练的视觉语言模型作为教师，提供校准后的不确定性估计，以指导视觉单一模型的样本选择。实验表明，CCMA 在多个基准测试中优于现有主动学习基线，特别是在数据效率和性能一致性方面表现出明显优势。

SODA: Sensitivity-Oriented Dynamic Acceleration for Diffusion Transformer

Authors: Tong Shao, Yusen Fu, Guoying Sun, Jingde Kong, Zhuotao Tian, Jingyong Su

Venue: CVPR 2026

First: 2026-03-07T06:33:07+00:00 · Latest: 2026-03-24T12:54:59+00:00

Comments: 23 pages, CVPR 2026 accepted

Abs · PDF · Code1 · Code2 · Code3

Abstract

Diffusion Transformers have become a dominant paradigm in visual generation, yet their low inference efficiency remains a key bottleneck hindering further advancement. Among common training-free techniques, caching offers high acceleration efficiency but often compromises fidelity, whereas pruning shows the opposite trade-off. Integrating caching with pruning achieves a balance between acceleration and generation quality. However, existing methods typically employ fixed and heuristic schemes to configure caching and pruning strategies. While they roughly follow the overall sensitivity trend of generation models to acceleration, they fail to capture fine-grained and complex variations, inevitably skipping highly sensitive computations and leading to quality degradation. Furthermore, such manually designed strategies exhibit poor generalization. To address these issues, we propose SODA, a Sensitivity-Oriented Dynamic Acceleration method that adaptively performs caching and pruning based on fine-grained sensitivity. SODA builds an offline sensitivity error modeling framework across timesteps, layers, and modules to capture the sensitivity to different acceleration operations. The cache intervals are optimized via dynamic programming with sensitivity error as the cost function, minimizing the impact of caching on model sensitivity. During pruning and cache reuse, SODA adaptively determines the pruning timing and rate to preserve computations of highly sensitive tokens, significantly enhancing generation fidelity. Extensive experiments on DiT-XL/2, PixArt-$α$, and OpenSora demonstrate that SODA achieves state-of-the-art generation fidelity under controllable acceleration ratios. Our code is released publicly at: https://github.com/leaves162/SODA.

中文标题/摘要

标题：SODA：面向灵敏度的动态加速方法用于扩散变换器

扩散变换器已成为视觉生成的主要范式，但其低推理效率仍然是进一步发展的关键瓶颈。在常见的无训练技术中，缓存提供了高效的加速，但往往牺牲了保真度，而剪枝则相反。将缓存与剪枝结合可以平衡加速和生成质量。然而，现有方法通常采用固定和启发式的方案来配置缓存和剪枝策略。虽然它们大致遵循生成模型对加速的整体灵敏度趋势，但无法捕捉到细微和复杂的差异，不可避免地跳过了高度灵敏的计算，导致质量下降。此外，这些手动设计的策略表现出较差的泛化能力。为了解决这些问题，我们提出了一种面向灵敏度的动态加速方法SODA，该方法基于细粒度的灵敏度自适应地执行缓存和剪枝。SODA构建了一个跨时间步、层和模块的离线灵敏度误差建模框架，以捕捉不同加速操作的灵敏度。通过使用灵敏度误差作为成本函数的动态规划优化缓存间隔，最小化缓存对模型灵敏度的影响。在剪枝和缓存重用期间，SODA自适应地确定剪枝时机和速率，以保留高度灵敏的标记的计算，显著提高生成保真度。在DiT-XL/2、PixArt-$α$和OpenSora上的广泛实验表明，SODA在可控加速比下实现了最先进的生成保真度。我们的代码已公开发布在：https://github.com/leaves162/SODA。

Summary / 总结

SODA is a method that enhances the efficiency of Diffusion Transformers by adaptively applying caching and pruning based on fine-grained sensitivity. It models sensitivity errors across timesteps, layers, and modules to optimize cache intervals and pruning timing, thereby balancing acceleration and generation quality. Experiments show that SODA achieves superior generation fidelity compared to existing methods under controlled acceleration ratios.

SODA 是一种针对扩散变换器的灵敏度导向动态加速方法，旨在提高推理效率同时保持生成质量。它通过离线的灵敏度误差建模框架捕捉不同加速操作的灵敏度，并使用动态规划优化缓存间隔。在剪枝和缓存重用过程中，SODA 会自适应地确定剪枝时机和速率，以保留高度灵敏的计算，显著提升生成保真度。实验表明，SODA 在可控加速比下实现了最先进的生成保真度。

Describe-Then-Act: Proactive Agent Steering via Distilled Language-Action World Models

Authors: Massimiliano Pappa, Luca Romani, Valentino Sacco, Alessio Palma, Stéphane Lathuilière, Fabio Galasso, Xavier Alameda-Pineda, Indro Spinelli

First: 2026-03-24T12:49:25+00:00 · Latest: 2026-03-24T12:49:25+00:00

Abs · PDF · Code1 · Code2

Abstract

Deploying safety-critical agents requires anticipating the consequences of actions before they are executed. While world models offer a paradigm for this proactive foresight, current approaches relying on visual simulation incur prohibitive latencies, often exceeding several seconds per step. In this work, we challenge the assumption that visual processing is necessary for failure prevention. We show that a trained policy's latent state, combined with its planned actions, already encodes sufficient information to anticipate action outcomes, making visual simulation redundant for failure prevention. To this end, we introduce DILLO (DIstiLLed Language-ActiOn World Model), a fast steering layer that shifts the paradigm from "simulate-then-act" to "describe-then-act." DILLO is trained via cross-modal distillation, where a privileged Vision Language Model teacher annotates offline trajectories and a latent-conditioned Large Language Model student learns to predict semantic outcomes. This creates a text-only inference path, bypassing heavy visual generation entirely, achieving a 14x speedup over baselines. Experiments on MetaWorld and LIBERO demonstrate that DILLO produces high-fidelity descriptions of the next state and is able to steer the policy, improving episode success rate by up to 15 pp and 9.3 pp on average across tasks.

中文标题/摘要

标题：描述-行动：通过提炼的语言-行动世界模型的主动代理引导

部署关键安全代理需要在执行动作之前预测其后果。虽然世界模型为这种主动预见提供了范式，但当前依赖于视觉模拟的方法往往会导致不可接受的延迟，通常每步超过几秒钟。在本文中，我们挑战视觉处理对于防止失败是必要的这一假设。我们展示了训练策略的潜在状态与计划动作结合后，已经包含了足够的信息来预测动作结果，使得视觉模拟对于防止失败变得多余。为此，我们引入了DILLO（提炼的语言-行动世界模型），这是一种快速的引导层，将范式从“模拟-行动”转变为“描述-行动”。DILLO通过跨模态提炼训练，其中特权视觉语言模型教师标注离线轨迹，而潜在条件下的大型语言模型学生学习预测语义结果。这创建了一条仅基于文本的推理路径，完全绕过了沉重的视觉生成，实现了比基线快14倍的速度。在MetaWorld和LIBERO上的实验表明，DILLO能够生成高保真的下一个状态描述，并能够引导策略，平均提高任务成功率15个基点和9.3个基点。

Summary / 总结

This work addresses the need for proactive foresight in safety-critical agents by proposing DILLO, a fast steering layer that uses a latent state and planned actions to predict action outcomes, eliminating the need for visual simulation. Experiments show DILLO can achieve a 14x speedup over baselines and improve episode success rates by up to 15 percentage points on various tasks.

该研究通过提出DILLO，一种使用潜状态和计划动作来预测行动结果而不进行视觉模拟的快速转向层，解决了安全关键型代理部署中的前瞻预见性挑战。该方法通过跨模态蒸馏训练，实现了比基线快14倍的速度，并在MetaWorld和LIBERO任务中将每集成功率提高了最多15个百分点。

HalDec-Bench: Benchmarking Hallucination Detector in Image Captioning

Authors: Kuniaki Saito, Risa Shinoda, Shohei Tanaka, Tosho Hirasawa, Fumio Okura, Yoshitaka Ushiku

First: 2025-11-25T17:19:47+00:00 · Latest: 2026-03-24T12:04:54+00:00

Comments: Previously this version appeared as arXiv:2603.15253 which was submitted as a new work by accident

Abs · PDF · Code1 · Code2 · Project1

Abstract

Hallucination detection in captions (HalDec) assesses a vision-language model's ability to correctly align image content with text by identifying errors in captions that misrepresent the image. Beyond evaluation, effective hallucination detection is also essential for curating high-quality image-caption pairs used to train VLMs. However, the generalizability of VLMs as hallucination detectors across different captioning models and hallucination types remains unclear due to the lack of a comprehensive benchmark. In this work, we introduce HalDec-Bench, a benchmark designed to evaluate hallucination detectors in a principled and interpretable manner. HalDec-Bench contains captions generated by diverse VLMs together with human annotations indicating the presence of hallucinations, detailed hallucination-type categories, and segment-level labels. The benchmark provides tasks with a wide range of difficulty levels and reveals performance differences across models that are not visible in existing multimodal reasoning or alignment benchmarks. Our analysis further uncovers two key findings. First, detectors tend to recognize sentences appearing at the beginning of a response as correct, regardless of their actual correctness. Second, our experiments suggest that dataset noise can be substantially reduced by using strong VLMs as filters while employing recent VLMs as caption generators. Our project page is available at https://dahlian00.github.io/HalDec-Bench-Page/.

中文标题/摘要

标题：HalDec-Bench：图像描述中的幻觉检测基准测试

幻觉检测（HalDec）评估视觉-语言模型正确对齐图像内容与文本的能力，通过识别错误描述图像的描述错误。除了评估之外，有效的幻觉检测对于收集用于训练VLM的高质量图像-描述对也至关重要。然而，由于缺乏全面的基准测试，VLM作为幻觉检测器在不同描述模型和幻觉类型之间的通用性仍然不清楚。在此项工作中，我们引入了HalDec-Bench，这是一种旨在以原则性和可解释性的方式评估幻觉检测器的基准测试。HalDec-Bench包含由多种VLM生成的描述，以及人类注释表明幻觉的存在，详细的幻觉类型分类，以及段落级别的标签。基准测试提供了不同难度级别的任务，并揭示了现有跨模态推理或对齐基准中不可见的模型性能差异。我们的分析进一步揭示了两个关键发现。首先，检测器倾向于将响应开头的句子识别为正确的，无论其实际正确性如何。其次，我们的实验表明，通过使用强大的VLM作为过滤器并采用最近的VLM作为描述生成器，可以显著减少数据集噪声。我们的项目页面可在https://dahlian00.github.io/HalDec-Bench-Page/上找到。

Summary / 总结

The research aims to evaluate the effectiveness of hallucination detectors in image captioning by introducing HalDec-Bench, a comprehensive benchmark. The method involves using diverse vision-language models to generate captions and human annotations to identify hallucinations. Key findings include that detectors often incorrectly label sentences at the beginning of responses as correct and that using strong VLMs as filters can reduce dataset noise when employing recent VLMs as caption generators.

研究旨在通过引入HalDec-Bench这一综合基准来评估图像字幕中幻觉检测器的有效性。方法包括从多种视觉语言模型收集字幕，并通过人类注释来识别幻觉，同时包含详细的类别和段落级别标签。主要发现包括，检测器往往错误地验证响应开头的句子，并且使用强大的视觉语言模型作为过滤器，可以减少使用最新视觉语言模型生成字幕时的数据集噪声。

WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval

Authors: Tianyue Wang, Leigang Qu, Tianyu Yang, Xiangzhao Hao, Yifan Xu, Haiyun Guo, Jinqiao Wang

Venue: CVPR 2026

First: 2026-02-26T14:11:10+00:00 · Latest: 2026-03-24T11:51:05+00:00

Comments: Accept to CVPR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a multimodal query (comprising a reference image and a modification text), without training on annotated triplets. Existing methods typically convert the multimodal query into a single modality-either as an edited caption for Text-to-Image retrieval (T2I) or as an edited image for Image-to-Image retrieval (I2I). However, each paradigm has inherent limitations: T2I often loses fine-grained visual details, while I2I struggles with complex semantic modifications. To effectively leverage their complementary strengths under diverse query intents, we propose WISER, a training-free framework that unifies T2I and I2I via a "retrieve-verify-refine" pipeline, explicitly modeling intent awareness and uncertainty awareness. Specifically, WISER first performs Wider Search by generating both edited captions and images for parallel retrieval to broaden the candidate pool. Then, it conducts Adaptive Fusion with a verifier to assess retrieval confidence, triggering refinement for uncertain retrievals, and dynamically fusing the dual-path for reliable ones. For uncertain retrievals, WISER generates refinement suggestions through structured self-reflection to guide the next retrieval round toward Deeper Thinking. Extensive experiments demonstrate that WISER significantly outperforms previous methods across multiple benchmarks, achieving relative improvements of 45% on CIRCO (mAP@5) and 57% on CIRR (Recall@1) over existing training-free methods. Notably, it even surpasses many training-dependent methods, highlighting its superiority and generalization under diverse scenarios. Code will be released at https://github.com/Physicsmile/WISER.

中文标题/摘要

标题：WISER：更广泛的搜索、更深的思考和自适应融合的无训练零样本组合图像检索

零样本组合图像检索（ZS-CIR）旨在根据包含参考图像和修改文本的多模态查询检索目标图像，无需使用标注三元组进行训练。现有方法通常将多模态查询转换为单一模态——要么作为文本到图像检索（T2I）的编辑文本，要么作为图像到图像检索（I2I）的编辑图像。然而，每种范式都有其固有的局限性：T2I往往丢失了细微的视觉细节，而I2I则难以处理复杂的语义修改。为了在各种查询意图下有效利用它们的互补优势，我们提出了一种无训练框架WISER，通过“检索-验证-精炼”管道统一T2I和I2I，明确建模意图意识和不确定性意识。具体而言，WISER首先通过生成编辑后的文本和图像进行并行检索，以扩大候选池进行更广泛的搜索。然后，它通过验证器进行自适应融合，评估检索置信度，对不确定的检索结果触发精炼，并动态融合双路径以获得可靠的检索结果。对于不确定的检索结果，WISER通过结构化的自我反思生成精炼建议，以指导下一轮检索向更深的思考迈进。广泛的实验表明，WISER在多个基准测试中显著优于先前的方法，在CIRCO（mAP@5）上相对提高了45%，在CIRR（Recall@1）上相对提高了57%。值得注意的是，它甚至超越了许多依赖训练的方法，突显了其在各种场景下的优越性和泛化能力。代码将在https://github.com/Physicsmile/WISER/发布。

HalDec-Bench: Benchmarking Hallucination Detector in Image Captioning

Authors: Kuniaki Saito, Risa Shinoda, Shohei Tanaka, Tosho Hirasawa, Fumio Okura, Yoshitaka Ushiku

First: 2026-03-16T13:21:55+00:00 · Latest: 2026-03-24T11:48:00+00:00

Comments: This work was intended as a replacement of arXiv:2511.20515 and any subsequent updates will appear there

Abs · PDF · Code1 · Code2 · Project1

Abstract

中文标题/摘要

标题：HalDec-Bench：图像描述中的幻觉检测基准测试

幻觉检测（HalDec）评估视觉-语言模型正确对齐图像内容与文本的能力，通过识别错误描述图像的错误标题。除了评估之外，有效的幻觉检测对于收集用于训练VLM的高质量图像-描述对也至关重要。然而，由于缺乏全面的基准测试，VLM作为幻觉检测器在不同描述模型和幻觉类型之间的通用性仍然不清楚。在此项工作中，我们引入了HalDec-Bench，这是一种旨在以原则性和可解释性的方式评估幻觉检测器的基准测试。HalDec-Bench包含由多种VLM生成的描述，以及人类注释表明幻觉的存在，详细的幻觉类型分类，以及段落级别的标签。基准测试提供了不同难度级别的任务，并揭示了现有跨模态推理或对齐基准中不可见的模型性能差异。我们的分析进一步揭示了两个关键发现。首先，检测器倾向于将响应开头的句子识别为正确的，无论其实际正确性如何。其次，我们的实验表明，通过使用强大的VLM作为过滤器并采用最近的VLM作为描述生成器，可以显著减少数据集噪声。我们的项目页面可在https://dahlian00.github.io/HalDec-Bench-Page/上找到。

Summary / 总结

The research aims to evaluate the ability of hallucination detectors in image captioning to identify errors in captions that misrepresent images. The HalDec-Bench benchmark includes captions from various vision-language models with human annotations and detailed hallucination-type categories. Key findings include that detectors often incorrectly identify sentences at the beginning of responses as correct and that using strong VLMs as filters can reduce dataset noise. The benchmark reveals performance differences across models not evident in existing benchmarks.

研究旨在通过引入HalDec-Bench基准来评估图像字幕中幻觉检测器的有效性。方法包括从多种视觉-语言模型收集字幕，并由人类标注来识别幻觉。主要发现包括，检测器往往错误地将响应开头的句子标记为正确，而使用强VLM作为过滤器可以减少数据集噪声，同时使用近期的VLM进行字幕生成可以提高性能。

VL-KnG: Persistent Spatiotemporal Knowledge Graphs from Egocentric Video for Embodied Scene Understanding

Authors: Mohamad Al Mdfaa, Svetlana Lukina, Timur Akhtyamov, Arthur Nigmatzyanov, Dmitrii Nalberskii, Sergey Zagoruyko, Gonzalo Ferrer

First: 2025-10-01T21:53:44+00:00 · Latest: 2026-03-24T11:41:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) demonstrate strong image-level scene understanding but often lack persistent memory, explicit spatial representations, and computational efficiency when reasoning over long video sequences. We present VL-KnG, a training-free framework that constructs spatiotemporal knowledge graphs from monocular video, bridging fine-grained scene graphs and global topological graphs without 3D reconstruction. VL-KnG processes video in chunks, maintains persistent object identity via LLM-based Spatiotemporal Object Association (STOA), and answers queries via Graph-Enhanced Retrieval (GER), a hybrid of GraphRAG subgraph retrieval and SigLIP2 visual grounding. Once built, the knowledge graph eliminates the need to re-process video at query time, enabling constant-time inference regardless of video length. Evaluation across three benchmarks, OpenEQA, NaVQA, and WalkieKnowledge (our newly introduced benchmark), shows that VL-KnG matches or surpasses frontier VLMs on embodied scene understanding tasks at significantly lower query latency, with explainable, graph-grounded reasoning. Real-world robot deployment confirms practical applicability with constant-time scaling.

中文标题/摘要

标题：VL-KnG：从第一人称视频构建持久时空知识图谱以实现沉浸式场景理解

视觉-语言模型（VLMs）在图像级场景理解方面表现出色，但在处理长视频序列时往往缺乏持久记忆、明确的空间表示和计算效率。我们提出了VL-KnG，这是一种无需训练的框架，可以从单目视频中构建时空知识图谱，无需3D重建即可连接细粒度场景图和全局拓扑图。VL-KnG 以块为单位处理视频，通过基于LLM的时空对象关联（STOA）保持持久的对象身份，并通过图增强检索（GER）回答查询，这是一种结合GraphRAG子图检索和SigLIP2视觉定位的混合方法。一旦构建完成，知识图谱在查询时无需重新处理视频，从而实现与视频长度无关的常数时间推理。在三个基准测试（OpenEQA、NaVQA和WalkieKnowledge，我们新引入的基准测试）上的评估表明，VL-KnG 在显著降低查询延迟的情况下，与前沿的VLMs在沉浸式场景理解任务上表现相当或更优，具有可解释的、基于图的推理。实际的机器人部署证实了其在常数时间扩展下的实用适用性。

Summary / 总结

VL-KnG is a training-free framework that constructs spatiotemporal knowledge graphs from monocular video to enhance embodied scene understanding. It maintains persistent object identity using LLM-based Spatiotemporal Object Association and answers queries through Graph-Enhanced Retrieval, combining GraphRAG subgraph retrieval and SigLIP2 visual grounding. VL-KnG demonstrates competitive performance on three benchmarks, achieving lower query latency and explainable reasoning, and is practically scalable for real-world robot applications.

VL-KnG 是一个无需训练的框架，通过单目视频构建时空知识图谱以提升体态场景理解能力。它使用基于LLM的空间时间对象关联来保持持久的对象身份，并通过结合 GraphRAG 子图检索和 SigLIP2 视觉定位的图增强检索来回答查询。VL-KnG 在三个基准测试中表现出竞争力，实现了较低的查询延迟和可解释的推理，并且在实际机器人部署中具有可扩展性。

MedCausalX: Adaptive Causal Reasoning with Self-Reflection for Trustworthy Medical Vision-Language Models

Authors: Jianxin Lin, Chunzheng Zhu, Peter J. Kneuertz, Yunfei Bai, Yuan Xue

First: 2026-03-24T11:28:15+00:00 · Latest: 2026-03-24T11:28:15+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) have enabled interpretable medical diagnosis by integrating visual perception with linguistic reasoning. Yet, existing medical chain-of-thought (CoT) models lack explicit mechanisms to represent and enforce causal reasoning, leaving them vulnerable to spurious correlations and limiting their clinical reliability. We pinpoint three core challenges in medical CoT reasoning: how to adaptively trigger causal correction, construct high-quality causal-spurious contrastive samples, and maintain causal consistency across reasoning trajectories. To address these challenges, we propose MedCausalX, an end-to-end framework explicitly models causal reasoning chains in medical VLMs. We first introduce the CRMed dataset providing fine-grained anatomical annotations, structured causal reasoning chains, and counterfactual variants that guide the learning of causal relationships beyond superficial correlations. Building upon CRMed, MedCausalX employs a two-stage adaptive reflection architecture equipped with $\langle$causal$\rangle$ and $\langle$verify$\rangle$ tokens, enabling the model to autonomously determine when and how to perform causal analysis and verification. Finally, a trajectory-level causal correction objective optimized through error-attributed reinforcement learning refines the reasoning chain, allowing the model to distinguish genuine causal dependencies from shortcut associations. Extensive experiments on multiple benchmarks show that MedCausalX consistently outperforms state-of-the-art methods, improving diagnostic consistency by +5.4 points, reducing hallucination by over 10 points, and attaining top spatial grounding IoU, thereby setting a new standard for causally grounded medical reasoning.

中文标题/摘要

标题：MedCausalX：自省式因果推理的医疗视觉语言模型

视觉语言模型（VLMs）通过结合视觉感知和语言推理，实现了可解释的医学诊断。然而，现有的医学链式思考（CoT）模型缺乏明确机制来表示和执行因果推理，使其容易受到虚假相关性的影响，并限制了其临床可靠性。我们指出了医学CoT推理中的三个核心挑战：如何适应性地触发因果修正、构建高质量的因果-虚假对比样本以及在推理轨迹中保持因果一致性。为了解决这些挑战，我们提出了MedCausalX，这是一种端到端框架，明确地在医疗VLM中建模因果推理链。我们首先引入了CRMed数据集，提供了精细的解剖学注释、结构化的因果推理链以及引导学习因果关系而非表面相关性的反事实变体。基于CRMed，MedCausalX采用了一种两阶段自适应反思架构，配备了<因果>和<验证>标记，使模型能够自主决定何时以及如何进行因果分析和验证。最后，通过错误归因强化学习优化的轨迹级因果修正目标进一步细化推理链，使模型能够区分真正的因果依赖关系和捷径关联。在多个基准上的广泛实验表明，MedCausalX 一致地优于最先进的方法，提高了诊断一致性5.4个百分点，减少了幻觉超过10个百分点，并获得了顶级的空间定位IoU，从而为因果驱动的医学推理设定了新的标准。

Summary / 总结

MedCausalX addresses the limitations of existing medical chain-of-thought models by introducing an end-to-end framework that explicitly models causal reasoning chains. It uses the CRMed dataset for training and a two-stage adaptive reflection architecture with causal and verify tokens to autonomously perform causal analysis and verification. The model also includes a trajectory-level causal correction objective to refine reasoning chains. Experiments show that MedCausalX improves diagnostic consistency and reduces hallucination compared to state-of-the-art methods, setting a new standard for causally grounded medical reasoning.

MedCausalX通过引入一个明确建模因果推理链的端到端框架来解决现有医疗链式思维模型的局限性。它使用CRMed数据集进行训练，并采用带有因果和验证标记的两阶段自适应反思架构，以自主执行因果分析和验证。模型还包含一个轨迹级别的因果修正目标，以细化推理链。实验结果表明，MedCausalX在提高诊断一致性并减少幻觉方面优于最先进的方法，从而为因果驱动的医疗推理设立了新标准。

Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models

Authors: Hayeon Kim, Ji Ha Jang, Junghun James Kim, Se Young Chun

Venue: CVPR 2026

First: 2026-03-23T14:41:20+00:00 · Latest: 2026-03-24T11:22:50+00:00

Comments: Accepted to CVPR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

While Vision-Language Models (VLMs) have achieved remarkable performance, their Euclidean embeddings remain limited in capturing hierarchical relationships such as part-to-whole or parent-child structures, and often face challenges in multi-object compositional scenarios. Hyperbolic VLMs mitigate this issue by better preserving hierarchical structures and modeling part-whole relations (i.e., whole scene and its part images) through entailment. However, existing approaches do not model that each part has a different level of semantic representativeness to the whole. We propose UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA) for enhancing hyperbolic VLMs. UNCHA models part-to-whole semantic representativeness with hyperbolic uncertainty, by assigning lower uncertainty to more representative parts and higher uncertainty to less representative ones for the whole scene. This representativeness is then incorporated into the contrastive objective with uncertainty-guided weights. Finally, the uncertainty is further calibrated with an entailment loss regularized by entropy-based term. With the proposed losses, UNCHA learns hyperbolic embeddings with more accurate part-whole ordering, capturing the underlying compositional structure in an image and improving its understanding of complex multi-object scenes. UNCHA achieves state-of-the-art performance on zero-shot classification, retrieval, and multi-label classification benchmarks. Our code and models are available at: https://github.com/jeeit17/UNCHA.git.

中文标题/摘要

标题：基于不确定性引导的分部到整体语义表示性组成性对齐在双曲视觉-语言模型中的应用

尽管视觉-语言模型（VLMs）已经取得了显著的性能，但它们的欧几里得嵌入在捕捉层次关系（如分部到整体或父级子级结构）方面仍然有限，并且在多对象组成场景中常常面临挑战。双曲VLMs通过更好地保留层次结构和通过蕴含来建模分部到整体关系（即整个场景及其分部图像）来缓解这一问题。然而，现有方法没有建模每个分部对整体的不同语义表示性。我们提出了不确定性引导的双曲组成性对齐（UNCHA）以增强双曲VLMs。UNCHA通过赋予更代表性的分部较低的不确定性，而赋予整体场景中不那么代表性的分部较高的不确定性，来建模分部到整体的语义表示性，使用双曲不确定性。这种表示性随后通过不确定性引导的权重纳入对比目标中。最后，不确定性通过基于熵的项正则化的蕴含损失进一步校准。通过提出的损失，UNCHA学习具有更准确分部到整体顺序的双曲嵌入，捕捉图像中的潜在组成结构，并提高其对复杂多对象场景的理解。UNCHA在零样本分类、检索和多标签分类基准测试中达到了最先进的性能。我们的代码和模型可在：https://github.com/jeeit17/UNCHA.git。

Summary / 总结

The research aims to enhance the performance of Vision-Language Models (VLMs) in capturing part-to-whole relationships by proposing UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA). UNCHA models the semantic representativeness of parts to the whole using hyperbolic uncertainty, assigning lower uncertainty to more representative parts and higher uncertainty to less representative ones. This method is integrated into a contrastive objective with uncertainty-guided weights and further calibrated with an entailment loss. The results show that UNCHA improves the accuracy of part-whole ordering and enhances the understanding of complex multi-object scenes, achieving state-of-the-art performance on various benchmarks.

研究旨在通过解决欧几里得嵌入在捕捉部分到整体结构方面的局限性，改进视觉-语言模型（VLM）中的层次关系建模。提出了UNCHA，通过使用双曲不确定性来整合部分到整体的语义代表性，对更代表性的部分赋予较低的不确定性，对较不具代表性的部分赋予较高的不确定性。该方法被整合到具有不确定性引导权重的对比目标中，并通过基于熵的项进一步校准不确定性，最终通过蕴含损失进行调整。实验结果表明，UNCHA在零样本分类、检索和多标签分类基准上优于现有方法，展示了对复杂多对象场景的更好理解。

Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search

Authors: Xinlei Yin, Xiulian Peng, Xiao Li, Zhiwei Xiong, Yan Lu

First: 2026-01-20T08:23:29+00:00 · Latest: 2026-03-24T11:07:39+00:00

Comments: Accepted by CVPR2026

Abs · PDF · Code1 · Code2

Abstract

Long video understanding presents significant challenges for vision-language models due to extremely long context windows. Existing solutions relying on naive chunking strategies with retrieval-augmented generation, typically suffer from information fragmentation and a loss of global coherence. We present HAVEN, a unified framework for long-video understanding that enables coherent and comprehensive reasoning by integrating audiovisual entity cohesion and hierarchical video indexing with agentic search. First, we preserve semantic consistency by integrating entity-level representations across visual and auditory streams, while organizing content into a structured hierarchy spanning global summary, scene, segment, and entity levels. Then we employ an agentic search mechanism to enable dynamic retrieval and reasoning across these layers, facilitating coherent narrative reconstruction and fine-grained entity tracking. Extensive experiments demonstrate that our method achieves good temporal coherence, entity consistency, and retrieval efficiency, establishing a new state-of-the-art with an overall accuracy of 84.1% on LVBench. Notably, it achieves outstanding performance in the challenging reasoning category, reaching 80.1%. These results highlight the effectiveness of structured, multimodal reasoning for comprehensive and context-consistent understanding of long-form videos.

中文标题/摘要

标题：利用视听实体连贯性和有机构建搜索的层次化长视频理解

由于极长的上下文窗口，长视频理解对视觉语言模型提出了重大挑战。现有的依赖于简单的分块策略和检索增强生成的解决方案，通常会遭受信息碎片化和全局连贯性丧失的问题。我们提出了HAVEN，这是一种统一的长视频理解框架，通过整合视听实体连贯性和层次化视频索引与有机构建搜索，实现连贯和全面的推理。首先，通过整合视觉和听觉流中的实体级表示，保持语义一致性，同时将内容组织成跨越全局摘要、场景、片段和实体级别的结构化层次。然后，采用有机构建搜索机制，实现跨这些层次的动态检索和推理，促进连贯叙事重建和细粒度实体跟踪。广泛的实验表明，我们的方法在时间连贯性、实体一致性以及检索效率方面表现出色，在LVBench上整体准确率达到84.1%，特别是在具有挑战性的推理类别中，准确率达到80.1%。这些结果突显了结构化多模态推理在长视频全面和上下文一致理解中的有效性。

Summary / 总结

The paper addresses the challenge of understanding long videos by proposing HAVEN, a framework that integrates audiovisual entity cohesion and hierarchical video indexing with agentic search. It preserves semantic consistency across visual and auditory streams and organizes content into a structured hierarchy. Experiments show that HAVEN achieves good temporal coherence, entity consistency, and retrieval efficiency, setting a new state-of-the-art with 84.1% accuracy on LVBench, particularly excelling in the reasoning category with 80.1% accuracy.

论文提出HAVEN框架，通过结合音频视觉实体一致性、层次视频索引和代理搜索来解决长视频理解的挑战。该框架在视觉和听觉流中保持语义一致性，并将内容组织成一个结构化的层次体系。实验表明，HAVEN在时间连贯性、实体一致性以及检索效率方面表现出色，在LVBench上达到84.1%的整体准确率，特别是在推理类别中达到80.1%的准确率。

On-device Semantic Selection Made Low Latency and Memory Efficient with Monolithic Forwarding

Authors: Jiahao Zhou, Chengliang Lin, Dingji Li, Mingkai Dong, Haibo Chen

First: 2025-10-17T13:06:09+00:00 · Latest: 2026-03-24T10:13:45+00:00

Abs · PDF · Code1 · Code2

Abstract

Semantic top-K selection with cross-encoder rerankers underpins on-device AI services, such as retrieval-augmented generation, agent memory, and personalized recommendation. However, its latency and memory demands dominate end-to-end budgets on edge hardware. Revisiting the objective of top-K selection, we reveal that only relative rankings matter, not exact per-candidate scores. We further observe sequence-level sparsity: relative rankings progressively stabilize in intermediate layers, enabling early pruning prior to completing full inference. Building on this insight, we propose monolithic forwarding and develop a training-free inference system, PRISM. By maintaining a global view of all candidates, it reduces latency through progressive cluster pruning. It also bounds peak memory usage by strategically overlapping I/O with computation via overlapped layer streaming and chunked execution. We evaluate PRISM against state-of-the-art baselines on rerankers from 0.6 B to 8 B parameters across Apple M2 and RTX 5070. PRISM consistently reduces latency by up to 89.2% and peak memory by up to 91.3% in microbenchmarks, without compromising precision. Across three real-world on-device AI applications, PRISM lowers latency by 11.6%-51.0% and peak memory by 18.6%-77.8%, demonstrating substantial improvements in efficiency and deployability.

中文标题/摘要

标题：在设备上实现低延迟和内存高效的整体前向传播语义选择

使用交叉编码器重排序的语义Top-K选择是设备上AI服务的基础，如检索增强生成、代理记忆和个人化推荐。然而，它的延迟和内存需求在边缘硬件的端到端预算中占主导地位。重新审视Top-K选择的目标，我们发现只有相对排名才重要，而不是每个候选人的精确得分。我们进一步观察到序列级别的稀疏性：相对排名在中间层中逐渐稳定，允许在完成完整推理之前进行早期剪枝。基于这一洞察，我们提出了整体前向传播，并开发了一个无需训练的推理系统PRISM。通过维护所有候选人的全局视图，它通过逐步聚类剪枝来减少延迟。它还通过重叠层流式传输和分块执行战略性地重叠输入/输出与计算，来限制峰值内存使用。我们在Apple M2和RTX 5070上对从0.6亿到8亿参数的重排序器进行了与最先进的基线的评估。在微基准测试中，PRISM将延迟最多减少89.2%，峰值内存最多减少91.3%，而不牺牲精度。在三个实际的设备上AI应用中，PRISM将延迟降低11.6%-51.0%，峰值内存降低18.6%-77.8%，展示了在效率和部署性方面的显著改进。

Summary / 总结

The paper addresses the high latency and memory demands of semantic top-K selection with cross-encoder rerankers in on-device AI services. By recognizing that only relative rankings are necessary, the authors propose PRISM, a monolithic forwarding system that reduces latency by up to 89.2% and peak memory by up to 91.3% without sacrificing precision. PRISM achieves these improvements through progressive cluster pruning and strategic I/O overlap during inference.

论文针对跨编码器重排序在设备端AI服务中的高延迟和高内存需求问题，提出了一种名为PRISM的单体转发系统，通过逐步聚类剪枝和策略性I/O重叠，将延迟最多降低89.2%，峰值内存最多降低91.3%，同时不牺牲精度。在实际应用中，PRISM将延迟降低了11.6%-51.0%，峰值内存降低了18.6%-77.8%。

Traffic Sign Recognition in Autonomous Driving: Dataset, Benchmark, and Field Experiment

Authors: Guoyang Zhao, Weiqing Qi, Kai Zhang, Chenguang Zhang, Zeying Gong, Zhihai Bi, Kai Chen, Benshan Ma, Ming Liu, Jun Ma

First: 2026-03-24T10:11:27+00:00 · Latest: 2026-03-24T10:11:27+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Traffic Sign Recognition (TSR) is a core perception capability for autonomous driving, where robustness to cross-region variation, long-tailed categories, and semantic ambiguity is essential for reliable real-world deployment. Despite steady progress in recognition accuracy, existing traffic sign datasets and benchmarks offer limited diagnostic insight into how different modeling paradigms behave under these practical challenges. We present TS-1M, a large-scale and globally diverse traffic sign dataset comprising over one million real-world images across 454 standardized categories, together with a diagnostic benchmark designed to analyze model capability boundaries. Beyond standard train-test evaluation, we provide a suite of challenge-oriented settings, including cross-region recognition, rare-class identification, low-clarity robustness, and semantic text understanding, enabling systematic and fine-grained assessment of modern TSR models. Using TS-1M, we conduct a unified benchmark across three representative learning paradigms: classical supervised models, self-supervised pretrained models, and multimodal vision-language models (VLMs). Our analysis reveals consistent paradigm-dependent behaviors, showing that semantic alignment is a key factor for cross-region generalization and rare-category recognition, while purely visual models remain sensitive to appearance shift and data imbalance. Finally, we validate the practical relevance of TS-1M through real-scene autonomous driving experiments, where traffic sign recognition is integrated with semantic reasoning and spatial localization to support map-level decision constraints. Overall, TS-1M establishes a reference-level diagnostic benchmark for TSR and provides principled insights into robust and semantic-aware traffic sign perception. Project page: https://guoyangzhao.github.io/projects/ts1m.

中文标题/摘要

标题：自动驾驶中的交通标志识别：数据集、基准和实地试验

交通标志识别（TSR）是自动驾驶的核心感知能力，其中跨区域变化的鲁棒性、长尾类别和语义模糊性对于可靠的现实世界部署至关重要。尽管在识别准确度方面取得了稳步进展，但现有的交通标志数据集和基准提供的诊断洞察有限，无法揭示不同建模范式在这些实际挑战下的表现。我们提出了TS-1M，这是一个大规模且全球多样化的交通标志数据集，包含超过一百万张来自454个标准化类别的真实世界图像，以及一个用于分析模型能力边界的诊断基准。除了标准的训练-测试评估外，我们还提供了一系列挑战导向的设置，包括跨区域识别、稀有类别识别、低清晰度鲁棒性和语义文本理解，使现代TSR模型能够系统地进行细致评估。使用TS-1M，我们在三种代表性学习范式之间进行了统一基准测试：经典监督模型、自我监督预训练模型和多模态视觉-语言模型（VLMs）。我们的分析揭示了范式依赖的一致行为，表明语义对齐是跨区域泛化和稀有类别识别的关键因素，而纯视觉模型对外观变化和数据不平衡仍然敏感。最后，我们通过将交通标志识别与语义推理和空间定位集成到实际场景的自动驾驶实验中，验证了TS-1M的实用相关性，以支持地图级决策约束。总体而言，TS-1M为TSR建立了参考级诊断基准，并提供了关于鲁棒性和语义感知交通标志感知的原理性见解。项目页面：https://guoyangzhao.github.io/projects/ts1m.

Summary / 总结

The paper addresses the need for robust Traffic Sign Recognition (TSR) in autonomous driving, focusing on cross-region variation, long-tailed categories, and semantic ambiguity. It introduces TS-1M, a large-scale and globally diverse dataset with over one million images across 454 categories, and a diagnostic benchmark for evaluating model performance under various challenges. Experimental results show that classical supervised models, self-supervised pretrained models, and multimodal vision-language models exhibit different behaviors, with semantic alignment being crucial for cross-region generalization and rare-category recognition. The study also validates the practical relevance of TS-1M through real-scene autonomous driving experiments, integrating traffic sign recognition with semantic reasoning and spatial localization for map-level decision-making constraints.

论文旨在通过引入包含超过一百万张图像、涵盖454个类别的大规模全球多样化数据集TS-1M，解决自动驾驶中交通标志识别（TSR）的鲁棒性挑战。同时，提出了一个诊断基准，以评估模型在跨区域识别和稀有类别识别等实际挑战下的性能。研究使用了三种学习范式——经典监督模型、自我监督预训练模型和多模态视觉-语言模型，并发现语义对齐对于跨区域泛化和稀有类别识别至关重要，而纯视觉模型对外观变化和数据不平衡敏感。通过实际场景的自动驾驶实验验证了TS-1M的实际相关性。

Looking Beyond the Window: Global-Local Aligned CLIP for Training-free Open-Vocabulary Semantic Segmentation

Authors: ByeongCheol Lee, Hyun Seok Seong, Sangeek Hyun, Gilhan Park, WonJun Moon, Jae-Pil Heo

Venue: CVPR 2026

First: 2026-03-24T10:10:12+00:00 · Latest: 2026-03-24T10:10:12+00:00

Comments: 18 pages, 13 figures, 12 tables, Accepted to CVPR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

A sliding-window inference strategy is commonly adopted in recent training-free open-vocabulary semantic segmentation methods to overcome limitation of the CLIP in processing high-resolution images. However, this approach introduces a new challenge: each window is processed independently, leading to semantic discrepancy across windows. To address this issue, we propose Global-Local Aligned CLIP~(GLA-CLIP), a framework that facilitates comprehensive information exchange across windows. Rather than limiting attention to tokens within individual windows, GLA-CLIP extends key-value tokens to incorporate contextual cues from all windows. Nevertheless, we observe a window bias: outer-window tokens are less likely to be attended, since query features are produced through interactions within the inner window patches, thereby lacking semantic grounding beyond their local context. To mitigate this, we introduce a proxy anchor, constructed by aggregating tokens highly similar to the given query from all windows, which provides a unified semantic reference for measuring similarity across both inner- and outer-window patches. Furthermore, we propose a dynamic normalization scheme that adjusts attention strength according to object scale by dynamically scaling and thresholding the attention map to cope with small-object scenarios. Moreover, GLA-CLIP can be equipped on existing methods and broad their receptive field. Extensive experiments validate the effectiveness of GLA-CLIP in enhancing training-free open-vocabulary semantic segmentation performance. Code is available at https://github.com/2btlFe/GLA-CLIP.

中文标题/摘要

标题：超越窗口：全局-局部对齐的CLIP用于训练无监督的开放词汇语义分割

最近的训练无监督的开放词汇语义分割方法通常采用滑动窗口推理策略来克服CLIP在处理高分辨率图像时的局限性。然而，这种方法引入了一个新的挑战：每个窗口独立处理，导致窗口间语义不一致。为了解决这个问题，我们提出了全局-局部对齐的CLIP（GLA-CLIP）框架，该框架促进了窗口间的信息全面交换。GLA-CLIP不仅关注窗口内的令牌，还扩展了关键值令牌以包含所有窗口的上下文线索。然而，我们观察到窗口偏见：外窗口令牌不太可能被关注，因为查询特征是通过内窗口片段内的交互生成的，因此缺乏超出其局部上下文的语义基础。为了解决这个问题，我们引入了一个代理锚点，它是从所有窗口中聚集与给定查询高度相似的令牌构建的，为内窗口和外窗口片段之间的相似性测量提供了一个统一的语义参考。此外，我们提出了一种动态归一化方案，根据对象尺度动态调整注意力强度，通过动态缩放和阈值化注意力图来应对小对象场景。此外，GLA-CLIP可以安装在现有方法上并扩大其感受野。广泛的实验验证了GLA-CLIP在提高训练无监督的开放词汇语义分割性能方面的有效性。代码可在https://github.com/2btlFe/GLA-CLIP获取。

Summary / 总结

This paper addresses the issue of semantic discrepancy in training-free open-vocabulary semantic segmentation methods by proposing GLA-CLIP, which facilitates information exchange across sliding windows. GLA-CLIP extends key-value tokens to incorporate contextual cues from all windows and introduces a proxy anchor to mitigate window bias. Additionally, a dynamic normalization scheme is proposed to adjust attention strength based on object scale. Experiments show that GLA-CLIP improves the performance of existing methods in training-free open-vocabulary semantic segmentation. Code is available at https://github.com/2btlFe/GLA-CLIP.

论文针对使用滑动窗口推理策略的训练免费开放词汇语义分割方法中存在的语义不连续性问题，提出了GLA-CLIP框架，通过从所有窗口中引入上下文线索来促进窗口间的信息交流。该方法引入了代理锚点以缓解窗口偏见，并提出了一种动态归一化方案，根据物体大小动态调整注意力强度。实验表明，GLA-CLIP能够提高现有方法在该领域的性能。

Zero-Shot Personalization of Objects via Textual Inversion

Authors: Aniket Roy, Maitreya Suin, Rama Chellappa

First: 2026-03-24T09:54:30+00:00 · Latest: 2026-03-24T09:54:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advances in text-to-image diffusion models have substantially improved the quality of image customization, enabling the synthesis of highly realistic images. Despite this progress, achieving fast and efficient personalization remains a key challenge, particularly for real-world applications. Existing approaches primarily accelerate customization for human subjects by injecting identity-specific embeddings into diffusion models, but these strategies do not generalize well to arbitrary object categories, limiting their applicability. To address this limitation, we propose a novel framework that employs a learned network to predict object-specific textual inversion embeddings, which are subsequently integrated into the UNet timesteps of a diffusion model for text-conditional customization. This design enables rapid, zero-shot personalization of a wide range of objects in a single forward pass, offering both flexibility and scalability. Extensive experiments across multiple tasks and settings demonstrate the effectiveness of our approach, highlighting its potential to support fast, versatile, and inclusive image customization. To the best of our knowledge, this work represents the first attempt to achieve such general-purpose, training-free personalization within diffusion models, paving the way for future research in personalized image generation.

中文标题/摘要

标题：通过文本反转实现对象的零样本个性化

最近在文本到图像扩散模型方面的进展显著提高了图像定制的质量，使得合成高度逼真的图像成为可能。尽管取得了这些进展，实现快速和高效的个性化仍然是一个关键挑战，尤其是在实际应用中。现有的方法主要通过向扩散模型注入身份特定嵌入来加速人类主体的定制，但这些策略不适用于任意对象类别，限制了它们的应用范围。为了解决这一局限性，我们提出了一种新的框架，该框架利用一个学习网络预测对象特定的文本反转嵌入，这些嵌入随后被整合到扩散模型的UNet时间步中，以实现文本条件下的定制。这种设计使得在单次前向传播中能够快速实现广泛对象的零样本个性化，提供了灵活性和可扩展性。在多个任务和设置下的广泛实验表明了我们方法的有效性，突显了其支持快速、多功能和包容性图像定制的潜力。据我们所知，这项工作是首次尝试在扩散模型中实现这种通用的、无需训练的个性化，为未来个性化图像生成的研究铺平了道路。

Summary / 总结

The research aims to enable fast and efficient personalization of images for objects using text-to-image diffusion models. The method involves predicting object-specific textual inversion embeddings with a learned network and integrating them into the UNet timesteps of a diffusion model. Experiments show that this approach allows for rapid, zero-shot personalization of various objects in a single forward pass, demonstrating its effectiveness and scalability in image customization tasks.

研究旨在利用文本到图像的扩散模型，实现对象（非人类主体）的快速高效个性化。方法是通过一个学习网络预测对象特定的文本反转嵌入，然后将其整合到扩散模型的UNet时间步中。实验结果表明，这种方法可以在单次前向传播中实现各种对象的快速零样本个性化，展示了其在多个任务和设置中的灵活性和可扩展性。

Retrieval-Augmented Generation with Covariate Time Series

Authors: Kenny Ye Liang, Zhongyi Pei, Huan Zhang, Yuhui Liu, Shaoxu Song, Jianmin Wang

First: 2026-03-05T08:45:24+00:00 · Latest: 2026-03-24T09:38:04+00:00

Comments: 12 pages. Preprint

Abs · PDF · Code1 · Code2

Abstract

While RAG has greatly enhanced LLMs, extending this paradigm to Time-Series Foundation Models (TSFMs) remains a challenge. This is exemplified in the Predictive Maintenance of the Pressure Regulating and Shut-Off Valve (PRSOV), a high-stakes industrial scenario characterized by (1) data scarcity, (2) short transient sequences, and (3) covariate coupled dynamics. Unfortunately, existing time-series RAG approaches predominantly rely on generated static vector embeddings and learnable context augmenters, which may fail to distinguish similar regimes in such scarce, transient, and covariate coupled scenarios. To address these limitations, we propose RAG4CTS, a regime-aware, training-free RAG framework for Covariate Time-Series. Specifically, we construct a hierarchal time-series native knowledge base to enable lossless storage and physics-informed retrieval of raw historical regimes. We design a two-stage bi-weighted retrieval mechanism that aligns historical trends through point-wise and multivariate similarities. For context augmentation, we introduce an agent-driven strategy to dynamically optimize context in a self-supervised manner. Extensive experiments on PRSOV demonstrate that our framework significantly outperforms state-of-the-art baselines in prediction accuracy. The proposed system is deployed in Apache IoTDB within China Southern Airlines. Since deployment, our method has successfully identified one PRSOV fault in two months with zero false alarm.

中文标题/摘要

标题：基于协变量时间序列的检索增强生成

虽然RAG极大地提升了LLMs，但将其扩展到时间序列基础模型（TSFMs）仍面临挑战。这在压力调节和关断阀（PRSOV）的预测维护中尤为明显，这是一个高风险的工业场景，具有（1）数据稀缺性，（2）短暂的瞬态序列，以及（3）协变量耦合的动力学。不幸的是，现有的时间序列RAG方法主要依赖于生成的静态向量嵌入和可学习的上下文增强器，这在稀缺、短暂且协变量耦合的场景中可能无法区分相似的运行状态。为了解决这些局限性，我们提出了RAG4CTS，这是一种针对协变量时间序列的无训练检索增强生成框架。具体而言，我们构建了一个层次化的时间序列本征知识库，以实现无损存储和基于物理的检索历史运行状态。我们设计了一种两阶段的双加权检索机制，通过点对点和多变量相似性对历史趋势进行对齐。对于上下文增强，我们引入了一种基于代理的策略，以自监督方式动态优化上下文。在PRSOV上的广泛实验表明，我们的框架在预测准确性上显著优于最先进的基线。所提出系统已部署在中国南方航空公司的Apache IoTDB中。自部署以来，我们的方法在两个月内成功检测到一个PRSOV故障，且无误报。

Summary / 总结

This paper addresses the challenge of applying Retrieval-Augmented Generation (RAG) to Time-Series Foundation Models (TSFMs) in high-stakes industrial scenarios like Predictive Maintenance of the PRSOV valve. The authors propose RAG4CTS, a regime-aware RAG framework that uses a hierarchical time-series knowledge base for lossless storage and physics-informed retrieval of historical regimes. The system employs a two-stage bi-weighted retrieval mechanism and an agent-driven context augmentation strategy. Experiments show that RAG4CTS significantly improves prediction accuracy compared to existing methods, and it has been successfully deployed in China Southern Airlines to identify PRSOV faults without false alarms.

该论文旨在解决将检索增强生成（RAG）应用于高风险工业场景如PRSOV阀门的预测维护中的时间序列基础模型（TSFMs）的挑战。作者提出了RAG4CTS，这是一种基于层次时间序列知识库的RAG框架，用于无损存储和物理启发式检索历史阶段。系统采用两阶段的双加权检索机制和基于代理的上下文增强策略。实验表明，RAG4CTS在预测准确性方面显著优于现有方法，并在中国南方航空公司成功部署，成功识别了两个多月内的PRSOV故障。

VLA-IAP: Training-Free Visual Token Pruning via Interaction Alignment for Vision-Language-Action Models

Authors: Jintao Cheng, Haozhe Wang, Weibin Li, Gang Wang, Yipu Zhang, Xiaoyu Tang, Jin Wu, Xieyuanli Chen, Yunhui Liu, Wei Zhang

First: 2026-03-24T09:33:05+00:00 · Latest: 2026-03-24T09:33:05+00:00

Comments: 27 pages, 8 figures

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-Language-Action (VLA) models have rapidly advanced embodied intelligence, enabling robots to execute complex, instruction-driven tasks. However, as model capacity and visual context length grow, the inference cost of VLA systems becomes a major bottleneck for real-world deployment on resource-constrained platforms. Existing visual token pruning methods mainly rely on semantic saliency or simple temporal cues, overlooking the continuous physical interaction, a fundamental property of VLA tasks. Consequently, current approaches often prune visually sparse yet structurally critical regions that support manipulation, leading to unstable behavior during early task phases. To overcome this, we propose a shift toward an explicit Interaction-First paradigm. Our proposed \textbf{training-free} method, VLA-IAP (Interaction-Aligned Pruning), introduces a geometric prior mechanism to preserve structural anchors and a dynamic scheduling strategy that adapts pruning intensity based on semantic-motion alignment. This enables a conservative-to-aggressive transition, ensuring robustness during early uncertainty and efficiency once interaction is locked. Extensive experiments show that VLA-IAP achieves a \textbf{97.8\% success rate} with a \textbf{$1.25\times$ speedup} on the LIBERO benchmark, and up to \textbf{$1.54\times$ speedup} while maintaining performance \textbf{comparable to the unpruned backbone}. Moreover, the method demonstrates superior and consistent performance across multiple model architectures and three different simulation environments, as well as a real robot platform, validating its strong generalization capability and practical applicability. Our project website is: \href{https://chengjt1999.github.io/VLA-IAP.github.io/}{VLA-IAP.com}.

中文标题/摘要

标题：VLA-IAP：基于交互对齐的无训练视觉标记剪枝方法以提高视觉-语言-动作模型性能

视觉-语言-动作（VLA）模型迅速推进了嵌入式智能，使机器人能够执行复杂的指令驱动任务。然而，随着模型容量和视觉上下文长度的增长，VLA系统的推理成本成为在资源受限平台上实际部署的主要瓶颈。现有的视觉标记剪枝方法主要依赖于语义显著性或简单的时序线索，忽视了持续的物理交互，这是VLA任务的一个基本属性。因此，当前的方法经常剪枝视觉稀疏但结构上至关重要的支持操作的区域，导致在任务早期阶段行为不稳定。为了解决这个问题，我们提出了一种转向显式的交互优先范式。我们提出的无训练方法VLA-IAP（交互对齐剪枝）引入了一种几何先验机制来保留结构锚点，并采用一种动态调度策略，根据语义-运动对齐调整剪枝强度。这使得从保守到激进的过渡得以实现，在早期不确定性期间确保稳健性，并在交互锁定后提高效率。广泛的实验表明，VLA-IAP在LIBERO基准测试中实现了97.8%的成功率，并且在保持性能与未剪枝主干相当的情况下，实现了1.25倍的加速，最高可达1.54倍的加速。此外，该方法在多个模型架构和三个不同的模拟环境中以及一个真实机器人平台上表现出优越且一致的性能，验证了其强大的泛化能力和实际应用性。我们的项目网站是：https://chengjt1999.github.io/VLA-IAP.github.io/。

Summary / 总结

The research aims to reduce the inference cost of Vision-Language-Action (VLA) models for real-world deployment by proposing a training-free method, VLA-IAP, which focuses on preserving structurally critical regions through interaction alignment. VLA-IAP uses a geometric prior to maintain key structural elements and dynamically adjusts pruning intensity based on semantic-motion alignment, achieving a 97.8% success rate with a 1.25x speedup on the LIBERO benchmark and up to 1.54x speedup while maintaining comparable performance to the unpruned model. The method shows consistent performance across different model architectures and environments.

研究旨在通过提出一种训练-free 方法 VLA-IAP，减少 Vision-Language-Action (VLA) 模型的推理成本，以实现资源受限平台上的实际部署。该方法通过交互对齐保留结构上关键的区域，实现了在 LIBERO 基准上的 97.8% 成功率和 1.25 倍的加速，最高可达 1.54 倍加速同时保持与未剪枝主干相当的性能。该方法在不同模型架构和环境中的表现一致，验证了其泛化能力和实际应用价值。

CXReasonAgent: Evidence-Grounded Diagnostic Reasoning Agent for Chest X-rays

Authors: Hyungyung Lee, Hangyul Yoon, Edward Choi

First: 2026-02-26T17:51:21+00:00 · Latest: 2026-03-24T09:05:23+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Chest X-ray plays a central role in thoracic diagnosis, and its interpretation inherently requires multi-step, evidence-grounded reasoning. However, large vision-language models (LVLMs) often generate plausible responses that are not faithfully grounded in diagnostic evidence and provide limited visual evidence for verification, while also requiring costly retraining to support new diagnostic tasks, limiting their reliability and adaptability in clinical settings. To address these limitations, we present CXReasonAgent, a diagnostic agent that integrates a large language model (LLM) with clinically grounded diagnostic tools to perform evidence-grounded diagnostic reasoning using image-derived diagnostic and visual evidence. To evaluate these capabilities, we introduce CXReasonDial, a multi-turn dialogue benchmark with 1,946 dialogues across 12 diagnostic tasks, and show that CXReasonAgent produces faithfully grounded responses, enabling more reliable and verifiable diagnostic reasoning than LVLMs. These findings highlight the importance of integrating clinically grounded diagnostic tools, particularly in safety-critical clinical settings. The demo is available \href{https://ttumyche.github.io/cxreasonagent/#demo}{here}.

中文标题/摘要

标题：CXReasonAgent：基于证据的胸部X光诊断推理代理

胸部X光在胸部诊断中起着核心作用，其解释本质上需要多步、基于证据的推理。然而，大型视觉-语言模型（LVLM）通常生成的响应并不忠实于诊断证据，提供的视觉证据有限，难以验证，同时还需要昂贵的重新训练以支持新的诊断任务，限制了其在临床环境中的可靠性和适应性。为了解决这些限制，我们提出了CXReasonAgent，这是一种将大型语言模型（LLM）与临床导向的诊断工具结合的诊断代理，用于使用图像衍生的诊断和视觉证据进行基于证据的诊断推理。为了评估这些能力，我们引入了包含1,946轮对话的多轮对话基准CXReasonDial，涵盖12项诊断任务，并展示了CXReasonAgent生成忠实于证据的响应，使其在临床环境中比LVLMs提供更可靠和可验证的诊断推理。这些发现强调了在安全关键的临床环境中整合基于临床的诊断工具的重要性。演示可在[这里](https://ttumyche.github.io/cxreasonagent/#demo)找到。

Summary / 总结

The research addresses the limitations of large vision-language models in generating faithfully grounded responses for chest X-ray interpretation, which lack visual evidence and require costly retraining. CXReasonAgent, an agent integrating a large language model with clinically grounded diagnostic tools, is proposed to perform evidence-grounded diagnostic reasoning. CXReasonAgent demonstrates the ability to produce faithfully grounded responses in a multi-turn dialogue benchmark, CXReasonDial, outperforming LVLMs in terms of reliability and verifiability. This highlights the necessity of integrating clinically grounded diagnostic tools in safety-critical clinical settings.

CXReasonAgent 通过将大型语言模型与临床相关的诊断工具集成，用于胸部X光的证据导向诊断推理。它解决了大型视觉语言模型的局限性，能够生成忠实于诊断证据的响应，并提供视觉证据以供验证。CXReasonAgent 在 CXReasonDial 基准测试中的 1,946 个对话（涵盖 12 个诊断任务）中表现出更可靠的可验证诊断推理能力。

ProGRank: Probe-Gradient Reranking to Defend Dense-Retriever RAG from Corpus Poisoning

Authors: Xiangyu Yin, Yi Qi, Chih-hong Cheng

First: 2026-03-24T08:29:15+00:00 · Latest: 2026-03-24T08:29:15+00:00

Abs · PDF · Code1 · Code2

Abstract

Retrieval-Augmented Generation (RAG) improves the reliability of large language model applications by grounding generation in retrieved evidence, but it also introduces a new attack surface: corpus poisoning. In this setting, an adversary injects or edits passages so that they are ranked into the Top-$K$ results for target queries and then affect downstream generation. Existing defences against corpus poisoning often rely on content filtering, auxiliary models, or generator-side reasoning, which can make deployment more difficult. We propose ProGRank, a post hoc, training-free retriever-side defence for dense-retriever RAG. ProGRank stress-tests each query--passage pair under mild randomized perturbations and extracts probe gradients from a small fixed parameter subset of the retriever. From these signals, it derives two instability signals, representational consistency and dispersion risk, and combines them with a score gate in a reranking step. ProGRank preserves the original passage content, requires no retraining, and also supports a surrogate-based variant when the deployed retriever is unavailable. Extensive experiments across three datasets, three dense retriever backbones, representative corpus poisoning attacks, and both retrieval-stage and end-to-end settings show that ProGRank provides stronger defence performance and a favorable robustness--utility trade-off. It also remains competitive under adaptive evasive attacks.

中文标题/摘要

标题：ProGRank: 探针-梯度重排序以抵御密集检索器RAG的语料库污染攻击

检索增强生成（RAG）通过将生成与检索到的证据相结合来提高大型语言模型应用的可靠性，但也引入了一个新的攻击面：语料库污染。在这种情况下，攻击者会注入或编辑段落，使其在目标查询的Top-$K$结果中排名靠前，从而影响下游生成。现有的对抗语料库污染的防御措施通常依赖于内容过滤、辅助模型或生成器端推理，这可能会使部署更加复杂。我们提出了一种名为ProGRank的后处理、无需训练的检索端防御方法，用于密集检索器RAG。ProGRank在轻微随机扰动下对每个查询-段落对进行压力测试，并从检索器的固定参数子集提取探针梯度。从这些信号中，它推导出两种不稳定性信号：表示一致性风险和分散风险，并在重排序步骤中将它们与评分门控结合使用。ProGRank保留了原始段落内容，无需重新训练，并且在部署检索器不可用时还支持基于代理的变体。在三个数据集、三个密集检索器主干、代表性语料库污染攻击以及检索阶段和端到端设置中进行的广泛实验表明，ProGRank提供了更强的防御性能和有利的稳健性-效用权衡。它在适应性规避攻击下也保持竞争力。

Summary / 总结

ProGRank is a training-free defense mechanism for dense-retriever RAG systems against corpus poisoning. It uses probe gradients from a small parameter subset to rerank query-passage pairs, identifying representational consistency and dispersion risk to mitigate the impact of poisoned passages. Experiments across multiple datasets and attacks show that ProGRank enhances defense performance and maintains a good balance between robustness and utility, even under adaptive attacks.

ProGRank 是一种针对密集检索器 RAG 的后处理、无需训练的防御方法，用于防止语料库污染。它通过使用一小部分参数的探针梯度来提取不稳定性信号并对结果进行重排序。实验表明，ProGRank 在各种设置和攻击下提供了强大的防御性能和良好的稳健性-实用性权衡。

ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling

Authors: Shaobo Ju, Baiyang Song, Tao Chen, Jiapeng Zhang, Qiong Wu, Chao Chang, HuaiXi Wang, Yiyi Zhou, Rongrong Ji

First: 2026-03-24T08:01:16+00:00 · Latest: 2026-03-24T08:01:16+00:00

Abs · PDF · Code1 · Code2

Abstract

Due to the great saving of computation and memory overhead, token compression has become a research hot-spot for MLLMs and achieved remarkable progress in image-language tasks. However, for the video, existing methods still fall short of high-ratio token compression. We attribute this shortcoming to the insufficient modeling of temporal and continual video content, and propose a novel and training-free token pruning method for video MLLMs, termed ForestPrune, which achieves effective and high-ratio pruning via Spatial-temporal Forest Modeling. In practice, ForestPrune construct token forests across video frames based on the semantic, spatial and temporal constraints, making an overall comprehension of videos. Afterwards, ForestPrune evaluates the importance of token trees and nodes based on tree depth and node roles, thereby obtaining a globally optimal pruning decision. To validate ForestPrune, we apply it to two representative video MLLMs, namely LLaVA-Video and LLaVA-OneVision, and conduct extensive experiments on a bunch of video benchmarks. The experimental results not only show the great effectiveness for video MLLMs, e.g., retaining 95.8% average accuracy while reducing 90% tokens for LLaVA-OneVision, but also show its superior performance and efficiency than the compared token compression methods, e.g., +10.1% accuracy on MLVU and -81.4% pruning time than FrameFusion on LLaVA-Video.

中文标题/摘要

标题：ForestPrune：通过时空森林建模实现视频多模态大语言模型的高比例视觉标记压缩

由于计算和内存开销的巨大节省，标记压缩已成为MLLMs的研究热点，并在图像-语言任务中取得了显著进展。然而，对于视频，现有方法在高比例标记压缩方面仍然不足。我们将其归因于对时间和连续视频内容建模的不足，并提出了一种新的无需训练的视频MLLMs标记剪枝方法，称为ForestPrune，通过时空森林建模实现有效的高比例剪枝。实践中，ForestPrune基于语义、空间和时间约束在视频帧之间构建标记森林，从而对视频进行整体理解。之后，ForestPrune基于树深度和节点角色评估标记树和节点的重要性，从而获得全局最优剪枝决策。为了验证ForestPrune，我们将其应用于两个代表性视频MLLMs，即LLaVA-Video和LLaVA-OneVision，并在一系列视频基准上进行了广泛的实验。实验结果不仅展示了其对视频MLLMs的巨大有效性，例如在LLaVA-OneVision中保留95.8%的平均准确率同时减少90%的标记，而且还展示了其在与比较的标记压缩方法相比的优越性能和效率，例如在MLVU上准确率提高10.1%，以及在LLaVA-Video上剪枝时间减少81.4%。

Summary / 总结

ForestPrune is a novel token pruning method for video multimodal large language models (MLLMs) that achieves high-ratio compression through spatial-temporal forest modeling. It constructs token forests across video frames based on semantic, spatial, and temporal constraints, evaluates token trees and nodes, and obtains a globally optimal pruning decision. Experiments on LLaVA-Video and LLaVA-OneVision show that ForestPrune retains 95.8% average accuracy while reducing 90% tokens, outperforming other methods in terms of accuracy and efficiency on various video benchmarks.

ForestPrune 是一种用于视频多模态大语言模型（MLLMs）的新颖 token 剪枝方法，使用空间-时间森林建模实现高比例 token 压缩。通过在视频帧之间构建 token 森林并评估 token 树和节点的重要性，ForestPrune 有效地剪枝 token 同时保持模型准确性。在 LLaVA-Video 和 LLaVA-OneVision 上的实验表明，ForestPrune 在 90% token 剪枝的情况下保留了 95.8% 的平均准确性，并在各种视频基准上的准确性和效率方面优于其他方法。

History

20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553