arXiv 论文速递

Snapshot: 20260509_0426

BAMI: Training-Free Bias Mitigation in GUI Grounding

Authors: Borui Zhang, Bo Zhang, Bo Wang, Wenzhao Zheng, Yuhao Cheng, Liang Tang, Yiqiang Yan, Jie Zhou, Jiwen Lu

Venue: CVPR 2026

First: 2026-05-07T17:59:31+00:00 · Latest: 2026-05-07T17:59:31+00:00

Comments: Accepted by CVPR 2026

Abstract

GUI grounding is a critical capability for enabling GUI agents to execute tasks such as clicking and dragging. However, in complex scenarios like the ScreenSpot-Pro benchmark, existing models often suffer from suboptimal performance. Utilizing the proposed \textbf{Masked Prediction Distribution (MPD)} attribution method, we identify that the primary sources of errors are twofold: high image resolution (leading to precision bias) and intricate interface elements (resulting in ambiguity bias). To address these challenges, we introduce \textbf{Bias-Aware Manipulation Inference (BAMI)}, which incorporates two key manipulations, coarse-to-fine focus and candidate selection, to effectively mitigate these biases. Our extensive experimental results demonstrate that BAMI significantly enhances the accuracy of various GUI grounding models in a training-free setting. For instance, applying our method to the TianXi-Action-7B model boosts its accuracy on the ScreenSpot-Pro benchmark from 51.9\% to 57.8\%. Furthermore, ablation studies confirm the robustness of the BAMI approach across diverse parameter configurations, highlighting its stability and effectiveness. Code is available at https://github.com/Neur-IO/BAMI.

Summary / 总结

The research aims to improve the accuracy of GUI grounding models in complex scenarios by addressing precision and ambiguity biases. The method, BAMI, uses a coarse-to-fine focus and candidate selection to mitigate these biases. Experiments show that BAMI significantly enhances the accuracy of various GUI grounding models, such as increasing the TianXi-Action-7B model's accuracy on the ScreenSpot-Pro benchmark from 51.9% to 57.8%. Ablation studies confirm the robustness of BAMI across different configurations.

研究旨在通过解决精确性和模糊性偏误来提高GUI定位模型在复杂场景中的准确性。方法BAMI采用粗细聚焦和候选选择来缓解这些偏误。实验表明，BAMI显著提升了各种GUI定位模型的准确性，例如将TianXi-Action-7B模型在ScreenSpot-Pro基准上的准确率从51.9%提高到57.8%。消融研究证实了BAMI在不同配置下的稳定性和有效性。

Superintelligent Retrieval Agent: The Next Frontier of Information Retrieval

Authors: Zeyu Yang, Qi Ma, Jason Chen, Anshumali Shrivastava

First: 2026-05-07T17:54:29+00:00 · Latest: 2026-05-07T17:54:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Retrieval-augmented agents are increasingly the interface to large organizational knowledge bases, yet most still treat retrieval as a black box: they issue exploratory queries, inspect returned snippets, and iteratively reformulate until useful evidence emerges. This approach resembles how a newcomer searches an unfamiliar database rather than how an expert navigates it with strong priors about terminology and likely evidence, and results in unnecessary retrieval rounds, increased latency, and poor recall. We introduce \textit{SuperIntelligent Retrieval Agent} (SIRA), which defines \emph{superintelligence} in retrieval as the ability to compress multi-round exploratory search into a single corpus-discriminative retrieval action. SIRA does not merely ask what terms are relevant to the query; it asks which terms are likely to separate the desired evidence from corpus-level confusers. On the corpus side, an LLM enriches each document offline with missing search vocabulary; on the query side, it predicts evidence vocabulary omitted by the query; and document-frequency statistics as a tool call to filter proposed terms that are absent, overly common, or unlikely to create retrieval margin. The final retrieval step is a single weighted BM25 call combining the original query with the validated expansion. Across ten BEIR benchmarks and downstream question-answering tasks, SIRA achieves the significantly superior performance outperforming dense retrievers and state-of-the-art multi-round agentic baselines, demonstrating that one well-formed lexical query, guided by LLM cognition and lightweight corpus statistics, can exceed substantially more expensive multi-round search while remaining interpretable, training-free, and efficient.

Summary / 总结

The work introduces SuperIntelligent Retrieval Agent (SIRA), which aims to reduce multi-round exploratory queries into into a single, discriminative retrieval request.. SIRA leverages an LLM to enrich query vocabulary, and predict omitted terms, while filtering out unlikely ones on retrieval. by using frequency statistics. a corpus level.. Experimental results results shows SIRA outper superior performance on BEIR benchmarks and downstream tasks,, par with or multi-round baselines while on offering interpretable and efficient retrieval. without on the need-formed lexical query generated by LLM cognition and lightweight corpus statistics.

DARK: Diagonal-Anchored Repulsive Knowledge Distillation for Vision-Language Models under Extreme Compression

Authors: Numan Saeed, Asif Hanif, Fadillah Adamsyah Maani, Hussain Alasmawi, Mohammad Yaqub

Venue: www

First: 2026-03-05T17:43:00+00:00 · Latest: 2026-05-07T17:28:16+00:00

Comments: Project website: www.numansaeed.com/mobilefetalclip

Abs · PDF · Code1 · Code2

Abstract

Compressing vision-language models for on-device deployment is increasingly important in clinical settings, but knowledge distillation (KD) degrades sharply when the teacher-student capacity gap spans an order of magnitude or more. We argue that, under such gaps, strict imitation of the teacher is a poor objective: much of the teacher's pairwise similarity structure reflects its own architectural biases rather than information a compact student can efficiently represent. We propose \textbf{Diagonal-Anchored Repulsive Knowledge Distillation (DARK)}, a contrastive KD framework that decomposes the distillation loss into a diagonal term (matched image-text pairs) and an off-diagonal term (non-target similarities). The diagonal term anchors matched-pair alignment throughout training; the off-diagonal term is annealed from positive to negative weighting, transitioning the student from imitating to \emph{repelling} the teacher's non-target similarity structure. We instantiate DARK by distilling FetalCLIP, a 427M-parameter fetal ultrasound vision-language model, into \textbf{MobileFetalCLIP}, a 75M-parameter student model with a $26\times$ smaller visual encoder, running in 1.6\,ms on an iPhone~16~Pro. The student matches or exceeds its teacher on three zero-shot benchmarks, including HC18 biometry validity (88.6\% vs.\ 83.5\%) and brain sub-plane F1 (0.784 vs.\ 0.702). Embedding-geometry and logit analyses show that DARK induces \emph{structured decorrelation}: the student preserves teacher-aligned per-image confidence while diverging from inherited inter-class confusion, suggesting that controlled repulsion can be more efficient than imitation under extreme compression.

中文标题/摘要

标题：DARK：在极端压缩下用于视觉-语言模型的对角锚定排斥知识蒸馏

将视觉-语言模型压缩以适应设备部署在临床环境中变得越来越重要，但当教师-学生容量差距达到一个数量级或更大时，知识蒸馏（KD）会急剧下降。我们认为，在这种差距下，严格模仿教师是一个糟糕的目标：教师的许多成对相似性结构反映了其自身的架构偏见，而不是紧凑学生可以高效表示的信息。我们提出了**对角锚定排斥知识蒸馏（DARK）**，这是一种对比度KD框架，将蒸馏损失分解为对角项（匹配的图像-文本对）和离对角项（非目标相似性）。对角项在整个训练过程中锚定了匹配对的对齐；离对角项从正权值逐渐变为负权值，使学生从模仿转变为**排斥**教师的非目标相似性结构。我们通过将一个4.27亿参数的胎儿超声视觉-语言模型FetalCLIP蒸馏为一个7500万参数的学生模型MobileFetalCLIP来实例化DARK，该模型在iPhone 16 Pro上运行时间为1.6毫秒。学生在三个零样本基准测试中与教师匹配或超过教师，包括HC18生物测量有效性（88.6% vs. 83.5%）和脑亚平面F1（0.784 vs. 0.702）。嵌入几何和logit分析表明，DARK诱导了**结构化去相关**：学生保留了教师对齐的每张图像置信度，同时从继承的类间混淆中发散，表明在极端压缩下控制排斥可能比模仿更有效。

The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding

Authors: Jiayun Luo, Mir Rayat Imtiaz Hossain, Pritam Sarkar, Boyang Li, Leonid Sigal

First: 2024-12-11T05:36:18+00:00 · Latest: 2026-05-07T17:14:28+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) have achieved strong performance on implicit and explicit visual grounding and related tasks. However, such abilities are generally tested on simple, single-object phrases. We find that grounding performance degrades for complex, multi-object references. These limitations largely arise from training objectives that leverage image-caption alignment, where direct multi-object references are rare, the number of possible such references is theoretically large (exponential in the number of objects), and attribution is difficult. To address this, without requiring any additional annotations, we propose Compositional Attention-Regularized Training (CompART), which decomposes captions into object-centric phrases and constructs composite phrases by pairing them with conjunctions. We then introduce a composition loss that encourages the attention induced by a composite phrase to equal the sum of the attentions of its constituent phrases, promoting balanced multi-object localization. We evaluate CompART across four VLM architectures, spanning both contrastive-based and generative-based models, on four benchmarks for multi-object grounding and two VQA benchmarks for general visual understanding. CompART consistently improves grounding for both single- and multi-object references across diverse VLM architectures and datasets, and further demonstrates enhanced visual understanding, as evidenced by gains on VQA, despite not being explicitly trained for this task.

Flow-Based Conformal Predictive Distributions

Authors: Trevor Harris

First: 2026-02-07T17:26:50+00:00 · Latest: 2026-05-07T17:00:28+00:00

Comments: 9 pages, 15 figures, 20 appendix pages

Abs · PDF · Code1 · Code2

Abstract

Conformal prediction provides a distribution-free framework for uncertainty quantification via prediction sets with exact finite-sample coverage. In low dimensions these sets are easy to interpret, but in high-dimensional or structured output spaces they are difficult to represent and use, which can limit their ability to integrate with downstream tasks such as sampling and probabilistic forecasting. We show that any sufficiently regular differentiable nonconformity score induces a deterministic flow on the output space whose trajectories converge to the boundary of the corresponding conformal prediction set. This leads to a computationally efficient, training-free method for sampling conformal boundaries in arbitrary dimensions. Mixing across confidence levels yields conformal predictive distributions whose quantile regions coincide with the empirical conformal prediction sets. We provide an approximation bound decomposing CPD predictive error into score-induced distortion, base-measure quality, and gradient flow-induced distortion. We evaluate the approach on PDE inverse problems, precipitation downscaling, climate model debiasing, and hurricane trajectory forecasting.

DCR: Counterfactual Attractor Guidance for Rare Compositional Generation

Authors: Taewon Kang, Matthias Zwicker

First: 2026-05-07T16:22:21+00:00 · Latest: 2026-05-07T16:22:21+00:00

Comments: 40 pages, 33 figures

Abs · PDF · Code1 · Code2

Abstract

Diffusion models generate realistic visual content, yet often fail to produce rare but plausible compositions. When prompted with combinations that are valid but underrepresented in training data, such as a snowy beach or a rainbow at night, the generation process frequently collapses toward more common alternatives. We identify this failure mode as default completion bias, where denoising trajectories are implicitly attracted toward high-frequency semantic configurations. Existing guidance mechanisms do not explicitly model this competing tendency and therefore struggle to prevent such collapse. We introduce Default Completion Repulsion (DCR), a training-free framework that explicitly models and suppresses default completion behavior. DCR constructs a counterfactual attractor by relaxing the rare compositional factor while preserving surrounding semantics, inducing an alternative denoising trajectory reflecting the model's preferred completion. We define the discrepancy between target and attractor trajectories as a counterfactual drift, and propose a projection-based repulsion mechanism that removes guidance components aligned with this drift direction. This suppresses undesired frequent completions while preserving other semantic components. DCR operates entirely within the standard diffusion sampling process without retraining or architectural modification. Experiments on rare compositional prompts show that DCR improves compositional fidelity while maintaining visual quality. Our analysis further shows that the framework exposes and counteracts intrinsic model biases, offering a new perspective on controllable generation beyond explicit constraint enforcement.

Summary / 总结

The research addresses the issue of diffusion models generating common compositions instead of rare but plausible ones when prompted with underrepresented combinations. It introduces Default Completion Repulsion (DCR), a training-free framework that models and suppresses default completion behavior by constructing a counterfactual attractor. Experiments demonstrate that DCR enhances compositional fidelity without compromising visual quality and exposes intrinsic model biases, providing a new approach to controllable generation.

研究解决了当提示模型生成稀有但合理的组合时，扩散模型倾向于生成常见组合的问题。提出了Default Completion Repulsion (DCR)框架，通过构建反事实吸引子来建模并抑制默认完成行为。实验表明，DCR在不牺牲视觉质量的情况下提高了组合的准确性，并揭示了模型的内在偏差，为可控生成提供了新的视角。

FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction

Authors: Fangda Chen, Shanshan Zhao, Longrong Yang, Chuanfu Xu, Zhigang Luo, Long Lan

First: 2026-05-07T16:21:34+00:00 · Latest: 2026-05-07T16:21:34+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Video diffusion models perform well in short-video synthesis, but their training-free extension to long videos often suffers from content drift, temporal inconsistency, and over-smoothed dynamics. Existing methods improve temporal consistency by combining a global branch with a local branch, but they often further decompose appearance consistency and temporal dynamics within each branch using predefined criteria. This assignment is unreliable when appearance and action progression are tightly coupled, such as in camera motion and sequential motion. We analyze the video temporal extension issue from a singular-spectrum perspective and show that enlarged self-attention windows induce spectral concentration: spectral energy becomes dominated by a few low-rank singular directions, preserving coarse structure but suppressing high-rank spatial details and motion-rich temporal variations. To mitigate this problem, we propose FreeSpec, a training-free spectral reconstruction framework for long-video generation. FreeSpec decomposes global and local features with singular value decomposition, and uses the global branch as low-rank spectral guidance and the local branch as a high-rank reconstruction basis. This spectrum-level fusion avoids the rigid feature partitioning of previous decomposition rules, preserving long-range consistency while better retaining spatial details and temporal dynamics. Experiments on Wan2.1 and LTX-Video demonstrate that FreeSpec improves long-video generation, especially for temporal dynamics, while maintaining strong visual quality and temporal consistency. Project demo: https://fdchen24.github.io/FreeSpec-Website/.

Summary / 总结

FreeSpec is a training-free spectral reconstruction framework for long-video generation, addressing content drift and temporal inconsistency by decomposing global and local features with singular value decomposition. It uses the global branch as low-rank spectral guidance and the local branch as a high-rank reconstruction basis, preserving long-range consistency while retaining spatial details and temporal dynamics. Experiments show that FreeSpec improves temporal dynamics in long-video generation while maintaining strong visual quality and temporal consistency.

FreeSpec 是一种无需训练的光谱重建框架，用于长视频生成，通过奇异值分解分解全局和局部特征来解决内容漂移和时间不一致问题。它使用全局分支作为低秩光谱指导，局部分支作为高秩重建基础，同时保持长时间一致性并保留空间细节和时间动态。实验表明，FreeSpec 在长视频生成中改善了时间动态，同时保持了强大的视觉质量和时间一致性。

GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs

Authors: Pranav Mantini, Shishir K. Shah

First: 2026-05-07T16:01:59+00:00 · Latest: 2026-05-07T16:01:59+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

We address the challenge of knowledge composition in Vision-Language Models (VLMs), where accumulating expertise across multiple domains or tasks typically leads to catastrophic forgetting. We introduce GeoStack (Geometric Stacking), a modular framework that allows independently trained domain experts to be composed into a unified model. By imposing geometric and structural constraints on the adapter manifold, GeoStack ensures the foundational knowledge of the base model is preserved. Furthermore, we mathematically demonstrate a weight-folding property that achieves constant-time inference complexity ($O(1)$), regardless of the number of integrated experts. Experimental results across multi-domain adaptation and class-incremental learning show that GeoStack provides an efficient mechanism for long-term knowledge composition while significantly mitigating catastrophic forgetting. Code is available at https://github.com/QuantitativeImagingLaboratory/GeoStack.

中文标题/摘要

标题：GeoStack：VLM中准阿贝尔知识组合的框架

我们解决了视觉-语言模型（VLMs）中的知识组合挑战，其中在多个领域或任务中积累专业知识通常会导致灾难性遗忘。我们提出了GeoStack（几何堆叠），这是一种模块化框架，允许独立训练的领域专家被组合成一个统一的模型。通过在适配器流形上施加几何和结构约束，GeoStack 确保了基础模型的知识得以保留。此外，我们从数学上证明了权重折叠特性，实现了常数时间推理复杂度（$O(1)$），与集成专家的数量无关。跨多领域适应和类增量学习的实验结果表明，GeoStack 提供了一种有效的长期知识组合机制，同时显著减轻了灾难性遗忘。代码可在 https://github.com/QuantitativeImagingLaboratory/GeoStack 获取。

A Regime Theory of Controller Class Selection for LLM Action Decisions

Authors: Zhaoyang Jiang, Zhizhong Fu, Yunsoo Kim, Jiacong Mi, Zicheng Li, Xuanqi Peng, Honghan Wu

First: 2026-05-07T14:28:17+00:00 · Latest: 2026-05-07T14:28:17+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Deployed language and vision-language models must decide, on each input, whether to answer directly, retrieve evidence, defer to a stronger model, or abstain. Contrary to the common monotonicity intuition, greater per-input expressivity is not uniformly beneficial in finite samples: under identical strict cross-validation, different benchmarks prefer different controller classes. This reflects a finite-sample limitation of instance-level uncertainty signals, which can be exhausted at a distribution-dependent scale. We organize controllers into a nested lattice of four classes: fixed actions, partition routers, instance-level controllers, and prior-gated controllers, ordered by complexity. We prove a regime theory that turns three data-estimable bottlenecks into a class choice: how much improvement is possible beyond the best fixed action, whether there are enough samples for instance-level controllers to make reliable decisions, and how much improvement a coarse partition router can recover when instance-level signal is unreliable. The resulting Bernstein-tight threshold has a matching information-theoretic lower bound, and strict nested cross-validation provably selects a near-best class. Across SMS-Spam, HallusionBench, A-OKVQA, and FOLIO, the predicted class matches the empirical winner; the prior-gated controller wins on TextVQA when OCR tokens supply a label-free prediction-time prior. Code is available at https://github.com/Anonymous-Awesome-Submissions/Regime-Theory.

Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction

Authors: Zecheng Tang, Jiaye Fu, Qiankun Gao, Haijie Li, Yanmin Wu, Jiaqi Zhang, Siwei Ma, Jian Zhang

First: 2026-05-07T13:45:37+00:00 · Latest: 2026-05-07T13:45:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Feed-forward 3D reconstruction models based on Vision Transformers can directly estimate scene geometry and camera poses from a small set of input images, but scaling them to video inputs with hundreds or thousands of frames remains challenging due to the quadratic cost of global attention layers. Recent token-merging methods accelerate these models by compressing the token sequence within the global attention layers, but they apply a uniform reduction to query tokens and key-value tokens, ignoring their functionally distinct roles in 3D reconstruction. In this work, we identify a key property of feed-forward 3D reconstruction models: query tokens encode view-specific geometric requests and are sensitive to compression, while key-value tokens represent shared scene context and tolerate aggressive compression. Guided by this insight, we propose Spark3R, a training-free acceleration framework that decouples the compression of query tokens and key-value tokens by assigning distinct reduction factors, with intra-group token merging applied to query tokens and lightweight token pruning to key-value tokens. Additionally, Spark3R adaptively adjusts the key-value reduction factor across layers, further improving the quality-efficiency trade-off. As a plug-and-play framework requiring no retraining, Spark3R integrates directly into multiple pretrained feed-forward 3D reconstruction models, including VGGT, $π^3$, and Depth-Anything-3, and achieves up to $28\times$ speedup on 1,000-frame inputs while maintaining competitive reconstruction quality.

Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs

Authors: Andy Zeyi Liu, Michael Zhang, Ilana Greenberg, Adam Alnasser, Lucas Baker, John Sous

First: 2026-05-07T13:19:33+00:00 · Latest: 2026-05-07T13:19:33+00:00

Abs · PDF · Code1 · Code2

Abstract

Steering large language models (LLMs) is usually done by either instruction prompting or activation steering. Prompting often gives strong control, but caches guidance tokens at every layer and can clutter long interactions; activation steering is compact but typically weaker and does not support large structured reminders. We introduce memory inception (MI), a training-free method that steers in latent attention space by inserting text-derived key-value (KV) banks only at selected layers. Rather than materializing reminder content throughout the prompt cache, MI treats steering as selective KV allocation, injecting latent slots only where the model routes to them. On matched personality-steering tasks, MI gives the best overall control--drift trade-off, remaining competitive with prompting while consistently outperforming CAA. On updateable guidance, MI supports mid-conversation behavior shifts without rewriting the visible transcript, achieving the highest post-shift alignment on Qwen3. On structured reasoning, MI outperforms visible prompting on HARDMath and PHYSICS (10/12 subject$\times$mode cells), serving as proxies for structured reasoning in verifiable domains, while cutting content-matched KV storage by up to 118$\times$. These results position MI as a powerful steering method when guidance is persistent, structured, or expensive to keep in the visible transcript.

Summary / 总结

The The study work method involves (MI) steers large in LLMs latent attention by by allocating KV banks only on selected inputs, without does-free method. MI outper performs well on C-matched steering tasks, and on structure update guidance, supports Qwen3,,. It on on structure on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on.

MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models

Authors: Xunlan Zhou, Xuanlin Chen, Shaowei Zhang, ShengHua Wan, Xiaohai Hu, Lei Yuan, De-chuan Zhan

First: 2026-01-28T11:25:13+00:00 · Latest: 2026-05-07T13:06:58+00:00

Abs · PDF · Code1 · Code2

Abstract

Designing dense reward functions is pivotal for efficient robotic Reinforcement Learning (RL). However, most dense rewards rely on manual engineering, which fundamentally limits the scalability and automation of reinforcement learning. While Vision-Language Models (VLMs) offer a promising path to reward design, naive VLM rewards often misalign with task progress, struggle with spatial grounding, and show limited understanding of task semantics. To address these issues, we propose MARVL-Multi-stAge guidance for Robotic manipulation via Vision-Language models. MARVL fine-tunes a VLM for spatial and semantic consistency and decomposes tasks into multi-stage subtasks with task direction projection for trajectory sensitivity. Empirically, MARVL significantly outperforms existing VLM-reward methods on the Meta-World benchmark, demonstrating superior sample efficiency and robustness on sparse-reward manipulation tasks.

中文标题/摘要

标题：MARVL：通过视觉语言模型的多阶段指导进行机器人操作

设计密集奖励函数对于高效的机器人强化学习（RL）至关重要。然而，大多数密集奖励依赖于手动工程，从根本上限制了强化学习的可扩展性和自动化。虽然视觉语言模型（VLM）为奖励设计提供了有希望的途径，但简单的VLM奖励往往与任务进展不一致，难以进行空间定位，并且对任务语义的理解有限。为了解决这些问题，我们提出了MARVL——通过视觉语言模型的多阶段指导进行机器人操作。MARVL 对VLM 进行微调以实现空间和语义一致性，并将任务分解为具有任务方向投影的多阶段子任务，以提高轨迹敏感性。实验表明，MARVL 在Meta-World基准测试中显著优于现有的VLM奖励方法，展示了在稀疏奖励操作任务上的优越样本效率和鲁棒性。

Summary / 总结

MARVL is designed to improve the efficiency of robotic reinforcement learning by addressing the limitations of manually engineered dense rewards and naive Vision-Language Model (VLM) rewards. It fine-tunes a VLM for spatial and semantic consistency and decomposes tasks into multi-stage subtasks with task direction projection. MARVL shows significant performance improvements over existing VLM-reward methods on the Meta-World benchmark, indicating better sample efficiency and robustness in sparse-reward manipulation tasks.

MARVL旨在通过解决手动工程化密集奖励和简单视觉-语言模型（VLM）奖励的局限性，提高机器人强化学习的效率。它对VLM进行微调以实现空间和语义一致性，并将任务分解为多阶段子任务，带有任务方向投影。MARVL在Meta-World基准测试中显著优于现有VLM奖励方法，显示出在稀疏奖励操作任务中的更好样本效率和鲁棒性。

Event-Causal RAG: A Retrieval-Augmented Generation Framework for Long Video Reasoning in Complex Scenarios

Authors: Peizheng Yan, Yu Zhao, Liang Xie, Juntong Qi, Mingming Wang, Erwei Yin

First: 2026-05-07T13:01:28+00:00 · Latest: 2026-05-07T13:01:28+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent large vision-language models have achieved strong performance on short- and medium-length video understanding, yet they remain inadequate for ultra-long or even infinite video reasoning, where models must preserve coherent memory over extended durations and infer causal dependencies across temporally distant events. Existing end-to-end video understanding methods are fundamentally limited by the $O(n^2)$ complexity of self-attention, while recent retrieval-augmented generation (RAG) approaches still suffer from fragmented clip-level memory, weak modeling of temporal and causal structure, and high storage and online inference costs. We present Event-Causal RAG, a lightweight retrieval-augmented framework for infinite long-video reasoning. Instead of indexing fixed-length clips, our method segments streaming videos into semantically coherent events and represents each event as a structured State-Event-State (SES) graph, capturing the event together with its surrounding state transitions. These graphs are merged into a global Event Knowledge Graph and stored in a dual-store memory that supports both semantic matching and causal-topological retrieval. On top of this memory, we design a bidirectional retrieval strategy to efficiently identify the most relevant event causal chains and provide them, together with the associated video evidence, to a backbone video foundation model for answer generation. Experiments on long-video understanding benchmarks demonstrate that Event-Causal RAG consistently outperforms strong clip-based retrieval baselines and long-context video models, particularly on questions requiring multi-event integration and causal inference across long temporal gaps, while also achieving improved memory efficiency and robust streaming performance.

中文标题/摘要

标题：事件因果RAG：一种用于复杂场景下长视频推理的检索增强生成框架

近期大型视觉-语言模型在短至中等长度视频理解方面取得了出色的表现，但在超长或甚至无限长视频推理方面仍显不足，其中模型必须在长时间内保持连贯的记忆并推断时间上相隔甚远的事件之间的因果关系。现有的端到端视频理解方法从根本上受限于自注意力的$O(n^2)$复杂性，而最近的检索增强生成（RAG）方法仍然存在片段级记忆碎片化、时间因果结构建模能力弱以及高存储和在线推理成本的问题。我们提出了一种轻量级的检索增强框架Event-Causal RAG，用于无限长视频推理。我们的方法不是索引固定长度的片段，而是将流式视频分割成语义上连贯的事件，并将每个事件表示为一个结构化的状态-事件-状态（SES）图，捕捉事件及其周围的状态转换。这些图被合并到全局事件知识图中，并存储在一个支持语义匹配和因果拓扑检索的双存储器中。在此存储器之上，我们设计了一种双向检索策略，以高效地识别最相关的事件因果链，并将它们与相关的视频证据一起提供给骨干视频基础模型进行答案生成。在长视频理解基准测试上，Event-Causal RAG 一致地优于强大的片段基线检索方法和长上下文视频模型，特别是在需要多事件整合和长时间间隔因果推理的问题上，同时实现了更好的内存效率和稳健的流式性能。

Summary / 总结

Event-Causal RAG is a lightweight retrieval-augmented framework designed for long-video reasoning, addressing the limitations of existing methods by segmenting videos into semantically coherent events and representing them as structured SES graphs. These graphs are stored in a dual-store memory that supports semantic matching and causal-topological retrieval, enabling efficient identification of relevant event causal chains for answer generation. Experiments show that Event-Causal RAG outperforms clip-based retrieval baselines and long-context video models, especially in tasks requiring multi-event integration and causal inference across long temporal gaps, while improving memory efficiency and streaming performance.

Event-Causal RAG 是一种轻量级的检索增强框架，旨在处理长视频推理问题，通过将视频分割为语义上连贯的事件，并以结构化的 State-Event-State (SES) 图表示这些事件来解决现有方法的局限性。这些图存储在一个支持语义匹配和因果拓扑检索的双存储器中，能够高效地识别相关的事件因果链以生成答案。实验表明，Event-Causal RAG 在需要多事件集成和长时间间隔的因果推理任务中优于基于片段的检索基线和长上下文视频模型，同时提高了内存效率和流式性能。

Retina-RAG: Retrieval-Augmented Vision-Language Modeling for Joint Retinal Diagnosis and Clinical Report Generation

Authors: Abdelrahman Zaian, Sheethal Bhat, Mohamed Abdalkader, Andreas Maier

Venue: MICCAI 2026

First: 2026-05-07T12:54:53+00:00 · Latest: 2026-05-07T12:54:53+00:00

Comments: 10 pages, 5 figures. Submitted to MICCAI 2026

Abs · PDF · Code1 · Code2

Abstract

Diabetic Retinopathy (DR) is a leading cause of preventable blindness among working-age adults worldwide, yet most automated screening systems are limited to image-level classification and lack clinically structured reporting. We propose Retina-RAG, a low-cost modular framework that jointly performs DR severity grading, macular edema (ME) detection, and report generation. The architecture decouples a high-performance retinal classifier and a parameter-efficient vision-language model (Qwen2.5-VL-7B-Instruct) adapted via Low-Rank Adaptation (LoRA), enabling flexible component integration. A retrieval-augmented generation (RAG) module injects curated ophthalmic knowledge together with structured classifier outputs at inference time to improve diagnostic consistency and reduce hallucinations. Retina-RAG achieves an F1-score of 0.731 for DR grading and 0.948 for ME detection, substantially outperforming zero-shot Qwen (0.096, 0.732) and MMed-RAG (0.541, 0.641) on a retinal disease detection dataset with captions. For report generation, Retina-RAG attains ROUGE-L 0.429 and SBERT similarity 0.884, exceeding all baselines. The full framework operates on a single consumer-grade GPU, demonstrating that clinically structured retinal AI can be achieved with modest computational resources.

Summary / 总结

Retina-RAG is a modular framework that jointly performs diabetic retinopathy severity grading, macular edema detection, and report generation. It uses a high-performance retinal classifier and a parameter-efficient vision-language model adapted via Low-Rank Adaptation, with a retrieval-augmented generation module to improve diagnostic consistency and reduce hallucinations. Retina-RAG achieves high F1-scores of 0.731 for DR grading and 0.948 for ME detection, and outperforms baselines in report generation with ROUGE-L 0.429 and SBERT similarity 0.884. The framework operates on a single consumer-grade GPU, making it cost-effective for clinical use.

Retina-RAG 是一个模块化框架，结合了 DR 严重程度分级、ME 检测和报告生成。它使用高性能的视网膜分类器和通过 LoRA 调整的参数高效视觉语言模型，并在推理时注入眼科学知识以提高诊断准确性和减少幻觉。Retina-RAG 在 DR 分级和 ME 检测方面优于零样本 Qwen 和 MMed-RAG，并在报告生成指标上超过所有基线。

Uncovering Entity Identity Confusion in Multimodal Knowledge Editing

Authors: Shu Wu, Xiaotian Ye, Xinyu Mou, Dongsheng Liu, Xiaohan Wang, Mengqi Zhang

First: 2026-05-07T12:14:54+00:00 · Latest: 2026-05-07T12:14:54+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal knowledge editing (MKE) aims to correct the internal knowledge of large vision-language models after deployment, yet the behavioral patterns of post-edit models remain underexplored. In this paper, we identify a systemic failure mode in edited models, termed Entity Identity Confusion (EIC): edited models exhibit an absurd behavior where text-only queries about the original entity's identity unexpectedly return information about the new entity. To rigorously investigate EIC, we construct EC-Bench, a diagnostic benchmark that directly probes how image-entity bindings shift before and after editing. Our analysis reveals that EIC stems from existing methods failing to distinguish between Image-Entity (I-E) binding and Entity-Entity (E-E) relational knowledge in the model, causing models to overfit E-E associations as a shortcut: the image is still perceived as the original entity, with the new entity's name serving only as a spurious identity label. We further explore potential mitigation strategies, showing that constraining edits to the model's I-E processing stage encourages edits to act more faithfully on I-E binding, thereby substantially reducing EIC. Based on these findings, we discuss principled desiderata for faithful MKE and provide methodological guidance for future research.

中文标题/摘要

标题：揭示多模态知识编辑中的实体身份混淆

多模态知识编辑（MKE）旨在部署后纠正大型视觉-语言模型的内部知识，但后编辑模型的行为模式仍被广泛忽视。在本文中，我们识别出编辑模型中的一个系统性失败模式，称为实体身份混淆（EIC）：编辑后的模型表现出一种荒谬的行为，即仅通过文本查询原始实体的身份时，意外地返回了新实体的信息。为了严格研究EIC，我们构建了EC-Bench，这是一个诊断基准，直接探测图像-实体绑定在编辑前后如何变化。我们的分析表明，EIC源于现有方法无法区分模型中的图像-实体（I-E）绑定和实体-实体（E-E）关系知识，导致模型过度拟合E-E关联作为捷径：图像仍然被视为原始实体，新实体的名字仅作为虚假的身份标签。我们进一步探讨了潜在的缓解策略，表明限制编辑仅在模型的I-E处理阶段可以鼓励编辑更忠实地作用于I-E绑定，从而显著减少EIC。基于这些发现，我们讨论了忠实的MKE的基本要求，并为未来的研究提供了方法论指导。

Summary / 总结

This paper addresses the issue of Entity Identity Confusion (EIC) in multimodal knowledge editing (MKE), where edited models incorrectly return information about a new entity when queried about the original entity. To investigate EIC, the authors developed EC-Bench, a benchmark that evaluates changes in image-entity bindings. They found that existing methods fail to differentiate between image-entity and entity-entity associations, leading to overfitting on entity-entity relations. By constraining edits to the image-entity processing stage, the authors reduced EIC. The study provides insights into faithful MKE and offers guidance for future research.

本文探讨了多模态知识编辑（MKE）中的实体身份混淆（EIC）问题，即编辑后的模型在查询原始实体时错误地返回新实体的信息。为了研究EIC，作者开发了EC-Bench基准，评估图像-实体绑定的变化。他们发现现有方法未能区分图像-实体和实体-实体关联，导致模型过度拟合实体-实体关系。通过将编辑限制在图像-实体处理阶段，作者减少了EIC。研究提供了忠实MKE的原则性要求，并为未来的研究提供了方法论指导。

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

Authors: Haowei Zhang, Shudong Yang, Jinlan Fu, See-Kiong Ng, Xipeng Qiu

Venue: ACL 2026

First: 2026-01-21T07:26:15+00:00 · Latest: 2026-05-07T12:10:26+00:00

Comments: Accepted to ACL 2026 Main

Abs · PDF · Code1 · Code2

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams. Based on a mechanistic attention investigation, we conceptualize KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference, HERMES reuses a compact KV cache, enabling efficient streaming understanding under resource constraints. Notably, HERMES requires no auxiliary computations upon the arrival of user queries, thereby guaranteeing real-time responses for continuous video stream interactions, which achieves 10$\times$ faster TTFT compared to prior SOTA. Even when reducing video tokens by up to 68% compared with uniform sampling, HERMES achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.

中文标题/摘要

标题：HERMES: 作为分层内存的KV缓存以提高流式视频理解效率

近期多模态大型语言模型（MLLMs）在离线视频理解方面取得了显著进步。然而，将这些能力扩展到流式视频输入仍然具有挑战性，因为现有模型难以同时保持稳定的理解性能、实时响应和低GPU内存开销。为了解决这一挑战，我们提出了一种名为HERMES的新型无训练架构，用于实时和准确地理解视频流。基于机制性注意力调查，我们将KV缓存概念化为一种分层内存框架，能够跨多个粒度封装视频信息。在推理过程中，HERMES重用紧凑的KV缓存，能够在资源受限的情况下实现高效的流式理解。值得注意的是，HERMES在用户查询到达时不需要额外的辅助计算，从而保证了连续视频流交互的实时响应，TTFT比前SOTA快10倍。即使与均匀采样相比，将视频标记减少高达68%，HERMES在所有基准测试中仍能实现优于或可比的准确性，流式数据集上最高可获得11.4%的提升。

OpenGaFF: Open-Vocabulary Gaussian Feature Field with Codebook Attention

Authors: Kunyi Li, Michael Niemeyer, Sen Wang, Stefano Gasperini, Nassir Navab, Federico Tombari

First: 2026-05-07T12:10:07+00:00 · Latest: 2026-05-07T12:10:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Understanding open-vocabulary 3D scenes with Gaussian-based representations remains challenging due to fragmented and spatially inconsistent semantic predictions across multi-view observations. In this paper, we present OpenGaFF, a novel framework for open-vocabulary 3D scene understanding built upon 3D Gaussian Splatting. At the core of our method is a Gaussian Feature Field that models semantics as a continuous function of Gaussian geometry and appearance. By explicitly conditioning semantic predictions on geometric structure, this formulation strengthens the coupling between geometry and semantics, leading to improved spatial coherence across similar structures in 3D space. To further enforce object-level semantic consistency, we introduce a structured codebook that serves as a set of shared semantic primitives. Furthermore, a codebook-guided attention mechanism is proposed to retrieve language features via similarity matching between query embeddings and learned codebook entries, enabling robust open-vocabulary reasoning while reducing intra-object feature variance. Extensive experiments on standard 2D and 3D open-vocabulary benchmarks demonstrate that our method consistently outperforms prior approaches, achieving improved segmentation quality, stronger 3D semantic consistency and a semantically interpretable codebook that provides insight into the learned representation.

中文标题/摘要

标题：OpenGaFF: 基于码本注意力的开放词汇高斯特征场

由于多视图观测中语义预测存在碎片化和空间不一致的问题，使用高斯表示理解开放词汇3D场景仍然具有挑战性。本文提出了一种基于3D高斯点积的新框架OpenGaFF，用于开放词汇3D场景理解。该方法的核心是一个高斯特征场，它将语义建模为高斯几何和外观的连续函数。通过显式地将语义预测条件化于几何结构，这种表述增强了几何和语义之间的耦合，从而在3D空间中相似结构上提高了空间一致性。为了进一步增强对象级语义一致性，我们引入了一个结构化的码本，作为一组共享的语义基元。此外，我们提出了一种码本引导的注意力机制，通过查询嵌入与学习到的码本条目的相似性匹配来检索语言特征，从而实现稳健的开放词汇推理并减少对象内特征的方差。在标准2D和3D开放词汇基准上的大量实验表明，我们的方法在分割质量、3D语义一致性和提供有关学习表示的语义可解释码本方面均优于先前的方法。

Summary / 总结

The paper addresses the challenge of understanding open-vocabulary 3D scenes using Gaussian-based representations. It introduces OpenGaFF, a framework that models semantics as a continuous function of Gaussian geometry and appearance, enhancing spatial coherence. The method uses a structured codebook and a codebook-guided attention mechanism to improve semantic consistency and robustness. Experiments show that OpenGaFF outperforms previous methods in segmentation quality and 3D semantic consistency, providing a semantically interpretable codebook for insights into the learned representation.

论文针对使用高斯表示理解开放词汇3D场景的挑战，提出了OpenGaFF框架，将语义建模为高斯几何和外观的连续函数，增强空间一致性。该方法使用结构化的代码本和代码本引导的注意力机制来提高对象级语义一致性并减少内部对象特征的方差。实验表明，OpenGaFF在分割质量和3D语义一致性方面优于先前的方法，并提供了一个语义可解释的代码本以洞察学习到的表示。

Towards Self-Explainable Document Visual Question Answering with Chain-of-Explanation Predictions

Authors: Kjetil Indrehus, Adrian Duric, Changkyu Choi, Ali Ramezani-Kebrya

First: 2026-05-07T11:42:23+00:00 · Latest: 2026-05-07T11:42:23+00:00

Abs · PDF · Code1 · Code2

Abstract

Document Visual Question Answering (DocVQA) requires vision-language models to reason not only about what information in a document is relevant to a question, but also where the answer is grounded on the page. Existing DocVQA models entangle question-relevant evidence and answer localization and operate largely as black boxes, offering limited means to verify how predictions depend on visual evidence. We propose CoExVQA, a self-explainable DocVQA framework with a grounded reasoning process through a chain-of-explanation design. CoExVQA first identifies question-relevant evidence, then explicitly localizes the answer region, and finally decodes the answer exclusively from the grounded region. Prediction via CoExVQA's chain-of-explanation enables direct inspection and verification of the reasoning process across modalities. Empirical results show that restricting decoding to grounded evidence achieves SotA explainable DocVQA performance on PFL-DocVQA, improving ANLS by 12% over the current explainable baselines while providing transparent and verifiable predictions.

Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands via Direct Preference Optimization

Authors: Weijian Su, Songqian Zhang, Yuqi Han, Jian Zhuang, Yongdong Huang, Qiang Zhang

Venue: CVPR 2026

First: 2026-05-07T11:34:41+00:00 · Latest: 2026-05-07T11:34:41+00:00

Comments: Accepted by CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

As a key technique in multi-modal processing, infrared and visible image fusion (IVIF) plays a crucial role in integrating complementary spectral information for visual enhancement and downstream vision tasks. Despite remarkable progress, existing methods struggle to flexibly accommodate heterogeneous demands. Achieving adaptive fusion that aligns with various preferences from both human and machine vision remains an open and challenging problem. To address this challenge, we propose DPOFusion, a direct preference optimization (DPO) framework integrating the property-aligned latent diffusion model (PALDM) and the preference-controllable latent diffusion model (PCLDM), enabling task-guided, preference-adaptive IVIF for both human and machine vision. The PALDM leverages a latent fusion prior and a joint conditional loss to generate diverse candidate fusion results with various properties. PCLDM is subsequently fine-tuned via instance direct preference optimization (IDPO), enabling direct control of the final fusion results with heterogeneous preference signals. Experimental results demonstrate that our framework not only attains precise preference alignment among humans, vision-language models, and task-driven networks, but also sets a new benchmark for adaptive fusion quality and task-oriented transferability.

Can Vision-Language Models Think from the Sky? Unifying UAV Reasoning and Generation

Authors: Jintao Sun, Gangyi Ding, Donglin Di, Hu Zhang, Zhedong Zheng

First: 2026-04-07T03:23:30+00:00 · Latest: 2026-05-07T11:33:25+00:00

Comments: 21 pages, 12 figures, 7 tables

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models have achieved strong progress in ground-view visual understanding, yet they remain brittle in high-altitude Unmanned Aerial Vehicle scenes, where objects are tiny and densely packed, textures are repetitive, and top-down orientations are ambiguous. We introduce UAVReason, a large-scale UAV-native dataset and evaluation suite for studying unified aerial reasoning and generation under this nadir-view domain shift. UAVReason aligns RGB imagery, depth maps, semantic segmentation masks, captions, and question-answer pairs within a consistent aerial domain. It contains 23.6K captioned frames, 273K VQA pairs including 68.2K two-frame temporal questions, and 188.8K cross-modal generation samples across RGB, depth, and segmentation modalities. We further adapt UAVReason-Bagel as a unified understanding-and-generation baseline that jointly optimizes language reasoning and dense visual generation objectives. Experiments show that general-purpose VLMs and off-the-shelf unified generators struggle with UAV-native grounding, while UAVReason-Bagel substantially improves over its pretrained counterpart, increasing VQA-1F F1 from 0.394 to 0.711, VQA-2F F1 from 0.427 to 0.822, and heading-aware VQA F1 from 0.798 to 0.973. For generation, it improves segmentation mIoU to 0.143 and reduces KID from 0.078 to 0.048 for depth-segmentation-text-conditioned RGB synthesis. More importantly, our ablations reveal a bidirectional synergy between synthesis and reasoning. Dense generation objectives improve temporal semantic consistency, while language-level reasoning regularizes sparse-condition image synthesis. These results suggest that unified reasoning and generation provide effective geometry-aware structural priors for physically grounded aerial intelligence. All data, code, and evaluation tools will be released.

Summary / 总结

This paper addresses the limitations of Vision-Language Models in high-altitude Unmanned Aerial Vehicle (UAV) scenes by introducing UAVReason, a large-scale dataset for aerial reasoning and generation. The dataset includes RGB imagery, depth maps, semantic segmentation masks, captions, and question-answer pairs, covering 23.6K frames, 273K VQA pairs, and 188.8K cross-modal generation samples. Experiments show that general VLMs and unified generators struggle with UAV grounding, while the proposed UAVReason-Bagel improves VQA F1 scores significantly, enhancing temporal semantic consistency and reducing generation errors. The results indicate that unified reasoning and generation provide effective structural priors for aerial intelligence.

本文通过引入UAVReason数据集，解决视觉-语言模型在高海拔无人机场景中的局限性。该数据集包含RGB图像、深度图、语义分割掩码、描述和问答对，涵盖23.6K帧、273K VQA对和188.8K跨模态生成样本。实验表明，通用VLM和统一生成器在无人机接地方面存在困难，而提出的UAVReason-Bagel显著提高了VQA F1分数，增强了时间语义一致性并减少了生成错误。结果表明，统一的推理和生成为物理接地的空中智能提供了有效的结构先验。

PlotPick: AI-powered batch extraction of numerical data from scientific figures

Authors: Tommy Carstensen

First: 2026-05-07T11:15:39+00:00 · Latest: 2026-05-07T11:15:39+00:00

Comments: 7 pages, 2 figures, 2 tables. Software available at https://plotpick.streamlit.app and https://github.com/tommycarstensen/plotpick

Abs · PDF · Code1 · Code2 · Code3

Abstract

Systematic reviews and meta-analyses frequently require numerical data that authors report only as figures, yet manual digitisation is slow and does not scale. We present PlotPick, an open-source tool that uses vision-language models (VLMs) to batch-extract structured tabular data from scientific figures. We evaluate six VLMs from three providers on two established chart-to-table benchmarks (ChartX and PlotQA) and compare against the dedicated chart-to-table model DePlot. All six VLMs outperform DePlot on both benchmarks. On ChartX (restricted to bar charts, line charts, box plots, and histograms; n=300), VLMs achieve 88-96% recall versus 71% for DePlot. On PlotQA (n=529), VLMs achieve 86-99% RMSF1 versus 94% for DePlot. The gap is largest on chart types absent from the dedicated models' training data: on box plots, DePlot achieves 24% RMSF1 while VLMs achieve 83-97%. PlotPick is available at https://plotpick.streamlit.app.

中文标题/摘要

标题：PlotPick：基于AI的批量提取科学图表中数值数据工具

系统评价和元分析经常需要作者仅以图表形式报告的数值数据，但手动数字化速度慢且无法扩展。我们介绍了PlotPick，这是一个开源工具，使用视觉-语言模型（VLMs）批量提取科学图表中的结构化表格数据。我们在两个已建立的图表到表格基准测试（ChartX和PlotQA）上评估了来自三个提供商的六种VLMs，并将其与专门的图表到表格模型DePlot进行比较。所有六种VLMs在两个基准测试上均优于DePlot。在ChartX（仅限条形图、线图、箱形图和直方图；n=300）上，VLMs的召回率为88-96%，而DePlot为71%。在PlotQA（n=529）上，VLMs的RMSF1为86-99%，而DePlot为94%。差距最大的是在专门模型训练数据中不存在的图表类型上：在箱形图上，DePlot的RMSF1为24%，而VLMs为83-97%。PlotPick可在https://plotpick.streamlit.app 获取。

Summary / 总结

PlotPick is an open-source tool that uses vision-language models to automatically extract structured tabular data from scientific figures, addressing the inefficiency of manual digitization. It outperforms the dedicated chart-to-table model DePlot on two benchmarks, achieving up to 96% recall and 99% RMSF1. The tool is particularly effective on chart types not seen during the training of DePlot, such as box plots, where it significantly outperforms DePlot by a large margin.

PlotPick 是一个开源工具，利用视觉-语言模型从科学图表中批量提取结构化表格数据，解决了手动数字化效率低下的问题。它在两个基准测试中优于专门的图表到表格模型 DePlot，最高可达 96% 的召回率和 99% 的 RMSF1。该工具特别适用于 DePlot 训练数据中未包含的图表类型，如箱线图，在这些图表类型上，它显著优于 DePlot。

4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

Authors: Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xiang An, Bo Li, Xin Xie, ZiDong Wang, Mingze Sun, Shuang Chen, Hongyu Li, Xiaobin Hu, Ruqi Huang

First: 2026-05-07T10:48:46+00:00 · Latest: 2026-05-07T10:48:46+00:00

Comments: 21 pages, 16 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to "think with 4D" through dynamic latent mental imagery, i.e., internally simulating how scenes evolve within the continuous hidden space. Specifically, we first introduce a scalable, annotation-free data generation pipeline that synthesizes 4D reasoning data from raw videos. We then propose Dynamic-Imagery Fine-Tuning (DIFT), which jointly supervises textual tokens and 4D latents to ground the model in dynamic visual semantics. Building on this, 4D Reinforcement Learning (4DRL) further tackles complex reasoning tasks via outcome-based rewards, restricting policy gradients to text tokens to ensure stable optimization. Extensive experiments across multiple dynamic spatial reasoning benchmarks demonstrate that 4DThinker consistently outperforms strong baselines and offers a new perspective toward 4D reasoning in VLMs. Our code is available at https://github.com/zhangquanchen/4DThinker.

中文标题/摘要

标题：4DThinker: 使用4D影像进行动态空间理解的思考

单目视频中的动态空间推理对于连接视觉智能和物理世界至关重要，但对视觉-语言模型（VLMs）来说仍然具有挑战性。先前的方法要么完全将空间-时间推理转化为文本，这在处理复杂动态时往往是冗长且不精确的，要么依赖于外部几何模块，增加了推理复杂性而未能培养模型的内在能力。在本文中，我们提出了4DThinker，这是第一个使VLMs能够通过动态潜藏的心理影像进行“4D思考”的框架，即在连续隐藏空间中内部模拟场景的演变。具体来说，我们首先引入了一种可扩展且无需标注的数据生成管道，从原始视频中合成4D推理数据。然后，我们提出了动态影像微调（DIFT），它联合监督文本标记和4D潜变量，使模型扎根于动态视觉语义。在此基础上，4D强化学习（4DRL）进一步通过基于结果的奖励解决复杂的推理任务，限制策略梯度仅作用于文本标记以确保优化的稳定性。在多个动态空间推理基准上的广泛实验表明，4DThinker在多个基准上持续优于强基线，并为VLMs中的4D推理提供了新的视角。我们的代码可在https://github.com/zhangquanchen/4DThinker获取。

Summary / 总结

4DThinker is a framework that enables vision-language models to perform dynamic spatial reasoning by internally simulating 4D imagery. It introduces a data generation pipeline for synthesizing 4D reasoning data and a fine-tuning method called DIFT that jointly supervises textual tokens and 4D latents. 4DRL further enhances this by using outcome-based rewards. Experiments show that 4DThinker outperforms strong baselines on multiple dynamic spatial reasoning benchmarks, providing a new approach to 4D reasoning in VLMs.

4DThinker 是一个框架，使视觉语言模型能够通过内部模拟4D图像来进行动态空间推理。它引入了一个数据生成管道来合成4D推理数据，并提出了一种称为DIFT的微调方法，该方法联合监督文本标记和4D潜变量。4DRL进一步通过基于结果的奖励来增强这一点，限制策略梯度仅针对文本标记以确保稳定的优化。实验表明，4DThinker 在多个动态空间推理基准测试中优于强大基线，为VLM中的4D推理提供了新的方法。

Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs

Authors: Zixuan Chen, Hao Lin, Zizhe Chen, Yizhou Tian, Garry Yang, Depeng Wang, Ya Guo, Huijia Zhu, James Cheng

First: 2026-05-07T10:04:39+00:00 · Latest: 2026-05-07T10:04:39+00:00

Abs · PDF · Code1 · Code2

Abstract

LLMs reliably correct false claims when presented in isolation, yet when the same claims are embedded in task-oriented requests, they often comply rather than correct. We term this failure mode \emph{correction suppression} and construct a benchmark of 300 false premises to systematically evaluate it across eight models. Suppression rates range from 19\% to 90\%, with four models exceeding 80\%, establishing correction suppression as a prevalent and severe phenomenon. Mechanistic analysis reveals that suppression is not a knowledge failure: the model registers the error internally but task context diverts early-layer attention from the false claim as output intent crystallizes toward compliance at middle layers. We characterize this as \emph{knowing but not correcting} -- suppression occurs at response selection rather than knowledge encoding. Guided by this mechanism, we propose two training-free interventions. Correction Direction Steering (CDS) estimates a correction-compliance direction from matched pairs and injects it at middle layers before output intent crystallizes. Dynamic Payload Amplification (DPA) localizes payload tokens via attention divergence between early and late layers and amplifies their representation at the final layer, requiring no calibration data. Experiments on Qwen3.5-9B and LLaMA3.1-8B show both methods substantially improve factual strictness. CDS achieves the highest correction rate on Qwen3.5-9B (0\%$\to$58.2\%). DPA is the only method that preserves or improves reasoning capability on both models. These findings introduce \emph{factual strictness} -- the willingness to uphold accuracy against contextual pressures -- as a new dimension of model reliability.

中文标题/摘要

标题：明知而不纠正：常规任务请求抑制LLM事实纠正

LLM在孤立呈现虚假声明时会可靠地进行纠正，但在嵌入任务导向请求中时，它们往往遵守请求而不纠正。我们称这种失败模式为“纠正抑制”，并构建了一个包含300个虚假前提的基准测试，以系统地评估其在八个模型中的表现。抑制率从19%到90%不等，有四个模型超过80%，这表明纠正抑制是一种普遍且严重的现象。机制分析表明，抑制并非知识失败：模型内部已识别出错误，但任务背景会引导早期层的注意力远离虚假声明，以符合中间层的合规输出意图。我们将其描述为“明知而不纠正”——抑制发生在响应选择而非知识编码阶段。基于这一机制，我们提出了两种无需训练的干预措施。纠正方向引导（CDS）从匹配的成对数据中估计纠正-合规方向，并在输出意图固化之前注入到中间层。动态负载放大（DPA）通过早期层和晚期层之间的注意力差异定位负载标记，并在最终层放大其表示，无需校准数据。在Qwen3.5-9B和LLaMA3.1-8B上的实验表明，这两种方法都能显著提高事实准确性。CDS在Qwen3.5-9B上实现了最高的纠正率（0%→58.2%）。DPA是唯一一种在两种模型上保持或提高推理能力的方法。这些发现引入了“事实准确性”——在面对背景压力时坚持准确性的意愿——作为模型可靠性的一个新维度。

Adaptive Greedy Frame Selection for Long Video Understanding

Authors: Yuning Huang, Xiaoyu Ji, Joseph Huang, Yichi Zhang, Fengqing Zhu

First: 2026-03-20T17:55:32+00:00 · Latest: 2026-05-07T09:47:21+00:00

Abs · PDF · Code1 · Code2

Abstract

Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frames and resulting visual tokens. Naive sparse sampling can miss decisive moments, while purely relevance-driven selection frequently collapses onto near-duplicate frames and sacrifices coverage of temporally distant evidence. We propose a question-adaptive greedy frame selection method that jointly optimizes query relevance and semantic representativeness under a fixed frame budget. Our approach constructs a 1~FPS candidate pool (capped at 1000) with exact timestamp alignment, embeds candidates in two complementary spaces (SigLIP for question relevance and DINOv2 for semantic similarity), and selects frames by greedily maximizing a weighted sum of a modular relevance term and a facility-location coverage term. This objective is normalized, monotone, and submodular, yielding a standard (1-1/e) greedy approximation guarantee. To account for question-dependent trade-offs between relevance and coverage, we introduce four preset strategies and a lightweight text-only question-type classifier that routes each query to its best-performing preset. Experiments on MLVU show consistent accuracy gains over uniform sampling and a strong recent baseline across frame budgets, with the largest improvements under tight budgets.

中文标题/摘要

标题：长视频理解中的自适应贪婪帧选择

大型视觉-语言模型（VLMs）在长视频问答中越来越被应用，但推理往往受限于输入帧的数量和由此产生的视觉标记数量。简单的稀疏采样可能会错过关键时刻，而纯粹基于相关性的选择经常陷入近似重复的帧中，并牺牲了时间上相距较远的证据的覆盖范围。我们提出了一种问题自适应的贪婪帧选择方法，该方法在固定帧预算下联合优化查询相关性和语义代表性。我们的方法构建了一个1 FPS候选池（最多1000个），具有精确的时间戳对齐，将候选者嵌入两个互补的空间（SigLIP用于问题相关性，DINOv2用于语义相似性），并通过贪婪地最大化加权和的模块化相关性项和设施位置覆盖项来选择帧。该目标是归一化的、单调的和次模的，提供了标准的（1-1/e）贪婪近似保证。为了考虑问题之间相关性和覆盖之间的依赖性权衡，我们引入了四种预设策略和一个轻量级的仅文本问题类型分类器，将每个查询路由到其表现最佳的预设策略。在MLVU上的实验显示，在各种帧预算下，该方法相对于均匀采样和一个强大的近期基线都有一致的准确率提升，特别是在预算紧张的情况下，提升最大。

Summary / 总结

The paper addresses the challenge of efficiently selecting frames for long-video question answering using large vision-language models. It proposes an adaptive greedy frame selection method that optimizes both query relevance and semantic representativeness. The method constructs a 1 FPS candidate pool, embeds candidates in two spaces, and selects frames by maximizing a weighted sum of relevance and coverage terms. Experiments show consistent accuracy improvements over uniform sampling and a strong baseline, with the largest gains under tight frame budgets.

论文旨在高效地从长视频中选择用于问题回答的帧，使用了大型视觉-语言模型。提出了一种自适应贪婪帧选择方法，同时优化查询相关性和语义代表性。该方法构建了一个1 FPS候选池，将候选帧嵌入两个空间，并通过最大化相关性和覆盖性的加权和来选择帧。实验结果显示，在均匀采样和一个强大的基线方法上，该方法具有一致的准确率提升，特别是在帧预算紧张的情况下，提升最为显著。

Plug-and-play Class-aware Knowledge Injection for Prompt Learning with Visual-Language Model

Authors: Junhui Yin, Nan Pu, Xinyu Zhang, Lingfeng Yang, Lin Wu, Xiaojie Wang, Zhun Zhong

First: 2026-05-07T09:20:42+00:00 · Latest: 2026-05-07T09:20:42+00:00

Comments: Accepted by International Journal of Computer Vision

Abs · PDF · Code1 · Code2 · Code3

Abstract

Prompt learning has become an effective and widely used technique in enhancing vision-language models (VLMs) such as CLIP for various downstream tasks, particularly in zero-shot classification within specific domains. Existing methods typically focus on either learning class-shared prompts for a given domain or generating instance-specific prompts through conditional prompt learning. While these methods have achieved promising performance, they often overlook class-specific knowledge in prompt design, leading to suboptimal outcomes. The underlying reasons are: 1) class-specific prompts offer more fine-grained supervision compared to coarse class-shared prompts, which helps prevent misclassification of data from different classes into a single class; 2) compared to class-specific prompts, instance-specific prompts neglect the richer class-level information across multiple instances, potentially causing data from the same class to be divided into multiple classes. To effectively supplement the class-specific knowledge into existing methods, we propose a plug-and-play Class-Aware Knowledge Injection (CAKI) framework. CAKI comprises two key components, i.e., class-specific prompt generation and query-key prompt matching. The former encodes class-specific knowledge into prompts from few-shot samples that belong to the same class and stores the learned prompts in a class-level knowledge bank. The latter provides a plug-and-play mechanism for each test instance to retrieve relevant class-level knowledge from the knowledge bank and inject such knowledge to refine model predictions. Extensive experiments demonstrate that our CAKI effectively improves the performance of existing methods on base and novel classes. Code is publicly available at \href{https://github.com/yjh576/CAKI}{this https URL}.

中文标题/摘要

标题：即插即用类感知知识注入用于视觉-语言模型的提示学习

提示学习已成为增强视觉-语言模型（如CLIP）的有效且广泛使用的技术，特别是在特定领域的零样本分类中。现有方法通常侧重于为给定领域学习类共享提示，或通过条件提示学习生成实例特定的提示。虽然这些方法取得了令人鼓舞的性能，但它们往往忽略了提示设计中的类特定知识，导致结果不佳。其原因在于：1）类特定提示相比粗略的类共享提示提供了更精细的监督，有助于防止不同类别的数据被误分类为单一类别；2）与类特定提示相比，实例特定提示忽略了多个实例间的丰富类级信息，可能导致同一类别的数据被划分为多个类别。为了有效补充类特定知识到现有方法中，我们提出了一种即插即用类感知知识注入（CAKI）框架。CAKI 包含两个关键组件，即类特定提示生成和查询-键提示匹配。前者将类特定知识编码到属于同一类别的少量样本的提示中，并将学习到的提示存储在类级知识库中。后者为每个测试实例提供了一种即插即用机制，可以从知识库中检索相关类级知识并注入此类知识以改进模型预测。广泛的实验表明，我们的CAKI能有效提高现有方法在基础类和新类上的性能。代码可在\href{https://github.com/yjh576/CAKI}{此网址}获取。

Summary / 总结

The paper proposes a plug-and-play Class-Aware Knowledge Injection (CAKI) framework to enhance prompt learning in vision-language models like CLIP. Motivated by the need to incorporate class-specific knowledge, CAKI includes class-specific prompt generation and query-key prompt matching. The former encodes class-specific knowledge into prompts from few-shot samples, while the latter retrieves and injects this knowledge to refine model predictions. Experiments show that CAKI improves performance on both base and novel classes compared to existing methods.

论文提出了一种插件式Class-Aware Knowledge Injection (CAKI)框架，旨在增强视觉语言模型如CLIP中的提示学习。该方法受到现有方法使用类共享或实例特定提示的局限性的启发，旨在将类特定知识注入这些方法中。CAKI包括类特定提示生成和查询键提示匹配。前者将类特定知识编码到少量样本中，后者在测试时检索并注入这种知识。实验表明，CAKI在基类和新类上均能比现有方法提高性能。

SMI: Statistical Membership Inference for Reliable Unlearned Model Auditing

Authors: Jialong Sun, Zeming Wei, Jiaxuan Zou, Jiacheng Gong, Jie Fu, Chengyang Dong, Heng Xu, Jialong Li, Bo Liu

First: 2026-02-01T10:51:53+00:00 · Latest: 2026-05-07T09:14:58+00:00

Abs · PDF · Code1 · Code2

Abstract

Machine unlearning (MU) is essential for enforcing the right to be forgotten in machine learning systems. A key challenge of MU is how to reliably audit whether a model has truly forgotten specified training data. Membership Inference Attacks (MIAs) are widely used for unlearned model auditing, where samples that evade membership detection are regarded as successfully forgotten. We show this assumption is fundamentally flawed: failed membership inference does not imply true forgetting. We prove that unlearned samples occupy fundamentally different positions in the feature space than non-member samples, making this alignment bias unavoidable and unobservable, which leads to systematically optimistic evaluations of unlearning performance. Meanwhile, training shadow models for MIA incurs substantial computational overhead. To address both limitations, we propose Statistical Membership Inference (SMI), a training-free auditing framework that reformulates auditing as estimating the non-member mixture proportion in the unlearned feature distribution. Beyond estimating the forgetting rate, SMI also provides bootstrap reference ranges for quantified auditing reliability. Extensive experiments show that SMI consistently outperforms all MIA-based baselines, with no shadow model training required. Overall, SMI establishes a principled and efficient alternative to MIA-based auditing methods, with both theoretical guarantees and strong empirical performance.

中文标题/摘要

标题：SMI：统计成员推断以实现可靠的未学习模型审计

机器遗忘（MU）对于在机器学习系统中执行被遗忘的权利至关重要。MU的关键挑战是如何可靠地审计模型是否真正忘记了指定的训练数据。成员推断攻击（MIAs）广泛用于未学习模型的审计，其中逃避成员检测的样本被视为成功遗忘。我们表明这种假设是根本错误的：失败的成员推断并不意味着真正的遗忘。我们证明未学习样本在特征空间中的位置与非成员样本完全不同，这种对齐偏差不可避免且不可见，导致了对遗忘性能的系统性乐观评估。同时，为MIAs训练阴影模型会带来巨大的计算开销。为了解决这两个问题，我们提出了统计成员推断（SMI），这是一种无需训练的审计框架，将审计重新定义为估计未学习特征分布中的非成员混合比例。除了估计遗忘率外，SMI还提供了量化审计可靠性的Bootstrap参考范围。广泛的实验表明，SMI在不需要训练阴影模型的情况下始终优于所有基于MIAs的基线方法。总体而言，SMI为基于MIAs的审计方法提供了一个有原则且高效的替代方案，具有理论保证和强大的实验性能。

Summary / 总结

The paper addresses the challenge of reliably auditing whether a machine learning model has forgotten specified training data after unlearning. It introduces Statistical Membership Inference (SMI), a training-free method that estimates the proportion of non-member samples in the unlearned feature distribution. Experiments show that SMI outperforms existing Membership Inference Attack (MIA)-based methods without requiring the training of shadow models, providing a more reliable and efficient auditing approach.

论文针对机器学习模型在执行数据遗忘后如何可靠地审计模型是否真正忘记了指定的训练数据这一挑战。提出了一种无需训练的统计成员推断（SMI）方法，用于估计未学习特征分布中非成员样本的比例。实验表明，SMI在无需训练阴影模型的情况下，比现有的基于会员推断攻击（MIA）的方法表现更优，提供了一种更可靠和高效的审计方法。

DBMSolver: A Training-free Diffusion Bridge Sampler for High-Quality Image-to-Image Translation

Authors: Sankarshana Venugopal, Mohammad Mostafavi, Jonghyun Choi

Venue: CVPR 2026

First: 2026-05-07T08:59:05+00:00 · Latest: 2026-05-07T08:59:05+00:00

Comments: Accepted to CVPR 2026. Includes supplementary material

Abs · PDF · Code1 · Code2 · Code3

Abstract

Diffusion-based image-to-image (I2I) translation excels in high-fidelity generation but suffers from slow sampling in state-of-the-art Diffusion Bridge Models (DBMs), often requiring dozens of function evaluations (NFEs). We introduce DBMSolver, a training-free sampler that exploits the semi-linear structure of DBM's underlying SDE and ODE via exponential integrators, yielding highly-efficient 1st- and 2nd-order solutions. This reduces NFEs by up to 5x while boosting quality (e.g., FID drops 53% on DIODE at 20 NFEs vs. 2nd-order baseline). Experiments on inpainting, stylization, and semantics-to-image tasks across resolutions up to 256x256 show DBMSolver sets new SOTA efficiency-quality tradeoffs, enabling real-world applicability. Our code is publicly available at https://github.com/snumprlab/dbmsolver.

Summary / 总结

DBMSolver is a training-free sampler that improves the efficiency of diffusion-based image-to-image translation by reducing the number of function evaluations needed for sampling. It leverages the semi-linear structure of the underlying SDE and ODE through exponential integrators, achieving up to 5x faster sampling while maintaining or improving image quality. Experiments demonstrate that DBMSolver sets new state-of-the-art efficiency-quality tradeoffs, making it suitable for real-world applications in tasks such as inpainting, stylization, and semantics-to-image translation. The method reduces the Fréchet Inception Distance (FID) by 53% on the DIODE dataset at 20 function evaluations compared to a 2nd-order baseline. The code is publicly available.

DBMSolver 是一种无需训练的采样器，通过利用基础 SDE 和 ODE 的半线性结构和指数积分器，提高了基于扩散的图像到图像转换的效率，减少了所需的函数评估次数，最多可提高 5 倍效率同时保持或提升图像质量。实验表明，DBMSolver 在包括修复、风格化和语义到图像任务在内的多种任务中，设定了新的效率-质量折衷标准，使其适用于实际应用。该方法在 DIODE 数据集上的 FID 指标在 20 次函数评估时比 2nd 阶基线降低了 53%。代码已公开可用。

Training-Free Dense Hand Contact Estimation with Multi-Modal Large Language Models

Authors: Daniel Sungho Jung, Kyoung Mu Lee

First: 2026-05-07T08:57:27+00:00 · Latest: 2026-05-07T08:57:27+00:00

Comments: Project page: https://contactprompt-release.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Dense hand contact estimation requires both high-level semantic understanding and fine-grained geometric reasoning of human interaction to accurately localize contact regions. Recently, multi-modal large language models (MLLMs) have demonstrated strong capabilities in understanding visual semantics, enabled by vision-language priors learned from large-scale data. However, leveraging MLLMs for dense hand contact estimation remains underexplored. There are two major challenges in applying MLLMs to dense hand contact estimation. First, encoding explicit 3D hand geometry is difficult, as MLLMs primarily operate on vision and language modalities. Second, capturing fine-grained vertex-level contact remains challenging, as MLLMs tend to focus on high-level semantics rather than detailed geometric reasoning. To address these challenges, we propose ContactPrompt, a training-free and zero-shot approach for dense hand contact estimation using MLLMs. To effectively encode 3D hand geometry, we introduce a detailed hand-part segmentation and a part-wise vertex-grid representation that provides structured, localized geometric information. To enable accurate and efficient dense contact prediction, we develop a multi-stage structured contact reasoning with part conditioning, progressively bridging global semantics and fine-grained geometry. Therefore, our method effectively leverages the reasoning capabilities of MLLMs while enabling precise dense hand contact estimation. Surprisingly, the proposed approach outperforms previous supervised methods trained on large-scale dense contact datasets without requiring any training. The codes will be released.

Summary / 总结

The research aims to leverage multi-modal large language models (MLLMs) for dense hand contact estimation, addressing the challenges of encoding 3D hand geometry and capturing fine-grained vertex-level contact. The proposed ContactPrompt method uses a detailed hand-part segmentation and part-wise vertex-grid representation to encode 3D geometry and a multi-stage structured contact reasoning approach to predict dense contacts. This training-free and zero-shot approach outperforms previous supervised methods without requiring any training data.

研究旨在利用多模态大型语言模型（MLLMs）进行密集手部接触估计，解决3D手部几何编码和细粒度顶点级接触捕捉的挑战。所提出的ContactPrompt方法使用详细的分部位分割和部分顶点网格表示来编码几何信息，并开发了一种多阶段结构化接触推理方法以预测密集手部接触。该方法在无需任何训练数据的情况下，优于之前的监督方法，实现了更优的性能。

StableTTA: Improving Vision Model Performance by Training-free Test-Time Adaptation Methods

Authors: Zheng Li, Jerry Cheng, Huanying Helen Gu

First: 2026-04-06T09:21:48+00:00 · Latest: 2026-05-07T08:44:16+00:00

Comments: 27 pages, 10 figures, 9 tables

Abs · PDF · Code1 · Code2

Abstract

Ensemble methods improve predictive performance but often incur high memory and computational costs. We identify an aggregation instability induced by nonlinear projection and voting operations. To address both efficiency challenges and this inconsistency, we propose StableTTA, a training-free test-time adaptation method with two variants. StableTTA-I targets coherent-batch inference settings, where temporally or semantically adjacent observations are likely to belong to the same class. Examples include burst photography, video streams, robotics perception, and industrial inspection. Under coherent-batch inference, StableTTA-I substantially improves prediction consistency and accuracy through variance-aware logit aggregation. StableTTA-II establishes feature-level cropping, enabling efficient logit aggregation with a single forward pass on a single model backbone. Experiments on ImageNet-1K across 71 models demonstrate that StableTTA-I consistently improves prediction accuracy under coherent-batch inference, while StableTTA-II provides lightweight and architecture-agnostic accuracy improvements with minimal computational overhead. These results suggest that inference-time semantic coherence and aggregation stability provide useful perspectives for improving practical test-time adaptation systems.

中文标题/摘要

标题：StableTTA：通过训练后测试时自适应方法提高视觉模型性能

集成方法可以提高预测性能，但通常会带来高内存和计算成本。我们识别出由非线性投影和投票操作引起的聚合不稳定性。为了解决效率挑战和这种不一致性，我们提出了StableTTA，这是一种无需训练的测试时自适应方法，具有两种变体。StableTTA-I针对一致批次推理设置，其中时间上或语义上相邻的观测很可能属于同一类别。示例包括连拍摄影、视频流、机器人感知和工业检测。在一致批次推理下，StableTTA-I通过方差感知逻辑聚合显著提高了预测一致性和准确性。StableTTA-II实现了特征级裁剪，允许在单个模型主干上进行一次前向传播来高效聚合逻辑。在ImageNet-1K上对71个模型进行的实验表明，StableTTA-I在一致批次推理下始终提高了预测准确性，而StableTTA-II提供了轻量级且架构无关的准确性改进，且计算开销最小。这些结果表明，推理时语义一致性和聚合稳定性为改进实际测试时自适应系统提供了有用视角。

Summary / 总结

The paper addresses the challenge of improving the performance of vision models by proposing StableTTA, a training-free test-time adaptation method. It includes two variants: StableTTA-I for coherent-batch inference and StableTTA-II for efficient logit aggregation. Experiments on ImageNet-1K across 71 models show that StableTTA-I enhances prediction consistency and accuracy under coherent-batch inference, while StableTTA-II offers lightweight and architecture-agnostic improvements with minimal computational overhead.

StableTTA 是一种无需训练的测试时自适应方法，旨在解决集成方法的效率挑战和聚合不稳定性问题。它包括两个变体：StableTTA-I 适用于一致批次推理，通过方差感知的 logits 聚合提高预测一致性和准确性；StableTTA-II 则通过单次前向传播实现高效 logits 聚合，且计算开销小。实验表明，StableTTA-I 在一致批次推理下提升预测准确性，而 StableTTA-II 提供轻量级且架构无关的改进，计算开销小。

Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media

Authors: Megha Mariam K. M, Vineeth N. Balasubramanian, C. V. Jawahar

Venue: CVPR

First: 2026-05-07T08:04:50+00:00 · Latest: 2026-05-07T08:04:50+00:00

Comments: Accepted at IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

The communication of scientific knowledge has become increasingly multimodal, spanning text, visuals, and speech through materials such as research papers, slides, and recorded presentations. These different representations collectively convey a study's reasoning, results, and insights, offering complementary perspectives that enrich understanding. However, despite their shared purpose, such materials are rarely connected in a structured way. The absence of explicit links across formats makes it difficult to trace how concepts, visuals, and explanations correspond, limiting unified exploration and analysis of research content. To address this gap, we introduce the Multimodal Conference Dataset (MCD), the first benchmark that integrates research papers, presentation videos, explanatory videos, and slides from the same works. We evaluate a range of embedding-based and vision-language models to assess their ability to discover fine-grained cross-format correspondences, establishing the first systematic benchmark for this task. Our results show that vision-language models are robust but struggle with fine-grained alignment, while embedding-based models capture text-visual correspondences well but equations and symbolic content form distinct clusters in the embedding space. These findings highlight both the strengths and limitations of current approaches and point to key directions for future research in multimodal scientific understanding. To ensure reproducibility, we release the resources for MCD at https://github.com/meghamariamkm2002/MCD

History

20260508_0435 20260507_0454 20260506_0427 20260505_0436 20260504_0410 20260503_0414 20260502_0426 20260501_0429 20260430_0430 20260429_0437 20260428_0429 20260427_0405 20260426_0404 20260425_0410 20260424_0430 20260423_0426 20260422_0424 20260421_0418 20260420_0359 20260419_0358 20260418_0415 20260417_0421 20260416_0425 20260415_0426 20260414_0423 20260413_0352 20260412_0347 20260411_0356 20260410_0412 20260409_0411 20260407_0404 20260406_0347 20260405_0344 20260404_0350 20260403_0400 20260401_0408 20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553