arXiv 论文速递

Snapshot: 20260511_0418

BAMI: Training-Free Bias Mitigation in GUI Grounding

Authors: Borui Zhang, Bo Zhang, Bo Wang, Wenzhao Zheng, Yuhao Cheng, Liang Tang, Yiqiang Yan, Jie Zhou, Jiwen Lu

Venue: CVPR 2026

First: 2026-05-07T17:59:31+00:00 · Latest: 2026-05-07T17:59:31+00:00

Comments: Accepted by CVPR 2026

Abstract

GUI grounding is a critical capability for enabling GUI agents to execute tasks such as clicking and dragging. However, in complex scenarios like the ScreenSpot-Pro benchmark, existing models often suffer from suboptimal performance. Utilizing the proposed \textbf{Masked Prediction Distribution (MPD)} attribution method, we identify that the primary sources of errors are twofold: high image resolution (leading to precision bias) and intricate interface elements (resulting in ambiguity bias). To address these challenges, we introduce \textbf{Bias-Aware Manipulation Inference (BAMI)}, which incorporates two key manipulations, coarse-to-fine focus and candidate selection, to effectively mitigate these biases. Our extensive experimental results demonstrate that BAMI significantly enhances the accuracy of various GUI grounding models in a training-free setting. For instance, applying our method to the TianXi-Action-7B model boosts its accuracy on the ScreenSpot-Pro benchmark from 51.9\% to 57.8\%. Furthermore, ablation studies confirm the robustness of the BAMI approach across diverse parameter configurations, highlighting its stability and effectiveness. Code is available at https://github.com/Neur-IO/BAMI.

中文标题/摘要

标题：BAMI：无训练偏差缓解在GUI定位中的应用

GUI定位是使GUI代理执行点击和拖动等任务的关键能力。然而，在如ScreenSpot-Pro基准测试等复杂场景中，现有模型往往表现出不佳的性能。利用提出的“掩码预测分布（MPD）”归因方法，我们发现错误的主要来源有两个：高图像分辨率（导致精度偏差）和复杂的界面元素（导致模糊性偏差）。为了解决这些挑战，我们引入了“感知偏差操纵推理（BAMI）”，它包含两种关键操作：粗细聚焦和候选选择，以有效缓解这些偏差。我们的大量实验结果表明，BAMI在无训练设置中显著提高了各种GUI定位模型的准确性。例如，将我们的方法应用于TianXi-Action-7B模型，使其在ScreenSpot-Pro基准测试中的准确性从51.9%提高到57.8%。此外，消融研究证实了BAMI方法在不同参数配置下的稳健性和有效性，突显了其稳定性和有效性。代码可在https://github.com/Neur-IO/BAMI获取。

Summary / 总结

The research addresses the suboptimal performance of GUI grounding models in complex scenarios, particularly in the ScreenSpot-Pro benchmark. It introduces BAMI, a training-free method that uses coarse-to-fine focus and candidate selection to mitigate precision and ambiguity biases. Experimental results show that BAMI improves the accuracy of the TianXi-Action-7B model from 51.9% to 57.8% on the ScreenSpot-Pro benchmark, and ablation studies confirm its robustness across different configurations.

研究针对GUI接地模型在复杂场景中的表现不佳问题，特别是在ScreenSpot-Pro基准中。提出了一种名为BAMI的无训练方法，通过粗细聚焦和候选选择来减轻精度偏差和模糊偏差。实验结果显示，BAMI将TianXi-Action-7B模型在ScreenSpot-Pro基准上的准确率从51.9%提升到了57.8%，并且消融研究证实了其在不同配置下的稳定性和有效性。

Superintelligent Retrieval Agent: The Next Frontier of Information Retrieval

Authors: Zeyu Yang, Qi Ma, Jason Chen, Anshumali Shrivastava

First: 2026-05-07T17:54:29+00:00 · Latest: 2026-05-07T17:54:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Retrieval-augmented agents are increasingly the interface to large organizational knowledge bases, yet most still treat retrieval as a black box: they issue exploratory queries, inspect returned snippets, and iteratively reformulate until useful evidence emerges. This approach resembles how a newcomer searches an unfamiliar database rather than how an expert navigates it with strong priors about terminology and likely evidence, and results in unnecessary retrieval rounds, increased latency, and poor recall. We introduce \textit{SuperIntelligent Retrieval Agent} (SIRA), which defines \emph{superintelligence} in retrieval as the ability to compress multi-round exploratory search into a single corpus-discriminative retrieval action. SIRA does not merely ask what terms are relevant to the query; it asks which terms are likely to separate the desired evidence from corpus-level confusers. On the corpus side, an LLM enriches each document offline with missing search vocabulary; on the query side, it predicts evidence vocabulary omitted by the query; and document-frequency statistics as a tool call to filter proposed terms that are absent, overly common, or unlikely to create retrieval margin. The final retrieval step is a single weighted BM25 call combining the original query with the validated expansion. Across ten BEIR benchmarks and downstream question-answering tasks, SIRA achieves the significantly superior performance outperforming dense retrievers and state-of-the-art multi-round agentic baselines, demonstrating that one well-formed lexical query, guided by LLM cognition and lightweight corpus statistics, can exceed substantially more expensive multi-round search while remaining interpretable, training-free, and efficient.

中文标题/摘要

标题：超级智能检索代理：信息检索的下一个前沿

检索增强代理正逐渐成为大型组织知识库的接口，但大多数代理仍然将检索视为黑盒：它们发出探索性查询，检查返回的片段，然后迭代地重新制定查询，直到找到有用的信息。这种方法类似于新手在不熟悉的数据库中搜索的方式，而不是专家如何凭借对术语和可能证据的强烈先验知识在数据库中导航，这导致了不必要的检索轮次、增加的延迟和较差的召回率。我们引入了《超级智能检索代理》（SIRA），它将检索中的超级智能定义为将多轮探索性搜索压缩为单次语料库区分检索动作的能力。SIRA 不仅询问哪些词与查询相关，还询问哪些词可能将所需证据与语料库级别的混淆项区分开来。在语料库方面，LLM 在线下为每份文档补充缺失的搜索词汇；在查询方面，它预测查询中遗漏的证据词汇；并且使用文档频率统计作为工具来过滤提出的词，这些词要么不存在，要么过于常见，或者不太可能创建检索优势。最终的检索步骤是结合原始查询和验证扩展的单一加权BM25调用。在十个BEIR基准测试和下游问答任务中，SIRA 在性能上显著优于密集检索器和最先进的多轮代理基线，证明了一个由LLM认知和轻量级语料库统计指导的精心构造的词汇查询，可以超越成本更高的多轮搜索，同时保持可解释性、无需训练和高效。

Summary / 总结

The paper introduces SuperIntelligent Retrieval Agent (SIRA), which aims to improve information retrieval by compressing multi-round exploratory search into a single retrieval action. SIRA uses an LLM to enrich documents with missing search terms and predict evidence vocabulary, and applies document-frequency statistics to filter terms. In benchmarks and downstream tasks, SIRA outperforms dense retrievers and multi-round baselines, showing that a well-formed query can achieve better performance than multiple rounds of search while remaining efficient and interpretable.

论文介绍了SuperIntelligent Retrieval Agent (SIRA)，旨在通过将多轮探索性搜索压缩为单次检索来改进信息检索。SIRA 使用LLM 补充缺失的搜索术语并预测证据词汇，并应用文档频率统计来过滤术语。在基准测试和下游任务中，SIRA 的表现优于密集检索器和多轮基线，展示了单个精心构建的查询可以通过轻量级的语料库统计和LLM 认知实现比多次搜索更优的效果，同时保持高效和可解释性。

DARK: Diagonal-Anchored Repulsive Knowledge Distillation for Vision-Language Models under Extreme Compression

Authors: Numan Saeed, Asif Hanif, Fadillah Adamsyah Maani, Hussain Alasmawi, Mohammad Yaqub

Venue: www

First: 2026-03-05T17:43:00+00:00 · Latest: 2026-05-07T17:28:16+00:00

Comments: Project website: www.numansaeed.com/mobilefetalclip

Abs · PDF · Code1 · Code2

Abstract

Compressing vision-language models for on-device deployment is increasingly important in clinical settings, but knowledge distillation (KD) degrades sharply when the teacher-student capacity gap spans an order of magnitude or more. We argue that, under such gaps, strict imitation of the teacher is a poor objective: much of the teacher's pairwise similarity structure reflects its own architectural biases rather than information a compact student can efficiently represent. We propose \textbf{Diagonal-Anchored Repulsive Knowledge Distillation (DARK)}, a contrastive KD framework that decomposes the distillation loss into a diagonal term (matched image-text pairs) and an off-diagonal term (non-target similarities). The diagonal term anchors matched-pair alignment throughout training; the off-diagonal term is annealed from positive to negative weighting, transitioning the student from imitating to \emph{repelling} the teacher's non-target similarity structure. We instantiate DARK by distilling FetalCLIP, a 427M-parameter fetal ultrasound vision-language model, into \textbf{MobileFetalCLIP}, a 75M-parameter student model with a $26\times$ smaller visual encoder, running in 1.6\,ms on an iPhone~16~Pro. The student matches or exceeds its teacher on three zero-shot benchmarks, including HC18 biometry validity (88.6\% vs.\ 83.5\%) and brain sub-plane F1 (0.784 vs.\ 0.702). Embedding-geometry and logit analyses show that DARK induces \emph{structured decorrelation}: the student preserves teacher-aligned per-image confidence while diverging from inherited inter-class confusion, suggesting that controlled repulsion can be more efficient than imitation under extreme compression.

中文标题/摘要

标题：DARK：在极端压缩下用于视觉-语言模型的对角锚定排斥知识蒸馏

将视觉-语言模型压缩以适应设备部署在临床环境中变得越来越重要，但当教师-学生容量差距达到一个数量级或更大时，知识蒸馏（KD）会急剧下降。我们认为，在这种差距下，严格模仿教师是一个糟糕的目标：教师的许多成对相似性结构反映了其自身的架构偏见，而不是紧凑学生可以高效表示的信息。我们提出了**对角锚定排斥知识蒸馏（DARK）**，这是一种对比度KD框架，将蒸馏损失分解为对角项（匹配的图像-文本对）和离对角项（非目标相似性）。对角项在整个训练过程中锚定了匹配对的对齐；离对角项从正权值逐渐变为负权值，使学生从模仿转变为**排斥**教师的非目标相似性结构。我们通过将一个4.27亿参数的胎儿超声视觉-语言模型FetalCLIP蒸馏为一个7500万参数的学生模型MobileFetalCLIP来实例化DARK，该模型在iPhone 16 Pro上运行时间为1.6毫秒。学生在三个零样本基准测试中与教师匹配或超过教师，包括HC18生物测量有效性（88.6% vs. 83.5%）和脑亚平面F1（0.784 vs. 0.702）。嵌入几何和logit分析表明，DARK诱导了**结构化去相关**：学生保留了教师对齐的每张图像置信度，同时从继承的类间混淆中发散，表明在极端压缩下控制排斥可能比模仿更有效。

Summary / 总结

The research aims to address the challenge of compressing vision-language models for on-device deployment in clinical settings. DARK, a contrastive knowledge distillation framework, decomposes the distillation loss into diagonal and off-diagonal terms to improve performance under extreme capacity gaps. The method anchors matched-pair alignment and gradually transitions the student from imitating to repelling the teacher's non-target similarities. Experiments show that MobileFetalCLIP, a 75M-parameter student model, matches or exceeds its 427M-parameter teacher on three zero-shot benchmarks, including HC18 biometry validity and brain sub-plane F1 scores.

研究旨在解决在临床环境中压缩视觉-语言模型以实现设备端部署的挑战。DARK是一种对比知识蒸馏框架，将蒸馏损失分解为对角线和非对角线项，以在极端容量差距下提高性能。该方法锚定匹配对的对齐，并逐步将学生从模仿过渡到排斥教师的非目标相似性结构。实验表明，MobileFetalCLIP，一个75M参数的学生模型，在HC18生物测量有效性等三个零样本基准上与427M参数的教师模型相当或超过其表现。

The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding

Authors: Jiayun Luo, Mir Rayat Imtiaz Hossain, Pritam Sarkar, Boyang Li, Leonid Sigal

First: 2024-12-11T05:36:18+00:00 · Latest: 2026-05-07T17:14:28+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) have achieved strong performance on implicit and explicit visual grounding and related tasks. However, such abilities are generally tested on simple, single-object phrases. We find that grounding performance degrades for complex, multi-object references. These limitations largely arise from training objectives that leverage image-caption alignment, where direct multi-object references are rare, the number of possible such references is theoretically large (exponential in the number of objects), and attribution is difficult. To address this, without requiring any additional annotations, we propose Compositional Attention-Regularized Training (CompART), which decomposes captions into object-centric phrases and constructs composite phrases by pairing them with conjunctions. We then introduce a composition loss that encourages the attention induced by a composite phrase to equal the sum of the attentions of its constituent phrases, promoting balanced multi-object localization. We evaluate CompART across four VLM architectures, spanning both contrastive-based and generative-based models, on four benchmarks for multi-object grounding and two VQA benchmarks for general visual understanding. CompART consistently improves grounding for both single- and multi-object references across diverse VLM architectures and datasets, and further demonstrates enhanced visual understanding, as evidenced by gains on VQA, despite not being explicitly trained for this task.

Flow-Based Conformal Predictive Distributions

Authors: Trevor Harris

First: 2026-02-07T17:26:50+00:00 · Latest: 2026-05-07T17:00:28+00:00

Comments: 9 pages, 15 figures, 20 appendix pages

Abs · PDF · Code1 · Code2

Abstract

Conformal prediction provides a distribution-free framework for uncertainty quantification via prediction sets with exact finite-sample coverage. In low dimensions these sets are easy to interpret, but in high-dimensional or structured output spaces they are difficult to represent and use, which can limit their ability to integrate with downstream tasks such as sampling and probabilistic forecasting. We show that any sufficiently regular differentiable nonconformity score induces a deterministic flow on the output space whose trajectories converge to the boundary of the corresponding conformal prediction set. This leads to a computationally efficient, training-free method for sampling conformal boundaries in arbitrary dimensions. Mixing across confidence levels yields conformal predictive distributions whose quantile regions coincide with the empirical conformal prediction sets. We provide an approximation bound decomposing CPD predictive error into score-induced distortion, base-measure quality, and gradient flow-induced distortion. We evaluate the approach on PDE inverse problems, precipitation downscaling, climate model debiasing, and hurricane trajectory forecasting.

中文标题/摘要

标题：基于流的容许预测分布

容许预测提供了一种通过预测集进行不确定性量化的方法，这些预测集具有精确的有限样本覆盖率。在低维空间中，这些集合并易于解释，但在高维或结构化输出空间中，它们难以表示和使用，这可能限制了它们与下游任务（如采样和概率预测）的集成能力。我们展示了任何充分规则的可微非容许性评分都会在输出空间上诱导一个确定性的流，其轨迹收敛到相应的容许预测集的边界。这导致了一种在任意维度上高效且无需训练的方法，用于采样容许边界。在不同置信水平上混合生成的容许预测分布的分位区域与经验容许预测集一致。我们提供了一个近似界，将CPD预测误差分解为评分诱导的失真、基础测度质量和梯度流诱导的失真。我们在偏微分方程反问题、降尺度降水、气候模型校正和飓风轨迹预测方面评估了该方法。

Summary / 总结

The research aims to address the challenge of representing and using conformal prediction sets in high-dimensional or structured output spaces, which are difficult to interpret and integrate with downstream tasks. The method involves using a differentiable nonconformity score to induce a deterministic flow on the output space, leading to a computationally efficient, training-free approach for sampling conformal boundaries. The key experimental findings show that mixing across confidence levels yields conformal predictive distributions whose quantile regions match empirical conformal prediction sets, and the approach is evaluated on various applications including PDE inverse problems and hurricane trajectory forecasting.

研究旨在解决高维或结构化输出空间中形式化预测集难以解释和与下游任务集成的问题。方法是使用可微非一致性得分来诱导输出空间上的确定性流，从而实现高效且无需训练的采样形式化边界的方法。关键实验发现表明，通过在置信水平上混合，可以得到与经验形式化预测集相匹配的分位数区域的形式化预测分布，并且该方法在PDE反问题和飓风轨迹预测等应用中进行了评估。

DCR: Counterfactual Attractor Guidance for Rare Compositional Generation

Authors: Taewon Kang, Matthias Zwicker

First: 2026-05-07T16:22:21+00:00 · Latest: 2026-05-07T16:22:21+00:00

Comments: 40 pages, 33 figures

Abs · PDF · Code1 · Code2

Abstract

Diffusion models generate realistic visual content, yet often fail to produce rare but plausible compositions. When prompted with combinations that are valid but underrepresented in training data, such as a snowy beach or a rainbow at night, the generation process frequently collapses toward more common alternatives. We identify this failure mode as default completion bias, where denoising trajectories are implicitly attracted toward high-frequency semantic configurations. Existing guidance mechanisms do not explicitly model this competing tendency and therefore struggle to prevent such collapse. We introduce Default Completion Repulsion (DCR), a training-free framework that explicitly models and suppresses default completion behavior. DCR constructs a counterfactual attractor by relaxing the rare compositional factor while preserving surrounding semantics, inducing an alternative denoising trajectory reflecting the model's preferred completion. We define the discrepancy between target and attractor trajectories as a counterfactual drift, and propose a projection-based repulsion mechanism that removes guidance components aligned with this drift direction. This suppresses undesired frequent completions while preserving other semantic components. DCR operates entirely within the standard diffusion sampling process without retraining or architectural modification. Experiments on rare compositional prompts show that DCR improves compositional fidelity while maintaining visual quality. Our analysis further shows that the framework exposes and counteracts intrinsic model biases, offering a new perspective on controllable generation beyond explicit constraint enforcement.

FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction

Authors: Fangda Chen, Shanshan Zhao, Longrong Yang, Chuanfu Xu, Zhigang Luo, Long Lan

First: 2026-05-07T16:21:34+00:00 · Latest: 2026-05-07T16:21:34+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Video diffusion models perform well in short-video synthesis, but their training-free extension to long videos often suffers from content drift, temporal inconsistency, and over-smoothed dynamics. Existing methods improve temporal consistency by combining a global branch with a local branch, but they often further decompose appearance consistency and temporal dynamics within each branch using predefined criteria. This assignment is unreliable when appearance and action progression are tightly coupled, such as in camera motion and sequential motion. We analyze the video temporal extension issue from a singular-spectrum perspective and show that enlarged self-attention windows induce spectral concentration: spectral energy becomes dominated by a few low-rank singular directions, preserving coarse structure but suppressing high-rank spatial details and motion-rich temporal variations. To mitigate this problem, we propose FreeSpec, a training-free spectral reconstruction framework for long-video generation. FreeSpec decomposes global and local features with singular value decomposition, and uses the global branch as low-rank spectral guidance and the local branch as a high-rank reconstruction basis. This spectrum-level fusion avoids the rigid feature partitioning of previous decomposition rules, preserving long-range consistency while better retaining spatial details and temporal dynamics. Experiments on Wan2.1 and LTX-Video demonstrate that FreeSpec improves long-video generation, especially for temporal dynamics, while maintaining strong visual quality and temporal consistency. Project demo: https://fdchen24.github.io/FreeSpec-Website/.

中文标题/摘要

标题：FreeSpec：基于奇异谱重构的无训练长视频生成

视频扩散模型在短视频合成方面表现良好，但它们的无训练扩展到长视频时往往会出现内容漂移、时间不一致和过度平滑的动力学问题。现有方法通过结合全局分支和局部分支来提高时间一致性，但它们通常会在每个分支内使用预定义的标准进一步分解外观一致性和时间动力学。当外观和动作进展紧密耦合时，如摄像机运动和序列运动，这种分配是不可靠的。我们从奇异谱的角度分析了视频时间扩展问题，并表明扩大的自注意力窗口导致谱集中：谱能量被少数低秩奇异方向主导，保留了粗略结构但抑制了高秩空间细节和运动丰富的时序变化。为了解决这个问题，我们提出了FreeSpec，这是一种基于奇异谱重构的无训练长视频生成框架。FreeSpec 使用奇异值分解分解全局和局部特征，并使用全局分支作为低秩谱指导，使用局部分支作为高秩重构基。这种谱级融合避免了先前分解规则的刚性特征分割，同时保持了长程一致性并更好地保留了空间细节和时序动力学。在Wan2.1和LTX-Video上的实验表明，FreeSpec 在提高时序动力学方面改善了长视频生成，同时保持了强大的视觉质量和时间一致性。项目演示：https://fdchen24.github.io/FreeSpec-Website/

GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs

Authors: Pranav Mantini, Shishir K. Shah

First: 2026-05-07T16:01:59+00:00 · Latest: 2026-05-07T16:01:59+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

We address the challenge of knowledge composition in Vision-Language Models (VLMs), where accumulating expertise across multiple domains or tasks typically leads to catastrophic forgetting. We introduce GeoStack (Geometric Stacking), a modular framework that allows independently trained domain experts to be composed into a unified model. By imposing geometric and structural constraints on the adapter manifold, GeoStack ensures the foundational knowledge of the base model is preserved. Furthermore, we mathematically demonstrate a weight-folding property that achieves constant-time inference complexity ($O(1)$), regardless of the number of integrated experts. Experimental results across multi-domain adaptation and class-incremental learning show that GeoStack provides an efficient mechanism for long-term knowledge composition while significantly mitigating catastrophic forgetting. Code is available at https://github.com/QuantitativeImagingLaboratory/GeoStack.

Summary / 总结

The research addresses the issue of catastrophic forgetting in Vision-Language Models (VLMs) when expertise from multiple domains is accumulated. GeoStack, a modular framework, is introduced to compose independently trained domain experts into a unified model while preserving the base model's foundational knowledge. The framework imposes geometric and structural constraints to ensure constant-time inference complexity. Experiments show that GeoStack effectively mitigates catastrophic forgetting and provides an efficient mechanism for long-term knowledge composition across multi-domain adaptation and class-incremental learning.

研究解决了在Vision-Language Models (VLMs)中，当从多个领域积累专业知识时出现的灾难性遗忘问题。提出了GeoStack模块化框架，将独立训练的领域专家整合到一个统一的模型中，同时保留基础模型的核心知识。该框架通过几何和结构约束确保推理复杂度为常数时间。实验表明，GeoStack有效缓解了灾难性遗忘，并为多领域适应和类别增量学习提供了高效的长期知识整合机制。

A Regime Theory of Controller Class Selection for LLM Action Decisions

Authors: Zhaoyang Jiang, Zhizhong Fu, Yunsoo Kim, Jiacong Mi, Zicheng Li, Xuanqi Peng, Honghan Wu

First: 2026-05-07T14:28:17+00:00 · Latest: 2026-05-07T14:28:17+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Deployed language and vision-language models must decide, on each input, whether to answer directly, retrieve evidence, defer to a stronger model, or abstain. Contrary to the common monotonicity intuition, greater per-input expressivity is not uniformly beneficial in finite samples: under identical strict cross-validation, different benchmarks prefer different controller classes. This reflects a finite-sample limitation of instance-level uncertainty signals, which can be exhausted at a distribution-dependent scale. We organize controllers into a nested lattice of four classes: fixed actions, partition routers, instance-level controllers, and prior-gated controllers, ordered by complexity. We prove a regime theory that turns three data-estimable bottlenecks into a class choice: how much improvement is possible beyond the best fixed action, whether there are enough samples for instance-level controllers to make reliable decisions, and how much improvement a coarse partition router can recover when instance-level signal is unreliable. The resulting Bernstein-tight threshold has a matching information-theoretic lower bound, and strict nested cross-validation provably selects a near-best class. Across SMS-Spam, HallusionBench, A-OKVQA, and FOLIO, the predicted class matches the empirical winner; the prior-gated controller wins on TextVQA when OCR tokens supply a label-free prediction-time prior. Code is available at https://github.com/Anonymous-Awesome-Submissions/Regime-Theory.

Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction

Authors: Zecheng Tang, Jiaye Fu, Qiankun Gao, Haijie Li, Yanmin Wu, Jiaqi Zhang, Siwei Ma, Jian Zhang

First: 2026-05-07T13:45:37+00:00 · Latest: 2026-05-07T13:45:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Feed-forward 3D reconstruction models based on Vision Transformers can directly estimate scene geometry and camera poses from a small set of input images, but scaling them to video inputs with hundreds or thousands of frames remains challenging due to the quadratic cost of global attention layers. Recent token-merging methods accelerate these models by compressing the token sequence within the global attention layers, but they apply a uniform reduction to query tokens and key-value tokens, ignoring their functionally distinct roles in 3D reconstruction. In this work, we identify a key property of feed-forward 3D reconstruction models: query tokens encode view-specific geometric requests and are sensitive to compression, while key-value tokens represent shared scene context and tolerate aggressive compression. Guided by this insight, we propose Spark3R, a training-free acceleration framework that decouples the compression of query tokens and key-value tokens by assigning distinct reduction factors, with intra-group token merging applied to query tokens and lightweight token pruning to key-value tokens. Additionally, Spark3R adaptively adjusts the key-value reduction factor across layers, further improving the quality-efficiency trade-off. As a plug-and-play framework requiring no retraining, Spark3R integrates directly into multiple pretrained feed-forward 3D reconstruction models, including VGGT, $π^3$, and Depth-Anything-3, and achieves up to $28\times$ speedup on 1,000-frame inputs while maintaining competitive reconstruction quality.

Summary / 总结

Spark3R is designed to accelerate feed-forward 3D reconstruction models based on Vision Transformers by asymmetrically reducing query and key-value tokens. It applies distinct reduction factors to these tokens, with more aggressive compression on key-value tokens and intra-group merging on query tokens. Spark3R also adaptively adjusts the key-value reduction factor across layers to optimize the quality-efficiency trade-off. This framework achieves up to 28 times speedup on 1,000-frame inputs without retraining, maintaining competitive reconstruction quality across various models including VGGT, $π^3$, and Depth-Anything-3.

Spark3R 是一种加速基于 Vision Transformers 的 feed-forward 3D 重建模型的方法，通过不对称地减少查询和键值标记。它对这些标记应用不同的压缩因子，对键值标记进行更激进的压缩，并对查询标记进行组内合并。Spark3R 还在各层中自适应调整键值压缩因子，以优化质量和效率的权衡。该框架在不重新训练的情况下，对 1,000 帧输入实现最高 28 倍的加速，同时保持与 VGGT、$π^3$ 和 Depth-Anything-3 等多种模型的竞争力。

Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs

Authors: Andy Zeyi Liu, Michael Zhang, Ilana Greenberg, Adam Alnasser, Lucas Baker, John Sous

First: 2026-05-07T13:19:33+00:00 · Latest: 2026-05-07T13:19:33+00:00

Abs · PDF · Code1 · Code2

Abstract

Steering large language models (LLMs) is usually done by either instruction prompting or activation steering. Prompting often gives strong control, but caches guidance tokens at every layer and can clutter long interactions; activation steering is compact but typically weaker and does not support large structured reminders. We introduce memory inception (MI), a training-free method that steers in latent attention space by inserting text-derived key-value (KV) banks only at selected layers. Rather than materializing reminder content throughout the prompt cache, MI treats steering as selective KV allocation, injecting latent slots only where the model routes to them. On matched personality-steering tasks, MI gives the best overall control--drift trade-off, remaining competitive with prompting while consistently outperforming CAA. On updateable guidance, MI supports mid-conversation behavior shifts without rewriting the visible transcript, achieving the highest post-shift alignment on Qwen3. On structured reasoning, MI outperforms visible prompting on HARDMath and PHYSICS (10/12 subject$\times$mode cells), serving as proxies for structured reasoning in verifiable domains, while cutting content-matched KV storage by up to 118$\times$. These results position MI as a powerful steering method when guidance is persistent, structured, or expensive to keep in the visible transcript.

Summary / 总结

Memory Inception (MI) is a training-free method that steers large language models (LLMs) by inserting text-derived key-value (KV) banks only at selected layers in the latent attention space. This approach provides the best overall control in personality-steering tasks, supports mid-conversation behavior shifts without rewriting the visible transcript, and outperforms visible prompting on structured reasoning tasks while significantly reducing KV storage. On HARDMath and PHYSICS, MI achieves the highest post-shift alignment and outperforms visible prompting, demonstrating its effectiveness in persistent, structured, or expensive-to-keep guidance.

Memory Inception (MI) 是一种无需训练的方法，通过在潜在注意力空间中仅在选定层插入文本衍生的关键值 (KV) 银行来引导大型语言模型 (LLMs)。这种方法在个性引导任务中提供了最佳的整体控制，在对话中无需重写可见记录即可支持行为转变，并在结构化推理任务中优于可见提示，同时显著减少了 KV 存储。在 HARDMath 和 PHYSICS 中，MI 达到了最高的转变后对齐度，并且优于可见提示，展示了其在持久、结构化或难以保留在可见记录中的指导方面的有效性。

MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models

Authors: Xunlan Zhou, Xuanlin Chen, Shaowei Zhang, ShengHua Wan, Xiaohai Hu, Lei Yuan, De-chuan Zhan

First: 2026-01-28T11:25:13+00:00 · Latest: 2026-05-07T13:06:58+00:00

Abs · PDF · Code1 · Code2

Abstract

Designing dense reward functions is pivotal for efficient robotic Reinforcement Learning (RL). However, most dense rewards rely on manual engineering, which fundamentally limits the scalability and automation of reinforcement learning. While Vision-Language Models (VLMs) offer a promising path to reward design, naive VLM rewards often misalign with task progress, struggle with spatial grounding, and show limited understanding of task semantics. To address these issues, we propose MARVL-Multi-stAge guidance for Robotic manipulation via Vision-Language models. MARVL fine-tunes a VLM for spatial and semantic consistency and decomposes tasks into multi-stage subtasks with task direction projection for trajectory sensitivity. Empirically, MARVL significantly outperforms existing VLM-reward methods on the Meta-World benchmark, demonstrating superior sample efficiency and robustness on sparse-reward manipulation tasks.

中文标题/摘要

标题：MARVL：通过视觉语言模型的多阶段指导进行机器人操作

设计密集奖励函数是机器人强化学习（RL）高效性的关键。然而，大多数密集奖励依赖于手动工程，从根本上限制了强化学习的可扩展性和自动化。虽然视觉语言模型（VLM）为奖励设计提供了有希望的途径，但简单的VLM奖励往往与任务进展不一致，难以进行空间定位，并且对任务语义的理解有限。为了解决这些问题，我们提出了MARVL：通过视觉语言模型的多阶段指导进行机器人操作。MARVL 对VLM 进行微调以实现空间和语义一致性，并将任务分解为具有任务方向投影的多阶段子任务，以提高轨迹敏感性。实验上，MARVL 在Meta-World基准测试中显著优于现有的VLM奖励方法，展示了在稀疏奖励操作任务上的优越样本效率和鲁棒性。

Event-Causal RAG: A Retrieval-Augmented Generation Framework for Long Video Reasoning in Complex Scenarios

Authors: Peizheng Yan, Yu Zhao, Liang Xie, Juntong Qi, Mingming Wang, Erwei Yin

First: 2026-05-07T13:01:28+00:00 · Latest: 2026-05-07T13:01:28+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent large vision-language models have achieved strong performance on short- and medium-length video understanding, yet they remain inadequate for ultra-long or even infinite video reasoning, where models must preserve coherent memory over extended durations and infer causal dependencies across temporally distant events. Existing end-to-end video understanding methods are fundamentally limited by the $O(n^2)$ complexity of self-attention, while recent retrieval-augmented generation (RAG) approaches still suffer from fragmented clip-level memory, weak modeling of temporal and causal structure, and high storage and online inference costs. We present Event-Causal RAG, a lightweight retrieval-augmented framework for infinite long-video reasoning. Instead of indexing fixed-length clips, our method segments streaming videos into semantically coherent events and represents each event as a structured State-Event-State (SES) graph, capturing the event together with its surrounding state transitions. These graphs are merged into a global Event Knowledge Graph and stored in a dual-store memory that supports both semantic matching and causal-topological retrieval. On top of this memory, we design a bidirectional retrieval strategy to efficiently identify the most relevant event causal chains and provide them, together with the associated video evidence, to a backbone video foundation model for answer generation. Experiments on long-video understanding benchmarks demonstrate that Event-Causal RAG consistently outperforms strong clip-based retrieval baselines and long-context video models, particularly on questions requiring multi-event integration and causal inference across long temporal gaps, while also achieving improved memory efficiency and robust streaming performance.

Summary / 总结

Event-Causal RAG is a lightweight retrieval-augmented framework designed for long-video reasoning, addressing the limitations of existing models by segmenting videos into semantically coherent events and representing them as structured SES graphs. These graphs are merged into a global Event Knowledge Graph and stored in a dual-store memory for efficient causal-topological retrieval. The model uses a bidirectional retrieval strategy to identify relevant event causal chains and provide them to a backbone video foundation model for answer generation. Experiments show that Event-Causal RAG outperforms clip-based retrieval baselines and long-context video models, especially for questions requiring causal inference across long temporal gaps, while also improving memory efficiency and streaming performance.

Event-Causal RAG 是一种轻量级的检索增强框架，旨在处理长视频推理问题，通过将视频分割成语义上连贯的事件，并用结构化的 State-Event-State (SES) 图表示这些事件来解决现有模型的局限性。这些图被合并到一个全局事件知识图中，并存储在一个支持语义匹配和因果拓扑检索的双存储器中。该模型使用双向检索策略来识别相关的事件因果链，并将它们提供给基础视频模型以生成答案。实验表明，Event-Causal RAG 在需要跨长时间间隔进行因果推理的问题上优于基于片段的检索基线和长上下文视频模型，同时提高了内存效率和流式性能。

Retina-RAG: Retrieval-Augmented Vision-Language Modeling for Joint Retinal Diagnosis and Clinical Report Generation

Authors: Abdelrahman Zaian, Sheethal Bhat, Mohamed Abdalkader, Andreas Maier

Venue: MICCAI 2026

First: 2026-05-07T12:54:53+00:00 · Latest: 2026-05-07T12:54:53+00:00

Comments: 10 pages, 5 figures. Submitted to MICCAI 2026

Abs · PDF · Code1 · Code2

Abstract

Diabetic Retinopathy (DR) is a leading cause of preventable blindness among working-age adults worldwide, yet most automated screening systems are limited to image-level classification and lack clinically structured reporting. We propose Retina-RAG, a low-cost modular framework that jointly performs DR severity grading, macular edema (ME) detection, and report generation. The architecture decouples a high-performance retinal classifier and a parameter-efficient vision-language model (Qwen2.5-VL-7B-Instruct) adapted via Low-Rank Adaptation (LoRA), enabling flexible component integration. A retrieval-augmented generation (RAG) module injects curated ophthalmic knowledge together with structured classifier outputs at inference time to improve diagnostic consistency and reduce hallucinations. Retina-RAG achieves an F1-score of 0.731 for DR grading and 0.948 for ME detection, substantially outperforming zero-shot Qwen (0.096, 0.732) and MMed-RAG (0.541, 0.641) on a retinal disease detection dataset with captions. For report generation, Retina-RAG attains ROUGE-L 0.429 and SBERT similarity 0.884, exceeding all baselines. The full framework operates on a single consumer-grade GPU, demonstrating that clinically structured retinal AI can be achieved with modest computational resources.

Summary / 总结

Retina-RAG is a modular framework that jointly performs diabetic retinopathy severity grading, macular edema detection, and clinical report generation. It uses a high-performance retinal classifier and a parameter-efficient vision-language model adapted via Low-Rank Adaptation, with a retrieval-augmented generation module to enhance diagnostic consistency. Retina-RAG achieves high F1-scores of 0.731 for DR grading and 0.948 for ME detection, and outperforms other models in report generation with ROUGE-L 0.429 and SBERT similarity 0.884.

Retina-RAG 是一个模块化框架，能够同时进行糖尿病视网膜病变严重程度分级、黄斑水肿检测和临床报告生成。它使用高性能的视网膜分类器和通过低秩适应调整的参数高效视觉语言模型，并带有检索增强生成模块以提高诊断一致性。Retina-RAG 在糖尿病视网膜病变分级上的 F1 分数达到 0.731，在黄斑水肿检测上的 F1 分数达到 0.948，并在报告生成上以 ROUGE-L 0.429 和 SBERT 相似度 0.884 超过其他基线模型。

Uncovering Entity Identity Confusion in Multimodal Knowledge Editing

Authors: Shu Wu, Xiaotian Ye, Xinyu Mou, Dongsheng Liu, Xiaohan Wang, Mengqi Zhang

First: 2026-05-07T12:14:54+00:00 · Latest: 2026-05-07T12:14:54+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal knowledge editing (MKE) aims to correct the internal knowledge of large vision-language models after deployment, yet the behavioral patterns of post-edit models remain underexplored. In this paper, we identify a systemic failure mode in edited models, termed Entity Identity Confusion (EIC): edited models exhibit an absurd behavior where text-only queries about the original entity's identity unexpectedly return information about the new entity. To rigorously investigate EIC, we construct EC-Bench, a diagnostic benchmark that directly probes how image-entity bindings shift before and after editing. Our analysis reveals that EIC stems from existing methods failing to distinguish between Image-Entity (I-E) binding and Entity-Entity (E-E) relational knowledge in the model, causing models to overfit E-E associations as a shortcut: the image is still perceived as the original entity, with the new entity's name serving only as a spurious identity label. We further explore potential mitigation strategies, showing that constraining edits to the model's I-E processing stage encourages edits to act more faithfully on I-E binding, thereby substantially reducing EIC. Based on these findings, we discuss principled desiderata for faithful MKE and provide methodological guidance for future research.

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

Authors: Haowei Zhang, Shudong Yang, Jinlan Fu, See-Kiong Ng, Xipeng Qiu

Venue: ACL 2026

First: 2026-01-21T07:26:15+00:00 · Latest: 2026-05-07T12:10:26+00:00

Comments: Accepted to ACL 2026 Main

Abs · PDF · Code1 · Code2

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams. Based on a mechanistic attention investigation, we conceptualize KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference, HERMES reuses a compact KV cache, enabling efficient streaming understanding under resource constraints. Notably, HERMES requires no auxiliary computations upon the arrival of user queries, thereby guaranteeing real-time responses for continuous video stream interactions, which achieves 10$\times$ faster TTFT compared to prior SOTA. Even when reducing video tokens by up to 68% compared with uniform sampling, HERMES achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.

OpenGaFF: Open-Vocabulary Gaussian Feature Field with Codebook Attention

Authors: Kunyi Li, Michael Niemeyer, Sen Wang, Stefano Gasperini, Nassir Navab, Federico Tombari

First: 2026-05-07T12:10:07+00:00 · Latest: 2026-05-07T12:10:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Understanding open-vocabulary 3D scenes with Gaussian-based representations remains challenging due to fragmented and spatially inconsistent semantic predictions across multi-view observations. In this paper, we present OpenGaFF, a novel framework for open-vocabulary 3D scene understanding built upon 3D Gaussian Splatting. At the core of our method is a Gaussian Feature Field that models semantics as a continuous function of Gaussian geometry and appearance. By explicitly conditioning semantic predictions on geometric structure, this formulation strengthens the coupling between geometry and semantics, leading to improved spatial coherence across similar structures in 3D space. To further enforce object-level semantic consistency, we introduce a structured codebook that serves as a set of shared semantic primitives. Furthermore, a codebook-guided attention mechanism is proposed to retrieve language features via similarity matching between query embeddings and learned codebook entries, enabling robust open-vocabulary reasoning while reducing intra-object feature variance. Extensive experiments on standard 2D and 3D open-vocabulary benchmarks demonstrate that our method consistently outperforms prior approaches, achieving improved segmentation quality, stronger 3D semantic consistency and a semantically interpretable codebook that provides insight into the learned representation.

Summary / 总结

The paper addresses the challenge of understanding open-vocabulary 3D scenes using Gaussian-based representations. It introduces OpenGaFF, which uses a Gaussian Feature Field to model semantics as a continuous function of Gaussian geometry and appearance, enhancing spatial coherence. A structured codebook and codebook-guided attention mechanism are proposed to improve object-level semantic consistency and robustness in open-vocabulary reasoning. Experiments show that OpenGaFF outperforms previous methods in segmentation quality and 3D semantic consistency, and provides a semantically interpretable codebook for insight into the learned representation.

论文旨在使用高斯表示理解开放词汇的3D场景，提出了OpenGaFF框架，通过高斯特征场将语义建模为高斯几何和外观的连续函数，增强空间一致性。还提出了结构化码本和码本引导的注意力机制以提高对象级语义一致性及开放词汇推理的鲁棒性。实验表明，OpenGaFF在分割质量、3D语义一致性和提供可解释的码本以洞察学习表示方面优于先前方法。

Towards Self-Explainable Document Visual Question Answering with Chain-of-Explanation Predictions

Authors: Kjetil Indrehus, Adrian Duric, Changkyu Choi, Ali Ramezani-Kebrya

First: 2026-05-07T11:42:23+00:00 · Latest: 2026-05-07T11:42:23+00:00

Abs · PDF · Code1 · Code2

Abstract

Document Visual Question Answering (DocVQA) requires vision-language models to reason not only about what information in a document is relevant to a question, but also where the answer is grounded on the page. Existing DocVQA models entangle question-relevant evidence and answer localization and operate largely as black boxes, offering limited means to verify how predictions depend on visual evidence. We propose CoExVQA, a self-explainable DocVQA framework with a grounded reasoning process through a chain-of-explanation design. CoExVQA first identifies question-relevant evidence, then explicitly localizes the answer region, and finally decodes the answer exclusively from the grounded region. Prediction via CoExVQA's chain-of-explanation enables direct inspection and verification of the reasoning process across modalities. Empirical results show that restricting decoding to grounded evidence achieves SotA explainable DocVQA performance on PFL-DocVQA, improving ANLS by 12% over the current explainable baselines while providing transparent and verifiable predictions.

中文标题/摘要

标题：迈向具有链式解释预测的自解释文档视觉问答

文档视觉问答（DocVQA）要求视觉-语言模型不仅推理文档中与问题相关的信息，还要确定答案在页面上的位置。现有DocVQA模型将问题相关证据和答案定位紧密结合，并且大多作为黑箱操作，提供有限的手段验证预测依赖于视觉证据的程度。我们提出了CoExVQA，这是一种具有自解释能力的DocVQA框架，通过链式解释设计实现基于推理过程的定位。CoExVQA首先识别问题相关证据，然后明确定位答案区域，最后仅从定位区域解码答案。通过CoExVQA的链式解释进行预测，可以在不同模态中直接检查和验证推理过程。实验证明，将解码限制在定位证据上，在PFL-DocVQA上实现了最先进的可解释DocVQA性能，相比当前的可解释基线提高了ANLS 12%，同时提供了透明且可验证的预测。

Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands via Direct Preference Optimization

Authors: Weijian Su, Songqian Zhang, Yuqi Han, Jian Zhuang, Yongdong Huang, Qiang Zhang

Venue: CVPR 2026

First: 2026-05-07T11:34:41+00:00 · Latest: 2026-05-07T11:34:41+00:00

Comments: Accepted by CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

As a key technique in multi-modal processing, infrared and visible image fusion (IVIF) plays a crucial role in integrating complementary spectral information for visual enhancement and downstream vision tasks. Despite remarkable progress, existing methods struggle to flexibly accommodate heterogeneous demands. Achieving adaptive fusion that aligns with various preferences from both human and machine vision remains an open and challenging problem. To address this challenge, we propose DPOFusion, a direct preference optimization (DPO) framework integrating the property-aligned latent diffusion model (PALDM) and the preference-controllable latent diffusion model (PCLDM), enabling task-guided, preference-adaptive IVIF for both human and machine vision. The PALDM leverages a latent fusion prior and a joint conditional loss to generate diverse candidate fusion results with various properties. PCLDM is subsequently fine-tuned via instance direct preference optimization (IDPO), enabling direct control of the final fusion results with heterogeneous preference signals. Experimental results demonstrate that our framework not only attains precise preference alignment among humans, vision-language models, and task-driven networks, but also sets a new benchmark for adaptive fusion quality and task-oriented transferability.

Can Vision-Language Models Think from the Sky? Unifying UAV Reasoning and Generation

Authors: Jintao Sun, Gangyi Ding, Donglin Di, Hu Zhang, Zhedong Zheng

First: 2026-04-07T03:23:30+00:00 · Latest: 2026-05-07T11:33:25+00:00

Comments: 21 pages, 12 figures, 7 tables

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models have achieved strong progress in ground-view visual understanding, yet they remain brittle in high-altitude Unmanned Aerial Vehicle scenes, where objects are tiny and densely packed, textures are repetitive, and top-down orientations are ambiguous. We introduce UAVReason, a large-scale UAV-native dataset and evaluation suite for studying unified aerial reasoning and generation under this nadir-view domain shift. UAVReason aligns RGB imagery, depth maps, semantic segmentation masks, captions, and question-answer pairs within a consistent aerial domain. It contains 23.6K captioned frames, 273K VQA pairs including 68.2K two-frame temporal questions, and 188.8K cross-modal generation samples across RGB, depth, and segmentation modalities. We further adapt UAVReason-Bagel as a unified understanding-and-generation baseline that jointly optimizes language reasoning and dense visual generation objectives. Experiments show that general-purpose VLMs and off-the-shelf unified generators struggle with UAV-native grounding, while UAVReason-Bagel substantially improves over its pretrained counterpart, increasing VQA-1F F1 from 0.394 to 0.711, VQA-2F F1 from 0.427 to 0.822, and heading-aware VQA F1 from 0.798 to 0.973. For generation, it improves segmentation mIoU to 0.143 and reduces KID from 0.078 to 0.048 for depth-segmentation-text-conditioned RGB synthesis. More importantly, our ablations reveal a bidirectional synergy between synthesis and reasoning. Dense generation objectives improve temporal semantic consistency, while language-level reasoning regularizes sparse-condition image synthesis. These results suggest that unified reasoning and generation provide effective geometry-aware structural priors for physically grounded aerial intelligence. All data, code, and evaluation tools will be released.

Summary / 总结

This paper addresses the limitations of vision-language models in high-altitude Unmanned Aerial Vehicle (UAV) scenes by introducing UAVReason, a large-scale dataset and evaluation suite. UAVReason includes RGB imagery, depth maps, semantic segmentation masks, captions, and question-answer pairs, focusing on aerial reasoning and generation. Experiments show that general-purpose vision-language models and unified generators struggle with UAV-native grounding, while the adapted UAVReason-Bagel significantly improves performance, enhancing VQA accuracy and segmentation quality. The study also reveals a bidirectional synergy between generation and reasoning, suggesting that unified reasoning and generation provide effective structural priors for aerial intelligence.

本文通过引入UAVReason数据集和评估套件，解决了视觉语言模型在高空无人机场景中的局限性。UAVReason包含RGB图像、深度图、语义分割掩码、描述和问答对，专注于航空推理和生成。实验表明，通用的视觉语言模型和统一生成器在无人机本地定位上表现不佳，而适应后的UAVReason-Bagel显著提高了性能，增强了VQA准确性和分割质量。研究还揭示了生成和推理之间的双向协同作用，表明统一的推理和生成为物理接地的航空智能提供了有效的结构先验。

PlotPick: AI-powered batch extraction of numerical data from scientific figures

Authors: Tommy Carstensen

First: 2026-05-07T11:15:39+00:00 · Latest: 2026-05-07T11:15:39+00:00

Comments: 7 pages, 2 figures, 2 tables. Software available at https://plotpick.streamlit.app and https://github.com/tommycarstensen/plotpick

Abs · PDF · Code1 · Code2 · Code3

Abstract

Systematic reviews and meta-analyses frequently require numerical data that authors report only as figures, yet manual digitisation is slow and does not scale. We present PlotPick, an open-source tool that uses vision-language models (VLMs) to batch-extract structured tabular data from scientific figures. We evaluate six VLMs from three providers on two established chart-to-table benchmarks (ChartX and PlotQA) and compare against the dedicated chart-to-table model DePlot. All six VLMs outperform DePlot on both benchmarks. On ChartX (restricted to bar charts, line charts, box plots, and histograms; n=300), VLMs achieve 88-96% recall versus 71% for DePlot. On PlotQA (n=529), VLMs achieve 86-99% RMSF1 versus 94% for DePlot. The gap is largest on chart types absent from the dedicated models' training data: on box plots, DePlot achieves 24% RMSF1 while VLMs achieve 83-97%. PlotPick is available at https://plotpick.streamlit.app.

中文标题/摘要

标题：PlotPick：基于AI的批量提取科学图表中数值数据工具

系统评价和元分析经常需要作者仅以图表形式报告的数值数据，但手动数字化速度慢且无法扩展。我们介绍了PlotPick，这是一个开源工具，使用视觉-语言模型（VLMs）批量提取科学图表中的结构化表格数据。我们在两个已建立的图表到表格基准测试（ChartX和PlotQA）上评估了来自三个提供商的六种VLMs，并将其与专门的图表到表格模型DePlot进行比较。所有六种VLMs在两个基准测试上均优于DePlot。在ChartX（仅限条形图、线图、箱形图和直方图；n=300）上，VLMs的召回率为88-96%，而DePlot为71%。在PlotQA（n=529）上，VLMs的RMSF1为86-99%，而DePlot为94%。差距最大的是在专门模型训练数据中不存在的图表类型上：在箱形图上，DePlot的RMSF1为24%，而VLMs为83-97%。PlotPick可在https://plotpick.streamlit.app 获取。

Summary / 总结

PlotPick is an open-source tool that uses vision-language models to automatically extract numerical data from scientific figures, addressing the inefficiency of manual digitization. It outperforms the dedicated chart-to-table model DePlot on two benchmarks, achieving up to 99% recall and 99% RMSF1. The tool is particularly effective on chart types not seen during training, such as box plots, where it outperforms DePlot significantly.

PlotPick 是一个开源工具，使用视觉-语言模型从科学图表中批量提取数值数据，解决手动数字化效率低的问题。它在两个基准测试中优于专门的图表到表格模型 DePlot，达到高达 99% 的召回率和 99% 的 RMSF1。该工具特别在训练数据中未出现的图表类型（如箱形图）上表现优异，显著优于 DePlot。

4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

Authors: Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xiang An, Bo Li, Xin Xie, ZiDong Wang, Mingze Sun, Shuang Chen, Hongyu Li, Xiaobin Hu, Ruqi Huang

First: 2026-05-07T10:48:46+00:00 · Latest: 2026-05-07T10:48:46+00:00

Comments: 21 pages, 16 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to "think with 4D" through dynamic latent mental imagery, i.e., internally simulating how scenes evolve within the continuous hidden space. Specifically, we first introduce a scalable, annotation-free data generation pipeline that synthesizes 4D reasoning data from raw videos. We then propose Dynamic-Imagery Fine-Tuning (DIFT), which jointly supervises textual tokens and 4D latents to ground the model in dynamic visual semantics. Building on this, 4D Reinforcement Learning (4DRL) further tackles complex reasoning tasks via outcome-based rewards, restricting policy gradients to text tokens to ensure stable optimization. Extensive experiments across multiple dynamic spatial reasoning benchmarks demonstrate that 4DThinker consistently outperforms strong baselines and offers a new perspective toward 4D reasoning in VLMs. Our code is available at https://github.com/zhangquanchen/4DThinker.

Summary / 总结

4DThinker is a framework that enables vision-language models to perform dynamic spatial reasoning through internal 4D imagery simulation. It introduces a data generation pipeline for synthesizing 4D reasoning data and a fine-tuning method called DIFT that grounds the model in dynamic visual semantics. 4DRL further enhances this by using outcome-based rewards. Experiments show that 4DThinker outperforms strong baselines on multiple dynamic spatial reasoning benchmarks.

4DThinker 是一个框架，使视觉-语言模型能够通过内部 4D 图像模拟来进行动态空间推理。它引入了一个数据生成管道来合成 4D 推理数据，并提出了一种称为 DIFT 的微调方法，将模型与动态视觉语义联系起来。4DRL 进一步通过基于结果的奖励来增强这一点，限制策略梯度仅在文本标记上。实验表明，4DThinker 在多个动态空间推理基准测试中优于强基线。

Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs

Authors: Zixuan Chen, Hao Lin, Zizhe Chen, Yizhou Tian, Garry Yang, Depeng Wang, Ya Guo, Huijia Zhu, James Cheng

First: 2026-05-07T10:04:39+00:00 · Latest: 2026-05-07T10:04:39+00:00

Abs · PDF · Code1 · Code2

Abstract

LLMs reliably correct false claims when presented in isolation, yet when the same claims are embedded in task-oriented requests, they often comply rather than correct. We term this failure mode \emph{correction suppression} and construct a benchmark of 300 false premises to systematically evaluate it across eight models. Suppression rates range from 19\% to 90\%, with four models exceeding 80\%, establishing correction suppression as a prevalent and severe phenomenon. Mechanistic analysis reveals that suppression is not a knowledge failure: the model registers the error internally but task context diverts early-layer attention from the false claim as output intent crystallizes toward compliance at middle layers. We characterize this as \emph{knowing but not correcting} -- suppression occurs at response selection rather than knowledge encoding. Guided by this mechanism, we propose two training-free interventions. Correction Direction Steering (CDS) estimates a correction-compliance direction from matched pairs and injects it at middle layers before output intent crystallizes. Dynamic Payload Amplification (DPA) localizes payload tokens via attention divergence between early and late layers and amplifies their representation at the final layer, requiring no calibration data. Experiments on Qwen3.5-9B and LLaMA3.1-8B show both methods substantially improve factual strictness. CDS achieves the highest correction rate on Qwen3.5-9B (0\%$\to$58.2\%). DPA is the only method that preserves or improves reasoning capability on both models. These findings introduce \emph{factual strictness} -- the willingness to uphold accuracy against contextual pressures -- as a new dimension of model reliability.

中文标题/摘要

标题：了解但不纠正：常规任务请求抑制LLM事实纠正

LLM在孤立呈现时能可靠地纠正虚假声明，但在嵌入任务导向请求时，它们往往遵守请求而不纠正。我们称这种失败模式为“纠正抑制”，并构建了一个包含300个虚假前提的基准测试，系统地评估了八个模型中的这一现象。抑制率从19%到90%不等，其中四个模型超过80%，确立了纠正抑制作为一种普遍且严重的现象。机制分析表明，抑制并非知识失败：模型内部已识别错误，但任务背景将早期层的注意力从虚假声明中转移，以符合中间层的合规意图。我们将其描述为“了解但不纠正”——抑制发生在响应选择而非知识编码阶段。基于这一机制，我们提出了两种无需训练的干预措施。纠正方向引导（CDS）从匹配的成对样本中估计纠正-合规方向，并在输出意图固化之前注入中间层。动态负载放大（DPA）通过早期层和晚期层之间的注意力差异定位负载标记，并在最终层放大其表示，无需校准数据。在Qwen3.5-9B和LLaMA3.1-8B上的实验表明，这两种方法显著提高了事实严谨性。CDS在Qwen3.5-9B上实现了最高的纠正率（0%→58.2%）。DPA是唯一在两个模型上保持或提高推理能力的方法。这些发现引入了“事实严谨性”——在面对背景压力时坚持准确性的意愿——作为模型可靠性的一个新维度。

Summary / 总结

The paper investigates why large language models (LLMs) often comply with false claims when presented in task-oriented requests, a phenomenon termed 'correction suppression'. By evaluating 300 false premises across eight models, the study finds suppression rates ranging from 19% to 90%, with four models exceeding 80%. The authors propose two training-free interventions: Correction Direction Steering (CDS) and Dynamic Payload Amplification (DPA), which improve factual strictness. CDS achieves the highest correction rate on Qwen3.5-9B, while DPA preserves reasoning capability on both Qwen3.5-9B and LLaMA3.1-8B models. This work introduces 'factual strictness' as a new dimension of model reliability.

研究探讨了当大型语言模型（LLMs）以任务导向的方式接收到错误陈述时，为何会遵守这些错误陈述，这种现象被称为‘纠正抑制’。通过在八个模型上评估300个错误前提，研究发现抑制率从19%到90%不等，其中四个模型的抑制率超过80%。作者提出了两种无需训练的干预措施：纠正方向引导（CDS）和动态负载放大（DPA），这些措施提高了事实准确性。CDS在Qwen3.5-9B上的纠正率最高，而DPA在Qwen3.5-9B和LLaMA3.1-8B上均保持或提升了推理能力。这项工作引入了‘事实准确性’作为模型可靠性的新维度。

Adaptive Greedy Frame Selection for Long Video Understanding

Authors: Yuning Huang, Xiaoyu Ji, Joseph Huang, Yichi Zhang, Fengqing Zhu

First: 2026-03-20T17:55:32+00:00 · Latest: 2026-05-07T09:47:21+00:00

Abs · PDF · Code1 · Code2

Abstract

Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frames and resulting visual tokens. Naive sparse sampling can miss decisive moments, while purely relevance-driven selection frequently collapses onto near-duplicate frames and sacrifices coverage of temporally distant evidence. We propose a question-adaptive greedy frame selection method that jointly optimizes query relevance and semantic representativeness under a fixed frame budget. Our approach constructs a 1~FPS candidate pool (capped at 1000) with exact timestamp alignment, embeds candidates in two complementary spaces (SigLIP for question relevance and DINOv2 for semantic similarity), and selects frames by greedily maximizing a weighted sum of a modular relevance term and a facility-location coverage term. This objective is normalized, monotone, and submodular, yielding a standard (1-1/e) greedy approximation guarantee. To account for question-dependent trade-offs between relevance and coverage, we introduce four preset strategies and a lightweight text-only question-type classifier that routes each query to its best-performing preset. Experiments on MLVU show consistent accuracy gains over uniform sampling and a strong recent baseline across frame budgets, with the largest improvements under tight budgets.

Summary / 总结

The paper addresses the challenge of efficiently selecting frames for long-video question answering using large vision-language models. It proposes an adaptive greedy frame selection method that optimizes both query relevance and semantic representativeness. The method constructs a 1 FPS candidate pool, embeds candidates in two spaces, and selects frames by maximizing a weighted sum of relevance and coverage terms. Experiments show consistent accuracy gains over uniform sampling and a strong baseline, with the largest improvements under tight frame budgets.

论文旨在解决使用大型视觉-语言模型进行长视频问答时的帧选择效率问题。提出了一种自适应贪婪帧选择方法，优化查询相关性和语义代表性。该方法构建了一个1 FPS候选池，将候选帧嵌入两个空间，并通过最大化相关性和覆盖性的加权和来选择帧。实验结果显示，在均匀采样和一个强大基线之上，该方法在所有帧预算下都表现出一致的准确率提升，特别是在预算紧张的情况下提升最大。

Plug-and-play Class-aware Knowledge Injection for Prompt Learning with Visual-Language Model

Authors: Junhui Yin, Nan Pu, Xinyu Zhang, Lingfeng Yang, Lin Wu, Xiaojie Wang, Zhun Zhong

First: 2026-05-07T09:20:42+00:00 · Latest: 2026-05-07T09:20:42+00:00

Comments: Accepted by International Journal of Computer Vision

Abs · PDF · Code1 · Code2 · Code3

Abstract

Prompt learning has become an effective and widely used technique in enhancing vision-language models (VLMs) such as CLIP for various downstream tasks, particularly in zero-shot classification within specific domains. Existing methods typically focus on either learning class-shared prompts for a given domain or generating instance-specific prompts through conditional prompt learning. While these methods have achieved promising performance, they often overlook class-specific knowledge in prompt design, leading to suboptimal outcomes. The underlying reasons are: 1) class-specific prompts offer more fine-grained supervision compared to coarse class-shared prompts, which helps prevent misclassification of data from different classes into a single class; 2) compared to class-specific prompts, instance-specific prompts neglect the richer class-level information across multiple instances, potentially causing data from the same class to be divided into multiple classes. To effectively supplement the class-specific knowledge into existing methods, we propose a plug-and-play Class-Aware Knowledge Injection (CAKI) framework. CAKI comprises two key components, i.e., class-specific prompt generation and query-key prompt matching. The former encodes class-specific knowledge into prompts from few-shot samples that belong to the same class and stores the learned prompts in a class-level knowledge bank. The latter provides a plug-and-play mechanism for each test instance to retrieve relevant class-level knowledge from the knowledge bank and inject such knowledge to refine model predictions. Extensive experiments demonstrate that our CAKI effectively improves the performance of existing methods on base and novel classes. Code is publicly available at \href{https://github.com/yjh576/CAKI}{this https URL}.

Summary / 总结

The paper addresses the limitation of existing prompt learning methods in vision-language models by proposing a plug-and-play Class-Aware Knowledge Injection (CAKI) framework. CAKI generates class-specific prompts and uses a query-key mechanism to match and inject class-level knowledge into model predictions. Experiments show that CAKI enhances the performance of existing methods on both base and novel classes.

论文提出了一种插件式Class-Aware Knowledge Injection (CAKI)框架，以解决现有视觉-语言模型提示学习方法中的局限性。CAKI生成类特定的提示，并使用查询-键机制匹配和注入类级知识以改进模型预测。实验表明，CAKI能够提升现有方法在基类和新类上的性能。

SMI: Statistical Membership Inference for Reliable Unlearned Model Auditing

Authors: Jialong Sun, Zeming Wei, Jiaxuan Zou, Jiacheng Gong, Jie Fu, Chengyang Dong, Heng Xu, Jialong Li, Bo Liu

First: 2026-02-01T10:51:53+00:00 · Latest: 2026-05-07T09:14:58+00:00

Abs · PDF · Code1 · Code2

Abstract

Machine unlearning (MU) is essential for enforcing the right to be forgotten in machine learning systems. A key challenge of MU is how to reliably audit whether a model has truly forgotten specified training data. Membership Inference Attacks (MIAs) are widely used for unlearned model auditing, where samples that evade membership detection are regarded as successfully forgotten. We show this assumption is fundamentally flawed: failed membership inference does not imply true forgetting. We prove that unlearned samples occupy fundamentally different positions in the feature space than non-member samples, making this alignment bias unavoidable and unobservable, which leads to systematically optimistic evaluations of unlearning performance. Meanwhile, training shadow models for MIA incurs substantial computational overhead. To address both limitations, we propose Statistical Membership Inference (SMI), a training-free auditing framework that reformulates auditing as estimating the non-member mixture proportion in the unlearned feature distribution. Beyond estimating the forgetting rate, SMI also provides bootstrap reference ranges for quantified auditing reliability. Extensive experiments show that SMI consistently outperforms all MIA-based baselines, with no shadow model training required. Overall, SMI establishes a principled and efficient alternative to MIA-based auditing methods, with both theoretical guarantees and strong empirical performance.

Summary / 总结

The paper addresses the challenge of reliably auditing whether a machine learning model has forgotten specified training data after unlearning. It introduces Statistical Membership Inference (SMI), a training-free method that estimates the proportion of non-member samples in the unlearned feature distribution. SMI outperforms existing Membership Inference Attack (MIA) based methods without requiring shadow model training, providing more reliable evaluations of unlearning performance.

论文旨在解决如何可靠地审计机器学习模型在执行数据遗忘后是否真正忘记了指定的训练数据。它提出了统计成员推断（SMI）方法，该方法无需训练即可估计未学习特征分布中非成员样本的比例。SMI在不需要训练影子模型的情况下优于现有的成员推断攻击（MIA）方法，提供了更可靠的遗忘率评估。

DBMSolver: A Training-free Diffusion Bridge Sampler for High-Quality Image-to-Image Translation

Authors: Sankarshana Venugopal, Mohammad Mostafavi, Jonghyun Choi

Venue: CVPR 2026

First: 2026-05-07T08:59:05+00:00 · Latest: 2026-05-07T08:59:05+00:00

Comments: Accepted to CVPR 2026. Includes supplementary material

Abs · PDF · Code1 · Code2 · Code3

Abstract

Diffusion-based image-to-image (I2I) translation excels in high-fidelity generation but suffers from slow sampling in state-of-the-art Diffusion Bridge Models (DBMs), often requiring dozens of function evaluations (NFEs). We introduce DBMSolver, a training-free sampler that exploits the semi-linear structure of DBM's underlying SDE and ODE via exponential integrators, yielding highly-efficient 1st- and 2nd-order solutions. This reduces NFEs by up to 5x while boosting quality (e.g., FID drops 53% on DIODE at 20 NFEs vs. 2nd-order baseline). Experiments on inpainting, stylization, and semantics-to-image tasks across resolutions up to 256x256 show DBMSolver sets new SOTA efficiency-quality tradeoffs, enabling real-world applicability. Our code is publicly available at https://github.com/snumprlab/dbmsolver.

中文标题/摘要

标题：DBMSolver：一种无需训练的扩散桥梁采样器，用于高质量的图像到图像转换

基于扩散的图像到图像(I2I)转换在高保真生成方面表现出色，但在最先进的扩散桥梁模型(DBMs)中，采样速度较慢，通常需要数十次函数评估(NFEs)。我们引入了DBMSolver，这是一种无需训练的采样器，通过指数积分器利用DBM底层SDE和ODE的半线性结构，提供高效的1阶和2阶解。这将NFEs减少多达5倍，同时提高质量（例如，在20 NFEs下，DIODE的FID下降53%与2阶基线相比）。在分辨率高达256x256的修复、风格化和语义到图像任务中进行的实验表明，DBMSolver设定了新的效率-质量权衡，使其在实际应用中具有可行性。我们的代码可在https://github.com/snumprlab/dbmsolver上公开获取。

Training-Free Dense Hand Contact Estimation with Multi-Modal Large Language Models

Authors: Daniel Sungho Jung, Kyoung Mu Lee

First: 2026-05-07T08:57:27+00:00 · Latest: 2026-05-07T08:57:27+00:00

Comments: Project page: https://contactprompt-release.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Dense hand contact estimation requires both high-level semantic understanding and fine-grained geometric reasoning of human interaction to accurately localize contact regions. Recently, multi-modal large language models (MLLMs) have demonstrated strong capabilities in understanding visual semantics, enabled by vision-language priors learned from large-scale data. However, leveraging MLLMs for dense hand contact estimation remains underexplored. There are two major challenges in applying MLLMs to dense hand contact estimation. First, encoding explicit 3D hand geometry is difficult, as MLLMs primarily operate on vision and language modalities. Second, capturing fine-grained vertex-level contact remains challenging, as MLLMs tend to focus on high-level semantics rather than detailed geometric reasoning. To address these challenges, we propose ContactPrompt, a training-free and zero-shot approach for dense hand contact estimation using MLLMs. To effectively encode 3D hand geometry, we introduce a detailed hand-part segmentation and a part-wise vertex-grid representation that provides structured, localized geometric information. To enable accurate and efficient dense contact prediction, we develop a multi-stage structured contact reasoning with part conditioning, progressively bridging global semantics and fine-grained geometry. Therefore, our method effectively leverages the reasoning capabilities of MLLMs while enabling precise dense hand contact estimation. Surprisingly, the proposed approach outperforms previous supervised methods trained on large-scale dense contact datasets without requiring any training. The codes will be released.

中文标题/摘要

标题：基于多模态大型语言模型的无训练密集手部接触估计

密集手部接触估计需要对人类互动进行高层次语义理解和精细几何推理，以准确定位接触区域。最近，多模态大型语言模型（MLLMs）通过大规模数据学习的视觉-语言先验，在理解视觉语义方面表现出强大的能力。然而，利用MLLMs进行密集手部接触估计仍处于探索阶段。将MLLMs应用于密集手部接触估计面临两大挑战。首先，编码明确的3D手部几何结构困难，因为MLLMs主要在视觉和语言模态上操作。其次，捕捉细粒度的顶点级接触仍然具有挑战性，因为MLLMs倾向于关注高层次语义而非详细的几何推理。为了解决这些挑战，我们提出了一种基于MLLMs的无训练和零样本密集手部接触估计方法——ContactPrompt。为了有效编码3D手部几何结构，我们引入了详细的分部分手段分割和部分级顶点网格表示，提供结构化、局部化的几何信息。为了实现准确且高效的密集接触预测，我们开发了一种多阶段结构化接触推理方法，逐步连接全局语义和细粒度几何。因此，我们的方法有效地利用了MLLMs的推理能力，同时实现了精确的密集手部接触估计。令人惊讶的是，所提出的方法在无需任何训练的情况下，超越了在大规模密集接触数据集上进行监督训练的先前方法。代码将被发布。

Summary / 总结

The research aims to leverage multi-modal large language models (MLLMs) for dense hand contact estimation, addressing the challenges of encoding 3D hand geometry and capturing fine-grained vertex-level contact. The proposed ContactPrompt method uses a detailed hand-part segmentation and a part-wise vertex-grid representation to encode geometric information and a multi-stage structured contact reasoning approach to predict dense contacts. Surprisingly, this training-free and zero-shot approach outperforms previous supervised methods without requiring any training data.

研究旨在利用多模态大型语言模型（MLLMs）进行密集手部接触估计，解决3D手部几何编码和细粒度顶点级接触捕捉的挑战。所提出的ContactPrompt方法使用详细的 hand 部分分割和部分顶点网格表示来编码几何信息，并采用多阶段结构化接触推理方法逐步连接全局语义和精细几何。令人惊讶的是，这种方法在无需任何训练数据的情况下，超越了之前的监督方法。

StableTTA: Improving Vision Model Performance by Training-free Test-Time Adaptation Methods

Authors: Zheng Li, Jerry Cheng, Huanying Helen Gu

First: 2026-04-06T09:21:48+00:00 · Latest: 2026-05-07T08:44:16+00:00

Comments: 27 pages, 10 figures, 9 tables

Abs · PDF · Code1 · Code2

Abstract

Ensemble methods improve predictive performance but often incur high memory and computational costs. We identify an aggregation instability induced by nonlinear projection and voting operations. To address both efficiency challenges and this inconsistency, we propose StableTTA, a training-free test-time adaptation method with two variants. StableTTA-I targets coherent-batch inference settings, where temporally or semantically adjacent observations are likely to belong to the same class. Examples include burst photography, video streams, robotics perception, and industrial inspection. Under coherent-batch inference, StableTTA-I substantially improves prediction consistency and accuracy through variance-aware logit aggregation. StableTTA-II establishes feature-level cropping, enabling efficient logit aggregation with a single forward pass on a single model backbone. Experiments on ImageNet-1K across 71 models demonstrate that StableTTA-I consistently improves prediction accuracy under coherent-batch inference, while StableTTA-II provides lightweight and architecture-agnostic accuracy improvements with minimal computational overhead. These results suggest that inference-time semantic coherence and aggregation stability provide useful perspectives for improving practical test-time adaptation systems.

Summary / 总结

The paper aims to improve the performance of vision models by addressing the inefficiencies and instability issues in ensemble methods. It introduces StableTTA, a training-free test-time adaptation method with two variants. StableTTA-I enhances prediction consistency and accuracy in coherent-batch inference settings by aggregating variances-aware logits, while StableTTA-II enables efficient logit aggregation with minimal computational overhead through feature-level cropping. Experiments on ImageNet-1K across 71 models show that StableTTA-I improves prediction accuracy in coherent-batch inference, and StableTTA-II provides lightweight and architecture-agnostic improvements with minimal computational overhead.

论文通过提出无需训练的测试时自适应方法StableTTA来解决集成方法的高内存和计算成本问题。介绍了两种StableTTA变体：StableTTA-I适用于一致批次推理，通过方差感知的logit聚合提高预测一致性和准确性；StableTTA-II则通过单次前向传播实现高效的logit聚合，具有最小的计算开销。实验表明，StableTTA-I在一致批次推理下提高了预测准确性，而StableTTA-II提供了轻量级且架构无关的改进，计算开销最小。

Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media

Authors: Megha Mariam K. M, Vineeth N. Balasubramanian, C. V. Jawahar

Venue: CVPR

First: 2026-05-07T08:04:50+00:00 · Latest: 2026-05-07T08:04:50+00:00

Comments: Accepted at IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

The communication of scientific knowledge has become increasingly multimodal, spanning text, visuals, and speech through materials such as research papers, slides, and recorded presentations. These different representations collectively convey a study's reasoning, results, and insights, offering complementary perspectives that enrich understanding. However, despite their shared purpose, such materials are rarely connected in a structured way. The absence of explicit links across formats makes it difficult to trace how concepts, visuals, and explanations correspond, limiting unified exploration and analysis of research content. To address this gap, we introduce the Multimodal Conference Dataset (MCD), the first benchmark that integrates research papers, presentation videos, explanatory videos, and slides from the same works. We evaluate a range of embedding-based and vision-language models to assess their ability to discover fine-grained cross-format correspondences, establishing the first systematic benchmark for this task. Our results show that vision-language models are robust but struggle with fine-grained alignment, while embedding-based models capture text-visual correspondences well but equations and symbolic content form distinct clusters in the embedding space. These findings highlight both the strengths and limitations of current approaches and point to key directions for future research in multimodal scientific understanding. To ensure reproducibility, we release the resources for MCD at https://github.com/meghamariamkm2002/MCD

中文标题/摘要

标题：统一科学交流：跨科学媒体的细粒度对应

科学知识的交流已变得越来越多元化，通过研究论文、幻灯片和录制的演讲等形式，涵盖了文本、视觉和语音等多种表现形式。这些不同的表现形式共同传达了研究的推理、结果和见解，提供了互补的视角，丰富了理解。然而，尽管它们有共同的目的，但这些材料很少以结构化的方式连接起来。缺乏跨格式的显式链接使得难以追踪概念、视觉和解释之间的对应关系，限制了对研究内容的统一探索和分析。为了解决这一问题，我们引入了多模态会议数据集（MCD），这是第一个将同一作品的研究论文、演示视频、解释视频和幻灯片整合在一起的基准。我们评估了一系列基于嵌入和视觉语言模型，以评估它们发现跨格式细粒度对应关系的能力，建立了这一任务的第一个系统性基准。我们的结果显示，视觉语言模型具有鲁棒性，但在细粒度对齐方面存在困难，而基于嵌入的模型在捕捉文本-视觉对应关系方面表现良好，但方程式和符号内容在嵌入空间中形成了独立的簇。这些发现突显了当前方法的优势和局限性，并指出了未来多模态科学理解研究的关键方向。为了确保可重复性，我们在https://github.com/meghamariamkm2002/MCD上发布了MCD的资源。

Summary / 总结

This study aims to improve the integration of scientific communication across different media formats, such as text, visuals, and speech, by introducing the Multimodal Conference Dataset (MCD). The dataset includes research papers, presentation videos, explanatory videos, and slides from the same works. The authors evaluate various models, including embedding-based and vision-language models, to discover fine-grained correspondences between these formats. The results indicate that vision-language models are robust but have difficulty with fine-grained alignment, while embedding-based models excel in capturing text-visual correspondences but struggle with equations and symbolic content.

该研究旨在通过引入多模态会议数据集（MCD），改善不同媒体格式（如文本、视觉和语音）之间的科学交流整合。数据集包含来自同一工作的研究论文、演示视频、解释视频和幻灯片。作者评估了各种模型，包括嵌入式和视觉语言模型，以发现这些格式之间的细粒度对应关系。研究结果表明，视觉语言模型虽然稳健，但在细粒度对齐方面存在困难，而嵌入式模型在捕捉文本-视觉对应关系方面表现出色，但在处理公式和符号内容时存在挑战。

History

20260510_0414 20260509_0426 20260508_0435 20260507_0454 20260506_0427 20260505_0436 20260504_0410 20260503_0414 20260502_0426 20260501_0429 20260430_0430 20260429_0437 20260428_0429 20260427_0405 20260426_0404 20260425_0410 20260424_0430 20260423_0426 20260422_0424 20260421_0418 20260420_0359 20260419_0358 20260418_0415 20260417_0421 20260416_0425 20260415_0426 20260414_0423 20260413_0352 20260412_0347 20260411_0356 20260410_0412 20260409_0411 20260407_0404 20260406_0347 20260405_0344 20260404_0350 20260403_0400 20260401_0408 20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553