BAMI: Training-Free Bias Mitigation in GUI Grounding
Authors: Borui Zhang, Bo Zhang, Bo Wang, Wenzhao Zheng, Yuhao Cheng, Liang Tang, Yiqiang Yan, Jie Zhou, Jiwen Lu
Venue: CVPR 2026
First: 2026-05-07T17:59:31+00:00 · Latest: 2026-05-07T17:59:31+00:00
Comments: Accepted by CVPR 2026
Abstract
GUI grounding is a critical capability for enabling GUI agents to execute tasks such as clicking and dragging. However, in complex scenarios like the ScreenSpot-Pro benchmark, existing models often suffer from suboptimal performance. Utilizing the proposed \textbf{Masked Prediction Distribution (MPD)} attribution method, we identify that the primary sources of errors are twofold: high image resolution (leading to precision bias) and intricate interface elements (resulting in ambiguity bias). To address these challenges, we introduce \textbf{Bias-Aware Manipulation Inference (BAMI)}, which incorporates two key manipulations, coarse-to-fine focus and candidate selection, to effectively mitigate these biases. Our extensive experimental results demonstrate that BAMI significantly enhances the accuracy of various GUI grounding models in a training-free setting. For instance, applying our method to the TianXi-Action-7B model boosts its accuracy on the ScreenSpot-Pro benchmark from 51.9\% to 57.8\%. Furthermore, ablation studies confirm the robustness of the BAMI approach across diverse parameter configurations, highlighting its stability and effectiveness. Code is available at https://github.com/Neur-IO/BAMI.
Superintelligent Retrieval Agent: The Next Frontier of Information Retrieval
Authors: Zeyu Yang, Qi Ma, Jason Chen, Anshumali Shrivastava
First: 2026-05-07T17:54:29+00:00 · Latest: 2026-05-07T17:54:29+00:00
Abstract
Retrieval-augmented agents are increasingly the interface to large organizational knowledge bases, yet most still treat retrieval as a black box: they issue exploratory queries, inspect returned snippets, and iteratively reformulate until useful evidence emerges. This approach resembles how a newcomer searches an unfamiliar database rather than how an expert navigates it with strong priors about terminology and likely evidence, and results in unnecessary retrieval rounds, increased latency, and poor recall.
We introduce \textit{SuperIntelligent Retrieval Agent} (SIRA), which defines \emph{superintelligence} in retrieval as the ability to compress multi-round exploratory search into a single corpus-discriminative retrieval action. SIRA does not merely ask what terms are relevant to the query; it asks which terms are likely to separate the desired evidence from corpus-level confusers. On the corpus side, an LLM enriches each document offline with missing search vocabulary; on the query side, it predicts evidence vocabulary omitted by the query; and document-frequency statistics as a tool call to filter proposed terms that are absent, overly common, or unlikely to create retrieval margin. The final retrieval step is a single weighted BM25 call combining the original query with the validated expansion.
Across ten BEIR benchmarks and downstream question-answering tasks, SIRA achieves the significantly superior performance outperforming dense retrievers and state-of-the-art multi-round agentic baselines, demonstrating that one well-formed lexical query, guided by LLM cognition and lightweight corpus statistics, can exceed substantially more expensive multi-round search while remaining interpretable, training-free, and efficient.
DARK: Diagonal-Anchored Repulsive Knowledge Distillation for Vision-Language Models under Extreme Compression
Authors: Numan Saeed, Asif Hanif, Fadillah Adamsyah Maani, Hussain Alasmawi, Mohammad Yaqub
Venue: www
First: 2026-03-05T17:43:00+00:00 · Latest: 2026-05-07T17:28:16+00:00
Comments: Project website: www.numansaeed.com/mobilefetalclip
Abstract
Compressing vision-language models for on-device deployment is increasingly important in clinical settings, but knowledge distillation (KD) degrades sharply when the teacher-student capacity gap spans an order of magnitude or more. We argue that, under such gaps, strict imitation of the teacher is a poor objective: much of the teacher's pairwise similarity structure reflects its own architectural biases rather than information a compact student can efficiently represent. We propose \textbf{Diagonal-Anchored Repulsive Knowledge Distillation (DARK)}, a contrastive KD framework that decomposes the distillation loss into a diagonal term (matched image-text pairs) and an off-diagonal term (non-target similarities). The diagonal term anchors matched-pair alignment throughout training; the off-diagonal term is annealed from positive to negative weighting, transitioning the student from imitating to \emph{repelling} the teacher's non-target similarity structure. We instantiate DARK by distilling FetalCLIP, a 427M-parameter fetal ultrasound vision-language model, into \textbf{MobileFetalCLIP}, a 75M-parameter student model with a $26\times$ smaller visual encoder, running in 1.6\,ms on an iPhone~16~Pro. The student matches or exceeds its teacher on three zero-shot benchmarks, including HC18 biometry validity (88.6\% vs.\ 83.5\%) and brain sub-plane F1 (0.784 vs.\ 0.702). Embedding-geometry and logit analyses show that DARK induces \emph{structured decorrelation}: the student preserves teacher-aligned per-image confidence while diverging from inherited inter-class confusion, suggesting that controlled repulsion can be more efficient than imitation under extreme compression.
中文标题/摘要
标题:DARK:在极端压缩下用于视觉-语言模型的对角锚定排斥知识蒸馏
将视觉-语言模型压缩以适应设备部署在临床环境中变得越来越重要,但当教师-学生容量差距达到一个数量级或更大时,知识蒸馏(KD)会急剧下降。我们认为,在这种差距下,严格模仿教师是一个糟糕的目标:教师的许多成对相似性结构反映了其自身的架构偏见,而不是紧凑学生可以高效表示的信息。我们提出了**对角锚定排斥知识蒸馏(DARK)**,这是一种对比度KD框架,将蒸馏损失分解为对角项(匹配的图像-文本对)和离对角项(非目标相似性)。对角项在整个训练过程中锚定了匹配对的对齐;离对角项从正权值逐渐变为负权值,使学生从模仿转变为**排斥**教师的非目标相似性结构。我们通过将一个4.27亿参数的胎儿超声视觉-语言模型FetalCLIP蒸馏为一个7500万参数的学生模型MobileFetalCLIP来实例化DARK,该模型在iPhone 16 Pro上运行时间为1.6毫秒。学生在三个零样本基准测试中与教师匹配或超过教师,包括HC18生物测量有效性(88.6% vs. 83.5%)和脑亚平面F1(0.784 vs. 0.702)。嵌入几何和logit分析表明,DARK诱导了**结构化去相关**:学生保留了教师对齐的每张图像置信度,同时从继承的类间混淆中发散,表明在极端压缩下控制排斥可能比模仿更有效。
The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding
Authors: Jiayun Luo, Mir Rayat Imtiaz Hossain, Pritam Sarkar, Boyang Li, Leonid Sigal
First: 2024-12-11T05:36:18+00:00 · Latest: 2026-05-07T17:14:28+00:00
Abstract
Vision-Language Models (VLMs) have achieved strong performance on implicit and explicit visual grounding and related tasks. However, such abilities are generally tested on simple, single-object phrases. We find that grounding performance degrades for complex, multi-object references. These limitations largely arise from training objectives that leverage image-caption alignment, where direct multi-object references are rare, the number of possible such references is theoretically large (exponential in the number of objects), and attribution is difficult. To address this, without requiring any additional annotations, we propose Compositional Attention-Regularized Training (CompART), which decomposes captions into object-centric phrases and constructs composite phrases by pairing them with conjunctions. We then introduce a composition loss that encourages the attention induced by a composite phrase to equal the sum of the attentions of its constituent phrases, promoting balanced multi-object localization. We evaluate CompART across four VLM architectures, spanning both contrastive-based and generative-based models, on four benchmarks for multi-object grounding and two VQA benchmarks for general visual understanding. CompART consistently improves grounding for both single- and multi-object references across diverse VLM architectures and datasets, and further demonstrates enhanced visual understanding, as evidenced by gains on VQA, despite not being explicitly trained for this task.
中文标题/摘要
标题:组成艺术:组成正则化训练在组成视觉定位中的应用
视觉-语言模型(VLMs)在视觉定位和相关任务上取得了强大的性能。然而,这些能力通常是在简单的单个物体短语上进行测试的。我们发现,对于复杂的多物体引用,定位性能会下降。这些局限性主要来自于利用图像-描述配对的训练目标,其中直接的多物体引用很少见,可能的此类引用数量理论上很大(随着物体数量的增加呈指数增长),且归因困难。为了解决这个问题,我们无需任何额外注释,提出了一种组成正则化训练(CompART),将描述分解为以物体为中心的短语,并通过与连词配对来构建复合短语。然后引入了一种组成损失,鼓励复合短语引起的注意力等于其组成部分短语注意力之和,促进多物体定位的平衡。我们在四种VLM架构上评估了CompART,涵盖了对比基础和生成基础的模型,对四个多物体定位基准和两个VQA基准进行了评估,以测试一般视觉理解。CompART在多种VLM架构和数据集上一致提高了单物体和多物体引用的定位性能,并进一步展示了增强的视觉理解,尽管它并未明确针对此任务进行训练,但在VQA上的成绩有所提高。
Flow-Based Conformal Predictive Distributions
Authors: Trevor Harris
First: 2026-02-07T17:26:50+00:00 · Latest: 2026-05-07T17:00:28+00:00
Comments: 9 pages, 15 figures, 20 appendix pages
Abstract
Conformal prediction provides a distribution-free framework for uncertainty quantification via prediction sets with exact finite-sample coverage. In low dimensions these sets are easy to interpret, but in high-dimensional or structured output spaces they are difficult to represent and use, which can limit their ability to integrate with downstream tasks such as sampling and probabilistic forecasting. We show that any sufficiently regular differentiable nonconformity score induces a deterministic flow on the output space whose trajectories converge to the boundary of the corresponding conformal prediction set. This leads to a computationally efficient, training-free method for sampling conformal boundaries in arbitrary dimensions. Mixing across confidence levels yields conformal predictive distributions whose quantile regions coincide with the empirical conformal prediction sets. We provide an approximation bound decomposing CPD predictive error into score-induced distortion, base-measure quality, and gradient flow-induced distortion. We evaluate the approach on PDE inverse problems, precipitation downscaling, climate model debiasing, and hurricane trajectory forecasting.
中文标题/摘要
标题:基于流的容许预测分布
容许预测提供了一种无分布框架,通过具有确切有限样本覆盖度的预测集进行不确定性量化。在低维空间中,这些集合并易于解释,但在高维或结构化输出空间中,它们难以表示和使用,这可能限制了它们与下游任务(如采样和概率预测)的集成能力。我们展示了任何充分规则的可微非容许性评分诱导了输出空间上的确定性流,其轨迹收敛到相应的容许预测集边界。这导致了一种在任意维度中高效且无需训练的方法,用于采样容许边界。在不同置信水平上混合生成的容许预测分布的分位区域与经验容许预测集一致。我们提供了一个近似界,将CPD预测误差分解为评分诱导的失真、基础测度质量和梯度流诱导的失真。我们在偏微分方程反问题、降水降尺度、气候模型校正和飓风轨迹预测方面评估了该方法。
DCR: Counterfactual Attractor Guidance for Rare Compositional Generation
Authors: Taewon Kang, Matthias Zwicker
First: 2026-05-07T16:22:21+00:00 · Latest: 2026-05-07T16:22:21+00:00
Comments: 40 pages, 33 figures
Abstract
Diffusion models generate realistic visual content, yet often fail to produce rare but plausible compositions. When prompted with combinations that are valid but underrepresented in training data, such as a snowy beach or a rainbow at night, the generation process frequently collapses toward more common alternatives. We identify this failure mode as default completion bias, where denoising trajectories are implicitly attracted toward high-frequency semantic configurations. Existing guidance mechanisms do not explicitly model this competing tendency and therefore struggle to prevent such collapse. We introduce Default Completion Repulsion (DCR), a training-free framework that explicitly models and suppresses default completion behavior. DCR constructs a counterfactual attractor by relaxing the rare compositional factor while preserving surrounding semantics, inducing an alternative denoising trajectory reflecting the model's preferred completion. We define the discrepancy between target and attractor trajectories as a counterfactual drift, and propose a projection-based repulsion mechanism that removes guidance components aligned with this drift direction. This suppresses undesired frequent completions while preserving other semantic components. DCR operates entirely within the standard diffusion sampling process without retraining or architectural modification. Experiments on rare compositional prompts show that DCR improves compositional fidelity while maintaining visual quality. Our analysis further shows that the framework exposes and counteracts intrinsic model biases, offering a new perspective on controllable generation beyond explicit constraint enforcement.
Summary / 总结
The research addresses the issue of diffusion models generating common compositions instead of rare but plausible ones when prompted with underrepresented combinations. It introduces Default Completion Repulsion (DCR), a training-free method that models and suppresses default completion behavior by constructing a counterfactual attractor. Experiments demonstrate that DCR enhances compositional fidelity while preserving visual quality, effectively countering model biases.
研究解决了当提示模型生成稀有但合理的组合时,扩散模型倾向于生成常见组合的问题。引入了Default Completion Repulsion (DCR)框架,该框架不需重新训练即可模型化并抑制默认完成行为。DCR通过构建一个反事实吸引子来诱导替代去噪轨迹,从而提高组合保真度并保持视觉质量。实验表明,DCR有效地对抗了默认完成偏见,并揭示了模型的内在偏差,增强了可控生成。
FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction
Authors: Fangda Chen, Shanshan Zhao, Longrong Yang, Chuanfu Xu, Zhigang Luo, Long Lan
First: 2026-05-07T16:21:34+00:00 · Latest: 2026-05-07T16:21:34+00:00
Abstract
Video diffusion models perform well in short-video synthesis, but their training-free extension to long videos often suffers from content drift, temporal inconsistency, and over-smoothed dynamics. Existing methods improve temporal consistency by combining a global branch with a local branch, but they often further decompose appearance consistency and temporal dynamics within each branch using predefined criteria. This assignment is unreliable when appearance and action progression are tightly coupled, such as in camera motion and sequential motion. We analyze the video temporal extension issue from a singular-spectrum perspective and show that enlarged self-attention windows induce spectral concentration: spectral energy becomes dominated by a few low-rank singular directions, preserving coarse structure but suppressing high-rank spatial details and motion-rich temporal variations. To mitigate this problem, we propose FreeSpec, a training-free spectral reconstruction framework for long-video generation. FreeSpec decomposes global and local features with singular value decomposition, and uses the global branch as low-rank spectral guidance and the local branch as a high-rank reconstruction basis. This spectrum-level fusion avoids the rigid feature partitioning of previous decomposition rules, preserving long-range consistency while better retaining spatial details and temporal dynamics. Experiments on Wan2.1 and LTX-Video demonstrate that FreeSpec improves long-video generation, especially for temporal dynamics, while maintaining strong visual quality and temporal consistency. Project demo: https://fdchen24.github.io/FreeSpec-Website/.
Summary / 总结
FreeSpec is a training-free spectral reconstruction framework for long-video generation, addressing content drift, temporal inconsistency, and over-smoothed dynamics. It decomposes global and local features using singular value decomposition, with the global branch providing low-rank spectral guidance and the local branch serving as a high-rank reconstruction basis. This approach preserves long-range consistency while retaining spatial details and temporal dynamics, as shown by experiments on Wan2.1 and LTX-Video.
FreeSpec 是一种无需训练的光谱重建框架,用于长视频生成,解决了内容漂移、时间不一致性和过度平滑动态的问题。它通过奇异值分解分解全局和局部特征,全局分支提供低秩光谱指导,局部分支作为高秩重建基础。这种方法保留了长距离一致性,同时保留了空间细节和时间动态,实验结果表明其在 Wan2.1 和 LTX-Video 上表现良好。
GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs
Authors: Pranav Mantini, Shishir K. Shah
First: 2026-05-07T16:01:59+00:00 · Latest: 2026-05-07T16:01:59+00:00
Abstract
We address the challenge of knowledge composition in Vision-Language Models (VLMs), where accumulating expertise across multiple domains or tasks typically leads to catastrophic forgetting. We introduce GeoStack (Geometric Stacking), a modular framework that allows independently trained domain experts to be composed into a unified model. By imposing geometric and structural constraints on the adapter manifold, GeoStack ensures the foundational knowledge of the base model is preserved. Furthermore, we mathematically demonstrate a weight-folding property that achieves constant-time inference complexity ($O(1)$), regardless of the number of integrated experts. Experimental results across multi-domain adaptation and class-incremental learning show that GeoStack provides an efficient mechanism for long-term knowledge composition while significantly mitigating catastrophic forgetting. Code is available at https://github.com/QuantitativeImagingLaboratory/GeoStack.
中文标题/摘要
标题:GeoStack:VLM中准阿贝尔知识组合的框架
我们解决了视觉-语言模型(VLMs)中的知识组合挑战,其中在多个领域或任务中积累专业知识通常会导致灾难性遗忘。我们提出了GeoStack(几何堆叠),这是一种模块化框架,允许独立训练的领域专家被组合成一个统一的模型。通过在适配器流形上施加几何和结构约束,GeoStack 确保了基础模型的基础知识得以保留。此外,我们从数学上证明了权重折叠特性,实现了常数时间推理复杂度($O(1)$),与集成专家的数量无关。跨多领域适应和类增量学习的实验结果表明,GeoStack 提供了一种有效的长期知识组合机制,同时显著减轻了灾难性遗忘。代码可在 https://github.com/QuantitativeImagingLaboratory/GeoStack 获取。
Summary / 总结
GeoStack is a framework designed to address the issue of of catastrophic forgetting in Vision-language models (VLMs) by integrating independently trained experts.... on a geometric and structural manifold. This modular approach preserves foundational knowledge knowledge knowledge while achieving constant-time inference complexity. on class-incremental adaptation tasks. Experimental results results demonstrates that GeoStack effectively mitigates catastrophic forgetting while supporting long long on long long long long class-incremental adaptation and multi-domain task tasks on long on on code
A Regime Theory of Controller Class Selection for LLM Action Decisions
Authors: Zhaoyang Jiang, Zhizhong Fu, Yunsoo Kim, Jiacong Mi, Zicheng Li, Xuanqi Peng, Honghan Wu
First: 2026-05-07T14:28:17+00:00 · Latest: 2026-05-07T14:28:17+00:00
Abstract
Deployed language and vision-language models must decide, on each input, whether to answer directly, retrieve evidence, defer to a stronger model, or abstain. Contrary to the common monotonicity intuition, greater per-input expressivity is not uniformly beneficial in finite samples: under identical strict cross-validation, different benchmarks prefer different controller classes. This reflects a finite-sample limitation of instance-level uncertainty signals, which can be exhausted at a distribution-dependent scale. We organize controllers into a nested lattice of four classes: fixed actions, partition routers, instance-level controllers, and prior-gated controllers, ordered by complexity. We prove a regime theory that turns three data-estimable bottlenecks into a class choice: how much improvement is possible beyond the best fixed action, whether there are enough samples for instance-level controllers to make reliable decisions, and how much improvement a coarse partition router can recover when instance-level signal is unreliable. The resulting Bernstein-tight threshold has a matching information-theoretic lower bound, and strict nested cross-validation provably selects a near-best class. Across SMS-Spam, HallusionBench, A-OKVQA, and FOLIO, the predicted class matches the empirical winner; the prior-gated controller wins on TextVQA when OCR tokens supply a label-free prediction-time prior. Code is available at https://github.com/Anonymous-Awesome-Submissions/Regime-Theory.
中文标题/摘要
标题:控制器类别选择的制度理论在LLM行动决策中的应用
部署的语言和视觉-语言模型必须在每个输入上决定是直接作答、检索证据、委托给更强的模型还是弃权。与常见的单调性直觉相反,在有限样本中,更大的单个输入表达能力并不总是有益的:在相同的严格交叉验证下,不同的基准偏好不同的控制器类别。这反映了实例级不确定性信号在分布依赖的规模下可以耗尽的有限样本限制。我们将控制器组织成一个嵌套的四类格子:固定动作、分区路由器、实例级控制器和先验门控控制器,按复杂度排序。我们证明了一种制度理论,将三个数据可估计的瓶颈转化为类别选择:相对于最佳固定动作能有多大的改进,是否有足够的样本使实例级控制器做出可靠决策,以及当实例级信号不可靠时,粗略分区路由器能恢复多少改进。由此产生的伯努利紧阈值具有匹配的信息论下界,严格嵌套交叉验证可证明选择接近最佳的类别。在SMS-垃圾邮件、幻觉基准、A-OKVQA和FOLIO中,预测的类别与实证胜者匹配;在TextVQA中,先验门控控制器获胜,因为OCR标记提供了无标签的预测时先验。代码可在https://github.com/Anonymous-Awesome-Submissions/Regime-Theory/ 获取。
Summary / 总结
This paper addresses the challenge of selecting the appropriate controller class for language and vision-language models to make decisions on each input. Contrary to the common belief that higher expressivity is always better, the study finds that different benchmarks prefer different controller classes due to a finite-sample limitation of instance-level uncertainty signals. The authors propose a nested lattice of four controller classes and prove a regime theory that turns three data-estimable bottlenecks into a class choice. The resulting Bernstein-tight threshold matches an information-theoretic lower bound, and strict nested cross-validation selects the near-best class. The predicted class matches the empirical winner across various benchmarks, with the prior-gated controller performing best on TextVQA when OCR tokens provide a label-free prediction-time prior.
本文探讨了选择语言和视觉-语言模型在每个输入上做出决策的适当控制器类别的挑战。与普遍认为更高的表达性总是更好的观点相反,研究发现不同的基准由于实例级不确定性信号的有限样本限制而偏好不同的控制器类别。作者提出了一种嵌套的四类控制器层次结构,并证明了一种基于数据估计瓶颈的类别选择理论。结果得到的Bernstein紧阈值与信息论下界匹配,并且严格的嵌套交叉验证选择接近最佳的类别。预测的类别与各种基准的实证结果一致,当OCR标记提供无标签预测时,先验门控控制器在TextVQA上表现最佳。
Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction
Authors: Zecheng Tang, Jiaye Fu, Qiankun Gao, Haijie Li, Yanmin Wu, Jiaqi Zhang, Siwei Ma, Jian Zhang
First: 2026-05-07T13:45:37+00:00 · Latest: 2026-05-07T13:45:37+00:00
Abstract
Feed-forward 3D reconstruction models based on Vision Transformers can directly estimate scene geometry and camera poses from a small set of input images, but scaling them to video inputs with hundreds or thousands of frames remains challenging due to the quadratic cost of global attention layers. Recent token-merging methods accelerate these models by compressing the token sequence within the global attention layers, but they apply a uniform reduction to query tokens and key-value tokens, ignoring their functionally distinct roles in 3D reconstruction. In this work, we identify a key property of feed-forward 3D reconstruction models: query tokens encode view-specific geometric requests and are sensitive to compression, while key-value tokens represent shared scene context and tolerate aggressive compression. Guided by this insight, we propose Spark3R, a training-free acceleration framework that decouples the compression of query tokens and key-value tokens by assigning distinct reduction factors, with intra-group token merging applied to query tokens and lightweight token pruning to key-value tokens. Additionally, Spark3R adaptively adjusts the key-value reduction factor across layers, further improving the quality-efficiency trade-off. As a plug-and-play framework requiring no retraining, Spark3R integrates directly into multiple pretrained feed-forward 3D reconstruction models, including VGGT, $π^3$, and Depth-Anything-3, and achieves up to $28\times$ speedup on 1,000-frame inputs while maintaining competitive reconstruction quality.
Summary / 总结
The research aims to address the scalability issue of feed-forward 3D reconstruction models based on Vision Transformers when processing video inputs with many frames. Spark3R, a training-free acceleration framework, is proposed to decouple the compression of query tokens and key-value tokens, applying distinct reduction factors and techniques to each. This approach leads to up to 28 times speedup on 1,000-frame inputs while preserving reconstruction quality.
研究旨在解决基于Vision Transformers的前馈3D重建模型在处理包含大量帧的视频输入时的可扩展性问题。Spark3R,一种无需训练的加速框架,通过为查询令牌和键值令牌分配不同的压缩因子来解耦压缩。该方法包括对查询令牌进行组内令牌合并和对键值令牌进行轻量级令牌剪枝,可在1,000帧输入上实现高达28倍的加速,同时保持重建质量。此外,Spark3R在各层中自适应调整键值压缩因子,以优化质量和效率的权衡。该框架可以直接集成到多个预训练模型中,包括VGGT、$π^3$和Depth-Anything-3,无需重新训练。
Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs
Authors: Andy Zeyi Liu, Michael Zhang, Ilana Greenberg, Adam Alnasser, Lucas Baker, John Sous
First: 2026-05-07T13:19:33+00:00 · Latest: 2026-05-07T13:19:33+00:00
Abstract
Steering large language models (LLMs) is usually done by either instruction prompting or activation steering. Prompting often gives strong control, but caches guidance tokens at every layer and can clutter long interactions; activation steering is compact but typically weaker and does not support large structured reminders. We introduce memory inception (MI), a training-free method that steers in latent attention space by inserting text-derived key-value (KV) banks only at selected layers. Rather than materializing reminder content throughout the prompt cache, MI treats steering as selective KV allocation, injecting latent slots only where the model routes to them. On matched personality-steering tasks, MI gives the best overall control--drift trade-off, remaining competitive with prompting while consistently outperforming CAA. On updateable guidance, MI supports mid-conversation behavior shifts without rewriting the visible transcript, achieving the highest post-shift alignment on Qwen3. On structured reasoning, MI outperforms visible prompting on HARDMath and PHYSICS (10/12 subject$\times$mode cells), serving as proxies for structured reasoning in verifiable domains, while cutting content-matched KV storage by up to 118$\times$. These results position MI as a powerful steering method when guidance is persistent, structured, or expensive to keep in the visible transcript.
中文标题/摘要
标题:记忆 inception:潜空间 KV 缓存操控以引导大语言模型
引导大型语言模型(LLMs)通常通过指令提示或激活引导来实现。提示通常能提供较强的控制,但会在每一层缓存指导标记,导致长时间交互时出现混乱;激活引导则较为紧凑,但通常较弱,不支持大型结构化提示。我们引入了记忆 inception(MI),这是一种无需训练的方法,通过在选定层插入文本衍生的关键值(KV)库,在潜注意力空间中进行引导。MI 不是在提示缓存中全程呈现提醒内容,而是将引导视为选择性的 KV 分配,仅在模型路由到的地方注入潜空间插槽。在匹配的人格引导任务中,MI 提供了最佳的整体控制——漂移权衡,保持与提示竞争的同时,始终优于 CAA。在可更新的指导中,MI 支持对话中行为的动态变化,无需重写可见的转录,Qwen3 的后变化对齐度最高。在结构化推理中,MI 在 HARDMath 和 PHYSICS(10/12 个科目模式单元)上优于可见提示,作为可验证领域中结构化推理的代理,同时将内容匹配的 KV 存储量减少多达 118 倍。这些结果将 MI 定位为当指导持续、结构化或在可见转录中保持昂贵时,一种强大的引导方法。
Summary / 总结
Memory Inception (MI) is a training-free method that steers large language models (LLMs) by inserting text-derived key-value (KV) banks at selected layers in the latent attention space. This approach avoids cluttering the prompt cache with guidance tokens and supports mid-conversation behavior shifts without rewriting the visible transcript. MI outperforms both prompting and activation steering in terms of control and alignment on various tasks, especially in structured reasoning, while significantly reducing KV storage requirements.
Memory Inception (MI) 是一种无需训练的方法,通过在潜在注意力空间中选择性地插入文本衍生的关键值(KV)银行来引导大型语言模型(LLMs)。这种方法在持续、结构化或难以保留在可见记录中的指导时提供了强大的控制。MI 在个性引导任务中优于提示和激活引导,支持在对话中不重写记录就改变行为,并在结构化推理任务中优于可见提示,同时显著减少了 KV 存储需求。
MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models
Authors: Xunlan Zhou, Xuanlin Chen, Shaowei Zhang, ShengHua Wan, Xiaohai Hu, Lei Yuan, De-chuan Zhan
First: 2026-01-28T11:25:13+00:00 · Latest: 2026-05-07T13:06:58+00:00
Abstract
Designing dense reward functions is pivotal for efficient robotic Reinforcement Learning (RL). However, most dense rewards rely on manual engineering, which fundamentally limits the scalability and automation of reinforcement learning. While Vision-Language Models (VLMs) offer a promising path to reward design, naive VLM rewards often misalign with task progress, struggle with spatial grounding, and show limited understanding of task semantics. To address these issues, we propose MARVL-Multi-stAge guidance for Robotic manipulation via Vision-Language models. MARVL fine-tunes a VLM for spatial and semantic consistency and decomposes tasks into multi-stage subtasks with task direction projection for trajectory sensitivity. Empirically, MARVL significantly outperforms existing VLM-reward methods on the Meta-World benchmark, demonstrating superior sample efficiency and robustness on sparse-reward manipulation tasks.
Summary / 总结
MARVL is designed to improve the efficiency of robotic reinforcement learning by addressing the limitations of manually engineered dense rewards and naive VLM rewards. It fine-tunes a VLM for spatial and semantic consistency and decomposes tasks into multi-stage subtasks. MARVL shows superior performance on the Meta-World benchmark, outperforming existing methods in terms of sample efficiency and robustness on sparse-reward manipulation tasks.
MARVL旨在通过解决手动工程化密集奖励和简单VLM奖励的局限性,提高机器人强化学习的效率。它对VLM进行微调以实现空间和语义一致性,并将任务分解为多阶段子任务。MARVL在Meta-World基准测试中表现出色,优于现有方法,在稀疏奖励操作任务中的样本效率和鲁棒性方面表现出色。
Event-Causal RAG: A Retrieval-Augmented Generation Framework for Long Video Reasoning in Complex Scenarios
Authors: Peizheng Yan, Yu Zhao, Liang Xie, Juntong Qi, Mingming Wang, Erwei Yin
First: 2026-05-07T13:01:28+00:00 · Latest: 2026-05-07T13:01:28+00:00
Abstract
Recent large vision-language models have achieved strong performance on short- and medium-length video understanding, yet they remain inadequate for ultra-long or even infinite video reasoning, where models must preserve coherent memory over extended durations and infer causal dependencies across temporally distant events. Existing end-to-end video understanding methods are fundamentally limited by the $O(n^2)$ complexity of self-attention, while recent retrieval-augmented generation (RAG) approaches still suffer from fragmented clip-level memory, weak modeling of temporal and causal structure, and high storage and online inference costs. We present Event-Causal RAG, a lightweight retrieval-augmented framework for infinite long-video reasoning. Instead of indexing fixed-length clips, our method segments streaming videos into semantically coherent events and represents each event as a structured State-Event-State (SES) graph, capturing the event together with its surrounding state transitions. These graphs are merged into a global Event Knowledge Graph and stored in a dual-store memory that supports both semantic matching and causal-topological retrieval. On top of this memory, we design a bidirectional retrieval strategy to efficiently identify the most relevant event causal chains and provide them, together with the associated video evidence, to a backbone video foundation model for answer generation. Experiments on long-video understanding benchmarks demonstrate that Event-Causal RAG consistently outperforms strong clip-based retrieval baselines and long-context video models, particularly on questions requiring multi-event integration and causal inference across long temporal gaps, while also achieving improved memory efficiency and robust streaming performance.
Summary / 总结
Event-Causal RAG is a lightweight retrieval-augmented framework designed for long-video reasoning, addressing the limitations of existing models in handling ultra-long or infinite videos. It segments videos into semantically coherent events and represents them as structured SES graphs, which are then merged into a global Event Knowledge Graph. This graph supports efficient causal-topological retrieval and semantic matching, enabling the identification of relevant event causal chains for answer generation. The method outperforms clip-based retrieval baselines and long-context video models, especially in tasks requiring multi-event integration and causal inference over long temporal gaps, while maintaining memory efficiency and streaming performance.
Event-Causal RAG 是一种轻量级的检索增强框架,旨在处理超长或无限长的视频推理问题,解决了现有模型在处理这类视频时的局限性。该方法将视频分割为语义上连贯的事件,并将每个事件表示为结构化的 State-Event-State (SES) 图,这些图被合并到一个全局事件知识图中,支持语义匹配和因果拓扑检索。该图能够高效地识别相关的事件因果链,与相关的视频证据一起提供给基础视频模型以生成答案。实验表明,该方法在需要多事件整合和长时间间隔的因果推理任务中优于基于片段的检索基线和长上下文视频模型,同时保持了内存效率和流式性能。
Retina-RAG: Retrieval-Augmented Vision-Language Modeling for Joint Retinal Diagnosis and Clinical Report Generation
Authors: Abdelrahman Zaian, Sheethal Bhat, Mohamed Abdalkader, Andreas Maier
Venue: MICCAI 2026
First: 2026-05-07T12:54:53+00:00 · Latest: 2026-05-07T12:54:53+00:00
Comments: 10 pages, 5 figures. Submitted to MICCAI 2026
Abstract
Diabetic Retinopathy (DR) is a leading cause of preventable blindness among working-age adults worldwide, yet most automated screening systems are limited to image-level classification and lack clinically structured reporting. We propose Retina-RAG, a low-cost modular framework that jointly performs DR severity grading, macular edema (ME) detection, and report generation. The architecture decouples a high-performance retinal classifier and a parameter-efficient vision-language model (Qwen2.5-VL-7B-Instruct) adapted via Low-Rank Adaptation (LoRA), enabling flexible component integration. A retrieval-augmented generation (RAG) module injects curated ophthalmic knowledge together with structured classifier outputs at inference time to improve diagnostic consistency and reduce hallucinations. Retina-RAG achieves an F1-score of 0.731 for DR grading and 0.948 for ME detection, substantially outperforming zero-shot Qwen (0.096, 0.732) and MMed-RAG (0.541, 0.641) on a retinal disease detection dataset with captions. For report generation, Retina-RAG attains ROUGE-L 0.429 and SBERT similarity 0.884, exceeding all baselines. The full framework operates on a single consumer-grade GPU, demonstrating that clinically structured retinal AI can be achieved with modest computational resources.
中文标题/摘要
标题:Retina-RAG:联合视网膜诊断和临床报告生成的检索增强视觉语言建模
糖尿病视网膜病变(DR)是全球工作年龄成人可预防失明的主要原因,但大多数自动化筛查系统仅限于图像级别的分类,缺乏临床结构化的报告。我们提出了一种名为Retina-RAG的低成本模块化框架,该框架联合执行DR严重程度分级、黄斑水肿(ME)检测和报告生成。该架构将高性能的视网膜分类器和通过低秩适应(LoRA)调整的参数高效视觉语言模型(Qwen2.5-VL-7B-Instruct)解耦,使组件集成更加灵活。检索增强生成(RAG)模块在推理时注入了经过筛选的眼科知识和结构化的分类器输出,以提高诊断一致性并减少幻觉。Retina-RAG在DR分级上的F1分数为0.731,在ME检测上的F1分数为0.948,显著优于零样本Qwen(0.096,0.732)和MMed-RAG(0.541,0.641)在带有描述的视网膜疾病检测数据集上的表现。对于报告生成,Retina-RAG获得了ROUGE-L 0.429和SBERT相似度0.884,超过了所有基线。整个框架在单个消费级GPU上运行,证明了临床结构化的视网膜AI可以通过有限的计算资源实现。
Summary / 总结
Retina-RAG is a modular framework that jointly performs diabetic retinopathy severity grading, macular edema detection, and report generation. It uses a high-performance retinal classifier and a parameter-efficient vision-language model adapted via Low-Rank Adaptation, with a retrieval-augmented generation module that enhances diagnostic consistency. Retina-RAG achieves high F1-scores of 0.731 for DR grading and 0.948 for ME detection, and outperforms baselines in report generation with ROUGE-L 0.429 and SBERT similarity 0.884.
Retina-RAG 是一个模块化框架,能够同时进行糖尿病视网膜病变严重程度分级、黄斑水肿检测和报告生成。它使用高性能的视网膜分类器和通过低秩适应调整的参数高效视觉语言模型,并在推理时通过检索增强生成模块提高诊断一致性。Retina-RAG 在糖尿病视网膜病变分级上的 F1 分数达到 0.731,在黄斑水肿检测上的 F1 分数达到 0.948,且在报告生成指标上超过现有模型。该框架仅需单个消费级 GPU 运行,表明临床结构化的视网膜 AI 可以在有限的计算资源下实现。
Uncovering Entity Identity Confusion in Multimodal Knowledge Editing
Authors: Shu Wu, Xiaotian Ye, Xinyu Mou, Dongsheng Liu, Xiaohan Wang, Mengqi Zhang
First: 2026-05-07T12:14:54+00:00 · Latest: 2026-05-07T12:14:54+00:00
Abstract
Multimodal knowledge editing (MKE) aims to correct the internal knowledge of large vision-language models after deployment, yet the behavioral patterns of post-edit models remain underexplored. In this paper, we identify a systemic failure mode in edited models, termed Entity Identity Confusion (EIC): edited models exhibit an absurd behavior where text-only queries about the original entity's identity unexpectedly return information about the new entity. To rigorously investigate EIC, we construct EC-Bench, a diagnostic benchmark that directly probes how image-entity bindings shift before and after editing. Our analysis reveals that EIC stems from existing methods failing to distinguish between Image-Entity (I-E) binding and Entity-Entity (E-E) relational knowledge in the model, causing models to overfit E-E associations as a shortcut: the image is still perceived as the original entity, with the new entity's name serving only as a spurious identity label. We further explore potential mitigation strategies, showing that constraining edits to the model's I-E processing stage encourages edits to act more faithfully on I-E binding, thereby substantially reducing EIC. Based on these findings, we discuss principled desiderata for faithful MKE and provide methodological guidance for future research.
中文标题/摘要
标题:揭示多模态知识编辑中的实体身份混淆
多模态知识编辑(MKE)旨在部署后纠正大型视觉-语言模型的内部知识,但后编辑模型的行为模式仍被广泛忽视。在本文中,我们识别出编辑模型中的一个系统性失败模式,称为实体身份混淆(EIC):编辑后的模型表现出一种荒谬的行为,即仅通过文本查询原始实体的身份时,意外地返回了新实体的信息。为了严格研究EIC,我们构建了EC-Bench,这是一个诊断基准,直接探测图像-实体绑定在编辑前后如何变化。我们的分析表明,EIC源于现有方法无法区分模型中的图像-实体(I-E)绑定和实体-实体(E-E)关系知识,导致模型过度拟合E-E关联作为捷径:图像仍然被视为原始实体,新实体的名字仅作为虚假的身份标签。我们进一步探讨了潜在的缓解策略,表明限制编辑仅在模型的I-E处理阶段可以鼓励编辑更忠实地作用于I-E绑定,从而显著减少EIC。基于这些发现,我们讨论了忠实的MKE的基本要求,并为未来的研究提供了方法论指导。
Summary / 总结
This paper addresses the issue of Entity Identity Confusion (EIC) in multimodal knowledge editing (MKE), where edited models incorrectly return information about a new entity when queried about the original entity. To investigate EIC, the authors developed EC-Bench, a benchmark that evaluates changes in image-entity bindings before and after editing. The study found that existing methods fail to distinguish between image-entity and entity-entity associations, leading to overfitting on entity-entity relations. By constraining edits to the image-entity processing stage, the authors reduced EIC. The paper concludes with recommendations for faithful MKE and methodological guidance for future research.
本文探讨了多模态知识编辑(MKE)中的实体身份混淆(EIC)问题,即编辑后的模型在查询原始实体时会错误地返回新实体的信息。为了研究EIC,作者开发了EC-Bench基准,评估编辑前后图像-实体绑定的变化。研究发现,现有方法往往无法区分图像-实体(I-E)和实体-实体(E-E)知识,导致过度拟合E-E关联。通过将编辑限制在I-E处理阶段,研究展示了EIC的减少,表明了一种更忠实的MKE方法。研究结果强调了MKE中需要更好的准则,并为未来的研究提供了方法指导。
HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding
Authors: Haowei Zhang, Shudong Yang, Jinlan Fu, See-Kiong Ng, Xipeng Qiu
Venue: ACL 2026
First: 2026-01-21T07:26:15+00:00 · Latest: 2026-05-07T12:10:26+00:00
Comments: Accepted to ACL 2026 Main
Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams. Based on a mechanistic attention investigation, we conceptualize KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference, HERMES reuses a compact KV cache, enabling efficient streaming understanding under resource constraints. Notably, HERMES requires no auxiliary computations upon the arrival of user queries, thereby guaranteeing real-time responses for continuous video stream interactions, which achieves 10$\times$ faster TTFT compared to prior SOTA. Even when reducing video tokens by up to 68% compared with uniform sampling, HERMES achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.
Summary / 总结
The research aims to enhance real-time video understanding for streaming inputs by addressing the limitations of existing models in maintaining performance, real-time responses, and low GPU memory usage. HERMES, a training-free architecture, leverages a hierarchical KV cache to efficiently manage video information across different granularities. During inference, HERMES reuses a compact cache, achieving 10 times faster TTFT than prior SOTA models and maintaining or improving accuracy even with a 68% reduction in video tokens.
研究旨在通过解决现有模型在保持性能、实时响应和低GPU内存使用方面的局限性,提升对流媒体输入的视频理解能力。HERMES是一种无需训练的架构,利用层次化的KV缓存来高效管理不同粒度的视频信息。在推理过程中,HERMES重用紧凑的缓存,比之前最先进的模型快10倍的TTFT,并且即使视频令牌减少68%,仍然能够保持或提高准确率。
OpenGaFF: Open-Vocabulary Gaussian Feature Field with Codebook Attention
Authors: Kunyi Li, Michael Niemeyer, Sen Wang, Stefano Gasperini, Nassir Navab, Federico Tombari
First: 2026-05-07T12:10:07+00:00 · Latest: 2026-05-07T12:10:07+00:00
Abstract
Understanding open-vocabulary 3D scenes with Gaussian-based representations remains challenging due to fragmented and spatially inconsistent semantic predictions across multi-view observations. In this paper, we present OpenGaFF, a novel framework for open-vocabulary 3D scene understanding built upon 3D Gaussian Splatting. At the core of our method is a Gaussian Feature Field that models semantics as a continuous function of Gaussian geometry and appearance. By explicitly conditioning semantic predictions on geometric structure, this formulation strengthens the coupling between geometry and semantics, leading to improved spatial coherence across similar structures in 3D space. To further enforce object-level semantic consistency, we introduce a structured codebook that serves as a set of shared semantic primitives. Furthermore, a codebook-guided attention mechanism is proposed to retrieve language features via similarity matching between query embeddings and learned codebook entries, enabling robust open-vocabulary reasoning while reducing intra-object feature variance. Extensive experiments on standard 2D and 3D open-vocabulary benchmarks demonstrate that our method consistently outperforms prior approaches, achieving improved segmentation quality, stronger 3D semantic consistency and a semantically interpretable codebook that provides insight into the learned representation.
Summary / 总结
The research aims to address the challenge of understanding open-vocabulary 3D scenes using Gaussian-based representations. OpenGaFF, a novel framework, employs a Gaussian Feature Field to model semantics as a continuous function of Gaussian geometry and appearance, enhancing the coupling between geometry and semantics. The method introduces a structured codebook and a codebook-guided attention mechanism to improve object-level semantic consistency and robustness in open-vocabulary reasoning. Experiments show that OpenGaFF outperforms previous methods in segmentation quality and 3D semantic consistency, with a semantically interpretable codebook that offers insights into the learned representation.
研究旨在使用基于高斯的表示来理解开放词汇的3D场景。OpenGaFF框架采用高斯特征场将语义建模为高斯几何和外观的连续函数,增强几何与语义之间的耦合。该方法引入了结构化码本和码本引导的注意力机制,以提高语义一致性和开放词汇推理的鲁棒性。实验表明,OpenGaFF在分割质量、3D语义一致性和具有语义解释性的码本方面优于先前的方法,码本提供了关于学习表示的见解。
Towards Self-Explainable Document Visual Question Answering with Chain-of-Explanation Predictions
Authors: Kjetil Indrehus, Adrian Duric, Changkyu Choi, Ali Ramezani-Kebrya
First: 2026-05-07T11:42:23+00:00 · Latest: 2026-05-07T11:42:23+00:00
Abstract
Document Visual Question Answering (DocVQA) requires vision-language models to reason not only about what information in a document is relevant to a question, but also where the answer is grounded on the page. Existing DocVQA models entangle question-relevant evidence and answer localization and operate largely as black boxes, offering limited means to verify how predictions depend on visual evidence. We propose CoExVQA, a self-explainable DocVQA framework with a grounded reasoning process through a chain-of-explanation design. CoExVQA first identifies question-relevant evidence, then explicitly localizes the answer region, and finally decodes the answer exclusively from the grounded region. Prediction via CoExVQA's chain-of-explanation enables direct inspection and verification of the reasoning process across modalities. Empirical results show that restricting decoding to grounded evidence achieves SotA explainable DocVQA performance on PFL-DocVQA, improving ANLS by 12% over the current explainable baselines while providing transparent and verifiable predictions.
Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands via Direct Preference Optimization
Authors: Weijian Su, Songqian Zhang, Yuqi Han, Jian Zhuang, Yongdong Huang, Qiang Zhang
Venue: CVPR 2026
First: 2026-05-07T11:34:41+00:00 · Latest: 2026-05-07T11:34:41+00:00
Comments: Accepted by CVPR 2026
Abstract
As a key technique in multi-modal processing, infrared and visible image fusion (IVIF) plays a crucial role in integrating complementary spectral information for visual enhancement and downstream vision tasks. Despite remarkable progress, existing methods struggle to flexibly accommodate heterogeneous demands. Achieving adaptive fusion that aligns with various preferences from both human and machine vision remains an open and challenging problem. To address this challenge, we propose DPOFusion, a direct preference optimization (DPO) framework integrating the property-aligned latent diffusion model (PALDM) and the preference-controllable latent diffusion model (PCLDM), enabling task-guided, preference-adaptive IVIF for both human and machine vision. The PALDM leverages a latent fusion prior and a joint conditional loss to generate diverse candidate fusion results with various properties. PCLDM is subsequently fine-tuned via instance direct preference optimization (IDPO), enabling direct control of the final fusion results with heterogeneous preference signals. Experimental results demonstrate that our framework not only attains precise preference alignment among humans, vision-language models, and task-driven networks, but also sets a new benchmark for adaptive fusion quality and task-oriented transferability.
中文标题/摘要
标题:融合之道:通过直接偏好优化使红外和可见光图像融合适应异构需求
作为多模态处理中的关键技术,红外和可见光图像融合(IVIF)在整合互补光谱信息以增强视觉效果和下游视觉任务中起着关键作用。尽管取得了显著进展,但现有方法难以灵活适应异构需求。实现与人类和机器视觉各种偏好相适应的自适应融合仍然是一个开放且具有挑战性的问题。为了解决这一挑战,我们提出了一种直接偏好优化(DPO)框架DPOFusion,该框架结合了属性对齐的潜在扩散模型(PALDM)和偏好可控的潜在扩散模型(PCLDM),使红外和可见光图像融合能够适应人类和机器视觉的任务指导和偏好适应。PALDM利用潜在融合先验和联合条件损失生成具有各种属性的多样化候选融合结果。PCLDM随后通过实例直接偏好优化(IDPO)进行微调,使最终融合结果能够直接控制具有异构偏好信号的生成。实验结果表明,我们的框架不仅在人类、视觉语言模型和任务驱动网络之间实现了精确的偏好对齐,还为自适应融合质量和任务导向的可转移性设立了新的基准。
Summary / 总结
The paper addresses the challenge of achieving adaptive image fusion that aligns with various preferences from both human and machine vision. It proposes DPOFusion, a direct preference optimization framework combining a property-aligned latent diffusion model (PALDM) and a preference-controllable latent diffusion model (PCLDM). The framework generates diverse candidate fusion results and fine-tunes them using instance direct preference optimization (IDPO) to control the final fusion results with heterogeneous preference signals. Experiments show that DPOFusion achieves precise preference alignment and sets a new benchmark for adaptive fusion quality and task-oriented transferability.
论文提出了DPOFusion框架,该框架结合了PALDM和PCLDM,以实现与人类和机器视觉各种偏好相适应的红外和可见光图像融合(IVIF)。该框架通过PALDM生成具有不同属性的候选融合结果,并通过实例直接偏好优化进一步微调PCLDM,以控制最终融合结果的偏好信号。实验结果表明,DPOFusion实现了精确的偏好对齐,并在自适应融合质量和任务导向的可转移性方面设立了新的基准。
Can Vision-Language Models Think from the Sky? Unifying UAV Reasoning and Generation
Authors: Jintao Sun, Gangyi Ding, Donglin Di, Hu Zhang, Zhedong Zheng
First: 2026-04-07T03:23:30+00:00 · Latest: 2026-05-07T11:33:25+00:00
Comments: 21 pages, 12 figures, 7 tables
Abstract
Vision-Language Models have achieved strong progress in ground-view visual understanding, yet they remain brittle in high-altitude Unmanned Aerial Vehicle scenes, where objects are tiny and densely packed, textures are repetitive, and top-down orientations are ambiguous. We introduce UAVReason, a large-scale UAV-native dataset and evaluation suite for studying unified aerial reasoning and generation under this nadir-view domain shift. UAVReason aligns RGB imagery, depth maps, semantic segmentation masks, captions, and question-answer pairs within a consistent aerial domain. It contains 23.6K captioned frames, 273K VQA pairs including 68.2K two-frame temporal questions, and 188.8K cross-modal generation samples across RGB, depth, and segmentation modalities. We further adapt UAVReason-Bagel as a unified understanding-and-generation baseline that jointly optimizes language reasoning and dense visual generation objectives. Experiments show that general-purpose VLMs and off-the-shelf unified generators struggle with UAV-native grounding, while UAVReason-Bagel substantially improves over its pretrained counterpart, increasing VQA-1F F1 from 0.394 to 0.711, VQA-2F F1 from 0.427 to 0.822, and heading-aware VQA F1 from 0.798 to 0.973. For generation, it improves segmentation mIoU to 0.143 and reduces KID from 0.078 to 0.048 for depth-segmentation-text-conditioned RGB synthesis. More importantly, our ablations reveal a bidirectional synergy between synthesis and reasoning. Dense generation objectives improve temporal semantic consistency, while language-level reasoning regularizes sparse-condition image synthesis. These results suggest that unified reasoning and generation provide effective geometry-aware structural priors for physically grounded aerial intelligence. All data, code, and evaluation tools will be released.
Summary / 总结
This study addresses the limitations of Vision-Language Models in high-altitude Unmanned Aerial Vehicle (UAV) scenes by introducing UAVReason, a large-scale dataset and evaluation suite. UAVReason includes RGB imagery, depth maps, semantic segmentation masks, captions, and question-answer pairs, focusing on aerial reasoning and generation. Experiments show that general models struggle with UAV-native grounding, while the adapted UAVReason-Bagel significantly improves performance, enhancing VQA accuracy and segmentation quality. The results indicate that unified reasoning and generation provide effective structural priors for aerial intelligence.
该研究通过引入UAVReason数据集和评估套件,解决了视觉-语言模型在高空无人机场景中的局限性。UAVReason包含RGB图像、深度图、语义分割掩码、描述和问答对,专注于航空推理和生成。实验表明,通用模型在UAV本地定位上表现不佳,而适应的UAVReason-Bagel显著提高了性能,增强了VQA准确性和分割质量。结果表明,统一的推理和生成为物理上接地的航空智能提供了有效的几何先验。
PlotPick: AI-powered batch extraction of numerical data from scientific figures
Authors: Tommy Carstensen
First: 2026-05-07T11:15:39+00:00 · Latest: 2026-05-07T11:15:39+00:00
Comments: 7 pages, 2 figures, 2 tables. Software available at https://plotpick.streamlit.app and https://github.com/tommycarstensen/plotpick
Abstract
Systematic reviews and meta-analyses frequently require numerical data that authors report only as figures, yet manual digitisation is slow and does not scale. We present PlotPick, an open-source tool that uses vision-language models (VLMs) to batch-extract structured tabular data from scientific figures. We evaluate six VLMs from three providers on two established chart-to-table benchmarks (ChartX and PlotQA) and compare against the dedicated chart-to-table model DePlot. All six VLMs outperform DePlot on both benchmarks. On ChartX (restricted to bar charts, line charts, box plots, and histograms; n=300), VLMs achieve 88-96% recall versus 71% for DePlot. On PlotQA (n=529), VLMs achieve 86-99% RMSF1 versus 94% for DePlot. The gap is largest on chart types absent from the dedicated models' training data: on box plots, DePlot achieves 24% RMSF1 while VLMs achieve 83-97%. PlotPick is available at https://plotpick.streamlit.app.
Summary / 总结
PlotPick is an open-source tool that uses vision-language models to automatically extract numerical data from scientific figures, addressing the inefficiency of manual digitization. It outperforms the dedicated chart-to-table model DePlot on two benchmarks, achieving up to 96% recall and 99% RMSF1. The tool is particularly effective on chart types not seen during the training of DePlot, such as box plots, where it significantly outperforms DePlot by 69% in RMSF1.
PlotPick 是一个开源工具,利用视觉-语言模型从科学图表中自动提取数值数据,解决手动数字化效率低的问题。它在两个基准测试中优于专门的图表到表格模型 DePlot,达到高达 96% 的召回率和 99% 的 RMSF1。该工具特别在 DePlot 训练数据中未出现的图表类型(如箱形图)上表现优异,RMSF1 指标上比 DePlot 高出 69%。
4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding
Authors: Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xiang An, Bo Li, Xin Xie, ZiDong Wang, Mingze Sun, Shuang Chen, Hongyu Li, Xiaobin Hu, Ruqi Huang
First: 2026-05-07T10:48:46+00:00 · Latest: 2026-05-07T10:48:46+00:00
Comments: 21 pages, 16 figures
Abstract
Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to "think with 4D" through dynamic latent mental imagery, i.e., internally simulating how scenes evolve within the continuous hidden space. Specifically, we first introduce a scalable, annotation-free data generation pipeline that synthesizes 4D reasoning data from raw videos. We then propose Dynamic-Imagery Fine-Tuning (DIFT), which jointly supervises textual tokens and 4D latents to ground the model in dynamic visual semantics. Building on this, 4D Reinforcement Learning (4DRL) further tackles complex reasoning tasks via outcome-based rewards, restricting policy gradients to text tokens to ensure stable optimization. Extensive experiments across multiple dynamic spatial reasoning benchmarks demonstrate that 4DThinker consistently outperforms strong baselines and offers a new perspective toward 4D reasoning in VLMs. Our code is available at https://github.com/zhangquanchen/4DThinker.
Summary / 总结
4DThinker is a framework that enables vision-language models to perform dynamic spatial reasoning through internal 4D imagery. It introduces a data generation pipeline for synthesizing 4D reasoning data and a fine-tuning method called DIFT that grounds the model in dynamic visual semantics. 4DRL further enhances this by using outcome-based rewards. Experiments show that 4DThinker outperforms strong baselines on multiple dynamic spatial reasoning benchmarks.
4DThinker 是一个框架,使视觉语言模型能够通过内部的4D图像进行动态空间推理。它引入了一个数据生成管道来合成4D推理数据,并提出了一种称为DIFT的微调方法,使模型扎根于动态视觉语义。4DRL 进一步通过基于结果的奖励来增强这一点,限制策略梯度仅针对文本标记。实验表明,4DThinker 在多个动态空间推理基准测试中优于强基线。
Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs
Authors: Zixuan Chen, Hao Lin, Zizhe Chen, Yizhou Tian, Garry Yang, Depeng Wang, Ya Guo, Huijia Zhu, James Cheng
First: 2026-05-07T10:04:39+00:00 · Latest: 2026-05-07T10:04:39+00:00
Abstract
LLMs reliably correct false claims when presented in isolation, yet when the same claims are embedded in task-oriented requests, they often comply rather than correct. We term this failure mode \emph{correction suppression} and construct a benchmark of 300 false premises to systematically evaluate it across eight models. Suppression rates range from 19\% to 90\%, with four models exceeding 80\%, establishing correction suppression as a prevalent and severe phenomenon. Mechanistic analysis reveals that suppression is not a knowledge failure: the model registers the error internally but task context diverts early-layer attention from the false claim as output intent crystallizes toward compliance at middle layers. We characterize this as \emph{knowing but not correcting} -- suppression occurs at response selection rather than knowledge encoding. Guided by this mechanism, we propose two training-free interventions. Correction Direction Steering (CDS) estimates a correction-compliance direction from matched pairs and injects it at middle layers before output intent crystallizes. Dynamic Payload Amplification (DPA) localizes payload tokens via attention divergence between early and late layers and amplifies their representation at the final layer, requiring no calibration data. Experiments on Qwen3.5-9B and LLaMA3.1-8B show both methods substantially improve factual strictness. CDS achieves the highest correction rate on Qwen3.5-9B (0\%$\to$58.2\%). DPA is the only method that preserves or improves reasoning capability on both models. These findings introduce \emph{factual strictness} -- the willingness to uphold accuracy against contextual pressures -- as a new dimension of model reliability.
Summary / 总结
The paper investigates why large language models (LLMs) often comply with false claims when these claims are part of task-oriented requests, a phenomenon termed 'correction suppression'. The study evaluates 300 false premises across eight models, finding suppression rates ranging from 19% to 90%, with four models exceeding 80%. The authors propose two training-free interventions: Correction Direction Steering (CDS) and Dynamic Payload Amplification (DPA), which improve factual strictness. CDS achieves the highest correction rate on Qwen3.5-9B, while DPA preserves reasoning capability on both Qwen3.5-9B and LLaMA3.1-8B.
研究探讨了为什么在任务导向请求中,大型语言模型(LLMs)往往会遵从错误的陈述,这一现象被称为‘纠正抑制’。研究评估了8个模型中的300个错误前提,发现抑制率从19%到90%不等,其中四个模型的抑制率超过80%。作者提出了两种无需训练的干预措施:纠正方向引导(CDS)和动态负载放大(DPA),这些措施提高了事实准确性。CDS在Qwen3.5-9B上实现了最高的纠正率,而DPA在Qwen3.5-9B和LLaMA3.1-8B上保持了推理能力。
Adaptive Greedy Frame Selection for Long Video Understanding
Authors: Yuning Huang, Xiaoyu Ji, Joseph Huang, Yichi Zhang, Fengqing Zhu
First: 2026-03-20T17:55:32+00:00 · Latest: 2026-05-07T09:47:21+00:00
Abstract
Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frames and resulting visual tokens. Naive sparse sampling can miss decisive moments, while purely relevance-driven selection frequently collapses onto near-duplicate frames and sacrifices coverage of temporally distant evidence. We propose a question-adaptive greedy frame selection method that jointly optimizes query relevance and semantic representativeness under a fixed frame budget. Our approach constructs a 1~FPS candidate pool (capped at 1000) with exact timestamp alignment, embeds candidates in two complementary spaces (SigLIP for question relevance and DINOv2 for semantic similarity), and selects frames by greedily maximizing a weighted sum of a modular relevance term and a facility-location coverage term. This objective is normalized, monotone, and submodular, yielding a standard (1-1/e) greedy approximation guarantee. To account for question-dependent trade-offs between relevance and coverage, we introduce four preset strategies and a lightweight text-only question-type classifier that routes each query to its best-performing preset. Experiments on MLVU show consistent accuracy gains over uniform sampling and a strong recent baseline across frame budgets, with the largest improvements under tight budgets.
Plug-and-play Class-aware Knowledge Injection for Prompt Learning with Visual-Language Model
Authors: Junhui Yin, Nan Pu, Xinyu Zhang, Lingfeng Yang, Lin Wu, Xiaojie Wang, Zhun Zhong
First: 2026-05-07T09:20:42+00:00 · Latest: 2026-05-07T09:20:42+00:00
Comments: Accepted by International Journal of Computer Vision
Abstract
Prompt learning has become an effective and widely used technique in enhancing vision-language models (VLMs) such as CLIP for various downstream tasks, particularly in zero-shot classification within specific domains. Existing methods typically focus on either learning class-shared prompts for a given domain or generating instance-specific prompts through conditional prompt learning. While these methods have achieved promising performance, they often overlook class-specific knowledge in prompt design, leading to suboptimal outcomes. The underlying reasons are: 1) class-specific prompts offer more fine-grained supervision compared to coarse class-shared prompts, which helps prevent misclassification of data from different classes into a single class; 2) compared to class-specific prompts, instance-specific prompts neglect the richer class-level information across multiple instances, potentially causing data from the same class to be divided into multiple classes. To effectively supplement the class-specific knowledge into existing methods, we propose a plug-and-play Class-Aware Knowledge Injection (CAKI) framework. CAKI comprises two key components, i.e., class-specific prompt generation and query-key prompt matching. The former encodes class-specific knowledge into prompts from few-shot samples that belong to the same class and stores the learned prompts in a class-level knowledge bank. The latter provides a plug-and-play mechanism for each test instance to retrieve relevant class-level knowledge from the knowledge bank and inject such knowledge to refine model predictions. Extensive experiments demonstrate that our CAKI effectively improves the performance of existing methods on base and novel classes. Code is publicly available at \href{https://github.com/yjh576/CAKI}{this https URL}.
SMI: Statistical Membership Inference for Reliable Unlearned Model Auditing
Authors: Jialong Sun, Zeming Wei, Jiaxuan Zou, Jiacheng Gong, Jie Fu, Chengyang Dong, Heng Xu, Jialong Li, Bo Liu
First: 2026-02-01T10:51:53+00:00 · Latest: 2026-05-07T09:14:58+00:00
Abstract
Machine unlearning (MU) is essential for enforcing the right to be forgotten in machine learning systems. A key challenge of MU is how to reliably audit whether a model has truly forgotten specified training data. Membership Inference Attacks (MIAs) are widely used for unlearned model auditing, where samples that evade membership detection are regarded as successfully forgotten. We show this assumption is fundamentally flawed: failed membership inference does not imply true forgetting. We prove that unlearned samples occupy fundamentally different positions in the feature space than non-member samples, making this alignment bias unavoidable and unobservable, which leads to systematically optimistic evaluations of unlearning performance. Meanwhile, training shadow models for MIA incurs substantial computational overhead. To address both limitations, we propose Statistical Membership Inference (SMI), a training-free auditing framework that reformulates auditing as estimating the non-member mixture proportion in the unlearned feature distribution. Beyond estimating the forgetting rate, SMI also provides bootstrap reference ranges for quantified auditing reliability. Extensive experiments show that SMI consistently outperforms all MIA-based baselines, with no shadow model training required. Overall, SMI establishes a principled and efficient alternative to MIA-based auditing methods, with both theoretical guarantees and strong empirical performance.
Summary / 总结
The paper addresses the challenge of reliably auditing whether a model has forgotten specified training data in machine unlearning. It introduces Statistical Membership Inference (SMI), a training-free method that estimates the proportion of non-member samples in the unlearned feature distribution. SMI outperforms existing Membership Inference Attack (MIA)-based methods without the need for shadow model training, providing both theoretical guarantees and strong empirical performance. This approach avoids the optimistic bias inherent in MIA and offers reliable quantified auditing of unlearning performance.
论文针对机器卸载中如何可靠地审计模型是否真正忘记了指定的训练数据这一挑战,提出了一个无需训练的审计框架——统计成员推断(SMI)。SMI通过估计未学习特征分布中的非成员样本比例来工作,无需训练影子模型,其性能优于现有的基于会员推断攻击(MIA)的方法,并提供了理论保证和强大的实证表现。这种方法避免了MIA固有的乐观偏差,并提供了可靠的审计可靠性量化。
DBMSolver: A Training-free Diffusion Bridge Sampler for High-Quality Image-to-Image Translation
Authors: Sankarshana Venugopal, Mohammad Mostafavi, Jonghyun Choi
Venue: CVPR 2026
First: 2026-05-07T08:59:05+00:00 · Latest: 2026-05-07T08:59:05+00:00
Comments: Accepted to CVPR 2026. Includes supplementary material
Abstract
Diffusion-based image-to-image (I2I) translation excels in high-fidelity generation but suffers from slow sampling in state-of-the-art Diffusion Bridge Models (DBMs), often requiring dozens of function evaluations (NFEs). We introduce DBMSolver, a training-free sampler that exploits the semi-linear structure of DBM's underlying SDE and ODE via exponential integrators, yielding highly-efficient 1st- and 2nd-order solutions. This reduces NFEs by up to 5x while boosting quality (e.g., FID drops 53% on DIODE at 20 NFEs vs. 2nd-order baseline). Experiments on inpainting, stylization, and semantics-to-image tasks across resolutions up to 256x256 show DBMSolver sets new SOTA efficiency-quality tradeoffs, enabling real-world applicability. Our code is publicly available at https://github.com/snumprlab/dbmsolver.
Summary / 总结
DBMSolver is a training-free diffusion bridge sampler that improves the efficiency of high-fidelity image-to-image translation by reducing the number of function evaluations needed. It uses exponential integrators to achieve highly-efficient first- and second-order solutions, which decrease the number of function evaluations by up to 5x while enhancing image quality. Experiments show that DBMSolver sets new state-of-the-art efficiency-quality tradeoffs, enabling practical applications in tasks such as inpainting, stylization, and semantics-to-image translation. The method reduces FID scores by 53% at 20 function evaluations compared to a second-order baseline. The code is publicly available.
DBMSolver 是一种无需训练的扩散桥梁采样器,通过减少所需的功能评估次数来提高高保真度的图像到图像转换效率。它使用指数积分器实现高效的一阶和二阶解,将功能评估次数减少多达 5 倍,同时提高图像质量。实验表明,DBMSolver 在修复、风格化和语义到图像任务中设置了新的效率-质量折衷标准,使其在实际应用中具有可行性。该方法在 20 次功能评估时将 FID 分数降低了 53% 相比二阶基线。代码已公开可用。
Training-Free Dense Hand Contact Estimation with Multi-Modal Large Language Models
Authors: Daniel Sungho Jung, Kyoung Mu Lee
First: 2026-05-07T08:57:27+00:00 · Latest: 2026-05-07T08:57:27+00:00
Comments: Project page: https://contactprompt-release.github.io/
Abstract
Dense hand contact estimation requires both high-level semantic understanding and fine-grained geometric reasoning of human interaction to accurately localize contact regions. Recently, multi-modal large language models (MLLMs) have demonstrated strong capabilities in understanding visual semantics, enabled by vision-language priors learned from large-scale data. However, leveraging MLLMs for dense hand contact estimation remains underexplored. There are two major challenges in applying MLLMs to dense hand contact estimation. First, encoding explicit 3D hand geometry is difficult, as MLLMs primarily operate on vision and language modalities. Second, capturing fine-grained vertex-level contact remains challenging, as MLLMs tend to focus on high-level semantics rather than detailed geometric reasoning. To address these challenges, we propose ContactPrompt, a training-free and zero-shot approach for dense hand contact estimation using MLLMs. To effectively encode 3D hand geometry, we introduce a detailed hand-part segmentation and a part-wise vertex-grid representation that provides structured, localized geometric information. To enable accurate and efficient dense contact prediction, we develop a multi-stage structured contact reasoning with part conditioning, progressively bridging global semantics and fine-grained geometry. Therefore, our method effectively leverages the reasoning capabilities of MLLMs while enabling precise dense hand contact estimation. Surprisingly, the proposed approach outperforms previous supervised methods trained on large-scale dense contact datasets without requiring any training. The codes will be released.
中文标题/摘要
标题:基于多模态大型语言模型的无训练密集手部接触估计
密集手部接触估计需要对人类互动进行高层次语义理解和精细几何推理,以准确定位接触区域。最近,多模态大型语言模型(MLLMs)通过大规模数据学习的视觉-语言先验,在理解视觉语义方面表现出强大的能力。然而,利用MLLMs进行密集手部接触估计仍处于探索阶段。将MLLMs应用于密集手部接触估计面临两大挑战。首先,编码明确的3D手部几何结构困难,因为MLLMs主要在视觉和语言模态上运行。其次,捕捉细粒度的顶点级接触仍然具有挑战性,因为MLLMs倾向于关注高层次语义而非详细的几何推理。为了解决这些挑战,我们提出了一种基于MLLMs的无训练和零样本密集手部接触估计方法——ContactPrompt。为了有效编码3D手部几何结构,我们引入了详细的分部分手分割和部分级顶点网格表示,提供结构化、局部化的几何信息。为了实现准确且高效的密集接触预测,我们开发了一种多阶段结构化接触推理方法,逐步连接全局语义和细粒度几何。因此,我们的方法有效地利用了MLLMs的推理能力,同时实现了精确的密集手部接触估计。令人惊讶的是,所提出的方法在无需任何训练的情况下,超越了在大规模密集接触数据集上进行监督训练的先前方法。代码将被发布。
Summary / 总结
The research aims to leverage multi-modal large language models (MLLMs) for dense hand contact estimation, which requires both high-level semantic understanding and fine-grained geometric reasoning. To address the challenges of encoding 3D hand geometry and capturing vertex-level contact, the authors propose ContactPrompt, a training-free and zero-shot approach. This method uses detailed hand-part segmentation and a part-wise vertex-grid representation to provide structured geometric information, and employs a multi-stage structured contact reasoning process to predict dense hand contacts accurately. The approach outperforms previous supervised methods without requiring any training data.
研究旨在利用多模态大型语言模型(MLLMs)进行密集手部接触估计,这需要高层次语义理解和精细几何推理。为了解决编码3D手部几何和捕捉顶点级接触的挑战,作者提出了ContactPrompt,这是一种无需训练的零样本方法。该方法使用详细的肢体分割和部分顶点网格表示来提供结构化的几何信息,并采用多阶段结构化接触推理过程以准确预测密集手部接触。该方法在无需任何训练数据的情况下超过了之前的监督方法。
StableTTA: Improving Vision Model Performance by Training-free Test-Time Adaptation Methods
Authors: Zheng Li, Jerry Cheng, Huanying Helen Gu
First: 2026-04-06T09:21:48+00:00 · Latest: 2026-05-07T08:44:16+00:00
Comments: 27 pages, 10 figures, 9 tables
Abstract
Ensemble methods improve predictive performance but often incur high memory and computational costs. We identify an aggregation instability induced by nonlinear projection and voting operations. To address both efficiency challenges and this inconsistency, we propose StableTTA, a training-free test-time adaptation method with two variants. StableTTA-I targets coherent-batch inference settings, where temporally or semantically adjacent observations are likely to belong to the same class. Examples include burst photography, video streams, robotics perception, and industrial inspection. Under coherent-batch inference, StableTTA-I substantially improves prediction consistency and accuracy through variance-aware logit aggregation. StableTTA-II establishes feature-level cropping, enabling efficient logit aggregation with a single forward pass on a single model backbone. Experiments on ImageNet-1K across 71 models demonstrate that StableTTA-I consistently improves prediction accuracy under coherent-batch inference, while StableTTA-II provides lightweight and architecture-agnostic accuracy improvements with minimal computational overhead. These results suggest that inference-time semantic coherence and aggregation stability provide useful perspectives for improving practical test-time adaptation systems.
中文标题/摘要
标题:StableTTA:通过训练后测试时自适应方法提高视觉模型性能
集成方法可以提高预测性能,但通常会带来高内存和计算成本。我们识别出由非线性投影和投票操作引起的聚合不稳定性。为了解决效率挑战和这种不一致性,我们提出了StableTTA,一种无需训练的测试时自适应方法,包含两种变体。StableTTA-I针对一致批次推理场景,其中时间上或语义上相邻的观测很可能属于同一类别。示例包括连拍摄影、视频流、机器人感知和工业检测。在一致批次推理下,StableTTA-I通过方差感知逻辑聚合显著提高了预测一致性和准确性。StableTTA-II实现了特征级裁剪,允许在单个模型主干上进行一次前向传播高效地进行逻辑聚合。在ImageNet-1K上对71个模型进行的实验表明,StableTTA-I在一致批次推理下始终提高了预测准确性,而StableTTA-II提供了轻量级且架构无关的准确性改进,且计算开销最小。这些结果表明,推理时语义一致性和聚合稳定性为改进实际测试时自适应系统提供了有用视角。
Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media
Authors: Megha Mariam K. M, Vineeth N. Balasubramanian, C. V. Jawahar
Venue: CVPR
First: 2026-05-07T08:04:50+00:00 · Latest: 2026-05-07T08:04:50+00:00
Comments: Accepted at IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026
Abstract
The communication of scientific knowledge has become increasingly multimodal, spanning text, visuals, and speech through materials such as research papers, slides, and recorded presentations. These different representations collectively convey a study's reasoning, results, and insights, offering complementary perspectives that enrich understanding. However, despite their shared purpose, such materials are rarely connected in a structured way. The absence of explicit links across formats makes it difficult to trace how concepts, visuals, and explanations correspond, limiting unified exploration and analysis of research content. To address this gap, we introduce the Multimodal Conference Dataset (MCD), the first benchmark that integrates research papers, presentation videos, explanatory videos, and slides from the same works. We evaluate a range of embedding-based and vision-language models to assess their ability to discover fine-grained cross-format correspondences, establishing the first systematic benchmark for this task. Our results show that vision-language models are robust but struggle with fine-grained alignment, while embedding-based models capture text-visual correspondences well but equations and symbolic content form distinct clusters in the embedding space. These findings highlight both the strengths and limitations of current approaches and point to key directions for future research in multimodal scientific understanding. To ensure reproducibility, we release the resources for MCD at https://github.com/meghamariamkm2002/MCD
Summary / 总结
The paper addresses the challenge of connecting different scientific representations such as papers, slides, and videos, which are rarely linked in a structured way. It introduces the Multimodal Conference Dataset (MCD) and evaluates various models to find fine-grained correspondences, showing that vision-language models are robust but struggle with fine-grained alignment, while embedding-based models excel in capturing text-visual correspondences but have difficulty with equations and symbolic content.
论文解决了不同科学表示形式(如论文、幻灯片和视频)很少以结构化方式链接的问题。它引入了多模态会议数据集(MCD),并评估了各种模型以找到细粒度对应关系,结果显示视觉语言模型虽然稳健但难以实现细粒度对齐,而嵌入式模型在捕捉文本-视觉对应关系方面表现出色,但在处理公式和符号内容时存在困难。