AlbumFill: Album-Guided Reasoning and Retrieval for Personalized Image Completion
Authors: Yu-Ju Tsai, Brian Price, Qing Liu, Luis Figueroa, Daniil Pakhomov, Zhihong Ding, Scott Cohen, Ming-Hsuan Yang
First: 2026-05-04T17:59:59+00:00 · Latest: 2026-05-04T17:59:59+00:00
Comments: Project Page: https://liagm.github.io/AlbumFill/
Abstract
Personalized image completion aims to restore occluded regions in personal photos while preserving identity and appearance. Existing methods either rely on generic inpainting models that often fail to maintain identity consistency, or assume that suitable reference images are explicitly provided. In practice, suitable references are often not explicitly provided, requiring the system to search for identity-consistent images within personal photo collections. We present AlbumFill, a training-free framework that retrieves identity-consistent references from personal albums for personalized completion. Given an occluded image and a personal album, a vision-language model infers missing semantic cues to guide composed image retrieval, and the retrieved references are used by reference-based completion models. To facilitate this task, we introduce a dataset containing 54K human-centric samples with associated album images. Experiments across multiple baselines demonstrate the difficulty of personalized completion and highlight the importance of identity-consistent reference retrieval. Project Page: https://liagm.github.io/AlbumFill/
Summary / 总结
AlbumFill is a framework designed for personalized photo completion tasks, by leveraging semantic cues from personal albums to guide retrieval of context-consistent images. occludedured images. The method involves involves involves involves involves involves involves involves involves involves involves involves involves involves involves involves involves involves retrieving images from a user album based inferred by a user semantic inference process, and then then then using a based image-based completion method to fill in the occludedured areas area. Experimental findings demonstrate show show on the importance of context-consistent retrieval retrieval in maintaining photo completion tasks, personal image collections.. the project introduces a dataset of with 40,4K human-centric samples for benchmarkinging purposes.
VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition
Authors: Tanush Yadav, Mohammadreza Salehi, Jae Sung Park, Vivek Ramanujan, Hannaneh Hajishirzi, Yejin Choi, Ali Farhadi, Rohun Tripathi, Ranjay Krishna
Venue: CVPR 2026 Highlight
First: 2026-05-04T17:11:16+00:00 · Latest: 2026-05-04T17:11:16+00:00
Comments: project website at https://tanu.sh/videonet
Abstract
Videos are unique in their ability to capture actions which transcend multiple frames. Accordingly, for many years action recognition was the quintessential task for video understanding. Unfortunately, due to a lack of sufficiently diverse and challenging data, modern vision-language models (VLMs) are no longer evaluated on their action recognition capabilities. To revitalize action recognition in the era of VLMs, we advocate for a returned focus on domain-specific actions. To this end, we introduce VideoNet, a domain-specific action recognition benchmark covering 1,000 distinct actions from 37 domains. We begin with a multiple-choice evaluation setting, where the difference between closed and open models is stark: Gemini 3.1 Pro attains 69.9% accuracy while Qwen3-VL-8B gets a mere 45.0%. To understand why VLMs struggle on VideoNet, we relax the questions into a binary setting, where random chance is 50%. Still, Qwen achieves only 59.2% accuracy. Further relaxing the evaluation setup, we provide $k\in\{1,2,3\}$ in-context examples of the action. Some models excel in the few-shot setting, while others falter; Qwen improves $+7.0\%$, while Gemini declines $-4.8\%$. Notably, these gains fall short of the $+13.6\%$ improvement in non-expert humans when given few-shot examples. Finding that VLMs struggle to fully exploit in-context examples, we shift from test-time improvements to the training side. We collect the first large-scale training dataset for domain-specific actions, totaling nearly 500k video question-answer pairs. Fine-tuning a Molmo2-4B model on our data, we surpass all open-weight 8B models on the VideoNet benchmark.
jina-vlm: Small Multilingual Vision Language Model
Authors: Andreas Koukounas, Georgios Mastrapas, Florian Hönicke, Sedigheh Eslami, Guillaume Roncari, Scott Martens, Han Xiao
First: 2025-12-03T18:13:41+00:00 · Latest: 2026-05-04T16:45:33+00:00
Comments: 23 pages, 1-10 main content, 11-23 references and appendix
Abstract
We present jina-vlm, a token-efficient 2.4B parameter vision-language model that achieves state-of-the-art multilingual VQA performance among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language decoder and makes use of image tiling and attention-pooling for token-efficient processing of arbitrary-resolution images. To understand the contribution of different training data categories, we conduct a leave-one-out data mixture ablation study-systematically removing task, domain, modality, and language categories-to diagnose which data types are necessary versus redundant and whether task benefits transfer across domains. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm.
中文标题/摘要
标题:jina-vlm:小型多语言视觉语言模型
我们介绍了jina-vlm,这是一种参数高效的24亿参数视觉-语言模型,在开放的2B规模视觉语言模型中实现了最先进的跨语言视觉问答性能。该模型结合了SigLIP2视觉编码器和Qwen3语言解码器,并利用图像切片和注意力池化进行高效处理任意分辨率的图像。为了理解不同训练数据类别的作用,我们进行了一个排除法的数据混合消融研究,系统地移除任务、领域、模态和语言类别,以诊断哪些数据类型是必要的还是冗余的,以及任务收益是否在领域之间转移。模型权重和代码已公开发布在https://huggingface.co/jinaai/jina-vlm。
Summary / 总结
jina-vlm is a 2.4B parameter vision-language model that excels in multilingual VQA performance. It uses a SigLIP2 vision encoder and a Qwen3 language decoder, along with image tiling and attention-pooling for efficient processing. The model's performance is attributed to its token-efficient design and the use of different training data categories. An ablation study reveals the necessity of various data types and the transferability of task benefits across domains. The model and code are publicly available.
jina-vlm 是一个 2.4B 参数的视觉语言模型,在多语言 VQA 方面表现出色。它结合了 SigLIP2 视觉编码器和 Qwen3 语言解码器,并使用图像切片和注意力池化进行高效的处理。模型性能归因于其高效的 token 设计和不同训练数据类别的使用。消融研究揭示了各种数据类型的重要性以及任务收益在不同领域之间的可转移性。该模型及其代码已公开发布。
Perceptual Flow Network for Visually Grounded Reasoning
Authors: Yangfu Li, Yuning Gong, Hongjian Zhan, Teng Li, Yuanhuiyi Lyu, Tianyi Chen, Qi Liu, Ziyuan Huang, Zhihang Zhong, Dandan Zheng, Yue Lu
Venue: ICML 2026
First: 2026-05-04T15:31:11+00:00 · Latest: 2026-05-04T15:31:11+00:00
Comments: 36 pages with 17 figures, Accepted at ICML 2026
Abstract
Despite the success of Large-Vision Language Models (LVLMs), general optimization objectives (e.g., standard MLE) fail to constrain visual trajectories, leading to language bias and hallucination. To mitigate this, current methods introduce geometric priors from visual experts as additional supervision. However, we observe that such supervision is typically suboptimal: it is biased toward geometric precision and offers limited reasoning utility. To bridge this gap, we propose Perceptual Flow Network (PFlowNet), which eschews rigid alignment with the expert priors and achieves interpretable yet more effective visual reasoning. Specifically, PFlowNet decouples perception from reasoning to establish a self-conditioned generation process. Based on this, it integrates multi-dimensional rewards with vicinal geometric shaping via variational reinforcement learning, thereby facilitating reasoning-oriented perceptual behaviors while preserving visual reliability. PFlowNet delivers a provable performance guarantee and competitive empirical results, particularly setting new SOTA records on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
中文标题/摘要
标题:感知流网络用于视觉导向推理
尽管大型视觉语言模型(LVLM)取得了成功,但通用优化目标(例如标准MLE)未能约束视觉轨迹,导致语言偏见和幻觉。为缓解这一问题,当前方法引入来自视觉专家的几何先验作为额外监督。然而,我们观察到这种监督通常不理想:它偏向于几何精度,提供的推理实用性有限。为弥合这一差距,我们提出了感知流网络(PFlowNet),它摒弃了与专家先验的刚性对齐,实现了可解释且更有效的视觉推理。具体而言,PFlowNet 将感知与推理解耦,建立一个自我条件生成过程。在此基础上,它通过变分强化学习整合多维奖励与邻近几何塑造,从而促进以推理为导向的感知行为,同时保持视觉可靠性。PFlowNet 提供了可证明的性能保证,并取得了竞争性的实证结果,特别是在 V* 基准(90.6%)和 MME-RealWorld-lite(67.0%)上创下了新的 SOTA 记录。
Summary / 总结
The research aims to address the issue of language bias and hallucination in Large-Vision Language Models (LVLMs) by proposing Perceptual Flow Network (PFlowNet). PFlowNet decouples perception from reasoning, integrating multi-dimensional rewards with geometric shaping through variational reinforcement learning. This approach leads to improved visual reasoning and reliability, achieving a new state-of-the-art on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
研究旨在通过提出感知流网络(PFlowNet)解决大型视觉语言模型(LVLM)中的语言偏差和幻觉问题。PFlowNet将感知与推理分离,建立一个自我条件生成过程,通过变分强化学习整合多维度奖励与邻近几何塑造。该模型实现了可解释且有效的视觉推理,取得了竞争性的实验结果,在V* Bench(90.6%)和MME-RealWorld-lite(67.0%)上创下了新的SOTA记录。
PubMed-Ophtha: An open resource for training ophthalmology vision-language models on scientific literature
Authors: Verena Jasmin Hallitschke, Carsten Eickhoff, Philipp Berens
First: 2026-05-04T15:19:42+00:00 · Latest: 2026-05-04T15:19:42+00:00
Comments: 12 pages, 4 figures, 3 supplementary figures. Dataset available at https://huggingface.co/datasets/pubmed-ophtha/PubMed-Ophtha. Code available at https://github.com/berenslab/pubmed-ophtha
Abstract
Vision-language models hold considerable promise for ophthalmology, but their development depends on large-scale, high-quality image-text datasets that remain scarce. We present PubMed-Ophtha, a hierarchical dataset of 102,023 ophthalmological image-caption pairs extracted from 15,842 open-access articles in PubMed Central. Unlike existing datasets, figures are extracted directly from article PDFs at full resolution and decomposed into their constituent panels, panel identifiers, and individual images. Each image is annotated with its imaging modality -- color fundus photography, optical coherence tomography, retinal imaging, or other -- and a mark status indicating the presence of annotation marks such as arrows. Figure captions are split into panel-level subcaptions using a two-step LLM approach, achieving a mean average sentence BLEU score of 0.913 on human-annotated data. Panel and image detection models reach a mAP@0.50 of 0.909 and 0.892, respectively, and figure extraction achieves a median IoU of 0.997. To support reproducibility, we additionally release the human-annotated ground-truth data, all trained models, and the full dataset generation pipeline.
Summary / 总结
The research aims to develop vision-language models for ophthalmology by creating PubMed-Ophtha, a dataset of 102,023 image-caption pairs from 15,842 open-access articles. The method involves extracting figures directly from PDFs and decomposing them into panels, identifying imaging modalities, and annotating marks. Key findings include a mean average sentence BLEU score of 0.913 for panel-level subcaption generation, mAP@0.50 of 0.909 for panel detection, and 0.892 for image detection, with a median IoU of 0.997 for figure extraction. The dataset and related resources are publicly available for reproducibility.
研究旨在通过创建包含102,023个图像-描述对的PubMed-Ophtha数据集来开发眼科领域的视觉-语言模型,这些数据集来自15,842篇开放获取的PubMed Central文章。方法包括直接从PDF中提取图像并分解为面板,识别成像模态并标注标记。关键发现包括生成面板级子描述的平均句子BLEU得分为0.913,面板检测的mAP@0.50为0.909,图像检测为0.892,以及图像提取的中值IoU为0.997。数据集及相关资源已公开以支持可重复性。
AnchorD: Metric Grounding of Monocular Depth Using Factor Graphs
Authors: Simon Dorer, Martin Büchner, Nick Heppert, Abhinav Valada
First: 2026-05-04T14:48:52+00:00 · Latest: 2026-05-04T14:48:52+00:00
Comments: 8 pages, 9 Figures, 3 Tables
Abstract
Dense and accurate depth estimation is essential for robotic manipulation, grasping, and navigation, yet currently available depth sensors are prone to errors on transparent, specular, and general non-Lambertian surfaces. To mitigate these errors, large-scale monocular depth estimation approaches provide strong structural priors, but their predictions can be potentially skewed or mis-scaled in metric units, limiting their direct use in robotics. Thus, in this work, we propose a training-free depth grounding framework that anchors monocular depth estimation priors from a depth foundation model in raw sensor depth through factor graph optimization. Our method performs a patch-wise affine alignment, locally grounding monocular predictions in metric real-world depth while preserving fine-grained geometric structure and discontinuities. To facilitate evaluation in challenging real-world conditions, we introduce a benchmark dataset with dense scene-wide ground truth depth in the presence of non-Lambertian objects. Ground truth is obtained via matte reflection spray and multi-camera fusion, overcoming the reliance on object-only CAD-based annotations used in prior datasets. Extensive evaluations across diverse sensors and domains demonstrate consistent improvements in depth performance without any (re-)training. We make our implementation publicly available at https://anchord.cs.uni-freiburg.de.
中文标题/摘要
标题:AnchorD:使用因子图的单目深度度量接地
密集且准确的深度估计对于机器人操作、抓取和导航至关重要,但目前可用的深度传感器在透明、镜面和一般非朗伯表面容易出错。为减轻这些错误,大规模单目深度估计方法提供了强大的结构先验,但它们的预测可能在米制单位中被潜在地偏斜或缩放错误,限制了它们在机器人领域的直接应用。因此,在这项工作中,我们提出了一种无需训练的深度接地框架,通过因子图优化将单目深度估计先验从深度基础模型锚定到原始传感器深度中。我们的方法进行局部的仿射对齐,将单目预测局部接地在米制的真实世界深度中,同时保留精细的几何结构和不连续性。为了在具有非朗伯物体的复杂现实条件下促进评估,我们引入了一个基准数据集,其中包含密集场景范围的真实深度,通过哑光反射喷雾和多相机融合获得真实深度,克服了先前数据集中仅依赖于物体的CAD注释的依赖。在多种传感器和领域中的广泛评估表明,无需任何(重新)训练即可一致地提高深度性能。我们将在https://anchord.cs.uni-freiburg.de/上公开我们的实现。
Summary / 总结
The research aims to improve the accuracy and reliability of monocular depth estimation for robotic applications, particularly in challenging conditions like non-Lambertian surfaces. The method uses a training-free factor graph optimization to align monocular depth predictions with raw sensor depth, preserving geometric details. Key findings show consistent improvements in depth performance across various sensors and domains without retraining, and a new benchmark dataset with dense ground truth depth is introduced to evaluate performance in real-world conditions.
研究旨在通过改进单目深度估计的准确性和可靠性,特别是在透明、镜面和非朗伯表面等挑战性条件下,提高机器人应用的效果。方法使用训练-free 因子图优化将单目深度预测与原始传感器深度对齐,同时保留几何细节。关键发现显示,在各种传感器和领域中,无需重新训练即可实现深度性能的一致提升,并引入了一个新的基准数据集,包含密集的真实深度,以在真实世界条件下评估性能。
AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding
Authors: Ruilin Yao, Shegnwu Xiong, Tianyu Zou, Shili Xiong, Yi Rong
First: 2026-05-04T14:18:46+00:00 · Latest: 2026-05-04T14:18:46+00:00
Comments: 18 pages, 4 figures
Abstract
Vision-Language Models (VLMs) have enabled autonomous GUI agents that translate natural language instructions into executable screen coordinates. However, grounding performance degrades in high-resolution interfaces, where dense layouts and small interactive elements expose a resolution gap between modern displays and model input constraints. Existing zoom-in strategies rely on fixed anchors, heuristic grids, or reinforcement learning, lacking a principled mechanism to adaptively determine where refinement is needed and how much spatial uncertainty should be explored. We propose AutoFocus, a training-free, uncertainty-aware active visual search framework for GUI grounding. Our key insight is that token-level perplexity in coordinate generation naturally reflects spatial uncertainty. Rather than committing to a single prediction, AutoFocus samples multiple coordinate hypotheses and converts their axial perplexities into an anisotropic gaussian spatial probability field, explicitly modeling directional uncertainty. Based on this field, we generate global and local region proposals and introduce Shape-Aware Zooming to balance tight localization with contextual preservation. A visual prompt-based aggregation step then selects the most consistent prediction via structured comparison. Extensive experiments on ScreenSpot-Pro and ScreenSpot-V2 demonstrate consistent improvements across both general-purpose and GUI-specialized VLMs.
中文标题/摘要
标题:AutoFocus:面向GUI定位的不确定性感知主动视觉搜索
视觉-语言模型(VLMs)使自主GUI代理能够将自然语言指令转换为可执行的屏幕坐标。然而,在高分辨率界面中,密集布局和小型交互元素暴露了现代显示器与模型输入约束之间的分辨率差距,导致定位性能下降。现有的放大策略依赖于固定的锚点、启发式网格或强化学习,缺乏一种原理性的机制来适应性地确定需要细化的地方以及应探索多少空间不确定性。我们提出了AutoFocus,一种无需训练、具有不确定性感知的主动视觉搜索框架,用于GUI定位。我们的核心见解是,坐标生成中的标记级困惑度自然反映了空间不确定性。AutoFocus 不是做出单一预测,而是采样多个坐标假设,并将它们的轴向困惑度转换为各向异性高斯空间概率场,明确建模方向性不确定性。基于此场,我们生成全局和局部区域提案,并引入形状感知放大以平衡精确定位与上下文保留。随后,基于视觉提示的聚合步骤通过结构化比较选择最一致的预测。在ScreenSpot-Pro和ScreenSpot-V2上的广泛实验表明,AutoFocus 在通用和GUI专用的VLMs上均能实现一致的性能提升。
Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model
Authors: Zhou Bingtao, Xiang Mian, Ning Qian
First: 2026-05-04T13:51:45+00:00 · Latest: 2026-05-04T13:51:45+00:00
Abstract
Source-Free Domain Adaptation (SFDA) adapts source models to target domains without accessing source data, addressing privacy and transmission issues. However, existing methods still initialize from a source pre-trained model and thus are not truly source-free. Recent works have introduced Vision-Language (ViL) models to guide the adaptation process, in these methods, we observe that for the same target domain, different source models yield minimal variation in final results, indicating the source model itself has limited impact. Motivated by this, we propose ViL-Only Domain Adaptation (VODA) , a stricter setting that eliminates all dependencies on source domain, relying solely on a randomly initialized model, a ViL model, and unlabeled target data. We analyze the adaptation dynamics of VODA and introduce Two-Stage Denoised-Region Distillation (TS-DRD) , a two-stage framework that first warms up the model with ViL guidance, then seek a Denoised-Region inherent in both the ViL and adapting model, yielding cleaner supervision for distillation. Experiments on Office-Home, VisDA, and DomainNet-126 show that under VODA, TS-DRD achieves competitive or superior performance to existing SFDA methods that still use source models, demonstrating its effectiveness and the potential of the VODA setting.
CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
Authors: Berk Çiçek, Mert K. Er, Özgür S. Öğüz
Venue: RSS
First: 2026-05-04T13:49:19+00:00 · Latest: 2026-05-04T13:49:19+00:00
Comments: 21 pages, 9 figures, 3 tables. Accepted to Robotics: Science and Systems (RSS) 2026
Abstract
While Large Language Models (LLMs) and Vision-Language Models (VLMs) demonstrate remarkable capabilities in high-level reasoning and semantic understanding, applying them directly to contact-rich manipulation remains a challenge due to their lack of explicit physical grounding and inability to perform adaptive control. To bridge this gap, we propose CoRAL (Contact-Rich Adaptive LLM-based control), a modular framework that enables zero-shot planning by decoupling high-level reasoning from low-level control. Unlike black-box policies, CoRAL uses LLMs not as direct controllers, but as cost designers that synthesize context-aware objective functions for a sampling-based motion planner (MPPI). To address the ambiguity of physical parameters in visual data, we introduce a neuro-symbolic adaptation loop: a VLM provides semantic priors for environmental dynamics, such as mass and friction estimates, which are then explicitly refined in real time via online system identification, while the LLM iteratively modulates the cost-function structure to correct strategic errors based on interaction feedback. Furthermore, a retrieval-based memory unit allows the system to reuse successful strategies across recurrent tasks. This hierarchical architecture ensures real-time control stability by decoupling high-level semantic reasoning from reactive execution, effectively bridging the gap between slow LLM inference and dynamic contact requirements. We validate CoRAL on both simulation and real-world hardware across challenging and novel tasks, such as flipping objects against walls by leveraging extrinsic contacts. Experiments demonstrate that CoRAL outperforms state-of-the-art VLA and foundation-model-based planner baselines by boosting success rates over 50% on average in unseen contact-rich scenarios, effectively handling sim-to-real gaps through its adaptive physical understanding.
LVLM-Aided Alignment of Task-Specific Vision Models
Authors: Alexander Koebler, Lukas Kuhn, Ingo Thon, Florian Buettner
Venue: CVPR 2026
First: 2025-12-26T11:11:25+00:00 · Latest: 2026-05-04T12:51:00+00:00
Comments: Accepted at CVPR 2026
Abstract
In high-stakes domains, small task-specific vision models are crucial due to their low computational requirements and the availability of numerous methods to explain their results. However, these explanations often reveal that the models do not align well with human domain knowledge, relying instead on spurious correlations. This might result in brittle behavior once deployed in the real-world. To address this issue, we introduce a novel and efficient method for aligning small task-specific vision models with human domain knowledge by leveraging the generalization capabilities of a Large Vision Language Model (LVLM). Our LVLM-Aided Visual Alignment (LVLM-VA) method provides a bidirectional interface that translates model behavior into natural language and maps human class-level specifications to image-level critiques, enabling effective interaction between domain experts and the model. Our method demonstrates substantial improvement in aligning model behavior with human specifications, as validated on both synthetic and real-world datasets. We show that it effectively reduces the model's dependence on spurious features and on group-specific biases, without requiring fine-grained feedback.
Summary / 总结
The paper introduces LVLM-VA, a method for aligning small task-specific vision models with human domain knowledge using a Large Vision Language Model (LVLM). This approach translates model behavior into natural language and maps human class-level specifications to image-level critiques, improving model alignment and reducing dependence on spurious features and biases. Experiments on synthetic and real-world datasets show significant improvements in model behavior alignment with human specifications.
该研究提出了一种名为LVLM-VA的方法,利用大型视觉语言模型(LVLM)将小型任务特定视觉模型与人类领域知识对齐。该方法将模型行为转化为自然语言,并将人类的类别级规范映射到图像级批评,从而提高模型对齐并减少对虚假特征和群体特定偏见的依赖。实验结果表明,该方法在合成和真实世界数据集上显著提高了模型行为与人类规范的对齐程度。
A Semantic Autonomy Framework for VLM-Integrated Indoor Mobile Robots: Hybrid Deterministic Reasoning and Cross-Robot Adaptive Memory
Authors: Bogdan Felician Abaza, Andrei-Alexandru Staicu, Cristian Vasile Doicin
First: 2026-05-04T12:27:03+00:00 · Latest: 2026-05-04T12:27:03+00:00
Comments: 33 pages, 11 figures, 14 tables
Abstract
Autonomous indoor mobile robots can navigate reliably to metric coordinates using established frameworks such as ROS 2 Navigation 2, yet they lack the ability to interpret natural language instructions that express intent rather than positions. Vision-Language Models offer the semantic reasoning required to bridge this gap, but their inference latency (2-9 seconds per decision on consumer hardware) and session-by-session amnesia limit practical deployment. This paper presents the Semantic Autonomy Stack, a six-layer reference framework for semantically autonomous indoor navigation, and validates a complete instance featuring hybrid deterministic-VLM reasoning and cross-robot adaptive memory on physical robots with off-the-shelf edge hardware. A seven-step parametric resolver handles 88% of instructions in under 0.1 milliseconds without invoking a language model, camera, or GPU; only genuinely ambiguous instructions escalate to VLM reasoning. A five-category semantic memory framework with explicit scope taxonomy (global environment knowledge, per-operator preferences, per-robot capabilities) enables cross-session learning and cross-robot knowledge transfer: preferences learned through VLM interactions on one robot are promoted to deterministic resolution and transferred to a second robot via a shared compiled digest, achieving a measured latency reduction of 103,000-fold. Experimental validation on two custom-built differential-drive robots across 82 scenario-level decisions and three sessions demonstrates 100% semantic transfer accuracy (33/33, 95% CI [0.894, 1.000]), 100% semantic resolution accuracy, and concurrent multi-robot operation feasibility - all on Raspberry Pi 5 platforms with no onboard GPU, requiring zero training data.
中文标题/摘要
标题:集成视觉语言模型的室内移动机器人语义自主框架:混合确定性推理与跨机器人自适应记忆
自主室内移动机器人可以使用ROS 2 Navigation 2等现有框架可靠地导航到米制坐标,但它们缺乏理解自然语言指令的能力,这些指令表达的是意图而不是位置。视觉语言模型提供了所需的语义推理能力来弥合这一差距,但它们在消费级硬件上的推理延迟(每次决策2-9秒)和会话性健忘限制了其实用部署。本文提出了语义自主堆栈,这是一种六层参考框架,用于语义自主室内导航,并在物理机器人上通过现成的边缘硬件验证了一个完整的实例,该实例结合了混合确定性-VLM推理和跨机器人自适应记忆。七步参数解析器在不到0.1毫秒内处理了88%的指令,无需调用语言模型、相机或GPU;只有真正含糊的指令才会升级到VLM推理。一种五类语义记忆框架,具有明确的范围分类(全局环境知识、每操作员偏好、每机器人能力),实现了跨会话学习和跨机器人知识转移:通过VLM交互在一台机器人上学习的偏好被提升为确定性解析,并通过共享编译摘要传输到第二台机器人,实现了测量延迟减少103,000倍。在两台自定义差速驱动机器人上进行的实验验证了82个场景级决策和三个会话,展示了100%的语义转移准确性(33/33,95%置信区间[0.894, 1.000])、100%的语义解析准确性以及多机器人操作的可行性——所有这些都在没有内置GPU的Raspberry Pi 5平台上实现,无需任何训练数据。
The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling
Authors: Tu Nguyen, Rasul Tutunov, Xiaotong Ji, Matthieu Zimmer
First: 2026-05-04T10:26:34+00:00 · Latest: 2026-05-04T10:26:34+00:00
Abstract
A recurring pattern in "reasoning without training" is that base LLMs already assign non-trivial probability mass to correct multi-step solutions; the bottleneck is locating these modes efficiently at inference time. Power sampling provides a principled way to bias decoding toward such modes by targeting p_theta(x)^alpha with alpha > 1, but practical approximations must account for future-dependent correction factors that determine which prefixes remain promising.
We introduce Auxiliary Particle Power Sampling (APPS), a blockwise particle algorithm for approximating the sequence-level power target with a bounded population of partial solutions. APPS propagates hypotheses in parallel using proposal-corrected power reweighting and refines their survival through future-value-guided selection at resampling boundaries. This redistributes finite compute across competing prefixes rather than committing to a single unfolding path, while providing a direct scaling knob in the particle count and predictable peak memory. We instantiate the future-value signal with short-horizon rollouts and also study an amortized variant that replaces rollouts with a lightweight learned selection head. Across reasoning benchmarks, APPS improves the accuracy-runtime trade-off of training-free decoding and suggests that part of the gap to post-trained systems can be recovered through more faithful inference-time power approximation.
Summary / 总结
The research aims to improve the efficiency of finding correct multi-step solutions in reasoning tasks by addressing the bottleneck of locating these solutions at inference time. The method, Auxiliary Particle Power Sampling (APPS), uses a blockwise particle algorithm to approximate the sequence-level power target, redistributing compute across competing prefixes. This approach improves the accuracy-runtime trade-off of training-free decoding and suggests that more faithful inference-time power approximation can help close the gap to post-trained systems.
研究旨在通过解决推理任务中在推理时定位正确多步解决方案的瓶颈,提高找到这些解决方案的效率。方法是引入块状粒子算法的辅助粒子功率采样(APPS),以重新分配计算资源到竞争前缀。这种方法提高了无训练解码的准确性和运行时性能之间的权衡,并表明更忠实的推理时功率近似可以缩小与后训练系统的差距。
DirectEdit: Step-Level Accurate Inversion for Flow-Based Image Editing
Authors: Desong Yang, Mang Ye
First: 2026-05-04T10:09:18+00:00 · Latest: 2026-05-04T10:09:18+00:00
Comments: Project page: https://desongyang.github.io/Directedit/
Abstract
With recent advancements in large-scale pre-trained text-to-image (T2I) models, training-free image editing methods have demonstrated remarkable success. Typically, these methods involve adding noise to a clean image via an inversion process, followed by separate denoising steps for the reconstruction and editing paths during the forward process. However, since the reconstruction path is approximated using noisy latents from mismatched timesteps, existing methods inevitably suffer from accumulated drift, which fundamentally limits reconstruction fidelity. To address this challenge, we systematically analyze the inversion process within the flow transformer and propose DirectEdit, a simple yet effective editing method that eliminates the inherent reconstruction error without introducing additional neural function evaluations (NFEs). Unlike most prior works that attempt to rectify the inversion path, DirectEdit focuses on directly aligning the forward paths, enabling precise reconstruction and reliable feature sharing. Furthermore, we introduce a preservation mechanism based on attention feature injection and multi-branch mask-guided noise blending, which effectively balances fidelity and editability. Extensive experiments across diverse scenarios demonstrate that DirectEdit achieves efficient and accurate image editing, delivering superior performance that outperforms state-of-the-art methods. Code and examples are available at https://desongyang.github.io/Directedit.
Summary / 总结
This work analyzes the inversion process in flow-based models and proposes DirectEdit, a method that improves image editing by introducing additional neural function evaluations (NFEs) to eliminate reconstruction errors. Unlike previous methods methods aligning the inversion process, DirectEdit directly focuses on aligning the on the reconstruction and editing processes directly. Furthermore, a preservation mechanism based attention and multi-- branch branch-guided noise on to balance between effectively balances fidelity and editing. Experiments demonstrate diverse scenarios demonstrate that DirectEdit outper performs better par par par with state-of-the-art methods methods image editing.
DirectEdit 是一种通过直接对齐前向路径来提高图像编辑准确性的方法,而不引入额外的神经函数评估。它通过消除固有的重建误差来解决现有方法中累积漂移的问题。关键发现表明,DirectEdit 在效率和准确性方面均优于最先进的方法,实现了更优的图像编辑效果。
FitText: Evolving Agent Tool Ecologies via Memetic Retrieval
Authors: Kyle Zheng, Han Zhang, Renliang Sun, Chenchen Ye, Wei Wang
First: 2026-05-04T10:01:24+00:00 · Latest: 2026-05-04T10:01:24+00:00
Abstract
A semantic gap separates how users describe tasks from how tools are documented. As API ecosystems scale to tens of thousands of endpoints, static retrieval from the initial query alone cannot bridge this gap: the agent's understanding of what it needs evolves during execution, but its tool set does not. We introduce FitText, a training-free framework that makes retrieval dynamic by embedding it directly in the agent's reasoning loop. FitText generates natural-language pseudo-tool descriptions as retrieval probes, refines them iteratively using retrieval feedback, and explores diverse alternatives through stochastic generation. Memetic Retrieval adds evolutionary selection pressure over candidate descriptions, guided by a tool memory that avoids redundant search. On ToolRet (43k tools, 4 domains), FitText improves average retrieval rank from 8.81 to 2.78; on StableToolBench (16,464 APIs), it achieves a 0.73 average pass rate--a 24-point absolute gain over static query retrieval. The gains transfer across base models capable of acting as competent semantic operators; under weaker base models, Memetic's evolutionary search inverts--amplifying noise rather than refining signal--surfacing model capacity as a prerequisite for evolutionary tool exploration.
中文标题/摘要
标题:FitText:通过记忆性检索演化代理工具生态
用户描述任务的方式与工具文档化之间存在语义鸿沟。随着API生态系统的规模扩大到数万个端点,仅从初始查询进行静态检索无法弥合这一差距:代理在执行过程中对其所需的理解会演变,但其工具集不会随之更新。我们引入了FitText,这是一种无需训练的框架,通过直接嵌入代理的推理循环中使检索变得动态。FitText生成自然语言的伪工具描述作为检索探针,通过检索反馈迭代优化这些描述,并通过随机生成探索多种替代方案。记忆性检索通过工具记忆避免冗余搜索,为候选描述添加了进化选择压力。在ToolRet(43,000个工具,4个领域)上,FitText将平均检索排名从8.81提高到2.78;在StableToolBench(16,464个API)上,它实现了0.73的平均通过率——比静态查询检索高出24个绝对点。这些收益在能够作为有效语义操作者的基础模型中转移;在较弱的基础模型下,记忆性的进化搜索会反转——放大噪声而非精炼信号,表明模型能力是进化工具探索的前提。
FEAT: Fashion Editing and Try-On from Any Design
Authors: Soye Kwon, Keonyoung Lee, Dahuin Jung, Jaekoo Lee
First: 2026-05-04T09:35:44+00:00 · Latest: 2026-05-04T09:35:44+00:00
Comments: 10 pages, 9 figures, 2 tables
Abstract
Fashion design aims to express a designer's creative intent and to depict how garments interact with the human body. Recent methods condition on multimodal inputs to support garment editing and virtual try-on. However, existing methods still (i) confine design to garment-related images, excluding creative design sources such as artwork, abstract imagery, and natural photographs, and (ii) cannot support complete outfits, including accessories. We present FEAT (Fashion Editing And Try-On from Any Design), a method that enables editing and try-on across garments and accessories using diverse design sources. To achieve this, we introduce Disentangled Dual Injection (DDI). It takes both apparel and non-apparel design sources and selectively injects design cues via content and style disentanglement. Furthermore, we propose Orthogonal-Guided Noise Fusion (OGNF), a training-free mechanism that removes residual garments via orthogonal projection and applies region-specific noise strategies to enable virtual try-on for both garments and accessories. Extensive experiments demonstrate that FEAT achieves state-of-the-art performance in design flexibility, prompt consistency, and visual realism.
Summary / 总结
FEAT is designed to enhance fashion design by allowing editing and virtual try-on using a wide range of design sources, including non-garment images. It introduces Disentangled Dual Injection (DDI) to handle both apparel and non-apparel designs and Orthogonal-Guided Noise Fusion (OGNF) to remove residual garments and enable virtual try-on for accessories. Experiments show that FEAT outperforms existing methods in design flexibility, prompt consistency, and visual realism.
FEAT旨在从多种设计来源实现服装编辑和试穿,通过纳入非服装设计输入和支持完整搭配来解决现有方法的限制。它引入了内容和风格分离的双注入(DDI)机制,并提出了正交引导噪声融合(OGNF)机制,用于去除残留服装并应用区域特定的噪声策略。实验表明,FEAT在设计灵活性、提示一致性以及视觉真实度方面优于现有方法。
Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning
Authors: Haoyu Wang, Haonan Wang, Yuyan Chen, Jun Chen, Gang Liu, Qian Wang, Jiahong Yan, Yanghua Xiao
First: 2026-05-04T09:18:19+00:00 · Latest: 2026-05-04T09:18:19+00:00
Comments: Under review
Abstract
In-context learning (ICL) allows large models to adapt to tasks using a few examples, yet its extension to vision-language models (VLMs) remains fragile. Our analysis reveals that the fundamental limitation lies in an inductive gap, models often produce correct answers from flawed reasoning, while struggling to extract consistent rules across demonstrations. This gap is further exacerbated by two visual-level obstacles: an overwhelming proportion of redundant visual tokens that obscure textual cues, and a skewed attention distribution that favors the initial image at the expense of subsequent context. To address these issues, we introduce a framework that restructures multimodal ICL as a principled inductive-deductive process. The framework incorporates a similarity-based visual token compression module to filter out redundant patches, a dynamic attention rebalancing mechanism to distribute focus equitably across all images, and a chain-of-thought paradigm that explicitly guides the model to analyze individual examples, derive a generalizable rule, and then apply it to the query. An auxiliary learning pipeline combines supervised fine-tuning with reinforcement learning using verifiable rewards to reinforce faithful citation and noise filtering. Evaluations across eight benchmarks covering visual perception, logical reasoning, STEM problems, and sarcasm detection demonstrate consistent and significant improvements over standard ICL baselines for multiple open-source VLMs, highlighting the potential of equipping models with genuine inductive capabilities in multimodal settings.
Summary / 总结
The paper addresses the limitations of in-context learning (ICL) in vision-language models (VLMs), particularly the inductive gap where models produce correct answers through flawed reasoning. To tackle this, the authors propose a framework that restructures ICL into an inductive-deductive process, incorporating visual token compression, dynamic attention rebalancing, and a chain-of-thought paradigm. Experiments across various benchmarks show consistent improvements over standard ICL methods, indicating the potential for enhancing multimodal ICL capabilities.
论文针对视觉语言模型(VLMs)中的上下文学习(ICL)存在的归纳差距问题,即模型通过错误推理产生正确答案。为解决这一问题,作者提出了一种将ICL重构为归纳-演绎过程的框架,该框架包括视觉标记压缩、动态注意力重新平衡和思维链 paradigm。实验结果显示,在多个基准测试中,该方法相对于标准ICL基线方法表现出一致的改进,表明在多模态环境中增强模型的真正归纳能力的潜力。
Differentiable Kernel Ridge Regression for Deep Learning Pipelines
Authors: Jean-Marc Mercier, Gabriele Santin
First: 2026-05-04T08:13:08+00:00 · Latest: 2026-05-04T08:13:08+00:00
Abstract
Deep neural networks dominate modern machine learning, while alternative function approximators remain comparatively underexplored at scale. In this work, we revisit kernel methods as drop-in components for standard deep learning pipelines. We introduce \emph{Sparse Kernels} (SKs), a differentiable, localized, and lazy variant of kernel ridge regression (KRR) that defers training to inference time and reduces to the solution of small local systems. We integrate SKs into PyTorch as modular layers that preserve end-to-end trainability, and we show that they expose three distinct sets of parameters -- feature representations, target values, and evaluation points -- each of which can be fixed or learned. This decomposition broadens the design space available to practitioners, enabling, in particular, training-free transfer, nonlinear probing, and hybrid kernel-neural models. Across convolutional networks, vision transformers, and reinforcement learning, SK-based modules serve two complementary roles: in some settings, they match the performance of trained neural readouts with substantially less training; in others, they augment existing models and improve their performance when used as additional components. Our results suggest that kernel methods, once made scalable and differentiable, can be readily integrated with deep learning rather than treated as a separate paradigm.
Summary / 总结
This work revisits kernel methods as a drop-in component for deep learning pipelines, introducing Sparse Kernels (SKs) as a differentiable, localized, and lazy variant of kernel ridge regression. SKs are integrated into PyTorch as modular layers that preserve end-to-end trainability and expose three sets of parameters for different roles. Experiments show that SK-based modules can match the performance of trained neural readouts with less training in some settings, and augment existing models to improve performance in others, suggesting that kernel methods can be readily integrated with deep learning.
本文重新审视了核方法作为深度学习管道中的组件,引入了Sparse Kernels (SKs)作为核岭回归的一种可微分、局部化和懒惰变体。SKs被集成到PyTorch中作为可模块化的层,保持端到端的可训练性,并暴露了三种参数用于不同角色。实验表明,SK模块在某些情况下可以与训练好的神经读出具有相似的性能,但在其他情况下可以增强现有模型并提高其性能,这表明核方法可以很容易地与深度学习集成,而不是被视为一个单独的范式。
Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift
Authors: Lixian Chen, Yanhui Chen, Junyi Lin
First: 2026-04-27T15:29:35+00:00 · Latest: 2026-05-04T07:57:04+00:00
Abstract
Vision-language models transfer well in zero-shot settings, but at deployment the visual and textual branches often shift asymmetrically. Under this condition, entropy-based test-time adaptation can sharpen the fused posterior while increasing error, because an unreliable modality may still dominate fusion. We study this failure mode through a majorization view of multimodal posteriors and cast adaptation as a constrained de-mixing problem on the fused prediction. Based on this view, we propose MG-MTTA, which keeps the backbone frozen and updates only a lightweight gate or adapter. The objective combines fused-posterior entropy minimization with a reliability-aware gate prior built from anchor-based modality consistency and cross-modal conflict. Our analysis gives conditions under which entropy reduction preserves the correct ranking and a threshold that characterizes modality-dominance failure. On the ImageNet-based benchmark, MG-MTTA improves top-1 accuracy from 57.97 to 66.51 under semantics-preserving textual shift and from 21.68 to 26.27 under joint visual-textual shift, while remaining competitive in the visual-only benchmark. These results show that multimodal test-time adaptation should control modality reliability, not just prediction entropy.
中文标题/摘要
标题:视觉语言模型在模态特定偏移下的主要化引导测试时自适应
视觉语言模型在零样本设置中表现良好,但在部署时,视觉和文本分支往往不对称地偏移。在这种情况下,基于熵的测试时自适应可以细化融合后验,但会增加错误,因为不可靠的模态仍然可能主导融合。我们通过主要化视角研究了这种失败模式,并将自适应问题表述为融合预测上的受约束去混杂问题。基于这种视角,我们提出了MG-MTTA,该方法冻结骨干网络并仅更新一个轻量级门控或适配器。目标函数结合了融合后验熵最小化与基于锚定模态一致性及跨模态冲突构建的可靠性感知门控先验。我们的分析给出了熵减少在保持正确排名下的条件,并给出了表征模态主导失败的阈值。在基于ImageNet的基准测试中,MG-MTTA在保持语义一致的文本偏移下将Top-1精度从57.97提高到66.51,在联合视觉-文本偏移下将Top-1精度从21.68提高到26.27,同时在纯视觉基准测试中保持竞争力。这些结果表明,多模态测试时自适应应控制模态可靠性,而不仅仅是预测熵。
Determinism of Randomness: Prompt-Residual Seed Shaping for Diffusion Generation
Authors: Song Yan, Wei Zhai, Chenfeng Wang, Xinliang Bi, Jian Yang, Yancheng Cai, Yusen Zhang, Yunwei Lan, Tao Zhang, GuanYe Xiong, Min Li, Zheng-Jun Zha
First: 2025-11-11T02:12:38+00:00 · Latest: 2026-05-04T06:30:15+00:00
Abstract
Diffusion models start generation from an isotropic Gaussian latent, yet changing only the random seed can lead to large differences in prompt faithfulness, composition, and visual quality. We study this seed sensitivity through the semantic map from initial noise to generated meaning. Although the sampling flow is locally invertible, the subsequent semantic projection is many-to-one, inducing a degenerate pullback semi-metric on the latent space: most local directions are nearly semantic-invariant, while semantic-sensitive variation is concentrated in a much smaller horizontal subspace. This provides an explanatory geometric view of the seed lottery. Motivated by this view, we introduce a training-free prompt-residual seed-shaping procedure. Rather than claiming to recover the exact horizontal space, the method uses a single high-noise cold-start prompt residual as a model-coupled proxy, injects only its tangential component, and retracts the seed to the original Gaussian radius shell. This keeps the initialization prior-compatible while adding only one conditional/unconditional probe before standard sampling. Across multiple generation benchmarks, the method improves alignment and quality metrics over standard sampling, supporting both the practical value of the proxy and the explanatory relevance of semantic anisotropy.
Summary / 总结
The paper investigates the sensitivity of diffusion models to random seeds, revealing that while the sampling process is locally invertible, the subsequent semantic projection is many-to-one, leading to a degenerate pullback semi-metric. Motivated by this, the authors propose a prompt-residual seed-shaping method that uses a single high-noise prompt residual to inject a tangential component and retracts the seed to the original Gaussian radius shell, improving alignment and quality metrics in multiple generation benchmarks without requiring training.
论文研究了扩散模型对随机种子的高度敏感性,发现尽管采样过程是局部可逆的,但后续的语义投影是一对多的,导致了退化的拉回半度量。受此启发,作者提出了一种使用单个高噪声提示残差注入切向分量并重新缩放种子到原始高斯半径壳的方法,这种方法在多个生成基准中提高了对齐和质量指标,且无需训练。
CoVSpec: Efficient Device-Edge Co-Inference for Vision-Language Models via Speculative Decoding
Authors: Yuanyuan Jia, Shunpu Tang, Qianqian Yang
First: 2026-05-04T04:43:15+00:00 · Latest: 2026-05-04T04:43:15+00:00
Comments: 6 pages, 2 tables, 1 figure. Submitted to IEEE Globecom 2026
Abstract
Vision-language models (VLMs) have demonstrated strong capabilities in multimodal perception and reasoning. However, deploying large VLMs on mobile devices remains challenging due to their substantial computational and memory demands. A practical alternative is device-edge co-inference, where a lightweight draft VLM on the mobile device collaborates with a larger target VLM on the edge server via speculative decoding. Nevertheless, directly extending speculative decoding to VLMs suffers from severe inefficiency due to excessive visual-token computation and high communication overhead. To address these challenges, we propose CoVSpec, an efficient collaborative speculative decoding framework for VLM inference. Specifically, we first develop a training-free visual token reduction framework that prunes redundant visual tokens on the mobile device by jointly considering query relevance, token activity, and low-rank dependency. Moreover, we design an adaptive drafting strategy that dynamically adjusts both the verification frequency and the draft length. In addition, we introduce a parallel branching mechanism with decoupled verification-correction to improve draft-side utilization during target-side verification and reduce correction-related transmission overhead. Experiments on multiple benchmarks show that CoVSpec achieves up to 2.21x higher throughput than target-only inference and reduces communication overhead by more than 96% compared with baselines, without compromising task accuracy.
中文标题/摘要
标题:CoVSpec:通过推测性解码在设备-边缘协同推理视觉语言模型的高效方法
视觉语言模型(VLMs)在多模态感知和推理方面表现出强大的能力。然而,由于其巨大的计算和内存需求,将大型VLM部署在移动设备上仍然具有挑战性。一种实用的替代方案是在移动设备上使用轻量级草稿VLM与边缘服务器上的更大目标VLM通过推测性解码进行协同推理。然而,直接将推测性解码扩展到VLMs会因视觉标记计算过多和高通信开销而导致严重的低效。为了解决这些挑战,我们提出了一种高效的协同推测性解码框架CoVSpec,用于VLM推理。具体而言,我们首先开发了一种无需训练的视觉标记缩减框架,在移动设备上通过同时考虑查询相关性、标记活动和低秩依赖性来剪枝冗余的视觉标记。此外,我们设计了一种自适应草稿策略,动态调整验证频率和草稿长度。此外,我们引入了一种并行分支机制,具有解耦验证-纠正,以提高目标侧验证期间草稿侧的利用率并减少纠正相关的传输开销。在多个基准上的实验表明,CoVSpec的吞吐量比仅目标推理高2.21倍,并且与基线相比,通信开销减少了超过96%,而不会牺牲任务准确性。
Summary / 总结
CoVSpec is an efficient collaborative speculative decoding framework for vision-language models (VLMs) that addresses the challenges of deploying large VLMs on mobile devices. It includes a visual token reduction framework and an adaptive drafting strategy to reduce computational and communication overhead. Experiments show that CoVSpec achieves up to 2.21x higher throughput than target-only inference and reduces communication overhead by more than 96% compared to baselines, without affecting task accuracy.
CoVSpec 是一种高效的协作推测性解码框架,用于视觉-语言模型(VLMs),旨在解决在移动设备上部署大型VLMs的挑战。该框架包括视觉标记缩减框架和自适应制图策略,以减少计算和通信开销。实验表明,CoVSpec 的吞吐量可提高至目标推理的2.21倍,并且与基线相比,通信开销减少了超过96%,同时不牺牲任务准确性。
Metric Unreliability in Multimodal Machine Unlearning: A Systematic Analysis and Principled Unified Score
Authors: Abdullah Ahmad Khan, Hamid Laga, Ferdous Sohel
Venue: Neurips 2026
First: 2026-05-04T04:13:00+00:00 · Latest: 2026-05-04T04:13:00+00:00
Comments: 9 Pages , 6 figures, Neurips 2026
Abstract
Machine unlearning in Vision-Language Models (VLMs) is required for compliance with the General Data Protection Regulation (GDPR), yet current evaluation practices are inconsistent. We present the first systematic study of metric reliability in multimodal unlearning. Five standard metrics, Forget Accuracy (FA), Retain Accuracy (RA), Membership Inference Attack (MIA), Activation Distance (AD), and JS divergence (JS), yield conflicting method rankings across three VQA benchmarks (MLLMU-Bench, UnLOK-VQA, MMUBench). Kendall tau analysis over 36 unlearned LLaVA-1.5-7B models reveals two opposing clusters, {FA, RA, MIA} and {AD, JS}, with tau_FA_AD = -0.26, reproduced on BLIP-2 OPT-2.7B. Agreement is lower in multimodal VQA (average tau = 0.086) than in unimodal classification (average tau = 0.158; difference = 0.072), indicating that dual image-and-text pathways amplify inconsistency. We introduce the Unified Quality Score (UQS), a composite metric with weights derived from each metric's Spearman correlation with the oracle distance d(M_hat, M_star), where M_star is the oracle model retrained only on the retain set. RA shows the strongest reliability (rho = 0.484, p = 0.003), while FA is negatively correlated (rho = -0.418, p = 0.011). UQS yields stable rankings under 100 random weight perturbations (tau = 0.647 +- 0.262). We release the benchmark, 36 checkpoints, and an interactive leaderboard. Code and pre-computed results are available at https://github.com/neurips26/UnifiedUnl.
中文标题/摘要
标题:多模态机器遗忘中的度量不可靠性:系统分析和原则统一评分
在视觉-语言模型(VLMs)中进行机器遗忘对于遵守《通用数据保护条例》(GDPR)是必要的,但当前的评估实践不一致。我们首次对多模态遗忘中的度量可靠性进行了系统研究。五种标准度量,遗忘准确率(FA)、保留准确率(RA)、成员推断攻击(MIA)、激活距离(AD)和JS散度(JS),在三个VQA基准(MLLMU-Bench、UnLOK-VQA、MMUBench)上产生了相互矛盾的方法排名。对36个未学习的LLaVA-1.5-7B模型进行Kendall tau分析显示存在两个对立的簇,{FA, RA, MIA}和{AD, JS},其中tau_FA_AD = -0.26,这一结果在BLIP-2 OPT-2.7B上也得到了重现。在多模态VQA中的协议一致性低于单模态分类(平均tau = 0.086 vs. 平均tau = 0.158;差异 = 0.072),表明图像和文本双通道路径放大了不一致性。我们引入了统一质量评分(UQS),这是一个复合度量,权重来自每个度量与oracle距离d(M_hat, M_star)的Spearman相关性,其中M_star是仅在保留集上重新训练的oracle模型。RA显示出最强的可靠性(ρ = 0.484,p = 0.003),而FA与之负相关(ρ = -0.418,p = 0.011)。UQS在100次随机权重扰动下保持稳定排名(τ = 0.647 ± 0.262)。我们发布了基准、36个检查点和一个互动排行榜。代码和预计算结果可在https://github.com/neurips26/UnifiedUnl/获取。
Summary / 总结
This study addresses the inconsistency in evaluating machine unlearning in Vision-Language Models (VLMs) by analyzing five metrics: Forget Accuracy (FA), Retain Accuracy (RA), Membership Inference Attack (MIA), Activation Distance (AD), and JS divergence (JS). The research finds conflicting rankings among these metrics across three VQA benchmarks. A new Unified Quality Score (UQS) is introduced, which shows stable rankings and higher reliability compared to individual metrics. RA is found to be the most reliable metric, while FA is negatively correlated. The study also highlights that multimodal VQA benchmarks show lower agreement compared to unimodal classification benchmarks, suggesting that dual image-and-text pathways increase inconsistency in unlearning evaluations.
该研究针对视觉语言模型(VLMs)中的机器遗忘评估不一致问题,分析了五个指标:遗忘准确率(FA)、保留准确率(RA)、成员推断攻击(MIA)、激活距离(AD)和JS散度(JS)。研究发现这些指标在三个VQA基准上的排名存在冲突。研究引入了一种新的统一质量分数(UQS),显示了更稳定的排名和更高的可靠性。RA被发现是最可靠的指标,而FA与之负相关。研究还指出,多模态VQA基准的共识低于单模态分类基准,表明图像和文本双通道路径增加了遗忘评估的一致性问题。
CBV: Clean-label Backdoor Attacks on Vision Language Models via Diffusion Models
Authors: Ji Guo, Xiaolong Qin, Cencen Liu, Jielei Wang, Jierun Chen, Wenbo Jiang
First: 2026-05-04T04:02:23+00:00 · Latest: 2026-05-04T04:02:23+00:00
Abstract
Vision-Language Models (VLMs) have achieved remarkable success in tasks such as image captioning and visual question answering (VQA). However, as their applications become increasingly widespread, recent studies have revealed that VLMs are vulnerable to backdoor attacks. Existing backdoor attacks on VLMs primarily rely on data poisoning by adding visual triggers and modifying text labels, where the induced image-text mismatch makes poisoned samples easy to detect. To address this limitation, we propose the Clean-Label Backdoor Attack on VLMs via Diffusion Models (CBV), which leverages diffusion models to generate natural poisoned examples via score matching. Specifically, CBV modifies the score during the reverse generation process of the diffusion model to guide the generation of poisoned samples that contain triggered image features. To further enhance the effectiveness of the attack, we incorporate the textual information of the triggered images as multimodal guidance during generation. Moreover, to enhance stealthiness, we introduce a GradCAM-guided Mask (GM) that restricts modifications to only the most semantically important regions, rather than the entire image. We evaluate our method on MSCOCO and VQA v2 with four representative VLMs, achieving over 80% ASR while preserving normal functionality.
Summary / 总结
The paper proposes CBV, a clean-label backdoor attack on Vision-Language Models (VLMs) using diffusion models. It generates natural poisoned examples by modifying the score during the reverse generation process of the diffusion model, incorporating multimodal guidance and a GradCAM-guided Mask to enhance stealthiness. Experiments on MSCOCO and VQA v2 with four VLMs show an attack success rate of over 80% while maintaining normal functionality.
论文提出了使用扩散模型的CBV,一种针对视觉语言模型的干净标签后门攻击方法。通过在反向生成过程中修改得分生成自然的中毒样本,并结合多模态指导和GradCAM引导的遮罩来增强隐蔽性。在MSCOCO和VQA v2上对四个代表性视觉语言模型的实验表明,攻击成功率超过80%,同时保持了正常功能。
Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding
Authors: Jingqi Xu
First: 2026-03-31T04:48:52+00:00 · Latest: 2026-05-04T03:13:15+00:00
Abstract
Vision-Language Models (VLMs) have demonstrated strong capabilities across a wide range of multimodal tasks. However, recent studies have shown that VLMs, such as CLIP, perform poorly in understanding negation expressions, which are common in natural language. In this work, we propose Omni-NegCLIP, a fine-tuned CLIP model that improves CLIP's understanding of two types of negation, namely presence-based negation and absence-based negation, which correspond to negated expressions of objects that are actually present in an image and those that may plausibly exist in an image but are in fact absent, respectively, by modifying CLIP's original InfoNCE contrastive loss. Specifically, we design a presence-based contrastive objective that pulls image embeddings closer to their original caption embeddings while pushing them away from the corresponding presence-based negated caption embeddings, and an absence-based contrastive objective that aligns image embeddings with both original and absence-based negated caption embeddings while maintaining a semantic distinction between the two text embeddings. Based on our observation that the front transformer layers of CLIP text encoder have stronger learning ability for negated text than the later layers, we fine-tune the front transformer layers of the CLIP text encoder at each training step using the combined contrastive objective. Experimental results show that, compared with pretrained CLIP, Omni-NegCLIP improves performance on presence-based negation and absence-based negation tasks by up to 52.65% and 12.50%, respectively, without sacrificing general capability in image-text retrieval and even improving it by up to 19.62%. Compared with prior works, Omni-NegCLIP demonstrates a more comprehensive ability to understand multiple types of negation tasks.
中文标题/摘要
标题:Omni-NegCLIP:通过前层对比微调增强CLIP的全面否定理解
视觉-语言模型(VLMs)在多种跨模态任务中展现了强大的能力。然而,最近的研究表明,VLMs,如CLIP,在理解否定表达方面表现不佳,而否定表达在自然语言中很常见。在本文中,我们提出了一种名为Omni-NegCLIP的微调CLIP模型,通过修改CLIP原始的InfoNCE对比损失,以提高CLIP对两种类型否定的理解,即基于存在的否定和基于缺失的否定。具体来说,我们设计了一种基于存在的对比目标,将图像嵌入拉近其原始描述嵌入,同时将其推开与相应的基于存在的否定描述嵌入;以及一种基于缺失的对比目标,将图像嵌入与原始描述嵌入和基于缺失的否定描述嵌入对齐,同时保持两种文本嵌入的语义区分。基于我们观察到CLIP文本编码器的前几层在学习否定文本方面比后几层更强的能力,我们在每个训练步骤中使用结合的对比目标微调CLIP文本编码器的前几层。实验结果表明,与预训练的CLIP相比,Omni-NegCLIP在基于存在的否定和基于缺失的否定任务上的性能分别提高了52.65%和12.50%,而不会牺牲图像-文本检索的一般能力,甚至提高了19.62%。与先前的工作相比,Omni-NegCLIP展示了更全面的多类型否定任务理解能力。
Summary / 总结
Omni-NegCLIP is a fine-tuned CLIP model that enhances the understanding of negation expressions in images and text by modifying the original InfoNCE contrastive loss. It introduces presence-based and absence-based contrastive objectives to improve the model's performance on negation tasks. The model fine-tunes the front transformer layers of the CLIP text encoder, which have a stronger learning ability for negated text. Experimental results show that Omni-NegCLIP significantly improves performance on presence-based and absence-based negation tasks by up to 52.65% and 12.50%, respectively, while maintaining or even enhancing general image-text retrieval capabilities.
Omni-NegCLIP 是一个通过修改原始 InfoNCE 对比损失来增强图像和文本中否定表达理解的 fine-tuned CLIP 模型。它引入了基于存在的和基于缺失的对比目标来提高模型在否定任务上的表现。该模型对 CLIP 文本编码器的前端变换层进行 fine-tune,因为这些层在学习否定文本方面具有更强的能力。实验结果显示,Omni-NegCLIP 在存在否定和缺失否定任务上的性能分别提高了最多 52.65% 和 12.50%,同时保持或甚至增强了图像-文本检索的一般能力。
SpecEdit: Training-Free Acceleration for Diffusion based Image Editing via Semantic Locking
Authors: Zhengan Yan, Shikang Zheng, Haoran Qin, Xiaobing Tu, Yinggui Wang, Jiacheng Liu, Jiaxuan Ren, Yuqi Lin, Peiliang Cai, Jinkui Ren, Xiantao Zhang, Linfeng Zhang
First: 2026-05-04T02:30:24+00:00 · Latest: 2026-05-04T02:30:24+00:00
Comments: Main paper with supplementary material; figures and tables included
Abstract
Diffusion-based image editing offers strong semantic controllability, but remains computationally expensive due to iterative high-resolution denoising over all spatial tokens. Dynamic-resolution sampling reduces this cost by performing early steps at reduced resolution. However, existing approaches prioritize upsampling using low-level heuristics such as edge detection or channel variance, which are weakly aligned with editing semantics and may lead to structural inconsistency. Moreover, spatial regions are often upsampled without verifying whether semantic modification is actually required, resulting in redundant high-resolution computation and accumulated errors. Therefore, we propose SpecEdit, a training-free dynamic-resolution framework tailored for diffusion-based image editing. SpecEdit follows a draft-and-verify scheme: a low-resolution draft first estimates the semantic outcome, after which token-level discrepancies are used to identify edit-relevant tokens for high-resolution denoising, while the remaining tokens stay at a coarse resolution. Experiments on Qwen-Image-Edit and FLUX.1-Kontext-dev demonstrate up to 10x and 7x acceleration, while maintaining strong quality. SpecEdit is complementary to step distillation and other acceleration techniques, achieving up to 13x speedup when combined with existing methods. Our code is in supplementary material and will be released on GitHub.
中文标题/摘要
标题:SpecEdit:基于语义锁定的无训练加速扩散模型图像编辑
基于扩散的图像编辑提供了强大的语义可控性,但由于需要在所有空间标记上进行迭代的高分辨率去噪,因此计算成本仍然很高。动态分辨率采样通过在较低分辨率下进行早期步骤来降低这种成本。然而,现有方法优先使用边缘检测或通道方差等低级启发式方法进行上采样,这些方法与编辑语义弱相关,可能导致结构不一致。此外,经常在未验证是否需要进行语义修改的情况下对空间区域进行上采样,导致冗余的高分辨率计算和累积错误。因此,我们提出了SpecEdit,这是一种针对基于扩散的图像编辑的无训练动态分辨率框架。SpecEdit 遵循草图和验证方案:首先在低分辨率下生成语义结果估计,然后使用标记级差异来识别需要进行高分辨率去噪的编辑相关标记,而其余标记保持在粗略分辨率。在Qwen-Image-Edit和FLUX.1-Kontext-dev上的实验表明,SpecEdit 可以加速10倍和7倍,同时保持高质量。SpecEdit 与步骤蒸馏和其他加速技术兼容,与现有方法结合使用时可实现高达13倍的加速。我们的代码在附录材料中,并将在GitHub上发布。
Statistical Consistency and Generalization of Contrastive Representation Learning
Authors: Yuanfan Li, Xiyuan Wei, Tianbao Yang, Yiming Ying
Venue: ICML 2026
First: 2026-05-04T00:38:29+00:00 · Latest: 2026-05-04T00:38:29+00:00
Comments: Accepted by ICML 2026
Abstract
Contrastive representation learning (CRL) underpins many modern foundation models. Despite recent theoretical progress, existing analyses suffer from several key limitations: (i) the statistical consistency of CRL remains poorly understood; (ii) available generalization bounds deteriorate as the number of negative samples increases, contradicting the empirical benefits of large negative sets; and (iii) the retrieval performance of CRL has received limited theoretical attention.
In this paper, we develop a unified statistical learning theory for CRL. For downstream tasks, we evaluate retrieval quality using an AUC-type population criterion and show that the contrastive loss is \emph{statistically consistent} with optimal ranking. We further establish a \emph{calibration-style inequality} that quantitatively relates excess contrastive risk to excess retrieval suboptimality. For upstream training, we study both supervised and self-supervised contrastive objectives and derive generalization bounds of order $O(1/m + 1/\sqrt{n})$ and $O(1/\sqrt{m} + 1/\sqrt{n})$, respectively, where $m$ denotes the number of negative samples and $n$ the number of anchor points. These bounds not only explain the empirical advantages of large negative sets but also reveal an explicit trade-off between $m$ and $n$. Extensive experiments on large-scale vision--language models corroborate our theoretical predictions.
Summary / 总结
This paper addresses the theoretical limitations of contrastive representation learning (CRL) by developing a unified statistical learning theory. It shows that the contrastive loss is statistically consistent with optimal ranking for downstream tasks and establishes a calibration-style inequality relating contrastive risk to retrieval suboptimality. For training, it derives generalization bounds for both supervised and self-supervised CRL, explaining the empirical benefits of large negative sets and revealing a trade-off between the number of negative samples and anchor points. Experiments on large-scale vision-language models support the theoretical findings.
本文通过建立统一的统计学习理论,解决了对比表示学习(CRL)的理论局限性。它表明对比损失在下游任务中与最优排名统计一致,并建立了校准型不等式,将对比风险与检索次优性联系起来。在训练方面,它为监督和自监督CRL分别推导了泛化界,解释了大量负样本的实证优势,并揭示了负样本数量和锚点数量之间的显式权衡。大规模视觉-语言模型的实验支持了理论预测。
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment
Authors: Rishab Balasubramanian, Pin-Jie Lin, Rituraj Sharma, Anjie Fang, Fardin Abdi, Viktor Rozgic, Zheng Du, Mohit Bansal, Tu Vu
First: 2026-04-07T19:02:10+00:00 · Latest: 2026-05-03T22:36:43+00:00
Abstract
We investigate whether post-trained capabilities can be transferred across models without retraining, with a focus on transfer across different model scales. We propose the Master Key Hypothesis, which states that model capabilities correspond to directions in a low-dimensional latent subspace that induce specific behaviors and are transferable across models through linear alignment. Based on this hypothesis, we introduce UNLOCK, a training-free and label-free framework that extracts a capability direction by contrasting activations between capability-present and capability-absent Source variants, aligns it with a Target model through a low-rank linear transformation, and applies it at inference time to elicit the behavior. Experiments on reasoning behaviors, including Chain-of-Thought (CoT) and mathematical reasoning, demonstrate substantial improvements across model scales without training. For example, transferring CoT reasoning from Qwen1.5-14B to Qwen1.5-7B yields an accuracy gain of 12.1% on MATH, and transferring a mathematical reasoning direction from Qwen3-4B-Base to Qwen3-14B-Base improves AGIEval Math accuracy from 61.1% to 71.3%, surpassing the 67.8% achieved by the 14B post-trained model. Our analysis shows that the success of transfer depends on the capabilities learned during pre-training, and that our intervention amplifies latent capabilities by sharpening the output distribution toward successful reasoning trajectories.
中文标题/摘要
标题:主钥匙假设:通过线性子空间对齐实现跨模型能力迁移
我们研究了是否可以在不重新训练的情况下将后训练能力跨模型进行迁移,重点关注不同模型规模之间的迁移。我们提出了主钥匙假设,该假设认为模型能力对应于低维潜在子空间中的方向,这些方向会引发特定行为,并可以通过线性对齐在模型之间进行迁移。基于这一假设,我们引入了UNLOCK框架,该框架通过对比能力存在和不存在的源变体的激活来提取能力方向,通过低秩线性变换将其与目标模型对齐,并在推理时应用它以引发相应行为。在包括链式思考(CoT)和数学推理在内的推理行为上进行的实验表明,无需训练即可在不同模型规模上实现显著改进。例如,将Qwen1.5-14B中的链式思考推理转移到Qwen1.5-7B中,在MATH上的准确率提高了12.1%,将数学推理方向从Qwen3-4B-Base转移到Qwen3-14B-Base中,AGIEval Math准确率从61.1%提高到71.3%,超过了14B后训练模型实现的67.8%。我们的分析表明,迁移的成功取决于预训练期间学习到的能力,而我们的干预通过使输出分布朝成功的推理轨迹更加集中,放大了潜在能力。
InfiniteDiffusion: Bridging Learned Fidelity and Procedural Utility for Open-World Terrain Generation
Authors: Alexander Goslin
First: 2025-12-09T07:10:35+00:00 · Latest: 2026-05-03T20:00:14+00:00
Comments: Project website: https://xandergos.github.io/terrain-diffusion/ Code: https://github.com/xandergos/terrain-diffusion/
Abstract
For decades, procedural worlds have been built on procedural noise functions such as Perlin noise, which are fast and infinite, yet fundamentally limited in realism and large-scale coherence. Conversely, diffusion models offer unprecedented fidelity but remain generally confined to bounded canvases. We introduce InfiniteDiffusion, a training-free algorithm that reformulates diffusion sampling for lazy and unbounded generation, bridging the fidelity of diffusion models with the properties that made procedural noise indispensable: seamless infinite extent, seed-consistency, and constant-time random access. To demonstrate the utility of this approach, we present Terrain Diffusion, a framework for learned procedural terrain generation with a procedural noise-like interface. Our framework outpaces orbital velocity by 9 times on a consumer GPU, enabling realistic terrain generation at interactive rates. We integrate a hierarchical stack of diffusion models to couple planetary context with local detail, a compact Laplacian encoding to stabilize outputs across Earth-scale dynamic ranges, and an open-source infinite-tensor framework for constant-memory manipulation of unbounded tensors. Together, these components position diffusion models as a practical foundation for the next generation of infinite virtual worlds.
Summary / 总结
InfiniteDiffusion is a training-free algorithm that combines the high fidelity of diffusion models with the procedural properties of Perlin noise, enabling seamless and infinite terrain generation. The approach demonstrates superior performance, generating realistic terrain at interactive rates on consumer GPUs, and integrates hierarchical diffusion models and Laplacian encoding to handle large-scale dynamic ranges. The framework is open-sourced and supports constant-memory manipulation of unbounded tensors, making it a practical foundation for infinite virtual worlds.
InfiniteDiffusion 是一种无需训练的算法,将扩散模型的高保真度与 Perlin 噪声的程序化特性相结合,实现无缝且无限的地形生成。该方法在消费级 GPU 上以交互速率生成逼真地形,并通过层次扩散模型和拉普拉斯编码处理大规模动态范围。该框架开源并支持对无限张量进行常内存操作,使其成为无限虚拟世界的实用基础。
A Multimodal Dataset for Visually Grounded Ambiguity in Machine Translation
Authors: Jingheng Pan, Xintong Wang, Longyue Wang, Liang Ding, Weihua Luo, Chris Biemann
First: 2026-05-03T19:55:06+00:00 · Latest: 2026-05-03T19:55:06+00:00
Abstract
Ambiguity resolution is a key challenge in multimodal machine translation (MMT), where models must genuinely leverage visual input to map an ambiguous expression to its intended meaning. Although prior work has proposed disambiguation-oriented benchmarks that provide supportive evidence for the role of vision, we observe substantial issues in data quality and a mismatch with translation scenarios. Moreover, existing ambiguity-oriented evaluations are not well suited to broader ambiguity types in open-ended translation. To address these limitations, we present VIDA (Visually-Dependent Ambiguity), a dataset of 2,500 carefully curated instances in which resolving an annotated ambiguous source span requires visual evidence. We further propose Disambiguation-Centric Metrics that use an LLM-as-a-judge classifier to verify whether annotated ambiguous expressions are resolved correctly at the span level. Experiments with two state-of-the-art Large Vision Language Models under vanilla inference, supervised fine-tuning (SFT), and our chain-of-thought SFT (CoT-SFT) show that while SFT improves overall translation quality, CoT-SFT yields more consistent gains in disambiguation accuracy, especially on out-of-distribution subsets, indicating a stronger generalization for resolving diverse ambiguity types.
中文标题/摘要
标题:一种用于机器翻译中视觉接地歧义性的多模态数据集
歧义消解是多模态机器翻译(MMT)中的一个关键挑战,模型必须真正利用视觉输入将模糊表达式映射到其预期含义。尽管先前的工作提出了旨在消歧的基准测试,提供了视觉作用的支持性证据,但我们观察到数据质量存在重大问题,并且与翻译场景存在不匹配。此外,现有的消歧性评估并不适合更广泛的开放性翻译中的歧义类型。为了解决这些局限性,我们提出了VIDA(视觉依赖性歧义),这是一个包含2,500个精心挑选的实例的数据集,在这些实例中,解决标注的模糊源段落需要视觉证据。我们进一步提出了以消歧为中心的度量标准,使用LLM作为法官的分类器来验证标注的模糊表达式是否在段落级别上正确消歧。使用两种最先进的大型视觉语言模型在纯推理、监督微调(SFT)和我们的链式思考微调(CoT-SFT)下的实验表明,虽然SFT提高了整体翻译质量,但CoT-SFT在消歧准确性上取得了更一致的提升,尤其是在分布外子集上,表明其在解决各种歧义类型方面具有更强的泛化能力。
Conventional Commit Classification using Large Language Models and Prompt Engineering
Authors: H. M. Sazzad Quadir, Sakib Al Hasan, Md. Nurul Ahad Tawhid
First: 2026-05-03T19:52:39+00:00 · Latest: 2026-05-03T19:52:39+00:00
Abstract
Conventional commits provide a structured format for writing commit messages, which improves readability, software maintenance, and enables automation tools such as changelog generators and semantic versioning systems. Existing approaches to conventional commit classification typically rely on ML/DL models trained on large labeled datasets. In this paper, we investigated a training-free alternative by leveraging large language models (LLMs) through prompt engineering. Rather than building a task-specific classifier, we evaluate three prompting strategies, such as zero-shot, few-shot, and chain-of-thought, across three open-source LLMs of varying scale: Mistral-7B-Instruct, LLaMA-3-8B, and DeepSeek-R1-32B. Classification is performed directly on code diffs extracted from a balanced dataset of 3,200 commits mined from the InfluxDB repository, without any model fine-tuning. Our results show that few-shot prompting consistently achieves the highest accuracy, while chain-of-thought prompting does not yield additional gains for this classification task. Among the evaluated models, DeepSeek-R1-32B achieves the strongest overall performance, suggesting that model scale plays a meaningful role in conventional commit classification. These findings provide practical guidance for researchers and practitioners seeking to automate commit classification without the overhead of curating and maintaining labeled training data.
Summary / 总结
This paper explores the use of large language models (LLMs) and prompt engineering for conventional commit classification, aiming to improve software maintenance and automation. Three prompting strategies—zero-shot, few-shot, and chain-of-thought—are evaluated across three LLMs of different scales. The study uses a balanced dataset of 3,200 commits from the InfluxDB repository. Results indicate that few-shot prompting yields the highest accuracy, and among the models, DeepSeek-R1-32B performs best, highlighting the importance of model scale in this task.
本文探讨了使用大型语言模型(LLMs)和提示工程进行常规提交分类的方法,以提高软件维护和自动化。研究评估了三种提示策略——零样本、少量样本和链式思考——在三种不同规模的LLM上的表现。研究使用了来自InfluxDB仓库的3,200个提交的平衡数据集。结果表明,少量样本提示方法的准确率最高,而在这些模型中,DeepSeek-R1-32B的表现最佳,突显了模型规模在这一任务中的重要性。
VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation
Authors: Seongheon Park, Changdae Oh, Hyeong Kyu Choi, Sean Du, Sharon Li
Venue: ACL 2026
First: 2026-02-24T16:11:14+00:00 · Latest: 2026-05-03T17:44:38+00:00
Comments: ACL 2026 (Findings)
Abstract
Large Vision-Language Models (LVLMs) frequently hallucinate, limiting their safe deployment in real-world applications. Existing LLM self-evaluation methods rely on a model's ability to estimate the correctness of its own outputs, which can improve deployment reliability; however, they depend heavily on language priors and are therefore ill-suited for evaluating vision-conditioned predictions. We propose VAUQ, a vision-aware uncertainty quantification framework for LVLM self-evaluation that explicitly measures how strongly a model's output depends on visual evidence. VAUQ introduces the Image-Information Score (IS), which captures the reduction in predictive uncertainty attributable to visual input, and an unsupervised core-region masking strategy that amplifies the influence of salient regions. Combining predictive entropy with this core-masked IS yields a training-free scoring function that reliably reflects answer correctness. Comprehensive experiments show that VAUQ consistently outperforms existing self-evaluation methods across multiple datasets.
Summary / 总结
The research aims to address the issue of hallucination in Large Vision-Language Models (LVLMs) by proposing VAUQ, a vision-aware uncertainty quantification framework. VAUQ measures the dependency of a model's output on visual evidence through the Image-Information Score (IS) and an unsupervised core-region masking strategy. The framework combines predictive entropy with the core-masked IS to create a scoring function that accurately reflects answer correctness. Experiments demonstrate that VAUQ outperforms existing self-evaluation methods across various datasets.
研究旨在通过提出VAUQ,一种视觉感知不确定性量化框架,解决大型视觉语言模型(LVLM)的幻觉问题。VAUQ通过图像信息得分(IS)和无监督的核心区域遮罩策略,衡量模型输出对视觉证据的依赖性。该框架结合预测熵与核心遮罩的IS,生成一个无需训练即可反映答案正确性的评分函数。实验表明,VAUQ在多个数据集上优于现有自我评估方法。