arXiv 论文速递

2026-03-10 03:50
Snapshot: 20260310_0350
Multimodal Large Language Models as Image Classifiers
Authors: Nikita Kisel, Illia Volkov, Klara Janouskova, Jiri Matas
First: 2026-03-06T18:59:58+00:00 · Latest: 2026-03-06T18:59:58+00:00
Abstract
Multimodal Large Language Models (MLLM) classification performance depends critically on evaluation protocol and ground truth quality. Studies comparing MLLMs with supervised and vision-language models report conflicting conclusions, and we show these conflicts stem from protocols that either inflate or underestimate performance. Across the most common evaluation protocols, we identify and fix key issues: model outputs that fall outside the provided class list and are discarded, inflated results from weak multiple-choice distractors, and an open-world setting that underperforms only due to poor output mapping. We additionally quantify the impact of commonly overlooked design choices - batch size, image ordering, and text encoder selection - showing they substantially affect accuracy. Evaluating on ReGT, our multilabel reannotation of 625 ImageNet-1k classes, reveals that MLLMs benefit most from corrected labels (up to +10.8%), substantially narrowing the perceived gap with supervised models. Much of the reported MLLMs underperformance on classification is thus an artifact of noisy ground truth and flawed evaluation protocol rather than genuine model deficiency. Models less reliant on supervised training signals prove most sensitive to annotation quality. Finally, we show that MLLMs can assist human annotators: in a controlled case study, annotators confirmed or integrated MLLMs predictions in approximately 50% of difficult cases, demonstrating their potential for large-scale dataset curation.
中文标题/摘要
标题:多模态大型语言模型作为图像分类器
多模态大型语言模型(MLLM)的分类性能在很大程度上取决于评估协议和真实标签的质量。比较MLLM、监督模型和视觉-语言模型的研究报告结论不一,我们表明这些分歧源于要么夸大要么低估性能的评估协议。在最常见的评估协议中,我们识别并解决了关键问题:模型输出超出提供的类别列表并被丢弃、由于弱的选择题干扰项导致的夸大结果以及在开放世界设置中由于输出映射不佳而表现不佳。我们还量化了通常被忽视的设计选择——批量大小、图像排序和文本编码器选择的影响,表明它们显著影响准确性。在我们的多标签重注释的625个ImageNet-1k类别上进行评估显示,MLLM最受益于修正的标签(最多+10.8%),显著缩小了与监督模型之间的感知差距。因此,报告的MLLM在分类上的表现不佳很大程度上是由于嘈杂的真实标签和有缺陷的评估协议造成的,而不是真正的模型缺陷。对监督训练信号依赖较少的模型对注释质量最为敏感。最后,我们展示了MLLM可以辅助人类注释员:在受控案例研究中,注释员在大约50%的困难案例中确认或整合了MLLM的预测,证明了它们在大规模数据集整理中的潜力。
Summary / 总结
The study investigates the performance of Multimodal Large Language Models (MLLM) as image classifiers, identifying issues in evaluation protocols and ground truth quality that lead to conflicting conclusions. By correcting these issues, the researchers show that MLLMs perform better, especially when using accurate labels, and that much of the reported underperformance is due to noisy ground truth and flawed evaluation. The study also demonstrates that MLLMs can assist human annotators in dataset curation, with annotators confirming or integrating MLLM predictions in about 50% of difficult cases.
研究探讨了多模态大型语言模型(MLLM)作为图像分类器的性能,强调了评估协议和地面真实质量的重要性。通过解决模型输出超出类别列表、弱多项选择干扰项和开放世界设置等问题,研究显示使用修正后的标签时,MLLM的表现更好,缩小了与监督模型之间的差距。研究还发现,批量大小和文本编码器选择等设计选择显著影响准确性,并且MLLM可以通过确认或整合在困难案例中的预测来帮助人类标注员进行大规模数据集的整理。
SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning
Authors: Alejandra Perez, Anita Rau, Lee White, Busisiwe Mlambo, Chinedu Nwoye, Muhammad Abdullah Jamal, Omid Mohareri
First: 2026-03-06T18:58:36+00:00 · Latest: 2026-03-06T18:58:36+00:00
Abstract
Surgeons don't just see -- they interpret. When an expert observes a surgical scene, they understand not only what instrument is being used, but why it was chosen, what risk it poses, and what comes next. Current surgical AI cannot answer such questions, largely because training data that explicitly encodes surgical reasoning is immensely difficult to annotate at scale. Yet surgical video lectures already contain exactly this -- explanations of intent, rationale, and anticipation, narrated by experts for the purpose of teaching. Though inherently noisy and unstructured, these narrations encode the reasoning that surgical AI currently lacks. We introduce SUREON, a large-scale video QA dataset that systematically harvests this training signal from surgical academic videos. SUREON defines 12 question categories covering safety assessment, decision rationale, and forecasting, and uses a multi-agent pipeline to extract and structure supervision at scale. Across 134.7K clips and 170 procedure types, SUREON yields 206.8k QA pairs and an expert-validated benchmark of 354 examples. To evaluate the extent to which this supervision translates to surgical reasoning ability, we introduce two models: SureonVLM, a vision-language model adapted through supervised fine-tuning, and SureonVLM-R1, a reasoning model trained with Group Relative Policy Optimization. Both models can answer complex questions about surgery and substantially outperform larger general-domain models, exceeding 84% accuracy on the SUREON benchmark while outperforming general-domain models on standard surgical perception tasks. Qualitative analysis of SureonVLM-R1 reveals explicit reasoning behavior, such as inferring operative intent from visual context.
中文标题/摘要
标题:SUREON:一个手术推理基准和视觉-语言模型
外科医生不只是观察,他们还会进行解释。当专家观察手术场景时,他们不仅理解正在使用的器械是什么,还会理解为什么选择这种器械,它带来的风险是什么,接下来会发生什么。当前的手术AI无法回答这些问题,主要是因为大规模标注包含手术推理的训练数据极其困难。然而,手术视频讲座中已经包含了这些内容——由专家解释意图、理由和预测,目的是教学。尽管这些叙述本身是噪音且结构化不足,但它们编码了当前手术AI所缺乏的推理。我们引入了SUREON,这是一个大规模的视频问答数据集,系统地从手术学术视频中收集这种训练信号。SUREON定义了12个问题类别,涵盖安全评估、决策理由和预测,并使用多智能体流水线大规模提取和结构化监督。在134.7万段剪辑和170种手术类型中,SUREON产生了206.8万对问答对和354个专家验证基准。为了评估这种监督是否转化为手术推理能力,我们引入了两个模型:SureonVLM,通过监督微调适应的视觉-语言模型,以及SureonVLM-R1,使用组相对策略优化训练的推理模型。这两个模型都能回答复杂的手术问题,并显著优于大型通用领域模型,在SUREON基准测试中超过84%的准确率,同时在标准的手术感知任务中也优于通用领域模型。对SureonVLM-R1的定性分析显示了明确的推理行为,例如从视觉上下文推断手术意图。
Summary / 总结
SUREON is a new video QA dataset for surgical reasoning, derived from surgical academic videos. It includes 12 question categories and 206,800 QA pairs, providing a benchmark for evaluating surgical reasoning. Two models, SureonVLM and SureonVLM-R1, were trained on this dataset and outperformed general-domain models, achieving over 84% accuracy on the SUREON benchmark and demonstrating explicit reasoning capabilities.
SUREON 是一个新的视频问答数据集,用于手术推理,来源于手术学术视频。它包含12个问题类别和206,800个问答对,提供了一个评估手术推理能力的标准。两个模型SureonVLM和SureonVLM-R1在该数据集上训练,并超越了通用领域模型,准确率超过84%,展示了明确的推理能力。
Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
Authors: Boqiang Zhang, Lei Ke, Ruihan Yang, Qi Gao, Tianyuan Qu, Rossell Chen, Dong Yu, Leoweiliang
First: 2026-03-06T18:58:04+00:00 · Latest: 2026-03-06T18:58:04+00:00
Comments: Penguin-VL Technical Report; Code: https://github.com/tencent-ailab/Penguin-VL
Abstract
Vision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge devices such as smartphones and robots. In this work, we explore the performance limits of compact (e.g., 2B and 8B) VLMs. We challenge the prevailing practice that state-of-the-art VLMs must rely on vision encoders initialized via massive contrastive pretraining (e.g., CLIP/SigLIP). We identify an objective mismatch: contrastive learning, optimized for discrimination, enforces coarse and category-level invariances that suppress fine-grained visual cues needed for dense captioning and complex VLM reasoning. To address this issue, we present Penguin-VL, whose vision encoder is initialized from a text-only LLM. Our experiments reveal that Penguin-Encoder serves as a superior alternative to traditional contrastive pretraining, unlocking a higher degree of visual fidelity and data efficiency for multimodal understanding. Across various image and video benchmarks, Penguin-VL achieves performance comparable to leading VLMs (e.g., Qwen3-VL) in mathematical reasoning and surpasses them in tasks such as document understanding, visual knowledge, and multi-perspective video understanding. Notably, these gains are achieved with a lightweight architecture, demonstrating that improved visual representation rather than model scaling is the primary driver of performance. Our ablations show that Penguin-Encoder consistently outperforms contrastive-pretrained encoders, preserving fine-grained spatial and temporal cues that are critical for dense perception and complex reasoning. This makes it a strong drop-in alternative for compute-efficient VLMs and enables high performance in resource-constrained settings. Code: https://github.com/tencent-ailab/Penguin-VL
中文标题/摘要
标题:Penguin-VL:基于LLM的视觉编码器探索VLM的效率极限
视觉语言模型(VLM)的发展主要依赖于扩大模型规模,这阻碍了在计算受限的移动和边缘设备(如智能手机和机器人)上的部署。在本研究中,我们探索了紧凑型(例如,2B和8B)VLM的性能极限。我们挑战了当前VLM必须依赖通过大规模对比预训练(例如,CLIP/SigLIP)初始化的视觉编码器的主流做法。我们发现对比学习优化了区分性,强化了粗略的类别不变性,抑制了密集描述和复杂VLM推理所需的细粒度视觉线索。为了解决这一问题,我们提出了Penguin-VL,其视觉编码器从纯文本的LLM初始化。我们的实验表明,Penguin-Encoder比传统的对比预训练更优越,为多模态理解提供了更高的视觉保真度和数据效率。在各种图像和视频基准测试中,Penguin-VL在数学推理方面达到了与领先VLM(如Qwen3-VL)相当的性能,在文档理解、视觉知识和多视角视频理解等任务上则超过了它们。值得注意的是,这些改进是通过轻量级架构实现的,表明改进的视觉表示而非模型规模是性能提升的主要驱动力。我们的消融实验表明,Penguin-Encoder始终优于对比预训练的编码器,保留了对密集感知和复杂推理至关重要的细粒度空间和时间线索。这使其成为计算高效的VLM的强有力替代品,并在资源受限的环境中实现了高性能。代码:https://github.com/tencent-ailab/Penguin-VL
Summary / 总结
This work explores the performance limits of compact vision language models (VLMs) by challenging the necessity of contrastive pretraining for vision encoders. The authors introduce Penguin-VL, which initializes its vision encoder from a text-only large language model (LLM), achieving performance comparable to leading VLMs in mathematical reasoning and surpassing them in tasks like document understanding and multi-perspective video understanding. The lightweight architecture of Penguin-VL demonstrates that improved visual representation is more critical than model scaling for performance gains in VLMs.
该研究探索了紧凑型视觉语言模型(VLM)的性能极限,并挑战了视觉编码器必须通过大规模对比预训练(如CLIP/SigLIP)初始化的观点。通过从文本仅大型语言模型(LLM)初始化视觉编码器,作者提出了Penguin-VL,该模型在数学推理任务中与领先VLMs表现相当,并在文档理解、视觉知识和多视角视频理解等任务中超越它们。Penguin-VL的轻量级架构表明,改进的视觉表示是关键,而不是仅仅通过模型规模扩展。消融实验显示,Penguin-Encoder在细粒度视觉线索的保留方面始终优于对比预训练编码器,这对于密集感知和复杂推理至关重要。
CASA: Cross-Attention over Self-Attention for Efficient Vision-Language Fusion
Authors: Moritz Böhle, Amélie Royer, Juliette Marrie, Edouard Grave, Patrick Pérez
First: 2025-12-22T16:21:39+00:00 · Latest: 2026-03-06T18:46:27+00:00
Comments: updated with improved CA results
Abstract
Vision-language models (VLMs) are commonly trained by directly inserting image tokens from a pretrained vision encoder into the text stream of a language model. This allows text and image information to fully attend to one another within the model, but becomes rapidly costly for long multi-image conversations or streaming video applications, both in terms of memory and compute. VLMs leveraging cross-attention (CA) are an efficient alternative to token insertion as image tokens are not added to the KV cache. Despite being introduced early on, multimodal CA models are scarce in the current VLM literature and often underperform their token insertion counterparts. In this work, we reinvestigate the effectiveness of cross-attention for vision-language modeling: (i) We analyze the core differences between the cross-attention and self-attention mechanisms, (ii) we train cross-attention VLMs both from a text-only LLM and by adapting a pretrained insertion-based VLM, showing that simple cross-attention is far more competitive with token insertion than previously reported, and (iii) we demonstrate the practical advantages of cross-attention on real-time video captioning, where it naturally maintains low latency and near-constant memory cost. For samples and code, please see our project page at https://kyutai.org/casa .
中文标题/摘要
标题:CASA:自注意力上的交叉注意力用于高效的视觉-语言融合
视觉-语言模型(VLMs)通常通过将预训练视觉编码器中的图像令牌直接插入语言模型的文字流中进行训练。这使得文本和图像信息能够在模型内部完全相互注意,但在长多图像对话或流式视频应用中,这在内存和计算方面变得迅速昂贵。利用交叉注意力(CA)的VLM是令牌插入的高效替代方案,因为图像令牌不会被添加到KV缓存中。尽管早在早期就被引入,但在当前的VLM文献中,多模态CA模型很少见,且往往不如其令牌插入的对应物表现良好。在本文中,我们重新调查了交叉注意力在视觉-语言建模中的有效性:(i) 我们分析了交叉注意力和自注意力机制的核心差异,(ii) 我们从仅文本的大语言模型和通过调整预训练的插入式VLM训练交叉注意力VLMs,表明简单的交叉注意力比之前报告的更具有竞争力,(iii) 我们展示了交叉注意力在实时视频字幕中的实际优势,它自然地保持了低延迟和近恒定的内存成本。有关样本和代码,请参见我们的项目页面 https://kyutai.org/casa 。
Summary / 总结
The research aims to improve the efficiency of vision-language models by exploring cross-attention (CA) mechanisms, which avoid the memory and compute overhead of token insertion. The study compares CA with self-attention, showing that simple cross-attention outperforms token insertion in both training from a text-only language model and adapting a pretrained insertion-based model. Key findings include the practical benefits of cross-attention for real-time video captioning, maintaining low latency and constant memory cost.
研究旨在探索交叉注意力(CA)在视觉-语言模型(VLMs)中的有效性,以高效处理多图像对话和流式视频应用。研究将CA与token插入方法进行了比较,表明简单的交叉注意力比之前认为的更具有竞争力。作者从纯文本语言模型和预训练的插入式VLM中训练CA VLMs,并展示了交叉注意力在实时视频字幕生成中保持低延迟和近似恒定的内存成本的优势。
NEGATE: Constrained Semantic Guidance for Linguistic Negation in Text-to-Video Diffusion
Authors: Taewon Kang, Ming C. Lin
First: 2026-03-06T18:21:49+00:00 · Latest: 2026-03-06T18:21:49+00:00
Comments: 50 pages, 32 figures
Abstract
Negation is a fundamental linguistic operator, yet it remains inadequately modeled in diffusion-based generative systems. In this work, we present a formal treatment of linguistic negation in diffusion-based generative models by modeling it as a structured feasibility constraint on semantic guidance within diffusion dynamics. Rather than introducing heuristics or retraining model parameters, we reinterpret classifier-free guidance as defining a semantic update direction and enforce negation by projecting the update onto a convex constraint set derived from linguistic structure. This novel formulation provides a unified framework for handling diverse negation phenomena, including object absence, graded non-inversion semantics, multi-negation composition, and scope-sensitive disambiguation. Our approach is training-free, compatible with pretrained diffusion backbones, and naturally extends from image generation to temporally evolving video trajectories. In addition, we introduce a structured negation-centric benchmark suite that isolates distinct linguistic failure modes in generative systems, to further research in this area. Experiments demonstrate that our method achieves robust negation compliance while preserving visual fidelity and structural coherence, establishing the first unified formulation of linguistic negation in diffusion-based generative models beyond representation-level evaluation.
中文标题/摘要
标题:NEGATE:基于文本到视频扩散的约束语义指导语言否定
否定是基本的语义操作符,但在基于扩散的生成系统中仍未能充分建模。在本文中,我们通过将否定建模为扩散动力学中语义指导的结构化可行性约束,为基于扩散的生成模型提供了一种形式化的处理方法。我们不引入启发式方法或重新训练模型参数,而是重新解释无分类器引导作为定义语义更新方向,并通过从语言结构中导出的凸约束集投影更新来强制执行否定。这种新颖的表述提供了一个统一框架,用于处理各种否定现象,包括对象缺席、分级非反转语义、多重否定组合和范围敏感的消歧。我们的方法是无需训练的,与预训练的扩散主干兼容,并自然地从图像生成扩展到时间演变的视频轨迹。此外,我们引入了一个结构化否定中心基准套件,以隔离生成系统中的不同语言失败模式,进一步推动该领域的研究。实验表明,我们的方法在保持视觉保真度和结构连贯性的同时实现了稳健的否定一致性,建立了扩散生成模型中语言否定的第一个统一表述,超越了表示级评估。
Summary / 总结
This paper addresses the inadequacy of modeling negation in diffusion-based generative systems by proposing a structured feasibility constraint on semantic guidance. The method reinterprets classifier-free guidance and enforces negation through a convex constraint set derived from linguistic structure. Experiments show that the approach achieves robust negation compliance while maintaining visual fidelity and structural coherence, providing a unified framework for handling various negation phenomena.
该研究旨在解决在扩散生成模型中对否定的建模不足问题,通过在语义指导上引入结构化的可行性约束来解决。方法重新解释了无分类器自由指导,通过从语言结构中导出的凸约束集来强制执行否定,而无需重新训练模型。实验表明,该方法在保持视觉保真度和结构连贯性的同时,实现了稳健的否定一致性,提供了一种统一的框架来处理各种否定现象。
COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics
Authors: Kartik Sharma, Rakshit S. Trivedi
Venue: ICLR 2026
First: 2026-03-06T17:27:27+00:00 · Latest: 2026-03-06T17:27:27+00:00
Comments: ICLR 2026. Code available at https://github.com/Ksartik/cold-steer
Abstract
Activation steering methods enable inference-time control of large language model (LLM) behavior without retraining, but current approaches face a fundamental trade-off: sample-efficient methods suboptimally capture steering signals from labeled examples, while methods that better extract these signals require hundreds to thousands of examples. We introduce COLD-Steer, a training-free framework that steers LLM activations by approximating the representational changes that would result from gradient descent on in-context examples. Our key insight is that the effect of fine-tuning on a small set of examples can be efficiently approximated at inference time without actual parameter updates. We formalize this through two complementary approaches: (i) a unit kernel approximation method that updates the activations directly using gradients with respect to them, normalized across examples, and (ii) a finite-difference approximation requiring only two forward passes regardless of example count. Experiments across a variety of steering tasks and benchmarks demonstrate that COLD-Steer achieves upto 95% steering effectiveness while using 50 times fewer samples compared to the best baseline. COLD-Steer facilitates accommodating diverse perspectives without extensive demonstration data, which we validate through our experiments on pluralistic alignment tasks. Our framework opens new possibilities for adaptive, context-aware model control that can flexibly address varying loss-driven human preferences through principled approximation of learning dynamics rather than specialized training procedures.
中文标题/摘要
标题:COLD-Steer:通过上下文内一步学习动力学引导大型语言模型
激活引导方法可以在无需重新训练的情况下,在推理时控制大型语言模型(LLM)的行为,但当前的方法面临一个基本的权衡:样本高效的策略以次优的方式捕捉标记示例中的引导信号,而能够更好地提取这些信号的策略则需要数百到数千个示例。我们提出了COLD-Steer,这是一种无需训练的框架,通过近似上下文内示例进行梯度下降所导致的表示变化来引导LLM的激活。我们的核心洞察是,对一小组示例进行微调的效果可以在推理时通过不实际更新参数来高效地近似。我们通过两种互补的方法来形式化这一点:(i)一种单位核近似方法,直接使用归一化的示例梯度来更新激活,(ii)一种仅需两次前向传递的有限差分近似,无论示例数量多少。在各种引导任务和基准测试中的实验表明,COLD-Steer 在使用比最佳基线少50倍的样本的情况下,可以实现高达95%的引导效果。COLD-Steer 使得在没有大量演示数据的情况下容纳多样化的观点成为可能,我们通过在多元一致对齐任务上的实验进行了验证。我们的框架为通过原理上近似学习动力学而不是专门的训练程序来实现适应性和上下文感知的模型控制打开了新的可能性。
Summary / 总结
COLD-Steer is a training-free framework that steers large language model activations by approximating the representational changes from in-context examples. It uses two methods: a unit kernel approximation and a finite-difference approximation. Experiments show COLD-Steer achieves up to 95% steering effectiveness with 50 times fewer samples compared to existing methods, making it more sample-efficient and suitable for accommodating diverse perspectives without extensive demonstration data.
COLD-Steer 是一个无需训练的框架,通过近似在上下文示例上进行梯度下降的表示变化来引导大型语言模型(LLM)的激活。它使用两种方法:单位核近似和有限差分近似。实验表明,COLD-Steer 在样本效率方面表现出色,最多可实现 95% 的引导效果,所需样本数量仅为最佳基线的 50 分之一,使其能够更好地适应不同的观点而无需大量的演示数据。
Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement
Authors: Yakov Pyotr Shkolnikov
First: 2026-03-06T16:48:27+00:00 · Latest: 2026-03-06T16:48:27+00:00
Abstract
Vision-language models encode continuous geometry that their text pathway fails to express: a 6,000-parameter linear probe extracts hand joint angles at 6.1 degrees MAE from frozen features, while the best text output achieves only 20.0 degrees -- a 3.3x bottleneck. LoRA fine-tuning (r=16, 2,000 images) narrows this gap to 6.5 degrees, providing evidence for a pathway-training deficit rather than a representational one. Training objective determines accuracy more than architecture: five encoders spanning self-supervised, contrastive, and hybrid paradigms converge to statistically equivalent accuracy (R^2 approximately 0.55, TOST-equivalent at delta=0.03) despite sharing as little as CKA=0.41 representational similarity -- functional convergence without representational convergence. Autoregressive generation damages geometric fidelity, but the damage originates in the generation process, not in language alignment: Qwen2.5-VL's LLM layers actually improve probe accuracy over its raw vision encoder. Layer-wise analysis reveals a universal mid-network accuracy peak across all architectures, with attention heads in layers 18-22 carrying disproportionate geometric signal. These findings enable a single frozen backbone to function as a multi-task geometric sensor through lightweight probes, without fine-tuning or text generation.
中文标题/摘要
标题:基础模型知道几何学吗?探究冻结特征的连续物理测量
视觉-语言模型编码连续几何学,而其文本路径无法表达:一个包含6,000个参数的线性探测器从冻结特征中提取手关节角度,MAE为6.1度,而最佳文本输出仅为20.0度——3.3倍瓶颈。LoRA微调(r=16,2,000张图像)将这一差距缩小到6.5度,为路径训练缺陷而非表示缺陷提供了证据。训练目标比架构更能决定准确性:五个涵盖自监督、对比和混合范式的编码器收敛到统计上等效的准确性(R²约0.55,TOST等效于delta=0.03),尽管它们的表示相似性仅为CKA=0.41——功能收敛而无表示收敛。自回归生成损害几何保真度,但损害源自生成过程,而非语言对齐:Qwen2.5-VL的LLM层实际上提高了探测器的准确性,超过了其原始视觉编码器。逐层分析显示,所有架构中网络中间层的准确性存在普遍峰值,第18-22层的注意力头承载了不成比例的几何信号。这些发现使一个冻结的主干能够通过轻量级探测器作为多任务几何传感器,无需微调或文本生成。
VisualPrompter: Semantic-Aware Prompt Optimization with Visual Feedback for Text-to-Image Synthesis
Authors: Shiyu Wu, Mingzhen Sun, Weining Wang, Yequan Wang, Jing Liu
First: 2025-06-29T08:24:39+00:00 · Latest: 2026-03-06T16:34:03+00:00
Comments: ICLR2026 Camera Ready
Abstract
The notable gap between user-provided and model-preferred prompts poses a significant challenge for generating high-quality images with text-to-image models, compelling the need for prompt engineering. Current studies on prompt engineering can effectively enhance the style and aesthetics of generated images. However, they often neglect the semantic alignment between generated images and user descriptions, resulting in visually appealing but content-wise unsatisfying outputs. In this work, we propose VisualPrompter, a novel training-free prompt engineering framework that refines user inputs to model-preferred sentences. VisualPrompter utilizes an automatic self-reflection module that identifies absent concepts in the generated images, followed by a target-specific prompt optimization mechanism that revises the prompts in a fine-grained manner. By deconstructing prompts, introducing new elements at the atomic semantic level, and then reassembling them, our framework is able to maintain semantic consistency and integrity throughout the optimization process. Extensive experiments demonstrate the effectiveness of VisualPrompter, which achieves new state-of-the-art performance on multiple benchmarks for text-image alignment evaluation. Additionally, our framework features a plug-and-play design, making it highly adaptable to various generative models. Our code is available at https://github.com/teheperinko541/VisualPrompter.
中文标题/摘要
标题:VisualPrompter:基于视觉反馈的语义感知提示优化
用户提供的提示与模型偏好之间的显著差距,对使用文本生成图像模型生成高质量图像构成了重大挑战,促使需要进行提示工程。当前的提示工程研究可以有效增强生成图像的风格和美学。然而,它们往往忽视了生成图像与用户描述之间的语义对齐,导致视觉上吸引人但内容上不满意的输出。在本文中,我们提出了一种名为VisualPrompter的新型无需训练的提示工程框架,用于将用户输入优化为模型偏好句子。VisualPrompter利用一个自动自我反思模块来识别生成图像中缺失的概念,然后通过特定目标的提示优化机制以精细的方式修订提示。通过分解提示,在原子语义级别引入新元素,然后重新组装,我们的框架能够在优化过程中保持语义的一致性和完整性。广泛的实验表明,VisualPrompter在多个文本-图像对齐评估基准上达到了新的最佳性能。此外,我们的框架具有即插即用设计,使其高度适应各种生成模型。我们的代码可在https://github.com/teheperinko541/VisualPrompter获取。
Summary / 总结
VisualPrompter is a training-free prompt engineering framework that optimizes user inputs to align with model preferences, addressing the gap between user descriptions and generated images. It uses an automatic self-reflection module to identify missing concepts and a target-specific prompt optimization mechanism to refine prompts. Experiments show that VisualPrompter achieves state-of-the-art performance on text-image alignment benchmarks and is adaptable to various generative models.
VisualPrompter 是一个无需训练的提示工程框架,用于优化用户输入以与模型偏好对齐进行文本到图像合成。它使用自动自我反思模块来识别生成图像中缺失的概念,并使用目标特定的提示优化机制来细化提示。实验表明,VisualPrompter 改进了文本与图像的对齐,并在多个基准上达到了最先进的性能。该框架适用于各种生成模型,并已作为开源代码发布。
OralGPT-Plus: Learning to Use Visual Tools via Reinforcement Learning for Panoramic X-ray Analysis
Authors: Yuxuan Fan, Jing Hao, Hong Chen, Jiahao Bao, Yihua Shao, Yuci Liang, Kuo Feng Hung, Hao Tang
Venue: CVPR 2026
First: 2026-03-06T15:16:30+00:00 · Latest: 2026-03-06T15:16:30+00:00
Comments: 34 pages, 24 figures, conference
Abstract
Panoramic dental radiographs require fine-grained spatial reasoning, bilateral symmetry understanding, and multi-step diagnostic verification, yet existing vision-language models operate under a static single-pass paradigm that limits their clinical reliability. In this paper, we introduce OralGPT-Plus, an agentic vision-language model designed to perform iterative and symmetry-aware diagnostic reasoning for panoramic dental radiograph analysis. To support this paradigm, we construct DentalProbe, a five-thousand-image dataset with expert-curated diagnostic trajectories that provide structured supervision for localized inspection and contralateral comparison. We further develop a Reinspection-driven reinforcement learning framework that encourages clinically meaningful re-examination and stabilizes long-horizon reasoning with rubric-based reward and conditioned diagnostic-driven reward. In parallel, we present MMOral-X, the first benchmark for holistic panoramic diagnosis, containing 300 open-ended questions and region-level annotations across multiple difficulty levels. OralGPT-Plus demonstrates consistent and reliable improvements over strong baselines on MMOral-X and established panoramic benchmarks, indicating the effectiveness of interactive and symmetry-informed reasoning. Our work highlights the value of agentic modeling for dental imaging and provides a foundation for future research in clinically aligned panoramic radiograph analysis.
中文标题/摘要
标题:OralGPT-Plus:通过强化学习学习使用视觉工具进行全景X光分析
全景牙科放射照需要精细的空间推理、双边对称理解和多步诊断验证,但现有的视觉-语言模型在静态单次通过的范式下运作,限制了其临床可靠性。在本文中,我们介绍了OralGPT-Plus,这是一种能够进行迭代和对称意识诊断推理的代理视觉-语言模型,用于全景牙科放射照分析。为了支持这一范式,我们构建了包含专家标注诊断轨迹的DentalProbe数据集,提供了局部检查和对侧比较的结构化监督。我们还开发了一种基于重检的强化学习框架,鼓励临床有意义的重新检查,并通过基于评分表的奖励和条件诊断驱动奖励来稳定长期推理。同时,我们提出了MMOral-X,这是第一个全景诊断基准,包含300个开放式问题和多难度级别的区域级注释。OralGPT-Plus在MMOral-X和已建立的全景基准上表现出一致且可靠的改进,表明交互式和对称指导推理的有效性。我们的工作突显了代理建模在牙科成像中的价值,并为未来在临床对齐的全景放射照分析中的研究提供了基础。
Summary / 总结
OralGPT-Plus is an agentic vision-language model designed for iterative and symmetry-aware diagnostic reasoning in panoramic dental radiograph analysis. It uses a dataset called DentalProbe and a reinforcement learning framework to support structured supervision and clinically meaningful re-examination. OralGPT-Plus shows consistent improvements over strong baselines on both new and established panoramic benchmarks, indicating the effectiveness of interactive and symmetry-informed reasoning.
OralGPT-Plus 是一种用于全景牙科放射图分析的具有迭代和对称感知推理能力的视觉-语言模型。它使用名为 DentalProbe 的数据集和强化学习框架来支持结构化的监督和临床意义的复查。OralGPT-Plus 在新的和现有的全景基准测试中都表现出一致的改进,表明交互式和对称感知推理的有效性。
K-MaT: Knowledge-Anchored Manifold Transport for Cross-Modal Prompt Learning in Medical Imaging
Authors: Jiajun Zeng, Shadi Albarqouni
First: 2026-03-06T14:46:55+00:00 · Latest: 2026-03-06T14:46:55+00:00
Abstract
Large-scale biomedical vision-language models (VLMs) adapted on high-end imaging (e.g., CT) often fail to transfer to frontline low-end modalities (e.g., radiography), collapsing into modality-specific shortcuts. We propose K-MaT (Knowledge-Anchored Manifold Transport), a prompt-learning framework that transfers decision structures to low-end modalities without requiring low-end training images. K-MaT factorizes prompts, anchors them to clinical text descriptions, and aligns the low-end prompt manifold to the visually-grounded high-end space using Fused Gromov-Wasserstein optimal transport. We evaluate K-MaT on four cross-modal benchmarks, including dermoscopy, mammography to ultrasound, and CT to chest X-ray. K-MaT achieves state-of-the-art results, improving the average harmonic mean of accuracy to 44.1% (from BiomedCoOp's 42.0%) and macro-F1 to 36.2%. Notably, on the challenging breast imaging task, it mitigates the catastrophic forgetting seen in standard methods like CoOp (which drops to 27.0% accuracy on the low-end), preserving robust performance across modalities. Aligning prompt manifolds via optimal transport provides a highly effective route for the zero-shot cross-modal deployment of medical VLMs.
中文标题/摘要
标题:K-MaT:知识锚定流形传输在医学影像跨模态提示学习中的应用
大规模生物医学视觉-语言模型(VLMs)在高端成像(如CT)上适应后,往往无法转移到前线低端模态(如X光片),导致陷入特定模态的捷径。我们提出K-MaT(知识锚定流形传输),这是一种提示学习框架,可以在不需要低端训练图像的情况下将决策结构转移到低端模态。K-MaT 分解提示,将其锚定到临床文本描述,并使用融合格罗莫夫-瓦尔什最优传输将低端提示流形对齐到视觉基础的高端空间。我们在四个跨模态基准上评估了K-MaT,包括皮肤镜检查、乳腺X光到超声检查以及CT到胸部X光检查。K-MaT 达到了最先进的结果,将平均调和平均准确率提高到44.1%(从BiomedCoOp的42.0%),宏F1分数提高到36.2%。值得注意的是,在具有挑战性的乳腺成像任务中,它缓解了标准方法(如CoOp)中出现的灾难性遗忘现象(准确率从42.0%下降到27.0%),在不同模态中保持了稳健的性能。通过最优传输对提示流形进行对齐为医学VLMs的零样本跨模态部署提供了非常有效的途径。
Summary / 总结
K-MaT is a prompt-learning framework designed to transfer decision structures from high-end imaging modalities to low-end ones without requiring low-end training images. It factorizes prompts, anchors them to clinical text descriptions, and aligns the low-end prompt manifold to the high-end space using Fused Gromov-Wasserstein optimal transport. K-MaT achieves state-of-the-art results on four cross-modal benchmarks, improving accuracy and macro-F1 scores, and demonstrates robust performance across modalities, particularly on challenging breast imaging tasks where it mitigates catastrophic forgetting seen in standard methods.
K-MaT 是一种提示学习框架,旨在无需低端模态训练图像的情况下,将高端成像模态的决策结构转移到低端模态。它将提示分解,锚定到临床文本描述,并使用融合格罗莫夫-瓦尔什最优传输对低端提示流形进行对齐。K-MaT 在四个跨模态基准测试中取得了最先进的成果,提高了准确率和宏-F1分数,并且在具有挑战性的乳腺成像任务中,保持了跨模态的稳健性能。
WMoE-CLIP: Wavelet-Enhanced Mixture-of-Experts Prompt Learning for Zero-Shot Anomaly Detection
Authors: Peng Chen, Chao Huang
First: 2026-03-06T14:16:06+00:00 · Latest: 2026-03-06T14:16:06+00:00
Abstract
Vision-language models have recently shown strong generalization in zero-shot anomaly detection (ZSAD), enabling the detection of unseen anomalies without task-specific supervision. However, existing approaches typically rely on fixed textual prompts, which struggle to capture complex semantics, and focus solely on spatial-domain features, limiting their ability to detect subtle anomalies. To address these challenges, we propose a wavelet-enhanced mixture-of-experts prompt learning method for ZSAD. Specifically, a variational autoencoder is employed to model global semantic representations and integrate them into prompts to enhance adaptability to diverse anomaly patterns. Wavelet decomposition extracts multi-frequency image features that dynamically refine textual embeddings through cross-modal interactions. Furthermore, a semantic-aware mixture-of-experts module is introduced to aggregate contextual information. Extensive experiments on 14 industrial and medical datasets demonstrate the effectiveness of the proposed method.
中文标题/摘要
标题:WMoE-CLIP:小波增强混合专家提示学习方法在零样本异常检测中的应用
视觉语言模型在零样本异常检测(ZSAD)中表现出强大的泛化能力,能够在无需特定任务监督的情况下检测未见过的异常。然而,现有方法通常依赖固定的文本提示,难以捕捉复杂的语义,并且仅专注于空间域特征,限制了其检测细微异常的能力。为解决这些挑战,我们提出了一种小波增强混合专家提示学习方法用于ZSAD。具体而言,使用变分自编码器建模全局语义表示,并将其整合到提示中以增强对多种异常模式的适应性。小波分解提取多频率图像特征,通过跨模态交互动态细化文本嵌入。此外,引入了一种语义感知混合专家模块来聚合上下文信息。在14个工业和医疗数据集上的广泛实验表明了该方法的有效性。
Summary / 总结
The research aims to improve zero-shot anomaly detection by addressing the limitations of fixed textual prompts and spatial-domain features. The proposed method, WMoE-CLIP, uses a variational autoencoder to model global semantic representations and integrate them into prompts, enhancing adaptability. Wavelet decomposition extracts multi-frequency image features that dynamically refine textual embeddings, and a semantic-aware mixture-of-experts module aggregates contextual information. Experiments on 14 datasets show the method's effectiveness in detecting subtle anomalies across various domains.
研究旨在通过解决现有方法的限制,如固定文本提示和仅依赖空间特征,来提高零样本异常检测的性能。WMoE-CLIP 方法使用变分自编码器建模全局语义表示并将其集成到提示中,增强适应性。小波分解提取多频率图像特征,动态细化文本嵌入,并引入语义感知的混合专家模块聚合上下文信息。在14个工业和医疗数据集上的实验表明,该方法在各种领域中有效检测细微异常。
DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models
Authors: Walid Bousselham, Angie Boggust, Hendrik Strobelt, Hilde Kuehne
First: 2026-03-06T14:07:37+00:00 · Latest: 2026-03-06T14:07:37+00:00
Comments: Project page: https://walidbousselham.com/DEX-AR
Abstract
As Vision-Language Models (VLMs) become increasingly sophisticated and widely used, it becomes more and more crucial to understand their decision-making process. Traditional explainability methods, designed for classification tasks, struggle with modern autoregressive VLMs due to their complex token-by-token generation process and intricate interactions between visual and textual modalities. We present DEX-AR (Dynamic Explainability for AutoRegressive models), a novel explainability method designed to address these challenges by generating both per-token and sequence-level 2D heatmaps highlighting image regions crucial for the model's textual responses. The proposed method offers to interpret autoregressive VLMs-including varying importance of layers and generated tokens-by computing layer-wise gradients with respect to attention maps during the token-by-token generation process. DEX-AR introduces two key innovations: a dynamic head filtering mechanism that identifies attention heads focused on visual information, and a sequence-level filtering approach that aggregates per-token explanations while distinguishing between visually-grounded and purely linguistic tokens. Our evaluation on ImageNet, VQAv2, and PascalVOC, shows a consistent improvement in both perturbation-based metrics, using a novel normalized perplexity measure, as well as segmentation-based metrics.
中文标题/摘要
标题:DEX-AR:自回归视觉语言模型的动态可解释性方法
随着视觉语言模型(VLMs)变得越来越复杂和广泛使用,理解其决策过程变得越来越重要。传统的可解释性方法,设计用于分类任务,难以应对现代自回归VLMs,因为它们具有复杂的逐词生成过程和视觉与文本模态之间的复杂交互。我们提出了DEX-AR(动态自回归模型可解释性),这是一种新型的可解释性方法,通过生成逐词和序列级的2D热图来突出显示对模型文本响应至关重要的图像区域,以解决这些挑战。该方法通过计算逐词生成过程中相对于注意力图的层级梯度来解释自回归VLMs,包括不同层和生成词的重要性。DEX-AR引入了两项关键创新:动态头部筛选机制,用于识别关注视觉信息的注意力头,以及序列级筛选方法,用于聚合逐词解释并区分视觉支撑和纯粹语言词。我们在ImageNet、VQAv2和PascalVOC上的评估显示,在使用新型归一化困惑度度量的扰动基度量和分割基度量中都有一致的改进。
Summary / 总结
The research aims to improve the interpretability of autoregressive Vision-Language Models (VLMs) by addressing the limitations of traditional explainability methods. DEX-AR, a novel dynamic explainability method, generates 2D heatmaps for both per-token and sequence-level explanations, highlighting crucial image regions for the model's textual responses. Key innovations include a dynamic head filtering mechanism and a sequence-level filtering approach. Experimental results on ImageNet, VQAv2, and PascalVOC demonstrate consistent improvements in perturbation-based and segmentation-based metrics.
研究旨在通过解决传统解释方法的局限性,增强自回归视觉-语言模型(VLMs)的可解释性。DEX-AR 是一种新型动态解释方法,通过生成 2D 热图来突出显示对模型文本响应至关重要的图像区域。它在逐个生成标记的过程中计算层间梯度,并引入动态头过滤机制和序列级过滤来区分视觉基础和纯粹语言标记。实验结果表明,在 ImageNet、VQAv2 和 PascalVOC 上的一致改进了扰动基和分割基的度量标准。
FlowMotion: Training-Free Flow Guidance for Video Motion Transfer
Authors: Zhen Wang, Youcan Xu, Jun Xiao, Long Chen
First: 2026-03-06T13:48:01+00:00 · Latest: 2026-03-06T13:48:01+00:00
Abstract
Video motion transfer aims to generate a target video that inherits motion patterns from a source video while rendering new scenes. Existing training-free approaches focus on constructing motion guidance based on the intermediate outputs of pre-trained T2V models, which results in heavy computational overhead and limited flexibility. In this paper, we present FlowMotion, a novel training-free framework that enables efficient and flexible motion transfer by directly leveraging the predicted outputs of flow-based T2V models. Our key insight is that early latent predictions inherently encode rich temporal information. Motivated by this, we propose flow guidance, which extracts motion representations based on latent predictions to align motion patterns between source and generated videos. We further introduce a velocity regularization strategy to stabilize optimization and ensure smooth motion evolution. By operating purely on model predictions, FlowMotion achieves superior time and resource efficiency as well as competitive performance compared with state-of-the-art methods.
中文标题/摘要
标题:FlowMotion:无需训练的流指导视频运动转移
视频运动转移旨在生成一个目标视频,该视频继承了源视频的运动模式,同时渲染新的场景。现有的无需训练的方法侧重于基于预训练T2V模型的中间输出构建运动指导,这导致了巨大的计算开销和有限的灵活性。在本文中,我们提出了FlowMotion,这是一种新颖的无需训练框架,通过直接利用基于流的T2V模型的预测输出来实现高效和灵活的运动转移。我们的关键见解是早期的潜在预测本身包含了丰富的时序信息。受此启发,我们提出了流指导,该方法基于潜在预测提取运动表示,以使源视频和生成视频之间的运动模式对齐。我们还引入了速度正则化策略来稳定优化并确保运动的平滑演变。通过仅在模型预测上操作,FlowMotion实现了优于最新方法的时间和资源效率以及竞争力的性能。
Summary / 总结
FlowMotion is a training-free framework for video motion transfer that leverages the predicted outputs of flow-based T2V models to directly guide motion alignment, reducing computational overhead and increasing flexibility. It extracts motion representations from early latent predictions and uses velocity regularization to stabilize optimization. Experiments show that FlowMotion outperforms state-of-the-art methods in terms of efficiency and performance.
FlowMotion 是一个无需训练的视频运动转移框架,通过利用流式 T2V 模型的预测输出直接指导源视频和目标视频之间的运动对齐。它从早期的潜在预测中提取运动表示以对齐运动模式,并引入速度正则化策略以稳定优化。实验结果表明,FlowMotion 在时间和资源效率方面优于现有方法,同时保持了竞争力的性能。
HiPP-Prune: Hierarchical Preference-Conditioned Structured Pruning for Vision-Language Models
Authors: Lincen Bai, Hedi Tabia, Raul Santos-Rodriguez
First: 2026-03-06T13:31:54+00:00 · Latest: 2026-03-06T13:31:54+00:00
Abstract
Pruning vision-language models (VLMs) for efficient deployment is challenging because compression can affect not only task utility but also visual grounding, often amplifying object hallucinations even at the same sparsity level. We present HiPP-Prune, a hierarchical preference-conditioned structured pruning framework that treats pruning as conditional resource allocation under multiple objectives. HiPP-Prune makes plan-level decisions: a single policy invocation outputs a global pruning blueprint by factorizing decisions into an overall sparsity budget and a layer-wise allocation, enabling queryable trade-offs via a user-specified preference vector. To account for VLM-specific failure modes, our policy state integrates a visual sensitivity signal derived from attention flow between vision tokens and language hidden states, discouraging over-pruning of vision-critical layers that facilitate cross-modal fusion. We optimize pruning plans with plan-level Group Relative Policy Optimization (GRPO) under a multi-objective return that combines task utility, hallucination robustness (POPE), compression, and a synaptic-flow-inspired stability proxy to reduce unproductive exploration in high-sparsity regimes. Experiments on LLaVA with POPE and ScienceQA demonstrate that HiPP-Prune discovers diverse non-dominated pruning plans and provides controllable robustness--utility trade-offs under matched sparsity budgets.
中文标题/摘要
标题:HiPP-Prune: 分层偏好条件结构剪枝框架用于视觉-语言模型
对视觉-语言模型(VLMs)进行剪枝以实现高效的部署具有挑战性,因为压缩不仅会影响任务性能,还会影响视觉定位,通常会在相同的稀疏度水平下放大对象幻觉。我们提出了HiPP-Prune,这是一种分层偏好条件结构剪枝框架,将剪枝视为在多个目标下的条件资源分配。HiPP-Prune 在计划层面做出决策:一次策略调用输出一个全局剪枝蓝图,通过将决策分解为整体稀疏度预算和逐层分配来实现,从而通过用户指定的偏好向量实现可查询的权衡。为了考虑VLM特有的失败模式,我们的策略状态整合了一个视觉敏感信号,该信号源自视觉标记与语言隐藏状态之间的注意力流,以防止过度剪枝视觉关键层,这些层有助于跨模态融合。我们使用计划层面的组相对策略优化(GRPO)来优化剪枝计划,在结合任务性能、幻觉鲁棒性(POPE)、压缩和一种基于突触流的稳定性代理的多目标回报下进行优化,以减少在高稀疏度区域中的无效探索。在LLaVA和ScienceQA上的实验表明,HiPP-Prune 发现了多样化的非支配剪枝计划,并在匹配的稀疏度预算下提供了可控的鲁棒性-性能权衡。
Summary / 总结
HiPP-Prune is a hierarchical preference-conditioned structured pruning framework designed to efficiently deploy vision-language models while maintaining task utility and reducing object hallucinations. It treats pruning as a resource allocation problem under multiple objectives, enabling global pruning plans through a preference vector. HiPP-Prune integrates a visual sensitivity signal to avoid over-pruning critical vision layers, optimizing plans with a multi-objective return that includes task utility, hallucination robustness, compression, and stability. Experiments on LLaVA and ScienceQA show that HiPP-Prune discovers diverse non-dominated pruning plans and provides controllable robustness-utility trade-offs under matched sparsity budgets.
HiPP-Prune 是一种针对视觉语言模型 (VLM) 的分层偏好条件结构剪枝框架,将剪枝视为在多个目标下的条件资源分配。它通过将决策分解为整体稀疏预算和逐层分配来输出全局剪枝蓝图,允许通过用户指定的偏好向量进行可查询的权衡。HiPP-Prune 集成了视觉敏感信号以防止过度剪枝关键视觉层,并使用结合任务性能、幻觉鲁棒性、压缩和稳定性代理的多目标回报来优化剪枝计划。实验结果表明,HiPP-Prune 发现了多样化的非支配剪枝计划,并在匹配的稀疏预算下提供了可控的鲁棒性-性能权衡。
GazeMoE: Perception of Gaze Target with Mixture-of-Experts
Authors: Zhuangzhuang Dai, Zhongxi Lu, Vincent G. Zakka, Luis J. Manso, Jose M Alcaraz Calero, Chen Li
Venue: ICRA 2026
First: 2026-03-06T13:16:29+00:00 · Latest: 2026-03-06T13:16:29+00:00
Comments: 8 pages, 3 figures, ICRA 2026
Abstract
Estimating human gaze target from visible images is a critical task for robots to understand human attention, yet the development of generalizable neural architectures and training paradigms remains challenging. While recent advances in pre-trained vision foundation models offer promising avenues for locating gaze targets, the integration of multi-modal cues -- including eyes, head poses, gestures, and contextual features -- demands adaptive and efficient decoding mechanisms. Inspired by Mixture-of-Experts (MoE) for adaptive domain expertise in large vision-language models, we propose GazeMoE, a novel end-to-end framework that selectively leverages gaze-target-related cues from a frozen foundation model through MoE modules. To address class imbalance in gaze target classification (in-frame vs. out-of-frame) and enhance robustness, GazeMoE incorporates a class-balancing auxiliary loss alongside strategic data augmentations, including region-specific cropping and photometric transformations. Extensive experiments on benchmark datasets demonstrate that our GazeMoE achieves state-of-the-art performance, outperforming existing methods on challenging gaze estimation tasks. The code and pre-trained models are released at https://huggingface.co/zdai257/GazeMoE
中文标题/摘要
标题:GazeMoE:混合专家模型的注视目标感知
从可见图像中估计人类的注视目标是机器人理解人类注意力的关键任务,但开发通用的神经架构和训练范式仍然具有挑战性。尽管预训练的视觉基础模型的最新进展为定位注视目标提供了有希望的途径,但多模态线索(包括眼睛、头部姿态、手势和上下文特征)的整合需要适应性和高效的解码机制。受大型视觉-语言模型中混合专家(MoE)适应领域专业知识的启发,我们提出了一种新颖的端到端框架GazeMoE,该框架通过MoE模块选择性地利用冻结的基础模型中的注视目标相关线索。为了解决注视目标分类中的类别不平衡(框内 vs. 框外)并增强鲁棒性,GazeMoE 结合了类别平衡的辅助损失以及包括区域特定裁剪和光度变换在内的战略数据增强。在基准数据集上的广泛实验表明,我们的GazeMoE 达到了最先进的性能,在具有挑战性的注视估计任务中优于现有方法。代码和预训练模型已发布在 https://huggingface.co/zdai257/GazeMoE
Summary / 总结
GazeMoE is a novel end-to-end framework that uses Mixture-of-Experts (MoE) to selectively leverage gaze-target-related cues from a frozen foundation model for estimating human gaze targets. It incorporates class-balancing auxiliary loss and strategic data augmentations to address class imbalance and enhance robustness. Experiments show that GazeMoE outperforms existing methods on gaze estimation tasks, achieving state-of-the-art performance on benchmark datasets.
GazeMoE 是一种使用 Mixture-of-Experts (MoE) 的端到端框架,从冻结的基础模型中选择性地利用与注视目标相关的线索进行注视目标估计。它通过引入类别平衡辅助损失和战略数据增强来解决类别不平衡问题并增强鲁棒性。实验表明,GazeMoE 在注视估计任务上优于现有方法,实现了基准数据集上的最先进性能。
NOVA: Next-step Open-Vocabulary Autoregression for 3D Multi-Object Tracking in Autonomous Driving
Authors: Kai Luo, Xu Wang, Rui Fan, Kailun Yang
First: 2026-03-06T13:12:28+00:00 · Latest: 2026-03-06T13:12:28+00:00
Comments: Code will be available at https://github.com/xifen523/NOVA
Abstract
Generalizing across unknown targets is critical for open-world perception, yet existing 3D Multi-Object Tracking (3D MOT) pipelines remain limited by closed-set assumptions and ``semantic-blind'' heuristics. To address this, we propose Next-step Open-Vocabulary Autoregression (NOVA), an innovative paradigm that shifts 3D tracking from traditional fragmented distance-based matching toward generative spatio-temporal semantic modeling. NOVA reformulates 3D trajectories as structured spatio-temporal semantic sequences, enabling the simultaneous encoding of physical motion continuity and deep linguistic priors. By leveraging the autoregressive capabilities of Large Language Models (LLMs), we transform the tracking task into a principled process of next-step sequence completion. This mechanism allows the model to explicitly utilize the hierarchical structure of language space to resolve fine-grained semantic ambiguities and maintain identity consistency across complex long-range sequences through high-level commonsense reasoning. Extensive experiments on nuScenes, V2X-Seq-SPD, and KITTI demonstrate the superior performance of NOVA. Notably, on the nuScenes dataset, NOVA achieves an AMOTA of 22.41% for Novel categories, yielding a significant 20.21% absolute improvement over the baseline. These gains are realized through a compact 0.5B autoregressive model. Code will be available at https://github.com/xifen523/NOVA.
中文标题/摘要
标题:NOVA:开放词汇的自回归方法在自主驾驶中三维多目标跟踪的下一步
在开放世界感知中,跨未知目标的一般化至关重要,但现有的三维多目标跟踪(3D MOT)管道仍然受限于封闭集假设和“语义盲”的启发式方法。为了解决这一问题,我们提出了Next-step Open-Vocabulary Autoregression(NOVA),这是一种创新的范式,将三维跟踪从传统的基于距离的片段匹配转向生成时空语义建模。NOVA 将三维轨迹重新定义为结构化的时空语义序列,能够同时编码物理运动连续性和深层次的语义先验。通过利用大型语言模型(LLMs)的自回归能力,我们将跟踪任务转化为一个有原则的序列完成过程。这种机制使模型能够明确利用语言空间的层次结构来解决细微的语义歧义,并通过高层次常识推理在复杂的长序列中保持身份一致性。在 nuScenes、V2X-Seq-SPD 和 KITTI 上的广泛实验表明,NOVA 的性能优于现有方法。特别是在 nuScenes 数据集上,NOVA 在 Novel 类别的 AMOTA 达到 22.41%,相对于基线实现了 20.21% 的绝对改进。这些收益是通过一个紧凑的 0.5B 自回归模型实现的。
Summary / 总结
NOVA proposes a new approach for 3D multi-object tracking in autonomous driving by reformulating the problem as a generative spatio-temporal semantic modeling task. It leverages the autoregressive capabilities of Large Language Models to encode physical motion and deep linguistic priors, enabling better handling of unknown targets. Experiments show NOVA outperforms existing methods, particularly on the nuScenes dataset, where it achieves a significant improvement in AMOTA for Novel categories.
NOVA通过将轨迹重新定义为结构化的时空语义序列,并利用大型语言模型的自回归能力,提出了3D多目标跟踪的新范式。这种方法解决了现有封闭集假设和语义盲启发式算法的局限性。NOVA显著提高了性能,在nuScenes数据集上实现了20.21%的绝对AMOTA改进,对于新类别,使用的是一个紧凑的0.5B自回归模型。
Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events
Authors: Xiaoxing You, Qiang Huang, Lingyu Li, Xiaojun Chang, Jun Yu
Venue: CVPR 2026
First: 2026-03-06T12:29:33+00:00 · Latest: 2026-03-06T12:29:33+00:00
Comments: Accepted to CVPR 2026
Abstract
Multimodal Summarization (MMS) aims to generate concise textual summaries by understanding and integrating information across videos, transcripts, and images. However, existing approaches still suffer from three main challenges: (1) reliance on domain-specific supervision, (2) implicit fusion with weak cross-modal grounding, and (3) flat temporal modeling without event transitions. To address these issues, we introduce **CoE**, a training-free MMS framework that performs structured reasoning through a **Chain-of-Events** guided by a Hierarchical Event Graph (HEG). The HEG encodes textual semantics into an explicit event hierarchy that scaffolds cross-modal grounding and temporal reasoning. Guided by this structure, **CoE** localizes key visual cues, models event evolution and causal transitions, and refines outputs via lightweight style adaptation for domain alignment. Extensive experiments on eight diverse datasets demonstrate that **CoE** consistently outperforms state-of-the-art video CoT baselines, achieving average gains of **+3.04 ROUGE**, **+9.51 CIDEr**, and **+1.88 BERTScore**, highlighting its robustness, interpretability, and cross-domain generalization. Our code is available at https://github.com/youxiaoxing/CoE.
中文标题/摘要
标题:直击核心:无需训练的多模态总结通过事件链
多模态总结(MMS)旨在通过理解并整合视频、转录和图像中的信息来生成简洁的文本摘要。然而,现有方法仍然面临三个主要挑战:(1)依赖于特定领域的监督,(2)模态间融合时缺乏强跨模态定位,(3)缺乏事件过渡的扁平时间建模。为了解决这些问题,我们引入了**CoE**,一种无需训练的MMS框架,通过由层次事件图(HEG)引导的**事件链**进行结构化推理。HEG将文本语义编码为明确的事件层次结构,支撑跨模态定位和时间推理。在这一结构的引导下,**CoE**定位关键视觉线索,建模事件演变和因果过渡,并通过轻量级风格适应进行领域对齐以优化输出。在八个不同数据集上的广泛实验表明,**CoE**在视频CoT基线中始终表现出色,平均分别获得**+3.04 ROUGE**,**+9.51 CIDEr**和**+1.88 BERTScore**的提升,突显了其稳健性、可解释性和跨域泛化能力。我们的代码可在https://github.com/youxiaoxing/CoE获取。
Summary / 总结
The paper addresses the challenges of existing Multimodal Summarization (MMS) approaches by introducing CoE, a training-free framework that uses a Chain-of-Events guided by a Hierarchical Event Graph (HEG) to encode textual semantics and facilitate cross-modal grounding and temporal reasoning. Experiments show that CoE outperforms state-of-the-art video CoT baselines, achieving significant improvements in ROUGE, CIDEr, and BERTScore metrics, and demonstrating robustness, interpretability, and cross-domain generalization capabilities.
论文通过引入基于层级事件图(HEG)引导的训练-free 框架 CoE,解决了现有多模态总结(MMS)方法的三大挑战,即依赖领域特定监督、跨模态融合弱跨模态接地以及扁平时间建模。CoE 通过链式事件进行结构化推理,定位关键视觉线索,建模事件演变和因果过渡,并通过轻量级风格适应进行领域对齐。实验表明,CoE 在八个不同数据集上优于最先进的视频 CoT 基线,分别在 ROUGE、CIDEr 和 BERTScore 上提高了 3.04、9.51 和 1.88。
Whisper-CD: Accurate Long-Form Speech Recognition using Multi-Negative Contrastive Decoding
Authors: Hoseong Ahn, Jeongyun Chae, Yoonji Park, Kyuhong Shim
First: 2026-03-06T12:04:12+00:00 · Latest: 2026-03-06T12:04:12+00:00
Comments: Submitted to Interspeech 2026
Abstract
Long-form speech recognition with large encoder-decoder models such as Whisper often exhibit hallucinations, repetition loops, and content omissions. These errors can accumulate and be further amplified when the previous segment's transcription is used as decoding context. We propose Whisper-CD, a training-free contrastive decoding framework that contrasts clean-audio logits against negative logits computed from three acoustically motivated perturbations: Gaussian noise injection, silence signal, and audio temporal shift. We aggregate these negatives via the log-sum-exp operator, building a unified multi-negative objective for token-by-token decoding. Across five English long-form benchmarks, Whisper-CD reduces WER by up to 24.3pp on CORAAL and shows 48% faster token generation throughput than beam search. Because Whisper-CD operates purely at inference time, it can be applied as a drop-in replacement to already-deployed Whisper systems without retraining.
中文标题/摘要
标题:Whisper-CD:使用多负对比解码的大模型长时语音识别
使用Whisper等大型编码器-解码器模型进行长时语音识别时,往往会表现出幻觉、重复循环和内容遗漏等问题。这些错误在使用前一段转录作为解码上下文时会进一步累积和放大。我们提出了一种无需训练的对比解码框架Whisper-CD,该框架将干净音频的logits与从三种基于声学的扰动中计算出的负logits进行对比:高斯噪声注入、静默信号和音频时间移位。我们通过log-sum-exp算子聚合这些负logits,构建一个统一的多负目标,用于逐token解码。在五个英语长时语音基准测试中,Whisper-CD将WER降低了最多24.3个百分点,并且比束搜索快48%的token生成吞吐量。由于Whisper-CD仅在推理时运行,因此可以作为插件替代品应用于已部署的Whisper系统,无需重新训练。
Summary / 总结
Whisper-CD is a training-free contrastive decoding framework designed to improve long-form speech recognition accuracy. It uses multi-negative contrastive decoding with Gaussian noise injection, silence signal, and audio temporal shift as negative examples to reduce hallucinations and content omissions. Across five English long-form benchmarks, Whisper-CD significantly reduces Word Error Rate (WER) by up to 24.3 percentage points on the CORAAL dataset and offers 48% faster token generation throughput compared to beam search.
Whisper-CD旨在通过解决大型编码器-解码器模型如Whisper中的幻听、重复和内容遗漏问题来改进长语音识别。它采用了一种对比解码框架,将干净音频的logits与从高斯噪声注入、静默信号和音频时间移位中计算的负logits进行对比。在五个英语基准测试中,Whisper-CD将词错误率最多降低了24.3个百分点,并且与束搜索相比,其标记生成吞吐量提高了48%。由于Whisper-CD仅在推理时运行,因此可以无缝集成到已部署的Whisper系统中而无需重新训练。
Making Training-Free Diffusion Segmentors Scale with the Generative Power
Authors: Benyuan Meng, Qianqian Xu, Zitai Wang, Xiaochun Cao, Longtao Huang, Qingming Huang
Venue: CVPR 2026
First: 2026-03-06T11:35:37+00:00 · Latest: 2026-03-06T11:35:37+00:00
Comments: Accepted to CVPR 2026
Abstract
As powerful generative models, text-to-image diffusion models have recently been explored for discriminative tasks. A line of research focuses on adapting a pre-trained diffusion model to semantic segmentation without any further training, leading to what training-free diffusion segmentors. These methods typically rely on cross-attention maps from the model's attention layers, which are assumed to capture semantic relationships between image pixels and text tokens. Ideally, such approaches should benefit from more powerful diffusion models, i.e., stronger generative capability should lead to better segmentation. However, we observe that existing methods often fail to scale accordingly. To understand this issue, we identify two underlying gaps: (i) cross-attention is computed across multiple heads and layers, but there exists a discrepancy between these individual attention maps and a unified global representation. (ii) Even when a global map is available, it does not directly translate to accurate semantic correlation for segmentation, due to score imbalances among different text tokens. To bridge these gaps, we propose two techniques: auto aggregation and per-pixel rescaling, which together enable training-free segmentation to better leverage generative capability. We evaluate our approach on standard semantic segmentation benchmarks and further integrate it into a generative technique, demonstrating both improved performance broad applicability. Codes are at https://github.com/Darkbblue/goca.
中文标题/摘要
标题:利用生成能力使无训练扩散分割器扩展
作为强大的生成模型,文本到图像的扩散模型最近被探索用于判别任务。一系列研究致力于在无需进一步训练的情况下,将预训练的扩散模型适应于语义分割,从而产生了无训练的扩散分割器。这些方法通常依赖于模型注意力层的交叉注意力图,这些图被认为捕捉了图像像素和文本标记之间的语义关系。理想情况下,此类方法应受益于更强大的扩散模型,即更强的生成能力应导致更好的分割。然而,我们观察到现有方法往往无法相应地扩展。为了理解这一问题,我们识别了两个潜在的差距:(i) 交叉注意力在多个头和层之间计算,但这些单独的注意力图与统一的全局表示之间存在差异。(ii) 即使有全局图,它也无法直接转化为准确的语义相关性,因为不同文本标记之间的评分不平衡。为了弥合这些差距,我们提出了两种技术:自动聚合和逐像素重新缩放,这两者共同使无训练分割能够更好地利用生成能力。我们在标准语义分割基准上评估了我们的方法,并进一步将其集成到生成技术中,展示了更好的性能和更广泛的适用性。代码在 https://github.com/Darkbblue/goca.
Summary / 总结
This paper addresses the challenge of scaling training-free diffusion segmentors with the generative power of diffusion models. It identifies two gaps in existing methods: discrepancies between individual attention maps and a unified global representation, and score imbalances among text tokens. To address these, the authors propose auto aggregation and per-pixel rescaling techniques. These methods improve segmentation performance on standard benchmarks and enhance the applicability of training-free diffusion segmentors. The approach is evaluated on semantic segmentation benchmarks and integrated into a generative technique, showing better performance and broader applicability.
本文探讨了如何利用扩散模型的生成能力来扩展训练-free 的扩散分割器。研究发现两个关键问题:个体注意力图与统一全局表示之间的差异,以及不同文本标记之间的评分不平衡。为此,作者提出了自动聚合和逐像素重新缩放的技术。在标准分割基准上的评估显示了性能的提升,并将方法集成到生成技术中,增强了其广泛应用性。代码可在 https://github.com/Darkbblue/goca 获取。
Reversible Inversion for Training-Free Exemplar-guided Image Editing
Authors: Yuke Li, Lianli Gao, Ji Zhang, Pengpeng Zeng, Lichuan Xiang, Hongkai Wen, Heng Tao Shen, Jingkuan Song
First: 2025-12-01T07:56:06+00:00 · Latest: 2026-03-06T11:23:13+00:00
Abstract
Exemplar-guided Image Editing (EIE) aims to modify a source image according to a visual reference. Existing approaches often require large-scale pre-training to learn relationships between the source and reference images, incurring high computational costs. As a training-free alternative, inversion techniques can be used to map the source image into a latent space for manipulation. However, our empirical study reveals that standard inversion is sub-optimal for EIE, leading to poor quality and inefficiency. To tackle this challenge, we introduce \textbf{Reversible Inversion ({ReInversion})} for effective and efficient EIE. Specifically, ReInversion operates as a two-stage denoising process, which is first conditioned on the source image and subsequently on the reference. Besides, we introduce a Mask-Guided Selective Denoising (MSD) strategy to constrain edits to target regions, preserving the structural consistency of the background. Both qualitative and quantitative comparisons demonstrate that our ReInversion method achieves state-of-the-art EIE performance with the lowest computational overhead.
中文标题/摘要
标题:可逆反转以实现无需训练的示例引导图像编辑
示例引导图像编辑(EIE)旨在根据视觉参考修改源图像。现有方法通常需要大规模预训练以学习源图像和参考图像之间的关系,导致高计算成本。作为无需训练的替代方案,反转技术可以将源图像映射到潜在空间进行操作。然而,我们的实证研究表明,标准反转对于EIE来说是次优的,导致质量差且效率低。为了解决这一挑战,我们引入了**可逆反转(ReInversion)**以实现有效且高效的EIE。具体而言,ReInversion作为两阶段去噪过程,首先基于源图像,然后基于参考图像。此外,我们引入了掩码引导的选择性去噪(MSD)策略,以限制编辑仅限于目标区域,从而保持背景的结构一致性。定性和定量比较均表明,我们的ReInversion方法在最低计算开销的情况下实现了最先进的EIE性能。
Summary / 总结
The paper addresses the challenge of Exemplar-guided Image Editing (EIE) by proposing Reversible Inversion (ReInversion) as a training-free method. ReInversion uses a two-stage denoising process conditioned on the source and reference images, and includes a Mask-Guided Selective Denoising (MSD) strategy to focus edits on target regions. Experiments show that ReInversion outperforms existing methods in terms of both quality and computational efficiency, achieving state-of-the-art results in EIE.
论文提出了一种训练-free 方法 Reversible Inversion (ReInversion) 来解决 Exemplar-guided Image Editing (EIE) 的挑战。ReInversion 使用两阶段去噪过程,分别基于源图像和参考图像,并引入了 Mask-Guided Selective Denoising (MSD) 策略来集中编辑在目标区域。实验表明,ReInversion 在质量和计算效率上都优于现有方法,达到了 EIE 的最先进性能。
JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas
Authors: Sandeep Inuganti, Hideaki Kanayama, Kanta Shimizu, Mahdi Chamseddine, Soichiro Yokota, Didier Stricker, Jason Rambach
First: 2026-03-06T11:22:14+00:00 · Latest: 2026-03-06T11:22:14+00:00
Abstract
Semantic segmentation across visual modalities such as 3D point clouds and panoramic images remains a challenging task, primarily due to the scarcity of annotated data and the limited adaptability of fixed-label models. In this paper, we present JOPP-3D, an open-vocabulary semantic segmentation framework that jointly leverages panoramic and point cloud data to enable language-driven scene understanding. We convert RGB-D panoramic images into their corresponding tangential perspective images and 3D point clouds, then use these modalities to extract and align foundational vision-language features. This allows natural language querying to generate semantic masks on both input modalities. Experimental evaluation on the Stanford-2D-3D-s and ToF-360 datasets demonstrates the capability of JOPP-3D to produce coherent and semantically meaningful segmentations across panoramic and 3D domains. Our proposed method achieves a significant improvement compared to the SOTA in open and closed vocabulary 2D and 3D semantic segmentation.
中文标题/摘要
标题:JOPP-3D:点云和全景图的联合开放词汇语义分割
跨视觉模态(如3D点云和全景图像)的语义分割仍然是一个具有挑战性的任务,主要由于标注数据的稀缺性和固定标签模型的有限适应性。在本文中,我们提出了JOPP-3D,这是一种联合利用全景图和点云数据的开放词汇语义分割框架,以实现基于语言的场景理解。我们将RGB-D全景图像转换为其相应的切线视角图像和3D点云,然后使用这些模态来提取和对齐基础的视觉-语言特征。这使得自然语言查询能够在输入的两种模态上生成语义掩码。在斯坦福-2D-3D-s和ToF-360数据集上的实验评估表明,JOPP-3D能够在全景和3D领域生成连贯且语义上有意义的分割。我们提出的方法在开放词汇和封闭词汇的2D和3D语义分割中取得了显著的改进。
Summary / 总结
The research aims to address the challenge of semantic segmentation across 3D point clouds and panoramic images by developing JOPP-3D, an open-vocabulary semantic segmentation framework. It converts RGB-D panoramic images into tangential perspective images and 3D point clouds to extract and align vision-language features, enabling natural language querying for semantic segmentation. Experiments on Stanford-2D-3D-s and ToF-360 datasets show that JOPP-3D produces coherent and semantically meaningful segmentations, outperforming state-of-the-art methods in both open and closed vocabulary settings.
研究动机是解决由于标注数据有限和固定标签模型限制而在3D点云和全景图像之间进行语义分割的挑战。主要方法是将RGB-D全景图像转换为切线视角图像和3D点云,然后联合利用这些模态来提取和对齐视觉-语言特征。关键实验发现表明,JOPP-3D可以在全景和3D领域之间生成连贯且语义上有意义的分割,相对于最先进的方法,在开放和封闭词汇语义分割方面取得了显著的改进。
A Semi-Supervised Framework for Breast Ultrasound Segmentation with Training-Free Pseudo-Label Generation and Label Refinement
Authors: Ruili Li, Jiayi Ding, Ruiyu Li, Yilun Jin, Shiwen Ge, Yuwen Zeng, Xiaoyong Zhang, Eichi Takaya, Jan Vrba, Noriyasu Homma
First: 2026-03-06T11:21:26+00:00 · Latest: 2026-03-06T11:21:26+00:00
Abstract
Semi-supervised learning (SSL) has emerged as a promising paradigm for breast ultrasound (BUS) image segmentation, but it often suffers from unstable pseudo labels under extremely limited annotations, leading to inaccurate supervision and degraded performance. Recent vision-language models (VLMs) provide a new opportunity for pseudo-label generation, yet their effectiveness on BUS images remains limited because domain-specific prompts are difficult to transfer. To address this issue, we propose a semi-supervised framework with training-free pseudo-label generation and label refinement. By leveraging simple appearance-based descriptions (e.g., dark oval), our method enables cross-domain structural transfer between natural and medical images, allowing VLMs to generate structurally consistent pseudo labels. These pseudo labels are used to warm up a static teacher that captures global structural priors of breast lesions. Combined with an exponential moving average teacher, we further introduce uncertainty entropy weighted fusion and adaptive uncertainty-guided reverse contrastive learning to improve boundary discrimination. Experiments on four BUS datasets demonstrate that our method achieves performance comparable to fully supervised models even with only 2.5% labeled data, significantly outperforming existing SSL approaches. Moreover, the proposed paradigm is readily extensible: for other imaging modalities or diseases, only a global appearance description is required to obtain reliable pseudo supervision, enabling scalable semi-supervised medical image segmentation under limited annotations.
中文标题/摘要
标题:一种基于训练无监督生成伪标签和标签精炼的乳腺超声分割半监督框架
半监督学习(SSL)已成为乳腺超声(BUS)图像分割的一个有前途的范式,但其在极少量标注下往往遭受不稳定伪标签的困扰,导致监督不准确且性能下降。最近的视觉-语言模型(VLMs)为伪标签生成提供了新的机会,但由于领域特定提示难以转移,它们在BUS图像上的效果有限。 为了解决这一问题,我们提出了一种基于训练无监督生成伪标签和标签精炼的半监督框架。通过利用简单的外观描述(例如,深色椭圆),我们的方法在自然图像和医学图像之间实现了跨领域的结构转移,使VLMs能够生成结构一致的伪标签。这些伪标签用于预热一个静态教师,该教师捕捉乳腺病变的全局结构先验。结合指数移动平均教师,我们进一步引入了不确定性熵加权融合和自适应不确定性引导逆对比学习,以提高边界区分能力。 在四个BUS数据集上的实验表明,即使只有2.5%的标注数据,我们的方法也能达到与全监督模型相当的性能,显著优于现有SSL方法。此外,所提出的范式易于扩展:对于其他成像模态或疾病,只需一个全局外观描述即可获得可靠的伪监督,从而在有限标注下实现可扩展的半监督医学图像分割。
Summary / 总结
The paper proposes a semi-supervised framework for breast ultrasound image segmentation that uses training-free pseudo-label generation and label refinement. By leveraging simple appearance-based descriptions, the method enables cross-domain structural transfer between natural and medical images, allowing vision-language models to generate structurally consistent pseudo labels. The framework further improves boundary discrimination through uncertainty entropy weighted fusion and adaptive uncertainty-guided reverse contrastive learning. Experiments show that the method achieves performance comparable to fully supervised models with only 2.5% labeled data, outperforming existing semi-supervised approaches.
研究旨在通过半监督框架提高乳腺超声图像分割的准确性。方法采用训练免费的伪标签生成和标签精炼,利用简单的外观描述生成结构一致的伪标签。结合指数移动平均教师和不确定性引导的学习,该方法增强了边界区分能力。实验表明,该方法在仅有2.5%标注数据的情况下,性能可与全监督模型媲美,显著优于现有半监督方法。
FreeOcc: Training-free Panoptic Occupancy Prediction via Foundation Models
Authors: Andrew Caunes, Thierry Chateau, Vincent Fremont
First: 2026-03-06T11:19:33+00:00 · Latest: 2026-03-06T11:19:33+00:00
Comments: 14 pages
Abstract
Semantic and panoptic occupancy prediction for road scene analysis provides a dense 3D representation of the ego vehicle's surroundings. Current camera-only approaches typically rely on costly dense 3D supervision or require training models on data from the target domain, limiting deployment in unseen environments. We propose FreeOcc, a training-free pipeline that leverages pretrained foundation models to recover both semantics and geometry from multi-view images. FreeOcc extracts per-view panoptic priors with a promptable foundation segmentation model and prompt-to-taxonomy rules, and reconstructs metric 3D points with a reconstruction foundation model. Depth- and confidence- aware filtering lifts reliable labels into 3D, which are fused over time and voxelized with a deterministic refinement stack. For panoptic occupancy, instances are recovered by fitting and merging robust current-view 3D box candidates, enabling instance-aware occupancy without any learned 3D model. On Occ3D-nuScenes, FreeOcc achieves 16.9 mIoU and 16.5 RayIoU train-free, on par with state-of-the-art weakly supervised methods. When employed as a pseudo-label generation pipeline for training downstream models, it achieves 21.1 RayIoU, surpassing the previous state-of-the-art weakly supervised baseline. Furthermore, FreeOcc sets new baselines for both train-free and weakly supervised panoptic occupancy prediction, achieving 3.1 RayPQ and 3.9 RayPQ, respectively. These results highlight foundation-model-driven perception as a practical route to training-free 3D scene understanding.
中文标题/摘要
标题:FreeOcc:无需训练的全景占用预测
语义和全景占用预测为道路场景分析提供了 ego 车辆周围环境的密集 3D 表示。当前仅基于相机的方法通常依赖于昂贵的密集 3D 监督或需要在目标域数据上训练模型,限制了在未见过的环境中部署。我们提出了 FreeOcc,这是一种无需训练的流水线,利用预训练的基础模型从多视图图像中恢复语义和几何。FreeOcc 使用可提示的基础分割模型和提示到分类规则提取每视图的全景先验,并使用重建基础模型重建度量 3D 点。深度和置信度感知过滤将可靠的标签提升到 3D,并在时间上融合并用确定性细化堆栈进行体素化。对于全景占用,通过拟合和合并当前视图的鲁棒 3D 盒候选对象恢复实例,无需任何学习的 3D 模型即可实现实例感知的占用。在 Occ3D-nuScenes 上,FreeOcc 达到了 16.9 mIoU 和 16.5 RayIoU 无需训练,与最先进的弱监督方法相当。当作为生成下游模型训练伪标签的流水线时,它达到了 21.1 RayIoU,超过了之前的弱监督基准。此外,FreeOcc 为无需训练和弱监督全景占用预测设定了新的基准,分别达到了 3.1 RayPQ 和 3.9 RayPQ。这些结果突显了基础模型驱动的感知作为无需训练的 3D 场景理解的实用途径。
Summary / 总结
FreeOcc is a training-free pipeline that uses pretrained foundation models to predict semantic and geometric details of road scenes from multi-view images. It extracts panoptic priors using a promptable segmentation model and reconstructs 3D points with a reconstruction model. The method filters and fuses these labels to achieve panoptic occupancy prediction. On Occ3D-nuScenes, FreeOcc achieves 16.9 mIoU and 16.5 RayIoU without training, matching state-of-the-art weakly supervised methods. When used as a pseudo-label generator, it improves RayIoU to 21.1, surpassing previous weakly supervised baselines. It also sets new baselines for both train-free and weakly supervised panoptic occupancy prediction, achieving 3.1 RayPQ and 3.9 RayPQ respectively.
FreeOcc 是一个无需训练的管道,利用预训练的基础模型从多视角图像中预测道路场景的语义和几何细节。该方法使用可提示的分割模型提取全景先验,并使用重建模型重建 3D 点。该方法将可靠的标签过滤到 3D 空间并进行时间融合。FreeOcc 在 Occ3D-nuScenes 上达到 16.9 mIoU 和 16.5 RayIoU,与最先进的弱监督方法相当,并通过 21.1 RayIoU 改进了下游模型的伪标签生成。此外,它还为训练免费和弱监督全景占用预测设定了新的基准,分别达到 3.1 RayPQ 和 3.9 RayPQ。
Reflective Flow Sampling Enhancement
Authors: Zikai Zhou, Muyao Wang, Shitong Shao, Lichen Bai, Haoyi Xiong, Bo Han, Zeke Xie
First: 2026-03-06T11:17:37+00:00 · Latest: 2026-03-06T11:17:37+00:00
Abstract
The growing demand for text-to-image generation has led to rapid advances in generative modeling. Recently, text-to-image diffusion models trained with flow matching algorithms, such as FLUX, have achieved remarkable progress and emerged as strong alternatives to conventional diffusion models. At the same time, inference-time enhancement strategies have been shown to improve the generation quality and text-prompt alignment of text-to-image diffusion models. However, these techniques are mainly applicable to conventional diffusion models and usually fail to perform well on flow models. To bridge this gap, we propose Reflective Flow Sampling (RF-Sampling), a theoretically-grounded and training-free inference enhancement framework explicitly designed for flow models, especially for the CFG-distilled variants (i.e., models distilled from CFG guidance techniques), like FLUX. Departing from heuristic interpretations, we provide a formal derivation proving that RF-Sampling implicitly performs gradient ascent on the text-image alignment score. By leveraging a linear combination of textual representations and integrating them with flow inversion, RF-Sampling allows the model to explore noise spaces that are more consistent with the input prompt. Extensive experiments across multiple benchmarks demonstrate that RF-Sampling consistently improves both generation quality and prompt alignment. Moreover, RF-Sampling is also the first inference enhancement method that can exhibit test-time scaling ability to some extent on FLUX.
中文标题/摘要
标题:反射流采样增强
文本到图像生成需求的增长推动了生成模型的快速发展。最近,使用流匹配算法(如FLUX)训练的文本到图像扩散模型取得了显著进展,并成为传统扩散模型的强大替代品。同时,推理时增强策略已被证明可以提高文本到图像扩散模型的生成质量和文本提示对齐。然而,这些技术主要适用于传统扩散模型,通常在流模型上表现不佳。为弥合这一差距,我们提出了一种理论依据明确且无需训练的推理增强框架——反射流采样(RF-Sampling),专门设计用于流模型,特别是CFG提炼变体(即从CFG指导技术提炼的模型),如FLUX。不同于启发式解释,我们提供了一种形式化推导,证明RF-Sampling隐式执行了文本图像对齐得分的梯度上升。通过利用文本表示的线性组合并将其与流反转集成,RF-Sampling使模型能够探索与输入提示更一致的噪声空间。在多个基准上的广泛实验表明,RF-Sampling始终能够提高生成质量和提示对齐。此外,RF-Sampling也是首个在一定程度上能够展示测试时缩放能力的推理增强方法,适用于FLUX。
Summary / 总结
The paper proposes Reflective Flow Sampling (RF-Sampling), an inference enhancement framework designed for flow models, particularly CFG-distilled variants like FLUX. It improves text-to-image generation quality and prompt alignment by implicitly performing gradient ascent on the text-image alignment score. Experiments show consistent improvements in generation quality and prompt alignment across multiple benchmarks, and RF-Sampling is the first method to exhibit test-time scaling ability on FLUX.
随着对文本到图像生成的需求增长,生成模型取得了显著进展,其中流匹配算法如FLUX表现出色。为了提升流模型,尤其是CFG提炼变体的性能,作者提出了Reflective Flow Sampling (RF-Sampling),一种无需训练的推理增强框架。RF-Sampling通过隐式进行文本-图像对齐得分的梯度上升来提升生成质量和文本提示对齐。跨多个基准的实验显示,RF-Sampling在生成质量和提示对齐方面均表现出一致的改进,且是首个在FLUX上展示测试时扩展能力的方法。
VLM-RobustBench: A Comprehensive Benchmark for Robustness of Vision-Language Models
Authors: Rohit Saxena, Alessandro Suglia, Pasquale Minervini
First: 2026-03-06T10:58:02+00:00 · Latest: 2026-03-06T10:58:02+00:00
Abstract
Vision-language models (VLMs) achieve strong performance on standard, high-quality datasets, but we still do not fully understand how they perform under real-world image distortions. We present VLM-RobustBench, a benchmark spanning 49 augmentation types across noise, blur, weather, digital, and geometric perturbations, evaluated under graded severities (low/mid/high) and binary transforms, yielding 133 corrupted settings. We evaluate VLMs from four families (Qwen, InternVL, Molmo, Gemma) on two complementary benchmarks: MMBench (visually grounded) and MMMU-Pro (reasoning-oriented). Our results reveal that visual severity is a weak predictor of difficulty: low-severity spatial perturbations often degrade performance more than visually severe photometric corruptions. In particular, low-severity glass_blur reduces MMBench accuracy by about 8 pp on average across models, while the largest drops arise from resampling and geometric distortions (e.g., upsample, elastic_transform), reaching up to 34 pp. Overall, our findings suggest current VLMs are semantically strong but spatially fragile, motivating the definition of novel robustness evaluation protocols and training regimes that emphasize resampling and geometric invariances.
中文标题/摘要
标题:VLM-RobustBench:视觉语言模型鲁棒性综合基准
视觉语言模型(VLMs)在标准高质量数据集上表现出强大的性能,但我们仍然不清楚它们在真实世界图像失真下的表现如何。我们提出了VLM-RobustBench基准,该基准涵盖了49种增强类型,包括噪声、模糊、天气、数字和几何扰动,并在不同程度(低/中/高)和二元变换下进行评估,共产生了133种受污染的设置。我们对四个模型家族(Qwen、InternVL、Molmo、Gemma)在两个互补基准(MMBench(视觉接地)和MMMU-Pro(推理导向))上进行了评估。我们的结果表明,视觉严重程度不是难度的弱预测器:低程度的空间扰动往往比视觉上严重的光度污染对性能的损害更大。特别是,低程度的glass_blur平均使MMBench的准确性降低了约8个百分点,而最大的下降来自于重采样和几何畸变(例如,upsample、elastic_transform),最高可达34个百分点。总体而言,我们的研究结果表明当前的VLMs在语义上很强但空间上很脆弱,这促使我们定义新的鲁棒性评估协议和训练方案,强调重采样和几何不变性。
Summary / 总结
VLM-RobustBench is a benchmark that evaluates the robustness of vision-language models (VLMs) under 49 types of real-world image distortions, including noise, blur, weather, digital, and geometric perturbations, across 133 corrupted settings. The study finds that visual severity is not a reliable predictor of difficulty, with low-severity spatial perturbations often degrading performance more than visually severe corruptions. Notably, low-severity glass_blur reduces MMBench accuracy by about 8 percentage points on average, while resampling and geometric distortions can cause up to a 34 percentage point drop. These results highlight the need for new robustness evaluation protocols and training methods to improve spatial invariances in VLMs.
VLM-RobustBench 通过133种不同的图像扭曲设置评估了视觉语言模型(VLMs)的鲁棒性,包括49种类型的图像失真。研究发现,视觉严重程度并不是难度的可靠指标;低程度的空间扭曲比视觉严重的失真更能降低性能。具体来说,低程度的glass_blur平均使MM Bench的准确性降低约8个百分点,而重采样和几何扭曲可能导致高达34个百分点的下降。这表明当前的VLMs在语义上很强,但在空间上很脆弱,需要新的鲁棒性评估和训练方法。
Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views
Authors: Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan, Xiang An, Yan Feng, Peng Pei, Xunliang Cai, Ruqi Huang
Venue: CVPR 2026
First: 2025-10-21T13:36:58+00:00 · Latest: 2026-03-06T10:50:23+00:00
Comments: 25 pages, 17 figures
Abstract
Though recent advances in vision-language models (VLMs) have achieved remarkable progress across a wide range of multimodal tasks, understanding 3D spatial relationships from limited views remains a significant challenge. Previous reasoning methods typically rely on pure text (e.g., topological cognitive maps) or on 2D visual cues. However, their limited representational capacity hinders performance in specific tasks that require 3D spatial imagination. To address this limitation, we propose 3DThinker, a framework that can effectively exploits the rich geometric information embedded within images while reasoning, like humans do. Our framework is the first to enable 3D mentaling during reasoning without any 3D prior input, and it does not rely on explicitly labeled 3D data for training. Specifically, our training consists of two stages. First, we perform supervised training to align the 3D latent generated by VLM while reasoning with that of a 3D foundation model (e.g., VGGT). Then, we optimize the entire reasoning trajectory solely based on outcome signals, thereby refining the underlying 3D mentaling. Extensive experiments across multiple benchmarks show that 3DThinker consistently outperforms strong baselines and offers a new perspective toward unifying 3D representations into multimodal reasoning. Our code is available at https://github.com/zhangquanchen/3DThinker.
中文标题/摘要
标题:三维思考:基于有限视角的几何想象与空间推理
尽管近期视觉-语言模型(VLMs)在多种跨模态任务中取得了显著进展,但从有限视角理解三维空间关系仍然是一个重大挑战。以往的推理方法通常依赖纯文本(例如拓扑认知图)或二维视觉线索。然而,它们有限的表示能力阻碍了在需要三维空间想象的任务中的表现。为了解决这一限制,我们提出了3DThinker框架,该框架能够在推理过程中有效利用图像中嵌入的丰富几何信息,类似于人类的思考方式。我们的框架是首个在推理过程中启用三维思考而无需任何三维先验输入的框架,并且在训练过程中不依赖于明确标注的三维数据。具体而言,我们的训练分为两个阶段。首先,我们进行监督训练,以使VLM在推理过程中生成的三维潜在表示与三维基础模型(例如VGGT)生成的三维潜在表示对齐。然后,我们仅基于结果信号优化整个推理过程,从而细化底层的三维思考。在多个基准测试中的广泛实验表明,3DThinker在多个基准测试中始终优于强基线,并为将三维表示统一到跨模态推理中提供了新的视角。我们的代码可在https://github.com/zhangquanchen/3DThinker获取。
Summary / 总结
The research aims to improve the ability of vision-language models to understand 3D spatial relationships from limited views, which is challenging for existing methods. The proposed 3DThinker framework incorporates geometric information from images to enhance spatial reasoning, without requiring explicit 3D data. It consists of two stages: supervised training to align 3D latent representations and optimization based on outcome signals. Experiments show that 3DThinker outperforms strong baselines and provides a new approach to integrating 3D reasoning into multimodal tasks.
研究旨在解决从有限视角理解3D空间关系的难题,这给现有的视觉-语言模型带来了挑战。提出了一种名为3DThinker的框架,该框架通过利用图像中的几何信息增强推理能力,能够在无需3D数据的情况下实现3D空间想象。该框架分为两个阶段:监督训练以对齐3D潜在表示,并基于结果优化推理轨迹。实验表明3DThinker在多个基准测试中优于强基线,并提供了一种将3D表示统一到多模态推理的新方法。
Spatial Colour Mixing Illusions as a Perception Stress Test for Vision-Language Models
Authors: Nicoleta-Nina Basoc, Adrian Cosma, Emilian Radoi
First: 2026-03-06T10:50:04+00:00 · Latest: 2026-03-06T10:50:04+00:00
Abstract
Vision-language models (VLMs) achieve strong benchmark results, yet can exhibit systematic perceptual weaknesses: structured, large changes to pixel values can cause confident yet nonsensical predictions, even when the underlying scene remains easily recognizable to humans. We study this gap using Spatial Colour Mixing, a programmatic family of colour distortions that overlays structured patterns (in both RGB and Ostwald colour systems) onto natural images. We introduce a framework of eight spatial colour mixing variants and evaluate nine VLMs across three model families on four datasets. Across models and datasets, accuracy degrades sharply with increasing distortion, and scaling the language model does not reliably mitigate the failure. In a human study with 61 participants on an animal recognition dataset, humans substantially outperform VLMs under the same distortions. Finally, we show that a simple human-inspired preprocessing step recovers a meaningful portion of performance for several distortion types, motivating perception-aware preprocessing and tool-use as practical strategies for improving VLM robustness.
中文标题/摘要
标题:空间色彩混合错觉作为视觉-语言模型感知压力测试
视觉-语言模型(VLMs)在基准测试中表现出色,但可能会表现出系统性的感知弱点:结构化的、大规模的像素值变化会导致自信但无意义的预测,即使在人类仍然可以轻松识别场景的情况下也是如此。我们使用空间色彩混合来研究这一差距,这是一种色彩失真程序化家族,它在自然图像上叠加了结构化的模式(在RGB和奥斯特瓦德色彩系统中)。我们引入了八种空间色彩混合变体,并在三个模型家族的九种VLMs上对四个数据集进行了评估。在模型和数据集之间,准确率随着失真的增加而急剧下降,扩展语言模型并不能可靠地缓解这种失败。在针对动物识别数据集的61名参与者的人类研究中,人类在相同失真的情况下显著优于VLMs。最后,我们展示了简单的基于人类启发的预处理步骤可以恢复多种失真类型的部分性能,这激励了感知意识的预处理和工具使用作为提高VLM鲁棒性的实用策略。
Summary / 总结
This study investigates the perceptual weaknesses of vision-language models (VLMs) using spatial colour mixing illusions, which overlay structured patterns onto natural images. Nine VLMs from three model families were evaluated on four datasets, showing a sharp decline in accuracy with increasing distortion. Scaling the language model did not reliably improve performance. Human participants outperformed VLMs under the same distortions in a recognition task, highlighting the need for perception-aware preprocessing to enhance VLM robustness.
研究使用空间色彩混合错觉来考察视觉-语言模型(VLMs)的感知弱点,这种错觉涉及在自然图像上叠加结构化图案。评估了来自三个模型家族的九个VLMs在四个数据集上的表现,结果显示随着扭曲程度的增加,准确率急剧下降。增加语言模型的规模并不能可靠地改善性能。在一项动物识别任务中,人类参与者在相同扭曲下比VLMs表现更好,这强调了需要采用感知导向的预处理步骤来提升VLM的鲁棒性。
FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment
Authors: Sebastián Barbas Laina, Simon Boche, Sotiris Papatheodorou, Simon Schaefer, Jaehyung Jung, Helen Oleynikova, Stefan Leutenegger
First: 2025-04-11T15:12:05+00:00 · Latest: 2026-03-06T09:49:58+00:00
Comments: 11 pages, 5 figures
Abstract
Geometrically accurate and semantically expressive map representations have proven invaluable for robot deployment and task planning in unknown environments. Nevertheless, real-time, open-vocabulary semantic understanding of large-scale unknown environments still presents open challenges, mainly due to computational requirements. In this paper we present FindAnything, an open-world mapping framework that incorporates vision-language information into dense volumetric submaps. Thanks to the use of vision-language features, FindAnything combines pure geometric and open-vocabulary semantic information for a higher level of understanding. It proposes an efficient storage of open-vocabulary information through the aggregation of features at the object level. Pixelwise vision-language features are aggregated based on eSAM segments, which are in turn integrated into object-centric volumetric submaps, providing a mapping from open-vocabulary queries to 3D geometry that is scalable also in terms of memory usage. We demonstrate that FindAnything performs on par with the state-of-the-art in terms of semantic accuracy while being substantially faster and more memory-efficient, allowing its deployment in large-scale environments and on resourceconstrained devices, such as MAVs. We show that the real-time capabilities of FindAnything make it useful for downstream tasks, such as autonomous MAV exploration in a simulated Search and Rescue scenario. Project Page: https://ethz-mrl.github.io/findanything/.
中文标题/摘要
标题:FindAnything:任意词汇和对象中心的映射框架,用于机器人在任意环境中的探索
几何上精确且语义上丰富的地图表示已被证明对于在未知环境中部署机器人和任务规划至关重要。然而,实时地对大规模未知环境进行开放词汇的语义理解仍然存在挑战,主要原因是计算需求。本文提出了一种名为FindAnything的开放世界映射框架,该框架将视觉-语言信息整合到密集的体积子地图中。通过使用视觉-语言特征,FindAnything结合了纯几何和开放词汇的语义信息,提高了理解水平。它通过在对象级别聚合特征来高效存储开放词汇信息。基于eSAM片段的像素级视觉-语言特征被聚合,并整合到对象中心的体积子地图中,提供了一种从开放词汇查询到3D几何的映射,该映射在内存使用方面也具有可扩展性。我们证明FindAnything在语义准确性方面与最先进的技术相当,但在速度和内存效率方面更具优势,使其能够在大规模环境中部署,并在资源受限的设备上运行,如MAVs。我们展示了FindAnything的实时能力使其在下游任务中具有实用性,例如在模拟的搜索和救援场景中自主MAV探索。项目页面:https://ethz-mrl.github.io/findanything/
Summary / 总结
FindAnything is an open-world mapping framework that integrates vision-language information into dense volumetric submaps to achieve both geometric accuracy and open-vocabulary semantic understanding. It aggregates features at the object level and integrates them into object-centric volumetric submaps, enabling efficient storage and scalable memory usage. Experimental results show that FindAnything matches the state-of-the-art in semantic accuracy while being faster and more memory-efficient, suitable for large-scale environments and resource-constrained devices like MAVs. It also demonstrates real-time capabilities useful for downstream tasks such as autonomous MAV exploration in simulated Search and Rescue scenarios.
FindAnything 是一种开放世界的映射框架,通过将视觉-语言信息整合到密集的体素子地图中来解决大规模未知环境中的实时语义理解挑战。通过在对象级别聚合特征并将其整合到对象为中心的体素子地图中,FindAnything 实现了与最先进的方法相当的语义准确性,同时更快且更节省内存,使其适合部署在大规模环境和资源受限设备如 MAVs 中。它展示了实时能力,适用于下游任务如模拟搜索和救援场景中的自主 MAV 探索。
Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache
Authors: Bowen Cui, Yuanbin Wang, Huajiang Xu, Biaolong Chen, Aixi Zhang, Hao Jiang, Zhengzheng Jin, Xu Liu, Pipei Huang
Venue: CVPR 2026
First: 2026-02-26T06:13:33+00:00 · Latest: 2026-03-06T09:34:23+00:00
Comments: Accepted by CVPR 2026
Abstract
Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead of multi-step iterative sampling. Among acceleration strategies, caching-based methods offer a training-free and effective solution by reusing or predicting features across timesteps. However, existing approaches rely on fixed or locally adaptive schedules without considering the global structure of the denoising trajectory, often leading to error accumulation and visual artifacts. To overcome this limitation, we propose DPCache, a novel training-free acceleration framework that formulates diffusion sampling acceleration as a global path planning problem. DPCache constructs a Path-Aware Cost Tensor from a small calibration set to quantify the path-dependent error of skipping timesteps conditioned on the preceding key timestep. Leveraging this tensor, DPCache employs dynamic programming to select an optimal sequence of key timesteps that minimizes the total path cost while preserving trajectory fidelity. During inference, the model performs full computations only at these key timesteps, while intermediate outputs are efficiently predicted using cached features. Extensive experiments on DiT, FLUX, and HunyuanVideo demonstrate that DPCache achieves strong acceleration with minimal quality loss, outperforming prior acceleration methods by $+$0.031 ImageReward at 4.87$\times$ speedup and even surpassing the full-step baseline by $+$0.028 ImageReward at 3.54$\times$ speedup on FLUX, validating the effectiveness of our path-aware global scheduling framework. Code is available at https://github.com/argsss/DPCache.
中文标题/摘要
标题:降噪作为路径规划:基于DPCache的无训练加速扩散模型
扩散模型在图像和视频生成方面取得了显著的成功,但其实际部署仍受到多步迭代采样带来的大量计算开销的阻碍。在加速策略中,基于缓存的方法提供了一种无训练且有效的解决方案,通过在时间步之间重用或预测特征来实现加速。然而,现有方法依赖于固定或局部适应的时间表,而不考虑去噪轨迹的全局结构,这通常会导致误差累积和视觉伪影。为克服这一限制,我们提出了一种名为DPCache的新型无训练加速框架,将扩散采样的加速问题表述为全局路径规划问题。DPCache从少量校准集中构建路径感知代价张量,以量化在给定前一关键时间步的情况下跳过时间步的路径依赖误差。利用该张量,DPCache采用动态规划选择一个最优的关键时间步序列,以最小化总路径成本同时保持轨迹保真度。在推理过程中,模型仅在这些关键时间步进行完整计算,而中间输出则通过缓存特征高效预测。在DiT、FLUX和HunyuanVideo上的大量实验表明,DPCache在保持最小质量损失的情况下实现了显著加速,与先前的加速方法相比,在4.87倍加速下提高了0.031 ImageReward,在FLUX上3.54倍加速下提高了0.028 ImageReward,甚至超过了全步基线,验证了我们路径感知全局调度框架的有效性。代码可在https://github.com/argsss/DPCache获取。
Summary / 总结
DPCache is a training-free acceleration framework for diffusion models that formulates diffusion sampling as a path planning problem. It constructs a Path-Aware Cost Tensor from a calibration set to select key timesteps that minimize path cost while preserving trajectory fidelity. Experiments show DPCache achieves strong acceleration with minimal quality loss, outperforming prior methods by +0.031 ImageReward at 4.87x speedup and +0.028 ImageReward at 3.54x speedup on FLUX.
DPCache 是一种无需训练的加速框架,将扩散模型的加速问题表述为路径规划任务。它构建路径感知成本张量来确定最优的关键时间步序列,以最小化路径成本和误差累积。实验表明,DPCache 可实现显著的加速(最高 4.87 倍),同时保持质量损失最小,优于之前的加速方法在 DiT、FLUX 和 HunyuanVideo 上的表现。
SpecFuse: Ensembling Large Language Models via Next-Segment Prediction
Authors: Bo Lv, Nayu Liu, Chen Tang, Xin Liu, Yue Yu, Ping Luo
First: 2024-12-10T10:27:41+00:00 · Latest: 2026-03-06T09:33:09+00:00
Comments: 15 pages, 5 figures
Abstract
Ensembles of generative large language models (LLMs) are a promising way to compensate for individual model limitations, integrating the strengths of different LLMs. Existing LLM ensemble methods, however, face limitations such as first-token delay and challenges in long-range semantic collaboration between models, Moreover, they typically assume equal voting weights for all models during ensemble, ignoring task-specific performance differences among models. In this work, we propose SpecEM, a training-free, plug-and-play LLM ensemble framework that dynamically adjusts each model's model contribution in real time based on task performance. Inspired by speculative decoding, SpecEM iteratively performs drafting and verification, allowing models to collaborate semantically at the segment level for integrated output. Furthermore, we introduce an online feedback mechanism with multiplicative weight updates, where each model's voting weight is adjusted on-the-fly according to how often it outperforms others during verification stage, ensuring that stronger models exert greater influence during ensembling. Experimental results on five LLM families (ranging from 7B to 72B parameters) and six benchmark datasets, spanning open-domain instruction following, reasoning, commonsense, demonstrate consistent performance improvements compared to state-of-the-art LLM ensemble methods. Our code is available at https://github.com/lvbotenbest/SpecEM.
中文标题/摘要
标题:SpecFuse:通过下一段预测集成大型语言模型
生成型大型语言模型(LLM)的集成是一种弥补单个模型局限性的有前途的方法,整合不同LLM的优势。现有的LLM集成方法面临诸如首词延迟和模型之间长距离语义协作的挑战,此外,它们通常假设所有模型在集成中的投票权重相等,忽略了模型之间的任务特定性能差异。在本工作中,我们提出了一种无需训练、即插即用的LLM集成框架SpecEM,该框架能够根据任务性能实时动态调整每个模型的贡献。受推测性解码的启发,SpecEM 通过迭代的起草和验证过程,允许模型在段级上进行语义协作,以生成综合输出。此外,我们引入了一种在线反馈机制,其中每个模型的投票权重根据验证阶段中其相对于其他模型的性能调整,确保更强的模型在集成中发挥更大的影响。在包含从7B到72B参数的五个LLM家族和六个基准数据集(涵盖开放域指令跟随、推理、常识等)的实验中,与最先进的LLM集成方法相比,我们的方法显示出一致的性能改进。我们的代码可在https://github.com/lvbotenbest/SpecEM 获取。
Summary / 总结
SpecEM is a training-free LLM ensemble framework that dynamically adjusts each model's contribution based on task performance. It uses speculative decoding to iteratively draft and verify segments, enabling semantic collaboration at the segment level. An online feedback mechanism with multiplicative weight updates ensures stronger models have more influence. Experiments on five LLM families and six benchmark datasets show consistent performance improvements over existing methods.
SpecEM 是一种无需训练的 LLM 联合框架,根据任务性能动态调整每个模型的贡献。它使用推测性解码进行迭代的草稿和验证,使模型能够在段落级别进行语义协作。该框架包含一个基于在线反馈的乘性权重更新机制,可以实时调整每个模型的投票权重。实验结果显示,在五个 LLM 家族和六个基准数据集上,与现有方法相比具有一致的性能改进。
History
20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553