arXiv 论文速递

2026-01-30 03:41
Snapshot: 20260130_0341
Splat Feature Solver
Authors: Butian Xiong, Rong Liu, Kenneth Xu, Meida Chen, Andrew Feng
Venue: ICLR 2026
First: 2025-08-17T03:13:06+00:00 · Latest: 2026-01-28T18:51:46+00:00
Comments: ICLR 2026 Accepted
Abstract
Feature lifting has emerged as a crucial component in 3D scene understanding, enabling the attachment of rich image feature descriptors (e.g., DINO, CLIP) onto splat-based 3D representations. The core challenge lies in optimally assigning rich general attributes to 3D primitives while addressing the inconsistency issues from multi-view images. We present a unified, kernel- and feature-agnostic formulation of the feature lifting problem as a sparse linear inverse problem, which can be solved efficiently in closed form. Our approach admits a provable upper bound on the global optimal error under convex losses for delivering high quality lifted features. To address inconsistencies and noise in multi-view observations, we introduce two complementary regularization strategies to stabilize the solution and enhance semantic fidelity. Tikhonov Guidance enforces numerical stability through soft diagonal dominance, while Post-Lifting Aggregation filters noisy inputs via feature clustering. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on open-vocabulary 3D segmentation benchmarks, outperforming training-based, grouping-based, and heuristic-forward baselines while producing lifted features in minutes. Our \textbf{code} is available in the \href{https://github.com/saliteta/splat-distiller/tree/main}{\textcolor{blue}{GitHub}}. We provide additional \href{https://splat-distiller.pages.dev/}{\textcolor{blue}{website}} for more visualization, as well as the \href{https://www.youtube.com/watch?v=CH-G5hbvArM}{\textcolor{blue}{video}}.
中文标题/摘要
标题:Splat特征求解器
特征提升已成为3D场景理解中的关键组成部分,能够将丰富的图像特征描述符(例如DINO、CLIP)附着到基于splat的3D表示上。核心挑战在于如何在解决多视图图像不一致性问题的同时,最优地将丰富的一般属性分配给3D基本体。我们提出了一种统一的、内核和特征无关的特征提升问题的稀疏线性逆问题形式化方法,可以高效地以闭式形式求解。我们的方法在凸损失下提供了全局最优误差的可证明上界,以提供高质量的提升特征。为了解决多视图观测中的不一致性和噪声,我们引入了两种互补的正则化策略来稳定解并增强语义保真度。Tikhonov引导通过软对角占优确保数值稳定性,而后提升聚合通过特征聚类过滤掉噪声输入。大量实验表明,我们的方法在开放词汇3D分割基准测试中达到了最先进的性能,优于基于训练、基于分组和启发式前向的基线方法,同时在几分钟内生成提升特征。我们的\textbf{代码}可在\href{https://github.com/saliteta/splat-distiller/tree/main}{\textcolor{blue}{GitHub}}获得。我们还提供了额外的\href{https://splat-distiller.pages.dev/}{\textcolor{blue}{网站}}进行更多可视化展示,以及\href{https://www.youtube.com/watch?v=CH-G5hbvArM}{\textcolor{blue}{视频}}。
Summary / 总结
The paper addresses the challenge of feature lifting in 3D scene understanding by formulating it as a sparse linear inverse problem, which is solved efficiently. Two regularization strategies, Tikhonov Guidance and Post-Lifting Aggregation, are introduced to stabilize the solution and enhance semantic fidelity. The approach achieves state-of-the-art performance on open-vocabulary 3D segmentation benchmarks, outperforming other methods and producing lifted features in minutes.
论文通过将特征提升问题表述为稀疏线性逆问题来解决3D场景理解中的挑战,并引入了两种正则化策略,Tikhonov Guidance和Post-Lifting Aggregation,以稳定解决方案并增强语义保真度。实验表明,所提出的方法在开放词汇3D分割基准测试中优于现有基线,并能快速生成高质量的提升特征。
BlindSight: Harnessing Sparsity for Efficient Vision-Language Models
Authors: Tharun Adithya Srikrishnan, Deval Shah, Timothy Hein, Ahmed Hasssan, Stephen Youn, Steven K. Reinhardt
First: 2025-07-11T23:15:30+00:00 · Latest: 2026-01-28T18:45:01+00:00
Abstract
Large vision-language models (VLMs) enable joint processing of text and images. However, incorporating vision data significantly increases the prompt length, resulting in a longer time to first token (TTFT). This bottleneck can be alleviated by leveraging the inherent sparsity in the attention computation. Analyzing these attention patterns in VLMs when processing a series of images, we observe the absence of inter-image attention in a substantial portion of layers. Based on this, we propose BlindSight: an approach to optimize multi-image VLM inference using an input-template-aware attention sparsity mask with no runtime overhead. We utilize a dataset to derive a prompt-agnostic categorization for attention heads: Dense, Sink, Intra-Image, and Intra-Image+Sink. We develop a Triton-based GPU kernel to leverage this sparsity. BlindSight achieves a 1.8-3.2x speedup in the attention computation (prompt length 36K-300K). BlindSight generalizes across VLMs (Qwen2-VL, Qwen2.5-VL, Gemma 3), with only a 0.78% absolute accuracy degradation on average on multi-image comprehension benchmarks. Finally, we advocate for the design of efficient VLMs that combine BlindSight-inspired sparse and dense layers.
中文标题/摘要
标题:BlindSight:利用稀疏性提高视觉语言模型效率
大型视觉语言模型(VLMs)能够联合处理文本和图像。然而,引入视觉数据显著增加了提示长度,导致首次生成标记的时间变长。通过利用注意力计算中的固有稀疏性,可以缓解这一瓶颈。在处理一系列图像时分析VLMs的注意力模式,我们观察到在大量层中不存在跨图像注意力。基于此,我们提出BlindSight:一种利用输入模板感知的注意力稀疏性掩码优化多图像VLM推理的方法,且无运行时开销。我们利用数据集为注意力头提供一种提示无关的分类:密集型、汇流型、图像内和图像内+汇流型。我们开发了一个基于Triton的GPU内核以利用这种稀疏性。BlindSight在注意力计算中实现了1.8-3.2倍的加速(提示长度36K-300K)。BlindSight在不同VLMs(Qwen2-VL、Qwen2.5-VL、Gemma 3)上具有良好的泛化能力,平均在多图像理解基准测试上的绝对准确率下降仅为0.78%。最后,我们提倡设计结合BlindSight启发式稀疏和密集层的高效VLMs。
Summary / 总结
BlindSight optimizes multi-image vision-language model inference by utilizing the inherent sparsity in attention computation. By categorizing attention heads and applying an input-template-aware sparsity mask, BlindSight achieves a 1.8-3.2x speedup in attention computation without runtime overhead. It generalizes across different VLMs with minimal accuracy degradation on multi-image comprehension benchmarks.
BlindSight 通过利用注意力计算中的固有稀疏性来优化多图像视觉-语言模型的推理,实现1.8-3.2倍的注意力计算加速,且不增加运行时开销。它将注意力头分类为密集型、汇流型、图像内型和图像内+汇流型,并利用基于Triton的GPU内核进一步加速过程。该方法在不同VLM上的表现具有普适性,并且在多图像理解基准测试上的绝对准确率下降仅为0.78%。
Open-Vocabulary Functional 3D Human-Scene Interaction Generation
Authors: Jie Liu, Yu Sun, Alpar Cseke, Yao Feng, Nicolas Heron, Michael J. Black, Yan Zhang
First: 2026-01-28T18:34:25+00:00 · Latest: 2026-01-28T18:34:25+00:00
Comments: 18 pages
Abstract
Generating 3D humans that functionally interact with 3D scenes remains an open problem with applications in embodied AI, robotics, and interactive content creation. The key challenge involves reasoning about both the semantics of functional elements in 3D scenes and the 3D human poses required to achieve functionality-aware interaction. Unfortunately, existing methods typically lack explicit reasoning over object functionality and the corresponding human-scene contact, resulting in implausible or functionally incorrect interactions. In this work, we propose FunHSI, a training-free, functionality-driven framework that enables functionally correct human-scene interactions from open-vocabulary task prompts. Given a task prompt, FunHSI performs functionality-aware contact reasoning to identify functional scene elements, reconstruct their 3D geometry, and model high-level interactions via a contact graph. We then leverage vision-language models to synthesize a human performing the task in the image and estimate proposed 3D body and hand poses. Finally, the proposed 3D body configuration is refined via stage-wise optimization to ensure physical plausibility and functional correctness. In contrast to existing methods, FunHSI not only synthesizes more plausible general 3D interactions, such as "sitting on a sofa'', while supporting fine-grained functional human-scene interactions, e.g., "increasing the room temperature''. Extensive experiments demonstrate that FunHSI consistently generates functionally correct and physically plausible human-scene interactions across diverse indoor and outdoor scenes.
中文标题/摘要
标题:开放词汇功能性的三维人类-场景交互生成
生成能够功能性地与三维场景交互的三维人类仍然是一个开放问题,具有在具身人工智能、机器人技术和交互内容创作中的应用。关键挑战在于既要推理三维场景中功能元素的语义,又要推理实现功能感知交互所需的三维人类姿态。不幸的是,现有方法通常缺乏对物体功能及其相应的人-场景接触的显式推理,导致交互不合理或功能不正确。在本工作中,我们提出了一种无需训练的功能驱动框架FunHSI,该框架能够从开放词汇的任务提示中生成功能性正确的交互。给定一个任务提示,FunHSI 进行功能感知的接触推理,以识别功能性的场景元素,重建它们的三维几何结构,并通过接触图建模高层次的交互。然后利用视觉语言模型合成执行任务的人类,并估计提出的三维身体和手部姿态。最后,通过阶段优化来细化提出的三维身体配置,以确保物理合理性与功能性正确性。与现有方法相比,FunHSI 不仅能够合成更合理的通用三维交互,如“坐在沙发上”,还支持细粒度的功能性人类-场景交互,例如“提高房间温度”。大量实验表明,FunHSI 能够在各种室内外场景中一致地生成功能性正确且物理合理的交互。
Summary / 总结
This work addresses the challenge of generating 3D humans that functionally interact with 3D scenes by proposing FunHSI, a training-free framework that performs functionality-aware contact reasoning to identify and model functional scene elements and high-level interactions. The method leverages vision-language models to synthesize humans performing tasks and refines the 3D body configuration to ensure physical plausibility and functional correctness. Experiments show that FunHSI generates more plausible and functionally correct interactions compared to existing methods across various scenes.
研究旨在生成能够与3D场景功能互动的3D人类,解决关于物体功能性和人类-场景接触的推理难题。FunHSI是一个无需训练的框架,能够识别功能性的场景元素,重构其3D几何结构,并通过接触图建模高层次的互动。该方法利用视觉-语言模型合成人类姿态,并通过逐步优化确保物理上的合理性。实验表明,FunHSI能够生成功能正确且物理上合理的互动,包括一般任务如'坐在沙发上'和精细任务如'增加房间温度'。
FAIRT2V: Training-Free Debiasing for Text-to-Video Diffusion Models
Authors: Haonan Zhong, Wei Song, Tingxu Han, Maurice Pagnucco, Jingling Xue, Yang Song
First: 2026-01-28T17:29:53+00:00 · Latest: 2026-01-28T17:29:53+00:00
Abstract
Text-to-video (T2V) diffusion models have achieved rapid progress, yet their demographic biases, particularly gender bias, remain largely unexplored. We present FairT2V, a training-free debiasing framework for text-to-video generation that mitigates encoder-induced bias without finetuning. We first analyze demographic bias in T2V models and show that it primarily originates from pretrained text encoders, which encode implicit gender associations even for neutral prompts. We quantify this effect with a gender-leaning score that correlates with bias in generated videos. Based on this insight, FairT2V mitigates demographic bias by neutralizing prompt embeddings via anchor-based spherical geodesic transformations while preserving semantics. To maintain temporal coherence, we apply debiasing only during early identity-forming steps through a dynamic denoising schedule. We further propose a video-level fairness evaluation protocol combining VideoLLM-based reasoning with human verification. Experiments on the modern T2V model Open-Sora show that FairT2V substantially reduces demographic bias across occupations with minimal impact on video quality.
中文标题/摘要
标题:FAIRT2V:无需训练的文本到视频扩散模型去偏方法
文本到视频(T2V)扩散模型取得了快速进展,但它们的民众人种偏差,尤其是性别偏差,仍然未被充分探索。我们提出了FairT2V,这是一种无需训练的去偏框架,用于文本到视频生成,可以在不微调的情况下减轻编码器引起的偏差。我们首先分析了T2V模型中的民众人种偏差,并表明它主要源自预训练的文本编码器,即使对于中性的提示,它们也会编码隐含的性别关联。我们通过与生成视频中的偏差相关联的性别倾向评分来量化这种效应。 基于这一洞察,FairT2V 通过基于锚点的球面测地线变换来中和提示嵌入,同时保留语义,从而减轻民众人种偏差。为了保持时间连贯性,我们仅在早期身份形成步骤通过动态去噪调度应用去偏。我们还提出了一种结合VideoLLM推理和人工验证的视频级公平性评估协议。实验表明,FairT2V 在不影响视频质量的情况下,显著减少了职业方面的民众人种偏差。
Summary / 总结
The research aims to address the gender bias in text-to-video (T2V) diffusion models by presenting FairT2V, a training-free debiasing framework. It mitigates bias by neutralizing prompt embeddings through anchor-based spherical geodesic transformations while preserving semantics, and applies debiasing only during early steps to maintain temporal coherence. Experiments show that FairT2V significantly reduces demographic bias across occupations with minimal impact on video quality.
研究旨在通过提出FairT2V,一种无需训练的去偏见框架,解决文本到视频(T2V)扩散模型中的性别等人口统计学偏差问题。FairT2V通过使用锚点基于的球面测地变换来中和提示嵌入,同时保持语义。实验表明,FairT2V在保持视频质量的同时,显著减少了职业等领域的偏差。
Hashing-Baseline: Rethinking Hashing in the Age of Pretrained Models
Authors: Ilyass Moummad, Kawtar Zaher, Lukas Rauch, Alexis Joly
First: 2025-09-17T20:58:43+00:00 · Latest: 2026-01-28T17:01:31+00:00
Abstract
Information retrieval with compact binary embeddings, also referred to as hashing, is crucial for scalable fast search applications, yet state-of-the-art hashing methods require expensive, scenario-specific training. In this work, we introduce Hashing-Baseline, a strong training-free hashing method leveraging powerful pretrained encoders that produce rich pretrained embeddings. We revisit classical, training-free hashing techniques: principal component analysis, random orthogonal projection, and threshold binarization, to produce a strong baseline for hashing. Our approach combines these techniques with frozen embeddings from state-of-the-art vision and audio encoders to yield competitive retrieval performance without any additional learning or fine-tuning. To demonstrate the generality and effectiveness of this approach, we evaluate it on standard image retrieval benchmarks as well as a newly introduced benchmark for audio hashing.
中文标题/摘要
标题:哈希基线:在预训练模型时代重新思考哈希
使用紧凑二进制嵌入的信息检索,也称为哈希,在可扩展的快速搜索应用中至关重要,然而最先进的哈希方法需要昂贵的、特定场景的训练。在本文中,我们引入了哈希基线,这是一种强大的无需训练的哈希方法,利用强大的预训练编码器生成丰富的预训练嵌入。我们重新审视了经典的无需训练的哈希技术:主成分分析、随机正交投影和阈值二值化,以产生哈希的强基线。我们的方法将这些技术与最先进的视觉和音频编码器的冻结嵌入相结合,无需任何额外的学习或微调即可获得具有竞争力的检索性能。为了展示该方法的通用性和有效性,我们在标准图像检索基准以及新引入的音频哈希基准上进行了评估。
Summary / 总结
The research aims to address the need for efficient and scalable information retrieval methods using compact binary embeddings, or hashing, which are crucial for fast search applications. The study introduces Hashing-Baseline, a training-free hashing method that utilizes powerful pretrained encoders to generate rich embeddings. This approach combines classical hashing techniques with frozen embeddings from state-of-the-art vision and audio encoders, achieving competitive retrieval performance without additional learning or fine-tuning. The method is evaluated on standard image retrieval benchmarks and a new audio hashing benchmark, showcasing its effectiveness and generality.
研究旨在解决高效且可扩展的信息检索需求,使用哈希方法对于快速搜索应用至关重要。该研究引入了Hashing-Baseline,这是一种无需训练的哈希方法,利用强大的预训练编码器生成丰富的嵌入。该方法结合了经典的哈希技术与来自最新视觉和音频编码器的冻结嵌入,无需额外的学习或微调即可实现竞争性的检索性能。该方法在标准图像检索基准和新的音频哈希基准上进行了评估,展示了其有效性和通用性。
FaithSCAN: Model-Driven Single-Pass Hallucination Detection for Faithful Visual Question Answering
Authors: Chaodong Tong, Qi Zhang, Chen Li, Lei Jiang, Yanbing Liu
First: 2026-01-01T09:19:39+00:00 · Latest: 2026-01-28T16:05:20+00:00
Comments: 21 pages, 13 figures, 8 tables
Abstract
Faithfulness hallucinations in VQA occur when vision-language models produce fluent yet visually ungrounded answers, severely undermining their reliability in safety-critical applications. Existing detection methods mainly fall into two categories: external verification approaches relying on auxiliary models or knowledge bases, and uncertainty-driven approaches using repeated sampling or uncertainty estimates. The former suffer from high computational overhead and are limited by external resource quality, while the latter capture only limited facets of model uncertainty and fail to sufficiently explore the rich internal signals associated with the diverse failure modes. Both paradigms thus have inherent limitations in efficiency, robustness, and detection performance. To address these challenges, we propose FaithSCAN: a lightweight network that detects hallucinations by exploiting rich internal signals of VLMs, including token-level decoding uncertainty, intermediate visual representations, and cross-modal alignment features. These signals are fused via branch-wise evidence encoding and uncertainty-aware attention. We also extend the LLM-as-a-Judge paradigm to VQA hallucination and propose a low-cost strategy to automatically generate model-dependent supervision signals, enabling supervised training without costly human labels while maintaining high detection accuracy. Experiments on multiple VQA benchmarks show that FaithSCAN significantly outperforms existing methods in both effectiveness and efficiency. In-depth analysis shows hallucinations arise from systematic internal state variations in visual perception, cross-modal reasoning, and language decoding. Different internal signals provide complementary diagnostic cues, and hallucination patterns vary across VLM architectures, offering new insights into the underlying causes of multimodal hallucinations.
中文标题/摘要
标题:FaithSCAN:基于模型驱动的一次通过幻觉检测方法以实现忠实的视觉问答
视觉问答中的忠实性幻觉发生在视觉语言模型生成流畅但与视觉现实脱节的答案时,严重削弱了其在安全关键应用中的可靠性。现有检测方法主要分为两类:依赖辅助模型或知识库的外部验证方法,以及使用重复采样或不确定性估计的不确定性驱动方法。前者面临高计算开销的问题,并受限于外部资源质量,而后者仅捕捉模型不确定性的一小部分,并未能充分探索与多种失败模式相关的丰富内部信号。这两种范式在效率、鲁棒性和检测性能方面都存在固有的局限性。为应对这些挑战,我们提出FaithSCAN:一种轻量级网络,通过利用视觉语言模型的丰富内部信号(包括标记级解码不确定性、中间视觉表示和跨模态对齐特征)来检测幻觉。这些信号通过分支级证据编码和不确定性感知注意力进行融合。我们还将LLM作为裁判的范式扩展到视觉问答幻觉,并提出了一种低成本策略,自动生成模型依赖的监督信号,从而在无需昂贵的人工标签的情况下进行监督训练,同时保持高检测准确性。在多个视觉问答基准上的实验表明,FaithSCAN在效果和效率上均显著优于现有方法。深入分析表明,幻觉源于视觉感知、跨模态推理和语言解码中的系统性内部状态变化。不同的内部信号提供了互补的诊断线索,而幻觉模式在不同的视觉语言模型架构之间有所不同,为多模态幻觉的根本原因提供了新的见解。
Summary / 总结
FaithSCAN is a lightweight network designed to detect faithfulness hallucinations in Visual Question Answering (VQA) by leveraging rich internal signals from vision-language models, such as token-level decoding uncertainty, intermediate visual representations, and cross-modal alignment features. It uses branch-wise evidence encoding and uncertainty-aware attention to fuse these signals. Additionally, FaithSCAN extends the LLM-as-a-Judge paradigm to automatically generate supervision signals, enabling efficient training without human labels. Experiments demonstrate that FaithSCAN outperforms existing methods in both effectiveness and efficiency, with hallucinations arising from systematic internal state variations in visual perception, cross-modal reasoning, and language decoding.
FaithSCAN 是一种轻量级网络,通过利用视觉语言模型(VLM)的丰富内部信号(包括 token 级解码不确定性、中间视觉表示和跨模态对齐特征)来检测视觉问答(VQA)中的忠实性幻觉。它使用分支级证据编码和不确定性感知注意力来融合这些信号。此外,FaithSCAN 还将 LLM 作为裁判的范式扩展到 VQA 幻觉检测,提出了一种低成本策略自动生成模型依赖的监督信号,从而在无需昂贵的人工标签的情况下实现监督训练,同时保持高检测准确性。在多个 VQA 基准上的实验表明,FaithSCAN 在有效性与效率上均优于现有方法,并揭示了幻觉模式在不同 VLM 架构下的变化,提供了关于多模态幻觉潜在原因的新见解。
AEDR: Training-Free AI-Generated Image Attribution via Autoencoder Double-Reconstruction
Authors: Chao Wang, Zijin Yang, Yaofei Wang, Weiming Zhang, Kejiang Chen
Venue: AAAI 2026 Oral
First: 2025-07-25T06:34:58+00:00 · Latest: 2026-01-28T15:53:05+00:00
Comments: 7 pages. Accepted by AAAI 2026 Oral
Abstract
The rapid advancement of image-generation technologies has made it possible for anyone to create photorealistic images using generative models, raising significant security concerns. To mitigate malicious use, tracing the origin of such images is essential. Reconstruction-based attribution methods offer a promising solution, but they often suffer from reduced accuracy and high computational costs when applied to state-of-the-art (SOTA) models. To address these challenges, we propose AEDR (AutoEncoder Double-Reconstruction), a novel training-free attribution method designed for generative models with continuous autoencoders. Unlike existing reconstruction-based approaches that rely on the value of a single reconstruction loss, AEDR performs two consecutive reconstructions using the model's autoencoder, and adopts the ratio of these two reconstruction losses as the attribution signal. This signal is further calibrated using the image homogeneity metric to improve accuracy, which inherently cancels out absolute biases caused by image complexity, with autoencoder-based reconstruction ensuring superior computational efficiency. Experiments on eight top latent diffusion models show that AEDR achieves 25.5% higher attribution accuracy than existing reconstruction-based methods, while requiring only 1% of the computational time.
中文标题/摘要
标题:AEDR:基于自编码器双重建的无训练AI生成图像归因
图像生成技术的迅速发展使得任何人都可以使用生成模型创建逼真的图像,这引发了重大的安全问题。为了遏制恶意使用,追踪此类图像的来源至关重要。基于重建的归因方法提供了有希望的解决方案,但在应用于最先进的(SOTA)模型时,它们通常会遭受准确性降低和高计算成本的问题。为了解决这些挑战,我们提出了AEDR(自编码器双重建),这是一种针对具有连续自编码器的生成模型的新型无训练归因方法。与现有的依赖单重建损失值的重建方法不同,AEDR 进行两次连续重建,并采用这两次重建损失的比率作为归因信号。该信号进一步使用图像均匀度度量进行校准以提高准确性,这本质上抵消了由于图像复杂性引起的绝对偏差,基于自编码器的重建确保了更高的计算效率。在八个顶级潜扩散模型上的实验表明,AEDR 的归因准确性比现有重建方法高25.5%,而计算时间仅需1%。
Summary / 总结
The paper introduces AEDR (AutoEncoder Double-Reconstruction), a training-free method for attributing AI-generated images. It uses two consecutive reconstructions with the autoencoder and the ratio of these losses as the attribution signal, which is then calibrated with an image homogeneity metric. Experiments on eight top latent diffusion models show AEDR achieves 25.5% higher accuracy than existing methods with only 1% of the computational time.
AEDR 是一种无需训练的图像归属方法,用于识别由生成模型生成的图像来源。它通过双重建计算两个重建损失的比率,并通过图像均匀性进行校准以提高准确性和降低计算成本。实验结果显示,AEDR 在八种顶级潜在扩散模型上的归属准确率比现有方法高 25.5%,且仅需 1% 的计算时间。
bi-modal textual prompt learning for vision-language models in remote sensing
Authors: Pankhi Kashyap, Mainak Singha, Biplab Banerjee
Venue: ICASSP 2026
First: 2026-01-28T14:58:14+00:00 · Latest: 2026-01-28T14:58:14+00:00
Comments: Accepted in ICASSP 2026
Abstract
Prompt learning (PL) has emerged as an effective strategy to adapt vision-language models (VLMs), such as CLIP, for downstream tasks under limited supervision. While PL has demonstrated strong generalization on natural image datasets, its transferability to remote sensing (RS) imagery remains underexplored. RS data present unique challenges, including multi-label scenes, high intra-class variability, and diverse spatial resolutions, that hinder the direct applicability of existing PL methods. In particular, current prompt-based approaches often struggle to identify dominant semantic cues and fail to generalize to novel classes in RS scenarios. To address these challenges, we propose BiMoRS, a lightweight bi-modal prompt learning framework tailored for RS tasks. BiMoRS employs a frozen image captioning model (e.g., BLIP-2) to extract textual semantic summaries from RS images. These captions are tokenized using a BERT tokenizer and fused with high-level visual features from the CLIP encoder. A lightweight cross-attention module then conditions a learnable query prompt on the fused textual-visual representation, yielding contextualized prompts without altering the CLIP backbone. We evaluate BiMoRS on four RS datasets across three domain generalization (DG) tasks and observe consistent performance gains, outperforming strong baselines by up to 2% on average. Codes are available at https://github.com/ipankhi/BiMoRS.
中文标题/摘要
标题:遥感领域双模态文本提示学习
提示学习(PL)已成为一种有效的策略,用于在有限监督下将视觉-语言模型(VLMs),如CLIP,适应下游任务。尽管PL在自然图像数据集上展示了强大的泛化能力,但其在遥感(RS)图像上的可转移性仍被忽视。RS数据带来了独特的挑战,包括多标签场景、高类内变异性以及多样化的空间分辨率,这阻碍了现有PL方法的直接适用性。特别是,当前基于提示的方法往往难以识别主导语义线索,并且无法在RS场景中泛化到新的类别。为了解决这些挑战,我们提出了一种名为BiMoRS的轻量级双模态提示学习框架,专门针对RS任务。BiMoRS利用冻结的图像描述模型(如BLIP-2)从RS图像中提取文本语义摘要。这些描述被使用BERT分词器分词,并与CLIP编码器的高层视觉特征融合。一个轻量级的交叉注意力模块然后根据融合的文本-视觉表示条件化一个可学习的查询提示,从而生成上下文化的提示,而不改变CLIP主干。我们在四个RS数据集上对BiMoRS进行了三种领域泛化(DG)任务的评估,并观察到一致的性能提升,平均比强基线高出2%。代码可在https://github.com/ipankhi/BiMoRS/获取。
Summary / 总结
The research aims to enhance the adaptability of vision-language models (VLMs) like CLIP for remote sensing (RS) tasks through prompt learning (PL). BiMoRS, a bi-modal prompt learning framework, is proposed to address the unique challenges of RS data, such as multi-label scenes and high intra-class variability. BiMoRS uses a frozen image captioning model to extract textual summaries from RS images, which are then fused with visual features and used to condition a learnable query prompt. Experiments on four RS datasets show consistent performance improvements, outperforming strong baselines by up to 2% on average.
研究旨在通过解决多标签场景和高类内变异性等独特挑战,增强视觉语言模型在遥感图像中的迁移能力。BiMoRS是一种双模态提示学习框架,使用冻结的图像描述模型从遥感图像中提取文本摘要,并与视觉特征融合。该方法在三个领域泛化任务中表现出色,平均性能提升高达2%,优于强基线。
Investigating the Development of Task-Oriented Communication in Vision-Language Models
Authors: Boaz Carmeli, Orr Paradise, Shafi Goldwasser, Yonatan Belinkov, Ron Meir
First: 2026-01-28T14:28:31+00:00 · Latest: 2026-01-28T14:28:31+00:00
Abstract
We investigate whether \emph{LLM-based agents} can develop task-oriented communication protocols that differ from standard natural language in collaborative reasoning tasks. Our focus is on two core properties such task-oriented protocols may exhibit: Efficiency -- conveying task-relevant information more concisely than natural language, and Covertness -- becoming difficult for external observers to interpret, raising concerns about transparency and control. To investigate these aspects, we use a referential-game framework in which vision-language model (VLM) agents communicate, providing a controlled, measurable setting for evaluating language variants. Experiments show that VLMs can develop effective, task-adapted communication patterns. At the same time, they can develop covert protocols that are difficult for humans and external agents to interpret. We also observe spontaneous coordination between similar models without explicitly shared protocols. These findings highlight both the potential and the risks of task-oriented communication, and position referential games as a valuable testbed for future work in this area.
中文标题/摘要
标题:探究视觉语言模型中任务导向通信的发展
我们研究基于\emph{LLM的代理}是否能够发展出不同于标准自然语言的任务导向通信协议,特别是在协作推理任务中的表现。我们的重点在于此类任务导向协议可能表现出的两个核心特性:效率——比自然语言更简洁地传达任务相关信息,以及隐蔽性——对外部观察者难以解读,从而引发透明度和控制方面的担忧。为了研究这些方面,我们使用了一个参照游戏框架,其中视觉语言模型(VLM)代理进行沟通,提供了一个可控且可测量的环境来评估语言变体。实验表明,VLM可以发展出有效的、任务适应的通信模式。同时,它们还可以发展出难以由人类和外部代理解读的隐蔽协议。我们还观察到,相似模型之间自发的协调,而无需明确共享的协议。这些发现突显了任务导向通信的潜力和风险,并将参照游戏定位为未来在此领域研究中一个有价值的测试平台。
Summary / 总结
The study investigates whether language models can develop task-oriented communication protocols that are more efficient and covert than natural language. Using a referential-game framework, the research evaluates the communication patterns of vision-language models and finds that these models can create effective, task-specific communication strategies and covert protocols that are hard for humans to interpret. Spontaneous coordination between similar models without explicit protocols is also observed, highlighting both the benefits and risks of task-oriented communication.
研究探讨了语言模型是否能在协作推理任务中发展出比自然语言更高效且难以解读的任务导向通信协议。通过使用参照游戏框架进行评估,研究发现视觉语言模型能够创建简洁且难以被人类和外部代理解读的通信模式。此外,模型还表现出自发协调而无需明确共享协议,这既突显了任务导向通信的优势,也揭示了其潜在风险。
DeepSeek-OCR 2: Visual Causal Flow
Authors: Haoran Wei, Yaofeng Sun, Yukun Li
First: 2026-01-28T12:46:07+00:00 · Latest: 2026-01-28T12:46:07+00:00
Abstract
We present DeepSeek-OCR 2 to investigate the feasibility of a novel encoder-DeepEncoder V2-capable of dynamically reordering visual tokens upon image semantics. Conventional vision-language models (VLMs) invariably process visual tokens in a rigid raster-scan order (top-left to bottom-right) with fixed positional encoding when fed into LLMs. However, this contradicts human visual perception, which follows flexible yet semantically coherent scanning patterns driven by inherent logical structures. Particularly for images with complex layouts, human vision exhibits causally-informed sequential processing. Inspired by this cognitive mechanism, DeepEncoder V2 is designed to endow the encoder with causal reasoning capabilities, enabling it to intelligently reorder visual tokens prior to LLM-based content interpretation. This work explores a novel paradigm: whether 2D image understanding can be effectively achieved through two-cascaded 1D causal reasoning structures, thereby offering a new architectural approach with the potential to achieve genuine 2D reasoning. Codes and model weights are publicly accessible at http://github.com/deepseek-ai/DeepSeek-OCR-2.
中文标题/摘要
标题:DeepSeek-OCR 2:视觉因果流
我们提出DeepSeek-OCR 2以探究一种新型编码器-DeepEncoder V2的可行性,该编码器能够根据图像语义动态重新排列视觉标记。传统的视觉-语言模型(VLMs)在输入LLMs时,不可避免地以固定的栅格扫描顺序(从左上到右下)处理视觉标记,并附带固定的位置编码。然而,这与人类视觉感知相矛盾,人类视觉遵循灵活且语义连贯的扫描模式,由内在的逻辑结构驱动。特别是对于具有复杂布局的图像,人类视觉表现出因果驱动的顺序处理。受这一认知机制的启发,DeepEncoder V2被设计成赋予编码器因果推理能力,使其能够在基于LLM的内容解释之前智能地重新排列视觉标记。本研究探索了一个新的范式:二维图像理解是否可以通过两个级联的一维因果推理结构有效实现,从而提供一种具有潜在真正二维推理能力的新架构。相关代码和模型权重可在http://github.com/deepseek-ai/DeepSeek-OCR-2获取。
Summary / 总结
DeepSeek-OCR 2 investigates the feasibility of a novel encoder called DeepEncoder V2, which can dynamically reorder visual tokens based on image semantics. This approach contrasts with conventional vision-language models that process visual tokens in a fixed raster-scan order. The study finds that this causal reasoning mechanism can improve the understanding of complex images, suggesting a new architectural approach for 2D reasoning in image understanding tasks.
DeepSeek-OCR 2 探索使用一种名为 DeepEncoder V2 的新型编码器,该编码器可以根据图像语义动态重新排序视觉标记,以提高视觉理解能力。这种方法与传统的视觉-语言模型不同,后者以固定的扫描顺序处理视觉标记。研究发现,这种因果推理机制可以有效地重新排序视觉标记,通过两个级联的 1D 因果推理结构,可能实现更好的 2D 图像理解。
AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors
Authors: Matic Fučka, Vitjan Zavrtanik, Danijel Skočaj
First: 2026-01-28T12:02:58+00:00 · Latest: 2026-01-28T12:02:58+00:00
Abstract
Zero-shot anomaly detection aims to detect and localise abnormal regions in the image without access to any in-domain training images. While recent approaches leverage vision-language models (VLMs), such as CLIP, to transfer high-level concept knowledge, methods based on purely vision foundation models (VFMs), like DINOv2, have lagged behind in performance. We argue that this gap stems from two practical issues: (i) limited diversity in existing auxiliary anomaly detection datasets and (ii) overly shallow VFM adaptation strategies. To address both challenges, we propose AnomalyVFM, a general and effective framework that turns any pretrained VFM into a strong zero-shot anomaly detector. Our approach combines a robust three-stage synthetic dataset generation scheme with a parameter-efficient adaptation mechanism, utilising low-rank feature adapters and a confidence-weighted pixel loss. Together, these components enable modern VFMs to substantially outperform current state-of-the-art methods. More specifically, with RADIO as a backbone, AnomalyVFM achieves an average image-level AUROC of 94.1% across 9 diverse datasets, surpassing previous methods by significant 3.3 percentage points. Project Page: https://maticfuc.github.io/anomaly_vfm/
中文标题/摘要
标题:AnomalyVFM -- 将视觉基础模型转化为零样本异常检测器
零样本异常检测旨在无需任何领域内训练图像的情况下,检测和定位图像中的异常区域。虽然最近的方法利用视觉-语言模型(VLMs),如CLIP,来转移高级概念知识,但基于纯粹视觉基础模型(VFMs)的方法,如DINOv2,在性能上落后。我们认为这种差距源于两个实际问题:(i) 现有辅助异常检测数据集的多样性有限,(ii) VFM的过度浅层适应策略。为了解决这两个挑战,我们提出了AnomalyVFM,这是一种通用且有效的框架,能够将任何预训练的VFM转化为强大的零样本异常检测器。我们的方法结合了一种稳健的三阶段合成数据集生成方案和一种参数高效的适应机制,利用低秩特征适配器和置信加权像素损失。这些组件共同使现代VFMs在性能上显著超越当前最先进的方法。具体而言,以RADIO作为骨干,AnomalyVFM在9个不同数据集上的平均图像级AUROC为94.1%,比之前的方法高出显著的3.3个百分点。项目页面:https://maticfuc.github.io/anomaly_vfm/
Summary / 总结
The research aims to enhance zero-shot anomaly detection using vision foundation models (VFMs) by addressing the limitations of existing methods. AnomalyVFM, a proposed framework, transforms any pretrained VFM into a strong zero-shot anomaly detector through a robust synthetic dataset generation and a parameter-efficient adaptation mechanism. The method achieves an average image-level AUROC of 94.1% across nine diverse datasets, outperforming previous methods by 3.3 percentage points.
AnomalyVFM旨在通过将预训练的视觉基础模型(VFMs)如DINOv2转化为有效的异常检测器来提升零样本异常检测。它通过引入一种稳健的三阶段合成数据集生成方案和参数高效的适应机制来解决现有方法的局限性。AnomalyVFM显著超越了当前最先进的方法,在九个不同的数据集上实现了平均图像级AUROC为94.1%,比之前的方法高出3.3个百分点。
Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models
Authors: Xuyang Liu, Xiyan Gui, Yuchao Zhang, Linfeng Zhang
Venue: ICLR 2026
First: 2025-10-23T16:17:47+00:00 · Latest: 2026-01-28T10:49:58+00:00
Comments: Accepted by ICLR 2026. Our code is available at https://github.com/xuyang-liu16/MixKV
Abstract
Recent large vision-language models (LVLMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet the resulting key-value (KV) cache expansion creates a critical memory bottleneck that fundamentally limits deployment scalability. While existing KV cache compression methods focus on retaining high-importance KV pairs to minimize storage, they often overlook the modality-specific semantic redundancy patterns that emerge distinctively in multi-modal KV caches. In this work, we first analyze how, beyond simple importance, the KV cache in LVLMs exhibits varying levels of redundancy across attention heads. We show that relying solely on importance can only cover a subset of the full KV cache information distribution, leading to potential loss of semantic coverage. To address this, we propose MixKV, a novel method that mixes importance with diversity for optimized KV cache compression in LVLMs. MixKV adapts to head-wise semantic redundancy, selectively balancing diversity and importance when compressing KV pairs. Extensive experiments demonstrate that MixKV consistently enhances existing methods across multiple LVLMs. Under extreme compression (budget=64), MixKV improves baseline methods by an average of 5.1% across five multi-modal understanding benchmarks and achieves remarkable gains of 8.0% and 9.0% for SnapKV and AdaKV on GUI grounding tasks, all while maintaining comparable inference efficiency. Furthermore, MixKV extends seamlessly to LLMs with comparable performance gains. Our code is available at https://github.com/xuyang-liu16/MixKV.
中文标题/摘要
标题:结合重要性与多样性:在大型视觉-语言模型中联合优化KV缓存压缩
近期的大型视觉-语言模型(LVLMs)在处理扩展的多模态序列方面表现出色,但由此产生的KV缓存扩展造成了严重的内存瓶颈,从根本上限制了部署的可扩展性。虽然现有的KV缓存压缩方法侧重于保留高重要性的KV对以最小化存储,但它们往往忽略了多模态KV缓存中出现的独特的模态特定语义冗余模式。在本文中,我们首先分析了LVLMs中的KV缓存如何在不同的注意力头中表现出不同程度的冗余。我们展示了仅依赖重要性只能覆盖KV缓存信息分布的一部分,可能导致语义覆盖的潜在损失。为了解决这一问题,我们提出了MixKV,这是一种新颖的方法,用于在LVLMs中结合重要性和多样性以优化KV缓存压缩。MixKV根据注意力头的语义冗余进行调整,在压缩KV对时选择性地平衡多样性和重要性。广泛的实验表明,MixKV在多个LVLMs上始终优于现有方法。在极端压缩(预算=64)下,MixKV在五个多模态理解基准测试中平均提高了5.1%,并在GUI定位任务中分别实现了SnapKV和AdaKV的显著提升,分别为8.0%和9.0%,同时保持了相当的推理效率。此外,MixKV无缝扩展到LLMs,性能提升相当。我们的代码可在https://github.com/xuyang-liu16/MixKV获取。
Summary / 总结
This work addresses the memory bottleneck caused by the expansion of key-value (KV) caches in large vision-language models (LVLMs) by proposing MixKV, a method that combines importance with diversity for KV cache compression. The method adapts to head-wise semantic redundancy and selectively balances diversity and importance. Experimental results show that MixKV outperforms existing methods across multiple benchmarks, with an average improvement of 5.1% under extreme compression and significant gains for specific tasks like GUI grounding.
该研究针对大型视觉-语言模型(LVLMs)中由于键值(KV)缓存扩展导致的内存瓶颈问题,提出了一种结合重要性和多样性的方法MixKV进行KV缓存压缩。该方法适应头部级别的语义冗余,并在压缩KV对时选择性地平衡多样性和重要性,从而在多个基准测试中一致地改进了现有方法。在极端压缩条件下,MixKV将基线方法的性能平均提高了5.1%,并在GUI定位任务中分别取得了8.0%和9.0%的显著提升,同时保持了相当的推理效率。
MARE: Multimodal Alignment and Reinforcement for Explainable Deepfake Detection via Vision-Language Models
Authors: Wenbo Xu, Wei Lu, Xiangyang Luo, Jiantao Zhou
First: 2026-01-28T09:44:31+00:00 · Latest: 2026-01-28T09:44:31+00:00
Abstract
Deepfake detection is a widely researched topic that is crucial for combating the spread of malicious content, with existing methods mainly modeling the problem as classification or spatial localization. The rapid advancements in generative models impose new demands on Deepfake detection. In this paper, we propose multimodal alignment and reinforcement for explainable Deepfake detection via vision-language models, termed MARE, which aims to enhance the accuracy and reliability of Vision-Language Models (VLMs) in Deepfake detection and reasoning. Specifically, MARE designs comprehensive reward functions, incorporating reinforcement learning from human feedback (RLHF), to incentivize the generation of text-spatially aligned reasoning content that adheres to human preferences. Besides, MARE introduces a forgery disentanglement module to capture intrinsic forgery traces from high-level facial semantics, thereby improving its authenticity detection capability. We conduct thorough evaluations on the reasoning content generated by MARE. Both quantitative and qualitative experimental results demonstrate that MARE achieves state-of-the-art performance in terms of accuracy and reliability.
中文标题/摘要
标题:MARE:基于视觉-语言模型的多模态对齐与强化以实现可解释的深度假信息检测
深度假信息检测是一个广泛研究的主题,对于遏制恶意内容的传播至关重要,现有方法主要将问题建模为分类或空间定位。生成模型的快速发展对深度假信息检测提出了新的要求。在本文中,我们提出了一种基于视觉-语言模型的多模态对齐与强化以实现可解释的深度假信息检测方法,称为MARE,旨在提高视觉-语言模型(VLMs)在深度假信息检测和推理中的准确性和可靠性。具体而言,MARE 设计了全面的奖励函数,结合人类反馈的强化学习(RLHF),以激励生成符合人类偏好的文本-空间对齐的推理内容。此外,MARE 引入了一个伪造分离模块,以从高级面部语义中捕获内在的伪造痕迹,从而提高其真实性检测能力。我们对MARE生成的推理内容进行了全面评估。定量和定性的实验结果表明,MARE 在准确性和可靠性方面达到了最先进的性能。
Summary / 总结
MARE is a method that uses multimodal alignment and reinforcement learning to enhance the accuracy and reliability of deepfake detection via vision-language models. It incorporates human feedback to generate text-spatially aligned reasoning content and includes a forgery disentanglement module to capture intrinsic forgery traces. Experimental results show that MARE outperforms existing methods in terms of accuracy and reliability.
MARE 通过多模态对齐和强化学习来提升基于视觉-语言模型的深伪检测的准确性和可靠性。它结合了人类反馈生成文本-空间对齐的推理内容,并引入伪造分离模块来捕捉高级面部语义中的伪造痕迹。实验结果表明,MARE 在准确性和可靠性方面均优于现有方法。
Let's Roll a BiFTA: Bi-refinement for Fine-grained Text-visual Alignment in Vision-Language Models
Authors: Yuhao Sun, Chengyi Cai, Jiacheng Zhang, Zesheng Ye, Xingliang Yuan, Feng Liu
First: 2026-01-28T09:24:14+00:00 · Latest: 2026-01-28T09:24:14+00:00
Comments: 25 pages
Abstract
Recent research has shown that aligning fine-grained text descriptions with localized image patches can significantly improve the zero-shot performance of pre-trained vision-language models (e.g., CLIP). However, we find that both fine-grained text descriptions and localized image patches often contain redundant information, making text-visual alignment less effective. In this paper, we tackle this issue from two perspectives: \emph{View Refinement} and \emph{Description refinement}, termed as \textit{\textbf{Bi}-refinement for \textbf{F}ine-grained \textbf{T}ext-visual \textbf{A}lignment} (BiFTA). \emph{View refinement} removes redundant image patches with high \emph{Intersection over Union} (IoU) ratios, resulting in more distinctive visual samples. \emph{Description refinement} removes redundant text descriptions with high pairwise cosine similarity, ensuring greater diversity in the remaining descriptions. BiFTA achieves superior zero-shot performance on 6 benchmark datasets for both ViT-based and ResNet-based CLIP, justifying the necessity to remove redundant information in visual-text alignment.
中文标题/摘要
标题:让我们滚一个BiFTA:细粒度文本视觉对齐中的双精炼
近期研究表明,将细粒度的文本描述与局部图像片段对齐可以显著提高预训练视觉-语言模型(例如CLIP)的零样本性能。然而,我们发现细粒度的文本描述和局部图像片段中往往包含冗余信息,使得文本-视觉对齐效果不佳。在本文中,我们从两个角度解决这一问题:细粒度视觉对齐中的\emph{视图精炼}和\emph{描述精炼},称为\textit{\textbf{双}-精炼细粒度\textbf{T}文\textbf{V}视觉\textbf{A}对齐}(BiFTA)。\emph{视图精炼}通过去除高\emph{交并比}(IoU)的冗余图像片段,使得视觉样本更具特色。\emph{描述精炼}通过去除高成对余弦相似度的冗余文本描述,确保剩余描述的多样性。BiFTA在6个基准数据集上实现了优于ViT和ResNet基线的零样本性能,证明了在视觉-文本对齐中去除冗余信息的必要性。
Summary / 总结
This paper addresses the issue of redundant information in fine-grained text descriptions and localized image patches, which can hinder the effectiveness of text-visual alignment in vision-language models. The authors propose Bi-refinement for Fine-grained Text-visual Alignment (BiFTA), which includes view refinement to remove redundant image patches and description refinement to eliminate redundant text descriptions. BiFTA improves zero-shot performance on six benchmark datasets for both ViT-based and ResNet-based CLIP models, demonstrating the importance of removing redundant information in visual-text alignment.
本文针对细粒度文本描述和局部图像片段中冗余信息的问题,该问题可能阻碍视觉语言模型(如CLIP)中的文本-视觉对齐效果。作者提出了一种名为细粒度文本-视觉对齐的双向精炼方法(BiFTA),包括视图精炼以去除冗余图像片段和描述精炼以消除冗余文本描述。BiFTA在六个基准数据集上提高了基于ViT和ResNet的CLIP模型的零样本性能,证明了去除视觉-文本对齐中冗余信息的重要性。
NLPrompt: Noise-Label Prompt Learning for Vision-Language Models
Authors: Bikang Pan, Qun Li, Xiaoying Tang, Wei Huang, Zhen Fang, Feng Liu, Jingya Wang, Jingyi Yu, Ye Shi
First: 2024-12-02T08:25:09+00:00 · Latest: 2026-01-28T08:33:08+00:00
Abstract
The emergence of vision-language foundation models, such as CLIP, has revolutionized image-text representation, enabling a broad range of applications via prompt learning. Despite its promise, real-world datasets often contain noisy labels that can degrade prompt learning performance. In this paper, we demonstrate that using mean absolute error (MAE) loss in prompt learning, named PromptMAE, significantly enhances robustness against noisy labels while maintaining high accuracy. Though MAE is straightforward and recognized for its robustness, it is rarely used in noisy-label learning due to its slow convergence and poor performance outside prompt learning scenarios. To elucidate the robustness of PromptMAE, we leverage feature learning theory to show that MAE can suppress the influence of noisy samples, thereby improving the signal-to-noise ratio and enhancing overall robustness. Additionally, we introduce PromptOT, a prompt-based optimal transport data purification method to enhance the robustness further. PromptOT employs text features in vision-language models as prototypes to construct an optimal transportation matrix. This matrix effectively partitions datasets into clean and noisy subsets, allowing for the application of cross-entropy loss to the clean subset and MAE loss to the noisy subset. Our Noise-Label Prompt Learning method, named NLPrompt, offers a simple and efficient approach that leverages the expressive representations and precise alignment capabilities of vision-language models for robust prompt learning. We validate NLPrompt through extensive experiments across various noise settings, demonstrating significant performance improvements.
中文标题/摘要
标题:NLPrompt:视觉语言模型中的噪声标签提示学习
视觉语言基础模型(如CLIP)的出现,已经彻底改变了图像-文本表示,通过提示学习使广泛的应用成为可能。尽管前景广阔,但现实世界的数据集往往包含噪声标签,这会降低提示学习的效果。本文展示了在提示学习中使用均绝对误差(MAE)损失,称为PromptMAE,可以显著增强对噪声标签的鲁棒性,同时保持高精度。尽管MAE简单且因其鲁棒性而广受认可,但由于其收敛速度慢且在提示学习场景外表现不佳,因此在噪声标签学习中很少使用。为了阐明PromptMAE的鲁棒性,我们利用特征学习理论证明,MAE可以抑制噪声样本的影响,从而提高信噪比并增强整体鲁棒性。此外,我们引入了基于提示的最优传输数据净化方法PromptOT,以进一步增强鲁棒性。PromptOT 使用视觉语言模型中的文本特征作为原型,构建最优运输矩阵。该矩阵有效地将数据集划分为干净和噪声子集,从而可以对干净子集应用交叉熵损失,对噪声子集应用MAE损失。我们的噪声标签提示学习方法NLPrompt提供了一种简单而有效的方法,利用视觉语言模型的表达性和精确对齐能力进行鲁棒提示学习。我们通过在各种噪声设置下进行广泛实验验证了NLPrompt,展示了显著的性能改进。
Summary / 总结
NLPrompt is a method that uses mean absolute error (MAE) loss in prompt learning to enhance the robustness of vision-language models against noisy labels while maintaining high accuracy. It introduces PromptOT, a prompt-based optimal transport data purification method, to further improve robustness. Experiments show that NLPrompt significantly improves performance in various noise settings.
NLPrompt 是一种通过提示学习增强视觉-语言模型对噪声标签鲁棒性的方法。它使用了名为 PromptMAE 的均绝对误差 (MAE) 损失,提高了鲁棒性同时保持了准确性。此外,PromptOT 是一种基于提示的最优运输数据净化方法,通过将数据集划分为干净和噪声子集来进一步增强鲁棒性。实验结果表明,在各种噪声设置下表现出显著的性能提升。
TABED: Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs
Authors: Minjae Lee, Wonjun Kang, Byeongkeun Ahn, Christian Classen, Kevin Galim, Seunghyuk Oh, Minghao Yan, Hyung Il Koo, Kangwook Lee
First: 2026-01-28T08:16:57+00:00 · Latest: 2026-01-28T08:16:57+00:00
Comments: Accepted to Findings of EACL 2026
Abstract
Speculative decoding (SD) has proven effective for accelerating LLM inference by quickly generating draft tokens and verifying them in parallel. However, SD remains largely unexplored for Large Vision-Language Models (LVLMs), which extend LLMs to process both image and text prompts. To address this gap, we benchmark existing inference methods with small draft models on 11 datasets across diverse input scenarios and observe scenario-specific performance fluctuations. Motivated by these findings, we propose Test-time Adaptive Batched Ensemble Drafting (TABED), which dynamically ensembles multiple drafts obtained via batch inference by leveraging deviations from past ground truths available in the SD setting. The dynamic ensemble method achieves an average robust walltime speedup of 1.74x over autoregressive decoding and a 5% improvement over single drafting methods, while remaining training-free and keeping ensembling costs negligible through parameter sharing. With its plug-and-play compatibility, we further enhance TABED by integrating advanced verification and alternative drafting methods. Code and custom-trained models are available at https://github.com/furiosa-ai/TABED.
中文标题/摘要
标题:TABED:LVLMs中的测试时自适应批量化集成草稿生成以实现鲁棒推测解码
推测解码(SD)已被证明能通过快速生成草稿令牌并在并行中验证它们来加速LLM推理。然而,SD在大型视觉-语言模型(LVLMs)中尚未得到充分探索,LVLMs将LLM扩展到处理图像和文本提示。为解决这一问题,我们在11个具有不同输入场景的基准数据集上对现有推理方法进行了评估,并观察到性能波动。受这些发现的启发,我们提出了测试时自适应批量化集成草稿生成(TABED),该方法通过利用SD设置中可用的过去地面真实值的偏差动态地将多个草稿进行集成。动态集成方法在自回归解码上实现了平均1.74倍的鲁棒墙时间加速,并且比单一草稿方法提高了5%,同时保持无训练成本,并通过参数共享使集成成本微乎其微。凭借其即插即用兼容性,我们进一步通过集成高级验证和替代草稿方法增强了TABED。代码和自训练模型可在https://github.com/furiosa-ai/TABED上获得。
Summary / 总结
The research aims to improve speculative decoding for Large Vision-Language Models (LVLMs) by addressing scenario-specific performance fluctuations. The proposed Test-time Adaptive Batched Ensemble Drafting (TABED) dynamically ensembles multiple drafts using deviations from past ground truths, achieving an average robust walltime speedup of 1.74x over autoregressive decoding and a 5% improvement over single drafting methods. TABED remains training-free and cost-effective through parameter sharing.
研究旨在通过解决场景特定的性能波动问题,改进大型视觉-语言模型(LVLM)的推测解码。研究引入了Test-time Adaptive Batched Ensemble Drafting(TABED),该方法通过利用过去地面真实值的偏差动态组合多个草稿。TABED 实现了平均1.74倍的鲁棒墙时间加速,相比自回归解码,并且通过参数共享保持了可忽略的组合成本。该方法具有即插即用的兼容性,并可以与高级验证和替代草稿方法结合使用。
Improving Fine-Grained Control via Aggregation of Multiple Diffusion Models
Authors: Conghan Yue, Zhengwei Peng, Shiyan Du, Zhi Ji, Chuangjian Cai, Le Wan, Dongyu Zhang
First: 2024-10-02T06:16:06+00:00 · Latest: 2026-01-28T08:13:16+00:00
Abstract
While many diffusion models perform well when controlling particular aspects such as style, character, and interaction, they struggle with fine-grained control due to dataset limitations and intricate model architecture design. This paper introduces a novel training-free algorithm for fine-grained generation, called Aggregation of Multiple Diffusion Models (AMDM). The algorithm integrates features in the latent data space from multiple diffusion models within the same ecosystem into a specified model, thereby activating particular features and enabling fine-grained control. Experimental results demonstrate that AMDM significantly improves fine-grained control without training, validating its effectiveness. Additionally, it reveals that diffusion models initially focus on features such as position, attributes, and style, with later stages improving generation quality and consistency. AMDM offers a new perspective for tackling the challenges of fine-grained conditional generation in diffusion models. Specifically, it allows us to fully utilize existing or develop new conditional diffusion models that control specific aspects, and then aggregate them using the AMDM algorithm. This eliminates the need for constructing complex datasets, designing intricate model architectures, and incurring high training costs. Code is available at: https://github.com/Hammour-steak/AMDM.
中文标题/摘要
标题:通过多个扩散模型聚合提高细粒度控制
尽管许多扩散模型在控制特定方面如风格、角色和互动时表现良好,但在细粒度控制方面由于数据集限制和复杂的模型架构设计而遇到困难。本文提出了一种无需训练的细粒度生成新算法,称为多个扩散模型聚合(AMDM)。该算法将同一生态系统内多个扩散模型在潜在数据空间中的特征整合到指定模型中,从而激活特定特征并实现细粒度控制。实验结果表明,AMDM在无需训练的情况下显著提高了细粒度控制能力,验证了其有效性。此外,它还揭示了扩散模型最初关注位置、属性和风格等特征,后期阶段则提高生成质量和一致性。AMDM为解决扩散模型中的细粒度条件生成挑战提供了新视角。具体而言,它允许我们充分利用现有或开发新的控制特定方面的条件扩散模型,并使用AMDM算法进行聚合。这消除了构建复杂数据集、设计复杂模型架构和高训练成本的需要。代码可在:https://github.com/Hammour-steak/AMDM 获取。
Visual Instruction Pretraining for Domain-Specific Foundation Models
Authors: Yuxuan Li, Yicheng Zhang, Wenhao Tang, Yimian Dai, Ming-Ming Cheng, Xiang Li, Jian Yang
First: 2025-09-22T10:57:42+00:00 · Latest: 2026-01-28T07:15:09+00:00
Abstract
Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, this loop remains incomplete: the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is not yet underexplored. This paper addresses this gap by proposing a new paradigm for pretraining foundation models in downstream domains. We introduce Visual insTruction Pretraining (ViTP), a novel approach that directly leverages reasoning to enhance perception. ViTP embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by our proposed Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens. Extensive experiments on 16 challenging remote sensing and medical imaging benchmarks demonstrate that ViTP establishes new state-of-the-art performance across a diverse range of downstream tasks. The code is available at https://github.com/zcablii/ViTP.
中文标题/摘要
标题:领域特定基础模型的视觉指令预训练
现代计算机视觉正在形成一个闭环,在这个闭环中,感知、推理和生成相互增强。然而,这个闭环仍然不完整:高层推理对低层感知特征基础学习的自上而下的影响尚未得到充分探索。本文通过提出一种新的预训练范式来解决这一差距,以在下游领域预训练基础模型。我们引入了视觉指令预训练(ViTP),这是一种新颖的方法,可以直接利用推理来增强感知。ViTP 将视觉变换器(ViT)主干嵌入到视觉语言模型中,并使用从目标下游领域收集的丰富视觉指令数据集对其进行端到端预训练。ViTP 由我们提出的视觉鲁棒性学习(VRL)驱动,促使 ViT 从稀疏的视觉标记集中学习稳健且领域相关的特征。在 16 个具有挑战性的遥感和医学成像基准测试上的广泛实验表明,ViTP 在多种下游任务中建立了新的最佳性能。代码可在 https://github.com/zcablii/ViTP 获取。
Summary / 总结
This paper aims to enhance the foundational learning of low-level perceptual features by incorporating high-level reasoning through a new pretraining paradigm called Visual insTruction Pretraining (ViTP). ViTP uses a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it using visual instruction data from target domains. The results show that ViTP outperforms existing methods across 16 diverse downstream tasks, including remote sensing and medical imaging benchmarks, establishing new state-of-the-art performance. The code is available at https://github.com/zcablii/ViTP.
本文旨在通过整合高层次推理来增强计算机视觉中低级感知特征的基础学习。它引入了Visual insTruction Pretraining (ViTP),该方法使用Vision-Language模型和Vision Transformer骨干,在特定领域的视觉指令数据上进行端到端预训练。在16个基准测试上的实验表明,ViTP在遥感和医学成像任务中优于现有方法,建立了新的最佳性能。代码可在https://github.com/zcablii/ViTP获得。
Physically Guided Visual Mass Estimation from a Single RGB Image
Authors: Sungjae Lee, Junhan Jeong, Yeonjoo Hong, Kwang In Kim
First: 2026-01-28T06:53:36+00:00 · Latest: 2026-01-28T06:53:36+00:00
Abstract
Estimating object mass from visual input is challenging because mass depends jointly on geometric volume and material-dependent density, neither of which is directly observable from RGB appearance. Consequently, mass prediction from pixels is ill-posed and therefore benefits from physically meaningful representations to constrain the space of plausible solutions. We propose a physically structured framework for single-image mass estimation that addresses this ambiguity by aligning visual cues with the physical factors governing mass. From a single RGB image, we recover object-centric three-dimensional geometry via monocular depth estimation to inform volume and extract coarse material semantics using a vision-language model to guide density-related reasoning. These geometry, semantic, and appearance representations are fused through an instance-adaptive gating mechanism, and two physically guided latent factors (volume- and density-related) are predicted through separate regression heads under mass-only supervision. Experiments on image2mass and ABO-500 show that the proposed method consistently outperforms state-of-the-art methods.
中文标题/摘要
标题:基于物理引导的单张RGB图像物体质量估计
从视觉输入估计物体质量具有挑战性,因为质量同时依赖于几何体积和材料相关的密度,而这两者都不是从RGB外观直接可观测到的。因此,从像素预测质量是病态的,因此需要物理意义的表示来约束可能解的空间。我们提出了一种基于物理结构的单图像质量估计框架,通过将视觉线索与决定质量的物理因素对齐来解决这种歧义。从单张RGB图像中,我们通过单目深度估计恢复以物体为中心的三维几何形状,以提供体积信息,并使用视觉-语言模型提取粗略的材料语义,以引导密度相关的推理。这些几何、语义和外观表示通过实例自适应门控机制融合,通过质量唯一的监督,分别通过两个物理引导的潜在因素(体积和密度相关)预测。在image2mass和ABO-500上的实验表明,所提出的方法在所有方面都优于最先进的方法。
Summary / 总结
The research aims to estimate object mass from a single RGB image by addressing the challenge of jointly estimating geometric volume and material density, which are not directly observable. The method uses a physically structured framework to align visual cues with physical factors governing mass. It recovers object-centric 3D geometry and extracts coarse material semantics, which are then fused through an instance-adaptive gating mechanism. The method predicts two physically guided latent factors (volume- and density-related) under mass-only supervision. Experiments show that the proposed method outperforms existing state-of-the-art methods on image2mass and ABO-500 datasets.
研究旨在通过单张RGB图像估计物体质量,解决几何体积和材料密度联合估计的挑战。方法使用单目深度估计恢复物体几何,并用视觉语言模型提取材料语义,然后通过实例自适应门控机制融合这些表示。体积和密度相关的两个物理引导的潜在因素在质量监督下分别被预测。实验表明,该方法在image2mass和ABO-500数据集上优于现有方法。
Hallucination Begins Where Saliency Drops
Authors: Xiaofeng Zhang, Yuanchao Zhu, Chaochen Gu, Xiaosong Yuan, Qiyan Zhao, Jiawei Cao, Feilong Tang, Sinan Fan, Yaomin Shen, Chen Shen, Hao Tang
Venue: ICLR 2026
First: 2026-01-28T05:50:52+00:00 · Latest: 2026-01-28T05:50:52+00:00
Comments: Accepted in ICLR 2026
Abstract
Recent studies have examined attention dynamics in large vision-language models (LVLMs) to detect hallucinations. However, existing approaches remain limited in reliably distinguishing hallucinated from factually grounded outputs, as they rely solely on forward-pass attention patterns and neglect gradient-based signals that reveal how token influence propagates through the network. To bridge this gap, we introduce LVLMs-Saliency, a gradient-aware diagnostic framework that quantifies the visual grounding strength of each output token by fusing attention weights with their input gradients. Our analysis uncovers a decisive pattern: hallucinations frequently arise when preceding output tokens exhibit low saliency toward the prediction of the next token, signaling a breakdown in contextual memory retention. Leveraging this insight, we propose a dual-mechanism inference-time framework to mitigate hallucinations: (1) Saliency-Guided Rejection Sampling (SGRS), which dynamically filters candidate tokens during autoregressive decoding by rejecting those whose saliency falls below a context-adaptive threshold, thereby preventing coherence-breaking tokens from entering the output sequence; and (2) Local Coherence Reinforcement (LocoRE), a lightweight, plug-and-play module that strengthens attention from the current token to its most recent predecessors, actively counteracting the contextual forgetting behavior identified by LVLMs-Saliency. Extensive experiments across multiple LVLMs demonstrate that our method significantly reduces hallucination rates while preserving fluency and task performance, offering a robust and interpretable solution for enhancing model reliability. Code is available at: https://github.com/zhangbaijin/LVLMs-Saliency
中文标题/摘要
标题:幻觉始于显著性下降
近期研究探讨了大型视觉-语言模型(LVLM)中的注意力动态,以检测幻觉。然而,现有方法在可靠地区分幻觉输出与事实依据的输出方面仍有限制,因为它们仅依赖于前向传递的注意力模式,而忽略了揭示标记影响如何在网络中传播的梯度信号。为弥合这一差距,我们引入了LVLMs-显著性,这是一种梯度感知的诊断框架,通过将注意力权重与输入梯度融合来量化每个输出标记的视觉接地强度。我们的分析揭示了一个决定性的模式:幻觉经常在前一个输出标记对下一个标记预测的显著性较低时出现,这表明上下文记忆保留出现了故障。基于这一洞察,我们提出了一种双重机制的推理时框架来减轻幻觉:(1)显著性引导的拒绝采样(SGRS),该机制在自回归解码过程中动态过滤候选标记,通过拒绝那些显著性低于上下文自适应阈值的标记,从而防止不连贯的标记进入输出序列;(2)局部连贯强化(LocoRE),这是一种轻量级、即插即用模块,加强当前标记与其最近前驱的注意力,积极对抗LVLMs-显著性识别的上下文遗忘行为。在多个LVLM上的广泛实验表明,我们的方法显著降低了幻觉率,同时保持了流畅性和任务性能,提供了一种增强模型可靠性的稳健且可解释的解决方案。代码可在:https://github.com/zhangbaijin/LVLMs-Saliency 获取。
Summary / 总结
This study addresses the challenge of detecting hallucinations in large vision-language models (LVLMs) by introducing LVLMs-Saliency, a gradient-aware diagnostic framework. It quantifies the visual grounding strength of each output token by fusing attention weights with input gradients. The research reveals that hallucinations often occur when preceding tokens have low saliency towards the prediction of the next token. To mitigate hallucinations, the study proposes Saliency-Guided Rejection Sampling (SGRS) and Local Coherence Reinforcement (LocoRE), which dynamically filter and reinforce attention, respectively. Experiments show that the method effectively reduces hallucination rates while maintaining fluency and task performance.
该研究通过引入LVLMs-Saliency,一种基于梯度的诊断框架,解决了大型视觉-语言模型(LVLMs)中的幻觉检测问题。该框架量化每个输出词的视觉接地强度,并发现幻觉通常发生在先前词具有低显著性时。提出的双机制推理时框架,Saliency-Guided Rejection Sampling (SGRS) 和 Local Coherence Reinforcement (LocoRE),有效减少了幻觉现象,同时保持流畅性和任务性能。广泛的实验表明,在多个LVLMs中显著降低了幻觉率,提供了一种增强模型可靠性和可解释性的稳健解决方案。
FusionLog: Cross-System Log-based Anomaly Detection via Fusion of General and Proprietary Knowledge
Authors: Xinlong Zhao, Tong Jia, Minghua He, Xixuan Yang, Ying Li
First: 2025-11-08T06:30:50+00:00 · Latest: 2026-01-28T05:23:53+00:00
Comments: 11 pages, 4 figures, and 2 tables
Abstract
Log-based anomaly detection is critical for ensuring the stability and reliability of web systems. One of the key problems in this task is the lack of sufficient labeled logs, which limits the rapid deployment in new systems. Existing works usually leverage large-scale labeled logs from a mature web system and a small amount of labeled logs from a new system, using transfer learning to extract and generalize general knowledge across both domains. However, these methods focus solely on the transfer of general knowledge and neglect the disparity and potential mismatch between such knowledge and the proprietary knowledge of target system, thus constraining performance. To address this limitation, we propose FusionLog, a novel zero-label cross-system log-based anomaly detection method that effectively achieves the fusion of general and proprietary knowledge, enabling cross-system generalization without any labeled target logs. Specifically, we first design a training-free router based on semantic similarity that dynamically partitions unlabeled target logs into 'general logs' and 'proprietary logs.' For general logs, FusionLog employs a small model based on system-agnostic representation meta-learning for direct training and inference, inheriting the general anomaly patterns shared between the source and target systems. For proprietary logs, we iteratively generate pseudo-labels and fine-tune the small model using multi-round collaborative knowledge distillation and fusion based on large language model (LLM) and small model (SM) to enhance its capability to recognize anomaly patterns specific to the target system. Experimental results on three public log datasets from different systems show that FusionLog achieves over 90% F1-score under a fully zero-label setting, significantly outperforming state-of-the-art cross-system log-based anomaly detection methods.
中文标题/摘要
标题:FusionLog:通过融合通用和专有知识的跨系统日志异常检测方法
基于日志的异常检测对于确保网络系统的稳定性和可靠性至关重要。这一任务中的一个关键问题是缺乏足够的标注日志,这限制了其在新系统中的快速部署。现有工作通常利用成熟网络系统的大规模标注日志和新系统的小规模标注日志,通过迁移学习提取和泛化跨两个领域的通用知识。然而,这些方法仅专注于通用知识的迁移,而忽视了此类知识与目标系统专有知识之间的差异和潜在不匹配,从而限制了性能。为解决这一局限,我们提出了一种名为FusionLog的新型零标注跨系统日志异常检测方法,该方法有效地实现了通用和专有知识的融合,能够在没有目标系统标注日志的情况下实现跨系统的泛化。具体而言,我们首先基于语义相似性设计了一种无需训练的路由器,动态地将未标注的目标日志划分为“通用日志”和“专有日志”。对于通用日志,FusionLog 使用基于系统无关表示元学习的小型模型进行直接训练和推理,继承了源系统和目标系统之间共享的通用异常模式。对于专有日志,我们通过多轮基于大型语言模型(LLM)和小型模型(SM)的协作知识蒸馏和融合迭代生成伪标签,并对小型模型进行微调,以增强其识别目标系统特定异常模式的能力。在三个不同系统的公开日志数据集上的实验结果显示,在完全零标注设置下,FusionLog 的 F1 分数超过 90%,显著优于最先进的跨系统日志异常检测方法。
Summary / 总结
FusionLog is a novel zero-label cross-system log-based anomaly detection method that integrates general and proprietary knowledge to improve performance without labeled target logs. It uses a training-free router based on semantic similarity to partition logs into general and proprietary categories. For general logs, it employs a small model for direct training and inference. For proprietary logs, it generates pseudo-labels and fine-tunes the model using collaborative knowledge distillation and fusion. Experiments on three public log datasets show FusionLog achieves over 90% F1-score, outperforming existing methods.
FusionLog 是一种新颖的零标签跨系统日志异常检测方法,它整合了一般和专有知识以提高性能。该方法使用一个基于语义相似性的无训练路由器将日志分为一般和专有两类,并使用小型模型直接训练一般日志,同时通过多轮协作知识蒸馏和融合为专有日志生成伪标签。实验结果表明,FusionLog 在三个不同系统的公共日志数据集上实现了超过 90% 的 F1 分数,显著优于现有方法的跨系统日志异常检测方法。
Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review
Authors: Matthew Lisondra, Beno Benhabib, Goldie Nejat
First: 2025-05-26T20:08:09+00:00 · Latest: 2026-01-28T05:01:06+00:00
Comments: v2: Expanded systematic review; resubmitted to Robotics
Abstract
Rapid advancements in foundation models, including Large Language Models, Vision-Language Models, Multimodal Large Language Models, and Vision-Language-Action Models, have opened new avenues for embodied AI in mobile service robotics. By combining foundation models with the principles of embodied AI, where intelligent systems perceive, reason, and act through physical interaction, mobile service robots can achieve more flexible understanding, adaptive behavior, and robust task execution in dynamic real-world environments. Despite this progress, embodied AI for mobile service robots continues to face fundamental challenges related to the translation of natural language instructions into executable robot actions, multimodal perception in human-centered environments, uncertainty estimation for safe decision-making, and computational constraints for real-time onboard deployment. In this paper, we present the first systematic review focused specifically on the integration of foundation models in mobile service robotics. We analyze how recent advances in foundation models address these core challenges through language-conditioned control, multimodal sensor fusion, uncertainty-aware reasoning, and efficient model scaling. We further examine real-world applications in domestic assistance, healthcare, and service automation, highlighting how foundation models enable context-aware, socially responsive, and generalizable robot behaviors. Beyond technical considerations, we discuss ethical, societal, and human-interaction implications associated with deploying foundation model-enabled service robots in human environments. Finally, we outline future research directions emphasizing reliability and lifelong adaptation, privacy-aware and resource-constrained deployment, and governance and human-in-the-loop frameworks required for safe, scalable, and trustworthy mobile service robotics.
中文标题/摘要
标题:基于基础模型的移动服务机器人本体AI:系统综述
基础模型的迅速发展,包括大型语言模型、视觉-语言模型、多模态大型语言模型和视觉-语言-行动模型,为移动服务机器人的本体AI开辟了新的途径。通过将基础模型与本体AI的原则相结合,即智能系统通过物理交互感知、推理和行动,移动服务机器人可以在动态的现实环境中实现更灵活的理解、适应性行为和稳健的任务执行。尽管取得了这些进展,移动服务机器人的本体AI仍然面临着将自然语言指令转化为可执行的机器人动作、人类中心环境中的多模态感知、安全决策中的不确定性估计以及实时机载部署的计算约束等根本挑战。在本文中,我们首次针对基础模型在移动服务机器人中的集成进行了系统综述。我们分析了最近基础模型的进展如何通过语言条件控制、多模态传感器融合、不确定性感知推理和高效模型缩放来解决这些核心挑战。我们进一步探讨了在家庭辅助、医疗保健和服务业自动化中的实际应用,突出了基础模型如何使机器人行为具有情境感知、社会响应性和泛化能力。除了技术考虑,我们还讨论了部署基础模型驱动的服务机器人在人类环境中的伦理、社会和人机交互影响。最后,我们概述了未来研究方向,强调可靠性与终身适应、隐私保护和资源受限部署、以及确保移动服务机器人安全、可扩展和值得信赖的治理和人机在环框架。
Summary / 总结
This paper reviews the integration of foundation models, such as large language models and vision-language-action models, into mobile service robotics to enhance embodied AI. The study addresses challenges like translating natural language instructions into robot actions, multimodal perception, and computational constraints. Key findings include the use of language-conditioned control, multimodal sensor fusion, uncertainty-aware reasoning, and efficient model scaling to improve robot behaviors in domestic assistance, healthcare, and service automation. The review also discusses ethical and societal implications of deploying these robots in human environments.
本文回顾了基础模型在移动服务机器人中的集成,解决了自然语言指令转换为机器人动作和多模态感知等挑战。研究分析了语言条件控制和多模态传感器融合等近期进展如何克服这些挑战。主要发现包括使机器人在家庭辅助、医疗保健等应用中实现情境感知、社会响应和通用行为。文章还讨论了伦理和社会影响,并提出了可靠性、隐私和人类在环框架等未来研究方向。
The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning
Authors: Ren Zhuang, Ben Wang, Shuifa Sun
First: 2026-01-25T18:16:17+00:00 · Latest: 2026-01-28T03:14:37+00:00
Comments: 11 pages, 5 figures
Abstract
Scaling test-time compute enhances long chain-of-thought (CoT) reasoning, yet existing approaches face a fundamental trade-off between computational cost and coverage quality: either incurring high training expense or yielding redundant trajectories. We introduce The Geometric Reasoner (TGR), a training-free framework that performs manifold-informed latent foresight search under strict memory bounds. At each chunk boundary, TGR scores candidate latent anchors via a lightweight look-ahead estimate combined with soft geometric regularizers that encourage smooth trajectories and diverse exploration. Chunk-wise KV cache resets keep memory linear in chunk length. On challenging math and code benchmarks, TGR improves robust trajectory coverage, measured by the area under the Pass@$k$ curve (AUC), by up to 13 points on Qwen3-8B, with negligible overhead of about 1.1--1.3 times.
中文标题/摘要
标题:几何推理器:流形导向的潜在先见搜索以实现长上下文推理
扩展测试时计算可以增强长链推理,但现有方法在计算成本和覆盖质量之间存在根本性权衡:要么导致高昂的训练成本,要么产生冗余轨迹。我们引入了几何推理器(TGR),这是一种无需训练的框架,能够在严格内存限制下进行流形导向的潜在先见搜索。在每个块边界,TGR 通过轻量级的前瞻估计与软几何正则化相结合来评分潜在锚点,这些正则化项鼓励平滑轨迹和多样探索。块内KV缓存重置使内存与块长度成线性关系。在具有挑战性的数学和代码基准测试中,TGR 在 Qwen3-8B 上通过提升通过率@$k$曲线下的面积(AUC)最多 13 个点,同时开销仅为约 1.1-1.3 倍。
Summary / 总结
The Geometric Reasoner (TGR) addresses the trade-off between computational cost and coverage quality in long-chain-of-thought reasoning by performing manifold-informed latent foresight search. TGR scores candidate latent anchors using a lightweight look-ahead estimate and soft geometric regularizers, and resets chunk-wise KV cache to maintain linear memory usage. On math and code benchmarks, TGR improves robust trajectory coverage by up to 13 points on Qwen3-8B with minimal overhead of about 1.1–1.3 times the original model size.
The Geometric Reasoner (TGR)通过执行流形导向的潜在前瞻搜索来解决长链推理中的计算成本与覆盖质量之间的权衡问题。它使用轻量级的前瞻估计和软几何正则化来评分候选的潜在锚点,并通过分块的KV缓存重置来保持内存线性增长。TGR在Qwen3-8B基准测试中将稳健轨迹覆盖度提高了最多13个点,且额外开销仅为1.1到1.3倍。
VLM-CAD: VLM-Optimized Collaborative Agent Design Workflow for Analog Circuit Sizing
Authors: Guanyuan Pan, Shuai Wang, Yugui Lin, Tiansheng Zhou, Pietro Liò, Zhenxin Zhao, Yaqi Wang
First: 2026-01-12T08:37:32+00:00 · Latest: 2026-01-28T02:59:53+00:00
Comments: 9 pages, 4 figures
Abstract
Analog mixed-signal circuit sizing involves complex trade-offs within high-dimensional design spaces. Existing automatic analog circuit sizing approaches rely solely on netlists, ignoring the circuit schematic, which hinders the cognitive link between the schematic and its performance. Furthermore, the black-box nature of machine learning methods and hallucination risks in large language models fail to provide the necessary ground-truth explainability required for industrial sign-off. To address these challenges, we propose a Vision Language Model-optimized collaborative agent design workflow (VLM-CAD), which analyzes circuits, optimizes DC operating points, performs inference-based sizing, and executes external sizing optimization. We integrate Image2Net to annotate circuit schematics and generate a structured JSON description for precise interpretation by Vision Language Models. Furthermore, we propose an Explainable Trust Region Bayesian Optimization method (ExTuRBO) that employs collaborative warm-start from agent-generated seeds and offers dual-granularity sensitivity analysis for external sizing optimization, supporting a comprehensive final design report. Experiment results on amplifier sizing tasks using 180nm, 90nm, and 45nm Predictive Technology Models demonstrate that VLM-CAD effectively balances power and performance while maintaining physics-based explainability. VLM-CAD meets all specification requirements while maintaining low power consumption in optimizing an amplifier with a complementary input and a class-AB output stage, with a total runtime under 66 minutes across all experiments on two amplifiers.
中文标题/摘要
标题:VLM-CAD:优化视觉语言模型协作代理设计工作流以进行模拟电路尺寸优化
模拟混合信号电路尺寸优化涉及高维设计空间中的复杂权衡。现有的自动模拟电路尺寸优化方法仅依赖于网表,忽略了电路原理图,阻碍了原理图与其性能之间的认知联系。此外,机器学习方法的黑箱性质和大型语言模型中的幻觉风险无法提供工业签收所需的必要的地面真相可解释性。为了解决这些挑战,我们提出了一种视觉语言模型优化的协作代理设计工作流(VLM-CAD),该工作流分析电路、优化直流工作点、进行推理尺寸优化并执行外部尺寸优化。我们整合了Image2Net来标注电路原理图并生成结构化的JSON描述,以便视觉语言模型精确解释。此外,我们提出了一种可解释的信任区域贝叶斯优化方法(ExTuRBO),该方法采用代理生成的种子进行协作预热,并提供外部尺寸优化的双重粒度灵敏度分析,支持全面的最终设计报告。使用180nm、90nm和45nm预测技术模型进行放大器尺寸优化任务的实验结果表明,VLM-CAD在保持物理基础可解释性的同时有效平衡了功率和性能。VLM-CAD在优化具有互补输入和类AB输出阶段的放大器时满足所有规范要求,同时保持低功耗,在所有实验中总运行时间低于66分钟。
Summary / 总结
VLM-CAD is a workflow that optimizes analog circuit sizing by integrating Vision Language Models and collaborative agents. It uses Image2Net for schematic annotation and ExTuRBO for optimization, providing detailed sensitivity analysis and a comprehensive design report. Experiments on amplifier sizing tasks show VLM-CAD balances power and performance effectively while maintaining physics-based explainability.
VLM-CAD 是一种结合了视觉语言模型和协作代理的工作流,用于优化模拟电路的尺寸。该方法分析电路、优化直流工作点,并进行基于推理的尺寸优化。使用 Image2Net 对电路图进行注释,并采用可解释的信任区域贝叶斯优化(ExTuRBO)进行外部尺寸优化,提供详细的灵敏度分析。实验结果表明,VLM-CAD 在不同技术模型的放大器尺寸任务中有效地平衡了功率和性能,同时保持了基于物理的可解释性和低功耗。
Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks
Authors: Miao Jing, Mengting Jia, Junling Lin, Zhongxia Shen, Huan Gao, Mingkun Xu, Shangyang Li
Venue: ICLR
First: 2025-09-26T12:20:01+00:00 · Latest: 2026-01-28T02:56:44+00:00
Comments: 23 pages, 12 figures
Abstract
Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. To ensure reliable evaluation, we develop a hybrid scoring pipeline that combines LLM-based graders, clinician validation, and semantic similarity metrics. Through systematic evaluation of state-of-the-art VLMs, including GPT-4o, Claude-4, and MedGemma, we observe a sharp performance drop compared to conventional datasets. Error analysis shows that reasoning failures, rather than perceptual errors, dominate model shortcomings. Our findings highlight the necessity of a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented, compact benchmarks such as Neural-MedBench for reasoning fidelity. We release Neural-MedBench at https://neuromedbench.github.io/ as an open and extensible diagnostic testbed, which guides the expansion of future benchmarks and enables rigorous yet cost-effective assessment of clinically trustworthy AI.
中文标题/摘要
标题:超越分类准确率:Neural-MedBench和深入推理基准的需求
近期视觉-语言模型(VLMs)在标准医学基准测试中取得了显著的性能,但其真正的临床推理能力仍然不清楚。现有数据集主要强调分类准确率,导致一种评估错觉,即模型看似熟练但实际上在高风险诊断推理方面仍然失败。我们引入了Neural-MedBench,这是一个紧凑但推理密集的基准,专门设计用于探索神经学多模态临床推理的极限。Neural-MedBench 结合了多序列 MRI 扫描、结构化的电子健康记录和临床笔记,并涵盖了三个核心任务家族:鉴别诊断、病灶识别和理由生成。为了确保可靠的评估,我们开发了一种结合了基于LLM的评分员、临床验证和语义相似度度量的混合评分管道。通过对包括GPT-4o、Claude-4和MedGemma在内的最新VLMs进行系统评估,我们观察到与传统数据集相比,性能出现了显著下降。错误分析表明,推理失败而非感知错误是模型的主要缺陷。我们的研究结果强调了双重评估框架的必要性:广度导向的大数据集用于统计泛化,以及深度导向、紧凑的基准如Neural-MedBench用于推理准确性。我们通过https://neuromedbench.github.io/发布了Neural-MedBench,作为开放和可扩展的诊断测试平台,指导未来基准的扩展,并实现严格的成本效益评估。
Summary / 总结
The research aims to evaluate the true clinical reasoning ability of vision-language models (VLMs) beyond classification accuracy. Neural-MedBench, a new benchmark, integrates MRI scans, electronic health records, and clinical notes to test differential diagnosis, lesion recognition, and rationale generation. The study finds that state-of-the-art VLMs perform poorly on this reasoning-intensive benchmark, with errors primarily due to reasoning failures rather than perceptual issues. The authors propose a Two-Axis Evaluation Framework to guide future benchmark development for clinical reasoning fidelity.
论文提出了Neural-MedBench,这是一个旨在评估神经学领域中视觉-语言模型推理能力的基准,重点关注鉴别诊断、病灶识别和推理生成。通过整合MRI扫描、电子健康记录和临床笔记,它旨在弥补当前评价方法主要关注分类准确性的不足。研究发现,最先进的模型在这一基准上表现不佳,表明它们在更深层次的推理任务上存在问题。作者提出了一个两轴评价框架,以指导未来基准的发展,确保AI系统的临床可信度。
The Ensemble Schr{ö}dinger Bridge filter for Nonlinear Data Assimilation
Authors: Hui Sun
First: 2025-12-22T00:06:49+00:00 · Latest: 2026-01-28T00:54:59+00:00
Abstract
This work puts forward a novel nonlinear optimal filter namely the Ensemble Schr{ö}dinger Bridge nonlinear filter. The proposed filter finds marriage of the standard prediction procedure and the diffusion generative modeling for the analysis procedure to realize one filtering step. The designed approach finds no structural model error, and it is derivative free, training free and highly parallizable. Experimental results show that the designed algorithm performs well given highly nonlinear dynamics in (mildly) high dimension up to 40 or above under a chaotic environment. It also shows better performance than classical methods such as the ensemble Kalman filter and the Particle filter in numerous tests given different level of nonlinearity. Future work will focus on extending the proposed approach to practical meteorological applications and establishing a rigorous convergence analysis.
中文标题/摘要
标题:Schr{ö}dinger 桥集成滤波器在非线性数据同化中的应用
本文提出了一种新的非线性最优滤波器,即集成 Schr{ö}dinger 桥非线性滤波器。所提出的滤波器将标准预测过程与扩散生成建模相结合,以实现分析过程中的一步滤波。所设计的方法没有结构模型误差,且无需求导、无需训练,具有高度并行性。实验结果表明,在混沌环境中,该设计算法在(轻微)高维(40或以上)且高度非线性动力学条件下表现良好。此外,在不同非线性程度下,与经典方法(如集成卡尔曼滤波器和粒子滤波器)相比,该算法在多次测试中表现出更好的性能。未来的工作将致力于将所提出的方法扩展到实际气象应用,并建立严格的收敛分析。
Summary / 总结
This work introduces the Ensemble Schrödinger Bridge nonlinear filter, which combines standard prediction with diffusion generative modeling for analysis. The filter is derivative-free, training-free, and highly parallelizable, showing superior performance in highly nonlinear dynamics up to 40 dimensions under chaotic conditions. It outperforms traditional methods like the ensemble Kalman filter and Particle filter in various tests with different levels of nonlinearity.
该工作提出了一种新的非线性滤波器——Ensemble Schrödinger Bridge非线性滤波器,它将标准预测与扩散生成建模相结合用于分析,消除了结构模型误差,并不需要训练或导数。实验结果显示,在高达40维度的强非线性动态条件下,该算法在混沌环境中表现出色,并在不同非线性程度的测试中优于传统方法如集合卡尔曼滤波器和粒子滤波器。
PromptVFX: Text-Driven Fields for Open-World 3D Gaussian Animation
Authors: Mert Kiray, Paul Uhlenbruck, Nassir Navab, Benjamin Busam
First: 2025-06-01T17:22:59+00:00 · Latest: 2026-01-28T00:03:21+00:00
Abstract
Visual effects (VFX) are key to immersion in modern films, games, and AR/VR. Creating 3D effects requires specialized expertise and training in 3D animation software and can be time consuming. Generative solutions typically rely on computationally intense methods such as diffusion models which can be slow at 4D inference. We reformulate 3D animation as a field prediction task and introduce a text-driven framework that infers a time-varying 4D flow field acting on 3D Gaussians. By leveraging large language models (LLMs) and vision-language models (VLMs) for function generation, our approach interprets arbitrary prompts (e.g., "make the vase glow orange, then explode") and instantly updates color, opacity, and positions of 3D Gaussians in real time. This design avoids overheads such as mesh extraction, manual or physics-based simulations and allows both novice and expert users to animate volumetric scenes with minimal effort on a consumer device even in a web browser. Experimental results show that simple textual instructions suffice to generate compelling time-varying VFX, reducing the manual effort typically required for rigging or advanced modeling. We thus present a fast and accessible pathway to language-driven 3D content creation that can pave the way to democratize VFX further. Code available at https://obsphera.github.io/promptvfx/.
中文标题/摘要
标题:PromptVFX:基于文本的领域用于开放世界3D高斯动画
视觉效果(VFX)是现代电影、游戏和AR/VR中沉浸感的关键。创建3D效果需要专门的3D动画软件专业知识和培训,并且耗时。生成解决方案通常依赖于计算密集型方法,如扩散模型,这些方法在4D推理时可能很慢。我们将3D动画重新定义为一个场预测任务,并引入了一种基于文本的框架,该框架推断作用于3D高斯体的时间变化4D流场。通过利用大型语言模型(LLMs)和视觉语言模型(VLMs)进行函数生成,我们的方法可以解释任意提示(例如,“让花瓶发出橙光,然后爆炸”),并实时更新3D高斯体的颜色、透明度和位置。此设计避免了诸如网格提取、手动或基于物理的模拟等开销,使初学者和专家用户都能在消费设备上,甚至在网页浏览器中,以最小的努力动画化体积场景。实验结果表明,简单的文本指令足以生成引人入胜的时间变化VFX,减少了通常用于绑定或高级建模所需的手动努力。因此,我们提出了一种快速且易于访问的基于语言的3D内容创作途径,这可以为VFX的普及铺平道路。代码可在https://obsphera.github.io/promptvfx/获取。
Summary / 总结
The paper presents PromptVFX, a text-driven framework for generating 3D Gaussian animations. It reformulates 3D animation as a field prediction task and uses large language models and vision-language models to interpret textual prompts and update 3D Gaussians in real time. The approach avoids the need for mesh extraction and manual simulations, allowing users to create compelling VFX with minimal effort on a consumer device. Experiments show that simple textual instructions can generate time-varying VFX, reducing the manual effort required for advanced modeling and rigging tasks.
研究旨在通过文本驱动的方法简化3D视觉效果(VFX)的创建,解决传统3D动画复杂和耗时的问题。方法将3D动画重新表述为场预测任务,利用大型语言模型和视觉语言模型来推断作用于3D高斯体的时间变化4D流场。实验结果表明,简单的文本指令可以生成引人注目的时间变化VFX,大大减少了高级建模和绑定所需的手动努力。这种方法使初学者和专家用户都能在消费级设备上实时动画体积场景,包括网页浏览器,从而进一步普及VFX创作。代码可在https://obsphera.github.io/promptvfx/ 获取。
Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration
Authors: Jucheng Shen, Gaurav Sarkar, Yeonju Ro, Sharath Nittur Sridhar, Zhangyang Wang, Aditya Akella, Souvik Kundu
First: 2025-12-08T05:15:41+00:00 · Latest: 2026-01-27T22:50:35+00:00
Comments: 9 pages, 3 figures. Preprint under review
Abstract
We present CadLLM, a training-free method to accelerate the inference throughput of diffusion-based LLMs (dLLMs). We first investigate the dynamic nature of token unmasking confidence across blocks and steps. Based on this observation, we present a lightweight adaptive approach that controls the generation block size, step size, and threshold based on the average confidence of unmasked tokens. We further reduce softmax overhead by dynamically leveraging a subset of the vocabulary to regulate sampling breadth. CadLLM is a plug-and-play, model-agnostic method compatible with KV-cache-based dLLMs. Extensive experiments on four popular tasks demonstrate that CadLLM yields up to 2.28x throughput improvement over the state-of-the-art baseline with competitive accuracy.
中文标题/摘要
标题:通过训练免费的置信意识校准提高基于扩散的大语言模型的吞吐量
我们提出了CadLLM,这是一种训练免费的方法,用于加速基于扩散的大语言模型(dLLMs)的推理吞吐量。我们首先研究了令牌去遮蔽置信度在块和步骤中的动态性质。基于这一观察,我们提出了一种轻量级自适应方法,根据未遮蔽令牌的平均置信度控制生成块大小、步长和阈值。我们进一步通过动态利用词汇表的子集来调节采样范围,从而减少softmax开销。CadLLM 是一种即插即用、模型无关的方法,适用于基于KV缓存的大语言模型。在四个流行任务上的广泛实验表明,与最先进的基线相比,CadLLM 可以获得高达2.28倍的吞吐量提升,同时保持竞争力的准确性。
Summary / 总结
CadLLM is a training-free method to enhance the inference throughput of diffusion-based large language models (dLLMs) by dynamically adjusting the generation block size, step size, and threshold based on the average confidence of unmasked tokens. It also reduces softmax overhead by sampling from a subset of the vocabulary. Experiments show that CadLLM can achieve up to 2.28x throughput improvement compared to the state-of-the-art baseline while maintaining competitive accuracy.
CadLLM 是一种无需训练的方法,通过动态调整生成块大小、步骤大小和阈值,基于 token 解码置信度来提升扩散型大语言模型(dLLMs)的推理吞吐量。它还通过选择性地采样词汇表来减少 softmax 过程中的开销。实验表明,CadLLM 可以在四个常见任务上实现最高 2.28 倍的吞吐量提升,并保持相近的准确性。
Look in the Middle: Structural Anchor Pruning for Scalable Visual RAG Indexing
Authors: Zhuchenyang Liu, Ziyu Hu, Yao Zhang, Yu Xiao
First: 2026-01-27T22:50:11+00:00 · Latest: 2026-01-27T22:50:11+00:00
Comments: 18 pages, 6 figures, 11 tables
Abstract
Recent Vision-Language Models (e.g., ColPali) enable fine-grained Visual Document Retrieval (VDR) but incur prohibitive index vector size overheads. Training-free pruning solutions (e.g., EOS-attention based methods) can reduce index vector size by approximately 60% without model adaptation, but often underperform random selection in high-compression scenarios (> 80%). Prior research (e.g., Light-ColPali) attributes this to the conclusion that visual token importance is inherently query-dependent, thereby questioning the feasibility of training-free pruning. In this work, we propose Structural Anchor Pruning (SAP), a training-free pruning method that identifies key visual patches from middle layers to achieve high performance compression. We also introduce Oracle Score Retention (OSR) protocol to evaluate how layer-wise information affects compression efficiency. Evaluations on the ViDoRe benchmark demonstrate that SAP reduces index vectors by over 90% while maintaining robust retrieval fidelity, providing a highly scalable solution for Visual RAG. Furthermore, our OSR-based analysis reveals that semantic structural anchor patches persist in the middle layers, unlike traditional pruning solutions that focus on the final layer where structural signals dissipate.
中文标题/摘要
标题:中间查看:结构锚点剪枝以实现可扩展的视觉RAG索引
近期的视觉语言模型(例如ColPali)能够实现精细的视觉文档检索(VDR),但会带来巨大的索引向量大小开销。无训练剪枝解决方案(例如基于EOS-attention的方法)可以在不进行模型适应的情况下将索引向量大小减少约60%,但在高压缩场景(>80%)中通常不如随机选择表现好。先前的研究(例如Light-ColPali)认为这是由于视觉标记的重要性本质上是查询依赖的,从而质疑无训练剪枝的可行性。在本文中,我们提出了一种无训练剪枝方法——结构锚点剪枝(SAP),该方法从中间层识别关键视觉片段以实现高效压缩。我们还引入了Oracle评分保留(OSR)协议来评估层间信息如何影响压缩效率。ViDoRe基准测试上的评估表明,SAP可以将索引向量减少超过90%的同时保持稳健的检索保真度,提供了一种高度可扩展的视觉RAG解决方案。此外,基于OSR的分析表明,语义结构锚点片段在中间层中持久存在,而传统的剪枝解决方案则集中在结构信号消散的最终层。
Summary / 总结
This paper addresses the challenge of reducing index vector size in Visual Document Retrieval (VDR) while maintaining retrieval performance. It introduces Structural Anchor Pruning (SAP), a training-free method that selects key visual patches from middle layers to achieve high compression. Experiments on the ViDoRe benchmark show that SAP reduces index vectors by over 90% and maintains robust retrieval fidelity, providing a scalable solution for Visual RAG. Additionally, the Oracle Score Retention (OSR) protocol reveals that semantic structural anchor patches persist in middle layers, unlike traditional methods that focus on the final layer.
本文旨在解决视觉文档检索(VDR)中减少索引向量大小的同时保持检索性能的问题。它提出了一种名为结构锚点剪枝(SAP)的无训练方法,从中间层选择关键视觉片段以实现高效压缩。ViDoRe基准上的实验表明,SAP可以将索引向量减少超过90%,同时保持稳健的检索精度,提供了一种可扩展的视觉RAG解决方案。此外,Oracle分数保留(OSR)协议揭示了语义结构锚点片段在中间层中持久存在,不同于传统方法集中在信号消散的最终层。
Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery
Authors: Meng Xin, Sweta Priyadarshi, Jingyu Xin, Bilal Kartal, Aditya Vavre, Asma Kuriparambil Thekkumpate, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Ido Shahaf, Akhiad Bercovich, Kinjal Patel, Suguna Varshini Velury, Chenjie Luo, Zhiyu Cheng, Jenny Chen, Chen-Han Yu, Wei Ping, Oleg Rybakov, Nima Tajbakhsh, Oluwatobi Olabiyi, Dusan Stosic, Di Wu, Song Han, Eric Chung, Sharath Turuvekere Sreenivas, Bryan Catanzaro, Yoshi Suhara, Tijmen Blankevoort, Huizi Mao
First: 2026-01-27T22:14:47+00:00 · Latest: 2026-01-27T22:14:47+00:00
Abstract
This technical report presents quantization-aware distillation (QAD) and our best practices for recovering accuracy of NVFP4-quantized large language models (LLMs) and vision-language models (VLMs). QAD distills a full-precision teacher model into a quantized student model using a KL divergence loss. While applying distillation to quantized models is not a new idea, we observe key advantages of QAD for today's LLMs: 1. It shows remarkable effectiveness and stability for models trained through multi-stage post-training pipelines, including supervised fine-tuning (SFT), reinforcement learning (RL), and model merging, where traditional quantization-aware training (QAT) suffers from engineering complexity and training instability; 2. It is robust to data quality and coverage, enabling accuracy recovery without full training data. We evaluate QAD across multiple post-trained models including AceReason Nemotron, Nemotron 3 Nano, Nemotron Nano V2, Nemotron Nano V2 VL (VLM), and Llama Nemotron Super v1, showing consistent recovery to near-BF16 accuracy.
中文标题/摘要
标题:面向NVFP4推理准确性的量化感知蒸馏
本技术报告介绍了量化感知蒸馏(QAD)及其在恢复NVFP4量化大型语言模型(LLMs)和视觉-语言模型(VLMs)准确性的最佳实践。QAD使用KL散度损失将全精度教师模型蒸馏到量化学生模型中。虽然将蒸馏应用于量化模型不是新想法,但观察到QAD对当今的LLMs具有关键优势:1. 对于通过多阶段后训练管道训练的模型(包括监督微调(SFT)、强化学习(RL)和模型合并),它显示出显著的有效性和稳定性,而传统的量化感知训练(QAT)则因工程复杂性和训练不稳定性而受到影响;2. 它对数据质量和覆盖范围具有鲁棒性,能够在无需完整训练数据的情况下实现准确性的恢复。我们在包括AceReason Nemotron、Nemotron 3 Nano、Nemotron Nano V2、Nemotron Nano V2 VL(VLM)和Llama Nemotron Super v1在内的多个后训练模型上评估了QAD,展示了其恢复到接近BF16准确性的一致性。
Summary / 总结
This technical report introduces quantization-aware distillation (QAD) for recovering the inference accuracy of NVFP4-quantized large language models and vision-language models. QAD uses a KL divergence loss to distill a full-precision teacher model into a quantized student model. The method is particularly effective for models trained through multi-stage post-training pipelines, such as supervised fine-tuning, reinforcement learning, and model merging, where traditional quantization-aware training faces challenges. QAD also demonstrates robustness to data quality and coverage, enabling accuracy recovery without full training data. Experiments across various models show consistent recovery to near-BF16 accuracy.
该论文介绍了量化感知蒸馏(QAD)方法,用于恢复NVFP4量化的大语言模型(LLM)和视觉-语言模型(VLM)的推理准确性。QAD使用KL散度损失将知识从全精度教师模型转移到量化学生模型。研究显示,QAD在通过多阶段后训练管道训练的模型中表现出高度的有效性和稳定性,如监督微调、强化学习和模型合并,而传统的量化感知训练(QAT)则面临工程复杂性和训练不稳定性的问题。QAD还对数据质量和覆盖范围具有鲁棒性,能够在不使用完整训练数据的情况下实现准确性恢复。对包括AceReason Nemotron、Nemotron 3 Nano、Nemotron Nano V2、Nemotron Nano V2 VL和Llama Nemotron Super v1在内的多种模型的评估显示,其能够将准确性恢复到接近BF16的水平。
History
20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553