arXiv 论文速递

2026-03-03 03:48
Snapshot: 20260303_0348
CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
Authors: Weinan Dai, Hanlin Wu, Qiying Yu, Huan-ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yufan Song, Hongli Yu, Jiaze Chen, Wei-Ying Ma, Ya-Qin Zhang, Jingjing Liu, Mingxuan Wang, Xin Liu, Hao Zhou
First: 2026-02-27T18:58:05+00:00 · Latest: 2026-02-27T18:58:05+00:00
Abstract
GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as torch.compile for CUDA kernel generation. Existing CUDA code generation approaches either rely on training-free refinement or fine-tune models within fixed multi-turn execution-feedback loops, but both paradigms fail to fundamentally improve the model's intrinsic CUDA optimization ability, resulting in limited performance gains. We present CUDA Agent, a large-scale agentic reinforcement learning system that develops CUDA kernel expertise through three components: a scalable data synthesis pipeline, a skill-augmented CUDA development environment with automated verification and profiling to provide reliable reward signals, and reinforcement learning algorithmic techniques enabling stable training. CUDA Agent achieves state-of-the-art results on KernelBench, delivering 100\%, 100\%, and 92\% faster rate over torch.compile on KernelBench Level-1, Level-2, and Level-3 splits, outperforming the strongest proprietary models such as Claude Opus 4.5 and Gemini 3 Pro by about 40\% on the hardest Level-3 setting.
中文标题/摘要
标题:CUDA代理:大规模代理型强化学习在高性能CUDA内核生成中的应用
GPU内核优化是现代深度学习的基础,但仍然是一个高度专业化且需要深厚硬件知识的任务。尽管在通用编程方面表现出色,大型语言模型(LLMs)在CUDA内核生成方面仍无法与基于编译器的系统(如torch.compile)竞争。现有的CUDA代码生成方法要么依赖于无训练的细化,要么在固定多轮执行反馈循环中微调模型,但这两种范式都无法从根本上提高模型的CUDA优化能力,导致性能提升有限。我们提出了CUDA代理,这是一种通过三个组件开发CUDA内核专业知识的大规模代理型强化学习系统:可扩展的数据合成管道、具有自动化验证和分析的技能增强CUDA开发环境,以提供可靠的奖励信号,以及强化学习算法技术,以实现稳定的训练。CUDA代理在KernelBench上取得了最先进的成果,分别在KernelBench Level-1、Level-2和Level-3分割上比torch.compile快100%,100%和92%,在最难的Level-3设置上,比最强的专有模型Claude Opus 4.5和Gemini 3 Pro高出约40%。
Summary / 总结
The research aims to optimize GPU kernels for deep learning by leveraging agentic reinforcement learning. The method involves a scalable data synthesis pipeline, a skill-augmented development environment with automated verification and profiling, and reinforcement learning techniques for stable training. Key findings show that CUDA Agent outperforms torch.compile and proprietary models like Claude Opus 4.5 and Gemini 3 Pro, achieving up to 100% faster performance on KernelBench Level-1, Level-2, and Level-3 splits.
研究旨在通过使用代理强化学习来优化GPU内核以提升深度学习性能。方法包括一个可扩展的数据合成管道、一个增强技能的开发环境,该环境具有自动验证和分析以提供可靠的奖励信号,以及强化学习技术以实现稳定的训练。关键发现表明,CUDA Agent 在 KernelBench Level-1、Level-2 和 Level-3 分割上分别比 torch.compile 快 100%,并且比 Claude Opus 4.5 和 Gemini 3 Pro 等专有模型快约 40%。
LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans
Authors: Zhening Huang, Xiaoyang Wu, Fangcheng Zhong, Hengshuang Zhao, Matthias Nießner, Joan Lasenby
Venue: www
First: 2025-07-03T17:59:55+00:00 · Latest: 2026-02-27T18:27:47+00:00
Comments: Project Page: https://litereality.github.io; Video: https://www.youtube.com/watch?v=ecK9m3LXg2c&feature=youtu.be Camera-Ready Version
Abstract
We propose LiteReality, a novel pipeline that converts RGB-D scans of indoor environments into compact, realistic, and interactive 3D virtual replicas. LiteReality not only reconstructs scenes that visually resemble reality but also supports key features essential for graphics pipelines -- such as object individuality, articulation, high-quality physically based rendering materials, and physically based interaction. At its core, LiteReality first performs scene understanding and parses the results into a coherent 3D layout and objects with the help of a structured scene graph. It then reconstructs the scene by retrieving the most visually similar 3D artist-crafted models from a curated asset database. Next, the Material Painting module enhances realism by recovering high-quality, spatially varying materials. Finally, the reconstructed scene is integrated into a simulation engine with basic physical properties to enable interactive behavior. The resulting scenes are compact, editable, and fully compatible with standard graphics pipelines, making them suitable for applications in AR/VR, gaming, robotics, and digital twins. In addition, LiteReality introduces a training-free object retrieval module that achieves state-of-the-art similarity performance on the Scan2CAD benchmark, along with a robust material painting module capable of transferring appearances from images of any style to 3D assets -- even under severe misalignment, occlusion, and poor lighting. We demonstrate the effectiveness of LiteReality on both real-life scans and public datasets. Project page: https://litereality.github.io; Video: https://www.youtube.com/watch?v=ecK9m3LXg2c
中文标题/摘要
标题:LiteReality:从RGB-D扫描重建室内环境的紧凑现实3D场景
我们提出LiteReality,一种将室内环境的RGB-D扫描转换为紧凑、逼真且可交互的3D虚拟复制品的新管道。LiteReality不仅重建了视觉上类似于现实的场景,还支持图形管道中必不可少的功能,如物体的独特性、关节运动、高质量的基于物理的渲染材料和基于物理的交互。其核心在于首先进行场景理解并将结果解析为一个连贯的3D布局和物体,借助结构化的场景图。然后通过检索精心策划的资产数据库中最相似的3D艺术家设计模型来重建场景。接下来,材质绘画模块通过恢复高质量的空间变化材质来增强现实感。最后,重建的场景被整合到具有基本物理属性的模拟引擎中,以实现交互行为。生成的场景紧凑、可编辑且完全兼容标准图形管道,适用于AR/VR、游戏、机器人技术和数字孪生等应用。此外,LiteReality引入了一种无需训练的对象检索模块,在Scan2CAD基准测试中实现了最先进的相似性性能,以及一个稳健的材质绘画模块,能够将任何风格的图像外观转移到3D资产上——即使在严重对齐不良、遮挡和照明不佳的情况下。我们在现实扫描和公共数据集上展示了LiteReality的有效性。项目页面:https://litereality.github.io;视频:https://www.youtube.com/watch?v=ecK9m3LXg2c&feature=youtu.be
Summary / 总结
LiteReality is a pipeline that converts RGB-D scans into realistic 3D virtual replicas with object individuality, articulation, and high-quality rendering materials. It first parses the scene into a coherent layout and objects using a structured scene graph, then retrieves similar 3D models from a curated database, enhances realism with a Material Painting module, and integrates the scene into a simulation engine. The resulting scenes are compact and compatible with standard graphics pipelines, suitable for AR/VR, gaming, robotics, and digital twins. Key findings include state-of-the-art object retrieval performance and robust material transfer capabilities.
LiteReality 是一个将 RGB-D 扫描转换为具有图形管道关键功能(如物体个体性和高质量渲染材料)的逼真 3D 虚拟复制品的管道。它使用结构化的场景图进行场景理解,从一个精心策划的数据库中检索相似的 3D 模型,通过材料绘画模块增强现实感,并将场景集成到一个仿真引擎中。结果紧凑、可编辑且与标准图形管道兼容,适用于 AR/VR、游戏、机器人技术和数字孪生。该管道引入了一个无需训练的对象检索模块和一个能够在各种挑战条件下(如严重错位、遮挡和不良照明)将图像风格转移至 3D 资产的稳健材料绘画模块。
SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching
Authors: Yasaman Haghighi, Alexandre Alahi
First: 2026-02-27T17:36:09+00:00 · Latest: 2026-02-27T17:36:09+00:00
Abstract
Diffusion models achieve state-of-the-art video generation quality, but their inference remains expensive due to the large number of sequential denoising steps. This has motivated a growing line of research on accelerating diffusion inference. Among training-free acceleration methods, caching reduces computation by reusing previously computed model outputs across timesteps. Existing caching methods rely on heuristic criteria to choose cache/reuse timesteps and require extensive tuning. We address this limitation with a principled sensitivity-aware caching framework. Specifically, we formalize the caching error through an analysis of the model output sensitivity to perturbations in the denoising inputs, i.e., the noisy latent and the timestep, and show that this sensitivity is a key predictor of caching error. Based on this analysis, we propose Sensitivity-Aware Caching (SenCache), a dynamic caching policy that adaptively selects caching timesteps on a per-sample basis. Our framework provides a theoretical basis for adaptive caching, explains why prior empirical heuristics can be partially effective, and extends them to a dynamic, sample-specific approach. Experiments on Wan 2.1, CogVideoX, and LTX-Video show that SenCache achieves better visual quality than existing caching methods under similar computational budgets.
中文标题/摘要
标题:SenCache:通过敏感性感知缓存加速扩散模型推理
扩散模型在视频生成质量上达到了最先进的水平,但由于需要大量的顺序去噪步骤,其推理仍然很昂贵。这激发了对加速扩散推理的研究。在无需训练的方法中,缓存通过在时间步之间重用之前计算的模型输出来减少计算量。现有的缓存方法依赖于启发式标准来选择缓存/重用的时间步,并需要大量的调优。我们通过一个基于模型输出对去噪输入(即噪声潜在变量和时间步)扰动的敏感性分析,提出了一个原理性的敏感性感知缓存框架来解决这一限制。具体来说,我们通过分析模型输出对去噪输入的敏感性来形式化缓存误差,并表明这种敏感性是预测缓存误差的关键指标。基于这一分析,我们提出了敏感性感知缓存(SenCache),这是一种动态缓存策略,能够根据每个样本自适应地选择缓存时间步。我们的框架为自适应缓存提供了理论基础,解释了为什么先前的经验启发式方法部分有效,并将它们扩展为一种动态的、样本特定的方法。在Wan 2.1、CogVideoX和LTX-Video上的实验表明,在相似的计算预算下,SenCache在视觉质量上优于现有的缓存方法。
Summary / 总结
SenCache accelerates diffusion model inference by using a sensitivity-aware caching framework. It formulates the caching error based on the model output sensitivity to perturbations in denoising inputs and proposes SenCache, a dynamic caching policy that selects caching timesteps adaptively. Experiments show SenCache outperforms existing methods in visual quality under similar computational budgets.
SenCache 通过使用敏感性感知缓存框架来加速扩散模型推理。它基于去噪输入的扰动对模型输出敏感性来公式化缓存误差,并提出了一种动态缓存策略 SenCache,该策略在每个样本基础上自适应选择缓存时间步。实验表明,SenCache 在相似计算预算下比现有缓存方法在视觉质量上表现更优。
Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume
Authors: Gregory Kang Ruey Lau, Hieu Dao, Nicole Kan Hui Lin, Bryan Kian Hsiang Low
Venue: ICLR 2025
First: 2026-02-27T17:18:42+00:00 · Latest: 2026-02-27T17:18:42+00:00
Comments: Earlier versions presented at ICLR 2025 QUESTION workshop and ICML 2025 R2-FM workshop
Abstract
Despite their capabilities, Multimodal Large Language Models (MLLMs) may produce plausible but erroneous outputs, hindering reliable deployment. Accurate uncertainty metrics could enable escalation of unreliable queries to human experts or larger models for improved performance. However, existing uncertainty metrics have practical constraints, such as being designed only for specific modalities, reliant on external tools, or computationally expensive. We introduce UMPIRE, a training-free uncertainty quantification framework for MLLMs that works efficiently across various input and output modalities without external tools, relying only on the models' own internal modality features. UMPIRE computes the incoherence-adjusted semantic volume of sampled MLLM responses for a given task instance, effectively capturing both the global semantic diversity of samples and the local incoherence of responses based on internal model confidence. We propose uncertainty desiderata for MLLMs and provide theoretical analysis motivating UMPIRE's design. Extensive experiments show that UMPIRE consistently outperforms baseline metrics in error detection and uncertainty calibration across image, audio, and video-text benchmarks, including adversarial and out-of-distribution settings. We also demonstrate UMPIRE's generalization to non-text output tasks, including image and audio generation.
中文标题/摘要
标题:多模态大型语言模型的不确定性量化:基于内部模态特征的不一致性调整语义体积
尽管具有强大的能力,多模态大型语言模型(MLLMs)可能会生成看似合理但实际上错误的输出,阻碍了可靠部署。准确的不确定性度量可以将不可靠的查询升级给人类专家或更大规模的模型以提高性能。然而,现有的不确定性度量存在实际限制,如仅针对特定模态设计、依赖外部工具或计算成本高昂。我们提出了UMPIRE,这是一种无需训练的MLLM不确定性量化框架,可以在各种输入和输出模态下高效工作,无需外部工具,仅依赖模型自身的内部模态特征。UMPIRE 通过计算给定任务实例中采样MLLM响应的不一致性调整语义体积,有效地捕捉样本的全局语义多样性和响应的局部不一致性,基于内部模型的信心。我们为MLLMs提出了不确定性期望,并提供了支持UMPIRE设计的理论分析。广泛的实验表明,UMPIRE 在图像、音频和视频文本基准测试中,包括对抗性和离分布设置中,始终优于基线度量在错误检测和不确定性校准方面的表现。我们还展示了UMPIRE 在非文本输出任务中的泛化能力,包括图像和音频生成。
Summary / 总结
The research aims to improve the reliability of Multimodal Large Language Models (MLLMs) by developing an efficient uncertainty quantification framework called UMPIRE. UMPIRE computes the incoherence-adjusted semantic volume of MLLM responses without additional training or external tools, making it versatile across different modalities. Experiments show that UMPIRE outperforms existing metrics in error detection and uncertainty calibration across various benchmarks, including adversarial and out-of-distribution scenarios.
研究通过引入UMPRIRE,一种无需训练的不确定性量化框架,解决了多模态大型语言模型(MLLMs)产生不可靠输出的问题。UMPRIRE通过计算MLLM响应的不一致调整语义体积来捕捉全局语义多样性和局部不一致性,从而实现跨多种模态的准确错误检测和不确定性校准。实验表明,UMPRIRE在包括对抗性和离分布设置在内的多种基准测试中,优于基线指标,在错误检测和不确定性校准方面表现更优。
Task-Centric Acceleration of Small-Language Models
Authors: Dor Tsur, Sharon Adar, Ran Levy
First: 2026-02-27T16:55:22+00:00 · Latest: 2026-02-27T16:55:22+00:00
Abstract
Small language models (SLMs) have emerged as efficient alternatives to large language models for task-specific applications. However, they are often employed in high-volume, low-latency settings, where efficiency is crucial. We propose TASC, Task-Adaptive Sequence Compression, a framework for SLM acceleration comprising two use-cases: When performing SLM fine-tuning, we propose TASC-ft, which iteratively enriches the tokenizer vocabulary with high-frequency output n-grams and then fine-tunes the model to utilize the expanded vocabulary. Next, we propose an inference-time method, termed TASC-spec. TASC-spec is a lightweight, training-free speculative decoding method that constructs an n-gram draft model from the task's output corpus, mixing task and context n-gram information.TASC-spec avoids any additional training, while bypassing draft-target vocabulary alignment constraints. We demonstrate the effectiveness of both methods across multiple low output-variability generation tasks. Our methods show consistent improvements in inference efficiency while maintaining task performance.
中文标题/摘要
标题:面向任务的小语言模型加速
小语言模型(SLMs)已成为针对特定任务应用的高效替代品,但它们通常在高流量、低延迟的环境中使用,效率至关重要。我们提出了TASC,即任务自适应序列压缩,这是一种小语言模型加速框架,包含两种用例:在进行SLM微调时,我们提出了TASC-ft,该方法通过迭代丰富分词器词汇表中的高频输出n-克,并随后微调模型以利用扩展的词汇表。接下来,我们提出了一种推理时方法,称为TASC-spec。TASC-spec是一种轻量级、无需训练的推测性解码方法,从任务的输出语料库中构建n-克草图模型,混合任务和上下文n-克信息。TASC-spec避免了任何额外的训练,同时绕过了草图目标词汇表对齐的约束。我们在多个低输出变异性生成任务中展示了这两种方法的有效性。我们的方法在保持任务性能的同时,一致地提高了推理效率。
Summary / 总结
The research aims to enhance the efficiency of small language models (SLMs) for high-volume, low-latency applications. It introduces TASC, a framework that includes TASC-ft for fine-tuning SLMs by expanding the tokenizer vocabulary with high-frequency output n-grams, and TASC-spec, a speculative decoding method at inference time that constructs a draft model from the task's output corpus. Experiments across various low output-variability tasks show consistent improvements in inference efficiency without compromising task performance.
研究旨在提高小语言模型(SLMs)在特定任务中的效率,特别是在高流量和低延迟的应用场景中。提出了两种方法:TASC-ft 在微调过程中丰富分词器词汇表,然后进行微调;TASC-spec 是一种轻量级的推测性解码方法,从任务的输出语料库中构建草稿模型。实验表明,在各种低输出变异性生成任务中,这些方法在不牺牲任务性能的情况下,能够一致地提高推理效率。
Toward Guarantees for Clinical Reasoning in Vision Language Models via Formal Verification
Authors: Vikash Singh, Debargha Ganguly, Haotian Yu, Chengwei Zhou, Prerna Singh, Brandon Lee, Vipin Chaudhary, Gourav Datta
First: 2026-02-27T15:49:59+00:00 · Latest: 2026-02-27T15:49:59+00:00
Abstract
Vision-language models (VLMs) show promise in drafting radiology reports, yet they frequently suffer from logical inconsistencies, generating diagnostic impressions unsupported by their own perceptual findings or missing logically entailed conclusions. Standard lexical metrics heavily penalize clinical paraphrasing and fail to capture these deductive failures in reference-free settings. Toward guarantees for clinical reasoning, we introduce a neurosymbolic verification framework that deterministically audits the internal consistency of VLM-generated reports. Our pipeline autoformalizes free-text radiographic findings into structured propositional evidence, utilizing an SMT solver (Z3) and a clinical knowledge base to verify whether each diagnostic claim is mathematically entailed, hallucinated, or omitted. Evaluating seven VLMs across five chest X-ray benchmarks, our verifier exposes distinct reasoning failure modes, such as conservative observation and stochastic hallucination, that remain invisible to traditional metrics. On labeled datasets, enforcing solver-backed entailment acts as a rigorous post-hoc guarantee, systematically eliminating unsupported hallucinations to significantly increase diagnostic soundness and precision in generative clinical assistants.
中文标题/摘要
标题:通过形式验证实现视觉语言模型在临床推理中的保证
视觉-语言模型(VLMs)在撰写放射学报告方面显示出潜力,但它们经常遭受逻辑不一致的困扰,生成未被自身感知发现支持的诊断印象,或者遗漏逻辑上必然的结论。标准的词汇度量会严厉惩罚临床同义替换,并且在无参考设置中无法捕捉这些演绎失败。为实现临床推理的保证,我们引入了一种神经符号验证框架,以确定性地审计VLM生成报告的内部一致性。我们的流水线将自由文本放射学发现自动形式化为结构化的命题证据,利用SMT求解器(Z3)和临床知识库来验证每个诊断声明是否被数学上蕴含、虚构或遗漏。在五个胸部X光基准测试中评估七种VLM,我们的验证器揭示了保守观察和随机虚构等不同的推理失败模式,这些模式对传统度量仍然不可见。在带标签的数据集上,强制求解器支持的蕴含作为严格的事后保证,系统地消除未被支持的虚构,显著提高生成临床助手的诊断准确性和精确度。
Summary / 总结
The paper addresses the issue of logical inconsistencies in vision-language models (VLMs) used for drafting radiology reports. It introduces a neurosymbolic verification framework to audit the internal consistency of VLM-generated reports by autoformalizing free-text radiographic findings into structured propositional evidence and using an SMT solver (Z3) and a clinical knowledge base to verify the logical entailment of diagnostic claims. The evaluation across seven VLMs on five chest X-ray benchmarks revealed distinct reasoning failure modes, such as conservative observation and stochastic hallucination, which traditional metrics fail to capture, thereby enhancing diagnostic soundness and precision in clinical assistants.
该研究针对视觉语言模型(VLMs)在生成放射学报告时存在的逻辑不一致问题,引入了一种神经符号验证框架,通过将自由文本放射学发现自动形式化为结构化的命题证据,并使用SMT求解器和临床知识库验证诊断声明。在七个VLMs对五个胸部X光基准的评估中,揭示了保守观察和随机幻觉等不同的推理失败模式,这些模式传统指标无法捕捉。通过强制执行求解器支持的蕴含,系统地消除了未支持的幻觉,从而显著提高了诊断的准确性和精确度。
AutoDebias: Automated Framework for Debiasing Text-to-Image Models
Authors: Hongyi Cai, Mohammad Mahdinur Rahman, Mingkang Dong, Muxin Pu, Moqyad Alqaily, Jie Li, Xinfeng Li, Jialie Shen, Meikang Qiu, Qingsong Wen
Venue: CVPR 2026
First: 2025-08-01T09:05:45+00:00 · Latest: 2026-02-27T15:45:24+00:00
Comments: Accepted to CVPR 2026
Abstract
Text-to-Image (T2I) models generate high-quality images but are vulnerable to malicious backdoor attacks that inject harmful biases (e.g., trigger-activated gender or racial stereotypes). Existing debiasing methods, often designed for natural statistical biases, struggle with these deliberately and subtly injected attacks. We propose AutoDebias, a framework that automatically identifies and mitigates these malicious biases in T2I models without prior knowledge of the specific attack types. Specifically, AutoDebias leverages vision-language models to detect trigger-activated visual patterns and constructs neutralization guides by generating counter-prompts. These guides drive a CLIP-guided training process that breaks the harmful associations while preserving the original model's image quality and diversity. Unlike methods designed for natural bias, AutoDebias effectively addresses subtle, injected stereotypes and multiple interacting attacks. We evaluate the framework on a new benchmark covering 17 distinct backdoor scenarios, including challenging cases where multiple backdoors co-exist. AutoDebias detects malicious patterns with 91.6% accuracy and reduces the backdoor success rate from 90% to negligible levels, while preserving the visual fidelity of the original model.
中文标题/摘要
标题:AutoDebias:自动去偏见框架
文本到图像(T2I)模型能够生成高质量的图像,但容易受到恶意后门攻击的影响,这些攻击会注入有害偏见(例如,触发激活的性别或种族刻板印象)。现有的去偏见方法,通常针对自然统计偏见,难以应对这些故意且微妙注入的攻击。我们提出了AutoDebias,这是一种无需了解特定攻击类型即可自动识别和减轻T2I模型中恶意偏见的框架。具体而言,AutoDebias 利用视觉语言模型检测触发激活的视觉模式,并通过生成反向提示构建中和指南。这些指南驱动CLIP引导的训练过程,打破有害关联,同时保持原始模型的图像质量和多样性。与针对自然偏见设计的方法不同,AutoDebias 有效地解决了微妙、注入的刻板印象和多个交互式攻击。我们在涵盖17种不同后门场景的新基准上评估了该框架,包括多个后门共存的具有挑战性的案例。AutoDebias 以91.6%的准确率检测恶意模式,并将后门成功率从90%降低到可忽略的水平,同时保持原始模型的视觉保真度。
Summary / 总结
AutoDebias is an automated framework designed to identify and mitigate malicious biases in Text-to-Image models, which are vulnerable to backdoor attacks that inject harmful stereotypes. It uses vision-language models to detect trigger-activated visual patterns and generates counter-prompts to neutralize these biases through CLIP-guided training. The framework successfully detects malicious patterns with 91.6% accuracy and reduces the backdoor success rate from 90% to negligible levels, maintaining the original model's image quality and diversity.
AutoDebias 是一个自动框架,旨在识别并缓解 Text-to-Image 模型中的恶意偏见,这些模型容易受到触发激活的性别或种族刻板印象攻击。它使用视觉-语言模型检测有害模式,并生成反向提示进行训练,打破有害关联同时保持图像质量。评估结果显示,AutoDebias 在新基准上的检测准确率为 91.6%,显著降低了后门成功率,同时保持了原始模型的视觉保真度。
FRIEDA: Benchmarking Multi-Step Cartographic Reasoning in Vision-Language Models
Authors: Jiyoon Pyo, Yuankun Jiao, Dongwon Jung, Zekun Li, Leeje Jang, Sofia Kirsanova, Jina Kim, Yijun Lin, Qin Liu, Junyi Xie, Hadi Askari, Nan Xu, Muhao Chen, Yao-Yi Chiang
Venue: ICLR 2026
First: 2025-12-08T20:18:15+00:00 · Latest: 2026-02-27T15:26:28+00:00
Comments: Accepted to ICLR 2026
Abstract
Cartographic reasoning is the skill of interpreting geographic relationships by aligning legends, map scales, compass directions, map texts, and geometries across one or more map images. Although essential as a concrete cognitive capability and for critical tasks such as disaster response and urban planning, it remains largely unevaluated. Building on progress in chart and infographic understanding, recent large vision language model studies on map visual question-answering often treat maps as a special case of charts. In contrast, map VQA demands comprehension of layered symbology (e.g., symbols, geometries, and text labels) as well as spatial relations tied to orientation and distance that often span multiple maps and are not captured by chart-style evaluations. To address this gap, we introduce FRIEDA, a benchmark for testing complex open-ended cartographic reasoning in LVLMs. FRIEDA sources real map images from documents and reports in various domains and geographical areas. Following classifications in Geographic Information System (GIS) literature, FRIEDA targets all three categories of spatial relations: topological (border, equal, intersect, within), metric (distance), and directional (orientation). All questions require multi-step inference, and many require cross-map grounding and reasoning. We evaluate eleven state-of-the-art LVLMs under two settings: (1) the direct setting, where we provide the maps relevant to the question, and (2) the contextual setting, where the model may have to identify the maps relevant to the question before reasoning. Even the strongest models, Gemini-2.5-Pro and GPT-5-Think, achieve only 38.20% and 37.20% accuracy, respectively, far below human performance of 84.87%. These results reveal a persistent gap in multi-step cartographic reasoning, positioning FRIEDA as a rigorous benchmark to drive progress on spatial intelligence in LVLMs.
中文标题/摘要
标题:FRIEDA:视觉语言模型中多步制图推理的基准测试
制图推理是指通过对图例、地图比例尺、指南针方向、地图文字和几何形状进行对齐来解释地理关系的能力。尽管对于具体的认知能力和关键任务(如灾害响应和城市规划)至关重要,但其仍主要未被评估。基于对图表和信息图理解的进展,最近的大规模视觉语言模型研究中的地图视觉问答往往将地图视为图表的一种特殊案例。相比之下,地图VQA需要理解分层符号(如符号、几何形状和文本标签)以及与方向和距离相关的空间关系,这些关系往往跨越多张地图且无法通过图表风格的评估捕捉到。为解决这一差距,我们引入了FRIEDA,一个用于测试LVLM中复杂开放性制图推理的基准。FRIEDA从各种领域和地理区域的文档和报告中获取真实地图图像。按照地理信息系统(GIS)文献中的分类,FRIEDA针对所有三种空间关系类别:拓扑(边界、相等、相交、包含)、度量(距离)和方向(方向)。所有问题都需要多步推理,许多问题还需要跨图推理和推理。我们以两种设置评估了十一个最先进的LVLM:(1)直接设置,我们提供与问题相关的地图;(2)上下文设置,模型可能需要先识别与问题相关的地图,然后再进行推理。即使最强的模型Gemini-2.5-Pro和GPT-5-Think的准确率也只有38.20%和37.20%,远低于人类的84.87%。这些结果揭示了多步制图推理中的持续差距,将FRIEDA定位为一个严格的基准,以推动LVLM中空间智能的进步。
Summary / 总结
FRIEDA is a benchmark for evaluating multi-step cartographic reasoning in large vision-language models (LVLMs). It addresses the gap in map understanding by focusing on complex spatial relations such as topological, metric, and directional. The benchmark uses real map images from various domains and evaluates models in both direct and contextual settings. Despite the complexity, even the strongest models achieve only 38.20% and 37.20% accuracy, far below human performance of 84.87%, highlighting the need for improved spatial intelligence in LVLMs.
研究旨在评估视觉语言模型在地理关系理解方面的能力,特别是它们在多张地图之间解释复杂地理关系的能力。方法是创建FRIEDA基准,测试这些模型在来自不同领域和地理区域的真实地图图像上的表现,重点关注拓扑、度量和方向空间关系。关键发现表明,即使是最强的模型也只能达到38.20%和37.20%的准确率,远低于人类84.87%的性能,这表明需要在多步地理关系推理方面改进视觉语言模型。
Reallocating Attention Across Layers to Reduce Multimodal Hallucination
Authors: Haolang Lu, Bolun Chu, WeiYe Fu, Guoshun Nan, Junning Liu, Minghui Pan, Qiankun Li, Yi Yu, Hua Wang, Kun Wang
First: 2025-10-11T16:54:41+00:00 · Latest: 2026-02-27T15:25:38+00:00
Comments: Acceptted by CVPR2026
Abstract
Multimodal large reasoning models (MLRMs) often suffer from hallucinations that stem not only from insufficient visual grounding but also from imbalanced allocation between perception and reasoning processes. Building upon recent interpretability findings suggesting a staged division of attention across layers, we analyze how this functional misalignment leads to two complementary failure modes: perceptual bias in shallow layers and reasoning drift in deeper layers. To alleviate these issues, we propose Functional Head Identification and Class-Conditioned Rescaling , a lightweight, training-free plugin that identifies perception- and reasoning-oriented heads and adaptively rebalances their layerwise contributions. Our method improves reasoning consistency and visual faithfulness without retraining or any architectural modification. Evaluations across three representative MLRMs and five multimodal reasoning benchmarks show an average 4.2% point gain, with less than 1% additional computation and only 9% baseline latency. Beyond empirical improvements, our study provides an interpretable perspective on regulating cross-layer functional dynamics to enhance the reliability of multimodal reasoning.
中文标题/摘要
标题:在层间重新分配注意力以减少多模态幻觉
多模态大型推理模型(MLRM)常常受到幻觉的影响,这些幻觉不仅源于视觉接地不足,还源于感知和推理过程之间的不平衡分配。基于最近关于注意力在层间分阶段分配的可解释性发现,我们分析了这种功能错位如何导致两种互补的失败模式:浅层的感知偏差和深层的推理漂移。为了解决这些问题,我们提出了一种轻量级、无需训练的插件——功能头部识别和类别条件重缩放,该方法能够识别感知和推理导向的头部,并适应性地重新平衡它们的层间贡献。我们的方法在无需重新训练或任何架构修改的情况下提高了推理一致性和视觉真实性。在三个代表性MLRM和五个多模态推理基准上的评估显示,平均提高了4.2个百分点,额外计算量不到1%,基线延迟仅增加9%。除了实证改进,我们的研究还提供了一种可解释的视角,用于调节跨层功能动态,以增强多模态推理的可靠性。
Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization
Authors: Chenwei Jia, Baoting Li, Xuchong Zhang, Mingzhuo Wei, Bochen Lin, Hongbin Sun
Venue: CVPR 2026
First: 2026-02-27T14:47:48+00:00 · Latest: 2026-02-27T14:47:48+00:00
Comments: 13 pages, 6 figures, including appendix, Accepted at CVPR 2026
Abstract
Post-Training Quantization (PTQ) has emerged as an effective technique for alleviating the substantial computational and memory overheads of Vision-Language Models (VLMs) by compressing both weights and activations without retraining the full model. Existing PTQ methods primarily rely on static identification and global compensation of sensitive or outlier channels, yet they often overlook the distributional differences of these important channels across inputs, leading to unsatisfactory quantization. In this work, we observe that the distributions and occurrence frequencies of important channels vary significantly both across modalities and among tokens, even within the same modality. Accordingly, we propose \textbf{Quant Experts (QE)}, a token-aware adaptive error compensation with mixture-of-experts for VLMs quantization. QE divides the important channels into token-independent and token-dependent groups. For the former, a shared expert is designed for most tokens to compensate for global quantization error using a low-rank adapter. For the latter, routed experts including multiple routed low-rank adapters are elaborated to compensate for local quantization error related to specific tokens. Extensive experiments demonstrate that QE consistently enhances task accuracy across various quantization settings and model scales, ranging from 2B to 70B parameters, while maintaining performance comparable to full-precision models.
中文标题/摘要
标题:量化专家:具有专家混合的感知令牌自适应误差重构
后训练量化(PTQ)已成为一种有效技术,通过压缩视觉-语言模型(VLMs)的权重和激活,减轻其显著的计算和内存开销,而无需重新训练整个模型。现有PTQ方法主要依赖于静态识别和全局补偿敏感或异常通道,但往往忽略了这些重要通道在不同输入之间的分布差异,导致量化效果不佳。本文观察到,重要通道的分布和出现频率在不同模态之间以及在不同令牌之间变化显著,即使在同一模态内也是如此。因此,我们提出了**量化专家(QE)**,一种具有专家混合的感知令牌自适应误差补偿方法,用于VLMs量化。QE将重要通道分为令牌独立和令牌依赖两组。对于前者,设计了一个共享专家,使用低秩适配器补偿大部分令牌的全局量化误差。对于后者,提出了包括多个路由低秩适配器的路由专家,以补偿与特定令牌相关的局部量化误差。广泛的实验表明,QE在各种量化设置和模型规模下(从20亿到700亿参数)都能一致地提高任务准确性,同时保持与全精度模型相当的性能。
Summary / 总结
This work addresses the limitations of existing Post-Training Quantization (PTQ) methods in Vision-Language Models (VLMs) by proposing Quant Experts (QE), which adaptively compensates for quantization errors based on token-aware mechanisms. QE divides important channels into token-independent and token-dependent groups, using a shared expert for global error compensation and multiple routed experts for local error compensation. Experiments show that QE improves task accuracy across different quantization settings and model scales, maintaining performance close to full-precision models.
本文针对现有视觉-语言模型(VLMs)后训练量化(PTQ)方法的不足,提出了Quant Experts(QE),通过基于令牌感知的策略自适应补偿量化误差。QE 将重要通道分为令牌独立和令牌依赖组,使用共享专家处理全局量化误差,并使用多个路由专家处理与特定令牌相关的局部量化误差。实验表明,QE 在不同量化设置和模型规模下提高了任务准确性,性能接近全精度模型。
Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation
Authors: Yuhan Liu, Lianhui Qin, Shengjie Wang
Venue: ICLR 2026
First: 2025-10-23T17:59:21+00:00 · Latest: 2026-02-27T14:25:56+00:00
Comments: Accepted to ICLR 2026
Abstract
Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict.
中文标题/摘要
标题:小草图,大裁决:基于推测的信息密集型视觉推理
大型多模态视觉语言模型(VLMs)在多模态理解方面取得了显著进展,但在处理密集交织了文本注释和细粒度图形元素的信息密集型图像时,它们面临挑战。主要挑战在于在密集布局中精确定位关键线索以及进行多跳推理以整合分散的证据。我们提出了推测裁决(SV),这是一种无需训练的框架,灵感来源于推测解码,结合了多个轻量级草图专家和一个大型裁决模型。在草图阶段,小型VLM作为草图专家生成提供多样化定位候选的推理路径;在裁决阶段,强大的VLM综合这些路径生成最终答案,同时降低计算成本并恢复正确答案。为了进一步提高效率和准确性,SV引入了一种共识专家选择机制,仅将高一致性的推理路径转发到裁决阶段。实验证明,SV在InfographicVQA、ChartMuseum、ChartQAPro和HR-Bench 4K等具有挑战性的信息密集型和高分辨率视觉问答基准测试中取得了持续的改进。通过综合多个部分准确推理路径中的正确见解,SV在错误修正和成本效率方面优于大型专有模型或训练管道。代码可在https://github.com/Tinaliu0123/speculative-verdict获取。
Summary / 总结
The research addresses the challenge of visual reasoning in information-intensive images where dense textual and graphical elements complicate the task. It introduces Speculative Verdict (SV), a training-free framework that uses multiple lightweight draft experts to generate diverse reasoning paths, which are then synthesized by a strong VLM in the verdict stage. SV achieves consistent improvements on benchmarks like InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K, demonstrating both error correction and cost-efficiency compared to large models or training pipelines.
研究针对信息密集型图像中的视觉推理难题,大型VLM由于密集布局和分散的证据难以应对。提出了一种名为Speculative Verdict (SV)的无训练框架,使用轻量级的草案专家生成多样化的推理路径,然后由强大的VLM在判决阶段综合这些路径,以降低成本同时保持准确性。通过共识机制仅转发高一致性的路径,实验结果显示在InfographicVQA、ChartMuseum、ChartQAPro和HR-Bench 4K等基准测试上的一致改进,展示了与大型专有模型或训练管道相比的错误纠正和成本效率优势。
Look Carefully: Adaptive Visual Reinforcements in Multimodal Large Language Models for Hallucination Mitigation
Authors: Xingyu Zhu, Kesen Zhao, Liang Yi, Shuo Wang, Zhicai Wang, Beier Zhu, Hanwang Zhang
Venue: ICLR 2026
First: 2026-02-27T14:18:51+00:00 · Latest: 2026-02-27T14:18:51+00:00
Comments: ICLR 2026
Abstract
Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language reasoning, yet they remain vulnerable to hallucination, where generated content deviates from visual evidence. Existing mitigation strategies either require costly supervision during training or introduce additional latency at inference time. Recent vision enhancement methods attempt to address this issue by reinforcing visual tokens during decoding, but they typically inject all tokens indiscriminately, which causes interference from background regions and distracts the model from critical cues. To overcome this challenge, we propose Adaptive Visual Reinforcement (AIR), a training-free framework for MLLMs. AIR consists of two components. Prototype-based token reduction condenses the large pool of visual tokens into a compact subset to suppress redundancy. OT-guided patch reinforcement quantifies the alignment between hidden states and patch embeddings to selectively integrate the most consistent patches into feed-forward layers. As a result, AIR enhances the model's reliance on salient visual information and effectively mitigates hallucination. Extensive experiments across representative MLLMs demonstrate that AIR substantially reduces hallucination while preserving general capabilities, establishing it as an effective solution for building reliable MLLMs.
中文标题/摘要
标题:仔细观察:多模态大型语言模型中的自适应视觉强化以减轻幻觉
多模态大型语言模型(MLLMs)在视觉语言推理方面取得了显著进展,但仍然容易出现幻觉,即生成的内容与视觉证据不符。现有的缓解策略要么需要在训练过程中昂贵的监督,要么在推理时引入额外的延迟。最近的视觉增强方法试图通过在解码过程中强化视觉标记来解决这一问题,但它们通常会无差别地注入所有标记,这会导致背景区域的干扰并使模型偏离关键线索。为克服这一挑战,我们提出了一种无需训练的自适应视觉强化(AIR)框架,适用于MLLMs。AIR由两个组件组成。基于原型的标记减少将大量的视觉标记浓缩成一个紧凑的子集,以抑制冗余。OT引导的补丁强化量化隐藏状态与补丁嵌入之间的对齐,以选择性地将最一致的补丁整合到前馈层中。结果,AIR增强了模型对显著视觉信息的依赖性,并有效减轻了幻觉。广泛的实验表明,AIR在保持一般能力的同时显著减少了幻觉,确立了其作为构建可靠MLLMs的有效解决方案的地位。
Summary / 总结
The research aims to address the hallucination issue in multimodal large language models (MLLMs) by proposing Adaptive Visual Reinforcement (AIR), which enhances the model's reliance on salient visual information. AIR consists of prototype-based token reduction and OT-guided patch reinforcement to condense visual tokens and selectively integrate consistent patches, respectively. Experiments show that AIR significantly reduces hallucination without requiring costly training or additional inference latency, making it an effective solution for building reliable MLLMs.
研究旨在通过提出自适应视觉强化(AIR)来解决多模态大型语言模型(MLLMs)中的幻觉问题,增强模型对显著视觉信息的依赖。AIR 包含两个组件:原型基的标记缩减和OT引导的补丁强化。前者通过抑制冗余来浓缩视觉标记,后者则选择性地将一致的补丁整合到前馈层中。实验表明,AIR 显著减少了幻觉现象,同时保持了模型的一般能力,使其成为构建可靠 MLLMs 的有效解决方案。
GuardAlign: Test-time Safety Alignment in Multimodal Large Language Models
Authors: Xingyu Zhu, Beier Zhu, Junfeng Fang, Shuo Wang, Yin Zhang, Xiang Wang, Xiangnan He
Venue: ICLR 2026
First: 2026-02-27T13:52:52+00:00 · Latest: 2026-02-27T13:52:52+00:00
Comments: ICLR 2026
Abstract
Large vision-language models (LVLMs) have achieved remarkable progress in vision-language reasoning tasks, yet ensuring their safety remains a critical challenge. Recent input-side defenses detect unsafe images with CLIP and prepend safety prefixes to prompts, but they still suffer from inaccurate detection in complex scenes and unstable safety signals during decoding. To address these issues, we propose GuardAlign, a training-free defense framework that integrates two strategies. First, OT-enhanced safety detection leverages optimal transport to measure distribution distances between image patches and unsafe semantics, enabling accurate identification of malicious regions without additional computational cost. Second, cross-modal attentive calibration strengthens the influence of safety prefixes by adaptively reallocating attention across layers, ensuring that safety signals remain consistently activated throughout generation. Extensive evaluations on six representative MLLMs demonstrate that GuardAlign reduces unsafe response rates by up to 39% on SPA-VL, while preserving utility, achieving an improvement on VQAv2 from 78.51% to 79.21%.
中文标题/摘要
标题:GuardAlign:多模态大型语言模型测试时的安全对齐
大型视觉-语言模型(LVLMs)在视觉-语言推理任务中取得了显著进展,但确保其安全性仍然是一个关键挑战。最近的输入端防御使用CLIP检测不安全的图像,并在提示前添加安全前缀,但它们仍然在复杂场景中存在不准确的检测问题,并且在解码过程中产生不稳定的信号。为了解决这些问题,我们提出了一种无需训练的防御框架GuardAlign,该框架结合了两种策略。首先,OT增强的安全检测利用最优传输来测量图像块与不安全语义之间的分布距离,从而无需额外计算成本即可准确识别恶意区域。其次,跨模态注意校准通过自适应地重新分配注意力来增强安全前缀的影响,确保生成过程中始终激活安全信号。在六个代表性MLLM上的广泛评估表明,GuardAlign在SPA-VL上将不安全响应率降低了高达39%,同时保持了实用性,并在VQAv2上提高了0.7%。
Summary / 总结
GuardAlign is a training-free defense framework for ensuring the safety of multimodal large language models (MLLMs) in vision-language reasoning tasks. It uses OT-enhanced safety detection to accurately identify malicious regions and cross-modal attentive calibration to ensure consistent safety signals during generation. Evaluations show that GuardAlign reduces unsafe response rates by up to 39% on SPA-VL while maintaining utility, improving VQAv2 performance from 78.51% to 79.21%.
GuardAlign 是一个无需训练的防御框架,用于确保多模态大型语言模型(MLLMs)在视觉语言推理任务中的安全性。它通过 OT 增强的安全检测准确识别恶意区域,并通过跨模态注意校准确保生成过程中安全信号的一致激活。评估显示,GuardAlign 在 SPA-VL 上将不安全响应率最多降低 39%,同时保持了实用性,将 VQAv2 的准确率从 78.51% 提高到 79.21%。
Interpretable Debiasing of Vision-Language Models for Social Fairness
Authors: Na Min An, Yoonna Jang, Yusuke Hirota, Ryo Hachiuma, Isabelle Augenstein, Hyunjung Shim
Venue: CVPR 2026
First: 2026-02-27T13:37:11+00:00 · Latest: 2026-02-27T13:37:11+00:00
Comments: 25 pages, 30 figures, 13 Tables Accepted to CVPR 2026
Abstract
The rapid advancement of Vision-Language models (VLMs) has raised growing concerns that their black-box reasoning processes could lead to unintended forms of social bias. Current debiasing approaches focus on mitigating surface-level bias signals through post-hoc learning or test-time algorithms, while leaving the internal dynamics of the model largely unexplored. In this work, we introduce an interpretable, model-agnostic bias mitigation framework, DeBiasLens, that localizes social attribute neurons in VLMs through sparse autoencoders (SAEs) applied to multimodal encoders. Building upon the disentanglement ability of SAEs, we train them on facial image or caption datasets without corresponding social attribute labels to uncover neurons highly responsive to specific demographics, including those that are underrepresented. By selectively deactivating the social neurons most strongly tied to bias for each group, we effectively mitigate socially biased behaviors of VLMs without degrading their semantic knowledge. Our research lays the groundwork for future auditing tools, prioritizing social fairness in emerging real-world AI systems.
中文标题/摘要
标题:可解释的视觉-语言模型去偏见以实现社会公平
视觉-语言模型(VLMs)的快速发展引发了对其黑箱推理过程可能导致社会偏见的担忧。当前的去偏见方法主要通过后学习或测试时算法减轻表面偏见信号,而对模型内部动态则鲜有探索。本文提出了一种可解释的、模型无关的偏见缓解框架DeBiasLens,通过稀疏自编码器(SAEs)应用于多模态编码器来定位VLM中的社会属性神经元。基于SAEs的解耦能力,我们在没有对应社会属性标签的面部图像或描述数据集上训练它们,以发现对特定人口统计学高度响应的神经元,包括那些被欠代表的人群。通过选择性地抑制与偏见最相关的社会神经元,我们有效地缓解了VLM的社会偏见行为,而不损害其语义知识。我们的研究为未来审计工具的开发奠定了基础,优先考虑新兴实际AI系统中的社会公平。
Summary / 总结
This work addresses the issue of social bias in Vision-Language models (VLMs) by introducing DeBiasLens, a model-agnostic framework that uses sparse autoencoders to identify and mitigate socially biased neurons. By training SAEs on facial images or captions without social attribute labels, the framework uncovers neurons associated with specific demographics, including underrepresented groups. Deactivating these neurons selectively reduces socially biased behaviors in VLMs without compromising their semantic knowledge. This approach provides a transparent and effective method for auditing and enhancing the fairness of VLMs.
该研究通过引入使用稀疏自编码器识别和缓解社会偏见的DeBiasLens框架,解决了Vision-Language模型(VLMs)中的社会偏见问题。通过在没有社会属性标签的面部图像或描述上进行训练,该框架能够发现与特定人口统计学相关的神经元。通过有选择地禁用这些与偏见紧密相关的神经元,可以减少VLMs中的社会偏见行为,而不影响其语义理解能力。
Empowering Small VLMs to Think with Dynamic Memorization and Exploration
Authors: Jiazhen Liu, Yuchuan Deng, Long Chen
Venue: ICLR 2026
First: 2025-06-29T02:19:51+00:00 · Latest: 2026-02-27T12:25:34+00:00
Comments: Accepted by ICLR 2026
Abstract
Small-scale Vision-Language Models (SVLMs) are exceptionally well-suited for proprietary tasks. Equipping them with thinking capabilities is a critical step to enhance their performance and reliability in these specific domains. However, existing training paradigms, including Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Reward (RLVR), impose substantial demands on the base VLM, exceeding the capacity of SVLMs. Consequently, directly applying these paradigms to SVLMs fails to instill the desired thinking abilities. A natural solution is to combine SFT and RLVR, leveraging their complementarity to reduce the dependence on model capacity. Yet the core challenge lies in managing the inherent trade-off: excessive reliance on SFT can force the model to memorize pseudo thinking traces, while over-emphasizing RLVR can lead to unstable exploration (i.e., advantage collapse). To address this, we propose DyME, a novel training paradigm that Dynamically selects between Memorization (via SFT) and Exploration (via RLVR) at each optimization step. By ensuring that every update contributes to the trade-off, DyME serves as a robust, standalone strategy that stabilizes SVLM learning. Complementing this paradigm, we further introduce a synergistic Visual Supervision mechanism (comprising a visual checker and refiner) designed to inject dynamically enhanced, image-grounded guidance during optimization. Extensive experiments across diverse domains demonstrate that DyME consistently achieves this balance, and thus delivers substantial performance improvements on specialized tasks. These results establish DyME as a practical and effective solution for empowering SVLMs with reliable thinking capabilities. GitHub: https://github.com/HKUST-LongGroup/DyME
中文标题/摘要
标题:赋予小型VLM动态记忆与探索能力以增强思考
小型视觉-语言模型(SVLM)非常适合专用任务。赋予它们思考能力是提高其在这些特定领域性能和可靠性的关键步骤。然而,现有的训练范式,包括监督微调(SFT)和可验证奖励的强化学习(RLVR),对基础VLM提出了巨大的需求,超出了SVLM的能力范围。因此,直接将这些范式应用于SVLM无法培养所需的思考能力。一个自然的解决方案是结合SFT和RLVR,利用它们的互补性来减少对模型容量的依赖。然而,核心挑战在于管理固有的权衡:过度依赖SFT可能导致模型记忆伪思考痕迹,而过度强调RLVR可能导致不稳定探索(即优势崩溃)。为了解决这个问题,我们提出了DyME,这是一种新颖的训练范式,在每次优化步骤中动态选择记忆(通过SFT)和探索(通过RLVR)。通过确保每次更新都对权衡做出贡献,DyME作为一种稳健的独立策略,稳定了SVLM的学习。为了补充这一范式,我们还引入了一种协同视觉监督机制(包括视觉检查器和精炼器),旨在在优化过程中注入动态增强的、基于图像的指导。在多个领域的广泛实验表明,DyME能够实现这种平衡,并在专门任务上实现了显著的性能提升。这些结果确立了DyME作为赋予SVLM可靠思考能力的实用和有效解决方案的地位。GitHub: https://github.com/HKUST-LongGroup/DyME
Summary / 总结
The research aims to enhance the thinking capabilities of small-scale Vision-Language Models (SVLMs) by proposing DyME, a novel training paradigm that dynamically balances memorization and exploration. DyME alternates between Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Reward (RLVR) at each optimization step, ensuring that updates contribute to the trade-off. The study shows that DyME effectively stabilizes SVLM learning and improves performance on specialized tasks, making it a practical solution for equipping SVLMs with reliable thinking abilities.
论文提出了DyME,一种新的训练范式,通过动态平衡记忆和探索来增强小规模VLM的思考能力。DyME确保每次优化步骤都对这种权衡有所贡献,从而稳定SVLM的学习。此外,还引入了一种协同视觉监督机制,以提供动态增强的、基于图像的指导。实验表明,DyME在专门任务上表现出显著的性能提升,使其成为一种实用有效的解决方案。
SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls
Authors: Qianxun Xu, Chenxi Song, Yujun Cai, Chi Zhang
Venue: CVPR 2026
First: 2026-02-27T11:59:06+00:00 · Latest: 2026-02-27T11:59:06+00:00
Comments: CVPR 2026
Abstract
Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis. However, current models are predominantly optimized for single-event generation. When handling multi-event prompts, without explicit temporal grounding, such models often produce blended or collapsed scenes that break the intended narrative. To address this limitation, we present SwitchCraft, a training-free framework for multi-event video generation. Our key insight is that uniform prompt injection across time ignores the correspondence between events and frames. To this end, we introduce Event-Aligned Query Steering (EAQS), which steers frame-level attention to align with relevant event prompts. Furthermore, we propose Auto-Balance Strength Solver (ABSS), which adaptively balances steering strength to preserve temporal consistency and visual fidelity. Extensive experiments demonstrate that SwitchCraft substantially improves prompt alignment, event clarity, and scene consistency compared with existing baselines, offering a simple yet effective solution for multi-event video generation.
中文标题/摘要
标题:SwitchCraft:无需训练的多事件视频生成与注意力控制
近期在文本到视频扩散模型方面的进展使得高保真度和时间连贯的视频合成成为可能。然而,当前的模型主要针对单事件生成进行了优化。在处理多事件提示时,如果没有明确的时间定位,这些模型往往会生成混合或压缩的场景,破坏了预期的叙述。为了解决这一限制,我们提出了SwitchCraft,一种无需训练的多事件视频生成框架。我们的核心见解是,时间上的均匀提示注入忽略了事件与帧之间的对应关系。为此,我们引入了事件对齐查询引导(EAQS),该方法引导帧级注意力与相关事件提示对齐。此外,我们提出了自适应平衡强度求解器(ABSS),该方法自适应地平衡引导强度,以保持时间一致性和视觉保真度。广泛的实验表明,SwitchCraft在提示对齐、事件清晰度和场景一致性方面显著优于现有基线,提供了一种简单而有效的多事件视频生成解决方案。
Summary / 总结
SwitchCraft is a training-free framework for generating multi-event videos by aligning frame-level attention with relevant event prompts. It introduces Event-Aligned Query Steering (EAQS) to steer attention and Auto-Balance Strength Solver (ABSS) to balance steering strength, ensuring temporal consistency and visual fidelity. Experiments show that SwitchCraft significantly enhances prompt alignment, event clarity, and scene consistency compared to existing methods.
SwitchCraft 是一个无需训练的框架,通过将帧级注意力与相关事件提示对齐来生成多事件视频。它引入了事件对齐查询引导(EAQS)来引导注意力,并提出了自动平衡强度求解器(ABSS)来平衡引导强度,以确保时间一致性和视觉保真度。实验表明,SwitchCraft 在提示对齐、事件清晰度和场景一致性方面优于现有方法。
CC-VQA: Conflict- and Correlation-Aware Method for Mitigating Knowledge Conflict in Knowledge-Based Visual Question Answering
Authors: Yuyang Hong, Jiaqi Gu, Yujin Lou, Lubin Fan, Qi Yang, Ying Wang, Kun Ding, Yue Wu, Shiming Xiang, Jieping Ye
First: 2026-02-27T11:56:26+00:00 · Latest: 2026-02-27T11:56:26+00:00
Comments: Accepted by CVPR2026
Abstract
Knowledge-based visual question answering (KB-VQA) demonstrates significant potential for handling knowledge-intensive tasks. However, conflicts arise between static parametric knowledge in vision language models (VLMs) and dynamically retrieved information due to the static model knowledge from pre-training. The outputs either ignore retrieved contexts or exhibit inconsistent integration with parametric knowledge, posing substantial challenges for KB-VQA. Current knowledge conflict mitigation methods primarily adapted from language-based approaches, focusing on context-level conflicts through engineered prompting strategies or context-aware decoding mechanisms. However, these methods neglect the critical role of visual information in conflicts and suffer from redundant retrieved contexts, which impair accurate conflict identification and effective mitigation. To address these limitations, we propose \textbf{CC-VQA}: a novel training-free, conflict- and correlation-aware method for KB-VQA. Our method comprises two core components: (1) Vision-Centric Contextual Conflict Reasoning, which performs visual-semantic conflict analysis across internal and external knowledge contexts; and (2) Correlation-Guided Encoding and Decoding, featuring positional encoding compression for low-correlation statements and adaptive decoding using correlation-weighted conflict scoring. Extensive evaluations on E-VQA, InfoSeek, and OK-VQA benchmarks demonstrate that CC-VQA achieves state-of-the-art performance, yielding absolute accuracy improvements of 3.3\% to 6.4\% compared to existing methods. Code is available at https://github.com/cqu-student/CC-VQA.
中文标题/摘要
标题:CC-VQA:缓解知识冲突的视觉语义冲突分析与相关性引导编码解码方法
基于知识的视觉问答(KB-VQA)在处理知识密集型任务方面显示出巨大的潜力。然而,由于预训练的静态模型知识与动态检索的信息之间的冲突,静态参数知识在视觉语言模型(VLMs)中的静态知识与动态检索的信息之间产生了冲突。输出要么忽略检索的上下文,要么与参数知识的整合不一致,这给KB-VQA带来了重大挑战。当前的知识冲突缓解方法主要借鉴语言领域的做法,侧重于通过工程化的提示策略或上下文感知解码机制解决上下文级别的冲突。然而,这些方法忽视了视觉信息在冲突中的关键作用,并且存在冗余的检索上下文,这影响了冲突的准确识别和有效缓解。为了解决这些局限性,我们提出了一种新的训练无损、视觉语义冲突分析与相关性引导编码解码方法——CC-VQA。该方法包括两个核心组件:(1)视觉中心的上下文冲突推理,进行内部和外部知识上下文之间的视觉语义冲突分析;(2)相关性引导的编码与解码,包括低相关性陈述的位置编码压缩和使用相关性加权冲突评分的自适应解码。在E-VQA、InfoSeek和OK-VQA基准上的广泛评估表明,CC-VQA达到了最先进的性能,与现有方法相比,绝对准确率提高了3.3%到6.4%。代码可在https://github.com/cqu-student/CC-VQA获取。
Summary / 总结
The paper proposes CC-VQA, a conflict- and correlation-aware method for mitigating knowledge conflicts in KB-VQA. It addresses the limitations of existing methods by incorporating visual information and using positional encoding compression and correlation-weighted conflict scoring. Experiments on E-VQA, InfoSeek, and OK-VQA show that CC-VQA outperforms existing methods with absolute accuracy improvements of 3.3% to 6.4%.
该论文提出了一种名为CC-VQA的新方法,用于缓解KB-VQA中的知识冲突。该方法包括视觉中心的上下文冲突分析和相关性引导的编码与解码,前者用于视觉语义冲突分析,后者通过压缩低相关性陈述和使用相关性加权冲突评分来实现。实验表明,CC-VQA在E-VQA、InfoSeek和OK-VQA基准测试中优于现有方法,绝对准确率提高了3.3%到6.4%。
The Geometry of Transfer: Unlocking Medical Vision Manifolds for Training-Free Model Ranking
Authors: Jiaqi Tang, Shaoyang Zhang, Xiaoqi Wang, Jiaying Zhou, Yang Liu, Qingchao Chen
First: 2026-02-27T11:04:15+00:00 · Latest: 2026-02-27T11:04:15+00:00
Abstract
The advent of large-scale self-supervised learning (SSL) has produced a vast zoo of medical foundation models. However, selecting optimal medical foundation models for specific segmentation tasks remains a computational bottleneck. Existing Transferability Estimation (TE) metrics, primarily designed for classification, rely on global statistical assumptions and fail to capture the topological complexity essential for dense prediction. We propose a novel Topology-Driven Transferability Estimation framework that evaluates manifold tractability rather than statistical overlap. Our approach introduces three components: (1) Global Representation Topology Divergence (GRTD), utilizing Minimum Spanning Trees to quantify feature-label structural isomorphism; (2) Local Boundary-Aware Topological Consistency (LBTC), which assesses manifold separability specifically at critical anatomical boundaries; and (3) Task-Adaptive Fusion, which dynamically integrates global and local metrics based on the semantic cardinality of the target task. Validated on the large-scale OpenMind benchmark across diverse anatomical targets and SSL foundation models, our approach significantly outperforms state-of-the-art baselines by around \textbf{31\%} relative improvement in the weighted Kendall, providing a robust, training-free proxy for efficient model selection without the cost of fine-tuning. The code will be made publicly available upon acceptance.
中文标题/摘要
标题:转移的几何学:解锁医疗视觉流形以实现无需训练的模型排名
大规模自我监督学习(SSL)的出现产生了大量的医疗基础模型。然而,为特定分割任务选择最优的医疗基础模型仍然是一个计算瓶颈。现有的转移性估计(TE)指标主要针对分类任务设计,依赖于全局统计假设,无法捕捉密集预测中至关重要的拓扑复杂性。我们提出了一种新颖的拓扑驱动转移性估计框架,评估流形可处理性而非统计重叠。我们的方法引入了三个组成部分:(1)全局表示拓扑差异(GRTD),利用最小生成树量化特征-标签结构同构性;(2)局部边界感知拓扑一致性(LBTC),评估特定于关键解剖边界处的流形可分性;(3)任务自适应融合,根据目标任务的语义基数动态整合全局和局部指标。在大规模OpenMind基准测试中,针对多种解剖目标和SSL基础模型,我们的方法在加权肯德尔系数上相对领先约31%,提供了一种稳健的、无需训练的模型选择代理,无需微调的成本。代码将在接受后公开发布。
Summary / 总结
This paper addresses the challenge of selecting optimal medical foundation models for specific segmentation tasks by proposing a Topology-Driven Transferability Estimation framework. It introduces three components: GRTD for quantifying feature-label structural isomorphism, LBTC for assessing manifold separability at critical anatomical boundaries, and Task-Adaptive Fusion for dynamically integrating global and local metrics. The approach significantly outperforms existing methods by around 31% relative improvement in weighted Kendall, offering a robust, training-free method for efficient model selection.
本文提出了一种拓扑驱动的转移可估性估计框架,以解决选择适合特定分割任务的医学基础模型的挑战。该框架包含三个组件:用于量化特征-标签结构同构性的GRTD、用于评估关键解剖边界处流形可分性的LBTC,以及用于根据目标任务语义基数动态集成全局和局部度量的自适应融合。该方法在OpenMind基准上显著优于现有方法,相对改进了约31%的加权肯德尔指标,提供了一种稳健且无需微调的模型选择代理。
Open-Vocabulary Semantic Segmentation in Remote Sensing via Hierarchical Attention Masking and Model Composition
Authors: Mohammadreza Heidarianbaei, Mareike Dorozynski, Hubert Kanyamahanga, Max Mehltretter, Franz Rottensteiner
First: 2026-02-27T10:11:12+00:00 · Latest: 2026-02-27T10:11:12+00:00
Comments: Published in the proceedings of the British Machine Vision Conference Workshops 2025
Abstract
In this paper, we propose ReSeg-CLIP, a new training-free Open-Vocabulary Semantic Segmentation method for remote sensing data. To compensate for the problems of vision language models, such as CLIP in semantic segmentation caused by inappropriate interactions within the self-attention layers, we introduce a hierarchical scheme utilizing masks generated by SAM to constrain the interactions at multiple scales. We also present a model composition approach that averages the parameters of multiple RS-specific CLIP variants, taking advantage of a new weighting scheme that evaluates representational quality using varying text prompts. Our method achieves state-of-the-art results across three RS benchmarks without additional training.
中文标题/摘要
标题:遥感数据中的开放词汇语义分割通过层次注意掩蔽和模型组合
在本文中,我们提出了一种新的无需训练的开放词汇语义分割方法ReSeg-CLIP,适用于遥感数据。为了解决视觉语言模型,如CLIP在语义分割中由于自注意力层内的不当交互所引起的问题,我们引入了一种层次方案,利用SAM生成的掩码在多个尺度上约束交互。我们还提出了一种模型组合方法,通过平均多个RS特定的CLIP变体的参数,利用一种新的加权方案,该方案使用不同的文本提示来评估表示质量。我们的方法在三个RS基准上达到了最先进的结果,无需额外训练。
Summary / 总结
The research aims to address the limitations of vision language models in remote sensing semantic segmentation, particularly the issues arising from inappropriate interactions within self-attention layers. The proposed ReSeg-CLIP method uses a hierarchical scheme with masks generated by SAM to constrain interactions at multiple scales and a model composition approach that averages parameters of multiple RS-specific CLIP variants. This method achieves state-of-the-art results across three remote sensing benchmarks without additional training.
该研究提出了ReSeg-CLIP,一种无需训练的遥感开放词汇语义分割方法。它通过使用SAM生成的掩码来约束多尺度交互,解决视觉语言模型如CLIP的问题。此外,还提出了一种模型组合方法,通过多种特定于遥感的CLIP变体的参数平均值,并使用基于文本提示的新加权方案评估表示质量。该方法在三个遥感基准上达到了最先进的结果,无需额外训练。
SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?
Authors: Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, Mahfuz Ahmed Anik, Munem Shahriar, Mohsin Mahmud Topu, Sadia Tasnim Meem, Rahatun Nesa Priti, Sabrina Afroz Mitu, Md. Iqramul Hoque, Shahriyar Zaman Ridoy, Mohammed Eunus Ali, Majd Hawasly, Mohammad Raza, Md Rizwan Parvez
Venue: ICLR 2026
First: 2026-02-03T17:52:02+00:00 · Latest: 2026-02-27T09:56:46+00:00
Comments: Accepted to ICLR 2026 (https://openreview.net/forum?id=fWWUPOb0CT). 92 Pages. 42 Figures and 29 Tables
Abstract
Spatial reasoning is a fundamental aspect of human cognition, yet it remains a major challenge for contemporary vision-language models (VLMs). Prior work largely relied on synthetic or LLM-generated environments with limited task designs and puzzle-like setups, failing to capture the real-world complexity, visual noise, and diverse spatial relationships that VLMs encounter. To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts. SpatiaLab comprises 1,400 visual question-answer pairs across six major categories: Relative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation, and 3D Geometry, each with five subcategories, yielding 30 distinct task types. Each subcategory contains at least 25 questions, and each main category includes at least 200 questions, supporting both multiple-choice and open-ended evaluation. Experiments across diverse state-of-the-art VLMs, including open- and closed-source models, reasoning-focused, and specialized spatial reasoning models, reveal a substantial gap in spatial reasoning capabilities compared with humans. In the multiple-choice setup, InternVL3.5-72B achieves 54.93% accuracy versus 87.57% for humans. In the open-ended setting, all models show a performance drop of around 10-25%, with GPT-5-mini scoring highest at 40.93% versus 64.93% for humans. These results highlight key limitations in handling complex spatial relationships, depth perception, navigation, and 3D geometry. By providing a diverse, real-world evaluation framework, SpatiaLab exposes critical challenges and opportunities for advancing VLMs' spatial reasoning, offering a benchmark to guide future research toward robust, human-aligned spatial understanding. SpatiaLab is available at: https://spatialab-reasoning.github.io/.
中文标题/摘要
标题:SpatiaLab:视觉语言模型能在野外进行空间推理吗?
空间推理是人类认知的基本方面,但对当前的视觉语言模型(VLMs)来说仍然是一个重大挑战。以往的工作主要依赖于合成或LLM生成的环境,任务设计有限且多为谜题式设置,未能捕捉到VLMs在现实世界中遇到的复杂性、视觉噪声和多样的空间关系。为解决这一问题,我们引入了SpatiaLab,这是一个全面的基准,用于评估VLMs在现实、不受限制的环境中的空间推理能力。SpatiaLab包含1400个视觉问题-答案对,涵盖六个主要类别:相对位置、深度与遮挡、方向、大小与比例、空间导航和三维几何,每个类别下有五个子类别,总共产生30种不同的任务类型。每个子类别至少包含25个问题,每个主要类别至少包含200个问题,支持多项选择和开放式评估。在多种最先进的VLMs实验中,包括开源和闭源模型、推理集中型和专门的空间推理模型,显示出与人类相比在空间推理能力上的显著差距。在多项选择设置中,InternVL3.5-72B的准确率为54.93%,而人类为87.57%。在开放式设置中,所有模型的性能下降约10-25%,GPT-5-mini得分最高为40.93%,而人类为64.93%。这些结果突显了处理复杂的空间关系、深度感知、导航和三维几何的关键局限性。通过提供一个多样化的、现实世界的评估框架,SpatiaLab揭示了VLMs空间推理的关键挑战和机遇,为未来研究提供了基准,以指导向稳健、与人类对齐的空间理解方向发展。SpatiaLab可在:https://spatialab-reasoning.github.io/。
Summary / 总结
SpatiaLab is a new benchmark for evaluating vision-language models' spatial reasoning in real-world scenarios. It includes 1,400 visual question-answer pairs covering six categories with 30 distinct task types. Experiments across various VLMs show significant gaps in spatial reasoning capabilities compared to humans, with accuracy ranging from 54.93% to 40.93% in multiple-choice and open-ended settings, respectively. These results highlight the need for better handling of complex spatial relationships and 3D geometry in VLMs.
SpatiaLab 是一个新基准,旨在评估视觉语言模型在真实世界、不受约束环境中的空间推理能力。它包含1,400个视觉问答对,涵盖六个类别,每个类别至少有200个问题。实验显示,各种最先进的 VLM 在多项选择题中的准确率从54.93%到87.57%不等,在开放性问题中的表现则下降了10-25%,这表明模型在处理复杂的空间关系和3D几何方面存在显著差距。
Radiologist Copilot: An Agentic Framework Orchestrating Specialized Tools for Reliable Radiology Reporting
Authors: Yongrui Yu, Zhongzhen Huang, Linjie Mu, Shaoting Zhang, Xiaofan Zhang
First: 2025-12-02T14:25:05+00:00 · Latest: 2026-02-27T09:24:21+00:00
Abstract
In clinical practice, radiology reporting is an essential yet complex, time-intensive, and error-prone task, particularly for 3D medical images. Existing automated approaches based on medical vision-language models primarily focus on isolated report generation. However, real-world radiology reporting extends far beyond report writing, which requires meticulous image observation and interpretation, appropriate template selection, and rigorous quality control to ensure adherence to clinical standards. This multi-stage, planning-intensive workflow fundamentally exceeds the capabilities of single-pass models. To bridge this gap, we propose Radiologist Copilot, an agentic system that autonomously orchestrates specialized tools to complete the entire radiology reporting workflow rather than isolated report writing. Radiologist Copilot enables region image localization and region analysis planning to support detailed visual reasoning, adopts strategic template selection for standardized report writing, and incorporates dedicated report quality control via quality assessment and feedback-driven iterative refinement. By integrating localization, interpretation, template selection, report composition, and quality control, Radiologist Copilot delivers a comprehensive and clinically aligned radiology reporting workflow. Experimental results demonstrate that it significantly outperforms state-of-the-art methods, supporting radiologists throughout the entire radiology reporting process. The code will be released upon acceptance.
中文标题/摘要
标题:放射科助手:一种代理框架协调专业工具以实现可靠的放射学报告
在临床实践中,放射学报告是一项重要但复杂、耗时且容易出错的任务,尤其是对于3D医学图像。现有的基于医学视觉-语言模型的自动化方法主要集中在孤立的报告生成上。然而,现实世界的放射学报告远远超出了孤立报告写作的范围,这需要细致的图像观察和解释、合适的模板选择以及严格的质量控制以确保符合临床标准。这一多阶段、计划密集型的工作流程从根本上超出了单次模型的能力。为了弥合这一差距,我们提出了一种名为放射科助手的代理系统,该系统能够自主协调专业工具以完成整个放射学报告工作流程,而不仅仅是孤立的报告写作。放射科助手支持区域图像定位和区域分析规划以支持详细的视觉推理,采用战略性的模板选择进行标准化报告写作,并通过质量评估和反馈驱动的迭代改进来纳入专门的报告质量控制。通过整合定位、解释、模板选择、报告组成和质量控制,放射科助手提供了一个全面且符合临床标准的放射学报告工作流程。实验结果表明,它显著优于最先进的方法,支持放射学家在整个放射学报告过程中。代码将在接受后发布。
Summary / 总结
The research aims to address the complexity and time-consuming nature of radiology reporting, especially for 3D medical images. The proposed Radiologist Copilot system autonomously orchestrates specialized tools to handle the entire radiology reporting workflow, including image localization, analysis planning, template selection, report composition, and quality control. Experimental results show that Radiologist Copilot outperforms existing methods and supports radiologists throughout the reporting process.
论文针对放射学报告的复杂性和耗时性,特别是对于3D医学图像。提出了一种名为Radiologist Copilot的代理系统,该系统能够自主协调专门的工具来处理整个报告流程,包括图像定位、分析规划、模板选择、报告编写和质量控制。实验结果表明,Radiologist Copilot优于现有方法,并支持放射科医生完成整个报告过程。
Less is More: Lean yet Powerful Vision-Language Model for Autonomous Driving
Authors: Sheng Yang, Tong Zhan, Guancheng Chen, Yanfeng Lu, Jian Wang
First: 2025-09-29T05:14:18+00:00 · Latest: 2026-02-27T09:00:30+00:00
Abstract
In this work, we reconceptualize autonomous driving as a generalized language problem and formulate the trajectory planning task as next waypoint prediction. We introduce Max-V1, a novel framework for one-stage end-to-end autonomous driving, named in tribute to the renowned Dutch racing driver Max Verstappen. Our framework presents a single-pass generation paradigm that aligns with the inherent sequentiality of driving. This approach leverages the generative capacity of the Vision-Language Model (VLM) to enable end-to-end trajectory prediction directly from front-view camera input. The efficacy of this method is underpinned by a principled supervision strategy derived from statistical modeling. This provides a well-defined learning objective, which makes the framework highly amenable to mastering complex driving policies through imitation learning from large-scale expert demonstrations. Empirically, our method achieves state-of-the-art performance on the nuScenes dataset, delivering an overall improvement of over 30% compared to prior baselines. Furthermore, it exhibits superior generalization performance on cross-domain datasets acquired from diverse vehicles, demonstrating notable potential for cross-vehicle robustness and adaptability. With these empirical strengths, this work introduces a model that enables fundamental driving behaviors, laying the foundation for the development of more capable self-driving agents. Code will be available upon publication.
中文标题/摘要
标题:少即是多:简约而强大的视觉语言模型在自动驾驶中的应用
在本研究中,我们将自动驾驶重新概念化为一个通用的语言问题,并将轨迹规划任务表述为下一个航点的预测。我们引入了Max-V1,这是一种以著名荷兰赛车手Max Verstappen命名的一站式端到端自动驾驶框架。我们的框架采用了一次生成的范式,这与驾驶的固有顺序性相吻合。该方法利用视觉语言模型(VLM)的生成能力,直接从前视摄像头输入中进行端到端的轨迹预测。该方法的有效性基于从统计建模中得出的原则性监督策略,这为通过大规模专家演示的模仿学习掌握复杂的驾驶策略提供了明确的学习目标。实验证明,我们的方法在nuScenes数据集上达到了最先进的性能,相比之前的基线方法,整体性能提高了超过30%。此外,它在跨域数据集上的泛化性能也表现出色,显示出在不同车辆上的鲁棒性和适应性。凭借这些实证优势,本研究引入了一个模型,能够实现基本的驾驶行为,为开发更强大的自动驾驶代理奠定了基础。代码将在发表后提供。
Summary / 总结
This work reimagines autonomous driving as a language problem, formulating trajectory planning as next waypoint prediction. The Max-V1 framework, inspired by Max Verstappen, uses a single-pass generation approach to predict trajectories directly from camera inputs using a Vision-Language Model. The method shows superior performance on the nuScenes dataset, improving over 30% compared to previous models and demonstrating strong generalization across different vehicles.
本文将自动驾驶重新定义为语言问题,将轨迹规划转化为下一目标点预测。引入了Max-V1框架,采用视觉语言模型直接从摄像头输入预测轨迹。该方法通过一个原则性的监督策略,在nuScenes数据集上取得了最先进的性能,比之前的方法提高了超过30%。此外,它在不同车辆的数据集上表现出色,显示出强大的鲁棒性和适应性,为更先进的自动驾驶代理的开发奠定了基础。
Footprint-Guided Exemplar-Free Continual Histopathology Report Generation
Authors: Pratibha Kumari, Daniel Reisenbüchler, Afshin Bozorgpour, yousef Sadegheih, Priyankar Choudhary, Dorit Merhof
First: 2026-02-27T08:58:03+00:00 · Latest: 2026-02-27T08:58:03+00:00
Abstract
Rapid progress in vision-language modeling has enabled pathology report generation from gigapixel whole-slide images, but most approaches assume static training with simultaneous access to all data. In clinical deployment, however, new organs, institutions, and reporting conventions emerge over time, and sequential fine-tuning can cause catastrophic forgetting. We introduce an exemplar-free continual learning framework for WSI-to-report generation that avoids storing raw slides or patch exemplars. The core idea is a compact domain footprint built in a frozen patch-embedding space: a small codebook of representative morphology tokens together with slide-level co-occurrence summaries and lightweight patch-count priors. These footprints support generative replay by synthesizing pseudo-WSI representations that reflect domain-specific morphological mixtures, while a teacher snapshot provides pseudo-reports to supervise the updated model without retaining past data. To address shifting reporting conventions, we distill domain-specific linguistic characteristics into a compact style descriptor and use it to steer generation. At inference, the model identifies the most compatible descriptor directly from the slide signal, enabling domain-agnostic setup without requiring explicit domain identifiers. Evaluated across multiple public continual learning benchmarks, our approach outperforms exemplar-free and limited-buffer rehearsal baselines, highlighting footprint-based generative replay as a practical solution for deployment in evolving clinical settings.
中文标题/摘要
标题:基于足迹引导的无范例持续病理报告生成
视觉-语言模型的快速发展使得可以从千兆像素全切片图像中生成病理报告成为可能,但大多数方法假设静态训练并同时访问所有数据。然而,在临床部署中,随着时间的推移,新的器官、机构和报告规范不断出现,顺序微调会导致灾难性遗忘。我们提出了一种无范例的持续学习框架,用于全切片图像到报告的生成,避免存储原始切片或片段范例。核心思想是在冻结片段嵌入空间中构建紧凑的领域足迹:一个代表形态学标记的小码本,以及切片级别共现总结和轻量级片段计数先验。这些足迹通过合成反映领域特定形态学混合的伪全切片表示支持生成回放,而教师快照提供伪报告以监督更新模型,而不保留过去的数据。为应对变化的报告规范,我们将领域特定的语言特征提炼为紧凑的风格描述符,并使用它来引导生成。在推理时,模型直接从切片信号中识别最兼容的描述符,实现领域无关的设置,无需显式领域标识符。在多个公开的持续学习基准上评估,我们的方法优于无范例和有限缓冲复习基线,突显了基于足迹的生成回放作为在不断变化的临床环境中部署的实用解决方案。
Summary / 总结
The research addresses the challenge of continual learning in histopathology report generation from whole-slide images (WSIs) without storing raw data or exemplars. It introduces a framework that uses a compact domain footprint in a frozen patch-embedding space, including a codebook of morphology tokens and slide-level summaries, to support generative replay and adapt to changing reporting conventions. The approach outperforms other exemplar-free and limited-buffer rehearsal methods, demonstrating its effectiveness in evolving clinical settings.
研究针对从高分辨率全切片图像生成病理报告时面临的持续学习挑战,其中新数据和报告规范会随时间不断出现。提出了一种无示例的持续学习框架,该框架利用冻结的patch嵌入空间中的紧凑领域足迹来支持生成性重放和更新模型,而不保留过去的数据。该框架包括代表形态学标记的代码本、切片级别的共现总结和轻量级的patch计数先验,以及一个风格描述符来处理报告规范的变化。实验结果表明,该方法在多个持续学习基准测试中优于其他无示例和有限缓冲区复习基线方法。
See, Act, Adapt: Active Perception for Unsupervised Cross-Domain Visual Adaptation via Personalized VLM-Guided Agent
Authors: Tianci Tang, Tielong Cai, Hongwei Wang, Gaoang Wang
First: 2026-02-27T08:44:47+00:00 · Latest: 2026-02-27T08:44:47+00:00
Abstract
Pre-trained perception models excel in generic image domains but degrade significantly in novel environments like indoor scenes. The conventional remedy is fine-tuning on downstream data which incurs catastrophic forgetting of prior knowledge and demands costly, scene-specific annotations. We propose a paradigm shift through Sea$^2$ (See, Act, Adapt): rather than adapting the perception modules themselves, we adapt how they are deployed through an intelligent pose-control agent. Sea$^2$ keeps all perception modules frozen, requiring no downstream labels during training, and uses only scalar perceptual feedback to navigate the agent toward informative viewpoints. Specially, we transform a vision-language model (VLM) into a low-level pose controller through a two-stage training pipeline: first fine-tuning it on rule-based exploration trajectories that systematically probe indoor scenes, and then refining the policy via unsupervised reinforcement learning that constructs rewards from the perception module's outputs and confidence. Unlike prior active perception methods that couple exploration with specific models or collect data for retraining them, Sea$^2$ directly leverages off-the-shelf perception models for various tasks without the need for retraining. We conducted experiments on three visual perception tasks, including visual grounding, segmentation and 3D box estimation, with performance improvements of 13.54%, 15.92% and 27.68% respectively on dataset ReplicaCAD.
中文标题/摘要
标题:见、行动、适应:通过个性化VLM引导代理的主动感知在无监督跨域视觉适应中的应用
预训练的感知模型在通用图像域中表现出色,但在像室内场景这样的新环境中会显著退化。传统的解决方法是对下游数据进行微调,这会导致先验知识的灾难性遗忘,并需要昂贵的、场景特定的注释。我们提出了一种范式转变,通过Sea$^2$(见、行动、适应):而不是适应感知模块本身,我们通过一个智能姿态控制代理来适应它们的部署方式。Sea$^2$保持所有感知模块冻结,在训练过程中不需要下游标签,并仅使用标量感知反馈来引导代理向信息性视角移动。特别地,我们通过两阶段训练管道将一个视觉语言模型(VLM)转化为低级姿态控制器:首先在基于规则的探索轨迹上对其进行微调,系统地探测室内场景,然后通过无监督强化学习来细化策略,该策略从感知模块的输出和置信度构建奖励。与之前将探索与特定模型耦合或收集用于重新训练它们的数据的方法不同,Sea$^2$直接利用现成的感知模型进行各种任务,而无需重新训练。我们在包括视觉定位、分割和3D框估计在内的三个视觉感知任务上进行了实验,在ReplicaCAD数据集上分别取得了13.54%、15.92%和27.68%的性能提升。
Summary / 总结
The paper addresses the issue of pre-trained perception models degrading in novel indoor environments by proposing Sea$^2$ (See, Act, Adapt), which adapts an intelligent pose-control agent rather than the perception modules themselves. This approach uses a vision-language model as a low-level pose controller, trained through a two-stage process: first, fine-tuning on rule-based exploration trajectories, and then unsupervised reinforcement learning. Experiments on visual grounding, segmentation, and 3D box estimation showed improvements of 13.54%, 15.92%, and 27.68% respectively on the ReplicaCAD dataset.
论文针对预训练感知模型在新型室内环境中的退化问题,提出了Sea$^2$(See, Act, Adapt)框架,通过智能姿态控制代理来适应感知模块的部署。Sea$^2$保持感知模块不变,并使用感知反馈引导代理到达信息丰富的视角。该方法利用视觉语言模型(VLM)作为低级姿态控制器,通过两个阶段训练:首先在基于规则的探索轨迹上训练,然后通过无监督强化学习从感知模块的输出和置信度构建奖励。实验结果显示,在视觉定位、分割和3D框估计任务上分别取得了13.54%、15.92%和27.68%的性能提升,基于ReplicaCAD数据集。
ZOO-Prune: Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models
Authors: Youngeun Kim, Youjia Zhang, Huiling Liu, Aecheon Jung, Sunwoo Lee, Sungeun Hong
First: 2025-09-29T14:20:05+00:00 · Latest: 2026-02-27T08:02:05+00:00
Abstract
Large Vision-Language Models (VLMs) enable strong multimodal reasoning but incur heavy inference costs from redundant visual tokens. Token pruning alleviates this issue, yet existing approaches face limitations. Attention-based methods rely on raw attention scores, which are often unstable across layers and heads and can lead to redundant selections. Diversity-based methods improve robustness by selecting tokens far apart in feature space, but risk dropping regions needed for accurate prediction. We propose ZOO-Prune, a training-free framework built on the intuition that highly sensitive tokens have a stronger influence on the model's output and capture complementary visual cues rather than redundant ones. To achieve this, we estimate token sensitivity using zeroth-order perturbations at the lightweight projection layer. This measures how small random perturbations affect the projected features and enables efficient approximation of each token's influence without backpropagation. Extensive experiments across multiple VLMs and benchmarks show that ZOO-Prune consistently outperforms prior methods while pruning up to 94.4% of tokens without sacrificing accuracy. Our method also improves efficiency, reaching up to 2.30x faster end-to-end inference compared to the baseline.
中文标题/摘要
标题:ZOO-Prune: 无需训练的零阶梯度估计视觉-语言模型中令牌剪枝
大型视觉-语言模型(VLMs)能够实现强大的多模态推理,但冗余的视觉令牌会导致高昂的推理成本。令牌剪枝可以缓解这一问题,但现有方法存在局限性。基于注意力的方法依赖于原始的注意力分数,这些分数在不同层和头之间往往不稳定,可能导致冗余选择。基于多样性的方法通过在特征空间中选择相距较远的令牌来提高鲁棒性,但可能会遗漏对于准确预测必要的区域。我们提出了一种无需训练的ZOO-Prune框架,其基于直觉,即高度敏感的令牌对模型输出有更强的影响,并捕捉互补的视觉线索而非冗余的线索。为此,我们使用轻量级投影层的零阶扰动来估计令牌的敏感性。这种方法通过测量小的随机扰动如何影响投影特征来衡量每个令牌的影响,并在无需反向传播的情况下进行高效近似。在多个VLMs和基准上的广泛实验表明,ZOO-Prune在剪枝高达94.4%的令牌的同时,始终优于先前的方法,且不牺牲准确性。此外,我们的方法还提高了效率,与基线相比,端到端推理速度可提高至2.30倍。
Summary / 总结
ZOO-Prune is a training-free token pruning method for Vision-Language Models (VLMs) that uses zeroth-order gradient estimation to identify and prune redundant visual tokens. It measures token sensitivity by applying small random perturbations at a lightweight projection layer, avoiding the need for backpropagation. Experiments demonstrate that ZOO-Prune prunes up to 94.4% of tokens without accuracy loss and achieves up to 2.30x faster inference compared to baseline models.
ZOO-Prune 是一种无需训练的 Vision-Language 模型(VLMs)的 token 剪枝方法,通过零阶梯度估计来识别和剪枝冗余的视觉 token。它通过在轻量级投影层上应用小的随机扰动来测量 token 的敏感性,避免了反向传播的需求。实验表明,ZOO-Prune 可以剪枝高达 94.4% 的 token 同时不损失准确性,并且相比基线模型可实现高达 2.30 倍的端到端推理加速。
Shape vs. Context: Examining Human--AI Gaps in Ambiguous Japanese Character Recognition
Authors: Daichi Haraguchi
First: 2026-02-27T07:18:53+00:00 · Latest: 2026-02-27T07:18:53+00:00
Comments: Accepted to CHI 2026 Poster track
Abstract
High text recognition performance does not guarantee that Vision-Language Models (VLMs) share human-like decision patterns when resolving ambiguity. We investigate this behavioral gap by directly comparing humans and VLMs using continuously interpolated Japanese character shapes generated via a $β$-VAE. We estimate decision boundaries in a single-character recognition (shape-only task) and evaluate whether VLM responses align with human judgments under shape in context (i.e., embedding an ambiguous character near the human decision boundary in word-level context). We find that human and VLM decision boundaries differ in the shape-only task, and that shape in context can improve human alignment in some conditions. These results highlight qualitative behavioral differences, offering foundational insights toward human--VLM alignment benchmarking.
中文标题/摘要
标题:形状 vs. 上下文:探究人类与AI在模糊日文字形识别中的差异
高文本识别性能并不保证视觉-语言模型(VLMs)在解决歧义时会表现出与人类相似的决策模式。我们通过直接比较人类和VLMs,使用通过$β$-VAE生成的连续插值日文字形来研究这种行为差异。我们估计单字识别(仅形状任务)的决策边界,并评估在上下文(即,在单词级上下文中将模糊字符置于人类决策边界附近)中VLM的响应是否与人类判断一致。我们发现,在仅形状任务中,人类和VLM的决策边界存在差异,而在某些条件下,形状在上下文中的使用可以改善人类的一致性。这些结果突显了定性行为差异,为人类-VLM对齐基准测试提供了基础见解。
Summary / 总结
The study investigates the differences in decision-making between humans and Vision-Language Models (VLMs) when recognizing ambiguous Japanese characters. By using continuously interpolated character shapes generated via a $β$-VAE, the researchers compare human and VLM responses in both shape-only and shape-in-context tasks. Key findings include differences in decision boundaries between humans and VLMs in the shape-only task, and that embedding ambiguous characters in word-level context can improve VLM alignment with human judgments in some conditions.
研究探讨了人类和Vision-Language模型(VLM)在识别日文模糊字符时的决策差异。通过使用$β$-VAE生成的连续插值字符形状,研究人员在形状仅任务和形状在上下文任务中比较了人类和VLM的响应。主要发现表明,在形状仅任务中,人类和VLM的决策边界存在差异,而在单词级上下文中嵌入模糊字符可以改善某些条件下VLM与人类判断的一致性。
SAGE-LLM: Towards Safe and Generalizable LLM Controller with Fuzzy-CBF Verification and Graph-Structured Knowledge Retrieval for UAV Decision
Authors: Wenzhe Zhao, Yang Zhao, Ganchao Liu, Zhiyu Jiang, Dandan Ma, Zihao Li, Xuelong Li
First: 2026-02-27T06:41:04+00:00 · Latest: 2026-02-27T06:41:04+00:00
Abstract
In UAV dynamic decision, complex and variable hazardous factors pose severe challenges to the generalization capability of algorithms. Despite offering semantic understanding and scene generalization, Large Language Models (LLM) lack domain-specific UAV control knowledge and formal safety assurances, restricting their direct applicability. To bridge this gap, this paper proposes a train-free two-layer decision architecture based on LLMs, integrating high-level safety planning with low-level precise control. The framework introduces three key contributions: 1) A fuzzy Control Barrier Function verification mechanism for semantically-augmented actions, providing provable safety certification for LLM outputs. 2) A star-hierarchical graph-based retrieval-augmented generation system, enabling efficient, elastic, and interpretable scene adaptation. 3) Systematic experimental validation in pursuit-evasion scenarios with unknown obstacles and emergent threats, demonstrating that our SAGE-LLM maintains performance while significantly enhancing safety and generalization without online training. The proposed framework demonstrates strong extensibility, suggesting its potential for generalization to broader embodied intelligence systems and safety-critical control domains.
中文标题/摘要
标题:SAGE-LLM:基于模糊-CBF验证和图结构知识检索的UAV决策安全通用控制器
在UAV动态决策中,复杂的多变的危险因素对算法的泛化能力提出了严峻挑战。尽管大型语言模型(LLM)能够提供语义理解和场景泛化,但它们缺乏特定领域的UAV控制知识和正式的安全保证,限制了它们的直接应用。为了解决这一问题,本文提出了一种基于LLM的无训练两层决策架构,结合了高层的安全规划和低层的精确控制。该框架提出了三个关键贡献:1)一种模糊控制屏障函数验证机制,为语义增强的动作提供可证明的安全认证。2)一种基于星形层次图的检索增强生成系统,实现高效、弹性且可解释的场景适应。3)在存在未知障碍和突发威胁的追逃场景中进行了系统的实验验证,证明了我们的SAGE-LLM在保持性能的同时,显著提高了安全性和泛化能力,无需在线训练。所提出的框架展示了强大的扩展性,表明其在更广泛的体态智能系统和安全关键控制领域中的应用潜力。
Summary / 总结
The paper addresses the challenge of integrating large language models (LLMs) into UAV decision-making by proposing SAGE-LLM, which combines fuzzy Control Barrier Function verification and graph-structured knowledge retrieval. This framework ensures safety and generalizability through semantically-augmented actions and efficient scene adaptation, validated in pursuit-evasion scenarios without online training, showing enhanced safety and performance.
本文提出了一种基于大型语言模型(LLM)的两层决策架构SAGE-LLM,以解决无人机决策中的安全和有效性问题。该架构引入了模糊控制障碍函数验证机制以确保安全性,以及基于星形层级图的检索增强生成系统以实现高效的场景适应。实验结果表明,SAGE-LLM在动态场景中保持了性能,同时显著增强了安全性和泛化能力,无需在线训练。
BEV-VLM: Trajectory Planning via Unified BEV Abstraction
Authors: Guancheng Chen, Sheng Yang, Tong Zhan, Jian Wang
First: 2025-09-27T07:13:55+00:00 · Latest: 2026-02-27T06:27:46+00:00
Abstract
This paper introduces BEV-VLM, a novel approach for trajectory planning in autonomous driving that leverages Vision-Language Models (VLMs) with Bird's-Eye View (BEV) feature maps as visual input. Unlike conventional trajectory planning approaches that rely solely on raw visual data (e.g., camera images), our method utilizes a highly compressed and informative BEV representation generated by fusing camera and LiDAR data, with subsequent alignment to High-Definition (HD) maps. This unified BEV-HD map format provides a geometrically consistent and semantically rich scene description, which enables VLMs to perform accurate and robust trajectory planning. Experimental results on the nuScenes dataset demonstrate that, compared with state-of-the-art vision-only methods, our approach achieves a 53.1% improvement in planning accuracy and realizes complete collision avoidance in evaluation scenarios. Our work highlights that VLMs can effectively interpret processed visual representations such as BEV features, expanding their applicability beyond raw image inputs for the task of trajectory planning.
中文标题/摘要
标题:BEV-VLM:基于统一BEV抽象的轨迹规划
本文介绍了BEV-VLM,这是一种利用视觉语言模型(VLMs)和鸟瞰图(BEV)特征图作为视觉输入的自主驾驶轨迹规划新方法。与依赖于原始视觉数据(例如,相机图像)的传统轨迹规划方法不同,我们的方法利用了由融合相机和激光雷达数据生成的高度压缩且信息丰富的BEV表示,并将其与高精度(HD)地图对齐。这种统一的BEV-HD地图格式提供了几何上一致且语义丰富的场景描述,使VLMs能够进行准确且鲁棒的轨迹规划。在nuScenes数据集上的实验结果表明,与最先进的仅基于视觉的方法相比,我们的方法在规划准确性上提高了53.1%,并在评估场景中实现了完全的碰撞避免。我们的工作表明,VLMs可以有效地解释处理后的视觉表示,如BEV特征,从而扩展其在轨迹规划任务中的应用范围,而不仅仅是原始图像输入。
Summary / 总结
The paper proposes BEV-VLM, a trajectory planning method for autonomous driving that uses Vision-Language Models with Bird's-Eye View (BEV) feature maps. Unlike traditional methods relying on raw visual data, BEV-VLM integrates camera and LiDAR data to create a compressed and semantically rich BEV representation, aligned with HD maps. Experiments on the nuScenes dataset show a 53.1% improvement in planning accuracy and complete collision avoidance compared to state-of-the-art vision-only methods.
该论文提出了一种名为BEV-VLM的轨迹规划方法,利用Vision-Language模型和BEV特征图。与依赖原始视觉数据的传统方法不同,BEV-VLM将相机和LiDAR数据融合生成压缩且信息丰富的BEV表示,并与高精度地图对齐。这种方法在nuScenes数据集上的规划准确率提高了53.1%,并且实现了完全的碰撞避免,优于最先进的基于视觉的方法。
EgoGraph: Temporal Knowledge Graph for Egocentric Video Understanding
Authors: Shitong Sun, Ke Han, Yukai Huang, Weitong Cai, Jifei Song
First: 2026-02-27T06:20:58+00:00 · Latest: 2026-02-27T06:20:58+00:00
Comments: Under review
Abstract
Ultra-long egocentric videos spanning multiple days present significant challenges for video understanding. Existing approaches still rely on fragmented local processing and limited temporal modeling, restricting their ability to reason over such extended sequences. To address these limitations, we introduce EgoGraph, a training-free and dynamic knowledge-graph construction framework that explicitly encodes long-term, cross-entity dependencies in egocentric video streams. EgoGraph employs a novel egocentric schema that unifies the extraction and abstraction of core entities, such as people, objects, locations, and events, and structurally reasons about their attributes and interactions, yielding a significantly richer and more coherent semantic representation than traditional clip-based video models. Crucially, we develop a temporal relational modeling strategy that captures temporal dependencies across entities and accumulates stable long-term memory over multiple days, enabling complex temporal reasoning. Extensive experiments on the EgoLifeQA and EgoR1-bench benchmarks demonstrate that EgoGraph achieves state-of-the-art performance on long-term video question answering, validating its effectiveness as a new paradigm for ultra-long egocentric video understanding.
中文标题/摘要
标题:EgoGraph:用于第一人称视频理解的时空知识图
跨越多天的超长第一人称视频为视频理解带来了重大挑战。现有方法仍然依赖于片段化的局部处理和有限的时序建模,限制了它们在处理此类扩展序列时的能力。为解决这些限制,我们引入了EgoGraph,这是一种无需训练且动态的知识图构建框架,明确编码了第一人称视频流中的长期跨实体依赖关系。EgoGraph 使用一种新颖的第一人称模式,统一了核心实体(如人物、物体、地点和事件)的提取和抽象,并从结构上推理它们的属性和交互,从而比传统的基于片段的视频模型提供了更丰富和更连贯的语义表示。关键的是,我们开发了一种时序关系建模策略,捕捉实体之间的时序依赖关系,并在多天内累积稳定的长期记忆,从而实现复杂的时序推理。在EgoLifeQA和EgoR1-bench基准上的广泛实验表明,EgoGraph 在长期视频问答任务中达到了最先进的性能,验证了其作为超长第一人称视频理解新范式的有效性。
Summary / 总结
EgoGraph is a training-free framework that constructs a dynamic knowledge graph to address the challenges of understanding ultra-long egocentric videos spanning multiple days. It uses a novel egocentric schema to extract and abstract core entities and their interactions, providing a richer semantic representation compared to traditional clip-based models. Key findings show that EgoGraph outperforms existing methods on long-term video question answering benchmarks, demonstrating its effectiveness in handling extended video sequences.
EgoGraph 是一个无需训练的框架,通过构建动态知识图谱来解决超长时长的自我中心视频理解挑战。它使用新颖的自我中心模式来提取和抽象核心实体及其交互,提供比传统片段式视频模型更丰富的语义表示。关键发现表明,EgoGraph 在长期视频问答基准测试中优于现有方法,证明了其在处理扩展视频序列方面的有效性。
Suppressing Prior-Comparison Hallucinations in Radiology Report Generation via Semantically Decoupled Latent Steering
Authors: Ao Li, Rui Liu, Mingjie Li, Sheng Liu, Lei Wang, Xiaodan Liang, Lina Yao, Xiaojun Chang, Lei Xing
First: 2026-02-27T04:49:01+00:00 · Latest: 2026-02-27T04:49:01+00:00
Comments: 15 pages, 5 figures
Abstract
Automated radiology report generation using vision-language models (VLMs) is limited by the risk of prior-comparison hallucination, where the model generates historical findings unsupported by the current study. We address this challenge with a training-free, inference-time control framework termed Semantically Decoupled Latent Steering (SDLS). Unlike generic activation steering, which often suffers from semantic entanglement, our approach constructs a semantic-free intervention vector via large language model (LLM)-driven semantic decomposition followed by $QR$-based orthogonalization. This orthogonalization step is critical. It leverages geometric constraints to filter out the clinical semantics often entangled in standard principal component analysis (PCA) directions, ensuring that the steering vector targets only the ``historical comparison" axis. We validate our method on the BiomedGPT foundation model, demonstrating that it overcomes the trade-off between hallucination suppression and clinical accuracy. Extensive experiments on MIMIC-CXR, and zero-shot transfer evaluation on CheXpert Plus and IU-Xray, demonstrate the robustness of our approach. Quantitative evaluations on MIMIC-CXR show that our approach significantly reduces the probability of historical hallucinations (FilBERT score decreases from 0.2373 to 0.1889) and improves clinical label fidelity (CheXpert macro-F1 increases from 0.2242 to 0.3208). Supplementary evaluations confirm that the structural integrity of the clinical narrative is maintained.
中文标题/摘要
标题:通过语义解耦潜在引导抑制放射学报告生成中的先验比较幻觉
使用视觉-语言模型(VLMs)进行自动化放射学报告生成受到先验比较幻觉风险的限制,即模型生成未被当前研究支持的历史发现。我们提出了一种无需训练、在推理时控制的方法,称为语义解耦潜在引导(SDLS)。与通常存在语义纠缠的通用激活引导不同,我们的方法通过大型语言模型(LLM)驱动的语义分解,随后通过$QR$正交化构造了一个语义无关的干预向量。这一正交化步骤至关重要。它利用几何约束过滤掉标准主成分分析(PCA)方向中通常纠缠的临床语义,确保引导向量仅针对“历史比较”轴。我们在BiomedGPT基础模型上验证了该方法,证明了它克服了幻觉抑制与临床准确性的权衡。在MIMIC-CXR上的大量实验以及CheXpert Plus和IU-Xray上的零样本迁移评估表明,该方法具有鲁棒性。MIMIC-CXR上的定量评估显示,我们的方法显著降低了历史幻觉的概率(FilBERT得分从0.2373降至0.1889),并提高了临床标签的一致性(CheXpert宏F1从0.2242升至0.3208)。补充评估证实,临床叙述的结构完整性得以保持。
Summary / 总结
The paper addresses the issue of prior-comparison hallucinations in radiology report generation using vision-language models by proposing a training-free inference-time control framework called Semantically Decoupled Latent Steering (SDLS). SDLS uses a large language model for semantic decomposition and QR-based orthogonalization to create an intervention vector that targets the 'historical comparison' axis without semantic entanglement. Experiments on the BiomedGPT model and MIMIC-CXR dataset show that SDLS reduces the probability of historical hallucinations and improves clinical label fidelity compared to the baseline model.
本文旨在解决使用视觉语言模型进行自动化放射学报告生成时出现的历史对比幻觉问题。提出了一种名为Semantically Decoupled Latent Steering (SDLS)的训练免费推理时控制框架,该框架通过大型语言模型驱动的语义分解和QR基正交化来构造一个语义无关的干预向量。在BiomedGPT基础模型以及MIMIC-CXR、CheXpert Plus和IU-Xray上的实验表明,SDLS能够有效减少幻觉并提高临床标签的一致性,同时不牺牲准确性。
History
20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553