TaxonRL: Reinforcement Learning with Intermediate Rewards for Interpretable Fine-Grained Visual Reasoning
Authors: Maximilian von Klinski, Maximilian Schall
Venue: WACV 2026
First: 2026-03-04T18:45:35+00:00 · Latest: 2026-03-04T18:45:35+00:00
Comments: Accepted at WACV 2026
Abstract
Traditional vision-language models struggle with contrastive fine-grained taxonomic reasoning, particularly when distinguishing between visually similar species within the same genus or family. We introduce TaxonRL, a reinforcement learning approach using Group Relative Policy Optimization with intermediate rewards that decomposes the reasoning process into hierarchical taxonomic predictions. Our method incentivizes models to explicitly reason about species-level, genus-level, and family-level features before making final classifications. This structured approach is designed not only to boost accuracy but also to yield a transparent, verifiable decision-making process. On the challenging Birds-to-Words dataset, TaxonRL achieves 91.7\% average accuracy, exceeding human performance (77.3\%) while generating interpretable reasoning traces. We demonstrate strong cross-domain generalization, showing substantial gains in primate and marine species verification. Our results establish that enforcing structured, hierarchical reasoning provides a powerful and transferable framework for fine-grained visual discrimination.
中文标题/摘要
标题:TaxonRL:使用中间奖励的强化学习进行可解释的细粒度视觉推理
传统的视觉-语言模型在对比细粒度分类-taxonomic推理方面存在困难,尤其是在区分同一属或同一科中的视觉相似物种时。我们提出了TaxonRL,这是一种使用组相对策略优化的强化学习方法,并使用中间奖励将推理过程分解为分层分类预测。我们的方法激励模型在最终分类之前明确地推理物种级、属级和科级特征。这种结构化方法不仅旨在提高准确性,还旨在产生透明且可验证的决策过程。在具有挑战性的鸟类到词语数据集上,TaxonRL 达到了 91.7% 的平均准确率,超过了人类表现(77.3%),同时生成了可解释的推理轨迹。我们展示了强大的跨域泛化能力,在灵长类和海洋物种验证中取得了显著进步。我们的结果表明,强制执行结构化、分层推理为细粒度视觉区分提供了一个强大且可转移的框架。
Summary / 总结
TaxonRL is a reinforcement learning method that uses intermediate rewards to improve fine-grained visual reasoning, especially for distinguishing visually similar species. It decomposes the reasoning process into hierarchical taxonomic predictions, enhancing both accuracy and interpretability. On the Birds-to-Words dataset, TaxonRL achieves 91.7% accuracy, surpassing human performance and generating interpretable reasoning traces. It also shows strong generalization across different species domains.
TaxonRL 采用强化学习并使用中间奖励来提升细粒度的分类推理能力,特别是在区分视觉相似的物种方面。它将推理过程分解为层次结构,提高准确率并提供透明的决策过程。在 Birds-to-Words 数据集上,TaxonRL 达到了 91.7% 的准确率,超过了人类的表现,并生成了可解释的推理痕迹。此外,它在不同物种领域中也表现出强大的泛化能力。
FireANTs: Adaptive Riemannian Optimization for Multi-Scale Diffeomorphic Matching
Authors: Rohit Jena, Pratik Chaudhari, James C. Gee
First: 2024-04-01T17:12:47+00:00 · Latest: 2026-03-04T18:28:18+00:00
Comments: Accepted at Nature Communications
Abstract
The paper proposes FireANTs, a multi-scale Adaptive Riemannian Optimization algorithm for dense diffeomorphic image matching. Existing state-of-the-art methods for diffeomorphic image matching are slow due to inefficient implementations and slow convergence due to the ill-conditioned nature of the optimization problem. Deep learning methods offer fast inference but require extensive training time, substantial inference memory, and fail to generalize across long-tailed distributions or diverse image modalities, necessitating costly retraining. We address these challenges by proposing a training-free, GPU-accelerated multi-scale Adaptive Riemannian Optimization algorithm for fast and accurate dense diffeomorphic image matching. FireANTs runs about 2.5x faster than ANTs on a CPU, and upto 1200x faster on a GPU. On a single GPU, FireANTs performs competitively with deep learning methods on inference runtime while consuming upto 10x less memory. FireANTs shows remarkable robustness to a wide variety of matching problems across modalities, species, and organs without any domain-specific training or tuning. Our framework allows hyperparameter grid search studies with significantly less resources and time compared to traditional and deep learning registration algorithms alike.
中文标题/摘要
标题:FireANTs:多尺度自适应黎曼优化的 diffeomorphic 图像配准
该论文提出了一种多尺度自适应黎曼优化算法 FireANTs,用于密集的 diffeomorphic 图像配准。现有的 diffeomorphic 图像配准的最先进方法由于实现效率低下和优化问题病态导致收敛速度慢。深度学习方法虽然可以提供快速推理,但需要大量的训练时间、大量的推理内存,并且无法很好地泛化到长尾分布或多种图像模态,需要昂贵的重新训练。我们通过提出一种无需训练、基于 GPU 加速的多尺度自适应黎曼优化算法来解决这些挑战,以实现快速准确的密集 diffeomorphic 图像配准。FireANTs 在 CPU 上比 ANTs 快约 2.5 倍,在 GPU 上快约 1200 倍。在单个 GPU 上,FireANTs 在推理运行时间方面与深度学习方法竞争,同时消耗的内存最多可减少 10 倍。FireANTs 在多种模态、物种和器官的配准问题上表现出色,无需任何特定领域的训练或调整。我们的框架与传统和深度学习配准算法相比,可以显著减少网格搜索超参数研究所需的资源和时间。
Summary / 总结
FireANTs is a multi-scale Adaptive Riemannian Optimization algorithm designed for efficient dense diffeomorphic image matching. It addresses the inefficiencies of existing methods by leveraging GPU acceleration and adaptive optimization techniques, achieving up to 1200x faster processing than CPU-based methods. FireANTs demonstrates competitive inference runtime with deep learning methods while using up to 10x less memory. It also shows robust performance across various image modalities and organs without requiring domain-specific training or tuning.
论文提出了FireANTs,一种多尺度自适应黎曼优化算法,用于高效密集的 diffeomorphic 图像配准。该算法解决了现有方法的局限性,通过更快的收敛和减少计算需求来实现高效配准。FireANTs 在单个 GPU 上的处理速度比 ANTs 在 CPU 上快多达 1200 倍,并且比深度学习方法在 GPU 上使用多达 10 倍少的内存。它在各种模态和器官上表现出色,无需特定领域的训练或调整。
Beyond Dominant Patches: Spatial Credit Redistribution For Grounded Vision-Language Models
Authors: Niamul Hassan Samin, Md Arifur Rahman, Abdullah Ibne Hanif Arean, Juena Ahmed Noshin, Md Ashikur Rahman
First: 2026-02-25T23:08:31+00:00 · Latest: 2026-03-04T18:21:09+00:00
Abstract
Vision-Language Models (VLMs) often hallucinate objects that are not present in the input image. We identify a contributing cause of this behavior, which we term spatial credit collapse: in early transformer layers, hidden-state activation concentrates on a small number of visual patches, suppressing surrounding contextual evidence and increasing reliance on language priors. Across seven models we observe a strong correlation between visual attention entropy and hallucination rate (r = -0.65, p < 0.001), suggesting that reduced spatial credit diversity contributes to hallucination.
To address this issue we propose Spatial Credit Redistribution (SCR), a training-free inference-time method. SCR uses a lightweight two-pass procedure. A diagnostic pass identifies the top-K high-attention source patches and their spatial neighbors. A redistribution pass then scales each source by 1/lambda (~0.91) and injects a (lambda - 1) weighted copy of its hidden state into neighboring patches, restoring suppressed visual context without modifying model weights. Because the diagnostic pass is performed once per image and reused across the output sequence, the added latency is negligible (<0.5 ms per token for 100-token responses).
We evaluate SCR across seven model configurations from four VLM families (Chameleon, LLaVA-1.5, Qwen-VL/Qwen2-VL, and InternVL2) on five benchmarks: POPE, CHAIR, MME, HallusionBench, and AMBER. SCR reduces POPE-Adversarial hallucination by 4.6-6.0 percentage points and CHAIR-s by 41-51 percent while preserving caption quality (CIDEr drop <=0.8). Compared with prior inference-time methods including OPERA, VCD, OA-VCD, DoLa, VLI, SID, and CRoPS, SCR achieves a better trade-off between hallucination reduction, generation quality, and latency.
中文标题/摘要
标题:超越主导斑块:空间信用重分配以实现基于视觉-语言模型的扎根
视觉-语言模型(VLMs)经常虚构输入图像中不存在的对象。我们识别出这种行为的一个促成因素,称之为空间信用崩溃:在早期的变压器层中,隐藏状态激活集中在少量的视觉斑块上,抑制了周围的上下文证据,增加了对语言先验的依赖。在七个模型中,我们观察到视觉注意力熵和虚构率之间存在很强的相关性(r = -0.65,p < 0.001),表明空间信用多样性减少会促进虚构。
为解决这一问题,我们提出了空间信用重分配(SCR),这是一种无需训练的推理时方法。SCR 使用一种轻量级的两步程序。诊断步骤识别出高注意力的前K个源斑块及其空间邻居。然后,重分配步骤将每个源斑块的权重调整为1/λ(~0.91),并将(λ - 1)加权的隐藏状态副本注入到相邻斑块中,恢复被抑制的视觉上下文,而不修改模型权重。由于诊断步骤在每张图像上只执行一次,并在整个输出序列中重复使用,因此增加的延迟可以忽略不计(对于100个词的响应,每词<0.5毫秒)。
我们在四种VLM家族(Chameleon、LLaVA-1.5、Qwen-VL/Qwen2-VL、InternVL2)的七个模型配置上,对五个基准(POPE、CHAIR、MME、HallusionBench、AMBER)进行了评估。SCR将POPE-对抗虚构减少了4.6-6.0个百分点,将CHAIR-s减少了41-51个百分点,同时保持了描述质量(CIDEr下降<=0.8)。与先前的推理时方法,包括OPER、VCD、OA-VCD、DoLa、VLI、SID和CRoPS相比,SCR在减少虚构、生成质量和延迟之间实现了更好的权衡。
Summary / 总结
The paper addresses the issue of hallucination in Vision-Language Models (VLMs) by identifying spatial credit collapse as a contributing factor. It proposes Spatial Credit Redistribution (SCR), a training-free method applied at inference time. SCR uses a two-pass procedure to redistribute attention across patches, restoring suppressed visual context without altering model weights. Evaluations across seven VLM configurations on five benchmarks show that SCR reduces hallucination rates while maintaining caption quality and achieving a better trade-off in terms of latency compared to other methods.
论文通过识别空间信用坍塌作为幻觉问题的一个原因,提出了空间信用重分布(SCR)方法,这是一种在推理时无需训练的方法。SCR 使用两步程序来识别并重新分配注意力,从而恢复被抑制的视觉上下文。在七个 VLM 配置和五个基准测试上的评估表明,SCR 可以降低幻觉率,同时保持描述质量,并在幻觉减少、生成质量和延迟之间取得更好的权衡,优于其他方法。
FocusGraph: Graph-Structured Frame Selection for Embodied Long Video Question Answering
Authors: Tatiana Zemskova, Solomon Andryushenko, Ilya Obrubov, Viktoriia Khoruzhaia, Ekaterina Eroshenko, Ekaterina Derevyanka, Dmitry Yudin
First: 2026-03-04T18:14:00+00:00 · Latest: 2026-03-04T18:14:00+00:00
Abstract
The ability to understand long videos is vital for embodied intelligent agents, because their effectiveness depends on how well they can accumulate, organize, and leverage long-horizon perceptual memories. Recently, multimodal LLMs have been gaining popularity for solving the long video understanding task due to their general ability to understand natural language and to leverage world knowledge. However, as the number of frames provided to an MLLM increases, the quality of its responses tends to degrade, and inference time grows. Therefore, when using MLLMs for long video understanding, a crucial step is selecting key frames from the video to answer user queries.
In this work, we develop FocusGraph, a framework for keyframe selection for question answering over long egocentric videos. It leverages a lightweight trainable Scene-Caption LLM Selector that selects query-relevant clips based on their graph-based captions, and a training-free method for selecting keyframes from these clips. Unlike existing methods, the proposed Scene-Caption LLM Selector does not rely on the original sequence of low-resolution frames; instead, it operates on a compact textual representation of the scene. We then design a training-free Patch-wise Sparse-Flow Retention (PSFR) method to select keyframes from the resulting sequence of clips, which are fed into an MLLM to produce the final answer. Together, these components enable FocusGraph to achieve state-of-the-art results on challenging egocentric long-video question answering benchmarks, including FindingDory and HourVideo, while significantly reducing inference time relative to baseline approaches.
中文标题/摘要
标题:FocusGraph:基于图结构框架的选择性帧提取用于体感长视频问答
理解长视频的能力对于体感智能代理至关重要,因为它们的效果取决于能否有效地积累、组织和利用长期感知记忆。最近,由于其理解和利用世界知识的通用能力,多模态LLM因解决长视频理解任务而受到越来越多的关注。然而,随着提供给MLLM的帧数量增加,其响应质量往往会下降,推理时间也会增长。因此,在使用MLLM进行长视频理解时,一个关键步骤是从视频中选择关键帧以回答用户查询。
在这项工作中,我们开发了FocusGraph,这是一种用于长第一人称视角视频问答的关键帧选择框架。它利用一种轻量级可训练的场景-描述LLM选择器,该选择器基于图基描述选择与查询相关的片段,并且使用一种无需训练的方法从这些片段中选择关键帧。与现有方法不同,提出的场景-描述LLM选择器不依赖于原始的低分辨率帧序列,而是操作于场景的紧凑文本表示。然后,我们设计了一种无需训练的块级稀疏流保留(PSFR)方法,从生成的片段序列中选择关键帧,这些片段被输入到MLLM以生成最终答案。这些组件共同使FocusGraph在具有挑战性的第一人称视角长视频问答基准测试(包括FindingDory和HourVideo)中达到了最先进的性能,同时显著减少了相对于基线方法的推理时间。
Summary / 总结
FocusGraph is a framework for keyframe selection in long egocentric videos for question answering. It uses a lightweight Scene-Caption LLM Selector to select query-relevant clips based on graph-based captions, and a training-free PSFR method to select keyframes from these clips. This approach enables FocusGraph to achieve state-of-the-art results on benchmarks like FindingDory and HourVideo while reducing inference time compared to baselines.
FocusGraph 是一种用于长视频问答的关键帧选择框架,使用轻量级的 Scene-Caption LLM 选择器在场景的文本表示上操作以选择与查询相关的片段,然后使用 PSFR 方法从这些片段中选择关键帧。该方法在 egocentric 长视频问答基准测试中取得了最先进的结果,同时相比基线方法显著减少了推理时间。
Cognition Envelopes for Bounded Decision Making in Autonomous UAS Operations
Authors: Pedro Antonio Alarcon Granadeno, Arturo Miguel Bernal Russell, Sofia Nelson, Demetrius Hernandez, Maureen Petterson, Michael Murphy, Walter J. Scheirer, Jane Cleland-Huang
First: 2025-10-30T18:11:32+00:00 · Latest: 2026-03-04T18:07:51+00:00
Comments: 12 pages, 9 figures
Abstract
Cyber-physical systems increasingly rely on foundational models, such as Large Language Models (LLMs) and Vision-Language Models (VLMs) to increase autonomy through enhanced perception, inference, and planning. However, these models also introduce new types of errors, such as hallucinations, over-generalizations, and context misalignments, resulting in incorrect and flawed decisions. To address this, we introduce the concept of Cognition Envelopes, designed to establish reasoning boundaries that constrain AI-generated decisions while complementing the use of meta-cognition and traditional safety envelopes. As with safety envelopes, Cognition Envelopes require practical guidelines and systematic processes for their definition, validation, and assurance. In this paper we describe an LLM/VLM-supported pipeline for dynamic clue analysis within the domain of small autonomous Uncrewed Aerial Systems deployed on Search and Rescue (SAR) missions, and a Cognition Envelope based on probabilistic reasoning and resource analysis. We evaluate the approach through assessing decisions made by our Clue Analysis Pipeline in a series of SAR missions. Finally, we identify key software engineering challenges for systematically designing, implementing, and validating Cognition Envelopes for AI-supported decisions in cyber-physical systems.
中文标题/摘要
标题:自主无人航空系统受限决策认知包
网络物理系统越来越多地依赖大型语言模型(LLMs)和视觉语言模型(VLMs)等基础模型,通过增强感知、推理和规划来提高自主性。然而,这些模型也会引入新的错误类型,如幻觉、过度概括和上下文错位,导致错误和有缺陷的决策。为了解决这一问题,我们提出了认知包的概念,旨在通过限制AI生成的决策来建立推理边界,同时补充元认知和传统安全包的使用。与安全包类似,认知包需要实用的指南和系统的过程来定义、验证和保证。在本文中,我们描述了一个由LLM/VLM支持的动态线索分析管道,用于小型自主无人航空系统在搜索和救援(SAR)任务中的领域,并基于概率推理和资源分析构建了认知包。我们通过评估在一系列SAR任务中由我们的线索分析管道做出的决策来评估该方法。最后,我们确定了系统设计、实现和验证支持AI决策的网络物理系统中认知包的关键软件工程挑战。
Summary / 总结
This paper introduces Cognition Envelopes to address errors in autonomous decision-making by LLMs and VLMs in small Uncrewed Aerial Systems (UAS) for Search and Rescue (SAR) missions. The method involves a pipeline for dynamic clue analysis using LLMs and VLMs, with a Cognition Envelope based on probabilistic reasoning and resource analysis. Key findings show that this approach improves decision accuracy in SAR missions, but highlights challenges in systematically designing and validating Cognition Envelopes.
本文提出了一种认知包络(Cognition Envelopes)来解决在小型无人航空系统(UAS)中进行搜索和救援(SAR)任务时,由大型语言模型(LLMs)和视觉语言模型(VLMs)引发的错误。方法包括动态线索分析管道和基于概率推理和资源分析的认知包络。主要发现表明,该方法在SAR任务中提高了决策的准确性和可靠性,但也指出了系统设计、实现和验证这些包络的挑战。
Scalable Evaluation of the Realism of Synthetic Environmental Augmentations in Images
Authors: Damian J. Ruck, Paul Vautravers, Oliver Chalkley, Jake Thomas
First: 2026-03-04T17:46:08+00:00 · Latest: 2026-03-04T17:46:08+00:00
Abstract
Evaluation of AI systems often requires synthetic test cases, particularly for rare or safety-critical conditions that are difficult to observe in operational data. Generative AI offers a promising approach for producing such data through controllable image editing, but its usefulness depends on whether the resulting images are sufficiently realistic to support meaningful evaluation.
We present a scalable framework for assessing the realism of synthetic image-editing methods and apply it to the task of adding environmental conditions-fog, rain, snow, and nighttime-to car-mounted camera images. Using 40 clear-day images, we compare rule-based augmentation libraries with generative AI image-editing models. Realism is evaluated using two complementary automated metrics: a vision-language model (VLM) jury for perceptual realism assessment, and embedding-based distributional analysis to measure similarity to genuine adverse-condition imagery.
Generative AI methods substantially outperform rule-based approaches, with the best generative method achieving approximately 3.6 times the acceptance rate of the best rule-based method. Performance varies across conditions: fog proves easiest to simulate, while nighttime transformations remain challenging. Notably, the VLM jury assigns imperfect acceptance even to real adverse-condition imagery, establishing practical ceilings against which synthetic methods can be judged. By this standard, leading generative methods match or exceed real-image performance for most conditions.
These results suggest that modern generative image-editing models can enable scalable generation of realistic adverse-condition imagery for evaluation pipelines. Our framework therefore provides a practical approach for scalable realism evaluation, though validation against human studies remains an important direction for future work.
中文标题/摘要
标题:合成环境增强在图像中的现实性可扩展评估
AI系统的评估通常需要合成测试案例,特别是对于在操作数据中难以观察的罕见或安全关键条件。生成式AI通过可控的图像编辑提供了一种有前景的数据生成方法,但其有用性取决于生成的图像是否足够现实,以支持有意义的评估。
我们提出了一种可扩展的框架来评估合成图像编辑方法的现实性,并将其应用于向汽车安装的相机图像中添加环境条件(雾、雨、雪和夜间)的任务。使用40张晴天图像,我们将基于规则的增强库与生成式AI图像编辑模型进行了比较。现实性通过两种互补的自动化度量标准进行评估:视觉语言模型(VLM)陪审团进行感知现实性评估,以及基于嵌入的分布分析来衡量与真实恶劣条件图像的相似性。
生成式AI方法显著优于基于规则的方法,最佳生成式方法的接受率大约是最佳基于规则方法的3.6倍。性能在不同条件下有所不同:雾是最容易模拟的,而夜间变换仍然具有挑战性。值得注意的是,VLM陪审团即使对真实恶劣条件的图像也赋予了不完美的接受度,这为合成方法设定了实际的上限。按照这一标准,领先的生成式方法在大多数条件下与真实图像的性能相当或超过。
这些结果表明,现代生成式图像编辑模型可以实现恶劣条件图像的可扩展生成,用于评估管道。因此,我们的框架提供了一种实用的方法来进行可扩展的现实性评估,尽管未来工作仍需通过人类研究进行验证。
Summary / 总结
This study evaluates the realism of synthetic environmental augmentations in images using a scalable framework. It compares rule-based augmentation libraries with generative AI models for adding fog, rain, snow, and nighttime conditions to car-mounted camera images. The generative AI methods outperform rule-based approaches, with the best generative method achieving about 3.6 times the acceptance rate of the best rule-based method. The study finds that while fog is easiest to simulate, nighttime transformations remain challenging. The VLM jury assigns imperfect acceptance to both synthetic and real adverse-condition imagery, suggesting that leading generative methods can match or exceed real-image performance for most conditions.
研究旨在评估合成环境增强在图像中的真实性,这对于测试AI系统在罕见或安全关键条件下的表现至关重要。研究使用了一个可扩展的框架,比较了基于规则的增强库与生成AI模型,重点关注在汽车摄像头图像中添加雾、雨、雪和夜间条件。结果显示,生成AI方法优于基于规则的方法,最佳生成方法的接受率约为最佳基于规则方法的3.6倍。研究发现,虽然雾最容易模拟,但夜间变换仍然具有挑战性。值得注意的是,视觉语言模型评委对真实不良条件图像的接受度也并不完美,这表明合成方法的实际限制。总体而言,现代生成图像编辑模型可以生成真实的不良条件图像,支持可扩展的评估管道,尽管还需要进一步的人类验证。
EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding
Authors: Seungjun Lee, Zihan Wang, Yunsong Wang, Gim Hee Lee
Venue: CVPR 2026
First: 2026-03-04T16:40:41+00:00 · Latest: 2026-03-04T16:40:41+00:00
Comments: CVPR 2026, Project Page: https://0nandon.github.io/EmbodiedSplat/
Abstract
Understanding a 3D scene immediately with its exploration is essential for embodied tasks, where an agent must construct and comprehend the 3D scene in an online and nearly real-time manner. In this study, we propose EmbodiedSplat, an online feed-forward 3DGS for open-vocabulary scene understanding that enables simultaneous online 3D reconstruction and 3D semantic understanding from the streaming images. Unlike existing open-vocabulary 3DGS methods which are typically restricted to either offline or per-scene optimization setting, our objectives are two-fold: 1) Reconstructs the semantic-embedded 3DGS of the entire scene from over 300 streaming images in an online manner. 2) Highly generalizable to novel scenes with feed-forward design and supports nearly real-time 3D semantic reconstruction when combined with real-time 2D models. To achieve these objectives, we propose an Online Sparse Coefficients Field with a CLIP Global Codebook where it binds the 2D CLIP embeddings to each 3D Gaussian while minimizing memory consumption and preserving the full semantic generalizability of CLIP. Furthermore, we generate 3D geometric-aware CLIP features by aggregating the partial point cloud of 3DGS through 3D U-Net to compensate the 3D geometric prior to 2D-oriented language embeddings. Extensive experiments on diverse indoor datasets, including ScanNet, ScanNet++, and Replica, demonstrate both the effectiveness and efficiency of our method. Check out our project page in https://0nandon.github.io/EmbodiedSplat/.
中文标题/摘要
标题:EmbodiedSplat:在线前馈语义3DGS用于开放词汇3D场景理解
在探索过程中立即理解3D场景对于嵌入式任务至关重要,其中代理必须在线且几乎实时地构建和理解3D场景。在本研究中,我们提出了一种名为EmbodiedSplat的在线前馈3DGS,用于开放词汇场景理解,能够同时从流式图像中进行在线3D重建和3D语义理解。与通常局限于离线或单场景优化设置的现有开放词汇3DGS方法不同,我们的目标有两个:1)以在线方式从超过300张流式图像中重建整个场景的语义嵌入3DGS。2)通过前馈设计高度通用,结合实时2D模型时支持几乎实时的3D语义重建。为了实现这些目标,我们提出了一种在线稀疏系数场,其中使用CLIP全局码本将2D CLIP嵌入与每个3D高斯绑定,同时减少内存消耗并保留CLIP的完整语义通用性。此外,我们通过3D U-Net聚合3DGS的部分点云生成3D几何感知CLIP特征,以补偿2D导向的语言嵌入的3D几何先验。在包括ScanNet、ScanNet++和Replica在内的多种室内数据集上的广泛实验表明,我们的方法既有效又高效。请访问我们的项目页面:https://0nandon.github.io/EmbodiedSplat/
Summary / 总结
EmbodiedSplat is an online feed-forward 3D geometric semantic model designed for real-time 3D scene understanding in embodied tasks. It reconstructs the semantic-embedded 3D geometric structure from streaming images and supports nearly real-time 3D semantic reconstruction. The method uses an Online Sparse Coefficients Field with a CLIP Global Codebook to bind 2D CLIP embeddings to 3D Gaussian coefficients, enabling high generalizability to novel scenes. Experiments on ScanNet, ScanNet++, and Replica show the effectiveness and efficiency of EmbodiedSplat in online 3D scene understanding.
EmbodiedSplat 是一种在线前馈 3D 几何语义模型,用于体感任务中的实时 3D 场景理解。该方法从流媒体图像中重建场景的 3D 几何结构和语义信息,实现接近实时的性能。该方法使用在线稀疏系数场和 CLIP 全局码本将 2D CLIP 表征绑定到 3D 高斯系数上,以保持语义的一般性。在 ScanNet、ScanNet++ 和 Replica 等多样化的室内数据集上的实验表明,该方法在处理不同室内场景时具有有效性和效率。
Real5-OmniDocBench: A Full-Scale Physical Reconstruction Benchmark for Robust Document Parsing in the Wild
Authors: Changda Zhou, Ziyue Gao, Xueqing Wang, Tingquan Gao, Cheng Cui, Jing Tang, Yi Liu
First: 2026-03-04T15:49:06+00:00 · Latest: 2026-03-04T15:49:06+00:00
Abstract
While Vision-Language Models (VLMs) achieve near-perfect scores on digital document benchmarks like OmniDocBench, their performance in the unpredictable physical world remains largely unknown due to the lack of controlled yet realistic evaluations. We introduce Real5-OmniDocBench, the first benchmark that performs a full-scale, one-to-one physical reconstruction of the entire OmniDocBench v1.5 (1,355 images) across five critical real-world scenarios: Scanning, Warping, Screen-Photography, Illumination, and Skew. Unlike prior benchmark that either lack digital correspondence or employ partial sampling, our complete ground-truth mapping enables, for the first time, rigorous factor-wise attribution of performance degradation-allowing us to pinpoint whether failures stem from geometric distortions, optical artifacts, or model limitations. Our benchmark establishes a challenging new standard for the community, demonstrating that the 'reality gap' in document parsing is far from closed, and provides a diagnostic tool to guide the development of truly resilient document intelligence.
中文标题/摘要
标题:Real5-OmniDocBench:面向野外鲁棒文档解析的全尺度物理重建基准
尽管视觉-语言模型(VLMs)在数字文档基准测试如OmniDocBench上取得了近乎完美的成绩,但在不可预测的物理世界中的表现仍然未知,因为缺乏受控且现实的评估。我们引入了Real5-OmniDocBench,这是首个对整个OmniDocBench v1.5(1,355张图像)进行全尺度、一对一物理重建的基准,跨越五个关键的现实世界场景:扫描、扭曲、屏幕摄影、照明和倾斜。与之前的基准相比,我们的基准要么缺乏数字对应,要么采用部分采样,而我们完整的地面真实映射首次使我们能够按因子严格归因性能下降,从而确定失败是源自几何失真、光学伪影还是模型限制。我们的基准为社区设定了一个具有挑战性的新标准,表明文档解析中的‘现实差距’远未关闭,并提供了一种诊断工具,以指导真正鲁棒的文档智能的发展。
Summary / 总结
Real5-OmniDocBench is a benchmark that reconstructs 1,355 images from the OmniDocBench v1.5 across five real-world scenarios to evaluate the performance of Vision-Language Models in the physical world. Unlike previous benchmarks, it provides a complete ground-truth mapping, allowing for the first time a detailed analysis of performance degradation due to geometric distortions, optical artifacts, or model limitations. This benchmark highlights the significant 'reality gap' in document parsing and serves as a diagnostic tool for improving robustness.
研究引入了Real5-OmniDocBench,该基准在五个真实世界场景中重新创建了来自OmniDocBench v1.5的1,355张图像,以评估VLMs的鲁棒性。这种全面的物理重建首次允许对性能退化因素,如几何失真和光学伪影,进行详细分析,而不是依赖部分采样或数字对应关系。研究突显了文档解析中的‘现实差距’,并提供了一个诊断工具,以指导提高模型在物理世界中的鲁棒性。
PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters
Authors: Yinghong Yu, Guangyuan Li, Jiancheng Yang
First: 2026-03-04T15:23:30+00:00 · Latest: 2026-03-04T15:23:30+00:00
Comments: Code is available at https://github.com/HINTLab/PlaneCycle
Abstract
Large-scale 2D foundation models exhibit strong transferable representations, yet extending them to 3D volumetric data typically requires retraining, adapters, or architectural redesign. We introduce PlaneCycle, a training-free, adapter-free operator for architecture-agnostic 2D-to-3D lifting of foundation models. PlaneCycle reuses the original pretrained 2D backbone by cyclically distributing spatial aggregation across orthogonal HW, DW, and DH planes throughout network depth, enabling progressive 3D fusion while preserving pretrained inductive biases. The method introduces no additional parameters and is applicable to arbitrary 2D networks. Using pretrained DINOv3 models, we evaluate PlaneCycle on six 3D classification and three 3D segmentation benchmarks. Without any training, the lifted models exhibit intrinsic 3D fusion capability and, under linear probing, outperform slice-wise 2D baselines and strong 3D counterparts, approaching the performance of fully trained models. With full fine-tuning, PlaneCycle matches standard 3D architectures, highlighting its potential as a seamless and practical 2D-to-3D lifting operator. These results demonstrate that 3D capability can be unlocked from pretrained 2D foundation models without structural modification or retraining. Code is available at https://github.com/HINTLab/PlaneCycle.
中文标题/摘要
标题:PlaneCycle:无需训练的2D到3D基础模型提升操作
大规模的2D基础模型表现出强大的可迁移表示,但将其扩展到3D体数据通常需要重新训练、适配器或架构重设计。我们引入了PlaneCycle,这是一种无需训练、无需适配器的操作符,用于基础模型的2D到3D提升,适用于任意架构。PlaneCycle 通过在网络深度中周期性地在正交的HW、DW和DH平面之间分配空间聚合,重用了原始预训练的2D主干,从而实现渐进的3D融合,同时保留预训练的归纳偏差。该方法不引入额外参数,并适用于任意2D网络。使用预训练的DINOv3模型,我们在六个3D分类和三个3D分割基准上评估了PlaneCycle。在无需训练的情况下,提升后的模型展示了内在的3D融合能力,并在线性探针下优于切片式的2D基线和强大的3D对应物,接近完全训练模型的性能。在完全微调后,PlaneCycle 达到了标准3D架构的性能,突显了其作为无缝且实用的2D到3D提升操作符的潜力。这些结果表明,3D能力可以从预训练的2D基础模型中解锁,无需结构修改或重新训练。代码可在 https://github.com/HINTLab/PlaneCycle 获取。
Real Eyes Realize Faster: Gaze Stability and Pupil Novelty for Efficient Egocentric Learning
Authors: Ajan Subramanian, Sumukh Bettadapura, Rohan Sathish
First: 2026-03-04T14:10:27+00:00 · Latest: 2026-03-04T14:10:27+00:00
Comments: 14 pages, 4 figures, 3 tables, plus supplementary material
Abstract
Always-on egocentric cameras are increasingly used as demonstrations for embodied robotics, imitation learning, and assistive AR, but the resulting video streams are dominated by redundant and low-quality frames. Under the storage and battery constraints of wearable devices, choosing which frames to keep is as important as how to learn from them. We observe that modern eye-tracking headsets provide a continuous, training-free side channel that decomposes into two complementary axes: gaze fixation captures visual stability (quality), while pupil response captures arousal-linked moments (novelty). We operationalize this insight as a Dual-Criterion Frame Curator that first gates frames by gaze quality and then ranks the survivors by pupil-derived novelty. On the Visual Experience Dataset (VEDB), curated frames at 10% budget match the classification performance of the full stream, and naive signal fusion consistently destroys both contributions. The benefit is task-dependent: pupil ranking improves activity recognition, while gaze-only selection already dominates for scene recognition, confirming that the two signals serve genuinely different roles. Our method requires no model inference and operates at capture time, offering a path toward efficient, always-on egocentric data curation.
中文标题/摘要
标题:真实视线更快实现:注视稳定性与瞳孔新颖性促进自我中心学习
始终开启的自我中心摄像头越来越多地被用作体现型机器人、模仿学习和辅助AR的演示,但由此产生的视频流主要由冗余和低质量的帧组成。在可穿戴设备的存储和电池限制下,选择保留哪些帧与如何从中学习一样重要。我们观察到,现代眼动追踪头戴设备提供了一个连续的、无需训练的侧通道,可以分解为两个互补的轴:注视固定捕捉视觉稳定性(质量),而瞳孔反应捕捉与唤醒相关的时刻(新颖性)。我们将这一洞察力操作化为一种双重标准帧管理器,首先通过注视质量筛选帧,然后按由瞳孔衍生的新颖性对幸存者进行排名。在视觉体验数据集(VEDB)上,以10%的预算筛选出的帧与完整流的分类性能相当,而简单的信号融合会破坏这两种贡献。这种好处取决于任务:瞳孔排名改善了活动识别,而仅注视选择已经对场景识别占主导地位,这证实了这两种信号确实发挥着不同的作用。我们的方法不需要模型推理,并在捕获时运行,为高效、始终开启的自我中心数据管理提供了一条途径。
Summary / 总结
The paper addresses the challenge of efficiently curating frames from egocentric camera streams to reduce redundancy and improve learning. It introduces a Dual-Criterion Frame Curator that uses gaze stability and pupil novelty as criteria. Experiments on the Visual Experience Dataset show that frames selected at 10% budget match full stream performance for classification tasks, with pupil ranking enhancing activity recognition and gaze-only selection excelling in scene recognition. The method operates at capture time without requiring model inference, offering a practical solution for efficient data curation in wearable devices.
论文针对从第一人称相机流中高效筛选帧以减少冗余并提高学习效率的问题。它提出了一种基于双标准的帧筛选器,使用注视稳定性和瞳孔新颖性作为筛选标准。在视觉体验数据集上的实验表明,以10%的预算筛选的帧与完整流的分类性能相当,瞳孔排序增强活动识别,而仅使用注视选择则在场景识别中占优,证实了两种信号确实服务于不同的作用。该方法在捕获时即可运行且无需模型推理,为穿戴设备中的高效数据筛选提供了一种实用方案。
Beyond Accuracy: What Matters in Designing Well-Behaved Image Classification Models?
Authors: Robin Hesse, Doğukan Bağcı, Bernt Schiele, Simone Schaub-Meyer, Stefan Roth
First: 2025-03-21T12:54:18+00:00 · Latest: 2026-03-04T14:00:09+00:00
Comments: Published in TMLR (01/2026) | OpenReview: https://openreview.net/forum?id=E7HDtLCoT6 | Project page: https://visinf.github.io/beyond-accuracy/
Abstract
Deep learning has become an essential part of computer vision, with deep neural networks (DNNs) excelling in predictive performance. However, they often fall short in other critical quality dimensions, such as robustness, calibration, or fairness. While existing studies have focused on a subset of these quality dimensions, none have explored a more general form of "well-behavedness" of DNNs. With this work, we address this gap by simultaneously studying nine different quality dimensions for image classification. Through a large-scale study, we provide a bird's-eye view by analyzing 326 backbone models and how different training paradigms and model architectures affect these quality dimensions. We reveal various new insights such that (i) vision-language models exhibit high class balance on ImageNet-1k classification and strong robustness against domain changes; (ii) training models initialized with weights obtained through self-supervised learning is an effective strategy to improve most considered quality dimensions; and (iii) the training dataset size is a major driver for most of the quality dimensions. We conclude our study by introducing the QUBA score (Quality Understanding Beyond Accuracy), a novel metric that ranks models across multiple dimensions of quality, enabling tailored recommendations based on specific user needs.
中文标题/摘要
标题:超越准确性:设计良好行为的图像分类模型需要考虑什么?
深度学习已成为计算机视觉不可或缺的一部分,深度神经网络(DNNs)在预测性能方面表现出色。然而,它们在其他关键质量维度,如鲁棒性、校准或公平性方面往往表现不佳。虽然现有研究集中在这些质量维度的一部分,但没有一项研究探讨DNNs的更广泛形式的“良好行为”。通过这项工作,我们填补了这一空白,同时研究了图像分类中的九种不同质量维度。通过大规模研究,我们通过分析326个骨干模型及其不同训练范式和模型架构对这些质量维度的影响,提供了宏观视角。我们揭示了各种新的见解,例如:(i) 视觉-语言模型在ImageNet-1k分类中表现出高类别平衡,并且在领域变化中具有很强的鲁棒性;(ii) 使用自监督学习获得的权重初始化模型是一种有效策略,可以提高大多数考虑的质量维度;(iii) 训练数据集大小是大多数质量维度的主要驱动因素。我们通过引入QUBA评分(超越准确性理解的质量),一种新型指标,对模型在多个质量维度上的排名,从而根据特定用户需求提供定制化建议,来结束我们的研究。
Summary / 总结
This study addresses the gap in evaluating the well-behavedness of deep neural networks in image classification by examining nine quality dimensions. Through a large-scale analysis of 326 backbone models, the research reveals that vision-language models have high class balance and robustness, and self-supervised learning initialization improves most quality dimensions. The study also finds that the training dataset size significantly influences most quality dimensions. A novel metric, QUBA score, is introduced to rank models across multiple quality dimensions, providing tailored recommendations based on user needs.
研究通过分析326个骨干模型,探讨了图像分类中深度神经网络的九个质量维度,揭示了视觉语言模型在类别平衡和鲁棒性方面表现出色,并且自我监督学习初始化可以提高大多数质量维度。研究还发现,训练数据集的大小对大多数质量维度有显著影响。研究引入了QUBA评分(质量理解超越准确性),这是一种新型指标,可以跨多个质量维度对模型进行排名,从而根据特定需求提供定制化建议。
Context Biasing for Pronunciation-Orthography Mismatch in Automatic Speech Recognition
Authors: Christian Huber, Alexander Waibel
First: 2025-06-23T14:42:03+00:00 · Latest: 2026-03-04T13:24:46+00:00
Abstract
Neural sequence-to-sequence systems deliver state-of-the-art performance for automatic speech recognition. When using appropriate modeling units, e.g., byte-pair encoding, these systems are in principle open vocabulary systems. In practice, however, they often fail to recognize words not seen during training, e.g., named entities, acronyms, or domain-specific special words. To address this problem, many context biasing methods have been proposed; however, these methods may still struggle when they are unable to relate audio and corresponding text, e.g., in case of a pronunciation-orthography mismatch. We propose a method where corrections of substitution errors can be used to improve the recognition accuracy of such challenging words. Users can add corrections on the fly during inference. We show that with this method we get a relative improvement in biased word error rate between 22% and 34% compared to a text-based replacement method, while maintaining the overall performance.
中文标题/摘要
标题:自动语音识别中的发音-拼写匹配上下文偏差
神经序列到序列系统在自动语音识别中提供了最先进的性能。当使用适当的建模单元,例如字节对编码时,这些系统原则上是开放词汇系统。然而,在实践中,它们往往无法识别训练期间未见过的单词,例如专有名词、缩写或领域特定的特殊单词。为了解决这个问题,已经提出了许多上下文偏差方法;然而,当它们无法将音频与相应的文本相关联时,例如在发音-拼写匹配不匹配的情况下,这些方法仍然可能遇到困难。我们提出了一种方法,其中可以通过纠正替换错误来提高此类具有挑战性的单词的识别准确性。用户可以在推理过程中实时添加纠正。我们表明,与基于文本的替换方法相比,使用此方法可以将偏差单词错误率相对提高22%至34%,同时保持整体性能。
Summary / 总结
The paper addresses the issue of automatic speech recognition failing to recognize words not seen during training, particularly in cases of pronunciation-orthography mismatch. It proposes a context biasing method that uses corrections of substitution errors to improve recognition accuracy. Experimental results show a relative improvement in biased word error rate between 22% and 34% compared to a text-based replacement method, while maintaining overall performance.
论文针对自动语音识别系统在识别未在训练中出现的词语,尤其是发音与拼写不符的情况下识别不准确的问题,提出了一种上下文偏置方法,该方法利用替换错误的修正来提高识别准确性。在推理过程中,用户可以实时添加这些修正。该方法使得在偏误词错误率上相对提高了22%到34%,同时保持了整体性能。
Human-Object Interaction via Automatically Designed VLM-Guided Motion Policy
Authors: Zekai Deng, Ye Shi, Kaiyang Ji, Lan Xu, Shaoli Huang, Jingya Wang
Venue: iclr
First: 2025-03-24T05:18:04+00:00 · Latest: 2026-03-04T13:22:04+00:00
Comments: iclr camera ready
Abstract
Human-object interaction (HOI) synthesis is crucial for applications in animation, simulation, and robotics. However, existing approaches either rely on expensive motion capture data or require manual reward engineering, limiting their scalability and generalizability. In this work, we introduce the first unified physics-based HOI framework that leverages Vision-Language Models (VLMs) to enable long-horizon interactions with diverse object types, including static, dynamic, and articulated objects. We introduce VLM-Guided Relative Movement Dynamics (RMD), a fine-grained spatio-temporal bipartite representation that automatically constructs goal states and reward functions for reinforcement learning. By encoding structured relationships between human and object parts, RMD enables VLMs to generate semantically grounded, interaction-aware motion guidance without manual reward tuning. To support our methodology, we present Interplay, a novel dataset with thousands of long-horizon static and dynamic interaction plans. Extensive experiments demonstrate that our framework outperforms existing methods in synthesizing natural, human-like motions across both simple single-task and complex multi-task scenarios. For more details, please refer to our project webpage: https://vlm-rmd.github.io/.
中文标题/摘要
标题:通过自动设计的VLM引导运动策略实现人-物交互
人-物交互(HOI)合成对于动画、模拟和机器人技术的应用至关重要。然而,现有方法要么依赖昂贵的运动捕捉数据,要么需要手动奖励工程,这限制了它们的可扩展性和通用性。在本文中,我们引入了第一个基于物理的HOI统一框架,该框架利用视觉语言模型(VLMs)实现与多种类型物体(包括静态、动态和可动物体)的长时交互。我们提出了VLM引导的相对运动动力学(RMD),这是一种细粒度的空间-时间二分表示,可以自动构建强化学习的目标状态和奖励函数。通过编码人类和物体部分之间的结构化关系,RMD使VLM能够生成语义上合理、交互感知的运动指导,而无需手动调整奖励。为了支持我们的方法,我们提出了Interplay,一个包含数千个长时交互计划的新数据集。广泛的实验表明,我们的框架在合成自然、人类样式的运动方面优于现有方法,无论是在简单的单任务还是复杂的多任务场景中。如需更多详情,请参阅我们的项目网页:https://vlm-rmd.github.io/
Summary / 总结
This work introduces a unified physics-based framework for human-object interaction synthesis that uses Vision-Language Models (VLMs) to enable long-horizon interactions with diverse object types. The framework employs VLM-Guided Relative Movement Dynamics (RMD) to automatically construct goal states and reward functions for reinforcement learning, allowing for semantically grounded, interaction-aware motion guidance without manual reward tuning. Experiments show that this approach outperforms existing methods in generating natural, human-like motions across simple and complex scenarios.
该研究提出了一种使用Vision-Language模型(VLM)引导运动策略的统一物理框架,用于人类与物体的交互合成。该框架采用VLM-Guided相对运动动力学(RMD)自动构建目标状态和奖励函数,使交互能够涵盖多种物体类型,并支持长时间的交互。实验表明,该方法在各种场景中生成自然、类人的运动表现优于现有方法。还引入了一个名为Interplay的新数据集来支持该方法。该框架无需手动调整奖励,增强了其可扩展性和通用性。更多信息请参阅项目网页:https://vlm-rmd.github.io/.
Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs
Authors: Zeyi Huang, Yuyang Ji, Xiaofang Wang, Nikhil Mehta, Tong Xiao, Donghyun Lee, Sigmund Vanvalkenburgh, Shengxin Zha, Bolin Lai, Yiqiu Ren, Licheng Yu, Ning Zhang, Yong Jae Lee, Miao Liu
First: 2025-01-08T08:15:29+00:00 · Latest: 2026-03-04T13:07:07+00:00
Abstract
Long-form video understanding with Large Vision Language Models is challenged by the need to analyze temporally dispersed yet spatially concentrated key moments within limited context windows. In this work, we introduce VideoMindPalace, a new framework inspired by the "Mind Palace", which organizes critical video moments into a topologically structured semantic graph. VideoMindPalace organizes key information through (i) hand-object tracking and interaction, (ii) clustered activity zones representing specific areas of recurring activities, and (iii) environment layout mapping, allowing natural language parsing by LLMs to provide grounded insights on spatio-temporal and 3D context. In addition, we propose the Video MindPalace Benchmark (VMB), to assess human-like reasoning, including spatial localization, temporal reasoning, and layout-aware sequential understanding. Evaluated on VMB and established video QA datasets, including EgoSchema, NExT-QA, IntentQA, and the Active Memories Benchmark, VideoMindPalace demonstrates notable gains in spatio-temporal coherence and human-aligned reasoning, advancing long-form video analysis capabilities in VLMs.
中文标题/摘要
标题:构建思维宫殿:基于环境的语义图结构化以实现有效的长视频分析
使用大型视觉语言模型进行长视频理解面临挑战,需要在有限的上下文窗口内分析时间上分散但空间上集中的关键时刻。本文介绍了一种名为VideoMindPalace的新框架,该框架受到“思维宫殿”的启发,将关键视频时刻组织成拓扑结构化的语义图。VideoMindPalace通过(i) 手部-物体跟踪和交互,(ii) 聚类活动区域表示特定活动区域,以及(iii) 环境布局映射,组织关键信息,使大型语言模型能够通过自然语言解析提供基于空间-时间及3D上下文的见解。此外,我们提出了Video MindPalace基准(VMB),以评估包括空间定位、时间推理和布局感知序列理解在内的类人类推理能力。在VMB和EgoSchema、NExT-QA、IntentQA以及Active Memories基准等现有视频问答数据集上进行评估,VideoMindPalace在空间-时间连贯性和与人类对齐的推理方面表现出显著提升,推动了视觉语言模型在长视频分析能力上的进步。
Summary / 总结
This work addresses the challenge of analyzing long-form videos by introducing VideoMindPalace, a framework that organizes key video moments into a structured semantic graph. It uses hand-object tracking, clustered activity zones, and environment layout mapping to provide spatial and temporal context for large language models. VideoMindPalace was evaluated on the Video MindPalace Benchmark and established video QA datasets, showing improvements in spatio-temporal coherence and human-aligned reasoning, enhancing long-form video analysis capabilities in vision-language models.
该研究通过引入VideoMindPalace框架,将关键视频时刻组织成结构化的语义图来解决长视频分析的挑战。该框架利用手部和物体跟踪、活动区域聚类以及环境布局映射来为大型语言模型提供空间和时间上下文。VideoMindPalace在Video MindPalace基准和现有的视频问答数据集上进行了评估,展示了在时空连贯性和人类对齐的推理方面的改进,从而提升了视觉语言模型的长视频分析能力。
Training-Free Rate-Distortion-Perception Traversal With Diffusion
Authors: Yuhan Wang, Suzhi Bi, Ying-Jun Angela Zhang
First: 2026-03-04T12:49:13+00:00 · Latest: 2026-03-04T12:49:13+00:00
Comments: 40 pages, 17 figures
Abstract
The rate-distortion-perception (RDP) tradeoff characterizes the fundamental limits of lossy compression by jointly considering bitrate, reconstruction fidelity, and perceptual quality. While recent neural compression methods have improved perceptual performance, they typically operate at fixed points on the RDP surface, requiring retraining to target different tradeoffs. In this work, we propose a training-free framework that leverages pre-trained diffusion models to traverse the entire RDP surface. Our approach integrates a reverse channel coding (RCC) module with a novel score-scaled probability flow ODE decoder. We theoretically prove that the proposed diffusion decoder is optimal for the distortion-perception tradeoff under AWGN observations and that the overall framework with the RCC module achieves the optimal RDP function in the Gaussian case. Empirical results across multiple datasets demonstrate the framework's flexibility and effectiveness in navigating the ternary RDP tradeoff using pre-trained diffusion models. Our results establish a practical and theoretically grounded approach to adaptive, perception-aware compression.
中文标题/摘要
标题:无需训练的速率-失真-感知权衡扩散穿越
速率-失真-感知(RDP)权衡描述了通过同时考虑比特率、重建保真度和感知质量来确定有损压缩的基本限制。尽管最近的神经压缩方法提高了感知性能,但它们通常在RDP曲面上的固定点上运行,需要重新训练以针对不同的权衡。在本文中,我们提出了一种无需训练的框架,利用预训练的扩散模型在整个RDP曲面上进行穿越。我们的方法将逆信道编码(RCC)模块与新颖的分数缩放概率流ODE解码器相结合。我们理论上证明,在加性高斯白噪声(AWGN)观测下,提出的扩散解码器对于失真-感知权衡是最佳的,并且整体框架结合RCC模块在高斯情况下实现了最优的RDP函数。在多个数据集上的实验结果表明,该框架利用预训练的扩散模型在三元RDP权衡中具有灵活性和有效性。我们的结果确立了一种实用且理论基础的自适应、感知导向压缩方法。
Spectral Surgery: Training-Free Refinement of LoRA via Gradient-Guided Singular Value Reweighting
Authors: Zailong Tian, Yanzhe Chen, Zhuoheng Han, Lizi Liao
First: 2026-03-04T12:38:36+00:00 · Latest: 2026-03-04T12:38:36+00:00
Abstract
Low-Rank Adaptation (LoRA) improves downstream performance by restricting task updates to a low-rank parameter subspace, yet how this limited capacity is allocated within a trained adapter remains unclear. Through a geometric and empirical study across multiple tasks and backbones, we find that trained LoRA updates often exhibit an inefficient spectrum: task effects concentrate in a small subset of singular directions, while many remaining components are neutral or detrimental, motivating post-hoc refinement within the learned subspace. We propose Spectral Surgery, a training-free refinement that decomposes a LoRA update with SVD, estimates per-component sensitivity using gradients on a small calibration set, and reweights singular values under a magnitude constraint while keeping the learned directions fixed. Across Llama-3.1-8B and Qwen3-8B on four benchmarks, Spectral Surgery yields consistent gains (up to +4.4 points on CommonsenseQA and +2.4 pass@1 on HumanEval) by adjusting only $\approx 1{,}000$ scalar coefficients. These results demonstrate that SVD-structured, low-cost parameter editing can serve as a practical route to improving trained LoRA adapters in a purely post-hoc manner.
中文标题/摘要
标题:光谱手术:通过梯度引导的奇异值重新加权对LoRA的无训练精炼
低秩适应(LoRA)通过将任务更新限制在低秩参数子空间中来提高下游性能,但训练后的适配器内部如何分配这种有限的能力尚不清楚。通过在多个任务和骨干网络上的几何和经验研究,我们发现训练后的LoRA更新通常表现出低效的光谱:任务效果集中在少数奇异方向上,而许多剩余组件是中性的或有害的,这促使我们在学习的子空间内进行事后精炼。我们提出了光谱手术,这是一种无训练精炼方法,通过SVD分解LoRA更新,使用小校准集上的梯度估计每个组件的敏感性,并在保持学习方向不变的情况下,根据幅度约束重新加权奇异值。在Llama-3.1-8B和Qwen3-8B上,通过调整约1,000个标量系数,光谱手术在四个基准测试上取得了一致的改进(在CommonsenseQA上最多提高4.4分,在HumanEval上提高2.4分)。这些结果表明,SVD结构化、低成本的参数编辑可以作为一种实用的方法,以纯事后的方式改进训练后的LoRA适配器。
Summary / 总结
The paper addresses the inefficiency of singular directions in LoRA updates by proposing Spectral Surgery, a training-free method that reweights singular values using gradient estimates. This method improves downstream performance on four benchmarks, achieving up to 4.4 points on CommonsenseQA and 2.4 pass@1 on HumanEval by adjusting only about 1,000 scalar coefficients.
论文针对LoRA更新中奇异值分配的低效问题,提出了Spectral Surgery进行后处理优化的方法。该方法通过SVD分解LoRA更新,使用梯度估计各分量的敏感性,并在幅度约束下重新加权奇异值。实验结果显示,在Llama-3.1-8B和Qwen3-8B上,该方法在CommonsenseQA (+4.4分) 和HumanEval (+2.4 pass@1) 上实现了稳定的性能提升,仅调整了约1000个标量系数。
When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models
Authors: Qianpu Chen, Derya Soydaner, Rob Saunders
First: 2026-03-04T12:33:36+00:00 · Latest: 2026-03-04T12:33:36+00:00
Abstract
When visual evidence is ambiguous, vision models must decide whether to interpret face-like patterns as meaningful. Face pareidolia, the perception of faces in non-face objects, provides a controlled probe of this behavior. We introduce a representation-level diagnostic framework that analyzes detection, localization, uncertainty, and bias across class, difficulty, and emotion in face pareidolia images. Under a unified protocol, we evaluate six models spanning four representational regimes: vision-language models (VLMs; CLIP-B/32, CLIP-L/14, LLaVA-1.5-7B), pure vision classification (ViT), general object detection (YOLOv8), and face detection (RetinaFace). Our analysis reveals three mechanisms of interpretation under ambiguity. VLMs exhibit semantic overactivation, systematically pulling ambiguous non-human regions toward the Human concept, with LLaVA-1.5-7B producing the strongest and most confident over-calls, especially for negative emotions. ViT instead follows an uncertainty-as-abstention strategy, remaining diffuse yet largely unbiased. Detection-based models achieve low bias through conservative priors that suppress pareidolia responses even when localization is controlled. These results show that behavior under ambiguity is governed more by representational choices than score thresholds, and that uncertainty and bias are decoupled: low uncertainty can signal either safe suppression, as in detectors, or extreme over-interpretation, as in VLMs. Pareidolia therefore provides a compact diagnostic and a source of ambiguity-aware hard negatives for probing and improving the semantic robustness of vision-language systems. Code will be released upon publication.
中文标题/摘要
标题:当视觉证据模棱两可时:幻视作为视觉模型诊断探针的应用
当视觉证据模棱两可时,视觉模型必须决定是否将类似人脸的模式解释为有意义的。人脸幻视,即在非人脸物体中感知人脸的现象,提供了一种控制这种行为的探针。我们引入了一种表示级诊断框架,该框架分析了人脸幻视图像中检测、定位、不确定性和偏见在类别、难度和情绪方面的表现。在统一的协议下,我们评估了六种模型,涵盖了四种表示范式:视觉-语言模型(VLMs;CLIP-B/32,CLIP-L/14,LLaVA-1.5-7B)、纯视觉分类(ViT)、通用物体检测(YOLOv8)和人脸检测(RetinaFace)。我们的分析揭示了三种在模棱两可情况下解释的机制。VLMs表现出语义过激活,系统地将模棱两可的非人类区域拉向“人类”概念,其中LLaVA-1.5-7B产生最强且最自信的过解释,尤其是在负面情绪方面。ViT则遵循一种不确定性作为避免策略,保持模糊但总体上无偏。基于检测的模型通过保守的先验抑制幻视响应,即使在定位受控的情况下也是如此。这些结果表明,行为在模棱两可情况下的表现更多由表示选择而非评分阈值决定,不确定性和偏见是分离的:低不确定性可以信号安全抑制,如在检测器中,也可以信号极端过解释,如在VLMs中。因此,幻视提供了一种紧凑的诊断工具和模棱两可感知的硬负例来源,用于探索和提高视觉-语言系统的语义鲁棒性。代码将在发表后发布。
Summary / 总结
This study investigates how vision models interpret ambiguous face-like patterns, using face pareidolia as a controlled probe. Six models from different representational regimes were evaluated: vision-language models (CLIP-B/32, CLIP-L/14, LLaVA-1.5-7B), pure vision classification (ViT), general object detection (YOLOv8), and face detection (RetinaFace). The analysis revealed that VLMs overactivate semantic concepts, particularly for negative emotions, while ViT follows an uncertainty-as-abstention strategy. Detection-based models suppress pareidolia responses through conservative priors, showing that behavior under ambiguity is more influenced by representational choices than score thresholds. Pareidolia thus serves as a diagnostic tool for improving the semantic robustness of vision-language systems.
研究通过使用面相错觉作为探针,探讨了视觉模型如何解释模棱两可的面部图案。六个来自不同表示范式的模型被评估,结果显示VLMs过度激活语义概念,而ViT模型则保持相对无偏。基于检测的模型通过保守的先验抑制错觉响应,表明模糊处理更多受表示选择影响而非评分阈值。面相错觉图像被提议作为诊断工具和难以处理的负面样本来源,以提高视觉语言系统的语义鲁棒性。
GeoSeg: Training-Free Reasoning-Driven Segmentation in Remote Sensing Imagery
Authors: Lifan Jiang, Yuhang Pei, oxi Wu, Yan Zhao, Tianrun Wu, Shulong Yu, Lihui Zhang, Deng Cai
First: 2026-03-04T12:24:16+00:00 · Latest: 2026-03-04T12:24:16+00:00
Abstract
Recent advances in MLLMs are reframing segmentation from fixed-category prediction to instruction-grounded localization. While reasoning based segmentation has progressed rapidly in natural scenes, remote sensing lacks a generalizable solution due to the prohibitive cost of reasoning-oriented data and domain-specific challenges like overhead viewpoints. We present GeoSeg, a zero-shot, training-free framework that bypasses the supervision bottleneck for reasoning-driven remote sensing segmentation. GeoSeg couples MLLM reasoning with precise localization via: (i) bias-aware coordinate refinement to correct systematic grounding shifts and (ii) a dual-route prompting mechanism to fuse semantic intent with fine-grained spatial cues. We also introduce GeoSeg-Bench, a diagnostic benchmark of 810 image--query pairs with hierarchical difficulty levels. Experiments show that GeoSeg consistently outperforms all baselines, with extensive ablations confirming the effectiveness and necessity of each component.
中文标题/摘要
标题:GeoSeg:无需训练的推理驱动遥感影像分割
近期MLLM的进展正在重新定义分割任务,从固定类别的预测转变为基于指令的定位。虽然在自然场景中基于推理的分割已经取得了快速进展,但由于推理导向数据的成本高昂以及特定领域的挑战(如俯视视角),遥感领域缺乏通用解决方案。我们提出了GeoSeg,这是一种零样本、无需训练的框架,绕过了推理驱动遥感分割的监督瓶颈。GeoSeg 结合了MLLM推理与精确定位:(i) 偏差感知的坐标细化以纠正系统性定位偏差,(ii) 双路提示机制以融合语义意图与精细的空间线索。我们还引入了GeoSeg-Bench,这是一个包含810个图像-查询对的诊断基准,具有分层难度级别。实验表明,GeoSeg 在所有基线中表现最佳,广泛的消融实验验证了每个组件的有效性和必要性。
Summary / 总结
GeoSeg is a training-free framework for reasoning-driven segmentation in remote sensing imagery, addressing the lack of generalizable solutions due to high data acquisition costs and domain-specific challenges. It uses MLLM reasoning and precise localization through bias-aware coordinate refinement and a dual-route prompting mechanism. GeoSeg outperforms all baselines in experiments and demonstrates the effectiveness of its components through extensive ablations.
GeoSeg 是一个无需训练的框架,用于遥感图像中的推理驱动分割,解决了由于高数据获取成本和领域特定挑战导致的缺乏通用解决方案的问题。它通过 MLLM 推理和精确定位,结合系统定位偏移的偏差感知坐标细化和双路提示机制。实验表明,GeoSeg 在所有基线中表现最佳,消融研究验证了每个组件的有效性。
Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation
Authors: Zehao Deng, Tianjie Ju, Zheng Wu, Zhuosheng Zhang, Gongshen Liu
Venue: CVPR 2026
First: 2025-11-27T09:01:38+00:00 · Latest: 2026-03-04T12:15:06+00:00
Comments: Accepted to CVPR 2026
Abstract
The rapid development of large vision-language model (VLM) has greatly promoted the research of GUI agent. However, GUI agents still face significant challenges in handling long-horizon tasks. First, single-agent models struggle to balance high-level capabilities and low-level execution capability, facing prevalent issues of responsibility coupling and capability conflicts. Second, agents lack awareness of the task state, leading to progress loss in long-horizon tasks. To address these challenges, we propose a staged execution-feedback reinforcement learning algorithm. Unlike training a unified policy model, we focus on training high-level scheduling models. Specifically, we propose and train two agents: a Coordinator, responsible for the strategic planning and task decomposition; and a State Tracker, responsible for context compression and information management to maintain the task's state and coherence. Based on this, we built the Coordinator-Executor-State Tracker (CES) multi-agent framework, which can be integrated with any low-level Executor model, assisting the Executor in solving long-horizon tasks through task scheduling and state management. Experiments on long-horizon task benchmarks demonstrate that CES significantly enhances the system's planning and state management capabilities. Furthermore, analysis confirms that our trained high-level scheduling module is a generalizable, plug-and-play module that significantly enhances the long-horizon capabilities of various Executors. Code can be available at https://github.com/hehehahi4/CES.
中文标题/摘要
标题:使用执行反馈强化学习训练高阶调度器以实现长时GUI自动化
大型视觉-语言模型(VLM)的快速发展极大地推动了GUI代理的研究。然而,GUI代理在处理长时任务时仍面临重大挑战。首先,单个代理模型难以平衡高阶能力和低级执行能力,面临责任耦合和能力冲突的普遍问题。其次,代理缺乏对任务状态的意识,导致长时任务中进度损失。为解决这些挑战,我们提出了一种分阶段执行反馈强化学习算法。与训练统一策略模型不同,我们专注于训练高阶调度模型。具体来说,我们提出了并训练了两个代理:协调器,负责战略规划和任务分解;状态跟踪器,负责上下文压缩和信息管理,以保持任务的状态和连贯性。基于此,我们构建了协调器-执行器-状态跟踪器(CES)多代理框架,可以与任何低级执行器模型集成,通过任务调度和状态管理协助执行器解决长时任务。在长时任务基准上的实验表明,CES显著增强了系统的规划和状态管理能力。此外,分析证实,我们训练的高阶调度模块是一个可移植的通用模块,显著增强了各种执行器的长时能力。代码可在https://github.com/hehehahi4/CES/获取。
Summary / 总结
This paper addresses the challenges of handling long-horizon tasks in GUI agents by proposing a staged execution-feedback reinforcement learning algorithm. The method involves training a high-level Coordinator and a State Tracker to manage strategic planning, task decomposition, and state management, respectively. Experiments show that the CES framework significantly improves planning and state management capabilities, making the high-level scheduling module generalizable and enhancing the long-horizon capabilities of various Executors.
论文通过提出阶段化的执行反馈强化学习算法来解决GUI代理在处理长时任务时面临的挑战。它引入了一个多代理框架CES,包括负责战略规划的协调器、负责状态管理的状态跟踪器和负责任务执行的执行器。实验表明,CES提高了规划和状态管理能力,使其成为一个可通用的模块,能够显著增强各种执行器的长时任务处理能力。
IROSA: Interactive Robot Skill Adaptation using Natural Language
Authors: Markus Knauer, Samuel Bustamante, Thomas Eiband, Alin Albu-Schäffer, Freek Stulp, João Silvério
First: 2026-03-04T09:54:09+00:00 · Latest: 2026-03-04T09:54:09+00:00
Comments: Accepted IEEE Robotics and Automation Letters (RA-L) journal, 8 pages, 5 figures, 3 tables, 1 listing
Abstract
Foundation models have demonstrated impressive capabilities across diverse domains, while imitation learning provides principled methods for robot skill adaptation from limited data. Combining these approaches holds significant promise for direct application to robotics, yet this combination has received limited attention, particularly for industrial deployment. We present a novel framework that enables open-vocabulary skill adaptation through a tool-based architecture, maintaining a protective abstraction layer between the language model and robot hardware. Our approach leverages pre-trained LLMs to select and parameterize specific tools for adapting robot skills without requiring fine-tuning or direct model-to-robot interaction. We demonstrate the framework on a 7-DoF torque-controlled robot performing an industrial bearing ring insertion task, showing successful skill adaptation through natural language commands for speed adjustment, trajectory correction, and obstacle avoidance while maintaining safety, transparency, and interpretability.
中文标题/摘要
标题:IROSA:使用自然语言的交互式机器人技能适应
基础模型在多个领域展示了令人印象深刻的性能,而模仿学习为从有限数据中通过原理方法适应机器人技能提供了可能。将这两种方法结合起来在直接应用于机器人技术方面具有巨大潜力,但这种结合在工业部署方面受到的关注有限。我们提出了一种新的框架,通过工具架构实现开放词汇量的技能适应,保持语言模型与机器人硬件之间的保护抽象层。我们的方法利用预训练的大规模语言模型来选择和参数化特定工具,以适应机器人技能,而无需微调或直接模型到机器人交互。我们在一个7自由度扭矩控制机器人上展示了该框架,该机器人执行工业轴承环插入任务,通过自然语言命令成功实现了技能适应,包括速度调整、轨迹校正和障碍物避免,同时保持了安全、透明和可解释性。
Summary / 总结
The research aims to combine the capabilities of foundation models and imitation learning to enable interactive robot skill adaptation using natural language. The method involves using a tool-based architecture with pre-trained language models to select and parameterize specific tools for skill adaptation without fine-tuning or direct interaction with the robot. Key experimental findings show successful skill adaptation for a 7-DoF robot in an industrial bearing ring insertion task, with adjustments made through natural language commands for speed, trajectory, and obstacle avoidance, while ensuring safety and interpretability.
研究旨在结合基础模型和模仿学习的能力,通过自然语言实现交互式的机器人技能适应。方法是使用工具基础架构,通过预训练的语言模型选择和参数化特定工具来进行技能适应,无需对模型进行微调或直接与机器人交互。实验结果表明,通过自然语言命令成功实现了速度调整、轨迹校正和避障技能适应,同时保持了工业轴承环插入任务中的安全性和可解释性。
Merlin: A Computed Tomography Vision-Language Foundation Model and Dataset
Authors: Louis Blankemeier, Ashwin Kumar, Joseph Paul Cohen, Jiaming Liu, Longchao Liu, Dave Van Veen, Syed Jamal Safdar Gardezi, Hongkun Yu, Magdalini Paschali, Zhihong Chen, Jean-Benoit Delbrouck, Eduardo Reis, Robbie Holland, Cesar Truyts, Christian Bluethgen, Yufu Wu, Long Lian, Malte Engmann Kjeldskov Jensen, Sophie Ostmeier, Maya Varma, Jeya Maria Jose Valanarasu, Zhongnan Fang, Zepeng Huo, Zaid Nabulsi, Diego Ardila, Wei-Hung Weng, Edson Amaro Junior, Neera Ahuja, Jason Fries, Nigam H. Shah, Greg Zaharchuk, Marc Willis, Adam Yala, Andrew Johnston, Robert D. Boutin, Andrew Wentland, Curtis P. Langlotz, Jason Hom, Sergios Gatidis, Akshay S. Chaudhari
First: 2024-06-10T17:53:01+00:00 · Latest: 2026-03-04T09:13:44+00:00
Comments: Nature (2026)
Abstract
The large volume of abdominal computed tomography (CT) scans coupled with the shortage of radiologists have intensified the need for automated medical image analysis tools. Previous state-of-the-art approaches for automated analysis leverage vision-language models (VLMs) that jointly model images and radiology reports. However, current medical VLMs are generally limited to 2D images and short reports. Here to overcome these shortcomings for abdominal CT interpretation, we introduce Merlin, a 3D VLM that learns from volumetric CT scans, electronic health record data and radiology reports. This approach is enabled by a multistage pretraining framework that does not require additional manual annotations. We trained Merlin using a high-quality clinical dataset of paired CT scans (>6 million images from 15,331 CT scans), diagnosis codes (>1.8 million codes) and radiology reports (>6 million tokens). We comprehensively evaluated Merlin on 6 task types and 752 individual tasks that covered diagnostic, prognostic and quality-related tasks. The non-adapted (off-the-shelf) tasks included zero-shot classification of findings (30 findings), phenotype classification (692 phenotypes) and zero-shot cross-modal retrieval (image-to-findings and image-to-impression). The model-adapted tasks included 5-year chronic disease prediction (6 diseases), radiology report generation and 3D semantic segmentation (20 organs). We validated Merlin at scale, with internal testing on 5,137 CT scans and external testing on 44,098 CT scans from 3 independent sites and 2 public datasets. The results demonstrated high generalization across institutions and anatomies. Merlin outperformed 2D VLMs, CT foundation models and off-the-shelf radiology models. We also release our trained models, code, and dataset, available at: https://github.com/StanfordMIMI/Merlin.
中文标题/摘要
标题:梅林:一种计算机断层扫描视觉-语言基础模型及数据集
腹部计算机断层扫描(CT)扫描的数量庞大,而放射科医生的短缺加剧了对自动化医学图像分析工具的需求。之前最先进的自动化分析方法利用视觉-语言模型(VLMs)联合建模图像和放射学报告。然而,当前的医学VLMs通常局限于2D图像和简短的报告。为了解决腹部CT解释的这些不足,我们引入了梅林,这是一种3D VLM,可以从体层CT扫描、电子健康记录数据和放射学报告中学习。这种方法得益于一个多阶段的预训练框架,不需要额外的手动注释。我们使用高质量的临床数据集训练梅林,该数据集包含配对的CT扫描(超过600万张图像,来自15,331个CT扫描),诊断代码(超过180万代码)和放射学报告(超过600万个标记)。我们在6种任务类型和752个个体任务上全面评估了梅林,这些任务涵盖了诊断、预后和质量相关的任务。非适应性任务包括零样本分类(30种发现)、表型分类(692种表型)和零样本跨模态检索(图像到发现和图像到印象)。模型适应性任务包括5年慢性疾病预测(6种疾病)、放射学报告生成和3D语义分割(20种器官)。我们大规模验证了梅林,在5,137个CT扫描的内部测试和来自3个独立站点和2个公共数据集的44,098个CT扫描的外部测试中进行了验证。结果表明,梅林在机构和解剖学方面具有高度的泛化能力。梅林优于2D VLMs、CT基础模型和现成的放射学模型。我们还发布了训练模型、代码和数据集,可在以下链接获取:https://github.com/StanfordMIMI/Merlin。
Summary / 总结
Merlin is a 3D vision-language model designed for the interpretation of abdominal computed tomography (CT) scans, which are volumetric and complex. It leverages a multistage pretraining framework without additional manual annotations, using a large clinical dataset. Merlin excelled in various tasks, including zero-shot classification, phenotype classification, and radiology report generation, outperforming 2D models and other foundation models. It was validated across multiple institutions and datasets, showing high generalization capabilities.
Merlin 是一种针对腹部计算机断层扫描 (CT) 图像的 3D 视觉语言模型,这些图像体积大且复杂。它利用多阶段预训练框架,无需额外的手动注释,使用了大规模的临床数据集。Merlin 在零样本分类、表型分类和放射学报告生成等多种任务中表现出色,优于 2D 模型和其他基础模型。它在多个机构和数据集上进行了验证,展示了高度的泛化能力。
DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models
Authors: Yangfu Li, Hongjian Zhan, Jiawei Chen, Yuning Gong, Qi Liu, Yue Lu
Venue: CVPR
First: 2026-03-04T09:06:47+00:00 · Latest: 2026-03-04T09:06:47+00:00
Comments: 18 pages 17 figures
Abstract
Humans can robustly localize visual evidence and provide grounded answers even in noisy environments by identifying critical cues and then relating them to the full context in a bottom-up manner. Inspired by this, we propose DeepScan, a training-free framework that combines Hierarchical Scanning, Refocusing, and Evidence-Enhanced Reasoning for visually grounded reasoning in Large Vision-Language Models (LVLMs). Unlike existing methods that pursue one-shot localization of complete evidence, Hierarchical Scanning performs local cue exploration and multi-scale evidence extraction to recover evidence in a bottom-up manner, effectively mitigating the impacts of distractive context. Refocusing then optimizes the localized evidence view through collaboration of LVLMs and visual experts. Finally, Evidence-Enhanced Reasoning aggregates multi-granular views via a hybrid evidence memory and yields accurate and interpretable answers. Experimental results demonstrate that DeepScan significantly boosts LVLMs in diverse visual tasks, especially in fine-grained visual understanding. It achieves 90.6% overall accuracy on V* when integrated with Qwen2.5-VL-7B. Moreover, DeepScan provides consistent improvements for LVLMs across various architectures and model scales without additional adaptation cost.
中文标题/摘要
标题:DeepScan:大型视觉语言模型中基于视觉的推理无训练框架
人类能够在嘈杂环境中通过识别关键线索并从底部向上关联它们来稳健地定位视觉证据并提供基于视觉的答案。受此启发,我们提出了DeepScan,一种无训练框架,结合了层次扫描、聚焦和证据增强推理,以在大型视觉语言模型(LVLMs)中实现基于视觉的推理。与现有方法追求一次性定位完整证据不同,层次扫描通过局部线索探索和多尺度证据提取以自底向上的方式恢复证据,有效缓解了干扰性上下文的影响。聚焦则通过LVLMs和视觉专家的合作优化定位的证据视图。最后,证据增强推理通过混合证据记忆聚合多粒度视图并提供准确且可解释的答案。实验结果表明,DeepScan显著提升了LVLMs在各种视觉任务中的表现,特别是在细粒度视觉理解方面。当与Qwen2.5-VL-7B集成时,DeepScan在V*上的整体准确率达到90.6%。此外,DeepScan在各种架构和模型规模的LVLMs上提供了持续改进,而无需额外的适应成本。
Summary / 总结
DeepScan is a training-free framework that enhances visually grounded reasoning in large vision-language models by combining Hierarchical Scanning, Refocusing, and Evidence-Enhanced Reasoning. It performs local cue exploration and multi-scale evidence extraction to mitigate the impact of distractive context, optimizes the localized evidence view through collaboration between LVLMs and visual experts, and aggregates multi-granular views for accurate and interpretable answers. DeepScan significantly improves LVLMs in various visual tasks, achieving 90.6% overall accuracy on V* when integrated with Qwen2.5-VL-7B and providing consistent improvements across different model architectures and scales without additional adaptation cost.
DeepScan 是一个无需训练的框架,通过结合层次扫描、聚焦优化和证据增强推理来增强大型视觉-语言模型(LVLM)的视觉接地推理能力。它通过局部线索探索和多尺度证据提取来减轻干扰背景的影响,通过视觉专家和 LVLM 的协作优化局部化证据视图,并通过混合证据记忆聚合多粒度视图以获得准确且可解释的答案。DeepScan 在各种视觉任务中显著提升了 LVLM 的性能,与 Qwen2.5-VL-7B 集成后总体准确率达到 90.6%,并且在不同模型架构和规模上提供了持续改进,无需额外的适应成本。
Training-Free Reward-Guided Image Editing via Trajectory Optimal Control
Authors: Jinho Chang, Jaemin Kim, Jong Chul Ye
Venue: ICLR 2026 Poster
First: 2025-09-30T06:34:37+00:00 · Latest: 2026-03-04T08:45:18+00:00
Comments: Poster in ICLR 2026; 22 pages, 9 figures
Abstract
Recent advancements in diffusion and flow-matching models have demonstrated remarkable capabilities in high-fidelity image synthesis. A prominent line of research involves reward-guided guidance, which steers the generation process during inference to align with specific objectives. However, leveraging this reward-guided approach to the task of image editing, which requires preserving the semantic content of the source image while enhancing a target reward, is largely unexplored. In this work, we introduce a novel framework for training-free, reward-guided image editing. We formulate the editing process as a trajectory optimal control problem where the reverse process of a diffusion model is treated as a controllable trajectory originating from the source image, and the adjoint states are iteratively updated to steer the editing process. Through extensive experiments across distinct editing tasks, we demonstrate that our approach significantly outperforms existing inversion-based training-free guidance baselines, achieving a superior balance between reward maximization and fidelity to the source image without reward hacking.
中文标题/摘要
标题:基于轨迹最优控制的无需训练奖励引导图像编辑
近期在扩散和流匹配模型方面的进展展示了其在高保真图像合成方面的卓越能力。研究的一个重要方向是奖励引导的指导,该方法在推理过程中引导生成过程以满足特定目标。然而,将这种奖励引导的方法应用于需要保留源图像语义内容并增强目标奖励的图像编辑任务,尚未得到充分探索。在本文中,我们提出了一种新的无需训练的奖励引导图像编辑框架。我们将编辑过程形式化为一个轨迹最优控制问题,其中扩散模型的逆过程被视为从源图像出发的可控轨迹,通过迭代更新伴随状态来引导编辑过程。通过在不同编辑任务上的广泛实验,我们证明了我们的方法在奖励最大化和对源图像保真度之间取得了显著优于现有基于反转的无需训练指导基线的平衡,同时没有出现奖励作弊。
Summary / 总结
This work addresses the challenge of training-free reward-guided image editing by formulating the editing process as a trajectory optimal control problem. The method uses the reverse process of a diffusion model to generate a controllable trajectory from the source image, with adjoint states iteratively updated to steer the editing process. Experiments show that the proposed approach outperforms existing inversion-based methods, balancing reward maximization and source image fidelity effectively.
该研究提出了一种无需训练的奖励导向图像编辑框架,将编辑过程表述为轨迹最优控制问题。扩散模型的逆过程作为从源图像出发的可控轨迹,通过迭代更新伴随状态来引导编辑过程。实验表明,该方法在最大化奖励和保持源图像 fidelity 之间取得了更好的平衡,且没有出现奖励作弊现象,优于现有基于反演的方法。
From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning
Authors: Ruilin Luo, Chufan Shi, Yizhen Zhang, Cheng Yang, Songtao Jiang, Tongkun Guan, Ruizhe Chen, Ruihang Chu, Peng Wang, Mingkun Yang, Yujiu Yang, Junyang Lin, Zhibo Yang
Venue: ICLR 2026 Poster
First: 2026-03-04T08:22:27+00:00 · Latest: 2026-03-04T08:22:27+00:00
Comments: ICLR 2026 Poster
Abstract
The cold-start initialization stage plays a pivotal role in training Multimodal Large Reasoning Models (MLRMs), yet its mechanisms remain insufficiently understood. To analyze this stage, we introduce the Visual Attention Score (VAS), an attention-based metric that quantifies how much a model attends to visual tokens. We find that reasoning performance is strongly correlated with VAS (r=0.9616): models with higher VAS achieve substantially stronger multimodal reasoning. Surprisingly, multimodal cold-start fails to elevate VAS, resulting in attention distributions close to the base model, whereas text-only cold-start leads to a clear increase. We term this counter-intuitive phenomenon Lazy Attention Localization. To validate its causal role, we design training-free interventions that directly modulate attention allocation during inference, performance gains of 1$-$2% without any retraining. Building on these insights, we further propose Attention-Guided Visual Anchoring and Reflection (AVAR), a comprehensive cold-start framework that integrates visual-anchored data synthesis, attention-guided objectives, and visual-anchored reward shaping. Applied to Qwen2.5-VL-7B, AVAR achieves an average gain of 7.0% across 7 multimodal reasoning benchmarks. Ablation studies further confirm that each component of AVAR contributes step-wise to the overall gains. The code, data, and models are available at https://github.com/lrlbbzl/Qwen-AVAR.
中文标题/摘要
标题:从窄视角到全景视图:注意力引导下的冷启动重塑多模态推理
冷启动初始化阶段在训练多模态大型推理模型(MLRM)中起着关键作用,但其机制尚不完全理解。为了分析这一阶段,我们引入了视觉注意力得分(VAS),这是一种基于注意力的度量,量化模型对视觉标记的关注程度。我们发现推理性能与VAS(r=0.9616)之间存在强烈相关性:VAS较高的模型实现更强的多模态推理。令人惊讶的是,多模态冷启动未能提高VAS,导致注意力分布接近基模型,而仅文本冷启动则导致明显增加。我们称这一反直觉现象为“懒惰注意力定位”。为了验证其因果作用,我们设计了无需训练的干预措施,直接在推理期间调节注意力分配,无需重新训练即可获得1%-2%的性能提升。基于这些见解,我们进一步提出了注意力引导视觉锚定与反思(AVAR)框架,该框架结合了视觉锚定数据合成、注意力引导目标和视觉锚定奖励塑造。应用于Qwen2.5-VL-7B,AVAR在7个多模态推理基准测试中平均实现了7.0%的提升。消融研究进一步证实了AVAR中每个组件逐步贡献于整体提升。代码、数据和模型可在https://github.com/lrlbbzl/Qwen-AVAR获取。
Summary / 总结
The paper investigates the cold-start initialization stage in training Multimodal Large Reasoning Models (MLRMs) by introducing the Visual Attention Score (VAS), which measures how much models attend to visual tokens. It finds that reasoning performance is strongly correlated with VAS, and that text-only cold-start increases VAS, while multimodal cold-start does not. The proposed Attention-Guided Visual Anchoring and Reflection (AVAR) framework, which includes visual-anchored data synthesis, attention-guided objectives, and visual-anchored reward shaping, improves performance by 7.0% on average across 7 benchmarks. Ablation studies confirm the contributions of each component of AVAR.
论文通过引入视觉注意力得分(VAS)来量化模型对视觉标记的关注程度,研究了多模态大型推理模型(MLRM)的冷启动初始化阶段。研究发现,推理性能与VAS高度相关,仅文本冷启动可以增加VAS,而多模态冷启动则不会。提出的注意力引导视觉锚定和反射(AVAR)框架在7个基准测试中平均提高了7.0%,每个组件对整体改进都有逐步贡献。
Structure-aware Prompt Adaptation from Seen to Unseen for Open-Vocabulary Compositional Zero-Shot Learning
Authors: Yihang Duan, Jiong Wang, Pengpeng Zeng, Ji Zhang, Lei Zhao, Chong Wang, Jingkuan Song, Lianli Gao
First: 2026-03-04T07:54:28+00:00 · Latest: 2026-03-04T07:54:28+00:00
Abstract
The goal of Open-Vocabulary Compositional Zero-Shot Learning (OV-CZSL) is to recognize attribute-object compositions in the open-vocabulary setting, where compositions of both seen and unseen attributes and objects are evaluated. Recently, prompt tuning methods have demonstrated strong generalization capabilities in the closed setting, where only compositions of seen attributes and objects are evaluated, i.e., Compositional Zero-Shot Learning (CZSL). However, directly applying these methods to OV-CZSL may not be sufficient to generalize to unseen attributes, objects and their compositions, as it is limited to seen attributes and objects. Normally, when faced with unseen concepts, humans adopt analogies with seen concepts that have the similar semantics thereby inferring their meaning (e.g., "wet" and "damp", "shirt" and "jacket"). In this paper, we experimentally show that the distribution of semantically related attributes or objects tends to form consistent local structures in the embedding space. Based on the above structures, we propose Structure-aware Prompt Adaptation (SPA) method, which enables models to generalize from seen to unseen attributes and objects. Specifically, in the training stage, we design a Structure-aware Consistency Loss (SCL) that encourages the local structure's consistency of seen attributes and objects in each iteration. In the inference stage, we devise a Structure-guided Adaptation Strategy (SAS) that adaptively aligns the structures of unseen attributes and objects with those of trained seen attributes and objects with similar semantics. Notably, SPA is a plug-and-play method that can be seamlessly integrated into existing CZSL prompt tuning methods. Extensive experiments on OV-CZSL benchmarks demonstrate that SPA achieves competitive closed-set performance while significantly improving open-vocabulary results.
中文标题/摘要
标题:从已见至未见的结构感知提示适应以实现开放词汇组合零样本学习
开放词汇组合零样本学习(OV-CZSL)的目标是在开放词汇设置中识别属性-对象组合,其中既包括已见属性和对象的组合,也包括未见属性和对象的组合。最近,提示调优方法在仅评估已见属性和对象组合的组合零样本学习(CZSL)的封闭设置中展示了强大的泛化能力。然而,直接将这些方法应用于OV-CZSL可能不足以泛化到未见的属性、对象及其组合,因为它们仅限于已见的属性和对象。通常,当面对未见的概念时,人类会采用与已见且具有相似语义的概念进行类比,从而推断其含义(例如,“湿”和“潮湿”,“衬衫”和“夹克”)。在本文中,我们实验证明语义相关属性或对象在嵌入空间中的分布倾向于形成一致的局部结构。基于上述结构,我们提出了结构感知提示适应(SPA)方法,使模型能够从已见扩展到未见的属性和对象。具体而言,在训练阶段,我们设计了一种结构感知一致性损失(SCL),以在每次迭代中鼓励已见属性和对象的局部结构一致性。在推理阶段,我们设计了一种结构导向适应策略(SAS),以适应性地将未见属性和对象的结构与训练过的具有相似语义的已见属性和对象的结构对齐。值得注意的是,SPA是一种即插即用的方法,可以无缝集成到现有的CZSL提示调优方法中。在OV-CZSL基准上的广泛实验表明,SPA在闭集性能方面达到了竞争力的同时,显著提高了开放词汇结果。
Summary / 总结
The goal of Open-Vocabulary Compositional Zero-Shot Learning (OV-CZSL) is to recognize attribute-object compositions in both seen and unseen categories. This paper proposes Structure-aware Prompt Adaptation (SPA) to enhance the generalization from seen to unseen concepts. SPA includes a Structure-aware Consistency Loss during training to maintain local structure consistency and a Structure-guided Adaptation Strategy during inference to align unseen concepts with seen ones. Experiments show that SPA improves open-vocabulary results while maintaining competitive closed-set performance.
Open-Vocabulary Compositional Zero-Shot Learning (OV-CZSL) 的目标是在开放词汇量设置下识别属性-对象组合。本文提出了结构感知提示适应(SPA)方法,利用嵌入空间中的一致局部结构,使模型能够从已见过的属性和对象推广到未见过的属性和对象。SPA 包括训练阶段的结构感知一致性损失和推理阶段的结构导向适应策略。实验表明,SPA 在 OV-CZSL 基准上实现了与闭集性能相当的结果,并显著提高了开放词汇量结果。
TAP: A Token-Adaptive Predictor Framework for Training-Free Diffusion Acceleration
Authors: Haowei Zhu, Tingxuan Huang, Xing Wang, Tianyu Zhao, Jiexi Wang, Weifeng Chen, Xurui Peng, Fangmin Chen, Junhai Yong, Bin Wang
Venue: CVPR 2026
First: 2026-03-04T07:10:11+00:00 · Latest: 2026-03-04T07:10:11+00:00
Comments: Accepted by CVPR 2026
Abstract
Diffusion models achieve strong generative performance but remain slow at inference due to the need for repeated full-model denoising passes. We present Token-Adaptive Predictor (TAP), a training-free, probe-driven framework that adaptively selects a predictor for each token at every sampling step. TAP uses a single full evaluation of the model's first layer as a low-cost probe to compute proxy losses for a compact family of candidate predictors (instantiated primarily with Taylor expansions of varying order and horizon), then assigns each token the predictor with the smallest proxy error. This per-token "probe-then-select" strategy exploits heterogeneous temporal dynamics, requires no additional training, and is compatible with various predictor designs. TAP incurs negligible overhead while enabling large speedups with little or no perceptual quality loss. Extensive experiments across multiple diffusion architectures and generation tasks show that TAP substantially improves the accuracy-efficiency frontier compared to fixed global predictors and caching-only baselines.
中文标题/摘要
标题:TAP:一种基于令牌自适应预测框架的无训练扩散加速方法
扩散模型在生成性能上表现出色,但在推理时由于需要多次完整的去噪迭代而变得缓慢。我们提出了Token-Adaptive Predictor (TAP),一种无训练、探针驱动的框架,该框架在每次采样步骤中自适应地为每个令牌选择一个预测器。TAP 使用单次完整的模型第一层评估作为低成本探针,计算候选预测器(主要使用不同阶数和范围的泰勒展开)的代理损失,然后将每个令牌分配给代理误差最小的预测器。这种针对每个令牌的“探针-选择”策略利用了异质的时间动态,不需要额外的训练,并且与各种预测器设计兼容。TAP 在几乎不增加开销的情况下,能够实现显著的速度提升,同时几乎不会损失感知质量。在多个扩散架构和生成任务上的广泛实验表明,TAP 显著改善了固定全局预测器和仅缓存基线的准确性和效率前沿。
Summary / 总结
TAP is a training-free framework that adaptively selects predictors for each token during sampling steps in diffusion models. It uses a low-cost probe of the model's first layer to compute proxy losses for candidate predictors, then assigns the predictor with the smallest proxy error to each token. This approach improves accuracy-efficiency, enabling large speedups with minimal perceptual quality loss across various diffusion architectures and tasks.
TAP 是一个无需训练的框架,能够在采样步骤中为每个 token 选择预测器。它通过一次完整的模型第一层评估作为探针来计算候选预测器的代理损失,然后将损失最小的预测器分配给每个 token。这种方法提高了准确性和效率,能够在各种扩散模型架构和任务中实现显著的速度提升,同时保持较低的感知质量损失。
TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics
Authors: Yi Han, Enshen Zhou, Shanyu Rong, Jingkun An, Pengwei Wang, Zhongyuan Wang, Cheng Chi, Lu Sheng, Shanghang Zhang
First: 2025-10-08T16:20:23+00:00 · Latest: 2026-03-04T07:07:00+00:00
Comments: 8 pages, 6 figures
Abstract
Vision-Language Models (VLMs) have shown remarkable capabilities in spatial reasoning, yet they remain fundamentally limited to qualitative precision and lack the computational precision required for real-world robotics. Current approaches fail to leverage metric cues from depth sensors and camera calibration, instead reducing geometric problems to pattern recognition tasks that cannot deliver the centimeter-level accuracy essential for robotic manipulation. We present TIGeR (Tool-Integrated Geometric Reasoning), a novel framework that transforms VLMs from perceptual estimators to geometric computers by enabling them to generate and execute precise geometric computations through external tools. Rather than attempting to internalize complex geometric operations within neural networks, TIGeR empowers models to recognize geometric reasoning requirements, synthesize appropriate computational code, and invoke specialized libraries for exact calculations. To support this paradigm, we introduce TIGeR-300K, a comprehensive tool-invocation-oriented dataset covering point transformations, pose estimation, and spatial compatibility verification, complete with tool invocation sequences and intermediate computations. Through a two-stage training pipeline combining supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT) with our proposed hierarchical reward design, TIGeR achieves SOTA performance on geometric reasoning benchmarks while demonstrating centimeter-level precision in real-world robotic manipulation tasks.
中文标题/摘要
标题:TIGeR: 工具集成几何推理在视觉-语言模型中的应用以实现机器人技术
视觉-语言模型(VLMs)在空间推理方面表现出色,但它们本质上仍然局限于定性的精确度,缺乏实现真实世界机器人技术所需的计算精确度。当前的方法未能利用深度传感器和相机校准的度量线索,而是将几何问题简化为模式识别任务,无法提供机器人操作所需的厘米级精度。我们提出了TIGeR(Tool-Integrated Geometric Reasoning),这是一种新型框架,通过使VLMs能够生成和执行精确的几何计算,从而将它们从感知估计器转变为几何计算机。TIGeR 不试图在神经网络内部实现复杂的几何操作,而是使模型能够识别几何推理需求,合成适当的计算代码,并调用专门的库进行精确计算。为了支持这一范式,我们引入了TIGeR-300K,这是一个全面的工具调用导向数据集,涵盖了点变换、姿态估计和空间兼容性验证,包括工具调用序列和中间计算。通过结合监督微调(SFT)和强化微调(RFT)的两阶段训练管道,以及我们提出的分层奖励设计,TIGeR 在几何推理基准测试中达到了最佳性能,同时在真实世界的机器人操作任务中展示了厘米级的精度。
Summary / 总结
TIGeR is a novel framework that enhances Vision-Language Models (VLMs) for robotics by enabling them to perform precise geometric computations through external tools. This approach transforms VLMs from perceptual estimators to geometric computers, addressing the limitations of qualitative precision in VLMs. TIGeR uses a two-stage training pipeline combining supervised fine-tuning and reinforcement fine-tuning to achieve state-of-the-art performance on geometric reasoning benchmarks and demonstrates centimeter-level precision in real-world robotic manipulation tasks.
TIGeR 是一种新型框架,通过使视觉-语言模型(VLMs)能够生成并执行精确的几何计算,从而增强其在机器人中的几何推理能力。这种方法将 VLMs 从感知估计器转变为几何计算机,解决了 VLMs 在精确度和计算精度方面的局限性。TIGeR 使用结合监督微调和强化微调的两阶段训练管道,实现了几何推理基准的最先进性能,并在实际机器人操作任务中展示了厘米级的精确度。
Seeing as Experts Do: A Knowledge-Augmented Agent for Open-Set Fine-Grained Visual Understanding
Authors: Junhan Chen, Zilu Zhou, Yujun Tong, Dongliang Chang, Yitao Luo, Zhanyu Ma
First: 2026-03-04T06:18:45+00:00 · Latest: 2026-03-04T06:18:45+00:00
Abstract
Fine-grained visual understanding is shifting from static classification to knowledge-augmented reasoning, where models must justify as well as recognise. Existing approaches remain limited by closed-set taxonomies and single-label prediction, leading to significant degradation under open-set or context-dependent conditions. We present the Knowledge-Augmented Fine-Grained Reasoning Agent (KFRA), a unified framework that transforms fine-grained perception into evidence-driven reasoning. KFRA operates through a three-stage closed reasoning loop that emulates expert analysis. It first performs open-vocabulary detection and web-scale retrieval to generate category hypotheses. It then conducts discriminative regions localisation by aligning textual knowledge with visual evidence through a global-to-local focusing mechanism. Finally, it integrates all multimodal evidence within a large multimodal model to perform interpretable reasoning. Unlike existing agents that treat retrieval and reasoning as independent processes, KFRA establishes a retrieval-grounding coupling that converts retrieved knowledge into spatially grounded evidence for verification. This design enables factual, interpretable, and task-agnostic reasoning across diverse fine-grained scenarios. To evaluate this capability, we construct FGExpertBench, a benchmark designed to assess reasoning depth and cross-task generalisation across six knowledge dimensions. Extensive experiments demonstrate that KFRA consistently surpasses both standalone large multimodal models and current agent frameworks, achieving up to 19 percent improvement in reasoning accuracy and delivering evidence-grounded interpretability in open-set fine-grained visual understanding.
中文标题/摘要
标题:专家视角下的知识增强代理:开放集细粒度视觉理解
细粒度视觉理解正从静态分类转向知识增强推理,其中模型不仅要识别还要进行解释。现有方法仍受限于封闭集分类体系和单标签预测,导致在开放集或上下文依赖条件下性能显著下降。我们提出了知识增强细粒度推理代理(KFRA),这是一种统一框架,将细粒度感知转化为证据驱动的推理。KFRA 通过一个三阶段封闭推理循环来模拟专家分析。首先,它进行开放词汇检测和大规模网络检索以生成类别假设。然后,通过全局到局部聚焦机制将文本知识与视觉证据对齐,进行区分区域定位。最后,它在大型多模态模型中整合所有多模态证据进行可解释推理。与现有代理将检索和推理视为独立过程不同,KFRA 建立了检索-定位耦合,将检索到的知识转化为空间定位证据进行验证。这种设计使KFRA能够在多种细粒度场景中实现事实性、可解释性和任务无关的推理。为了评估这种能力,我们构建了FGExpertBench,这是一个旨在评估推理深度和跨任务泛化的基准,涵盖六个知识维度。大量实验表明,KFRA 一致地超越了独立的大型多模态模型和现有代理框架,推理准确率提高了高达19%,并在开放集细粒度视觉理解中提供了基于证据的可解释性。
Summary / 总结
The research aims to enhance fine-grained visual understanding by developing a knowledge-augmented agent that can reason and justify its predictions. The Knowledge-Augmented Fine-Grained Reasoning Agent (KFRA) operates in a three-stage closed reasoning loop, starting with open-vocabulary detection and web-scale retrieval, followed by discriminative region localization using a global-to-local focusing mechanism, and finally integrating all evidence for interpretable reasoning. Experiments show that KFRA outperforms both standalone large multimodal models and current agent frameworks, achieving up to 19 percent improvement in reasoning accuracy and providing evidence-grounded interpretability in open-set scenarios.
研究旨在通过开发一个知识增强的代理来提升细粒度视觉理解,该代理能够进行推理和解释其预测,解决闭集分类和单标签预测的局限性。知识增强的细粒度推理代理(KFRA)在一个三阶段封闭推理循环中运行,首先进行类别检测和检索,然后定位区分性区域,最后整合多模态证据进行推理。实验表明,KFRA 在推理准确性和开放集场景中的解释性方面均优于现有模型和框架,最高可提高 19% 的推理准确性。
When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance
Authors: Yongli Xiang, Ziming Hong, Zhaoqing Wang, Xiangyu Zhao, Bo Han, Tongliang Liu
Venue: CVPR 2026
First: 2026-02-24T13:20:31+00:00 · Latest: 2026-03-04T06:07:03+00:00
Comments: CVPR 2026; Code is released at https://github.com/tmllab/2026_CVPR_CASG
Abstract
Text-to-Image (T2I) diffusion models have demonstrated significant advancements in generating high-quality images, while raising potential safety concerns regarding harmful content generation. Safety-guidance-based methods have been proposed to mitigate harmful outputs by steering generation away from harmful zones, where the zones are averaged across multiple harmful categories based on predefined keywords. However, these approaches fail to capture the complex interplay among different harm categories, leading to "harmful conflicts" where mitigating one type of harm may inadvertently amplify another, thus increasing overall harmful rate. To address this issue, we propose Conflict-aware Adaptive Safety Guidance (CASG), a training-free framework that dynamically identifies and applies the category-aligned safety direction during generation. CASG is composed of two components: (i) Conflict-aware Category Identification (CaCI), which identifies the harmful category most aligned with the model's evolving generative state, and (ii) Conflict-resolving Guidance Application (CrGA), which applies safety steering solely along the identified category to avoid multi-category interference. CASG can be applied to both latent-space and text-space safeguards. Experiments on T2I safety benchmarks demonstrate CASG's state-of-the-art performance, reducing the harmful rate by up to 15.4% compared to existing methods.
中文标题/摘要
标题:安全碰撞:通过自适应安全指导解决文本到图像扩散中的多类别有害冲突
文本到图像(T2I)扩散模型在生成高质量图像方面取得了显著进展,但同时也引发了关于有害内容生成的安全问题。基于安全指导的方法已被提出,通过引导生成远离预定义关键词定义的有害区域来减轻有害输出。然而,这些方法未能捕捉不同有害类别之间的复杂相互作用,导致“有害冲突”,即减轻一种有害类型的同时可能无意中放大另一种,从而增加整体有害率。为解决这一问题,我们提出了冲突感知自适应安全指导(CASG),这是一种无需训练的框架,在生成过程中动态识别并应用与模型生成状态最一致的有害类别方向。CASG 包含两个组件:(i) 冲突感知类别识别(CaCI),识别与模型生成状态最一致的有害类别,(ii) 冲突解决指导应用(CrGA),仅沿识别的类别应用安全引导,以避免多类别干扰。CASG 可应用于潜在空间和文本空间的安全保护。在 T2I 安全基准上的实验表明,CASG 达到了最先进的性能,与现有方法相比,有害率最多降低了 15.4%。
Summary / 总结
The paper addresses the challenge of harmful content generation in text-to-image diffusion models by proposing Conflict-aware Adaptive Safety Guidance (CASG), which dynamically identifies and applies the most relevant safety direction during generation to avoid multi-category harmful conflicts. Experiments show that CASG reduces the harmful rate by up to 15.4% compared to existing methods.
论文提出了一种冲突感知自适应安全指导(CASG)框架,该框架能够动态地识别并应用类别对齐的安全方向,以避免多类别有害冲突。CASG 包含冲突感知类别识别(CaCI)和冲突解决指导应用(CrGA)两个组件,并在 T2I 安全基准测试中表现出色,相比现有方法可将有害率降低高达 15.4%。
Stochastic Self-Guidance for Training-Free Enhancement of Diffusion Models
Authors: Chubin Chen, Jiashu Zhu, Xiaokun Feng, Nisha Huang, Chen Zhu, Meiqi Wu, Fangyuan Mao, Jiahong Wu, Xiangxiang Chu, Xiu Li
Venue: ICLR 2026
First: 2025-08-18T12:31:20+00:00 · Latest: 2026-03-04T05:55:00+00:00
Comments: Accepted by ICLR 2026
Abstract
Classifier-free Guidance (CFG) is a widely used technique in modern diffusion models for enhancing sample quality and prompt adherence. However, through an empirical analysis on Gaussian mixture modeling with a closed-form solution, we observe a discrepancy between the suboptimal results produced by CFG and the ground truth. The model's excessive reliance on these suboptimal predictions often leads to semantic incoherence and low-quality outputs. To address this issue, we first empirically demonstrate that the model's suboptimal predictions can be effectively refined using sub-networks of the model itself. Building on this insight, we propose S$^2$-Guidance, a novel method that leverages stochastic block-dropping during the forward process to construct stochastic sub-networks, effectively guiding the model away from potential low-quality predictions and toward high-quality outputs. Extensive qualitative and quantitative experiments on text-to-image and text-to-video generation tasks demonstrate that S$^2$-Guidance delivers superior performance, consistently surpassing CFG and other advanced guidance strategies. Our code will be released.
中文标题/摘要
标题:随机自我引导以实现无需训练增强的扩散模型提升
无分类器引导(CFG)是现代扩散模型中广泛使用的一种技术,用于提高样本质量和指令一致性。然而,通过对具有闭式解的高斯混合模型的实证分析,我们观察到CFG产生的次优结果与真实值之间存在差异。模型对这些次优预测的过度依赖往往导致语义不一致和低质量的输出。为了解决这一问题,我们首先实证证明,可以使用模型自身的子网络有效精炼模型的次优预测。在此基础上,我们提出了一种新颖的方法S$^2$-引导,该方法在前向过程中利用随机块丢弃构建随机子网络,有效地引导模型远离潜在的低质量预测,趋向高质量输出。在文本到图像和文本到视频生成任务上的广泛定性和定量实验表明,S$^2$-引导提供了优越的性能,始终优于CFG和其他先进的引导策略。我们的代码将被发布。
Summary / 总结
The paper addresses the issue of suboptimal results produced by Classifier-free Guidance (CFG) in diffusion models, which often lead to semantic incoherence and low-quality outputs. To improve this, the authors propose S$^2$-Guidance, a method that uses stochastic block-dropping to construct sub-networks and guide the model towards high-quality outputs. Experiments show that S$^2$-Guidance outperforms CFG and other advanced guidance strategies in both text-to-image and text-to-video generation tasks.
论文针对Classifier-free Guidance (CFG) 在扩散模型中产生的次优结果导致的语义不一致和低质量输出问题,提出了S$^2$-Guidance 方法,该方法通过随机块丢弃来引导模型生成高质量的预测。实验结果显示,S$^2$-Guidance 在文本到图像和文本到视频生成任务上的定性和定量评估中均优于CFG和其他高级引导策略。