arXiv 论文速递

Snapshot: 20260318_0405

Fast SAM 3D Body: Accelerating SAM 3D Body for Real-Time Full-Body Human Mesh Recovery

Authors: Timing Yang, Sicheng He, Hongyi Jing, Jiawei Yang, Zhijian Liu, Chuhang Zou, Yue Wang

First: 2026-03-16T17:54:40+00:00 · Latest: 2026-03-16T17:54:40+00:00

Abstract

SAM 3D Body (3DB) achieves state-of-the-art accuracy in monocular 3D human mesh recovery, yet its inference latency of several seconds per image precludes real-time application. We present Fast SAM 3D Body, a training-free acceleration framework that reformulates the 3DB inference pathway to achieve interactive rates. By decoupling serial spatial dependencies and applying architecture-aware pruning, we enable parallelized multi-crop feature extraction and streamlined transformer decoding. Moreover, to extract the joint-level kinematics (SMPL) compatible with existing humanoid control and policy learning frameworks, we replace the iterative mesh fitting with a direct feedforward mapping, accelerating this specific conversion by over 10,000x. Overall, our framework delivers up to a 10.9x end-to-end speedup while maintaining on-par reconstruction fidelity, even surpassing 3DB on benchmarks such as LSPET. We demonstrate its utility by deploying Fast SAM 3D Body in a vision-only teleoperation system that-unlike methods reliant on wearable IMUs-enables real-time humanoid control and the direct collection of manipulation policies from a single RGB stream.

中文标题/摘要

标题：Fast SAM 3D人体：加速SAM 3D人体以实现实时全身人体网格恢复

SAM 3D人体（3DB）在单目3D人体网格恢复方面达到了最先进的准确性，但由于其每张图像几秒的推理延迟，无法实现实时应用。我们提出了一种无需训练的加速框架Fast SAM 3D人体，重新构建了3DB的推理路径，以实现交互速率。通过解耦序列空间依赖性并应用架构感知剪枝，我们使多裁剪特征提取并行化，并简化了变压器解码。此外，为了提取与现有类人控制和策略学习框架兼容的关节级运动学（SMPL），我们用直接前馈映射替代了迭代网格拟合，这使得这种特定转换加速了超过10000倍。总体而言，我们的框架在保持重建保真度的同时，实现了高达10.9倍的端到端加速，甚至在LSPET等基准测试中超过了3DB。我们通过部署Fast SAM 3D人体在仅基于视觉的远程操作系统中展示了其实用性，该系统与依赖穿戴式IMU的方法不同，能够实现类人控制并直接从单个RGB流中收集操作策略。

Summary / 总结

Fast SAM 3D Body is a training-free acceleration framework that reformulates the inference pathway of SAM 3D Body to achieve interactive rates. By decoupling serial spatial dependencies and applying architecture-aware pruning, it enables parallelized multi-crop feature extraction and streamlined transformer decoding. This framework also replaces iterative mesh fitting with a direct feedforward mapping, accelerating joint-level kinematics extraction by over 10,000x. The result is up to a 10.9x end-to-end speedup while maintaining reconstruction fidelity, surpassing SAM 3D Body on benchmarks like LSPET. It is demonstrated to enable real-time humanoid control and direct policy learning from a single RGB stream in a vision-only teleoperation system.

Fast SAM 3D Body 是一种加速 SAM 3D Body 的框架，将推理速度从每张图像几秒提高到交互速率。它通过解耦空间依赖性、应用剪枝和启用并行特征提取和简化解码来实现这一目标。该框架还将关节级运动学提取加速了超过 10,000 倍，总体上实现了最高 10.9 倍的端到端加速，同时保持了相当的重建精度，甚至在 LSPET 等基准测试中超过了 SAM 3D Body。这使其适用于实时应用，如仅基于视觉的遥控系统。

Panoramic Affordance Prediction

Authors: Zixin Zhang, Chenfei Liao, Hongfei Zhang, Harold Haodong Chen, Kanghao Chen, Zichen Wen, Litao Guo, Bin Ren, Xu Zheng, Yinchuan Li, Xuming Hu, Nicu Sebe, Ying-Cong Chen

First: 2026-03-16T17:21:49+00:00 · Latest: 2026-03-16T17:21:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Affordance prediction serves as a critical bridge between perception and action in embodied AI. However, existing research is confined to pinhole camera models, which suffer from narrow Fields of View (FoV) and fragmented observations, often missing critical holistic environmental context. In this paper, we present the first exploration into Panoramic Affordance Prediction, utilizing 360-degree imagery to capture global spatial relationships and holistic scene understanding. To facilitate this novel task, we first introduce PAP-12K, a large-scale benchmark dataset containing over 1,000 ultra-high-resolution (12k, 11904 x 5952) panoramic images with over 12k carefully annotated QA pairs and affordance masks. Furthermore, we propose PAP, a training-free, coarse-to-fine pipeline inspired by the human foveal visual system to tackle the ultra-high resolution and severe distortion inherent in panoramic images. PAP employs recursive visual routing via grid prompting to progressively locate targets, applies an adaptive gaze mechanism to rectify local geometric distortions, and utilizes a cascaded grounding pipeline to extract precise instance-level masks. Experimental results on PAP-12K reveal that existing affordance prediction methods designed for standard perspective images suffer severe performance degradation and fail due to the unique challenges of panoramic vision. In contrast, PAP framework effectively overcomes these obstacles, significantly outperforming state-of-the-art baselines and highlighting the immense potential of panoramic perception for robust embodied intelligence.

中文标题/摘要

标题：全景功能预测

功能预测是将感知与行动在具身人工智能中联系起来的关键桥梁。然而，现有的研究局限于针孔相机模型，这些模型视野狭窄且观察片段化，经常缺失关键的整体环境背景。在本文中，我们首次探索全景功能预测，利用360度图像捕捉全局空间关系和整体场景理解。为了促进这一新型任务，我们首先引入了PAP-12K，这是一个大规模基准数据集，包含超过1,000张超高分辨率（12k，11904 x 5952）全景图像，以及超过12k个仔细标注的问答对和功能掩码。此外，我们提出了PAP，一种无需训练、从粗到细的管道，灵感来源于人类的中心视觉系统，以应对全景图像中的超高清分辨率和严重的失真问题。PAP 通过递归视觉路由和网格提示逐步定位目标，应用自适应凝视机制校正局部几何失真，并利用级联定位管道提取精确的实例级掩码。在PAP-12K上的实验结果表明，现有的设计用于标准视角图像的功能预测方法在全景视觉的独特挑战面前表现严重下降并失败。相比之下，PAP框架有效地克服了这些障碍，显著优于最先进的基线方法，突显了全景感知在稳健的具身智能中的巨大潜力。

Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models

Authors: Lexiang Xiong, Qi Li, Jingwen Ye, Xinchao Wang

First: 2026-03-16T17:20:38+00:00 · Latest: 2026-03-16T17:20:38+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) frequently "hallucinate" - generate plausible yet factually incorrect statements - posing a critical barrier to their trustworthy deployment. In this work, we propose a new paradigm for diagnosing hallucinations, recasting them from static output errors into dynamic pathologies of a model's computational cognition. Our framework is grounded in a normative principle of computational rationality, allowing us to model a VLM's generation as a dynamic cognitive trajectory. We design a suite of information-theoretic probes that project this trajectory onto an interpretable, low-dimensional Cognitive State Space. Our central discovery is a governing principle we term the geometric-information duality: a cognitive trajectory's geometric abnormality within this space is fundamentally equivalent to its high information-theoretic surprisal. Hallucination detection is counts as a geometric anomaly detection problem. Evaluated across diverse settings - from rigorous binary QA (POPE) and comprehensive reasoning (MME) to unconstrained open-ended captioning (MS-COCO) - our framework achieves state-of-the-art performance. Crucially, it operates with high efficiency under weak supervision and remains highly robust even when calibration data is heavily contaminated. This approach enables a causal attribution of failures, mapping observable errors to distinct pathological states: perceptual instability (measured by Perceptual Entropy), logical-causal failure (measured by Inferential Conflict), and decisional ambiguity (measured by Decision Entropy). Ultimately, this opens a path toward building AI systems whose reasoning is transparent, auditable, and diagnosable by design.

中文标题/摘要

标题：谎言剖析：视觉-语言模型中幻觉的多阶段诊断框架

视觉-语言模型（VLMs）经常“产生幻觉”——生成看似合理但实际上不正确的陈述，这构成了它们可靠部署的关键障碍。在本文中，我们提出了一种新的诊断幻觉的范式，将幻觉重新定义为模型计算认知动态病态。我们的框架基于计算理性规范原则，使我们能够将VLM的生成建模为动态认知轨迹。我们设计了一套信息论探针，将此轨迹投影到可解释的低维认知状态空间中。我们的主要发现是一种我们称之为几何-信息二元性的原则：此空间中认知轨迹的几何异常本质上等同于其高信息论惊讶度。幻觉检测被视为几何异常检测问题。在从严格的二元问答（POPE）和全面推理（MME）到不受限制的开放生成（MS-COCO）的各种场景中，我们的框架实现了最先进的性能。关键的是，它在弱监督下高效运行，并且即使校准数据严重污染，也保持高度稳健。这种方法使我们能够对失败进行因果归因，将可观察的错误映射到不同的病理状态：感知不稳定性（通过感知熵测量）、逻辑因果失败（通过推理冲突测量）和决策模糊性（通过决策熵测量）。最终，这为构建透明、可审计和可诊断的AI系统开辟了道路。

Summary / 总结

The paper addresses the issue of hallucinations in Vision-Language Models (VLMs), proposing a multi-stage diagnostic framework that models hallucinations as dynamic pathologies. The framework uses information-theoretic probes to project the model's cognitive trajectory into a low-dimensional space, identifying geometric anomalies as indicators of hallucinations. The approach achieves state-of-the-art performance across various tasks and is robust under weak supervision and contaminated calibration data, enabling detailed causal attribution of model failures into perceptual instability, logical-causal failure, and decisional ambiguity.

该论文针对视觉-语言模型（VLM）中的幻觉问题，提出了一种多阶段诊断框架，将幻觉视为模型计算认知过程中的动态病理。该框架使用信息论探针将模型的认知轨迹投影到低维空间中，通过几何异常来识别幻觉。该方法在各种任务中达到了最先进的性能，并且在弱监督和污染校准数据下仍然非常 robust，能够将模型失败详细归因于感知不稳定性、逻辑因果失败和决策模糊性。

Imagine-then-Plan: Agent Learning from Adaptive Lookahead with World Models

Authors: Youwei Liu, Jian Wang, Hanlin Wang, Beichen Guo, Wenjie Li

First: 2026-01-13T19:49:58+00:00 · Latest: 2026-03-16T16:12:29+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances in world models have shown promise for modeling future dynamics of environmental states, enabling agents to reason and act without accessing real environments. Current methods mainly perform single-step or fixed-horizon rollouts, leaving their potential for complex task planning under-exploited. We propose Imagine-then-Plan (\texttt{ITP}), a unified framework for agent learning via lookahead imagination, where an agent's policy model interacts with the learned world model, yielding multi-step ``imagined'' trajectories. Since the imagination horizon may vary by tasks and stages, we introduce a novel adaptive lookahead mechanism by trading off the ultimate goal and task progress. The resulting imagined trajectories provide rich signals about future consequences, such as achieved progress and potential conflicts, which are fused with current observations, formulating a partially \textit{observable} and \textit{imaginable} Markov decision process to guide policy learning. We instantiate \texttt{ITP} with both training-free and reinforcement-trained variants. Extensive experiments across representative agent benchmarks demonstrate that \texttt{ITP} significantly outperforms competitive baselines. Further analyses validate that our adaptive lookahead largely enhances agents' reasoning capability, providing valuable insights into addressing broader, complex tasks. Our code and data will be publicly available at https://github.com/loyiv/ITP.

中文标题/摘要

标题：想象然后规划：基于世界模型的自适应前瞻学习

世界模型的最新进展表明，它们能够预测环境状态的未来动态，使代理能够在无需访问真实环境的情况下进行推理和行动。当前方法主要进行单步或固定时距的展开，其在复杂任务规划方面的潜力尚未得到充分利用。我们提出了想象然后规划（ exttt{ITP}），这是一种通过前瞻想象进行代理学习的统一框架，其中代理的策略模型与学习到的世界模型交互，生成多步“想象”轨迹。由于想象的时距可能因任务和阶段而异，我们引入了一种新颖的自适应前瞻机制，通过权衡最终目标和任务进展来实现。由此产生的想象轨迹提供了关于未来后果的丰富信号，如实现的进展和潜在的冲突，这些信号与当前观察结果融合，形成部分可观测和可想象的马尔可夫决策过程，以指导策略学习。我们使用训练无监督和强化训练的变体实例化了 exttt{ITP}。广泛的实验表明， exttt{ITP} 显著优于竞争性基线。进一步的分析验证了我们自适应前瞻机制大大增强了代理的推理能力，为解决更广泛、更复杂的任务提供了有价值的见解。我们的代码和数据将在 https://github.com/loyiv/ITP 公开发布。

Summary / 总结

The research aims to enhance agent learning through adaptive lookahead imagination using world models. The method, Imagine-then-Plan (ITP), allows agents to generate multi-step imagined trajectories by interacting with a learned world model, which helps in complex task planning. The adaptive lookahead mechanism adjusts the imagination horizon based on task requirements, providing rich signals for policy learning. Experiments show that ITP outperforms existing methods and improves agents' reasoning capabilities for complex tasks.

研究旨在通过世界模型实现自适应前瞻，提升代理学习能力，解决单一步骤或固定时距回放的局限性。Imagine-then-Plan (ITP)框架允许代理生成多步想象轨迹，用于指导策略学习。自适应前瞻机制根据任务需求调整想象时间范围，提供关于未来后果的丰富信号。实验表明，ITP在复杂任务中显著优于现有方法，通过提升代理的推理能力实现更优表现。

UP2You: Fast Reconstruction of Yourself from Unconstrained Photo Collections

Authors: Zeyu Cai, Ziyang Li, Xiaoben Li, Boqian Li, Zeyu Wang, Zhenyu Zhang, Yuliang Xiu

Venue: ICLR

First: 2025-09-29T14:06:00+00:00 · Latest: 2026-03-16T15:56:20+00:00

Comments: Page: https://zcai0612.github.io/UP2You Code: https://github.com/zcai0612/UP2You

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

We present UP2You, the first tuning-free solution for reconstructing high-fidelity 3D clothed portraits from extremely unconstrained in-the-wild 2D photos. Unlike previous approaches that require "clean" inputs (e.g., full-body images with minimal occlusions, or well-calibrated cross-view captures), UP2You directly processes raw, unstructured photographs, which may vary significantly in pose, viewpoint, cropping, and occlusion. Instead of compressing data into tokens for slow online text-to-3D optimization, we introduce a data rectifier paradigm that efficiently converts unconstrained inputs into clean, orthogonal multi-view images in a single forward pass within seconds, simplifying the 3D reconstruction. Central to UP2You is a pose-correlated feature aggregation module (PCFA), that selectively fuses information from multiple reference images w.r.t. target poses, enabling better identity preservation and nearly constant memory footprint, with more observations. We also introduce a perceiver-based multi-reference shape predictor, removing the need for pre-captured body templates. Extensive experiments on 4D-Dress, PuzzleIOI, and in-the-wild captures demonstrate that UP2You consistently surpasses previous methods in both geometric accuracy (Chamfer-15%, P2S-18% on PuzzleIOI) and texture fidelity (PSNR-21%, LPIPS-46% on 4D-Dress). UP2You is efficient (1.5 minutes per person), and versatile (supports arbitrary pose control, and training-free multi-garment 3D virtual try-on), making it practical for real-world scenarios where humans are casually captured. Both models and code will be released to facilitate future research on this underexplored task. Project Page: https://zcai0612.github.io/UP2You

中文标题/摘要

标题：UP2You：从极不约束的个人照片集合快速重建自己

我们提出了UP2You，这是首个无需调优即可从野外极不约束的2D照片中重建高保真3D着装肖像的解决方案。与之前需要“干净”输入（例如，无遮挡的全身照片或跨视角的校准捕获）的方法不同，UP2You可以直接处理原始且未结构化的照片，这些照片在姿态、视角、裁剪和遮挡方面可能差异很大。我们没有将数据压缩成令牌以进行缓慢的在线文本到3D优化，而是引入了一种数据校正范式，在单次前向传递中在几秒内将不约束的输入高效地转换为干净的、正交的多视角图像，简化了3D重建。UP2You的核心是一个与姿态相关的特征聚合模块（PCFA），该模块根据目标姿态选择性地融合多个参考图像的信息，从而实现更好的身份保留和几乎恒定的内存占用，随着观测次数的增加。我们还引入了一种基于感知器的多参考形状预测器，消除了预先捕获的身体模板的需要。在4D-Dress、PuzzleIOI和野外捕获的数据集上的大量实验表明，UP2You在几何精度（Chamfer-15%，P2S-18%在PuzzleIOI上）和纹理保真度（PSNR-21%，LPIPS-46%在4D-Dress上）方面均优于先前的方法。UP2You高效（每人1.5分钟），且通用（支持任意姿态控制和无需训练的多件服装3D虚拟试穿），使其适用于人类随意被捕获的真实场景。模型和代码将被发布，以促进对该未充分探索任务的未来研究。项目页面：https://zcai0612.github.io/UP2You

MA-VLCM: A Vision Language Critic Model for Value Estimation of Policies in Multi-Agent Team Settings

Authors: Shahil Shaik, Aditya Parameshwaran, Anshul Nayak, Jonathon M. Smereka, Yue Wang

First: 2026-03-16T15:29:41+00:00 · Latest: 2026-03-16T15:29:41+00:00

Comments: 7 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Multi-agent reinforcement learning (MARL) commonly relies on a centralized critic to estimate the value function. However, learning such a critic from scratch is highly sample-inefficient and often lacks generalization across environments. At the same time, large vision-language-action models (VLAs) trained on internet-scale data exhibit strong multimodal reasoning and zero-shot generalization capabilities, yet directly deploying them for robotic execution remains computationally prohibitive, particularly in heterogeneous multi-robot systems with diverse embodiments and resource constraints. To address these challenges, we propose Multi-Agent Vision-Language-Critic Models (MA-VLCM), a framework that replaces the learned centralized critic in MARL with a pretrained vision-language model fine-tuned to evaluate multi-agent behavior. MA-VLCM acts as a centralized critic conditioned on natural language task descriptions, visual trajectory observations, and structured multi-agent state information. By eliminating critic learning during policy optimization, our approach significantly improves sample efficiency while producing compact execution policies suitable for deployment on resource-constrained robots. Results show good zero-shot return estimation on models with differing VLM backbones on in-distribution and out-of-distribution scenarios in multi-agent team settings

中文标题/摘要

标题：MA-VLCM：多智能体团队设置中政策价值估计的视觉语言批评模型

多智能体强化学习（MARL）通常依赖于中心化的批评家来估计价值函数。然而，从头学习这样的批评家是高度样本效率低下的，并且往往在不同环境中缺乏泛化能力。同时，大规模的视觉-语言-行动模型（VLAs）在互联网规模的数据上进行训练，表现出强大的多模态推理和零样本泛化能力，但在机器人执行中直接部署仍然是计算上不可行的，特别是在具有多样化机体和资源限制的异构多机器人系统中。为了解决这些挑战，我们提出了多智能体视觉语言批评模型（MA-VLCM），该框架用预训练的视觉语言模型替换MARL中的学习到的中心化批评家，该模型微调以评估多智能体行为。MA-VLCM作为基于自然语言任务描述、视觉轨迹观察和结构化多智能体状态信息的中心化批评家。通过在策略优化过程中消除批评家学习，我们的方法显著提高了样本效率，同时生成适合在资源受限的机器人上部署的紧凑执行策略。结果显示，在多智能体团队设置中的不同分布内和分布外场景中，使用不同VLB后端模型的模型具有良好的零样本回报估计

Summary / 总结

The paper addresses the challenges of learning a centralized critic in multi-agent reinforcement learning (MARL) by proposing MA-VLCM, which uses a pretrained vision-language model fine-tuned for evaluating multi-agent behavior. This approach improves sample efficiency and produces compact policies suitable for resource-constrained robots, demonstrating good zero-shot return estimation across different vision-language model backbones in various scenarios.

论文提出MA-VLCM框架，使用预训练的视觉-语言模型微调以评估多智能体行为，解决了MARL中学习集中式批评家的挑战。该方法提高了样本效率并生成适合资源受限机器人部署的紧凑策略，在不同场景中展示了良好的零样本回报估计能力。

Pointing-Based Object Recognition

Authors: Lukáš Hajdúch, Viktor Kocur

First: 2026-03-16T15:16:53+00:00 · Latest: 2026-03-16T15:16:53+00:00

Comments: Submitted to InnovAIte conference

Abs · PDF · Code1 · Code2

Abstract

This paper presents a comprehensive pipeline for recognizing objects targeted by human pointing gestures using RGB images. As human-robot interaction moves toward more intuitive interfaces, the ability to identify targets of non-verbal communication becomes crucial. Our proposed system integrates several existing state-of-the-art methods, including object detection, body pose estimation, monocular depth estimation, and vision-language models. We evaluate the impact of 3D spatial information reconstructed from a single image and the utility of image captioning models in correcting classification errors. Experimental results on a custom dataset show that incorporating depth information significantly improves target identification, especially in complex scenes with overlapping objects. The modularity of the approach allows for deployment in environments where specialized depth sensors are unavailable.

中文标题/摘要

标题：基于指针的物体识别

本文提出了一种全面的管道，用于识别人类指向手势所瞄准的物体，使用的是RGB图像。随着人机交互向更直观的界面发展，识别非言语交流目标的能力变得至关重要。我们提出的系统结合了多种现有的先进技术，包括物体检测、人体姿态估计、单目深度估计和视觉语言模型。我们评估了从单张图像重建的3D空间信息的影响，以及图像描述模型在纠正分类错误方面的实用性。在自定义数据集上的实验结果表明，结合深度信息显著提高了目标识别的准确性，尤其是在复杂场景中重叠物体较多的情况下。该方法的模块化特性使其能够在没有专门深度传感器的环境中部署。

Summary / 总结

This paper introduces a pipeline for recognizing objects pointed to by human gestures using RGB images. The system combines object detection, body pose estimation, monocular depth estimation, and vision-language models. The study evaluates the effectiveness of 3D spatial information and image captioning models in improving target identification, demonstrating that depth information enhances performance, particularly in complex scenes. The modular design enables deployment in environments without specialized depth sensors.

本文提出了一种使用RGB图像识别人类手势所指向物体的管道。该系统旨在促进更直观的人机交互，结合了物体检测、人体姿态估计、单目深度估计和视觉语言模型。研究评估了3D空间信息和图像描述模型的益处。在自定义数据集上的实验结果表明，集成深度信息可以提高目标识别能力，尤其是在杂乱场景中重叠物体较多的情况下。模块化的设计使得该系统能够在没有专用深度传感器的环境中部署。

MeMix: Writing Less, Remembering More for Streaming 3D Reconstruction

Authors: Jiacheng Dong, Huan Li, Sicheng Zhou, Wenhao Hu, Weili Xu, Yan Wang

First: 2026-03-16T14:21:19+00:00 · Latest: 2026-03-16T14:21:19+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Reconstruction is a fundamental task in 3D vision and a fundamental capability for spatial intelligence. Particularly, streaming 3D reconstruction is central to real-time spatial perception, yet existing recurrent online models often suffer from progressive degradation on long sequences due to state drift and forgetting, motivating inference-time remedies. We present MeMix, a training-free, plug-and-play module that improves streaming reconstruction by recasting the recurrent state into a Memory Mixture. MeMix partitions the state into multiple independent memory patches and updates only the least-aligned memory patches while exactly preserving others. This selective update mitigates catastrophic forgetting while retaining $O(1)$ inference memory, and requires no fine-tuning or additional learnable parameters, making it directly applicable to existing recurrent reconstruction models. Across standard benchmarks (ScanNet, 7-Scenes, KITTI, etc.), under identical backbones and inference settings, MeMix reduces reconstruction completeness error by 15.3% on average (up to 40.0%) across 300--500 frame streams on 7-Scenes. The code is available at https://dongjiacheng06.github.io/MeMix/

中文标题/摘要

标题：MeMix：少写多记，用于流式3D重建

重建是3D视觉中的一个基本任务，也是空间智能的一项基本能力。特别是，流式3D重建对于实时空间感知至关重要，但现有的递归在线模型往往由于状态漂移和遗忘而在长序列上遭受性能退化，促使人们寻找推理时的补救措施。我们提出了MeMix，这是一种无需训练、即插即用的模块，通过将递归状态重新定义为记忆混合物来改进流式重建。MeMix将状态划分为多个独立的记忆片段，并仅更新最不对齐的记忆片段，而精确保留其他部分。这种选择性更新减轻了灾难性遗忘，同时保持了$O(1)$的推理内存需求，且无需微调或额外的学习参数，使其可以直接应用于现有的递归重建模型。在标准基准（ScanNet、7-Scenes、KITTI等）上，使用相同的骨干网络和推理设置，MeMix在7-Scenes的300-500帧流中平均将重建完整性误差降低了15.3%（最高40.0%）。代码可在https://dongjiacheng06.github.io/MeMix/ 获取

Summary / 总结

The motivation for this work is to address the issue of state drift and forgetting in long sequences of streaming 3D reconstruction. The method introduced is MeMix, a training-free module that partitions the recurrent state into multiple independent memory patches and selectively updates only the least-aligned patches. This approach mitigates catastrophic forgetting while maintaining constant inference memory. Experiments on standard benchmarks show that MeMix reduces reconstruction completeness error by 15.3% on average, up to 40.0% on 7-Scenes sequences with 300-500 frames.

MeMix 是一个无需训练的模块，通过将递归状态重新铸造成 Memory Mixture，并仅选择性地更新最不匹配的记忆片段，同时保留其他片段不变。这种方法可以缓解灾难性遗忘并保持恒定的推理内存。在各种基准测试中，MeMix 将重建完整性误差减少了平均 15.3%，最多 40.0% 的 7-Scenes 流中的 300-500 帧。

Ultralytics YOLO Evolution: An Overview of YOLO26, YOLO11, YOLOv8 and YOLOv5 Object Detectors for Computer Vision and Pattern Recognition

Authors: Ranjan Sapkota, Manoj Karkee

First: 2025-10-06T23:28:44+00:00 · Latest: 2026-03-16T13:48:48+00:00

Abs · PDF · Code1 · Code2

Abstract

This paper presents a comprehensive overview of the Ultralytics YOLO(You Only Look Once) family of object detectors, focusing the architectural evolution, benchmarking, deployment perspectives, and future challenges. The review begins with the most recent release, YOLO26 (or YOLOv26), which introduces key innovations including Distribution Focal Loss (DFL) removal, native NMS-free inference, Progressive Loss Balancing (ProgLoss), Small-Target-Aware Label Assignment (STAL), and the MuSGD optimizer for stable training. The progression is then traced through YOLO11, with its hybrid task assignment and efficiency-focused modules; YOLOv8, which advanced with a decoupled detection head and anchor-free predictions; and YOLOv5, which established the modular PyTorch foundation that enabled modern YOLO development. Benchmarking on the MS COCO dataset provides a detailed quantitative comparison of YOLOv5, YOLOv8, YOLO11, and YOLO26 (YOLOv26), alongside cross-comparisons with YOLOv12, YOLOv13, RT-DETR, and DEIM(DETR with Improved Matching). Metrics including precision, recall, F1 score, mean Average Precision, and inference speed are analyzed to highlight trade-offs between accuracy and efficiency. Deployment and application perspectives are further discussed, covering export formats, quantization strategies, and real-world use in robotics, agriculture, surveillance, and manufacturing. Finally, the paper identifies challenges and future directions, including dense-scene limitations, hybrid CNN-Transformer integration, open-vocabulary detection, and edge-aware training approaches. (Object Detection, YOLOv26, YOLO)

中文标题/摘要

标题：Ultralytics YOLO进化概述：YOLO26、YOLO11、YOLOv8和YOLOv5目标检测器在计算机视觉和模式识别中的应用

本文全面概述了Ultralytics YOLO(你只需看一次)家族的目标检测器，重点介绍了架构演变、基准测试、部署前景和未来挑战。回顾从最近发布的YOLO26（或YOLOv26）开始，它引入了关键创新，包括分布焦点损失（DFL）移除、原生无NMS推理、渐进损失平衡（ProgLoss）、小目标感知标签分配（STAL）和MuSGD优化器以实现稳定的训练。然后追溯到YOLO11，其混合任务分配和效率导向模块；YOLOv8，它通过解耦检测头和无锚预测推进；以及YOLOv5，它建立了模块化的PyTorch基础，使现代YOLO开发成为可能。在MS COCO数据集上的基准测试提供了YOLOv5、YOLOv8、YOLO11和YOLO26（YOLOv26）的详细定量比较，以及与YOLOv12、YOLOv13、RT-DETR和DEIM（改进匹配的DETR）的交叉比较。分析了包括精确度、召回率、F1分数、平均精度和推理速度在内的指标，以突出准确性和效率之间的权衡。进一步讨论了部署和应用前景，包括导出格式、量化策略以及在机器人技术、农业、监控和制造业中的实际应用。最后，本文指出了挑战和未来方向，包括密集场景限制、混合CNN-Transformer集成、开放词汇检测和边缘感知训练方法。（目标检测，YOLOv26，YOLO）

Flash-Unified: A Training-Free and Task-Aware Acceleration Framework for Native Unified Models

Authors: Junlong Ke, Zichen Wen, Boxue Yang, Yantai Yang, Xuyang Liu, Chenfei Liao, Zhaorun Chen, Shaobo Wang, Linfeng Zhang

Venue: CVPR 2026

First: 2026-03-16T13:37:55+00:00 · Latest: 2026-03-16T13:37:55+00:00

Comments: Accepted by CVPR 2026 Findings

Abs · PDF · Code1 · Code2 · Code3

Abstract

Native unified multimodal models, which integrate both generative and understanding capabilities, face substantial computational overhead that hinders their real-world deployment. Existing acceleration techniques typically employ a static, monolithic strategy, ignoring the fundamental divergence in computational profiles between iterative generation tasks (e.g., image generation) and single-pass understanding tasks (e.g., VQA). In this work, we present the first systematic analysis of unified models, revealing pronounced parameter specialization, where distinct neuron sets are critical for each task. This implies that, at the parameter level, unified models have implicitly internalized separate inference pathways for generation and understanding within a single architecture. Based on these insights, we introduce a training-free and task-aware acceleration framework, FlashU, that tailors optimization to each task's demands. Across both tasks, we introduce Task-Specific Network Pruning and Dynamic Layer Skipping, aiming to eliminate inter-layer and task-specific redundancy. For visual generation, we implement a time-varying control signal for the guidance scale and a temporal approximation for the diffusion head via Diffusion Head Cache. For multimodal understanding, building upon the pruned model, we introduce Dynamic Token Pruning via a V-Norm Proxy to exploit the spatial redundancy of visual inputs. Extensive experiments on Show-o2 demonstrate that FlashU achieves 1.78$\times$ to 2.01$\times$ inference acceleration across both understanding and generation tasks while maintaining SOTA performance, outperforming competing unified models and validating our task-aware acceleration paradigm. Our code is publicly available at https://github.com/Rirayh/FlashU.

中文标题/摘要

标题：Flash-Unified：一种无需训练且任务感知的加速框架，用于原生统一模型

原生统一多模态模型结合了生成和理解能力，面临巨大的计算开销，阻碍了其在实际中的部署。现有的加速技术通常采用静态、单一的策略，忽视了迭代生成任务（如图像生成）和单次通过理解任务（如VQA）在计算特征上的根本差异。在本文中，我们首次系统地分析了统一模型，揭示了显著的参数专业化现象，其中不同的神经元集对每个任务至关重要。这表明，在参数层面，统一模型已经隐式地在一个架构中内化了生成和理解的独立推理路径。基于这些见解，我们引入了一种无需训练且任务感知的加速框架FlashU，该框架针对每个任务的需求进行优化。在两个任务中，我们引入了任务特定网络剪枝和动态层跳过，旨在消除层间和任务特定的冗余。对于视觉生成，我们实现了一个时间变化的控制信号来控制指导尺度，并通过扩散头缓存实现扩散头的时序近似。对于多模态理解，基于剪枝模型，我们引入了通过V-范数代理实现的动态标记剪枝，以利用视觉输入的空间冗余。在Show-o2上的广泛实验表明，FlashU在理解任务和生成任务上分别实现了1.78$\times$到2.01$\times$的推理加速，同时保持了SOTA性能，优于竞争的统一模型，验证了我们的任务感知加速范式。我们的代码可在https://github.com/Rirayh/FlashU公开获取。

Exemplar Diffusion: Improving Medical Object Detection with Opportunistic Labels

Authors: Victor Wåhlstrand, Jennifer Alvén, Ida Häggström

Venue: MICCAI 2026

First: 2026-03-16T13:36:23+00:00 · Latest: 2026-03-16T13:36:23+00:00

Comments: Submitted to MICCAI 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

We present a framework to take advantage of existing labels at inference, called \textit{exemplars}, in order to improve the performance of object detection in medical images. The method, \textit{exemplar diffusion}, leverages existing diffusion methods for object detection to enable a training-free approach to adding information of known bounding boxes at test time. We demonstrate that for medical image datasets with clear spatial structure, the method yields an across-the-board increase in average precision and recall, and a robustness to exemplar quality, enabling non-expert annotation. Moreover, we demonstrate how our method may also be used to quantify predictive uncertainty in diffusion detection methods. Source code and data splits openly available online: https://github.com/waahlstrand/ExemplarDiffusion

中文标题/摘要

标题：范例扩散：利用机会性标签提高医学对象检测性能

我们提出了一种框架，利用推理时已有的标签，称为“范例”，以提高医学图像中对象检测的性能。该方法“范例扩散”利用现有的对象检测扩散方法，在测试时无需训练即可添加已知边界框的信息。我们证明，对于具有清晰空间结构的医学图像数据集，该方法可以全面提高平均精度和召回率，并且对范例质量具有鲁棒性，能够支持非专家注释。此外，我们还展示了该方法如何用于量化扩散检测方法中的预测不确定性。源代码和数据分割已公开在线：https://github.com/waahlstrand/ExemplarDiffusion

Summary / 总结

The research aims to enhance object detection in medical images by utilizing existing labels, known as exemplars, during inference. The method, exemplar diffusion, improves average precision and recall across various medical image datasets, showing robustness to the quality of exemplars. It also allows for the quantification of predictive uncertainty in diffusion detection methods. This approach enables non-expert annotation and is supported by open-source code and data splits available online.

研究旨在通过利用医学图像推理过程中的现有标签（称为示例）来提高目标检测性能。方法名为示例扩散，它在具有清晰空间结构的医学图像数据集上提高了平均精度和召回率，对示例质量具有鲁棒性。此外，该方法还可以用于量化扩散检测方法中的预测不确定性。该方法无需训练，非专家也可使用。源代码和数据分割已公开在线。

Directional Embedding Smoothing for Robust Vision Language Models

Authors: Ye Wang, Jing Liu, Toshiaki Koike-Akino

Venue: ICLR 2026

First: 2026-03-16T13:25:29+00:00 · Latest: 2026-03-16T13:25:29+00:00

Comments: Accepted at ICLR 2026 Workshop on Agents in the Wild

Abs · PDF · Code1 · Code2

Abstract

The safety and reliability of vision-language models (VLMs) are a crucial part of deploying trustworthy agentic AI systems. However, VLMs remain vulnerable to jailbreaking attacks that undermine their safety alignment to yield harmful outputs. In this work, we extend the Randomized Embedding Smoothing and Token Aggregation (RESTA) defense to VLMs and evaluate its performance against the JailBreakV-28K benchmark of multi-modal jailbreaking attacks. We find that RESTA is effective in reducing attack success rate over this diverse corpus of attacks, in particular, when employing directional embedding noise, where the injected noise is aligned with the original token embedding vectors. Our results demonstrate that RESTA can contribute to securing VLMs within agentic systems, as a lightweight, inference-time defense layer of an overall security framework.

中文标题/摘要

标题：方向嵌入平滑以增强视觉语言模型的稳健性

视觉-语言模型（VLMs）的安全性和可靠性是部署可信赖的代理人工智能系统的关键部分。然而，VLMs仍然容易受到破坏其安全对齐并产生有害输出的监狱突破攻击。在本研究中，我们将随机嵌入平滑和标记聚合（RESTA）防御扩展到VLMs，并评估其在面对JailBreakV-28K多模态监狱突破攻击基准测试中的性能。我们发现，当使用方向嵌入噪声时，RESTA能够有效降低这些多样化的攻击集的成功率，特别是当注入的噪声与原始标记嵌入向量对齐时。我们的结果表明，RESTA可以作为代理系统整体安全框架中的轻量级、推理时的防御层，有助于保护VLMs。

Summary / 总结

The research aims to enhance the safety and reliability of vision-language models (VLMs) by defending against jailbreaking attacks. The study extends the RESTA defense method to VLMs and evaluates its effectiveness using the JailBreakV-28K benchmark. Key findings show that RESTA, especially when using directional embedding noise aligned with original token embeddings, significantly reduces the attack success rate, making VLMs more secure in agentic AI systems.

该研究旨在通过防御劫持攻击来增强视觉-语言模型（VLMs）的安全性和可靠性。研究将RESTA防御方法扩展到VLMs，并对其在JailBreakV-28K基准测试中的有效性进行了评估。关键发现表明，使用与原始令牌嵌入向量对齐的方向性嵌入噪声，可以显著降低攻击的成功率，从而使VLMs在代理AI系统中更加安全。

HalDec-Bench: Benchmarking Hallucination Detector in Image Captioning

Authors: Kuniaki Saito, Risa Shinoda, Shohei Tanaka, Tosho Hirasawa, Fumio Okura, Yoshitaka Ushiku

First: 2026-03-16T13:21:55+00:00 · Latest: 2026-03-16T13:21:55+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Hallucination detection in captions (HalDec) assesses a vision-language model's ability to correctly align image content with text by identifying errors in captions that misrepresent the image. Beyond evaluation, effective hallucination detection is also essential for curating high-quality image-caption pairs used to train VLMs. However, the generalizability of VLMs as hallucination detectors across different captioning models and hallucination types remains unclear due to the lack of a comprehensive benchmark. In this work, we introduce HalDec-Bench, a benchmark designed to evaluate hallucination detectors in a principled and interpretable manner. HalDec-Bench contains captions generated by diverse VLMs together with human annotations indicating the presence of hallucinations, detailed hallucination-type categories, and segment-level labels. The benchmark provides tasks with a wide range of difficulty levels and reveals performance differences across models that are not visible in existing multimodal reasoning or alignment benchmarks. Our analysis further uncovers two key findings. First, detectors tend to recognize sentences appearing at the beginning of a response as correct, regardless of their actual correctness. Second, our experiments suggest that dataset noise can be substantially reduced by using strong VLMs as filters while employing recent VLMs as caption generators. Our project page is available at https://dahlian00.github.io/HalDec-Bench-Page/.

中文标题/摘要

标题：HalDec-Bench：图像描述中的幻觉检测基准

幻觉检测（HalDec）评估视觉-语言模型正确对齐图像内容与文本的能力，通过识别错误描述图像的描述错误。除了评估之外，有效的幻觉检测对于收集用于训练VLM的高质量图像-描述对也至关重要。然而，由于缺乏全面的基准，VLM作为幻觉检测器在不同描述模型和幻觉类型之间的通用性仍然不清楚。在此工作中，我们引入了HalDec-Bench，这是一种旨在以原则性和可解释性方式评估幻觉检测器的基准。HalDec-Bench包含由多种VLM生成的描述，以及人类注释表明幻觉的存在，详细的幻觉类型类别，以及段级标签。基准提供了不同难度级别的任务，并揭示了现有跨模态推理或对齐基准中不可见的模型性能差异。我们的分析进一步揭示了两个关键发现。首先，检测器倾向于将响应开头出现的句子识别为正确的，而不管它们的实际正确性。其次，我们的实验表明，通过使用强大的VLM作为过滤器并采用最近的VLM作为描述生成器，可以显著减少数据集噪声。我们的项目页面可在https://dahlian00.github.io/HalDec-Bench-Page/上找到。

Summary / 总结

The research aims to evaluate the ability of hallucination detectors in image captioning by introducing HalDec-Bench, a benchmark that includes captions from various vision-language models and human annotations. The method involves segment-level labeling and detailed hallucination-type categories. Key findings include that detectors often incorrectly identify sentences at the beginning of responses as correct and that using strong VLMs as filters can reduce dataset noise when recent VLMs are used for caption generation.

研究旨在通过引入HalDec-Bench基准来评估图像字幕中的幻觉检测器的有效性，该基准包含来自多种视觉-语言模型的字幕和人类注释。方法包括段级标注和详细的幻觉类型分类。关键发现包括，检测器往往错误地将响应开头的句子识别为正确，并且使用强大的VLM作为过滤器可以减少当使用最近的VLM进行字幕生成时的数据集噪声。

Multi-turn Physics-informed Vision-language Model for Physics-grounded Anomaly Detection

Authors: Yao Gu, Xiaohao Xu, Yingna Wu

First: 2026-03-16T13:11:47+00:00 · Latest: 2026-03-16T13:11:47+00:00

Comments: Accepted by IEEE ICASSP2026

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) demonstrate strong general-purpose reasoning but remain limited in physics-grounded anomaly detection, where causal understanding of dynamics is essential. Existing VLMs, trained predominantly on appearance-centric correlations, fail to capture kinematic constraints, leading to poor performance on anomalies such as irregular rotations or violated mechanical motions. We introduce a physics-informed instruction tuning framework that explicitly encodes object properties, motion paradigms, and dynamic constraints into structured prompts. By delivering these physical priors through multi-turn dialogues, our method decomposes causal reasoning into incremental steps, enabling robust internal representations of normal and abnormal dynamics. Evaluated on the Phys-AD benchmark, our approach achieves 96.7% AUROC in video-level detection--substantially outperforming prior SOTA (66.9%)--and yields superior causal explanations (0.777 LLM score). This work highlights how structured physics priors can transform VLMs into reliable detectors of dynamic anomalies.

中文标题/摘要

标题：基于物理的多轮视觉语言模型在物理驱动异常检测中的应用

视觉语言模型（VLMs）展示了强大的通用推理能力，但在物理驱动的异常检测中仍受到限制，因为动态因果理解是必不可少的。现有的VLMs主要在外观相关性上进行训练，无法捕捉运动约束，导致在不规则旋转或违反机械运动等异常检测上表现不佳。我们提出了一种物理信息指令调优框架，明确地将物体属性、运动模式和动态约束编码到结构化提示中。通过多轮对话传递这些物理先验，我们的方法将因果推理分解为逐步步骤，从而形成对正常和异常动态的稳健内部表示。在Phys-AD基准测试上，我们的方法在视频级检测中达到了96.7%的AUROC，显著优于之前的SOTA（66.9%），并提供了更优的因果解释（0.777 LLM得分）。这项工作突显了结构化物理先验如何将VLMs转化为可靠的动态异常检测器。

Summary / 总结

The research aims to enhance Vision-Language Models (VLMs) for physics-grounded anomaly detection by incorporating physical priors. The method uses a physics-informed instruction tuning framework that encodes object properties and dynamic constraints into structured prompts, enabling multi-turn dialogues to decompose causal reasoning. The approach significantly outperforms previous state-of-the-art models on the Phys-AD benchmark, achieving 96.7% AUROC and better causal explanations. This demonstrates the effectiveness of structured physics priors in improving VLMs for detecting dynamic anomalies.

研究旨在通过解决视觉-语言模型（VLMs）在捕捉动力学约束方面的局限性，增强其在物理导向异常检测中的能力。方法引入了一个物理信息指令调优框架，通过多轮对话嵌入物理先验，将因果推理分解为逐步步骤。该方法显著提高了性能，在Phys-AD基准测试中实现了96.7%的AUROC，而之前的最佳模型仅为66.9%，并且提供了更好的因果解释。这项工作展示了结构化物理先验如何将VLMs转变为可靠的动态异常检测器。

RAG-3DSG: Enhancing 3D Scene Graphs with Re-Shot Guided Retrieval-Augmented Generation

Authors: Yue Chang, Rufeng Chen, Zhaofan Zhang, Yi Chen, Yifan Tian, Sihong Xie

First: 2026-01-15T08:15:01+00:00 · Latest: 2026-03-16T12:51:59+00:00

Abs · PDF · Code1 · Code2

Abstract

Open-vocabulary 3D Scene Graph (3DSG) can enhance various downstream tasks in robotics by leveraging structured semantic representations, yet current 3DSG construction methods suffer from semantic inconsistencies caused by noisy cross-image aggregation under occlusions and constrained viewpoints. To mitigate the impact of such inconsistency, we propose RAG-3DSG, which introduces re-shot guided uncertainty estimation. By measuring the semantic consistency between original limited viewpoints and re-shot optimal viewpoints, this method quantifies the underlying semantic ambiguity of each graph object. Based on this quantification, we devise an Object-level Retrieval-Augmented Generation (RAG) that leverages low-uncertainty objects as semantic anchors to retrieve more reliable contextual knowledge, enabling a Vision-Language Model to rectify the predictions of uncertain objects and optimize the final 3DSG. Extensive evaluations across three challenging benchmarks and real-world robot trials demonstrate that RAG-3DSG achieves superior recall and precision, effectively mitigating semantic noise to provide highly reliable scene representations for robotics tasks.

中文标题/摘要

标题：RAG-3DSG：基于重拍引导检索增强生成的3D场景图增强

开放词汇的3D场景图（3DSG）可以通过利用结构化的语义表示来增强机器人领域的各种下游任务，但当前的3DSG构建方法由于遮挡和受限视角下的嘈杂跨图像聚合而遭受语义不一致的影响。为了减轻这种不一致的影响，我们提出了RAG-3DSG，该方法引入了基于重拍的不确定性估计。通过测量原始有限视角和重拍最佳视角之间的语义一致性，该方法量化了每个图对象的潜在语义模糊性。基于这种量化，我们设计了一种基于对象的检索增强生成（RAG），利用低不确定性对象作为语义锚点来检索更可靠的上下文知识，使视觉语言模型能够纠正不确定对象的预测并优化最终的3DSG。在三个具有挑战性的基准和真实世界机器人试验中的广泛评估表明，RAG-3DSG实现了更高的召回率和精度，有效地减轻了语义噪声，为机器人任务提供了高度可靠的场景表示。

Summary / 总结

RAG-3DSG addresses the issue of semantic inconsistencies in 3D Scene Graphs by introducing re-shot guided uncertainty estimation. This method measures semantic consistency between original and re-shot viewpoints to quantify the ambiguity of each object. An Object-level Retrieval-Augmented Generation (RAG) is then used to retrieve more reliable contextual knowledge, improving the predictions of uncertain objects and optimizing the final 3D Scene Graph. Experiments show that RAG-3DSG outperforms existing methods in terms of recall and precision, providing more reliable scene representations for robotics tasks.

RAG-3DSG通过引入重新拍摄引导的不确定性估计来解决3D场景图（3DSG）中的语义不一致性问题。该方法通过测量原始视角和重新拍摄视角之间的语义一致性来量化每个对象的模糊性。然后使用对象级检索增强生成（RAG）来检索更可靠的上下文知识，帮助视觉-语言模型改进不确定对象的预测。评估结果显示，RAG-3DSG在召回率和精度上优于现有方法，提供了更可靠的场景表示，适用于机器人任务。

Efficient Document Parsing via Parallel Token Prediction

Authors: Lei Li, Ze Zhao, Meng Li, Zhongwang Lun, Yi Yuan, Xingjing Lu, Zheng Wei, Jiang Bian, Zang Li

Venue: CVPR 2026

First: 2026-03-16T12:45:34+00:00 · Latest: 2026-03-16T12:45:34+00:00

Comments: Accepted by CVPR 2026 Findings

Abs · PDF · Code1 · Code2

Abstract

Document parsing, as a fundamental yet crucial vision task, is being revolutionized by vision-language models (VLMs). However, the autoregressive (AR) decoding inherent to VLMs creates a significant bottleneck, severely limiting parsing speed. In this paper, we propose Parallel-Token Prediction (PTP), a plugable, model-agnostic and simple-yet-effective method that enables VLMs to generate multiple future tokens in parallel with improved sample efficiency. Specifically, we insert some learnable tokens into the input sequence and design corresponding training objectives to equip the model with parallel decoding capabilities for document parsing. Furthermore, to support effective training, we develop a comprehensive data generation pipeline that efficiently produces large-scale, high-quality document parsing training data for VLMs. Extensive experiments on OmniDocBench and olmOCR-bench demonstrate that our method not only significantly improves decoding speed (1.6x-2.2x) but also reduces model hallucinations and exhibits strong generalization abilities.

中文标题/摘要

标题：通过并行词预测提高文档解析效率

文档解析作为一项基本而重要的视觉任务，正被视觉-语言模型（VLMs）所革新。然而，VLMs固有的自回归（AR）解码方式成为了一个显著瓶颈，严重限制了解析速度。本文提出了一种名为并行词预测（PTP）的插件式、模型无关且简单有效的方法，使VLMs能够以提高采样效率的方式并行生成多个未来词。具体而言，我们在输入序列中插入一些可学习的词，并设计相应的训练目标，以赋予模型文档解析的并行解码能力。此外，为了支持有效的训练，我们开发了一整套高效生成大规模高质量文档解析训练数据的管道。在OmniDocBench和olmOCR-bench上的广泛实验表明，我们的方法不仅显著提高了解码速度（1.6倍-2.2倍），还减少了模型的幻觉，并展示了强大的泛化能力。

Summary / 总结

The paper addresses the bottleneck of autoregressive decoding in vision-language models (VLMs) for document parsing, proposing Parallel-Token Prediction (PTP) to enable parallel token prediction and improve sample efficiency. By inserting learnable tokens and designing training objectives, PTP allows VLMs to decode multiple tokens simultaneously. Experiments show that PTP enhances decoding speed by 1.6x to 2.2x, reduces model hallucinations, and demonstrates strong generalization capabilities.

本文针对视觉语言模型（VLMs）在文档解析中的自回归解码瓶颈，提出了并行令牌预测（PTP）方法，以实现并行令牌预测并提高样本效率。该方法在输入序列中插入可学习的令牌，并设计训练目标以支持并行解码。全面的数据生成确保了高质量的训练数据。实验表明，PTP可以将解码速度提高1.6倍至2.2倍，减少模型幻觉，并展示出强大的泛化能力。

DAIT: Distillation from Vision-Language Models to Lightweight Classifiers with Adaptive Intermediate Teacher Transfer

Authors: Zhengxu He, Jun Li, Zhijian Wu

First: 2026-03-16T12:00:31+00:00 · Latest: 2026-03-16T12:00:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Large-scale Vision-Language Models (VLMs) encode rich multimodal semantics that are highly beneficial for fine-grained visual categorization (FGVC). However, their prohibitive computational cost hinders practical deployment in resource-constrained environments. Although knowledge distillation contributes to transferring VLMs capacity to lightweight classifiers, conventional distillation mechanisms, which directly transfer from a generic VLM to a compact student, often yield suboptimal results due to severe architectural misalignment and introducing task-irrelevant information. To alleviate this limitation, we propose Distillation with Adaptive Intermediate Teacher transfer (DAIT) in this study, facilitating adaptive knowledge transfer from VLMs to lightweight students. DAIT introduces a trainable intermediate teacher that learns to transfer frozen VLMs representations under explicit supervision from the target fine-grained task. This intermediate teacher adaptively enhances discriminative visual cues, thereby producing compact and task-aligned knowledge that can be reliably distilled into lightweight models. Extensive evaluations on multiple FGVC benchmarks with diverse student architectures demonstrate that our method achieves respective performance gains of 12.63% and 8.34% on FGVC-Aircraft and CUB-200-2011 datasets, establishing DAIT as a principled paradigm for transferring from general-purpose VLMS to deployable fine-grained recognition models.

中文标题/摘要

标题：DAIT: 从视觉语言模型到轻量级分类器的自适应中间教师蒸馏

大规模的视觉语言模型（VLMs）编码了丰富的跨模态语义，这对细粒度视觉分类（FGVC）非常有益。然而，它们巨大的计算成本阻碍了在资源受限环境中实际部署。尽管知识蒸馏有助于将VLMs的能力转移到轻量级分类器上，但传统的蒸馏机制直接从通用VLM转移到紧凑的学生模型，往往由于严重的架构不匹配和引入与任务无关的信息而效果不佳。为了解决这一限制，我们在本文中提出了自适应中间教师蒸馏（DAIT），以促进从VLMs到轻量级学生的自适应知识转移。DAIT引入了一个可训练的中间教师，该教师在目标细粒度任务的显式监督下学习转移冻结的VLMs表示。该中间教师自适应地增强区分性视觉线索，从而产生紧凑且与任务对齐的知识，可以可靠地转移到轻量级模型中。在多个FGVC基准上的广泛评估表明，我们的方法在FGVC-Aircraft和CUB-200-2011数据集上分别实现了12.63%和8.34%的性能提升，确立了DAIT作为从通用视觉语言模型到可部署的细粒度识别模型的原理性范式。

Summary / 总结

The research aims to address the computational cost of large-scale Vision-Language Models (VLMs) for practical deployment in resource-constrained environments. The proposed method, DAIT, introduces an adaptive intermediate teacher that learns to transfer VLMs' representations to lightweight classifiers under explicit supervision from the target task. This approach enhances discriminative visual cues and produces task-aligned knowledge, leading to significant performance gains of 12.63% and 8.34% on FGVC-Aircraft and CUB-200-2011 datasets, respectively.

研究旨在通过提出DAIT方法解决大规模Vision-Language模型（VLMs）的计算成本问题，该方法促进从VLMs到轻量级分类器的自适应知识转移。DAIT引入了一个可训练的中间教师，在目标细粒度任务的显式监督下学习转移冻结的VLM表示，增强区分性视觉线索并生成任务对齐的知识。实验在FGVC基准数据集上显示，DAIT实现了显著的性能提升，分别在FGVC-Aircraft和CUB-200-2011数据集上取得了12.63%和8.34%的改进。

Vision-Language Model Based Multi-Expert Fusion for CT Image Classification

Authors: Jianfa Bai, Kejin Lu, Runtian Yuan, Qingqiu Li, Jilan Xu, Junlin Hou, Yuejie Zhang, Rui Feng

First: 2026-03-16T11:48:49+00:00 · Latest: 2026-03-16T11:48:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Robust detection of COVID-19 from chest CT remains challenging in multi-institutional settings due to substantial source shift, source imbalance, and hidden test-source identities. In this work, we propose a three-stage source-aware multi-expert framework for multi-source COVID-19 CT classification. First, we build a lung-aware 3D expert by combining original CT volumes and lung-extracted CT volumes for volumetric classification. Second, we develop two MedSigLIP-based experts: a slice-wise representation and probability learning module, and a Transformer-based inter-slice context modeling module for capturing cross-slice dependency. Third, we train a source classifier to predict the latent source identity of each test scan. By leveraging the predicted source information, we perform model fusion and voting based on different experts. On the validation set covering all four sources, the Stage 1 model achieves the best macro-F1 of 0.9711, ACC of 0.9712, and AUC of 0.9791. Stage~2a and Stage~2b achieve the best AUC scores of 0.9864 and 0.9854, respectively. Stage~3 source classifier reaches 0.9107 ACC and 0.9114 F1. These results demonstrate that source-aware expert modeling and hierarchical voting provide an effective solution for robust COVID-19 CT classification under heterogeneous multi-source conditions.

中文标题/摘要

标题：基于视觉-语言模型的多专家融合方法用于CT图像分类

在多机构环境中，由于数据来源差异大、数据来源不平衡以及隐藏的测试数据来源身份，从胸部CT中稳健检测COVID-19仍然具有挑战性。本文提出了一种基于多专家的三阶段来源感知框架，用于多来源COVID-19 CT分类。首先，通过结合原始CT体积和提取肺部的CT体积构建一个肺部感知的3D专家，用于体积分类。其次，开发了两个基于MedSigLIP的专家：切片级表示和概率学习模块，以及基于Transformer的跨切片上下文建模模块，用于捕捉跨切片依赖性。第三，训练一个来源分类器来预测每个测试扫描的潜在来源身份。通过利用预测的来源信息，基于不同专家进行模型融合和投票。在涵盖所有四个来源的验证集上，第一阶段模型的宏F1值为0.9711，ACC值为0.9712，AUC值为0.9791。第二阶段a和第二阶段b分别达到最佳AUC值0.9864和0.9854。第三阶段来源分类器的ACC值为0.9107，F1值为0.9114。这些结果表明，基于来源的专家建模和分层投票在异质多来源条件下提供了有效的COVID-19 CT分类解决方案。

Summary / 总结

This work addresses the challenge of robust COVID-19 detection from chest CT scans across multiple institutions by proposing a three-stage source-aware multi-expert framework. The framework includes a lung-aware 3D expert, two MedSigLIP-based experts for slice-wise and inter-slice context modeling, and a source classifier to predict the source identity. The model achieves high performance, with Stage 1 reaching macro-F1 of 0.9711, ACC of 0.9712, and AUC of 0.9791, while Stages 2a and 2b achieve AUC scores of 0.9864 and 0.9854, respectively. The source classifier reaches 0.9107 ACC and 0.9114 F1, demonstrating the effectiveness of the proposed method in handling heterogeneous multi-source conditions.

本文提出了一种三级源感知多专家框架，以解决多机构环境下从胸部CT扫描中稳健检测COVID-19的挑战。该方法结合了肺感知3D分类、切片表示学习和跨切片上下文建模。框架在Stage 1中实现了0.9711的宏F1、0.9712的ACC和0.9791的AUC，在Stage 2a和Stage 2b中分别实现了0.9864和0.9854的AUC。Stage 3达到了0.9107的ACC和0.9114的F1，展示了源感知建模和分层投票在多源分类中的有效性。

Benchmarking Semantic Segmentation Models via Appearance and Geometry Attribute Editing

Authors: Zijin Yin, Bing Li, Kongming Liang, Hao Sun, Zhongjiang He, Zhanyu Ma, Jun Guo

First: 2026-03-02T07:05:37+00:00 · Latest: 2026-03-16T10:48:26+00:00

Comments: Accepted to IEEE TPAMI 2026

Abs · PDF · Code1 · Code2

Abstract

Semantic segmentation takes pivotal roles in various applications such as autonomous driving and medical image analysis. When deploying segmentation models in practice, it is critical to test their behaviors in varied and complex scenes in advance. In this paper, we construct an automatic data generation pipeline Gen4Seg to stress-test semantic segmentation models by generating various challenging samples with different attribute changes. Beyond previous evaluation paradigms focusing solely on global weather and style transfer, we investigate variations in both appearance and geometry attributes at the object and image level. These include object color, material, size, position, as well as image-level variations such as weather and style. To achieve this, we propose to edit visual attributes of existing real images with precise control of structural information, empowered by diffusion models. In this way, the existing segmentation labels can be reused for the edited images, which greatly reduces the labor costs. Using our pipeline, we construct two new benchmarks, Pascal-EA and COCO-EA. We benchmark a wide variety of semantic segmentation models, spanning from closed-set models to open-vocabulary large models. We have several key findings: 1) advanced open-vocabulary models do not exhibit greater robustness compared to closed-set methods under geometric variations; 2) data augmentation techniques, such as CutOut and CutMix, are limited in enhancing robustness against appearance variations; 3) our pipeline can also be employed as a data augmentation tool and improve both in-distribution and out-of-distribution performances. Our work suggests the potential of generative models as effective tools for automatically analyzing segmentation models, and we hope our findings will assist practitioners and researchers in developing more robust and reliable segmentation models.

中文标题/摘要

标题：基于外观和几何属性编辑的语义分割模型基准测试

语义分割在自动驾驶和医学图像分析等众多应用中发挥着关键作用。在实际部署分割模型时，提前测试其在各种复杂场景中的行为至关重要。本文构建了一个自动数据生成管道Gen4Seg，通过生成具有不同属性变化的各种具有挑战性的样本来对语义分割模型进行压力测试。除了之前仅关注全局天气和风格迁移的评估范式外，我们还研究了在对象和图像级别上外观和几何属性的变化。这些变化包括对象颜色、材质、大小、位置，以及图像级别的变化如天气和风格。为了实现这一点，我们提出使用精确控制结构信息的方法编辑现有真实图像的视觉属性，借助扩散模型。这样，现有的分割标签可以被重用于编辑后的图像，这大大减少了劳动力成本。使用我们的管道，我们构建了两个新的基准Pascal-EA和COCO-EA。我们对从封闭集模型到开放词汇大型模型的广泛语义分割模型进行了基准测试。我们有几个关键发现：1) 先进的开放词汇模型在几何变化下的鲁棒性并不优于封闭集方法；2) 数据增强技术，如CutOut和CutMix，在增强对外观变化的鲁棒性方面效果有限；3) 我们的管道也可以用作数据增强工具，提高分布内和分布外性能。我们的工作表明生成模型作为自动分析分割模型的有效工具的潜力，并希望我们的发现能帮助从业者和研究人员开发出更鲁棒和可靠的分割模型。

Summary / 总结

This paper aims to evaluate the robustness of semantic segmentation models in varied and complex scenes by generating challenging samples with attribute changes. The authors propose a data generation pipeline called Gen4Seg, which edits visual attributes of real images with precise control over structural information. Using this pipeline, they constructed two new benchmarks, Pascal-EA and COCO-EA, and found that advanced open-vocabulary models do not show greater robustness under geometric variations, data augmentation techniques are limited in enhancing robustness against appearance variations, and their pipeline can improve both in-distribution and out-of-distribution performances.

本文介绍了Gen4Seg，一个用于基准测试语义分割模型的自动数据生成管道。它通过生成具有各种属性变化（包括外观和几何）的挑战性样本来评估模型。关键发现包括先进的开放词汇模型在几何变化下并不比封闭集方法更 robust，数据增强技术在增强对外观变化的鲁棒性方面有限，而该管道还可以作为数据增强工具，提高分布内和分布外性能。这项工作表明生成模型作为自动分析分割模型的有效工具的潜力。

Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought

Authors: Shin'ya Yamaguchi, Kosuke Nishida, Daiki Chijiwa

Venue: CVPR 2026

First: 2025-07-10T12:07:13+00:00 · Latest: 2026-03-16T09:48:20+00:00

Comments: Accepted to CVPR 2026 (Main); Code is available at https://github.com/yshinya6/red/

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large vision-language models (LVLMs) have demonstrated remarkable capabilities by integrating pre-trained vision encoders with large language models (LLMs). Similar to single-modal LLMs, chain-of-thought (CoT) prompting has been adapted for LVLMs to enhance multi-modal reasoning by generating intermediate rationales based on visual and textual inputs. While CoT is assumed to improve grounding and accuracy in LVLMs, our experiments reveal a key challenge: existing LVLMs often ignore the contents of generated rationales in CoT reasoning. To address this, we re-formulate multi-modal CoT reasoning as a KL-constrained reward maximization focused on rationale-conditional log-likelihood. As the optimal solution, we propose rationale-enhanced decoding (RED), a novel plug-and-play inference-time decoding strategy. RED harmonizes visual and rationale information by multiplying distinct image-conditional and rationale-conditional next token distributions. Extensive experiments show that RED consistently and significantly improves reasoning over standard CoT and other decoding methods across multiple benchmarks and LVLMs. Our work offers a practical and effective approach to improve both the faithfulness and accuracy of CoT reasoning in LVLMs, paving the way for more reliable rationale-grounded multi-modal systems. Code is available at https://github.com/yshinya6/red/.

中文标题/摘要

标题：增强推理的多模态链式思考解码

大型视觉-语言模型（LVLM）通过结合预训练的视觉编码器和大型语言模型（LLM），展示了非凡的能力。类似单模态LLM，链式思考（CoT）提示已被调整用于LVLM，通过基于视觉和文本输入生成中间推理来增强多模态推理。尽管CoT被认为可以提高LVLM中的定位和准确性，但我们的实验揭示了一个关键挑战：现有的LVLM经常忽略生成的推理内容在CoT推理中的作用。为了解决这一问题，我们将多模态CoT推理重新表述为一个KL约束下的奖励最大化问题，重点关注基于推理的对数似然。作为最优解，我们提出了推理增强解码（RED），这是一种新颖的插拔式推理时解码策略。RED通过将图像条件和推理条件的下一个标记分布相乘来协调视觉和推理信息。广泛的实验表明，RED在多个基准和LVLM上一致且显著地提高了推理能力，超过了标准CoT和其他解码方法。我们的工作提供了一种实用且有效的方法，以提高LVLM中CoT推理的忠实度和准确性，为更可靠的基于推理的多模态系统铺平了道路。代码可在https://github.com/yshinya6/red/获取。

Summary / 总结

The research aims to enhance multi-modal chain-of-thought reasoning in large vision-language models (LVLMs) by addressing the issue of existing models often ignoring generated rationales. The method involves formulating multi-modal CoT reasoning as a KL-constrained reward maximization and proposing rationale-enhanced decoding (RED), a novel decoding strategy that harmonizes visual and rationale information. Experiments demonstrate that RED significantly improves reasoning accuracy over standard CoT and other decoding methods across various benchmarks and LVLMs.

本文解决了大型视觉-语言模型（LVLM）在链式思考（CoT）推理过程中经常忽略生成的推理内容的问题。提出了一种新颖的解码策略——推理增强解码（RED），该策略通过最大化推理条件下的对数似然性来协调视觉和推理信息。该方法在各种基准和LVLM上显著提高了推理准确性，超过了标准CoT和其他解码方法。

Open-World Motion Forecasting

Authors: Nicolas Schischka, Nikhil Gosala, B Ravi Kiran, Senthil Yogamani, Abhinav Valada

First: 2026-03-10T09:35:08+00:00 · Latest: 2026-03-16T09:42:11+00:00

Comments: V2: Adapt author affiliation

Abs · PDF · Code1 · Code2 · Project1

Abstract

Motion forecasting aims to predict the future trajectories of dynamic agents in the scene, enabling autonomous vehicles to effectively reason about scene evolution. Existing approaches operate under the closed-world regime and assume fixed object taxonomy as well as access to high-quality perception. Therefore, they struggle in real-world settings where perception is imperfect and object taxonomy evolves over time. In this work, we bridge this fundamental gap by introducing open-world motion forecasting, a novel setting in which new object classes are sequentially introduced over time and future object trajectories are estimated directly from camera images. We tackle this setting by proposing the first end-to-end class-incremental motion forecasting framework to mitigate catastrophic forgetting while simultaneously learning to forecast newly introduced classes. When a new class is introduced, our framework employs a pseudo-labeling strategy to first generate motion forecasting pseudo-labels for all known classes which are then processed by a vision-language model to filter inconsistent and over-confident predictions. Parallelly, our approach further mitigates catastrophic forgetting by using a novel replay sampling strategy that leverages query feature variance to sample previous sequences with informative motion patterns. Extensive evaluation on the nuScenes and Argoverse 2 datasets demonstrates that our approach successfully resists catastrophic forgetting and maintains performance on previously learned classes while improving adaptation to novel ones. Further, we demonstrate that our approach supports zero-shot transfer to real-world driving and naturally extends to end-to-end class-incremental planning, enabling continual adaptation of the full autonomous driving system. We provide the code at https://omen.cs.uni-freiburg.de.

中文标题/摘要

标题：开放世界运动预测

运动预测旨在预测场景中动态代理未来轨迹，使自动驾驶车辆能够有效推理场景演变。现有方法在封闭世界框架下运行，并假设固定的对象分类以及高质量的感知。因此，它们在感知不完美且对象分类随时间演变的真实世界环境中表现不佳。在本文中，我们通过引入开放世界运动预测这一新颖设置来弥合这一根本差距，该设置中新的对象类别会随时间顺序引入，未来对象轨迹直接从相机图像中估计。我们通过提出第一个端到端类增量运动预测框架来缓解灾难性遗忘问题，同时学习预测新引入的类别。当引入新类别时，我们的框架采用伪标签策略首先为所有已知类别生成运动预测伪标签，然后通过视觉-语言模型过滤不一致和过于自信的预测。同时，我们的方法通过利用查询特征方差来选择具有信息性运动模式的先前序列，进一步缓解灾难性遗忘。在nuScenes和Argoverse 2数据集上的广泛评估表明，我们的方法成功地抵抗了灾难性遗忘，保持了对先前学习类别的性能，同时提高了对新类别的适应性。此外，我们展示了我们的方法支持零样本向真实世界驾驶的转移，并自然扩展到端到端类增量规划，使整个自动驾驶系统能够持续适应。代码可在https://omen.cs.uni-freiburg.de/获取。

Summary / 总结

The paper addresses the challenge of motion forecasting in real-world settings where perception is imperfect and object taxonomy evolves over time. It introduces open-world motion forecasting, a novel setting where new object classes are introduced sequentially, and proposes an end-to-end class-incremental motion forecasting framework to mitigate catastrophic forgetting. The framework uses pseudo-labeling and a replay sampling strategy to improve adaptation to novel classes while maintaining performance on previously learned classes. Extensive evaluation on nuScenes and Argoverse 2 datasets shows that the approach successfully resists catastrophic forgetting and enhances adaptation to new classes, supporting zero-shot transfer to real-world driving and end-to-end class-incremental planning.

研究旨在通过引入开放世界运动预测来解决现有封闭世界方法的局限性，该方法能够处理不断变化的对象类别和不完美的感知。方法提出了一种端到端的类别增量运动预测框架，以减轻灾难性遗忘并学习预测新类别。关键发现表明，该方法成功地在保持对先前学习类别性能的同时，提高了对新类别的适应性，并支持零样本转移到真实世界的驾驶以及端到端的类别增量规划，以实现自主驾驶系统的持续适应。

Frame Sampling Strategies Matter: A Benchmark for small vision language models

Authors: Marija Brkic, Anas Filali Razzouki, Yannis Tevissen, Khalil Guetari, Mounim A. El Yacoubi

First: 2025-09-18T09:18:42+00:00 · Latest: 2026-03-16T09:39:15+00:00

Abs · PDF · Code1 · Code2

Abstract

Comparing vision language models on videos is particularly complex, as the performances is jointly determined by the model's visual representation capacity and the frame-sampling strategy used to construct the input. Current video benchmarks are suspected to suffer from substantial frame-sampling bias, as models are evaluated with different frame selection strategies. In this work, we propose the first frame-accurate benchmark of state-of-the-art small VLMs for video question-answering, evaluated under controlled frame-sampling strategies. Our results confirm the suspected bias and highlight both data-specific and task-specific behaviors of SVLMs under different frame-sampling techniques. By open-sourcing our benchmarking code, we provide the community with a reproducible and unbiased protocol for evaluating video VLMs and emphasize the need for standardized frame-sampling strategies tailored to each benchmarking dataset in future research.

中文标题/摘要

标题：帧采样策略很重要：小型视觉语言模型基准测试

在视频上比较视觉语言模型特别复杂，因为模型的表现由其视觉表示能力和用于构建输入的帧采样策略共同决定。当前的视频基准可能受到显著的帧采样偏差影响，因为模型是使用不同的帧选择策略进行评估的。在本文中，我们提出了第一个针对视频问答的最先进的小型VLM的帧准确基准测试，在受控的帧采样策略下进行评估。我们的结果证实了这种偏差的存在，并突显了在不同帧采样技术下SVLMs的数据特异性和任务特异性行为。通过开源我们的基准测试代码，我们为社区提供了一个可重复且无偏见的协议，用于评估视频VLMs，并强调未来研究中为每个基准测试数据集制定标准化帧采样策略的必要性。

Summary / 总结

The research aims to address the complexity in comparing vision language models on videos, where model performance is influenced by both visual representation capacity and frame-sampling strategy. The study introduces a frame-accurate benchmark for small vision language models, evaluating them under controlled frame-sampling strategies. The findings confirm the presence of frame-sampling bias and reveal data-specific and task-specific behaviors of these models. The benchmarking code is open-sourced to ensure reproducibility and unbiased evaluation in future research.

该研究解决了在视频上评估视觉语言模型的复杂性，模型性能受到其视觉表示能力和帧采样策略的影响。作者提出了一种新的小型视觉语言模型基准，确保了控制的帧采样策略。他们的结果显示现有基准存在偏差，并且不同帧采样技术如何影响数据特定和任务特定的行为。开源的基准代码旨在促进未来研究中的可重复和无偏评估。

Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods

Authors: Omer Ben Hayun, Roy Betser, Meir Yossef Levi, Levi Kassel, Guy Gilboa

Venue: CVPR 2026

First: 2026-03-16T09:26:56+00:00 · Latest: 2026-03-16T09:26:56+00:00

Comments: Accepted to CVPR 2026

Abs · PDF · Code1 · Code2 · Project1

Abstract

Following major advances in text and image generation, the video domain has surged, producing highly realistic and controllable sequences. Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial. Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors generalize poorly to unseen generators, a critical drawback given the rapid emergence of new models. These challenges motivate zero-shot approaches, which avoid synthetic data and instead score content against real-data statistics, enabling training-free, model-agnostic detection. We introduce \emph{STALL}, a simple, training-free, theoretically justified detector that provides likelihood-based scoring for videos, jointly modeling spatial and temporal evidence within a probabilistic framework. We evaluate STALL on two public benchmarks and introduce ComGenVid, a new benchmark with state-of-the-art generative models. STALL consistently outperforms prior image- and video-based baselines. Code and data are available at https://omerbenhayun.github.io/stall-video.

中文标题/摘要

标题：基于空间-时间似然性的无训练检测生成视频

随着文本和图像生成技术的重大进展，视频领域迅速发展，产生了高度逼真且可控的序列。与此同时，这些模型也引发了严重的信息误导问题，使得可靠检测合成视频变得越来越重要。基于图像的检测器本质上是有限的，因为它们逐帧操作并忽略时间动态，而监督的视频检测器在面对未见过的生成器时泛化能力差，这是一个关键的缺点，因为新的模型正迅速涌现。这些挑战促使了零样本方法的发展，这些方法避免使用合成数据，而是将内容与真实数据统计进行评分，从而实现无训练、模型无关的检测。我们引入了\emph{STALL}，一种简单的、无训练的、理论上合理的检测器，它为视频提供基于似然性的评分，并在概率框架内联合建模空间和时间证据。我们使用两个公开基准对STALL进行了评估，并引入了ComGenVid，这是一个包含最新生成模型的新基准。STALL在所有先前的基于图像和视频的基线中表现优异。代码和数据可在https://omerbenhayun.github.io/stall-video/获取。

Summary / 总结

The paper addresses the challenge of detecting synthetic videos by introducing STALL, a training-free detector that scores video content based on spatial and temporal likelihoods. STALL avoids the limitations of image-based detectors and supervised video detectors by operating in a model-agnostic manner. The detector consistently outperforms existing baselines on two public benchmarks and a new benchmark with state-of-the-art generative models.

论文通过引入STALL，一种无需训练的检测器，基于空间和时间似然性对视频内容进行评分，解决了合成视频检测的挑战。STALL避免了基于图像的检测器和监督式视频检测器的局限性，以模型无关的方式运作。该检测器在两个公开基准和一个包含最新生成模型的新基准上，一致地优于现有基线。

Curing Semantic Drift: A Dynamic Approach to Grounding Generation in Large Vision-Language Models

Authors: Jiahe Chen, Jiaying He, Qiyuan Chen, Qian Shao, Jiahe Ying, Hongxia Xu, Jintai Chen, Jianwei Zheng, Jian Wu

First: 2025-06-26T17:35:40+00:00 · Latest: 2026-03-16T09:20:22+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large Vision-Language Models (LVLMs) face a tug-of-war between powerful linguistic priors and visual evidence, often leading to \emph{semantic drift}: a progressive detachment from the input image that can abruptly emerge at specific decoding steps. Through a token-level diagnosis, we show that hallucination is frequently triggered not by the absence of grounded candidates, but by a failure of selection -- the model chooses a linguistically convenient yet visually unfaithful token even when better grounded alternatives exist. Motivated by this insight, we propose \textbf{D}ynamic \textbf{L}ogits \textbf{C}alibration (DLC), a training-free decoding framework that introduces a lightweight visual referee to intervene exactly when drift happens. At each step, DLC performs a dual-aspect grounding check on top-$k$ candidates: (1) it assesses the intrinsic visual relevance of a candidate token and (2) its contextual visual coherence. These signals are evaluated against an adaptive historical baseline to compute a relative visual advantage, which is then used to dynamically calibrate logits and favor grounded tokens. Extensive experiments on CHAIR, POPE, SHR, GPT-4o evaluation, and MME demonstrate that DLC consistently reduces hallucinations across multiple LVLMs while preserving response quality. Further analyses validate robustness to different vision backbones and demonstrate a favorable trade-off between output quality and computational cost as the candidate pool size varies. Code will be released on https://github.com/JiaheChen2002/DLC.

中文标题/摘要

标题：治愈语义漂移：一种在大型视觉-语言模型中接地生成的动态方法

大型视觉-语言模型（LVLMs）在强大的语言先验和视觉证据之间存在拉锯战，经常导致\emph{语义漂移}：输入图像的逐步脱离，可能在特定解码步骤中突然出现。通过在标记级别进行诊断，我们表明幻觉通常不是由于缺乏接地候选词，而是由于选择失败——模型选择了一个语言上方便但视觉上不忠实的标记，即使有更好的接地替代选项存在。受此见解的启发，我们提出了\textbf{D}ynamic \textbf{L}ogits \textbf{C}alibration（DLC），一种无需训练的解码框架，引入了一个轻量级的视觉裁判，在漂移发生时进行干预。在每一步，DLC 对 top-$k$ 候选词进行双重方面接地检查：（1）评估候选标记的内在视觉相关性，（2）评估其上下文视觉一致性。这些信号与自适应历史基线进行评估，以计算相对视觉优势，然后用于动态校准 logits 并倾向于接地标记。在 CHAIR、POPE、SHR、GPT-4o 评估和 MME 上的广泛实验表明，DLC 一致地减少了多个 LVLMs 中的幻觉，同时保持了响应质量。进一步的分析验证了其对不同视觉后端的鲁棒性，并展示了在候选池大小变化时输出质量和计算成本之间的有利权衡。代码将在 https://github.com/JiaheChen2002/DLC 上发布。

Summary / 总结

The paper addresses the issue of semantic drift in large Vision-Language Models (LVLMs), where the model's output becomes increasingly disconnected from the input image. It proposes DLC, a lightweight decoding framework that dynamically calibrates logits to favor more grounded tokens. Experiments show that DLC reduces hallucinations while maintaining response quality across different LVLMs and vision backbones, with a favorable trade-off between output quality and computational cost.

论文解决了大型视觉-语言模型（LVLM）中的语义漂移问题，即模型输出与输入图像逐渐脱节。提出了一种名为DLC的无训练解码框架，引入视觉裁判动态校准logits，偏好接地的词。实验表明，DLC在多个LVLM中减少了幻觉现象，保持了响应质量，并且对不同的视觉后端具有鲁棒性，随着候选池大小的变化，在输出质量和计算成本之间表现出有利的权衡。

Molecular Identifier Visual Prompt and Verifiable Reinforcement Learning for Chemical Reaction Diagram Parsing

Authors: Jiahe Song, Chuang Wang, Yinfan Wang, Hao Zheng, Rui Nie, Bowen Jiang, Xingjian Wei, Junyuan Gao, Yubin Wang, Bin Wang, Lijun Wu, Jiang Wu, Qian Yu, Conghui He

First: 2026-03-16T09:17:05+00:00 · Latest: 2026-03-16T09:17:05+00:00

Abs · PDF · Code1 · Code2

Abstract

Reaction diagram parsing (RxnDP) is critical for extracting chemical synthesis information from literature. Although recent Vision-Language Models (VLMs) have emerged as a promising paradigm to automate this complex visual reasoning task, their application is fundamentally bottlenecked by the inability to align visual chemical entities with pre-trained knowledge, alongside the inherent discrepancy between token-level training and reaction-level evaluation. To address these dual challenges, this work enhances VLM-based RxnDP from two complementary perspectives: prompting representation and learning paradigms. First, we propose Identifier as Visual Prompting (IdtVP), which leverages naturally occurring molecule identifiers (e.g., bold numerals like 1a) to activate the chemical knowledge acquired during VLM pre-training. IdtVP enables powerful zero-shot and out-of-distribution capabilities, outperforming existing prompting strategies. Second, to further optimize performance within fine-tuning paradigms, we introduce Re3-DAPO, a reinforcement learning algorithm that leverages verifiable rewards to directly optimize reaction-level metrics, thereby achieving consistent gains over standard supervised fine-tuning. Additionally, we release the ScannedRxn benchmark, comprising scanned historical reaction diagrams with real-world artifacts, to rigorously assess model robustness and out-of-distribution ability. Our contributions advance the accuracy and generalization of VLM-based reaction diagram parsing. We will release data, models, and code on GitHub.

中文标题/摘要

标题：分子标识符视觉提示和可验证强化学习在化学反应图解析中的应用

反应图解析（RxnDP）对于从文献中提取化学合成信息至关重要。尽管最近的视觉-语言模型（VLMs）已经展现出自动化这一复杂视觉推理任务的潜力，但其应用受到无法将视觉化学实体与预训练知识对齐的限制，以及标记级别训练与反应级别评估之间的固有差异。为了解决这些双重挑战，本研究从两个互补的角度增强了基于VLM的RxnDP：提示表示和学习范式。首先，我们提出了分子标识符作为视觉提示（IdtVP），利用自然出现的分子标识符（例如，粗体数字如1a）来激活VLM预训练期间获得的化学知识。IdtVP使零样本和分布外能力变得强大，优于现有提示策略。其次，为了进一步优化微调范式内的性能，我们引入了Re3-DAPO，这是一种利用可验证奖励直接优化反应级别指标的强化学习算法，从而在标准监督微调上实现一致的改进。此外，我们发布了ScannedRxn基准，包含带有真实世界缺陷的扫描历史反应图，以严格评估模型的鲁棒性和分布外能力。我们的贡献推进了基于VLM的反应图解析的准确性和泛化能力。我们将在GitHub上发布数据、模型和代码。

Summary / 总结

This work addresses the challenges in Reaction Diagram Parsing (RxnDP) by enhancing Vision-Language Models (VLMs) from two perspectives: prompting representation and learning paradigms. It introduces Identifier as Visual Prompting (IdtVP) to leverage molecule identifiers for zero-shot and out-of-distribution capabilities, and Re3-DAPO, a reinforcement learning algorithm that optimizes reaction-level metrics using verifiable rewards. The study also releases the ScannedRxn benchmark to evaluate model robustness. Key findings show improved performance over existing methods in both zero-shot and fine-tuning scenarios.

该研究通过增强Vision-Language Models (VLMs)从两个方面解决了反应图解析（RxnDP）的挑战：提示表示和学习范式。它引入了Identifier作为视觉提示（IdtVP），利用分子标识符实现零样本和跨分布能力，并引入了Re3-DAPO，这是一种利用可验证奖励直接优化反应级别指标的强化学习算法。研究还发布了ScannedRxn基准，以评估模型的鲁棒性和跨分布能力。关键发现表明，在零样本和微调场景中，该方法的性能优于现有方法。

MMSpec: Benchmarking Speculative Decoding for Vision-Language Models

Authors: Hui Shen, Xin Wang, Ping Zhang, Yunta Hsieh, Qi Han, Zhongwei Wan, Ziheng Zhang, Jingxuan Zhang, Jing Xiong, Ziyuan Liu, Yifan Zhang, Hangrui Cao, Chenyang Zhao, Mi Zhang

First: 2026-03-16T08:55:42+00:00 · Latest: 2026-03-16T08:55:42+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) achieve strong performance on multimodal tasks but suffer from high inference latency due to large model sizes and long multimodal contexts. Speculative decoding has recently emerged as an effective acceleration technique, yet its behavior in VLMs remains insufficiently understood. We introduce MMSpec, the first benchmark for evaluating speculative decoding in vision-language models. MMSpec contains 600 multimodal samples across six task categories and integrates ten representative speculative decoding algorithms under a unified evaluation framework. Our study reveals three key findings: (1) methods designed for text-only LLMs degrade in multimodal scenarios, (2) vision awareness becomes increasingly important at larger batch sizes, and (3) throughput speedup alone does not reliably reflect latency performance. Motivated by these findings, we propose ViSkip, a plug-and-play speculative decoding method that dynamically adapts speculation to vision tokens and achieves state-of-the-art performance.

中文标题/摘要

标题：MMSpec：视觉语言模型推测性解码基准测试

视觉语言模型（VLMs）在多模态任务中表现出色，但由于模型规模大和长的多模态上下文，推断延迟较高。推测性解码最近作为一种有效的加速技术出现，但在VLMs中的行为尚不完全理解。我们引入了MMSpec，这是第一个用于评估视觉语言模型中推测性解码的基准测试。MMSpec包含六个任务类别中的600个多模态样本，并在统一的评估框架下集成了十种代表性推测性解码算法。我们的研究揭示了三个关键发现：（1）设计用于仅文本的大规模语言模型的方法在多模态场景中表现下降，（2）在更大的批量大小下，视觉意识变得越来越重要，（3）吞吐量加速本身不能可靠地反映延迟性能。基于这些发现，我们提出了ViSkip，这是一种即插即用的推测性解码方法，能够动态适应视觉标记并实现了最先进的性能。

Summary / 总结

MMSpec is the first benchmark for evaluating speculative decoding in vision-language models, containing 600 multimodal samples across six task categories and integrating ten speculative decoding algorithms. The study reveals that methods designed for text-only LLMs perform poorly in multimodal scenarios, vision awareness is crucial at larger batch sizes, and throughput speedup does not always reflect latency performance. Based on these findings, the authors propose ViSkip, a speculative decoding method that dynamically adapts to vision tokens and achieves state-of-the-art performance.

MMSpec 是一个用于评估视觉语言模型中投机解码的基准，旨在解决 VLM 的高推理延迟问题。它包含 600 个多模态样本和十种投机解码算法。研究发现，针对文本的方法在多模态场景中表现不佳，视觉意识在大批次大小时变得越来越重要，而仅通过吞吐量加速无法可靠反映延迟性能。基于这些发现，作者提出了 ViSkip，这是一种动态适应视觉标记的投机解码方法，并取得了最佳性能。

MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation

Authors: Xun Huang, Shijia Zhao, Yunxiang Wang, Xin Lu, Wanfa Zhang, Rongsheng Qu, Weixin Li, Yunhong Wang, Chenglu Wen

Venue: CVPR 2026

First: 2025-11-13T14:51:21+00:00 · Latest: 2026-03-16T08:45:56+00:00

Comments: 18 pages, Accepted by CVPR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relational edges with dynamically assigned images. Built on M3DSG, we propose MSGNav, a zero-shot navigation system that includes a Key Subgraph Selection module for efficient reasoning, an Adaptive Vocabulary Update module for open vocabulary support, and a Closed-Loop Reasoning module for accurate exploration reasoning. Additionally, we further identify the last mile problem in zero-shot navigation determining the feasible target location with a suitable final viewpoint, and propose a Visibility-based Viewpoint Decision module to explicitly resolve it. Comprehensive experimental results demonstrate that MSGNav achieves state-of-the-art performance on the challenging GOAT-Bench and HM3D-ObjNav benchmark. The code will be publicly available at https://github.com/ylwhxht/MSGNav.

中文标题/摘要

标题：MSGNav：多模态3D场景图释放零样本室内导航能力

具身导航是机器人代理执行的基本能力。实际部署需要开放词汇的一般化和低训练开销，因此推动了零样本方法而不是任务特定的强化学习训练。然而，现有的构建显式3D场景图的零样本方法通常将丰富的视觉观察压缩为文本关系，导致高构建成本、不可逆的视觉证据损失和受限的词汇量。为了解决这些限制，我们引入了多模态3D场景图（M3DSG），通过用动态分配的图像替换文本关系来保留视觉线索。基于M3DSG，我们提出了MSGNav，这是一种零样本导航系统，包括一个关键子图选择模块以实现高效推理、一个自适应词汇更新模块以支持开放词汇以及一个闭环推理模块以实现准确的探索推理。此外，我们进一步识别了零样本导航中的最后一英里问题，即确定合适的最终视点以确定可行的目标位置，并提出了一种基于可见性的视点决策模块以明确解决该问题。全面的实验结果表明，MSGNav在具有挑战性的GOAT-Bench和HM3D-ObjNav基准测试中达到了最先进的性能。代码将在https://github.com/ylwhxht/MSGNav公开。

Summary / 总结

The research aims to develop a zero-shot embodied navigation system that can handle open vocabulary generalization with low training overhead. MSGNav uses a Multi-modal 3D Scene Graph (M3DSG) to preserve visual cues and includes modules for efficient reasoning, open vocabulary support, and accurate exploration. The system also addresses the last mile problem by proposing a Visibility-based Viewpoint Decision module. Experimental results show that MSGNav outperforms existing methods on challenging benchmarks.

研究旨在开发一种零样本的体感导航系统，能够对未见过的环境和物体进行泛化，同时减少训练开销。MSGNav 使用多模态 3D 场景图 (M3DSG) 来保留视觉线索，包括用于高效推理的关键子图选择模块、用于开放词汇支持的自适应词汇更新模块以及用于准确探索推理的闭环推理模块。此外，还提出了一种基于可见性的视点决策模块来确定可行的目标位置。实验结果表明，MSGNav 在 GOAT-Bench 和 HM3D-ObjNav 基准测试中表现优于现有方法。

BackdoorIDS: Zero-shot Backdoor Detection for Pretrained Vision Encoder

Authors: Siquan Huang, Yijiang Li, Ningzhi Gao, Xingfu Yan, Leyu Shi, Ying Gao

First: 2026-03-12T08:32:19+00:00 · Latest: 2026-03-16T08:21:21+00:00

Comments: 17 pages, 10 figures, 6 tables

Abs · PDF · Code1 · Code2

Abstract

Self-supervised and multimodal vision encoders learn strong visual representations that are widely adopted in downstream vision tasks and large vision-language models (LVLMs). However, downstream users often rely on third-party pretrained encoders with uncertain provenance, exposing them to backdoor attacks. In this work, we propose BackdoorIDS, a simple yet effective zero-shot, inference-time backdoor samples detection method for pretrained vision encoders. BackdoorIDS is motivated by two observations: Attention Hijacking and Restoration. Under progressive input masking, a backdoored image initially concentrates attention on malicious trigger features. Once the masking ratio exceeds the trigger's robustness threshold, the trigger is deactivated, and attention rapidly shifts to benign content. This transition induces a pronounced change in the image embedding, whereas embeddings of clean images evolve more smoothly across masking progress. BackdoorIDS operationalizes this signal by extracting an embedding sequence along the masking trajectory and applying density-based clustering such as DBSCAN. An input is flagged as backdoored if its embedding sequence forms more than one cluster. Extensive experiments show that BackdoorIDS consistently outperforms existing defenses across diverse attack types, datasets, and model families. Notably, it is a plug-and-play approach that requires no retraining and operates fully zero-shot at inference time, making it compatible with a wide range of encoder architectures, including CNNs, ViTs, CLIP, and LLaVA-1.5.

中文标题/摘要

标题：BackdoorIDS：预训练视觉编码器的零样本后门检测

自监督和多模态视觉编码器学习强大的视觉表示，广泛应用于下游视觉任务和大型视觉-语言模型（LVLMs）。然而，下游用户经常依赖来源不明的第三方预训练编码器，使其面临后门攻击的风险。在本工作中，我们提出了一种名为BackdoorIDS的简单而有效的零样本、推理时后门样本检测方法，用于预训练视觉编码器。BackdoorIDS受到两个观察结果的启发：注意力劫持和恢复。在渐进输入遮罩下，后门图像最初将注意力集中在恶意触发特征上。一旦遮罩比例超过触发的鲁棒性阈值，触发器被禁用，注意力迅速转向良性内容。这种转变导致图像嵌入产生显著变化，而干净图像的嵌入则在遮罩过程中更平滑地演变。BackdoorIDS通过沿遮罩轨迹提取嵌入序列并应用基于密度的聚类（如DBSCAN）来实现这一信号。如果输入的嵌入序列形成多个聚类，则将其标记为后门。大量实验表明，BackdoorIDS在各种攻击类型、数据集和模型家族中始终优于现有防御措施。值得注意的是，它是一种即插即用的方法，无需重新训练，并在推理时完全零样本运行，使其与各种编码器架构兼容，包括CNN、ViT、CLIP和LLaVA-1.5。

Summary / 总结

BackdoorIDS is a zero-shot backdoor detection method for pretrained vision encoders, motivated by the observations of Attention Hijacking and Restoration. It detects backdoor attacks by analyzing the change in image embeddings under progressive input masking. Extensive experiments demonstrate that BackdoorIDS outperforms existing defenses across various attack types, datasets, and model families, and it can be applied without retraining and operates fully at inference time.

BackdoorIDS 是一种针对预训练视觉编码器的零样本后门检测方法，基于注意力劫持和恢复的观察。它通过分析在渐进输入遮罩下的图像嵌入变化来检测后门攻击。大量实验表明，BackdoorIDS 在各种攻击类型、数据集和模型家族中均优于现有防御方法，并且无需重新训练或额外数据即可轻松应用，使其适用于不同的编码器架构。

Relevance Feedback in Text-to-Image Diffusion: A Training-Free And Model-Agnostic Interactive Framework

Authors: Wenxi Wang, Hongbin Liu, Mingqian Li, Junyan Yuan, Junqi Zhang

First: 2026-03-16T07:38:23+00:00 · Latest: 2026-03-16T07:38:23+00:00

Abs · PDF · Code1 · Code2

Abstract

Text-to-image generation using diffusion models has achieved remarkable success. However, users often possess clear visual intents but struggle to express them precisely in language, resulting in ambiguous prompts and misaligned images. Existing methods struggle to bridge this gap, typically relying on high-load textual dialogues, opaque black-box inferences, or expensive fine-tuning. They fail to simultaneously achieve low cognitive load, interpretable preference inference, and remain training-free and model-agnostic. To address this, we propose RFD, an interactive framework that adapts the relevance feedback mechanism from information retrieval to diffusion models. In RFD, users replace explicit textual dialogue with implicit, multi-select visual feedback to minimize cognitive load, easily expressing complex, multi-dimensional preferences. To translate feedback into precise generative guidance, we construct an expert-curated feature repository and introduce an information-theoretic weighted cumulative preference analysis. This white-box method calculates preferences from current-round feedback and incrementally accumulates them, avoiding the concatenation of historical interactions and preventing inference degradation caused by lengthy contexts. Furthermore, RFD employs a probabilistic sampling mechanism for prompt reconstruction to balance exploitation and exploration, preventing output homogenization. Crucially, RFD operates entirely within the external text space, making it strictly training-free and model-agnostic as a universal plug-and-play solution. Extensive experiments demonstrate that RFD effectively captures the user's true visual intent, significantly outperforming baselines in preference alignment.

中文标题/摘要

标题：文本到图像扩散中的相关反馈：一种无需训练且模型无关的交互框架

使用扩散模型进行文本到图像生成已经取得了显著的成功。然而，用户往往有明确的视觉意图，但在用语言表达时却难以精确描述，导致生成的提示语模糊不清，生成的图像与意图不一致。现有方法难以弥合这一差距，通常依赖于高负载的文字对话、不透明的黑盒推理或昂贵的微调。它们无法同时实现低认知负荷、可解释的偏好推理以及保持无需训练和模型无关。为了解决这一问题，我们提出了一种名为RFD的交互框架，将信息检索中的相关反馈机制应用于扩散模型。在RFD中，用户用隐式的多选视觉反馈代替显式的文字对话，以减少认知负荷，轻松表达复杂的多维偏好。为了将反馈转化为精确的生成指导，我们构建了一个专家编纂的特征库，并引入了基于信息论加权累积偏好分析。这是一种白盒方法，从当前轮次的反馈中计算偏好，并逐步累积，避免了历史交互的串联，防止了由于长上下文导致的推理退化。此外，RFD采用概率采样机制进行提示重构，平衡了探索和利用，防止了输出同质化。最关键的是，RFD完全在外部文本空间中运行，使其严格保持无需训练和模型无关，成为一种通用的即插即用解决方案。广泛的实验表明，RFD能够有效捕捉用户的真正视觉意图，在偏好对齐方面显著优于基线方法。

Summary / 总结

The paper addresses the challenge of users expressing clear visual intents through ambiguous textual prompts in text-to-image generation using diffusion models. It introduces RFD, a training-free and model-agnostic framework that uses implicit visual feedback to minimize cognitive load and improve preference inference. Key findings show that RFD outperforms existing methods in aligning generated images with user intent.

论文针对文本到图像生成中使用扩散模型时出现的模糊提示问题，提出了一种名为RFD的训练免费且模型无关的框架，通过使用隐式的视觉反馈来减少认知负担并使生成的图像与用户意图对齐。RFD构建了一个特征库，并采用信息论方法来分析偏好，避免了由于历史交互导致的推理退化。实验表明，RFD在偏好对齐方面优于现有方法。

Agentic Retoucher for Text-To-Image Generation

Authors: Shaocheng Shen, Jianfeng Liang, Chunlei Cai, Cong Geng, Huiyu Duan, Xiaoyun Zhang, Qiang Hu, Guangtao Zhai

First: 2026-01-05T12:06:43+00:00 · Latest: 2026-03-16T06:59:43+00:00

Comments: Accepted by CVPR2026

Abs · PDF · Code1 · Code2

Abstract

Text-to-image (T2I) diffusion models such as SDXL and FLUX have achieved impressive photorealism, yet small-scale distortions remain pervasive in limbs, face, text and so on. Existing refinement approaches either perform costly iterative re-generation or rely on vision-language models (VLMs) with weak spatial grounding, leading to semantic drift and unreliable local edits. To close this gap, we propose Agentic Retoucher, a hierarchical decision-driven framework that reformulates post-generation correction as a human-like perception-reasoning-action loop. Specifically, we design (1) a perception agent that learns contextual saliency for fine-grained distortion localization under text-image consistency cues, (2) a reasoning agent that performs human-aligned inferential diagnosis via progressive preference alignment, and (3) an action agent that adaptively plans localized inpainting guided by user preference. This design integrates perceptual evidence, linguistic reasoning, and controllable correction into a unified, self-corrective decision process. To enable fine-grained supervision and quantitative evaluation, we further construct GenBlemish-27K, a dataset of 6K T2I images with 27K annotated artifact regions across 12 categories. Extensive experiments demonstrate that Agentic Retoucher consistently outperforms state-of-the-art methods in perceptual quality, distortion localization and human preference alignment, establishing a new paradigm for self-corrective and perceptually reliable T2I generation.

中文标题/摘要

标题：文本到图像生成的代理修复器

文本到图像（T2I）扩散模型如SDXL和FLUX已经实现了令人印象深刻的写实效果，但在肢体、面部、文本等方面仍然普遍存在小规模失真。现有的精修方法要么进行昂贵的迭代重新生成，要么依赖于弱空间定位的视觉语言模型（VLMs），导致语义漂移和不可靠的局部编辑。为了解决这一问题，我们提出了一种名为代理修复器的分层决策驱动框架，将后生成修正重新构想为类似人类感知-推理-行动的循环。具体来说，我们设计了（1）一个感知代理，学习在文本-图像一致性线索下的细粒度失真定位的上下文显著性；（2）一个推理代理，通过逐步偏好对齐进行符合人类的推断诊断；（3）一个行动代理，根据用户偏好自适应地计划局部修复。该设计将感知证据、语言推理和可控修正整合到一个统一的、自我修正的决策过程中。为了实现精细监督和定量评估，我们进一步构建了包含6000张T2I图像和27000个注释缺陷区域的GenBlemish-27K数据集。广泛的实验表明，代理修复器在感知质量、失真定位和人类偏好对齐方面始终优于最先进的方法，建立了自我修正和感知可靠的T2I生成的新范式。

Summary / 总结

Agentic Retoucher is a hierarchical framework that addresses small-scale distortions in text-to-image generation by reformulating post-generation correction as a perception-reasoning-action loop. It includes a perception agent for fine-grained distortion localization, a reasoning agent for human-aligned inferential diagnosis, and an action agent for adaptive localized inpainting. The framework outperforms existing methods in perceptual quality, distortion localization, and human preference alignment, as demonstrated by extensive experiments. To support evaluation, the authors also created GenBlemish-27K, a dataset of 6K T2I images with 27K annotated artifact regions across 12 categories.

Agentic Retoucher 是一个分层框架，旨在通过解决小尺度失真来提高文本到图像生成的质量。它包括一个用于精细失真定位的感知代理、一个进行人类对齐的推理诊断代理以及一个根据用户偏好进行自适应局部修复的行动代理。该框架将感知证据、语言推理和可控修正统一到一个自我纠正的决策过程中。实验表明，Agentic Retoucher 在感知质量、失真定位和人类偏好对齐方面优于现有方法，为自我纠正和感知可靠的 T2I 生成设定了新标准。

History

20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553