arXiv 论文速递

GutenOCR: A Grounded Vision-Language Front-End for Documents

Authors: Hunter Heidenreich, Ben Elliott, Olivia Dinica, Yosheb Getachew

First: 2026-01-20T21:26:15+00:00 · Latest: 2026-01-22T18:58:24+00:00

Abstract

GutenOCR is a family of grounded OCR front-ends obtained by fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B. The resulting single-checkpoint vision-language models expose reading, detection, and grounding through a unified, prompt-based interface. Trained on business documents, scientific articles, and synthetic grounding data, the models support full-page and localized reading with line- and paragraph-level bounding boxes and conditional ``where is x?'' queries. We introduce a grounded OCR evaluation protocol and show that GutenOCR-7B more than doubles the composite grounded OCR score of its Qwen2.5-VL-7B backbone on 10.5K held-out business and scientific pages (0.40 to 0.82). On Fox and OmniDocBench v1.5, our approach substantially improves region- and line-level OCR as well as text-detection recall, but reveals trade-offs in page-level linearization, color-guided OCR, and formula-heavy layouts.

中文标题/摘要

标题：GutenOCR：文档的基于视觉语言的前端

GutenOCR 是通过微调 Qwen2.5-VL-3B 和 Qwen2.5-VL-7B 获得的一系列基于视觉语言的 OCR 前端。生成的单模型视觉语言模型通过统一的提示界面展示了阅读、检测和定位。该模型在商业文档、科学文章和合成定位数据上进行训练，支持全页和局部阅读，具有行级和段落级的边界框，并支持“x 在哪里？”的条件查询。我们引入了一种基于视觉语言的 OCR 评估协议，并展示了 GutenOCR-7B 在 10.5K 保留的商业和科学页面上将 Qwen2.5-VL-7B 主干的综合基于视觉语言的 OCR 分数提高了 1.05（从 0.40 到 0.82）。在 Fox 和 OmniDocBench v1.5 上，我们的方法显著提高了区域级和行级 OCR 以及文本检测召回率，但揭示了页面级线性化、颜色引导 OCR 和公式密集布局方面的权衡。

Summary / 总结

GutenOCR is a vision-language model fine-tuned from Qwen2.5-VL-3B and Qwen2.5-VL-7B, which provides unified reading, detection, and grounding through a prompt-based interface. Trained on various documents, GutenOCR-7B significantly improves the grounded OCR score, achieving a composite score of 0.82 compared to 0.40 for its backbone model. It enhances region- and line-level OCR and text-detection recall but shows some trade-offs in page-level linearization and formula-heavy layouts.

GutenOCR 是从 Qwen2.5-VL-3B 和 Qwen2.5-VL-7B 微调而来的视觉-语言模型，通过统一的接口提供阅读、检测和定位功能。这些模型经过商业文档和科学文章的训练，支持全页和局部阅读，并带有边界框和条件查询。评估结果显示，GutenOCR-7B 显著提高了地面OCR得分，10,500 页的综合得分为 0.82，比之前的得分 0.40 高出一倍多。在 Fox 和 OmniDocBench v1.5 上，这些模型增强了区域和行级OCR以及文本检测召回率，但在页面级线性化、颜色引导OCR和公式密集布局方面存在一些权衡。

LLM-in-Sandbox Elicits General Agentic Intelligence

Authors: Daixuan Cheng, Shaohan Huang, Yuxian Gu, Huatong Song, Guoxin Chen, Li Dong, Wayne Xin Zhao, Ji-Rong Wen, Furu Wei

First: 2026-01-22T18:57:09+00:00 · Latest: 2026-01-22T18:57:09+00:00

Comments: Project Page: https://llm-in-sandbox.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

We introduce LLM-in-Sandbox, enabling LLMs to explore within a code sandbox (i.e., a virtual computer), to elicit general intelligence in non-code domains. We first demonstrate that strong LLMs, without additional training, exhibit generalization capabilities to leverage the code sandbox for non-code tasks. For example, LLMs spontaneously access external resources to acquire new knowledge, leverage the file system to handle long contexts, and execute scripts to satisfy formatting requirements. We further show that these agentic capabilities can be enhanced through LLM-in-Sandbox Reinforcement Learning (LLM-in-Sandbox-RL), which uses only non-agentic data to train models for sandbox exploration. Experiments demonstrate that LLM-in-Sandbox, in both training-free and post-trained settings, achieves robust generalization spanning mathematics, physics, chemistry, biomedicine, long-context understanding, and instruction following. Finally, we analyze LLM-in-Sandbox's efficiency from computational and system perspectives, and open-source it as a Python package to facilitate real-world deployment.

中文标题/摘要

标题：LLM-in-Sandbox 激发通用代理智能

我们介绍了 LLM-in-Sandbox，使大语言模型能够在代码沙箱（即虚拟计算机）中探索，以激发非代码领域的通用智能。我们首先展示了强大的大语言模型在无需额外训练的情况下，能够利用代码沙箱来执行非代码任务的一般化能力。例如，大语言模型自发地访问外部资源以获取新知识，利用文件系统处理长文本，并执行脚本以满足格式要求。我们进一步表明，通过仅使用非代理数据训练用于沙箱探索的模型，LLM-in-Sandbox 强化学习（LLM-in-Sandbox-RL）可以增强这些代理能力。实验表明，无论是在无训练还是后训练设置下，LLM-in-Sandbox 都能够实现涵盖数学、物理、化学、生物医学、长文本理解以及指令遵循的稳健泛化。最后，我们从计算和系统角度分析了 LLM-in-Sandbox 的效率，并将其开源为 Python 包，以促进其实用部署。

Summary / 总结

The study introduces LLM-in-Sandbox, which allows large language models (LLMs) to explore a code sandbox to develop general intelligence in non-code domains. The research demonstrates that strong LLMs can generalize and use the sandbox for non-code tasks, such as accessing external resources and handling long contexts. Further, LLM-in-Sandbox-RL enhances these capabilities through reinforcement learning. Experiments show robust generalization across various domains including mathematics, physics, chemistry, biomedicine, and long-context understanding. The study also analyzes the efficiency of LLM-in-Sandbox from computational and system perspectives and opens it as a Python package for real-world deployment.

研究引入了LLM-in-Sandbox，使大型语言模型（LLMs）能够在代码沙箱中探索，以在非代码领域发展一般智能。研究显示，强大的LLMs可以在无需额外训练的情况下，利用沙箱进行各种任务，例如访问外部资源和执行脚本。进一步地，通过使用非代理数据进行强化学习，LLM-in-Sandbox-RL增强了这些能力。实验表明，LLM-in-Sandbox在数学、物理和生物医学等多个领域实现了稳健的泛化。研究还从计算和系统效率的角度分析了LLM-in-Sandbox，并将其作为Python包开源，以促进实际部署。

Training-Free Geospatial Place Representation Learning from Large-Scale Point-of-Interest Graph Data

Authors: Mohammad Hashemi, Hossein Amiri, Andreas Zufle

First: 2025-06-25T15:10:31+00:00 · Latest: 2026-01-22T18:46:50+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Learning effective representations of urban environments requires capturing spatial structure beyond fixed administrative boundaries. Existing geospatial representation learning approaches typically aggregate Points of Interest(POI) into pre-defined administrative regions such as census units or ZIP code areas, assigning a single embedding to each region. However, POIs often form semantically meaningful groups that extend across, within, or beyond these boundaries, defining places that better reflect human activity and urban function. To address this limitation, we propose PlaceRep, a training-free geospatial representation learning method that constructs place-level representations by clustering spatially and semantically related POIs. PlaceRep summarizes large-scale POI graphs from U.S. Foursquare data to produce general-purpose urban region embeddings while automatically identifying places across multiple spatial scales. By eliminating model pre-training, PlaceRep provides a scalable and efficient solution for multi-granular geospatial analysis. Experiments using the tasks of population density estimation and housing price prediction as downstream tasks show that PlaceRep outperforms most state-of-the-art graph-based geospatial representation learning methods and achieves up to a 100x speedup in generating region-level representations on large-scale POI graphs. The implementation of PlaceRep is available at https://github.com/mohammadhashemii/PlaceRep.

中文标题/摘要

标题：无需训练的地理空间地点表示学习从大规模兴趣点图数据

学习有效的城市环境表示需要捕捉超越固定行政边界的空间结构。现有的地理空间表示学习方法通常将兴趣点(POI)聚合到预先定义的行政区域中，如普查单位或邮政编码区域，并为每个区域分配一个单一的嵌入。然而，POI往往形成具有语义意义的组，跨越、位于或超出这些边界，定义了更好地反映人类活动和城市功能的地点。为了解决这一局限性，我们提出了一种无需训练的地理空间表示学习方法PlaceRep，该方法通过聚类空间上和语义上相关的POI来构建地点级表示。PlaceRep从美国Foursquare数据中的大规模POI图中总结出通用的城市区域嵌入，并自动识别跨多个空间尺度的地点。通过消除模型预训练，PlaceRep提供了一种可扩展且高效的多粒度地理空间分析解决方案。使用人口密度估计和房价预测等下游任务的实验表明，PlaceRep在生成大规模POI图的区域级表示方面优于大多数最先进的基于图的地理空间表示学习方法，并且在生成区域级表示方面可快至100倍。PlaceRep的实现可在https://github.com/mohammadhashemii/PlaceRep获得。

Summary / 总结

The research aims to develop a training-free method for learning effective geospatial place representations from large-scale POI graphs, addressing the limitation of existing approaches that aggregate POIs into fixed administrative regions. PlaceRep clusters spatially and semantically related POIs to generate place-level embeddings, automatically identifying places across multiple scales. Experiments show that PlaceRep outperforms state-of-the-art methods in tasks such as population density estimation and housing price prediction, with up to a 100x speedup in generating region-level representations on large-scale POI graphs.

研究旨在开发一种方法，以捕捉超越固定行政边界的空间结构。PlaceRep 是一种无需训练的方法，通过聚类空间和语义相关的兴趣点来创建地方级表示。实验表明，PlaceRep 在人口密度估计和房价预测等任务上优于现有方法，并且在生成大规模兴趣点图的区域级表示时可快至 100 倍。

Multimodal Climate Disinformation Detection: Integrating Vision-Language Models with External Knowledge Sources

Authors: Marzieh Adeli Shamsabad, Hamed Ghodrati

First: 2026-01-22T16:55:48+00:00 · Latest: 2026-01-22T16:55:48+00:00

Abs · PDF · Code1 · Code2

Abstract

Climate disinformation has become a major challenge in today digital world, especially with the rise of misleading images and videos shared widely on social media. These false claims are often convincing and difficult to detect, which can delay actions on climate change. While vision-language models (VLMs) have been used to identify visual disinformation, they rely only on the knowledge available at the time of training. This limits their ability to reason about recent events or updates. The main goal of this paper is to overcome that limitation by combining VLMs with external knowledge. By retrieving up-to-date information such as reverse image results, online fact-checks, and trusted expert content, the system can better assess whether an image and its claim are accurate, misleading, false, or unverifiable. This approach improves the model ability to handle real-world climate disinformation and supports efforts to protect public understanding of science in a rapidly changing information landscape.

中文标题/摘要

标题：多模态气候 misinformation 检测：结合视觉-语言模型与外部知识源

气候 misinformation 已成为当今数字世界的主要挑战，尤其是在社交媒体上广泛传播误导性图片和视频的情况下。这些虚假声明往往令人信服且难以识别，这可能会延迟应对气候变化的行动。虽然视觉-语言模型（VLMs）已被用于识别视觉 misinformation，但它们仅依赖于训练时可用的知识。这限制了它们对近期事件或更新进行推理的能力。本文的主要目标是通过结合 VLMs 与外部知识来克服这一限制。通过检索最新的信息，如逆向图像搜索结果、在线事实核查和可信专家内容，该系统可以更好地评估图片及其声明是否准确、误导、虚假或无法验证。这种方法提高了模型处理现实世界气候 misinformation 的能力，并支持在快速变化的信息环境中保护公众对科学的理解的努力。

Summary / 总结

The research aims to address the challenge of climate disinformation by integrating vision-language models with external knowledge sources. The method involves using VLMs to analyze visual content and combining this with up-to-date information such as reverse image results, online fact-checks, and expert content. The key finding is that this approach enhances the model's ability to accurately assess the veracity of climate-related claims and images, thereby improving the detection of disinformation in real-world scenarios.

研究旨在应对气候 misinformation，特别是社交媒体上误导性的图像和视频的挑战。为了克服视觉语言模型（VLMs）在处理近期事件方面的局限性，该研究将VLMs与反向图像搜索结果、在线事实核查和专家内容等外部知识源相结合。这种方法增强了模型评估气候相关声明准确性的能力，并支持保护公众对气候科学的理解的努力。

DTP: A Simple yet Effective Distracting Token Pruning Framework for Vision-Language Action Models

Authors: Chenyang Li, Jieyuan Liu, Bin Li, Bo Gao, Yilin Yuan, Yangfan He, Yuchen Li, Jingqun Tang

First: 2026-01-22T16:02:56+00:00 · Latest: 2026-01-22T16:02:56+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Action (VLA) models have shown remarkable progress in robotic manipulation by leveraging the powerful perception abilities of Vision-Language Models (VLMs) to understand environments and directly output actions. However, by default, VLA models may overly attend to image tokens in the task-irrelevant region, which we describe as 'distracting tokens'. This behavior can disturb the model from the generation of the desired action tokens in each step, affecting the success rate of tasks. In this paper, we introduce a simple yet effective plug-and-play Distracting Token Pruning (DTP) framework, which dynamically detects and prunes these distracting image tokens. By correcting the model's visual attention patterns, we aim to improve the task success rate, as well as exploring the performance upper boundaries of the model without altering its original architecture or adding additional inputs. Experiments on the SIMPLER Benchmark (Li et al., 2024) show that our method consistently achieving relative improvements in task success rates across different types of novel VLA models, demonstrating generalizability to transformer-based VLAs. Further analysis reveals a negative correlation between the task success rate and the amount of attentions in the task-irrelevant region for all models tested, highlighting a common phenomenon of VLA models that could guide future research. We also publish our code at: https://anonymous.4open.science/r/CBD3.

中文标题/摘要

标题：DTP：一种简单有效的视觉-语言-动作模型分散令牌剪枝框架

视觉-语言-动作（VLA）模型通过利用视觉-语言模型（VLM）的强大感知能力来理解环境并直接输出动作，已经在机器人操作方面取得了显著进展。然而，默认情况下，VLA模型可能会过度关注任务无关区域的图像令牌，我们将其称为“分散令牌”。这种行为会干扰模型在每一步生成所需动作令牌的能力，影响任务的成功率。在本文中，我们介绍了一种简单有效的即插即用分散令牌剪枝（DTP）框架，该框架能够动态检测并剪枝这些分散的图像令牌。通过纠正模型的视觉注意力模式，我们旨在提高任务成功率，并探索模型的性能上限，而不改变其原始架构或添加额外输入。在SIMPLER基准（Li等，2024）上的实验表明，我们的方法在不同类型的新型VLA模型中始终能够实现任务成功率的相对提高，展示了其对基于变换器的VLA模型的通用性。进一步的分析揭示了所有测试模型的任务成功率与其任务无关区域注意力量之间存在负相关关系，突显了VLA模型中的一种常见现象，这可以指导未来的研究。我们还发布了我们的代码：https://anonymous.4open.science/r/CBD3.

Summary / 总结

This paper addresses the issue of distracting tokens in Vision-Language Action (VLA) models, which can negatively impact the success rate of robotic manipulation tasks. The authors propose a simple Distracting Token Pruning (DTP) framework that dynamically detects and prunes these distracting tokens, thereby improving the model's visual attention and task success rate. Experiments on the SIMPLER Benchmark show consistent improvements across various VLA models, indicating the framework's generalizability and effectiveness without altering the model's architecture or adding new inputs.

本文解决了Vision-Language Action (VLA)模型中分散注意力的token问题，这会负面影响机器人操作任务的成功率。作者提出了一种简单的Distracting Token Pruning (DTP)框架，能够动态检测并去除这些分散注意力的token，从而改善模型的视觉注意力和任务成功率。在SIMPLER基准上的实验显示，该方法在各种VLA模型上都表现出一致的改进，表明其通用性和有效性，且无需改变模型架构或添加新输入。

DextER: Language-driven Dexterous Grasp Generation with Embodied Reasoning

Authors: Junha Lee, Eunha Park, Minsu Cho

First: 2026-01-22T15:23:35+00:00 · Latest: 2026-01-22T15:23:35+00:00

Abs · PDF · Code1 · Code2

Abstract

Language-driven dexterous grasp generation requires the models to understand task semantics, 3D geometry, and complex hand-object interactions. While vision-language models have been applied to this problem, existing approaches directly map observations to grasp parameters without intermediate reasoning about physical interactions. We present DextER, Dexterous Grasp Generation with Embodied Reasoning, which introduces contact-based embodied reasoning for multi-finger manipulation. Our key insight is that predicting which hand links contact where on the object surface provides an embodiment-aware intermediate representation bridging task semantics with physical constraints. DextER autoregressively generates embodied contact tokens specifying which finger links contact where on the object surface, followed by grasp tokens encoding the hand configuration. On DexGYS, DextER achieves 67.14% success rate, outperforming state-of-the-art by 3.83%p with 96.4% improvement in intention alignment. We also demonstrate steerable generation through partial contact specification, providing fine-grained control over grasp synthesis.

中文标题/摘要

标题：DextER：基于语言的灵巧抓取生成与体化推理

基于语言的灵巧抓取生成要求模型理解任务语义、3D几何和复杂的手物交互。尽管视觉语言模型已被应用于此问题，现有方法直接将观察结果映射为抓取参数，而没有关于物理交互的中间推理。我们提出了DextER，灵巧抓取生成与体化推理，引入了基于接触的体化推理进行多指操作。我们的关键见解是，预测哪只手在物体表面接触哪里提供了一种体化意识的中间表示，将任务语义与物理约束联系起来。DextER 自回归生成体化接触标记，指定哪只手指在物体表面接触哪里，随后生成抓取标记编码手的配置。在DexGYS上，DextER 达到了67.14%的成功率，比最先进的方法高出3.83%，意图对齐改进了96.4%。我们还展示了通过部分接触指定实现可引导的生成，提供了对抓取合成的精细控制。

SimWorld-Robotics: Synthesizing Photorealistic and Dynamic Urban Environments for Multimodal Robot Navigation and Collaboration

Authors: Yan Zhuang, Jiawei Ren, Xiaokang Ye, Jianzhi Shen, Ruixuan Zhang, Tianai Yue, Muhammad Faayez, Xuhong He, Ziqiao Ma, Lianhui Qin, Zhiting Hu, Tianmin Shu

Venue: NeurIPS 2025

First: 2025-12-10T20:04:08+00:00 · Latest: 2026-01-22T14:26:01+00:00

Comments: Conference: NeurIPS 2025 (main)

Abs · PDF · Code1 · Code2

Abstract

Recent advances in foundation models have shown promising results in developing generalist robotics that can perform diverse tasks in open-ended scenarios given multimodal inputs. However, current work has been mainly focused on indoor, household scenarios. In this work, we present SimWorld-Robotics~(SWR), a simulation platform for embodied AI in large-scale, photorealistic urban environments. Built on Unreal Engine 5, SWR procedurally generates unlimited photorealistic urban scenes populated with dynamic elements such as pedestrians and traffic systems, surpassing prior urban simulations in realism, complexity, and scalability. It also supports multi-robot control and communication. With these key features, we build two challenging robot benchmarks: (1) a multimodal instruction-following task, where a robot must follow vision-language navigation instructions to reach a destination in the presence of pedestrians and traffic; and (2) a multi-agent search task, where two robots must communicate to cooperatively locate and meet each other. Unlike existing benchmarks, these two new benchmarks comprehensively evaluate a wide range of critical robot capacities in realistic scenarios, including (1) multimodal instructions grounding, (2) 3D spatial reasoning in large environments, (3) safe, long-range navigation with people and traffic, (4) multi-robot collaboration, and (5) grounded communication. Our experimental results demonstrate that state-of-the-art models, including vision-language models (VLMs), struggle with our tasks, lacking robust perception, reasoning, and planning abilities necessary for urban environments.

中文标题/摘要

标题：SimWorld-Robotics：合成逼真且动态的城市环境以实现多模态机器人导航与协作

基础模型的最新进展表明，在给定多模态输入的情况下，通用机器人可以在开放场景中执行多种任务，显示出有希望的结果。然而，当前的工作主要集中在室内家庭场景。在本工作中，我们提出了SimWorld-Robotics (SWR)，一个用于大规模、逼真城市环境的模拟平台。基于Unreal Engine 5，SWR 通过生成无限的逼真城市场景，其中包含动态元素如行人和交通系统，超越了先前的城市模拟在逼真度、复杂性和可扩展性方面的表现。它还支持多机器人控制和通信。凭借这些关键功能，我们构建了两个具有挑战性的机器人基准测试：（1）多模态指令跟随任务，其中机器人必须在行人和交通的环境中，根据视觉语言导航指令到达目的地；（2）多智能体搜索任务，其中两个机器人必须通过通信合作找到并会合。与现有基准不同，这两个新基准全面评估了机器人在现实场景中的广泛关键能力，包括（1）多模态指令语义理解，（2）大型环境中的三维空间推理，（3）与行人和交通的安全、长距离导航，（4）多机器人协作，以及（5）基于语义的通信。我们的实验结果表明，最先进的模型，包括视觉语言模型（VLMs），在我们的任务中表现不佳，缺乏在城市环境中所需的稳健感知、推理和规划能力。

Summary / 总结

The research aims to develop a simulation platform for embodied AI in photorealistic urban environments, addressing the limitations of current indoor-focused robotics. SimWorld-Robotics (SWR) uses Unreal Engine 5 to generate dynamic urban scenes with pedestrians and traffic, supporting multi-robot control and communication. Two benchmarks are introduced: a multimodal instruction-following task and a multi-agent search task. The experiments show that state-of-the-art models, such as vision-language models, face challenges in perception, reasoning, and planning in realistic urban scenarios.

该研究介绍了SimWorld-Robotics (SWR) 平台，基于Unreal Engine 5构建，用于在逼真的城市环境中实现机器人的感知能力。SWR可以生成动态的城市场景，包含行人和交通系统，并支持多机器人控制和通信。开发了两个基准测试：多模态指令跟随任务和多机器人搜索任务。这些基准测试全面评估了机器人在现实场景中的关键能力，包括多模态指令解析、三维空间推理、安全导航、多机器人协作和基于通信的能力。最先进的模型，包括视觉语言模型，在这些任务中表现出局限性，缺乏在城市环境中所需的感知、推理和规划能力。

A Multi-View Pipeline and Benchmark Dataset for 3D Hand Pose Estimation in Surgery

Authors: Valery Fischer, Alan Magdaleno, Anna-Katharina Calek, Nicola Cavalcanti, Nathan Hoffman, Christoph Germann, Joschua Wüthrich, Max Krähenmann, Mazda Farshad, Philipp Fürnstahl, Lilian Calvet

First: 2026-01-22T12:48:24+00:00 · Latest: 2026-01-22T12:48:24+00:00

Abs · PDF · Code1 · Code2

Abstract

Purpose: Accurate 3D hand pose estimation supports surgical applications such as skill assessment, robot-assisted interventions, and geometry-aware workflow analysis. However, surgical environments pose severe challenges, including intense and localized lighting, frequent occlusions by instruments or staff, and uniform hand appearance due to gloves, combined with a scarcity of annotated datasets for reliable model training. Method: We propose a robust multi-view pipeline for 3D hand pose estimation in surgical contexts that requires no domain-specific fine-tuning and relies solely on off-the-shelf pretrained models. The pipeline integrates reliable person detection, whole-body pose estimation, and state-of-the-art 2D hand keypoint prediction on tracked hand crops, followed by a constrained 3D optimization. In addition, we introduce a novel surgical benchmark dataset comprising over 68,000 frames and 3,000 manually annotated 2D hand poses with triangulated 3D ground truth, recorded in a replica operating room under varying levels of scene complexity. Results: Quantitative experiments demonstrate that our method consistently outperforms baselines, achieving a 31% reduction in 2D mean joint error and a 76% reduction in 3D mean per-joint position error. Conclusion: Our work establishes a strong baseline for 3D hand pose estimation in surgery, providing both a training-free pipeline and a comprehensive annotated dataset to facilitate future research in surgical computer vision.

中文标题/摘要

标题：手术中3D手部姿态估计的多视图管道和基准数据集

目的：准确的3D手部姿态估计支持手术应用，如技能评估、机器人辅助干预和几何感知工作流程分析。然而，手术环境带来了严重挑战，包括强烈的局部照明、频繁的器械或人员遮挡、手套导致的手部均匀外观，以及可靠的模型训练所需的标注数据稀缺性。方法：我们提出了一种鲁棒的多视图管道，用于手术环境下的3D手部姿态估计，该管道无需特定领域的微调，仅依赖于现成的预训练模型。该管道结合了可靠的人体检测、全身姿态估计以及在跟踪的手部裁剪上使用最先进的2D手部关键点预测，随后进行约束3D优化。此外，我们还引入了一个新的手术基准数据集，包含超过68,000帧和3,000个手动标注的2D手部姿态，具有三角化3D地面真值，数据集在不同场景复杂度下记录于一个复现的手术室中。结果：定量实验表明，我们的方法在2D平均关节误差上降低了31%，在3D平均每个关节位置误差上降低了76%，始终优于基线方法。结论：我们的工作为手术中的3D手部姿态估计设定了一个强大的基准，提供了无需训练的管道和全面标注的数据集，以促进未来手术计算机视觉的研究。

Summary / 总结

The study aims to improve 3D hand pose estimation in surgical environments, which is crucial for surgical applications. It proposes a multi-view pipeline using off-the-shelf models for person detection, whole-body pose estimation, and 2D hand keypoint prediction, followed by 3D optimization. The pipeline is validated using a new benchmark dataset with over 68,000 frames and 3,000 annotated hand poses. The results show a significant improvement in accuracy, with a 31% reduction in 2D mean joint error and a 76% reduction in 3D mean per-joint position error compared to baselines.

研究旨在改善手术环境中的3D手部姿态估计，这对于手术应用至关重要。提出的多视图管道使用现成的预训练模型进行人体检测、全身姿态估计和手部关键点预测，然后进行3D优化。该管道在包含超过68,000帧和3,000个标注手部姿态的新手术基准数据集上进行了验证，与基线相比，2D平均关节误差减少了31%，3D平均每个关节位置误差减少了76%。

RadJEPA: Radiology Encoder for Chest X-Rays via Joint Embedding Predictive Architecture

Authors: Anas Anwarul Haq Khan, Mariam Husain, Kshitij Jadhav

First: 2026-01-22T12:11:53+00:00 · Latest: 2026-01-22T12:11:53+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advances in medical vision language models guide the learning of visual representations; however, this form of supervision is constrained by the availability of paired image text data, raising the question of whether robust radiology encoders can be learned without relying on language supervision. In this work, we introduce RadJEPA, a self-supervised framework built on a Joint Embedding Predictive Architecture that learns without language supervision. Pre-trained solely on unlabeled chest X-ray images, the model learns to predict latent representations of masked image regions. This predictive objective differs fundamentally from both image text pre-training and DINO-style self-distillation: rather than aligning global representations across views or modalities, RadJEPA explicitly models latent-space prediction. We evaluate the learned encoder on disease classification, semantic segmentation, and report generation tasks. Across benchmarks, RadJEPA achieves performance exceeding state-of-the-art approaches, including Rad-DINO.

中文标题/摘要

标题：RadJEPA：通过联合嵌入预测架构的胸部X光影像编码器

近期医学视觉语言模型的进步指导了视觉表示的学习；然而，这种监督形式受限于成对的图像文本数据的可用性，引发了是否可以在不依赖语言监督的情况下学习稳健的放射学编码器的问题。在本文中，我们引入了RadJEPA，这是一种基于联合嵌入预测架构的自监督框架，该框架在不依赖语言监督的情况下学习。该模型仅在未标记的胸部X光影像上进行预训练，学习预测遮罩图像区域的潜在表示。这种预测目标与图像文本预训练和DINO风格的自我蒸馏完全不同：RadJEPA不是在视图或模态之间对齐全局表示，而是明确建模潜在空间预测。我们在疾病分类、语义分割和报告生成任务上评估了所学习的编码器。在各个基准测试中，RadJEPA的性能超过了最先进的方法，包括Rad-DINO。

Summary / 总结

The research aims to develop a robust radiology encoder for chest X-rays without relying on paired image-text data. RadJEPA, a self-supervised framework, is introduced, which pre-trains on unlabeled chest X-ray images to predict masked image regions. The model outperforms existing methods like Rad-DINO in disease classification, semantic segmentation, and report generation tasks across various benchmarks.

研究旨在开发一种不依赖于配对图像-文本数据的胸部X光放射学编码器，因为这类数据通常有限。引入了RadJEPA，这是一种自监督框架，通过预测遮蔽的图像区域从未标记的胸部X光图像中学习。该方法在疾病分类、语义分割和报告生成等任务的多个基准测试中超过了现有方法如Rad-DINO的表现。

Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning

Authors: Yifan Wang, Shiyu Li, Peiming Li, Xiaochen Yang, Yang Tang, Zheng Wei

First: 2026-01-21T08:09:25+00:00 · Latest: 2026-01-22T12:09:02+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. Recent works often focus exclusively on outcome alignment and lack supervision on the intermediate reasoning process. These deficiencies obscure the analyzability of the latent reasoning chain. To address these challenges, we introduce Render-of-Thought (RoT), the first framework to reify the reasoning chain by rendering textual steps into images, making the latent rationale explicit and traceable. Specifically, we leverage the vision encoders of existing Vision Language Models (VLMs) as semantic anchors to align the vision embeddings with the textual space. This design ensures plug-and-play implementation without incurring additional pre-training overhead. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that our method achieves 3-4x token compression and substantial inference acceleration compared to explicit CoT. Furthermore, it maintains competitive performance against other methods, validating the feasibility of this paradigm. Our code is available at https://github.com/TencentBAC/RoT

中文标题/摘要

标题：Render-of-Thought: 将文本推理链渲染为图像以进行视觉潜在推理

文本推理链（CoT）提示在解锁大型语言模型（LLMs）的推理能力方面取得了显著成功。尽管CoT提示增强了推理能力，但其冗长性带来了巨大的计算开销。近期工作往往仅关注结果对齐，而缺乏对中间推理过程的监督。这些不足之处模糊了潜在推理链的可分析性。为解决这些挑战，我们引入了Render-of-Thought（RoT），这是第一个通过将文本步骤渲染为图像来实现推理链实体化的框架，使潜在的推理理由变得明确且可追踪。具体而言，我们利用现有视觉语言模型（VLMs）的视觉编码器作为语义锚点，将视觉嵌入与文本空间对齐。此设计确保了即插即用的实现方式，无需额外的预训练开销。在数学和逻辑推理基准测试上的广泛实验表明，与显式CoT相比，我们的方法实现了3-4倍的令牌压缩和显著的推理加速。此外，与其他方法相比，它保持了竞争力，验证了此范式的可行性。我们的代码可在https://github.com/TencentBAC/RoT 获取

Summary / 总结

The paper introduces Render-of-Thought (RoT), a framework that converts textual reasoning steps into images to make latent reasoning explicit and traceable. It leverages vision encoders from existing Vision Language Models to align visual embeddings with textual space, enabling plug-and-play implementation. Experiments show RoT achieves 3-4x token compression and significant inference speedup compared to explicit CoT, while maintaining competitive performance on mathematical and logical reasoning benchmarks.

动机是解决大型语言模型（LLMs）中链式思考（CoT）提示的计算开销和缺乏监督问题。引入Render-of-Thought（RoT）框架，将文本推理步骤转换为图像，使潜在的推理过程变得明确和可追踪。RoT 利用现有视觉语言模型（VLMs）的视觉编码器来对齐视觉和文本空间，实现了3-4倍的令牌压缩和显著的推理加速，同时在推理基准测试中保持了竞争力。

VLM-CAD: VLM-Optimized Collaborative Agent Design Workflow for Analog Circuit Sizing

Authors: Guanyuan Pan, Shuai Wang, Yugui Lin, Tiansheng Zhou, Pietro Liò, Yaqi Wang, Zhenxin Zhao

First: 2026-01-12T08:37:32+00:00 · Latest: 2026-01-22T11:46:08+00:00

Comments: 9 pages, 4 figures, submitted to the 10th International Conference on Control, Automation and Diagnosis (ICCAD'26)

Abs · PDF · Code1 · Code2

Abstract

Analog mixed-signal circuit sizing involves complex trade-offs within high-dimensional design spaces. Existing automatic analog circuit sizing approaches rely solely on netlists, ignoring the circuit schematic, which hinders the cognitive link between the schematic and its performance. Furthermore, the black-box nature of machine learning methods and hallucination risks in large language models fail to provide the necessary ground-truth explainability required for industrial sign-off. To address these challenges, we propose a Vision Language Model-optimized collaborative agent design workflow (VLM-CAD), which analyzes circuits, optimizes DC operating points, performs inference-based sizing, and executes external sizing optimization. We integrate Image2Net to annotate circuit schematics and generate a structured JSON description for precise interpretation by Vision Language Models. Furthermore, we propose an Explainable Trust Region Bayesian Optimization method (ExTuRBO) that employs collaborative warm-start from agent-generated seeds and offers dual-granularity sensitivity analysis for external sizing optimization, supporting a comprehensive final design report. Experiment results on amplifier sizing tasks using 180nm, 90nm, and 45nm Predictive Technology Models demonstrate that VLM-CAD effectively balances power and performance while maintaining physics-based explainability. VLM-CAD meets all specification requirements while maintaining low power consumption in optimizing an amplifier with a complementary input and a class-AB output stage, with a total runtime under 66 minutes across all experiments on two amplifiers.

中文标题/摘要

标题：VLM-CAD：优化视觉语言模型协作代理设计工作流以实现模拟电路尺寸优化

模拟混合信号电路尺寸优化涉及高维设计空间中的复杂权衡。现有的自动模拟电路尺寸优化方法仅依赖于网表，忽略了电路原理图，阻碍了原理图与其性能之间的认知联系。此外，机器学习方法的黑箱性质和大型语言模型中的幻觉风险无法提供工业签收所需的必要的地面真相可解释性。为了解决这些挑战，我们提出了一种视觉语言模型优化的协作代理设计工作流（VLM-CAD），该工作流分析电路、优化直流工作点、进行基于推理的尺寸优化并执行外部尺寸优化。我们整合了Image2Net来标注电路原理图并生成结构化的JSON描述，以便视觉语言模型精确解释。此外，我们提出了一种可解释的信任区域贝叶斯优化方法（ExTuRBO），该方法采用代理生成的种子进行协作预热，并提供外部尺寸优化的双粒度灵敏度分析，支持全面的最终设计报告。使用180nm、90nm和45nm预测技术模型进行放大器尺寸优化任务的实验结果表明，VLM-CAD在保持物理基础可解释性的同时有效平衡了功率和性能。VLM-CAD在优化具有互补输入和类AB输出级的放大器时满足所有规范要求，同时保持低功耗，在两次放大器实验中总运行时间低于66分钟。

Summary / 总结

VLM-CAD is a workflow that optimizes analog circuit design by integrating Vision Language Models and collaborative agents. It analyzes circuits, optimizes DC operating points, and performs inference-based sizing, while using Explainable Trust Region Bayesian Optimization (ExTuRBO) for external sizing optimization. Experiments on amplifier sizing tasks with different technology nodes show that VLM-CAD effectively balances power and performance while maintaining physics-based explainability and low power consumption.

VLM-CAD 是一种基于 Vision Language Model 的协作代理设计工作流，用于模拟电路尺寸优化。它结合了 Image2Net 对电路原理图进行标注，并使用可解释的信任区域贝叶斯优化 (ExTuRBO) 进行尺寸优化。实验结果表明，VLM-CAD 在 180nm、90nm 和 45nm 预测技术模型的放大器尺寸任务中，能够有效平衡功率和性能，同时保持基于物理的可解释性和低功耗。

MMP-A*: Multimodal Perception Enhanced Incremental Heuristic Search on Path Planning

Authors: Minh Hieu Ha, Khanh Ly Ta, Hung Phan, Tung Doan, Tung Dao, Dao Tran, Huynh Thi Thanh Binh

First: 2026-01-05T08:55:27+00:00 · Latest: 2026-01-22T10:24:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Autonomous path planning requires a synergy between global reasoning and geometric precision, especially in complex or cluttered environments. While classical A* is valued for its optimality, it incurs prohibitive computational and memory costs in large-scale scenarios. Recent attempts to mitigate these limitations by using Large Language Models for waypoint guidance remain insufficient, as they rely only on text-based reasoning without spatial grounding. As a result, such models often produce incorrect waypoints in topologically complex environments with dead ends, and lack the perceptual capacity to interpret ambiguous physical boundaries. These inconsistencies lead to costly corrective expansions and undermine the intended computational efficiency. We introduce MMP-A*, a multimodal framework that integrates the spatial grounding capabilities of vision-language models with a novel adaptive decay mechanism. By anchoring high-level reasoning in physical geometry, the framework produces coherent waypoint guidance that addresses the limitations of text-only planners. The adaptive decay mechanism dynamically regulates the influence of uncertain waypoints within the heuristic, ensuring geometric validity while substantially reducing memory overhead. To evaluate robustness, we test the framework in challenging environments characterized by severe clutter and topological complexity. Experimental results show that MMP-A* achieves near-optimal trajectories with significantly reduced operational costs, demonstrating its potential as a perception-grounded and computationally efficient paradigm for autonomous navigation.

中文标题/摘要

标题：MMP-A*: 多模态感知增强的增量启发式搜索路径规划

自主路径规划需要全局推理与几何精度之间的协同作用，尤其是在复杂或拥挤的环境中。虽然经典的A*因其最优性而受到重视，但在大规模场景中会带来巨大的计算和内存成本。最近通过使用大型语言模型进行航点指导来缓解这些限制的努力仍然不足，因为它们仅依赖于基于文本的推理而缺乏空间定位能力。因此，这些模型在拓扑复杂且有死胡同的环境中经常生成错误的航点，并缺乏感知能力来解释模糊的物理边界。这些不一致导致昂贵的修正扩展，并削弱了预期的计算效率。我们引入了MMP-A*，这是一种结合了视觉语言模型的空间定位能力和新颖的自适应衰减机制的多模态框架。通过将高层次推理锚定在物理几何上，该框架生成连贯的航点指导，解决了纯文本规划器的局限性。自适应衰减机制动态调节启发式中不确定航点的影响，确保几何有效性同时大幅减少内存开销。为了评估鲁棒性，我们在严重拥挤和拓扑复杂性的环境中测试了该框架。实验结果表明，MMP-A*在显著降低操作成本的同时实现了接近最优的轨迹，证明了其作为感知导向和计算高效的自主导航范式的潜力。

Summary / 总结

MMP-A* is a multimodal framework that combines the spatial grounding of vision-language models with an adaptive decay mechanism to enhance path planning in complex environments. It addresses the limitations of classical A* and text-only planners by producing coherent waypoints and ensuring geometric validity. Experimental results show that MMP-A* achieves near-optimal trajectories with reduced operational costs, making it a promising approach for autonomous navigation.

MMP-A* 是一种结合了视觉语言模型的空间定位能力和自适应衰减机制的多模态框架，以增强 A* 路径规划。该方法通过整合几何推理解决了纯文本规划器的局限性，提高了复杂环境中路径点的准确性。实验结果表明，MMP-A* 能够以显著降低的计算和内存成本实现接近最优的轨迹，使其成为自主导航中一种感知导向且计算高效的范式。

Assessing Situational and Spatial Awareness of VLMs with Synthetically Generated Video

Authors: Pascal Benschop, Justin Dauwels, Jan van Gemert

First: 2026-01-22T09:14:11+00:00 · Latest: 2026-01-22T09:14:11+00:00

Abs · PDF · Code1 · Code2

Abstract

Spatial reasoning in vision language models (VLMs) remains fragile when semantics hinge on subtle temporal or geometric cues. We introduce a synthetic benchmark that probes two complementary skills: situational awareness (recognizing whether an interaction is harmful or benign) and spatial awareness (tracking who does what to whom, and reasoning about relative positions and motion). Through minimal video pairs, we test three challenges: distinguishing violence from benign activity, binding assailant roles across viewpoints, and judging fine-grained trajectory alignment. While we evaluate recent VLMs in a training-free setting, the benchmark is applicable to any video classification model. Results show performance only slightly above chance across tasks. A simple aid, stable color cues, partly reduces assailant role confusions but does not resolve the underlying weakness. By releasing data and code, we aim to provide reproducible diagnostics and seed exploration of lightweight spatial priors to complement large-scale pretraining.

中文标题/摘要

标题：基于合成生成视频评估VLMs的情境意识和空间意识

视觉语言模型（VLMs）中的空间推理在依赖于微妙的时间或几何线索时仍然脆弱。我们引入了一个合成基准，以探测两种互补的能力：情境意识（识别互动是否有害或无害）和空间意识（追踪谁对谁做了什么，并推理相对位置和运动）。通过最小的视频对，我们测试了三个挑战：区分暴力行为与良性活动、跨视角绑定攻击者角色以及判断细粒度轨迹对齐。虽然我们在无训练设置下评估了近期的VLMs，但该基准适用于任何视频分类模型。结果显示，各任务上的性能仅略高于随机猜测。一个简单的辅助，稳定的颜色线索，部分减少了攻击者角色的混淆，但未能解决根本缺陷。通过发布数据和代码，我们旨在提供可重复的诊断并激发对轻量级空间先验的研究，以补充大规模预训练。

Crafting Adversarial Inputs for Large Vision-Language Models Using Black-Box Optimization

Authors: Jiwei Guan, Haibo Jin, Haohan Wang

First: 2026-01-05T02:49:33+00:00 · Latest: 2026-01-22T09:09:47+00:00

Comments: EACL

Abs · PDF · Code1 · Code2

Abstract

Recent advancements in Large Vision-Language Models (LVLMs) have shown groundbreaking capabilities across diverse multimodal tasks. However, these models remain vulnerable to adversarial jailbreak attacks, where adversaries craft subtle perturbations to bypass safety mechanisms and trigger harmful outputs. Existing white-box attacks methods require full model accessibility, suffer from computing costs and exhibit insufficient adversarial transferability, making them impractical for real-world, black-box settings. To address these limitations, we propose a black-box jailbreak attack on LVLMs via Zeroth-Order optimization using Simultaneous Perturbation Stochastic Approximation (ZO-SPSA). ZO-SPSA provides three key advantages: (i) gradient-free approximation by input-output interactions without requiring model knowledge, (ii) model-agnostic optimization without the surrogate model and (iii) lower resource requirements with reduced GPU memory consumption. We evaluate ZO-SPSA on three LVLMs, including InstructBLIP, LLaVA and MiniGPT-4, achieving the highest jailbreak success rate of 83.0% on InstructBLIP, while maintaining imperceptible perturbations comparable to white-box methods. Moreover, adversarial examples generated from MiniGPT-4 exhibit strong transferability to other LVLMs, with ASR reaching 64.18%. These findings underscore the real-world feasibility of black-box jailbreaks and expose critical weaknesses in the safety mechanisms of current LVLMs

中文标题/摘要

标题：使用黑盒优化构建针对大型视觉-语言模型的对抗输入

大型视觉-语言模型（LVLMs）在多种跨模态任务中展现了突破性的能力。然而，这些模型仍然容易受到对抗性脱管攻击的影响，攻击者通过施加微妙的扰动来绕过安全机制并触发有害输出。现有的白盒攻击方法需要完全访问模型，计算成本高且对抗性转移性不足，使其在实际的黑盒环境中不切实际。为了解决这些限制，我们提出了一种通过零阶优化使用同时扰动随机近似（ZO-SPSA）对LVLMs进行黑盒脱管攻击的方法。ZO-SPSA提供了三个关键优势：（i）无需模型知识的输入-输出交互的无梯度近似，（ii）无需代理模型的模型无关优化，（iii）降低资源需求，减少GPU内存消耗。我们在包括InstructBLIP、LLaVA和MiniGPT-4在内的三个LVLMs上评估了ZO-SPSA，实现了InstructBLIP上最高的脱管攻击成功率83.0%，同时保持与白盒方法相当的不可感知扰动。此外，从MiniGPT-4生成的对抗性示例在其他LVLMs上表现出强大的转移性，ASR达到64.18%。这些发现强调了黑盒脱管攻击在实际环境中的可行性，并揭示了当前LVLMs安全机制中的关键弱点

Summary / 总结

This study addresses the vulnerability of Large Vision-Language Models (LVLMs) to adversarial attacks by proposing a black-box jailbreak attack using Zeroth-Order optimization with Simultaneous Perturbation Stochastic Approximation (ZO-SPSA). The method does not require model knowledge, is model-agnostic, and has lower resource requirements. Experiments on InstructBLIP, LLaVA, and MiniGPT-4 showed a high jailbreak success rate of 83.0% on InstructBLIP and strong transferability of adversarial examples, reaching an ASR of 64.18% on MiniGPT-4. These results highlight the real-world feasibility of black-box attacks and the need for improved safety mechanisms in LVLMs.

研究通过提出使用零阶优化与同时扰动随机逼近（ZO-SPSA）方法来应对大型视觉-语言模型（LVLMs）的对抗性逃逸攻击，该方法无需模型知识、具有模型无关性且资源需求较低。实验在InstructBLIP、LLaVA和MiniGPT-4上显示，在InstructBLIP上的逃逸成功率高达83.0%，并且生成的对抗性示例在MiniGPT-4上表现出强大的迁移性，ASR达到64.18%。这些结果强调了黑盒逃逸攻击在实际世界中的可行性，并揭示了当前LVLMs中关键的安全弱点。

VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

Authors: Li Kang, Xiufeng Song, Heng Zhou, Yiran Qin, Jie Yang, Xiaohong Liu, Philip Torr, Lei Bai, Zhenfei Yin

Venue: NeurIPS 2025

First: 2025-06-10T17:59:44+00:00 · Latest: 2026-01-22T08:52:35+00:00

Comments: Accepted by NeurIPS 2025 Track on Datasets and Benchmarks. Project page: https://faceong.github.io/VIKI-R/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.

中文标题/摘要

标题：VIKI-R：通过强化学习协调具身多智能体合作

在动态环境中协调多个具身智能体仍然是人工智能的核心挑战，需要感知驱动的推理和可扩展的合作策略。虽然最近的工作利用了大型语言模型（LLMs）进行多智能体规划，但有少数开始探索视觉语言模型（VLMs）进行视觉推理。然而，这些基于VLM的方法在支持多种具身类型方面仍然有限。在本文中，我们介绍了VIKI-Bench，这是第一个针对具身多智能体合作的分层基准，包含三个结构化层次：智能体激活、任务规划和轨迹感知。VIKI-Bench 包含多种机器人具身、多视角视觉观察和结构化的监督信号，以评估基于视觉输入的推理。为了展示VIKI-Bench 的实用性，我们提出了VIKI-R，这是一种两阶段框架，首先使用带有Chain-of-Thought注释的演示对预训练的视觉语言模型（VLM）进行微调，然后在多级奖励信号下使用强化学习。我们的大量实验表明，VIKI-R 在所有任务级别上显著优于基线方法。此外，我们展示了强化学习使异构智能体之间出现组合合作模式。总体而言，VIKI-Bench 和 VIKI-R 提供了一个统一的测试平台和方法，以推进具身人工智能系统中的多智能体、视觉驱动的合作。

Summary / 总结

This work addresses the challenge of coordinating multiple embodied agents in dynamic environments by introducing VIKI-Bench, a hierarchical benchmark for embodied multi-agent cooperation. VIKI-R, a two-stage framework, fine-tunes a pretrained vision-language model using Chain-of-Thought annotated demonstrations and then applies reinforcement learning with multi-level reward signals. Experiments demonstrate that VIKI-R outperforms baseline methods across all task levels and enables the emergence of compositional cooperation patterns among heterogeneous agents.

该研究通过引入VIKI-Bench，一个面向多智能体合作的层次化基准，解决了在动态环境中协调多个实体智能体的挑战。VIKI-R是一个两阶段框架，首先使用带有思维链注释的演示数据微调预训练的视觉语言模型，然后应用多级奖励信号的强化学习。实验表明，VIKI-R在所有任务级别上均优于基线方法，并且能够促进异构智能体之间的组合性合作模式的出现。

Zero-Shot Product Attribute Labeling with Vision-Language Models: A Three-Tier Evaluation Framework

Authors: Shubham Shukla, Kunal Sonalkar

Venue: WACV 2026

First: 2026-01-22T07:33:41+00:00 · Latest: 2026-01-22T07:33:41+00:00

Comments: Accepted to WACV 2026 Workshop on Physical Retail AI (PRAW)

Abs · PDF · Code1 · Code2

Abstract

Fine-grained attribute prediction is essential for fashion retail applications including catalog enrichment, visual search, and recommendation systems. Vision-Language Models (VLMs) offer zero-shot prediction without task-specific training, yet their systematic evaluation on multi-attribute fashion tasks remains underexplored. A key challenge is that fashion attributes are often conditional. For example, "outer fabric" is undefined when no outer garment is visible. This requires models to detect attribute applicability before attempting classification. We introduce a three-tier evaluation framework that decomposes this challenge: (1) overall task performance across all classes (including NA class: suggesting attribute is not applicable) for all attributes, (2) attribute applicability detection, and (3) fine-grained classification when attributes are determinable. Using DeepFashion-MultiModal, which explicitly defines NA (meaning attribute doesn't exist or is not visible) within attribute label spaces, we benchmark nine VLMs spanning flagship (GPT-5, Gemini 2.5 Pro), efficient (GPT-5 Mini, Gemini 2.5 Flash), and ultra-efficient tiers (GPT-5 Nano, Gemini 2.5 Flash-Lite) against classifiers trained on pretrained Fashion-CLIP embeddings on 5,000 images across 18 attributes. Our findings reveal that: (1) zero-shot VLMs achieve 64.0% macro-F1, a threefold improvement over logistic regression on pretrained Fashion-CLIP embeddings; (2) VLMs excel at fine-grained classification (Tier 3: 70.8% F1) but struggle with applicability detection (Tier 2: 34.1% NA-F1), identifying a key bottleneck; (3) efficient models achieve over 90% of flagship performance at lower cost, offering practical deployment paths. This diagnostic framework enables practitioners to pinpoint whether errors stem from visibility detection or classification, guiding targeted improvements for production systems.

中文标题/摘要

标题：使用视觉语言模型的零样本产品属性标注：三层评估框架

细粒度属性预测对于时尚零售应用（包括目录丰富、视觉搜索和推荐系统）至关重要。视觉语言模型（VLMs）可以在无需特定任务训练的情况下实现零样本预测，但它们在多属性时尚任务上的系统评估仍被忽视。一个关键挑战是时尚属性往往是条件性的。例如，当没有外衣可见时，“外层面料”是未定义的。这要求模型在尝试分类之前检测属性的适用性。我们引入了一个三层评估框架来分解这一挑战：（1）所有属性（包括NA类：表明属性不适用）在所有类别的整体任务性能，（2）属性适用性检测，以及（3）当属性可确定时的细粒度分类。使用DeepFashion-MultiModal，其中明确定义了NA（表示属性不存在或不可见）作为属性标签空间的一部分，我们对九种VLMs（包括旗舰级（GPT-5, Gemini 2.5 Pro）、高效级（GPT-5 Mini, Gemini 2.5 Flash）和超高效级（GPT-5 Nano, Gemini 2.5 Flash-Lite））进行了基准测试，这些模型在5,000张图像（涵盖18个属性）上使用预训练的Fashion-CLIP嵌入进行训练。我们的发现表明：（1）零样本VLMs实现了64.0%的宏F1，比预训练Fashion-CLIP嵌入上的逻辑回归提高了三倍；（2）VLMs在细粒度分类（第3级：70.8% F1）方面表现出色，但在适用性检测（第2级：34.1% NA-F1）方面存在困难，指出了一个关键瓶颈；（3）高效模型在较低成本下实现了旗舰级性能的90%以上，提供了实际部署路径。此诊断框架使从业者能够确定错误是源自可见性检测还是分类，从而指导生产系统的针对性改进。

Summary / 总结

The study addresses the challenge of fine-grained attribute prediction in fashion retail, focusing on zero-shot prediction using Vision-Language Models (VLMs). It introduces a three-tier evaluation framework to assess overall task performance, attribute applicability detection, and fine-grained classification. Nine VLMs, ranging from flagship to ultra-efficient tiers, were benchmarked against classifiers on 5,000 images across 18 attributes. Key findings include a macro-F1 score of 64.0% for zero-shot VLMs, a significant improvement over logistic regression, and the identification of applicability detection as a key bottleneck for VLMs.

研究旨在评估Vision-Language模型（VLMs）在零样本预测细粒度时尚属性方面的表现，解决属性适用性的问题。开发了三层评估框架来评估整体任务性能、属性适用性检测和细粒度分类。九种VLMs在5,000张图像和18个属性上进行了基准测试，结果显示VLMs的宏F1达到64.0%，优于逻辑回归，并且高效模型可以实现旗舰模型90%以上的性能，同时成本更低。

Agentic Uncertainty Quantification

Authors: Jiaxin Zhang, Prafulla Kumar Choubey, Kung-Hsiang Huang, Caiming Xiong, Chien-Sheng Wu

First: 2026-01-22T07:16:26+00:00 · Latest: 2026-01-22T07:16:26+00:00

Comments: 36 pages, 9 figures, 9 tables

Abs · PDF · Code1 · Code2

Abstract

Although AI agents have demonstrated impressive capabilities in long-horizon reasoning, their reliability is severely hampered by the ``Spiral of Hallucination,'' where early epistemic errors propagate irreversibly. Existing methods face a dilemma: uncertainty quantification (UQ) methods typically act as passive sensors, only diagnosing risks without addressing them, while self-reflection mechanisms suffer from continuous or aimless corrections. To bridge this gap, we propose a unified Dual-Process Agentic UQ (AUQ) framework that transforms verbalized uncertainty into active, bi-directional control signals. Our architecture comprises two complementary mechanisms: System 1 (Uncertainty-Aware Memory, UAM), which implicitly propagates verbalized confidence and semantic explanations to prevent blind decision-making; and System 2 (Uncertainty-Aware Reflection, UAR), which utilizes these explanations as rational cues to trigger targeted inference-time resolution only when necessary. This enables the agent to balance efficient execution and deep deliberation dynamically. Extensive experiments on closed-loop benchmarks and open-ended deep research tasks demonstrate that our training-free approach achieves superior performance and trajectory-level calibration. We believe this principled framework AUQ represents a significant step towards reliable agents.

中文标题/摘要

标题：代理不确定性量化

尽管AI代理在长期推理方面表现出色，但它们的可靠性因“幻觉螺旋”而严重受损，早期的知识错误不可逆地传播。现有方法面临困境：不确定性量化（UQ）方法通常作为被动传感器，仅诊断风险而不解决问题，而自我反思机制则遭受持续或盲目修正。为弥合这一差距，我们提出了一种统一的双重过程代理不确定性量化（AUQ）框架，将口头表达的不确定性转化为双向控制信号。我们的架构包括两个互补机制：系统1（不确定性意识记忆，UAM），它隐式地传播口头表达的信心和语义解释，以防止盲目决策；系统2（不确定性意识反思，UAR），它利用这些解释作为理性的提示，在必要时触发目标推理时的解决。这使代理能够动态平衡高效执行和深入的反思。在闭环基准和开放性深度研究任务上的广泛实验表明，我们的无需训练方法在性能和轨迹级校准方面表现出色。我们认为，这一原理性的框架AUQ代表了迈向可靠代理的重要一步。

Summary / 总结

The research aims to address the issue of reliability in AI agents by proposing a Dual-Process Agentic Uncertainty Quantification (AUQ) framework. This framework transforms verbalized uncertainty into active control signals through two mechanisms: Uncertainty-Aware Memory (UAM) and Uncertainty-Aware Reflection (UAR). UAM prevents blind decision-making by implicitly propagating confidence and explanations, while UAR uses these explanations to trigger targeted inference-time resolution when necessary. Experiments show that this approach improves performance and trajectory-level calibration without training. The AUQ framework is seen as a significant step towards more reliable AI agents.

研究旨在通过提出一种双重过程代理不确定性量化（AUQ）框架来解决AI代理在长期推理中的可靠性问题。该框架通过两种机制将口头表达的不确定性转化为主动控制信号：不确定性感知记忆（UAM）和不确定性感知反思（UAR）。实验表明，AUQ在无需训练的情况下提高了性能和轨迹级校准，代表了更可靠AI代理的重要一步。

Multi-event Video-Text Retrieval

Authors: Gengyuan Zhang, Jisen Ren, Jindong Gu, Volker Tresp

First: 2023-08-22T16:32:46+00:00 · Latest: 2026-01-22T06:58:13+00:00

Comments: [fixed typos in equations] accepted to ICCV2023 Poster; some figures are not supported when viewed online, please download the file and view locally

Abs · PDF · Code1 · Code2 · Code3

Abstract

Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of massive video-text data on the Internet. A plethora of work characterized by using a two-stream Vision-Language model architecture that learns a joint representation of video-text pairs has become a prominent approach for the VTR task. However, these models operate under the assumption of bijective video-text correspondences and neglect a more practical scenario where video content usually encompasses multiple events, while texts like user queries or webpage metadata tend to be specific and correspond to single events. This establishes a gap between the previous training objective and real-world applications, leading to the potential performance degradation of earlier models during inference. In this study, we introduce the Multi-event Video-Text Retrieval (MeVTR) task, addressing scenarios in which each video contains multiple different events, as a niche scenario of the conventional Video-Text Retrieval Task. We present a simple model, Me-Retriever, which incorporates key event video representation and a new MeVTR loss for the MeVTR task. Comprehensive experiments show that this straightforward framework outperforms other models in the Video-to-Text and Text-to-Video tasks, effectively establishing a robust baseline for the MeVTR task. We believe this work serves as a strong foundation for future studies. Code is available at https://github.com/gengyuanmax/MeVTR.

中文标题/摘要

标题：多事件视频-文本检索

视频-文本检索（VTR）是互联网上大规模视频-文本数据时代的一项重要多模态任务。使用两流视觉-语言模型架构来学习视频-文本对的联合表示已成为VTR任务的主要方法。然而，这些模型假设视频-文本对应关系是一一对应的，并忽略了视频内容通常包含多个事件，而文本如用户查询或网页元数据通常特定于单一事件的更实际场景。这导致了先前训练目标与实际应用之间的差距，使得早期模型在推理时可能性能下降。在本研究中，我们提出了多事件视频-文本检索（MeVTR）任务，以解决每个视频包含多个不同事件的场景，这是传统视频-文本检索任务的一个特殊场景。我们提出了一种简单的模型Me-Retriever，该模型结合了关键事件视频表示和新的MeVTR损失函数。全面的实验表明，该简单框架在视频到文本和文本到视频任务中优于其他模型，有效地为MeVTR任务建立了稳健的基础。我们相信这项工作为未来的研究奠定了坚实的基础。代码可在https://github.com/gengyuanmax/MeVTR/ 获取。

Summary / 总结

The study addresses the limitation of existing Video-Text Retrieval (VTR) models by introducing the Multi-event Video-Text Retrieval (MeVTR) task, where videos contain multiple events while texts correspond to single events. The researchers propose Me-Retriever, a simple model that incorporates key event video representation and a new MeVTR loss function. Experiments show that Me-Retriever outperforms other models in both Video-to-Text and Text-to-Video tasks, providing a robust baseline for the MeVTR task.

研究引入了多事件视频-文本检索（MeVTR）任务，以应对视频中包含多个事件的情况，不同于传统的VTR任务，后者假设视频-文本对应是双射的。Me-Retriever模型结合了关键事件视频表示和新的MeVTR损失，优于其他模型，在视频到文本和文本到视频任务中表现出色，为MeVTR任务提供了稳健的基础。全面的实验表明，它在处理多事件场景时的有效性，填补了训练目标与实际应用之间的差距。

PatchEAD: Unifying Industrial Visual Prompting Frameworks for Patch-Exclusive Anomaly Detection

Authors: Po-Han Huang, Jeng-Lin Li, Po-Hsuan Huang, Ming-Ching Chang, Wei-Chao Chen

Venue: WACV 2026

First: 2025-09-30T06:52:08+00:00 · Latest: 2026-01-22T06:50:23+00:00

Comments: 10 pages, 5 figures. WACV 2026 (Accepted)

Abs · PDF · Code1 · Code2

Abstract

Industrial anomaly detection is increasingly relying on foundation models, aiming for strong out-of-distribution generalization and rapid adaptation in real-world deployments. Notably, past studies have primarily focused on textual prompt tuning, leaving the intrinsic visual counterpart fragmented into processing steps specific to each foundation model. We aim to address this limitation by proposing a unified patch-focused framework, Patch-Exclusive Anomaly Detection (PatchEAD), enabling training-free anomaly detection that is compatible with diverse foundation models. The framework constructs visual prompting techniques, including an alignment module and foreground masking. Our experiments show superior few-shot and batch zero-shot performance compared to prior work, despite the absence of textual features. Our study further examines how backbone structure and pretrained characteristics affect patch-similarity robustness, providing actionable guidance for selecting and configuring foundation models for real-world visual inspection. These results confirm that a well-unified patch-only framework can enable quick, calibration-light deployment without the need for carefully engineered textual prompts.

中文标题/摘要

标题：PatchEAD：统一的工业视觉提示框架以实现专用于异常检测

工业异常检测越来越多地依赖于基础模型，旨在实现强大的离分布泛化和在实际部署中的快速适应。值得注意的是，以往的研究主要集中在文本提示调优上，而视觉方面的内在对应物则被分割成每个基础模型特有的处理步骤。我们旨在通过提出一个统一的专用于补丁的框架——Patch-Exclusive Anomaly Detection (PatchEAD)，来解决这一局限性，该框架能够实现无需训练的异常检测，并兼容多种基础模型。该框架构建了视觉提示技术，包括对齐模块和前景遮罩。我们的实验表明，与先前的工作相比，尽管没有使用文本特征，但其在少量样本和批量零样本检测方面的性能更优。我们的研究进一步探讨了基础模型的结构和预训练特性如何影响补丁相似性鲁棒性，为选择和配置适用于实际视觉检查的基础模型提供了可操作的指导。这些结果证实，一个良好统一的仅补丁框架可以实现快速、校准轻量的部署，无需精心设计的文本提示。

Summary / 总结

The research aims to improve the visual prompting techniques for anomaly detection in industrial settings by proposing PatchEAD, a unified framework that enhances out-of-distribution generalization and rapid adaptation. PatchEAD uses an alignment module and foreground masking to enable training-free anomaly detection compatible with various foundation models. Experiments demonstrate that PatchEAD outperforms previous methods in few-shot and batch zero-shot scenarios without relying on textual features. The study also explores how backbone structure and pretrained characteristics impact patch-similarity robustness, offering practical guidance for real-world applications.

研究旨在通过提出PatchEAD统一框架，改进工业环境中的异常检测的视觉提示技术，增强其离群值泛化能力和快速适应能力。PatchEAD使用对齐模块和前景遮罩来实现无需训练的异常检测，并兼容多种基础模型。实验表明，PatchEAD在少量样本和批量零样本场景中优于先前方法，且不依赖于文本特征。研究还探讨了基础模型结构和预训练特性对patch相似性鲁棒性的影响，为实际应用提供实用指导。

VIOLA: Towards Video In-Context Learning with Minimal Annotations

Authors: Ryo Fujii, Hideo Saito, Ryo Hachiuma

First: 2026-01-22T00:35:30+00:00 · Latest: 2026-01-22T00:35:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Generalizing Multimodal Large Language Models (MLLMs) to novel video domains is essential for real-world deployment but remains challenging due to the scarcity of labeled data. While In-Context Learning (ICL) offers a training-free adaptation path, standard methods rely on large annotated pools, which are often impractical in specialized environments like industrial or surgical settings since they require the experts' annotations. To bridge this gap, we introduce VIOLA (Video In-cOntext Learning with minimal Annotation), a label-efficient framework that synergizes minimal expert supervision with abundant unlabeled data. First, to maximize the efficiency of a strict annotation budget, we propose density-uncertainty-weighted sampling. Unlike standard diversity or uncertainty strategies that risk selecting visual outliers, our method leverages density estimation to identify samples that are simultaneously diverse, representative, and informative. Second, to utilize the remaining unlabeled data without noise propagation, we construct a hybrid pool and introduce confidence-aware retrieval and confidence-aware prompting. These mechanisms explicitly model label reliability, retrieving demonstrations based on a composite score of similarity and confidence while enabling the MLLM to adaptively distinguish between verified ground truths and noisy pseudo-labels. Extensive experiments across nine diverse benchmarks using four MLLMs demonstrate that our framework significantly outperforms various baselines in low-resource settings, achieving robust adaptation with minimal annotation costs.

中文标题/摘要

标题：VIOLA：面向最少标注的视频上下文学习

将多模态大型语言模型（MLLMs）推广到新的视频领域对于实际部署至关重要，但由于标注数据稀缺而充满挑战。虽然上下文学习（ICL）提供了一种无需训练的适应路径，但标准方法依赖于大规模的标注数据池，这在工业或手术等专业环境中往往是不切实际的，因为这需要专家的标注。为了解决这一问题，我们引入了VIOLA（视频上下文学习与最少标注），这是一种标签高效的框架，将最少的专家监督与大量的未标注数据相结合。首先，为了最大化严格的标注预算效率，我们提出了密度不确定性加权采样。与标准的多样性和不确定性策略不同，我们的方法利用密度估计来识别同时具有多样性和代表性的样本，这些样本还具有信息性。其次，为了在不传播噪声的情况下利用剩余的未标注数据，我们构建了一个混合池，并引入了置信度感知检索和置信度感知提示。这些机制明确建模了标签的可靠性，根据相似性和置信度的复合得分检索示例，同时使MLLM能够自适应地区分验证的真实标签和嘈杂的伪标签。在四个MLLMs和九个不同基准上的广泛实验表明，我们的框架在低资源设置中显著优于各种基线，实现了在最少标注成本下的稳健适应。

Summary / 总结

VIOLA is a label-efficient framework for video in-context learning that minimizes the need for expert annotations. It uses density-uncertainty-weighted sampling to select diverse and informative samples, and a hybrid pool with confidence-aware mechanisms to utilize unlabeled data without noise propagation. Experiments show that VIOLA outperforms various baselines in low-resource settings, achieving robust adaptation with minimal annotation costs.

VIOLA 是一种结合少量专家监督和大量未标注数据的高效框架，旨在增强多模态大型语言模型在新型视频领域的泛化能力。该框架引入了基于密度不确定性加权的采样方法来选择多样且信息丰富的样本，并通过置信度感知的机制利用未标注数据而不传播噪声。实验表明，VIOLA 在低资源设置中显著优于各种基线，实现了在少量标注成本下的稳健适应。

MARS: Unleashing the Power of Speculative Decoding via Margin-Aware Verification

Authors: Jingwei Song, Xinyu Wang, Hanbin Wang, Xiaoxuan Lei, Bill Shi, Shixin Han, Eric Yang, Xiao-Wen Chang, Lynn Ai

First: 2026-01-21T22:03:06+00:00 · Latest: 2026-01-21T22:03:06+00:00

Comments: 11 pages, 5 figures

Abs · PDF · Code1 · Code2

Abstract

Speculative Decoding (SD) accelerates autoregressive large language model (LLM) inference by decoupling generation and verification. While recent methods improve draft quality by tightly coupling the drafter with the target model, the verification mechanism itself remains largely unchanged, relying on strict token-level rejection sampling. In practice, modern LLMs frequently operate in low-margin regimes where the target model exhibits weak preference among top candidates. In such cases, rejecting plausible runner-up tokens yields negligible information gain while incurring substantial rollback cost, leading to a fundamental inefficiency in verification. We propose Margin-Aware Speculative Verification, a training-free and domain-agnostic verification strategy that adapts to the target model's local decisiveness. Our method conditions verification on decision stability measured directly from the target logits and relaxes rejection only when strict verification provides minimal benefit. Importantly, the approach modifies only the verification rule and is fully compatible with existing target-coupled speculative decoding frameworks. Extensive experiments across model scales ranging from 8B to 235B demonstrate that our method delivers consistent and significant inference speedups over state-of-the-art baselines while preserving generation quality across diverse benchmarks.

中文标题/摘要

标题：MARS：通过 Margin-Aware 验证释放推测解码的潜力

推测解码（SD）通过解耦生成和验证来加速自回归大型语言模型（LLM）的推理。虽然最近的方法通过紧密耦合草稿生成器和目标模型来提高草稿质量，但验证机制本身几乎没有变化，仍然依赖于严格的令牌级拒绝采样。实际上，现代LLM经常在低边际区域运行，目标模型对顶级候选者之间表现出较弱的偏好。在这种情况下，拒绝可能的亚军令牌几乎没有信息增益，但会带来显著的回滚成本，导致验证中的根本性低效。我们提出了Margin-Aware推测验证，这是一种无需训练且领域通用的验证策略，能够适应目标模型的局部决断性。该方法根据直接从目标对数中测量的决策稳定性进行验证，并仅在严格的验证提供最小益处时才放松拒绝。重要的是，该方法仅修改验证规则，并完全兼容现有的目标耦合推测解码框架。在从8B到235B的模型规模上进行的广泛实验表明，我们的方法在保持生成质量的同时，相对于最先进的基线方法提供了持续且显著的推理加速。

Summary / 总结

The paper introduces Margin-Aware Speculative Verification (MARS), a verification strategy for Speculative Decoding (SD) in autoregressive LLM inference. It addresses the inefficiency of strict token-level rejection sampling in low-margin regimes by adapting verification based on the target model's local decisiveness. Experiments show that MARS provides consistent and significant inference speedups while maintaining generation quality across various model scales.

论文提出了MARS，通过引入基于边际感知的验证来增强LLM中的推测解码。该方法根据目标模型的局部决断性调整验证，仅在严格验证提供微小益处时才放松拒绝。实验表明，该方法在不牺牲生成质量的情况下，能够一致地显著提高推理速度。

DevPrompt: Deviation-Based Prompt Learning for One-Normal ShotImage Anomaly Detection

Authors: Morteza Poudineh, Marc Lalonde

First: 2026-01-21T20:35:51+00:00 · Latest: 2026-01-21T20:35:51+00:00

Comments: 8 pages

Abs · PDF · Code1 · Code2

Abstract

Few-normal shot anomaly detection (FNSAD) aims to detect abnormal regions in images using only a few normal training samples, making the task highly challenging due to limited supervision and the diversity of potential defects. Recent approaches leverage vision-language models such as CLIP with prompt-based learning to align image and text features. However, existing methods often exhibit weak discriminability between normal and abnormal prompts and lack principled scoring mechanisms for patch-level anomalies. We propose a deviation-guided prompt learning framework that integrates the semantic power of vision-language models with the statistical reliability of deviation-based scoring. Specifically, we replace fixed prompt prefixes with learnable context vectors shared across normal and abnormal prompts, while anomaly-specific suffix tokens enable class-aware alignment. To enhance separability, we introduce a deviation loss with Top-K Multiple Instance Learning (MIL), modeling patch-level features as Gaussian deviations from the normal distribution. This allows the network to assign higher anomaly scores to patches with statistically significant deviations, improving localization and interpretability. Experiments on the MVTecAD and VISA benchmarks demonstrate superior pixel-level detection performance compared to PromptAD and other baselines. Ablation studies further validate the effectiveness of learnable prompts, deviation-based scoring, and the Top-K MIL strategy.

中文标题/摘要

标题：DevPrompt：基于偏差的提示学习在少量正常样本图像异常检测中的应用

少量正常样本异常检测（FNSAD）旨在仅使用少量正常训练样本检测图像中的异常区域，由于监督有限且潜在缺陷多样，任务极具挑战性。最近的方法利用如CLIP等视觉语言模型结合提示学习来对齐图像和文本特征。然而，现有方法在正常和异常提示之间的区分能力较弱，并缺乏针对块级异常的原理性评分机制。我们提出了一种基于偏差的提示学习框架，将视觉语言模型的语义能力与基于偏差的评分的统计可靠性相结合。具体而言，我们用在正常和异常提示之间共享的学习上下文向量替换固定的提示前缀，而特定于异常的后缀标记使类别感知对齐成为可能。为了增强可分性，我们引入了一种基于偏差的Top-K多实例学习（MIL）损失，将块级特征建模为与正常分布的高斯偏差。这使网络能够将更高的异常评分赋予统计上显著偏差的块，从而提高定位和可解释性。在MVTecAD和VISA基准上的实验表明，与PromptAD和其他基线相比，该方法在像素级检测性能上表现出优越性。消融研究进一步验证了可学习提示、基于偏差的评分和Top-K MIL策略的有效性。

Summary / 总结

DevPrompt is a deviation-guided prompt learning framework designed for few-normal shot image anomaly detection. It uses learnable context vectors and anomaly-specific suffix tokens to enhance the discriminability between normal and abnormal prompts. A deviation loss with Top-K Multiple Instance Learning is introduced to model patch-level features as Gaussian deviations from the normal distribution, improving anomaly score assignment and localization. Experiments show superior pixel-level detection performance compared to existing methods like PromptAD and baselines on MVTecAD and VISA benchmarks.

DevPrompt 是一种用于少量正常样本图像异常检测的偏差引导提示学习框架。它使用可学习的上下文向量和异常特定的后缀标记来增强正常和异常提示之间的可区分性。引入了基于 Top-K 多实例学习的偏差损失来将 patch 级别特征建模为从正常分布的高斯偏差，从而提高异常评分和定位。实验显示其在像素级检测性能上优于 PromptAD 和其他基线方法。

CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation

Authors: Pablo Messina, Andrés Villa, Juan León Alcázar, Karen Sánchez, Carlos Hinojosa, Denis Parra, Álvaro Soto, Bernard Ghanem

Venue: CVPR 2026

First: 2026-01-21T19:19:41+00:00 · Latest: 2026-01-21T19:19:41+00:00

Comments: 31 pages, 7 figures, submitted to CVPR 2026 (under review)

Abs · PDF · Code1 · Code2 · Code3

Abstract

Medical vision-language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present CURE, an error-aware curriculum learning framework that improves grounding and report quality without any additional data. CURE fine-tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy-grounded report generation using public datasets. The method dynamically adjusts sampling based on model performance, emphasizing harder samples to improve spatial and textual alignment. CURE improves grounding accuracy by +0.37 IoU, boosts report quality by +0.188 CXRFEScore, and reduces hallucinations by 18.6%. CURE is a data-efficient framework that enhances both grounding accuracy and report reliability. Code is available at https://github.com/PabloMessina/CURE and model weights at https://huggingface.co/pamessina/medgemma-4b-it-cure

中文标题/摘要

标题：CURE：基于课程指导的多任务训练以生成可靠的解剖学导向报告

医学视觉-语言模型可以自动化放射学报告的生成，但难以实现准确的视觉定位和事实一致性。现有模型经常将文本发现与视觉证据对齐不当，导致不可靠或弱定位的预测。我们提出了CURE，一种错误感知的课程学习框架，可以在不使用额外数据的情况下提高定位和报告质量。CURE在公共数据集上对多模态指令模型进行微调，用于短语定位、定位报告生成和解剖学定位报告生成。该方法根据模型性能动态调整采样，强调更难的样本以提高空间和文本对齐。CURE将定位准确度提高了0.37 IoU，提升了报告质量0.188 CXRFEScore，并减少了18.6%的幻觉。CURE是一种数据高效的框架，能够同时提高定位准确度和报告可靠性。代码可在https://github.com/PabloMessina/CURE 获取，模型权重可在https://huggingface.co/pamessina/medgemma-4b-it-cure 获取

Summary / 总结

CURE is an error-aware curriculum learning framework that improves the accuracy of visual grounding and report quality in medical vision-language models without additional data. It fine-tunes a multimodal model on phrase grounding, grounded report generation, and anatomy-grounded report generation, dynamically adjusting sampling based on model performance. CURE enhances grounding accuracy by 0.37 IoU, improves report quality by 0.188 CXRFEScore, and reduces hallucinations by 18.6%. This data-efficient approach boosts both grounding accuracy and report reliability.

CURE 是一种错误感知的课程学习框架，无需额外数据即可提高医疗视觉-语言模型的视觉定位准确性和报告质量。它在三个任务上微调多模态模型，并根据性能动态调整采样，重点关注更难的样本。CURE 将定位准确率提高了 0.37 IoU，提高了报告质量 0.188 CXRFEScore，并减少了 18.6% 的幻觉。这种方法提高了定位准确性和报告可靠性。

Towards Understanding Best Practices for Quantization of Vision-Language Models

Authors: Gautom Das, Vincent La, Ethan Lau, Abhinav Shrivastava, Matthew Gwilliam

First: 2026-01-21T18:59:51+00:00 · Latest: 2026-01-21T18:59:51+00:00

Comments: 15 pages, 12 figures, 1 table

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large language models (LLMs) deliver impressive results for a variety of tasks, but state-of-the-art systems require fast GPUs with large amounts of memory. To reduce both the memory and latency of these systems, practitioners quantize their learned parameters, typically at half precision. A growing body of research focuses on preserving the model performance with more aggressive bit widths, and some work has been done to apply these strategies to other models, like vision transformers. In our study we investigate how a variety of quantization methods, including state-of-the-art GPTQ and AWQ, can be applied effectively to multimodal pipelines comprised of vision models, language models, and their connectors. We address how performance on captioning, retrieval, and question answering can be affected by bit width, quantization method, and which portion of the pipeline the quantization is used for. Results reveal that ViT and LLM exhibit comparable importance in model performance, despite significant differences in parameter size, and that lower-bit quantization of the LLM achieves high accuracy at reduced bits per weight (bpw). These findings provide practical insights for efficient deployment of MLLMs and highlight the value of exploration for understanding component sensitivities in multimodal models. Our code is available at https://github.com/gautomdas/mmq.

中文标题/摘要

标题：理解视觉-语言模型量化最佳实践

大型语言模型（LLMs）在各种任务中表现出色，但最先进的系统需要快速的GPU和大量的内存。为了减少这些系统的内存和延迟，从业者通常会将它们学习到的参数量化为半精度。越来越多的研究集中在使用更激进的位宽来保持模型性能，一些工作已经将这些策略应用于其他模型，如视觉变换器。在我们的研究中，我们探讨了如何有效地将包括最先进的GPTQ和AWQ在内的各种量化方法应用于由视觉模型、语言模型及其连接器组成的多模态管道。我们研究了位宽、量化方法以及量化在管道中的使用位置如何影响字幕生成、检索和问答任务的性能。结果表明，尽管参数规模存在显著差异，ViT和LLM在模型性能中具有相当的重要性，并且LLM的低位量化可以在减少每个权重位数（bpw）的情况下实现高精度。这些发现为高效部署多模态大语言模型提供了实用见解，并突显了探索多模态模型组件敏感性的价值。我们的代码可在https://github.com/gautomdas/mmq获取。

Summary / 总结

This study investigates the application of various quantization methods, including GPTQ and AWQ, to multimodal pipelines involving vision transformers and language models. The research aims to understand how different bit widths and quantization techniques impact performance on tasks such as captioning, retrieval, and question answering. Key findings show that both vision transformers and language models are crucial for model performance, and that lower-bit quantization of the language model can achieve high accuracy with fewer bits per weight.

研究探讨了将GPTQ和AWQ等不同量化方法应用于包含视觉模型、语言模型及其连接器的多模态管道中。研究旨在了解不同的位宽和量化技术如何影响诸如图像字幕、检索和问答等任务的性能。主要发现包括视觉变压器（ViT）和语言模型（LLM）在模型性能中的相对重要性相当，尽管它们的参数量存在显著差异，以及通过较低位宽量化LLM可以在减少内存使用的同时实现高精度。

Iterative Refinement Improves Compositional Image Generation

Authors: Shantanu Jaiswal, Mihir Prabhudesai, Nikash Bhardwaj, Zheyang Qin, Amir Zadeh, Chuan Li, Katerina Fragkiadaki, Deepak Pathak

First: 2026-01-21T18:59:40+00:00 · Latest: 2026-01-21T18:59:40+00:00

Comments: Project webpage: https://iterative-img-gen.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Text-to-image (T2I) models have achieved remarkable progress, yet they continue to struggle with complex prompts that require simultaneously handling multiple objects, relations, and attributes. Existing inference-time strategies, such as parallel sampling with verifiers or simply increasing denoising steps, can improve prompt alignment but remain inadequate for richly compositional settings where many constraints must be satisfied. Inspired by the success of chain-of-thought reasoning in large language models, we propose an iterative test-time strategy in which a T2I model progressively refines its generations across multiple steps, guided by feedback from a vision-language model as the critic in the loop. Our approach is simple, requires no external tools or priors, and can be flexibly applied to a wide range of image generators and vision-language models. Empirically, we demonstrate consistent gains on image generation across benchmarks: a 16.9% improvement in all-correct rate on ConceptMix (k=7), a 13.8% improvement on T2I-CompBench (3D-Spatial category) and a 12.5% improvement on Visual Jenga scene decomposition compared to compute-matched parallel sampling. Beyond quantitative gains, iterative refinement produces more faithful generations by decomposing complex prompts into sequential corrections, with human evaluators preferring our method 58.7% of the time over 41.3% for the parallel baseline. Together, these findings highlight iterative self-correction as a broadly applicable principle for compositional image generation. Results and visualizations are available at https://iterative-img-gen.github.io/

中文标题/摘要

标题：迭代优化提升组合图像生成

文本到图像（T2I）模型已经取得了显著进展，但仍难以处理需要同时处理多个对象、关系和属性的复杂提示。现有的推理时策略，如并行采样带验证器或简单增加去噪步骤，可以改善提示对齐，但在许多约束必须满足的丰富组合场景中仍然不足。受大型语言模型中链式思考推理成功的启发，我们提出了一种迭代的测试时策略，在该策略中，T2I模型在多个步骤中逐步优化其生成，由循环中的视觉语言模型作为批评者提供反馈。我们的方法简单，不需要外部工具或先验知识，并且可以灵活应用于各种图像生成器和视觉语言模型。实验证明，我们的方法在基准测试中的一致改进：在ConceptMix（k=7）上提高了16.9%的全正确率，在T2I-CompBench（3D-空间类别）上提高了13.8%，在视觉积木场景分解上提高了12.5%，与计算匹配的并行采样相比。除了定量改进，迭代优化还通过将复杂提示分解为顺序修正，生成更忠实的图像，人类评估者中有58.7%的人更偏好我们的方法，而并行基线为41.3%。这些发现共同强调了迭代自我修正作为组合图像生成广泛适用原则的重要性。结果和可视化可在https://iterative-img-gen.github.io/获取

Summary / 总结

The paper addresses the challenge of generating complex images from text prompts by proposing an iterative refinement strategy. This method involves the text-to-image model refining its output across multiple steps, with feedback from a vision-language model acting as the critic. The approach shows consistent improvements across various benchmarks, with a 16.9% increase in the all-correct rate on ConceptMix (k=7), 13.8% on T2I-CompBench (3D-Spatial category), and 12.5% on Visual Jenga scene decomposition. It also produces more faithful images, as preferred by human evaluators 58.7% of the time compared to the parallel baseline.

论文提出了一种迭代精炼策略来解决从文本提示生成复杂图像的挑战。该方法涉及文本到图像模型在来自视觉语言模型的反馈下多次迭代改进其输出。该方法在多个基准测试中表现出一致的改进，包括在ConceptMix（k=7）上提高了16.9%的全部正确率，在T2I-CompBench（3D-空间类别）上提高了13.8%，在Visual Jenga场景分解上提高了12.5%。迭代精炼还能生成更忠实的图像，人类评估者更偏好该方法，偏好率为58.7%，而平行基线为41.3%。

Improving MoE Compute Efficiency by Composing Weight and Data Sparsity

Authors: Maciej Kilian, Oleg Mkrtchyan, Luke Zettlemoyer, Akshat Shrivastava, Armen Aghajanyan

First: 2026-01-21T18:53:58+00:00 · Latest: 2026-01-21T18:53:58+00:00

Abs · PDF · Code1 · Code2

Abstract

Mixture-of-Experts layers achieve compute efficiency through weight sparsity: each token activates only a subset of experts. Data sparsity, where each expert processes only a subset of tokens, offers a complementary axis. Expert-choice routing implements data sparsity directly but violates causality in autoregressive models, creating train-inference mismatch. We recover data sparsity within causal token-choice MoE by leveraging zero-compute (null) experts within the routing pool. When a token routes to null experts, those slots consume no compute. The standard load balancing objective trains the model to uniformly use all experts (real and null) therefore creating data sparsity in expectation without the causality violations. We evaluate on vision-language model training, where data heterogeneity is pronounced: vision encoders produce many low-information tokens while text tokens are denser. At matched expected FLOPs, composing weight and data sparsity yields a more compute-efficient frontier than weight sparsity alone, with gains in training loss and downstream performance. The model learns implicit modality-aware allocation, routing vision tokens to null experts more aggressively than text, without explicit modality routing.

中文标题/摘要

标题：通过组合权重和数据稀疏性提高专家混合层的计算效率

专家混合层通过权重稀疏性实现计算效率：每个标记仅激活专家子集。数据稀疏性，其中每个专家仅处理标记子集，提供了互补的维度。专家选择路由直接实现数据稀疏性，但在自回归模型中违反了因果性，导致训练与推理不匹配。我们通过在路由池中利用零计算（空）专家来在因果标记选择的专家混合层中恢复数据稀疏性。当标记路由到空专家时，这些槽位不消耗计算。标准负载均衡目标训练模型均匀使用所有专家（真实和空的），从而在期望中创建数据稀疏性，而不违反因果性。我们在视觉-语言模型训练中进行评估，其中数据异质性明显：视觉编码器产生许多低信息量标记，而文本标记更密集。在匹配预期FLOPs的情况下，组合权重和数据稀疏性比单独使用权重稀疏性提供了更高效的计算边界，且在训练损失和下游性能上有所提升。模型学习隐式的模态感知分配，更积极地将视觉标记路由到空专家，而无需显式的模态路由。

Summary / 总结

The paper aims to enhance the computational efficiency of Mixture-of-Experts (MoE) layers by combining weight sparsity and data sparsity. Weight sparsity ensures that each token activates only a subset of experts, while data sparsity allows each expert to process only a subset of tokens. To avoid causality issues, the authors introduce null experts that do not consume compute when selected. This approach, combined with a standard load balancing objective, uniformly uses all experts, including null ones, leading to better computational efficiency. Experiments on vision-language model training show that this method outperforms weight sparsity alone in terms of training loss and downstream performance, especially in handling data heterogeneity where vision tokens are less informative than text tokens.

论文旨在通过结合权重稀疏性和数据稀疏性来提高Mixture-of-Experts（MoE）层的计算效率。权重稀疏性确保每个token只激活一部分专家，而数据稀疏性则让每个专家只处理一部分token。为了避免因果性问题，作者引入了零计算（null）专家，在选择时不会消耗计算资源。结合标准负载均衡目标，这种方法可以均匀使用所有专家，包括null专家，从而提高计算效率。实验表明，在处理数据异质性（视觉token比文本token信息量少）的视觉-语言模型训练中，该方法在训练损失和下游性能上优于单独使用权重稀疏性。

PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

Authors: Jianshu Zhang, Chengxuan Qian, Haosen Sun, Haoran Lu, Dingcheng Wang, Letian Xue, Han Liu

First: 2026-01-21T17:56:59+00:00 · Latest: 2026-01-21T17:56:59+00:00

Comments: Website: https://progresslm.github.io/ProgressLM/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Estimating task progress requires reasoning over long-horizon dynamics rather than recognizing static visual content. While modern Vision-Language Models (VLMs) excel at describing what is visible, it remains unclear whether they can infer how far a task has progressed from partial observations. To this end, we introduce Progress-Bench, a benchmark for systematically evaluating progress reasoning in VLMs. Beyond benchmarking, we further explore a human-inspired two-stage progress reasoning paradigm through both training-free prompting and training-based approach based on curated dataset ProgressLM-45K. Experiments on 14 VLMs show that most models are not yet ready for task progress estimation, exhibiting sensitivity to demonstration modality and viewpoint changes, as well as poor handling of unanswerable cases. While training-free prompting that enforces structured progress reasoning yields limited and model-dependent gains, the training-based ProgressLM-3B achieves consistent improvements even at a small model scale, despite being trained on a task set fully disjoint from the evaluation tasks. Further analyses reveal characteristic error patterns and clarify when and why progress reasoning succeeds or fails.

中文标题/摘要

标题：PROGRESSLM：迈向视觉语言模型中的进度推理

估计任务进度需要推理长时动态，而不仅仅是识别静态视觉内容。尽管现代视觉语言模型（VLMs）在描述可见内容方面表现出色，但尚不清楚它们是否可以从部分观察中推断出任务的进展情况。为此，我们引入了Progress-Bench，用于系统评估VLMs的进度推理能力。除了基准测试外，我们还通过无训练提示和基于精心构建的数据集ProgressLM-45K的训练方法，进一步探索了灵感来源于人类的两阶段进度推理范式。在14个VLMs上的实验表明，大多数模型尚未准备好进行任务进度估计，表现出对演示模态和视角变化的敏感性，以及对无法回答的情况处理不佳。虽然无训练提示强制结构化的进度推理仅能带来有限且模型依赖的收益，但基于训练的ProgressLM-3B即使在小型模型规模下也能实现一致的改进，尽管其训练任务集与评估任务集完全不重叠。进一步的分析揭示了特征错误模式，并阐明了进度推理何时以及为何成功或失败。

Summary / 总结

The research aims to evaluate the ability of Vision-Language Models (VLMs) to estimate task progress by introducing Progress-Bench, a benchmark for progress reasoning. The study explores a two-stage human-inspired approach through both training-free prompting and a training-based model based on the ProgressLM-45K dataset. Experiments on 14 VLMs reveal that most models struggle with task progress estimation, showing sensitivity to changes in demonstration modality and viewpoint, and difficulty with unanswerable cases. While training-free prompting provides limited gains, the training-based ProgressLM-3B model shows consistent improvements, even at a small scale, despite being trained on a disjoint task set.

研究旨在通过引入Progress-Bench基准来评估Vision-Language模型（VLMs）在任务进度估计方面的能力。研究探索了一种基于人类启发的两阶段方法，通过无训练提示和基于ProgressLM-45K数据集的训练方法。实验结果显示，大多数模型在任务进度估计方面存在困难，表现出对演示模态和视角变化的敏感性，以及处理不可回答情况的困难。虽然无训练提示提供了有限的改进，但训练后的ProgressLM-3B模型在小规模下表现出一致的改进，尽管其训练任务集与评估任务集完全不重合。

CRAFT: Continuous Reasoning and Agentic Feedback Tuning for Multimodal Text-to-Image Generation

Authors: V. Kovalev, A. Kuvshinov, A. Buzovkin, D. Pokidov, D. Timonin

First: 2025-12-23T13:44:41+00:00 · Latest: 2026-01-21T16:42:28+00:00

Comments: 37 pages, 42 figures

Abs · PDF · Code1 · Code2

Abstract

Recent work has shown that inference-time reasoning and reflection can improve text-to-image generation without retraining. However, existing approaches often rely on implicit, holistic critiques or unconstrained prompt rewrites, making their behavior difficult to interpret, control, or stop reliably. In contrast, large language models have benefited from explicit, structured forms of **thinking** based on verification, targeted correction, and early stopping. We introduce CRAFT (Continuous Reasoning and Agentic Feedback Tuning), a training-free and model-agnostic framework for multimodal image generation. CRAFT transforms a user prompt into a set of explicit, dependency-structured visual constraints, verifies generated images using a vision-language model, and performs targeted prompt updates only when specific constraints are violated. This iterative process includes an explicit stopping criterion, resulting in an interpretable and controllable inference-time refinement loop. Across multiple model families and challenging benchmarks, CRAFT consistently improves compositional accuracy, text rendering, and preference-based evaluations, with particularly strong gains for lightweight generators. Importantly, these improvements incur only a negligible inference-time overhead, allowing smaller or cheaper models to approach the quality of substantially more expensive systems. Our results suggest that explicitly structured, constraint-driven inference-time reasoning is a key ingredient for improving the reliability of multimodal generative models.

中文标题/摘要

标题：CRAFT：连续推理和自主反馈调优的多模态文本到图像生成

近期研究表明，在不重新训练的情况下，推理时的推理和反思可以提高文本到图像生成的效果。然而，现有方法往往依赖于隐式的、整体的批评或不受限制的提示重写，这使得它们的行为难以解释、控制或可靠地停止。相比之下，大型语言模型得益于基于验证、目标修正和早期停止的明确、结构化的**思考**形式。我们提出了CRAFT（连续推理和自主反馈调优），这是一种无需训练且模型无关的多模态图像生成框架。CRAFT 将用户提示转换为一组明确的、依赖结构化的视觉约束，使用视觉语言模型验证生成的图像，并仅在特定约束被违反时进行有针对性的提示更新。这个迭代过程包括一个明确的停止标准，从而形成一个可解释且可控的推理时细化循环。在多个模型家族和具有挑战性的基准测试中，CRAFT 一致地提高了组合准确性、文本呈现和基于偏好的评估，特别是在轻量级生成器方面取得了显著的改进。重要的是，这些改进仅带来了微不足道的推理时开销，使得较小或更便宜的模型能够接近更昂贵系统的质量。我们的结果表明，明确结构化的、基于约束的推理是提高多模态生成模型可靠性的关键成分。

Summary / 总结

CRAFT is a training-free and model-agnostic framework for multimodal text-to-image generation that transforms user prompts into explicit visual constraints, verifies generated images using a vision-language model, and updates prompts only when constraints are violated. This iterative process includes an explicit stopping criterion, leading to improved compositional accuracy, text rendering, and preference-based evaluations, especially for lightweight generators, with minimal inference-time overhead.

CRAFT 是一个无需训练且模型无关的框架，将用户提示转换为显式的视觉约束，使用视觉语言模型验证图像，并仅在约束被违反时更新提示。这一迭代过程包含明确的停止标准，提高了组合准确性、文本渲染和偏好评价，尤其是对于轻量级生成器，同时几乎不增加推理时间开销。

Training-Free and Interpretable Hateful Video Detection via Multi-stage Adversarial Reasoning

Authors: Shuonan Yang, Yuchen Zhang, Zeyu Fu

Venue: ICASSP 2026

First: 2026-01-21T15:52:26+00:00 · Latest: 2026-01-21T15:52:26+00:00

Comments: Accepted at ICASSP 2026. \c{opyright} 2026 IEEE. This is the author accepted manuscript. The final published version will be available via IEEE Xplore

Abs · PDF · Code1 · Code2 · Code3

Abstract

Hateful videos pose serious risks by amplifying discrimination, inciting violence, and undermining online safety. Existing training-based hateful video detection methods are constrained by limited training data and lack of interpretability, while directly prompting large vision-language models often struggle to deliver reliable hate detection. To address these challenges, this paper introduces MARS, a training-free Multi-stage Adversarial ReaSoning framework that enables reliable and interpretable hateful content detection. MARS begins with the objective description of video content, establishing a neutral foundation for subsequent analysis. Building on this, it develops evidence-based reasoning that supports potential hateful interpretations, while in parallel incorporating counter-evidence reasoning to capture plausible non-hateful perspectives. Finally, these perspectives are synthesized into a conclusive and explainable decision. Extensive evaluation on two real-world datasets shows that MARS achieves up to 10% improvement under certain backbones and settings compared to other training-free approaches and outperforms state-of-the-art training-based methods on one dataset. In addition, MARS produces human-understandable justifications, thereby supporting compliance oversight and enhancing the transparency of content moderation workflows. The code is available at https://github.com/Multimodal-Intelligence-Lab-MIL/MARS.

中文标题/摘要

标题：基于多阶段对抗推理的无训练可解释仇恨视频检测

仇恨视频通过放大歧视、煽动暴力和破坏在线安全等方式带来严重风险。现有的基于训练的仇恨视频检测方法受限于训练数据有限且缺乏可解释性，而直接对大型视觉-语言模型进行提示往往难以提供可靠的仇恨检测。为解决这些挑战，本文提出了一种无训练的多阶段对抗推理框架MARS，以实现可靠且可解释的仇恨内容检测。MARS从客观描述视频内容开始，建立后续分析的中立基础。在此基础上，它发展了基于证据的推理，支持潜在的仇恨解释，同时并行地纳入反证据推理以捕捉可能的非仇恨视角。最后，这些视角被综合成一个明确且可解释的决策。在两个真实世界数据集上的广泛评估表明，MARS在某些骨干网络和设置下比其他无训练方法提高了10%以上，并在另一个数据集上优于最先进的基于训练的方法。此外，MARS生成了人类可理解的解释，从而支持合规监督并增强内容审核流程的透明度。代码可在https://github.com/Multimodal-Intelligence-Lab-MIL/MARS/ 获取。

Summary / 总结

This paper addresses the challenges of detecting hateful videos through a training-free and interpretable framework called MARS, which uses multi-stage adversarial reasoning. Starting with neutral video content description, MARS then develops evidence-based reasoning for potential hateful interpretations and incorporates counter-evidence reasoning to capture non-hateful perspectives, leading to a conclusive and explainable decision. The evaluation on two real-world datasets shows that MARS outperforms other training-free approaches and state-of-the-art training-based methods, with up to 10% improvement under certain conditions, and provides human-understandable justifications for content moderation.

本文提出了MARS，一种训练-free 多阶段对抗推理框架，以解决检测仇恨视频的挑战。MARS 从中立的视频内容描述开始，然后发展基于证据的推理来支持潜在的仇恨解释，同时结合反证据来捕捉非仇恨视角。该框架将这些视角综合成一个结论性的、可解释的决策。实验结果显示，MARS 在两个真实世界数据集上的表现优于其他训练-free 方法和最先进的训练基方法，最高可提高 10% 的性能，并提供可理解的解释以支持内容审核的合规性和透明度。

Training-Free In-Context Forensic Chain for Image Manipulation Detection and Localization

Authors: Rui Chen, Bin Liu, Changtao Miao, Xinghao Wang, Yi Li, Tao Gong, Qi Chu, Nenghai Yu

First: 2025-10-11T08:42:31+00:00 · Latest: 2026-01-21T15:39:57+00:00

Comments: This version was uploaded in error and contains misleading information found in an early draft. The manuscript requires extensive and long-term revisions

Abs · PDF · Code1 · Code2

Abstract

Advances in image tampering pose serious security threats, underscoring the need for effective image manipulation localization (IML). While supervised IML achieves strong performance, it depends on costly pixel-level annotations. Existing weakly supervised or training-free alternatives often underperform and lack interpretability. We propose the In-Context Forensic Chain (ICFC), a training-free framework that leverages multi-modal large language models (MLLMs) for interpretable IML tasks. ICFC integrates an objectified rule construction with adaptive filtering to build a reliable knowledge base and a multi-step progressive reasoning pipeline that mirrors expert forensic workflows from coarse proposals to fine-grained forensics results. This design enables systematic exploitation of MLLM reasoning for image-level classification, pixel-level localization, and text-level interpretability. Across multiple benchmarks, ICFC not only surpasses state-of-the-art training-free methods but also achieves competitive or superior performance compared to weakly and fully supervised approaches.

中文标题/摘要

标题：无需训练的上下文法医链用于图像篡改检测与定位

图像篡改技术的进步带来了严重的安全威胁，突显了有效图像篡改定位（IML）的必要性。虽然监督IML能够取得优异性能，但它依赖于昂贵的像素级注释。现有的弱监督或无需训练的替代方案往往表现不佳且缺乏可解释性。我们提出了一种无需训练的框架——上下文法医链（ICFC），该框架利用多模态大型语言模型（MLLMs）进行可解释的IML任务。ICFC 结合了对象化规则构建与自适应过滤，构建了一个可靠的知识库，并采用多步渐进推理管道，模拟了从粗略提案到精细法医结果的专家法医工作流程。该设计使MLLM推理在图像级分类、像素级定位和文本级可解释性方面的系统利用成为可能。在多个基准测试中，ICFC 不仅超越了最先进的无需训练方法，而且在弱监督和完全监督方法方面也取得了竞争性或更优的性能。

Summary / 总结

The research aims to address the security threats posed by image tampering by developing a training-free framework for image manipulation localization (IML). The In-Context Forensic Chain (ICFC) leverages multi-modal large language models to create a reliable knowledge base and a multi-step reasoning pipeline. This framework outperforms existing training-free methods and achieves competitive or superior performance compared to weakly and fully supervised approaches across multiple benchmarks.

研究旨在通过开发一种无需训练的框架来解决图像篡改带来的安全威胁，即图像篡改定位（IML）。In-Context Forensic Chain (ICFC) 利用多模态大语言模型构建可解释的知识库和多步推理管道。该方法在多个基准测试中不仅超越了现有的无需训练的方法，而且在弱监督和完全监督方法中也实现了竞争力或更优的性能。