arXiv 论文速递

2026-01-25 03:29
Snapshot: 20260125_0329
GutenOCR: A Grounded Vision-Language Front-End for Documents
Authors: Hunter Heidenreich, Ben Elliott, Olivia Dinica, Yosheb Getachew
First: 2026-01-20T21:26:15+00:00 · Latest: 2026-01-22T18:58:24+00:00
Abstract
GutenOCR is a family of grounded OCR front-ends obtained by fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B. The resulting single-checkpoint vision-language models expose reading, detection, and grounding through a unified, prompt-based interface. Trained on business documents, scientific articles, and synthetic grounding data, the models support full-page and localized reading with line- and paragraph-level bounding boxes and conditional ``where is x?'' queries. We introduce a grounded OCR evaluation protocol and show that GutenOCR-7B more than doubles the composite grounded OCR score of its Qwen2.5-VL-7B backbone on 10.5K held-out business and scientific pages (0.40 to 0.82). On Fox and OmniDocBench v1.5, our approach substantially improves region- and line-level OCR as well as text-detection recall, but reveals trade-offs in page-level linearization, color-guided OCR, and formula-heavy layouts.
中文标题/摘要
标题:GutenOCR:一种基于视觉语言的文档前端
GutenOCR 是通过微调 Qwen2.5-VL-3B 和 Qwen2.5-VL-7B 获得的一系列基于视觉语言的 OCR 前端。生成的单模型视觉语言模型通过统一的提示界面暴露了阅读、检测和定位。该模型在商业文档、科学文章和合成定位数据上进行训练,支持全页和局部阅读,具有行级和段落级的边界框,并支持条件“x 在哪里?”查询。我们引入了一种基于视觉语言的 OCR 评估协议,并展示了 GutenOCR-7B 在 10.5K 保留的商业和科学页面上将 Qwen2.5-VL-7B 主干的综合基于视觉语言的 OCR 分数提高了 1.05(从 0.40 到 0.82)。在 Fox 和 OmniDocBench v1.5 上,我们的方法显著提高了区域级和行级 OCR 以及文本检测召回率,但揭示了页面级线性化、颜色引导 OCR 和公式密集布局方面的权衡。
Summary / 总结
GutenOCR is a vision-language model fine-tuned from Qwen2.5-VL-3B and Qwen2.5-VL-7B, which provides unified reading, detection, and grounding through a prompt-based interface. Trained on business documents and scientific articles, GutenOCR-7B significantly improves the composite grounded OCR score to 0.82, more than doubling the score of its backbone model on 10,500 held-out pages. It also enhances region- and line-level OCR and text-detection recall but shows some trade-offs in page-level linearization and formula-heavy layouts.
GutenOCR 是从 Qwen2.5-VL-3B 和 Qwen2.5-VL-7B 细化而来的一系列视觉-语言模型,通过提示提供了统一的阅读、检测和定位接口。这些模型经过多种文档的训练,支持全页和局部阅读,并带有边界框和条件查询。GutenOCR-7B 在商业和科学页面上的综合定位OCR分数显著提高,达到0.82,而其基础模型的分数为0.40。在Fox和OmniDocBench上,GutenOCR 提升了区域和行级OCR以及文本检测召回率,但在页面级线性化、颜色引导的OCR和公式密集布局方面显示出一些权衡。
LLM-in-Sandbox Elicits General Agentic Intelligence
Authors: Daixuan Cheng, Shaohan Huang, Yuxian Gu, Huatong Song, Guoxin Chen, Li Dong, Wayne Xin Zhao, Ji-Rong Wen, Furu Wei
First: 2026-01-22T18:57:09+00:00 · Latest: 2026-01-22T18:57:09+00:00
Comments: Project Page: https://llm-in-sandbox.github.io
Abstract
We introduce LLM-in-Sandbox, enabling LLMs to explore within a code sandbox (i.e., a virtual computer), to elicit general intelligence in non-code domains. We first demonstrate that strong LLMs, without additional training, exhibit generalization capabilities to leverage the code sandbox for non-code tasks. For example, LLMs spontaneously access external resources to acquire new knowledge, leverage the file system to handle long contexts, and execute scripts to satisfy formatting requirements. We further show that these agentic capabilities can be enhanced through LLM-in-Sandbox Reinforcement Learning (LLM-in-Sandbox-RL), which uses only non-agentic data to train models for sandbox exploration. Experiments demonstrate that LLM-in-Sandbox, in both training-free and post-trained settings, achieves robust generalization spanning mathematics, physics, chemistry, biomedicine, long-context understanding, and instruction following. Finally, we analyze LLM-in-Sandbox's efficiency from computational and system perspectives, and open-source it as a Python package to facilitate real-world deployment.
中文标题/摘要
标题:LLM-in-Sandbox 激发通用代理智能
我们介绍了 LLM-in-Sandbox,使大语言模型能够在代码沙盒(即虚拟计算机)中探索,以激发非代码领域的通用智能。我们首先展示了强大的大语言模型在无需额外训练的情况下,能够利用代码沙盒来执行非代码任务的一般化能力。例如,大语言模型会自发地访问外部资源以获取新知识,利用文件系统处理长文本,并执行脚本以满足格式要求。我们进一步表明,通过仅使用非代理数据训练用于沙盒探索的模型,LLM-in-Sandbox 强化学习(LLM-in-Sandbox-RL)可以增强这些代理能力。实验表明,无论是在无训练模式还是在训练后模式下,LLM-in-Sandbox 都能够实现涵盖数学、物理、化学、生物医学、长文本理解以及指令遵循的稳健泛化。最后,我们从计算和系统角度分析了 LLM-in-Sandbox 的效率,并将其开源为 Python 包,以促进其实用部署。
Summary / 总结
The research introduces LLM-in-Sandbox, which enables language models to explore a code sandbox to develop general intelligence in non-code domains. The study demonstrates that strong language models can generalize and use the sandbox for non-code tasks such as accessing external resources and executing scripts. Further, LLM-in-Sandbox-RL enhances these capabilities through reinforcement learning without additional training. Experiments show robust generalization across various fields including mathematics, physics, and biomedicine. The research also analyzes the efficiency of LLM-in-Sandbox and opens it as a Python package for real-world deployment.
研究介绍了LLM-in-Sandbox,该方法允许大型语言模型(LLMs)探索代码沙箱以在非代码领域发展一般智能。研究展示了强大的LLMs可以泛化并在非代码任务中利用沙箱,例如访问外部资源和执行脚本。此外,LLM-in-Sandbox强化学习进一步增强了这些能力。实验表明,LLM-in-Sandbox在数学、物理和生物医学等多个领域实现了稳健的泛化。研究还分析了LLM-in-Sandbox的效率,并将其作为Python包开源以促进实际部署。
Training-Free Geospatial Place Representation Learning from Large-Scale Point-of-Interest Graph Data
Authors: Mohammad Hashemi, Hossein Amiri, Andreas Zufle
First: 2025-06-25T15:10:31+00:00 · Latest: 2026-01-22T18:46:50+00:00
Abstract
Learning effective representations of urban environments requires capturing spatial structure beyond fixed administrative boundaries. Existing geospatial representation learning approaches typically aggregate Points of Interest(POI) into pre-defined administrative regions such as census units or ZIP code areas, assigning a single embedding to each region. However, POIs often form semantically meaningful groups that extend across, within, or beyond these boundaries, defining places that better reflect human activity and urban function. To address this limitation, we propose PlaceRep, a training-free geospatial representation learning method that constructs place-level representations by clustering spatially and semantically related POIs. PlaceRep summarizes large-scale POI graphs from U.S. Foursquare data to produce general-purpose urban region embeddings while automatically identifying places across multiple spatial scales. By eliminating model pre-training, PlaceRep provides a scalable and efficient solution for multi-granular geospatial analysis. Experiments using the tasks of population density estimation and housing price prediction as downstream tasks show that PlaceRep outperforms most state-of-the-art graph-based geospatial representation learning methods and achieves up to a 100x speedup in generating region-level representations on large-scale POI graphs. The implementation of PlaceRep is available at https://github.com/mohammadhashemii/PlaceRep.
中文标题/摘要
标题:无需训练的地理空间地点表示学习从大规模兴趣点图数据
学习有效的城市环境表示需要捕捉超出固定行政边界的空间结构。现有的地理空间表示学习方法通常将兴趣点(POI)聚合到预先定义的行政区域中,如普查单位或邮政编码区域,并为每个区域分配一个单一的嵌入。然而,POI往往形成具有语义意义的群体,跨越、位于或超出这些边界,定义了更好地反映人类活动和城市功能的地点。为了解决这一局限性,我们提出了一种无需训练的地理空间表示学习方法PlaceRep,该方法通过聚类空间上和语义上相关的POI来构建地点级表示。PlaceRep从美国Foursquare数据中的大规模POI图中进行总结,生成通用的城市区域嵌入,并自动识别跨多个空间尺度的地点。通过消除模型预训练,PlaceRep提供了一种可扩展且高效的多粒度地理空间分析解决方案。使用人口密度估计和房价预测等下游任务进行的实验表明,PlaceRep在大多数基于图的地理空间表示学习方法中表现更优,并且在生成大规模POI图的区域级表示时可实现高达100倍的速度提升。PlaceRep的实现可在https://github.com/mohammadhashemii/PlaceRep获取。
Summary / 总结
The research aims to develop effective geospatial representations of urban environments by capturing spatial structures beyond administrative boundaries. PlaceRep, a training-free method, clusters spatially and semantically related POIs to generate place-level representations, which are then used to summarize large-scale POI graphs. Experiments show that PlaceRep outperforms existing methods in tasks such as population density estimation and housing price prediction, and it provides a 100x speedup in generating region-level representations on large-scale POI graphs.
研究旨在通过捕捉超越行政边界的空间结构来开发有效的城市环境表示。PlaceRep 是一种无需训练的方法,通过聚类空间和语义相关的 POI 生成地方级表示。实验表明,PlaceRep 在人口密度估计和房价预测等任务上优于最先进的方法,并且在生成大规模 POI 图的区域级表示时可实现高达 100 倍的加速。
Multimodal Climate Disinformation Detection: Integrating Vision-Language Models with External Knowledge Sources
Authors: Marzieh Adeli Shamsabad, Hamed Ghodrati
First: 2026-01-22T16:55:48+00:00 · Latest: 2026-01-22T16:55:48+00:00
Abstract
Climate disinformation has become a major challenge in today digital world, especially with the rise of misleading images and videos shared widely on social media. These false claims are often convincing and difficult to detect, which can delay actions on climate change. While vision-language models (VLMs) have been used to identify visual disinformation, they rely only on the knowledge available at the time of training. This limits their ability to reason about recent events or updates. The main goal of this paper is to overcome that limitation by combining VLMs with external knowledge. By retrieving up-to-date information such as reverse image results, online fact-checks, and trusted expert content, the system can better assess whether an image and its claim are accurate, misleading, false, or unverifiable. This approach improves the model ability to handle real-world climate disinformation and supports efforts to protect public understanding of science in a rapidly changing information landscape.
中文标题/摘要
标题:多模态气候 misinformation 检测:结合视觉-语言模型与外部知识源
气候 misinformation 已成为当今数字世界的主要挑战,尤其是在社交媒体上广泛传播误导性图片和视频的情况下。这些虚假声明往往令人信服且难以识别,这可能会延迟应对气候变化的行动。虽然视觉-语言模型(VLMs)已被用于识别视觉 misinformation,但它们仅依赖于训练时可用的知识。这限制了它们对近期事件或更新进行推理的能力。本文的主要目标是通过结合 VLMs 与外部知识来克服这一限制。通过检索最新的信息,如逆向图像搜索结果、在线事实核查和可信专家内容,该系统可以更好地评估图片及其声明是否准确、误导、虚假或无法验证。这种方法提高了模型处理真实世界气候 misinformation 的能力,并支持在快速变化的信息环境中保护公众对科学的理解的努力。
Summary / 总结
The paper addresses the challenge of detecting climate disinformation, particularly misleading images and videos on social media. It proposes integrating vision-language models with external knowledge sources to enhance the models' ability to reason about recent events and updates. Key findings show that this approach improves the model's accuracy in assessing the veracity of climate-related claims, making it more effective in protecting public understanding of climate science.
本文通过将视觉语言模型与外部知识源结合,解决气候 misinformation 的检测问题。方法包括检索最新的信息,如反向图像搜索结果、在线事实核查和专家内容,以评估图像及其声明的准确性。主要发现表明,这种方法增强了模型在快速变化的信息环境中处理气候 misinformation 的能力,提高了其准确性和可靠性。
DTP: A Simple yet Effective Distracting Token Pruning Framework for Vision-Language Action Models
Authors: Chenyang Li, Jieyuan Liu, Bin Li, Bo Gao, Yilin Yuan, Yangfan He, Yuchen Li, Jingqun Tang
First: 2026-01-22T16:02:56+00:00 · Latest: 2026-01-22T16:02:56+00:00
Abstract
Vision-Language Action (VLA) models have shown remarkable progress in robotic manipulation by leveraging the powerful perception abilities of Vision-Language Models (VLMs) to understand environments and directly output actions. However, by default, VLA models may overly attend to image tokens in the task-irrelevant region, which we describe as 'distracting tokens'. This behavior can disturb the model from the generation of the desired action tokens in each step, affecting the success rate of tasks. In this paper, we introduce a simple yet effective plug-and-play Distracting Token Pruning (DTP) framework, which dynamically detects and prunes these distracting image tokens. By correcting the model's visual attention patterns, we aim to improve the task success rate, as well as exploring the performance upper boundaries of the model without altering its original architecture or adding additional inputs. Experiments on the SIMPLER Benchmark (Li et al., 2024) show that our method consistently achieving relative improvements in task success rates across different types of novel VLA models, demonstrating generalizability to transformer-based VLAs. Further analysis reveals a negative correlation between the task success rate and the amount of attentions in the task-irrelevant region for all models tested, highlighting a common phenomenon of VLA models that could guide future research. We also publish our code at: https://anonymous.4open.science/r/CBD3.
中文标题/摘要
标题:DTP:一种简单有效的视觉-语言-动作模型干扰标记剪枝框架
视觉-语言-动作(VLA)模型通过利用视觉-语言模型(VLM)的强大感知能力来理解环境并直接输出动作,已经在机器人操作方面取得了显著进展。然而,默认情况下,VLA模型可能会过度关注任务无关区域的图像标记,我们将其称为“干扰标记”。这种行为会干扰模型在每一步生成所需动作标记的过程,影响任务的成功率。在本文中,我们介绍了一种简单有效的即插即用干扰标记剪枝(DTP)框架,该框架能够动态检测并剪枝这些干扰图像标记。通过纠正模型的视觉注意力模式,我们旨在提高任务成功率,并在不改变其原始架构或添加额外输入的情况下探索模型的性能上限。在SIMPLER基准(Li等,2024)上的实验表明,我们的方法在不同类型的新型VLA模型中一致地提高了任务成功率,展示了其对基于变换器的VLA模型的通用性。进一步的分析揭示了所有测试模型的任务成功率与其任务无关区域注意力量之间的负相关关系,突显了VLA模型中的一种常见现象,这可以指导未来的研究。我们还发布了我们的代码:https://anonymous.4open.science/r/CBD3.
DextER: Language-driven Dexterous Grasp Generation with Embodied Reasoning
Authors: Junha Lee, Eunha Park, Minsu Cho
First: 2026-01-22T15:23:35+00:00 · Latest: 2026-01-22T15:23:35+00:00
Abstract
Language-driven dexterous grasp generation requires the models to understand task semantics, 3D geometry, and complex hand-object interactions. While vision-language models have been applied to this problem, existing approaches directly map observations to grasp parameters without intermediate reasoning about physical interactions. We present DextER, Dexterous Grasp Generation with Embodied Reasoning, which introduces contact-based embodied reasoning for multi-finger manipulation. Our key insight is that predicting which hand links contact where on the object surface provides an embodiment-aware intermediate representation bridging task semantics with physical constraints. DextER autoregressively generates embodied contact tokens specifying which finger links contact where on the object surface, followed by grasp tokens encoding the hand configuration. On DexGYS, DextER achieves 67.14% success rate, outperforming state-of-the-art by 3.83%p with 96.4% improvement in intention alignment. We also demonstrate steerable generation through partial contact specification, providing fine-grained control over grasp synthesis.
中文标题/摘要
标题:DextER:基于语言的灵巧抓取生成与具身推理
基于语言的灵巧抓取生成要求模型理解任务语义、3D几何和复杂的手物交互。尽管视觉语言模型已被应用于此问题,现有方法直接将观察结果映射为抓取参数,而没有关于物理交互的中间推理。我们提出了DextER,灵巧抓取生成与具身推理,引入了基于接触的具身推理进行多指操作。我们的关键见解是,预测哪只手在物体表面接触哪里提供了一种任务语义与物理约束之间的具身感知中间表示。DextER 自回归生成具身接触标记,指定哪只手指在物体表面接触哪里,随后生成抓取标记编码手的配置。在DexGYS上,DextER 达到了67.14%的成功率,比最先进的方法高出3.83%,意图对齐改进了96.4%。我们还展示了通过部分接触指定实现可引导的生成,提供了对抓取合成的精细控制。
SimWorld-Robotics: Synthesizing Photorealistic and Dynamic Urban Environments for Multimodal Robot Navigation and Collaboration
Authors: Yan Zhuang, Jiawei Ren, Xiaokang Ye, Jianzhi Shen, Ruixuan Zhang, Tianai Yue, Muhammad Faayez, Xuhong He, Ziqiao Ma, Lianhui Qin, Zhiting Hu, Tianmin Shu
Venue: NeurIPS 2025
First: 2025-12-10T20:04:08+00:00 · Latest: 2026-01-22T14:26:01+00:00
Comments: Conference: NeurIPS 2025 (main)
Abstract
Recent advances in foundation models have shown promising results in developing generalist robotics that can perform diverse tasks in open-ended scenarios given multimodal inputs. However, current work has been mainly focused on indoor, household scenarios. In this work, we present SimWorld-Robotics~(SWR), a simulation platform for embodied AI in large-scale, photorealistic urban environments. Built on Unreal Engine 5, SWR procedurally generates unlimited photorealistic urban scenes populated with dynamic elements such as pedestrians and traffic systems, surpassing prior urban simulations in realism, complexity, and scalability. It also supports multi-robot control and communication. With these key features, we build two challenging robot benchmarks: (1) a multimodal instruction-following task, where a robot must follow vision-language navigation instructions to reach a destination in the presence of pedestrians and traffic; and (2) a multi-agent search task, where two robots must communicate to cooperatively locate and meet each other. Unlike existing benchmarks, these two new benchmarks comprehensively evaluate a wide range of critical robot capacities in realistic scenarios, including (1) multimodal instructions grounding, (2) 3D spatial reasoning in large environments, (3) safe, long-range navigation with people and traffic, (4) multi-robot collaboration, and (5) grounded communication. Our experimental results demonstrate that state-of-the-art models, including vision-language models (VLMs), struggle with our tasks, lacking robust perception, reasoning, and planning abilities necessary for urban environments.
中文标题/摘要
标题:SimWorld-Robotics: 合成逼真且动态的城市环境以实现多模态机器人导航与协作
基础模型的最新进展表明,在给定多模态输入的情况下,通用机器人可以在开放场景中执行多种任务,显示出有希望的结果。然而,当前的工作主要集中在室内家庭场景。在本工作中,我们提出了SimWorld-Robotics (SWR),一个用于大规模、逼真城市环境的模拟平台。基于Unreal Engine 5,SWR 通过生成无限的逼真城市场景,其中包含动态元素如行人和交通系统,超越了先前的城市模拟在逼真度、复杂性和可扩展性方面的表现。它还支持多机器人控制和通信。凭借这些关键功能,我们构建了两个具有挑战性的机器人基准测试:(1) 多模态指令跟随任务,其中机器人必须遵循视觉-语言导航指令,在行人和交通的环境中到达目的地;(2) 多智能体搜索任务,其中两个机器人必须通过通信合作找到并会合。与现有基准不同,这两个新基准全面评估了机器人在现实场景中的广泛关键能力,包括(1) 多模态指令语义理解,(2) 大环境中的三维空间推理,(3) 与行人和交通的安全、长距离导航,(4) 多机器人协作,以及(5) 基于语义的通信。我们的实验结果表明,最先进的模型,包括视觉-语言模型 (VLMs),在我们的任务中表现不佳,缺乏在城市环境中所需的稳健感知、推理和规划能力。
Summary / 总结
The research aims to develop a simulation platform for embodied AI in photorealistic urban environments, addressing the limitations of current indoor-focused robotics. SimWorld-Robotics (SWR) uses Unreal Engine 5 to generate dynamic urban scenes with pedestrians and traffic, supporting multi-robot control and communication. The platform introduces two benchmarks: a multimodal instruction-following task and a multi-agent search task, which comprehensively evaluate robots' abilities in realistic scenarios, including multimodal grounding, 3D spatial reasoning, safe navigation, multi-robot collaboration, and grounded communication. State-of-the-art models, including vision-language models, struggle with these tasks, highlighting the need for improved perception, reasoning, and planning abilities in urban environments.
研究旨在开发一个用于在逼真城市环境中进行体态AI的模拟平台,解决当前主要集中在室内场景的机器人技术的局限性。SimWorld-Robotics (SWR) 利用Unreal Engine 5生成包含行人和交通的动态城市场景,支持多机器人控制和通信。该平台引入了两个基准测试:多模态指令跟随任务和多智能体搜索任务,全面评估机器人在现实场景中的能力,包括多模态语义理解、三维空间推理、安全导航、多机器人协作和基于通信的能力。最先进的模型,包括视觉语言模型,在这些任务中表现不佳,突显了在城市环境中需要改进感知、推理和规划能力。
A Multi-View Pipeline and Benchmark Dataset for 3D Hand Pose Estimation in Surgery
Authors: Valery Fischer, Alan Magdaleno, Anna-Katharina Calek, Nicola Cavalcanti, Nathan Hoffman, Christoph Germann, Joschua Wüthrich, Max Krähenmann, Mazda Farshad, Philipp Fürnstahl, Lilian Calvet
First: 2026-01-22T12:48:24+00:00 · Latest: 2026-01-22T12:48:24+00:00
Abstract
Purpose: Accurate 3D hand pose estimation supports surgical applications such as skill assessment, robot-assisted interventions, and geometry-aware workflow analysis. However, surgical environments pose severe challenges, including intense and localized lighting, frequent occlusions by instruments or staff, and uniform hand appearance due to gloves, combined with a scarcity of annotated datasets for reliable model training. Method: We propose a robust multi-view pipeline for 3D hand pose estimation in surgical contexts that requires no domain-specific fine-tuning and relies solely on off-the-shelf pretrained models. The pipeline integrates reliable person detection, whole-body pose estimation, and state-of-the-art 2D hand keypoint prediction on tracked hand crops, followed by a constrained 3D optimization. In addition, we introduce a novel surgical benchmark dataset comprising over 68,000 frames and 3,000 manually annotated 2D hand poses with triangulated 3D ground truth, recorded in a replica operating room under varying levels of scene complexity. Results: Quantitative experiments demonstrate that our method consistently outperforms baselines, achieving a 31% reduction in 2D mean joint error and a 76% reduction in 3D mean per-joint position error. Conclusion: Our work establishes a strong baseline for 3D hand pose estimation in surgery, providing both a training-free pipeline and a comprehensive annotated dataset to facilitate future research in surgical computer vision.
中文标题/摘要
标题:手术环境下3D手部姿态估计的多视图管道和基准数据集
目的:准确的3D手部姿态估计支持手术应用,如技能评估、机器人辅助干预和几何感知工作流程分析。然而,手术环境带来了严重挑战,包括强烈的局部照明、频繁的器械或人员遮挡、手套导致的手部均匀外观,以及缺乏用于可靠模型训练的标注数据集。 方法:我们提出了一种适用于手术环境的鲁棒多视图管道,无需特定领域微调,仅依赖现成的预训练模型。该管道结合了可靠的人体检测、全身姿态估计和基于跟踪手部区域的最新2D手部关键点预测,随后进行约束3D优化。此外,我们还引入了一个新的手术基准数据集,包含超过68,000帧和3,000个手动标注的2D手部姿态,具有三角化3D地面真值,在不同场景复杂度下记录在一个复现的手术室中。 结果:定量实验表明,我们的方法在2D平均关节误差上比基线方法降低了31%,在3D平均每个关节位置误差上降低了76%。 结论:我们的工作为手术中的3D手部姿态估计设定了一个强大的基线,提供了无需训练的管道和全面标注的数据集,以促进未来手术计算机视觉研究。
Summary / 总结
The study aims to improve 3D hand pose estimation in surgical environments, addressing challenges like lighting and occlusions. It introduces a multi-view pipeline using off-the-shelf models for person detection, whole-body pose estimation, and 2D hand keypoint prediction, followed by 3D optimization. The pipeline is validated on a new benchmark dataset with over 68,000 frames and achieves significant reductions in 2D and 3D joint errors compared to baselines.
研究旨在提高手术环境中的3D手部姿态估计,这对于技能评估和机器人辅助干预等应用至关重要。方法包括使用现成的预训练模型进行人体检测、全身姿态估计和手部关键点预测,然后进行3D优化。该方法通过一个包含超过68,000帧和3,000个标注手部姿态的新基准数据集进行了验证。结果表明,所提出的方法显著优于现有基线,2D平均关节误差减少了31%,3D平均每个关节位置误差减少了76%。
RadJEPA: Radiology Encoder for Chest X-Rays via Joint Embedding Predictive Architecture
Authors: Anas Anwarul Haq Khan, Mariam Husain, Kshitij Jadhav
First: 2026-01-22T12:11:53+00:00 · Latest: 2026-01-22T12:11:53+00:00
Abstract
Recent advances in medical vision language models guide the learning of visual representations; however, this form of supervision is constrained by the availability of paired image text data, raising the question of whether robust radiology encoders can be learned without relying on language supervision. In this work, we introduce RadJEPA, a self-supervised framework built on a Joint Embedding Predictive Architecture that learns without language supervision. Pre-trained solely on unlabeled chest X-ray images, the model learns to predict latent representations of masked image regions. This predictive objective differs fundamentally from both image text pre-training and DINO-style self-distillation: rather than aligning global representations across views or modalities, RadJEPA explicitly models latent-space prediction. We evaluate the learned encoder on disease classification, semantic segmentation, and report generation tasks. Across benchmarks, RadJEPA achieves performance exceeding state-of-the-art approaches, including Rad-DINO.
中文标题/摘要
标题:RadJEPA:通过联合嵌入预测架构的胸部X光影像编码器
近期医学视觉语言模型的进步指导了视觉表示的学习;然而,这种监督形式受限于配对的图像文本数据的可用性,引发了是否可以在不依赖语言监督的情况下学习稳健的放射学编码器的问题。在本文中,我们引入了RadJEPA,这是一种基于联合嵌入预测架构的自监督框架,该框架在不依赖语言监督的情况下进行学习。该模型仅在未标记的胸部X光图像上进行预训练,学习预测遮罩图像区域的潜在表示。这种预测目标与图像文本预训练和DINO风格的自我蒸馏完全不同:RadJEPA不是在视图或模态之间对齐全局表示,而是明确建模潜在空间预测。我们在疾病分类、语义分割和报告生成任务上评估了所学习的编码器。在各个基准测试中,RadJEPA的性能超过了最先进的方法,包括Rad-DINO。
Summary / 总结
The research aims to develop a robust radiology encoder for chest X-rays without relying on paired image-text data, which is often limited. RadJEPA, a self-supervised framework, is introduced to learn from unlabeled chest X-ray images by predicting masked image regions. This method outperforms existing approaches, including Rad-DINO, in disease classification, semantic segmentation, and report generation tasks across various benchmarks.
研究旨在开发一种无需依赖图像-文本配对数据的胸部X光放射学编码器。RadJEPA 是一个自监督框架,学习预测遮罩图像区域的潜在表示。该模型在疾病分类、语义分割和报告生成等任务的各种基准测试中超过了最先进的方法。
Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning
Authors: Yifan Wang, Shiyu Li, Peiming Li, Xiaochen Yang, Yang Tang, Zheng Wei
First: 2026-01-21T08:09:25+00:00 · Latest: 2026-01-22T12:09:02+00:00
Abstract
Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. Recent works often focus exclusively on outcome alignment and lack supervision on the intermediate reasoning process. These deficiencies obscure the analyzability of the latent reasoning chain. To address these challenges, we introduce Render-of-Thought (RoT), the first framework to reify the reasoning chain by rendering textual steps into images, making the latent rationale explicit and traceable. Specifically, we leverage the vision encoders of existing Vision Language Models (VLMs) as semantic anchors to align the vision embeddings with the textual space. This design ensures plug-and-play implementation without incurring additional pre-training overhead. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that our method achieves 3-4x token compression and substantial inference acceleration compared to explicit CoT. Furthermore, it maintains competitive performance against other methods, validating the feasibility of this paradigm. Our code is available at https://github.com/TencentBAC/RoT
中文标题/摘要
标题:Render-of-Thought: 将文本推理链渲染为图像以进行视觉潜在推理
文本推理链(CoT)提示在解锁大型语言模型(LLMs)的推理能力方面取得了显著成功。尽管CoT提示增强了推理能力,但其冗长性带来了巨大的计算开销。近期工作往往仅关注结果对齐,而缺乏对中间推理过程的监督。这些不足之处模糊了潜在推理链的可分析性。为解决这些挑战,我们引入了Render-of-Thought(RoT),这是第一个通过将文本步骤渲染为图像来实现推理链具体化的框架,使潜在的推理理由变得明确且可追踪。具体而言,我们利用现有视觉语言模型(VLMs)的视觉编码器作为语义锚点,将视觉嵌入与文本空间对齐。此设计确保了即插即用的实现,无需额外的预训练开销。在数学和逻辑推理基准测试上的广泛实验表明,与显式CoT相比,我们的方法实现了3-4倍的令牌压缩和显著的推理加速。此外,与其他方法相比,它保持了竞争力,验证了此范式的可行性。我们的代码可在https://github.com/TencentBAC/RoT 获取
Summary / 总结
The paper introduces Render-of-Thought (RoT), a framework that converts textual reasoning steps into images to make latent reasoning explicit and traceable. RoT leverages vision encoders from existing Vision Language Models to align visual embeddings with textual space, enabling plug-and-play implementation. Experiments show that RoT achieves 3-4x token compression and significant inference speedup compared to explicit CoT, while maintaining competitive performance on mathematical and logical reasoning benchmarks.
论文提出了Render-of-Thought (RoT)框架,将文本推理步骤转化为图像,使潜在的推理过程变得明确和可追踪。该框架利用现有视觉语言模型的视觉编码器,将视觉嵌入与文本空间对齐,实现了3-4倍的令牌压缩和显著的推理加速,同时在推理基准测试中保持了竞争力。
VLM-CAD: VLM-Optimized Collaborative Agent Design Workflow for Analog Circuit Sizing
Authors: Guanyuan Pan, Shuai Wang, Yugui Lin, Tiansheng Zhou, Pietro Liò, Yaqi Wang, Zhenxin Zhao
First: 2026-01-12T08:37:32+00:00 · Latest: 2026-01-22T11:46:08+00:00
Comments: 9 pages, 4 figures, submitted to the 10th International Conference on Control, Automation and Diagnosis (ICCAD'26)
Abstract
Analog mixed-signal circuit sizing involves complex trade-offs within high-dimensional design spaces. Existing automatic analog circuit sizing approaches rely solely on netlists, ignoring the circuit schematic, which hinders the cognitive link between the schematic and its performance. Furthermore, the black-box nature of machine learning methods and hallucination risks in large language models fail to provide the necessary ground-truth explainability required for industrial sign-off. To address these challenges, we propose a Vision Language Model-optimized collaborative agent design workflow (VLM-CAD), which analyzes circuits, optimizes DC operating points, performs inference-based sizing, and executes external sizing optimization. We integrate Image2Net to annotate circuit schematics and generate a structured JSON description for precise interpretation by Vision Language Models. Furthermore, we propose an Explainable Trust Region Bayesian Optimization method (ExTuRBO) that employs collaborative warm-start from agent-generated seeds and offers dual-granularity sensitivity analysis for external sizing optimization, supporting a comprehensive final design report. Experiment results on amplifier sizing tasks using 180nm, 90nm, and 45nm Predictive Technology Models demonstrate that VLM-CAD effectively balances power and performance while maintaining physics-based explainability. VLM-CAD meets all specification requirements while maintaining low power consumption in optimizing an amplifier with a complementary input and a class-AB output stage, with a total runtime under 66 minutes across all experiments on two amplifiers.
中文标题/摘要
标题:VLM-CAD:优化视觉语言模型的协作代理设计工作流用于模拟电路尺寸优化
模拟混合信号电路尺寸优化涉及高维设计空间中的复杂权衡。现有的自动模拟电路尺寸优化方法仅依赖于网表,忽略了电路原理图,阻碍了原理图与其性能之间的认知联系。此外,机器学习方法的黑箱性质和大型语言模型中的幻觉风险无法提供工业签收所需的必要的地面真相可解释性。为了解决这些挑战,我们提出了一种优化视觉语言模型的协作代理设计工作流(VLM-CAD),该工作流分析电路、优化直流工作点、进行基于推理的尺寸优化并执行外部尺寸优化。我们整合了Image2Net来标注电路原理图并生成结构化的JSON描述,以便视觉语言模型精确解释。此外,我们提出了一种可解释的信任区域贝叶斯优化方法(ExTuRBO),该方法采用代理生成的种子进行协作预热,并提供外部尺寸优化的双粒度灵敏度分析,支持全面的最终设计报告。使用180nm、90nm和45nm预测技术模型进行放大器尺寸优化任务的实验结果表明,VLM-CAD在保持物理基础可解释性的同时有效平衡了功率和性能。VLM-CAD在优化具有互补输入和类AB输出阶段的放大器时满足所有规范要求,同时保持低功耗,在两次放大器的所有实验中总运行时间低于66分钟。
Summary / 总结
VLM-CAD is a Vision Language Model-optimized collaborative agent design workflow for analog circuit sizing that addresses the limitations of existing approaches by integrating Image2Net for schematic annotation and proposing an Explainable Trust Region Bayesian Optimization method (ExTuRBO) for detailed sensitivity analysis. The workflow effectively balances power and performance while maintaining physics-based explainability, as demonstrated by experiments on amplifier sizing tasks using different technology nodes.
VLM-CAD 是一种通过结合 Vision Language Models 和协作代理优化模拟电路尺寸的工作流。它分析电路、优化直流工作点并进行推理尺寸优化。该方法使用 Image2Net 对电路图进行注释,并采用可解释的信任区域贝叶斯优化 (ExTuRBO) 进行外部尺寸优化,提供详细的灵敏度分析。实验结果表明,VLM-CAD 在不同技术节点的放大器尺寸任务中有效平衡了功率和性能,同时保持了基于物理的可解释性和低功耗。
MMP-A*: Multimodal Perception Enhanced Incremental Heuristic Search on Path Planning
Authors: Minh Hieu Ha, Khanh Ly Ta, Hung Phan, Tung Doan, Tung Dao, Dao Tran, Huynh Thi Thanh Binh
First: 2026-01-05T08:55:27+00:00 · Latest: 2026-01-22T10:24:37+00:00
Abstract
Autonomous path planning requires a synergy between global reasoning and geometric precision, especially in complex or cluttered environments. While classical A* is valued for its optimality, it incurs prohibitive computational and memory costs in large-scale scenarios. Recent attempts to mitigate these limitations by using Large Language Models for waypoint guidance remain insufficient, as they rely only on text-based reasoning without spatial grounding. As a result, such models often produce incorrect waypoints in topologically complex environments with dead ends, and lack the perceptual capacity to interpret ambiguous physical boundaries. These inconsistencies lead to costly corrective expansions and undermine the intended computational efficiency. We introduce MMP-A*, a multimodal framework that integrates the spatial grounding capabilities of vision-language models with a novel adaptive decay mechanism. By anchoring high-level reasoning in physical geometry, the framework produces coherent waypoint guidance that addresses the limitations of text-only planners. The adaptive decay mechanism dynamically regulates the influence of uncertain waypoints within the heuristic, ensuring geometric validity while substantially reducing memory overhead. To evaluate robustness, we test the framework in challenging environments characterized by severe clutter and topological complexity. Experimental results show that MMP-A* achieves near-optimal trajectories with significantly reduced operational costs, demonstrating its potential as a perception-grounded and computationally efficient paradigm for autonomous navigation.
中文标题/摘要
标题:MMP-A*: 多模态感知增强的增量启发式搜索路径规划
自主路径规划需要在全局推理和几何精度之间实现协同作用,尤其是在复杂或拥挤的环境中。虽然经典的A*因其最优性而受到重视,但在大规模场景中会带来巨大的计算和内存成本。最近通过使用大型语言模型进行航点指导来缓解这些限制的努力仍然不足,因为它们仅依赖于基于文本的推理而缺乏空间定位能力。因此,这些模型在拓扑复杂且有死胡同的环境中经常生成错误的航点,并且缺乏感知能力来解释模糊的物理边界。这些不一致导致昂贵的修正扩展,并削弱了预期的计算效率。我们引入了MMP-A*,这是一种结合了视觉语言模型的空间定位能力和新颖的自适应衰减机制的多模态框架。通过将高层次推理锚定在物理几何上,该框架生成连贯的航点指导,解决了纯文本规划器的局限性。自适应衰减机制动态调节启发式中不确定航点的影响,确保几何有效性同时大幅减少内存开销。为了评估鲁棒性,我们在严重拥挤和拓扑复杂性的环境中测试了该框架。实验结果表明,MMP-A*在显著降低操作成本的同时实现了接近最优的轨迹,展示了其作为感知导向和计算高效的自主导航范式的潜力。
Summary / 总结
MMP-A* is a multimodal framework that combines the spatial grounding of vision-language models with an adaptive decay mechanism to enhance path planning in complex environments. It addresses the limitations of text-only planners by producing coherent waypoints and ensuring geometric validity. Experimental results show that MMP-A* achieves near-optimal trajectories with reduced computational and memory costs, making it a promising approach for autonomous navigation.
MMP-A* 是一种结合视觉语言模型的空间定位能力和自适应衰减机制的多模态路径规划框架,旨在提高复杂环境下的路径规划效率和准确性。该框架解决了传统 A* 和纯文本规划器的局限性,通过生成连贯的航点指导并减少内存开销,实现了在严重杂乱和拓扑复杂环境中的近最优轨迹,并显著降低了操作成本。
Assessing Situational and Spatial Awareness of VLMs with Synthetically Generated Video
Authors: Pascal Benschop, Justin Dauwels, Jan van Gemert
First: 2026-01-22T09:14:11+00:00 · Latest: 2026-01-22T09:14:11+00:00
Abstract
Spatial reasoning in vision language models (VLMs) remains fragile when semantics hinge on subtle temporal or geometric cues. We introduce a synthetic benchmark that probes two complementary skills: situational awareness (recognizing whether an interaction is harmful or benign) and spatial awareness (tracking who does what to whom, and reasoning about relative positions and motion). Through minimal video pairs, we test three challenges: distinguishing violence from benign activity, binding assailant roles across viewpoints, and judging fine-grained trajectory alignment. While we evaluate recent VLMs in a training-free setting, the benchmark is applicable to any video classification model. Results show performance only slightly above chance across tasks. A simple aid, stable color cues, partly reduces assailant role confusions but does not resolve the underlying weakness. By releasing data and code, we aim to provide reproducible diagnostics and seed exploration of lightweight spatial priors to complement large-scale pretraining.
中文标题/摘要
标题:基于合成生成视频评估VLMs的情境意识和空间意识
视觉语言模型(VLMs)中的空间推理在依赖于微妙的时间或几何线索时仍然脆弱。我们引入了一个合成基准,以探测两种互补的能力:情境意识(识别互动是否有害或无害)和空间意识(追踪谁对谁做了什么,并推理相对位置和运动)。通过最小的视频对,我们测试了三个挑战:区分暴力行为与良性活动、跨视角绑定攻击者角色以及判断细粒度轨迹对齐。虽然我们在无训练设置下评估了最近的VLMs,但该基准适用于任何视频分类模型。结果显示,各任务的性能仅略高于随机猜测。一个简单的辅助,稳定的颜色线索,部分减少了攻击者角色的混淆,但并未解决根本弱点。通过发布数据和代码,我们旨在提供可重复的诊断并激发对轻量级空间先验的研究,以补充大规模预训练。
Summary / 总结
This study evaluates the spatial reasoning capabilities of vision language models (VLMs) using a synthetic benchmark that tests situational and spatial awareness. The benchmark includes challenges such as distinguishing violent from benign activities and tracking roles and movements. Despite recent advancements, VLMs perform only slightly above chance. A simple aid, stable color cues, slightly improves performance but does not fully address the underlying weaknesses. The authors aim to provide reproducible diagnostics and encourage the exploration of lightweight spatial priors to enhance VLMs.
该研究使用合成基准测试视觉语言模型(VLMs)的空间推理能力,包括情境意识和空间意识的挑战,如区分暴力和良性活动以及追踪角色和运动。尽管有最新进展,VLMs的表现仅略高于随机水平。一种简单的辅助手段,稳定的颜色线索,可以稍微提高性能,但并不能完全解决根本问题。作者希望通过提供可重复的诊断工具并鼓励探索轻量级的空间先验来增强VLMs。
Crafting Adversarial Inputs for Large Vision-Language Models Using Black-Box Optimization
Authors: Jiwei Guan, Haibo Jin, Haohan Wang
First: 2026-01-05T02:49:33+00:00 · Latest: 2026-01-22T09:09:47+00:00
Comments: EACL
Abstract
Recent advancements in Large Vision-Language Models (LVLMs) have shown groundbreaking capabilities across diverse multimodal tasks. However, these models remain vulnerable to adversarial jailbreak attacks, where adversaries craft subtle perturbations to bypass safety mechanisms and trigger harmful outputs. Existing white-box attacks methods require full model accessibility, suffer from computing costs and exhibit insufficient adversarial transferability, making them impractical for real-world, black-box settings. To address these limitations, we propose a black-box jailbreak attack on LVLMs via Zeroth-Order optimization using Simultaneous Perturbation Stochastic Approximation (ZO-SPSA). ZO-SPSA provides three key advantages: (i) gradient-free approximation by input-output interactions without requiring model knowledge, (ii) model-agnostic optimization without the surrogate model and (iii) lower resource requirements with reduced GPU memory consumption. We evaluate ZO-SPSA on three LVLMs, including InstructBLIP, LLaVA and MiniGPT-4, achieving the highest jailbreak success rate of 83.0% on InstructBLIP, while maintaining imperceptible perturbations comparable to white-box methods. Moreover, adversarial examples generated from MiniGPT-4 exhibit strong transferability to other LVLMs, with ASR reaching 64.18%. These findings underscore the real-world feasibility of black-box jailbreaks and expose critical weaknesses in the safety mechanisms of current LVLMs
中文标题/摘要
标题:使用黑盒优化构建针对大型视觉-语言模型的对抗输入
大型视觉-语言模型(LVLM)在多种跨模态任务中展现了突破性的能力。然而,这些模型仍然容易受到对抗性脱管攻击的影响,攻击者通过施加微妙的扰动来绕过安全机制并触发有害输出。现有的白盒攻击方法需要完全访问模型,计算成本高且对抗性转移性不足,使其在实际的黑盒环境中不切实际。为了解决这些限制,我们提出了一种使用零阶优化和同时扰动随机近似(ZO-SPSA)对LVLM进行黑盒脱管攻击的方法。ZO-SPSA提供了三个关键优势:(i)无需模型知识的输入-输出交互的无梯度近似,(ii)无需代理模型的模型无关优化,(iii)降低资源需求,减少GPU内存消耗。我们在三个LVLM上评估了ZO-SPSA,包括InstructBLIP、LLaVA和MiniGPT-4,在InstructBLIP上实现了最高的脱管攻击成功率83.0%,同时保持与白盒方法相当的不可感知扰动。此外,从MiniGPT-4生成的对抗性示例在其他LVLM上具有很强的转移性,ASR达到64.18%。这些发现强调了黑盒脱管攻击在实际环境中的可行性,并揭示了当前LVLM安全机制中的关键弱点
Summary / 总结
This study addresses the vulnerability of Large Vision-Language Models (LVLMs) to adversarial attacks by proposing a black-box jailbreak attack using Zeroth-Order optimization with Simultaneous Perturbation Stochastic Approximation (ZO-SPSA). ZO-SPSA offers gradient-free approximation, model-agnostic optimization, and reduced resource requirements. The method achieves a high jailbreak success rate of 83.0% on InstructBLIP and demonstrates strong transferability of adversarial examples, with an attack success rate (ASR) of 64.18% on MiniGPT-4. These results highlight the real-world feasibility of black-box attacks and reveal critical safety weaknesses in LVLMs.
该研究通过提出使用零阶优化与同时扰动随机近似(ZO-SPSA)方法来解决大型视觉-语言模型(LVLMs)对对抗攻击的脆弱性问题。该方法无需模型知识、具有模型无关性且资源需求较低。实验表明,在InstructBLIP、LLaVA和MiniGPT-4上的破解成功率高达83.0%,并且生成的对抗样本在其他模型上具有较强的迁移性,这表明需要改进LVLMs的安全机制。
VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning
Authors: Li Kang, Xiufeng Song, Heng Zhou, Yiran Qin, Jie Yang, Xiaohong Liu, Philip Torr, Lei Bai, Zhenfei Yin
Venue: NeurIPS 2025
First: 2025-06-10T17:59:44+00:00 · Latest: 2026-01-22T08:52:35+00:00
Comments: Accepted by NeurIPS 2025 Track on Datasets and Benchmarks. Project page: https://faceong.github.io/VIKI-R/
Abstract
Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.
中文标题/摘要
标题:VIKI-R:通过强化学习协调具身多智能体合作
在动态环境中协调多个具身智能体仍然是人工智能的核心挑战,需要感知驱动的推理和可扩展的合作策略。虽然最近的工作利用了大型语言模型(LLMs)进行多智能体规划,但有少数开始探索视觉语言模型(VLMs)进行视觉推理。然而,这些基于VLM的方法在支持多种具身类型方面仍然有限。在本文中,我们介绍了VIKI-Bench,这是第一个针对具身多智能体合作的分层基准,包含三个结构化层次:智能体激活、任务规划和轨迹感知。VIKI-Bench 包括多种机器人具身、多视角视觉观察和结构化的监督信号,以评估基于视觉输入的推理。为了展示VIKI-Bench 的实用性,我们提出了VIKI-R,这是一种两阶段框架,首先使用带有Chain-of-Thought注释的演示对预训练的视觉语言模型(VLM)进行微调,然后在多层次奖励信号下使用强化学习。我们的大量实验表明,VIKI-R 在所有任务层次上显著优于基线方法。此外,我们展示了强化学习使异构智能体之间出现组合合作模式。总体而言,VIKI-Bench 和 VIKI-R 提供了一个统一的测试平台和方法,以推进具身人工智能系统中的多智能体、视觉驱动的合作。
Summary / 总结
This work addresses the challenge of coordinating multiple embodied agents in dynamic environments by introducing VIKI-Bench, a hierarchical benchmark for embodied multi-agent cooperation. VIKI-R, a two-stage framework, fine-tunes a pretrained vision-language model with Chain-of-Thought annotated demonstrations and then uses reinforcement learning with multi-level reward signals. Experiments demonstrate that VIKI-R outperforms baseline methods across all task levels and enables compositional cooperation among heterogeneous agents.
该研究通过引入VIKI-Bench,一个用于多机器人合作的层次化基准,解决了动态环境中协调多个实体代理的问题。VIKI-R是一个两阶段框架,首先对预训练的视觉语言模型进行微调,使用带有链式思考注释的演示,然后使用多层次奖励信号进行强化学习。实验表明,VIKI-R在所有任务级别上都优于基线方法,并且能够使异构代理之间产生组合性的合作模式。
Zero-Shot Product Attribute Labeling with Vision-Language Models: A Three-Tier Evaluation Framework
Authors: Shubham Shukla, Kunal Sonalkar
Venue: WACV 2026
First: 2026-01-22T07:33:41+00:00 · Latest: 2026-01-22T07:33:41+00:00
Comments: Accepted to WACV 2026 Workshop on Physical Retail AI (PRAW)
Abstract
Fine-grained attribute prediction is essential for fashion retail applications including catalog enrichment, visual search, and recommendation systems. Vision-Language Models (VLMs) offer zero-shot prediction without task-specific training, yet their systematic evaluation on multi-attribute fashion tasks remains underexplored. A key challenge is that fashion attributes are often conditional. For example, "outer fabric" is undefined when no outer garment is visible. This requires models to detect attribute applicability before attempting classification. We introduce a three-tier evaluation framework that decomposes this challenge: (1) overall task performance across all classes (including NA class: suggesting attribute is not applicable) for all attributes, (2) attribute applicability detection, and (3) fine-grained classification when attributes are determinable. Using DeepFashion-MultiModal, which explicitly defines NA (meaning attribute doesn't exist or is not visible) within attribute label spaces, we benchmark nine VLMs spanning flagship (GPT-5, Gemini 2.5 Pro), efficient (GPT-5 Mini, Gemini 2.5 Flash), and ultra-efficient tiers (GPT-5 Nano, Gemini 2.5 Flash-Lite) against classifiers trained on pretrained Fashion-CLIP embeddings on 5,000 images across 18 attributes. Our findings reveal that: (1) zero-shot VLMs achieve 64.0% macro-F1, a threefold improvement over logistic regression on pretrained Fashion-CLIP embeddings; (2) VLMs excel at fine-grained classification (Tier 3: 70.8% F1) but struggle with applicability detection (Tier 2: 34.1% NA-F1), identifying a key bottleneck; (3) efficient models achieve over 90% of flagship performance at lower cost, offering practical deployment paths. This diagnostic framework enables practitioners to pinpoint whether errors stem from visibility detection or classification, guiding targeted improvements for production systems.
中文标题/摘要
标题:使用视觉语言模型的零样本产品属性标签化:三层评估框架
细粒度属性预测对于时尚零售应用(包括目录丰富、视觉搜索和推荐系统)至关重要。视觉语言模型(VLMs)可以在无需特定任务训练的情况下实现零样本预测,但它们在多属性时尚任务上的系统评估仍被忽视。一个关键挑战是时尚属性往往是条件性的。例如,“外层织物”在没有外衣的情况下是未定义的。这要求模型在尝试分类之前检测属性的适用性。我们引入了一个三层评估框架来分解这一挑战:(1)所有属性(包括NA类:表明属性不适用)在所有类别的整体任务性能,(2)属性适用性检测,以及(3)当属性可确定时的细粒度分类。使用DeepFashion-MultiModal,其中明确定义了NA(表示属性不存在或不可见),我们使用5,000张图像和18个属性,将九种VLMs(包括旗舰级(GPT-5, Gemini 2.5 Pro)、高效级(GPT-5 Mini, Gemini 2.5 Flash)和超高效级(GPT-5 Nano, Gemini 2.5 Flash-Lite))与基于预训练Fashion-CLIP嵌入的分类器进行基准测试。我们的发现表明:(1)零样本VLMs实现了64.0%的宏F1,比基于预训练Fashion-CLIP嵌入的逻辑回归提高了三倍;(2)VLMs在细粒度分类(第3级:70.8% F1)方面表现出色,但在适用性检测(第2级:34.1% NA-F1)方面存在困难,揭示了一个关键瓶颈;(3)高效模型在较低成本下实现了旗舰模型90%以上的性能,提供了实际部署路径。此诊断框架使从业者能够确定错误是源自可见性检测还是分类,从而指导生产系统的针对性改进。
Summary / 总结
The paper introduces a three-tier evaluation framework for zero-shot product attribute labeling using Vision-Language Models (VLMs) in fashion retail applications. It evaluates nine VLMs across different efficiency tiers on 18 attributes using the DeepFashion-MultiModal dataset. Key findings include a macro-F1 score of 64.0% for zero-shot VLMs, a threefold improvement over logistic regression, and a significant disparity in applicability detection (34.1% NA-F1) compared to fine-grained classification (70.8% F1). Efficient models achieve over 90% of flagship performance, making them practical for deployment.
研究旨在评估Vision-Language模型(VLMs)在零样本预测细粒度时尚属性方面的表现,特别是解决属性适用性的问题。开发了一种三阶段评估框架,分别评估整体任务性能、属性适用性检测和细粒度分类。九种VLMs在包含18个属性的5,000张图像上进行了基准测试,结果显示VLMs的宏F1值达到64.0%,显著优于逻辑回归,且高效模型在较低成本下实现了旗舰模型90%以上的性能。然而,VLMs在适用性检测方面表现不佳,揭示了一个关键瓶颈。
Agentic Uncertainty Quantification
Authors: Jiaxin Zhang, Prafulla Kumar Choubey, Kung-Hsiang Huang, Caiming Xiong, Chien-Sheng Wu
First: 2026-01-22T07:16:26+00:00 · Latest: 2026-01-22T07:16:26+00:00
Comments: 36 pages, 9 figures, 9 tables
Abstract
Although AI agents have demonstrated impressive capabilities in long-horizon reasoning, their reliability is severely hampered by the ``Spiral of Hallucination,'' where early epistemic errors propagate irreversibly. Existing methods face a dilemma: uncertainty quantification (UQ) methods typically act as passive sensors, only diagnosing risks without addressing them, while self-reflection mechanisms suffer from continuous or aimless corrections. To bridge this gap, we propose a unified Dual-Process Agentic UQ (AUQ) framework that transforms verbalized uncertainty into active, bi-directional control signals. Our architecture comprises two complementary mechanisms: System 1 (Uncertainty-Aware Memory, UAM), which implicitly propagates verbalized confidence and semantic explanations to prevent blind decision-making; and System 2 (Uncertainty-Aware Reflection, UAR), which utilizes these explanations as rational cues to trigger targeted inference-time resolution only when necessary. This enables the agent to balance efficient execution and deep deliberation dynamically. Extensive experiments on closed-loop benchmarks and open-ended deep research tasks demonstrate that our training-free approach achieves superior performance and trajectory-level calibration. We believe this principled framework AUQ represents a significant step towards reliable agents.
中文标题/摘要
标题:代理不确定性量化
尽管AI代理在长期推理方面表现出色,但它们的可靠性因“幻觉螺旋”而严重受损,早期的知识错误不可逆地传播。现有方法面临困境:不确定性量化(UQ)方法通常作为被动传感器,仅诊断风险而不解决问题,而自我反思机制则遭受持续或盲目修正。为弥合这一差距,我们提出了一种统一的双重过程代理不确定性量化(AUQ)框架,将口头表达的不确定性转化为双向控制信号。我们的架构包括两个互补机制:系统1(不确定性感知记忆,UAM),它隐式地传播口头表达的信心和语义解释,以防止盲目决策;系统2(不确定性感知反思,UAR),它利用这些解释作为理性的提示,在必要时触发目标推理时的解决。这使代理能够动态平衡高效执行和深入的反思。在闭环基准和开放性深度研究任务上的广泛实验表明,我们的无训练方法在性能和轨迹级校准方面表现出色。我们认为,这一原理性的框架AUQ代表了迈向可靠代理的重要一步。
Summary / 总结
The research aims to address the issue of reliability in AI agents by proposing a unified Dual-Process Agentic Uncertainty Quantification (AUQ) framework. This framework transforms verbalized uncertainty into active control signals through two mechanisms: Uncertainty-Aware Memory (UAM) and Uncertainty-Aware Reflection (UAR). UAM prevents blind decision-making by propagating confidence and semantic explanations, while UAR uses these explanations to trigger targeted inference-time resolution when necessary. Experiments show that this approach improves performance and trajectory-level calibration without training.
研究旨在通过提出一种双重过程代理不确定性量化(AUQ)框架来解决AI代理在长期推理中的可靠性问题。该框架通过两种机制将口头表达的不确定性转化为主动控制信号:不确定性意识记忆(UAM)通过传播信心和语义解释来防止盲目决策;不确定性意识反思(UAR)利用这些解释在必要时触发目标推理时的解决。实验表明,这种方法在不进行训练的情况下提高了性能和轨迹级校准。
Multi-event Video-Text Retrieval
Authors: Gengyuan Zhang, Jisen Ren, Jindong Gu, Volker Tresp
First: 2023-08-22T16:32:46+00:00 · Latest: 2026-01-22T06:58:13+00:00
Comments: [fixed typos in equations] accepted to ICCV2023 Poster; some figures are not supported when viewed online, please download the file and view locally
Abstract
Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of massive video-text data on the Internet. A plethora of work characterized by using a two-stream Vision-Language model architecture that learns a joint representation of video-text pairs has become a prominent approach for the VTR task. However, these models operate under the assumption of bijective video-text correspondences and neglect a more practical scenario where video content usually encompasses multiple events, while texts like user queries or webpage metadata tend to be specific and correspond to single events. This establishes a gap between the previous training objective and real-world applications, leading to the potential performance degradation of earlier models during inference. In this study, we introduce the Multi-event Video-Text Retrieval (MeVTR) task, addressing scenarios in which each video contains multiple different events, as a niche scenario of the conventional Video-Text Retrieval Task. We present a simple model, Me-Retriever, which incorporates key event video representation and a new MeVTR loss for the MeVTR task. Comprehensive experiments show that this straightforward framework outperforms other models in the Video-to-Text and Text-to-Video tasks, effectively establishing a robust baseline for the MeVTR task. We believe this work serves as a strong foundation for future studies. Code is available at https://github.com/gengyuanmax/MeVTR.
中文标题/摘要
标题:多事件视频-文本检索
视频-文本检索(VTR)是互联网上大规模视频-文本数据时代的一项重要多模态任务。以两流视觉-语言模型架构为特征的工作,通过学习视频-文本对的联合表示,已成为VTR任务的主要方法。然而,这些模型假设视频-文本对应关系是一一对应的,并忽略了视频内容通常包含多个事件,而文本如用户查询或网页元数据通常特定于单一事件的更实际场景。这导致了先前训练目标与实际应用之间的差距,使得早期模型在推理时可能性能下降。在本研究中,我们提出了多事件视频-文本检索(MeVTR)任务,以解决视频中包含多个不同事件的场景,作为传统视频-文本检索任务的一个细分场景。我们提出了一种简单的模型Me-Retriever,该模型结合了关键事件视频表示和新的MeVTR损失函数。全面的实验表明,该简单框架在视频到文本和文本到视频任务中优于其他模型,有效地为MeVTR任务建立了稳健的基础。我们认为这项工作为未来的研究奠定了坚实的基础。代码可在https://github.com/gengyuanmax/MeVTR/ 获取。
Summary / 总结
This study addresses the Video-Text Retrieval (VTR) task by introducing the Multi-event Video-Text Retrieval (MeVTR) task, where videos contain multiple events and texts correspond to single events. The authors propose Me-Retriever, a simple model that incorporates key event video representation and a new MeVTR loss. Experiments show that Me-Retriever outperforms other models in both Video-to-Text and Text-to-Video tasks, establishing a robust baseline for MeVTR. The work fills a gap in the previous training objective and improves performance in real-world applications.
该研究通过引入多事件视频文本检索(MeVTR)任务,解决了视频包含多个事件而文本对应单一事件的问题。作者提出了一种简单的模型Me-Retriever,该模型结合了关键事件的视频表示和新的MeVTR损失。实验表明,Me-Retriever在视频到文本和文本到视频任务中均优于其他模型,为MeVTR任务建立了稳健的基础。该工作填补了先前训练目标的空白,并在实际应用中提高了性能。
PatchEAD: Unifying Industrial Visual Prompting Frameworks for Patch-Exclusive Anomaly Detection
Authors: Po-Han Huang, Jeng-Lin Li, Po-Hsuan Huang, Ming-Ching Chang, Wei-Chao Chen
Venue: WACV 2026
First: 2025-09-30T06:52:08+00:00 · Latest: 2026-01-22T06:50:23+00:00
Comments: 10 pages, 5 figures. WACV 2026 (Accepted)
Abstract
Industrial anomaly detection is increasingly relying on foundation models, aiming for strong out-of-distribution generalization and rapid adaptation in real-world deployments. Notably, past studies have primarily focused on textual prompt tuning, leaving the intrinsic visual counterpart fragmented into processing steps specific to each foundation model. We aim to address this limitation by proposing a unified patch-focused framework, Patch-Exclusive Anomaly Detection (PatchEAD), enabling training-free anomaly detection that is compatible with diverse foundation models. The framework constructs visual prompting techniques, including an alignment module and foreground masking. Our experiments show superior few-shot and batch zero-shot performance compared to prior work, despite the absence of textual features. Our study further examines how backbone structure and pretrained characteristics affect patch-similarity robustness, providing actionable guidance for selecting and configuring foundation models for real-world visual inspection. These results confirm that a well-unified patch-only framework can enable quick, calibration-light deployment without the need for carefully engineered textual prompts.
中文标题/摘要
标题:PatchEAD:统一的工业视觉提示框架以实现专用于异常检测
工业异常检测越来越多地依赖于基础模型,旨在实现强大的离分布泛化和在实际部署中的快速适应。值得注意的是,以往的研究主要集中在文本提示调优上,而视觉方面的内在对应物则被分割成每个基础模型特有的处理步骤。我们旨在通过提出一个统一的专用于补丁的框架——Patch-Exclusive Anomaly Detection (PatchEAD),来解决这一局限性,该框架能够实现无需训练的异常检测,并与多种基础模型兼容。该框架构建了视觉提示技术,包括对齐模块和前景遮罩。我们的实验表明,与先前的工作相比,尽管没有使用文本特征,但其在少量样本和批量零样本检测方面的性能更优。我们的研究进一步探讨了基础模型的结构和预训练特性如何影响补丁相似性鲁棒性,为选择和配置适用于实际视觉检查的基础模型提供了可操作的指导。这些结果证实,一个良好统一的仅补丁框架可以实现快速、校准轻量的部署,无需精心设计的文本提示。
Summary / 总结
The research aims to improve industrial anomaly detection by addressing the fragmented visual prompting techniques for different foundation models. PatchEAD, a unified patch-focused framework, is proposed to enable training-free anomaly detection compatible with various models. The framework includes an alignment module and foreground masking. Experiments demonstrate that PatchEAD outperforms previous methods in few-shot and batch zero-shot scenarios without relying on textual features. The study also explores how backbone structure and pretrained characteristics impact patch-similarity robustness, offering practical guidance for real-world applications.
研究旨在通过解决不同基础模型视觉提示技术碎片化的问题,提升工业异常检测。提出了一个统一的基于补丁的框架PatchEAD,使其能够在多种模型上实现无需训练的异常检测。该框架包含对齐模块和前景遮罩。实验表明,PatchEAD 在少量样本和批量零样本场景中优于先前方法,且不依赖于文本特征。研究还探讨了基础模型结构和预训练特性对补丁相似性鲁棒性的影响,为实际应用提供了实用指导。
VIOLA: Towards Video In-Context Learning with Minimal Annotations
Authors: Ryo Fujii, Hideo Saito, Ryo Hachiuma
First: 2026-01-22T00:35:30+00:00 · Latest: 2026-01-22T00:35:30+00:00
Abstract
Generalizing Multimodal Large Language Models (MLLMs) to novel video domains is essential for real-world deployment but remains challenging due to the scarcity of labeled data. While In-Context Learning (ICL) offers a training-free adaptation path, standard methods rely on large annotated pools, which are often impractical in specialized environments like industrial or surgical settings since they require the experts' annotations. To bridge this gap, we introduce VIOLA (Video In-cOntext Learning with minimal Annotation), a label-efficient framework that synergizes minimal expert supervision with abundant unlabeled data. First, to maximize the efficiency of a strict annotation budget, we propose density-uncertainty-weighted sampling. Unlike standard diversity or uncertainty strategies that risk selecting visual outliers, our method leverages density estimation to identify samples that are simultaneously diverse, representative, and informative. Second, to utilize the remaining unlabeled data without noise propagation, we construct a hybrid pool and introduce confidence-aware retrieval and confidence-aware prompting. These mechanisms explicitly model label reliability, retrieving demonstrations based on a composite score of similarity and confidence while enabling the MLLM to adaptively distinguish between verified ground truths and noisy pseudo-labels. Extensive experiments across nine diverse benchmarks using four MLLMs demonstrate that our framework significantly outperforms various baselines in low-resource settings, achieving robust adaptation with minimal annotation costs.
中文标题/摘要
标题:VIOLA:面向最少标注的视频上下文学习
将多模态大型语言模型(MLLMs)推广到新的视频领域对于实际部署至关重要,但由于标注数据稀缺而充满挑战。虽然上下文学习(ICL)提供了一条无需训练的适应路径,但标准方法依赖于大规模标注数据池,这在工业或手术等专业环境中往往不切实际,因为需要专家的标注。为了解决这一问题,我们提出了VIOLA(视频上下文学习与最少标注),这是一种高效标签框架,将最少的专家监督与大量的未标注数据相结合。首先,为了最大化严格的标注预算效率,我们提出了密度不确定性加权采样。与标准的多样性和不确定性策略不同,我们的方法利用密度估计来识别同时具有多样性和代表性且信息丰富的样本。其次,为了在不传播噪声的情况下利用剩余的未标注数据,我们构建了一个混合池,并引入了可信度感知检索和可信度感知提示。这些机制明确建模了标签的可靠性,根据相似性和可信度的复合得分检索示例,使MLLM能够自适应地区分验证的真实标签和嘈杂的伪标签。在四个MLLMs和九个不同基准上的广泛实验表明,我们的框架在低资源设置中显著优于各种基线,实现了在最少标注成本下的稳健适应。
Summary / 总结
VIOLA is a label-efficient framework that combines minimal expert supervision with abundant unlabeled data to enhance the generalization of multimodal large language models in video domains. It introduces density-uncertainty-weighted sampling to select diverse and informative samples and a hybrid pool with confidence-aware mechanisms to utilize unlabeled data effectively. Experiments across nine benchmarks show that VIOLA outperforms various baselines in low-resource settings, achieving robust adaptation with minimal annotation costs.
VIOLA 是一种结合少量专家监督和大量未标注数据的视频上下文学习框架。它使用密度-不确定性加权采样来选择多样且信息丰富的样本,并引入了可信度感知检索和提示来处理噪声伪标签。实验表明,VIOLA 在低资源环境中优于各种基线,实现了低成本的鲁棒适应。
MARS: Unleashing the Power of Speculative Decoding via Margin-Aware Verification
Authors: Jingwei Song, Xinyu Wang, Hanbin Wang, Xiaoxuan Lei, Bill Shi, Shixin Han, Eric Yang, Xiao-Wen Chang, Lynn Ai
First: 2026-01-21T22:03:06+00:00 · Latest: 2026-01-21T22:03:06+00:00
Comments: 11 pages, 5 figures
Abstract
Speculative Decoding (SD) accelerates autoregressive large language model (LLM) inference by decoupling generation and verification. While recent methods improve draft quality by tightly coupling the drafter with the target model, the verification mechanism itself remains largely unchanged, relying on strict token-level rejection sampling. In practice, modern LLMs frequently operate in low-margin regimes where the target model exhibits weak preference among top candidates. In such cases, rejecting plausible runner-up tokens yields negligible information gain while incurring substantial rollback cost, leading to a fundamental inefficiency in verification. We propose Margin-Aware Speculative Verification, a training-free and domain-agnostic verification strategy that adapts to the target model's local decisiveness. Our method conditions verification on decision stability measured directly from the target logits and relaxes rejection only when strict verification provides minimal benefit. Importantly, the approach modifies only the verification rule and is fully compatible with existing target-coupled speculative decoding frameworks. Extensive experiments across model scales ranging from 8B to 235B demonstrate that our method delivers consistent and significant inference speedups over state-of-the-art baselines while preserving generation quality across diverse benchmarks.
中文标题/摘要
标题:MARS:通过 Margin-Aware 验证释放推测解码的潜力
推测解码(SD)通过解耦生成和验证来加速自回归大型语言模型(LLM)的推理。虽然最近的方法通过紧密耦合草稿生成者和目标模型来提高草稿质量,但验证机制本身变化不大,仍然依赖于严格的令牌级拒绝采样。实际上,现代LLM经常在低边际区域运行,目标模型对顶级候选者之间表现出较弱的偏好。在这种情况下,拒绝可能的亚军令牌几乎不会获得信息增益,但却会带来显著的回滚成本,导致验证中的根本性低效。我们提出了Margin-Aware推测验证,这是一种无需训练且领域通用的验证策略,能够适应目标模型的局部决断性。该方法根据直接从目标对数中测量的决策稳定性进行验证,并仅在严格的验证提供最小益处时才放松拒绝。重要的是,该方法仅修改验证规则,并与现有的目标耦合推测解码框架完全兼容。在从8B到235B的模型规模上进行的广泛实验表明,我们的方法在保持生成质量的同时,相对于最先进的基线方法提供了持续且显著的推理加速。
Summary / 总结
The paper introduces Margin-Aware Speculative Verification (MARS) to improve the efficiency of Speculative Decoding (SD) in autoregressive large language model inference. MARS adapts to the target model's local decisiveness by conditioning verification on decision stability measured from the target logits, thus reducing unnecessary token-level rejection sampling. Experiments show that MARS provides consistent and significant inference speedups over existing methods while maintaining generation quality across various model scales.
论文提出了Margin-Aware Speculative Verification (MARS) 方法,这是一种适应目标模型局部决断性的验证策略。通过直接从目标模型的logits中测量决策稳定性来条件化验证,并仅在严格验证提供最小益处时才放松拒绝。这种方法无需训练且适用于多种领域,在不同模型规模下都能实现一致且显著的推理加速,同时保持生成质量。
DevPrompt: Deviation-Based Prompt Learning for One-Normal ShotImage Anomaly Detection
Authors: Morteza Poudineh, Marc Lalonde
First: 2026-01-21T20:35:51+00:00 · Latest: 2026-01-21T20:35:51+00:00
Comments: 8 pages
Abstract
Few-normal shot anomaly detection (FNSAD) aims to detect abnormal regions in images using only a few normal training samples, making the task highly challenging due to limited supervision and the diversity of potential defects. Recent approaches leverage vision-language models such as CLIP with prompt-based learning to align image and text features. However, existing methods often exhibit weak discriminability between normal and abnormal prompts and lack principled scoring mechanisms for patch-level anomalies. We propose a deviation-guided prompt learning framework that integrates the semantic power of vision-language models with the statistical reliability of deviation-based scoring. Specifically, we replace fixed prompt prefixes with learnable context vectors shared across normal and abnormal prompts, while anomaly-specific suffix tokens enable class-aware alignment. To enhance separability, we introduce a deviation loss with Top-K Multiple Instance Learning (MIL), modeling patch-level features as Gaussian deviations from the normal distribution. This allows the network to assign higher anomaly scores to patches with statistically significant deviations, improving localization and interpretability. Experiments on the MVTecAD and VISA benchmarks demonstrate superior pixel-level detection performance compared to PromptAD and other baselines. Ablation studies further validate the effectiveness of learnable prompts, deviation-based scoring, and the Top-K MIL strategy.
中文标题/摘要
标题:DevPrompt:基于偏差的提示学习在少量正常样本图像异常检测中的应用
少量正常样本异常检测(FNSAD)旨在仅使用少量正常训练样本检测图像中的异常区域,由于监督有限且潜在缺陷多样,任务极具挑战性。最近的方法利用如CLIP等视觉语言模型结合提示学习来对齐图像和文本特征。然而,现有方法在正常和异常提示之间的区分能力较弱,并缺乏针对块级异常的原理性评分机制。我们提出了一种基于偏差的提示学习框架,将视觉语言模型的语义能力与基于偏差的评分的统计可靠性相结合。具体而言,我们用可学习的上下文向量替换固定提示前缀,这些向量在正常和异常提示之间共享,而特定于异常的后缀标记使类感知对齐成为可能。为了增强可分性,我们引入了一种基于Top-K多实例学习(MIL)的偏差损失,将块级特征建模为与正常分布的高斯偏差。这使网络能够将更高的异常评分分配给统计上显著偏差的块,从而提高定位和可解释性。在MVTecAD和VISA基准上的实验表明,与PromptAD和其他基线相比,像素级检测性能更优。消融研究进一步验证了可学习提示、基于偏差的评分和Top-K MIL策略的有效性。
Summary / 总结
DevPrompt is a deviation-guided prompt learning framework for few-normal shot image anomaly detection. It uses learnable context vectors and anomaly-specific suffix tokens to enhance the discriminability between normal and abnormal prompts. A deviation loss with Top-K Multiple Instance Learning is introduced to model patch-level features as Gaussian deviations from the normal distribution, improving anomaly score assignment and localization. Experiments show superior pixel-level detection performance compared to PromptAD and other baselines.
DevPrompt 是一种用于少量正常样本图像异常检测的偏差引导提示学习框架。它使用可学习的上下文向量和异常特定的后缀标记来增强正常和异常提示之间的可区分性。引入了基于偏差的 Top-K 多实例学习损失来将补丁级特征建模为从正常分布的高斯偏差,从而提高异常评分分配和定位。实验表明,与 PromptAD 及其他基线相比,其在像素级检测性能上表现出优越性。
CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation
Authors: Pablo Messina, Andrés Villa, Juan León Alcázar, Karen Sánchez, Carlos Hinojosa, Denis Parra, Álvaro Soto, Bernard Ghanem
Venue: CVPR 2026
First: 2026-01-21T19:19:41+00:00 · Latest: 2026-01-21T19:19:41+00:00
Comments: 31 pages, 7 figures, submitted to CVPR 2026 (under review)
Abstract
Medical vision-language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present CURE, an error-aware curriculum learning framework that improves grounding and report quality without any additional data. CURE fine-tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy-grounded report generation using public datasets. The method dynamically adjusts sampling based on model performance, emphasizing harder samples to improve spatial and textual alignment. CURE improves grounding accuracy by +0.37 IoU, boosts report quality by +0.188 CXRFEScore, and reduces hallucinations by 18.6%. CURE is a data-efficient framework that enhances both grounding accuracy and report reliability. Code is available at https://github.com/PabloMessina/CURE and model weights at https://huggingface.co/pamessina/medgemma-4b-it-cure
中文标题/摘要
标题:CURE:基于课程指导的多任务训练以生成可靠的解剖学导向报告
医学视觉-语言模型可以自动化生成放射学报告,但难以实现准确的视觉定位和事实一致性。现有模型经常将文本发现与视觉证据对齐不当,导致不可靠或弱定位的预测。我们提出了CURE,一种错误感知的课程学习框架,可以在不使用额外数据的情况下提高定位和报告质量。CURE在公共数据集上对多模态指令模型进行微调,用于短语定位、定位报告生成和解剖学定位报告生成。该方法根据模型性能动态调整采样,强调更难的样本以提高空间和文本对齐。CURE将定位准确度提高了0.37 IoU,提升了报告质量0.188 CXRFEScore,并减少了18.6%的幻觉。CURE是一种数据高效的框架,能够同时提高定位准确度和报告可靠性。代码可在https://github.com/PabloMessina/CURE 获取,模型权重可在https://huggingface.co/pamessina/medgemma-4b-it-cure 获取
Summary / 总结
CURE is an error-aware curriculum learning framework that enhances the grounding accuracy and report quality of medical vision-language models without additional data. It fine-tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy-grounded report generation. CURE dynamically adjusts sampling based on model performance, focusing on harder samples to improve spatial and textual alignment. The results show an improvement of +0.37 IoU in grounding accuracy, +0.188 in CXRFEScore for report quality, and a 18.6% reduction in hallucinations. CURE is a data-efficient framework that improves both grounding accuracy and report reliability.
CURE 是一种基于课程的学习多任务训练框架,通过提高视觉定位和事实一致性来增强解剖导向的放射学报告生成的准确性和可靠性。它在短语定位、定位报告生成和解剖导向报告生成上微调多模态模型,并根据模型性能动态调整采样。关键发现包括定位准确性的 0.37 IoU 提升、报告质量的 CXRFEScore 提高 0.188 以及幻觉减少 18.6%。CURE 是一种数据高效的框架,能够在不使用额外数据的情况下同时提高定位准确性和报告可靠性。
Towards Understanding Best Practices for Quantization of Vision-Language Models
Authors: Gautom Das, Vincent La, Ethan Lau, Abhinav Shrivastava, Matthew Gwilliam
First: 2026-01-21T18:59:51+00:00 · Latest: 2026-01-21T18:59:51+00:00
Comments: 15 pages, 12 figures, 1 table
Abstract
Large language models (LLMs) deliver impressive results for a variety of tasks, but state-of-the-art systems require fast GPUs with large amounts of memory. To reduce both the memory and latency of these systems, practitioners quantize their learned parameters, typically at half precision. A growing body of research focuses on preserving the model performance with more aggressive bit widths, and some work has been done to apply these strategies to other models, like vision transformers. In our study we investigate how a variety of quantization methods, including state-of-the-art GPTQ and AWQ, can be applied effectively to multimodal pipelines comprised of vision models, language models, and their connectors. We address how performance on captioning, retrieval, and question answering can be affected by bit width, quantization method, and which portion of the pipeline the quantization is used for. Results reveal that ViT and LLM exhibit comparable importance in model performance, despite significant differences in parameter size, and that lower-bit quantization of the LLM achieves high accuracy at reduced bits per weight (bpw). These findings provide practical insights for efficient deployment of MLLMs and highlight the value of exploration for understanding component sensitivities in multimodal models. Our code is available at https://github.com/gautomdas/mmq.
中文标题/摘要
标题:理解视觉-语言模型量化最佳实践
大型语言模型(LLMs)在各种任务中表现出色,但最先进的系统需要快速的GPU和大量的内存。为了减少这些系统的内存和延迟,实践者通常会将它们的学习参数量化为半精度。越来越多的研究集中在使用更激进的位宽来保持模型性能,并且已经有一些工作将这些策略应用于其他模型,如视觉变换器。在我们的研究中,我们探讨了如何有效地将包括最先进的GPTQ和AWQ在内的各种量化方法应用于由视觉模型、语言模型及其连接器组成的多模态管道。我们研究了位宽、量化方法以及量化在管道中的使用位置如何影响字幕生成、检索和问答的性能。结果表明,尽管参数规模存在显著差异,ViT和LLM在模型性能中具有相当的重要性,并且LLM的低位量化可以在减少每个权重位数(bpw)的情况下实现高精度。这些发现为高效部署多模态大语言模型提供了实用见解,并突显了探索多模态模型组件敏感性的价值。我们的代码可在https://github.com/gautomdas/mmq/获取。
Summary / 总结
This study investigates the application of various quantization methods, including GPTQ and AWQ, to multimodal pipelines involving vision models, language models, and their connectors. The research aims to understand how different bit widths and quantization techniques impact performance in tasks such as captioning, retrieval, and question answering. Key findings show that both vision transformers (ViT) and large language models (LLMs) are crucial for model performance, and that LLMs can achieve high accuracy with lower-bit quantization, which reduces memory usage and latency. These insights offer practical guidance for deploying efficient multimodal models.
研究探讨了GPTQ和AWQ等不同量化方法在包含视觉和语言模型的多模态管道中的应用。研究旨在了解不同量化技术和位宽如何影响诸如图像字幕、检索和问答等任务的性能。关键发现表明,视觉变压器(ViT)和语言模型(LLM)对于模型性能都至关重要,且LLM即使在较低位宽量化的情况下也能保持高精度。这些发现对于多模态大型语言模型(MLLM)的高效部署具有重要意义。
Iterative Refinement Improves Compositional Image Generation
Authors: Shantanu Jaiswal, Mihir Prabhudesai, Nikash Bhardwaj, Zheyang Qin, Amir Zadeh, Chuan Li, Katerina Fragkiadaki, Deepak Pathak
First: 2026-01-21T18:59:40+00:00 · Latest: 2026-01-21T18:59:40+00:00
Comments: Project webpage: https://iterative-img-gen.github.io/
Abstract
Text-to-image (T2I) models have achieved remarkable progress, yet they continue to struggle with complex prompts that require simultaneously handling multiple objects, relations, and attributes. Existing inference-time strategies, such as parallel sampling with verifiers or simply increasing denoising steps, can improve prompt alignment but remain inadequate for richly compositional settings where many constraints must be satisfied. Inspired by the success of chain-of-thought reasoning in large language models, we propose an iterative test-time strategy in which a T2I model progressively refines its generations across multiple steps, guided by feedback from a vision-language model as the critic in the loop. Our approach is simple, requires no external tools or priors, and can be flexibly applied to a wide range of image generators and vision-language models. Empirically, we demonstrate consistent gains on image generation across benchmarks: a 16.9% improvement in all-correct rate on ConceptMix (k=7), a 13.8% improvement on T2I-CompBench (3D-Spatial category) and a 12.5% improvement on Visual Jenga scene decomposition compared to compute-matched parallel sampling. Beyond quantitative gains, iterative refinement produces more faithful generations by decomposing complex prompts into sequential corrections, with human evaluators preferring our method 58.7% of the time over 41.3% for the parallel baseline. Together, these findings highlight iterative self-correction as a broadly applicable principle for compositional image generation. Results and visualizations are available at https://iterative-img-gen.github.io/
中文标题/摘要
标题:迭代优化提升组合图像生成
文本到图像(T2I)模型取得了显著进展,但仍难以处理需要同时处理多个对象、关系和属性的复杂提示。现有的推理时策略,如并行采样带验证器或简单增加去噪步骤,可以改善提示对齐,但在许多约束必须满足的丰富组合场景中仍然不足。受大型语言模型中链式思考推理成功的启发,我们提出了一种迭代测试时策略,在该策略中,T2I模型在多个步骤中逐步细化其生成,由循环中的视觉语言模型作为批评者提供反馈。我们的方法简单,不需要外部工具或先验知识,并且可以灵活应用于各种图像生成器和视觉语言模型。实验证明,我们的方法在基准测试中的一致改进:在ConceptMix(k=7)上提高了16.9%的全正确率,在T2I-CompBench(3D-空间类别)上提高了13.8%,在视觉积木场景分解上提高了12.5%,与计算匹配的并行采样相比。除了定量改进,迭代优化生成更忠实的图像,通过将复杂提示分解为顺序修正,人类评估者中有58.7%的人更偏好我们的方法,而并行基线为41.3%。这些发现共同强调了迭代自我修正作为组合图像生成广泛适用原则的重要性。结果和可视化可在https://iterative-img-gen.github.io/获取
Summary / 总结
The paper proposes an iterative refinement strategy for text-to-image generation to handle complex prompts. By iteratively refining generations and using a vision-language model as a critic, the method achieves consistent improvements across benchmarks, with a 16.9% increase in the all-correct rate on ConceptMix (k=7), and higher fidelity generations preferred by human evaluators. This approach is simple, flexible, and does not require external tools or priors.
本文提出了一种迭代细化策略,以解决从文本提示生成复杂图像的挑战。该方法涉及文本到图像模型在多个步骤中逐步改进其输出,并由视觉语言模型提供反馈。实验结果显示,在多个基准测试中的一致改进,包括在ConceptMix(k=7)上提高了16.9%的全部正确率,在T2I-CompBench(3D-Spatial类别)上提高了13.8%,在Visual Jenga场景分解上提高了12.5%。人类评估者在58.7%的情况下也更偏好迭代方法,而非并行基线。该方法简单、灵活,不需要外部工具或先验知识,适用于各种图像生成器和视觉语言模型。
Improving MoE Compute Efficiency by Composing Weight and Data Sparsity
Authors: Maciej Kilian, Oleg Mkrtchyan, Luke Zettlemoyer, Akshat Shrivastava, Armen Aghajanyan
First: 2026-01-21T18:53:58+00:00 · Latest: 2026-01-21T18:53:58+00:00
Abstract
Mixture-of-Experts layers achieve compute efficiency through weight sparsity: each token activates only a subset of experts. Data sparsity, where each expert processes only a subset of tokens, offers a complementary axis. Expert-choice routing implements data sparsity directly but violates causality in autoregressive models, creating train-inference mismatch. We recover data sparsity within causal token-choice MoE by leveraging zero-compute (null) experts within the routing pool. When a token routes to null experts, those slots consume no compute. The standard load balancing objective trains the model to uniformly use all experts (real and null) therefore creating data sparsity in expectation without the causality violations. We evaluate on vision-language model training, where data heterogeneity is pronounced: vision encoders produce many low-information tokens while text tokens are denser. At matched expected FLOPs, composing weight and data sparsity yields a more compute-efficient frontier than weight sparsity alone, with gains in training loss and downstream performance. The model learns implicit modality-aware allocation, routing vision tokens to null experts more aggressively than text, without explicit modality routing.
中文标题/摘要
标题:通过组合权重和数据稀疏性提高专家混合层的计算效率
专家混合层通过权重稀疏性实现计算效率:每个标记仅激活专家子集。数据稀疏性,其中每个专家仅处理标记子集,提供了互补的维度。专家选择路由直接实现数据稀疏性,但在自回归模型中违反了因果性,导致训练与推理不匹配。我们通过在路由池中利用零计算(空)专家来在因果标记选择的专家混合层中恢复数据稀疏性。当标记路由到空专家时,这些槽位不消耗计算资源。标准负载均衡目标训练模型均匀使用所有专家(真实和空的),因此在期望中创建数据稀疏性,而不违反因果性。我们在视觉-语言模型训练中进行评估,其中数据异质性明显:视觉编码器产生许多低信息量标记,而文本标记更密集。在匹配预期FLOPs的情况下,组合权重和数据稀疏性比单独使用权重稀疏性提供了更高效的计算边界,训练损失和下游性能都有所提升。模型学习隐式的模态感知分配,更积极地将视觉标记路由到空专家,而无需显式的模态路由。
Summary / 总结
The research aims to enhance the computational efficiency of Mixture-of-Experts (MoE) layers by combining weight sparsity and data sparsity. The method involves using zero-compute (null) experts to implement data sparsity within causal token-choice MoE, avoiding causality violations present in direct expert-choice routing. The evaluation on vision-language model training shows that combining weight and data sparsity leads to a more compute-efficient model compared to weight sparsity alone, with improvements in training loss and downstream performance. The model implicitly allocates vision tokens to null experts more frequently than text tokens without explicit modality routing.
研究旨在通过结合权重稀疏性和数据稀疏性来提高Mixture-of-Experts(MoE)层的计算效率。方法是使用零计算(空闲)专家在因果令牌选择的MoE中实现数据稀疏性,避免了直接专家选择路由中存在的因果性问题。在视觉-语言模型训练上的评估表明,结合权重稀疏性和数据稀疏性比单独使用权重稀疏性更高效,能够提高训练损失和下游性能。模型隐式地将视觉令牌分配给空闲专家的频率高于文本令牌,而无需显式的模态路由。
PROGRESSLM: Towards Progress Reasoning in Vision-Language Models
Authors: Jianshu Zhang, Chengxuan Qian, Haosen Sun, Haoran Lu, Dingcheng Wang, Letian Xue, Han Liu
First: 2026-01-21T17:56:59+00:00 · Latest: 2026-01-21T17:56:59+00:00
Comments: Website: https://progresslm.github.io/ProgressLM/
Abstract
Estimating task progress requires reasoning over long-horizon dynamics rather than recognizing static visual content. While modern Vision-Language Models (VLMs) excel at describing what is visible, it remains unclear whether they can infer how far a task has progressed from partial observations. To this end, we introduce Progress-Bench, a benchmark for systematically evaluating progress reasoning in VLMs. Beyond benchmarking, we further explore a human-inspired two-stage progress reasoning paradigm through both training-free prompting and training-based approach based on curated dataset ProgressLM-45K. Experiments on 14 VLMs show that most models are not yet ready for task progress estimation, exhibiting sensitivity to demonstration modality and viewpoint changes, as well as poor handling of unanswerable cases. While training-free prompting that enforces structured progress reasoning yields limited and model-dependent gains, the training-based ProgressLM-3B achieves consistent improvements even at a small model scale, despite being trained on a task set fully disjoint from the evaluation tasks. Further analyses reveal characteristic error patterns and clarify when and why progress reasoning succeeds or fails.
中文标题/摘要
标题:PROGRESSLM:迈向视觉语言模型中的进度推理
估计任务进度需要推理长时动态,而不仅仅是识别静态视觉内容。尽管现代视觉语言模型(VLMs)在描述可见内容方面表现出色,但尚不清楚它们是否能够从部分观察中推断出任务的进展情况。为此,我们引入了Progress-Bench,用于系统评估VLMs中的进度推理。除了基准测试外,我们还通过无训练提示和基于精心构建的数据集ProgressLM-45K的训练方法,进一步探索了受人类启发的两阶段进度推理范式。在14个VLMs上的实验表明,大多数模型尚未准备好进行任务进度估计,表现出对演示模态和视角变化的敏感性,以及对无法回答的情况处理不佳。虽然无训练提示强制结构化的进度推理仅能带来有限且模型依赖的收益,但基于训练的ProgressLM-3B即使在小型模型规模下也能实现一致的改进,尽管其训练任务集与评估任务完全不重叠。进一步的分析揭示了特征错误模式,并阐明了进度推理何时以及为何成功或失败。
Summary / 总结
The research aims to evaluate the ability of Vision-Language Models (VLMs) to estimate task progress, which involves reasoning over long-term dynamics rather than recognizing static visual content. To this end, the authors introduce Progress-Bench, a benchmark for systematically evaluating progress reasoning in VLMs. They explore a human-inspired two-stage progress reasoning paradigm through both training-free prompting and a training-based approach using the curated dataset ProgressLM-45K. Experiments on 14 VLMs reveal that most models struggle with task progress estimation, showing sensitivity to changes in demonstration modality and viewpoint, and poor handling of unanswerable cases. While training-free prompting provides limited gains, the training-based ProgressLM-3B model achieves consistent improvements even at a small model scale.
研究旨在评估和提升视觉语言模型(Vision-Language Models, VLMs)在任务进度推理方面的能力,这需要理解长期动态而非仅仅识别静态视觉内容。为此,作者引入了Progress-Bench,用于系统性地评估VLMs的进度推理能力。他们还通过训练-free提示和基于ProgressLM-45K数据集的训练方法探索了启发式两阶段进度推理框架。实验结果显示,大多数模型在任务进度估计方面存在困难,表现出对演示模态和视角变化的敏感性,以及处理无法回答情况的困难。然而,基于训练的ProgressLM-3B模型即使在小规模下也显示出了持续改进,尽管其训练任务集与评估任务集完全不重合。
CRAFT: Continuous Reasoning and Agentic Feedback Tuning for Multimodal Text-to-Image Generation
Authors: V. Kovalev, A. Kuvshinov, A. Buzovkin, D. Pokidov, D. Timonin
First: 2025-12-23T13:44:41+00:00 · Latest: 2026-01-21T16:42:28+00:00
Comments: 37 pages, 42 figures
Abstract
Recent work has shown that inference-time reasoning and reflection can improve text-to-image generation without retraining. However, existing approaches often rely on implicit, holistic critiques or unconstrained prompt rewrites, making their behavior difficult to interpret, control, or stop reliably. In contrast, large language models have benefited from explicit, structured forms of **thinking** based on verification, targeted correction, and early stopping. We introduce CRAFT (Continuous Reasoning and Agentic Feedback Tuning), a training-free and model-agnostic framework for multimodal image generation. CRAFT transforms a user prompt into a set of explicit, dependency-structured visual constraints, verifies generated images using a vision-language model, and performs targeted prompt updates only when specific constraints are violated. This iterative process includes an explicit stopping criterion, resulting in an interpretable and controllable inference-time refinement loop. Across multiple model families and challenging benchmarks, CRAFT consistently improves compositional accuracy, text rendering, and preference-based evaluations, with particularly strong gains for lightweight generators. Importantly, these improvements incur only a negligible inference-time overhead, allowing smaller or cheaper models to approach the quality of substantially more expensive systems. Our results suggest that explicitly structured, constraint-driven inference-time reasoning is a key ingredient for improving the reliability of multimodal generative models.
中文标题/摘要
标题:CRAFT:连续推理和自主反馈调优的多模态文本到图像生成
近期研究表明,在不重新训练的情况下,推理时间和反思可以提高文本到图像生成的效果。然而,现有方法往往依赖于隐式的、整体的批评或不受限制的提示重写,这使得它们的行为难以解释、控制或可靠地停止。相比之下,大型语言模型得益于基于验证、目标修正和早期停止的明确、结构化的**思考**形式。我们提出了CRAFT(连续推理和自主反馈调优),这是一种无需训练且模型无关的多模态图像生成框架。CRAFT 将用户提示转换为一组明确的、依赖结构化的视觉约束,使用视觉语言模型验证生成的图像,并仅在特定约束被违反时进行有针对性的提示更新。这个迭代过程包括一个明确的停止标准,从而形成一个可解释且可控的推理时精炼循环。在多个模型家族和具有挑战性的基准测试中,CRAFT 一致地提高了组合准确性、文本呈现和基于偏好的评估,特别是在轻量级生成器方面取得了显著的改进。重要的是,这些改进仅带来了微不足道的推理时开销,使得较小或更便宜的模型能够接近更昂贵系统的质量。我们的结果表明,明确结构化的、基于约束的推理是提高多模态生成模型可靠性的关键成分。
Summary / 总结
CRAFT is a training-free and model-agnostic framework for multimodal text-to-image generation. It transforms user prompts into explicit visual constraints, verifies generated images using a vision-language model, and updates prompts only when constraints are violated. This iterative process, with an explicit stopping criterion, leads to improved compositional accuracy, text rendering, and preference-based evaluations, especially for lightweight generators, with minimal inference-time overhead.
CRAFT 是一个无需训练且模型无关的框架,将用户提示转化为显式的视觉约束,使用视觉-语言模型验证生成的图像,并仅在约束被违反时更新提示。这一迭代过程包含明确的停止标准,从而提高了组合准确性、文本渲染和基于偏好的评估,尤其是对于轻量级生成器,同时几乎不增加推理时间开销。
Training-Free and Interpretable Hateful Video Detection via Multi-stage Adversarial Reasoning
Authors: Shuonan Yang, Yuchen Zhang, Zeyu Fu
Venue: ICASSP 2026
First: 2026-01-21T15:52:26+00:00 · Latest: 2026-01-21T15:52:26+00:00
Comments: Accepted at ICASSP 2026. \c{opyright} 2026 IEEE. This is the author accepted manuscript. The final published version will be available via IEEE Xplore
Abstract
Hateful videos pose serious risks by amplifying discrimination, inciting violence, and undermining online safety. Existing training-based hateful video detection methods are constrained by limited training data and lack of interpretability, while directly prompting large vision-language models often struggle to deliver reliable hate detection. To address these challenges, this paper introduces MARS, a training-free Multi-stage Adversarial ReaSoning framework that enables reliable and interpretable hateful content detection. MARS begins with the objective description of video content, establishing a neutral foundation for subsequent analysis. Building on this, it develops evidence-based reasoning that supports potential hateful interpretations, while in parallel incorporating counter-evidence reasoning to capture plausible non-hateful perspectives. Finally, these perspectives are synthesized into a conclusive and explainable decision. Extensive evaluation on two real-world datasets shows that MARS achieves up to 10% improvement under certain backbones and settings compared to other training-free approaches and outperforms state-of-the-art training-based methods on one dataset. In addition, MARS produces human-understandable justifications, thereby supporting compliance oversight and enhancing the transparency of content moderation workflows. The code is available at https://github.com/Multimodal-Intelligence-Lab-MIL/MARS.
中文标题/摘要
标题:基于多阶段对抗推理的无训练可解释仇恨视频检测
仇恨视频通过放大歧视、煽动暴力和破坏在线安全等方式带来严重风险。现有的基于训练的仇恨视频检测方法受限于训练数据有限且缺乏可解释性,而直接对大型视觉-语言模型进行提示往往难以提供可靠的仇恨检测。为解决这些挑战,本文提出了一种无训练的多阶段对抗推理框架MARS,以实现可靠且可解释的仇恨内容检测。MARS从客观描述视频内容开始,建立后续分析的中立基础。在此基础上,它发展了基于证据的推理,支持潜在的仇恨解释,同时并行地纳入反证据推理以捕捉可能的非仇恨视角。最后,这些视角被综合成一个明确且可解释的决策。在两个真实世界数据集上的广泛评估表明,MARS在某些骨干网络和设置下比其他无训练方法提高了10%以上,并在另一个数据集上优于最先进的基于训练的方法。此外,MARS生成了人类可理解的解释,从而支持合规监督并增强内容审核流程的透明度。代码可在https://github.com/Multimodal-Intelligence-Lab-MIL/MARS/ 获取。
Summary / 总结
This paper addresses the challenges of detecting hateful videos by introducing MARS, a training-free Multi-stage Adversarial ReaSoning framework. MARS starts with objective video content description, then develops reasoning that supports potential hateful interpretations while incorporating counter-evidence to capture non-hateful perspectives. The framework synthesizes these perspectives into a conclusive and explainable decision. Experimental results show that MARS outperforms other training-free approaches and state-of-the-art training-based methods on certain datasets, and it provides human-understandable justifications for content moderation.
本文通过引入MARS,一种无训练的多阶段对抗推理框架,解决了检测仇恨视频的挑战。MARS从中立的视频内容分析开始,然后发展出基于证据的潜在仇恨解释和反证据推理以捕捉非仇恨观点,最终综合形成一个明确且可解释的决策。该方法在性能上超越了现有的无训练方法和最先进的基于训练的方法,最高可提高10%的准确率,并提供可理解的解释支持内容审核的合规性和透明度。在两个真实世界数据集上的评估证明了其有效性和可解释性。
Training-Free In-Context Forensic Chain for Image Manipulation Detection and Localization
Authors: Rui Chen, Bin Liu, Changtao Miao, Xinghao Wang, Yi Li, Tao Gong, Qi Chu, Nenghai Yu
First: 2025-10-11T08:42:31+00:00 · Latest: 2026-01-21T15:39:57+00:00
Comments: This version was uploaded in error and contains misleading information found in an early draft. The manuscript requires extensive and long-term revisions
Abstract
Advances in image tampering pose serious security threats, underscoring the need for effective image manipulation localization (IML). While supervised IML achieves strong performance, it depends on costly pixel-level annotations. Existing weakly supervised or training-free alternatives often underperform and lack interpretability. We propose the In-Context Forensic Chain (ICFC), a training-free framework that leverages multi-modal large language models (MLLMs) for interpretable IML tasks. ICFC integrates an objectified rule construction with adaptive filtering to build a reliable knowledge base and a multi-step progressive reasoning pipeline that mirrors expert forensic workflows from coarse proposals to fine-grained forensics results. This design enables systematic exploitation of MLLM reasoning for image-level classification, pixel-level localization, and text-level interpretability. Across multiple benchmarks, ICFC not only surpasses state-of-the-art training-free methods but also achieves competitive or superior performance compared to weakly and fully supervised approaches.
中文标题/摘要
标题:无需训练的上下文法医链用于图像篡改检测与定位
图像篡改技术的进步带来了严重的安全威胁,突显了有效图像篡改定位(IML)的必要性。虽然监督IML能够取得优异性能,但它依赖于昂贵的像素级注释。现有的弱监督或无需训练的替代方案往往表现不佳且缺乏可解释性。我们提出了一种无需训练的框架——上下文法医链(ICFC),该框架利用多模态大型语言模型(MLLMs)进行可解释的IML任务。ICFC 结合了对象化规则构建与自适应过滤,构建了一个可靠的知识库,并采用多步骤渐进推理管道,模拟专家法医工作流程,从粗略提案到精细的法医结果。此设计使MLLM推理在图像级分类、像素级定位和文本级可解释性方面的系统利用成为可能。在多个基准测试中,ICFC 不仅超越了最先进的无需训练方法,而且在弱监督和完全监督方法方面也取得了竞争性或更优的性能。
Summary / 总结
The paper addresses the challenge of image manipulation localization (IML) by proposing the In-Context Forensic Chain (ICFC), a training-free framework that uses multi-modal large language models to construct a knowledge base and a reasoning pipeline. ICFC integrates rule construction and adaptive filtering to achieve image-level classification, pixel-level localization, and text-level interpretability. The framework outperforms existing training-free methods and matches or exceeds the performance of weakly and fully supervised approaches on multiple benchmarks.
研究提出了一种名为In-Context Forensic Chain (ICFC)的训练-free框架,利用多模态大型语言模型构建知识库和推理管道。该方法结合规则构建和自适应过滤,实现图像级别分类、像素级别定位和文本级别可解释性。实验表明,ICFC 在多个基准测试中不仅超越了现有的训练-free 方法,而且在弱监督和全监督方法的性能上达到了竞争或优越的水平。
History
20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553