arXiv 论文速递

2026-01-24 03:37
Snapshot: 20260124_0337
GutenOCR: A Grounded Vision-Language Front-End for Documents
Authors: Hunter Heidenreich, Ben Elliott, Olivia Dinica, Yosheb Getachew
First: 2026-01-20T21:26:15+00:00 · Latest: 2026-01-22T18:58:24+00:00
Abstract
GutenOCR is a family of grounded OCR front-ends obtained by fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B. The resulting single-checkpoint vision-language models expose reading, detection, and grounding through a unified, prompt-based interface. Trained on business documents, scientific articles, and synthetic grounding data, the models support full-page and localized reading with line- and paragraph-level bounding boxes and conditional ``where is x?'' queries. We introduce a grounded OCR evaluation protocol and show that GutenOCR-7B more than doubles the composite grounded OCR score of its Qwen2.5-VL-7B backbone on 10.5K held-out business and scientific pages (0.40 to 0.82). On Fox and OmniDocBench v1.5, our approach substantially improves region- and line-level OCR as well as text-detection recall, but reveals trade-offs in page-level linearization, color-guided OCR, and formula-heavy layouts.
中文标题/摘要
标题:GutenOCR:一种基于文档的视觉-语言前端
GutenOCR 是通过微调 Qwen2.5-VL-3B 和 Qwen2.5-VL-7B 获得的一系列基于文档的 OCR 前端。生成的单模型视觉-语言模型通过统一的提示界面展示了阅读、检测和定位。该模型在商业文档、科学文章和合成定位数据上进行训练,支持全页和局部阅读,具有行级和段落级的边界框,并支持“x 在哪里?”的条件查询。我们引入了一种基于文档的 OCR 评估协议,并展示了 GutenOCR-7B 在 10.5K 保留的商业和科学页面上将 Qwen2.5-VL-7B 主干的综合基于文档的 OCR 分数提高了 1.05(从 0.40 到 0.82)。在 Fox 和 OmniDocBench v1.5 上,我们的方法显著提高了区域级和行级 OCR 以及文本检测召回率,但显示出页面级线性化、颜色引导 OCR 和公式密集布局方面的权衡。
Summary / 总结
GutenOCR is a family of vision-language models fine-tuned from Qwen2.5-VL-3B and Qwen2.5-VL-7B, which provide unified reading, detection, and grounding through a prompt-based interface. Trained on various documents, these models support full-page and localized reading with bounding boxes and conditional queries. The evaluation shows that GutenOCR-7B significantly improves the grounded OCR score compared to its backbone model, especially on business and scientific pages. However, there are trade-offs in page-level linearization and formula-heavy layouts.
GutenOCR 是从 Qwen2.5-VL-3B 和 Qwen2.5-VL-7B 微调而来的视觉-语言模型,通过统一的提示式接口提供阅读、检测和定位功能。该模型经过商业文档和科学文章的训练,在10,500个保留页面上将接地OCR得分翻了一番。它提高了区域和行级OCR以及文本检测召回率,但在页面级线性化、颜色引导OCR和公式密集布局方面存在一些权衡。
LLM-in-Sandbox Elicits General Agentic Intelligence
Authors: Daixuan Cheng, Shaohan Huang, Yuxian Gu, Huatong Song, Guoxin Chen, Li Dong, Wayne Xin Zhao, Ji-Rong Wen, Furu Wei
First: 2026-01-22T18:57:09+00:00 · Latest: 2026-01-22T18:57:09+00:00
Comments: Project Page: https://llm-in-sandbox.github.io
Abstract
We introduce LLM-in-Sandbox, enabling LLMs to explore within a code sandbox (i.e., a virtual computer), to elicit general intelligence in non-code domains. We first demonstrate that strong LLMs, without additional training, exhibit generalization capabilities to leverage the code sandbox for non-code tasks. For example, LLMs spontaneously access external resources to acquire new knowledge, leverage the file system to handle long contexts, and execute scripts to satisfy formatting requirements. We further show that these agentic capabilities can be enhanced through LLM-in-Sandbox Reinforcement Learning (LLM-in-Sandbox-RL), which uses only non-agentic data to train models for sandbox exploration. Experiments demonstrate that LLM-in-Sandbox, in both training-free and post-trained settings, achieves robust generalization spanning mathematics, physics, chemistry, biomedicine, long-context understanding, and instruction following. Finally, we analyze LLM-in-Sandbox's efficiency from computational and system perspectives, and open-source it as a Python package to facilitate real-world deployment.
中文标题/摘要
标题:LLM-in-Sandbox 激发通用代理智能
我们介绍了 LLM-in-Sandbox,使大语言模型能够在代码沙盒(即虚拟计算机)中探索,以激发非代码领域的通用智能。我们首先展示了强大的大语言模型在无需额外训练的情况下,能够利用代码沙盒来执行非代码任务的一般化能力。例如,大语言模型自发地访问外部资源以获取新知识,利用文件系统处理长文本,并执行脚本以满足格式要求。我们进一步表明,通过仅使用非代理数据训练用于沙盒探索的模型,LLM-in-Sandbox 强化学习(LLM-in-Sandbox-RL)可以增强这些代理能力。实验表明,无论是在无训练还是后训练设置下,LLM-in-Sandbox 都能够实现涵盖数学、物理、化学、生物医学、长文本理解以及指令遵循的稳健泛化。最后,我们从计算和系统角度分析了 LLM-in-Sandbox 的效率,并将其开源为 Python 包,以促进其实用部署。
Summary / 总结
The research introduces LLM-in-Sandbox, which allows large language models (LLMs) to explore a code sandbox to develop general intelligence in non-code domains. The study demonstrates that strong LLMs can generalize and use the sandbox for non-code tasks without additional training, such as accessing external resources and executing scripts. The method further enhances these capabilities through LLM-in-Sandbox Reinforcement Learning. Experiments show robust generalization across various fields including mathematics, physics, chemistry, biomedicine, and long-context understanding. The research also analyzes the efficiency of LLM-in-Sandbox from computational and system perspectives and opens it as a Python package for real-world deployment.
研究引入了LLM-in-Sandbox,使大型语言模型(LLMs)能够在代码沙箱中探索,以在非代码领域发展一般智能。研究展示了强大的LLMs能够泛化并在非代码任务中使用沙箱,例如访问外部资源、处理长文本和执行脚本。通过LLM-in-Sandbox强化学习进一步增强了沙箱探索能力。实验表明,LLM-in-Sandbox在数学、物理、化学、生物医学和指令遵循等多个领域实现了稳健的泛化。研究还从计算和系统角度分析了LLM-in-Sandbox的效率,并将其作为Python包开源,以促进实际部署。
Training-Free Geospatial Place Representation Learning from Large-Scale Point-of-Interest Graph Data
Authors: Mohammad Hashemi, Hossein Amiri, Andreas Zufle
First: 2025-06-25T15:10:31+00:00 · Latest: 2026-01-22T18:46:50+00:00
Abstract
Learning effective representations of urban environments requires capturing spatial structure beyond fixed administrative boundaries. Existing geospatial representation learning approaches typically aggregate Points of Interest(POI) into pre-defined administrative regions such as census units or ZIP code areas, assigning a single embedding to each region. However, POIs often form semantically meaningful groups that extend across, within, or beyond these boundaries, defining places that better reflect human activity and urban function. To address this limitation, we propose PlaceRep, a training-free geospatial representation learning method that constructs place-level representations by clustering spatially and semantically related POIs. PlaceRep summarizes large-scale POI graphs from U.S. Foursquare data to produce general-purpose urban region embeddings while automatically identifying places across multiple spatial scales. By eliminating model pre-training, PlaceRep provides a scalable and efficient solution for multi-granular geospatial analysis. Experiments using the tasks of population density estimation and housing price prediction as downstream tasks show that PlaceRep outperforms most state-of-the-art graph-based geospatial representation learning methods and achieves up to a 100x speedup in generating region-level representations on large-scale POI graphs. The implementation of PlaceRep is available at https://github.com/mohammadhashemii/PlaceRep.
中文标题/摘要
标题:无需训练的地理空间地点表示学习从大规模兴趣点图数据
学习有效的城市环境表示需要捕捉超越固定行政边界的空间结构。现有的地理空间表示学习方法通常将兴趣点(POI)聚合到预先定义的行政区域中,如普查单位或邮政编码区域,并为每个区域分配一个单一的嵌入。然而,POI往往形成具有语义意义的群体,跨越、位于或超出这些边界,定义了更好地反映人类活动和城市功能的地点。为了解决这一局限性,我们提出了一种无需训练的地理空间表示学习方法PlaceRep,该方法通过聚类空间上和语义上相关的POI来构建地点级表示。PlaceRep从美国Foursquare数据中总结大规模POI图,生成通用的城市区域嵌入,同时自动识别跨多个空间尺度的地点。通过消除模型预训练,PlaceRep提供了一种可扩展且高效的多粒度地理空间分析解决方案。使用人口密度估计和房价预测作为下游任务的实验表明,PlaceRep在大多数基于图的地理空间表示学习方法中表现更优,并在生成大规模POI图的区域级表示时实现了高达100倍的速度提升。PlaceRep的实现可在https://github.com/mohammadhashemii/PlaceRep获取。
Summary / 总结
The research aims to improve geospatial representation learning by capturing spatial structures beyond administrative boundaries. PlaceRep, a training-free method, clusters semantically related Points of Interest to create place-level representations, which are then used for tasks like population density estimation and housing price prediction. Experiments show PlaceRep outperforms existing methods and provides up to a 100x speedup in generating region-level representations.
研究旨在通过聚类语义相关的兴趣点(POI)来学习有效的地理空间表示,而不需要预训练。PlaceRep通过识别语义上有意义的POI组来构建地方级表示,更好地反映人类活动和城市功能。实验表明,PlaceRep在人口密度估计和房价预测等任务上优于现有方法,并且在生成大规模POI图的区域级表示时可提供高达100倍的速度提升。
Multimodal Climate Disinformation Detection: Integrating Vision-Language Models with External Knowledge Sources
Authors: Marzieh Adeli Shamsabad, Hamed Ghodrati
First: 2026-01-22T16:55:48+00:00 · Latest: 2026-01-22T16:55:48+00:00
Abstract
Climate disinformation has become a major challenge in today digital world, especially with the rise of misleading images and videos shared widely on social media. These false claims are often convincing and difficult to detect, which can delay actions on climate change. While vision-language models (VLMs) have been used to identify visual disinformation, they rely only on the knowledge available at the time of training. This limits their ability to reason about recent events or updates. The main goal of this paper is to overcome that limitation by combining VLMs with external knowledge. By retrieving up-to-date information such as reverse image results, online fact-checks, and trusted expert content, the system can better assess whether an image and its claim are accurate, misleading, false, or unverifiable. This approach improves the model ability to handle real-world climate disinformation and supports efforts to protect public understanding of science in a rapidly changing information landscape.
中文标题/摘要
标题:多模态气候 misinformation 检测:结合视觉-语言模型与外部知识源
气候 misinformation 已成为当今数字世界的主要挑战,尤其是在社交媒体上广泛传播误导性的图片和视频的情况下。这些虚假声明往往令人信服且难以识别,这可能会延迟应对气候变化的行动。虽然视觉-语言模型(VLMs)已被用于识别视觉 misinformation,但它们仅依赖于训练时可用的知识。这限制了它们对近期事件或更新进行推理的能力。本文的主要目标是通过结合 VLMs 与外部知识来克服这一限制。通过检索最新的信息,如逆向图像搜索结果、在线事实核查和可信专家内容,该系统可以更好地评估图片及其声明是否准确、误导、虚假或无法验证。这种方法提高了模型处理真实世界气候 misinformation 的能力,并支持在快速变化的信息环境中保护公众对科学的理解的努力。
Summary / 总结
The paper addresses the challenge of detecting climate disinformation by integrating vision-language models with external knowledge sources. It aims to enhance the models' ability to reason about recent events and updates, which traditional models trained on static knowledge cannot do. The system retrieves up-to-date information like reverse image results, online fact-checks, and expert content to assess the accuracy of images and their claims, thereby improving the detection of climate disinformation.
研究旨在通过将视觉语言模型与外部知识源结合,应对社交媒体上误导性的气候信息,特别是误导性的图片和视频。方法包括检索最新的信息,如逆向图像结果、在线事实核查和专家内容,以增强模型评估声明准确性的能力。关键发现表明,这种方法提高了模型处理现实世界气候误导信息的能力,支持了保护公众对气候科学理解的努力。
DTP: A Simple yet Effective Distracting Token Pruning Framework for Vision-Language Action Models
Authors: Chenyang Li, Jieyuan Liu, Bin Li, Bo Gao, Yilin Yuan, Yangfan He, Yuchen Li, Jingqun Tang
First: 2026-01-22T16:02:56+00:00 · Latest: 2026-01-22T16:02:56+00:00
Abstract
Vision-Language Action (VLA) models have shown remarkable progress in robotic manipulation by leveraging the powerful perception abilities of Vision-Language Models (VLMs) to understand environments and directly output actions. However, by default, VLA models may overly attend to image tokens in the task-irrelevant region, which we describe as 'distracting tokens'. This behavior can disturb the model from the generation of the desired action tokens in each step, affecting the success rate of tasks. In this paper, we introduce a simple yet effective plug-and-play Distracting Token Pruning (DTP) framework, which dynamically detects and prunes these distracting image tokens. By correcting the model's visual attention patterns, we aim to improve the task success rate, as well as exploring the performance upper boundaries of the model without altering its original architecture or adding additional inputs. Experiments on the SIMPLER Benchmark (Li et al., 2024) show that our method consistently achieving relative improvements in task success rates across different types of novel VLA models, demonstrating generalizability to transformer-based VLAs. Further analysis reveals a negative correlation between the task success rate and the amount of attentions in the task-irrelevant region for all models tested, highlighting a common phenomenon of VLA models that could guide future research. We also publish our code at: https://anonymous.4open.science/r/CBD3.
中文标题/摘要
标题:DTP:一种简单有效的视觉-语言-动作模型分散令牌剪枝框架
视觉-语言-动作(VLA)模型通过利用视觉-语言模型(VLM)的强大感知能力来理解环境并直接输出动作,已经在机器人操作方面取得了显著进展。然而,默认情况下,VLA模型可能会过度关注任务无关区域的图像令牌,我们将其称为“分散令牌”。这种行为会干扰模型在每一步生成所需动作令牌的能力,影响任务的成功率。在本文中,我们介绍了一种简单有效的即插即用分散令牌剪枝(DTP)框架,该框架能够动态检测并剪枝这些分散的图像令牌。通过纠正模型的视觉注意力模式,我们旨在提高任务成功率,并探索模型的性能上限,而不改变其原始架构或添加额外输入。在SIMPLER基准(Li等,2024)上的实验表明,我们的方法在不同类型的新型VLA模型中一致地提高了任务成功率,展示了其对基于变换器的VLA模型的通用性。进一步的分析揭示了所有测试模型的任务成功率与其任务无关区域注意力量之间存在负相关关系,突显了VLA模型中的一种常见现象,这可以指导未来的研究。我们还发布了我们的代码:https://anonymous.4open.science/r/CBD3.
DextER: Language-driven Dexterous Grasp Generation with Embodied Reasoning
Authors: Junha Lee, Eunha Park, Minsu Cho
First: 2026-01-22T15:23:35+00:00 · Latest: 2026-01-22T15:23:35+00:00
Abstract
Language-driven dexterous grasp generation requires the models to understand task semantics, 3D geometry, and complex hand-object interactions. While vision-language models have been applied to this problem, existing approaches directly map observations to grasp parameters without intermediate reasoning about physical interactions. We present DextER, Dexterous Grasp Generation with Embodied Reasoning, which introduces contact-based embodied reasoning for multi-finger manipulation. Our key insight is that predicting which hand links contact where on the object surface provides an embodiment-aware intermediate representation bridging task semantics with physical constraints. DextER autoregressively generates embodied contact tokens specifying which finger links contact where on the object surface, followed by grasp tokens encoding the hand configuration. On DexGYS, DextER achieves 67.14% success rate, outperforming state-of-the-art by 3.83%p with 96.4% improvement in intention alignment. We also demonstrate steerable generation through partial contact specification, providing fine-grained control over grasp synthesis.
中文标题/摘要
标题:DextER:基于语言的灵巧抓取生成与具身推理
基于语言的灵巧抓取生成要求模型理解任务语义、3D几何和复杂的手物交互。尽管视觉语言模型已被应用于此问题,但现有方法直接将观察结果映射为抓取参数,而没有关于物理交互的中间推理。我们提出了DextER,灵巧抓取生成与具身推理,引入了基于接触的具身推理进行多指操作。我们的关键见解是,预测哪只手在物体表面接触哪里提供了一种具身意识的中间表示,将任务语义与物理约束联系起来。DextER 自回归生成具身接触标记,指定哪只手指在物体表面接触哪里,随后生成抓取标记编码手的配置。在DexGYS上,DextER 达到了67.14%的成功率,比最先进的方法高出3.83%,意图对齐提高了96.4%。我们还展示了通过部分接触指定实现可引导的生成,提供了对抓取合成的精细控制。
Summary / 总结
DextER is designed to address the challenge of language-driven dexterous grasp generation by incorporating embodied reasoning. It predicts contact points between fingers and objects, bridging task semantics and physical constraints. On the DexGYS dataset, DextER achieves a 67.14% success rate, surpassing previous methods by 3.83% and improving intention alignment by 96.4%. Additionally, DextER supports steerable generation through partial contact specification, offering fine-grained control over grasp synthesis.
DextER旨在通过引入体态推理来解决语言驱动的灵巧抓取生成问题。它预测手指与物体之间的接触点,将任务语义与物理约束联系起来。在DexGYS数据集上,DextER的成功率为67.14%,超越了先前的方法3.83%,并且在意图对齐方面提高了96.4%。此外,DextER还支持通过部分接触指定进行可控生成,提供精细的抓取合成控制。
SimWorld-Robotics: Synthesizing Photorealistic and Dynamic Urban Environments for Multimodal Robot Navigation and Collaboration
Authors: Yan Zhuang, Jiawei Ren, Xiaokang Ye, Jianzhi Shen, Ruixuan Zhang, Tianai Yue, Muhammad Faayez, Xuhong He, Ziqiao Ma, Lianhui Qin, Zhiting Hu, Tianmin Shu
Venue: NeurIPS 2025
First: 2025-12-10T20:04:08+00:00 · Latest: 2026-01-22T14:26:01+00:00
Comments: Conference: NeurIPS 2025 (main)
Abstract
Recent advances in foundation models have shown promising results in developing generalist robotics that can perform diverse tasks in open-ended scenarios given multimodal inputs. However, current work has been mainly focused on indoor, household scenarios. In this work, we present SimWorld-Robotics~(SWR), a simulation platform for embodied AI in large-scale, photorealistic urban environments. Built on Unreal Engine 5, SWR procedurally generates unlimited photorealistic urban scenes populated with dynamic elements such as pedestrians and traffic systems, surpassing prior urban simulations in realism, complexity, and scalability. It also supports multi-robot control and communication. With these key features, we build two challenging robot benchmarks: (1) a multimodal instruction-following task, where a robot must follow vision-language navigation instructions to reach a destination in the presence of pedestrians and traffic; and (2) a multi-agent search task, where two robots must communicate to cooperatively locate and meet each other. Unlike existing benchmarks, these two new benchmarks comprehensively evaluate a wide range of critical robot capacities in realistic scenarios, including (1) multimodal instructions grounding, (2) 3D spatial reasoning in large environments, (3) safe, long-range navigation with people and traffic, (4) multi-robot collaboration, and (5) grounded communication. Our experimental results demonstrate that state-of-the-art models, including vision-language models (VLMs), struggle with our tasks, lacking robust perception, reasoning, and planning abilities necessary for urban environments.
中文标题/摘要
标题:SimWorld-Robotics: 合成逼真且动态的城市环境以供多模态机器人导航与协作
基础模型的最新进展表明,在给定多模态输入的情况下,通用机器人可以在开放场景中执行多种任务,显示出有希望的结果。然而,当前的工作主要集中在室内家庭场景。在本研究中,我们介绍了SimWorld-Robotics (SWR),一个用于大规模逼真城市环境的模拟平台,支持具身AI。SWR基于Unreal Engine 5构建,可以生成无限的逼真城市场景,包含动态元素如行人和交通系统,超越了先前的城市模拟在逼真度、复杂性和可扩展性方面的表现。它还支持多机器人控制和通信。凭借这些关键功能,我们构建了两个具有挑战性的机器人基准测试:(1)多模态指令跟随任务,机器人必须在行人和交通的环境中根据视觉-语言导航指令到达目的地;(2)多智能体搜索任务,两个机器人必须通过通信合作找到并会合。与现有基准不同,这两个新基准全面评估了机器人在现实场景中的多种关键能力,包括(1)多模态指令语义理解,(2)大型环境中的三维空间推理,(3)与行人和交通的安全、长距离导航,(4)多机器人协作,以及(5)基于环境的通信。我们的实验结果表明,最先进的模型,包括视觉-语言模型(VLMs),在我们的任务中表现不佳,缺乏在城市环境中所需的稳健感知、推理和规划能力。
Summary / 总结
The research aims to develop a simulation platform for embodied AI in photorealistic urban environments to test robots' capabilities in diverse tasks. The platform, SimWorld-Robotics (SWR), uses Unreal Engine 5 to generate dynamic urban scenes with pedestrians and traffic systems. Two benchmarks are introduced: a multimodal instruction-following task and a multi-agent search task. The results show that state-of-the-art models, including vision-language models, face challenges in tasks involving multimodal grounding, 3D spatial reasoning, safe navigation, and multi-robot collaboration due to insufficient perception and reasoning abilities in urban settings.
研究旨在开发一个用于多模态机器人导航和协作的仿真平台,模拟真实的都市环境。该平台SimWorld-Robotics (SWR) 生成包含行人和交通系统的动态都市场景,支持多机器人控制和通信。引入了两个基准测试:多模态指令跟随任务和多智能体搜索任务。实验结果表明,最先进的模型,包括视觉语言模型,在这些任务中表现不佳,因为它们在复杂都市环境中的感知、推理和规划能力有限。
A Multi-View Pipeline and Benchmark Dataset for 3D Hand Pose Estimation in Surgery
Authors: Valery Fischer, Alan Magdaleno, Anna-Katharina Calek, Nicola Cavalcanti, Nathan Hoffman, Christoph Germann, Joschua Wüthrich, Max Krähenmann, Mazda Farshad, Philipp Fürnstahl, Lilian Calvet
First: 2026-01-22T12:48:24+00:00 · Latest: 2026-01-22T12:48:24+00:00
Abstract
Purpose: Accurate 3D hand pose estimation supports surgical applications such as skill assessment, robot-assisted interventions, and geometry-aware workflow analysis. However, surgical environments pose severe challenges, including intense and localized lighting, frequent occlusions by instruments or staff, and uniform hand appearance due to gloves, combined with a scarcity of annotated datasets for reliable model training. Method: We propose a robust multi-view pipeline for 3D hand pose estimation in surgical contexts that requires no domain-specific fine-tuning and relies solely on off-the-shelf pretrained models. The pipeline integrates reliable person detection, whole-body pose estimation, and state-of-the-art 2D hand keypoint prediction on tracked hand crops, followed by a constrained 3D optimization. In addition, we introduce a novel surgical benchmark dataset comprising over 68,000 frames and 3,000 manually annotated 2D hand poses with triangulated 3D ground truth, recorded in a replica operating room under varying levels of scene complexity. Results: Quantitative experiments demonstrate that our method consistently outperforms baselines, achieving a 31% reduction in 2D mean joint error and a 76% reduction in 3D mean per-joint position error. Conclusion: Our work establishes a strong baseline for 3D hand pose estimation in surgery, providing both a training-free pipeline and a comprehensive annotated dataset to facilitate future research in surgical computer vision.
中文标题/摘要
标题:手术中3D手部姿态估计的多视图管道和基准数据集
目的:准确的3D手部姿态估计支持手术应用,如技能评估、机器人辅助干预和几何感知工作流程分析。然而,手术环境带来了严重挑战,包括强烈的局部照明、频繁的器械或人员遮挡、手套导致的手部均匀外观,以及可靠的模型训练所需的标注数据稀缺性。 方法:我们提出了一种鲁棒的多视图管道,用于手术环境下的3D手部姿态估计,该管道无需特定领域的微调,仅依赖于现成的预训练模型。该管道结合了可靠的人体检测、全身姿态估计和最先进的跟踪手部区域的2D手部关键点预测,随后进行约束3D优化。此外,我们引入了一个新的手术基准数据集,包含超过68,000帧和3,000个手动标注的2D手部姿态,具有三角化3D地面真值,记录在一个复现的手术室中,场景复杂度不同。 结果:定量实验表明,我们的方法在2D平均关节误差上比基线方法降低了31%,在3D平均每个关节位置误差上降低了76%。 结论:我们的工作为手术中的3D手部姿态估计建立了强大的基线,提供了无需训练的管道和全面标注的数据集,以促进未来手术计算机视觉的研究。
Summary / 总结
The study aims to improve 3D hand pose estimation in surgical settings, addressing challenges like lighting and occlusions. It proposes a multi-view pipeline using off-the-shelf models for person detection, whole-body pose estimation, and 2D hand keypoint prediction, followed by 3D optimization. The method achieves significant improvements, reducing 2D mean joint error by 31% and 3D mean per-joint position error by 76%. Additionally, a new benchmark dataset with over 68,000 frames and 3,000 annotated hand poses is introduced.
研究旨在改善手术场景中的3D手部姿态估计,这对于各种手术应用至关重要。提出的多视图管道使用现成的预训练模型进行人体检测、全身姿态估计和2D手部关键点预测,然后进行3D优化。该方法在包含超过68,000帧的新手术基准数据集上得到验证,并在2D和3D关节误差方面显著优于基线方法。
RadJEPA: Radiology Encoder for Chest X-Rays via Joint Embedding Predictive Architecture
Authors: Anas Anwarul Haq Khan, Mariam Husain, Kshitij Jadhav
First: 2026-01-22T12:11:53+00:00 · Latest: 2026-01-22T12:11:53+00:00
Abstract
Recent advances in medical vision language models guide the learning of visual representations; however, this form of supervision is constrained by the availability of paired image text data, raising the question of whether robust radiology encoders can be learned without relying on language supervision. In this work, we introduce RadJEPA, a self-supervised framework built on a Joint Embedding Predictive Architecture that learns without language supervision. Pre-trained solely on unlabeled chest X-ray images, the model learns to predict latent representations of masked image regions. This predictive objective differs fundamentally from both image text pre-training and DINO-style self-distillation: rather than aligning global representations across views or modalities, RadJEPA explicitly models latent-space prediction. We evaluate the learned encoder on disease classification, semantic segmentation, and report generation tasks. Across benchmarks, RadJEPA achieves performance exceeding state-of-the-art approaches, including Rad-DINO.
中文标题/摘要
标题:RadJEPA:通过联合嵌入预测架构的胸部X光影像编码器
近期医学视觉语言模型的进步指导了视觉表示的学习;然而,这种监督形式受限于成对的图像文本数据的可用性,引发了是否可以在不依赖语言监督的情况下学习稳健的放射学编码器的问题。在本文中,我们引入了RadJEPA,这是一种基于联合嵌入预测架构的自监督框架,该框架在不依赖语言监督的情况下进行学习。该模型仅在未标记的胸部X光图像上进行预训练,学习预测被遮盖的图像区域的潜在表示。这种预测目标与图像文本预训练和DINO风格的自我蒸馏完全不同:RadJEPA不是在视图或模态之间对齐全局表示,而是明确地建模潜在空间预测。我们在疾病分类、语义分割和报告生成任务上评估了所学习的编码器。在各个基准测试中,RadJEPA的性能超过了最先进的方法,包括Rad-DINO。
Summary / 总结
The research aims to develop a robust radiology encoder for chest X-rays without relying on language supervision. RadJEPA, a self-supervised framework, is introduced, which learns to predict latent representations of masked image regions. The model is pre-trained on unlabeled chest X-ray images and outperforms state-of-the-art approaches in disease classification, semantic segmentation, and report generation tasks.
研究旨在开发一种无需依赖图像-文本配对数据的胸部X光放射学编码器。引入了RadJEPA,这是一种自监督框架,能够学习预测被遮掩图像区域的潜在表示。该方法在疾病分类、语义分割和报告生成等任务上超过了现有方法如Rad-DINO,展示了其在无需语言监督的情况下学习的有效性。
Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning
Authors: Yifan Wang, Shiyu Li, Peiming Li, Xiaochen Yang, Yang Tang, Zheng Wei
First: 2026-01-21T08:09:25+00:00 · Latest: 2026-01-22T12:09:02+00:00
Abstract
Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. Recent works often focus exclusively on outcome alignment and lack supervision on the intermediate reasoning process. These deficiencies obscure the analyzability of the latent reasoning chain. To address these challenges, we introduce Render-of-Thought (RoT), the first framework to reify the reasoning chain by rendering textual steps into images, making the latent rationale explicit and traceable. Specifically, we leverage the vision encoders of existing Vision Language Models (VLMs) as semantic anchors to align the vision embeddings with the textual space. This design ensures plug-and-play implementation without incurring additional pre-training overhead. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that our method achieves 3-4x token compression and substantial inference acceleration compared to explicit CoT. Furthermore, it maintains competitive performance against other methods, validating the feasibility of this paradigm. Our code is available at https://github.com/TencentBAC/RoT
中文标题/摘要
标题:思维渲染:将文本推理链渲染为图像以实现视觉潜在推理
思维链(CoT)提示在解锁大型语言模型(LLMs)的推理能力方面取得了显著成功。尽管CoT提示增强了推理能力,但其冗长性带来了巨大的计算开销。近期工作往往专注于结果对齐,而缺乏对中间推理过程的监督。这些不足之处模糊了潜在推理链的可分析性。为解决这些挑战,我们引入了思维渲染(RoT),这是第一个通过将文本步骤渲染为图像来实现推理链实体化的框架,使潜在的推理理由变得明确和可追踪。具体而言,我们利用现有视觉语言模型(VLMs)的视觉编码器作为语义锚点,将视觉嵌入与文本空间对齐。此设计确保了即插即用的实现,无需额外的预训练开销。在数学和逻辑推理基准测试上的广泛实验表明,与显式CoT相比,我们的方法实现了3-4倍的令牌压缩和显著的推理加速。此外,与其他方法相比,它保持了竞争力,验证了此范式的可行性。我们的代码可在https://github.com/TencentBAC/RoT获取
Summary / 总结
The paper introduces Render-of-Thought (RoT), a framework that converts textual reasoning steps into images to make the latent reasoning process explicit and traceable. By leveraging vision encoders from existing Vision Language Models, RoT aligns visual embeddings with textual space, enabling plug-and-play implementation. Experiments show that RoT achieves 3-4x token compression and significant inference acceleration compared to explicit CoT, while maintaining competitive performance on mathematical and logical reasoning benchmarks.
研究旨在通过引入Render-of-Thought (RoT)框架解决Chain-of-Thought (CoT)提示中的计算开销和透明度不足问题,RoT将文本推理步骤渲染成图像。RoT利用现有Vision Language Models (VLMs)的视觉编码器将视觉和文本空间对齐,实现显式的可追踪推理。实验表明,RoT实现了3-4倍的令牌压缩和显著的推理加速,同时在数学和逻辑推理基准测试中保持了竞争力。
VLM-CAD: VLM-Optimized Collaborative Agent Design Workflow for Analog Circuit Sizing
Authors: Guanyuan Pan, Shuai Wang, Yugui Lin, Tiansheng Zhou, Pietro Liò, Yaqi Wang, Zhenxin Zhao
First: 2026-01-12T08:37:32+00:00 · Latest: 2026-01-22T11:46:08+00:00
Comments: 9 pages, 4 figures, submitted to the 10th International Conference on Control, Automation and Diagnosis (ICCAD'26)
Abstract
Analog mixed-signal circuit sizing involves complex trade-offs within high-dimensional design spaces. Existing automatic analog circuit sizing approaches rely solely on netlists, ignoring the circuit schematic, which hinders the cognitive link between the schematic and its performance. Furthermore, the black-box nature of machine learning methods and hallucination risks in large language models fail to provide the necessary ground-truth explainability required for industrial sign-off. To address these challenges, we propose a Vision Language Model-optimized collaborative agent design workflow (VLM-CAD), which analyzes circuits, optimizes DC operating points, performs inference-based sizing, and executes external sizing optimization. We integrate Image2Net to annotate circuit schematics and generate a structured JSON description for precise interpretation by Vision Language Models. Furthermore, we propose an Explainable Trust Region Bayesian Optimization method (ExTuRBO) that employs collaborative warm-start from agent-generated seeds and offers dual-granularity sensitivity analysis for external sizing optimization, supporting a comprehensive final design report. Experiment results on amplifier sizing tasks using 180nm, 90nm, and 45nm Predictive Technology Models demonstrate that VLM-CAD effectively balances power and performance while maintaining physics-based explainability. VLM-CAD meets all specification requirements while maintaining low power consumption in optimizing an amplifier with a complementary input and a class-AB output stage, with a total runtime under 66 minutes across all experiments on two amplifiers.
中文标题/摘要
标题:VLM-CAD:优化视觉语言模型协作代理设计工作流以实现模拟电路尺寸优化
模拟混合信号电路尺寸优化涉及高维设计空间中的复杂权衡。现有的自动模拟电路尺寸优化方法仅依赖于网表,忽略了电路原理图,阻碍了原理图与其性能之间的认知联系。此外,机器学习方法的黑箱性质和大型语言模型中的幻觉风险无法提供工业签收所需的必要的地面真相可解释性。为了解决这些挑战,我们提出了一种视觉语言模型优化的协作代理设计工作流(VLM-CAD),该工作流分析电路、优化直流工作点、进行基于推理的尺寸优化并执行外部尺寸优化。我们整合了Image2Net来标注电路原理图并生成结构化的JSON描述,以便视觉语言模型精确解释。此外,我们提出了一种可解释的信任区域贝叶斯优化方法(ExTuRBO),该方法采用代理生成的种子进行协作预热,并提供外部尺寸优化的双粒度灵敏度分析,支持全面的最终设计报告。使用180nm、90nm和45nm预测技术模型进行放大器尺寸优化任务的实验结果表明,VLM-CAD在保持物理基础可解释性的同时有效平衡了功率和性能。VLM-CAD在优化具有互补输入和类AB输出级的放大器时满足所有规范要求,同时保持低功耗,在两次放大器实验中总运行时间低于66分钟。
Summary / 总结
VLM-CAD is a workflow that optimizes analog circuit sizing by integrating Vision Language Models and collaborative agents. It analyzes circuits, optimizes DC operating points, and performs inference-based sizing, while using Image2Net for schematic annotation and ExTuRBO for explainable optimization. Experiments on amplifier sizing tasks with different technology models show that VLM-CAD effectively balances power and performance while maintaining physics-based explainability and low power consumption.
VLM-CAD 是一种通过结合 Vision Language Models 和协作代理优化模拟电路尺寸的工作流。它分析电路、优化 DC 工作点并执行基于推理的尺寸优化。该方法使用 Image2Net 对电路图进行注释,并使用 ExTuRBO 进行可解释的优化。实验表明,VLM-CAD 在保持物理基础的可解释性及低功耗的同时,有效平衡了功率和性能,并满足了所有放大器尺寸任务的规格要求。
MMP-A*: Multimodal Perception Enhanced Incremental Heuristic Search on Path Planning
Authors: Minh Hieu Ha, Khanh Ly Ta, Hung Phan, Tung Doan, Tung Dao, Dao Tran, Huynh Thi Thanh Binh
First: 2026-01-05T08:55:27+00:00 · Latest: 2026-01-22T10:24:37+00:00
Abstract
Autonomous path planning requires a synergy between global reasoning and geometric precision, especially in complex or cluttered environments. While classical A* is valued for its optimality, it incurs prohibitive computational and memory costs in large-scale scenarios. Recent attempts to mitigate these limitations by using Large Language Models for waypoint guidance remain insufficient, as they rely only on text-based reasoning without spatial grounding. As a result, such models often produce incorrect waypoints in topologically complex environments with dead ends, and lack the perceptual capacity to interpret ambiguous physical boundaries. These inconsistencies lead to costly corrective expansions and undermine the intended computational efficiency. We introduce MMP-A*, a multimodal framework that integrates the spatial grounding capabilities of vision-language models with a novel adaptive decay mechanism. By anchoring high-level reasoning in physical geometry, the framework produces coherent waypoint guidance that addresses the limitations of text-only planners. The adaptive decay mechanism dynamically regulates the influence of uncertain waypoints within the heuristic, ensuring geometric validity while substantially reducing memory overhead. To evaluate robustness, we test the framework in challenging environments characterized by severe clutter and topological complexity. Experimental results show that MMP-A* achieves near-optimal trajectories with significantly reduced operational costs, demonstrating its potential as a perception-grounded and computationally efficient paradigm for autonomous navigation.
中文标题/摘要
标题:MMP-A*: 多模态感知增强的增量启发式搜索路径规划
自主路径规划需要全局推理与几何精度之间的协同作用,尤其是在复杂或拥挤的环境中。虽然经典的A*因其最优性而受到重视,但在大规模场景中会带来巨大的计算和内存成本。最近通过使用大型语言模型进行航点指导来缓解这些限制的努力仍然不足,因为它们仅依赖于基于文本的推理而缺乏空间定位能力。因此,这些模型在拓扑复杂且有死胡同的环境中经常生成错误的航点,并缺乏感知能力来解释模糊的物理边界。这些不一致导致昂贵的修正扩展,并削弱了预期的计算效率。我们引入了MMP-A*,这是一种结合了视觉语言模型的空间定位能力和新颖的自适应衰减机制的多模态框架。通过将高层次推理锚定在物理几何上,该框架生成连贯的航点指导,解决了纯文本规划器的局限性。自适应衰减机制动态调节启发式中不确定航点的影响,确保几何有效性同时大幅减少内存开销。为了评估鲁棒性,我们在严重拥挤和拓扑复杂性的环境中测试了该框架。实验结果表明,MMP-A*在显著降低操作成本的同时实现了接近最优的轨迹,展示了其作为感知导向和计算高效的自主导航范式的潜力。
Summary / 总结
The research addresses the challenge of efficient and accurate path planning in complex environments by integrating vision-language models with a novel adaptive decay mechanism. The MMP-A* framework enhances classical A* by providing spatially grounded guidance, reducing computational and memory costs. Experiments in cluttered and topologically complex environments show that MMP-A* achieves near-optimal trajectories with lower operational costs compared to text-only planners.
MMP-A* 是一种结合视觉语言模型的空间接地能力和自适应衰减机制的多模态框架,以增强 A* 路径规划。它通过生成连贯的航点并确保几何有效性来克服纯文本规划器的局限性。实验结果表明,MMP-A* 在复杂环境中实现了接近最优的轨迹,同时显著降低了计算和内存成本。
Assessing Situational and Spatial Awareness of VLMs with Synthetically Generated Video
Authors: Pascal Benschop, Justin Dauwels, Jan van Gemert
First: 2026-01-22T09:14:11+00:00 · Latest: 2026-01-22T09:14:11+00:00
Abstract
Spatial reasoning in vision language models (VLMs) remains fragile when semantics hinge on subtle temporal or geometric cues. We introduce a synthetic benchmark that probes two complementary skills: situational awareness (recognizing whether an interaction is harmful or benign) and spatial awareness (tracking who does what to whom, and reasoning about relative positions and motion). Through minimal video pairs, we test three challenges: distinguishing violence from benign activity, binding assailant roles across viewpoints, and judging fine-grained trajectory alignment. While we evaluate recent VLMs in a training-free setting, the benchmark is applicable to any video classification model. Results show performance only slightly above chance across tasks. A simple aid, stable color cues, partly reduces assailant role confusions but does not resolve the underlying weakness. By releasing data and code, we aim to provide reproducible diagnostics and seed exploration of lightweight spatial priors to complement large-scale pretraining.
中文标题/摘要
标题:基于合成生成视频评估VLMs的情境意识和空间意识
视觉语言模型(VLMs)中的空间推理在依赖微妙的时间或几何线索时仍然脆弱。我们引入了一个合成基准,测试两种互补的能力:情境意识(识别互动是否有害或无害)和空间意识(追踪谁对谁做了什么,并推理相对位置和运动)。通过最小的视频对,我们测试了三个挑战:区分暴力与良性活动、跨视角绑定攻击者角色以及判断细粒度轨迹对齐。虽然我们在无训练设置下评估了最近的VLMs,但该基准适用于任何视频分类模型。结果显示,各任务的性能仅略高于随机猜测。一个简单的辅助,稳定的颜色线索,部分减少了攻击者角色的混淆,但未能解决根本弱点。通过发布数据和代码,我们旨在提供可重复的诊断并激发对轻量级空间先验的研究,以补充大规模预训练。
Crafting Adversarial Inputs for Large Vision-Language Models Using Black-Box Optimization
Authors: Jiwei Guan, Haibo Jin, Haohan Wang
First: 2026-01-05T02:49:33+00:00 · Latest: 2026-01-22T09:09:47+00:00
Comments: EACL
Abstract
Recent advancements in Large Vision-Language Models (LVLMs) have shown groundbreaking capabilities across diverse multimodal tasks. However, these models remain vulnerable to adversarial jailbreak attacks, where adversaries craft subtle perturbations to bypass safety mechanisms and trigger harmful outputs. Existing white-box attacks methods require full model accessibility, suffer from computing costs and exhibit insufficient adversarial transferability, making them impractical for real-world, black-box settings. To address these limitations, we propose a black-box jailbreak attack on LVLMs via Zeroth-Order optimization using Simultaneous Perturbation Stochastic Approximation (ZO-SPSA). ZO-SPSA provides three key advantages: (i) gradient-free approximation by input-output interactions without requiring model knowledge, (ii) model-agnostic optimization without the surrogate model and (iii) lower resource requirements with reduced GPU memory consumption. We evaluate ZO-SPSA on three LVLMs, including InstructBLIP, LLaVA and MiniGPT-4, achieving the highest jailbreak success rate of 83.0% on InstructBLIP, while maintaining imperceptible perturbations comparable to white-box methods. Moreover, adversarial examples generated from MiniGPT-4 exhibit strong transferability to other LVLMs, with ASR reaching 64.18%. These findings underscore the real-world feasibility of black-box jailbreaks and expose critical weaknesses in the safety mechanisms of current LVLMs
中文标题/摘要
标题:使用黑盒优化构建针对大型视觉-语言模型的对抗输入
大型视觉-语言模型(LVLMs)在多种跨模态任务中展现了突破性的能力。然而,这些模型仍然容易受到对抗性脱管攻击的影响,攻击者通过施加微妙的扰动来绕过安全机制并触发有害输出。现有的白盒攻击方法需要完全访问模型,计算成本高且对抗性转移性不足,使其在实际的黑盒环境中不切实际。为了解决这些限制,我们提出了一种通过零阶优化(ZO-SPSA)使用同时扰动随机近似(Simultaneous Perturbation Stochastic Approximation)对LVLMs进行黑盒脱管攻击的方法。ZO-SPSA提供了三个关键优势:(i) 无需模型知识的输入-输出交互的无梯度近似,(ii) 无需代理模型的模型无关优化,(iii) 降低资源需求,减少GPU内存消耗。我们在三个LVLMs上评估了ZO-SPSA,包括InstructBLIP、LLaVA和MiniGPT-4,在InstructBLIP上实现了最高的脱管攻击成功率83.0%,同时保持与白盒方法相当的不可感知扰动。此外,从MiniGPT-4生成的对抗性示例在其他LVLMs上表现出强大的转移性,ASR达到64.18%。这些发现强调了黑盒脱管攻击在实际环境中的可行性,并揭示了当前LVLMs安全机制中的关键弱点
Summary / 总结
This paper addresses the vulnerability of Large Vision-Language Models (LVLMs) to adversarial attacks by proposing a black-box jailbreak attack using Zeroth-Order optimization with Simultaneous Perturbation Stochastic Approximation (ZO-SPSA). The method does not require model knowledge, is model-agnostic, and has lower resource requirements. Experiments on InstructBLIP, LLaVA, and MiniGPT-4 showed a jailbreak success rate of 83.0% on InstructBLIP and strong transferability of adversarial examples, reaching an ASR of 64.18% on MiniGPT-4.
论文通过使用零阶优化的Simultaneous Perturbation Stochastic Approximation (ZO-SPSA) 方法,提出了一种针对大型视觉-语言模型(LVLM)的黑盒破解攻击。该方法无需模型知识,具有模型无关性,并且资源需求较低。实验结果显示,在InstructBLIP、LLaVA和MiniGPT-4上的破解成功率高达83.0%,并且生成的对抗样本在MiniGPT-4上的攻击成功率达到了64.18%,显示出较强的迁移性。
VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning
Authors: Li Kang, Xiufeng Song, Heng Zhou, Yiran Qin, Jie Yang, Xiaohong Liu, Philip Torr, Lei Bai, Zhenfei Yin
Venue: NeurIPS 2025
First: 2025-06-10T17:59:44+00:00 · Latest: 2026-01-22T08:52:35+00:00
Comments: Accepted by NeurIPS 2025 Track on Datasets and Benchmarks. Project page: https://faceong.github.io/VIKI-R/
Abstract
Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.
中文标题/摘要
标题:VIKI-R:通过强化学习协调具身多智能体合作
在动态环境中协调多个具身智能体仍然是人工智能的核心挑战,需要感知驱动的推理和可扩展的合作策略。虽然最近的工作利用了大型语言模型(LLMs)进行多智能体规划,但有少数开始探索视觉语言模型(VLMs)进行视觉推理。然而,这些基于VLM的方法在支持多种具身类型方面仍然有限。在本文中,我们介绍了VIKI-Bench,这是第一个针对具身多智能体合作的分层基准,包含三个结构化层次:智能体激活、任务规划和轨迹感知。VIKI-Bench 包含多种机器人具身、多视角视觉观察和结构化的监督信号,以评估基于视觉输入的推理。为了展示VIKI-Bench 的实用性,我们提出了VIKI-R,这是一种两阶段框架,首先使用带有Chain-of-Thought注释的演示对预训练的视觉语言模型(VLM)进行微调,然后在多层次奖励信号下进行强化学习。我们的大量实验表明,VIKI-R 在所有任务层次上都显著优于基线方法。此外,我们展示了强化学习使异构智能体之间出现组合合作模式。总体而言,VIKI-Bench 和 VIKI-R 提供了一个统一的测试平台和方法,以推进具身人工智能系统中的多智能体、视觉驱动的合作。
Summary / 总结
This work addresses the challenge of coordinating multiple embodied agents in dynamic environments by introducing VIKI-Bench, a hierarchical benchmark for embodied multi-agent cooperation. VIKI-R, a two-stage framework, fine-tunes a pretrained vision-language model with Chain-of-Thought annotated demonstrations and then uses reinforcement learning with multi-level reward signals. The experiments demonstrate that VIKI-R outperforms baseline methods across all task levels and enables compositional cooperation among heterogeneous agents.
该研究通过引入VIKI-Bench,一个用于多体态多智能体合作的分层基准,解决了在动态环境中协调多个体态智能体的挑战。VIKI-R是一个两阶段框架,首先使用带有推理链标注的演示数据微调预训练的视觉语言模型,然后在多级奖励信号下应用强化学习。实验表明,VIKI-R在所有任务级别上均优于基线方法,并且能够使异构智能体之间产生组合性的合作模式。
Zero-Shot Product Attribute Labeling with Vision-Language Models: A Three-Tier Evaluation Framework
Authors: Shubham Shukla, Kunal Sonalkar
Venue: WACV 2026
First: 2026-01-22T07:33:41+00:00 · Latest: 2026-01-22T07:33:41+00:00
Comments: Accepted to WACV 2026 Workshop on Physical Retail AI (PRAW)
Abstract
Fine-grained attribute prediction is essential for fashion retail applications including catalog enrichment, visual search, and recommendation systems. Vision-Language Models (VLMs) offer zero-shot prediction without task-specific training, yet their systematic evaluation on multi-attribute fashion tasks remains underexplored. A key challenge is that fashion attributes are often conditional. For example, "outer fabric" is undefined when no outer garment is visible. This requires models to detect attribute applicability before attempting classification. We introduce a three-tier evaluation framework that decomposes this challenge: (1) overall task performance across all classes (including NA class: suggesting attribute is not applicable) for all attributes, (2) attribute applicability detection, and (3) fine-grained classification when attributes are determinable. Using DeepFashion-MultiModal, which explicitly defines NA (meaning attribute doesn't exist or is not visible) within attribute label spaces, we benchmark nine VLMs spanning flagship (GPT-5, Gemini 2.5 Pro), efficient (GPT-5 Mini, Gemini 2.5 Flash), and ultra-efficient tiers (GPT-5 Nano, Gemini 2.5 Flash-Lite) against classifiers trained on pretrained Fashion-CLIP embeddings on 5,000 images across 18 attributes. Our findings reveal that: (1) zero-shot VLMs achieve 64.0% macro-F1, a threefold improvement over logistic regression on pretrained Fashion-CLIP embeddings; (2) VLMs excel at fine-grained classification (Tier 3: 70.8% F1) but struggle with applicability detection (Tier 2: 34.1% NA-F1), identifying a key bottleneck; (3) efficient models achieve over 90% of flagship performance at lower cost, offering practical deployment paths. This diagnostic framework enables practitioners to pinpoint whether errors stem from visibility detection or classification, guiding targeted improvements for production systems.
中文标题/摘要
标题:使用视觉语言模型的零样本产品属性标签化:三层评估框架
细粒度属性预测对于时尚零售应用(包括目录丰富、视觉搜索和推荐系统)至关重要。视觉语言模型(VLMs)可以在无需特定任务训练的情况下实现零样本预测,但它们在多属性时尚任务上的系统评估仍被忽视。一个关键挑战是时尚属性往往是条件性的。例如,“外层织物”在没有外衣的情况下是未定义的。这要求模型在尝试分类之前检测属性的适用性。我们引入了一个三层评估框架来分解这一挑战:(1)所有属性(包括NA类:表明属性不适用)在所有类别的整体任务性能,(2)属性适用性检测,以及(3)当属性可确定时的细粒度分类。使用DeepFashion-MultiModal,其中明确定义了NA(表示属性不存在或不可见),我们对九种VLMs(包括旗舰级(GPT-5, Gemini 2.5 Pro)、高效级(GPT-5 Mini, Gemini 2.5 Flash)和超高效级(GPT-5 Nano, Gemini 2.5 Flash-Lite))进行了基准测试,这些模型在5,000张图像(涵盖18个属性)上使用预训练的Fashion-CLIP嵌入进行训练。我们的发现表明:(1)零样本VLMs实现了64.0%的宏F1,比预训练Fashion-CLIP嵌入上的逻辑回归提高了三倍;(2)VLMs在细粒度分类(第3级:70.8% F1)方面表现出色,但在适用性检测(第2级:34.1% NA-F1)方面存在困难,揭示了一个关键瓶颈;(3)高效模型在较低成本下实现了旗舰模型90%以上的性能,提供了实际部署路径。此诊断框架使从业者能够确定错误是源自可见性检测还是分类,从而指导生产系统的针对性改进。
Summary / 总结
The study aims to evaluate Vision-Language Models (VLMs) for zero-shot prediction of fine-grained fashion attributes, addressing the challenge of attribute applicability. A three-tier evaluation framework was developed to assess overall task performance, attribute applicability detection, and fine-grained classification. Nine VLMs, ranging from flagship to ultra-efficient models, were benchmarked against classifiers on 5,000 images across 18 attributes. Key findings include a macro-F1 score of 64.0% for zero-shot VLMs, a threefold improvement over logistic regression, and the identification of a bottleneck in applicability detection, with efficient models achieving over 90% of flagship performance at lower cost.
研究旨在评估Vision-Language模型(VLMs)在零样本预测细粒度时尚属性方面的表现,解决属性适用性的问题。引入了三层评估框架来评估整体任务性能、属性适用性检测和细粒度分类。九种VLMs,从旗舰级到超高效级,被基准测试在5,000张图像上,涵盖18个属性。关键发现包括零样本VLMs的64.0%宏F1分数,比逻辑回归提高了三倍,以及在适用性检测方面存在瓶颈,高效模型在较低成本下达到了旗舰模型90%以上的性能。
Agentic Uncertainty Quantification
Authors: Jiaxin Zhang, Prafulla Kumar Choubey, Kung-Hsiang Huang, Caiming Xiong, Chien-Sheng Wu
First: 2026-01-22T07:16:26+00:00 · Latest: 2026-01-22T07:16:26+00:00
Comments: 36 pages, 9 figures, 9 tables
Abstract
Although AI agents have demonstrated impressive capabilities in long-horizon reasoning, their reliability is severely hampered by the ``Spiral of Hallucination,'' where early epistemic errors propagate irreversibly. Existing methods face a dilemma: uncertainty quantification (UQ) methods typically act as passive sensors, only diagnosing risks without addressing them, while self-reflection mechanisms suffer from continuous or aimless corrections. To bridge this gap, we propose a unified Dual-Process Agentic UQ (AUQ) framework that transforms verbalized uncertainty into active, bi-directional control signals. Our architecture comprises two complementary mechanisms: System 1 (Uncertainty-Aware Memory, UAM), which implicitly propagates verbalized confidence and semantic explanations to prevent blind decision-making; and System 2 (Uncertainty-Aware Reflection, UAR), which utilizes these explanations as rational cues to trigger targeted inference-time resolution only when necessary. This enables the agent to balance efficient execution and deep deliberation dynamically. Extensive experiments on closed-loop benchmarks and open-ended deep research tasks demonstrate that our training-free approach achieves superior performance and trajectory-level calibration. We believe this principled framework AUQ represents a significant step towards reliable agents.
中文标题/摘要
标题:代理不确定性量化
尽管AI代理在长期推理方面表现出色,但它们的可靠性因“幻觉螺旋”而严重受损,早期的知识错误不可逆地传播。现有方法面临困境:不确定性量化(UQ)方法通常作为被动传感器,仅诊断风险而不解决问题,而自我反思机制则遭受持续或盲目修正。为弥合这一差距,我们提出了一种统一的双重过程代理不确定性量化(AUQ)框架,将口头表达的不确定性转化为双向控制信号。我们的架构包括两个互补机制:系统1(不确定性感知记忆,UAM),它隐式地传播口头表达的信心和语义解释,以防止盲目决策;系统2(不确定性感知反思,UAR),它利用这些解释作为理性的提示,在必要时触发目标推理时的解决。这使代理能够动态平衡高效执行和深入的反思。在闭环基准和开放性深度研究任务上的广泛实验表明,我们的无训练方法在性能和轨迹级校准方面表现出色。我们认为,这一原理性的框架AUQ代表了迈向可靠代理的重要一步。
Summary / 总结
The research addresses the issue of AI agents' reliability in long-horizon reasoning, particularly the 'Spiral of Hallucination' where early errors propagate irreversibly. It proposes a Dual-Process Agentic Uncertainty Quantification (AUQ) framework that transforms verbalized uncertainty into active control signals. The framework includes two mechanisms: Uncertainty-Aware Memory (UAM) for implicit propagation of confidence and explanations to prevent blind decisions, and Uncertainty-Aware Reflection (UAR) for targeted inference-time resolution. Experiments show that this approach improves performance and trajectory-level calibration without training.
研究旨在通过提出一种双重过程代理不确定性量化(AUQ)框架来解决AI代理在长期推理中的可靠性问题。该框架结合了两种机制:不确定性感知记忆(UAM)和不确定性感知反思(UAR)。UAM通过隐式传播口头表达的信心和语义解释来防止盲目决策,而UAR则利用这些解释在必要时触发目标推理时的解决。实验表明,这种方法在性能和轨迹级校准方面优于现有方法。
Multi-event Video-Text Retrieval
Authors: Gengyuan Zhang, Jisen Ren, Jindong Gu, Volker Tresp
First: 2023-08-22T16:32:46+00:00 · Latest: 2026-01-22T06:58:13+00:00
Comments: [fixed typos in equations] accepted to ICCV2023 Poster; some figures are not supported when viewed online, please download the file and view locally
Abstract
Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of massive video-text data on the Internet. A plethora of work characterized by using a two-stream Vision-Language model architecture that learns a joint representation of video-text pairs has become a prominent approach for the VTR task. However, these models operate under the assumption of bijective video-text correspondences and neglect a more practical scenario where video content usually encompasses multiple events, while texts like user queries or webpage metadata tend to be specific and correspond to single events. This establishes a gap between the previous training objective and real-world applications, leading to the potential performance degradation of earlier models during inference. In this study, we introduce the Multi-event Video-Text Retrieval (MeVTR) task, addressing scenarios in which each video contains multiple different events, as a niche scenario of the conventional Video-Text Retrieval Task. We present a simple model, Me-Retriever, which incorporates key event video representation and a new MeVTR loss for the MeVTR task. Comprehensive experiments show that this straightforward framework outperforms other models in the Video-to-Text and Text-to-Video tasks, effectively establishing a robust baseline for the MeVTR task. We believe this work serves as a strong foundation for future studies. Code is available at https://github.com/gengyuanmax/MeVTR.
中文标题/摘要
标题:多事件视频-文本检索
视频-文本检索(VTR)是互联网上大规模视频-文本数据时代的一项重要多模态任务。使用两流视觉-语言模型架构来学习视频-文本对的联合表示已成为VTR任务的主要方法。然而,这些模型假设视频-文本对应关系是一一对应的,并忽略了视频内容通常包含多个事件,而文本如用户查询或网页元数据通常特定于单个事件的实际情况。这导致了先前训练目标与实际应用之间的差距,使得早期模型在推理时可能性能下降。在本研究中,我们提出了多事件视频-文本检索(MeVTR)任务,以解决视频中包含多个不同事件的场景,这是传统视频-文本检索任务的一个特殊场景。我们提出了一种简单的模型Me-Retriever,该模型结合了关键事件视频表示和新的MeVTR损失函数。全面的实验表明,该简单框架在视频到文本和文本到视频任务中优于其他模型,有效地为MeVTR任务建立了稳健的基础。我们相信这项工作为未来的研究奠定了坚实的基础。代码可在https://github.com/gengyuanmax/MeVTR/ 获取。
Summary / 总结
The study addresses the Video-Text Retrieval (VTR) task by introducing the Multi-event Video-Text Retrieval (MeVTR) task, where videos contain multiple events while texts correspond to single events. The researchers propose Me-Retriever, a model that includes key event video representation and a new MeVTR loss. Experiments show that Me-Retriever outperforms other models in both Video-to-Text and Text-to-Video tasks, providing a robust baseline for MeVTR. This work fills a gap in the VTR task by handling multi-event scenarios more effectively. Code is available at https://github.com/gengyuanmax/MeVTR.
研究引入了多事件视频-文本检索(MeVTR)任务,处理视频包含多个事件而文本对应单一事件的情况。提出了一种简单模型Me-Retriever,包含关键事件视频表示和新的MeVTR损失。实验表明,Me-Retriever在视频到文本和文本到视频任务中均优于其他模型,为MeVTR任务提供了稳健的基础。这项工作旨在弥合先前模型与实际应用之间的差距,提高视频-文本检索系统的性能。
PatchEAD: Unifying Industrial Visual Prompting Frameworks for Patch-Exclusive Anomaly Detection
Authors: Po-Han Huang, Jeng-Lin Li, Po-Hsuan Huang, Ming-Ching Chang, Wei-Chao Chen
Venue: WACV 2026
First: 2025-09-30T06:52:08+00:00 · Latest: 2026-01-22T06:50:23+00:00
Comments: 10 pages, 5 figures. WACV 2026 (Accepted)
Abstract
Industrial anomaly detection is increasingly relying on foundation models, aiming for strong out-of-distribution generalization and rapid adaptation in real-world deployments. Notably, past studies have primarily focused on textual prompt tuning, leaving the intrinsic visual counterpart fragmented into processing steps specific to each foundation model. We aim to address this limitation by proposing a unified patch-focused framework, Patch-Exclusive Anomaly Detection (PatchEAD), enabling training-free anomaly detection that is compatible with diverse foundation models. The framework constructs visual prompting techniques, including an alignment module and foreground masking. Our experiments show superior few-shot and batch zero-shot performance compared to prior work, despite the absence of textual features. Our study further examines how backbone structure and pretrained characteristics affect patch-similarity robustness, providing actionable guidance for selecting and configuring foundation models for real-world visual inspection. These results confirm that a well-unified patch-only framework can enable quick, calibration-light deployment without the need for carefully engineered textual prompts.
中文标题/摘要
标题:PatchEAD:统一的工业视觉提示框架以实现专用于异常检测
工业异常检测越来越多地依赖于基础模型,旨在实现强大的离分布泛化和在实际部署中的快速适应。值得注意的是,以往的研究主要集中在文本提示调优上,而视觉方面的内在对应物则被分割成针对每个基础模型的具体处理步骤。我们旨在通过提出一个统一的专用于补丁的框架——Patch-Exclusive Anomaly Detection (PatchEAD),来解决这一局限性,该框架能够实现无需训练的异常检测,并兼容多种基础模型。该框架构建了视觉提示技术,包括对齐模块和前景遮罩。我们的实验表明,与先前的工作相比,尽管没有使用文本特征,但其在少量样本和批量零样本检测方面的性能更优。我们的研究进一步探讨了基础模型的结构和预训练特性如何影响补丁相似性鲁棒性,为选择和配置适用于实际视觉检查的基础模型提供了可操作的指导。这些结果证实,一个良好统一的仅补丁框架可以实现快速、校准轻量的部署,无需精心设计的文本提示。
VIOLA: Towards Video In-Context Learning with Minimal Annotations
Authors: Ryo Fujii, Hideo Saito, Ryo Hachiuma
First: 2026-01-22T00:35:30+00:00 · Latest: 2026-01-22T00:35:30+00:00
Abstract
Generalizing Multimodal Large Language Models (MLLMs) to novel video domains is essential for real-world deployment but remains challenging due to the scarcity of labeled data. While In-Context Learning (ICL) offers a training-free adaptation path, standard methods rely on large annotated pools, which are often impractical in specialized environments like industrial or surgical settings since they require the experts' annotations. To bridge this gap, we introduce VIOLA (Video In-cOntext Learning with minimal Annotation), a label-efficient framework that synergizes minimal expert supervision with abundant unlabeled data. First, to maximize the efficiency of a strict annotation budget, we propose density-uncertainty-weighted sampling. Unlike standard diversity or uncertainty strategies that risk selecting visual outliers, our method leverages density estimation to identify samples that are simultaneously diverse, representative, and informative. Second, to utilize the remaining unlabeled data without noise propagation, we construct a hybrid pool and introduce confidence-aware retrieval and confidence-aware prompting. These mechanisms explicitly model label reliability, retrieving demonstrations based on a composite score of similarity and confidence while enabling the MLLM to adaptively distinguish between verified ground truths and noisy pseudo-labels. Extensive experiments across nine diverse benchmarks using four MLLMs demonstrate that our framework significantly outperforms various baselines in low-resource settings, achieving robust adaptation with minimal annotation costs.
中文标题/摘要
标题:VIOLA:面向最少标注的视频上下文学习
将通用多模态大型语言模型(MLLMs)推广到新的视频领域对于实际部署至关重要,但由于标注数据稀缺而充满挑战。虽然上下文学习(ICL)提供了一种无需训练的适应路径,但标准方法依赖于大规模标注数据池,这在如工业或手术等专业环境中往往是不切实际的,因为这需要专家的标注。为了解决这一问题,我们提出了VIOLA(视频上下文学习与最少标注),这是一种标签高效的框架,将最少的专家监督与大量的未标注数据相结合。首先,为了最大化严格的标注预算效率,我们提出了密度不确定性加权采样。与标准的多样性和不确定性策略不同,我们的方法利用密度估计来识别同时具有多样性和代表性且信息丰富的样本。其次,为了在不传播噪声的情况下利用剩余的未标注数据,我们构建了一个混合池,并引入了置信度感知检索和置信度感知提示。这些机制明确建模了标签的可靠性,根据相似性和置信度的复合得分检索演示,使MLLM能够自适应地区分验证的真实标签和嘈杂的伪标签。在四个MLLMs和九个不同基准上的广泛实验表明,我们的框架在低资源设置中显著优于各种基线,实现了在最少标注成本下的稳健适应。
Summary / 总结
VIOLA is a label-efficient framework for video in-context learning that minimizes the need for expert annotations. It uses density-uncertainty-weighted sampling to select diverse and informative samples and constructs a hybrid pool with confidence-aware mechanisms to handle unlabeled data. Experiments show that VIOLA outperforms various baselines in low-resource settings, achieving robust adaptation with minimal annotation costs.
VIOLA 是一种标签高效的视频在上下文学习框架,旨在减少对专家标注的需求。它使用密度-不确定性加权采样来选择多样且信息丰富的样本,并通过混合池和置信度感知机制有效利用未标注数据。实验表明,VIOLA 在低资源环境中优于各种基线,实现了低成本的鲁棒适应。
MARS: Unleashing the Power of Speculative Decoding via Margin-Aware Verification
Authors: Jingwei Song, Xinyu Wang, Hanbin Wang, Xiaoxuan Lei, Bill Shi, Shixin Han, Eric Yang, Xiao-Wen Chang, Lynn Ai
First: 2026-01-21T22:03:06+00:00 · Latest: 2026-01-21T22:03:06+00:00
Comments: 11 pages, 5 figures
Abstract
Speculative Decoding (SD) accelerates autoregressive large language model (LLM) inference by decoupling generation and verification. While recent methods improve draft quality by tightly coupling the drafter with the target model, the verification mechanism itself remains largely unchanged, relying on strict token-level rejection sampling. In practice, modern LLMs frequently operate in low-margin regimes where the target model exhibits weak preference among top candidates. In such cases, rejecting plausible runner-up tokens yields negligible information gain while incurring substantial rollback cost, leading to a fundamental inefficiency in verification. We propose Margin-Aware Speculative Verification, a training-free and domain-agnostic verification strategy that adapts to the target model's local decisiveness. Our method conditions verification on decision stability measured directly from the target logits and relaxes rejection only when strict verification provides minimal benefit. Importantly, the approach modifies only the verification rule and is fully compatible with existing target-coupled speculative decoding frameworks. Extensive experiments across model scales ranging from 8B to 235B demonstrate that our method delivers consistent and significant inference speedups over state-of-the-art baselines while preserving generation quality across diverse benchmarks.
中文标题/摘要
标题:MARS:通过 Margin-Aware 验证释放推测解码的潜力
推测解码(SD)通过解耦生成和验证来加速自回归大型语言模型(LLM)的推理。虽然最近的方法通过紧密耦合草稿生成者和目标模型来提高草稿质量,但验证机制本身几乎没有变化,仍然依赖于严格的令牌级拒绝采样。实际上,现代LLM经常在低边际区域运行,目标模型对顶级候选者之间表现出较弱的偏好。在这种情况下,拒绝可能的亚军令牌几乎没有信息增益,但会带来显著的回滚成本,导致验证中的基本低效。我们提出了Margin-Aware推测验证,这是一种无需训练且领域通用的验证策略,能够适应目标模型的局部决断性。该方法根据直接从目标对数概率中测量的决策稳定性进行验证,并仅在严格的验证提供最小益处时才放松拒绝。重要的是,该方法仅修改验证规则,并完全兼容现有的目标耦合推测解码框架。在从8B到235B的模型规模上进行的广泛实验表明,我们的方法在保持生成质量的同时,相对于最先进的基线方法提供了持续且显著的推理加速。
Summary / 总结
The paper introduces Margin-Aware Speculative Verification (MARS) to improve the efficiency of Speculative Decoding (SD) in autoregressive large language model inference. MARS adapts to the target model's local decisiveness by verifying based on decision stability measured from the target logits, relaxing rejection only when strict verification provides minimal benefit. Experiments show that MARS consistently and significantly speeds up inference while maintaining generation quality across various model scales.
论文提出了Margin-Aware Speculative Verification (MARS),这是一种用于Speculative Decoding (SD)的验证策略,旨在提高自回归LLM推理的效率。通过适应目标模型的局部决断性,MARS仅在严格验证提供最小益处时才放松拒绝,从而在各种模型规模下实现一致且显著的推理加速,同时不牺牲生成质量。
DevPrompt: Deviation-Based Prompt Learning for One-Normal ShotImage Anomaly Detection
Authors: Morteza Poudineh, Marc Lalonde
First: 2026-01-21T20:35:51+00:00 · Latest: 2026-01-21T20:35:51+00:00
Comments: 8 pages
Abstract
Few-normal shot anomaly detection (FNSAD) aims to detect abnormal regions in images using only a few normal training samples, making the task highly challenging due to limited supervision and the diversity of potential defects. Recent approaches leverage vision-language models such as CLIP with prompt-based learning to align image and text features. However, existing methods often exhibit weak discriminability between normal and abnormal prompts and lack principled scoring mechanisms for patch-level anomalies. We propose a deviation-guided prompt learning framework that integrates the semantic power of vision-language models with the statistical reliability of deviation-based scoring. Specifically, we replace fixed prompt prefixes with learnable context vectors shared across normal and abnormal prompts, while anomaly-specific suffix tokens enable class-aware alignment. To enhance separability, we introduce a deviation loss with Top-K Multiple Instance Learning (MIL), modeling patch-level features as Gaussian deviations from the normal distribution. This allows the network to assign higher anomaly scores to patches with statistically significant deviations, improving localization and interpretability. Experiments on the MVTecAD and VISA benchmarks demonstrate superior pixel-level detection performance compared to PromptAD and other baselines. Ablation studies further validate the effectiveness of learnable prompts, deviation-based scoring, and the Top-K MIL strategy.
中文标题/摘要
标题:DevPrompt:基于偏差的提示学习在少量正常样本图像异常检测中的应用
少量正常样本异常检测(FNSAD)旨在仅使用少量正常训练样本检测图像中的异常区域,由于监督有限且潜在缺陷多样,任务极具挑战性。最近的方法利用如CLIP等视觉语言模型结合提示式学习对图像和文本特征进行对齐。然而,现有方法在正常和异常提示之间的区分能力较弱,并缺乏针对块级异常的原理性评分机制。我们提出了一种基于偏差的提示学习框架,将视觉语言模型的语义能力与基于偏差的评分的统计可靠性相结合。具体而言,我们用可学习的上下文向量替换固定提示前缀,并在正常和异常提示之间共享,而异常特定的后缀标记使类别感知对齐成为可能。为了增强可分性,我们引入了基于Top-K多实例学习(MIL)的偏差损失,将块级特征建模为与正常分布的高斯偏差。这使网络能够将更高的异常评分赋予统计上显著偏差的块,从而提高定位和可解释性。在MVTecAD和VISA基准上的实验表明,与PromptAD和其他基线相比,像素级检测性能更优。消融研究进一步验证了可学习提示、基于偏差的评分和Top-K MIL策略的有效性。
Summary / 总结
DevPrompt is a deviation-based prompt learning framework designed for few-normal shot image anomaly detection, addressing the challenge of limited supervision and diverse defects. It uses learnable context vectors and anomaly-specific suffix tokens to enhance the discriminability between normal and abnormal prompts. A deviation loss with Top-K Multiple Instance Learning is introduced to model patch-level features as Gaussian deviations from the normal distribution, improving anomaly detection performance. Experiments show that DevPrompt outperforms PromptAD and other baselines on the MVTecAD and VISA benchmarks.
DevPrompt 是一种用于少量正常样本图像异常检测的偏差导向提示学习框架。它使用可学习的上下文向量和异常特定的后缀标记来增强正常和异常提示之间的可区分性。引入了基于 Top-K 多实例学习的偏差损失来将补丁级特征建模为从正常分布的高斯偏差,从而提高异常检测性能。实验表明,DevPrompt 在 MVTecAD 和 VISA 基准上的像素级检测准确性优于现有方法如 PromptAD。
CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation
Authors: Pablo Messina, Andrés Villa, Juan León Alcázar, Karen Sánchez, Carlos Hinojosa, Denis Parra, Álvaro Soto, Bernard Ghanem
Venue: CVPR 2026
First: 2026-01-21T19:19:41+00:00 · Latest: 2026-01-21T19:19:41+00:00
Comments: 31 pages, 7 figures, submitted to CVPR 2026 (under review)
Abstract
Medical vision-language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present CURE, an error-aware curriculum learning framework that improves grounding and report quality without any additional data. CURE fine-tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy-grounded report generation using public datasets. The method dynamically adjusts sampling based on model performance, emphasizing harder samples to improve spatial and textual alignment. CURE improves grounding accuracy by +0.37 IoU, boosts report quality by +0.188 CXRFEScore, and reduces hallucinations by 18.6%. CURE is a data-efficient framework that enhances both grounding accuracy and report reliability. Code is available at https://github.com/PabloMessina/CURE and model weights at https://huggingface.co/pamessina/medgemma-4b-it-cure
中文标题/摘要
标题:CURE:基于课程指导的多任务训练以生成可靠的解剖学导向报告
医学视觉-语言模型可以自动化放射学报告的生成,但难以实现准确的视觉定位和事实一致性。现有模型经常将文本发现与视觉证据对齐不当,导致不可靠或弱定位的预测。我们提出了CURE,一种错误感知的课程学习框架,该框架在不使用额外数据的情况下提高了定位和报告质量。CURE在公共数据集上对多模态指令模型进行微调,用于短语定位、定位报告生成和解剖学定位报告生成。该方法根据模型性能动态调整采样,强调更难的样本以提高空间和文本对齐。CURE将定位准确度提高了+0.37 IoU,提升了报告质量+0.188 CXRFEScore,并减少了18.6%的幻觉。CURE是一个数据高效的框架,能够同时提高定位准确度和报告可靠性。代码可在https://github.com/PabloMessina/CURE 获取,模型权重可在https://huggingface.co/pamessina/medgemma-4b-it-cure 获取
Summary / 总结
CURE is a curriculum-guided multi-task training framework that enhances the accuracy and reliability of anatomy-grounded radiology reports by improving visual grounding and factual consistency. It fine-tunes a multimodal model on phrase grounding, grounded report generation, and anatomy-grounded report generation using public datasets. CURE dynamically adjusts sampling based on model performance, focusing on harder samples to improve alignment. Key findings include a 0.37 IoU improvement in grounding accuracy, a 0.188 boost in CXRFEScore for report quality, and a 18.6% reduction in hallucinations. CURE is data-efficient and improves both grounding accuracy and report reliability without additional data.
CURE 是一种基于课程的学习多任务训练框架,通过提高视觉定位和事实一致性来增强解剖导向的放射学报告的准确性和可靠性。它使用公共数据集对多模态模型进行微调,进行短语定位、定位报告生成和解剖导向报告生成。CURE 根据模型性能动态调整采样,专注于更难的样本以提高对齐。关键发现包括 0.37 的 IoU 定位准确性的提高、CXRFEScore 报告质量的 0.188 提升以及幻觉的 18.6% 减少。CURE 是一种数据高效的框架,能够在不使用额外数据的情况下提高定位准确性和报告可靠性。
Towards Understanding Best Practices for Quantization of Vision-Language Models
Authors: Gautom Das, Vincent La, Ethan Lau, Abhinav Shrivastava, Matthew Gwilliam
First: 2026-01-21T18:59:51+00:00 · Latest: 2026-01-21T18:59:51+00:00
Comments: 15 pages, 12 figures, 1 table
Abstract
Large language models (LLMs) deliver impressive results for a variety of tasks, but state-of-the-art systems require fast GPUs with large amounts of memory. To reduce both the memory and latency of these systems, practitioners quantize their learned parameters, typically at half precision. A growing body of research focuses on preserving the model performance with more aggressive bit widths, and some work has been done to apply these strategies to other models, like vision transformers. In our study we investigate how a variety of quantization methods, including state-of-the-art GPTQ and AWQ, can be applied effectively to multimodal pipelines comprised of vision models, language models, and their connectors. We address how performance on captioning, retrieval, and question answering can be affected by bit width, quantization method, and which portion of the pipeline the quantization is used for. Results reveal that ViT and LLM exhibit comparable importance in model performance, despite significant differences in parameter size, and that lower-bit quantization of the LLM achieves high accuracy at reduced bits per weight (bpw). These findings provide practical insights for efficient deployment of MLLMs and highlight the value of exploration for understanding component sensitivities in multimodal models. Our code is available at https://github.com/gautomdas/mmq.
中文标题/摘要
标题:理解视觉-语言模型量化最佳实践
大型语言模型(LLMs)在各种任务中表现出色,但最先进的系统需要快速的GPU和大量的内存。为了减少这些系统的内存和延迟,从业者通常会将它们的学习参数量化为半精度。越来越多的研究集中在使用更激进的位宽来保持模型性能,并且已经有一些工作将这些策略应用于其他模型,如视觉变换器。在我们的研究中,我们探讨了如何有效地将包括最先进的GPTQ和AWQ在内的各种量化方法应用于由视觉模型、语言模型及其连接器组成的多模态管道。我们研究了位宽、量化方法以及量化在管道中的使用位置如何影响字幕生成、检索和问答任务的性能。结果表明,尽管参数规模存在显著差异,ViT和LLM在模型性能中具有相当的重要性,并且LLM的低位量化可以在减少每个权重位数(bpw)的情况下实现高精度。这些发现为高效部署多模态大语言模型提供了实用见解,并突显了探索多模态模型组件敏感性的价值。我们的代码可在https://github.com/gautomdas/mmq/获取。
Summary / 总结
This study investigates the application of various quantization methods, including GPTQ and AWQ, to multimodal pipelines involving vision transformers and large language models. The research aims to understand how different bit widths and quantization techniques impact performance in tasks such as captioning, retrieval, and question answering. Key findings include the comparable importance of ViT and LLMs in model performance despite their size differences, and the effectiveness of lower-bit quantization of LLMs in achieving high accuracy with reduced memory usage.
研究探讨了GPTQ和AWQ等不同量化方法在包含视觉模型、语言模型及其连接器的多模态管道中的应用。研究旨在了解不同位宽和量化技术对诸如图像字幕、检索和问答等任务性能的影响。关键发现表明,视觉变压器(ViT)和大型语言模型(LLMs)对于模型性能都至关重要,且LLMs即使在较低位量化的情况下也能实现高精度,从而减少每个权重的位数。这些发现为高效部署多模态模型提供了实用指导。
Iterative Refinement Improves Compositional Image Generation
Authors: Shantanu Jaiswal, Mihir Prabhudesai, Nikash Bhardwaj, Zheyang Qin, Amir Zadeh, Chuan Li, Katerina Fragkiadaki, Deepak Pathak
First: 2026-01-21T18:59:40+00:00 · Latest: 2026-01-21T18:59:40+00:00
Comments: Project webpage: https://iterative-img-gen.github.io/
Abstract
Text-to-image (T2I) models have achieved remarkable progress, yet they continue to struggle with complex prompts that require simultaneously handling multiple objects, relations, and attributes. Existing inference-time strategies, such as parallel sampling with verifiers or simply increasing denoising steps, can improve prompt alignment but remain inadequate for richly compositional settings where many constraints must be satisfied. Inspired by the success of chain-of-thought reasoning in large language models, we propose an iterative test-time strategy in which a T2I model progressively refines its generations across multiple steps, guided by feedback from a vision-language model as the critic in the loop. Our approach is simple, requires no external tools or priors, and can be flexibly applied to a wide range of image generators and vision-language models. Empirically, we demonstrate consistent gains on image generation across benchmarks: a 16.9% improvement in all-correct rate on ConceptMix (k=7), a 13.8% improvement on T2I-CompBench (3D-Spatial category) and a 12.5% improvement on Visual Jenga scene decomposition compared to compute-matched parallel sampling. Beyond quantitative gains, iterative refinement produces more faithful generations by decomposing complex prompts into sequential corrections, with human evaluators preferring our method 58.7% of the time over 41.3% for the parallel baseline. Together, these findings highlight iterative self-correction as a broadly applicable principle for compositional image generation. Results and visualizations are available at https://iterative-img-gen.github.io/
中文标题/摘要
标题:迭代细化提高组合图像生成
文本到图像(T2I)模型已经取得了显著的进步,但仍难以处理复杂的提示,这些提示需要同时处理多个对象、关系和属性。现有的推理时策略,如并行采样带验证器或简单增加去噪步骤,可以改善提示对齐,但在许多约束必须满足的丰富组合场景中仍然不足。受大型语言模型中链式思考推理成功的启发,我们提出了一种迭代的测试时策略,在该策略中,T2I模型在多个步骤中逐步细化其生成,受到循环中的视觉语言模型作为批评者的反馈指导。我们的方法简单,不需要外部工具或先验知识,并且可以灵活应用于各种图像生成器和视觉语言模型。实验证明,我们的方法在基准测试中的一致改进:在ConceptMix(k=7)上提高了16.9%的全正确率,在T2I-CompBench(3D-空间类别)上提高了13.8%,在视觉积木场景分解上提高了12.5%,与计算匹配的并行采样相比。除了定量的改进,迭代细化通过将复杂提示分解为顺序修正,生成更忠实的图像,人类评估者中有58.7%的人更偏好我们的方法,而并行基线为41.3%。这些发现共同强调了迭代自我修正作为组合图像生成的广泛适用原则。结果和可视化可在https://iterative-img-gen.github.io/获取
Summary / 总结
The paper addresses the challenge of generating complex images from text prompts by proposing an iterative refinement strategy. This method involves a text-to-image model refining its output across multiple steps based on feedback from a vision-language model. The approach shows consistent improvements across various benchmarks, with a 16.9% increase in the all-correct rate on ConceptMix (k=7), and higher fidelity generations preferred by human evaluators. The method is simple, does not require external tools, and can be applied to a wide range of image generators and vision-language models.
论文解决了从文本提示生成复杂图像的挑战,现有方法在处理多个对象和属性时存在困难。研究提出了一种迭代细化策略,该策略使文本到图像模型在多步中逐步细化输出,并由视觉语言模型提供反馈。这种方法在各种基准测试中取得了显著的性能提升,在准确性和对提示的忠实度方面表现更好,人类评估者中有58.7%的人更偏好迭代方法而非并行基线。
Improving MoE Compute Efficiency by Composing Weight and Data Sparsity
Authors: Maciej Kilian, Oleg Mkrtchyan, Luke Zettlemoyer, Akshat Shrivastava, Armen Aghajanyan
First: 2026-01-21T18:53:58+00:00 · Latest: 2026-01-21T18:53:58+00:00
Abstract
Mixture-of-Experts layers achieve compute efficiency through weight sparsity: each token activates only a subset of experts. Data sparsity, where each expert processes only a subset of tokens, offers a complementary axis. Expert-choice routing implements data sparsity directly but violates causality in autoregressive models, creating train-inference mismatch. We recover data sparsity within causal token-choice MoE by leveraging zero-compute (null) experts within the routing pool. When a token routes to null experts, those slots consume no compute. The standard load balancing objective trains the model to uniformly use all experts (real and null) therefore creating data sparsity in expectation without the causality violations. We evaluate on vision-language model training, where data heterogeneity is pronounced: vision encoders produce many low-information tokens while text tokens are denser. At matched expected FLOPs, composing weight and data sparsity yields a more compute-efficient frontier than weight sparsity alone, with gains in training loss and downstream performance. The model learns implicit modality-aware allocation, routing vision tokens to null experts more aggressively than text, without explicit modality routing.
中文标题/摘要
标题:通过组合权重和数据稀疏性提高专家混合层的计算效率
专家混合层通过权重稀疏性实现计算效率:每个标记仅激活专家子集。数据稀疏性,其中每个专家仅处理标记子集,提供了互补的维度。专家选择路由直接实现数据稀疏性,但在自回归模型中违反了因果性,导致训练与推理不匹配。我们通过在路由池中利用零计算(空)专家来在因果标记选择的专家混合层中恢复数据稀疏性。当标记路由到空专家时,这些槽位不消耗计算。标准负载均衡目标训练模型均匀使用所有专家(真实和空的),因此在期望中创建数据稀疏性而不违反因果性。我们在视觉-语言模型训练中进行评估,其中数据异质性明显:视觉编码器产生许多低信息量标记,而文本标记更密集。在匹配的预期FLOPs下,组合权重和数据稀疏性比单独使用权重稀疏性提供了更高效的计算边界,训练损失和下游性能都有所提升。模型学习隐式的模态感知分配,更积极地将视觉标记路由到空专家,而无需显式的模态路由。
PROGRESSLM: Towards Progress Reasoning in Vision-Language Models
Authors: Jianshu Zhang, Chengxuan Qian, Haosen Sun, Haoran Lu, Dingcheng Wang, Letian Xue, Han Liu
First: 2026-01-21T17:56:59+00:00 · Latest: 2026-01-21T17:56:59+00:00
Comments: Website: https://progresslm.github.io/ProgressLM/
Abstract
Estimating task progress requires reasoning over long-horizon dynamics rather than recognizing static visual content. While modern Vision-Language Models (VLMs) excel at describing what is visible, it remains unclear whether they can infer how far a task has progressed from partial observations. To this end, we introduce Progress-Bench, a benchmark for systematically evaluating progress reasoning in VLMs. Beyond benchmarking, we further explore a human-inspired two-stage progress reasoning paradigm through both training-free prompting and training-based approach based on curated dataset ProgressLM-45K. Experiments on 14 VLMs show that most models are not yet ready for task progress estimation, exhibiting sensitivity to demonstration modality and viewpoint changes, as well as poor handling of unanswerable cases. While training-free prompting that enforces structured progress reasoning yields limited and model-dependent gains, the training-based ProgressLM-3B achieves consistent improvements even at a small model scale, despite being trained on a task set fully disjoint from the evaluation tasks. Further analyses reveal characteristic error patterns and clarify when and why progress reasoning succeeds or fails.
中文标题/摘要
标题:PROGRESSLM:迈向视觉语言模型中的进度推理
估计任务进度需要对长期动态进行推理,而不仅仅是识别静态视觉内容。尽管现代视觉语言模型(VLMs)在描述可见内容方面表现出色,但尚不清楚它们是否能够从部分观察中推断出任务的进展情况。为此,我们引入了Progress-Bench,用于系统评估VLMs的进度推理能力。除了基准测试外,我们还通过无训练提示和基于精心策划的数据集ProgressLM-45K的训练方法,进一步探索了灵感来源于人类的两阶段进度推理范式。在14个VLMs上的实验表明,大多数模型尚未准备好进行任务进度估计,表现出对演示模态和视角变化的敏感性,以及对无法回答的情况处理不佳。虽然无训练提示强制结构化的进度推理仅能带来有限且模型依赖的收益,但基于训练的ProgressLM-3B即使在小型模型规模下也能实现一致的改进,尽管其训练任务集与评估任务集完全不重叠。进一步的分析揭示了特征错误模式,并阐明了进度推理何时以及为何成功或失败。
Summary / 总结
The research aims to evaluate the ability of Vision-Language Models (VLMs) to estimate task progress, which involves reasoning over long-horizon dynamics rather than recognizing static visual content. The study introduces Progress-Bench, a benchmark for evaluating progress reasoning in VLMs, and explores a human-inspired two-stage paradigm through both training-free prompting and a training-based approach using the ProgressLM-45K dataset. Experiments on 14 VLMs reveal that most models struggle with task progress estimation, showing sensitivity to demonstration modality and viewpoint changes, and poor handling of unanswerable cases. While training-free prompting provides limited and model-dependent gains, the training-based ProgressLM-3B model achieves consistent improvements even at a small model scale.
研究旨在评估视觉语言模型(VLMs)估计任务进度的能力,这需要对长期动态进行推理,而不是识别静态视觉内容。研究引入了Progress-Bench,用于评估VLMs的进度推理能力,并通过训练前提示和基于ProgressLM-45K数据集的训练方法探索了一种启发式两阶段推理框架。实验结果显示,大多数模型在任务进度估计方面存在困难,表现出对演示模态和视角变化的敏感性,以及对无法回答的情况处理不佳。虽然训练前提示提供了有限且模型依赖的改进,但基于训练的ProgressLM-3B模型即使在小型模型规模下也能实现一致的改进。
CRAFT: Continuous Reasoning and Agentic Feedback Tuning for Multimodal Text-to-Image Generation
Authors: V. Kovalev, A. Kuvshinov, A. Buzovkin, D. Pokidov, D. Timonin
First: 2025-12-23T13:44:41+00:00 · Latest: 2026-01-21T16:42:28+00:00
Comments: 37 pages, 42 figures
Abstract
Recent work has shown that inference-time reasoning and reflection can improve text-to-image generation without retraining. However, existing approaches often rely on implicit, holistic critiques or unconstrained prompt rewrites, making their behavior difficult to interpret, control, or stop reliably. In contrast, large language models have benefited from explicit, structured forms of **thinking** based on verification, targeted correction, and early stopping. We introduce CRAFT (Continuous Reasoning and Agentic Feedback Tuning), a training-free and model-agnostic framework for multimodal image generation. CRAFT transforms a user prompt into a set of explicit, dependency-structured visual constraints, verifies generated images using a vision-language model, and performs targeted prompt updates only when specific constraints are violated. This iterative process includes an explicit stopping criterion, resulting in an interpretable and controllable inference-time refinement loop. Across multiple model families and challenging benchmarks, CRAFT consistently improves compositional accuracy, text rendering, and preference-based evaluations, with particularly strong gains for lightweight generators. Importantly, these improvements incur only a negligible inference-time overhead, allowing smaller or cheaper models to approach the quality of substantially more expensive systems. Our results suggest that explicitly structured, constraint-driven inference-time reasoning is a key ingredient for improving the reliability of multimodal generative models.
中文标题/摘要
标题:CRAFT:连续推理和自主反馈调优的多模态文本到图像生成
近期研究表明,在不重新训练的情况下,推理时的推理和反思可以提高文本到图像的生成效果。然而,现有方法往往依赖于隐式的、整体的批评或不受限制的提示重写,这使得它们的行为难以解释、控制或可靠地停止。相比之下,大型语言模型得益于基于验证、目标修正和早期停止的明确、结构化的**思考**形式。我们提出了CRAFT(连续推理和自主反馈调优),这是一种无需训练且模型无关的多模态图像生成框架。CRAFT 将用户提示转换为一组明确的、依赖结构化的视觉约束,使用视觉语言模型验证生成的图像,并仅在特定约束被违反时进行有针对性的提示更新。这一迭代过程包括一个明确的停止标准,从而形成一个可解释且可控的推理时细化循环。在多个模型家族和具有挑战性的基准测试中,CRAFT 一致地提高了组合准确性、文本呈现和基于偏好的评估,特别是在轻量级生成器方面取得了显著的改进。重要的是,这些改进仅带来了微不足道的推理时开销,使得较小或更便宜的模型能够接近更昂贵系统的质量。我们的结果表明,明确结构化的、基于约束的推理时推理是提高多模态生成模型可靠性的关键成分。
Summary / 总结
CRAFT is a training-free and model-agnostic framework that transforms user prompts into explicit visual constraints, verifies generated images using a vision-language model, and updates prompts only when constraints are violated. This iterative process, with an explicit stopping criterion, leads to improved compositional accuracy, text rendering, and preference-based evaluations, especially for lightweight generators, with minimal inference-time overhead.
CRAFT 是一个无需训练且模型无关的框架,将用户提示转换为显式的视觉约束,使用视觉语言模型验证图像,并仅在约束被违反时更新提示。这一迭代过程包含明确的停止标准,提高了组合准确性、文本渲染和基于偏好的评估,尤其是对于轻量级生成器,且几乎不增加推理时间开销。
Training-Free and Interpretable Hateful Video Detection via Multi-stage Adversarial Reasoning
Authors: Shuonan Yang, Yuchen Zhang, Zeyu Fu
Venue: ICASSP 2026
First: 2026-01-21T15:52:26+00:00 · Latest: 2026-01-21T15:52:26+00:00
Comments: Accepted at ICASSP 2026. \c{opyright} 2026 IEEE. This is the author accepted manuscript. The final published version will be available via IEEE Xplore
Abstract
Hateful videos pose serious risks by amplifying discrimination, inciting violence, and undermining online safety. Existing training-based hateful video detection methods are constrained by limited training data and lack of interpretability, while directly prompting large vision-language models often struggle to deliver reliable hate detection. To address these challenges, this paper introduces MARS, a training-free Multi-stage Adversarial ReaSoning framework that enables reliable and interpretable hateful content detection. MARS begins with the objective description of video content, establishing a neutral foundation for subsequent analysis. Building on this, it develops evidence-based reasoning that supports potential hateful interpretations, while in parallel incorporating counter-evidence reasoning to capture plausible non-hateful perspectives. Finally, these perspectives are synthesized into a conclusive and explainable decision. Extensive evaluation on two real-world datasets shows that MARS achieves up to 10% improvement under certain backbones and settings compared to other training-free approaches and outperforms state-of-the-art training-based methods on one dataset. In addition, MARS produces human-understandable justifications, thereby supporting compliance oversight and enhancing the transparency of content moderation workflows. The code is available at https://github.com/Multimodal-Intelligence-Lab-MIL/MARS.
中文标题/摘要
标题:基于多阶段对抗推理的无训练可解释仇恨视频检测
仇恨视频通过放大歧视、煽动暴力和破坏在线安全等方式带来严重风险。现有的基于训练的仇恨视频检测方法受限于训练数据有限且缺乏可解释性,而直接对大型视觉-语言模型进行提示往往难以提供可靠的仇恨检测。为解决这些挑战,本文提出了一种无训练的多阶段对抗推理框架MARS,以实现可靠且可解释的仇恨内容检测。MARS从客观描述视频内容开始,建立后续分析的中立基础。在此基础上,它发展了基于证据的推理,支持潜在的仇恨解释,同时并行地纳入反证据推理以捕捉可能的非仇恨视角。最后,这些视角被综合成一个明确且可解释的决策。在两个真实世界数据集上的广泛评估表明,MARS在某些骨干网络和设置下比其他无训练方法提高了10%以上,并在其中一个数据集上优于最先进的基于训练的方法。此外,MARS生成了人类可理解的解释,从而支持合规监督并增强内容审核流程的透明度。代码可在https://github.com/Multimodal-Intelligence-Lab-MIL/MARS/ 获取。
Summary / 总结
This paper addresses the challenges of detecting hateful videos by introducing MARS, a training-free Multi-stage Adversarial ReaSoning framework. MARS starts with objective video content description, then develops evidence-based reasoning for potential hateful interpretations while incorporating counter-evidence to capture non-hateful perspectives. The framework synthesizes these perspectives into a conclusive and explainable decision. Experiments on two real-world datasets show that MARS outperforms existing training-free approaches and state-of-the-art training-based methods, achieving up to 10% improvement and providing human-understandable justifications for content moderation.
本文提出了一种名为MARS的无训练且可解释的框架,以解决检测仇恨视频的挑战。MARS从视频内容的中立描述开始,然后使用多阶段对抗推理来发展基于证据的推理和反证据推理,最终得出一个明确且可解释的决策。该框架在实际数据集上的表现优于其他无训练方法和最先进的训练方法,最高可提高10%的性能。此外,MARS提供了易于理解的解释,增强了内容审核工作流程的透明度。
Training-Free In-Context Forensic Chain for Image Manipulation Detection and Localization
Authors: Rui Chen, Bin Liu, Changtao Miao, Xinghao Wang, Yi Li, Tao Gong, Qi Chu, Nenghai Yu
First: 2025-10-11T08:42:31+00:00 · Latest: 2026-01-21T15:39:57+00:00
Comments: This version was uploaded in error and contains misleading information found in an early draft. The manuscript requires extensive and long-term revisions
Abstract
Advances in image tampering pose serious security threats, underscoring the need for effective image manipulation localization (IML). While supervised IML achieves strong performance, it depends on costly pixel-level annotations. Existing weakly supervised or training-free alternatives often underperform and lack interpretability. We propose the In-Context Forensic Chain (ICFC), a training-free framework that leverages multi-modal large language models (MLLMs) for interpretable IML tasks. ICFC integrates an objectified rule construction with adaptive filtering to build a reliable knowledge base and a multi-step progressive reasoning pipeline that mirrors expert forensic workflows from coarse proposals to fine-grained forensics results. This design enables systematic exploitation of MLLM reasoning for image-level classification, pixel-level localization, and text-level interpretability. Across multiple benchmarks, ICFC not only surpasses state-of-the-art training-free methods but also achieves competitive or superior performance compared to weakly and fully supervised approaches.
中文标题/摘要
标题:无需训练的上下文法医链用于图像篡改检测与定位
图像篡改技术的进步带来了严重的安全威胁,突显了有效图像篡改定位(IML)的必要性。虽然监督IML能够取得优异性能,但它依赖于昂贵的像素级注释。现有的弱监督或无需训练的替代方案往往表现不佳且缺乏可解释性。我们提出了一种无需训练的框架——上下文法医链(ICFC),该框架利用多模态大型语言模型(MLLMs)进行可解释的IML任务。ICFC 结合了对象化规则构建与自适应过滤,构建了一个可靠的知识库,并采用多步骤渐进推理管道,模拟专家法医工作流程,从粗略提案到精细的法医结果。此设计使MLLM推理在图像级分类、像素级定位和文本级可解释性方面的系统利用成为可能。在多个基准测试中,ICFC 不仅超越了最先进的无需训练方法,而且在弱监督和完全监督方法方面也取得了竞争性或更优的性能。
Summary / 总结
The research aims to address the security threats posed by image tampering by developing a training-free framework for image manipulation localization (IML). The In-Context Forensic Chain (ICFC) leverages multi-modal large language models to construct a reliable knowledge base and a multi-step reasoning pipeline, enabling systematic exploitation of MLLM reasoning for image-level classification, pixel-level localization, and text-level interpretability. The ICFC outperforms existing training-free methods and achieves competitive or superior performance compared to weakly and fully supervised approaches across multiple benchmarks.
论文提出了一种名为In-Context Forensic Chain (ICFC)的训练-free框架,利用多模态大型语言模型构建知识库和推理管道,以系统地利用MLLM推理进行图像级分类、像素级定位和文本级解释。ICFC在多个基准测试中不仅超越了现有的训练-free方法,而且在弱监督和全监督方法中也达到了竞争或优越的性能。
History
20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553