VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents
Authors: Zirui Wang, Junyi Zhang, Jiaxin Ge, Long Lian, Letian Fu, Lisa Dunlap, Ken Goldberg, XuDong Wang, Ion Stoica, David M. Chan, Sewon Min, Joseph E. Gonzalez
First: 2026-01-23T18:43:34+00:00 · Latest: 2026-01-23T18:43:34+00:00
Comments: Project page: https://visgym.github.io/
Abstract
Modern Vision-Language Models (VLMs) remain poorly characterized in multi-step visual interactions, particularly in how they integrate perception, memory, and action over long horizons. We introduce VisGym, a gymnasium of 17 environments for evaluating and training VLMs. The suite spans symbolic puzzles, real-image understanding, navigation, and manipulation, and provides flexible controls over difficulty, input representation, planning horizon, and feedback. We also provide multi-step solvers that generate structured demonstrations, enabling supervised finetuning. Our evaluations show that all frontier models struggle in interactive settings, achieving low success rates in both the easy (46.6%) and hard (26.0%) configurations. Our experiments reveal notable limitations: models struggle to effectively leverage long context, performing worse with an unbounded history than with truncated windows. Furthermore, we find that several text-based symbolic tasks become substantially harder once rendered visually. However, explicit goal observations, textual feedback, and exploratory demonstrations in partially observable or unknown-dynamics settings for supervised finetuning yield consistent gains, highlighting concrete failure modes and pathways for improving multi-step visual decision-making. Code, data, and models can be found at: https://visgym.github.io/.
中文标题/摘要
标题:VisGym:多模态代理的多样化、可定制、可扩展环境
现代视觉-语言模型(VLMs)在多步骤视觉交互中仍然缺乏充分的表征,特别是在它们如何在长时程内整合感知、记忆和行动方面。我们引入了VisGym,这是一个包含17个环境的测试场,用于评估和训练VLMs。该套件涵盖了符号谜题、真实图像理解、导航和操作,并提供了对难度、输入表示、规划时程和反馈的灵活控制。我们还提供了多步骤求解器,生成结构化的演示,以实现监督微调。我们的评估表明,所有前沿模型在交互设置中都面临挑战,在简单(46.6%)和困难(26.0%)配置中成功率都很低。我们的实验揭示了一些显著的局限性:模型难以有效利用长上下文,在无界历史记录下表现不如在截断窗口下。此外,我们发现,一旦以视觉形式呈现,几种基于文本的符号任务变得显著更难。然而,在部分可观测或未知动力学设置中,通过明确的目标观察、文本反馈和探索性演示进行监督微调可以实现一致的改进,突显了多步骤视觉决策的具体失败模式和改进路径。代码、数据和模型可在:https://visgym.github.io/ 获取。
Summary / 总结
VisGym is designed to evaluate and train Vision-Language Models (VLMs) in multi-step visual interactions, covering various tasks such as symbolic puzzles, real-image understanding, navigation, and manipulation. It offers flexible controls over difficulty, input representation, planning horizon, and feedback. Experiments show that current models perform poorly in interactive settings, with low success rates even in easy configurations. Models struggle with long context and perform worse with unbounded history. However, explicit goal observations and textual feedback improve performance in partially observable or unknown-dynamics settings.
VisGym 旨在评估和训练视觉-语言模型(VLMs)在多步视觉交互中的表现,涵盖符号谜题、真实图像理解、导航和操作等多种任务。它提供了对难度、输入表示、规划时间范围和反馈的灵活控制。实验显示,当前模型在交互式设置中的表现较差,在简单配置中成功率也很低。研究还指出,模型在处理长上下文时存在困难,而明确的目标观察、文本反馈和在部分可观测或未知动力学设置中的探索性演示可以提高性能。
LoL: Longer than Longer, Scaling Video Generation to Hour
Authors: Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, Cho-Jui Hsieh
First: 2026-01-23T17:21:35+00:00 · Latest: 2026-01-23T17:21:35+00:00
Comments: preprint
Abstract
Recent research in long-form video generation has shifted from bidirectional to autoregressive models, yet these methods commonly suffer from error accumulation and a loss of long-term coherence. While attention sink frames have been introduced to mitigate this performance decay, they often induce a critical failure mode we term sink-collapse: the generated content repeatedly reverts to the sink frame, resulting in abrupt scene resets and cyclic motion patterns. Our analysis reveals that sink-collapse originates from an inherent conflict between the periodic structure of Rotary Position Embedding (RoPE) and the multi-head attention mechanisms prevalent in current generative models. To address it, we propose a lightweight, training-free approach that effectively suppresses this behavior by introducing multi-head RoPE jitter that breaks inter-head attention homogenization and mitigates long-horizon collapse. Extensive experiments show that our method successfully alleviates sink-collapse while preserving generation quality. To the best of our knowledge, this work achieves the first demonstration of real-time, streaming, and infinite-length video generation with little quality decay. As an illustration of this robustness, we generate continuous videos up to 12 hours in length, which, to our knowledge, is among the longest publicly demonstrated results in streaming video generation.
中文标题/摘要
标题:LoL:更长的视频生成,扩展到一小时
近期长视频生成的研究从双向模型转向了自回归模型,但这些方法通常会遭受错误累积和长期连贯性丧失的问题。虽然已经引入了注意力汇流帧来缓解这种性能衰减,但它们往往会导致我们称之为汇流崩溃的关键失败模式:生成的内容反复回到汇流帧,导致场景突然重置和循环运动模式。我们的分析表明,汇流崩溃源于旋转位置嵌入(RoPE)的周期结构与当前生成模型中普遍存在的多头注意力机制之间的固有冲突。为了解决这个问题,我们提出了一种轻量级、无需训练的方法,通过引入多头RoPE抖动来有效抑制这种行为,打破多头之间的注意力同质化,缓解长期崩溃。大量实验表明,我们的方法成功缓解了汇流崩溃,同时保持了生成质量。据我们所知,这项工作实现了实时、流式和无限长度视频生成的第一个演示,几乎没有质量衰减。作为这一鲁棒性的示例,我们生成了长达12小时的连续视频,据我们所知,这是已公开演示的最长流式视频生成结果之一。
Summary / 总结
The research aims to address the issue of error accumulation and loss of long-term coherence in long-form video generation. The authors propose a lightweight, training-free approach called multi-head RoPE jitter to mitigate the sink-collapse problem, which is caused by the conflict between the periodic structure of Rotary Position Embedding and multi-head attention mechanisms. Experiments demonstrate that this method effectively alleviates sink-collapse while maintaining generation quality, achieving real-time, streaming, and infinite-length video generation with minimal quality decay, up to 12 hours in length.
该论文解决了长视频生成中错误累积和长期连贯性丧失的问题,特别是在自回归模型中。它提出了一种轻量级、无需训练的方法——多头RoPE抖动,以缓解生成内容反复回到基帧的sink-collapse问题。实验表明,该方法成功缓解了sink-collapse现象,同时保持了视频质量,实现了实时、流式和无限长度的视频生成,质量衰减极小,最长可达12小时。
Evaluating Large Vision-language Models for Surgical Tool Detection
Authors: Nakul Poudel, Richard Simon, Cristian A. Linte
First: 2026-01-23T17:00:46+00:00 · Latest: 2026-01-23T17:00:46+00:00
Abstract
Surgery is a highly complex process, and artificial intelligence has emerged as a transformative force in supporting surgical guidance and decision-making. However, the unimodal nature of most current AI systems limits their ability to achieve a holistic understanding of surgical workflows. This highlights the need for general-purpose surgical AI systems capable of comprehensively modeling the interrelated components of surgical scenes. Recent advances in large vision-language models that integrate multimodal data processing offer strong potential for modeling surgical tasks and providing human-like scene reasoning and understanding. Despite their promise, systematic investigations of VLMs in surgical applications remain limited. In this study, we evaluate the effectiveness of large VLMs for the fundamental surgical vision task of detecting surgical tools. Specifically, we investigate three state-of-the-art VLMs, Qwen2.5, LLaVA1.5, and InternVL3.5, on the GraSP robotic surgery dataset under both zero-shot and parameter-efficient LoRA fine-tuning settings. Our results demonstrate that Qwen2.5 consistently achieves superior detection performance in both configurations among the evaluated VLMs. Furthermore, compared with the open-set detection baseline Grounding DINO, Qwen2.5 exhibits stronger zero-shot generalization and comparable fine-tuned performance. Notably, Qwen2.5 shows superior instrument recognition, while Grounding DINO demonstrates stronger localization.
中文标题/摘要
标题:评估大型视觉语言模型在手术工具检测中的效果
手术是一个高度复杂的过程,人工智能已经成为了支持手术指导和决策的变革性力量。然而,大多数当前AI系统的单模态性质限制了它们实现对手术工作流程的全面理解的能力。这突显了需要能够全面建模手术场景中相关组件的一般用途手术AI系统的需求。最近在多模态数据处理方面取得的大型视觉语言模型的进步为建模手术任务和提供类人场景推理和理解提供了强大的潜力。尽管它们具有潜力,但在手术应用中的系统性研究仍然有限。在本研究中,我们评估了大型视觉语言模型在基本的手术视觉任务——检测手术工具中的效果。具体而言,我们在GraSP机器人手术数据集上研究了三种最先进的视觉语言模型Qwen2.5、LLaVA1.5和InternVL3.5,在零样本和参数高效LoRA微调设置下。我们的结果表明,在评估的视觉语言模型中,Qwen2.5在两种配置下都始终表现出更优的检测性能。此外,与开放集检测基准Grounding DINO相比,Qwen2.5在零样本泛化方面表现出更强的能力,并且在微调性能方面具有可比性。值得注意的是,Qwen2.5在器械识别方面表现出更优的效果,而Grounding DINO在定位方面表现更强。
Summary / 总结
This study evaluates the effectiveness of large vision-language models (VLMs) for detecting surgical tools, focusing on Qwen2.5, LLaVA1.5, and InternVL3.5. The research demonstrates that Qwen2.5 outperforms the other models in both zero-shot and fine-tuned settings, showing strong zero-shot generalization and comparable fine-tuned performance. Notably, Qwen2.5 excels in instrument recognition, while Grounding DINO is better at localization.
研究评估了大型视觉语言模型(VLMs)在手术工具检测任务中的效果,重点关注Qwen2.5、LLaVA1.5和InternVL3.5。结果显示,Qwen2.5在零样本和微调设置中均表现出色,展现出强大的零样本泛化能力和可比的微调性能。值得注意的是,Qwen2.5在器械识别方面表现出色,而Grounding DINO在定位方面表现更佳。
VMMU: A Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark
Authors: Vy Tuong Dang, An Vo, Emilio Villa-Cueva, Quang Tau, Duc Dm, Thamar Solorio, Daeyoung Kim
First: 2025-08-19T09:31:18+00:00 · Latest: 2026-01-23T15:16:58+00:00
Abstract
We introduce VMMU, a Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark designed to evaluate how vision-language models (VLMs) interpret and reason over visual and textual information beyond English. VMMU consists of 2.5k multimodal questions across 7 tasks, covering a diverse range of problem contexts, including STEM problem solving, data interpretation, rule-governed visual reasoning, and abstract visual reasoning. All questions require genuine multimodal integration, rather than reliance on text-only cues or OCR-based shortcuts. We evaluate a diverse set of state-of-the-art proprietary and open-source VLMs on VMMU. Despite strong Vietnamese OCR performance, proprietary models achieve only 66% mean accuracy. Further analysis shows that the primary source of failure is not OCR, but instead multimodal grounding and reasoning over text and visual evidence. Code and data are available at https://vmmu-bench.github.io/
中文标题/摘要
标题:VMMU:越南多任务多模态理解与推理基准
我们介绍了VMMU,一个越南多任务多模态理解与推理基准,旨在评估视觉语言模型(VLMs)如何超越英语解释和推理视觉和文本信息。VMMU 包含7个任务中的2500个多模态问题,涵盖了从STEM问题解决到数据解释、规则指导的视觉推理和抽象视觉推理等多种问题情境。所有问题都需要真正的多模态整合,而不是依赖于纯文本线索或OCR捷径。我们对VMMU上的一系列最先进的专有和开源VLMs进行了评估。尽管越南OCR表现出色,但专有模型的平均准确率仅为66%。进一步的分析表明,主要的失败原因不是OCR,而是文本和视觉证据的多模态定位和推理。代码和数据可在https://vmmu-bench.github.io/ 获取
Summary / 总结
The VMMU benchmark evaluates vision-language models (VLMs) in interpreting and reasoning over Vietnamese multimodal data, covering tasks like STEM problem solving and abstract reasoning. It includes 2,500 questions across 7 diverse tasks, requiring genuine multimodal integration. Despite strong OCR performance, proprietary models achieve only 66% mean accuracy, indicating challenges in multimodal grounding and reasoning.
VMMU 是一个越南语的多任务多模态理解和推理基准,旨在评估视觉语言模型在处理超出英语的任务时的能力。它包含2500个多模态问题,覆盖七个不同的任务。尽管OCR性能很强,但专用模型的平均准确率仅为66%,表明在多模态定位和推理方面存在挑战。该基准旨在推动视觉语言模型在处理复杂视觉和文本信息方面的边界。
Data Matters Most: Auditing Social Bias in Contrastive Vision Language Models
Authors: Zahraa Al Sahili, Ioannis Patras, Matthew Purver
First: 2025-01-22T21:08:30+00:00 · Latest: 2026-01-23T15:12:35+00:00
Comments: Published at TMLR; updated version
Abstract
Vision-language models (VLMs) deliver strong zero-shot recognition but frequently inherit social biases from their training data. We systematically disentangle three design factors -- model size, training-data scale, and training-data source -- by comparing CLIP and OpenCLIP, two models that share an identical contrastive objective yet differ in encoder width and in the image-text corpora on which they are pre-trained (400M proprietary pairs vs. 400M/2B LAION). Across balanced face-analysis benchmarks, enlarging the encoder reduces gender skew in CLIP but amplifies both gender and racial skew in OpenCLIP; increasing the LAION corpus from 400M to 2B further increases OpenCLIP bias. At matched model and data budgets, substituting proprietary data with LAION improves gender fairness while increasing racial skew, underscoring data source as the primary driver of bias patterns. We also evaluate three post-hoc, test-time debiasing strategies -- Bias Prompts, Prompt Array, and SANER. Debiasing reduces but does not eliminate harm, and its effectiveness is source- and size-dependent: Bias Prompts most effectively reduce gender skew in CLIP at smaller model sizes, whereas Prompt Array and SANER more reliably reduce racial skew in OpenCLIP; scaling LAION reconfigures which method is most fair. Taken together, these findings challenge the assumption that bigger models or datasets are automatically fairer and foreground training data source as the key determinant of both bias and mitigation efficacy. We release code and evaluation scripts to enable transparent, reproducible auditing of future VLMs.
中文标题/摘要
标题:数据最重要:审计对比视觉语言模型中的社会偏见
视觉-语言模型(VLMs)在零样本识别方面表现出色,但经常从训练数据中继承社会偏见。我们系统地拆分了三个设计因素——模型大小、训练数据规模和训练数据来源,通过比较CLIP和OpenCLIP两种模型,这两种模型具有相同的对比目标,但编码器宽度不同,且预训练数据集不同(4亿私有配对数据 vs. 4亿/20亿LAION)。在平衡的人脸分析基准测试中,增大编码器减少了CLIP中的性别偏差,但在OpenCLIP中放大了性别和种族偏差;将LAION数据集从4亿增加到2亿进一步增加了OpenCLIP的偏见。在匹配的模型和数据预算下,用LAION替换私有数据提高了性别公平性,但增加了种族偏见,突显了数据来源是偏见模式的主要驱动因素。我们还评估了三种事后测试时去偏策略——偏见提示、提示阵列和SANER。去偏减少了但并未消除伤害,其有效性取决于来源和规模:偏见提示在较小的模型规模下最有效地减少了CLIP中的性别偏差,而提示阵列和SANER更可靠地减少了OpenCLIP中的种族偏差;扩大LAION重新配置了哪种方法最公平。综合来看,这些发现挑战了更大的模型或数据集自动更公平的假设,并将训练数据来源置于偏见和缓解效果的关键决定因素之上。我们发布了代码和评估脚本,以实现未来VLMs的透明、可重复审计。
Summary / 总结
This study investigates the impact of model size, training data scale, and data source on social bias in vision-language models. By comparing CLIP and OpenCLIP, which share the same contrastive objective but differ in encoder width and training data, the research finds that increasing the encoder size reduces gender bias in CLIP but amplifies both gender and racial bias in OpenCLIP. Expanding the training data from 400M to 2B further increases bias in OpenCLIP. Substituting proprietary data with LAION improves gender fairness but increases racial bias. Post-hoc debiasing strategies reduce bias but are source- and size-dependent, highlighting the importance of training data source in determining bias patterns and mitigation efficacy.
研究通过使用CLIP和OpenCLIP探讨了模型大小、训练数据规模和数据来源对视觉语言模型(VLMs)社会偏见的影响。研究发现,增加模型大小可以减少CLIP中的性别偏见,但在OpenCLIP中则会放大性别和种族偏见。将训练数据从400M扩展到2B会增加OpenCLIP中的偏见。用LAION替换专有数据可以改善性别公平性,但会增加种族偏见。后处理去偏方法可以减少偏见,但其效果取决于数据来源和模型大小,突显了训练数据来源在决定偏见模式和缓解效果中的关键作用。
A Multi-Stage Hybrid Framework for Automated Interpretation of Multi-View Engineering Drawings Using Vision Language Model
Authors: Muhammad Tayyab Khan, Zane Yong, Lequn Chen, Wenhe Feng, Nicholas Yew Jin Tan, Seung Ki Moon
First: 2025-10-23T09:07:31+00:00 · Latest: 2026-01-23T11:39:05+00:00
Comments: This draft has been accepted in the 13th International Conference on Industrial Engineering and Applications (ICIEA 2026)
Abstract
Engineering drawings are fundamental to manufacturing communication, serving as the primary medium for conveying design intent, tolerances, and production details. However, interpreting complex multi-view drawings with dense annotations remains challenging using manual methods, generic optical character recognition (OCR) systems, or traditional deep learning approaches, due to varied layouts, orientations, and mixed symbolic-textual content. To address these challenges, this paper proposes a three-stage hybrid framework for the automated interpretation of 2D multi-view engineering drawings using modern detection and vision language models (VLMs). In the first stage, YOLOv11-det performs layout segmentation to localize key regions such as views, title blocks, and notes. The second stage uses YOLOv11-obb for orientation-aware, fine-grained detection of annotations, including measures, GD&T symbols, and surface roughness indicators. The third stage employs two Donut-based, OCR-free VLMs for semantic content parsing: the Alphabetical VLM extracts textual and categorical information from title blocks and notes, while the Numerical VLM interprets quantitative data such as measures, GD&T frames, and surface roughness. Two specialized datasets were developed to ensure robustness and generalization: 1,000 drawings for layout detection and 1,406 for annotation-level training. The Alphabetical VLM achieved an overall F1 score of 0.672, while the Numerical VLM reached 0.963, demonstrating strong performance in textual and quantitative interpretation, respectively. The unified JSON output enables seamless integration with CAD and manufacturing databases, providing a scalable solution for intelligent engineering drawing analysis.
中文标题/摘要
标题:一种基于视觉语言模型的多阶段混合框架,用于工程视图图纸的自动化解释
工程图纸是制造沟通的基础,是传达设计意图、公差和生产细节的主要媒介。然而,使用手动方法、通用光学字符识别(OCR)系统或传统的深度学习方法解读具有密集注释的复杂多视图图纸仍然具有挑战性,原因在于布局、方向和混合符号-文本内容的多样性。为了解决这些挑战,本文提出了一种基于现代检测和视觉语言模型(VLMs)的三阶段混合框架,用于自动化解读2D多视图工程图纸。第一阶段使用YOLOv11-det进行布局分割,以定位视图、标题块和注释等关键区域。第二阶段使用YOLOv11-obb进行方向感知的细粒度注释检测,包括尺寸、GD&T符号和表面粗糙度指示。第三阶段使用两个基于Donut的、无需OCR的VLMs进行语义内容解析:Alphabetical VLM从标题块和注释中提取文本和分类信息,而Numerical VLM解释诸如尺寸、GD&T框架和表面粗糙度等定量数据。为了确保鲁棒性和泛化能力,开发了两个专用数据集:用于布局检测的1000张图纸和用于注释级训练的1406张图纸。Alphabetical VLM的整体F1得分为0.672,而Numerical VLM达到了0.963,分别在文本和定量解释方面表现出色。统一的JSON输出可以无缝集成到CAD和制造数据库中,提供了一种可扩展的智能工程图纸分析解决方案。
Summary / 总结
This paper addresses the challenge of interpreting complex multi-view engineering drawings with dense annotations by proposing a three-stage hybrid framework. The first stage uses YOLOv11-det for layout segmentation, the second for orientation-aware detection of annotations, and the third employs OCR-free VLMs to parse semantic content. The Alphabetical VLM achieved an F1 score of 0.672, while the Numerical VLM reached 0.963, showing strong performance in textual and quantitative interpretation, respectively. The framework provides a scalable solution for intelligent engineering drawing analysis.
本文旨在解决复杂多视图工程图纸中密集注释的解读难题。提出了一种三阶段混合框架,使用现代检测和视觉语言模型。第一阶段进行布局分割,第二阶段对注释进行方向感知的精细检测,第三阶段使用无OCR视觉语言模型解析语义内容。字母视觉语言模型的F1得分为0.672,数字视觉语言模型达到了0.963,分别展示了在文本和定量解释方面的强大性能。
Edge-Aware Image Manipulation via Diffusion Models with a Novel Structure-Preservation Loss
Authors: Minsu Gong, Nuri Ryu, Jungseul Ok, Sunghyun Cho
Venue: WACV 2026
First: 2026-01-23T11:06:51+00:00 · Latest: 2026-01-23T11:06:51+00:00
Comments: Accepted to WACV 2026
Abstract
Recent advances in image editing leverage latent diffusion models (LDMs) for versatile, text-prompt-driven edits across diverse tasks. Yet, maintaining pixel-level edge structures-crucial for tasks such as photorealistic style transfer or image tone adjustment-remains as a challenge for latent-diffusion-based editing. To overcome this limitation, we propose a novel Structure Preservation Loss (SPL) that leverages local linear models to quantify structural differences between input and edited images. Our training-free approach integrates SPL directly into the diffusion model's generative process to ensure structural fidelity. This core mechanism is complemented by a post-processing step to mitigate LDM decoding distortions, a masking strategy for precise edit localization, and a color preservation loss to preserve hues in unedited areas. Experiments confirm SPL enhances structural fidelity, delivering state-of-the-art performance in latent-diffusion-based image editing. Our code will be publicly released at https://github.com/gongms00/SPL.
中文标题/摘要
标题:基于扩散模型的新型结构保持损失的感知图像操作
近期图像编辑的进展利用潜在扩散模型(LDMs)实现多样任务中多样的、由文本提示驱动的编辑。然而,保持像素级边缘结构——这对于照片现实风格转移或图像色调调整等任务至关重要——仍然是基于潜在扩散的编辑的挑战。为克服这一限制,我们提出了一种新型结构保持损失(SPL),利用局部线性模型量化输入图像和编辑图像之间的结构差异。我们的无训练方法将SPL直接整合到扩散模型的生成过程中,以确保结构保真度。该核心机制由后处理步骤、减轻LDM解码失真的策略、精确编辑定位的掩码策略以及保持未编辑区域色调的颜色保持损失来补充。实验结果证实,SPL提高了结构保真度,实现了基于潜在扩散的图像编辑的最新性能。我们的代码将在https://github.com/gongms00/SPL公开发布。
Summary / 总结
The research aims to improve the structural fidelity in latent diffusion models for image editing tasks. The authors introduce a novel Structure Preservation Loss (SPL) that uses local linear models to maintain edge structures during image manipulation. The method integrates SPL into the diffusion model's generative process and includes additional steps for post-processing, precise edit localization, and color preservation. Experiments show that SPL significantly enhances structural fidelity, achieving state-of-the-art performance in latent-diffusion-based image editing.
研究旨在提高图像编辑任务中潜扩散模型的结构保真度。作者提出了一种新型结构保存损失(SPL),利用局部线性模型在图像操作过程中保持边缘结构。该方法将SPL直接集成到扩散模型的生成过程中,并包含后处理步骤、精确编辑定位和颜色保存等额外步骤。实验表明,SPL显著提高了结构保真度,实现了在基于潜扩散的图像编辑中的最先进性能。
UniFGVC: Universal Training-Free Few-Shot Fine-Grained Vision Classification via Attribute-Aware Multimodal Retrieval
Authors: Hongyu Guo, Xiangzhao Hao, Jiarui Guo, Haiyun Guo, Jinqiao Wang, Tat-Seng Chua
First: 2025-08-06T07:02:39+00:00 · Latest: 2026-01-23T10:16:53+00:00
Abstract
Few-shot fine-grained visual classification (FGVC) aims to leverage limited data to enable models to discriminate subtly distinct categories. Recent works mostly finetuned the pre-trained visual language models to achieve performance gain, yet suffering from overfitting and weak generalization. To deal with this, we introduce UniFGVC, a universal training-free framework that reformulates few-shot FGVC as multimodal retrieval. First, we propose the Category-Discriminative Visual Captioner (CDV-Captioner) to exploit the open-world knowledge of multimodal large language models (MLLMs) to generate a structured text description that captures the fine-grained attribute features distinguishing closely related classes. CDV-Captioner uses chain-of-thought prompting and visually similar reference images to reduce hallucination and enhance discrimination of generated captions. Using it we can convert each image into an image-description pair, enabling more comprehensive feature representation, and construct the multimodal category templates using few-shot samples for the subsequent retrieval pipeline. Then, off-the-shelf vision and text encoders embed query and template pairs, and FGVC is accomplished by retrieving the nearest template in the joint space. UniFGVC ensures broad compatibility with diverse MLLMs and encoders, offering reliable generalization and adaptability across few-shot FGVC scenarios. Extensive experiments on 12 FGVC benchmarks demonstrate its consistent superiority over prior few-shot CLIP-based methods and even several fully-supervised MLLMs-based approaches.
中文标题/摘要
标题:UniFGVC: 基于属性感知多模态检索的通用无训练少样本细粒度视觉分类
少样本细粒度视觉分类(FGVC)旨在利用有限的数据使模型能够区分细微不同的类别。近期的工作主要通过微调预训练的视觉语言模型来实现性能提升,但容易出现过拟合和泛化能力弱的问题。为了解决这一问题,我们提出了UniFGVC,这是一种通用的无训练框架,将少样本FGVC重新定义为多模态检索。首先,我们提出了类别区分视觉描述生成器(CDV-描述生成器)来利用多模态大型语言模型(MLLMs)的开放世界知识,生成结构化的文本描述,捕捉细粒度的属性特征,区分密切相关的类别。CDV-描述生成器使用链式思考提示和视觉相似的参考图像来减少幻觉并增强生成描述的区分度。使用它,我们可以将每张图像转换为图像-描述对,从而实现更全面的特征表示,并使用少量样本构建多模态类别模板,用于后续的检索管道。然后,现成的视觉和文本编码器嵌入查询和模板对,FGVC通过在联合空间中检索最近的模板来完成。UniFGVC确保了与各种MLLMs和编码器的广泛兼容性,提供了在少样本FGVC场景中可靠的泛化能力和适应性。在12个FGVC基准上的广泛实验表明,它在少样本CLIP基线方法上具有一致的优越性,甚至超过了几个完全监督的MLLMs基线方法。
Summary / 总结
UniFGVC is a training-free framework for few-shot fine-grained visual classification that leverages attribute-aware multimodal retrieval. It uses a Category-Discriminative Visual Captioner (CDV-Captioner) to generate structured text descriptions that capture fine-grained attribute features, enhancing the discrimination of generated captions. These descriptions are paired with images to construct multimodal category templates, which are then used for retrieval. Experiments on 12 FGVC benchmarks show that UniFGVC outperforms previous few-shot CLIP-based methods and some fully-supervised MLLMs-based approaches.
UniFGVC 是一个无需训练的框架,用于少量样本细粒度视觉分类,利用属性感知的多模态检索。它使用 Category-Discriminative Visual Captioner (CDV-Captioner) 生成结构化的文本描述,捕捉细粒度的属性,增强生成描述的区分度。这些描述与图像配对形成多模态类别模板,然后在视觉和文本编码器的联合空间中检索最近的模板。在12个细粒度视觉分类基准上的实验表明,UniFGVC 在性能上优于之前的少量样本 CLIP 基础方法和一些完全监督的 MLLMs 基础方法。
X-Aligner: Composed Visual Retrieval without the Bells and Whistles
Authors: Yuqian Zheng, Mariana-Iuliana Georgescu
First: 2026-01-23T09:33:38+00:00 · Latest: 2026-01-23T09:33:38+00:00
Comments: 8 pages
Abstract
Composed Video Retrieval (CoVR) facilitates video retrieval by combining visual and textual queries. However, existing CoVR frameworks typically fuse multimodal inputs in a single stage, achieving only marginal gains over initial baseline. To address this, we propose a novel CoVR framework that leverages the representational power of Vision Language Models (VLMs). Our framework incorporates a novel cross-attention module X-Aligner, composed of cross-attention layers that progressively fuse visual and textual inputs and align their multimodal representation with that of the target video. To further enhance the representation of the multimodal query, we incorporate the caption of the visual query as an additional input. The framework is trained in two stages to preserve the pretrained VLM representation. In the first stage, only the newly introduced module is trained, while in the second stage, the textual query encoder is also fine-tuned. We implement our framework on top of BLIP-family architecture, namely BLIP and BLIP-2, and train it on the Webvid-CoVR data set. In addition to in-domain evaluation on Webvid-CoVR-Test, we perform zero-shot evaluations on the Composed Image Retrieval (CIR) data sets CIRCO and Fashion-IQ. Our framework achieves state-of-the-art performance on CoVR obtaining a Recall@1 of 63.93% on Webvid-CoVR-Test, and demonstrates strong zero-shot generalization on CIR tasks.
中文标题/摘要
标题:X-Aligner:无花哨的组合视觉检索
组合视频检索(CoVR)通过结合视觉和文本查询来促进视频检索。然而,现有的CoVR框架通常在单一阶段融合多模态输入,仅在初始基线基础上取得微小改进。为解决这一问题,我们提出了一种新的CoVR框架,利用视觉语言模型(VLM)的表示能力。该框架包含一种新颖的交叉注意力模块X-Aligner,由逐步融合视觉和文本输入并将其多模态表示与目标视频的表示对齐的交叉注意力层组成。为了进一步增强多模态查询的表示,我们将视觉查询的描述作为额外输入纳入其中。该框架分两阶段训练以保留预训练的VLM表示。在第一阶段,仅训练新引入的模块,而在第二阶段,也微调文本查询编码器。我们在BLIP家族架构之上实现该框架,即BLIP和BLIP-2,并在Webvid-CoVR数据集上进行训练。除了在Webvid-CoVR-Test上的领域内评估,我们还在组合图像检索(CIR)数据集CIRCO和Fashion-IQ上进行了零样本评估。我们的框架在CoVR上取得了最先进的性能,获得Webvid-CoVR-Test上Recall@1为63.93%,并在CIR任务上展示了强大的零样本泛化能力。
Summary / 总结
The research aims to improve Composed Video Retrieval (CoVR) by addressing the limitations of existing frameworks that fuse multimodal inputs in a single stage. The proposed X-Aligner framework uses a novel cross-attention module to progressively fuse and align visual and textual inputs, enhancing the multimodal query representation. The framework is trained in two stages, with the first stage focusing on the new module and the second stage fine-tuning the textual query encoder. Experimental results show that the framework achieves state-of-the-art performance on CoVR, with a Recall@1 of 63.93% on Webvid-CoVR-Test, and strong zero-shot generalization on CIR tasks.
该论文提出了一种名为X-Aligner的新框架,用于结合视觉和文本查询的Composed Video Retrieval (CoVR),该框架利用Vision Language Models (VLMs)和交叉注意力模块逐步融合和对齐视觉和文本输入。该框架分为两个阶段进行训练,第一阶段专注于新引入的X-Aligner模块,第二阶段微调文本查询编码器。实验结果表明,该方法在Webvid-CoVR-Test集上取得了63.93%的Recall@1性能,并在CIR任务上展示了强大的零样本泛化能力。
LLM is Not All You Need: A Systematic Evaluation of ML vs. Foundation Models for text and image based Medical Classification
Authors: Meet Raval, Tejul Pandit, Dhvani Upadhyay
First: 2026-01-23T08:35:53+00:00 · Latest: 2026-01-23T08:35:53+00:00
Comments: 9 pages, 5 figures, 3 tables, paper accepted in AAIML'26 conference
Abstract
The combination of multimodal Vision-Language Models (VLMs) and Large Language Models (LLMs) opens up new possibilities for medical classification. This work offers a rigorous, unified benchmark by using four publicly available datasets covering text and image modalities (binary and multiclass complexity) that contrasts traditional Machine Learning (ML) with contemporary transformer-based techniques. We evaluated three model classes for each task: Classical ML (LR, LightGBM, ResNet-50), Prompt-Based LLMs/VLMs (Gemini 2.5), and Fine-Tuned PEFT Models (LoRA-adapted Gemma3 variants). All experiments used consistent data splits and aligned metrics. According to our results, traditional machine learning (ML) models set a high standard by consistently achieving the best overall performance across most medical categorization tasks. This was especially true for structured text-based datasets, where the classical models performed exceptionally well. In stark contrast, the LoRA-tuned Gemma variants consistently showed the worst performance across all text and image experiments, failing to generalize from the minimal fine-tuning provided. However, the zero-shot LLM/VLM pipelines (Gemini 2.5) had mixed results; they performed poorly on text-based tasks, but demonstrated competitive performance on the multiclass image task, matching the classical ResNet-50 baseline. These results demonstrate that in many medical categorization scenarios, established machine learning models continue to be the most reliable option. The experiment suggests that foundation models are not universally superior and that the effectiveness of Parameter-Efficient Fine-Tuning (PEFT) is highly dependent on the adaptation strategy, as minimal fine-tuning proved detrimental in this study.
中文标题/摘要
标题:LLM并非万能:机器学习与基础模型在医学分类中的系统评估
多模态视觉-语言模型(VLMs)与大型语言模型(LLMs)的结合为医学分类开辟了新的可能性。本研究通过使用四个涵盖文本和图像模态(二分类和多分类复杂性)的公开数据集,提供了一个严格的统一基准,对比了传统机器学习(ML)与现代基于变换器的技术。我们为每个任务评估了三种模型类别:经典机器学习(LR,LightGBM,ResNet-50),提示基础LLM/VLM(Gemini 2.5),以及微调PEFT模型(LoRA-适应Gemma3变体)。所有实验使用了统一的数据分割和一致的度量标准。根据我们的结果,传统机器学习(ML)模型设定了高标准,一致地在大多数医学分类任务中表现出最佳的整体性能。特别是在结构化文本数据集上,经典模型表现尤为出色。相比之下,LoRA调优的Gemma变体在所有文本和图像实验中始终表现出最差的性能,无法从提供的最小微调中泛化。然而,零样本LLM/VLM流水线(Gemini 2.5)的结果参差不齐;它们在文本任务上表现不佳,但在多分类图像任务上表现出竞争力,与经典的ResNet-50基线相当。这些结果表明,在许多医学分类场景中,传统的机器学习模型仍然是最可靠的选择。实验表明,基础模型并非普遍优越,参数高效微调(PEFT)的有效性高度依赖于适应策略,在本研究中,最小微调证明是有害的。
Summary / 总结
This study evaluates the performance of traditional machine learning (ML) models, prompt-based large language models (LLMs)/vision-language models (VLMs), and fine-tuned parameter-efficient fine-tuning (PEFT) models in medical classification tasks. Using four datasets, the research contrasts ML with transformer-based techniques. The results indicate that classical ML models outperform both LLMs and PEFT models across most tasks, particularly in structured text-based datasets. Zero-shot LLM/VLM pipelines showed mixed results, performing poorly on text-based tasks but competitively on image tasks. The study suggests that foundation models are not universally superior and that minimal fine-tuning can be detrimental.
该研究评估了传统机器学习(ML)模型、基于提示的大语言模型(LLM)/视觉语言模型(VLM)以及参数高效微调(PEFT)模型在医学分类任务中的表现。使用四个数据集,研究对比了ML与基于变换器的技术。结果显示,经典ML模型在大多数任务中表现优于LLM和PEFT模型,特别是在结构化文本数据集上表现尤为出色。零样本LLM/VLM流水线在文本任务上表现不佳,但在多类图像任务上表现出色,与经典ResNet-50基线相当。研究指出,基础模型并非在所有场景下都更优,且参数高效微调的有效性高度依赖于微调策略,本研究中少量微调效果较差。
SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer
Authors: Tongcheng Fang, Hanling Zhang, Ruiqi Xie, Zhuo Han, Xin Tao, Tianchen Zhao, Pengfei Wan, Wenbo Ding, Wanli Ouyang, Xuefei Ning, Yu Wang
First: 2026-01-23T07:28:53+00:00 · Latest: 2026-01-23T07:28:53+00:00
Abstract
Diffusion Transformers have recently demonstrated remarkable performance in video generation. However, the long input sequences result in high computational latency due to the quadratic complexity of full attention. Various sparse attention mechanisms have been proposed. Training-free sparse attention is constrained by limited sparsity and thus offers modest acceleration, whereas training-based methods can reach much higher sparsity but demand substantial data and computation for training. In this work, we propose SALAD, introducing a lightweight linear attention branch in parallel with the sparse attention. By incorporating an input-dependent gating mechanism to finely balance the two branches, our method attains 90% sparsity and 1.72x inference speedup, while maintaining generation quality comparable to the full attention baseline. Moreover, our finetuning process is highly efficient, requiring only 2,000 video samples and 1,600 training steps with a batch size of 8.
中文标题/摘要
标题:SALAD:通过高效线性注意力调优实现高稀疏度注意力以提高视频扩散变换器性能
扩散变换器在视频生成方面最近表现出色。然而,长输入序列导致全注意力的计算延迟较高,由于其二次复杂度。已经提出了多种稀疏注意力机制。无训练的稀疏注意力受限于稀疏度有限,因此只能提供适度的加速,而基于训练的方法可以达到更高的稀疏度,但需要大量的数据和计算进行训练。在本工作中,我们提出了SALAD,引入了一个与稀疏注意力并行的轻量级线性注意力分支。通过引入输入依赖的门控机制精细平衡两个分支,我们的方法实现了90%的稀疏度和1.72倍的推理加速,同时保持与全注意力基线相当的生成质量。此外,我们的微调过程非常高效,只需要2,000个视频样本和1,600个训练步骤,批量大小为8。
Summary / 总结
The research aims to address the high computational latency in video generation using Diffusion Transformers due to the quadratic complexity of full attention. SALAD introduces a lightweight linear attention branch alongside sparse attention, achieving 90% sparsity and 1.72x inference speedup while maintaining comparable generation quality. The method uses an input-dependent gating mechanism to balance the two branches and requires only 2,000 video samples and 1,600 training steps for efficient fine-tuning.
本文提出了SALAD方法,该方法结合了稀疏和线性注意力,以减少使用Diffusion Transformers进行视频生成时的计算延迟。通过使用输入依赖的门控机制平衡两种注意力机制,SALAD实现了90%的稀疏性和1.72倍的推理加速,同时保持与全注意力基线相当的生成质量。微调过程非常高效,仅需2,000个视频样本和1,600个训练步骤,批量大小为8。
AlphaFace: High Fidelity and Real-time Face Swapper Robust to Facial Pose
Authors: Jongmin Yu, Hyeontaek Oh, Zhongtian Sun, Angelica I Aviles-Rivero, Moongu Jeon, Jinhong Yang
First: 2026-01-23T04:01:49+00:00 · Latest: 2026-01-23T04:01:49+00:00
Abstract
Existing face-swapping methods often deliver competitive results in constrained settings but exhibit substantial quality degradation when handling extreme facial poses. To improve facial pose robustness, explicit geometric features are applied, but this approach remains problematic since it introduces additional dependencies and increases computational cost. Diffusion-based methods have achieved remarkable results; however, they are impractical for real-time processing. We introduce AlphaFace, which leverages an open-source vision-language model and CLIP image and text embeddings to apply novel visual and textual semantic contrastive losses. AlphaFace enables stronger identity representation and more precise attribute preservation, all while maintaining real-time performance. Comprehensive experiments across FF++, MPIE, and LPFF demonstrate that AlphaFace surpasses state-of-the-art methods in pose-challenging cases. The project is publicly available on `https://github.com/andrewyu90/Alphaface_Official.git'.
中文标题/摘要
标题:AlphaFace:高保真度和实时面部互换,具备面部姿态鲁棒性
现有面部互换方法在受限环境中通常能提供竞争力的结果,但在处理极端面部姿态时表现出显著的质量下降。为了提高面部姿态鲁棒性,应用了显式的几何特征,但这种方法仍然存在问题,因为它引入了额外的依赖性和增加了计算成本。基于扩散的方法取得了显著成果;然而,它们不适用于实时处理。我们引入了AlphaFace,它利用开源的视觉-语言模型和CLIP图像和文本嵌入,应用了新颖的视觉和文本语义对比损失。AlphaFace 能够提供更强的身份表示和更精确的属性保留,同时保持实时性能。跨FF++、MPIE和LPFF的全面实验表明,AlphaFace 在姿态挑战性情况下超越了最先进的方法。该项目可在`https://github.com/andrewyu90/Alphaface_Official.git' 公开获取。
Summary / 总结
AlphaFace is designed to improve the robustness of face-swapping methods to extreme facial poses by using an open-source vision-language model and CLIP embeddings to apply semantic contrastive losses. This approach enhances identity representation and attribute preservation while maintaining real-time performance. Experimental results show that AlphaFace outperforms existing methods in handling pose-challenging cases across various datasets including FF++, MPIE, and LPFF.
AlphaFace 旨在通过使用来自开源视觉语言模型和CLIP图像及文本嵌入的视觉和文本语义对比损失,增强极端面部姿态下的换脸效果。这种方法提高了身份表示和属性保留,并保持了实时性能。FF++、MPIE 和 LPFF 的实验结果表明,AlphaFace 在处理姿态挑战性情况下优于现有最先进的方法。
Hierarchy-Aware Multimodal Unlearning for Medical AI
Authors: Fengli Wu, Vaidehi Patil, Jaehong Yoon, Yue Zhang, Mohit Bansal
First: 2025-12-10T17:55:06+00:00 · Latest: 2026-01-23T02:43:59+00:00
Comments: Dataset and Code: https://github.com/fengli-wu/MedForget
Abstract
Pretrained Multimodal Large Language Models (MLLMs) are increasingly used in sensitive domains such as medical AI, where privacy regulations like HIPAA and GDPR require specific removal of individuals' or institutions' data. This motivates machine unlearning, which aims to remove the influence of target data from a trained model. However, existing unlearning benchmarks fail to reflect the hierarchical and multimodal structure of real-world medical data, limiting their ability to properly evaluate unlearning in practice. Therefore, we introduce MedForget, a hierarchy-aware multimodal unlearning benchmark that models hospital data as a nested structure, enabling fine-grained evaluation of multimodal unlearning across retain and forget splits. Experiments with current unlearning methods show that existing approaches struggle to achieve effective hierarchy-aware forgetting without degrading downstream medical utility. To address this limitation, we propose Cross-modal Hierarchy-Informed Projection for unlearning (CHIP), a training-free, hierarchy-aware multimodal unlearning method that deletes information by selectively removing target-specific weight subspaces while preserving sibling-shared information. Experiments show that CHIP achieves the highest forget-retain performance gap across all hierarchy levels while maintaining competitive downstream utility compared to existing methods. Overall, MedForget provides a practical, HIPAA-aligned benchmark for evaluating structured multimodal unlearning for medical data, and CHIP offers an effective and general solution for hierarchy-aware forgetting that balances deletion with utility.
中文标题/摘要
标题:医疗AI中的层次感知多模态遗忘
预训练多模态大型语言模型(MLLMs)在医疗AI等敏感领域中越来越被使用,而HIPAA和GDPR等隐私法规要求特定地移除个人或机构的数据。这促使了机器遗忘的发展,其目标是从训练模型中移除目标数据的影响。然而,现有的遗忘基准未能反映现实世界医疗数据的层次和多模态结构,限制了它们在实际中评估遗忘的能力。因此,我们引入了MedForget,一种层次感知的多模态遗忘基准,将医院数据建模为嵌套结构,从而在保留和遗忘分割中实现多模态遗忘的精细评估。实验表明,现有方法在实现有效的层次感知遗忘时难以避免对下游医疗效用的退化。为解决这一局限,我们提出了跨模态层次启发式投影遗忘方法(CHIP),这是一种无需训练、层次感知的多模态遗忘方法,通过选择性地删除目标特定的权重子空间同时保留兄弟共享的信息来删除信息。实验表明,CHIP在所有层次级别上实现了最高的遗忘-保留性能差距,同时保持与现有方法相当的下游效用。总体而言,MedForget提供了一个实用的、符合HIPAA的基准,用于评估结构化的多模态遗忘,而CHIP提供了一种有效的、通用的层次感知遗忘解决方案,平衡了删除与效用。
Summary / 总结
The paper introduces MedForget, a hierarchy-aware multimodal unlearning benchmark for medical AI, addressing the limitations of existing benchmarks in handling the hierarchical and multimodal structure of medical data. CHIP, a proposed method, achieves the best performance in forgetting while maintaining utility, outperforming existing approaches. MedForget provides a practical benchmark for evaluating unlearning in medical AI, aligned with privacy regulations like HIPAA and GDPR.
该论文通过引入MedForget,一种面向层次结构的多模态遗忘基准,解决了医疗AI中的遗忘问题。它评估了现有遗忘方法的性能,并提出了CHIP,一种无需训练的方法,可以有选择地删除目标特定的权重子空间同时保留共享信息。CHIP在实现有效的层次结构遗忘的同时保持医疗效用方面优于现有方法。
Unified Multimodal Interleaved Document Representation for Retrieval
Authors: Jaewoo Lee, Joonho Ko, Jinheon Baek, Soyeong Jeong, Sung Ju Hwang
First: 2024-10-03T17:49:09+00:00 · Latest: 2026-01-23T02:42:40+00:00
Comments: EACL Findings 2026
Abstract
Information Retrieval (IR) methods aim to identify documents relevant to a query, which have been widely applied in various natural language tasks. However, existing approaches typically consider only the textual content within documents, overlooking the fact that documents can contain multiple modalities, including images and tables. Also, they often segment each long document into multiple discrete passages for embedding, which prevents them from capturing the overall document context and interactions between paragraphs. To address these two challenges, we propose a method that holistically embeds documents interleaved with multiple modalities by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation. Moreover, to mitigate the information loss from segmenting documents into passages, instead of representing and retrieving passages individually, we further merge the representations of segmented passages into one single document representation, while we additionally introduce a reranking strategy to decouple and identify the relevant passage within the document if necessary. Then, through extensive experiments on diverse IR scenarios considering both the textual and multimodal queries, we show that our approach substantially outperforms relevant baselines, thanks to the consideration of the multimodal information within documents.
中文标题/摘要
标题:统一多模态交织文档表示以进行检索
信息检索(IR)方法旨在识别与查询相关的文档,已在各种自然语言任务中广泛应用。然而,现有的方法通常只考虑文档中的文本内容,忽略了文档可以包含多种模态,包括图像和表格的事实。此外,它们通常将长文档分割成多个离散段落进行嵌入,这妨碍了它们捕捉文档的整体上下文和段落之间的交互。为了解决这两个挑战,我们提出了一种方法,通过利用最近的视觉-语言模型的能力,将文档与多种模态交织在一起进行整体嵌入,从而实现文本、图像和表格的统一处理和表示。此外,为了减轻将文档分割成段落时的信息损失,我们不仅合并分割段落的表示为单一文档表示,还引入了一种重新排序策略,必要时解耦并识别文档中的相关段落。通过在考虑文本和多模态查询的多种IR场景下进行广泛的实验,我们展示了我们的方法在考虑文档中的多模态信息方面显著优于相关基线。
Summary / 总结
The paper addresses the limitations of traditional Information Retrieval methods that focus solely on textual content and ignore multimodal elements like images and tables. It proposes a unified multimodal interleaved document representation method using recent vision-language models to process and integrate text, images, and tables into a single representation. The method merges the representations of segmented passages into one document representation and introduces a reranking strategy. Experiments show that this approach significantly outperforms existing baselines in various IR scenarios, including those with multimodal queries.
本文针对传统信息检索方法仅关注文本内容并分割文档为离散段落,从而忽视多模态信息和文档整体上下文的局限性。作者提出了一种统一的多模态交织文档表示方法,利用最新的视觉-语言模型来处理和整合文本、图像和表格为单一表示。他们还引入了一种重排序策略以增强对相关段落的检索。在各种考虑文本和多模态查询的IR场景实验中,他们的方法显著优于现有方法,这得益于对文档中多模态信息的利用和整体文档上下文的保持。
Cite-While-You-Generate: Training-Free Evidence Attribution for Multimodal Clinical Summarization
Authors: Qianqi Yan, Huy Nguyen, Sumana Srivatsa, Hari Bandi, Xin Eric Wang, Krishnaram Kenthapadi
First: 2026-01-23T02:01:43+00:00 · Latest: 2026-01-23T02:01:43+00:00
Abstract
Trustworthy clinical summarization requires not only fluent generation but also transparency about where each statement comes from. We propose a training-free framework for generation-time source attribution that leverages decoder attentions to directly cite supporting text spans or images, overcoming the limitations of post-hoc or retraining-based methods. We introduce two strategies for multimodal attribution: a raw image mode, which directly uses image patch attentions, and a caption-as-span mode, which substitutes images with generated captions to enable purely text-based alignment. Evaluations on two representative domains: clinician-patient dialogues (CliConSummation) and radiology reports (MIMIC-CXR), show that our approach consistently outperforms embedding-based and self-attribution baselines, improving both text-level and multimodal attribution accuracy (e.g., +15% F1 over embedding baselines). Caption-based attribution achieves competitive performance with raw-image attention while being more lightweight and practical. These findings highlight attention-guided attribution as a promising step toward interpretable and deployable clinical summarization systems.
中文标题/摘要
标题:Cite-While-You-Generate: 无需训练的多模态临床总结证据归属
可靠的临床总结不仅需要流畅的生成,还需要透明地说明每个陈述的来源。我们提出了一种无需训练的生成时来源归属框架,该框架利用解码器注意力直接引用支持的文本片段或图像,克服了事后或重新训练方法的局限性。我们介绍了两种多模态归属策略:一种是原始图像模式,直接使用图像片段注意力;另一种是带有说明的片段模式,用生成的说明替换图像,以实现纯文本对齐。在两个代表性领域(Clinician-Patient Dialogues (CliConSummation) 和 Radiology Reports (MIMIC-CXR))上的评估表明,我们的方法在文本级别和多模态归属准确性方面始终优于基于嵌入和自我归属的基线(例如,相对于嵌入基线的F1分数提高15%)。基于说明的归属在性能上与原始图像注意力相当,但更为轻量级和实用。这些发现突出了注意力引导归属作为可解释和可部署临床总结系统的一个有前景的步骤。
Summary / 总结
The research aims to enhance the transparency of clinical summarization by attributing each statement to its source. It proposes a training-free framework that uses decoder attentions to cite evidence directly from text or images. Evaluations on clinician-patient dialogues and radiology reports show that this approach outperforms existing methods, improving both text-level and multimodal attribution accuracy. Caption-based attribution is found to be competitive with raw-image attention while being more practical.
研究旨在通过归因每个陈述的来源来提高临床摘要的透明度。提出了一种无需训练的框架,利用解码器注意力直接引用支持的文本片段或图像,超越了后处理和重新训练的方法。研究介绍了两种多模态归因策略:原始图像模式和标题作为片段模式。在临床对话和放射报告上的评估表明,该方法优于基于嵌入和自我归因的基线,标题归因在性能上与原始图像注意力相当,但更为轻量和实用。这表明注意力引导的归因可能是可解释和可部署的临床摘要系统的一个有前景的步骤。
Cross-Lingual Activation Steering for Multilingual Language Models
Authors: Rhitabrat Pokharel, Ameeta Agrawal, Tanay Nagar
First: 2026-01-23T01:41:17+00:00 · Latest: 2026-01-23T01:41:17+00:00
Comments: Under review
Abstract
Large language models exhibit strong multilingual capabilities, yet significant performance gaps persist between dominant and non-dominant languages. Prior work attributes this gap to imbalances between shared and language-specific neurons in multilingual representations. We propose Cross-Lingual Activation Steering (CLAS), a training-free inference-time intervention that selectively modulates neuron activations. We evaluate CLAS on classification and generation benchmarks, achieving average improvements of 2.3% (Acc.) and 3.4% (F1) respectively, while maintaining high-resource language performance. We discover that effective transfer operates through functional divergence rather than strict alignment; performance gains correlate with increased language cluster separation. Our results demonstrate that targeted activation steering can unlock latent multilingual capacity in existing models without modification to model weights.
中文标题/摘要
标题:跨语言激活引导多语言语言模型
大型语言模型表现出强大的多语言能力,但主导语言和非主导语言之间仍存在显著的性能差距。先前的工作将这一差距归因于多语言表示中共享神经元和语言特定神经元之间的不平衡。我们提出了一种名为跨语言激活引导(CLAS)的无训练推理时干预方法,该方法选择性地调节神经元激活。我们在分类和生成基准上评估了CLAS,分别实现了2.3%(准确率)和3.4%(F1值)的平均改进,同时保持了高资源语言的性能。我们发现有效的迁移是通过功能分化而非严格的对齐来实现的;性能提升与语言簇的分离度增加相关。我们的结果表明,有针对性的激活引导可以在不修改模型权重的情况下解锁现有模型中的潜在多语言能力。
Summary / 总结
The research aims to address the performance disparity between dominant and non-dominant languages in large multilingual language models. It introduces Cross-Lingual Activation Steering (CLAS), an inference-time method that selectively modulates neuron activations without altering model weights. Evaluations on classification and generation tasks show average improvements of 2.3% in accuracy and 3.4% in F1 score, with high-resource language performance maintained. The study finds that effective transfer occurs through functional divergence, and performance gains are linked to increased separation between language clusters.
研究旨在解决多语言语言模型中主流语言和非主流语言之间的性能差距。提出了跨语言激活引导(CLAS),这是一种在推理时无需训练的方法,可以有选择地调节神经元激活。在分类和生成基准上的评估显示,平均准确率提高了2.3%,F1分数提高了3.4%,同时保持了高资源语言的性能。研究发现,有效的转移通过功能差异实现,性能提升与语言簇之间的分离度增加相关。
SimWorld: An Open-ended Realistic Simulator for Autonomous Agents in Physical and Social Worlds
Authors: Jiawei Ren, Yan Zhuang, Xiaokang Ye, Lingjun Mao, Xuhong He, Jianzhi Shen, Mrinaal Dogra, Yiming Liang, Ruixuan Zhang, Tianai Yue, Yiqing Yang, Eric Liu, Ryan Wu, Kevin Benavente, Rajiv Mandya Nagaraju, Muhammad Faayez, Xiyan Zhang, Dhruv Vivek Sharma, Xianrui Zhong, Ziqiao Ma, Tianmin Shu, Zhiting Hu, Lianhui Qin
First: 2025-11-30T20:58:13+00:00 · Latest: 2026-01-22T21:44:26+00:00
Abstract
While LLM/VLM-powered AI agents have advanced rapidly in math, coding, and computer use, their applications in complex physical and social environments remain challenging. Building agents that can survive and thrive in the real world (for example, by autonomously earning income or running a business) requires massive-scale interaction, reasoning, training, and evaluation across diverse embodied scenarios. However, existing world simulators for such development fall short: they often rely on limited hand-crafted environments, simulate simplified game-like physics and social rules, and lack native support for LLM/VLM agents. We introduce SimWorld, a new simulator built on Unreal Engine 5, designed for developing and evaluating LLM/VLM agents in rich, real-world-like settings. SimWorld offers three core capabilities: (1) realistic, open-ended world simulation, including accurate physical and social dynamics and language-driven procedural environment generation; (2) a rich interface for LLM/VLM agents, with multimodal world inputs and open-vocabulary actions at varying levels of abstraction; and (3) diverse and extensible physical and social reasoning scenarios that are easily customizable by users. We demonstrate SimWorld by deploying frontier LLM agents (e.g., GPT-4o, Gemini-2.5-Flash, Claude-3.5, and DeepSeek-Prover-V2) on long-horizon multi-agent delivery tasks involving strategic cooperation and competition. The results reveal distinct reasoning patterns and limitations across models. We open-source SimWorld and hope it becomes a foundational platform for advancing real-world agent intelligence across disciplines: https://simworld.org.
中文标题/摘要
标题:SimWorld:一种用于物理和社会世界中自主代理的开放性现实模拟器
尽管基于LLM/VLM的AI代理在数学、编程和计算机使用方面取得了快速进展,但在复杂物理和社会环境中的应用仍然具有挑战性。要开发能够在现实世界中生存和繁荣的代理(例如,通过自主赚取收入或经营业务),需要大规模的交互、推理、训练和评估,涵盖多种多样的具身场景。然而,现有的世界模拟器在这方面存在不足:它们通常依赖于有限的手工制作环境,模拟简化的游戏物理和社会规则,并缺乏对LLM/VLM代理的原生支持。我们介绍了SimWorld,这是一种基于Unreal Engine 5的新模拟器,旨在为LLM/VLM代理在丰富的真实世界场景中进行开发和评估提供支持。SimWorld提供了三种核心能力:(1)现实、开放的环境模拟,包括准确的物理和社会动力学以及语言驱动的程序化环境生成;(2)丰富的LLM/VLM代理界面,具有多模态世界输入和不同抽象层次的开放词汇动作;(3)多样且可扩展的物理和社会推理场景,用户可以轻松自定义。我们通过部署前沿的LLM代理(例如GPT-4o、Gemini-2.5-Flash、Claude-3.5和DeepSeek-Prover-V2)在涉及战略合作和竞争的长期多代理交付任务中展示了SimWorld。结果显示了不同模型在推理模式和限制方面的差异。我们开源了SimWorld,希望它成为跨学科推进现实世界代理智能的基础平台:https://simworld.org/
Summary / 总结
The research aims to address the challenges of applying LLM/VLM-powered AI agents in complex physical and social environments. SimWorld, a new simulator built on Unreal Engine 5, is introduced to support the development and evaluation of these agents in realistic scenarios. Key features include realistic, open-ended world simulation, a rich interface for LLM/VLM agents, and diverse reasoning scenarios. Experimental results show distinct reasoning patterns and limitations among different models in long-horizon multi-agent delivery tasks involving strategic cooperation and competition.
研究旨在解决LLM/VLM驱动的AI代理在复杂物理和社会环境中的应用挑战。介绍了基于Unreal Engine 5的新模拟器SimWorld,以支持这些代理在现实场景中的开发和评估。关键功能包括现实的、开放的环境模拟,丰富的接口供LLM/VLM代理使用,以及多样化的推理场景。实验结果表明,在涉及战略合作和竞争的长期多代理交付任务中,不同模型表现出不同的推理模式和局限性。
The Spatial Blindspot of Vision-Language Models
Authors: Nahid Alam, Leema Krishna Murali, Siddhant Bharadwaj, Patrick Liu, Timothy Chung, Drishti Sharma, Akshata A, Kranthi Kiran, Wesley Tam, Bala Krishna S Vegesna
First: 2026-01-15T00:30:34+00:00 · Latest: 2026-01-22T19:05:41+00:00
Comments: Work done as part of the EleutherAI SOAR Program
Abstract
Vision-language models (VLMs) have advanced rapidly, but their ability to capture spatial relationships remains a blindspot. Current VLMs are typically built with contrastive language-image pretraining (CLIP) style image encoders. The training recipe often flattens images into 1D patch sequences, discarding the 2D structure necessary for spatial reasoning. We argue that this lack of spatial awareness is a missing dimension in VLM design and a bottleneck for applications requiring spatial grounding, such as robotics and embodied AI. To address this, we investigate (i) image encoders trained with alternative objectives and (ii) 2D positional encodings. Our experiments show that these architectural choices can lead to improved spatial reasoning on several benchmarks.
中文标题/摘要
标题:视觉语言模型的空间盲点
视觉语言模型(VLMs)已经取得了快速的进步,但它们捕捉空间关系的能力仍然是一个盲点。当前的VLMs通常使用CLIP风格的图像编码器进行对比语言-图像预训练。训练配方通常将图像扁平化为1D的块序列,从而丢弃了进行空间推理所必需的2D结构。我们认为,这种缺乏空间意识是VLM设计中缺失的一个维度,并且是需要空间定位的应用(如机器人技术和具身AI)的瓶颈。为了应对这一问题,我们研究了(i)使用其他目标训练的图像编码器和(ii)2D位置编码。我们的实验表明,这些架构选择可以在多个基准上提高空间推理能力。
Summary / 总结
The research addresses the limitation of vision-language models (VLMs) in capturing spatial relationships, which is crucial for applications like robotics. The study explores alternative image encoder training objectives and 2D positional encodings to enhance spatial awareness. Experiments demonstrate that these modifications improve spatial reasoning capabilities on various benchmarks.
研究旨在解决视觉语言模型(VLMs)在捕捉空间关系方面的不足,这对于机器人等应用至关重要。研究探索了替代的图像编码训练目标和2D位置编码,以增强空间意识。实验表明,这些修改可以在各种基准测试中提高空间推理能力。
GutenOCR: A Grounded Vision-Language Front-End for Documents
Authors: Hunter Heidenreich, Ben Elliott, Olivia Dinica, Yosheb Getachew
First: 2026-01-20T21:26:15+00:00 · Latest: 2026-01-22T18:58:24+00:00
Abstract
GutenOCR is a family of grounded OCR front-ends obtained by fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B. The resulting single-checkpoint vision-language models expose reading, detection, and grounding through a unified, prompt-based interface. Trained on business documents, scientific articles, and synthetic grounding data, the models support full-page and localized reading with line- and paragraph-level bounding boxes and conditional ``where is x?'' queries. We introduce a grounded OCR evaluation protocol and show that GutenOCR-7B more than doubles the composite grounded OCR score of its Qwen2.5-VL-7B backbone on 10.5K held-out business and scientific pages (0.40 to 0.82). On Fox and OmniDocBench v1.5, our approach substantially improves region- and line-level OCR as well as text-detection recall, but reveals trade-offs in page-level linearization, color-guided OCR, and formula-heavy layouts.
中文标题/摘要
标题:GutenOCR:文档的基于视觉语言的前端
GutenOCR 是通过微调 Qwen2.5-VL-3B 和 Qwen2.5-VL-7B 获得的一系列基于视觉语言的 OCR 前端。生成的单模型视觉语言模型通过统一的提示界面暴露了阅读、检测和定位。该模型在商业文档、科学文章和合成定位数据上进行训练,支持全页和局部阅读,具有行级和段落级的边界框,并支持条件“x 在哪里?”查询。我们引入了一种基于视觉语言的 OCR 评估协议,并展示了 GutenOCR-7B 在 10.5K 保留的商业和科学页面上将 Qwen2.5-VL-7B 主干的综合基于视觉语言的 OCR 分数提高了 1.05(从 0.40 到 0.82)。在 Fox 和 OmniDocBench v1.5 上,我们的方法显著提高了区域级和行级 OCR 以及文本检测召回率,但揭示了页面级线性化、颜色引导 OCR 和公式密集布局方面的权衡。
Summary / 总结
GutenOCR is a vision-language model fine-tuned from Qwen2.5-VL-3B and Qwen2.5-VL-7B, which provides unified reading, detection, and grounding through a prompt-based interface. Trained on various documents, GutenOCR-7B significantly improves the grounded OCR score, achieving a composite score of 0.82 on 10,500 held-out pages, more than doubling the score of its backbone model. It also enhances region- and line-level OCR and text-detection recall but shows some trade-offs in page-level linearization and formula-heavy layouts.
GutenOCR 是从 Qwen2.5-VL-3B 和 Qwen2.5-VL-7B 微调而来的视觉语言模型,通过提示提供统一接口进行阅读、检测和定位。该模型经过商务文档和科学文章的训练,支持全页和局部阅读,并带有边界框。评估结果显示,GutenOCR-7B 相比其基础模型显著提高了地面OCR得分,特别是在商务和科学页面上。然而,它在页面级线性化和公式密集布局方面面临一些挑战。
LLM-in-Sandbox Elicits General Agentic Intelligence
Authors: Daixuan Cheng, Shaohan Huang, Yuxian Gu, Huatong Song, Guoxin Chen, Li Dong, Wayne Xin Zhao, Ji-Rong Wen, Furu Wei
First: 2026-01-22T18:57:09+00:00 · Latest: 2026-01-22T18:57:09+00:00
Comments: Project Page: https://llm-in-sandbox.github.io
Abstract
We introduce LLM-in-Sandbox, enabling LLMs to explore within a code sandbox (i.e., a virtual computer), to elicit general intelligence in non-code domains. We first demonstrate that strong LLMs, without additional training, exhibit generalization capabilities to leverage the code sandbox for non-code tasks. For example, LLMs spontaneously access external resources to acquire new knowledge, leverage the file system to handle long contexts, and execute scripts to satisfy formatting requirements. We further show that these agentic capabilities can be enhanced through LLM-in-Sandbox Reinforcement Learning (LLM-in-Sandbox-RL), which uses only non-agentic data to train models for sandbox exploration. Experiments demonstrate that LLM-in-Sandbox, in both training-free and post-trained settings, achieves robust generalization spanning mathematics, physics, chemistry, biomedicine, long-context understanding, and instruction following. Finally, we analyze LLM-in-Sandbox's efficiency from computational and system perspectives, and open-source it as a Python package to facilitate real-world deployment.
中文标题/摘要
标题:LLM-in-Sandbox 激发通用代理智能
我们介绍了 LLM-in-Sandbox,使大语言模型能够在代码沙盒(即虚拟计算机)中探索,以激发非代码领域的通用智能。我们首先展示了强大的大语言模型在无需额外训练的情况下,能够利用代码沙盒来执行非代码任务的一般化能力。例如,大语言模型自发地访问外部资源以获取新知识,利用文件系统处理长文本,并执行脚本以满足格式要求。我们进一步表明,通过仅使用非代理数据训练用于沙盒探索的模型,LLM-in-Sandbox 强化学习(LLM-in-Sandbox-RL)可以增强这些代理能力。实验表明,无论是在无训练模式还是在训练后模式下,LLM-in-Sandbox 都能够实现涵盖数学、物理、化学、生物医学、长文本理解以及指令遵循的稳健泛化。最后,我们从计算和系统角度分析了 LLM-in-Sandbox 的效率,并将其开源为 Python 包,以促进其实用部署。
Summary / 总结
The study introduces LLM-in-Sandbox, which allows large language models (LLMs) to explore a code sandbox to develop general intelligence in non-code domains. Without additional training, LLMs can generalize to use the sandbox for non-code tasks such as accessing external resources, handling long contexts, and executing scripts. The research further enhances these capabilities through LLM-in-Sandbox Reinforcement Learning, which trains models using non-agentic data. Experiments show that LLM-in-Sandbox achieves robust generalization across various fields including mathematics, physics, chemistry, and biomedicine. The study also analyzes the efficiency of LLM-in-Sandbox from computational and system perspectives and opens it as a Python package for real-world deployment.
研究引入了LLM-in-Sandbox,使大型语言模型(LLM)能够在代码沙箱中探索,从而在非代码领域发展出一般智能。研究展示了强大的LLM能够泛化并在非代码任务中使用代码沙箱,例如访问外部资源和处理长文本。该方法进一步通过仅使用非代理数据训练模型来增强这些能力,即LLM-in-Sandbox强化学习。实验表明,LLM-in-Sandbox在数学、物理、化学、生物医学等多个领域以及指令遵循方面实现了稳健的泛化。研究还从计算和系统角度分析了LLM-in-Sandbox的效率,并将其作为Python包开源以促进实际部署。
Training-Free Geospatial Place Representation Learning from Large-Scale Point-of-Interest Graph Data
Authors: Mohammad Hashemi, Hossein Amiri, Andreas Zufle
First: 2025-06-25T15:10:31+00:00 · Latest: 2026-01-22T18:46:50+00:00
Abstract
Learning effective representations of urban environments requires capturing spatial structure beyond fixed administrative boundaries. Existing geospatial representation learning approaches typically aggregate Points of Interest(POI) into pre-defined administrative regions such as census units or ZIP code areas, assigning a single embedding to each region. However, POIs often form semantically meaningful groups that extend across, within, or beyond these boundaries, defining places that better reflect human activity and urban function. To address this limitation, we propose PlaceRep, a training-free geospatial representation learning method that constructs place-level representations by clustering spatially and semantically related POIs. PlaceRep summarizes large-scale POI graphs from U.S. Foursquare data to produce general-purpose urban region embeddings while automatically identifying places across multiple spatial scales. By eliminating model pre-training, PlaceRep provides a scalable and efficient solution for multi-granular geospatial analysis. Experiments using the tasks of population density estimation and housing price prediction as downstream tasks show that PlaceRep outperforms most state-of-the-art graph-based geospatial representation learning methods and achieves up to a 100x speedup in generating region-level representations on large-scale POI graphs. The implementation of PlaceRep is available at https://github.com/mohammadhashemii/PlaceRep.
中文标题/摘要
标题:从大规模兴趣点图数据中学习无训练的地理空间场所表示
学习有效的城市环境表示需要捕捉超越固定行政边界的空间结构。现有的地理空间表示学习方法通常将兴趣点(POI)聚合到预先定义的行政区域中,如人口普查单位或邮政编码区域,并为每个区域分配一个单一的嵌入。然而,POI往往形成具有语义意义的群体,跨越、位于或超出这些边界,定义了更好地反映人类活动和城市功能的场所。为了解决这一局限性,我们提出了一种名为PlaceRep的无训练地理空间表示学习方法,通过聚类空间上和语义上相关的POI来构建场所级表示。PlaceRep从美国Foursquare数据中的大规模POI图中总结出通用的城市区域嵌入,并自动识别跨多个空间尺度的场所。通过消除模型预训练,PlaceRep提供了一种可扩展且高效的多粒度地理空间分析解决方案。使用人口密度估计和房价预测等下游任务进行的实验表明,PlaceRep在大规模POI图中生成区域级表示时比大多数最先进的基于图的地理空间表示学习方法性能更优,并且速度提高了多达100倍。PlaceRep的实现可在https://github.com/mohammadhashemii/PlaceRep获取。
Summary / 总结
The research aims to develop effective geospatial representations of urban environments by capturing spatial structures beyond administrative boundaries. PlaceRep, a training-free method, clusters semantically related Points of Interest to generate place-level representations, which are then used for tasks like population density estimation and housing price prediction. Experiments show that PlaceRep outperforms existing methods and provides up to a 100x speedup in generating region-level representations.
研究旨在开发一种无需训练的方法,从大规模POI图中学习地理空间场所表示,解决现有方法将POI聚合到预定义行政区域的局限性。PlaceRep通过聚类空间和语义相关的POI生成场所级表示,用于人口密度估计和房价预测等任务。该方法在大规模POI图上生成区域级表示时比最先进的图基地理空间表示学习技术快100倍,并且表现更优。
Multimodal Climate Disinformation Detection: Integrating Vision-Language Models with External Knowledge Sources
Authors: Marzieh Adeli Shamsabad, Hamed Ghodrati
First: 2026-01-22T16:55:48+00:00 · Latest: 2026-01-22T16:55:48+00:00
Abstract
Climate disinformation has become a major challenge in today digital world, especially with the rise of misleading images and videos shared widely on social media. These false claims are often convincing and difficult to detect, which can delay actions on climate change. While vision-language models (VLMs) have been used to identify visual disinformation, they rely only on the knowledge available at the time of training. This limits their ability to reason about recent events or updates. The main goal of this paper is to overcome that limitation by combining VLMs with external knowledge. By retrieving up-to-date information such as reverse image results, online fact-checks, and trusted expert content, the system can better assess whether an image and its claim are accurate, misleading, false, or unverifiable. This approach improves the model ability to handle real-world climate disinformation and supports efforts to protect public understanding of science in a rapidly changing information landscape.
中文标题/摘要
标题:多模态气候 misinformation 检测:结合视觉-语言模型与外部知识源
气候 misinformation 已成为当今数字世界的主要挑战,尤其是在社交媒体上广泛传播误导性图片和视频的情况下。这些虚假声明往往令人信服且难以识别,这可能会延迟应对气候变化的行动。虽然视觉-语言模型(VLMs)已被用于识别视觉 misinformation,但它们仅依赖于训练时可用的知识。这限制了它们对近期事件或更新进行推理的能力。本文的主要目标是通过结合 VLMs 与外部知识来克服这一限制。通过检索最新的信息,如逆向图像搜索结果、在线事实核查和可信专家内容,该系统可以更好地评估图片及其声明是否准确、误导、虚假或无法验证。这种方法提高了模型处理真实世界气候 misinformation 的能力,并支持在快速变化的信息环境中保护公众对科学的理解的努力。
Summary / 总结
The paper addresses the challenge of detecting climate disinformation by integrating vision-language models with external knowledge sources. It aims to enhance the models' ability to reason about recent events and updates, which traditional models trained on static data cannot handle. The system retrieves up-to-date information such as reverse image results, online fact-checks, and expert content to assess the accuracy of visual claims. Key findings show that this approach improves the model's capability to handle real-world climate disinformation and supports public understanding of climate science.
论文针对社交媒体上广泛传播的气候误导信息,特别是误导性的图片和视频,提出了将视觉语言模型与外部知识源结合的方法,以增强检测能力。通过整合最新的信息,如逆向图像搜索、在线事实核查和专家内容,系统可以更准确地评估视觉声明的准确性。主要发现表明,该方法在识别和区分准确、误导、虚假和无法验证的气候相关内容方面表现出更好的性能。
DTP: A Simple yet Effective Distracting Token Pruning Framework for Vision-Language Action Models
Authors: Chenyang Li, Jieyuan Liu, Bin Li, Bo Gao, Yilin Yuan, Yangfan He, Yuchen Li, Jingqun Tang
First: 2026-01-22T16:02:56+00:00 · Latest: 2026-01-22T16:02:56+00:00
Abstract
Vision-Language Action (VLA) models have shown remarkable progress in robotic manipulation by leveraging the powerful perception abilities of Vision-Language Models (VLMs) to understand environments and directly output actions. However, by default, VLA models may overly attend to image tokens in the task-irrelevant region, which we describe as 'distracting tokens'. This behavior can disturb the model from the generation of the desired action tokens in each step, affecting the success rate of tasks. In this paper, we introduce a simple yet effective plug-and-play Distracting Token Pruning (DTP) framework, which dynamically detects and prunes these distracting image tokens. By correcting the model's visual attention patterns, we aim to improve the task success rate, as well as exploring the performance upper boundaries of the model without altering its original architecture or adding additional inputs. Experiments on the SIMPLER Benchmark (Li et al., 2024) show that our method consistently achieving relative improvements in task success rates across different types of novel VLA models, demonstrating generalizability to transformer-based VLAs. Further analysis reveals a negative correlation between the task success rate and the amount of attentions in the task-irrelevant region for all models tested, highlighting a common phenomenon of VLA models that could guide future research. We also publish our code at: https://anonymous.4open.science/r/CBD3.
中文标题/摘要
标题:DTP:一种简单有效的视觉-语言-动作模型分散令牌剪枝框架
视觉-语言-动作(VLA)模型通过利用视觉-语言模型(VLM)的强大感知能力来理解环境并直接输出动作,已经在机器人操作方面取得了显著进展。然而,默认情况下,VLA模型可能会过度关注任务无关区域的图像令牌,我们将其称为“分散令牌”。这种行为会干扰模型在每一步生成所需动作令牌的能力,影响任务的成功率。在本文中,我们介绍了一种简单有效的即插即用分散令牌剪枝(DTP)框架,该框架能够动态检测并剪枝这些分散的图像令牌。通过纠正模型的视觉注意力模式,我们旨在提高任务成功率,并在不改变其原始架构或添加额外输入的情况下探索模型的性能上限。在SIMPLER基准(Li等,2024)上的实验表明,我们的方法在不同类型的新型VLA模型中一致地提高了任务成功率,展示了其对基于变换器的VLA模型的通用性。进一步的分析揭示了所有测试模型的任务成功率与其任务无关区域注意力量之间的负相关关系,突显了VLA模型中的一种常见现象,这可以指导未来的研究。我们还发布了我们的代码:https://anonymous.4open.science/r/CBD3.
Summary / 总结
This paper introduces DTP, a simple yet effective framework for pruning distracting image tokens in Vision-Language Action (VLA) models, which helps improve the model's task success rate by correcting its visual attention patterns. Experiments on the SIMPLER Benchmark show consistent improvements in task success rates across various VLA models, indicating the framework's generalizability to transformer-based VLA models. The analysis also reveals a negative correlation between task success rate and attention on task-irrelevant regions, highlighting a common issue in VLA models.
研究旨在解决VLA模型过度关注任务无关图像令牌的问题,这会妨碍生成所需的动作令牌。提出了DTP框架,动态检测并消除这些分散注意力的令牌,从而提高任务成功率。在SIMPLER基准上的实验显示,该方法在各种VLA模型上表现出一致的改进,表明其通用性和有效性,且无需修改原始模型架构或添加新输入。
DextER: Language-driven Dexterous Grasp Generation with Embodied Reasoning
Authors: Junha Lee, Eunha Park, Minsu Cho
First: 2026-01-22T15:23:35+00:00 · Latest: 2026-01-22T15:23:35+00:00
Abstract
Language-driven dexterous grasp generation requires the models to understand task semantics, 3D geometry, and complex hand-object interactions. While vision-language models have been applied to this problem, existing approaches directly map observations to grasp parameters without intermediate reasoning about physical interactions. We present DextER, Dexterous Grasp Generation with Embodied Reasoning, which introduces contact-based embodied reasoning for multi-finger manipulation. Our key insight is that predicting which hand links contact where on the object surface provides an embodiment-aware intermediate representation bridging task semantics with physical constraints. DextER autoregressively generates embodied contact tokens specifying which finger links contact where on the object surface, followed by grasp tokens encoding the hand configuration. On DexGYS, DextER achieves 67.14% success rate, outperforming state-of-the-art by 3.83%p with 96.4% improvement in intention alignment. We also demonstrate steerable generation through partial contact specification, providing fine-grained control over grasp synthesis.
中文标题/摘要
标题:DextER:基于语言的灵巧抓取生成与具身推理
基于语言的灵巧抓取生成要求模型理解任务语义、3D几何和复杂的手物交互。尽管视觉语言模型已被应用于此问题,现有方法直接将观察结果映射为抓取参数,而没有关于物理交互的中间推理。我们提出了DextER,灵巧抓取生成与具身推理,引入了基于接触的具身推理进行多指操作。我们的关键见解是,预测哪只手在物体表面接触哪里提供了一种任务语义与物理约束之间的具身感知中间表示。DextER 自回归生成具身接触标记,指定哪只手指在物体表面接触哪里,随后生成抓取标记编码手的配置。在DexGYS上,DextER 达到了67.14%的成功率,比最先进的方法高出3.83%,意图对齐改进了96.4%。我们还展示了通过部分接触指定实现可引导的生成,提供了对抓取合成的精细控制。
Summary / 总结
DextER is designed to generate dexterous grasps by understanding task semantics, 3D geometry, and hand-object interactions. It introduces contact-based embodied reasoning for multi-finger manipulation, predicting which hand links contact where on the object surface. On DexGYS, DextER achieves a 67.14% success rate, outperforming state-of-the-art methods by 3.83% with significant improvement in intention alignment. It also supports steerable generation through partial contact specification, offering fine-grained control over grasp synthesis.
DextER 通过结合语言理解和实体推理来生成灵巧的抓取动作。它预测物体上的接触点,采用多指操作,将任务语义与物理约束联系起来。在 DexGYS 数据集上,DextER 达到了 67.14% 的成功率,比之前的方法高出 3.83%,并且意图对齐提高了 96.4%。此外,它还支持通过部分接触点指定进行可调节生成,提供对抓取合成的精细控制。
SimWorld-Robotics: Synthesizing Photorealistic and Dynamic Urban Environments for Multimodal Robot Navigation and Collaboration
Authors: Yan Zhuang, Jiawei Ren, Xiaokang Ye, Jianzhi Shen, Ruixuan Zhang, Tianai Yue, Muhammad Faayez, Xuhong He, Ziqiao Ma, Lianhui Qin, Zhiting Hu, Tianmin Shu
Venue: NeurIPS 2025
First: 2025-12-10T20:04:08+00:00 · Latest: 2026-01-22T14:26:01+00:00
Comments: Conference: NeurIPS 2025 (main)
Abstract
Recent advances in foundation models have shown promising results in developing generalist robotics that can perform diverse tasks in open-ended scenarios given multimodal inputs. However, current work has been mainly focused on indoor, household scenarios. In this work, we present SimWorld-Robotics~(SWR), a simulation platform for embodied AI in large-scale, photorealistic urban environments. Built on Unreal Engine 5, SWR procedurally generates unlimited photorealistic urban scenes populated with dynamic elements such as pedestrians and traffic systems, surpassing prior urban simulations in realism, complexity, and scalability. It also supports multi-robot control and communication. With these key features, we build two challenging robot benchmarks: (1) a multimodal instruction-following task, where a robot must follow vision-language navigation instructions to reach a destination in the presence of pedestrians and traffic; and (2) a multi-agent search task, where two robots must communicate to cooperatively locate and meet each other. Unlike existing benchmarks, these two new benchmarks comprehensively evaluate a wide range of critical robot capacities in realistic scenarios, including (1) multimodal instructions grounding, (2) 3D spatial reasoning in large environments, (3) safe, long-range navigation with people and traffic, (4) multi-robot collaboration, and (5) grounded communication. Our experimental results demonstrate that state-of-the-art models, including vision-language models (VLMs), struggle with our tasks, lacking robust perception, reasoning, and planning abilities necessary for urban environments.
Summary / 总结
The research aims to develop a simulation platform for embodied AI in large-scale, photorealistic urban environments to test generalist robotics in diverse tasks. SimWorld-Robotics (SWR) uses Unreal Engine 5 to generate dynamic urban scenes with pedestrians and traffic, supporting multi-robot control and communication. The platform introduces two benchmarks: a multimodal instruction-following task and a multi-agent search task, which comprehensively evaluate robots' multimodal grounding, 3D spatial reasoning, safe navigation, multi-robot collaboration, and grounded communication. Experiments show that state-of-the-art models, including vision-language models, struggle with these tasks due to limitations in perception, reasoning, and planning abilities in urban settings.
研究旨在开发一个用于多模态机器人在逼真城市环境中的导航和协作的模拟平台。该平台SimWorld-Robotics (SWR) 使用Unreal Engine 5生成包含行人和交通系统的动态城市场景,超越了之前的模拟在真实感和可扩展性方面。引入了两个基准测试:多模态指令跟随任务和多智能体搜索任务,全面评估了机器人在现实场景中的多模态语义理解、三维空间推理、安全导航、多机器人协作和基于语义的通信能力。实验结果表明,当前最先进的模型,包括视觉语言模型,在这些任务中表现不佳,缺乏在城市环境中所需的感知、推理和规划能力。
A Multi-View Pipeline and Benchmark Dataset for 3D Hand Pose Estimation in Surgery
Authors: Valery Fischer, Alan Magdaleno, Anna-Katharina Calek, Nicola Cavalcanti, Nathan Hoffman, Christoph Germann, Joschua Wüthrich, Max Krähenmann, Mazda Farshad, Philipp Fürnstahl, Lilian Calvet
First: 2026-01-22T12:48:24+00:00 · Latest: 2026-01-22T12:48:24+00:00
Abstract
Purpose: Accurate 3D hand pose estimation supports surgical applications such as skill assessment, robot-assisted interventions, and geometry-aware workflow analysis. However, surgical environments pose severe challenges, including intense and localized lighting, frequent occlusions by instruments or staff, and uniform hand appearance due to gloves, combined with a scarcity of annotated datasets for reliable model training.
Method: We propose a robust multi-view pipeline for 3D hand pose estimation in surgical contexts that requires no domain-specific fine-tuning and relies solely on off-the-shelf pretrained models. The pipeline integrates reliable person detection, whole-body pose estimation, and state-of-the-art 2D hand keypoint prediction on tracked hand crops, followed by a constrained 3D optimization. In addition, we introduce a novel surgical benchmark dataset comprising over 68,000 frames and 3,000 manually annotated 2D hand poses with triangulated 3D ground truth, recorded in a replica operating room under varying levels of scene complexity.
Results: Quantitative experiments demonstrate that our method consistently outperforms baselines, achieving a 31% reduction in 2D mean joint error and a 76% reduction in 3D mean per-joint position error.
Conclusion: Our work establishes a strong baseline for 3D hand pose estimation in surgery, providing both a training-free pipeline and a comprehensive annotated dataset to facilitate future research in surgical computer vision.
中文标题/摘要
标题:手术中3D手部姿态估计的多视图管道和基准数据集
目的:准确的3D手部姿态估计支持手术应用,如技能评估、机器人辅助干预和几何感知工作流程分析。然而,手术环境带来了严重挑战,包括强烈的局部照明、频繁的器械或人员遮挡、手套导致的手部均匀外观,以及可靠的模型训练所需的标注数据稀缺性。
方法:我们提出了一种鲁棒的多视图管道,用于手术环境下的3D手部姿态估计,该管道无需特定领域的微调,仅依赖于现成的预训练模型。该管道结合了可靠的人体检测、全身姿态估计以及在跟踪的手部裁剪上使用最先进的2D手部关键点预测,随后进行约束3D优化。此外,我们还引入了一个新的手术基准数据集,包含超过68,000帧和3,000个手动标注的2D手部姿态,具有三角化3D地面真值,数据集在不同场景复杂度下记录在一个复现的手术室中。
结果:定量实验表明,我们的方法在2D平均关节误差上比基线方法降低了31%,在3D平均每个关节位置误差上降低了76%。
结论:我们的工作为手术中的3D手部姿态估计建立了强大的基线,提供了无需训练的管道和全面标注的数据集,以促进未来手术计算机视觉的研究。
Summary / 总结
The study aims to improve 3D hand pose estimation in surgical settings, which is crucial for various surgical applications. The proposed multi-view pipeline uses off-the-shelf pretrained models for person detection, whole-body pose estimation, and 2D hand keypoint prediction, followed by 3D optimization. The pipeline is validated on a new surgical benchmark dataset with over 68,000 frames and achieves a 31% reduction in 2D mean joint error and a 76% reduction in 3D mean per-joint position error compared to baselines.
研究旨在改善手术环境下的3D手部姿态估计,解决光照和遮挡等挑战。提出了一种使用现成模型进行人体检测、全身姿态估计和手部关键点预测的多视图管道,随后进行3D优化。该管道在包含超过68,000帧和3,000个标注手部姿态的新基准数据集上进行了验证,相比基线方法,显示了2D平均关节误差减少31%和3D平均每个关节位置误差减少76%的结果。
RadJEPA: Radiology Encoder for Chest X-Rays via Joint Embedding Predictive Architecture
Authors: Anas Anwarul Haq Khan, Mariam Husain, Kshitij Jadhav
First: 2026-01-22T12:11:53+00:00 · Latest: 2026-01-22T12:11:53+00:00
Abstract
Recent advances in medical vision language models guide the learning of visual representations; however, this form of supervision is constrained by the availability of paired image text data, raising the question of whether robust radiology encoders can be learned without relying on language supervision. In this work, we introduce RadJEPA, a self-supervised framework built on a Joint Embedding Predictive Architecture that learns without language supervision. Pre-trained solely on unlabeled chest X-ray images, the model learns to predict latent representations of masked image regions. This predictive objective differs fundamentally from both image text pre-training and DINO-style self-distillation: rather than aligning global representations across views or modalities, RadJEPA explicitly models latent-space prediction. We evaluate the learned encoder on disease classification, semantic segmentation, and report generation tasks. Across benchmarks, RadJEPA achieves performance exceeding state-of-the-art approaches, including Rad-DINO.
中文标题/摘要
标题:RadJEPA:通过联合嵌入预测架构的胸部X光影像编码器
近期医学视觉语言模型的进步指导了视觉表示的学习;然而,这种监督形式受限于配对的图像文本数据的可用性,引发了是否可以在不依赖语言监督的情况下学习稳健的放射学编码器的问题。在本文中,我们引入了RadJEPA,这是一种基于联合嵌入预测架构的自监督框架,该框架在没有语言监督的情况下学习。该模型仅在未标记的胸部X光图像上进行预训练,学习预测遮罩图像区域的潜在表示。这种预测目标与图像文本预训练和DINO风格的自我蒸馏完全不同:RadJEPA不是在视图或模态之间对齐全局表示,而是明确建模潜在空间预测。我们在疾病分类、语义分割和报告生成任务上评估了所学习的编码器。在各个基准测试中,RadJEPA的性能超过了最先进的方法,包括Rad-DINO。
Summary / 总结
The research aims to develop a robust radiology encoder for chest X-rays without relying on paired image-text data. RadJEPA, a self-supervised framework, is introduced, which learns to predict latent representations of masked image regions. This model outperforms state-of-the-art methods, including Rad-DINO, on disease classification, semantic segmentation, and report generation tasks.
研究旨在开发一种不依赖图像-文本配对数据的胸部X射线稳健放射学编码器。引入了RadJEPA,这是一种自监督框架,能够预测遮蔽图像区域的潜在表示。该模型在疾病分类、语义分割和报告生成等任务上超过了包括Rad-DINO在内的最新方法。
Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning
Authors: Yifan Wang, Shiyu Li, Peiming Li, Xiaochen Yang, Yang Tang, Zheng Wei
First: 2026-01-21T08:09:25+00:00 · Latest: 2026-01-22T12:09:02+00:00
Abstract
Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. Recent works often focus exclusively on outcome alignment and lack supervision on the intermediate reasoning process. These deficiencies obscure the analyzability of the latent reasoning chain. To address these challenges, we introduce Render-of-Thought (RoT), the first framework to reify the reasoning chain by rendering textual steps into images, making the latent rationale explicit and traceable. Specifically, we leverage the vision encoders of existing Vision Language Models (VLMs) as semantic anchors to align the vision embeddings with the textual space. This design ensures plug-and-play implementation without incurring additional pre-training overhead. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that our method achieves 3-4x token compression and substantial inference acceleration compared to explicit CoT. Furthermore, it maintains competitive performance against other methods, validating the feasibility of this paradigm. Our code is available at https://github.com/TencentBAC/RoT
中文标题/摘要
标题:思维渲染:将文本推理链渲染为图像以实现视觉潜在推理
思维链(CoT)提示在解锁大型语言模型(LLMs)的推理能力方面取得了显著成功。尽管CoT提示增强了推理能力,但其冗长性带来了巨大的计算开销。近期的工作往往专注于结果对齐,而缺乏对中间推理过程的监督。这些不足之处模糊了潜在推理链的可分析性。为了解决这些挑战,我们引入了思维渲染(RoT),这是第一个通过将文本步骤渲染为图像来实现推理链具体化的框架,使潜在的推理理由变得明确和可追踪。具体而言,我们利用现有视觉语言模型(VLMs)的视觉编码器作为语义锚点,将视觉嵌入与文本空间对齐。此设计确保了即插即用的实现,无需额外的预训练开销。在数学和逻辑推理基准测试上的广泛实验表明,与显式CoT相比,我们的方法实现了3-4倍的令牌压缩和显著的推理加速。此外,它在与其他方法的竞争中保持了竞争力,验证了此范式的可行性。我们的代码可在https://github.com/TencentBAC/RoT 获取
Summary / 总结
The paper introduces Render-of-Thought (RoT), a framework that converts textual reasoning steps into images to make latent reasoning explicit and traceable. This addresses the computational overhead of Chain-of-Thought (CoT) prompting by leveraging vision encoders of existing Vision Language Models (VLMs) for semantic alignment. Experiments show RoT achieves 3-4x token compression and significant inference acceleration while maintaining competitive performance on reasoning benchmarks.
论文提出了Render-of-Thought (RoT)框架,将文本推理步骤转换为图像,使潜在的推理过程变得明确和可追踪。通过利用现有Vision Language Models的视觉编码器,RoT将视觉嵌入与文本空间对齐,实现即插即用的实施方式。实验表明,RoT在数学和逻辑推理基准测试中实现了3-4倍的令牌压缩和显著的推理加速,同时保持了与其他方法相当的性能。
VLM-CAD: VLM-Optimized Collaborative Agent Design Workflow for Analog Circuit Sizing
Authors: Guanyuan Pan, Shuai Wang, Yugui Lin, Tiansheng Zhou, Pietro Liò, Yaqi Wang, Zhenxin Zhao
First: 2026-01-12T08:37:32+00:00 · Latest: 2026-01-22T11:46:08+00:00
Comments: 9 pages, 4 figures, submitted to the 10th International Conference on Control, Automation and Diagnosis (ICCAD'26)
Abstract
Analog mixed-signal circuit sizing involves complex trade-offs within high-dimensional design spaces. Existing automatic analog circuit sizing approaches rely solely on netlists, ignoring the circuit schematic, which hinders the cognitive link between the schematic and its performance. Furthermore, the black-box nature of machine learning methods and hallucination risks in large language models fail to provide the necessary ground-truth explainability required for industrial sign-off. To address these challenges, we propose a Vision Language Model-optimized collaborative agent design workflow (VLM-CAD), which analyzes circuits, optimizes DC operating points, performs inference-based sizing, and executes external sizing optimization. We integrate Image2Net to annotate circuit schematics and generate a structured JSON description for precise interpretation by Vision Language Models. Furthermore, we propose an Explainable Trust Region Bayesian Optimization method (ExTuRBO) that employs collaborative warm-start from agent-generated seeds and offers dual-granularity sensitivity analysis for external sizing optimization, supporting a comprehensive final design report. Experiment results on amplifier sizing tasks using 180nm, 90nm, and 45nm Predictive Technology Models demonstrate that VLM-CAD effectively balances power and performance while maintaining physics-based explainability. VLM-CAD meets all specification requirements while maintaining low power consumption in optimizing an amplifier with a complementary input and a class-AB output stage, with a total runtime under 66 minutes across all experiments on two amplifiers.
中文标题/摘要
标题:VLM-CAD:优化视觉语言模型协作代理设计工作流以实现模拟电路尺寸优化
模拟混合信号电路尺寸优化涉及高维设计空间中的复杂权衡。现有的自动模拟电路尺寸优化方法仅依赖于网表,忽略了电路原理图,阻碍了原理图与其性能之间的认知联系。此外,机器学习方法的黑箱性质和大型语言模型中的幻觉风险无法提供工业签收所需的必要的事实可解释性。为了解决这些挑战,我们提出了一种视觉语言模型优化的协作代理设计工作流(VLM-CAD),该工作流分析电路、优化直流工作点、进行基于推理的尺寸优化并执行外部尺寸优化。我们整合了Image2Net来标注电路原理图并生成结构化的JSON描述,以便视觉语言模型精确解释。此外,我们提出了一种可解释的信任区域贝叶斯优化方法(ExTuRBO),该方法采用代理生成的种子进行协作预热,并提供外部尺寸优化的双重粒度灵敏度分析,支持全面的最终设计报告。使用180nm、90nm和45nm预测技术模型进行放大器尺寸优化任务的实验结果表明,VLM-CAD在保持基于物理的可解释性的同时,有效地平衡了功率和性能。VLM-CAD在优化具有互补输入和类AB输出级的放大器时满足所有规范要求,同时保持低功耗,在两次放大器实验中总运行时间低于66分钟。
Summary / 总结
VLM-CAD is a workflow that optimizes analog circuit sizing by integrating Vision Language Models and collaborative agents. It analyzes circuits, optimizes DC operating points, and performs inference-based sizing. The method uses Image2Net for schematic annotation and an Explainable Trust Region Bayesian Optimization (ExTuRBO) for external sizing optimization, providing detailed sensitivity analysis. Experiments on amplifier sizing tasks with different technology nodes show that VLM-CAD effectively balances power and performance while maintaining physics-based explainability and low power consumption.
VLM-CAD 是一种通过结合 Vision Language Models 和协作代理优化模拟电路尺寸的工作流。它分析电路、优化直流工作点,并使用 Explainable Trust Region Bayesian Optimization (ExTuRBO) 进行详细的灵敏度分析。实验结果表明,VLM-CAD 在不同技术模型下的放大器尺寸任务中有效平衡了功率和性能,同时保持了基于物理的可解释性和低功耗。
MMP-A*: Multimodal Perception Enhanced Incremental Heuristic Search on Path Planning
Authors: Minh Hieu Ha, Khanh Ly Ta, Hung Phan, Tung Doan, Tung Dao, Dao Tran, Huynh Thi Thanh Binh
First: 2026-01-05T08:55:27+00:00 · Latest: 2026-01-22T10:24:37+00:00
Abstract
Autonomous path planning requires a synergy between global reasoning and geometric precision, especially in complex or cluttered environments. While classical A* is valued for its optimality, it incurs prohibitive computational and memory costs in large-scale scenarios. Recent attempts to mitigate these limitations by using Large Language Models for waypoint guidance remain insufficient, as they rely only on text-based reasoning without spatial grounding. As a result, such models often produce incorrect waypoints in topologically complex environments with dead ends, and lack the perceptual capacity to interpret ambiguous physical boundaries. These inconsistencies lead to costly corrective expansions and undermine the intended computational efficiency.
We introduce MMP-A*, a multimodal framework that integrates the spatial grounding capabilities of vision-language models with a novel adaptive decay mechanism. By anchoring high-level reasoning in physical geometry, the framework produces coherent waypoint guidance that addresses the limitations of text-only planners. The adaptive decay mechanism dynamically regulates the influence of uncertain waypoints within the heuristic, ensuring geometric validity while substantially reducing memory overhead. To evaluate robustness, we test the framework in challenging environments characterized by severe clutter and topological complexity. Experimental results show that MMP-A* achieves near-optimal trajectories with significantly reduced operational costs, demonstrating its potential as a perception-grounded and computationally efficient paradigm for autonomous navigation.
Summary / 总结
MMP-A* is a multimodal framework that combines the spatial grounding of vision-language models with an adaptive decay mechanism to enhance path planning in complex environments. It addresses the limitations of text-only planners by producing coherent waypoints and ensuring geometric validity. Experimental results show that MMP-A* achieves near-optimal trajectories with reduced operational costs, making it a promising approach for autonomous navigation.
论文提出了MMP-A*,一种结合视觉语言模型的空间定位能力和自适应衰减机制的多模态路径规划框架,以提高自主导航的效率和准确性。该框架通过生成连贯的航点指导并减少内存开销,解决了经典A*和基于文本的规划器的局限性。在复杂且充满障碍的环境中进行的实验结果显示,MMP-A*能够实现接近最优的轨迹,并且具有较低的操作成本。