Towards Understanding Best Practices for Quantization of Vision-Language Models
Authors: Gautom Das, Vincent La, Ethan Lau, Abhinav Shrivastava, Matthew Gwilliam
First: 2026-01-21T18:59:51+00:00 · Latest: 2026-01-21T18:59:51+00:00
Comments: 15 pages, 12 figures, 1 table
Abstract
Large language models (LLMs) deliver impressive results for a variety of tasks, but state-of-the-art systems require fast GPUs with large amounts of memory. To reduce both the memory and latency of these systems, practitioners quantize their learned parameters, typically at half precision. A growing body of research focuses on preserving the model performance with more aggressive bit widths, and some work has been done to apply these strategies to other models, like vision transformers. In our study we investigate how a variety of quantization methods, including state-of-the-art GPTQ and AWQ, can be applied effectively to multimodal pipelines comprised of vision models, language models, and their connectors. We address how performance on captioning, retrieval, and question answering can be affected by bit width, quantization method, and which portion of the pipeline the quantization is used for. Results reveal that ViT and LLM exhibit comparable importance in model performance, despite significant differences in parameter size, and that lower-bit quantization of the LLM achieves high accuracy at reduced bits per weight (bpw). These findings provide practical insights for efficient deployment of MLLMs and highlight the value of exploration for understanding component sensitivities in multimodal models. Our code is available at https://github.com/gautomdas/mmq.
中文标题/摘要
标题:理解视觉-语言模型量化最佳实践
大型语言模型(LLMs)在各种任务中表现出色,但最先进的系统需要快速的GPU和大量的内存。为了减少这些系统的内存和延迟,从业者通常会将它们学习到的参数量化为半精度。越来越多的研究集中在使用更激进的位宽来保持模型性能,一些工作已经将这些策略应用于其他模型,如视觉变换器。在我们的研究中,我们探讨了如何有效地将包括最先进的GPTQ和AWQ在内的各种量化方法应用于由视觉模型、语言模型及其连接器组成的多模态管道。我们研究了位宽、量化方法以及量化在管道中的使用位置如何影响字幕生成、检索和问答任务的性能。结果表明,尽管参数规模存在显著差异,ViT和LLM在模型性能中具有相当的重要性,并且LLM的低位量化可以在减少每个权重位数(bpw)的情况下实现高精度。这些发现为高效部署多模态大语言模型提供了实用见解,并突显了探索多模态模型组件敏感性的价值。我们的代码可在https://github.com/gautomdas/mmq/获取。
Summary / 总结
This study investigates the application of various quantization methods, including GPTQ and AWQ, to multimodal pipelines involving vision models, language models, and their connectors. The research aims to understand how different bit widths and quantization techniques affect performance in tasks such as captioning, retrieval, and question answering. Key findings include the comparable importance of ViT and LLMs in model performance despite their size differences, and the effectiveness of lower-bit quantization of LLMs in achieving high accuracy with reduced memory usage.
研究探讨了GPTQ和AWQ等不同量化方法在包含视觉和语言模型的多模态管道中的应用。研究旨在了解不同位宽和量化技术对任务如图像字幕、检索和问答性能的影响。关键发现表明,视觉变换器(ViT)和大型语言模型(LLM)对于模型性能都至关重要,且LLM即使在较低位量化下也能保持高精度,减少每个权重的位数。这些发现对于高效部署多模态大型语言模型(MLLM)具有重要意义。
Iterative Refinement Improves Compositional Image Generation
Authors: Shantanu Jaiswal, Mihir Prabhudesai, Nikash Bhardwaj, Zheyang Qin, Amir Zadeh, Chuan Li, Katerina Fragkiadaki, Deepak Pathak
First: 2026-01-21T18:59:40+00:00 · Latest: 2026-01-21T18:59:40+00:00
Comments: Project webpage: https://iterative-img-gen.github.io/
Abstract
Text-to-image (T2I) models have achieved remarkable progress, yet they continue to struggle with complex prompts that require simultaneously handling multiple objects, relations, and attributes. Existing inference-time strategies, such as parallel sampling with verifiers or simply increasing denoising steps, can improve prompt alignment but remain inadequate for richly compositional settings where many constraints must be satisfied. Inspired by the success of chain-of-thought reasoning in large language models, we propose an iterative test-time strategy in which a T2I model progressively refines its generations across multiple steps, guided by feedback from a vision-language model as the critic in the loop. Our approach is simple, requires no external tools or priors, and can be flexibly applied to a wide range of image generators and vision-language models. Empirically, we demonstrate consistent gains on image generation across benchmarks: a 16.9% improvement in all-correct rate on ConceptMix (k=7), a 13.8% improvement on T2I-CompBench (3D-Spatial category) and a 12.5% improvement on Visual Jenga scene decomposition compared to compute-matched parallel sampling. Beyond quantitative gains, iterative refinement produces more faithful generations by decomposing complex prompts into sequential corrections, with human evaluators preferring our method 58.7% of the time over 41.3% for the parallel baseline. Together, these findings highlight iterative self-correction as a broadly applicable principle for compositional image generation. Results and visualizations are available at https://iterative-img-gen.github.io/
中文标题/摘要
标题:迭代优化提升组合图像生成
文本到图像(T2I)模型取得了显著进展,但仍难以处理需要同时处理多个对象、关系和属性的复杂提示。现有的推理时策略,如并行采样带验证器或简单增加去噪步骤,可以改善提示对齐,但在许多约束必须满足的丰富组合场景中仍不充分。受大型语言模型中链式思考推理成功的启发,我们提出了一种迭代测试时策略,在该策略中,T2I模型在多个步骤中逐步优化其生成,由循环中的视觉语言模型作为批评者提供反馈。我们的方法简单,无需外部工具或先验知识,可以灵活应用于各种图像生成器和视觉语言模型。实验证明,我们的方法在基准测试中的一致改进:在ConceptMix(k=7)上提高了16.9%的全正确率,在T2I-CompBench(3D-空间类别)上提高了13.8%,在视觉积木场景分解上提高了12.5%,与计算匹配的并行采样相比。除了定量改进,迭代优化生成更忠实的图像,通过将复杂提示分解为顺序修正,人类评估者更偏好我们的方法,占58.7%,而并行基线为41.3%。这些发现共同强调了迭代自我修正作为组合图像生成广泛适用原则的重要性。结果和可视化可在https://iterative-img-gen.github.io/获取
Summary / 总结
The paper addresses the challenge of generating complex images from multi-object and multi-attribute prompts, which current text-to-image models struggle with. It introduces an iterative refinement strategy where a T2I model generates images in multiple steps, receiving feedback from a vision-language model. This approach consistently improves generation quality across benchmarks, showing a 16.9% increase in all-correct rate on ConceptMix (k=7), 13.8% on T2I-CompBench (3D-Spatial category), and 12.5% on Visual Jenga scene decomposition compared to parallel sampling. Human evaluators also prefer the iterative method over the parallel baseline. The method is simple and can be applied to various image generators and vision-language models without additional tools or priors.
本文提出了一种迭代细化策略,以解决从文本提示生成复杂图像的挑战。该方法涉及文本到图像模型在多步骤中逐步改进其输出,并由视觉语言模型提供反馈。实验结果显示,在基准测试中的一致改进,包括在ConceptMix(k=7)上的16.9%提高正确率,在T2I-CompBench(3D-Spatial类别)上的13.8%提高,以及在Visual Jenga场景分解上的12.5%提高。迭代细化还生成了更忠实的图像,人类评估者中有58.7%的时间更偏好这种方法,而不是平行基线。
PROGRESSLM: Towards Progress Reasoning in Vision-Language Models
Authors: Jianshu Zhang, Chengxuan Qian, Haosen Sun, Haoran Lu, Dingcheng Wang, Letian Xue, Han Liu
First: 2026-01-21T17:56:59+00:00 · Latest: 2026-01-21T17:56:59+00:00
Comments: Website: https://progresslm.github.io/ProgressLM/
Abstract
Estimating task progress requires reasoning over long-horizon dynamics rather than recognizing static visual content. While modern Vision-Language Models (VLMs) excel at describing what is visible, it remains unclear whether they can infer how far a task has progressed from partial observations. To this end, we introduce Progress-Bench, a benchmark for systematically evaluating progress reasoning in VLMs. Beyond benchmarking, we further explore a human-inspired two-stage progress reasoning paradigm through both training-free prompting and training-based approach based on curated dataset ProgressLM-45K. Experiments on 14 VLMs show that most models are not yet ready for task progress estimation, exhibiting sensitivity to demonstration modality and viewpoint changes, as well as poor handling of unanswerable cases. While training-free prompting that enforces structured progress reasoning yields limited and model-dependent gains, the training-based ProgressLM-3B achieves consistent improvements even at a small model scale, despite being trained on a task set fully disjoint from the evaluation tasks. Further analyses reveal characteristic error patterns and clarify when and why progress reasoning succeeds or fails.
中文标题/摘要
标题:PROGRESSLM:迈向视觉语言模型中的进度推理
估计任务进度需要推理长时动态,而不仅仅是识别静态视觉内容。尽管现代视觉语言模型(VLMs)在描述可见内容方面表现出色,但尚不清楚它们是否能够从部分观察中推断出任务的进展情况。为此,我们引入了Progress-Bench,用于系统评估VLMs的进度推理能力。除了基准测试外,我们还通过无训练提示和基于精心构建的数据集ProgressLM-45K的训练方法,进一步探索了灵感来源于人类的两阶段进度推理范式。在14个VLMs上的实验表明,大多数模型尚未准备好进行任务进度估计,表现出对演示模态和视角变化的敏感性,以及对无法回答的情况处理不佳。虽然无训练提示强制结构化的进度推理仅能带来有限且模型依赖的收益,但基于训练的ProgressLM-3B即使在小型模型规模下也能实现一致的改进,尽管其训练任务集与评估任务集完全不重叠。进一步的分析揭示了特征错误模式,并阐明了进度推理何时以及为何成功或失败。
CRAFT: Continuous Reasoning and Agentic Feedback Tuning for Multimodal Text-to-Image Generation
Authors: V. Kovalev, A. Kuvshinov, A. Buzovkin, D. Pokidov, D. Timonin
First: 2025-12-23T13:44:41+00:00 · Latest: 2026-01-21T16:42:28+00:00
Comments: 37 pages, 42 figures
Abstract
Recent work has shown that inference-time reasoning and reflection can improve text-to-image generation without retraining. However, existing approaches often rely on implicit, holistic critiques or unconstrained prompt rewrites, making their behavior difficult to interpret, control, or stop reliably. In contrast, large language models have benefited from explicit, structured forms of **thinking** based on verification, targeted correction, and early stopping. We introduce CRAFT (Continuous Reasoning and Agentic Feedback Tuning), a training-free and model-agnostic framework for multimodal image generation. CRAFT transforms a user prompt into a set of explicit, dependency-structured visual constraints, verifies generated images using a vision-language model, and performs targeted prompt updates only when specific constraints are violated. This iterative process includes an explicit stopping criterion, resulting in an interpretable and controllable inference-time refinement loop. Across multiple model families and challenging benchmarks, CRAFT consistently improves compositional accuracy, text rendering, and preference-based evaluations, with particularly strong gains for lightweight generators. Importantly, these improvements incur only a negligible inference-time overhead, allowing smaller or cheaper models to approach the quality of substantially more expensive systems. Our results suggest that explicitly structured, constraint-driven inference-time reasoning is a key ingredient for improving the reliability of multimodal generative models.
中文标题/摘要
标题:CRAFT:连续推理和自主反馈调优的多模态文本到图像生成
近期研究表明,在不重新训练的情况下,推理时的推理和反思可以提高文本到图像生成的效果。然而,现有方法往往依赖于隐式的、整体的批评或不受限制的提示重写,这使得它们的行为难以解释、控制或可靠地停止。相比之下,大型语言模型得益于基于验证、目标修正和早期停止的明确、结构化的**思考**形式。我们提出了CRAFT(连续推理和自主反馈调优),这是一种无需训练且模型无关的多模态图像生成框架。CRAFT 将用户提示转换为一组明确的、依赖结构化的视觉约束,使用视觉语言模型验证生成的图像,并仅在特定约束被违反时进行有针对性的提示更新。这个迭代过程包括一个明确的停止标准,从而形成一个可解释且可控的推理时细化循环。在多个模型家族和具有挑战性的基准测试中,CRAFT 一致地提高了组合准确性、文本呈现和基于偏好的评估,特别是在轻量级生成器方面取得了显著的改进。重要的是,这些改进仅带来了微不足道的推理时开销,使得较小或更便宜的模型能够接近更昂贵系统的质量。我们的结果表明,明确结构化的、基于约束的推理时推理是提高多模态生成模型可靠性的关键成分。
Summary / 总结
CRAFT is a training-free and model-agnostic framework for multimodal text-to-image generation that transforms user prompts into explicit visual constraints, verifies generated images using a vision-language model, and updates prompts only when constraints are violated. This iterative process, with an explicit stopping criterion, leads to improved compositional accuracy, text rendering, and preference-based evaluations, especially for lightweight generators, with minimal inference-time overhead.
CRAFT 是一种无需训练且模型无关的框架,将用户提示转换为显式的视觉约束,使用视觉-语言模型验证图像,并仅在约束被违反时更新提示。这一迭代过程包含明确的停止标准,从而提高了组合准确性、文本渲染和基于偏好的评估,尤其是对于轻量级生成器,且几乎不增加推理时间开销。
Training-Free and Interpretable Hateful Video Detection via Multi-stage Adversarial Reasoning
Authors: Shuonan Yang, Yuchen Zhang, Zeyu Fu
Venue: ICASSP 2026
First: 2026-01-21T15:52:26+00:00 · Latest: 2026-01-21T15:52:26+00:00
Comments: Accepted at ICASSP 2026. \c{opyright} 2026 IEEE. This is the author accepted manuscript. The final published version will be available via IEEE Xplore
Abstract
Hateful videos pose serious risks by amplifying discrimination, inciting violence, and undermining online safety. Existing training-based hateful video detection methods are constrained by limited training data and lack of interpretability, while directly prompting large vision-language models often struggle to deliver reliable hate detection. To address these challenges, this paper introduces MARS, a training-free Multi-stage Adversarial ReaSoning framework that enables reliable and interpretable hateful content detection. MARS begins with the objective description of video content, establishing a neutral foundation for subsequent analysis. Building on this, it develops evidence-based reasoning that supports potential hateful interpretations, while in parallel incorporating counter-evidence reasoning to capture plausible non-hateful perspectives. Finally, these perspectives are synthesized into a conclusive and explainable decision. Extensive evaluation on two real-world datasets shows that MARS achieves up to 10% improvement under certain backbones and settings compared to other training-free approaches and outperforms state-of-the-art training-based methods on one dataset. In addition, MARS produces human-understandable justifications, thereby supporting compliance oversight and enhancing the transparency of content moderation workflows. The code is available at https://github.com/Multimodal-Intelligence-Lab-MIL/MARS.
中文标题/摘要
标题:基于多阶段对抗推理的无训练可解释仇恨视频检测
仇恨视频通过放大歧视、煽动暴力和破坏在线安全等方式带来严重风险。现有的基于训练的仇恨视频检测方法受限于训练数据有限且缺乏可解释性,而直接对大型视觉-语言模型进行提示往往难以提供可靠的仇恨检测。为解决这些挑战,本文提出了一种无训练的多阶段对抗推理框架MARS,以实现可靠且可解释的仇恨内容检测。MARS从客观描述视频内容开始,建立后续分析的中立基础。在此基础上,它发展了基于证据的推理,支持潜在的仇恨解释,同时并行地纳入反证据推理以捕捉可能的非仇恨视角。最后,这些视角被综合成一个明确且可解释的决策。在两个真实世界数据集上的广泛评估表明,MARS在某些骨干网络和设置下比其他无训练方法提高了10%以上,并在另一个数据集上优于最先进的基于训练的方法。此外,MARS生成了人类可理解的解释,从而支持合规监督并增强内容审核流程的透明度。代码可在https://github.com/Multimodal-Intelligence-Lab-MIL/MARS/ 获取。
Summary / 总结
This paper addresses the challenges of detecting hateful videos by introducing MARS, a training-free Multi-stage Adversarial ReaSoning framework. MARS starts with neutral video content description, then develops evidence-based reasoning for potential hateful interpretations while incorporating counter-evidence reasoning to capture non-hateful perspectives. The framework synthesizes these perspectives into a conclusive and explainable decision. Experimental results show that MARS outperforms both training-free and training-based approaches on real-world datasets, achieving up to 10% improvement and providing human-understandable justifications for content moderation.
本文提出了一种训练-free 的多阶段对抗推理框架 MARS,以解决仇恨视频检测的挑战。MARS 从中立的视频内容描述开始,然后发展出基于证据的推理来支持潜在的仇恨解释,同时结合反证据来捕捉非仇恨视角。该框架将这些视角综合成一个结论性的、可解释的决策。实验结果表明,MARS 在真实世界数据集上的表现优于训练-free 和训练-based 方法,并提供了可理解的解释,支持内容审核的合规性和透明度。
Training-Free In-Context Forensic Chain for Image Manipulation Detection and Localization
Authors: Rui Chen, Bin Liu, Changtao Miao, Xinghao Wang, Yi Li, Tao Gong, Qi Chu, Nenghai Yu
First: 2025-10-11T08:42:31+00:00 · Latest: 2026-01-21T15:39:57+00:00
Comments: This version was uploaded in error and contains misleading information found in an early draft. The manuscript requires extensive and long-term revisions
Abstract
Advances in image tampering pose serious security threats, underscoring the need for effective image manipulation localization (IML). While supervised IML achieves strong performance, it depends on costly pixel-level annotations. Existing weakly supervised or training-free alternatives often underperform and lack interpretability. We propose the In-Context Forensic Chain (ICFC), a training-free framework that leverages multi-modal large language models (MLLMs) for interpretable IML tasks. ICFC integrates an objectified rule construction with adaptive filtering to build a reliable knowledge base and a multi-step progressive reasoning pipeline that mirrors expert forensic workflows from coarse proposals to fine-grained forensics results. This design enables systematic exploitation of MLLM reasoning for image-level classification, pixel-level localization, and text-level interpretability. Across multiple benchmarks, ICFC not only surpasses state-of-the-art training-free methods but also achieves competitive or superior performance compared to weakly and fully supervised approaches.
Summary / 总结
The research aims to address the security threats posed by image tampering by developing a training-free framework for image manipulation localization. The In-Context Forensic Chain (ICFC) leverages multi-modal large language models to construct a reliable knowledge base and a multi-step reasoning pipeline, which surpasses state-of-the-art training-free methods and performs competitively with weakly and fully supervised approaches in multiple benchmarks.
研究旨在通过开发一个无需训练的框架来解决图像篡改带来的安全威胁,该框架名为In-Context Forensic Chain (ICFC),利用多模态大型语言模型构建可靠的知识库和多步推理管道,模拟专家的法医工作流程。该框架在多个基准测试中不仅超越了现有的无需训练的方法,而且在弱监督和完全监督的方法中也达到了竞争或更优的性能。
Unified Multi-Dataset Training for TBPS
Authors: Nilanjana Chatterjee, Sidharatha Garg, A V Subramanyam, Brejesh Lall
First: 2026-01-21T13:26:28+00:00 · Latest: 2026-01-21T13:26:28+00:00
Abstract
Text-Based Person Search (TBPS) has seen significant progress with vision-language models (VLMs), yet it remains constrained by limited training data and the fact that VLMs are not inherently pre-trained for pedestrian-centric recognition. Existing TBPS methods therefore rely on dataset-centric fine-tuning to handle distribution shift, resulting in multiple independently trained models for different datasets. While synthetic data can increase the scale needed to fine-tune VLMs, it does not eliminate dataset-specific adaptation. This motivates a fundamental question: can we train a single unified TBPS model across multiple datasets? We show that naive joint training over all datasets remains sub-optimal because current training paradigms do not scale to a large number of unique person identities and are vulnerable to noisy image-text pairs. To address these challenges, we propose Scale-TBPS with two contributions: (i) a noise-aware unified dataset curation strategy that cohesively merges diverse TBPS datasets; and (ii) a scalable discriminative identity learning framework that remains effective under a large number of unique identities. Extensive experiments on CUHK-PEDES, ICFG-PEDES, RSTPReid, IIITD-20K, and UFine6926 demonstrate that a single Scale-TBPS model outperforms dataset-centric optimized models and naive joint training.
中文标题/摘要
标题:TBPS的统一多数据集训练
基于文本的人体搜索(TBPS)在视觉-语言模型(VLMs)的帮助下取得了显著进展,但仍然受到训练数据有限的限制,且VLMs本身并不天然预训练用于行人中心识别。因此,现有的TBPS方法依赖于数据集中心的微调来处理分布偏移,导致为不同的数据集训练多个独立的模型。虽然合成数据可以增加用于微调VLMs所需的规模,但它并不能消除数据集特定的适应性。这促使了一个基本问题:我们能否训练一个跨越多个数据集的统一TBPS模型?我们展示了对所有数据集进行简单的联合训练仍然是次优的,因为当前的训练范式无法扩展到大量的独特行人身份,并且容易受到噪声的图像-文本对的影响。为了解决这些挑战,我们提出了Scale-TBPS,并提出了两个贡献:(i)一种噪声感知的统一数据集编纂策略,将多样化的TBPS数据集综合合并;(ii)一种可扩展的区分性身份学习框架,即使在大量的独特身份下也能保持有效性。在CUHK-PEDES、ICFG-PEDES、RSTPReid、IIITD-20K和UFine6926上的广泛实验表明,一个单一的Scale-TBPS模型优于数据集中心优化的模型和简单的联合训练。
Summary / 总结
The research aims to address the limitations of Text-Based Person Search (TBPS) models, particularly their reliance on dataset-specific fine-tuning and vulnerability to noisy data. The authors propose Scale-TBPS, which includes a noise-aware dataset curation strategy and a scalable identity learning framework. Experimental results on multiple datasets show that Scale-TBPS outperforms dataset-centric models and naive joint training approaches.
研究旨在解决基于文本的人体搜索(TBPS)模型的局限性,特别是需要针对特定数据集进行微调以及依赖合成数据的问题。为克服这些问题,作者提出了Scale-TBPS,其中包括一种噪声感知的统一数据集整理策略和一种在大量唯一身份下仍有效的可扩展鉴别身份学习框架。实验结果显示,单个Scale-TBPS模型在CUHK-PEDES、ICFG-PEDES、RSTPReid、IIITD-20K和UFine6926等多个数据集上优于针对特定数据集优化的模型和简单的联合训练。
GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis
Authors: Angelos Zavras, Dimitrios Michail, Xiao Xiang Zhu, Begüm Demir, Ioannis Papoutsis
First: 2025-02-13T18:52:14+00:00 · Latest: 2026-01-21T12:51:46+00:00
Comments: 26 pages, 14 figures
Abstract
Existing Vision-Language Models (VLMs) are predominantly trained on web-scraped, noisy image-text data, exhibiting limited exposure to the specialized domain of RS. This deficiency results in poor performance on RS-specific tasks, as commonly used datasets often lack detailed, scientifically accurate textual descriptions and instead emphasize solely on attributes like date and location. To bridge this critical gap, we introduce GAIA, a novel dataset designed for multi-scale, multi-sensor, and multi-modal RS image analysis. GAIA comprises of 201,005 meticulously curated RS image-text pairs, representing a diverse range of RS modalities associated to different spatial resolutions. Unlike existing vision-language datasets in RS, GAIA specifically focuses on capturing a diverse range of RS applications, providing unique information about environmental changes, natural disasters, and various other dynamic phenomena. The dataset provides a spatially and temporally balanced distribution, spanning across the globe, covering the last 25 years with a balanced temporal distribution of observations. GAIA's construction involved a two-stage process: (1) targeted web-scraping of images and accompanying text from reputable RS-related sources, and (2) generation of five high-quality, scientifically grounded synthetic captions for each image using carefully crafted prompts that leverage the advanced vision-language capabilities of GPT-4o. Our extensive experiments, including fine-tuning of CLIP and BLIP2 models, demonstrate that GAIA significantly improves performance on RS image classification, cross-modal retrieval and image captioning tasks. We make our dataset, automated processing framework and fine-tuned model weights publicly available on our project's GitHub repository: https://github.com/Orion-AI-Lab/GAIA.
中文标题/摘要
标题:GAIA:面向遥感图像分析的全球多模态多尺度视觉语言数据集
现有的视觉语言模型(VLMs)主要在网页抓取的嘈杂图像-文本数据上进行训练,对遥感(RS)专业领域的暴露有限,这导致其在RS特定任务上的表现不佳。由于常用数据集往往缺乏详细的、科学准确的文字描述,而更侧重于日期和地点等属性。为解决这一关键问题,我们引入了GAIA,一个专为多尺度、多传感器和多模态RS图像分析设计的新数据集。GAIA包含201,005个精心策划的RS图像-文本对,涵盖了不同空间分辨率的多种RS模态。与现有RS领域的视觉语言数据集不同,GAIA特别关注捕捉各种RS应用的多样性,提供关于环境变化、自然灾害和其他动态现象的独特信息。该数据集在全球范围内提供了空间和时间上的平衡分布,覆盖了过去25年,并且观测时间分布平衡。GAIA的构建涉及两个阶段:(1)针对遥感相关来源的网页抓取,获取图像及其配套文本;(2)使用精心设计的提示生成每个图像的五个高质量、科学依据充足的合成描述,利用GPT-4o的高级视觉语言能力。我们的实验表明,GAIA在RS图像分类、跨模态检索和图像描述任务上显著提高了性能。我们已将数据集、自动化处理框架和微调模型权重公开发布在我们的GitHub项目库:https://github.com/Orion-AI-Lab/GAIA。
Summary / 总结
GAIA is a new dataset for remote sensing (RS) image analysis, addressing the limitations of existing vision-language models by providing a global, multi-modal, and multi-scale dataset. It includes 201,005 RS image-text pairs with detailed and scientifically accurate descriptions, covering various RS applications and environmental changes. Experiments show that fine-tuning models like CLIP and BLIP2 with GAIA improves performance in RS image classification, cross-modal retrieval, and image captioning tasks.
GAIA 是一个针对遥感 (RS) 图像分析的新数据集,通过提供详细且科学准确的文字描述来解决现有视觉-语言数据集的局限性。它包含了201,005个RS图像-文本对,涵盖了各种空间分辨率和多种RS应用。数据集通过网络抓取和使用GPT-4o生成高质量的合成描述构建而成。实验表明,GAIA在RS图像分类、跨模态检索和图像描述任务中显著提高了模型性能。
Vision-Language Models on the Edge for Real-Time Robotic Perception
Authors: Sarat Ahmad, Maryam Hafeez, Syed Ali Raza Zaidi
First: 2026-01-21T12:09:48+00:00 · Latest: 2026-01-21T12:09:48+00:00
Abstract
Vision-Language Models (VLMs) enable multimodal reasoning for robotic perception and interaction, but their deployment in real-world systems remains constrained by latency, limited onboard resources, and privacy risks of cloud offloading. Edge intelligence within 6G, particularly Open RAN and Multi-access Edge Computing (MEC), offers a pathway to address these challenges by bringing computation closer to the data source. This work investigates the deployment of VLMs on ORAN/MEC infrastructure using the Unitree G1 humanoid robot as an embodied testbed. We design a WebRTC-based pipeline that streams multimodal data to an edge node and evaluate LLaMA-3.2-11B-Vision-Instruct deployed at the edge versus in the cloud under real-time conditions. Our results show that edge deployment preserves near-cloud accuracy while reducing end-to-end latency by 5\%. We further evaluate Qwen2-VL-2B-Instruct, a compact model optimized for resource-constrained environments, which achieves sub-second responsiveness, cutting latency by more than half but at the cost of accuracy.
中文标题/摘要
标题:边缘端的视觉-语言模型用于实时机器人感知
视觉-语言模型(VLMs)能够实现多模态推理,用于机器人感知和交互,但在实际系统中的部署受到延迟、有限的机载资源以及云卸载隐私风险的限制。6G中的边缘智能,特别是Open RAN和多接入边缘计算(MEC),提供了一种解决这些挑战的途径,通过将计算带向数据源。本研究探讨了在ORAN/MEC基础设施上部署VLMs的方法,使用Unitree G1人形机器人作为具身测试平台。我们设计了一个基于WebRTC的流水线,将多模态数据流式传输到边缘节点,并在实时条件下评估在边缘和云中部署的LLaMA-3.2-11B-Vision-Instruct的性能。结果显示,边缘部署保持了接近云的准确性,同时将端到端延迟降低了5%。我们还评估了Qwen2-VL-2B-Instruct,这是一种针对资源受限环境优化的紧凑型模型,实现了亚秒级响应,将延迟降低了超过一半,但准确性有所下降。
Summary / 总结
This work investigates the deployment of Vision-Language Models (VLMs) on Open RAN and Multi-access Edge Computing (MEC) infrastructure to address latency and resource constraints in robotic perception. Using the Unitree G1 humanoid robot as a testbed, the study evaluates VLMs deployed at the edge versus in the cloud under real-time conditions. The results demonstrate that edge deployment maintains near-cloud accuracy while reducing end-to-end latency by 5%, and a compact model optimized for resource-constrained environments achieves sub-second responsiveness but with slightly reduced accuracy.
这项工作旨在通过部署在Open RAN和Multi-access Edge Computing (MEC)等边缘智能基础设施上,解决Vision-Language Models (VLMs)在实际机器人感知中的局限性。研究使用Unitree G1人形机器人作为测试平台,评估了LLaMA-3.2-11B-Vision-Instruct和Qwen2-VL-2B-Instruct模型的性能。结果表明,边缘部署可以保持接近云端的准确性,同时将端到端延迟降低5%,而一个紧凑型模型可以实现亚秒级响应,但准确性有所下降。
SpatialMem: Unified 3D Memory with Metric Anchoring and Fast Retrieval
Authors: Xinyi Zheng, Yunze Liu, Chi-Hao Wu, Fan Zhang, Hao Zheng, Wenqi Zhou, Walterio W. Mayol-Cuevas, Junxiao Shen
First: 2026-01-21T11:32:24+00:00 · Latest: 2026-01-21T11:32:24+00:00
Abstract
We present SpatialMem, a memory-centric system that unifies 3D geometry, semantics, and language into a single, queryable representation. Starting from casually captured egocentric RGB video, SpatialMem reconstructs metrically scaled indoor environments, detects structural 3D anchors (walls, doors, windows) as the first-layer scaffold, and populates a hierarchical memory with open-vocabulary object nodes -- linking evidence patches, visual embeddings, and two-layer textual descriptions to 3D coordinates -- for compact storage and fast retrieval. This design enables interpretable reasoning over spatial relations (e.g., distance, direction, visibility) and supports downstream tasks such as language-guided navigation and object retrieval without specialized sensors. Experiments across three real-life indoor scenes demonstrate that SpatialMem maintains strong anchor-description-level navigation completion and hierarchical retrieval accuracy under increasing clutter and occlusion, offering an efficient and extensible framework for embodied spatial intelligence.
中文标题/摘要
标题:SpatialMem:统一的3D记忆系统,结合度量锚定和快速检索
我们提出了SpatialMem,这是一种以存储为中心的系统,将3D几何、语义和语言统一为单一的可查询表示。从随意拍摄的主观RGB视频开始,SpatialMem重建了度量标定的室内环境,检测结构化的3D锚点(墙壁、门、窗户)作为第一层框架,并用开放词汇的物体节点填充层次化的记忆——将证据片段、视觉嵌入和两层文本描述链接到3D坐标,以实现紧凑存储和快速检索。此设计使空间关系(如距离、方向、可见性)的可解释推理成为可能,并支持诸如语言引导导航和物体检索等下游任务,无需专用传感器。在三个真实室内场景中的实验表明,SpatialMem在增加杂乱和遮挡的情况下,保持了强大的锚点-描述级别的导航完成度和层次化检索准确性,提供了一种高效且可扩展的体感空间智能框架。
Summary / 总结
SpatialMem is a memory-centric system that integrates 3D geometry, semantics, and language into a unified, queryable representation. Starting from casual egocentric RGB video, it reconstructs indoor environments, detects structural 3D anchors, and populates a hierarchical memory with object nodes linked to visual and textual evidence. Experiments show that SpatialMem maintains strong navigation completion and retrieval accuracy even under clutter and occlusion, supporting tasks like language-guided navigation and object retrieval.
SpatialMem 是一个以记忆为中心的系统,将 3D 几何、语义和语言统一到一个可查询的表示中。从随意拍摄的主观 RGB 视频开始,它重建室内环境,检测结构化的 3D 锚点,并填充一个分层记忆,其中包含与视觉和文本证据相连的对象节点。实验表明,即使在杂乱和遮挡的情况下,SpatialMem 也能保持强大的导航和检索准确性,支持如语言引导导航和物体检索等任务,无需专用传感器。
Beyond Functional Correctness: Exploring Hallucinations in LLM-Generated Code
Authors: Fang Liu, Yang Liu, Lin Shi, Zhen Yang, Li Zhang, Xiaoli Lian, Zhongqi Li, Yuchi Ma
First: 2024-04-01T07:31:45+00:00 · Latest: 2026-01-21T11:08:06+00:00
Comments: Accepted by Transactions on Software Engineering (TSE)
Abstract
The rise of Large Language Models (LLMs) has significantly advanced various applications on software engineering tasks, particularly in code generation. Despite the promising performance, LLMs are prone to generate hallucinations, which means LLMs might produce outputs that deviate from users' intent, exhibit internal inconsistencies, or misaligned with the real-world knowledge, making the deployment of LLMs potentially risky in a wide range of applications. Existing work mainly focuses on investigating the hallucination in the domain of Natural Language Generation (NLG), leaving a gap in comprehensively understanding the types, causes, and impacts of hallucinations in the context of code generation. To bridge the gap, we conducted a thematic analysis of the LLM-generated code to summarize and categorize the hallucinations, as well as their causes and impacts. Our study established a comprehensive taxonomy of code hallucinations, encompassing 3 primary categories and 12 specific categories. Furthermore, we systematically analyzed the distribution of hallucinations, exploring variations among different LLMs and benchmarks. Moreover, we perform an in-depth analysis on the causes and impacts of various hallucinations, aiming to provide valuable insights into hallucination mitigation. Finally, to enhance the correctness and reliability of LLM-generated code in a lightweight manner, we explore training-free hallucination mitigation approaches by prompt enhancing techniques. We believe our findings will shed light on future research about code hallucination evaluation and mitigation, ultimately paving the way for building more effective and reliable code LLMs in the future. The replication package is available at https://github.com/Lorien1128/code_hallucination
中文标题/摘要
标题:超越功能正确性:探索LLM生成代码中的幻觉
大型语言模型(LLMs)的兴起在软件工程任务中显著推进了各种应用,特别是在代码生成方面。尽管表现出色,但LLMs容易生成幻觉,即LLMs可能会产生与用户意图不符、内部不一致或与现实知识不一致的输出,这使得在广泛的应用中部署LLMs具有潜在风险。现有研究主要集中在自然语言生成(NLG)领域幻觉的研究上,忽略了代码生成背景下幻觉类型、原因和影响的全面理解。为了填补这一空白,我们对LLM生成的代码进行了主题分析,总结和分类了幻觉及其原因和影响。我们的研究建立了一个全面的代码幻觉分类体系,包括3个主要类别和12个具体类别。此外,我们系统地分析了幻觉的分布,探索了不同LLM和基准之间的差异。此外,我们对各种幻觉的原因和影响进行了深入分析,旨在提供幻觉缓解的宝贵见解。最后,为了以轻量级方式增强LLM生成代码的正确性和可靠性,我们通过提示增强技术探索了无需训练的幻觉缓解方法。我们相信我们的发现将为未来关于代码幻觉评估和缓解的研究提供启示,最终为未来构建更有效和可靠的代码LLM铺平道路。复制包可在https://github.com/Lorien1128/code_hallucination 获取。
Summary / 总结
This study investigates hallucinations in Large Language Models (LLMs) generated code, addressing the gap in understanding hallucinations in code generation. Through thematic analysis, the research identifies a comprehensive taxonomy of code hallucinations, including their causes and impacts, and explores their distribution among different LLMs. The study also proposes prompt-enhancing techniques for lightweight hallucination mitigation, aiming to enhance the correctness and reliability of LLM-generated code.
研究探讨了大型语言模型(LLMs)生成代码中的幻觉问题,填补了代码生成中幻觉理解的空白。通过主题分析,研究将幻觉分为3个主要类别和12个具体类别,并分析了它们在不同LLMs中的分布及其原因和影响。研究还提出了一种基于提示增强的轻量级幻觉缓解方法,旨在提高LLMs生成代码的正确性和可靠性,而无需额外训练。研究结果为未来关于代码幻觉评估和缓解的研究提供了贡献。
Measuring and Aligning Abstraction in Vision-Language Models with Medical Taxonomies
Authors: Ben Schaper, Maxime Di Folco, Bernhard Kainz, Julia A. Schnabel, Cosmin I. Bercea
First: 2026-01-21T09:58:50+00:00 · Latest: 2026-01-21T09:58:50+00:00
Abstract
Vision-Language Models show strong zero-shot performance for chest X-ray classification, but standard flat metrics fail to distinguish between clinically minor and severe errors. This work investigates how to quantify and mitigate abstraction errors by leveraging medical taxonomies. We benchmark several state-of-the-art VLMs using hierarchical metrics and introduce Catastrophic Abstraction Errors to capture cross-branch mistakes. Our results reveal substantial misalignment of VLMs with clinical taxonomies despite high flat performance. To address this, we propose risk-constrained thresholding and taxonomy-aware fine-tuning with radial embeddings, which reduce severe abstraction errors to below 2 per cent while maintaining competitive performance. These findings highlight the importance of hierarchical evaluation and representation-level alignment for safer and more clinically meaningful deployment of VLMs.
中文标题/摘要
标题:基于医学分类学衡量与对齐视觉-语言模型中的抽象
视觉-语言模型在胸部X光分类中表现出强大的零样本性能,但标准的扁平度量无法区分临床轻微和严重错误。这项工作研究了如何通过利用医学分类学来量化和减轻抽象错误。我们使用分层度量对几种最先进的VLM进行基准测试,并引入灾难性抽象错误来捕捉跨分支错误。我们的结果揭示了尽管具有高扁平性能,VLM与临床分类学之间仍存在显著的不一致。为了解决这一问题,我们提出了风险约束阈值化和基于径向嵌入的分类学感知微调,这将严重抽象错误降低到2%以下,同时保持竞争力的性能。这些发现强调了分层评估和表示级对齐对于更安全和更具临床意义的VLM部署的重要性。
Summary / 总结
This work aims to measure and mitigate abstraction errors in Vision-Language Models (VLMs) for chest X-ray classification by using medical taxonomies. The study introduces hierarchical metrics and Catastrophic Abstraction Errors to evaluate VLMs, revealing significant misalignment with clinical taxonomies despite high flat performance. The authors propose risk-constrained thresholding and taxonomy-aware fine-tuning, which reduce severe abstraction errors to below 2 percent while maintaining competitive performance. This highlights the need for hierarchical evaluation and representation-level alignment for safer and more clinically meaningful VLMs.
这项工作旨在通过使用医学分类来衡量和缓解Vision-Language Models (VLMs)在胸部X光分类中的抽象错误。研究引入了层次化指标和灾难性抽象错误来评估VLMs,揭示了尽管平局性能高,但VLMs与临床分类存在显著的不一致。作者提出了风险约束阈值和分类学意识微调,这将严重的抽象错误减少到低于2%,同时保持竞争力的性能。这强调了在更安全和更具临床意义的VLMs部署中进行层次化评估和表示级对齐的重要性。
Does medical specialization of VLMs enhance discriminative power?: A comprehensive investigation through feature distribution analysis
Authors: Keita Takeda, Tomoya Sakai
Venue: ISBI short
First: 2026-01-21T08:53:40+00:00 · Latest: 2026-01-21T08:53:40+00:00
Comments: A short version paper of this research has been accepted for The IEEE International Symposium on Biomedical Imaging (ISBI) 2026
Abstract
This study investigates the feature representations produced by publicly available open source medical vision-language models (VLMs). While medical VLMs are expected to capture diagnostically relevant features, their learned representations remain underexplored, and standard evaluations like classification accuracy do not fully reveal if they acquire truly discriminative, lesion-specific features. Understanding these representations is crucial for revealing medical image structures and improving downstream tasks in medical image analysis. This study aims to investigate the feature distributions learned by medical VLMs and evaluate the impact of medical specialization. We analyze the feature distribution of multiple image modalities extracted by some representative medical VLMs across lesion classification datasets on multiple modalities. These distributions were compared them with non-medical VLMs to assess the domain-specific medical training. Our experiments showed that medical VLMs can extract discriminative features that are effective for medical classification tasks. Moreover, it was found that non-medical VLMs with recent improvement with contextual enrichment such as LLM2CLIP produce more refined feature representations. Our results imply that enhancing text encoder is more crucial than training intensively on medical images when developing medical VLMs. Notably, non-medical models are particularly vulnerable to biases introduced by overlaied text strings on images. These findings underscore the need for careful consideration on model selection according to downstream tasks besides potential risks in inference due to background biases such as textual information in images.
中文标题/摘要
标题:医学专业化的大规模视觉-语言模型是否增强辨别力?:通过特征分布分析的全面调查
本研究调查了公开可用的开源医学视觉-语言模型(VLMs)生成的特征表示。尽管医学VLMs被期望捕捉到诊断相关的特征,但它们学习到的表示仍然未被充分探索,而标准评估如分类准确率并不能完全揭示它们是否获得了真正具有辨别力、病变特异性的特征。理解这些表示对于揭示医学图像结构和改进医学图像分析中的下游任务至关重要。本研究旨在调查医学VLMs学习到的特征分布,并评估医学专业化的影响。我们分析了由一些代表性医学VLMs在多模态病变分类数据集中提取的多个图像模态的特征分布,并将这些分布与非医学VLMs进行了比较,以评估其领域特定的医学训练。我们的实验表明,医学VLMs可以提取出有效的医学分类任务特征。此外,我们发现,具有上下文丰富改进的非医学VLMs,如LLM2CLIP,生成了更精细的特征表示。我们的结果表明,在开发医学VLMs时,增强文本编码器比在医学图像上进行密集训练更为重要。值得注意的是,非医学模型特别容易受到图像上叠加文本字符串引入的偏差的影响。这些发现强调了在选择模型时除了潜在的背景偏差风险外,还需要根据下游任务进行仔细考虑。
Summary / 总结
This study investigates the feature representations of medical vision-language models (VLMs) by analyzing their learned distributions across multiple medical image modalities. The research compares medical VLMs with non-medical VLMs, finding that medical VLMs can extract discriminative features for medical classification tasks. However, non-medical VLMs with contextual enrichment produce more refined feature representations, suggesting that enhancing text encoders is more critical than extensive medical image training. The study highlights the importance of model selection based on downstream tasks and the potential risks of background biases in images.
该研究通过分析医学视觉-语言模型(VLMs)在多种图像模态下的特征分布,并将其与非医学VLMs进行比较,探究了其学习的特征表示。研究发现,医学VLMs能够提取出有效的医学分类任务特征,而非医学VLMs通过上下文增强后能够生成更精细的特征表示。研究指出,增强文本编码器比在医学图像上进行大量训练更为重要,并强调了非医学模型对图像上叠加文本字符串引入的偏差的脆弱性。
A Training-Free Guess What Vision Language Model from Snippets to Open-Vocabulary Object Detection
Authors: Guiying Zhu, Bowen Yang, Yin Zhuang, Tong Zhang, Guanqun Wang, Zhihao Che, He Chen, Lianlin Li
First: 2026-01-17T05:14:42+00:00 · Latest: 2026-01-21T08:41:03+00:00
Abstract
Open-Vocabulary Object Detection (OVOD) aims to develop the capability to detect anything. Although myriads of large-scale pre-training efforts have built versatile foundation models that exhibit impressive zero-shot capabilities to facilitate OVOD, the necessity of creating a universal understanding for any object cognition according to already pretrained foundation models is usually overlooked. Therefore, in this paper, a training-free Guess What Vision Language Model, called GW-VLM, is proposed to form a universal understanding paradigm based on our carefully designed Multi-Scale Visual Language Searching (MS-VLS) coupled with Contextual Concept Prompt (CCP) for OVOD. This approach can engage a pre-trained Vision Language Model (VLM) and a Large Language Model (LLM) in the game of "guess what". Wherein, MS-VLS leverages multi-scale visual-language soft-alignment for VLM to generate snippets from the results of class-agnostic object detection, while CCP can form the concept of flow referring to MS-VLS and then make LLM understand snippets for OVOD. Finally, the extensive experiments are carried out on natural and remote sensing datasets, including COCO val, Pascal VOC, DIOR, and NWPU-10, and the results indicate that our proposed GW-VLM can achieve superior OVOD performance compared to the-state-of-the-art methods without any training step.
中文标题/摘要
标题:无需训练的猜什么视觉语言模型:从片段到开放词汇目标检测
开放词汇目标检测(OVOD)旨在开发检测任何事物的能力。尽管大量的大规模预训练工作构建了具备广泛功能的基础模型,这些模型在零样本情况下表现出色,以促进OVOD,但根据已预训练的基础模型创建通用理解以适应任何物体认知的需求通常被忽视。因此,在本文中,提出了一种无需训练的猜什么视觉语言模型(GW-VLM),基于我们精心设计的多尺度视觉语言搜索(MS-VLS)与上下文概念提示(CCP)来构建通用理解范式。该方法可以将预训练的视觉语言模型(VLM)和大型语言模型(LLM)参与“猜什么”的游戏。其中,MS-VLS 利用多尺度视觉语言软对齐,使VLM从无类别的目标检测结果中生成片段,而CCP可以形成与MS-VLS相关的概念流,然后让LLM理解这些片段以实现OVOD。最后,在自然和遥感数据集上进行了广泛的实验,包括COCO val、Pascal VOC、DIOR和NWPU-10,结果表明,我们提出的GW-VLM在无需任何训练步骤的情况下,可以实现优于现有方法的OVOD性能。
Summary / 总结
This paper proposes a training-free Guess What Vision Language Model (GW-VLM) for Open-Vocabulary Object Detection (OVOD). It leverages Multi-Scale Visual Language Searching (MS-VLS) and Contextual Concept Prompt (CCP) to engage a pre-trained Vision Language Model (VLM) and a Large Language Model (LLM) in the game of 'guess what'. The model generates snippets from class-agnostic object detection results and uses these to form a universal understanding for OVOD. Experiments on various datasets show that GW-VLM outperforms state-of-the-art methods without requiring any training steps.
本文提出了一种无需训练的方法GW-VLM,利用预训练的Vision Language Model (VLM)和Large Language Model (LLM)在“猜什么”的游戏中进行开放词汇目标检测(OVOD)。该方法使用Multi-Scale Visual Language Searching (MS-VLS)从无类别目标检测结果中生成片段,并使用Contextual Concept Prompt (CCP)形成概念流供LLM理解。在多个数据集上的实验表明,GW-VLM在无需任何训练步骤的情况下,比现有最佳方法在OVOD上表现更优。
Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning
Authors: Yifan Wang, Shiyu Li, Peiming Li, Xiaochen Yang, Yang Tang, Zheng Wei
First: 2026-01-21T08:09:25+00:00 · Latest: 2026-01-21T08:09:25+00:00
Abstract
Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. Recent works often focus exclusively on outcome alignment and lack supervision on the intermediate reasoning process. These deficiencies obscure the analyzability of the latent reasoning chain. To address these challenges, we introduce Render-of-Thought (RoT), the first framework to reify the reasoning chain by rendering textual steps into images, making the latent rationale explicit and traceable. Specifically, we leverage the vision encoders of existing Vision Language Models (VLMs) as semantic anchors to align the vision embeddings with the textual space. This design ensures plug-and-play implementation without incurring additional pre-training overhead. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that our method achieves 3-4x token compression and substantial inference acceleration compared to explicit CoT. Furthermore, it maintains competitive performance against other methods, validating the feasibility of this paradigm. Our code is available at https://github.com/TencentBAC/RoT
中文标题/摘要
标题:Render-of-Thought: 将文本推理链渲染为图像以进行视觉潜在推理
文本推理链(CoT)提示在解锁大型语言模型(LLMs)的推理能力方面取得了显著成功。尽管CoT提示增强了推理能力,但其冗长性带来了巨大的计算开销。近期工作往往仅关注结果对齐,而缺乏对中间推理过程的监督。这些不足之处模糊了潜在推理链的可分析性。为解决这些挑战,我们引入了Render-of-Thought(RoT),这是第一个通过将文本步骤渲染为图像来实现推理链具体化的框架,使潜在的推理理由变得明确且可追踪。具体而言,我们利用现有视觉语言模型(VLMs)的视觉编码器作为语义锚点,将视觉嵌入与文本空间对齐。此设计确保了即插即用的实现方式,无需额外的预训练开销。在数学和逻辑推理基准测试上的广泛实验表明,与显式CoT相比,我们的方法实现了3-4倍的令牌压缩和显著的推理加速。此外,与其他方法相比,它保持了竞争力,验证了此范式的可行性。我们的代码可在https://github.com/TencentBAC/RoT 获取
Summary / 总结
The motivation for this work is to address the computational overhead and lack of transparency in Chain-of-Thought (CoT) prompting for Large Language Models (LLMs). The proposed Render-of-Thought (RoT) framework converts textual reasoning steps into images, making the latent reasoning process explicit and traceable. Experiments show that RoT achieves 3-4x token compression and significant inference speedup compared to explicit CoT, while maintaining competitive performance on mathematical and logical reasoning benchmarks.
论文提出了Render-of-Thought (RoT)框架,将文本推理步骤转化为图像,使潜在的推理过程变得明确和可追踪。该方法利用现有Vision Language Models的视觉编码器对视觉嵌入与文本空间进行对齐,实现即插即用的实施。实验表明,RoT在数学和逻辑推理基准上实现了3-4倍的令牌压缩和显著的推理加速,同时保持了与其他方法相当的性能。
Extending Audio Context for Long-Form Understanding in Large Audio-Language Models
Authors: Yuatyong Chaichana, Pittawat Taveekitworachai, Warit Sirichotedumrong, Potsawee Manakul, Kunat Pipatanakul
First: 2025-10-17T01:44:28+00:00 · Latest: 2026-01-21T08:00:24+00:00
Comments: EACL 2026. Code and dataset are available at: https://github.com/yophis/partial-yarn
Abstract
Large Audio-Language Models (LALMs) are often constrained by short audio context windows, even when their text backbones support long contexts, limiting long-form audio understanding. Prior work has introduced context-extension methods (e.g. YaRN) on unimodal LLMs, yet their application to LALMs remains unexplored. First, building on RoPE-based context extension, we introduce Partial YaRN, a training-free, modality-decoupled extension method that modifies only audio token positions, leaving text positions intact to preserve the base LLM's text capabilities. Second, we propose Virtual Longform Audio Training (VLAT), a training strategy that extends Partial YaRN into a training-time positional augmentation. VLAT simulates diverse audio lengths during training, enabling generalization to inputs far longer than those seen in training. Our experiments on SALMONN and Qwen2-Audio confirm that Partial YaRN outperforms the original models across wide range of settings, and VLAT provides substantial performance improvement on long audio of unseen lengths.
中文标题/摘要
标题:扩展音频上下文以增强大型音频语言模型的长格式理解
大型音频语言模型(LALMs)通常受限于短音频上下文窗口,即使其文本后端支持长上下文,也限制了长格式音频的理解能力。先前的工作在单模态LLMs上引入了上下文扩展方法(例如YaRN),但其在LALMs上的应用尚未被探索。首先,基于RoPE的上下文扩展,我们引入了Partial YaRN,这是一种无需训练、模态解耦的扩展方法,仅修改音频标记位置,保留基LLM的文本能力。其次,我们提出了虚拟长格式音频训练(VLAT),这是一种训练策略,将Partial YaRN扩展为训练时的位置增强。VLAT在训练过程中模拟多种音频长度,使模型能够泛化到远长于训练中所见的输入。我们在SALMONN和Qwen2-Audio上的实验表明,Partial YaRN在各种设置中均优于原始模型,而VLAT在未见过的长音频上提供了显著的性能提升。
Summary / 总结
This paper addresses the limitation of short audio context windows in Large Audio-Language Models (LALMs) by introducing Partial YaRN, a training-free method that extends audio context without affecting text capabilities. VLAT, a training strategy, further enhances this by simulating diverse audio lengths during training. Experiments show that Partial YaRN improves performance across various settings, and VLAT significantly boosts performance on long, unseen audio inputs.
该研究通过引入Partial YaRN,一种无需训练的方法,来扩展音频上下文而不影响文本能力,并提出了VLAT,一种训练策略,增强Partial YaRN以更好地适应长音频输入。实验结果表明,Partial YaRN在各种设置中提高了性能,而VLAT在未见过的长音频输入上提供了显著的性能提升。
DeepMoLM: Leveraging Visual and Geometric Structural Information for Molecule-Text Modeling
Authors: Jing Lan, Hexiao Ding, Hongzhao Chen, Yufeng Jiang, Nga-Chun Ng, Gwing Kei Yip, Gerald W. Y. Cheng, Yunlin Mao, Jing Cai, Liang-ting Lin, Jung Sun Yoo
First: 2026-01-21T07:41:59+00:00 · Latest: 2026-01-21T07:41:59+00:00
Comments: Under review
Abstract
AI models for drug discovery and chemical literature mining must interpret molecular images and generate outputs consistent with 3D geometry and stereochemistry. Most molecular language models rely on strings or graphs, while vision-language models often miss stereochemical details and struggle to map continuous 3D structures into discrete tokens. We propose DeepMoLM: Deep Molecular Language M odeling, a dual-view framework that grounds high-resolution molecular images in geometric invariants derived from molecular conformations. DeepMoLM preserves high-frequency evidence from 1024 $\times$ 1024 inputs, encodes conformer neighborhoods as discrete Extended 3-Dimensional Fingerprints, and fuses visual and geometric streams with cross-attention, enabling physically grounded generation without atom coordinates. DeepMoLM improves PubChem captioning with a 12.3% relative METEOR gain over the strongest generalist baseline while staying competitive with specialist methods. It produces valid numeric outputs for all property queries and attains MAE 13.64 g/mol on Molecular Weight and 37.89 on Complexity in the specialist setting. On ChEBI-20 description generation from images, it exceeds generalist baselines and matches state-of-the-art vision-language models. Code is available at https://github.com/1anj/DeepMoLM.
中文标题/摘要
标题:DeepMoLM:利用分子图像和几何结构信息进行分子-文本建模
用于药物发现和化学文献挖掘的AI模型必须解释分子图像并生成与三维几何和立体化学一致的输出。大多数分子语言模型依赖于字符串或图,而视觉-语言模型往往忽略立体化学细节,难以将连续的三维结构映射为离散的标记。我们提出了DeepMoLM:深度分子语言建模,这是一种双视图框架,将高分辨率分子图像与从分子构象中导出的几何不变量联系起来。DeepMoLM 保留了来自1024×1024输入的高频证据,将构象邻域编码为扩展的三维指纹,并通过交叉注意力融合视觉和几何流,从而实现基于物理的生成而无需原子坐标。DeepMoLM 在PubChem图注中相对于最强的一般基线获得了12.3%的相对METEOR增益,同时在专家模式下与专家方法保持竞争力。它对所有属性查询生成有效的数值输出,并在专家模式下分别在分子量和复杂性上达到MAE 13.64 g/mol和37.89。在ChEBI-20图像描述生成中,它超过了通用基线并匹配最先进的视觉-语言模型。代码可在https://github.com/1anj/DeepMoLM/ 获取。
Summary / 总结
DeepMoLM is designed to interpret molecular images and generate outputs consistent with 3D geometry and stereochemistry. It uses a dual-view framework that combines high-resolution molecular images with geometric invariants from molecular conformations. The model improves PubChem captioning by 12.3% relative METEOR gain, produces valid numeric outputs for property queries, and matches state-of-the-art vision-language models on ChEBI-20 description generation from images.
DeepMoLM旨在解析分子图像并生成与三维几何和立体化学一致的输出。它采用了一种结合高分辨率分子图像和分子构象几何不变量的双视图框架。该模型在PubChem图注生成中相对METEOR增益提高了12.3%,能够生成有效的数值输出,并在ChEBI-20图像描述生成中与最先进的视觉语言模型相当。
HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding
Authors: Haowei Zhang, Shudong Yang, Jinlan Fu, See-Kiong Ng, Xipeng Qiu
First: 2026-01-21T07:26:15+00:00 · Latest: 2026-01-21T07:26:15+00:00
Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams. Based on a mechanistic attention investigation, we conceptualize KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference, HERMES reuses a compact KV cache, enabling efficient streaming understanding under resource constraints. Notably, HERMES requires no auxiliary computations upon the arrival of user queries, thereby guaranteeing real-time responses for continuous video stream interactions, which achieves 10$\times$ faster TTFT compared to prior SOTA. Even when reducing video tokens by up to 68% compared with uniform sampling, HERMES achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.
中文标题/摘要
标题:HERMES: 作为分层内存的KV缓存以实现高效的流式视频理解
近期多模态大型语言模型(MLLMs)在离线视频理解方面取得了显著进步。然而,将这些能力扩展到流式视频输入仍然具有挑战性,因为现有模型难以同时保持稳定的理解性能、实时响应和低GPU内存开销。为了解决这一挑战,我们提出了一种名为HERMES的新型无训练架构,用于实时和准确地理解视频流。基于机制性注意力调查,我们将KV缓存概念化为一种分层内存框架,以跨多个粒度封装视频信息。在推理过程中,HERMES重用紧凑的KV缓存,能够在资源受限的情况下实现高效的流式理解。值得注意的是,HERMES在用户查询到达时不需要额外的辅助计算,从而保证了连续视频流交互的实时响应,TTFT比之前最先进的技术快10倍。即使与均匀采样相比,将视频标记减少高达68%,HERMES在所有基准测试中仍能实现优于或可比的准确性,流式数据集上最高可获得11.4%的提升。
Summary / 总结
The research aims to improve real-time and accurate understanding of streaming video inputs by addressing the limitations of existing models in maintaining performance, real-time responses, and low GPU memory usage. HERMES, a training-free architecture, uses a hierarchical memory framework based on a KV cache to efficiently reuse video information across different granularities during inference. This approach ensures real-time responses and achieves up to 10 times faster TTFT compared to prior state-of-the-art models. Even with a 68% reduction in video tokens, HERMES maintains or improves accuracy across all benchmarks, demonstrating its effectiveness in streaming datasets.
研究旨在通过解决现有模型在保持性能、实时响应和低GPU内存使用方面的局限性,提高对流视频的理解能力。HERMES是一种无需训练的架构,利用基于KV缓存的层次化内存框架来高效处理视频流。关键实验发现是,HERMES可以比最先进的方法快10倍的时间到达第一个令牌(TTFT),同时保持或提高准确性,即使视频令牌减少了高达68%。
Can Synthetic Images Serve as Effective and Efficient Class Prototypes?
Authors: Dianxing Shi, Dingjie Fu, Yuqiao Liu, Jun Wang
First: 2025-12-19T01:39:43+00:00 · Latest: 2026-01-21T07:00:03+00:00
Comments: Accepted by IEEE ICASSP2026
Abstract
Vision-Language Models (VLMs) have shown strong performance in zero-shot image classification tasks. However, existing methods, including Contrastive Language-Image Pre-training (CLIP), all rely on annotated text-to-image pairs for aligning visual and textual modalities. This dependency introduces substantial cost and accuracy requirement in preparing high-quality datasets. At the same time, processing data from two modes also requires dual-tower encoders for most models, which also hinders their lightweight. To address these limitations, we introduce a ``Contrastive Language-Image Pre-training via Large-Language-Model-based Generation (LGCLIP)" framework. LGCLIP leverages a Large Language Model (LLM) to generate class-specific prompts that guide a diffusion model in synthesizing reference images. Afterwards these generated images serve as visual prototypes, and the visual features of real images are extracted and compared with the visual features of these prototypes to achieve comparative prediction. By optimizing prompt generation through the LLM and employing only a visual encoder, LGCLIP remains lightweight and efficient. Crucially, our framework requires only class labels as input during whole experimental procedure, eliminating the need for manually annotated image-text pairs and extra pre-processing. Experimental results validate the feasibility and efficiency of LGCLIP, demonstrating great performance in zero-shot classification tasks and establishing a novel paradigm for classification.
中文标题/摘要
标题:合成图像能否作为有效的和高效的类别原型?
视觉-语言模型(VLMs)在零样本图像分类任务中表现出强大的性能。然而,现有的方法,包括对比语言-图像预训练(CLIP),都依赖于标注的图文对来对齐视觉和文本模态。这种依赖性在准备高质量数据集时引入了巨大的成本和准确度要求。同时,处理两种模式的数据还需要大多数模型使用双塔编码器,这也阻碍了它们的轻量化。为了解决这些限制,我们引入了“基于大型语言模型生成的对比语言-图像预训练(LGCLIP)”框架。LGCLIP 利用大型语言模型(LLM)生成类特定的提示,引导扩散模型合成参考图像。之后,这些生成的图像作为视觉原型,从真实图像中提取的视觉特征与这些原型的视觉特征进行比较,以实现相对预测。通过通过 LLM 优化提示生成并仅使用视觉编码器,LGCLIP 保持了轻量化和高效性。至关重要的是,我们的框架在整个实验过程中只需要类别标签作为输入,消除了手动标注的图文对和额外预处理的需求。实验结果验证了 LGCLIP 的可行性和高效性,展示了其在零样本分类任务中的出色性能,并建立了分类的新范式。
Summary / 总结
The research aims to address the high cost and accuracy requirements in preparing annotated text-to-image pairs for vision-language models, which are essential for aligning visual and textual modalities. The proposed LGCLIP framework uses a Large Language Model to generate class-specific prompts for a diffusion model to synthesize reference images, which serve as visual prototypes. These prototypes are then used to extract and compare visual features of real images, achieving competitive performance in zero-shot classification tasks. The framework is lightweight and efficient, requiring only class labels as input and eliminating the need for manually annotated image-text pairs and extra pre-processing.
论文针对现有方法在视觉和文本模态对齐中高昂的成本和高准确度要求,提出了LGCLIP框架。该框架利用大型语言模型生成类特定的提示,指导扩散模型合成参考图像。这些图像作为视觉原型,用于比较真实图像的视觉特征以实现分类。LGCLIP轻量高效,仅需类标签即可,展示了在零样本分类任务中的出色性能。
AutoDriDM: An Explainable Benchmark for Decision-Making of Vision-Language Models in Autonomous Driving
Authors: Zecong Tang, Zixu Wang, Yifei Wang, Weitong Lian, Tianjian Gao, Haoran Li, Tengju Ru, Lingyi Meng, Zhejun Cui, Yichen Zhu, Qi Kang, Kaixuan Wang, Yu Zhang
Venue: ACL
First: 2026-01-21T06:29:09+00:00 · Latest: 2026-01-21T06:29:09+00:00
Comments: 23 pages. Submitted to ACL ARR 2026 January
Abstract
Autonomous driving is a highly challenging domain that requires reliable perception and safe decision-making in complex scenarios. Recent vision-language models (VLMs) demonstrate reasoning and generalization abilities, opening new possibilities for autonomous driving; however, existing benchmarks and metrics overemphasize perceptual competence and fail to adequately assess decision-making processes. In this work, we present AutoDriDM, a decision-centric, progressive benchmark with 6,650 questions across three dimensions - Object, Scene, and Decision. We evaluate mainstream VLMs to delineate the perception-to-decision capability boundary in autonomous driving, and our correlation analysis reveals weak alignment between perception and decision-making performance. We further conduct explainability analyses of models' reasoning processes, identifying key failure modes such as logical reasoning errors, and introduce an analyzer model to automate large-scale annotation. AutoDriDM bridges the gap between perception-centered and decision-centered evaluation, providing guidance toward safer and more reliable VLMs for real-world autonomous driving.
中文标题/摘要
标题:AutoDriDM:一种可解释的自动驾驶领域决策基准
自动驾驶是一个极具挑战性的领域,需要在复杂场景中实现可靠的感知和安全的决策。近期的视觉-语言模型(VLMs)展示了推理和泛化能力,为自动驾驶开辟了新的可能性;然而,现有的基准和度量标准过于强调感知能力,未能充分评估决策过程。在本文中,我们提出了AutoDriDM,这是一个以决策为中心、分阶段的基准,包含6,650个问题,涵盖三个维度:对象、场景和决策。我们评估了主流的VLMs,以界定自动驾驶中的感知到决策能力边界,并通过相关性分析发现感知与决策性能之间存在弱对齐。我们进一步对模型的推理过程进行了可解释性分析,识别出关键的失败模式,如逻辑推理错误,并引入了分析器模型以自动化大规模标注。AutoDriDM弥合了以感知为中心和以决策为中心评估之间的差距,为更安全、更可靠的VLMs在实际自动驾驶中的应用提供了指导。
Summary / 总结
AutoDriDM is a decision-centric benchmark for evaluating vision-language models in autonomous driving, addressing the limitations of existing benchmarks by focusing on decision-making processes. It includes 6,650 questions across three dimensions: Object, Scene, and Decision. The study finds weak alignment between perception and decision-making performance and identifies key failure modes such as logical reasoning errors. The benchmark also introduces an analyzer model for automated large-scale annotation, aiming to improve the safety and reliability of VLMs in real-world autonomous driving scenarios.
AutoDriDM 是一个决策导向的基准,用于评估视觉-语言模型在自动驾驶中的表现,通过关注决策过程来弥补现有基准的不足。该基准包含6,650个问题,涵盖对象、场景和决策三个维度。主流 VLM 的评估结果显示,感知能力和决策性能之间存在弱关联。研究还识别了逻辑推理错误等关键失败模式,并引入了自动化大规模标注的分析模型,旨在提高 VLM 在实际自动驾驶中的安全性和可靠性。
T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs
Authors: Shao-Jun Xia, Huixin Zhang, Zhengzhong Tu
First: 2025-11-20T07:02:06+00:00 · Latest: 2026-01-21T06:18:19+00:00
Abstract
In large language models (LLM), in-context learning (ICL) refers to performing new tasks by conditioning on small demonstrations provided in the input context. Recent advances in visual in-context learning (VICL) demonstrate promising capabilities for solving downstream tasks by unified vision-language models (VLMs). When the visual prompt and the target images originate from different visual tasks, can VLMs still enable VICL? In the paper, we propose a fully collaborative pipeline, i.e. T2T-VICL, for VLMs to investigate the potential of cross-task VICL. Fundamentally, we design a mechanism to generate and select text prompts that best implicitly describe the differences between two distinct low-level vision tasks, and construct the first cross-task VICL dataset. Building upon this, we propose a novel inference framework that combines perceptual score-based reasoning with traditional evaluation metrics to perform cross-task VICL. Our approach achieves top-tier results across twelve cross-task scenarios and second-tier performance in nine additional scenarios, unlocking the boundaries of cross-task VICL within VLMs.
中文标题/摘要
标题:T2T-VICL: 通过隐式文本驱动的VLM突破跨任务视觉在上下文学习的界限
在大型语言模型(LLM)中,上下文学习(ICL)是指通过输入上下文中的小规模示范来执行新任务。视觉在上下文学习(VICL)的最新进展展示了统一的视觉语言模型(VLM)解决下游任务的有希望的能力。当视觉提示和目标图像来自不同的视觉任务时,VLM 是否仍能实现 VICL?在论文中,我们提出了一种完全协作的管道,即 T2T-VICL,用于研究 VLM 的跨任务 VICL 潜力。本质上,我们设计了一种机制来生成和选择最佳地隐式描述两个不同低级视觉任务之间差异的文本提示,并构建了首个跨任务 VICL 数据集。在此基础上,我们提出了一种新颖的推理框架,结合感知得分推理与传统评估指标来执行跨任务 VICL。我们的方法在十二个跨任务场景中取得了顶级结果,并在九个额外场景中取得了第二级性能,突破了 VLM 中跨任务 VICL 的界限。
Summary / 总结
The paper explores the potential of cross-task visual in-context learning (VICL) by proposing T2T-VICL, a fully collaborative pipeline. It introduces a mechanism to generate and select text prompts that describe the differences between two distinct low-level vision tasks and constructs the first cross-task VICL dataset. The approach combines perceptual score-based reasoning with traditional evaluation metrics and achieves top-tier results in twelve cross-task scenarios and second-tier performance in nine additional scenarios, significantly advancing the boundaries of cross-task VICL within vision-language models.
论文旨在探索使用视觉语言模型(VLMs)进行跨任务视觉在上下文学习(VICL)的潜力。提出了T2T-VICL,一个完全协作的管道,生成和选择文本提示以描述不同低级视觉任务之间的差异。该方法结合了感知得分推理和传统评估指标,并在十二个跨任务场景中取得了顶级结果,在九个额外场景中取得了第二级性能,展示了VLMs在跨任务VICL中的能力。
Neural Honeytrace: Plug&Play Watermarking Framework against Model Extraction Attacks
Authors: Yixiao Xu, Binxing Fang, Rui Wang, Yinghai Zhou, Yuan Liu, Mohan Li, Zhihong Tian
First: 2025-01-16T06:59:20+00:00 · Latest: 2026-01-21T05:34:49+00:00
Abstract
Triggerable watermarking enables model owners to assert ownership against model extraction attacks. However, most existing approaches require additional training, which limits post-deployment flexibility, and the lack of clear theoretical foundations makes them vulnerable to adaptive attacks. In this paper, we propose Neural Honeytrace, a plug-and-play watermarking framework that operates without retraining. We redefine the watermark transmission mechanism from an information perspective, designing a training-free multi-step transmission strategy that leverages the long-tailed effect of backdoor learning to achieve efficient and robust watermark embedding. Extensive experiments demonstrate that Neural Honeytrace reduces the average number of queries required for a worst-case t-test-based ownership verification to as low as $2\%$ of existing methods, while incurring zero training cost.
中文标题/摘要
标题:神经蜜踪:无需重新训练的即插即用水印框架对抗模型提取攻击
触发式水印使模型所有者能够在模型提取攻击中主张所有权。然而,大多数现有方法需要额外的训练,这限制了部署后的灵活性,并且缺乏清晰的理论基础使它们容易受到适应性攻击。在本文中,我们提出了一种无需重新训练的即插即用水印框架——神经蜜踪。我们从信息的角度重新定义了水印传输机制,设计了一种基于后门学习长尾效应的无训练多步传输策略,以实现高效且稳健的水印嵌入。广泛的实验表明,与现有方法相比,神经蜜踪将最坏情况下的 t 检验基所有权验证所需的平均查询次数降低至其 2% 以下,同时不产生任何训练成本。
Summary / 总结
Neural Honeytrace is a watermarking framework designed to protect against model extraction attacks without requiring additional training, thus maintaining post-deployment flexibility. It leverages the long-tailed effect of backdoor learning to embed watermarks efficiently and robustly. Experimental results show that Neural Honeytrace significantly reduces the number of queries needed for ownership verification compared to existing methods, while avoiding any training costs.
Neural Honeytrace 是一种无需额外训练即可保护模型免受模型提取攻击的水印框架,保持部署后的灵活性。它通过利用后门学习的长尾效应重新定义水印传输机制,实现高效且稳健的水印嵌入。实验结果表明,与现有方法相比,Neural Honeytrace 显著减少了所需的所有权验证查询次数,同时不产生训练成本。
LRR-Bench: Left, Right or Rotate? Vision-Language models Still Struggle With Spatial Understanding Tasks
Authors: Fei Kong
First: 2025-07-27T08:31:24+00:00 · Latest: 2026-01-21T05:06:39+00:00
Abstract
Real-world applications, such as autonomous driving and humanoid robot manipulation, require precise spatial perception. However, it remains underexplored how Vision-Language Models (VLMs) recognize spatial relationships and perceive spatial movement. In this work, we introduce a spatial evaluation pipeline and construct a corresponding benchmark. Specifically, we categorize spatial understanding into two main types: absolute spatial understanding, which involves querying the absolute spatial position (e.g., left, right) of an object within an image, and 3D spatial understanding, which includes movement and rotation. Notably, our dataset is entirely synthetic, enabling the generation of test samples at a low cost while also preventing dataset contamination. We conduct experiments on multiple state-of-the-art VLMs and observe that there is significant room for improvement in their spatial understanding abilities. Explicitly, in our experiments, humans achieve near-perfect performance on all tasks, whereas current VLMs attain human-level performance only on the two simplest tasks. For the remaining tasks, the performance of VLMs is distinctly lower than that of humans. In fact, the best-performing Vision-Language Models even achieve near-zero scores on multiple tasks. The dataset and code are available on https://github.com/kong13661/LRR-Bench.
中文标题/摘要
标题:LRR-Bench:左、右还是旋转?视觉-语言模型在空间理解任务上仍存在困难
现实世界的应用,如自动驾驶和类人机器人操作,需要精确的空间感知。然而,视觉-语言模型(VLMs)如何识别空间关系和感知空间运动仍处于探索阶段。在本研究中,我们引入了一个空间评估管道并构建了相应的基准。具体而言,我们将空间理解分为两类:绝对空间理解,涉及查询图像中物体的绝对空间位置(例如,左、右);以及3D空间理解,包括运动和旋转。值得注意的是,我们的数据集完全是合成的,这使得测试样本的生成成本较低,同时也能防止数据集污染。我们在多个最先进的VLMs上进行了实验,并观察到它们在空间理解能力方面有很大的改进空间。具体而言,在我们的实验中,人类在所有任务上的表现几乎完美,而当前的VLMs仅在两个最简单的任务上达到人类水平。对于剩余的任务,VLMs的表现明显低于人类。事实上,表现最好的视觉-语言模型甚至在多个任务上得分接近零。数据集和代码可在https://github.com/kong13661/LRR-Bench上获取。
Summary / 总结
This work addresses the underexplored area of how Vision-Language Models (VLMs) recognize spatial relationships and perceive spatial movement, crucial for real-world applications like autonomous driving. The authors introduce a spatial evaluation pipeline and a synthetic benchmark, categorizing spatial understanding into absolute and 3D spatial understanding. Experiments on multiple state-of-the-art VLMs show that while humans perform near-perfectly, current VLMs struggle significantly, especially in tasks involving movement and rotation, indicating a need for improvement in spatial understanding capabilities.
这项工作引入了LRR-Bench,一个用于评估视觉-语言模型(VLMs)空间理解能力的空间评估管道和基准。研究将空间理解分为绝对和三维类型,发现当前的VLMs在识别空间关系和运动方面存在显著困难,仅在最简单的任务上达到人类水平的表现。人类在更复杂的任务上表现更优,有些模型在多个任务上的得分接近零。该数据集是合成的,允许以低成本生成测试样本且不受污染。
Forest-Chat: Adapting Vision-Language Agents for Interactive Forest Change Analysis
Authors: James Brock, Ce Zhang, Nantheera Anantrasirichai
First: 2026-01-21T04:23:33+00:00 · Latest: 2026-01-21T04:23:33+00:00
Comments: 22 pages, 8 figures, 7 tables, Submitted to Ecological Informatics
Abstract
The increasing availability of high-resolution satellite imagery, together with advances in deep learning, creates new opportunities for enhancing forest monitoring workflows. Two central challenges in this domain are pixel-level change detection and semantic change interpretation, particularly for complex forest dynamics. While large language models (LLMs) are increasingly adopted for data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored, especially beyond urban environments. We introduce Forest-Chat, an LLM-driven agent designed for integrated forest change analysis. The proposed framework enables natural language querying and supports multiple RSICI tasks, including change detection, change captioning, object counting, deforestation percentage estimation, and change reasoning. Forest-Chat builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration, and incorporates zero-shot change detection via a foundation change detection model together with an interactive point-prompt interface to support fine-grained user guidance. To facilitate adaptation and evaluation in forest environments, we introduce the Forest-Change dataset, comprising bi-temporal satellite imagery, pixel-level change masks, and multi-granularity semantic change captions generated through a combination of human annotation and rule-based methods. Experimental results demonstrate that Forest-Chat achieves strong performance on Forest-Change and on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI, for joint change detection and captioning, highlighting the potential of interactive, LLM-driven RSICI systems to improve accessibility, interpretability, and analytical efficiency in forest change analysis.
中文标题/摘要
标题:Forest-Chat:适应交互式森林变化分析的视觉语言代理
高分辨率卫星图像的日益可用性,结合深度学习的进步,为增强森林监测工作流程创造了新的机会。该领域中的两个主要挑战是像素级变化检测和语义变化解释,特别是对于复杂的森林动态。虽然大型语言模型(LLMs)越来越多地用于数据探索,但它们与遥感图像变化解释(RSICI)中的视觉语言模型(VLMs)的集成仍然未被充分探索,尤其是在城市环境之外。我们介绍了Forest-Chat,这是一种基于LLM的代理,旨在进行集成森林变化分析。所提出的框架支持自然语言查询,并支持多种RSICI任务,包括变化检测、变化描述、对象计数、森林砍伐比例估计和变化推理。Forest-Chat 基于多级变化解释(MCI)视觉语言骨干,并通过基于LLM的编排结合基础变化检测模型和交互式点提示界面实现零样本变化检测。为了在森林环境中促进适应和评估,我们引入了Forest-Change数据集,该数据集包含双时相卫星图像、像素级变化掩码和通过结合人工注释和基于规则的方法生成的多粒度语义变化描述。实验结果表明,Forest-Chat 在Forest-Change和LEVIR-MCI-Trees(LEVIR-MCI的一个树焦点子集)上的联合变化检测和描述任务中表现出色,突显了交互式、基于LLM的RSICI系统在提高森林变化分析的可访问性、可解释性和分析效率方面的潜力。
Summary / 总结
Forest-Chat is an LLM-driven agent designed for integrated forest change analysis, addressing pixel-level change detection and semantic change interpretation. It uses a multi-level change interpretation vision-language backbone and incorporates zero-shot change detection with an interactive point-prompt interface. The system demonstrates strong performance on the Forest-Change dataset and LEVIR-MCI-Trees, showing the potential of interactive, LLM-driven RSICI systems for forest change analysis.
Forest-Chat 是一个基于 LLM 的代理,用于集成森林变化分析,解决像素级变化检测和语义变化解释问题。它使用多级变化解释视觉语言骨干和基于 LLM 的编排,并通过交互式点提示界面提供精细的用户指导。实验结果显示,Forest-Chat 在 Forest-Change 和 LEVIR-MCI-Trees 数据集上表现出色,展示了交互式、基于 LLM 的 RSICI 系统在森林变化分析中的潜力。
Learning Consistent Taxonomic Classification through Hierarchical Reasoning
Authors: Zhenghong Li, Kecheng Zheng, Haibin Ling
First: 2026-01-21T03:00:00+00:00 · Latest: 2026-01-21T03:00:00+00:00
Comments: 12 pages, 4 figures
Abstract
While Vision-Language Models (VLMs) excel at visual understanding, they often fail to grasp hierarchical knowledge. This leads to common errors where VLMs misclassify coarser taxonomic levels even when correctly identifying the most specific level (leaf level). Existing approaches largely overlook this issue by failing to model hierarchical reasoning. To address this gap, we propose VL-Taxon, a two-stage, hierarchy-based reasoning framework designed to improve both leaf-level accuracy and hierarchical consistency in taxonomic classification. The first stage employs a top-down process to enhance leaf-level classification accuracy. The second stage then leverages this accurate leaf-level output to ensure consistency throughout the entire taxonomic hierarchy. Each stage is initially trained with supervised fine-tuning to instill taxonomy knowledge, followed by reinforcement learning to refine the model's reasoning and generalization capabilities. Extensive experiments reveal a remarkable result: our VL-Taxon framework, implemented on the Qwen2.5-VL-7B model, outperforms its original 72B counterpart by over 10% in both leaf-level and hierarchical consistency accuracy on average on the iNaturalist-2021 dataset. Notably, this significant gain was achieved by fine-tuning on just a small subset of data, without relying on any examples generated by other VLMs.
中文标题/摘要
标题:通过层次推理实现一致的分类学分类学习
视觉-语言模型(VLMs)在视觉理解方面表现出色,但在掌握层次知识方面却常常力不从心。这导致VLMs在正确识别最具体层次(叶节点层次)时,仍会犯错,错误地分类更粗略的分类学层次。现有方法大多忽视了这一问题,未能建模层次推理。为解决这一差距,我们提出了VL-Taxon,这是一种基于层次的两阶段推理框架,旨在提高分类学分类中的叶节点准确性和层次一致性。第一阶段采用自上而下的过程来提高叶节点分类的准确性。第二阶段则利用准确的叶节点输出,确保整个分类学层次的一致性。每个阶段最初通过监督微调来灌输分类学知识,然后通过强化学习来完善模型的推理和泛化能力。广泛的实验表明,我们的VL-Taxon框架在Qwen2.5-VL-7B模型上实施后,在iNaturalist-2021数据集上的叶节点和层次一致性准确性平均提高了超过10%,其性能优于其原始的72B版本。值得注意的是,这一显著的提升仅通过微调一小部分数据实现,而无需依赖其他VLM生成的任何示例。
Summary / 总结
The research aims to improve the hierarchical classification accuracy of Vision-Language Models (VLMs) by addressing their tendency to misclassify coarser taxonomic levels. The proposed VL-Taxon framework uses a two-stage hierarchy-based reasoning approach, first enhancing leaf-level classification accuracy and then ensuring consistency throughout the hierarchy. Experiments show that VL-Taxon outperforms the original model by over 10% in both leaf-level and hierarchical consistency accuracy on the iNaturalist-2021 dataset, achieved through supervised fine-tuning and reinforcement learning on a small subset of data.
研究旨在提高视觉语言模型(VLMs)的层次推理能力,以解决其在分类学分类中的常见错误。提出了VL-Taxon框架,该框架首先提高叶级分类准确性,然后确保整个分类学层次结构的一致性。该模型在iNaturalist-2021数据集上的叶级和层次一致性准确性均比原72B模型高出超过10%,仅使用少量数据进行微调。
3D Space as a Scratchpad for Editable Text-to-Image Generation
Authors: Oindrila Saha, Vojtech Krs, Radomir Mech, Subhransu Maji, Matheus Gadelha, Kevin Blackburn-Matzen
First: 2026-01-21T02:40:19+00:00 · Latest: 2026-01-21T02:40:19+00:00
Abstract
Recent progress in large language models (LLMs) has shown that reasoning improves when intermediate thoughts are externalized into explicit workspaces, such as chain-of-thought traces or tool-augmented reasoning. Yet, visual language models (VLMs) lack an analogous mechanism for spatial reasoning, limiting their ability to generate images that accurately reflect geometric relations, object identities, and compositional intent. We introduce the concept of a spatial scratchpad -- a 3D reasoning substrate that bridges linguistic intent and image synthesis. Given a text prompt, our framework parses subjects and background elements, instantiates them as editable 3D meshes, and employs agentic scene planning for placement, orientation, and viewpoint selection. The resulting 3D arrangement is rendered back into the image domain with identity-preserving cues, enabling the VLM to generate spatially consistent and visually coherent outputs. Unlike prior 2D layout-based methods, our approach supports intuitive 3D edits that propagate reliably into final images. Empirically, it achieves a 32% improvement in text alignment on GenAI-Bench, demonstrating the benefit of explicit 3D reasoning for precise, controllable image generation. Our results highlight a new paradigm for vision-language models that deliberate not only in language, but also in space. Code and visualizations at https://oindrilasaha.github.io/3DScratchpad/
中文标题/摘要
标题:三维空间作为可编辑文本到图像生成的草稿板
大型语言模型(LLMs)的最新进展表明,当中间想法被外部化到显式的工件空间,如链式思考痕迹或工具增强推理时,推理会得到改善。然而,视觉语言模型(VLMs)缺乏类似的空间推理机制,限制了它们生成准确反映几何关系、物体身份和组合意图的图像的能力。我们引入了空间草稿板的概念——一种将语言意图与图像合成联系起来的三维推理基础结构。给定一个文本提示,我们的框架解析主题和背景元素,将它们实例化为可编辑的3D网格,并采用代理场景规划进行放置、方向和视点选择。生成的3D布局以保持身份的线索重新渲染回图像域,使VLM能够生成空间一致且视觉连贯的输出。与基于2D布局的先前方法不同,我们的方法支持直观的3D编辑,这些编辑可靠地传播到最终图像中。实验证明,它在GenAI-Bench上的文本对齐提高了32%,证明了显式3D推理对于精确可控的图像生成的好处。我们的结果突显了一种新的视觉语言模型范式,不仅在语言中进行推理,还在空间中进行推理。代码和可视化可在https://oindrilasaha.github.io/3DScratchpad/获取。
Summary / 总结
This paper addresses the limitation of visual language models in handling spatial reasoning by introducing a 3D spatial scratchpad. The method involves parsing text prompts to create editable 3D meshes and using agentic scene planning for their placement and orientation. The 3D arrangement is then rendered into the image domain, resulting in spatially consistent and visually coherent outputs. Experiments show a 32% improvement in text alignment on GenAI-Bench compared to 2D layout-based methods, highlighting the effectiveness of explicit 3D reasoning for precise image generation.
本文通过引入一种称为空间擦除板的3D推理子结构,解决了视觉语言模型在处理空间推理方面的局限性。该方法包括解析文本提示以创建可编辑的3D网格,并使用代理场景规划进行放置和视角选择。最终的3D布局被渲染回图像域,与基于2D布局的方法相比,在GenAI-Bench上文本对齐的准确度提高了32%,突显了显式3D推理对精确图像生成的好处。
Coding the Visual World: From Image to Simulation Using Vision Language Models
Authors: Sagi Eppel
First: 2026-01-08T19:49:05+00:00 · Latest: 2026-01-20T21:37:57+00:00
Abstract
The ability to construct mental models of the world is a central aspect of understanding. Similarly, visual understanding can be viewed as the ability to construct a representative model of the system depicted in an image. This work explores the capacity of Vision Language Models (VLMs) to recognize and simulate the systems and mechanisms depicted in images using the Im2Sim methodology. The VLM is given a natural image of a real-world system (e.g., cities, clouds, vegetation) and is tasked with describing the system and writing code that simulates and generates it. This generative code is then executed to produce a synthetic image, which is compared against the original. This approach is tested on various complex emergent systems, ranging from physical systems (waves, lights, clouds) to vegetation, cities, materials, and geological formations. Through analysis of the models and images generated by the VLMs, we examine their understanding of the systems in images. The results show that leading VLMs (GPT, Gemini) have the ability to understand and model complex, multi-component systems across multiple layers of abstraction and a wide range of domains. At the same time, the VLMs exhibit limited ability to replicate fine details and low-level arrangements of patterns in the image. These findings reveal an interesting asymmetry: VLMs combine high-level, deep visual understanding of images with limited perception of fine details.
中文标题/摘要
标题:编码可视世界:从图像到模拟的视觉语言模型
构建世界的心理模型是理解的核心方面。同样,视觉理解可以被视为构建图像中所描绘系统代表模型的能力。这项工作探讨了视觉语言模型(VLMs)使用Im2Sim方法识别和模拟图像中所描绘的系统和机制的能力。VLM被给予一个真实世界的自然图像(例如,城市、云、植被),并被要求描述该系统并编写模拟和生成它的代码。然后执行生成的代码以产生合成图像,并将其与原始图像进行比较。这种方法在各种复杂的涌现系统上进行了测试,从物理系统(波、光、云)到植被、城市、材料和地质构造。通过分析VLM生成的模型和图像,我们研究了它们对图像中系统的理解。结果表明,领先的VLM(GPT、Gemini)能够在多个抽象层次和多个领域中理解和建模复杂的多组件系统。同时,VLM在复制图像中的细部和低级模式排列方面表现出有限的能力。这些发现揭示了一个有趣的不对称性:VLM结合了对图像的高层次、深入的视觉理解,但对细部感知有限。
Summary / 总结
This work investigates the capability of Vision Language Models (VLMs) to recognize and simulate complex systems depicted in images using the Im2Sim methodology. VLMs are given natural images and asked to describe the system and write code to simulate it. The synthetic images generated by executing this code are compared to the original images. The study shows that leading VLMs like GPT and Gemini can understand and model complex, multi-component systems across various domains but struggle with replicating fine details and low-level patterns in the images. This reveals an interesting asymmetry in VLMs' visual understanding capabilities.
该研究探讨了视觉语言模型(VLMs)使用Im2Sim方法识别和模拟图像中复杂系统的能力。VLMs被给定自然图像并要求描述系统并编写模拟代码,该代码执行后生成合成图像。结果表明,领先的VLMs如GPT和Gemini能够理解并跨多个领域建模复杂的多组件系统,但在复制图像中的细节点和模式布局方面存在局限性。这揭示了VLMs视觉理解的一种有趣不对称性,即结合了高层次的视觉理解能力与对细节点的感知限制。
GutenOCR: A Grounded Vision-Language Front-End for Documents
Authors: Hunter Heidenreich, Ben Elliott, Olivia Dinica, Yosheb Getachew
First: 2026-01-20T21:26:15+00:00 · Latest: 2026-01-20T21:26:15+00:00
Abstract
GutenOCR is a family of grounded OCR front-ends obtained by fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B. The resulting single-checkpoint vision-language models expose reading, detection, and grounding through a unified, prompt-based interface. Trained on business documents, scientific articles, and synthetic grounding data, the models support full-page and localized reading with line- and paragraph-level bounding boxes and conditional ``where is x?'' queries. We introduce a grounded OCR evaluation protocol and show that GutenOCR-7B more than doubles the composite grounded OCR score of its Qwen2.5-VL-7B backbone on 10.5K held-out business and scientific pages (0.40 to 0.82). On Fox and OmniDocBench v1.5, our approach substantially improves region- and line-level OCR as well as text-detection recall, but reveals trade-offs in page-level linearization, color-guided OCR, and formula-heavy layouts.
中文标题/摘要
标题:GutenOCR:一种基于文档的视觉-语言前端
GutenOCR 是通过微调 Qwen2.5-VL-3B 和 Qwen2.5-VL-7B 获得的一系列基于文档的 OCR 前端。生成的单模型视觉-语言模型通过统一的提示界面展示了阅读、检测和定位。该模型在商业文档、科学文章和合成定位数据上进行训练,支持全页和局部阅读,具有行级和段落级的边界框,并支持“x 在哪里?”的条件查询。我们引入了一种基于文档的 OCR 评估协议,并展示了 GutenOCR-7B 在 10.5K 保留的商业和科学页面上将 Qwen2.5-VL-7B 主干的综合基于文档的 OCR 分数提高了 1.05(从 0.40 到 0.82)。在 Fox 和 OmniDocBench v1.5 上,我们的方法显著提高了区域级和行级 OCR 以及文本检测召回率,但显示了页面级线性化、颜色引导 OCR 和公式密集布局方面的权衡。
Summary / 总结
GutenOCR is a vision-language model fine-tuned from Qwen2.5-VL-3B and Qwen2.5-VL-7B, which provides a unified interface for reading, detection, and grounding through prompts. Trained on various documents, GutenOCR supports full-page and localized reading with bounding boxes and can answer 'where is x?' queries. The model significantly improves the grounded OCR score on business and scientific pages, achieving a composite score of 0.82 compared to 0.40 for its backbone. However, it shows some trade-offs in page-level linearization and formula-heavy layouts.
GutenOCR 是从 Qwen2.5-VL-3B 和 Qwen2.5-VL-7B 精调而来的视觉语言模型,通过提示式接口提供统一的阅读、检测和定位功能。经过商业文档和科学文章的训练,GutenOCR-7B 显著提高了地面OCR得分,复合得分为0.82,而其基础模型的得分为0.40。它增强了区域和行级OCR以及文本检测召回率,但在页面级线性化、颜色引导OCR和公式密集布局方面显示出一些局限性。
VisTIRA: Closing the Image-Text Modality Gap in Visual Math Reasoning via Structured Tool Integration
Authors: Saeed Khaki, Ashudeep Singh, Nima Safaei, Kamal Ginotra
First: 2026-01-20T19:54:49+00:00 · Latest: 2026-01-20T19:54:49+00:00
Abstract
Vision-language models (VLMs) lag behind text-only language models on mathematical reasoning when the same problems are presented as images rather than text. We empirically characterize this as a modality gap: the same question in text form yields markedly higher accuracy than its visually typeset counterpart, due to compounded failures in reading dense formulas, layout, and mixed symbolic-diagrammatic context. First, we introduce VisTIRA (Vision and Tool-Integrated Reasoning Agent), a tool-integrated reasoning framework that enables structured problem solving by iteratively decomposing a given math problem (as an image) into natural language rationales and executable Python steps to determine the final answer. Second, we build a framework to measure and improve visual math reasoning: a LaTeX-based pipeline that converts chain-of-thought math corpora (e.g., NuminaMath) into challenging image counterparts, and a large set of synthetic tool-use trajectories derived from a real-world, homework-style image dataset (called SnapAsk) for fine-tuning VLMs. Our experiments show that tool-integrated supervision improves image-based reasoning, and OCR grounding can further narrow the gap for smaller models, although its benefit diminishes at scale. These findings highlight that modality gap severity inversely correlates with model size, and that structured reasoning and OCR-based grounding are complementary strategies for advancing visual mathematical reasoning.
中文标题/摘要
标题:VisTIRA:通过结构化工具集成缩小视觉数学推理中的图像-文本模态差距
视觉语言模型(VLMs)在以图像形式而非文本形式呈现相同问题时,在数学推理方面落后于仅文本的语言模型。我们实证地将这种差距归因于模态差距:文本形式的问题比其视觉排版的对应物具有明显更高的准确性,原因在于阅读密集公式、布局和混合符号-图表上下文时的复合失败。首先,我们引入了VisTIRA(视觉和工具集成推理代理),这是一种工具集成的推理框架,通过迭代地将给定的数学问题(作为图像)分解为自然语言推理和可执行的Python步骤来确定最终答案,从而实现结构化问题解决。其次,我们构建了一个衡量和提高视觉数学推理能力的框架:一个基于LaTeX的流水线,将链式思维数学语料库(例如,NuminaMath)转换为具有挑战性的图像对应物,并从一种真实世界的、类似于家庭作业的图像数据集(称为SnapAsk)中生成大量合成的工具使用轨迹,用于微调VLMs。我们的实验表明,工具集成的监督可以提高基于图像的推理能力,而OCR定位可以进一步缩小较小模型的差距,尽管其益处随着规模的扩大而减弱。这些发现表明,模态差距的严重程度与模型大小呈反比,而结构化推理和基于OCR的定位是促进视觉数学推理的互补策略。
Summary / 总结
The research aims to address the modality gap in visual math reasoning where vision-language models perform poorly compared to text-only models when presented with images of math problems. The study introduces VisTIRA, a tool-integrated reasoning framework that decomposes math problems into natural language rationales and executable steps, and a framework for measuring and improving visual math reasoning through a LaTeX-based pipeline and synthetic tool-use trajectories. Key findings show that tool-integrated supervision enhances image-based reasoning, and OCR grounding helps smaller models, though its effectiveness diminishes with larger models.
研究旨在解决视觉数学推理中的模态差距,即视觉语言模型在面对图像时的表现不如文本模型。研究引入了VisTIRA,一种工具集成推理框架,将数学问题分解为自然语言推理和可执行步骤。实验表明,工具集成监督和OCR定位可以改善基于图像的推理,尽管后者对大模型的益处会减少。研究发现,结构化推理和基于OCR的定位是推进视觉数学推理的有效策略。
Large-Scale Label Quality Assessment for Medical Segmentation via a Vision-Language Judge and Synthetic Data
Authors: Yixiong Chen, Zongwei Zhou, Wenxuan Li, Alan Yuille
Venue: ISBI 2026
First: 2026-01-20T19:09:12+00:00 · Latest: 2026-01-20T19:09:12+00:00
Comments: ISBI 2026 accepted
Abstract
Large-scale medical segmentation datasets often combine manual and pseudo-labels of uneven quality, which can compromise training and evaluation. Low-quality labels may hamper performance and make the model training less robust. To address this issue, we propose SegAE (Segmentation Assessment Engine), a lightweight vision-language model (VLM) that automatically predicts label quality across 142 anatomical structures. Trained on over four million image-label pairs with quality scores, SegAE achieves a high correlation coefficient of 0.902 with ground-truth Dice similarity and evaluates a 3D mask in 0.06s. SegAE shows several practical benefits: (I) Our analysis reveals widespread low-quality labeling across public datasets; (II) SegAE improves data efficiency and training performance in active and semi-supervised learning, reducing dataset annotation cost by one-third and quality-checking time by 70% per label. This tool provides a simple and effective solution for quality control in large-scale medical segmentation datasets. The dataset, model weights, and codes are released at https://github.com/Schuture/SegAE.
中文标题/摘要
标题:通过视觉语言裁判和合成数据对大规模医学分割标签质量进行评估
大规模医学分割数据集通常结合了质量参差不齐的手动和伪标签,这可能会影响训练和评估。低质量的标签可能阻碍性能并使模型训练不够稳健。为了解决这一问题,我们提出了SegAE(分割评估引擎),这是一种轻量级的视觉语言模型(VLM),可以自动预测142种解剖结构的标签质量。SegAE在超过四百万张图像-标签对的质量评分上进行训练,与真实Dice相似度的相关系数达到0.902,并能在0.06秒内评估一个3D掩码。SegAE展示了几个实际优势:(I) 我们的分析揭示了公共数据集中普遍存在低质量标注;(II) SegAE在主动学习和半监督学习中提高了数据效率和训练性能,将数据标注成本降低了三分之一,并将每标签的质量检查时间减少了70%。该工具为大规模医学分割数据集的质量控制提供了一个简单有效的解决方案。数据集、模型权重和代码在https://github.com/Schuture/SegAE上发布。
Summary / 总结
The paper addresses the issue of low-quality labels in large-scale medical segmentation datasets by proposing SegAE, a vision-language model that predicts label quality. Trained on over four million image-label pairs, SegAE correlates highly with ground-truth Dice similarity and evaluates a 3D mask in 0.06 seconds. SegAE improves data efficiency and training performance, reducing annotation cost and quality-checking time by one-third and 70% respectively. This tool offers a simple and effective solution for quality control in medical segmentation datasets.
论文提出了一种名为SegAE的视觉-语言模型,用于评估医学分割数据集中的标签质量。该模型在超过四百万张图像-标签对上进行训练,预测标签质量的皮尔逊相关系数达到0.902,并能在0.06秒内评估一个3D掩码。SegAE识别出公共数据集中普遍存在低质量标签的问题,并提高了数据效率和训练性能,将标注成本和每标签的质量检查时间分别降低了三分之一和70%。该工具为大型医学分割数据集的质量控制提供了一个简单而有效的解决方案。