Lost in Translation? Vocabulary Alignment for Source-Free Domain Adaptation in Open-Vocabulary Semantic Segmentation
Authors: Silvio Mazzucco, Carl Persson, Mattia Segu, Pier Luigi Dovesi, Federico Tombari, Luc Van Gool, Matteo Poggi
First: 2025-09-18T17:59:58+00:00 · Latest: 2025-09-18T17:59:58+00:00
Comments: BMVC 2025 - Project Page: https://thegoodailab.org/blog/vocalign -
Code: https://github.com/Sisso16/VocAlign
Abstract
We introduce VocAlign, a novel source-free domain adaptation framework
specifically designed for VLMs in open-vocabulary semantic segmentation. Our
method adopts a student-teacher paradigm enhanced with a vocabulary alignment
strategy, which improves pseudo-label generation by incorporating additional
class concepts. To ensure efficiency, we use Low-Rank Adaptation (LoRA) to
fine-tune the model, preserving its original capabilities while minimizing
computational overhead. In addition, we propose a Top-K class selection
mechanism for the student model, which significantly reduces memory
requirements while further improving adaptation performance. Our approach
achieves a notable 6.11 mIoU improvement on the CityScapes dataset and
demonstrates superior performance on zero-shot segmentation benchmarks, setting
a new standard for source-free adaptation in the open-vocabulary setting.
中文标题/摘要
标题:迷失翻译?源代码自由领域适应在开放词汇语义分割中的词汇对齐
我们引入了VocAlign,一种专为开放词汇语义分割中的VLM设计的源代码自由领域适应框架。该方法采用学生-教师范式,并结合了词汇对齐策略,通过引入额外的类别概念来改进伪标签生成。为了确保效率,我们使用低秩适应(LoRA)对模型进行微调,同时保留其原始功能并最小化计算开销。此外,我们还提出了学生模型的Top-K类别选择机制,这显著减少了内存需求并进一步提高了适应性能。我们的方法在CityScapes数据集上实现了显著的6.11 mIoU改进,并在零样本分割基准测试中表现出色,为开放词汇设置中的源代码自由适应设定了新标准。
Summary / 总结
The research introduces VocAlign, a source-free domain adaptation framework for VLMs in open-vocabulary semantic segmentation. It uses a student-teacher paradigm with vocabulary alignment and Low-Rank Adaptation (LoRA) to fine-tune the model efficiently. The approach also includes a Top-K class selection mechanism to reduce memory usage. The method achieves a 6.11 mIoU improvement on CityScapes and outperforms existing methods on zero-shot segmentation benchmarks, setting a new standard for source-free adaptation in open-vocabulary settings.
研究引入了VocAlign,这是一种针对开放词汇语义分割中VLM的源免费领域适应框架。该方法采用学生-教师范式并结合词汇对齐策略来增强伪标签生成,并采用低秩适应(LoRA)进行高效微调。此外,该方法还包含一个Top-K类别选择机制以减少内存使用。VocAlign在CityScapes上实现了6.11 mIoU的改进,并在零样本分割基准测试中表现出色,为开放词汇设置中的源免费适应设定了新标准。
Calibration-Aware Prompt Learning for Medical Vision-Language Models
Authors: Abhishek Basu, Fahad Shamshad, Ashshak Sharifdeen, Karthik Nandakumar, Muhammad Haris Khan
First: 2025-09-18T17:59:58+00:00 · Latest: 2025-09-18T17:59:58+00:00
Comments: Accepted in BMVC 2025
Abstract
Medical Vision-Language Models (Med-VLMs) have demonstrated remarkable
performance across diverse medical imaging tasks by leveraging large-scale
image-text pretraining. However, their confidence calibration is largely
unexplored, and so remains a significant challenge. As such, miscalibrated
predictions can lead to overconfident errors, undermining clinical trust and
decision-making reliability. To address this, we introduce CalibPrompt, the
first framework to calibrate Med-VLMs during prompt tuning. CalibPrompt
optimizes a small set of learnable prompts with carefully designed calibration
objectives under scarce labeled data regime. First, we study a regularizer that
attempts to align the smoothed accuracy with the predicted model confidences.
Second, we introduce an angular separation loss to maximize textual feature
proximity toward improving the reliability in confidence estimates of
multimodal Med-VLMs. Extensive experiments on four publicly available Med-VLMs
and five diverse medical imaging datasets reveal that CalibPrompt consistently
improves calibration without drastically affecting clean accuracy. Our code is
available at https://github.com/iabh1shekbasu/CalibPrompt.
中文标题/摘要
标题:医疗视觉语言模型的校准感知提示学习
医疗视觉语言模型(Med-VLMs)通过大规模图像文本预训练,在多种医疗成像任务中表现出色。然而,它们的置信度校准尚未得到充分探索,仍然是一个重大挑战。因此,未校准的预测可能导致过度自信的错误,削弱临床信任和决策可靠性。为了解决这一问题,我们引入了CalibPrompt,这是第一个在提示调优过程中校准Med-VLMs的框架。CalibPrompt在少量标注数据条件下,通过精心设计的校准目标优化一小组可学习的提示。首先,我们研究了一个正则化器,试图使平滑后的准确率与预测模型置信度对齐。其次,我们引入了角度分离损失,以最大化文本特征的接近度,从而提高多模态Med-VLMs置信度估计的可靠性。在四个公开的Med-VLMs和五个多样化的医疗成像数据集上的广泛实验表明,CalibPrompt在不大幅影响干净准确率的情况下,始终能够提高校准。我们的代码可在https://github.com/iabh1shekbasu/CalibPrompt获取。
Summary / 总结
The paper introduces CalibPrompt, a framework for calibrating Medical Vision-Language Models (Med-VLMs) during prompt tuning. It optimizes learnable prompts with calibration objectives under limited labeled data. CalibPrompt uses a regularizer to align smoothed accuracy with predicted model confidences and an angular separation loss to enhance textual feature proximity. Experiments show that CalibPrompt improves calibration without significantly affecting clean accuracy across four Med-VLMs and five medical imaging datasets.
论文提出了CalibPrompt框架,在提示调优过程中校准医疗视觉-语言模型(Med-VLMs),解决预测失准的问题。它在有限标注数据下优化可学习提示,并通过校准目标实现这一目标。实验表明,CalibPrompt在四个Med-VLMs和五个医学影像数据集上提高了校准效果,同时对干净准确率影响不大。
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
Authors: Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, Shenglong Ye, Qingyun Li, Zeyue Tian, Gen Luo, Xiangyu Yue, Biqing Qi, Kai Chen, Bowen Zhou, Yu Qiao, Qifeng Chen, Wenhai Wang
First: 2025-09-18T17:59:22+00:00 · Latest: 2025-09-18T17:59:22+00:00
Abstract
Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that
operate GUIs autonomously, showing great potential, yet progress is limited by
the lack of large-scale, open-source computer use data and foundation models.
In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It
offers a large-scale dataset spanning 6 operating systems and 3 task domains,
built via a closed-loop pipeline uniting automated agents with human experts.
Trained on this scaled-up data, ScaleCUA can operate seamlessly across
platforms. Specifically, it delivers strong gains over baselines (+26.6 on
WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art
results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on
WebArena-Lite-v2). These findings underscore the power of data-driven scaling
for general-purpose computer use agents. We will release data, models, and code
to advance future research: https://github.com/OpenGVLab/ScaleCUA.
中文标题/摘要
标题:ScaleCUA:跨平台数据扩展开源计算机使用代理
视觉-语言模型(VLMs)使计算机使用代理(CUAs)能够自主操作GUI,展现出巨大的潜力,但进展受限于缺乏大规模、开源的计算机使用数据和基础模型。在本项工作中,我们介绍了ScaleCUA,这是迈向扩展开源CUA的一个步骤。它提供了一个跨越6个操作系统和3个任务领域的大型数据集,通过结合自动化代理和人类专家的闭环管道构建而成。在这些扩展的数据上训练后,ScaleCUA可以在不同平台之间无缝操作。具体而言,它在WebArena-Lite-v2上比基线模型提高了26.6%,在ScreenSpot-Pro上提高了10.7%,并在MMBench-GUI L1-Hard上达到了94.4%的新最佳结果,在OSWorld-G上达到了60.6%,在WebArena-Lite-v2上达到了47.4%。这些发现强调了数据驱动扩展对于通用计算机使用代理的强大作用。我们将发布数据、模型和代码以促进未来研究:https://github.com/OpenGVLab/ScaleCUA。
Summary / 总结
ScaleCUA addresses the limitation of open-source computer use agents by introducing a large-scale dataset spanning multiple operating systems and task domains. Utilizing a closed-loop pipeline combining automated agents and human experts, ScaleCUA demonstrates significant improvements over existing baselines, achieving new state-of-the-art results on MMBench-GUI L1-Hard, OSWorld-G, and WebArena-Lite-v2. This work highlights the importance of data-driven scaling for general-purpose computer use agents and will release data, models, and code for further research.
ScaleCUA通过引入跨越多个操作系统和任务领域的大型数据集来解决开源计算机使用代理的限制。利用结合自动化代理和人工专家的闭环管道,ScaleCUA在MMBenchmark-GUI L1-Hard、OSWorld-G和WebArena-Lite-v2上实现了显著的改进,达到了新的最先进结果。这项工作强调了数据驱动扩展对于通用计算机使用代理的重要性,并将发布数据、模型和代码以促进进一步研究。
MedFact-R1: Towards Factual Medical Reasoning via Pseudo-Label Augmentation
Authors: Gengliang Li, Rongyu Chen, Bin Li, Linlin Yang, Guodong Ding
First: 2025-09-18T16:59:59+00:00 · Latest: 2025-09-18T16:59:59+00:00
Comments: Tech report
Abstract
Ensuring factual consistency and reliable reasoning remains a critical
challenge for medical vision-language models. We introduce MEDFACT-R1, a
two-stage framework that integrates external knowledge grounding with
reinforcement learning to improve the factual medical reasoning. The first
stage uses pseudo-label supervised fine-tuning (SFT) to incorporate external
factual expertise; while the second stage applies Group Relative Policy
Optimization (GRPO) with four tailored factual reward signals to encourage
self-consistent reasoning. Across three public medical QA benchmarks,
MEDFACT-R1 delivers up to 22.5% absolute improvement in factual accuracy over
previous state-of-the-art methods. Ablation studies highlight the necessity of
pseudo-label SFT cold start and validate the contribution of each GRPO reward,
underscoring the synergy between knowledge grounding and RL-driven reasoning
for trustworthy medical AI. Codes are released at
https://github.com/Garfieldgengliang/MEDFACT-R1.
中文标题/摘要
标题:MedFact-R1:通过伪标签增强实现医学事实推理
确保事实一致性与可靠推理仍然是医学视觉-语言模型的关键挑战。我们引入了MEDFACT-R1,这是一种两阶段框架,结合了外部知识接地与强化学习,以提高医学事实推理。第一阶段使用伪标签监督微调(SFT)来整合外部事实专业知识;而第二阶段则应用组相对策略优化(GRPO)和四个定制的事实奖励信号,以促进自我一致的推理。在三个公开的医学问答基准测试中,MEDFACT-R1在事实准确性上比之前最先进的方法提高了高达22.5%。消融研究强调了伪标签SFT冷启动的必要性,并验证了每个GRPO奖励的贡献,突显了知识接地与基于RL的推理之间的协同作用对于可信赖的医学AI的重要性。代码已发布于https://github.com/Garfieldgengliang/MEDFACT-R1。
WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance
Authors: Chenxi Song, Yanming Yang, Tong Zhao, Ruibo Li, Chi Zhang
First: 2025-09-18T16:40:47+00:00 · Latest: 2025-09-18T16:40:47+00:00
Comments: Project Webpage: https://worldforge-agi.github.io/
Abstract
Recent video diffusion models demonstrate strong potential in spatial
intelligence tasks due to their rich latent world priors. However, this
potential is hindered by their limited controllability and geometric
inconsistency, creating a gap between their strong priors and their practical
use in 3D/4D tasks. As a result, current approaches often rely on retraining or
fine-tuning, which risks degrading pretrained knowledge and incurs high
computational costs. To address this, we propose WorldForge, a training-free,
inference-time framework composed of three tightly coupled modules. Intra-Step
Recursive Refinement introduces a recursive refinement mechanism during
inference, which repeatedly optimizes network predictions within each denoising
step to enable precise trajectory injection. Flow-Gated Latent Fusion leverages
optical flow similarity to decouple motion from appearance in the latent space
and selectively inject trajectory guidance into motion-related channels.
Dual-Path Self-Corrective Guidance compares guided and unguided denoising paths
to adaptively correct trajectory drift caused by noisy or misaligned structural
signals. Together, these components inject fine-grained, trajectory-aligned
guidance without training, achieving both accurate motion control and
photorealistic content generation. Extensive experiments across diverse
benchmarks validate our method's superiority in realism, trajectory
consistency, and visual fidelity. This work introduces a novel plug-and-play
paradigm for controllable video synthesis, offering a new perspective on
leveraging generative priors for spatial intelligence.
Summary / 总结
WorldForge is a training-free framework that enhances the controllability and geometric consistency of video diffusion models for 3D/4D generation. It consists of three modules: Intra-Step Recursive Refinement, Flow-Gated Latent Fusion, and Dual-Path Self-Corrective Guidance. These modules enable precise trajectory injection, selective motion guidance, and adaptive correction of trajectory drift, respectively. Experiments show that WorldForge outperforms existing methods in terms of realism, trajectory consistency, and visual fidelity across various benchmarks.
WorldForge 是一个无需训练的框架,旨在增强视频扩散模型在3D/4D任务中的可控性和几何一致性。它包含三个模块:Intra-Step Recursive Refinement、Flow-Gated Latent Fusion 和 Dual-Path Self-Corrective Guidance,在推理时注入轨迹对齐的指导。实验表明,WorldForge 在提高真实感、轨迹一致性及视觉保真度方面优于现有方法,且无需重新训练模型。
Manipulation Facing Threats: Evaluating Physical Vulnerabilities in End-to-End Vision Language Action Models
Authors: Hao Cheng, Erjia Xiao, Yichi Wang, Chengyuan Yu, Mengshu Sun, Qiang Zhang, Yijie Guo, Kaidi Xu, Jize Zhang, Chao Shen, Philip Torr, Jindong Gu, Renjing Xu
First: 2024-09-20T03:02:05+00:00 · Latest: 2025-09-18T16:36:42+00:00
Abstract
Recently, driven by advancements in Multimodal Large Language Models (MLLMs),
Vision Language Action Models (VLAMs) are being proposed to achieve better
performance in open-vocabulary scenarios for robotic manipulation tasks. Since
manipulation tasks involve direct interaction with the physical world, ensuring
robustness and safety during the execution of this task is always a very
critical issue. In this paper, by synthesizing current safety research on MLLMs
and the specific application scenarios of the manipulation task in the physical
world, we comprehensively evaluate VLAMs in the face of potential physical
threats. Specifically, we propose the Physical Vulnerability Evaluating
Pipeline (PVEP) that can incorporate as many visual modal physical threats as
possible for evaluating the physical robustness of VLAMs. The physical threats
in PVEP specifically include Out-of-Distribution, Typography-based Visual
Prompt, and Adversarial Patch Attacks. By comparing the performance
fluctuations of VLAMs before and after being attacked, we provide generalizable
\textbf{\textit{Analyses}} of how VLAMs respond to different physical threats.
中文标题/摘要
标题:面对威胁的操控:评估端到端视觉语言动作模型的物理脆弱性
近年来,随着多模态大型语言模型(MLLMs)的进步,视觉语言动作模型(VLAMs)被提出,以在机器人操控任务的开放词汇场景中实现更好的性能。由于操控任务涉及直接与物理世界交互,因此在执行此任务时确保稳健性和安全性始终是一个非常关键的问题。在本文中,通过综合当前MLLMs的安全研究以及操控任务在物理世界中的具体应用场景,我们全面评估了VLAMs在面对潜在物理威胁时的表现。具体地,我们提出了物理脆弱性评估管道(PVEP),它可以尽可能多地纳入视觉模态的物理威胁,以评估VLAMs的物理鲁棒性。PVEP中的物理威胁具体包括离分布、基于字体的视觉提示和对抗性补丁攻击。通过比较VLAMs在攻击前后性能的变化,我们提供了关于VLAMs如何应对不同物理威胁的可推广的**分析**。
Summary / 总结
This paper evaluates the physical robustness of Vision Language Action Models (VLAMs) in robotic manipulation tasks, which involve direct interaction with the physical world. The authors propose the Physical Vulnerability Evaluating Pipeline (PVEP) to assess how VLAMs handle various physical threats such as out-of-distribution, typography-based visual prompts, and adversarial patches. The study finds that VLAMs exhibit different performance fluctuations when subjected to these attacks, providing insights into their vulnerability to physical threats.
本文评估了在直接与物理世界交互的机器人操作任务中,视觉语言动作模型(VLAMs)的物理鲁棒性。作者提出了物理脆弱性评估管道(PVEP),以评估VLAMs在面对诸如分布外、基于字体的视觉提示和对抗性补丁等物理威胁时的表现。研究发现,当VLAMs受到这些攻击时,其性能会表现出不同的波动,提供了它们对物理威胁的脆弱性的见解。
Debias your Large Multi-Modal Model at Test-Time via Non-Contrastive Visual Attribute Steering
Authors: Neale Ratzlaff, Matthew Lyle Olson, Musashi Hinck, Estelle Aflalo, Shao-Yen Tseng, Vasudev Lal, Phillip Howard
First: 2024-11-15T20:06:09+00:00 · Latest: 2025-09-18T15:58:56+00:00
Comments: 10 pages, 6 Figures, 8 Tables. arXiv admin note: text overlap with
arXiv:2410.13976
Abstract
Large Multi-Modal Models (LMMs) have demonstrated impressive capabilities as
general-purpose chatbots able to engage in conversations about visual inputs.
However, their responses are influenced by societal biases present in their
training datasets, leading to undesirable differences in how the model responds
when presented with images depicting people of different demographics. In this
work, we propose a training-free debiasing framework for LMMs that intervenes
on the model's representations during text generation by constructing a
steering vector that reduces reference on protected attributes. Our framework
introduces two complementary methods: (1) a dataset-based approach that
constructs a steering vector by contrasting model activations on biased and
neutral inputs, and (2) a novel optimization-based approach designed for
low-resource settings, which constructs the steering vector using a single step
of gradient-based perturbation without requiring additional data. Our
experiments show that these interventions effectively reduce the propensity of
LMMs to generate text related to protected attributes while maintaining
sentiment and fluency. Furthermore, we demonstrate that debiased LMMs achieve
comparable accuracy to their unmodified counterparts on downstream tasks,
indicating that bias mitigation can be achieved without sacrificing model
performance.
中文标题/摘要
标题:在测试时通过非对比视觉属性引导消除大型多模态模型的偏差
大型多模态模型(LMMs)展示了作为通用聊天机器人的出色能力,能够就视觉输入进行对话。然而,它们的响应受到其训练数据集中存在的社会偏见的影响,导致在展示不同人口统计学特征的人像时,模型的响应存在不希望的差异。在本工作中,我们提出了一种无需训练的去偏见框架,该框架在文本生成过程中干预模型的表示,通过构建减少对受保护属性依赖的引导向量来实现。我们的框架引入了两种互补的方法:(1)基于数据的方法,通过对比模型在有偏和中性输入上的激活来构建引导向量;(2)一种针对资源有限环境的新颖优化方法,使用单步梯度扰动构建引导向量,无需额外数据。我们的实验表明,这些干预措施有效地减少了LMMs生成与受保护属性相关的文本的倾向,同时保持了情感和流畅性。此外,我们证明去偏见的LMMs在下游任务上的准确度与未修改的版本相当,表明可以在不牺牲模型性能的情况下实现偏见缓解。
Summary / 总结
This work addresses the issue of societal biases in Large Multi-Modal Models (LMMs) by proposing a training-free debiasing framework. The framework uses two methods: a dataset-based approach that constructs a steering vector by contrasting model activations on biased and neutral inputs, and an optimization-based approach that uses a single step of gradient-based perturbation. Experiments show that these interventions reduce the model's propensity to generate text related to protected attributes while maintaining sentiment and fluency, and debiased LMMs perform comparably to their unmodified counterparts on downstream tasks.
本文提出了一种无需训练的去偏见框架,用于解决大型多模态模型(LMMs)中的社会偏见问题。该框架使用两种方法:基于数据集的方法通过对比模型在偏见和中性输入上的激活来构建引导向量,以及一种优化方法,使用单步梯度扰动来构建引导向量。实验表明,这些干预措施可以减少模型生成与保护属性相关文本的倾向,同时保持情感和流畅性,去偏见的LMMs在下游任务上的性能与未修改的模型相当。
Forecasting and Visualizing Air Quality from Sky Images with Vision-Language Models
Authors: Mohammad Saleh Vahdatpour, Maryam Eyvazi, Yanqing Zhang
First: 2025-09-18T15:36:38+00:00 · Latest: 2025-09-18T15:36:38+00:00
Comments: Published at ICCVW 2025
Abstract
Air pollution remains a critical threat to public health and environmental
sustainability, yet conventional monitoring systems are often constrained by
limited spatial coverage and accessibility. This paper proposes an AI-driven
agent that predicts ambient air pollution levels from sky images and
synthesizes realistic visualizations of pollution scenarios using generative
modeling. Our approach combines statistical texture analysis with supervised
learning for pollution classification, and leverages vision-language model
(VLM)-guided image generation to produce interpretable representations of air
quality conditions. The generated visuals simulate varying degrees of
pollution, offering a foundation for user-facing interfaces that improve
transparency and support informed environmental decision-making. These outputs
can be seamlessly integrated into intelligent applications aimed at enhancing
situational awareness and encouraging behavioral responses based on real-time
forecasts. We validate our method using a dataset of urban sky images and
demonstrate its effectiveness in both pollution level estimation and
semantically consistent visual synthesis. The system design further
incorporates human-centered user experience principles to ensure accessibility,
clarity, and public engagement in air quality forecasting. To support scalable
and energy-efficient deployment, future iterations will incorporate a green CNN
architecture enhanced with FPGA-based incremental learning, enabling real-time
inference on edge platforms.
中文标题/摘要
标题:基于视觉语言模型的天空图像空气质量预测与可视化
空气污染仍然是公共健康和环境可持续性的重大威胁,但传统的监测系统往往受限于有限的空间覆盖范围和可访问性。本文提出了一种基于人工智能的代理,可以从天空图像中预测大气污染水平,并使用生成模型合成现实的污染场景可视化。我们的方法结合了统计纹理分析和监督学习进行污染分类,并利用视觉语言模型(VLM)指导的图像生成来生成可解释的空气质量条件表示。生成的视觉效果模拟了不同程度的污染,为面向用户的界面提供了基础,以提高透明度并支持基于实时预测的环境决策。这些输出可以无缝集成到旨在增强态势感知并鼓励基于实时预测的行为响应的智能应用中。我们使用城市天空图像数据集验证了该方法,并证明了其在污染水平估计和语义一致的视觉合成方面的有效性。系统设计进一步融入了以用户为中心的人机交互原则,以确保空气质量预测的可访问性、清晰性和公众参与。为了实现可扩展和节能部署,未来的迭代将结合增强的FPGA基于增量学习的绿色CNN架构,以实现边缘平台上的实时推理。
QuizRank: Picking Images by Quizzing VLMs
Authors: Tenghao Ji, Eytan Adar
First: 2025-09-18T15:22:33+00:00 · Latest: 2025-09-18T15:22:33+00:00
Abstract
Images play a vital role in improving the readability and comprehension of
Wikipedia articles by serving as `illustrative aids.' However, not all images
are equally effective and not all Wikipedia editors are trained in their
selection. We propose QuizRank, a novel method of image selection that
leverages large language models (LLMs) and vision language models (VLMs) to
rank images as learning interventions. Our approach transforms textual
descriptions of the article's subject into multiple-choice questions about
important visual characteristics of the concept. We utilize these questions to
quiz the VLM: the better an image can help answer questions, the higher it is
ranked. To further improve discrimination between visually similar items, we
introduce a Contrastive QuizRank that leverages differences in the features of
target (e.g., a Western Bluebird) and distractor concepts (e.g., Mountain
Bluebird) to generate questions. We demonstrate the potential of VLMs as
effective visual evaluators by showing a high congruence with human quiz-takers
and an effective discriminative ranking of images.
中文标题/摘要
标题:QuizRank:通过问答VLMs挑选图像
图像在提高维基百科文章的可读性和理解性方面起着至关重要的作用,作为‘说明性辅助工具’。然而,并非所有图像都同样有效,也不是所有维基百科编辑都受过图像选择的培训。我们提出QuizRank,一种新颖的图像选择方法,利用大型语言模型(LLMs)和视觉语言模型(VLMs)对图像进行排名,作为学习干预措施。我们的方法将文章主题的文本描述转化为关于概念重要视觉特征的多项选择题。我们利用这些问题来测试VLM:图像越能帮助回答问题,排名越高。为了进一步提高对视觉相似项目的区分度,我们引入了对比QuizRank,利用目标(如蓝冠山雀)和干扰概念(如西部蓝冠山雀)的特征差异来生成问题。我们通过展示VLMs与人类问答者的高一致性以及对图像的有效区分排名,证明了VLMs作为有效的视觉评估工具的潜力。
PRISM: Product Retrieval In Shopping Carts using Hybrid Matching
Authors: Arda Kabadayi, Senem Velipasalar, Jiajing Chen
First: 2025-09-18T14:15:37+00:00 · Latest: 2025-09-18T14:15:37+00:00
Abstract
Compared to traditional image retrieval tasks, product retrieval in retail
settings is even more challenging. Products of the same type from different
brands may have highly similar visual appearances, and the query image may be
taken from an angle that differs significantly from view angles of the stored
catalog images. Foundational models, such as CLIP and SigLIP, often struggle to
distinguish these subtle but important local differences. Pixel-wise matching
methods, on the other hand, are computationally expensive and incur
prohibitively high matching times. In this paper, we propose a new, hybrid
method, called PRISM, for product retrieval in retail settings by leveraging
the advantages of both vision-language model-based and pixel-wise matching
approaches. To provide both efficiency/speed and finegrained retrieval
accuracy, PRISM consists of three stages: 1) A vision-language model (SigLIP)
is employed first to retrieve the top 35 most semantically similar products
from a fixed gallery, thereby narrowing the search space significantly; 2) a
segmentation model (YOLO-E) is applied to eliminate background clutter; 3)
fine-grained pixel-level matching is performed using LightGlue across the
filtered candidates. This framework enables more accurate discrimination
between products with high inter-class similarity by focusing on subtle visual
cues often missed by global models. Experiments performed on the ABV dataset
show that our proposed PRISM outperforms the state-of-the-art image retrieval
methods by 4.21% in top-1 accuracy while still remaining within the bounds of
real-time processing for practical retail deployments.
中文标题/摘要
标题:PRISM:购物车中产品检索的混合匹配方法
与传统的图像检索任务相比,零售环境中的产品检索更具挑战性。同一类型的不同品牌产品可能具有高度相似的视觉外观,查询图像的角度可能与存储目录图像的角度相差很大。基础模型如CLIP和SigLIP往往难以区分这些细微但重要的局部差异。像素级匹配方法则计算成本高昂,匹配时间难以接受。本文提出了一种新的混合方法PRISM,通过结合基于视觉语言模型的方法和像素级匹配方法的优势,用于零售环境中的产品检索。PRISM由三个阶段组成:1) 使用视觉语言模型(SigLIP)从固定画廊中检索出最相似的35个产品,显著缩小搜索范围;2) 应用分割模型(YOLO-E)去除背景杂乱;3) 在筛选出的候选产品中使用LightGlue进行精细的像素级匹配。该框架通过关注全局模型常忽略的细微视觉线索,使具有高类间相似性的产品之间能够更准确地区分。在ABV数据集上的实验表明,我们的PRISM在top-1准确率上比最先进的图像检索方法高出4.21%,同时仍保持在实时处理的范围内,适用于实际零售部署。
Summary / 总结
PRISM is a hybrid method for product retrieval in retail settings, combining the strengths of vision-language models and pixel-wise matching. It consists of three stages: first, SigLIP retrieves the top 35 semantically similar products, then YOLO-E removes background clutter, and finally, LightGlue performs fine-grained pixel-level matching. Experiments show that PRISM outperforms state-of-the-art methods by 4.21% in top-1 accuracy while maintaining real-time processing capabilities.
PRISM 是一种结合视觉语言模型(SigLIP)和像素级匹配的混合方法,用于零售环境下的产品检索。它首先检索最相似的35个产品,然后使用分割模型去除背景杂乱,最后在筛选出的候选产品中进行精细的像素级匹配。实验表明,PRISM 在 top-1 准确率上比最先进的方法高出 4.21%,同时保持实时处理能力。
EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence
Authors: Chaoyin She, Ruifang Lu, Lida Chen, Wei Wang, Qinghua Huang
First: 2025-09-18T14:07:53+00:00 · Latest: 2025-09-18T14:07:53+00:00
Abstract
Ultrasound imaging has become the preferred imaging modality for early cancer
screening due to its advantages of non-ionizing radiation, low cost, and
real-time imaging capabilities. However, conventional ultrasound diagnosis
heavily relies on physician expertise, presenting challenges of high
subjectivity and low diagnostic efficiency. Vision-language models (VLMs) offer
promising solutions for this issue, but existing general-purpose models
demonstrate limited knowledge in ultrasound medical tasks, with poor
generalization in multi-organ lesion recognition and low efficiency across
multi-task diagnostics. To address these limitations, we propose EchoVLM, a
vision-language model specifically designed for ultrasound medical imaging. The
model employs a Mixture of Experts (MoE) architecture trained on data spanning
seven anatomical regions. This design enables the model to perform multiple
tasks, including ultrasound report generation, diagnosis and visual
question-answering (VQA). The experimental results demonstrated that EchoVLM
achieved significant improvements of 10.15 and 4.77 points in BLEU-1 scores and
ROUGE-1 scores respectively compared to Qwen2-VL on the ultrasound report
generation task. These findings suggest that EchoVLM has substantial potential
to enhance diagnostic accuracy in ultrasound imaging, thereby providing a
viable technical solution for future clinical applications. Source code and
model weights are available at https://github.com/Asunatan/EchoVLM.
中文标题/摘要
标题:EchoVLM:用于通用超声智能的动态专家混合视觉语言模型
超声成像已成为早期癌症筛查的首选成像技术,因其无辐射、低成本和实时成像能力。然而,传统的超声诊断高度依赖于医生的专业知识,存在主观性高和诊断效率低的挑战。视觉语言模型(VLMs)为这一问题提供了有前景的解决方案,但现有的通用模型在超声医学任务中的知识有限,在多器官病灶识别上的泛化能力差,且在多任务诊断中的效率低。为解决这些局限性,我们提出了一种专门针对超声医学成像的视觉语言模型EchoVLM。该模型采用跨七个解剖区域数据训练的专家混合(MoE)架构,能够执行包括超声报告生成、诊断和视觉问答(VQA)在内的多种任务。实验结果表明,与Qwen2-VL相比,EchoVLM在超声报告生成任务中的BLEU-1分数和ROUGE-1分数分别提高了10.15和4.77分。这些发现表明,EchoVLM在提高超声成像诊断准确性方面具有巨大潜力,从而为未来的临床应用提供可行的技术解决方案。源代码和模型权重可在https://github.com/Asunatan/EchoVLM/ 获取。
VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion
Authors: Pei Liu, Haipeng Liu, Haichao Liu, Xin Liu, Jinxin Ni, Jun Ma
First: 2025-02-25T10:02:12+00:00 · Latest: 2025-09-18T11:55:02+00:00
Abstract
Human drivers adeptly navigate complex scenarios by utilizing rich
attentional semantics, but the current autonomous systems struggle to replicate
this ability, as they often lose critical semantic information when converting
2D observations into 3D space. In this sense, it hinders their effective
deployment in dynamic and complex environments. Leveraging the superior scene
understanding and reasoning abilities of Vision-Language Models (VLMs), we
propose VLM-E2E, a novel framework that uses the VLMs to enhance training by
providing attentional cues. Our method integrates textual representations into
Bird's-Eye-View (BEV) features for semantic supervision, which enables the
model to learn richer feature representations that explicitly capture the
driver's attentional semantics. By focusing on attentional semantics, VLM-E2E
better aligns with human-like driving behavior, which is critical for
navigating dynamic and complex environments. Furthermore, we introduce a
BEV-Text learnable weighted fusion strategy to address the issue of modality
importance imbalance in fusing multimodal information. This approach
dynamically balances the contributions of BEV and text features, ensuring that
the complementary information from visual and textual modalities is effectively
utilized. By explicitly addressing the imbalance in multimodal fusion, our
method facilitates a more holistic and robust representation of driving
environments. We evaluate VLM-E2E on the nuScenes dataset and achieve
significant improvements in perception, prediction, and planning over the
baseline end-to-end model, showcasing the effectiveness of our
attention-enhanced BEV representation in enabling more accurate and reliable
autonomous driving tasks.
中文标题/摘要
标题:VLM-E2E:利用多模态驾驶员注意力融合提升端到端自动驾驶
人类驾驶员能够通过丰富的注意力语义来应对复杂的场景,但当前的自动驾驶系统在将2D观察转换为3D空间时往往会丢失关键的语义信息,这阻碍了它们在动态和复杂环境中的有效部署。利用视觉语言模型(VLMs)的优越场景理解和推理能力,我们提出了VLM-E2E这一新型框架,通过VLMs提供注意力提示来增强训练。我们的方法将文本表示整合到鸟瞰视图(BEV)特征中,用于语义监督,使模型能够学习更丰富的特征表示,明确捕捉驾驶员的注意力语义。通过关注注意力语义,VLM-E2E更好地与类似人类的驾驶行为对齐,这对于在动态和复杂环境中导航至关重要。此外,我们引入了一种可学习的BEV-Text加权融合策略,以解决多模态信息融合中模态重要性不平衡的问题。该方法动态平衡BEV和文本特征的贡献,确保视觉和文本模态互补信息的有效利用。通过明确解决多模态融合中的不平衡问题,我们的方法促进了更全面和稳健的驾驶环境表示。我们在nuScenes数据集上评估了VLM-E2E,并在感知、预测和规划方面显著优于基线端到端模型,展示了我们增强的BEV表示在实现更准确和可靠的自动驾驶任务方面的有效性。
Summary / 总结
The research aims to enhance end-to-end autonomous driving by leveraging Vision-Language Models (VLMs) to provide attentional cues, which helps in better understanding and reasoning about complex driving scenarios. The method integrates textual representations into Bird's-Eye-View (BEV) features for semantic supervision, enabling richer feature representations. Experimental results show significant improvements in perception, prediction, and planning tasks over the baseline end-to-end model, demonstrating the effectiveness of the attention-enhanced BEV representation.
研究旨在通过利用Vision-Language模型(VLMs)提供注意力提示来增强端到端的自动驾驶,以更好地理解和推理复杂的驾驶场景。方法将文本表示集成到鸟瞰图(BEV)特征中,使模型能够学习更丰富的特征表示,明确捕捉驾驶员的注意力语义。实验在nuScenes数据集上表明,与基线端到端模型相比,在感知、预测和规划方面取得了显著改进,突显了注意力增强的BEV表示的有效性。
MARIC: Multi-Agent Reasoning for Image Classification
Authors: Wonduk Seo, Minhyeong Yu, Hyunjin An, Seunghyun Lee
First: 2025-09-18T11:27:00+00:00 · Latest: 2025-09-18T11:27:00+00:00
Comments: Preprint
Abstract
Image classification has traditionally relied on parameter-intensive model
training, requiring large-scale annotated datasets and extensive fine tuning to
achieve competitive performance. While recent vision language models (VLMs)
alleviate some of these constraints, they remain limited by their reliance on
single pass representations, often failing to capture complementary aspects of
visual content. In this paper, we introduce Multi Agent based Reasoning for
Image Classification (MARIC), a multi agent framework that reformulates image
classification as a collaborative reasoning process. MARIC first utilizes an
Outliner Agent to analyze the global theme of the image and generate targeted
prompts. Based on these prompts, three Aspect Agents extract fine grained
descriptions along distinct visual dimensions. Finally, a Reasoning Agent
synthesizes these complementary outputs through integrated reflection step,
producing a unified representation for classification. By explicitly
decomposing the task into multiple perspectives and encouraging reflective
synthesis, MARIC mitigates the shortcomings of both parameter-heavy training
and monolithic VLM reasoning. Experiments on 4 diverse image classification
benchmark datasets demonstrate that MARIC significantly outperforms baselines,
highlighting the effectiveness of multi-agent visual reasoning for robust and
interpretable image classification.
中文标题/摘要
标题:MARIC:多智能体推理在图像分类中的应用
图像分类传统上依赖于参数密集型模型的训练,需要大规模标注数据集和广泛的微调才能达到竞争性性能。尽管最近的视觉语言模型(VLMs)在一定程度上缓解了这些限制,但它们仍然受限于单次表示,往往无法捕捉视觉内容的互补方面。在本文中,我们提出了基于多智能体的图像分类(MARIC)框架,将图像分类重新定义为协作推理过程。MARIC 首先利用一个离群点智能体分析图像的全局主题并生成针对性提示。基于这些提示,三个方面智能体沿不同的视觉维度提取细粒度描述。最后,一个推理智能体通过综合反思步骤综合这些互补输出,生成用于分类的统一表示。通过明确将任务分解为多个视角并促进反思综合,MARIC 缓解了参数密集型训练和单一 VLM 推理的不足。在 4 个不同的图像分类基准数据集上的实验表明,MARIC 显著优于基线,突显了多智能体视觉推理在稳健和可解释图像分类中的有效性。
Summary / 总结
MARIC is a multi-agent framework for image classification that decomposes the task into multiple perspectives and encourages reflective synthesis. It uses an Outliner Agent to analyze the global theme and generate targeted prompts, followed by three Aspect Agents that extract fine-grained descriptions along distinct visual dimensions. The Reasoning Agent then synthesizes these outputs to produce a unified representation. Experiments show MARIC outperforms baselines on four diverse datasets, demonstrating the effectiveness of multi-agent visual reasoning for robust and interpretable image classification.
MARIC 是一种多代理框架,用于图像分类,旨在解决传统参数密集型训练和单一视觉语言模型的局限性。它将分类任务分解为多个视角,使用轮廓代理、三个方面代理和推理代理。实验表明,MARIC 在四个不同的数据集上优于基线,展示了多代理视觉推理在稳健和可解释图像分类中的有效性。
The Art of Saying "Maybe": A Conformal Lens for Uncertainty Benchmarking in VLMs
Authors: Asif Azad, Mohammad Sadat Hossain, MD Sadik Hossain Shanto, M Saifur Rahman, Md Rizwan Parvez
First: 2025-09-16T08:17:39+00:00 · Latest: 2025-09-18T10:10:19+00:00
Abstract
Vision-Language Models (VLMs) have achieved remarkable progress in complex
visual understanding across scientific and reasoning tasks. While performance
benchmarking has advanced our understanding of these capabilities, the critical
dimension of uncertainty quantification has received insufficient attention.
Therefore, unlike prior conformal prediction studies that focused on limited
settings, we conduct a comprehensive uncertainty benchmarking study, evaluating
16 state-of-the-art VLMs (open and closed-source) across 6 multimodal datasets
with 3 distinct scoring functions. Our findings demonstrate that larger models
consistently exhibit better uncertainty quantification; models that know more
also know better what they don't know. More certain models achieve higher
accuracy, while mathematical and reasoning tasks elicit poorer uncertainty
performance across all models compared to other domains. This work establishes
a foundation for reliable uncertainty evaluation in multimodal systems.
中文标题/摘要
标题:说“可能”的艺术:一种用于VLMs不确定性基准测试的同形透镜
视觉-语言模型(VLMs)在跨科学和推理任务的复杂视觉理解方面取得了显著进展。虽然性能基准测试提高了我们对这些能力的理解,但不确定性量化这一关键维度却受到了不足的关注。因此,不同于以往专注于有限场景的同形预测研究,我们进行了全面的不确定性基准测试研究,评估了16个最先进的VLMs(开源和闭源)在6个多模态数据集上的表现,使用了3种不同的评分函数。我们的研究结果表明,较大的模型在不确定性量化方面表现更一致;知道得越多的模型也更清楚自己不知道什么。更确定的模型实现了更高的准确性,而数学和推理任务在所有模型中的不确定性表现普遍低于其他领域。这项工作为多模态系统的可靠不确定性评估奠定了基础。
Summary / 总结
This study aims to address the underexplored area of uncertainty quantification in Vision-Language Models (VLMs) by conducting a comprehensive benchmarking evaluation. The research evaluates 16 state-of-the-art VLMs using 6 multimodal datasets and three scoring functions. Key findings include that larger models generally provide better uncertainty quantification, and models that are more certain tend to achieve higher accuracy. The study also reveals that mathematical and reasoning tasks are more challenging for uncertainty evaluation compared to other domains.
该研究旨在通过全面的基准测试来解决视觉-语言模型(VLMs)中不确定性量化不足的问题。研究评估了16个最先进的VLMs,使用了6个多模态数据集和三种评分函数。主要发现包括:较大的模型通常提供更好的不确定性量化,更确定的模型往往能获得更高的准确性。研究还揭示了数学和推理任务在不确定性评估方面比其他领域更具挑战性。
Frame Sampling Strategies Matter: A Benchmark for small vision language models
Authors: Marija Brkic, Anas Filali Razzouki, Yannis Tevissen, Khalil Guetari, Mounim A. El Yacoubi
First: 2025-09-18T09:18:42+00:00 · Latest: 2025-09-18T09:18:42+00:00
Abstract
Comparing vision language models on videos is particularly complex, as the
performances is jointly determined by the model's visual representation
capacity and the frame-sampling strategy used to construct the input. Current
video benchmarks are suspected to suffer from substantial frame-sampling bias,
as models are evaluated with different frame selection strategies. In this
work, we propose the first frame-accurate benchmark of state-of-the-art small
VLMs for video question-answering, evaluated under controlled frame-sampling
strategies. Our results confirm the suspected bias and highlight both
data-specific and task-specific behaviors of SVLMs under different
frame-sampling techniques. By open-sourcing our benchmarking code, we provide
the community with a reproducible and unbiased protocol for evaluating video
VLMs and emphasize the need for standardized frame-sampling strategies tailored
to each benchmarking dataset in future research.
中文标题/摘要
标题:帧采样策略很重要:小型视觉语言模型基准测试
在视频上比较视觉语言模型特别复杂,因为模型的表现由其视觉表示能力和用于构建输入的帧采样策略共同决定。当前的视频基准可能受到显著的帧采样偏差影响,因为模型使用了不同的帧选择策略进行评估。在本文中,我们提出了第一个针对视频问答的最先进的小型VLM基准测试,该基准测试在受控的帧采样策略下进行评估。我们的结果证实了存在的偏差,并突显了在不同帧采样技术下SVLMs的数据特异性和任务特异性行为。通过开源我们的基准测试代码,我们为社区提供了可重复且无偏见的评估视频VLMs的协议,并强调未来研究中需要为每个基准测试数据集量身定制标准化的帧采样策略。
Summary / 总结
This work addresses the complexity of comparing vision language models on videos, where model performance is influenced by both visual representation capacity and frame-sampling strategies. The authors propose a frame-accurate benchmark for small vision language models, evaluating them under controlled frame-sampling techniques. The results confirm the presence of frame-sampling bias and reveal task-specific and data-specific behaviors of these models. The study emphasizes the need for standardized frame-sampling strategies in future research to ensure unbiased evaluation.
该研究通过提出一个帧准确的基准,解决了在视频上评估视觉语言模型的复杂性,专门用于视频问答任务的小型视觉语言模型(VLM)。该基准控制帧采样策略以减轻偏差。关键发现表明,不同的帧采样技术显著影响模型性能,揭示了数据特定和任务特定的行为。作者强调了未来研究中标准化帧采样策略的重要性。
PVLM: Parsing-Aware Vision Language Model with Dynamic Contrastive Learning for Zero-Shot Deepfake Attribution
Authors: Yaning Zhang, Jiahe Zhang, Chunjie Ma, Weili Guan, Tian Gan, Zan Gao
First: 2025-04-19T01:11:46+00:00 · Latest: 2025-09-18T08:24:25+00:00
Abstract
The challenge of tracing the source attribution of forged faces has gained
significant attention due to the rapid advancement of generative models.
However, existing deepfake attribution (DFA) works primarily focus on the
interaction among various domains in vision modality, and other modalities such
as texts and face parsing are not fully explored. Besides, they tend to fail to
assess the generalization performance of deepfake attributors to unseen
advanced generators like diffusion in a fine-grained manner. In this paper, we
propose a novel parsing-aware vision language model with dynamic contrastive
learning(PVLM) method for zero-shot deepfake attribution (ZS-DFA),which
facilitates effective and fine-grained traceability to unseen advanced
generators. Specifically, we conduct a novel and fine-grained ZS-DFA benchmark
to evaluate the attribution performance of deepfake attributors to unseen
advanced generators like diffusion. Besides, we propose an innovative
parsing-guided vision language model with dynamic contrastive learning (PVLM)
method to capture general and diverse attribution features. We are motivated by
the observation that the preservation of source face attributes in facial
images generated by GAN and diffusion models varies significantly. We employ
the inherent face attributes preservation differences to capture face
parsing-aware forgery representations. Therefore, we devise a novel parsing
encoder to focus on global face attribute embeddings, enabling parsing-guided
DFA representation learning via dynamic vision-parsing matching. Additionally,
we present a novel deepfake attribution contrastive center loss to pull
relevant generators closer and push irrelevant ones away, which can be
introduced into DFA models to enhance traceability. Experimental results show
that our model exceeds the state-of-the-art on the ZS-DFA benchmark via various
protocol evaluations.
Summary / 总结
This paper addresses the challenge of tracing the source attribution of forged faces by proposing a novel parsing-aware vision language model with dynamic contrastive learning (PVLM) for zero-shot deepfake attribution (ZS-DFA). The authors introduce a fine-grained ZS-DFA benchmark to evaluate the performance of deepfake attributors against unseen advanced generators. They also propose a parsing-guided vision language model with dynamic contrastive learning to capture general and diverse attribution features. The model includes a parsing encoder that focuses on global face attribute embeddings and a deepfake attribution contrastive center loss to improve traceability. Experimental results demonstrate that the proposed PVLM outperforms existing methods on the ZS-DFA benchmark.
本文提出了一种新的基于解析的视觉语言模型动态对比学习(PVLM),用于零样本深度伪造归属(ZS-DFA),以解决伪造人脸来源归属的挑战。作者引入了一个细粒度的ZS-DFA基准来评估深度伪造检测器对未见过的高级生成器的性能。他们还提出了一种基于解析的视觉语言模型动态对比学习方法,以捕捉通用和多样化的归属特征。该模型包括一个专注于全局面部属性嵌入的解析编码器和一个深度伪造归属对比中心损失,以提高可追溯性。实验结果表明,所提出的PVLM在ZS-DFA基准上优于现有方法。
BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching
Authors: Hanshuai Cui, Zhiqing Tang, Zhifei Xu, Zhi Yao, Wenyi Zeng, Weijia Jia
First: 2025-09-17T07:58:36+00:00 · Latest: 2025-09-18T04:57:32+00:00
Abstract
Recent advancements in Diffusion Transformers (DiTs) have established them as
the state-of-the-art method for video generation. However, their inherently
sequential denoising process results in inevitable latency, limiting real-world
applicability. Existing acceleration methods either compromise visual quality
due to architectural modifications or fail to reuse intermediate features at
proper granularity. Our analysis reveals that DiT blocks are the primary
contributors to inference latency. Across diffusion timesteps, the feature
variations of DiT blocks exhibit a U-shaped pattern with high similarity during
intermediate timesteps, which suggests substantial computational redundancy. In
this paper, we propose Block-Wise Caching (BWCache), a training-free method to
accelerate DiT-based video generation. BWCache dynamically caches and reuses
features from DiT blocks across diffusion timesteps. Furthermore, we introduce
a similarity indicator that triggers feature reuse only when the differences
between block features at adjacent timesteps fall below a threshold, thereby
minimizing redundant computations while maintaining visual fidelity. Extensive
experiments on several video diffusion models demonstrate that BWCache achieves
up to 2.24$\times$ speedup with comparable visual quality.
中文标题/摘要
标题:BWCache:通过块级缓存加速视频扩散变换器
近期扩散变换器(DiTs)的发展已将其确立为视频生成的最新方法。然而,其固有的顺序去噪过程不可避免地导致了延迟,限制了其实用性。现有的加速方法要么因架构修改而牺牲视觉质量,要么无法在适当粒度上重用中间特征。我们的分析表明,DiT块是推理延迟的主要来源。在扩散时间步中,DiT块的特征变化呈现出U形模式,在中间时间步具有高度相似性,这表明存在大量的计算冗余。在本文中,我们提出了一种无需训练的块级缓存(BWCache)方法,以加速基于DiT的视频生成。BWCache动态地跨扩散时间步缓存和重用DiT块的特征。此外,我们引入了一个相似性指标,仅在相邻时间步块特征之间的差异低于阈值时触发特征重用,从而在保持视觉保真度的同时最小化冗余计算。在几种视频扩散模型上的广泛实验表明,BWCache实现了最高2.24倍的加速,视觉质量相当。
Summary / 总结
BWCache accelerates Diffusion Transformers (DiTs) for video generation by caching and reusing features from DiT blocks across diffusion timesteps. This method reduces computational redundancy without compromising visual quality, achieving up to 2.24 times speedup. The key is a similarity indicator that triggers feature reuse only when differences between adjacent timesteps are below a threshold.
BWCache 是一种无需训练的方法,通过在扩散时间步之间缓存和重用 DiT 块的特征来加速 DiT。它使用相似性指标来最小化冗余计算并保持视觉质量,实现了最高 2.24 倍的加速。特征变化的 U 型模式表明存在显著的计算冗余,推动了 BWCache 的开发以解决 DiT 的固有延迟问题。
Do Vision-Language Models See Urban Scenes as People Do? An Urban Perception Benchmark
Authors: Rashid Mushkani
First: 2025-09-18T03:21:10+00:00 · Latest: 2025-09-18T03:21:10+00:00
Abstract
Understanding how people read city scenes can inform design and planning. We
introduce a small benchmark for testing vision-language models (VLMs) on urban
perception using 100 Montreal street images, evenly split between photographs
and photorealistic synthetic scenes. Twelve participants from seven community
groups supplied 230 annotation forms across 30 dimensions mixing physical
attributes and subjective impressions. French responses were normalized to
English. We evaluated seven VLMs in a zero-shot setup with a structured prompt
and deterministic parser. We use accuracy for single-choice items and Jaccard
overlap for multi-label items; human agreement uses Krippendorff's alpha and
pairwise Jaccard. Results suggest stronger model alignment on visible,
objective properties than subjective appraisals. The top system (claude-sonnet)
reaches macro 0.31 and mean Jaccard 0.48 on multi-label items. Higher human
agreement coincides with better model scores. Synthetic images slightly lower
scores. We release the benchmark, prompts, and harness for reproducible,
uncertainty-aware evaluation in participatory urban analysis.
中文标题/摘要
标题:视觉-语言模型如何理解城市场景?一种城市感知基准
理解人们如何阅读城市场景可以指导设计和规划。我们引入了一个小型基准,使用蒙特利尔100条街道图像测试视觉-语言模型(VLMs)的城市感知能力,这些图像平分了照片和写实合成场景。来自七个社区团体的12名参与者提供了涵盖30个维度的230份注释表,这些维度混合了物理属性和主观印象。法语回答被标准化为英语。我们在零样本设置下评估了七种VLMs,使用结构化提示和确定性解析器。我们使用准确率评估单选题,使用Jaccard重叠评估多标签题;人类一致性使用Krippendorff的alpha和成对Jaccard。结果表明,模型在可见的、客观的属性上比主观评价有更好的对齐。顶级系统(claude-sonnet)在多标签题上的宏平均为0.31,平均Jaccard为0.48。更高的人类一致性与更好的模型得分相关。合成图像略微降低了得分。我们发布了基准、提示和框架,以实现可重复的、考虑不确定性的评估,用于参与式城市分析。
Summary / 总结
The study aims to understand how vision-language models perceive urban scenes by comparing their annotations with human perceptions. Using a benchmark of 100 Montreal street images, the researchers evaluated seven VLMs in a zero-shot setup. The models showed better alignment with visible and objective properties than subjective appraisals. The top system, claude-sonnet, achieved a macro accuracy of 0.31 and a mean Jaccard overlap of 0.48 on multi-label items. Higher human agreement correlated with better model scores, and synthetic images slightly reduced model performance.
研究旨在通过比较视觉-语言模型与人类对城市场景的感知,了解模型的感知能力。使用蒙特利尔100张街道图像作为基准,研究人员在零样本设置下评估了七种VLM的表现。模型的评估基于单选题的准确性和多标签项的Jaccard重叠。结果显示,模型在可见和客观属性上的表现优于主观评价,最佳系统在多标签项上的宏分数达到0.31,平均Jaccard重叠为0.48。人类一致性较高与模型得分较高相关,而合成图像则略微降低了性能。
VisMoDAl: Visual Analytics for Evaluating and Improving Corruption Robustness of Vision-Language Models
Authors: Huanchen Wang, Wencheng Zhang, Zhiqiang Wang, Zhicong Lu, Yuxin Ma
First: 2025-09-18T03:15:00+00:00 · Latest: 2025-09-18T03:15:00+00:00
Comments: 11 pages, 7 figures, 1 table, accepted to IEEE VIS 2025 (IEEE
Transactions on Visualization and Computer Graphics)
Abstract
Vision-language (VL) models have shown transformative potential across
various critical domains due to their capability to comprehend multi-modal
information. However, their performance frequently degrades under distribution
shifts, making it crucial to assess and improve robustness against real-world
data corruption encountered in practical applications. While advancements in VL
benchmark datasets and data augmentation (DA) have contributed to robustness
evaluation and improvement, there remain challenges due to a lack of in-depth
comprehension of model behavior as well as the need for expertise and iterative
efforts to explore data patterns. Given the achievement of visualization in
explaining complex models and exploring large-scale data, understanding the
impact of various data corruption on VL models aligns naturally with a visual
analytics approach. To address these challenges, we introduce VisMoDAl, a
visual analytics framework designed to evaluate VL model robustness against
various corruption types and identify underperformed samples to guide the
development of effective DA strategies. Grounded in the literature review and
expert discussions, VisMoDAl supports multi-level analysis, ranging from
examining performance under specific corruptions to task-driven inspection of
model behavior and corresponding data slice. Unlike conventional works,
VisMoDAl enables users to reason about the effects of corruption on VL models,
facilitating both model behavior understanding and DA strategy formulation. The
utility of our system is demonstrated through case studies and quantitative
evaluations focused on corruption robustness in the image captioning task.
中文标题/摘要
标题:VisMoDAl:评估和提高视觉语言模型抗腐败鲁棒性的可视化分析
视觉语言(VL)模型因其能够理解多模态信息而在各个关键领域展现了变革性的潜力。然而,它们在分布变化下的性能经常下降,因此评估和提高其在实际应用中遇到的真实数据腐败情况下的鲁棒性变得至关重要。尽管VL基准数据集和数据增强(DA)的进步有助于鲁棒性评估和改进,但由于对模型行为缺乏深入理解以及需要专业知识和迭代探索数据模式,仍存在挑战。鉴于可视化在解释复杂模型和探索大规模数据方面的成就,理解各种数据腐败对VL模型的影响自然与可视化分析方法相契合。为了解决这些挑战,我们提出了VisMoDAl,这是一种可视化分析框架,旨在评估VL模型在各种腐败类型下的鲁棒性,并识别表现不佳的样本以指导有效的数据增强策略的开发。VisMoDAl基于文献综述和专家讨论,支持多级分析,从特定腐败下的性能检查到任务驱动的模型行为和相应数据切片的检查。与传统工作不同,VisMoDAl使用户能够推理数据腐败对VL模型的影响,促进模型行为理解和数据增强策略的制定。通过针对图像字幕任务的抗腐败鲁棒性的案例研究和定量评估,展示了我们系统的实用性。
Summary / 总结
VisMoDAl is a visual analytics framework designed to evaluate and improve the robustness of vision-language models against data corruption. It supports multi-level analysis, from specific corruption types to task-driven model behavior inspection. Key findings show that VisMoDAl helps identify underperforming samples and guides the development of effective data augmentation strategies, particularly in image captioning tasks.
VisMoDAl 是一个视觉分析框架,旨在评估和提高视觉语言模型在各种数据腐蚀类型下的鲁棒性。它支持多层次分析,从特定腐蚀下的性能检查到任务驱动的模型行为和数据切片检查。关键发现表明,VisMoDAl 通过识别表现不佳的样本有效指导数据增强策略的开发,从而增强模型在实际应用中的鲁棒性。
VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought
Authors: Gabriel Sarch, Lawrence Jang, Michael J. Tarr, William W. Cohen, Kenneth Marino, Katerina Fragkiadaki
First: 2024-06-20T17:45:02+00:00 · Latest: 2025-09-18T02:44:34+00:00
Comments: Project website: https://ical-learning.github.io/
Abstract
Large-scale generative language and vision-language models (LLMs and VLMs)
excel in few-shot learning but require high-quality demonstrations. We propose
In-Context Abstraction Learning (ICAL), enabling VLM agents to transform
suboptimal trajectories into high-quality training data through self-reflection
and human feedback. Given imperfect task demonstrations, a VLM abstracts
trajectories into generalized strategies and action annotations by correcting
inefficiencies and annotating cognitive abstractions: causal relationships,
object state changes, temporal subgoals, and task-relevant visual elements.
These annotations are iteratively refined through human feedback during
execution in similar environments. The resulting examples significantly improve
decision-making when used for retrieval-augmented generation or fine-tuning. As
the agent's example library grows, it becomes more efficient at abstracting new
examples, requiring less human feedback and fewer environment interactions.
ICAL achieves state-of-the-art results across multiple benchmarks. In TEACh
dialogue-based instruction following, combining fine-tuning and retrieval on
ICAL examples outperforms raw human demonstrations and expert examples by 17.5%
in goal-condition success. In VisualWebArena, retrieval-augmented GPT-4V with
ICAL improves task success 1.6x, while fine-tuned Qwen2-VL achieves 2.8x
improvement over the base model. In Ego4D action forecasting, we surpass
few-shot GPT-4V and remain competitive with supervised models. Our approach
scales 2x better than raw demonstrations and significantly reduces manual
prompt engineering requirements.
中文标题/摘要
标题:VLM智能体生成自己的记忆:将经验提炼为具身思维程序
大规模生成语言和跨模态语言模型(LLMs和VLMs)在少量示例学习方面表现出色,但需要高质量的演示。我们提出了上下文抽象学习(ICAL),使VLM智能体能够通过自我反思和人类反馈将次优轨迹转化为高质量的训练数据。给定不完美的任务演示,VLM将轨迹抽象为通用策略和动作注释,通过纠正低效性和标注认知抽象:因果关系、物体状态变化、时间子目标和任务相关的视觉元素。这些注释在类似环境中执行期间通过人类反馈迭代优化。生成的示例在用于检索增强生成或微调时显著改善了决策。随着智能体示例库的增长,它在抽象新示例方面变得更加高效,需要更少的人类反馈和环境交互。ICAL在多个基准测试中取得了最先进的成果。在TEACh对话式指令跟随中,结合ICAL示例的微调和检索优于原始人类演示和专家示例17.5%的目标条件成功率。在VisualWebArena中,使用ICAL的检索增强GPT-4V将任务成功率提高了1.6倍,而微调后的Qwen2-VL将基线模型提高了2.8倍。在Ego4D动作预测中,我们超越了少量示例的GPT-4V,并在监督模型中保持竞争力。我们的方法比原始演示扩展速度快2倍,并显著减少了手动提示工程的需求。
Summary / 总结
The research aims to enhance the few-shot learning capabilities of vision-language models (VLMs) by enabling them to generate high-quality training data through self-reflection and human feedback. The method involves In-Context Abstraction Learning (ICAL), where VLMs abstract suboptimal trajectories into generalized strategies and action annotations. Key findings show that ICAL improves decision-making in various benchmarks, outperforming raw human demonstrations and expert examples. In TEACh, ICAL examples improve goal-condition success by 17.5%. In VisualWebArena, ICAL enhances task success by 1.6x and 2.8x for retrieval-augmented GPT-4V and fine-tuned Qwen2-VL, respectively. In Ego4D, ICAL surpasses few-shot GPT-4V and remains competitive with supervised models.
论文提出了In-Context Abstraction Learning (ICAL),该方法使VLM代理能够通过自我反思和人类反馈将次优轨迹转化为高质量的训练数据。通过将轨迹抽象为通用策略和动作注释,ICAL在多个基准测试中提高了决策能力。例如,在TEACh对话式指令遵循中,ICAL在目标条件成功方面比原始人类演示高出17.5%。在VisualWebArena中,ICAL与检索增强的GPT-4V结合使用时,任务成功率提高了1.6倍;在Ego4D动作预测中,ICAL超越了少量示例的GPT-4V,并且与监督模型保持竞争力。
An Empirical Study of Federated Prompt Learning for Vision Language Model
Authors: Zhihao Wang, Wenke Huang, Tian Chen, Zekun Shi, Guancheng Wan, Yu Qiao, Bin Yang, Jian Wang, Bing Li, Mang Ye
First: 2025-05-29T03:09:15+00:00 · Latest: 2025-09-18T02:36:50+00:00
Abstract
The Vision Language Model (VLM) excels in aligning vision and language
representations, and prompt learning has emerged as a key technique for
adapting such models to downstream tasks. However, the application of prompt
learning with VLM in federated learning (FL) scenarios remains underexplored.
This paper systematically investigates the behavioral differences between
language prompt learning (LPT) and vision prompt learning (VPT) under data
heterogeneity challenges, including label skew and domain shift. We conduct
extensive experiments to evaluate the impact of various FL and prompt
configurations, such as client scale, aggregation strategies, and prompt
length, to assess the robustness of Federated Prompt Learning (FPL).
Furthermore, we explore strategies for enhancing prompt learning in complex
scenarios where label skew and domain shift coexist, including leveraging both
prompt types when computational resources allow. Our findings offer practical
insights into optimizing prompt learning in federated settings, contributing to
the broader deployment of VLMs in privacy-preserving environments.
中文标题/摘要
标题:联邦提示学习在视觉语言模型中的实证研究
视觉语言模型(VLM)在视觉和语言表示对齐方面表现出色,提示学习已成为将此类模型适应下游任务的关键技术。然而,提示学习在联邦学习(FL)场景中的应用尚未得到充分探索。本文系统地研究了在数据异质性挑战(包括标签偏斜和领域偏移)下语言提示学习(LPT)和视觉提示学习(VPT)的行为差异。我们进行了广泛的实验,评估了各种FL和提示配置(如客户端规模、聚合策略和提示长度)对联邦提示学习(FPL)鲁棒性的影响。此外,我们探讨了在标签偏斜和领域偏移共存的复杂场景中增强提示学习的策略,包括在计算资源允许时利用两种提示类型。我们的发现为优化联邦设置中的提示学习提供了实用见解,有助于在隐私保护环境中更广泛地部署VLMs。
Summary / 总结
This paper explores the differences between language prompt learning (LPT) and vision prompt learning (VPT) in federated learning (FL) scenarios, focusing on data heterogeneity challenges like label skew and domain shift. Through extensive experiments, the study evaluates the impact of various FL and prompt configurations, such as client scale and aggregation strategies, to assess the robustness of Federated Prompt Learning (FPL). Key findings include the importance of using both prompt types when computational resources permit to enhance learning in complex scenarios.
本文研究了在联邦学习(FL)场景下视觉语言模型(VLM)中语言提示学习(LPT)和视觉提示学习(VPT)之间的差异。研究通过改变客户端规模、聚合策略和提示长度,评估了Federated Prompt Learning(FPL)在标签偏斜和领域偏移等数据异构性下的鲁棒性。关键发现包括,在计算资源允许时使用两种提示类型的有效性,从而增强模型在复杂场景中的适应性。
FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation
Authors: Yuxuan Jiang, Zehua Chen, Zeqian Ju, Chang Li, Weibei Dou, Jun Zhu
Venue: ACM MM 2025
First: 2025-07-11T12:57:51+00:00 · Latest: 2025-09-18T02:19:36+00:00
Comments: Accepted at ACM MM 2025
Abstract
Text-to-audio (T2A) generation has achieved promising results with the recent
advances in generative models. However, because of the limited quality and
quantity of temporally-aligned audio-text pairs, existing T2A methods struggle
to handle the complex text prompts that contain precise timing control, e.g.,
"owl hooted at 2.4s-5.2s". Recent works have explored data augmentation
techniques or introduced timing conditions as model inputs to enable
timing-conditioned 10-second T2A generation, while their synthesis quality is
still limited. In this work, we propose a novel training-free timing-controlled
T2A framework, FreeAudio, making the first attempt to enable timing-controlled
long-form T2A generation, e.g., "owl hooted at 2.4s-5.2s and crickets chirping
at 0s-24s". Specifically, we first employ an LLM to plan non-overlapping time
windows and recaption each with a refined natural language description, based
on the input text and timing prompts. Then we introduce: 1) Decoupling and
Aggregating Attention Control for precise timing control; 2) Contextual Latent
Composition for local smoothness and Reference Guidance for global consistency.
Extensive experiments show that: 1) FreeAudio achieves state-of-the-art
timing-conditioned T2A synthesis quality among training-free methods and is
comparable to leading training-based methods; 2) FreeAudio demonstrates
comparable long-form generation quality with training-based Stable Audio and
paves the way for timing-controlled long-form T2A synthesis. Demo samples are
available at: https://freeaudio.github.io/FreeAudio/
Summary / 总结
FreeAudio is a training-free framework for timing-controlled text-to-audio generation, addressing the challenge of handling complex text prompts with precise timing. It uses an LLM to plan non-overlapping time windows and refine descriptions, and introduces techniques like decoupling and aggregating attention control, contextual latent composition, and reference guidance. Experiments show that FreeAudio achieves high-quality timing-conditioned synthesis comparable to training-based methods and enables long-form generation with timing control, paving the way for future research in this area.
FreeAudio 是一个无需训练的框架,用于实现具有精确时间控制的文本转音频生成,解决处理包含精确时间控制的复杂文本提示的挑战。它使用LLM规划非重叠的时间窗口并细化描述,并引入了解耦和聚合注意力控制、上下文潜在组成和参考指导等技术。实验表明,FreeAudio 在时间条件下的合成质量达到了训练免费方法的最高水平,与训练基线方法相当,并且能够实现长段落生成的时间控制,为未来的研究开辟了道路。
METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling
Authors: Bingxuan Li, Yiwei Wang, Jiuxiang Gu, Kai-Wei Chang, Nanyun Peng
First: 2025-02-24T21:01:39+00:00 · Latest: 2025-09-17T21:47:50+00:00
Comments: ACL2025 Main
Abstract
Chart generation aims to generate code to produce charts satisfying the
desired visual properties, e.g., texts, layout, color, and type. It has great
potential to empower the automatic professional report generation in financial
analysis, research presentation, education, and healthcare. In this work, we
build a vision-language model (VLM) based multi-agent framework for effective
automatic chart generation. Generating high-quality charts requires both strong
visual design skills and precise coding capabilities that embed the desired
visual properties into code. Such a complex multi-modal reasoning process is
difficult for direct prompting of VLMs. To resolve these challenges, we propose
METAL, a multi-agent framework that decomposes the task of chart generation
into the iterative collaboration among specialized agents. METAL achieves 5.2%
improvement over the current best result in the chart generation task. The
METAL framework exhibits the phenomenon of test-time scaling: its performance
increases monotonically as the logarithmic computational budget grows from 512
to 8192 tokens. In addition, we find that separating different modalities
during the critique process of METAL boosts the self-correction capability of
VLMs in the multimodal context.
中文标题/摘要
标题:METAL:一种用于图表生成的多智能体框架(带测试时扩展)
图表生成旨在生成代码以生成满足所需视觉属性的图表,例如文本、布局、颜色和类型。它在金融分析、研究展示、教育和医疗保健中的自动专业报告生成方面具有巨大的潜力。在本工作中,我们构建了一个基于视觉语言模型(VLM)的有效自动图表生成多智能体框架。生成高质量的图表需要强大的视觉设计技能和精确的编码能力,将所需的视觉属性嵌入到代码中。这种复杂的多模态推理过程难以直接对VLM进行提示。为了解决这些挑战,我们提出了METAL,一种将图表生成任务分解为专业智能体之间迭代协作的多智能体框架。METAL在图表生成任务中的表现优于当前最佳结果5.2%。METAL框架展示了测试时扩展的现象:其性能随着计算预算从512增长到8192个令牌呈单调增长。此外,我们发现,在METAL的批评过程中分离不同的模态增强了VLM在多模态环境中的自我纠正能力。
Summary / 总结
The research aims to develop an effective multi-agent framework for automatic chart generation, which is crucial for professional report generation in various fields. METAL, a vision-language model-based framework, decomposes the chart generation task into iterative collaboration among specialized agents, addressing the need for both visual design and coding skills. The framework shows a 5.2% improvement over the current best results and demonstrates test-time scaling, with performance increasing as computational budget grows. Separating modalities during the critique process enhances the self-correction capability of VLMs.
研究旨在开发一个有效的多智能体框架,用于自动图表生成,这对于金融分析、研究展示、教育和医疗保健等领域的专业报告生成至关重要。METAL框架利用视觉-语言模型将图表生成任务分解为多个专门智能体的迭代协作,相比之前的最佳结果提高了5.2%的图表质量。此外,该框架展示了测试时扩展的现象,随着计算预算的增长,性能逐渐提升;并且在多模态评论过程中分离不同模态可以增强VLMs的自我纠正能力。
Hashing-Baseline: Rethinking Hashing in the Age of Pretrained Models
Authors: Ilyass Moummad, Kawtar Zaher, Lukas Rauch, Alexis Joly
First: 2025-09-17T20:58:43+00:00 · Latest: 2025-09-17T20:58:43+00:00
Abstract
Information retrieval with compact binary embeddings, also referred to as
hashing, is crucial for scalable fast search applications, yet state-of-the-art
hashing methods require expensive, scenario-specific training. In this work, we
introduce Hashing-Baseline, a strong training-free hashing method leveraging
powerful pretrained encoders that produce rich pretrained embeddings. We
revisit classical, training-free hashing techniques: principal component
analysis, random orthogonal projection, and threshold binarization, to produce
a strong baseline for hashing. Our approach combines these techniques with
frozen embeddings from state-of-the-art vision and audio encoders to yield
competitive retrieval performance without any additional learning or
fine-tuning. To demonstrate the generality and effectiveness of this approach,
we evaluate it on standard image retrieval benchmarks as well as a newly
introduced benchmark for audio hashing.
中文标题/摘要
标题:哈希-基线:在预训练模型时代重新思考哈希
使用紧凑的二进制嵌入进行信息检索,也称为哈希,在可扩展的快速搜索应用中至关重要,但最先进的哈希方法需要昂贵的、特定场景的训练。在本文中,我们引入了哈希-基线,这是一种强大的无需训练的哈希方法,利用强大的预训练编码器生成丰富的预训练嵌入。我们回顾了经典的无需训练的哈希技术:主成分分析、随机正交投影和阈值二值化,以生成哈希的强基线。我们的方法将这些技术与最先进的视觉和音频编码器的冻结嵌入相结合,无需任何额外的学习或微调即可获得竞争力的检索性能。为了证明该方法的通用性和有效性,我们在标准图像检索基准以及新引入的音频哈希基准上进行了评估。
Fine-tuning Vision Language Models with Graph-based Knowledge for Explainable Medical Image Analysis
Authors: Chenjun Li, Laurin Lux, Alexander H. Berger, Martin J. Menten, Mert R. Sabuncu, Johannes C. Paetzold
First: 2025-03-12T20:19:07+00:00 · Latest: 2025-09-17T20:08:48+00:00
Comments: 11 pages, 3 figures
Abstract
Accurate staging of Diabetic Retinopathy (DR) is essential for guiding timely
interventions and preventing vision loss. However, current staging models are
hardly interpretable, and most public datasets contain no clinical reasoning or
interpretation beyond image-level labels. In this paper, we present a novel
method that integrates graph representation learning with vision-language
models (VLMs) to deliver explainable DR diagnosis. Our approach leverages
optical coherence tomography angiography (OCTA) images by constructing
biologically informed graphs that encode key retinal vascular features such as
vessel morphology and spatial connectivity. A graph neural network (GNN) then
performs DR staging while integrated gradients highlight critical nodes and
edges and their individual features that drive the classification decisions. We
collect this graph-based knowledge which attributes the model's prediction to
physiological structures and their characteristics. We then transform it into
textual descriptions for VLMs. We perform instruction-tuning with these textual
descriptions and the corresponding image to train a student VLM. This final
agent can classify the disease and explain its decision in a human
interpretable way solely based on a single image input. Experimental
evaluations on both proprietary and public datasets demonstrate that our method
not only improves classification accuracy but also offers more clinically
interpretable results. An expert study further demonstrates that our method
provides more accurate diagnostic explanations and paves the way for precise
localization of pathologies in OCTA images.
中文标题/摘要
标题:基于图知识微调视觉语言模型以实现可解释的糖尿病视网膜病变医学图像分析
准确的糖尿病视网膜病变(DR)分期对于指导及时干预和预防视力丧失至关重要。然而,当前的分期模型几乎不具备可解释性,而且大多数公开的数据集仅包含图像级别的标签,而没有临床推理或解释。在本文中,我们提出了一种新颖的方法,将图表示学习与视觉-语言模型(VLMs)结合,以提供可解释的DR诊断。我们的方法通过构建生物启发的图来利用光学相干断层扫描血管成像(OCTA)图像,这些图编码了关键的视网膜血管特征,如血管形态和空间连接性。然后,图神经网络(GNN)执行DR分期,集成梯度突出显示驱动分类决策的关键节点和边及其个体特征。我们收集了基于图的知识,将模型的预测归因于生理结构及其特征。然后将其转换为文本描述,供VLMs使用。我们使用这些文本描述和相应的图像进行指令微调,训练一个学生VLM。最终的代理可以根据单张图像输入进行疾病分类,并以人类可解释的方式解释其决策。在私有和公开数据集上的实验评估表明,我们的方法不仅提高了分类准确性,还提供了更具临床解释性的结果。专家研究进一步表明,我们的方法提供了更准确的诊断解释,并为OCTA图像中病理学的精确定位铺平了道路。
Summary / 总结
Accurate staging of Diabetic Retinopathy (DR) is essential for guiding timely interventions and preventing vision loss.
本文提出了一种通过将图表示学习与视觉语言模型结合来提高糖尿病视网膜病变(DR)分期可解释性的方法。该方法使用光学相干断层扫描血管成像(OCTA)图像构建生物启发的图,然后由图神经网络(GNN)处理以执行DR分期。集成梯度突出关键节点和边,为模型的决策提供依据。该方法提高了分类准确性,并提供了比现有模型更具临床解释性的结果。专家研究进一步证实了诊断解释的准确性以及在OCTA图像中对病理的精确定位。