TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models
Authors: Junyi Zhang, Jia-Chen Gu, Wenbo Hu, Yu Zhou, Robinson Piramuthu, Nanyun Peng
First: 2025-09-29T17:51:26+00:00 · Latest: 2025-09-29T17:51:26+00:00
Abstract
Existing medical reasoning benchmarks for vision-language models primarily
focus on analyzing a patient's condition based on an image from a single visit.
However, this setting deviates significantly from real-world clinical practice,
where doctors typically refer to a patient's historical conditions to provide a
comprehensive assessment by tracking their changes over time. In this paper, we
introduce TemMed-Bench, the first benchmark designed for analyzing changes in
patients' conditions between different clinical visits, which challenges large
vision-language models (LVLMs) to reason over temporal medical images.
TemMed-Bench consists of a test set comprising three tasks - visual
question-answering (VQA), report generation, and image-pair selection - and a
supplementary knowledge corpus of over 17,000 instances. With TemMed-Bench, we
conduct an evaluation of six proprietary and six open-source LVLMs. Our results
show that most LVLMs lack the ability to analyze patients' condition changes
over temporal medical images, and a large proportion perform only at a
random-guessing level in the closed-book setting. In contrast, GPT o3, o4-mini
and Claude 3.5 Sonnet demonstrate comparatively decent performance, though they
have yet to reach the desired level. Furthermore, we explore augmenting the
input with both retrieved visual and textual modalities in the medical domain.
We also show that multi-modal retrieval augmentation yields notably higher
performance gains than no retrieval and textual retrieval alone across most
models on our benchmark, with the VQA task showing an average improvement of
2.59%. Overall, we compose a benchmark grounded on real-world clinical
practice, and it reveals LVLMs' limitations in temporal medical image
reasoning, as well as highlighting the use of multi-modal retrieval
augmentation as a potentially promising direction worth exploring to address
this challenge.
中文标题/摘要
标题:TemMed-Bench:评估视觉语言模型在时间医学图像推理中的表现
现有的医学推理基准主要针对基于单次就诊图像分析患者状况。然而,这种设置与实际临床实践相差甚远,医生通常会参考患者的历史状况,通过跟踪其随时间的变化来提供全面评估。本文介绍了TemMed-Bench,这是首个用于分析不同临床就诊之间患者状况变化的基准,挑战大型视觉语言模型(LVLMs)进行时间医学图像推理。TemMed-Bench 包含一个测试集,包括三个任务——视觉问答(VQA)、报告生成和图像配对选择——以及超过17,000个实例的补充知识库。通过TemMed-Bench,我们对六种专有和六种开源LVLMs进行了评估。结果显示,大多数LVLMs缺乏分析患者随时间变化的医学图像状况的能力,在闭卷设置中,很大一部分仅能随机猜测。相比之下,GPT o3、o4-mini和Claude 3.5 Sonnet表现出相对较好的性能,尽管尚未达到理想水平。此外,我们还探索了在医学领域将检索到的视觉和文本模态输入到模型中。我们还展示了多模态检索增强在我们的基准上大多数模型中都取得了显著的性能提升,VQA任务的平均改进率为2.59%。总体而言,我们构建了一个基于实际临床实践的基准,揭示了LVLMs在时间医学图像推理中的局限性,并强调了多模态检索增强作为解决这一挑战潜在有前景的方向。
Summary / 总结
Existing medical reasoning benchmarks for vision-language models primarily focus on analyzing a patient's condition based on an image from a single visit.