miniGPT-Med: Large Language Model as a General Interface for Radiology Diagnosis

Asma Alkhaldi^1,2, Raneem Alnajim^1,2, Layan Alabdullatef^1,2, Rawan Alyahya¹,
Jun Chen², Deyao Zhu², Ahmed Alsinan¹, Mohamed Elhoseiny ²¹¹footnotemark: 1,

¹Saudi Data and Artificial Intelligence Authority (SDAIA),
²King Abdullah University of Science and Technology (KAUST)
{asma.alkhaldi,mohamed.elhoseiny}@kaust.edu.sa

Abstract

Recent advancements in artificial intelligence (AI) have precipitated significant breakthroughs in healthcare, particularly in refining diagnostic procedures. However, previous studies have often been constrained to limited functionalities. This study introduces MiniGPT-Med, a vision-language model derived from large-scale language models and tailored for medical applications. MiniGPT-Med demonstrates remarkable versatility across various imaging modalities, including X-rays, CT scans, and MRIs, enhancing its utility. The model is capable of performing tasks such as medical report generation, visual question answering (VQA), and disease identification within medical imagery. Its integrated processing of both image and textual clinical data markedly improves diagnostic accuracy. Our empirical assessments confirm MiniGPT-Med’s superior performance in disease grounding, medical report generation, and VQA benchmarks, representing a significant step towards reducing the gap in assisting radiology practice. Furthermore, it achieves state-of-the-art performance on medical report generation, higher than the previous best model by 19% accuracy. MiniGPT-Med promises to become a general interface for radiology diagnoses, enhancing diagnostic efficiency across a wide range of medical imaging applications. Our model and code have been made publicly available https://github.com/Vision-CAIR/MiniGPT-Med

Asma Alkhaldi^1,2^†^†thanks: {asma.alkhaldi,mohamed.elhoseiny}@kaust.edu.sa, Raneem Alnajim^1,2, Layan Alabdullatef^1,2, Rawan Alyahya¹, Jun Chen², Deyao Zhu², Ahmed Alsinan¹, Mohamed Elhoseiny ²¹¹footnotemark: 1, ¹Saudi Data and Artificial Intelligence Authority (SDAIA), ²King Abdullah University of Science and Technology (KAUST)

Refer to caption — Figure 1: The diverse capabilities by MiniGPT-Med. It can perform disease detection, medical visual question answering, and medical report generation. MiniGPT-Med effectively works with a wide range of radiological data (X-rays, CT scans, and MRIs) and is adept at diagnosing many diseases.

1 Introduction

The unprecedented surge in both the quantity of image-text data across diverse fields and the strides made in vision-language modeling have paved the way for groundbreaking research in Generative Pretraining. This era of innovation is marked by the emergence of multimodal models such as GPT-4 Achiam et al. (2023) and Gemini Team et al. (2023). These advancements signify a leap forward in our ability to process and understand complex data. Despite this progress, the adoption of Multi-modal Large Language Models (LLMs) within the medical sector remains limited. The medical field’s unique requirements for data complexity, sensitivity, and specificity highlight the need for tailored approaches to harness the potential of LLMs in transforming healthcare research and practice. Numerous models designed for medical applications have been introduced, yet they often exhibit a high degree of specialization for specific tasks. This specialization limits their versatility, particularly in performing diverse medical applications. For instance, models like Med-Flamingo Moor et al. (2023) and XrayGPT Thawkar et al. (2023a) are primarily tailored for tasks such as medical report generation and medical visual question answering, respectively. However, they lack capabilities in essential areas like disease detection, which requires visual grounding skills— a crucial component in the medical field. To address this deficiency, we introduce MiniGPT-Med, a unified model capable of adeptly handling both grounding and non-grounding tasks. We introduce MiniGPT-Med, a versatile model designed for various tasks in the medical domain, including but not limited to medical report generation, medical visual question answering, and disease identification. MiniGPT-Med builds upon the architecture of large language models (LLMs), which have demonstrated exceptional generative capabilities and extensive linguistics, including medical knowledge. Drawing on the successes of LLMs in a wide range of vision-language applications, as evidenced in recent studies Zhu et al. (2023); Chen et al. (2023); Li et al. (2024), our model adopts a design similar to MiniGPT-v2 Chen et al. (2023), utilizing the LLaMA-2 language model as a universal interface. Additionally, we incorporate distinct task identifiers to enhance the model’s ability to accurately perform various medical vision-language skills. Through extensive experimentation, we have demonstrated that our model exhibits strong performance across a range of medical vision-language tasks, including medical report generation, medical visual question answering, and disease detection. We benchmarked our model against both specialized and generalized baseline models, revealing that our approach achieves strong results across all evaluated tasks. Notably, in the domain of medical report generation, our model attained state-of-the-art performance, surpassing the best baseline models by 19% in BERT-Sim and 5.2% in CheXbert-Sim. This indicates our model has strong generation capabilities on diverse medical vision-language tasks.

Our contributions are as follows:

1.

We introduce MiniGPT-Med, a model tailored for the heterogeneous nature of radiological imagery, encompassing X-rays, CT scans, and MRIs. This model is adept at handling a variety of vision-language tasks, including disease identification, medical visual question answering, and the generation of medical reports.
2.

Through comprehensive evaluation, we evaluated our model across both grounding and non-grounding tasks, complemented by expert manual assessments. The findings demonstrate that MiniGPT-Med delivers competitive performance across a majority of benchmarks, surpassing both generalized and specialized models, notably achieving state-of-the-art results in medical report generation, and surpassing the best baseline by 19.0%.

2 Background

Aligning Visual Data with Large Language Models: Recent advancements in the domain of large language models such as the release of GPT-4, have enhanced the interpretative and generative capabilities of LLMs. This progress is exemplified by models such as LLaVALiu et al. (2023), FlamingoAlayrac et al. (2022), and MiniGPT-v2Chen et al. (2023). LLaVA is designed to augment the understanding of visual content in large language models through diverse multimodal instructions. This enhancement in comprehension is critical for integrating different forms of data input. In contrast, Flamingo demonstrates remarkable proficiency in quick adaption to novel tasks with minimal data. This model effectively manages sequences that incorporate both visual and textual elements. MiniGPT-v2, on the other hand, displays enhanced multimodal capabilities within a singular model framework. This is achieved through task-specific training and a specialized architecture that combines visual tokens with a large language model, aligning well with the objectives of LLaVA and Flamingo.

Integration of Vision Language Models for Enhanced Medical Diagnostics: Recent work in vision-language models has led to significant improvements in healthcare applications, especially in medical image analysis and diagnostic report generation. Utilizing VLMs in medical diagnostics marks a significant progression in the healthcare industry. models combine computer vision and language processing to better analyze medical images like X-rays, computed tomography (CT), and MRIs. More specialized applications in the medical field such as LLaVA-Med Li et al. (2024) and Med-BERT Rasmy et al. (2020) have shown promise in incorporating structured electronic health records for improvements in disease prediction tasks. MedVQA Canepa et al. (2023) has demonstrated medical visual question-answering and image analysis capabilities. Furthermore, for classification and interpretation tasks, Med-Flamingo Moor et al. (2023), MedVis Shen et al. (2008), and MedMCQA Pal et al. (2022) have shown the importance of few-shot learning, visual interpretation, and domain-specific question-answering in medical AI. Both LLaVA-Med and Med-Flamingo focus on multimodal conversational AI and few-shot learning in medical contexts, utilizing large-scale datasets and showcasing proficiency in visual question answering. BioViL Bannur et al. (2023), BioBERT Lee et al. (2019), and BioGPT Luo et al. (2022) all have tackled a more domain-specific language model pretraining. BioViL emphasizes text semantics for enhanced biomedical vision-language processing. Emphasis on specialized models for radiology applications has also been presented in MedKLIP Wu et al. (2023a), XrayGPT Thawkar et al. (2023b), and BERTHop Monajatipoor et al. (2021) all demonstrating the challenge of achieving high diagnostic accuracy. MedKLIP in particular innovates by integrating medical knowledge into vision-language pre-training for improved disease classification. XrayGPT integrated a medical visual encoder with a large language model to combine visual and textual analysis to generate precise summaries from radiological data, while BERTHop showed diagnostic performance with smaller datasets on chest X-rays. Moreover, the contributions of CheXagent Chen et al. (2024), CheXNeXt Rajpurkar et al. (2018), and CheXpert Irvin et al. (2019) have set benchmarks in chest pathologies detection. While each work presents unique approaches, their common goal is to enhance radiological analysis through sophisticated AI models.

3 Method

3.1 Model architecture

Our model architecture, illustrated in Fig 2, is composed of three key components: a visual backbone, a linear projection layer, and an extensive language model. The details of each component are described as follows:

Vision Encoder. In our approach, we incorporate EVA Sun et al. (2023) as the primary visual backbone of our model. EVA Sun et al. (2023), a high-performing vision encoder, can be particularly effective when applied to radiological data due to its ability to handle complex image structures and variations. Throughout the entire training process, this visual backbone remains frozen during the training. The radiological images are usually in high resolution, we train the models with the image resolution of 448 $\times$ 448. We also interpolate the positional encoding to adapt to the higher image resolution.

Large Language Model (LLM). We have incorporated the LLaMA2-chat (7B)Touvron et al. (2023), an open-source language model, as the primary language model backbone. The LLM has already learned extensive medical knowledge through learning huge linguistic knowledge, and we treat it as a unified interface for processing many medical vision-language tasks. For example, the LLM can help generate detailed medical reports and also the precise localization of tumors in the medical domain.

Vision Language Alignment. We adopt the architecture of MiniGPT-v2 Chen et al. (2023) and enhance efficiency by concatenating visual tokens from the vision encoder, a technique particularly beneficial for processing high-resolution medical images. This method involves merging four adjacent visual tokens into a single embedding, which is then mapped into the language model’s feature space via a linear projection layer.

3.2 Prompt Template.

We employ a prompt template that allows our model can well deal with many diverse medical vision-language skills, such as visual question answering, image captioning, referring expression comprehension (REC), referring expression generation (REG), disease detection, and grounded image captioning. A language model might experience high levels of hallucination and confusion while dealing with many diverse vision-language tasks. For instance, when being asked to identify a potential lung tumor, it could mistakenly focus on and describe areas of calcification in the blood vessels or heart. Therefore, to avoid ambiguity in these multi-task environments, we add task-specific tokens to our training framework. We follow a similar instruction design to that of MiniGPT-v2Chen et al. (2023) in our instruction template, presented as follows:

[INST] $<$ Img $>$ $<$ ImageFeature $>$ $<$ /Img $>$ [Task Identifier] Instruction [/INST]

We present diverse prompt templates in Table 1 to demonstrate how our model effectively deals with the different tasks through task identifiers.

3.3 Region grounding representation.

For grounding skills that involve the spatial location of objects, such as disease detection and grounded image captioning, we employ a textual representation for bounding boxes. This representation allows us to integrate spatial locations into the text fed into the language model. We normalize the bounding box coordinates within the [0,100] range. Each spatial location is expressed in the format:

\{<X_{left}><Y_{top}><X_{right}><Y_{bottom}>\}

4 Experiments

The experiment aimed to evaluate the efficacy of MiniGPT-Med with a focus on its ability to accurately analyze and describe complex medical imaging data for applications like lung cancer detection, report generation, and question-and-answer capabilities. We fine-tuned stage 3 of MiniGPT-v2 using a comprehensive dataset of radiological images, including X-rays, MRIs, and CT scans, covering a wide range of medical conditions for a variety of skills.

4.1 Dataset Setup

The lack of quality medical datasets is a significant challenge in the field of deep learning for medical imaging. To address this issue, we have prioritized the collection of a comprehensive dataset focusing on radiology, specifically lung diseases, as well as general medical information. Our goal is to gather a diverse and extensive range of medical images, including X-rays, CT scans, and MRI images. Furthermore, we aim to enhance the dataset by incorporating images with bounding boxes, datasets featuring question-and-answer formats, and datasets for report generation. These additions will support all the necessary skills for the model training and development.

Task Types	[Identifier] Instruction
Caption	[caption] Could you describe the contents of this image for me?
VQA	[vqa] What plane is the image in?
Detection	[detection] pneumonia
Refer	[refer] the nodule in the left lung
Grounding	[grounding] describe this image in detail
Identify	[identify] what is this { <56><16><84><58>}

Table 1: Task-specific instruction format. <ImageFeature> denote the image features. During our model training, we used six different types of task identifiers for diverse grounding and non-grounding tasks.

The collected datasets include MIMIC Johnson et al. (2019), NLST The Cancer Imaging Archive (2023) and SLAKE Medical Visual Question Answering (2023) (Med-VQA), RSNA Radiological Society of North America (2018) and RadVQA OSF (2023s).The details for those medical datasets are demonstrated as the following: MIMIC The dataset comprises 377,110 images and 227,835 medical reports. In our study, we obtained the preprocessed MIMIC dataset from XrayGPT Thawkar et al. (2023a), which includes 114,539 de-identified chest X-ray images in JPG format, each accompanied by a corresponding radiology report. Of these, 171,085 images and reports are allocated for training, while 43,454 images and reports are designated for testing. This dataset is utilized for the task of report generation.

NLST This dataset is employed for the detection task, encompassing 7,625 meticulously annotated low-dose CT scan images for lung cancer, specifically marked to pinpoint nodule locations. From the complete 3D volume, we extracted the 2D CT slice displaying the nodule. These annotations, utilized for training, were sourced from the work of Sybil Mikhael et al. (2023).

SLAKE This dataset is used for training the grounding and VQA tasks, where it comprises 579 radiology images delineating various body organs, coupled with 3,543 diverse sets of question-answer pairs used for training.

RSNA. We use the RSNA dataset for evaluating pneumonia detection task. RSNA dataset comprises 1,218 patients who had at least one or more pneumonia conditions. We perform the zero-shot evaluation on this dataset for the disease detection task.

RadVQA Includes 315 radiology images evenly spread across the head, chest, and abdomen, each paired with multiple questions and results in 2,248 question-answer pairs. These questions fall into 11 distinct categories: abnormality, attribute, modality, organ system, color, counting, presence of objects or conditions, size, plane, and positional reasoning. Half of the responses are closed-ended (i.e., yes/no), while the remaining are open-ended, typically requiring one-word or short-phrase replies. We perform the zero-shot evaluation on RadVQA datasets.

4.2 Training Details

In our experiment, we initialize our model with MiniGPT-v2 Chen et al. (2023) pre-trained weights (after stage 3) and keep the vision encoder frozen throughout the whole training process. We finetune the linear projection layer and use LoRA (Low-Rank Adaptation) Hu et al. (2021) to finetune the LLaMA-2 Touvron et al. (2023) large language model. The model is trained using the cross-entropy loss function, which is optimized using the AdamW optimizer. Our dataset comprises 124,276 medical images, each with a resolution of 448x448 pixels, and no data augmentation is applied. The entire training was performed on a single NVIDIA A100 GPU over 100 epochs, with a maximum learning rate of 1e-5. The training duration was approximately 22 hours.

4.3 Baseline models

In this study, we conducted an assessment of MiniGPT-Med’s performance across three distinct tasks: medical report generation, disease detection, and medical visual question answering (VQA). We compared our model to both specialist and generalist models. The specialist models represent those who can only do either grounding or non-grounding tasks. The generalist models represent those models that can do various tasks including both grounding and non-grounding tasks.

– For the medical report generation task, we compared MiniGPT-Med with specialist models including Med-Flamingo Moor et al. (2023) and LLaVA-Med Li et al. (2024), known for their prowess in vision-language tasks and contextual learning abilities. Additionally, we compared MiniGPT-Med with RadFM Wu et al. (2023b), which is specifically tailored for radiology, and XrayGPT Thawkar et al. (2023a), a novel vision-language model designed for chest radiograph analysis. Furthermore, we evaluated MiniGPT-Med against CheXagent Chen et al. (2024), a foundation model focused on improving chest X-ray interpretation. Moreover, comparisons were made with generalist models like MiniGPT-v2 and Qwen-VL Bai et al. (2023), trained on the general vision-language data, showcasing exceptional performance across various vision-focused comprehension benchmarks.

– In the disease detection task, MiniGPT-Med was compared against specialist models including BioVil Bannur et al. (2023), MedKLIP Wu et al. (2023a), and GLoRIA Huang et al. (2021), all pre-trained on vision-language medical datasets, as well as generalist models including MiniGPT-v2 and Qwen-VL.

– In the medical VQA task, we compared MiniGPT-Med with specialized models like MedVINT Zhang et al. (2023), OpenFlamingo Awadalla et al. (2023), and Med-Flamingo Moor et al. (2023) tailored to address the challenges of medical VQA, particularly in zero-shot scenarios, utilizing the RadVQA dataset. Additionally, our work was compared with generalist models such as MiniGPT-v2 and Qwen-VL to provide a comprehensive evaluation of MiniGPT-Med’s performance.

Method	Model’s type	MIMIC-CXR
		BERT-Sim	CheXbert-Sim
MedFlamingo		10.4	3.2
LLaVA-Med		6.2	17.5
RadFM	Specialist Models	45.7	17.5
XrayGPT		44.0	24.2
CheXagent		50.4	24.9
MiniGPT-v2		53.0	21.1
Qwen-VL	Generalist Models	51.9	20.3
Ours		72.0	30.1

Table 2: Evaluation of Medical Report Generation: MiniGPT-Med versus Generalist and Specialist Models. MiniGPT-Med is contrasted with a generalist model capable of executing a wide range of grounding and non-grounding tasks, alongside specialist models limited to non-grounding tasks. The highest performance metrics for both specialist and general models are highlighted in bold.

Method	Model’s type	RSNA IoU
BioViL		0.30
MedKLIP	Specialist	0.31
GLoRIA		0.21
Qwen-VL		0.10
MiniGPT-v2	Generalist	0.13
Ours		0.26

Table 3: Evaluation of Disease Detection on the RSNA Benchmark as zero-shot: A Comparison of Our Models with Generalist and Non-Generalist Models. The top performance metrics for both specialist and general models are highlighted in bold.

4.4 Evaluation Metrics

In our study, we adapted our evaluation approach to align with the distinct skills required for interpreting radiology images using MiniGPT-Med. To assess the model’s ability to generate radiological reports, we used two metrics: BERT Similarity (BERTsim) and CheXbert Similarity (CheXbert-Sim). BERTsim was utilized to evaluate the semantic similarity between the model-generated descriptions of radiological images and the expert-provided ground truth annotations. This involved using a BERT model to embed both the ground truth and generated sentences, followed by computing the cosine similarity between these embeddings. CheXbert-Sim, conversely, was selected for its relevance in assessing the model’s accuracy in replicating professional medical report standards. It is a specialized version of the BERT model, fine-tuned on clinical texts, which computes the cosine similarity between embeddings for each corresponding sentence pair after encoding. For the Visual Question Answering (VQA) aspect, we exclusively used BERTsim to measure the semantic accuracy of the model’s responses. Additionally, we employed Intersection over Union (IoU) for grounding, a metric that quantitatively measures the model’s precision in localizing and identifying specific features or abnormalities within the radiology images, such as pneumonia in the RSNA dataset.

4.5 Medical Report Generation

In our comprehensive study, we evaluated the efficacy of the MiniGPT-Med model in the generation of medical report generation, leveraging the comprehensive MIMIC dataset Johnson et al. (2019). The results of this evaluation, which are outlined in Table 5, demonstrate that the MiniGPT-Med model surpasses both specialized and generalized baseline models. Most notably, MiniGPT-Med demonstrates a significant edge over the leading specialized model, CheXagent Chen et al. (2024), with remarkable margins of 21.6 and 5.2 on the BERT-Sim and CheXbert-Sim metrics, respectively. This performance not only showcases MiniGPT-Med’s supremacy in the medical report generation but also underscores its ability to outpace the top generalist models substantially—by a notable 19 points on BERT-Sim and 9 points on CheXbert-Sim. These findings solidify MiniGPT-Med’s position as a cutting-edge tool, demonstrating its effectiveness in medical report generation.

4.6 Disease Detection

The data showcased in Table 3 reveal that MiniGPT-Med stands out for its competitive performance when compared against a comprehensive range of baseline models. With an Intersection over Union (IoU) score of 0.26, MiniGPT-Med not only exceeds the capabilities of generalist models by a margin of 16% but also attains performance metrics on par with specialist models. The peak IoU score among these specialist models is noted to be 0.31. Our MiniGPT-Med achieves competitive results and it demonstrates good disease detection performance among all the baseline models, highlighting its potential as a versatile and effective tool in the medical domain.

4.7 Medical Visual Question Answering

This study evaluates our model, MiniGPT-Med, against various baseline models using the RadVQA OSF (2023s) benchmark, as presented in Table 4. MiniGPT-Med achieves a notable performance metric of 0.58, surpassing both generalist models such as MiniGPT-v2 Chen et al. (2023) and specialist models like OpenFlamingo Awadalla et al. (2023) and Med-Flamingo Moor et al. (2023). This performance not only demonstrates MiniGPT-Med’s superiority over a broad range of models but also shows it can achieve results comparable to those of the leading specialist model, MedVIN Zhang et al. (2023), which has an accuracy of 0.62. The ability of MiniGPT-Med to outperform or match the performance of several specialized and generalist models underscores its significant potential as a foundation for the development of advanced medical visual question-answering models.

Method	Model’s type	RadVQA
		BERT-Sim
MedVIN		0.62
OpenFlamingo	Specialist	0.49
Med-Flamingo		0.48
Qwen-VL		0.13
MiniGPT-v2	Generalist	0.55
Ours		0.58

Table 4: Evaluation of visual Question Answering on the radVQA Benchmark as zero-shot: A Comparison of Our Models with Generalist and Non-Generalist Models. The top performance metrics for both specialist and general models are highlighted in bold.

4.8 Radiology Expert Evaluation

Our study evaluated MiniGPT-Med using a rigorous human subjective protocol with two senior radiologists. They assessed 50 random samples from the MIMIC dataset’s test suite, focusing on the model’s robustness, granularity, and accuracy. The evaluation centered on three questions, Q1: how closely does the generated report align with your expert judgment? Q2: How detailed is the medical content of the generated report? Q3: How accurate is the generated report in diagnosing pathologies? We present the results in the accompanying Table 5. The results show that a remarkable 76% of the artificial medical reports are adjudged as of high quality. A further 19% were classified as of medium quality, while a mere 5% were deemed to be of poor quality. This distribution underscores the model’s capability to synthesize medical reports that not only resonate with professional standards but also exhibit a high degree of detail and diagnostic accuracy. Such findings underscore the potential of MiniGPT-Med as a valuable tool in the augmentation of medical reporting processes, indicating its substantial reliability and effectiveness in generating clinically relevant reports.

	Radiologist Evaluation
Quality	Percentage
Good	76%
Medium	19%
Poor	5%

Table 5: Expert Manual Evaluation for Medical Report Generation. We evaluate the model in terms of robustness, granularity, and accuracy. The table presents the percentage of votes for different quality categories.

4.9 Qualitative Evaluation

In this section, we provide comprehensive demonstrations of MiniGPT-Med’s capabilities in generating medical reports and performing interpretative tasks. First, Figure 3(a) illustrates the model’s ability to produce detailed medical reports from imagery data. Also, the model can accurately identify and delineate specified abnormalities with bounding boxes as shown in Figure 3(b). Additionally, Figure 4(a) demonstrates the grounding skill, where the model explains each generated word and draws a bounding box around the object. Furthermore, Figure 4(b) details the model’s precision in referencing, and pinpointing abnormalities as specified by users. Moreover, the identification feature is showcased in Figure 4(c), where the model provides elaborate medical descriptions utilizing object coordinates. Finally, Figure 4(d) presents the model’s visual question-answering (VQA) functionality, underscoring its effectiveness in providing precise answers to medical questions.

5 Limitation

MiniGPT-Med faces challenges due to a lack of diverse and high-quality training datasets, limiting its coverage to a narrow range of diseases. To improve, richer and more diverse datasets are needed, along with advanced vision backbones and enhancements in the underlying large language model. The model occasionally generates inaccurate medical reports and improperly connects symptoms to diseases, a phenomenon known as hallucination. Additionally, it struggles to distinguish between the abnormality and the medical images that include device implants in the human body. Fig.5 demonstrates a sample of the data that MiniGPT-Med failed to correctly identify the pneumonia location. The object under the green bounding box is the ground truth and the object under the red bounding box is the false detection. The model easily confuses the device implants as an abnormality. This shortcoming often results in misdiagnosed conditions. Specifically, when AI encounters X-rays or MRIs featuring implants, it may incorrectly identify these as abnormalities.

6 Conclusions

In this study, we introduce MiniGPT-Med, a specialized multi-modal designed for radiology diagnosis applications. It handles various medical vision-language tasks such as generating medical reports, detecting diseases, and answering visually-based medical questions, using distinct task identifiers to navigate these tasks efficiently. MiniGPT-Med outperforms baseline models in both grounding and non-grounding tasks, achieving state-of-the-art performance in the MIMIC-CXR medical report generation task. Radiologist evaluations show that approximately 76% of the generated reports are of preferred quality, highlighting the model’s superiority. Future plans include incorporating more diverse medical datasets, improving the understanding of complex medical terminology, enhancing interpretability and dependability, and conducting extensive clinical validation studies to ensure effectiveness and safety in real healthcare environments.

Acknowledgments

This work was supported by KAUST-SDAIA funding. Asma, Raneem, and Layan started working on this project as visiting research engineers at KAUST VisionCAIR. We extend our sincere gratitude to the KAUST HPC/Ibex cluster Team for their invaluable assistance and support during this research project to train the model.

References

Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
Alayrac et al. (2022) Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems.
Awadalla et al. (2023) Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. 2023. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390.
Bai et al. (2023) Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
Bannur et al. (2023) Shruthi Bannur, Stephanie Hyland, Qianchu Liu, Fernando Perez-Garcia, Maximilian Ilse, Daniel C Castro, Benedikt Boecking, Harshita Sharma, Kenza Bouzid, Anja Thieme, et al. 2023. Learning to exploit temporal structure for biomedical vision-language processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15016–15027.
Canepa et al. (2023) Louisa Canepa, Sonit Singh, and Arcot Sowmya. 2023. Visual question answering in the medical domain. Preprint, arXiv:2309.11080.
Chen et al. (2023) Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. 2023. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478.
Chen et al. (2024) Zhihong Chen, Maya Varma, Jean-Benoit Delbrouck, Magdalini Paschali, Louis Blankemeier, et al. 2024. Chexagent: Towards a foundation model for chest x-ray interpretation. arXiv preprint arXiv:2401.12208.
Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
Huang et al. (2021) Shih-Cheng Huang, Liyue Shen, Matthew P Lungren, and Serena Yeung. 2021. Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3942–3951.
Irvin et al. (2019) Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, Jayne Seekins, David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones, David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, Matthew P. Lungren, and Andrew Y. Ng. 2019. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. Preprint, arXiv:1901.07031.
Johnson et al. (2019) Alistair Johnson, Matt Lungren, Yifan Peng, Zhiyong Lu, Roger Mark, Seth Berkowitz, and Steven Horng. 2019. Mimic-cxr-jpg-chest radiographs with structured labels. PhysioNet.
Lee et al. (2019) Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
Li et al. (2024) Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. 2024. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 36.
Liu et al. (2023) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
Luo et al. (2022) Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie-Yan Liu. 2022. Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23.
Medical Visual Question Answering (2023) (Med-VQA) Medical Visual Question Answering (Med-VQA). 2023. The slake dataset. https://www.med-vqa.com/slake/.
Mikhael et al. (2023) Peter G Mikhael, Jeremy Wohlwend, Adam Yala, Ludvig Karstens, Justin Xiang, Angelo K Takigami, Patrick P Bourgouin, PuiYee Chan, Sofiane Mrah, Wael Amayri, et al. 2023. Sybil: A validated deep learning model to predict future lung cancer risk from a single low-dose chest computed tomography. Journal of Clinical Oncology, 41(12):2191–2200.
Monajatipoor et al. (2021) Masoud Monajatipoor, Mozhdeh Rouhsedaghat, Liunian Harold Li, Aichi Chien, C. C. Jay Kuo, Fabien Scalzo, and Kai-Wei Chang. 2021. Berthop: An effective vision-and-language model for chest x-ray disease diagnosis. Preprint, arXiv:2108.04938.
Moor et al. (2023) Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Eduardo Pontes Reis, and Pranav Rajpurkar. 2023. Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), pages 353–367. PMLR.
OSF (2023s) OSF. 2023s. Osf project: Radvqa. https://osf.io/89kps/.
Pal et al. (2022) Ankit Pal, Logesh Umapathi, and Malaikannan Sankarasubbu. 2022. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering.
Radiological Society of North America (2018) Radiological Society of North America. 2018. Rsna pneumonia detection challenge 2018. https://www.rsna.org/rsnai/ai-image-challenge/rsna-pneumonia-detection-challenge-2018.
Rajpurkar et al. (2018) Pranav Rajpurkar, Jeremy Irvin, Robyn Ball, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Bhavik Patel, Kristen Yeom, Katie Shpanskaya, Francis Blankenberg, Jayne Seekins, Timothy Amrhein, David Mong, Safwan Halabi, Evan Zucker, and Matthew Lungren. 2018. Deep learning for chest radiograph diagnosis: A retrospective comparison of the chexnext algorithm to practicing radiologists. PLOS Medicine, 15:e1002686.
Rasmy et al. (2020) Laila Rasmy, Yang Xiang, Ziqian Xie, Cui Tao, and Degui Zhi. 2020. Med-bert: pre-trained contextualized embeddings on large-scale structured electronic health records for disease prediction. CoRR, abs/2005.12833.
Shen et al. (2008) Rui Shen, Pierre Boulanger, and Michelle Noga. 2008. Medvis: A real-time immersive visualization environment for the exploration of medical volumetric data. In 2008 Fifth International Conference BioMedical Visualization: Information Visualization in Medical and Biomedical Informatics, pages 63–68. IEEE.
Sun et al. (2023) Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. 2023. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389.
Team et al. (2023) Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
Thawkar et al. (2023a) Omkar Thawkar, Abdelrahman Shaker, Sahal Shaji Mullappilly, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Jorma Laaksonen, and Fahad Shahbaz Khan. 2023a. Xraygpt: Chest radiographs summarization using medical vision-language models. arXiv preprint arXiv:2306.07971.
Thawkar et al. (2023b) Omkar Thawkar, Abdelrahman Shaker, Sahal Shaji Mullappilly, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Jorma Laaksonen, and Fahad Shahbaz Khan. 2023b. Xraygpt: Chest radiographs summarization using medical vision-language models. Preprint, arXiv:2306.07971.
The Cancer Imaging Archive (2023) The Cancer Imaging Archive. 2023. The cancer imaging archive (tcia) national lung screening trial (nlst) wiki page. https://wiki.cancerimagingarchive.net/display/NLST.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Wu et al. (2023a) Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. 2023a. Medklip: Medical knowledge enhanced language-image pre-training. arXiv preprint arXiv:2301.02228.
Wu et al. (2023b) Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. 2023b. Towards generalist foundation model for radiology. arXiv preprint arXiv:2308.02463.
Zhang et al. (2023) Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. 2023. Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415.
Zhu et al. (2023) Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.