Cross-Modal self-supervised vision language pre-training with multiple objectives for medical visual question answering

Gang Liu; Jinlong He; Pengfei Li; Zixu Zhao; Shenjun Zhong

doi:10.1016/j.jbi.2024.104748

Cross-Modal self-supervised vision language pre-training with multiple objectives for medical visual question answering

J Biomed Inform. 2024 Nov 11:104748. doi: 10.1016/j.jbi.2024.104748. Online ahead of print.

Authors

Gang Liu¹, Jinlong He², Pengfei Li³, Zixu Zhao⁴, Shenjun Zhong⁵

Affiliations

¹ College of Computer Science and Technology, Harbin Engineering University, Harbin, 150001, Heilongjiang, China. Electronic address: [email protected].
² College of Computer Science and Technology, Harbin Engineering University, Harbin, 150001, Heilongjiang, China. Electronic address: [email protected].
³ College of Computer Science and Technology, Harbin Engineering University, Harbin, 150001, Heilongjiang, China. Electronic address: [email protected].
⁴ College of Computer Science and Technology, Harbin Engineering University, Harbin, 150001, Heilongjiang, China. Electronic address: [email protected].
⁵ Monash Biomedical Imaging, Monash University, Melbourne, 3800, Victoria, Australia. Electronic address: [email protected].

PMID: 39536998
DOI: 10.1016/j.jbi.2024.104748

Abstract

Medical Visual Question Answering (VQA) is a task that aims to provide answers to questions about medical images, which utilizes both visual and textual information in the reasoning process. The absence of large-scale annotated medical VQA datasets presents a formidable obstacle to training a medical VQA model from scratch in an end-to-end manner. Existing works have been using image captioning dataset in the pre-training stage and fine-tuning to downstream VQA tasks. Following the same paradigm, we use a collection of public medical image captioning datasets to pre-train multimodality models in a self-supervised setup, and fine-tune to downstream medical VQA tasks. In the work, we propose a method that featured with Cross-Modal pre-training with Multiple Objectives (CMMO), which includes masked image modelling, masked language modelling, image-text matching, and image-text contrastive learning. The proposed method is designed to associate the visual features of medical images with corresponding medical concepts in captions, for learning aligned vision and language feature representations, and multi-modal interactions. The experimental results reveal that our proposed CMMO method outperforms state-of-the-art methods on three public medical VQA datasets, showing absolute improvements of 2.6%, 0.9%, and 4.0% on the VQA-RAD, PathVQA, and SLAKE dataset, respectively. We also conduct comprehensive ablation studies to validate our method, and visualize the attention maps which show a strong interpretability. The code and pre-trained weights will be released at https://github.com/pengfeiliHEU/CMMO.

Keywords: Contrastive learning; Cross-modal self-supervised; Medical VQA; Medical vision-language pre-training.