Multi-modal medical Transformers: A meta-analysis for medical image segmentation in oncology

Gustavo Andrade-Miranda; Vincent Jaouen; Olena Tankyevych; Catherine Cheze Le Rest; Dimitris Visvikis; Pierre-Henri Conze

doi:10.1016/j.compmedimag.2023.102308

Multi-modal medical Transformers: A meta-analysis for medical image segmentation in oncology

Comput Med Imaging Graph. 2023 Dec:110:102308. doi: 10.1016/j.compmedimag.2023.102308. Epub 2023 Oct 26.

Authors

Gustavo Andrade-Miranda¹, Vincent Jaouen², Olena Tankyevych³, Catherine Cheze Le Rest⁴, Dimitris Visvikis⁵, Pierre-Henri Conze⁶

Affiliations

¹ LaTIM UMR 1101, Inserm, Brest, France. Electronic address: [email protected].
² LaTIM UMR 1101, Inserm, Brest, France; IMT Atlantique, Brest, France. Electronic address: [email protected].
³ LaTIM UMR 1101, Inserm, Brest, France; Nuclear Medicine, University Hospital of Poitiers, Poitiers, France. Electronic address: [email protected].
⁴ LaTIM UMR 1101, Inserm, Brest, France; Nuclear Medicine, University Hospital of Poitiers, Poitiers, France. Electronic address: [email protected].
⁵ LaTIM UMR 1101, Inserm, Brest, France. Electronic address: [email protected].
⁶ LaTIM UMR 1101, Inserm, Brest, France; IMT Atlantique, Brest, France. Electronic address: [email protected].

PMID: 37918328
DOI: 10.1016/j.compmedimag.2023.102308

Abstract

Multi-modal medical image segmentation is a crucial task in oncology that enables the precise localization and quantification of tumors. The aim of this work is to present a meta-analysis of the use of multi-modal medical Transformers for medical image segmentation in oncology, specifically focusing on multi-parametric MR brain tumor segmentation (BraTS2021), and head and neck tumor segmentation using PET-CT images (HECKTOR2021). The multi-modal medical Transformer architectures presented in this work exploit the idea of modality interaction schemes based on visio-linguistic representations: (i) single-stream, where modalities are jointly processed by one Transformer encoder, and (ii) multiple-stream, where the inputs are encoded separately before being jointly modeled. A total of fourteen multi-modal architectures are evaluated using different ranking strategies based on dice similarity coefficient (DSC) and average symmetric surface distance (ASSD) metrics. In addition, cost indicators such as the number of trainable parameters and the number of multiply-accumulate operations (MACs) are reported. The results demonstrate that multi-path hybrid CNN-Transformer-based models improve segmentation accuracy when compared to traditional methods, but come at the cost of increased computation time and potentially larger model size.

Keywords: CNN; Medical imaging; Multi-modality; Oncology; Tumor segmentation; Vision transformers.

Publication types

Meta-Analysis
Research Support, Non-U.S. Gov't

MeSH terms

Benchmarking*
Image Processing, Computer-Assisted
Positron Emission Tomography Computed Tomography*