MDA: An Interpretable Multi-Modal Fusion with Missing Modalities and Intrinsic Noise

Lin Fan¹ , Yafei Ou^2∗ , Cenyang Zheng¹, Pengyu Dai², Tamotsu Kamishima³,
Masayuki Ikebe³, Kenji Suzuki², Xun Gong^1†
¹ Southwest Jiaotong University, Chengdu, China
² Tokyo Institute of Technology, Yokohama, Japan
³ Hokkaido University, Sapporo, Japan
Equal ContributionCorresponding Author ([email protected], [email protected])

Abstract

Multi-modal fusion is crucial in medical data research, enabling a comprehensive understanding of diseases and improving diagnostic performance by combining diverse modalities. However, multi-modal fusion faces challenges, including capturing interactions between modalities, addressing missing modalities, handling erroneous modal information, and ensuring interpretability. Many existing researchers tend to design different solutions for these problems, often overlooking the commonalities among them. This paper proposes a novel multi-modal fusion framework that achieves adaptive adjustment over the weights of each modality by introducing the Modal-Domain Attention (MDA). It aims to facilitate the fusion of multi-modal information while allowing for the inclusion of missing modalities or intrinsic noise, thereby enhancing the representation of multi-modal data. We provide visualizations of accuracy changes and MDA weights by observing the process of modal fusion, offering a comprehensive analysis of its interpretability. Extensive experiments on various gastrointestinal disease benchmarks, the proposed MDA maintains high accuracy even in the presence of missing modalities and intrinsic noise. One thing worth mentioning is that the visualization of MDA is highly consistent with the conclusions of existing clinical studies on the dependence of different diseases on various modalities. Code and dataset will be made available.

1 Introduction

Medical multi-modal fusion has gained increasing attention as it integrates medical information from different modalities, providing comprehensive diagnostic evidence for healthcare professionals [1]. For instance, in diagnosing gastrointestinal disorders, white-light endoscope (WLE) provides information regarding the shape, size, and surface characteristics of the lesions, aiding in the exclusion of visually typical abnormalities [2]. Endoscopic ultrasonography (EUS) can delineate individual histologic layers and accurately define the most relevant site of tumor origin [3]. Integrating these image modalities and combining their respective imaging reports can contribute to a more accurate and comprehensive understanding of lesion attributes, potentially leading to improved clinical outcomes [4]. Driven by the dual forces of data acquisition and technological advancements in the field of medicine, multi-modal learning has been widely applied to enhance the performance of clinical prediction tasks, including disease-assisted diagnosis [5] and prognostic forecasting [6].

Refer to caption — Figure 1: A unified multi-modal learning strategy involves learning with different multi-modal configurations. (a) Train and test with full modality. (b) The model will reduce its attention when learning with missing modalities or intrinsic noise.

Integrating multi-modal data has emerged as a new trend in medical data research. It enables the effective utilization of clinical data and provides strong support for tasks such as assisted diagnosis and prediction. Despite recent efforts in this domain, fundamental and challenging issues remain due to the complexity of multi-modal clinical data and the real-world application scenarios: Chanllenge1: The difficulty of multi-modal fusion arising from modality heterogeneity. Despite the individual strengths of each modality, integrating different modalities, including images and text, to improve disease diagnosis is still hindered by modality heterogeneity, which arises from the diverse data representations across different modalities. Various methods have been proposed for fusing heterogeneous modal data. In the beginning, researchers introduced fusion methods that relied on multi-scale transform [7, 8, 9] and sparse representation [10, 11]. With the rise of deep learning, recent years have witnessed the emergence of fusion methods based on deep learning. Numerous data fusion techniques fail to harness the potential of synergizing these modalities, often resorting to the basic concatenation of latent features [12, 13, 14]. This constraint inhibits the ability to fully demonstrate the amplified usefulness attained by integrating multi-modal data. Recently, methods have emerged using multi-modal attention to enhance modality fusion. However, they often lack specific analysis of attention patterns across modalities, which is crucial for preventing modal data misuse and promoting medical interpretability [15, 16]. Recently, methods leveraging multi-modal attention for enhancing modality fusion have emerged [15, 16]. However, there is a lack of in-depth exploration of highly specific multi-modal data, such as integrating multi-perspective images and textual information and assessing their contributions. This constitutes a prevalent research gap in the field of medical applications. Hence, effectively capturing intricate interactions among vastly diverse modalities presents an unresolved obstacle. Chanllenge2: The issue of modality missing in real-world scenarios. Many existing multi-modal fusion methods assume that all modalities are available for all training and testing samples, which is unrealistic in real-world applications. For instance, in tumor segmentation and classification tasks on multi-modal medical images [17, 18], generative models are used to synthesize missing modalities. However, accurately synthesizing images through text modality is unfeasible, and the generative methods typically generate specific modalities or use specific modalities for synthesis, which cannot flexibly handle varying numbers of missing modalities. There is still a need for further optimized solutions for effectively performing adaptive modality feature fusion learning to handle arbitrary missing modalities. Chanllenge3: Modality inconsistency and intrinsic noise.In diagnosing gastrointestinal diseases, different cases focus on different modalities. For example, using WLE alone achieves 99% specificity in diagnosing lipomas, whereas it is not reliable for diagnosing ectopic pancreas or gastrointestinal stromal tumors, for which EUS is more dependent [19]. If the inconsistency between modalities across different diseases is not appropriately considered, it can significantly reduce the accuracy of model predictions, leading to suboptimal clinical outcomes. Furthermore, multimodal learning with diagnostic reports faces intrinsic noise due to its heavy reliance on subjective annotations by clinical experts [20]. Intrinsic noise refers to the hidden intrinsic noise present in the text, contradicting the knowledge conveyed by other modalities, and it can have an impact on the final predictions. To the best of our knowledge, we are the first to explore the presence of intrinsic noise in diagnostic texts as part of our study. Selectively filtering accurate and advantageous multi-modal information, while avoiding errors that can compromise diagnostic accuracy, remains an unresolved research challenge in multi-modal feature fusion. Chanllenge4: The comprehensive interpretability of multi-modal fusion. Understanding AI model decisions correctly has been a persistent challenge in achieving interpretability in multi-modal fusion [21, 22]. Currently, the interpretability of medical-assisted diagnostic models primarily focuses on generating heatmaps using techniques such as Grad-CAM [23] or Score-CAM [24] to visualize the network’s attention and enhance model interpretability [4, 25]. However, heatmaps are visualizations of network parameters and cannot explain modality specificity in different cases, i.e., they cannot reflect modality weights for different cases. Further research is needed to achieve comprehensive interpretability of multi-modal fusion in terms of its efficacy.

To address the aforementioned challenges, we propose a novel modality fusion approach in which we introduce a Modal-Domain Attention (MDA) to achieve adaptive adjustment of the weights for each modality. MDA can leverage continuous attention between different modality features to seek the optimal allocation of attention across multiple modalities. With MDA, we can also achieve interpretability of modality domains, and it further enables the reduction of the impact on the final results by decreasing the weights of missing modalities and intrinsic noise (as shown in Fig. 1). In summary, our main contributions can be summarized as follows:

•

We propose a novel multi-modal fusion framework that achieves adaptive control over the weights of multiple modalities by incorporating the MDA.
•

MDA provides a unified solution to several challenges in multi-modal fusion, including multi-modal integration, modality missing, and learning with intrinsic noise.
•

Based on MDA, we comprehensively analyze the sources of multi-modal efficacy from a macroscopic perspective (across different diseases) and a microscopic perspective (within individual cases) while providing interpretability to the model from a clinical experience perspective.
•

Supplementary experiments demonstrate that we achieve state-of-the-art (SOTA) performance in multi-modal fusion, missing modalities, and learning with intrinsic noise.

2 Method

2.1 Problem setting

Classifying submucosal tumors is crucial for surgical decision-making in clinical [26]. This study performs multi-class learning on submucosal tumors, including Gastrointestinal stromal tumors (GISTs), Gastric leiomyomas (GLMs), Neuroendocrine tumors (NETs), Ectopic pancreas (EPs), Lipomas, Gastrointestinal schwannomas (GSs), Pneumatosis cystoid (PCs) and Inflammatory fibroid polyps (IFPs). This study focuses on clinical prediction using three modalities: EUS, WLE, and imaging reports. Imaging reports provide descriptions of lesion attributes observed in two imaging modalities. We denote the multi-modal data contained in the ${n^{th}}$ sample as ${\mathcal{X}={\{\mathbf{X}_{i}^{eus},\mathbf{X}_{i}^{wle},\mathbf{X}_{i}^{% report}}\}_{i=1}^{N}}$ and the predicted labels as ${Y_{i}}$ . Therefore, the entire classification task can be defined as ${\mathcal{T}_{multi}={\{\mathbf{X}_{i}^{eus},\mathbf{X}_{i}^{wle},\mathbf{X}_{% i}^{report},Y_{i}}\}_{i=1}^{N}}$ , where ${N}$ represents the sample size. The output ${y_{i}}$ is represented as a one-hot vector for this multi-class task.

2.2 Multi-modal fusion framework

2.2.1 Overview

An overview of the proposed method is depicted in Fig. 2. Specifically, we first construct individual multi-disease classification tasks ${\mathcal{T}_{eus}={\{\mathbf{X}_{i}^{eus},Y_{i}}\}_{i=1}^{N}}$ , ${\mathcal{T}_{wle}={\{\mathbf{X}_{i}^{wle},Y_{i}}\}_{i=1}^{N}}$ and ${\mathcal{T}_{report}={\{\mathbf{X}_{i}^{report},Y_{i}}\}_{i=1}^{N}}$ for each modality separately. Then, we freeze the single-modality structures and feed the extracted features into the MDA to calculate the adaptive weights for each modality. Finally, we perform multi-modal feature fusion based on the inter-modality weights and feed the fused features into the classifier to accomplish the classification task ${\mathcal{T}_{multi}}$ .

2.2.2 Building uni-modal networks

The proposed framework utilizes pre-trained convolutional neural networks (CNNs) as feature extractors for the two image modalities, such as ResNet, which are well-known for their proficiency in extracting high-level image features. The vectorized latent features obtained from the feature extractors are denoted as:

\mathbf{f}_{eus}^{i}=\left(\mathbf{G}_{eus}\left(\mathbf{X}_{i}^{eus}\right)% \right),\mathbf{f}_{wle}^{i}=\left(\mathbf{G}_{wle}\left(\mathbf{X}_{i}^{wle}% \right)\right)

(1)

where ${\mathbf{G}_{eus}}$ and ${\mathbf{G}_{wle}}$ represent two different feature extractors. We employed the self-attention module to capture the interactions within each sample modality.

Additionally, the self-attention module excludes its influence on inter-modal weight learning, thereby reducing potential ambiguity in interpreting the reasons behind observed performance improvements. Specifically, the self-attention module does not participate in subsequent weight adjustments of the intermodal modules, meaning that performance improvements cannot be attributed to the self-attention module’s intervention. This design facilitates a clearer understanding and analysis of the true causes of performance enhancement.

Notably, the implementation of the self-attention module is identical in both the EUS and WLE modalities. The calculation of self-attention scores is as follows:

		$\displaystyle\mathbf{f}_{IS-eus}^{i}=\operatorname{softmax}\left(\frac{query_{% eus}^{i}\cdot(key_{eus}^{i})^{T}}{\sqrt{dim}}\right)\cdot value_{eus}^{i}$		(2)
		$\displaystyle\mathbf{f}_{IS-wle}^{i}=\operatorname{softmax}\left(\frac{query_{% wle}^{i}\cdot(key_{wle}^{i})^{T}}{\sqrt{dim}}\right)\cdot value_{wle}^{i}$		(2)

where ${query_{i}}$ , ${key_{i}}$ , and ${value_{i}}$ are linear transformations of the latent feature ${\mathbf{f}^{i}}$ . We employ the BERT [27] to encode the input text. BERT learns contextual representations of words or subwords in a text by using a self-attention mechanism. This is consistent with the self-attention module in image feature extraction. To adapt BERT to our specific task, we unfroze the last four layers of BERT for training. The feature extraction representation for textual reports is as follows:

\mathbf{f}_{IS-report}^{i}=\left(\mathbf{BERT}\left(\mathbf{X}_{i}^{report}% \right)\right)

(3)

Then the ${\mathbf{f}_{IS}^{i}}$ are fed into the classification network, and three single-modality multi-disease classification networks are trained.

2.2.3 Model-domain attention

The MDA weighting between modalities not only enhances the performance of modality fusion but also helps counteract the effects of missing modalities and intrinsic noise, as it plays a role in selecting modality-specific information that is more advantageous for the final outcome and weighting the fusion accordingly.

In the MDA, we aim to compute the correlations among all modalities. This involves computations of the intricate dependencies among all modal features when dealing with more than two modalities. Therefore, we use a continuous attention mechanism to compute the attention weights for any given modality while simultaneously considering multiple other modalities. The specific computation method is represented as follows:

\mathbf{f}_{MA-e}^{i}=\operatorname{softmax}\left(\frac{\operatorname{softmax}% \left(\frac{query_{eus}^{i}\cdot(key_{wle}^{i})^{T}}{\sqrt{dim}}\right)\cdot value% _{eus-1}^{i}\cdot(key_{report}^{i})^{T}}{\sqrt{dim}}\right)\cdot value_{eus-2}% ^{i}

(4)

where ${{query}^{i}}$ , ${key^{i}}$ , and ${value^{i}}$ represent the linear transformations of potential features corresponding to different modalities indexed by $i$ . ${value_{eus-1}^{i}}$ and ${value_{eus-2}^{i}}$ represent distinct EUS value keys, as the query state for multi-modalities has changed. Similarly, the weight matrix for the WLE and report modalities is computed similarly.

2.2.4 Objective function and optimization

For the classification training of single-modality models, given the features ${\mathbf{f}_{IS}}$ (represented uniformly for three different modalities) obtained after the self-attention module, we employ a multi-layer perceptron classifier ${\mathcal{C}_{s}}$ for disease prediction. The model training is guided by the cross-entropy loss, defined as follows:

\ell_{c-single}=\text{ CrossEntropy }\left(\mathcal{C}_{s}\left(\mathbf{f}_{IS% }\right),Y\right)

(5)

For the multi-modal classification training, we obtain the fused features:

\mathbf{f}_{MA-O}^{i}=\oplus(\mathbf{f}_{MA-e}^{i},\mathbf{f}_{MA-w}^{i},% \mathbf{f}_{MA-r}^{i})

(6)

where ${\oplus}$ represents the concatenation operation. Then, the classification loss is defined as:

\ell_{c-fusion}=\text{ CrossEntropy }\left(\mathcal{C}_{f}\left(\mathbf{f}_{MA% -O}\right),Y\right)

(7)

We employ the Adam optimizer with a weight decay rate of 1e-4 to optimize the model parameters.

3 Experiments and discussion

We aim to develop a multi-modal fusion model that can effectively handle modality heterogeneity, adapt to varying degrees of missing modalities, filter out inconsistent and intrinsic noise, and mitigate their negative impact on the results. To this end, we conducted experiments using three different multi-modal datasets, encompassing diverse dimensions such as medical data, natural images, movies, and audio, to assess the generality of the proposed model comprehensively. The first dataset consisted of multi-center gastrointestinal disease data with three modalities: EUS, WLE, and imaging reports. All experiments were conducted using the PyTorch framework with the Geforce RTX 4090 GPU.

In Section 3.1, we present the results of our proposed method in addressing the challenges of modality heterogeneity, missing modalities, and learning with intrinsic noise on the gastrointestinal disease dataset. We comprehensively validate the effectiveness of our proposed method by analyzing three different fusion scenarios. In Section 3.2, We conducted experiments on publicly available datasets to evaluate the performance of the proposed method in multi-modal fusion and fusion efficacy in the presence of missing modalities. We compared our method with state-of-the-art methods for handling missing modalities. The interpretability analysis of MDA is presented in Section 3.3.

3.1 The efficacy of MDA in confronting the three key challenges of multi-modal fusion

We conducted an in-depth analysis of the roles played by EUS, WLE, and imaging reports in multi-modal fusion under three challenges: modality specificity, missing modalities, and learning with intrinsic noise. Specifically, we perturbed the input and observed the performance changes in single-modality analysis, direct fusion of multiple modalities, and the utilization of the MDA. This step-by-step approach allowed us to decompose the importance of each component in the multi-modal fusion process. The results of all scenarios are shown in Table 1. The experimental settings and result analysis for the three challenges are as follows:

Modality Heterogeneity Challenge. The training of the three uni-modal models was conducted independently, and the training and testing sets were sourced from different centers. The experimental results show that the direct concatenation-based fusion of multi-modal features achieved higher accuracy than any individual uni-modal testing accuracy. This indicates the effectiveness of multi-modal fusion in improving disease diagnosis accuracy on this dataset. Finally, by employing the MDA, which adaptively learns the weight relationships between modalities, the fusion capability of the multi-modal model was enhanced, resulting in a significant improvement in diagnostic accuracy compared to direct concatenation (acc: 91.2% < 98.9%). This demonstrates the effectiveness of the proposed model in addressing modality heterogeneity.

Missing Modality Challenge. To examine the robustness of the proposed model against missing modalities, we randomly discarded the data features of one modality by setting the input data to one. The results showed that the accuracy of direct concatenation fusion was lower than the accuracy of concatenation fusion using all modalities (acc: 88.1% < 91.2%), indicating that the missing modality indeed affects the performance of multi-modal fusion, even with partial missing. However, when using MDA, even with only two modalities involved, the accuracy under attentional adjustment far exceeds that obtained with concat (accuracy: 95.4% > 88.1%). This finding suggests that MDA can counteract most of the negative effects of missing modalities on feature fusion by shifting attention, as evidenced by the fact that it is only about 3% lower than the fusion accuracy using all modalities (acc: 95.4% < 98.9%).

Learning with Intrinsic Noise Challenge. Intrinsic noise refers to the intrinsic noise present in a report that contradicts the information conveyed by other modalities. For example, it could occur when the report describes the echogenicity of a tumor in an EUS image as "hyperechoic" while the image itself reveals that the echogenicity is "heterogeneous." To validate the effectiveness of the proposed method in handling intrinsic noise, we processed the report by replacing the attribute "tumor origin level" with another random label of the samples. The uni-modal testing was performed with error handling applied to all test data. The results showed that the attribute "origin level" significantly impacted the diagnostic accuracy of the model, as the accuracy decreased from 87.4% to 80.5%. After applying the concatenation-based multi-modal fusion, the accuracy improved to 89.3%. Although the image information helped mitigate the impact of erroneous modalities, the accuracy was still lower than that achieved with correct modalities (acc: 89.3% < 91.2%). Upon incorporating the MDA, which establishes the correspondence between the report and image modalities, the model gained the ability to identify misleading information. As a result, the diagnostic accuracy was essentially restored to the level achieved with the correct modalities (acc: 97.9%).

Table 1: Multi-modal fusion results under perturbed inputs.

	Typ	Modal	Fusion Method	Accuracy
Pre-Trained	Uni-modal	EUS	-	85.3
	Uni-modal	WLE	-	81.7
	Uni-modal	Bericht	-	87.4
Full Modalities	Multi-modal	EUS+WLE+Report	concat	91.2
	Multi-modal	EUS+WLE+Report	FusionM4Net [28]	69.4
	Multi-modal	EUS+WLE+Report	SCT Fusion[29]	82.8
	Multi-modal	EUS+WLE+Report	concat+MDA	98.9
Missing Modalities	Multi-modal	Randomly missing 1 modal	concat	88.1
	Multi-modal	Fixed missing 1 modal ^∗	MMD[30]	88.4
	Multi-modal	Randomly missing 1 modal	concat+MDA	95.4
Learning with Intrinsic Noise	Uni-modal	Bericht	-	80.5
	Multi-modal	EUS+WLE+Report	concat	89.3
	Multi-modal	EUS+WLE+Report	concat+MDA	97.9

*

Following the method described in the paper, we simulated missing modalities by applying dropout to three different modalities. During training, the full set of modalities was used, while during testing, only the modality with dropout was missing. The final result was obtained by averaging the predictions from the three models.

We compared the recent multi-modal fusion methods on a gastrointestinal dataset under the full modality setting. The results are shown in Table 1. When using the FusionM4Net method [28], we replaced the meta-data in the second stage with the report. We performed an eight-class classification while keeping the other configurations the same as stated in the paper. For the SCT Fusion method [29], we computed the cross-entropy loss between our classification results and the ground truth of the eight-class labels without modifying the other configurations. The results demonstrate that the proposed method significantly outperforms recent multi-modal approaches in terms of fusion performance across multiple modalities (acc: 95.6%>82.8%>69.4%).

3.2 Experiments in missing modalities

Due to the extensive research on missing modalities, we specifically conducted an extended investigation into the performance of MDA in the presence of missing modalities. We conducted experiments on the Multi-modal IMDb (MM-IMDb) [31] and the Audiovision-MNIST (avMNIST) [32] to investigate missing modalities. We discard a certain percentage of the data for the model training on the datasets. Specifically, during training, we constructed datasets containing 20%, 50%, 70% of the full modality, respectively, with randomization of the missing modalities. During the evaluation and testing phases, we compared the accuracy of single-modal, direct fusion of multi-modal, and the proposed method. Additionally, we conducted experiments using 100% images and n% audio during training and tested with both images and audio to verify the proficiency of the proposed model in multi-modal fusion under the absence of certain modalities, and compared this result with the two latest methods, SMIL [33], and ShaSpec [34].

Table 2: Model performance comparison of classification accuracy of missing modality (by setting different available audio rates) on avMNIST dataset. And train 30 epochs. ’M-m Concat’ refers to multi-modal fusion using concatenation, while ’Ours-m’ represents our proposed fusion method that handles missing modalities during testing. ’Ours-f’ denotes testing with full modalities, while only having n% of the audio modality available in the training data.

Multi-Modal Rate	Uni-Image	Uni-Audio	M-m Concat	Ours-m	SMIL [33]	ShaSpec [34]	Ours-f
70%	87.3	82.5	87.9	95.8	96.4	97.3	98.1
50%	87.3	82.5	84.3	86.3	95.6	96.0	96.2
20%	87.3	82.5	83.6	86.7	94.4	95.3	95.9

Table 3: Model performance comparison of classification accuracy of missing modality (by setting different available audio rates) on MM-IMDb dataset. Evaluating performance using F1 sample scores and training 5 epochs.

Multi-Modal Rate	Uni-Image	Uni-Text	M-m Concat	SMIL [33] ^∗	Ours-m
70%	0.361	0.398	0.637	-	0.701
50%	0.361	0.398	0.533	-	0.578
20%	0.361	0.398	0.496	0.541	0.546

*

Results for this case where only text data is missing. Our approach involves random modalities being missing.

In the training process of the MM-IMDb dataset, following the same setup as avMNIST, we set the dropout rates for the text and image modalities to 20%, 50%, 70%. The dropped modalities are randomly selected, resulting in a training dataset with missing modalities. During the evaluation and testing phases, similar to the avMNIST dataset, we compare the accuracies of single-modal performance, direct fusion of multi-modal data, and the proposed method.

Table 2 and Table 3 present the experimental results conducted on avMNIST and MM-IMDb datasets. The results show that the proposed method may yield lower performance than single-modality results when one modality is missing. This could be attributed to the limitation of our method in generating information from the missing modality, as it relies solely on strengthening the relationships between modalities. However, a significant amount of single-modality information is trained solely based on their respective modalities, which prevents the model from learning an adequate number of dependencies between modalities. However, in cases where the missing modality is not substantial, our method outperforms the conventional fusion approach (acc: 95.8% > 87.9% in Table 2 and acc: 70.1% > 63.7% in Table 3). Furthermore, our method demonstrates increased resilience against the decline in model accuracy caused by a significant number of missing modalities. Specifically, when only 20% of the modalities are complete, our accuracy degrades significantly less than that of conventional fusion methods. Similar results were observed from the MM-IMDb dataset (Table 3), indicating that the proposed method is more adept at learning complementary or removing redundant information from the complex relationships among multiple modalities to enhance the effectiveness of multi-modal fusion. Furthermore, the method demonstrates improved accuracy and stability of the network even in the presence of missing modalities. The results of Ours-f in Table 2 represent the performance achieved when testing with both audio and image modalities, while only having n% of the audio modality available in the training data. We compare these results with SMIL and ShaSpec. Our proposed method outperforms the others in various audio-missing modalities, especially when 70% of the audio data is missing. This is because our proposed method can effectively learn the interrelationships between different modalities, thereby achieving optimal multi-modal fusion performance on the test set.

3.3 Interpretability analysis

In current medical research, attention visualization methods such as Class Activation Mapping (CAM) and its variants have been widely applied for interpretability analysis. Current methods in medical image analysis highlight regions of interest, benefiting healthcare professionals and researchers in understanding model decision-making in diagnosis and disease classification. However, these methods have limitations at macroscopic and microscopic levels. Macroscopically, they lack the capability to analyze model attention variations across modalities and disease categories deeply. This alignment is crucial for clinical relevance and reliability by aligning with existing knowledge. Microscopically, the CAM method fails to analyze attention variations at specific feature points and their impact on diagnostic outcomes. Different anatomical structures and pathological features have varying roles in diagnosis, making it important to examine attention variations in the absence of modalities or under intrinsic noise. To address this, we leverage adaptive attention maps across modalities to comprehensively analyze the model’s performance in multi-modal fusion, modality absence, and handling intrinsic noise at both macroscopic disease categories and microscopic feature points.

Fig. 3 illustrates the average modal attention and standard deviation (SD) results of the proposed method for different disease categories. Firstly, the model exhibits notable specificity in modal attention for different disease categories. It is important to note that no constraints were imposed during model training regarding which modalities should be attended to for recognizing different diseases. We attribute this specificity to the proposed modality weight adaptation module working in conjunction with multi-disease classification. Secondly, the SD was employed to measure the deviation between data points, and the results (Fig. 3(a)) indicate low SD values for the average modal attention across all disease categories. This suggests the stability of the model’s specificity in attending to different modalities for different diseases and underscores the reliability of the predictions. Furthermore, we analyzed the alignment between the model’s specificity in modal attention for different disease categories and clinical priors. According to a study on gastrointestinal diseases by Jacobson et al. [19], using WLE alone to diagnose lipomas achieves a specificity of 99% in clinical practice. Similar findings hold true for NET. Conversely, GIST and EP rely more heavily on EUS. This alignment between the proposed model’s MDA and clinical expertise is highly consistent (Fig. 3(b)), providing interpretability of modal attention for each disease category. The results illustrate the variation of modal-domain attention for different diseases throughout the training iterations in the Appendix.

Table 4: Variations in adaptive weights for different modalities in the absence of certain modalities and learning with intrinsic noise.

Disease	Modal	Baseline	Missing EUS	Missing WLE	Missing Report
GIST	EUS	0.08 $\pm$ 0.05	0.11 $\pm$ 0.06	0.06 $\pm$ 0.06	0.27 $\pm$ 0.12
	WLE	0.21 $\pm$ 0.11	0.39 $\pm$ 0.14	0.05 $\pm$ 0.06	0.40 $\pm$ 0.15
	Bericht	0.72 $\pm$ 0.13	0.49 $\pm$ 0.13	0.90 $\pm$ 0.10	0.33 $\pm$ 0.13
GLM	EUS	0.17 $\pm$ 0.10	0.08 $\pm$ 0.05	0.31 $\pm$ 0.17	0.37 $\pm$ 0.16
	WLE	0.26 $\pm$ 0.12	0.40 $\pm$ 0.13	0.12 $\pm$ 0.05	0.46 $\pm$ 0.15
	Bericht	0.56 $\pm$ 0.14	0.52 $\pm$ 0.11	0.56 $\pm$ 0.14	0.17 $\pm$ 0.09
NET	EUS	0.33 $\pm$ 0.10	0.11 $\pm$ 0.04	0.26 $\pm$ 0.10	0.71 $\pm$ 0.13
	WLE	0.10 $\pm$ 0.07	0.35 $\pm$ 0.11	0.10 $\pm$ 0.05	0.16 $\pm$ 0.11
	Bericht	0.57 $\pm$ 0.10	0.54 $\pm$ 0.10	0.64 $\pm$ 0.10	0.13 $\pm$ 0.07
EP	EUS	0.16 $\pm$ 0.08	0.07 $\pm$ 0.07	0.29 $\pm$ 0.12	0.42 $\pm$ 0.15
	WLE	0.38 $\pm$ 0.17	0.34 $\pm$ 0.17	0.17 $\pm$ 0.07	0.49 $\pm$ 0.16
	Bericht	0.46 $\pm$ 0.16	0.59 $\pm$ 0.16	0.54 $\pm$ 0.11	0.09 $\pm$ 0.06
Lipomas	EUS	0.10 $\pm$ 0.06	0.02 $\pm$ 0.01	0.02 $\pm$ 0.02	0.61 $\pm$ 0.14
	WLE	0.04 $\pm$ 0.03	0.09 $\pm$ 0.07	0.01 $\pm$ 0.00	0.15 $\pm$ 0.11
	Bericht	0.86 $\pm$ 0.07	0.89 $\pm$ 0.07	0.97 $\pm$ 0.02	0.24 $\pm$ 0.11
GS	EUS	0.35 $\pm$ 0.13	0.13 $\pm$ 0.08	0.33 $\pm$ 0.12	0.77 $\pm$ 0.09
	WLE	0.27 $\pm$ 0.13	0.33 $\pm$ 0.13	0.27 $\pm$ 0.06	0.12 $\pm$ 0.08
	Bericht	0.39 $\pm$ 0.13	0.54 $\pm$ 0.13	0.40 $\pm$ 0.14	0.11 $\pm$ 0.06
PC	EUS	0.18 $\pm$ 0.08	0.07 $\pm$ 0.03	0.12 $\pm$ 0.06	0.62 $\pm$ 0.12
	WLE	0.15 $\pm$ 0.07	0.33 $\pm$ 0.13	0.06 $\pm$ 0.05	0.25 $\pm$ 0.09
	Bericht	0.67 $\pm$ 0.12	0.59 $\pm$ 0.12	0.82 $\pm$ 0.05	0.13 $\pm$ 0.12
IFP	EUS	0.14 $\pm$ 0.08	0.04 $\pm$ 0.04	0.09 $\pm$ 0.07	0.57 $\pm$ 0.16
	WLE	0.13 $\pm$ 0.10	0.19 $\pm$ 0.10	0.02 $\pm$ 0.02	0.31 $\pm$ 0.15
	Bericht	0.72 $\pm$ 0.13	0.77 $\pm$ 0.10	0.89 $\pm$ 0.08	0.12 $\pm$ 0.06

Table 4 demonstrates the variations in attention weights for three modalities before and after incorporating missing modalities and intrinsic noise in the network. The findings indicate that the proposed model is highly sensitive to these changes, as evidenced by the rapid and substantial shifts in attention weights across modalities when dealing with missing modalities. When EUS is missing, a significant decrease in EUS weights is first observed (0.34 to 0.05, 0.53 to 0.11), which is also observed when WLE is missing. When WLE is missing, attention weights almost entirely shift towards the report modality, which is expected since the surface information of the tumor, reflected in WLE, cannot be found in EUS. When WLE is absent, the model cannot learn relevant knowledge solely from the EUS modality. However, the report modality contains descriptions of WLE and can serve as a substitute for the missing image modality. Hence, the weights assigned to the report modality increase. Similarly, in the absence of EUS, the emphasis shifts more towards the reporting modality rather than the white-light endoscopy (WLE) modality. In the scenario where the report modality is missing, the redistribution of the reduced weight is approximately evenly allocated to both white-light endoscopy (WLE) and endoscopic ultrasonography (EUS) to compensate for the absence. The purpose of this weight redistribution is to optimize the accuracy and comprehensiveness of the diagnosis, even in the absence of the report modality.

4 Conclusion

This paper introduces the Modal-Domain Attention (MDA), which utilizes continuous attention mechanisms to capture interactions between multiple modalities. MDA exhibits the capability to effectively handle multi-modal information, even when dealing with missing modalities and intrinsic noise, eliminating the need for separate solutions for each scenario. To the best of our knowledge, this study is the first to investigate the handling of randomly missing modalities, and the first to explore the presence of intrinsic noise in diagnostic texts.

The interpretability of MDA in medical diagnosis is comprehensively analyzed. At the macroscopic level, we investigate the attention specificity of MDA towards different disease categories, demonstrating its alignment with clinical experience. At the microscopic level, we examine the significant changes in attention exhibited by MDA when handling missing modalities and intrinsic noise for the same sample. The experimental results provide evidence of the efficacy of MDA in improving multi-modal fusion and its robustness in the presence of missing modalities or intrinsic noise.

References

[1] Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen Awm Van Der Laak, Bram Van Ginneken, and Clara I Sánchez. A survey on deep learning in medical image analysis. Medical image analysis, 42:60–88, 2017.
[2] Jan L Hedenbro, Mats Ekelund, and Peter Wetterberg. Endoscopic diagnosis of submucosal gastric lesions: the results after routine endoscopy. Surgical endoscopy, 5:20–23, 1991.
[3] Ioannis S Papanikolaou, Konstantinos Triantafyllou, Anastasia Kourikou, and Thomas Rösch. Endoscopic ultrasonography for gastric submucosal lesions. World journal of gastrointestinal endoscopy, 3(5):86, 2011.
[4] Van T Manh, Jianqiao Zhou, Xiaohong Jia, Zehui Lin, Wenwen Xu, Zihan Mei, Yijie Dong, Xin Yang, Ruobing Huang, and Dong Ni. Multi-attribute attention network for interpretable diagnosis of thyroid nodules in ultrasound images. IEEE transactions on ultrasonics, ferroelectrics, and frequency control, 69(9):2611–2620, 2022.
[5] Cui-Na Jiao, Ying-Lian Gao, Dao-Hui Ge, Junliang Shang, and Jin-Xing Liu. Multi-modal imaging genetics data fusion by deep auto-encoder and self-representation network for alzheimer’s disease diagnosis and biomarkers extraction. Engineering Applications of Artificial Intelligence, 130:107782, 2024.
[6] Chuan-Xian Ren, Geng-Xin Xu, Dao-Qing Dai, Li Lin, Ying Sun, and Qing-Shan Liu. Cross-site prognosis prediction for nasopharyngeal carcinoma from incomplete multi-modal data. Medical Image Analysis, page 103103, 2024.
[7] Yuchan Jie, Fuqiang Zhou, Haishu Tan, Gao Wang, Xiaoqi Cheng, and Xiaosong Li. Tri-modal medical image fusion based on adaptive energy choosing scheme and sparse representation. Measurement, 204:112038, 2022.
[8] Xiaosong Li, Weijun Wan, Fuqiang Zhou, Xiaoqi Cheng, Yuchan Jie, and Haishu Tan. Medical image fusion based on sparse representation and neighbor energy activity. Biomedical Signal Processing and Control, 80:104353, 2023.
[9] Yan Mo, Xudong Kang, Puhong Duan, Bin Sun, and Shutao Li. Attribute filter based infrared and visible image fusion. Information Fusion, 75:41–54, 2021.
[10] Huafeng Li, Moyuan Yang, and Zhengtao Yu. Joint image fusion and super-resolution for enhanced visualization via semi-coupled discriminative dictionary learning and advantage embedding. Neurocomputing, 422:62–84, 2021.
[11] Amit Vishwakarma and Manas Kamal Bhuyan. Image fusion using adjustable non-subsampled shearlet transform. IEEE Transactions on Instrumentation and Measurement, 68(9):3367–3378, 2018.
[12] Shaozhuang Ye, Tuo Wang, Mingyue Ding, and Xuming Zhang. F-darts: Foveated differentiable architecture search based multimodal medical image fusion. IEEE Transactions on Medical Imaging, 2023.
[13] Gucheng Zhang, Rencan Nie, Jinde Cao, Luping Chen, and Ya Zhu. Fdgnet: A pair feature difference guided network for multimodal medical image fusion. Biomedical Signal Processing and Control, 81:104545, 2023.
[14] Zixiang Zhao, Haowen Bai, Jiangshe Zhang, Yulun Zhang, Shuang Xu, Zudi Lin, Radu Timofte, and Luc Van Gool. Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5906–5916, 2023.
[15] Xi Wei, Tianzhu Zhang, Yan Li, Yongdong Zhang, and Feng Wu. Multi-modality cross attention network for image and sentence matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10941–10950, 2020.
[16] Jin Zhang, Xiaohai He, Yan Liu, Qingyan Cai, Honggang Chen, and Linbo Qing. Multi-modal cross-attention network for alzheimer’s disease diagnosis with multi-modality data. Computers in Biology and Medicine, 162:107050, 2023.
[17] Sanaz Karimijafarbigloo, Reza Azad, Amirhossein Kazerouni, Saeed Ebadollahi, and Dorit Merhof. Mmcformer: Missing modality compensation transformer for brain tumor segmentation. In Medical Imaging with Deep Learning, pages 1144–1162. PMLR, 2024.
[18] Anmol Sharma and Ghassan Hamarneh. Missing mri pulse sequence synthesis using multi-modal generative adversarial network. IEEE transactions on medical imaging, 39(4):1170–1183, 2019.
[19] Brian C Jacobson, Amit Bhatt, Katarina B Greer, Linda S Lee, Walter G Park, Bryan G Sauer, and Vanessa M Shami. Acg clinical guideline: diagnosis and management of gastrointestinal subepithelial lesions. Official journal of the American College of Gastroenterology| ACG, 118(1):46–58, 2023.
[20] Pietro Fusaroli, Mohamad Eloubeidi, Claudio Calvanese, Christoph Dietrich, Christian Jenssen, Adrian Saftoiu, Claudio De Angelis, Shyam Varadarajulu, Bertrand Napoleon, Andrea Lisotti, et al. Quality of reporting in endoscopic ultrasound: Results of an international multicenter survey (the quoreus study). Endoscopy International Open, 9(07):E1171–E1177, 2021.
[21] Daniel T Huff, Amy J Weisman, and Robert Jeraj. Interpretation and visualization techniques for deep learning models in medical imaging. Physics in Medicine & Biology, 66(4):04TR01, 2021.
[22] Zohaib Salahuddin, Henry C Woodruff, Avishek Chatterjee, and Philippe Lambin. Transparency of deep neural networks for medical image analysis: A review of interpretability methods. Computers in biology and medicine, 140:105111, 2022.
[23] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
[24] Haofan Wang, Zifan Wang, Mengnan Du, Fan Yang, Zijian Zhang, Sirui Ding, Piotr Mardziel, and Xia Hu. Score-cam: Score-weighted visual explanations for convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 24–25, 2020.
[25] Ting-Wei Wu, Jia-Hong Huang, Joseph Lin, and Marcel Worring. Expert-defined keywords improve interpretability of retinal image captioning. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1859–1868, 2023.
[26] Chengcheng Liu, Mengyun Qiao, Fei Jiang, Yi Guo, Zhendong Jin, and Yuanyuan Wang. Tn-usma net: Triple normalization-based gastrointestinal stromal tumors classification on multicenter eus images with ultrasound-specific pretraining and meta attention. Medical Physics, 48(11):7199–7214, 2021.
[27] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[28] Peng Tang, Xintong Yan, Yang Nan, Shao Xiang, Sebastian Krammer, and Tobias Lasser. Fusionm4net: A multi-stage multi-modal learning algorithm for multi-label skin lesion classification. Medical Image Analysis, 76:102307, 2022.
[29] David Sebastian Hoffmann, Kai Norman Clasen, and Begüm Demir. Transformer-based multi-modal learning for multi-label remote sensing image classification. In IGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium, pages 4891–4894. IEEE, 2023.
[30] Can Cui, Han Liu, Quan Liu, Ruining Deng, Zuhayr Asad, Yaohong Wang, Shilin Zhao, Haichun Yang, Bennett A Landman, and Yuankai Huo. Survival prediction of brain cancer with incomplete radiology, pathology, genomic, and demographic data. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 626–635. Springer, 2022.
[31] John Arevalo, Thamar Solorio, Manuel Montes-y Gómez, and Fabio A González. Gated multimodal units for information fusion. arXiv preprint arXiv:1702.01992, 2017.
[32] Valentin Vielzeuf, Alexis Lechervy, Stéphane Pateux, and Frédéric Jurie. Centralnet: a multilayer approach for multimodal fusion. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pages 0–0, 2018.
[33] Mengmeng Ma, Jian Ren, Long Zhao, Sergey Tulyakov, Cathy Wu, and Xi Peng. Smil: Multimodal learning with severely missing modality. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 2302–2310, 2021.
[34] Hu Wang, Yuanhong Chen, Congbo Ma, Jodie Avery, Louise Hull, and Gustavo Carneiro. Multi-modal learning with missing modality via shared-specific feature modelling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15878–15887, 2023.

Appendix A Appendix / supplemental material

A.1 Results on the variation of modal-domain attention weights with period for each disease

We display the changes in attention weights for different modalities after applying MDA to each disease during the training process in Fig. 4. The results indicate strong specificity among different diseases, manifested in their attention to different modalities, training stability duration, and variations during the training process. It is important to emphasize that these findings reflect the response of the model to the multimodal data of different diseases, rather than the results of multiple models trained on different disease datasets separately. Regarding the attention to different modalities, common diseases such as GIST, PC, and Lipomas exhibit clear and strong attention to EUS or WLE, which aligns well with clinical experience. As for the training stability duration, we consider that when the general trend between modalities no longer changes, it can be considered stable. For example, in the case of GS, although the attention fluctuates between EUS and WLE during the training process, the overall trend shows consistently higher attention to EUS. For the changes during training, it can be noticed that the attention lines for almost all the diseases crossed during training and a simultaneous oscillation in the recognition accuracy of the disease species can be observed at the location of the crossing. Suggesting that incorrect attention weights may cause a drop in accuracy, this also indirectly reveals the process by which MDA is correcting attention to different disease modalities through training.