License: arXiv.org perpetual non-exclusive license
arXiv:2401.01553v1 [eess.IV] 03 Jan 2024

Multi-modal Learning with Missing Modality in Predicting Axillary Lymph Node Metastasis
thanks: This study was partially supported by the National Natural Science Foundation of China (Grant no. 92270108), Zhejiang Province Natural Science Foundation of China (Grant no. XHD23F0201).

Shichuan Zhangabsent{}^{\star\dagger}start_FLOATSUPERSCRIPT ⋆ † end_FLOATSUPERSCRIPT   Sunyi Zheng{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT   Zhongyi Shuiabsent{}^{\star\dagger}start_FLOATSUPERSCRIPT ⋆ † end_FLOATSUPERSCRIPT   Honglin Liabsent{}^{\star\dagger}start_FLOATSUPERSCRIPT ⋆ † end_FLOATSUPERSCRIPT   Lin Yang{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT {}^{\star}start_FLOATSUPERSCRIPT ⋆ end_FLOATSUPERSCRIPT Zhejiang University, Hangzhou, Zhejiang, China
{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT School of Engineering, Westlake University, Hangzhou, Zhejiang, China
[email protected], [email protected], [email protected]
Abstract

Multi-modal Learning has attracted widespread attention in medical image analysis. Using multi-modal data, whole slide images (WSIs) and clinical information, can improve the performance of deep learning models in the diagnosis of axillary lymph node metastasis. However, clinical information is not easy to collect in clinical practice due to privacy concerns, limited resources, lack of interoperability, etc. Although patient selection can ensure the training set to have multi-modal data for model development, missing modality of clinical information can appear during test. This normally leads to performance degradation, which limits the use of multi-modal models in the clinic. To alleviate this problem, we propose a bidirectional distillation framework consisting of a multi-modal branch and a single-modal branch. The single-modal branch acquires the complete multi-modal knowledge from the multi-modal branch, while the multi-modal learns the robust features of WSI from the single-modal. We conduct experiments on a public dataset of Lymph Node Metastasis in Early Breast Cancer to validate the method. Our approach not only achieves state-of-the-art performance with an AUC of 0.861 on the test set without missing data, but also yields an AUC of 0.842 when the rate of missing modality is 80%. This shows the effectiveness of the approach in dealing with multi-modal data and missing modality. Such a model has the potential to improve treatment decision-making for early breast cancer patients who have axillary lymph node metastatic status.

Index Terms:
Missing modality, Whole slide image, Clinical data.

I Introduction

Breast cancer has become the most deadly disease for women worldwide. The prediction of axillary lymph node metastasis(ALNM) can guide treatment, therefore is crucial to improve the survival rate of early breast cancer patients. Previous works [1, 2, 3] have been devoted to the prediction of LNM. Li, et al [4] combine the histopathological images and tabular clinical data, including age, gender and tumor location, to improve the performance of ALNM prediction. Besides, multi-modal learning [5, 6, 7] in other medical fields have also achieved remarkable results. A multi-modal Transformer [8] is introduced for the survival prediction of nasopharyngeal carcinoma patients. Hong, et al [9] combine the clinical features and histological images to predict the molecular subtypes and mutation status. Generally, existing research efforts mainly focus on how to fuse multi-modal data effectively.

Refer to caption
(a)
Refer to caption
(b)
Figure 1: Model performance at different missing rates of clinical information. The multi-modal model [10] has better results than that of the single-modal model trained on images at a low missing rate of clinical data, but the situation is reversed when the missing rate is high. By using bidirectional distillation, our multi-modal method with the same backbone can achieve good performance regardless of missing rates.
Refer to caption
Figure 2: An overview of the bidirectional distillation framework.

Studies have been designed to tackle the missing modality problem. Transfer learning [11, 12, 13, 14] is effective in handling the fully missing modality problem. However previous methods ignore a practical situation that one of the modalities is often partly missing. It poses a different challenge with the fully missing modality [15]. Ma, et al [16] consider various missing situations and leverage a generative model to produce missing text during training and testing. Nevertheless, generative models require a large number of training pairs. The combined application of histology and molecular markers is employed for the classification of diffuse glioma through multi-task learning [17]. However, the gaps in method performance attributed to distinct modalities cannot be avoided. To our best knowledge, there are no efforts on multi-modal learning with both partly and fully missing modality for pathology images and clinical data.

Making full use of comprehensive information such as histopathological images and clinical data can effectively improve the performance of deep learning models [18, 19, 20]. But clinical data are not always available in the real world due to privacy concerns or limited resources, especially in actual testing. Therefore, the question about will the fusion of multi-modal information in training helps even if the task is single-modal or partly missing modality at test time remains. As shown in Fig.1, modality missing at the testing phase can seriously affect model performance. The performance of the model [10] that learns from multi-modal data even becomes worse than that of the model learning from single-modal data when considering severe missing of clinical data. Thus, the problem needs to be solved: how to learn a multi-modal model from a complete training dataset while it is robust to fully or partly missing modality during testing.

In order to take full advantage of the clinical data in training set effectively and face various missing styles (partly missing and fully missing) during testing flexibly, we propose a bidirectional distillation (BD) framework as shown in Fig.2. Our contributions can be concluded as follows:

  • We propose a BD framework consisting of a single-modal branch and a multi-modal branch, which can flexibly tackle modality complete or incomplete inputs in a unified manner by turning off or on the single branch when testing.

  • In order to transfer the knowledge of clinical information to the single-modal branch, we introduce a learnable prompt during the distillation from the multi-modal branch to the single-modal branch.

  • The learning of complicated fused features may lead to the overfitting on the feature learning of WSI [21], which is verified in the experiment. To tackle this challenge, we leverage the distillation from the single-modal branch to the multi-modal branch to extract robust features of WSI in the multi-modal branch.

  • We additionally conduct further research on the missing modality within WSIs. The experimental results demonstrate the strong performance of our method regardless of the missing modality.

II Methodology

II-A Problem Formulation.

Missing modality of clinical data in the test time is considered in the paper. The dataset is divided into a training set and test set: 𝒟={𝒟d,𝒟v}𝒟superscript𝒟𝑑superscript𝒟𝑣\mathcal{D}=\{\mathcal{D}^{d},\mathcal{D}^{v}\}caligraphic_D = { caligraphic_D start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT }. We consider the training set 𝒟d={(xiw,xic,yi)}i=0n1superscript𝒟𝑑superscriptsubscriptsubscriptsuperscript𝑥𝑤𝑖subscriptsuperscript𝑥𝑐𝑖subscript𝑦𝑖𝑖0𝑛1\mathcal{D}^{d}=\{(x^{w}_{i},x^{c}_{i},y_{i})\}_{i=0}^{n-1}caligraphic_D start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT = { ( italic_x start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT as a modality-complete dataset, where xiwsubscriptsuperscript𝑥𝑤𝑖x^{w}_{i}italic_x start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and xicsubscriptsuperscript𝑥𝑐𝑖x^{c}_{i}italic_x start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represent two different modalities (whole slide images (WSI) and clinical information) of the i𝑖iitalic_i-th sample, yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the corresponding label and n𝑛nitalic_n is the total number of the samples in the training set. The test set 𝒟v={(x0w,x0c,y0),(x1w,y1),}superscript𝒟𝑣subscriptsuperscript𝑥𝑤0subscriptsuperscript𝑥𝑐0subscript𝑦0subscriptsuperscript𝑥𝑤1subscript𝑦1\mathcal{D}^{v}=\{(x^{w}_{0},x^{c}_{0},y_{0}),(x^{w}_{1},y_{1}),...\}caligraphic_D start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT = { ( italic_x start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , ( italic_x start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … } is a modality-incomplete dataset. There exist samples in the data set 𝒟vsuperscript𝒟𝑣\mathcal{D}^{v}caligraphic_D start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT that do not contain the clinical information. In this paper, we aim to make full use of the multi-modal information in training set to improve the model performance and flexibly deal with the modality-missing problem in the test set.

II-B Multi-modal Branch Learning.

Specifically, a WSI xiwsubscriptsuperscript𝑥𝑤𝑖x^{w}_{i}italic_x start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is divided into t𝑡titalic_t small patches which feed into an encoder. In the multi-modal branch, deep features {fi,jwm}j=0t1superscriptsubscriptsuperscriptsubscript𝑓𝑖𝑗𝑤𝑚𝑗0𝑡1\{f_{i,j}^{wm}\}_{j=0}^{t-1}{ italic_f start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w italic_m end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT from the encoder are aggregated to the fused feature fiwmsuperscriptsubscript𝑓𝑖𝑤𝑚f_{i}^{wm}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w italic_m end_POSTSUPERSCRIPT by simple attention [22].

fiwm=j{fi,jwm}j=0t1({fi,jwm}j=0t1),superscriptsubscript𝑓𝑖𝑤𝑚subscript𝑗superscriptsubscriptsuperscriptsubscript𝑓𝑖𝑗𝑤𝑚𝑗0𝑡1superscriptsubscriptsuperscriptsubscript𝑓𝑖𝑗𝑤𝑚𝑗0𝑡1f_{i}^{wm}=\sum_{j}\{f_{i,j}^{wm}\}_{j=0}^{t-1}\cdot\mathcal{H}(\{f_{i,j}^{wm}% \}_{j=0}^{t-1}),italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w italic_m end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT { italic_f start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w italic_m end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ⋅ caligraphic_H ( { italic_f start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w italic_m end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ) , (1)

where ()\mathcal{H}(\cdot)caligraphic_H ( ⋅ ) is a non-linear projection function whose parameters are learnable. The output of ()\mathcal{H}(\cdot)caligraphic_H ( ⋅ ) is a 1D vector with length t𝑡titalic_t. The j𝑗jitalic_jth element corresponds to the patch feature fi,jwmsuperscriptsubscript𝑓𝑖𝑗𝑤𝑚f_{i,j}^{wm}italic_f start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w italic_m end_POSTSUPERSCRIPT during summation. We combine fiwmsuperscriptsubscript𝑓𝑖𝑤𝑚f_{i}^{wm}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w italic_m end_POSTSUPERSCRIPT and the mapped feature ficsuperscriptsubscript𝑓𝑖𝑐f_{i}^{c}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT of clinical data to calculate the final classification loss mulcsuperscriptsubscript𝑚𝑢𝑙𝑐\mathcal{L}_{mul}^{c}caligraphic_L start_POSTSUBSCRIPT italic_m italic_u italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT.

mulc=iyilog(𝒢mul([fiwm,fic])),superscriptsubscript𝑚𝑢𝑙𝑐subscript𝑖subscript𝑦𝑖𝑙𝑜𝑔subscript𝒢𝑚𝑢𝑙superscriptsubscript𝑓𝑖𝑤𝑚superscriptsubscript𝑓𝑖𝑐\mathcal{L}_{mul}^{c}=-\sum_{i}y_{i}log(\mathcal{G}_{mul}([f_{i}^{wm},f_{i}^{c% }])),caligraphic_L start_POSTSUBSCRIPT italic_m italic_u italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_l italic_o italic_g ( caligraphic_G start_POSTSUBSCRIPT italic_m italic_u italic_l end_POSTSUBSCRIPT ( [ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w italic_m end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ] ) ) , (2)

where 𝒢mul()subscript𝒢𝑚𝑢𝑙\mathcal{G}_{mul}(\cdot)caligraphic_G start_POSTSUBSCRIPT italic_m italic_u italic_l end_POSTSUBSCRIPT ( ⋅ ) is the final classifier in the multi-modal branch and [,][\cdot,\cdot][ ⋅ , ⋅ ] is a feature splicing operation.

In order to avoid the feature learning of WSI being affected by clinical information, we transfer the knowledge from the single-modal branch to the multi-modal branch. We define the intermediate output {fkmul}k=0l1superscriptsubscriptsuperscriptsubscript𝑓𝑘𝑚𝑢𝑙𝑘0𝑙1\{f_{k}^{mul}\}_{k=0}^{l-1}{ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_u italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT of the deep layers in the multi-modal branch and {fksgl}k=0l1superscriptsubscriptsuperscriptsubscript𝑓𝑘𝑠𝑔𝑙𝑘0𝑙1\{f_{k}^{sgl}\}_{k=0}^{l-1}{ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_g italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT in the single-modal branch. l𝑙litalic_l is the number of deep feature layers in the network. In this paper, we choose the WSI features from the final layer fl1mulsuperscriptsubscript𝑓𝑙1𝑚𝑢𝑙f_{l-1}^{mul}italic_f start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_u italic_l end_POSTSUPERSCRIPT and fl1sglsuperscriptsubscript𝑓𝑙1𝑠𝑔𝑙f_{l-1}^{sgl}italic_f start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_g italic_l end_POSTSUPERSCRIPT (the output of the attention module fiwmsuperscriptsubscript𝑓𝑖𝑤𝑚f_{i}^{wm}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w italic_m end_POSTSUPERSCRIPT and fiwssuperscriptsubscript𝑓𝑖𝑤𝑠f_{i}^{ws}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w italic_s end_POSTSUPERSCRIPT in Fig. 2), which exhibit the most robust semantics of WSIs. The knowledge distillation for the WSI features can be represented as follows:

mulf=i𝒟(mul(fiwm),sgl(fiws)),superscriptsubscript𝑚𝑢𝑙𝑓subscript𝑖𝒟subscript𝑚𝑢𝑙superscriptsubscript𝑓𝑖𝑤𝑚subscript𝑠𝑔𝑙superscriptsubscript𝑓𝑖𝑤𝑠\mathcal{L}_{mul}^{f}=\sum_{i}\mathcal{D}(\mathcal{M}_{mul}(f_{i}^{wm}),% \mathcal{M}_{sgl}(f_{i}^{ws})),caligraphic_L start_POSTSUBSCRIPT italic_m italic_u italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caligraphic_D ( caligraphic_M start_POSTSUBSCRIPT italic_m italic_u italic_l end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w italic_m end_POSTSUPERSCRIPT ) , caligraphic_M start_POSTSUBSCRIPT italic_s italic_g italic_l end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w italic_s end_POSTSUPERSCRIPT ) ) , (3)

where 𝒟𝒟\mathcal{D}caligraphic_D is a distance function that measures the gap of features between the single-modal and multi-modal branches. We choose the mean square error as the distance measure function. mulsubscript𝑚𝑢𝑙\mathcal{M}_{mul}caligraphic_M start_POSTSUBSCRIPT italic_m italic_u italic_l end_POSTSUBSCRIPT and sglsubscript𝑠𝑔𝑙\mathcal{M}_{sgl}caligraphic_M start_POSTSUBSCRIPT italic_s italic_g italic_l end_POSTSUBSCRIPT are the projection modules that can transfer the intermediate output feature to the target representation. The total loss function for the learning of the multi-modal branch is

mul=mulc+λm*mulf,subscript𝑚𝑢𝑙superscriptsubscript𝑚𝑢𝑙𝑐subscript𝜆𝑚superscriptsubscript𝑚𝑢𝑙𝑓\mathcal{L}_{mul}=\mathcal{L}_{mul}^{c}+\lambda_{m}*\mathcal{L}_{mul}^{f},caligraphic_L start_POSTSUBSCRIPT italic_m italic_u italic_l end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_m italic_u italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT * caligraphic_L start_POSTSUBSCRIPT italic_m italic_u italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT , (4)

where λmsubscript𝜆𝑚\lambda_{m}italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is a hyper-parameter to weigh different items. We utilize the classification loss mulcsuperscriptsubscript𝑚𝑢𝑙𝑐\mathcal{L}_{mul}^{c}caligraphic_L start_POSTSUBSCRIPT italic_m italic_u italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and the distillation loss mulfsuperscriptsubscript𝑚𝑢𝑙𝑓\mathcal{L}_{mul}^{f}caligraphic_L start_POSTSUBSCRIPT italic_m italic_u italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT to update the multi-modal branch simultaneously.

II-C Single-modal Branch Learning.

Following patch fusion steps described in Multi-modal Branch Learning, we convert deep features {fi,jws}j=0t1superscriptsubscriptsuperscriptsubscript𝑓𝑖𝑗𝑤𝑠𝑗0𝑡1\{f_{i,j}^{ws}\}_{j=0}^{t-1}{ italic_f start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w italic_s end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT to fiwssuperscriptsubscript𝑓𝑖𝑤𝑠f_{i}^{ws}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w italic_s end_POSTSUPERSCRIPT. We employ a learnable prompt [23] xipsuperscriptsubscript𝑥𝑖𝑝x_{i}^{p}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT to signal the single-model branch when missing modality and memorize the missing information of clinical data xicsuperscriptsubscript𝑥𝑖𝑐x_{i}^{c}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT. We map xipsuperscriptsubscript𝑥𝑖𝑝x_{i}^{p}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT to a feature fipsuperscriptsubscript𝑓𝑖𝑝f_{i}^{p}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT by a non-linear function. The dimension of fipsuperscriptsubscript𝑓𝑖𝑝f_{i}^{p}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT is as same as that of the feature ficsuperscriptsubscript𝑓𝑖𝑐f_{i}^{c}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT in the multi-modal branch. We then combine the WSI feature fiwssuperscriptsubscript𝑓𝑖𝑤𝑠f_{i}^{ws}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w italic_s end_POSTSUPERSCRIPT and the prompt feature fipsuperscriptsubscript𝑓𝑖𝑝f_{i}^{p}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT. Afterward, the knowledge of clinical data is transferred from the multi-modal branch to the single-modal branch based on the distillation loss:

sglf=superscriptsubscript𝑠𝑔𝑙𝑓absent\displaystyle\mathcal{L}_{sgl}^{f}=caligraphic_L start_POSTSUBSCRIPT italic_s italic_g italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT = 𝒟([fiws,fip],[fiwm,fic])+limit-from𝒟superscriptsubscript𝑓𝑖𝑤𝑠superscriptsubscript𝑓𝑖𝑝superscriptsubscript𝑓𝑖𝑤𝑚superscriptsubscript𝑓𝑖𝑐\displaystyle\mathcal{D}([f_{i}^{ws},f_{i}^{p}],[f_{i}^{wm},f_{i}^{c}])+caligraphic_D ( [ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w italic_s end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ] , [ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w italic_m end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ] ) + (5)
KL(𝒢sgl([fiws,fip]),𝒢mul([fiwm,fic]).\displaystyle KL(\mathcal{G}_{sgl}([f_{i}^{ws},f_{i}^{p}]),\mathcal{G}_{mul}([% f_{i}^{wm},f_{i}^{c}]).italic_K italic_L ( caligraphic_G start_POSTSUBSCRIPT italic_s italic_g italic_l end_POSTSUBSCRIPT ( [ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w italic_s end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ] ) , caligraphic_G start_POSTSUBSCRIPT italic_m italic_u italic_l end_POSTSUBSCRIPT ( [ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w italic_m end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ] ) .

We also apply mean square error for 𝒟𝒟\mathcal{D}caligraphic_D. KL(,)𝐾𝐿KL(\cdot,\cdot)italic_K italic_L ( ⋅ , ⋅ ) [24] is the KL divergence function for the predicted confidence. 𝒢sglsubscript𝒢𝑠𝑔𝑙\mathcal{G}_{sgl}caligraphic_G start_POSTSUBSCRIPT italic_s italic_g italic_l end_POSTSUBSCRIPT and 𝒢mulsubscript𝒢𝑚𝑢𝑙\mathcal{G}_{mul}caligraphic_G start_POSTSUBSCRIPT italic_m italic_u italic_l end_POSTSUBSCRIPT are two final classifiers for single-modal branch and multi-modal branch, respectively. The loss function sglfsuperscriptsubscript𝑠𝑔𝑙𝑓\mathcal{L}_{sgl}^{f}caligraphic_L start_POSTSUBSCRIPT italic_s italic_g italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT is only used for the learning of the prompt as shown in Fig.2.

Similarly, there is also a classification loss sglcsuperscriptsubscript𝑠𝑔𝑙𝑐\mathcal{L}_{sgl}^{c}caligraphic_L start_POSTSUBSCRIPT italic_s italic_g italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT for the learning of WSIs in the single-modal branch. Consequently, the total loss function for the training of a single-modal branch is presented as follows:

sgl=sglc+λs*sglfsubscript𝑠𝑔𝑙superscriptsubscript𝑠𝑔𝑙𝑐subscript𝜆𝑠superscriptsubscript𝑠𝑔𝑙𝑓\mathcal{L}_{sgl}=\mathcal{L}_{sgl}^{c}+\lambda_{s}*\mathcal{L}_{sgl}^{f}caligraphic_L start_POSTSUBSCRIPT italic_s italic_g italic_l end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_s italic_g italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT * caligraphic_L start_POSTSUBSCRIPT italic_s italic_g italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT (6)

The loss function is used for the update of the single-modal branch while the multi-modal branch is frozen. During testing, the BD framework can tackle modality complete or incomplete inputs in a unified manner by turning off or on the single branch.

III Experiments and Results

Refer to caption
(a) Filling
Refer to caption
(b) AE
Refer to caption
(c) Ensemble
Figure 3: Structures of the methods for multi-modal learning with missing modality

III-A Dataset and Experimental Settings.

The experimental dataset is from a grand challenge named Early Breast Cancer Core-Needle Biopsy WSI (BCNB) [10]. Paired multi-modal data containing WSIs and clinical information is provided by the dataset. All WSIs are hematoxylin and eosin stained and the clinical data consists the information of age, tumor size, ER, PR and HER2. We use the information to predict the metastatic status (N0subscript𝑁0N_{0}italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and N+subscript𝑁N_{+}italic_N start_POSTSUBSCRIPT + end_POSTSUBSCRIPT) of axillary lymph nodes. Since it is a binary classification task, we use the metrics Area Under Curve (AUC) and F1-scores (F1) to validate the proposed method. F1 represents the averaged results in the prediction of metastatic status.

We randomly split the dataset into a training and test set with 80% and 20%. A subset with 20% is separated from the training set for validation. We assume that the training set is complete with paired modalities. While the clinical data in the test set can be missed at a random rate. In the training process of our method, stochastic gradient descent with a momentum of 0.3, a weight decay rate of 1×1031superscript1031\times 10^{-3}1 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT serves as the optimizer. The learning rate is initialized at 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. The hyper-parameters τ𝜏\tauitalic_τ, λssubscript𝜆𝑠\lambda_{s}italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and λmsubscript𝜆𝑚\lambda_{m}italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are set to 1.2, 0.5 and 0.6, respectively. We initialize the learnable prompt with a length of 50. Early stopping is used to avoid overfitting by monitoring the F1 scores in the training set. The code is implemented based on python3 and pytorch-1.9 and all experiments are conducted using NVIDIA A100 GPUs.

All non-linear and linear projection modules are composed of fully connected layers and the ReLU non-linear activation function. The function ()\mathcal{H}(\cdot)caligraphic_H ( ⋅ ) within the attention module consists of two hidden layers with corresponding activation functions. We employ two layers with hidden sizes of 100 and 50 to map the learnable prompt xipsuperscriptsubscript𝑥𝑖𝑝x_{i}^{p}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT to fipsuperscriptsubscript𝑓𝑖𝑝f_{i}^{p}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT. Both mulsubscript𝑚𝑢𝑙\mathcal{M}_{mul}caligraphic_M start_POSTSUBSCRIPT italic_m italic_u italic_l end_POSTSUBSCRIPT and sglsubscript𝑠𝑔𝑙\mathcal{M}_{sgl}caligraphic_M start_POSTSUBSCRIPT italic_s italic_g italic_l end_POSTSUBSCRIPT are fully connected layers used to map features to a dimension of 64.

TABLE I: Results of our method (BD) and other methods. We regard the method Filling as a baseline and calculate the changes (ΔΔ\Deltaroman_Δ) at various missing ratios.
Missing rate(%) Methods AUC ΔΔ\Deltaroman_Δ AUC F1 ΔΔ\Deltaroman_Δ F1
0 image only 82.3 - 72.2 -
clinical only 71.6 - 62.2 -
0 Filling 84.1 73.6
AE 84.1 0.0 73.6 0.0
Ensemble 85.1 1.0 74.0 0.4
SMIL 82.6 -1.5 72.7 -0.9
BD 86.1 2.0 75.8 2.2
50 Filling 81.8 71.5
AE 81.3 -0.5 71.5 0.0
Ensemble 83.7 1.9 73.0 1.5
SMIL 82.2 0.4 73.2 1.7
BD 85.0 3.2 74.1 2.6
80 Filling 79.1 70.7
AE 79.9 0.8 70.6 -0.1
Ensemble 82.8 3.7 71.7 1.0
SMIL 80.0 0.9 71.5 0.8
BD 84.2 5.1 74.9 4.2
100 Filling 78.9 68.7
AE 79.6 0.7 69.4 0.7
Ensemble 82.3 3.4 72.2 3.5
SMIL 78.8 -0.1 69.7 1.0
BD 82.7 3.8 72.7 4.0
TABLE II: Ablation study on the effect of the two parts 𝒮𝒮\mathcal{S}\rightarrow\mathcal{M}caligraphic_S → caligraphic_M (learning from single-modal branch) and 𝒮𝒮\mathcal{M}\rightarrow\mathcal{S}caligraphic_M → caligraphic_S (learning from multi-modal branch).
Missing rate(%) 𝒮𝒮\mathcal{S}\rightarrow\mathcal{M}caligraphic_S → caligraphic_M 𝒮𝒮\mathcal{M}\rightarrow\mathcal{S}caligraphic_M → caligraphic_S F1-score
0 74.2
75.8
74.2
80 72.0
72.1
73.8

III-B Comparison with other methods.

We compared our proposed approach with representative methods (AE [25], Ensemble [26], Filling, SMIL [16]) in dealing with the missing modality problem for multi-modal learning, The mechanism of first three intuitive methods are as shown in Fig. 3.

  • Filling is the method that aims to fill the missing clinical data with zero vectors. The model structure is based on the model LNMP [10]. It is the same as LNMP when the modalities are complete during test.

  • AE is designed to generate the missed deep features of clinical data automatically. This model is trained with two stages. First, we train an LNMP model with the modality-complete training set. Then, an auto-encoder is trained to generate missed features, the input and output of which are features of the WSIs and clinical data, respectively.

  • Ensemble is the model that has two individual networks. One is the WSI recognition network, whose output is the predicted probability. The other one is the classification network for clinical data. We get the final prediction result by fusing the probabilities from the two networks. We only use the first network if there is no input of clinical data.

TABLE III: Ablation study on λmsubscript𝜆𝑚\lambda_{m}italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and λssubscript𝜆𝑠\lambda_{s}italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. The last column means the epoch of saved best model during training.
Missing rate(%) λmsubscript𝜆𝑚\lambda_{m}italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT λssubscript𝜆𝑠\lambda_{s}italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT F1-score Epoch
100 0.2 0.5 71.7 30
0.4 0.5 70.3 26
0.6 0.5 72.7 31
0.8 0.5 72.2 28
0.6 0.2 71.0 33
0.6 0.4 72.3 31
0.6 0.6 71.3 29
0.6 0.8 70.8 17
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 4: (a)-(b) Model performance (AUC and F1 score) at different missing rates of whole slide images. The green line is the result of the baseline method Filling𝐹𝑖𝑙𝑙𝑖𝑛𝑔Fillingitalic_F italic_i italic_l italic_l italic_i italic_n italic_g. (c) The ROC curve, AUC value, and the confidence interval (97.5%) under different initial lengths of the learnable prompt.
Refer to caption
(a) WSI features of single-modal model
Refer to caption
(b) Expanding 20 times
Refer to caption
(c) Expanding 40 times
Figure 5: Distributions of WSI features. The number ‘0’ and ‘1’ represent negative and positive classes, respectively. Stars mean the center of classes. WSI features from the multi-modal model are perturbed by the learning of clinical data compared to that from the single-modal model.

The results of comparisons are shown in Table I. Our method achieves the best performance (bold) regarding F1 scores and AUC. Compared to others, AE yields relatively worse performance. This method often requires a large amount of paired training data, therefore, is difficult to effectively generate the accurate features of clinical data for the prediction. The direct filling method meets the requirement of test flexibility but does not provide valuable information about the missing clinical data. Thus, performance decreases greatly with the increase of the missing ratio and even becomes worse than that of the method with only images. The integration of two separate networks in the Ensemble method has better performance both on the complete modality and incomplete modality among the three intuitive approaches. However, the two networks are independent, and the complete modality in the training set is underutilized. For instance, a test sample only with the modality of WSI is not helped by the clinical data in the training set. Our method is also inspired by this finding and further improves the performance based on Ensemble. SMIL can also be regarded as a generative model. Differently, it is trained end-to-end, but the shared encoder may be perturbed by clinical data.

III-C Ablation Study.

Ablation study on distillation directions. We split the BD framework into two parts: the single-branch learning from the multi-branch (𝒮𝒮\mathcal{M}\rightarrow\mathcal{S}caligraphic_M → caligraphic_S) and the multi-branch learning from the single-branch (𝒮𝒮\mathcal{S}\rightarrow\mathcal{M}caligraphic_S → caligraphic_M). Then, we design the ablation study to verify the effectiveness of each part. We test in two situations with the missing ratio of 0% (complete modality) and 80%. F1-scores of the model with or without each part are presented in Table II. We regard the independent two branches without distillation as the baseline in the ablation study. Under the test case of complete modality, the performance remains the same after adding the part 𝒮𝒮\mathcal{M}\rightarrow\mathcal{S}caligraphic_M → caligraphic_S due to turning off the single branch when testing. But there is a substantial improvement after appending the part 𝒮𝒮\mathcal{S}\rightarrow\mathcal{M}caligraphic_S → caligraphic_M. In the case of missing modality, it is exactly the other way around. The part 𝒮𝒮\mathcal{M}\rightarrow\mathcal{S}caligraphic_M → caligraphic_S is crucial to the performance of the model, while 𝒮𝒮\mathcal{S}\rightarrow\mathcal{M}caligraphic_S → caligraphic_M has little effect on it. Thus, 𝒮𝒮\mathcal{M}\rightarrow\mathcal{S}caligraphic_M → caligraphic_S and 𝒮𝒮\mathcal{S}\rightarrow\mathcal{M}caligraphic_S → caligraphic_M are necessary for the incomplete and complete modality respectively.

Ablation study on the initial length of learnable prompt. We opt for the scenario where 100% of clinical data is missing for comparison. As shown in the subfigure (c) of Fig.4, the model performs best when the initialization length is 50 (the green line). Too long initialization of the prompt may result in memory redundant information of missing modality. We believe that shorter initializations might convey less information, yet the prompt can still serve as a reminder to the model regarding the absence of the modality. The performance will not drop significantly.

Ablation study on λmsubscript𝜆𝑚\lambda_{m}italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and λssubscript𝜆𝑠\lambda_{s}italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Missing 100% clinical data is considered in this experiment. We first fix λssubscript𝜆𝑠\lambda_{s}italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and vary λmsubscript𝜆𝑚\lambda_{m}italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT to record model performance. Then we choose the best λmsubscript𝜆𝑚\lambda_{m}italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and change λssubscript𝜆𝑠\lambda_{s}italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to various values. As shown in TABLE III, larger values of λmsubscript𝜆𝑚\lambda_{m}italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT lead to better performance, illustrating the necessity of distilling WSI features from the single-modal branch. It may overwhelm the classification loss sglcsuperscriptsubscript𝑠𝑔𝑙𝑐\mathcal{L}_{sgl}^{c}caligraphic_L start_POSTSUBSCRIPT italic_s italic_g italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT as λssubscript𝜆𝑠\lambda_{s}italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT increases, resulting in performance degradation. From the values of the saved epoch, the model converges faster when λssubscript𝜆𝑠\lambda_{s}italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is larger.

III-D Feature Analysis between Single-modal and Multi-modal Models.

For further study, we analyze the deep features of WSI before and after the addition of clinical information. We train a single-modal model with only WSIs and a multi-modal model with completely paired data, including WSIs and clinical data. Then, the deep features of WSI from the two models are collected. We perform feature dimensionality reduction based on t-SNE[27] and visualize these features on the two-dimensional plane as shown in Fig. 5. (a): The WSI features are extracted by the model trained with only WSIs. (b): The WSI features are from the intermediate output of a trained multi-modal model, in which the deep features of clinical data are expanded by 20 times compared to the original dimension. (c): The WSI features are also collected from the multi-modal model, where the deep features of clinical data are expanded by 40 times.

We find that the WSI features from the single-modal model are more aggregated and divisible. And there is a sign that we may get worse WSI features as the feature dimension of clinical data increases. Thus, we conclude that the addition of clinical information might affect the representation learning of WSIs. Inspired by this finding, we keep the part that the multi-modal branch learns from the single-modal branch (𝒮𝒮\mathcal{S}\rightarrow\mathcal{M}caligraphic_S → caligraphic_M).

III-E Further Investigation into the Absence of WSIs

To validate the efficacy of our model, we consider the scenario of WSI absence. We employ the learnable prompt to alert the model of the presence of missing modality and memorize the information of WSIs from the multi-modal branch. The image encoder is removed from the single-modal branch, and the prompt is directly mapped to the deep feature. The subfigure (a) and (b) of Fig.4 illustrate that our model far outperforms the base model (Filling𝐹𝑖𝑙𝑙𝑖𝑛𝑔Fillingitalic_F italic_i italic_l italic_l italic_i italic_n italic_g), and it consistently outperforms the single-modal model at various missing rates. This demonstrates the effectiveness of our model no matter which modality is missing.

IV Conclusion

Combining modalities can improve the performance of deep learning models in the diagnosis of axillary lymph node metastasis. However, there usually exists missing modality during test. In this paper, we propose a bidirectional distillation framework to cope with the problem of missing clinical data flexibly. Our model makes full use of the complete modality in the training set effectively via the interaction of the two branches (single-modal and multi-modal branches). The experiment results show that our model makes significant improvements at different missing rates of clinical information. Our method is model- and task-agnostic. We will further explore the effectiveness of our model in other multi-modal tasks in the future.

References

  • [1] Y. Hu, F. Su, K. Dong, X. Wang, X. Zhao, Y. Jiang, J. Li, J. Ji, and Y. Sun, “Deep learning system for lymph node quantification and metastatic cancer identification from whole-slide pathology images,” Gastric Cancer, vol. 24, pp. 868–877, 2021.
  • [2] Y. Zhao, F. Yang, Y. Fang, H. Liu, N. Zhou, J. Zhang, J. Sun, S. Yang, B. Menze, X. Fan et al., “Predicting lymph node metastasis using histopathological images based on multiple instance learning with deep graph convolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4837–4846.
  • [3] S. A. Harmon, T. H. Sanford, G. T. Brown, C. Yang, S. Mehralivand, J. M. Jacob, V. A. Valera, J. H. Shih, P. K. Agarwal, P. L. Choyke et al., “Multiresolution application of artificial intelligence in digital pathology for prediction of positive lymph nodes from primary tumors in bladder cancer,” JCO clinical cancer informatics, vol. 4, pp. 367–382, 2020.
  • [4] H. Li, F. Yang, X. Xing, Y. Zhao, J. Zhang, Y. Liu, M. Han, J. Huang, L. Wang, and J. Yao, “Multi-modal multi-instance learning using weakly correlated histopathological images and tabular clinical information,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part VIII 24.   Springer, 2021, pp. 529–539.
  • [5] O. Dalmaz, M. Yurt, and T. Çukur, “Resvit: residual vision transformers for multimodal medical image synthesis,” IEEE Transactions on Medical Imaging, vol. 41, no. 10, pp. 2598–2614, 2022.
  • [6] S. Zhang, J. Zhang, B. Tian, T. Lukasiewicz, and Z. Xu, “Multi-modal contrastive mutual learning and pseudo-label re-learning for semi-supervised medical image segmentation,” Medical Image Analysis, vol. 83, p. 102656, 2023.
  • [7] J. N. Acosta, G. J. Falcone, P. Rajpurkar, and E. J. Topol, “Multimodal biomedical ai,” Nature Medicine, vol. 28, no. 9, pp. 1773–1784, 2022.
  • [8] H. Zheng, Z. Lin, Q. Zhou, X. Peng, J. Xiao, C. Zu, Z. Jiao, and Y. Wang, “Multi-transsp: Multimodal transformer for survival prediction of nasopharyngeal carcinoma patients,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII.   Springer, 2022, pp. 234–243.
  • [9] R. Hong, W. Liu, D. DeLair, N. Razavian, and D. Fenyö, “Predicting endometrial cancer subtypes and molecular features from histopathology images using multi-resolution deep learning models,” Cell Reports Medicine, vol. 2, no. 9, p. 100400, 2021.
  • [10] F. Xu, C. Zhu, W. Tang, Y. Wang, Y. Zhang, J. Li, H. Jiang, Z. Shi, J. Liu, and M. Jin, “Predicting axillary lymph node metastasis in early breast cancer using deep learning on primary tumor biopsy slides,” Frontiers in oncology, vol. 11, p. 759007, 2021.
  • [11] Z. Zheng, A. Ma, L. Zhang, and Y. Zhong, “Deep multisensor learning for missing-modality all-weather mapping,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 174, pp. 254–264, 2021.
  • [12] N. C. Garcia, P. Morerio, and V. Murino, “Modality distillation with multiple stream networks for action recognition,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 103–118.
  • [13] X. Xing, Z. Chen, M. Zhu, Y. Hou, Z. Gao, and Y. Yuan, “Discrepancy and gradient-guided multi-modal knowledge distillation for pathological glioma grading,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2022, pp. 636–646.
  • [14] Y. Zhang, J. Yang, J. Tian, Z. Shi, C. Zhong, Y. Zhang, and Z. He, “Modality-aware mutual learning for multi-modal medical image segmentation,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24.   Springer, 2021, pp. 589–599.
  • [15] A. Rahate, R. Walambe, S. Ramanna, and K. Kotecha, “Multimodal co-learning: challenges, applications with datasets, recent advances and future directions,” Information Fusion, vol. 81, pp. 203–239, 2022.
  • [16] M. Ma, J. Ren, L. Zhao, S. Tulyakov, C. Wu, and X. Peng, “Smil: Multimodal learning with severely missing modality,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 3, 2021, pp. 2302–2310.
  • [17] X. Wang, S. Price, and C. Li, “Multi-task learning of histology and molecular markers for classifying diffuse glioma,” arXiv preprint arXiv:2303.14845, 2023.
  • [18] J. Höhn, E. Krieghoff-Henning, T. B. Jutzi, C. von Kalle, J. S. Utikal, F. Meier, F. F. Gellrich, S. Hobelsberger, A. Hauschild, J. G. Schlager et al., “Combining cnn-based histologic whole slide image analysis and patient data to improve skin cancer classification,” European Journal of Cancer, vol. 149, pp. 94–101, 2021.
  • [19] J. Yang, J. Ju, L. Guo, B. Ji, S. Shi, Z. Yang, S. Gao, X. Yuan, G. Tian, Y. Liang et al., “Prediction of her2-positive breast cancer recurrence and metastasis risk from histopathological images and clinical information via multimodal deep learning,” Computational and structural biotechnology journal, vol. 20, pp. 333–342, 2022.
  • [20] K. Huang, B. Lin, J. Liu, Y. Liu, J. Li, G. Tian, and J. Yang, “Predicting colorectal cancer tumor mutational burden from histopathological images and clinical information using multi-modal deep learning,” Bioinformatics, vol. 38, no. 22, pp. 5108–5115, 2022.
  • [21] A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. Morency, “Tensor fusion network for multimodal sentiment analysis,” arXiv preprint arXiv:1707.07250, 2017.
  • [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [23] X. Chen, N. Zhang, X. Xie, S. Deng, Y. Yao, C. Tan, F. Huang, L. Si, and H. Chen, “Knowprompt: Knowledge-aware prompt-tuning with synergistic optimization for relation extraction,” in Proceedings of the ACM Web Conference 2022, 2022, pp. 2778–2788.
  • [24] L. Beyer, X. Zhai, A. Royer, L. Markeeva, R. Anil, and A. Kolesnikov, “Knowledge distillation: A good teacher is patient and consistent,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 925–10 934.
  • [25] S. H. Dumpala, I. Sheikh, R. Chakraborty, and S. K. Kopparapu, “Audio-visual fusion for sentiment classification using cross-modal autoencoder,” in 32nd conference on neural information processing systems (NIPS 2018), 2019, pp. 1–4.
  • [26] S. Zhang, Z. Tang, H. Pan, X. Wei, and J. Huang, “A hierarchical framwork with improved loss for large-scale multi-modal video identification,” in Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 2539–2542.
  • [27] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.