Multi-modal Learning with Missing Modality in Predicting Axillary Lymph Node Metastasis
††thanks: This study was partially supported by the National Natural Science Foundation of China (Grant no. 92270108), Zhejiang Province Natural Science Foundation of China (Grant no. XHD23F0201).
Abstract
Multi-modal Learning has attracted widespread attention in medical image analysis. Using multi-modal data, whole slide images (WSIs) and clinical information, can improve the performance of deep learning models in the diagnosis of axillary lymph node metastasis. However, clinical information is not easy to collect in clinical practice due to privacy concerns, limited resources, lack of interoperability, etc. Although patient selection can ensure the training set to have multi-modal data for model development, missing modality of clinical information can appear during test. This normally leads to performance degradation, which limits the use of multi-modal models in the clinic. To alleviate this problem, we propose a bidirectional distillation framework consisting of a multi-modal branch and a single-modal branch. The single-modal branch acquires the complete multi-modal knowledge from the multi-modal branch, while the multi-modal learns the robust features of WSI from the single-modal. We conduct experiments on a public dataset of Lymph Node Metastasis in Early Breast Cancer to validate the method. Our approach not only achieves state-of-the-art performance with an AUC of 0.861 on the test set without missing data, but also yields an AUC of 0.842 when the rate of missing modality is 80%. This shows the effectiveness of the approach in dealing with multi-modal data and missing modality. Such a model has the potential to improve treatment decision-making for early breast cancer patients who have axillary lymph node metastatic status.
Index Terms:
Missing modality, Whole slide image, Clinical data.I Introduction
Breast cancer has become the most deadly disease for women worldwide. The prediction of axillary lymph node metastasis(ALNM) can guide treatment, therefore is crucial to improve the survival rate of early breast cancer patients. Previous works [1, 2, 3] have been devoted to the prediction of LNM. Li, et al [4] combine the histopathological images and tabular clinical data, including age, gender and tumor location, to improve the performance of ALNM prediction. Besides, multi-modal learning [5, 6, 7] in other medical fields have also achieved remarkable results. A multi-modal Transformer [8] is introduced for the survival prediction of nasopharyngeal carcinoma patients. Hong, et al [9] combine the clinical features and histological images to predict the molecular subtypes and mutation status. Generally, existing research efforts mainly focus on how to fuse multi-modal data effectively.
Studies have been designed to tackle the missing modality problem. Transfer learning [11, 12, 13, 14] is effective in handling the fully missing modality problem. However previous methods ignore a practical situation that one of the modalities is often partly missing. It poses a different challenge with the fully missing modality [15]. Ma, et al [16] consider various missing situations and leverage a generative model to produce missing text during training and testing. Nevertheless, generative models require a large number of training pairs. The combined application of histology and molecular markers is employed for the classification of diffuse glioma through multi-task learning [17]. However, the gaps in method performance attributed to distinct modalities cannot be avoided. To our best knowledge, there are no efforts on multi-modal learning with both partly and fully missing modality for pathology images and clinical data.
Making full use of comprehensive information such as histopathological images and clinical data can effectively improve the performance of deep learning models [18, 19, 20]. But clinical data are not always available in the real world due to privacy concerns or limited resources, especially in actual testing. Therefore, the question about will the fusion of multi-modal information in training helps even if the task is single-modal or partly missing modality at test time remains. As shown in Fig.1, modality missing at the testing phase can seriously affect model performance. The performance of the model [10] that learns from multi-modal data even becomes worse than that of the model learning from single-modal data when considering severe missing of clinical data. Thus, the problem needs to be solved: how to learn a multi-modal model from a complete training dataset while it is robust to fully or partly missing modality during testing.
In order to take full advantage of the clinical data in training set effectively and face various missing styles (partly missing and fully missing) during testing flexibly, we propose a bidirectional distillation (BD) framework as shown in Fig.2. Our contributions can be concluded as follows:
-
•
We propose a BD framework consisting of a single-modal branch and a multi-modal branch, which can flexibly tackle modality complete or incomplete inputs in a unified manner by turning off or on the single branch when testing.
-
•
In order to transfer the knowledge of clinical information to the single-modal branch, we introduce a learnable prompt during the distillation from the multi-modal branch to the single-modal branch.
-
•
The learning of complicated fused features may lead to the overfitting on the feature learning of WSI [21], which is verified in the experiment. To tackle this challenge, we leverage the distillation from the single-modal branch to the multi-modal branch to extract robust features of WSI in the multi-modal branch.
-
•
We additionally conduct further research on the missing modality within WSIs. The experimental results demonstrate the strong performance of our method regardless of the missing modality.
II Methodology
II-A Problem Formulation.
Missing modality of clinical data in the test time is considered in the paper. The dataset is divided into a training set and test set: . We consider the training set as a modality-complete dataset, where and represent two different modalities (whole slide images (WSI) and clinical information) of the -th sample, is the corresponding label and is the total number of the samples in the training set. The test set is a modality-incomplete dataset. There exist samples in the data set that do not contain the clinical information. In this paper, we aim to make full use of the multi-modal information in training set to improve the model performance and flexibly deal with the modality-missing problem in the test set.
II-B Multi-modal Branch Learning.
Specifically, a WSI is divided into small patches which feed into an encoder. In the multi-modal branch, deep features from the encoder are aggregated to the fused feature by simple attention [22].
(1) |
where is a non-linear projection function whose parameters are learnable. The output of is a 1D vector with length . The th element corresponds to the patch feature during summation. We combine and the mapped feature of clinical data to calculate the final classification loss .
(2) |
where is the final classifier in the multi-modal branch and is a feature splicing operation.
In order to avoid the feature learning of WSI being affected by clinical information, we transfer the knowledge from the single-modal branch to the multi-modal branch. We define the intermediate output of the deep layers in the multi-modal branch and in the single-modal branch. is the number of deep feature layers in the network. In this paper, we choose the WSI features from the final layer and (the output of the attention module and in Fig. 2), which exhibit the most robust semantics of WSIs. The knowledge distillation for the WSI features can be represented as follows:
(3) |
where is a distance function that measures the gap of features between the single-modal and multi-modal branches. We choose the mean square error as the distance measure function. and are the projection modules that can transfer the intermediate output feature to the target representation. The total loss function for the learning of the multi-modal branch is
(4) |
where is a hyper-parameter to weigh different items. We utilize the classification loss and the distillation loss to update the multi-modal branch simultaneously.
II-C Single-modal Branch Learning.
Following patch fusion steps described in Multi-modal Branch Learning, we convert deep features to . We employ a learnable prompt [23] to signal the single-model branch when missing modality and memorize the missing information of clinical data . We map to a feature by a non-linear function. The dimension of is as same as that of the feature in the multi-modal branch. We then combine the WSI feature and the prompt feature . Afterward, the knowledge of clinical data is transferred from the multi-modal branch to the single-modal branch based on the distillation loss:
(5) | ||||
We also apply mean square error for . [24] is the KL divergence function for the predicted confidence. and are two final classifiers for single-modal branch and multi-modal branch, respectively. The loss function is only used for the learning of the prompt as shown in Fig.2.
Similarly, there is also a classification loss for the learning of WSIs in the single-modal branch. Consequently, the total loss function for the training of a single-modal branch is presented as follows:
(6) |
The loss function is used for the update of the single-modal branch while the multi-modal branch is frozen. During testing, the BD framework can tackle modality complete or incomplete inputs in a unified manner by turning off or on the single branch.
III Experiments and Results
III-A Dataset and Experimental Settings.
The experimental dataset is from a grand challenge named Early Breast Cancer Core-Needle Biopsy WSI (BCNB) [10]. Paired multi-modal data containing WSIs and clinical information is provided by the dataset. All WSIs are hematoxylin and eosin stained and the clinical data consists the information of age, tumor size, ER, PR and HER2. We use the information to predict the metastatic status ( and ) of axillary lymph nodes. Since it is a binary classification task, we use the metrics Area Under Curve (AUC) and F1-scores (F1) to validate the proposed method. F1 represents the averaged results in the prediction of metastatic status.
We randomly split the dataset into a training and test set with 80% and 20%. A subset with 20% is separated from the training set for validation. We assume that the training set is complete with paired modalities. While the clinical data in the test set can be missed at a random rate. In the training process of our method, stochastic gradient descent with a momentum of 0.3, a weight decay rate of serves as the optimizer. The learning rate is initialized at . The hyper-parameters , and are set to 1.2, 0.5 and 0.6, respectively. We initialize the learnable prompt with a length of 50. Early stopping is used to avoid overfitting by monitoring the F1 scores in the training set. The code is implemented based on python3 and pytorch-1.9 and all experiments are conducted using NVIDIA A100 GPUs.
All non-linear and linear projection modules are composed of fully connected layers and the ReLU non-linear activation function. The function within the attention module consists of two hidden layers with corresponding activation functions. We employ two layers with hidden sizes of 100 and 50 to map the learnable prompt to . Both and are fully connected layers used to map features to a dimension of 64.
Missing rate(%) | Methods | AUC | AUC | F1 | F1 |
---|---|---|---|---|---|
0 | image only | 82.3 | - | 72.2 | - |
clinical only | 71.6 | - | 62.2 | - | |
0 | Filling | 84.1 | 73.6 | ||
AE | 84.1 | 0.0 | 73.6 | 0.0 | |
Ensemble | 85.1 | 1.0 | 74.0 | 0.4 | |
SMIL | 82.6 | -1.5 | 72.7 | -0.9 | |
BD | 86.1 | 2.0 | 75.8 | 2.2 | |
50 | Filling | 81.8 | 71.5 | ||
AE | 81.3 | -0.5 | 71.5 | 0.0 | |
Ensemble | 83.7 | 1.9 | 73.0 | 1.5 | |
SMIL | 82.2 | 0.4 | 73.2 | 1.7 | |
BD | 85.0 | 3.2 | 74.1 | 2.6 | |
80 | Filling | 79.1 | 70.7 | ||
AE | 79.9 | 0.8 | 70.6 | -0.1 | |
Ensemble | 82.8 | 3.7 | 71.7 | 1.0 | |
SMIL | 80.0 | 0.9 | 71.5 | 0.8 | |
BD | 84.2 | 5.1 | 74.9 | 4.2 | |
100 | Filling | 78.9 | 68.7 | ||
AE | 79.6 | 0.7 | 69.4 | 0.7 | |
Ensemble | 82.3 | 3.4 | 72.2 | 3.5 | |
SMIL | 78.8 | -0.1 | 69.7 | 1.0 | |
BD | 82.7 | 3.8 | 72.7 | 4.0 |
Missing rate(%) | F1-score | ||
0 | 74.2 | ||
✓ | 75.8 | ||
✓ | 74.2 | ||
80 | 72.0 | ||
✓ | 72.1 | ||
✓ | 73.8 |
III-B Comparison with other methods.
We compared our proposed approach with representative methods (AE [25], Ensemble [26], Filling, SMIL [16]) in dealing with the missing modality problem for multi-modal learning, The mechanism of first three intuitive methods are as shown in Fig. 3.
-
•
Filling is the method that aims to fill the missing clinical data with zero vectors. The model structure is based on the model LNMP [10]. It is the same as LNMP when the modalities are complete during test.
-
•
AE is designed to generate the missed deep features of clinical data automatically. This model is trained with two stages. First, we train an LNMP model with the modality-complete training set. Then, an auto-encoder is trained to generate missed features, the input and output of which are features of the WSIs and clinical data, respectively.
-
•
Ensemble is the model that has two individual networks. One is the WSI recognition network, whose output is the predicted probability. The other one is the classification network for clinical data. We get the final prediction result by fusing the probabilities from the two networks. We only use the first network if there is no input of clinical data.
Missing rate(%) | F1-score | Epoch | ||
---|---|---|---|---|
100 | 0.2 | 0.5 | 71.7 | 30 |
0.4 | 0.5 | 70.3 | 26 | |
0.6 | 0.5 | 72.7 | 31 | |
0.8 | 0.5 | 72.2 | 28 | |
0.6 | 0.2 | 71.0 | 33 | |
0.6 | 0.4 | 72.3 | 31 | |
0.6 | 0.6 | 71.3 | 29 | |
0.6 | 0.8 | 70.8 | 17 |
The results of comparisons are shown in Table I. Our method achieves the best performance (bold) regarding F1 scores and AUC. Compared to others, AE yields relatively worse performance. This method often requires a large amount of paired training data, therefore, is difficult to effectively generate the accurate features of clinical data for the prediction. The direct filling method meets the requirement of test flexibility but does not provide valuable information about the missing clinical data. Thus, performance decreases greatly with the increase of the missing ratio and even becomes worse than that of the method with only images. The integration of two separate networks in the Ensemble method has better performance both on the complete modality and incomplete modality among the three intuitive approaches. However, the two networks are independent, and the complete modality in the training set is underutilized. For instance, a test sample only with the modality of WSI is not helped by the clinical data in the training set. Our method is also inspired by this finding and further improves the performance based on Ensemble. SMIL can also be regarded as a generative model. Differently, it is trained end-to-end, but the shared encoder may be perturbed by clinical data.
III-C Ablation Study.
Ablation study on distillation directions. We split the BD framework into two parts: the single-branch learning from the multi-branch () and the multi-branch learning from the single-branch (). Then, we design the ablation study to verify the effectiveness of each part. We test in two situations with the missing ratio of 0% (complete modality) and 80%. F1-scores of the model with or without each part are presented in Table II. We regard the independent two branches without distillation as the baseline in the ablation study. Under the test case of complete modality, the performance remains the same after adding the part due to turning off the single branch when testing. But there is a substantial improvement after appending the part . In the case of missing modality, it is exactly the other way around. The part is crucial to the performance of the model, while has little effect on it. Thus, and are necessary for the incomplete and complete modality respectively.
Ablation study on the initial length of learnable prompt. We opt for the scenario where 100% of clinical data is missing for comparison. As shown in the subfigure (c) of Fig.4, the model performs best when the initialization length is 50 (the green line). Too long initialization of the prompt may result in memory redundant information of missing modality. We believe that shorter initializations might convey less information, yet the prompt can still serve as a reminder to the model regarding the absence of the modality. The performance will not drop significantly.
Ablation study on and . Missing 100% clinical data is considered in this experiment. We first fix and vary to record model performance. Then we choose the best and change to various values. As shown in TABLE III, larger values of lead to better performance, illustrating the necessity of distilling WSI features from the single-modal branch. It may overwhelm the classification loss as increases, resulting in performance degradation. From the values of the saved epoch, the model converges faster when is larger.
III-D Feature Analysis between Single-modal and Multi-modal Models.
For further study, we analyze the deep features of WSI before and after the addition of clinical information. We train a single-modal model with only WSIs and a multi-modal model with completely paired data, including WSIs and clinical data. Then, the deep features of WSI from the two models are collected. We perform feature dimensionality reduction based on t-SNE[27] and visualize these features on the two-dimensional plane as shown in Fig. 5. (a): The WSI features are extracted by the model trained with only WSIs. (b): The WSI features are from the intermediate output of a trained multi-modal model, in which the deep features of clinical data are expanded by 20 times compared to the original dimension. (c): The WSI features are also collected from the multi-modal model, where the deep features of clinical data are expanded by 40 times.
We find that the WSI features from the single-modal model are more aggregated and divisible. And there is a sign that we may get worse WSI features as the feature dimension of clinical data increases. Thus, we conclude that the addition of clinical information might affect the representation learning of WSIs. Inspired by this finding, we keep the part that the multi-modal branch learns from the single-modal branch ().
III-E Further Investigation into the Absence of WSIs
To validate the efficacy of our model, we consider the scenario of WSI absence. We employ the learnable prompt to alert the model of the presence of missing modality and memorize the information of WSIs from the multi-modal branch. The image encoder is removed from the single-modal branch, and the prompt is directly mapped to the deep feature. The subfigure (a) and (b) of Fig.4 illustrate that our model far outperforms the base model (), and it consistently outperforms the single-modal model at various missing rates. This demonstrates the effectiveness of our model no matter which modality is missing.
IV Conclusion
Combining modalities can improve the performance of deep learning models in the diagnosis of axillary lymph node metastasis. However, there usually exists missing modality during test. In this paper, we propose a bidirectional distillation framework to cope with the problem of missing clinical data flexibly. Our model makes full use of the complete modality in the training set effectively via the interaction of the two branches (single-modal and multi-modal branches). The experiment results show that our model makes significant improvements at different missing rates of clinical information. Our method is model- and task-agnostic. We will further explore the effectiveness of our model in other multi-modal tasks in the future.
References
- [1] Y. Hu, F. Su, K. Dong, X. Wang, X. Zhao, Y. Jiang, J. Li, J. Ji, and Y. Sun, “Deep learning system for lymph node quantification and metastatic cancer identification from whole-slide pathology images,” Gastric Cancer, vol. 24, pp. 868–877, 2021.
- [2] Y. Zhao, F. Yang, Y. Fang, H. Liu, N. Zhou, J. Zhang, J. Sun, S. Yang, B. Menze, X. Fan et al., “Predicting lymph node metastasis using histopathological images based on multiple instance learning with deep graph convolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4837–4846.
- [3] S. A. Harmon, T. H. Sanford, G. T. Brown, C. Yang, S. Mehralivand, J. M. Jacob, V. A. Valera, J. H. Shih, P. K. Agarwal, P. L. Choyke et al., “Multiresolution application of artificial intelligence in digital pathology for prediction of positive lymph nodes from primary tumors in bladder cancer,” JCO clinical cancer informatics, vol. 4, pp. 367–382, 2020.
- [4] H. Li, F. Yang, X. Xing, Y. Zhao, J. Zhang, Y. Liu, M. Han, J. Huang, L. Wang, and J. Yao, “Multi-modal multi-instance learning using weakly correlated histopathological images and tabular clinical information,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part VIII 24. Springer, 2021, pp. 529–539.
- [5] O. Dalmaz, M. Yurt, and T. Çukur, “Resvit: residual vision transformers for multimodal medical image synthesis,” IEEE Transactions on Medical Imaging, vol. 41, no. 10, pp. 2598–2614, 2022.
- [6] S. Zhang, J. Zhang, B. Tian, T. Lukasiewicz, and Z. Xu, “Multi-modal contrastive mutual learning and pseudo-label re-learning for semi-supervised medical image segmentation,” Medical Image Analysis, vol. 83, p. 102656, 2023.
- [7] J. N. Acosta, G. J. Falcone, P. Rajpurkar, and E. J. Topol, “Multimodal biomedical ai,” Nature Medicine, vol. 28, no. 9, pp. 1773–1784, 2022.
- [8] H. Zheng, Z. Lin, Q. Zhou, X. Peng, J. Xiao, C. Zu, Z. Jiao, and Y. Wang, “Multi-transsp: Multimodal transformer for survival prediction of nasopharyngeal carcinoma patients,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII. Springer, 2022, pp. 234–243.
- [9] R. Hong, W. Liu, D. DeLair, N. Razavian, and D. Fenyö, “Predicting endometrial cancer subtypes and molecular features from histopathology images using multi-resolution deep learning models,” Cell Reports Medicine, vol. 2, no. 9, p. 100400, 2021.
- [10] F. Xu, C. Zhu, W. Tang, Y. Wang, Y. Zhang, J. Li, H. Jiang, Z. Shi, J. Liu, and M. Jin, “Predicting axillary lymph node metastasis in early breast cancer using deep learning on primary tumor biopsy slides,” Frontiers in oncology, vol. 11, p. 759007, 2021.
- [11] Z. Zheng, A. Ma, L. Zhang, and Y. Zhong, “Deep multisensor learning for missing-modality all-weather mapping,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 174, pp. 254–264, 2021.
- [12] N. C. Garcia, P. Morerio, and V. Murino, “Modality distillation with multiple stream networks for action recognition,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 103–118.
- [13] X. Xing, Z. Chen, M. Zhu, Y. Hou, Z. Gao, and Y. Yuan, “Discrepancy and gradient-guided multi-modal knowledge distillation for pathological glioma grading,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2022, pp. 636–646.
- [14] Y. Zhang, J. Yang, J. Tian, Z. Shi, C. Zhong, Y. Zhang, and Z. He, “Modality-aware mutual learning for multi-modal medical image segmentation,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24. Springer, 2021, pp. 589–599.
- [15] A. Rahate, R. Walambe, S. Ramanna, and K. Kotecha, “Multimodal co-learning: challenges, applications with datasets, recent advances and future directions,” Information Fusion, vol. 81, pp. 203–239, 2022.
- [16] M. Ma, J. Ren, L. Zhao, S. Tulyakov, C. Wu, and X. Peng, “Smil: Multimodal learning with severely missing modality,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 3, 2021, pp. 2302–2310.
- [17] X. Wang, S. Price, and C. Li, “Multi-task learning of histology and molecular markers for classifying diffuse glioma,” arXiv preprint arXiv:2303.14845, 2023.
- [18] J. Höhn, E. Krieghoff-Henning, T. B. Jutzi, C. von Kalle, J. S. Utikal, F. Meier, F. F. Gellrich, S. Hobelsberger, A. Hauschild, J. G. Schlager et al., “Combining cnn-based histologic whole slide image analysis and patient data to improve skin cancer classification,” European Journal of Cancer, vol. 149, pp. 94–101, 2021.
- [19] J. Yang, J. Ju, L. Guo, B. Ji, S. Shi, Z. Yang, S. Gao, X. Yuan, G. Tian, Y. Liang et al., “Prediction of her2-positive breast cancer recurrence and metastasis risk from histopathological images and clinical information via multimodal deep learning,” Computational and structural biotechnology journal, vol. 20, pp. 333–342, 2022.
- [20] K. Huang, B. Lin, J. Liu, Y. Liu, J. Li, G. Tian, and J. Yang, “Predicting colorectal cancer tumor mutational burden from histopathological images and clinical information using multi-modal deep learning,” Bioinformatics, vol. 38, no. 22, pp. 5108–5115, 2022.
- [21] A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. Morency, “Tensor fusion network for multimodal sentiment analysis,” arXiv preprint arXiv:1707.07250, 2017.
- [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- [23] X. Chen, N. Zhang, X. Xie, S. Deng, Y. Yao, C. Tan, F. Huang, L. Si, and H. Chen, “Knowprompt: Knowledge-aware prompt-tuning with synergistic optimization for relation extraction,” in Proceedings of the ACM Web Conference 2022, 2022, pp. 2778–2788.
- [24] L. Beyer, X. Zhai, A. Royer, L. Markeeva, R. Anil, and A. Kolesnikov, “Knowledge distillation: A good teacher is patient and consistent,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 925–10 934.
- [25] S. H. Dumpala, I. Sheikh, R. Chakraborty, and S. K. Kopparapu, “Audio-visual fusion for sentiment classification using cross-modal autoencoder,” in 32nd conference on neural information processing systems (NIPS 2018), 2019, pp. 1–4.
- [26] S. Zhang, Z. Tang, H. Pan, X. Wei, and J. Huang, “A hierarchical framwork with improved loss for large-scale multi-modal video identification,” in Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 2539–2542.
- [27] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.