Supervised Pretraining through Contrastive Categorical Positive Samplings to Improve COVID-19 Mortality Prediction

Tingyi Wanyan; Mingquan Lin; Eyal Klang; Kartikeya M Menon; Faris F Gulamali; Ariful Azad; Yiye Zhang; Ying Ding; Zhangyang Wang; Fei Wang; Benjamin Glicksberg; Yifan Peng

doi:10.1145/3535508.3545541

Supervised Pretraining through Contrastive Categorical Positive Samplings to Improve COVID-19 Mortality Prediction

ACM BCB. 2022 Aug:2022:9. doi: 10.1145/3535508.3545541. Epub 2022 Aug 7.

Authors

Affiliations

¹ Population Health Sciences, Weill Cornell Medicine, New York, NY, USA.
² Icahn School of Medicine at Mount Sinai, New York, NY, USA.
³ Intelligent Systems Engineering, Indiana University, Bloomington, Bloomington, IN, USA.
⁴ School of Information, University of Texus Austin, Austin, TX, USA.
⁵ Electrical and Computer Engineering, University of Texus Austin, Austin, TX, USA.

Abstract

Clinical EHR data is naturally heterogeneous, where it contains abundant sub-phenotype. Such diversity creates challenges for outcome prediction using a machine learning model since it leads to high intra-class variance. To address this issue, we propose a supervised pre-training model with a unique embedded k-nearest-neighbor positive sampling strategy. We demonstrate the enhanced performance value of this framework theoretically and show that it yields highly competitive experimental results in predicting patient mortality in real-world COVID-19 EHR data with a total of over 7,000 patients admitted to a large, urban health system. Our method achieves a better AUROC prediction score of 0.872, which outperforms the alternative pre-training models and traditional machine learning methods. Additionally, our method performs much better when the training data size is small (345 training instances).

Keywords: Intra-class variance; Pre-training; Self-supervised Learning; Sub-phenotype; Supervised Contrastive Learning; mortality prediction.

Grants and funding

R00 LM013001/LM/NLM NIH HHS/United States