Background: There exists consistent empirical evidence in the literature pointing out ample heterogeneity in terms of the clinical evolution of patients with COVID-19. The identification of specific phenotypes underlying in the population might contribute towards a better understanding and characterization of the different courses of the disease. The aim of this study was to identify distinct clinical phenotypes among hospitalized patients with SARS-CoV-2 pneumonia using machine learning clustering, and to study their association with subsequent clinical outcomes as severity and mortality.
Methods: Multicentric observational, prospective, longitudinal, cohort study conducted in four hospitals in Spain. We included adult patients admitted for in-hospital stay due to SARS-CoV-2 pneumonia. We collected a broad spectrum of variables to describe exhaustively each case: patient demographics, comorbidities, symptoms, physiological status, baseline examinations (blood analytics, arterial gas test), etc. For the development and internal validation of the clustering/phenotype models, the dataset was split into training and test sets (50% each). We proposed a sequence of machine learning stages: feature scaling, missing data imputation, reduction of data dimensionality via Kernel Principal Component Analysis (KPCA), and clustering with the k-means algorithm. The optimal cluster model parameters -including k, the number of phenotypes- were chosen automatically, by maximizing the average Silhouette score across the training set.
Results: We enrolled 1548 patients, each of them characterized by 92 clinical attributes (d=109 features after variable encoding). Our clustering algorithm identified k=3 distinct phenotypes and 18 strongly informative variables: Phenotype A (788 cases [50.9% prevalence] - age 57, Charlson comorbidity 1, pneumonia CURB-65 score 0 to 1, respiratory rate at admission 18 min-1, FiO2 21%, C-reactive protein CRP 49.5 mg/dL [median within cluster]); phenotype B (620 cases [40.0%] - age 75, Charlson 5, CURB-65 1 to 2, respiration 20 min-1, FiO2 21%, CRP 101.5 mg/dL); and phenotype C (140 cases [9.0%] - age 71, Charlson 4, CURB-65 0 to 2, respiration 30 min-1, FiO2 38%, CRP 152.3 mg/dL). Hypothesis testing provided solid statistical evidence supporting an interaction between phenotype and each clinical outcome: severity and mortality. By computing their corresponding odds ratios, a clear trend was found for higher frequencies of unfavourable evolution in phenotype C with respect to B, as well as more unfavourable in phenotype B than in A.
Conclusion: A compound unsupervised clustering technique (including a fully-automated optimization of its internal parameters) revealed the existence of three distinct groups of patients - phenotypes. In turn, these showed strong associations with the clinical severity in the progression of pneumonia, and with mortality.
Keywords: COVID-19; Clustering; Phenotypes; SARS-CoV-2 pneumonia; Unsupervised machine learning.
© 2024. The Author(s).