Comparison of methods for early-readmission prediction in a high-dimensional heterogeneous covariates and time-to-event outcome framework

Simon Bussy; Raphaël Veil; Vincent Looten; Anita Burgun; Stéphane Gaïffas; Agathe Guilloux; Brigitte Ranque; Anne-Sophie Jannot

doi:10.1186/s12874-019-0673-4

Comparison of methods for early-readmission prediction in a high-dimensional heterogeneous covariates and time-to-event outcome framework

BMC Med Res Methodol. 2019 Mar 6;19(1):50. doi: 10.1186/s12874-019-0673-4.

Authors

Simon Bussy¹, Raphaël Veil^{2

3}, Vincent Looten^{2

3}, Anita Burgun^{2

3}, Stéphane Gaïffas^{4

5}, Agathe Guilloux⁶, Brigitte Ranque^{7

8}, Anne-Sophie Jannot^{2

3}

Affiliations

¹ Laboratoire de Probabilités Statistique et Modélisation (LPSM), UMR 8001, Sorbonne University, 4 Place Jussieu, Paris, 75005, France. [email protected].
² Assistance Publique-Hôpitaux de Paris, Biomedical Informatics and Public Health Department, European Georges Pompidou Hospital, 20 Rue Leblanc, Paris, 75015, France.
³ INSERM UMRS 1138, Eq22, Centre de Recherche des Cordeliers, Université Paris Descartes, 15 Rue de l'École de Médecine, Paris, 75006, France.
⁴ Laboratoire de Probabilités Statistique et Modélisation (LPSM), UMR 8001, Sorbonne University, 4 Place Jussieu, Paris, 75005, France.
⁵ CMAP, UMR 7641 École Polytechnique CNRS, Route de Saclay, Palaiseau, 91128, France.
⁶ LAMME, Univ Evry, CNRS, Université Paris-Saclay, 23 boulevard de France, Evry, 91025, France.
⁷ INSERM UMRS 970, Université Paris Descartes, 56 rue Leblanc, Paris, 75015, France.
⁸ Assistance Publique-Hôpitaux de Paris, Internal Medicine Department, Georges Pompidou European Hospital, 20 Rue Leblanc, Paris, 75015, France.

Abstract

Background: Choosing the most performing method in terms of outcome prediction or variables selection is a recurring problem in prognosis studies, leading to many publications on methods comparison. But some aspects have received little attention. First, most comparison studies treat prediction performance and variable selection aspects separately. Second, methods are either compared within a binary outcome setting (where we want to predict whether the readmission will occur within an arbitrarily chosen delay or not) or within a survival analysis setting (where the outcomes are directly the censored times), but not both. In this paper, we propose a comparison methodology to weight up those different settings both in terms of prediction and variables selection, while incorporating advanced machine learning strategies.

Methods: Using a high-dimensional case study on a sickle-cell disease (SCD) cohort, we compare 8 statistical methods. In the binary outcome setting, we consider logistic regression (LR), support vector machine (SVM), random forest (RF), gradient boosting (GB) and neural network (NN); while on the survival analysis setting, we consider the Cox Proportional Hazards (PH), the CURE and the C-mix models. We also propose a method using Gaussian Processes to extract meaningfull structured covariates from longitudinal data.

Results: Among all assessed statistical methods, the survival analysis ones obtain the best results. In particular the C-mix model yields the better performances in both the two considered settings (AUC =0.94 in the binary outcome setting), as well as interesting interpretation aspects. There is some consistency in selected covariates across methods within a setting, but not much across the two settings.

Conclusions: It appears that learning withing the survival analysis setting first (so using all the temporal information), and then going back to a binary prediction using the survival estimates gives significantly better prediction performances than the ones obtained by models trained "directly" within the binary outcome setting.

Keywords: High-dimensional prediction; Hospital readmission risk; Machine learning methods; Sickle-cell disease; Survival analysis.

Publication types

Comparative Study

MeSH terms

Anemia, Sickle Cell / diagnosis*
Anemia, Sickle Cell / therapy*
Cohort Studies
Humans
Logistic Models
Machine Learning
Multivariate Analysis
Neural Networks, Computer
Outcome Assessment, Health Care / methods
Outcome Assessment, Health Care / statistics & numerical data*
Patient Readmission / statistics & numerical data*
Prognosis
Proportional Hazards Models
Reproducibility of Results
Support Vector Machine
Survival Analysis