Limitations in Evaluating Machine Learning Models for Imbalanced Binary Outcome Classification in Spine Surgery: A Systematic Review

Marc Ghanem; Abdul Karim Ghaith; Victor Gabriel El-Hajj; Archis Bhandarkar; Andrea de Giorgio; Adrian Elmi-Terander; Mohamad Bydon

doi:10.3390/brainsci13121723

Limitations in Evaluating Machine Learning Models for Imbalanced Binary Outcome Classification in Spine Surgery: A Systematic Review

Brain Sci. 2023 Dec 16;13(12):1723. doi: 10.3390/brainsci13121723.

Authors

Marc Ghanem^{1

2

3}, Abdul Karim Ghaith^{1

2}, Victor Gabriel El-Hajj^{1

2

4}, Archis Bhandarkar^{1

2}, Andrea de Giorgio⁵, Adrian Elmi-Terander^{4

6}, Mohamad Bydon^{1

2}

Affiliations

¹ Mayo Clinic Neuro-Informatics Laboratory, Mayo Clinic, Rochester, MN 55902, USA.
² Department of Neurological Surgery, Mayo Clinic, Rochester, MN 55902, USA.
³ School of Medicine, Lebanese American University, Byblos 4504, Lebanon.
⁴ Department of Clinical Neuroscience, Karolinska Institutet, 17177 Stockholm, Sweden.
⁵ Artificial Engineering, Via del Rione Sirignano, 80121 Naples, Italy.
⁶ Department of Surgical Sciences, Uppsala University, 75236 Uppsala, Sweden.

Abstract

Clinical prediction models for spine surgery applications are on the rise, with an increasing reliance on machine learning (ML) and deep learning (DL). Many of the predicted outcomes are uncommon; therefore, to ensure the models' effectiveness in clinical practice it is crucial to properly evaluate them. This systematic review aims to identify and evaluate current research-based ML and DL models applied for spine surgery, specifically those predicting binary outcomes with a focus on their evaluation metrics. Overall, 60 papers were included, and the findings were reported according to the PRISMA guidelines. A total of 13 papers focused on lengths of stay (LOS), 12 on readmissions, 12 on non-home discharge, 6 on mortality, and 5 on reoperations. The target outcomes exhibited data imbalances ranging from 0.44% to 42.4%. A total of 59 papers reported the model's area under the receiver operating characteristic (AUROC), 28 mentioned accuracies, 33 provided sensitivity, 29 discussed specificity, 28 addressed positive predictive value (PPV), 24 included the negative predictive value (NPV), 25 indicated the Brier score with 10 providing a null model Brier, and 8 detailed the F1 score. Additionally, data visualization varied among the included papers. This review discusses the use of appropriate evaluation schemes in ML and identifies several common errors and potential bias sources in the literature. Embracing these recommendations as the field advances may facilitate the integration of reliable and effective ML models in clinical settings.

Keywords: artificial intelligence; deep learning; machine learning; predictive modeling; spine surgery.

Publication types

Review

Grants and funding

This research received no external funding.