Development and Validation of Machine Learning Models for Identifying Prediabetes and Diabetes in Normoglycemia

Diabetes Metab Res Rev. 2024 Nov;40(8):e70003. doi: 10.1002/dmrr.70003.

Abstract

Background: Prediabetes and diabetes are both abnormal states of glucose metabolism (AGM) that can lead to severe complications. Early detection of AGM is crucial for timely intervention and treatment. However, fasting blood glucose (FBG) as a mass population screening method may fail to identify some individuals who are actually AGM but with normoglycemia. This study aimed to develop and validate machine learning (ML) models to identify AGM among individuals with normoglycemia using routine health check-up indicators.

Methods: According to the American Diabetes Association (ADA) criteria, participants with normoglycemia (FBG ≤ 5.6 mmol/L) were collected from 2019 to 2023, and then divided into AGM and Normal groups using glycosylated haemoglobin (HbA1c) 5.7% as the threshold. Data from 2019 to 2022 were divided into training and internal validation sets at a 7:3 ratio, while data from 2023 were used as the external validation set. Seven ML algorithms-including logistic regression (LR), random forest (RF), support vector machine (SVM), extreme gradient boosting machine, multilayer perceptron (MLP), light gradient boosting machine (LightGBM), and categorical boosting (CatBoost)-were used to build models for identifying AGM in normoglycemia population. Model performance was evaluated using the area under the receiver operating characteristic curve (auROC) and the precision-recall curve (auPR). The feature contributions to the optimal model was visualised using the SHapley Additive exPlanations (SHAP). Finally, an intuitive and user-friendly interactive interface was developed.

Results: A total of 59,259 participants were finally enroled in this study, and then divided into the training set of 32,810, the internal validation set of 14,060, and the external validation set of 12,389. The Catboost model outperformed the others with auROC of 0.806 and 0.794 for the internal and external validation set, respectively. Age was the most important feature influencing the performance of the CatBoost model, followed by fasting blood glucose, red blood cells, haemoglobin, body mass index, and triglyceride-glucose.

Conclusion: A well-performed ML model to identify AGM in the normoglycemia population was built, offering significant potential for early intervention and treatment of AGM that would otherwise have been missed.

Keywords: diabetes; machine learning; missed diagnosis; normoglycemia; prediabetes.

Publication types

  • Validation Study

MeSH terms

  • Adult
  • Aged
  • Algorithms
  • Biomarkers / analysis
  • Biomarkers / blood
  • Blood Glucose* / analysis
  • Diabetes Mellitus / blood
  • Diabetes Mellitus / diagnosis
  • Diabetes Mellitus, Type 2 / diagnosis
  • Female
  • Follow-Up Studies
  • Glycated Hemoglobin / analysis
  • Humans
  • Machine Learning*
  • Male
  • Middle Aged
  • Prediabetic State* / diagnosis
  • Prognosis
  • ROC Curve

Substances

  • Blood Glucose
  • Biomarkers
  • Glycated Hemoglobin