Development and Validation of Machine Learning Models for Identifying Prediabetes and Diabetes in Normoglycemia

Xiaodong Zhang; Weidong Yao; Dawei Wang; Wenqi Hu; Guang Zhang; Yongsheng Zhang

doi:10.1002/dmrr.70003

Development and Validation of Machine Learning Models for Identifying Prediabetes and Diabetes in Normoglycemia

Diabetes Metab Res Rev. 2024 Nov;40(8):e70003. doi: 10.1002/dmrr.70003.

Authors

Xiaodong Zhang¹, Weidong Yao², Dawei Wang³, Wenqi Hu⁴, Guang Zhang⁴, Yongsheng Zhang⁴

Affiliations

¹ Postgraduate Department, Shandong First Medical University (Shandong Academy of Medical Sciences), Jinan, China.
² Department of Anesthesiology, Second Affiliated Hospital of Shandong University of Traditional Chinese Medicine, Jinan, China.
³ Department of Radiology, the First Affiliated Hospital of Shandong First Medical University & Shandong Provincial Qianfoshan Hospital, Jinan, China.
⁴ Department of Health Management, Shandong Engineering Research Center of Health Management, Shandong Institute of Health Management, the First Affiliated Hospital of Shandong First Medical University & Shandong Provincial Qianfoshan Hospital, Jinan, China.

PMID: 39497474
DOI: 10.1002/dmrr.70003

Abstract

Background: Prediabetes and diabetes are both abnormal states of glucose metabolism (AGM) that can lead to severe complications. Early detection of AGM is crucial for timely intervention and treatment. However, fasting blood glucose (FBG) as a mass population screening method may fail to identify some individuals who are actually AGM but with normoglycemia. This study aimed to develop and validate machine learning (ML) models to identify AGM among individuals with normoglycemia using routine health check-up indicators.

Methods: According to the American Diabetes Association (ADA) criteria, participants with normoglycemia (FBG ≤ 5.6 mmol/L) were collected from 2019 to 2023, and then divided into AGM and Normal groups using glycosylated haemoglobin (HbA1c) 5.7% as the threshold. Data from 2019 to 2022 were divided into training and internal validation sets at a 7:3 ratio, while data from 2023 were used as the external validation set. Seven ML algorithms-including logistic regression (LR), random forest (RF), support vector machine (SVM), extreme gradient boosting machine, multilayer perceptron (MLP), light gradient boosting machine (LightGBM), and categorical boosting (CatBoost)-were used to build models for identifying AGM in normoglycemia population. Model performance was evaluated using the area under the receiver operating characteristic curve (auROC) and the precision-recall curve (auPR). The feature contributions to the optimal model was visualised using the SHapley Additive exPlanations (SHAP). Finally, an intuitive and user-friendly interactive interface was developed.

Results: A total of 59,259 participants were finally enroled in this study, and then divided into the training set of 32,810, the internal validation set of 14,060, and the external validation set of 12,389. The Catboost model outperformed the others with auROC of 0.806 and 0.794 for the internal and external validation set, respectively. Age was the most important feature influencing the performance of the CatBoost model, followed by fasting blood glucose, red blood cells, haemoglobin, body mass index, and triglyceride-glucose.

Conclusion: A well-performed ML model to identify AGM in the normoglycemia population was built, offering significant potential for early intervention and treatment of AGM that would otherwise have been missed.

Keywords: diabetes; machine learning; missed diagnosis; normoglycemia; prediabetes.

Publication types

Validation Study

MeSH terms

Adult
Aged
Algorithms
Biomarkers / analysis
Biomarkers / blood
Blood Glucose* / analysis
Diabetes Mellitus / blood
Diabetes Mellitus / diagnosis
Diabetes Mellitus, Type 2 / diagnosis
Female
Follow-Up Studies
Glycated Hemoglobin / analysis
Humans
Machine Learning*
Male
Middle Aged
Prediabetic State* / diagnosis
Prognosis
ROC Curve

Substances

Blood Glucose
Biomarkers
Glycated Hemoglobin

Abstract

Publication types

MeSH terms

Substances

Grants and funding