Aims: This study aimed to develop a machine learning-based prediction model for gestational diabetes mellitus (GDM) in early pregnancy in Chinese women.
Materials and methods: We used an established population-based prospective cohort of 19,331 pregnant women registered as pregnant before the 15th gestational week in Tianjin, China, from October 2010 to August 2012. The dataset was randomly divided into a training set (70%) and a test set (30%). Risk factors collected at registration were examined and used to construct the prediction model in the training dataset. Machine learning, that is, the extreme gradient boosting (XGBoost) method, was employed to develop the model, while a traditional logistic model was also developed for comparison purposes. In the test dataset, the performance of the developed prediction model was assessed by calibration plots for calibration and area under the receiver operating characteristic curve (AUR) for discrimination.
Results: In total, 1484 (7.6%) women developed GDM. Pre-pregnancy body mass index, maternal age, fasting plasma glucose at registration, and alanine aminotransferase were selected as risk factors. The machine learning XGBoost model-predicted probability of GDM was similar to the observed probability in the test data set, while the logistic model tended to overestimate the risk at the highest risk level (Hosmer-Lemeshow test p value: 0.243 vs. 0.099). The XGBoost model achieved a higher AUR than the logistic model (0.742 vs. 0.663, p < 0.001). This XGBoost model was deployed through a free, publicly available software interface (https://liuhongwei.shinyapps.io/gdm_risk_calculator/).
Conclusion: The XGBoost model achieved better performance than the logistic model.
Keywords: extreme gradient boosting; gestational diabetes mellitus; machine learning; prognostic prediction model.
© 2020 John Wiley & Sons Ltd.