Development and Validation of Machine Learning Algorithms for Prediction of Colorectal Polyps Based on Electronic Health Records

Biomedicines. 2024 Aug 27;12(9):1955. doi: 10.3390/biomedicines12091955.

Abstract

Background: Colorectal Polyps are the main source of precancerous lesions in colorectal cancer. To increase the early diagnosis of tumors and improve their screening, we aimed to develop a simple and non-invasive diagnostic prediction model for colorectal polyps based on machine learning (ML) and using accessible health examination records.

Methods: We conducted a single-center observational retrospective study in China. The derivation cohort, consisting of 5426 individuals who underwent colonoscopy screening from January 2021 to January 2024, was separated for training (cohort 1) and validation (cohort 2). The variables considered in this study included demographic data, vital signs, and laboratory results recorded by health examination records. With features selected by univariate analysis and Lasso regression analysis, nine machine learning methods were utilized to develop a colorectal polyp diagnostic model. Several evaluation indexes, including the area under the receiver-operating-characteristic curve (AUC), were used to compare the predictive performance. The SHapley additive explanation method (SHAP) was used to rank the feature importance and explain the final model.

Results: 14 independent predictors were identified as the most valuable features to establish the models. The adaptive boosting machine (AdaBoost) model exhibited the best performance among the 9 ML models in cohort 1, with accuracy, sensitivity, specificity, positive predictive value, negative predictive value, F1 score, and AUC (95% CI) of 0.632 (0.618-0.646), 0.635 (0.550-0.721), 0.674 (0.591-0.758), 0.593 (0.576-0.611), 0.673 (0.654-0.691), 0.608 (0.560-0.655) and 0.687 (0.626-0.749), respectively. The final model gave an AUC of 0.675 in cohort 2. Additionally, the precision recall (PR) curve for the AdaBoost model reached the highest AUPR of 0.648, positioning it nearest to the upper right corner. SHAP analysis provided visualized explanations, reaffirming the critical factors associated with the risk of colorectal polyps in the asymptomatic population.

Conclusions: This study integrated the clinical and laboratory indicators with machine learning techniques to establish the predictive model for colorectal polyps, providing non-invasive, cost-effective screening strategies for asymptomatic individuals and guiding decisions for further examination and treatment.

Keywords: SHAP; colorectal polyps; health examination record; machine learning; prediction model.

Grants and funding

This research received no external funding.