Utility of machine learning in developing a predictive model for early-age-onset colorectal neoplasia using electronic health records

PLoS One. 2022 Mar 10;17(3):e0265209. doi: 10.1371/journal.pone.0265209. eCollection 2022.

Abstract

Background and aims: The incidence of colorectal cancer (CRC) is increasing in adults younger than 50, and early screening remains challenging due to cost and under-utilization. To identify individuals aged 35-50 years who may benefit from early screening, we developed a prediction model using machine learning and electronic health record (EHR)-derived factors.

Methods: We enrolled 3,116 adults aged 35-50 at average-risk for CRC and underwent colonoscopy between 2017-2020 at a single center. Prediction outcomes were (1) CRC and (2) CRC or high-risk polyps. We derived our predictors from EHRs (e.g., demographics, obesity, laboratory values, medications, and zip code-derived factors). We constructed four machine learning-based models using a training set (random sample of 70% of participants): regularized discriminant analysis, random forest, neural network, and gradient boosting decision tree. In the testing set (remaining 30% of participants), we measured predictive performance by comparing C-statistics to a reference model (logistic regression).

Results: The study sample was 55.1% female, 32.8% non-white, and included 16 (0.05%) CRC cases and 478 (15.3%) cases of CRC or high-risk polyps. All machine learning models predicted CRC with higher discriminative ability compared to the reference model [e.g., C-statistics (95%CI); neural network: 0.75 (0.48-1.00) vs. reference: 0.43 (0.18-0.67); P = 0.07] Furthermore, all machine learning approaches, except for gradient boosting, predicted CRC or high-risk polyps significantly better than the reference model [e.g., C-statistics (95%CI); regularized discriminant analysis: 0.64 (0.59-0.69) vs. reference: 0.55 (0.50-0.59); P<0.0015]. The most important predictive variables in the regularized discriminant analysis model for CRC or high-risk polyps were income per zip code, the colonoscopy indication, and body mass index quartiles.

Discussion: Machine learning can predict CRC risk in adults aged 35-50 using EHR with improved discrimination. Further development of our model is needed, followed by validation in a primary-care setting, before clinical application.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Adult
  • Colorectal Neoplasms* / diagnosis
  • Colorectal Neoplasms* / epidemiology
  • Electronic Health Records*
  • Female
  • Humans
  • Logistic Models
  • Machine Learning
  • Male