Purpose: The risk of colorectal cancer (CRC) recurrence after primary treatment varies across individuals and over time. Using patients' most up-to-date information, including carcinoembryonic antigen (CEA) biomarker profiles, to predict risk could improve personalized decision making.
Methods: We used electronic health record data from an integrated health system on a cohort of patients diagnosed with American Joint Committee on Cancer stage I-III CRC between 2008 and 2013 (N = 3,970) and monitored until recurrence or end of follow-up. We addressed missingness in recurrence outcomes and longitudinal CEA measures, and engineered CEA features using current and past biomarker values for inclusion in a risk prediction model. We used a discrete time Superlearner model to evaluate various algorithms for predicting recurrence. We evaluated the time-varying discrimination and calibration of the algorithms and assessed the role of individual predictors.
Results: Recurrence was documented in 448 (11.3%) patients. XGBoost with depth = 1 (XGB-D1) predicted recurrence substantially better than all other algorithms at all time points, with AUC ranging from 0.87 (95% CI, 0.86 to 0.88) at 6 months to 0.94 (95% CI, 0.92 to 0.96) at 54 months. The only variable used by XGB-D1 was 6-month change in log CEA. Predicted 1-year risk of recurrence was nearly zero for patients whose log CEA did not increase in the last 6 months, between 12.2% and 34.1% for patients whose log CEA increased between 0.10 and 0.40, and 43.6% for those with a log CEA increase >0.40. Compared with XGB, penalized regression approaches (lasso, ridge, and elastic net) performed poorly, with AUCs ranging from 0.58 to 0.69.
Conclusion: A flexible, machine learning approach that incorporated longitudinal CEA information yielded a simple and high-performing model for predicting recurrence on the basis of 6-month change in log CEA.