Construction of a prognostic prediction model for colorectal cancer based on 5-year clinical follow-up data

Sci Rep. 2025 Jan 21;15(1):2701. doi: 10.1038/s41598-025-86872-5.

Abstract

Colorectal cancer (CRC) is a prevalent malignant tumor that presents significant challenges to both public health and healthcare systems. The aim of this study was to develop a machine learning model based on five years of clinical follow-up data from CRC patients to accurately identify individuals at risk of poor prognosis. This study included 411 CRC patients who underwent surgery at Yixing Hospital and completed the follow-up process. A modeling dataset containing 73 characteristic variables was established by collecting demographic information, clinical blood test indicators, histopathological results, and additional treatment-related information. Decision tree, random forest, support vector machine, and extreme gradient boosting (XGBoost) models were selected for modeling based on the features identified through recursive feature elimination (RFE). The Cox proportional hazards model was used as the baseline for model comparison. During the model training process, hyperparameters were optimized using a grid search method. The model performance was comprehensively assessed using multiple metrics, including accuracy, F1 score, Brier score, sensitivity, specificity, positive predictive value, negative predictive value, receiver operating characteristic curve, calibration curve, and decision curve analysis curve. For the selected optimal model, the decision-making process was interpreted using the SHapley Additive exPlanations (SHAP) method. The results show that the optimal RFE-XGBoost model achieved an accuracy of 0.83 (95% CI 0.76-0.90), an F1 score of 0.81 (95% CI 0.72-0.88), and an area under the receiver operating characteristic curve of 0.89 (95% CI 0.82-0.94). Furthermore, the model exhibited superior calibration and clinical utility. SHAP analysis revealed that increased perioperative transfusion quantity, higher tumor AJCC stage, elevated carcinoembryonic antigen level, elevated carbohydrate antigen 19-9 (CA19-9) level, advanced age, and elevated carbohydrate antigen 125 (CA125) level were correlated with increased individual mortality risk. The RFE-XGBoost model demonstrated excellent performance in predicting CRC patient prognosis, and the application of the SHAP method bolstered the model's credibility and utility.

Keywords: Colorectal cancer; Follow-up studies; Machine learning; Prognosis; Risk factors.

MeSH terms

  • Adult
  • Aged
  • Colorectal Neoplasms* / blood
  • Colorectal Neoplasms* / mortality
  • Colorectal Neoplasms* / pathology
  • Colorectal Neoplasms* / surgery
  • Female
  • Follow-Up Studies
  • Humans
  • Machine Learning
  • Male
  • Middle Aged
  • Prognosis
  • Proportional Hazards Models
  • ROC Curve
  • Support Vector Machine