Personalized three-year survival prediction and prognosis forecast by interpretable machine learning for pancreatic cancer patients: a population-based study and an external validation

Front Oncol. 2024 Oct 21:14:1488118. doi: 10.3389/fonc.2024.1488118. eCollection 2024.

Abstract

Purpose: The overall survival of patients with pancreatic cancer is extremely low. We aimed to establish machine learning (ML) based model to accurately predict three-year survival and prognosis of pancreatic cancer patients.

Methods: We analyzed pancreatic cancer patients from the Surveillance, Epidemiology, and End Results (SEER) database between 2000 and 2021. Univariate and multivariate logistic analysis were employed to select variables. Recursive Feature Elimination (RFE) method based on 6 ML algorithms was utilized in feature selection. To construct predictive model, 13 ML algorithms were evaluated by area under the curve (AUC), area under precision-recall curve (PRAUC), accuracy, sensitivity, specificity, precision, cross-entropy, Brier scores and Balanced Accuracy (bacc) and F Beta Score (fbeta). An optimal ML model was constructed to predict three-year survival, and the predictive results were explained by SHapley Additive exPlanations (SHAP) framework. Meanwhile, 101 ML algorithm combinations were developed to select the best model with highest C-index to predict prognosis of pancreatic cancer patients.

Results: A total of 20,064 pancreatic cancer patients from SEER database was consecutively enrolled. We utilized eight clinical variables to establish prediction model for three-year survival. CatBoost model was selected as the best prediction model, and AUC was 0.932 [0.924, 0.939], 0.899 [0.873, 0.934] and 0.826 [0.735, 0.919] in training, internal test and external test sets, with 0.839 [0.831, 0.847] accuracy, 0.872 [0.858, 0.887] sensitivity, 0.803 [0.784, 0.825] specificity and 0.832 [0.821, 0.853] precision. Surgery type had the greatest effects on three-year survival according to SHAP results. For prognosis prediction, "RSF+GBM" algorithm was the best prognostic model with C-index of 0.774, 0.722 and 0.674 in training, internal test and external test sets.

Conclusions: Our ML models demonstrate excellent accuracy and reliability, offering more precise personalized prognostic prediction to pancreatic cancer patients.

Keywords: SEER; machine learning; pancreatic cancer; prognosis prediction; three-year survival.

Grants and funding

The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.