Development and validation of machine learning models for predicting venous thromboembolism in colorectal cancer patients: A cohort study in China

Int J Med Inform. 2024 Dec 19:195:105770. doi: 10.1016/j.ijmedinf.2024.105770. Online ahead of print.

Abstract

Background: With advancements in healthcare, traditional VTE risk assessment tools are increasingly insufficient to meet the demands of high-quality care, underscoring the need for innovative and specialized assessment methods.

Objective: Owing to the remarkable success of machine learning in supervised learning and disease prediction, our objective is to develop a reliable and efficient model for assessing VTE risk by leveraging the fundamental data and clinical characteristics of colorectal cancer patients within our medical facility.

Methods: Six commonly used machine learning algorithms were utilized in our study to predict the occurrence of VTE in patients with rectal cancer. In the modeling process, LASSO regression was employed to identify and exclude variables not associated with VTE. Additionally, hyperparameter tuning was conducted via 5-fold cross-validation to mitigate overfitting, and 200 bootstrap samples were used to adjust the apparent performance on the training set. The selection of the VTE assessment model was determined by a thorough evaluation of performance criteria, such as the AUC, ACC and F1 score.

Results: The RF model exhibits consistent and efficient performance. Specifically, in the internally validation dataset, where generalizability was adjusted, the RF model achieved the highest scores across multiple metrics: AD-AUC (0.895), AD-ACC (0.871), AD-F1 (0.311), AD-MCC (0.316), AD-Precision (0.241), AD-Specificity (0.888). For external validation on unseen colon cancer data, the RF model also performed best in terms of ACC (0.728), F1 (0.292), MCC (0.225), Precision (0.192), and Specificity (0.740), with a suboptimal AUC of 0.745 and a Sensitivity (Recall) of 0.615. Additionally, the RF model demonstrates strong performance not only on the original dataset but also on datasets processed via alternative imbalance handling techniques.

Conclusions: Our research successfully established and validated a risk assessment model for assessing the risk of VTE in colorectal cancer patients.

Keywords: Bootstrap; Colorectal cancer; Machine learning; VTE.