Background: This study aimed to develop predictive models with robust generalization capabilities for assessing the risk of pulmonary embolism in patients with tuberculosis using machine learning algorithms.
Methods: Data were collected from two centers and categorized into development and validation cohorts. Using the development cohort, candidate variables were selected via the Recursive Feature Elimination (RFE) method. Five machine learning algorithms, logistic regression (LR), random forest (RF), extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM), and support vector machine (SVM), were utilized to construct the predictive models. Model performance was evaluated through nested cross-validation and area under the curve (AUC) metrics, supplemented by interpretations using Shapley Additive explanations (SHAP) and line charts of AUC values. Models were subjected to external validation using an independent validation group, facilitating the early identification and management of pulmonary embolism risks in tuberculosis patients.
Results: Data from 694 patients were used for model development, and 236 patients from the validation group met the enrollment criteria. The optimal subset of variables identified included D-dimer, smoking status, dyspnea, age, sex, diabetes, platelet count, cough, fibrinogen, hemoglobin, hemoptysis, hypertension, chronic obstructive pulmonary disease (COPD), and chest pain. The RF model outperformed others, achieving an AUC of 0.839 (95% CI 0.780-0.899) and maintaining the highest average performance in external fivefold cross-validation (AUC: 0.906 ± 0.041).
Conclusions: The RF model demonstrates high and consistent effectiveness in predicting pulmonary embolism risk in tuberculosis patients.
Keywords: Machine learning; Pulmonary embolism; Pulmonary tuberculosis; Risk prediction.
© 2024. The Author(s).