Revolutionizing cardiovascular disease classification through machine learning and statistical methods

J Biopharm Stat. 2024 Nov 24:1-23. doi: 10.1080/10543406.2024.2429524. Online ahead of print.

Abstract

Background: Cardiovascular diseases (CVDs) include abnormal conditions of the heart, diseased blood vessels, structural problems of the heart, and blood clots. Traditionally, CVD has been diagnosed by clinical experts, physicians, and medical specialists, which is expensive, time-consuming, and requires expert intervention. On the other hand, cost-effective digital diagnosis of CVD is now possible because of the emergence of machine learning (ML) and statistical techniques.

Method: In this research, extensive studies were carried out to classify CVD via 19 promising ML models. To evaluate the performance and rank the ML models for CVD classification, two benchmark CVD datasets are considered from well-known sources, such as Kaggle and the UCI repository. The results are analysed considering individual datasets and their combination to assess the efficiency and reliability of ML models on the basis of various performance measures, such as precision, kappa, accuracy, recall, and the F1 score. Since some of the ML models are stochastic, we repeated the simulation 50 times for each dataset using each model and applied nonparametric statistical tests to draw decisive conclusions.

Results: The nonparametric Friedman - Nemenyi hypothesis test suggests that the Extra Tree Classifier provides statistically superior accuracy and precision compared with all other models. However, the Extreme Gradient Boost (XGBoost) classifier provides statistically superior recall, kappa, and F1 scores compared with those of all the other models. Additionally, the XGBRF classifier achieves a statistically second-best rank in terms of the recall measures.

Keywords: Cardiovascular disease (CVD); Gaussian Naïve Bayes (GNB); K-nearest neighbor (KNN); adaptive boosting (ADB); artificial intelligence (AI); artificial neural network (ANN); bagging classifier (BC); decision tree (DT); extra tree (TC); extra trees (ETC); extreme gradient boost (XGBoost); extreme gradient boost random forest (XGBRF); gradient boosting (GB); linear support vector classifier (LSVC); logistic regression (LR); machine learning (ML); multilayer perceptron (MLP); passive aggressive classifier (PAC); random forest (RF); ridge classifier (RC); stochastic gradient decent (SGD); support vector classifier (SVC); voting classifier (VC).