Integrating Metabolomics Domain Knowledge with Explainable Machine Learning in Atherosclerotic Cardiovascular Disease Classification

Everton Santana; Eliana Ibrahimi; Evangelos Ntalianis; Nicholas Cauwenberghs; Tatiana Kuznetsova

doi:10.3390/ijms252312905

Integrating Metabolomics Domain Knowledge with Explainable Machine Learning in Atherosclerotic Cardiovascular Disease Classification

Int J Mol Sci. 2024 Nov 30;25(23):12905. doi: 10.3390/ijms252312905.

Authors

Everton Santana¹, Eliana Ibrahimi², Evangelos Ntalianis¹, Nicholas Cauwenberghs¹, Tatiana Kuznetsova¹

Affiliations

¹ Research Unit Hypertension and Cardiovascular Epidemiology, KU Leuven Department of Cardiovascular Sciences, University of Leuven, 3000 Leuven, Belgium.
² Department of Biology, University of Tirana, 1001 Tirana, Albania.

Abstract

Metabolomic data often present challenges due to high dimensionality, collinearity, and variability in metabolite concentrations. Machine learning (ML) application in metabolomic analyses is enabling the extraction of meaningful information from complex data. Bringing together domain-specific knowledge from metabolomics with explainable ML methods can refine the predictive performance and interpretability of models used in atherosclerosis research. In this work, we aimed to identify the most impactful metabolites associated with the presence of atherosclerotic cardiovascular disease (ASCVD) in cross-sectional case-control studies using explainable ML methods integrated with metabolomics domain knowledge. For this, a subset from the FLEMENGHO cohort with metabolomic data available was used as the training cohort, including 63 patients with a history of ASCVD and 52 non-smoking controls matched by age, sex, and body mass index from the same population. First, Partial Least Squares Discriminant Analysis (PLS-DA) was applied for dimensionality reduction. The selected metabolites' correlations were analyzed by considering their chemical categorization. Then, eXtreme Gradient Boosting (XGBoost) was used to identify metabolites that characterize ASCVD. Next, the selected metabolites were evaluated in an external cohort to determine their effectiveness in distinguishing between cases and controls. A total of 56 metabolites were selected for ASCVD discrimination using PLS-DA. The primary identified metabolites' superclasses included lipids, organic acids, and organic oxygen compounds. Upon integrating these metabolites with the XGBoost model, the classification yielded a test area under the curve (AUC) of 0.75. SHAP analyses ranked cholesterol, 3-methylhistidine, and glucuronic acid among the most impactful features and showed the diversity of metabolites considered for building the ASCVD discriminator. Also using XGBoost, the selected metabolites achieved an AUC of 0.93 in an independent external validation cohort. In conclusion, the combination of different metabolites has the potential to build classifiers for ASCVD. Integrating metabolite categorization within the SHAP analysis further enhanced the interpretability of the model, offering insights into metabolite-specific contributions to ASCVD risk.

Keywords: atherosclerotic cardiovascular diseases; domain knowledge; explainable machine learning; metabolomics.

MeSH terms

Aged
Atherosclerosis* / metabolism
Biomarkers
Cardiovascular Diseases / metabolism
Case-Control Studies
Cross-Sectional Studies
Female
Humans
Machine Learning*
Male
Metabolomics* / methods
Middle Aged

Substances

Biomarkers

Abstract

MeSH terms

Substances

Grants and funding