Interpreting Lung Cancer Health Disparity between African American Males and European American Males

Masrur Sobhan; Md Mezbahul Islam; Ananda Mohan Mondal

doi:10.1109/bibm62325.2024.10822014

Interpreting Lung Cancer Health Disparity between African American Males and European American Males

Proceedings (IEEE Int Conf Bioinformatics Biomed). 2024 Dec:2024:7141-7143. doi: 10.1109/bibm62325.2024.10822014.

Authors

Masrur Sobhan¹, Md Mezbahul Islam¹, Ananda Mohan Mondal¹

Affiliation

¹ Knight Foundation School of Computing and Information Sciences, Florida International University, Miami, USA.

Abstract

Lung cancer remains a predominant cause of cancer-related deaths, with notable disparities in incidence and outcomes across racial and gender groups. This study addresses these disparities by developing a computational framework leveraging explainable artificial intelligence (XAI) to identify both patient- and cohort-specific biomarker genes in lung cancer. Specifically, we focus on two lung cancer subtypes, Lung Adenocarcinoma (LUAD) and Lung Squamous Cell Carcinoma (LUSC), examining distinct racial and sex-specific cohorts: African American males (AAMs) and European American males (EAMs). This study innovatively structures classification tasks based on disease conditions rather than racial labels to avoid race-specific imbalance. We constructed four classification tasks- one three-class problem (LUAD-LUSC-HEALTHY) and three two-class problems (LUAD-LUSC, LUAD-HEALTHY, LUSC-HEALTHY)- to interpret the disease behavior of the patients in terms of genes and pathways. This methodology allows a LUAD or LUSC patient to be analyzed via multiple classifications, yielding robust disparity information for every patient. This preliminary work reports the disparity information for LUAD only. Utilizing Transcriptome data from The Cancer Genome Atlas (TCGA) and Genotype-Tissue Expression (GTEx) projects, we processed samples for LUAD, LUSC, and HEALTHY cohorts. We applied machine learning models, including convolutional neural network (CNN), logistic regression (LR), naïve Bayesian classifier (NB), support vector machine (SVM), random forest (RF), and extreme gradient boosting (XGBoost) for the classification. The SHapley Additive exPlanation (SHAP)-based interpretation of the best performing classification model uncovered cohort-specific genes and pathways related to health disparities between LUAD-AAM and LUAD-EAM cohorts.

Keywords: SHAP; explainable AI; health disparity; lung cancer; patient-specific biomarker.