Lung cancer remains a predominant cause of cancer-related deaths, with notable disparities in incidence and outcomes across racial and gender groups. This study addresses these disparities by developing a computational framework leveraging explainable artificial intelligence (XAI) to identify both patient- and cohort-specific biomarker genes in lung cancer. Specifically, we focus on two lung cancer subtypes, Lung Adenocarcinoma (LUAD) and Lung Squamous Cell Carcinoma (LUSC), examining distinct racial and sex-specific cohorts: African American males (AAMs) and European American males (EAMs). This study innovatively structures classification tasks based on disease conditions rather than racial labels to avoid race-specific imbalance. We constructed four classification tasks- one three-class problem (LUAD-LUSC-HEALTHY) and three two-class problems (LUAD-LUSC, LUAD-HEALTHY, LUSC-HEALTHY)- to interpret the disease behavior of the patients in terms of genes and pathways. This methodology allows a LUAD or LUSC patient to be analyzed via multiple classifications, yielding robust disparity information for every patient. This preliminary work reports the disparity information for LUAD only. Utilizing Transcriptome data from The Cancer Genome Atlas (TCGA) and Genotype-Tissue Expression (GTEx) projects, we processed samples for LUAD, LUSC, and HEALTHY cohorts. We applied machine learning models, including convolutional neural network (CNN), logistic regression (LR), naïve Bayesian classifier (NB), support vector machine (SVM), random forest (RF), and extreme gradient boosting (XGBoost) for the classification. The SHapley Additive exPlanation (SHAP)-based interpretation of the best performing classification model uncovered cohort-specific genes and pathways related to health disparities between LUAD-AAM and LUAD-EAM cohorts.
Keywords: SHAP; explainable AI; health disparity; lung cancer; patient-specific biomarker.