Explainable machine learning approach for cancer prediction through binarilization of RNA sequencing data

Tianjie Chen; Md Faisal Kabir

doi:10.1371/journal.pone.0302947

Explainable machine learning approach for cancer prediction through binarilization of RNA sequencing data

PLoS One. 2024 May 10;19(5):e0302947. doi: 10.1371/journal.pone.0302947. eCollection 2024.

Authors

Tianjie Chen¹, Md Faisal Kabir¹

Affiliation

¹ Department of Computer Science, Pennsylvania State University Harrisburg, Middletown, Pennsylvania, United States of America.

Abstract

In recent years, researchers have proven the effectiveness and speediness of machine learning-based cancer diagnosis models. However, it is difficult to explain the results generated by machine learning models, especially ones that utilized complex high-dimensional data like RNA sequencing data. In this study, we propose the binarilization technique as a novel way to treat RNA sequencing data and used it to construct explainable cancer prediction models. We tested our proposed data processing technique on five different models, namely neural network, random forest, xgboost, support vector machine, and decision tree, using four cancer datasets collected from the National Cancer Institute Genomic Data Commons. Since our datasets are imbalanced, we evaluated the performance of all models using metrics designed for imbalance performance like geometric mean, Matthews correlation coefficient, F-Measure, and area under the receiver operating characteristic curve. Our approach showed comparative performance while relying on less features. Additionally, we demonstrated that data binarilization offers higher explainability by revealing how each feature affects the prediction. These results demonstrate the potential of data binarilization technique in improving the performance and explainability of RNA sequencing based cancer prediction models.

Copyright: © 2024 Chen, Kabir. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Decision Trees
Humans
Machine Learning*
Neoplasms* / genetics
Neural Networks, Computer
ROC Curve
Sequence Analysis, RNA* / methods
Support Vector Machine

Grants and funding

The author(s) received no specific funding for this work.