Efficient feature extraction from highly sparse binary genotype data for cancer prognosis prediction using an auto-encoder

Front Oncol. 2023 Jan 10:12:1091767. doi: 10.3389/fonc.2022.1091767. eCollection 2022.

Abstract

Genomics involving tens of thousands of genes is a complex system determining phenotype. An interesting and vital issue is how to integrate highly sparse genetic genomics data with a mass of minor effects into a prediction model for improving prediction power. We find that the deep learning method can work well to extract features by transforming highly sparse dichotomous data to lower-dimensional continuous data in a non-linear way. This may provide benefits in risk prediction-associated genotype data. We developed a multi-stage strategy to extract information from highly sparse binary genotype data and applied it for cancer prognosis. Specifically, we first reduced the size of binary biomarkers via a univariable regression model to a moderate size. Then, a trainable auto-encoder was used to learn compact features from the reduced data. Next, we performed a LASSO problem process to select the optimal combination of extracted features. Lastly, we applied such feature combination to real cancer prognostic models and evaluated the raw predictive effect of the models. The results indicated that these compressed transformation features could better improve the model's original predictive performance and might avoid an overfitting problem. This idea may be enlightening for everyone involved in cancer research, risk reduction, treatment, and patient care via integrating genomics data.

Keywords: LASSO; auto-encoder; feature extraction; highly sparse binary data; risk prediction.

Grants and funding

This work was supported in part by the National Natural Science Foundation of China (81773541), funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions at Soochow University and the State Key Laboratory of Radiation Medicine and Protection (GZK1201919) to ZT, the National Natural Science Foundation of China (81872552 and U1967220) to JC, and the National Natural Science Foundation of China (82172441) and Suzhou Key Clinical Diagnosis and Treatment Technology Project (LCZX201925) to KL. The funding body did not play any roles in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.