Imputation of Gene Expression Data in Blood Cancer and Its Significance in Inferring Biological Pathways

Front Oncol. 2020 Jan 8:9:1442. doi: 10.3389/fonc.2019.01442. eCollection 2019.

Abstract

Purpose: Gene expression data generated from microarray technology is often analyzed for disease diagnostics and treatment. However, this data suffers with missing values that may lead to inaccurate findings. Since data capture is expensive, time consuming, and is required to be collected from subjects, it is worthwhile to recover missing values instead of re-collecting the data. In this paper, a novel but simple method, namely, DSNN (Doubly Sparse DCT domain with Nuclear Norm minimization) has been proposed for imputing missing values in microarray data. Extensive experiments including pathway enrichment have been carried out on four blood cancer dataset to validate the method as well as to establish the significance of imputation. Methods: A new method, namely, DSNN, was proposed for missing value imputation on gene expression data. Method was validated on four dataset, CLL, AML, MM (Spanish data), and MM (Indian data). All the dataset were downloaded from GEO repository. Missing values were introduced in the original data from 10 to 90% in steps of 10% because method validation requires ground truth. Quantitative results on normalized mean square error (NMSE) between the ground truth and imputed data were computed. To further validate and establish the significance of the proposed imputation method, two experiments were carried out on the data imputed with the proposed method, data imputed with the state-of-art methods, and data with missing values. In the first experiment, classification of normal vs. cancer subjects was carried out. In the second experiment, biological significance of imputation was ascertained by identifying top candidate tumor drivers using the existing state-of-the-art SPARROW algorithm, followed by gene list enrichment analysis on top candidate drivers. Results: Quantitative NMSE results of the DSNN method were compared with three state-of-the-art imputation methods. DSNN method was observed to perform better compared to these other methods both at high as well as low observable data. Experiment-1 demonstrated superior results on classification with imputation compared to that performed on missing data matrix as well as compared to classification on imputed data with existing methods. In experiment-2, cancer affected pathways were discovered with higher significance in the data imputed with the proposed method compared to those discovered with the missing data matrix. Conclusion: Missing value problem in microarray data is a serious problem and can adversely influence downstream analysis. A novel method, namely, DSNN is proposed for missing value imputation. The method is validated quantitatively on the application of classification and biologically by performing pathway enrichment analysis.

Keywords: AML; CLL; MM; blood cancer; compressive sensing; gene enrichment analysis; machine learning; matrix imputation.