Regulatory network-based imputation of dropouts in single-cell RNA sequencing data

Ana Carolina Leote; Xiaohui Wu; Andreas Beyer

doi:10.1371/journal.pcbi.1009849

Regulatory network-based imputation of dropouts in single-cell RNA sequencing data

PLoS Comput Biol. 2022 Feb 17;18(2):e1009849. doi: 10.1371/journal.pcbi.1009849. eCollection 2022 Feb.

Authors

Ana Carolina Leote^{1

2}, Xiaohui Wu^{1

3

4}, Andreas Beyer^{1

2

5

6}

Affiliations

¹ Cluster of Excellence Cellular Stress Responses in Aging-associated Diseases (CECAD), Cologne, Germany.
² University of Cologne, Faculty of Medicine and Cologne University Hospital, Cologne, Germany.
³ Department of Automation, Xiamen University, Xiamen, China.
⁴ Pasteurien College, Soochow University, Suzhou, China.
⁵ Center for Molecular Medicine Cologne (CMMC), Cologne, Germany.
⁶ Cologne School for Computational Biology & Center for Data Science and Simulation, University of Cologne, Cologne, Germany.

Abstract

Single-cell RNA sequencing (scRNA-seq) methods are typically unable to quantify the expression levels of all genes in a cell, creating a need for the computational prediction of missing values ('dropout imputation'). Most existing dropout imputation methods are limited in the sense that they exclusively use the scRNA-seq dataset at hand and do not exploit external gene-gene relationship information. Further, it is unknown if all genes equally benefit from imputation or which imputation method works best for a given gene. Here, we show that a transcriptional regulatory network learned from external, independent gene expression data improves dropout imputation. Using a variety of human scRNA-seq datasets we demonstrate that our network-based approach outperforms published state-of-the-art methods. The network-based approach performs particularly well for lowly expressed genes, including cell-type-specific transcriptional regulators. Further, the cell-to-cell variation of 11.3% to 48.8% of the genes could not be adequately imputed by any of the methods that we tested. In those cases gene expression levels were best predicted by the mean expression across all cells, i.e. assuming no measurable expression variation between cells. These findings suggest that different imputation methods are optimal for different genes. We thus implemented an R-package called ADImpute (available via Bioconductor https://bioconductor.org/packages/release/bioc/html/ADImpute.html) that automatically determines the best imputation method for each gene in a dataset. Our work represents a paradigm shift by demonstrating that there is no single best imputation method. Instead, we propose that imputation should maximally exploit external information and be adapted to gene-specific features, such as expression level and expression variation across cells.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Exome Sequencing
Gene Expression Profiling
Gene Regulatory Networks / genetics
Humans
RNA
Sequence Analysis, RNA
Single-Cell Analysis* / methods
Software*

Substances

RNA

Grants and funding

A.C.L. received support by the Cologne Graduate School of Ageing Research, funded by the Deutsche Forschungsgemeinschaft (DFG), German Research Foundation, under Germany's Excellence Strategy - EXC 2030/1 - 390661388. X.W. received financial support from the National Natural Science Foundation of China (61871463) and Natural Science Foundation of Fujian Province of China (2017J01068). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.