Robust classification of single-cell transcriptome data by nonnegative matrix factorization

Bioinformatics. 2017 Jan 15;33(2):235-242. doi: 10.1093/bioinformatics/btw607. Epub 2016 Sep 23.

Abstract

Motivation: Single-cell transcriptome data provide unprecedented resolution to study heterogeneity in cell populations and present a challenge for unsupervised classification. Popular methods, like principal component analysis (PCA), often suffer from the high level of noise in the data.

Results: Here we adapt Nonnegative Matrix Factorization (NMF) to study the problem of identifying subpopulations in single-cell transcriptome data. In contrast to the conventional gene-centered view of NMF, identifying metagenes, we used NMF in a cell-centered direction, identifying cell subtypes ('metacells'). Using three different datasets (based on RT-qPCR and single cell RNA-seq data, respectively), we show that NMF outperforms PCA in identifying subpopulations in an accurate and robust way, without the need for prior feature selection; moreover, NMF successfully recovered the broad classes on a large dataset (thousands of single-cell transcriptomes), as identified by a computationally sophisticated method. NMF allows to identify feature genes in a direct, unbiased manner. We propose novel approaches for determining a biologically meaningful number of subpopulations based on minimizing the ambiguity of classification. In conclusion, our study shows that NMF is a robust, informative and simple method for the unsupervised learning of cell subtypes from single-cell gene expression data.

Availability and implementation: https://github.com/ccshao/nimfa CONTACTS: [email protected] or [email protected] information: Supplementary data are available at Bioinformatics online.

MeSH terms

  • Animals
  • Cerebral Cortex / cytology
  • Cerebral Cortex / metabolism
  • Computational Biology / methods
  • Gene Expression Profiling / methods*
  • Mice
  • Sequence Analysis, RNA / methods
  • Single-Cell Analysis / methods*
  • Software*
  • Statistics as Topic / methods*