Non-gaussian distributions affect identification of expression patterns, functional annotation, and prospective classification in human cancer genomes

PLoS One. 2012;7(10):e46935. doi: 10.1371/journal.pone.0046935. Epub 2012 Oct 31.

Abstract

Introduction: Gene expression data is often assumed to be normally-distributed, but this assumption has not been tested rigorously. We investigate the distribution of expression data in human cancer genomes and study the implications of deviations from the normal distribution for translational molecular oncology research.

Methods: We conducted a central moments analysis of five cancer genomes and performed empiric distribution fitting to examine the true distribution of expression data both on the complete-experiment and on the individual-gene levels. We used a variety of parametric and nonparametric methods to test the effects of deviations from normality on gene calling, functional annotation, and prospective molecular classification using a sixth cancer genome.

Results: Central moments analyses reveal statistically-significant deviations from normality in all of the analyzed cancer genomes. We observe as much as 37% variability in gene calling, 39% variability in functional annotation, and 30% variability in prospective, molecular tumor subclassification associated with this effect.

Conclusions: Cancer gene expression profiles are not normally-distributed, either on the complete-experiment or on the individual-gene level. Instead, they exhibit complex, heavy-tailed distributions characterized by statistically-significant skewness and kurtosis. The non-Gaussian distribution of this data affects identification of differentially-expressed genes, functional annotation, and prospective molecular classification. These effects may be reduced in some circumstances, although not completely eliminated, by using nonparametric analytics. This analysis highlights two unreliable assumptions of translational cancer gene expression analysis: that "small" departures from normality in the expression data distributions are analytically-insignificant and that "robust" gene-calling algorithms can fully compensate for these effects.

Publication types

  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms
  • Computational Biology
  • Gene Expression Profiling
  • Genome, Human*
  • Humans
  • Molecular Sequence Annotation
  • Neoplasms / classification
  • Neoplasms / genetics*
  • Normal Distribution*

Grants and funding

NFM is supported by a grant from the American Association of Neurological Surgeons' William P. VanWagenen Fellowship program. RJW is supported in part by Grant No.W81XWH-062-0033 from the United States Department of Defense Breast Cancer Research Program, by the Melvin Burkhardt chair in neurosurgical oncology, and by the Karen Colina Wilson research endowment within the Brain Tumor and Neuro-oncology Center at the Cleveland Clinic Foundation. No additional external funding was received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.