Feature specific quantile normalization enables cross-platform classification of molecular subtypes using gene expression data

Bioinformatics. 2018 Jun 1;34(11):1868-1874. doi: 10.1093/bioinformatics/bty026.

Abstract

Motivation: Molecular subtypes of cancers and autoimmune disease, defined by transcriptomic profiling, have provided insight into disease pathogenesis, molecular heterogeneity and therapeutic responses. However, technical biases inherent to different gene expression profiling platforms present a unique problem when analyzing data generated from different studies. Currently, there is a lack of effective methods designed to eliminate platform-based bias. We present a method to normalize and classify RNA-seq data using machine learning classifiers trained on DNA microarray data and molecular subtypes in two datasets: breast invasive carcinoma (BRCA) and colorectal cancer (CRC).

Results: Multiple analyses show that feature specific quantile normalization (FSQN) successfully removes platform-based bias from RNA-seq data, regardless of feature scaling or machine learning algorithm. We achieve up to 98% accuracy for BRCA data and 97% accuracy for CRC data in assigning molecular subtypes to RNA-seq data normalized using FSQN and a support vector machine trained exclusively on DNA microarray data. We find that maximum accuracy was achieved when normalizing RNA-seq datasets that contain at least 25 samples. FSQN allows comparison of RNA-seq data to existing DNA microarray datasets. Using these techniques, we can successfully leverage information from existing gene expression data in new analyses despite different platforms used for gene expression profiling.

Availability and implementation: FSQN has been submitted as an R package to CRAN. All code used for this study is available on Github (https://github.com/jenniferfranks/FSQN).

Contact: [email protected].

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Breast Neoplasms / genetics
  • Colonic Neoplasms / genetics
  • Colorectal Neoplasms / genetics
  • Female
  • Gene Expression Profiling / methods*
  • Gene Expression Regulation, Neoplastic
  • Humans
  • Neoplasms / genetics*
  • Oligonucleotide Array Sequence Analysis / methods*
  • Sequence Analysis, RNA / methods
  • Software
  • Support Vector Machine*