A merged microarray meta-dataset for transcriptionally profiling colorectal neoplasm formation and progression

Sci Data. 2021 Aug 11;8(1):214. doi: 10.1038/s41597-021-00998-5.

Abstract

Transcriptional profiling of pre- and post-malignant colorectal cancer (CRC) lesions enable temporal monitoring of molecular events underlying neoplastic progression. However, the most widely used transcriptomic dataset for CRC, TCGA-COAD, is devoid of adenoma samples, which increases reliance on an assortment of disparate microarray studies and hinders consensus building. To address this, we developed a microarray meta-dataset comprising 231 healthy, 132 adenoma, and 342 CRC tissue samples from twelve independent studies. Utilizing a stringent analytic framework, select datasets were downloaded from the Gene Expression Omnibus, normalized by frozen robust multiarray averaging and subsequently merged. Batch effects were then identified and removed by empirical Bayes estimation (ComBat). Finally, the meta-dataset was filtered for low variant probes, enabling downstream differential expression as well as quantitative and functional validation through cross-platform correlation and enrichment analyses, respectively. Overall, our meta-dataset provides a robust tool for investigating colorectal adenoma formation and malignant transformation at the transcriptional level with a pipeline that is modular and readily adaptable for similar analyses in other cancer types.

Publication types

  • Dataset
  • Research Support, N.I.H., Extramural

MeSH terms

  • Adenoma / genetics*
  • Adenoma / pathology*
  • Aged
  • Cell Transformation, Neoplastic / genetics*
  • Colorectal Neoplasms / genetics*
  • Colorectal Neoplasms / pathology*
  • Female
  • Gene Expression Profiling*
  • Humans
  • Male
  • Metadata
  • Middle Aged
  • Oligonucleotide Array Sequence Analysis
  • Transcriptome