Large-scale atlas of microarray data reveals the distinct expression landscape of different tissues in Arabidopsis

Fei He; Shinjae Yoo; Daifeng Wang; Sunita Kumari; Mark Gerstein; Doreen Ware; Sergei Maslov

doi:10.1111/tpj.13175

Large-scale atlas of microarray data reveals the distinct expression landscape of different tissues in Arabidopsis

Plant J. 2016 Jun;86(6):472-80. doi: 10.1111/tpj.13175.

Authors

Fei He¹, Shinjae Yoo^{2

3}, Daifeng Wang⁴, Sunita Kumari⁵, Mark Gerstein⁴, Doreen Ware^{5

6}, Sergei Maslov^{1

7

8}

Affiliations

¹ Biology Department, Brookhaven National Laboratory, Upton, NY, 11973, USA.
² Computational Science Center, Brookhaven National Laboratory, Upton, NY, 11973, USA.
³ Institute of Advanced Computational Science at Stony Brook University, Stony Brook, NY, 11794, USA.
⁴ Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, 06520, USA.
⁵ Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 17724, USA.
⁶ USDA ARS NEA Plant, Soil & Nutrition Laboratory Research Unit, USDA-ARS, Ithaca, NY, 14853, USA.
⁷ Department of Bioengineering, Carl R. Woese Institute for Genomic Biology, Urbana, IL, 61801, USA.
⁸ National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA.

PMID: 27015116
DOI: 10.1111/tpj.13175

Abstract

Transcriptome data sets from thousands of samples of the model plant Arabidopsis thaliana have been collectively generated by multiple individual labs. Although integration and meta-analysis of these samples has become routine in the plant research community, it is often hampered by a lack of metadata or differences in annotation styles of different labs. In this study, we carefully selected and integrated 6057 Arabidopsis microarray expression samples from 304 experiments deposited to the Gene Expression Omnibus (GEO) at the National Center for Biotechnology Information (NCBI). Metadata such as tissue type, growth conditions and developmental stage were manually curated for each sample. We then studied the global expression landscape of the integrated data set and found that samples of the same tissue tend to be more similar to each other than to samples of other tissues, even in different growth conditions or developmental stages. Root has the most distinct transcriptome, compared with aerial tissues, but the transcriptome of cultured root is more similar to the transcriptome of aerial tissues, as the cultured root samples lost their cellular identity. Using a simple computational classification method, we showed that the tissue type of a sample can be successfully predicted based on its expression profile, opening the door for automatic metadata extraction and facilitating the re-use of plant transcriptome data. As a proof of principle, we applied our automated annotation pipeline to 708 RNA-seq samples from public repositories and verified the accuracy of our predictions with sample metadata provided by the authors.

Keywords: Arabidopsis; automatic reconstruction of missing metadata; expression data integration; global transcriptional landscape; metadata annotation; re-use of public expression data.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Arabidopsis / genetics*
Arabidopsis Proteins / genetics
Gene Expression Regulation, Plant / genetics
Oligonucleotide Array Sequence Analysis / methods*

Substances

Arabidopsis Proteins