YuGene: a simple approach to scale gene expression data derived from different platforms for integrated analyses

Kim-Anh Lê Cao; Florian Rohart; Leo McHugh; Othmar Korn; Christine A Wells

doi:10.1016/j.ygeno.2014.03.001

YuGene: a simple approach to scale gene expression data derived from different platforms for integrated analyses

Genomics. 2014 Apr;103(4):239-51. doi: 10.1016/j.ygeno.2014.03.001. Epub 2014 Mar 22.

Authors

Kim-Anh Lê Cao¹, Florian Rohart², Leo McHugh³, Othmar Korn², Christine A Wells⁴

Affiliations

¹ Queensland Facility for Advanced Bioinformatics, The University of Queensland, St Lucia 4072, Australia; Institute for Molecular Biology, The University of Queensland, St Lucia 4072, Australia.
² Australian Institute for Bioengineering and Nanotechnology, The University of Queensland, St. Lucia 4072, Australia.
³ Queensland Facility for Advanced Bioinformatics, The University of Queensland, St Lucia 4072, Australia.
⁴ Australian Institute for Bioengineering and Nanotechnology, The University of Queensland, St. Lucia 4072, Australia; The Institute for Infection, Immunity & Inflammation, College of Medical, Veterinary and Life Sciences, Glasgow University, G12 8TA, UK. Electronic address: [email protected].

PMID: 24667244
DOI: 10.1016/j.ygeno.2014.03.001

Abstract

Gene expression databases contain invaluable information about a range of cell states, but the question "Where is my gene of interest expressed?" remains one of the most difficult to systematically assess when relevant data is derived on different platforms. Barriers to integrating this data include disparities in data formats and scale, a lack of common identifiers, and the disproportionate contribution of a platform to the 'batch effect'. There are few purpose-built cross-platform normalization strategies, and most of these fit data to an idealized data structure, which in turn may compromise gene expression comparisons between different platforms. YuGene addresses this gap by providing a simple transform that assigns a modified cumulative proportion value to each measurement, without losing essential underlying information on data distributions or experimental correlates. The Yugene transform is applied to individual samples and is suitable to apply to data with different distributions. Yugene is robust to combining datasets of different sizes, does not require global renormalization as new data is added, and does not require a common identifier. YuGene was benchmarked against commonly used normalization approaches, performing favorably in comparison to quantile (RMA), Z-score or rank methods. Implementation in the www.stemformatics.org resource provides users with expression queries across stem cell related datasets. Probe performance statistics including poorly performing (never expressed) probes, and examples of probes/genes expressed in a sample-restricted manner are provided. The YuGene software is implemented as an R package available from CRAN.

Keywords: Cross platform normalization; Gene expression; Microarray.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Computational Biology / methods
Databases, Genetic*
Gene Expression Profiling / methods*
Humans
Internet
Oligonucleotide Array Sequence Analysis
Software*
Stem Cells