Microarray for transcriptomics experiments often suffer from limited statistical power due to small sample size. Quantile discretization (QD) maps expression values for a sample into a series of equivalently sized 'bins' that represent a discrete numerical range, e.g. [Formula: see text]4 to [Formula: see text]4, which enables normalized data from multiple experiments and/or expression platforms to be combined for re-analysis. We found, however, that informal selection of bin numbers often resulted in loss of the underlying correlation structure in the data through assigning of the same numerical value to genes that are in reality expressed at significantly different levels within a sample. Here we report a procedure for determining an optimal bin number for dataset. Applying this to integrated public breast cancer datasets enabled statistical identification of several differentially expressed tumorigenesis-related genes that were not found when analyzing the individual datasets, and also several cancer biomarkers not previously indicated as having utility in the disease. Notably, differential modulation of translational control and protein synthesis via multiple pathways were found to potentially have central roles in breast cancer development and progression. These findings suggest that our protocol has significant utility in making meaningful novel biomedical discoveries by leveraging the large public expression data repositories.
Keywords: Microarray; cross-platform integration; differential expression; quantile discretization; statistical analysis.