Shrinkage-based similarity metric for cluster analysis of microarray data

Vera Cherepinsky; Jiawu Feng; Marc Rejali; Bud Mishra

doi:10.1073/pnas.1633770100

Shrinkage-based similarity metric for cluster analysis of microarray data

Proc Natl Acad Sci U S A. 2003 Aug 19;100(17):9668-73. doi: 10.1073/pnas.1633770100. Epub 2003 Aug 5.

Authors

Vera Cherepinsky¹, Jiawu Feng, Marc Rejali, Bud Mishra

Affiliation

¹ Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street, New York, NY 10012, USA.

Abstract

The current standard correlation coefficient used in the analysis of microarray data was introduced by M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein [(1998) Proc. Natl. Acad. Sci. USA 95, 14863-14868]. Its formulation is rather arbitrary. We give a mathematically rigorous correlation coefficient of two data vectors based on James-Stein shrinkage estimators. We use the assumptions described by Eisen et al., also using the fact that the data can be treated as transformed into normal distributions. While Eisen et al. use zero as an estimator for the expression vector mean mu, we start with the assumption that for each gene, mu is itself a zero-mean normal random variable [with a priori distribution N(0,tau 2)], and use Bayesian analysis to obtain a posteriori distribution of mu in terms of the data. The shrunk estimator for mu differs from the mean of the data vectors and ultimately leads to a statistically robust estimator for correlation coefficients. To evaluate the effectiveness of shrinkage, we conducted in silico experiments and also compared similarity metrics on a biological example by using the data set from Eisen et al. For the latter, we classified genes involved in the regulation of yeast cell-cycle functions by computing clusters based on various definitions of correlation coefficients and contrasting them against clusters based on the activators known in the literature. The estimated false positives and false negatives from this study indicate that using the shrinkage metric improves the accuracy of the analysis.

Publication types

Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.
Research Support, U.S. Gov't, P.H.S.

MeSH terms

Algorithms
Cell Cycle / genetics
Cluster Analysis
Data Interpretation, Statistical
Gene Expression Profiling / statistics & numerical data*
Genes, Fungal
Models, Statistical
Oligonucleotide Array Sequence Analysis / statistics & numerical data*
Saccharomyces cerevisiae / cytology
Saccharomyces cerevisiae / genetics