Improved bolstering error estimation for gene ranking

Annu Int Conf IEEE Eng Med Biol Soc. 2007:2007:4633-6. doi: 10.1109/IEMBS.2007.4353372.

Abstract

Many methods have been proposed to identify differentially expressed genes in diseased tissues. The performance of the method is closely related to the evaluation metric. We examine several error estimation algorithms (i.e., cross validation, bootstrap, resubstitution, and resubstitution with bolstering) for three classifiers (i.e., support vector machine, Fisher's discriminant, and signed distance function). To control the classifier's data-overfitting problem, usually caused by small sample size for many real datasets, we generate synthetic datasets based on real data. This way, we can monitor sample size impact when evaluating the metrics. We find that resubstitution with bolstering has the best result, especially with respect to computational efficiency. However, classical bolstering tends to bias in high dimensions. Thus, we further investigate ways to reduce bolstering estimation bias without increasing computational intensity. Results of our investigation indicate that the estimator tends to become unbiased as the sample size increases. We also find that modified bolstering is the best among all metrics in terms of estimation accuracy and computational efficiency.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Animals
  • Computer Simulation
  • Gene Expression Profiling / methods*
  • Gene Expression Regulation*
  • Humans
  • Oligonucleotide Array Sequence Analysis / methods*
  • Selection Bias
  • Sensitivity and Specificity
  • Software*