A prediction-based resampling method for estimating the number of clusters in a dataset

Genome Biol. 2002 Jun 25;3(7):RESEARCH0036. doi: 10.1186/gb-2002-3-7-research0036. Epub 2002 Jun 25.

Abstract

Background: Microarray technology is increasingly being applied in biological and medical research to address a wide range of problems, such as the classification of tumors. An important statistical problem associated with tumor classification is the identification of new tumor classes using gene-expression profiles. Two essential aspects of this clustering problem are: to estimate the number of clusters, if any, in a dataset; and to allocate tumor samples to these clusters, and assess the confidence of cluster assignments for individual samples. Here we address the first of these problems.

Results: We have developed a new prediction-based resampling method, Clest, to estimate the number of clusters in a dataset. The performance of the new and existing methods were compared using simulated data and gene-expression data from four recently published cancer microarray studies. Clest was generally found to be more accurate and robust than the six existing methods considered in the study.

Conclusions: Focusing on prediction accuracy in conjunction with resampling produces accurate and robust estimates of the number of clusters.

Publication types

  • Comparative Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Cluster Analysis
  • Computational Biology / methods
  • Computer Simulation
  • Gene Expression Profiling / methods*
  • Humans
  • Neoplasms / classification*
  • Neoplasms / genetics*
  • Neoplasms / metabolism
  • Oligonucleotide Array Sequence Analysis / methods*
  • Reproducibility of Results