A prediction-based resampling method for estimating the number of clusters in a dataset

Sandrine Dudoit; Jane Fridlyand

doi:10.1186/gb-2002-3-7-research0036

A prediction-based resampling method for estimating the number of clusters in a dataset

Genome Biol. 2002 Jun 25;3(7):RESEARCH0036. doi: 10.1186/gb-2002-3-7-research0036. Epub 2002 Jun 25.

Authors

Sandrine Dudoit¹, Jane Fridlyand

Affiliation

¹ Division of Biostatistics, School of Public Health, University of California Berkeley, 140 Earl Warren Hall, Berkeley, CA 94720-7360, USA. [email protected]

Abstract

Background: Microarray technology is increasingly being applied in biological and medical research to address a wide range of problems, such as the classification of tumors. An important statistical problem associated with tumor classification is the identification of new tumor classes using gene-expression profiles. Two essential aspects of this clustering problem are: to estimate the number of clusters, if any, in a dataset; and to allocate tumor samples to these clusters, and assess the confidence of cluster assignments for individual samples. Here we address the first of these problems.

Results: We have developed a new prediction-based resampling method, Clest, to estimate the number of clusters in a dataset. The performance of the new and existing methods were compared using simulated data and gene-expression data from four recently published cancer microarray studies. Clest was generally found to be more accurate and robust than the six existing methods considered in the study.

Conclusions: Focusing on prediction accuracy in conjunction with resampling produces accurate and robust estimates of the number of clusters.

Publication types

Comparative Study
Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Cluster Analysis
Computational Biology / methods
Computer Simulation
Gene Expression Profiling / methods*
Humans
Neoplasms / classification*
Neoplasms / genetics*
Neoplasms / metabolism
Oligonucleotide Array Sequence Analysis / methods*
Reproducibility of Results