Identifying clusters in genomics data by recursive partitioning

Stat Appl Genet Mol Biol. 2013 Oct 1;12(5):637-52. doi: 10.1515/sagmb-2013-0016.

Abstract

Genomics studies frequently involve clustering of molecular data to identify groups, but common clustering methods such as K-means clustering and hierarchical clustering do not determine the number of clusters. Methods for estimating the number of clusters typically focus on identifying the global structure in the data, however the discovery of substructures within clusters may also be of great biological interest. We propose a novel method, Partitioning Algorithm based on Recursive Thresholding (PART), that recursively uncovers distinct subgroups in the groups already identified. Outliers are common in high-dimensional genomics data and may mask the presence of substructure within a cluster. A crucial feature of the algorithm is the introduction of tentative splits of clusters to isolate outliers that might otherwise halt the recursion prematurely. The method is demonstrated on simulated as well as a wide range of real data sets from gene expression microarrays, where the correct clusters were known in advance. When subclusters are present and the variance is large or varies between the clusters, the proposed method performs better than two established global methods on simulated data. On the real data sets the overall performance of PART is superior to the global methods when used in combination with hierarchical clustering. The method is implemented in the R package clusterGenomics and is freely available from CRAN (The Comprehensive R Archive Network).

MeSH terms

  • Algorithms
  • Cluster Analysis
  • Computer Simulation
  • Data Interpretation, Statistical
  • Gene Expression Profiling*
  • Genomics
  • Humans
  • Models, Biological
  • Models, Statistical
  • Neoplasms / genetics*
  • Neoplasms / metabolism
  • Software*
  • Transcriptome