densityCut: an efficient and versatile topological approach for automatic clustering of biological data

Jiarui Ding; Sohrab Shah; Anne Condon

doi:10.1093/bioinformatics/btw227

densityCut: an efficient and versatile topological approach for automatic clustering of biological data

Bioinformatics. 2016 Sep 1;32(17):2567-76. doi: 10.1093/bioinformatics/btw227. Epub 2016 Apr 23.

Authors

Jiarui Ding¹, Sohrab Shah¹, Anne Condon²

Affiliations

¹ Department of Computer Science, University of British Columbia, Vancouver, BC V6T 1Z4, Canada Department of Molecular Oncology, BC Cancer Research Centre, Vancouver, BC V5Z 1L3, Canada.
² Department of Computer Science, University of British Columbia, Vancouver, BC V6T 1Z4, Canada.

Abstract

Motivation: Many biological data processing problems can be formalized as clustering problems to partition data points into sensible and biologically interpretable groups.

Results: This article introduces densityCut, a novel density-based clustering algorithm, which is both time- and space-efficient and proceeds as follows: densityCut first roughly estimates the densities of data points from a K-nearest neighbour graph and then refines the densities via a random walk. A cluster consists of points falling into the basin of attraction of an estimated mode of the underlining density function. A post-processing step merges clusters and generates a hierarchical cluster tree. The number of clusters is selected from the most stable clustering in the hierarchical cluster tree. Experimental results on ten synthetic benchmark datasets and two microarray gene expression datasets demonstrate that densityCut performs better than state-of-the-art algorithms for clustering biological datasets. For applications, we focus on the recent cancer mutation clustering and single cell data analyses, namely to cluster variant allele frequencies of somatic mutations to reveal clonal architectures of individual tumours, to cluster single-cell gene expression data to uncover cell population compositions, and to cluster single-cell mass cytometry data to detect communities of cells of the same functional states or types. densityCut performs better than competing algorithms and is scalable to large datasets.

Availability and implementation: Data and the densityCut R package is available from https://bitbucket.org/jerry00/densitycut_dev

Contact: : [email protected] or [email protected] or [email protected]

Supplementary information: Supplementary data are available at Bioinformatics online.

MeSH terms

Algorithms*
Cluster Analysis*
Gene Expression
Gene Expression Profiling*
Humans
Oligonucleotide Array Sequence Analysis