Estimating the number of clusters via a corrected clustering instability

Comput Stat. 2020;35(4):1879-1894. doi: 10.1007/s00180-020-00981-5. Epub 2020 May 18.

Abstract

We improve instability-based methods for the selection of the number of clusters k in cluster analysis by developing a corrected clustering distance that corrects for the unwanted influence of the distribution of cluster sizes on cluster instability. We show that our corrected instability measure outperforms current instability-based measures across the whole sequence of possible k, overcoming limitations of current insability-based methods for large k. We also compare, for the first time, model-based and model-free approaches to determining cluster-instability and find their performance to be comparable. We make our method available in the R-package cstab.

Keywords: Cluster analysis; Resampling; Stability; k-means.