Clusters that are not there: An R tutorial and a Shiny app to quantify a priori inferential risks when using clustering methods

Int J Psychol. 2024 Dec;59(6):1183-1198. doi: 10.1002/ijop.13246. Epub 2024 Sep 19.

Abstract

Clustering methods are increasingly used in social science research. Generally, researchers use them to infer the existence of qualitatively different types of individuals within a larger population, thus unveiling previously "hidden" heterogeneity. Depending on the clustering technique, however, valid inference requires some conditions and assumptions. Common risks include not only failing to detect existing clusters due to a lack of power but also revealing clusters that do not exist in the population. Simple data simulations suggest that under conditions of sample size, number, correlation and skewness of indicators that are frequently encountered in applied psychological research, commonly used clustering methods are at a high risk of detecting clusters that are not there. Generally, this is due to some violations of assumptions that are not usually considered critical in psychology. The present article illustrates a simple R tutorial and a Shiny app (for those who are not familiar with R) that allow researchers to quantify a priori inferential risks when performing clustering methods on their own data. Doing so is suggested as a much-needed preliminary sanity check, because conditions that inflate the number of detected clusters are very common in applied psychological research scenarios.

Keywords: Cluster analysis; Data simulation; Machine learning; Mixture models; k‐means.

MeSH terms

  • Cluster Analysis
  • Data Interpretation, Statistical
  • Humans
  • Mobile Applications*