Clusters that are not there: An R tutorial and a Shiny app to quantify a priori inferential risks when using clustering methods

Enrico Toffalini; Filippo Gambarota; Ambra Perugini; Paolo Girardi; Valentina Tobia; Gianmarco Altoè; David Giofrè; Psicostat Core Team; Tommaso Feraco

doi:10.1002/ijop.13246

Clusters that are not there: An R tutorial and a Shiny app to quantify a priori inferential risks when using clustering methods

Int J Psychol. 2024 Dec;59(6):1183-1198. doi: 10.1002/ijop.13246. Epub 2024 Sep 19.

Authors

Enrico Toffalini¹, Filippo Gambarota², Ambra Perugini², Paolo Girardi³, Valentina Tobia⁴, Gianmarco Altoè², David Giofrè⁵; Psicostat Core Team²; Tommaso Feraco¹

Affiliations

¹ Department of General Psychology, University of Padova, Padova, Italy.
² Department of Developmental Psychology and Socialization, University of Padova, Padova, Italy.
³ Department of Environmental Sciences, Informatics and Statistics-University Ca' Foscari, Venice, Italy.
⁴ Department of Psychology, University Vita-Salute San Raffaele, Milan, Italy.
⁵ DISFOR-University of Genova, Italy.

PMID: 39300789
DOI: 10.1002/ijop.13246

Abstract

Clustering methods are increasingly used in social science research. Generally, researchers use them to infer the existence of qualitatively different types of individuals within a larger population, thus unveiling previously "hidden" heterogeneity. Depending on the clustering technique, however, valid inference requires some conditions and assumptions. Common risks include not only failing to detect existing clusters due to a lack of power but also revealing clusters that do not exist in the population. Simple data simulations suggest that under conditions of sample size, number, correlation and skewness of indicators that are frequently encountered in applied psychological research, commonly used clustering methods are at a high risk of detecting clusters that are not there. Generally, this is due to some violations of assumptions that are not usually considered critical in psychology. The present article illustrates a simple R tutorial and a Shiny app (for those who are not familiar with R) that allow researchers to quantify a priori inferential risks when performing clustering methods on their own data. Doing so is suggested as a much-needed preliminary sanity check, because conditions that inflate the number of detected clusters are very common in applied psychological research scenarios.

Keywords: Cluster analysis; Data simulation; Machine learning; Mixture models; k‐means.

MeSH terms

Cluster Analysis
Data Interpretation, Statistical
Humans
Mobile Applications*

Grants and funding

C53D23004210006/Finanziamento Ministero dell'Università e della Ricerca Direzione Generale della Ricerca Ufficio III 15 dell'Unione Europea - NextGenerationEU - missione 4, componente 2, investimento 1.1