Numero: a statistical framework to define multivariable subgroups in complex population-based datasets

Song Gao; Stefan Mutter; Aaron Casey; Ville-Petteri Mäkinen

doi:10.1093/ije/dyy113

Numero: a statistical framework to define multivariable subgroups in complex population-based datasets

Int J Epidemiol. 2019 Apr 1;48(2):369-374. doi: 10.1093/ije/dyy113.

Authors

Song Gao¹, Stefan Mutter¹, Aaron Casey¹, Ville-Petteri Mäkinen^{1

2

3}

Affiliations

¹ Heart Health Theme, South Australian Health and Medical Research Institute, Adelaide, SA, Australia.
² School of Biological Sciences, University of Adelaide, Adelaide, SA, Australia.
³ Computational Medicine, University of Oulu and Biocenter Oulu, Oulu, Finland.

PMID: 29947762
DOI: 10.1093/ije/dyy113

Abstract

Large-scale epidemiological and population data provide opportunities to identify subgroups of people who are at risk of disease or exposed to adverse environments. Clustering algorithms are popular data-driven tools to identify these subgroups; however, relying exclusively on algorithms may not produce the best results if the dataset does not have a clustered structure. For this reason, we propose a framework (the R-library Numero) that combines the self-organizing map algorithm, permutation analysis for statistical evidence and a final expert-driven subgrouping step. We used Numero to define subgroups in two examples without an obvious clustering structure: a biomedical dataset of kidney disease and another dataset of community-level socioeconomic indicators. We benchmarked the Numero subgroupings against popular clustering algorithms (principal components, K-means and hierarchical clustering). The Numero subgroupings were more intuitive and easier to interpret without losing mathematical quality. Therefore, we expect Numero to be useful for exploratory analyses of population-based epidemiological datasets.

Keywords: Multivariable statistics; data-driven subgrouping; population data; self-organizing map.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Datasets as Topic
Epidemiologic Methods*
Humans
Statistics as Topic*