Unsupervised learning for medical data: A review of probabilistic factorization methods

Dorien Neijzen; Gerton Lunter

doi:10.1002/sim.9924

Unsupervised learning for medical data: A review of probabilistic factorization methods

Stat Med. 2023 Dec 30;42(30):5541-5554. doi: 10.1002/sim.9924. Epub 2023 Oct 18.

Authors

Dorien Neijzen¹, Gerton Lunter^{1

2}

Affiliations

¹ Department of Epidemiology, University of Groningen, University Medical Center Groningen, Groningen, the Netherlands.
² Weatherall Institute of Molecular Medicine, Oxford University, Oxford, UK.

PMID: 37850249
DOI: 10.1002/sim.9924

Abstract

We review popular unsupervised learning methods for the analysis of high-dimensional data encountered in, for example, genomics, medical imaging, cohort studies, and biobanks. We show that four commonly used methods, principal component analysis, K-means clustering, nonnegative matrix factorization, and latent Dirichlet allocation, can be written as probabilistic models underpinned by a low-rank matrix factorization. In addition to highlighting their similarities, this formulation clarifies the various assumptions and restrictions of each approach, which eases identifying the appropriate method for specific applications for applied medical researchers. We also touch upon the most important aspects of inference and model selection for the application of these methods to health data.

Keywords: clustering; dimension reduction; health-care research; latent variable discovery; probabilistic matrix factorization; topic model; unsupervised learning.

Publication types

Review

MeSH terms

Algorithms*
Cluster Analysis
Genomics
Humans
Models, Statistical
Unsupervised Machine Learning*