LDAPrototype: a model selection algorithm to improve reliability of latent Dirichlet allocation

Jonas Rieger; Carsten Jentsch; Jörg Rahnenführer

doi:10.7717/peerj-cs.2279

LDAPrototype: a model selection algorithm to improve reliability of latent Dirichlet allocation

PeerJ Comput Sci. 2024 Sep 20:10:e2279. doi: 10.7717/peerj-cs.2279. eCollection 2024.

Authors

Jonas Rieger¹, Carsten Jentsch¹, Jörg Rahnenführer¹

Affiliation

¹ Department of Statistics, TU Dortmund University, Dortmund, Germany.

Abstract

Latent Dirichlet allocation (LDA) is a popular method for analyzing large text corpora, but it suffers from instability due to its reliance on random initialization. This results in different outcomes for replicated runs, hindering reproducibility. To address this, we introduce LDAPrototype, a new approach for selecting the most representative LDA run from multiple replications on the same dataset. LDAPrototype enhances the reliability of LDA conclusions by ensuring greater similarity between replications compared to traditional LDA runs or models chosen based on perplexity or NPMI. A key feature of LDAPrototype is its use of a novel model similarity measure called S-CLOP (Similarity of multiple sets by Clustering with LOcal Pruning). It is based on topic similarities, for which we compare the usage of measures like the thresholded Jaccard coefficient, cosine similarity, Jensen-Shannon divergence, and rank-biased overlap. The effectiveness of LDAPrototype is demonstrated through its application to six real datasets, including newspaper articles and tweets. The results show improved reproducibility and reliability in topic modeling outcomes. LDAPrototype's approach is noteworthy for its practical applicability, comprehensibility, ease of implementation, and computational efficiency. Furthermore, the algorithm's concept can be generalized to other topic modeling procedures that characterize topics through word distributions, making it a versatile tool in text data analysis.

Keywords: Clustering; Medoid; Replications; Similarity; Stability; Topic model.

Grants and funding

The authors had access to computing time provided on the Linux HPC cluster at TU Dortmund University (LiDO3), funded in the course of the Large-Scale Equipment Initiative by the German Research Foundation (DFG) as project 271512359. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.