Topic models with elements of neural networks: investigation of stability, coherence, and determining the optimal number of topics

PeerJ Comput Sci. 2024 Jan 3:10:e1758. doi: 10.7717/peerj-cs.1758. eCollection 2024.

Abstract

Topic modeling is a widely used instrument for the analysis of large text collections. In the last few years, neural topic models and models with word embeddings have been proposed to increase the quality of topic solutions. However, these models were not extensively tested in terms of stability and interpretability. Moreover, the question of selecting the number of topics (a model parameter) remains a challenging task. We aim to partially fill this gap by testing four well-known and available to a wide range of users topic models such as the embedded topic model (ETM), Gaussian Softmax distribution model (GSM), Wasserstein autoencoders with Dirichlet prior (W-LDA), and Wasserstein autoencoders with Gaussian Mixture prior (WTM-GMM). We demonstrate that W-LDA, WTM-GMM, and GSM possess poor stability that complicates their application in practice. ETM model with additionally trained embeddings demonstrates high coherence and rather good stability for large datasets, but the question of the number of topics remains unsolved for this model. We also propose a new topic model based on granulated sampling with word embeddings (GLDAW), demonstrating the highest stability and good coherence compared to other considered models. Moreover, the optimal number of topics in a dataset can be determined for this model.

Keywords: Coherence; Neural topic models; Optimal number of topics; Renyi entropy; Stability; Topic modeling; Word embeddings.

Grants and funding

The results of the project “Modeling the structure and socio-psychological factors of news perception”, carried out within the framework of the Basic Research Program at the National Research University Higher School of Economics (HSE University) in 2022, are presented in this work. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.