A topic modeling approach for analyzing and categorizing electronic healthcare documents in Afaan Oromo without label information

Sci Rep. 2024 Dec 30;14(1):32051. doi: 10.1038/s41598-024-83743-3.

Abstract

Afaan Oromo is a resource-scarce language with limited tools developed for its processing, posing significant challenges for natural language tasks. The tools designed for English do not work efficiently for Afaan Oromo due to the linguistic differences and lack of well-structured resources. To address this challenge, this work proposes a topic modeling framework for unstructured health-related documents in Afaan Oromo using latent dirichlet allocation (LDA) algorithms. All collected documents lack label information, which poses significant challenges for categorizing the documents and applying the supervised learning methods. So, we utilize the LDA model since it offers solutions to this problem by allowing discovery of the latent topics of the documents without requiring the predefined labels. The model takes a word dictionary to extract hidden topics by evaluating word patterns and distributions across the dataset. Then it extracts the most relevant document topics and generates weight values for each word in the documents per topic. Next, we classify the topics using the represented keyword as input and assign class labels based on human evaluations topic coherence. This model could be applied to classifying medical documents and used to find specialists who best suitable for patients' requests from the obtained information. As a conclusion of our findings, the topic modeling using LDA gave the promised value of 79.17% accuracy and 79.66% F1 score for test documents of the dataset.

Keywords: Afaan Oromo; Classification; Information retrieval; Latent dirichlet allocation; Text analysis; Topic modeling.

MeSH terms

  • Algorithms*
  • Electronic Health Records
  • Humans
  • Models, Theoretical
  • Natural Language Processing*