Objectives: Traditionally, summarization of research themes and trends within a given discipline was accomplished by manual review of scientific works in the field. However, with the ushering in of the age of "big data," new methods for discovery of such information become necessary as traditional techniques become increasingly difficult to apply due to the exponential growth of document repositories. Our objectives are to develop a pipeline for unsupervised theme extraction and summarization of thematic trends in document repositories, and to test it by applying it to a specific domain.
Methods: To that end, we detail a pipeline, which utilizes machine learning and natural language processing for unsupervised theme extraction, and a novel method for summarization of thematic trends, and network mapping for visualization of thematic relations. We then apply this pipeline to a collection of anesthesiology abstracts.
Results: We demonstrate how this pipeline enables discovery of major themes and temporal trends in anesthesiology research and facilitates document classification and corpus exploration.
Discussion: The relation of prevalent topics and extracted trends to recent events in both anesthesiology, and healthcare in general, demonstrates the pipeline's utility. Furthermore, the agreement between the unsupervised thematic grouping and human-assigned classification validates the pipeline's accuracy and demonstrates another potential use.
Conclusion: The described pipeline enables summarization and exploration of large document repositories, facilitates classification, aids in trend identification. A more robust and user-friendly interface will facilitate the expansion of this methodology to other domains. This will be the focus of future work for our group.
Keywords: data mining; machine learning; natural language processing.