Literature search sandbox: a large language model that generates search queries for systematic reviews

Gaelen P Adam; Jay DeYoung; Alice Paul; Ian J Saldanha; Ethan M Balk; Thomas A Trikalinos; Byron C Wallace

doi:10.1093/jamiaopen/ooae098

Literature search sandbox: a large language model that generates search queries for systematic reviews

JAMIA Open. 2024 Sep 25;7(3):ooae098. doi: 10.1093/jamiaopen/ooae098. eCollection 2024 Oct.

Authors

Gaelen P Adam¹, Jay DeYoung², Alice Paul³, Ian J Saldanha^{1

4}, Ethan M Balk¹, Thomas A Trikalinos^{1

3}, Byron C Wallace²

Affiliations

¹ Center for Evidence Synthesis in Health, Brown University School of Public Health, Providence, RI 02903, United States.
² Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, United States.
³ Department of Biostatistics, Brown University School of Public Health, Providence, RI 02903, United States.
⁴ Center for Clinical Trials and Evidence Synthesis, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, United States.

Abstract

Objectives: Development of search queries for systematic reviews (SRs) is time-consuming. In this work, we capitalize on recent advances in large language models (LLMs) and a relatively large dataset of natural language descriptions of reviews and corresponding Boolean searches to generate Boolean search queries from SR titles and key questions.

Materials and methods: We curated a training dataset of 10 346 SR search queries registered in PROSPERO. We used this dataset to fine-tune a set of models to generate search queries based on Mistral-Instruct-7b. We evaluated the models quantitatively using an evaluation dataset of 57 SRs and qualitatively through semi-structured interviews with 8 experienced medical librarians.

Results: The model-generated search queries had median sensitivity of 85% (interquartile range [IQR] 40%-100%) and number needed to read of 1206 citations (IQR 205-5810). The interviews suggested that the models lack both the necessary sensitivity and precision to be used without scrutiny but could be useful for topic scoping or as initial queries to be refined.

Discussion: Future research should focus on improving the dataset with more high-quality search queries, assessing whether fine-tuning the model on other fields, such as the population and intervention, improves performance, and exploring the addition of interactivity to the interface.

Conclusions: The datasets developed for this project can be used to train and evaluate LLMs that map review descriptions to Boolean search queries. The models cannot replace thoughtful search query design but may be useful in providing suggestions for key words and the framework for the query.

Keywords: artificial intelligence; systematic reviews as topic/methods.