Objectives: Extracting the sample size from randomized controlled trials (RCTs) remains a challenge to developing better search functionalities or automating systematic reviews. Most current approaches rely on the sample size being explicitly mentioned in the abstract. The objective of this study was, therefore, to develop and validate additional approaches.
Materials and methods: 847 RCTs from high-impact medical journals were tagged with 6 different entities that could indicate the sample size. A named entity recognition (NER) model was trained to extract the entities and then deployed on a test set of 150 RCTs. The entities' performance in predicting the actual number of trial participants who were randomized was assessed and possible combinations of the entities were evaluated to create predictive models. The test set was also used to evaluate the performance of GPT-4o on the same task.
Results: The most accurate model could make predictions for 64.7% of trials in the test set, and the resulting predictions were equal to the ground truth in 93.8%. GPT-4o was able to make a prediction on 94.7% of trials and the resulting predictions were equal to the ground truth in 90.8%.
Discussion: This study presents an NER model that can extract different entities that can be used to predict the sample size from the abstract of an RCT. The entities can be combined in different ways to obtain models with different characteristics.
Conclusion: Training an NER model to predict the sample size from RCTs is feasible. Large language models can deliver similar performance without the need for prior training on the task although at a higher cost due to proprietary technology and/or required computational power.
Keywords: GPT-4; evidence-based medicine; machine learning; natural language processing; randomized controlled trial; text mining; transformer.
© The Author(s) 2024. Published by Oxford University Press on behalf of the American Medical Informatics Association.