Establishing best practices in large language model research: an application to repeat prompting

J Am Med Inform Assoc. 2024 Dec 4:ocae294. doi: 10.1093/jamia/ocae294. Online ahead of print.

Abstract

Objectives: We aimed to demonstrate the importance of establishing best practices in large language model research, using repeat prompting as an illustrative example.

Materials and methods: Using data from a prior study investigating potential model bias in peer review of medical abstracts, we compared methods that ignore correlation in model outputs from repeated prompting with a random effects method that accounts for this correlation.

Results: High correlation within groups was found when repeatedly prompting the model, with intraclass correlation coefficient of 0.69. Ignoring the inherent correlation in the data led to over 100-fold inflation of effective sample size. After appropriately accounting for this issue, the authors' results reverse from a small but highly significant finding to no evidence of model bias.

Discussion: The establishment of best practices for LLM research is urgently needed, as demonstrated in this case where accounting for repeat prompting in analyses was critical for accurate study conclusions.

Keywords: large language model; multilevel analysis; peer review.