Objectives: We aimed to demonstrate the importance of establishing best practices in large language model research, using repeat prompting as an illustrative example.
Materials and methods: Using data from a prior study investigating potential model bias in peer review of medical abstracts, we compared methods that ignore correlation in model outputs from repeated prompting with a random effects method that accounts for this correlation.
Results: High correlation within groups was found when repeatedly prompting the model, with intraclass correlation coefficient of 0.69. Ignoring the inherent correlation in the data led to over 100-fold inflation of effective sample size. After appropriately accounting for this issue, the authors' results reverse from a small but highly significant finding to no evidence of model bias.
Discussion: The establishment of best practices for LLM research is urgently needed, as demonstrated in this case where accounting for repeat prompting in analyses was critical for accurate study conclusions.
Keywords: large language model; multilevel analysis; peer review.
© The Author(s) 2024. Published by Oxford University Press on behalf of the American Medical Informatics Association.