Establishing best practices in large language model research: an application to repeat prompting

Robert J Gallo; Michael Baiocchi; Thomas R Savage; Jonathan H Chen

doi:10.1093/jamia/ocae294

Establishing best practices in large language model research: an application to repeat prompting

J Am Med Inform Assoc. 2024 Dec 4:ocae294. doi: 10.1093/jamia/ocae294. Online ahead of print.

Authors

Robert J Gallo^{1

2}, Michael Baiocchi³, Thomas R Savage⁴, Jonathan H Chen^{4

5

6}

Affiliations

¹ Center for Innovation to Implementation, VA Palo Alto Health Care System, Menlo Park, CA 94025, United States.
² Department of Health Policy, Stanford University, Stanford, CA 94305, United States.
³ Department of Epidemiology and Population Health, Stanford University, Stanford, CA 94305, United States.
⁴ Division of Hospital Medicine, Stanford University, Stanford, CA 94305, United States.
⁵ Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA 94304, United States.
⁶ Clinical Excellence Research Center, Stanford University, Stanford, CA 94305, United States.

PMID: 39656836
DOI: 10.1093/jamia/ocae294

Abstract

Objectives: We aimed to demonstrate the importance of establishing best practices in large language model research, using repeat prompting as an illustrative example.

Materials and methods: Using data from a prior study investigating potential model bias in peer review of medical abstracts, we compared methods that ignore correlation in model outputs from repeated prompting with a random effects method that accounts for this correlation.

Results: High correlation within groups was found when repeatedly prompting the model, with intraclass correlation coefficient of 0.69. Ignoring the inherent correlation in the data led to over 100-fold inflation of effective sample size. After appropriately accounting for this issue, the authors' results reverse from a small but highly significant finding to no evidence of model bias.

Discussion: The establishment of best practices for LLM research is urgently needed, as demonstrated in this case where accounting for repeat prompting in analyses was critical for accurate study conclusions.

Keywords: large language model; multilevel analysis; peer review.

Abstract

Grants and funding