Effectiveness of a large language model for clinical information retrieval regarding shoulder arthroplasty

Jacob F Oeding; Amy Z Lu; Michael Mazzucco; Michael C Fu; David M Dines; Russell F Warren; Lawrence V Gulotta; Joshua S Dines; Kyle N Kunze

doi:10.1002/jeo2.70114

Effectiveness of a large language model for clinical information retrieval regarding shoulder arthroplasty

J Exp Orthop. 2024 Dec 17;11(4):e70114. doi: 10.1002/jeo2.70114. eCollection 2024 Oct.

Authors

Jacob F Oeding¹, Amy Z Lu², Michael Mazzucco², Michael C Fu^{3

4}, David M Dines^{3

4}, Russell F Warren^{3

4}, Lawrence V Gulotta^{3

4}, Joshua S Dines^{3

4}, Kyle N Kunze^{3

4}

Affiliations

¹ Department of Orthopaedics, Institute of Clinical Sciences, The Sahlgrenska Academy University of Gothenburg Gothenburg Sweden.
² Weill Cornell Medical College New York New York USA.
³ Department of Orthopaedic Surgery Hospital for Special Surgery New York New York USA.
⁴ Sports Medicine and Shoulder Institute Hospital for Special Surgery New York New York USA.

Abstract

Purpose: To determine the scope and accuracy of medical information provided by ChatGPT-4 in response to clinical queries concerning total shoulder arthroplasty (TSA), and to compare these results to those of the Google search engine.

Methods: A patient-replicated query for 'total shoulder replacement' was performed using both Google Web Search (the most frequently used search engine worldwide) and ChatGPT-4. The top 10 frequently asked questions (FAQs), answers, and associated sources were extracted. This search was performed again independently to identify the top 10 FAQs necessitating numerical responses such that the concordance of answers could be compared between Google and ChatGPT-4. The clinical relevance and accuracy of the provided information were graded by two blinded orthopaedic shoulder surgeons.

Results: Concerning FAQs with numeric responses, 8 out of 10 (80%) had identical answers or substantial overlap between ChatGPT-4 and Google. Accuracy of information was not significantly different (p = 0.32). Google sources included 40% medical practices, 30% academic, 20% single-surgeon practice, and 10% social media, while ChatGPT-4 used 100% academic sources, representing a statistically significant difference (p = 0.001). Only 3 out of 10 (30%) FAQs with open-ended answers were identical between ChatGPT-4 and Google. The clinical relevance of FAQs was not significantly different (p = 0.18). Google sources for open-ended questions included academic (60%), social media (20%), medical practice (10%) and single-surgeon practice (10%), while 100% of sources for ChatGPT-4 were academic, representing a statistically significant difference (p = 0.0025).

Conclusion: ChatGPT-4 provided trustworthy academic sources for medical information retrieval concerning TSA, while sources used by Google were heterogeneous. Accuracy and clinical relevance of information were not significantly different between ChatGPT-4 and Google.

Level of evidence: Level IV cross-sectional.

Keywords: ChatGPT; LLM; information retrieval; large language model; total shoulder arthroplasty.