Can Artificial Intelligence Deceive Residency Committees? A Randomized Multicenter Analysis of Letters of Recommendation

Samuel K Simister; Eric G Huish; Eugene Y Tsai; Hai V Le; Andrea Halim; Dominick Tuason; John P Meehan; Holly B Leshikar; Augustine M Saiz; Zachary C Lum

doi:10.5435/JAAOS-D-24-00438

Can Artificial Intelligence Deceive Residency Committees? A Randomized Multicenter Analysis of Letters of Recommendation

J Am Acad Orthop Surg. 2024 Dec 12. doi: 10.5435/JAAOS-D-24-00438. Online ahead of print.

Authors

Samuel K Simister¹, Eric G Huish, Eugene Y Tsai, Hai V Le, Andrea Halim, Dominick Tuason, John P Meehan, Holly B Leshikar, Augustine M Saiz, Zachary C Lum

Affiliation

¹ From the University of California, Davis, Sacramento, CA (Simister, Le, Meehan, Leshikar, Saiz, and Lum), the San Joaquin General Hospital, French Camp, CA (Huish), the Cedars Sinai, Los Angeles, CA (Tsai), and the Yale University, New Haven, CT (Halim and Tuason).

PMID: 39693540
DOI: 10.5435/JAAOS-D-24-00438

Abstract

Introduction: The introduction of generative artificial intelligence (AI) may have a profound effect on residency applications. In this study, we explore the abilities of AI-generated letters of recommendation (LORs) by evaluating the accuracy of orthopaedic surgery residency selection committee members to identify LORs written by human or AI authors.

Methods: In a multicenter, single-blind trial, a total of 45 LORs (15 human, 15 ChatGPT, and 15 Google BARD) were curated. In a random fashion, seven faculty reviewers from four residency programs were asked to grade each of the 45 LORs based on the 11 characteristics outlined in the American Orthopaedic Associations standardized LOR, as well as a 1 to 10 scale on how they would rank the applicant, their desire of having the applicant in the program, and if they thought the letter was generated by a human or AI author. Analysis included descriptives, ordinal regression, and a receiver operator characteristic curve to compare accuracy based on the number of letters reviewed.

Results: Faculty reviewers correctly identified 40% (42/105) of human-generated and 63% (132/210) of AI-generated letters (P < 0.001), which did not increase over time (AUC 0.451, P = 0.102). When analyzed by perceived author, letters marked as human generated had significantly higher means for all variables (P = 0.01). BARD did markedly better than human authors in accuracy (3.25 [1.79 to 5.92], P < 0.001), adaptability (1.29 [1.02 to 1.65], P = 0.034), and perceived commitment (1.56 [0.99 to 2.47], P < 0.055). Additional analysis controlling for reviewer background showed no differences in outcomes based on experience or familiarity with the AI programs.

Conclusion: Faculty members were unsuccessful in determining the difference between human-generated and AI-generated LORs 50% of the time, which suggests that AI can generate LORs similarly to human authors. This highlights the importance for selection committees to reconsider the role and influence of LORs on residency applications.