Can Artificial Intelligence Deceive Residency Committees? A Randomized Multicenter Analysis of Letters of Recommendation

J Am Acad Orthop Surg. 2024 Dec 12. doi: 10.5435/JAAOS-D-24-00438. Online ahead of print.

Abstract

Introduction: The introduction of generative artificial intelligence (AI) may have a profound effect on residency applications. In this study, we explore the abilities of AI-generated letters of recommendation (LORs) by evaluating the accuracy of orthopaedic surgery residency selection committee members to identify LORs written by human or AI authors.

Methods: In a multicenter, single-blind trial, a total of 45 LORs (15 human, 15 ChatGPT, and 15 Google BARD) were curated. In a random fashion, seven faculty reviewers from four residency programs were asked to grade each of the 45 LORs based on the 11 characteristics outlined in the American Orthopaedic Associations standardized LOR, as well as a 1 to 10 scale on how they would rank the applicant, their desire of having the applicant in the program, and if they thought the letter was generated by a human or AI author. Analysis included descriptives, ordinal regression, and a receiver operator characteristic curve to compare accuracy based on the number of letters reviewed.

Results: Faculty reviewers correctly identified 40% (42/105) of human-generated and 63% (132/210) of AI-generated letters (P < 0.001), which did not increase over time (AUC 0.451, P = 0.102). When analyzed by perceived author, letters marked as human generated had significantly higher means for all variables (P = 0.01). BARD did markedly better than human authors in accuracy (3.25 [1.79 to 5.92], P < 0.001), adaptability (1.29 [1.02 to 1.65], P = 0.034), and perceived commitment (1.56 [0.99 to 2.47], P < 0.055). Additional analysis controlling for reviewer background showed no differences in outcomes based on experience or familiarity with the AI programs.

Conclusion: Faculty members were unsuccessful in determining the difference between human-generated and AI-generated LORs 50% of the time, which suggests that AI can generate LORs similarly to human authors. This highlights the importance for selection committees to reconsider the role and influence of LORs on residency applications.