Disagreements in Medical Ethics Question Answering Between Large Language Models and Physicians

Shelly Soffer; Dafna Nesselroth; Keren Pragier; Roi Anteby; Donald Apakama; Emma Holmes; Ashwin Shreekant Sawant; Ethan Abbott; Lauren Alyse Lepow; Ishita Vasudev; Joshua Lampert; Moran Gendler; Nir Horesh; Orly Efros; Benjamin S Glicksberg; Robert Freeman; David L Reich; Alexander W Charney; Girish N Nadkarni; Eyal Klang

doi:10.21203/rs.3.rs-5382879/v1

Disagreements in Medical Ethics Question Answering Between Large Language Models and Physicians

Res Sq [Preprint]. 2024 Nov 15:rs.3.rs-5382879. doi: 10.21203/rs.3.rs-5382879/v1.

Authors

Shelly Soffer¹, Dafna Nesselroth², Keren Pragier³, Roi Anteby⁴, Donald Apakama⁵, Emma Holmes⁶, Ashwin Shreekant Sawant⁶, Ethan Abbott⁶, Lauren Alyse Lepow⁶, Ishita Vasudev⁶, Joshua Lampert⁶, Moran Gendler⁷, Nir Horesh⁸, Orly Efros⁹, Benjamin S Glicksberg⁶, Robert Freeman⁶, David L Reich⁶, Alexander W Charney⁶, Girish N Nadkarni⁶, Eyal Klang⁶

Affiliations

¹ Rabin Medical Center.
² Meuhedet Health Services.
³ Maccabi Institute for Health Services Research.
⁴ Massachusetts General Hospital.
⁵ Mount Sinai Hospital.
⁶ Icahn School of Medicine at Mount Sinai.
⁷ Azrieli Faculty of Medicine.
⁸ Cleveland Clinic Florida.
⁹ Sheba Medical Center.

Abstract

Importance: Medical ethics is inherently complex, shaped by a broad spectrum of opinions, experiences, and cultural perspectives. The integration of large language models (LLMs) in healthcare is new and requires an understanding of their consistent adherence to ethical standards.

Objective: To compare the agreement rates in answering questions based on ethically ambiguous situations between three frontier LLMs (GPT-4, Gemini-pro-1.5, and Llama-3-70b) and a multi-disciplinary physician group.

Methods: In this cross-sectional study, three LLMs generated 1,248 medical ethics questions. These questions were derived based on the principles outlined in the American College of Physicians Ethics Manual. The topics spanned traditional, inclusive, interdisciplinary, and contemporary themes. Each model was then tasked in answering all generated questions. Twelve practicing physicians evaluated and responded to a randomly selected 10% subset of these questions. We compared agreement rates in question answering among the physicians, between the physicians and LLMs, and among LLMs.

Results: The models generated a total of 3,744 answers. Despite physicians perceiving the questions' complexity as moderate, with scores between 2 and 3 on a 5-point scale, their agreement rate was only 55.9%. The agreement between physicians and LLMs was also low at 57.9%. In contrast, the agreement rate among LLMs was notably higher at 76.8% (p < 0.001), emphasizing the consistency in LLM responses compared to both physician-physician and physician-LLM agreement.

Conclusions: LLMs demonstrate higher agreement rates in ethically complex scenarios compared to physicians, suggesting their potential utility as consultants in ambiguous ethical situations. Future research should explore how LLMs can enhance consistency while adapting to the complexities of real-world ethical dilemmas.

Publication types

Preprint

Grants and funding

UL1 TR004419/TR/NCATS NIH HHS/United States