Importance: Medical ethics is inherently complex, shaped by a broad spectrum of opinions, experiences, and cultural perspectives. The integration of large language models (LLMs) in healthcare is new and requires an understanding of their consistent adherence to ethical standards.
Objective: To compare the agreement rates in answering questions based on ethically ambiguous situations between three frontier LLMs (GPT-4, Gemini-pro-1.5, and Llama-3-70b) and a multi-disciplinary physician group.
Methods: In this cross-sectional study, three LLMs generated 1,248 medical ethics questions. These questions were derived based on the principles outlined in the American College of Physicians Ethics Manual. The topics spanned traditional, inclusive, interdisciplinary, and contemporary themes. Each model was then tasked in answering all generated questions. Twelve practicing physicians evaluated and responded to a randomly selected 10% subset of these questions. We compared agreement rates in question answering among the physicians, between the physicians and LLMs, and among LLMs.
Results: The models generated a total of 3,744 answers. Despite physicians perceiving the questions' complexity as moderate, with scores between 2 and 3 on a 5-point scale, their agreement rate was only 55.9%. The agreement between physicians and LLMs was also low at 57.9%. In contrast, the agreement rate among LLMs was notably higher at 76.8% (p < 0.001), emphasizing the consistency in LLM responses compared to both physician-physician and physician-LLM agreement.
Conclusions: LLMs demonstrate higher agreement rates in ethically complex scenarios compared to physicians, suggesting their potential utility as consultants in ambiguous ethical situations. Future research should explore how LLMs can enhance consistency while adapting to the complexities of real-world ethical dilemmas.