Utilization of ChatGPT for Rhinology Patient Education: Limitations in a Surgical Sub-Specialty

OTO Open. 2025 Jan 7;9(1):e70065. doi: 10.1002/oto2.70065. eCollection 2025 Jan-Mar.

Abstract

Objective: To analyze the accuracy of ChatGPT-generated responses to common rhinologic patient questions.

Methods: Ten common questions from rhinology patients were compiled by a panel of 4 rhinology fellowship-trained surgeons based on clinical patient experience. This panel (Panel 1) developed consensus "expert" responses to each question. Questions were individually posed to ChatGPT (version 3.5) and its responses recorded. ChatGPT-generated responses were individually graded by Panel 1 on a scale of 0 (incorrect) to 3 (correct and exceeding the quality of expert responses). A 2nd panel was given the consensus and ChatGPT responses to each question and asked to guess which response corresponded to which source. They then graded ChatGPT responses using the same criteria as Panel 1. Question-specific and overall mean grades for ChatGPT responses, as well as interclass correlation coefficient (ICC) as a measure of interrater reliability, were calculated.

Results: The overall mean grade for ChatGPT responses was 1.65/3. For 2 out of 10 questions, ChatGPT responses were equal to or better than expert responses. However, for 8 out of 10 questions, ChatGPT provided responses that were incorrect, false, or incomplete based on mean rater grades. Overall ICC was 0.526, indicating moderate reliability among raters of ChatGPT responses. Reviewers were able to discern ChatGPT from human responses with 97.5% accuracy.

Conclusion: This preliminary study demonstrates overall near-complete and variably accurate responses provided by ChatGPT to common rhinologic questions, demonstrating important limitations in nuanced subspecialty fields.

Keywords: artificial intelligence; chatbot; education technology; machine learning; neural network; patient education; rhinology.