How Chatbots Respond to NCLEX-RN Practice Questions: Assessment of Google Gemini, GPT-3.5, and GPT-4

Alejandro García-Rudolph; David Sanchez-Pinsach; Mira Caridad Fernandez; Sandra Cunyat; Eloy Opisso; Elena Hernandez-Pena

doi:10.1097/01.NEP.0000000000001364

How Chatbots Respond to NCLEX-RN Practice Questions: Assessment of Google Gemini, GPT-3.5, and GPT-4

Nurs Educ Perspect. 2024 Dec 18. doi: 10.1097/01.NEP.0000000000001364. Online ahead of print.

Authors

Alejandro García-Rudolph¹, David Sanchez-Pinsach, Mira Caridad Fernandez, Sandra Cunyat, Eloy Opisso, Elena Hernandez-Pena

Affiliation

¹ About the Authors Alejandro García-Rudolph, PhD; David Sanchez-Pinsach, PhD; Mira Caridad Fernandez, MSc; Sandra Cunyat, MSc; Eloy Opisso, PhD; and Elena Hernandez-Pena, MSc, are faculty, Institut Guttmann Hospital de Neurorehabilitació, Barcelona, Spain. The authors are grateful to Olga Araujo of the Institut Guttmann-Documentation Office for her support in accessing the literature. For more information, contact Dr. Alejandro García-Rudolph at [email protected].

PMID: 39692545
DOI: 10.1097/01.NEP.0000000000001364

Abstract

ChatGPT often "hallucinates" or misleads, underscoring the need for formal validation at the professional level for reliable use in nursing education. We evaluated two free chatbots (Google Gemini and GPT-3.5) and a commercial version (GPT-4) on 250 standardized questions from a simulated nursing licensure exam, which closely matches the content and complexity of the actual exam. Gemini achieved 73.2 percent (183/250), GPT-3.5 achieved 72 percent (180/250), and GPT-4 reached a notably higher performance with 92.4 percent (231/250). GPT-4 exhibited its highest error rate (13.3%) in the psychosocial integrity category.