Assessing readability of explanations and reliability of answers by GPT-3.5 and GPT-4 in non-traumatic spinal cord injury education

Alejandro García-Rudolph; David Sanchez-Pinsach; Mark Andrew Wright; Eloy Opisso; Joan Vidal

doi:10.1080/0142159X.2024.2430365

Assessing readability of explanations and reliability of answers by GPT-3.5 and GPT-4 in non-traumatic spinal cord injury education

Med Teach. 2025 Jan 20:1-8. doi: 10.1080/0142159X.2024.2430365. Online ahead of print.

Authors

Alejandro García-Rudolph^{1

2

3}, David Sanchez-Pinsach^{1

2

3}, Mark Andrew Wright^{1

2

3}, Eloy Opisso^{1

2

3}, Joan Vidal^{1

2

3}

Affiliations

¹ Departmento de Investigación e Innovación, Institut Guttmann, Institut Universitari de Neurorehabilitació adscrit a la UAB, Barcelona, Spain.
² Universitat Autònoma de Barcelona, Bellaterra (Cerdanyola del Vallès), Bellaterra, Spain.
³ Fundació Institut d'Investigació en Ciències de la Salut Germans Trias i Pujol, Barcelona, Spain.

PMID: 39832525
DOI: 10.1080/0142159X.2024.2430365

Abstract

Purpose: Our study aimed to: i) Assess the readability of textbook explanations using established indexes; ii) Compare these with GPT-4's default explanations, ensuring similar word counts for direct comparisons; iii) Evaluate GPT-4's adaptability by simplifying high-complexity explanations; iv) Determine the reliability of GPT-3.5 and GPT-4 in providing accurate answers.

Material and methods: We utilized a textbook designed for ABPMR certification. Our analysis covered 50 multiple-choice questions, each with a detailed explanation, focusing on non-traumatic spinal cord injury (NTSCI).

Results: Our analysis revealed statistically significant differences in readability scores, with the textbook achieving 14.5 (SD = 2.5) compared to GPT-4's 17.3 (SD = 1.9), indicating that GPT-4's explanations are generally more complex (p < 0.001). Using the Flesch Reading Ease Score, 86% of GPT-4's explanations fell into the 'Very difficult' category, significantly higher than the textbook's 58% (p = 0.006). GPT-4 successfully demonstrated adaptability by reducing the mean readability score of the top-nine most complex explanations, maintaining the word count. Regarding reliability, GPT-3.5 and GPT-4 scored 84% and 96% respectively, with GPT-4 outperforming GPT-3.5 (p = 0.046).

Conclusions: Our results confirmed GPT-4's potential in medical education by providing highly accurate yet often complex explanations for NTSCI, which were successfully simplified without losing accuracy.

Keywords: ChatGPT; Chatbot; readability; reliability.