ChatGPT-4 Turbo and Meta's LLaMA 3.1: A Relative Analysis of Answering Radiology Text-Based Questions

Cureus. 2024 Nov 24;16(11):e74359. doi: 10.7759/cureus.74359. eCollection 2024 Nov.

Abstract

Aims and objectives: This study aimed to compare the accuracy of two AI models - OpenAI's GPT-4 Turbo (San Francisco, CA) and Meta's LLaMA 3.1 (Menlo Park, CA) - when answering a standardized set of pediatric radiology questions. The primary objective was to evaluate the overall accuracy of each model, while the secondary objective was to assess their performance within subsections.

Methods and materials: A total of 79 text-based pediatric radiology questions were selected out of 302 total questions for this comparison. The questions covered seven subsections, including musculoskeletal, chest, and neuroradiology, among others. Image-based questions were excluded to focus on text interpretation and to minimize the sampling bias within each model. Each model was tested independently on the same question set, and the percent accuracy was calculated for both overall performance as well as individual subsections.

Results: GPT-4 Turbo performed at an overall accuracy of 88.6% (70/79 questions), outperforming LLaMA 3.1's 77.2% (61/79). Within subsections, GPT-4 Turbo had higher accuracy in most areas, except for equal accuracy in the neuroradiology section. The subsections with the greatest accuracy for GPT-4 Turbo, in descending order, were chest and cardiac radiology (100%), musculoskeletal system (93.3%), and genitourinary system (92.9%). LLaMA 3.1's highest performance was 86.7% in the musculoskeletal system, while its lowest was 50.0% in chest radiology.

Conclusion: GPT-4 Turbo consistently outperformed LLaMA 3.1 in answering pediatric radiology questions, both overall and within most subsections. These findings suggest that GPT-4 Turbo may offer more accurate responses in specialized medical education, in contrast to LLaMA 3.1's efficient performance, although future research should further evaluate AI models' performance within other fields.

Keywords: ai in medical education; chatgpt; large language models (llms); llama; pediatric radiology.