Limitations of GPT-3.5 and GPT-4 in Applying Fleischner Society Guidelines to Incidental Lung Nodules

Can Assoc Radiol J. 2024 May;75(2):412-416. doi: 10.1177/08465371231218250. Epub 2023 Dec 25.

Abstract

Purpose: To evaluate the accuracy of GPT-3.5, GPT-4, and a fine-tuned GPT-3.5 model in applying Fleischner Society recommendations to lung nodules. Methods: We generated 10 lung nodule descriptions for each of the 12 nodule categories from the Fleischner Society guidelines, incorporating them into a single fictitious report (n = 120). GPT-3.5 and GPT-4 were prompted to make follow-up recommendations based on the reports. We then incorporated the full guidelines into the prompts and re-submitted them. Finally, we re-submitted the prompts to a fine-tuned GPT-3.5 model. Results were analyzed using binary accuracy analysis in R. Results: GPT-3.5 accuracy in applying Fleischner Society guidelines was 0.058 (95% CI: 0.02, 0.12). GPT-4 accuracy was improved at 0.15 (95% CI: 0.09, 0.23; P = .02 for accuracy comparison). In recommending PET-CT and/or biopsy, both GPT-3.5 and GPT-4 had an F-score of 0.00. After explicitly including the Fleischner Society guidelines in the prompt, GPT-3.5 and GPT-4 significantly improved their accuracy to 0.42 (95% CI: 0.33, 0.51; P < .001) and to 0.66 (95% CI: 0.57, 0.74; P < .001), respectively. GPT-4 remained significantly better than GPT-3.5 (P < .001). The fine-tuned GPT-3.5 model accuracy was 0.46 (95% CI: 0.37, 0.55), not different from the GPT-3.5 model with guidelines included (P = .53). Conclusion: GPT-3.5 and GPT-4 performed poorly in applying widely known guidelines and never correctly recommended biopsy. Flawed knowledge and reasoning both contributed to their poor performance. While GPT-4 was more accurate than GPT-3.5, its inaccuracy rate was unacceptable for clinical practice. These results underscore the limitations of large language models for knowledge and reasoning-based tasks.

Keywords: GPT; guidelines; large language models; lung nodules; machine learning.

MeSH terms

  • Humans
  • Incidental Findings
  • Lung / diagnostic imaging
  • Lung Neoplasms* / diagnostic imaging
  • Practice Guidelines as Topic*
  • Reproducibility of Results
  • Solitary Pulmonary Nodule / diagnostic imaging