Evaluating ChatGPT's competency in radiation oncology: A comprehensive assessment across clinical scenarios

Sherif Ramadan; Adam Mutsaers; Po-Hsuan Cameron Chen; Glenn Bauman; Vikram Velker; Belal Ahmad; Andrew J Arifin; Timothy K Nguyen; David Palma; Christopher D Goodman

doi:10.1016/j.radonc.2024.110645

Evaluating ChatGPT's competency in radiation oncology: A comprehensive assessment across clinical scenarios

Radiother Oncol. 2025 Jan:202:110645. doi: 10.1016/j.radonc.2024.110645. Epub 2024 Nov 19.

Affiliations

¹ Department of Radiation Oncology, London Health Sciences Centre, London, ON, Canada. Electronic address: [email protected].
² Department of Radiation Oncology, London Health Sciences Centre, London, ON, Canada.
³ Need Inc., Santa Monica, CA, USA.
⁴ Department of Radiation Oncology, London Health Sciences Centre, London, ON, Canada; Department of Radiation Oncology, The University of Texas MD Anderson Cancer Center, Houston, USA.

PMID: 39571686
DOI: 10.1016/j.radonc.2024.110645

Abstract

Purpose: Artificial intelligence (AI) and machine learning present an opportunity to enhance clinical decision-making in radiation oncology. This study aims to evaluate the competency of ChatGPT, an AI language model, in interpreting clinical scenarios and assessing its oncology knowledge.

Methods and materials: A series of clinical cases were designed covering 12 disease sites. Questions were grouped into domains: epidemiology, staging and workup, clinical management, treatment planning, cancer biology, physics, and surveillance. Royal College-certified radiation oncologists (ROs) reviewed cases and provided solutions. ROs scored responses on 3 criteria: conciseness (focused answers), completeness (addressing all aspects of the question), and correctness (answer aligns with expert opinion) using a standardized rubric. Scores ranged from 0 to 5 for each criterion for a total possible score of 15.

Results: Across 12 cases, 182 questions were answered with a total AI score of 2317/2730 (84 %). Scores by criteria were: completeness (79 %, range: 70-99 %), conciseness (92 %, range: 83-99 %), and correctness (81 %, range: 72-92 %). AI performed best in the domains of epidemiology (93 %) and cancer biology (93 %) and reasonably in staging and workup (89 %), physics (86 %) and surveillance (82 %). Weaker domains included treatment planning (78 %) and clinical management (81 %). Statistical differences were driven by variations in the completeness (p < 0.01) and correctness (p = 0.04) criteria, whereas conciseness scored universally high (p = 0.91). These trends were consistent across disease sites.

Conclusions: ChatGPT showed potential as a tool in radiation oncology, demonstrating a high degree of accuracy in several oncologic domains. However, this study highlights limitations with incorrect and incomplete answers in complex cases.

Keywords: Artificial intelligence; Large language models; Radiation oncology.

MeSH terms

Artificial Intelligence*
Clinical Competence*
Clinical Decision-Making
Humans
Machine Learning
Neoplasms / radiotherapy
Radiation Oncologists
Radiation Oncology* / standards