Evaluation of ChatGPT as a Reliable Source of Medical Information on Prostate Cancer for Patients: Global Comparative Survey of Medical Oncologists and Urologists

Arnulf Stenzl; Andrew J Armstrong; Eamonn Rogers; Dany Habr; Jochen Walz; Martin Gleave; Andrea Sboner; Jennifer Ghith; Lucile Serfass; Kristine W Schuler; Sam Garas; Dheepa Chari; Ken Truman; Cora N Sternberg

doi:10.1097/UPJ.0000000000000740

Evaluation of ChatGPT as a Reliable Source of Medical Information on Prostate Cancer for Patients: Global Comparative Survey of Medical Oncologists and Urologists

Urol Pract. 2024 Nov 7:101097UPJ0000000000000740. doi: 10.1097/UPJ.0000000000000740. Online ahead of print.

Authors

Arnulf Stenzl¹, Andrew J Armstrong², Eamonn Rogers³, Dany Habr⁴, Jochen Walz⁵, Martin Gleave⁶, Andrea Sboner⁷, Jennifer Ghith⁴, Lucile Serfass⁸, Kristine W Schuler⁴, Sam Garas⁴, Dheepa Chari⁴, Ken Truman⁹, Cora N Sternberg¹⁰

Affiliations

¹ Department of Urology, University of Tübingen, Tübingen, Germany.
² Division of Medical Oncology, Department of Medicine, Duke Cancer Institute Center for Prostate and Urologic Cancer, Durham, North Carolina.
³ Department of Urology, University College Hospital, Galway, Ireland.
⁴ Pfizer Oncology, Pfizer Inc, New York, New York.
⁵ Department of Urology, Institut Paoli-Calmettes Cancer Center, Marseilles, France.
⁶ Department of Urologic Sciences, The Vancouver Prostate Centre, The University of British Columbia, Vancouver, British Columbia, Canada.
⁷ Department of Pathology and Laboratory Medicine, Institute for Computational Biomedicine, Meyer Cancer Center, Weill Cornell Medicine, New York, New York.
⁸ Pfizer Oncology, Pfizer Inc, Paris, France.
⁹ Insights & Connections, MedThink Inc, Cary, North Carolina.
¹⁰ Englander Institute for Precision Medicine, Meyer Cancer Center, Weill Cornell Medicine, New York Presbyterian Center, New York, New York.

PMID: 39509585
DOI: 10.1097/UPJ.0000000000000740

Abstract

Introduction: No consensus exists on performance standards for evaluation of generative artificial intelligence (AI) to generate medical responses. The purpose of this study was the assessment of Chat Generative Pre-trained Transformer (ChatGPT) to address medical questions in prostate cancer.

Methods: A global online survey was conducted from April to June 2023 among > 700 medical oncologists or urologists who treat patients with prostate cancer. Participants were unaware this was a survey evaluating AI. In component 1, responses to 9 questions were written independently by medical writers (MWs; from medical websites) and ChatGPT-4.0 (AI generated from publicly available information). Respondents were randomly exposed and blinded to both AI-generated and MW-curated responses; evaluation criteria and overall preference were recorded. Exploratory component 2 evaluated AI-generated responses to 5 complex questions with nuanced answers in the medical literature. Responses were evaluated on a 5-point Likert scale. Statistical significance was denoted by P < .05.

Results: In component 1, respondents (N = 602) consistently preferred the clarity of AI-generated responses over MW-curated responses in 7 of 9 questions (P < .05). Despite favoring AI-generated responses when blinded to questions/answers, respondents considered medical websites a more credible source (52%-67%) than ChatGPT (14%). Respondents in component 2 (N = 98) also considered medical websites more credible than ChatGPT, but rated AI-generated responses highly for all evaluation criteria, despite nuanced answers in the medical literature.

Conclusions: These findings provide insight into how clinicians rate AI-generated and MW-curated responses with evaluation criteria that can be used in future AI validation studies.

Keywords: artificial intelligence; medical oncology; proof of concept study; survey and questionnaires; urology.