Assessing the performance of large language models (LLMs) in answering medical questions regarding breast cancer in the Chinese context

Ying Piao; Hongtao Chen; Shihai Wu; Xianming Li; Zihuang Li; Dong Yang

doi:10.1177/20552076241284771

Assessing the performance of large language models (LLMs) in answering medical questions regarding breast cancer in the Chinese context

Digit Health. 2024 Oct 7:10:20552076241284771. doi: 10.1177/20552076241284771. eCollection 2024 Jan-Dec.

Authors

Ying Piao¹, Hongtao Chen¹, Shihai Wu¹, Xianming Li¹, Zihuang Li¹, Dong Yang¹

Affiliation

¹ Department of Radiation Oncology, Shenzhen People's Hospital (The Second Clinical Medical College, Jinan University; The First Affiliated Hospital, Southern University of Science and Technology), Shenzhen, Guangdong, People's Republic of China.

Abstract

Purpose: Large language models (LLMs) are deep learning models designed to comprehend and generate meaningful responses, which have gained public attention in recent years. The purpose of this study is to evaluate and compare the performance of LLMs in answering questions regarding breast cancer in the Chinese context.

Material and methods: ChatGPT, ERNIE Bot, and ChatGLM were chosen to answer 60 questions related to breast cancer posed by two oncologists. Responses were scored as comprehensive, correct but inadequate, mixed with correct and incorrect data, completely incorrect, or unanswered. The accuracy, length, and readability among answers from different models were evaluated using statistical software.

Results: ChatGPT answered 60 questions, with 40 (66.7%) comprehensive answers and six (10.0%) correct but inadequate answers. ERNIE Bot answered 60 questions, with 34 (56.7%) comprehensive answers and seven (11.7%) correct but inadequate answers. ChatGLM generated 60 answers, with 35 (58.3%) comprehensive answers and six (10.0%) correct but inadequate answers. The differences for chosen accuracy metrics among the three LLMs did not reach statistical significance, but only ChatGPT demonstrated a sense of human compassion. The accuracy of the three models in answering questions regarding breast cancer treatment was the lowest, with an average of 44.4%. ERNIE Bot's responses were significantly shorter compared to ChatGPT and ChatGLM (p < .001 for both). The readability scores of the three models showed no statistical significance.

Conclusions: In the Chinese context, the capabilities of ChatGPT, ERNIE Bot, and ChatGLM are similar in answering breast cancer-related questions at present. These three LLMs may serve as adjunct informational tools for breast cancer patients in the Chinese context, offering guidance for general inquiries. However, for highly specialized issues, particularly in the realm of breast cancer treatment, LLMs cannot deliver reliable performance. It is necessary to utilize them under the supervision of healthcare professionals.

Keywords: Breast cancer; ChatGLM; ChatGPT; Chinese context; ERNIE Bot; large language model.