Do large language model chatbots perform better than established patient information resources in answering patient questions? A comparative study on melanoma

Nadia C W Kamminga; June E C Kievits; Peter W Plaisier; Jako S Burgers; Astrid M van der Veldt; Jan A G J van den Brand; Mark Mulder; Marlies Wakkee; Marjolein Lugtenberg; Tamar Nijsten

doi:10.1093/bjd/ljae377

Do large language model chatbots perform better than established patient information resources in answering patient questions? A comparative study on melanoma

Br J Dermatol. 2025 Jan 24;192(2):306-315. doi: 10.1093/bjd/ljae377.

Authors

Nadia C W Kamminga¹, June E C Kievits^{1

2}, Peter W Plaisier², Jako S Burgers^{3

4}, Astrid M van der Veldt^{5

6}, Jan A G J van den Brand⁷, Mark Mulder⁶, Marlies Wakkee¹, Marjolein Lugtenberg^{1

8}, Tamar Nijsten¹

Affiliations

¹ Department of Dermatology, Erasmus MC Cancer Institute, University Medical Center Rotterdam, the Netherlands.
² Department of Surgery, Albert Schweitzer Hospital, Dordrecht, the Netherlands.
³ Dutch College of General Practitioners, PO Box 3231, Utrecht, the Netherlands.
⁴ Care and Public Health Research Institute, Department Family Medicine, Maastricht UMC+, Maastricht, the Netherlands.
⁵ Department of Radiology & Nuclear Medicine, Erasmus MC Cancer Institute, University Medical Center Rotterdam, the Netherlands.
⁶ Department of Medical Oncology, Erasmus MC Cancer Institute, University Medical Center Rotterdam, the Netherlands.
⁷ Research Suite/DataHub, Erasmus MC, Rotterdam, the Netherlands.
⁸ Department Tranzo, Tilburg School of Social and Behavioral Sciences, Tilburg University, Tilburg, the Netherlands.

PMID: 39365602
DOI: 10.1093/bjd/ljae377

Abstract

Background: Large language models (LLMs) have a potential role in providing adequate patient information.

Objectives: To compare the quality of LLM responses with established Dutch patient information resources (PIRs) in answering patient questions regarding melanoma.

Methods: Responses from ChatGPT versions 3.5 and 4.0, Gemini, and three leading Dutch melanoma PIRs to 50 melanoma-specific questions were examined at baseline and for LLMs again after 8 months. Outcomes included (medical) accuracy, completeness, personalization, readability and, additionally, reproducibility for LLMs. Comparative analyses were performed within LLMs and PIRs using Friedman's Anova, and between best-performing LLMs and gold-standard (GS) PIRs using the Wilcoxon signed-rank test.

Results: Within LLMs, ChatGPT-3.5 demonstrated the highest accuracy (P = 0.009). Gemini performed best in completeness (P < 0.001), personalization (P = 0.007) and readability (P < 0.001). PIRs were consistent in accuracy and completeness, with the general practitioner's website excelling in personalization (P = 0.013) and readability (P < 0.001). The best-performing LLMs outperformed the GS-PIR on completeness and personalization, yet it was less accurate and less readable. Over time, response reproducibility decreased for all LLMs, showing variability across outcomes.

Conclusions: Although LLMs show potential in providing highly personalized and complete responses to patient questions regarding melanoma, improving and safeguarding accuracy, reproducibility and accessibility is crucial before they can replace or complement conventional PIRs.

Plain language summary

Large language models (LLMs) are a type of artificial intelligence that can be used to assess large amounts of information, and could play a role in providing patients with accurate information about melanoma. This study, conducted in the Netherlands, aimed to compare the quality of responses from LLMs and established Dutch patient information resources (PIRs) to questions from patients about melanoma. We evaluated the responses of ChatGPT versions 3.5 and 4.0, Gemini, and three leading Dutch melanoma PIRs to 50 specific melanoma-related questions. We looked at these responses initially and asked the questions again to LLMs after 8 months. We assessed medical accuracy, completeness, personalization, readability and reproducibility for LLMs. We used statistical tests to compare results within LLMs, within PIRs, and between the best-performing LLMs and the gold-standard PIR (leaflet from dermatologists). We found that within the LLMs, ChatGPT-3.5 had the highest accuracy. Gemini performed best in terms of completeness, personalization and readability. The PIRs consistently showed high accuracy and completeness, with the general practitioner’s website excelling in personalization and readability. The top-performing LLMs outperformed the best PIR on completeness and personalization, but was less accurate and less readable. Over time, the reproducibility of LLM responses decreased, showing varied outcomes. In conclusion, although LLMs have the potential to provide highly personalized and complete answers to patient questions about melanoma, their accuracy, reproducibility and accessibility need to be improved before they can be a reliable source of information or complement existing PIRs.

Publication types

Comparative Study

MeSH terms

Comprehension
Female
Health Literacy
Humans
Internet
Language
Male
Melanoma*
Netherlands
Patient Education as Topic* / methods
Patient Education as Topic* / standards
Reproducibility of Results
Skin Neoplasms*
Surveys and Questionnaires / statistics & numerical data

Grants and funding

Bristol Myers Squibb