Purpose: Prior literature has suggested a reduced performance of large language models (LLMs) in non-English analyses, including Arabic and French. However, there are no current studies testing the multimodal performance of ChatGPT in French ophthalmology cases, and comparing this to the results observed in the English literature. We compared the performance of ChatGPT-4 in French and English on open-ended prompts using multimodal input data from retinal cases.
Methods: GPT-4 was prompted in English and French using a public dataset containing 67 retinal cases from the ophthalmology education website OCTCases.com. The clinical case and accompanying ophthalmic images comprised the prompt, along with the open-ended question: "What is the most likely diagnosis?" Systematic prompting was used to identify and compare relevant factor(s) contributing to correct and incorrect responses. Diagnostic accuracy was the primary outcome, defined as the proportion of correctly diagnosed cases in French and English. Diagnoses were compared with the answer key on OCTCases to confirm correct or incorrect responses. Clinically relevant factors reported by the LLM as contributory to its decision-making were secondary endpoints.
Results: The diagnostic accuracies of GPT-4 in English and French were 35.8% and 28.4%, respectively (χ2, P=0.36). Imaging findings were reported as most influential for correct diagnosis in English (37.5%) and French (42.1%) (P=0.76). In incorrectly diagnosed cases, imaging findings were primarily implicated in English (35.6%) and French (33.3%) (P=0.81). In incorrectly diagnosed cases, the differential diagnosis list contained the correct diagnosis in 39.5% of English cases and 41.7% of French cases (P=0.83).
Conclusion: Our results suggest that GPT-4 performed similarly in English and French on all quantitative performance metrics measured. Ophthalmic images were identified in both languages as critical for correct diagnosis. Future research should assess LLM comprehension through the clarity, grammatical, cultural, and idiomatic accuracy of its responses.
Keywords: Artificial intelligence; French language; Grands modèles linguistiques; Image processing; Intelligence artificielle; Langue française; Large language models; Natural language processing; Optical coherence tomography; Tomographie par cohérence optique; Traitement du langage naturel; Traitement d’images.
Copyright © 2024 Elsevier Masson SAS. All rights reserved.