Evaluation of the ability of large language models to self-diagnose oral diseases

iScience. 2024 Nov 29;27(12):111495. doi: 10.1016/j.isci.2024.111495. eCollection 2024 Dec 20.

Abstract

Large language models (LLMs) offer potential in primary dental care. We conducted an evaluation of LLMs' diagnostic capabilities across various oral diseases and contexts. All LLMs showed diagnostic capabilities for temporomandibular joint disorders, periodontal disease, dental caries, and malocclusion. The prompts did not affect the performance of ChatGPT 3.5. When Chinese was used, the diagnostic ability of ChatGPT 3.5 for pulpitis improved (0% vs. 61.7%, p < 0.001), while the ability to diagnose pericoronitis decreased (8% vs. 0%, p < 0.001). For ChatGPT 4.0 in Chinese, they were both improved (0% vs. 92%, 8% vs. 72%, p < 0.001, respectively). Claude 2 exhibited the highest accuracy in diagnosing pulpitis (36%, p = 0.048), ChatGPT 4.0 showed complete diagnostic capability for pericoronitis. Llama 2 and Claude 3.5 Sonnet exhibited complete diagnostic capability for oral cancer. In conclusion, LLMs may be a potential tool for daily dental care but need further updates.

Keywords: Artificial intelligence; Computing methodology; Health sciences.