Performance Assessment of Large Language Models in Medical Consultation: A Comparative Study

JMIR Med Inform. 2025 Jan 4. doi: 10.2196/64318. Online ahead of print.

Abstract

Background: The recent introduction of generative artificial intelligence (AI) as an interactive consultant has prompted an interest in assessing its applicability in medical discussions and consultations, particularly in the domain of depression.

Objective: This study assessed the capability of Large Language Models (LLMs) in AI to generate responses to depression-related queries.

Methods: Utilizing the PubMedQA and QuoraQA datasets, we compared various LLMs, such as BioGPT, PMC-Llama, GPT-3.5, and Llama2, and measured the similarity between the generated and original answers.

Results: The latest general LLMs, GPT-3.5 and Llama2, demonstrated superior performance, particularly in generating responses to medical inquiries from the PubMedQA dataset.

Conclusions: Given the rapid progression in the development of LLMs in recent years, it is hypothesized that version upgrades of general LLMs are more beneficial for enhancing their capability to generate "knowledge text" in the biomedical domain compared to fine-tuning for the biomedical field. These findings are expected to contribute significantly to the advancement of AI-based medical counseling systems.