Evaluating and Enhancing Large Language Models' Performance in Domain-Specific Medicine: Development and Usability Study With DocOA

Xi Chen; Li Wang; MingKe You; WeiZhi Liu; Yu Fu; Jie Xu; Shaoting Zhang; Gang Chen; Kang Li; Jian Li

doi:10.2196/58158

Evaluating and Enhancing Large Language Models' Performance in Domain-Specific Medicine: Development and Usability Study With DocOA

J Med Internet Res. 2024 Jul 22:26:e58158. doi: 10.2196/58158.

Authors

Xi Chen^#^{1

2}, Li Wang^#^{1

2}, MingKe You^{1

2}, WeiZhi Liu^{1

2}, Yu Fu³, Jie Xu⁴, Shaoting Zhang⁴, Gang Chen^{1

2}, Kang Li^#^{4

5

6}, Jian Li^#^{1

2}

Affiliations

¹ Sports Medicine Center, West China Hospital, Sichuan University, Chengdu, China.
² Department of Orthopedics and Orthopedic Research Institute, West China Hospital, Sichuan University, Chengdu, China.
³ West China Hospital, West China School of Medicine, Sichuan University, Chengdu, China.
⁴ Shanghai Artificial Intelligence Laboratory, OpenMedLab, Shanghai, China.
⁵ West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, China.
⁶ Med-X Center for Informatics, Sichuan University, Chengdu, China.

^# Contributed equally.

PMID: 38833165
PMCID: PMC11301122
DOI: 10.2196/58158

Abstract

Background: The efficacy of large language models (LLMs) in domain-specific medicine, particularly for managing complex diseases such as osteoarthritis (OA), remains largely unexplored.

Objective: This study focused on evaluating and enhancing the clinical capabilities and explainability of LLMs in specific domains, using OA management as a case study.

Methods: A domain-specific benchmark framework was developed to evaluate LLMs across a spectrum from domain-specific knowledge to clinical applications in real-world clinical scenarios. DocOA, a specialized LLM designed for OA management integrating retrieval-augmented generation and instructional prompts, was developed. It can identify the clinical evidence upon which its answers are based through retrieval-augmented generation, thereby demonstrating the explainability of those answers. The study compared the performance of GPT-3.5, GPT-4, and a specialized assistant, DocOA, using objective and human evaluations.

Results: Results showed that general LLMs such as GPT-3.5 and GPT-4 were less effective in the specialized domain of OA management, particularly in providing personalized treatment recommendations. However, DocOA showed significant improvements.

Conclusions: This study introduces a novel benchmark framework that assesses the domain-specific abilities of LLMs in multiple aspects, highlights the limitations of generalized LLMs in clinical contexts, and demonstrates the potential of tailored approaches for developing domain-specific medical LLMs.

Keywords: domain-specific benchmark framework; large language model; osteoarthritis management; retrieval-augmented generation.

©Xi Chen, Li Wang, MingKe You, WeiZhi Liu, Yu Fu, Jie Xu, Shaoting Zhang, Gang Chen, Kang Li, Jian Li. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 22.07.2024.

MeSH terms

Humans
Machine Learning*
Osteoarthritis* / therapy