Development of a liver disease-specific large language model chat interface using retrieval-augmented generation

Jin Ge; Steve Sun; Joseph Owens; Victor Galvez; Oksana Gologorskaya; Jennifer C Lai; Mark J Pletcher; Ki Lai

doi:10.1097/HEP.0000000000000834

Development of a liver disease-specific large language model chat interface using retrieval-augmented generation

Hepatology. 2024 Nov 1;80(5):1158-1168. doi: 10.1097/HEP.0000000000000834. Epub 2024 Mar 7.

Authors

Jin Ge¹, Steve Sun², Joseph Owens², Victor Galvez², Oksana Gologorskaya^{2

3}, Jennifer C Lai¹, Mark J Pletcher⁴, Ki Lai²

Affiliations

¹ Department of Medicine, Division of Gastroenterology and Hepatology, University of California-San Francisco, San Francisco, California, USA.
² UCSF Health Information Technology, University of California-San Francisco, San Francisco, California, USA.
³ Bakar Computational Health Sciences Institute, University of California-San Francisco, San Francisco, California, USA.
⁴ Department of Epidemiology and Biostatistics, University of California-San Francisco, San Francisco, California, USA.

PMID: 38451962
DOI: 10.1097/HEP.0000000000000834

Abstract

Background and aims: Large language models (LLMs) have significant capabilities in clinical information processing tasks. Commercially available LLMs, however, are not optimized for clinical uses and are prone to generating hallucinatory information. Retrieval-augmented generation (RAG) is an enterprise architecture that allows the embedding of customized data into LLMs. This approach "specializes" the LLMs and is thought to reduce hallucinations.

Approach and results: We developed "LiVersa," a liver disease-specific LLM, by using our institution's protected health information-complaint text embedding and LLM platform, "Versa." We conducted RAG on 30 publicly available American Association for the Study of Liver Diseases guidance documents to be incorporated into LiVersa. We evaluated LiVersa's performance by conducting 2 rounds of testing. First, we compared LiVersa's outputs versus those of trainees from a previously published knowledge assessment. LiVersa answered all 10 questions correctly. Second, we asked 15 hepatologists to evaluate the outputs of 10 hepatology topic questions generated by LiVersa, OpenAI's ChatGPT 4, and Meta's Large Language Model Meta AI 2. LiVersa's outputs were more accurate but were rated less comprehensive and safe compared to those of ChatGPT 4.

Results: We evaluated LiVersa's performance by conducting 2 rounds of testing. First, we compared LiVersa's outputs versus those of trainees from a previously published knowledge assessment. LiVersa answered all 10 questions correctly. Second, we asked 15 hepatologists to evaluate the outputs of 10 hepatology topic questions generated by LiVersa, OpenAI's ChatGPT 4, and Meta's Large Language Model Meta AI 2. LiVersa's outputs were more accurate but were rated less comprehensive and safe compared to those of ChatGPT 4.

Conclusions: In this demonstration, we built disease-specific and protected health information-compliant LLMs using RAG. While LiVersa demonstrated higher accuracy in answering questions related to hepatology, there were some deficiencies due to limitations set by the number of documents used for RAG. LiVersa will likely require further refinement before potential live deployment. The LiVersa prototype, however, is a proof of concept for utilizing RAG to customize LLMs for clinical use cases.

MeSH terms

Humans
Information Storage and Retrieval / methods
Liver Diseases*
Natural Language Processing

Grants and funding

P30 DK026743/DK/NIDDK NIH HHS/United States