Enhancing Large Language Model Reliability: Minimizing Hallucinations with Dual Retrieval-Augmented Generation Based on the Latest Diabetes Guidelines

Jaedong Lee; Hyosoung Cha; Yul Hwangbo; Wonjoong Cheon

doi:10.3390/jpm14121131

Enhancing Large Language Model Reliability: Minimizing Hallucinations with Dual Retrieval-Augmented Generation Based on the Latest Diabetes Guidelines

J Pers Med. 2024 Nov 30;14(12):1131. doi: 10.3390/jpm14121131.

Authors

Jaedong Lee^{1

2}, Hyosoung Cha¹, Yul Hwangbo^{1

2}, Wonjoong Cheon³

Affiliations

¹ Healthcare AI Team, National Cancer Center, Goyang-si 10408, Gyeonggi-do, Republic of Korea.
² Department of Cancer AI & Digital Health, Graduate School of Cancer Science and Policy, National Cancer Center, Goyang-si 10408, Gyeonggi-do, Republic of Korea.
³ Department of Radiation Oncology, Seoul St. Mary's Hospital, College of Medicine, The Catholic University of Korea, Seoul 06591, Republic of Korea.

Abstract

Background/Objectives: Large language models (LLMs) show promise in healthcare but face challenges with hallucinations, particularly in rapidly evolving fields like diabetes management. Traditional LLM updating methods are resource-intensive, necessitating new approaches for delivering reliable, current medical information. This study aimed to develop and evaluate a novel retrieval system to enhance LLM reliability in diabetes management across different languages and guidelines. Methods: We developed a dual retrieval-augmented generation (RAG) system integrating both Korean Diabetes Association and American Diabetes Association 2023 guidelines. The system employed dense retrieval with 11 embedding models (including OpenAI, Upstage, and multilingual models) and sparse retrieval using BM25 algorithm with language-specific tokenizers. Performance was evaluated across different top-k values, leading to optimized ensemble retrievers for each guideline. Results: For dense retrievers, Upstage's Solar Embedding-1-large and OpenAI's text-embedding-3-large showed superior performance for Korean and English guidelines, respectively. Multilingual models outperformed language-specific models in both cases. For sparse retrievers, the ko_kiwi tokenizer demonstrated superior performance for Korean text, while both ko_kiwi and porter_stemmer showed comparable effectiveness for English text. The ensemble retrievers, combining optimal dense and sparse configurations, demonstrated enhanced coverage while maintaining precision. Conclusions: This study presents an effective dual RAG system that enhances LLM reliability in diabetes management across different languages. The successful implementation with both Korean and American guidelines demonstrates the system's cross-regional capability, laying a foundation for more trustworthy AI-assisted healthcare applications.

Keywords: diabetes management; ensemble retriever; large language models; medical information retrieval; retrieval-augmented generation.

Grants and funding

RS-2022-KH125204/the Ministry of Health & Welfare, Republic of Korea