Assessing Retrieval-Augmented Large Language Model Performance in Emergency Department ICD-10-CM Coding Compared to Human Coders

medRxiv [Preprint]. 2024 Oct 17:2024.10.15.24315526. doi: 10.1101/2024.10.15.24315526.

Abstract

Background: Accurate medical coding is essential for clinical and administrative purposes but complicated, time-consuming, and biased. This study compares Retrieval-Augmented Generation (RAG)-enhanced LLMs to provider-assigned codes in producing ICD-10-CM codes from emergency department (ED) clinical records.

Methods: Retrospective cohort study using 500 ED visits randomly selected from the Mount Sinai Health System between January and April 2024. The RAG system integrated past 1,038,066 ED visits data (2021-2023) into the LLMs' predictions to improve coding accuracy. Nine commercial and open-source LLMs were evaluated. The primary outcome was a head-to-head comparison of the ICD-10-CM codes generated by the RAG-enhanced LLMs and those assigned by the original providers. A panel of four physicians and two LLMs blindly reviewed the codes, comparing the RAG-enhanced LLM and provider-assigned codes on accuracy and specificity.

Findings: RAG-enhanced LLMs demonstrated superior performance to provider coders in both the accuracy and specificity of code assignments. In a targeted evaluation of 200 cases where discrepancies existed between GPT-4 and provider-assigned codes, human reviewers favored GPT-4 for accuracy in 447 instances, compared to 277 instances where providers' codes were preferred (p<0.001). Similarly, GPT-4 was selected for its superior specificity in 509 cases, whereas human coders were preferred in only 181 cases (p<0.001). Smaller open-access models, such as Llama-3.1-70B, also demonstrated substantial scalability when enhanced with RAG, with 218 instances of accuracy preference compared to 90 for providers' codes. Furthermore, across all models, the exact match rate between LLM-generated and provider-assigned codes significantly improved following RAG integration, with Qwen-2-7B increasing from 0.8% to 17.6% and Gemma-2-9b-it improving from 7.2% to 26.4%.

Interpretation: RAG-enhanced LLMs improve medical coding accuracy in EDs, suggesting clinical workflow applications. These findings show that generative AI can improve clinical outcomes and reduce administrative burdens.

Funding: This work was supported in part through the computational and data resources and staff expertise provided by Scientific Computing and Data at the Icahn School of Medicine at Mount Sinai and supported by the Clinical and Translational Science Awards (CTSA) grant UL1TR004419 from the National Center for Advancing Translational Sciences. Research reported in this publication was also supported by the Office of Research Infrastructure of the National Institutes of Health under award number S10OD026880 and S10OD030463. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The funders played no role in study design, data collection, analysis and interpretation of data, or the writing of this manuscript.

Twitter summary: A study showed AI models with retrieval-augmented generation outperformed human doctors in ED diagnostic coding accuracy and specificity. Even smaller AI models perform favorably when using RAG. This suggests potential for reducing administrative burden in healthcare, improving coding efficiency, and enhancing clinical documentation.

Publication types

  • Preprint