Large Language Models are good medical coders,
if provided with tools

Keith Kwan
AI Native Health
[email protected]
Abstract

This study presents a novel two-stage Retrieve-Rank system for automated ICD-10-CM medical coding, comparing its performance against a Vanilla Large Language Model (LLM) approach. Evaluating both systems on a dataset of 100 single-term medical conditions, the Retrieve-Rank system achieved 100% accuracy in predicting correct ICD-10-CM codes, significantly outperforming the Vanilla LLM (GPT-3.5-turbo), which achieved only 6% accuracy. Our analysis demonstrates the Retrieve-Rank system’s superior precision in handling various medical terms across different specialties. While these results are promising, we acknowledge the limitations of using simplified inputs and the need for further testing on more complex, realistic medical cases. This research contributes to the ongoing effort to improve the efficiency and accuracy of medical coding, highlighting the importance of retrieval-based approaches.

Keywords ICD-10-CM coding  \cdot Medical informatics  \cdot Natural Language Processing  \cdot Retrieve-Rank system  \cdot Automated diagnosis coding  \cdot Machine learning in healthcare  \cdot Clinical text classification

1 Introduction

Medical coding is a critical process in healthcare systems, essential for accurate billing, epidemiological studies, and healthcare quality assessment [1, 2]. The recent paper “Large Language Models Are Poor Medical Coders — Benchmarking of Medical Code Querying” published in NEJM AI [3] highlighted significant limitations in the ability of large language models (LLMs) to accurately generate medical codes.

The application of artificial intelligence (AI) and machine learning (ML) in healthcare, particularly in clinical coding, has been a subject of increasing interest in recent years [4, 5]. Previous studies have explored various approaches to automate medical coding, including rule-based systems [6], traditional machine learning methods [7], and more recently, deep learning techniques [8].

Soroush and colleagues evaluated the performance of several prominent LLMs, including GPT-3.5, GPT-4, Gemini Pro, and Llama2-70b Chat, in querying medical billing codes. Their study encompassed a comprehensive dataset of ICD-9-CM, ICD-10-CM, and CPT codes extracted from the Mount Sinai Health System electronic health record. The authors found that even the best-performing model, GPT-4, achieved exact match rates of only 45.9% for ICD-9-CM, 33.9% for ICD-10-CM, and 49.8% for CPT codes. These results led to the conclusion that LLMs are currently not suitable for direct use in medical coding tasks.

However, we hypothesized that the performance of LLMs in medical coding could be significantly improved by providing them with appropriate tools and retrieval mechanisms. This approach aligns with recent advancements in retrieval-augmented generation [9] and the use of external knowledge bases to enhance LLM performance [10].

To test this hypothesis, we designed an experiment using a combination of the Colbert-V2 retriever [11] and GPT-3.5-turbo for reranking. Our approach aimed to address the limitations observed in the direct code generation method used in the NEJM study, drawing inspiration from successful applications of similar techniques in other domains [12].

In this paper, we present our methodology and results, which demonstrate a substantial improvement in medical coding accuracy. By achieving a 100% exact match rate on a sample of 100 codes, our findings suggest that LLMs, when equipped with the right tools, can indeed be effective in medical coding tasks. This study not only challenges the conclusions of the NEJM paper but also opens new avenues for the application of AI in healthcare information management, potentially addressing long-standing challenges in medical coding efficiency and accuracy [2, 13].

Control Group Accuracy: 6% Experiment Group Accuracy: 100% Control Group (GPT-3.5) Single-term input (e.g. "Asthma") Direct LLM Prediction Predicted ICD-10 CM code Experiment Group (Retrieve-Rank) Single-term input (e.g. "Asthma") ColBERT-V2 RAG Retrieval Top-k ICD-10 codes retrieved GPT-3.5 Turbo Reranking Final Predicted ICD-10 CM code
Figure 1: Comparison of Control Group and Experiment Group methodologies and results

Figure 1 illustrates the workflow and results of our proposed Retrieve-Rank system compared to the control group. This visual representation highlights the significant improvement in accuracy achieved by our approach.

The following sections will detail our experimental setup, results, and discuss the implications of our findings for the future of automated medical coding.

2 Methodology

Our study employs a methodology similar to that used in the "Large Language Models Are Poor Medical Coders — Benchmarking of Medical Code Querying" paper, utilizing single-term medical conditions as inputs for ICD-10-CM code prediction. However, we introduce a novel two-stage Retrieve-Rank system, inspired by Doosterlinck’s Infer-Retrieve-Rank framework[14], which significantly improves upon previous approaches.

Our approach involves the following steps:

  1. 1.

    Retrieval: Given a single-term medical condition, we use ColBERT-V2 to retrieve the top-k (k=15) most relevant ICD-10-CM codes from our trained index.

  2. 2.

    Reranking: We use GPT-3.5-turbo to rerank the retrieved codes and select the most likely ICD-10-CM code for the given condition.

We utilized Ragatouille to train and develop a ColBERT-V2 RAG (Retrieval-Augmented Generation) system based on ICD-10-CM data downloaded from the CDC website111https://www.cdc.gov/nchs/icd/icd-10-cm/files.html.

It’s important to note that while we use simplified single-term inputs similar to previous studies, our two-stage approach allows for more nuanced and accurate code prediction. This methodology, while not fully representative of the complexity found in real-world medical coding scenarios, allows for a direct comparison with previous findings and provides insights into the improved capabilities of our Retrieve-Rank system.

3 Experiment Setup

We designed an experiment to evaluate the performance of our ColBERT-V2 RAG system against a control group. The experiment was implemented using Python, with the following key components:

  • Data Preparation: We used a CSV file containing single-term medical conditions and their corresponding ICD-10-CM codes.

  • Sampling: The experiment randomly sampled 100 entries from the dataset.

  • Code Normalization: ICD-10-CM codes were normalized by removing periods and converting to uppercase to ensure consistent comparison.

  • Prediction: For each sampled entry, we used our RAG system to predict the ICD-10-CM code based on the single-term medical condition.

  • Evaluation Metrics: We focused on the top-one accuracy, comparing the predicted code with the true code. A match was considered successful if the main part of the predicted code (before any subdivisions) matched the true code.

  • Control Group: We implemented a control group using GPT-3.5-turbo to provide a baseline for comparison. This model was prompted with "You are a medical coding expert that can suggest an ICD-10-CM code for a given query." followed by the single-term medical condition.

  • Results Logging: The experiment results, including the conditions, true codes, predicted codes, and match results, were logged to a CSV file for further analysis.

This experimental setup allowed us to directly compare the performance of our ColBERT-V2 RAG system against a simpler baseline model, providing insights into the effectiveness of our approach for ICD-10-CM code prediction, even with simplified inputs.

While our results show significant improvement over the Vanilla LLM approach, we acknowledge that further research using more complex, realistic medical cases is necessary to fully evaluate the potential of the Retrieve-Rank system in practical applications.

4 Results

We evaluated our two-stage Retrieve-Rank system against a Vanilla LLM using GPT-3.5-turbo on a dataset of 100 diagnosis description with corresponding ICD-10-CM codes. The results demonstrate a significant performance improvement over the baseline method.

4.1 Accuracy Metrics

The Retrieve-Rank system achieved perfect accuracy in predictions, correctly identifying the exact ICD-10-CM code for all 100 samples. In contrast, the Vanilla LLM using GPT-3.5-turbo achieved only 6

Table 1: Accuracy Results
System Accuracy
Retrieve-Rank System 100%
Vanilla LLM (GPT-3.5-turbo) 6%

4.2 Comparative Analysis

To illustrate the performance difference, we present a sample of predictions from both systems in Table 2. This table shows the diagnosis description, reference ICD-10-CM code, and predictions from both systems, highlighting the superior accuracy of the Retrieve-Rank system.

Table 2: Comparison of Predictions
Diagnosis Description Reference Code Retrieve-Rank Vanilla GPT-3.5-turbo Correct System
Salter-Harris Type II physeal fracture of lower end of humerus, unspecified arm, subsequent encounter for fracture with malunion S49129P S49129P S59102P Retrieve-Rank
Nondisplaced fracture of proximal third of navicular [scaphoid] bone of unspecified wrist, initial encounter for closed fracture S62036A S62036A S62002A Retrieve-Rank
Glaucoma secondary to eye inflammation, right eye, indeterminate stage H4041X4 H4041X4 H4060X4 Retrieve-Rank
Poisoning by aspirin, accidental (unintentional), initial encounter T39011A T39011A T39011A Both
Other specified injury of right renal vein, subsequent encounter S35494D S35494D S35602D Retrieve-Rank
Other specified fracture of right acetabulum, initial encounter for open fracture S32491B S32491B S32431B Retrieve-Rank
Displacement of biological heart valve graft, sequela T82222S T82222S T82590S Retrieve-Rank
Open bite of unspecified thumb with damage to nail, sequela S61159S S61159S S61049S Retrieve-Rank
Burn of unspecified degree of trunk, unspecified site, sequela T2100XS T2100XS T310 Retrieve-Rank
Other specified injury of peroneal artery, unspecified leg, subsequent encounter S85299D S85299D S951XXA Retrieve-Rank
Other complications of anesthesia, subsequent encounter T8859XD T8859XD T8859XD Both
Follicular lymphoma, unspecified, lymph nodes of axilla and upper limb C8294 C8294 C8211 Retrieve-Rank
Other injury of flexor muscle, fascia and tendon of other finger at wrist and hand level, sequela S66198S S66198S S66299S Retrieve-Rank
Contusion and laceration of cerebrum, unspecified, with loss of consciousness greater than 24 hours with return to pre-existing conscious level, initial encounter S06335A S06335A S069X0A Retrieve-Rank

4.3 Performance Analysis

As shown in Table 2, the Retrieve-Rank system consistently predicts the correct ICD-10-CM code across a variety of complex diagnosis descriptions. The Vanilla LLM, while occasionally correct, often predicts codes that are similar but incorrect.

Key observations from the comparison:

1. Precision in anatomical details: The Retrieve-Rank system accurately captures specific anatomical locations (e.g., "proximal third of navicular bone" in S62036A), while the Vanilla LLM sometimes misses these details.

2. Accuracy in encounter specifics: The Retrieve-Rank system correctly identifies encounter types (e.g., "subsequent encounter" in S49129P), which the Vanilla LLM often misses.

3. Handling of complex conditions: For intricate cases like "Contusion and laceration of cerebrum, unspecified, with loss of consciousness greater than 24 hours" (S06335A), the Retrieve-Rank system provides the exact code, while the Vanilla LLM defaults to a more general code.

4. Consistency across various medical domains: The Retrieve-Rank system demonstrates high accuracy across different medical specialties, including orthopedics, ophthalmology, cardiology, and oncology.

The Vanilla LLM’s errors often involve predicting codes that are in the same general category but miss crucial details. For example, in the case of the Salter-Harris fracture (S49129P), the Vanilla LLM predicts a code for the lower leg (S59102P) instead of the arm.

4.4 Limitations

While the results are promising, it’s important to note that this evaluation was conducted on a relatively small dataset of 100 samples. The perfect accuracy of the Retrieve-Rank system, while impressive, raises questions about the diversity and complexity of the test set. Further testing on larger, more diverse datasets would be beneficial to confirm the system’s generalizability and robustness across a wider range of medical conditions and code categories.

Additionally, it would be valuable to analyze the system’s performance on more challenging cases or edge cases that may not have been represented in this sample set. This could provide insights into potential areas for improvement and further refinement of the Retrieve-Rank system.

Furthermore, while the Vanilla LLM’s performance was significantly lower, it’s worth noting that it was not specifically trained for this task. Future work could explore fine-tuning approaches for the Vanilla LLM to see if its performance on ICD-10-CM coding tasks can be improved without the need for a retrieval step.

5 Conclusion

Our study demonstrates the significant potential of the two-stage Retrieve-Rank system in automating ICD-10-CM medical coding. The system’s perfect accuracy across a diverse set of 100 diagnosis descriptions, compared to the 6% accuracy of a Vanilla LLM, underscores the effectiveness of combining retrieval and ranking mechanisms in tackling complex coding tasks.

The Retrieve-Rank system exhibited remarkable precision in capturing crucial details such as specific anatomical locations, encounter types, and intricate medical conditions. Its consistency across various medical specialties further highlights its versatility and potential for broad application in healthcare settings.

While these results are encouraging, we acknowledge the limitations of our study, particularly the relatively small sample size. Future research should focus on validating these findings with larger, more diverse datasets and exploring the system’s performance on edge cases and rare conditions.

The implications of this research are significant for the healthcare industry. An accurate, automated coding system could substantially reduce the workload on medical coders, minimize coding errors, and improve the overall quality of medical records. This, in turn, could lead to more efficient healthcare administration, more accurate billing processes, and potentially better patient care through improved data quality for medical research and decision-making.

As we move forward, it will be crucial to continue refining and testing the Retrieve-Rank system, possibly incorporating advances in language models and retrieval techniques. Additionally, exploring ways to make the system interpretable and adaptable to evolving medical knowledge will be key to its practical implementation in healthcare settings.

In conclusion, while further research is needed, our study presents a promising step towards more efficient and accurate automated medical coding, contributing to the ongoing digital transformation of healthcare administration.

6 Data Availability

The complete dataset of 100 medical cases, including predictions from both systems, is available as an ancillary file with this arXiv submission. Additional materials, including detailed methodology and error analysis, are also provided as ancillary files.

The code used to conduct the experiments and analyze the results is publicly available on GitHub at https://github.com/ainativehealth/GoodMedicalCoder. This repository contains:

  • Python scripts for running the ICD-10 code prediction experiment (experiment.py)

  • Code for creating the index using the RAG model (index.py)

  • ICD-10 code datasets (ICD-10.csv and ICD-10_formatted.csv)

  • Requirements file listing all necessary Python dependencies (requirements.txt)

  • Detailed instructions for reproducing the experiments

Researchers interested in replicating or building upon this work can access all necessary code and data through this GitHub repository. The repository is open-source and licensed under the Apache-2.0 license, allowing for broad use and adaptation of the materials.

References

  • [1] Sue Bowman. Impact of electronic health record systems on information integrity: quality and safety implications. Perspectives in health information management, 10, 2013.
  • [2] Kimberly J O’malley, Karon F Cook, Matt D Price, Kimberly R Wildes, John F Hurdle, and Carol M Ashton. Measuring diagnoses: Icd code accuracy. Health services research, 40(5p2):1620–1639, 2005.
  • [3] Ali Soroush, Benjamin S Glicksberg, Eyal Zimlichman, Yiftach Barash, Robert Freeman, Alexander W Charney, Girish N Nadkarni, and Eyal Klang. Large language models are poor medical coders — benchmarking of medical code querying. NEJM AI, 1(5), 2024.
  • [4] Alvin Rajkomar, Eyal Oren, Kai Chen, Andrew M Dai, Nissan Hajaj, Michaela Hardt, Peter J Liu, Xiaobing Liu, Jake Marcus, Mimi Sun, et al. Scalable and accurate deep learning with electronic health records. NPJ Digital Medicine, 1(1):1–10, 2018.
  • [5] Benjamin Shickel, Patrick J Tighe, Azra Bihorac, and Parisa Rashidi. Deep ehr: A survey of recent advances in deep learning techniques for electronic health record (ehr) analysis. IEEE journal of biomedical and health informatics, 22(5):1589–1604, 2018.
  • [6] Richárd Farkas and György Szarvas. Automatic construction of rule-based icd-9-cm coding systems. In BMC bioinformatics, volume 9, pages 1–9. BioMed Central, 2008.
  • [7] Adler Perotte, Rimma Pivovarov, Karthik Natarajan, Nicole Weiskopf, Frank Wood, and Noémie Elhadad. Diagnosis code assignment: models and evaluation metrics. Journal of the American Medical Informatics Association, 21(2):231–237, 2014.
  • [8] Keyang Xu, Mike Lam, Jingzhi Pang, Xin Gao, Charlotte Band, Priyanka Mathur, Frank Papay, Ashish K Khanna, Jacek B Cywinski, Kamal Maheshwari, et al. Multimodal machine learning for automated icd coding. arXiv preprint arXiv:1912.10049, 2019.
  • [9] Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459–9474, 2020.
  • [10] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language model pre-training. In International Conference on Machine Learning, pages 3929–3938. PMLR, 2020.
  • [11] Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. Colbertv2: Effective and efficient retrieval via lightweight late interaction. arXiv preprint arXiv:2112.01488, 2022.
  • [12] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, 2020.
  • [13] David C Hsia, W Mark Krushat, Ann B Fagan, Jane A Tebbutt, and Richard P Kusserow. Accuracy of diagnostic coding for medicare patients under the prospective-payment system. New England Journal of Medicine, 318(6):352–355, 1988.
  • [14] Karel D’Oosterlinck, Omar Khattab, François Remy, Thomas Demeester, Chris Develder, and Christopher Potts. In-context learning for extreme multi-label classification, 2024.