Background: Accurate classification of medical records is crucial for clinical documentation, particularly when using the 10th revision of the International Classification of Diseases (ICD-10) coding system. The use of machine learning algorithms and Systematized Nomenclature of Medicine (SNOMED) mapping has shown promise in performing these classifications. However, challenges remain, particularly in reducing false negatives, where certain diagnoses are not correctly identified by either approach.
Objective: This study explores the potential of leveraging advanced large language models to improve the accuracy of ICD-10 classifications in challenging cases of medical records where machine learning and SNOMED mapping fail.
Methods: We evaluated the performance of ChatGPT 3.5 and ChatGPT 4 in classifying ICD-10 codes from discharge summaries within selected records of the Medical Information Mart for Intensive Care (MIMIC) IV dataset. These records comprised 802 discharge summaries identified as false negatives by both machine learning and SNOMED mapping methods, showing their challenging case. Each summary was assessed by ChatGPT 3.5 and 4 using a classification prompt, and the results were compared to human coder evaluations. Five human coders, with a combined experience of over 30 years, independently classified a stratified sample of 100 summaries to validate ChatGPT's performance.
Results: ChatGPT 4 demonstrated significantly improved consistency over ChatGPT 3.5, with matching results between runs ranging from 86% to 89%, compared to 57% to 67% for ChatGPT 3.5. The classification accuracy of ChatGPT 4 was variable across different ICD-10 codes. Overall, human coders performed better than ChatGPT. However, ChatGPT matched the median performance of human coders, achieving an accuracy rate of 22%.
Conclusion: This study underscores the potential of integrating advanced language models with clinical coding processes to improve documentation accuracy. ChatGPT 4 demonstrated improved consistency and comparable performance to median human coders, achieving 22% accuracy in challenging cases. Combining ChatGPT with methods like SNOMED mapping could further enhance clinical coding accuracy, particularly for complex scenarios.
Keywords: ChatGPT; Clinical coding; Clinical document improvement; Large language model; Machine learning; Snomed.
Copyright © 2025 The Authors. Published by Elsevier B.V. All rights reserved.