Utilizing RAG and GPT-4 for Extraction of Substance Use Information from Clinical Notes

Stud Health Technol Inform. 2024 Nov 22:321:94-98. doi: 10.3233/SHTI241070.

Abstract

This research investigates the application of a hybrid Retrieval-Augmented Generation (RAG) and Generative Pre-trained Transformer (GPT) pipeline for extracting and categorizing substance use information from unstructured clinical notes. The aim is to enhance the accuracy and efficiency of identifying substance use mentions and determining their status in patient documentation. By integrating RAG to pre-filter and focus the input for GPT, the pipeline strategically narrows the scope of analysis to the most relevant text segments, thereby improving the precision and recall of the extraction. Utilizing the Medical Information Mart for Intensive Care III dataset, the performance of the pipeline was evaluated through manual verification, assessing various metrics including recall, precision, F1-score, and accuracy. The results demonstrated high precision rates (up to 0.99 for drug and alcohol mentions), and substantial recall (0.88 across all substances for status of the usage).

Keywords: GPT; Retrieval-augmented generation (RAG); Substance use information.

MeSH terms

  • Data Mining / methods
  • Electronic Health Records
  • Humans
  • Information Storage and Retrieval / methods
  • Natural Language Processing*
  • Substance-Related Disorders*