Constructing synthetic datasets with generative artificial intelligence to train large language models to classify acute renal failure from clinical notes

J Am Med Inform Assoc. 2024 May 20;31(6):1404-1410. doi: 10.1093/jamia/ocae081.

Abstract

Objectives: To compare performances of a classifier that leverages language models when trained on synthetic versus authentic clinical notes.

Materials and methods: A classifier using language models was developed to identify acute renal failure. Four types of training data were compared: (1) notes from MIMIC-III; and (2, 3, and 4) synthetic notes generated by ChatGPT of varied text lengths of 15 (GPT-15 sentences), 30 (GPT-30 sentences), and 45 (GPT-45 sentences) sentences, respectively. The area under the receiver operating characteristics curve (AUC) was calculated from a test set from MIMIC-III.

Results: With RoBERTa, the AUCs were 0.84, 0.80, 0.84, and 0.76 for the MIMIC-III, GPT-15, GPT-30- and GPT-45 sentences training sets, respectively.

Discussion: Training language models to detect acute renal failure from clinical notes resulted in similar performances when using synthetic versus authentic training data.

Conclusion: The use of training data derived from protected health information may not be needed.

Keywords: ChatGPT; artificial intelligence; generative AI; large language models.

Publication types

  • Comparative Study

MeSH terms

  • Acute Kidney Injury* / classification
  • Acute Kidney Injury* / diagnosis
  • Area Under Curve
  • Artificial Intelligence*
  • Datasets as Topic
  • Electronic Health Records*
  • Humans
  • Natural Language Processing
  • ROC Curve