Objectives: To compare performances of a classifier that leverages language models when trained on synthetic versus authentic clinical notes.
Materials and methods: A classifier using language models was developed to identify acute renal failure. Four types of training data were compared: (1) notes from MIMIC-III; and (2, 3, and 4) synthetic notes generated by ChatGPT of varied text lengths of 15 (GPT-15 sentences), 30 (GPT-30 sentences), and 45 (GPT-45 sentences) sentences, respectively. The area under the receiver operating characteristics curve (AUC) was calculated from a test set from MIMIC-III.
Results: With RoBERTa, the AUCs were 0.84, 0.80, 0.84, and 0.76 for the MIMIC-III, GPT-15, GPT-30- and GPT-45 sentences training sets, respectively.
Discussion: Training language models to detect acute renal failure from clinical notes resulted in similar performances when using synthetic versus authentic training data.
Conclusion: The use of training data derived from protected health information may not be needed.
Keywords: ChatGPT; artificial intelligence; generative AI; large language models.
© The Author(s) 2024. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: [email protected].