Objectives: To investigate the feasibility and performance of Chat Generative Pretrained Transformer (ChatGPT) in converting symptom narratives into structured symptom labels.
Methods: We extracted symptoms from 300 deidentified symptom narratives of COVID-19 patients by a computer-based matching algorithm (the standard), and prompt engineering in ChatGPT. Common symptoms were those with a prevalence >10% according to the standard, and similarly less common symptoms were those with a prevalence of 2-10%. The precision of ChatGPT was compared with the standard using sensitivity and specificity with 95% exact binomial CIs (95% binCIs). In ChatGPT, we prompted without examples (zero-shot prompting) and with examples (few-shot prompting).
Results: In zero-shot prompting, GPT-4 achieved high specificity (0.947 [95% binCI: 0.894-0.978]-1.000 [95% binCI: 0.965-0.988, 1.000]) for all symptoms, high sensitivity for common symptoms (0.853 [95% binCI: 0.689-0.950]-1.000 [95% binCI: 0.951-1.000]), and moderate sensitivity for less common symptoms (0.200 [95% binCI: 0.043-0.481]-1.000 [95% binCI: 0.590-0.815, 1.000]). Few-shot prompting increased the sensitivity and specificity. GPT-4 outperformed GPT-3.5 in response accuracy and consistent labelling.
Discussion: This work substantiates ChatGPT's role as a research tool in medical fields. Its performance in converting symptom narratives to structured symptom labels was encouraging, saving time and effort in compiling the task-specific training data. It potentially accelerates free-text data compilation and synthesis in future disease outbreaks and improves the accuracy of symptom checkers. Focused prompt training addressing ambiguous descriptions impacts medical research positively.
Keywords: ChatGPT; Entity recognition; Large language model; Symptom extraction; Symptom narratives; Symptom science.
Copyright © 2023 European Society of Clinical Microbiology and Infectious Diseases. Published by Elsevier Ltd. All rights reserved.