The growing impact of large language models (LLMs), such as ChatGPT, prompts questions about the reliability of their application in public health. We compared drug toxicity assessments by GPT-4 for liver, heart, and kidney against expert assessments using US Food and Drug Administration (FDA) drug-labeling documents. Two approaches were assessed: a 'General prompt', mimicking the conversational style used by the general public, and an 'Expert prompt' engineered to represent an approach of an expert. The Expert prompt achieved higher accuracy (64-75%) compared with the General prompt (48-72%), but the overall performance was moderate, indicating that caution is needed when using GPT-4 for public health. To improve reliability, an advanced framework ,such as Retrieval Augmented Generation (RAG), might be required to leverage knowledge embedded in GPT-4.
Keywords: GPT-4; artificial intelligence (AI); drug toxicity and safety; heart; kidney; large language models (LLMs); liver; organ toxicity; prompt engineering; public health.
Copyright © 2025. Published by Elsevier Ltd.