Is ChatGPT ready for public use in organ-specific drug toxicity research?

Drug Discov Today. 2025 Jan 20:104297. doi: 10.1016/j.drudis.2025.104297. Online ahead of print.

Abstract

The growing impact of large language models (LLMs), such as ChatGPT, prompts questions about the reliability of their application in public health. We compared drug toxicity assessments by GPT-4 for liver, heart, and kidney against expert assessments using US Food and Drug Administration (FDA) drug-labeling documents. Two approaches were assessed: a 'General prompt', mimicking the conversational style used by the general public, and an 'Expert prompt' engineered to represent an approach of an expert. The Expert prompt achieved higher accuracy (64-75%) compared with the General prompt (48-72%), but the overall performance was moderate, indicating that caution is needed when using GPT-4 for public health. To improve reliability, an advanced framework ,such as Retrieval Augmented Generation (RAG), might be required to leverage knowledge embedded in GPT-4.

Keywords: GPT-4; artificial intelligence (AI); drug toxicity and safety; heart; kidney; large language models (LLMs); liver; organ toxicity; prompt engineering; public health.