De novo generation of colorectal patient educational materials using large language models: Prompt engineering key to improved readability

Surgery. 2025 Jan 4:180:109024. doi: 10.1016/j.surg.2024.109024. Online ahead of print.

Abstract

Background: Improving patient education has been shown to improve clinical outcomes and reduce disparities, though such efforts can be labor intensive. Large language models may serve as an accessible method to improve patient educational material. The aim of this study was to compare readability between existing educational materials and those generated by large language models.

Methods: Baseline colorectal surgery educational materials were gathered from a large academic institution (n = 52). Three prompts were entered into Perplexity and ChatGPT 3.5 for each topic: a Basic prompt that simply requested patient educational information the topic, an Iterative prompt that repeated instruction asking for the information to be more health literate, and a Metric-based prompt that requested a sixth-grade reading level, short sentences, and short words. Flesch-Kincaid Grade Level or Grade Level, Flesch-Kincaid Reading Ease or Ease, and Modified Grade Level scores were calculated for all materials, and unpaired t tests were used to compare mean scores between baseline and documents generated by artificial intelligence platforms.

Results: Overall existing materials were longer than materials generated by the large language models across categories and prompts: 863-956 words vs 170-265 (ChatGPT) and 220-313 (Perplexity), all P < .01. Baseline materials did not meet sixth-grade readability guidelines based on grade level (Grade Level 7.0-9.8 and Modified Grade Level 9.6-11.5) or ease of readability (Ease 53.1-65.0). Readability of materials generated by a large language model varied by prompt and platform. Overall, ChatGPT materials were more readable than baseline materials with the Metric-based prompt: Grade Level 5.2 vs 8.1, Modified Grade Level 7.3 vs 10.3, and Ease 70.5 vs 60.4, all P < .01. In contrast, Perplexity-generated materials were significantly less readable except for those generated with the Metric-based prompt, which did not statistically differ.

Conclusion: Both existing materials and the majority of educational materials created by large language models did not meet readability recommendations. The exception to this was with ChatGPT materials generated with a Metric-based prompt that consistently improved readability scores from baseline and met recommendations in terms of the average Grade Level score. The variability in performance highlights the importance of the prompt used with large language models.