Evaluating Text-to-Image Generated Photorealistic Images of Human Anatomy

Paula Muhr; Yating Pan; Charlotte Tumescheit; Ann-Kathrin Kübler; Hatice Kübra Parmaksiz; Cheng Chen; Pablo Sebastián Bolaños Orozco; Soeren S Lienkamp; Janna Hastings

doi:10.7759/cureus.74193

Evaluating Text-to-Image Generated Photorealistic Images of Human Anatomy

Cureus. 2024 Nov 21;16(11):e74193. doi: 10.7759/cureus.74193. eCollection 2024 Nov.

Authors

Paula Muhr¹, Yating Pan², Charlotte Tumescheit¹, Ann-Kathrin Kübler², Hatice Kübra Parmaksiz², Cheng Chen², Pablo Sebastián Bolaños Orozco², Soeren S Lienkamp³, Janna Hastings¹

Affiliations

¹ Faculty of Medicine, Institute for Implementation Science in Health Care, University of Zurich, Zurich, CHE.
² Digital Society Initiative, University of Zurich, Zurich, CHE.
³ Faculty of Medicine, Institute for Anatomy, University of Zurich, Zurich, CHE.

Abstract

Background: Generative artificial intelligence (AI) models that can produce photorealistic images from text descriptions have many applications in medicine, including medical education and the generation of synthetic data. However, it can be challenging to evaluate their heterogeneous outputs and to compare between different models. There is a need for a systematic approach enabling image and model comparisons.

Method: To address this gap, we developed an error classification system for annotating errors in AI-generated photorealistic images of humans and applied our method to a corpus of 240 images generated with three different models (DALL-E 3, Stable Diffusion XL, and Stable Cascade) using 10 prompts with eight images per prompt.

Results: The error classification system identifies five different error types with three different severities across five anatomical regions and specifies an associated quantitative scoring method based on aggregated proportions of errors per expected count of anatomical components for the generated image. We assessed inter-rater agreement by double-annotating 25% of the images and calculating Krippendorf's alpha and compared results across the three models and 10 prompts quantitatively using a cumulative score per image. The error classification system, accompanying training manual, generated image collection, annotations, and all associated scripts, is available from our GitHub repository at https://github.com/hastingslab-org/ai-human-images. Inter-rater agreement was relatively poor, reflecting the subjectivity of the error classification task. Model comparisons revealed that DALL-E 3 performed consistently better than Stable Diffusion; however, the latter generated images reflecting more diversity in personal attributes. Images with groups of people were more challenging for all the models than individuals or pairs; some prompts were challenging for all models.

Conclusion: Our method enables systematic comparison of AI-generated photorealistic images of humans; our results can serve to catalyse improvements in these models for medical applications.

Keywords: ai in medicine; anatomical education; diffusion model; generative ai; large multi-modal model; medical image; photorealistic synthetic anatomical image; text-to-image generation.

Grants and funding

The project was supported by funding from the University of Zurich’s Digital Society Initiative and the Swiss National Science Foundation under grant agreement 209510.