Generating and evaluating synthetic data in digital pathology through diffusion models

Matteo Pozzi; Shahryar Noei; Erich Robbi; Luca Cima; Monica Moroni; Enrico Munari; Evelin Torresani; Giuseppe Jurman

doi:10.1038/s41598-024-79602-w

Generating and evaluating synthetic data in digital pathology through diffusion models

Sci Rep. 2024 Nov 18;14(1):28435. doi: 10.1038/s41598-024-79602-w.

Authors

Matteo Pozzi^#^{1

2}, Shahryar Noei^#¹, Erich Robbi^{1

3}, Luca Cima⁴, Monica Moroni¹, Enrico Munari⁴, Evelin Torresani⁵, Giuseppe Jurman⁶

Affiliations

¹ Data Science for Health Unit, Fondazione Bruno Kessler, Via Sommarive 18, Povo, Trento, 38123, Italy.
² Department for Computational and Integrative Biology, Università degli Studi di Trento, Via Sommarive, 9, Povo, Trento, 38123, Italy.
³ Department of Information Engineering and Computer Science, Università degli Studi di Trento, Via Sommarive, 9, Povo, Trento, 38123, Italy.
⁴ Department of Diagnostic and Public Health, Section of Pathology, University and Hospital Trust of Verona, Verona, Italy.
⁵ Pathology Unit, Department of Laboratory Medicine, Santa Chiara Hospital, APSS, Trento, Italy.
⁶ Data Science for Health Unit, Fondazione Bruno Kessler, Via Sommarive 18, Povo, Trento, 38123, Italy. [email protected].

^# Contributed equally.

Abstract

Synthetic data is becoming a valuable tool for computational pathologists, aiding in tasks like data augmentation and addressing data scarcity and privacy. However, its use necessitates careful planning and evaluation to prevent the creation of clinically irrelevant artifacts.This manuscript introduces a comprehensive pipeline for generating and evaluating synthetic pathology data using a diffusion model. The pipeline features a multifaceted evaluation strategy with an integrated explainability procedure, addressing two key aspects of synthetic data use in the medical domain.The evaluation of the generated data employs an ensemble-like approach. The first step includes assessing the similarity between real and synthetic data using established metrics. The second step involves evaluating the usability of the generated images in deep learning models accompanied with explainable AI methods. The final step entails verifying their histopathological realism through questionnaires answered by professional pathologists. We show that each of these evaluation steps are necessary as they provide complementary information on the generated data's quality.The pipeline is demonstrated on the public GTEx dataset of 650 Whole Slide Images (WSIs), including five different tissues. An equal number of tiles from each tissue are generated and their reliability is assessed using the proposed evaluation pipeline, yielding promising results.In summary, the proposed workflow offers a comprehensive solution for generative AI in digital pathology, potentially aiding the community in their transition towards digitalization and data-driven modeling.

MeSH terms

Deep Learning*
Humans
Image Processing, Computer-Assisted / methods
Pathology, Clinical / methods
Reproducibility of Results

Abstract

MeSH terms

Grants and funding