Recommendations on compiling test datasets for evaluating artificial intelligence solutions in pathology

André Homeyer; Christian Geißler; Lars Ole Schwen; Falk Zakrzewski; Theodore Evans; Klaus Strohmenger; Max Westphal; Roman David Bülow; Michaela Kargl; Aray Karjauv; Isidre Munné-Bertran; Carl Orge Retzlaff; Adrià Romero-López; Tomasz Sołtysiński; Markus Plass; Rita Carvalho; Peter Steinbach; Yu-Chia Lan; Nassim Bouteldja; David Haber; Mateo Rojas-Carulla; Alireza Vafaei Sadr; Matthias Kraft; Daniel Krüger; Rutger Fick; Tobias Lang; Peter Boor; Heimo Müller; Peter Hufnagl; Norman Zerbe

doi:10.1038/s41379-022-01147-y

Recommendations on compiling test datasets for evaluating artificial intelligence solutions in pathology

Mod Pathol. 2022 Dec;35(12):1759-1769. doi: 10.1038/s41379-022-01147-y. Epub 2022 Sep 10.

Authors

André Homeyer^#¹, Christian Geißler^#², Lars Ole Schwen^#³, Falk Zakrzewski^#⁴, Theodore Evans^#², Klaus Strohmenger^#⁵, Max Westphal^#³, Roman David Bülow^#⁶, Michaela Kargl⁷, Aray Karjauv², Isidre Munné-Bertran⁸, Carl Orge Retzlaff², Adrià Romero-López⁹, Tomasz Sołtysiński¹⁰, Markus Plass⁷, Rita Carvalho⁵, Peter Steinbach¹¹, Yu-Chia Lan⁶, Nassim Bouteldja⁶, David Haber⁹, Mateo Rojas-Carulla⁹, Alireza Vafaei Sadr⁶, Matthias Kraft⁹, Daniel Krüger¹², Rutger Fick¹³, Tobias Lang¹⁴, Peter Boor⁶, Heimo Müller⁷, Peter Hufnagl⁵, Norman Zerbe⁵

Affiliations

¹ Fraunhofer Institute for Digital Medicine MEVIS, Max-von-Laue-Straße 2, 28359, Bremen, Germany. [email protected].
² Technische Universität Berlin, DAI-Labor, Ernst-Reuter-Platz 7, 10587, Berlin, Germany.
³ Fraunhofer Institute for Digital Medicine MEVIS, Max-von-Laue-Straße 2, 28359, Bremen, Germany.
⁴ Institute of Pathology, Carl Gustav Carus University Hospital Dresden (UKD), TU Dresden (TUD), Fetscherstrasse 74, 01307, Dresden, Germany.
⁵ Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt Universität zu Berlin, Institute of Pathology, Charitéplatz 1, 10117, Berlin, Germany.
⁶ Institute of Pathology, University Hospital RWTH Aachen, Pauwelsstraße 30, 52074, Aachen, Germany.
⁷ Medical University of Graz, Diagnostic and Research Center for Molecular BioMedicine, Diagnostic & Research Institute of Pathology, Neue Stiftingtalstrasse 6, 8010, Graz, Austria.
⁸ MoticEurope, S.L.U., C. Les Corts, 12 Poligono Industrial, 08349, Barcelona, Spain.
⁹ Lakera AI AG, Zelgstrasse 7, 8003, Zürich, Switzerland.
¹⁰ QuIP GmbH, Reinhardtstraße 1, 10117, Berlin, Germany.
¹¹ Helmholtz-Zentrum Dresden Rossendorf, Bautzner Landstraße 400, 01328, Dresden, Germany.
¹² Olympus Soft Imaging Solutions GmbH, Johann-Krane-Weg 39, 48149, Münster, Germany.
¹³ Tribun Health, 2 Rue du Capitaine Scott, 75015, Paris, France.
¹⁴ Mindpeak GmbH, Zirkusweg 2, 20359, Hamburg, Germany.

^# Contributed equally.

Abstract

Artificial intelligence (AI) solutions that automatically extract information from digital histology images have shown great promise for improving pathological diagnosis. Prior to routine use, it is important to evaluate their predictive performance and obtain regulatory approval. This assessment requires appropriate test datasets. However, compiling such datasets is challenging and specific recommendations are missing. A committee of various stakeholders, including commercial AI developers, pathologists, and researchers, discussed key aspects and conducted extensive literature reviews on test datasets in pathology. Here, we summarize the results and derive general recommendations on compiling test datasets. We address several questions: Which and how many images are needed? How to deal with low-prevalence subsets? How can potential bias be detected? How should datasets be reported? What are the regulatory requirements in different countries? The recommendations are intended to help AI developers demonstrate the utility of their products and to help pathologists and regulatory agencies verify reported performance measures. Further research is needed to formulate criteria for sufficiently representative test datasets so that AI solutions can operate with less user intervention and better support diagnostic workflows in the future.

Publication types

Review
Research Support, Non-U.S. Gov't

MeSH terms

Artificial Intelligence*
Datasets as Topic
Forecasting
Humans
Pathology*