Heatmap analysis for artificial intelligence explainability in diabetic retinopathy detection: illuminating the rationale of deep learning decisions

Fernando Korn Malerbi; Luis Filipe Nakayama; Paulo Prado; Fernando Yamanaka; Gustavo Barreto Melo; Caio Vinicius Regatieri; José Augusto Stuchi

doi:10.21037/atm-24-73

Heatmap analysis for artificial intelligence explainability in diabetic retinopathy detection: illuminating the rationale of deep learning decisions

Ann Transl Med. 2024 Oct 20;12(5):89. doi: 10.21037/atm-24-73. Epub 2024 Oct 12.

Authors

Fernando Korn Malerbi^{1

2

3}, Luis Filipe Nakayama¹, Paulo Prado², Fernando Yamanaka², Gustavo Barreto Melo^{1

4}, Caio Vinicius Regatieri¹, José Augusto Stuchi²

Affiliations

¹ Department of Ophthalmology and Visual Sciences, Federal University of Sao Paulo, Sao Paulo, Brazil.
² Phelcom Technologies, Sao Carlos, Brazil.
³ Diabetes Center, Federal University of Sao Paulo, Sao Paulo, Brazil.
⁴ Sergipe Eye Hospital (Hospital de Olhos de Sergipe), Aracaju, Sergipe, Brazil.

Abstract

Background: The opaqueness of artificial intelligence (AI) algorithms decision processes limit their application in healthcare. Our objective was to explore discrepancies in heatmaps originated from slightly different retinal images from the same eyes of individuals with diabetes, to gain insights into the deep learning (DL) decision process.

Methods: Pairs of retinal images from the same eyes of individuals with diabetes, composed of images obtained before and after pupil dilation, underwent automatic analysis by a convolutional neural network for the presence of diabetic retinopathy (DR), output being a score ranging from 0 to 1. Gradient-based Class Activation Maps (GradCam) allowed visualization of activated areas. Pairs of images with discordant DL scores or outputs within the pair were objectively compared to the concordant pairs, regarding the sum of activations of Class Activation Mapping (CAM), the number of activated areas, and DL score differences. Heatmaps of discordant pairs were also qualitatively assessed.

Results: Algorithmic performance for the detection of DR attained 89.8% sensitivity, 96.3% specificity and area under the receiver operating characteristic (ROC) curve of 0.95. Out of 210 comparable pairs of images, 20 eyes and 10 eyes were considered discordant according to DL score difference and regarding DL output, respectively. Comparison of concordant versus discordant groups showed statistically significant differences for all objective variables. Qualitative analysis pointed to subtle differences in image quality within discordant pairs.

Conclusions: The successfully established relationship among objective parameters extracted from heatmaps and DL output discrepancies reinforces the role of heatmaps for DL explainability, fostering acceptance of DL systems for clinical use.

Keywords: Artificial intelligence (AI); deep learning (DL); diabetic retinopathy (DR); explainability; retina.