Purpose: To assess an FDA-approved and CE-certified deep learning (DL) software application compared to the performance of human radiologists in detecting intracranial hemorrhages (ICH).
Methods: Within a 20-week trial from January to May 2020, 2210 adult non-contrast head CT scans were performed in a single center and automatically analyzed by an artificial intelligence (AI) solution with workflow integration. After excluding 22 scans due to severe motion artifacts, images were retrospectively assessed for the presence of ICHs by a second-year resident and a certified radiologist under simulated time pressure. Disagreements were resolved by a subspecialized neuroradiologist serving as the reference standard. We calculated interrater agreement and diagnostic performance parameters, including the Breslow-Day and Cochran-Mantel-Haenszel tests.
Results: An ICH was present in 214 out of 2188 scans. The interrater agreement between the resident and the certified radiologist was very high (κ = 0.89) and even higher (κ = 0.93) between the resident and the reference standard. The software has delivered 64 false-positive and 68 false-negative results giving an overall sensitivity, specificity, positive predictive value, negative predictive value, and accuracy of 68.2%, 96.8%, 69.5%, 96.6%, and 94.0%, respectively. Corresponding values for the resident were 94.9%, 99.2%, 93.1%, 99.4%, and 98.8%. The accuracy of the DL application was inferior (p < 0.001) to that of both the resident and the certified neuroradiologist.
Conclusion: A resident under time pressure outperformed an FDA-approved DL program in detecting ICH in CT scans. Our results underline the importance of thoughtful workflow integration and post-approval validation of AI applications in various clinical environments.
Keywords: Artificial intelligence; Computed tomography; Deep learning; Diagnostic accuracy; Intracranial hemorrhage.
© 2021. The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.