Background and objectives: The multiple chest x-ray datasets released in the last years have ground-truth labels intended for different computer vision tasks, suggesting that performance in automated chest x-ray interpretation might improve by using a method that can exploit diverse types of annotations. This work presents a Deep Learning method based on the late fusion of different convolutional architectures, that allows training with heterogeneous data with a simple implementation, and evaluates its performance on independent test data. We focused on obtaining a clinically useful tool that could be successfully integrated into a hospital workflow.
Materials and methods: Based on expert opinion, we selected four target chest x-ray findings, namely lung opacities, fractures, pneumothorax and pleural effusion. For each finding we defined the most suitable type of ground-truth label, and built four training datasets combining images from public chest x-ray datasets and our institutional archive. We trained four different Deep Learning architectures and combined their outputs with a late fusion strategy, obtaining a unified tool. The performance was measured on two test datasets: an external openly-available dataset, and a retrospective institutional dataset, to estimate performance on the local population.
Results: The external and local test sets had 4376 and 1064 images, respectively, for which the model showed an area under the Receiver Operating Characteristics curve of 0.75 (95%CI: 0.74-0.76) and 0.87 (95%CI: 0.86-0.89) in the detection of abnormal chest x-rays. For the local population, a sensitivity of 86% (95%CI: 84-90), and a specificity of 88% (95%CI: 86-90) were obtained, with no significant differences between demographic subgroups. We present examples of heatmaps to show the accomplished level of interpretability, examining true and false positives.
Conclusion: This study presents a new approach for exploiting heterogeneous labels from different chest x-ray datasets, by choosing Deep Learning architectures according to the radiological characteristics of each pathological finding. We estimated the tool's performance on the local population, obtaining results comparable to state-of-the-art metrics. We believe this approach is closer to the actual reading process of chest x-rays by professionals, and therefore more likely to be successful in a real clinical setting.
Keywords: Artificial intelligence; Chest; Clinical decision support systems; Deep Learning; Radiography.
Copyright © 2021 Elsevier B.V. All rights reserved.