Objectives: We hypothesized that published performances of algorithms for artificial intelligence (AI) pneumothorax (PTX) detection in chest radiographs (CXRs) do not sufficiently consider the influence of PTX size and confounding effects caused by thoracic tubes (TTs). Therefore, we established a radiologically annotated benchmarking cohort (n = 6446) allowing for a detailed subgroup analysis.
Materials and methods: We retrospectively identified 6434 supine CXRs, among them 1652 PTX-positive cases and 4782 PTX-negative cases. Supine CXRs were radiologically annotated for PTX size, PTX location, and inserted TTs. The diagnostic performances of 2 AI algorithms ("AI_CheXNet" [Rajpurkar et al], "AI_1.5" [Guendel et al]), both trained on publicly available datasets with labels obtained from automatic report interpretation, were quantified. The algorithms' discriminative power for PTX detection was quantified by the area under the receiver operating characteristics (AUROC), and significance analysis was based on the corresponding 95% confidence interval. A detailed subgroup analysis was performed to quantify the influence of PTX size and the confounding effects caused by inserted TTs.
Results: Algorithm performance was quantified as follows: overall performance with AUROCs of 0.704 (AI_1.5) / 0.765 (AI_CheXNet) for unilateral PTXs, AUROCs of 0.666 (AI_1.5) / 0.722 (AI_CheXNet) for unilateral PTXs smaller than 1 cm, and AUROCs of 0.735 (AI_1.5) / 0.818 (AI_CheXNet) for unilateral PTXs larger than 2 cm. Subgroup analysis identified TTs to be strong confounders that significantly influence algorithm performance: Discriminative power is completely eliminated by analyzing PTX-positive cases without TTs referenced to control PTX-negative cases with inserted TTs. Contrarily, AUROCs increased up to 0.875 (AI_CheXNet) for large PTX-positive cases with inserted TTs referenced to control cases without TTs.
Conclusions: Our detailed subgroup analysis demonstrated that the performance of established AI algorithms for PTX detection trained on public datasets strongly depends on PTX size and is significantly biased by confounding image features, such as inserted TTS. Our established, clinically relevant and radiologically annotated benchmarking cohort might be of great benefit for ongoing algorithm development.