Background: A single learning algorithm can produce deep learning-based image segmentation models that vary in performance purely due to random effects during training. This study assessed the effect of these random performance fluctuations on the reliability of standard methods of comparing segmentation models.
Methods: The influence of random effects during training was assessed by running a single learning algorithm (nnU-Net) with 50 different random seeds for three multiclass 3D medical image segmentation problems, including brain tumour, hippocampus, and cardiac segmentation. Recent literature was sampled to find the most common methods for estimating and comparing the performance of deep learning segmentation models. Based on this, segmentation performance was assessed using both hold-out validation and 5-fold cross-validation and the statistical significance of performance differences was measured using the Paired t-test and the Wilcoxon signed rank test on Dice scores.
Results: For the different segmentation problems, the seed producing the highest mean Dice score statistically significantly outperformed between 0 % and 76 % of the remaining seeds when estimating performance using hold-out validation, and between 10 % and 38 % when estimating performance using 5-fold cross-validation.
Conclusion: Random effects during training can cause high rates of statistically-significant performance differences between segmentation models from the same learning algorithm. Whilst statistical testing is widely used in contemporary literature, our results indicate that a statistically-significant difference in segmentation performance is a weak and unreliable indicator of a true performance difference between two learning algorithms.
Keywords: Deep learning; Medical image segmentation; Performance comparisons; Random seeds; Randomness.
Copyright © 2024. Published by Elsevier Ltd.