A cautionary analysis of STAPLE using direct inference of segmentation truth

Med Image Comput Comput Assist Interv. 2014;17(Pt 1):398-406. doi: 10.1007/978-3-319-10404-1_50.

Abstract

In this paper we analyze the properties of the well-known segmentation fusion algorithm STAPLE, using a novel inference technique that analytically marginalizes out all model parameters. We demonstrate both theoretically and empirically that when the number of raters is large, or when consensus regions are included in the model, STAPLE devolves into thresholding the average of the input segmentations. We further show that when the number of raters is small, the STAPLE result may not be the optimal segmentation truth estimate, and its model parameter estimates might not reflect the individual raters' actual segmentation performance. Our experiments indicate that these intrinsic weaknesses are frequently exacerbated by the presence of undesirable global optima and convergence issues. Together these results cast doubt on the soundness and usefulness of typical STAPLE outcomes.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Artificial Intelligence
  • Brain / pathology*
  • Humans
  • Image Enhancement / methods
  • Image Interpretation, Computer-Assisted / methods*
  • Information Storage and Retrieval / methods*
  • Magnetic Resonance Imaging / methods*
  • Models, Biological
  • Models, Statistical
  • Pattern Recognition, Automated / methods*
  • Reproducibility of Results
  • Sensitivity and Specificity
  • Software Validation
  • Software*