Introducing New Measures of Inter- and Intra-Rater Agreement to Assess the Reliability of Medical Ground Truth

Stud Health Technol Inform. 2020 Jun 16:270:282-286. doi: 10.3233/SHTI200167.

Abstract

In this paper, we present and discuss two new measures of inter- and intra-rater agreement to assess the reliability of the raters, and hence of their labeling, in multi-rater setings, which are common in the production of ground truth for machine learning models. Our proposal is more conservative of other existing agreement measures, as it considers a more articulated notion of agreement by chance, based on an empirical estimation of the precision (or reliability) of the single raters involved. We discuss the measures in light of a realistic annotation tasks that involved 13 expert radiologists in labeling the MRNet dataset.

Keywords: Ground Truth; Machine Learning; inter-rater agreement; reliability.

MeSH terms

  • Electronic Health Records
  • Humans
  • Information Storage and Retrieval / methods*
  • Machine Learning*
  • Observer Variation*
  • Radiologists*
  • Radiology
  • Reproducibility of Results