Inter-observer variability in the classification of lumbar foraminal stenosis in magnetic resonance imaging using different evaluation scales

Eur Spine J. 2024 Dec 20. doi: 10.1007/s00586-024-08612-z. Online ahead of print.

Abstract

Background: The evaluation of lumbar spine degeneration on magnetic resonance imaging (MRI) is prone to inter-reader variability, including when assessing foraminal changes. This variability, often due to subjective criteria and inconsistent terminology, may affect clinical correlations. Standardized criteria could help improve agreement among readers.

Materials and methods: MRI of the lumbar spine of 50 randomly selected patients were evaluated by 12 independent readers. Foraminal stenosis was assessed using four different rating scales for each patient. The first scale classified stenosis as presence/absence of neurologic compromise of the spinal nerve root at the foramen, the second scale classified stenosis as absent/mild/moderate/severe, the third scale as normal/contact of disk or osteophyte with the nerve root/deviation of the nerve root/compression of the nerve root, and the fourth scale utilized the Lee et al. criteria. Agreement analysis was performed using Fleiss' kappa coefficients.

Results: Agreement was moderate using the first scale (k = 0.439), and significantly lower using the second, third and fourth scales (k = 0.310, k = 0.311, k = 0.295, respectively). When comparing the agreements obtained between board certified neuroradiologists and between neuroradiology residents, there was statistically significant differences when using the third and fourth scales, where the agreement for board certified neuroradiologists was higher, but still only fair. Individual kappas showed that in the second, third, and fourth scales the levels of agreement were higher in the extremes of the scale, namely, when there was no stenosis or when the stenosis was maximal with nerve compression.

Conclusions: Levels of agreement can differ depending on the scale used. Simpler dichotomous scales may return higher levels of agreement compared to more complex ones. For the non-dichotomous scales, using different scales may not result in overall different levels of agreement. Given the overall low inter-rater agreements observed, there is probably significant potential to enhance agreement through more rigorous training and consensus-building.

Keywords: Foraminal stenosis; Inter-observer agreement; Inter-observer variability; Lumbar spine; Magnetic resonance; Rating scale; Spine degenerative disease.