Quantifying lumbar paraspinal intramuscular fat: Accuracy and reliability of automated thresholding models

N Am Spine Soc J. 2024 Jan 24:17:100313. doi: 10.1016/j.xnsj.2024.100313. eCollection 2024 Mar.

Abstract

Background: The reported level of lumbar paraspinal intramuscular fat (IMF) in people with low back pain (LBP) varies considerably across studies using conventional T1- and T2-weighted magnetic resonance imaging (MRI) sequences. This may be due to the different thresholding models employed to quantify IMF. In this study we investigated the accuracy and reliability of established (two-component) and novel (three-component) thresholding models to measure lumbar paraspinal IMF from T2-weighted MRI.

Methods: In this cross-sectional study, we included MRI scans from 30 people with LBP (50% female; mean (SD) age: 46.3 (15.0) years). Gaussian mixture modelling (GMM) and K-means clustering were used to quantify IMF bilaterally from the lumbar multifidus, erector spinae, and psoas major using two and three-component thresholding approaches (GMM2C; K-means2C; GMM3C; and K-means3C). Dixon fat-water MRI was used as the reference for IMF. Accuracy was measured using Bland-Altman analyses, and reliability was measured using ICC3,1. The mean absolute error between thresholding models was compared using repeated-measures ANOVA and post-hoc paired sample t-tests (α = 0.05).

Results: We found poor reliability for K-means2C (ICC3,1 ≤ 0.38), moderate to good reliability for K-means3C (ICC3,1 ≥ 0.68), moderate reliability for GMM2C (ICC3,1 ≥ 0.63) and good reliability for GMM3C (ICC3,1 ≥ 0.77). The GMM (p < .001) and three-component models (p < .001) had smaller mean absolute errors than K-means and two-component models, respectively. None of the investigated models adequately quantified IMF for psoas major (ICC3,1 ≤ 0.01).

Conclusions: The performance of automated thresholding models is strongly dependent on the choice of algorithms, number of components, and muscle assessed. Compared to Dixon MRI, the GMM performed better than K-means and three-component performed better than two-component models for quantifying lumbar multifidus and erector spinae IMF. None of the investigated models accurately quantified IMF for psoas major. Future research is needed to investigate the performance of thresholding models in a more heterogeneous clinical dataset and across different sites and vendors.

Keywords: Adiposity; Back muscles; Low back pain; Machine learning; Magnetic resonance imaging; Thresholding.