Objective: To determine the extent of precision and sources of variability among experts on scoring radiographic abnormalities in rheumatoid arthritis.
Methods: Radiographic scores from 6 datasets in which 2 or more readers had scored film sets were analyzed. Datasets included scores by 11 different readers, 6 of whom scored films by both the Larsen (global) and Sharp (composite) methods. Scores of each possible combination of 2 readers were compared in calculating the smallest detectable difference (SDD) on raw scores and on scores normalized for each individual reader (nSDD). Intraclass correlation (ICC), Pearson's r, and the correlation between differences in score and their mean scores were determined. Agreement on progression of radiographic damage scores was also examined.
Results: Variability among readers was greater than previous studies suggested. Agreement was better for intra- than interreader comparisons; average intrareader SDD was 24.4 for the composite method and 9.0 for the global. The larger SDD for the composite method reflect their greater range of possible scores. When normalized scores were used to adjust for the range difference, there was minimal difference in the SDD; nSDD was 10.1 for the composite method, 8.0 for the global. Interreader variability was larger: SDD of 53.7 for the composite method and 23.3 for the global; nSDD 12.9 and 14.4, respectively. ICC varied between 0.465 and 0.999, with all but one value below 0.925 occurring in composite scores with a range below 100. Differences in repeated scores were frequently associated with the mean of those scores and this was greater for inter- than for intrareader comparisons. Agreement between progression scores showed a similar pattern. The SDD was better for intrareader comparisons and smaller for global scores: compare 13.7 (composite, intrareader) and 5.4 (global, intrareader) to 18.1 (composite, interreader) and 8.7 (global, interreader). The ICC was lower for progression scores than for raw scores, averaging between 0.661 and 0.885.
Conclusion: The variability in scoring radiographic abnormalities is considerable among this group of 11 expert readers. This has important implications for power calculations in comparison studies such as therapeutic trials and for cross-trial comparisons. The correlation between the difference in repeated scores and their means indicates systematic error (bias), which, if corrected, may improve the detection of treatment effects when using a responder-type analysis. These and other design and analysis issues are discussed.