Objectives: This project focused on Agency for Healthcare Research and Quality (AHRQ) methods guidance to its Evidence-based Practice Center (EPC) program on grading the strength of evidence (SOE) related to therapeutic interventions. Our project focused on inter-rater reliability testing of the two main components of the AHRQ approach to grading SOE for specific outcomes: (1) scoring evidence on the four required domains (risk of bias, consistency, directness, and precision), separately for randomized controlled trials (RCTs) and observational studies, and (2) developing an overall SOE grade, given the scores for the individual domains.
Data Sources and Methods: We conducted inter-rater reliability testing using data obtained from two published CERs. We designed 10 exercises (5 positive outcomes [benefits] and 5 harms [adverse effects]); all 10 included RCTs, and 6 of the 10 included 1 or more observational studies.
Eleven pairs of reviewers (22 participants) participated in the exercises. Each reviewer independently completed each of the exercises; subsequently, each pair of reviewers reconciled their independent responses.
We calculated summary statistics to describe agreement among reviewers and their difficulty in making each rating assessment. We used logistic regression analysis to describe the relationship between domain scores and the final SOE grade, both in relation to the specific grade selected and level of agreement among reviewers. We examined the change in independent reviewer ratings following reconciliation among reviewer pairs.
Results: The level of independent reviewer inter-rater agreement for domain scores varied considerably from substantial for RCT risk of bias and directness to slight for observational study risk of bias. Agreement on all other domains was either moderate or fair. Agreement was generally better for RCTs than observational studies and agreement among reconciled reviewer pairs was as good as or better than it was for individual independent reviewers.
Agreement on independent reviewer SOE grades was generally poorer than for domain scores. Overall agreement was slight and it was not appreciably better when limited to the exercises that included only RCTs. Neither agreement on domain scores nor agreement about the level of difficulty in evaluating particular domains predicted the overall SOE grades.
When evidence was limited to RCT studies, better SOE grades of moderate or high were related to RCT domain scores’ being considered consistent and precise. The inclusion of observational studies, in addition to RCTs, in an exercise was a strong predictor of a poorer SOE grade — namely, either insufficient or low.
Conclusions: Our findings demonstrate that the conclusions reached by experienced reviewers based on the same evidence can differ greatly, particularly when they are faced with bodies of evidence that do not lend themselves to meta-analysis and they need to rely more heavily on their own judgment. Of particular concern is how to deal with (a) outcomes that are evaluated through a combination of RCTs and observational studies, (b) outcomes that are evaluated through more than one measure and (c) grading evidence that appears to show no difference.
We conclude that additional methodological guidance is needed, including more details and examples, supported by more training, particularly on how best to evaluate the “thornier” bodies of evidence as discussed above. However, some potential will always exist for disagreement even among experienced reviewers. EPC reviewer teams need to be transparent in how they have conducted this task. This will help to ensure that stakeholders can be confident of their interpretation of the evidence.
Our study provided only a first approximation of reviewers’ rationales for differences in SOE decisions. Additional research is needed to understand gaps in guidance that should be filled, areas of insufficient understanding of the guidance itself and how best to overcome that deficit, and complex decisions that may still need to be left to the review team’s substantive expertise.