Background: The assumption of consistency, defined as agreement between direct and indirect sources of evidence, underlies the increasingly popular method of network meta-analysis. This assumption is often evaluated by statistically testing for a difference between direct and indirect estimates within each loop of evidence. However, the test is believed to be underpowered. We aim to evaluate its properties when applied to a loop typically found in published networks.
Methods: In a simulation study we estimate type I error, power and coverage probability of the inconsistency test for dichotomous outcomes using realistic scenarios informed by previous empirical studies. We evaluate test properties in the presence or absence of heterogeneity, using different estimators of heterogeneity and by employing different methods for inference about pairwise summary effects (Knapp-Hartung and inverse variance methods).
Results: As expected, power is positively associated with sample size and frequency of the outcome and negatively associated with the presence of heterogeneity. Type I error converges to the nominal level as the total number of individuals in the loop increases. Coverage is close to the nominal level in most cases. Different estimation methods for heterogeneity do not greatly impact on test performance, but different methods to derive the variances of the direct estimates impact on inconsistency inference. The Knapp-Hartung method is more powerful, especially in the absence of heterogeneity, but exhibits larger type I error. The power for a 'typical' loop (comprising of 8 trials and about 2000 participants) to detect a 35% relative change between direct and indirect estimation of the odds ratio was 14% for inverse variance and 21% for Knapp-Hartung methods (with type I error 5% in the former and 11% in the latter).
Conclusions: The study gives insight into the conditions under which the statistical test can detect important inconsistency in a loop of evidence. Although different methods to estimate the uncertainty of the mean effect may improve the test performance, this study suggests that the test has low power for the 'typical' loop. Investigators should interpret results very carefully and always consider the comparability of the studies in terms of potential effect modifiers.