Background: Quality improvement in health care requires identification of areas in need of improvement by comparing processes and patient outcomes within and between health care providers. It is critical to adjust for different case-mix and outcome risks of patient populations but it is currently unclear which approach has higher validity and how limitations need to be dealt with. Our aim was to compare 3 approaches towards risk adjustment for 7 different major quality indicators in neonatal intensive care (21 models).
Methods: We compared an indirect standardization, logistic regression and multilevel approach. Parameters for risk adjustment were chosen according to literature and the condition that they may not depend on processes performed by treating clinics. Predictive validity was tested using the mean Brier Score and by comparing area under curve (AUC) using high quality population based data separated into training and validation sets. Changes in attributional validity were analysed by comparing the effect of the models on the observed-to-expected ratios of the clinics in standardized mortality/morbidity ratio charts.
Results: Risk adjustment based on indirect standardization revealed inferior c-statistics but superior Brier scores for 3 of 7 outcomes. Logistic regression and multilevel modelling were equivalent to one another. C-statistics revealed that predictive validity was high for 8 and acceptable for 11 of the 21 models. Yet, the effect of all forms of risk adjustment on any clinic's comparison with the standard was small, even though there was clear risk heterogeneity between clinics.
Conclusions: All three approaches to risk adjustment revealed comparable results. The limited effect of risk adjustment on clinic comparisons indicates a small case-mix influence on observed outcomes, but also a limited ability to isolate quality improvement potential based on risk-adjustment models. Rather than relying on methodological approaches, we instead recommend that clinics build small collaboratives and compare their indicators both in risk-adjusted and unadjusted form together. This allows qualitatively investigating and discussing the residual risk-differences within networks. The predictive validity should be quantified and reported and stratification into risk groups should be more widely used to correct for confounding.
Keywords: Effectiveness; Indirect standardization; Logistic regression; Mean brier score; Multilevel; Neonatology; Quality improvement; Risk adjustment; and ROC area under curve.