Background: The P-POSSUM score, the most well known of predictive scores for postoperative mortality, requires validation for population and setting.
Methods: Validation methods included discrimination (C-index statistic), observed:expected (O:E) ratio, calibration with the Hosmer-Lemeshow test, and subgroup analysis (emergency surgery, cancer, age, organs). The study included 3,881 multisite patients undergoing major digestive surgery in France.
Results: Discrimination via the receiver operating characteristic curve was good (C-index = 0.87). The overall O:E ratio was 1 (95% confidence interval ([95 % CI]: 0.88-1.13), and therefore the quality of the surgical performance is within normal ranges. The O:E ratio, calculated by risk ranges, showed overestimation in the low risk range, especially in the 3 % to 6 % and 6 % to 10 % ranges. Calibration was poor (p < 0.001). The model deviated from the normal pattern of calibration, with mortality lower than expected in the high-risk range. Subgroup analysis found reasonable to good discrimination of populations (C-index ranging from 0.78 to 0.93 except for liver surgery [0.67]) while calibration of individuals remained poor (p < 0.001 to 0.02).
Conclusions: Good discrimination, as well as nonsignificant overall O:E values, makes P-POSSUM a valuable tool when it is used for surgical audit to compare mortality between populations for major digestive surgery. Conversely, poor calibration (goodness-of-fit), especially in subgroup analysis, and underestimation or overestimation of O:E ratios considerably limits the value of P-POSSUM for prediction of mortality in individuals. Therefore P-POSSUM should not be used to predict outcomes for one particular patient.