Although Apgar scores are commonly used as proxy outcomes, little evidence exists in support of the most common cutpoints (<7, <4). We used 2 data sets to explore this issue: one contained planned community births from across the United States (n = 52,877; 2012-2016), and the other contained hospital births from California (n = 428,877; 2010). We treated 5-minute Apgars as clinical "tests," compared against 18 known outcomes; we calculated sensitivity, specificity, positive and negative predictive values, and the area under the receiver operating characteristic curve for each. We used 3 different criteria to determine optimal cutpoints. Results were very consistent across data sets, outcomes, and all subgroups: The cutpoint that maximizes the trade-off between sensitivity and specificity is universally <9. However, extremely low positive predictive values for all outcomes at <9 indicate more misclassification than is acceptable for research. The areas under the receiver operating characteristic curves (which treat Apgars as quasicontinuous) were generally indicative of adequate discrimination between infants destined to experience poor outcomes and those not; comparing median Apgars between groups might be an analytical alternative to dichotomizing. Nonetheless, because Apgar scores are not clearly on any causal pathway of interest, we discourage researchers from using them unless the motivation for doing so is clear.
Keywords: Apgar score; ROC curve; infant health.
© The Author(s) 2019. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health. All rights reserved. For permissions, please e-mail: [email protected].