Purpose: To compare methodologies for imputing ethnicity in an urban ophthalmology clinic.
Methods: Using data from 19,165 patients with self-reported ethnicity, surname, and home address, we compared the accuracy of three methodologies for imputing ethnicity: (1) a surname method based on tabulation from the 2000 US Census; (2) a geocoding method based on tract data from the 2010 US Census; and (3) a combined surname geocoding method using Bayes' theorem.
Results: The combined surname geocoding model had the highest accuracy of the three methodologies, imputing black ethnicity with a sensitivity of 84% and positive predictive value (PPV) of 94%, white ethnicity with a sensitivity of 92% and PPV of 82%, Hispanic ethnicity with a sensitivity of 77% and PPV of 71%, and Asian ethnicity with a sensitivity of 83% and PPV of 79%. Overall agreement of imputed and self-reported ethnicity was fair for the surname method (κ 0.23), moderate for the geocoding method (κ 0.58), and strong for the combined method (κ 0.76).
Conclusion: A methodology combining surname analysis and Census tract data using Bayes' theorem to determine ethnicity is superior to other methods tested and is ideally suited for research purposes of clinical and administrative data.