Balancing Inferential Integrity and Disclosure Risk via Model Targeted Masking and Multiple Imputation

J Am Stat Assoc. 2022;117(537):52-66. doi: 10.1080/01621459.2021.1909597. Epub 2021 May 4.

Abstract

There is a growing expectation that data collected by government-funded studies should be openly available to ensure research reproducibility, and so is the concern on data-privacy. A strategy to protect individuals' identity is to release multiply imputed (MI) synthetic datasets with masked sensitivity values (Rubin, 1993). However, information loss or incorrectly specified imputation models can weaken or invalidate the inferences obtained from the MI-datasets. Studying a restricted-use Canadian Scleroderma Research Group (CSRG) dataset, the authors investigate the use of a new masking framework with a data-augmentation (DA) component and a tuning mechanism that balances between protecting identity-disclosure and preserving data-utility. They found, respectively in a work-disability and an interstitial lung disease study, using this DA-MI strategy reached 0% identity disclosure-risk, preserved all inferential conclusions, and on average produced 98.5% and 95.5% confidence intervals (CI) overlaps when compared to the 95% CIs constructed using the generic CSGR dataset; the lowest CI-overlap value is 91%. In contrast, the same is not true for the currently used methods; with the CI-overlap values ranging from 73.9% to 91.8% and the lowest value being 28.1%. These findings indicate that the DA-MI masking framework facilitates sharing of useful research data while protecting participants' identities.

Keywords: Data augmentation; Disclosure control; Joint modeling; Rare disease; Synthetic data.