CC BY 4.0 · Methods Inf Med
DOI: 10.1055/a-2385-1355
Original Article

Does Differentially Private Synthetic Data Lead to Synthetic Discoveries?

Ileana Montoya Perez
1   Department of Computing, University of Turku, Turku, Finland
,
Parisa Movahedi
1   Department of Computing, University of Turku, Turku, Finland
,
Valtteri Nieminen
1   Department of Computing, University of Turku, Turku, Finland
,
Antti Airola
1   Department of Computing, University of Turku, Turku, Finland
,
Tapio Pahikkala
1   Department of Computing, University of Turku, Turku, Finland
› Author Affiliations
Finanzierung This work has received funding from Business Finland (grant number 37428/31/2020) and European Union's Horizon Europe research and innovation programme (grant number 101095384). Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Health and Digital Executive Agency (HADEA). Neither the European Union nor the granting authority can be held responsible for them.

Abstract

Background Synthetic data have been proposed as a solution for sharing anonymized versions of sensitive biomedical datasets. Ideally, synthetic data should preserve the structure and statistical properties of the original data, while protecting the privacy of the individual subjects. Differential Privacy (DP) is currently considered the gold standard approach for balancing this trade-off.

Objectives The aim of this study is to investigate how trustworthy are group differences discovered by independent sample tests from DP-synthetic data. The evaluation is carried out in terms of the tests' Type I and Type II errors. With the former, we can quantify the tests' validity, i.e., whether the probability of false discoveries is indeed below the significance level, and the latter indicates the tests' power in making real discoveries.

Methods We evaluate the Mann–Whitney U test, Student's t-test, chi-squared test, and median test on DP-synthetic data. The private synthetic datasets are generated from real-world data, including a prostate cancer dataset (n = 500) and a cardiovascular dataset (n = 70,000), as well as on bivariate and multivariate simulated data. Five different DP-synthetic data generation methods are evaluated, including two basic DP histogram release methods and MWEM, Private-PGM, and DP GAN algorithms.

Conclusion A large portion of the evaluation results expressed dramatically inflated Type I errors, especially at levels of ϵ ≤ 1. This result calls for caution when releasing and analyzing DP-synthetic data: low p-values may be obtained in statistical tests simply as a byproduct of the noise added to protect privacy. A DP Smoothed Histogram-based synthetic data generation method was shown to produce valid Type I error for all privacy levels tested but required a large original dataset size and a modest privacy budget (ϵ ≥ 5) in order to have reasonable Type II error levels.

Supplementary Material



Publication History

Received: 31 May 2023

Accepted: 25 July 2024

Accepted Manuscript online:
13 August 2024

Article published online:
09 September 2024

© 2024. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution License, permitting unrestricted use, distribution, and reproduction so long as the original work is properly cited. (https://creativecommons.org/licenses/by/4.0/)

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

 
  • References

  • 1 El Emam K, Rodgers S, Malin B. Anonymising and sharing individual patient data. BMJ 2015; 350 (01) h1139
  • 2 Rubin DB. Statistical disclosure limitation. J Off Stat 1993; 9 (02) 461-468
  • 3 Chen RJ, Lu MY, Chen TY, Williamson DFK, Mahmood F. Synthetic data in machine learning for medicine and healthcare. Nat Biomed Eng 2021; 5 (06) 493-497
  • 4 Hernadez M, Epelde G, Alberdi A, Cilla R, Rankin D. Synthetic tabular data evaluation in the health domain covering resemblance, utility, and privacy dimensions. Methods Inf Med 2023; 62 (S 01): e19-e38
  • 5 Jordon J, Szpruch L, Houssiau F. et al. Synthetic data-what, why and how?. arXiv preprint 2022. Accessed May 17, 2023 at: http://arxiv.org/abs/2205.03257
  • 6 Chen D, Yu N, Zhang Y, Fritz M. GAN-Leaks: a taxonomy of membership inference attacks against generative models. Paper presented at: Proceedings of the ACM Conference on Computer and Communications Security. Virtual event, United States: ACM; 2020: 343-362
  • 7 Hayes J, Melis L, Danezis G, De Cristofaro E. LOGAN: membership inference attacks against generative models. arXiv preprint 2018. Accessed May 22, 2023 at: https://arxiv.org/abs/1705.07663v4
  • 8 Stadler T, Oprisanu B, Troncoso C. Synthetic data–anonymisation groundhog day. arXiv preprint 2022. Accessed May 9, 2023 at: https://arxiv.org/abs/2011.07018
  • 9 Carlini N, Brain G, Liu C, Erlingsson Ú, Kos J, Song D. The secret sharer: evaluating and testing unintended memorization in neural networks. Paper presented at: 28th USENIX Security Symposium (USENIX Security 19), Santa Clara, California, United States; 2019: 267-284 . Accessed May 17, 2023 at: https://www.usenix.org/conference/usenixsecurity19/presentation/carlini
  • 10 Karr AF, Kohnen CN, Oganian A, Reiter JP, Sanil AP. A framework for evaluating the utility of data altered to protect confidentiality. Am Stat 2006; 60 (03) 224-232
  • 11 Boedihardjo M, Strohmer T, Vershynin R. Covariance's loss is privacy's gain: computationally efficient, private and accurate synthetic data. Found Comput Math 2024; 24 (01) 179-226
  • 12 Dwork C, McSherry F, Nissim K, Smith A. Calibrating noise to sensitivity in private data analysis. In: Halevi S, Rabin T. eds. Theory of Cryptography Conference. Berlin: Springer Berlin Heidelberg; 2006: 265-284
  • 13 Dwork C, Roth A. The algorithmic foundations of differential privacy. Found Trends Theor Comput Sci 2014; 9 (3–4): 211-487
  • 14 Wasserman L, Zhou S. A statistical framework for differential privacy. J Am Stat Assoc 2010; 105 (489) 375-389
  • 15 Gong M, Xie Y, Pan K, Feng K, Qin AK. A survey on differentially private machine learning. IEEE Comput Intell Mag 2020; 15 (02) 49-64
  • 16 Xu J, Zhang Z, Xiao X, Yang Y, Yu G, Winslett M. Differentially private histogram publication. VLDB J 2013; 22 (06) 797-822
  • 17 Gaboardi M, Lim HW, Rogers R, Vadhan SP. Differentially private chi-squared hypothesis testing: goodness of fit and independence testing. Paper presented at: Proceedings of the 33rd International Conference on Machine Learning, New York, United States. PMLR; 2016: 2111-2120
  • 18 Task C, Clifton C. Differentially private significance testing on paired-sample data. Paper presented at: 16th SIAM International Conference on Data Mining, Miami, Florida, United States, May 5–7, 2016; SDM; 2016: 153-161
  • 19 Couch S, Kazan Z, Shi K, Bray A, Groce A. Differentially private nonparametric hypothesis testing. Paper presented at: Proceedings of the ACM Conference on Computer and Communications Security. ACM; 2019: 737-751
  • 20 Ferrando C, Wang S, Sheldon D. Parametric bootstrap for differentially private confidence intervals. arXiv preprint 2021. Accessed February 5, 2023 at: https://arxiv.org/abs/2006.07749
  • 21 Chaudhuri K, Monteleoni C, Sarwate AD. Differentially private empirical risk minimization. J Mach Learn Res 2011; 12: 1069-1109
  • 22 Hardt M, Ligett K, McSherry F. A simple and practical algorithm for differentially private data release. Adv Neural Inf Process Syst 2012; 3: 2339-2347
  • 23 Ping H, Stoyanovich J, Howe B. DataSynthesizer: privacy-preserving synthetic datasets. Paper presented at: 29th International Conference on Scientific and Statistical Database Management; June 27, 2017; Chicago, Illinois, United States. ACM; 2017: 1-5
  • 24 Snoke J, Slavković A. pMSE mechanism: differentially private synthetic data with maximal distributional similarity. arXiv preprint 2018. Accessed October 5, 2022 at: https://arxiv.org/abs/1805.09392
  • 25 Chen D, Orekondy T, Fritz M. GS-WGAN: a gradient-sanitized approach for learning differentially private generators. Adv Neural Inf Process Syst 2020; 33: 12673-12684
  • 26 McKenna R, Miklau G, Sheldon D. Winning the NIST Contest: a scalable and general approach to differentially private synthetic data. J Priv Confid 2021; 11 (03) 10.29012/jpc.778
  • 27 Nachar N. The Mann-Whitney U: a test for assessing whether two independent samples come from the same distribution. Tutor Quant Methods Psychol 2008; 4 (01) 13-20
  • 28 Zar JH. Biostatistical Analysis. 5th ed.. New Jersey: Pearson Prentice Hall; 2010
  • 29 Okeh UM. Statistical analysis of the application of Wilcoxon and Mann-Whitney U test in medical research studies. Biotechnol Mol Biol Rev 2009; 4 (06) 128-131
  • 30 Fay MP, Proschan MA. Wilcoxon-Mann-Whitney or t-test? On assumptions for hypothesis tests and multiple interpretations of decision rules. Stat Surv 2010; 4: 1-39
  • 31 Kim TK. T test as a parametric statistic. Korean J Anesthesiol 2015; 68 (06) 540-546
  • 32 Conover WJ. Practical Nonparametric Statistics. New York, NY: Wiley; 1999
  • 33 McHugh ML. The chi-square test of independence. Biochem Med (Zagreb) 2013; 23 (02) 143-149
  • 34 Casella G, Berger RL. Statistical inference. 2nd ed.. Pacific Grove, CA, USA: Duxbury Press; 2002
  • 35 Arnold C, Neunhoeffer M. Really useful synthetic data–a framework to evaluate the quality of differentially private synthetic data. arXiv preprint 2020. Accessed May 19, 2023 at: https://arxiv.org/abs/2004.07740v2
  • 36 Abadi M, McMahan HB, Chu A. et al. Deep learning with differential privacy. Paper presented at: 2016 ACM SIGSAC Conference on Computer and Communications Security; October 24, 2016; Vienna, Austria. ACM; 2016: 308-318
  • 37 Abay NC, Zhou Y, Kantarcioglu M, Thuraisingham B, Sweeney L. Privacy preserving synthetic data release using deep learning. Paper presented at: European Conference on Machine Learning and Knowledge Discovery in Databases; September 10, 2018; Dublin, Ireland. Springer; 2018: 510-526
  • 38 Jordon J, Yoon J, Van Der Schaar M. PATE-GAN: Generating synthetic data with differential privacy guarantees. Paper presented at: International Conference on Learning Representations; May 6, 2019; New Orleans, Louisiana. ICLR; 2019
  • 39 Bowen CM, Snoke J. Comparative study of differentially private synthetic data algorithms from the NIST PSCR differential privacy synthetic data challenge. J Priv Confid 2021; 11 (01) 10.29012/jpc.748
  • 40 Bowen CM, Liu F. Comparative study of differentially private data synthesis methods. Stat Sci 2020; 35 (02) 280-307
  • 41 Cai K, Lei X, Wei J, Xiao X. Data synthesis via differentially private markov random fields. Proc VLDB Endow 2021; 14 (11) 2190-2202
  • 42 Zhang J, Cormode G, Procopiuc CM, Srivastava D, Xiao X. Privbayes: Private data release via bayesian networks. ACM Trans Database Syst (TODS) 2017; 42 (04) 1-41
  • 43 McKenna R, Mullins B, Sheldon D, Miklau G. AIM: an adaptive and iterative mechanism for differentially private synthetic data. Proc VLDB Endow 2022; 15 (11) 2599-2612
  • 44 Wang T, Yang X, Ren X, Yu W, Yang S. Locally private high-dimensional crowdsourced data release based on copula functions. IEEE Trans Serv Comput 2022; 15 (02) 778-792
  • 45 Ren X, Yu CM, Yu W. et al. LoPub: high-dimensional crowdsourced data publication with local differential privacy. IEEE Trans Inf Forensics Security 2018; 13 (09) 2151-2166
  • 46 Chen R, Li H, Qin AK, Kasiviswanathan SP, Jin H. Private spatial data aggregation in the local setting. Paper presented at: 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016; 2016: 289-300
  • 47 Goodfellow I, Pouget-Abadie J, Mirza M. et al. Generative adversarial networks. Commun ACM 2020; 63 (11) 139-144
  • 48 Ian Goodfellow. NIPS 2016 tutorial: generative adversarial networks. arXiv preprint 2016. Accessed May 19, 2023 at: https://arxiv.org/abs/1701.00160v4
  • 49 Wilcoxon F. Individual comparisons by ranking methods. Biom Bull 1945; 1 (06) 80-83
  • 50 Mann HB, Whitney DR. On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat 1947; 18 (01) 50-60
  • 51 Freidlin B, Gastwirth JL. Should the median test be retired from general use?. Am Stat 2010; 54 (03) 161-164
  • 52 Jambor I, Boström PJ, Taimen P. et al. Novel biparametric MRI and targeted biopsy improves risk stratification in men with a clinical suspicion of prostate cancer (IMPROD Trial). J Magn Reson Imaging 2017; 46 (04) 1089-1095
  • 53 Jambor I, Verho J, Ettala O. et al. Validation of IMPROD biparametric MRI in men with clinically suspected prostate cancer: a prospective multi-institutional trial. PLoS Med 2019; 16 (06) e1002813
  • 54 Stamey TA, Yang N, Hay AR, McNeal JE, Freiha FS, Redwine E. Prostate-specific antigen as a serum marker for adenocarcinoma of the prostate. N Engl J Med 1987; 317 (15) 909-916
  • 55 Catalona WJ, Smith DS, Ratliff TL. et al. Measurement of prostate-specific antigen in serum as a screening test for prostate cancer. N Engl J Med 1991; 324 (17) 1156-1161
  • 56 Ulianova S. Cardiovascular Disease dataset | Kaggle. 2019 . Accessed October 12, 2022 at: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset
  • 57 Larsson SC, Bäck M, Rees JMB, Mason AM, Burgess S. Body mass index and body composition in relation to 14 cardiovascular conditions in UK Biobank: a Mendelian randomization study. Eur Heart J 2020; 41 (02) 221-226
  • 58 Patki N, Wedge R, Veeramachaneni K. The synthetic data vault. Paper presented at: Proceedings - 3rd IEEE International Conference on Data Science and Advanced Analytics, DSAA 2016; 2016: 399-410
  • 59 Virtanen P, Gommers R, Oliphant TE. et al; SciPy 1.0 Contributors. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods 2020; 17 (03) 261-272
  • 60 Canonne CL, Kamath G, Steinke T. The discrete Gaussian for differential privacy. J Priv Confid 2022; 12 (01) 10.29012/jpc.784
  • 61 McKenna R, Miklau G, Sheldon D. Private-PGM. GitHub 2021. Accessed April 8, 2022 at: https://github.com/ryan112358/private-pgm
  • 62 Hardt M, Ligett K, McSherry F. Private Multiplicative Weights (MWEM). GitHub 2020. Accessed November 7, 2022 at: https://github.com/mrtzh/PrivateMultiplicativeWeights.jl
  • 63 Chen D. GS-WGAN. GitHub 2020. Accessed October 8, 2022 at: https://github.com/DingfanChen/GS-WGAN
  • 64 Paszke A, Gross S, Massa F. et al. PyTorch: an imperative style, high-performance deep learning library. Paper presented at: Proceedings of the 33rd International Conference on Neural Information Processing Systems; December 8, 2019; Vancouver, Canada. Curran Associates Inc.; 2019: 8026-8037
  • 65 Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC. Improved training of Wasserstein GANs. Paper presented at: Proceedings of the 31st International Conference on Neural Information Processing Systems; December 4, 2017; Long Beach, California. Curran Associates Inc.; 2017: 5769-5779
  • 66 Charest A-S. How can we analyze differentially-private synthetic datasets?. J Priv Confid 2011; 2 (02) 21-33
  • 67 Charest A-S. Empirical evaluation of statistical inference from differentially-private contingency tables. Paper presented at: International Conference on Privacy in Statistical Databases; September 26, 2012; Palermo, Italy. Springer-Verlag; 2012: 257-272
  • 68 Giles O, Hosseini K, Mingas G. et al. Faking feature importance: a cautionary tale on the use of differentially-private synthetic data. arXiv preprint 2022. Accessed May 19, 2023 at: https://arxiv.org/abs/2203.01363
  • 69 Räisä O, Jälkö J, Kaski S, Honkela A. Noise-aware statistical inference with differentially private synthetic data. Paper presented at: Proceedings of The 26th International Conference on Artificial Intelligence and Statistics; April 25, 2023; Valencia, Spain. PMLR; 2023: 3620-3643
  • 70 Su D, Cao J, Li N, Lyu M. PrivPfC: differentially private data publication for classification. VLDB J 2018; 27 (02) 201-223