Identification of individuals by trait prediction using whole-genome sequencing data

Proc Natl Acad Sci U S A. 2017 Sep 19;114(38):10166-10171. doi: 10.1073/pnas.1711125114. Epub 2017 Sep 5.

Abstract

Prediction of human physical traits and demographic information from genomic data challenges privacy and data deidentification in personalized medicine. To explore the current capabilities of phenotype-based genomic identification, we applied whole-genome sequencing, detailed phenotyping, and statistical modeling to predict biometric traits in a cohort of 1,061 participants of diverse ancestry. Individually, for a large fraction of the traits, their predictive accuracy beyond ancestry and demographic information is limited. However, we have developed a maximum entropy algorithm that integrates multiple predictions to determine which genomic samples and phenotype measurements originate from the same person. Using this algorithm, we have reidentified an average of >8 of 10 held-out individuals in an ethnically mixed cohort and an average of 5 of either 10 African Americans or 10 Europeans. This work challenges current conceptions of personal privacy and may have far-reaching ethical and legal implications.

Keywords: DNA phenotyping; genome sequencing; genomic privacy; phenotype prediction; reidentification.

MeSH terms

  • Adult
  • Age Factors
  • Algorithms
  • Body Size
  • Cohort Studies
  • Confidentiality*
  • DNA Fingerprinting*
  • Data Anonymization
  • Female
  • Humans
  • Male
  • Middle Aged
  • Models, Genetic*
  • Phenotype*
  • Pigmentation / genetics
  • Whole Genome Sequencing*
  • Young Adult