PopCluster: an algorithm to identify genetic variants with ethnicity-dependent effects

Anastasia Gurinovich; Harold Bae; John J Farrell; Stacy L Andersen; Stefano Monti; Annibale Puca; Gil Atzmon; Nir Barzilai; Thomas T Perls; Paola Sebastiani

doi:10.1093/bioinformatics/btz017

PopCluster: an algorithm to identify genetic variants with ethnicity-dependent effects

Bioinformatics. 2019 Sep 1;35(17):3046-3054. doi: 10.1093/bioinformatics/btz017.

Authors

Affiliations

¹ Bioinformatics Program, Boston University, Boston, MA, USA.
² College of Public Health and Human Sciences, Oregon State University, Corvallis, OR, USA.
³ Department of Medicine, Boston University School of Medicine, Boston, MA, USA.
⁴ Department of Medicine and Surgery, University of Salerno, Fisciano, Italy.
⁵ Cardiovascular Research Unit, IRCCS MultiMedica, Sesto San Giovanni, Italy.
⁶ Department of Medicine and Department of Genetics, Albert Einstein College of Medicine, Bronx, NY, USA.
⁷ Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA.

Abstract

Motivation: Over the last decade, more diverse populations have been included in genome-wide association studies. If a genetic variant has a varying effect on a phenotype in different populations, genome-wide association studies applied to a dataset as a whole may not pinpoint such differences. It is especially important to be able to identify population-specific effects of genetic variants in studies that would eventually lead to development of diagnostic tests or drug discovery.

Results: In this paper, we propose PopCluster: an algorithm to automatically discover subsets of individuals in which the genetic effects of a variant are statistically different. PopCluster provides a simple framework to directly analyze genotype data without prior knowledge of subjects' ethnicities. PopCluster combines logistic regression modeling, principal component analysis, hierarchical clustering and a recursive bottom-up tree parsing procedure. The evaluation of PopCluster suggests that the algorithm has a stable low false positive rate (∼4%) and high true positive rate (>80%) in simulations with large differences in allele frequencies between cases and controls. Application of PopCluster to data from genetic studies of longevity discovers ethnicity-dependent heterogeneity in the association of rs3764814 (USP42) with the phenotype.

Availability and implementation: PopCluster was implemented using the R programming language, PLINK and Eigensoft software, and can be found at the following GitHub repository: https://github.com/gurinovich/PopCluster with instructions on its installation and usage.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Ethnicity*
Genome-Wide Association Study*
Humans
Programming Languages
Software
Thiolester Hydrolases

Substances

USP42 protein, human
Thiolester Hydrolases

Abstract

Publication types

MeSH terms

Substances

Grants and funding