Genetic variant pathogenicity prediction trained using disease-specific clinical sequencing data sets

Perry Evans; Chao Wu; Amanda Lindy; Dianalee A McKnight; Matthew Lebo; Mahdi Sarmady; Ahmad N Abou Tayoun

doi:10.1101/gr.240994.118

Genetic variant pathogenicity prediction trained using disease-specific clinical sequencing data sets

Genome Res. 2019 Jul;29(7):1144-1151. doi: 10.1101/gr.240994.118. Epub 2019 Jun 24.

Authors

Perry Evans¹, Chao Wu², Amanda Lindy³, Dianalee A McKnight³, Matthew Lebo^{4

5}, Mahdi Sarmady^{2

6}, Ahmad N Abou Tayoun^{2

6

7}

Affiliations

¹ Department of Biomedical and Health Informatics, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA.
² Division of Genomic Diagnostics, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA.
³ GeneDx, Gaithersburg, Maryland 20877, USA.
⁴ Laboratory for Molecular Medicine, Partners HealthCare Personalized Medicine, Cambridge, Massachusetts 02139, USA.
⁵ Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts 02115, USA.
⁶ Department of Pathology and Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania 19104, USA.
⁷ Al Jalila Children's Specialty Hospital, Dubai, United Arab Emirates.

Abstract

Recent advances in DNA sequencing have expanded our understanding of the molecular basis of genetic disorders and increased the utilization of clinical genomic tests. Given the paucity of evidence to accurately classify each variant and the difficulty of experimentally evaluating its clinical significance, a large number of variants generated by clinical tests are reported as variants of unknown clinical significance. Population-scale variant databases can improve clinical interpretation. Specifically, pathogenicity prediction for novel missense variants can use features describing regional variant constraint. Constrained genomic regions are those that have an unusually low variant count in the general population. Computational methods have been introduced to capture these regions and incorporate them into pathogenicity classifiers, but these methods have yet to be compared on an independent clinical variant data set. Here, we introduce one variant data set derived from clinical sequencing panels and use it to compare the ability of different genomic constraint metrics to determine missense variant pathogenicity. This data set is compiled from 17,071 patients surveyed with clinical genomic sequencing for cardiomyopathy, epilepsy, or RASopathies. We further use this data set to demonstrate the necessity of disease-specific classifiers and to train PathoPredictor, a disease-specific ensemble classifier of pathogenicity based on regional constraint and variant-level features. PathoPredictor achieves an average precision >90% for variants from all 99 tested disease genes while approaching 100% accuracy for some genes. The accumulation of larger clinical variant training data sets can significantly enhance their performance in a disease- and gene-specific manner.

Publication types

Evaluation Study

MeSH terms

Cardiomyopathies / genetics*
Datasets as Topic*
Epilepsy / genetics*
Genetic Variation*
Humans
Mutation, Missense
ras Proteins / genetics*

Substances

ras Proteins