We developed a variant-annotation method that combines sequence-based machine-learning classification with a context-dependent algorithm for selecting splice variants. Our approach is distinctive in that it compares the splice potential of a sequence bearing a variant with the splice potential of the reference sequence. After training, classification accurately identified 168 of 180 (93.3%) canonical splice sites of five genes. The combined method, CryptSplice, identified and correctly predicted the effect of 18 of 21 (86%) known splice-altering variants in CFTR, a well-studied gene whose loss-of-function variants cause cystic fibrosis (CF). Among 1,423 unannotated CFTR disease-associated variants, the method identified 32 potential exonic cryptic splice variants, two of which were experimentally evaluated and confirmed. After complete CFTR sequencing, the method found three cryptic intronic splice variants (one known and two experimentally verified) that completed the molecular diagnosis of CF in 6 of 14 individuals. CryptSplice interrogation of sequence data from six individuals with X-linked dyskeratosis congenita caused by an unknown disease-causing variant in DKC1 identified two splice-altering variants that were experimentally verified. To assess the extent to which disease-associated variants might activate cryptic splicing, we selected 458 pathogenic variants and 348 variants of uncertain significance (VUSs) classified as high confidence from ClinVar. Splice-site activation was predicted for 129 (28%) of the pathogenic variants and 75 (22%) of the VUSs. Our findings suggest that cryptic splice-site activation is more common than previously thought and should be routinely considered for all variants within the transcribed regions of genes.
Keywords: cryptic splicing; cystic fibrosis; machine learning; minigene; pseudoexon; splice acceptor; splice donor; splice variant; splicing.
Copyright © 2017 American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.