De novo mutations (DNMs), and among them clustered DNMs within 20 bp of each other (cDNMs) are known to be a potential cause of genetic disorders. However, identifying DNM in whole genome sequencing (WGS) data is a process that often suffers from low specificity. We propose a deep learning framework for DNM and cDNM detection in WGS data based on Google's DeepTrio software for variant calling, which considers regions of 110 bp up- and downstream from possible variants to take information from the surrounding region into account. We trained a model each for the DNM and cDNM detection tasks and tested it on data generated on the HiSeq and NovaSeq platforms. In total, the model was trained on 82 WGS trios generated on the NovaSeq and 16 on the HiSeq. For the DNM detection task, our model achieves a sensitivity of 95.7% and a precision of 89.6%. The extended model adds confidence information for cDNMs, in addition to standard variant classes and DNMs. While this causes a slight drop in DNM sensitivity (91.96%) and precision (90.5%), on HG002 cDNMs can be isolated from other variant classes in all cases (5 out of 5) with a precision of 76.9%. Since the model emits confidence probabilities for each variant class, it is possible to fine-tune cutoff thresholds to allow users to select a desired trade-off between sensitivity and specificity. These results show that DeepTrio can be retrained to identify complex mutational signatures with only little modification effort.
© The Author(s) 2024. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.