Current approaches for parsing true variation (i.e. signal) from noise, broadly involve estimating a baseline value of the latter, below which all sequence data are ignored. In an effort to deliver a more objective criterion for setting such thresholds, a novel approach based on phylogenetic principles is presented here., Our method deconstructs a special category of noise from true mitochondrial genome data, namely nuclear insertions of mitochondrial DNA (Numts). This bioinformatic approach leverages the relationship of massively parallel sequence reads and is capable of discovering putative Numts (pNumts) in absence of a reference genome. The new method was tested on a whole mitochondrial genome dataset (n = 41 individuals from an admixed population sample from Rio de Janeiro) and led to the discovery of 451 pNumt variants. Comparison of these pNumts haplotypes against an existing Numt database revealed 147 exact matches to previously discovered Numts, while 122 haplotypes differed only by a single base pair and none matched exclusively to the mitochondrial genome. In general, these sequences were considerably more divergent from the mitochondrial genome than from those of the Numt database, supporting that the novel pNumts were probably hitherto uncatalogued variants. Unlike previous techniques, our method appears to be able to detect both polymorphic and fixed Numt sequences. It was also found that the region containing the D-Loop and associated Promoters (DLP) in the human mitochondrial genome, which harbors markers of forensic genetics importance, is the origin of several Numts. Though currently designed for the mitochondrial genome, our novel approach has the potential to be expanded to other scenarios that might require construing signal from noise, including the deconvolution of mixtures, thus significantly improving how analytical thresholds may be established.
Keywords: Analytical threshold; Bioinformatics; Massively parallel sequencing; Mitochondrial haplotypes; PCR errors; Phylogenetic networks; Randomized minimum spanning trees; pNumts.
Published by Elsevier B.V.