Inference of the Biogeographical Ancestry (BGA) of a person or trace relies on three ingredients: (1) a reference database of DNA samples including BGA information; (2) a statistical clustering method; (3) a set of loci which segregate dependent on geographical location, i.e. a set of so-called Ancestry Informative Markers (AIMs). We used the theory of feature selection from statistical learning in order to obtain AIMsets for BGA inference. Using simulations, we show that this learning procedure works in various cases, and outperforms ad hoc methods, based on statistics like FST or informativeness for the choice of AIMs. Applying our method to data from the 1000 genomes project (excluding Admixed Americans) we identified an AIMset of 12 SNPs, which gives a vanishing misclassification error on a continental scale, as do other published AIMsets. In fact, cross validation shows that there exists a multitude of sets with comparable performance to the optimal AIMset. On a sub-continental scale, we find a set of 55 SNPs for distinguishing the five European populations. The misclassification error is reduced by a factor of two relative to published AIMsets, but is still 30% and therefore too large in order to be useful in forensic applications.
Keywords: 1000 genomes dataset; Ancestry Informative Markers; Biogeographical Ancestry; Classification; Coalescent simulation.
Copyright © 2020 Elsevier B.V. All rights reserved.