Ancestry-informative markers (AIMs) can be used to infer the ancestry of an individual to minimize the inaccuracy of self-reported ethnicity in biomedical research. In this study, we describe three methods for selecting AIM SNPs for the Malay population (Malay AIM panel) using different approaches based on pairwise FST, informativeness for assignment (In), and PCA-correlated SNPs (PCAIMs). These Malay AIM panels were extracted from genotype data stored in SNP arrays hosted by the Malaysian node of the Human Variome Project (MyHVP) and the Singapore Genome Variation Project (SGVP). In particular, genotype data from a total of 165 Malay individuals were analyzed, comprising data on 117 individual genotypes from the Affymetrix SNP-6 SNP array platform and data on 48 individual genotypes from the OMNI 2.5 Illumina SNP array platform. The HapMap phase 3 database (1397 individuals from 11 populations) was used as a reference for comparison with the Malay genotype data. The accuracy of each resulting Malay AIM panel was evaluated using a machine learning "ancestry-predictive model" constructed by using WEKA, a comprehensive machine learning platform written in Java. A total of 1250 SNPs were finally selected, which successfully identified Malay individuals from other world populations with an accuracy of 90%, but the accuracy decreased to 80% using 157 SNPs according to the pairwise FST method, while a panel of 200 SNPs selected using In and PCAIMs could be used to identify Malay individuals with an accuracy of approximately 80%.
Keywords: Admixture; Ancestry; Ancestry-informative markers; Malay; Population; SNP.