Efficient algorithms for the discovery of DNA oligonucleotide barcodes from sequence databases

M Zahariev; V Dahl; W Chen; C A Lévesque

doi:10.1111/j.1755-0998.2009.02651.x

Efficient algorithms for the discovery of DNA oligonucleotide barcodes from sequence databases

Mol Ecol Resour. 2009 May:9 Suppl s1:58-64. doi: 10.1111/j.1755-0998.2009.02651.x.

Authors

M Zahariev¹, V Dahl, W Chen, C A Lévesque

Affiliation

¹ School of Computing Science, Simon Fraser University, 8888 University Drive, Burnaby, BC, Canada V5A 1S6, Agriculture & Agri-Food Canada, Ottawa, ON, Canada K1A 0C6, Department of Biology, Carleton University, Ottawa, Ontario, Canada, K1S 5B6.

PMID: 21564965
DOI: 10.1111/j.1755-0998.2009.02651.x

Abstract

Efficient design of barcode oligonucleotides can lead to significant cost reductions in the manufacturing of DNA arrays. Previous methods are based on either a preliminary alignment, which reduces their efficiency for intron-rich regions, or on a brute force approach, not feasible for large-scale problems or on data structures with very poor performance in the worst case. One of the algorithms we propose uses 'oligonucleotide sorting' for the discovery of oligonucleotide barcodes of given sizes, with good asymptotic performance. Specific barcode oligonucleotides with at least one base difference from other sequences in a database are found for each individual sequence. With another algorithm, specific oligonucleotides can also be found for groups or clades in the database, which have 100% homology for all oligonucleotide sequences within the group or clade while having differences with the rest of the data. By re-organizing the sequences/groups in the database, oligonucleotides for different hierarchical levels can be found. The oligonucleotides or polymorphism locations identified as species or clade specific by the new algorithm are refined and screened further for hybridization thermodynamic properties with third party software.