Efficient Mining Multi-mers in a Variety of Biological Sequences

IEEE/ACM Trans Comput Biol Bioinform. 2018 Apr 19. doi: 10.1109/TCBB.2018.2828313. Online ahead of print.

Abstract

Counting the occurrence frequency of each -mer in a biological sequence is a preliminary yet important step in many bioinformatics applications. However, most -mer counting algorithms rely on a given k to produce single-length -mers, which is inefficient for sequence analysis for different k. Moreover, existing -mer counters focus more on DNA and RNA sequences and less on protein ones. In practice, the analysis of -mers in protein sequences can provide substantial biological insights in structure, function and evolution. To this end, an efficient algorithm, called MulMer (Multiple-Mer mining), is proposed to mine -mers of various lengths termed multi-mers via inverted-index technique, which is orders of magnitude faster than the conventional forward-index methods. Moreover, to the best of our knowledge, MulMer is the first able to mine multi-mers in a variety of sequences, including DNARNA and protein sequences.