MISSH: Fast Hashing of Multiple Spaced Seeds

IEEE/ACM Trans Comput Biol Bioinform. 2024 Sep 25:PP. doi: 10.1109/TCBB.2024.3467368. Online ahead of print.

Abstract

Alignment-free analysis of sequences has revolutionized the high-throughput processing of sequencing data within numerous bioinformatics pipelines. Hashing k-mers represents a common function across various alignment-free applications, serving as a crucial tool for indexing, querying, and rapid similarity searching. More recently, spaced seeds, a specialized pattern that accommodates errors or mutations, have become a standard choice over traditional k-mers. Spaced seeds offer enhanced sensitivity in many applications when compared to k-mers. However, it's important to note that hashing spaced seeds significantly increases computational time. Furthermore, if multiple spaced seeds are employed, accuracy can be further improved, albeit at the expense of longer processing times. This paper addresses the challenge of efficiently hashing multiple spaced seeds. The proposed algorithms leverage the similarity of adjacent spaced seed hash values within an input sequence, allowing for the swift computation of subsequent hashes. Our experimental results, conducted across various tests, demonstrate a remarkable performance improvement over previously suggested algorithms, with potential speedups of up to 20 times. Additionally, we apply these efficient spaced seed hashing algorithms to a metagenomic application, specifically the classification of reads using Clark-S [Ounit and Lonardi, 2016]. Our findings reveal a substantial speedup, effectively mitigating the slowdown caused by the utilization of multiple spaced seeds.