MTSv: rapid alignment-based taxonomic classification and high-confidence metagenomic analysis

PeerJ. 2022 Nov 8:10:e14292. doi: 10.7717/peerj.14292. eCollection 2022.

Abstract

As the size of reference sequence databases and high-throughput sequencing datasets continue to grow, it is becoming computationally infeasible to use traditional alignment to large genome databases for taxonomic classification of metagenomic reads. Exact matching approaches can rapidly assign taxonomy and summarize the composition of microbial communities, but they sacrifice accuracy and can lead to false positives. Full alignment tools provide higher confidence assignments and can assign sequences from genomes that diverge from reference sequences; however, full alignment tools are computationally intensive. To address this, we designed MTSv specifically for alignment-based taxonomic assignment in metagenomic analysis. This tool implements an FM-index assisted q-gram filter and SIMD accelerated Smith-Waterman algorithm to find alignments. However, unlike traditional aligners, MTSv will not attempt to make additional alignments to a TaxID once an alignment of sufficient quality has been found. This improves efficiency when many reference sequences are available per taxon. MTSv was designed to be flexible and can be modified to run on either memory or processor constrained systems. Although MTSv cannot compete with the speeds of exact k-mer matching approaches, it is reasonably fast and has higher precision than popular exact matching approaches. Because MTSv performs a full alignment it can classify reads even when the genomes share low similarity with reference sequences and provides a tool for high confidence pathogen detection with low off-target assignments to near neighbor species.

Keywords: Alignment; Metagenomics; Pathogen detection; Taxonomic classification.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Databases, Nucleic Acid
  • Metagenome* / genetics
  • Metagenomics
  • Sequence Analysis, DNA

Grants and funding

This work has been supported by the Department of Homeland Security Grant (HSHQDC-17-C-B0008/BAA14-003). Software development, benchmarking, and analysis was performed on Northern Arizona University’s Monsoon computing cluster, funded by Arizona’s Technology and Research Initiative Fund. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.