Large-scale sequence comparisons with sourmash

N Tessa Pierce; Luiz Irber; Taylor Reiter; Phillip Brooks; C Titus Brown

doi:10.12688/f1000research.19675.1

Large-scale sequence comparisons with sourmash

F1000Res. 2019 Jul 4:8:1006. doi: 10.12688/f1000research.19675.1. eCollection 2019.

Authors

N Tessa Pierce^#¹, Luiz Irber^#¹, Taylor Reiter^#¹, Phillip Brooks^#¹, C Titus Brown¹

Affiliation

¹ Department of Population Health and Reproduction, University of California, Davis, Davis, California, 95616, USA.

^# Contributed equally.

Abstract

The sourmash software package uses MinHash-based sketching to create "signatures", compressed representations of DNA, RNA, and protein sequences, that can be stored, searched, explored, and taxonomically annotated. sourmash signatures can be used to estimate sequence similarity between very large data sets quickly and in low memory, and can be used to search large databases of genomes for matches to query genomes and metagenomes. sourmash is implemented in C++, Rust, and Python, and is freely available under the BSD license at http://github.com/dib-lab/sourmash.

Keywords: MinHash; bioinformatics; k-mer; sequence analysis; sourmash.

Publication types

Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Databases, Factual
Genome*
Software*

Grants and funding

This work is funded in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative [GBMF4551 to CTB]. NTP was supported by a National Science Foundation Postdoctoral Fellowship in Biology [1711984].