KmerKeys: a web resource for searching indexed genome assemblies and variants

Dmitri S Pavlichin; HoJoon Lee; Stephanie U Greer; Susan M Grimes; Tsachy Weissman; Hanlee P Ji

doi:10.1093/nar/gkac266

KmerKeys: a web resource for searching indexed genome assemblies and variants

Nucleic Acids Res. 2022 Jul 5;50(W1):W448-W453. doi: 10.1093/nar/gkac266.

Authors

Dmitri S Pavlichin¹, HoJoon Lee¹, Stephanie U Greer¹, Susan M Grimes², Tsachy Weissman³, Hanlee P Ji^{1

2}

Affiliations

¹ Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA.
² Stanford Genome Technology Center West, Stanford University, Palo Alto, CA, 94304, USA.
³ Department of Electrical Engineering, Stanford University, Palo Alto, CA, 94304, USA.

Abstract

K-mers are short DNA sequences that are used for genome sequence analysis. Applications that use k-mers include genome assembly and alignment. However, the wider bioinformatic use of these short sequences has challenges related to the massive scale of genomic sequence data. A single human genome assembly has billions of k-mers. As a result, the computational requirements for analyzing k-mer information is enormous, particularly when involving complete genome assemblies. To address these issues, we developed a new indexing data structure based on a hash table tuned for the lookup of short sequence keys. This web application, referred to as KmerKeys, provides performant, rapid query speeds for cloud computation on genome assemblies. We enable fuzzy as well as exact sequence searches of assemblies. To enable robust and speedy performance, the website implements cache-friendly hash tables, memory mapping and massive parallel processing. Our method employs a scalable and efficient data structure that can be used to jointly index and search a large collection of human genome assembly information. One can include variant databases and their associated metadata such as the gnomAD population variant catalogue. This feature enables the incorporation of future genomic information into sequencing analysis. KmerKeys is freely accessible at https://kmerkeys.dgi-stanford.org.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Genome, Human
Genomics / methods
Humans
Sequence Analysis, DNA* / methods
Software*

Abstract

Publication types

MeSH terms

Grants and funding