dna2bit: high performance genomic distance estimation software for microbial genome analysis

Front Microbiol. 2024 Dec 23:15:1521181. doi: 10.3389/fmicb.2024.1521181. eCollection 2024.

Abstract

dna2bit is an ultra-fast software specifically engineered for microbial genome analysis, particularly adept at calculating genome distances within metagenome and single amplified genome datasets. Distinguished from existing software such as Mash and Dashing, dna2bit employs feature hashing technique and Hamming distance to achieve enhanced speed and memory utilization, without sacrifice in the accuracy of average nucleotide identity calculations. dna2bit has promising applications in various domains such as average nucleotide identity approximation, metagenomic sequence clustering, and homology querying. dna2bit significantly boosts computational efficiency in handling large datasets including single amplified genomes, thereby facilitating a better understanding of the population heterogeneity and comparative genomics of microorganisms. dna2bit is available at https://github.com/lijuzeng/dna2bit.

Keywords: Hamming distance; average nucleotide identity; genome distance; metagenomic clustering; single amplified genomes.

Grants and funding

The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This study was supported by the National Key R&D Program of China (2023YFA1801200), National Natural Science Foundation of China (32288101), Shanghai Municipal Science and Technology Major Project (2023SHZDZX02 and 2017SHZDZX01), and CAMS Innovation Fund for Medical Science (2019-I2M-5-066).