Advancing microbial diagnostics: a universal phylogeny guided computational algorithm to find unique sequences for precise microorganism detection

Gulshan Kumar Sharma; Rakesh Sharma; Kavita Joshi; Sameer Qureshi; Shubhita Mathur; Sharad Sinha; Samit Chatterjee; Vandana Nunia

doi:10.1093/bib/bbae545

Advancing microbial diagnostics: a universal phylogeny guided computational algorithm to find unique sequences for precise microorganism detection

Brief Bioinform. 2024 Sep 23;25(6):bbae545. doi: 10.1093/bib/bbae545.

Authors

Gulshan Kumar Sharma¹, Rakesh Sharma², Kavita Joshi³, Sameer Qureshi³, Shubhita Mathur³, Sharad Sinha⁴, Samit Chatterjee³, Vandana Nunia³

Affiliations

¹ Malaviya National Institute of Technology, Jawahar Lal Nehru Marg, Jhalana Gram, Malviya Nagar, Jaipur, Rajasthan 302017, India.
² Centre for Converging Technologies, University of Rajasthan, Jawahar Lal Nehru Marg, Talvandi, Jaipur, Rajasthan 302004, India.
³ Department of Zoology, University of Rajasthan, Jawahar Lal Nehru Marg, Talvandi, Jaipur, Rajasthan 302004, India.
⁴ Department of Mathematics, University of Rajasthan, Jawahar Lal Nehru Marg, Talvandi, Jaipur, Rajasthan 302004, India.

Abstract

Sequences derived from organisms sharing common evolutionary origins exhibit similarity, while unique sequences, absent in related organisms, act as good diagnostic marker candidates. However, the approach focused on identifying dissimilar regions among closely-related organisms poses challenges as it requires complex multiple sequence alignments, making computation and parsing difficult. To address this, we have developed a biologically inspired universal NAUniSeq algorithm to find the unique sequences for microorganism diagnosis by traveling through the phylogeny of life. Mapping through a phylogenetic tree ensures a low number of cross-contamination and false positives. We have downloaded complete taxonomy data from Taxadb database and sequence data from National Center for Biotechnology Information Reference Sequence Database (NCBI-Refseq) and, with the help of NetworkX, created a phylogenetic tree. Sequences were assigned over the graph nodes, k-mers were created for target and non-target nodes and search was performed over the graph using the depth first search algorithm. In a memory efficient alternative NoSQL approach, we created a collection of Refseq sequences in MongoDB database using tax-id and path of FASTA files. We queried the MongoDB collection for the target and non-target sequences. In both the approaches, we used an alignment free sliding window k-mer-based procedure that quickly compares k-mers of target and non-target sequences and returns unique sequences that are not present in the non-target. We have validated our algorithm with target nodes Mycobacterium tuberculosis, Neisseria gonorrhoeae, and Monkeypox and generated unique sequences. This universal algorithm is a powerful tool for generating diagnostic sequences, enabling the accurate identification of microbial strains with high phylogenetic precision.

Keywords: depth first search; hash map; k-mer; phylogenetic tree; taxonomy; unique sequences.

MeSH terms

Algorithms*
Bacteria / classification
Bacteria / genetics
Computational Biology / methods
Humans
Phylogeny*
Sequence Alignment
Sequence Analysis, DNA / methods
Software