SpecGMM: Integrating Spectral analysis and Gaussian Mixture Models for taxonomic classification and identification of discriminative DNA regions

Bioinform Adv. 2024 Nov 5;4(1):vbae171. doi: 10.1093/bioadv/vbae171. eCollection 2024.

Abstract

Motivation: Genomic signal processing (GSP), which transforms biomolecular sequences into discrete signals for spectral analysis, has provided valuable insights into DNA sequence, structure, and evolution. However, challenges persist with spectral representations of variable-length sequences for tasks like species classification and in interpreting these spectra to identify discriminative DNA regions.

Results: We introduce SpecGMM, a novel framework that integrates sliding window-based Spectral analysis with a Gaussian Mixture Model to transform variable-length DNA sequences into fixed-dimensional spectral representations for taxonomic classification. SpecGMM's hyperparameters were selected using a dataset of plant sequences, and applied unchanged across diverse datasets, including mitochondrial DNA, viral and bacterial genome, and 16S rRNA sequences. Across these datasets, SpecGMM outperformed a baseline method, with 9.45% average and 35.55% maximum improvement in test accuracies for a Linear Discriminant classifier. Regarding interpretability, SpecGMM revealed discriminative hypervariable regions in 16S rRNA sequences-particularly V3/V4 for discriminating higher taxa and V2/V3 for lower taxa-corroborating their known classification relevance. SpecGMM's spectrogram video analysis helped visualize species-specific DNA signatures. SpecGMM thus provides a robust and interpretable method for spectral DNA analysis, opening new avenues in GSP research.

Availability and implementation: SpecGMM's source code is available at https://github.com/BIRDSgroup/SpecGMM.