Machine learning classification by fitting amplicon sequences to existing OTUs

Courtney R Armour; Kelly L Sovacool; William L Close; Begüm D Topçuoğlu; Jenna Wiens; Patrick D Schloss

doi:10.1128/msphere.00336-23

Machine learning classification by fitting amplicon sequences to existing OTUs

mSphere. 2023 Oct 24;8(5):e0033623. doi: 10.1128/msphere.00336-23. Epub 2023 Aug 24.

Authors

Courtney R Armour¹, Kelly L Sovacool², William L Close¹, Begüm D Topçuoğlu¹, Jenna Wiens³, Patrick D Schloss¹

Affiliations

¹ Department of Microbiology and Immunology, University of Michigan , Ann Arbor, Michigan, USA.
² Department of Computational Medicine and Bioinformatics, University of Michigan , Ann Arbor, Michigan, USA.
³ Department of Electrical Engineering and Computer Science, University of Michigan , Ann Arbor, Michigan, USA.

Abstract

The ability to use 16S rRNA gene sequence data to train machine learning classification models offers the opportunity to diagnose patients based on the composition of their microbiome. In some applications, the taxonomic resolution that provides the best models may require the use of de novo operational taxonomic units (OTUs) whose composition changes when new data are added. We previously developed a new reference-based approach, OptiFit, that fits new sequence data to existing de novo OTUs without changing the composition of the original OTUs. While OptiFit produces OTUs that are as high quality as de novo OTUs, it is unclear whether this method for fitting new sequence data into existing OTUs will impact the performance of classification models relative to models trained and tested only using de novo OTUs. We used OptiFit to cluster sequences into existing OTUs and evaluated model performance in classifying a dataset containing samples from patients with and without colonic screen relevant neoplasia (SRN). We compared the performance of this model to standard methods including de novo and database-reference-based clustering. We found that using OptiFit performed as well or better in classifying SRNs. OptiFit can streamline the process of classifying new samples by avoiding the need to retrain models using reclustered sequences. IMPORTANCE There is great potential for using microbiome data to aid in diagnosis. A challenge with de novo operational taxonomic unit (OTU)-based classification models is that 16S rRNA gene sequences are often assigned to OTUs based on similarity to other sequences in the dataset. If data are generated from new patients, the old and new sequences must be reclustered to OTUs and the classification model retrained. Yet there is a desire to have a single, validated model that can be widely deployed. To overcome this obstacle, we applied the OptiFit clustering algorithm to fit new sequence data to existing OTUs allowing for reuse of the model. A random forest model implemented using OptiFit performed as well as the traditional reassign and retrain approach. This result shows that it is possible to train and apply machine learning models based on OTU relative abundance data that do not require retraining or the use of a reference database.

Keywords: bioinformatics; diagnostics; machine learning; microbial ecology; microbiome.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Algorithms
Humans
Metagenomics* / methods
Microbiota* / genetics
RNA, Ribosomal, 16S / genetics
Sequence Analysis, DNA / methods

Substances

RNA, Ribosomal, 16S

Abstract

Publication types

MeSH terms

Substances

Grants and funding