Performance of methods for SARS-CoV-2 variant detection and abundance estimation within mixed population samples

Tunc Kayikcioglu; Jasmine Amirzadegan; Hugh Rand; Bereket Tesfaldet; Ruth E Timme; James B Pettengill

doi:10.7717/peerj.14596

Performance of methods for SARS-CoV-2 variant detection and abundance estimation within mixed population samples

PeerJ. 2023 Jan 26:11:e14596. doi: 10.7717/peerj.14596. eCollection 2023.

Authors

Tunc Kayikcioglu^{1

2}, Jasmine Amirzadegan^{1

3}, Hugh Rand¹, Bereket Tesfaldet¹, Ruth E Timme⁴, James B Pettengill¹

Affiliations

¹ Biostatistics and Bioinformatics Staff, Office of Analytics and Outreach, Center for Food Safety and Applied Nutrition, US Food and Drug Administration, College Park, MD, United States of America.
² Joint Institute for Food Safety and Applied Nutrition, University of Maryland College Park, College Park, MD, United States of America.
³ Oak Ridge Institute for Science and Education, Oak Ridge, TN, United States of America.
⁴ Division of Microbiology, Office of Regulatory Science, Center for Food Safety and Applied Nutrition, United States Food and Drug Administration, College Park, MD, United States of America.

Abstract

Background: The accurate identification of SARS-CoV-2 (SC2) variants and estimation of their abundance in mixed population samples (e.g., air or wastewater) is imperative for successful surveillance of community level trends. Assessing the performance of SC2 variant composition estimators (VCEs) should improve our confidence in public health decision making. Here, we introduce a linear regression based VCE and compare its performance to four other VCEs: two re-purposed DNA sequence read classifiers (Kallisto and Kraken2), a maximum-likelihood based method (Lineage deComposition for Sars-Cov-2 pooled samples (LCS)), and a regression based method (Freyja).

Methods: We simulated DNA sequence datasets of known variant composition from both Illumina and Oxford Nanopore Technologies (ONT) platforms and assessed the performance of each VCE. We also evaluated VCEs performance using publicly available empirical wastewater samples collected for SC2 surveillance efforts. Bioinformatic analyses were performed with a custom NextFlow workflow (C-WAP, CFSAN Wastewater Analysis Pipeline). Relative root mean squared error (RRMSE) was used as a measure of performance with respect to the known abundance and concordance correlation coefficient (CCC) was used to measure agreement between pairs of estimators.

Results: Based on our results from simulated data, Kallisto was the most accurate estimator as it had the lowest RRMSE, followed by Freyja. Kallisto and Freyja had the most similar predictions, reflected by the highest CCC metrics. We also found that accuracy was platform and amplicon panel dependent. For example, the accuracy of Freyja was significantly higher with Illumina data compared to ONT data; performance of Kallisto was best with ARTICv4. However, when analyzing empirical data there was poor agreement among methods and variations in the number of variants detected (e.g., Freyja ARTICv4 had a mean of 2.2 variants while Kallisto ARTICv4 had a mean of 10.1 variants).

Conclusion: This work provides an understanding of the differences in performance of a number of VCEs and how accurate they are in capturing the relative abundance of SC2 variants within a mixed sample (e.g., wastewater). Such information should help officials gauge the confidence they can have in such data for informing public health decisions.

Keywords: Bioinformatics; Deconvolution; SARS-CoV-2; Wastewater surveillance.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.
Research Support, U.S. Gov't, P.H.S.

MeSH terms

COVID-19* / diagnosis
Humans
Likelihood Functions
SARS-CoV-2 / genetics
Wastewater

Substances

Wastewater

Supplementary concepts

SARS-CoV-2 variants

Grants and funding

U01 FD001418/FD/FDA HHS/United States