AmpSeqR: an R package for amplicon deep sequencing data analysis

F1000Res. 2023 Mar 23:12:327. doi: 10.12688/f1000research.129581.1. eCollection 2023.

Abstract

Amplicon sequencing (AmpSeq) is a methodology that targets specific genomic regions of interest for polymerase chain reaction (PCR) amplification so that they can be sequenced to a high depth of coverage. Amplicons are typically chosen to be highly polymorphic, usually with several highly informative, high frequency single nucleotide polymorphisms (SNPs) segregating in an amplicon of 100-200 base pair (bp). This allows high sensitivity detection and quantification of the frequency of each sequence within each sample making it suitable for applications such as low frequency somatic mosaicism detection or minor clone detection in mixed samples. AmpSeq is being increasingly applied to both biological and medical studies, in applications such as cancer, infectious diseases and brain mosaicism studies. Current bioinformatics pipelines for AmpSeq data processing lack downstream analysis, have difficulty distinguishing between true sequences and PCR sequencing errors and artifacts, and often require bioinformatic expertise. We present a new R package: AmpSeqR, designed for the processing of deep short-read amplicon sequencing data, with a focus on infectious diseases. The pipeline integrates several existing R packages combining them with newly developed functions to perform optimal filtering of reads to remove noise and improve the accuracy of the detected sequences data, permitting detection of very low frequency clones in mixed samples. The package provides useful functions including data pre-processing, amplicon sequence variants (ASVs) estimation, data post-processing, data visualization, and automatically generates a comprehensive Rmarkdown report that contains all essential results facilitating easy inclusion into reports and publications. AmpSeqR is publicly available at https://github.com/bahlolab/AmpSeqR.

Keywords: R package; amplicon sequencing; data visualization; summary report.

MeSH terms

  • Computational Biology / methods
  • Data Analysis
  • High-Throughput Nucleotide Sequencing* / methods
  • Humans
  • Polymerase Chain Reaction / methods
  • Polymorphism, Single Nucleotide*
  • Sequence Analysis, DNA / methods
  • Software*

Associated data

  • figshare/10.6084/m9.figshare.21739121.v2

Grants and funding

This work was made possible through the Victorian State Government Operational Infrastructure Support and Australian Government National Health and Medical Research Council (NHMRC) independent research Institute Infrastructure Support Scheme (IRIISS). Melanie Bahlo was supported by an NHMRC Investigator grant (ID: 1195236).Jiru Han was supported by a Melbourne Research Scholarship (The University of Melbourne, https://www.unimelb.edu.au) and a WEHI PhD Scholarship (The Walter and Eliza Hall Institute of Medical Research, https://www.wehi.edu.au).