xGAP: a python based efficient, modular, extensible and fault tolerant genomic analysis pipeline for variant discovery

Aditya Gorla; Brandon Jew; Luke Zhang; Jae Hoon Sul

doi:10.1093/bioinformatics/btaa1097

xGAP: a python based efficient, modular, extensible and fault tolerant genomic analysis pipeline for variant discovery

Bioinformatics. 2021 Apr 9;37(1):9-16. doi: 10.1093/bioinformatics/btaa1097.

Authors

Aditya Gorla¹, Brandon Jew², Luke Zhang³, Jae Hoon Sul⁴

Affiliations

¹ Department of Bioengineering, University of California, Los Angeles, CA 90095, USA.
² Bioinformatics Interdepartmental Program, University of California, Los Angeles, CA 90095, USA.
³ Undergraduate Neuroscience Interdepartmental Program, University of California, Los Angeles, CA 90095, USA.
⁴ Department of Psychiatry and Biobehavioral Sciences, University of California, Los Angeles, CA 90095, USA.

Abstract

Motivation: Since the first human genome was sequenced in 2001, there has been a rapid growth in the number of bioinformatic methods to process and analyze next-generation sequencing (NGS) data for research and clinical studies that aim to identify genetic variants influencing diseases and traits. To achieve this goal, one first needs to call genetic variants from NGS data, which requires multiple computationally intensive analysis steps. Unfortunately, there is a lack of an open-source pipeline that can perform all these steps on NGS data in a manner, which is fully automated, efficient, rapid, scalable, modular, user-friendly and fault tolerant. To address this, we introduce xGAP, an extensible Genome Analysis Pipeline, which implements modified GATK best practice to analyze DNA-seq data with the aforementioned functionalities.

Results: xGAP implements massive parallelization of the modified GATK best practice pipeline by splitting a genome into many smaller regions with efficient load-balancing to achieve high scalability. It can process 30× coverage whole-genome sequencing (WGS) data in ∼90 min. In terms of accuracy of discovered variants, xGAP achieves average F1 scores of 99.37% for single nucleotide variants and 99.20% for insertion/deletions across seven benchmark WGS datasets. We achieve highly consistent results across multiple on-premises (SGE & SLURM) high-performance clusters. Compared to the Churchill pipeline, with similar parallelization, xGAP is 20% faster when analyzing 50× coverage WGS on Amazon Web Service. Finally, xGAP is user-friendly and fault tolerant where it can automatically re-initiate failed processes to minimize required user intervention.

Availability and implementation: xGAP is available at https://github.com/Adigorla/xgap.

Supplementary information: Supplementary data are available at Bioinformatics online.

Abstract

Grants and funding