Analysis-ready VCF at Biobank scale using Zarr

Eric Czech; Timothy R Millar; Will Tyler; Tom White; Ben Jeffery; Alistair Miles; Sam Tallman; Rafal Wojdyla; Shadi Zabad; Jeff Hammerbacher; Jerome Kelleher

doi:10.1101/2024.06.11.598241

Analysis-ready VCF at Biobank scale using Zarr

bioRxiv [Preprint]. 2024 Nov 15:2024.06.11.598241. doi: 10.1101/2024.06.11.598241.

Authors

Eric Czech¹, Timothy R Millar^{2

3}, Will Tyler⁴, Tom White⁵, Ben Jeffery⁶, Alistair Miles⁷, Sam Tallman⁸, Rafal Wojdyla¹, Shadi Zabad⁹, Jeff Hammerbacher¹, Jerome Kelleher⁶

Affiliations

¹ Related Sciences, Lincoln, New Zealand.
² The New Zealand Institute for Plant & Food Research Ltd, Lincoln, New Zealand.
³ Department of Biochemistry, School of Biomedical Sciences, University of Otago, Dunedin, New Zealand.
⁴ Independent researcher, University of Oxford, UK.
⁵ Tom White Consulting Ltd., University of Oxford, UK.
⁶ Big Data Institute, University of Oxford, UK.
⁷ Wellcome Sanger Institute, McGill University, Montreal, QC, Canada.
⁸ Genomics England, McGill University, Montreal, QC, Canada.
⁹ School of Computer Science, McGill University, Montreal, QC, Canada.

Abstract

Background: Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasises efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. Biobank scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed.

Results: We present the VCF Zarr specification, an encoding of the VCF data model using Zarr which makes retrieving subsets of the data much more efficient. Zarr is a cloud-native format for storing multi-dimensional data, widely used in scientific computing. We show how this format is far more efficient than standard VCF based approaches, and competitive with specialised methods for storing genotype data in terms of compression ratios and calculation performance. We demonstrate the VCF Zarr format (and the vcf2zarr conversion utility) on a subset of the Genomics England aggV2 dataset comprising 78,195 samples and 59,880,903 variants, with a 5X reduction in storage and greater than 300X reduction in CPU usage in some representative benchmarks.

Conclusions: Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely-used, open-source technologies has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores, while maintaining compatibility with existing file-oriented workflows.

Keywords: Analysis ready data; Variant Call Format; Zarr.

Publication types

Preprint

Abstract

Publication types

Grants and funding