GSC: efficient lossless compression of VCF files with fast query

Gigascience. 2024 Jan 2:13:giae046. doi: 10.1093/gigascience/giae046.

Abstract

Background: With the rise of large-scale genome sequencing projects, genotyping of thousands of samples has produced immense variant call format (VCF) files. It is becoming increasingly challenging to store, transfer, and analyze these voluminous files. Compression methods have been used to tackle these issues, aiming for both high compression ratio and fast random access. However, existing methods have not yet achieved a satisfactory compromise between these 2 objectives.

Findings: To address the aforementioned issue, we introduce GSC (Genotype Sparse Compression), a specialized and refined lossless compression tool for VCF files. In benchmark tests conducted across various open-source datasets, GSC showcased exceptional performance in genotype data compression. Compared with the industry's most advanced tools (namely, GBC and GTC), GSC achieved compression ratios that were higher by 26.9% to 82.4% over GBC and GTC on the datasets, respectively. In lossless compression scenarios, GSC also demonstrated robust performance, with compression ratios 1.5× to 6.5× greater than general-purpose tools like gzip, zstd, and BCFtools-a mode not supported by either GBC or GTC. Achieving such high compression ratios did require some reasonable trade-offs, including longer decompression times, with GSC being 1.2× to 2× slower than GBC, yet 1.1× to 1.4× faster than GTC. Moreover, GSC maintained decompression query speeds that were equivalent to its competitors. In terms of RAM usage, GSC outperformed both counterparts. Overall, GSC's comprehensive performance surpasses that of the most advanced technologies.

Conclusion: GSC balances high compression ratios with rapid data access, enhancing genomic data management. It supports seamless PLINK binary format conversion, simplifying downstream analysis.

Keywords: VCF/BCF files; lossless compression; rapid random access.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Computational Biology / methods
  • Data Compression* / methods
  • Genotype
  • High-Throughput Nucleotide Sequencing / methods
  • Humans
  • Software*