ntsm: an alignment-free, ultra-low-coverage, sequencing technology agnostic, intraspecies sample comparison tool for sample swap detection

Gigascience. 2024 Jan 2:13:giae024. doi: 10.1093/gigascience/giae024.

Abstract

Background: Due to human error, sample swapping in large cohort studies with heterogeneous data types (e.g., mix of Oxford Nanopore Technologies, Pacific Bioscience, Illumina data, etc.) remains a common issue plaguing large-scale studies. At present, all sample swapping detection methods require costly and unnecessary (e.g., if data are only used for genome assembly) alignment, positional sorting, and indexing of the data in order to compare similarly. As studies include more samples and new sequencing data types, robust quality control tools will become increasingly important.

Findings: The similarity between samples can be determined using indexed k-mer sequence variants. To increase statistical power, we use coverage information on variant sites, calculating similarity using a likelihood ratio-based test. Per sample error rate, and coverage bias (i.e., missing sites) can also be estimated with this information, which can be used to determine if a spatially indexed principal component analysis (PCA)-based prescreening method can be used, which can greatly speed up analysis by preventing exhaustive all-to-all comparisons.

Conclusions: Because this tool processes raw data, is faster than alignment, and can be used on very low-coverage data, it can save an immense degree of computational resources in standard quality control (QC) pipelines. It is robust enough to be used on different sequencing data types, important in studies that leverage the strengths of different sequencing technologies. In addition to its primary use case of sample swap detection, this method also provides information useful in QC, such as error rate and coverage bias, as well as population-level PCA ancestry analysis visualization.

Keywords: PCA-based population analysis; alignment-free; error rate estimation; low-sequence coverage analysis; quality control; sample mixup; sample-swap detection; sequencing technology agnostic.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Algorithms
  • Computational Biology / methods
  • High-Throughput Nucleotide Sequencing* / methods
  • Humans
  • Principal Component Analysis
  • Sequence Analysis, DNA* / methods
  • Software