DeepSSV: detecting somatic small variants in paired tumor and normal sequencing data with convolutional neural network

Brief Bioinform. 2021 Jul 20;22(4):bbaa272. doi: 10.1093/bib/bbaa272.

Abstract

It is of considerable interest to detect somatic mutations in paired tumor and normal sequencing data. A number of callers that are based on statistical or machine learning approaches have been developed to detect somatic small variants. However, they take into consideration only limited information about the reference and potential variant allele in both tumor and normal samples at a candidate somatic site. Also, they differ in how biological and technological noises are addressed. Hence, they are expected to produce divergent outputs. To overcome the drawbacks of existing somatic callers, we develop a deep learning-based tool called DeepSSV, which employs a convolutional neural network (CNN) model to learn increasingly abstract feature representations from the raw data in higher feature layers. DeepSSV creates a spatially oriented representation of read alignments around the candidate somatic sites adapted for the convolutional architecture, which enables it to expand to effectively gather scattered evidence. Moreover, DeepSSV incorporates the mapping information of both reference allele-supporting and variant allele-supporting reads in the tumor and normal samples at a genomic site that are readily available in the pileup format file. Together, the CNN model can process the whole alignment information. Such representational richness allows the model to capture the dependencies in the sequence and identify context-based sequencing artifacts. We fitted the model on ground truth somatic mutations and did benchmarking experiments on simulated and real tumors. The benchmarking results demonstrate that DeepSSV outperforms its state-of-the-art competitors in overall F1 score.

Keywords: convolutional neural network; deep learning; mapping information; paired tumor/normal sequencing data; somatic small variants.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Genomics
  • High-Throughput Nucleotide Sequencing*
  • Humans
  • Mutation*
  • Neoplasms / genetics*
  • Neoplasms / metabolism
  • Neural Networks, Computer*
  • Sequence Analysis, DNA*
  • Software*