A generative adversarial network for multiple reads reconstruction in DNA storage

Sci Rep. 2024 Dec 30;14(1):32071. doi: 10.1038/s41598-024-83806-5.

Abstract

DNA storage is widely considered as a promising solution to the data explosion problem. However, the synthesis, PCR and sequencing processes usually result in erroneous reads involving base insertions, deletions, and substitutions. Specially this situation is even more serious in the 3rd generation of sequencing technologies. Different from previous error-correction and multiple sequence alignment methods, we first transform the multiple reads into a noisy mage, and then construct a conditional generative adversarial network to produce a "smooth" image which refers to the consensus sequence. Results on two real datasets demonstrate that our model can completely reconstruct the tested sequences with as high as 5.9% errors. This means that the proposed DNA-GAN can be applied on 3rd generation nanopore sequencing environments, while the transformer-based models are only tested on next-generation sequencing datasets. Furthermore, DNA-GAN exhibits excellent robustness even when as much as 20% of the clusters are contaminated with irrelevant reads. To the best of our knowledge, this work is the first to use GAN for multi-reads reconstruction in DNA-based storage.

MeSH terms

  • Algorithms
  • DNA* / genetics
  • High-Throughput Nucleotide Sequencing* / methods
  • Humans
  • Neural Networks, Computer
  • Sequence Analysis, DNA* / methods

Substances

  • DNA