A Sequence-Based Novel Approach for Quality Evaluation of Third-Generation Sequencing Reads

Genes (Basel). 2019 Jan 14;10(1):44. doi: 10.3390/genes10010044.

Abstract

The advent of third-generation sequencing (TGS) technologies, such as the Pacific Biosciences (PacBio) and Oxford Nanopore machines, provides new possibilities for contig assembly, scaffolding, and high-performance computing in bioinformatics due to its long reads. However, the high error rate and poor quality of TGS reads provide new challenges for accurate genome assembly and long-read alignment. Efficient processing methods are in need to prioritize high-quality reads for improving the results of error correction and assembly. In this study, we proposed a novel Read Quality Evaluation and Selection Tool (REQUEST) for evaluating the quality of third-generation long reads. REQUEST generates training data of high-quality and low-quality reads which are characterized by their nucleotide combinations. A linear regression model was built to score the quality of reads. The method was tested on three datasets of different species. The results showed that the top-scored reads prioritized by REQUEST achieved higher alignment accuracies. The contig assembly results based on the top-scored reads also outperformed conventional approaches that use all reads. REQUEST is able to distinguish high-quality reads from low-quality ones without using reference genomes, making it a promising alternative sequence-quality evaluation method to alignment-based algorithms.

Keywords: genomics; read quality assessment; third-generation sequencing.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Animals
  • Drosophila melanogaster
  • Escherichia coli
  • High-Throughput Nucleotide Sequencing / methods
  • High-Throughput Nucleotide Sequencing / standards*
  • Quality Control*
  • Sequence Analysis, DNA / methods
  • Sequence Analysis, DNA / standards*
  • Software*
  • Yersinia pestis