SuRankCo: supervised ranking of contigs in de novo assemblies

Mathias Kuhring; Piotr Wojtek Dabrowski; Vitor C Piro; Andreas Nitsche; Bernhard Y Renard

doi:10.1186/s12859-015-0644-7

SuRankCo: supervised ranking of contigs in de novo assemblies

BMC Bioinformatics. 2015 Jul 30:16:240. doi: 10.1186/s12859-015-0644-7.

Authors

Mathias Kuhring^{1

2}, Piotr Wojtek Dabrowski^{3

4}, Vitor C Piro⁵, Andreas Nitsche⁶, Bernhard Y Renard⁷

Affiliations

¹ Central Administration 4 (IT), Robert Koch Institute, Berlin, Germany. [email protected].
² Centre for Biological Threats and Special Pathogens (ZBS 1), Robert Koch Institute, Berlin, Germany. [email protected].
³ Research Group Bioinformatics (NG4), Robert Koch Institute, Berlin, Germany. [email protected].
⁴ CAPES Foundation, Ministry of Education of Brazil, Brasília - DF, 70040-020, Brazil. [email protected].
⁵ Centre for Biological Threats and Special Pathogens (ZBS 1), Robert Koch Institute, Berlin, Germany. [email protected].
⁶ Research Group Bioinformatics (NG4), Robert Koch Institute, Berlin, Germany. [email protected].
⁷ Research Group Bioinformatics (NG4), Robert Koch Institute, Berlin, Germany. [email protected].

Abstract

Background: Evaluating the quality and reliability of a de novo assembly and of single contigs in particular is challenging since commonly a ground truth is not readily available and numerous factors may influence results. Currently available procedures provide assembly scores but lack a comparative quality ranking of contigs within an assembly.

Results: We present SuRankCo, which relies on a machine learning approach to predict quality scores for contigs and to enable the ranking of contigs within an assembly. The result is a sorted contig set which allows selective contig usage in downstream analysis. Benchmarking on datasets with known ground truth shows promising sensitivity and specificity and favorable comparison to existing methodology.

Conclusions: SuRankCo analyzes the reliability of de novo assemblies on the contig level and thereby allows quality control and ranking prior to further downstream and validation experiments.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Contig Mapping / methods*
Escherichia coli / genetics
Escherichia coli / metabolism
ROC Curve
Software*