Resolving the problem of multiple accessions of the same transcript deposited across various public databases

Tyler Weirick; David John; Shizuka Uchida

doi:10.1093/bib/bbw017

Resolving the problem of multiple accessions of the same transcript deposited across various public databases

Brief Bioinform. 2017 Mar 1;18(2):226-235. doi: 10.1093/bib/bbw017.

Authors

Tyler Weirick¹, David John¹, Shizuka Uchida¹

Affiliation

¹ Institute of Cardiovascular Regeneration, Centre for Molecular Medicine, Goethe University Frankfurt and German Center for Cardiovascular Research, Partner side Rhein-Main, Frankfurt am Main, Germany.

PMID: 26921280
DOI: 10.1093/bib/bbw017

Abstract

Maintaining the consistency of genomic annotations is an increasingly complex task because of the iterative and dynamic nature of assembly and annotation, growing numbers of biological databases and insufficient integration of annotations across databases. As information exchange among databases is poor, a 'novel' sequence from one reference annotation could be annotated in another. Furthermore, relationships to nearby or overlapping annotated transcripts are even more complicated when using different genome assemblies. To better understand these problems, we surveyed current and previous versions of genomic assemblies and annotations across a number of public databases containing long noncoding RNA. We identified numerous discrepancies of transcripts regarding their genomic locations, transcript lengths and identifiers. Further investigation showed that the positional differences between reference annotations of essentially the same transcript could lead to differences in its measured expression at the RNA level. To aid in resolving these problems, we present the algorithm 'Universal Genomic Accession Hash (UGAHash)' and created an open source web tool to encourage the usage of the UGAHash algorithm. The UGAHash web tool (http://ugahash.uni-frankfurt.de) can be accessed freely without registration. The web tool allows researchers to generate Universal Genomic Accessions for genomic features or to explore annotations deposited in the public databases of the past and present versions. We anticipate that the UGAHash web tool will be a valuable tool to check for the existence of transcripts before judging the newly discovered transcripts as novel.

Keywords: accession numbers; accession system; annotation scheme; databases; hashing algorithm; lncRNA; novel transcripts.

MeSH terms

Algorithms
Databases, Genetic*
Genome
Genomics
Molecular Sequence Annotation
RNA, Long Noncoding
Software

Substances

RNA, Long Noncoding