Resolving the problem of multiple accessions of the same transcript deposited across various public databases

Brief Bioinform. 2017 Mar 1;18(2):226-235. doi: 10.1093/bib/bbw017.

Abstract

Maintaining the consistency of genomic annotations is an increasingly complex task because of the iterative and dynamic nature of assembly and annotation, growing numbers of biological databases and insufficient integration of annotations across databases. As information exchange among databases is poor, a 'novel' sequence from one reference annotation could be annotated in another. Furthermore, relationships to nearby or overlapping annotated transcripts are even more complicated when using different genome assemblies. To better understand these problems, we surveyed current and previous versions of genomic assemblies and annotations across a number of public databases containing long noncoding RNA. We identified numerous discrepancies of transcripts regarding their genomic locations, transcript lengths and identifiers. Further investigation showed that the positional differences between reference annotations of essentially the same transcript could lead to differences in its measured expression at the RNA level. To aid in resolving these problems, we present the algorithm 'Universal Genomic Accession Hash (UGAHash)' and created an open source web tool to encourage the usage of the UGAHash algorithm. The UGAHash web tool (http://ugahash.uni-frankfurt.de) can be accessed freely without registration. The web tool allows researchers to generate Universal Genomic Accessions for genomic features or to explore annotations deposited in the public databases of the past and present versions. We anticipate that the UGAHash web tool will be a valuable tool to check for the existence of transcripts before judging the newly discovered transcripts as novel.

Keywords: accession numbers; accession system; annotation scheme; databases; hashing algorithm; lncRNA; novel transcripts.

MeSH terms

  • Algorithms
  • Databases, Genetic*
  • Genome
  • Genomics
  • Molecular Sequence Annotation
  • RNA, Long Noncoding
  • Software

Substances

  • RNA, Long Noncoding