vALId: validation of protein sequence quality based on multiple alignment data

J Bioinform Comput Biol. 2005 Aug;3(4):929-47. doi: 10.1142/s0219720005001326.

Abstract

The validation of sequences is essential to perform accurate phylogeny and structure/function analysis. However among the thousands of protein sequences available in the public databases, most have been predicted in silico and have not systematically undergone a quality verification. It has recently become evident that they often contain sequence errors. To address the problem of automatic protein quality control, we have developed vALId, an interactive web interfaced software. Taking advantage of high quality multiple alignments of complete protein sequences (MACS), vALId first warns about the presence of suspicious insertions, deletions (indels) and divergent segments, and second, proposes corrections based on transcripts and genome contigs. In a first evaluation test, hundreds of indels and divergent segments were randomly generated in a manually refined MACS. The sensitivity (Sn) and specificity (Sp) of indel detection were excellent (0.96) while the mean Sn(0.49) and Sp(0.56) of divergent segment delineation depended on the percent identity between sequence neighbors. In a second test, 6195 sequences in 100 MACS corresponding to different functional and structural protein families were analyzed. 65% of the sequences were in silico predictions and 44% of eukaryote predicted proteins were partially incorrect with at least one suspicious indel or divergent segment.

Publication types

  • Evaluation Study
  • Research Support, Non-U.S. Gov't
  • Validation Study

MeSH terms

  • Algorithms*
  • Amino Acid Sequence
  • Molecular Sequence Data
  • Proteins / analysis*
  • Proteins / chemistry*
  • Quality Control
  • Sequence Alignment / methods*
  • Sequence Analysis, Protein / methods*
  • Software*
  • User-Computer Interface*

Substances

  • Proteins