Towards a reliable objective function for multiple sequence alignments

J Mol Biol. 2001 Dec 7;314(4):937-51. doi: 10.1006/jmbi.2001.5187.

Abstract

Multiple sequence alignment is a fundamental tool in a number of different domains in modern molecular biology, including functional and evolutionary studies of a protein family. Multiple alignments also play an essential role in the new integrated systems for genome annotation and analysis. Thus, the development of new multiple alignment scores and statistics is essential, in the spirit of the work dedicated to the evaluation of pairwise sequence alignments for database searching techniques. We present here norMD, a new objective scoring function for multiple sequence alignments. NorMD combines the advantages of the column-scoring techniques with the sensitivity of methods incorporating residue similarity scores. In addition, norMD incorporates ab initio sequence information, such as the number, length and similarity of the sequences to be aligned. The sensitivity and reliability of the norMD objective function is demonstrated using structural alignments in the SCOP and BAliBASE databases. The norMD scores are then applied to the multiple alignments of the complete sequences (MACS) detected by BlastP with E-value<10, for a set of 734 hypothetical proteins encoded by the Vibrio cholerae genome. Unrelated or badly aligned sequences were automatically removed from the MACS, leaving a high-quality multiple alignment which could be reliably exploited in a subsequent functional and/or structural annotation process. After removal of unreliable sequences, 176 (24 %) of the alignments contained at least one sequence with a functional annotation. 103 of these new matches were supported by significant hits to the Interpro domain and motif database.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Amino Acid Motifs
  • Amino Acid Sequence
  • Archaeal Proteins / chemistry
  • Archaeal Proteins / genetics
  • Bacterial Proteins / chemistry*
  • Bacterial Proteins / genetics
  • Bacterial Proteins / metabolism*
  • Computational Biology / methods*
  • Databases, Genetic
  • Eukaryotic Cells / metabolism
  • Genome, Bacterial
  • Genomics / methods
  • Models, Molecular
  • Molecular Sequence Data
  • Protein Structure, Tertiary
  • Reproducibility of Results
  • Research Design
  • Sensitivity and Specificity
  • Sequence Alignment / methods*
  • Software
  • Vibrio cholerae / chemistry*
  • Vibrio cholerae / genetics
  • Vibrio cholerae / pathogenicity

Substances

  • Archaeal Proteins
  • Bacterial Proteins