PSI-BLAST pseudocounts and the minimum description length principle

Nucleic Acids Res. 2009 Feb;37(3):815-24. doi: 10.1093/nar/gkn981. Epub 2008 Dec 16.

Abstract

Position specific score matrices (PSSMs) are derived from multiple sequence alignments to aid in the recognition of distant protein sequence relationships. The PSI-BLAST protein database search program derives the column scores of its PSSMs with the aid of pseudocounts, added to the observed amino acid counts in a multiple alignment column. In the absence of theory, the number of pseudocounts used has been a completely empirical parameter. This article argues that the minimum description length principle can motivate the choice of this parameter. Specifically, for realistic alignments, the principle supports the practice of using a number of pseudocounts essentially independent of alignment size. However, it also implies that more highly conserved columns should use fewer pseudocounts, increasing the inter-column contrast of the implied PSSMs. A new method for calculating pseudocounts that significantly improves PSI-BLAST's; retrieval accuracy is now employed by default.

Publication types

  • Research Support, N.I.H., Intramural

MeSH terms

  • Databases, Protein
  • Sequence Alignment / methods*
  • Sequence Analysis, Protein / methods*