PSI-BLAST pseudocounts and the minimum description length principle

Stephen F Altschul; E Michael Gertz; Richa Agarwala; Alejandro A Schäffer; Yi-Kuo Yu

doi:10.1093/nar/gkn981

PSI-BLAST pseudocounts and the minimum description length principle

Nucleic Acids Res. 2009 Feb;37(3):815-24. doi: 10.1093/nar/gkn981. Epub 2008 Dec 16.

Authors

Stephen F Altschul¹, E Michael Gertz, Richa Agarwala, Alejandro A Schäffer, Yi-Kuo Yu

Affiliation

¹ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health Bethesda, MD 20894, USA. [email protected]

Abstract

Position specific score matrices (PSSMs) are derived from multiple sequence alignments to aid in the recognition of distant protein sequence relationships. The PSI-BLAST protein database search program derives the column scores of its PSSMs with the aid of pseudocounts, added to the observed amino acid counts in a multiple alignment column. In the absence of theory, the number of pseudocounts used has been a completely empirical parameter. This article argues that the minimum description length principle can motivate the choice of this parameter. Specifically, for realistic alignments, the principle supports the practice of using a number of pseudocounts essentially independent of alignment size. However, it also implies that more highly conserved columns should use fewer pseudocounts, increasing the inter-column contrast of the implied PSSMs. A new method for calculating pseudocounts that significantly improves PSI-BLAST's; retrieval accuracy is now employed by default.

Publication types

Research Support, N.I.H., Intramural

MeSH terms

Databases, Protein
Sequence Alignment / methods*
Sequence Analysis, Protein / methods*

Grants and funding

Intramural NIH HHS/United States