Query-seeded iterative sequence similarity searching improves selectivity 5-20-fold

William R Pearson; Weizhong Li; Rodrigo Lopez

doi:10.1093/nar/gkw1207

Query-seeded iterative sequence similarity searching improves selectivity 5-20-fold

Nucleic Acids Res. 2017 Apr 20;45(7):e46. doi: 10.1093/nar/gkw1207.

Authors

William R Pearson¹, Weizhong Li², Rodrigo Lopez²

Affiliations

¹ Dept. of Biochemistry and Molecular Genetics, University of Virginia, School of Medicine, Charlottesville, VA 22908, USA.
² European Bioinformatics Institute, EMBL Outstation, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.

Abstract

Iterative similarity search programs, like psiblast, jackhmmer, and psisearch, are much more sensitive than pairwise similarity search methods like blast and ssearch because they build a position specific scoring model (a PSSM or HMM) that captures the pattern of sequence conservation characteristic to a protein family. But models are subject to contamination; once an unrelated sequence has been added to the model, homologs of the unrelated sequence will also produce high scores, and the model can diverge from the original protein family. Examination of alignment errors during psiblast PSSM contamination suggested a simple strategy for dramatically reducing PSSM contamination. psiblast PSSMs are built from the query-based multiple sequence alignment (MSA) implied by the pairwise alignments between the query model (PSSM, HMM) and the subject sequences in the library. When the original query sequence residues are inserted into gapped positions in the aligned subject sequence, the resulting PSSM rarely produces alignment over-extensions or alignments to unrelated sequences. This simple step, which tends to anchor the PSSM to the original query sequence and slightly increase target percent identity, can reduce the frequency of false-positive alignments more than 20-fold compared with psiblast and jackhmmer, with little loss in search sensitivity.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Protein Domains
Sequence Alignment / methods*
Sequence Analysis, Protein / methods*
Software