Dynamic sequence databank searching with templates and multiple alignment

W R Taylor

doi:10.1006/jmbi.1998.1853

Dynamic sequence databank searching with templates and multiple alignment

J Mol Biol. 1998 Jul 17;280(3):375-406. doi: 10.1006/jmbi.1998.1853.

Author

W R Taylor¹

Affiliation

¹ Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, London, NW7 1AA, UK.

PMID: 9665844
DOI: 10.1006/jmbi.1998.1853

Abstract

Sequence databank searches are often performed iteratively, taking the results of a search to form a probe (either a pattern or profile) for a subsequent scan of the databank. The advantage of this approach is that, as more sequences are drawn into the probe, it should, in principle be possible to detect increasingly distant members of the family. This approach works well when supervised by an "expert" who has a good "eye" for the quality of the sequence alignment and whether novel matches should be rejected or incorporated into the probe. However, all attempts to automate the process have proved difficult, as the process is inherently unstable. Errors in the alignment, or the misalignment of a non-family member, lead to a deterioration of the probe specificity, so allowing further incorrect sequences to be identified. Here, a combination of two methods is used to provide a check on such instability. A pattern matching (template) search method is used (with a BLAST-like pre-filter for speed) to return sequence segments for alignment in a standard multiple alignment program (MULTAL). Sequences are aligned only to a fixed limit of similarity and any sequences or sub-families that have not joined the original "seed" family are rejected. The remaining core family then provides the basis for a subsequent pattern derivation and databank search. The constant check by the multiple alignment phase allows the search phase to be pushed continually towards the boundary of similarity. This is maintained by lowering the cutoff on the scores of acceptable sequences each time the family remains the same over successive search cycles. The procedure was observed to be stable under misalignments and to have an ability to recognise distantly related family members across super-families that was comparable to Psi-BLAST. The method is applied to the analysis of the hormone-binding domains of the insulin and related growth-factor receptors.

MeSH terms

Azurin / chemistry
Databases as Topic*
GTP-Binding Proteins / chemistry
Globins / chemistry
Hemoglobins / chemistry
Information Storage and Retrieval*
Leghemoglobin / chemistry
Myoglobin / chemistry
Plastocyanin / chemistry
Sequence Alignment*
ras Proteins / chemistry

Substances

Hemoglobins
Leghemoglobin
Myoglobin
Azurin
Globins
Plastocyanin
GTP-Binding Proteins
ras Proteins