Sequence databank searches are often performed iteratively, taking the results of a search to form a probe (either a pattern or profile) for a subsequent scan of the databank. The advantage of this approach is that, as more sequences are drawn into the probe, it should, in principle be possible to detect increasingly distant members of the family. This approach works well when supervised by an "expert" who has a good "eye" for the quality of the sequence alignment and whether novel matches should be rejected or incorporated into the probe. However, all attempts to automate the process have proved difficult, as the process is inherently unstable. Errors in the alignment, or the misalignment of a non-family member, lead to a deterioration of the probe specificity, so allowing further incorrect sequences to be identified. Here, a combination of two methods is used to provide a check on such instability. A pattern matching (template) search method is used (with a BLAST-like pre-filter for speed) to return sequence segments for alignment in a standard multiple alignment program (MULTAL). Sequences are aligned only to a fixed limit of similarity and any sequences or sub-families that have not joined the original "seed" family are rejected. The remaining core family then provides the basis for a subsequent pattern derivation and databank search. The constant check by the multiple alignment phase allows the search phase to be pushed continually towards the boundary of similarity. This is maintained by lowering the cutoff on the scores of acceptable sequences each time the family remains the same over successive search cycles. The procedure was observed to be stable under misalignments and to have an ability to recognise distantly related family members across super-families that was comparable to Psi-BLAST. The method is applied to the analysis of the hormone-binding domains of the insulin and related growth-factor receptors.
Copyright 1998 Academic Press.