Measurement of the effectiveness of transitive sequence comparison, through a third 'intermediate' sequence

M Gerstein

doi:10.1093/bioinformatics/14.8.707

Measurement of the effectiveness of transitive sequence comparison, through a third 'intermediate' sequence

Bioinformatics. 1998;14(8):707-14. doi: 10.1093/bioinformatics/14.8.707.

Author

M Gerstein¹

Affiliation

¹ Department of Molecular Biophysics and Biochemistry, 266 Whitney Avenue, Yale University, PO Box 208114, New Haven, CT 06520, USA. [email protected]

PMID: 9789096
DOI: 10.1093/bioinformatics/14.8.707

Abstract

Motivation: Transitive sequence matching expands the scope of sequence comparison by re-running the results of a given query against the databank as a new query. This sometimes results in the initial query sequence (Q) being related to a final match (M) indirectly, through a third, 'intermediate' sequence (Q --> I --> M ). This approach has often been suggested as providing greater sensitivity in sequence comparison; however, it has not yet been possible to gauge its improvement precisely.

Results: Here, this improvement is comprehensively measured by seeing what fraction of the known structural relationships transitive sequence matching can uncover beyond that found by normal pairwise comparison (i.e. direct linkage). The structural relationships are taken from a well-characterized test set, the scop classification of protein structure. Specifically, 2055 known structural similarities (called 'pairs') between distantly related proteins constitute the basic test set. To make the measurement of transitive matching properly, special data sets, called 'baseline sets', are derived from this. They consist of pairs of sequences that have a clear structural relationship that cannot be found by normal sequence comparison (i.e. they cannot be directly linked). Specifically, using standard sequence comparison protocols (FASTA with an e-value cut-off of 0. 001), it is found that the baseline set consists of 1742 pairs. A third intermediate sequence can link 86 of these indirectly (5%), where this third sequence is drawn from the entire, current universe of protein sequences. The number of false positives is minimal. Furthermore, when one considers only the relationships within the test set that correspond to a close structural alignment, the coverage increases considerably. In particular, 862 of the baseline set pairs fit to better than 2.6 A RMS, and transitive matching can find 62 of these (9%).

Availability: All the test data, including precise similarity values calculated from structural alignment, are available in tabular format over the Web from http://bioinfo.mbb. yale.edu/align.

Contact: [email protected]

Publication types

Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Amino Acid Sequence
Databases, Factual*
Protein Conformation
Proteins / chemistry*
Sequence Alignment*

Substances

Proteins