A set-theoretic approach to database searching and clustering

A Krause; M Vingron

doi:10.1093/bioinformatics/14.5.430

A set-theoretic approach to database searching and clustering

Bioinformatics. 1998 Jun;14(5):430-8. doi: 10.1093/bioinformatics/14.5.430.

Authors

A Krause¹, M Vingron

Affiliation

¹ Deutsches Krebsforschungszentrum (DKFZ), Theoretische Bioinformatik, Im Neuenheimer Feld 280, D-69120 Heidelberg, Germany. [email protected]

PMID: 9682056
DOI: 10.1093/bioinformatics/14.5.430

Abstract

Motivation: In this paper, we introduce an iterative method of database searching and apply it to design a database clustering algorithm applicable to an entire protein database. The clustering procedure relies on the quality of the database searching routine and further improves its results based on a set-theoretic analysis of a highly redundant yet efficient to generate cluster system.

Results: Overall, we achieve unambiguous assignment of 80% of SWISS-PROT sequences to non-overlapping sequence clusters in an entirely automatic fashion. Our results are compared to an expert-generated clustering for validation. The database searching method is fast and the clustering technique does not require time-consuming all-against-all comparison. This allows for fast clustering of large amounts of sequences.

Availability: The resulting clustering for the PIR1 (Release 51) and SWISS-PROT (Release 34) databases is available over the Internet from http://www.dkfz-heidelberg.de/tbi/services/modest/b rowsesysters.pl.

Contact: [email protected]; [email protected]

Publication types

Comparative Study
Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Cluster Analysis
Computational Biology
Databases, Factual*
Proteins / genetics*
Sequence Alignment / methods*
Sequence Alignment / statistics & numerical data
Software

Substances

Proteins