Motivation: Integration of molecular biology databases remains limited in practice despite its practical importance and considerable research effort. The complexity of the problem is such that an experimental approach is mandatory, yet this very complexity makes it hard to design definitive experiments. This dilemma is common in science, and one tried-and-true strategy is to work with model systems. We propose a model system for this problem, namely a database of genes integrating diverse data across organisms, and describe an experiment using this model.
Results: We attempted to construct a database of human and mouse genes integrating data from GenBank and the human and mouse genome-databases. We discovered numerous errors in these well-respected databases: approximately 15% of genes are apparently missing from the genome-databases; links between the sequence and genome-databases are missing for another 5-10% of the cases; about a third of likely homology links are missing between the genome-databases; 10-20% of entries classified as 'genes' are apparently misclassified. By using a model system, we were able to study the problems caused by anomalous data without having to face all the hard problems of database integration.
Contact: [email protected]