GEMINI: a computationally-efficient search engine for large gene expression datasets

Timothy DeFreitas; Hachem Saddiki; Patrick Flaherty

doi:10.1186/s12859-016-0934-8

GEMINI: a computationally-efficient search engine for large gene expression datasets

BMC Bioinformatics. 2016 Feb 24:17:102. doi: 10.1186/s12859-016-0934-8.

Authors

Timothy DeFreitas^{1

2}, Hachem Saddiki³, Patrick Flaherty^{4

5

6}

Affiliations

¹ Computer Science Department, Worcester Polytechnic Institute, 100 Institute Rd, Worcester, 01609, USA. [email protected].
² Program in Bioinformatics and Computational Biology, 100 Institute Rd, Worcester, 01609, USA. [email protected].
³ Department of Mathematics and Statistics, University of Massachusetts, Amherst, 710 N. Pleasant St, Amherst, 01003, USA. [email protected].
⁴ Program in Bioinformatics and Computational Biology, 100 Institute Rd, Worcester, 01609, USA. [email protected].
⁵ Biomedical Engineering Department, Worcester Polytechnic Institute, 100 Institute Rd, Worcester, 01609, USA. [email protected].
⁶ Department of Mathematics and Statistics, University of Massachusetts, Amherst, 710 N. Pleasant St, Amherst, 01003, USA. [email protected].

Abstract

Background: Low-cost DNA sequencing allows organizations to accumulate massive amounts of genomic data and use that data to answer a diverse range of research questions. Presently, users must search for relevant genomic data using a keyword, accession number of meta-data tag. However, in this search paradigm the form of the query - a text-based string - is mismatched with the form of the target - a genomic profile.

Results: To improve access to massive genomic data resources, we have developed a fast search engine, GEMINI, that uses a genomic profile as a query to search for similar genomic profiles. GEMINI implements a nearest-neighbor search algorithm using a vantage-point tree to store a database of n profiles and in certain circumstances achieves an [Formula: see text] expected query time in the limit. We tested GEMINI on breast and ovarian cancer gene expression data from The Cancer Genome Atlas project and show that it achieves a query time that scales as the logarithm of the number of records in practice on genomic data. In a database with 10(5) samples, GEMINI identifies the nearest neighbor in 0.05 sec compared to a brute force search time of 0.6 sec.

Conclusions: GEMINI is a fast search engine that uses a query genomic profile to search for similar profiles in a very large genomic database. It enables users to identify similar profiles independent of sample label, data origin or other meta-data information.

MeSH terms

Cluster Analysis
Databases, Factual
Gene Expression / genetics*
Genomics
Humans
Search Engine / statistics & numerical data*
Sequence Analysis, DNA / methods*

Grants and funding

T32 CA121940/CA/NCI NIH HHS/United States