Mathematical correction for fingerprint similarity measures to improve chemical retrieval

S Joshua Swamidass; Pierre Baldi

doi:10.1021/ci600526a

Mathematical correction for fingerprint similarity measures to improve chemical retrieval

J Chem Inf Model. 2007 May-Jun;47(3):952-64. doi: 10.1021/ci600526a. Epub 2007 Apr 20.

Authors

S Joshua Swamidass¹, Pierre Baldi

Affiliation

¹ Institute for Genomics and Bioinformatics, School of Information and Computer Sciences, University of California, Irvine, Irvine, California 92697-3435, USA.

PMID: 17444629
DOI: 10.1021/ci600526a

Abstract

In many modern chemoinformatics systems, molecules are represented by long binary fingerprint vectors recording the presence or absence of particular features or substructures, such as labeled paths or trees, in the molecular graphs. These long fingerprints are often compressed to much shorter fingerprints using a simple modulo operation. As the length of the fingerprints decreases, their typical density and overlap tend to increase, and so does any similarity measure based on overlap, such as the widely used Tanimoto similarity. Here we show that this correlation between shorter fingerprints and higher similarity can be thought of as a systematic error introduced by the fingerprint folding algorithm and that this systematic error can be corrected mathematically. More precisely, given two molecules and their compressed fingerprints of a given length, we show how a better estimate of their uncompressed overlap, hence of their similarity, can be derived to correct for this bias. We show how the correction can be implemented not only for the Tanimoto measure but also for all other commonly used measures. Experiments on various data sets and fingerprint sizes demonstrate how, with a negligible computational overhead, the correction noticeably improves the sensitivity and specificity of chemical retrieval.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Chemistry / methods*
Databases, Factual*
Informatics / methods*
Information Storage and Retrieval / methods*
Mathematics

Grants and funding

LM-07443-01/LM/NLM NIH HHS/United States