Nearest Neighbor Gaussian Process for Quantitative Structure-Activity Relationships

Anthony DiFranzo; Robert P Sheridan; Andy Liaw; Matthew Tudor

doi:10.1021/acs.jcim.0c00678

Nearest Neighbor Gaussian Process for Quantitative Structure-Activity Relationships

J Chem Inf Model. 2020 Oct 26;60(10):4653-4663. doi: 10.1021/acs.jcim.0c00678. Epub 2020 Oct 6.

Authors

Anthony DiFranzo¹, Robert P Sheridan², Andy Liaw³, Matthew Tudor¹

Affiliations

¹ Computational and Structural Chemistry, Merck & Company, Inc., West Point, Pennsylvania 19486, United States.
² Computational and Structural Chemistry, Merck & Company, Inc., Kenilworth, New Jersey 07033, United States.
³ Biometrics Research, Merck & Company, Inc., Rahway, New Jersey 07065, United States.

PMID: 33022174
DOI: 10.1021/acs.jcim.0c00678

Abstract

While Gaussian process models are typically restricted to smaller data sets, we propose a variation which extends its applicability to the larger data sets common in the industrial drug discovery space, making it relatively novel in the quantitative structure-activity relationship (QSAR) field. By incorporating locality-sensitive hashing for fast nearest neighbor searches, the nearest neighbor Gaussian process model makes predictions with time complexity that is sub-linear with the sample size. The model can be efficiently built, permitting rapid updates to prevent degradation as new data is collected. Given its small number of hyperparameters, it is robust against overfitting and generalizes about as well as other common QSAR models. Like the usual Gaussian process model, it natively produces principled and well-calibrated uncertainty estimates on its predictions. We compare this new model with implementations of random forest, light gradient boosting, and k-nearest neighbors to highlight these promising advantages. The code for the nearest neighbor Gaussian process is available at https://github.com/Merck/nngp.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Cluster Analysis
Drug Discovery*
Normal Distribution
Quantitative Structure-Activity Relationship*