Detecting novel associations in large data sets

Science. 2011 Dec 16;334(6062):1518-24. doi: 10.1126/science.1205438.

Abstract

Identifying interesting relationships between pairs of variables in large data sets is increasingly important. Here, we present a measure of dependence for two-variable relationships: the maximal information coefficient (MIC). MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R(2)) of the data relative to the regression function. MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships. We apply MIC and MINE to data sets in global health, gene expression, major-league baseball, and the human gut microbiota and identify known and novel relationships.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms
  • Animals
  • Baseball / statistics & numerical data
  • Data Interpretation, Statistical*
  • Female
  • Gene Expression
  • Genes, Fungal
  • Genomics / methods
  • Humans
  • Intestines / microbiology
  • Male
  • Metagenome
  • Mice
  • Obesity
  • Saccharomyces cerevisiae / genetics