Signals Among Signals: Prioritizing Nongenetic Associations in Massive Data Sets

Arjun K Manrai; John P A Ioannidis; Chirag J Patel

doi:10.1093/aje/kwz031

Signals Among Signals: Prioritizing Nongenetic Associations in Massive Data Sets

Am J Epidemiol. 2019 May 1;188(5):846-850. doi: 10.1093/aje/kwz031.

Authors

Arjun K Manrai^{1

2

3}, John P A Ioannidis^{4

5

6

7}, Chirag J Patel²

Affiliations

¹ Computational Health Informatics Program, Boston Children's Hospital, Boston Massachusetts.
² Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts.
³ Department of Pediatrics, Harvard Medical School, Boston, Massachusetts.
⁴ Stanford Prevention Research Center, Department of Medicine, Stanford University, Stanford, California.
⁵ Department of Health Research and Policy, Stanford University, Stanford, California.
⁶ Department of Biomedical Data Science, Stanford University, Stanford, California.
⁷ Department of Statistics, Stanford University, Stanford, California.

Abstract

Massive data sets are often regarded as a panacea to the underpowered studies of the past. At the same time, it is becoming clear that in many of these data sets in which thousands of variables are measured across hundreds of thousands or millions of individuals, almost any desired relationship can be inferred with a suitable combination of covariates or analytic choices. Inspired by the genome-wide association study analysis paradigm that has transformed human genetics, X-wide association studies or "XWAS" have emerged as a popular approach to systematically analyzing nongenetic data sets and guarding against false positives. However, these studies often yield hundreds or thousands of associations characterized by modest effect sizes and miniscule P values. Many of these associations will be spurious and emerge due to confounding and other biases. One way of characterizing confounding in the genomics paradigm is the genomic inflation factor. An analogous "X-wide inflation factor," denoted λX, can be defined and applied to published XWAS. Effects that arise in XWAS may be prioritized using replication, triangulation, quantification of measurement error, contextualization of each effect in the distribution of all effect sizes within a field, and pre-registration. Criteria like those of Bradford Hill need to be reconsidered in light of exposure-wide epidemiology to prioritize signals among signals.

Keywords: P values; X-wide association study; big data; inflation factor; machine learning.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Big Data*
Biostatistics / methods*
Confounding Factors, Epidemiologic
Data Interpretation, Statistical*
Epidemiologic Research Design*
Humans
Machine Learning*
Models, Statistical

Abstract

Publication types

MeSH terms

Grants and funding