Comparison of Statistical Tests and Power Analysis for Phosphoproteomics Data

Lei J Ding; Hannah M Schlüter; Matthew J Szucs; Rushdy Ahmad; Zheyang Wu; Weifeng Xu

doi:10.1021/acs.jproteome.9b00280

Comparison of Statistical Tests and Power Analysis for Phosphoproteomics Data

J Proteome Res. 2020 Feb 7;19(2):572-582. doi: 10.1021/acs.jproteome.9b00280. Epub 2019 Dec 26.

Authors

Lei J Ding, Hannah M Schlüter¹, Matthew J Szucs², Rushdy Ahmad², Zheyang Wu³, Weifeng Xu

Affiliations

¹ Department of Computing , Imperial College London , South Kensington, London SW7 2AZ , United Kingdom.
² Broad Institute of MIT and Harvard , 415 Main Street , Cambridge , Massachusetts 02139 , United States.
³ Department of Mathematical Sciences and Program of Bioinformatics and Computational Biology and Program of Data Science , Worcester Polytechnic Institute (WPI) , 100 Institute Road , Worcester , Massachusetts 01609 , United States.

Abstract

Advances in protein tagging and mass spectrometry have enabled generation of large quantitative proteome and phosphoproteome data sets, for identifying differentially expressed targets in case-control studies. The power study of statistical tests is critical for designing strategies for effective target identification and control of experimental cost. Here, we develop a simulation framework to generate realistic phospho-peptide data with known changes between cases and controls. Using this framework, we quantify the performance of traditional t-tests, Bayesian tests, and the ranking-by-fold-change test. Bayesian tests, which share variance information among peptides, outperform the traditional t-tests. Although ranking-by-fold-change has similar power as the Bayesian tests, its type I error rate cannot be properly controlled without proper permutation analysis; therefore, simply relying on the ranking likely brings false positives. Two-sample Bayesian tests considering dependencies between intensity and variance are superior for data sets with complex variance. While increasing the sample size enhances the statistical tests' performance, balanced controls and cases are recommended over a one-side weighted group. Further, higher peptide standard deviations require higher fold changes to achieve the same statistical power. Together, these results highlight the importance of model-informed experimental design and principled statistical analyses when working with large-scale proteomics and phosphoproteomics data.

Keywords: Bayesian statistics; bioinformatics; empirical variance; hierachical simulation; multiplex; neuroproteomics; proteomics; quantitative phosphorpoteomics; sample size; two-sample.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Bayes Theorem
Computational Biology / methods*
Computer Simulation
Data Interpretation, Statistical
Models, Statistical
Peptides / metabolism
Phosphoproteins / metabolism*
Proteome / metabolism*
Proteomics / methods*
Sample Size

Substances

Peptides
Phosphoproteins
Proteome

Grants and funding

R01 MH118298/MH/NIMH NIH HHS/United States