Assessing the statistical validity of proteomics based biomarkers

Suzanne Smit; Mariëlle J van Breemen; Huub C J Hoefsloot; Age K Smilde; Johannes M F G Aerts; Chris G de Koster

doi:10.1016/j.aca.2007.04.043

Assessing the statistical validity of proteomics based biomarkers

Anal Chim Acta. 2007 Jun 5;592(2):210-7. doi: 10.1016/j.aca.2007.04.043. Epub 2007 Apr 27.

Authors

Suzanne Smit¹, Mariëlle J van Breemen, Huub C J Hoefsloot, Age K Smilde, Johannes M F G Aerts, Chris G de Koster

Affiliation

¹ Swammerdam Institute for Life Sciences, Universiteit van-Amsterdam, Nieuwe Achtergracht 166, 1018 WV Amsterdam, The Netherlands.

PMID: 17512828
DOI: 10.1016/j.aca.2007.04.043

Abstract

A strategy is presented for the statistical validation of discrimination models in proteomics studies. Several existing tools are combined to form a solid statistical basis for biomarker discovery that should precede a biochemical validation of any biomarker. These tools consist of permutation tests, single and double cross-validation. The cross-validation steps can simply be combined with a new variable selection method, called rank products. The strategy is especially suited for the low-samples-to-variables-ratio (undersampling) case, as is often encountered in proteomics and metabolomics studies. As a classification method, principal component discriminant analysis is used; however, the methodology can be used with any classifier. A dataset containing serum samples from Gaucher patients and healthy controls serves as a test case. Double cross-validation shows that the sensitivity of the model is 89% and the specificity 90%. Potential putative biomarkers are identified using the novel variable selection method. Results from permutation tests support the choice of double cross-validation as the tool for determining error rates when the modelling procedure involves a tuneable parameter. This shows that even cross-validation does not guarantee unbiased results. The validation of discrimination models with a combination of permutation tests and double cross-validation helps to avoid erroneous results which may result from the undersampling.

MeSH terms

Adolescent
Adult
Aged
Biomarkers / blood
Biomarkers / chemistry
Female
Humans
Male
Mass Spectrometry
Middle Aged
Proteomics / classification
Proteomics / methods*
Proteomics / standards*
Reproducibility of Results
Statistics as Topic

Substances

Biomarkers