Variable selection using iterative reformulation of training set models for discrimination of samples: application to gas chromatography/mass spectrometry of mouse urinary metabolites

Kanet Wongravee; Nina Heinrich; Maria Holmboe; Michele L Schaefer; Randall R Reed; Jose Trevejo; Richard G Brereton

doi:10.1021/ac900251c

Variable selection using iterative reformulation of training set models for discrimination of samples: application to gas chromatography/mass spectrometry of mouse urinary metabolites

Anal Chem. 2009 Jul 1;81(13):5204-17. doi: 10.1021/ac900251c.

Authors

Kanet Wongravee¹, Nina Heinrich, Maria Holmboe, Michele L Schaefer, Randall R Reed, Jose Trevejo, Richard G Brereton

Affiliation

¹ Centre for Chemometrics, School of Chemistry, University of Bristol, Cantocks Close, Bristol BS8 1TS, UK.

Abstract

The paper discusses variable selection as used in large metabolomic studies, exemplified by mouse urinary gas chromatography of 441 mice in three experiments to detect the influence of age, diet, and stress on their chemosignal. Partial least squares discriminant analysis (PLS-DA) was applied to obtain class models, using a procedure of 20,000 iterations including the bootstrap for model optimization and random splits into test and training sets for validation. Variables are selected using PLS regression coefficients on the training set using an optimized number of components obtained from the bootstrap. The variables are ranked in order of significance, and the overall optimal variables are selected as those that appear as highly significant over 100 different test and training set splits. Cost/benefit analysis of performing the model on a reduced number of variables is also illustrated. This paper provides a strategy for properly validated methods for determining which variables are most significant for discriminating between two groups in large metabolomic data sets avoiding the common pitfall of overfitting if variables are selected on a combined training and test set and also taking into account that different variables may be selected each time the samples are split into training and test sets using iterative procedures.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Animals
Area Under Curve
Discriminant Analysis
Gas Chromatography-Mass Spectrometry / methods*
Least-Squares Analysis
Metabolome
Metabolomics / economics
Metabolomics / methods*
Mice
Models, Statistical
Models, Theoretical
Urinalysis / economics

Abstract

Publication types

MeSH terms

Grants and funding