Data reduction for prediction: a case study on robust coding of age and family history for the risk of having a genetic mutation

Ewout W Steyerberg; Judith Balmaña; David H Stockwell; Sapna Syngal

doi:10.1002/sim.3119

Data reduction for prediction: a case study on robust coding of age and family history for the risk of having a genetic mutation

Stat Med. 2007 Dec 30;26(30):5545-56. doi: 10.1002/sim.3119.

Authors

Ewout W Steyerberg¹, Judith Balmaña, David H Stockwell, Sapna Syngal

Affiliation

¹ Department of Public Health, Erasmus MC, Rotterdam, The Netherlands. [email protected] <[email protected]>

PMID: 17948867
DOI: 10.1002/sim.3119

Abstract

Data reduction is often desired in the development of a prediction model, for example for effects of age and family history in the identification of subjects having a genetic mutation. We aimed to evaluate a strategy for model simplification by robust coding of related predictors. We considered 898 patients suspected of having Lynch syndrome, which is caused primarily by mutations in the mismatch repair genes, MLH1 or MSH2. The presence of colorectal cancer (CRC) and endometrial cancer in patients and their relatives was related to mutation prevalence with logistic regression analysis. The performances of simplified and more complex models were quantified with a concordance statistic (c), which was corrected for optimism by cross-validation and bootstrapping. External validation was performed in 1016 patients. The first challenge was the coding of age at diagnosis of CRC, where we forced effects to be identical in patients, in 1st degree and in 2nd degree relatives, by taking the sum of the ages at diagnosis. As a further simplification, CRC diagnosis in 2nd degree relatives was weighted half that of 1st degree relatives. These data reduction approaches were also followed for endometrial cancer. The simplified model used 7 instead of 17 degrees of freedom (df) for a more complex model incorporating individual predictor effects. The optimism-corrected c was higher (0.79 instead of 0.77), but the external c was similar (0.78 for the simplified and more complex models). A stepwise selected model performed slightly worse (external c=0.77). In conclusion, a prediction model could be developed with relatively few df that captured effects of age at diagnosis across patients and relatives per type of cancer in the family. Such robust coding may especially be relevant for modeling in relatively small data sets.

MeSH terms

Adaptor Proteins, Signal Transducing / genetics
Age Factors*
Aged
Aged, 80 and over
Colorectal Neoplasms, Hereditary Nonpolyposis / epidemiology
Colorectal Neoplasms, Hereditary Nonpolyposis / genetics
DNA Mismatch Repair
Family Health
Genetic Carrier Screening / methods
Genetic Predisposition to Disease / etiology
Genetic Testing / methods*
Humans
Likelihood Functions
Logistic Models
Middle Aged
MutL Protein Homolog 1
MutL Proteins
Mutation*
Neoplasm Proteins / genetics
Nuclear Proteins / genetics
Pedigree*
Predictive Value of Tests*
Reproducibility of Results
Risk Factors

Substances

Adaptor Proteins, Signal Transducing
MLH1 protein, human
Neoplasm Proteins
Nuclear Proteins
PMS1 protein, human
MutL Protein Homolog 1
MutL Proteins