Data reduction is often desired in the development of a prediction model, for example for effects of age and family history in the identification of subjects having a genetic mutation. We aimed to evaluate a strategy for model simplification by robust coding of related predictors. We considered 898 patients suspected of having Lynch syndrome, which is caused primarily by mutations in the mismatch repair genes, MLH1 or MSH2. The presence of colorectal cancer (CRC) and endometrial cancer in patients and their relatives was related to mutation prevalence with logistic regression analysis. The performances of simplified and more complex models were quantified with a concordance statistic (c), which was corrected for optimism by cross-validation and bootstrapping. External validation was performed in 1016 patients. The first challenge was the coding of age at diagnosis of CRC, where we forced effects to be identical in patients, in 1st degree and in 2nd degree relatives, by taking the sum of the ages at diagnosis. As a further simplification, CRC diagnosis in 2nd degree relatives was weighted half that of 1st degree relatives. These data reduction approaches were also followed for endometrial cancer. The simplified model used 7 instead of 17 degrees of freedom (df) for a more complex model incorporating individual predictor effects. The optimism-corrected c was higher (0.79 instead of 0.77), but the external c was similar (0.78 for the simplified and more complex models). A stepwise selected model performed slightly worse (external c=0.77). In conclusion, a prediction model could be developed with relatively few df that captured effects of age at diagnosis across patients and relatives per type of cancer in the family. Such robust coding may especially be relevant for modeling in relatively small data sets.
Copyright (c) 2007 John Wiley & Sons, Ltd.