Predictive models for breast cancer susceptibility from multiple single nucleotide polymorphisms

Clin Cancer Res. 2004 Apr 15;10(8):2725-37. doi: 10.1158/1078-0432.ccr-1115-03.

Abstract

Hereditary predisposition and causative environmental exposures have long been recognized in human malignancies. In most instances, cancer cases occur sporadically, suggesting that environmental influences are critical in determining cancer risk. To test the influence of genetic polymorphisms on breast cancer risk, we have measured 98 single nucleotide polymorphisms (SNPs) distributed over 45 genes of potential relevance to breast cancer etiology in 174 patients and have compared these with matched normal controls. Using machine learning techniques such as support vector machines (SVMs), decision trees, and naïve Bayes, we identified a subset of three SNPs as key discriminators between breast cancer and controls. The SVMs performed maximally among predictive models, achieving 69% predictive power in distinguishing between the two groups, compared with a 50% baseline predictive power obtained from the data after repeated random permutation of class labels (individuals with cancer or controls). However, the simpler naïve Bayes model as well as the decision tree model performed quite similarly to the SVM. The three SNP sites most useful in this model were (a) the +4536T/C site of the aldosterone synthase gene CYP11B2 at amino acid residue 386 Val/Ala (T/C) (rs4541); (b) the +4328C/G site of the aryl hydrocarbon hydroxylase CYP1B1 at amino acid residue 293 Leu/Val (C/G) (rs5292); and (c) the +4449C/T site of the transcription factor BCL6 at amino acid 387 Asp/Asp (rs1056932). No single SNP site on its own could achieve more than 60% in predictive accuracy. We have shown that multiple SNP sites from different genes over distant parts of the genome are better at identifying breast cancer patients than any one SNP alone. As high-throughput technology for SNPs improves and as more SNPs are identified, it is likely that much higher predictive accuracy will be achieved and a useful clinical tool developed.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Artificial Intelligence
  • Aryl Hydrocarbon Hydroxylases / genetics
  • Bayes Theorem
  • Breast Neoplasms / diagnosis*
  • Breast Neoplasms / genetics*
  • Computational Biology
  • Cytochrome P-450 CYP11B2 / genetics
  • Cytochrome P-450 CYP1B1
  • DNA-Binding Proteins / genetics
  • Diagnosis, Computer-Assisted / methods*
  • Disease Susceptibility
  • Female
  • Genetic Predisposition to Disease*
  • Genome
  • Humans
  • Models, Theoretical
  • Odds Ratio
  • Polymorphism, Single Nucleotide*
  • Proto-Oncogene Proteins / genetics
  • Proto-Oncogene Proteins c-bcl-6
  • Risk
  • Transcription Factors / genetics

Substances

  • DNA-Binding Proteins
  • Proto-Oncogene Proteins
  • Proto-Oncogene Proteins c-bcl-6
  • Transcription Factors
  • Aryl Hydrocarbon Hydroxylases
  • CYP1B1 protein, human
  • Cytochrome P-450 CYP1B1
  • Cytochrome P-450 CYP11B2