A method for predicting disease subtypes in presence of misclassification among training samples using gene expression: application to human breast cancer

Bioinformatics. 2006 Feb 1;22(3):317-25. doi: 10.1093/bioinformatics/bti738. Epub 2005 Nov 2.

Abstract

Motivation: An accurate diagnostic and prediction will not be achieved unless the disease subtype status for every training sample used in the supervised learning step is accurately known. Such an assumption requires the existence of a perfect tool for disease diagnostic and classification, which is seldom available in the majority of the cases. Thus, the supervised learning step has to be conducted with a statistical model that contemplates and handles potential mislabeling in the input data.

Results: A procedure for handling potential mislabeling among training samples in the prediction of disease subtypes using gene expression data was proposed. A real data-based simulation study about the estrogen receptor status (ER+/ER-) of breast cancer patients was conducted. The results demonstrated that when 1-4 training samples (N = 30) were artificially mislabeled, the proposed method was able not only in correcting the ER status of mislabeled training samples but also more importantly in predicting the ER status of validation samples as well as using 'true' training data.

Publication types

  • Evaluation Study

MeSH terms

  • Algorithms*
  • Artificial Intelligence
  • Biomarkers, Tumor / metabolism*
  • Breast Neoplasms / classification*
  • Breast Neoplasms / diagnosis
  • Breast Neoplasms / metabolism*
  • Diagnosis, Computer-Assisted / methods
  • Gene Expression Profiling / methods*
  • Humans
  • Neoplasm Proteins / metabolism*
  • Oligonucleotide Array Sequence Analysis / methods*
  • Pattern Recognition, Automated / methods
  • Reproducibility of Results
  • Sample Size
  • Sensitivity and Specificity

Substances

  • Biomarkers, Tumor
  • Neoplasm Proteins