Marker identification and classification of cancer types using gene expression data and SIMCA

Methods Inf Med. 2004;43(1):4-8.

Abstract

Objectives: High-throughput technologies are radically boosting the understanding of living systems, thus creating enormous opportunities to elucidate the biological processes of cells in different physiological states. In particular, the application of DNA micro-arrays to monitor expression profiles from tumor cells is improving cancer analysis to levels that classical methods have been unable to reach. However, molecular diagnostics based on expression profiling requires addressing computational issues as the overwhelming number of variables and the complex, multi-class nature of tumor samples. Thus, the objective of the present research has been the development of a computational procedure for feature extraction and classification of gene expression data.

Methods: The Soft Independent Modeling of Class Analogy (SIMCA) approach has been implemented in a data mining scheme, which allows the identification of those genes that are most likely to confer robust and accurate classification of samples from multiple tumor types.

Results: The proposed method has been tested on two different microarray data sets, namely Golub's analysis of acute human leukemia and the small round blue cell tumors study presented by Khan et al.. The identified features represent a rational and dimensionally reduced base for understanding the biology of diseases, defining targets of therapeutic intervention, and developing diagnostic tools for classification of pathological states.

Conclusions: The analysis of the SIMCA model residuals allows the identification of specific phenotype markers. At the same time, the class analogy approach provides the assignment to multiple classes, such as different pathological conditions or tissue samples, for previously unseen instances.

MeSH terms

  • Biomarkers, Tumor / genetics
  • Biomarkers, Tumor / physiology*
  • Computational Biology
  • DNA, Neoplasm / classification
  • DNA, Neoplasm / physiology
  • Data Interpretation, Statistical
  • Databases, Genetic*
  • Gene Expression Profiling / methods*
  • Gene Expression Profiling / statistics & numerical data
  • Humans
  • Leukemia / classification*
  • Leukemia / genetics*
  • Oligonucleotide Array Sequence Analysis / classification*
  • Pattern Recognition, Automated*
  • Phenotype
  • Principal Component Analysis*
  • Sequence Analysis, DNA

Substances

  • Biomarkers, Tumor
  • DNA, Neoplasm