Objective: Accurate cancer prognosis prediction is critical to cancer treatment. There have been many prognosis models based on clinical markers, but few of them are satisfied in clinical applications. And with the development of microarray technologies, cancer researchers have discovered many genes as new markers from the gene expression data and have further developed powerful prognosis models based on these so-called genetic biomarkers. However, the application of such biomarkers still suffers from some problems. The first one is there are a great number of genes and a few samples in the gene expression data so that it is difficult to select a unified gene set to establish a stable classifier for prognosis. The second one is that, due to the experimental and technical reasons, there are existing noises and redundancies in gene expression data, which may lead to building a prognosis predictor with poor performance. The last but not the least one is the microarray experiments are so expensive currently that it is hard to obtain abundant samples. Therefore, it is practical to develop prognosis methods mainly based on conventional clinical markers in real cancer treatment applications. This paper aims to establish an accurate classification model for cancer prognosis, in order to make full use of the invaluable information in clinical data, especially which is usually ignored by most of the existing methods when they aim for high prediction accuracies.
Methods: First, this paper gives the formal description of general classification problem, and presents a novel mixture classification model to make full use of the invaluable information in clinical data, which is similar to the traditional ensemble classification models except for putting strict constraints on the construction of mapping functions to avoid voting process. Then, a two-layer instance of the proposed model, named as MRS (Mixture of Rough set and Support vector machine), is constructed by integrating rough set and support vector machine (SVM) classification methods, in which, the rough set classifier acts as the first layer to identify some singular samples in data, and the SVM classifier acts as the second layer to classify the remaining samples. Finally, MRS is used to make prognosis prediction on two open breast cancer datasets. One dataset, denoted as BRC-1 hereafter, is a high quality, publicly available dataset of 97 breast cancer tumors of node-negative patients. The other, denoted as BRC-2 hereafter, uses baseline human primary breast tumor data from LBL breast cancer cell collection containing 174 samples.
Results: We have done two experiments on BRC-1 and BRC-2, respectively. In the first experiment, the BRC-1 dataset is divided into train set with 78 patients (34 ones belonging to poor prognosis group and 44 ones belonging to good prognosis group) and test set with 19 patients (12 ones belonging to poor prognosis group and 7 ones belonging to good prognosis). After trained on the train set, the MRS can correctly classify all the 12 patients with poor prognosis, and 6 of 7 patients with good prognosis in the test set. The results are better than previous researches, even better than the 70-gene based biomarkers. And in the second experiment, we construct the classifiers using BRC-2 dataset, and compare MRS with other representative methods in Weka software by 5-fold cross-validation, and comparison results show that MRS has higher prediction accuracy than those methods.
Conclusions: The proposed mixture classification model can easily integrate methods with different characteristics. It can overcome the shortcomings of traditional voting-based ensemble models and thus can make full use of the information in clinical data. The experimental results illustrate that our implemented MRS classifier can predict the breast cancer prognosis more accurately than previous prognostic methods.
2009 Elsevier B.V. All rights reserved.