A predictive risk probability approach for microarray data with survival as an endpoint

J Biopharm Stat. 2008;18(5):841-52. doi: 10.1080/10543400802277967.

Abstract

Gene expression profiling has played an important role in cancer risk classification and has shown promising results. Since gene expression profiling often involves determination of a set of top rank genes for analysis, it is important to evaluate how modeling performance varies with the number of selected top ranked genes incorporated in the model. We used a colon data set collected at Moffitt Cancer Center as an example of the study, and ranked genes based on the univariate Cox proportional hazards model. A set of top ranked genes was selected for evaluation. The selection was done by choosing the top k ranked genes for k = 1 to 12,500. An analysis indicated a considerable variation of classification outcomes when the number of top ranked genes was changed. We developed a predictive risk probability approach to accommodate this variation by identifying a range number of top ranked genes. For each number of top ranked genes, the procedure classifies each patient as having high risk (score = 1) or low risk (score = 0). The categorizations are then averaged, giving a risk score between 0 and 1, thus providing a ranking for the patient's need for further treatment. This approach was applied to the colon data set and demonstrated the strength of this approach by three criteria: First, a univariate Cox proportional hazards model showed a highly statistically significant level (log-rank chi(2) statistics = 110 with p-value <10(-16)) for the predictive risk probability classification. Second, the survival tree model used the risk probability to partition patients into five risk groups showing a good separation of survival curves (log-rank chi(2) statistics = 215). In addition, utilization of the risk group status identified a small set of risk genes that may be practical for biological validation. Third, analysis of resampling the risk probability suggested the variation pattern of the log-rank chi(2) in the colon cancer data set was unlikely caused by chance.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Colonic Neoplasms / etiology*
  • Colonic Neoplasms / genetics
  • Colonic Neoplasms / mortality
  • Endpoint Determination*
  • Gene Expression Profiling*
  • Humans
  • Oligonucleotide Array Sequence Analysis*
  • Probability
  • Proportional Hazards Models
  • Risk