A predictive risk probability approach for microarray data with survival as an endpoint

Dung-Tsa Chen; Michael J Schell; James J Chen; William J Fulp; Steven Eschrich; Timothy Yeatman

doi:10.1080/10543400802277967

A predictive risk probability approach for microarray data with survival as an endpoint

J Biopharm Stat. 2008;18(5):841-52. doi: 10.1080/10543400802277967.

Authors

Dung-Tsa Chen¹, Michael J Schell, James J Chen, William J Fulp, Steven Eschrich, Timothy Yeatman

Affiliation

¹ Biostatistics Division, Moffitt Cancer Center & Research Institute, University of South Florida, Tampa, Florida 33612, USA. [email protected]

Abstract

Gene expression profiling has played an important role in cancer risk classification and has shown promising results. Since gene expression profiling often involves determination of a set of top rank genes for analysis, it is important to evaluate how modeling performance varies with the number of selected top ranked genes incorporated in the model. We used a colon data set collected at Moffitt Cancer Center as an example of the study, and ranked genes based on the univariate Cox proportional hazards model. A set of top ranked genes was selected for evaluation. The selection was done by choosing the top k ranked genes for k = 1 to 12,500. An analysis indicated a considerable variation of classification outcomes when the number of top ranked genes was changed. We developed a predictive risk probability approach to accommodate this variation by identifying a range number of top ranked genes. For each number of top ranked genes, the procedure classifies each patient as having high risk (score = 1) or low risk (score = 0). The categorizations are then averaged, giving a risk score between 0 and 1, thus providing a ranking for the patient's need for further treatment. This approach was applied to the colon data set and demonstrated the strength of this approach by three criteria: First, a univariate Cox proportional hazards model showed a highly statistically significant level (log-rank chi(2) statistics = 110 with p-value <10(-16)) for the predictive risk probability classification. Second, the survival tree model used the risk probability to partition patients into five risk groups showing a good separation of survival curves (log-rank chi(2) statistics = 215). In addition, utilization of the risk group status identified a small set of risk genes that may be practical for biological validation. Third, analysis of resampling the risk probability suggested the variation pattern of the log-rank chi(2) in the colon cancer data set was unlikely caused by chance.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Colonic Neoplasms / etiology*
Colonic Neoplasms / genetics
Colonic Neoplasms / mortality
Endpoint Determination*
Gene Expression Profiling*
Humans
Oligonucleotide Array Sequence Analysis*
Probability
Proportional Hazards Models
Risk

Abstract

Publication types

MeSH terms

Grants and funding