Cancers are caused by the accumulation of genomic alterations. Driver mutations are required for the cancer phenotype, whereas passenger mutations are irrelevant to tumor development and accumulate through DNA replication. A major challenge facing the field of cancer genome sequencing is to identify cancer-associated genes with mutations that drive the cancer phenotype. Here, we describe a powerful and flexible statistical framework for identifying driver genes and driver signaling pathways in cancer genome-sequencing studies. Biological knowledge of the mutational process in tumors is fully integrated into our statistical models and includes such variables as the length of protein-coding regions, transcript isoforms, variation in mutation types, differences in background mutation rates, the redundancy of genetic code, and multiple mutations in one gene. This framework provides several significant features that are not addressed or naively obtained by previous methods. In particular, on the observation of low prevalence of somatic mutations in individual tumors, we propose a heuristic strategy to estimate the mixture proportion of chi-square distribution of likelihood ratio test (LRT) statistics. This provides significantly increased statistical power compared to regular LRT. Through a combination of simulation and analysis of TCGA cancer sequencing study data, we demonstrate high accuracy and sensitivity in our methods. Our statistical methods and several auxiliary bioinformatics tools have been incorporated into a computational tool, DrGaP. The newly developed tool is immediately applicable to cancer genome-sequencing studies and will lead to a more complete identification of altered driver genes and driver signaling pathways in cancer.
Copyright © 2013 The American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.