Prediction of inherited genomic susceptibility to 20 common cancer types by a supervised machine-learning method

Byung-Ju Kim; Sung-Hou Kim

doi:10.1073/pnas.1717960115

Prediction of inherited genomic susceptibility to 20 common cancer types by a supervised machine-learning method

Proc Natl Acad Sci U S A. 2018 Feb 6;115(6):1322-1327. doi: 10.1073/pnas.1717960115. Epub 2018 Jan 22.

Authors

Byung-Ju Kim^{1

2

3}, Sung-Hou Kim^{4

2

3

5}

Affiliations

¹ Department of Chemistry, University of California, Berkeley, CA 94720.
² Department of Integrative Omics for Biomedical Sciences, Yonsei University Graduate School, Seoul, Korea.
³ Molecular Biophysics and Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720.
⁴ Department of Chemistry, University of California, Berkeley, CA 94720; [email protected].
⁵ Center for Computational Biology, University of California, Berkeley, CA 94720.

Abstract

Prevention and early intervention are the most effective ways of avoiding or minimizing psychological, physical, and financial suffering from cancer. However, such proactive action requires the ability to predict the individual's susceptibility to cancer with a measure of probability. Of the triad of cancer-causing factors (inherited genomic susceptibility, environmental factors, and lifestyle factors), the inherited genomic component may be derivable from the recent public availability of a large body of whole-genome variation data. However, genome-wide association studies have so far showed limited success in predicting the inherited susceptibility to common cancers. We present here a multiple classification approach for predicting individuals' inherited genomic susceptibility to acquire the most likely phenotype among a panel of 20 major common cancer types plus 1 "healthy" type by application of a supervised machine-learning method under competing conditions among the cohorts of the 21 types. This approach suggests that, depending on the phenotypes of 5,919 individuals of "white" ethnic population in this study, (i) the portion of the cohort of a cancer type who acquired the observed type due to mostly inherited genomic susceptibility factors ranges from about 33 to 88% (or its corollary: the portion due to mostly environmental and lifestyle factors ranges from 12 to 67%), and (ii) on an individual level, the method also predicts individuals' inherited genomic susceptibility to acquire the other types ranked with associated probabilities. These probabilities may provide practical information for individuals, heath professionals, and health policymakers related to prevention and/or early intervention of cancer.

Keywords: SNP syntax; cancer risk; genomic/environmental factors; k nearest neighbor method; multiple assortment model.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Genetic Predisposition to Disease*
Genome, Human
Humans
Life Style
Machine Learning*
Neoplasms / genetics*
Polymorphism, Single Nucleotide*
Probability