A machine learning framework to trace tumor tissue-of-origin of 13 types of cancer based on DNA somatic mutation

Biochim Biophys Acta Mol Basis Dis. 2020 Nov 1;1866(11):165916. doi: 10.1016/j.bbadis.2020.165916. Epub 2020 Aug 7.

Abstract

Carcinoma of unknown primary (CUP), defined as metastatic cancers with unknown cancer origin, occurs in 3-5 per 100 cancer patients in the United States. Heterogeneity and metastasis of cancer brings great difficulties to the follow-up diagnosis and treatment for CUP. To find the tissue-of-origin (TOO) of the CUP, multiple methods have been raised. However, the accuracies for computed tomography (CT) and positron emission tomography (PET) to identify TOO were 20%-27% and 24%-40% respectively, which were not enough for determining targeted therapies. In this study, we provide a machine learning framework to trace tumor tissue origin by using gene length-normalized somatic mutation sequencing data. Somatic mutation data was downloaded from the Data Portal (Release 28) of the International Cancer Genome Consortium (ICGC), and 4909 samples for 13 cancers was used to identify primary site of cancers. Optimal results were obtained based on a 600-gene set by using the random forest algorithm with 10-fold cross-validation, and the average accuracy and F1-score were 0.8822 and 0.8886 respectively across 13 types of cancer. In conclusion, we provide an effective computational framework to infer cancer tissue-of-origin by combining DNA sequencing and machine learning techniques, which is promising in assisting clinical diagnosis of cancers.

Keywords: Cancers of unknown primary; Cross-validation; Gene length; Random forest; Somatic mutation; Tissue-of-origin.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • DNA / genetics*
  • Machine Learning*
  • Mutation / genetics
  • Neoplasms, Unknown Primary / genetics*
  • Positron-Emission Tomography
  • Sequence Analysis, DNA

Substances

  • DNA