Evaluating the Predictability of Cancer Types from 536 Somatic Mutations: A New Dataset

Annu Int Conf IEEE Eng Med Biol Soc. 2020 Jul:2020:5308-5311. doi: 10.1109/EMBC44109.2020.9176699.

Abstract

In this paper, we introduce a new dataset for cancer research containing somatic mutation states of 536 genes of the Cancer Gene Census (CGC). We used somatic mutation information from the Cancer Genome Atlas (TCGA) projects to create this dataset. As preliminary investigations, we employed machine learning techniques, including k-Nearest Neighbors, Decision Tree, Random Forest, and Artificial Neural Networks (ANNs) to evaluate the potential of these somatic mutations for classification of cancer types. We compared our models on accuracy, precision, recall, and F1-score. We observed that ANNs outperformed the other models with F1-score of 0.36 and overall classification accuracy of 40%, and precision ranging from 12% to 92% for different cancer types. The 40% accuracy is significantly higher than random guessing which would have resulted in 3% overall classification accuracy. Although the model has relatively low overall accuracy, it has an average classification specificity of 98%. The ANN achieved high precision scores (> 0.7) for 5 of the 33 cancer types. The introduced dataset can be used for research on TCGA data, such as survival analysis, histopathology image analysis and content-based image retrieval. The dataset is available online for download: https://kimialab.uwaterloo.ca/kimia/.

MeSH terms

  • Humans
  • Machine Learning
  • Mutation
  • Neoplasms* / genetics
  • Neural Networks, Computer*
  • Sensitivity and Specificity