High-Throughput Omics and Statistical Learning Integration for the Discovery and Validation of Novel Diagnostic Signatures in Colorectal Cancer

Int J Mol Sci. 2019 Jan 12;20(2):296. doi: 10.3390/ijms20020296.

Abstract

The advancement of bioinformatics and machine learning has facilitated the discovery and validation of omics-based biomarkers. This study employed a novel approach combining multi-platform transcriptomics and cutting-edge algorithms to introduce novel signatures for accurate diagnosis of colorectal cancer (CRC). Different random forests (RF)-based feature selection methods including the area under the curve (AUC)-RF, Boruta, and Vita were used and the diagnostic performance of the proposed biosignatures was benchmarked using RF, logistic regression, naïve Bayes, and k-nearest neighbors models. All models showed satisfactory performance in which RF appeared to be the best. For instance, regarding the RF model, the following were observed: mean accuracy 0.998 (standard deviation (SD) < 0.003), mean specificity 0.999 (SD < 0.003), and mean sensitivity 0.998 (SD < 0.004). Moreover, proposed biomarker signatures were highly associated with multifaceted hallmarks in cancer. Some biomarkers were found to be enriched in epithelial cell signaling in Helicobacter pylori infection and inflammatory processes. The overexpression of TGFBI and S100A2 was associated with poor disease-free survival while the down-regulation of NR5A2, SLC4A4, and CD177 was linked to worse overall survival of the patients. In conclusion, novel transcriptome signatures to improve the diagnostic accuracy in CRC are introduced for further validations in various clinical settings.

Keywords: biomarker; colorectal cancer; diagnosis; machine learning; transcriptomics; variable selection.

MeSH terms

  • Algorithms
  • Area Under Curve
  • Bayes Theorem
  • Biomarkers, Tumor / genetics*
  • Chemotactic Factors / genetics
  • Colorectal Neoplasms / diagnosis*
  • Colorectal Neoplasms / genetics
  • Computational Biology / methods*
  • Female
  • GPI-Linked Proteins / genetics
  • Gene Expression Profiling / methods*
  • Gene Expression Regulation, Neoplastic
  • Humans
  • Isoantigens / genetics
  • Logistic Models
  • Machine Learning
  • Oligonucleotide Array Sequence Analysis / methods*
  • Prognosis
  • Receptors, Cell Surface / genetics
  • Receptors, Cytoplasmic and Nuclear / genetics
  • S100 Proteins / genetics
  • Sensitivity and Specificity
  • Sodium-Bicarbonate Symporters / genetics
  • Survival Analysis
  • Transforming Growth Factor beta1 / genetics

Substances

  • Biomarkers, Tumor
  • CD177 protein, human
  • Chemotactic Factors
  • GPI-Linked Proteins
  • Isoantigens
  • NR5A2 protein, human
  • Receptors, Cell Surface
  • Receptors, Cytoplasmic and Nuclear
  • S100 Proteins
  • S100A2 protein, human
  • SLC4A4 protein, human
  • Sodium-Bicarbonate Symporters
  • TGFB1 protein, human
  • Transforming Growth Factor beta1