PDP-CON: prediction of domain/linker residues in protein sequences using a consensus approach

J Mol Model. 2016 Apr;22(4):72. doi: 10.1007/s00894-016-2933-0. Epub 2016 Mar 11.

Abstract

The prediction of domain/linker residues in protein sequences is a crucial task in the functional classification of proteins, homology-based protein structure prediction, and high-throughput structural genomics. In this work, a novel consensus-based machine-learning technique was applied for residue-level prediction of the domain/linker annotations in protein sequences using ordered/disordered regions along protein chains and a set of physicochemical properties. Six different classifiers-decision tree, Gaussian naïve Bayes, linear discriminant analysis, support vector machine, random forest, and multilayer perceptron-were exhaustively explored for the residue-level prediction of domain/linker regions. The protein sequences from the curated CATH database were used for training and cross-validation experiments. Test results obtained by applying the developed PDP-CON tool to the mutually exclusive, independent proteins of the CASP-8, CASP-9, and CASP-10 databases are reported. An n-star quality consensus approach was used to combine the results yielded by different classifiers. The average PDP-CON accuracy and F-measure values for the CASP targets were found to be 0.86 and 0.91, respectively. The dataset, source code, and all supplementary materials for this work are available at https://cmaterju.org/cmaterbioinfo/ for noncommercial use.

Keywords: Consensus strategy; Domain boundary prediction; Machine-learning approaches; Ordered-disordered regions in protein sequence; Physicochemical properties; Protein domain/linker prediction.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Bayes Theorem
  • Caspase 10 / chemistry*
  • Caspase 8 / chemistry*
  • Caspase 9 / chemistry*
  • Computational Biology / methods*
  • Databases, Protein
  • Decision Trees
  • Discriminant Analysis
  • Humans
  • Neural Networks, Computer
  • Protein Domains
  • Sequence Analysis, Protein
  • Structural Homology, Protein
  • Support Vector Machine*

Substances

  • CASP8 protein, human
  • CASP9 protein, human
  • Caspase 10
  • Caspase 8
  • Caspase 9
  • CASP10 protein, human