Fast Prediction of Protein Methylation Sites Using a Sequence-Based Feature Selection Technique

IEEE/ACM Trans Comput Biol Bioinform. 2019 Jul-Aug;16(4):1264-1273. doi: 10.1109/TCBB.2017.2670558. Epub 2017 Feb 16.

Abstract

Protein methylation, an important post-translational modification, plays crucial roles in many cellular processes. The accurate prediction of protein methylation sites is fundamentally important for revealing the molecular mechanisms undergoing methylation. In recent years, computational prediction based on machine learning algorithms has emerged as a powerful and robust approach for identifying methylation sites, and much progress has been made in predictive performance improvement. However, the predictive performance of existing methods is not satisfactory in terms of overall accuracy. Motivated by this, we propose a novel random-forest-based predictor called MePred-RF, integrating several discriminative sequence-based feature descriptors and improving feature representation capability using a powerful feature selection technique. Importantly, unlike other methods based on multiple, complex information inputs, our proposed MePred-RF is based on sequence information alone. Comparative studies on benchmark datasets via vigorous jackknife tests indicate that our proposed MePred-RF method remarkably outperforms other state-of-the-art predictors, leading by a 4.5 percent average in terms of overall accuracy. A user-friendly webserver that implements the proposed method has been established for researchers' convenience, and is now freely available for public use through http://server.malab.cn/MePred-RF. We anticipate our research tool to be useful for the large-scale prediction and analysis of protein methylation sites.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Computational Biology / methods*
  • Databases, Protein
  • Humans
  • Methylation
  • Peptides / chemistry
  • Predictive Value of Tests
  • Protein Processing, Post-Translational*
  • Proteins / chemistry*
  • Reproducibility of Results
  • Sequence Analysis, Protein / methods*
  • Support Vector Machine

Substances

  • Peptides
  • Proteins