Mining protein phosphorylation information from biomedical literature using NLP parsing and Support Vector Machines

Comput Methods Programs Biomed. 2018 Jul:160:57-64. doi: 10.1016/j.cmpb.2018.03.022. Epub 2018 Mar 22.

Abstract

Background: Extraction of protein phosphorylation information from biomedical literature has gained much attention because of the importance in numerous biological processes.

Objective: In this study, we propose a text mining methodology which consists of two phases, NLP parsing and SVM classification to extract phosphorylation information from literature.

Methods: First, using NLP parsing we divide the data into three base-forms depending on the biomedical entities related to phosphorylation and further classify into ten sub-forms based on their distribution with phosphorylation keyword. Next, we extract the phosphorylation entity singles/pairs/triplets and apply SVM to classify the extracted singles/pairs/triplets using a set of features applicable to each sub-form.

Results: The performance of our methodology was evaluated on three corpora namely PLC, iProLink and hPP corpus. We obtained promising results of >85% F-score on ten sub-forms of training datasets on cross validation test. Our system achieved overall F-score of 93.0% on iProLink and 96.3% on hPP corpus test datasets. Furthermore, our proposed system achieved best performance on cross corpus evaluation and outperformed the existing system with recall of 90.1%.

Conclusions: The performance analysis of our unique system on three corpora reveals that it extracts protein phosphorylation information efficiently in both non-organism specific general datasets such as PLC and iProLink, and human specific dataset such as hPP corpus.

Keywords: Human protein phosphorylation; Information extraction; Natural language processing; Post transcriptional modification; Support Vector Machines; hPP corpus.

Publication types

  • Comparative Study
  • Evaluation Study
  • Validation Study

MeSH terms

  • Data Mining / methods*
  • Data Mining / statistics & numerical data
  • Databases, Protein / statistics & numerical data
  • Humans
  • Natural Language Processing
  • Phosphorylation
  • Protein Modification, Translational
  • Proteins / metabolism*
  • Support Vector Machine

Substances

  • Proteins