Background: Extraction of protein phosphorylation information from biomedical literature has gained much attention because of the importance in numerous biological processes.
Objective: In this study, we propose a text mining methodology which consists of two phases, NLP parsing and SVM classification to extract phosphorylation information from literature.
Methods: First, using NLP parsing we divide the data into three base-forms depending on the biomedical entities related to phosphorylation and further classify into ten sub-forms based on their distribution with phosphorylation keyword. Next, we extract the phosphorylation entity singles/pairs/triplets and apply SVM to classify the extracted singles/pairs/triplets using a set of features applicable to each sub-form.
Results: The performance of our methodology was evaluated on three corpora namely PLC, iProLink and hPP corpus. We obtained promising results of >85% F-score on ten sub-forms of training datasets on cross validation test. Our system achieved overall F-score of 93.0% on iProLink and 96.3% on hPP corpus test datasets. Furthermore, our proposed system achieved best performance on cross corpus evaluation and outperformed the existing system with recall of 90.1%.
Conclusions: The performance analysis of our unique system on three corpora reveals that it extracts protein phosphorylation information efficiently in both non-organism specific general datasets such as PLC and iProLink, and human specific dataset such as hPP corpus.
Keywords: Human protein phosphorylation; Information extraction; Natural language processing; Post transcriptional modification; Support Vector Machines; hPP corpus.
Copyright © 2018 Elsevier B.V. All rights reserved.