Automated recognition of functional compound-protein relationships in literature

PLoS One. 2020 Mar 3;15(3):e0220925. doi: 10.1371/journal.pone.0220925. eCollection 2020.

Abstract

Motivation: Much effort has been invested in the identification of protein-protein interactions using text mining and machine learning methods. The extraction of functional relationships between chemical compounds and proteins from literature has received much less attention, and no ready-to-use open-source software is so far available for this task.

Method: We created a new benchmark dataset of 2,613 sentences from abstracts containing annotations of proteins, small molecules, and their relationships. Two kernel methods were applied to classify these relationships as functional or non-functional, named shallow linguistic and all-paths graph kernel. Furthermore, the benefit of interaction verbs in sentences was evaluated.

Results: The cross-validation of the all-paths graph kernel (AUC value: 84.6%, F1 score: 79.0%) shows slightly better results than the shallow linguistic kernel (AUC value: 82.5%, F1 score: 77.2%) on our benchmark dataset. Both models achieve state-of-the-art performance in the research area of relation extraction. Furthermore, the combination of shallow linguistic and all-paths graph kernel could further increase the overall performance slightly. We used each of the two kernels to identify functional relationships in all PubMed abstracts (29 million) and provide the results, including recorded processing time.

Availability: The software for the tested kernels, the benchmark, the processed 29 million PubMed abstracts, all evaluation scripts, as well as the scripts for processing the complete PubMed database are freely available at https://github.com/KerstenDoering/CPI-Pipeline.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Automation
  • Databases, Factual
  • Linguistics
  • Machine Learning
  • Proteins / chemistry*
  • Publications*

Substances

  • Proteins

Grants and funding

KD received funding for this project by the German National Research Foundation (DFG, Lis45). • AFAM was funded by a doctoral research grant from the German Academic Exchange Service [DAAD, Award No. 91653768] • MG was funded by China Scholarship Council [Award No. 2 201908080143] • JL was supported by the German National Research Foundation [DFG, Research Training Group 1976] and funded by the Baden-Württemberg Foundation [BWST_WSF-043] • PT received funding from the German Research Center for Artificial Intelligence (DFKI). The company DKFI is a non-profit public-private partnership. • Calculations were partly performed on a cluster hosted by the High Performance and Cloud Computing Group at the Zentrum für Datenverarbeitung of the University of Tübingen. The cluster is funded by the state of Baden-Württemberg through bwHPC and the German Research Foundation (DFG) through grant no INST 37/935-1 FUGG. The article processing charge was funded by the German Research Foundation (DFG) and the Albert Ludwigs University Freiburg in the funding programme Open Access Publishing. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.