EnzChemRED, a rich enzyme chemistry relation extraction dataset

Po-Ting Lai; Elisabeth Coudert; Lucila Aimo; Kristian Axelsen; Lionel Breuza; Edouard de Castro; Marc Feuermann; Anne Morgat; Lucille Pourcel; Ivo Pedruzzi; Sylvain Poux; Nicole Redaschi; Catherine Rivoire; Anastasia Sveshnikova; Chih-Hsuan Wei; Robert Leaman; Ling Luo; Zhiyong Lu; Alan Bridge

doi:10.1038/s41597-024-03835-7

EnzChemRED, a rich enzyme chemistry relation extraction dataset

Sci Data. 2024 Sep 9;11(1):982. doi: 10.1038/s41597-024-03835-7.

Authors

Po-Ting Lai^#¹, Elisabeth Coudert^#², Lucila Aimo², Kristian Axelsen², Lionel Breuza², Edouard de Castro², Marc Feuermann², Anne Morgat², Lucille Pourcel², Ivo Pedruzzi², Sylvain Poux², Nicole Redaschi², Catherine Rivoire², Anastasia Sveshnikova², Chih-Hsuan Wei¹, Robert Leaman¹, Ling Luo³, Zhiyong Lu⁴, Alan Bridge⁵

Affiliations

¹ National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA.
² Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland.
³ School of Computer Science and Technology, Dalian University of Technology, 116024, Dalian, China.
⁴ National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA. [email protected].
⁵ Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland. [email protected].

^# Contributed equally.

Abstract

Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts where enzymes and the chemical reactions they catalyze are annotated using identifiers from the protein knowledgebase UniProtKB and the chemical ontology ChEBI. We show that fine-tuning language models with EnzChemRED significantly boosts their ability to identify proteins and chemicals in text (86.30% F₁ score) and to extract the chemical conversions (86.66% F₁ score) and the enzymes that catalyze those conversions (83.79% F₁ score). We apply our methods to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea.

Publication types

Dataset

MeSH terms

Databases, Protein
Enzymes* / chemistry
Knowledge Bases
Natural Language Processing*
PubMed

Substances

Enzymes

Grants and funding

U24 HG007822/HG/NHGRI NIH HHS/United States