Investigating Cross-Domain Binary Relation Classification in Biomedical Natural Language Processing

Alberto Purpura; Natasha Mulligan; Uri Kartoun; Eileen Koski; Vibha Anand; Joao Bettencourt-Silva

Investigating Cross-Domain Binary Relation Classification in Biomedical Natural Language Processing

AMIA Jt Summits Transl Sci Proc. 2024 May 31:2024:384-390. eCollection 2024.

Authors

Alberto Purpura¹, Natasha Mulligan¹, Uri Kartoun², Eileen Koski³, Vibha Anand², Joao Bettencourt-Silva¹

Affiliations

¹ IBM Research Europe, Dublin, Ireland.
² IBM Research, Cambridge, MA, USA.
³ IBM Research, Yorktown Heights, NY, USA.

PMID: 38827064
PMCID: PMC11141837

Abstract

This paper addresses the challenge of binary relation classification in biomedical Natural Language Processing (NLP), focusing on diverse domains including gene-disease associations, compound protein interactions, and social determinants of health (SDOH). We evaluate different approaches, including fine-tuning Bidirectional Encoder Representations from Transformers (BERT) models and generative Large Language Models (LLMs), and examine their performance in zero and few-shot settings. We also introduce a novel dataset of biomedical text annotated with social and clinical entities to facilitate research into relation classification. Our results underscore the continued complexity of this task for both humans and models. BERT-based models trained on domain-specific data excelled in certain domains and achieved comparable performance and generalization power to generative LLMs in others. Despite these encouraging results, these models are still far from achieving human-level performance. We also highlight the significance of high-quality training data and domain-specific fine-tuning on the performance of all the considered models.