License: CC BY 4.0
arXiv:2312.13881v1 [cs.CL] 21 Dec 2023

Diversifying Knowledge Enhancement of Biomedical Language Models using Adapter Modules and Knowledge Graphs

Juraj Vladika, Alexander Fichtl, Florian Matthes
Department of Computer Science, Technical University of Munich, Boltzmannstraße 3, 85748 Garching bei München, Germany
{juraj.vladika, alexander.fichtl, matthes}@tum.de
Abstract

Recent advances in natural language processing (NLP) owe their success to pre-training language models on large amounts of unstructured data. Still, there is an increasing effort to combine the unstructured nature of LMs with structured knowledge and reasoning. Particularly in the rapidly evolving field of biomedical NLP, knowledge-enhanced language models (KELMs) have emerged as promising tools to bridge the gap between large language models and domain-specific knowledge, considering the available biomedical knowledge graphs (KGs) curated by experts over the decades. In this paper, we develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models (PLMs). We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical ontology OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT. The approach includes partitioning knowledge graphs into smaller subgraphs, fine-tuning adapter modules for each subgraph, and combining the knowledge in a fusion layer. We test the performance on three downstream tasks: document classification, question answering, and natural language inference. We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low. Finally, we provide a detailed interpretation of the results and report valuable insights for future work.

1 INTRODUCTION

The field of natural language processing (NLP) has been marked by impressive advancements in recent years. The appearance of new model architectures, including the emergence of generative transformers and pre-trained language models (PLMs), has brought along with it widespread usage and attention. Still, most of these models were trained on large amounts of web content, and while they excel at tasks in a general-purpose setting, there is still a performance gap when it comes to domain-specific challenges.

One of these challenging domains is bio-medicine, which centers around the study of the human body, diseases, drugs, and treatments. Biomedical text is often characterized as highly complex because of its advanced terminology, which frequently includes names of chemical compounds, long-spanning relations, and other jargon not commonly used in everyday language. For NLP models trained on general corpora to work well in the biomedical domain, researchers have turned to transfer learning methods and domain adaption. The most common approach to domain adaptation is to continue the initial general pre-training of language models with data from domain-specific medical corpora. Examples of models adapted in this way are BioBERT Lee et al., (2019) and SciBERT Beltagy et al., (2019), which drew the additional training data from biomedical and computer science research abstracts. Dropping the mixed-domain approach from previous frameworks, models like PubMedBERT Gu et al., (2020) and BioLinkBERT Yasunaga et al., (2022) were instead trained solely on PubMed research articles, with BioLinkBERT even leveraging links (citations) to other research articles.

While domain fine-tuning of whole PLMs has proven to increase the performance on downstream biomedical NLP tasks, additional pre-training can often be resource-intensive and infeasible for smaller research groups and situations where computing power is limited. A promising research direction has emerged in the form of knowledge-enhanced language models (KELMs) Hu et al., (2023). It refers to any set of methods that try to incorporate external knowledge into language models, usually by injecting it into the model’s input, architecture, or output. In a sea of knowledge-enhancement methods, an especially interesting one is the utilization of adapters.

Broadly speaking, adapters are small bottleneck feed-forward layers inserted within each layer of a transformer-based language model Houlsby et al., (2019); Pfeiffer et al., 2020b . The small amount of additional parameters allows for the injection of new data or knowledge without requiring the whole model to be fine-tuned. Adapters plugged on top of large language models will often only have around 1% of the number of training parameters compared to the transformer. The transformer model’s learned parameters (weights) are frozen and left unchanged, and only the adapter is fine-tuned. Other than being lightweight on resources, this approach also helps avoid the problem of catastrophic forgetting, where language models forget their existing knowledge from the pre-training corpora when they are fine-tuned on a new, smaller corpus Colon-Hernandez et al., (2021).

This paper specifically focuses on using adapters to inject structured biomedical knowledge from large knowledge graphs into PLMs. We provide an overview of existing adapter approaches for the biomedical domain, as well as existing biomedical language models. We perform extensive experiments to test the performance of knowledge-enhanced, adapter-based biomedical language models on a number of representative biomedical classification tasks (document classification, question answering, natural language inference). We show that the model performance is improved in several instances on downstream tasks and provide a deeper look into the resulting change in model predictions. Finally, our experiments demonstrate that the OntoChem ontology Irmer et al., (2013), which has not been used for knowledge enhancement yet, is a viable alternative to other prominent knowledge sources.

2 RELATED WORK

2.1 Knowledge-Enhanced PLMs

PLMs are trained on enormous corpora of training data, ranging from 3.3 billion tokens in the case of the original BERT Devlin et al., (2019), all the way to 3.5 trillion tokens in the case of the recent Falcon-180B model Almazrouei et al., (2023). The power of the model architecture, combined with transfer learning, has led to these models showing impressive capabilities on most NLP tasks. While the textual data used for the model training is usually completely unstructured in nature, research has shown that models like BERT do encode, to some extent, syntactic structures, hierarchical concepts, and certain semantic conceptual relations Rogers et al., (2021). Still, other studies have shown weakness in modeling tasks dealing with structured knowledge, such as hyponymy relations Ravichander et al., (2020) or preserving the association between text and meaning Di Sciullo, (2018).

In most cases, the knowledge we find and gather, especially scientific knowledge, can be represented in a structured manner. This is the underlying idea of knowledge graphs (KGs), a data structure that models concepts (entities) and relations between them in a graph-like format Ji et al., (2021). KGs have been used in the field of NLP to enhance the performance of NLP models in many downstream NLP tasks Schneider et al., (2022). There are multiple ways to combine KGs with PLMs. The knowledge triples from KGs can be embedded as vector representations such as TransE Wang et al., (2014) or TuckER Balazevic et al., (2019) and then combined with the vectors encoding text. Alternatively, the triples from KGs can be converted to sentences, and, in turn, these textual representations can then be used to fine-tune PLMs in the same way as with any other text. This approach was followed by COMET Bosselut et al., (2019), which utilized the knowledge graph ConceptNet Speer et al., (2017) to enhance the performance on commonsense reasoning tasks. Besides knowledge graphs, lexicons are sometimes used for knowledge enhancement Hoang et al., (2022).

While there are numerous ways to inject structured knowledge into PLMs such as adding it to the input and output of models Wei et al., (2021), an especially promising approach is adding adapters to the architecture of the model Colon-Hernandez et al., (2021). Adapters are small layers that are inserted within a language model and are subsequently fine-tuned to a specific task. The major benefit of adapters is that they add a minimal amount of additional parameters, thus significantly reducing the needed training time. Combined with freezing original model weights, adapters can avoid catastrophic forgetting, where the PLM’s performance deteriorates when all of its weights are fine-tuned with a new knowledge source. Adapters have been used for numerous purposes such as learning hierarchical representation Chronopoulou et al., (2022), transferring models trained on English to low-resource languages Wang et al., (2021), and in the domain of efficient transformers as low-rank adapters (LoRA) Hu et al., (2022). General knowledge-enhanced PLMs utilizing adapters include, for example, KnowBERT Peters et al., (2019) and K-Adapter Wang et al., (2020). A practical tool emerged that combines well-known adapter architectures in one place, called AdapterHub Pfeiffer et al., 2020b .

2.2 Biomedical Knowledge-Enhanced PLMs

A major focus of knowledge enhancement in PLMs is in domain adaption to expert domains such as the biomedical domain. So far, most of the advancements have focused on utilizing the knowledge graph UMLS Bodenreider, (2004) for this purpose. Examples include BERT-MK He et al., (2020) and KeBioLM Yuan et al., (2021), which both fine-tune the whole weights of the base language model by using masked language modeling of triples from UMLS. Biomedical PLMs can then be used for various NLP tasks, such as biomedical text summarization Abacha et al., (2021), named entity recognition Sung et al., (2022), medical fact-checking Vladika and Matthes, (2023), information retrieval Luo et al., (2022), or health question answering Vladika et al., (2023).

There are also existing approaches using adapters for biomedical knowledge enhancement. Representative works are DAKI Lu et al., (2021), which fine-tunes the adapters with entity prediction task, and KEBLM Lai et al., (2023), which fine-tunes the adapters on three different knowledge types from UMLS and PubChem Kim et al., (2019), namely entity descriptions, entity-entity relations, and entity synonyms. The most similar approach to ours and a direct inspiration was the Mixture-of-Partitions (MoP) approach Meng et al., (2021), where the adapters were fine-tuned on smaller subgraphs of UMLS.

Refer to caption
Figure 1: Triplet from the OntoChem Fact Finder111https://sciwalker.com/analytics/factfinder

In essence, our work builds on the present foundations of adapter-based biomedical models and uses the yet unexplored knowledge graph OntoChem, which is rich with chemical knowledge. For our experiments, we use the well-known biomedical PLM PubMedBERT as well as the yet unexplored but powerful BioLinkBERT base model. Following the suggestions of Meng et al., (2021), we use only the triplets corresponding to the 20 most frequent relations of OntoChem for the knowledge injection. An example of an OntoChem triplet can be seen in Figure 1. Finally, we provide a deeper qualitative analysis of learned structured knowledge on a specific dataset. Notably, our work achieves the SOTA (averaged) performance on the question-answering BioASQ-7b dataset.

3 METHODOLOGY

In this section, we will explain the training methodology we used for the experiments in this paper. It is depicted in Figure 2.

Refer to caption
Figure 2: Methodology used to construct the final model and run the experiments

3.1 Knowledge Graph Representation

A central element of our method is the knowledge graph (KG). This KG is a structured representation of information denoted as a collection of ordered triples Ji et al., (2021). We denote these triples as (s,r,o)𝑠𝑟𝑜(s,r,o)( italic_s , italic_r , italic_o ), where s is a subject, r is a relation, and o is an object. Both s and o are entities that come from an entity set E, while relations come from a relation set R. Each entity and relation in the KG is associated with its corresponding textual surface form. This form can take the shape of a single word or a compound term (e.g., for names of chemicals) or even a concise phrase, especially in case of relations. This textual association is critical as it bridges the gap between the structured KG and natural language, allowing for easier injection of KG knowledge into the language models and associated fine-tuning.

The primary objective is to enhance the capabilities of a pre-trained language model, denoted as LM, by integrating the knowledge contained within the KG. To achieve this, we need a training objective that effectively incorporates the KG knowledge into the model. Most encoder-only PLMs based on the original BERT use masked language modeling (MLM) as one of its pre-training objectives. This task consists of masking a certain word in a given sentence and having the model predict which word would fit the best in the place of the masked token. We follow the established approach of using an entity prediction objective, where we mask one of the entities and have the model predict which token would best fit. In this way, the model incorporates the structured knowledge of (s,r,o)𝑠𝑟𝑜(s,r,o)( italic_s , italic_r , italic_o ) triples into its internal weights.

3.2 Graph Partitioning

During the prediction of the masked token, the model produces a probability distribution (with a softmax function) over all of the entities from KG’s entity set E. Considering the massive size of the biomedical KGs we use in the paper, computing the softmax function over all its entities carries a lot of computation complexity. This issue can significantly slow down model training and inference. To bridge this challenge, some approaches have been suggested in the literature. We opt for the approach of Meng et al., (2021), which involves partitioning the KG into smaller subgraphs, which are then trained on independently, and later, their knowledge combined to unified knowledge representations.

The process of dividing a KG yields smaller subgraphs that we denote as G1,G2,,Gksubscript𝐺1subscript𝐺2subscript𝐺𝑘G_{1},G_{2},...,G_{k}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. We set k𝑘kitalic_k in final experiments to be 20202020, following empirical observations and previous literature, which balances efficiency and graph coverage well. Ideally, these 20202020 smaller subgraphs should be almost equal-sized components, meaning nodes are balanced across partitions. Additionally, the capacity of edges between different components should be minimized to maximize the retention of factual knowledge. This is a problem known as balanced graph partitioning and is known to be an NP-complete problem Andreev and Räcke, (2004). Several good approximations have been developed to determine the exact solution despite its computational complexity. We opt for the METIS algorithm Karypis and Kumar, (1997), which was used in other works dealing with large-scale KG partitioning Zheng et al., (2020).

3.3 Adapter Module Learning

Once the KG is appropriately partitioned, the process of fine-tuning the LM can be started. We deploy adapter modules for this purpose. As mentioned previously, adapters are newly initialized feed-forward networks inserted between the transformer model’s layers. Notably, the training of adapter modules does not require fine-tuning the existing parameters of the pre-trained model. Instead, it focuses solely on updating the parameters within the adapters. This strategy ensures that the pre-trained model’s core knowledge remains intact while enabling the model to specialize in the biomedical domain by adapting to the specific knowledge contained in the KG.

There are multiple adapter module configurations, such as Houlsby et al., (2019) and Bapna and Firat, (2019). The adapter module configuration used in the paper is based on the one by Pfeiffer et al., 2020a , the so-called Pfeiffer architecture. In this configuration, only one adapter module is added as a down-projection and up-projection, unlike the Houlsby architecture, where there are two projections. While the Houlsby architecture has more learning capacity, it comes with training and inference speed costs. Previous studies showed no significant difference in performance between the model architectures, making Pfeiffer architecture a very lightweight choice that brings powerful learning capabilities.

As already mentioned, masked language modeling is used to fine-tune the adapter modules. More precisely, it is a task of entity prediction since a missing entity from the graph triple is being predicted. Given a subgraph Gksubscript𝐺𝑘G_{k}italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and its triples (s,r,o)𝑠𝑟𝑜(s,r,o)( italic_s , italic_r , italic_o ), each triple has a textual representation. The object entity o is removed from each triple, and the remaining two elements of the triple are transformed into a textual representation like: ”[CLS] s [SEP] r [SEP]”. The adapter module is then trained to predict the missing object entity using the representation of the [CLS] token. The parameters of the adapter module are optimized by minimizing the cross-entropy loss.

3.4 Knowledge Fusion

Finally, with a set of knowledge-encapsulated adapter modules at hand, we need to fuse their knowledge together into a final representation. For this, we use the so-called AdapterFusion mixture layers Pfeiffer et al., 2020a . These layers serve the purpose of combining knowledge from various adapters to enhance the model’s performance on downstream tasks. It is a relatively recent approach designed to effectively learn how to combine information from a set of task-specific adapters. It does so by employing a softmax attention mechanism that assigns contextual mixture weights over the adapters. These weights are then used to predict task labels in the final layer. The composition of these layers and their interactions ultimately contribute to the model’s ability to generalize and perform well on a range of tasks.

4 EXPERIMENTS

In this section, we describe our approach to leveraging data from OntoChem’s SciWalker platform together with adapters to improve existing approaches to biomedical KELMs. For reproducibility, we made the code for the experiment runs available on GitHub.222https://github.com/alexander-fichtl/diversifying˙KELMs.git

4.1 Datasets

All of our datasets, with the exception of MedNLI, originated from a collection of common biomedical NLP tasks known as BLURB – Biomedical Language Understanding and Reasoning Benchmark.333https://microsoft.github.io/BLURB/index.html Inspired by a similar suite of tasks for general-purpose natural language understanding (NLU) known as GLUE Wang et al., (2018), BLURB covers a wide-range of tasks related to biomedical NLU. This means no tasks include text generation and are all essentially classification tasks, which makes them convenient to evaluate with common classification metrics such as precision, recall, accuracy, and F1 score. The four datasets are described in continuation.

MedNLI Romanov and Shivade, (2018) is a dataset for natural language inference (NLI). It consists of 14,049 unique sentence pairs, where one sentence is a hypothesis, and the other one is a premise. The task is to infer whether the premise entails the hypothesis, contradicts it, or is in a neutral relation with respect to it. The premises were collected from MIMIC-III Johnson et al., (2016), the largest repository of publicly available clinical data (patient notes).

BioASQ-7b Nentidis et al., (2020) is a biomedical question answering (QA) benchmark dataset containing questions in English, along with golden standard (reference) answers and related material. It has been designed to reflect real information needs of biomedical experts. Other than only exact answers, the BioASQ dataset also includes ideal answers (summaries). Researchers working on paraphrasing and textual entailment can also measure the degree to which their methods improve the performance of biomedical QA systems. The dataset is a part of the ongoing shared challenge with the same name Tsatsaronis et al., (2015), while our dataset (7b) is from the 2019 challenge.

PubMedQA Jin et al., (2019) is a different QA dataset collected from PubMed abstracts, the largest collection of biomedical research papers White, (2020). The task of PubMedQA is to answer research questions with yes/no/maybe using the corresponding abstracts. The dataset has 1,000 expert-annotated instances of question-answer pairs. Each PubMedQA instance is composed of a question, a context (abstract without the conclusion), a long answer (conclusion of the abstract), and a yes/no/maybe label that summarizes the conclusion.

The Hallmarks of Cancer (HOC) Corpus Baker et al., (2015) consists of 1852 PubMed publication abstracts manually annotated by experts according to a taxonomy. The taxonomy consists of 37 classes in a hierarchy. Zero or more class labels are assigned to each sentence in the corpus. These hallmarks refer to the alterations in cell behavior that characterize the cancer cell. Proposed as a strategy to capture the complexity of cancer in a few basic principles, it provides an organized framework comprising of ten hallmarks Baker et al., (2017).

 

UMLS20

#Triples Onto20Fused #Triples Onto20Type #Triples

 

has finding site

367,237

relates to

708,076

[protein] relates to [disease]

295,841

has method

275,398

induces

502,512

[substance] induces [physiology]

282,721

has associated morphology

269729

modulates

326,534

[food] contains [compound]

269,211

has procedure site

129,686

treats

225,279

[substance] treats [disease]

247,348

possibly equivalent to

91,446

inhibits

219,720

[biomarker] of [disease]

205,604

has causative agent

86,780

is analyzed by

195,291

[substance] is analyzed by [method]

130,275

interprets

84,533

produces

173,979

[plant] produces [compound]

102,270

has direct procedure site

83,749

increases activity of

148,673

[protein] induces [physiology]

85,411

has active ingredient

63,792

contains

133,241

[compound] increases activity of [protein]

85,196

has pathological process

54,639

increases

110,803

[compound] decreases activity of [protein]

72,311

has occurrence

40,154

detects

93,373

[substance] inhibits [physiology]

68,728

has dose form

30,940

decreases activity of

85,425

[protein] is a [biomarker]

65,558

has direct morphology

29,667

prevents

82,574

[anatomy] produces [protein]

64,206

has intent

25,907

increases expression of

80,771

[substance] prevents [disease]

60,260

has interpretation

24,624

expresses

62,142

[protein] induces [disease]

59,577

has direct substance

23,042

attenuates

54,865

[substance] modulates [protein]

54,533

has direct device

17,726

decreases expression of

51,152

[protein] is analyzed by [method]

54,250

moved to

17,507

binds to

49,206

[method] treats [disease]

35,768

has temporal context

17,195

is a

47,435

[method] detects [physiology]

33,504

has subject relationship context

16,926

affects expression of

37,399

[protein] modulates [physiology]

24,332

 

Total

1,750,677 3,388,450 2,296,904

 

Table 1: Twenty most common relations in each of the three KGs used in the experiments

4.2 Knowledge Sources

The Unified Medical Language System (UMLS) is a set of resources and tools developed by the US National Library of Medicine (NLM) to facilitate the integration and retrieval of biomedical and clinical information from various sources Bodenreider, (2004). Created in 1986 and continuously developed over the decades, it can be viewed as a comprehensive thesaurus and ontology of biomedical concepts, making it easier to connect and use medical terminology in research, clinical practice, and healthcare information systems. We use the most recent SNOMED CT, US Edition vocabulary from September 2023.444https://www.nlm.nih.gov/healthit/snomedct/us˙edition.html

The second knowledge graph, more precisely ontology, that we use, is the OntoChem Ontology Irmer et al., (2013). The ontology contains more than 900 complex relationships between two or more named entities. Entities include chemical compounds, diseases, drug combinations, chemical reactions, biological activities, adverse reactions, etc. Relationships can be downloaded as RDF files. The data originates from MedLine,555https://www.nlm.nih.gov/medline/index.html a bibliographic database from the US National Library of Medicine’s (NLM), that contains more than 30 million journal articles focusing on medicine and life sciences. The KG triples can be interactively queried and also downloaded from the SciWalker platform with the Fact Finder tool.666https://sciwalker.com/analytics/factfinder

4.3 KG Subsets

The versions of the KGs from the two knowledge sources we use in this work are subsets of their respective full KGs – we use versions including only the top 20 most common relations. This was done to increase the efficiency of training but also because initial experiments showed this smaller version does not hurt the performance on downstream tasks. For UMLS, the list of most common relations was taken from MoP and SNOMED, a systematically organized collection of medical terms providing codes, terms, synonyms and definitions used in clinical documentation and reporting. We label this KG as UMLS20.

The relations provided by OntoChem are unique to the type of entities that the relation connects, so there can be several types of the same relation. For example, the relation ”induces” can have a ”substance” as a subject and a ”disease” as an object, so the full relation becomes ”[substance] induces [disease]”, while another one is with a ”physiology” as a subject and a ”disease” as an object, producing ”[physiology] induces [disease]”. To test the performance between these two types, we produce both a KG with top 20 fused relations (independent of entity types) and with top 20 typed relations (dependent on entity types). We call these two KGs Onto20Fused and Onto20Type.

The top 20 relations in each of the three KGs is shown in Table 4.1. This also gives a good insight into what kind of structured knowledge is actually contained in these manually curated biomedical knowledge bases. While there are certain overlaps between top relations UMLS and OntoChem, a lot of them refer to different types of interactions between entities. Therefore, a promising research avenue that we did not explore in this work would be to merge these two knowledge bases into a unified KG and use both to fine-tune the adapters.

4.4 Setup

Task-specific fine-tuning is carried out for the four chosen benchmark downstream tasks. We aligned our hyperparameters with the settings recommended by the BLURB creators Gu et al., (2020): We deploy the Adam optimizer Zhang, (2018) alongside the typical slanted triangular learning rate schedule, with a warm-up for the initial 10 percent of steps and a cool-down for the subsequent 90 percent, and set the dropout probability at 0.1. Furthermore, we followed Pfeiffer et al., 2020a and Meng et al., (2021) by introducing mixture layers and AdapterFusion to route valuable knowledge from the adapters to downstream tasks automatically. Given the random initialization of the task-specific model and dropout, outcomes can fluctuate based on different random seeds, particularly for the small PubMedQA and BioASQ7b datasets. For a more accurate representation, we present average results from ten iterations for BioASQ7b and PubMedQA, five iterations HoC, and three for MedNLI, as done in related biomedical NLP papers benchmarking these tasks.

The training was carried out on Google Colab, with V100 and T4 GPUs provided on the platform. Specific hyperparameters and settings used in our experiments are shown in Table 2. Run seeds are reported on GitHub.

Setting/Task HoC PubMedQA BioASQ7b MedNLI
 repeat runs 5 10 10 3
epochs 20 30 25 20
patience 3 4 5 3
batch size 16 4 4 8
learning rate 1e-5 0.5e-5 0.5e-5 0.5e-5
max. seq. len. 128 512 512 256
Table 2: Settings and hyperparameters used for training each of the datasets of the downstream tasks

5 RESULTS

This section describes the detailed experiment results. We provide both a numerical analysis and a qualitative analysis of the results.

 \downarrow model—dataset normal-→\rightarrow HoC PubMedQA BioASQ7b MedNLI
 SciBERT-base 80.52±0.60plus-or-minus0.60{}_{\pm 0.60}start_FLOATSUBSCRIPT ± 0.60 end_FLOATSUBSCRIPT 57.38±4.22plus-or-minus4.22{}_{\pm 4.22}start_FLOATSUBSCRIPT ± 4.22 end_FLOATSUBSCRIPT 75.93±4.20plus-or-minus4.20{}_{\pm 4.20}start_FLOATSUBSCRIPT ± 4.20 end_FLOATSUBSCRIPT 81.19±0.54plus-or-minus0.54{}_{\pm 0.54}start_FLOATSUBSCRIPT ± 0.54 end_FLOATSUBSCRIPT
  + MoP 81.79±0.66{}^{\dagger}_{\pm 0.66}\uparrowstart_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT ± 0.66 end_POSTSUBSCRIPT ↑ 54.66±3.10plus-or-minus3.10{}_{\pm 3.10}start_FLOATSUBSCRIPT ± 3.10 end_FLOATSUBSCRIPT 78.50±4.06{}^{\dagger}_{\pm 4.06}\uparrowstart_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT ± 4.06 end_POSTSUBSCRIPT ↑ 81.20±0.37plus-or-minus0.37{}_{\pm 0.37}start_FLOATSUBSCRIPT ± 0.37 end_FLOATSUBSCRIPT
  + KEBLM / 59.0 / 82.14
   
BioBERT-base 81.41±0.59plus-or-minus0.59{}_{\pm 0.59}start_FLOATSUBSCRIPT ± 0.59 end_FLOATSUBSCRIPT 60.24±2.32plus-or-minus2.32{}_{\pm 2.32}start_FLOATSUBSCRIPT ± 2.32 end_FLOATSUBSCRIPT 77.50±2.92plus-or-minus2.92{}_{\pm 2.92}start_FLOATSUBSCRIPT ± 2.92 end_FLOATSUBSCRIPT 82.42±0.59plus-or-minus0.59{}_{\pm 0.59}start_FLOATSUBSCRIPT ± 0.59 end_FLOATSUBSCRIPT
  + MoP 82.53±1.08{}^{\dagger}_{\pm 1.08}\uparrowstart_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT ± 1.08 end_POSTSUBSCRIPT ↑ 61.04±4.81{}_{\pm 4.81}\uparrowstart_FLOATSUBSCRIPT ± 4.81 end_FLOATSUBSCRIPT ↑ 80.79±4.40{}^{\dagger}_{\pm 4.40}\uparrowstart_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT ± 4.40 end_POSTSUBSCRIPT ↑ 82.93±0.55{}_{\pm 0.55}\uparrowstart_FLOATSUBSCRIPT ± 0.55 end_FLOATSUBSCRIPT ↑
  + KEBLM / 68.00 \uparrow / 84.24 \uparrow
  + DAKI / / / 83.41 \uparrow
 PubMedBERT-base 82.25±0.46plus-or-minus0.46{}_{\pm 0.46}start_FLOATSUBSCRIPT ± 0.46 end_FLOATSUBSCRIPT 55.84±1.78plus-or-minus1.78{}_{\pm 1.78}start_FLOATSUBSCRIPT ± 1.78 end_FLOATSUBSCRIPT 87.71±4.25plus-or-minus4.25{}_{\pm 4.25}start_FLOATSUBSCRIPT ± 4.25 end_FLOATSUBSCRIPT 84.18±0.19plus-or-minus0.19{}_{\pm 0.19}start_FLOATSUBSCRIPT ± 0.19 end_FLOATSUBSCRIPT
  + UMLS20 83.26±0.32{}^{\dagger}_{\pm 0.32}\uparrowstart_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT ± 0.32 end_POSTSUBSCRIPT ↑ 62.84±2.71{}^{\dagger}_{\pm 2.71}\uparrowstart_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT ± 2.71 end_POSTSUBSCRIPT ↑ 90.64±2.43{}^{\dagger}_{\pm 2.43}\uparrowstart_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT ± 2.43 end_POSTSUBSCRIPT ↑ 84.70±0.19{}_{\pm 0.19}\uparrowstart_FLOATSUBSCRIPT ± 0.19 end_FLOATSUBSCRIPT ↑
  + Onto20Type 82.17±0.62plus-or-minus0.62{}_{\pm 0.62}start_FLOATSUBSCRIPT ± 0.62 end_FLOATSUBSCRIPT 55.40±5.57plus-or-minus5.57{}_{\pm 5.57}start_FLOATSUBSCRIPT ± 5.57 end_FLOATSUBSCRIPT 86.36±3.07plus-or-minus3.07{}_{\pm 3.07}start_FLOATSUBSCRIPT ± 3.07 end_FLOATSUBSCRIPT 83.94±0.63plus-or-minus0.63{}_{\pm 0.63}start_FLOATSUBSCRIPT ± 0.63 end_FLOATSUBSCRIPT
  + Onto20Fused 82.39±0.65{}_{\pm 0.65}\uparrowstart_FLOATSUBSCRIPT ± 0.65 end_FLOATSUBSCRIPT ↑ 56.12±2.91{}_{\pm 2.91}\uparrowstart_FLOATSUBSCRIPT ± 2.91 end_FLOATSUBSCRIPT ↑ 84.36±4.73plus-or-minus4.73{}_{\pm 4.73}start_FLOATSUBSCRIPT ± 4.73 end_FLOATSUBSCRIPT 83.97±0.59plus-or-minus0.59{}_{\pm 0.59}start_FLOATSUBSCRIPT ± 0.59 end_FLOATSUBSCRIPT
   
BioLinkBERT-base 82.21±0.87plus-or-minus0.87{}_{\pm 0.87}start_FLOATSUBSCRIPT ± 0.87 end_FLOATSUBSCRIPT 56.76±3.00plus-or-minus3.00{}_{\pm 3.00}start_FLOATSUBSCRIPT ± 3.00 end_FLOATSUBSCRIPT 91.29±3.18plus-or-minus3.18{}_{\pm 3.18}start_FLOATSUBSCRIPT ± 3.18 end_FLOATSUBSCRIPT 84.1±0.03plus-or-minus0.03{}_{\pm 0.03}start_FLOATSUBSCRIPT ± 0.03 end_FLOATSUBSCRIPT
  +UMLS20 82.36±0.57{}_{\pm 0.57}\uparrowstart_FLOATSUBSCRIPT ± 0.57 end_FLOATSUBSCRIPT ↑ 63.62±5.31{}^{\dagger}_{\pm 5.31}\uparrowstart_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT ± 5.31 end_POSTSUBSCRIPT ↑ 91.50±2.25{}_{\pm 2.25}\uparrowstart_FLOATSUBSCRIPT ± 2.25 end_FLOATSUBSCRIPT ↑ 83.78±0.09plus-or-minus0.09{}_{\pm 0.09}start_FLOATSUBSCRIPT ± 0.09 end_FLOATSUBSCRIPT
  +Onto20Type 82.37±0.42{}_{\pm 0.42}\uparrowstart_FLOATSUBSCRIPT ± 0.42 end_FLOATSUBSCRIPT ↑ 60.46±5.81{}_{\pm 5.81}\uparrowstart_FLOATSUBSCRIPT ± 5.81 end_FLOATSUBSCRIPT ↑ 92.14±2.30{}_{\pm 2.30}\uparrowstart_FLOATSUBSCRIPT ± 2.30 end_FLOATSUBSCRIPT ↑ 82.84±0.34plus-or-minus0.34{}_{\pm 0.34}start_FLOATSUBSCRIPT ± 0.34 end_FLOATSUBSCRIPT
  +Onto20Fused 82.24±1.25{}_{\pm 1.25}\uparrowstart_FLOATSUBSCRIPT ± 1.25 end_FLOATSUBSCRIPT ↑ 63.28±4.46{}^{\dagger}_{\pm 4.46}\uparrowstart_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT ± 4.46 end_POSTSUBSCRIPT ↑ 90.57±3.14plus-or-minus3.14{}_{\pm 3.14}start_FLOATSUBSCRIPT ± 3.14 end_FLOATSUBSCRIPT 83.69±0.55plus-or-minus0.55{}_{\pm 0.55}start_FLOATSUBSCRIPT ± 0.55 end_FLOATSUBSCRIPT
 
Table 3: Final results of the model experiments: The metric for HoC is Micro F1, while for the other three it is accuracy. The best results for every task are in bold. ”\uparrow” denotes that improvements are observed when compared to the base model. “\dagger” denotes a statistically significant better result over the base model (T-test, p <<< 0.05). The results in italic are taken from previous works, while the rest of results comes from our experiments.

5.1 Numerical Analysis

Table 3 shows the final results of the experiments. Each section first shows the performance of the base biomedical model on its own, namely SciBERT Beltagy et al., (2019), BioBERT Lee et al., (2019), PubMedBERT Gu et al., (2020), and BioLinkBERT Yasunaga et al., (2022). Afterwards, indentended rows show the performance of knowledge-enhanced versions of the models. For SciBERT and BioBERT, we report on competing approaches that use structured knowledge integration: MoP Meng et al., (2021), DAKI Lu et al., (2021), and KEBLM Lai et al., (2023). For PubMedBERT and BioLinkBERT, we report on the knowledge-enhanced versions as described in our paper, augmented with structured knowledge from knowledge graphs UMLS20, Onto20Fused, and Onto20Type. It should be noted that the BioLinkBERT results differ from the ones in the original publication because we report on averaged experiment results over multiple runs, unlike the best single run in the original paper.

The results demonstrate that our knowledge enhancement approach improved PubMedBERT in six instances and the BioLinkBERT model in eight instances, either with the UMLS data or the OntoChem data. Notably, there is a difference in the margin of improvement between the datasets. For HoC, the improvement is either negligible or 1% in the best case. This shows that the task of trying to classify document abstracts according to cancer properties is mostly dependent on the document context itself and does not noticeably benefit from external knowledge. Similar is the case for MedNLI, which either deteriorates or improves less than 1%, showing that entailment recognition is mostly tied to the reasoning capabilities of a language model and not the deeper medical knowledge.

On the other hand, the two question-answering datasets experience noticeable improvements. This makes sense considering the knowledge-intensive nature of QA, where factual knowledge is at its core. Especially for PubMedQA, both base PLMs get a 7% jump in accuracy with different KGs. An impressive result is the BioLinkBert-base + Onto20Type model achieving state-of-the-art performance on the BioASQ7b dataset (when looking at the averaged performance over 10 runs). When looking at the difference between the two styles of OntoChem relations, the fused version was superior for PubMedQA (by 3%), while the more detailed, typed version performed better for BioASQ (by 1.5%). We attribute this to the slight difference in the domain of these two datasets – BioASQ contains more questions relating to chemical knowledge, where specific types could come into play, while PubMedQA covers diverse medical diagnoses and treatments.

An interesting result that we have to investigate further is the relatively worse performance of our approach with OntoChem KGs on PubMedBERT compared to BioLinkBERT, even when factoring in the stronger base performance of BioLinkBERT. When the base models don’t match, it is hard to distinguish whether performance gains or losses come from the difference in base models or the difference in the adapter-based approaches. Here, the base models of BioLinkBERT generally perform better than those of PubMedBERT or SciBERT over a variety of tasks. Therefore, whenever we use BioLinkBERT, we cannot say how much of the performance gains come from the superiority of our approach versus the superiority of the base model.

5.2 Qualitative Analysis

To investigate the performance of our knowledge-enhanced models on a deeper level, we decided to look at the classification performance on an instance level and singled out some interesting examples. Table 5.2 shows two instances from the BioASQ dataset where our knowledge-enhanced model predicted the answer correctly, unlike the base model. Instances in BioASQ consist of a question and context, and the goal is to answer the question with a yes/no verdict.

Table 4: Examples of two instances from the BioASQ dataset (with a question, context, and verdict) where the knowledge-enhanced model performed correctly, unlike its vanilla counterpart.

 

Question

Context Predictions

 

Can Diazepam be beneficial in the treatment of traumatic brain injury?

The present experiment examined the effects of diazepam, a positive modulator at the GABA(A) receptor, on survival and cognitive performance in traumatically brain-injured animals.

BioLinkBERT:
BLBERT+Onto20Type:

Gold Label:

no
yes

yes

Does axitinib prolong the survival of pancreatic cancer patients?

Axitinib/gemcitabine, while tolerated, did not provide survival benefit over gemcitabine alone in patients with advanced pancreatic cancer from Japan or other regions […].

BioLinkBERT:
BLBERT+Onto20Type:

Gold Label:

yes
no

no

 

The first row contains a question on the relationship between Diazepam and traumatic brain injury. While the vanilla BioLinkBERT answered the question incorrectly, our knowledge-enhanced BioLinkBERT + Onto20Type model gave the correct answer. Diazepam (first marketed as Valium) is listed as an entity in the OntoChem KG, where it has a direct relation to brain injuries – the full triple is ”diazepam [substance] treats [disease] brain injury” (see also figure 1. It is likely that, thanks to the injection of this knowledge, the enhanced model was able to deduce the answer, while the base model was not.

The second row shows a question about axitinib and its relation to pancreatic cancer. Here, the base version of BioLinkBERT incorrectly predicted that axitinib does prolong the survival of pancreatic cancer patients, while our BioLinkBERT + Onto20Type model gave the correct negative answer. This time, there is no relation between axitinib and any form of cancer listed in the KG. Therefore, our enhanced model might have been able to rely on its injected knowledge and deduce that there are no such connections between the entities in question.

Table 4: Examples of two instances from the BioASQ dataset (with a question, context, and verdict) where the knowledge-enhanced model performed correctly, unlike its vanilla counterpart.