Enhancing Biomedical Knowledge Discovery for Diseases: An End-To-End Open-Source Framework

Christos Theodoropoulos
KU Leuven
[email protected]
&Andrei Catalin Coman
EPFL, Idiap Research Institute
[email protected]
\ANDJames Henderson
Idiap Research Institute
[email protected]
&Marie-Francine Moens
KU Leuven
[email protected]
Abstract

The ever-growing volume of biomedical publications creates a critical need for efficient knowledge discovery. In this context, we introduce an open-source end-to-end framework designed to construct knowledge around specific diseases directly from raw text. To facilitate research in disease-related knowledge discovery, we create two annotated datasets focused on Rett syndrome and Alzheimer’s disease, enabling the identification of semantic relations between biomedical entities. Extensive benchmarking explores various ways to represent relations and entity representations, offering insights into optimal modeling strategies for semantic relation detection and highlighting language models’ competence in knowledge discovery. We also conduct probing experiments using different layer representations and attention scores to explore transformers’ ability to capture semantic relations.111Data and code: publicly available upon acceptance.

1 Introduction

Knowledge discovery (Wang et al., 2023; Shu and Ye, 2023) is a pivotal research domain due to the surge in publications, which makes keeping up with new findings challenging, necessitating automated knowledge extraction and processing. Of particular concern is the biomedical literature, where updates occur with ever-accelerating frequency (Fig. 1). Despite advances in healthcare, many diseases, such as Alzheimer’s disease (AD) (Trejo-Lopez et al., 2023; Scheltens et al., 2021) and multiple sclerosis (McGinley et al., 2021; Attfield et al., 2022), lack effective cures. Additionally, over 1,200 rare disorders have limited or no cures according to the National Organization for Rare Disorders.222https://rarediseases.org/rare-diseases/ Discovering new scientific insights from research papers can expedite disease understanding and accelerate cure development.

Refer to caption
Figure 1: Publication Trends: RS and AD

This paper presents an end-to-end framework for detecting medical entities in unstructured text and annotating semantic relations, enabling automated knowledge discovery for diseases. We employ a multi-stage methodology for data acquisition, annotation, and model evaluation. The process starts with gathering relevant PubMed abstracts from PubMed to form the corpus. Entities are identified and extracted, followed by the co-occurrence graph generation that models the intra-sentence co-occurrence of the entities across the corpus. Leveraging the processed text and co-occurrence graph, an algorithm samples sentences to create gold-standard datasets. Medical experts label the semantic relations between entities within these sentences via an annotation portal. The framework’s versatility allows application across various diseases and enables expansion to encompass knowledge about symptoms, genes, and more. This study focuses on two diseases of particular research interest: Rett syndrome (RS) (Petriti et al., 2023) and AD. These diseases are selected due to their significant impact and the absence of a cure, highlighting the urgency for advancements in understanding and treatment. We introduce two curated datasets tailored for detecting semantic relations between entities in biomedical text related to RS and AD. The datasets are used for benchmarking, testing techniques for representing relations and entities and assessing language models’ capabilities in knowledge discovery. This work probes the layer outputs of transformer models (Vaswani et al., 2017) and their attention patterns to reveal their ability to implicitly capture semantic relations in biomedical text.

RS (Sandweiss et al., 2020) poses challenges due to its sporadic nature and rare expression across diverse racial groups. The disorder’s elusive nature undermines its comprehension and stresses the pressing need for a cure. Rare diseases collectively affect a substantial portion of the population, with over 30 million affected people in Europe alone (Pakter, 2024). AD is characterized by its prevalence among older populations, with millions of patients worldwide as it is the most common type of dementia (60-70% cases) (Alzheimer’s-Association, 2024). With life expectancy on the rise, the projected increase in Alzheimer’s cases accentuates the urgency of finding a cure.

In summary, the key paper’s contributions are:

  • Development of an open-source end-to-end framework to build disease knowledge directly from raw text.

  • Two annotated datasets for RS and AD provide gold labels for semantic relations, aiding disease knowledge discovery research.333The description of distantly supervised datasets for weakly supervised scenarios is included in Appendix F.

  • Benchmarking on the datasets examines methods for relation and entity representation, offering insights into optimal approaches for semantic relation detection and emphasizing language models’ knowledge discovery capabilities.

  • Probing experiments with different layer representations and attention scores assess transformers’ inherent ability to capture semantic relations.

2 Data Pipeline

We focus on developing a robust data pipeline (Fig. 2) to annotate sentences with entities associated with the Unified Medical Language System (UMLS) (Bodenreider, 2004; Elkin and Brown, 2023). The first step involves the retrieval of the textual abstracts, followed by the mention extraction that includes entity detection and linking to UMLS. We construct a co-occurrence graph to highlight interconnections between entities in the text. The processed text and co-occurrence graph are then used to develop two curated datasets with precise entity annotations and semantic relations between detected entity pairs.444Additional information regarding the data pipeline is incorporated in Appendix A.

Refer to caption
Figure 2: The pipeline starts with abstract retrieval using a natural language query. Next, entities are detected and linked to UMLS, followed by the co-occurrence graph generation. The final step is the dataset creation using the processed text and co-occurrence graph.

Abstract retrieval. We retrieve PubMed555https://pubmed.ncbi.nlm.nih.gov/ articles ids based on a query (e.g., Rett syndrome) and extract their open-access abstracts. To accomplish this, we leverage the official Entrez Programming Utilities (Kans, 2024) and the Biopython API (Cock et al., 2009) (BSD 3-Clause License), ensuring access to the vast repository of biomedical literature. After obtaining the PubMed IDs (PMIDs), we retrieve the abstracts from the specified articles and tokenize the text into sentences using NLTK (Bird et al., 2009) (Apache License 2.0).

Mention extraction. MetaMapLite (Aronson, 2001) (open-source BSD License) is provided by the National Library of Medicine (NLM) for extracting biomedical entities and mapping them to Concept Unique Identifiers (CUIs) within UMLS. The tool is updated every two years to incorporate the latest medical terminology and to ensure its accuracy in extraction and mapping. MetaMapLite simultaneously extracts mentions and links them to UMLS in one step, efficiently associating mentions with their corresponding CUIs. We detect a diverse range of entities, spanning 82 unique semantic types and covering a broad spectrum of biomedical concepts, including diseases, biologically active substances, anatomical structures, genes, and more. Detailed entity detection often leads to overlapping or successive entities in the text. To address this, our pipeline incorporates a merging strategy that consolidates overlapping or subsequent entities into cohesive units. For example, in the sentence: "To test norepinephrine augmentation as a potential disease-modifying therapy, we performed a biomarker-driven phase II trial of atomoxetine, a clinically-approved norepinephrine transporter inhibitor, in subjects with mild cognitive impairment due to AD.", the subsequent relevant mentions norepinephrine transporter and inhibitor are merged to one entity.

Co-occurrence graph generation. We model the intra-sentence co-occurrence between the entities. Each node in the graph corresponds to a unique CUI and contains metadata including the semantic type and the list of sentence IDs where the corresponding entity is detected. An edge between two nodes signifies that the corresponding entities co-occur within the same sentence. The edge weight represents the number of times two entities co-occur in a sentence throughout the text corpus.

2.1 Dataset Creation

Leveraging the extracted co-occurrence graph, we define two distinct probability distributions to select sentences for manual annotation. The first distribution 𝒫𝒫\mathcal{P}caligraphic_P focuses on common pairs of co-occurred entities, with higher frequency in the co-occurrence graph resulting in a higher likelihood of sampling. The second distribution 𝒫𝒫\mathcal{IP}caligraphic_I caligraphic_P prioritizes novel/rare pairs of co-occurred entities, selecting sentences where the entities have a lower frequency in the co-occurrence graph. We sample 50% of sentences using 𝒫𝒫\mathcal{P}caligraphic_P and 50% using 𝒫𝒫\mathcal{IP}caligraphic_I caligraphic_P to ensure a balance of common and potentially novel pairs of co-occurring entities in the datasets.666Sentence sampling algorithm details in Appendix A.

Then, we develop an annotation portal using the streamlit777https://streamlit.io/ library, providing a user-friendly interface for annotators. Annotators are presented with a sentence containing two highlighted entities and are prompted to categorize the semantic relation between them. Options include positive (direct semantic connection), negative (negative semantic connection where negative words like "no" and "absence" are present), complex (semantic connection with complex reasoning), and no relation. The annotation portal offers additional functions such as sentence removal (for non-informative sentences), entity removal (for incorrect entity types or spans), and context addition (for providing additional text to aid in relation type determination). We enlist the expertise of three medical experts to ensure the accuracy and reliability of the annotation process.

Dataset Sentences Instances Unique CUIs Semantic Types
ReDReS 601 5,259 1,148 73
Train set 409 3,573 887 73
Dev. set 72 749 249 56
Test set 120 937 349 57
ReDAD 641 8,565 1,480 82
Train set 437 5,502 1,114 78
Dev. set 76 1,188 321 60
Test set 128 1,875 452 58
Dataset Labels - Type of Relation
Positive Complex Negative No Relation
ReDReS 1,732 (32.9%) 1,491 (28.4%) 97 (1.8%) 1,945 (36.9%)
Train set 1,176 (32.9%) 996 (27.9%) 69 (1.9%) 1,332 (37.3%)
Dev. set 241 (32.2%) 213 (28.4%) 7 (0.9%) 288 (38.5%)
Test set 313 (33.3%) 282 (30.1%) 21 (2.2%) 321 (34.4%)
ReDAD 2,496 (29.1%) 2,874 (33.6%) 125 (1.5%) 3,070 (35.8%)
Train set 1,718 (31.2%) 1,923 (34.9%) 68 (1.2%) 1,793 (32.7%)
Dev. set 286 (24.1%) 373 (32.4%) 18 (1.5%) 511 (42%)
Test set 492 (26.2%) 578 (30.8%) 39 (2.1%) 766 (40.9%)
Table 1: Datasets: Statistics of the RS and AD datasets and their label distribution.

The result of the expert annotation yields two curated datasets. The Relation Detection dataset for Rett Syndrome (ReDReS) contains 601 sentences with 5,259 instances and 1,148 unique CUIs (Tab. 1). The inter-annotator agreement is measured using the Fleiss kappa score (McHugh, 2012), resulting in 0.6143 in the multi-class setup (4 classes) and indicating substantial agreement among annotators (Landis and Koch, 1977). In the binary setup (relation or no relation), the Fleiss kappa score is 0.7139. The Relation Detection dataset for Alzheimer’s Disease (ReDAD) comprises 641 sentences with 8,565 instances and 1,480 unique CUIs (Tab. 1). The Fleiss kappa score is 0.6403 in the multi-class setup and 0.7064 in the binary setup, showing substantial consensus among annotators. The final labels are determined through majority voting, leveraging the labels provided by each expert. While the label distribution across classes is relatively balanced, the negative class is under-represented with 97 and 125 instances in ReDReS and ReDAD respectively (Tab. 1). Each dataset is randomly split into train, development, and test sets.

3 Models

Refer to caption
Figure 3: Model Architecture of LaMReDA, LaMReDM (left), and LaMEL (right): Each model encodes the input sequence using PubMedBERT (large or base). For LaMReDA and LaMReDM, different tokens define the relation representation (A-P), passed through a linear projection layer, a dropout layer, and then a classification layer for prediction. The symbol # denotes element-wise addition and multiplication for LaMReDA and LaMReDM, respectively. For LaMEL, different tokens construct the entity representation (A-H), which are sent through a dropout layer and a linear layer to extract the projected entity representations.

In this section, we introduce two main models, the Language-Model Embedding Learning (LaMEL) model and the Language-Model Relation Detection (LaMReD) model (Fig. 3), to benchmark datasets and establish robust baselines.

Task formulation. Given a sentence containing two identified entities e1𝑒1e1italic_e 1 and e2𝑒2e2italic_e 2, we predict the semantic relation semr𝑠𝑒subscript𝑚𝑟sem_{r}italic_s italic_e italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT between them. In the multi-class setup, the labels are: positive, negative, complex, and no relation. In the binary setup, the goal is to determine if any relation exists. Special tokens [ent] and [/ent] mark the start and end of each entity within the sentence, ensuring consistent identification and processing of entity boundaries.

3.1 LaMEL model

LaMEL learns an embedding space optimized for relation detection (Fig. 3). As the backbone language model (LM), we opt for PubMedBERT (Gu et al., 2021; Tinn et al., 2023) (MIT License), available in both uncased base and uncased large versions.888HuggingFace’s Transformers library (Wolf et al., 2019) PubMedBERT is pretrained on the PubMed corpus, making it well-suited for our task as the curated datasets consist of sentences of abstracts from PubMed papers. Leveraging PubMedBERT ensures that the model can capture the language patterns prevalent in biomedical text. Following the LM encoding, we construct the representation of each entity by extracting its contextualized embedding Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT corresponding to each entity eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the encoded sequence. Subsequently, the entity representations are projected to the embedding space using a linear layer without changing the embedding dimension. The final prediction is based on cosine similarity between the two projected entity representations. If the cosine similarity exceeds a predefined threshold, the model predicts that there is a semantic relation between the two entities. We experiment with diverse strategies for learning entity representations (Fig. 3), aiming to optimize the effectiveness of the embedding space for the relation detection task. The explored types of entity representation E𝐸Eitalic_E are:

  • A, B, C - Special Tokens:

    EA=t[ent],subscript𝐸𝐴subscript𝑡delimited-[]𝑒𝑛𝑡E_{A}=t_{[ent]},italic_E start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT [ italic_e italic_n italic_t ] end_POSTSUBSCRIPT , (1)
    EB=t[/ent],subscript𝐸𝐵subscript𝑡delimited-[]absent𝑒𝑛𝑡E_{B}=t_{[/ent]},italic_E start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT [ / italic_e italic_n italic_t ] end_POSTSUBSCRIPT , (2)
    EC=t[ent];t[/ent],subscript𝐸𝐶subscript𝑡delimited-[]𝑒𝑛𝑡subscript𝑡delimited-[]absent𝑒𝑛𝑡E_{C}=t_{[ent]};t_{[/ent]},italic_E start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT [ italic_e italic_n italic_t ] end_POSTSUBSCRIPT ; italic_t start_POSTSUBSCRIPT [ / italic_e italic_n italic_t ] end_POSTSUBSCRIPT , (3)
  • D - Entity Pool:

    ED=[tE],subscript𝐸𝐷delimited-[]subscript𝑡𝐸E_{D}=[t_{E}],italic_E start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT = [ italic_t start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ] , (4)
  • E - Entity & Middle Pool:

    EE=[tE][tInter],subscript𝐸𝐸delimited-[]subscript𝑡𝐸delimited-[]subscript𝑡𝐼𝑛𝑡𝑒𝑟E_{E}=[t_{E}]*[t_{Inter}],italic_E start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT = [ italic_t start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ] ∗ [ italic_t start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT ] , (5)
  • F, G, H - Special Tokens & Middle Pool:

    EF=t[ent][tInter],subscript𝐸𝐹subscript𝑡delimited-[]𝑒𝑛𝑡delimited-[]subscript𝑡𝐼𝑛𝑡𝑒𝑟E_{F}=t_{[ent]}*[t_{Inter}],italic_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT [ italic_e italic_n italic_t ] end_POSTSUBSCRIPT ∗ [ italic_t start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT ] , (6)
    EG=t[/ent][tInter],subscript𝐸𝐺subscript𝑡delimited-[]absent𝑒𝑛𝑡delimited-[]subscript𝑡𝐼𝑛𝑡𝑒𝑟E_{G}=t_{[/ent]}*[t_{Inter}],italic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT [ / italic_e italic_n italic_t ] end_POSTSUBSCRIPT ∗ [ italic_t start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT ] , (7)
    EH=t[ent]t[/ent][tInter],subscript𝐸𝐻subscript𝑡delimited-[]𝑒𝑛𝑡subscript𝑡delimited-[]absent𝑒𝑛𝑡delimited-[]subscript𝑡𝐼𝑛𝑡𝑒𝑟E_{H}=t_{[ent]}*t_{[/ent]}*[t_{Inter}],italic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT [ italic_e italic_n italic_t ] end_POSTSUBSCRIPT ∗ italic_t start_POSTSUBSCRIPT [ / italic_e italic_n italic_t ] end_POSTSUBSCRIPT ∗ [ italic_t start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT ] , (8)

where {EA,EB,ED,EE,EF,EG,EH}dsubscript𝐸𝐴subscript𝐸𝐵subscript𝐸𝐷subscript𝐸𝐸subscript𝐸𝐹subscript𝐸𝐺subscript𝐸𝐻superscript𝑑\{E_{A},E_{B},E_{D},E_{E},E_{F},E_{G},E_{H}\}\in\mathbb{R}^{d}{ italic_E start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and EC2dsubscript𝐸𝐶superscript2𝑑E_{C}\in\mathbb{R}^{2d}italic_E start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_d end_POSTSUPERSCRIPT, d𝑑ditalic_d is the embedding size of PubMedBERT base (768) and PubMedBERT large (1024), ; defines the concatenation, * holds for the element-wise multiplication, t[ent]subscript𝑡delimited-[]𝑒𝑛𝑡t_{[ent]}italic_t start_POSTSUBSCRIPT [ italic_e italic_n italic_t ] end_POSTSUBSCRIPT, t[/ent]subscript𝑡delimited-[]absent𝑒𝑛𝑡t_{[/ent]}italic_t start_POSTSUBSCRIPT [ / italic_e italic_n italic_t ] end_POSTSUBSCRIPT are the embeddings of the start and end special tokens of the entities, [tE]delimited-[]subscript𝑡𝐸[t_{E}][ italic_t start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ] and [tInter]delimited-[]subscript𝑡𝐼𝑛𝑡𝑒𝑟[t_{Inter}][ italic_t start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT ] are the averaged pooled representation of the entities and the intermediate tokens between the entities respectively.

3.2 LaMReD model

LaMReD provides two variations that differ in information synthesis (Fig. 3), aiming to explore the potential effect of different aggregations (Theodoropoulos and Moens, 2023). LaMReDA utilizes element-wise addition to aggregate the entities’ representations, while LaMReDM employs element-wise multiplication. The input text is encoded using PubMedBERT (base or large). Following LM encoding, we construct the relation representation by sampling and aggregating tokens from the input sequence. This step enables the model to capture essential features and contextual information relevant to semantic relation classification. To mitigate the risk of overfitting and enhance model generalization, we incorporate a dropout layer (Srivastava et al., 2014) with a probability of 0.3. The linear classification layer takes the aggregated representation and outputs the predicted label.

Following the paradigm proposed by Baldini Soares et al. (2019) and Hogan et al. (2021), we experiment with various approaches for learning relation representations tailored to the relation detection task to empirically ascertain the effectiveness of each strategy (Fig. 3). The explored types of relation representation R are the following:

  • A, B, C - Special Tokens:

    RA=f(l(t[ent]1),l(t[ent]2)),subscript𝑅𝐴𝑓𝑙subscript𝑡subscriptdelimited-[]𝑒𝑛𝑡1𝑙subscript𝑡subscriptdelimited-[]𝑒𝑛𝑡2R_{A}=f(l(t_{[ent]_{1}}),l(t_{[ent]_{2}})),italic_R start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = italic_f ( italic_l ( italic_t start_POSTSUBSCRIPT [ italic_e italic_n italic_t ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_l ( italic_t start_POSTSUBSCRIPT [ italic_e italic_n italic_t ] start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) , (9)
    RB=f(l(t[/ent]1),l(t[/ent]2)),R_{B}=f(l(t_{[/ent]_{1}}),l(t_{[/ent]_{2}})),italic_R start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = italic_f ( italic_l ( italic_t start_POSTSUBSCRIPT [ / italic_e italic_n italic_t ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_l ( italic_t start_POSTSUBSCRIPT [ / italic_e italic_n italic_t ] start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) , (10)
    RC=f(l(t[ent]1),l(t[/ent]1),\displaystyle R_{C}=f(l(t_{[ent]_{1}}),l(t_{[/ent]_{1}}),italic_R start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = italic_f ( italic_l ( italic_t start_POSTSUBSCRIPT [ italic_e italic_n italic_t ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_l ( italic_t start_POSTSUBSCRIPT [ / italic_e italic_n italic_t ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , (11)
    l(t[ent]2),l(t[/ent]2)),\displaystyle l(t_{[ent]_{2}}),l(t_{[/ent]_{2}})),italic_l ( italic_t start_POSTSUBSCRIPT [ italic_e italic_n italic_t ] start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_l ( italic_t start_POSTSUBSCRIPT [ / italic_e italic_n italic_t ] start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) ,
  • D - Entity Pool:

    RD=f(l([tE1]),l([tE2])),subscript𝑅𝐷𝑓𝑙delimited-[]subscript𝑡𝐸1𝑙delimited-[]subscript𝑡𝐸2R_{D}=f(l([t_{E1}]),l([t_{E2}])),italic_R start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT = italic_f ( italic_l ( [ italic_t start_POSTSUBSCRIPT italic_E 1 end_POSTSUBSCRIPT ] ) , italic_l ( [ italic_t start_POSTSUBSCRIPT italic_E 2 end_POSTSUBSCRIPT ] ) ) , (12)
  • E - Middle Pool:

    RE=l([tInter]),subscript𝑅𝐸𝑙delimited-[]subscript𝑡𝐼𝑛𝑡𝑒𝑟R_{E}=l([t_{Inter}]),italic_R start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT = italic_l ( [ italic_t start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT ] ) , (13)
  • F - [CLS] token & Entity Pool:

    RF=f(l(t[CLS]),l([tE1]),l([tE2])),subscript𝑅𝐹𝑓𝑙subscript𝑡delimited-[]𝐶𝐿𝑆𝑙delimited-[]subscript𝑡𝐸1𝑙delimited-[]subscript𝑡𝐸2R_{F}=f(l(t_{[CLS]}),l([t_{E1}]),l([t_{E2}])),italic_R start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = italic_f ( italic_l ( italic_t start_POSTSUBSCRIPT [ italic_C italic_L italic_S ] end_POSTSUBSCRIPT ) , italic_l ( [ italic_t start_POSTSUBSCRIPT italic_E 1 end_POSTSUBSCRIPT ] ) , italic_l ( [ italic_t start_POSTSUBSCRIPT italic_E 2 end_POSTSUBSCRIPT ] ) ) , (14)
  • G, H, I - [CLS] token & Special Tokens:

    RG=f(l(t[CLS]),l(t[ent]1),l(t[ent]2)),subscript𝑅𝐺𝑓𝑙subscript𝑡delimited-[]𝐶𝐿𝑆𝑙subscript𝑡subscriptdelimited-[]𝑒𝑛𝑡1𝑙subscript𝑡subscriptdelimited-[]𝑒𝑛𝑡2R_{G}=f(l(t_{[CLS]}),l(t_{[ent]_{1}}),l(t_{[ent]_{2}})),italic_R start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = italic_f ( italic_l ( italic_t start_POSTSUBSCRIPT [ italic_C italic_L italic_S ] end_POSTSUBSCRIPT ) , italic_l ( italic_t start_POSTSUBSCRIPT [ italic_e italic_n italic_t ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_l ( italic_t start_POSTSUBSCRIPT [ italic_e italic_n italic_t ] start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) , (15)
    RH=f(l(t[CLS]),l(t[/ent]1),l(t[/ent]2)),R_{H}=f(l(t_{[CLS]}),l(t_{[/ent]_{1}}),l(t_{[/ent]_{2}})),italic_R start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT = italic_f ( italic_l ( italic_t start_POSTSUBSCRIPT [ italic_C italic_L italic_S ] end_POSTSUBSCRIPT ) , italic_l ( italic_t start_POSTSUBSCRIPT [ / italic_e italic_n italic_t ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_l ( italic_t start_POSTSUBSCRIPT [ / italic_e italic_n italic_t ] start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) , (16)
    RI=f(l(t[CLS]),l(t[ent]1),l(t[/ent]1),\displaystyle R_{I}=f(l(t_{[CLS]}),l(t_{[ent]_{1}}),l(t_{[/ent]_{1}}),italic_R start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = italic_f ( italic_l ( italic_t start_POSTSUBSCRIPT [ italic_C italic_L italic_S ] end_POSTSUBSCRIPT ) , italic_l ( italic_t start_POSTSUBSCRIPT [ italic_e italic_n italic_t ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_l ( italic_t start_POSTSUBSCRIPT [ / italic_e italic_n italic_t ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , (17)
    l(t[ent]2),l(t[/ent]2)),\displaystyle\vspace{-3mm}l(t_{[ent]_{2}}),l(t_{[/ent]_{2}})),italic_l ( italic_t start_POSTSUBSCRIPT [ italic_e italic_n italic_t ] start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_l ( italic_t start_POSTSUBSCRIPT [ / italic_e italic_n italic_t ] start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) ,
  • J - [CLS] token & Middle Pool:

    RJ=f(l(t[CLS]),l([tInter])),subscript𝑅𝐽𝑓𝑙subscript𝑡delimited-[]𝐶𝐿𝑆𝑙delimited-[]subscript𝑡𝐼𝑛𝑡𝑒𝑟R_{J}=f(l(t_{[CLS]}),l([t_{Inter}])),italic_R start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT = italic_f ( italic_l ( italic_t start_POSTSUBSCRIPT [ italic_C italic_L italic_S ] end_POSTSUBSCRIPT ) , italic_l ( [ italic_t start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT ] ) ) , (18)
  • K, L, M - Special tokens & Middle Pool:

    RK=f(l(t[ent]1),l([tInter]),l(t[ent]2)),subscript𝑅𝐾𝑓𝑙subscript𝑡subscriptdelimited-[]𝑒𝑛𝑡1𝑙delimited-[]subscript𝑡𝐼𝑛𝑡𝑒𝑟𝑙subscript𝑡subscriptdelimited-[]𝑒𝑛𝑡2R_{K}=f(l(t_{[ent]_{1}}),l([t_{Inter}]),l(t_{[ent]_{2}})),italic_R start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = italic_f ( italic_l ( italic_t start_POSTSUBSCRIPT [ italic_e italic_n italic_t ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_l ( [ italic_t start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT ] ) , italic_l ( italic_t start_POSTSUBSCRIPT [ italic_e italic_n italic_t ] start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) , (19)
    RL=f(l(t[/ent]1),l(tInter]),l(t[/ent]2)),R_{L}=f(l(t_{[/ent]_{1}}),l(t_{Inter}]),l(t_{[/ent]_{2}})),italic_R start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT = italic_f ( italic_l ( italic_t start_POSTSUBSCRIPT [ / italic_e italic_n italic_t ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_l ( italic_t start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT ] ) , italic_l ( italic_t start_POSTSUBSCRIPT [ / italic_e italic_n italic_t ] start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) , (20)
    RM=f(l(t[ent]1),l(t[/ent]1),l([tInter]),\displaystyle R_{M}=f(l(t_{[ent]_{1}}),l(t_{[/ent]_{1}}),l([t_{Inter}]),italic_R start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT = italic_f ( italic_l ( italic_t start_POSTSUBSCRIPT [ italic_e italic_n italic_t ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_l ( italic_t start_POSTSUBSCRIPT [ / italic_e italic_n italic_t ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_l ( [ italic_t start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT ] ) , (21)
    l(t[ent]2),l(t[/ent]2)),\displaystyle\vspace{-3mm}l(t_{[ent]_{2}}),l(t_{[/ent]_{2}})),italic_l ( italic_t start_POSTSUBSCRIPT [ italic_e italic_n italic_t ] start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_l ( italic_t start_POSTSUBSCRIPT [ / italic_e italic_n italic_t ] start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) ,
  • N - Entity & Middle Pool:

    RN=f(l([tE1]),l([tInter]),l([tE2])),subscript𝑅𝑁𝑓𝑙delimited-[]subscript𝑡𝐸1𝑙delimited-[]subscript𝑡𝐼𝑛𝑡𝑒𝑟𝑙delimited-[]subscript𝑡𝐸2R_{N}=f(l([t_{E1}]),l([t_{Inter}]),l([t_{E2}])),italic_R start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = italic_f ( italic_l ( [ italic_t start_POSTSUBSCRIPT italic_E 1 end_POSTSUBSCRIPT ] ) , italic_l ( [ italic_t start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT ] ) , italic_l ( [ italic_t start_POSTSUBSCRIPT italic_E 2 end_POSTSUBSCRIPT ] ) ) , (22)
  • O, P - Context Vector & Entity Pool:

    RO=l(cv),subscript𝑅𝑂𝑙𝑐𝑣R_{O}=l(cv),italic_R start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT = italic_l ( italic_c italic_v ) , (23)
    RP=f(l([tE1]),l([tE2]),l(cv)),subscript𝑅𝑃𝑓𝑙delimited-[]subscript𝑡𝐸1𝑙delimited-[]subscript𝑡𝐸2𝑙𝑐𝑣R_{P}=f(l([t_{E1}]),l([t_{E2}]),l(cv)),italic_R start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = italic_f ( italic_l ( [ italic_t start_POSTSUBSCRIPT italic_E 1 end_POSTSUBSCRIPT ] ) , italic_l ( [ italic_t start_POSTSUBSCRIPT italic_E 2 end_POSTSUBSCRIPT ] ) , italic_l ( italic_c italic_v ) ) , (24)

where {RA,RB,RC,RD,RE,RF,RG,RH,RI,RJ,RK,RL,RM,RN,RO,RP}dsubscript𝑅𝐴subscript𝑅𝐵subscript𝑅𝐶subscript𝑅𝐷subscript𝑅𝐸subscript𝑅𝐹subscript𝑅𝐺subscript𝑅𝐻subscript𝑅𝐼subscript𝑅𝐽subscript𝑅𝐾subscript𝑅𝐿subscript𝑅𝑀subscript𝑅𝑁subscript𝑅𝑂subscript𝑅𝑃superscript𝑑\{R_{A},R_{B},R_{C},R_{D},R_{E},R_{F},R_{G},R_{H},R_{I},\\ R_{J},R_{K},R_{L},R_{M},R_{N},R_{O},R_{P}\}\in\mathbb{R}^{d}{ italic_R start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, d is the embedding size of PubMedBERT base (768) and PubMedBERT large (1024), f()𝑓f()italic_f ( ) is the aggregation function, element-wise addition for LaMReDA and element-wise multiplication for LaMReDM, l()𝑙l()italic_l ( ) is a linear projection layer with dimension equal to the embedding size, t[ent]1subscript𝑡subscriptdelimited-[]𝑒𝑛𝑡1t_{[ent]_{1}}italic_t start_POSTSUBSCRIPT [ italic_e italic_n italic_t ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, t[/ent]1t_{[/ent]_{1}}italic_t start_POSTSUBSCRIPT [ / italic_e italic_n italic_t ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, t[ent]2subscript𝑡subscriptdelimited-[]𝑒𝑛𝑡2t_{[ent]_{2}}italic_t start_POSTSUBSCRIPT [ italic_e italic_n italic_t ] start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and t[/ent]2t_{[/ent]_{2}}italic_t start_POSTSUBSCRIPT [ / italic_e italic_n italic_t ] start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT are the embeddings of the start and end special tokens of the first and second entity and t[CLS]subscript𝑡delimited-[]𝐶𝐿𝑆t_{[CLS]}italic_t start_POSTSUBSCRIPT [ italic_C italic_L italic_S ] end_POSTSUBSCRIPT is the representation of the special token [CLS]. We define the averaged pooled representation of the entities and the intermediate tokens between the entities as [tE1]delimited-[]subscript𝑡𝐸1[t_{E1}][ italic_t start_POSTSUBSCRIPT italic_E 1 end_POSTSUBSCRIPT ], [tE2]delimited-[]subscript𝑡𝐸2[t_{E2}][ italic_t start_POSTSUBSCRIPT italic_E 2 end_POSTSUBSCRIPT ], and [tInter]delimited-[]subscript𝑡𝐼𝑛𝑡𝑒𝑟[t_{Inter}][ italic_t start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT ] correspondingly. In equations 23 and 24, we utilize the localized context vector cv𝑐𝑣cvitalic_c italic_v999Additional information is provided in Appendix D. that utilizes the attention heads to locate relevant context for the entity pair and was introduced in ATLOP (Zhou et al., 2021), a state-of-the-art model in document-level relation extraction.

3.3 Experimental setup

The models are trained for 50 epochs and the best checkpoints are retained based on the performance on the development set, measured using the F1-score. We utilize the Adam (Kingma and Ba, 2014) optimizer with a learning rate of 10-5. The batch size is set to 16. We conduct experiments in two distinct setups. In the multi-class setup, we evaluate performance using micro and macro F1-score, considering four relation types: positive, negative, complex, and no relation. In the binary setup, the objective is the prediction of the presence of relation. LaMEL is specifically designed for the binary setup. We utilize the official splits of ReDReS and ReDAD (Tab. 1) and repeat the experiments 10 times with different seeds. To ensure robustness of results, we also employ a 5-fold cross-validation approach. To explore the cross-disease capabilities of our approach, we train the models using one dataset (e.g., ReDReS) and evaluate on the other (e.g., ReDAD), and vice versa. We utilize the relation representation RAsubscript𝑅𝐴R_{A}italic_R start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT (Eq. 9) for LaMReDA and LaMReDM and the entity representation EAsubscript𝐸𝐴E_{A}italic_E start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT (Eq. 1) for LaMEL. These experiments are repeated 10 times with different seeds, and 15% of the training data is excluded to define the development set101010Hardware: single NVIDIA RTX 3090 GPU 24GB..

The cross-entropy loss function is used to train LaMReDA and LaMReDM. For LaMEL, the following cosine embedding loss function is used:

l(x1,x2,y)={1cos(x1,x2),if y=1max(0,cos(x1,x2)m),if y=1,𝑙subscript𝑥1subscript𝑥2𝑦cases1𝑐𝑜𝑠subscript𝑥1subscript𝑥2if y=1𝑚𝑎𝑥0𝑐𝑜𝑠subscript𝑥1subscript𝑥2𝑚if y=1\leavevmode\resizebox{361.71335pt}{}{$l(x_{1},x_{2},y)=\begin{cases}1-cos(x_{1% },x_{2}),&\text{if $y=1$}\\ max(0,cos(x_{1},x_{2})-m),&\text{if $y=-1$}\end{cases}$},\vspace{-2mm}italic_l ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y ) = { start_ROW start_CELL 1 - italic_c italic_o italic_s ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , end_CELL start_CELL if italic_y = 1 end_CELL end_ROW start_ROW start_CELL italic_m italic_a italic_x ( 0 , italic_c italic_o italic_s ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_m ) , end_CELL start_CELL if italic_y = - 1 end_CELL end_ROW , (25)

where x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the projected representations of the two entities, y𝑦yitalic_y is the gold-truth label (1 if the entities are correlated, -1 if they are not), cos()𝑐𝑜𝑠cos()italic_c italic_o italic_s ( ) is the cosine similarity in the embedding space, and m𝑚mitalic_m is the margin parameter that is set to 0. In the inference step, the threshold to predict the presence of relation based on the cosine similarity of the two entity representations is set to 0.5.

4 Results

Tables 2 and 3 report the F1-scores for LaMEL LaMReDA, and LaMReDM, models on the ReDReS and ReDAD datasets. Each cell (except for cross-disease experiments) displays two values: the average F1-score from 10 runs on the original test set (Tab. 1) and the average F1-score from a 5-fold cross-validation. The models perform well across all relation (A-P) and entity (A-H) representations, showing their ability to learn meaningful representations for the semantic relation task regardless of initial token selection. However, we observe patterns regarding the relation representations. In the binary setup, relation representation RGsubscript𝑅𝐺R_{G}italic_R start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT (Eq. 15) yields strong results for both datasets, suggesting that including the [CLS] token representation might be beneficial. In the multi-class setup, relation representations RLsubscript𝑅𝐿R_{L}italic_R start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT (Eq. 20), RJsubscript𝑅𝐽R_{J}italic_R start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT (Eq. 18), and ROsubscript𝑅𝑂R_{O}italic_R start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT (Eq. 23) are effective for both datasets, indicating that the surrounding context is crucial for the more complex task, as RLsubscript𝑅𝐿R_{L}italic_R start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and RJsubscript𝑅𝐽R_{J}italic_R start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT include the averaged pooled representation of intermediate tokens between entities, and ROsubscript𝑅𝑂R_{O}italic_R start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT leverages the context vector (Zhou et al., 2021). The intra-model comparison reveals that over-parameterization tends to be useful. Using PubMedBERT large generally results in better performance than the base alternative. The PubMedBERT base shows superior performance mainly only in experiments using the original splits of ReDReS (Tab. 1). LaMEL is highly competitive with LaMReDA and LaMReDM, indicating that learning entity embedding spaces optimized for relation detection is promising. LaMEL achieves the highest performance in the 5-fold setup of ReDAD and the original setup of ReDReS, with F1-scores of 91.03% and 91.25% respectively.

Type1 ReDReS ReDAD
F1\square F1\blacksquare F1\square F1\blacksquare
A 90.25/89.43 90.88/90.01 86.73/88.75 88.9/90.17
B 90.29/89.01 90.73/89.41 86.89/89.29 88.22/89.15
C 90.51/89.44 90.71/89.67 87.49/90.02 88.57/90.65
D 90.47/88.9 91.03/90.07 86.29/88.88 88.22/90.64
E 90.61/89.1 90.54/89.55 86.03/88.96 88.74/90.35
F 90.48/89.37 90.88/90.29 87.27/89.18 89.44/90.57
G 90.32/89.71 90.35/89.43 86.97/89.46 89.12/91.25
H 89.68/89.24 90.13/89.29 87.29/89.91 88.77/90.67
CD2 86.2 89.14 88.92 88.56
  • 1

    Type of Relation Representation.

  • 2

    Cross-disease experiments utilizing the entity representation EAsubscript𝐸𝐴E_{A}italic_E start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT: Training on ReDReS, evaluation on ReDAD, and vice versa.

Table 2: LaMEL Results (%) in binary setup (PubMedBERT \square: base, \blacksquare:large): Each cell (unless cross-disease experiments) shows the average F1-score from 10 runs (original test set) and from 5-fold cross-validation setup.

The inter-model comparison across the same relation representations indicates that the aggregation function does not significantly impact relation detection tasks. Neither LaMReDA (element-wise addition) nor LaMReDM (element-wise multiplication) show a clear advantage over the other. This suggests that the transformer layers of PubMedBERT and the projection layer l()𝑙l()italic_l ( ) preceding the aggregation are effectively trained in both models to encode the essential information for relation detection, regardless of the aggregation function used. The cross-disease experiments underscore the robustness of the models in both binary and multi-class setups. This robustness supports transfer learning (Zhuang et al., 2020) in semantic relation detection, extending to other diseases, highlighting the potential for broader applications and research endeavors in knowledge discovery.

Data Type1 Binary setup Multi-class setup
Micro Evaluation Macro Evaluation
LaMReDA LaMReDM LaMReDA LaMReDM LaMReDA LaMReDM
F1\square F1\blacksquare F1\square F1\blacksquare F1\square F1\blacksquare F1\square F1\blacksquare F1\square F1\blacksquare F1\square F1\blacksquare
ReDReS A 90.72/89.95 90.74/90.57 90.42/89.15 90.71/89.53 74.49/73.91 73.96/75.01 74.36/73.31 74.35/74.91 74.52/74.5 73.66/74.48 74.3/73.06 72.81/75.07
B 90.4/88.79 90.28/89.54 90.47/89.33 90.06/89.74 74.27/74.14 73.72/74.79 74.26/74.45 73.57/75.34 74.32/74.15 73.65/75.19 74.38/74.31 73.11/75.74
C 90.85/89.69 90.75/89.75 90.51/88.84 89.14/89.16 74.93/72.98 73.54/74.59 74.31/72.71 73.69/73.56 74.96/73.75 73.44/74.74 74.1/72.83 73.49/73.88
D 90.55/89.29 90.93/89.25 90.61/89.47 90.53/88.96 73.61/73.85 73.5/74.36 73.02/74.96 73.5/75.77 73.71/74.54 73.7/74.62 73.24/75.12 73.9/76.24
E 89.57/89.39 89.43/88.89 89.57/89.39 89.43/88.89 73.73/75.67 73.68/74.68 73.73/75.67 73.21/74.68 73.95/74.9 74.01/75.1 73.95/74.9 74.01/75.1
F 90.48/89.09 90.62/89.56 90.41/89.19 90.43/89.94 72.86/74.18 73.82/76.55 73.51/74.07 73.33/75.26 72.62/72.82 73.94/76.66 73.32/74.08 74.5/75.84
G 90.78/89.32 90.76/90.26 90.91/89.82 89.47/89.6 74.33/73.35 73.63/73.59 74.05/75.05 73.22/74.34 74.57/73.78 73.31/74.02 74.21/75.13 73.87/74.8
H 90.91/88.88 90.45/88.98 90.29/88.93 89.99/89.14 74.43/73.68 73.62/74.56 73.59/74.06 73.36/74.71 74.48/73.96 73.65/74.62 73.9/74.12 73.42/75.04
I 90.86/89.07 90.47/89.38 90.62/89.35 89.55/89.18 74.75/73.19 73.3/74.78 74.29/74.14 74/74.28 74.8/73.48 73.26/74.9 73.88/74.6 73.91/74.16
J 89.43/89.23 89.65/89.3 89.53/88.99 89.89/89.43 73.75/76.05 74.28/74.55 73.47/75.04 74.43/74.95 74.05/75.09 75.06/74.97 73.52/75.91 74.7/75.53
K 90.1/89.63 90.05/89.26 89.7/89.18 89.64/89.54 74.43/74.4 74.07/75.22 74.47/75.89 74.38/74.97 74.44/74.8 74.23/75.42 74.3/75.81 74.02/74.67
L 89.6/89.86 89.85/89.95 89.82/88.61 90.33/88.93 73.35/74.27 74.32/75.42 73.68/76.5 73.9/76.15 73.16/74.55 74.04/75.52 73.66/75.08 73.52/76.02
M 90.81/90.27 90.07/89.85 90.01/89.3 89.75/89.73 74.21/74.59 74.29/74.85 73.96/74.94 74.32/74.87 74.03/74.04 74.36/74.68 73.72/75.1 73.95/75.59
N 90.73/89.37 90.6/89.71 90.72/88.77 90.63/89.49 74.55/73.49 73.83/73.47 73.86/74.81 73.38/74.58 74.66/74.81 73.97/74.94 74.13/74.82 73.53/74.76
O 90.9/89.94 90.5/89.35 90.9/89.94 90.5/89.35 73.99/74.51 73.77/73.79 73.99/74.51 73.77/73.79 73.83/74.71 73.62/74.27 73.83/74.71 73.62/74.27
P 89.72/89.8 90.31/90.08 89.13/89.19 90.3/89.92 73.37/75.03 73.87/75.02 73.57/75.24 74.41/75.44 73.48/74.79 74.66/75.18 73.54/74.84 73.52/75.79
CD2 87.42 88.93 87.76 88.1 73.09 75.04 74.15 75.35 73.64 74.94 74.38 75.44
ReDAD A 88.31/90.15 89.55/91.07 87.98/90.37 89.14/89.92 77.64/77.07 79.47/78.07 78.34/76.14 80.21/78.17 77.34/77.21 79.26/77.83 78.44/76.4 80.13/78.39
B 87.82/90.57 89.11/90.52 87.66/88.83 88.64/87.11 77.74/77.56 78.65/78.24 78.61/76.43 78.91/78.26 77.13/77.74 78.72/78.31 77.76/76.55 78.98/77.56
C 88.3/89.64 89.21/87.01 88.11/88.79 89.17/89.87 77.14/76.7 79.67/77.89 78.19/76.41 79.32/77.59 77.08/77.05 79.35/77.92 77.82/76.62 79.26/77.78
D 87.33/88.99 89.82/89.25 88.18/89.61 88.8/90.05 78.28/76.73 79.54/75.64 76.81/76.58 78.68/78.37 78.26/76.8 78.47/76.12 76.67/76.84 78.73/78.87
E 88.03/88.63 89.37/90.91 88.03/88.63 89.37/90.91 77.83/77.45 77.54/78.3 77.83/77.45 77.54/78.3 77.75/77.34 77.69/78.41 77.75/77.34 77.69/78.41
F 87.71/89.54 88.45/89.45 87.87/90.11 88.54/90.99 77.59/76.5 79.74/79.23 76.94/76.75 79.68/77.94 77.35/76.33 79.47/79.31 76.95/76.95 79.21/77.38
G 88.17/90.06 89.83/88.96 88.22/89.75 89.55/90.15 77.83/77.64 79.39/78.61 78.13/77.04 79.09/77.91 77.5/77.69 79.58/78.76 77.88/77.56 78.74/77.73
H 88.01/89.14 88.76/90.78 87.73/88.99 88.99/90.39 77.12/76.36 79.57/78.32 78.11/77.81 79.4/78.13 77.09/76.08 78.76/78.88 78.14/77.81 79.18/78.15
I 87.56/88.64 88.05/89.67 87.86/90.14 89.45/90.13 77.77/76.29 79.23/77.7 78.4/76.08 78.99/78.42 77.11/76.56 79.66/78.31 78.49/76.24 78.78/77.97
J 87.99/89.91 88.89/91.12 87.79/90.5 89.06/89.05 77.69/77.48 78.4/78.92 77.14/78.35 78.4/77.55 77.71/77.57 77.94/78.87 76.59/77.92 78.18/77.43
K 88.36/91.01 89.33/90.94 88.01/90.09 89.05/90.89 78.3/78.24 78.25/76.19 78.54/78.13 77.73/77.5 78.11/78.17 78.49/76.1 78.29/78.6 77.42/77.46
L 88.25/90.53 89.25/91.09 87.87/90.03 89.13/90.02 78.48/77.87 78.94/77.67 77.88/77.59 77.91/77.97 78.52/78.05 78.16/78.47 77.85/77.67 77.37/78.22
M 88.42/90.07 89.57/90.8 88.12/90.52 89.4/90.46 77/77.31 78.85/78.42 78.02/78.5 77.12/77.66 76.62/77.2 79.66/77.85 78.02/78.38 77.08/77.66
N 87.98/90.71 88.94/90.82 88.08/90.43 89.47/90.68 78.21/77.97 78.92/78.24 78.03/77.63 78.78/78.06 78.07/77.3 77.91/78.39 77.74/77.73 78.6/78.1
O 88.27/90.59 87.71/91.06 88.27/90.59 87.71/91.06 77.08/79.02 78.96/78.78 77.08/79.02 78.96/78.78 76.78/78.95 79.03/78.97 76.78/78.95 79.03/78.97
P 88.02/89.41 88.86/89.93 88.33/90.26 89.51/87.85 78.43/77.45 79.38/76.67 78.44/77.19 79.12/77.7 78.37/76.25 79.44/77.27 78.04/77.31 78.91/77.76
CD2 88.4 89.33 89.01 89.16 73.69 74.29 72.67 72.81 74.13 74.82 72.91 73.76
  • 1

    Type of Relation Representation.

  • 2

    Cross-disease experiments utilizing the relation representation RAsubscript𝑅𝐴R_{A}italic_R start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT: Training on ReDReS, evaluation on ReDAD, and vice versa.

Table 3: LaMReDA and LaMReDM Results (%) in binary and multi-class setup (PubMedBERT \square:base,\blacksquare:large): Each cell (unless cross-disease experiments) shows the average F1-score from 10 runs (original test set) and from 5-fold cross-validation setup.

Human Performance. To assess and compare to human performance, two additional experts identify the relation type in a random sample of 300 instances from the test set of each dataset (Tab. 1). The evaluation ground truth is based on the original test set labels. In the binary setup, the average F1-score ranges from 92.14 for ReDReS to 91.87 for ReDAD. The LaMReDA, LaMReDM, and LMEL models achieve performance comparable to human experts, indicating a high ability to detect semantic relations. Multi-class macro F1-scores range from 85.23 (micro: 85.45) to 85.76 (micro: 85.87) for ReDReS and ReDAD, respectively. Compared to human experts, all models show a performance gap, highlighting that identifying more complex aspects of semantic relation is a challenging task.

Baseline performance - lower bound. We randomly assign labels based on the training data’s class distribution (Tab 1). In the binary setup, the baseline achieves F1-scores of 54% (ReDReS) and 53.16% (ReDAD). For the multi-class setup, the macro F1-scores range from 32.05% to 32.43%, stressing the task’s difficulty, particularly for distinguishing various semantic relations (multi-class)111111More information is available in Appendix E..

5 Probing

This study probes PubMedBERT’s ability to capture semantic relations between entities. We explore different transformer layer representations and attention scores per layer and attention head. Averaged pooled entity representations are extracted from each layer, followed by training a linear classification layer. We test relation representations RDsubscript𝑅𝐷R_{D}italic_R start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, ROsubscript𝑅𝑂R_{O}italic_R start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT, and RPsubscript𝑅𝑃R_{P}italic_R start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT (Eq. 12, 23, 24) of LaMReDA and LaMReDM to assess the impact of the context vector. Out-of-the-box representations are evaluated without the projection linear layer l()𝑙l()italic_l ( ). We also extract average attention scores of tokens for each entity towards the other across each layer and head, concatenating these into a feature vector for training a linear classification layer. Following Chizhikova et al. (2022), we also train the classification layer using average attention scores between the two entities across all layers.

Refer to caption
Figure 4: ReDReS Probing (Binary setup): Examines LaMReDA/LaMReDM relation representations (RDsubscript𝑅𝐷R_{D}italic_R start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, ROsubscript𝑅𝑂R_{O}italic_R start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT, RPsubscript𝑅𝑃R_{P}italic_R start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT) and attention scores from each layer and explores average attention scores of tokens corresponding to each entity towards the other entity across attention heads. Top boundary: best LaMReDA and LaMReDM performance (Tab. 3). Second boundary: classifier with average attention scores across all layers as input.

Figure 4 shows the results of probing experiments in the binary setup using ReDReS and PubMedBERT base.121212Additional probing experiments in Appendix G. The 10th and 11th layers provide the most informative representations for relation types (RDsubscript𝑅𝐷R_{D}italic_R start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, ROsubscript𝑅𝑂R_{O}italic_R start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT, and RPsubscript𝑅𝑃R_{P}italic_R start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT). The RDsubscript𝑅𝐷R_{D}italic_R start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT representation, using element-wise multiplication, outperforms other representations in intra-layer comparisons, suggesting its effectiveness without end-to-end training. However, as highlighted in section 4, inter-model comparisons indicate that the transformer layers and projection layer l()𝑙l()italic_l ( ) capture crucial information for relation detection, regardless of the aggregation function. Using context vectors with ROsubscript𝑅𝑂R_{O}italic_R start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT and RPsubscript𝑅𝑃R_{P}italic_R start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT generally offers no advantage, though ROsubscript𝑅𝑂R_{O}italic_R start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT from the 10th and 11th layers performs well, indicating possibly meaningful localized context. Attention scores between entities in the 12th layer yield the best performance, surpassing the baseline of using scores from all layers, indicating strong attention between the entities in the last layer. Figure 4 reveals that the 6th and 9th attention heads are most informative for relation detection.

6 Related Work

Information Extraction Datasets. Several biomedical datasets aim to enhance Information Extraction (IE) system development (Huang et al., 2024; Detroja et al., 2023; Nasar et al., 2021; Theodoropoulos et al., 2021), typically focusing on one or a few entity types and their interactions. AIMed (Bunescu et al., 2005), BioInfer (Pyysalo et al., 2007), and BioCreative II PPI IPS (Krallinger et al., 2008) formulate protein-protein interactions. The chemical-protein and chemical-disease interactions are modeled by DrugProt (Miranda et al., 2021) and BC5CDR (Li et al., 2016), respectively. ADE (Gurulingappa et al., 2012), DDI13 (Herrero-Zazo et al., 2013), and n2c2 2018 ADE (Henry et al., 2020) include drug-ADE (adverse drug effect) and drug-drug interactions. EMU (Doughty et al., 2011), GAD (Bravo et al., 2015) and RENET2 (Su et al., 2021) contain relations between genes and diseases. N-ary (Peng et al., 2017) incorporates drug-gene mutation interactions. The task of event extraction is illustrated by GE09 (Kim et al., 2009), GE11 (Kim et al., 2011), and CG (Pyysalo et al., 2013). DDAE (Lai et al., 2019) includes disease-disease associations. BioRED (Luo et al., 2022) focuses on document-level relations for various entities. Unlike these datasets, ReDReS and ReDAD focus on RS and AD, include entities of up to 82 different semantic types, and model the semantic relation between them.

Knowledge Discovery. Gottlieb et al. (2011) present PREDICT, a method for ranking potential drug-disease associations to predict drug indications. Romano et al. (2024) release AlzKB, a heterogeneous graph knowledge base for AD, constructed using external data sources and describing various medical entities (e.g., chemicals, genes). Other graph-based efforts model knowledge around AD for tasks such as drug repurposing (Hsieh et al., 2023; Daluwatumulle et al., 2022; Nian et al., 2022), gene identification (Binder et al., 2022), or as general knowledge repositories (Sügis et al., 2019). Another paradigm for knowledge discovery is the open information extraction (OIE) setup (Mausam et al., 2012; Etzioni et al., 2008), which faces challenges such as data consistency, performance evaluation, and semantic drift (Zhou et al., 2022). Research efforts (Wang et al., 2018; de Silva et al., 2017; Nebot and Berlanga, 2014; Movshovitz-Attias and Cohen, 2012; Nebot and Berlanga, 2011) aim to address these issues and extract knowledge with little or no supervision. Advances in literature-based discovery (Gopalakrishnan et al., 2019; Thilakaratne et al., 2019) try to identify novel medical entity relations using graph-based (Kilicoglu et al., 2020; Nicholson and Greene, 2020), machine learning (Zhao et al., 2021; Lardos et al., 2022), and co-occurrence methods (Kuusisto et al., 2020; Millikin et al., 2023). Tian et al. (2024) stress the potential of large language models (LLMs) to summarize, simplify, and synthesize medical evidence (Peng et al., 2023; Tang et al., 2023; Shaib et al., 2023), suggesting that LLMs may have encoded biomedical knowledge (Singhal et al., 2023). To exploit this potential, we explore constructing LM representations for knowledge discovery. To the best of our knowledge, no systematic approach assembles knowledge about RS. Unlike previous work, we introduce a, in principle, disease-agnostic framework, to acquire knowledge about RS and AD starting from raw text.

7 Conclusion

This work presents an open-source framework for disease knowledge discovery from raw text. We contribute two new annotated datasets for RS (ReDReS) and AD (ReDAD), facilitating further research. Extensive evaluation explores various methods for representing relations and entities, yielding insights into optimal modeling approaches for semantic relation detection, and emphasizing language models’ potential in knowledge discovery.

Limitations

One limitation of the paper is that the data pipeline relies on an external mention extractor/linker. However, this aspect introduces flexibility, allowing researchers and practitioners to integrate custom models suited to their specific applications. The creation of gold-standard datasets requires the manual work of medical experts. This process is time-consuming and resource-intensive, potentially limiting the scalability of the approach. Nevertheless, the experiments demonstrate that the supervised models of the study achieve strong performance in semantic relation detection without needing a large training set. Additionally, the cross-disease experiments highlight the robustness of the models in both binary and multi-class setups. This finding enables transfer learning scenarios in semantic relation detection, which can be applied to other diseases or medical aspects, indicating a potential for broader applications and research opportunities.

Ethics Statement

All recruited medical experts provided informed consent before participating in the annotation process. The compensation provided to the annotators was adequate and considered their demographic, particularly their country of residence.

References

Appendix A Data Pipeline: Additional Information

To facilitate effective abstract retrieval, we implement an iterative approach to circumvent the API’s limitation of retrieving only 10,000 article IDs per query. This iterative process enables us to access a comprehensive set of PubMed IDs (PMIDs) related to the query. The detailed list of the 82 semantic types of the MetaMapLite-based pipeline is presented in Table 4. In addition to the MetaMapLite-based pipeline, we propose a second pipeline that is based on ScispaCy (Apache License 2.0) (Fig. 5). The difference lies in the selection of entity extractors and linkers that map the extracted entities to knowledge schemes. Unlike MetaMapLite, which adopts an integrated approach where mention extraction and linking are performed simultaneously in a single step and focuses on UMLS mapping, allowing for more precise and targeted extraction of entities, ScispaCy serves a broader range of Natural Language Processing (NLP) tasks. After the retrieval of the abstracts that is described in subsection 2.1, the following steps are executed: Knowledge schema and linker generation, Mention extraction, Entity linking. and Sampling of linked identifiers.

Refer to caption
Figure 5: The pipeline starts with abstract retrieval using a natural language query. Next, entities are detected using the 4 different ScispaCy extractors (CRAFT, JNLPBA, BC5CDR, BIONLP13CG). The entity linking is executed, utilizing the knowledge schema and linker generation step that creates the updated linkers tailored to a range of knowledge schemes. To address the scenario of multiple concept unique identifiers (CUIs) due to the utilization of multiple linkers, we select the most relevant CUIs, using a prioritized sampling strategy. Then, the co-occurrence graph generation step models the intra-sentence co-occurrence of the extracted entities. The final step is the dataset creation using the processed text and co-occurrence graph.

Knowledge schema and linker generation. ScispaCy (Neumann et al., 2019) harnesses an older version of UMLS (2020AA). This version serves as the foundation upon which ScispaCy trains and constructs its linkers that operate on a char-3grams string overlap-based search mechanism, facilitating efficient and accurate entity recognition and linking processes. Following the paradigm of ScispaCy, we provide scripts for generating updated linkers tailored to a range of knowledge schemes. These include UMLS (Bodenreider, 2004), Gene Ontology (GO) (Consortium, 2004), National Center for Biotechnology Information (NCBI) taxonomy (Schoch et al., 2020), RxNorm (Nelson et al., 2011), SNOMED Clinical Terms (SNOMEDCT_US) (Stearns et al., 2001), Human Phenotype Ontology (HPO) (Köhler et al., 2021), Medical Subject Headings (MeSH) (Lipscomb, 2000) DrugBank (Knox et al., 2024) and Gold Standard Drug Database (GS)131313https://www.nlm.nih.gov/research/umls/
sourcereleasedocs/current/GS/index.html
. Of particular note is the inclusion of UMLS, a unified system encompassing various knowledge bases, vocabularies, taxonomies, and ontologies pertinent to the biomedical domain. Any supported linker maps the concepts to UMLS CUIs enhancing the standardization of medical terminology. Notably, the flexibility of ScispaCy’s implementation allows for seamless expansion to incorporate additional knowledge bases, thereby enhancing its versatility and applicability across diverse research needs.

Mention extraction. ScispaCy boasts four distinct entity extractors, each trained on different corpora, collectively encompassing a range of entity types. These extractors include named entity recognition (NER) models trained on the CRAFT corpus (with 6 entity types) (Bada et al., 2012), JNLPBA corpus (with 5 entity types) (Collier et al., 2004), BC5CDR corpus (with 2 entity types) (Li et al., 2016), and BIONLP13CG corpus (with 16 entity types) (Kim et al., 2013). To maximize the range of the entity extraction, we leverage these diverse extractors in tandem, allowing us to capture mentions of 18 unique entity types (gene or protein, cell, chemical, organism, disease, organ, DNA, RNA, tissue, cancer, cellular component, anatomical system, multi-tissue structure, organism subdivision, developing anatomical structure, pathological formation, organism substance, and immaterial anatomical entity).

Entity linking. This process enhances the semantic understanding of the extracted entities, facilitates standardization, which is a key issue in the biomedical field (Bettencourt-Silva et al., 2012; Theodoropoulos et al., 2023), and promotes interoperability with external resources by associating the entities with specific concepts in supported knowledge schemes. Each entity is subjected to a linking process where we attempt to map it to concepts within supported knowledge bases or vocabularies. If a match is found, the entity is assigned a unique identifier, referred to as a CUI, corresponding to the specific concept in the knowledge schema. As entities may be linked to multiple knowledge sources, we merge the extracted CUIs obtained from the different linkers. This consolidation process ensures that each entity is associated with a comprehensive set of identifiers, encompassing diverse perspectives and representations across various knowledge schemes.

Sampling of linked identifiers. We address the scenario where multiple CUIs can be extracted for each entity due to the utilization of multiple linkers. We propose a prioritized sampling strategy (Fig. 6) to manage this situation and select the most relevant CUIs effectively. This strategy is designed to sample CUIs based on the predicted type of the entity (e.g., disease, gene, or chemical/drug) by prioritizing mapped CUIs from specific knowledge schemes focused on the entity being processed. For example, if an entity is predicted to be a chemical/drug, the sampling strategy first checks if any linked CUIs exist in RxNorm linker, a specific knowledge schema tailored for chemicals. If linked CUIs are found, they are sampled for inclusion in the final set of linked concepts associated with the entity, otherwise, the search is continued in a prioritized way (Fig. 6). We stress that the sampling strategy can be easily modified by the user based on the requirements of the research or the application.

The co-occurrence graph generation step is described in subsection 2.1.

Refer to caption
Figure 6: Prioritized Concept Unique Identifier (CUI) sampling strategy: An ordered matching search for linked CUIs is designed based on the predicted type of the entity. For example, if an entity is predicted to be a chemical, the sampling strategy first checks if any linked CUIs exist in RxNorm linker, a specific knowledge schema tailored for chemicals. If linked CUIs are found, they are sampled for inclusion in the final set of CUIs associated with the entity, and the process is finished, otherwise, the search is continued in a prioritized way. The order of check is based on the potential relevance and coverage of the knowledge schema given the predicted type of the entity.
Semantic Types
Amino Acid, Peptide, or Protein Acquired Abnormality Amino Acid Sequence Amphibian Anatomical Abnormality
Animal Anatomical Structure Antibiotic Archaeon Biologically Active Substance
Bacterium Body Substance Body System Behavior Biologic Function
Body Location or Region Biomedical or Dental Material Body Part, Organ, or Organ Component Body Space or Junction Cell Component
Cell Function Cell Congenital Abnormality Chemical Chemical Viewed Functionally
Chemical Viewed Structurally Clinical Attribute Clinical Drug Cell or Molecular Dysfunction Carbohydrate Sequence
Diagnostic Procedure Daily or Recreational Activity Disease or Syndrome Environmental Effect of Humans Element, Ion, or Isotope
Experimental Model of Disease Embryonic Structure Enzyme Eukaryote Fully Formed Anatomical Structure
Fungus Food Genetic Function Gene or Genome Human-caused Phenomenon or Process
Health Care Activity Hazardous or Poisonous Substance Hormone Immunologic Factor Individual Behavior
Inorganic Chemical Injury or Poisoning Indicator, Reagent, or Diagnostic Aid Laboratory Procedure Laboratory or Test Result
Mammal Molecular Biology Research Technique Mental Process Mental or Behavioral Dysfunction Molecular Sequence
Neoplastic Process Nucleic Acid, Nucleoside, or Nucleotide Nucleotide Sequence Organic Chemical Organism Attribute
Organism Function Organism Organ or Tissue Function Pathologic Function Pharmacologic Substance
Plant Organism Population Group Receptor Reptile
Substance Social Behavior Sign or Symptom Tissue Therapeutic or Preventive Procedure
Virus Vitamin Vertebrate
Table 4: List of the 82 semantic types of the MetaMapLite-based pipeline.

Sentence Sampling Algorithm. Given a set of sentences with defined CUIs sent_c𝑠𝑒𝑛𝑡_𝑐sent\_citalic_s italic_e italic_n italic_t _ italic_c and the co-occurrence frequency graph co_g𝑐𝑜_𝑔co\_gitalic_c italic_o _ italic_g, sample n𝑛nitalic_n number of sentences (Alg. 1). Initialize a dictionary f_d𝑓_𝑑f\_ditalic_f _ italic_d and for each sentence save the extracted CUIs pairs c_p𝑐_𝑝c\_pitalic_c _ italic_p (extract_conc(sent)𝑒𝑥𝑡𝑟𝑎𝑐𝑡_𝑐𝑜𝑛𝑐𝑠𝑒𝑛𝑡extract\_conc(sent)italic_e italic_x italic_t italic_r italic_a italic_c italic_t _ italic_c italic_o italic_n italic_c ( italic_s italic_e italic_n italic_t )), the frequencies f_p𝑓_𝑝f\_pitalic_f _ italic_p of each pair extracted from the co_g𝑐𝑜_𝑔co\_gitalic_c italic_o _ italic_g (extract_freq(c_p,co_g)𝑒𝑥𝑡𝑟𝑎𝑐𝑡_𝑓𝑟𝑒𝑞𝑐_𝑝𝑐𝑜_𝑔extract\_freq(c\_p,co\_g)italic_e italic_x italic_t italic_r italic_a italic_c italic_t _ italic_f italic_r italic_e italic_q ( italic_c _ italic_p , italic_c italic_o _ italic_g )) and the summation of the frequencies t_f𝑡_𝑓t\_fitalic_t _ italic_f. Retrieve the sentence ids, summed frequencies, and the inverted summed frequencies from the dictionary and append them in ids𝑖𝑑𝑠idsitalic_i italic_d italic_s, f_l𝑓_𝑙f\_litalic_f _ italic_l, and inv_f_l𝑖𝑛𝑣_𝑓_𝑙inv\_f\_litalic_i italic_n italic_v _ italic_f _ italic_l lists respectively. Calculate the total sums of the frequencies t_f_sum𝑡_𝑓_𝑠𝑢𝑚t\_f\_sumitalic_t _ italic_f _ italic_s italic_u italic_m, inv_t_f_sum𝑖𝑛𝑣_𝑡_𝑓_𝑠𝑢𝑚inv\_t\_f\_sumitalic_i italic_n italic_v _ italic_t _ italic_f _ italic_s italic_u italic_m and then utilize them to define the probability distributions 𝒫𝒫\mathcal{P}caligraphic_P and 𝒫𝒫\mathcal{IP}caligraphic_I caligraphic_P. Sample 50% of the sentences from 𝒫𝒫\mathcal{P}caligraphic_P (sample(𝒫,n/2)𝑠𝑎𝑚𝑝𝑙𝑒𝒫𝑛2sample(\mathcal{P},n/2)italic_s italic_a italic_m italic_p italic_l italic_e ( caligraphic_P , italic_n / 2 )) and 50% from 𝒫𝒫\mathcal{IP}caligraphic_I caligraphic_P (sample(𝒫,n/2)𝑠𝑎𝑚𝑝𝑙𝑒𝒫𝑛2sample(\mathcal{IP},n/2)italic_s italic_a italic_m italic_p italic_l italic_e ( caligraphic_I caligraphic_P , italic_n / 2 )) to ensure a balance of common and potentially novel pairs of co-occurred entities in the dataset.

Algorithm 1 Sentence Sampling.
sent_c𝑠𝑒𝑛𝑡_𝑐sent\_citalic_s italic_e italic_n italic_t _ italic_c, co_g𝑐𝑜_𝑔co\_gitalic_c italic_o _ italic_g, n𝑛nitalic_n
Initialize dictionary f_d𝑓_𝑑f\_ditalic_f _ italic_d
for s_id𝑠_𝑖𝑑s\_iditalic_s _ italic_i italic_d, sent𝑠𝑒𝑛𝑡sentitalic_s italic_e italic_n italic_t in sent_c𝑠𝑒𝑛𝑡_𝑐sent\_citalic_s italic_e italic_n italic_t _ italic_c do
     c_pextract_conc(sent)𝑐_𝑝𝑒𝑥𝑡𝑟𝑎𝑐𝑡_𝑐𝑜𝑛𝑐𝑠𝑒𝑛𝑡c\_p\leftarrow extract\_conc(sent)italic_c _ italic_p ← italic_e italic_x italic_t italic_r italic_a italic_c italic_t _ italic_c italic_o italic_n italic_c ( italic_s italic_e italic_n italic_t )
     f_pextract_freq(c_p,co_g)𝑓_𝑝𝑒𝑥𝑡𝑟𝑎𝑐𝑡_𝑓𝑟𝑒𝑞𝑐_𝑝𝑐𝑜_𝑔f\_p\leftarrow extract\_freq(c\_p,co\_g)italic_f _ italic_p ← italic_e italic_x italic_t italic_r italic_a italic_c italic_t _ italic_f italic_r italic_e italic_q ( italic_c _ italic_p , italic_c italic_o _ italic_g )
     t_fsum(f_p)𝑡_𝑓𝑠𝑢𝑚𝑓_𝑝t\_f\leftarrow sum(f\_p)italic_t _ italic_f ← italic_s italic_u italic_m ( italic_f _ italic_p )
     f_dsave(s_id,c_p,f_p,t_f)𝑓_𝑑𝑠𝑎𝑣𝑒𝑠_𝑖𝑑𝑐_𝑝𝑓_𝑝𝑡_𝑓f\_d\leftarrow save(s\_id,c\_p,f\_p,t\_f)italic_f _ italic_d ← italic_s italic_a italic_v italic_e ( italic_s _ italic_i italic_d , italic_c _ italic_p , italic_f _ italic_p , italic_t _ italic_f )
end for
Initialize lists ids𝑖𝑑𝑠idsitalic_i italic_d italic_s, f_l𝑓_𝑙f\_litalic_f _ italic_l, inv_f_l𝑖𝑛𝑣_𝑓_𝑙inv\_f\_litalic_i italic_n italic_v _ italic_f _ italic_l
for s_id𝑠_𝑖𝑑s\_iditalic_s _ italic_i italic_d in f_d𝑓_𝑑f\_ditalic_f _ italic_d do
     idsappend(s_id)𝑖𝑑𝑠𝑎𝑝𝑝𝑒𝑛𝑑𝑠_𝑖𝑑ids\leftarrow append(s\_id)italic_i italic_d italic_s ← italic_a italic_p italic_p italic_e italic_n italic_d ( italic_s _ italic_i italic_d )
     t_fget(f_d,s_id)𝑡_𝑓𝑔𝑒𝑡𝑓_𝑑𝑠_𝑖𝑑t\_f\leftarrow get(f\_d,s\_id)italic_t _ italic_f ← italic_g italic_e italic_t ( italic_f _ italic_d , italic_s _ italic_i italic_d )
     f_lappend(t_f)𝑓_𝑙𝑎𝑝𝑝𝑒𝑛𝑑𝑡_𝑓f\_l\leftarrow append(t\_f)italic_f _ italic_l ← italic_a italic_p italic_p italic_e italic_n italic_d ( italic_t _ italic_f )
     inv_f_lappend(1/t_f)𝑖𝑛𝑣_𝑓_𝑙𝑎𝑝𝑝𝑒𝑛𝑑1𝑡_𝑓inv\_f\_l\leftarrow append(1/t\_f)italic_i italic_n italic_v _ italic_f _ italic_l ← italic_a italic_p italic_p italic_e italic_n italic_d ( 1 / italic_t _ italic_f )
end for
t_f_sumsum(f_l)𝑡_𝑓_𝑠𝑢𝑚𝑠𝑢𝑚𝑓_𝑙t\_f\_sum\leftarrow sum(f\_l)italic_t _ italic_f _ italic_s italic_u italic_m ← italic_s italic_u italic_m ( italic_f _ italic_l )
inv_t_f_sumsum(inv_f_l)𝑖𝑛𝑣_𝑡_𝑓_𝑠𝑢𝑚𝑠𝑢𝑚𝑖𝑛𝑣_𝑓_𝑙inv\_t\_f\_sum\leftarrow sum(inv\_f\_l)italic_i italic_n italic_v _ italic_t _ italic_f _ italic_s italic_u italic_m ← italic_s italic_u italic_m ( italic_i italic_n italic_v _ italic_f _ italic_l )
Initialize lists prob𝑝𝑟𝑜𝑏probitalic_p italic_r italic_o italic_b, inv_prob𝑖𝑛𝑣_𝑝𝑟𝑜𝑏inv\_probitalic_i italic_n italic_v _ italic_p italic_r italic_o italic_b
for f𝑓fitalic_f in f_l𝑓_𝑙f\_litalic_f _ italic_l do
     p=f/t_f_sum𝑝𝑓𝑡_𝑓_𝑠𝑢𝑚p=f/t\_f\_sumitalic_p = italic_f / italic_t _ italic_f _ italic_s italic_u italic_m
     probappend(p)𝑝𝑟𝑜𝑏𝑎𝑝𝑝𝑒𝑛𝑑𝑝prob\leftarrow append(p)italic_p italic_r italic_o italic_b ← italic_a italic_p italic_p italic_e italic_n italic_d ( italic_p )
end for
for f𝑓fitalic_f in inv_f_l𝑖𝑛𝑣_𝑓_𝑙inv\_f\_litalic_i italic_n italic_v _ italic_f _ italic_l do
     p=f/inv_t_f_sum𝑝𝑓𝑖𝑛𝑣_𝑡_𝑓_𝑠𝑢𝑚p=f/inv\_t\_f\_sumitalic_p = italic_f / italic_i italic_n italic_v _ italic_t _ italic_f _ italic_s italic_u italic_m
     inv_probappend(p)𝑖𝑛𝑣_𝑝𝑟𝑜𝑏𝑎𝑝𝑝𝑒𝑛𝑑𝑝inv\_prob\leftarrow append(p)italic_i italic_n italic_v _ italic_p italic_r italic_o italic_b ← italic_a italic_p italic_p italic_e italic_n italic_d ( italic_p )
end for
𝒫prob_distr(ids,prob)𝒫𝑝𝑟𝑜𝑏_𝑑𝑖𝑠𝑡𝑟𝑖𝑑𝑠𝑝𝑟𝑜𝑏\mathcal{P}\leftarrow prob\_distr(ids,prob)caligraphic_P ← italic_p italic_r italic_o italic_b _ italic_d italic_i italic_s italic_t italic_r ( italic_i italic_d italic_s , italic_p italic_r italic_o italic_b )
𝒫prob_distr(ids,inv_prob)𝒫𝑝𝑟𝑜𝑏_𝑑𝑖𝑠𝑡𝑟𝑖𝑑𝑠𝑖𝑛𝑣_𝑝𝑟𝑜𝑏\mathcal{IP}\leftarrow prob\_distr(ids,inv\_prob)caligraphic_I caligraphic_P ← italic_p italic_r italic_o italic_b _ italic_d italic_i italic_s italic_t italic_r ( italic_i italic_d italic_s , italic_i italic_n italic_v _ italic_p italic_r italic_o italic_b )
sam_sent_1sample(𝒫,n/2)𝑠𝑎𝑚_𝑠𝑒𝑛𝑡_1𝑠𝑎𝑚𝑝𝑙𝑒𝒫𝑛2sam\_sent\_1\leftarrow sample(\mathcal{P},n/2)italic_s italic_a italic_m _ italic_s italic_e italic_n italic_t _ 1 ← italic_s italic_a italic_m italic_p italic_l italic_e ( caligraphic_P , italic_n / 2 )
sam_sent_2sample(𝒫,n/2)𝑠𝑎𝑚_𝑠𝑒𝑛𝑡_2𝑠𝑎𝑚𝑝𝑙𝑒𝒫𝑛2sam\_sent\_2\leftarrow sample(\mathcal{IP},n/2)italic_s italic_a italic_m _ italic_s italic_e italic_n italic_t _ 2 ← italic_s italic_a italic_m italic_p italic_l italic_e ( caligraphic_I caligraphic_P , italic_n / 2 )
Return sam_sent_1𝑠𝑎𝑚_𝑠𝑒𝑛𝑡_1sam\_sent\_1italic_s italic_a italic_m _ italic_s italic_e italic_n italic_t _ 1, sam_sent_2𝑠𝑎𝑚_𝑠𝑒𝑛𝑡_2sam\_sent\_2italic_s italic_a italic_m _ italic_s italic_e italic_n italic_t _ 2

Appendix B Annotation Portal

Figure 7 presents the annotation portal with an example from the ReDReS dataset. The annotator’s task is to identify the semantic relation between the two highlighted entities, classifying it as either a Positive Relation, Negative Relation, Complex Relation, or No Relation. If the sentence is considered uninformative or if there are errors in entity detection, type, or span, the annotator can remove the sentence or the entities. Furthermore, the annotator is encouraged to provide feedback, including any additional text that can clarify or elaborate on the relationship between the entities. By providing this supplementary information, annotators can contribute to a richer and more nuanced understanding of the relations within the data.

Refer to caption
Figure 7: Annotation portal: The annotator should define the semantic relation (Positive Relation, Negative Relation, Complex Relation, and No Relation) between the two highlighted entities: Rett syndrome and MECP2 gene. The annotator can remove the sentence if it is not informative and the entities if the entity detection or the type/span is incorrect. Additionally, the annotator can provide feedback, adding the text that is useful to define the relation between the entities.

Appendix C Dataset Instances

In this section, we present some instances of different relation types in the datasets. In each example, we highlight the two detected entities.

Positive Relation:

  • Amyloid fibrils are found in many fatal neurodegenerative diseases such as Alzheimer’s disease, Parkinson’s disease, type II diabetes, and prion disease.

  • AChE has become an important drug target because partial inhibition of AChE results in modest increase in ACh levels that can have therapeutic benefits, thus AChE inhibitors have proved useful in the symptomatic treatment of Alzheimer’s disease.

Complex Relation:

  • When the brain’s antioxidant defenses are overwhelmed by IR, it produces an abundance of reactive oxygen species (ROS) that can lead to oxidative stress, mitochondrial dysfunction, loss of synaptic plasticity, altered neuronal structure and microvascular impairment that have been identified as early signs of neurodegeneration in Alzheimer’s disease, Parkinson’s, amyotrophic lateral sclerosis, vascular dementia and other diseases that progressively damage the brain and central nervous system.

  • Autophagy inhibitor 3-methyladenine (3-MA) attenuated the neuroprotective effect of CA, suggesting that autophagy was involved in the neuroprotection of CA.

Negative Relation:

  • It was not observed in synaptopodin-deficient mice, which lack spine apparatus organelles.

  • Furthermore, the use of some kinds of antihypertensive medication has been suggested to reduce the incidence of dementia including Alzheimer’s disease.

No Relation:

  • Peripheral immune cells can cross the intact BBB, CNS neurons and glia actively regulate macrophage and lymphocyte responses, and microglia are immunocompetent but differ from other macrophage/dendritic cells in their ability to direct neuroprotective lymphocyte responses.

  • These techniques have thus provided morphological and functional brain alterations mapping of Alzheimer’s disease: on one hand grey matter atrophy first concerns the medial temporal lobe before extending to the temporal neocortex and then other neocortical areas; on the other hand, metabolic alterations are first located within the posterior cingulate cortex and then reach the temporo-parietal area as well as the prefrontal cortex, especially in its medial part.

Appendix D Localized Context Vector

The localized context vector is computed as follows:

  • Extract the attention scores of the two entities in the last encoding layer of the language model.

  • Calculate the Hadamard product of the attention vectors.

  • Calculate the average of the Hadamard product over the attention heads.

  • Normalize to extract the distribution over the sequence.

  • Extract the localized context vector by multiplying the token representations of the last encoding layer with the distribution vector.

Appendix E Baseline Performance: Lower Bound

To establish a baseline performance (lower bound) for comparison, contrasting with the human evaluation that serves as an upper bound, we randomly assign class labels to each instance of the test set based on the prior class distribution in the training set (Tab. 1). This simulates a classifier with no ability to learn relations between entities. We repeat this experiment 1 million times for robustness and report the average F1-score.

In the binary setup, the baseline achieves average F1-scores of 54% (ReDReS) and 53.16% (ReDAD). For the multi-class setup, the average macro F1-scores range from 32.05% (micro: 32.21%) to 32.43% (micro: 32.33%) for ReDReS and ReDAD, respectively. Despite the simplicity of the baseline, the low performance highlights the challenge of the task, especially in the multi-class scenario where the model needs to distinguish between nuanced semantic relations.

Appendix F Distantly Supervised Datasets

ReDReS and ReDAD include gold annotations for a small fraction of the extracted sentences. The pre-processed text consists of 28,622 and 1,301,429 additional sentences related to RS and AD respectively, without annotations about the semantic relation between the detected entities. Observing that the supervised models achieve performance levels comparable to human experts, we leverage the best-performing models to generate silver labels for the unannotated instances. For the binary setup, we employ:

  • LaMReDA (PubMedBERT large) with the relation representation RAsubscript𝑅𝐴R_{A}italic_R start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT (Eq. 9) for the RS corpus.

  • LaMReDA (PubMedBERT large) with the relation representation RJsubscript𝑅𝐽R_{J}italic_R start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT (Eq. 18) for the AD corpus.

For the multi-class setup, we use:

  • LaMReDA (PubMedBERT large) with the relation representation RFsubscript𝑅𝐹R_{F}italic_R start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT (Eq. 14) for the RS corpus.

  • LaMReDA (PubMedBERT large) with the relation representation RFsubscript𝑅𝐹R_{F}italic_R start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT (Eq. 14) for the AD corpus.

We stress that the selection of the models relies on the performance in the 5-fold cross-validation setup to avoid choosing based on the model performance on the original test set. Each model is trained 10 times with different seeds using the original splits (Tab. 1) of ReDReS and ReDAD correspondingly. The best model weights are saved based on the performance on the development set. Every trained model provides the predictions for the unannotated instances and the final silver labels are extracted through majority voting. The Distantly Supervised Relation Detection dataset for Rett Syndrome (DiSReDReS) contains 304,008 instances with 8,611 unique CUIs and 80 semantic types (Tab. 5). The Distantly Supervised Relation Detection dataset for Alzheimer’s Disease (DiSReDAD) comprises 13,608,175 instances with 53,750 unique CUIs and 82 semantic types (Tab. 5). As noisy labeling is inevitable in distantly supervised data and imposes challenges for knowledge extraction scenarios, the two extensive datasets can promote weakly supervised learning.

Data Sentences Instances CUIs1 S.T.1 Benchmark3
Binary Multi-Class
Micro Macro
DiSReDReS 28,622 304,008 8,611 80 91.53 75.1 75.19
DiSReDAD 1,301,429 13,608,175 53,750 82 88.99 80.56 80.69
  • 1

    The total number of unique CUIs.

  • 2

    Semantic Types.

  • 3

    The benchmark performance (F1-score %) in the weakly supervised setup.

Table 5: DiSReDReS & DiSReDAD: Statistics and performance (%)
Dataset Labels - Type of relation
Positive Complex Negative No Relation
DiSReDReS 97,099 (31.9%) 105,861 (34.8%) 3,242 (1.1%) 97,806 (32.2%)
DiSReDAD 4,468,110 (32.8%) 5,755,884 (42.3%) 120,267 (0.9%) 3,263,914 (24%)
Table 6: DiSReDReS & DiSReDAD: Label Distribution

Weakly Supervised Setup. The task formulation remains the same as described in section 3 of the paper. The train sets of ReDReS and ReDAD are replaced by DiSReDReS and DiSReDAD, respectively. The development and test sets remain the same (Tab. 1). To provide a benchmark, we train the LaMReDA (PubMedBERT base) with the relation representation RAsubscript𝑅𝐴R_{A}italic_R start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT (Eq. 9) for 10 epochs utilizing the ADAM optimizer with learning rate 10-5. The batch is set to 32. The experiments are repeated 10 times with different seeds and the best scores are retained based on the performance on the development set.

In the supervised setup (Tab. 1), LaMReDA (PubMedBERT base) with the RAsubscript𝑅𝐴R_{A}italic_R start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT representation achieves 90.72% and 88.31% F1-score in the binary setup on ReDReS and ReDAD, respectively (Tab. 3). Multi-class macro F1-scores range from 74.52% (micro: 74.49%) to 77.34% (micro: 77.64%) for ReDReS and ReDAD, accordingly (Tab. 3). Table 5 presents the benchmark performance in the binary and multi-class setup for both datasets. Notably, the performance is improved in the weakly supervised setup, indicating the robustness of LaMReDA, when trained with noisy data, and highlighting the quality of the silver labels of DiSReDReS and DiSReDAD.

Appendix G Probing: Additional Experiments

We use the same experimental setup as described in subsection 3.3 and the experiments are conducted in the 5-fold cross-validation setting. To provide an inclusive probing analysis on ReDReS, we incorporate additional probing results in this section. Figures 8 and 9 present the experiments in the multi-class setup using PubMedBERT base. Additionally, aiming to explore the probing capabilities of PubMedBERT large, we include the results of further experiments in Figures 10, 11, and 12. These experiments investigate the model’s performance in detecting semantic relations, comparing the representations and attention mechanisms at different layers and heads to understand how well the larger LM can discern complex relationships in the biomedical text.

Refer to caption
Figure 8: ReDReS Probing (Multi-class setup, Macro evaluation) (PubMedBERT base): Examines LaMReDA/LaMReDM relation representations (RDsubscript𝑅𝐷R_{D}italic_R start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, ROsubscript𝑅𝑂R_{O}italic_R start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT, RPsubscript𝑅𝑃R_{P}italic_R start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT) and attention scores from each layer and explores average attention scores of tokens corresponding to each entity towards the other entity across attention heads. Top boundary: best LaMReDA and LaMReDM performance (Tab. 3). Second boundary: classifier with average attention scores across all layers as input.
Refer to caption
Figure 9: ReDReS Probing (Multi-class setup, Micro evaluation) (PubMedBERT base): Examines LaMReDA/LaMReDM relation representations (RDsubscript𝑅𝐷R_{D}italic_R start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, ROsubscript𝑅𝑂R_{O}italic_R start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT, RPsubscript𝑅𝑃R_{P}italic_R start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT) and attention scores from each layer and explores average attention scores of tokens corresponding to each entity towards the other entity across attention heads. Top boundary: best LaMReDA and LaMReDM performance (Tab. 3). Second boundary: classifier with average attention scores across all layers as input.
Refer to caption
Figure 10: ReDReS Probing (Binary setup) (PubMedBERT large): Examines LaMReDA/LaMReDM relation representations (RDsubscript𝑅𝐷R_{D}italic_R start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, ROsubscript𝑅𝑂R_{O}italic_R start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT, RPsubscript𝑅𝑃R_{P}italic_R start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT) and attention scores from each layer and explores average attention scores of tokens corresponding to each entity towards the other entity across attention heads. Top boundary: best LaMReDA and LaMReDM performance (Tab. 3). Second boundary: classifier with average attention scores across all layers as input.
Refer to caption
Figure 11: ReDReS Probing (Multi-class setup, Macro evaluation) (PubMedBERT large): Examines LaMReDA/LaMReDM relation representations (RDsubscript𝑅𝐷R_{D}italic_R start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, ROsubscript𝑅𝑂R_{O}italic_R start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT, RPsubscript𝑅𝑃R_{P}italic_R start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT) and attention scores from each layer and explores average attention scores of tokens corresponding to each entity towards the other entity across attention heads. Top boundary: best LaMReDA and LaMReDM performance (Tab. 3). Second boundary: classifier with average attention scores across all layers as input.
Refer to caption
Figure 12: ReDReS Probing (Multi-class setup, Micro evaluation) (PubMedBERT large): Examines LaMReDA/LaMReDM relation representations (RDsubscript𝑅𝐷R_{D}italic_R start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, ROsubscript𝑅𝑂R_{O}italic_R start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT, RPsubscript𝑅𝑃R_{P}italic_R start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT) and attention scores from each layer and explores average attention scores of tokens corresponding to each entity towards the other entity across attention heads. Top boundary: best LaMReDA and LaMReDM performance (Tab. 3). Second boundary: classifier with average attention scores across all layers as input.

Appendix H Detailed Annotation Guidelines

Task. Determine if there is any semantic relation between the two colored entities in the sentence.

General Instructions:

  • Use Sentence Information Only: Base your annotation solely on the information provided within the sentence. Do not use external knowledge or prior information.

  • Entity Check: Examine the entities and their types. If an entity is incorrect, if the entity span is inaccurate (includes irrelevant words), or if the entity type is incorrect (e.g., "Rett syndrome" categorized as part of the human body), click the "Remove First Entity" or "Remove Second Entity" button, corresponding to the error.

  • Removing a Sentence: If a sentence lacks informative content, you have the option to remove it. Use this option if you are confident the sentence is uninformative.

Relation Categories:

  • No Relation: Use this label if there’s no semantic relation between the entities in the sentence.

  • Positive Relation: The two entities are directly, semantically connected.

  • Negative Relation: The two entities are negatively correlated. This is a rare case, and negative words or phrases (e.g., "no," "absence") often indicate this.

  • Complex Relation: Entities are related but not straightforwardly positive or negative. Complex reasoning might be needed to determine the semantic relation.

Annotation Process:

  1. 1.

    When presented with a pair, choose the relevant relation category label.

  2. 2.

    If you change your choice, you can adjust it by clicking a new button corresponding to the revised label.

  3. 3.

    Important: Once you press "Done", the instance can’t be retrieved, so ensure your decision is accurate.

  4. 4.

    Provide Relation Context: First, you need to finalize your choice for the relation labeling and then provide (if any) the related piece of text. If classifying a pair as related, specify the word or phrase in the sentence that influenced your decision. Use the text box provided and preferably copy-paste to avoid spelling errors. Press "Enter" after inputting the text to store it.