Enhancing Biomedical Knowledge Discovery for Diseases: An End-To-End Open-Source Framework

Christos Theodoropoulos
KU Leuven
[email protected]
&Andrei Catalin Coman
EPFL, Idiap Research Institute
[email protected]
\ANDJames Henderson
Idiap Research Institute
[email protected]
&Marie-Francine Moens
KU Leuven
[email protected]

Abstract

The ever-growing volume of biomedical publications creates a critical need for efficient knowledge discovery. In this context, we introduce an open-source end-to-end framework designed to construct knowledge around specific diseases directly from raw text. To facilitate research in disease-related knowledge discovery, we create two annotated datasets focused on Rett syndrome and Alzheimer’s disease, enabling the identification of semantic relations between biomedical entities. Extensive benchmarking explores various ways to represent relations and entity representations, offering insights into optimal modeling strategies for semantic relation detection and highlighting language models’ competence in knowledge discovery. We also conduct probing experiments using different layer representations and attention scores to explore transformers’ ability to capture semantic relations.¹¹1Data and code: publicly available upon acceptance.

1 Introduction

Knowledge discovery (Wang et al., 2023; Shu and Ye, 2023) is a pivotal research domain due to the surge in publications, which makes keeping up with new findings challenging, necessitating automated knowledge extraction and processing. Of particular concern is the biomedical literature, where updates occur with ever-accelerating frequency (Fig. 1). Despite advances in healthcare, many diseases, such as Alzheimer’s disease (AD) (Trejo-Lopez et al., 2023; Scheltens et al., 2021) and multiple sclerosis (McGinley et al., 2021; Attfield et al., 2022), lack effective cures. Additionally, over 1,200 rare disorders have limited or no cures according to the National Organization for Rare Disorders.²²2https://rarediseases.org/rare-diseases/ Discovering new scientific insights from research papers can expedite disease understanding and accelerate cure development.

Refer to caption — Figure 1: Publication Trends: RS and AD

This paper presents an end-to-end framework for detecting medical entities in unstructured text and annotating semantic relations, enabling automated knowledge discovery for diseases. We employ a multi-stage methodology for data acquisition, annotation, and model evaluation. The process starts with gathering relevant PubMed abstracts from PubMed to form the corpus. Entities are identified and extracted, followed by the co-occurrence graph generation that models the intra-sentence co-occurrence of the entities across the corpus. Leveraging the processed text and co-occurrence graph, an algorithm samples sentences to create gold-standard datasets. Medical experts label the semantic relations between entities within these sentences via an annotation portal. The framework’s versatility allows application across various diseases and enables expansion to encompass knowledge about symptoms, genes, and more. This study focuses on two diseases of particular research interest: Rett syndrome (RS) (Petriti et al., 2023) and AD. These diseases are selected due to their significant impact and the absence of a cure, highlighting the urgency for advancements in understanding and treatment. We introduce two curated datasets tailored for detecting semantic relations between entities in biomedical text related to RS and AD. The datasets are used for benchmarking, testing techniques for representing relations and entities and assessing language models’ capabilities in knowledge discovery. This work probes the layer outputs of transformer models (Vaswani et al., 2017) and their attention patterns to reveal their ability to implicitly capture semantic relations in biomedical text.

RS (Sandweiss et al., 2020) poses challenges due to its sporadic nature and rare expression across diverse racial groups. The disorder’s elusive nature undermines its comprehension and stresses the pressing need for a cure. Rare diseases collectively affect a substantial portion of the population, with over 30 million affected people in Europe alone (Pakter, 2024). AD is characterized by its prevalence among older populations, with millions of patients worldwide as it is the most common type of dementia (60-70% cases) (Alzheimer’s-Association, 2024). With life expectancy on the rise, the projected increase in Alzheimer’s cases accentuates the urgency of finding a cure.

In summary, the key paper’s contributions are:

•

Development of an open-source end-to-end framework to build disease knowledge directly from raw text.
•

Two annotated datasets for RS and AD provide gold labels for semantic relations, aiding disease knowledge discovery research.³³3The description of distantly supervised datasets for weakly supervised scenarios is included in Appendix F.
•

Benchmarking on the datasets examines methods for relation and entity representation, offering insights into optimal approaches for semantic relation detection and emphasizing language models’ knowledge discovery capabilities.
•

Probing experiments with different layer representations and attention scores assess transformers’ inherent ability to capture semantic relations.

2 Data Pipeline

We focus on developing a robust data pipeline (Fig. 2) to annotate sentences with entities associated with the Unified Medical Language System (UMLS) (Bodenreider, 2004; Elkin and Brown, 2023). The first step involves the retrieval of the textual abstracts, followed by the mention extraction that includes entity detection and linking to UMLS. We construct a co-occurrence graph to highlight interconnections between entities in the text. The processed text and co-occurrence graph are then used to develop two curated datasets with precise entity annotations and semantic relations between detected entity pairs.⁴⁴4Additional information regarding the data pipeline is incorporated in Appendix A.

Abstract retrieval. We retrieve PubMed⁵⁵5https://pubmed.ncbi.nlm.nih.gov/ articles ids based on a query (e.g., Rett syndrome) and extract their open-access abstracts. To accomplish this, we leverage the official Entrez Programming Utilities (Kans, 2024) and the Biopython API (Cock et al., 2009) (BSD 3-Clause License), ensuring access to the vast repository of biomedical literature. After obtaining the PubMed IDs (PMIDs), we retrieve the abstracts from the specified articles and tokenize the text into sentences using NLTK (Bird et al., 2009) (Apache License 2.0).

Mention extraction. MetaMapLite (Aronson, 2001) (open-source BSD License) is provided by the National Library of Medicine (NLM) for extracting biomedical entities and mapping them to Concept Unique Identifiers (CUIs) within UMLS. The tool is updated every two years to incorporate the latest medical terminology and to ensure its accuracy in extraction and mapping. MetaMapLite simultaneously extracts mentions and links them to UMLS in one step, efficiently associating mentions with their corresponding CUIs. We detect a diverse range of entities, spanning 82 unique semantic types and covering a broad spectrum of biomedical concepts, including diseases, biologically active substances, anatomical structures, genes, and more. Detailed entity detection often leads to overlapping or successive entities in the text. To address this, our pipeline incorporates a merging strategy that consolidates overlapping or subsequent entities into cohesive units. For example, in the sentence: "To test norepinephrine augmentation as a potential disease-modifying therapy, we performed a biomarker-driven phase II trial of atomoxetine, a clinically-approved norepinephrine transporter inhibitor, in subjects with mild cognitive impairment due to AD.", the subsequent relevant mentions norepinephrine transporter and inhibitor are merged to one entity.

Co-occurrence graph generation. We model the intra-sentence co-occurrence between the entities. Each node in the graph corresponds to a unique CUI and contains metadata including the semantic type and the list of sentence IDs where the corresponding entity is detected. An edge between two nodes signifies that the corresponding entities co-occur within the same sentence. The edge weight represents the number of times two entities co-occur in a sentence throughout the text corpus.

2.1 Dataset Creation

Leveraging the extracted co-occurrence graph, we define two distinct probability distributions to select sentences for manual annotation. The first distribution $\mathcal{P}$ focuses on common pairs of co-occurred entities, with higher frequency in the co-occurrence graph resulting in a higher likelihood of sampling. The second distribution $\mathcal{IP}$ prioritizes novel/rare pairs of co-occurred entities, selecting sentences where the entities have a lower frequency in the co-occurrence graph. We sample 50% of sentences using $\mathcal{P}$ and 50% using $\mathcal{IP}$ to ensure a balance of common and potentially novel pairs of co-occurring entities in the datasets.⁶⁶6Sentence sampling algorithm details in Appendix A.

Then, we develop an annotation portal using the streamlit⁷⁷7https://streamlit.io/ library, providing a user-friendly interface for annotators. Annotators are presented with a sentence containing two highlighted entities and are prompted to categorize the semantic relation between them. Options include positive (direct semantic connection), negative (negative semantic connection where negative words like "no" and "absence" are present), complex (semantic connection with complex reasoning), and no relation. The annotation portal offers additional functions such as sentence removal (for non-informative sentences), entity removal (for incorrect entity types or spans), and context addition (for providing additional text to aid in relation type determination). We enlist the expertise of three medical experts to ensure the accuracy and reliability of the annotation process.

Dataset	Sentences	Instances	Unique CUIs	Semantic Types
ReDReS	601	5,259	1,148	73
Train set	409	3,573	887	73
Dev. set	72	749	249	56
Test set	120	937	349	57
ReDAD	641	8,565	1,480	82
Train set	437	5,502	1,114	78
Dev. set	76	1,188	321	60
Test set	128	1,875	452	58
Dataset	Labels - Type of Relation
Dataset	Positive	Complex	Negative	No Relation
ReDReS	1,732 (32.9%)	1,491 (28.4%)	97 (1.8%)	1,945 (36.9%)
Train set	1,176 (32.9%)	996 (27.9%)	69 (1.9%)	1,332 (37.3%)
Dev. set	241 (32.2%)	213 (28.4%)	7 (0.9%)	288 (38.5%)
Test set	313 (33.3%)	282 (30.1%)	21 (2.2%)	321 (34.4%)
ReDAD	2,496 (29.1%)	2,874 (33.6%)	125 (1.5%)	3,070 (35.8%)
Train set	1,718 (31.2%)	1,923 (34.9%)	68 (1.2%)	1,793 (32.7%)
Dev. set	286 (24.1%)	373 (32.4%)	18 (1.5%)	511 (42%)
Test set	492 (26.2%)	578 (30.8%)	39 (2.1%)	766 (40.9%)

Table 1: Datasets: Statistics of the RS and AD datasets and their label distribution.

The result of the expert annotation yields two curated datasets. The Relation Detection dataset for Rett Syndrome (ReDReS) contains 601 sentences with 5,259 instances and 1,148 unique CUIs (Tab. 1). The inter-annotator agreement is measured using the Fleiss kappa score (McHugh, 2012), resulting in 0.6143 in the multi-class setup (4 classes) and indicating substantial agreement among annotators (Landis and Koch, 1977). In the binary setup (relation or no relation), the Fleiss kappa score is 0.7139. The Relation Detection dataset for Alzheimer’s Disease (ReDAD) comprises 641 sentences with 8,565 instances and 1,480 unique CUIs (Tab. 1). The Fleiss kappa score is 0.6403 in the multi-class setup and 0.7064 in the binary setup, showing substantial consensus among annotators. The final labels are determined through majority voting, leveraging the labels provided by each expert. While the label distribution across classes is relatively balanced, the negative class is under-represented with 97 and 125 instances in ReDReS and ReDAD respectively (Tab. 1). Each dataset is randomly split into train, development, and test sets.

3 Models

In this section, we introduce two main models, the Language-Model Embedding Learning (LaMEL) model and the Language-Model Relation Detection (LaMReD) model (Fig. 3), to benchmark datasets and establish robust baselines.

Task formulation. Given a sentence containing two identified entities $e1$ and $e2$ , we predict the semantic relation $sem_{r}$ between them. In the multi-class setup, the labels are: positive, negative, complex, and no relation. In the binary setup, the goal is to determine if any relation exists. Special tokens [ent] and [/ent] mark the start and end of each entity within the sentence, ensuring consistent identification and processing of entity boundaries.

3.1 LaMEL model

LaMEL learns an embedding space optimized for relation detection (Fig. 3). As the backbone language model (LM), we opt for PubMedBERT (Gu et al., 2021; Tinn et al., 2023) (MIT License), available in both uncased base and uncased large versions.⁸⁸8HuggingFace’s Transformers library (Wolf et al., 2019) PubMedBERT is pretrained on the PubMed corpus, making it well-suited for our task as the curated datasets consist of sentences of abstracts from PubMed papers. Leveraging PubMedBERT ensures that the model can capture the language patterns prevalent in biomedical text. Following the LM encoding, we construct the representation of each entity by extracting its contextualized embedding $E_{i}$ corresponding to each entity $e_{i}$ from the encoded sequence. Subsequently, the entity representations are projected to the embedding space using a linear layer without changing the embedding dimension. The final prediction is based on cosine similarity between the two projected entity representations. If the cosine similarity exceeds a predefined threshold, the model predicts that there is a semantic relation between the two entities. We experiment with diverse strategies for learning entity representations (Fig. 3), aiming to optimize the effectiveness of the embedding space for the relation detection task. The explored types of entity representation $E$ are:

•

A, B, C - Special Tokens:

E_{A}=t_{[ent]},

(1)

E_{B}=t_{[/ent]},

(2)

E_{C}=t_{[ent]};t_{[/ent]},

(3)

•

D - Entity Pool:

$E_{D}=[t_{E}],$ (4)

•

E - Entity & Middle Pool:

E_{E}=[t_{E}]*[t_{Inter}],

(5)

•

F, G, H - Special Tokens & Middle Pool:

E_{F}=t_{[ent]}*[t_{Inter}],

(6)

E_{G}=t_{[/ent]}*[t_{Inter}],

(7)

E_{H}=t_{[ent]}*t_{[/ent]}*[t_{Inter}],

(8)

where $\{E_{A},E_{B},E_{D},E_{E},E_{F},E_{G},E_{H}\}\in\mathbb{R}^{d}$ and $E_{C}\in\mathbb{R}^{2d}$ , $d$ is the embedding size of PubMedBERT base (768) and PubMedBERT large (1024), ; defines the concatenation, $*$ holds for the element-wise multiplication, $t_{[ent]}$ , $t_{[/ent]}$ are the embeddings of the start and end special tokens of the entities, $[t_{E}]$ and $[t_{Inter}]$ are the averaged pooled representation of the entities and the intermediate tokens between the entities respectively.

3.2 LaMReD model

LaMReD provides two variations that differ in information synthesis (Fig. 3), aiming to explore the potential effect of different aggregations (Theodoropoulos and Moens, 2023). LaMReDA utilizes element-wise addition to aggregate the entities’ representations, while LaMReDM employs element-wise multiplication. The input text is encoded using PubMedBERT (base or large). Following LM encoding, we construct the relation representation by sampling and aggregating tokens from the input sequence. This step enables the model to capture essential features and contextual information relevant to semantic relation classification. To mitigate the risk of overfitting and enhance model generalization, we incorporate a dropout layer (Srivastava et al., 2014) with a probability of 0.3. The linear classification layer takes the aggregated representation and outputs the predicted label.

Following the paradigm proposed by Baldini Soares et al. (2019) and Hogan et al. (2021), we experiment with various approaches for learning relation representations tailored to the relation detection task to empirically ascertain the effectiveness of each strategy (Fig. 3). The explored types of relation representation R are the following:

•

A, B, C - Special Tokens:

R_{A}=f(l(t_{[ent]_{1}}),l(t_{[ent]_{2}})),

(9)

R_{B}=f(l(t_{[/ent]_{1}}),l(t_{[/ent]_{2}})),

(10)

	$\displaystyle R_{C}=f(l(t_{[ent]_{1}}),l(t_{[/ent]_{1}}),$		(11)
	$\displaystyle l(t_{[ent]_{2}}),l(t_{[/ent]_{2}})),$		(11)

•

D - Entity Pool:

R_{D}=f(l([t_{E1}]),l([t_{E2}])),

(12)

•

E - Middle Pool:

R_{E}=l([t_{Inter}]),

(13)

•

F - [CLS] token & Entity Pool:

R_{F}=f(l(t_{[CLS]}),l([t_{E1}]),l([t_{E2}])),

(14)

•

G, H, I - [CLS] token & Special Tokens:

R_{G}=f(l(t_{[CLS]}),l(t_{[ent]_{1}}),l(t_{[ent]_{2}})),

(15)

R_{H}=f(l(t_{[CLS]}),l(t_{[/ent]_{1}}),l(t_{[/ent]_{2}})),

(16)

	$\displaystyle R_{I}=f(l(t_{[CLS]}),l(t_{[ent]_{1}}),l(t_{[/ent]_{1}}),$		(17)
	$\displaystyle\vspace{-3mm}l(t_{[ent]_{2}}),l(t_{[/ent]_{2}})),$		(17)

•

J - [CLS] token & Middle Pool:

R_{J}=f(l(t_{[CLS]}),l([t_{Inter}])),

(18)

•

K, L, M - Special tokens & Middle Pool:

R_{K}=f(l(t_{[ent]_{1}}),l([t_{Inter}]),l(t_{[ent]_{2}})),

(19)

R_{L}=f(l(t_{[/ent]_{1}}),l(t_{Inter}]),l(t_{[/ent]_{2}})),

(20)

	$\displaystyle R_{M}=f(l(t_{[ent]_{1}}),l(t_{[/ent]_{1}}),l([t_{Inter}]),$		(21)
	$\displaystyle\vspace{-3mm}l(t_{[ent]_{2}}),l(t_{[/ent]_{2}})),$		(21)

•

N - Entity & Middle Pool:

R_{N}=f(l([t_{E1}]),l([t_{Inter}]),l([t_{E2}])),

(22)

•

O, P - Context Vector & Entity Pool:

R_{O}=l(cv),

(23)

R_{P}=f(l([t_{E1}]),l([t_{E2}]),l(cv)),

(24)

where $\{R_{A},R_{B},R_{C},R_{D},R_{E},R_{F},R_{G},R_{H},R_{I},\\ R_{J},R_{K},R_{L},R_{M},R_{N},R_{O},R_{P}\}\in\mathbb{R}^{d}$ , d is the embedding size of PubMedBERT base (768) and PubMedBERT large (1024), $f()$ is the aggregation function, element-wise addition for LaMReDA and element-wise multiplication for LaMReDM, $l()$ is a linear projection layer with dimension equal to the embedding size, $t_{[ent]_{1}}$ , $t_{[/ent]_{1}}$ , $t_{[ent]_{2}}$ , and $t_{[/ent]_{2}}$ are the embeddings of the start and end special tokens of the first and second entity and $t_{[CLS]}$ is the representation of the special token [CLS]. We define the averaged pooled representation of the entities and the intermediate tokens between the entities as $[t_{E1}]$ , $[t_{E2}]$ , and $[t_{Inter}]$ correspondingly. In equations 23 and 24, we utilize the localized context vector $cv$ ⁹⁹9Additional information is provided in Appendix D. that utilizes the attention heads to locate relevant context for the entity pair and was introduced in ATLOP (Zhou et al., 2021), a state-of-the-art model in document-level relation extraction.

3.3 Experimental setup

The models are trained for 50 epochs and the best checkpoints are retained based on the performance on the development set, measured using the F1-score. We utilize the Adam (Kingma and Ba, 2014) optimizer with a learning rate of 10^-5. The batch size is set to 16. We conduct experiments in two distinct setups. In the multi-class setup, we evaluate performance using micro and macro F1-score, considering four relation types: positive, negative, complex, and no relation. In the binary setup, the objective is the prediction of the presence of relation. LaMEL is specifically designed for the binary setup. We utilize the official splits of ReDReS and ReDAD (Tab. 1) and repeat the experiments 10 times with different seeds. To ensure robustness of results, we also employ a 5-fold cross-validation approach. To explore the cross-disease capabilities of our approach, we train the models using one dataset (e.g., ReDReS) and evaluate on the other (e.g., ReDAD), and vice versa. We utilize the relation representation $R_{A}$ (Eq. 9) for LaMReDA and LaMReDM and the entity representation $E_{A}$ (Eq. 1) for LaMEL. These experiments are repeated 10 times with different seeds, and 15% of the training data is excluded to define the development set¹⁰¹⁰10Hardware: single NVIDIA RTX 3090 GPU 24GB..

The cross-entropy loss function is used to train LaMReDA and LaMReDM. For LaMEL, the following cosine embedding loss function is used:

\leavevmode\resizebox{361.71335pt}{}{$l(x_{1},x_{2},y)=\begin{cases}1-cos(x_{1% },x_{2}),&\text{if $y=1$}\\ max(0,cos(x_{1},x_{2})-m),&\text{if $y=-1$}\end{cases}$},\vspace{-2mm}

(25)

where $x_{1}$ and $x_{2}$ are the projected representations of the two entities, $y$ is the gold-truth label (1 if the entities are correlated, -1 if they are not), $cos()$ is the cosine similarity in the embedding space, and $m$ is the margin parameter that is set to 0. In the inference step, the threshold to predict the presence of relation based on the cosine similarity of the two entity representations is set to 0.5.

4 Results

Tables 2 and 3 report the F1-scores for LaMEL LaMReDA, and LaMReDM, models on the ReDReS and ReDAD datasets. Each cell (except for cross-disease experiments) displays two values: the average F1-score from 10 runs on the original test set (Tab. 1) and the average F1-score from a 5-fold cross-validation. The models perform well across all relation (A-P) and entity (A-H) representations, showing their ability to learn meaningful representations for the semantic relation task regardless of initial token selection. However, we observe patterns regarding the relation representations. In the binary setup, relation representation $R_{G}$ (Eq. 15) yields strong results for both datasets, suggesting that including the [CLS] token representation might be beneficial. In the multi-class setup, relation representations $R_{L}$ (Eq. 20), $R_{J}$ (Eq. 18), and $R_{O}$ (Eq. 23) are effective for both datasets, indicating that the surrounding context is crucial for the more complex task, as $R_{L}$ and $R_{J}$ include the averaged pooled representation of intermediate tokens between entities, and $R_{O}$ leverages the context vector (Zhou et al., 2021). The intra-model comparison reveals that over-parameterization tends to be useful. Using PubMedBERT large generally results in better performance than the base alternative. The PubMedBERT base shows superior performance mainly only in experiments using the original splits of ReDReS (Tab. 1). LaMEL is highly competitive with LaMReDA and LaMReDM, indicating that learning entity embedding spaces optimized for relation detection is promising. LaMEL achieves the highest performance in the 5-fold setup of ReDAD and the original setup of ReDReS, with F1-scores of 91.03% and 91.25% respectively.

Type¹	ReDReS		ReDAD
Type¹	F₁^$\square$	F₁^{$\blacksquare$}	F₁^$\square$	F₁^{$\blacksquare$}
A	90.25/89.43	90.88/90.01	86.73/88.75	88.9/90.17
B	90.29/89.01	90.73/89.41	86.89/89.29	88.22/89.15
C	90.51/89.44	90.71/89.67	87.49/90.02	88.57/90.65
D	90.47/88.9	91.03/90.07	86.29/88.88	88.22/90.64
E	90.61/89.1	90.54/89.55	86.03/88.96	88.74/90.35
F	90.48/89.37	90.88/90.29	87.27/89.18	89.44/90.57
G	90.32/89.71	90.35/89.43	86.97/89.46	89.12/91.25
H	89.68/89.24	90.13/89.29	87.29/89.91	88.77/90.67
CD²	86.2	89.14	88.92	88.56

1

Type of Relation Representation.
2

Cross-disease experiments utilizing the entity representation $E_{A}$ : Training on ReDReS, evaluation on ReDAD, and vice versa.

Table 2: LaMEL Results (%) in binary setup (PubMedBERT

\square

: base,

\blacksquare

:large): Each cell (unless cross-disease experiments) shows the average F1-score from 10 runs (original test set) and from 5-fold cross-validation setup.

The inter-model comparison across the same relation representations indicates that the aggregation function does not significantly impact relation detection tasks. Neither LaMReDA (element-wise addition) nor LaMReDM (element-wise multiplication) show a clear advantage over the other. This suggests that the transformer layers of PubMedBERT and the projection layer $l()$ preceding the aggregation are effectively trained in both models to encode the essential information for relation detection, regardless of the aggregation function used. The cross-disease experiments underscore the robustness of the models in both binary and multi-class setups. This robustness supports transfer learning (Zhuang et al., 2020) in semantic relation detection, extending to other diseases, highlighting the potential for broader applications and research endeavors in knowledge discovery.

Data

Type¹

Binary setup

Multi-class setup

Micro Evaluation

Macro Evaluation

LaMReDA

LaMReDM

LaMReDA

LaMReDM

LaMReDA

LaMReDM

F₁^$\square$

F₁^{$\blacksquare$}

F₁^$\square$

F₁^{$\blacksquare$}

F₁^$\square$

F₁^{$\blacksquare$}

F₁^$\square$

F₁^{$\blacksquare$}

F₁^$\square$

F₁^{$\blacksquare$}

F₁^$\square$

F₁^{$\blacksquare$}

ReDReS

90.72/89.95

90.74/90.57

90.42/89.15

90.71/89.53

74.49/73.91

73.96/75.01

74.36/73.31

74.35/74.91

74.52/74.5

73.66/74.48

74.3/73.06

72.81/75.07

90.4/88.79

90.28/89.54

90.47/89.33

90.06/89.74

74.27/74.14

73.72/74.79

74.26/74.45

73.57/75.34

74.32/74.15

73.65/75.19

74.38/74.31

73.11/75.74

90.85/89.69

90.75/89.75

90.51/88.84

89.14/89.16

74.93/72.98

73.54/74.59

74.31/72.71

73.69/73.56

74.96/73.75

73.44/74.74

74.1/72.83

73.49/73.88

90.55/89.29

90.93/89.25

90.61/89.47

90.53/88.96

73.61/73.85

73.5/74.36

73.02/74.96

73.5/75.77

73.71/74.54

73.7/74.62

73.24/75.12

73.9/76.24

89.57/89.39

89.43/88.89

89.57/89.39

89.43/88.89

73.73/75.67

73.68/74.68

73.73/75.67

73.21/74.68

73.95/74.9

74.01/75.1

73.95/74.9

74.01/75.1

90.48/89.09

90.62/89.56

90.41/89.19

90.43/89.94

72.86/74.18

73.82/76.55

73.51/74.07

73.33/75.26

72.62/72.82

73.94/76.66

73.32/74.08

74.5/75.84

90.78/89.32

90.76/90.26

90.91/89.82

89.47/89.6

74.33/73.35

73.63/73.59

74.05/75.05

73.22/74.34

74.57/73.78

73.31/74.02

74.21/75.13

73.87/74.8

90.91/88.88

90.45/88.98

90.29/88.93

89.99/89.14

74.43/73.68

73.62/74.56

73.59/74.06

73.36/74.71

74.48/73.96

73.65/74.62

73.9/74.12

73.42/75.04

90.86/89.07

90.47/89.38

90.62/89.35

89.55/89.18

74.75/73.19

73.3/74.78

74.29/74.14

74/74.28

74.8/73.48

73.26/74.9

73.88/74.6

73.91/74.16

89.43/89.23

89.65/89.3

89.53/88.99

89.89/89.43

73.75/76.05

74.28/74.55

73.47/75.04

74.43/74.95

74.05/75.09

75.06/74.97

73.52/75.91

74.7/75.53

90.1/89.63

90.05/89.26

89.7/89.18

89.64/89.54

74.43/74.4

74.07/75.22

74.47/75.89

74.38/74.97

74.44/74.8

74.23/75.42

74.3/75.81

74.02/74.67

89.6/89.86

89.85/89.95

89.82/88.61

90.33/88.93

73.35/74.27

74.32/75.42

73.68/76.5

73.9/76.15

73.16/74.55

74.04/75.52

73.66/75.08

73.52/76.02

90.81/90.27

90.07/89.85

90.01/89.3

89.75/89.73

74.21/74.59

74.29/74.85

73.96/74.94

74.32/74.87

74.03/74.04

74.36/74.68

73.72/75.1

73.95/75.59

90.73/89.37

90.6/89.71

90.72/88.77

90.63/89.49

74.55/73.49

73.83/73.47

73.86/74.81

73.38/74.58

74.66/74.81

73.97/74.94

74.13/74.82

73.53/74.76

90.9/89.94

90.5/89.35

90.9/89.94

90.5/89.35

73.99/74.51

73.77/73.79

73.99/74.51

73.77/73.79

73.83/74.71

73.62/74.27

73.83/74.71

73.62/74.27

89.72/89.8

90.31/90.08

89.13/89.19

90.3/89.92

73.37/75.03

73.87/75.02

73.57/75.24

74.41/75.44

73.48/74.79

74.66/75.18

73.54/74.84

73.52/75.79

CD²

87.42

88.93

87.76

88.1

73.09

75.04

74.15

75.35

73.64

74.94

74.38

75.44

ReDAD

88.31/90.15

89.55/91.07

87.98/90.37

89.14/89.92

77.64/77.07

79.47/78.07

78.34/76.14

80.21/78.17

77.34/77.21

79.26/77.83

78.44/76.4

80.13/78.39

87.82/90.57

89.11/90.52

87.66/88.83

88.64/87.11

77.74/77.56

78.65/78.24

78.61/76.43

78.91/78.26

77.13/77.74

78.72/78.31

77.76/76.55

78.98/77.56

88.3/89.64

89.21/87.01

88.11/88.79

89.17/89.87

77.14/76.7

79.67/77.89

78.19/76.41

79.32/77.59

77.08/77.05

79.35/77.92

77.82/76.62

79.26/77.78

87.33/88.99

89.82/89.25

88.18/89.61

88.8/90.05

78.28/76.73

79.54/75.64

76.81/76.58

78.68/78.37

78.26/76.8

78.47/76.12

76.67/76.84

78.73/78.87

88.03/88.63

89.37/90.91

88.03/88.63

89.37/90.91

77.83/77.45

77.54/78.3

77.83/77.45

77.54/78.3

77.75/77.34

77.69/78.41

77.75/77.34

77.69/78.41

87.71/89.54

88.45/89.45

87.87/90.11

88.54/90.99

77.59/76.5

79.74/79.23

76.94/76.75

79.68/77.94

77.35/76.33

79.47/79.31

76.95/76.95

79.21/77.38

88.17/90.06

89.83/88.96

88.22/89.75

89.55/90.15

77.83/77.64

79.39/78.61

78.13/77.04

79.09/77.91

77.5/77.69

79.58/78.76

77.88/77.56

78.74/77.73

88.01/89.14

88.76/90.78

87.73/88.99

88.99/90.39

77.12/76.36

79.57/78.32

78.11/77.81

79.4/78.13

77.09/76.08

78.76/78.88

78.14/77.81

79.18/78.15

87.56/88.64

88.05/89.67

87.86/90.14

89.45/90.13

77.77/76.29

79.23/77.7

78.4/76.08

78.99/78.42

77.11/76.56

79.66/78.31

78.49/76.24

78.78/77.97

87.99/89.91

88.89/91.12

87.79/90.5

89.06/89.05

77.69/77.48

78.4/78.92

77.14/78.35

78.4/77.55

77.71/77.57

77.94/78.87

76.59/77.92

78.18/77.43

88.36/91.01

89.33/90.94

88.01/90.09

89.05/90.89

78.3/78.24

78.25/76.19

78.54/78.13

77.73/77.5

78.11/78.17

78.49/76.1

78.29/78.6

77.42/77.46

88.25/90.53

89.25/91.09

87.87/90.03

89.13/90.02

78.48/77.87

78.94/77.67

77.88/77.59

77.91/77.97

78.52/78.05

78.16/78.47

77.85/77.67

77.37/78.22

88.42/90.07

89.57/90.8

88.12/90.52

89.4/90.46

77/77.31

78.85/78.42

78.02/78.5

77.12/77.66

76.62/77.2

79.66/77.85

78.02/78.38

77.08/77.66

87.98/90.71

88.94/90.82

88.08/90.43

89.47/90.68

78.21/77.97

78.92/78.24

78.03/77.63

78.78/78.06

78.07/77.3

77.91/78.39

77.74/77.73

78.6/78.1

88.27/90.59

87.71/91.06

88.27/90.59

87.71/91.06

77.08/79.02

78.96/78.78

77.08/79.02

78.96/78.78

76.78/78.95

79.03/78.97

76.78/78.95

79.03/78.97

88.02/89.41

88.86/89.93

88.33/90.26

89.51/87.85

78.43/77.45

79.38/76.67

78.44/77.19

79.12/77.7

78.37/76.25

79.44/77.27

78.04/77.31

78.91/77.76

CD²

88.4

89.33

89.01

89.16

73.69

74.29

72.67

72.81

74.13

74.82

72.91

73.76

1

Type of Relation Representation.
2

Cross-disease experiments utilizing the relation representation $R_{A}$ : Training on ReDReS, evaluation on ReDAD, and vice versa.

Table 3: LaMReDA and LaMReDM Results (%) in binary and multi-class setup (PubMedBERT

\square

:base,

\blacksquare

:large): Each cell (unless cross-disease experiments) shows the average F1-score from 10 runs (original test set) and from 5-fold cross-validation setup.

Human Performance. To assess and compare to human performance, two additional experts identify the relation type in a random sample of 300 instances from the test set of each dataset (Tab. 1). The evaluation ground truth is based on the original test set labels. In the binary setup, the average F1-score ranges from 92.14 for ReDReS to 91.87 for ReDAD. The LaMReDA, LaMReDM, and LMEL models achieve performance comparable to human experts, indicating a high ability to detect semantic relations. Multi-class macro F1-scores range from 85.23 (micro: 85.45) to 85.76 (micro: 85.87) for ReDReS and ReDAD, respectively. Compared to human experts, all models show a performance gap, highlighting that identifying more complex aspects of semantic relation is a challenging task.

Baseline performance - lower bound. We randomly assign labels based on the training data’s class distribution (Tab 1). In the binary setup, the baseline achieves F1-scores of 54% (ReDReS) and 53.16% (ReDAD). For the multi-class setup, the macro F1-scores range from 32.05% to 32.43%, stressing the task’s difficulty, particularly for distinguishing various semantic relations (multi-class)¹¹¹¹11More information is available in Appendix E..

5 Probing

This study probes PubMedBERT’s ability to capture semantic relations between entities. We explore different transformer layer representations and attention scores per layer and attention head. Averaged pooled entity representations are extracted from each layer, followed by training a linear classification layer. We test relation representations $R_{D}$ , $R_{O}$ , and $R_{P}$ (Eq. 12, 23, 24) of LaMReDA and LaMReDM to assess the impact of the context vector. Out-of-the-box representations are evaluated without the projection linear layer $l()$ . We also extract average attention scores of tokens for each entity towards the other across each layer and head, concatenating these into a feature vector for training a linear classification layer. Following Chizhikova et al. (2022), we also train the classification layer using average attention scores between the two entities across all layers.

Figure 4 shows the results of probing experiments in the binary setup using ReDReS and PubMedBERT base.¹²¹²12Additional probing experiments in Appendix G. The 10^th and 11^th layers provide the most informative representations for relation types ( $R_{D}$ , $R_{O}$ , and $R_{P}$ ). The $R_{D}$ representation, using element-wise multiplication, outperforms other representations in intra-layer comparisons, suggesting its effectiveness without end-to-end training. However, as highlighted in section 4, inter-model comparisons indicate that the transformer layers and projection layer $l()$ capture crucial information for relation detection, regardless of the aggregation function. Using context vectors with $R_{O}$ and $R_{P}$ generally offers no advantage, though $R_{O}$ from the 10^th and 11^th layers performs well, indicating possibly meaningful localized context. Attention scores between entities in the 12^th layer yield the best performance, surpassing the baseline of using scores from all layers, indicating strong attention between the entities in the last layer. Figure 4 reveals that the 6^th and 9^th attention heads are most informative for relation detection.

6 Related Work

Information Extraction Datasets. Several biomedical datasets aim to enhance Information Extraction (IE) system development (Huang et al., 2024; Detroja et al., 2023; Nasar et al., 2021; Theodoropoulos et al., 2021), typically focusing on one or a few entity types and their interactions. AIMed (Bunescu et al., 2005), BioInfer (Pyysalo et al., 2007), and BioCreative II PPI IPS (Krallinger et al., 2008) formulate protein-protein interactions. The chemical-protein and chemical-disease interactions are modeled by DrugProt (Miranda et al., 2021) and BC5CDR (Li et al., 2016), respectively. ADE (Gurulingappa et al., 2012), DDI13 (Herrero-Zazo et al., 2013), and n2c2 2018 ADE (Henry et al., 2020) include drug-ADE (adverse drug effect) and drug-drug interactions. EMU (Doughty et al., 2011), GAD (Bravo et al., 2015) and RENET2 (Su et al., 2021) contain relations between genes and diseases. N-ary (Peng et al., 2017) incorporates drug-gene mutation interactions. The task of event extraction is illustrated by GE09 (Kim et al., 2009), GE11 (Kim et al., 2011), and CG (Pyysalo et al., 2013). DDAE (Lai et al., 2019) includes disease-disease associations. BioRED (Luo et al., 2022) focuses on document-level relations for various entities. Unlike these datasets, ReDReS and ReDAD focus on RS and AD, include entities of up to 82 different semantic types, and model the semantic relation between them.

Knowledge Discovery. Gottlieb et al. (2011) present PREDICT, a method for ranking potential drug-disease associations to predict drug indications. Romano et al. (2024) release AlzKB, a heterogeneous graph knowledge base for AD, constructed using external data sources and describing various medical entities (e.g., chemicals, genes). Other graph-based efforts model knowledge around AD for tasks such as drug repurposing (Hsieh et al., 2023; Daluwatumulle et al., 2022; Nian et al., 2022), gene identification (Binder et al., 2022), or as general knowledge repositories (Sügis et al., 2019). Another paradigm for knowledge discovery is the open information extraction (OIE) setup (Mausam et al., 2012; Etzioni et al., 2008), which faces challenges such as data consistency, performance evaluation, and semantic drift (Zhou et al., 2022). Research efforts (Wang et al., 2018; de Silva et al., 2017; Nebot and Berlanga, 2014; Movshovitz-Attias and Cohen, 2012; Nebot and Berlanga, 2011) aim to address these issues and extract knowledge with little or no supervision. Advances in literature-based discovery (Gopalakrishnan et al., 2019; Thilakaratne et al., 2019) try to identify novel medical entity relations using graph-based (Kilicoglu et al., 2020; Nicholson and Greene, 2020), machine learning (Zhao et al., 2021; Lardos et al., 2022), and co-occurrence methods (Kuusisto et al., 2020; Millikin et al., 2023). Tian et al. (2024) stress the potential of large language models (LLMs) to summarize, simplify, and synthesize medical evidence (Peng et al., 2023; Tang et al., 2023; Shaib et al., 2023), suggesting that LLMs may have encoded biomedical knowledge (Singhal et al., 2023). To exploit this potential, we explore constructing LM representations for knowledge discovery. To the best of our knowledge, no systematic approach assembles knowledge about RS. Unlike previous work, we introduce a, in principle, disease-agnostic framework, to acquire knowledge about RS and AD starting from raw text.

7 Conclusion

This work presents an open-source framework for disease knowledge discovery from raw text. We contribute two new annotated datasets for RS (ReDReS) and AD (ReDAD), facilitating further research. Extensive evaluation explores various methods for representing relations and entities, yielding insights into optimal modeling approaches for semantic relation detection, and emphasizing language models’ potential in knowledge discovery.

Limitations

One limitation of the paper is that the data pipeline relies on an external mention extractor/linker. However, this aspect introduces flexibility, allowing researchers and practitioners to integrate custom models suited to their specific applications. The creation of gold-standard datasets requires the manual work of medical experts. This process is time-consuming and resource-intensive, potentially limiting the scalability of the approach. Nevertheless, the experiments demonstrate that the supervised models of the study achieve strong performance in semantic relation detection without needing a large training set. Additionally, the cross-disease experiments highlight the robustness of the models in both binary and multi-class setups. This finding enables transfer learning scenarios in semantic relation detection, which can be applied to other diseases or medical aspects, indicating a potential for broader applications and research opportunities.

Ethics Statement

All recruited medical experts provided informed consent before participating in the annotation process. The compensation provided to the annotators was adequate and considered their demographic, particularly their country of residence.

References

Alzheimer’s-Association (2024) Alzheimer’s-Association. 2024. 2024 alzheimer’s disease facts and figures. Alzheimer’s & dementia: the journal of the Alzheimer’s Association.
Aronson (2001) Alan R Aronson. 2001. Effective mapping of biomedical text to the umls metathesaurus: the metamap program. In Proceedings of the AMIA Symposium, page 17. American Medical Informatics Association.
Attfield et al. (2022) Kathrine E Attfield, Lise Torp Jensen, Max Kaufmann, Manuel A Friese, and Lars Fugger. 2022. The immunology of multiple sclerosis. Nature Reviews Immunology, 22(12):734–750.
Bada et al. (2012) Michael Bada, Miriam Eckert, Donald Evans, Kristin Garcia, Krista Shipley, Dmitry Sitnikov, William A Baumgartner, K Bretonnel Cohen, Karin Verspoor, Judith A Blake, et al. 2012. Concept annotation in the craft corpus. BMC bioinformatics, 13:1–20.
Baldini Soares et al. (2019) Livio Baldini Soares, Nicholas FitzGerald, Jeffrey Ling, and Tom Kwiatkowski. 2019. Matching the blanks: Distributional similarity for relation learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2895–2905, Florence, Italy. Association for Computational Linguistics.
Bettencourt-Silva et al. (2012) J Bettencourt-Silva, B De La Iglesia, S Donell, and V Rayward-Smith. 2012. On creating a patient-centric database from multiple hospital information systems. Methods of information in medicine, 51(03):210–220.
Binder et al. (2022) Jessica Binder, Oleg Ursu, Cristian Bologa, Shanya Jiang, Nicole Maphis, Somayeh Dadras, Devon Chisholm, Jason Weick, Orrin Myers, Praveen Kumar, et al. 2022. Machine learning prediction and tau-based screening identifies potential alzheimer’s disease genes relevant to immunity. Communications Biology, 5(1):125.
Bird et al. (2009) Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.".
Bodenreider (2004) Olivier Bodenreider. 2004. The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research, 32(suppl_1):D267–D270.
Bravo et al. (2015) Àlex Bravo, Janet Piñero, Núria Queralt-Rosinach, Michael Rautschka, and Laura I Furlong. 2015. Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research. BMC bioinformatics, 16:1–17.
Bunescu et al. (2005) Razvan Bunescu, Ruifang Ge, Rohit J Kate, Edward M Marcotte, Raymond J Mooney, Arun K Ramani, and Yuk Wah Wong. 2005. Comparative experiments on learning information extractors for proteins and their interactions. Artificial intelligence in medicine, 33(2):139–155.
Chizhikova et al. (2022) Anastasia Chizhikova, Sanzhar Murzakhmetov, Oleg Serikov, Tatiana Shavrina, and Mikhail Burtsev. 2022. Attention understands semantic relations. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4040–4050, Marseille, France. European Language Resources Association.
Cock et al. (2009) Peter JA Cock, Tiago Antao, Jeffrey T Chang, Brad A Chapman, Cymon J Cox, Andrew Dalke, Iddo Friedberg, Thomas Hamelryck, Frank Kauff, Bartek Wilczynski, et al. 2009. Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11):1422.
Collier et al. (2004) Nigel Collier, Tomoko Ohta, Yoshimasa Tsuruoka, Yuka Tateisi, and Jin-Dong Kim. 2004. Introduction to the bio-entity recognition task at JNLPBA. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP), pages 73–78, Geneva, Switzerland. COLING.
Consortium (2004) Gene Ontology Consortium. 2004. The gene ontology (go) database and informatics resource. Nucleic acids research, 32(suppl_1):D258–D261.
Daluwatumulle et al. (2022) Geesa Daluwatumulle, Rupika Wijesinghe, and Ruvan Weerasinghe. 2022. In silico drug repurposing using knowledge graph embeddings for alzheimer’s disease. In Proceedings of the 9th International Conference on Bioinformatics Research and Applications, pages 61–66.
de Silva et al. (2017) Nisansa de Silva, Dejing Dou, and Jingshan Huang. 2017. Discovering inconsistencies in pubmed abstracts through ontology-based information extraction. In Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pages 362–371.
Detroja et al. (2023) Kartik Detroja, CK Bhensdadia, and Brijesh S Bhatt. 2023. A survey on relation extraction. Intelligent Systems with Applications, 19:200244.
Doughty et al. (2011) Emily Doughty, Attila Kertesz-Farkas, Olivier Bodenreider, Gary Thompson, Asa Adadey, Thomas Peterson, and Maricel G Kann. 2011. Toward an automatic method for extracting cancer-and other disease-related point mutations from the biomedical literature. Bioinformatics, 27(3):408–415.
Elkin and Brown (2023) Peter L Elkin and Steven H Brown. 2023. Unified medical language system (umls). In Terminology, Ontology and their Implementations, pages 463–474. Springer.
Etzioni et al. (2008) Oren Etzioni, Michele Banko, Stephen Soderland, and Daniel S Weld. 2008. Open information extraction from the web. Communications of the ACM, 51(12):68–74.
Gopalakrishnan et al. (2019) Vishrawas Gopalakrishnan, Kishlay Jha, Wei Jin, and Aidong Zhang. 2019. A survey on literature based discovery approaches in biomedical domain. Journal of biomedical informatics, 93:103141.
Gottlieb et al. (2011) Assaf Gottlieb, Gideon Y Stein, Eytan Ruppin, and Roded Sharan. 2011. Predict: a method for inferring novel drug indications with application to personalized medicine. Molecular systems biology, 7(1):496.
Gu et al. (2021) Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2021. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23.
Gurulingappa et al. (2012) Harsha Gurulingappa, Abdul Mateen Rajput, Angus Roberts, Juliane Fluck, Martin Hofmann-Apitius, and Luca Toldo. 2012. Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports. Journal of biomedical informatics, 45(5):885–892.
Henry et al. (2020) Sam Henry, Kevin Buchan, Michele Filannino, Amber Stubbs, and Ozlem Uzuner. 2020. 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records. Journal of the American Medical Informatics Association, 27(1):3–12.
Herrero-Zazo et al. (2013) María Herrero-Zazo, Isabel Segura-Bedmar, Paloma Martínez, and Thierry Declerck. 2013. The ddi corpus: An annotated corpus with pharmacological substances and drug–drug interactions. Journal of biomedical informatics, 46(5):914–920.
Hogan et al. (2021) William P Hogan, Molly Huang, Yannis Katsis, Tyler Baldwin, Ho-Cheol Kim, Yoshiki Baeza, Andrew Bartko, and Chun-Nan Hsu. 2021. Abstractified multi-instance learning (amil) for biomedical relation extraction. In 3rd Conference on Automated Knowledge Base Construction.
Hsieh et al. (2023) Kang-Lin Hsieh, German Plascencia-Villa, Ko-Hong Lin, George Perry, Xiaoqian Jiang, and Yejin Kim. 2023. Synthesize heterogeneous biological knowledge via representation learning for alzheimer’s disease drug repurposing. Iscience, 26(1).
Huang et al. (2024) Ming-Siang Huang, Jen-Chieh Han, Pei-Yen Lin, Yu-Ting You, Richard Tzong-Han Tsai, and Wen-Lian Hsu. 2024. Surveying biomedical relation extraction: a critical examination of current datasets and the proposal of a new resource. Briefings in Bioinformatics, 25(3):bbae132.
Kans (2024) Jonathan Kans. 2024. Entrez direct: E-utilities on the unix command line. In Entrez programming utilities help [Internet]. National Center for Biotechnology Information (US).
Kilicoglu et al. (2020) Halil Kilicoglu, Graciela Rosemblat, Marcelo Fiszman, and Dongwook Shin. 2020. Broad-coverage biomedical relation extraction with semrep. BMC bioinformatics, 21:1–28.
Kim et al. (2009) Jin-Dong Kim, Tomoko Ohta, Sampo Pyysalo, Yoshinobu Kano, and Jun’ichi Tsujii. 2009. Overview of BioNLP’09 shared task on event extraction. In Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task, pages 1–9, Boulder, Colorado. Association for Computational Linguistics.
Kim et al. (2011) Jin-Dong Kim, Yue Wang, Toshihisa Takagi, and Akinori Yonezawa. 2011. Overview of Genia event task in BioNLP shared task 2011. In Proceedings of BioNLP Shared Task 2011 Workshop, pages 7–15, Portland, Oregon, USA. Association for Computational Linguistics.
Kim et al. (2013) Jin-Dong Kim, Yue Wang, and Yamamoto Yasunori. 2013. The Genia event extraction shared task, 2013 edition - overview. In Proceedings of the BioNLP Shared Task 2013 Workshop, pages 8–15, Sofia, Bulgaria. Association for Computational Linguistics.
Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Knox et al. (2024) Craig Knox, Mike Wilson, Christen M Klinger, Mark Franklin, Eponine Oler, Alex Wilson, Allison Pon, Jordan Cox, Na Eun Chin, Seth A Strawbridge, et al. 2024. Drugbank 6.0: the drugbank knowledgebase for 2024. Nucleic Acids Research, 52(D1):D1265–D1275.
Köhler et al. (2021) Sebastian Köhler, Michael Gargano, Nicolas Matentzoglu, Leigh C Carmody, David Lewis-Smith, Nicole A Vasilevsky, Daniel Danis, Ganna Balagura, Gareth Baynam, Amy M Brower, et al. 2021. The human phenotype ontology in 2021. Nucleic acids research, 49(D1):D1207–D1217.
Krallinger et al. (2008) Martin Krallinger, Florian Leitner, Carlos Rodriguez-Penagos, and Alfonso Valencia. 2008. Overview of the protein-protein interaction annotation extraction task of biocreative ii. Genome biology, 9:1–19.
Kuusisto et al. (2020) Finn Kuusisto, Daniel Ng, John Steill, Ian Ross, Miron Livny, James Thomson, David Page, and Ron Stewart. 2020. Kinderminer web: a simple web tool for ranking pairwise associations in biomedical applications. F1000Research, 9.
Lai et al. (2019) Po-Ting Lai, Wei-Liang Lu, Ting-Rung Kuo, Chia-Ru Chung, Jen-Chieh Han, Richard Tzong-Han Tsai, Jorng-Tzong Horng, et al. 2019. Using a large margin context-aware convolutional neural network to automatically extract disease-disease association from literature: comparative analytic study. JMIR Medical Informatics, 7(4):e14502.
Landis and Koch (1977) J Richard Landis and Gary G Koch. 1977. The measurement of observer agreement for categorical data. biometrics, pages 159–174.
Lardos et al. (2022) Andreas Lardos, Ahmad Aghaebrahimian, Anna Koroleva, Julia Sidorova, Evelyn Wolfram, Maria Anisimova, and Manuel Gil. 2022. Computational literature-based discovery for natural products research: current state and future prospects. Frontiers in Bioinformatics, 2:827207.
Li et al. (2016) Jiao Li, Yueping Sun, Robin J Johnson, Daniela Sciaky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Davis, Carolyn J Mattingly, Thomas C Wiegers, and Zhiyong Lu. 2016. Biocreative v cdr task corpus: a resource for chemical disease relation extraction. Database, 2016.
Lipscomb (2000) Carolyn E Lipscomb. 2000. Medical subject headings (mesh). Bulletin of the Medical Library Association, 88(3):265.
Luo et al. (2022) Ling Luo, Po-Ting Lai, Chih-Hsuan Wei, Cecilia N Arighi, and Zhiyong Lu. 2022. Biored: a rich biomedical relation extraction dataset. Briefings in Bioinformatics, 23(5):bbac282.
Mausam et al. (2012) Mausam, Michael Schmitz, Stephen Soderland, Robert Bart, and Oren Etzioni. 2012. Open language learning for information extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 523–534, Jeju Island, Korea. Association for Computational Linguistics.
McGinley et al. (2021) Marisa P McGinley, Carolyn H Goldschmidt, and Alexander D Rae-Grant. 2021. Diagnosis and treatment of multiple sclerosis: a review. Jama, 325(8):765–779.
McHugh (2012) Mary L McHugh. 2012. Interrater reliability: the kappa statistic. Biochemia Medica, 22(3):276.
Millikin et al. (2023) Robert J Millikin, Kalpana Raja, John Steill, Cannon Lock, Xuancheng Tu, Ian Ross, Lam C Tsoi, Finn Kuusisto, Zijian Ni, Miron Livny, et al. 2023. Serial kinderminer (skim) discovers and annotates biomedical knowledge using co-occurrence and transformer models. BMC bioinformatics, 24(1):412.
Miranda et al. (2021) Antonio Miranda, Farrokh Mehryary, Jouni Luoma, Sampo Pyysalo, Alfonso Valencia, and Martin Krallinger. 2021. Overview of drugprot biocreative vii track: quality evaluation and large scale text mining of drug-gene/protein relations. In Proceedings of the seventh BioCreative challenge evaluation workshop, pages 11–21.
Movshovitz-Attias and Cohen (2012) Dana Movshovitz-Attias and William Cohen. 2012. Bootstrapping biomedical ontologies for scientific text using nell. In BioNLP: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing, pages 11–19.
Nasar et al. (2021) Zara Nasar, Syed Waqar Jaffry, and Muhammad Kamran Malik. 2021. Named entity recognition and relation extraction: State-of-the-art. ACM Computing Surveys (CSUR), 54(1):1–39.
Nebot and Berlanga (2011) Victoria Nebot and Rafael Berlanga. 2011. Semantics-aware open information extraction in the biomedical domain. In Proceedings of the 4th International Workshop on Semantic Web Applications and Tools for the Life Sciences, pages 84–91.
Nebot and Berlanga (2014) Victoria Nebot and Rafael Berlanga. 2014. Exploiting semantic annotations for open information extraction: an experience in the biomedical domain. Knowledge and information Systems, 38:365–389.
Nelson et al. (2011) Stuart J Nelson, Kelly Zeng, John Kilbourne, Tammy Powell, and Robin Moore. 2011. Normalized names for clinical drugs: Rxnorm at 6 years. Journal of the American Medical Informatics Association, 18(4):441–448.
Neumann et al. (2019) Mark Neumann, Daniel King, Iz Beltagy, and Waleed Ammar. 2019. ScispaCy: Fast and robust models for biomedical natural language processing. In Proceedings of the 18th BioNLP Workshop and Shared Task, pages 319–327, Florence, Italy. Association for Computational Linguistics.
Nian et al. (2022) Yi Nian, Xinyue Hu, Rui Zhang, Jingna Feng, Jingcheng Du, Fang Li, Larry Bu, Yuji Zhang, Yong Chen, and Cui Tao. 2022. Mining on alzheimer’s diseases related knowledge graph to identity potential ad-related semantic triples for drug repurposing. BMC bioinformatics, 23(Suppl 6):407.
Nicholson and Greene (2020) David N Nicholson and Casey S Greene. 2020. Constructing knowledge graphs and their biomedical applications. Computational and structural biotechnology journal, 18:1414–1428.
Pakter (2024) Philippe Pakter. 2024. Rare disease care in europe – gaping unmet needs. Rare, 2:100018.
Peng et al. (2017) Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina Toutanova, and Wen-tau Yih. 2017. Cross-sentence n-ary relation extraction with graph LSTMs. Transactions of the Association for Computational Linguistics, 5:101–115.
Peng et al. (2023) Yifan Peng, Justin F Rousseau, Edward H Shortliffe, and Chunhua Weng. 2023. Ai-generated text may have a role in evidence-based medicine. Nature medicine, 29(7):1593–1594.
Petriti et al. (2023) Uarda Petriti, Daniel C Dudman, Emil Scosyrev, and Sandra Lopez-Leon. 2023. Global prevalence of rett syndrome: systematic review and meta-analysis. Systematic Reviews, 12(1):5.
Pyysalo et al. (2007) Sampo Pyysalo, Filip Ginter, Juho Heimonen, Jari Björne, Jorma Boberg, Jouni Järvinen, and Tapio Salakoski. 2007. Bioinfer: a corpus for information extraction in the biomedical domain. BMC bioinformatics, 8:1–24.
Pyysalo et al. (2013) Sampo Pyysalo, Tomoko Ohta, and Sophia Ananiadou. 2013. Overview of the cancer genetics (CG) task of BioNLP shared task 2013. In Proceedings of the BioNLP Shared Task 2013 Workshop, pages 58–66, Sofia, Bulgaria. Association for Computational Linguistics.
Romano et al. (2024) Joseph D Romano, Van Truong, Rachit Kumar, Mythreye Venkatesan, Britney E Graham, Yun Hao, Nick Matsumoto, Xi Li, Zhiping Wang, Marylyn D Ritchie, et al. 2024. The alzheimer’s knowledge base: A knowledge graph for alzheimer disease research. Journal of Medical Internet Research, 26:e46777.
Sandweiss et al. (2020) Alexander J Sandweiss, Vicky L Brandt, and Huda Y Zoghbi. 2020. Advances in understanding of rett syndrome and mecp2 duplication syndrome: prospects for future therapies. The Lancet Neurology, 19(8):689–698.
Scheltens et al. (2021) Philip Scheltens, Bart De Strooper, Miia Kivipelto, Henne Holstege, Gael Chételat, Charlotte E Teunissen, Jeffrey Cummings, and Wiesje M van der Flier. 2021. Alzheimer’s disease. The Lancet, 397(10284):1577–1590.
Schoch et al. (2020) Conrad L Schoch, Stacy Ciufo, Mikhail Domrachev, Carol L Hotton, Sivakumar Kannan, Rogneda Khovanskaya, Detlef Leipe, Richard Mcveigh, Kathleen O’Neill, Barbara Robbertse, et al. 2020. Ncbi taxonomy: a comprehensive update on curation, resources and tools. Database, 2020:baaa062.
Shaib et al. (2023) Chantal Shaib, Millicent Li, Sebastian Joseph, Iain Marshall, Junyi Jessy Li, and Byron Wallace. 2023. Summarizing, simplifying, and synthesizing medical evidence using GPT-3 (with varying success). In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1387–1407, Toronto, Canada. Association for Computational Linguistics.
Shu and Ye (2023) Xiaoling Shu and Yiwan Ye. 2023. Knowledge discovery: Methods from data mining and machine learning. Social Science Research, 110:102817.
Singhal et al. (2023) Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. 2023. Large language models encode clinical knowledge. Nature, 620(7972):172–180.
Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958.
Stearns et al. (2001) Michael Q Stearns, Colin Price, Kent A Spackman, and Amy Y Wang. 2001. Snomed clinical terms: overview of the development process and project status. In Proceedings of the AMIA Symposium, page 662. American Medical Informatics Association.
Su et al. (2021) Junhao Su, Ye Wu, Hing-Fung Ting, Tak-Wah Lam, and Ruibang Luo. 2021. Renet2: high-performance full-text gene–disease relation extraction with iterative training data expansion. NAR Genomics and Bioinformatics, 3(3):lqab062.
Sügis et al. (2019) Elena Sügis, Jerome Dauvillier, Anna Leontjeva, Priit Adler, Valerie Hindie, Thomas Moncion, Vincent Collura, Rachel Daudin, Yann Loe-Mie, Yann Herault, et al. 2019. Hena, heterogeneous network-based data set for alzheimer’s disease. Scientific data, 6(1):151.
Tang et al. (2023) Liyan Tang, Zhaoyi Sun, Betina Idnay, Jordan G Nestor, Ali Soroush, Pierre A Elias, Ziyang Xu, Ying Ding, Greg Durrett, Justin F Rousseau, et al. 2023. Evaluating large language models on medical evidence summarization. npj Digital Medicine, 6(1):158.
Theodoropoulos et al. (2021) Christos Theodoropoulos, James Henderson, Andrei Catalin Coman, and Marie-Francine Moens. 2021. Imposing relation structure in language-model embeddings using contrastive learning. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 337–348, Online. Association for Computational Linguistics.
Theodoropoulos and Moens (2023) Christos Theodoropoulos and Marie-Francine Moens. 2023. An information extraction study: Take in mind the tokenization! In Conference of the European Society for Fuzzy Logic and Technology, pages 593–606. Springer, Springer Nature Switzerland.
Theodoropoulos et al. (2023) Christos Theodoropoulos, Natalia Mulligan, Thaddeus Stappenbeck, and Joao Bettencourt-Silva. 2023. Representation learning for person or entity-centric knowledge graphs: An application in healthcare. In Proceedings of the 12th Knowledge Capture Conference 2023, K-CAP ’23, page 225–233, New York, NY, USA. Association for Computing Machinery.
Thilakaratne et al. (2019) Menasha Thilakaratne, Katrina Falkner, and Thushari Atapattu. 2019. A systematic review on literature-based discovery: general overview, methodology, & statistical analysis. ACM Computing Surveys (CSUR), 52(6):1–34.
Tian et al. (2024) Shubo Tian, Qiao Jin, Lana Yeganova, Po-Ting Lai, Qingqing Zhu, Xiuying Chen, Yifan Yang, Qingyu Chen, Won Kim, Donald C Comeau, et al. 2024. Opportunities and challenges for chatgpt and large language models in biomedicine and health. Briefings in Bioinformatics, 25(1):bbad493.
Tinn et al. (2023) Robert Tinn, Hao Cheng, Yu Gu, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2023. Fine-tuning large neural language models for biomedical natural language processing. Patterns, 4(4).
Trejo-Lopez et al. (2023) Jorge A Trejo-Lopez, Anthony T Yachnis, and Stefan Prokop. 2023. Neuropathology of alzheimer’s disease. Neurotherapeutics, 19(1):173–185.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30.
Wang et al. (2023) Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al. 2023. Scientific discovery in the age of artificial intelligence. Nature, 620(7972):47–60.
Wang et al. (2018) Xuan Wang, Yu Zhang, Qi Li, Yinyin Chen, and Jiawei Han. 2018. Open information extraction with meta-pattern discovery in biomedical literature. In Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pages 291–300.
Wolf et al. (2019) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
Zhao et al. (2021) Sendong Zhao, Chang Su, Zhiyong Lu, and Fei Wang. 2021. Recent advances in biomedical literature mining. Briefings in Bioinformatics, 22(3):bbaa057.
Zhou et al. (2022) Shaowen Zhou, Bowen Yu, Aixin Sun, Cheng Long, Jingyang Li, and Jian Sun. 2022. A survey on neural open information extraction: Current status and future directions. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 5694–5701. International Joint Conferences on Artificial Intelligence Organization. Survey Track.
Zhou et al. (2021) Wenxuan Zhou, Kevin Huang, Tengyu Ma, and Jing Huang. 2021. Document-level relation extraction with adaptive thresholding and localized context pooling. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 14612–14620.
Zhuang et al. (2020) Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. 2020. A comprehensive survey on transfer learning. Proceedings of the IEEE, 109(1):43–76.

Appendix A Data Pipeline: Additional Information

To facilitate effective abstract retrieval, we implement an iterative approach to circumvent the API’s limitation of retrieving only 10,000 article IDs per query. This iterative process enables us to access a comprehensive set of PubMed IDs (PMIDs) related to the query. The detailed list of the 82 semantic types of the MetaMapLite-based pipeline is presented in Table 4. In addition to the MetaMapLite-based pipeline, we propose a second pipeline that is based on ScispaCy (Apache License 2.0) (Fig. 5). The difference lies in the selection of entity extractors and linkers that map the extracted entities to knowledge schemes. Unlike MetaMapLite, which adopts an integrated approach where mention extraction and linking are performed simultaneously in a single step and focuses on UMLS mapping, allowing for more precise and targeted extraction of entities, ScispaCy serves a broader range of Natural Language Processing (NLP) tasks. After the retrieval of the abstracts that is described in subsection 2.1, the following steps are executed: Knowledge schema and linker generation, Mention extraction, Entity linking. and Sampling of linked identifiers.

Knowledge schema and linker generation. ScispaCy (Neumann et al., 2019) harnesses an older version of UMLS (2020AA). This version serves as the foundation upon which ScispaCy trains and constructs its linkers that operate on a char-3grams string overlap-based search mechanism, facilitating efficient and accurate entity recognition and linking processes. Following the paradigm of ScispaCy, we provide scripts for generating updated linkers tailored to a range of knowledge schemes. These include UMLS (Bodenreider, 2004), Gene Ontology (GO) (Consortium, 2004), National Center for Biotechnology Information (NCBI) taxonomy (Schoch et al., 2020), RxNorm (Nelson et al., 2011), SNOMED Clinical Terms (SNOMEDCT_US) (Stearns et al., 2001), Human Phenotype Ontology (HPO) (Köhler et al., 2021), Medical Subject Headings (MeSH) (Lipscomb, 2000) DrugBank (Knox et al., 2024) and Gold Standard Drug Database (GS)¹³¹³13https://www.nlm.nih.gov/research/umls/
sourcereleasedocs/current/GS/index.html. Of particular note is the inclusion of UMLS, a unified system encompassing various knowledge bases, vocabularies, taxonomies, and ontologies pertinent to the biomedical domain. Any supported linker maps the concepts to UMLS CUIs enhancing the standardization of medical terminology. Notably, the flexibility of ScispaCy’s implementation allows for seamless expansion to incorporate additional knowledge bases, thereby enhancing its versatility and applicability across diverse research needs.

Mention extraction. ScispaCy boasts four distinct entity extractors, each trained on different corpora, collectively encompassing a range of entity types. These extractors include named entity recognition (NER) models trained on the CRAFT corpus (with 6 entity types) (Bada et al., 2012), JNLPBA corpus (with 5 entity types) (Collier et al., 2004), BC5CDR corpus (with 2 entity types) (Li et al., 2016), and BIONLP13CG corpus (with 16 entity types) (Kim et al., 2013). To maximize the range of the entity extraction, we leverage these diverse extractors in tandem, allowing us to capture mentions of 18 unique entity types (gene or protein, cell, chemical, organism, disease, organ, DNA, RNA, tissue, cancer, cellular component, anatomical system, multi-tissue structure, organism subdivision, developing anatomical structure, pathological formation, organism substance, and immaterial anatomical entity).

Entity linking. This process enhances the semantic understanding of the extracted entities, facilitates standardization, which is a key issue in the biomedical field (Bettencourt-Silva et al., 2012; Theodoropoulos et al., 2023), and promotes interoperability with external resources by associating the entities with specific concepts in supported knowledge schemes. Each entity is subjected to a linking process where we attempt to map it to concepts within supported knowledge bases or vocabularies. If a match is found, the entity is assigned a unique identifier, referred to as a CUI, corresponding to the specific concept in the knowledge schema. As entities may be linked to multiple knowledge sources, we merge the extracted CUIs obtained from the different linkers. This consolidation process ensures that each entity is associated with a comprehensive set of identifiers, encompassing diverse perspectives and representations across various knowledge schemes.

Sampling of linked identifiers. We address the scenario where multiple CUIs can be extracted for each entity due to the utilization of multiple linkers. We propose a prioritized sampling strategy (Fig. 6) to manage this situation and select the most relevant CUIs effectively. This strategy is designed to sample CUIs based on the predicted type of the entity (e.g., disease, gene, or chemical/drug) by prioritizing mapped CUIs from specific knowledge schemes focused on the entity being processed. For example, if an entity is predicted to be a chemical/drug, the sampling strategy first checks if any linked CUIs exist in RxNorm linker, a specific knowledge schema tailored for chemicals. If linked CUIs are found, they are sampled for inclusion in the final set of linked concepts associated with the entity, otherwise, the search is continued in a prioritized way (Fig. 6). We stress that the sampling strategy can be easily modified by the user based on the requirements of the research or the application.

The co-occurrence graph generation step is described in subsection 2.1.

Semantic Types
Amino Acid, Peptide, or Protein	Acquired Abnormality	Amino Acid Sequence	Amphibian	Anatomical Abnormality
Animal	Anatomical Structure	Antibiotic	Archaeon	Biologically Active Substance
Bacterium	Body Substance	Body System	Behavior	Biologic Function
Body Location or Region	Biomedical or Dental Material	Body Part, Organ, or Organ Component	Body Space or Junction	Cell Component
Cell Function	Cell	Congenital Abnormality	Chemical	Chemical Viewed Functionally
Chemical Viewed Structurally	Clinical Attribute	Clinical Drug	Cell or Molecular Dysfunction	Carbohydrate Sequence
Diagnostic Procedure	Daily or Recreational Activity	Disease or Syndrome	Environmental Effect of Humans	Element, Ion, or Isotope
Experimental Model of Disease	Embryonic Structure	Enzyme	Eukaryote	Fully Formed Anatomical Structure
Fungus	Food	Genetic Function	Gene or Genome	Human-caused Phenomenon or Process
Health Care Activity	Hazardous or Poisonous Substance	Hormone	Immunologic Factor	Individual Behavior
Inorganic Chemical	Injury or Poisoning	Indicator, Reagent, or Diagnostic Aid	Laboratory Procedure	Laboratory or Test Result
Mammal	Molecular Biology Research Technique	Mental Process	Mental or Behavioral Dysfunction	Molecular Sequence
Neoplastic Process	Nucleic Acid, Nucleoside, or Nucleotide	Nucleotide Sequence	Organic Chemical	Organism Attribute
Organism Function	Organism	Organ or Tissue Function	Pathologic Function	Pharmacologic Substance
Plant	Organism	Population Group	Receptor	Reptile
Substance	Social Behavior	Sign or Symptom	Tissue	Therapeutic or Preventive Procedure
Virus	Vitamin	Vertebrate

Table 4: List of the 82 semantic types of the MetaMapLite-based pipeline.

Sentence Sampling Algorithm. Given a set of sentences with defined CUIs $sent\_c$ and the co-occurrence frequency graph $co\_g$ , sample $n$ number of sentences (Alg. 1). Initialize a dictionary $f\_d$ and for each sentence save the extracted CUIs pairs $c\_p$ ( $extract\_conc(sent)$ ), the frequencies $f\_p$ of each pair extracted from the $co\_g$ ( $extract\_freq(c\_p,co\_g)$ ) and the summation of the frequencies $t\_f$ . Retrieve the sentence ids, summed frequencies, and the inverted summed frequencies from the dictionary and append them in $ids$ , $f\_l$ , and $inv\_f\_l$ lists respectively. Calculate the total sums of the frequencies $t\_f\_sum$ , $inv\_t\_f\_sum$ and then utilize them to define the probability distributions $\mathcal{P}$ and $\mathcal{IP}$ . Sample 50% of the sentences from $\mathcal{P}$ ( $sample(\mathcal{P},n/2)$ ) and 50% from $\mathcal{IP}$ ( $sample(\mathcal{IP},n/2)$ ) to ensure a balance of common and potentially novel pairs of co-occurred entities in the dataset.

Algorithm 1 Sentence Sampling.

sent\_c

co\_g

n

Initialize dictionary

f\_d

for

s\_id

sent

sent\_c

c\_p\leftarrow extract\_conc(sent)

f\_p\leftarrow extract\_freq(c\_p,co\_g)

t\_f\leftarrow sum(f\_p)

f\_d\leftarrow save(s\_id,c\_p,f\_p,t\_f)

end for

Initialize lists

ids

f\_l

inv\_f\_l

for

s\_id

f\_d

ids\leftarrow append(s\_id)

t\_f\leftarrow get(f\_d,s\_id)

f\_l\leftarrow append(t\_f)

inv\_f\_l\leftarrow append(1/t\_f)

end for

t\_f\_sum\leftarrow sum(f\_l)

inv\_t\_f\_sum\leftarrow sum(inv\_f\_l)

Initialize lists

prob

inv\_prob

for

f

f\_l

p=f/t\_f\_sum

prob\leftarrow append(p)

end for

for

f

inv\_f\_l

p=f/inv\_t\_f\_sum

inv\_prob\leftarrow append(p)

end for

\mathcal{P}\leftarrow prob\_distr(ids,prob)

\mathcal{IP}\leftarrow prob\_distr(ids,inv\_prob)

sam\_sent\_1\leftarrow sample(\mathcal{P},n/2)

sam\_sent\_2\leftarrow sample(\mathcal{IP},n/2)

Return

sam\_sent\_1

sam\_sent\_2

Appendix B Annotation Portal

Figure 7 presents the annotation portal with an example from the ReDReS dataset. The annotator’s task is to identify the semantic relation between the two highlighted entities, classifying it as either a Positive Relation, Negative Relation, Complex Relation, or No Relation. If the sentence is considered uninformative or if there are errors in entity detection, type, or span, the annotator can remove the sentence or the entities. Furthermore, the annotator is encouraged to provide feedback, including any additional text that can clarify or elaborate on the relationship between the entities. By providing this supplementary information, annotators can contribute to a richer and more nuanced understanding of the relations within the data.

Appendix C Dataset Instances

In this section, we present some instances of different relation types in the datasets. In each example, we highlight the two detected entities.

Positive Relation:

•

Amyloid fibrils are found in many fatal neurodegenerative diseases such as Alzheimer’s disease, Parkinson’s disease, type II diabetes, and prion disease.
•

AChE has become an important drug target because partial inhibition of AChE results in modest increase in ACh levels that can have therapeutic benefits, thus AChE inhibitors have proved useful in the symptomatic treatment of Alzheimer’s disease.

Complex Relation:

•

When the brain’s antioxidant defenses are overwhelmed by IR, it produces an abundance of reactive oxygen species (ROS) that can lead to oxidative stress, mitochondrial dysfunction, loss of synaptic plasticity, altered neuronal structure and microvascular impairment that have been identified as early signs of neurodegeneration in Alzheimer’s disease, Parkinson’s, amyotrophic lateral sclerosis, vascular dementia and other diseases that progressively damage the brain and central nervous system.
•

Autophagy inhibitor 3-methyladenine (3-MA) attenuated the neuroprotective effect of CA, suggesting that autophagy was involved in the neuroprotection of CA.

Negative Relation:

•

It was not observed in synaptopodin-deficient mice, which lack spine apparatus organelles.
•

Furthermore, the use of some kinds of antihypertensive medication has been suggested to reduce the incidence of dementia including Alzheimer’s disease.

No Relation:

•

Peripheral immune cells can cross the intact BBB, CNS neurons and glia actively regulate macrophage and lymphocyte responses, and microglia are immunocompetent but differ from other macrophage/dendritic cells in their ability to direct neuroprotective lymphocyte responses.
•

These techniques have thus provided morphological and functional brain alterations mapping of Alzheimer’s disease: on one hand grey matter atrophy first concerns the medial temporal lobe before extending to the temporal neocortex and then other neocortical areas; on the other hand, metabolic alterations are first located within the posterior cingulate cortex and then reach the temporo-parietal area as well as the prefrontal cortex, especially in its medial part.

Appendix D Localized Context Vector

The localized context vector is computed as follows:

•

Extract the attention scores of the two entities in the last encoding layer of the language model.
•

Calculate the Hadamard product of the attention vectors.
•

Calculate the average of the Hadamard product over the attention heads.
•

Normalize to extract the distribution over the sequence.
•

Extract the localized context vector by multiplying the token representations of the last encoding layer with the distribution vector.

Appendix E Baseline Performance: Lower Bound

To establish a baseline performance (lower bound) for comparison, contrasting with the human evaluation that serves as an upper bound, we randomly assign class labels to each instance of the test set based on the prior class distribution in the training set (Tab. 1). This simulates a classifier with no ability to learn relations between entities. We repeat this experiment 1 million times for robustness and report the average F1-score.

In the binary setup, the baseline achieves average F1-scores of 54% (ReDReS) and 53.16% (ReDAD). For the multi-class setup, the average macro F1-scores range from 32.05% (micro: 32.21%) to 32.43% (micro: 32.33%) for ReDReS and ReDAD, respectively. Despite the simplicity of the baseline, the low performance highlights the challenge of the task, especially in the multi-class scenario where the model needs to distinguish between nuanced semantic relations.

Appendix F Distantly Supervised Datasets

ReDReS and ReDAD include gold annotations for a small fraction of the extracted sentences. The pre-processed text consists of 28,622 and 1,301,429 additional sentences related to RS and AD respectively, without annotations about the semantic relation between the detected entities. Observing that the supervised models achieve performance levels comparable to human experts, we leverage the best-performing models to generate silver labels for the unannotated instances. For the binary setup, we employ:

•

LaMReDA (PubMedBERT large) with the relation representation $R_{A}$ (Eq. 9) for the RS corpus.
•

LaMReDA (PubMedBERT large) with the relation representation $R_{J}$ (Eq. 18) for the AD corpus.

For the multi-class setup, we use:

•

LaMReDA (PubMedBERT large) with the relation representation $R_{F}$ (Eq. 14) for the RS corpus.
•

LaMReDA (PubMedBERT large) with the relation representation $R_{F}$ (Eq. 14) for the AD corpus.

We stress that the selection of the models relies on the performance in the 5-fold cross-validation setup to avoid choosing based on the model performance on the original test set. Each model is trained 10 times with different seeds using the original splits (Tab. 1) of ReDReS and ReDAD correspondingly. The best model weights are saved based on the performance on the development set. Every trained model provides the predictions for the unannotated instances and the final silver labels are extracted through majority voting. The Distantly Supervised Relation Detection dataset for Rett Syndrome (DiSReDReS) contains 304,008 instances with 8,611 unique CUIs and 80 semantic types (Tab. 5). The Distantly Supervised Relation Detection dataset for Alzheimer’s Disease (DiSReDAD) comprises 13,608,175 instances with 53,750 unique CUIs and 82 semantic types (Tab. 5). As noisy labeling is inevitable in distantly supervised data and imposes challenges for knowledge extraction scenarios, the two extensive datasets can promote weakly supervised learning.

Data	Sentences	Instances	CUIs¹	S.T.¹	Benchmark³
					Binary	Multi-Class
						Micro	Macro
DiSReDReS	28,622	304,008	8,611	80	91.53	75.1	75.19
DiSReDAD	1,301,429	13,608,175	53,750	82	88.99	80.56	80.69

1

The total number of unique CUIs.
2

Semantic Types.
3

The benchmark performance (F1-score %) in the weakly supervised setup.

Table 5: DiSReDReS & DiSReDAD: Statistics and performance (%)

Dataset	Labels - Type of relation
Dataset	Positive	Complex	Negative	No Relation
DiSReDReS	97,099 (31.9%)	105,861 (34.8%)	3,242 (1.1%)	97,806 (32.2%)
DiSReDAD	4,468,110 (32.8%)	5,755,884 (42.3%)	120,267 (0.9%)	3,263,914 (24%)

Table 6: DiSReDReS & DiSReDAD: Label Distribution

Weakly Supervised Setup. The task formulation remains the same as described in section 3 of the paper. The train sets of ReDReS and ReDAD are replaced by DiSReDReS and DiSReDAD, respectively. The development and test sets remain the same (Tab. 1). To provide a benchmark, we train the LaMReDA (PubMedBERT base) with the relation representation $R_{A}$ (Eq. 9) for 10 epochs utilizing the ADAM optimizer with learning rate 10^-5. The batch is set to 32. The experiments are repeated 10 times with different seeds and the best scores are retained based on the performance on the development set.

In the supervised setup (Tab. 1), LaMReDA (PubMedBERT base) with the $R_{A}$ representation achieves 90.72% and 88.31% F1-score in the binary setup on ReDReS and ReDAD, respectively (Tab. 3). Multi-class macro F1-scores range from 74.52% (micro: 74.49%) to 77.34% (micro: 77.64%) for ReDReS and ReDAD, accordingly (Tab. 3). Table 5 presents the benchmark performance in the binary and multi-class setup for both datasets. Notably, the performance is improved in the weakly supervised setup, indicating the robustness of LaMReDA, when trained with noisy data, and highlighting the quality of the silver labels of DiSReDReS and DiSReDAD.

Appendix G Probing: Additional Experiments

We use the same experimental setup as described in subsection 3.3 and the experiments are conducted in the 5-fold cross-validation setting. To provide an inclusive probing analysis on ReDReS, we incorporate additional probing results in this section. Figures 8 and 9 present the experiments in the multi-class setup using PubMedBERT base. Additionally, aiming to explore the probing capabilities of PubMedBERT large, we include the results of further experiments in Figures 10, 11, and 12. These experiments investigate the model’s performance in detecting semantic relations, comparing the representations and attention mechanisms at different layers and heads to understand how well the larger LM can discern complex relationships in the biomedical text.

Appendix H Detailed Annotation Guidelines

Task. Determine if there is any semantic relation between the two colored entities in the sentence.

General Instructions:

•

Use Sentence Information Only: Base your annotation solely on the information provided within the sentence. Do not use external knowledge or prior information.
•

Entity Check: Examine the entities and their types. If an entity is incorrect, if the entity span is inaccurate (includes irrelevant words), or if the entity type is incorrect (e.g., "Rett syndrome" categorized as part of the human body), click the "Remove First Entity" or "Remove Second Entity" button, corresponding to the error.
•

Removing a Sentence: If a sentence lacks informative content, you have the option to remove it. Use this option if you are confident the sentence is uninformative.

Relation Categories:

•

No Relation: Use this label if there’s no semantic relation between the entities in the sentence.
•

Positive Relation: The two entities are directly, semantically connected.
•

Negative Relation: The two entities are negatively correlated. This is a rare case, and negative words or phrases (e.g., "no," "absence") often indicate this.
•

Complex Relation: Entities are related but not straightforwardly positive or negative. Complex reasoning might be needed to determine the semantic relation.

Annotation Process:

1.

When presented with a pair, choose the relevant relation category label.
2.

If you change your choice, you can adjust it by clicking a new button corresponding to the revised label.
3.

Important: Once you press "Done", the instance can’t be retrieved, so ensure your decision is accurate.
4.

Provide Relation Context: First, you need to finalize your choice for the relation labeling and then provide (if any) the related piece of text. If classifying a pair as related, specify the word or phrase in the sentence that influenced your decision. Use the text box provided and preferably copy-paste to avoid spelling errors. Press "Enter" after inputting the text to store it.