Causality extraction from medical text using Large Language Models (LLMs)

Seethalakshmi Gopalakrishnan [email protected] 0009-0006-8331-3476 University of North Carolina at Charlotte9201 University City BlvdCharlotteNorth CarolinaUSA28223 , Luciana Garbayo University of Central Florida6850 Lake Nona Blvd.OrlandoFloridaUSA [email protected] and Wlodek Zadrozny University of North Carolina at Charlotte9201 University City BlvdCharlotteNorth CarolinaUSA [email protected]

(2018)

Abstract.

This study explores the potential of natural language models, including large language models, to extract causal relations from medical texts, specifically from Clinical Practice Guidelines (CPGs). The outcomes causality extraction from Clinical Practice Guidelines for gestational diabetes are presented, marking a first in the field. We report on a set of experiments using variants of BERT (BioBERT, DistilBERT, and BERT) and using Large Language Models (LLMs), namely GPT-4 and LLAMA2. Our experiments show that BioBERT performed better than other models, including the Large Language Models, with an average F1-score of 0.72. GPT-4 and LLAMA2 results show similar performance but less consistency. We also release the code and an annotated a corpus of causal statements within the Clinical Practice Guidelines for gestational diabetes.

Causality extraction, Large Language Models, GPT-4, LLAMA2

^†^†copyright: acmlicensed^†^†journalyear: 2024^†^†doi: XXXXXXX.XXXXXXX^†^†conference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NY^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Do Not Use This Code Generate the Correct Terms for Your Paper^†^†ccs: Do Not Use This Code Generate the Correct Terms for Your Paper^†^†ccs: Do Not Use This Code Generate the Correct Terms for Your Paper^†^†ccs: Do Not Use This Code Generate the Correct Terms for Your Paper

1. Introduction

Clinical Practice Guidelines (CPGs) are a set of expert guidelines developed to guide physicians in navigating the complexities of the medical decision-making process. Various medical societies provide numerous such guidelines, based on their focus (e.g. cardiology vs. family medicine). This variability can lead to inconsistencies in the comparison and application of guidelines, as noted in our previous work (Hematialam et al., 2020, 2021). Recognizing these discrepancies is crucial for effective communication between patients and physicians.

Pre-trained language models like BERT (Devlin et al., 2018), which dynamically adjusts the weightings between each part of the output and all elements of the input based on their connection (attention), have demonstrated remarkable effectiveness on numerous natural language processing tasks (Devlin et al., 2018), including causality extraction (Gopalakrishnan et al., 2023).

More recent improvements come from Large Language Models (LLMs) like GPT-4 (OpenAI, 2023), which are pre-trained on extensive data and later enhanced through reinforcement learning feedback from both humans and AI to ensure adherence with human principles and policy compliance. Another recent model, also used in this article, is the open source LLAMA2 (Touvron et al., 2023). It was trained on 2 trillion tokens and it available in three different sizes (7B, 13B, and 70B). It is widely used in numerous tasks, particularly in information extraction (Wiest et al., 2023). And perhaps more importantly, LLAMA is a focus of research on understanding capabilities and structures in large language models (Chen et al., 2023), (Gurnee and Tegmark, 2023).

Despite the emergence of the LLMs, BERT continues to be one of the top-performing models for various applications, including causality extraction (Khetan et al., 2020), (Lyu et al., 2022),(Gopalakrishnan et al., 2023), (Peng et al., 2019). This study aims to extract causal relations (’causalities’) from the medical text in Clinical Practice Guidelines.

In medicine, the best explanations are causal (and imply the opportunity for better recommendations?). Causal reasoning is (therefore) used (not only in producing and (but in) evaluating the impact of medical guidelines (over time in learning system loops). The mechanistic model of (explanation of biological phenomena) is preferred in biomedicine (even if it is probabilistic). Automated analysis is necessary (for various applications like comparing differences in guidelines, diagnostic support, etc.) since there are currently over 37,000 of medical guidelines indexed on PubMed as ”practice guidelines” and two orders of magnitude of articles that are used to produce the guidelines. Most of them use causal statements. The main contributions of this study are

•

An entirely new type of public dataset of cause/effect relationships for Clinical Practice Guidelines. For medical text, there are only a very few causality extraction datasets available (Mihăilă et al., 2013), (Reklos and Meroño-Peñuela, 2022), but none of them focus on Clinical Practice Guidelines (CPGs).
•

A performance evaluation of several known Large Language Models (LLMs) on the corpus of CPGs for causality extraction task. The results indicate that the performance of GPT-4 does not increase with an increase in the prompt size beyond 10. The LLAMA2 performance does not improve with the increase in the number of epochs.

From the experiments and evaluation we conclude that variants of BERT might still be preferred for this task, given the ease of fine-tuning and consistent performance. With BERT, we obtained an average F1 score of 72%, whereas GPT-4 gave an average F1 score of 60%. LLAMA2 shows promise, in that an average F1 score of 76% was obtained on subset for which it made predictions; LLAMA2 did not generate predictions for 20-35+% of data.

2. Related Work

Causality extraction is the task of automatically extracting the cause/effect relationships from the text. In this section, we briefly discuss studies related to causality extraction.

2.1. Work related to automatic information extraction from Clinical Practice Guidelines (CPG)

This section summarizes the prior work related to information extraction on the Clinical Practice Guidelines. Extracting clinical findings from notes of outpatient progress was early done realized by (Ertle et al., 1996). Fifteen years later (Taboada et al., 2013) targeted the automated extraction of diagnosis and treatment procedures from clinical guidelines. A similar work (Kaiser and Miksch, 2010) introduced a method for automatically collecting useful information using rules rooted in both syntactic and semantic information. A pattern-based approach was further used by (Chunhua et al., 2014), which contrasted a manually developed ontology for CPG eligibility criteria with a top-level ontology stemming from a semantic pattern-based approach. A more recent work (Fazlic et al., 2019) introduced an innovative system that blends together the methodologies of Natural Language Processing (NLP) and Fuzzy Logic. A supervised machine learning methodology was used by another similar work (Graham et al., 2022) to extract and categorize Conflicts Of Interest (COIs) from disclosure statements indexed in PubMed.

Recently, Large Language Models (LLMs) have (also) been employed for a variety of NLP tasks, including those involving information extraction. LLMs can be fine-tuned to cater to a specific dataset, or a prompt-based approach can be utilized. An illustrative study (Zhao et al., 2021) measures the efficiency of the few-shot learning performance of GPT-3 in tasks related to text classification and information extraction. A recent survey article (Landolsi et al., 2023) provides a summary of the methods and solutions employed for information extraction. It also highlights the challenges encountered when extracting information from medical documents (such as, ambiguities when the named entity belongs to more than one class, phrase boundary detection, name variations, and others).

2.2. Recent work on causality extraction from non-medical text

The study by (Li et al., 2021) directly extracts cause and effect from text without separately extracting candidate pairs and their relations. A work on event extraction (Man et al., 2022) focuses on identifying the causal relationship between pairs of event mentions, also known as ’Event Causality Identification’ (ECI). Balashankar et al. (Balashankar et al., 2019) propose an event extraction (modality) that seeks to uncover the hidden relationships between events mentioned in news streams by creating a Predictive Causal Graph (PCG). Prompt tuning has been proposed to bridge the gap between pre-training and fine-tuning on many of the mainstream NLP tasks like text classification(Schick and Schütze, 2020; Zhang et al., 2021), information extraction(Chen et al., 2021; Cui et al., 2021) etc. In (Liu et al., 2023), Knowledge Enhanced Prompt Tuning (KEPT) employs external knowledge sourced from knowledge bases (KBs) to fine-tune pre-trained language models through the design of an attention mechanism.

A recent article (Chan et al., 2023) describes the use of ChatGPT to extract cause/effect relationships from text on three datasets: (1) Choice of Plausible Alternatives (COPA) (Gordon et al., 2012), which is a collection of premises, along with two questions related to each premise, that requires causal reasoning in order to solve the inference; (2) e-CARE (Du et al., 2022) which is an explainable causal reasoning dataset with cause, effect and two possible explanations; (3) Headline Cause (Gusev and Tikhonov, 2021) dataset, which aims to identify the implicit causal relations between pair of text. On COPA, ChatGPT in-context learning got a 97% accuracy (performance); on the eCARE dataset, a 79.6% accuracy was obtained using prompt engineering, and 72.7% accuracy was recorded for on Headline Cause.

2.3. Causality extraction from the medical text

In 2013 the task of automatic detection conditional statements in medical guidelines was first introduced (Wenzina and Kaiser, 2013). The article used a rule-based approach, focusing on presence of connectives such as ”if”, and a collection of word-based syntactic patterns. Subsequent works on detecting condition action statements from CPGs, (Hematialam and Zadrozny, 2017) and (Hematialam, 2021), apply supervised machine learning techniques to classify sentences according to whether they express conditions and actions. Another study (Hussain et al., 2018) used heuristic patterns to identify recommendation statements in Clinical Practice Guidelines (CPG). A review article (Fu et al., 2020) documents the existing methods and tools for clinical concept extraction. The summarization of biomedical literature is addressed in (Xie et al., 2022) using pre-trained language models. A more recent study (Tang et al., 2023) explores the use of ChatGPT for clinical text mining, specifically for extracting structured data from unstructured healthcare texts and focusing on biological named entity recognition and relation extraction by identifying and extracting medical entities from text related to disease and drug, symptoms and treatment, etc.

3. Data

We annotated seven documents of gestational diabetes (clinical practice) guidelines from various societies (and medical entities) like(such as) the American Diabetes Association (ADA) ((ame, 2020),(Metzger et al., 2010)), US Preventive Services Task Force (USPSTF) ((Davidson et al., 2021), (Pillay et al., 2021)), American College of Obstetrics & Gynecology (ACOG) (mel, 2018), American Academy of Family Physician (AAFP) (Mills and Mohnot, 2021), and Endocrine Society (Blumer et al., 2013).

The decision to annotate gestational diabetes clinical practice guidelines was based on the opportunity to explore causality inference in the future with situated learning models with prediction (team member Dr. Garbayo worked on safety and quality database development on maternal and child in maternities, resulting in the creation of the largest maternity database in Latin America (Leal et al., 2004).

Two annotators were recruited. Given a (medical) text document, their task was to read the document and mark the cause, effect, condition, action, modal, and degree of influence with tags. The cause was marked as C, effect as E, condition as CO, and action as A. The phrases containing any of these causal phrases should be differentiated; for example, the beginning of a cause phrase will be marked as ¡C¿ and the end as ¡/C¿. For example:

Example 3.1.

<C>Pregnant persons with gestational diabetes</C> are at <E>increased risk for maternal and fetal complications</E> and may benefit from <A>early identification and treatment</A>.

3.1. Inter-annotator agreement for the medical data

Due to the intricacy of causality extraction, which involves annotators labeling varying text spans as ”cause,” ”effect,” and so on, computing agreement between two annotators can be challenging as it requires comparing two spans of texts. Traditional methods of inter-annotator agreement, such as the Kappa statistic (Fleiss et al., 2013), are inadequate due to their need for classifications to fit into mutually exclusive and discrete categories. Therefore, we decided to assess agreement using both exact match and relaxed match criteria. The F-measure is used for the exact match (Hripcsak and Rothschild, 2005; Thompson et al., 2009; Mihăilă et al., 2013) between the labels. In the case of the relaxed match, the average distance between phrases is computed. Initially, the annotated phrases, their corresponding labels, and the full sentence they are derived from are extracted from the entire annotated document. These annotations, originating from both annotators, are then compared and amalgamated based on the sentence. The resulting merged table thus features the sentence, the extracted phrase, and the labels as marked by Annotator 1 and Annotator 2. In total, 514 matching phrases have been identified. An overall agreement computed as a Jaccard similarity of 0.66 was obtained. Details of the inter-annotator agreement computation are given below.

Refer to caption — Figure 1. Distribution of the labels in the corpus. The percentage of almost all the labels is around 24%.

From the merged data table, the inter-annotator agreement was computed. This is done by computing the match between the annotations as follows.

•

Relaxed match – Both annotator’s phrases overlap with each other but are not necessarily an exact match.
•

Exact match – Both annotator’s phrases exactly match.

To execute the relaxed match, we employed the Levenshtein distance (Miller et al., 2009) and the Jaccard distance (Real and Vargas, 1996). The Levenshtein distance quantifies the difference between two string sequences, indicating the minimum single-character edits required to transform one word into another. Jaccard similarity computes the degree of relatedness between two finite samples by dividing the intersection’s size by the size of the sample sets’ union. The Jaccard distance is subsequently calculated by subtracting the Jaccard similarity from 1. The Python library Levenshtein ¹¹1https://pypi.org/project/python-Levenshtein/ is used in computing the Levenshtein distance. The Jaccard index was computed using the Python library textdistance²²2https://pypi.org/project/textdistance/. The Levenshtein distance and the Jaccard distance between the annotators are summarized in Table 1

	Levenshtein distance	Jaccard distance
Cause	0.22	0.27
Condition	0.34	0.21
Effect	0.37	0.31
Action	0.87	0.48

Table 1. Relaxed match between the annotated phrases. Levenshtein distance is the minimum number of edits required to transform one phrase to another, whereas Jaccard distance is the amount of non-overlap between phrases. The lower the distance, the agreement is higher. The distance is higher for action. In most of the cases where there is a mismatch, the length of the phrase by both the annotators was different.

From Table 1, we can understand that there is an average Levenshtein distance of 0.41 and an average Jaccard distance of 0.34. In most cases, both annotators annotated the same sentence with the same labels, but the length of the phrase was different. The exact match between the phrases is computed by finding the exact string match between phrases 1 and 2. Out of the 514 phrases, 112 phrases are exact matches. The match between the labels for the same phrase by both annotators is also computed with an average F1 score of 0.78. The match between the labels for each subcategory is given in Table 2

	Precision	Recall	F1-score
Cause	0.86	0.71	0.77
Condition	0.56	0.85	0.67
Effect	0.85	0.90	0.88
Action	0.89	0.70	0.78

Table 2. For a given phrase, the labels annotated by annotators 1 and 2 are compared. An average F1 score of 0.78 was obtained. From the F1-score, we can understand that both the annotators agree on most of the categories except the signal for which the F1-score is low.

3.2. Data preparation and preprocessing

Seven documents on gestational diabetes guidelines provided by different societies are downloaded as PDF documents. The PDFs are converted into a document format, and the documents are given to the annotators for annotating them manually. The annotators used tags to annotate the documents.

After annotating them, the NLTK sentence tokenizer is used to extract sentences from all the documents. The sentences from all the documents are appended together and converted into a data frame. Regular expressions are used to extract the causal sentence. If any of the sentences contain a tag ¡¿, it will be extracted as a causal sentence. Again regular expressions are used to extract the phrases of cause, effect, action, signal, and condition from the sentences. The extracted phrases are used for computing inter-annotator agreement.

4. Methodology

4.1. Causality extraction using BERT

Given the good performance of DistilBERT with organizational data (Gopalakrishnan et al., 2023), this model was also applied to the medical data. Considering the limited sample size in medical data, we attempted to improve the learning process by increasing the number of epochs. This approach allows for more refined fine-tuning of the model.

In order to decide on the correct number of epochs and to avoid overfitting, we tried running the model for 100 epochs and plotted the validation loss and the training loss. The graph showing the train and validation loss for our highest performing model, BioBERT, is given in Figure 2.

From the graph, we can understand that with the increase in the number of epochs, the training loss is constantly increasing and approaching 0. The validation loss decreases till 18 epochs and then starts to increase. Based on this, we fine-tuned DistilBERT for 18 epochs, BERT(BERT-base-uncased) for 20 epochs, and BioBERT for 16 epochs.

The data is split into train and test. DistilBERT for token classification is fine-tuned on the training data for 18 epochs. On the test data, the model obtained an average F1-score of 0.57. Similarly, we fine-tuned BioBERT for 16 epochs and BERT for 20 epochs. Out of these three models, BioBERT(Lee et al., 2020) gave us an average higher F1-score. BioBERT gave an average F1 score of 0.61, and BERT gave an average F1 score of 0.60. The detailed results of fine-tuning BioBERT on the test data are given in Table 3; and, for comparison, the summary of the results of using variants of BERT for causality extraction task is given in Table 4

	Precision	Recall	F1-score	Support
E	0.82	0.75	0.78	696
C	0.62	0.69	0.65	411
CO	0.80	0.63	0.71	717
A	0.65	0.85	0.73	838
Macro average	0.72	0.73	0.72	2662

Table 3. Causality extraction results on the medical data using BioBERT, the highest performing model. Each token in the text was assigned a label Effect(E), Cause(C), Condition(CO), and Action(A). The results are obtained by splitting the manually annotated data into train and test data.

	Precision	Recall	F1-score
DistilBERT	0.69	0.68	0.68
BERT	0.72	0.72	0.71
BioBERT	0.72	0.73	0.72

Table 4. Summary of the results of causality extraction on medical text using the Pre-trained Language Model (BERT) and its variants. The gestational diabetes data is split into train and test data. All the models are fine-tuned on train data and tested on test data.

4.2. Observations on using GPT-4 for causality extraction from medical guidelines

Generative Pre-trained Transformer 4 (GPT-4) (OpenAI, 2023) outperforms most of the state-of-the-art performing models on the traditional NLP benchmark datasets. In this section, we discuss our results of prompting GPT-4-0314 with a with context window of 8,192, for the causality extraction task. We explored various prompt sizes (zero, four, six, eight, ten-shot, and twenty-shot prompting).

As an initial step, we tried the sentence with token-level labels for each word in the sentence as prompt examples. For the test data, the model is expected to predict a label for each word in the sentence. However, the model hallucinated by predicting a longer number of labels than in the given sentence; that is, a long sequences of non-existing ”non-causal” labels.

Since GPT-4 hallucinated for the token-level predictions, we tried extracting the phrases of cause/effect relationships in text and tried converting them into token level by assigning labels for each token. We started with a four-shot prompting. The annotated data with the tags will be given as an example in the prompt, and the model is expected to predict similarly. A sample is given in Example 4.1.

Example 4.1.

<C>Gestational diabetes</C> has also been associated with an <E>increased risk of several long-term health outcomes in pregnant persons and intermediate outcomes in their offspring</E>

We tried converting the predictions with the tags into a token-level format in order to compute the F1 score. However, since the tags are placed in different places in some of the gold annotations and predictions, the number of tokens in gold and predictions doesn’t match. An example is given below

Example 4.2.

Gold: Importance<C>Gestational diabetes</C> is diabetes that develops during pregnancy.1-3 Prevalence of gestational diabetes in the US has been estimated at 5.8% to 9.2%, based on traditional diagnostic criteria, although it may be higher if more inclusive criteria are used.4-8 <C>Pregnant persons with gestational diabetes</C> <E>increased risk for maternal and fetal complications, including preeclampsia, fetal macrosomia (which can cause shoulder dystocia and birth injury), and neonatal hypoglycemia</E> .3,9-11 <C>Gestational diabetes</C> has also been associated with an <E>increased risk of several long-term health outcomes in pregnant persons and intermediate outcomes in their offspring</E> .12-16Table 1.

Prediction: Importance Gestational diabetes is diabetes that develops during pregnancy. 1-3 Prevalence of gestational diabetes in the US has been estimated at 5.8% to 9.2%, based on traditional diagnostic criteria, although it may be higher if more inclusive criteria are used.4-8 <C>Pregnant persons with gestational diabetes</C> <E>increased risk for maternal and fetal complications, including preeclampsia, fetal macrosomia (which can cause shoulder dystocia and birth injury), and neonatal hypoglycemia. 3,9-11</E> <C>Gestational diabetes</C> has also been associated with an <E>increased risk of several long-term health outcomes in pregnant persons and intermediate outcomes in their offspring.12-16Table 1.</E>

In Example 4.2, the phrases marked indicate the scenario where some extra spaces can be added, leading to the indifference in the number of tokens between gold and the predictions. In the gold data, neonatal hypoglycemia</E> .3,9-11 have a space after the tag, but in the prediction, the tag is predicted after
the number, which leads to no space between </E> and .3,9-11. In some scenarios, the GPT-4 omits some of the words if they do not contain a causal relation (omits the ’O’ labels in some places). This mismatch between the gold and the predictions impedes the token-level comparison and reporting of the F1 score. An example is given below:

Example 4.3.

Gold:Race/Ethnicity/Hemoglobinopathies<C>Hemoglobin variants </C> can <E>interfere with the measurement of A1C</E>, although most assays in use in the U.S. are unaffected by the most common variants.

Prediction: <C>Race/Ethnicity/Hemoglobinopathies variants</C> can interfere with the measurement of A1C, although most assays in use in the U.S. are unaffected by the most common variants.

In Example 4.3, in the prediction, the keyword ”Hemoglobin” is missing, which is present in the gold data. In some places, such inconsistencies lead to token mismatch between the gold and predicted data.

To compare the performance of GPT-4 with other models, the predictions are converted into the token level and manually checked to convert both the gold predictions to the same number of tokens for the four-shot prompting. In the predictions, some tokens are missed; those tokens are added to the predictions and marked as label ”O.”(as O indicates tokens that are not cause, effect, condition, action, or signal). After converting the data into a token level, we computed the F1 score. With GPT-4, we got an average F1 score of 0.39 with four-shot prompting.

5. Results & Experiments

As the predictions of GPT-4 can be unreliable, and missing tokens in a sentence leads to a token mismatch between the gold data and the predicted data, therefore Jaccard distance is proposed as an alternative solution to the traditional F1 score as the evaluation criteria. The Jaccard similarity was computed using the textdistance³³3https://pypi.org/project/textdistance/ Python library. Another alternative measure to try is the cosine similarity. The cosine similarity is obtained by computing the vectors of both the gold and the predictions using the Universal Sentence Encoder(Cer et al., 2018). The computed values are used to compute the pairwise cosine similarity between two vectors using Scikit-learn ⁴⁴4https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html. The cause, effect, signal, condition, and action are extracted from the predictions using regular expressions on the tags. The extracted prediction phrases and the gold annotated phrases are merged. We perform two types of matching on the gold and predicted phrases.

•

Jaccard similarity: To measure the dissimilarity between the gold data and the predictions.
•

Cosine similarity: To measure the semantic similarity between the gold data and the predictions.

The results of phrase level similarity between the gold annotated data and predictions of GPT-4 using various prompt sizes are summarized in Table 5.

	Jaccard similarity	Cosine similarity	F1 (labels)
Zero-shot	0.42	0.22	0.27
Four-shot	0.44	0.22	0.35
Six-shot	0.57	0.22	0.52
Eight-shot	0.52	0.23	0.55
Ten-shot	0.57	0.20	0.60
Twenty-shot	0.46	0.20	0.28

Table 5. The phrase level comparison results of few-shot prompting using GPT-4. We tried various prompt sizes (zero, four, six, eight, ten, and twenty-shot prompting). From the results, we can understand that the Jaccard similarity at ten-shot prompting is higher (higher the similarity, higher overlap between the gold and predicted spans), cosine similarity is lower (lower the similarity, higher the gold and press are related), and the F1-score between the labels is higher, after which the similarity and F1 decreases at twenty-shot. The cosine similarity, which gives the semantic similarity between gold and predictions, remains the same with all the prompt sizes. Here the F1-score is computed by comparing the gold labels and the predicted labels.

From the results of the various prompt sizes for the causality extraction on medical data, we can understand that the result of the ten-shot prompting gives a higher similarity and F1 score.

The Jaccard similarity gives the similarity score based on the overlap between the gold and the predictions. The cosine similarity gives the semantic similarity between the gold and predicted phrases. There is not much difference in the cosine similarity with various prompt sizes, indicating that it may not be the right measure for this task. The F1-scores are computed by comparing the gold labels with the predicted labels (Jaccard and cosine similarity for the predicted phrases, F1-score for the labels). The detailed F1-score for the ten-shot prompting label match between gold and predictions is given in Table 6. In particular, we can see that the F1 score for cause, effect, and action is higher compared to the other labels. (This result is comparable with a recent work (Chan et al., 2023), indicating a strong performance of ChatGPT for extracting cause/effect relationships).

	Precision	Recall	F1 score
Action	0.56	0.90	0.69
Cause	0.60	0.74	0.66
Condition	0.95	0.17	0.29
Effect	0.74	0.79	0.76
Macro average	0.71	0.65	0.60

Table 6. Summary of the results of the GPT-4 predictions of ten-shot prompting on our medical data. Here the F1-score is computed by comparing the gold labels and the predicted labels. The F1 score for cause, effect, and action is higher compared to the condition. Many of the conditions are predicted as causes.

5.1. LLAMA2 for causality extraction from medical guidelines

LLAMA2(Touvron et al., 2023) is a pre-trained and fine-tuned Large Language Model. Three variants of LLAMA2 are available, which differ in the parameters. 7B, 13B, and 70B parameters are publicly available. LLAMA2 is trained on two trillion tokens of data. In our experiments, the LLAMA2 7B parameter is fine-tuned on the medical data. It is fine-tuned using the HuggingFace autotrain.

To fine-tune LLAMA2, the first step is to prepare the data. At first, when the model was fine-tuned and tested on the token level as BERT, LLAMA2 was predicting a long number of ”O-other” as GPT-4. So we dealt with this as a phrase-level extraction problem. The data is prepared with three parts which are instruction, input, and output. A sample training data is given in example 5.1

Example 5.1.

###Instruction: Extract the cause, condition, effect, signal, and action from the given sentence. ###Input: Pregnant persons with gestational diabetes are at increased risk for maternal and fetal complications, including preeclampsia, fetal macrosomia (which can cause shoulder dystocia and birth injury), and neonatal hypoglycemia. ###Output: [’Pregnant persons-signal’, ’with gestational diabetes -cause’, ’increased risk for maternal and fetal complications, including preeclampsia, fetal macrosomia (which can cause shoulder dystocia and birth injury), and neonatal hypoglycemia-effect’]

The test data should be similar to the training data except for the output, which should be empty. The gestational diabetes annotated data was split into train and test data. The HuggingFace autotrain ⁵⁵5https://huggingface.co/docs/autotrain/llm_finetuning for the LLM fine-tuning was used to fine-tune the model. The fine-tuned weights are pushed into the HuggingFace dataset for inference. This experiment was done using Google Colab Pro+ with a High-RAM A100 GPU. Similar to the GPT-4, the predictions of LLAMA-2 were also at phrase level. So a similar evaluation strategy is followed for LLAMA2. We present the results with three types of distance.

The predictions are split into phrase levels and then compared with gold data. The Jaccard similarity was computed using the textdistance⁶⁶6https://pypi.org/project/textdistance/ Python library. The cosine similarity is obtained by computing the vectors of both the gold and the predictions using the Universal sentence encoder(Cer et al., 2018). The computed values are used to compute the pairwise cosine similarity between two vectors using Scikit-learn ⁷⁷7https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html.

Initially, we split the data into train and test using the Scikit learn train_test_split(). We have converted the phrase-level predictions into token-level. In the test data, there were a total of 59 samples. Out of the 59 samples, only 29 samples, LLAMA2 predicted the labels, so the evaluation is only for those sentences. With LLAMA2, we got an average F1-score of 0.36, which is lower than that of all the other models.

Since the test data size is very small, we have also tried a four-fold cross-validation on this data. The results of fine-tuning LLAMA2 using four-fold cross-validation with 3,5, and 10 epochs are given in Table 7.

With the increase in the number of epochs, both the Jaccard similarity and F1-score increase. Also, the predictions of LLAMA2 missed labels in many of the predictions. It extracted the phrases with no label. With three epochs, LLAMA2 missed 38% of the labels; with five epochs, 21% of the labels; and with ten epochs, it missed 26% of the labels. We omitted the predictions with no labels (108 predictions, 60 predictions, 76 predictions). The results of causality extraction presented in Table 7 are after omitting the predictions with no labels.

From the results, we can understand that Jaccard similarity, cosine similarity, and F1 score increase with the increase in the number of epochs. However, the number of missed labels started increasing after 5 epochs.

	Jaccard similarity	Cosine similarity	F1-score
LLAMA2 (3epochs)	0.73	0.19	0.70
LLAMA2 (5epochs)	0.888	0.20	0.75
LLAMA2 (10epochs)	0.90	0.21	0.76

Table 7. The phrase level comparison results of LLAMA2 using 4 fold cross validation. Jaccard similarity and cosine similarity indicate the average similarity between the gold and the predictions. The F1 score is the comparison between the gold labels and predicted labels.

6. Discussion

Above we presented results on causality extraction from medical guidelines using recently introduced large language models such as LLAMA2 and GPT-4, and compared them with the performance of BERT, an older, and smaller LLM. The annotated data and the code are all publicly available on GitHub: https://github.com/gseetha04/LLMs-Medicaldata.git .

We observed that GPT-4 expresses strong performance for the cause-effect relationships with medical data, and generally has a good understanding of medical text without fine-tuning. However, in contrast with GPT-3.5 GPT-4 cannot deal with token classification, which limits the traditional way of finding cause and effect phrases, as discussed e.g. in our previous work (Gopalakrishnan et al., 2023).

Even though LLAMA2 seems to perform well for causality extraction, the predictions of the LLAMA2 do not predict labels for many cases, which limits its practical application. This perhaps was caused by fine-tuning LLAMA2 on our small dataset. Therefore, increasing the size of the dataset before the fine-tuning may improve the performance. However, large annotated datasets for CPGs are not available, and further experiments would require annotating more data. Since we focused on the accuracy of actual predictions, we omitted 38% of labels with three epochs, 21% of labels with five epochs, and 26% of labels with ten epochs.

Given its relatively high performance and ease of use, BERT-based models continue to be a state-of-the-art for causality extraction tasks, even in the age of LLM .

7. Conclusion

We developed an automated technique for extracting causalities from annotated corpora of medical guidelines. Additionally, we exhibited the practicality of employing new Large Language Models for causality extraction tasks. With BioBERT, we got an average F1-score of 0.72, whereas with LLAMA2, an average Jaccard distance of 0.40 was obtained. We demonstrated the potential for extracting causalities from medical guidelines using a small annotated corpus. The next logical step could involve expanding the corpus through the annotation of more data and creating a benchmark dataset for causality extraction from medical guidelines.

The potential of this research opens up novel dimensions for the health domain, as causality extraction from medical guidelines can enhance clinical decision-making and patient care. This work explored both machine learning and natural language processing techniques for causality extraction. Despite the abundance of causal sentences within these guidelines, automatic extraction is an unexplored field of research. Also, machine learning models often fail in clinical applications (Schmidt, 2017) due to the gap between data (both training and testing). In order to avoid this gap, more realistic tests need to be done so that they can be employed for real-world data.

Credit authorship contribution statement

S.G. performed the majority of the experiments and writing. She also supervised the annotation process. L.G. chose the clinical guidelines data for annotations and participated in discussions and writing. W.Z. designed some of the experiments, provided feedback, and contributed to writing.

Acknowledgements.

This research was partly funded by the National Science Foundation (NSF) grant number 2141124. We would like to thank Nikhil Vundela for his contributions to data annotation. We would like to thank Dr. Wenwen Dou and Dr. Victor Zitian Chen for their feedback on the causality extraction tasks.

References

(1)
mel (2018) 2018. ACOG practice bulletin, Mellitus, Gestational Diabetes. ACOG: Washington, DC, USA (2018).
ame (2020) 2020. 2. Classification and diagnosis of diabetes: Standards of Medical Care in Diabetes—2020, American Diabetes Association. Diabetes care 43, Supplement_1 (2020), S14–S31.
Balashankar et al. (2019) Ananth Balashankar, Sunandan Chakraborty, Samuel Fraiberger, and Lakshminarayanan Subramanian. 2019. Identifying predictive causal factors from news streams. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2338–2348.
Blumer et al. (2013) Ian Blumer, Eran Hadar, David R Hadden, Lois Jovanovič, Jorge H Mestman, M Hassan Murad, and Yariv Yogev. 2013. Diabetes and pregnancy: An Endocrine society clinical practice guideline. The journal of clinical endocrinology & Metabolism 98, 11 (2013), 4227–4249.
Cer et al. (2018) Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al. 2018. Universal sentence encoder. arXiv preprint arXiv:1803.11175 (2018).
Chan et al. (2023) Chunkit Chan, Jiayang Cheng, Weiqi Wang, Yuxin Jiang, Tianqing Fang, Xin Liu, and Yangqiu Song. 2023. Chatgpt evaluation on sentence level relations: A focus on temporal, causal, and discourse relations. arXiv preprint arXiv:2304.14827 (2023).
Chen et al. (2023) Nuo Chen, Ning Wu, Shining Liang, Ming Gong, Linjun Shou, Dongmei Zhang, and Jia Li. 2023. Beyond Surface: Probing LLaMA Across Scales and Layers. arXiv preprint arXiv:2312.04333 (2023).
Chen et al. (2021) Xiang Chen, Ningyu Zhang, Lei Li, Xin Xie, Shumin Deng, Chuanqi Tan, Fei Huang, Luo Si, and Huajun Chen. 2021. Lightner: A lightweight generative framework with prompt-guided attention for low-resource NER. arXiv preprint arXiv:2109.00720 (2021).
Chunhua et al. (2014) WENG Chunhua, Philip RO Payne, Mark Velez, Stephen B Johnson, and Suzanne Bakken. 2014. Towards symbiosis in knowledge representation and natural language processing for structuring clinical practice guidelines. Studies in health technology and informatics 201 (2014), 461.
Cui et al. (2021) Leyang Cui, Yu Wu, Jian Liu, Sen Yang, and Yue Zhang. 2021. Template-based named entity recognition using BART. arXiv preprint arXiv:2106.01760 (2021).
Davidson et al. (2021) Karina W Davidson, Michael J Barry, Carol M Mangione, Michael Cabana, Aaron B Caughey, Esa M Davis, Katrina E Donahue, Chyke A Doubeni, Martha Kubik, Li Li, et al. 2021. Screening for gestational diabetes: US Preventive Services Task Force recommendation statement. JAMA 326, 6 (2021), 531–538.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
Du et al. (2022) Li Du, Xiao Ding, Kai Xiong, Ting Liu, and Bing Qin. 2022. e-CARE: a new dataset for exploring explainable causal reasoning. arXiv preprint arXiv:2205.05849 (2022).
Ertle et al. (1996) Alan R Ertle, EM Campbell, and William R Hersh. 1996. Automated application of clinical practice guidelines for asthma management.. In Proceedings of the AMIA Annual Fall Symposium. American Medical Informatics Association, 552.
Fazlic et al. (2019) Lejla Begic Fazlic, Ahmed Hallawa, Anke Schmeink, Arne Peine, Lukas Martin, and Guido Dartmann. 2019. A novel NLP-fuzzy system prototype for information extraction from medical guidelines. In 2019 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO). IEEE, 1025–1030.
Fleiss et al. (2013) Joseph L Fleiss, Bruce Levin, and Myunghee Cho Paik. 2013. Statistical methods for rates and proportions. john wiley & sons.
Fu et al. (2020) Sunyang Fu, David Chen, Huan He, Sijia Liu, Sungrim Moon, Kevin J Peterson, Feichen Shen, Liwei Wang, Yanshan Wang, Andrew Wen, et al. 2020. Clinical concept extraction: a methodology review. Journal of Biomedical Informatics 109 (2020), 103526.
Gopalakrishnan et al. (2023) Seethalakshmi Gopalakrishnan, Victor Zitian Chen, Wenwen Dou, Gus Hahn-Powell, Sreekar Nedunuri, and Wlodek Zadrozny. 2023. Text to Causal Knowledge Graph: A Framework to Synthesize Knowledge from Unstructured Business Texts into Causal Graphs. Information 14, 7 (2023), 367.
Gordon et al. (2012) Andrew Gordon, Zornitsa Kozareva, and Melissa Roemmele. 2012. SemEval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In * SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012). 394–398.
Graham et al. (2022) S Scott Graham, Zoltan P Majdik, Johua B Barbour, and Justin F Rousseau. 2022. Associations Between Aggregate NLP-extracted Conflicts of Interest and Adverse Events By Drug Product. Studies in health technology and informatics 290 (2022), 405.
Gurnee and Tegmark (2023) Wes Gurnee and Max Tegmark. 2023. Language models represent space and time. arXiv preprint arXiv:2310.02207 (2023).
Gusev and Tikhonov (2021) Ilya Gusev and Alexey Tikhonov. 2021. HeadlineCause: A Dataset of News Headlines for Detecting Causalities. arXiv preprint arXiv:2108.12626 (2021).
Hematialam (2021) Hossein Hematialam. 2021. Knowledge Extraction and Analysis of Medical Text with Particular Emphasis on Medical Guidelines. Ph. D. Dissertation. The University of North Carolina at Charlotte.
Hematialam et al. (2020) Hossein Hematialam, Luciana Garbayo, Seethalakshmi Gopalakrishnan, and Wlodek Zadrozny. 2020. Computing Conceptual Distances between Breast Cancer Screening Guidelines: An Implementation of a Near-Peer Epistemic Model of Medical Disagreement. arXiv preprint arXiv:2007.00709 (2020).
Hematialam et al. (2021) Hossein Hematialam, Luciana Garbayo, Seethalakshmi Gopalakrishnan, and Wlodek W Zadrozny. 2021. A Method for Computing Conceptual Distances between Medical Recommendations: Experiments in Modeling Medical Disagreement. Applied Sciences 11, 5 (2021), 2045.
Hematialam and Zadrozny (2017) Hossein Hematialam and Wlodek Zadrozny. 2017. Identifying condition-action statements in medical guidelines using domain-independent features. arXiv preprint arXiv:1706.04206 (2017).
Hripcsak and Rothschild (2005) George Hripcsak and Adam S Rothschild. 2005. Agreement, the f-measure, and reliability in information retrieval. Journal of the American medical informatics association 12, 3 (2005), 296–298.
Hussain et al. (2018) Musarrat Hussain, Jamil Hussain, Muhammad Sadiq, Anees Ul Hassan, and Sungyoung Lee. 2018. Recommendation statements identification in clinical practice guidelines using heuristic patterns. In 2018 19th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD). IEEE, 152–156.
Kaiser and Miksch (2010) Katharina Kaiser and Silvia Miksch. 2010. Supporting the abstraction of clinical practice guidelines using information extraction. In International Conference on Application of Natural Language to Information Systems. Springer, 304–311.
Khetan et al. (2020) Vivek Khetan, Roshni Ramnani, Mayuresh Anand, Shubhashis Sengupta, and Andrew E Fano. 2020. Causal BERT: Language models for causality detection between events expressed in text. arXiv preprint arXiv:2012.05453 (2020).
Landolsi et al. (2023) Mohamed Yassine Landolsi, Lobna Hlaoua, and Lotfi Ben Romdhane. 2023. Information extraction from electronic medical documents: state of the art and future research directions. Knowledge and Information Systems 65, 2 (2023), 463–516.
Leal et al. (2004) Maria do Carmo Leal, Silvana Granado Nogueira da Gama, Mônica Rodrigues Campos, Luciana Tricai Cavalini, Luciana Sarmento Garbayo, Carla Lopes Porto Brasil, and Célia Landmann Szwarcwald. 2004. Factors associated with perinatal morbidity and mortality in a sample of public and private maternity centers in the City of Rio de Janeiro, 1999-2001. Cadernos de Saúde Pública 20 (2004), S20–S33.
Lee et al. (2020) Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2020), 1234–1240.
Li et al. (2021) Zhaoning Li, Qi Li, Xiaotian Zou, and Jiangtao Ren. 2021. Causality extraction based on self-attentive BiLSTM-CRF with transferred embeddings. Neurocomputing 423 (2021), 207–219.
Liu et al. (2023) Jintao Liu, Zequn Zhang, Zhi Guo, Li Jin, Xiaoyu Li, Kaiwen Wei, and Xian Sun. 2023. KEPT: Knowledge Enhanced Prompt Tuning for event causality identification. Knowledge-Based Systems 259 (2023), 110064.
Lyu et al. (2022) Chenyang Lyu, Tianbo Ji, Quanwei Sun, and Liting Zhou. 2022. DCU-Lorcan at FinCausal 2022: Span-based Causality Extraction from Financial Documents using Pre-trained Language Models. In Proceedings of the 4th Financial Narrative Processing Workshop@ LREC2022. 116–120.
Man et al. (2022) Hieu Man, Minh Nguyen, and Thien Nguyen. 2022. Event Causality Identification via Generation of Important Context Words. In Proceedings of the 11th Joint Conference on Lexical and Computational Semantics. 323–330.
Metzger et al. (2010) Boyd E Metzger, Steven G Gabbe, Bengt Persson, Lynn P Lowe, Alan R Dyer, Jeremy JN Oats, and Thomas A Buchanan. 2010. International association of diabetes and pregnancy study groups recommendations on the diagnosis and classification of hyperglycemia in pregnancy: response to Weinert. Diabetes care 33, 7 (2010), e98–e98.
Mihăilă et al. (2013) Claudiu Mihăilă, Tomoko Ohta, Sampo Pyysalo, and Sophia Ananiadou. 2013. BioCause: Annotating and analysing causality in the biomedical domain. BMC bioinformatics 14 (2013), 1–18.
Miller et al. (2009) Frederic P Miller, Agnes F Vandome, and John McBrewster. 2009. Levenshtein distance: Information theory, computer science, string (computer science), string metric, damerau? Levenshtein distance, spell checker, hamming distance.
Mills and Mohnot (2021) Justin Mills and Sopan Mohnot. 2021. Screening for Gestational Diabetes. American Family Physician 104, 6 (2021), 641–642.
OpenAI (2023) OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
Peng et al. (2019) Yifan Peng, Shankai Yan, and Zhiyong Lu. 2019. Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. arXiv preprint arXiv:1906.05474 (2019).
Pillay et al. (2021) Jennifer Pillay, Lois Donovan, Samantha Guitard, Bernadette Zakher, Michelle Gates, Allison Gates, Ben Vandermeer, Christina Bougatsos, Roger Chou, and Lisa Hartling. 2021. Screening for gestational diabetes: updated evidence report and systematic review for the US preventive services task force. Jama 326, 6 (2021), 539–562.
Real and Vargas (1996) Raimundo Real and Juan M Vargas. 1996. The probabilistic basis of Jaccard’s index of similarity. Systematic biology 45, 3 (1996), 380–385.
Reklos and Meroño-Peñuela (2022) Ioannis Reklos and Albert Meroño-Peñuela. 2022. Medicause: Causal relation modelling and extraction from medical publications. In Proceedings of the 1st International Workshop on Knowledge Graph Generation From Text co-located with 19th Extended Semantic Conference (ESWC 2022), Hersonissos, Greece, Vol. 3184. 1–18.
Schick and Schütze (2020) Timo Schick and Hinrich Schütze. 2020. Exploiting cloze questions for few shot text classification and natural language inference. arXiv preprint arXiv:2001.07676 (2020).
Schmidt (2017) Charlie Schmidt. 2017. MD Anderson breaks with IBM Watson, raising questions about artificial intelligence in oncology.
Taboada et al. (2013) Maria Taboada, Maria Meizoso, D Martínez, David Riano, and Albert Alonso. 2013. Combining open-source natural language processing tools to parse clinical practice guidelines. Expert Systems 30, 1 (2013), 3–11.
Tang et al. (2023) Ruixiang Tang, Xiaotian Han, Xiaoqian Jiang, and Xia Hu. 2023. Does synthetic data generation of llms help clinical text mining? arXiv preprint arXiv:2303.04360 (2023).
Thompson et al. (2009) Paul Thompson, Syed A Iqbal, John McNaught, and Sophia Ananiadou. 2009. Construction of an annotated corpus to support biomedical information extraction. BMC bioinformatics 10 (2009), 1–19.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
Wenzina and Kaiser (2013) Reinhardt Wenzina and Katharina Kaiser. 2013. Identifying condition-action sentences using a heuristic-based information extraction method. In Process Support and Knowledge Representation in Health Care. Springer, 26–38.
Wiest et al. (2023) Isabella Catharina Wiest, Dyke Ferber, Jiefu Zhu, Marko Van Treeck, Sonja Katharina Meyer, Radhika Juglan, Zunamys I Carrero, Daniel Paech, Jens Kleesiek, Matthias P Ebert, et al. 2023. From text to tables: a local privacy preserving large language model for structured information retrieval from medical documents. medRxiv (2023), 2023–12.
Xie et al. (2022) Qianqian Xie, Jennifer Amy Bishop, Prayag Tiwari, and Sophia Ananiadou. 2022. Pre-trained language models with domain knowledge for Biomedical extractive summarization. Knowledge-Based Systems (2022), 109460.
Zhang et al. (2021) Ningyu Zhang, Luoqiu Li, Xiang Chen, Shumin Deng, Zhen Bi, Chuanqi Tan, Fei Huang, and Huajun Chen. 2021. Differentiable prompt makes pre-trained language models better few-shot learners. arXiv preprint arXiv:2108.13161 (2021).
Zhao et al. (2021) Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning. PMLR, 12697–12706.