Rapid Biomedical Research Classification:
The Pandemic PACT Advanced Categorisation Engine

Omid Rohanian1,2, Mohammadmahdi Nouriborji1,7, Olena Seminog 3,
Rodrigo Furst3, Thomas Mendy3, Shanthi Levanita3, Zaharat Kadri-Alab3,
Nusrat Jabin3, Daniela Toale3, Georgina Humphreys5,
Emilia Antonio3, Adrian Bucher6, Alice Norton3, David A. Clifton2,4

1
NLPie Research, Oxford, UK
2Department of Engineering Science, University of Oxford, Oxford, UK
3Pandemic Sciences Institute, University of Oxford, Oxford, England, UK
4Oxford-Suzhou Centre for Advanced Research, Suzhou, China
5Green Templeton College, University of Oxford, Oxford, England, UK
6UK Collaborative on Development Research, London, UK
7Sharif University of Technology, Tehran, Iran

[email protected]
Abstract

This paper introduces the Pandemic PACT Advanced Categorisation Engine (PPACE) along with its associated dataset. PPACE is a fine-tuned model developed to automatically classify research abstracts from funded biomedical projects according to WHO-aligned research priorities. This task is crucial for monitoring research trends and identifying gaps in global health preparedness and response. Our approach builds on human-annotated projects, which are allocated one or more categories from a predefined list. A large language model is then used to generate ‘rationales’ explaining the reasoning behind these annotations. This augmented data, comprising expert annotations and rationales, is subsequently used to fine-tune a smaller, more efficient model. Developed as part of the Pandemic PACT project, which aims to track and analyse research funding and clinical evidence for a wide range of diseases with outbreak potential, PPACE supports informed decision-making by research funders, policymakers, and independent researchers. We introduce and release both the trained model111https://huggingface.co/nlpie/ppace-v1.0 and the instruction-based dataset used for its training222https://huggingface.co/datasets/nlpie/pandemic_pact. Our evaluation shows that PPACE significantly outperforms its baselines. The release of PPACE and its associated dataset offers valuable resources for researchers in multilabel biomedical document classification and supports advancements in aligning biomedical research with key global health priorities.

Rapid Biomedical Research Classification:
The Pandemic PACT Advanced Categorisation Engine


Omid Rohanian1,2, Mohammadmahdi Nouriborji1,7, Olena Seminog 3, Rodrigo Furst3, Thomas Mendy3, Shanthi Levanita3, Zaharat Kadri-Alab3, Nusrat Jabin3, Daniela Toale3, Georgina Humphreys5, Emilia Antonio3, Adrian Bucher6, Alice Norton3, David A. Clifton2,4 1NLPie Research, Oxford, UK 2Department of Engineering Science, University of Oxford, Oxford, UK 3Pandemic Sciences Institute, University of Oxford, Oxford, England, UK 4Oxford-Suzhou Centre for Advanced Research, Suzhou, China 5Green Templeton College, University of Oxford, Oxford, England, UK 6UK Collaborative on Development Research, London, UK 7Sharif University of Technology, Tehran, Iran [email protected]


1 Introduction

The surveillance and monitoring of emerging and re-emerging pathogens are vital to global health security. Infectious diseases with the potential to cause pandemics represent a significant threat to public health, economies, and societies worldwide. Efficiently tracking these threats and coordinating research efforts is essential to mitigate their impact. Traditional approaches to research funding and coordination during health crises, such as the COVID-19 pandemic, often exhibit limitations including slow activation of research projects, duplication of efforts, and fragmented funding landscapes. These issues highlight the need for improved systems to manage and analyse research activities (Carroll et al., 2021; McLean et al., 2022).

The rapid identification and response to infectious disease threats require a well-coordinated and systematic approach to research funding and project tracking. Accurate categorisation and analysis of research projects are crucial for identifying trends, understanding research gaps, and ensuring that resources are allocated effectively. This task is inherently complex, given the diverse and interdisciplinary nature of biomedical research (Seminog et al., 2024).

Artificial intelligence (AI), particularly large language models (LLMs), offers a promising solution to enhance the efficiency and accuracy of research categorisation. LLMs can be fine-tuned to assist in the classification of research abstracts, providing valuable support to human annotators. These models not only streamline the categorisation process but also generate rationales for their decisions, adding a layer of interpretability and transparency that is essential for gaining the trust of researchers and policymakers333LLMs are typically seen as complex and opaque, and their interpretability is a nuanced, ongoing research topic (Luo and Specia, 2024). Here, we use “interpretable” to mean that medical practitioners find the process more engaging and understandable than with encoder-only models, as LLMs can explain their reasoning in human language..

This paper introduces the Pandemic PACT Advanced Categorisation Engine (PPACE), a fine-tuned LLM designed to classify biomedical research abstracts according to WHO-aligned research priorities. PPACE leverages human-annotated data and employs generative AI to produce rationales for each classification. By automating the categorisation process, PPACE aims to enhance the monitoring of research trends and the identification of critical gaps in global health preparedness and response.

In the remainder of the paper, we first provide an overview of the literature on the use of LLMs in biomedical document classification (Section 2). Next, we introduce the Pandemic PACT project which this work builds on, and describe the details of the dataset and the annotation procedure involved in its creation (Section 3). Section 4 will describe the methodology in finetuning the PPACE model, and finally, in Section 5, we present the results and conclude the paper. The contributions of this work are as follows:

  1. 1.

    We contribute to the task of biomedical document classification by publicly releasing a carefully annotated dataset of research projects (each project containing a title and a PubMed-style abstract) gathered as part of the Pandemic PACT project and further preprocessed to include rationales generated by a 70B LLM. The augmented dataset is formatted as an instruction-based dataset and can be used to train similar models by the research community.

  2. 2.

    We fine-tune and publicly release an 8B model trained on the aforementioned dataset and make the model weights available publicly.

  3. 3.

    We perform a range of analyses on the dataset to shed light on the complexities of the data and run a number of evaluations to ensure that the model outperforms the baseline.

2 Biomedical Document Classification and the Use of LLMs

Biomedical document classification is an active area of research that has attracted considerable attention in recent years (Laza et al., 2011). The PubMed 200k RCT dataset, for instance, focuses on classifying different sections of randomized controlled trial abstracts into categories like objectives, methods, results, and conclusions (Dernoncourt and Lee, 2017). Another notable task in this area is the Hallmarks of Cancer (HoC), which presents a multi-label classification challenge and aims to identify key cancer-related research themes from PubMed abstracts (Baker et al., 2015). The LitCovid dataset (Chen et al., 2020; Jimenez Gutierrez et al., 2020), which comprises over 30,000 COVID-19-related articles, each annotated with one or more topics relevant to the pandemic, is another major biomedical document classification benchmark that has been studied in the literature. Automated topic annotation tasks like this can significantly reduce the manual curation burden during pandemics, and the present work can be considered a more generalized effort to automate the classification of biomedical literature into research themes of interest. Additional datasets relevant to biomedical document classification include the BioCreative Corpus III (BC3, Arighi et al., 2011) and TREC (Hersh et al., 2006). Behera et al. (2019) provides an overview of this task and the various deep learning algorithms to address it.

Large Language Models (LLMs) have become ubiquitous in various text processing and classification tasks, including document classification. Their ability to handle a wide range of text-related tasks makes them particularly appealing for numerous applications. LLMs can be instructed to perform specific tasks via few-shot examples or through fine-tuning with detailed instructions. For biomedical researchers, generative LLMs are especially valuable because they can be interfaced with using natural human language, facilitating more intuitive and effective interactions.

SciFive (Phan et al., 2021) is a domain-specific T5 model that has been pretrained to address a number of biomedical tasks, including document classification. Rohanian et al. (2023) is the first attempt to use generative language models to address classical biomedical text processing tasks like HoC via instruction tuning. Chen et al. (2023) and Tian et al. (2024) have studied the use of LLMs in a number of biomedical text processing tasks, including document classification, although the focus is mostly on closed-source frozen models like GPT-4.

Various techniques are observed in the literature regarding the use of LLMs when addressing this task. Several studies use parameter-efficient fine-tuning methods (Hu et al., 2021; Taylor et al., 2024; Jiang et al., 2024) which have become very prevalent due to the ease of use and faster training time they offer. Our work not only employs instruction tuning and LoRA as a parameter-efficient fine-tuning technique, but also draws inspiration from Hsieh et al. (2023) in that it utilises ‘rationales’ generated by a larger model to augment the labelled dataset and then fine-tunes a smaller one trained on the expanded dataset.

3 Dataset Overview and Sources

3.1 About Pandemic PACT

The Pandemic Preparedness Analytical Capacity and Funding Tracking Programme (Pandemic PACT) operates under the auspices of the Global Research Collaboration for Infectious Disease Preparedness (GloPID-R, Norton et al., 2020)444https://www.glopid-r.org/ at the University of Oxford’s Pandemic Sciences Institute. This initiative aims to enhance global response capabilities by tracking and analysing research funding for diseases with pandemic potential and other significant public health threats. By aligning with the WHO priority diseases, Pandemic PACT focuses on dynamic data collection and rigorous analysis to inform critical policy and funding decisions across the health system and public health domains (Norton et al., 2024; Seminog et al., 2024).

3.2 The Pandemic PACT Funding Tracker

The Pandemic PACT Funding Tracker is an integral component of this initiative, collecting detailed information on research grants from GloPID-R and UKCDR555The UK Collaborative on Development Research (UKCDR) coordinates development research funding in the UK to optimise effectiveness and strategic alignment. More information is available at https://www.ukcdr.org.uk. members since January 2020 and has since expanded to include a much broader set of international funding bodies. This tool maps the alignment of funding to critical research categories and priorities, displaying the data through an interactive dashboard that visualises funding trends and evidence gaps. The database includes diseases listed on the WHO R&D Blueprint priority list, such as pandemic influenza, mpox, and plague, among others. This comprehensive and evolving tool not only aids in real-time decision-making but also provides downloadable data for broader analysis, accessible via the official Pandemic PACT website at http://www.pandemicpact.org/.

3.3 Annotation Procedure

The Pandemic PACT database expands upon the previous database co-developed by UKCDR and GloPID-R as part of the COVID-19 Research Coordination and Learning initiative (COVID CIRCLE1). Pandemic PACT funding data on other diseases and additional COVID-19 research projects are collected either through direct data provision by funders (using a standardised template) or by scraping funder websites. The scraping process is based on search terms including disease-specific keywords, acronyms, virus, and virus family names. For the detailed search protocol, inclusion criteria, and transformation of the COVID CIRCLE data into the new standardised schema, see Seminog et al. (2024). Only grants that include a minimum level of essential information are included, such as grant award or start date, publication date, funder name, grant ID or another form of identifier, and grant title. The data encompass funding information from January 2020 onwards for the relevant diseases.

While all Pandemic PACT search terms used are in English, it does not exclude grants in other languages. If the search returns any relevant grants in foreign languages, their title and abstracts are translated using Google Translate and then included in the database. All collected data is stored in its original format as retrieved from the funding source, with basic data cleaning procedures performed to remove special characters from data in textual format.

All collected data are reviewed by a team of trained researchers from broad public health backgrounds to determine their relevance, classified against a research categorisation framework developed under Pandemic PACT, and assigned other relevant tags using manual annotation. The number of team members has varied over time, starting with three, increasing to ten before the launch of the Pandemic PACT tracker, then decreasing to six, and currently stabilising at four. The team size is subject to change based on project needs. Over the first months of the project, Pandemic PACT developed a standard approach to training and preparing new members of the data collection team through a series of training steps. First, they were exposed to tutorials of training material and videos that explained how to interpret data and submit contributions through the online interface. After that, data coders were expected to attend a weekly all-contributor meeting, at which point they started being included in the regular coding allocation. These meetings were used for expanding comprehension of the coding schema and processes, facilitating a collective consensus on interpretations of codes, and effectively probing coding disagreements.

Dataset Number of Projects
Training Set 5142
Test Set 1450
Table 1: Composition of the dataset used for training and testing the classification model.

After data is entered, they are marked as ‘unverified’ in the back-end database portal used by the Pandemic PACT if any issues arise or if the coder hesitates on how to code them. This flags them for the review process. Conversely, entries are marked as ‘complete’ if no concerns are raised. To ensure data reliability, Pandemic PACT mandates peer review of all new data by at least two annotators, ensuring each grant undergoes scrutiny and confirmation by an independent coder. In cases of inter-annotator disagreement, discussions are held to reach joint decisions. Alternatively, judgments from a designated coder, such as the Principal Investigator or a more experienced researcher, take precedence over others. Going forward, Pandemic PACT plans to implement a comprehensive approach where initial coding is performed by an LLM, followed by manual verification and final annotation.

Category Number Research Category
1 Pathogen: Natural History, Transmission, and Diagnostics
2 Animal and Environmental Research & Research on Diseases Vectors
3 Epidemiological Studies
4 Clinical Characterisation and Management in Humans
5 Infection Prevention and Control
6 Therapeutics Research, Development, and Implementation
7 Vaccines Research, Development, and Implementation
8 Research to Inform Ethical Issues
9 Policies for Public Health, Disease Control, and Community Resilience
10 Secondary Impacts of Disease, Response, and Control Measures
11 Health Systems Research
12 Capacity Strengthening
Table 2: The full list of research categories used to annotate the dataset.

3.4 Dataset Description

Our study employs a carefully selected sample from the Pandemic PACT database. Each row in the data represents a funded research project and includes a title and an abstract which provides a concise description of each project’s aims, methods, and potential impacts. The data is randomly divided into an approximate 80/20 split with the number of rows shown in Table 1.

To gain insights into the training set, we analysed the lengths of the project titles and abstracts as well as the distribution of research categories. The statistics for the lengths of project titles and abstracts are presented in Table 3. During finetuning, to keep the computation manageable, the abstract length is capped at 512 tokens. The numbers in the table reflect the lengths as seen in the dataset before this truncation is applied.

Measure Titel Abstract
Characters Average Length 98.24 1940.37
Max Length 850 6817
Words Average Length 13.10 279.72
Max Length 133 1036
Table 3: Statistics of project titles and abstracts in the training set. The measure of words is an approximation based on space separation.

The distribution of individual research categories (see Table 2) assigned to the projects in the training set is depicted in Figure 1. This figure shows that the most frequent research categories are Pathogen: Natural History, Transmission, and Diagnostics (Category 1), Secondary Impacts of Disease, Response, and Control Measures (Category 10), and Clinical Characterisation and Management in Humans (Category 4). The least frequent categories are Research to Inform Ethical Issues (Category 8), Capacity Strengthening (Category 12), and Infection Prevention and Control (Category 5). These categories are expected to pose more challenges for the model due to the fewer number of labels.

Refer to caption
Figure 1: Individual Label Distribution in the Training Set.
Refer to caption
Figure 2: Top 12 Combined Label Distribution in the Training Set.

3.5 Combined Label Distributions

We also examined the combined label distributions to understand the most common combinations of research categories assigned to the projects. Figure 2 shows the top 12 most frequent label clusters, indicating that combinations such as Pathogen & Clinical Characterisation in Humans (1, 4), and Policies for Public Health & Secondary Impacts of Disease (9, 10) are prevalent. Table 2 provides the mapping of category numbers to their respective research categories.

3.6 Label Correlations

Understanding the correlations between different research categories provides insights into interdisciplinary research trends visible in the training set. These correlations highlight how different fields of study intersect, helping us identify areas where models might struggle or easily pick up patterns.

Apart from the significant correlations mentioned in Figure 2, we also found notable intersections between Epidemiological Studies (Category 3) and Clinical Characterisation in Humans (Category 4), and between Pathogen (Category 1) and Therapeutics Research (Category 6). A properly trained model should be able to detect these patterns while also recognising instances where these correlations do not hold.

The conditional probabilities for the top five most frequent pairs of research categories are shown in Table 4. For example, the highest conditional probability for the combination (1, 3) is for label 4 at 0.39, and for (1, 4), the highest is label 3 at 0.17. These findings suggest that certain third-label correlations exist, but they are not overwhelmingly strong.

Combination Top Conditional Probability Probability
{1, 4} P(3 | {1, 4}) 0.17
{3, 4} P(1 | {3, 4}) 0.35
{10, 9} P(11 | {10, 9}) 0.12
{1, 6} P(4 | {1, 6}) 0.27
{1, 3} P(4 | {1, 3}) 0.39
Table 4: Top conditional probabilities for the most frequent pairs of research categories in the training set.
Refer to caption
Figure 3: Correlation Heatmap of Research Categories in the Training Set.

The heatmap in Figure 3 provides a visual representation of the strength of correlations between different research categories.

4 Methodology

In this work, we initially use a manually labelled dataset to generate ‘rationales’ for the labels using a Llama-3 70B model666https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct. These rationales explain why each category label is chosen, pinpointing the reason by referencing the abstract. We add these rationales to the labels, constructing an expanded dataset that includes both the prompt and the rationales. For details on the prompt template used, refer to Appendix Section B.

We subsequently explore adapting a smaller model to the classification task using this dataset. The fine-tuned model is expected not only to predict the labels but also to explain its reasoning. Based on insights from McCoy et al. (2023) regarding the limitations of autoregressive language models, we structured the prompt such that the rationales are provided first before the model determines the category labels. This approach ensures that errors from irrelevant categories are not propagated back into the model’s outputs. Given the substantial computing resources required to train the full weights of the 8B model, we explored the use of efficient fine-tuning via LoRA (Section 4.2).

We have chosen to restrict the experiments and the baselines to a single decoder-only transformer. As of this writing, an LLM with around 10 billion parameters is considered relatively small but can be performant enough to rival state-of-the-art. A representative LLM with a good starting performance on this complex task provides us with a foundation to improve upon while avoiding potential saturation. Additionally, a very large model like Llama-3 70B would be impractical for independent researchers to fine-tune or utilise on less powerful machines. Section 4.1 details how this LLM was chosen.

4.1 Adjudicating between Outputs of Candidate LLMs

In order to fine-tune a smaller model using the augmented dataset of human annotations and rationales generated by the LLama 3-70B, we evaluated several candidate models. The best-performing ones were the Mixtral-8x7B Instruct777https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1 and the Meta-Llama-3-8B-Instruct888https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct. We empirically found that the outputs produced by these models were significantly more relevant compared to other models of similar or smaller size such as Phi3.

To determine which model to fine-tune for developing PPACE, we conducted a detailed comparison. We randomly selected 10 projects from the dataset and used a few-shot learning template (Section B) to obtain inferences from both Mixtral and Llama3 models, utilising them as frozen language models. Each model’s output was then recorded and passed to GPT-4o for adjudication. GPT-4o was tasked with evaluating the responses, comparing them to the available human judgments, and providing a verdict favouring one model output over the other or declaring a tie.

This adjudication process involved a thorough analysis of each model’s outputs against human labels and was further verified by our annotators. The results showed that both models performed well, but Llama-3 was deemed the better model by a small margin. One of the key advantages of Llama-3 was that extracting the output labels from the generated text was significantly easier compared to Mixtral, which occasionally deviated from the specified format, complicating the extraction process. Additionally, Mixtral has 46.7B total parameters but only uses 12.9B parameters per token. In other words, Llama-3 also has the advantage of being smaller and more performant for this task. This consistency and ease of label extraction made Llama-3 the preferred choice for fine-tuning PPACE.

4.2 Low-Rank Adaptation (LoRA)

Low-Rank Adaptation (LoRA, Hu et al., 2021) is a parameter-efficient approach for adapting large pre-trained models without modifying the original weights. This method can be particularly beneficial for maintaining memory efficiency and reducing computational overhead. LoRA introduces two small matrices, 𝐀𝐀\mathbf{A}bold_A and 𝐁𝐁\mathbf{B}bold_B, into the transformer architecture, which project the high-dimensional parameter space into a lower-dimensional space and back. The original transformation in a transformer layer, typically a matrix multiplication involving a weight matrix 𝐖𝐖\mathbf{W}bold_W, is modified as follows:

𝐖=𝐖+rα𝐀𝐁superscript𝐖𝐖𝑟𝛼𝐀𝐁\mathbf{W}^{\prime}=\mathbf{W}+\frac{r}{\alpha}\mathbf{A}\mathbf{B}bold_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_W + divide start_ARG italic_r end_ARG start_ARG italic_α end_ARG bold_AB (1)

Here, 𝐖𝐖\mathbf{W}bold_W is the original weight matrix of the transformer, and 𝐖superscript𝐖\mathbf{W}^{\prime}bold_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the adapted weight matrix. The matrices 𝐀𝐀\mathbf{A}bold_A and 𝐁𝐁\mathbf{B}bold_B are of dimensions d×r𝑑𝑟d\times ritalic_d × italic_r and r×d𝑟𝑑r\times ditalic_r × italic_d respectively, where d𝑑ditalic_d is the dimensionality of 𝐖𝐖\mathbf{W}bold_W and r𝑟ritalic_r is much smaller than d𝑑ditalic_d, finally α𝛼\alphaitalic_α is a hyperparameter for adjusting the learning rate for the trainable weights. This low-rank structure ensures that the number of additional parameters introduced by 𝐀𝐀\mathbf{A}bold_A and 𝐁𝐁\mathbf{B}bold_B is significantly lower than the number of parameters in 𝐖𝐖\mathbf{W}bold_W, leading to substantial savings in terms of memory and computational resources.

The low-rank projection effectively captures the essential transformations needed for task-specific adaptation while preserving the original model’s capabilities. This approach is particularly advantageous when computational resources are limited or when the adaptability needs to be achieved with minimal disturbance to the original model structure, as often required in real-world applications where both efficiency and performance are critical. In practical applications, LoRA has shown to enable effective fine-tuning of large models on specific tasks without the need for extensive retraining of the original parameters.

4.3 Training Strategy and Hyperparameters

We used Supervised Finetuning (SFT) to adapt the Llama-3 8B model to the classification task. The model was trained for 2222 epochs on the training samples using 8888 A100 GPUs, with a batch size of 1111 per GPU and 4444 gradient accumulation steps. The LoRA modules were placed in all trainable layers of the self-attention and MLP layers of the Llama model.

During initial experiments, we found that focusing the loss calculation on the completion tokens (related to the explanations and categories) and ignoring the loss for the prompt tokens improved the model’s performance. This approach was more effective than using the autoregressive language modelling loss on all tokens, including the prompt.

To maximise model performance, we used a beam search decoding strategy with 4444 beams for the smaller models, which appeared to markedly improve generation quality and output structure. However, due to computational constraints, beam search was not employed for the 70707070B version.

The hyperparameters used for training the model in our experiments are listed in Table 5.

Table 5: The hyperparameters used for training the model
Hyperparameter Value
Total Batch Size 8888
Gradient Accumulation Steps 4444
Learning Rate 2e2𝑒2e2 italic_e-4444
LR Scheduler Linear
Epochs 2222
LoRA rank 128128128128
LoRA α𝛼\alphaitalic_α 256256256256
LoRA Dropout 0.050.050.050.05
Table 6: The results of different models on the test set
Model Precision Recall F1
Macro/Micro Macro/Micro Macro/Micro
Llama3-8b 0.2710/0.2631 0.5812/0.6042 0.3293/0.3666
Llama3-70b 0.3163/0.3524 0.6320/0.6515 0.3898/0.4574
PPACE (ours) 0.6927/0.7497 0.5625/0.7113 0.5914/0.7300

5 Results and Analysis

Table 7: Evaluation results on the test set for baseline Llama3 (Base) and finetuned PPACE models (Fine) in terms of Precision, Recall, and F-score for each individual category. Improvements (Imp) are also reported for these measures.
Kategorie Base P Base R Base F1 Fine P Fine R Fine F1 P Imp R Imp F1 Imp
Pathogen 0.576 0.856 0.689 0.765 0.854 0.807 0.189 -0.002 0.118
Animal & Dis. Vectors 0.484 0.804 0.604 0.915 0.768 0.835 0.431 -0.036 0.231
Epidemiological 0.565 0.476 0.517 0.721 0.646 0.682 0.156 0.171 0.165
Clinical Char. in Humans 0.399 0.808 0.535 0.775 0.603 0.678 0.376 -0.205 0.144
Infection Prev. & Control 0.150 0.696 0.247 0.677 0.457 0.545 0.527 -0.239 0.298
Therapeutics 0.226 0.775 0.350 0.749 0.856 0.799 0.522 0.081 0.449
Vaccines 0.130 0.776 0.222 0.748 0.793 0.770 0.618 0.017 0.548
Ethics 0.133 0.200 0.160 1.000 0.150 0.261 0.867 -0.050 0.101
Public Health 0.000 0.000 0.000 0.802 0.552 0.654 0.802 0.552 0.654
Secondary Impacts 0.391 0.391 0.391 0.777 0.904 0.836 0.386 0.513 0.445
Health Systems 0.133 0.730 0.225 0.429 0.169 0.242 0.296 -0.562 0.017
Capacity Strengthening 0.015 0.600 0.028 0.000 0.000 0.000 -0.015 -0.600 -0.028

Table 6 shows the results of the baseline Llama3-8B model, the larger Llama3-70b model, and the proposed PPACE model, respectively. As can be seen, PPACE outperforms the baselines on all the metrics with the exception of macro-averaged recall.

The finetuned model demonstrates significant improvements in F1 scores across most categories, indicating the effectiveness of the finetuning process. Notable improvements are seen in categories like ‘Infection Prevention & Control’, ‘Therapeutics’, and ‘Vaccines’, where the finetuned model’s precision and F1 scores show substantial gains. Categories with low representation in the dataset, such as ‘Capacity Strengthening’ and ‘Health Systems’, see mixed results, with some performance metrics slightly decreasing999The distribution of each category in the test set is provided in the Appendix Section C. This suggests that while finetuning enhances the model’s ability to generalise, it may still struggle with categories that have very few examples. Figure 4 shows the changes in F-scores between the fine-tuned and the base model across the different categories, sorted from the least frequent to the most frequent as seen in the test set.

Refer to caption
Figure 4: F1 Score Comparison by Category between the baseline Llama3 8B and the finetuned PPACE model. The categories are sorted from least to most frequent as seen in the test set.

Overall, the finetuned model generally performs better in terms of precision compared to recall. This trend indicates that the model has become more conservative in its predictions, leading to fewer false positives but potentially more false negatives. Categories like ‘Vaccines’, ‘Public Health’ and ‘Secondary Impacts’ show remarkable improvements in precision and F1 scores, demonstrating PPACE’s enhanced capability to identify relevant instances within these categories. The dramatic increase in all metrics for ‘Public Health’ is particularly noteworthy, with the F-score jumping from 0 to 0.65. However, there remains room for improvement, especially for categories with minimal representation in the dataset. The results highlight the strengths of the finetuning approach while also pointing out the difficulty of the task.

6 Conclusion

In this work we introduced the Pandemic PACT Advanced Categorisation Engine model or PPACE, a fine-tuned 8B language model for biomedical research classification as part of the Pandemic PACT initiative. PPACE is capable of accurately categorising research abstracts according to WHO-aligned priorities. This tool can be a valuable asset for identifying biomedical research trends and gaps in a multilabel classification scenario. The model was built on a robust foundation of human-annotated data, enhanced with LLM-generated rationales, ensuring that the model’s predictions are not only accurate but also interpretable. The use of efficient fine-tuning has enabled us to adapt the model effectively while maintaining computational efficiency.

Our evaluation demonstrated that PPACE outperforms its baselines, offering significant improvements in the context of multilabel classification. The model and the instruction-based dataset used for training are released oublicly, providing a valuable resource for the research community. These contributions facilitate further advancements in aligning biomedical research with critical global health priorities.

Looking ahead, the integration of LLMs in the annotation process promises to streamline data collection and categorisation, potentially reducing the burden on human annotators and improving the scalability of such initiatives. The evolution of PPACE can play a crucial role in enhancing the efficiency and effectiveness of global health research, ultimately contributing to better preparedness for future outbreaks.

Limitations

Our work has several limitations. First, the dataset used for training and evaluation, while extensive, may not encompass the full diversity of biomedical research projects globally, potentially limiting the generalisability of our model for prospective analyses of research in new emerging pathogens. Additionally, the research categories might become outdated at some point, requiring updates and subsequent retraining of the model. Second, some projects can be categorised in different ways, introducing a degree of subjectivity in certain assignments. While our use of human-annotated expert labels aims to minimise this issue, it does not completely eliminate it.

Furthermore, despite using efficient fine-tuning methods like LoRA, the 8-billion parameter model is still sizable. Researchers with limited computational resources would need reliable GPUs for inference, as running solely on CPU can be very slow. Future iterations of this work will aim to fine-tune smaller models to improve accessibility.

Lastly, we did not experiment heavily with advanced prompting techniques or invest significant time in crafting the best possible prompts. There is potential for further improvement in the reported results for the frozen language models through optimised prompts, which might narrow the performance gap with the fine-tuned model.

Acknowledgments

This work was primarily funded by Wellcome [226543] and the EDCTP2 Programme supported by the European Union. The Pandemic PACT Programme is also supported by the following grants: This research was funded by the National Institute for Health Research (NIHR) (CSA2022GloPID-R-3387) using UK Aid from the UK Government to support global health research. This work was carried out with the aid of a grant from the International Development Research Centre, Ottawa, Canada (109910-001). This work was supported by UK Research & Innovation (UKRI) under the UK Government’s Horizon Europe Guarantee under GloPID-R SEC 3 Grant Agreement no. 10061268. Whilst the funders of Pandemic PACT are engaged through the Pandemic PACT Advisory Group and have a role in the provision of funding data, they are not involved in the analysis and presentation of related findings.

This work was supported in part by the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre (BRC), and in part by an InnoHK Project at the Hong Kong Centre for Cerebro-cardiovascular Health Engineering (COCHE). OR acknowledges the support of the Medical Research Council (grant number MR/W01761X/). DAC was supported by an NIHR Research Professorship, an RAEng Research Chair, COCHE, the UKRI, and the Pandemic Sciences Institute at the University of Oxford. The views expressed are those of the authors and not necessarily those of the NIHR, MRC, COCHE, UKRI, or the University of Oxford.

References

  • Arighi et al. (2011) Cecilia N Arighi, Phoebe M Roberts, Shashank Agarwal, Sanmitra Bhattacharya, Gianni Cesareni, Andrew Chatr-Aryamontri, Simon Clematide, Pascale Gaudet, Michelle Gwinn Giglio, Ian Harrow, et al. 2011. Biocreative iii interactive task: an overview. BMC bioinformatics, 12:1–21.
  • Baker et al. (2015) Simon Baker, Ilona Silins, Yufan Guo, Imran Ali, Johan Högberg, Ulla Stenius, and Anna Korhonen. 2015. Automatic semantic classification of scientific literature according to the hallmarks of cancer. Bioinformatics, 32(3):432–440.
  • Behera et al. (2019) Bichitrananda Behera, G Kumaravelan, and Prem Kumar. 2019. Performance evaluation of deep learning algorithms in biomedical document classification. In 2019 11th international conference on advanced computing (ICoAC), pages 220–224. IEEE.
  • Carroll et al. (2021) Dennis Carroll, Subhash Morzaria, Sylvie Briand, Christine Kreuder Johnson, David Morens, Keith Sumption, Oyewale Tomori, and Supaporn Wacharphaueasadee. 2021. Preventing the next pandemic: the power of a global viral surveillance network. BMJ, 372.
  • Chen et al. (2020) Qingyu Chen, Alexis Allot, and Zhiyong Lu. 2020. LitCovid: an open database of COVID-19 literature. Nucleic Acids Research, 49(D1):D1534–D1540.
  • Chen et al. (2023) Qingyu Chen, Jingcheng Du, Yan Hu, Vipina Kuttichi Keloth, Xueqing Peng, Kalpana Raja, Rui Zhang, Zhiyong Lu, and Hua Xu. 2023. Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations. arXiv preprint arXiv:2305.16326.
  • Dernoncourt and Lee (2017) Franck Dernoncourt and Ji Young Lee. 2017. Pubmed 200k rct: a dataset for sequential sentence classification in medical abstracts. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 308–313.
  • Hersh et al. (2006) William R Hersh, Aaron M Cohen, Phoebe M Roberts, and Hari Krishna Rekapalli. 2006. Trec 2006 genomics track overview. In TREC, volume 7, pages 500–274.
  • Hsieh et al. (2023) Cheng-Yu Hsieh, Chun-Liang Li, Chih-kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. 2023. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8003–8017.
  • Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  • Jiang et al. (2024) Ting Jiang, Shaohan Huang, Shengyue Luo, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, et al. 2024. Mora: High-rank updating for parameter-efficient fine-tuning. arXiv preprint arXiv:2405.12130.
  • Jimenez Gutierrez et al. (2020) Bernal Jimenez Gutierrez, Jucheng Zeng, Dongdong Zhang, Ping Zhang, and Yu Su. 2020. Document classification for COVID-19 literature. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3715–3722, Online. Association for Computational Linguistics.
  • Laza et al. (2011) Rosalía Laza, Reyes Pavón, Miguel Reboiro-Jato, and Florentino Fdez-Riverola. 2011. Evaluating the effect of unbalanced data in biomedical document classification. Journal of integrative bioinformatics, 8(3):105–117.
  • Luo and Specia (2024) Haoyan Luo and Lucia Specia. 2024. From understanding to utilization: A survey on explainability for large language models. arXiv preprint arXiv:2401.12874.
  • McCoy et al. (2023) R Thomas McCoy, Shunyu Yao, Dan Friedman, Matthew Hardy, and Thomas L Griffiths. 2023. Embers of autoregression: Understanding large language models through the problem they are trained to solve. arXiv preprint arXiv:2309.13638.
  • McLean et al. (2022) Alistair RD McLean, Sumayyah Rashan, Lien Tran, Lorenzo Arena, AbdulAzeez Lawal, Brittany J Maguire, Sandra Adele, Emilia Sitsofe Antonio, Matthew Brack, Fiona Caldwell, et al. 2022. The fragmented covid-19 therapeutics research landscape: a living systematic review of clinical trial registrations evaluating priority pharmacological interventions. Wellcome Open Research, 7(24):24.
  • Norton et al. (2020) Alice Norton, Louise Sigfrid, Adeniyi Aderoba, Naima Nasir, Peter G Bannister, Shelui Collinson, James Lee, Geneviève Boily-Larouche, Josephine P Golding, Evelyn Depoortere, et al. 2020. Preparing for a pandemic: highlighting themes for research funding and practice—perspectives from the global research collaboration for infectious disease preparedness (glopid-r). BMC medicine, 18:1–4.
  • Norton et al. (2024) Alice Norton, Louise Sigfrid, Emilia Antonio, Adrian Bucher, Duduzile Ndwandwe, and Pandemic PACT Advisory Group. 2024. Improving coherence of global research funding: Pandemic pact. Lancet (London, England), 403(10433):1233.
  • Phan et al. (2021) Long N Phan, James T Anibal, Hieu Tran, Shaurya Chanana, Erol Bahadroglu, Alec Peltekian, and Grégoire Altan-Bonnet. 2021. Scifive: a text-to-text transformer model for biomedical literature. arXiv preprint arXiv:2106.03598.
  • Rohanian et al. (2023) Omid Rohanian, Mohammadmahdi Nouriborji, and David A Clifton. 2023. Exploring the effectiveness of instruction tuning in biomedical language processing. arXiv preprint arXiv:2401.00579.
  • Seminog et al. (2024) O Seminog, R Furst, T Mendy, et al. 2024. A protocol for a living mapping review of global research funding for infectious diseases with a pandemic potential – pandemic pact. Wellcome Open Research, 9:156. Version 1; peer review: awaiting peer review.
  • Taylor et al. (2024) Niall Taylor, Upamanyu Ghose, Omid Rohanian, Mohammadmahdi Nouriborji, Andrey Kormilitzin, David Clifton, and Alejo Nevado-Holgado. 2024. Efficiency at scale: Investigating the performance of diminutive language models in clinical tasks. arXiv preprint arXiv:2402.10597.
  • Tian et al. (2024) Shubo Tian, Qiao Jin, Lana Yeganova, Po-Ting Lai, Qingqing Zhu, Xiuying Chen, Yifan Yang, Qingyu Chen, Won Kim, Donald C Comeau, et al. 2024. Opportunities and challenges for chatgpt and large language models in biomedicine and health. Briefings in Bioinformatics, 25(1):bbad493.

Appendix

Appendix A List of Research Categories Along with Definitions

The following is the list of categories used in the prompt along with corresponding explanations of what each category entails:

  1. 1.

    Pathogen: Natural History, Transmission, and Diagnostics: Development of diagnostic tools, understanding pathogen morphology, genomics, and genotyping, studying immunity, using disease models, and assessing the environmental stability of pathogens.

  2. 2.

    Animal and Environmental Research & Research on Diseases Vectors: Animal sources, transmission routes, vector biology, and control strategies for vectors.

  3. 3.

    Epidemiological Studies: Research on disease transmission dynamics, susceptibility, control measure effectiveness, and disease mapping through surveillance and reporting.

  4. 4.

    Clinical Characterisation and Management in Humans: Prognostic factors for disease severity, disease pathogenesis, supportive care and management, long-term health consequences, and clinical trials for disease management.

  5. 5.

    Infection Prevention and Control: Research on community restriction measures, barriers and PPE, infection control in healthcare settings, and measures at the human-animal interface.

  6. 6.

    Therapeutics Research, Development, and Implementation: Pre-clinical studies for therapeutic development, clinical trials for therapeutic safety and efficacy, development of prophylactic treatments, logistics and supply chain management for therapeutics, clinical trial design for therapeutics, and research on adverse events related to therapeutic administration.

  7. 7.

    Vaccines Research, Development, and Implementation: Pre-clinical studies for vaccine development, clinical trials for vaccine safety and efficacy, logistics and distribution strategies for vaccines, vaccine design and administration, clinical trial design for vaccines, research on adverse events related to immunisation, and characterisation of vaccine-induced immunity.

  8. 8.

    Research to Inform Ethical Issues: Ethical considerations in research design, ethical issues in public health measures, ethical clinical decision-making, ethical resource allocation, ethical governance, and ethical considerations in social determinants of health.

  9. 9.

    Policies for Public Health, Disease Control and Community Resilience: Approaches to public health interventions, community engagement, communication and infodemic management, vaccine/therapeutic hesitancy, and policy research and interventions.

  10. 10.

    Secondary Impacts of Disease, Response, and Control Measures: Indirect health impacts, social impacts, economic impacts, and other secondary impacts such as environmental effects, food security, and infrastructure.

  11. 11.

    Health Systems Research: Health service delivery, health financing, access to medicines and technologies, health information systems, health leadership and governance, and health workforce management.

  12. 12.

    Capacity Strengthening: Individual capacity building, institutional capacity strengthening, systemic/environmental components, and cross-cutting activities across all levels of capacity building.

Appendix B Prompt Design for Model Inference

To generate high-quality inferences from the models during the few-shot learning experiments, a carefully constructed prompt was employed. This prompt was designed based on feedback from expert human annotators to ensure precision and discourage spurious associations. Below is the detailed explanation and the code used for generating the prompt.

The prompt includes guidelines and examples to steer the model towards accurate classification. The guidelines ensure that the model focuses on relevant categories and avoids unnecessary implications or speculative guesses. The examples provided demonstrate the expected structure and reasoning for categorisation.

The {guideline} variable in the prompt is replaced with the list of research categories mentioned at A, prefaced with the following line:

We have a project in the area of biomedical research which we want to classify in terms of the research priorities it relates to. We have 12 possible research priorities and a project can be mapped to 1 or more of these priorities. The following is a guide on what each of these 12 categories are alongside the specific areas that they cover.

Listing 1: Function to generate the prompt for classifying research projects
# Function to generate the prompt for classifying research projects
def generate_classification_prompt(row, guideline):
# Assume row is a pandas Series with project info and guideline is a string containing the guidelines.
print(guideline)
prompt = f"""
␣␣␣␣[INST]Basedontheresearchcategorizationguidelines,classifythefollowingprojectintotheappropriateprimaryresearchprioritiesusingonlythetop-levelcategories1to12.Structureyourresponseclearly,providingthecategorynumbersenclosedinsinglequotationmarks.
␣␣␣␣{guideline}
␣␣␣␣Examples:
␣␣␣␣-Forastudyoninvestigatingthegeneticmutationsofapathogenanditsresistancetocurrentvaccines.
␣␣␣␣␣␣###Reasoning:Categories’1’and’7’arechosenfortheirfocusonpathogengenomicsandvaccinedevelopment,respectively.
␣␣␣␣␣␣###Categories:’1’,’7’
␣␣␣␣-ForastudyonexaminingtheeffectivenessofnewtherapeutictreatmentsinPhase3clinicaltrialsandtheethicalconsiderationsinconductingthesetrials.
␣␣␣␣␣␣###Reasoning:Categories’6’and’8’arechosenfortheirfocusonPhase3clinicaltrialsandethicalresearchissues,respectively.
␣␣␣␣␣␣###Categories:’6’,’8’
␣␣␣␣-Forastudyonthesocialdeterminantsofdiseasespreadinurbanenvironments,theefficacyofnon-pharmaceuticalinterventions,andthelong-termmentalhealthimpactsonsurvivors.
␣␣␣␣␣␣###Reasoning:Categories’3’,’9’,and’10’areselectedfortheirrelevancetodiseasetransmission,publichealthinterventions,andindirecthealthimpacts.
␣␣␣␣␣␣###Categories:’3’,’9’,’10’
␣␣␣␣Note1:Usecategory’2’onlyforexplicitreferencestoanimals(thisisararecategory).
␣␣␣␣Note2:ResearchCollaborationisdistinctfromepidemiologicalstudies.
␣␣␣␣Note3:Don’tcategorizesolelyonstudypopulation.
␣␣␣␣Note4:TherapeuticsResearchpertainstodrugs.
␣␣␣␣Note5:Staylogicalandfactualinyouranalysis.Avoidmakingunnecessaryimplicationsorspeculativeguessesbeyondtheexplicitinformationprovided.
␣␣␣␣Basedonthisinformation,identifytherelevantresearchcategoriesforthisproject.Provideclearbutsuccinctreasoningforyourchoicessimilartotheaboveexamples.Sectionyourresponseinthefollowingformat:
␣␣␣␣###Reasoning:...
␣␣␣␣###Categories:...
␣␣␣␣ProjectInformation:
␣␣␣␣Title:{row[’GrantTitleEng’]}
␣␣␣␣Abstract:{row[’AbstractEng’]}
␣␣␣␣[/INST]""".strip()
return prompt

In the case of fine-tuning, in order to ensure the model is actually learning from the labels rather than relying on extra information in the prompt, we do not use the few-shot examples and omit the extra 5 notes as well. Every other detail in the template above stays the same when finetuning.

Appendix C Distribution of Categories in the Test set

Refer to caption
Figure 5: Individual Label Distribution in the Test Set.