TrialEnroll: Predicting Clinical Trial Enrollment Success with Deep & Cross Network and Large Language Models

Ling Yue [email protected] Rensselaer Polytechnic InstituteTroyNYUSA Sixue Xing [email protected] Rensselaer Polytechnic InstituteTroyNYUSA Jintai Chen [email protected] University of Illinois Urbana-ChampaignChempaignILUSA  and  Tianfan Fu [email protected] Rensselaer Polytechnic InstituteTroyNYUSA
Abstract.

Clinical trials need to recruit a sufficient number of volunteer patients to demonstrate the statistical power of the treatment (e.g., a new drug) in curing a certain disease. Clinical trial recruitment has a significant impact on trial success. Forecasting whether the recruitment process would be successful before we run the trial would save many resources and time. This paper develops a novel deep & cross network with large language model (LLM)-augmented text feature that learns semantic information from trial eligibility criteria and predicts enrollment success. The proposed method enables interpretability by understanding which sentence/word in eligibility criteria contributes heavily to prediction. We also demonstrate the empirical superiority of the proposed method (0.7002 PR-AUC) over a bunch of well-established machine learning methods. The code and curated dataset are publicly available at https://anonymous.4open.science/r/TrialEnroll-7E12.

Clinical Trial, Drug Development, Drug Discovery, Large Language Model, Clinical Trial Recruitment, Clinical Trial Enrollment
conference: 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics; November 22, 2024; Shenzhen, Guangdong Province, P.R. Chinaccs: Computing methodologies Supervised learning by classification

1. Introduction

Clinical trials play an indispensable role in developing new treatments by assessing their safety and efficacy on human subjects (Friedman et al., 2015). These trials serve as critical checkpoints in the drug development process, ensuring that medications are not only effective but also safe for public use (Hill, 1952). To conduct a robust evaluation, it is imperative to enroll a sufficient number of participants who meet specific eligibility criteria (Haddad et al., 2015). This ensures that the statistical power of the trial is adequate to detect any significant differences between the treatment group and the control group.

However, recruiting the right number of participants is challenging (Patel et al., 2003; Haddad et al., 2015). The process is often time-consuming and costly, which can delay the entire drug development timeline. One of the main reasons for this difficulty is the stringent eligibility criteria that must be met by potential participants (Peto, 1978). These criteria are designed to ensure that the study population is representative of the intended patient population and to minimize confounding variables that could skew the results.

To address this issue, there has been growing interest in leveraging machine learning algorithms to predict patient enrollment in clinical trials more accurately (Lo et al., 2019; Fu et al., 2023). By analyzing historical data from previous trials, these algorithms can identify patterns and factors that influence recruitment rates. This predictive modeling can help researchers better plan and design their trials, leading to more efficient and effective recruitment strategies.

However, the prediction of clinical trial enrollment encounters some data and technical challenges, as shown below.

  • Lack of high-quality data. Clinical trial data are usually highly noisy and sensitive and not AI-ready, which hinders AI’s deployment.

  • Lack of ability to learn from the multimodal heterogeneous features. Clinical trials usually involve multimodal heterogeneous features, such as biomedical entities (drug, disease), and demographic features (e.g., gender and age). It is challenging for the current machine learning model to learn from them.

  • Lack of ability to learn from unstructured text data. Clinical trial involves a great amount of unstructured text data. For example, eligibility criteria consist of multiple natural language inclusion and exclusion criteria, which specify what is wanted and unwanted during the patient recruitment process. It is challenging to capture semantic information from the unstructured text data.

To address these challenges, we formally define the clinical trial enrollment prediction problem, curate AI-ready datasets, and customize Deep & Cross Network (Wang et al., 2017) using large language model-augmented features to learn semantic information from unstructured text data (e.g., eligibility criteria, LLM-generated feature), where large language model is used to augment the text feature of biomedical entities like drug and disease.

For ease of exposition, the major contribution of this manuscript can be summarized as follows.

  • Problem. To the best of our knowledge, We are the first to identify clinical trial enrollment as a predictable problem and formulate it into an AI-solvable task.

  • Data. We curate a ready-to-use dataset specialized for clinical trial enrollment prediction. The dataset contains 31,094 trials with binary labels for enrollment success.

  • Method. We develop a deep & cross network with large language model enhanced text feature tailored to enrollment prediction. Specifically, we design a hierarchical attention mechanism to learn the word- and sentence-level importance in an end-to-end manner.

  • Results. We conduct extensive experiments to validate the effectiveness of the proposed method. Specifically, the proposed method obtains 0.7002 PR-AUC score and achieves 0.0229 improvement over the best baseline method. Also, our method exhibits desirable interpretability that could help clinicians understand the AI predictions.

The rest of the paper is organized as follows. First, Section 2 briefly reviews the related works in using AI for predictive clinical trial tasks. Then, we elaborate on our method in Section 3. After that, empirical studies are described in Section 4. Finally, we conclude the paper in Section 5.

2. Related Works

In this section, we discuss the works that use AI methods for automatic clinical trial planning and management from two perspectives: clinical trial problems and specific AI methodologies for these problems.

AI-solvable clinical trial problems.

AI, especially deep learning methods, has great potential in aiding many clinical trial problems. Specifically, they focus on the following clinical trial problems.

  • Clinical trial outcome/approval prediction: The costs of conducting clinical trials are extremely expensive (up to hundreds of millions of dollars (Martin et al., 2017)), and the time to run a trial is very long (7-11 years on average) with low success probability (Peto, 1978; Ledford, 2011). However, many factors, such as the inefficacy of the drug, drug safety issues, and poor trial eligibility criteria design, can cause the failure of a clinical trial (Friedman et al., 2015). If we were better at predicting the results of clinical trials, we could avoid running trials that will inevitably fail — more resources could be devoted to trials that are more likely to succeed. (Fu et al., 2022; Chen et al., 2024b, a) propose to predict clinical trial approval based on drug molecule structure, disease code, and trial eligibility criteria.

  • Patient-trial matching: Patient recruitment is a key step in running clinical trials. Given the trial’s eligibility criteria, matching the appropriate patients based on their electronic health records is time- and labor-intensive (Zhang et al., 2021; Fu et al., 2024). Patient-trial matching is formulated as a machine learning task to automate the process that selects appropriate patients for the target trial and alleviates the burden of patient recruitment. (Zhang et al., 2020; Gao et al., 2020) predict patient-trial matching based on trial eligibility criteria and patient electronic health records (EHR);

  • Digital twin of clinical trial: Digital twins in the context of clinical trials refer to virtual representations of real-world patients or systems that can be used to simulate and predict outcomes under various conditions. Digital twins can simulate how different patient populations might respond to new treatments, potentially reducing the need for lengthy and costly physical trials. This can significantly speed up the drug development process. (Wang et al., 2024) design a TWIN-GPT model by finetuning standard GPT model to synthesize patient visit history to mimic the procedure of clinical trials and predict trial outcomes.

  • Integration of multi-omics data. Multi-omics data enables the characterization of individual patients at a molecular level, which is crucial for precision medicine approaches. By understanding the genetic (Lu et al., 2021, 2022), transcriptomic (Lu et al., 2023), and other molecular profiles of patients (Chen et al., 2021), treatments can be tailored to match individual disease mechanisms, potentially leading to more effective and personalized therapies.

  • Clinical trial duration prediction: Predicting the duration of clinical trials accurately offers significant benefits for trial management. By predicting trial duration, resource allocation such as staffing, budget, and facilities can be optimized, ensuring resources are available when needed to prevent inefficiencies and bottlenecks (Kerali, 2018). (Yue et al., 2024) predicts clinical trial duration using textual information of various trial features (disease, drug, eligibility criteria) with a pretrained BioBERT (Lee et al., 2020) model as a text feature enhancement.

  • Clinical trial site selection: The site of the clinical trial, also known as the investigators, is the physical place where clinical trials are carried out and is the key to the success of clinical trials. The selection of clinical trial sites is complex and laborious work. Traditional ways usually depend heavily on human experts, who manually assign the clinical trial sites to the specific clinical trials. The process is time-consuming, error-prone, and expensive. To reduce the time, resources, and cost, (Srinivasa et al., 2022) designed a policy-based reinforcement learning method to select trial sites automatically.

  • Dataset: TrialBench (Chen et al., 2024c) identifies 8 AI solvable clinical trial problems (prediction of trial duration, patient dropout rate, serious adverse event, mortality rate, trial approval outcome, trial failure reason, drug dose-finding, design of eligibility criteria) and curates 24 AI-ready corresponding datasets to facilitate the AI for the clinical trial community.

AI methodologies tailored to clinical trial.

Clinical trials produce valuable multimodal data that can be used for machine learning modeling. Patient-level trial data contain individual patient measurements and adverse events during the trial period. The trial summary data contains multi-modal information related to the trial, including drug molecules, target diseases, eligibility criteria for recruiting patients, sponsors (e.g., some specific pharmaceutical company or academic institute), physical trial sites (geographical locations to conduct the trial), trial aims and trial results. A series of deep learning methods were developed to represent these multimodal clinical trial features. For example, DeepEnroll (Zhang et al., 2020) also leveraged a Bidirectional Encoder Representations from Transformers (BERT (Devlin et al., 2019)) model to encode clinical trial information. COMPOSE (Gao et al., 2020) used pretrained BERT to generate contextualized word embedding for each word of the trial protocol and then applied multiple one-dimensional convolutional layers with different kernel sizes to generate trial embedding to capture semantics at different granularity levels. (Qi and Tang, 2019) designs a Residual Semi-Recurrent Neural Network (RS-RNN) to predict the phase 3 trial results based on phase 2 results. There are a great deal of missing features in clinical trials. To handle this issue, (Lo et al., 2019) explored various imputation techniques (Wu et al., 2022; Lu et al., 2019; Lu, 2018) (such as mean imputation, random imputation, and k-nearest neighbor) and applied a series of conventional machine learning models (including logistic regression (LaValley, 2008), random forest (Breiman, 2001), SVM (Jakkula, 2006)) to predict the outcome of clinical trial within 15 disease groups. However, they do not consider drug molecule features and trial protocol information and thus could not differentiate the outcome for different drugs focusing on disease. (Fu et al., 2022, 2023) designed a hierarchical interaction network (HINT) to encode multimodal trial features and capture their interaction (including drug molecules, disease code, and eligibility criteria). Based on this work, (Chen et al., 2024a, b) extend its scope by quantifying uncertainty and studying the explainability/interpretability of the HINT model. (Yue et al., 2024) design a hierarchical attention mechanism for learning word- and sentence-level semantic information from trial eligibility criteria. (Yue and Fu, 2024) design a multi-agent large language model-based reasoning method for clinical trial outcome prediction. Specifically, they decompose clinical trial outcome prediction into several simple sub-tasks, e.g., trial enrollment success, drug safety, and drug efficacy. To predict the clinical trial outcome, (Gao et al., 2024) designed a large language model-based interaction network (LINT), which uses a large language model (LLM) to extract meaningful text embedding and provide fruitful features, followed by a small-scale interaction network (to be finetuned) to make the prediction. (Srinivasa et al., 2022) design a policy-based reinforcement learning and design fairness-aware reward function to enhance the fairness of clinical trials over different races, especially for minority groups.

3. Methodology

Overview. In this section, we demonstrate the methodology. First, we discuss the broad impact of trial enrollment success prediction in Section 3.1. Then, we formulate the clinical trial enrollment success prediction problem in Section 3.2. Then, we discuss how to conduct feature engineering to produce informative features specifically for enrollment success prediction in Section 3.3. After that, we describe how to leverage large language model (LLM) to augment the text feature in Section 3.4. Then, we describe the customized Deep & Cross Network in Section 3.5. For ease of understanding, the architecture of the whole model is illustrated in Figure 1.

Refer to caption
Figure 1. Overview of TrialEnroll. Our model takes multimodal clinical trial features (e.g., drug, disease, eligibility criteria, geographical location of the trial, age, and target gender) as the input (detailed in Section 3.3), augmented by large language model (Section 3.4), leverages deep & cross network (DCN) as neural architecture (Section 3.5), and predicts whether the trial enrollment will succeed.

3.1. Broad Impact

Predicting clinical trial enrollment success is crucial for pharmaceutical companies (Haddad et al., 2015). Accurate predictions enable effective resource allocation, guiding investments in time, money, and personnel. Enrollment delays or failures can significantly increase costs, but predictive models can identify potential issues early, allowing for adjustments that save substantial resources. These models also inform trial design, optimizing inclusion/exclusion criteria and trial locations to attract and retain participants.

Reliable enrollment predictions enhance stakeholder confidence (Cruz Rivera et al., 2020), including investors, partners, and regulatory bodies. Demonstrating a high likelihood of successful enrollment reinforces trust in the company’s capabilities and commitment to delivering on its pipeline.

Accurate forecasting improves resource allocation and financial planning. By predicting enrollment outcomes, trial managers can optimize staffing, budget, and facilities, ensuring resources are available when needed and preventing inefficiencies and bottlenecks (Kerali, 2018). This enhances budget accuracy and ensures efficient capital use, reducing waste and improving trial efficiency (Baskin, 2019).

Predicting enrollment success is vital for proactive risk management. Early identification of potential recruitment challenges and delays allows for the development of contingency plans, minimizing impacts on the trial timeline and outcomes (Prasad, 2024). Retrospective analyses of past trials and prospective data collection during ongoing trials further support this proactive approach (Charles A. Knirsch, 2012).

Effective communication with stakeholders is another critical benefit. Setting realistic expectations regarding trial timelines and outcomes fosters trust and collaboration with sponsors, regulatory bodies, and participants (Yu et al., 2024). Reliable enrollment forecasts also help secure funding from sponsors and grant committees by providing accurate and trustworthy data.

For pharmaceutical companies, accurate forecasting of trial enrollment is essential for strategic planning, including market entry and product launch strategies. Enrollment predictions determine the timing of drug approval and market availability, crucial for competitive positioning and financial planning (DiMasi et al., 2016). These forecasts support regulatory submission planning, facilitating smoother interactions with regulatory bodies and a more efficient approval process, ultimately speeding up market access for new treatments and providing a significant competitive advantage (Alsultan et al., 2020).

3.2. Formulation of Clinical Trial Enrollment Success Prediction

A clinical trial is a systematic effort to assess the safety and effectiveness of a specific set of treatment set designed to address a particular group of target disease set, This evaluation is conducted according to predefined trial eligibility criteria for a chosen group of patients.

Definition 3.1 (Drug Set).

The drug set consists of a range of drug molecule candidates, denoted as 𝒟={d1,d2,,dN}𝒟subscript𝑑1subscript𝑑2subscript𝑑𝑁\mathcal{D}=\{d_{1},d_{2},\ldots,d_{N}\}caligraphic_D = { italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, where d1,d2,,dNsubscript𝑑1subscript𝑑2subscript𝑑𝑁d_{1},d_{2},\ldots,d_{N}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT are the names of the N𝑁Nitalic_N drug molecules involved in this trial. This study focuses on trials that aimed at discovering new uses for these drug candidates while excluding trials that involve non-drug interventions such as surgery or medical devices.

(1) 𝒟={d1,d2,,dN}.𝒟subscript𝑑1subscript𝑑2subscript𝑑𝑁\mathcal{D}=\{d_{1},d_{2},\ldots,d_{N}\}.caligraphic_D = { italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } .
Definition 3.2 (Target Disease Set).

For a trial addressing Kδsubscript𝐾𝛿K_{\delta}italic_K start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT diseases, the Target Disease Set is represented by 𝒯={t1,t2,,tKδ}𝒯subscript𝑡1subscript𝑡2subscript𝑡subscript𝐾𝛿\mathcal{T}=\{t_{1},t_{2},\ldots,t_{K_{\delta}}\}caligraphic_T = { italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT end_POSTSUBSCRIPT }, with each tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT being the disease name for the i𝑖iitalic_i-th disease.

(2) 𝒯={t1,t2,,tKδ}.𝒯subscript𝑡1subscript𝑡2subscript𝑡subscript𝐾𝛿\mathcal{T}=\{t_{1},t_{2},\ldots,t_{K_{\delta}}\}.caligraphic_T = { italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT end_POSTSUBSCRIPT } .
Definition 3.3 (Trial Eligibility Criteria).

The trial eligibility criteria encompass both inclusion (+) and exclusion (-) criteria, which respectively outline the desired and undesirable attributes of potential participants. These criteria provide details on various key parameters such as age, gender, location, medical background, the status of the target disease, and the present health condition.

(3) 𝒞=[𝝍1+,,𝝍Q+,𝝍1,,𝝍R],𝝍k+/is a criterion,𝒞superscriptsubscript𝝍1superscriptsubscript𝝍𝑄superscriptsubscript𝝍1superscriptsubscript𝝍𝑅superscriptsubscript𝝍𝑘absentis a criterion\mathcal{C}=[\bm{\psi}_{1}^{+},\ldots,\bm{\psi}_{Q}^{+},\bm{\psi}_{1}^{-},% \ldots,\bm{\psi}_{R}^{-}],\ \ \ \ \ \ \bm{\psi}_{k}^{+/-}\ \text{is a % criterion},caligraphic_C = [ bold_italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , … , bold_italic_ψ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , bold_italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , … , bold_italic_ψ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ] , bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + / - end_POSTSUPERSCRIPT is a criterion ,

where Q𝑄Qitalic_Q (R𝑅Ritalic_R) is the number of inclusion (exclusion) criteria in the trial. The term 𝝍k+superscriptsubscript𝝍𝑘\bm{\psi}_{k}^{+}bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT (𝝍ksuperscriptsubscript𝝍𝑘\bm{\psi}_{k}^{-}bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT) designates the k𝑘kitalic_k-th inclusion (exclusion) criterion within the eligibility criteria. Each criterion 𝝍𝝍\bm{\psi}bold_italic_ψ is a sentence in unstructured natural language.

Definition 3.4 (Clinical Trial Categorical/Numerical Feature).

Other clinical trial features also have considerable impacts on trial enrollment, including (1) demographic features, for example, the gender of the recruited patients (male or female or both), age of recruited patients (maximum and minimum age); (2) the trial phase (phase I, II, III, or IV); and (3) the geographical feature of the trial. These features are mostly categorical or numerical, denoted 𝒲𝒲\mathcal{W}caligraphic_W.

Definition 3.5 (Clinical Trial Enrollment Success).

The trial enrollment success is the groundtruth of our model, denoted y{0,1}𝑦01y\in\{0,1\}italic_y ∈ { 0 , 1 }, a binary variable indicating the success of trial enrollment.

Problem (Clinical Trial Enrollment Success Prediction). The estimation of y𝑦yitalic_y, represented as y^^𝑦\widehat{y}over^ start_ARG italic_y end_ARG, can be formulated through the machine learning model fΘsubscript𝑓Θf_{\Theta}italic_f start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT, such that

(4) y^=fΘ(𝒟,𝒯,𝒞,𝒲),^𝑦subscript𝑓Θ𝒟𝒯𝒞𝒲\widehat{y}=f_{\Theta}(\mathcal{D},\mathcal{T},\mathcal{C},\mathcal{W}),over^ start_ARG italic_y end_ARG = italic_f start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( caligraphic_D , caligraphic_T , caligraphic_C , caligraphic_W ) ,

where y^^𝑦\widehat{y}over^ start_ARG italic_y end_ARG denotes the predicted enrollment state of a trial; fΘsubscript𝑓Θf_{\Theta}italic_f start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT refers to the parameterized machine learning model, ΘΘ\Thetaroman_Θ denotes the learnable parameter. In this context, 𝒟𝒟\mathcal{D}caligraphic_D, 𝒯𝒯\mathcal{T}caligraphic_T, 𝒞𝒞\mathcal{C}caligraphic_C and 𝒲𝒲\mathcal{W}caligraphic_W refer to the drug set, the target disease set, the trial eligibility criteria, and other categorical features, respectively. For ease of exposition, Table 1 shows an example of a real clinical trial and all the related features.

Table 1. A real example of a clinical trial record.
Feature Descriptions
NCTID NCT00610792 (trial identifier)
drug bortezomib and pegylated liposomal doxorubicin
disease Ovarian Cancer
phase II
country Italy, Switzerland
gender Female
study type interventional
title Phase 2 Study of Twice Weekly VELCADE and CAELYX in Patients With Ovarian Cancer Failing Platinum Containing Regimens
summary This is a Phase 2, multicenter open-label, uncontrolled 2-step design. Patients will be arranged in two groups based on the response to their last platinum containing therapy.
The two groups are, 1) Platinum-Resistant Patients: patients with the progressive disease while on platinum-containing therapy or stable disease after at least 4 cycles; patients relapsing following an objective response while still receiving treatment; patients relapsing after an objective response within 6 months from the discontinuation of the last chemotherapy and 2) Platinum-Sensitive Patients: patients who relapsed following an objective response
inclusion criteria ECOG performance status grade 0 or 1 ; Age \geq 18 and \leq 75 yrs; Life expectancy of at least 3 months; LVEF must be within normal limits; …
exclusion Criteria Chemotherapy, hormonal, radiation or immunotherapy or participation in any investigational drug study within 4 weeks of study entry; Pre-existing peripheral neuropathy >>> Grade 1; Presence of cirrhosis or active or chronic hepatitis; … Pregnancy or lactation or unwillingness to use adequate method of birth control; Active infection; Known history of allergy to mannitol, boron or liposomally formulated drugs.
start date July 2006
completed date September 2009
duration 3.2 years
sponsor Millennium Pharmaceuticals, Inc.
outcome withdrawn

3.3. Feature Engineering

In this section, we detail the handcrafted feature engineering process employed to transform raw clinical trial data into structured inputs suitable for machine learning models. The features are derived from various aspects of the clinical trial records, including drug information, disease characteristics, eligibility criteria, demographic details, and geographical information. For each categorical feature, we use one-hot encoding to represent it.

3.3.1. Drug Embedding

To represent the drug information, we utilize the pre-trained BioBERT model (Lee et al., 2020), which is specifically trained on biological texts. A single clinical trial may involve several drugs. Each drug name is converted into an embedding vector using BioBERT. We then apply mean pooling to aggregate the token embeddings into a single fixed-size vector representing the drug. This approach captures the semantic information inherent in the drug names.

3.3.2. Disease Embedding

Similar to the drug embedding process, we use the pre-trained BioBERT model (Lee et al., 2020) to obtain embeddings for the disease names. A single clinical trial may involve several diseases. Each disease name is transformed into an embedding vector, and mean pooling is applied to generate a final disease embedding. This method ensures that the semantic nuances of the disease names are effectively captured.

3.3.3. Eligibility Criteria Embedding

The eligibility criteria, which include both inclusion and exclusion criteria, are processed at the sentence level. Each criterion sentence is embedded using the pre-trained BioBERT model (Lee et al., 2020), specifically utilizing the [CLS] token to obtain the sentence embedding. This token is designed to capture the overall meaning of the sentence, making it suitable for representing the criteria.

3.3.4. Demographic Features

Gender:

The gender of the participants is included as categorical features, with possible values being “female”, “male”, and “all”. This categorical representation allows the model to distinguish between different gender requirements.

Age:

The minimum and maximum age requirements are included as numerical features. These values are critical as they influence the difficulty of enrolling participants within the specified age range.

3.3.5. Trial Phase

The phase of the clinical trial is treated as a categorical feature. There are four phases: Phase I (safety), Phase II (efficacy and side effects), Phase III (comparative effectiveness), and Phase IV (post-market surveillance). Each phase is encoded as a separate category, allowing the model to differentiate between the distinct stages of clinical development.

3.3.6. Criteria Count

We introduce a feature representing the count of inclusion and exclusion criteria, as well as the total number of criteria. The hypothesis is that a higher number of criteria may correlate with increased difficulty in enrolling participants.

3.3.7. Geographical Features

Country, State, City:

The geographical location of the trial is a categorical feature. Different regions may have varying distributions of disease prevalence and patient availability, making this a valuable feature.

3.4. Large Language Model-based Feature Enhancement

Table 2. Prompts that are used in LLM-based feature enhancement (Section 3.4).
Kategorie Prompt Type Description
Drug System Prompt You are a highly knowledgeable clinical pharmacologist. Given a string that contains the name of a drug, please: - Provide the name of the drug. - Offer a comprehensive description of the drug, including: – Mechanism of action – Common uses – Notable side effects – Discuss the difficulty of recruiting patients for clinical trials involving this drug, including the reasons behind these challenges. Noted instruction: Please respond with fewer than 100 words.
User Prompt The following string contains the name of a drug: ¡string¿{drug}¡/string¿
Disease System Prompt You are a highly knowledgeable clinical epidemiologist. Given a string that contains the name of a disease, please: - Provide the name of the disease. - Offer a comprehensive description of the disease, including: – Pathogenesis (mechanism of disease development) – Common symptoms – Typical treatment options – Discuss the difficulty of recruiting patients for clinical trials involving this disease, including the reasons behind these challenges. Noted instruction: Please respond with fewer than 100 words.
User Prompt The following string contains the name of a disease: ¡string¿{disease}¡/string¿

Large Language Models (LLMs) are advanced artificial intelligence systems (Zhao et al., 2023) designed to process, understand, generate, and manipulate natural language text. Leveraging the vast amounts of data they are trained on, LLMs can significantly enhance feature representation in various applications, including clinical trial data analysis.

In our approach, we utilize LLMs to enrich the representations of drug and disease information beyond the capabilities of traditional embeddings. The process involves two main steps: generating detailed contextual information and embedding this information using BioBERT (Lee et al., 2020).

3.4.1. Enhanced Drug and Disease Representation

To enhance the representation of drugs and diseases, we first generate comprehensive introductory paragraphs for each entity using a large language model (LLM) (Jiang et al., 2023)111Model version: mistralai/Mistral-7B-Instruct-v0.3.. This approach leverages the LLM’s ability to synthesize relevant knowledge from its extensive training data, providing a richer context for each drug and disease.

Step 1: Context Generation

For each drug and disease, we prompt the LLM with a specific query to generate an introductory paragraph. The prompt is designed to elicit detailed information that captures the essential characteristics and context of the drug or disease. Table 2 displays the prompt.

Step 2: Embedding Generation

Once the introductory paragraphs are generated, we use the pre-trained BioBERT model to obtain embeddings for these paragraphs. Specifically, we utilize the [CLS] token to capture the overall meaning of the text. This token is designed to aggregate the contextual information of the entire paragraph into a single embedding vector.

Step 3: Integration into the Model

The resulting embeddings from the introductory paragraphs are then integrated into our model. These enhanced embeddings provide a richer and more nuanced representation of the drugs and diseases, potentially improving the model’s performance in predicting the success of clinical trial enrollment.

By integrating LLM-generated contextual information and leveraging their analytical capabilities, we significantly enhance the feature representation and analysis processes, leading to more accurate and insightful predictions for predicting the success of clinical trial enrollment.

Refer to caption
Figure 2. The Deep & Cross Network.

3.5. Deep & Cross Network

After obtaining the features, we customize Deep & Cross Network (DCN) (Wang et al., 2017) to effectively learn and integrate semantic information from both handcrafted features and embeddings derived from natural language processing (NLP) models. The model architecture is shown in Figure 2. The DCN consists of two essential components: a deep network and a cross-network. DCN explicitly models feature interactions at different levels. The cross-network component captures feature crosses efficiently, which can be more effective than the implicit feature interactions learned by MLPs.

3.5.1. Deep Network: Hierarchical Attention Network

The deep network component of the DCN is implemented as a Hierarchical Attention Network (HAN), inspired by the architecture used in (Yue et al., 2024). The HAN is designed to capture hierarchical structures in the eligibility criteria data, such as the relationships between words and sentences. This is particularly useful for processing the rich textual information embedded in clinical eligibility trial data.

The hierarchical attention mechanism operates at two levels:

  • Word-level attention: This layer captures the importance of each word within a sentence, allowing the model to focus on the most relevant words.

  • Sentence-level attention: This layer captures the importance of each sentence within the inclusion (or exclusion) criteria, enabling the model to focus on the most relevant criteria sentences. Sentence-level attention is of particular importance in clinical trials. For example, considering two inclusion criteria: one is to recruit patients with Amyotrophic Lateral Sclerosis (ALS), and the other is to recruit patients aged 18 or older. Clearly, due to the rarity of ALS, the criterion related to ALS has a more significant impact.

Formally, let 𝐡itsubscript𝐡𝑖𝑡\mathbf{h}_{it}bold_h start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT be the hidden state of the t𝑡titalic_t-th word in the i𝑖iitalic_i-th sentence. The word-level attention mechanism computes a context vector 𝐮isubscript𝐮𝑖\mathbf{u}_{i}bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each sentence as follows:

(5) 𝐮i=tαit𝐡it,subscript𝐮𝑖subscript𝑡subscript𝛼𝑖𝑡subscript𝐡𝑖𝑡\mathbf{u}_{i}=\sum_{t}\alpha_{it}\mathbf{h}_{it},bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT bold_h start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ,

where αitsubscript𝛼𝑖𝑡\alpha_{it}italic_α start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT are the attention weights that determine the importance of each word in the sentence.

Similarly, the sentence-level attention mechanism computes a document-level context vector 𝐮𝐮\mathbf{u}bold_u as:

(6) 𝐮=iβi𝐮i,𝐮subscript𝑖subscript𝛽𝑖subscript𝐮𝑖\mathbf{u}=\sum_{i}\beta_{i}\mathbf{u}_{i},bold_u = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,

where βisubscript𝛽𝑖\beta_{i}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the attention weights that determine the importance of each sentence in the document.

3.5.2. Cross Network

The cross network component is designed to explicitly learn feature interactions, particularly those involving handcrafted features. Unlike MLPs, which may struggle to capture higher-order interactions, the cross network efficiently models these interactions through a series of cross layers.

Given an input feature vector 𝐱d𝐱superscript𝑑\mathbf{x}\in\mathbb{R}^{d}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, the cross network computes the l𝑙litalic_l-th cross layer as:

(7) 𝐱(l+1)=𝐱(0)𝐱(l)𝐰(l)+𝐛(l)+𝐱(l),superscript𝐱𝑙1superscript𝐱0superscript𝐱superscript𝑙topsuperscript𝐰𝑙superscript𝐛𝑙superscript𝐱𝑙\mathbf{x}^{(l+1)}=\mathbf{x}^{(0)}\mathbf{x}^{(l)^{\top}}\mathbf{w}^{(l)}+% \mathbf{b}^{(l)}+\mathbf{x}^{(l)},bold_x start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT = bold_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ( italic_l ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT bold_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT + bold_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT + bold_x start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ,

where 𝐱(0)superscript𝐱0\mathbf{x}^{(0)}bold_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT is the original input vector, 𝐰(l)superscript𝐰𝑙\mathbf{w}^{(l)}bold_w start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT and 𝐛(l)superscript𝐛𝑙\mathbf{b}^{(l)}bold_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT are the weights and biases of the l𝑙litalic_l-th cross layer, respectively. This formulation allows the model to learn explicit feature interactions iteratively.

3.5.3. Integration and Training

The outputs of the deep network (HAN) and the cross network are concatenated and fed into a final fully connected layer. This combined representation leverages both the hierarchical semantic information captured by the deep network and the explicit feature interactions learned by the cross network.

The final prediction y^^𝑦\widehat{y}over^ start_ARG italic_y end_ARG is computed as:

(8) y^=σ(𝐰p[𝐮;𝐱(L)]+bp),^𝑦𝜎superscriptsubscript𝐰𝑝top𝐮superscript𝐱𝐿subscript𝑏𝑝\widehat{y}=\sigma\big{(}\mathbf{w}_{p}^{\top}[\mathbf{u};\mathbf{x}^{(L)}]+b_% {p}\big{)},over^ start_ARG italic_y end_ARG = italic_σ ( bold_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ bold_u ; bold_x start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT ] + italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ,

where σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) is the activation function (e.g., sigmoid for binary classification), 𝐰psubscript𝐰𝑝\mathbf{w}_{p}bold_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and bpsubscript𝑏𝑝b_{p}italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT are the weights and bias of the final fully connected layer, and L𝐿Litalic_L is the number of cross layers. We utilize binary cross-entropy loss as the loss criterion. The objective function is

(9) i=1Nyilogy^i(1yi)log(1y^i),superscriptsubscript𝑖1𝑁subscript𝑦𝑖subscript^𝑦𝑖1subscript𝑦𝑖1subscript^𝑦𝑖\sum_{i=1}^{N}-y_{i}\log\widehat{y}_{i}-(1-y_{i})\log(1-\widehat{y}_{i}),∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ( 1 - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log ( 1 - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,

where N𝑁Nitalic_N is the number of data points in training set, yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and y^isubscript^𝑦𝑖\widehat{y}_{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are groundtruth and prediction of the i𝑖iitalic_i-th data point, respectively. AdamW (Loshchilov and Hutter, 2017) is used as the numerical optimizer to minimize the objective function.

By integrating the hierarchical attention network and the cross network, the customized DCN effectively captures both deep semantic information and explicit feature interactions, leading to improved performance in predicting the enrollment success of a clinical trial.

4. Experiment

In this section, we present the empirical studies. We first describe the data curation process in Section 4.1. Then, Section 4.2 briefly describes the experimental setup. After that, we present the experimental results and ablation studies in Section 4.3 and 4.4.

4.1. Data Curation

For this study, we utilized a dataset from ClinicalTrials.gov (Zarin et al., 2011), a comprehensive global registry of clinical trials. Our objective was to analyze the factors influencing the enrollment success of these trials. We specifically focused on trials with complete datasets to ensure the reliability of our analysis.

Each clinical trial in the dataset was represented as an XML file. From these files, we extracted a variety of pertinent information, including the National Clinical Trial (NCT) ID, geographical location, gender, age, trial phase, disease name, drug name, and eligibility criteria (both inclusion and exclusion criteria). Following data extraction, we conducted feature engineering and incorporated features enhanced by large language models (LLMs).

To prevent data leakage, we divided the dataset into training and testing subsets based on a temporal cutoff of January 1, 2015. Trials that concluded before this date were included in the training set, while those that commenced after this date were allocated to the testing set. This temporal split ensured that the training data did not contain information from the future relative to the testing data. Our final dataset comprised 31,094 records, 22,579 of which were allocated to the training set and 8,515 to the testing set. The distribution of trial completion counts by year range is detailed in Table 3.

Table 3. Distribution of trial completion counts by year range.
Year range Before 2000 2000-2004 2005-2009 2010-2014 2015-2019 After 2020
# trials 52 1,055 10,133 11,643 7,969 242

Given the imbalanced nature of the dataset, where the distribution is heavily skewed towards class 0 (non-enrollment) as shown in Table 4, it is crucial to comprehensively assess the performance metrics to understand the strengths and weaknesses of each model. To address this imbalance, we employed oversampling techniques to increase the number of samples in the minority class (class 1) to match the number of samples in the majority class (class 0).

Table 4. Distribution of clinical trial enrollment success by phase.
Phase I II III IV
# negative 5,564 11,297 7,227 3,698
# positive 426 1,284 422 496

The pie chart in Figure 3 illustrates the distribution of clinical trial records by country. The United States has the largest share of records, comprising 20.6% of the total. This is followed by Germany (5.1%), Canada (4.6%), the United Kingdom (4.0%), and France (3.9%). Other notable countries include Spain, Italy, Belgium, Poland, and several others, each contributing between 1.4% and 3.5%. The remaining countries are grouped into the “Others” category, which makes up 30.8% of the records. This distribution highlights the global nature of clinical trials but also indicates a concentration in certain countries.

Refer to caption
Figure 3. Distribution of geographical locations of clinical trial records (country-level).

Table 5 presents a summary of the statistics for the age-related features used in our model. The features include Min-age, Max-age, and Age-span. Min-age represents the minimum age recorded, Max-age denotes the maximum age recorded, and Age-span is the range between the minimum and maximum ages. Note that the minimum value recorded is -1, which indicates missing or undefined values.

Table 5. Summary statistics of age-related features.
Statistic Min-age Max-age Age-span
Mean 19.26 29.42 20.24
Min -1.00 -1.00 -66.00
25% 18.00 -1.00 -1.00
50% 18.00 -1.00 -1.00
75% 18.00 65.00 47.00
Max 83.00 365.00 132.00

Table 6 provides a summary of the statistics for the sentence-related features used in our clinical trial enrollment prediction model. The features include Inclusion Criteria Count, Exclusion Criteria Count, and Total Criteria Count, which is the sum of the inclusion and exclusion criteria.

Table 6. Summary statistics of Criteria Count.
Statistic Inclusion Exclusion Total
Mean 13.16 18.20 31.36
Min 0.00 0.00 0.00
25% 4.00 6.00 12.00
50% 9.00 13.00 23.00
75% 17.00 25.00 44.00
Max 163.00 174.00 215.00

Table 7 shows the distribution of gender within the dataset. “All genders” means both female and male participants can be recruited.

Table 7. Distribution of gender in the dataset.
Gender Count Proportion (%)
All genders 27,316 87.85
Female 2,450 7.87
Male 1,328 4.27
Table 8. Summary of model performance for clinical trial enrollment success prediction. Results are averaged over three independent runs; corresponding standard deviations are also shown.
Model PR-AUC ROC-AUC F1 score Precision Recall Accuracy
LR 0.6499 ±plus-or-minus\pm± 0.0027 0.6814 ±plus-or-minus\pm± 0.0035 0.4463 ±plus-or-minus\pm± 0.0058 0.6816 ±plus-or-minus\pm± 0.0023 0.3318 ±plus-or-minus\pm± 0.0060 0.5884 ±plus-or-minus\pm± 0.0023
GBDT 0.6660 ±plus-or-minus\pm± 0.0102 0.6317 ±plus-or-minus\pm± 0.0010 0.6191 ±plus-or-minus\pm± 0.0135 0.5960 ±plus-or-minus\pm± 0.0025 0.6445 ±plus-or-minus\pm± 0.0259 0.6039 ±plus-or-minus\pm± 0.0064
AdaBoost 0.6525 ±plus-or-minus\pm± 0.0022 0.6816 ±plus-or-minus\pm± 0.0029 0.5631 ±plus-or-minus\pm± 0.0062 0.6552 ±plus-or-minus\pm± 0.0021 0.4937 ±plus-or-minus\pm± 0.0085 0.6169 ±plus-or-minus\pm± 0.0031
RF 0.6725 ±plus-or-minus\pm± 0.0042 0.6796 ±plus-or-minus\pm± 0.0039 0.3739 ±plus-or-minus\pm± 0.0167 0.7591 ±plus-or-minus\pm± 0.0058 0.2482 ±plus-or-minus\pm± 0.0142 0.5848 ±plus-or-minus\pm± 0.0060
MLP 0.6773 ±plus-or-minus\pm± 0.0060 0.7146 ±plus-or-minus\pm± 0.0028 0.4938 ±plus-or-minus\pm± 0.0907 0.6975 ±plus-or-minus\pm± 0.0259 0.4002 ±plus-or-minus\pm± 0.1316 0.6094 ±plus-or-minus\pm± 0.0239
TrialEnroll 0.7002 ±plus-or-minus\pm± 0.0013 0.7352 ±plus-or-minus\pm± 0.0021 0.4507 ±plus-or-minus\pm± 0.0529 0.7412 ±plus-or-minus\pm± 0.0102 0.3275 ±plus-or-minus\pm± 0.0567 0.6060 ±plus-or-minus\pm± 0.0160

4.2. Experimental Setup

In this section, we briefly describe the experimental setup, including evaluation metrics, baseline methods and implementation details. A detailed description is available in the Appendix.

Evaluation metrics. Clinical trial enrollment success prediction is formulated as a binary classification in this paper. We use six different evaluation metrics to measure performance comprehensively, including Precision-Recall Area Under Curve (PR-AUC), Area Under the Receiver Operating Characteristic Curve (ROC-AUC), F1 score, precision, recall, and accuracy. The scores for all six metrics range from 0 to 1; a higher value represents better performance.

Baseline methods. We selected multiple widely recognized models as baselines, including Logistic Regression (LR), Gradient Boosting Decision Tree (Ke et al., 2017) (GBDT), Adaptive Boosting (Rätsch et al., 2001) (AdaBoost), Random Forest (Breiman, 2001) (RF), and Multi-Layer Perceptron (Popescu et al., 2009) (MLP). The data fed into each model is the same to ensure a fair comparison.

Implementation details are elaborated in Section A.3 in Appendix.

4.3. Experimental Results

As illustrated in Table 8, the performance of various predictive models for clinical trial enrollment success prediction is evaluated using metrics such as PR-AUC, ROC-AUC, F1 Score, Accuracy, Precision, and Recall.

Among all the models evaluated, TrialEnroll demonstrates the best performance on the PR-AUC, which is the most critical metric in clinical trials. This model achieves a PR-AUC of 0.7002 and an ROC-AUC of 0.7352, both of which surpass those of the baseline models, indicating its robustness in accurately predicting trial enrollment success.

However, it is noteworthy that GBDT achieves the highest scores in F1 Score and Recall, while Random Forest excels in the Precision metric, and AdaBoost stands out in terms of Accuracy. GBDT, AdaBoost, and Random Forest are particularly well-suited for tabular data, and their superior performance further underscores the effectiveness of our handcrafted feature engineering and the incorporation of LLM-enhanced features. TrialEnroll does not outperform all models across every metric. Therefore, in practical applications, users should consider model complexity and specific task metrics to select the most appropriate model.

Importantly, this paper is the first to identify the problem of predicting trial enrollment success, design a novel feature engineering approach incorporating LLM features, and provide a benchmark for several well-established machine learning models. While the novel model TrialEnroll excels in PR-AUC and ROC-AUC, it does not achieve the best performance across all metrics. Future work will focus on designing an improved model to address this limitation.

To further demonstrate the importance of handcrafted features, we used permutation importance with a simple logistic regression model. This method involves shuffling the values of each feature and measuring the drop in model performance (using PR-AUC) compared to the baseline. The difference in PR-AUC before and after shuffling quantifies the feature’s importance. We repeated this process three times to obtain the mean importance scores and their standard deviations.

The results are shown in Figure 4. From the results, we can see that “inclusion criteria count” and “max age” are the most impactful features. This finding makes intuitive sense: the more inclusion criteria there are, the harder it is to enroll participants. Additionally, a higher maximum age may facilitate easier enrollment.

Table 9. Comparison of enhanced features using MLP model.
Method PR-AUC ROC-AUC Accuracy F1 Precision Recall
Origin 0.6497±0.0173plus-or-minus0.64970.01730.6497\pm 0.01730.6497 ± 0.0173 0.6864±0.0138plus-or-minus0.68640.01380.6864\pm 0.01380.6864 ± 0.0138 0.6070±0.0365plus-or-minus0.60700.03650.6070\pm 0.03650.6070 ± 0.0365 0.5240±0.1428plus-or-minus0.52400.14280.5240\pm 0.14280.5240 ± 0.1428 0.6549±0.0325plus-or-minus0.65490.03250.6549\pm 0.03250.6549 ± 0.0325 0.4792±0.1945plus-or-minus0.47920.19450.4792\pm 0.19450.4792 ± 0.1945
Origin + LLM 0.6544±0.0116plus-or-minus0.65440.01160.6544\pm 0.01160.6544 ± 0.0116 0.6887±0.0076plus-or-minus0.68870.00760.6887\pm 0.00760.6887 ± 0.0076 0.6108±0.0244plus-or-minus0.61080.02440.6108\pm 0.02440.6108 ± 0.0244 0.5231±0.0960plus-or-minus0.52310.09600.5231\pm 0.09600.5231 ± 0.0960 0.6715±0.0256plus-or-minus0.67150.02560.6715\pm 0.02560.6715 ± 0.0256 0.4514±0.1464plus-or-minus0.45140.14640.4514\pm 0.14640.4514 ± 0.1464
Origin + LLM + Handcrafted 0.6773 ±plus-or-minus\pm± 0.0060 0.7146 ±plus-or-minus\pm± 0.0028 0.6094±0.0239plus-or-minus0.60940.02390.6094\pm 0.02390.6094 ± 0.0239 0.4938±0.0907plus-or-minus0.49380.09070.4938\pm 0.09070.4938 ± 0.0907 0.6975±0.0259plus-or-minus0.69750.02590.6975\pm 0.02590.6975 ± 0.0259 0.4002±0.1316plus-or-minus0.40020.13160.4002\pm 0.13160.4002 ± 0.1316

4.4. Ablation Study

Refer to caption
Figure 4. Ablation study on feature. Permutation importance of features using PR-AUC.

Table 9 presents an ablation study comparing the performance of different feature sets. The feature sets evaluated are:

  • Origin: Includes embeddings for drug names, disease names, and criteria (both inclusion and exclusion).

  • Origin + LLM: Combines the Origin embeddings with additional features enhanced by a Large Language Model (LLM).

  • Origin + LLM + Handcrafted: Integrates the Origin and LLM-enhanced features with additional handcrafted features.

The results demonstrate that both LLM-enhanced features and handcrafted features contribute to significant improvements. The Origin feature set achieves a PR AUC of 0.6497. Adding LLM-enhanced features increases the PR AUC to 0.6544, and incorporating handcrafted features further improves it to 0.6773.

5. Conclusion

In this paper, we have addressed the problem of predicting clinical trial enrollment success by establishing a benchmark to evaluate well-known machine learning models and designing a customized Deep & Cross Network (DCN) model named TrialEnroll. Our approach also involved effective handcrafted feature engineering techniques.

Technical Contributions

Our work presents several key technical contributions:

  • Benchmark Establishment: We created a robust benchmark for evaluating machine learning models on clinical trial enrollment success prediction.

  • Customized DCN Model: We designed TrialEnroll, a DCN model that integrates a Hierarchical Attention Network (HAN) and a cross network to combine deep semantic information with handcrafted feature interactions.

  • Feature Engineering: We developed a detailed handcrafted feature engineering process, including the use of pre-trained BioBERT embeddings for drug and disease information.

  • LLM-based Feature Enhancement: We leveraged large language models (LLMs) to enrich drug and disease representations, improving model performance.

Clinical Implications

The findings of this study have significant clinical implications:

  • Improved Enrollment Predictions: Our model can help researchers and sponsors identify potential challenges early in the trial design process.

  • Resource Optimization: Enhanced prediction capabilities enable more efficient allocation of resources, leading to more successful and cost-effective clinical trials.

  • Patient Recruitment: Our model can assist in identifying trials that are more likely to succeed in enrolling participants.

Limitations and Future Work

While our study presents promising results, there are several limitations and areas for future work:

  • Data Limitations: Future work could involve expanding the dataset to include a more diverse range of trials.

  • Feature Expansion: Incorporating additional features, such as patient-level data, could enhance predictive capabilities.

In summary, our study provides a comprehensive approach to predicting clinical trial enrollment success, offering valuable insights and tools for researchers and sponsors. The proposed TrialEnroll model, along with our benchmark and feature engineering techniques, represents a significant step forward in the field of clinical trial optimization.

References

  • (1)
  • Alsultan et al. (2020) Abdullah Alsultan, Wael A. Alghamdi, Jahad Alghamdi, Abeer F. Alharbi, Abdullah Aljutayli, Ahmed Albassam, Omar Almazroo, and Saeed Alqahtani. 2020. Clinical pharmacology applications in clinical drug development and clinical care: A focus on Saudi Arabia. Saudi Pharmaceutical Journal 28, 10 (2020), 1217–1227. https://doi.org/10.1016/j.jsps.2020.08.012
  • Baskin (2019) Kara Baskin. 2019. Using machine learning to better predict clinical trial outcomes. https://mitsloan.mit.edu/ideas-made-to-matter/using-machine-learning-to-better-predict-clinical-trial-outcomes [Accessed: (9 March 2024)].
  • Breiman (2001) Leo Breiman. 2001. Random forests. Machine learning 45 (2001), 5–32.
  • Charles A. Knirsch (2012) Phillip B. Chappell Charles A. Knirsch. 2012. Risk Assessment and Mitigation. https://www.appliedclinicaltrialsonline.com/view/risk-assessment-and-mitigation [Accessed: (9 March 2024)].
  • Chen et al. (2024c) Jintai Chen, Yaojun Hu, Yue Wang, Yingzhou Lu, Xu Cao, Miao Lin, Hongxia Xu, Jian Wu, Cao Xiao, Jimeng Sun, et al. 2024c. TrialBench: Multi-Modal Artificial Intelligence-Ready Clinical Trial Datasets. arXiv preprint arXiv:2407.00631 (2024).
  • Chen et al. (2021) Lulu Chen, Lu Lu, Chiung-Ting Wu, Robert Clarke, Guoqiang Yu, Jennifer E Van Eyk, David M Herrington, and Yue Wang. 2021. Data-driven detection of subtype-specific differentially expressed genes. Scientific reports 11, 1 (2021), 332.
  • Chen et al. (2024a) Tianyi Chen, Nan Hao, Yingzhou Lu, and Capucine Van Rechem. 2024a. Uncertainty Quantification on Clinical Trial Outcome Prediction. arXiv preprint arXiv:2401.03482 (2024).
  • Chen et al. (2024b) Tianyi Chen, Nan Hao, Capucine Van Rechem, Jintai Chen, and Tianfan Fu. 2024b. Uncertainty quantification and interpretability for clinical trial approval prediction. Health Data Science 4 (2024), 0126.
  • Cruz Rivera et al. (2020) Samantha Cruz Rivera, Christel McMullan, Laura Jones, Derek Kyte, Anita Slade, and Melanie Calvert. 2020. The impact of patient-reported outcome data from clinical trials: perspectives from international stakeholders. Journal of patient-reported outcomes 4 (2020), 1–14.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT.
  • DiMasi et al. (2016) Joseph A DiMasi, Henry G Grabowski, and Ronald W Hansen. 2016. Innovation in the pharmaceutical industry: new estimates of R&D costs. Journal of health economics 47 (2016), 20–33.
  • Friedman et al. (2015) Lawrence M Friedman, Curt D Furberg, David L DeMets, David M Reboussin, and Christopher B Granger. 2015. Fundamentals of clinical trials. Springer.
  • Fu et al. (2023) Tianfan Fu, Kexin Huang, and Jimeng Sun. 2023. Automated prediction of clinical trial outcome. US Patent App. 17/749,065.
  • Fu et al. (2022) Tianfan Fu, Kexin Huang, Cao Xiao, Lucas M Glass, and Jimeng Sun. 2022. HINT: Hierarchical interaction network for clinical-trial-outcome predictions. Patterns 3, 4 (2022), 100445.
  • Fu et al. (2024) Yi Fu, Yizhi Wang, Bai Zhang, Zhen Zhang, Guoqiang Yu, Chunyu Liu, Robert Clarke, David M Herrington, and Yue Wang. 2024. DDN3. 0: Determining significant rewiring of biological network structure with differential dependency networks. Bioinformatics (2024), btae376.
  • Gao et al. (2024) Chufan Gao, Tianfan Fu, and Jimeng Sun. 2024. Language Interaction Network for Clinical Trial Approval Estimation. arXiv preprint arXiv:2405.06662 (2024).
  • Gao et al. (2020) Junyi Gao et al. 2020. COMPOSE: Cross-Modal Pseudo-Siamese Network for Patient Trial Matching. In KDD.
  • Haddad et al. (2015) Robert I Haddad, Anthony TC Chan, and Jan B Vermorken. 2015. Barriers to clinical trial recruitment in head and neck cancer. Oral oncology 51, 3 (2015), 203–211.
  • Hill (1952) A Bradford Hill. 1952. The clinical trial. New England Journal of Medicine 247, 4 (1952), 113–119.
  • Jakkula (2006) Vikramaditya Jakkula. 2006. Tutorial on support vector machine (svm). School of EECS, Washington State University 37, 2.5 (2006), 3.
  • Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023).
  • Ke et al. (2017) Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30 (2017).
  • Kerali (2018) Henry Kerali. 2018. Forecasting Clinical Trials: The Essential Checklist. https://www.clinicaltrialsarena.com/comment/how-to-forecast-a-clinical-trial-5783321-2/ [Accessed: (9 March 2024)].
  • LaValley (2008) Michael P LaValley. 2008. Logistic regression. Circulation 117, 18 (2008), 2395–2399.
  • Ledford (2011) Heidi Ledford. 2011. 4 Ways to fix the clinical trial: clinical trials are crumbling under modern economic and scientific pressures. Nature looks at ways they might be saved. Nature (2011).
  • Lee et al. (2020) Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2020), 1234–1240.
  • Lo et al. (2019) Andrew W Lo, Kien Wei Siah, and Chi Heem Wong. 2019. Machine learning with statistical imputation for predicting drug approvals. Vol. 60. SSRN.
  • Loshchilov and Hutter (2017) Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
  • Lu (2018) Yingzhou Lu. 2018. Multi-omics Data Integration for Identifying Disease Specific Biological Pathways. Ph. D. Dissertation. Virginia Tech.
  • Lu et al. (2019) Yingzhou Lu, Yi-Tan Chang, Eric P Hoffman, Guoqiang Yu, David M Herrington, Robert Clarke, Chiung-Ting Wu, Lulu Chen, and Yue Wang. 2019. Integrated identification of disease specific pathways using multi-omics data. bioRxiv (2019), 666065.
  • Lu et al. (2023) Yingzhou Lu, Kosaku Sato, and Jialu Wang. 2023. Deep Learning based Multi-Label Image Classification of Protest Activities. arXiv preprint arXiv:2301.04212 (2023).
  • Lu et al. (2021) Yingzhou Lu, Chiung-Ting Wu, Sarah J Parker, Lulu Chen, Georgia Saylor, Jennifer E Van Eyk, David M Herrington, and Yue Wang. 2021. COT: an efficient Python tool for detecting marker genes among many subtypes. bioRxiv (2021), 2021–01.
  • Lu et al. (2022) Yingzhou Lu, Chiung-Ting Wu, Sarah J Parker, Zuolin Cheng, Georgia Saylor, Jennifer E Van Eyk, Guoqiang Yu, Robert Clarke, David M Herrington, and Yue Wang. 2022. COT: an efficient and accurate method for detecting marker genes among many subtypes. Bioinformatics Advances 2, 1 (2022), vbac037.
  • Martin et al. (2017) Linda Martin, Melissa Hutchens, Conrad Hawkins, and Alaina Radnov. 2017. How much do clinical trials cost? Nat. Rev. Drug Discov. (2017).
  • Patel et al. (2003) Maxine X Patel, Victor Doku, and Lakshika Tennakoon. 2003. Challenges in recruitment of research participants. Advances in Psychiatric Treatment 9, 3 (2003), 229–238.
  • Peto (1978) Richard Peto. 1978. Clinical trial methodology. Nature 272, 5648 (1978), 15–16.
  • Popescu et al. (2009) Marius-Constantin Popescu, Valentina E Balas, Liliana Perescu-Popescu, and Nikos Mastorakis. 2009. Multilayer perceptron and neural networks. WSEAS Transactions on Circuits and Systems 8, 7 (2009), 579–588.
  • Prasad (2024) Rajiv Prasad. 2024. Navigating the Complexities of Clinical Trial Budget Forecasting and Payments. https://www.appliedclinicaltrialsonline.com/view/navigating-the-complexities-of-clinical-trial-budget-forecasting-and-payments [Accessed: (9 March 2024)].
  • Qi and Tang (2019) Youran Qi and Qi Tang. 2019. Predicting phase 3 clinical trial results by modeling phase 2 clinical trial subject level data using deep learning. In Machine Learning for Healthcare Conference. PMLR, 288–303.
  • Rätsch et al. (2001) Gunnar Rätsch, Takashi Onoda, and K-R Müller. 2001. Soft margins for AdaBoost. Machine learning 42 (2001), 287–320.
  • Srinivasa et al. (2022) Rakshith S Srinivasa, Cheng Qian, Brandon Theodorou, Jeffrey Spaeder, Cao Xiao, Lucas Glass, and Jimeng Sun. 2022. Clinical trial site matching with improved diversity using fair policy learning. arXiv preprint arXiv:2204.06501 (2022).
  • Wang et al. (2017) Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17. 1–7.
  • Wang et al. (2024) Yue Wang, Yinlong Xu, Zihan Ma, Hongxia Xu, Bang Du, Honghao Gao, and Jian Wu. 2024. TWIN-GPT: Digital Twins for Clinical Trials via Large Language Model. arXiv preprint arXiv:2404.01273 (2024).
  • Wu et al. (2022) Chiung-Ting Wu, Minjie Shen, Dongping Du, Zuolin Cheng, Sarah J Parker, Jennifer E Van Eyk, Guoqiang Yu, Robert Clarke, David M Herrington, et al. 2022. Cosbin: cosine score-based iterative normalization of biologically diverse samples. Bioinformatics Advances 2, 1 (2022), vbac076.
  • Yu et al. (2024) Mengjia Yu, Sheng Zhong, Yunzhao Xing, and Li Wang. 2024. Enrollment Forecast for Clinical Trials at the Planning Phase with Study-Level Historical Data. Therapeutic Innovation & Regulatory Science 58 (2024), 42–52. https://doi.org/s43441-023-00564-8
  • Yue and Fu (2024) Ling Yue and Tianfan Fu. 2024. CT-Agent: Clinical Trial Multi-Agent with Large Language Model-based Reasoning. arXiv preprint arXiv:2404.14777 (2024).
  • Yue et al. (2024) Ling Yue, Jonathan Li, Md Zabirul Islam, Bolun Xia, Tianfan Fu, and Jintai Chen. 2024. TrialDura: Hierarchical Attention Transformer for Interpretable Clinical Trial Duration Prediction. arXiv preprint arXiv:2404.13235 (2024).
  • Zarin et al. (2011) Deborah A Zarin, Tony Tse, Rebecca J Williams, Robert M Califf, and Nicholas C Ide. 2011. The ClinicalTrials. gov results database—update and key issues. New England Journal of Medicine 364, 9 (2011), 852–860.
  • Zhang et al. (2021) Bai Zhang, Yi Fu, Yingzhou Lu, Zhen Zhang, Robert Clarke, Jennifer E Van Eyk, David M Herrington, and Yue Wang. 2021. DDN2.0: R and Python packages for differential dependency network analysis of biological systems. bioRxiv (2021), 2021–04.
  • Zhang et al. (2020) Xingyao Zhang, Cao Xiao, Lucas M Glass, and Jimeng Sun. 2020. Deepenroll: Patient-trial matching with deep embedding and entailment prediction. In Proceedings of The Web Conference 2020. 1029–1037.
  • Zhao et al. (2023) Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).

Appendix A Additional Experimental Details

In this section, we present the additional experimental details to enhance the reproducibility. Concretely, Section A.1 details the evaluation metrics to assess model performance. Section A.2 elaborates on baseline methods. Section A.3 describes the implementation details.

A.1. Evaluation Metrics

Clinical trial enrollment success prediction is formulated as a binary classification in this paper. In binary classification, there are four kinds of test data points based on their ground truth and the model’s prediction, (1) positive sample and is correctly predicted as positive, also known as True Positive (TP); (2) negative samples and is wrongly predicted as positive samples, also known as False Positive (FP); (3) negative samples and is correctly predicted as negative samples, also known as True Negative (TN); (4) positive samples and is wrongly predicted as negative samples, also known as False Negative (FN).

We use different evaluation metrics as follows. (1) PR-AUC (Precision-Recall Area Under Curve). The area under the Precision-Recall curve summarizes the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds. (2) ROC-AUC. Area Under the Receiver Operating Characteristic Curve (ROC-AUC) summarizes the trade-off between the true positive rate and the false positive rate for a predictive model using different probability thresholds. ROC-AUC is also known as the Area Under the Receiver Operating Characteristic curve (AUROC) in some literature. (3) F1. The F1 score is the harmonic mean of the precision and recall, defined as F1=21precision +1recallF121precision 1recall\text{F1}=\frac{2}{\frac{1}{\text{precision }}+\frac{1}{\text{recall}}}F1 = divide start_ARG 2 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG precision end_ARG + divide start_ARG 1 end_ARG start_ARG recall end_ARG end_ARG. (4) Precision. The precision is the performance of a classifier on the samples that are predicted as positive. It is formally defined as precision=TPTP+FPprecision𝑇𝑃𝑇𝑃𝐹𝑃\text{precision}=\frac{TP}{TP+FP}precision = divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_P end_ARG. (5) Recall. The recall score measures the performance of the classifier to find all the positive samples. It is formally defined as recall=TPTP+FNrecall𝑇𝑃𝑇𝑃𝐹𝑁\text{recall}=\frac{TP}{TP+FN}recall = divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_N end_ARG. (6) Accuracy. Accuracy is the fraction of correctly predicted/classified samples. It is formally defined as accuracy=TP+TNTP+TN+FP+FNaccuracy𝑇𝑃𝑇𝑁𝑇𝑃𝑇𝑁𝐹𝑃𝐹𝑁\text{accuracy}=\frac{TP+TN}{TP+TN+FP+FN}accuracy = divide start_ARG italic_T italic_P + italic_T italic_N end_ARG start_ARG italic_T italic_P + italic_T italic_N + italic_F italic_P + italic_F italic_N end_ARG.

For all these metrics, the numerical values range from 0 to 1, a higher value represents better performance. We report multiple metrics to measure the performance comprehensively.

A.2. Baseline Methods

We are the first to identify the Enrollment Success prediction problem. To address this problem, we propose a benchmark to evaluate performance using several widely recognized models alongside our customized Deep Cross Network (Wang et al., 2017) (DCN), which we refer to as TrialEnroll.

To establish this benchmark, we selected multiple widely recognized models as baselines: Logistic Regression (LaValley, 2008) (LR), Gradient Boosting Decision Tree (Ke et al., 2017) (GBDT), Adaptive Boosting (Rätsch et al., 2001) (AdaBoost), Random Forest (Breiman, 2001) (RF), and Multi-Layer Perceptron (Popescu et al., 2009) (MLP). The data fed into each model is the same to ensure a fair comparison. These models have been successfully applied to clinical trial outcome prediction in previous studies (Fu et al., 2022, 2023; Chen et al., 2024b).

  • Logistic Regression (LR): Logistic Regression is a simple and widely used statistical method for modeling the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship between the input features and the output, making it easy to interpret but potentially limited in capturing complex patterns.

  • Gradient Boosting Decision Tree (GBDT): GBDT is an ensemble learning technique that builds a series of decision trees, where each tree corrects the errors of the previous ones. It combines the predictions of multiple weak learners to produce a strong learner, making it highly effective for both regression and classification tasks.

  • Adaptive Boosting (AdaBoost): AdaBoost is another ensemble learning method that combines multiple weak classifiers to form a strong classifier. It works by iteratively adjusting the weights of incorrectly classified instances, focusing more on difficult cases in subsequent iterations. This adaptive approach helps improve the overall model performance.

  • Random Forest (RF): Random Forest is an ensemble learning method that constructs multiple decision trees during training and outputs the mode of the classes (classification) or mean prediction (regression) of the individual trees. It reduces overfitting by averaging multiple trees, providing robust and accurate predictions.

  • Multi-Layer Perceptron (MLP): MLP is a type of artificial neural network that consists of multiple layers of nodes, including an input layer, one or more hidden layers, and an output layer. Each node (neuron) in a layer is connected to every node in the subsequent layer, allowing the network to learn complex, non-linear relationships in the data.

We aim to demonstrate TrialEnroll’s effectiveness by comparing It with these well-established models. This benchmark will also provide insights into the relationship between model complexity and performance.

A.3. Implementation Details

To ensure reproducibility, we provide a detailed description of our experimental framework and training process. The code and step-by-step instructions can be found at https://anonymous.4open.science/r/TrialEnroll-7E12.

Hardware and Software Configuration
  • CPU: Intel(R) Xeon(R) Gold 6248

  • RAM: 128GB

  • GPU: NVIDIA RTX A5000

  • Operating System: Ubuntu 20.04

  • Python Version: 3.10

  • PyTorch Version: 2.3

Training Parameters

During the training process, we set the batch size to 256 to balance memory usage and computational efficiency. The training was conducted over 100 epochs, with an early stopping mechanism to prevent overfitting. The early stopping patience was set to 5 epochs, meaning training would halt if no improvement in the validation loss was observed for 5 consecutive epochs.

Optimization

We employed the AdamW optimizer (Loshchilov and Hutter, 2017) to minimize the objective function. The initial learning rate was set to 0.001, which we found to be effective for our model and dataset. The learning rate was not dynamically adjusted during training.

The entire training process took approximately 4.3 hours to complete. This duration may vary depending on the specific hardware configuration and the complexity of the model.