\useunder

\ul


Machine Learning Driven Biomarker Selection for Medical Diagnosis

Divyagna Bavikadi1*, Ayushi Agarwal1, Shashank Ganta1, Yunro Chung2,3, Lusheng Song2, Ji Qiu2, Paulo Shakarian1


1 Fulton Schools of Engineering, Arizona State University, Tempe, AZ, USA

2 Biodesign Center for Personalized Diagnostics, Arizona State University, Tempe, AZ, USA

3 College of Health Solutions, Arizona State University, Phoenix, AZ, USA


* [email protected]

Abstract

Recent advances in experimental methods have enabled researchers to collect data on thousands of analytes simultaneously. This has led to correlational studies that associated molecular measurements with diseases such as Alzheimer’s, Liver, and Gastric Cancer. However, the use of thousands of biomarkers selected from the analytes is not practical for real-world medical diagnosis and is likely undesirable due to potentially formed spurious correlations. In this study, we evaluate 4444 different methods for biomarker selection and 4444 different machine learning (ML) classifiers for identifying correlations – evaluating 16161616 approaches in all. We found that contemporary methods outperform previously reported logistic regression in cases where 3333 and 10101010 biomarkers are permitted. When specificity is fixed at 0.90.90.90.9, ML approaches produced a sensitivity of 0.2400.2400.2400.240 (3333 biomarkers) and 0.5200.5200.5200.520 (10101010 biomarkers), while standard logistic regression provided a sensitivity of 0.0000.0000.0000.000 (3333 biomarkers) and 0.0400.0400.0400.040 (10101010 biomarkers). We also noted that causal-based methods for biomarker selection proved to be the most performant when fewer biomarkers were permitted, while univariate feature selection was the most performant when a greater number of biomarkers were permitted.

Einführung

Recent advances in experimental methods have enabled researchers to collect data on thousands of analytes (biological analytes) simultaneously (Rosado et al. [1], Topkaya et al. [2]). This has led to correlational studies that associated these molecular measurements with diseases such as Alzheimer’s (Blennow et al. [3]), Liver (Ahn Joseph C et al. [4]), and Gastric Cancer (Lin et al. [5]). However, it is generally considered undesirable to use thousands of biomarkers selected from the analytes for medical diagnosis for several reasons. First, large numbers of biomarkers increase the likelihood of spurious correlation. Second, the use of many biomarkers increases model complexity and hinders the interpretability of results. Further, from a practical standpoint, the use of fewer biomarkers is preferable from the standpoint of creating cost-effective diagnostic products.
As a result, previous studies have conducted two operations in tandem: the selection of candidate biomarkers thought to be associated with a given disease individually and the identification of correlations between the combination of selected candidate biomarkers and the target medical condition. The most commonly reported methodology in the literature has been logistic regression, often accompanied by a variant of univariate feature selection (Bursac et al. [6], Direkvand-Moghadam et al. [7], Islam et al. [8]). This paper looks to augment existing work by studying the effect of the feature selection method and model type. In particular, we examine causal-based feature selection (Kleinberg et al. [9]) and a variety of machine-learning approaches, including gradient-boosted decision trees and neural networks. In all, we study 16161616 different combinations of feature selection and classification models in tests where the number of biomarkers K𝐾Kitalic_K is restricted to a set of values 1,3,4,10,15,301341015301,3,4,10,15,301 , 3 , 4 , 10 , 15 , 30 on a gastric cancer dataset that includes measurements from 3440344034403440 biological analytes (Song et al. [10]). We perform a cross-validation study and report results on training and test sets as well as examine hyperparameter sensitivity for the causal-based approaches. We found that contemporary machine learning methods outperform previously reported logistic regression in these experiments. When specificity is fixed at 0.90.90.90.9, ML approaches produced a sensitivity of 0.2400.2400.2400.240 (3 biomarkers) and 0.5200.5200.5200.520 (10 biomarkers), while standard logistic regression provided a sensitivity of 0.0000.0000.0000.000 (3 biomarkers) and 0.0400.0400.0400.040 (10 biomarkers).
The rest of the paper is organized as follows: We first provide a brief overview of related work, a description of the gastric cancer dataset, and machine learning methods. This is followed by reporting of the experimental results on the gastric cancer dataset and associated discussion. Finally, we conclude by discussing our findings.

Related Work

Machine learning models, such as logistic regression, have been utilized with biological data for association purposes. In (Islam et al. [8]), the correlation coefficients of three biomarkers: body temperature, heart rate, and probable blood glucose level, were evaluated and associated with malaria detection using logistic regression. Similarly, in (Direkvand-Moghadam et al. [7]), univariate logistic regression demonstrated a substantial association between female sexual dysfunction and biomarkers, such as age, gravidity, and menarche age. Additionally, in (Bursac et al. [6]), the application of feature selection prior to model training showed the potential to maintain confounding variables, especially when dealing with macro biological data sets. Note that none of this prior work conducts an analysis of various machine learning classifiers, such as gradient-boosted trees or neural networks with causal-based and feature selection methods.
More specifically, machine learning models paired with feature selection for disease detection have proved significantly beneficial. In (Sorino et al. [11]), numerous machine learning techniques similar to ours, such as random forest classifier and boosted tree classifier, with cross-validation were used to diagnose non-alcoholic fatty liver disease. Similarly, in (Díaz Álvarez et al. [12]), a feature selection, evaluated on chi-squared statistic was paired with a Naive Bayes classifier to aid the diagnosis and classification of neuro-degenerative disorders. Moreover, vision-based machine learning techniques such as convolutional neural networks have been applied to a wide variety of medical diagnostic use cases (Yadav et al. [13], Shaban et al. [14], Heenaye-Mamode et al. [15], Lopez-Garnier et al. [16], Kundu et al. [17]). Such diagnosis based on imagery would be complementary to biomarker-based diagnosis. However, to our knowledge, the application of such techniques to the use of biomarkers, specifically proteins, for the purposes of medical diagnosis has not been studied in the literature. The concept of causal-based methods, such as the one apparent in our findings, has been used in a variety of medical applications (Kleinberg et al. [18]). For example, in (Richens et al. [19]), the application of causal machine learning effectively increased clinical accuracy from the top 48%percent4848\%48 % to the top 25%percent2525\%25 % of doctors. However, to date, such methods have not been combined with recent advances in biomarker experimentation (Kleinbaum et al. [20]) for medical diagnosis based on biomarker measurements.

Gastric Cancer Dataset

The dataset (Song et al. [21]) used for the biomarker discovery contains information on 100100100100 samples, each of which is associated with a case or control indicating the presence or absence of gastric cancer. The dataset is balanced with 50505050 samples labeled case and 50505050 samples labeled control. The age and gender of the samples are matched between cases and controls. Each instance is represented by 3440344034403440 corresponding molecular measurement values, which are used to assess the risk of gastric cancer and provide insight into the disease. The measurement values range from 0.000.000.000.00 to 260.65260.65260.65260.65. Molecular measurements were noted with IgG and IgA antibodies against the same set of proteins. The dataset contains data on clinical features, antibody reactions against Helicobacter pylori proteins, and demographic variables. Using the Nucleic Acid Programmable Protein Array (NAPPA) technology, the study assessed humoral responses to 1527152715271527 proteins or almost the whole H. pylori proteome. The total set of proteins nearly composes a complete H. pylori proteome. Measurement values were assessed on seropositivity. Seropositivity was defined as the median normalized intensity 22absent2\leq2 ≤ on NAPPA. Table 1 shows the breakdown of the dataset.

Table 1: Breakdown of Gastric Cancer Dataset
Data Samples     Analytes Data Quantity
Total Samples 100100100100     Total measurements* 3440344034403440
Cancer Cases 50505050     Organism: H. Pylori 3054305430543054
Cancer Controls 50505050     Organism: EBV 178178178178
    Organism: Streptococcus_gallolyticus 92929292
    Organism: Fusobacterium_nucleatum 84848484
    Organism: Other (\leq5 occurrences) 32323232
* indicates that it includes IgG𝐼𝑔𝐺IgGitalic_I italic_g italic_G and IgA𝐼𝑔𝐴IgAitalic_I italic_g italic_A antibodies

For the training data, each sample has a vector of real values associated with each analyte measurement and a ground truth that indicates the actual presence of the disease to distinguish between gastric cancer patients from healthy controls.

Machine Learning and Feature Selection Methods

Overview of approaches

We employ a two-step process for each method: feature selection and classification, and will discuss each in turn. We will use the symbol K𝐾Kitalic_K (let K{1,3,4,10,15,30}𝐾134101530K\in\{1,3,4,10,15,30\}italic_K ∈ { 1 , 3 , 4 , 10 , 15 , 30 }) to denote the maximum number of biomarkers permitted after the feature selection step. The best K𝐾Kitalic_K biomarkers are used to then classify a sample. We also explore the effect of binarizing biomarker inputs – the intuition being that rather than considering the biomarker measurement directly, we only consider if the biomarker exceeds some threshold γ𝛾\gammaitalic_γ (γ{0.6,1.0,1.4,1.8}𝛾0.61.01.41.8\gamma\in\{0.6,1.0,1.4,1.8\}italic_γ ∈ { 0.6 , 1.0 , 1.4 , 1.8 }), which is specified as a hyperparameter.

Feature selection methods

We consider two types of feature selection methods: the univariate selection and the causal metric. Univariate feature selection evaluates the strength of the relationship between the feature and the response variable. In this paper, we use chi-square statistic-based univariate feature selection method. By contrast, the causal-based method examines the effect of a single analyte based on other analytes that may have a co-occurring measurement. A contribution of this work is an adaption of the causal measure of (Kleinberg et al. [18]) for biomarker selection. While (Kleinberg et al. [18]) computes causality as the average increase in the probability of the effect when the cause is present, here we propose a new metric based on the intuition of (Gardner et al. [22]) but adapted for biomarker selection as follows:

causal(i)=jRif(i,j)f(¬i,j)size(Ri)𝑐𝑎𝑢𝑠𝑎𝑙𝑖subscript𝑗subscript𝑅𝑖𝑓𝑖𝑗𝑓𝑖𝑗𝑠𝑖𝑧𝑒subscript𝑅𝑖\displaystyle causal(i)=\frac{\sum_{j\in R_{i}}f(i,j)-f(\neg i,j)}{size(R_{i})}italic_c italic_a italic_u italic_s italic_a italic_l ( italic_i ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_i , italic_j ) - italic_f ( ¬ italic_i , italic_j ) end_ARG start_ARG italic_s italic_i italic_z italic_e ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG (1)

Here we still examine the average increase of a function when the biomarker is present based on co-occurring biomarkers. However, unlike in (Kleinberg et al. [18]) we do not use probability, but a measure more tuned to our domain. In Equation 1 the symbol causal(i)𝑐𝑎𝑢𝑠𝑎𝑙𝑖causal(i)italic_c italic_a italic_u italic_s italic_a italic_l ( italic_i ) is the causal metric for the analyte i𝑖iitalic_i, Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT indicates the set of analytes that are related to the analyte i𝑖iitalic_i, and f𝑓fitalic_f indicates the measure calculated based on the product of sensitivity and specificity for every pair of a analyte i𝑖iitalic_i and its related analyte j𝑗jitalic_j. This makes it more suitable for the kind of protein biomarkers used from the dataset. We provide details as to how we derived this measure in the supporting information.

Machine learning classification methods

We examine four machine learning methods: logistic regression, random forest, deep neural networks (DNN), gradient-boosted decision trees (Pedregosa et al. [23]), and XGBoost (Chen et al. [24]). The intuition for using logistic regression is to establish it as a baseline as it was used in previous biomarker studies (Direkvand-Moghadam et al. [7], Ravi et al. [25]), random forest for its ability to provide accurate results with minimal hyper-parameter tuning, a DNN due to their state-of-the-art performance in a variety of other tasks, and two variants of boosted trees which have been shown to provide state-of-the-art performance on tasks involving tabular data. For the DNN, we employ a dense, multi-layer perceptron with 4 layers, RELU activation function, and a softmax output layer using the PyTorch (Paszke et al. [26]) software package. For the boosted decision trees, we use the Scikit-learn implementation of gradient-boosted trees and the standard implementation of XGBoost. Summaries of these methods, along with hyperparameter settings can be found in the supporting information.

Results

Setup

We conducted experiments using an NVIDIA GTX1080 (2560 cuda cores, 10 Gbps memory speed). For evaluation, we used leave-one-out cross-validation (LOOCV) and examined values for Area Under the Curve (AUC) for both training and test data, as well as sensitivity on the test data with specificity fixed at 0.80.80.80.8 and 0.90.90.90.9 (sensitivity at specificity of 0.80.80.80.8 (Sen@80𝑆𝑒𝑛@80Sen@80italic_S italic_e italic_n @ 80) and sensitivity at specificity of 0.90.90.90.9 (Sen@90𝑆𝑒𝑛@90Sen@90italic_S italic_e italic_n @ 90)). These metrics are selected based on standards employed in assessing diagnostic biomarkers; it also helps us have an overall understanding of performance across multiple confidence thresholds as well as judge the degree to which the model can discriminate between case and control. Evaluation of experiments is conducted on this standard based on other factors such as models, and hyperparameters. Throughout the discussion, we will treat logistic regression with univariate selection as the baseline, as logistic regression was employed in prior work (Direkvand-Moghadam et al. [7], Islam et al. [8]).

Selection of 3 Biomarkers

Overall, the most performance in terms of test AUC was observed for the deep neural multilayer perceptron (MLP) classifier with causal metric for biomarker selection, which outperformed the baseline by 0.1140.1140.1140.114, shown in Table 2. For Sensitivity at specificity of 0.90.90.90.9, XGB with causal metric (as seen in figure 1) outperformed the baseline (as seen in figure 2) by 0.2400.2400.2400.240. Notably, the use of causality feature selection improved performance irrespective of classifier, providing a minimum improvement of 0.1200.1200.1200.120 (binarized) over univariate feature selection for each classifier for Sen@90𝑆𝑒𝑛@90Sen@90italic_S italic_e italic_n @ 90 (Table 2). Comparable results were noted for Sensitivity when Specificity was set to 0.80.80.80.8 along with test AUC.

Fig 1: ROC Curve for XGB model with causality measure (3 Biomarkers).
Refer to caption
Fig 2: ROC Curve for the baseline (3 Biomarkers).
Refer to caption

We note that training AUC was strongest for random forest with univariate selection with a value of 0.9970.9970.9970.997 – however, this drops to 0.5580.5580.5580.558 for testing. This is surprising, as random forest generally does not overfit (Breiman [27]) however it may indicate that univariate feature selection may cause overfitting when used in more complex models – as we observed the large discrepancies between training and testing AUCs when univariate feature selection was used in all cases except logistic regression. On the other hand, the average drop for the causality measure is 0.1180.1180.1180.118 and a maximum of 0.1860.1860.1860.186 while there is an average drop of 0.2600.2600.2600.260 and a maximum of 0.4390.4390.4390.439 for univariate feature selection which indicates a possibility of overfitting caused when causality is ablated.

Table 2: Results for 3 biomarkers using 5 models with causal-based and univariate feature selection
Model     Method Train AUC Test AUC Sen@90𝑆𝑒𝑛@90Sen@90italic_S italic_e italic_n @ 90 Sen@80𝑆𝑒𝑛@80Sen@80italic_S italic_e italic_n @ 80
MLP     Univariate 0.937 0.581 0.080 0.140
    Univariate(B) 0.738 0.527 0.000 0.000
    Causal 0.720 \ul0.695 0.220 0.420
    Causal(B) 0.774 0.588 0.200 0.300
XGB     Univariate 0.969 0.613 0.200 0.260
    Univariate(B) 0.754 0.538 0.000 0.000
    \ulCausal \ul0.719 \ul0.633 \ul0.240 \ul0.480
    Causal(B) 0.611 0.463 0.200 0.340
LR     Univariate 0.699 0.612 0.000 0.180
    Univariate(B) 0.756 0.560 0.000 0.000
    Causal 0.678 0.510 0.180 0.280
    Causal(B) 0.771 0.594 0.200 0.200
GBT     Univariate 0.984 0.571 0.120 0.280
    Univariate(B) 0.738 0.527 0.000 0.000
    Causal 0.722 0.659 0.140 0.420
    Causal(B) 0.613 0.496 0.220 0.360
RF     Univariate \ul0.997 0.558 0.120 0.200
    Univariate(B) 0.736 0.620 0.060 0.080
    Causal 0.719 0.593 0.120 0.120
    Causal(B) 0.662 0.583 0.180 \ul0.540
(B) dictates using binarized data; Bolded values dictate better
performance; Underlined values dictate best performance

Selection of 10 Biomarkers

On the other hand, the best-performing model, with respect to test AUC, was MLP with univariate feature selection, which outperformed MLP with causality measure by 0.2860.2860.2860.286, shown in Table 3. Furthermore, GBT with univariate feature selection (as seen in figure 3) reported the highest sensitivity at a specificity of 0.90.90.90.9, that is 0.5200.5200.5200.520 while GBT with causality measure reported sensitivity at a specificity of 0.90.90.90.9 as 0.220.220.220.22. Also, the baseline (as seen in figure 4) gave a moderate test AUC of 0.5990.5990.5990.599 but a low Sen@90𝑆𝑒𝑛@90Sen@90italic_S italic_e italic_n @ 90 value. We found that, with a high number of biomarkers, univariate feature selection seems to be performing well with respect to test AUC compared to the causality measure for all methods by a minimum of 0.0250.0250.0250.025 (binarized) and 0.0290.0290.0290.029 (non-binarized).

For a higher number of biomarkers, a more generic method like univariate seems to suffice. While increasing the historical data might help improve the performance of other approaches, the less data-hungry causal approach already performs well without inconsistent sensitivity at a specificity of 0.9,0.80.90.80.9,0.80.9 , 0.8.

Table 3: Results for 10 biomarkers using 5 models with causal-based and univariate feature selection
Model     Method Train AUC Test AUC Sen@90𝑆𝑒𝑛@90Sen@90italic_S italic_e italic_n @ 90 Sen@80𝑆𝑒𝑛@80Sen@80italic_S italic_e italic_n @ 80
MLP     Univariate 1.000 0.669 0.140 0.340
    Univariate(B) 0.926 0.764 0.480 0.480
    Causal 0.909 0.551 0.200 0.260
    Causal(B) 0.918 0.478 0.120 0.200
XGB     Univariate 0.998 0.701 0.300 0.420
    Univariate(B) 0.890 0.684 0.460 0.460
    Causal 0.816 0.575 0.200 0.340
    Causal(B) 0.879 0.659 0.220 0.360
LR     Univariate 0.811 0.599 0.040 0.180
    Univariate(B) 0.878 0.746 0.460 0.480
    Causal 0.734 0.569 0.080 0.220
    Causal(B) 0.830 0.681 0.320 0.360
GBT     \ulUnivariate \ul1.000 \ul0.721 \ul0.520 \ul0.620
    Univariate(B) 0.919 0.746 0.480 0.500
    Causal 0.852 0.588 0.220 0.260
    Causal(B) 0.875 0.540 0.180 0.220
RF     Univariate \ul0.999 0.649 0.140 0.380
    Univariate(B) 0.926 0.708 0.420 0.440
    Causal 0.894 0.594 0.140 0.260
    Causal(B) 0.904 0.538 0.120 0.320
(B) dictates using binarized data; Bolded values dictate better
performance; Underlined values dictate best performance
Fig 3: ROC Curve for GBT model with univariate feature selection (10 Biomarkers).
Refer to caption
Fig 4: ROC Curve for the Baseline (10 Biomarkers).
Refer to caption

Hyperparameter Study

As shown in Table 2 and Table 3, a few methods were classified based on the binarization of biomarker values before model training indicated by B; for example, Causal(B) means causality method with binarized inputs. We discretize all input measurements for a given sample based on a threshold γ{0.6,1.0,1.4,1.8}𝛾0.61.01.41.8\gamma\in\{0.6,1.0,1.4,1.8\}italic_γ ∈ { 0.6 , 1.0 , 1.4 , 1.8 }. The tables  2 and  3 are for threshold value γ=1.4𝛾1.4\gamma=1.4italic_γ = 1.4. However, it is important to note that there is little variance in AUCs for most thresholds, showing the stability of the selected biomarkers as seen in Fig 5(a). Also, consistency is observed in the frequency of biomarker selection. Furthermore, by raising the value of K𝐾Kitalic_K significantly, we get diminishing returns, suggesting a saturation point to pick the number of biomarkers, K𝐾Kitalic_K.

We found the biomarkers: DNA-directed RNA polymerase subunit alpha HP1293𝐻𝑃1293HP1293italic_H italic_P 1293, recombinase RecA recA𝑟𝑒𝑐𝐴recAitalic_r italic_e italic_c italic_A, and trigger factor tig𝑡𝑖𝑔tigitalic_t italic_i italic_g IgG antibodies, to be the most frequently selected biomarkers related to gastric cancer. Fig 5(b) shows the high frequency of biomarkers recAIgG,HP1293IgG,𝑟𝑒𝑐𝐴𝐼𝑔𝐺𝐻𝑃1293𝐼𝑔𝐺recAIgG,HP1293IgG,italic_r italic_e italic_c italic_A italic_I italic_g italic_G , italic_H italic_P 1293 italic_I italic_g italic_G , and tigIgG𝑡𝑖𝑔𝐼𝑔𝐺tigIgGitalic_t italic_i italic_g italic_I italic_g italic_G, appearing in above 90%percent9090\%90 % of all folds when evaluating with LOOCV, therefore supporting the stability of the model. These are the biomarkers that were consistently picked by the causality measure.

Notably, the test AUC increases with K𝐾Kitalic_K and saturates after K=10𝐾10K=10italic_K = 10 as seen in Fig 5(c). However, K𝐾Kitalic_K had a limited impact on the biologically relevant Sen@90𝑆𝑒𝑛@90Sen@90italic_S italic_e italic_n @ 90 measure. Initially, increasing the value of K𝐾Kitalic_K increased the test AUC by the magnitude of 0.20.20.20.2. As we gradually increased K𝐾Kitalic_K, the test AUC levels out to a certain range, around 0.70.70.70.7 but the Sen@90𝑆𝑒𝑛@90Sen@90italic_S italic_e italic_n @ 90 measure tends to get more sparse. We see diminishing returns by adding any more number of biomarkers. This relation has relevance based on the target application desired to make inexpensive diagnostic kits.

(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
Fig 5: Hyperparameter Sensitivity a: ROC Curve with multiple thresholds(γ𝛾\gammaitalic_γ) for XGB model with causal-based biomarker selection. b: Frequency of Selected Biomarkers, where K=3𝐾3K=3italic_K = 3. c: Effect of K𝐾Kitalic_K for threshold 1.41.41.41.4 for GBT model with univariate selection.

Discussion

We see the effects of ablating away causality measure with univariate feature selection in Table 2. We observe a higher AUC and consistent sensitivity values for the causal method as we decrease the number of biomarkers, and these benefits go away otherwise. This will be beneficial when applying this method to the industry considering the computational power and being less expensive as the method performs better with less number of biomarkers. This approach can also be applicable to similar domains for other disease prediction. Additionally, the experiments with the causal metric can be extended by adding a combinatorial way of picking the ranked causal biomarkers.

Fazit

In this paper, we use a causality measure to select biomarkers paired with ML-based classifiers on a gastric cancer dataset for disease detection purposes. We pre-select biomarkers to reduce the number of biomarkers considered to be more practical, reduce overfitting, and to understand the causal effect of the set of biomarkers. With respect to Sen@90𝑆𝑒𝑛@90Sen@90italic_S italic_e italic_n @ 90, and Sen@80𝑆𝑒𝑛@80Sen@80italic_S italic_e italic_n @ 80, the XGB model with causality measure performed better when compared to the baseline for 3333 biomarkers and has a hike of 0.1140.1140.1140.114 on AUC. We found that approaches with the causal metric performed better when handling a smaller number of biomarkers, while conventional techniques like univariate feature selection performed better with a larger number of biomarkers. The causality measure compares co-occurring biomarkers, they could provide biological intuition enabling further empirical studies. We see evidence that this approach likely generalizes for the prediction of other diseases based on biomarkers, as our machine learning methods perform well across a variety of diseases.

References

  •  1. Rosado M, Silva R, Bexiga M. G, Jones J. G, Manadas B, Anjo S. I. Chapter Four - Advances in biomarker detection: Alternative approaches for blood-based biomarker detection. Advances in Clinical Chemistry. 2019; 92(4):141-199.
  •  2. Topkaya S. N., Azimzadeh M, Ozsoz M. Electrochemical Biosensors for Cancer Biomarkers Detection: Recent Advances and Challenges. Electroanalysis. 2016; 28:1402-1419.
  •  3. Blennow K, Zetterberg H. Biomarkers for Alzheimer’s disease: current status and prospects for the future. J Intern Med. 2018; 284:643–663.
  •  4. Ahn Joseph C, Teng Pai‐Chi, Chen Pin‐Jung, Posadas Edwin, Tseng Hsian‐Rong, Lu Shelly C, Yang Ju Dong. Detection of Circulating Tumor Cells and Their Implications as a Biomarker for Diagnosis, Prognostication, and Therapeutic Monitoring in Hepatocellular Carcinoma. Hepatology 73(1):p 422-436, January 2021.
  •  5. Lin L, Huang H, Juan H. Discovery of biomarkers for gastric cancer: A proteomics approach. Journal of Proteomics. 2012; 75(11):3081-3097.
  •  6. Bursac Z, Gauss C. H, Williams D. K, Hosmer D. W. Purposeful selection of variables in logistic regression. Source Code Biol Med 3, 17 (2008).
  •  7. Direkvand-Moghadam A, Suhrabi Z, Akbari M, Direkvand-Moghadam A. Prevalence and Predictive Factors of Sexual Dysfunction in Iranian Women: Univariate and Multivariate Logistic Regression Analyses. Korean J Fam Med. 2016 Sep;37(5):293-8.
  •  8. Islam M, Islam R. Exploring the Impact of Univariate Feature Selection Method on Machine Learning Algorithms for Heart Disease Prediction. 2023 International Conference on Next-Generation Computing, IoT and Machine Learning (NCIM), Gazipur, Bangladesh. 2023;
  •  9. Kleinberg S, Hripcsak G. A review of causal inference for biomedical informatics. Journal of Biomedical Informatics. 2011; 44(6):1102-1112.
  •  10. Song L, Song M, Rabkin C. S, Williams S, Chung Y, Van Duine J, Liao L. M, Karthikeyan K, Gao W, Park JG, Tang Y, Lissowska J, Qiu J, LaBaer J, Camargo M. C. Helicobacter pylori Immunoproteomic Profiles in Gastric Cancer. Journal of Proteome Res. 2021; 20(1):409–419.
  •  11. Sorino P, Caruso M. G, Misciagna G, Bonfiglio C, Campanella A, Mirizzi A, Franco I, Bianco A, Buongiorno C, Liuzzi R, Cisternino AM, Notarnicola M, Chiloiro M, Pascoschi G, Osella A. R. Selecting the best machine learning algorithm to support the diagnosis of Non-Alcoholic Fatty Liver Disease: A meta learner study. PLOS ONE, Public Library of Science. 2020; 15(10).
  •  12. Díaz Álvarez J. D, Matias-Guiu J. A, Cabrera-Martín M. N, Risco-Martín J. L, Ayala J. L. An application of machine learning with feature selection to improve diagnosis and classification of neurodegenerative disorders. BMC Bioinformatics. 2019; 20:491.
  •  13. Yadav S. S, Jadhav S. M. Deep convolutional neural network based medical image classification for disease diagnosis. J Big Data. 2019; 6:113.
  •  14. Shaban M, Ogur Z, Mahmoud A, Switala A, Shalaby A, Abu Khalifeh H, Ghazal M, Fraiwan L, Giridharan G, Sandhu H, El-Baz A. S. A convolutional neural network for the screening and staging of diabetic retinopathy. PLOS ONE. 2020; 15(6).
  •  15. Heenaye-Mamode Khan M, Boodoo-Jahangeer N, Dullull W, Nathire S, Gao X, Sinha G. R, Nagwanshi KK. Multi- class classification of breast cancer abnormalities using Deep Convolutional Neural Network (CNN). PLOS ONE. 2021; 16(8).
  •  16. Lopez-Garnier S, Sheen P, Zimic M. Automatic diagnostics of tuberculosis using convolutional neural networks analysis of MODS digital images. PLOS ONE. 2019; 14(2).
  •  17. Kundu R, Das R, Geem Z. W, Han GT, Sarkar R. Pneumonia detection in chest X-ray images using an ensemble of deep learning models. PLOS ONE. 2021; 16(9).
  •  18. Kleinberg S, Mishra B. The temporal logic of causal structures. arXiv preprint arXiv:1205.2634. 2012 May 9.
  •  19. Richens J. G, Lee C. M, Johri S. Improving the accuracy of medical diagnosis with causal machine learning. Nat Commun. 2020 Aug 11;11(1):3923.
  •  20. Kleinbaum DG, Dietz K, Gail M, Klein M, Klein M. Logistic regression. New York: Springer-Verlag; 2002 Aug.
  •  21. Song L, Song M, Rabkin CS, Williams S, Chung Y, Van Duine J, Liao LM, Karthikeyan K, Gao W, Park JG, Tang Y. Helicobacter pylori immunoproteomic profiles in gastric cancer. Journal of Proteome Research. 2021 Jan 1;20(1):409-19.
  •  22. Gardner M. W, Dorling S. R. Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences. Atmospheric environment. 1998 Aug 1;32(14-15):2627-36.
  •  23. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Müller A, Nothman J, Louppe G, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay É. Scikit-learn: Machine Learning in Python. JMLR. 2011; 12(1):2825-2830.
  •  24. Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H, Chen K, Mitchell R, Cano I, Zhou T. Xgboost: extreme gradient boosting. R package version 0.4-2. 2015; 1(4):1-4.
  •  25. Ravi A, Gopal V, Roselyn J. P, Devaraj D, Chandran P, Madhura R. S. Detection of Infectious Disease using Non-Invasive Logistic Regression Technique. IEEE International Conference on Intelligent Techniques in Control, Optimization and Signal Processing (INCOS). 2019; 1-5
  •  26. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Köpf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc. 2019; 8024-8035
  •  27. Breiman L. Random Forests. Machine Learning. 2001; 45:5-32
  •  28. Chen T, Guestrin C. A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016 Aug 13.

Supporting information

S1 Appendix

Learning models

Multi layer perceptron

A multi-layer perceptron (MLP) is a type of artificial neural network (Gardner et al. [22]). MLPs are composed of multiple layers of interconnected nodes or neurons, each of which performs a non-linear operation on the input data. The input layer of an MLP receives the input data, and the output layer produces the predicted output. In between the input and output layers, there can be one or more hidden layers, each of which contains multiple neurons. The neurons in each layer are connected to the neurons in the next layer, forming a dense, fully connected network. During training, an MLP adjusts the weights of the connections between neurons to minimize the difference between the predicted output and the actual output. This process is performed using a process called backpropagation, which involves propagating the error through the network and updating the weights using gradient descent.

Extreme Gradient Boosting/XGBoost

XGBoost (Chen et al. [28]) is designed to improve the performance of gradient-boosted trees, especially in terms of speed and model accuracy. Like other gradient-boosting algorithms, XGBoost builds an ensemble of decision trees to make predictions. However, XGBoost introduces several improvements to the gradient boosting algorithm, such as a novel tree construction algorithm, parallel processing, and a regularized learning objective.

Logistic regression

It is a type of regression analysis that predicts the probability of an outcome based on one or more predictor variables (Kleinbaum et al. [20]). In logistic regression, the dependent variable is binary, meaning it can take only two possible values. The algorithm models the relationship between the independent variables and the binary outcome using the logistic function, also known as the sigmoid function. The output of the sigmoid function is a value between 0 and 1, which represents the predicted probability of the positive class.

Gradient-boosted trees

Gradient-boosted trees combine the predictions of multiple decision trees to improve the accuracy of predictions. In gradient-boosted trees, decision trees are created in a sequence, and each new tree is built to correct the errors of the previous tree. The algorithm assigns more weight to the data points that were incorrectly predicted in the previous iteration, and less weight to the correctly predicted points. This process continues until the algorithm reaches a predefined stopping point, such as a maximum number of iterations, or when the accuracy of the model stops improving.

Random forest

It is an ensemble learning method that combines multiple decision trees to make predictions (Breiman et al. [27]). Unlike a single decision tree, Random Forest creates multiple decision trees on randomly selected subsets of the training data and randomly selected subsets of the features. Each decision tree in the forest is constructed independently, and the final prediction is made by combining the predictions made by all the trees in the forest, typically by taking the average or the majority vote of the predictions.

Univariate Selection (UV)

A standard Logistic Regression model with Univariate Feature Selection was used as the baseline. This baseline model enabled us to identify the most relevant features by selecting K𝐾Kitalic_K features, where K𝐾Kitalic_K was set to 3333 and 10101010. This approach provided us with a reference point to compare the performance of our model. The hyperparameter values used for our experiments for each model is mentioned in table S1

Table S1: Hyperparameters used for each model
MLP     XGB, GBT
hidden_layer_sizes: 256,128,64,322561286432256,128,64,32256 , 128 , 64 , 32     max_depth: 2.0
activation: relu     learning_rate: 1.0
random_state: 1.0     n_estimators: 10.0
    random_state: 0.0
LR     RF
solver: lbfgs     n_estimators: 10.0

S2 Appendix

Derivation of Causal Metric

Our method, the process of selecting the top K𝐾Kitalic_K biomarkers, uses only the training data. Here is an overview:

  1. 1.

    For each individual biomarker, we compute the sensitivity and specificity with respect to the training data by classifying samples solely on if the associated biomarker reading for that single biomarker exceeds threshold γ𝛾\gammaitalic_γ.

  2. 2.

    For each biomarker i𝑖iitalic_i, we compute s2𝑠2s2italic_s 2 metric value which is simply the specificity multiplied by the sensitivity calculated based on a threshold γ𝛾\gammaitalic_γ in the above step, giving us the s2𝑠2s2italic_s 2 metric value for a biomarker i𝑖iitalic_i.

  3. 3.

    Using the causal computation (Equation 4) we compute causalγ(i)𝑐𝑎𝑢𝑠𝑎subscript𝑙𝛾𝑖causal_{\gamma}(i)italic_c italic_a italic_u italic_s italic_a italic_l start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_i ) for every biomarker. We then rank all biomarkers by this metric and select the top K𝐾Kitalic_K.

Here are further technical details on the derivation of causal metric to rank biomarkers:

For hyperparameter γ𝛾\gammaitalic_γ we will use the notation X(γ)superscript𝑋𝛾X^{(\gamma)}italic_X start_POSTSUPERSCRIPT ( italic_γ ) end_POSTSUPERSCRIPT and xi(γ)superscriptsubscript𝑥𝑖𝛾x_{i}^{(\gamma)}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_γ ) end_POSTSUPERSCRIPT to be a matrix (or vector respectively) consisting of zeros and ones based on the real-valued threshold γ𝛾\gammaitalic_γ (values are set to 1111 if the feature value is greater than or equal to γ𝛾\gammaitalic_γ). For a given feature vector xi(γ)superscriptsubscript𝑥𝑖𝛾x_{i}^{(\gamma)}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_γ ) end_POSTSUPERSCRIPT, we will use the notation Spec(xi(γ))𝑆𝑝𝑒𝑐superscriptsubscript𝑥𝑖𝛾Spec(x_{i}^{(\gamma)})italic_S italic_p italic_e italic_c ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_γ ) end_POSTSUPERSCRIPT ) and Sens(xi(γ))𝑆𝑒𝑛𝑠superscriptsubscript𝑥𝑖𝛾Sens(x_{i}^{(\gamma)})italic_S italic_e italic_n italic_s ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_γ ) end_POSTSUPERSCRIPT ) to be the specificity and sensitivity if only that binarized feature vector is used to make a prediction (in other words, we predict each sample of y𝑦yitalic_y if the value for feature i𝑖iitalic_i exceeds γ𝛾\gammaitalic_γ). We will use the notation:

s2(xi(γ))=Spec(xi(γ))×Sens(xi(γ))𝑠2superscriptsubscript𝑥𝑖𝛾𝑆𝑝𝑒𝑐superscriptsubscript𝑥𝑖𝛾𝑆𝑒𝑛𝑠superscriptsubscript𝑥𝑖𝛾\displaystyle s2(x_{i}^{(\gamma)})=Spec(x_{i}^{(\gamma)})\times Sens(x_{i}^{(% \gamma)})italic_s 2 ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_γ ) end_POSTSUPERSCRIPT ) = italic_S italic_p italic_e italic_c ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_γ ) end_POSTSUPERSCRIPT ) × italic_S italic_e italic_n italic_s ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_γ ) end_POSTSUPERSCRIPT ) (2)

Sensitivity and specificity measures are standard measures used in bio-medicine. We use s2𝑠2s2italic_s 2 measure to indicate the average influence of the presence or lack of a biomarker for samples with the disease, and the influence of the presence of a biomarker on samples without the disease. We examine the average increase of this causal effect of the biomarkers on the disease based on all possible co-occurring biomarkers.

We will consider the samples whose s2𝑠2s2italic_s 2 metric value is greater than the average s2𝑠2s2italic_s 2 metric value. Consider all biomarkers Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to be related biomarkers for i𝑖iitalic_i when there is an overlap of at least one sample between the subset of case samples where the biomarker value exceeds the threshold. In other words, for a given biomarker i𝑖iitalic_i, we say has a value of 1. Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is all of the other biomarkers that have a value of 1 when the sample is of a case.

We will use the logical operations of negation, and disjunction to take binarized vectors and form new ones, and this follows the normal intuition. Technical definitions are defined below:

  • Negation. For a given binarized vector xi(γ)superscriptsubscript𝑥𝑖𝛾x_{i}^{(\gamma)}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_γ ) end_POSTSUPERSCRIPT, we define notγ(i)𝑛𝑜subscript𝑡𝛾𝑖not_{\gamma}(i)italic_n italic_o italic_t start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_i ) where each component equals one minus the corresponding component of xi(γ)superscriptsubscript𝑥𝑖𝛾x_{i}^{(\gamma)}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_γ ) end_POSTSUPERSCRIPT (i.e., the zeros and ones are switched). When refering to feautres by their index (i.e., j𝑗jitalic_j) we will use the notation ¬j𝑗\neg j¬ italic_j to refer to the “index” pointing to vector notγ(j)𝑛𝑜subscript𝑡𝛾𝑗not_{\gamma}(j)italic_n italic_o italic_t start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_j ).

  • Disjunction. For two binarized vectors xi(γ),xj(γ)superscriptsubscript𝑥𝑖𝛾superscriptsubscript𝑥𝑗𝛾x_{i}^{(\gamma)},x_{j}^{(\gamma)}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_γ ) end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_γ ) end_POSTSUPERSCRIPT we define disjγ(i,j)𝑑𝑖𝑠subscript𝑗𝛾𝑖𝑗disj_{\gamma}(i,j)italic_d italic_i italic_s italic_j start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_i , italic_j ) where the k𝑘kitalic_kth position is equal to min(1,xi(γ)[k]+xj(γ)[k])1superscriptsubscript𝑥𝑖𝛾delimited-[]𝑘superscriptsubscript𝑥𝑗𝛾delimited-[]𝑘\min(1,x_{i}^{(\gamma)}[k]+x_{j}^{(\gamma)}[k])roman_min ( 1 , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_γ ) end_POSTSUPERSCRIPT [ italic_k ] + italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_γ ) end_POSTSUPERSCRIPT [ italic_k ] ). In other words, it is a vector (of the same size of both inputs) where each position is the sum of the pairwise components in each vector clipped to 1111 (i.e., each component is 1111 if either the corresponding component in xi(γ)superscriptsubscript𝑥𝑖𝛾x_{i}^{(\gamma)}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_γ ) end_POSTSUPERSCRIPT oder xj(γ)superscriptsubscript𝑥𝑗𝛾x_{j}^{(\gamma)}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_γ ) end_POSTSUPERSCRIPT is 1111 and zero otherwise).

Each hyperparameter γ𝛾\gammaitalic_γ will have different related biomarker sets. For a given vector xi(γ)superscriptsubscript𝑥𝑖𝛾x_{i}^{(\gamma)}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_γ ) end_POSTSUPERSCRIPT, the set of related biomarkers for the threshold γ𝛾\gammaitalic_γ, Rγ(i)subscript𝑅𝛾𝑖R_{\gamma}(i)italic_R start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_i ) is the set of all other vector indices j𝑗jitalic_j such that disjγ(i,j)>0𝑑𝑖𝑠subscript𝑗𝛾𝑖𝑗0\sum disj_{\gamma}(i,j)>0∑ italic_d italic_i italic_s italic_j start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_i , italic_j ) > 0.

We now introduce a disjunctive causal ranking metric for features. It is defined as follows:

causal(i)=jRif(i,j)f(¬i,j)size(Ri)𝑐𝑎𝑢𝑠𝑎𝑙𝑖subscript𝑗subscript𝑅𝑖𝑓𝑖𝑗𝑓𝑖𝑗𝑠𝑖𝑧𝑒subscript𝑅𝑖\displaystyle causal(i)=\frac{\sum_{j\in R_{i}}f(i,j)-f(\neg i,j)}{size(R_{i})}italic_c italic_a italic_u italic_s italic_a italic_l ( italic_i ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_i , italic_j ) - italic_f ( ¬ italic_i , italic_j ) end_ARG start_ARG italic_s italic_i italic_z italic_e ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG (3)

Intuitively, causalγ(i)𝑐𝑎𝑢𝑠𝑎subscript𝑙𝛾𝑖causal_{\gamma}(i)italic_c italic_a italic_u italic_s italic_a italic_l start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_i ) tells us the average increase in the s2𝑠2s2italic_s 2 metric obtained when feature i𝑖iitalic_i is used vs. when it is not. We compute this metric in training for all features and then, based on hyperparameter k𝑘kitalic_k, we select the k𝑘kitalic_k features with the greatest value for causal(i)𝑐𝑎𝑢𝑠𝑎𝑙𝑖causal(i)italic_c italic_a italic_u italic_s italic_a italic_l ( italic_i ) and use those to train the model. Note that the samples used for computing each causal(i)𝑐𝑎𝑢𝑠𝑎𝑙𝑖causal(i)italic_c italic_a italic_u italic_s italic_a italic_l ( italic_i ) and training the model are the same. This is used to identify the causal factors that have the most effect on the model to make a decision as well as not consider the factors that make a small difference.

causalγ(i)=jRγ(i)s2(disjγ(i,j))s2(disjγ(¬i,j))size(Rγ(i))𝑐𝑎𝑢𝑠𝑎subscript𝑙𝛾𝑖subscript𝑗subscript𝑅𝛾𝑖𝑠2𝑑𝑖𝑠subscript𝑗𝛾𝑖𝑗𝑠2𝑑𝑖𝑠subscript𝑗𝛾𝑖𝑗𝑠𝑖𝑧𝑒subscript𝑅𝛾𝑖\displaystyle causal_{\gamma}(i)=\frac{\sum_{j\in R_{\gamma}(i)}s2(disj_{% \gamma}(i,j))-s2(disj_{\gamma}(\neg i,j))}{size(R_{\gamma}(i))}italic_c italic_a italic_u italic_s italic_a italic_l start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_i ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ italic_R start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT italic_s 2 ( italic_d italic_i italic_s italic_j start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_i , italic_j ) ) - italic_s 2 ( italic_d italic_i italic_s italic_j start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( ¬ italic_i , italic_j ) ) end_ARG start_ARG italic_s italic_i italic_z italic_e ( italic_R start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_i ) ) end_ARG (4)

We rank the biomarkers based on the causal measure. Assuming there is a certain combination of biomarkers B1,B2,B3subscript𝐵1subscript𝐵2subscript𝐵3B_{1},B_{2},B_{3}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT consistently present for the class of cancer, they have approximately the same values, outcome implies they are associated with the same causal measure. When we pick biomarkers with respect to the order of causality, we would pick all those three biomarkers. B1,B2,B3subscript𝐵1subscript𝐵2subscript𝐵3B_{1},B_{2},B_{3}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT will have the same causal measure and the probability of a sample having cancer which showed B1subscript𝐵1B_{1}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT will be the same as B2,B3subscript𝐵2subscript𝐵3B_{2},B_{3}italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and lack of presence of any of these biomarkers will not contribute to the prediction of cancer or not.