Unmasking Bias: A Framework for Evaluating Treatment Benefit Predictors Using Observational Studies
Yuan Xia, Mohsen Sadatsafavi, and Paul Gustafson
Abstract
Treatment benefit predictors (TBPs) map patient characteristics into an estimate of the treatment benefit tailored to individual patients, which can support optimizing treatment decisions.
However, the assessment of their performance might be challenging with the non-random treatment assignment.
This study conducts a conceptual analysis, which can be applied to finite-sample studies.
We present a framework for evaluating TBPs using observational data from a target population of interest.
We then explore the impact of confounding bias on TBP evaluation using measures of discrimination and calibration, which are the moderate calibration and the concentration of the benefit index (), respectively.
We illustrate that failure to control for confounding can lead to misleading values of performance metrics and establish how the confounding bias propagates to an evaluation bias to quantify the explicit bias for the performance metrics.
These findings underscore the necessity of accounting for confounding factors when evaluating TBPs, ensuring more reliable and contextually appropriate treatment decisions.
Precision medicine aims to optimize medical care by tailoring treatment decisions to the unique characteristics of each patient.
This objective naturally falls in the intersection between predictive analytics and causal inference; the former aims at predicting the outcome of interest, and the latter seeks to answer counterfactual “what if” questions about the outcome.
Most of the progress in predictive analytics has centred around predicting risks.
To customize medical treatments, we must shift our focus to predicting treatment benefits.
Such prediction is often termed “causal prediction” or “counterfactual prediction” (Prosperi et al.,, 2020).
Many studies have investigated whether and how a specific covariate or a set of covariates modifies the treatment benefit, such as Abrevaya et al., (2015), Robertson et al., (2021), and Zhou and Zhu, (2021).
We refer to such a function that maps patient characteristics to an estimate of treatment benefit as a treatment benefit predictor (TBP).
Before being adopted in patient care, a pre-specified TBP needs to be evaluated (validated) in the target population of interest (la Roi-Teeuw et al.,, 2024).
The validation process for TBPs is currently an active area of research (Kent et al.,, 2020).
Traditionally, performance metrics for risk prediction are categorized into measures of overall fit, discrimination, calibration, and clinical utility (net benefit) (Riley et al.,, 2019; Steyerberg,, 2019).
Discrimination pertains to the predictive capacity of distinguishing individuals with and without the outcome of interest.
Calibration focuses on the proximity of predicted and actual risks.
Net benefit assesses the clinical usefulness of a risk prediction algorithm by quantifying the trade-off between the benefits of a true positive classification versus the harms of a false positive one.
In the context of treatment benefits prediction, various performance measures for TBPs have been formulated by extending the concepts from risk prediction to the treatment-benefit paradigm.
For instance, Vickers et al., (2007) provided an extension of net benefit for TBPs, and van Klaveren et al., (2019); Hoogland et al., (2022) discussed extensions of calibration and discrimination.
Efthimiou et al., (2023) amalgamated calibration and discrimination into measures for decision accuracy.
However, extending these methods isnflt straightforward, since we cannot observe the outcome (treatment benefit) due to the unavailability of the counterfactual.
Thus, assessing the performance of TBPs poses a significant challenge.
TPBs can be validated using data from randomized controlled trials (RCTs), where treatment assignment is not systematically confounded.
Nevertheless, RCTs from the target population of interest are not always available.
Even if available, they are often underpowered to evaluate TBP or lack sufficient follow-up time to elucidate treatment effects on relevant outcomes.
For some interventions where equipoise is not established, conducting a RCT might be unethical.
Hence, observational studies might give the only opportunity to examine the performance of a TBP in the target population.
Using observational studies adds complexity primarily due to the potential presence of confounding bias, which hinders the identification of treatment benefits.
Confounding bias and how it influences estimation of estimands have been extensively studied.
For instance, Imbens, (2003) and Veitch and Zaveri, (2020) have investigated the influence of confounding bias on the average treatment effect (ATE).
However, it receives less scrutiny in the TBP evaluation.
In this study, we show the impact of failing to fully control for confounding on TBP evaluation and offer a comprehensive conceptual evaluation framework applicable to any performance metric.
We consider calibration and discrimination and focus on two specific performance metrics as illustrative examples of assessing pre-specified TBPs, via conducting a conceptual analysis.
2 Notation and Assumptions
Each individual in the target population is described by with joint distribution ,
where is the treatment chosen indicator with denoting the absence of the treatment and being the presence of the treatment;
is the counterfactual outcome that would be observed under treatment ;
is the set of pre-treatment covariates observable in routine clinical practice and also in observational studies that will be used to predict treatment benefit;
and is a distinct set of additional covariates that might be only available in the observational study, and might be needed to control for confounding.
For instance, can be blood pressure and age available at the point of care, which are used to predict the benefit of statin therapy for cardiovascular diseases.
Meanwhile, , socioeconomic status, is a confounding variable but not often used for predicting benefit from statins.
Individual treatment benefit is quantified as , which is unobservable.
For instance, when an individual has received , the corresponding outcome remains unobserved (and therefore counterfactual).
The conditional mean outcome under treatment is denoted as .
We also denote and .
Typically, is referred to as the conditional average treatment effect (CATE), as is the entire input covariate space from .
However, when our focus is on a subset of covariates , we concentrate on .
The ATE is .
A TBP denoted as predicts the benefit of an active treatment of interest based on known patient characteristics in routine clinical practice.
It can guide treatment decision-making, for example the care provider offering treatment only to those with .
We denote the predicted treatment benefit from as and its cumulative distribution function (CDF) as .
The best possible TBP is itself, and the corresponding prediction is with CDF .
This study is motivated by the question: how can be evaluated using representative observational data from the target population where treatment is confounded?
In an observational study, we can observe iid draws of from the joint distribution , which is a consequence of .
To evaluate , the following three assumptions are universally required:
(1) no interference: between any two individuals, the treatment taken by one does not affect the counterfactual outcomes of the other;
(2) consistency: the counterfactual outcome under the observed treatment assignment equals the observed outcome , i.e., ;
and (3) conditional exchangeability: the treatment assignment is independent of the counterfactual outcomes, given the set of variables , i.e., .
The first two assumptions are known as the stable-unit-treatment-value assumption (SUTVA) (Rubin,, 1980).
The last one assumes no unmeasured confounders given and .
3 Performance Metrics
In this study, we explore two specific metrics, each corresponding to one of these aspects of performance within the population-level framework.
This framework enables us to conceptually understand how observational data can identify the performance of and to explore the extent of potential misguidance when failing to control for confounding.
3.1 Calibration
Van Calster et al., (2016) proposed a hierarchical definition of calibration for risk prediction models.
In what follows, we focus on what they named ‘moderate calibration’: that the expected value of the outcome among individuals with the same predicted risk is equal to the predicted risk.
They argue that moderate calibration is the most desired form of calibration.
Similarly, in treatment benefit prediction, a TBP can be considered moderately calibrated if .
It says that the average treatment benefit among all patients with predicted treatment benefit equals , for any .
For example, if is moderately calibrated and predicts a group of individuals to have , we should expect that the average treatment benefit within the group is also . Furthermore, is strongly calibrated if .
Calibration of TBPs can also be visualized in a calibration plot (Van Calster et al.,, 2019).
The calibration plot compares against , with a moderately calibrated TBP showing points aligned around the diagonal identity line.
3.2 Discrimination
In risk prediction, we assess discrimination using either concordance measures or measures of disparity.
The c-statistic and the Gini index are examples of concordance and disparity measures, respectively.
Both metrics have been extended to in the field of treatment benefit prediction.
However, it has been established that the c-for-benefit (van Klaveren et al.,, 2018), analogous to the c-statistic for TBPs, does not qualify as a proper scoring rule (Xia et al.,, 2023).
Therefore, we shift our focus to the concentration of the benefit index (), a single-value summary of the difference in average treatment benefit between two treatment assignment rules: ‘treat at random’ and ‘treat greater ’ (Sadatsafavi et al.,, 2020).
With i.i.d. copies of , the of is defined as:
(1)
where is an indicator function.
The denominator in (1) operationalizes the strategy of ‘treat greater ’ among two patients randomly selected from the population.
If the two patients have the same , we randomly assign treatment to a patient.
When and is at least not worse than ‘treat at random,’ the value ranges from to .
If , ‘treat at random’ is associated with a reduction in expected benefit compared with ‘treat greater .’
The connects to a Gini-like coefficient determined by twice the area between the line of independence and the relative concentration curve (RCC) of concerning .
The RCC orders patients by and plots the cumulative value divided by (Yitzhaki and Olkin,, 1991).
With the Gini-like coefficient denoted as , can be alternately defined as:
To eliminate the necessity of contemplating patient pairs to ascertain the expectation in (1), we establish that
(2)
where , denoting the probability of taking the specific value .
Thus, when is continuous, .
Although is continuous in most applications, it is helpful to derive the general expression to create simple, illustrative examples where is discrete.
As the original publication on did not discuss estimation in the presence of ties, we elaborate on this point in Appendix B.
Note that (2) enables us to concentrate on and (and if needed) for computing , where acts as a ranking variable.
Consequently, provides insights into the effectiveness of in mimicking the CATE function through its ranking ability.
4 Evaluating TBP Performance in Presence of Confounding
When using observational data to evaluate a pre-specified TBP , we consider both and to adjust for confounding, even though is solely a function of .
To evaluate the moderate calibration of any in the target population, we need to determine (i.e., the calibration curve).
We address the confounding variables by initially focussing on instead of .
Afterward, can be obtained by taking the average of conditional on :
(3)
Similarly, to compute of , we determine and (and if needed) in (2) by assessing as well.
Consequently, can be expressed as
(4)
where .
Note that plays a vital role in the determination of both and .
Various approaches are available to determine , with two main methods being outcome regressions and inverse probability weighting methods (Rosenbaum and Rubin,, 1983).
For instance, with the outcome models , we have .
With the propensity score , we have .
These two approaches are equivalent in the population-level framework as long as the overlap assumption holds.
The overlap assumption says that the conditional probability of receiving the active treatment or not is bounded away from and , i.e., , for all possible and .
However, variations may emerge when considering specific finite-sample estimating techniques associated with each.
We will return to the finite-sample estimation of the calibration curve and in Section 7.
If we treat the observational data as if they arose from a RCT, or if we do not sufficiently control for confounding, confounding bias may emerge.
Thus, it is essential to investigate the potential confounding biases and grasp how lack of full control might affect the accuracy of our evaluations.
In this study, we focus on the confounding bias that occurs when alone is not sufficient to control for confounding and denote the confounding bias as a function of .
For , we have
(5)
To illustrate the propagation of to performance metrics, we denote the inaccurate and calculated without controlling for as and , respectively.
The bias function of the calibration curve and the bias of are influenced by and could be different from (5).
For the calibration curve, the deviation from the accurate assessment can be expressed as
(6)
which is a function of .
It depends on and the association between and .
For , the confounding bias affects the calculation of both and .
The discrepancy from actual is , while the deviation from is .
However, expressing the deviation from the true is complex as it involves the difference between two ratios.
The deviation is in the form of:
(7)
This value not only depends on but also on , , and .
According to the biases (6) and (7), the deviations in moderate calibration and may yield zero value(s) even with non-zero .
In particular, zero deviations in moderate calibration would occur when for all , and zero deviation in would occur when .
In the nest section, we further investigate these biases in several illustrative examples to demonstrate how influences the evaluation results.
5 Examples Relevant to Confounding Bias in Evaluation
In this section, we establish two synthetic populations to illustrate the impact of confounding bias on evaluating given TBPs in the population-level framework.
The first population describes a linear function with binary outcome and covariates, which enables exploring to what extent the strength of confounding affects the bias of both metrics.
The second population has a non-linear function with continuous outcome and covariates.
This offers flexibility in defining , and propensity score function .
Unlike prior confounding bias studies, we investigate the propagation of confounding bias to and for both populations, where both are determined in closed-form.
5.1 Population 1. Binary Outcome and Covariates
Assume the dimensions of two sets of covariates are and , and all and are binary.
In this binary outcome setup, the individual treatment benefit .
We assume and the distribution is in the form of
where is a linear combination of , , and .
Distribution leads to a linear , which is
and .
There are parameters to capture the relationship between outcome, treatment, and covariates, with constraints imposed on all , , and to ensure legitimate distributions.
The linear propensity score function is .
The strength of confounding is determined by the values of and .
We formulate three TBPs: is the mean of covariates, is designed to be moderately calibrated, and is designed to be strongly calibrated, by carefully choosing coefficients.
The expressions for these three TBPs are as follows:
where and , and are coefficients.
Moreover, , which is an bijective function, uniquely mapping the four distinct values of to four unique prediction values, exhibiting strong calibration, and thus surpassing all potential TBPs.
(See Appendix A for the detailed definitions of the coefficients.)
5.2 Population 2. Continuous Outcome and Covariates
We still assume and , but let be independent, with each following the uniform distribution on the interval .
Adopting the setup proposed by Foster and Syrgkanis, (2023),
where the conditional independence is assumed, i.e., .
Note that and contribute to explaining the outcome and treatment assignment.
Let and consider simple functions: propensity score function and base response function .
We define the ensuing CATE function and TBP as a selected exemplification:
The predicted treatment benefit is a sum of two i.i.d uniform random variables on , which follows a triangular distribution with parameters: lower limit , upper limit and mode .
The variable is independent of conditioning on (i.e., ).
Therefore, is not an effect modifier but a confounding variable.
Note that , the maximum of these two uniform random variables, follows .
Hence, the population average treatment benefit is .
6 Metrics Performance in Two Synthetic Populations
For the first synthetic population, we employed a specific set of parameters to compare evaluation results for TBPs with and without controlling for .
This selection serves as just one instance among numerous potential examples:
where values in were randomly generated but represent a valid joint distribution of .
Upon establishing these parameters, we define the target population and consequently determine ,
which is greater than for all and a non-linear function of .
In this setup, the value of is influenced not only by the three parameters that determine confounding strength but also by other additional parameters; see Appendix A for further discussion of under various “strength of confounding.”
We then evaluate the three pre-specified TBPs using the calibration curve and with and without confounding bias.
Note that and exhibit analogous mapping patterns: each maps four unique combinations of to three distinct values.
We compute and through closed-form expressions and calculate and either via (6) and (7), or by using the inaccurate instead of in (3) and (4).
When , the evaluation results are illustrated in Figure 1, where the three plots on the left display the moderate calibration curves of , , and .
We see that and are moderately calibrated, aligning closely with the 45-degree line.
Additionally, while lacks moderate calibration, its predictions are positively associated with .
Figure 1 further highlights distinct disparities between and , particularly noting that for all three TBPs due to .
Consequently, the failure to control for confounding variables results in an inaccurate calibration assessment.
The three plots on the right in Figure 1 show the RCCs and values for the three TBPs.
The RCCs and for and are identical.
It is because the CDFs of and are the same, resulting in the two TBPs identically ranking patients.
Note that the optimal TBP yields a slightly larger and , compared to and .
It reflects that yields a more effective treatment assignment rule, leading to a larger average treatment benefit.
However, is smaller than for all three TBPs.
When , all blue dotted curves align with the red dashed curves for both performance metrics because is no longer a confounding variable.
Particularly in the calibration plots for and , all curves lie on the 45-degree diagonal line.
(See Figure 5 in Appendix A for the corresponding calibration plots, , and RCCs.)
The findings highlight the importance of controlling confounding variables when conducting evaluations in observational studies to obtain accurate results.
Ignoring confounding variables can produce misleading patterns with different extents depending on the strength and direction of associations.
In the second synthetic population, with the actual , we compute the calibration curve and by initially deriving the closed-form joint distribution of and then analyzing the corresponding closed-form conditional distribution of .
Similarly, we compare and with and .
(See Appendix C for calculation details.)
In this setup, is a function of :
, which is illustrated in Figure 2.
The average treatment benefit conditioned on the predicted treatment benefit from is .
Using the second treatment assignment rule based on , the average treatment benefit is , slightly exceeding the population average .
The for is .
When assigning treatment based on the predicted treatment benefit from , the average treatment benefit from the second treatment assignment rule is .
In other words, the second treatment assignment rule, based on the actual , does not exhibit a significant improvement in average treatment benefit compared to .
The corresponding for is .
However, confounding bias causes a deviation from the actual , as depicted by the step function:
It also causes an overestimation of by and an inaccurate calculated as .
Consequently, for , the of is lower than the of .
Figure 3 illustrates the calibration plot and RCC for with and without controlling .
It shows the distinct influence of confounding bias on moderate calibration, RCC, and assessments.
In the calibration plot, the red curve significantly deviates from the blue curve as the value of approaches both extremes, near and .
Moreover, confounding bias reduces the area between the independence line and the RCC by roughly half.
7 Discussion
In clinical settings, TBPs derived from prior studies offer valuable guidance for physicians and patients in making informed treatment decisions.
These TBPs, which may developed from various populations, should be evaluated in the target population before implementation (Riley et al.,, 2024).
Observational data, where treatment assignment is not art random, might be the only opportunity for such evaluation.
Consequently, addressing confounding bias is crucial when assessing treatment benefits on observational data.
This study evaluated pre-specified TBPs using observational studies and explored how confounding bias influences the evaluation of TBPs in a population-level framework.
We delved into two specific metrics, one focusing on calibration and the other on discrimination, and we proposed two bias expressions of calibration and .
We demonstrated that the failure to control for confounding variables leads to inaccurate assessments of moderate calibration and .
The impact of confounding bias on the assessment of moderate calibration and differs.
The two synthetic populations demonstrated lead to two positive functions, which are two examples of many other possible functions.
These two functions result ; nevertheless, .
In other words, positive confounding bias may lead to overestimation of , and but underestimation of at least for these choice of the TBP .
This study conducted a conceptual analysis, which lays the groundwork for finite-sample estimation.
To evaluate pre-determined TBPs using real-world observational data, the primary challenge shifts to estimating and then from the sample.
As previously discussed, can be estimated through outcome regression, inverse probability weighting, or a combination of both, such as the doubly robust method (Bang and Robins,, 2005).
Estimating each performance metric might have its challenges.
For instance, for the calibration curve, we need to estimate the conditional expectation of estimated given .
When is discrete, the estimation can rely on the sample average within groups sharing the same .
However, the estimation for continuous is non-trivial.
For , estimating the CDF of and possibly its can be achieved through either the empirical distribution or modelling .
Previous discussion have identified several areas for future research.
One might wonder if there is a performance metric for TBPs that is less sensitive to the influence of confounding bias.
We examined two performance metrics; however, an investigation of more existing performance metrics is needed to solve this question.
Additionally, the provided two synthetic populations assume independent counterfactual outcomes because it is a commonly used assumption in applications.
However, the real-world target populations can be way more complex.
Further research is needed to examine populations with correlated counterfactual outcomes.
Moreover, our conceptual analysis provides a better understanding of the impact of the confounding bias on the TBP evaluation, and the proposed framework can be applied to real data sets.
Then, it is natural to explore further which one of the existing CATE estimation methods is more flexible to handle complex function and which gives a more precise TPB assessment for making treatment decisions.
Ultimately, the final pieces of brick for finial-sample estimation address the challenges of estimating various performance metrics.
References
Abrevaya et al., (2015)
Abrevaya, J., Hsu, Y.-C., and Lieli, R. P. (2015).
Estimating conditional average treatment effects.
Journal of Business & Economic Statistics, 33(4):485–505.
Bang and Robins, (2005)
Bang, H. and Robins, J. M. (2005).
Doubly robust estimation in missing data and causal inference models.
Biometrics, 61(4):962–973.
Efthimiou et al., (2023)
Efthimiou, O., Hoogland, J., Debray, T. P., Seo, M., Furukawa, T. A., Egger, M., and White, I. R. (2023).
Measuring the performance of prediction models to personalize treatment choice.
Statistics in medicine, 42(8):1188–1206.
Foster and Syrgkanis, (2023)
Foster, D. J. and Syrgkanis, V. (2023).
Orthogonal statistical learning.
Hoogland et al., (2022)
Hoogland, J., Efthimiou, O., Nguyen, T.-L., and Debray, T. P. (2022).
Evaluating individualized treatment effect predictions: a new perspective on discrimination and calibration assessment.
arXiv preprint, (arXiv:2209.06101).
Imbens, (2003)
Imbens, G. W. (2003).
Sensitivity to exogeneity assumptions in program evaluation.
American Economic Review, 93(2):126–132.
Kent et al., (2020)
Kent, D. M., Paulus, J. K., Van Klaveren, D., D’Agostino, R., Goodman, S., Hayward, R., Ioannidis, J. P., Patrick-Lake, B., Morton, S., Pencina, M., et al. (2020).
The predictive approaches to treatment effect heterogeneity (path) statement.
Annals of Internal Medicine, 172:35–45.
la Roi-Teeuw et al., (2024)
la Roi-Teeuw, H. M., van Royen, F. S., de Hond, A., Zahra, A., de Vries, S., Bartels, R., Carriero, A. J., van Doorn, S., Dunias, Z. S., Kant, I., et al. (2024).
Don’t be misled: Three misconceptions about external validation of clinical prediction models.
Journal of Clinical Epidemiology, page 111387.
Prosperi et al., (2020)
Prosperi, M., Guo, Y., Sperrin, M., Koopman, J. S., Min, J. S., He, X., Rich, S., Wang, M., Buchan, I. E., and Bian, J. (2020).
Causal inference and counterfactual prediction in machine learning for actionable healthcare.
Nature Machine Intelligence, 2(7):369–375.
Riley et al., (2024)
Riley, R. D., Archer, L., Snell, K. I., Ensor, J., Dhiman, P., Martin, G. P., Bonnett, L. J., and Collins, G. S. (2024).
Evaluation of clinical prediction models (part 2): how to undertake an external validation study.
bmj, 384.
Riley et al., (2019)
Riley, R. D., Snell, K. I., Moons, K. G., and Debray, T. P. (2019).
Fundamental statistical methods for prognosis research.
In Prognosis Research in Health Care, chapter 3, pages 37–68. Oxford University Press.
Robertson et al., (2021)
Robertson, S. E., Leith, A., Schmid, C. H., and Dahabreh, I. J. (2021).
Assessing heterogeneity of treatment effects in observational studies.
American Journal of Epidemiology, 190(6):1088–1100.
Rosenbaum and Rubin, (1983)
Rosenbaum, P. R. and Rubin, D. B. (1983).
The central role of the propensity score in observational studies for causal effects.
Biometrika, 70(1):41–55.
Rubin, (1980)
Rubin, D. B. (1980).
Randomization analysis of experimental data: The fisher randomization test comment.
Journal of the American statistical association, 75(371):591–593.
Sadatsafavi et al., (2020)
Sadatsafavi, M., Mansournia, M. A., and Gustafson, P. (2020).
A threshold-free summary index for quantifying the capacity of covariates to yield efficient treatment rules.
Statistics in Medicine, 39:1362–1373.
Steyerberg, (2019)
Steyerberg, E. W. (2019).
Clinical Prediction Models.
Springer International Publishing.
Van Calster et al., (2019)
Van Calster, B., McLernon, D. J., Van Smeden, M., Wynants, L., and Steyerberg, E. W. (2019).
Calibration: the achilles heel of predictive analytics.
BMC medicine, 17(1):230.
Van Calster et al., (2016)
Van Calster, B., Nieboer, D., Vergouwe, Y., De Cock, B., Pencina, M. J., and Steyerberg, E. W. (2016).
A calibration hierarchy for risk models was defined: from utopia to empirical data.
Journal of Clinical Epidemiology, 74:167–176.
van Klaveren et al., (2019)
van Klaveren, D., Balan, T. A., Steyerberg, E. W., and Kent, D. M. (2019).
Models with interactions overestimated heterogeneity of treatment effects and were prone to treatment mistargeting.
Journal of Clinical Epidemiology, 114:72–83.
van Klaveren et al., (2018)
van Klaveren, D., Steyerberg, E. W., Serruys, P. W., and Kent, D. M. (2018).
The proposed ‘concordance-statistic for benefit’ provided a useful metric when modeling heterogeneous treatment effects.
Journal of Clinical Epidemiology, 94:59–68.
Veitch and Zaveri, (2020)
Veitch, V. and Zaveri, A. (2020).
Sense and sensitivity analysis: Simple post-hoc analysis of bias due to unobserved confounding.
Advances in neural information processing systems, 33:10999–11009.
Vickers et al., (2007)
Vickers, A. J., Kattan, M. W., and Sargent, D. J. (2007).
Method for evaluating prediction models that apply the results of randomized trials to individual patients.
Trials, 8:1–11.
Xia et al., (2023)
Xia, Y., Gustafson, P., and Sadatsafavi, M. (2023).
Methodological concerns about “concordance-statistic for benefit” as a measure of discrimination in predicting treatment benefit.
Diagnostic and Prognostic Research, 7(1):10.
Yitzhaki and Olkin, (1991)
Yitzhaki, S. and Olkin, I. (1991).
Concentration indices and concentration curves.
Lecture Notes-Monograph Series, pages 380–392.
Zhou and Zhu, (2021)
Zhou, N. and Zhu, L. (2021).
On ipw-based estimation of conditional average treatment effects.
Journal of Statistical Planning and Inference, 215:1–22.
Appendix A: Extra Results in population 1
Confounding Bias Function
The first population is generated using simple linear functions, yet confounding bias bias(X) determined by 18 parameters (coefficients), and covariate is complex.
When is fixed at a specific value , we depict the bias as a function of the coefficients, observing how the bias fluctuates as these coefficients vary.
When , and allow to take any value within the range .
This interval is determined by for a valid distribution.
Figure 4 shows the bias(X) across varying strengths of confounding, revealing a complex relationship between the bias and the three parameters .
When and , we have , , , and (with all values rounded to four decimal places).
Selected Coefficients and Performance for TBPs
In population 1, we define a distribution and a linear function with a total of 18 parameters.
To design a moderately calibrated and a strongly calibrated , we specify
and
When , the variable in population 1 ceases to be a confounding variable, and .
Consequently, , resulting in zero bias for both the and .
Appendix B: Propositions for Calculation
The expectation of Maximum-like (continuous)For continuous variables and , we have two independent copies denoted as .
The expectation of Maximum-like follows
where is the CDF of .
Proof. We demonstrate that the expected value of the Maximum-like for two patient pairs is twice the expected value of B, weighted by its CDF value.
The Gini-like index (continuous)For continuous variables and , the Gini-like index, representing twice the area () between the line of independence () and the relative concentration curve (), is defined as
We assume that .
Proof. Note that , where represents the -th quantile concerning the value of .
To find twice the area between line of independence and the RCC , we start with:
where represents value at the -th quantile.
Therefore, we have
The expectation of Maximum-like (discrete)For discrete variables and , we have two independent copies denoted as .
The expectation of Maximum-like follows
where denotes the probability mass function (PMF) of .
Proof.
The Gini-like index (discrete)For discrete variables and , the Gini-like index, representing twice the area () between the line of independence () and the relative concentration curve (), is defined as
We assume that .
Proof. For a discrete variable with distinct values, patients are ranked by their value of in ascending order to plot the RCC.
Assume that with the probability corresponding.
Note that , and we can express as
If , area would be bounded between and .
We calculate the area as minus the sum of the area of one triangle and trapezoids, which is
We then express each as the sum of treatment benefit averages for disjoint groups of patients, with each group having no overlap.
For instance,
It is possible to show that for
If some patients have , the area can take a value greater than , but the expression of stays the same.
In other words, the Gini-like index could be greater than with some negative treatment benefit values in the target population.
Aside:
For the univariate case, the properties of the expectation of maximum follow similar patterns.
For a continuous variable () with two independent copies , we observe that
where represents the CDF of .
Similarly, for a discrete variable with two independent copies , the equation becomes
where denotes the PMF of .
Moreover, for the Gini coefficient (), we have
Appendix C: Closed-form Expressions in Setting 2
With Control for Confounding
Recall that .
As is not a one-to-one map in population 2, we begin by calculating the pre-image of the point .
The pre-image of any point consists of two points given by the set
which exists when and .
Observe that the pre-image of any point comprises two points achieved by interchanging the values of and .
These two points correspond to two scenarios: and .
Fortunately, in each scenario, the max operator functions as a one-to-one map.
Overall, the joint density of is the sum of the two parts, which is
Now, we can recalculate the marginal PDFs from joint :
The CDF of is
With the joint PDF of , we calculate the conditional density function of by definition:
Finally, we can compute by the definition of conditional expectation:
Then, we calculate the for by computing
Furthermore, we can calculate the for by computing
Without Control for Confounding
The confounding bias is defined as
We want to figure out the expression of given by integrating out of
Overall, we have
We denote as , and we have
where is the confounding bias.
We compute
Then, we calculate , which is improper calculated with confounding bias
To calculate , we need to figure out the joint distribution of and then the conditional distribution of .
As is injective function, we can get
where , , , and .
The conditional distribution should be