Unmasking Bias: A Framework for Evaluating Treatment Benefit Predictors Using Observational Studies

Yuan Xia, Mohsen Sadatsafavi, and Paul Gustafson

Abstract

Treatment benefit predictors (TBPs) map patient characteristics into an estimate of the treatment benefit tailored to individual patients, which can support optimizing treatment decisions. However, the assessment of their performance might be challenging with the non-random treatment assignment. This study conducts a conceptual analysis, which can be applied to finite-sample studies. We present a framework for evaluating TBPs using observational data from a target population of interest. We then explore the impact of confounding bias on TBP evaluation using measures of discrimination and calibration, which are the moderate calibration and the concentration of the benefit index ( $C_{b}$ ), respectively. We illustrate that failure to control for confounding can lead to misleading values of performance metrics and establish how the confounding bias propagates to an evaluation bias to quantify the explicit bias for the performance metrics. These findings underscore the necessity of accounting for confounding factors when evaluating TBPs, ensuring more reliable and contextually appropriate treatment decisions.

Keywords: calibration; discrimination; confounding bias; precision medicine.

1 Introduction

Precision medicine aims to optimize medical care by tailoring treatment decisions to the unique characteristics of each patient. This objective naturally falls in the intersection between predictive analytics and causal inference; the former aims at predicting the outcome of interest, and the latter seeks to answer counterfactual “what if” questions about the outcome. Most of the progress in predictive analytics has centred around predicting risks. To customize medical treatments, we must shift our focus to predicting treatment benefits. Such prediction is often termed “causal prediction” or “counterfactual prediction” (Prosperi et al.,, 2020). Many studies have investigated whether and how a specific covariate or a set of covariates modifies the treatment benefit, such as Abrevaya et al., (2015), Robertson et al., (2021), and Zhou and Zhu, (2021). We refer to such a function that maps patient characteristics to an estimate of treatment benefit as a treatment benefit predictor (TBP). Before being adopted in patient care, a pre-specified TBP needs to be evaluated (validated) in the target population of interest (la Roi-Teeuw et al.,, 2024).

The validation process for TBPs is currently an active area of research (Kent et al.,, 2020). Traditionally, performance metrics for risk prediction are categorized into measures of overall fit, discrimination, calibration, and clinical utility (net benefit) (Riley et al.,, 2019; Steyerberg,, 2019). Discrimination pertains to the predictive capacity of distinguishing individuals with and without the outcome of interest. Calibration focuses on the proximity of predicted and actual risks. Net benefit assesses the clinical usefulness of a risk prediction algorithm by quantifying the trade-off between the benefits of a true positive classification versus the harms of a false positive one. In the context of treatment benefits prediction, various performance measures for TBPs have been formulated by extending the concepts from risk prediction to the treatment-benefit paradigm. For instance, Vickers et al., (2007) provided an extension of net benefit for TBPs, and van Klaveren et al., (2019); Hoogland et al., (2022) discussed extensions of calibration and discrimination. Efthimiou et al., (2023) amalgamated calibration and discrimination into measures for decision accuracy. However, extending these methods isnﬂt straightforward, since we cannot observe the outcome (treatment benefit) due to the unavailability of the counterfactual. Thus, assessing the performance of TBPs poses a significant challenge.

TPBs can be validated using data from randomized controlled trials (RCTs), where treatment assignment is not systematically confounded. Nevertheless, RCTs from the target population of interest are not always available. Even if available, they are often underpowered to evaluate TBP or lack sufficient follow-up time to elucidate treatment effects on relevant outcomes. For some interventions where equipoise is not established, conducting a RCT might be unethical. Hence, observational studies might give the only opportunity to examine the performance of a TBP in the target population.

Using observational studies adds complexity primarily due to the potential presence of confounding bias, which hinders the identification of treatment benefits. Confounding bias and how it influences estimation of estimands have been extensively studied. For instance, Imbens, (2003) and Veitch and Zaveri, (2020) have investigated the influence of confounding bias on the average treatment effect (ATE). However, it receives less scrutiny in the TBP evaluation. In this study, we show the impact of failing to fully control for confounding on TBP evaluation and offer a comprehensive conceptual evaluation framework applicable to any performance metric. We consider calibration and discrimination and focus on two specific performance metrics as illustrative examples of assessing pre-specified TBPs, via conducting a conceptual analysis.

2 Notation and Assumptions

Each individual in the target population is described by $(Y^{(0)},Y^{(1)},A,X,Z)$ with joint distribution $\mathbb{P}$ , where $A\in\{0,1\}$ is the treatment chosen indicator with $A=0$ denoting the absence of the treatment and $A=1$ being the presence of the treatment; $Y^{(a)}$ is the counterfactual outcome that would be observed under treatment $a$ ; $X\in\mathbb{R}^{d}$ is the set of pre-treatment covariates observable in routine clinical practice and also in observational studies that will be used to predict treatment benefit; and $Z\in\mathbb{R}^{p}$ is a distinct set of additional covariates that might be only available in the observational study, and might be needed to control for confounding. For instance, $X$ can be blood pressure and age available at the point of care, which are used to predict the benefit of statin therapy for cardiovascular diseases. Meanwhile, $Z$ , socioeconomic status, is a confounding variable but not often used for predicting benefit from statins.

Individual treatment benefit is quantified as $B=Y^{(1)}-Y^{(0)}$ , which is unobservable. For instance, when an individual has received $A=0$ , the corresponding outcome $Y^{(1)}$ remains unobserved (and therefore counterfactual). The conditional mean outcome under treatment $a$ is denoted as $\mu_{a}(x,z)=\operatorname{E}[Y^{(a)}\mid X=x,Z=z]$ . We also denote $\tau(x,z)=\operatorname{E}[B\mid X=x,Z=z]$ and $\tau_{s}(x)=\operatorname{E}[B\mid X=x]$ . Typically, $\tau(x,z)$ is referred to as the conditional average treatment effect (CATE), as $\{X,Z\}$ is the entire input covariate space from $\mathbb{P}$ . However, when our focus is on a subset of covariates $X\subset\{X,Z\}$ , we concentrate on $\tau_{s}(X)$ . The ATE is $\tau^{*}=\operatorname{E}[B]$ .

A TBP denoted as $h(x)$ predicts the benefit of an active treatment of interest based on known patient characteristics $X$ in routine clinical practice. It can guide treatment decision-making, for example the care provider offering treatment only to those with $h(x)>0$ . We denote the predicted treatment benefit from $h(x)$ as $H:=h(X)$ and its cumulative distribution function (CDF) as $F_{H}(\cdot)$ . The best possible TBP is $\tau_{s}(x)$ itself, and the corresponding prediction is $\tau_{s}(X)$ with CDF $F_{\tau_{s}(X)}(\cdot)$ .

This study is motivated by the question: how can $h(x)$ be evaluated using representative observational data from the target population where treatment is confounded? In an observational study, we can observe iid draws of $(Y,A,X,Z)$ from the joint distribution $\mathbb{P}_{obs}$ , which is a consequence of $\mathbb{P}$ . To evaluate $h(x)$ , the following three assumptions are universally required: (1) no interference: between any two individuals, the treatment taken by one does not affect the counterfactual outcomes of the other; (2) consistency: the counterfactual outcome under the observed treatment assignment equals the observed outcome $Y$ , i.e., $Y=Y^{(1)}A+Y^{(0)}(1-A)$ ; and (3) conditional exchangeability: the treatment assignment is independent of the counterfactual outcomes, given the set of variables $X\cup Z$ , i.e., $A\perp\!\!\!\perp Y^{(0)},Y^{(1)}\mid X\cup Z$ . The first two assumptions are known as the stable-unit-treatment-value assumption (SUTVA) (Rubin,, 1980). The last one assumes no unmeasured confounders given $X$ and $Z$ .

3 Performance Metrics

In this study, we explore two specific metrics, each corresponding to one of these aspects of performance within the population-level framework. This framework enables us to conceptually understand how observational data can identify the performance of $h(\cdot)$ and to explore the extent of potential misguidance when failing to control for confounding.

3.1 Calibration

Van Calster et al., (2016) proposed a hierarchical definition of calibration for risk prediction models. In what follows, we focus on what they named ‘moderate calibration’: that the expected value of the outcome among individuals with the same predicted risk is equal to the predicted risk. They argue that moderate calibration is the most desired form of calibration. Similarly, in treatment benefit prediction, a TBP $h(x)$ can be considered moderately calibrated if $\operatorname{E}[B\mid H]=H$ . It says that the average treatment benefit among all patients with predicted treatment benefit $H=h$ equals $h$ , for any $h$ . For example, if $h(x)$ is moderately calibrated and predicts a group of individuals to have $H=0.5$ , we should expect that the average treatment benefit within the group is also $0.5$ . Furthermore, $h(x)$ is strongly calibrated if $\tau_{s}(X)=H$ .

Calibration of TBPs can also be visualized in a calibration plot (Van Calster et al.,, 2019). The calibration plot compares $\operatorname{E}[B\mid H]$ against $H$ , with a moderately calibrated TBP showing points aligned around the diagonal identity line.

3.2 Discrimination

In risk prediction, we assess discrimination using either concordance measures or measures of disparity. The c-statistic and the Gini index are examples of concordance and disparity measures, respectively. Both metrics have been extended to in the field of treatment benefit prediction. However, it has been established that the c-for-benefit (van Klaveren et al.,, 2018), analogous to the c-statistic for TBPs, does not qualify as a proper scoring rule (Xia et al.,, 2023). Therefore, we shift our focus to the concentration of the benefit index ( $C_{b}$ ), a single-value summary of the difference in average treatment benefit between two treatment assignment rules: ‘treat at random’ and ‘treat greater $H$ ’ (Sadatsafavi et al.,, 2020).

With i.i.d. copies $\{(B_{1},H_{1}),(B_{2},H_{2})\}$ of $(B,H)$ , the $C_{b}$ of $H$ is defined as:

C_{b}=1-\frac{\operatorname{E}[B_{1}]}{\operatorname{E}[B_{1}I(H_{1}\geq H_{2}% )+B_{2}I(H_{1}<H_{2})]},

(1)

where $I(\cdot)$ is an indicator function. The denominator in (1) operationalizes the strategy of ‘treat greater $H$ ’ among two patients randomly selected from the population. If the two patients have the same $H$ , we randomly assign treatment to a patient. When $\tau^{*}>0$ and $h(x)$ is at least not worse than ‘treat at random,’ the $C_{b}$ value ranges from $0$ to $1$ . If $0\leq C_{b}\leq 1$ , ‘treat at random’ is associated with a $C_{b}\times 100\%$ reduction in expected benefit compared with ‘treat greater $H$ .’

The $C_{b}$ connects to a Gini-like coefficient determined by twice the area between the line of independence and the relative concentration curve (RCC) of $B$ concerning $H$ . The RCC orders patients by $H$ and plots the cumulative $B$ value divided by $\tau^{*}$ (Yitzhaki and Olkin,, 1991). With the Gini-like coefficient denoted as $\text{Gini}_{b}$ , $C_{b}$ can be alternately defined as:

C_{b}=\frac{\text{Gini}_{b}}{1+\text{Gini}_{b}}.

To eliminate the necessity of contemplating patient pairs to ascertain the expectation in (1), we establish that

\operatorname{E}[B_{1}I(H_{1}\geq H_{2})+B_{2}I(H_{1}<H_{2})]=2\operatorname{E% }[BF_{H}(H)]-\operatorname{E}[Bf_{H}(H)],

(2)

where $f_{H}(h):=\operatorname{P}(H=h)$ , denoting the probability of $H$ taking the specific value $h$ . Thus, when $H$ is continuous, $\operatorname{E}[Bf_{H}(H)]=0$ . Although $H$ is continuous in most applications, it is helpful to derive the general expression to create simple, illustrative examples where $H$ is discrete. As the original publication on $C_{b}$ did not discuss estimation in the presence of ties, we elaborate on this point in Appendix B. Note that (2) enables us to concentrate on $B$ and $F_{H}$ (and $f_{H}$ if needed) for computing $C_{b}$ , where $H$ acts as a ranking variable. Consequently, $C_{b}$ provides insights into the effectiveness of $h(\cdot)$ in mimicking the CATE function through its ranking ability.

4 Evaluating TBP Performance in Presence of Confounding

When using observational data to evaluate a pre-specified TBP $h(x)$ , we consider both $X$ and $Z$ to adjust for confounding, even though $h(X)$ is solely a function of $X$ . To evaluate the moderate calibration of any $h(X)$ in the target population, we need to determine $\operatorname{E}[B\mid H]$ (i.e., the calibration curve). We address the confounding variables $\{X,Z\}$ by initially focussing on $\tau(X,Z)$ instead of $\tau_{s}(X)$ . Afterward, $\operatorname{E}[B\mid H]$ can be obtained by taking the average of $\tau(X,Z)$ conditional on $H$ :

\operatorname{E}[B\mid H]=\operatorname{E}[\tau(X,Z)\mid H].

(3)

Similarly, to compute $C_{b}$ of $h(x)$ , we determine $\tau^{*}$ and $\operatorname{E}[BF_{H}(H)]$ (and $\operatorname{E}[Bf_{H}(H)]$ if needed) in (2) by assessing $\tau(X,Z)$ as well. Consequently, $C_{b}$ can be expressed as

C_{b}=1-\frac{\operatorname{E}[\tau(X,Z)]}{\operatorname{E}[\tau(X,Z)\eta(H)]},

(4)

where $\eta(H)=2F_{H}(H)-f_{H}(H)$ .

Note that $\tau(X,Z)$ plays a vital role in the determination of both $\operatorname{E}[B\mid H]$ and $C_{b}$ . Various approaches are available to determine $\tau(X,Z)$ , with two main methods being outcome regressions and inverse probability weighting methods (Rosenbaum and Rubin,, 1983). For instance, with the outcome models $\mu_{a}(x,z)=\operatorname{E}[Y\mid A=a,X=x,Z=z]$ , we have $\tau(x,z)=\mu_{1}(x,z)-\mu_{0}(x,z)$ . With the propensity score $e(x,z)=\operatorname{P}(A=1\mid X=x,Z=z)$ , we have $\tau(x,z)=\operatorname{E}\left[\frac{Y(A-e(x,z))}{e(x,z)(1-e(x,z))}\mid X=x,Z% =z\right]$ . These two approaches are equivalent in the population-level framework as long as the overlap assumption holds. The overlap assumption says that the conditional probability of receiving the active treatment or not is bounded away from $0$ and $1$ , i.e., $0<e(x,z)<1$ , for all possible $x$ and $z$ . However, variations may emerge when considering specific finite-sample estimating techniques associated with each. We will return to the finite-sample estimation of the calibration curve and $C_{b}$ in Section 7.

If we treat the observational data as if they arose from a RCT, or if we do not sufficiently control for confounding, confounding bias may emerge. Thus, it is essential to investigate the potential confounding biases and grasp how lack of full control might affect the accuracy of our evaluations. In this study, we focus on the confounding bias that occurs when $X$ alone is not sufficient to control for confounding and denote the confounding bias as a function of $X$ . For $X=x$ , we have

\text{bias}(x)=\left(\operatorname{E}[Y\mid A=1,X=x]-\operatorname{E}[Y\mid A=% 0,X=x]\right)-\tau_{s}(x).

(5)

To illustrate the propagation of $\text{bias}(X)$ to performance metrics, we denote the inaccurate $\operatorname{E}[B\mid H]$ and $C_{b}$ calculated without controlling for $Z$ as $\tilde{E}[B\mid H]$ and $\tilde{C}_{b}$ , respectively.

The bias function of the calibration curve and the bias of $C_{b}$ are influenced by and could be different from $\text{bias}(X)$ (5). For the calibration curve, the deviation from the accurate assessment can be expressed as

\tilde{\operatorname{E}}[B\mid H=h]-\operatorname{E}[B\mid H=h]=\operatorname{% E}[\text{bias}(X)\mid H=h],

(6)

which is a function of $H$ . It depends on $\text{bias}(X)$ and the association between $H$ and $X$ .

For $C_{b}$ , the confounding bias affects the calculation of both $\tau^{*}$ and $\operatorname{E}[B\eta(X)]$ . The discrepancy from actual $\tau^{*}$ is $\operatorname{E}[\text{bias}(X)]$ , while the deviation from $\operatorname{E}[B\eta(H)]$ is $\operatorname{E}[\text{bias}(X)\eta(H))]$ . However, expressing the deviation from the true $C_{b}$ is complex as it involves the difference between two ratios. The deviation is in the form of:

\tilde{C}_{b}-C_{b}=\frac{\tau^{*}\operatorname{E}[\text{bias}(X)\eta(H)]-% \operatorname{E}[B\eta(H)]\operatorname{E}[\text{bias}(X)]}{\operatorname{E}[B% \eta(H)](\operatorname{E}[B\eta(H)]+\operatorname{E}[\text{bias}(X)\eta(H)])}.

(7)

This value not only depends on $\text{bias}(X)$ but also on $\tau^{*}$ , $\eta(H)$ , and $\operatorname{E}[B\eta(H)]$ .

According to the biases (6) and (7), the deviations in moderate calibration and $C_{b}$ may yield zero value(s) even with non-zero $\text{bias}(X)$ . In particular, zero deviations in moderate calibration would occur when $\operatorname{E}[\text{bias}(X)\mid H=h]=0$ for all $h$ , and zero deviation in $C_{b}$ would occur when $\tau^{*}\operatorname{E}[\text{bias}(X)\eta(H)]=\operatorname{E}[\text{bias}(X% )]\operatorname{E}[B\eta(H)]$ . In the nest section, we further investigate these biases in several illustrative examples to demonstrate how $\text{bias}(X)$ influences the evaluation results.

5 Examples Relevant to Confounding Bias in Evaluation

In this section, we establish two synthetic populations to illustrate the impact of confounding bias on evaluating given TBPs in the population-level framework. The first population describes a linear $\tau(x,z)$ function with binary outcome and covariates, which enables exploring to what extent the strength of confounding affects the bias of both metrics. The second population has a non-linear $\tau(x,z)$ function with continuous outcome and covariates. This offers flexibility in defining $\tau_{s}(x)$ , $\tau(x,z)$ and propensity score function $e(x,z)$ . Unlike prior confounding bias studies, we investigate the propagation of confounding bias to $\operatorname{E}[B\mid H]$ and $C_{b}$ for both populations, where both are determined in closed-form.

5.1 Population 1. Binary Outcome and Covariates

Assume the dimensions of two sets of covariates are $d=2$ and $p=1$ , and all $Y,A,X_{1},X_{2}$ and $Z$ are binary. In this binary outcome setup, the individual treatment benefit $B\in\{-1,0,1\}$ . We assume $Y^{(0)}\perp\!\!\!\perp Y^{(1)}\mid X,Z$ and the distribution $\mathbb{P}$ is in the form of

	$\displaystyle(Y^{(0)}\mid X,Z;\alpha_{0})$	$\displaystyle\sim\text{Bernoulli}(\alpha_{00}+\alpha_{01}X_{1}+\alpha_{02}X_{2% }+\alpha_{03}Z),$
	$\displaystyle(Y^{(1)}\mid X,Z;\alpha_{1})$	$\displaystyle\sim\text{Bernoulli}(\alpha_{10}+\alpha_{11}X_{1}+\alpha_{12}X_{2% }+\alpha_{13}Z),$
	$\displaystyle(A\mid X,Z;\beta)$	$\displaystyle\sim\text{Bernoulli}(\beta_{0}+\beta_{1}Z),$
	$\displaystyle(X,Z\mid p)$	$\displaystyle\sim\text{Multivariate Bernoulli}\big{(}p_{000}^{(1-x_{1})(1-x_{2% })(1-z)}+$
		$\displaystyle p_{001}^{(1-x_{1})(1-x_{2})z}+p_{010}^{(1-x_{1})x_{2}(1-z)}+p_{1% 00}^{x_{1}(1-x_{2})(1-z)}+$
		$\displaystyle p_{011}^{(1-x_{1})x_{2}z}+p_{101}^{x_{1}(1-x_{2})z}+p_{110}^{x_{% 1}x_{2}(1-z)}+p_{111}^{x_{1}x_{2}z}\big{)},$

where $\mu_{a}(X,Z)$ is a linear combination of $A$ , $X$ , and $Z$ . Distribution $\mathbb{P}$ leads to a linear $\tau(x,z)$ , which is

\displaystyle\tau(X,Z)

\displaystyle=(\alpha_{10}-\alpha_{00})+(\alpha_{11}-\alpha_{01})X_{1}+(\alpha% _{12}-\alpha_{02})X_{2}+(\alpha_{13}-\alpha_{03})Z,

and $\tau_{s}(X)=\operatorname{E}[\tau(X,Z)\mid X]$ .

There are $18$ parameters to capture the relationship between outcome, treatment, and covariates, with constraints imposed on all $\alpha$ , $\beta$ , and $p$ to ensure legitimate distributions. The linear propensity score function is $e(X,Z)=\beta_{0}+\beta_{1}Z$ . The strength of confounding is determined by the values of $\beta_{1},\alpha_{03}$ and $\alpha_{13}$ .

We formulate three TBPs: $h_{1}(x_{1},x_{2})$ is the mean of covariates, $h_{2}(x_{1},x_{2})$ is designed to be moderately calibrated, and $h_{3}(x_{1},x_{2})$ is designed to be strongly calibrated, by carefully choosing coefficients. The expressions for these three TBPs are as follows:

	$\displaystyle h_{1}(x_{1},x_{2})$	$\displaystyle=\frac{x_{1}+x_{2}}{2},$
	$\displaystyle h_{2}(x_{1},x_{2})$	$\displaystyle=b_{0}+b_{1}(x_{1}+x_{2})+b_{2}x_{1}x_{2},$
	$\displaystyle h_{3}(x_{1},x_{2})$	$\displaystyle=c_{0}+c_{1}x_{1}+c_{2}x_{2}+c_{3}x_{1}x_{2},$

where $b_{l}$ and $c_{k}$ , $l=0,1,2$ and $k=0,1,2,3$ are coefficients. Moreover, $h_{3}(x_{1},x_{2}):=\tau_{s}(x)$ , which is an bijective function, uniquely mapping the four distinct values of $(x_{1},x_{2})$ to four unique prediction values, exhibiting strong calibration, and thus surpassing all potential TBPs. (See Appendix A for the detailed definitions of the coefficients.)

5.2 Population 2. Continuous Outcome and Covariates

We still assume $d=2$ and $p=1$ , but let $X_{1},X_{2},Z$ be independent, with each following the uniform distribution on the interval $[0,1]$ . Adopting the setup proposed by Foster and Syrgkanis, (2023),

	$\displaystyle(A\mid X,Z)\sim\text{Bernoulli}(e(X,Z)),$
	$\displaystyle(Y^{(a)}\mid X,Z)\sim\text{N}\left(\tau_{s}(X)\left(a-0.5\right)+% b(X,Z),\sigma^{2}\right),$

where the conditional independence is assumed, i.e., $Y^{(0)}\perp\!\!\!\perp Y^{(1)}\mid X,Z$ . Note that $X$ and $Z$ contribute to explaining the outcome and treatment assignment. Let $\sigma=0.1$ and consider simple functions: propensity score function $e(X,Z)=Z$ and base response function $b(X,Z)=\max\left(Z,X_{2}\right)+0.1X_{1}$ . We define the ensuing CATE function and TBP as a selected exemplification:

	$\displaystyle\tau_{s}(X)$	$\displaystyle=\max\left(X_{1},X_{2}\right),$
	$\displaystyle h(X)$	$\displaystyle=X_{1}+X_{2}.$

The predicted treatment benefit $H$ is a sum of two i.i.d uniform random variables on $[0,1]$ , which follows a triangular distribution with parameters: lower limit $a=0$ , upper limit $b=2,$ and mode $c=1$ .

The variable $Z$ is independent of $B$ conditioning on $X$ (i.e., $\tau(X,Z)=\tau_{s}(X)$ ). Therefore, $Z$ is not an effect modifier but a confounding variable. Note that $\tau_{s}(X)$ , the maximum of these two uniform random variables, follows $\text{Beta}(2,1)$ . Hence, the population average treatment benefit is $\tau^{*}=2/3$ .

6 Metrics Performance in Two Synthetic Populations

Refer to caption — Figure 1: Calibration plots and relative concentration curves (RCC) for $h_{1}(X_{1},X_{2})$ , $h_{2}(X_{1},X_{2})$ , and $h_{3}(X_{1},X_{2})$ when $\beta_{1}=0.7621$ . The three plots on the left-hand side demonstrate calibration plots. The three plots on the right-hand side are RCCs. The blue dotted curves refer to the $\operatorname{E}[B\mid H]$ and $C_{b}$ , and the red dashed curves refer to $\tilde{\operatorname{E}}[B\mid H]$ and $\tilde{C}_{b}$ .

For the first synthetic population, we employed a specific set of parameters to compare evaluation results for TBPs with and without controlling for $Z$ . This selection serves as just one instance among numerous potential examples:

	$\displaystyle\alpha_{0}$	$\displaystyle=(0.629,0.143,-0.479,-0.058),$
	$\displaystyle\alpha_{1}$	$\displaystyle=(0.335,0.304,-0.334,0.314),$
	$\displaystyle p$	$\displaystyle=(p_{111},p_{110},p_{101},p_{100},p_{011},p_{010},p_{001},p_{000})$
		$\displaystyle=(0.181,0.100,0.035,0.148,0.174,0.087,0.121,0.153),$
	$\displaystyle\beta$	$\displaystyle=(0.120,\beta_{1}),$

where values in $p$ were randomly generated but represent a valid joint distribution of $(X_{1},X_{2},Z)$ . Upon establishing these parameters, we define the target population and consequently determine $\text{bias}(X)$ , which is greater than $0$ for all $X=x$ and a non-linear function of $X$ . In this setup, the value of $\text{bias}(X)$ is influenced not only by the three parameters that determine confounding strength but also by other additional parameters; see Appendix A for further discussion of $\text{bias}(X)$ under various “strength of confounding.” We then evaluate the three pre-specified TBPs using the calibration curve and $C_{b}$ with and without confounding bias. Note that $h_{1}(x_{1},x_{2})$ and $h_{2}(x_{1},x_{2})$ exhibit analogous mapping patterns: each maps four unique combinations of $(x_{1},x_{2})$ to three distinct $H$ values. We compute $\operatorname{E}[B\mid H]$ and $C_{b}$ through closed-form expressions and calculate $\tilde{\operatorname{E}}[B\mid H]$ and $\tilde{C}_{b}$ either via (6) and (7), or by using the inaccurate $\tau_{s}(X)$ instead of $\tau(X,Z)$ in (3) and (4).

When $\beta_{1}=0.762$ , the evaluation results are illustrated in Figure 1, where the three plots on the left display the moderate calibration curves of $h_{1}(X_{1},X_{2})$ , $h_{2}(X_{1},X_{2})$ , and $h_{3}(X_{1},X_{2})$ . We see that $h_{2}$ and $h_{3}$ are moderately calibrated, aligning closely with the 45-degree line. Additionally, while $h_{1}$ lacks moderate calibration, its predictions are positively associated with $\operatorname{E}[B\mid H]$ . Figure 1 further highlights distinct disparities between $\operatorname{E}[B\mid H]$ and $\tilde{\operatorname{E}}[B\mid H]$ , particularly noting that $\operatorname{E}[\text{bias}(X)\mid H]>0$ for all three TBPs due to $\text{bias}(X)>0$ . Consequently, the failure to control for confounding variables results in an inaccurate calibration assessment.

The three plots on the right in Figure 1 show the RCCs and $C_{b}$ values for the three TBPs. The RCCs and $C_{b}$ for $h_{1}(X_{1},X_{2})$ and $h_{2}(X_{1},X_{2})$ are identical. It is because the CDFs of $H_{1}$ and $H_{2}$ are the same, resulting in the two TBPs identically ranking patients. Note that the optimal TBP $h_{3}(X_{1},X_{2})$ yields a slightly larger $C_{b}$ and $\text{Gini}_{b}$ , compared to $h_{1}(X_{1},X_{2})$ and $h_{2}(X_{1},X_{2})$ . It reflects that $h_{3}(X_{1},X_{2})$ yields a more effective treatment assignment rule, leading to a larger average treatment benefit. However, $\tilde{C}_{b}$ is smaller than $C_{b}$ for all three TBPs.

When $\beta_{1}=0$ , all blue dotted curves align with the red dashed curves for both performance metrics because $Z$ is no longer a confounding variable. Particularly in the calibration plots for $h_{2}(X_{1},X_{2})$ and $h_{3}(X_{1},X_{2})$ , all curves lie on the 45-degree diagonal line. (See Figure 5 in Appendix A for the corresponding calibration plots, $C_{b}$ , and RCCs.) The findings highlight the importance of controlling confounding variables when conducting evaluations in observational studies to obtain accurate results. Ignoring confounding variables can produce misleading patterns with different extents depending on the strength and direction of associations.

In the second synthetic population, with the actual $\tau_{s}(X)$ , we compute the calibration curve $\operatorname{E}[B\mid H]$ and $C_{b}$ by initially deriving the closed-form joint distribution of $(\tau_{s}(X),H)$ and then analyzing the corresponding closed-form conditional distribution of $(\tau_{s}(X)\mid H)$ . Similarly, we compare $\operatorname{E}[B\mid H]$ and $C_{b}$ with $\tilde{\operatorname{E}}[B\mid H]$ and $\tilde{C}_{b}$ . (See Appendix C for calculation details.) In this setup, $\text{bias}(X)$ is a function of $X_{2}$ : $\text{bias}(x)=1/3-x_{2}^{2}+2/3x_{2}^{3}$ , which is illustrated in Figure 2.

The average treatment benefit conditioned on the predicted treatment benefit from $h(X_{1},X_{2})$ is $\operatorname{E}[B\mid H]=0.5+H$ . Using the second treatment assignment rule based on $H$ , the average treatment benefit is $2\operatorname{E}[BF_{H}(H)]=0.7833$ , slightly exceeding the population average $\tau^{*}=2/3$ . The $C_{b}$ for $h(X)$ is $0.1489$ . When assigning treatment based on the predicted treatment benefit from $\tau_{s}(X)$ , the average treatment benefit from the second treatment assignment rule is $2\operatorname{E}[BF_{\tau_{s}(X)}(\tau_{s}(X))]=0.8$ . In other words, the second treatment assignment rule, based on the actual $\operatorname{E}[B\mid X]$ , does not exhibit a significant improvement in average treatment benefit compared to $H$ . The corresponding $C_{b}$ for $\tau_{s}(X)$ is $0.1667$ .

However, confounding bias causes a deviation from the actual $\operatorname{E}[B\mid H]$ , as depicted by the step function:

\displaystyle\tilde{\operatorname{E}}[B\mid H]-\operatorname{E}[B\mid H]=% \begin{cases}1/(6h),&0<h\leq 1,\\ 1/(6(2-h)),&1\leq h<2.\end{cases}

It also causes an overestimation of $\tau^{*}$ by $\operatorname{E}[\text{bias}(X_{2})]=1/6$ and an inaccurate $2\operatorname{E}[BF_{H}(H)]$ calculated as $0.9032$ . Consequently, for $h(X)$ , the $\tilde{C}_{b}$ of $0.0773$ is lower than the $C_{b}$ of $0.1489$ .

Figure 3 illustrates the calibration plot and RCC for $h(X_{1},X_{2})$ with and without controlling $Z$ . It shows the distinct influence of confounding bias on moderate calibration, RCC, and $C_{b}$ assessments. In the calibration plot, the red curve significantly deviates from the blue curve as the value of $H$ approaches both extremes, near $0$ and $2$ . Moreover, confounding bias reduces the area between the independence line and the RCC by roughly half.

7 Discussion

In clinical settings, TBPs derived from prior studies offer valuable guidance for physicians and patients in making informed treatment decisions. These TBPs, which may developed from various populations, should be evaluated in the target population before implementation (Riley et al.,, 2024). Observational data, where treatment assignment is not art random, might be the only opportunity for such evaluation. Consequently, addressing confounding bias is crucial when assessing treatment benefits on observational data.

This study evaluated pre-specified TBPs using observational studies and explored how confounding bias influences the evaluation of TBPs in a population-level framework. We delved into two specific metrics, one focusing on calibration and the other on discrimination, and we proposed two bias expressions of calibration and $C_{b}$ . We demonstrated that the failure to control for confounding variables leads to inaccurate assessments of moderate calibration and $C_{b}$ . The impact of confounding bias on the assessment of moderate calibration and $C_{b}$ differs. The two synthetic populations demonstrated lead to two positive $\text{bias}(X)$ functions, which are two examples of many other possible $\text{bias}(X)$ functions. These two $\text{bias}(X)$ functions result $\tilde{\operatorname{E}}[B\mid H]>\operatorname{E}[B\mid H]$ ; nevertheless, $\tilde{C}_{b}<C_{b}$ . In other words, positive confounding bias may lead to overestimation of $\tau^{*}$ , $\tau_{s}(X)$ and $\operatorname{E}[B\mid H]$ but underestimation of $C_{b}$ at least for these choice of the TBP $h(x)$ .

This study conducted a conceptual analysis, which lays the groundwork for finite-sample estimation. To evaluate pre-determined TBPs using real-world observational data, the primary challenge shifts to estimating $\tau(X,Z)$ and then $\tau_{s}(X)$ from the sample. As previously discussed, $\tau(X,Z)$ can be estimated through outcome regression, inverse probability weighting, or a combination of both, such as the doubly robust method (Bang and Robins,, 2005). Estimating each performance metric might have its challenges. For instance, for the calibration curve, we need to estimate the conditional expectation of estimated $\tau(X,Z)$ given $H$ . When $H$ is discrete, the estimation can rely on the sample average within groups sharing the same $H$ . However, the estimation for continuous $H$ is non-trivial. For $C_{b}$ , estimating the CDF of $H$ and possibly its $f_{H}(h)$ can be achieved through either the empirical distribution or modelling $H$ .

Previous discussion have identified several areas for future research. One might wonder if there is a performance metric for TBPs that is less sensitive to the influence of confounding bias. We examined two performance metrics; however, an investigation of more existing performance metrics is needed to solve this question. Additionally, the provided two synthetic populations assume independent counterfactual outcomes because it is a commonly used assumption in applications. However, the real-world target populations can be way more complex. Further research is needed to examine populations with correlated counterfactual outcomes. Moreover, our conceptual analysis provides a better understanding of the impact of the confounding bias on the TBP evaluation, and the proposed framework can be applied to real data sets. Then, it is natural to explore further which one of the existing CATE estimation methods is more flexible to handle complex $\tau(X,Z)$ function and which gives a more precise TPB assessment for making treatment decisions. Ultimately, the final pieces of brick for finial-sample estimation address the challenges of estimating various performance metrics.

References

Abrevaya et al., (2015) Abrevaya, J., Hsu, Y.-C., and Lieli, R. P. (2015). Estimating conditional average treatment effects. Journal of Business & Economic Statistics, 33(4):485–505.
Bang and Robins, (2005) Bang, H. and Robins, J. M. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973.
Efthimiou et al., (2023) Efthimiou, O., Hoogland, J., Debray, T. P., Seo, M., Furukawa, T. A., Egger, M., and White, I. R. (2023). Measuring the performance of prediction models to personalize treatment choice. Statistics in medicine, 42(8):1188–1206.
Foster and Syrgkanis, (2023) Foster, D. J. and Syrgkanis, V. (2023). Orthogonal statistical learning.
Hoogland et al., (2022) Hoogland, J., Efthimiou, O., Nguyen, T.-L., and Debray, T. P. (2022). Evaluating individualized treatment effect predictions: a new perspective on discrimination and calibration assessment. arXiv preprint, (arXiv:2209.06101).
Imbens, (2003) Imbens, G. W. (2003). Sensitivity to exogeneity assumptions in program evaluation. American Economic Review, 93(2):126–132.
Kent et al., (2020) Kent, D. M., Paulus, J. K., Van Klaveren, D., D’Agostino, R., Goodman, S., Hayward, R., Ioannidis, J. P., Patrick-Lake, B., Morton, S., Pencina, M., et al. (2020). The predictive approaches to treatment effect heterogeneity (path) statement. Annals of Internal Medicine, 172:35–45.
la Roi-Teeuw et al., (2024) la Roi-Teeuw, H. M., van Royen, F. S., de Hond, A., Zahra, A., de Vries, S., Bartels, R., Carriero, A. J., van Doorn, S., Dunias, Z. S., Kant, I., et al. (2024). Don’t be misled: Three misconceptions about external validation of clinical prediction models. Journal of Clinical Epidemiology, page 111387.
Prosperi et al., (2020) Prosperi, M., Guo, Y., Sperrin, M., Koopman, J. S., Min, J. S., He, X., Rich, S., Wang, M., Buchan, I. E., and Bian, J. (2020). Causal inference and counterfactual prediction in machine learning for actionable healthcare. Nature Machine Intelligence, 2(7):369–375.
Riley et al., (2024) Riley, R. D., Archer, L., Snell, K. I., Ensor, J., Dhiman, P., Martin, G. P., Bonnett, L. J., and Collins, G. S. (2024). Evaluation of clinical prediction models (part 2): how to undertake an external validation study. bmj, 384.
Riley et al., (2019) Riley, R. D., Snell, K. I., Moons, K. G., and Debray, T. P. (2019). Fundamental statistical methods for prognosis research. In Prognosis Research in Health Care, chapter 3, pages 37–68. Oxford University Press.
Robertson et al., (2021) Robertson, S. E., Leith, A., Schmid, C. H., and Dahabreh, I. J. (2021). Assessing heterogeneity of treatment effects in observational studies. American Journal of Epidemiology, 190(6):1088–1100.
Rosenbaum and Rubin, (1983) Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55.
Rubin, (1980) Rubin, D. B. (1980). Randomization analysis of experimental data: The fisher randomization test comment. Journal of the American statistical association, 75(371):591–593.
Sadatsafavi et al., (2020) Sadatsafavi, M., Mansournia, M. A., and Gustafson, P. (2020). A threshold-free summary index for quantifying the capacity of covariates to yield efficient treatment rules. Statistics in Medicine, 39:1362–1373.
Steyerberg, (2019) Steyerberg, E. W. (2019). Clinical Prediction Models. Springer International Publishing.
Van Calster et al., (2019) Van Calster, B., McLernon, D. J., Van Smeden, M., Wynants, L., and Steyerberg, E. W. (2019). Calibration: the achilles heel of predictive analytics. BMC medicine, 17(1):230.
Van Calster et al., (2016) Van Calster, B., Nieboer, D., Vergouwe, Y., De Cock, B., Pencina, M. J., and Steyerberg, E. W. (2016). A calibration hierarchy for risk models was defined: from utopia to empirical data. Journal of Clinical Epidemiology, 74:167–176.
van Klaveren et al., (2019) van Klaveren, D., Balan, T. A., Steyerberg, E. W., and Kent, D. M. (2019). Models with interactions overestimated heterogeneity of treatment effects and were prone to treatment mistargeting. Journal of Clinical Epidemiology, 114:72–83.
van Klaveren et al., (2018) van Klaveren, D., Steyerberg, E. W., Serruys, P. W., and Kent, D. M. (2018). The proposed ‘concordance-statistic for benefit’ provided a useful metric when modeling heterogeneous treatment effects. Journal of Clinical Epidemiology, 94:59–68.
Veitch and Zaveri, (2020) Veitch, V. and Zaveri, A. (2020). Sense and sensitivity analysis: Simple post-hoc analysis of bias due to unobserved confounding. Advances in neural information processing systems, 33:10999–11009.
Vickers et al., (2007) Vickers, A. J., Kattan, M. W., and Sargent, D. J. (2007). Method for evaluating prediction models that apply the results of randomized trials to individual patients. Trials, 8:1–11.
Xia et al., (2023) Xia, Y., Gustafson, P., and Sadatsafavi, M. (2023). Methodological concerns about “concordance-statistic for benefit” as a measure of discrimination in predicting treatment benefit. Diagnostic and Prognostic Research, 7(1):10.
Yitzhaki and Olkin, (1991) Yitzhaki, S. and Olkin, I. (1991). Concentration indices and concentration curves. Lecture Notes-Monograph Series, pages 380–392.
Zhou and Zhu, (2021) Zhou, N. and Zhu, L. (2021). On ipw-based estimation of conditional average treatment effects. Journal of Statistical Planning and Inference, 215:1–22.

Appendix A: Extra Results in population 1

Confounding Bias Function

The first population is generated using simple linear functions, yet confounding bias bias(X) determined by 18 parameters (coefficients), and covariate $X$ is complex. When $X$ is fixed at a specific value $x$ , we depict the bias as a function of the coefficients, observing how the bias fluctuates as these coefficients vary. When $\alpha_{03}=-0.058$ , and allow $\alpha_{13}-\alpha_{03}$ to take any value within the range $(0.0557,0.4194)$ . This interval is determined by $\alpha_{13}\in(-0.0011,0.3614)$ for a valid distribution.

Figure 4 shows the bias(X) across varying strengths of confounding, revealing a complex relationship between the bias and the three parameters $(\alpha_{03},\alpha_{13},\beta_{1})$ . When $\beta_{1}=0.7621$ and $\alpha_{13}-\alpha_{03}=0.3717$ , we have $\text{bias}(X_{1}=1,X_{2}=1)=0.0640$ , $\text{bias}(X_{1}=1,X_{2}=0)=0.1298$ , $\text{bias}(X_{1}=0,X_{2}=1)=0.0581$ , and $\text{bias}(X_{1}=0,X_{2}=0)=0.1090$ (with all values rounded to four decimal places).

Selected Coefficients and Performance for TBPs

In population 1, we define a distribution $\mathbb{P}$ and a linear function $\tau(X,Z)$ with a total of 18 parameters. To design a moderately calibrated $h_{2}(x_{1},x_{2})$ and a strongly calibrated $h_{3}(x_{1},x_{2})$ , we specify

	$\displaystyle b_{0}$	$\displaystyle=\left[(\alpha_{10}-\alpha_{00})+(\alpha_{13}-\alpha_{03})\frac{p% _{001}}{p_{001}+p_{000}}\right],$
	$\displaystyle b_{1}$	$\displaystyle=\Bigg{[}(\alpha_{11}-\alpha_{01})\frac{p_{101}+p_{100}}{p_{101}+% p_{100}+p_{011}+p_{010}}+(\alpha_{12}-\alpha_{02})\frac{p_{011}+p_{010}}{p_{10% 1}+p_{100}+p_{011}+p_{010}}+$
		$\displaystyle(\alpha_{13}-\alpha_{03})\left(\frac{p_{101}+p_{011}}{p_{101}+p_{% 100}+p_{011}+p_{010}}-\frac{p_{001}}{p_{001}+p_{000}}\right)\Bigg{]},$
	$\displaystyle b_{2}$	$\displaystyle=\Bigg{[}(\alpha_{11}-\alpha_{01})\left(1-2\frac{p_{101}+p_{100}}% {p_{101}+p_{100}+p_{011}+p_{010}}\right)+$
		$\displaystyle(\alpha_{12}-\alpha_{02})\left(1-2\frac{p_{011}+p_{010}}{p_{101}+% p_{100}+p_{011}+p_{010}}\right)+$
		$\displaystyle(\alpha_{13}-\alpha_{03})\left(\frac{p_{111}}{p_{111}+p_{110}}-2% \frac{p_{101}+p_{011}}{p_{101}+p_{100}+p_{011}+p_{010}}+\frac{p_{001}}{p_{001}% +p_{000}}\right)\Bigg{]},$

and

	$\displaystyle c_{0}$	$\displaystyle=\left[(\alpha_{10}-\alpha_{00})+(\alpha_{13}-\alpha_{03})\frac{p% _{001}}{p_{001}+p_{000}}\right],$
	$\displaystyle c_{1}$	$\displaystyle=\left[(\alpha_{11}-\alpha_{01})+(\alpha_{13}-\alpha_{03})\left(% \frac{p_{101}}{p_{101}+p_{100}}-\frac{p_{001}}{p_{001}+p_{000}}\right)\right],$
	$\displaystyle c_{2}$	$\displaystyle=\left[(\alpha_{12}-\alpha_{02})+(\alpha_{13}-\alpha_{03})\left(% \frac{p_{011}}{p_{011}+p_{010}}-\frac{p_{001}}{p_{001}+p_{000}}\right)\right],$
	$\displaystyle c_{3}$	$\displaystyle=\left[(\alpha_{13}-\alpha_{03})\left(\frac{p_{111}}{p_{111}+p_{1% 10}}-\frac{p_{101}}{p_{101}+p_{100}}-\frac{p_{011}}{p_{011}+p_{010}}+\frac{p_{% 001}}{p_{001}+p_{000}}\right)\right].$

When $\beta_{1}=0$ , the variable $Z$ in population 1 ceases to be a confounding variable, and $A\perp\!\!\!\perp X,Z$ . Consequently, $\text{bias}(X)=0$ , resulting in zero bias for both the $\operatorname{E}[B\mid H]$ and $C_{b}$ .

Appendix B: Propositions for $C_{b}$ Calculation

The expectation of Maximum-like (continuous) For continuous variables $B$ and $H$ , we have two independent copies denoted as $\{(B_{1},H_{1}),(B_{2},H_{2})\}$ . The expectation of Maximum-like follows

\operatorname{E}[B_{1}I(H_{1}\geq H_{2})+B_{2}I(H_{1}<H_{2})]=2\operatorname{E% }[BF_{H}(H)],

where $F_{H}(H)$ is the CDF of $H$ .

Proof. We demonstrate that the expected value of the Maximum-like for two patient pairs is twice the expected value of B, weighted by its CDF value.

	$\displaystyle\operatorname{E}[B_{1}I(H_{1}\geq H_{2})+B_{2}I(H_{1}<H_{2})]$
	$\displaystyle=2\operatorname{E}[B_{1}I(H_{1}\geq H_{2})]=2\operatorname{E}_{H_% {1},H_{2}}\big{[}\operatorname{E}[B_{1}I_{H_{1}\geq H_{2}}\mid H_{1},H_{2}]% \big{]}$
	$\displaystyle=2\int^{\infty}_{-\infty}\int^{\infty}_{-\infty}\left(\int^{% \infty}_{-\infty}b_{1}I(h_{1}\geq h_{2})f_{B_{1}\mid H_{1},H_{2}}(b_{1}\mid h_% {1},h_{2})db\right)f_{H_{2}}(h_{2})dh_{2}f_{H_{1}}(h_{1})dh_{1}$
	$\displaystyle=2\int^{\infty}_{-\infty}E[B_{1}\mid H_{1}]\left(\int^{h_{1}}_{-% \infty}f_{H_{2}}(h_{2})dh_{2}\right)f_{H_{1}}(h_{1})dh_{1}$
	$\displaystyle=2\int^{\infty}_{-\infty}\int^{\infty}_{-\infty}b_{1}F_{H_{1}}(h_% {1})f_{B_{1},H_{1}}(b_{1},h_{1})dh_{1}dh_{1}$
	$\displaystyle=2\operatorname{E}[BF_{H}(H)].$

The Gini-like index (continuous) For continuous variables $B$ and $H$ , the Gini-like index, representing twice the area ( $A$ ) between the line of independence ( $p$ ) and the relative concentration curve ( $R(p)$ ), is defined as

2A=\frac{2E[BF_{H}(H)]-\operatorname{E}[B]}{\operatorname{E}[B]}.

We assume that $E[B]>0$ .

Proof. Note that $R(p)=\frac{\operatorname{E}[BI(H\leq h)]}{\operatorname{E}[B]}$ , where $p$ represents the $p$ -th quantile concerning the value of $H$ . To find twice the area between line of independence $p$ and the RCC $R(p)$ , we start with:

	$\displaystyle 2A\operatorname{E}[B]$	$\displaystyle=2\int^{1}_{0}\left(p\operatorname{E}[B]-\int^{h_{p}}_{-\infty}% \operatorname{E}[B\mid H=h]f_{H}(h)dh\right)dp$
		$\displaystyle=\operatorname{E}[B]-2\int^{1}_{0}\int^{h_{p}}_{-\infty}% \operatorname{E}[B\mid H=h]f_{H}(h)dhdp$
		$\displaystyle=\operatorname{E}[B]-2\int^{\infty}_{-\infty}\operatorname{E}[B% \mid H=h]f_{H}(h)(1-F_{H}(h))dh$
		$\displaystyle=2\int^{\infty}_{-\infty}\operatorname{E}[B\mid H=h]f_{H}(h)F_{H}% (h)dh-\operatorname{E}[B]$
		$\displaystyle=2\operatorname{E}[BF_{H}(H)]-\operatorname{E}[B],$

where $h_{p}$ represents $h$ value at the $p$ -th quantile. Therefore, we have

2A=\frac{2\operatorname{E}[BF_{H}(H)]-\operatorname{E}[B]}{\operatorname{E}[B]}.

The expectation of Maximum-like (discrete) For discrete variables $B$ and $H$ , we have two independent copies denoted as $\{(B_{1},H_{1}),(B_{2},H_{2})\}$ . The expectation of Maximum-like follows

\operatorname{E}[B_{1}I(H_{1}\geq H_{2})+B_{2}I(H_{1}<H_{2})]=2\operatorname{E% }[BF_{H}(H)]-\operatorname{E}[Bf_{H}(H)],

where $f_{H}$ denotes the probability mass function (PMF) of $H$ .

Proof.

	$\displaystyle\operatorname{E}[B_{1}I(H_{1}\geq H_{2})+B_{2}I(H_{1}<H_{2})]$
	$\displaystyle=2\operatorname{E}\left[E[B_{1}I(H_{1}>H_{2})\mid H_{1},H_{2}]% \right]+\operatorname{E}\left[E[B_{1}I(H_{1}=H_{2})\mid H_{1},H_{2}]\right]$
	$\displaystyle=2\sum_{h_{1}}\left[\sum_{b_{1}}b_{1}\operatorname{P}(B_{1}=b_{1}% \mid H_{1}=h_{1})\right]\left(\sum_{h_{2}}I(h_{2}<h_{1})\operatorname{P}(H_{2}% =h_{2})\right)\operatorname{P}(H_{1}=h_{1})+$
	$\displaystyle\sum_{h_{1}}\left[\sum_{b_{1}}b_{1}\operatorname{P}(B_{1}=b_{1}% \mid H_{1}=h_{1})\right]\left(\sum_{h_{2}}I(h_{2}=h_{1})\operatorname{P}(H_{2}% =h_{2})\right)\operatorname{P}(H_{1}=h_{1})$
	$\displaystyle=2\sum_{h_{1}}\sum_{b_{1}}b_{1}\operatorname{P}(H_{1}<h_{1})% \operatorname{P}(B_{1}=b_{1},H_{1}=h_{1})+\sum_{h_{1}}\sum_{b_{1}}b_{1}% \operatorname{P}(H_{1}=h_{1})\operatorname{P}(B_{1}=b_{1},H_{1}=h_{1})$
	$\displaystyle=2\operatorname{E}[BF_{H}(H)]-\operatorname{E}[Bf_{H}(H)].$

The Gini-like index (discrete) For discrete variables $B$ and $H$ , the Gini-like index, representing twice the area ( $A$ ) between the line of independence ( $p$ ) and the relative concentration curve ( $R(p)$ ), is defined as

2A=\frac{2\operatorname{E}[BF_{H}(H)]-\operatorname{E}[Bf_{H}(H)]-% \operatorname{E}[B]}{\operatorname{E}[B]}.

We assume that $E[B]>0$ .

Proof. For a discrete variable $H$ with $k$ distinct values, patients are ranked by their value of $H$ in ascending order to plot the RCC. Assume that $h_{1}<h2<h3<\cdots<h_{k}$ with the probability $p_{1},p_{2},p_{3},\cdots,p_{k}$ corresponding. Note that $\sum_{i=1}^{k}p_{i}=1$ , and we can express $2\operatorname{E}[BF_{H}(H)]$ as

\displaystyle 2\operatorname{E}[BF_{H}(H)]

\displaystyle=2\left(p_{1}\operatorname{E}[BI(H=h_{1})]+(p_{1}+p_{2})% \operatorname{E}[BI(H=h_{2})]+\cdots+\operatorname{E}[I(H=h_{k})]\right).

If $B\geq 0$ , area $A$ would be bounded between $0$ and $0.5$ . We calculate the area $A$ as $0.5$ minus the sum of the area of one triangle and $(k-1)$ trapezoids, which is

\displaystyle 2A

\displaystyle=\frac{1}{\operatorname{E}[B]}\left((1-p_{k})\operatorname{E}[B]-% (p_{1}+p_{2})\operatorname{E}[BI(H\leq h_{1})]-\cdots-(p_{k-1}+p_{k})% \operatorname{E}[B]I(H\leq h_{k-1})]\right).

We then express each $\operatorname{E}[BI(H\leq h_{i})]$ as the sum of treatment benefit averages for disjoint groups of patients, with each group having no overlap. For instance,

\operatorname{E}[BI(H\leq h_{3})]=\operatorname{E}[BI(H=h_{1})]+\operatorname{% E}[BI(H=h_{2})]+\operatorname{E}[BI(H=h_{3})].

It is possible to show that for $k<\infty$

	$\displaystyle\frac{2\operatorname{E}[BF_{H}(H)]-\operatorname{E}[B]}{% \operatorname{E}[B]}-2A$	$\displaystyle=\frac{\sum_{i=1}^{k}p_{i}\operatorname{E}[BI(H=h_{i})]}{% \operatorname{E}[B]}$
		$\displaystyle=\frac{\operatorname{E}[Bf_{H}(H)]}{\operatorname{E}[B]}.$

If some patients have $B<0$ , the area $A$ can take a value greater than $0.5$ , but the expression of $A$ stays the same. In other words, the Gini-like index could be greater than $1$ with some negative treatment benefit values in the target population.

Aside: For the univariate case, the properties of the expectation of maximum follow similar patterns. For a continuous variable $X$ ( $X\geq 0$ ) with two independent copies $\{X_{1},X_{2}\}$ , we observe that

\operatorname{E}[\max(X_{1},X_{2})]=2\operatorname{E}[XF_{X}(X)],

where $F_{X}(X)$ represents the CDF of $X$ . Similarly, for a discrete variable $X$ with two independent copies $\{X_{1},X_{2}\}$ , the equation becomes

\operatorname{E}[\max(X_{1},X_{2})]=2\operatorname{E}[XF_{X}(X)]-E[Xf_{X}(X)],

where $f_{X}$ denotes the PMF of $X$ . Moreover, for the Gini coefficient ( $2A^{\prime}$ ), we have

\frac{\operatorname{E}[XF_{X}(X)]-\operatorname{E}[X]}{\operatorname{E}[X]}-2A% ^{\prime}=\frac{\operatorname{E}[Xf_{X}(X)]}{\operatorname{E}[X]}.

Appendix C: Closed-form Expressions in Setting 2

With Control for Confounding

Recall that $H:=h(X)$ . As $\tau$ is not a one-to-one map in population 2, we begin by calculating the pre-image of the point $(\tau_{s}(x),h)$ . The pre-image of any point $(\tau_{s}(x),h)$ consists of two points given by the set

\left\{(x_{1}=\tau_{s}(x),x_{2}=h-\tau_{s}(x)),(x_{1}=h-\tau_{s}(x),x_{2}=\tau% _{s}(x))\right\},

which exists when $2\tau_{s}(x)\geq h$ and $h\geq\tau_{s}(x)$ .

Observe that the pre-image of any point $(\tau_{s}(x),h)$ comprises two points achieved by interchanging the values of $x_{1}$ and $x_{2}$ . These two points correspond to two scenarios: $x_{1}\geq x_{2}$ and $x_{1}<x_{2}$ . Fortunately, in each scenario, the max operator functions as a one-to-one map. Overall, the joint density of $(\tau_{s}(x),H)$ is the sum of the two parts, which is

f_{\tau_{s}(x),H}(\tau_{s}(x),h)=\begin{cases}2,&\tau_{s}(x)\leq h,h\leq 2\tau% _{s}(x),0\leq\tau_{s}(x)\leq 1,\\ 0,&\text{otherwise}.\end{cases}

Now, we can recalculate the marginal PDFs from joint $f_{\tau_{s}(x),H}(\tau_{s}(x),h)$ :

	$\displaystyle f_{\tau_{s}(x)}(\tau_{s}(x))$	$\displaystyle=\begin{cases}2\tau_{s}(x),&0\leq\tau_{s}(x)\leq 1,\\ 0,&\text{otherwise},\end{cases}$
	$\displaystyle f_{H}(h)$	$\displaystyle=\begin{cases}h,&0\leq h<1,\\ 2-h,&1\leq h\leq 2,\\ 0,&\text{otherwise}.\end{cases}$

The CDF of $H$ is

\displaystyle F_{H}(h)=\begin{cases}0,&h<0,\\ h^{2}/2,&0\leq h<1,\\ 1-(2-h)^{2}/2,&1\leq h<2,\\ 1,&2\leq h.\end{cases}

With the joint PDF of $(\tau_{s}(X),H)$ , we calculate the conditional density function of $(\tau_{s}(X)\mid H)$ by definition:

\displaystyle f_{\tau_{s}(X)\mid H}(\tau_{s}(x)\mid h)

\displaystyle=\begin{cases}2/h,&\tau_{s}(x)\leq h,h\leq 2\tau_{s}(x),0\leq\tau% _{s}(x)\leq 1,0<h\leq 1,\\ 2/(2-h),&\tau_{s}(x)\leq h,h\leq 2\tau_{s}(x),0\leq\tau_{s}(x)\leq 1,1<h<2,\\ 0,&\text{otherwise}.\end{cases}

Finally, we can compute $\operatorname{E}[\tau_{s}(X)\mid H=h]$ by the definition of conditional expectation:

	$\displaystyle\operatorname{E}[\tau_{s}(X)\mid H=h]$	$\displaystyle=\int^{\infty}_{-\infty}\tau_{s}(x)f_{\tau_{s}(X)\mid H}(\tau_{s}% (x)\mid h)d\tau_{s}(x)$
		$\displaystyle=\begin{cases}\frac{3h}{4},&0<h\leq 1,\\ \frac{1-h^{2}/4}{2-h},&1<h<2,\\ 0,&\text{otherwise}.\end{cases}$

Then, we calculate the $C_{b}$ for $H$ by computing

	$\displaystyle\operatorname{E}[\tau_{s}(X)F_{H}(H)]$	$\displaystyle=\int_{h}\int_{\tau_{s}(x)}\tau_{s}(x)F_{H}(h)f_{\tau_{s}(X),H}(% \tau_{s}(x),h)d\tau_{s}(x)dh$
		$\displaystyle=\int^{1}_{0}\int^{h}_{\frac{h}{2}}2\tau_{s}(x)\frac{h^{2}}{2}d% \tau_{s}(x)dh+\int^{2}_{1}\int^{1}_{\frac{h}{2}}2\tau_{s}(x)\left(1-\frac{(2-h% )^{2}}{2}\right)d\tau_{s}(x)dh$
		$\displaystyle=2\left(\frac{3}{80}+\frac{19}{120}\right)=E[BF_{H}(H)].$
	$\displaystyle C_{b,h}$	$\displaystyle=1-\frac{\tau^{*}}{2E[BF_{H}(H)]}=1-\frac{2/3}{4\left(\frac{3}{80% }+\frac{19}{120}\right)}=0.1489362.$

Furthermore, we can calculate the $C_{b}$ for $\tau_{s}(X)$ by computing

	$\displaystyle\operatorname{E}[\tau_{s}(X)F_{\tau_{s}(X)}(\tau_{s}(X))]$	$\displaystyle=\int_{h}\int_{\tau_{s}(x)}\tau_{s}(x)F_{\tau_{s}(X)}(\tau_{s}(x)% )f_{\tau_{s}(X),H}(\tau_{s}(x),h)d\tau_{s}(x)dh$
		$\displaystyle=\int^{1}_{0}\int^{2\tau_{s}(x)}_{\tau_{s}(x)}2\tau_{s}(x)^{3}dhd% \tau_{s}(x)$
		$\displaystyle=\frac{2}{5}=\operatorname{E}[BF_{\tau_{s}(X)}(\tau_{s}(X))].$
	$\displaystyle C_{b,\tau_{s}(x)}$	$\displaystyle=1-\frac{\tau^{*}}{2\operatorname{E}[BF_{\tau_{s}(X)}(\tau_{s}(X)% )]}=1-\frac{2/3}{4/5}=0.166666667.$

Without Control for Confounding

The confounding bias is defined as

\displaystyle\text{bias}(X)=\left(\operatorname{E}[Y\mid A=1,X]-\operatorname{% E}[Y\mid A=0,X]\right)-\tau_{s}(X).

We want to figure out the expression of $\operatorname{E}[Y\mid A=1,X]-\operatorname{E}[Y\mid A=0,X]$ given $\mu_{a}(x,z),a\in\{0,1\}$ by integrating $Z$ out of $\mu_{a}(x,z)$

	$\displaystyle\operatorname{E}[Y\mid A=1,X=x]$	$\displaystyle=\int_{z}\operatorname{E}[Y\mid A=1,X=x,Z=z]f_{Z\mid X,A}(z\mid x% ,1)dz$
		$\displaystyle=\int_{z}2z\left(\frac{1}{2}\max(x_{1},x_{2})+\max(x_{2},z)+\frac% {1}{10}x_{1}\right)dz$
		$\displaystyle=\frac{1}{2}\max(x_{1},x_{2})+\frac{1}{3}x_{2}^{3}+\frac{2}{3}+% \frac{1}{10}x_{1}.$

	$\displaystyle\operatorname{E}[Y\mid A=0,X=x]$	$\displaystyle=\int_{z}\operatorname{E}[Y\mid A=0,X=x,Z=z]f_{Z\mid X,A}(z\mid x% ,0)dz$
		$\displaystyle=\int_{z}2(1-z)\left(-\frac{1}{2}\max(x_{1},x_{2})+\max(x_{2},z)+% \frac{1}{10}x_{1}\right)dz$
		$\displaystyle=-\frac{1}{2}\max(x_{1},x_{2})+\frac{1}{3}+x_{2}^{2}-\frac{1}{3}x% _{2}^{3}+\frac{1}{10}x_{1}.$

Overall, we have

\displaystyle\operatorname{E}[Y\mid A=1,X=x]-\operatorname{E}[Y\mid A=0,X=x]

\displaystyle=\tau(x_{1},x_{2})+\frac{1}{3}-x_{2}^{2}+\frac{2}{3}x_{2}^{3}.

We denote $\operatorname{E}[Y\mid A=1,X=x]-\operatorname{E}[Y\mid A=0,X=x]$ as $D$ , and we have

	$\displaystyle\tau_{s}(X)=\max(X_{1},X_{2}),$
	$\displaystyle H=X_{1}+X_{2},$
	$\displaystyle D=\tau_{s}(X)+\text{bias}(X),$

where $\text{bias}(x_{1},x_{2})=\frac{1}{3}-x_{2}^{2}+\frac{2}{3}x_{2}^{3}$ is the confounding bias.

We compute

	$\displaystyle\operatorname{E}[D]$	$\displaystyle=\operatorname{E}[\tau_{s}(X)]+\operatorname{E}[\text{bias}(X)]$
		$\displaystyle=\frac{2}{3}+\int_{x_{2}}\left(\frac{1}{3}-x_{2}^{2}+\frac{2}{3}x% _{2}^{3}\right)f_{X_{2}}(x_{2})dx_{2}$
		$\displaystyle=\frac{2}{3}+\frac{1}{6}=\frac{5}{6}.$

Then, we calculate $\operatorname{E}[D\mid H]$ , which is improper $\operatorname{E}[B\mid H]$ calculated with confounding bias

\displaystyle\operatorname{E}[D\mid H]

\displaystyle=\operatorname{E}[\tau_{s}(X)\mid H]+\operatorname{E}[\text{bias}% (X)\mid H].

To calculate $\operatorname{E}[\text{bias}(X)\mid H]$ , we need to figure out the joint distribution of $(X_{2},H)$ and then the conditional distribution of $(X_{2}\mid H)$ . As $h$ is injective function, we can get

f_{X_{2},H}(x_{2},h)=f_{X_{2},X_{1}+X_{2}}(x_{2},x_{1}+x_{2})=f_{X_{2},X_{1}}(% x_{2},x_{1})|J|=1,

where $0\leq x_{2}\leq 1$ , $0\leq h\leq 2$ , $h-1\leq x_{2}$ , and $x_{2}\leq h$ . The conditional distribution should be

\displaystyle f_{X_{2}\mid H}(x_{2}\mid h)

\displaystyle=\begin{cases}\frac{1}{h},&0\leq x_{2}\leq 1,0<h\leq 1,h-1\leq x_% {2},x_{2}\leq h,\\ \frac{1}{2-h},&0\leq x_{2}\leq 1,1\leq h<2,h-1\leq x_{2},x_{2}\leq h,\\ 0,&\text{otherwise}.\end{cases}

\displaystyle\operatorname{E}[\text{bias}(X_{2})\mid H=h]

\displaystyle=\begin{cases}\frac{1}{6h},&0<h\leq 1,\\ \frac{1}{6(2-h)},&1<h<2,\\ 0,&\text{otherwise}.\end{cases}

	$\displaystyle\tilde{\operatorname{E}}_{[}B\mid H=h]$	$\displaystyle=\operatorname{E}[\tau_{s}(X)\mid H=h]+\operatorname{E}[\text{% bias}(X_{2})\mid H=h]$
		$\displaystyle=\begin{cases}\frac{3h}{4}+\frac{1}{6h},&0<h\leq 1,\\ \frac{1-h^{2}/4}{2-h}+\frac{1}{6(2-h)},&1<h<2,\\ 0,&\text{otherwise}.\end{cases}$

Finally, we calculate the $C_{b}$ for $H$ by computing

	$\displaystyle\operatorname{E}[\text{bias}(X_{2})F_{H}(H)]$	$\displaystyle=\int_{x_{2}}\int_{h}\left(\frac{1}{3}-x_{2}^{2}+\frac{2}{3}x_{2}% ^{3}\right)F_{H}(h)f_{X_{2},H}(x_{2},h)dx_{2}dh$
		$\displaystyle=\frac{13}{504}+\frac{43}{1260},$

and

	$\displaystyle\operatorname{E}[DF_{H}(H)]$	$\displaystyle=\operatorname{E}[\tau_{s}(X)F_{H}(H)]+\operatorname{E}[\text{% bias}(X_{2})F_{H}(H)]$
		$\displaystyle=2\left(\frac{3}{80}+\frac{19}{120}\right)+\left(\frac{13}{504}+% \frac{43}{1260}\right).$

Therefore,

\displaystyle\tilde{C}_{b,h}=1-\frac{\operatorname{E}[D]}{2\operatorname{E}[DF% _{H}(H)]}=1-\frac{5/6}{2\left(2\left(\frac{3}{80}+\frac{19}{120}\right)+\left(% \frac{13}{504}+\frac{43}{1260}\right)\right)}=0.07732865.