License: CC BY 4.0
arXiv:2307.15176v3 [cs.AI] 31 Jan 2024

RCT Rejection Sampling for Causal Estimation Evaluation

   Katherine A. Keith [email protected]
Williams College
Sergey Feldman [email protected]
Allen Institute for Artificial Intelligence
David Jurgens [email protected]
University of Michigan
Jonathan Bragg [email protected]
Allen Institute for Artificial Intelligence
Rohit Bhattacharya [email protected]
Williams College
Abstract

Confounding is a significant obstacle to unbiased estimation of causal effects from observational data. For settings with high-dimensional covariates—such as text data, genomics, or the behavioral social sciences—researchers have proposed methods to adjust for confounding by adapting machine learning methods to the goal of causal estimation. However, empirical evaluation of these adjustment methods has been challenging and limited. In this work, we build on a promising empirical evaluation strategy that simplifies evaluation design and uses real data: subsampling randomized controlled trials (RCTs) to create confounded observational datasets while using the average causal effects from the RCTs as ground-truth. We contribute a new sampling algorithm, which we call RCT rejection sampling, and provide theoretical guarantees that causal identification holds in the observational data to allow for valid comparisons to the ground-truth RCT. Using synthetic data, we show our algorithm indeed results in low bias when oracle estimators are evaluated on the confounded samples, which is not always the case for a previously proposed algorithm. In addition to this identification result, we highlight several finite data considerations for evaluation designers who plan to use RCT rejection sampling on their own datasets. As a proof of concept, we implement an example evaluation pipeline and walk through these finite data considerations with a novel, real-world RCT—which we release publicly—consisting of approximately 70k observations and text data as high-dimensional covariates. Together, these contributions build towards a broader agenda of improved empirical evaluation for causal estimation.

1 Introduction

Across the empirical sciences, confounding is a significant obstacle to unbiased estimation of causal effects from observational data. Covariate adjustment on a relevant set of confounders aka backdoor adjustment (Pearl, 2009) is a popular technique for mitigating such confounding bias. In settings with only a few covariates, simple estimation strategies—e.g., parametric models or contingency tables—often suffice to compute the adjusted estimates. However, modern applications of causal inference have had to contend with thousands of covariates in fields like natural language processing (Keith et al., 2020; Feder et al., 2022), genetics (Stekhoven et al., 2012), or the behavioral social sciences (Li et al., 2016; Eckles & Bakshy, 2021). In these high-dimensional scenarios, more sophisticated methods are needed and often involve machine learning. Recent approaches include non-parametric and semi-parametric estimators (Hill, 2011; Chernozhukov et al., 2018; Athey et al., 2018; Farrell et al., 2021; Bhattacharya et al., 2022), causally-informed covariate selection (Maathuis et al., 2009; Belloni et al., 2014; Shortreed & Ertefaie, 2017), proxy measurement and correction (Kuroki & Pearl, 2014; Wood-Doughty et al., 2018), and causal representation learning (Johansson et al., 2016; Shi et al., 2019; Veitch et al., 2020).

Despite all this recent work targeted at high-dimensional confounding, these methods have not been systematically and empirically benchmarked. Such evaluations are essential in determining which methods work well in practice and under what conditions. However, unlike supervised learning problems which have ground-truth labels available for evaluating predictive performance on a held-out test set, analogous causal estimation problems require ground-truth labels for counterfactual outcomes of an individual under multiple versions of the treatment, data that is generally impossible to measure (Holland, 1986).

A promising evaluation strategy is to directly subsample data from a randomized controlled trial (RCT) in a way that induces confounding. Causal effect estimates obtained using the confounded observational samples can then be compared against the ground-truth estimates from the RCT to assess the performance of different causal estimators. This idea has appeared in works like Hill (2011) and Zhang & Bareinboim (2021) and was recently formalized by Gentzel et al. (2021).

We contribute to this evaluation strategy — which we subsequently refer to as RCT subsampling — via theory that clarifies why and how RCT subsampling algorithms should be constrained in order to produce valid downstream empirical comparisons. In particular, we prove previous subsampling algorithms can produce observational samples from which the causal effect is provably not identified, which makes recovery of the RCT ground-truth impossible (even with infinite samples). To address this issue, we present a new RCT subsampling algorithm, which we call RCT rejection sampling, that appropriately constrains the subsampling such that the observed data distribution permits identification.

In addition to improving the theoretical foundations of RCT subsampling, we provide evaluation designers a scaffolding to apply the theory. We implement a proof of concept evaluation pipeline with a novel, real-world RCT dataset—which we release publicly—consisting of approximately 70k observations and text data as high-dimensional covariates. We highlight important finite data considerations: selecting an RCT dataset and examining when empirical evaluation is appropriate; empirically verifying a necessary precondition for RCT subsampling; specifying and diagnosing an appropriate confounding function using finite samples; applying baseline estimation models; and briefly speculate on additional challenges that could arise. For each of these considerations, we walk through specific approaches we take in our proof of concept pipeline.

In summary, our contributions are

  • We provide a proof using existing results in causal graphical models showing that previous RCT subsampling procedures (e.g., Gentzel et al. (2021)) may draw observational data in a way that prevents non-parametric identification of the causal effect due to selection bias (§3.3).

  • We propose a new subsampling algorithm, which we call RCT rejection sampling, that is theoretically guaranteed to produce an observational dataset where samples are drawn according to a distribution where the effect is identified via a backdoor functional (§3.4). Using three settings of synthetic data, we show our algorithm results in low bias, which is not always the case for a previous algorithm (§3.5).

  • For evaluation designers who plan to use RCT rejection sampling for their own datasets, we highlight several finite data considerations and implement a proof of concept pipeline with a novel, real-world RCT dataset and application of baseline estimation models (§4).

  • We release this novel, real-world RCT dataset of approximately 70k observations that has text as covariates (§4.1.1). We also release our code.111 Code and data at https://github.com/kakeith/rct_rejection_sampling.

These contributions build towards a more extensive future research agenda in empirical evaluation for causal estimation (§5).

2 Related Work in Empirical Evaluation of Causal Estimators

Allgemein Application to Backdoor Adjustment
Dataset Eval. strategy DoF Data avail. DGP realism Covariates (num.) Data public?
Simulation, normal assmpts. (D’Amour & Franks, 2021) Synthetic  Many  High  Low  High (1000+)  Yes
IHDP-ACIC 2016 (Dorie et al., 2019) Semi-synthetic  Many  High  Medium  Medium (58)  Yes
PeerRead theorems (Veitch et al., 2020) Semi-synthetic  Many  High  Medium  High (Text vocab)  Yes
RCT repositories (Gentzel et al., 2021) RCT subsampling  Few  High-RCTs  High  Low (1-2)  Yes
Job training (LaLonde, 1986) COS  Few  Low  High  Low (4)  Yes
Facebook peer effects (Eckles & Bakshy, 2021) COS  Few  Low  High  High (3700)  No
This work RCT subsampling  Few  High-RCTs  High  High (Text vocab)  Yes
Table 1: Select related work in empirical evaluation of causal estimators compared on general desiderata of:  few degrees of freedom (DoF) for the evaluation designer,  high data availability, and  realistic data-generating processes (DGP). We also examine the accompanying datasets presented for evaluating backdoor adjustment. Here, we want a  high number of covariates to make the evaluation non-trivial and  public availability of the data for reuse and reproducibility.

As we discussed briefly in Section 1, empirical evaluation of causal estimation methods for observational data is difficult but important.

We argue an evaluation strategy should in general (i) reduce the evaluation designers’ degrees of freedom i.e. limit the number of choices researchers have to (inadvertently) pick an evaluation that favors their own method (Gentzel et al., 2019); (ii) has the necessary data (e.g., RCTs) available, (iii) ensure the data generating process (DGP) reflects the real world, and (iv) make the data publicly available for reuse and reproducibility. For applications of backdoor adjustment, we argue non-trivial evaluation should additionally (v) include a high number of covariates that could be used in the adjustment set. Table 1 compares select previous work (and our own) according to the above desiderata. We briefly discuss these and other related work, and make a qualitative argument for the RCT subsampling strategy we contribute to.

Synthetic evaluations are ones in which researchers specify the entire DGP, e.g., D’Amour & Franks (2021); Schmidt et al. (2022). This allows for infinite data availability, but is prone to encoding researcher preferences and can lead to over-simplification (or overly complex DGPs) compared to real-world observational scenarios.

Semi-synthetic evaluations use some real data but specify the rest of the synthetic DGP. This approach has been used in causal inference competitions (Dorie et al., 2019; Shimoni et al., 2018) and settings with text-data as confounding variables (Roberts et al., 2020; Veitch et al., 2020; Weld et al., 2022). Other semi-synthetic work fits generative models to real-world data (Neal et al., 2020; Parikh et al., 2022) or uses pre-trained language models to generate high-dimensional confounders from variables in a synthetic DGP (Wood-Doughty et al., 2021). Although more realistic than synthetic data, semi-synthetic DGPs can also make unrealistic assumptions; for example, Reisach et al. (2021) demonstrate this issue in the context of evaluating causal discovery algorithms.

Constructed observational studies (COSs) start with RCTs and then find non-experimental control samples that come from a similar population (LaLonde, 1986; Hill et al., 2004; Arceneaux et al., 2006; Shadish et al., 2008; Jaciw, 2016; Gordon et al., 2019; Eckles & Bakshy, 2021; Zeng et al., 2022; Gordon et al., 2022).222For example, LaLonde (1986) adjusts for four covariates—age, years of schooling, high school drop-out status, and race (Table 4, Footnote C in LaLonde)—in his non-experimental group in a study of the effects of job training on earnings. The advantage of COSs over (semi-)synthetic data is that they have few researcher degrees of freedom; however, non-experimental control groups often do not exist or do not come from similar-enough populations; see Dahabreh et al. (2022) for more details on identification from COSs.

Subsampling RCTs uses an RCT as ground-truth and then subsamples the RCT data to create a confounded observational dataset. For example, Zhang & Bareinboim (2021) subsample from the International Stroke Trial (IST) of roughly 20k patients to estimate the treatment effect of aspirin allocation. This strategy also appears in prior work (Hill, 2011; Kallus et al., 2018) and was recently formalized by Gentzel et al. (2021). While this approach is limited by the availability of RCTs and sampling decreases the number of units available to the estimation methods, it does not require the comparable non-experimental control group required by COSs, resulting in greater data availability. There are also fewer researcher degrees of freedom compared to synthetic or semi-synthetic approaches. Because of these tradeoffs, we believe this is one of the most promising strategies for empirically evaluating causal estimation methods and we build upon this strategy in the remainder of this work.

3 Subsampling from RCTs

We preface this section with a brief description of causal graphs, a prerequisite to understanding subsequent results. Then we provide identification and non-identification proofs for RCT subsampling algorithms and evidence from synthetic data.

3.1 Background: Causal graphical models

A causal model of a directed acyclic graph (causal DAG) 𝒢(V)𝒢𝑉{\mathcal{G}}(V)caligraphic_G ( italic_V ) can be viewed as the set of distributions induced by a system of structural equations: For each variable ViVsubscript𝑉𝑖𝑉V_{i}\in Vitalic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_V there exists a structural equation Vifi(pai,ϵi)subscript𝑉𝑖subscript𝑓𝑖subscriptpa𝑖subscriptitalic-ϵ𝑖V_{i}\leftarrow f_{i}(\operatorname{pa}_{i},\epsilon_{i})italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (Pearl, 2009). This function maps the variable’s parents’ values—paisubscriptpa𝑖\operatorname{pa}_{i}roman_pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in 𝒢(V)𝒢𝑉{\mathcal{G}}(V)caligraphic_G ( italic_V )—and an exogenous noise term333Typically these noise terms are assumed to be mutually independent, but this assumption is not strictly necessary (Richardson & Robins, 2013). Our results are non-parametric in the sense that we do not require any distributional assumptions on the noise terms or the specific form (e.g., linearity) of the structural equations fi()subscript𝑓𝑖f_{i}(\cdot)italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ )., ϵisubscriptitalic-ϵ𝑖\epsilon_{i}italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, to values of Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The system of equations induces a joint distribution P(V)𝑃𝑉P(V)italic_P ( italic_V ) that is Markov relative to the DAG 𝒢(V)𝒢𝑉{\mathcal{G}}(V)caligraphic_G ( italic_V ), i.e., P(V=v)=ViVP(Vi=viPai=pai)𝑃𝑉𝑣subscriptproductsubscript𝑉𝑖𝑉𝑃subscript𝑉𝑖conditionalsubscript𝑣𝑖subscriptPa𝑖subscriptpa𝑖P(V=v)=\prod_{V_{i}\in V}P(V_{i}=v_{i}\mid\operatorname{Pa}_{i}=\operatorname{% pa}_{i})italic_P ( italic_V = italic_v ) = ∏ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_V end_POSTSUBSCRIPT italic_P ( italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ roman_Pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Independences in the distribution can be read off from 𝒢𝒢{\mathcal{G}}caligraphic_G via the well-known d-separation criterion (Pearl, 2009). Interventions in the model are typically formalized using the do-operator (Pearl, 2009), where Y|do(T=t)conditional𝑌do𝑇𝑡Y|\operatorname{do}(T=t)italic_Y | roman_do ( italic_T = italic_t ) denotes the value of an outcome Y𝑌Yitalic_Y under an intervention that sets the treatment T𝑇Titalic_T to value t𝑡titalic_t.

Here, our causal estimand of interest is the average treatment effect (ATE) defined as,

ATE𝔼[Y|do(T=t)]𝔼[Y|do(T=t)],ATE𝔼delimited-[]conditional𝑌do𝑇𝑡𝔼delimited-[]conditional𝑌do𝑇superscript𝑡\displaystyle\text{ATE}\equiv\mathbb{E}[Y|\operatorname{do}(T=t)]-\mathbb{E}[Y% |\operatorname{do}(T=t^{\prime})],ATE ≡ blackboard_E [ italic_Y | roman_do ( italic_T = italic_t ) ] - blackboard_E [ italic_Y | roman_do ( italic_T = italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] , (1)

where t𝑡titalic_t and tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT denote distinct values of T𝑇Titalic_T. A causal parameter is said to be identified if it can be expressed as a function of the observed data P(V)𝑃𝑉P(V)italic_P ( italic_V ). Given a set of variables ZV𝑍𝑉Z\subset Vitalic_Z ⊂ italic_V that satisfy the backdoor criterion w.r.t T𝑇Titalic_T and Y𝑌Yitalic_Y444The set Z𝑍Zitalic_Z satisfies the backdoor criterion if no variable in Z𝑍Zitalic_Z is a causal descendant of T𝑇Titalic_T, and Z𝑍Zitalic_Z blocks all backdoor paths between T𝑇Titalic_T and Y𝑌Yitalic_Y, i.e., all paths of the form TY𝑇𝑌T\leftarrow\cdots\rightarrow Yitalic_T ← ⋯ → italic_Y., the ATE is identified via the well-known backdoor adjustment functional (Pearl, 1995).

T𝑇Titalic_TY𝑌Yitalic_YC𝐶Citalic_C(a)T𝑇Titalic_TY𝑌Yitalic_YC𝐶Citalic_CS=1𝑆1S=1italic_S = 1(b)T𝑇Titalic_TY𝑌Yitalic_YC𝐶Citalic_C(c)
Figure 1: Causal DAGs (a) corresponding to an RCT; (b) representing a sampling procedure; (c) corresponding to an observational study where C𝐶Citalic_C satisfies the backdoor criterion.

3.2 RCT subsampling: Setup and conditions

We now describe the specific setup and objectives of RCT subsampling.555Here our target of interest is the ATE, though similar principles apply to other causal parameters like conditional average treatment effects. We start with a dataset DRCTsubscript𝐷RCTD_{\text{RCT}}italic_D start_POSTSUBSCRIPT RCT end_POSTSUBSCRIPT consisting of n𝑛nitalic_n iid draws from an RCT of pre-treatment covariates C={C1,,Ck}𝐶subscript𝐶1subscript𝐶𝑘C=\{C_{1},\dots,C_{k}\}italic_C = { italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }, treatment T𝑇Titalic_T, and outcome Y𝑌Yitalic_Y. Since this data is assumed to come from an RCT, the observed data distribution P(C,T,Y)𝑃𝐶𝑇𝑌P(C,T,Y)italic_P ( italic_C , italic_T , italic_Y ) is Markov relative to the causal DAG shown in Fig. 1(a) where TCT\perp\!\!\!\perp Citalic_T ⟂ ⟂ italic_C. The goal of RCT subsampling is to construct an observational dataset DOBSsubscript𝐷OBSD_{\text{OBS}}italic_D start_POSTSUBSCRIPT OBS end_POSTSUBSCRIPT that consists of mn𝑚𝑛m\leq nitalic_m ≤ italic_n iid draws from DRCTsubscript𝐷RCTD_{\text{RCT}}italic_D start_POSTSUBSCRIPT RCT end_POSTSUBSCRIPT that satisfies the following conditions which enable appropriate evaluation of causal estimation methods:

  1. (I)

    Dependence induced. DOBSsubscript𝐷OBSD_{\text{OBS}}italic_D start_POSTSUBSCRIPT OBS end_POSTSUBSCRIPT consists of samples drawn according to a new distribution P*(C,T,Y)superscript𝑃𝐶𝑇𝑌P^{*}(C,T,Y)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_C , italic_T , italic_Y ) that satisfies the dependence relation T⟂̸CT\not\perp\!\!\!\perp Citalic_T ⟂̸ ⟂ italic_C.

  2. (II)

    ATE identified. There exists a functional g𝑔gitalic_g of the RCT distribution and a functional hhitalic_h of the subsampled data distribution such that ATE=g(P(C,T,Y))=h(P*(C,T,Y))ATE𝑔𝑃𝐶𝑇𝑌superscript𝑃𝐶𝑇𝑌\text{ATE}=g(P(C,T,Y))=h(P^{*}(C,T,Y))ATE = italic_g ( italic_P ( italic_C , italic_T , italic_Y ) ) = italic_h ( italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_C , italic_T , italic_Y ) ).

Here, (II) is an important identification pre-condition that ensures that it is possible, at least in theory, to compute estimates of the ATE from DOBSsubscript𝐷OBSD_{\text{OBS}}italic_D start_POSTSUBSCRIPT OBS end_POSTSUBSCRIPT to match the ATE from DRCTsubscript𝐷RCTD_{\text{RCT}}italic_D start_POSTSUBSCRIPT RCT end_POSTSUBSCRIPT, the latter of which is treated as ground-truth in evaluation. From Fig. 1(a) it is clear that two sets of variables satisfy the backdoor criterion: the set C𝐶Citalic_C and the empty set. Thus, the ATE is identified from the RCT distribution P(C,T,Y)𝑃𝐶𝑇𝑌P(C,T,Y)italic_P ( italic_C , italic_T , italic_Y ) via the following two backdoor adjustment functionals,

ATE =cP(c)×(𝔼[Yt,c]𝔼[Yt,c])absentsubscript𝑐𝑃𝑐𝔼delimited-[]conditional𝑌𝑡𝑐𝔼delimited-[]conditional𝑌superscript𝑡𝑐\displaystyle=\sum_{c}P(c)\times\big{(}\mathbb{E}[Y\mid t,c]-\mathbb{E}[Y\mid t% ^{\prime},c]\big{)}= ∑ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_P ( italic_c ) × ( blackboard_E [ italic_Y ∣ italic_t , italic_c ] - blackboard_E [ italic_Y ∣ italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c ] ) (2)
=𝔼[Yt]𝔼[Yt].absent𝔼delimited-[]conditional𝑌𝑡𝔼delimited-[]conditional𝑌superscript𝑡\displaystyle=\mathbb{E}[Y\mid t]-\mathbb{E}[Y\mid t^{\prime}].= blackboard_E [ italic_Y ∣ italic_t ] - blackboard_E [ italic_Y ∣ italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] . (3)

Thus, a subsampling algorithm satisfies (II) if there is a functional h(P*(C,T,Y))superscript𝑃𝐶𝑇𝑌h(P^{*}(C,T,Y))italic_h ( italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_C , italic_T , italic_Y ) ) that is equal to equation 2 or equation 3.

For our purposes, we add the condition (I) so that estimation in the observational data does not reduce to equation 3. That is, we aim to produce samples according to a distribution P*superscript𝑃P^{*}italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT such that some adjustment is in fact necessary to produce unbiased ATE estimates. We note that (I) by itself is not sufficient to guarantee this; RCT subsampling procedures also require that there exists at least one pre-treatment covariate correlated with the outcome, i.e., CiCsubscript𝐶𝑖𝐶\exists\ C_{i}\in C∃ italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_C such that Ci⟂̸YC_{i}\not\perp\!\!\!\perp Yitalic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟂̸ ⟂ italic_Y in P(C,T,Y)𝑃𝐶𝑇𝑌P(C,T,Y)italic_P ( italic_C , italic_T , italic_Y ) (Gentzel et al., 2021). However, this condition is easily testable, and we implement these checks in our synthetic experiments and real-world proof of concept (§4.2).

We now show a theoretical gap in existing approaches to subsampling RCTs, and propose a new algorithm that is theoretically guaranteed to satisfy conditions (I) and (II).

3.3 Non-identification in prior work

We claim that prior work that proposes RCT subsampling can result in observational samples from which the causal effect is not identified non-parametrically unless additional constraints are placed on the subsampling process. We consider Algorithm 2 in Gentzel et al. (2021) which does not explicitly impose such constraints and can be summarized as follows. Let S𝑆Sitalic_S be a binary variable indicating selection into the observational data from DRCTsubscript𝐷RCTD_{\text{RCT}}italic_D start_POSTSUBSCRIPT RCT end_POSTSUBSCRIPT. A structural equation S𝟙(T=Bernoulli(f(C)))𝑆1𝑇Bernoulli𝑓𝐶S\leftarrow\mathbbm{1}(T=\text{Bernoulli}(f(C)))italic_S ← blackboard_1 ( italic_T = Bernoulli ( italic_f ( italic_C ) ) ) is used to generate the selection variable, where f𝑓fitalic_f is a function defined by the researcher and 𝟙1\mathbbm{1}blackboard_1 corresponds to the indicator function. DOBSsubscript𝐷OBSD_{\text{{OBS}}}italic_D start_POSTSUBSCRIPT OBS end_POSTSUBSCRIPT is created by retaining only samples from DRCTsubscript𝐷RCTD_{\text{RCT}}italic_D start_POSTSUBSCRIPT RCT end_POSTSUBSCRIPT where S=1𝑆1S=1italic_S = 1. This results in P*(C,T,Y)=P(C,T,YS=1)superscript𝑃𝐶𝑇𝑌𝑃𝐶𝑇conditional𝑌𝑆1P^{*}(C,T,Y)=P(C,T,Y\mid S=1)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_C , italic_T , italic_Y ) = italic_P ( italic_C , italic_T , italic_Y ∣ italic_S = 1 ) which is Markov relative to the causal DAG in Fig. 1(b). From this DAG, it is easy to check via d-separation that condition (I) is satisfied as T⟂̸CS=1T\not\perp\!\!\!\perp C\mid S=1italic_T ⟂̸ ⟂ italic_C ∣ italic_S = 1. However, the following proposition shows that condition (II) is not satisfied.

Proposition 3.1.

Given n𝑛nitalic_n iid samples from a distribution P𝑃Pitalic_P that is Markov relative to Fig. 1(a), Algorithm 2 in Gentzel et al. (2021) draws samples according to a distribution P*superscript𝑃P^{*}italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT such that condition (II) is not satisfied.

We provide a proof in Appendix A.

The intuition behind the proof of Proposition 3.1 is as follows. Identification of the ATE relies on two pieces: the conditional mean of the outcome given treatment and covariates and the marginal distribution of covariates. From Fig. 1(b), we have 𝔼[Y|T,C]=𝔼[Y|T,C,S=1]𝔼delimited-[]conditional𝑌𝑇𝐶𝔼delimited-[]conditional𝑌𝑇𝐶𝑆1\mathbb{E}[Y|T,C]=\mathbb{E}[Y|T,C,S=1]blackboard_E [ italic_Y | italic_T , italic_C ] = blackboard_E [ italic_Y | italic_T , italic_C , italic_S = 1 ], but P(C)P(C|S=1)𝑃𝐶𝑃conditional𝐶𝑆1P(C)\not=P(C|S=1)italic_P ( italic_C ) ≠ italic_P ( italic_C | italic_S = 1 ). Indeed this marginal distribution cannot be identified via any non-parametric functional of the subsampled distribution P*(C,T,Y)superscript𝑃𝐶𝑇𝑌P^{*}(C,T,Y)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_C , italic_T , italic_Y ) (Bareinboim & Tian, 2015). However, this non-identification result holds assuming that there is no additional knowledge/constraints on how P*superscript𝑃P^{*}italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is generated; in the next section we modify the sampling to place constraints on the generated distribution P*superscript𝑃P^{*}italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT that mitigate this issue.

3.4 RCT rejection sampling

We propose Algorithm 1, which uses a rejection sampling procedure to subsample RCTs. Rejection sampling is useful when the target distribution is difficult to sample from but there exists a proposal distribution which is easier to sample from and the proposal distribution (times a constant) forms an “upper envelope” for the target distribution (Murphy, 2012, Chapter 23.2). Similar ideas on resampling data based on ratios of propensity scores appear in Thams et al. (2023) and Bhattacharya & Nabi (2022) in the context of testing independence constraints in post-intervention distributions. Though the rejection sampler also selects samples based on a function of T𝑇Titalic_T and C𝐶Citalic_C, as in Fig. 1(b), we prove that additional constraints placed by the sampling strategy ensure identification holds in the new observed data distribution.

The intuition behind our algorithm is as follows. Sufficient constraints for maintaining identifiability of the ATE in P*(C,T,Y)superscript𝑃𝐶𝑇𝑌P^{*}(C,T,Y)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_C , italic_T , italic_Y ) via the functional in equation 2 are to ensure that P*(C)=P(C)superscript𝑃𝐶𝑃𝐶P^{*}(C)=P(C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_C ) = italic_P ( italic_C ) and P*(YT,C)=P(YT,C)superscript𝑃conditional𝑌𝑇𝐶𝑃conditional𝑌𝑇𝐶P^{*}(Y\mid T,C)=P(Y\mid T,C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_Y ∣ italic_T , italic_C ) = italic_P ( italic_Y ∣ italic_T , italic_C ).666One could also consider maintaining equality of just the conditional mean of Y𝑌Yitalic_Y rather than the full conditional density. When this holds, it follows that equation 2 is equivalent to the adjustment functional h(P*(C,T,Y))=cP*(c)×(𝔼*[YT=t,c]𝔼*[YT=t,c])superscript𝑃𝐶𝑇𝑌subscript𝑐superscript𝑃𝑐superscript𝔼delimited-[]conditional𝑌𝑇𝑡𝑐superscript𝔼delimited-[]conditional𝑌𝑇superscript𝑡𝑐h(P^{*}(C,T,Y))=\sum_{c}P^{*}(c)\times(\mathbb{E}^{*}[Y\mid T=t,c]-\mathbb{E}^% {*}[Y\mid T=t^{\prime},c])italic_h ( italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_C , italic_T , italic_Y ) ) = ∑ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_c ) × ( blackboard_E start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT [ italic_Y ∣ italic_T = italic_t , italic_c ] - blackboard_E start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT [ italic_Y ∣ italic_T = italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_c ] ), where 𝔼*superscript𝔼\mathbb{E}^{*}blackboard_E start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT denotes the expectation taken w.r.t P*(YT,C)superscript𝑃conditional𝑌𝑇𝐶P^{*}(Y\mid T,C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_Y ∣ italic_T , italic_C ). To also satisfy (I), we propose resampling with weights that modify P(T)𝑃𝑇P(T)italic_P ( italic_T ) to a new conditional distribution P*(TC)superscript𝑃conditional𝑇𝐶P^{*}(T\mid C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T ∣ italic_C ).

The considerations listed in the prior paragraph inform our choice of an acceptance probability of 1M×P*(TC)P(T)1𝑀superscript𝑃conditional𝑇𝐶𝑃𝑇\frac{1}{M}\times\frac{P^{*}(T\mid C)}{P(T)}divide start_ARG 1 end_ARG start_ARG italic_M end_ARG × divide start_ARG italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T ∣ italic_C ) end_ARG start_ARG italic_P ( italic_T ) end_ARG in the rejection sampler, where M𝑀Mitalic_M is the usual upper bound on the likelihood ratio used in the rejection sampler, which in our case is P*(T|C)P(T)superscript𝑃conditional𝑇𝐶𝑃𝑇\frac{P^{*}(T|C)}{P(T)}divide start_ARG italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T | italic_C ) end_ARG start_ARG italic_P ( italic_T ) end_ARG.777In practice, we approximate M𝑀Mitalic_M from DRCTsubscript𝐷RCTD_{\text{RCT}}italic_D start_POSTSUBSCRIPT RCT end_POSTSUBSCRIPT as maxi{1,,n}P*(T=ti|Ci)mini{1,,n}P^(T=ti)subscript𝑖1𝑛superscript𝑃𝑇conditionalsubscript𝑡𝑖subscript𝐶𝑖subscript𝑖1𝑛^𝑃𝑇subscript𝑡𝑖\frac{\max_{i}\in\{1,\dots,n\}P^{*}(T=t_{i}|C_{i})}{\min_{i\in\{1,\dots,n\}}% \widehat{P}(T=t_{i})}divide start_ARG roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 1 , … , italic_n } italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T = italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG roman_min start_POSTSUBSCRIPT italic_i ∈ { 1 , … , italic_n } end_POSTSUBSCRIPT over^ start_ARG italic_P end_ARG ( italic_T = italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG. Here, P*(TC)superscript𝑃conditional𝑇𝐶P^{*}(T\mid C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T ∣ italic_C ) is a function specified by the evaluation designer that satisfies positivity (c,0<P*(TC=c)<1for-all𝑐0superscript𝑃conditional𝑇𝐶𝑐1\forall c,0<P^{*}(T\mid C=c)<1∀ italic_c , 0 < italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T ∣ italic_C = italic_c ) < 1 almost surely), and is a non-trivial function of C𝐶Citalic_C in the sense that P*(TC)P*(T)superscript𝑃conditional𝑇𝐶superscript𝑃𝑇P^{*}(T\mid C)\not=P^{*}(T)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T ∣ italic_C ) ≠ italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T ) for at least some values of T𝑇Titalic_T and C𝐶Citalic_C.

1:  Inputs: DRCTsubscript𝐷RCTD_{\text{RCT}}italic_D start_POSTSUBSCRIPT RCT end_POSTSUBSCRIPT consisting of n𝑛nitalic_n i.i.d. draws from P(C,T,Y)𝑃𝐶𝑇𝑌P(C,T,Y)italic_P ( italic_C , italic_T , italic_Y ); P*(T|C)superscript𝑃conditional𝑇𝐶P^{*}(T|C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T | italic_C ), a function specified by evaluation designers; MsupP*(T|C)P(T)𝑀supremumsuperscript𝑃conditional𝑇𝐶𝑃𝑇M\geq\sup\frac{P^{*}(T|C)}{P(T)}italic_M ≥ roman_sup divide start_ARG italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T | italic_C ) end_ARG start_ARG italic_P ( italic_T ) end_ARG, a constant computed empirically
2:  Output: DOBSsubscript𝐷OBSD_{\text{OBS}}italic_D start_POSTSUBSCRIPT OBS end_POSTSUBSCRIPT, a subset of DRCTsubscript𝐷RCTD_{\text{RCT}}italic_D start_POSTSUBSCRIPT RCT end_POSTSUBSCRIPT constructed according to a distribution P*(C,T,Y)superscript𝑃𝐶𝑇𝑌P^{*}(C,T,Y)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_C , italic_T , italic_Y ) which satisfies conditions (I) and (II)
3:  
4:  for each unit iDRCT𝑖subscript𝐷RCTi\in D_{\text{RCT}}italic_i ∈ italic_D start_POSTSUBSCRIPT RCT end_POSTSUBSCRIPT do
5:     Sample Uisubscript𝑈𝑖U_{i}italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT uniform on (0,1)01(0,1)( 0 , 1 )
6:     if Ui>P*(T=ti|Ci)P^(T=ti)Msubscript𝑈𝑖superscript𝑃𝑇conditionalsubscript𝑡𝑖subscript𝐶𝑖^𝑃𝑇subscript𝑡𝑖𝑀U_{i}>\frac{P^{*}(T=t_{i}|C_{i})}{\hat{P}(T=t_{i})M}italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > divide start_ARG italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T = italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG over^ start_ARG italic_P end_ARG ( italic_T = italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_M end_ARG then
7:        Ablegen i𝑖iitalic_i
8:     end
9:  end for
10:  Return: DOBSDRCT{discarded units}subscript𝐷OBSsubscript𝐷RCTdiscarded unitsD_{\text{OBS}}\leftarrow D_{\text{RCT}}-\{\text{discarded units}\}italic_D start_POSTSUBSCRIPT OBS end_POSTSUBSCRIPT ← italic_D start_POSTSUBSCRIPT RCT end_POSTSUBSCRIPT - { discarded units }
Algorithm 1 RCT rejection sampling
Theorem 3.2.

Given n𝑛nitalic_n iid samples from a distribution P𝑃Pitalic_P that is Markov relative to Fig. 1(a), a confounding function P*(T|C)superscript𝑃conditional𝑇𝐶P^{*}(T|C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T | italic_C ) satisfying positivity, and MsupP*(T|C)P(T)𝑀supremumsuperscript𝑃conditional𝑇𝐶𝑃𝑇M\geq\sup\frac{P^{*}(T|C)}{P(T)}italic_M ≥ roman_sup divide start_ARG italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T | italic_C ) end_ARG start_ARG italic_P ( italic_T ) end_ARG, the rejection sampler in Algorithm 1 draws samples from a distribution P*superscript𝑃P^{*}italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, such that conditions (I) and (II) are satisfied.

Proof.

Rejection sampling generates samples from a target distribution P*(V1,,Vk)superscript𝑃subscript𝑉1subscript𝑉𝑘P^{*}(V_{1},\dots,V_{k})italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) by accepting samples from a proposal distribution P(V1,,VK)𝑃subscript𝑉1subscript𝑉𝐾P(V_{1},\dots,V_{K})italic_P ( italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_V start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) with probability

1M×P*(V1,,Vk)P(V1,,Vk),1𝑀superscript𝑃subscript𝑉1subscript𝑉𝑘𝑃subscript𝑉1subscript𝑉𝑘\displaystyle\frac{1}{M}\times\frac{P^{*}(V_{1},\dots,V_{k})}{P(V_{1},\dots,V_% {k})},divide start_ARG 1 end_ARG start_ARG italic_M end_ARG × divide start_ARG italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG italic_P ( italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG ,

where M𝑀Mitalic_M is a finite upper bound on the likelihood ratio P*/Psuperscript𝑃𝑃P^{*}/Pitalic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT / italic_P over the support of V1,,Vksubscript𝑉1subscript𝑉𝑘V_{1},\dots,V_{k}italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

We start with samples from an RCT, so our proposal distribution factorizes according to the causal DAG in Fig. 1(a): P(C,T,Y)=P(C)×P(T)×P(YT,C)𝑃𝐶𝑇𝑌𝑃𝐶𝑃𝑇𝑃conditional𝑌𝑇𝐶P(C,T,Y)=P(C)\times P(T)\times P(Y\mid T,C)italic_P ( italic_C , italic_T , italic_Y ) = italic_P ( italic_C ) × italic_P ( italic_T ) × italic_P ( italic_Y ∣ italic_T , italic_C ).

Our target distribution is one where T⟂̸CT\not\perp\!\!\!\perp Citalic_T ⟂̸ ⟂ italic_C, and factorizes as P*(C,T,Y)=P*(C)×P*(TC)×P*(YT,C)superscript𝑃𝐶𝑇𝑌superscript𝑃𝐶superscript𝑃conditional𝑇𝐶superscript𝑃conditional𝑌𝑇𝐶P^{*}(C,T,Y)=P^{*}(C)\times P^{*}(T\mid C)\times P^{*}(Y\mid T,C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_C , italic_T , italic_Y ) = italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_C ) × italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T ∣ italic_C ) × italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_Y ∣ italic_T , italic_C ), with additional constraints that P*(C)=P(C)superscript𝑃𝐶𝑃𝐶P^{*}(C)=P(C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_C ) = italic_P ( italic_C ) and P*(YT,C)=P(YT,C)superscript𝑃conditional𝑌𝑇𝐶𝑃conditional𝑌𝑇𝐶P^{*}(Y\mid T,C)=P(Y\mid T,C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_Y ∣ italic_T , italic_C ) = italic_P ( italic_Y ∣ italic_T , italic_C ). This establishes the likelihood ratio,

P*(C,T,Y)P(C,T,Y)superscript𝑃𝐶𝑇𝑌𝑃𝐶𝑇𝑌\displaystyle\frac{P^{*}(C,T,Y)}{P(C,T,Y)}divide start_ARG italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_C , italic_T , italic_Y ) end_ARG start_ARG italic_P ( italic_C , italic_T , italic_Y ) end_ARG =P(C)×P*(TC)×P(YT,C)P(C)×P(T)×P(YT,C)absent𝑃𝐶superscript𝑃conditional𝑇𝐶𝑃conditional𝑌𝑇𝐶𝑃𝐶𝑃𝑇𝑃conditional𝑌𝑇𝐶\displaystyle=\frac{P(C)\times P^{*}(T\mid C)\times P(Y\mid T,C)}{P(C)\times P% (T)\times P(Y\mid T,C)}= divide start_ARG italic_P ( italic_C ) × italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T ∣ italic_C ) × italic_P ( italic_Y ∣ italic_T , italic_C ) end_ARG start_ARG italic_P ( italic_C ) × italic_P ( italic_T ) × italic_P ( italic_Y ∣ italic_T , italic_C ) end_ARG
=P*(TC)P(T),absentsuperscript𝑃conditional𝑇𝐶𝑃𝑇\displaystyle=\frac{P^{*}(T\mid C)}{P(T)},= divide start_ARG italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T ∣ italic_C ) end_ARG start_ARG italic_P ( italic_T ) end_ARG ,

and any choice of MsupP*(T|C)P(T)𝑀supremumsuperscript𝑃conditional𝑇𝐶𝑃𝑇M\geq\sup\frac{P^{*}(T|C)}{P(T)}italic_M ≥ roman_sup divide start_ARG italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T | italic_C ) end_ARG start_ARG italic_P ( italic_T ) end_ARG used in the rejection sampler in Algorithm 1 produces samples from the desired distribution P*superscript𝑃P^{*}italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, where the additional constraints satisfy the identification condition (II) and specification of P*(T|C)superscript𝑃conditional𝑇𝐶P^{*}(T|C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T | italic_C ) such that it truly depends on C𝐶Citalic_C satisfies condition (I). ∎

Since P*superscript𝑃P^{*}italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT satisfies T⟂̸CT\not\perp\!\!\!\perp Citalic_T ⟂̸ ⟂ italic_C and yields identification via the usual adjustment functional obtained in a conditionally ignorable causal model, Algorithm 1 can be thought of as producing samples exhibiting confounding bias similar to the causal DAG in Fig. 1(c), despite the selection mechanism. A longer argument for this qualitative claim is in Appendix B.

We conclude this section by noting that similar to prior works on RCT subsampling algorithms, the subsampling strategy in Algorithm 1 only requires researchers to specify a single function, P*(TC)superscript𝑃conditional𝑇𝐶P^{*}(T\mid C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T ∣ italic_C ). Hence, our procedure satisfies our original desideratum of limited researcher degrees of freedom, while providing stronger theoretical guarantees for downstream empirical evaluation. However, specification of P*(TC)superscript𝑃conditional𝑇𝐶P^{*}(T\mid C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T ∣ italic_C ) may still be challenging when C𝐶Citalic_C is high-dimensional. In Section 4.4, we discuss this finite data consideration and we use a proxy strategy for our proof of concept in which we have a low-dimensional confounding set C𝐶Citalic_C along with a set of high-dimensional covariates X𝑋Xitalic_X that serve as proxies of this confounding set.

3.5 Evidence from synthetic data

Synthetic DGP Setting Sampling Algorithm Abs. Bias (std.) Rel. Abs. Bias (std.) CI Cov.
1 |C|=1𝐶1|C|=1| italic_C | = 1, P(T=1)=0.3𝑃𝑇10.3P(T=1)=0.3italic_P ( italic_T = 1 ) = 0.3 Algorithm 2 from Gentzel et al. (2021) 0.222 (0.010) 0.089 (0.004) 0.00
RCT rejection sampling (This work) 0.009 (0.007) 0.004 (0.003) 0.97
2 |C|=1𝐶1|C|=1| italic_C | = 1, P(T=1)=0.5𝑃𝑇10.5P(T=1)=0.5italic_P ( italic_T = 1 ) = 0.5 Algorithm 2 from Gentzel et al. (2021) 0.009 (0.006) 0.003 (0.003) 0.98
RCT rejection sampling 0.007 (0.005) 0.003 (0.002) 0.98
3 |C|=5𝐶5|C|=5| italic_C | = 5, Nonlinear Algorithm 2 from Gentzel et al. (2021) 0.252 (0.010) 0.979 (0.037) 0.00
RCT rejection sampling 0.012 (0.009) 0.046 (0.034) 0.98
Table 2: Absolute bias (abs. bias) between ATE from DRCTsubscript𝐷RCTD_{\text{RCT}}italic_D start_POSTSUBSCRIPT RCT end_POSTSUBSCRIPT and the estimated ATE via backdoor adjustment on DOBSsubscript𝐷OBSD_{\text{OBS}}italic_D start_POSTSUBSCRIPT OBS end_POSTSUBSCRIPT created by each sampling algorithm. We also report abs. bias relative to the RCT ATE (rel. abs. bias) and the mean and standard deviation (std.) across samples from 1000 random seeds. In the final column, we report the confidence interval coverage (CI Cov.)—the proportion of 1000 random seeds for which the 95% confidence interval contains the true (RCT) ATE. The DGPs for Settings 1-3 are given in Appendix C.

Using synthetic DGPs for DRCTsubscript𝐷RCTD_{\text{RCT}}italic_D start_POSTSUBSCRIPT RCT end_POSTSUBSCRIPT, we produce DOBSsubscript𝐷OBSD_{\text{OBS}}italic_D start_POSTSUBSCRIPT OBS end_POSTSUBSCRIPT using Algorithm 2 from Gentzel et al. (2021) and separately via our RCT rejection sampler. We then compute ATE estimates using equation 2 for DOBSsubscript𝐷OBSD_{\text{OBS}}italic_D start_POSTSUBSCRIPT OBS end_POSTSUBSCRIPT and compare it to the ground-truth estimates using equation 3 in DRCTsubscript𝐷RCTD_{\text{RCT}}italic_D start_POSTSUBSCRIPT RCT end_POSTSUBSCRIPT. Section C in the appendix gives the full details of the data-generating processes (DGPs) for three settings. Briefly, the DGP in Setting 1 has a single confounding covariate C𝐶Citalic_C, sets P(T=1)=0.3𝑃𝑇10.3P(T=1)=0.3italic_P ( italic_T = 1 ) = 0.3, and has an interaction term TC𝑇𝐶TCitalic_T italic_C in the structural equation for Y𝑌Yitalic_Y. Setting 2 is the same as Setting 1 except we set P(T=1)=0.5𝑃𝑇10.5P(T=1)=0.5italic_P ( italic_T = 1 ) = 0.5. Setting 3 is a non-linear DGP with five covariates, C1,,C5subscript𝐶1subscript𝐶5C_{1},\dots,C_{5}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT. All methods are provided with the true adjustment set and functional form for the outcome regression, i.e., our experiments here use oracle estimators to validate the identification theory proposed in the previous subsection.

We construct 95% confidence intervals via bootstrapping and the percentile method Wasserman (2004) and report confidence interval coverage. After obtaining a sample DOBSsubscript𝐷OBSD_{\text{OBS}}italic_D start_POSTSUBSCRIPT OBS end_POSTSUBSCRIPT from the RCT via either RCT rejection sampling or Algorithm 2 from Gentzel et al., we resample DOBSsubscript𝐷OBSD_{\text{OBS}}italic_D start_POSTSUBSCRIPT OBS end_POSTSUBSCRIPT with replacement and calculate the ATE for that bootstrap sample. We repeat this for 1000 bootstrap samples and obtain a 95% confidence interval by taking the 2.5% and 97.5% points of the bootstrap distribution as the endpoints of the confidence interval. Across our 1000 random seeds of the synthetic DGPs, we measure the proportion of these confidence intervals that contain the true (RCT) ATE and report this metric as confidence interval coverage. See Appendix Section H for additional confidence interval plots.

Table 2 shows that our proposed RCT rejection sampler results in a reduction of absolute bias compared to Algorithm 2 from  Gentzel et al. (2021) by a factor of over 24 for Setting 1 (0.22/0.009) and a factor of 21 in Setting 3 (0.252/0.012). For Setting 3, Gentzel et al.’s procedure results in almost a 100% increase in bias relative to the gold RCT ATE of 0.260.26-0.26- 0.26. In Setting 2 where P(T=1)=0.5𝑃𝑇10.5P(T=1)=0.5italic_P ( italic_T = 1 ) = 0.5, the differences in absolute bias between the two algorithms is less pronounced.888We note Gentzel et al. (2021) primarily focus on the setting for which P(T=1)=P(T=0)=0.5𝑃𝑇1𝑃𝑇00.5P(T=1)=P(T=0)=0.5italic_P ( italic_T = 1 ) = italic_P ( italic_T = 0 ) = 0.5. However, their approach does not seem to generalize well outside of this setting, theoretically and empirically. In Settings 1 and 2, the confidence interval coverage for Gentzel et al.’s procedure is 0 whereas our RCT rejection sampling algorithm results in coverage of 0.97 and 0.98 for Settings 1 and 2 respectively, both slightly above the nominal 0.95 coverage. The results of the simulation are consistent with our theoretical findings that our algorithm permits identifiability under more general settings than prior work.

4 Finite Data Considerations and Proof of Concept

In the previous section, we provided theoretical guarantees for RCT rejection sampling and confirmed the algorithm results in low bias on synthetic data. In this section, we demonstrate how to put this proposed theory into practice and highlight considerations when working with finite real-world data. Our goal is to surface questions that must be asked and answered in creating useful and high-quality causal evaluation.

We also describe our specific approach towards each consideration as we create a proof of concept pipeline for empirical evaluation of high-dimensional backdoor adjustment methods. For this proof of concept, we use a large-scale, real-world RCT dataset with text as covariates. Although our approaches are specific to our proof of concept dataset, we believe other evaluation designers will benefit from a real-world example of how to put the theory and considerations into practice.

4.1 Considerations prior to using a specific RCT dataset

A necessary component for RCT subsampling is obtaining a real-world RCT dataset. This ensures a more realistic data generating processes compared to synthetic or semi-synthetic approaches (see Table 1). As Gentzel et al. (2021) note, there are many RCT repositories from a variety of disciplines from which evaluation designers could gather data. However, Gentzel et al. find many of these existing datasets only have one or two covariates that satisfy C⟂̸YC\not\perp\!\!\!\perp Yitalic_C ⟂̸ ⟂ italic_Y (see Consideration #1 below).

As we briefly mentioned in Section 1, for settings with just a few covariates one can often use simple estimation strategies with theoretical guarantees—e.g., parametric models or contingency tables—and empirical evaluation may not be particularly informative in this setting. Along these lines, we recommend that evaluation designers first ask themselves, Is empirical evaluation of causal estimators appropriate and necessary for this setting? Not all settings are in need of empirical evaluation.

A potentially untapped resource for RCT rejection sampling data is A/B tests from large online platforms. Other work, e.g., Eckles & Bakshy (2021), have used these types of experiments for constructed observational studies and we use such a dataset in our proof of concept. The large scale of these experiments can be advantageous since RCT subsampling reduces the number of units in the observational dataset by roughly half. Further, they often contain rich metadata and many covariates, which can be used to induce confounding in a way that emulates a high-dimensional setting.

4.1.1 Proof of concept approach

For our proof of concept, we choose a setting for which empirical evaluation is appropriate and needed: high-dimensional backdoor adjustment. Our high-dimensional covariates are the thousands of vocabulary words from text data, an application area that has generated a large amount of interest from applied practitioners, see Keith et al. (2020); Feder et al. (2022).

We use and publicly release a large, novel, real-world RCT (approximately 70k observations) that was run on an online scholarly search engine.999 The RCT was conducted on the Allen Institute for Artificial Intelligence’s Semantic Scholar platform https://www.semanticscholar.org/. Owners of the website conducted this experiment and gave us permission to use and release this data. Users arrive on a webpage that hosts metadata about a single academic paper and proceed to interact with this page. The RCT’s randomized binary treatment is swapping the ordering of two buttons—a PDF reader and a new “enhanced reader”. We set T=1𝑇1T=1italic_T = 1 as the setting where the “enhanced reader” is displayed first. The outcome of interest is a user clicking (Y=1𝑌1Y=1italic_Y = 1) or not clicking (Y=0𝑌0Y=0italic_Y = 0) on the enhanced reader button. The former action transports the user to a different webpage that provides a more interactive view of the publication. The RCT suggests that the treatment has a positive causal effect with an ATE of 0.113 computed using a simple difference of conditional means in the treated and untreated populations. See Appendix D for more details about the RCT.

4.2 Consideration #1: Checking necessary precondition C⟂̸YC\not\perp\!\!\!\perp Yitalic_C ⟂̸ ⟂ italic_Y

As we mentioned in Section 3, a necessary precondition for RCT subsampling in general is the existence of a causal edge between C𝐶Citalic_C and Y𝑌Yitalic_Y, implying C⟂̸YC\not\perp\!\!\!\perp Yitalic_C ⟂̸ ⟂ italic_Y. The relationship between C𝐶Citalic_C and Y𝑌Yitalic_Y is naturally occurring (not modified by evaluation designers) and the amount of confounding induced by sampling is, in part, contingent on this relationship (Gentzel et al., 2021). One can empirically check C⟂̸YC\not\perp\!\!\!\perp Yitalic_C ⟂̸ ⟂ italic_Y via independence tests, e.g., evaluating the odds ratio when both C𝐶Citalic_C and Y𝑌Yitalic_Y are binary variables. If there do not exist covariates, C𝐶Citalic_C such that C⟂̸YC\not\perp\!\!\!\perp Yitalic_C ⟂̸ ⟂ italic_Y, one cannot move forward in the evaluation pipeline using RCT subsampling.

4.2.1 Proof of concept approach

For our proof of concept, we use a subpopulation strategy to ensure the precondition C⟂̸YC\not\perp\!\!\!\perp Yitalic_C ⟂̸ ⟂ italic_Y is satisfied. We choose a single interpretable covariate to induce confounding: the field of study of manuscripts. Since CYC\perp\!\!\!\perp Yitalic_C ⟂ ⟂ italic_Y if and only if the odds ratio between C𝐶Citalic_C and Y𝑌Yitalic_Y is 1, we choose subpopulations of the full RCT that have high a odds ratio between a subset of the categorical field of study variable and the outcome Y𝑌Yitalic_Y. Specifically, we choose C𝐶Citalic_C to be a binary covariate representing one of two fields of study; for Subpopulation A, the field is either Physics or Medicine. In Appendix G, we implement the evaluation pipeline for an additional subpopulation with C𝐶Citalic_C chosen as the articles with Engineering or Business as the field of study. Substantively, one can interpret this high odds ratio as natural differences in click rates from users viewing articles from different fields of study.

We combine this subpopulation strategy with a proxy strategy in the next section to ensure that the estimation procedures only have access to high-dimensional covariates instead of our low-dimensional C𝐶Citalic_C. This has the benefit of simplifying the process of specifying P*(TC)superscript𝑃conditional𝑇𝐶P^{*}(T\mid C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T ∣ italic_C ) while still ensuring that downstream modeling must still contend with high-dimensional adjustment.

T𝑇Titalic_TY𝑌Yitalic_YC𝐶Citalic_CX𝑋Xitalic_X
RCT Dataset C𝐶Citalic_C categories n RCT ATE OR(C𝐶Citalic_C, Y𝑌Yitalic_Y)
Full - 69,675 0.113 -
Subpopulation A Physics, Medicine 4,379 0.096 1.8
Figure 2: Proof of concept approach. Left figure. Causal DAG for the proxy strategy. The blue edges are confirmed to empirically exist in the finite dataset. The red edge is selected by the evaluation designers via P*(T|C)superscript𝑃conditional𝑇𝐶P^{*}(T|C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T | italic_C ). Right table. RCT dataset descriptive statistics including the number of units in the population/subpopulation (n𝑛nitalic_n) and the odds ratio, OR(C,Y)𝑂𝑅𝐶𝑌OR(C,Y)italic_O italic_R ( italic_C , italic_Y ).

4.3 Consideration #2: Specification of P*(T|C)superscript𝑃conditional𝑇𝐶P^{*}(T|C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T | italic_C )

Evaluation designers using RCT rejection sampling have one degree of freedom: specification of P*(T|C)superscript𝑃conditional𝑇𝐶P^{*}(T|C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T | italic_C ). We describe one specific approach in our proof of concept to choose P*(T|C)superscript𝑃conditional𝑇𝐶P^{*}(T|C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T | italic_C ), but we anticipate evaluation designers using RCT rejection sampling to create large empirical benchmarks may want to include many different parameterizations of P*(T|C)superscript𝑃conditional𝑇𝐶P^{*}(T|C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T | italic_C ) to evaluate empirical performance of methods under numerous settings. Consideration #3 describes approaches to diagnosing the choice of P*(T|C)superscript𝑃conditional𝑇𝐶P^{*}(T|C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T | italic_C ) for a specific finite dataset.

4.3.1 Proof of concept approach

Proxy strategy. In Section 3, we briefly mentioned that specifying a suitable confounding function P*(TC)superscript𝑃conditional𝑇𝐶P^{*}(T\mid C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T ∣ italic_C ) may be difficult when C𝐶Citalic_C is high-dimensional. A key property of our RCT is that it has high-dimensional text data, X𝑋Xitalic_X, that is a proxy (with almost perfect predictive accuracy) for low-dimensional structured metadata—categories of scientific articles, e.g., Physics or Medicine. We use this structured metadata as the covariates C𝐶Citalic_C in our RCT rejection sampler, but provide the causal estimation methods only X,Y𝑋𝑌X,Yitalic_X , italic_Y and T𝑇Titalic_T. Note that as evaluation designers, we still have access to C𝐶Citalic_C to run diagnostics. This proxy strategy helps simplify the specification of P*(TC)superscript𝑃conditional𝑇𝐶P^{*}(T\mid C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T ∣ italic_C ) we use in the rejection sampler and avoids direct specification of a function involving high-dimensional covariates. Others have used similar proxy strategies for text in semi-synthetic evaluations, e.g., Roberts et al. (2020); Veitch et al. (2020). Such a technique may also be applied in other RCTs, e.g., healthcare studies where one or two important biomarkers serve as low-dimensional confounding variables, and the larger electronic health record data serves as the proxy X𝑋Xitalic_X.

Using X𝑋Xitalic_X. For each document i𝑖iitalic_i, the high-dimensional covariates Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a bag-of-words representation of the document’s concatenated title and abstract given a 2000-unigram vocabulary. A new vocabulary is created for each RCT subpopulation; see Appendix E for details. We check that there is high predictive accuracy of P(C|X)𝑃conditional𝐶𝑋P(C|X)italic_P ( italic_C | italic_X ) to ensure the plausibility of causal estimation models only having access to X𝑋Xitalic_X.101010We leave to future work correcting for measurement error with noisy proxies X𝑋Xitalic_X; see Wood-Doughty et al. (2018). To measure this predictive accuracy, we model P(C|X)𝑃conditional𝐶𝑋P(C|X)italic_P ( italic_C | italic_X ) with a logistic regression111111 Using scikit-learn Pedregosa et al. (2011) and an elasticnet penalty, L1 ratio 0.1, class-weight balanced, and SAGA solver. classifier. Averaged across held-out test folds, the F1 score is 0.98 and the average precision is 0.99 for Subpopulation A (Physics, Medicine).

Specifying P*(T|C)superscript𝑃conditional𝑇𝐶P^{*}(T|C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T | italic_C ). Since C𝐶Citalic_C is binary in our proof of concept pipeline, we choose a simple interpretable piece-wise function,

P*(Ti=1|Ci)={ζ0 if Ci=0ζ1 if Ci=1superscript𝑃subscript𝑇𝑖conditional1subscript𝐶𝑖casessubscript𝜁0 if subscript𝐶𝑖0𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒subscript𝜁1 if subscript𝐶𝑖1𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒\displaystyle P^{*}(T_{i}=1|C_{i})=\begin{cases}\zeta_{0}\text{ if }C_{i}=0\\ \zeta_{1}\text{ if }C_{i}=1\end{cases}italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 | italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = { start_ROW start_CELL italic_ζ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT if italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_ζ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT if italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 end_CELL start_CELL end_CELL end_ROW (4)

for each document i𝑖iitalic_i, where 0<ζ0<10subscript𝜁010<\zeta_{0}<10 < italic_ζ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT < 1 and 0<ζ1<10subscript𝜁110<\zeta_{1}<10 < italic_ζ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < 1 are parameters chosen by the evaluation designers. We choose ζ0,ζ1subscript𝜁0subscript𝜁1\zeta_{0},\zeta_{1}italic_ζ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ζ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for the remainder of the proof of concept pipeline via the diagnostics in Consideration #3.

4.4 Consideration #3: Diagnostics for P*(T|C)superscript𝑃conditional𝑇𝐶P^{*}(T|C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T | italic_C )

We recommend running diagnostics on the sampling procedure because, although Theorem 3.2 proves that our RCT rejection sampling permits identification, this is an asymptotic statement, and the sampler can produce finite samples where confounding bias is induced, but difficult to correct for.

4.4.1 Proof of concept approach

Refer to caption
Figure 3: Diagnostic plots for proof of concept pipeline. For Subpopulation A data, each plot is the parameterization of P*(T|C)superscript𝑃conditional𝑇𝐶P^{*}(T|C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T | italic_C ) in Equation 4, which is specified by the evaluation designer. Each blue circle is a different random seed (100 seeds total per plot/parameterization).

For our proof of concept pipeline, we step through the following diagnostics on the sampling procedure. Recall the notation from Section 3, DRCTsubscript𝐷RCTD_{\text{RCT}}italic_D start_POSTSUBSCRIPT RCT end_POSTSUBSCRIPT is our RCT dataset (here, from Subpopulation A) and DOBSsubscript𝐷OBSD_{\text{OBS}}italic_D start_POSTSUBSCRIPT OBS end_POSTSUBSCRIPT is the resulting observational dataset after RCT rejection sampling. First, we check empirically that overlap is satisfied for C𝐶Citalic_C in DOBSsubscript𝐷OBSD_{\text{OBS}}italic_D start_POSTSUBSCRIPT OBS end_POSTSUBSCRIPT, i.e., 0<P^(T=1|C=c)<10^𝑃𝑇conditional1𝐶𝑐10<\hat{P}(T=1|C=c)<10 < over^ start_ARG italic_P end_ARG ( italic_T = 1 | italic_C = italic_c ) < 1 for all c𝑐citalic_c.

Second, in Figure 3 we compare the amount of confounding induced to the error in the oracle adjustment for different ζ0,ζ1subscript𝜁0subscript𝜁1\zeta_{0},\zeta_{1}italic_ζ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ζ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in Equation 4 across 100 random seeds (blue dots). On the y-axis, we plot the absolute difference between the ATE for DRCTsubscript𝐷RCTD_{\text{RCT}}italic_D start_POSTSUBSCRIPT RCT end_POSTSUBSCRIPT (GoldATE) and exact backdoor estimates121212The exact backdoor equation is tractable because T,C𝑇𝐶T,Citalic_T , italic_C and Y𝑌Yitalic_Y are all binary. obtained from using the oracle adjustment set C𝐶Citalic_C. On the x-axis, we plot the absolute difference between the ATE on DRCTsubscript𝐷RCTD_{\text{RCT}}italic_D start_POSTSUBSCRIPT RCT end_POSTSUBSCRIPT (GoldATE) and the unadjusted naive estimate on DOBSsubscript𝐷OBSD_{\text{OBS}}italic_D start_POSTSUBSCRIPT OBS end_POSTSUBSCRIPT. In general, we want more samples to fall below the y=x𝑦𝑥y=xitalic_y = italic_x line (shown in red) since this means that more confounding was induced than there is error in estimation. Samples above the y=x𝑦𝑥y=xitalic_y = italic_x have more error from the sampling process than the amount of confounding induced, and thus are not useful in benchmarking methods that adjust for confounding. Of the settings in Figure 3, ζ0=0.85subscript𝜁00.85\zeta_{0}=0.85italic_ζ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.85 and ζ1=0.15subscript𝜁10.15\zeta_{1}=0.15italic_ζ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.15 had the best proportion of sampled datasets lying below the red line, so we choose these parameters for our proof of concept pipeline. We leave to future work providing more guidance on choosing settings of P*(T|C)superscript𝑃conditional𝑇𝐶P^{*}(T|C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T | italic_C ) for a comprehensive benchmark.

4.5 Consideration #4: Modeling

The primary goal of this work is to create clear steps to follow during the evaluation design phase. Although this stage precedes a thorough modeling effort, we recommend that one runs baseline models to check for potential issues.

4.5.1 Proof of concept approach

As a proof of concept, we apply baseline causal estimation models to the resulting DOBSsubscript𝐷OBSD_{\text{OBS}}italic_D start_POSTSUBSCRIPT OBS end_POSTSUBSCRIPT datasets after RCT rejection sampling (with many random seeds); as we mention above. We implement131313We attempted to use the EconML package Microsoft Research (2019), but at of the time of our experiments it did not support using sparse matrices, which is required for our high-dimensional datasets. commonly-used causal estimation methods via two steps: (1) fitting base learners and (2) using causal estimators that combine the base learners via plug-in principles or second-stage regression. We took care to ensure we use the same pre-trained base learners—same functional form and learned weights—as inputs into any appropriate causal estimator.

Base learners. We implement base learners for:141414Note for a binary outcome Y𝑌Yitalic_Y, we can rewrite the above equations with probabilities, as 𝔼[Y]=P(Y=1)𝔼delimited-[]conditional𝑌𝑃𝑌conditional1\mathbb{E}[Y\mid\cdot]=P(Y=1\mid\cdot)blackboard_E [ italic_Y ∣ ⋅ ] = italic_P ( italic_Y = 1 ∣ ⋅ ).

QT0(x)subscript𝑄subscript𝑇0𝑥\displaystyle Q_{T_{0}}(x)italic_Q start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) :=𝔼[Y|T=0,X=x]assignabsent𝔼delimited-[]formulae-sequenceconditional𝑌𝑇0𝑋𝑥\displaystyle:=\mathbb{E}[Y|T=0,X=x]:= blackboard_E [ italic_Y | italic_T = 0 , italic_X = italic_x ] (5)
QT1(x)subscript𝑄subscript𝑇1𝑥\displaystyle Q_{T_{1}}(x)italic_Q start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) :=𝔼[Y|T=1,X=x]assignabsent𝔼delimited-[]formulae-sequenceconditional𝑌𝑇1𝑋𝑥\displaystyle:=\mathbb{E}[Y|T=1,X=x]:= blackboard_E [ italic_Y | italic_T = 1 , italic_X = italic_x ] (6)
g(x)𝑔𝑥\displaystyle g(x)italic_g ( italic_x ) :=P(T=1|X=x)assignabsent𝑃𝑇conditional1𝑋𝑥\displaystyle:=P(T=1|X=x):= italic_P ( italic_T = 1 | italic_X = italic_x ) (7)
QX(x)subscript𝑄𝑋𝑥\displaystyle Q_{X}(x)italic_Q start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x ) :=𝔼[Y|X=x]assignabsent𝔼delimited-[]conditional𝑌𝑋𝑥\displaystyle:=\mathbb{E}[Y|X=x]:= blackboard_E [ italic_Y | italic_X = italic_x ] (8)

In our application both T𝑇Titalic_T and Y𝑌Yitalic_Y are binary variables, so we use an ensemble of gradient boosted decision trees (catboost) 151515Using CatBoost (Dorogush et al., 2018) with default parameters and without cross-validation. and logistic regression161616 Using scikit-learn (Pedregosa et al., 2011) and an elasticnet penalty, L1 ratio 0.1, balanced class weights, and SAGA solver. We tune the regularization parameter C𝐶Citalic_C via cross-validation over the set C1e4,1e3,1e2,1e1,1e0,1e1𝐶1superscript𝑒41superscript𝑒31superscript𝑒21superscript𝑒11superscript𝑒01superscript𝑒1C\in{1e^{-4},1e^{-3},1e^{-2},1e^{-1},1e^{0},1e^{1}}italic_C ∈ 1 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT , 1 italic_e start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 1 italic_e start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT , 1 italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , 1 italic_e start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , 1 italic_e start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT. for our base learners. We fit our models using cross-fitting (Hansen, 2000; Newey & Robins, 2018) and cross-validation; see Appendix F for more details.

Causal estimators. After training base learners with cross-fitting, we implement the following plug-in causal estimators: backdoor adjustment (outcome regression) (Q), inverse propensity of treatment weighting (IPTW), and augmented inverse propensity of treatment weighting (AIPTW) (Robins et al., 1994). We also use DoubleML (Chernozhukov et al., 2018) which applies an ordinary least squares on residuals from the base learners. See Appendix F for exact estimation equations.

g^(x)^𝑔𝑥\hat{g}(x)over^ start_ARG italic_g end_ARG ( italic_x ) Q^T0(x)subscript^𝑄subscript𝑇0𝑥\hat{Q}_{T_{0}}(x)over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) Q^T1(x)subscript^𝑄subscript𝑇1𝑥\hat{Q}_{T_{1}}(x)over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) Q^X(x)subscript^𝑄𝑋𝑥\hat{Q}_{X}(x)over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x )
Prediction Ave. Prec. (\uparrow better) train inference train inference train inference train inference
linear 0.89 (0.07) 0.59 (0.02) 0.63 (0.32) 0.03 (0.01) 0.85 (0.17) 0.13 (0.03) 0.71 (0.2) 0.06 (0.01)
catboost (nonlinear) 0.97 (0.0) 0.60 (0.02) 1.0 (0.0) 0.03 (0.01) 0.99 (0.01) 0.13 (0.02) 0.98 (0.01) 0.05 (0.01)
Causal Rel. Abs. Error (\downarrow better) Unadjusted (baseline) Backdoor C𝐶Citalic_C (oracle) τ^Qsubscript^𝜏𝑄\hat{\tau}_{Q}over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT τ^IPTWsubscript^𝜏IPTW\hat{\tau}_{\text{IPTW}}over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT IPTW end_POSTSUBSCRIPT τ^AIPTWsubscript^𝜏AIPTW\hat{\tau}_{\text{AIPTW}}over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT AIPTW end_POSTSUBSCRIPT τ^DMLsubscript^𝜏DML\hat{\tau}_{\text{DML}}over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT DML end_POSTSUBSCRIPT
linear 0.21 (0.08) 0.12 (0.09) 1.46 (1.04) 0.47 (0.16) 1.6 (0.66) 1.91 (0.9)
catboost (nonlinear) 0.21 (0.08) 0.12 (0.09) 0.24 (0.1) 0.14 (0.11) 0.11 (0.1) 0.13 (0.1)
Table 3: Modeling results for subpopulation A. Top: Predictive models’ average precision (ave. prec.) for training (yellow) and inference (green) data splits. Bottom: Causal estimation models’ relative absolute error (rel. abs. error) between the models’ estimated ATE and the RCT ATE. Here, darker shades of red indicate worse causal estimates. Baselines, unadjusted conditional mean on the samples (unadjusted) and the backdoor adjustment with the oracle C𝐶Citalic_C (backdoor C), are uncolored. We use two baselearner settings: linear and catboost (nonlinear). We report both the average and standard deviation (in parentheses) over 100 random seeds during sampling. All settings use P*(T|C)superscript𝑃conditional𝑇𝐶P^{*}(T|C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T | italic_C ) in equation 4 parameterized by ζ0=0.85,ζ1=0.15formulae-sequencesubscript𝜁00.85subscript𝜁10.15\zeta_{0}=0.85,\zeta_{1}=0.15italic_ζ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.85 , italic_ζ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.15.

Modeling results. Table 3 shows results for Subpopulation A. Since both T𝑇Titalic_T and Y𝑌Yitalic_Y are binary, we report average precision (AP) for the base learners on both the training and inference folds; this metric is only calculated for observed (not counterfactual) observations. We also report the relative absolute error (RAE) between estimators on DOBSsubscript𝐷OBSD_{\text{OBS}}italic_D start_POSTSUBSCRIPT OBS end_POSTSUBSCRIPT and the ATE for DRCTsubscript𝐷RCTD_{\text{RCT}}italic_D start_POSTSUBSCRIPT RCT end_POSTSUBSCRIPT. Comparing predictive base learners, the propensity score model, g(x)𝑔𝑥g(x)italic_g ( italic_x ), has much higher AP on inference folds than models that involve the outcome. As we previously mentioned, RCT subsampling allows us to set the relationship between X𝑋Xitalic_X (via proxy for C𝐶Citalic_C) and T𝑇Titalic_T but not X𝑋Xitalic_X and Y𝑌Yitalic_Y so the low AP for outcome models could reflect the difficulty in estimating this “natural” relationship between X𝑋Xitalic_X and Y𝑌Yitalic_Y. See Section 4.6.1 for additional discussion on low average precision for the outcome models (Q𝑄Qitalic_Q).

For causal estimators, we see that the doubly robust estimator AIPTW using catboost has the lowest estimation error—on par with estimates obtained using the oracle backdoor adjustment set C𝐶Citalic_C. It appears the doubly robust estimators using linear models do not recover from the poor predictive performance of the outcome models and IPTW is better in this setting. Though seemingly counterintuitive that the linear and catboost models have similar predictive performance but large differences in the causal estimation error, this discrepancy between predictive performance and causal estimation error is consistent with theoretical results on fitting nuisance models in causal inference (Tsiatis, 2007; Shortreed & Ertefaie, 2017) and empirical results from semi-synthetic evaluation (Shi et al., 2019; Wood-Doughty et al., 2021). Although we used best practices from machine learning to fit our base learners, these results suggest future work is needed to adapt machine learning practices to the goals of causal estimation.

4.6 Consideration #5: Additional finite data challenges

The broader purpose of this line of empirical evaluation is to understand the real-world settings for which certain estimation approaches are successful or unsuccessful. We believe bridging the gap between theory that holds asymptotically and finite data is important for drawing valid causal inference, but evaluation designers might encounter unforeseen challenges particular to finite data that need to be examined carefully.

4.6.1 Proof of concept approach

In our proof of concept pipeline, we hypothesize the outcome models have very low average precision (Table 3) because of finite data issues with class imbalance. In particular, for Subpopulation A, 82% of our data is C=1𝐶1C=1italic_C = 1 and 𝔼[Y]=0.07𝔼delimited-[]𝑌0.07\mathbb{E}[Y]=0.07blackboard_E [ italic_Y ] = 0.07 so there are few examples to learn from in the smallest category (C=0𝐶0C=0italic_C = 0, Y=1𝑌1Y=1italic_Y = 1): only 34 documents. This shows that even with our RCT that has a relatively large size (roughly 4k units) compared to other real-world RCTs, there are challenges with having sufficient support.

5 Discussion and Future Work

Unlike predictive evaluation, empirical evaluation for causal estimation is challenging and still at a nascent stage. In this work, we argue that one of the most promising paths forward is to use a RCT subsampling strategy, and to this line of work, we contribute an RCT rejection sampler with theoretical guarantees. We showed the utility of this algorithm in a proof of concept pipeline with a novel, real-world RCT to empirically evaluate high-dimensional backdoor adjustment methods.

Of course, there are critics of emprical evaluation. Hernán (2019) pejoratively compared a competition for causal estimation to evaluating “spherical cows in a vacuum” and claimed this discounted necessary subject-matter expertise. Even in the machine learning community, researchers warn against “mindless bake-offs” of methods (Langley, 2011), and in some cases the creation of benchmarks has led to the community overfitting to benchmarks, e.g., Recht et al. (2019). However, in the absence of theory or when theoretical assumptions do not match reality, we see empirical evaluation as a necessary, but not exclusive, part of the broader field of causal inference.

A fruitful future direction is for evaluation designers to use our RCT rejection sampler to create comprehensive benchmarks for various phenomena of interest: not only high-dimensional confounding but also heterogeneous treatment effects, unmeasured confounding, missing data, etc. This would involve gathering more RCTs and establishing interesting ways to set P*(T|C)superscript𝑃conditional𝑇𝐶P^{*}(T|C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T | italic_C ). Our proof of concept evaluation pipeline demonstrated the utility of RCT subsampling but there were many avenues we chose not to pursue such as: measurement error, causal null hypothesis tests, or moving to more sophisticated natural language processing approaches beyond bag-of-words, e.g., the CausalBERT model (Veitch et al., 2020).

In another direction, applied practitioners need guidance on which causal estimation method to use given their specific observational data. Although other work has attempted to link observational data to experimental data (real or synthetic) in which the ground-truth is known (Neal et al., 2020; Kallus et al., 2018), we believe RCT subsampling could help with meta analyses of which combination of techniques work best under which settings. Overall, we see this work as contributing to a much larger research agenda on empirical evaluation for causal estimation.

Broader Impact Statement

We conducted this research with ethical due diligence. Our real-world RCT dataset was implemented by owners of the online platform and in full compliance with the platform’s user agreement. The platform owners gave us explicit permission to use and access this dataset. Our dataset contains paper titles and abstracts, which are already publicly available from many sources, and we have removed any potentially personally identifiable information from the dataset, e.g., author names, user ids, user IP addresses, or session ids. By releasing this data, we do not anticipate any harm to authors or users.

Like any technological innovation, our proposed RCT rejection sampling algorithm and evaluation pipeline have the potential for dual use—to both benefit or harm society depending on the actions of the humans using the technology. We anticipate there could be substantial societal benefit from more accurate estimation of causal effects of treatments in the medical or public policy spheres. However, other applications of causal inference could potentially harm society by controlling or manipulating individuals. Despite these tradeoffs in downstream applications, we feel strongly this paper’s contributions will result in net overall benefit to the research community and society at large.

Author Contributions

KK conceived the original idea of the project and managed the project. RB contributed the ideas behind Algorithm 1 as well as the proofs in Section 3 and the Appendix. RB and KK implemented the synthetic experiments in Section 3. KK gathered and cleaned the data for the proof of concept pipeline in Section 4. KK and SF implemented the proof of concept empirical pipeline in Section 4. KK and RB wrote the first draft of the manuscript. KK, SF, DJ, JB, and RB guided the research ideas and experiments and edited the manuscript.

Acknowledgments

The authors gratefully thank David Jensen, Amanda Gentzel, Purva Pruthi, Doug Downey, Brandon Stewart, Zach Wood-Doughty and Jacob Eisenstein for comments on earlier drafts of this manuscript. The authors also thank anonymous reviewers from ICML and TMLR for helpful comments. Special thanks to the Semantic Scholar team at the Allen Institute for Artificial Intelligence for help gathering the real-world RCT dataset.

References

  • Arceneaux et al. (2006) Kevin Arceneaux, Alan S Gerber, and Donald P Green. Comparing experimental and matching methods using a large-scale voter mobilization experiment. Political Analysis, 14(1):37–62, 2006.
  • Athey et al. (2018) Susan Athey, Guido W Imbens, and Stefan Wager. Approximate residual balancing: debiased inference of average treatment effects in high dimensions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(4):597–623, 2018.
  • Bareinboim & Tian (2015) Elias Bareinboim and Jin Tian. Recovering causal effects from selection bias. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015.
  • Belloni et al. (2014) Alexandre Belloni, Victor Chernozhukov, and Christian Hansen. Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies, 81(2):608–650, 2014.
  • Bhattacharya & Nabi (2022) Rohit Bhattacharya and Razieh Nabi. On testability of the front-door model via Verma constraints. In The 38th Conference on Uncertainty in Artificial Intelligence, 2022.
  • Bhattacharya et al. (2022) Rohit Bhattacharya, Razieh Nabi, and Ilya Shpitser. Semiparametric inference for causal effects in graphical models with hidden variables. Journal of Machine Learning Research, 23:1–76, 2022.
  • Chernozhukov et al. (2018) Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James M Robins. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68, 2018.
  • Dahabreh et al. (2022) Issa J Dahabreh, Jon A Steingrimsson, James M Robins, and Miguel A Hernán. Randomized trials and their observational emulations: A framework for benchmarking and joint analysis. arXiv preprint arXiv:2203.14857, 2022.
  • D’Amour & Franks (2021) Alexander D’Amour and Alexander Franks. Deconfounding scores: Feature representations for causal effect estimation with weak overlap. arXiv preprint arXiv:2104.05762, 2021.
  • Dorie et al. (2019) Vincent Dorie, Jennifer Hill, Uri Shalit, Marc Scott, and Dan Cervone. Automated versus do-it-yourself methods for causal inference: Lessons learned from a data analysis competition. Statistical Science, 34(1):43–68, 2019.
  • Dorogush et al. (2018) Anna Veronika Dorogush, Vasily Ershov, and Andrey Gulin. CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363, 2018.
  • Eckles & Bakshy (2021) Dean Eckles and Eytan Bakshy. Bias and high-dimensional adjustment in observational studies of peer effects. Journal of the American Statistical Association, 116(534):507–517, 2021.
  • Farrell et al. (2021) Max H Farrell, Tengyuan Liang, and Sanjog Misra. Deep neural networks for estimation and inference. Econometrica, 89(1):181–213, 2021.
  • Feder et al. (2022) Amir Feder, Katherine A Keith, Emaad Manzoor, Reid Pryzant, Dhanya Sridhar, Zach Wood-Doughty, Jacob Eisenstein, Justin Grimmer, Roi Reichart, Margaret E Roberts, et al. Causal inference in natural language processing: Estimation, prediction, interpretation and beyond. Transactions of the Association for Computational Linguistics, 10:1138–1158, 2022.
  • Gentzel et al. (2019) Amanda M Gentzel, Dan Garant, and David Jensen. The case for evaluating causal models using interventional measures and empirical data. Advances in Neural Information Processing Systems, 32, 2019.
  • Gentzel et al. (2021) Amanda M Gentzel, Purva Pruthi, and David Jensen. How and why to use experimental data to evaluate methods for observational causal inference. In International Conference on Machine Learning, pp. 3660–3671. PMLR, 2021.
  • Gordon et al. (2019) Brett R Gordon, Florian Zettelmeyer, Neha Bhargava, and Dan Chapsky. A comparison of approaches to advertising measurement: Evidence from big field experiments at Facebook. Marketing Science, 38(2):193–225, 2019.
  • Gordon et al. (2022) Brett R Gordon, Robert Moakler, and Florian Zettelmeyer. Close enough? A large-scale exploration of non-experimental approaches to advertising measurement. arXiv preprint arXiv:2201.07055, 2022.
  • Hansen (2000) Bruce E Hansen. Sample splitting and threshold estimation. Econometrica, 68(3):575–603, 2000.
  • Hernán (2019) Miguel A Hernán. Comment: Spherical cows in a vacuum: data analysis competitions for causal inference. Statistical Science, 34(1):69–71, 2019.
  • Hill (2011) Jennifer L Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240, 2011.
  • Hill et al. (2004) Jennifer L Hill, Jerome P Reiter, and Elaine L Zanutto. A comparison of experimental and observational data analyses. Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives: An Essential Journey with Donald Rubin’s Statistical Family, pp.  49–60, 2004.
  • Holland (1986) Paul W Holland. Statistics and causal inference. Journal of the American Statistical Association, 81(396):945–960, 1986.
  • Jaciw (2016) Andrew P Jaciw. Assessing the accuracy of generalized inferences from comparison group studies using a within-study comparison approach: The methodology. Evaluation Review, 40(3):199–240, 2016.
  • Johansson et al. (2016) Fredrik Johansson, Uri Shalit, and David Sontag. Learning representations for counterfactual inference. In International Conference on Machine Learning, pp. 3020–3029. ICML, 2016.
  • Kallus et al. (2018) Nathan Kallus, Aahlad Manas Puli, and Uri Shalit. Removing hidden confounding by experimental grounding. Advances in Neural Information Processing Systems, 31, 2018.
  • Keith et al. (2020) Katherine Keith, David Jensen, and Brendan O’Connor. Text and causal inference: A review of using text to remove confounding from causal estimates. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020.
  • Künzel et al. (2019) Sören R Künzel, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 116(10):4156–4165, 2019.
  • Kuroki & Pearl (2014) Manabu Kuroki and Judea Pearl. Measurement bias and effect restoration in causal inference. Biometrika, 101(2):423–437, 2014.
  • LaLonde (1986) Robert J LaLonde. Evaluating the econometric evaluations of training programs with experimental data. The American Economic Review, pp.  604–620, 1986.
  • Langley (2011) Pat Langley. The changing science of machine learning. Machine Learning, 82(3):275–279, 2011.
  • Li et al. (2016) Sheng Li, Nikos Vlassis, Jaya Kawale, and Yun Fu. Matching via dimensionality reduction for estimation of treatment effects in digital marketing campaigns. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp.  3768–3774, 2016.
  • Maathuis et al. (2009) Marloes H Maathuis, Markus Kalisch, and Peter Bühlmann. Estimating high-dimensional intervention effects from observational data. The Annals of Statistics, 37(6A):3133–3164, 2009.
  • Microsoft Research (2019) Microsoft Research. EconML: A Python package for ML-based heterogeneous treatment effects estimation, 2019. URL https://github.com/microsoft/EconML.
  • Murphy (2012) Kevin P Murphy. Machine Learning: A Probabilistic Perspective. MIT press, 2012.
  • Neal et al. (2020) Brady Neal, Chin-Wei Huang, and Sunand Raghupathi. RealCause: Realistic causal inference benchmarking. arXiv preprint arXiv:2011.15007, 2020.
  • Newey & Robins (2018) Whitney K Newey and James M Robins. Cross-fitting and fast remainder rates for semiparametric estimation. arXiv preprint arXiv:1801.09138, 2018.
  • Parikh et al. (2022) Harsh Parikh, Carlos Varjao, Louise Xu, and Eric J Tchetgen Tchetgen. Validating causal inference methods. In International Conference on Machine Learning, pp. 17346–17358. PMLR, 2022.
  • Pearl (1995) Judea Pearl. Causal diagrams for empirical research. Biometrika, 82(4):669–688, 1995.
  • Pearl (2009) Judea Pearl. Causality. Cambridge University Press, 2009.
  • Pedregosa et al. (2011) Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • Recht et al. (2019) Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pp. 5389–5400. PMLR, 2019.
  • Reisach et al. (2021) Alexander Reisach, Christof Seiler, and Sebastian Weichwald. Beware of the simulated DAG! causal discovery benchmarks may be easy to game. Advances in Neural Information Processing Systems, 34:27772–27784, 2021.
  • Richardson & Robins (2013) Thomas S Richardson and James M Robins. Single world intervention graphs (SWIGs): A unification of the counterfactual and graphical approaches to causality. Center for the Statistics and the Social Sciences, University of Washington Working Paper 128, pp.  1–146, 2013.
  • Roberts et al. (2020) Margaret E Roberts, Brandon M Stewart, and Richard A Nielsen. Adjusting for confounding with text matching. American Journal of Political Science, 64(4):887–903, 2020.
  • Robins (1986) James M Robins. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical Modelling, 7(9-12):1393–1512, 1986.
  • Robins et al. (1994) James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89:846–866, 1994.
  • Rotnitzky & Smucler (2020) Andrea Rotnitzky and Ezequiel Smucler. Efficient adjustment sets for population average causal treatment effect estimation in graphical models. Journal of Machine Learning Research, 21:188–1, 2020.
  • Schmidt et al. (2022) Aurora C Schmidt, Christopher J Cameron, Corey Lowman, Joshua Brulé, Amruta J Deshpande, Seyyed A Fatemi, Vladimir Barash, Ariel M Greenberg, Cash J Costello, Eli S Sherman, et al. Searching for explanations: Testing social scientific methods in synthetic ground-truthed worlds. Computational and Mathematical Organization Theory, pp. 1–32, 2022.
  • Shadish et al. (2008) William R Shadish, Margaret H Clark, and Peter M Steiner. Can nonrandomized experiments yield accurate answers? A randomized experiment comparing random and nonrandom assignments. Journal of the American Statistical Association, 103(484):1334–1344, 2008.
  • Shi et al. (2019) Claudia Shi, David Blei, and Victor Veitch. Adapting neural networks for the estimation of treatment effects. Advances in Neural Information Processing Systems, 32, 2019.
  • Shimoni et al. (2018) Yishai Shimoni, Chen Yanover, Ehud Karavani, and Yaara Goldschmnidt. Benchmarking framework for performance-evaluation of causal inference analysis. arXiv preprint arXiv:1802.05046, 2018.
  • Shortreed & Ertefaie (2017) Susan M Shortreed and Ashkan Ertefaie. Outcome-adaptive lasso: variable selection for causal inference. Biometrics, 73(4):1111–1122, 2017.
  • Spirtes et al. (2000) Peter L Spirtes, Clark N Glymour, and Richard Scheines. Causation, Prediction, and Search. MIT press, 2000.
  • Stekhoven et al. (2012) Daniel J Stekhoven, Izabel Moraes, Gardar Sveinbjörnsson, Lars Hennig, Marloes H Maathuis, and Peter Bühlmann. Causal stability ranking. Bioinformatics, 28(21):2819–2823, 2012.
  • Thams et al. (2023) Nikolaj Thams, Sorawit Saengkyongam, Niklas Pfister, and Jonas Peters. Statistical testing under distributional shifts. Journal of the Royal Statistical Society Series B: Statistical Methodology, 85(3):597–663, 2023.
  • Tsiatis (2007) Anastasios Tsiatis. Semiparametric Theory and Missing Data. Springer Science & Business Media, 2007.
  • Veitch et al. (2020) Victor Veitch, Dhanya Sridhar, and David Blei. Adapting text embeddings for causal inference. In Conference on Uncertainty in Artificial Intelligence, pp. 919–928. PMLR, 2020.
  • Wasserman (2004) Larry Wasserman. All of statistics: a concise course in statistical inference, volume 26. Springer, 2004.
  • Weld et al. (2022) Galen Weld, Peter West, Maria Glenski, David Arbour, Ryan A Rossi, and Tim Althoff. Adjusting for confounders with text: Challenges and an empirical evaluation framework for causal inference. In Proceedings of the International AAAI Conference on Web and Social Media, volume 16, pp.  1109–1120, 2022.
  • Wood-Doughty et al. (2018) Zach Wood-Doughty, Ilya Shpitser, and Mark Dredze. Challenges of using text classifiers for causal inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, volume 2018, pp.  4586. NIH Public Access, 2018.
  • Wood-Doughty et al. (2021) Zach Wood-Doughty, Ilya Shpitser, and Mark Dredze. Generating synthetic text data to evaluate causal inference methods. arXiv preprint arXiv:2102.05638, 2021.
  • Zeng et al. (2022) Jiaming Zeng, Michael F Gensheimer, Daniel L Rubin, Susan Athey, and Ross D Shachter. Uncovering interpretable potential confounders in electronic medical records. Nature Communications, 13(1):1–14, 2022.
  • Zhang & Bareinboim (2021) Junzhe Zhang and Elias Bareinboim. Bounding causal effects on continuous outcome. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  12207–12215, 2021.

Appendix A Proof of Proposition 3.1

Proof.

Since the selection variable S𝑆Sitalic_S is created using a structural equation dependent on T𝑇Titalic_T and C𝐶Citalic_C and samples are retained only when S=1𝑆1S=1italic_S = 1, the distribution of the selected samples P*(C,T,Y)=P(C,T,YS=1)superscript𝑃𝐶𝑇𝑌𝑃𝐶𝑇conditional𝑌𝑆1P^{*}(C,T,Y)=P(C,T,Y\mid S=1)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_C , italic_T , italic_Y ) = italic_P ( italic_C , italic_T , italic_Y ∣ italic_S = 1 ) and is Markov relative to Fig. 1(b). In the absence of additional constraints on P*superscript𝑃P^{*}italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT (e.g., linearity assumptions) or the relation between P𝑃Pitalic_P and P*superscript𝑃P^{*}italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, it is known that the ATE is non-parametrically identified in the presence of selection bias if and only if no variable that is a causal ancestor of the outcome in a graph where one deletes the treatment variable is also a causal ancestor of the selection variable (Bareinboim & Tian, 2015, Theorem 2). In this case, if we delete T𝑇Titalic_T from Fig. 1(b), C𝐶Citalic_C remains a causal ancestor of Y𝑌Yitalic_Y and S𝑆Sitalic_S. Hence, the effect is not identified via any non-parametric functional of the observed data and condition (II) cannot be satisfied. ∎

Appendix B Viewing Algorithm 1 as drawing from a conditionally ignorable causal model

A causal model of a DAG 𝒢(V)𝒢𝑉{\mathcal{G}}(V)caligraphic_G ( italic_V ) can be interpreted as (i) a set of statistical distributions P(V)𝑃𝑉P(V)italic_P ( italic_V ) that factorize according to 𝒢𝒢{\mathcal{G}}caligraphic_G: P(V)=ViVP(ViPai)𝑃𝑉subscriptproductsubscript𝑉𝑖𝑉𝑃conditionalsubscript𝑉𝑖subscriptPa𝑖P(V)=\prod_{V_{i}\in V}P(V_{i}\mid\operatorname{Pa}_{i})italic_P ( italic_V ) = ∏ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_V end_POSTSUBSCRIPT italic_P ( italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ roman_Pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ); and (ii) a set of post intervention distributions given by the g-formula aka truncated factorization (Robins, 1986; Spirtes et al., 2000; Pearl, 2009): for every AV𝐴𝑉A\subset Vitalic_A ⊂ italic_V, we have P(VAdo(A=a))=ViVAP(ViPai)|A=a𝑃𝑉conditional𝐴do𝐴𝑎evaluated-atsubscriptproductsubscript𝑉𝑖𝑉𝐴𝑃conditionalsubscript𝑉𝑖subscriptPa𝑖𝐴𝑎P(V\setminus A\mid\operatorname{do}(A=a))=\prod_{V_{i}\in V\setminus A}P(V_{i}% \mid\operatorname{Pa}_{i})\ |_{A=a}italic_P ( italic_V ∖ italic_A ∣ roman_do ( italic_A = italic_a ) ) = ∏ start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_V ∖ italic_A end_POSTSUBSCRIPT italic_P ( italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ roman_Pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | start_POSTSUBSCRIPT italic_A = italic_a end_POSTSUBSCRIPT.

If it were possible to recollect data through a conditionally randomized experiment where treatment is assigned with probability P*(TC)superscript𝑃conditional𝑇𝐶P^{*}(T\mid C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T ∣ italic_C ) instead of P(T)𝑃𝑇P(T)italic_P ( italic_T ), the observed data distribution over C,T,Y𝐶𝑇𝑌C,T,Yitalic_C , italic_T , italic_Y would factorize according to the standard conditionally ignorable model shown in Fig. 1(c). That is, the distribution factorizes as P*(C,T,Y)=P(C)×P*(TC)×P(YT,C)superscript𝑃𝐶𝑇𝑌𝑃𝐶superscript𝑃conditional𝑇𝐶𝑃conditional𝑌𝑇𝐶P^{*}(C,T,Y)=P(C)\times P^{*}(T\mid C)\times P(Y\mid T,C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_C , italic_T , italic_Y ) = italic_P ( italic_C ) × italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T ∣ italic_C ) × italic_P ( italic_Y ∣ italic_T , italic_C ) and implies no independence constraints on the observed data. The distribution of samples P*(C,T,Y)superscript𝑃𝐶𝑇𝑌P^{*}(C,T,Y)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_C , italic_T , italic_Y ) output by the rejection sampler in Algorithm 1 also implies no independence constraints on C,T,Y𝐶𝑇𝑌C,T,Yitalic_C , italic_T , italic_Y and thus has the exact same factorization. This establishes statistical equivalence of the conditionally ignorable model and the one obtained via our rejection sampler.

However, statistical equivalence alone is insufficient, as it is statistically equivalent to any complete DAG on the variables C,T,Y𝐶𝑇𝑌C,T,Yitalic_C , italic_T , italic_Y that imply no independence constraints on the observed data. However, it is easy to see that all post-intervention distributions obtained via truncated factorization in the conditionally ignorable model also have the same identifying functionals in the model obtained through the rejection sampler due to the extra constraints imposed in Algorithm 1. For example, P*(T,Ydo(C=c))=P*(TC=c)×P(YT,C)superscript𝑃𝑇conditional𝑌do𝐶𝑐superscript𝑃conditional𝑇𝐶𝑐𝑃conditional𝑌𝑇𝐶P^{*}(T,Y\mid\operatorname{do}(C=c))=P^{*}(T\mid C=c)\times P(Y\mid T,C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T , italic_Y ∣ roman_do ( italic_C = italic_c ) ) = italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T ∣ italic_C = italic_c ) × italic_P ( italic_Y ∣ italic_T , italic_C ) and P*(C,Ydo(T=t))=P(C)×P(YT=t,C)superscript𝑃𝐶conditional𝑌do𝑇𝑡𝑃𝐶𝑃conditional𝑌𝑇𝑡𝐶P^{*}(C,Y\mid\operatorname{do}(T=t))=P(C)\times P(Y\mid T=t,C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_C , italic_Y ∣ roman_do ( italic_T = italic_t ) ) = italic_P ( italic_C ) × italic_P ( italic_Y ∣ italic_T = italic_t , italic_C ) in both the conditionally ignorable model and the one obtained via rejection sampling. Since the distribution obtained from the rejection sampler is indistinguishable from a conditionally ignorable model both in terms of the statistical model and all post-intervention distributions, applying Algorithm 1 can be viewed as essentially drawing samples from a standard conditionally ignorable model. That is, despite the selection mechanism, the resulting distribution is indistinguishable from Figure 1(c), the regular causal DAG model we use to represent observed confounding.

Appendix C Synthetic DGPs to evaluate sampling algorithms

We use the following data-generating process (DGP) to create synthetic data with 100k units for which the true ATE is equal to 2.5. We call this Setting 1:

C𝐶\displaystyle Citalic_C Binomial(0.5)similar-toabsentBinomial0.5\displaystyle\sim\text{Binomial}(0.5)∼ Binomial ( 0.5 )
T𝑇\displaystyle Titalic_T Binomial(0.3)similar-toabsentBinomial0.3\displaystyle\sim\text{Binomial}(0.3)∼ Binomial ( 0.3 )
Y𝑌\displaystyle Yitalic_Y 0.5C+1.5T+2TC+Normal(0,1)similar-toabsent0.5𝐶1.5𝑇2𝑇𝐶Normal01\displaystyle\sim 0.5C+1.5T+2TC+\text{Normal}(0,1)∼ 0.5 italic_C + 1.5 italic_T + 2 italic_T italic_C + Normal ( 0 , 1 )

We set the researcher-specified confounding function P*(T=1|C)=σ(1+2.5C)superscript𝑃𝑇conditional1𝐶𝜎12.5𝐶P^{*}(T=1|C)=\sigma(-1+2.5C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T = 1 | italic_C ) = italic_σ ( - 1 + 2.5 italic_C ), where σ𝜎\sigmaitalic_σ is the logistic (expit) function, for RCT Rejection sampling and the same function as f𝑓fitalic_f in Gentzel et al. (2021)’s Algorithm 2.

Setting 2 has the same DGP as Setting 1 but we change TBinomial(0.5)similar-to𝑇Binomial0.5T\sim\text{Binomial}(0.5)italic_T ∼ Binomial ( 0.5 ). For Settings 1 and 2, we provide the backdoor adjustment estimator with the true adjustment set, such that it adjusts for both C𝐶Citalic_C and the interaction term, TC𝑇𝐶TCitalic_T italic_C.

Setting 3 involves more covariates that are combined in non-linear ways:

C1similar-tosubscript𝐶1absent\displaystyle C_{1}\simitalic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ Binomial(0.5)Binomial0.5\displaystyle\ \text{Binomial}(0.5)Binomial ( 0.5 )
C2similar-tosubscript𝐶2absent\displaystyle C_{2}\simitalic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ C1Uniform(0.5,1)subscript𝐶1Uniform0.51\displaystyle\ C_{1}-\text{Uniform}(-0.5,1)italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - Uniform ( - 0.5 , 1 )
C3similar-tosubscript𝐶3absent\displaystyle C_{3}\simitalic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∼ Normal(0,1)Normal01\displaystyle\ \text{Normal}(0,1)Normal ( 0 , 1 )
C4similar-tosubscript𝐶4absent\displaystyle C_{4}\simitalic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ∼ Normal(0,1)Normal01\displaystyle\ \text{Normal}(0,1)Normal ( 0 , 1 )
C5similar-tosubscript𝐶5absent\displaystyle C_{5}\simitalic_C start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ∼ C3+C4+Normal(0,1)subscript𝐶3subscript𝐶4Normal01\displaystyle\ C_{3}+C_{4}+\text{Normal}(0,1)italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT + Normal ( 0 , 1 )
Tsimilar-to𝑇absent\displaystyle T\simitalic_T ∼ Binomial(0.3)Binomial0.3\displaystyle\ \text{Binomial}(0.3)Binomial ( 0.3 )
Ysimilar-to𝑌absent\displaystyle Y\simitalic_Y ∼ 0.5C4+2TC1C21.5T+C2C3+C5+Normal(0,1)0.5subscript𝐶42𝑇subscript𝐶1subscript𝐶21.5𝑇subscript𝐶2subscript𝐶3subscript𝐶5Normal01\displaystyle\ 0.5C_{4}+2TC_{1}C_{2}-1.5T+C_{2}C_{3}+C_{5}+\text{Normal}(0,1)0.5 italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT + 2 italic_T italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 1.5 italic_T + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT + Normal ( 0 , 1 )

In this setting, we set P*(T=1|C)=σ(0.5C1+0.7C2+1.2C3+1.5C4+1.2C5+0.5C1C2)P^{*}(T=1|C)=\sigma(0.5C_{1}+-0.7C_{2}+1.2C_{3}+1.5C_{4}+-1.2C_{5}+0.5C_{1}C_{% 2})italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T = 1 | italic_C ) = italic_σ ( 0.5 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + - 0.7 italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1.2 italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + 1.5 italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT + - 1.2 italic_C start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT + 0.5 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). We provide the parametric backdoor estimator with the true adjustment set: C4,TC1C2,C2C3,C5subscript𝐶4𝑇subscript𝐶1subscript𝐶2subscript𝐶2subscript𝐶3subscript𝐶5C_{4},TC_{1}C_{2},C_{2}C_{3},C_{5}italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_T italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT.

See Table 4 for the ATEs for each of the three settings.

DGP Setting RCT ATE
1 - Linear, P(T=1)=0.3𝑃𝑇10.3P(T=1)=0.3italic_P ( italic_T = 1 ) = 0.3 2.48
2 - Linear, P(T=1)=0.5𝑃𝑇10.5P(T=1)=0.5italic_P ( italic_T = 1 ) = 0.5 2.49
3 - Nonlinear, 5 C𝐶Citalic_C’s -0.26
Table 4: RCT ATEs for the synthetic DGPs.

Appendix D RCT dataset details

We expand on the details of the RCT described in Section 4.1.1. The RCT was conducted on the Allen Institute for Artificial Intelligence’s Semantic Scholar171717https://www.semanticscholar.org/ platform.

The experiment ran for 25 days from 2022-06-04 to 2022-06-28 and resulted in 53,281 unique users arriving on 50,833 unique paper pages. This is after filtering to users who are “active”, meaning prior to the experiment they clicked somewhere on the website at least once. Treatment is randomized for the combination of a unique user’s browser plus user’s device. We then post-process to recognize logged-in users across devices/browsers and remove them from the results if they switched treatments. The outcome of interest is if a user clicks on the “enhanced reader” button at least once during a session (Y=1𝑌1Y=1italic_Y = 1).

Intuitively, a positive causal effect is expected since we expect advertising a feature of the website to result in increased click rate. The final ATE was 0.113 as computed by a simple difference in conditional means 𝔼[YT=1]𝔼[YT=0]𝔼delimited-[]conditional𝑌𝑇1𝔼delimited-[]conditional𝑌𝑇0\mathbb{E}[Y\mid T=1]-\mathbb{E}[Y\mid T=0]blackboard_E [ italic_Y ∣ italic_T = 1 ] - blackboard_E [ italic_Y ∣ italic_T = 0 ].

Appendix E Creating vocabulary for X𝑋Xitalic_X

A new vocabulary is created for each RCT subpopulation. To create each vocabulary, we use binary indicators, remove stopwords, remove numbers and strip accents. Words must occur in at least 5 documents. We ignore terms that have a document frequency strictly lower than 10%.

Appendix F Modeling

Base learners. Base learners QT0subscript𝑄subscript𝑇0Q_{T_{0}}italic_Q start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and QT1subscript𝑄subscript𝑇1Q_{T_{1}}italic_Q start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT in Equations 5 and 6 respectively are fit as described by Künzel et al. (2019) as “T-learners” (not “S-learners”), i.e. we take all samples for which we have observed T=0𝑇0T=0italic_T = 0 and then regress X𝑋Xitalic_X on Y𝑌Yitalic_Y to get a trained model for QT0subscript𝑄subscript𝑇0Q_{T_{0}}italic_Q start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and likewise for units with observed T=1𝑇1T=1italic_T = 1 and QT1subscript𝑄subscript𝑇1Q_{T_{1}}italic_Q start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. In preliminary experiments, we used the “S-learner” but found high-dimensional X𝑋Xitalic_X dominated T𝑇Titalic_T and there were no differences learned between observed and counterfactual T𝑇Titalic_T settings.

Cross-fitting with cross-validation. We fit our models using cross-fitting (Newey & Robins, 2018) which is also called sample-splitting (Hansen, 2000). Here, we divide the data into K𝐾Kitalic_K folds. For each inference fold j𝑗jitalic_j, the other K1𝐾1K-1italic_K - 1 folds (shorthand j𝑗-j- italic_j) are used as the training set to fit the base learners—e.g., Q^T0jsuperscriptsubscript^𝑄subscript𝑇0𝑗\hat{Q}_{T_{0}}^{-j}over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_j end_POSTSUPERSCRIPT oder g^jsuperscript^𝑔𝑗\hat{g}^{-j}over^ start_ARG italic_g end_ARG start_POSTSUPERSCRIPT - italic_j end_POSTSUPERSCRIPT—where the superscript here indicates the data the model is fit on. The single hyperparameter for logistic regression is selected via cross-validation, where the training set is again split into folds. No cross-validation is performed for CatBoost. Then for each unit i𝑖iitalic_i in the inference set, we use the trained models to infer Q^T0(xi)=Q^T0j(xi)subscript^𝑄subscript𝑇0subscript𝑥𝑖subscriptsuperscript^𝑄𝑗subscript𝑇0subscript𝑥𝑖\hat{Q}_{T_{0}}(x_{i})=\hat{Q}^{-j}_{T_{0}}(x_{i})over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT - italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Then these are inserted into the plug-in estimators to compute the average treatment effect, τ𝜏\tauitalic_τ for each estimator below.

Causal estimators. After training base learners with cross-fitting, we implement the following plug-in causal estimators: backdoor adjustment (outcome regression) (Q), inverse propensity of treatment weighting (IPTW), adjusted inverse propensity of treatment weighting (AIPTW) (Robins et al., 1994).

τ^Qsubscript^𝜏𝑄\displaystyle\hat{\tau}_{Q}over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT :=1ni(Q^T1(xi)Q^T0(xi))assignabsent1𝑛subscript𝑖subscript^𝑄subscript𝑇1subscript𝑥𝑖subscript^𝑄subscript𝑇0subscript𝑥𝑖\displaystyle:=\frac{1}{n}\sum_{i}\bigg{(}\hat{Q}_{T_{1}}(x_{i})-\hat{Q}_{T_{0% }}(x_{i})\bigg{)}:= divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (9)
τ^IPTWsubscript^𝜏IPTW\displaystyle\hat{\tau}_{\text{IPTW}}over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT IPTW end_POSTSUBSCRIPT :=1ni(yitig^(xi)yi(1ti)1g^(xi))assignabsent1𝑛subscript𝑖subscript𝑦𝑖subscript𝑡𝑖^𝑔subscript𝑥𝑖subscript𝑦𝑖1subscript𝑡𝑖1^𝑔subscript𝑥𝑖\displaystyle:=\frac{1}{n}\sum_{i}\bigg{(}\frac{y_{i}t_{i}}{\hat{g}(x_{i})}-% \frac{y_{i}(1-t_{i})}{1-\hat{g}(x_{i})}\bigg{)}:= divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( divide start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_g end_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG - divide start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 1 - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - over^ start_ARG italic_g end_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ) (10)
τ^AIPTWsubscript^𝜏AIPTW\displaystyle\hat{\tau}_{\text{AIPTW}}over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT AIPTW end_POSTSUBSCRIPT :=1ni(Q^T1(xi)Q^T0(xi)+tiyiQ^T1(xi)g^(xi)\displaystyle:=\frac{1}{n}\sum_{i}\bigg{(}\hat{Q}_{T_{1}}(x_{i})-\hat{Q}_{T_{0% }}(x_{i})+t_{i}\frac{y_{i}-\hat{Q}_{T_{1}}(x_{i})}{\hat{g}(x_{i})}:= divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG over^ start_ARG italic_g end_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG
(1ti)yiQ^T0(xi)1g^(xi))\displaystyle\hskip 85.35826pt-(1-t_{i})\frac{y_{i}-\hat{Q}_{T_{0}}(x_{i})}{1-% \hat{g}(x_{i})}\bigg{)}- ( 1 - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) divide start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - over^ start_ARG italic_g end_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ) (11)

We also use DoubleML (Chernozhukov et al., 2018) which applies an ordinary least squares on residuals from the base learners

τ^DMLsubscript^𝜏DML\displaystyle\hat{\tau}_{\text{DML}}over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT DML end_POSTSUBSCRIPT :=E^[(yQ^X(x))|(tg^(x))]assignabsent^𝐸delimited-[]conditional𝑦subscript^𝑄𝑋𝑥𝑡^𝑔𝑥\displaystyle:=\hat{E}\big{[}(y-\hat{Q}_{X}(x))\big{|}(t-\hat{g}(x))\big{]}:= over^ start_ARG italic_E end_ARG [ ( italic_y - over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x ) ) | ( italic_t - over^ start_ARG italic_g end_ARG ( italic_x ) ) ] (12)

Appendix G Proof of concept pipeline for an additional subpopulation

As an additional proof of concept, we follow the same steps for the proof of concept in Section 4 but we use a different subpopulation—Subpopulation B for which the covariates C𝐶Citalic_C are Engineering and Business document categories. Table 5 provides the descriptive statistics for this subpopulation. To measure the predictive accuracy for this subpopulation, we again model P(C|X)𝑃conditional𝐶𝑋P(C|X)italic_P ( italic_C | italic_X ) with a logistic regression classifier. Averaged across held-out test folds, the F1 score is 0.92 and the average precision is 0.97.

Addressing consideration #4, we also create diagnostic plots for Subpopulation B in Figure 4. In the subsequent pipeline, we use P*(T|C)superscript𝑃conditional𝑇𝐶P^{*}(T|C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T | italic_C ) in equation 4 parameterized by ζ0=0.85,ζ1=0.15formulae-sequencesubscript𝜁00.85subscript𝜁10.15\zeta_{0}=0.85,\zeta_{1}=0.15italic_ζ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.85 , italic_ζ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.15. Examing the modeling results in Table 6, we hypothesize that the slight performance improvement in estimators using catboost over adjustment with the oracle set C𝐶Citalic_C is due to inclusion of extra covariates in X𝑋Xitalic_X granting additional statistical efficiency (Rotnitzky & Smucler, 2020). Regarding class imbalance, Subpopulation B is slightly more balanced (compared to Subpopulation A) with 55% business, 𝔼[Y]=0.05𝔼delimited-[]𝑌0.05\mathbb{E}[Y]=0.05blackboard_E [ italic_Y ] = 0.05 so that smallest category (C=0𝐶0C=0italic_C = 0, Y=1𝑌1Y=1italic_Y = 1) has 49 documents. However, this subpopulation also suffers from finite data and class imbalance issues, as evidenced by the low average precision for the inference folds of the outcome models.

RCT Dataset C𝐶Citalic_C categories n RCT ATE OR(C𝐶Citalic_C, Y𝑌Yitalic_Y)
Subpopulation B Engineering, Business 2,238 0.075 1.4
Table 5: For Subpopulation B, RCT dataset descriptive statistics including the number of units in the subpopulation (n𝑛nitalic_n) and the odds ratio, OR(C,Y)𝑂𝑅𝐶𝑌OR(C,Y)italic_O italic_R ( italic_C , italic_Y ).
Refer to caption
Figure 4: Diagnostic plot for Subpopulation B. Each plot is the parameterization of the researcher-specified confounding functions, P*(T|C)superscript𝑃conditional𝑇𝐶P^{*}(T|C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T | italic_C ) in Equation 4. Each blue dot is a different random seed (100 seeds total per plot/parameterization).
g^(x)^𝑔𝑥\hat{g}(x)over^ start_ARG italic_g end_ARG ( italic_x ) Q^T0(x)subscript^𝑄subscript𝑇0𝑥\hat{Q}_{T_{0}}(x)over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) Q^T1(x)subscript^𝑄subscript𝑇1𝑥\hat{Q}_{T_{1}}(x)over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) Q^X(x)subscript^𝑄𝑋𝑥\hat{Q}_{X}(x)over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_x ) Prediction Ave. Prec. (\uparrow better) train inference train inference train inference train inference linear 0.98 (0.01) 0.78 (0.02) 0.67 (0.24) 0.03 (0.02) 0.56 (0.22) 0.17 (0.03) 0.59 (0.17) 0.14 (0.02) catboost (nonlinear) 0.99 (0.0) 0.79 (0.02) 0.97 (0.02) 0.03 (0.01) 0.99 (0.01) 0.21 (0.04) 0.99 (0.0) 0.15 (0.03) Causal Rel. Abs. Error (\downarrow better) Unadjusted Backdoor C𝐶Citalic_C τ^Qsubscript^𝜏𝑄\hat{\tau}_{Q}over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT τ^IPTWsubscript^𝜏IPTW\hat{\tau}_{\text{IPTW}}over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT IPTW end_POSTSUBSCRIPT τ^AIPTWsubscript^𝜏AIPTW\hat{\tau}_{\text{AIPTW}}over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT AIPTW end_POSTSUBSCRIPT τ^DMLsubscript^𝜏DML\hat{\tau}_{\text{DML}}over^ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT DML end_POSTSUBSCRIPT linear 0.14 (0.06) 0.14 (0.1) 2.84 (1.43) 0.43 (0.94) 1.51 (2.28) 0.48 (0.2) catboost (nonlinear) 0.14 (0.06) 0.14 (0.1) 0.11 (0.08) 0.11 (0.08) 0.12 (0.09) 0.14 (0.1)
Table 6: Modeling results for Subpopulation B. Top: Predictive models’ average precision (ave. prec.) for training (yellow) and inference (green) data splits. Bottom: Causal estimation models’ relative absolute error (rel. abs. error) between the models’ estimated ATE and the RCT ATE. Here, darker shades of red indicate worse causal estimates. Baselines, unadjusted conditional mean on the samples (unadjusted) and the backdoor adjustment with the oracle C𝐶Citalic_C (backdoor C), are uncolored. We use two baselearner settings: linear and catboost (nonlinear). We report both the average and standard deviation (in parentheses) over 100 random seeds during sampling. All settings use P*(T|C)superscript𝑃conditional𝑇𝐶P^{*}(T|C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T | italic_C ) in equation 4 parameterized by ζ0=0.85,ζ1=0.15formulae-sequencesubscript𝜁00.85subscript𝜁10.15\zeta_{0}=0.85,\zeta_{1}=0.15italic_ζ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.85 , italic_ζ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.15.

Appendix H Confidence Interval Plots

Refer to caption
Refer to caption
Figure 5: For Synthetic DGP #1 (single random seed) and 1000 bootstrap samples, we plot the 95% confidence intervals for the original RCT (difference in means estimator), RCT rejection sampling with a parametric adjustment (with knowledge of the oracle adjustment), and Algorithm 2 from Gentzel et al. (2021) with a parametric adjustment (with knowledge of the oracle adjustment). The mean of the bootstrap samples is denoted by the dot.

In Figure 5, we plot the confidence intervals given by the bootstrap percentile method (see Section 3.5 for more details). First, we construct confidence intervals for the original RCT data with a difference in means estimator. Then we plot the confidence intervals for the two sampling procedures—RCT rejection sampling and Algorithm 2 from Gentzel et al.—applied to that same RCT data. We examine synthetic DGP #1 (see Section C for details on the synthetic DGPs) for a single random seed across two different sizes of synthetic RCT data, 100K samples and 3K samples.

In the plots, the dot is the mean of the 1000 bootstrap samples. The horizontal bars indicate the endpoints of the 95% confidence interval. We note that the confidence intervals for the sampling procedures are wider because sampling reduces the size of the dataset (roughly by half). Also as expected, all the confidence intervals are wider for the smaller (3K) RCT data setting. Note, in both settings, our RCT rejection sampling contains the true (RCT) ACE (red line in the plot) while Gentzel et. al’s algorithm does not.

3cm

Appendix I Varying Confounding Strength

Refer to caption
Figure 6: For Synthetic DGP #1, we alter the confounding strength x𝑥xitalic_x in the function we specify, P*(T=1|C)=σ(1+xC)superscript𝑃𝑇conditional1𝐶𝜎1𝑥𝐶P^{*}(T=1|C)=\sigma(-1+x\cdot C)italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T = 1 | italic_C ) = italic_σ ( - 1 + italic_x ⋅ italic_C ). We choose the lower and upper limits of x𝑥xitalic_x such that 0.05<P*(T=1|C)<0.950.05superscript𝑃𝑇conditional1𝐶0.950.05<P^{*}(T=1|C)<0.950.05 < italic_P start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_T = 1 | italic_C ) < 0.95 to ensure overlap is satisfied. For each level of confounding strength, we run the rejection sampling algorithm with 1000 different random seeds. We plot the estimated ATE for the observational (confounded) dataset that is created after RCT rejection sampling. We find there are no substantial differences in the estimates due to changing the confounding strength.