Benchmarking for Deep Uplift Modeling in Online Marketing

Dugang Liu Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)ShenzhenChina [email protected] , Xing Tang , Yang Qiao Financial Technology (FiT), TencentShenzhenChina [email protected] [email protected] , Miao Liu Shenzhen UniversityShenzhenChina [email protected] , Zexu Sun Renmin University of ChinaBeijingChina [email protected] , Xiuqiang He Financial Technology (FiT), TencentShenzhenChina [email protected] and Zhong Ming Shenzhen Technology UniversityShenzhenChina [email protected]

(2018)

Abstract.

Online marketing is critical for many industrial platforms and business applications, aiming to increase user engagement and platform revenue by identifying corresponding delivery-sensitive groups for specific incentives, such as coupons and bonuses. As the scale and complexity of features in industrial scenarios increase, deep uplift modeling (DUM) as a promising technique has attracted increased research from academia and industry, resulting in various predictive models. However, current DUM still lacks some standardized benchmarks and unified evaluation protocols, which limit the reproducibility of experimental results in existing studies and the practical value and potential impact in this direction. In this paper, we provide an open benchmark for DUM and present comparison results of existing models in a reproducible and uniform manner. To this end, we conduct extensive experiments on two representative industrial datasets with different preprocessing settings to re-evaluate 13 existing models. Surprisingly, our experimental results show that the most recent work differs less than expected from traditional work in many cases. In addition, our experiments also reveal the limitations of DUM in generalization, especially for different preprocessing and test distributions. Our benchmarking work allows researchers to evaluate the performance of new models quickly but also reasonably demonstrates fair comparison results with existing models. It also gives practitioners valuable insights into often overlooked considerations when deploying DUM. We will make this benchmarking library, evaluation protocol, and experimental setup available on GitHub.

Open benchmarking, Deep uplift modeling, Online marketing, Reproducible research

^†^†copyright: acmlicensed^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX^†^†conference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NY^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Information systems Personalization^†^†ccs: Applied computing Economics

1. Introduction

To increase user engagement and platform revenue, online marketing is an essential task for many industrial platforms and business applications, aiming to provide certain specific incentives to the target users, such as coupons (Zhao et al., 2019), discounts (Lin et al., 2017), and bonuses (Ai et al., 2022). The success of online marketing depends largely on being able to accurately identify the sensitive user groups for each incentive (He et al., 2022; Xu et al., 2022). Uplift modeling is an effective tool to address this problem (Devriendt et al., 2020; Betlei et al., 2021; Chen et al., 2022), which aims to estimate changes in outcomes due to individual-level treatments, i.e., accurately captures the difference between a user’s response to various incentives and their responses without incentives. This difference is called uplift, and practitioners can naturally specify more effective marketing delivery strategies based on the estimated uplift. Because uplift modeling typically targets a large user base, even a slight improvement in its predictive performance can significantly increase overall revenue. For example, in Tencent’s Licaitong financial scenario (Liu et al., 2023), a slight absolute improvement in prediction performance will significantly reduce marketing costs and increase commission conversion. Compared with response tasks like click-through rate prediction, uplift modeling tasks usually contain multiple incentives and lack direct supervision objectives. In other words, we can only observe the user’s response but not the degree of its change. Therefore, it is a great challenge to improve uplift modeling significantly.

Given the importance and unique challenges of uplift modeling, researchers in academia and industry have done extensive research in recent years. Uplift prediction models have evolved from meta-learners-based (Künzel et al., 2019; Nie and Wager, 2021; Bang and Robins, 2005), tree and forest-based (Radcliffe and Surry, 2011; Chipman et al., 2010; Wager and Athey, 2018; Athey and Imbens, 2016; Wan et al., 2022; Ai et al., 2022), Knapsack problem-based (Goldenberg et al., 2020; Albert and Goldenberg, 2021) to deep neural networks-based architecture. Notably, many recent works have focused on developing new neural network architectures to better adapt to uplift modeling in industrial scenarios and shown remarkable performance improvements, such as EUEN (Ke et al., 2021), DESCN (Zhong et al., 2022) and EFIN (Liu et al., 2023) and so on. We call this deep neural network-based line deep uplift modeling (DUM). The trend of technical lines gradually moving closer to DUM largely follows the development of online marketing forms and models in industrial scenarios. On the one hand, in real online marketing scenarios, incentives are becoming increasingly diverse and bring more helpful feature information (such as text and pictures displayed with bonuses, etc.). This implies the need to develop new techniques, including modeling the incentive-related features and the interaction between these features and the user and contextual features. DUM can more flexibly adapt to this goal than other lines. On the other hand, since deep network-based models are more commonly deployed in industrial scenarios, it is also easier to integrate them with DUM techniques than other lines. Therefore, we follow this trend and prioritize DUM in this benchmark work, which is more urgently needed in current industrial scenarios.

Despite the promising results from these aforementioned studies, there is a lack of standardized benchmarks and unified evaluation protocols for DUM tasks. For example, although industrial benchmark datasets such as Criteo (Diemert et al., 2021) and Lazada (Zhong et al., 2022) are used in existing studies, these works often use different random seeds to perform unknown train-test splits. In addition, there is no consistent standard for data preprocessing steps, such as feature normalization, instance deduplication, etc, and discussion of their potential impacts is neglected. This issue leads to non-reproducible and inconsistent experiment results between different studies and confuses practitioners. Each study shows that they achieve better performance than state-of-the-art methods. However, results from the same model in different studies cannot be compared due to different evaluation protocols. Due to the lack of open benchmarking results, practitioners may scrutinize whether the baseline model has been correctly implemented and rigorously tuned. Only a few works present detailed implementations of baseline models, and some do not include open-source code. Therefore, in the research field of uplift modeling, especially for DUM tasks, practitioners are constantly forced to re-implement many baselines and evaluate the results on their own data partition without knowing their correctness, which leads to inefficient development and redundant work. In addition, some factors that may need to be considered when deploying DUM have not received enough attention, such as the data preprocessing settings.

Inspired by the success of CausalML (Chen et al., 2020), the most popular uplift modeling and causal inference benchmark and mainly focuses on traditional tree-based uplift modeling methods and rarely covers DUM, this paper proposes an open benchmarking framework for DUM, i.e., DUMOM. Our work standardizes the open benchmarking pipeline for such modeling and thoroughly compares different DUM models for reproducible studies. We extensively evaluate 13 existing representative models in multiple data processing settings on two widely-used representative industrial datasets, including Criteo (Diemert et al., 2021) and Lazada (Zhong et al., 2022). The former contains training and test subsets with instance selection bias, and the latter contains a training subset with instance selection bias and a test subset without bias. Therefore, we prioritize these two datasets in this benchmark work so that we can provide more comprehensive insights. The comparison results between different models in our experiments reveal some valuable conclusions and can motivate attention to related research. Specifically, there is no stable state-of-the-art DUM model under different datasets and data preprocessing steps after sufficient hyperparameter search and model tuning. In particular, when the dataset’s training and test distributions are inconsistent, the differences between recent DUM models and traditional models are more minor than expected, and different DUM models are also sensitive to different data preprocessing steps. In other words, current DUM research has apparent limitations in model generalization, and it is necessary to carefully consider the test distribution environment and appropriate data preprocessing when deploying DUM and conduct more research in these areas.

2. Deep Uplift Modeling

In this section, we first define the uplift modeling task and architectural modules with the necessary notations and then briefly introduce some representative DUM models used in our experiments. Finally, to avoid confusion, we emphasize the similarities and differences between the current DUM and related individual treatment effect (ITE) estimation studies.

2.1. Architecture Overview

2.1.1. Objective of Uplift Modeling

Uplift modeling seeks to determine the incremental impact of an intervention (or treatment) on an individual’s behavior, where the intervention (or treatment) corresponds to an incentive in online marketing. This is achieved by estimating an individual’s potential outcome under different scenarios, such as with or without intervention. In this paper, without loss of generality, we use the binary treatment indicator $T\in\{0,1\}$ , where $T=0$ represents the control group (i.e., without intervention) and $T=1$ represents the treatment group (i.e., with intervention). In line with the potential outcome framework (Rubin, 2005), each individual $i$ has two possible outcomes: $Y_{i}(0)$ in the absence of intervention and $Y_{i}(1)$ in the presence of intervention. The uplift of intervention $T$ on an individual $i$ , represented by $\tau_{i}$ , is equal to the difference between the two potential outcomes:

(1)

\tau_{i}=Y_{i}(1)-Y_{i}(0).

However, individuals can only observe one of the two potential outcomes in real-world scenarios. The unobserved outcome is known as the counterfactual outcome, while the observed outcome can be represented as:

(2)

Y_{i}=TY_{i}(1)+(1-T)Y_{i}(0).

Since we can not observe both potential outcomes for each individual $i$ , the uplift cannot be directly estimated. However, with some adequate assumptions, it can be obtained in a manner similar to the conditional average treatment effect (CATE), which is the average treatment effect conditioning on a set of covariates. Specially, let $X=\boldsymbol{x}$ be a $k$ -dimensional covariate vector, then the CATE $\tau(\boldsymbol{x})$ can be derived as:

(3)

\tau(\boldsymbol{x})=\mathbb{E}[\tau\mid X=\boldsymbol{x}]=\mathbb{E}[Y(1)-Y(0% )\mid X=\boldsymbol{x}].

The objective of uplift modeling in Eq.(3) involves estimating two conditional expectations of the observed outcomes, which is feasible with available data. Next, we describe the critical components of a DUM model architecture.

2.1.2. Response Modeling of the Control Group

Based on the input covariate vector $\mathbf{x}$ , DUM usually transforms it into a low-dimensional dense vector $\mathbf{e}$ through an embedding layer. Then, a network architecture with predictive capabilities will be used to obtain the response of the control group $\hat{Y}_{i}(0)$ from this low-dimensional dense vector.

2.1.3. Response Modeling of the Treatment Group

Similarly, there is usually another network architecture with predictive capabilities responsible for predicting the response of the treatment group $\hat{Y}_{i}(1)$ . Note that in some DUM models, a shared module shares information at the bottom of the above two prediction networks.

2.1.4. Predictive Modeling of Treatment Indicator Variables

Since accurately predicting whether a user belongs to the control group or the treatment group helps to make the model robust to the distribution imbalance of the group to a certain extent, a crucial technical branch in DUM is whether to additionally predict the indicator variable of the treatment $\hat{T}_{i}$ . An independent prediction network will perform this process.

2.1.5. Loss Function

The least-square error (MSE) loss is widely used in DUM tasks, which is defined as follows:

(4)

\mathcal{L}=\frac{1}{N}\sum_{\mathcal{D}}\left(Y_{i}-\hat{Y}_{i}\right)^{2},

where $\mathcal{D}$ is the training set with $N$ instances. Depending on the group to which each instance $X$ belongs, the prediction of $Y_{i}$ may be generated by $\hat{Y}_{i}(0)=f(\mathbf{e}|T_{i}=0,\theta_{0})$ oder $\hat{Y}_{i}(1)=f(\mathbf{e}|T_{i}=1,\theta_{1})$ , where $\theta_{0}$ and $\theta_{1}$ are respective model parameters. Note that for brevity, we omit the optimization loss term for $T$ .

2.1.6. Inference Stage

In practical applications, DUM pays more attention to the response changes that each incentive may bring to the user than the response itself. Therefore, in the inference stage, we will make predictions for each instance based on the trained model on its response in the control group and the intervention group and then use the difference between the two to perform a ranking of delivery strategies, i.e.,

(5)

\hat{\tau_{i}}=\hat{Y}_{i}(1)-\hat{Y}_{i}(0).

2.2. Representative Models

Next, we summarize the representative models evaluated and benchmarked in this work. Note that although we only enumerate a subset of existing DUM models, they cover a broad range of representative research lines on DUM. To facilitate understanding and comparison, we show the architecture diagrams of these representative models in Figure 1. As shown in Figure 1, at a high level, we divide the existing methods into two categories based on how the treatment indicator is used.

Refer to caption — Figure 1. The architectural illustration of representative baselines in neural network-based uplift modeling. For more information on the meaning of some baseline-specific symbols, please refer to their original papers.

2.2.1. Treatment as a Branch Switch

Since it is necessary to determine the group to which each instance belongs during loss optimization, the corresponding prediction branch is selected to obtain the prediction label. Therefore, the methods of this category only use the treatment indicator as a switch to switch the model branch and do not let it participate in the model’s training. Representative models in this category include T-Learner (Künzel et al., 2019), TARNet (Shalit et al., 2017), CFRNet (Shalit et al., 2017), FlexTENet (Curth and van der Schaar, 2021b), and EUEN (Ke et al., 2021).

2.2.2. Treatments as Model Features

Since the intervention itself may also carry helpful information or needs to be exploited to perform some steps that help alleviate the distribution shift, the methods in this category incorporate the treatment indicators as features into the model’s training. Representative models in this category include BNN (Johansson et al., 2016), S-Learner (Künzel et al., 2019), DragonNet (Shi et al., 2019), SNet (Curth and van der Schaar, 2021a), GANITE (Yoon et al., 2018), CEVAE (Louizos et al., 2017), DESCN (Zhong et al., 2022), and EFIN (Liu et al., 2023)

2.2.3. Overview of Representative Models

For models in the first category,

•

T-Learner (Künzel et al., 2019): it is similar to S-Learner, which uses two estimators for the treatment and control groups.
•

TARNet (Shalit et al., 2017): a deep learning-based method that extends the two-model approach, which can be seen as the extension of BNN.
•

CFRNet (Shalit et al., 2017): it is built upon the TARNet, which adds the minimization of the discrepancy measured by either the maximum mean discrepancy or the Wasserstein distance between the two distributions.
•

FlexTENet (Curth and van der Schaar, 2021b): it builds on ideas from multi-task learning to design a new architecture for CATE estimation, which adaptively learns what to share between the potential outcome functions.
•

EUEN (Ke et al., 2021): it is an uplift model for large-scale online advertising, which uses two sub-networks to estimate the control conversion probability and the uplift effect, respectively.

For models in the second category,

•

BNN (Johansson et al., 2016): it leverages deep learning for counterfactual inference with theoretical guarantees, combining ideas from domain adaptation and representation learning.
•

S-Learner (Künzel et al., 2019): it is a kind of meta-learner method that uses a single estimator to estimate the outcome without giving the treatment a special role. Varying the treatment while fixing other features yields different variants and uses their differences to estimate CATE.
•

DragonNet (Shi et al., 2019): it designs a new architecture with the propensity score for estimation adjustment and uses a targeted regularization procedure to induce a bias toward models.
•

SNet (Curth and van der Schaar, 2021a): it is a new classification of meta-learners inspired by the ATE estimator taxonomy, which estimates unbiased pseudo-outcomes based on regression adjustment, propensity weighting, or both.
•

GANITE (Yoon et al., 2018): it utilizes a generative adversarial network (GAN) to model treatment effect heterogeneity, which consists of two components, a counterfactual block that generates the counterfactual outcome and an ITE block that generates the ITE.
•

CEVAE (Louizos et al., 2017): it is a variational autoencoder (VAE)-based treatment effect heterogeneity modeling method, which uses a VAE to learn a latent confounding set from the observed covariates and then uses it to estimate the treatment effect.
•

DESCN (Zhong et al., 2022): it jointly learns the treatment and response functions in the entire sample space to avoid treatment bias and employs an intermediate pseudo treatment effect prediction network to relieve sample imbalance.
•

EFIN (Liu et al., 2023): it is a feature interaction-aware uplift method, which consists of four customized modules: a feature encoding module, a self-interaction module, a treatment-aware interaction module, and an intervention-constrained module.

2.3. The Connection between DUM and ITE

Because both aimed to understand how treatments affect individuals, early developments in DUM borrowed heavily from ideas about estimating individual treatment effects (ITE) in causal inference, although they focus on marketing campaigns and medical treatments, respectively. However, the development of the online marketing scenario has made the research and goals in DUM more unique, and we argue that they already differ significantly in evaluation metrics, feature complexity, and treatment types.

1) The Evaluation Metrics. DUM focuses on accurately distinguishing the response difference estimates of user groups that tend to respond from other groups, i.e., ranking their response differences as high as possible, without paying attention to the specific values of the corresponding differences. ITE focuses on estimating the specific difference in potential outcomes for each individual with and without treatment. Therefore, DUM tends to discover the user groups that tend to convert, and ITE aims to obtain accurate ITE results on the individual level. The core metrics are the Area Under Uplift Curve (AUUC) and Qini coefficient (QINI) (Betlei et al., 2021), etc., which are different from the commonly used $\epsilon_{PEHE}$ and $\epsilon_{ATE}$ (Shalit et al., 2017) in ITE. These metrics draw on ranking metrics from traditional click-through rate prediction tasks, such as AUC (Zhu et al., 2021). 2) The Features Complexity. DUM is particularly suitable for optimizing resource allocation, for instance, determining which customers are most likely to increase purchases due to a marketing campaign. This requirement may invoke the deployment of learned models capable of navigating the high-dimensional feature space to capture the diverse impacts of features on the uplift prediction. ITE is more focused on personalized decision-making, such as medical plans. Therefore, it typically does not consider too many feature dimensions, and the model’s training process is more focused on finding which features best identify individuals who will benefit. 3) The Treatment Type. DUM’s flexibility lies in its ability to handle various treatment types. This means it can assess not only single binary treatments (e.g., whether a customer received a marketing email) but also analyze the impact of different treatments (e.g., various marketing strategies, product discounts) on individual behavior. In addition, DUM can be better compatible with displaying discount coupons in marketing scenarios where treatments often contain helpful information, such as text and images. In contrast, ITE estimates usually focus on simple treatment situations, such as binary treatments (e.g., whether a patient received a drug). Providing deep insights into different types and degrees of treatment effects like DUM is not easy. Considering these unique features of DUM, the DUM benchmark will differ from some existing ITE benchmarks in different dimensions and should not be conflated, especially the adopted datasets, baseline methods, and evaluation metrics.

3. Open DUM Benchmark (DUMOM)

Although the various DUM models mentioned above have achieved promising results, current research lacks standardized benchmarks and unified evaluation protocols. These problems cause most studies to face obstacles of non-reproducibility and inconsistent results, hindering the progress of DUM model development. In this section, We first compare our work to other related benchmarks or libraries to illustrate the necessity of an open benchmark for DUM. Then, we provide the implementation details of our open DUM benchmark.

3.1. Comparison with Existing Work

Based on a review of existing research, we identify several essential requirements that impact the reproducibility and value of DUM benchmarks. These include three aspects that affect the difficulty of use: benchmark results, hyperparameter search, and optimal parameters provided; two aspects of data preprocessing settings, instance deduplication and feature normalization; and one aspect that measures the degree of compatibility with DUM. As shown in Table 1, we evaluate a set of existing benchmarks and libraries for uplift modeling and causal inference that are powerfully relevant to our work against these considerations, including Criteo-ITE-Benchmark (Diemert et al., 2021), CATENets (Curth and van der Schaar, 2021a), DoWhy (Sharma et al., 2019), EconML (Battocchi et al., 2019), CausalML (Chen et al., 2020), DECI¹¹1 https://github.com/microsoft/causica, ShowWhy²²2 https://github.com/microsoft/showwhy. We find that none of them fully meet these requirements, which motivates us to propose a new public benchmark for DUM in this paper.

Table 1. Comparison with other relevant benchmarks or libraries on reproducible configurations, where

\checkmark

— – —

\times

mean fully supported, partially supported, and not supported, respectively.

Benchmark or Libararys	Criteo	CATENets	DoWhy	EconML	causalML	DECI	ShowWhy	Ours
Benchmark Results	$\checkmark$	-	-	-	-	$\times$	$\times$	$\checkmark$
Hyperparameter Search	$\checkmark$	$\times$	$\checkmark$	$\checkmark$	$\checkmark$	$\times$	$\times$	$\checkmark$
Optimal Parameters	$\times$	$\times$	$\times$	$\times$	$\times$	$\times$	$\times$	$\checkmark$
Feature Normalization	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\times$	$\times$	$\checkmark$
Instance Deduplication	$\times$	$\times$	$\times$	$\times$	$\times$	$\times$	$\times$	$\checkmark$
Comprehensive DUM Compatibility	-	-	-	-	-	$\times$	$\times$	$\checkmark$

Specifically, for the first three aspects: 1) Benchmark results. Although many uplift modeling studies provide partial baseline results, they often lack important details such as baseline model implementation, training code, and hyperparameters. Furthermore, they have insufficient models, datasets, and evaluation metrics coverage. 2) Hyperparameter search. A thorough hyperparameter search is a prerequisite for evaluating the model’s true performance. Most existing benchmarks or libraries provide hyperparameter search capabilities based on scikit-learn or other toolkits. However, they do not provide a reasonable hyperparameter search range, which makes it difficult for researchers to find the optimal parameters for different models quickly. 3) Optimal parameters provided. Providing optimal parameters can speed up model replication for practitioners and prevent wasted resources and inconsistent results. Unfortunately, most existing libraries do not provide optimal parameters, which complicates reproducibility.

For the latter three aspects: 1) Instance deduplication. In particular, we note that existing industrial datasets often contain duplicate instances, i.e., users with the same profile appear multiple times in the dataset. Our empirical validation shows that removing these duplicates can significantly impact performance evaluation. Unfortunately, this is often overlooked by existing research and libraries. 2) Feature normalization. Feature processing plays a crucial role in the performance of the model. Our analysis shows that feature normalization has a significant impact on improving estimation results. However, many existing libraries provide some feature transformation function modules but ignore key details, such as how to handle numerical features and whether to perform feature normalization. 3) Comprehensive DUM compatibility. Different from other routes, the DUM model requires various complex deep network modules, and there are many reusable structures. Therefore, the targeted development of a set of pluggable and reusable network components similar to those processed in traditional click-through rate prediction benchmarks (Zhu et al., 2021) is necessary and beneficial for DUM. However, since existing libraries or benchmarks do not pay much attention to DUM, they lack sufficient support.

3.2. Evaluation Protocol

Next, we provide a detailed introduction to the evaluation protocol adopted in our benchmark.

3.2.1. Datasets

Our experiments adopt two representative industrial datasets to better adapt to different marketing scenarios: 1) Lazada (Zhong et al., 2022). It is a dataset collected on the Lazada platform, containing 83 continuous features, a processing indicator, and a label. It provides data sets from two different sources, where the training set is collected from a production environment with a treatment bias, the treatment allocation is selective due to the operation target policy, and the test set is collected from the users who are not affected by the targeting strategy and the treatment assignment follows the randomized controlled trials (RCT). 2) Criteo (Diemert et al., 2021) is a dataset from real advertising scenarios provided by Criteo AI Labs. It contains nearly 14 million instances with similar treatment bias, 12 continuous features, a treatment indicator, and 2 labels (i.e., visits and conversions). The statistics of the two public datasets are shown in Table 2.

Table 2. Statistics of the two public datasets.

Dataset	Criteo	Lazada
Typ	Train	Train	Test
Size	13,979,592	926,669	181,669
Ratio of Treatment to Control	5.67:1	0.28:1	1.09:1
Positive Sample Ratio	4.70%	1.99%	3.52%
Relative Average Uplift	27.07%	499.8%	11.2%
Average Uplift	1.03%	4.72%	0.37%
Features Number	12	83	83
Conversion Target	Visit	Label	Label

3.2.2. Data Splitting

Following most existing studies (Liu et al., 2023), we randomly split the Criteo dataset into three sets, with a ratio of 8:1:1 for the training, validation, and testing. For the Lazada dataset, to make it accurately reproducible and easy to compare with existing work, we follow the splitting strategy reported by DESCN (Zhong et al., 2022): we randomly split the training set into two subsets with a ratio of 9:1 for training and validation, and the RCT test data is directly used as the test set. We set the random seed to 0-9, respectively, to ensure reproducibility when performing random splits.

3.2.3. Data Preprocessing

Since all feature values are continuous and do not require any special processing, we use all features from both datasets. Considering that our benchmark investigates the effect of data normalization on the results, we perform batch normalization (Ioffe and Szegedy, 2015) to the input features when choosing to perform normalization on the dataset. The expression can be referenced with Eq.(6).

(6)

\hat{x}_{i}=\frac{x_{i}-E[x]}{\sqrt{Var[x]+\epsilon}}\rightarrow y_{i}=\gamma% \hat{x}_{i}+\beta,

where $\gamma$ is the scaling factor and $\beta$ is the bias term. During the network’s training process, both $\gamma$ and $\beta$ are learnable parameters that can be updated by the back-propagation algorithm.

3.2.4. Evaluation Metrics

To evaluate each model’s effectiveness comprehensively, we adopt four widely used metrics in DUM: AUUC, QINI, WAU, and the uplift score at the first $k$ percentile (LIFT@ $k$ ). We report the results with $k$ set to 30 and compute these metrics using a standard python package scikit-uplift³³3https://www.uplift-modeling.com.

3.2.5. Benchmark Models and Implementation Details

We select all representative methods described in Section 2 as benchmark models. Because PyTorch can avoid non-determinism on GPU devices better than Tensorflow (Zhu et al., 2021), we choose it to implement all models. We utilized the AdamW optimizer and set a maximum of 20 iterations. Our model training is performed on V100 GPU clusters. Due to the different workloads of the GPU cluster, each model’s epoch training time is for reference only. We record all data settings and model hyperparameters in configuration files to make configurations more convenient. To promote more reproducible research, we plan to open-source the implementation and settings of all models to the community after the paper is accepted.

3.2.6. Hyperparameter Tuning

To ensure a fair and objective evaluation of all models’ true performance, we conduct a thorough search for hyperparameters using QINI as the main evaluation metric. In addition, we adopt an early-stopping mechanism with a patience of 5 to avoid overfitting the training set. To improve efficiency, we use the hyperparameter search library Optuna⁴⁴4https://optuna.org/ to speed up the tuning process. The search range of all hyperparameters is shown in Table 3. Note that different models may have different numbers of auxiliary losses, and we use the same parameter search range across all these losses.

Table 3. Hyperparameters and their values tuned in the experiments.

Name	Range	Functionality
$rank$	$\left\{2^{5},2^{6},2^{7}\right\}$	Hidden units
$bs$	$\left\{2^{8},2^{9},2^{10},2^{11}\right\}$	Batch size
$lr$	$\left\{1e^{-4},5e^{-4},1e^{-3},5e^{-3},1e^{-2}\right\}$	Learning rate
$\lambda$	$\left\{1e^{-5},1e^{-4},1e^{-3},1e^{-2}\right\}$	Weight decay
$\alpha$	$\left\{0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9\right\}$	Auxiliary losses weight

4. Experiment Results and Discussion

In this section, we provide valuable discussions and insights into the benchmark results and offer new and instructive experiences for the practical deployment of DUM. Again, we emphasize that the training and testing distributions are not significantly inconsistent on Criteo, but the opposite is true on Ladaza.

4.1. Full Benchmark Results

Benchmark results using different data processing combinations on all datasets are shown in Tables 4, 5, 6, and 7.

Table 4. Benchmark results on Criteo and Lazada dataset with instance deduplication and without feature normalization. We highlight the top 3 best results, where the best results are marked in bold.

Dataset	Year	Model	Valid Metrics				Test Metrics				Train Infos
Dataset	Year	Model	QINI	AUUC	WAU	LIFT@30	QINI	AUUC	WAU	LIFT@30	Time(s)	Epochs	Params
Criteo	2012	S-Learner	0.0946	0.0370	0.0089	0.0329	0.0951	0.0375	0.0087	0.0326	34	4	28697
	2016	T-Learner	0.1035	0.0407	0.0093	0.0334	0.1053	0.0416	0.0091	0.0334	71	1	4314
	2016	BNN	0.1047	0.0411	0.0098	0.0341	0.1017	0.0402	0.0098	0.0333	36	13	2201
	2017	TARNet	0.1024	0.0406	0.0090	0.0439	0.1063	0.0425	0.0087	0.0442	111	10	14426
	2017	CFRNet	0.1075	0.0426	0.0087	0.0472	0.1085	0.0434	0.0088	0.0466	212	7	14426
	2017	CEVAE	0.0975	0.0383	0.0112	0.0355	0.0881	0.0348	0.0111	0.0341	113	12	99270
	2018	GANITE	0.1209	0.0482	0.0097	0.0489	0.1192	0.0478	0.0097	0.0474	331	12	139933
	2019	DragonNet	0.1085	0.0430	0.0087	0.0460	0.1099	0.0439	0.0089	0.0463	133	7	18652
	2021	FlexTENet	0.1146	0.0460	0.0093	0.0496	0.1164	0.0470	0.0094	0.0494	72	3	18619
	2021	SNet	0.1016	0.0401	0.0086	0.0407	0.1044	0.0414	0.0083	0.0405	61	4	61339
	2021	EUEN	0.1158	0.0459	0.0086	0.0462	0.1143	0.0456	0.0087	0.0448	63	9	15292
	2022	DESCN	0.0817	0.0316	0.0086	0.0335	0.0795	0.0309	0.0085	0.0321	72	10	25437
	2023	EFIN	0.1255	0.0502	0.0110	0.0478	0.1220	0.0491	0.0110	0.0473	517	11	39950
Lazada	2012	S-Learner	0.4192	0.0013	0.0292	0.0470	0.0178	0.0024	0.0037	0.0082	7	4	12391
	2016	T-Learner	0.4431	0.0076	0.0372	0.0533	0.0231	0.0032	0.0041	0.0064	7	5	75432
	2016	BNN	0.4741	0.0064	0.0394	0.0579	0.0199	0.0028	0.0037	0.0077	19	10	12391
	2017	TARNet	0.4394	0.0047	0.0369	0.0541	0.0177	0.0024	0.0040	0.0051	7	8	64680
	2017	CFRNet	0.4595	0.0084	0.0405	0.0585	0.0234	0.0032	0.0036	0.0079	9	11	64680
	2017	CEVAE	0.4424	0.0049	0.0386	0.0536	0.0078	0.0010	0.0035	0.0052	35	13	234226
	2018	GANITE	0.4946	0.0138	0.0194	0.0680	0.0187	0.0025	0.0035	0.0075	28	12	50475
	2019	DragonNet	0.4961	0.0110	0.0386	0.0604	0.0177	0.0024	0.0038	0.0069	9	11	81322
	2021	FlexTENet	0.4995	0.0110	0.0402	0.0662	0.0215	0.0030	0.0036	0.0085	21	10	166377
	2021	SNet	0.4502	0.0185	0.0310	0.0652	0.0190	0.0026	0.0039	0.0076	20	6	157353
	2021	EUEN	0.4513	0.0075	0.0377	0.0550	0.0205	0.0028	0.0040	0.0076	7	4	16770
	2022	DESCN	0.4671	0.0070	0.0357	0.0533	0.0252	0.0034	0.0038	0.0077	8	3	30123
	2023	EFIN	0.5181	0.0117	0.0433	0.0771	0.0169	0.0023	0.0036	0.0074	152	11	186754

Table 5. Benchmark results on Criteo and Lazada dataset with instance deduplication and feature normalization. We highlight the top 3 best results, where the best results are marked in bold.

Dataset	Year	Model	Valid Metrics				Test Metrics				Train Infos
Dataset	Year	Model	QINI	AUUC	WAU	LIFT@30	QINI	AUUC	WAU	LIFT@30	Time(s)	Epochs	Params
Criteo	2012	S-Learner	0.0885	0.0342	0.0085	0.0332	0.0825	0.0321	0.0083	0.0320	97	15	7705
	2016	T-Learner	0.1029	0.0406	0.0083	0.0429	0.1034	0.0411	0.0082	0.0421	68	6	15258
	2016	BNN	0.0939	0.0369	0.0097	0.0335	0.0911	0.0360	0.0096	0.0323	191	3	2201
	2017	TARNet	0.1107	0.0438	0.0083	0.0442	0.1087	0.0433	0.0081	0.0434	207	6	55450
	2017	CFRNet	0.1140	0.0454	0.0088	0.0485	0.1194	0.0478	0.0088	0.0485	43	5	3898
	2017	CEVAE	0.0315	0.0124	0.0148	0.0217	0.0334	0.0132	0.0146	0.0231	194	1	26238
	2018	GANITE	0.1286	0.0514	0.0101	0.0499	0.1272	0.0512	0.0100	0.0491	558	7	139933
	2019	DragonNet	0.1073	0.0423	0.0088	0.0379	0.1057	0.0420	0.0088	0.0371	98	2	72092
	2021	FlexTENet	0.1110	0.0441	0.0087	0.0463	0.1125	0.0450	0.0084	0.0459	71	8	18619
	2021	SNet	0.1118	0.0444	0.0089	0.0460	0.1127	0.0451	0.0088	0.0456	74	7	125403
	2021	EUEN	0.1195	0.0475	0.0101	0.0492	0.1186	0.0475	0.0099	0.0486	67	8	15292
	2022	DESCN	0.0906	0.0360	0.0100	0.0382	0.0827	0.0330	0.0100	0.0358	118	6	25437
	2023	EFIN	0.1238	0.0491	0.0096	0.0488	0.1208	0.0483	0.0095	0.0476	547	15	21216
Lazada	2012	S-Learner	0.4022	0.0006	0.0277	0.0451	0.0246	0.0034	0.0035	0.0079	7	3	37927
	2016	T-Learner	0.4348	0.0047	0.0359	0.0534	0.0163	0.0022	0.0038	0.0066	13	1	75432
	2016	BNN	0.4583	0.0045	0.0379	0.0526	0.0292	0.0040	0.0039	0.0078	8	3	37927
	2017	TARNet	0.4527	0.0070	0.0370	0.0569	0.0266	0.0037	0.0038	0.0078	20	1	6312
	2017	CFRNet	0.4714	0.0073	0.0357	0.0561	0.0303	0.0042	0.0037	0.0090	6	2	64680
	2017	CEVAE	0.4388	0.0066	0.0398	0.0579	0.0121	0.0017	0.0037	0.0064	31	4	201330
	2018	GANITE	0.4944	0.0089	0.0241	0.0572	0.0194	0.0026	0.0036	0.0074	41	4	167339
	2019	DragonNet	0.5023	0.0173	0.0432	0.0766	0.0209	0.0029	0.0038	0.0079	14	3	23338
	2021	FlexTENet	0.4413	0.0079	0.0326	0.0549	0.0220	0.0030	0.0036	0.0088	9	10	60009
	2021	SNet	0.4371	0.0035	0.0335	0.0523	0.0231	0.0032	0.0035	0.0089	13	3	132905
	2021	EUEN	0.4528	0.0075	0.0384	0.0580	0.0206	0.0028	0.0036	0.0076	7	1	50010
	2022	DESCN	0.4734	0.0104	0.0266	0.0573	0.0197	0.0027	0.0036	0.0069	9	6	9131
	2023	EFIN	0.5510	0.0148	0.0446	0.0748	0.0131	0.0018	0.0037	0.0084	83	12	376914

Table 6. Benchmark results on Criteo and Lazada dataset without instance deduplication and feature normalization. We highlight the top 3 best results, where the best results are marked in bold.

Dataset	Year	Model	Valid Metrics				Test Metrics				Train Infos
Dataset	Year	Model	QINI	AUUC	WAU	LIFT@30	QINI	AUUC	WAU	LIFT@30	Time(s)	Epochs	Params
Criteo	2012	S-Learner	0.0988	0.0387	0.0080	0.0306	0.0902	0.0354	0.0078	0.0298	100	11	7705
	2016	T-Learner	0.0888	0.0348	0.0078	0.0284	0.0790	0.0310	0.0077	0.0277	49	1	4314
	2016	BNN	0.0979	0.0384	0.0078	0.0309	0.0888	0.0349	0.0077	0.0286	68	12	28697
	2017	TARNet	0.0952	0.0374	0.0080	0.0308	0.0847	0.0333	0.0079	0.0288	215	15	3898
	2017	CFRNet	0.0994	0.0390	0.0078	0.0317	0.0901	0.0354	0.0079	0.0302	139	15	14426
	2017	CEVAE	0.0950	0.0373	0.0083	0.0306	0.0867	0.0341	0.0082	0.0290	389	10	155606
	2018	GANITE	0.0850	0.0334	0.0072	0.0291	0.0818	0.0322	0.0072	0.0283	967	2	139933
	2019	DragonNet	0.0947	0.0371	0.0083	0.0303	0.0851	0.0335	0.0080	0.0289	281	1	72092
	2021	FlexTENet	0.1015	0.0398	0.0080	0.0317	0.0924	0.0363	0.0077	0.0298	475	9	36219
	2021	SNet	0.0973	0.0382	0.0083	0.0311	0.0843	0.0331	0.0081	0.0284	237	4	27803
	2021	EUEN	0.0949	0.0373	0.0079	0.0308	0.0898	0.0353	0.0077	0.0300	122	12	30756
	2022	DESCN	0.0831	0.0327	0.0074	0.0284	0.0803	0.0316	0.0074	0.0274	234	10	98973
	2023	EFIN	0.0949	0.0372	0.0077	0.0308	0.0859	0.0337	0.0074	0.0287	168	2	3523
Lazada	2012	S-Learner	0.4581	0.0030	0.0404	0.0568	0.0207	0.0029	0.0038	0.0072	14	1	75432
	2016	T-Learner	0.4648	0.0036	0.0386	0.0549	0.0217	0.0031	0.0038	0.0084	7	2	4615
	2016	BNN	0.4558	0.0057	0.0416	0.0608	0.0230	0.0033	0.0039	0.0073	11	4	12391
	2017	TARNet	0.4635	0.0053	0.0416	0.0612	0.0214	0.0030	0.0040	0.0072	19	4	6312
	2017	CFRNet	0.4609	0.0078	0.0421	0.0631	0.0231	0.0033	0.0038	0.0087	8	6	64680
	2017	CEVAE	0.4734	0.0053	0.0420	0.0564	0.0158	0.0022	0.0037	0.0074	18	7	72738
	2018	GANITE	0.4592	0.0065	0.0181	0.0567	0.0174	0.0024	0.0036	0.0078	17	15	17003
	2019	DragonNet	0.5051	0.0196	0.0475	0.0800	0.0176	0.0025	0.0036	0.0077	9	9	23338
	2021	FlexTENet	0.4576	0.0029	0.0385	0.0540	0.0185	0.0026	0.0040	0.0079	8	1	49993
	2021	SNet	0.4767	0.0110	0.0317	0.0582	0.0238	0.0034	0.0040	0.0076	12	15	97833
	2021	EUEN	0.4758	0.0108	0.0412	0.0687	0.0156	0.0022	0.0038	0.0073	20	5	50010
	2022	DESCN	0.4741	0.0056	0.0385	0.0579	0.0223	0.0031	0.0039	0.0075	9	5	9131
	2023	EFIN	0.5131	0.0084	0.0388	0.0655	0.0141	0.0020	0.0036	0.0077	83	1	376914

Table 7. Benchmark results on Criteo and Lazada dataset without instance deduplication and with feature normalization. We highlight the top 3 best results, where the best results are marked in bold.

Dataset	Year	Model	Valid Metrics				Test Metrics				Train Infos
Dataset	Year	Model	QINI	AUUC	WAU	LIFT@30	QINI	AUUC	WAU	LIFT@30	Time(s)	Epochs	Params
Criteo	2012	S-Learner	0.1020	0.0400	0.0083	0.0312	0.0868	0.0341	0.0081	0.0288	100	8	7705
	2016	T-Learner	0.0973	0.0382	0.0081	0.0309	0.0843	0.0331	0.0081	0.0285	113	8	15258
	2016	BNN	0.0997	0.0391	0.0082	0.0309	0.0828	0.0326	0.0080	0.0282	76	10	28697
	2017	TARNet	0.0981	0.0385	0.0079	0.0304	0.0834	0.0327	0.0078	0.0283	110	10	14426
	2017	CFRNet	0.0984	0.0386	0.0083	0.0309	0.0892	0.0351	0.0083	0.0297	85	7	3898
	2017	CEVAE	0.0713	0.0281	0.0084	0.0261	0.0660	0.0260	0.0084	0.0248	120	1	42878
	2018	GANITE	0.0829	0.0326	0.0072	0.0285	0.0807	0.0318	0.0071	0.0278	890	1	10045
	2019	DragonNet	0.0992	0.0390	0.0082	0.0307	0.0886	0.0348	0.0081	0.0288	58	9	4988
	2021	FlexTENet	0.0994	0.0390	0.0082	0.0306	0.0894	0.0351	0.0079	0.0290	484	15	138971
	2021	SNet	0.0977	0.0383	0.0080	0.0307	0.0851	0.0334	0.0080	0.0283	78	7	125403
	2021	EUEN	0.0987	0.0388	0.0084	0.0305	0.0899	0.0353	0.0082	0.0295	74	1	15292
	2022	DESCN	0.0827	0.0325	0.0069	0.0287	0.0803	0.0316	0.0070	0.0279	132	7	25437
	2023	EFIN	0.0961	0.0377	0.0084	0.0308	0.0857	0.0337	0.0083	0.0294	586	1	21216
Lazada	2012	S-Learner	0.4288	0.0017	0.0355	0.0520	0.0218	0.0031	0.0036	0.0081	13	15	37927
	2016	T-Learner	0.4455	0.0046	0.0335	0.0542	0.0237	0.0034	0.0040	0.0074	14	5	75432
	2016	BNN	0.4688	0.0050	0.0401	0.0576	0.0184	0.0026	0.0041	0.0074	11	3	4615
	2017	TARNet	0.4381	0.0016	0.0353	0.0510	0.0195	0.0028	0.0040	0.0069	14	3	64680
	2017	CFRNet	0.4624	0.0029	0.0407	0.0546	0.0219	0.0031	0.0041	0.0082	10	5	6312
	2017	CEVAE	0.4509	0.0069	0.0425	0.0616	0.0220	0.0031	0.0036	0.0069	33	4	201330
	2018	GANITE	0.4873	0.0067	0.0193	0.0566	0.0167	0.0023	0.0037	0.0075	34	15	50475
	2019	DragonNet	0.5103	0.0164	0.0459	0.0794	0.0204	0.0029	0.0038	0.0074	11	12	23338
	2021	FlexTENet	0.4374	0.0030	0.0323	0.0510	0.0238	0.0034	0.0038	0.0076	14	3	93177
	2021	SNet	0.4605	0.0091	0.0391	0.0586	0.0272	0.0039	0.0042	0.0075	9	4	157353
	2021	EUEN	0.4758	0.0079	0.0402	0.0596	0.0245	0.0035	0.0038	0.0079	9	3	50010
	2022	DESCN	0.5134	0.0131	0.0444	0.0747	0.0185	0.0026	0.0036	0.0079	10	2	108203
	2023	EFIN	0.5364	0.0135	0.0475	0.0699	0.0163	0.0023	0.0036	0.0070	27	2	95198

4.2. Sensitivity of Instance Deduplication

We first conduct two corresponding experiments to evaluate the impact of the instance deduplication operation on DUM, considering the case where the feature normalization operation is performed or not performed.

4.2.1. No Feature Normalization Operation is Performed

We report the corresponding results in Figure 2, where the left and right represent those obtained on Criteo and Lazada, respectively. Based on the results of Figure 2, we can have the following observations: 1) Without performing feature normalization operations, by performing instance deduplication operations on Criteo, i.e., when there is no apparent inconsistency between the training distribution and the test distribution, the performance of the DUM model can be significantly improved in most cases, except for CEVAE and DESCN, and 2) However, by performing instance deduplication on Lazada, i.e., when there is an apparent inconsistency between the training distribution and the test distribution, most DUM models will suffer, although a small number of DUM models can benefit from it.

4.2.2. Feature Normalization Operation is Performed

We report the corresponding results in Figure 3, where the left and right represent those obtained on Criteo and Lazada, respectively. Based on the results of Figure 3, we can have the following observations: 1) Compared with the case without feature normalization (i.e., Figure 2), the trends of most DUM models on Criteo are similar, and 2) However, most DUM models on Ladaza will have an opposite trend compared to the case without feature normalization (i.e., Figure 2).

4.2.3. Remark

Based on the above results, we can find that when deploying a specific DUM model in practical applications, practitioners must consider its compatibility with duplicate instances to determine whether to perform instance deduplication. In addition, it also shows that most DUM models are not robust to the presence of duplicate instances, which requires more research attention on data evaluation, denoising, and selection of DUM models.

4.3. Sensitivity of Feature Normalization

To evaluate the impact of the feature normalization operation on DUM, we conduct two corresponding experiments, respectively, considering the case where the instance deduplication operation is performed or not performed.

4.3.1. No Instance Deduplication Operation is Performed

We report the corresponding results in Figure 4, where the left and right represent those obtained on Criteo and Lazada, respectively. Based on the results of Figure 4, we can have the following observations: 1) On Criteo, performing feature normalization operations has almost no obvious impact without performing instance deduplication operations, except for a slight decrease in the performance of some DUM models. 2) However, we can find that the impact of performing feature normalization operations on Ladaza is bipolar, with significant improvements to S-Learner, BNN, TARNet, and CFRNet but damage to others.

4.3.2. Instance Deduplication Operation is Performed

We report the corresponding results in Figure 5, where the left and right represent those obtained on Criteo and Lazada, respectively. Based on the results of Figure 5, we can have the following observations: 1) Compared to the case where instance deduplication is not performed (i.e., Figure 4), the trends are similar on most DUMs on Criteo, except for more significant fluctuations on several models, such as CFRNet and GANITE. 2) The performance trend changes on Ladaza are similar to those observed on Criteo.

4.3.3. Remark

The above results suggest that practitioners also need to make more considerations in feature processing to determine the best operation when deploying a specific DUM model. Furthermore, it also shows that most DUM models are not robust to different feature processing, requiring more research on feature selection and feature embedding learning of DUM.

4.4. Sensitivity of Test Distribution

By comparing the results in Tables 4, 5, 6, and 7, we can find that no matter which data preprocessing combination is used, most DUM models often have different performance on Criteo and Lazada. Taking the results of a recent DUM model, EFIN, as an example, it has excellent performance in most cases on the Criteo dataset. However, its performance on the Lazada dataset is particularly weak. This primarily reflects that most existing DUM models are not robust enough when facing different test distributions, where Criteo has a selection-biased test environment, and Lazada has an unbiased test environment. This requires practitioners to reasonably select an appropriate DUM model for deployment based on the nature of different business scenarios. This can also motivate more attention to robust or out-of-distribution learning of DUM models and the automatic selection of the best DUM model in offline scenarios.

5. Limitations and Future Work

In this section, we will address the limitations of our study and explore potential directions for future research.

Dataset. In our open benchmark, we evaluated the performance of various uplift models using two real-world industrial datasets, Criteo (Diemert et al., 2021) and Lazada (Zhong et al., 2022). To the best of our knowledge, EC-LIFT (Ke et al., 2021) is another large-scale industrial dataset for different brands in a large-scale advertising scene, which is open-sourced by Alimama and contains billions of instances. We plan to extend more datasets to make it a more comprehensive open benchmark for uplift modeling.

Instance and Feature Processing. In the previous sections, we discussed the impact of data normalization and instance deduplication on the accuracy of uplift prediction. However, other factors can also affect the performance of uplift modeling. In our subsequent research, we plan to offer a more comprehensive range of instance and feature processing methods to evaluate the impact of different data processing techniques on uplift prediction accuracy.

Models. At present, our open benchmark only compares the performance of binary intervention uplift models. In our subsequent work, we plan to expand the benchmark to support multi-value intervention, continuous value intervention, and time-variant intervention uplift modeling. We also plan to add more representative models to further enhance its comprehensiveness and value.

6. Conclusions

This paper presents the first open benchmark, DUMOM, for deep uplift modeling (DUM) tasks. Our goal is to address the problem of non-reproducible and inconsistent results in this area of research. We design a standardized evaluation protocol to reevaluate 13 existing models by employing four preprocessing settings on two widely used real-world datasets. We rigorously compare existing models and provide a fair benchmark result. The results show that achieving substantial performance improvements in DUM studies is challenging and requires special consideration of generalization issues. In addition, we provide some valuable experience when deploying specific deep boosting models, i.e., practitioners must consider more about the impact of different data preprocessing on specific DUM models. We believe our benchmark helps drive further developments in this research area and for evaluating new studies’ fair performance.

References

(1)
Ai et al. (2022) Meng Ai, Biao Li, Heyang Gong, Qingwei Yu, Shengjie Xue, Yuan Zhang, Yunzhou Zhang, and Peng Jiang. 2022. LBCF: A large-scale budget-constrained causal forest algorithm. In Proceedings of the ACM Web Conference 2022. 2310–2319.
Albert and Goldenberg (2021) Javier Albert and Dmitri Goldenberg. 2021. E-commerce promotions personalization via online multiple-choice knapsack with uplift modeling. arXiv preprint arXiv:2108.13298 (2021).
Athey and Imbens (2016) Susan Athey and Guido Imbens. 2016. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences 113, 27 (2016), 7353–7360.
Bang and Robins (2005) Heejung Bang and James M Robins. 2005. Doubly robust estimation in missing data and causal inference models. Biometrics 61, 4 (2005), 962–973.
Battocchi et al. (2019) Keith Battocchi, Eleanor Dillon, Maggie Hei, Greg Lewis, Paul Oka, Miruna Oprescu, and Vasilis Syrgkanis. 2019. EconML: A Python Package for ML-Based Heterogeneous Treatment Effects Estimation. https://github.com/py-why/EconML. Version 0.x.
Betlei et al. (2021) Artem Betlei, Eustache Diemert, and Massih-Reza Amini. 2021. Uplift modeling with generalization guarantees. In Proceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 55–65.
Chen et al. (2020) Huigang Chen, Totte Harinen, Jeong-Yoon Lee, Mike Yung, and Zhenyu Zhao. 2020. CausalML: Python package for causal machine learning. arXiv preprint arXiv:2002.11631 (2020).
Chen et al. (2022) Xuanying Chen, Zhining Liu, Li Yu, Liuyi Yao, Wenpeng Zhang, Yi Dong, Lihong Gu, Xiaodong Zeng, Yize Tan, and Jinjie Gu. 2022. Imbalance-aware uplift modeling for observational data. In Proceedings of the 36th AAAI Conference on Artificial Intelligence. 6313–6321.
Chipman et al. (2010) Hugh A Chipman, Edward I George, and Robert E McCulloch. 2010. BART: Bayesian additive regression trees. The Annals of Applied Statistics 4, 1 (2010).
Curth and van der Schaar (2021a) Alicia Curth and Mihaela van der Schaar. 2021a. Nonparametric estimation of heterogeneous treatment effects: From theory to learning algorithms. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics. 1810–1818.
Curth and van der Schaar (2021b) Alicia Curth and Mihaela van der Schaar. 2021b. On inductive biases for heterogeneous treatment effect estimation. Proceedings of the 35th International Conference on Neural Information Processing Systems, 15883–15894.
Devriendt et al. (2020) Floris Devriendt, Jente Van Belle, Tias Guns, and Wouter Verbeke. 2020. Learning to rank for uplift modeling. IEEE Transactions on Knowledge and Data Engineering 34, 10 (2020), 4888–4904.
Diemert et al. (2021) Eustache Diemert, Artem Betlei, Christophe Renaudin, Massih-Reza Amini, Théophane Gregoir, and Thibaud Rahier. 2021. A large scale benchmark for individual treatment effect prediction and uplift modeling. arXiv preprint arXiv:2111.10106 (2021).
Goldenberg et al. (2020) Dmitri Goldenberg, Javier Albert, Lucas Bernardi, and Pablo Estevez. 2020. Free lunch! retrospective uplift modeling for dynamic promotions recommendation within roi constraints. In Proceedings of the 14th ACM Conference on Recommender Systems. 486–491.
He et al. (2022) Xiaofeng He, Guoqiang Xu, Cunxiang Yin, Zhongyu Wei, Yuncong Li, Yancheng He, and Jing Cai. 2022. Causal enhanced uplift model. In Proceedings of the 26th Pacific-Asia Conference on Knowledge Discovery and Data Mining. 119–131.
Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning. 448–456.
Johansson et al. (2016) Fredrik D Johansson, Uri Shalit, and David Sontag. 2016. Learning representations for counterfactual inference. In Proceedings of the 33rd International Conference on Machine Learning. 3020–3029.
Ke et al. (2021) Wenwei Ke, Chuanren Liu, Xiangfu Shi, Yiqiao Dai, S Yu Philip, and Xiaoqiang Zhu. 2021. Addressing exposure bias in uplift modeling for large-scale online advertising. In Proceedings of the 2021 IEEE International Conference on Data Mining. 1156–1161.
Künzel et al. (2019) Sören R Künzel, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu. 2019. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences 116, 10 (2019), 4156–4165.
Lin et al. (2017) Ying-Chun Lin, Chi-Hsuan Huang, Chu-Cheng Hsieh, Yu-Chen Shu, and Kun-Ta Chuang. 2017. Monetary discount strategies for real-time promotion campaign. In Proceedings of the ACM Web Conference 2017. 1123–1132.
Liu et al. (2023) Dugang Liu, Xing Tang, Han Gao, Fuyuan Lyu, and Xiuqiang He. 2023. Explicit feature interaction-aware uplift network for online marketing. In Proceedings of the 29th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 4507–4515.
Louizos et al. (2017) Christos Louizos, Uri Shalit, Joris Mooij, David Sontag, Richard Zemel, and Max Welling. 2017. Causal effect inference with deep latent-variable models. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 6449–6459.
Nie and Wager (2021) X Nie and S Wager. 2021. Quasi-oracle estimation of heterogeneous treatment effects. Biometrika 108, 2 (2021), 299–319.
Radcliffe and Surry (2011) Nicholas J Radcliffe and Patrick D Surry. 2011. Real-world uplift modeling with significance-based uplift trees. White Paper TR-2011-1, Stochastic Solutions (2011), 1–33.
Rubin (2005) Donald B Rubin. 2005. Causal inference using potential outcomes: Design, modeling, decisions. J. Amer. Statist. Assoc. 100, 469 (2005), 322–331.
Shalit et al. (2017) Uri Shalit, Fredrik D Johansson, and David Sontag. 2017. Estimating individual treatment effect: Generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine Learning. 3076–3085.
Sharma et al. (2019) Amit Sharma, Emre Kiciman, et al. 2019. DoWhy: A Python package for causal inference. https://github.com/microsoft/dowhy.
Shi et al. (2019) Claudia Shi, David M Blei, and Victor Veitch. 2019. Adapting neural networks for the estimation of treatment effects. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2507–2517.
Wager and Athey (2018) Stefan Wager and Susan Athey. 2018. Estimation and inference of heterogeneous treatment effects using random forests. J. Amer. Statist. Assoc. 113, 523 (2018), 1228–1242.
Wan et al. (2022) Shu Wan, Chen Zheng, Zhonggen Sun, Mengfan Xu, Xiaoqing Yang, Hongtu Zhu, and Jiecheng Guo. 2022. GCF: Generalized causal forest for heterogeneous treatment effect estimation in online marketplace. arXiv preprint arXiv:2203.10975 (2022).
Xu et al. (2022) Guoqiang Xu, Cunxiang Yin, Yuchen Zhang, Yuncong Li, Yancheng He, Jing Cai, and Zhongyu Wei. 2022. Learning discriminative representation base on attention for uplift. In Proceedings of the 26th Pacific-Asia Conference on Knowledge Discovery and Data Mining. 200–211.
Yoon et al. (2018) Jinsung Yoon, James Jordon, and Mihaela Van Der Schaar. 2018. GANITE: Estimation of individualized treatment effects using generative adversarial nets. In Proceedings of the 6th International Conference on Learning Representations.
Zhao et al. (2019) Kui Zhao, Junhao Hua, Ling Yan, Qi Zhang, Huan Xu, and Cheng Yang. 2019. A unified framework for marketing budget allocation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1820–1830.
Zhong et al. (2022) Kailiang Zhong, Fengtong Xiao, Yan Ren, Yaorong Liang, Wenqing Yao, Xiaofeng Yang, and Ling Cen. 2022. DESCN: Deep entire space cross networks for individual treatment effect estimation. In Proceedings of the 28th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 4612–4620.
Zhu et al. (2021) Jieming Zhu, Jinyang Liu, Shuai Yang, Qi Zhang, and Xiuqiang He. 2021. Open benchmarking for click-through rate prediction. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management. 2759–2769.