Recall-Augmented Ranking: Enhancing Click-Through Rate Prediction Accuracy with Cross-Stage Data
Abstract.
Click-through rate (CTR) prediction plays an indispensable role in online platforms. Numerous models have been proposed to capture users’ shifting preferences by leveraging user behavior sequences. However, these historical sequences often suffer from severe homogeneity and scarcity compared to the extensive item pool. Relying solely on such sequences for user representations is inherently restrictive, as user interests extend beyond the scope of items they have previously engaged with. To address this challenge, we propose a data-driven approach to enrich user representations. We recognize user profiling and recall items as two ideal data sources within the cross-stage framework, encompassing the u2u (user-to-user) and i2i (item-to-item) aspects respectively. In this paper, we propose a novel architecture named Recall-Augmented Ranking (RAR). RAR consists of two key sub-modules, which synergistically gather information from a vast pool of look-alike users and recall items, resulting in enriched user representations. Notably, RAR is orthogonal to many existing CTR models, allowing for consistent performance improvements in a plug-and-play manner. Extensive experiments are conducted, which verify the efficacy and compatibility of RAR against the SOTA methods.
1. Introduction
Recommender systems have been widely deployed to save users from information overload. Among them, CTR prediction is an essential task, which is to predict the probability that a user will click on an item under a particular context, enhancing both user experience and platform revenue.
Recently, many models have been proposed to extract user interest based on historical behavior sequences. However, items in user behavior sequences often exhibit homogeneity and scarcity versus the large-scale item pool, which is detailed in Section 2. Moreover, existing models often rely on target attention mechanisms, assigning higher scores to repetitive, similar items, reinforcing a cycle of homogeneity. In Figure 1, we provide an example. When a user buys lipstick, existing models often suggest similar products. However, the user might prefer exploring related items such as perfume or earrings, seeking variety beyond her initial purchase, even without previous interactions with these items. Therefore, we aim to enrich user representations from a data-driven perspective, incorporating diverse sources of information to enhance accuracy.
Moreover, CTR predictions traditionally focus on single user-item interactions and often overlook the interrelationships across various users and items, resulting in inadequate long-tail modeling. On the contrary, the recall stage inherently generates similar user-item lists, providing cross-instance modeling capability. We recognize user profiling and recall items as two ideal data sources within the cross-stage framework, encompassing the u2u and i2i aspects respectively. In this paper, we are interested in how to leverage these cross-stage data to enhance CTR prediction accuracy rather than how to construct the two sets.
In this paper, we propose a novel architecture named Recall-Augemented Ranking (RAR) to enhance model accuracy based on cross-stage data. RAR consists of two key components: the Cross-Stage User & Item Selection Module and the Co-Interaction Module. These sub-modules efficiently gather information from a broad spectrum of look-alike users and recall items, thereby enriching user representations. Note that the Co-Interaction Module is a set-to-set modeling, which has not been previously explored in CTR prediction task.
In summary, the contributions of the paper are as follows:
-
•
We shed light on the limitations of relying solely on user behavior sequences to model user preferences. To address this inadequacy, we propose a novel architecture RAR, which leverages cross-stage data to enrich user representation.
-
•
RAR contains two extra data sources, namely the look-alike user set and recall item set. It is the first work that incorporates set-to-set modeling into CTR prediction to the best of our knowledge.
-
•
RAR serves as a framework capable of enhancing the performance of numerous existing CTR prediction models. Comprehensive experiments show RAR’s outperformance, effectiveness and compatibility with a wide variety of models.
2. BACKGROUND
-
•
Motivation of RAR: We provide an analysis of user behavior in Taobao111https://tianchi.aliyun.com/dataset/649. Figure 2(a) illustrates the scarcity of user historical sequences, with the majority of users having interacted with only a minuscule fraction of the total number of available items. Figure 2(b) highlights user behavior’s homogeneity, with most activity of a specific user concentrated in four to five categories out of thousands. Furthermore, traditional CTR models focus on single user-item interactions, yet overlook broader interrelationships. Conversely, the recall stage facilitates cross-instance modeling by linking similar user-item lists.
-
•
Cascade Ranking System and User Profiling: In modern information retrieval applications, a cascade ranking system is often used to balance the efficiency and effectiveness. The system includes a variety of rankers. Each stage selects the top-k items it receives and feeds them to the next stage. Among them, recall and ranking are two common stages. Besides, look-alike methods have become a core component of online advertising and marketing, which are intended to identify similar users from a small user set.
Notation | Description. |
The number of selected look-alike users and recall items respectively. | |
Similarity score matrix of look-alike users and recall items. | |
Embedding matrix of look-alike users and recall items respectively. | |
Embedding matrix of selected look-alike users and recall items. | |
High-order representation of selected look-alike users and recall items. | |
Hash projection matrix in selection modules. |
3. Approach
3.1. Cross-Stage User/Item Selection Module
The Cross-Stage User/Item Selection Module select the most similar users and relevant items. The selection process can be abstracted into two steps and we take the selection of recall items as an example. First, similarity is measured between the target item and each recall item by similarity function . Then top-k relevant recall items can be selected based on the similarity score, which can be formalized in Equation 1, 2. We conclude the key notations and the descriptions in Table 1.
(1) |
(2) |
An intuitive idea is using the embedding and search k-nearest neighbor by inner product. However, the huge number of multiplications makes real-world deployment impractical. Considering the selection complexity, we use the SimHash function in our experiment.
SimHash, leveraging locality-sensitive properties, ensures similar outputs for similar inputs through random projection and signed axes, simplifying embeddings to binary fingerprints. This process is detailed in Equation 3, 4, where stands for the embedding of the recall item and is the hash function in the hash function set. It reduces storage and speeds up selection by using hamming distance for efficient comparison.
(3) |
(4) |
3.2. Co-Interaction Module
Co-Interaction Module provides a fine-grained set-to-set modeling. It improves upon the simplistic equal weighting of all selected recall items, which overlooks hierarchical information. We introduce a matching matrix to assess user-item interest compatibility. The matching score is represented as high-level latent vectors’ inner product, as is shown in Equation 5. Then we compute the matching matrix in Equation 6, where is used to map the matching scores to (0,1).
(5) |
(6) |
To provide the model with a clearer indication of which recall items are more important, the signal is utilized to supervise the training of the matching matrix. As the exposed signal is very sparse and we define in Equation 8.
(7) |
(8) |
Finally, the matching matrix is averaged by row and by column to obtain the item and user weighting vector. We obtain user common interest and user diverse interest by multiplying weighting vectors with corresponding embeddings. Then user enriched representation is obtained by concatenating and .
(9) |
(10) |
(11) |
3.3. Objective function
The loss function of RAR can be represented by Equation 12, where aims to predict CTR accurately and aims to provide a clearer indication to the model of which recall items are most important. is a tunable parameter for balancing the two losses. Both two losses are cross-entropy loss and supervise the training process in a point-wise manner. All modules of RAR are trained jointly by minimizing the joint loss function on the training dataset.
(12) |
3.4. Complexity Analysis
We analyze the efficiency of the RAR in this section. Co-Interaction Module first gets the matching matrix and weighting scores by multiplying the embedding vectors of selected look-alike users and recall items, followed by a weighted sum using the weighting vectors. Therefore, three matrix multiplications are needed, and the time complexity is . As for the selection module, the time complexity depends on the selection method utilized. Since conducting and counting the number of bits in 1 in SimHash can be accomplished in , the overall time complexity of RAR is , where . Since , , it can be approximated as .
4. Experiments
4.1. Experimental Setup
Datasets. We conduct experiments on three public datasets. KKBox is a challenge dataset for music recommendation. Movielens contains users’ tagging records on movies. CandiCTR-Pub is a publicly available industrial dataset that is both practical and large-scale. Apart from CandiCTR-Pub, which already includes recall item sets, we manually construct recall item sets and look-alike user sets for KKBox and MovieLens by employing a pretrained matching model(e.g., DSSM) for inner product calculations between user-item and user-user pairs. Our implementation builds upon FuxiCTR and follow the public benchmark (Zhu et al., 2022) and previous works (Wang et al., 2022; Zheng et al., 2022; Lin et al., 2023).
Base models. We consider both high-order feature interaction and ensemble models. We choose IPNN, WDL, DeepFM, DCN, xDeepFM, AutoInt+, DeepIM, DCN-V2 as our base models, which has been evaluated in the BARS benchmark; see references in (Zhu et al., 2022, 2021).
Baselines. FRNet (Wang et al., 2022) learns context-aware feature representations by capturing cross-feature relationships, becoming the new SOTA. CIM (Zheng et al., 2022) encodes all candidate items into a context vector by transformer to characterize users’ implicit awareness.
Metrics. We apply the most popular metrics AUC and gAUC (weighted sum AUC, grouped by users) to evaluate the performance.
4.2. Performance Evaluation with SOTA Models
Datasets | KKBox | |||||||
Modules | Raw | +FRNet | +CIM | +RAR | ||||
Models | gAUC(%) | AUC(%) | gAUC(%) | AUC(%) | gAUC(%) | AUC(%) | gAUC(%) | AUC(%) |
IPNN | 78.75 | 85.25 | 78.27 | 84.94 | 78.31 | 85.25 | 80.15 | 86.45 |
WDL | 78.44 | 85.02 | 78.26 | 84.85 | 78.67 | 85.36 | 79.74 | 86.23 |
DeepFM | 78.76 | 85.32 | 78.70 | 85.26 | 78.90 | 85.68 | 80.14 | 86.51 |
DCN | 78.66 | 85.25 | 78.69 | 85.26 | 79.16 | 85.74 | 80.22 | 86.58 |
xDeepFM | 78.60 | 85.25 | 78.65 | 85.22 | 78.72 | 85.56 | 80.12 | 86.50 |
AutoInt+ | 78.78 | 85.34 | 78.71 | 85.28 | 78.94 | 85.64 | 80.20 | 86.55 |
DeepIM | 78.79 | 85.29 | 78.56 | 85.16 | 78.92 | 85.63 | 80.26 | 86.59 |
DCN-V2 | 78.64 | 85.17 | 78.62 | 85.22 | 79.45 | 85.77 | 80.12 | 86.49 |
Best RelImp | 0.0% | 0.0% | 0.1% | 0.1% | 1.0% | 0.7% | 2.0% | 1.6% |
Datasets | Movielens | |||||||
Modules | Raw | +FRNet | +CIM | +RAR | ||||
Models | gAUC(%) | AUC(%) | gAUC(%) | AUC(%) | gAUC(%) | AUC(%) | gAUC(%) | AUC(%) |
IPNN | 95.53 | 96.53 | 95.14 | 96.15 | 95.28 | 96.38 | 95.92 | 97.02 |
WDL | 95.29 | 96.23 | 95.17 | 96.19 | 95.36 | 96.44 | 95.73 | 96.67 |
DeepFM | 94.84 | 95.90 | 94.65 | 96.11 | 95.06 | 96.23 | 95.40 | 96.40 |
DCN | 95.32 | 96.35 | 95.21 | 96.33 | 95.30 | 96.37 | 95.51 | 96.54 |
xDeepFM | 95.27 | 96.20 | 95.21 | 96.26 | 95.16 | 96.29 | 95.82 | 96.81 |
AutoInt+ | 95.22 | 96.24 | 95.26 | 96.28 | 95.31 | 96.38 | 95.61 | 96.58 |
DeepIM | 95.29 | 96.29 | 95.21 | 96.28 | 95.30 | 96.39 | 95.51 | 96.61 |
DCN-V2 | 94.95 | 96.00 | 95.23 | 96.25 | 95.35 | 96.39 | 95.66 | 96.63 |
Best RelImp | 0.0% | 0.0% | 0.3% | 0.3% | 0.4% | 0.4% | 0.7% | 0.7% |
Datasets | CandiCTR-Pub | |||||||
Modules | Raw | +FRNet | +CIM | +RAR | ||||
Models | gAUC(%) | AUC(%) | gAUC(%) | AUC(%) | gAUC(%) | AUC(%) | gAUC(%) | AUC(%) |
IPNN | 52.87 | 60.92 | 52.35 | 61.08 | 53.76 | 61.86 | 54.35 | 62.53 |
WDL | 52.82 | 60.90 | 52.48 | 60.82 | 53.92 | 62.73 | 54.43 | 63.92 |
DeepFM | 52.82 | 60.99 | 52.48 | 60.89 | 53.94 | 62.80 | 54.50 | 63.92 |
DCN | 52.78 | 60.95 | 52.55 | 60.61 | 53.75 | 61.75 | 54.50 | 62.53 |
xDeepFM | 52.87 | 61.19 | 52.48 | 60.78 | 53.93 | 62.79 | 54.40 | 64.06 |
AutoInt+ | 52.61 | 61.10 | 52.73 | 60.95 | 53.82 | 62.85 | 54.47 | 63.86 |
DeepIM | 52.72 | 61.23 | 52.42 | 60.94 | 53.48 | 61.65 | 54.44 | 62.72 |
DCN-V2 | 52.65 | 61.06 | 52.64 | 60.79 | 53.32 | 61.69 | 54.30 | 62.70 |
Best RelImp | 0.0% | 0.0% | 0.2% | 0.3% | 2.6% | 3.0% | 3.5% | 5.0% |
We evaluate RAR on existing models, including many SOTA methods, which is shown in Table 2. RAR notably surpasses other methods, with xDeepFM+RAR improving AUC by up to 4.7% across datasets, demonstrating the efficacy of using cross-stage data for richer user representations.
4.3. Ablation Study
We investigate the effectiveness of different components of RAR in Table 3. Three typical base models are selected to ensure generalizability and fairness.
-
•
Removing the channel of look-alike users: RAR-user replaces the channel of look-alike users with the target user only. Table 3 shows a 2.6% gAUC and 2.5% AUC increase over raw CTR models, highlighting the benefit of incorporating recall items into ranking models for diversified user representations. However, RAR-user’s comparison to RAR indicates further potential by leveraging user common interest introduced by look-alike users.
-
•
Removing the User and Item Selection Module: RAR-select, removing the Cross-Stage Selection Module, truncates the look-alike user set and recall item set to match RAR’s scale. Table 3 reveals a 0.4% gAUC and 2.3% AUC boost over base CTR models, which indicates the importance of a careful selection for further accuracy improvement.
-
•
Removing the Co-Interaction Module: RAR-aux-wght, omitting the Co-Interaction Module and related losses, uses simple sumpooling for user interest representation and shows weaker performance. RAR-wght, dropping weighting vectors but keeping auxiliary loss, significantly outperforms RAR-aux-wght, highlighting the auxiliary loss’s role in guiding hierachical information of recall items. RAR’s superiority over RAR-wght underscores the value of both matching loss and weighting vectors.
4.4. Training Efficiency
We present a wall-time comparison of RAR and CIM, the latter of which has proven effective in a real-world search advertising system. Findings detailed in Table 4 reveal that CIM_short, utilizing a truncated context input of 50 recall items, exhibits the least time consumption, whereas CIM_long, processing the full set of 305 recall items, incurs the most time. RAR’s selection modules effectively filter noise from extensive recall pools and look-alike user sets without reducing information. Additionally, as Section 4.2 demonstrates, the performance gain of RAR is substantial.
Model | DCN-V2 | xDeepFM | DeepIM | Average RelImp | ||||
gAUC(%) | AUC(%) | gAUC(%) | AUC(%) | gAUC(%) | AUC(%) | gAUC(%) | AUC(%) | |
Raw | 52.65 | 61.06 | 52.87 | 61.19 | 52.72 | 61.23 | 0% | 0% |
RAR-user | 54.07 | 62.35 | 54.09 | 63.45 | 54.19 | 62.23 | 2.6% | 2.5% |
RAR-select | 52.97 | 62.11 | 52.99 | 63.53 | 52.96 | 62.03 | 0.4% | 2.3% |
RAR-aux-wght | 52.48 | 61.14 | 52.57 | 62.52 | 52.49 | 61.10 | -0.4% | 0.7% |
RAR-wght | 53.89 | 62.47 | 53.87 | 63.78 | 54.06 | 62.35 | 2.3% | 2.8% |
RAR | 54.30 | 62.70 | 54.40 | 64.06 | 54.44 | 62.72 | 3.1% | 3.3% |
Model | Training Time | Rel.Inc | Time per inference step | Rel.Inc |
CIM_short | 14.5 min | 0% | 18 ms | 0% |
CIM_long | 29 min | 100% | 48 ms | 167% |
RAR | 18 min | 24% | 27.5 ms | 53% |
5. CONCLUSION
In this paper, we first shed light on the limitations of relying solely on homogeneous user behavior sequences to model user preferences and then we propose a novel architecture called RAR which utilizes cross-stage data to improve the cross-instance modeling capability of the models. RAR consists of two key sub-modules, which synergistically gather information from a vast pool of look-alike users and recall items, resulting in enriched user representations. RAR is a general framework that demonstrates great performance and compatibility through our in-depth experiments.
Acknowledgements.
The Shanghai Jiao Tong University team is partially supported by National Natural Science Foundation of China (62177033). We also gratefully acknowledge the support of MindSpore222https://www.mindspore.cn/, which is a new deep learning computing framework used for this research.References
- (1)
- Lin et al. (2023) Jianghao Lin, Yanru Qu, Wei Guo, Xinyi Dai, Ruiming Tang, Yong Yu, and Weinan Zhang. 2023. MAP: A Model-agnostic Pretraining Framework for Click-through Rate Prediction. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1384–1395.
- Wang et al. (2022) Fangye Wang, Yingxu Wang, Dongsheng Li, Hansu Gu, Tun Lu, Peng Zhang, and Ning Gu. 2022. Enhancing CTR prediction with context-aware feature representation learning. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 343–352.
- Zheng et al. (2022) Kaifu Zheng, Lu Wang, Yu Li, Xusong Chen, Hu Liu, Jing Lu, Xiwei Zhao, Changping Peng, Zhangang Lin, and Jingping Shao. 2022. Implicit User Awareness Modeling via Candidate Items for CTR Prediction in Search Ads. In Proceedings of the ACM Web Conference 2022. 246–255.
- Zhu et al. (2021) Jieming Zhu, Jinyang Liu, Shuai Yang, Qi Zhang, and Xiuqiang He. 2021. Open benchmarking for click-through rate prediction. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2759–2769.
- Zhu et al. (2022) Jieming Zhu, Kelong Mao, Quanyu Dai, Liangcai Su, Rong Ma, Jinyang Liu, Guohao Cai, Zhicheng Dou, Xi Xiao, and Rui Zhang. 2022. BARS: Towards Open Benchmarking for Recommender Systems. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR).