Deep Bag-of-Words Model: An Efficient and Interpretable Relevance Architecture for Chinese E-Commerce

Zhe Lin [email protected] Alibaba GroupHangZhouChina Jiwei Tan [email protected] Alibaba GroupHangZhouChina Dan Ou [email protected] Alibaba GroupHangZhouChina Xi Chen [email protected] Alibaba GroupHangZhouChina Shaowei Yao [email protected] Alibaba GroupHangZhouChina  and  Bo Zheng [email protected] Alibaba GroupHangZhouChina
(2024)
Abstract.

Text relevance or text matching of query and product is an essential technique for the e-commerce search system to ensure that the displayed products can match the intent of the query. Many studies focus on improving the performance of the relevance model in search system. Recently, pre-trained language models like BERT have achieved promising performance on the text relevance task. While these models perform well on the offline test dataset, there are still obstacles to deploy the pre-trained language model to the online system as their high latency. The two-tower model is extensively employed in industrial scenarios, owing to its ability to harmonize performance with computational efficiency. Regrettably, such models present an opaque “black box” nature, which prevents developers from making special optimizations. In this paper, we raise deep Bag-of-Words (DeepBoW) model, an efficient and interpretable relevance architecture for Chinese e-commerce. Our approach proposes to encode the query and the product into the sparse BoW representation, which is a set of word-weight pairs. The weight means the important or the relevant score between the corresponding word and the raw text. The relevance score is measured by the accumulation of the matched word between the sparse BoW representation of the query and the product. Compared to popular dense distributed representation that usually suffers from the drawback of black-box, the most advantage of the proposed representation model is highly explainable and interventionable, which is a superior advantage to the deployment and operation of online search engines. Moreover, the online efficiency of the proposed model is even better than the most efficient inner product form of dense representation. The proposed model is experimented on three different datasets for learning the sparse BoW representations, including the human-annotation set, the search-log set and the click-through set. Then the models are evaluated by experienced human annotators. Both the auto metrics and the online evaluations show our DeepBoW model achieves competitive performance while the online inference is much more efficient than the other models. Our DeepBoW model has already deployed to the biggest Chinese e-commerce search engine Taobao and served the entire search traffic for over 6 months.

E-Commerce, Text Matching, Relevance
journalyear: 2024copyright: acmlicensedconference: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 25–29, 2024; Barcelona, Spainbooktitle: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24), August 25–29, 2024, Barcelona, Spaindoi: 10.1145/3637528.3671559isbn: 979-8-4007-0490-1/24/08ccs: Information systems Similarity measuresccs: Information systems Document representationccs: Information systems Query representation

1. Introduction

The popularization of mobile internet has significantly elevated the prominence of online commerce in daily life. Hundreds of millions of customers purchase products they want on large e-commerce portals, such as Taobao and Amazon. The search engine emerges as the essential technology in assisting users to discover products that are in accord with their preferences. Different from general search engines like Google, commercial e-commerce search engines are usually designed to improve the user’s engagement and conversion, possibly at the cost of relevance in some cases (Carmel et al., 2020). The exhibition of products in search results that are inconsistent with the query intent has the potential to diminish the customer experience and hamper customers’ long-term trust and engagement. Consequently, measuring relevance between the text of search query and products to filter the irrelevant products is an indispensable process in the e-commerce search engine.

Text relevance has been a long-standing research topic due to its importance in information retrieval and the search engine. Researchers and engineers have been dedicated to the pursuit of an efficient and robust model to accurately measure the text relevance between the query and the product in the e-commerce scenario. Conventional methodologies have traditionally harnessed attributes such as the word matching ratio, Term Frequency-Inverse Document Frequency (TF-IDF, notably BM25), or cosine similarity to serve as the relevance score, often yielding strong baseline performances. Nevertheless, these word-matching approaches may cause inaccuracies due to inconsistent linguistic expressions of identical meanings such as synonyms. This issue is particularly pronounced in the e-commerce scenario since the queries are usually described by users in daily language while the products are described by sellers in professional language. Thus severe vocabulary gap may exist between queries and products (Xiao et al., 2019).

With the development of deep learning technology, neural models have shown their advantage in addressing the semantic matching problem for the text relevance task (Hu et al., 2014; Wan et al., 2016; Pang et al., 2016). Recently, pre-trained language models like BERT (Devlin et al., 2019) achieve excellent results in various NLP tasks including text matching. Unfortunately, typical paradigm of the BERT-based relevance model is the interaction-based structure, which needs to encode the query-document pair in real time to measure their relevance. This makes it difficult to be deployed to online systems with large traffic due to the high computation and latency. Consequently, it is usually impractical to deploy the pre-trained model directly to search systems. To address this problem, the representation-based model, also known as the two-tower model, is proposed and mostly applied to industrial search systems. It usually pre-computes the embeddings of query/document respectively, and measures the relevance online from the embeddings. SiameseBERT (Reimers and Gurevych, 2019a) leverages BERT as the encoder and calculates the cosine similarity between the dense embeddings of query and document as the relevance score. Some studies like ReprBERT (Yao et al., 2022) explore the more complex MLP classifier to compute the relevance score between two dense embeddings, which can achieve improved performance.

However, the representation-based model with dense embeddings still faces two major issues. First, the dense embedding may lose the detailed semantic information of the text, especially for low-frequency words like product models, entity names, or even brand identifiers. These words are essential in the e-commerce relevance task. Second, the dense embedding presents an opaque “black box” nature, which prevents developers from comprehensively understanding the model’s methodology for calculating relevance scores. Developers often find it difficult to analyze the reasons for bad cases in the online system and implement targeted optimizations. In contrast, traditional word-matching algorithms like BM25 continue to be favored in numerous industrial applications(Thakur et al., 2021) due to their high efficiency and robust interpretability. Such word-based algorithm can capture the match of words that are low-frequency but essential for the text-relevance task. Unfortunately, these methods are not without their constraints. They fall short in recognizing different linguistic expressions that convey identical meanings, such as synonyms, thereby limiting their effectiveness in semantic matching tasks.

Is it possible to combine the advantages of both deep semantic models and word matching methods? In this paper, we propose Deep Bag-of-Words (DeepBoW), which can leverage the pre-trained language model with large language corpora to improve semantic modeling while preserving the computational efficiency and interpretability of the word-matching method. We realize this by designing to learn sparse bag-of-words representation through deep neural networks. Our model generates the query/product high-dimensional representation (called the BoW representation) instead of the low-dimensional distributed representation (i.e. embedding). The dimensional size is the same as the size of the vocabulary. Each position in this high-dimensional representation corresponds to a word in the vocabulary, and with a value represents the weight of this word in the BoW representation, just like the BoW vector of TF-IDF. The proposed DeepBoW model encounters two predominant challenges. Firstly, due to the opaque nature of neural network models, often colloquially referred to as ”black boxes”, it is challenging to correlate positions within high-dimensional representations to specific words in the vocabulary. Secondly, the vocabulary size is usually much larger than the dimension of dense embedding. Expanding the dimensional size to the vocabulary size may explode the computation and storage resources. For the first challenge, we elaborately design the architecture of the model and loss function to align the position of the high-dimensional representation and the corresponding word in the vocabulary. For the second challenge, we add a sparse constraint in the loss function to reduce valid positions in the high-dimensional representation since the query/product should not include all words of vocabulary. Finally, we sample the high-dimensional representation to a small set of non-zero word-weight pairs, which is named as sparse BoW representation. In addition, although queries and product descriptions in e-commerce primarily consist of keywords, there still exists some semantic dependency on the word combination, which unigrams may not capture adequately. For example, brand names with multi-words may be incorrectly matched if they are separated. Consequently, we propose to model n-gram in our DeepBoW model, meanwhile introducing an n-gram hashing vocabulary strategy to avoid the explosion of vocabulary size. Finally, with the sparse BoW representation, the relevance score is measured in a most easy way as the weight accumulation of the matching words in the query/product’s sparse BoW representations, which makes it highly efficient and interpretable.

The proposed DeepBoW model is evaluated on the three industrial datasets. The results show that our DeepBoW model achieves more than 2.1% AUC improvement compared to the state-of-the-art two-tower relevance model on Chinese e-commerce. The sparse BoW representation generated by our DeepBoW model has positive interpretability and supports easy problem investigation and intervention for online systems. The time complexity of the online relevance score computing program can be optimized to 𝒪(N)𝒪𝑁\mathcal{O}(N)caligraphic_O ( italic_N ) by leveraging the two-pointer algorithm, which is faster than previous state-of-the-art relevance model. The contribution of this paper is summarized as follows:

  • We reveal that the semantics of query and product in the e-commerce scenario can be represented by the bag-of-words vectors with importance weight. We show the BoW representation can be more suitable for the industrial e-commerce search system as it’s more interpretable and flexible than the dense embedding from the BERT-based model.

  • We introduce an innovative architecture designed to encode query/product into two distinct sparse Bag-of-Words (BoW) representations. We elucidate a methodology by which relevance scoring, based on these sparse BoW representations, can reduce the latency of the online relevance model while preserving competitive performance.

  • Our proposed DeepBoW model is evaluated both on offline human-annotation datasets in Chinese and online A/B testing, achieving strong performance and efficiency. The model has already been deployed on the largest Chinese e-commerce platform Taobao, and has been serving the entire search traffic for over six months.

The rest of this paper is organized as follows. In Section 2 we introduce related work. The proposed method is detailed in Section 3, and experimental results are presented in Section 4. Finally, in Section 5 we conclude this work and discuss the future work.

2. Related Work

2.1. Text Matching

The text matching task takes textual sequences as input and predicts a numerical value or a category indicating their relationship. Text matching is a long-stand problem and a hot research topic as it’s important in information retrieval and search system. The e-commerce relevance learning can be regarded as a text-matching task. Early work mostly performs keyword-based matching that relies on manually defined features, such as TF-IDF similarity and BM25 (Robertson et al., 1994). These methods cannot effectively utilize raw text features and usually fail to evaluate semantic relevance.

Recently with the development of deep learning, neural-based text-matching models have been employed to solve the semantic matching task and have achieved promising performance. The architecture of the neural-based text-matching model can be roughly divided into the interaction-based model and the representation-based (two-tower) model. The interaction-based model (Hu et al., 2014; Pang et al., 2016, 2016; Wan et al., 2016; Parikh et al., 2016) usually puts all candidate text together as the input. The model can employ the full textual feature to calculate the matching feature as the low layer, and then aggregate the partial evidence of relevance to make the final decision. So interaction-based model can leverage sophisticated techniques in the aggregation procedure and achieve better performance. More recent studies are built upon the pre-trained language model like BERT (Devlin et al., 2019). With an extremely large corpus for pre-training, these methods can achieve new state-of-the-art performance on various benchmarks. The architecture of these models is the pre-trained bidirectional Transformer (Vaswani et al., 2017), which can also be regarded as an interaction-based model. The typical paradigm of the BERT-based relevance model is to feed text pair into BERT and then build a non-linear classifier upon BERT’s [CLS] output token to predict the relevance score (Nogueira et al., 2019; Nogueira and Cho, 2020).

Although having excellent performance in the text-matching task, interaction-based models are still hard to be deployed to practical online service as they are mostly time-consuming, and the features of queries and documents cannot be pre-computed offline. Two-tower models are widely used in many online search systems. The two-tower model consists of two identical neural networks, each taking one of the two inputs. DSSM (Huang et al., 2013; Shen et al., 2014) is a two-tower model that employs two separate deep full-connected networks to encode the candidate texts. Meanwhile, more sophisticated architectures can be adopted to enhance the ability of learning semantic representations. LSTM-DSSM (Palangi et al., 2015) and LSTM-RNN (Palangi et al., 2016) use RNNs to explicitly model word dependencies in the sentences. Typically dot-product, cosine, or parameterized non-linear layers are used to measure the similarity between representations of all candidate texts. Since individually encoding both the queries and documents, the embeddings of them can be pre-computed offline. Therefore, representation-based methods are online efficient and are widely used in industrial search engines. However, the encoding procedure of two inputs is independent with each other, making the final classifier hard to predict their relationship.

2.2. Search Relevance Matching

Relevance in search engine is a special sub-task of the text-matching which computes the relevance score between the query and the product (as the document). Different from the typical text-matching task which all input texts are semantically similar and homogeneous (i.e. having comparable lengths), the length of query may be much shorter than the length of document. Query only needs to match the partial semantics in the document.

A large number of models are proposed for conducting matching in search. Neural Tensor Network (NTN) (Socher et al., 2013) is originally proposed to explicitly model multiple interactions of relational data. It achieves powerful representation ability that can represent multiple similarity functions, including cosine similarity, dot product, and bilinear product, etc. Qiao et al. (2019) apply the BERT model to ad-hoc retrieval and passage retrieval. Reimers and Gurevych (2019b) propose Sentence-BERT for reducing the computational overhead for text matching. Bai et al. (2020) conduct a pilot study to map the frequency-based and BoW representation of a document to a sparse term importance distribution for text retrieval.

E-commerce search is a special scenario of the Web search system. Both tasks model the semantic matching between query and candidate and require high efficiency and low latency in the online search system. Differently, in Web search the query and document are usually very different in length, making most methods not feasible for the e-commerce relevance task. Currently, there is not a commonly-used public benchmark for the Chinese e-commerce relevance task, so previous works usually evaluate their models on the online service and the real-world dataset constructed from the online platforms. Guo et al. (2016) introduce a typical framework for e-commerce relevance learning. A Siamese network is adopted to learn pair-wise relevance of two products to a query. They investigate training the model with user clicks and batch negatives, followed by finetuning with human supervision to calibrate the score by pair-wise learning. Xiao et al. (2019) propose a co-training framework to address the data sparseness problem by investigating the instinctive connection between query rewriting and semantic matching. Yao et al. (2021) propose to learn a two-tower relevance model from click-through data in e-commerce by designing a point-wise loss function. Zhang et al. (2019) also find the weakness of training with click signals, and address this problem by proposing a multi-task learning framework of query intent classification and semantic textual similarity to improve semantic matching efficiency. Nigam et al. (2019) introduce a 3-part hinge loss to differentiate multiple types of training data. They classified training instances into three categories: random negative examples, impressed but not purchased examples, and purchased items. Recently Yao et al. (2022) propose ReprBERT, which has the advantages of both excellent performance and low latency, by distilling the interaction-based BERT model to a representation-based architecture. This framework is taken as the baseline of our model.

Refer to caption
Figure 1. An overview of the DeepBoW model. Figure (a) shows the architecture that encodes the input text into the Term-Weighting BoW representation, which gathers the attention weight of each word as its weight in the term-weighting BoW representation. Figure (b) shows the architecture that encodes the input text into the Synonym-Expansion BoW representation, which generates sparse BoW representation from character embedding and word embedding respectively, and aggregates these two representations as the synonym-expansion BoW representation.

3. Methodology

3.1. Overview

The proposed DeepBoW model is based on the two-tower architecture, which encodes the query and document separately and computes the semantic relevance score with the representations of query and document. Different from other text relevance models with dense embeddings, our model encodes the query and document into the Bag-of-Words vectors and calculates the relevance score from sparse BoW representations.

In this section, we introduce the components of our model in detail. We first describe the multi-granularity encoder to aggregate the character-grained feature and word-grained feature. Next, we introduce two different sparse BoW representations including the term-weighting BoW and the synonym-weighting BoW. Then, we show how to use N-gram hashing to reduce the semantic loss from word segmentation and enhance the quality of the sparse BoW representation. Finally, we describe the training process of our model in detail and show the deployment of our DeepBoW model to the online e-commerce search system. Figure 1 shows an overview of the DeepBoW model.

3.2. Multi-Granularity Encoder

The text encoder aims to obtain the input text’s contextual representations. We choose Transformer encoder (Vaswani et al., 2017) as our sentence encoder because of its excellent performance in many tasks. The Transformer encoder is a stack of L𝐿Litalic_L identical layers, and each layer includes a multi-head self-attention and a fully connected feed-forward network. For the input senquence S𝑆Sitalic_S, we obtain the output encoding matrix of i𝑖iitalic_i-th layer as Hi={h1i,h2i,,hli}superscript𝐻𝑖subscriptsuperscript𝑖1subscriptsuperscript𝑖2subscriptsuperscript𝑖𝑙H^{i}=\{h^{i}_{1},h^{i}_{2},\cdots,h^{i}_{l}\}italic_H start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = { italic_h start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_h start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_h start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT }, where hjidsubscriptsuperscript𝑖𝑗superscript𝑑h^{i}_{j}\in\mathbb{R}^{d}italic_h start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is the word embedding vector and l𝑙litalic_l is the number of words in S𝑆Sitalic_S. Same to ReprBERT (Yao et al., 2022), we aggregate the output of each layer as the text encoding representation according to:

(1) h~isuperscript~𝑖\displaystyle\tilde{h}^{i}over~ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT =(1nhjiHihji)𝐖𝐦+𝐛𝐦absent1𝑛subscriptsubscriptsuperscript𝑖𝑗subscript𝐻𝑖subscriptsuperscript𝑖𝑗subscript𝐖𝐦subscript𝐛𝐦\displaystyle=\left(\frac{1}{n}\sum_{h^{i}_{j}\in H_{i}}h^{i}_{j}\right)% \mathbf{W_{m}}+\mathbf{b_{m}}= ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) bold_W start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT
h\displaystyle hitalic_h =[i=1Lhi]𝐖𝐚𝐠𝐠+𝐛𝐚𝐠𝐠absentdelimited-[]superscriptsubscript𝑖1𝐿superscript𝑖subscript𝐖𝐚𝐠𝐠subscript𝐛𝐚𝐠𝐠\displaystyle=\left[\bigparallel_{i=1}^{L}h^{i}\right]\mathbf{W_{agg}}+\mathbf% {b_{agg}}= [ ‖ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ] bold_W start_POSTSUBSCRIPT bold_agg end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT bold_agg end_POSTSUBSCRIPT

where 𝐖𝐦d×d,𝐖𝐚𝐠𝐠Ld×d,𝐛𝐦,𝐛𝐚𝐠𝐠dformulae-sequencesubscript𝐖𝐦superscript𝑑𝑑formulae-sequencesubscript𝐖𝐚𝐠𝐠superscript𝐿𝑑𝑑subscript𝐛𝐦subscript𝐛𝐚𝐠𝐠superscript𝑑\mathbf{W_{m}}\in\mathbb{R}^{d\times d},\mathbf{W_{agg}}\in\mathbb{R}^{L\cdot d% \times d},\mathbf{b_{m}},\mathbf{b_{agg}}\in\mathbb{R}^{d}bold_W start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT , bold_W start_POSTSUBSCRIPT bold_agg end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L ⋅ italic_d × italic_d end_POSTSUPERSCRIPT , bold_b start_POSTSUBSCRIPT bold_m end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT bold_agg end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, \bigparallel is the concatenate operation.

Usually the character-based model performs better than the word-based model in Chinese NLP tasks (Li et al., 2019), and most Chinese-BERT models are character-based Transformer architecture. However, since our model intends to catch the relationship between the semantic and the word of the sentence, we propose to encode not only the character-segmentation sequence but also the word-segmentation sequence separately. For convenience, we denote the character-segmentation sequence and the word-segmentation sequence of the input text as Sc={c1,c2,,cm}subscript𝑆𝑐subscript𝑐1subscript𝑐2subscript𝑐𝑚S_{c}=\{c_{1},c_{2},\cdots,c_{m}\}italic_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } and Sw={w1,w2,,wn}subscript𝑆𝑤subscript𝑤1subscript𝑤2subscript𝑤𝑛S_{w}=\{w_{1},w_{2},\cdots,w_{n}\}italic_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = { italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, respectively, where cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the indices of the token in the vocabulary. The output encoding matrices of the character-segmentation sequence and the word-segmentation sequence are Hc={hc1,hc2,,hcm}subscript𝐻𝑐superscriptsubscript𝑐1superscriptsubscript𝑐2superscriptsubscript𝑐𝑚H_{c}=\{h_{c}^{1},h_{c}^{2},\cdots,h_{c}^{m}\}italic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ⋯ , italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT } and Hw={hw1,hw2,,hwn}subscript𝐻𝑤superscriptsubscript𝑤1superscriptsubscript𝑤2superscriptsubscript𝑤𝑛H_{w}=\{h_{w}^{1},h_{w}^{2},\cdots,h_{w}^{n}\}italic_H start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = { italic_h start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ⋯ , italic_h start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT }, where m𝑚mitalic_m and n𝑛nitalic_n are the lengths of the character-segmentation and the word-segmentation sequence. The text encoding representation of these two sequences are hcsubscript𝑐h_{c}italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and hwsubscript𝑤h_{w}italic_h start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT.

3.3. Sparse BoW Representation

Unlike traditional two-tower architecture models that encode text into the ”embedding” which is a dense distributed representation, our model encodes the text into the sparse BoW representation. The sparse BoW representation is a set of word-weight pairs, where each word corresponds to a weight that indicates the importance or the relevance of this word to the input text. In this section, we introduce two different sparse BoW representations: term-weighting BoW representation and synonym-weighting BoW representation, and describe the module to generate these two sparse BoW representations in detail.

3.3.1. Term-Weighting BoW Representation


{CJK*}

UTF8gbsn In the e-commerce search system, the query inputted by user may contain some redundant or unrelated words. These words can be excised with negligible impact on the text semantic. For example, for the input query from Taobao like ”2024年 夏季 适合 准妈妈 孕妇 套装”111This query means “2024 Summer Pregnant Women’s Clothing”., ”准妈妈” and ”孕妇” both mean a pregnant woman, but ”孕妇” is more accurate than ”准妈妈” at semantic level as the latter word is polysemous and more colloquial. ”适合” which means suitable, can be regarded as a stop word in the e-commerce scenario. So ”准妈妈” and ”适合” can be discarded and the other words should be retained.

Term-Weighting BoW includes all words of the input text, and each word is assigned a weight that indicates its significance within the text’s semantics. Key words like brand and category should have greater importance weights than the other words. Figure 1 (a) shows the architecture to generate the term-weighting BoW representation. Then, the term-weighting BoW representation can be produced as follows:

(2) pisubscript𝑝𝑖\displaystyle p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =exp(hchwi)hwjHwexp(hchwj)absentsuperscriptsubscript𝑐topsuperscriptsubscript𝑤𝑖subscriptsuperscriptsubscript𝑤𝑗subscript𝐻𝑤superscriptsubscript𝑐topsuperscriptsubscript𝑤𝑗\displaystyle=\frac{\exp\left(h_{c}^{\top}h_{w}^{i}\right)}{\sum_{h_{w}^{j}\in H% _{w}}\exp\left(h_{c}^{\top}h_{w}^{j}\right)}= divide start_ARG roman_exp ( italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ italic_H start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_exp ( italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_ARG
BoWtw(Sw)subscriptBoWtwsubscript𝑆𝑤\displaystyle\operatorname{BoW_{tw}}(S_{w})start_OPFUNCTION roman_BoW start_POSTSUBSCRIPT roman_tw end_POSTSUBSCRIPT end_OPFUNCTION ( italic_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) :={wi:pi,wiSw}assignabsentconditional-setsubscript𝑤𝑖subscript𝑝𝑖subscript𝑤𝑖subscript𝑆𝑤\displaystyle:=\{w_{i}:p_{i},w_{i}\in S_{w}\}:= { italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT }

We define the BoWtw()subscriptBoWtw\operatorname{BoW_{tw}}(\cdot)start_OPFUNCTION roman_BoW start_POSTSUBSCRIPT roman_tw end_POSTSUBSCRIPT end_OPFUNCTION ( ⋅ ) as the term-weighting BoW representation of \cdot. pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the importance weight of wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in Swsubscript𝑆𝑤S_{w}italic_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, and ipi=1subscript𝑖subscript𝑝𝑖1\sum_{i}p_{i}=1∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1.

3.3.2. synonym-expansion BoW Representation


Since queries in e-commerce search systems are entered by lay users, they may differ from the product descriptions and include colloquialisms and polysemes. Some adjectives or category words may also have synonyms. Term-weighting BoW representation can only compute the importance weight of each word in the text, but is unable to add relevant words and synonyms. Synonym expansion can greatly improve the performance of the e-commerce search system. Therefore, we propose the synonym-expansion BoW representation to enhance the retrieval performance of the sparse BoW representation. Figure 1 (b) shows the architecture to generate the synonym-expansion BoW representation.

We sample v𝑣vitalic_v words from the training corpora as the vocabulary 𝕍𝕍\mathbb{V}blackboard_V according to the frequency of the word. Our model leverages the relevance between these words and the input text to represent the semantics of the query and the products. First, our model aggregates the word-based text encoding representation and the character-based text encoding representation as follows:

(3) h~wsubscript~𝑤\displaystyle\tilde{h}_{w}over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT =i=1npihwiabsentsuperscriptsubscript𝑖1𝑛subscript𝑝𝑖superscriptsubscript𝑤𝑖\displaystyle=\sum_{i=1}^{n}p_{i}\cdot h_{w}^{i}= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_h start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT
Vcsubscript𝑉𝑐\displaystyle V_{c}italic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT =σ(hc𝐖𝐜+𝐛𝐜)absent𝜎subscript𝑐subscript𝐖𝐜subscript𝐛𝐜\displaystyle=\sigma\left(h_{c}\mathbf{W_{c}}+\mathbf{b_{c}}\right)= italic_σ ( italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT bold_c end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT bold_c end_POSTSUBSCRIPT )
Vwsubscript𝑉𝑤\displaystyle V_{w}italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT =σ([hc||h~w]𝐖𝐰+𝐛𝐰)\displaystyle=\sigma\left(\left[h_{c}||\tilde{h}_{w}\right]\mathbf{W_{w}}+% \mathbf{b_{w}}\right)= italic_σ ( [ italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | | over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ] bold_W start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT )

where 𝐖𝐜d×v,𝐖𝐰2d×vformulae-sequencesubscript𝐖𝐜superscript𝑑𝑣subscript𝐖𝐰superscript2𝑑𝑣\mathbf{W_{c}}\in\mathbb{R}^{d\times v},\mathbf{W_{w}}\in\mathbb{R}^{2d\times v}bold_W start_POSTSUBSCRIPT bold_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_v end_POSTSUPERSCRIPT , bold_W start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_d × italic_v end_POSTSUPERSCRIPT and 𝐛𝐜,𝐛𝐰vsubscript𝐛𝐜subscript𝐛𝐰superscript𝑣\mathbf{b_{c}},\mathbf{b_{w}}\in\mathbb{R}^{v}bold_b start_POSTSUBSCRIPT bold_c end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT. σ𝜎\sigmaitalic_σ is the sigmoid function. Then, the synonym-expansion BoW Representation of the input text is as follows:

(4) pgsubscript𝑝𝑔\displaystyle p_{g}italic_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT =σ([hc||h~w]𝐖𝐠+𝐛𝐠)\displaystyle=\sigma\left(\left[h_{c}||\tilde{h}_{w}\right]\mathbf{W_{g}}+% \mathbf{b_{g}}\right)= italic_σ ( [ italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | | over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ] bold_W start_POSTSUBSCRIPT bold_g end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT bold_g end_POSTSUBSCRIPT )
BoWSE(Sw)subscriptBoWSEsubscript𝑆𝑤\displaystyle\operatorname{BoW_{SE}}(S_{w})start_OPFUNCTION roman_BoW start_POSTSUBSCRIPT roman_SE end_POSTSUBSCRIPT end_OPFUNCTION ( italic_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) :={t:Vc(t),t𝕍Sw}assignabsentlimit-fromconditional-set𝑡subscript𝑉𝑐𝑡𝑡𝕍subscript𝑆𝑤\displaystyle:=\{t:V_{c}(t),t\in\mathbb{V}-S_{w}\}\cup:= { italic_t : italic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_t ) , italic_t ∈ blackboard_V - italic_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT } ∪
{t:pgVc(t)+(1pg)Vw(t),tSw𝕍}conditional-set𝑡subscript𝑝𝑔subscript𝑉𝑐𝑡1subscript𝑝𝑔subscript𝑉𝑤𝑡𝑡subscript𝑆𝑤𝕍\displaystyle\{t:p_{g}V_{c}(t)+(1-p_{g})V_{w}(t),t\in S_{w}\cap\mathbb{V}\}{ italic_t : italic_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_t ) + ( 1 - italic_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_t ) , italic_t ∈ italic_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∩ blackboard_V }

where V(i)𝑉𝑖V(i)italic_V ( italic_i ) denotes the i𝑖iitalic_i-th value of V𝑉Vitalic_V. We define theBoWSE()subscriptBoWSE\operatorname{BoW_{SE}}(\cdot)start_OPFUNCTION roman_BoW start_POSTSUBSCRIPT roman_SE end_POSTSUBSCRIPT end_OPFUNCTION ( ⋅ ) as the synonym-expansion BoW representation of \cdot. t𝑡titalic_t is the word (actually is the index of the word) in 𝕍𝕍\mathbb{V}blackboard_V. the corresponding weight is in [0,1]01[0,1][ 0 , 1 ], which can be regarded as the relevance score between this word and the input text.

3.4. N-gram Hashing Vocabulary

In the preceding section, we describe the sparse BoW representation in detail. Unfortunately, due to the limitation of model’s parameter size, we can only leverage the vocabulary within a limited number of words. Using the “[UNK]” to replace all Out-Of-Vocabulary (OOV) words may lead to significant semantic loss. To mitigate this issue, we introduce an ensemble of hashing tokens into the vocabulary, where the OOV word can be replaced with its hashing tokens222For the word w𝑤witalic_w, we leverage MD5(w)%B\operatorname{MD5}(w)\%BMD5 ( italic_w ) % italic_B as its hashing token, where B𝐵Bitalic_B is the hashing bucket number..

Semantic loss may occur between the raw text and its BoW representation, particularly when syntactically cohesive phrases are fragmented during the word segmentation process. This issue can lead to misalignment for the essential semantics such as product types, entity names, or brand identifiers in query/product. For example, the brand name “L’ORÉAL Paris” could be inaccurately divided into separate tokens during word segmentation. To address this problem, we introduce an N-gram hashing vocabulary strategy. Concretely, N-gram phrases are incorporated into the text’s BoW and are subsequently replaced with their respective hashing tokens, analogous to the treatment of OOV words. The significance of a particular N-gram phrase is directly proportional to the frequency of its occurrence within relevant query-product pairs in the corpora. Our model is equipped to ascertain the importance of these N-gram hashing tokens through the analysis of large-scale corpora. Consequently, the semantics of these N-gram phrases are retained within the sparse BoW representation.

3.5. DeepBoW Relevance Model

In this section, we describe the method to compute the relevance score between the query and the product from the sparse BoW representations. Note that in the search engine scenario the product should match all the semantics of query, while conversely the query does not need to match all the semantics of the product. Accordingly, we encode the query as the term-weighting BoW representation while encode the product as the synonym-expansion BoW representation. The relevance score of the query/product can be calculated as follows:

(5) Rt(Q,D)=(w:p)BoWTW(Q)(t:g)BoWSE(D)w=tp×gR_{t}(Q,D)=\sum_{(w:p)\in\operatorname{BoW_{TW}}(Q)\atop(t:g)\in\operatorname{% BoW_{SE}}(D)}\sum_{w=t}p\times gitalic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_Q , italic_D ) = ∑ start_POSTSUBSCRIPT FRACOP start_ARG ( italic_w : italic_p ) ∈ start_OPFUNCTION roman_BoW start_POSTSUBSCRIPT roman_TW end_POSTSUBSCRIPT end_OPFUNCTION ( italic_Q ) end_ARG start_ARG ( italic_t : italic_g ) ∈ start_OPFUNCTION roman_BoW start_POSTSUBSCRIPT roman_SE end_POSTSUBSCRIPT end_OPFUNCTION ( italic_D ) end_ARG end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_w = italic_t end_POSTSUBSCRIPT italic_p × italic_g

where Q𝑄Qitalic_Q is the query, D𝐷Ditalic_D is the product, and Rt(Q,D)subscript𝑅𝑡𝑄𝐷R_{t}(Q,D)italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_Q , italic_D ) is the relevance score between Q𝑄Qitalic_Q and D𝐷Ditalic_D. We call our DeepBoW model with this relevance score as DeepBoW(Q-Weight).

We leverage the cross-entropy loss between the output score and the ground truth to train our model. In addition, we also optimize the L2 norm of BoWSE(D)subscriptBoWSE𝐷\operatorname{BoW_{SE}}(D)start_OPFUNCTION roman_BoW start_POSTSUBSCRIPT roman_SE end_POSTSUBSCRIPT end_OPFUNCTION ( italic_D ) to enhance the sparsity of the synonym-expansion BoW representation, so that only the most relevant words can get high scores. The loss function is as follows:

(6) norm𝑛𝑜𝑟𝑚\displaystyle normitalic_n italic_o italic_r italic_m =((t:g)BoWSE(D)g2)12absentsuperscriptsubscript:𝑡𝑔absentsubscriptBoWSE𝐷superscript𝑔212\displaystyle=\left(\sum_{(t:g)\in\operatorname{BoW_{SE}}(D)}g^{2}\right)^{% \frac{1}{2}}= ( ∑ start_POSTSUBSCRIPT ( italic_t : italic_g ) ∈ start_OPFUNCTION roman_BoW start_POSTSUBSCRIPT roman_SE end_POSTSUBSCRIPT end_OPFUNCTION ( italic_D ) end_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT
losst𝑙𝑜𝑠subscript𝑠𝑡\displaystyle loss_{t}italic_l italic_o italic_s italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =CE(Rt,label)+1vnormabsentCEsubscript𝑅𝑡𝑙𝑎𝑏𝑒𝑙1𝑣𝑛𝑜𝑟𝑚\displaystyle=\operatorname{CE}\left(R_{t},label\right)+\frac{1}{v}norm= roman_CE ( italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_l italic_a italic_b italic_e italic_l ) + divide start_ARG 1 end_ARG start_ARG italic_v end_ARG italic_n italic_o italic_r italic_m

where CECE\operatorname{CE}roman_CE is the cross-entropy loss, label𝑙𝑎𝑏𝑒𝑙labelitalic_l italic_a italic_b italic_e italic_l is the ground truth of the training data. We can also encode the query as the synonym-expansion BoW representation to improve the performance of recall. The relevance score is as follows:

(7) C𝐶\displaystyle Citalic_C =(w:p)BoWSE(Q)pabsentsubscript:𝑤𝑝absentsubscriptBoWSE𝑄𝑝\displaystyle=\sum_{(w:p)\in\operatorname{BoW_{SE}}(Q)}p= ∑ start_POSTSUBSCRIPT ( italic_w : italic_p ) ∈ start_OPFUNCTION roman_BoW start_POSTSUBSCRIPT roman_SE end_POSTSUBSCRIPT end_OPFUNCTION ( italic_Q ) end_POSTSUBSCRIPT italic_p
Rs(Q,D)subscript𝑅𝑠𝑄𝐷\displaystyle R_{s}(Q,D)italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_Q , italic_D ) =1C(w:p)BoWSE(Q)(t:g)BoWSE(D)w=tp×g\displaystyle=\frac{1}{C}\sum_{(w:p)\in\operatorname{BoW_{SE}}(Q)\atop(t:g)\in% \operatorname{BoW_{SE}}(D)}\sum_{w=t}p\times g= divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT FRACOP start_ARG ( italic_w : italic_p ) ∈ start_OPFUNCTION roman_BoW start_POSTSUBSCRIPT roman_SE end_POSTSUBSCRIPT end_OPFUNCTION ( italic_Q ) end_ARG start_ARG ( italic_t : italic_g ) ∈ start_OPFUNCTION roman_BoW start_POSTSUBSCRIPT roman_SE end_POSTSUBSCRIPT end_OPFUNCTION ( italic_D ) end_ARG end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_w = italic_t end_POSTSUBSCRIPT italic_p × italic_g

We call our DeepBoW model with this relevance score as DeepBoW(Q -Synonym). Different from losst𝑙𝑜𝑠subscript𝑠𝑡loss_{t}italic_l italic_o italic_s italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we also leverage the bag-of-words of the query to train the product’s synonym-expansion BoW representation. The loss should be modified as follows:

(8) BoWavg(Sw)subscriptBoWavgsubscript𝑆𝑤\displaystyle\operatorname{BoW_{avg}}(S_{w})start_OPFUNCTION roman_BoW start_POSTSUBSCRIPT roman_avg end_POSTSUBSCRIPT end_OPFUNCTION ( italic_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) :={wi:1/n,wiSw}assignabsentconditional-setsubscript𝑤𝑖1𝑛subscript𝑤𝑖subscript𝑆𝑤\displaystyle:=\{w_{i}:1/n,w_{i}\in S_{w}\}:= { italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : 1 / italic_n , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT }
Ravg(Q,D)subscript𝑅𝑎𝑣𝑔𝑄𝐷\displaystyle R_{avg}(Q,D)italic_R start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT ( italic_Q , italic_D ) =(w:p)BoWavg(Q)(t:g)BoWSE(D)w=tp×g\displaystyle=\sum_{(w:p)\in\operatorname{BoW_{avg}}(Q)\atop(t:g)\in% \operatorname{BoW_{SE}}(D)}\sum_{w=t}p\times g= ∑ start_POSTSUBSCRIPT FRACOP start_ARG ( italic_w : italic_p ) ∈ start_OPFUNCTION roman_BoW start_POSTSUBSCRIPT roman_avg end_POSTSUBSCRIPT end_OPFUNCTION ( italic_Q ) end_ARG start_ARG ( italic_t : italic_g ) ∈ start_OPFUNCTION roman_BoW start_POSTSUBSCRIPT roman_SE end_POSTSUBSCRIPT end_OPFUNCTION ( italic_D ) end_ARG end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_w = italic_t end_POSTSUBSCRIPT italic_p × italic_g
losss=CE(Rs,\displaystyle loss_{s}=\operatorname{CE}(R_{s},italic_l italic_o italic_s italic_s start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = roman_CE ( italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , label)+CE(Ravg,label)+1vnorm\displaystyle label)+\operatorname{CE}\left(R_{avg},label\right)+\frac{1}{v}normitalic_l italic_a italic_b italic_e italic_l ) + roman_CE ( italic_R start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT , italic_l italic_a italic_b italic_e italic_l ) + divide start_ARG 1 end_ARG start_ARG italic_v end_ARG italic_n italic_o italic_r italic_m

For both losst𝑙𝑜𝑠subscript𝑠𝑡loss_{t}italic_l italic_o italic_s italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and losss𝑙𝑜𝑠subscript𝑠𝑠loss_{s}italic_l italic_o italic_s italic_s start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, we optimize the difference between the query’s sparse BoW representation and the product’s sparse BoW representation. This can align the vocabulary between the query and the product.

3.6. Online Deployment

Since most online search systems have strict latency limitations, we pre-compute the sparse BoW representation of the queries and the products offline. We discard the word-weight pairs in the sparse BoW representation whose weights are lower than the given threshold to optimize memory usage. We sort the word-weight pairs offline by the word index. The relevance score of two sparse BoW representations can be calculated by using the two-pointer algorithm. Then the time complexity of Eq.5 and Eq.7 can be optimized to 𝒪(N)𝒪𝑁\mathcal{O}(N)caligraphic_O ( italic_N ), which is much faster than the state-of-the-art deep relevance model (Yao et al., 2022). Although some deep relevance models with cosine similarity can achieve comparable efficiency, the performance of these models is much lower than our model as shown in the section 4.5.

4. EXPERIMENTS

4.1. Dataset

There is no public dataset and benchmark for the Chinese e-commerce relevance task, so we conduct experiments on three different types of industrial datasets to learn the DeepBoW model. The first is a large-scale Human-Annotation dataset which contains query-product pairs sampled from the Taobao search logs. Each query-product pair is labeled as Good (relevant) or Bad (irrelevant) by experienced human annotators. This is a daily task running in Taobao, which has accumulated more than 8 million labeled samples. We split the human-annotated datasets into training, validation and test sets, as detailed in Table 1.

Dataset sample query product Good Bad
Train 7,439,823 463,387 6,068,089 6,092,745 1,347,078
Valid 372,981 98,254 202,192 305,447 67,534
Test 984,175 134,564 883,691 745,524 238,651
Table 1. Statistic for human-annotation dataset.

The second dataset for training is built by knowledge distillation, similar to Yao et al. (2022). We leverage the training set of the human-annotation dataset to finetune the StructBERT (Wang et al., 2020) model, which results in an interaction-based teacher model with strong performance. Then the teacher model predicts the relevance scores of the large unlabeled query-product pairs sampled from the search logs of Taobao within a year. This training dataset is denoted as “Search-Logs” in Table 2. Third, we also sample click-through data from search logs and investigate the performance of our model on this training set. We denote this dataset as “Click-Through” in Table 2. Although these are training datasets of different sources, we all use the human-annotation validation and test dataset to evaluate the model performance.

4.2. Training Details

We employ Transformer as both the character-based encoder and the word-based encoder. We reduce the total 12-layer encoder to improve efficiency. After balancing the effectiveness and efficiency, our model adopts 2 layers that can still achieve competitive performance. We select the top 50000 words as the vocabulary 𝕍𝕍\mathbb{V}blackboard_V according to the word’s frequency in corpora, and we also add another 10000 hashing tokens into the vocabulary for the OOV words and the N-gram phrases.

We use “PyTorch” to implement our model and train the model with Adam optimizer. The hyper-parameters of Adam optimizer are β1=0.9,β2=0.999,ϵ=108formulae-sequencesubscript𝛽10.9formulae-sequencesubscript𝛽20.999italic-ϵsuperscript108\beta_{1}=0.9,\beta_{2}=0.999,\epsilon=10^{-8}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999 , italic_ϵ = 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT and the learning rate is set to 0.0001. Query-document pairs are batched together by approximate sequence length. Each training batch contains a set of sentence pairs with about 50000 tokens. The hyper-parameters and the best model are chosen from the experimental results on the validation set. We train our model on 2 Tesla V100 GPU and it usually takes about 3 days for the model to converge. The convergence is reached when the ROC-AUC does not improve on the validation set.

Model Human-Annotation Search-Logs Click-Through
ROC-AUC Neg PR-AUC ROC-AUC Neg PR-AUC ROC-AUC Neg PR-AUC
Interaction-Based Models:
BERT(Devlin et al., 2019) 0.850 0.662 - - - -
RoBERTa(Liu et al., 2019) 0.906 0.692 - - - -
StructBERT(Wang et al., 2020) 0.923 0.721 - - - -
Two-Tower Models:
Siamese BERT 0.765 0.565 0.821 0.648 - -
MASM(Yao et al., 2021) 0.795 0.484 0.793 0.582 0.615 0.283
Poly-Encoder(Humeau et al., 2019) 0.808 0.623 0.846 0.605 - -
DSSM

RoBERTa

0.873 0.673 - - - -
DSSM

StructBERT

0.858 0.658 - - - -
ReprBERT (Yao et al., 2022) 0.832 0.543 0.894 0.702 0.798 0.521
ReprBERT

+Cosine Similarity

0.798 0.452 0.847 0.601 0.727 0.399
Ours:
DeepBoW(Q-Weight) 0.874 0.665 0.908 0.705 0.803 0.579
DeepBoW(Q-Weight)

+128-Trunc

0.865 0.645 0.899 0.698 0.787 0.566
DeepBoW(Q-Weight)

+0.4-Trunc

0.869 0.658 0.906 0.701 0.796 0.572
DeepBoW(Q-Synonym) 0.880 0.674 0.914 0.712 0.812 0.585
DeepBoW(Q-Synonym)

+128-Trunc

0.873 0.670 0.906 0.705 0.799 0.571
DeepBoW(Q-Synonym)

+0.4-Trunc

0.877 0.672 0.911 0.710 0.807 0.575
Table 2. Comparison results on test set. Best scores are in bold.

+128-Trunc

means keeping 128 largest terms according to the value of the word-weight pair.

+0.4-Trunc

means discarding the terms that the value is smaller than 0.4. We only finetune the pre-trained based models on the Human-Annotation dataset and do not evaluate these models in the other two training sets, since we leverage StructBERT as teacher model to label the search-logs dataset. We do not evaluate the performance of MASM and Poly-Encoder on Click-Through training dataset, because the two models do not converge on the other two training datasets.

4.3. Baseline

We explore the performance of DeepBoW(Q-Weight) and DeepBoW (Q-Synonym) respectively. The main difference between the two methods is for DeepBoW(Q-Synonym) we leverage the synonym-expansion BoW representation to replace the term-weighting BoW representation for the query. To reduce memory usage and computation, we truncate the BoW representation to make it sparse. There are two methods to truncate the sparse BoW representation, one is to keep the k𝑘kitalic_k largest words according to their respective values, and the other is to discard the terms whose values are smaller than the giving threshold.

In addition, we adopt several state-of-the-art methods for comparison. BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019) and StructBERT (Wang et al., 2020) belong to the interaction-based architecture which is also known as the cross-encoder architecture. Siamese BERT (Jiang et al., 2019b), MASM (Yao et al., 2021), Poly-encoders (Humeau et al., 2019) and ReprBERT (Yao et al., 2022) belong to the two-tower architecture which is also known as the bi-encoder architecture. Besides, we investigate the performance of ReprBERT with cosine similarity score instead of MLP for online computation of relevance from query/product embeddings. For fair comparison, we also leverage the pre-trained model like RoBERTa and StructBERT as the encoder of the two-tower model. These models are also baselines in our experiment, which denote as DSSM RoBERTa and DSSM StructBERT .

4.4. Evaluation Metrics

We evaluate our model on both offline and online metrics. In offline evaluation, since the human annotation is binary, the task is evaluated as a classification task. The Receiver Operator Characteristic Area Under Curve (ROC-AUC) is widely adopted in text relevance tasks (Jiang et al., 2019a; Wan et al., 2016; Davis and Goadrich, 2006). Note that in the e-commerce relevance scenario, most instances are positive and we are more concerned about negative instances. Therefore the PR-AUC used in this paper is the negative PR-AUC that treats Bad as 1 and Good as 0 following Yao et al. (2021, 2022). This metric is denoted as Neg PR-AUC.

Besides, we also evaluate the different model complexity of parameters and online computation efficiency. The FLOPs / token is computed according to Molchanov et al. (2017) which shows the floating-point operations per second (FLOPs) when there is only 1 token being considered. The plus sign separates the online and offline calculation FLOPs, which means the former part of computation can be pre-computed offline. Memory indicates the online memory overhead for storing pre-computed query and product vectors where we use vector size for comparison. In online evaluation, we use the rate of Good annotated by human annotators and the number of transactions as the evaluation metrics. The query-product pairs for human relevance judgment are randomly sampled from the online search logs according to the amount of Page View (PV) as the sample weight.

4.5. Results

Table 2 presents an evaluative comparison across various models. Our DeepBoW model demonstrates robust performance across three different training sets. For human-annotation dataset, StructBERT has the best performance, and interaction-based models outperform two-tower models. Unfortunately, it is infeasible to deploy the interaction-based model in industrial system because of prohibitive computation and resource requirements. The ROC-AUC and Neg PR-AUC of two-tower models are much lower than the pre-trained model, because the human-annotation data is insufficient and the pre-trained model can introduce extra knowledge. Nonetheless, our DeepBoW model still outperforms other two-tower models.

The data enhancement sampled from the search logs and labeled by the teacher model can greatly improve the performance of two-tower models. Our model achieves the best performance around the two-tower models in search-logs training sets. Click-through data is used to train the relevance model in some cases where lack of human-annotation training data. The models trained on the click-through data get weak performance. The main reason is that the click-through data in e-commerce is much more noisy and misleading, which is not only affected by the query-product relevance but also by many factors including price, attractive titles, or images. Even so, our model also performs better than the other models since it explicitly encodes the semantics to the sparse bag-of-words while the other models may capture the personalized information beyond the textual feature.

Our proposed architecture can truncate the sparse BoW representation to reduce memory usage in the online search system. We can either truncate the sparse BoW representation to the fixed length or discard the word-weight pairs whose values are smaller than a given threshold. Benefiting from the interpretability of our sparse BoW representation, both truncation methods achieve competitive performance and only produce a slight loss of performance. We further explore the distribution of the sparse BoW representation as case study in Appx. A.

Model Params FLOPs / Token Mem
BERT 101.2M 182M 0
RoBERTa 101.2M 182M 0
StructBERT 101.2M 182M 0
Siamese BER 101.2M 91M + 1.5K 768
MASM 76.8M 674K 640
Poly-Encoder 101.2M 182M + 97.5K 768
ReprBERT 30.6M 30.4M + 296K 256
Ours:
DeepBoW

+128-Trunc

33.4M + 126M

159.4M + 128 128
DeepBoW

+0.4-Trunc

33.4M + 126M

159.4M + 28 28/144
Table 3. The models’ efficiency. The Params of our models include the encoder and the final vocabulary projection (𝐖𝐜subscript𝐖𝐜\mathbf{W_{c}}bold_W start_POSTSUBSCRIPT bold_c end_POSTSUBSCRIPT and 𝐖𝐰subscript𝐖𝐰\mathbf{W_{w}}bold_W start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT in Eq.LABEL:projection). The Mem of DeepBoW

+0.4-Trunc

is the mean value throughout the whole corpora (query/product).

Table 3 shows the parameter, computation and memory consumption of each model. While our model has a considerable number of parameters since we project the dense vector into the vocabulary (78.8% parameters about 126M come from 𝐖𝐜subscript𝐖𝐜\mathbf{W_{c}}bold_W start_POSTSUBSCRIPT bold_c end_POSTSUBSCRIPT and 𝐖𝐰subscript𝐖𝐰\mathbf{W_{w}}bold_W start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT in Eq.LABEL:projection), DeepBoW is the most efficient at online inference. Our model only uses CPU to compute the relevance score while the other models need GPU to speed the inference. Table 4 shows the inference time of each model. The experiments are performed on a local CPU platform. We report the average inference time of the model to score 1000 products per query. The results show that our model is much faster than the ReprBERT (Yao et al., 2022) which has been deployed in the Taobao search system.

4.6. Ablation Study

We perform the ablation study on the human-annotation dataset to investigate the influence of different modules in our model. We investigate the performance of the 2-layer encoder and the 6-layer encoder. Besides, we remove the word encoder and the character encoder separately to show the importance of encoding at both word-level and character-level. We also employ different capacities of vocabulary and hashing tokens in the sparse BoW representation. To further explore the sparsity of BoW representation, we remove the l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm from the loss function. This may lead to a sparsity deterioration in the sparse BoW representation.

Table 5 shows the results of the ablation study. We can see that each module in our model does contribute to the overall performance. The performance of our model has a significant deterioration if either the word encoder or the character encoder is removed. Increasing the capacity of vocabulary or hashing tokens cannot improve the model’s performance, because it may lead to insufficient training for each token. Optimizing our model without the l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm loss can lead to slight performance decline. There is almost no difference between the 2-layer and 6-layer encoders in our model.

Model Inference Time (ms)
StructBERT 321,408
ReprBERT 29,164
DeepBoW 0.732
Table 4. Inference time of different models that score 1000 query-product pairs.
DeepBoW ROC-AUC Neg PR-AUC
w/ 2 layers 0.914 0.712
w/ 6 layers 0.911 0.717
w/ 10w vocab 0.907 0.701
w/ 5w hashing tokens 0.899 0.693
w/o l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm loss 0.886 0.685
w/o word encoder 0.847 0.611
w/o character encoder 0.821 0.583
Table 5. Ablation study of components of DeepBoW

4.7. Online Evaluation

Online A/B testing is also conducted to evaluate our DeepBoW model, by replacing the online ReprBERT model with the DeepBoW model for comparison. Both experiments take about 2.5% proportion of Taobao search traffic, and the A/B testing lasts for a month. As a result, DeepBoW improves the number of transactions by about 0.4% on average. The daily human annotation results show that DeepBoW also improves the rate of relevance by 0.5%. Online A/B testing verifies the proposed DeepBoW is superior to previous state-of-the-art models, and can achieve significant online profit considering the extremely large traffic of Taobao every day.

Our DeepBoW model has already served the entire Taobao search traffic. After pre-computing the representations of queries and products, the online serving latency can be optimized to as low as 4ms on the distributed computing system with CPUs. This is much faster than the previous online relevance serving model ReprBERT (Yao et al., 2022) of 10ms with GPUs and can satisfy the extremely large traffic of Taobao.

5. Conclusion and Future Work

In this paper, we study an industrial task of measuring the semantic relevance of queries and products. We propose the DeepBoW relevance model, which is an efficient and interpretable relevance architecture for Chinese e-commerce search system. Our model encodes the query and product as a set of word-weight pairs, which is called the sparse BoW representation. The model is evaluted on three different training datasets, and the results show that our model achieves promising performance and efficiency. The model has been deployed in the Taobao search system.

In future work, we will explore integrating external knowledge into the DeepBoW relevance model to improve the performance. The proposed model can also be evaluated on datasets of other language.

References

  • (1)
  • Bai et al. (2020) Yang Bai, Xiaoguang Li, Gang Wang, Chaoliang Zhang, Lifeng Shang, Jun Xu, Zhaowei Wang, Fangshan Wang, and Qun Liu. 2020. SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval. ArXiv abs/2010.00768 (2020). https://api.semanticscholar.org/CorpusID:222125038
  • Carmel et al. (2020) David Carmel, Elad Haramaty, Arnon Lazerson, Liane Lewin-Eytan, and Yoelle Maarek. 2020. Why Do People Buy Seemingly Irrelevant Items in Voice Product Search? On the Relation between Product Relevance and Customer Satisfaction in eCommerce. In Proceedings of the 13th International Conference on Web Search and Data Mining (Houston, TX, USA) (WSDM ’20). Association for Computing Machinery, New York, NY, USA, 79–87. https://doi.org/10.1145/3336191.3371780
  • Davis and Goadrich (2006) Jesse Davis and Mark Goadrich. 2006. The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning (Pittsburgh, Pennsylvania, USA) (ICML ’06). Association for Computing Machinery, New York, NY, USA, 233–240. https://doi.org/10.1145/1143844.1143874
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
  • Guo et al. (2016) Jiafeng Guo, Yixing Fan, Qingyao Ai, and W. Bruce Croft. 2016. A Deep Relevance Matching Model for Ad-hoc Retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (Indianapolis, Indiana, USA) (CIKM ’16). Association for Computing Machinery, New York, NY, USA, 55–64. https://doi.org/10.1145/2983323.2983769
  • Hu et al. (2014) Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional Neural Network Architectures for Matching Natural Language Sentences. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger (Eds.), Vol. 27. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2014/file/b9d487a30398d42ecff55c228ed5652b-Paper.pdf
  • Huang et al. (2013) Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management (San Francisco, California, USA) (CIKM ’13). Association for Computing Machinery, New York, NY, USA, 2333–2338. https://doi.org/10.1145/2505515.2505665
  • Humeau et al. (2019) Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. 2019. Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring. In International Conference on Learning Representations. https://api.semanticscholar.org/CorpusID:210063976
  • Jiang et al. (2019b) Jyun-Yu Jiang, Mingyang Zhang, Cheng Li, Michael Bendersky, Nadav Golbandi, and Marc Najork. 2019b. Semantic Text Matching for Long-Form Documents. In The World Wide Web Conference (San Francisco, CA, USA) (WWW ’19). Association for Computing Machinery, New York, NY, USA, 795–806. https://doi.org/10.1145/3308558.3313707
  • Jiang et al. (2019a) Yunjiang Jiang, Yue Shang, Rui Li, Wen-Yun Yang, Guoyu Tang, Chaoyi Ma, Yun Xiao, and Eric Zhao. 2019a. A unified neural network approach to e-commerce relevance learning. In Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data (Anchorage, Alaska) (DLP-KDD ’19). Association for Computing Machinery, New York, NY, USA, Article 10, 7 pages. https://doi.org/10.1145/3326937.3341259
  • Li et al. (2019) Xiaoya Li, Yuxian Meng, Xiaofei Sun, Qinghong Han, Arianna Yuan, and Jiwei Li. 2019. Is Word Segmentation Necessary for Deep Learning of Chinese Representations?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 3242–3252. https://doi.org/10.18653/v1/P19-1314
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv abs/1907.11692 (2019). https://api.semanticscholar.org/CorpusID:198953378
  • Molchanov et al. (2017) Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. 2017. Pruning Convolutional Neural Networks for Resource Efficient Inference. In International Conference on Learning Representations. https://openreview.net/forum?id=SJGCiw5gl
  • Nigam et al. (2019) Priyanka Nigam, Yiwei Song, Vijai Mohan, Vihan Lakshman, Weitian (Allen) Ding, Ankit Shingavi, Choon Hui Teo, Hao Gu, and Bing Yin. 2019. Semantic Product Search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Anchorage, AK, USA) (KDD ’19). Association for Computing Machinery, New York, NY, USA, 2876–2885. https://doi.org/10.1145/3292500.3330759
  • Nogueira and Cho (2020) Rodrigo Nogueira and Kyunghyun Cho. 2020. Passage Re-ranking with BERT. arXiv:1901.04085 [cs.IR]
  • Nogueira et al. (2019) Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin. 2019. Multi-Stage Document Ranking with BERT. arXiv:1910.14424 [cs.IR]
  • Palangi et al. (2015) H. Palangi, L. Deng, Y. Shen, J. Gao, X. He, J. Chen, X. Song, and R. Ward. 2015. Semantic Modelling with Long-Short-Term Memory for Information Retrieval. arXiv:1412.6629 [cs.IR]
  • Palangi et al. (2016) Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, Xinying Song, and Rabab Ward. 2016. Deep sentence embedding using long short-term memory networks: analysis and application to information retrieval. IEEE/ACM Trans. Audio, Speech and Lang. Proc. 24, 4 (apr 2016), 694–707. https://doi.org/10.1109/TASLP.2016.2520371
  • Pang et al. (2016) Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng. 2016. Text matching as image recognition. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (Phoenix, Arizona) (AAAI’16). AAAI Press, 2793–2799.
  • Parikh et al. (2016) Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. 2016. A Decomposable Attention Model for Natural Language Inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Jian Su, Kevin Duh, and Xavier Carreras (Eds.). Association for Computational Linguistics, Austin, Texas, 2249–2255. https://doi.org/10.18653/v1/D16-1244
  • Qiao et al. (2019) Yifan Qiao, Chenyan Xiong, Zhenghao Liu, and Zhiyuan Liu. 2019. Understanding the Behaviors of BERT in Ranking. arXiv:1904.07531 [cs.IR]
  • Reimers and Gurevych (2019a) Nils Reimers and Iryna Gurevych. 2019a. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. http://arxiv.org/abs/1908.10084
  • Reimers and Gurevych (2019b) Nils Reimers and Iryna Gurevych. 2019b. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics, Hong Kong, China, 3982–3992. https://doi.org/10.18653/v1/D19-1410
  • Robertson et al. (1994) Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. 1994. Okapi at TREC-3. In Proceedings of The Third Text REtrieval Conference, TREC 1994, Gaithersburg, Maryland, USA, November 2-4, 1994 (NIST Special Publication, Vol. 500-225), Donna K. Harman (Ed.). National Institute of Standards and Technology (NIST), 109–126. http://trec.nist.gov/pubs/trec3/papers/city.ps.gz
  • Shen et al. (2014) Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014. A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (Shanghai, China) (CIKM ’14). Association for Computing Machinery, New York, NY, USA, 101–110. https://doi.org/10.1145/2661829.2661935
  • Socher et al. (2013) Richard Socher, Danqi Chen, Christopher D. Manning, and Andrew Y. Ng. 2013. Reasoning with neural tensor networks for knowledge base completion. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1 (Lake Tahoe, Nevada) (NIPS’13). Curran Associates Inc., Red Hook, NY, USA, 926–934.
  • Thakur et al. (2021) Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum?id=wCu6T5xFjeJ
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
  • Wan et al. (2016) Shengxian Wan, Yanyan Lan, Jun Xu, Jiafeng Guo, Liang Pang, and Xueqi Cheng. 2016. Match-SRNN: modeling the recursive matching structure with spatial RNN. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (New York, New York, USA) (IJCAI’16). AAAI Press, 2922–2928.
  • Wang et al. (2020) Wei Wang, Bin Bi, Ming Yan, Chen Wu, Jiangnan Xia, Zuyi Bao, Liwei Peng, and Luo Si. 2020. StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding. In International Conference on Learning Representations. https://openreview.net/forum?id=BJgQ4lSFPH
  • Xiao et al. (2019) Rong Xiao, Jianhui Ji, Baoliang Cui, Haihong Tang, Wenwu Ou, Yanghua Xiao, Jiwei Tan, and Xuan Ju. 2019. Weakly Supervised Co-Training of Query Rewriting and Semantic Matching for e-Commerce. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (Melbourne VIC, Australia) (WSDM ’19). Association for Computing Machinery, New York, NY, USA, 402–410. https://doi.org/10.1145/3289600.3291039
  • Yao et al. (2021) Shaowei Yao, Jiwei Tan, Xi Chen, Keping Yang, Rong Xiao, Hongbo Deng, and Xiaojun Wan. 2021. Learning a Product Relevance Model from Click-Through Data in E-Commerce. In Proceedings of the Web Conference 2021 (Ljubljana, Slovenia) (WWW ’21). Association for Computing Machinery, New York, NY, USA, 2890–2899. https://doi.org/10.1145/3442381.3450129
  • Yao et al. (2022) Shaowei Yao, Jiwei Tan, Xi Chen, Juhao Zhang, Xiaoyi Zeng, and Keping Yang. 2022. ReprBERT: Distilling BERT to an Efficient Representation-Based Relevance Model for E-Commerce. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Washington DC, USA) (KDD ’22). Association for Computing Machinery, New York, NY, USA, 4363–4371. https://doi.org/10.1145/3534678.3539090
  • Zhang et al. (2019) Hongchun Zhang, Tianyi Wang, Xiaonan Meng, and Yi Hu. 2019. Improving Semantic Matching via Multi-Task Learning in E-Commerce. In eCOM@SIGIR. https://api.semanticscholar.org/CorpusID:198120237
{CJK*}

UTF8gbsn Cases of the sparse BoW representation Query: 小香风连衣裙 𝐁𝐨𝐖𝐒𝐄(𝐐𝐮𝐞𝐫𝐲)subscript𝐁𝐨𝐖𝐒𝐄𝐐𝐮𝐞𝐫𝐲\operatorname{BoW_{SE}(Query)}bold_BoW start_POSTSUBSCRIPT bold_SE end_POSTSUBSCRIPT bold_( bold_Query bold_): [(‘连衣裙’, 0.30148), (‘高级感’, 0.25785), (‘小香风’, 0.2277), (‘新款’, 0.21297)] Product: 高级感秋冬装小香风长袖V领针织连衣裙通勤复古赫本小黑裙打底裙 连衣裙 18-24周岁 针织布 纯色 通勤 V领 高腰 套头 黑色 酒红色 粉红色 常规 单件 复古 A字裙 秋冬 2023年秋季 长袖 中裙 链条 𝐁𝐨𝐖𝐒𝐄(𝐏𝐫𝐨𝐝𝐮𝐜𝐭)subscript𝐁𝐨𝐖𝐒𝐄𝐏𝐫𝐨𝐝𝐮𝐜𝐭\operatorname{BoW_{SE}(Product)}bold_BoW start_POSTSUBSCRIPT bold_SE end_POSTSUBSCRIPT bold_( bold_Product bold_): [(‘品质’, 1.0), (‘v领’, 1.0), (‘秋’, 0.99999), (‘高腰’, 0.99999), (‘高级感’, 0.99999), (‘通勤’, 0.99999), (‘秋季’, 0.99998), (‘秋冬’, 0.99998), (‘秋冬装’, 0.99998), (‘黑色’, 0.99995), (‘赫本’, 0.99993), (‘单件’, 0.99993), (‘18’, 0.99992), (‘中裙’, 0.99991), (‘针织布’, 0.9999), (‘长袖’, 0.99985), (‘打底’, 0.99981), (‘复古’, 0.9998), (‘黑裙’, 0.9998), (‘小黑裙’, 0.99977), (‘针织’, 0.99972), (‘纯色’, 0.9997), (‘常规’, 0.99968), (‘a字裙’, 0.99961), (‘链条’, 0.99957), (‘24’, 0.99954), (‘字’, 0.99944), (‘粉红’, 0.9994), (‘新款’, 0.99934), (‘粉红色’, 0.9992), (‘时尚’, 0.99897), (‘气质’, 0.99862), (‘打底裙’, 0.99824), (‘秋款’, 0.9981), (‘女士’, 0.99805), (‘女裙’, 0.99726), (‘2023’, 0.99721), (‘早秋’, 0.99687), (‘小香风’, 0.99657), (‘春秋’, 0.99627), (‘2023年’, 0.99622), (‘女装’, 0.99621), (‘套头’, 0.99618), (‘酒红色’, 0.99616), (‘黑色连衣裙’, 0.99583), (‘针织连衣裙’, 0.99577), (‘新款连衣裙’, 0.99536), (‘装’, 0.99512), (‘冬装’, 0.99502), (‘秋装’, 0.9946), (‘连身’, 0.9943), (‘秋冬连衣裙’, 0.99379), (‘针织裙’, 0.99247), (‘气质连衣裙’, 0.99237), (‘18-24’, 0.99174), (‘酒红’, 0.99117), (‘女’, 0.99094), (‘袖子’, 0.99014), (‘自动充气垫’, 0.98937), (‘轻熟’, 0.98916), (‘2019’, 0.98725), (‘韩版’, 0.98703), (‘春季’, 0.98666), (‘布’, 0.98649), (‘内搭’, 0.98606), (‘高腰连衣裙’, 0.98587), (‘黑’, 0.98563), (‘a字’, 0.98546), (‘修身’, 0.98471), (‘连衣裙’, 0.98461), (‘裙子’, 0.98451), (‘裙’, 0.98445), (‘初秋’, 0.98441), (‘休闲连衣裙’, 0.98398), (‘流行’, 0.98365), (‘红色连衣裙’, 0.98294), (‘宽松连衣裙’, 0.9822), (‘衣服’, 0.9794), (‘显瘦’, 0.97937), (‘秋冬新款’, 0.97758), (‘冬季’, 0.97752), (‘女连衣裙’, 0.97674), (‘个子’, 0.97423), (‘红色裙子’, 0.9712), (‘名媛’, 0.96755), (‘名媛连衣裙’, 0.96739), (‘女生’, 0.96676), (‘打底衫’, 0.96633), (‘连身裙’, 0.9661), (‘洋气’, 0.96411), (‘名媛气质连衣裙’, 0.96329), (‘修身连衣裙’, 0.96265), (‘衣’, 0.96137), (‘打底连衣裙’, 0.95868), (‘冬款’, 0.95856), (‘毛衣裙’, 0.95822), (‘长袖连衣裙’, 0.95057), (‘连衣裙子’, 0.95046), (‘搭’, 0.94854), (‘秋冬裙子’, 0.94758), (‘休闲’, 0.94602), (‘设计感’, 0.94598)] Relevance Score:0.30148×0.98461+0.25785×0.9999+0.2277×0.99657+0.21297×0.99934=0.99440.301480.984610.257850.99990.22770.996570.212970.999340.99440.30148\times 0.98461+0.25785\times 0.9999+0.2277\times 0.99657+0.21297\times 0% .99934=0.99440.30148 × 0.98461 + 0.25785 × 0.9999 + 0.2277 × 0.99657 + 0.21297 × 0.99934 = 0.9944 Query: 秋冬床上四件套 𝐁𝐨𝐖𝐒𝐄(𝐐𝐮𝐞𝐫𝐲)subscript𝐁𝐨𝐖𝐒𝐄𝐐𝐮𝐞𝐫𝐲\operatorname{BoW_{SE}(Query)}bold_BoW start_POSTSUBSCRIPT bold_SE end_POSTSUBSCRIPT bold_( bold_Query bold_): [(‘四件’, ’0.343’), (‘床上’, ’0.13778’), (‘套’, ’0.06248’), (‘秋冬’, ’0.09616’), (‘床上四件套’, ’0.10872’), (‘四件套’, ’0.202’)] Product: 秋冬季加厚牛奶绒床上四件套珊瑚绒被套双面法兰绒加绒床单三件套 卡丝迪尔家纺旗舰店 床品套件/四件套/多件套 Kiss Dear/卡丝迪尔 蓄热保暖 牛奶绒 卡通动漫 卡通 中国大陆 2021年秋季 床单式 床笠式 大众 𝐁𝐨𝐖𝐒𝐄(𝐏𝐫𝐨𝐝𝐮𝐜𝐭)subscript𝐁𝐨𝐖𝐒𝐄𝐏𝐫𝐨𝐝𝐮𝐜𝐭\operatorname{BoW_{SE}(Product)}bold_BoW start_POSTSUBSCRIPT bold_SE end_POSTSUBSCRIPT bold_( bold_Product bold_): [(‘珊瑚绒被套’, 1.0), (‘珊瑚绒’, 1.0), (‘江苏’, 1.0), (‘2021年’, 1.0), (‘大陆’, 0.99999), (‘多件套’, 0.99999), (‘蓄热’, 0.99999), (‘床笠式’, 0.99998), (‘冬季’, 0.99997), (‘江苏省’, 0.99997), (‘秋冬季’, 0.99993), (‘床品’, 0.99988), (‘三件套’, 0.99979), (‘保暖’, 0.99979), (‘床上’, 0.99979), (‘加厚’, 0.99975), (‘被套’, 0.99975), (‘四件’, 0.99965), (‘南通’, 0.99955), (‘床品套件’, 0.99948), (‘套’, 0.99926), (‘件套’, 0.99919), (‘牛奶’, 0.99906), (‘秋’, 0.99891), (‘床单三件套’, 0.9984), (‘卡通’, 0.99832), (‘动漫’, 0.9981), (‘dear’, 0.99762), (‘床上三件套’, 0.99754), (‘双面’, 0.99744), (‘床上四件套’, 0.99703), (‘加绒’, 0.99627), (‘套件’, 0.99615), (‘kiss’, 0.99586), (‘床单’, 0.9952), (‘珊瑚’, 0.99498), (‘自动充气垫’, 0.9935), (‘床单四件套’, 0.98934), (‘被罩’, 0.98758), (‘床笠’, 0.98725), (‘2021’, 0.98496), (‘床盖’, 0.978), (‘双面绒’, 0.96437), (‘毛绒’, 0.95704), (‘秋季’, 0.94864), (‘冬天’, 0.93963), (‘毛毛’, 0.9209), (‘秋冬’, 0.90725), (‘用品’, 0.89087), (‘四件套’, 0.88608), (‘儿童四件套’, 0.87247), (‘珊瑚绒床单’, 0.86876), (‘冬’, 0.85713), (‘冬款’, 0.85588), (‘床品四件套’, 0.82923), (‘单人床’, 0.82689), (‘绒面’, 0.82627), (‘单人’, 0.81626), (‘潮牌’, 0.78414), (‘绒’, 0.77402), (‘宿舍’, 0.76135), (‘少女’, 0.6812), (‘简约’, 0.60279), (‘被子’, 0.59161), (‘家纺’, 0.56128), (‘秋冬款’, 0.55233), (‘絲’, 0.54342), (‘旗舰店’, 0.54124), (‘冬装’, 0.52361), (‘厚’, 0.51875), (‘高档’, 0.51707), (‘单件’, 0.51121), (‘薄绒’, 0.50859), (‘卡通四件套’, 0.48766), (‘床上用品’, 0.48612), (‘枕套’, 0.46126), (‘双人’, 0.45578), (‘4件套’, 0.44235), (‘套装’, 0.43786), (‘保暖衣’, 0.42136), (‘全套’, 0.40858), (‘睡’, 0.39326), (‘珊瑚绒睡衣’, 0.38609), (‘床套’, 0.35582), (‘丝绒’, 0.34883), (‘加绒裤’, 0.32625), (‘1.5’, 0.30884), (‘学生’, 0.29069), (‘神器’, 0.28852), (‘红色系’, 0.28552), (‘网红’, 0.28548), (‘毛茸茸’, 0.28526), (‘必备’, 0.28513), (‘装饰品’, 0.28248), (‘冬季睡衣’, 0.27353), (‘加厚外套’, 0.26865), (‘家居’, 0.26277), (‘床’, 0.25459)] Relevance Score:0.343×0.99965+0.13778×0.99979+0.09616×0.90725+0.10872×0.99703+0.202×0.88608=0.85530.3430.999650.137780.999790.096160.907250.108720.997030.2020.886080.85530.343\times 0.99965+0.13778\times 0.99979+0.09616\times 0.90725+0.10872\times 0% .99703+0.202\times 0.88608=0.85530.343 × 0.99965 + 0.13778 × 0.99979 + 0.09616 × 0.90725 + 0.10872 × 0.99703 + 0.202 × 0.88608 = 0.8553

Table 6. Two examples of the DeepBoW model. Both the query and the product are encoded to the synonym-expansion BoW representation. The relevance score can be calculated as shown in the table.

Appendix A Case Study

Table 6 shows two examples of our DeepBoW model. Both query and product are encoded to the synonym-expansion BoW representation. The sparse BoW representation consists of a collection of word-weight pairs, which can be regarded as the bag-of-words with soft weight. The synonym-expansion representation can not only capture the importance of the words in the original text, but also incorporates pertinent synonymous terms. The relevance score can be calculated by aggregating the matching terms of the query’s/product’s sparse BoW representation.

These two examples show that our proposed sparse BoW representation has positive interpretability, signifying that the developer can analyze bad cases from the online search system and implement targeted optimizations. Furthermore, the developer can modify the terms in the sparse BoW representation directly to achieve the expected result. In a word, our DeepBoW model surpasses other deep relevance modeling approaches in terms of interpretability and flexibility, thereby rendering it eminently suitable for the e-commerce search system.