\useunder

\ul

Learning Retrieval Augmentation for Personalized Dialogue Generation

Qiushi Huang1,2, Shuai Fu2, Xubo Liu1,Wenwu Wang1,
Tom Ko3, Yu Zhang2 , Lilian Tang1*
1University of Surrey, 2Southern University of Science and Technology, 3ByteDance AI Lab
{qiushi.huang, xubo.liu, w.wang, h.tang}@surrey.ac.uk,
{fus.akhasi, tomkocse, yu.zhang.ust}@gmail.com
  Corresponding authors.
Abstract

Personalized dialogue generation, focusing on generating highly tailored responses by leveraging persona profiles and dialogue context, has gained significant attention in conversational AI applications. However, persona profiles, a prevalent setting in current personalized dialogue datasets, typically composed of merely four to five sentences, may not offer comprehensive descriptions of the persona about the agent, posing a challenge to generate truly personalized dialogues. To handle this problem, we propose Learning Retrieval Augmentation for Personalized DialOgue Generation (LAPDOG), which studies the potential of leveraging external knowledge for persona dialogue generation. Specifically, the proposed LAPDOG model consists of a story retriever and a dialogue generator. The story retriever uses a given persona profile as queries to retrieve relevant information from the story document, which serves as a supplementary context to augment the persona profile. The dialogue generator utilizes both the dialogue history and the augmented persona profile to generate personalized responses. For optimization, we adopt a joint training framework that collaboratively learns the story retriever and dialogue generator, where the story retriever is optimized towards desired ultimate metrics (e.g., BLEU) to retrieve content for the dialogue generator to generate personalized responses. Experiments conducted on the CONVAI2 dataset with ROCStory as a supplementary data source show that the proposed LAPDOG method substantially outperforms the baselines, indicating the effectiveness of the proposed method. The LAPDOG model code is publicly available for further exploration. 111https://github.com/hqsiswiliam/LAPDOG

1 Introduction

Personalized dialogue generation Zhang et al. (2018); Dinan et al. (2019), which prompts an agent to generate consistent responses based on historical dialogue context and given persona profiles, has recently drawn substantial attention in many applications. For instance, such an agent could effectively adapt to different roles such as a customer service representative by tailoring its responses to specific customer needs based on its persona and improving customer interaction and satisfaction. Besides, personalized responses can foster a sense of human-like interaction in social platforms, thereby enriching the user experience.

The persona profiles contain background sentences describing the agent (e.g., I like to go hunting.) and play a crucial role in customizing the dialogue. Ideally, a persona profile should be as comprehensive as possible, containing diverse and detailed descriptions of an agent. However, these persona profiles, typically consisting of only four to five sentences, do not provide comprehensive descriptions for the persona of the agent. Such lack of depth and diversity in the persona descriptions impedes existing methods Liu et al. (2020); Song et al. (2021); Huang et al. (2022) from generating highly personalized and contextually rich responses, though they have shown capabilities in producing grammatically correct and human-like responses. In essence, those models are restricted by the static and limited persona profile. Hence, those models fail to dynamically incorporate more intensive extra personalized profiles when decoding the responses.

Though the given persona profile is limited, there are many external textual resources to describe personality and daily life circumstances. Hence, it is intuitive to ask: can we use other related datasets to enrich the details of the persona profile? This key question has not been thoroughly explored in existing methods, which primarily rely on the persona profile and dialogue context alone. An immediate issue is which types of external datasets could be used. A promising source is story data since they encompass diverse life events, personality traits, motivations, and experiences, which can contribute to a more detailed and realistic persona. For example, given the persona sentence as "I like to work on vintage cars.", potential retrieved stories’ titles can be "Antique Car Show" and "Mechanic", the details of the story content can be found in the appendix (Table 7). Furthermore, the clear and inherent structure in stories can enhance the consistency of the persona. In this work, we choose story data to facilitate the generation of more engaging and contextually meaningful dialogues.

Given the external knowledge (e.g., story data), how to infuse it into the process of personalized dialogue generation straightforwardly remains challenging. The first hurdle is the lack of explicit annotations for retrieval, which are the key to selecting relevant and helpful content to augment persona profiles. In addition, the criterion for assessing the efficacy of these contents remains unclear. For instance, retrieval-augmented generation (RAG) Lewis et al. (2020) is based on predicted probability distribution, which may not directly align with the objective of generating personalized responses. Moreover, simply tuning dense retriever Karpukhin et al. (2020) in a RAG’s paradigm may result in suboptimal retrieval outcomes as the retriever is inclined to consistently select similar passages for all queries, which may impede the further expansion of the persona profile details.

In this paper, we give the first try to utilize the story data as external knowledge for the personalized dialogue generation task and propose a Learning Retrieval Augmentation for Personalized DialOgue Generation (LAPDOG) framework. Specifically, the proposed model LAPDOG, consisting of a retriever to retrieve helpful information to enrich the persona and a generator to generate dialogues, is an end-to-end learnable retrieval methodology for integrating additional contextual information into personalized dialogue generation. LAPDOG utilizes non-differentiable metrics (e.g., BLEU, F1, and ROUGE-L) to guide the training of the retriever by aligning the retriever scores to these desired metrics, thereby facilitating the generation of relevant and diverse personalized responses. To ensure diversity in the retrieval process, we design a retrieval candidate augmentation during training, which prevents consistently selecting similar passages for all queries and provides a broader range of contextual inputs for the dialogue generator. In addition to the retrieved content, the persona information and dialogue context are also integrated into the dialogue generator. Furthermore, LAPDOG adopts a cooperative framework wherein the retriever and generator are jointly trained. This process enables LAPDOG to generate personalized responses that are coherent, contextually rich, and in line with the persona of the agent. Unlike other retrieval models Zhou et al. (2022); Santhanam et al. (2022) that rely on annotated retrieval datasets, our method retrieves the supplementary context in an end-to-end, unsupervised manner, which can be seamlessly extended to other suitable text sources.

We conduct experiments on the CONVAI2 dataset Dinan et al. (2019), which is widely recognized and extensively studied in the field of personalized dialogue generation Huang et al. (2022); Song et al. (2021); Liu et al. (2020), and the ROCStory dataset Mostafazadeh et al. (2016) acts as external knowledge. Experiments demonstrate the positive impact of learnable retrieval augmentation on performance. Quantitatively, the proposed LAPDOG method consistently yields improvements over the baseline models with varying model sizes. Moreover, the retrieved contents offer insights into the rationale behind the generation ability of the generator. Comprehensive ablation studies demonstrate that joint objective guidance outperforms each individual objective and provides insights into the size of retrieval candidates and the use of different metrics.

Overall, our contributions can be summarized as follows.

  • We present a novel LAPDOG model for personalized dialogue generation to retrieve relevant contents in external knowledge to the persona using the non-differentiable objective.

  • We introduce candidate augmentation as a means to enhance learning retrieval augmentation, resulting in improved performance and increased diversity of candidate selections during the inference process.

  • The proposed LAPDOG framework significantly enhances the performance over baselines, showing promising potential for learnable retrieval augmentation on personalized dialogue generation. Our code and pre-trained model will be open-sourced.

2 Related Work

2.1 Personalized Dialogue Generation

Based on the PersonaChat dataset Zhang et al. (2018), Dinan et al. curate the CONVAI2 dataset, which contains a brief persona with four to five sentences for each interlocutor. This unique dataset has become a standard benchmark for the personalized dialogue generation task and built on this dataset, there are numerous studies, each of which approaches personalized dialogue generation from diverse perspectives. For example, Wolf et al. proposes a fine-tuning model based on the GPT2 model Radford et al. (2019). Song et al. integrates three BERT models Devlin et al. (2019) via reinforcement learning to generate responses. Liu et al. propose a transmitter and receiver model, which utilizes reinforcement learning with manually designed rewards for further refinement, for the personalized dialogue generation task.Cao et al. adopt model-agnostic data augmentation to use language models, such as GPT2 and BERT Devlin et al. (2019), to augment the training set with pseudo training data. Huang et al. devise an adaptive attention mechanism to integrate information from persona and context encoders seamlessly. In contrast to the aforementioned models, the proposed LAPDOG method introduces an end-to-end dense retriever framework to simultaneously augment the input of the generator from external data source and tune the retriever.

2.2 Retrieval-Augmented Text Generation

There are some works to incorporate retrievers into their respective models via different integration strategies to enhance text generation tasks. For instance, DocPrompting Zhou et al. (2022) curated a retrieval annotation dataset to train a retriever to retrieve and do input augmentation for code generation. Toolformer Schick et al. (2023) bootstraps retrieval-annotated data, which performs fine-tuning on language models for the retrieval-augmentation ability. FLARE Jiang et al. (2023) extends the Toolformer to large language models such as Ouyang et al. (2022) with special-designed prompts. RePlug Shi et al. (2023) further refines the retriever by distilling the knowledge from the language model’s probability. RAG Lewis et al. (2020) jointly trains the retriever and language model, which updates the retriever by the language model’s probability. Different from those models, the proposed LAPDOG model is designed specifically for personalized dialogue generation with a focus on optimizing desired objectives rather than the language model’s probability distribution. Since RePlug, Toolformer, and FLARE are based on large language models or their API calls, we do not include them in the comparison to LAPDOG. Compared with other models, we do not rely on retrieval annotations or bootstrapped retrieval annotations. The training objectives are directly computed from a comparison between the generated text and ground truth, rather than relying on training probabilities that are not always aligned with the desired metrics. Additionally, we introduce a candidate augmentation to avoid the limitations of a confined candidate set. This broadens the scope of potential dialogues and better captures the richness and diversity of an agent’s persona.

Refer to caption
Figure 1: An illustration of the two-stage training process in the proposed LAPDOG model.

3 Methodology

In this section, we introduce the proposed LAPDOG model.

3.1 Task Formulation

In a persona-based conversation session denoted by C={P,U}𝐶𝑃𝑈C=\{P,U\}italic_C = { italic_P , italic_U }, the persona P={p1,,pe}𝑃subscript𝑝1subscript𝑝𝑒P=\{p_{1},\ldots,p_{e}\}italic_P = { italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT } consists of e𝑒eitalic_e profile sentences providing background information about a machine interlocutor m𝑚mitalic_m and the dialogue context U={uh,1,um,1,,uh,n}𝑈subscript𝑢1subscript𝑢𝑚1subscript𝑢𝑛U=\{u_{h,1},u_{m,1},...,u_{h,n}\}italic_U = { italic_u start_POSTSUBSCRIPT italic_h , 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_m , 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_h , italic_n end_POSTSUBSCRIPT } encompasses the exchange of utterances between a human interlocutor hhitalic_h and the machine interlocutor m𝑚mitalic_m. In the task of persona-based dialogue generation, the persona P𝑃Pitalic_P is used to characterize the machine interlocutor m𝑚mitalic_m, but it contains only four to five sentences, i.e., 4e54𝑒54\leq e\leq 54 ≤ italic_e ≤ 5. The conversation always starts by the human interlocutor hhitalic_h. The primary objective of this task is to generate the response r=um,n𝑟subscript𝑢𝑚𝑛r=u_{m,n}italic_r = italic_u start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT based on the given persona P𝑃Pitalic_P and dialogue context U𝑈Uitalic_U.

The persona P𝑃Pitalic_P is short and hence cannot give a full characterization for the background information. To enrich the persona, we utilize a retrieval corpus 𝒟𝒟\mathcal{D}caligraphic_D consisting of stories from a story dataset (e.g., ROCStory Mostafazadeh et al. (2016)). Note that there is no explicit annotation between 𝒟𝒟\mathcal{D}caligraphic_D and P𝑃Pitalic_P, necessitating an alternative approach to evaluate the usefulness of the retrieval content.

3.2 The Architecture

As shown in Figure 1, the architecture of the LAPDOG model consists of a generator, which adopts a transformer-based encoder-decoder structure, to generate dialogues and a dual-encoder retriever to efficiently obtain relevant information from an external story corpus.

Retriever

Based on Karpukhin et al. (2020), the retriever adopts a transformer-based encoder to embed the query and the story corpus, respectively. The retriever then calculates the dot product similarity score between the query and each story via their average pooled embeddings. Stories with the K𝐾Kitalic_K highest similarity scores are retrieved.

Generator

The generator takes a transformer-based encoder-decoder architecture to generate the response from the persona, dialogue history, and retrieved contents. To integrate the retrieved contents with the persona and dialogue history, we leverage the Fusion-in-Decoder (FiD) technique Izacard and Grave (2021). Specifically, each retrieved story is combined with the persona and context and individually encoded. The resulting encoded contexts are concatenated and cross-attended in the decoder to generate the final response.

3.3 Training Process

It is straightforward to directly train the generator and retriever using the generator’s probability distribution in a way similar to the RAG method. However, this strategy does not work well since the retriever would trap into a fixed candidate set and the predicted probability distribution is not always aligned with the desired objectives in the personalized dialogue generation task. Hence, as depicted in Figure 1, the LAPDOG model adopts a two-stage training procedure. In the first stage, the training process starts with supervised training for the generator (refer to Section 3.3.1). In the second stage, the framework starts to tune the retriever and learn the retrieval augmentation jointly. To learn retrieval augmentation (refer to Section 3.3.2), the retriever’s loss is computed from the evaluation metrics between the output of the generator and the ground truth. During the process of learning retrieval augmentation, to prevent the retriever from stagnating around a limited set of candidates, we design the retrieval candidate augmentation (refer to Section 3.3.3), a method ensuring diversity in the retrieval process. Afterward, we enrich the input of the generator with retrieval-enhanced data and compute a generator loss based on the augmented input (refer to Section 3.3.4). Finally, we combine losses from both the generator and retriever to jointly train the two components (refer to Section 3.3.5). In the following sections, we introduce each part in detail.

3.3.1 Supervised Training

First, we train a generator that accepts persona P𝑃Pitalic_P and context U𝑈Uitalic_U as input, and the ground-truth response r𝑟ritalic_r as the target without involving any retrieval results. Hence, this stage is to minimize the negative log-likelihood, which is formulated as

NLLsubscript𝑁𝐿𝐿\displaystyle\mathcal{L}_{NLL}caligraphic_L start_POSTSUBSCRIPT italic_N italic_L italic_L end_POSTSUBSCRIPT =log(𝒢θ(r|P,U))absentsubscript𝒢𝜃conditional𝑟𝑃𝑈\displaystyle=-\log(\mathcal{G}_{\theta}(r|P,U))= - roman_log ( caligraphic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_r | italic_P , italic_U ) ) (1)
=i=1|r|log(𝒢θ(rt|P,U,r<t)),absentsuperscriptsubscript𝑖1𝑟subscript𝒢𝜃conditionalsubscript𝑟𝑡𝑃𝑈subscript𝑟absent𝑡\displaystyle=-\sum_{i=1}^{|r|}\log(\mathcal{G}_{\theta}(r_{t}|P,U,r_{<t})),= - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_r | end_POSTSUPERSCRIPT roman_log ( caligraphic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_P , italic_U , italic_r start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) ) ,

where rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the t𝑡titalic_t-th token in r𝑟ritalic_r, r<tsubscript𝑟absent𝑡r_{<t}italic_r start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT denotes the sequence containing the first to (t1𝑡1t-1italic_t - 1)-th tokens in r𝑟ritalic_r, 𝒢θ()subscript𝒢𝜃\mathcal{G}_{\theta}(\cdot)caligraphic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) denotes the predicted probability distribution of the generator, and θ𝜃\thetaitalic_θ denotes parameters of the generator.

After supervised training, we obtain a supervised tuned generator denoted by 𝒢supsubscript𝒢𝑠𝑢𝑝\mathcal{G}_{sup}caligraphic_G start_POSTSUBSCRIPT italic_s italic_u italic_p end_POSTSUBSCRIPT.

3.3.2 Learning Retrieval Augmentation

Intuitively, with the retrieval content as an augmentation, the goal is to improve the generated content in terms of desired metrics. However, it is hard to build direct connections between retrieval contents and the quality of the final generated response to update the retriever. To achieve that, we use the trained generator 𝒢supsubscript𝒢𝑠𝑢𝑝\mathcal{G}_{sup}caligraphic_G start_POSTSUBSCRIPT italic_s italic_u italic_p end_POSTSUBSCRIPT as an evaluator to give feedback.

Specifically, given the metric values from the trained generator 𝒢supsubscript𝒢𝑠𝑢𝑝\mathcal{G}_{sup}caligraphic_G start_POSTSUBSCRIPT italic_s italic_u italic_p end_POSTSUBSCRIPT, we use those metric values as feedback to guide the update of the retriever. In other words, if the generator 𝒢supsubscript𝒢𝑠𝑢𝑝\mathcal{G}_{sup}caligraphic_G start_POSTSUBSCRIPT italic_s italic_u italic_p end_POSTSUBSCRIPT finds that the retrieved story diDqsubscript𝑑𝑖subscript𝐷𝑞d_{i}\in D_{q}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_D start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT is useful to improve the performance in terms of the given metrics, we should encourage the retriever to rank the score of disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to be higher. In this way, we can let the model decide the usefulness of the retrieval content and avoid relying on the retrieval annotations between query q𝑞qitalic_q and story disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. However, since the whole generation and metric calculation process is hard or even impossible to be differentiate, we cannot directly perform gradient descent with respect to the calculated metrics to update the retriever. To solve this problem, instead we transform the metric values into a probability distribution as

pi=exp(1τgM(y,Gen(𝒢sup,(di,P,U)))c=1Kexp(1τgM(y,Gen(𝒢sup,(dc,P,U))),p_{i}=\frac{\exp\left(\frac{1}{\tau_{g}}\text{M}(y,\text{Gen}(\mathcal{G}_{sup% },(d_{i},P,U))\right)}{\sum_{c=1}^{K}\exp\left(\frac{1}{\tau_{g}}\text{M}(y,% \text{Gen}(\mathcal{G}_{sup},(d_{c},P,U))\right)},italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_τ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_ARG M ( italic_y , Gen ( caligraphic_G start_POSTSUBSCRIPT italic_s italic_u italic_p end_POSTSUBSCRIPT , ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P , italic_U ) ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_τ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_ARG M ( italic_y , Gen ( caligraphic_G start_POSTSUBSCRIPT italic_s italic_u italic_p end_POSTSUBSCRIPT , ( italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_P , italic_U ) ) ) end_ARG ,

where M(y,y^)M𝑦^𝑦\text{M}(y,\hat{y})M ( italic_y , over^ start_ARG italic_y end_ARG ) denotes a metric function to evaluate the quality of the generated text y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG given the ground truth y𝑦yitalic_y, Gen(𝒢sup,(di,P,U))Gensubscript𝒢𝑠𝑢𝑝subscript𝑑𝑖𝑃𝑈\text{Gen}(\mathcal{G}_{sup},(d_{i},P,U))Gen ( caligraphic_G start_POSTSUBSCRIPT italic_s italic_u italic_p end_POSTSUBSCRIPT , ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P , italic_U ) ) denotes the decoded text generated by 𝒢supsubscript𝒢𝑠𝑢𝑝\mathcal{G}_{sup}caligraphic_G start_POSTSUBSCRIPT italic_s italic_u italic_p end_POSTSUBSCRIPT given (di,P,U)subscript𝑑𝑖𝑃𝑈(d_{i},P,U)( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P , italic_U ) as the input, and τgsubscript𝜏𝑔\tau_{g}italic_τ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is a temperature hyperparameter to control the sensitivity of the metric. Here the metric function satisfies that a higher value of M(,)M\text{M}(\cdot,\cdot)M ( ⋅ , ⋅ ) indicates better performance. If a smaller value of M(,)M\text{M}(\cdot,\cdot)M ( ⋅ , ⋅ ) indicates better performance, we can replace M(,)M\text{M}(\cdot,\cdot)M ( ⋅ , ⋅ ) with M(,)M-\text{M}(\cdot,\cdot)- M ( ⋅ , ⋅ ) in the calculation of pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. It is easy to see that a useful disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT will have a large pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and hence pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be used as a supervised signal to guide the learning of the retriever. That is, we could make the similarity score returned by the retriever close to P={pi}i=1Ksubscript𝑃superscriptsubscriptsubscript𝑝𝑖𝑖1𝐾P_{\mathcal{R}}=\{p_{i}\}_{i=1}^{K}italic_P start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT = { italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT. Formally, suppose we have top-K𝐾Kitalic_K retrieval stories Dqsubscript𝐷𝑞D_{q}italic_D start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT with its retrieval scores SqKsubscript𝑆𝑞superscript𝐾S_{q}\in\mathbb{R}^{K}italic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT with respect to the query q𝑞qitalic_q, we can minimize the KL divergence between Sqsubscript𝑆𝑞S_{q}italic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and Psubscript𝑃P_{\mathcal{R}}italic_P start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT as

=KL(P,σ(Sq/τs)),subscriptKLsubscript𝑃𝜎subscript𝑆𝑞subscript𝜏𝑠\mathcal{L}_{\mathcal{R}}=\mathrm{KL}(P_{\mathcal{R}},\sigma(S_{q}/\tau_{s})),caligraphic_L start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT = roman_KL ( italic_P start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT , italic_σ ( italic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT / italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) , (2)

where KL(,)KL\mathrm{KL}(\cdot,\cdot)roman_KL ( ⋅ , ⋅ ) denotes the KL divergence, σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) denotes the softmax function, and τssubscript𝜏𝑠\tau_{s}italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is a temperature hyperparameter to control the sensitivity of the similarity scores. Combining subscript\mathcal{L}_{\mathcal{R}}caligraphic_L start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT with the retrieval candidate augmentation introduced in the next section, we can update the retriever.

3.3.3 Retrieval Candidate Augmentation

During the training process, there is a risk that the retriever gets stuck in a local optimum and consistently retrieves a fixed set or a narrow range of candidates. Consequently, the generator fails to learn from the retriever and disregards the retrieved content. To address this challenge, we design retrieval candidate augmentation to incorporate randomly sampled stories to encourage the framework to explore a wider range of candidates. Specifically, we first replace each disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with a randomly selected candidate diaugsuperscriptsubscript𝑑𝑖𝑎𝑢𝑔d_{i}^{aug}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT at a probability of ρ𝜌\rhoitalic_ρ as

diaug=CandAug(di,ρ);diDq,formulae-sequencesuperscriptsubscript𝑑𝑖𝑎𝑢𝑔CandAugsubscript𝑑𝑖𝜌subscript𝑑𝑖subscript𝐷𝑞d_{i}^{aug}=\text{CandAug}(d_{i},\rho);d_{i}\in D_{q},italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT = CandAug ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ρ ) ; italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_D start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ,

where Dqsubscript𝐷𝑞D_{q}italic_D start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT denotes the set of retrieval stories, and forms a perturbed set Dqaug={diaug}i=1Ksuperscriptsubscript𝐷𝑞𝑎𝑢𝑔superscriptsubscriptsuperscriptsubscript𝑑𝑖𝑎𝑢𝑔𝑖1𝐾D_{q}^{aug}=\{d_{i}^{aug}\}_{i=1}^{K}italic_D start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT = { italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT. Then we can compute the dot product similarity between the query q𝑞qitalic_q and each diaugsuperscriptsubscript𝑑𝑖𝑎𝑢𝑔d_{i}^{aug}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT as the retrieval scores Sqaug={sq,iaug}i=1Ksuperscriptsubscript𝑆𝑞𝑎𝑢𝑔superscriptsubscriptsuperscriptsubscript𝑠𝑞𝑖𝑎𝑢𝑔𝑖1𝐾S_{q}^{aug}=\{s_{q,i}^{aug}\}_{i=1}^{K}italic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT = { italic_s start_POSTSUBSCRIPT italic_q , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, where sq,iaugsuperscriptsubscript𝑠𝑞𝑖𝑎𝑢𝑔s_{q,i}^{aug}italic_s start_POSTSUBSCRIPT italic_q , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT denotes the dot product similarity between q𝑞qitalic_q and diaugsuperscriptsubscript𝑑𝑖𝑎𝑢𝑔d_{i}^{aug}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT. Then we can apply the learning retrieval augmentation to Sqaugsuperscriptsubscript𝑆𝑞𝑎𝑢𝑔S_{q}^{aug}italic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT and based on Eq. (2) minimize the following loss to update the retriever as

aug=KL(P,σ(Sqaug/τs)).superscriptsubscript𝑎𝑢𝑔KLsubscript𝑃𝜎superscriptsubscript𝑆𝑞𝑎𝑢𝑔subscript𝜏𝑠\mathcal{L}_{\mathcal{R}}^{aug}=\text{KL}(P_{\mathcal{R}},\sigma(S_{q}^{aug}/% \tau_{s})).caligraphic_L start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT = KL ( italic_P start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT , italic_σ ( italic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT / italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) .

3.3.4 Training Retrieval-Augmented Generator

With the retrieval content obtained by the retriever, we hope to generate the response more accurately and hence we can further supervised train the generator in a way similar to the first stage (i.e., Section 3.3.1). Specifically, we can minimize the negative log-likelihood of the response given the persona, dialogue context, and retrieval content as

𝒢subscript𝒢\displaystyle\mathcal{L}_{\mathcal{G}}caligraphic_L start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT =log(𝒢θ(r|P,U,Dpaug))absentsubscript𝒢𝜃conditional𝑟𝑃𝑈superscriptsubscript𝐷𝑝𝑎𝑢𝑔\displaystyle=-\log(\mathcal{G}_{\theta}(r|P,U,D_{p}^{aug}))= - roman_log ( caligraphic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_r | italic_P , italic_U , italic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT ) ) (3)
=i=1|r|log(𝒢θ(rt|P,U,DPaug,r<t)).absentsuperscriptsubscript𝑖1𝑟subscript𝒢𝜃conditionalsubscript𝑟𝑡𝑃𝑈superscriptsubscript𝐷𝑃𝑎𝑢𝑔subscript𝑟absent𝑡\displaystyle=-\sum_{i=1}^{|r|}\log(\mathcal{G}_{\theta}(r_{t}|P,U,D_{P}^{aug}% ,r_{<t})).= - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_r | end_POSTSUPERSCRIPT roman_log ( caligraphic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_P , italic_U , italic_D start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) ) .

It is easy to see that Eq. (3) is similar to Eq. (1) with additionally inputting the retrieval content.

3.3.5 Retriever-Generator Joint Training

At the final stage, we aim to jointly train the retriever and generator to further improve them. Specifically, we minimize the sum of the losses of the two components as

\displaystyle\mathcal{L}caligraphic_L =aug+𝒢.absentsuperscriptsubscript𝑎𝑢𝑔subscript𝒢\displaystyle=\mathcal{L}_{\mathcal{R}}^{aug}+\mathcal{L}_{\mathcal{G}}.= caligraphic_L start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT + caligraphic_L start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT . (4)

In Eq. (4), the two loss functions are treated equally. Generally speaking, introducing and tuning a weighting hyperparameter between the two losses may result in better performance but it incurs computational costs when tuning it. For simplicity, we did not introduce this hyperparameter and this could be left for future study.

To summarize, Algorithm 2 in the appendix describes the complete two-stage training process.

3.4 Inference Process

During inference, stories from the ROCStory dataset are fetched in alignment with the provided persona and then integrated into the dialogue context using the Fusion-in-Decoder (FiD) technique Izacard and Grave (2021). Each combination of story, persona, and context is individually encoded. These encoded contexts are concatenated and processed via cross-attention in the decoder to produce the final response in an auto-regressive fashion. Additional experiments evaluating the effects of various query combinations, such as persona+context and context alone, are detailed in Appendix A.2 to highlight their impact on performance.

4 Experiment

In this section, we empirically evaluate the proposed LAPDOG model.

4.1 Dataset

ConvAI2 Dinan et al. (2019) is a dialogue dataset collected from the crowd, featuring 8939/1000 multi-turn conversations that rely on 1155/100 persona descriptions for the train/dev splits. Each persona is succinctly depicted by approximately 5 profile sentences. Paired workers engaged in interactive conversations based on predefined personas.

4.2 Retrieval Corpus

Given the absence of a paired annotated retrieval corpus connected to ConvAI2, we employ ROCStory Mostafazadeh et al. (2016) as an auxiliary retrieval dataset. Our aim is for the narratives within this dataset to serve as supplemental content to the existing personas within the dialogue. We have undertaken pre-processing of the ROCStory to align the narrative style more closely with persona representation, including changes like transforming ‘he’ to ‘I’ and ‘does’ to ‘do’. The detailed pre-processing is listed in Appendix A.7. Statically, there are 98,161 stories within the corpus, and each story is composed of 5555 sentences.

4.3 Experimental Settings

We employ T5 series models Raffel et al. (2020) (small, base, XL) as the foundational model used for the generator. We initialize our generator with pre-trained weights from T5 and subsequently fine-tune it on the CONVAI2 dataset as 𝒢supsubscript𝒢𝑠𝑢𝑝\mathcal{G}_{sup}caligraphic_G start_POSTSUBSCRIPT italic_s italic_u italic_p end_POSTSUBSCRIPT. The dense retriever is initialized with Contriever Izacard et al. (2021), a dual-encoder retriever that shares a similar encoder structure to BERT Devlin et al. (2019). We performed fine-tuning on both the retriever and generator using 𝒢supsubscript𝒢𝑠𝑢𝑝\mathcal{G}_{sup}caligraphic_G start_POSTSUBSCRIPT italic_s italic_u italic_p end_POSTSUBSCRIPT, with a learning rate of 5×1045E-45\text{\times}{10}^{-4}start_ARG 5 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 4 end_ARG end_ARG and ρ=0.5𝜌0.5\rho=0.5italic_ρ = 0.5. We further tune for learning retrieval augmentation based on the supervised foundation models in one epoch. We use persona profile as the query to retrieve the relevant stories.

4.4 Evaluation Metric

LAPDOG aims to optimize for some generation metrics to enhance the dialogue quality. The evaluation comprises three metrics. The first metric is F1, which computes the harmonic mean of precision and recall on a word level between the generated text and the ground truth. The second metric is BLEU Papineni et al. (2002); Post (2018), an n𝑛nitalic_n-gram precision-based measure that quantifies the overlap between the generated text and the ground truth by penalizing for overly long or short outputs. The third metric is ROUGE-L Lin (2004), a variant of ROUGE that considers the longest common subsequence between the generated text and the ground truth, to effectively measure sentence-level structural similarity. With those metrics, we ensure a comprehensive assessment of the quality of generated dialogues. To enhance the aforementioned three metrics, we sum these three metrics together as the overall metric to train the LAPDOG model.

4.5 Baseline

To compare the enhancement between different retrieval-augmentation approaches, we compared the results with the following baselines. First, we compare the LAPGOG with T5S/B/XLsupsuperscriptsubscriptabsent𝑠𝑢𝑝𝑆𝐵𝑋𝐿{}_{sup}^{S/B/XL}start_FLOATSUBSCRIPT italic_s italic_u italic_p end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT italic_S / italic_B / italic_X italic_L end_POSTSUPERSCRIPT models, where S/B/XL𝑆𝐵𝑋𝐿S/B/XLitalic_S / italic_B / italic_X italic_L indicates the model size small, base, XL respectively. T5S/B/XLsupsuperscriptsubscriptabsent𝑠𝑢𝑝𝑆𝐵𝑋𝐿{}_{sup}^{S/B/XL}start_FLOATSUBSCRIPT italic_s italic_u italic_p end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT italic_S / italic_B / italic_X italic_L end_POSTSUPERSCRIPT serves as the foundation models 𝒢supsubscript𝒢𝑠𝑢𝑝\mathcal{G}_{sup}caligraphic_G start_POSTSUBSCRIPT italic_s italic_u italic_p end_POSTSUBSCRIPT. We also add a fixed retriever fixsubscript𝑓𝑖𝑥\mathcal{R}_{fix}caligraphic_R start_POSTSUBSCRIPT italic_f italic_i italic_x end_POSTSUBSCRIPT initialized from Contriever to validate the effectiveness of tuned and untuned retrievers. Meanwhile, we utilized the reinforcement learning tuning (T5SupSsubscriptsuperscriptabsent𝑆𝑆𝑢𝑝{}^{S}_{Sup}start_FLOATSUPERSCRIPT italic_S end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_u italic_p end_POSTSUBSCRIPT+fixsubscript𝑓𝑖𝑥\mathcal{R}_{fix}caligraphic_R start_POSTSUBSCRIPT italic_f italic_i italic_x end_POSTSUBSCRIPT+RL) as one baseline, where the reward is set as the desired objective. Lastly, we introduce the RAG tuning that updates the retriever based on the generator’s training output probabilities instead of the desired metric, which is to validate the direct and indirect metric tuning for the objective.

Model F1 BLEU ROUGE-L F1 BLEU ROUGE-L AVG
T5SupSsubscriptsuperscriptabsent𝑆𝑆𝑢𝑝{}^{S}_{Sup}start_FLOATSUPERSCRIPT italic_S end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_u italic_p end_POSTSUBSCRIPT 13.90 3.03 14.15 0.00% 0.00% 0.00% 0.00%
T5SupSsubscriptsuperscriptabsent𝑆𝑆𝑢𝑝{}^{S}_{Sup}start_FLOATSUPERSCRIPT italic_S end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_u italic_p end_POSTSUBSCRIPT+fixsubscript𝑓𝑖𝑥\mathcal{R}_{fix}caligraphic_R start_POSTSUBSCRIPT italic_f italic_i italic_x end_POSTSUBSCRIPT 14.09 3.08 14.00 1.37% 1.59% -1.03% 0.64%
T5SupSsubscriptsuperscriptabsent𝑆𝑆𝑢𝑝{}^{S}_{Sup}start_FLOATSUPERSCRIPT italic_S end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_u italic_p end_POSTSUBSCRIPT+fixsubscript𝑓𝑖𝑥\mathcal{R}_{fix}caligraphic_R start_POSTSUBSCRIPT italic_f italic_i italic_x end_POSTSUBSCRIPT+RL 2.50 0.21 5.49 -81.98% -93.07% -61.22% -78.76%
T5SupSsubscriptsuperscriptabsent𝑆𝑆𝑢𝑝{}^{S}_{Sup}start_FLOATSUPERSCRIPT italic_S end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_u italic_p end_POSTSUBSCRIPT+RAG 14.20 2.94 14.10 2.14% -2.84% -0.35% -0.35%
T5SupSsubscriptsuperscriptabsent𝑆𝑆𝑢𝑝{}^{S}_{Sup}start_FLOATSUPERSCRIPT italic_S end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_u italic_p end_POSTSUBSCRIPT+LAPDOG 14.62 3.23 14.44 5.17% 6.57% 2.07% 4.60%
T5SupBsubscriptsuperscriptabsent𝐵𝑆𝑢𝑝{}^{B}_{Sup}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_u italic_p end_POSTSUBSCRIPT 15.47 3.42 14.93 0.00% 0.00% 0.00% 0.00%
T5SupBsubscriptsuperscriptabsent𝐵𝑆𝑢𝑝{}^{B}_{Sup}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_u italic_p end_POSTSUBSCRIPT+fixsubscript𝑓𝑖𝑥\mathcal{R}_{fix}caligraphic_R start_POSTSUBSCRIPT italic_f italic_i italic_x end_POSTSUBSCRIPT 14.36 3.34 15.05 -7.18% -2.31% 0.82% -2.89%
T5SupBsubscriptsuperscriptabsent𝐵𝑆𝑢𝑝{}^{B}_{Sup}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_u italic_p end_POSTSUBSCRIPT+LAPDOG 16.08 3.53 15.33 3.94% 3.21% 2.67% 3.27%
T5SupXLsubscriptsuperscriptabsent𝑋𝐿𝑆𝑢𝑝{}^{XL}_{Sup}start_FLOATSUPERSCRIPT italic_X italic_L end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_u italic_p end_POSTSUBSCRIPT 16.22 3.55 15.55 0.00% 0.00% 0.00% 0.00%
T5SupXLsubscriptsuperscriptabsent𝑋𝐿𝑆𝑢𝑝{}^{XL}_{Sup}start_FLOATSUPERSCRIPT italic_X italic_L end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_u italic_p end_POSTSUBSCRIPT+fixsubscript𝑓𝑖𝑥\mathcal{R}_{fix}caligraphic_R start_POSTSUBSCRIPT italic_f italic_i italic_x end_POSTSUBSCRIPT 16.27 3.36 15.02 0.31% -5.32% -3.41% -2.81%
T5SupXLsubscriptsuperscriptabsent𝑋𝐿𝑆𝑢𝑝{}^{XL}_{Sup}start_FLOATSUPERSCRIPT italic_X italic_L end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_u italic_p end_POSTSUBSCRIPT+LAPDOG 17.11 3.56 15.64 5.49% 0.30% 0.56% 2.11%
Table 1: Experimental results of various methods based on language models with varying sizes. denotes the relative improvement over the supervised foundation model. The best result under each setting is shown in bold.

4.6 Results

The T5SupSsubscriptsuperscriptabsent𝑆𝑆𝑢𝑝{}^{S}_{Sup}start_FLOATSUPERSCRIPT italic_S end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_u italic_p end_POSTSUBSCRIPT model forms our baseline. Augmenting it with a fixed retriever, T5SupSsubscriptsuperscriptabsent𝑆𝑆𝑢𝑝{}^{S}_{Sup}start_FLOATSUPERSCRIPT italic_S end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_u italic_p end_POSTSUBSCRIPT+fixsubscript𝑓𝑖𝑥\mathcal{R}_{fix}caligraphic_R start_POSTSUBSCRIPT italic_f italic_i italic_x end_POSTSUBSCRIPT, shows a slight improvement in F1 and BLEU scores but a small decrease in the ROUGE-L score. This indicates a moderate enhancement in both F1 and BLEU, though the decrease in ROUGE-L suggests a trade-off in terms of capturing long-distance dependencies.

The results for reinforcement learning tuning, T5SupSsubscriptsuperscriptabsent𝑆𝑆𝑢𝑝{}^{S}_{Sup}start_FLOATSUPERSCRIPT italic_S end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_u italic_p end_POSTSUBSCRIPT+fixsubscript𝑓𝑖𝑥\mathcal{R}_{fix}caligraphic_R start_POSTSUBSCRIPT italic_f italic_i italic_x end_POSTSUBSCRIPT+RL, exhibit a significant degradation across all metrics, indicating that this method might be not so effective for this task. This could be due to the challenge of setting an appropriate reward function for reinforcement learning.

The T5SupSsubscriptsuperscriptabsent𝑆𝑆𝑢𝑝{}^{S}_{Sup}start_FLOATSUPERSCRIPT italic_S end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_u italic_p end_POSTSUBSCRIPT+RAG model slightly outperforms the baseline model in terms of F1 but performs worse in terms of BLEU and ROUGE-L. This suggests that while the model seems to generate more correct words, there may be a compromise on the overall grammatical and semantic quality of the generated text. In contrast, the LAPDOG-enhanced model, T5SupSsubscriptsuperscriptabsent𝑆𝑆𝑢𝑝{}^{S}_{Sup}start_FLOATSUPERSCRIPT italic_S end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_u italic_p end_POSTSUBSCRIPT+LAPDOG, shows the highest improvements in all metrics among small models. This indicates that LAPDOG significantly enhances the ability to generate high-quality text and captures the desired metrics more effectively than other models.

For larger models, similar phenomena are observed. LAPDOG consistently delivers the best improvements over the base model, no matter whether it is T5SupBsubscriptsuperscriptabsent𝐵𝑆𝑢𝑝{}^{B}_{Sup}start_FLOATSUPERSCRIPT italic_B end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_u italic_p end_POSTSUBSCRIPT or T5SupXLsubscriptsuperscriptabsent𝑋𝐿𝑆𝑢𝑝{}^{XL}_{Sup}start_FLOATSUPERSCRIPT italic_X italic_L end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_u italic_p end_POSTSUBSCRIPT. This suggests that the efficacy of LAPDOG is not confined to smaller models and scales well with the model size.

Method BLEU ROUGE-L F1
LAPDOG 3.23 14.44 14.62
w/o BLEU 3.07 14.35 14.29
w/o F1 2.87 13.88 13.88
w/o ROUGE-L 2.96 14.12 13.99
Table 2: Ablation study with respect to the use of metrics in the LAPDOG model.

5 Ablation Studies

We conduct ablation studies with respect to the metrics, the number of candidates, candidate augmentation, and training strategy, respectively.

5.1 Analysis on Metrics

To understand the individual contribution of each metric, we perform ablation experiments by successively removing one metric from the combined optimization process with results shown in Table 2. When we remove BLEU, the performance experiences a slight drop across all metrics with BLEU decreasing to 3.07, ROUGE-L to 14.35, and F1 to 14.29, suggesting that BLEU contributes to a more precise matching of response generation. Moreover, when we exclude F1, we see a more significant reduction in the performance, indicating the crucial role of F1 to ensure the overlap of words between the generated responses and the ground truth. Lastly, the removal of ROUGE-L also results in a decrease in the performance across all three metrics, showing its essential role in evaluating the coherence of generated dialogues. In summary, each metric contributes uniquely to the optimization process, and their combination in LAPDOG provides a more comprehensive guide for the generation of high-quality, personalized dialogues.

Refer to caption
Figure 2: Ablation study with respect to the number of candidates K𝐾Kitalic_K.

5.2 Number of the Candidates

As shown in Figure 2, where K𝐾Kitalic_K increases from 2 to 6, we observe a consistent improvement in all metrics (i.e., BLEU, ROUGE-L, and F1). Generally, increasing the number of retrieval candidates improves the performance of the model, as evidenced by the improvements in the BLEU, ROUGE-L, and F1 scores. Interestingly, it is observed that the model performance does not monotonically increase with the number of candidates. The performance fluctuates as K𝐾Kitalic_K varies, implying that the number of retrieval candidates needs to be carefully selected. Too few candidates may not provide enough information for generating responses, while too many ones may introduce irrelevant information, which could potentially confuse the model.

Refer to caption
Figure 3: Comparison of the unique number of stories retrieved by different models.
Method BLEU ROUGE-L F1
LAPDOG 3.23 14.44 14.62
w/o CandAug 3.14 14.19 14.43
LAPDOGscratch 2.95 13.90 14.23
Table 3: Ablation study on the candidate augmentation and training strategy.

5.3 Candidate Augmentation

The influence of candidate augmentation is explored in two aspects: quantitative performance and the diversity of retrieved stories. Table 3 provides a comparison between the performance of the LAPDOG model with and without candidate augmentation. The incorporation of candidate augmentation leads to superior performance across all three evaluation metrics. Specifically, LAPDOG without candidate augmentation attains slightly lower scores in all three metrics. This indicates that the inclusion of candidate augmentation enhances the overall performance of our model, confirming its crucial role in the proposed LAPDOG model.

To further investigate the impact of candidate augmentation to the retrieval diversity, we count the number of unique stories retrieved during testing. As shown in Figure 3, the LAPDOG method with candidate augmentation retrieves 1570 unique stories, whereas without it, the model only retrieves 738 unique stories. This result implies that candidate augmentation significantly contributes to the retrieval diversity. Other methods like RAG and fixed retriever manage to retrieve 25 and 1351 unique stories, respectively. This underlines the effectiveness of the proposed candidate augmentation approach in enhancing the diversity of the retrieval process, which in turn can help generate more personalized and contextually rich responses.

5.4 Training Strategy

In contrast to the two-stage training process, we perform an ablation study by training LAPDOG from scratch, bypassing the first stage. As shown in Table 3, the results exhibit a significant decrease in all the metrics compared with the two-stage approach. Additionally, training directly from scratch requires more time to converge when compared with the two-stage training process. Overall, the two-stage training process is essential from both performance and efficiency standpoints.

6 Conclusion

In this paper, we introduced LAPDOG, an end-to-end learnable retrieval augmentation personalized dialogue generation framework. We show that LAPDOG jointly tunes the retriever with the generator to retrieve useful stories from the ROCStory dataset for enhancing the desired performance over the CONVAI2 dataset. LAPDOG gains consistent performance enhancement over language models with varying sizes.

Acknowledgements

This work is supported by NSFC general grant 62076118 and Shenzhen fundamental research program JCYJ20210324105000003.

Limitations

Given resource constraints, in this paper, we employ language models such as T5 and do not conduct experiments based on currently prevalent large language models Brown et al. (2020); Ouyang et al. (2022). Recognizing the enhanced reasoning capabilities of large language models, we posit that tuning the retrieval content with such models could yield significant advantages. Additionally, due to resource limitations, we study a small number of extracted passages (i.e., 2-6) and a short context length (i.e., 512 tokens). Nevertheless, we anticipate that incorporating a larger set of integrated stories and a longer context would further enhance the performance. Also, a more diverse objective rather than the summation of F1, ROUGE, and BLEU might be more helpful to train an engaging conversational AI system. Also, the generator is simply a conventional T5 model rather than explicitly designed models, which could help improve the performance of the proposed LAPDOG model further.

Ethics Statement

This work proposes a novel LAPDOG model for personalized dialogue generation, focusing on generating highly tailored responses by leveraging persona profiles and dialogue context. As with all machine learning applications, it is crucial to consider the ethical implications. The use of personal information in our model is limited to fictional persona profiles, and we do not handle or store any real personal data in our experiments. However, when applying our model to real-world applications, careful consideration should be given to data privacy and consent. It is essential to ensure that all personal information used to generate personalized dialogues is obtained ethically and used with the individuals’ informed consent. Moreover, the generated content should respect user privacy, dignity, and cultural sensitivities.

References

  • Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. Henighan, R. Child, A. Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. arXiv:2005.14165.
  • Cao et al. (2022) Yu Cao, Wei Bi, Meng Fang, Shuming Shi, and Dacheng Tao. 2022. A model-agnostic data manipulation method for persona-based dialogue generation. arXiv:2204.09867.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, pages 4171–4186.
  • Dinan et al. (2019) Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander H. Miller, Kurt Shuster, Jack Urbanek, Douwe Kiela, Arthur D. Szlam, Iulian Serban, Ryan Lowe, Shrimai Prabhumoye, Alan W. Black, Alexander I. Rudnicky, Jason Williams, Joelle Pineau, Mikhail S. Burtsev, and Jason Weston. 2019. The second conversational intelligence challenge (convai2). arXiv:1902.00098.
  • Huang et al. (2022) Qiushi Huang, Yu Zhang, Tom Ko, Xubo Liu, Bo Wu, Wenwu Wang, and Lilian Tang. 2022. Personalized dialogue generation with persona-adaptive attention. arXiv preprint arXiv:2210.15088.
  • Izacard et al. (2021) Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2021. Unsupervised dense information retrieval with contrastive learning.
  • Izacard and Grave (2021) Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, Online. Association for Computational Linguistics.
  • Jiang et al. (2023) Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active retrieval augmented generation.
  • Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
  • Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA. Curran Associates Inc.
  • Li et al. (2019) Zekang Li, Cheng Niu, Fandong Meng, Yang Feng, Qian Li, and Jie Zhou. 2019. Incremental transformer with deliberation decoder for document grounded conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 12–21, Florence, Italy. Association for Computational Linguistics.
  • Lin (2004) Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  • Liu et al. (2020) Qian Liu, Yihong Chen, Bei Chen, Jian-Guang Lou, Zixuan Chen, Bin Zhou, and Dongmei Zhang. 2020. You impress me: Dialogue generation via mutual persona perception. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1417–1427, Online. Association for Computational Linguistics.
  • Mostafazadeh et al. (2016) Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. 2016. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 839–849, San Diego, California. Association for Computational Linguistics.
  • Ouyang et al. (2022) Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. 2022. Training language models to follow instructions with human feedback. ArXiv, abs/2203.02155.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, page 311–318, USA. Association for Computational Linguistics.
  • Post (2018) Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computational Linguistics.
  • Radford et al. (2019) Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  • Santhanam et al. (2022) Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. ColBERTv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3715–3734, Seattle, United States. Association for Computational Linguistics.
  • Schick et al. (2023) Timo Schick, Jane Dwivedi-Yu, Roberto Dessìy, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools.
  • Shi et al. (2023) Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen tau Yih. 2023. Replug: Retrieval-augmented black-box language models. ArXiv, abs/2301.12652.
  • Song et al. (2021) Haoyu Song, Yan Wang, Kaiyan Zhang, Weinan Zhang, and Ting Liu. 2021. Bob: Bert over bert for training persona-based dialogue models from limited personalized data. In Association for Computational Linguistics.
  • Wolf et al. (2019) Thomas Wolf, Victor Sanh, Julien Chaumond, and Clement Delangue. 2019. Transfertransfo: A transfer learning approach for neural network based conversational agents. arXiv:1901.08149.
  • Zhang et al. (2018) Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur D. Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? In Association for Computational Linguistics.
  • Zhao et al. (2020) Xueliang Zhao, Wei Wu, Chongyang Tao, Can Xu, Dongyan Zhao, and Rui Yan. 2020. Low-resource knowledge-grounded dialogue generation. arXiv preprint arXiv:2002.10348.
  • Zhou et al. (2022) Shuyan Zhou, Uri Alon, Frank F Xu, Zhengbao JIang, and Graham Neubig. 2022. Doccoder: Generating code by retrieving and reading docs. arXiv preprint arXiv:2207.05987.

Appendix A Appendix

A.1 Detail Settings for Training

Setting Value
τgsubscript𝜏𝑔\tau_{g}italic_τ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT 0.85
τssubscript𝜏𝑠\tau_{s}italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT 0.8
dialogmaxturns𝑑𝑖𝑎𝑙𝑜subscript𝑔𝑚𝑎𝑥𝑡𝑢𝑟𝑛𝑠dialog_{maxturns}italic_d italic_i italic_a italic_l italic_o italic_g start_POSTSUBSCRIPT italic_m italic_a italic_x italic_t italic_u italic_r italic_n italic_s end_POSTSUBSCRIPT 3
Dropout 0.1
LR 5.00E-04
Optimizer Adam
WeightDecay 0.01
RetrievalCandidate 6
ρ𝜌\rhoitalic_ρ 0.5
BatchSize 240
Table 4: Detail experimental settings for training LAPDOG.
Query BLEU ROUGE-L F1
Persona 3.23 14.44 14.62
Persona+Dialogue 3.19 14.83 15.06
Generated 3.05 14.22 14.32
One Persona 3.10 14.29 14.39
Table 5: The evaluation on different queries.

A.2 Query Analysis

As shown in Table 5, we aim to analyze the performance impact of different retriever queries. As indicated in Table 5, using the Persona alone as a query achieves the highest BLEU score, albeit with a slight trade-off in ROUGE-L and F1 scores. When combining Persona with the Dialogue, the ROUGE-L and F1 scores improve marginally, but at the expense of a slightly decreasing in BLEU. Following the idea of forward retrieval Jiang et al. (2023), we experimented with using the generator’s output (Generated) as a query, but observed less competitive performance. Lastly, a strategy of using a single sentence from the persona profile chosen at random (One Persona) resulted in worse performance on all metrics.

A.3 Evaluation Results on CONVIAI2 Revised Dataset

As shown in Table 6, LAPDOG consistently enhances the performance on the revised dataset, where the persona is paraphrased to more implicit background sentences. The revised version is considered a more difficult task than the original task.

Method BLEU ROUGE-L F1
T5SupS𝑇superscriptsubscript5𝑆𝑢𝑝𝑆T5_{Sup}^{S}italic_T 5 start_POSTSUBSCRIPT italic_S italic_u italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT 2.01 12.48 11.56
T5SupS𝑇superscriptsubscript5𝑆𝑢𝑝𝑆T5_{Sup}^{S}italic_T 5 start_POSTSUBSCRIPT italic_S italic_u italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT+LAPDOG 2.21 13.44 12.82
Table 6: Evaluation results on CONVIAI2-Revised version, where the persona is paraphrased to more implicit background sentences. The revised version is considered as a more difficult task than the original task.
Story Examples
Antique Car Show
I like fixing cars.
I have just finished repairing and restoring an antique sports car.
I proudly enter it in a local car show for antique vehicles.
I won a cash prize for my hard work! Now I have enough money to buy another antique car to restore.
Mechanic
I am a mechanic and love to work on cars.
I work in a shop three days a week.
In my spare time I fix cars for people in my garage.
I do great work at a fast pace for a small fee.
I get two incomes doing what I love.
Table 7: Two examples of stories, pre-processed from the ROCStory dataset.
Algorithm 1 Pre-processing Procedure of Story Corpus
1:A story corpus 𝒟orisubscript𝒟𝑜𝑟𝑖\mathcal{D}_{ori}caligraphic_D start_POSTSUBSCRIPT italic_o italic_r italic_i end_POSTSUBSCRIPT
2:A pre-processed story corpus 𝒟𝒟\mathcal{D}caligraphic_D
3:Extract named entity using BERT-BASE-NER from 𝒟orisubscript𝒟𝑜𝑟𝑖\mathcal{D}_{ori}caligraphic_D start_POSTSUBSCRIPT italic_o italic_r italic_i end_POSTSUBSCRIPT
4:Filter out person-related named entity with tag B-PER
5:Replace the person-related named entity with the first-person narratives within stories in 𝒟orisubscript𝒟𝑜𝑟𝑖\mathcal{D}_{ori}caligraphic_D start_POSTSUBSCRIPT italic_o italic_r italic_i end_POSTSUBSCRIPT
6:Process the first-person stories over a grammatical error correct model gec-t5_small
7:Output the corrected stories as 𝒟𝒟\mathcal{D}caligraphic_D
Algorithm 2 Complete Training Process of LAPDOG
1:Persona sentences P𝑃Pitalic_P, dialogue context U𝑈Uitalic_U, a ground truth response y𝑦yitalic_y, a generator 𝒢𝒢\mathcal{G}caligraphic_G, a dense retriever \mathcal{R}caligraphic_R, and a story corpus 𝒟𝒟\mathcal{D}caligraphic_D
2:A tuned retriever tunedsubscript𝑡𝑢𝑛𝑒𝑑\mathcal{R}_{tuned}caligraphic_R start_POSTSUBSCRIPT italic_t italic_u italic_n italic_e italic_d end_POSTSUBSCRIPT, a tuned generator 𝒢tunedsubscript𝒢𝑡𝑢𝑛𝑒𝑑\mathcal{G}_{tuned}caligraphic_G start_POSTSUBSCRIPT italic_t italic_u italic_n italic_e italic_d end_POSTSUBSCRIPT
3:Construct query q𝑞qitalic_q from a query function Query(P,U)𝑄𝑢𝑒𝑟𝑦𝑃𝑈Query(P,U)italic_Q italic_u italic_e italic_r italic_y ( italic_P , italic_U )
4:Stage1:
5:Initialize 𝒢supsubscript𝒢𝑠𝑢𝑝\mathcal{G}_{sup}caligraphic_G start_POSTSUBSCRIPT italic_s italic_u italic_p end_POSTSUBSCRIPT with 𝒢𝒢\mathcal{G}caligraphic_G
6:Train a supervised tuned generator 𝒢supsubscript𝒢𝑠𝑢𝑝\mathcal{G}_{sup}caligraphic_G start_POSTSUBSCRIPT italic_s italic_u italic_p end_POSTSUBSCRIPT given input (P,U)𝑃𝑈(P,U)( italic_P , italic_U ) and ground truth y𝑦yitalic_y
7:Train 𝒢supsubscript𝒢𝑠𝑢𝑝\mathcal{G}_{sup}caligraphic_G start_POSTSUBSCRIPT italic_s italic_u italic_p end_POSTSUBSCRIPT until converge
8:Stage2:
9:Retrieve top-K𝐾Kitalic_K stories Dqsubscript𝐷𝑞D_{q}italic_D start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT given q𝑞qitalic_q
10:Apply Candidate Augmentation Dqaug=CandAug(di,ρ);diDqformulae-sequencesuperscriptsubscript𝐷𝑞𝑎𝑢𝑔CandAugsubscript𝑑𝑖𝜌subscript𝑑𝑖subscript𝐷𝑞D_{q}^{aug}=\text{CandAug}(d_{i},\rho);d_{i}\in D_{q}italic_D start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT = CandAug ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ρ ) ; italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_D start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT
11:Compute retriever scores Sqaugsuperscriptsubscript𝑆𝑞𝑎𝑢𝑔S_{q}^{aug}italic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT between query q𝑞qitalic_q and Dqaugsuperscriptsubscript𝐷𝑞𝑎𝑢𝑔D_{q}^{aug}italic_D start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT
12:for Retrieved story disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in Dqaugsuperscriptsubscript𝐷𝑞𝑎𝑢𝑔D_{q}^{aug}italic_D start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT do
13:     Construct augmented input by concatenation ai=[di;P;U]subscript𝑎𝑖subscript𝑑𝑖𝑃𝑈a_{i}=[d_{i};P;U]italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_P ; italic_U ]
14:     Generate text by predi=𝒢sup(ai)subscriptpred𝑖subscript𝒢𝑠𝑢𝑝subscript𝑎𝑖\text{pred}_{i}=\mathcal{G}_{sup}(a_{i})pred start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_G start_POSTSUBSCRIPT italic_s italic_u italic_p end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
15:     Compute metrics as mi=M(predi,y)subscript𝑚𝑖Msubscriptpred𝑖𝑦m_{i}=\text{M}(\text{pred}_{i},y)italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = M ( pred start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y )
16:end for
17:Gather M={softmax(mi)}i=1K𝑀superscriptsubscriptsoftmaxsubscript𝑚𝑖𝑖1𝐾M=\{\text{softmax}(m_{i})\}_{i=1}^{K}italic_M = { softmax ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, A={softmax(ai)}i=1K𝐴superscriptsubscriptsoftmaxsubscript𝑎𝑖𝑖1𝐾A=\{\text{softmax}(a_{i})\}_{i=1}^{K}italic_A = { softmax ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT
18:Compute retriever’s loss by aug=KL(M,Sqaug)superscriptsubscript𝑎𝑢𝑔KL𝑀superscriptsubscript𝑆𝑞𝑎𝑢𝑔\mathcal{L}_{\mathcal{R}}^{aug}=\text{KL}(M,S_{q}^{aug})caligraphic_L start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT = KL ( italic_M , italic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT )
19:for augmented input aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in A𝐴Aitalic_A do
20:     Compute negative log-likelihood loss 𝒢subscript𝒢\mathcal{L}_{\mathcal{G}}caligraphic_L start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT with input aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and ground truth target y𝑦yitalic_y
21:     Update 𝒢supsubscript𝒢𝑠𝑢𝑝\mathcal{G}_{sup}caligraphic_G start_POSTSUBSCRIPT italic_s italic_u italic_p end_POSTSUBSCRIPT and \mathcal{R}caligraphic_R by =aug+𝒢superscriptsubscript𝑎𝑢𝑔subscript𝒢\mathcal{L}=\mathcal{L}_{\mathcal{R}}^{aug}+\mathcal{L}_{\mathcal{G}}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_u italic_g end_POSTSUPERSCRIPT + caligraphic_L start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT as 𝒢tunedsubscript𝒢𝑡𝑢𝑛𝑒𝑑\mathcal{G}_{tuned}caligraphic_G start_POSTSUBSCRIPT italic_t italic_u italic_n italic_e italic_d end_POSTSUBSCRIPT and tunedsubscript𝑡𝑢𝑛𝑒𝑑\mathcal{R}_{tuned}caligraphic_R start_POSTSUBSCRIPT italic_t italic_u italic_n italic_e italic_d end_POSTSUBSCRIPT
22:end for
23:Repeat the steps in Stage2 until converge

A.4 Case Study

As shown in Tables 11 and 12, we present the case studies on two conversations to compare the generation results among the non-retrieval-augmented results of T5XLsupsuperscriptsubscriptabsent𝑠𝑢𝑝𝑋𝐿{}_{sup}^{XL}start_FLOATSUBSCRIPT italic_s italic_u italic_p end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT italic_X italic_L end_POSTSUPERSCRIPT, retrieval augmented results of T5XLsupsuperscriptsubscriptabsent𝑠𝑢𝑝𝑋𝐿{}_{sup}^{XL}start_FLOATSUBSCRIPT italic_s italic_u italic_p end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT italic_X italic_L end_POSTSUPERSCRIPT+LAPDOG, and the ground truth.

For the conversation in Table 11, the agent is going to talk about divorce as indicated in the ground truth. LAPDOG retrieves several stories about the bad days with his life and family, and this could be a clue for him to decide to divorce. In this conversation, T5XLsupsuperscriptsubscriptabsent𝑠𝑢𝑝𝑋𝐿{}_{sup}^{XL}start_FLOATSUBSCRIPT italic_s italic_u italic_p end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT italic_X italic_L end_POSTSUPERSCRIPT complains about his life but does not mention anything about divorce. LAPDOG, reinforced by the retrieval stories and persona, has a stronger intention to generate the divorce decision, which would be more aligned with the intention in the ground truth. The other retrieved stories describe the messes that happened during the working time, which may reflect the persona “I hate my job.”

In the conversation mentioned in Table 12, the conversation is a simple start with a “How are you doing today?”, T5XLsupsuperscriptsubscriptabsent𝑠𝑢𝑝𝑋𝐿{}_{sup}^{XL}start_FLOATSUBSCRIPT italic_s italic_u italic_p end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT italic_X italic_L end_POSTSUPERSCRIPT gives a standard, safe, but bland answer as “I am good. How are you?”, while LAPDOG incorporates stories and persona about the gym to answer with “I’m good. Just got back from the gym. How are you?”, which would be more information-intensive and engaging. Additionally, the story “Lifestyle Change” provides a good clue for the agent about why he decided to go to the gym, and the story “Home Gym” describes the enthusiasm about the workout. These would potentially provide the model with enriched information on generating personalized responses.

As shown in Table 13, the T5XLsupsuperscriptsubscriptabsent𝑠𝑢𝑝𝑋𝐿{}_{sup}^{XL}start_FLOATSUBSCRIPT italic_s italic_u italic_p end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT italic_X italic_L end_POSTSUPERSCRIPT model simply replied with a “I hope so!” without further informative content, while LAPDOG answers with richer information as “I hope so! I’m only in grade 3 so I’m hoping to go to Disney World soon!”, which might consider the information from both persona and stories. Additionally, the retrieved stories like “Dream Job” oder “Disneyland” are aligned with the agent’s favor.

A.5 Comparison to Traditional Knowledge Dialogue Approaches

Evaluating LAPDOG against traditional knowledge-grounded dialogue methods is crucial. A comparison with “Low-Resource Knowledge-Grounded Dialogue Generation” Zhao et al. (2020) was considered but not possible due to the unavailability of its code. Instead, we selected ITDD (Incremental Transformer with Deliberation Decoder) Li et al. (2019) for this purpose, given its proven effectiveness and wide recognition.

A.5.1 ITDD Experimental Setup

LAPDOG and ITDD are based on different principles. LAPDOG uses an unsupervised approach to train both a retriever and a generator for extracting relevant content from an external corpus. On the other hand, ITDD is designed to merge pre-annotated document-conversation pairs. To ensure a fair comparison, we used an off-the-shelf retriever Izacard et al. (2021) to create paired data from the persona and ROC story corpus for the ITDD model.

A.5.2 Comparison Results

We compared LAPDOG and ITDD using various metrics, and the results are presented in the table below, showcasing LAPDOG’s superior performance.

Method F1 BLEU ROUGE-L
ITDD 9.71 0.66 10.90
T5-S+LAPDOG 14.62 3.23 14.44
T5-B+LAPDOG 16.08 3.53 15.33
T5-L+LAPDOG 17.11 3.56 15.64
Table 8: Comparison of LAPDOG and ITDD on key performance metrics.

These results highlight LAPDOG’s effectiveness compared to the traditional ITDD method, enhancing the paper’s overall assessment of LAPDOG’s performance.

A.6 Complete Training Procedure

The complete training procedure is described at Algorithm 2.

A.7 Pre-process ROCStory Coprus

The pre-processing procedure is described at Algorithm 1.

A.8 Extended Evaluation Metrics

The evaluation metrics for LAPDOG have been broadened to incorporate METEOR and BERT scores. This enhancement supplements the foundational assessment based on F1, BLEU, and ROUGE-L metrics, presenting a more diverse evaluation landscape. The additional evaluation outcomes are tabulated in Table 9.

Based on the results in Table 9, LAPDOG excels in the METEOR score relative to baseline models, showcasing its capability in nuanced linguistic comprehension. However, the variance in BERT scores is minimal, likely due to LAPDOG’s optimization for traditional metrics. Enhancing performance by tailoring optimization for BERT scores represents a promising area for future inquiry.

Model METEOR BERT-F1 BERT-PRECISION BERT-RECALL
T5SupXLsubscriptsuperscriptabsent𝑋𝐿𝑆𝑢𝑝{}^{XL}_{Sup}start_FLOATSUPERSCRIPT italic_X italic_L end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_u italic_p end_POSTSUBSCRIPT 16.45 85.95 87.07 84.88
T5SupXLsubscriptsuperscriptabsent𝑋𝐿𝑆𝑢𝑝{}^{XL}_{Sup}start_FLOATSUPERSCRIPT italic_X italic_L end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_u italic_p end_POSTSUBSCRIPT+fix𝑓𝑖𝑥\mathcal{R}{fix}caligraphic_R italic_f italic_i italic_x 16.21 85.78 86.99 84.62
T5SupXLsubscriptsuperscriptabsent𝑋𝐿𝑆𝑢𝑝{}^{XL}_{Sup}start_FLOATSUPERSCRIPT italic_X italic_L end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_u italic_p end_POSTSUBSCRIPT+LAPDOG 17.76 85.99 87.35 84.70
Table 9: Additional evaluation metrics, METEOR and BERT scores.

A.9 Additional Related Work on Large Language Models (LLMs)

Language models compute probability distributions over text sequences. Recent advancements have escalated these models from millions of parameters Radford et al. (2019) to billions Brown et al. (2020), extending the training corpus to encompass web texts and instructional data Ouyang et al. (2022). These strides have significantly enhanced the performance of large language models (LLMs) across a myriad of NLP tasks. Notably, in conversational tasks, the quality of generated text improves with the expansion of both the model size and training corpus. Our proposed approach, LAPDOG, diverges from the prevailing trend of scaling; it leverages retrieval-augmented generation to yield more diverse and interpretable results, albeit with smaller model parameters and corpus size. Despite employing a smaller model in our experiments, we posit that our adaptive retrieval approach could complement existing LLMs, thereby potentially elevating their result-generation efficacy.

A.10 Human Evaluation

We conducted a human evaluation to gauge the preference between the retrieval-augmented LAPDOG and the fine-tuned T5 model. Evaluators were presented with a dialogue accompanied by two responses from each model and were asked to choose their preferred response.

LAPDOG Fine-tuned Model
Human Preferred 56.32% 43.68%
Table 10: Comparison of human preference for responses generated by LAPDOG and a fine-tuned T5 model.

Referring to Table 10, while LAPDOG was preferred by 56.32% of the evaluators, the difference between the two models is relatively narrow. This indicates that both models have their merits in certain conversational contexts. The slight edge for LAPDOG suggests that retrieval-augmented responses might align closer to human expectations in some scenarios. Delving deeper into the nuances of this preference, especially across varied dialogue topics, would provide a more comprehensive understanding.

Conversation #1
Persona
I am 40 years old.
I work as a car salesman.
My wife spends all my money.
I hate my job.
I’m planning on getting a divorce.
Dialogue
Agent: I know the struggle, I volunteer at a homeless shelter.
Human: I hope not either, would not wish it on my worse enemy.
T5XLsupsuperscriptsubscriptabsent𝑠𝑢𝑝𝑋𝐿{}_{sup}^{XL}start_FLOATSUBSCRIPT italic_s italic_u italic_p end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT italic_X italic_L end_POSTSUPERSCRIPT I am 40, so I think I am done with it.
T5XLsupsuperscriptsubscriptabsent𝑠𝑢𝑝𝑋𝐿{}_{sup}^{XL}start_FLOATSUBSCRIPT italic_s italic_u italic_p end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT italic_X italic_L end_POSTSUPERSCRIPT+LAPDOG Yeah, I feel you. I want to get a divorce.
Ground Truth Between wife, job with no money, I am thinking now of divorce.
Retrieval Stories
Money
I have a full-time job.
My wife stays at home.
One day, I worked an over-time shift.
I decided to spend all the extra money on myself.
This made my wife sad.
Randy leaves
I am a man who has 5 kids.
I started arguing with my wife one day.
I got so upset I promised to leave my wife.
So that’s what I did.
I left my children and wife.
Cruel Boss
I have been dealing with extra expenses in the family.
Since my wife is a stay-at-home mom, I pay for everything.
I tried to get a raise by talking to my boss the next day.
My boss understood my situation and sincerely apologized.
Although I have been my employee for years, the offer was rejected.
Paul Gets Fired
I hate my job.
Every day at work I get angry and say mean things to people.
My boss gave me a verbal warning about my attitude at work.
Currently I am on a performance plan at work.
Next month I will be fired.
Table 11: The first case study.
Conversation #2
Persona
I like to spend my money on cars.
I go to the gym most days just to socialize.
I work from home.
I watch too much tv in spare time.
I have never had a steady relationship.
Dialogue Human: Hi, How are you doing today?
T5XLsupsuperscriptsubscriptabsent𝑠𝑢𝑝𝑋𝐿{}_{sup}^{XL}start_FLOATSUBSCRIPT italic_s italic_u italic_p end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT italic_X italic_L end_POSTSUPERSCRIPT I am good. How are you?
T5XLsupsuperscriptsubscriptabsent𝑠𝑢𝑝𝑋𝐿{}_{sup}^{XL}start_FLOATSUBSCRIPT italic_s italic_u italic_p end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT italic_X italic_L end_POSTSUPERSCRIPT+LAPDOG I’m good. Just got back from the gym. How are you?
Ground Truth I’m pretty good. Just a little lonely.
Retrieval Stories
Lifestyle Change
I was overweight and always tired of not having a girlfriend.
I decided to make some lifestyle changes.
I cancelled my cable TV and joined a gym.
I exercised and ate healthily for five months.
I started to feel good about myself and started to date!
Home Gym
I am addicted to working out and I go every day.
I just started a new job that has me working long hours.
These long hours prevent me from being able to go to the gym.
By the time I get off work the gym is closed.
I went out and bought all the stuff I needed to work out at home.
Mechanic
I am a mechanic and love to work on cars.
I work in a shop three days a week.
In my spare time I fix cars for people in my garage.
I do great work at a fast pace for a small fee.
I get two incomes doing what I love.
Car Mechanic
I love to work on cars.
I work as a mechanic at a local car repair shop.
I decided to work on cars in my spare time too.
I advertised on social media that I would fix cars for cheap.
Before long I had a nice little side business going.
The Drive-Through
I ate out every day of my adult life.
I was a 27-year-old man.
I loved going to drive troughs and preferred eating in my car.
One day while I was ordering my favorite food I had a heart attack.
I was rushed to the hospital by the employees and went into a coma.
I died due to my excessive and unhealthy lifestyle later that night.
Table 12: The second case study.
Conversation #3
Persona
I love to sing songs from the movie Frozen.
I am in the third grade.
I play with my friends on the playground.
I love to go to Disney world every year.
Mickey Mouse is my favorite character.
Dialogue
Human: They’re about 5 years old.
Agent: Yeah Disney is their favorite! We always go to Disney land!
Human: That’s great, maybe we will meet there one day!
T5XLsupsuperscriptsubscriptabsent𝑠𝑢𝑝𝑋𝐿{}_{sup}^{XL}start_FLOATSUBSCRIPT italic_s italic_u italic_p end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT italic_X italic_L end_POSTSUPERSCRIPT I hope so!
T5XLsupsuperscriptsubscriptabsent𝑠𝑢𝑝𝑋𝐿{}_{sup}^{XL}start_FLOATSUBSCRIPT italic_s italic_u italic_p end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT italic_X italic_L end_POSTSUPERSCRIPT+LAPDOG I hope so! I’m only in grade 3 so I’m hoping to go to Disney World soon!
Ground Truth Mickey Mouse is my favorite, and I love songs from the movie Frozen.
Retrieval Stories
Xena The Shy
My friend is very shy.
I am always being teased because of my name.
The boys at school call my warrior princess.
People are so mean.
I wish I was a warrior princess so I could make them be quiet!
Meeting Mickey
I went to Disney World for the first time when I was 7.
My parents bought me a book to collect character signatures.
I was so excited to get my signature so I didn’t notice a line.
I ran to hug I cut in front of many other kids.
The girl who was next in line yelled at me.
dream job
I have always wanted to play a character at Disney World.
I moved to Orlando and applied for a job.
Disney hired me as a customer service rep.
I worked very hard to achieve my goal.
The other day I got a promotion to play Mickey Mouse.
Pageant
Little I was a three-year-old beauty queen.
I was very good at walking and smiling for the judges.
This week it was different.
I would have to sing.
The day of the show I looked beautiful.
Too bad butterflies in my tummy flew away with the words to the song.
Cade’s Christmas Show
I am excited about my Christmas show.
I have been practicing Jingle Bells all week.
On the day of the show I was so nervous.
On stage I look out at the audience to find my mom.
I am in the first row, so now I know everything will be fine.
Disneyland
I have never been to Disneyland.
I love all the Disney characters.
My mom decided to take me to Disneyland.
I was so excited when I got there.
It was the best day of my young life.
Table 13: The third case study.