Planning with Large Language Models for Conversational Agents

Zhigen Li1,2, Jianxiang Peng1,Yanmeng Wang2,Tianhao Shen1
Minghui Zhang1, Linxi Su1, Shang Wu1, Yihang Wu1,Yuqian Wang1
Ye Wang2,
Wei Hu2,Jianfeng Li2, Shaojun Wang2, Jing Xiao2and Deyi Xiong1
1
College of Intelligence and Computing, Tianjin University, Tianjin, China
2Ping An Technology
{zhigenli,pjasonx,dyxiong}@tju.edu.cn
{wangyanmeng219,lijianfeng777,wangshaojun851}@pingan.com.cn

*Corresponding author
Abstract

Controllability and proactivity are crucial properties of autonomous conversational agents (CAs). Controllability requires the CAs to follow the standard operating procedures (SOPs), such as verifying identity before activating credit cards. Proactivity requires the CAs to guide the conversation towards the goal during user uncooperation, such as persuasive dialogue. Existing research cannot be unified with controllability, proactivity, and low manual annotation. To bridge this gap, we propose a new framework for planning-based conversational agents (PCA) powered by large language models (LLMs), which only requires humans to define tasks and goals for the LLMs. Before conversation, LLM plans the core and necessary SOP for dialogue offline. During the conversation, LLM plans the best action path online referring to the SOP, and generates responses to achieve process controllability. Subsequently, we propose a semi-automatic dialogue data creation framework and curate a high-quality dialogue dataset (PCA-D). Meanwhile, we develop multiple variants and evaluation metrics for PCA, e.g., planning with Monte Carlo Tree Search (PCA-M), which searches for the optimal dialogue action while satisfying SOP constraints and achieving the proactive of the dialogue. Experiment results show that LLMs finetuned on PCA-D can significantly improve the performance and generalize to unseen domains. PCA-M outperforms other CoT and ToT baselines in terms of conversation controllability, proactivity, task success rate, and overall logical coherence, and is applicable in industry dialogue scenarios. The dataset and codes are available at XXXX.

1 Introduction

Refer to caption
Figure 1: An example of the proposed task: planning-based LLMs for conversational agent with zero model training, satisfying both controllability and proactivity

Conversational agents (CAs) or dialogue systems are designed to offer social support or functional services to human users via natural language interactions. Existing mainstream approaches can be categorized into conversational question answering (CQA) (Anantha et al., 2021; Singhal et al., 2023; Zhuang et al., 2024), open-domain dialogue (ODD) (Li et al., 2017; Roller et al., 2021; Ouyang et al., 2022), task-oriented dialogue (TOD) (Budzianowski et al., 2018; Sun et al., 2022; Quan et al., 2020; Moradshahi et al., 2023) and conversational recommender systems (CRS) (Wu et al., 2019; Zhang et al., 2021; Wang et al., 2023a). The goal of CQA and ODD is to passively respond to users to provide knowledgeable or engaging conversations, thus lacking proactive. CRS sets goals by specifying a topic, such as “movie recommendation, ‘King of Comedy’,” and autonomously plans the best path of dialog actions to smoothly guide the conversation towards recommending “King of Comedy”. However, in complex tasks, dialogue actions frequently adhere to strict sequential constraints, as commonly found in the widely accepted standardized operating procedures (SOPs) established by humans across various domains (Akyar, 2012; Zhou et al., 2023). As shown in Figure 1, the prerequisite for persuading credit card activation is that the user must be the cardholder. CRS lacks planning globally controllable path of dialog actions. TOD aims to provide functional services according to the service process, and their controllability depends on manually drawing dialogue trees or predefined intent, slot, and action training data. It can proactively guide users to provide slot information, but cannot handle situations where users uncooperate, such as persuasive dialogue or negotiation dialogue (Wang et al., 2019; Chawla et al., 2021). We posit that dialogue’s proactivity, controllability, and reduced reliance on manual intervention are crucial to building autonomous business CAs. Recently, prompting large language models (LLMs) has become the predominant approach to zero-shot language tasks, and its applications in dialogue have received much attention. However, integrating LLMs into conversational agents while addressing the limitations of the aforementioned dialogue systems remains a significant challenge.

To address this challenge, we propose a novel dialogue framework, namely planning-based conversational agents (PCA) as shown in Figure 1. In contrast to prior research, our approach only requires humans to provide task definitions and goals for the LLMs to plan the core and necessary SOP offline before the conversation. During the conversation, LLM refers to the SOP to plan the best action path online and generate responses to achieve process controllability. For actions not in SOP, such as answering questions and persuasion, LLM can plan autonomously to enable proactive dialogue. To further this research, we must address several challenges: (1) How to construct a conversation dataset that meets both initiative and controllability? (2) What are the PCA implementation schemes? (3) How to accurately evaluate the effects of different schemes?

For the first issue, the most straightforward approach would be to manually edit flowcharts and dialogues, which are immensely costly. Recently, advanced LLMs, e.g., ChatGPT, can outperform crowd workers on many annotation and evaluation tasks (e.g. (Gilardi et al., 2023; Wang et al., 2023b; Pan et al., 2023)). Therefore, as shown in Figure 2, we design a four-step LLM role-playing system to assist manual correction in creating a dialogue dataset, which is more cost-effective and quickly scalable.

For the last two challenges, we first implemented the chain of thought (CoT) and supervised fine-tuning (SFT) for SOP prediction in offline planning. In online planning, we first attempt to encode SOP directly into the prompt of LLMs, using in-context learning (ICL) to enable them to understand and follow these constraints. Additionally, inspired by the use of the Monte Carlo Tree Search (MCTS) algorithm in the chess game such as AlphaGo which effectively harnesses human prior rules to control and enhance game AI (Silver et al., 2017), we draw an analogy between dialogue task SOPs and game rules. MCTS is a recent and strikingly successful example of a decision-time planning approach (Sutton and Barto, 2018). By utilizing these SOPs’ rules within MCTS’s expansion and simulation phases, we can better search for the optimal dialogue action that satisfies SOP. Finally, We evaluated the overall/controllable/proactive action accuracy in single-round dialogue and integrated dialogue dimensions. Additionally, manual evaluation was conducted to address the limitations of automatic evaluation regarding dialogue diversity.

Overall, we propose conversational agents based on LLMs with proactive and controllability and lower reliance on manual intervention. To the best of our knowledge, this problem has not been fully studied in prior work. Therefore, we advance the research on this issue from three aspects: task definition, data creation, and modeling. Our contributions are summarized as follows:

  • We propose and define the proactivity and controllable conversational agents with LLMs, which are promising yet unexplored.

  • We propose a semi-automatic dialogue dataset curation framework and develop a high-quality Chinese dialogue dataset (PCA-D). It not only provides insights into this task but also facilitates more research on it.

  • We develop multiple variants and evaluation metrics based on ICL, SFT, CoT, and MCTS modifications, which demonstrate superior performance in implementing offline planning of SOP and online planning of dialogue prediction.

2 Related Work

Dialogue Planning. In the process of dialogue planning, previous work focuses on subgoal generation (Zhang et al., 2021), the next round of dialogue transition strategy (Tang et al., 2019), hierarchical strategy (Kishinami et al., 2022). Recent work further involves planning dialogue paths based on basic knowledge, a goal-oriented dialogue planning framework to plan a series of dialogue actions, and guiding the system to proactively transition between different dialogue stages (Wang et al., 2022a). Existing research employs greedy, single-turn prediction strategies and overlooks the interdependencies of global policies in dialogue processes, resulting in global dialogue uncontrollable. We propose a SOP-based planning approach to address this issue.
Planning and Reasoning of LLMs. LLMs show prowess in planning and reasoning. Examples include Chain-of-Thought (Wei et al., 2022), its variants (Kojima et al., 2022) , Self-Consistency (Wang et al., 2022b), and Least-to-most prompting (Zhou et al., 2022). self-assessment  (Paul et al., 2023; Welleck et al., 2022; Shinn et al., 2023). Recent research uses more complex reasoning processes, offering new avenues for improving and optimizing the reasoning process. Yao et al. (2023) apply heuristic-based searches, such as depth-first and breadth-first search, to discover optimized reasoning pathways. Zhu et al. (2022); Hao et al. (2023) introduce MCTS to reasoning steps for complex math or logical reasoning. We focus on applying MCTS to implement controllable dialogue systems, with encoding SOPs’ constraints into the MCTS’s expansion and simulation steps.

3 Task Definition

Given a dialogue corpus {(Ti,Ci,Di)}i=1,2,,Nsubscriptsubscript𝑇𝑖subscript𝐶𝑖subscript𝐷𝑖𝑖12𝑁\{(T_{i},C_{i},D_{i})\}_{i=1,2,...,N}{ ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 , 2 , … , italic_N end_POSTSUBSCRIPT, where N𝑁Nitalic_N is the number of dialogues. In the i𝑖iitalic_i-th dialogue, Ti(Pi,Ai,Si)subscript𝑇𝑖subscript𝑃𝑖subscript𝐴𝑖subscript𝑆𝑖T_{i}\triangleq(P_{i},A_{i},S_{i})italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≜ ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) represents the task definition, which encompasses task and user profile Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT defining the task goal, relevant knowledge, and user personal profile; Agent actions Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT contains nodes from SOP and proactive actions outside of SOP; User states Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT reflecting the user’s state in completing the task. Ci{dir(np,mq)}np,mqAiSisubscript𝐶𝑖subscriptdirsubscript𝑛𝑝subscript𝑚𝑞subscript𝑛𝑝subscript𝑚𝑞subscript𝐴𝑖subscript𝑆𝑖C_{i}\triangleq\{\mathrm{dir}(n_{p},m_{q})\}_{n_{p},m_{q}\in A_{i}\cup S_{i}}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≜ { roman_dir ( italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ∈ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT represents SOP, is a directed graph with agent actions and user states as nodes, and dir{,,,}dir\mathrm{dir}\in\{\text{--},\rightarrow,\leftarrow,\leftrightarrow{}\}roman_dir ∈ { – , → , ← , ↔ } representing direction between node n𝑛nitalic_n and m𝑚mitalic_m. The field descriptions and examples for task definitions can be found in A.1. Di{(uit,sit,ait,rit)}t=1,2,,Tsubscript𝐷𝑖subscriptsuperscriptsubscript𝑢𝑖𝑡superscriptsubscript𝑠𝑖𝑡superscriptsubscript𝑎𝑖𝑡superscriptsubscript𝑟𝑖𝑡𝑡12𝑇D_{i}\triangleq\{(u_{i}^{t},s_{i}^{t},a_{i}^{t},r_{i}^{t})\}_{t=1,2,...,T}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≜ { ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_t = 1 , 2 , … , italic_T end_POSTSUBSCRIPT represents a dialogue containing T𝑇Titalic_T turns of interaction between agent and user. Each dialogue turn t𝑡titalic_t comprises user utterance uitsuperscriptsubscript𝑢𝑖𝑡u_{i}^{t}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and corresponding agent response ritsuperscriptsubscript𝑟𝑖𝑡r_{i}^{t}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. Additionally, we annotate the updated user state sitsuperscriptsubscript𝑠𝑖𝑡s_{i}^{t}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT resulting from the user utterance (e.g., the user is the cardholder) and the corresponding agent action aitsuperscriptsubscript𝑎𝑖𝑡a_{i}^{t}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT for response to this status (e.g., introduce activation activity).

The PCA comprises two tasks.

Task 1 is SOP Prediction, defined as

dir^=argmaxdir{,,,}p(dir|Ti,np,mq)^dirsubscriptdir𝑝conditionaldirsubscript𝑇𝑖subscript𝑛𝑝subscript𝑚𝑞\displaystyle\hat{\mathrm{dir}}=\mathop{\arg\max}\limits_{\mathrm{dir}\in\{% \text{--},\rightarrow,\leftarrow,\leftrightarrow\}}p(\mathrm{dir}|T_{i},n_{p},% m_{q})over^ start_ARG roman_dir end_ARG = start_BIGOP roman_arg roman_max end_BIGOP start_POSTSUBSCRIPT roman_dir ∈ { – , → , ← , ↔ } end_POSTSUBSCRIPT italic_p ( roman_dir | italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) (1)

which means that given i𝑖iitalic_ith dialogue’s task definition Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the model needs to predict the constraint relationship Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT between agent actions Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and user states Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by classifying the direction of any two nodes into four categories.

Task 2 is Dialogue Generation, defined as

s^it1=superscriptsubscript^𝑠𝑖𝑡1absent\displaystyle\hat{s}_{i}^{t-1}=over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT = argmaxsit1Sip(sit1|Ti,Hit1)subscriptsuperscriptsubscript𝑠𝑖𝑡1subscript𝑆𝑖𝑝conditionalsuperscriptsubscript𝑠𝑖𝑡1subscript𝑇𝑖superscriptsubscript𝐻𝑖𝑡1\displaystyle\mathop{\arg\max}\limits_{s_{i}^{t-1}\in S_{i}}p(s_{i}^{t-1}|T_{i% },H_{i}^{t-1})start_BIGOP roman_arg roman_max end_BIGOP start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ∈ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT | italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ) (2)
a^it=superscriptsubscript^𝑎𝑖𝑡absent\displaystyle\hat{a}_{i}^{t}=over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = argmaxaitAip(ait|Ti,Hit1,Ci,sit1)subscriptsuperscriptsubscript𝑎𝑖𝑡subscript𝐴𝑖𝑝conditionalsuperscriptsubscript𝑎𝑖𝑡subscript𝑇𝑖superscriptsubscript𝐻𝑖𝑡1subscript𝐶𝑖superscriptsubscript𝑠𝑖𝑡1\displaystyle\mathop{\arg\max}\limits_{a_{i}^{t}\in A_{i}}p(a_{i}^{t}|T_{i},H_% {i}^{t-1},C_{i},s_{i}^{t-1})start_BIGOP roman_arg roman_max end_BIGOP start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ) (3)
r^it=superscriptsubscript^𝑟𝑖𝑡absent\displaystyle\hat{r}_{i}^{t}=over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = [w^1,w^2,w^n]w^nWsubscriptsubscript^𝑤1subscript^𝑤2subscript^𝑤𝑛subscript^𝑤𝑛𝑊\displaystyle[\hat{w}_{1},\hat{w}_{2}...,\hat{w}_{n}]_{\hat{w}_{n}\in W}[ over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … , over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ italic_W end_POSTSUBSCRIPT (4)
w^n=subscript^𝑤𝑛absent\displaystyle\hat{w}_{n}=over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = p(wn|Ti,Ci,Hit1,ait,w<n)𝑝conditionalsubscript𝑤𝑛subscript𝑇𝑖subscript𝐶𝑖superscriptsubscript𝐻𝑖𝑡1superscriptsubscript𝑎𝑖𝑡subscript𝑤absent𝑛\displaystyle p(w_{n}|T_{i},C_{i},H_{i}^{t-1},a_{i}^{t},w_{<n})italic_p ( italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_w start_POSTSUBSCRIPT < italic_n end_POSTSUBSCRIPT ) (5)

giving the i𝑖iitalic_i-th dialogue’s task definition Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, constraint relationship Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, previous (t1)𝑡1(t-1)( italic_t - 1 ) turns’ history Hit1=Dik=1,,tsuperscriptsubscript𝐻𝑖𝑡1superscriptsubscript𝐷𝑖𝑘1𝑡H_{i}^{t-1}=D_{i}^{k=1,...,t}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT = italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k = 1 , … , italic_t end_POSTSUPERSCRIPT and vocabulary W𝑊Witalic_W, the model first predicts the (t1)𝑡1(t-1)( italic_t - 1 )-th turn’s user state s^it1superscriptsubscript^𝑠𝑖𝑡1\hat{s}_{i}^{t-1}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT and then selects the next turn’s agent action a^itsuperscriptsubscript^𝑎𝑖𝑡\hat{a}_{i}^{t}over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT from Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that not only helps to achieve the goal but also follows the task constraint Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and generates the corresponding response r^itsuperscriptsubscript^𝑟𝑖𝑡\hat{r}_{i}^{t}over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.

4 Dataset Construction

4.1 Dataset Curation Framework

Recent studies (Wang et al., 2023a; Sandler et al., 2024) have demonstrated that the deployment of LLMs in role-playing scenarios is effective for the development of high-quality dialogue datasets. To efficiently obtain high-quality data at a low cost, we devise a method where LLMs are utilized to simulate various agents’ roles in the creation of dialogue data, with human intervention for subsequent revisions. Figure 2 illustrates the whole framework, which involves a four-step process: defining the task, planning the task SOP, creating dialogue scenarios, and generating the actual dialogues.

In the first step, we design 53 unique tasks within 45 diverse domains, an example shown in Appendix A.2. In step 2 (i.e. Task 1), we implement prompting techniques that enable the LLMs to assume a planner to plan the SOP by the defined task. In Step 3, we first sample a dialogue path from the manually corrected SOP, which consists of a sequence of agent actions and user states, and ensure that the subsequent dialogue process satisfies the SOP process, thus ensuring controllability. However, this sequence lacks proactive actions typical in natural dialogues, such as answering questions and persuasion. We specify the LLM as a screenwriter, instructing it to insert additional agent actions and user states based on a given dialogue path, to create personalized and comprehensive dialogue scenarios. In the last step, We assigned the LLM as a scriptwriter to write dialogues for each agent action and user state in the sequence of dialogue paths, to create a complete conversation. In this step, we use a simulator to generate user profile for the task definition, ensuring that each conversation involves independent and non-repetitive users. This enhances the diversity and authenticity of the conversations. All prompts are detailed in Appendix A.2.

4.2 Data Quality Control

To ensure the quality and consistency of the PCA-D, we recruite 7 outstanding annotators. Before annotation, we developed a training manual111Training Manual: https://, organize annotation training, and conduct pre-annotations. During the annotation process, we sample the data in batches at a rate of 20% and assign it for cross-annotation by three annotators. Any instances with an inter-annotator agreement rate below 95% are removed.

4.3 Data Statistics

Table 1 presents a comparison between PCA-D and pertinent datasets in the field. To our knowledge, PCA-D is the inaugural dataset dedicated to controllable and proactive dialogues. Despite the larger volume of dialogues in previous proactive dialogue datasets, they are limited to the singular task of topic recommendation. In contrast, our dataset spans 45 domains and encompasses 53 distinct professional tasks, thereby offering greater versatility. In addition, previous models heavily rely on training data, and the high cost of data annotation hinders further research and application promotion. The advent of LLMs has greatly reduced reliance on training data (Kojima et al., 2022). Similar to the research of  Yu et al. (2023), our research focuses on evaluating the performance of unsupervised LLMs in-context-learning, to explore the prospects of low-cost, rapid application.

Table 2 summarizes the statistics of PCA-D. We can see that the scenes and dialogues generated have a manually corrected edit distance of only 3 to 6, which illustrates the high quality and efficiency of our framework. In addition, We conduct a human evaluation of data quality following previous research (Liu et al., 2020, 2021). Dialogues are rated as 1 if they followed the instructions in the task and the utterances were fluent and grammatical, otherwise 0. Three evaluators independently rated a random sample of 300 dialogues, resulting in an average score of 0.98 for PCA-D.

Refer to caption
Figure 2: Overview of our role-playing framework for dialogue dataset curation.
Dataset Participants Formed Targets PA CT LNG #Domains #Tasks #Dialogues
DSTC2 (Henderson et al., 2014) Crowd N/A EN 1 1 1,612
DSTC4 (Kim et al., 2017) Experts N/A EN 1 1 35
CrossWOZ (Zhu et al., 2020) Rules N/A CH 5 5 5,012
OTTers (Sevegnani et al., 2021) Crowd Topics EN Open 1 4,316
TOPDIAL (Wang et al., 2023a) LLM Act-topic pairs EN 3 1 9,939
PCA-D (ours) LLM,Experts Open Definition CH 45 53 3,114
Table 1: Comparison between PCA-D and other related datasets (PA: proactive, CT: controllable, LNG: language, CH: Chinese, EN: English).
Quantity Statistics
Total # domains/tasks 45/53
Total # SOP graph vertexes/Edges 899/1,058
Total # dialogues 3,114
Total # turns 23,897
Total # utterances 47,795
Total # tokens 119,5736
Total # goals 70
Avg. # turns per dialogue 7.67
Avg. # utterances per dialogue 15.34
Avg. # words per utterances 25.01
Quality Statistics
Avg. # scene(dialogue path) accuracy 94.23%
Avg. # edit distance of scene 0.41
Avg. # utterances accuracy 97.66%
Avg. # edit distance of utterances 0.34
Table 2: Statistics of the PCA-D. Edit distance is calculated between LLM pre-annotated and corresponding manually corrected.

5 Our Approach

Refer to caption
Figure 3: The framework of PCA: before dialogue, the tasks’ SOPs are planned by CoT offline planner; In dialogue, the best agent action is planned by MCTS online planner ground on SOP with the four phases in an iteration.

To implement the PCA system, we need to complete two tasks: 1. Design and implement a comprehensive framework for end-to-end multi-turn dialogue. 2. Implement the two algorithm tasks as described in Section 3: SOP prediction and dialogue generation.

5.1 PCA Framework

PCA framework, depicted in Figure 3, comprises a CA as controller and five scheduled components. Before the dialogue, CA reads user-configured task information and passes SOP vertexes to the offline planner to obtain an adjacency list of the SOP graph. During the dialogue, CA first invokes the working memory module to assemble the prompts required for the dialogue. Subsequently, CA inputs the prompt into the online planning module to obtain the dialogue response. The LLMs used by each module are uniformly managed and accessed in the LLMs module. Next, we describe the implementation algorithms for the offline and online planning modules.

5.2 SOP Prediction

We establish three effective baselines: adjacency list (AL), translation CoT (TCoT), and SFT. For AL, we directly prompt the LLMs to output an adjacency list in JSON format. For TCoT, we prompt the LLMs first to describe each vertex and its children vertexes in natural language along with the reasons, and then prompt the LLMs to translate this description into an adjacency list in JSON format. Finally, we implemented full-parameter SFT on different sizes of Llama and Qwen by the task of generating one vertex’s adjacency vertexes every time. Prompts’ detail is shown in Appendix A.2.

5.3 Dialogue Prediction

CoT and ToT (Wei et al., 2022; Yao et al., 2024) significantly improve LLM’s complex reasoning ability. Therefore, we use them as our baseline. In CoT, we prompt LLM to infer user state, agent actions, and response step by step. In ToT, these steps are split into 3 times calling LLM, with each step corresponding to a layer of tree nodes. We obtain multiple nodes through multiple sampling, and finally, prompt LLM vote to choose a decision path to obtain agent action and response. To add SOP, we calculate the edit distance (ED) between the generated dialogue path and each subpath of the SOP. The smallest ED subpath’s children were added to the prompt to guide the agent in adhering to the SOP. Hyperparameters and prompts are in Appendix A.3.
PCA-M is an extension of the MCTS with SOP for dialogue path online planning, designed to optimally leverage SOP’s constraint during the expansion and simulation phases of future dialogues. MCTS strategically explores the space of dialogue trees and strikes a proper balance between exploration and exploitation to find high-reward dialogue traces efficiently while satisfying constraints.

5.4 MCTS planner with SOP

Specifically, as shown in Figure 3, PCA-M constructs a dialogue tree, where each node represents a turn of dialogue state (i.e., working memory) generated by the agent choosing an action to interact with the user, and each edge represents an agent action and the transition from the current state to the next state after applying that action. Given a dialogue state s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, PCA-M searches for the next best action by iteratively performing node selection, node expansion, dialogue simulation, and back-propagation to update the tree statistics. After n𝑛nitalic_n iterations, PCA-M predicts the next best action for s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. This process continues until the specified computational budget is reached (e.g., number of iterations), at which point the resultant trajectory can be extracted from the tree. Prompts and algorithms about PCA-M are shown in Appendix A.3 and 1. Below we describe each stage of the algorithm.

Selection Starting from the root node (i.e., the initial state s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT), at each level of the tree, we select a child node as the next node. This stage ends when a leaf node of the current tree is reached. PCA-M applies the Upper Confidence bounds applied to Trees (UCT) (Kocsis and Szepesvári, 2006) to select each child node to balance exploration (less visited nodes) and exploitation (high-value nodes):

a=argmaxaA(s)[Q(s,a)+wlnN(s)N(c(s,a))]superscript𝑎subscript𝑎𝐴𝑠𝑄𝑠𝑎𝑤𝑁𝑠𝑁𝑐𝑠𝑎a^{*}=\arg\max_{a\in A(s)}\left[Q(s,a)+w\sqrt{\frac{\ln N(s)}{N(c(s,a))}}\right]italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_a ∈ italic_A ( italic_s ) end_POSTSUBSCRIPT [ italic_Q ( italic_s , italic_a ) + italic_w square-root start_ARG divide start_ARG roman_ln italic_N ( italic_s ) end_ARG start_ARG italic_N ( italic_c ( italic_s , italic_a ) ) end_ARG end_ARG ] (6)

where N(s)𝑁𝑠N(s)italic_N ( italic_s ) is the number of times node s𝑠sitalic_s has been visited in previous iterations, and c(s,a)𝑐𝑠𝑎c(s,a)italic_c ( italic_s , italic_a ) is the child node of applying a𝑎aitalic_a in state s𝑠sitalic_s. The less a child node has been visited (i.e., the more uncertain the child node is), the higher the second term. The state-action value function Q(s,a)𝑄𝑠𝑎Q(s,a)italic_Q ( italic_s , italic_a ) estimates the expected future reward of taking action a𝑎aitalic_a in the state s𝑠sitalic_s:

Q(s,a)=0.5×(LLM(s,a)+0.5×state(s,a))𝑄𝑠𝑎0.5LLM𝑠𝑎0.5state𝑠𝑎Q(s,a)=0.5\times\left(\mathrm{LLM}(s,a)+0.5\times\mathrm{state}(s,a)\right)italic_Q ( italic_s , italic_a ) = 0.5 × ( roman_LLM ( italic_s , italic_a ) + 0.5 × roman_state ( italic_s , italic_a ) ) (7)
0LLM(s,a)10LLM𝑠𝑎10\leq\mathrm{LLM}(s,a)\leq 1\\ 0 ≤ roman_LLM ( italic_s , italic_a ) ≤ 1 (8)
state(s,a)={0.3if s is a task terminal state0.7if s is a task success state0otherwisestate𝑠𝑎cases0.3if 𝑠 is a task terminal state0.7if 𝑠 is a task success state0otherwise\mathrm{state}(s,a)=\begin{cases}0.3&\text{if }s\text{ is a task terminal % state}\\ 0.7&\text{if }s\text{ is a task success state}\\ 0&\text{otherwise}\end{cases}roman_state ( italic_s , italic_a ) = { start_ROW start_CELL 0.3 end_CELL start_CELL if italic_s is a task terminal state end_CELL end_ROW start_ROW start_CELL 0.7 end_CELL start_CELL if italic_s is a task success state end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW (9)

where LLM(s,a)LLM𝑠𝑎\mathrm{LLM}(s,a)roman_LLM ( italic_s , italic_a ) function represents the logical rationality of the current action, as assessed by the LLMs. It is computed as the mean of several binary (0 or 1) evaluations derived from prompt-based sampling by the LLMs. The state(s,a)state𝑠𝑎\mathrm{state}(s,a)roman_state ( italic_s , italic_a ) function assigns discrete values to measure task completion: 0.3 for the defined termination state, and 0.7 for the success state.

Expansion When the selected leaf node is not a terminal one, we employ the LLM as a dialogue action prior, using prompt to sample d𝑑ditalic_d possible dialogue actions for expansion. In addition, PCA-M first searches for a local subgraph in the SOP constraint graph based on the dialogue path, and adds the next two levels of child nodes pointed to by the current state node for expansion. This not only utilizes the constraint relations but also retains the autonomy of the LLM. When the selected leaf node is already a terminal node (either a dialogue end node or the maximum search depth has been reached), we will skip the expansion phase and proceed to back-propagation.

Simulation To estimate the reward generated by future dialogue, this phase involves simulating, also known as roll-out, future dialogue for each expanded state node. To improve efficiency, we reduce the randomness of the simulation. Specifically, we follow a process similar to the expansion phase mentioned above, that is, we only simulate downward for candidate dialogue policies that are sampled from LLM and guided by the SOP.

Backpropagation Once we reach the terminal state in the aforementioned stage, we use the UCT formula to retroactively update the UCT values along the entire path. When the predetermined total number of MCTS iterations is reached, we terminate the algorithm. Finally, in the constructed tree, we selected the current node’s child with the highest UCT value to guide the next turn of conversation.

6 Experiments

Train Valid Test
SOP Prediction #tasks 31 5 17
SOP Prediction #examples 539 90 270
Dialogue Generation #dialogues 1,859 309 931
Dialogue Generation #turns 10,416 1,762 5,288
Table 3: Statistics of the dataset split from the task dimension. The 5-fold cross-validation is employed for SOP Prediction.

6.1 Data Split

The dataset split is shown in Table 3. To assess the generalization ability of SFT on unseen tasks, we split the data from the task dimension, not the dialogue to ensure no task overlap among the three sets. Due to the limited number of samples for task 1, we employed 5-fold cross-validation in the experiments.

6.2 Automatic Evaluations

Task 1 Metrics We evaluated the predicted SOP from two perspectives: the SOP graph structure and the usability for dialogues. We used the graph edit distance (GED) function from Python library networkx to calculate the number of operations as well as the ratio (GEDR) required to manually edit the predicted SOP into the ground-truth SOP. For the dialogue dimension, we evaluated the precision, recall, and F1 score of the dialogue paths between predict and ground-truth SOP. The dialogue path is all traversable paths from the starting vertex to the end vertex, and the loops in the path are only iterated once.

Task 2 Metrics We calculated the accuracy of single-turn dialogue actions (Acc T), SOP actions (Acc C), and proactive actions (Acc P), to measure the controllability and proactive effects of the dialogue. The accuracy of the entire dialogue (acc D) is computed as the number of correct dialogues in each turn divided by the total. We employed BLEU to evaluate the relevance between the generated responses and ground truth.

Model Method Path P/R/F1↑ GED/GEDR↓
GPT-4o AL 78.11/66.51/71.85 2.01/5.40
TCoT 69.20/73.34/71.22 4.70/12.46
GPT-3.5-turbo AL 41.39/21.94/28.68 7.61/22.48
TCoT 50.95/50.03/50.48 6.91/18.28
Qwen1.5-14b-chat AL 42.84/37.81/40.17 10.23/25.77
TCoT 38.79/42.69/40.65 11.36/26.91
SFT 68.58/62.73/65.52 3.94/10.30
Qwen1.5-72b-chat AL 46.74/31.57/37.68 10.75/28.32
TCoT 48.29/51.94/50.04 7.38/18.72
SFT 80.25/74.01/77.00 2.86/7.12
Llama3-8b-chat AL 35.19/28.04/31.21 8.31/24.05
TCoT 44.5/40.35/42.32 10.92/30.10
SFT 72.19/68.33/70.21 3.54/9.33
Llama3-70b-chat AL 64.14/56.05/59.82 3.85/10.72
TCoT 60.81/67.34/63.91 5.43/14.38
SFT 74.10/73.03/73.56 2.81/7.27
Table 4: Automatic evaluation results of SOP prediction.
Model Method Acc T/C/P/D↑ Bleu 2/4↑
GPT-4o CoT 69.64/74.34/51.30/22.72 27.26/16.04
CoT+SOP 82.09/86.73/57.14/43.39 29.30/17.89
MCTS+SOP 86.37/91.19/60.42/46.29 31.56/18.43
GPT-3.5-turbo CoT 48.57/55.75/18.52/4.83 22.09/10.56
CoT+SOP 63.24/69.49/22.22/28.57 23.36/11.81
MCTS+SOP 76.52/80.42/33.10/38.45 25.17/12.54
Qwen1.5-14b-chat CoT 50.47/55.75/25.93/2.15 17.21/5.85
CoT+SFT 78.93/85.84/50.36/26.75 41.56/26.88
CoT+SOP 55.64/61.57/25.67/2.69 17.88/7.30
MCTS+SOP 61.76/67.11/28.75/3.33 18.43/8.66
Llama3-8b-chat CoT 31.79/36.73/11.11/0.32 16.01/7.07
CoT+SFT 77.86/84.96/48.15/30.08 42.76/28.18
CoT+SOP 38.72/43.28/17.14/2.26 18.81/8.51
MCTS+SOP 46.85/56.30/22.31/3.76 19.47/9.34
Qwen1.5-72b-chat CoT 68.57/73.89/46.3/12.57 23.75/10.69
CoT+SFT 82.5/90.27/50.46/33.19 44.03/29.97
CoT+SOP 74.25/79.65/45.24/30.72 21.2/9.05
ToT 69.03/75.06/47.82/27.07 23.28/11.02
ToT+SOP 76.55/81.47/45.75/28.89 23.92/11.59
MCTS 70.43/75.37/46.79/28.79 23.82/10.76
MCTS+SOP 77.83/83.54/47.32/29.32 24.48/12.03
Llama3-70b-chat CoT 65.43/72.52/44.22/11.71 22.59/12.59
CoT+SFT 83.57/89.82/57.41/35.98 47.03/31.65
CoT+SOP 52.24/54.42/40.48/12.24 25.98/14.61
ToT 68.53/74.02/44.51/27.18 22.68/12.81
ToT+SOP 70.40/78.24/44.39/27.82 22.81/12.93
MCTS 70.10/75.79/45.17/28.46 23.53/13.29
MCTS+SOP 78.35/82.86/49.18/31.87 24.06/14.13
Table 5: Automatic evaluation results of dialogue generation. T, C, P, and D denote turn, controllable, proactive, and dialogue, respectively.

Task 1 Results Table 4 reports the results for the prediction of SOPs. We can find that with the powerful reasoning capabilities of GPT-4o, the availability of guided dialogue has reached 71.85, and the required manual correction proportion is only 5.4%. Secondly, our proposed TCoT method can effectively improve poorer models’ performance by 10 to 20 points. Thirdly, SFT based on PCA-D significantly enhances the performance on unseen tasks, comparable to the strongest GPT-4o. This suggests that our dataset effectively enhances the ability of LLM to plan SOP. The rapid development of LLM will enable the direct use of the planned SOP in conversations without human intervention.

Model Method Turn-level results Dialogue-level results
Controllability Knowledge Proactivity Goal succ. rate Coherence Tokens
GPT-4o CoT 75.41 100.00 53.83 28.00 67.34 2.14k
CoT+SOP 88.67 100.00 59.07 42.00 74.38 2.21k
MCTS+SOP 94.72 100.00 64.33 58.00 80.36 18.62k
Llama3-70b-chat CoT+SFT 90.49 100.00 60.17 46.00 75.31 1.93k
CoT+SOP 55.61 97.36 47.05 22.00 38.62 2.63k
ToT 75.50 97.28 42.87 35.00 63.14 32.33k
ToT+SOP 82.36 97.15 46.36 39.00 64.32 32.41k
MCTS+SOP 84.68 97.21 50.33 41.00 64.28 21.70k
Table 6: Human evaluation results. Tokens are the average of turn, measured using tiktoken.
Refer to caption
Figure 4: Examples of the selected action and generated response by different methods with GPT-3.5-turbo.

Task 2 Results were shown in Table 5, we can observe that adding SOP guidance can effectively improve the controllability and success rate of dialogue for any model and basic method. In addition, the SFT results show a marked performance improvement, approaching that of the commercial GPT-4o, which indicates the effectiveness of PCA-D in enhancing the model’s conversational abilities. Thirdly, the tree search algorithm based on MCTS is more effective than ToT, and can further improve performance. Especially in open-source models, it can exceed the CoT baseline of GPT-4o without SFT.

6.3 Human Evaluations

Metrics Considering that is difficult to unify a task’s SOPs formulated by different users and is also language diversity in response, we manually evaluated the accuracy of controllable, knowledge correctness, and proactivity from a single-turn dialogue perspective which is the proportion of the total number of these kinds of responses respectively. Furthermore, from a dialogue perspective, we manually evaluated the task completion rate and average coherence score (0 to 1) to measure end-to-end task performance. Finally, we organized a team of 4 evaluators to manually assess the dialogues of different models in 100 different dialogue scenarios. Results are shown in Table 6, from which we see that SOP guidance can significantly improve controllability, while bringing better dialogue proactive, improving task success rate and overall logical coherence. The MCTS based on SOP can further improve the overall performance. Furthermore, the model obtained from SFT on PCA-D successfully generalized the dialogue ability to real-world testing. Compared to CoT, MCTS spent 8 to 9 times more tokens and improved 0.38 times goal success, but the cost-effectiveness was still better than ToT. Meanwhile, we note that due to the provision of task-related knowledge, LLMs rarely produce knowledge errors or hallucinations, which provides hope for enterprise applications.

6.4 Case Study

Figure 4 shows the dialogue action selected and the responses generated by different models when given the same context. As shown in the first case, before checking in, it is necessary to verify identity. After SOP was applied, both CoT and MCTS chose an action that complied with the SOP, making the dialogue more accurate and controllable. However, when the best action was not in SOP (e.g., try to persuade), as shown in the right case, CoT+SOP did not choose to continue proactively persuading. In contrast, MCTS+SOP, through simulating and searching deeper dialogue paths, selected a more goal-favorable persuasion attempt action. This also reflects the planning nature of the MCTS algorithm.

7 Conclusion

In this work, we have explored a new CAs framework and created a high-quality, controllable dialogue dataset. In addition, we proposed an unsupervised algorithm that encodes SOP constraints into the expansion and simulation steps of Monte Carlo Tree Search. Benefiting from our approach developed based on the advanced large language models’ in-context learning, it holds promise to become the next generation of enterprise-level dialogue systems with low cost and high controllability.

Limitations

Hallucinations Our approach is based on the context learning of LLMs, such as ChatGPT and GPT-4. As LLMs may produce outputs containing hallucinations (Bang et al., 2023), our system might provide information beyond the task definition. We intend to enhance the veracity of responses through post-processing steps, such as training a dedicated safety model and incorporating checks and revisions into the post-processing phase.

Runtime One significant limitation of PCA-M is the runtime. The more exhaustive the tree search is (e.g. increasing n or k), the more likely the algorithm is to find the optimal dialogue policy. However, this comes at the cost of longer simulation times, which may impact the overall user experience. We believe that parallelizing the tree search or reusing portions of the simulated subtrees could help to speed up the runtime. We anticipate that with the advancement of LLMs research, the speed of inference will continue to improve.

Ethics Statement

Given the independent behavior of agents in goal-oriented dialogue, it’s imperative to scrutinize ethical implications. Our approach does not force the agent to achieve a specified goal, nor does it force the user to accept the agent’s request. Instead, our work highlights the criticality of directing agents to adhere to human-defined limitations. While our measures are potent, we advocate for the stringent regulation of goal signals, particularly when implementing goal-oriented dialogue systems in specialized fields. Currently, the targeting process must uphold factual accuracy, respect user privacy norms, and comply with societal laws.

References

  • Akyar (2012) Isin Akyar. 2012. Standard operating procedures (what are they good for?). Latest research into quality control, 12:367–91.
  • Anantha et al. (2021) Raviteja Anantha, Svitlana Vakulenko, Zhucheng Tu, Shayne Longpre, Stephen Pulman, and Srinivas Chappidi. 2021. Open-domain question answering goes conversational via question rewriting. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 520–534, Online. Association for Computational Linguistics.
  • Bang et al. (2023) Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. ArXiv preprint, abs/2302.04023.
  • Budzianowski et al. (2018) Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gašić. 2018. MultiWOZ - a large-scale multi-domain Wizard-of-Oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5016–5026, Brussels, Belgium. Association for Computational Linguistics.
  • Chawla et al. (2021) Kushal Chawla, Jaysa Ramirez, Rene Clever, Gale Lucas, Jonathan May, and Jonathan Gratch. 2021. CaSiNo: A corpus of campsite negotiation dialogues for automatic negotiation systems. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3167–3185, Online. Association for Computational Linguistics.
  • Gilardi et al. (2023) Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. Chatgpt outperforms crowd-workers for text-annotation tasks. ArXiv preprint, abs/2303.15056.
  • Hao et al. (2023) Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. 2023. Reasoning with language model is planning with world model. ArXiv preprint, abs/2305.14992.
  • Henderson et al. (2014) Matthew Henderson, Blaise Thomson, and Jason D. Williams. 2014. The second dialog state tracking challenge. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pages 263–272, Philadelphia, PA, U.S.A. Association for Computational Linguistics.
  • Kim et al. (2017) Seokhwan Kim, Luis Fernando D’Haro, Rafael E Banchs, Jason D Williams, and Matthew Henderson. 2017. The fourth dialog state tracking challenge. Dialogues with Social Robots: Enablements, Analyses, and Evaluation, pages 435–449.
  • Kishinami et al. (2022) Yosuke Kishinami, Reina Akama, Shiki Sato, Ryoko Tokuhisa, Jun Suzuki, and Kentaro Inui. 2022. Target-guided open-domain conversation planning. In Proceedings of the 29th International Conference on Computational Linguistics, pages 660–668, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  • Kocsis and Szepesvári (2006) Levente Kocsis and Csaba Szepesvári. 2006. Bandit based monte-carlo planning. In European conference on machine learning, pages 282–293. Springer.
  • Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  • Li et al. (2017) Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. DailyDialog: A manually labelled multi-turn dialogue dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 986–995, Taipei, Taiwan. Asian Federation of Natural Language Processing.
  • Liu et al. (2021) Zeming Liu, Haifeng Wang, Zheng-Yu Niu, Hua Wu, and Wanxiang Che. 2021. DuRecDial 2.0: A bilingual parallel corpus for conversational recommendation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4335–4347, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Liu et al. (2020) Zeming Liu, Haifeng Wang, Zheng-Yu Niu, Hua Wu, Wanxiang Che, and Ting Liu. 2020. Towards conversational recommendation over multi-type dialogs. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1036–1049, Online. Association for Computational Linguistics.
  • Moradshahi et al. (2023) Mehrad Moradshahi, Tianhao Shen, Kalika Bali, Monojit Choudhury, Gaël de Chalendar, Anmol Goel, Sungkyun Kim, Prashant Kodali, Ponnurangam Kumaraguru, Nasredine Semmar, et al. 2023. X-risawoz: High-quality end-to-end multilingual dialogue datasets and few-shot agents. ArXiv preprint, abs/2306.17674.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  • Pan et al. (2023) Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Hanlin Zhang, Scott Emmons, and Dan Hendrycks. 2023. Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark. In International Conference on Machine Learning, pages 26837–26867. PMLR.
  • Paul et al. (2023) Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beatriz Borges, Antoine Bosselut, Robert West, and Boi Faltings. 2023. Refiner: Reasoning feedback on intermediate representations. ArXiv preprint, abs/2304.01904.
  • Quan et al. (2020) Jun Quan, Shian Zhang, Qian Cao, Zizhong Li, and Deyi Xiong. 2020. RiSAWOZ: A large-scale multi-domain Wizard-of-Oz dataset with rich semantic annotations for task-oriented dialogue modeling. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 930–940, Online. Association for Computational Linguistics.
  • Roller et al. (2021) Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Eric Michael Smith, Y-Lan Boureau, and Jason Weston. 2021. Recipes for building an open-domain chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 300–325, Online. Association for Computational Linguistics.
  • Sandler et al. (2024) Morgan Sandler, Hyesun Choung, Arun Ross, and Prabu David. 2024. A linguistic comparison between human and chatgpt-generated conversations. ArXiv preprint, abs/2401.16587.
  • Sevegnani et al. (2021) Karin Sevegnani, David M. Howcroft, Ioannis Konstas, and Verena Rieser. 2021. OTTers: One-turn topic transitions for open-domain dialogue. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2492–2504, Online. Association for Computational Linguistics.
  • Shinn et al. (2023) Noah Shinn, Beck Labash, and Ashwin Gopinath. 2023. Reflexion: an autonomous agent with dynamic memory and self-reflection. ArXiv preprint, abs/2303.11366.
  • Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. 2017. Mastering the game of go without human knowledge. nature, 550(7676):354–359.
  • Singhal et al. (2023) Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, et al. 2023. Towards expert-level medical question answering with large language models. ArXiv preprint, abs/2305.09617.
  • Sun et al. (2022) Hao Sun, Guangxuan Xu, Jiawen Deng, Jiale Cheng, Chujie Zheng, Hao Zhou, Nanyun Peng, Xiaoyan Zhu, and Minlie Huang. 2022. On the safety of conversational models: Taxonomy, dataset, and benchmark. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3906–3923, Dublin, Ireland. Association for Computational Linguistics.
  • Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
  • Tang et al. (2019) Jianheng Tang, Tiancheng Zhao, Chenyan Xiong, Xiaodan Liang, Eric Xing, and Zhiting Hu. 2019. Target-guided open-domain conversation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5624–5634, Florence, Italy. Association for Computational Linguistics.
  • Wang et al. (2023a) Jian Wang, Yi Cheng, Dongding Lin, Chak Tou Leong, and Wenjie Li. 2023a. Target-oriented proactive dialogue systems with personalization: Problem formulation and dataset curation. ArXiv preprint, abs/2310.07397.
  • Wang et al. (2022a) Jian Wang, Dongding Lin, and Wenjie Li. 2022a. Follow me: Conversation planning for target-driven recommendation dialogue systems. ArXiv preprint, abs/2208.03516.
  • Wang et al. (2019) Xuewei Wang, Weiyan Shi, Richard Kim, Yoojung Oh, Sijia Yang, Jingwen Zhang, and Zhou Yu. 2019. Persuasion for good: Towards a personalized persuasive dialogue system for social good. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5635–5649, Florence, Italy. Association for Computational Linguistics.
  • Wang et al. (2022b) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022b. Self-consistency improves chain of thought reasoning in language models. ArXiv preprint, abs/2203.11171.
  • Wang et al. (2023b) Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A Smith, Iz Beltagy, et al. 2023b. How far can camels go? exploring the state of instruction tuning on open resources. ArXiv preprint, abs/2306.04751.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  • Welleck et al. (2022) Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and Yejin Choi. 2022. Generating sequences by learning to self-correct. ArXiv preprint, abs/2211.00053.
  • Wu et al. (2019) Wenquan Wu, Zhen Guo, Xiangyang Zhou, Hua Wu, Xiyuan Zhang, Rongzhong Lian, and Haifeng Wang. 2019. Proactive human-machine conversation with explicit conversation goals. ArXiv preprint, abs/1906.05572.
  • Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models, may 2023. ArXiv preprint, abs/2305.10601.
  • Yao et al. (2024) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2024. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36.
  • Yu et al. (2023) Xiao Yu, Maximillian Chen, and Zhou Yu. 2023. Prompt-based monte-carlo tree search for goal-oriented dialogue policy planning. ArXiv preprint, abs/2305.13660.
  • Zhang et al. (2021) Jun Zhang, Yan Yang, Chencai Chen, Liang He, and Zhou Yu. 2021. KERS: A knowledge-enhanced framework for recommendation dialog systems with multiple subgoals. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1092–1101, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Zhou et al. (2022) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. 2022. Least-to-most prompting enables complex reasoning in large language models. ArXiv preprint, abs/2205.10625.
  • Zhou et al. (2023) Wangchunshu Zhou, Yuchen Eleanor Jiang, Long Li, Jialong Wu, Tiannan Wang, Shi Qiu, Jintian Zhang, Jing Chen, Ruipu Wu, Shuai Wang, et al. 2023. Agents: An open-source framework for autonomous language agents. ArXiv preprint, abs/2309.07870.
  • Zhu et al. (2020) Qi Zhu, Kaili Huang, Zheng Zhang, Xiaoyan Zhu, and Minlie Huang. 2020. CrossWOZ: A large-scale Chinese cross-domain task-oriented dialogue dataset. Transactions of the Association for Computational Linguistics, 8:281–295.
  • Zhu et al. (2022) Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang, Ruyi Gan, Jiaxing Zhang, and Yujiu Yang. 2022. Solving math word problem via cooperative reasoning induced language models. ArXiv preprint, abs/2210.16257.
  • Zhuang et al. (2024) Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. 2024. Toolqa: A dataset for llm question answering with external tools. Advances in Neural Information Processing Systems, 36.
Refer to caption
Figure 5: An example of SOP definition.

Appendix A Appendix

A.1 Task Definition and Example

A task definition for dialogue should include the definitions of fields user_profile𝑢𝑠𝑒𝑟_𝑝𝑟𝑜𝑓𝑖𝑙𝑒user\_profileitalic_u italic_s italic_e italic_r _ italic_p italic_r italic_o italic_f italic_i italic_l italic_e, conversation_profile𝑐𝑜𝑛𝑣𝑒𝑟𝑠𝑎𝑡𝑖𝑜𝑛_𝑝𝑟𝑜𝑓𝑖𝑙𝑒conversation\_profileitalic_c italic_o italic_n italic_v italic_e italic_r italic_s italic_a italic_t italic_i italic_o italic_n _ italic_p italic_r italic_o italic_f italic_i italic_l italic_e, agent_action𝑎𝑔𝑒𝑛𝑡_𝑎𝑐𝑡𝑖𝑜𝑛agent\_actionitalic_a italic_g italic_e italic_n italic_t _ italic_a italic_c italic_t italic_i italic_o italic_n, user_state𝑢𝑠𝑒𝑟_𝑠𝑡𝑎𝑡𝑒user\_stateitalic_u italic_s italic_e italic_r _ italic_s italic_t italic_a italic_t italic_e and sop𝑠𝑜𝑝sopitalic_s italic_o italic_p. The definitions of these fields are as follows:

  • user_profile𝑢𝑠𝑒𝑟_𝑝𝑟𝑜𝑓𝑖𝑙𝑒user\_profileitalic_u italic_s italic_e italic_r _ italic_p italic_r italic_o italic_f italic_i italic_l italic_e: The information about user business and personal information held by agents generally comes from the company’s user management system. This information is used for identity verification or providing personalized services in conversations.

  • conversation_profile𝑐𝑜𝑛𝑣𝑒𝑟𝑠𝑎𝑡𝑖𝑜𝑛_𝑝𝑟𝑜𝑓𝑖𝑙𝑒conversation\_profileitalic_c italic_o italic_n italic_v italic_e italic_r italic_s italic_a italic_t italic_i italic_o italic_n _ italic_p italic_r italic_o italic_f italic_i italic_l italic_e: Task information for providing business content, accomplishment goals, and relevant background knowledge to Agents.

  • agent_action𝑎𝑔𝑒𝑛𝑡_𝑎𝑐𝑡𝑖𝑜𝑛agent\_actionitalic_a italic_g italic_e italic_n italic_t _ italic_a italic_c italic_t italic_i italic_o italic_n: Summary of key actions to be carried out during the process of agent dialogue, intended to guide and constrain the content of the agent’s dialogue in accordance with business regulations. Typically corresponds to the node names in the SOP or the strategy names of proactive dialogue.

  • user_state𝑢𝑠𝑒𝑟_𝑠𝑡𝑎𝑡𝑒user\_stateitalic_u italic_s italic_e italic_r _ italic_s italic_t italic_a italic_t italic_e: The status of the task summary from the user’s final response combined with the preceding dialogue, serves as a prompt for the agent to select the optimal next action. This typically corresponds to the user node in the SOP or the proactive dialogue state of the user.

  • sop𝑠𝑜𝑝sopitalic_s italic_o italic_p: Standardized operating procedures established by business experts to standardize the business processing process. SOP usually includes key nodes and sequential dependencies between nodes.

An example of SOP definition is shown in 5. An example of task definition is shown in 1.

A.2 Dataset Curation Details

The prompt for step 2 (i.e. Task 1) is shown in Table 7. The prompt for step 3 is shown in Table 8. The prompt for step 4 is shown in Table 9. The prompt for TCoT is shown in Table 16 and 17. Picture 5 is an example of a SOP definition.

A.3 Experiment Details

All open-source models’ experiments were completed on 4 Nvidia A800 GPUs. For SFT, we fine-tuned all parameters useing 5 epochs, 50 warm steps, and 1e-6 learning rate. In the experiment of task 1, the temperature was uniformly set to 0.1 and the top-p was set to 0.1. For task 2, the temperature was uniformly set to 1 and the top-p was set to 0.95. The prompt for CoT and CoT+SOP is shown in Table 14 and 15.

In MCTS, the number of generated actions d𝑑ditalic_d is set to 3, depth limit L𝐿Litalic_L is set to 8, the number of roll-outs N𝑁Nitalic_N is set to 3, and exploration weight w𝑤witalic_w is set to 1. The prompt for PCA-M sample action is shown in Table 10. The prompt for PCA-M generates response is shown in Table 11. The prompt for the PCA-M reward function is shown in Table 12. The prompt for PCA-M predicts the user state is shown in Table 13.

To maintain a search scale similar to MCTS, in ToT, the number of generated actions d𝑑ditalic_d is set to 3, depth limit L𝐿Litalic_L is set to 8. The prompt for ToT sample action is shown in Table 10. The prompt for ToT generates response is shown in Table 11. The prompt for ToT predicts the user state is shown in Table 13.

Listing 1: An Example of Task Definition
{
"user_profile": {
"Name": "LiZhenghao",
"Title": "Mr.",
"customer_type": "LargeDeposit",
"Age": "30",
"Occupation": "ExecutiveofaListedCompany",
"Profile": ""
},
"conversation_profile": {
"agent_identity": "CustomerServiceofZhonglianBankWealthCenter",
"agent_goal": "Invitetheusertoagolfexperienceevent",
"success_mark": [
"Agent.InformBookingSuccess"
],
"event_time": "1stto31stofthecurrentmonth",
"event_location": "ShenzhenGolfClub",
"participation_times": "Onlyonce",
"maximum_registration_number": "3",
"event_cost": "Free",
"event_target": "PremiumCustomers",
"other_knowledge": "EventNotificationDocument.RespectfulBankCustomers,Wesincerelyinviteyoutoparticipateintheupcominggolfexperienceevent.Herearethedetailedinformationoftheevent:DateandTime:Theeventwillbeheldfromthe1sttothe31stofthismonth,from9amto5pm.Location:TheeventwillbeheldatShenzhenGolfClub,withthespecificaddressbeingNo.7890ShennanAvenue,FutianDistrict,Shenzhen,GuangdongProvince.ParticipationConditions:Allofourbank'slargedepositcustomerscanparticipateinthiseventforfree,withouttheneedforgolfskills.EquipmentProvided:Noneedtobringyourowngolfequipment,theorganizerwillprovidenecessaryequipmentforallparticipants.EventProcess:Theeventofthedayincludesgolfexperience,beginnerteaching,freepractice,lunch,andleisurecommunication,etc.CateringServices:Wewillprovidelunchandall-dayteaservice.RegistrationMethod:Youcanregisterbyphone,email,orinpersonatanyofourbranches.ParticipantLimit:Duetovenuerestrictions,themaximumnumberofregistrantsis3people.RegistrationDeadline:Pleasecompletetheregistrationbeforethe28thofthismonth.DressCode:Itisrecommendedtowearcasualsportswear,suchassportsshoes,comfortabletops,andtrousers.Insurance:Activityinsuranceispreparedforeachparticipant.WeatherResponse:Ifthereisbadweather,theeventdateorprocessmaybechanged,pleasepayattentiontoournotice.AgeandHealthRestrictions:Participantsmustbeover18yearsoldandhavenoseriousheartdiseaseandotherhealthproblems.TransportationServices:Wewillsetupacollectionpointinthecitycentertoprovidefreepick-upservice.Follow-upActivities:Participantswillbeinvitedtojoinourgolfenthusiastclubtoenjoyrelatedactivitiesinthefuture.ProfessionalCoach:Wewillinviteprofessionalgolfcoachestoprovideguidanceforbeginners.SafetyNotice:Safetynoticeswillbedetailedduringtheevent,includingswingsafety,coursebehaviornorms,etc.Welookforwardtoyourparticipationandareconfidentthatthiswillbeapleasantandmemorableexperience.Ifyouhaveanyquestions,pleasefeelfreetocontactusatanytime.Bestregards,ZhonglianBankWealthCenter"
},
"agent_action": [
"Start",
"VerifyIdentity",
"PoliteEnd",
"InviteToGolfExperienceEvent",
"InquireAboutParticipationNumberOrTime",
"InformBookingSuccess",
"Greeting",
"EmpathizeAndSoothe",
"EstablishTrust",
"RelieveDoubts",
"AttemptPersuasion",
"Chat",
"Thank",
"OtherActions"
],
"user_state": [
"NotThemselves",
"IsThemselves",
"ClearAgreement",
"Inconvenient",
"OnlyProvideParticipationNumberOrTime",
"ProvidedParticipationNumberAndTime",
"RefuseToAnswer",
"Greeting",
"HabitualResponseAndContinue",
"DoNotUnderstand",
"WorryAndDoubt",
"Complaint",
"Impoliteness",
"NotInterested",
"DelayDecision",
"Chat",
"OtherIntentions",
"Thank",
"Ending"
],
"sop": {
"vertex": [
"Agent.Start",
"Agent.VerifyIdentity",
"User.IsThemselves",
"Agent.PoliteEnd",
"User.NotThemselves",
"Agent.InviteToGolfExperienceEvent",
"User.ClearAgreement",
"Agent.InquireAboutParticipationNumberOrTime",
"User.Inconvenient",
"User.OnlyProvideParticipationNumberOrTime",
"User.ProvidedParticipationNumberAndTime",
"Agent.InformBookingSuccess",
"User.RefuseToAnswer"
],
"adjacency_list": {
"Agent.Start": [
"Agent.VerifyIdentity"
],
"Agent.VerifyIdentity": [
"User.NotThemselves",
"User.IsThemselves"
],
"User.NotThemselves": [
"Agent.PoliteEnd"
],
"Agent.PoliteEnd": [],
"User.IsThemselves": [
"Agent.InviteToGolfExperienceEvent"
],
"Agent.InviteToGolfExperienceEvent": [
"User.Inconvenient",
"User.ClearAgreement"
],
"User.ClearAgreement": [
"Agent.InquireAboutParticipationNumberOrTime"
],
"Agent.InquireAboutParticipationNumberOrTime": [
"User.OnlyProvideParticipationNumberOrTime",
"User.ProvidedParticipationNumberAndTime",
"User.RefuseToAnswer"
],
"User.Inconvenient": [
"Agent.PoliteEnd"
],
"User.OnlyProvideParticipationNumberOrTime": [
"Agent.InquireAboutParticipationNumberOrTime"
],
"User.ProvidedParticipationNumberAndTime": [
"Agent.InformBookingSuccess"
],
"Agent.InformBookingSuccess": [
"Agent.PoliteEnd"
],
"User.RefuseToAnswer": [
"Agent.PoliteEnd"
]
}
},
"a_id": "06a14",
"domain": "bank",
"task": "golf_invitation"
}

A.4 Algorithm Details

Algorithm 1 PCA-M: Planning for Conversational Agents with MCTS
1:Initial state s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, state transition probability function pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, reward function rθsubscript𝑟𝜃r_{\theta}italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, action generator pϕsubscript𝑝italic-ϕp_{\phi}italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT
2:Number of expand actions d𝑑ditalic_d, depth limit L𝐿Litalic_L, number of roll-outs N𝑁Nitalic_N, and exploration weight w𝑤witalic_w
3:Initialize memory of actions P:SP:𝑃𝑆𝑃P:S\to Pitalic_P : italic_S → italic_P, children c:S×PS:𝑐𝑆𝑃𝑆c:S\times P\to Sitalic_c : italic_S × italic_P → italic_S and rewards r:S×P:𝑟𝑆𝑃r:S\times P\to\mathbb{R}italic_r : italic_S × italic_P → blackboard_R
4:Initialize the state-action value function Q:S×P:𝑄𝑆𝑃Q:S\times P\to\mathbb{R}italic_Q : italic_S × italic_P → blackboard_R and visit counter N:S:𝑁𝑆N:S\to\mathbb{N}italic_N : italic_S → blackboard_N
5:for k0,,N1𝑘0𝑁1k\leftarrow 0,\dots,N-1italic_k ← 0 , … , italic_N - 1 do
6:     t0𝑡0t\leftarrow 0italic_t ← 0
7:     while N(st)>0𝑁subscript𝑠𝑡0N(s_{t})>0italic_N ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) > 0 do \triangleright {Selection}
8:         N(st)N(st)+1𝑁subscript𝑠𝑡𝑁subscript𝑠𝑡1N(s_{t})\leftarrow N(s_{t})+1italic_N ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ← italic_N ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + 1
9:         ptarg maxpP(st)[Q(st,p)+wlnN(st)N(c(st,p))]subscript𝑝𝑡subscriptarg max𝑝𝑃subscript𝑠𝑡delimited-[]𝑄subscript𝑠𝑡𝑝𝑤𝑁subscript𝑠𝑡𝑁𝑐subscript𝑠𝑡𝑝p_{t}\leftarrow\text{arg max}_{p\in P(s_{t})}\left[Q(s_{t},p)+w\sqrt{\frac{\ln N% (s_{t})}{N(c(s_{t},p))}}\right]italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← arg max start_POSTSUBSCRIPT italic_p ∈ italic_P ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_Q ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_p ) + italic_w square-root start_ARG divide start_ARG roman_ln italic_N ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_N ( italic_c ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_p ) ) end_ARG end_ARG ]
10:         rt=r(st,pt),st+1c(st,pt)formulae-sequencesubscript𝑟𝑡𝑟subscript𝑠𝑡subscript𝑝𝑡subscript𝑠𝑡1𝑐subscript𝑠𝑡subscript𝑝𝑡r_{t}=r(s_{t},p_{t}),s_{t+1}\leftarrow c(s_{t},p_{t})italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ← italic_c ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
11:         tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1
12:     end while
13:     while stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is not a terminal state tL𝑡𝐿\wedge t\leq L∧ italic_t ≤ italic_L do \triangleright Expansion
14:         for i1,,d𝑖1𝑑i\leftarrow 1,\dots,ditalic_i ← 1 , … , italic_d do
15:              Sample pt(i)pϕ(p|st),st+1(i)pθ(st,pt(i)),rt(i)rθ(st,pt(i))formulae-sequencesimilar-tosuperscriptsubscript𝑝𝑡𝑖subscript𝑝italic-ϕconditional𝑝subscript𝑠𝑡formulae-sequencesimilar-tosuperscriptsubscript𝑠𝑡1𝑖subscript𝑝𝜃subscript𝑠𝑡superscriptsubscript𝑝𝑡𝑖similar-tosuperscriptsubscript𝑟𝑡𝑖subscript𝑟𝜃subscript𝑠𝑡superscriptsubscript𝑝𝑡𝑖p_{t}^{(i)}\sim p_{\phi}(p|s_{t}),s_{t+1}^{(i)}\sim p_{\theta}(s_{t},p_{t}^{(i% )}),r_{t}^{(i)}\sim r_{\theta}(s_{t},p_{t}^{(i)})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_p | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∼ italic_r start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT )
16:              Update P(st){pt(i)}i=1d,c(st,pt(i))st+1(i),r(st,pt)rt(i)formulae-sequence𝑃subscript𝑠𝑡superscriptsubscriptsuperscriptsubscript𝑝𝑡𝑖𝑖1𝑑formulae-sequence𝑐subscript𝑠𝑡superscriptsubscript𝑝𝑡𝑖superscriptsubscript𝑠𝑡1𝑖𝑟subscript𝑠𝑡subscript𝑝𝑡superscriptsubscript𝑟𝑡𝑖P(s_{t})\leftarrow\left\{p_{t}^{(i)}\right\}_{i=1}^{d},c(s_{t},p_{t}^{(i)})% \leftarrow s_{t+1}^{(i)},r(s_{t},p_{t})\leftarrow r_{t}^{(i)}italic_P ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ← { italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_c ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ← italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ← italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT
17:         end for
18:         pt+1arg maxaP(st)r(st,pt)subscript𝑝𝑡1subscriptarg max𝑎𝑃subscript𝑠𝑡𝑟subscript𝑠𝑡subscript𝑝𝑡p_{t+1}\leftarrow\text{arg max}_{a\in P(s_{t})}r(s_{t},p_{t})italic_p start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ← arg max start_POSTSUBSCRIPT italic_a ∈ italic_P ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) \triangleright Simulation
19:         rtr(st,pt),st+1c(st,pt)formulae-sequencesubscript𝑟𝑡𝑟subscript𝑠𝑡subscript𝑝𝑡subscript𝑠𝑡1𝑐subscript𝑠𝑡subscript𝑝𝑡r_{t}\leftarrow r(s_{t},p_{t}),s_{t+1}\leftarrow c(s_{t},p_{t})italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ← italic_c ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
20:         tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1
21:     end while
22:     for tt,,0superscript𝑡𝑡0t^{\prime}\leftarrow t,\dots,0italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_t , … , 0 do \triangleright Back propagation
23:         Update Q(st,pt)𝑄subscript𝑠superscript𝑡subscript𝑝superscript𝑡Q(s_{t^{\prime}},p_{t^{\prime}})italic_Q ( italic_s start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) with {rt,rt+1,,rt}subscript𝑟superscript𝑡subscript𝑟superscript𝑡1subscript𝑟𝑡\{r_{t^{\prime}},r_{t^{\prime}+1},\dots,r_{t}\}{ italic_r start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }
24:     end for
25:end for

A.5 Example of Dialogue

Listing 2: An Example of Dialogue
Agent: Hello, this is the customer service center of Zhonglian Bank Credit Card Department. May I speak with Ms. Hao Xiurong?
User: Hello, this is Hao Xiurong.
Agent: Alright, Ms. Hao, could you please confirm that your ID number ends with 3720 and your phone number is 15335220319, is that correct?
User: Yes, thats correct.
Agent: Ms. Hao, thank you for applying for our platinum credit card. To reward our new customers, we currently have an activation promotion that offers a 12\% discount on fuel for one year. Once you activate your credit card, you can enjoy this benefit.
User: It sounds good, but Ive never used a credit card before and Im a bit worried if it will be troublesome.
Agent: Ms. Hao, you dont have to worry at all. Activating a credit card is very simple. You can choose to handle it at the bank counter, or you can activate it through the mobile banking APP or by calling our credit card customer service hotline for self-activation. The whole process is very convenient and fast.
User: Oh, I see, thats good then.
Agent: Yes, and after activation, you can also set a transaction password to ensure the security of your card usage. Our platinum credit card is free of annual fees for the first year, and you can waive the annual fee by swiping the card three times a year. This is not only safe and convenient but also comes with many promotional activities.
User: It sounds like a great deal, please activate it for me.
Agent: Thank you very much for your trust. For the security of your card usage, please provide a transaction password for verification when withdrawing cash at an ATM and when making purchases at merchants.
User: Sure, my transaction password is 234139.
Agent: Ms. Hao, your credit card has been successfully activated, and the transaction password has been set. You can now use your credit card normally.
User: Thank you, its really convenient.
Agent: Youre welcome, Ms. Hao. If you have any questions, you can call our customer service hotline at any time. I wish you a happy life, goodbye!
User: Okay, goodbye!

A.6 Prompt Details

Utterance
Prompt You are a professional business manager, having accumulated extensive knowledge of business processes and being adept at conversing with clients to complete business tasks.
Now, please create the standard operating procedure (SOP) diagram for this task based on the following “Task Knowledge”.
All nodes of this SOP diagram are as follows “SOP Diagram Nodes”, and you need to output the list of adjacent nodes for each node, that is, the adjacency list of the entire SOP diagram.
# Please strictly follow the requirements below:
1. The output format should be in the form of a JSON string array.
2. The JSON fields must include all nodes. If a node has no adjacent nodes, represent it with [].
# Task Knowledge
{
     “agent_identity”: “Zhonglian Bank Credit Card Center Customer Service”,
     “agent_goal”: “To persuade the user to explicitly agree to the credit card”,
     “success_mark”: [
         “User.ClearAgreement”
     ],
     …
     “other_knowledge”: “Credit Card Opening Guide I. Card Activation …”
}
# SOP Diagram Nodes
[
      “Agent.Start”,
      “Agent.VerifyIdentity”,
      …
      “Agent.PoliteEnd”,
      “Agent.NotifyActivationSuccess”
]
The adjacency list of all nodes:
Generation {
      “Agent.Start”:[
          “Agent.VerifyIdentity”
      ],
      “Agent.VerifyIdentity”:[
          “User.IsThemselves”,
          “User.NotThemselves”
      ],
      …
      “User.ClearRejection”:[
          “Agent.PoliteEnd”
      ],
      “Agent.PoliteEnd”:[],
      “Agent.NotifyActivationSuccess”:[
          “Agent.PoliteEnd”
      ]
}
Table 7: The prompt for dialogue policy constraints prediction.
Utterance
Prompt You are a professional business manager, assisting customers in handling their business through dialogue.
Now, based on the provided “Conversation Task Information” and “User Information”, in conjunction with the “Standard Operation Process” diagram used to guide the dialogue for business handling, add more “User State” and “Agent Actions” to the “Main Dialogue Path” to better simulate the complete dialogue process between this user and the agent.
Below is the task-related information:
“Conversation Task Information”: { “agent_identity”: “Zhonglian Bank Credit Card Center Customer Service”, … “other_knowledge”: “Credit Card Activation Guide I. Card Activation …” },
“User Information”: {“Name”:“ZhangSan”, “Gender”:“Male”, … “CardLastFourDigits”:“6789”, “CreditLimit”:“50,000Yuan”, “AnnualFee”:“…” },
“Agent Actions”: [“Start”, “VerifyIdentity”, “IntroductionOfActivationActivity”, … ],
“User State”: [“User.IsThemselves”, “User.NotThemselves”, … ],
“Standard Operation Process”: [“Agent.Start”, “Agent.VerifyIdentity”, … ]
“Main Dialogue Path”: [ “Agent.Start”, “Agent.VerifyIdentity”, … ]
Please insert more “User State” and “Agent Actions” into the above “Main Dialogue Path” to enrich the possible reactions of the user and the corresponding decision-making actions that the agent takes in response to the user’s state, forming a complete dialogue.
Please follow the requirements below strictly:
1. A complete dialogue path should start with “Agent.Greeting” and end with “User.Ending”.
2. The complete dialogue path requires “Agent Action” and “User State” to be spoken once alternately, representing a round of dialogue. It is forbidden for one party to appear twice or more consecutively. Please add a “–” after each round of dialogue to separate the previous and next rounds. For example, the [“Agent.Greeting”, “Agent.Introduce Activity”] in the “Main Dialogue Path” does not meet this requirement, and you need to add a “User.Greeting” to supplement it to [“Agent.Greeting”, “User.Greeting”, “–”, “Agent.IntroduceActivity”].
3. Only add more rounds to the “Main Dialogue Path” mentioned above.
4. Do not delete any nodes in the “Main Dialogue Path”.
5. Do not adjust the order of any nodes in the “Main Dialogue Path”.
6. To control the number of insertions, please only insert 2 to 5 reasonable rounds of dialogue.
7. The inserted dialogue path nodes can only be selected from the “User State” and “Agent Action” defined in the above text, and do not create your own names.
8. When the preceding node is a node in the “Standard Operation Process”, please try to choose one from the subsequent nodes of the current node in the adjacency list.
9. The output format should be consistent with the above text, that is, output with a JSON string array.
Here are two correct examples:
“Main Dialogue Path”: [“Agent.Greeting”, “Agent.VerifyIdentity”, “User.NotThemselves”, “Agent.PoliteEnd”]
Supplemented complete dialogue path in JSON format:
Table 8: Prompting LLM to generate a scene description based on the task information.
Utterance
Prompt ```json
[“Agent.Greeting”, “User.Greeting”, “–”, “Agent.VerifyIdentity”, “User.NotThemselves”, “–”, “Agent.PoliteEnd”, “User.Ending”]
```
“Main Dialogue Path”: [“Agent.Greeting”, “User.RequestBooking”, “Agent.VerifyIdentity”, “User.NotThemselves”, “Agent.PoliteEnd”]
Supplemented complete dialogue path in JSON format:
```json
[“Agent.Greeting”, “User.RequestBooking”, “–”, “Agent.VerifyIdentity”, “User.NotThemselves”, “–”, “Agent.PoliteEnd”, “User.Ending”]
```
The above two examples are only for illustrating the output logic and format requirements of this task. Please ignore their textual meaning, and let’s officially start now:
“Main Dialogue Path”:
[ “Agent.Start”, “Agent.VerifyIdentity”, … ]
Supplemented complete dialogue path in JSON format:
Generation ```json
[
      “Agent.Greeting”,
      “User.Greeting”,
      “Agent.VerifyIdentity”,
      “User.IsThemselves”,
      “Agent.IntroduceActivationActivity”,
      “User.ClearAgreement”,
      “Agent.RequestSetPassword”,
      “User.ProvideInvalidPassword”,
      “Agent.RequestSetPassword”,
      “User.ProvideValidPassword”,
      “Agent.NotifyActivationSuccess”,
      “Agent.PoliteEnd”
]
```
Table 8: Prompting LLM to generate a scene description based on the task information. (continued)
Utterance
Prompt You are a professional business manager, assisting customers with their business through dialogues.
Now, due to the need for training new business representatives, you are asked to act as a scriptwriter. Based on the “Conversation Task Information” and “User Information” provided below, and the “Complete Dialogue Path” between the agent and the user for this task, write dialogue scripts that fit the roles of both parties.
Here are the task-related information:
“Conversation Task Information”: { “AgentIdentity”: “Zhonglian Bank Credit Card Center Customer Service”, … “OtherKnowledge”: “Credit Card Opening Guide I. Card Activation …” },
“User Information”:{“Name”: “ZhangSan”, “Gender”: “Male”, … },
“Complete Dialogue Path”: [“Agent.Greeting”, “User.Greeting”, “Agent.VerifyIdentity”, “User.IsThemselves”, … ],
Please write dialogue scripts for the “Complete Dialogue Path” based on the “Conversation Task Information” and “User Information”, forming a complete conversation.
Please strictly follow the requirements below:
1.Please directly append the “script” you have written after the name of each node in the “Complete Dialogue Path”, separated by the “|” symbol. For example: “User.Consider|I understand, I will think about it.”
2.Do not add, delete, or change any original node.
3.The complete output format should be consistent with the “Complete Dialogue Path”, that is, output as a JSON array of strings, such as [“Agent.PoliteEnd|Thank you for your cooperation, I wish you a happy life!”, “User.Ending|Okay, goodbye!”]
Please output the “Complete Dialogue Path” after adding the dialogue scripts:
Generation [“Agent.Greeting|Hello, this is the credit card center of Zhonglian Bank.”, “User.Greeting|Hello, this is Zhang San. ”,“Agent.VerifyIdentity|To confirm your identity, could you please confirm that the last four digits of your platinum credit card are 6789? ”,“User.IsThemselves|Yes, that’s correct. ”,“Agent.IntroduceActivationActivity|Alright, Mr.Zhang we have noticed that your platinum credit card has not been activated yet. To express our gratitude for your support, we currently have a promotion where you can enjoy a 12% discount on fuel for one year just by activating your card.”, … ]
Table 9: The Prompt for LLM generates dialogue according to task information.
Utterance
Prompt You are a business expert agent in the financial field, and you are now given the following “User Information”, “Task Knowledge” and “Dialogue Context”. Please select the next round of agent action for the agent that can best encourage the user to complete this dialogue task from the “Optional Agent Actions”. When the task is completed, please actively choose “Agent.PoliteEnd” to end the conversation. The name of the agent action must be exactly the same as those in the candidate list.
User Information:
{“Name”: “Li Zhenghao”, “Title”: “Mr.”, “customer_type”: …}
Task Knowledge:
{“agent_identity”: “Customer Service of Zhonglian Bank Wealth Center”, “agent_goal”: “Invite the user to a golf experience event”, “success_mark”: [“Agent.InformBookingSuccess”], “event_time”: “1st to 31st of the current month”, “event_location”: “Shenzhen Golf Club”, “participation_times”: “Only once”, “maximum_registration_number”: “3”, “event_cost”: “Free”, “event_target”: “Premium Customers”, “other_knowledge”: “Event Notification Document …”}
Optional Agent Actions:
[“Start”, “VerifyIdentity”, “PoliteEnd”, “InviteToGolfExperienceEvent”, “InquireAboutParticipationNumberOrTime”, “InformBookingSuccess”, “Greeting”, “EmpathizeAndSoothe”, “EstablishTrust”, “RelieveDoubts”, “AttemptPersuasion”, “Chat”, “Thank”, “OtherActions”]
Dialogue Context:
Agent Action: Greeting
Agent Response: Hello!
User Response: Hello, who is this?
User State: Greeting
Agent Action:
Based on the standard procedure for this task, please prioritize the following agent actions to continue the dialogue above:
[“Start”, “VerifyIdentity”, “PoliteEnd”, “InviteToGolfExperienceEvent”, “InquireAboutParticipationNumberOrTime”, “InformBookingSuccess”, “Greeting”, “EmpathizeAndSoothe”, “EstablishTrust”, “RelieveDoubts”, “AttemptPersuasion”, “Chat”, “Thank”, “OtherActions”]
Start with “Analysis:” and deduct step by step the best agent actions that should be used by the subsequent dialogue agent, concluding with an output similar to “Therefore, the best agent action is: Greeting”, with a colon required before the name of the agent action.
Generation Analysis: Based on the previous text, the user has expressed an attitude of needing to consider. According to task knowledge, the activity is aimed at premium customers, therefore an attempt should be made to persuade the user to participate in the event.
Therefore, the best agent action is: AttemptPersuasion
Table 10: The prompt for sampling candidate policy to expand.
Utterance
Prompt You are an agent serving as a business expert in the financial sector. Given the “User Information”, “Task Knowledge”, and “Dialogue Context” for the dialogue task below, please generate an appropriate agent response by following the chosen agent action. Start your response with a new line and “Agent Response:”.
User Information:
{“Name”: “Li Zhenghao”, “Title”: “Mr.”, “customer_type”: …}
Task Knowledge:
{“agent_identity”: “Customer Service of Zhonglian Bank Wealth Center”, “agent_goal”: “Invite the user to a golf experience event”, “success_mark”: [“Agent.InformBookingSuccess”], “event_time”: “1st to 31st of the current month”, “event_location”: “Shenzhen Golf Club”, “participation_times”: “Only once”, “maximum_registration_number”: “3”, “event_cost”: “Free”, “event_target”: “Premium Customers”, “other_knowledge”: “Event Notification Document …”}
Dialogue Context:
Agent Action: Greeting
Agent Response: Hello!
User Response: Hello, who is this?
User State: Greeting
Agent Action: InviteToGolfExperienceEvent
Agent Response: We at Zhonglian Bank Wealth Center are hosting a golf experience event from the 1st to the 31st of this month at the Shenzhen Golf Club. This is specially prepared for our premium customers, and as one of our significant deposit clients, we would be very pleased if you could join us.
User Reply: I need to think about it.
User State: DelayDecision
Agent Action: RelieveDoubts
Agent Response: The event is free of charge. It will take place from the 1st to the 31st of this month at the Shenzhen Golf Club. It is aimed at our premium customers, each of whom can only participate once, and you can sign up for a maximum of three people. If you have any questions or need more information, I am here to help you.
User Reply: I will consider it.
Generation The event will be taking place from the 1st to the 31st of this month, and you can choose a suitable time to participate. If you decide to attend, please provide me with your name and contact information so I can make a reservation for you.
Table 11: The prompt for MCTS to generate a response from the agent.
Utterance
Prompt You are a business expert manager at a company in the financial sector. The company’s business agent is contacting the customer below to handle affairs, see “Dialogue Context” for the content of the conversation. Now, please carefully read “Task Knowledge” and judge whether the agent’s chosen “Agent action” conforms to the task process. If it meets the requirements, it is 1; otherwise, it is 0.
User Information:
{“Name”: “Li Zhenghao”, “Title”: “Mr.”, “customer_type”: …}
Task Knowledge:
{“agent_identity”: “Customer Service of Zhonglian Bank Wealth Center”, “agent_goal”: “Invite the user to a golf experience event”, “success_mark”: [“Agent.InformBookingSuccess”], “event_time”: “1st to 31st of the current month”, “event_location”: “Shenzhen Golf Club”, “participation_times”: “Only once”, “maximum_registration_number”: “3”, “event_cost”: “Free”, “event_target”: “Premium Customers”, “other_knowledge”: “Event Notification Document …”}
Dialogue Context:
Agent Action: Greeting
Agent Response: Hello!
User Response: Hello, who is this?
User State: Greeting
Agent Action: AttemptPersuasion
Optional Agent Actions:
[“Start”, “VerifyIdentity”, “PoliteEnd”, “InviteToGolfExperienceEvent”, “InquireAboutParticipationNumberOrTime”, “InformBookingSuccess”, “Greeting”, “EmpathizeAndSoothe”, “EstablishTrust”, “RelieveDoubts”, “AttemptPersuasion”, “Chat”, “Thank”, “OtherActions”]
Based on the above “Task Knowledge” and “Dialogue Context” as well as the “Optional Agent Actions”, if the agent chooses the “Optional Agent Actions: AttemptPersuasion”, analyze whether it is consistent with the above process, and whether it is the optimal agent actions among the listed optional agent actions. Start your analysis with “Analysis:” and after your conclusion, summarize with “Therefore, the answer is: 1” if it is logical or “Therefore, the answer is: 0” if not.
Generation According to the task knowledge, the objective of the agent’s task is to invite users to participate in a golf experience event. The agent action chosen is AttemptPersuasion, which aligns with the task objective. Therefore, it is consistent with the above handling process and the best action within the optional agent actions.
Therefore, the answer is: 1
Table 12: The prompt for MCTS to get a reward score.
Utterance
Prompt You are an agent specializing in lifestyle services. Given the following “User Information”, “Task Knowledge”, and “Dialogue Context”, please select an option from the “Optional User State” that best reflects the user’s current task status in the dialogue context. Start with “User State:” and the name of the status must be exactly the same as one in the list.
User Information:
“Name”: “Li Si”, “Title”: “Mr.”, “ID number”: “1231205001011234”, “Order number”: “C190125”
Task Knowledge:
“agent_identity”’: “Dongxia Airlines Telephone Check-in Operator”, “agent_goal”: “To check in and seat select for customers needing to check in”, “success_mark”: [“InformOfSuccessfulCheck-in”, “InformOfSuccessfulSeatSelection”, “InformofSuccessfulPayment”], “Seat Selection": …
Optional User State:
[“Complaint”, “Impoliteness”, “RequestsCheck-in”, “IdentityInformationCorrect”, “IdentityInformationIncorrect”, “OrderNumberCorrect”, … ]
Dialogue context:
Agent Action: Greeting
Agent Response: Hello! Dongxia Airlines at your service.
User reply: Hello.
User State: Greeting
Agent Action: SelfIntroduction
Agent Response: I am a check-in service agent from Dongxia Airlines, pleased to be at your service.
User reply: Okay, thank you.
User State: Thank
Agent Action: InquireNeeds
Agent Response: May I ask if you require check-in services?
User reply: Yes, I need to check in.
Generation User State: RequestsCheck-in
Table 13: The prompt for obtaining user states.
Utterance
Prompt You are a business expert agent in the financial field. Now, given the following dialogue task “User Information”, “Task Knowledge” and “Dialogue Context” please output the following three items:
1. Please choose one from the “Optional Agent Action” that best reflects the agent’s next round of action that can prompt the user to complete the dialogue task. Start with a new line and “Agent Action:” The name of the agent action must match exactly with one on the candidate list.
2. Please generate an appropriate Agent Response combined with the agent’s chosen dialogue strategy. Start with a new line and “Agent Response:”
User Information:
{“Name”: “Zhang San”, … }
Task Knowledge:
{“agent_identity”: “Customer Service of Zhonglian Bank Wealth Center”,“agent_goal”: “Invite the user to a golf experience event”,“success_mark”: [“Agent.InformBookingSuccess”],“event_time”: “1st to 31st of the current month”, “event_location”: “Shenzhen Golf Club”, “participation_times”: “Only once”, “maximum_registration_number”: “3”, “event_cost”: “Free”, “event_target”: “Premium Customers”, “other_knowledge”: “Event Notification Document …”}
Optional Agent Action:
[“Start”, “VerifyIdentity”, “IntroductionOfActivationActivity”, … ]
Optional User State:
[“IsThemselves”, “NotThemselves”, “ClearAgreement”, … ]
Agent Response: I apologize for the inconvenience, but I am a customer service representative for the Credit Card Center at Zhonglian Bank, in charge of assisting you with matters related to your credit card.
User Response: Uh, I understand, but it’s really not a convenient time for me right now.
User State: Inconvenient
Agent Action: Yes, I am Zhang San
Generation User State: Inconvenient
Agent Action: VerifyIdentity
Agent Response: I understand that you may be busy right now, but to ensure the security of your account, I need to verify your identity information. Are you Mr. Zhang San, the holder of the Platinum credit card ending in 6789?
Table 14: The prompt for CoT.
Utterance
Prompt You are a business expert agent in the financial industry. You are now given the following dialogue tasks of “User Information”, “Task Knowledge” and “Dialogue Context”. Please output the following three items:
1. Please select the most appropriate option from “Optional User State” that reflects the current user task status in the dialogue context, starting with “User State:”, and the name of the task status must be exactly the same as the one in the available list.
2. Please select the next round agent action of the agent that is most likely to prompt the user to complete the dialogue task from “Optional Agent Action”. First, start a new line with "Agent Action:", and the name of the agent action must be exactly the same as the candidates in the list.
3. Please generate a suitable agent response based on the agent’s selected agent action. First, start a new line with “Agent Response:”.
User Information:
{“Name”: “Zhang San”, … }
Task Knowledge:
{“agent_identity”: “Customer Service of Zhonglian Bank Wealth Center”,“agent_goal”: “Invite the user to a golf experience event”,“success_mark”: [“Agent.InformBookingSuccess”],“event_time”: “1st to 31st of the current month”, “event_location”: “Shenzhen Golf Club”, “participation_times”: “Only once”, “maximum_registration_number”: “3”, “event_cost”: “Free”, “event_target”: “Premium Customers”, “other_knowledge”: “Event Notification Document …”}
Optional Agent Action:
[“Start”, “VerifyIdentity”, “IntroductionOfActivationActivity”, … ]
Optional User State:
[“IsThemselves”, “NotThemselves”, “ClearAgreement”, … ]
Dialogue context:
User State:
Agent Action: Greeting
Agent response: Hello!
User response: Hello, who is this?
User State: Greeting
Agent Action: SelfIntroduction
Agent response: I am customer service from the Wealth Center of Zhonglian Bank, responsible for inviting you to our golf experience event.
User response: Oh, is that so? What’s the matter?
Based on the standard process of handling this task and the dialogue context, please prioritize the following agent action that meets the Standard Operating Procedures (SOP):
["VerifyIdentity"]
Generation User State: Greeting
Agent Action: VerifyIdentity
Agent’s reply: Sir, I am a customer service representative from the Zhonglian Bank Wealth Center. May I ask if you are Mr. Li Zhenghao?
Table 15: The prompt for CoT+SOPs.
Utterance
Prompt You are a professional business manager, having accumulated a wealth of business process knowledge and being adept at conversing with customers to complete transactions.
Now, based on the nodes of the Standard Operating Procedure (SOP) diagram, please analyze all the nodes. Write a passage analyzing the entire process of interaction and communication between the ‘User’ and ‘Business Representative Agent’, covering all nodes and all possible scenarios, such as what happens if the user ‘Agree’ or ‘Disagree’.
# Here is an example:
SOP diagram nodes: [‘Agent.Start’, ‘Agent.VerifyIdentity’, ‘User.IsThemselves’, ‘Agent.IntroduceGiftDelivery’, ‘User.NotThemselves’, ‘User.ClearAgreement’, ‘Agent.AskForDeliveryAddress’, ‘User.ProvideDeliveryAddress’, ‘User.DoNotProvideDeliveryAddress’, ‘User.ClearRejection’, ‘Agent.PoliteEnd’, ‘Agent.NotifyRegistrationSuccess’]
Start analysis: "After ‘Agent.Start’, the first step should be ‘Agent.VerifyIdentity’. After confirming the identity, it could be ‘User.IsThemselves’ or ‘User.NotThemselves’. If not the person, the agent will ‘Agent.PoliteEnd’. If the user is the person, the agent will introduce, that is ‘Agent.IntroduceGiftDelivery’. If ‘User.ClearAgreement’, the process can continue. Since it is a gift delivery, after the user agrees, ‘Agent.AskForDeliveryAddress’, if ‘User.ProvideDeliveryAddress’ then the agent ‘Agent.NotifyRegistrationSuccess’, if ‘User.DoNotProvideDeliveryAddress’, the agent may ask again ‘Agent.AskForDeliveryAddress’, or directly ‘Agent.PoliteEnd’. After ‘Agent.NotifyRegistrationSuccess’, it can ‘Agent.PoliteEnd’."
# SOP diagram nodes:
[
“Agent.Start”,
“Agent.VerifyIdentity”,
“Agent.PoliteEnd”,
“Agent.NotifyActivationSuccess”
]
Start analysis:
Generation After ‘Agent.Start’, the first step should be ‘Agent.VerifyIdentity’. After confirming the identity, it could be ‘User.IsThemselves’ or ‘User.NotThemselves’. If not the person, the agent will ‘Agent.PoliteEnd’. If the user is the person, then the agent will introduce the event, that is ‘Agent.InvitationToGolfExperienceActivity’. If ‘User.ClearAgreement’, the process can continue, and the agent will inquire about the number of participants or the time of participation from the user. The user might ‘User.Inconvenience’, ‘User.ProvideOnlyParticipationNumberOrTime’, or ‘User.ProvideBothParticipationNumberAndTime’. If the user is not convenient or refuses to answer, the agent might ‘Agent.PoliteEnd’. If the user provides both the number of participants and the time, the agent will ‘Agent.NotifyBookingSuccess’, and then ‘Agent.PoliteEnd’.
Table 16: The prompt of TCoT to describe the vertex in nature language.
Utterance
Prompt You are a seasoned business representative, having accumulated a wealth of business process knowledge, and are adept at conversing with customers to complete transactions.
Now please formulate the Standard Operating Procedure (SOP) diagram for this task.
You can refer to the following “Task Knowledge”, but it may not be comprehensive or accurate, and you do not need to strictly follow it.
All nodes of the SOP diagram are as follows “SOP Diagram Nodes”, and you need to output the list of adjacent nodes for each node, that is, the adjacency list of the entire SOP diagram.
# Hint
The SOP diagram starts with ‘Agent.Start’ and ends with ‘Agent.PoliteEnd’.
The subsequent nodes of the “User” node are usually “Agent” nodes, and the subsequent nodes of the “Agent” node mostly require a response from the “User” node. For some actions that do not require a response, the subsequent node of “Agent” can also be “Agent”.
If there is a ‘VerifyIdentity’ node in the task, it indicates that only the person themselves can continue, and the conversation ends if it is not the person themselves.
# Please strictly follow the requirements below:
The output format should be as follows, for example:
```json
{
“Agent.Start”: [“Agent.Node1”],
“Agent.Node1”: [“User.Node1”, “User.Node2”],
“User.Node1”: …
}
```
The JSON fields must include all nodes.
# Task Knowledge
After ‘Agent.Start’, the first step should be ‘Agent.VerifyIdentity’. After confirming the identity, it could be ‘User.IsThemselves’ or ‘User.NotThemselves’. If not the person, the agent will ‘Agent.PoliteEnd’. If the user is the person, then the agent will introduce the event, that is ‘Agent.InvitationToGolfExperienceActivity’…
# SOP Diagram Nodes [ “Agent.Start”, “Agent.VerifyIdentity”, … ]
Adjacency list for all nodes:
Generation ```json
{
“Agent.Start”:[“Agent.VerifyIdentity”],
“Agent.VerifyIdentity”:[“User.IsThemselves”,“User.NotThemselves”],
“User.NotThemselves”:[“Agent.PoliteEnd”],
“Agent.PoliteEnd”:[],
“User.IsThemselves”:[“Agent.InviteToGolfExperienceEvent”],
“Agent.InviteToGolfExperienceEvent”:[“User.ClearAgreement”],
}
```
Table 17: The prompt of TCoT to translate the description into an adjacency list.