Planning with Large Language Models for Conversational Agents

Zhigen Li^1,2, Jianxiang Peng¹,Yanmeng Wang²,Tianhao Shen¹
Minghui Zhang¹, Linxi Su¹, Shang Wu¹, Yihang Wu¹,Yuqian Wang¹
Ye Wang², Wei Hu²,Jianfeng Li², Shaojun Wang², Jing Xiao²and Deyi Xiong¹
¹ College of Intelligence and Computing, Tianjin University, Tianjin, China
²Ping An Technology
{zhigenli,pjasonx,dyxiong}@tju.edu.cn
{wangyanmeng219,lijianfeng777,wangshaojun851}@pingan.com.cn
*Corresponding author

Abstract

Controllability and proactivity are crucial properties of autonomous conversational agents (CAs). Controllability requires the CAs to follow the standard operating procedures (SOPs), such as verifying identity before activating credit cards. Proactivity requires the CAs to guide the conversation towards the goal during user uncooperation, such as persuasive dialogue. Existing research cannot be unified with controllability, proactivity, and low manual annotation. To bridge this gap, we propose a new framework for planning-based conversational agents (PCA) powered by large language models (LLMs), which only requires humans to define tasks and goals for the LLMs. Before conversation, LLM plans the core and necessary SOP for dialogue offline. During the conversation, LLM plans the best action path online referring to the SOP, and generates responses to achieve process controllability. Subsequently, we propose a semi-automatic dialogue data creation framework and curate a high-quality dialogue dataset (PCA-D). Meanwhile, we develop multiple variants and evaluation metrics for PCA, e.g., planning with Monte Carlo Tree Search (PCA-M), which searches for the optimal dialogue action while satisfying SOP constraints and achieving the proactive of the dialogue. Experiment results show that LLMs finetuned on PCA-D can significantly improve the performance and generalize to unseen domains. PCA-M outperforms other CoT and ToT baselines in terms of conversation controllability, proactivity, task success rate, and overall logical coherence, and is applicable in industry dialogue scenarios. The dataset and codes are available at XXXX.

1 Introduction

Refer to caption — Figure 1: An example of the proposed task: planning-based LLMs for conversational agent with zero model training, satisfying both controllability and proactivity

Conversational agents (CAs) or dialogue systems are designed to offer social support or functional services to human users via natural language interactions. Existing mainstream approaches can be categorized into conversational question answering (CQA) (Anantha et al., 2021; Singhal et al., 2023; Zhuang et al., 2024), open-domain dialogue (ODD) (Li et al., 2017; Roller et al., 2021; Ouyang et al., 2022), task-oriented dialogue (TOD) (Budzianowski et al., 2018; Sun et al., 2022; Quan et al., 2020; Moradshahi et al., 2023) and conversational recommender systems (CRS) (Wu et al., 2019; Zhang et al., 2021; Wang et al., 2023a). The goal of CQA and ODD is to passively respond to users to provide knowledgeable or engaging conversations, thus lacking proactive. CRS sets goals by specifying a topic, such as “movie recommendation, ‘King of Comedy’,” and autonomously plans the best path of dialog actions to smoothly guide the conversation towards recommending “King of Comedy”. However, in complex tasks, dialogue actions frequently adhere to strict sequential constraints, as commonly found in the widely accepted standardized operating procedures (SOPs) established by humans across various domains (Akyar, 2012; Zhou et al., 2023). As shown in Figure 1, the prerequisite for persuading credit card activation is that the user must be the cardholder. CRS lacks planning globally controllable path of dialog actions. TOD aims to provide functional services according to the service process, and their controllability depends on manually drawing dialogue trees or predefined intent, slot, and action training data. It can proactively guide users to provide slot information, but cannot handle situations where users uncooperate, such as persuasive dialogue or negotiation dialogue (Wang et al., 2019; Chawla et al., 2021). We posit that dialogue’s proactivity, controllability, and reduced reliance on manual intervention are crucial to building autonomous business CAs. Recently, prompting large language models (LLMs) has become the predominant approach to zero-shot language tasks, and its applications in dialogue have received much attention. However, integrating LLMs into conversational agents while addressing the limitations of the aforementioned dialogue systems remains a significant challenge.

To address this challenge, we propose a novel dialogue framework, namely planning-based conversational agents (PCA) as shown in Figure 1. In contrast to prior research, our approach only requires humans to provide task definitions and goals for the LLMs to plan the core and necessary SOP offline before the conversation. During the conversation, LLM refers to the SOP to plan the best action path online and generate responses to achieve process controllability. For actions not in SOP, such as answering questions and persuasion, LLM can plan autonomously to enable proactive dialogue. To further this research, we must address several challenges: (1) How to construct a conversation dataset that meets both initiative and controllability? (2) What are the PCA implementation schemes? (3) How to accurately evaluate the effects of different schemes?

For the first issue, the most straightforward approach would be to manually edit flowcharts and dialogues, which are immensely costly. Recently, advanced LLMs, e.g., ChatGPT, can outperform crowd workers on many annotation and evaluation tasks (e.g. (Gilardi et al., 2023; Wang et al., 2023b; Pan et al., 2023)). Therefore, as shown in Figure 2, we design a four-step LLM role-playing system to assist manual correction in creating a dialogue dataset, which is more cost-effective and quickly scalable.

For the last two challenges, we first implemented the chain of thought (CoT) and supervised fine-tuning (SFT) for SOP prediction in offline planning. In online planning, we first attempt to encode SOP directly into the prompt of LLMs, using in-context learning (ICL) to enable them to understand and follow these constraints. Additionally, inspired by the use of the Monte Carlo Tree Search (MCTS) algorithm in the chess game such as AlphaGo which effectively harnesses human prior rules to control and enhance game AI (Silver et al., 2017), we draw an analogy between dialogue task SOPs and game rules. MCTS is a recent and strikingly successful example of a decision-time planning approach (Sutton and Barto, 2018). By utilizing these SOPs’ rules within MCTS’s expansion and simulation phases, we can better search for the optimal dialogue action that satisfies SOP. Finally, We evaluated the overall/controllable/proactive action accuracy in single-round dialogue and integrated dialogue dimensions. Additionally, manual evaluation was conducted to address the limitations of automatic evaluation regarding dialogue diversity.

Overall, we propose conversational agents based on LLMs with proactive and controllability and lower reliance on manual intervention. To the best of our knowledge, this problem has not been fully studied in prior work. Therefore, we advance the research on this issue from three aspects: task definition, data creation, and modeling. Our contributions are summarized as follows:

•

We propose and define the proactivity and controllable conversational agents with LLMs, which are promising yet unexplored.
•

We propose a semi-automatic dialogue dataset curation framework and develop a high-quality Chinese dialogue dataset (PCA-D). It not only provides insights into this task but also facilitates more research on it.
•

We develop multiple variants and evaluation metrics based on ICL, SFT, CoT, and MCTS modifications, which demonstrate superior performance in implementing offline planning of SOP and online planning of dialogue prediction.

2 Related Work

Dialogue Planning. In the process of dialogue planning, previous work focuses on subgoal generation (Zhang et al., 2021), the next round of dialogue transition strategy (Tang et al., 2019), hierarchical strategy (Kishinami et al., 2022). Recent work further involves planning dialogue paths based on basic knowledge, a goal-oriented dialogue planning framework to plan a series of dialogue actions, and guiding the system to proactively transition between different dialogue stages (Wang et al., 2022a). Existing research employs greedy, single-turn prediction strategies and overlooks the interdependencies of global policies in dialogue processes, resulting in global dialogue uncontrollable. We propose a SOP-based planning approach to address this issue.
Planning and Reasoning of LLMs. LLMs show prowess in planning and reasoning. Examples include Chain-of-Thought (Wei et al., 2022), its variants (Kojima et al., 2022) , Self-Consistency (Wang et al., 2022b), and Least-to-most prompting (Zhou et al., 2022). self-assessment (Paul et al., 2023; Welleck et al., 2022; Shinn et al., 2023). Recent research uses more complex reasoning processes, offering new avenues for improving and optimizing the reasoning process. Yao et al. (2023) apply heuristic-based searches, such as depth-first and breadth-first search, to discover optimized reasoning pathways. Zhu et al. (2022); Hao et al. (2023) introduce MCTS to reasoning steps for complex math or logical reasoning. We focus on applying MCTS to implement controllable dialogue systems, with encoding SOPs’ constraints into the MCTS’s expansion and simulation steps.

3 Task Definition

Given a dialogue corpus $\{(T_{i},C_{i},D_{i})\}_{i=1,2,...,N}$ , where $N$ is the number of dialogues. In the $i$ -th dialogue, $T_{i}\triangleq(P_{i},A_{i},S_{i})$ represents the task definition, which encompasses task and user profile $P_{i}$ defining the task goal, relevant knowledge, and user personal profile; Agent actions $A_{i}$ contains nodes from SOP and proactive actions outside of SOP; User states $S_{i}$ reflecting the user’s state in completing the task. $C_{i}\triangleq\{\mathrm{dir}(n_{p},m_{q})\}_{n_{p},m_{q}\in A_{i}\cup S_{i}}$ represents SOP, is a directed graph with agent actions and user states as nodes, and $\mathrm{dir}\in\{\text{--},\rightarrow,\leftarrow,\leftrightarrow{}\}$ representing direction between node $n$ and $m$ . The field descriptions and examples for task definitions can be found in A.1. $D_{i}\triangleq\{(u_{i}^{t},s_{i}^{t},a_{i}^{t},r_{i}^{t})\}_{t=1,2,...,T}$ represents a dialogue containing $T$ turns of interaction between agent and user. Each dialogue turn $t$ comprises user utterance $u_{i}^{t}$ and corresponding agent response $r_{i}^{t}$ . Additionally, we annotate the updated user state $s_{i}^{t}$ resulting from the user utterance (e.g., the user is the cardholder) and the corresponding agent action $a_{i}^{t}$ for response to this status (e.g., introduce activation activity).

The PCA comprises two tasks.

Task 1 is SOP Prediction, defined as

\displaystyle\hat{\mathrm{dir}}=\mathop{\arg\max}\limits_{\mathrm{dir}\in\{% \text{--},\rightarrow,\leftarrow,\leftrightarrow\}}p(\mathrm{dir}|T_{i},n_{p},% m_{q})

(1)

which means that given $i$ th dialogue’s task definition $T_{i}$ , the model needs to predict the constraint relationship $C_{i}$ between agent actions $A_{i}$ and user states $S_{i}$ by classifying the direction of any two nodes into four categories.

Task 2 is Dialogue Generation, defined as

$\displaystyle\hat{s}_{i}^{t-1}=$	$\displaystyle\mathop{\arg\max}\limits_{s_{i}^{t-1}\in S_{i}}p(s_{i}^{t-1}\|T_{i% },H_{i}^{t-1})$	(2)
$\displaystyle\hat{a}_{i}^{t}=$	$\displaystyle\mathop{\arg\max}\limits_{a_{i}^{t}\in A_{i}}p(a_{i}^{t}\|T_{i},H_% {i}^{t-1},C_{i},s_{i}^{t-1})$	(3)
$\displaystyle\hat{r}_{i}^{t}=$	$\displaystyle[\hat{w}_{1},\hat{w}_{2}...,\hat{w}_{n}]_{\hat{w}_{n}\in W}$	(4)
$\displaystyle\hat{w}_{n}=$	$\displaystyle p(w_{n}\|T_{i},C_{i},H_{i}^{t-1},a_{i}^{t},w_{<n})$	(5)

giving the $i$ -th dialogue’s task definition $T_{i}$ , constraint relationship $C_{i}$ , previous $(t-1)$ turns’ history $H_{i}^{t-1}=D_{i}^{k=1,...,t}$ and vocabulary $W$ , the model first predicts the $(t-1)$ -th turn’s user state $\hat{s}_{i}^{t-1}$ and then selects the next turn’s agent action $\hat{a}_{i}^{t}$ from $A_{i}$ that not only helps to achieve the goal but also follows the task constraint $C_{i}$ , and generates the corresponding response $\hat{r}_{i}^{t}$ .

4 Dataset Construction

4.1 Dataset Curation Framework

Recent studies (Wang et al., 2023a; Sandler et al., 2024) have demonstrated that the deployment of LLMs in role-playing scenarios is effective for the development of high-quality dialogue datasets. To efficiently obtain high-quality data at a low cost, we devise a method where LLMs are utilized to simulate various agents’ roles in the creation of dialogue data, with human intervention for subsequent revisions. Figure 2 illustrates the whole framework, which involves a four-step process: defining the task, planning the task SOP, creating dialogue scenarios, and generating the actual dialogues.

In the first step, we design 53 unique tasks within 45 diverse domains, an example shown in Appendix A.2. In step 2 (i.e. Task 1), we implement prompting techniques that enable the LLMs to assume a planner to plan the SOP by the defined task. In Step 3, we first sample a dialogue path from the manually corrected SOP, which consists of a sequence of agent actions and user states, and ensure that the subsequent dialogue process satisfies the SOP process, thus ensuring controllability. However, this sequence lacks proactive actions typical in natural dialogues, such as answering questions and persuasion. We specify the LLM as a screenwriter, instructing it to insert additional agent actions and user states based on a given dialogue path, to create personalized and comprehensive dialogue scenarios. In the last step, We assigned the LLM as a scriptwriter to write dialogues for each agent action and user state in the sequence of dialogue paths, to create a complete conversation. In this step, we use a simulator to generate user profile for the task definition, ensuring that each conversation involves independent and non-repetitive users. This enhances the diversity and authenticity of the conversations. All prompts are detailed in Appendix A.2.

4.2 Data Quality Control

To ensure the quality and consistency of the PCA-D, we recruite 7 outstanding annotators. Before annotation, we developed a training manual¹¹1Training Manual: https://, organize annotation training, and conduct pre-annotations. During the annotation process, we sample the data in batches at a rate of 20% and assign it for cross-annotation by three annotators. Any instances with an inter-annotator agreement rate below 95% are removed.

4.3 Data Statistics

Table 1 presents a comparison between PCA-D and pertinent datasets in the field. To our knowledge, PCA-D is the inaugural dataset dedicated to controllable and proactive dialogues. Despite the larger volume of dialogues in previous proactive dialogue datasets, they are limited to the singular task of topic recommendation. In contrast, our dataset spans 45 domains and encompasses 53 distinct professional tasks, thereby offering greater versatility. In addition, previous models heavily rely on training data, and the high cost of data annotation hinders further research and application promotion. The advent of LLMs has greatly reduced reliance on training data (Kojima et al., 2022). Similar to the research of Yu et al. (2023), our research focuses on evaluating the performance of unsupervised LLMs in-context-learning, to explore the prospects of low-cost, rapid application.

Table 2 summarizes the statistics of PCA-D. We can see that the scenes and dialogues generated have a manually corrected edit distance of only 3 to 6, which illustrates the high quality and efficiency of our framework. In addition, We conduct a human evaluation of data quality following previous research (Liu et al., 2020, 2021). Dialogues are rated as 1 if they followed the instructions in the task and the utterances were fluent and grammatical, otherwise 0. Three evaluators independently rated a random sample of 300 dialogues, resulting in an average score of 0.98 for PCA-D.

Dataset	Participants	Formed Targets	PA	CT	LNG	#Domains	#Tasks	#Dialogues
DSTC2 (Henderson et al., 2014)	Crowd	N/A	✗	✓	EN	1	1	1,612
DSTC4 (Kim et al., 2017)	Experts	N/A	✗	✓	EN	1	1	35
CrossWOZ (Zhu et al., 2020)	Rules	N/A	✗	✓	CH	5	5	5,012
OTTers (Sevegnani et al., 2021)	Crowd	Topics	✓	✗	EN	Open	1	4,316
TOPDIAL (Wang et al., 2023a)	LLM	Act-topic pairs	✓	✗	EN	3	1	9,939
PCA-D (ours)	LLM,Experts	Open Definition	✓	✓	CH	45	53	3,114

Table 1: Comparison between PCA-D and other related datasets (PA: proactive, CT: controllable, LNG: language, CH: Chinese, EN: English).

Quantity Statistics
Total # domains/tasks	45/53
Total # SOP graph vertexes/Edges	899/1,058
Total # dialogues	3,114
Total # turns	23,897
Total # utterances	47,795
Total # tokens	119,5736
Total # goals	70
Avg. # turns per dialogue	7.67
Avg. # utterances per dialogue	15.34
Avg. # words per utterances	25.01
Quality Statistics
Avg. # scene(dialogue path) accuracy	94.23%
Avg. # edit distance of scene	0.41
Avg. # utterances accuracy	97.66%
Avg. # edit distance of utterances	0.34

Table 2: Statistics of the PCA-D. Edit distance is calculated between LLM pre-annotated and corresponding manually corrected.

5 Our Approach

To implement the PCA system, we need to complete two tasks: 1. Design and implement a comprehensive framework for end-to-end multi-turn dialogue. 2. Implement the two algorithm tasks as described in Section 3: SOP prediction and dialogue generation.

5.1 PCA Framework

PCA framework, depicted in Figure 3, comprises a CA as controller and five scheduled components. Before the dialogue, CA reads user-configured task information and passes SOP vertexes to the offline planner to obtain an adjacency list of the SOP graph. During the dialogue, CA first invokes the working memory module to assemble the prompts required for the dialogue. Subsequently, CA inputs the prompt into the online planning module to obtain the dialogue response. The LLMs used by each module are uniformly managed and accessed in the LLMs module. Next, we describe the implementation algorithms for the offline and online planning modules.

5.2 SOP Prediction

We establish three effective baselines: adjacency list (AL), translation CoT (TCoT), and SFT. For AL, we directly prompt the LLMs to output an adjacency list in JSON format. For TCoT, we prompt the LLMs first to describe each vertex and its children vertexes in natural language along with the reasons, and then prompt the LLMs to translate this description into an adjacency list in JSON format. Finally, we implemented full-parameter SFT on different sizes of Llama and Qwen by the task of generating one vertex’s adjacency vertexes every time. Prompts’ detail is shown in Appendix A.2.

5.3 Dialogue Prediction

CoT and ToT (Wei et al., 2022; Yao et al., 2024) significantly improve LLM’s complex reasoning ability. Therefore, we use them as our baseline. In CoT, we prompt LLM to infer user state, agent actions, and response step by step. In ToT, these steps are split into 3 times calling LLM, with each step corresponding to a layer of tree nodes. We obtain multiple nodes through multiple sampling, and finally, prompt LLM vote to choose a decision path to obtain agent action and response. To add SOP, we calculate the edit distance (ED) between the generated dialogue path and each subpath of the SOP. The smallest ED subpath’s children were added to the prompt to guide the agent in adhering to the SOP. Hyperparameters and prompts are in Appendix A.3.
PCA-M is an extension of the MCTS with SOP for dialogue path online planning, designed to optimally leverage SOP’s constraint during the expansion and simulation phases of future dialogues. MCTS strategically explores the space of dialogue trees and strikes a proper balance between exploration and exploitation to find high-reward dialogue traces efficiently while satisfying constraints.

5.4 MCTS planner with SOP

Specifically, as shown in Figure 3, PCA-M constructs a dialogue tree, where each node represents a turn of dialogue state (i.e., working memory) generated by the agent choosing an action to interact with the user, and each edge represents an agent action and the transition from the current state to the next state after applying that action. Given a dialogue state $s_{0}$ , PCA-M searches for the next best action by iteratively performing node selection, node expansion, dialogue simulation, and back-propagation to update the tree statistics. After $n$ iterations, PCA-M predicts the next best action for $s_{0}$ . This process continues until the specified computational budget is reached (e.g., number of iterations), at which point the resultant trajectory can be extracted from the tree. Prompts and algorithms about PCA-M are shown in Appendix A.3 and 1. Below we describe each stage of the algorithm.

Selection Starting from the root node (i.e., the initial state $s_{0}$ ), at each level of the tree, we select a child node as the next node. This stage ends when a leaf node of the current tree is reached. PCA-M applies the Upper Confidence bounds applied to Trees (UCT) (Kocsis and Szepesvári, 2006) to select each child node to balance exploration (less visited nodes) and exploitation (high-value nodes):

a^{*}=\arg\max_{a\in A(s)}\left[Q(s,a)+w\sqrt{\frac{\ln N(s)}{N(c(s,a))}}\right]

(6)

where $N(s)$ is the number of times node $s$ has been visited in previous iterations, and $c(s,a)$ is the child node of applying $a$ in state $s$ . The less a child node has been visited (i.e., the more uncertain the child node is), the higher the second term. The state-action value function $Q(s,a)$ estimates the expected future reward of taking action $a$ in the state $s$ :

Q(s,a)=0.5\times\left(\mathrm{LLM}(s,a)+0.5\times\mathrm{state}(s,a)\right)

(7)

0\leq\mathrm{LLM}(s,a)\leq 1\\

(8)

\mathrm{state}(s,a)=\begin{cases}0.3&\text{if }s\text{ is a task terminal % state}\\ 0.7&\text{if }s\text{ is a task success state}\\ 0&\text{otherwise}\end{cases}

(9)

where $\mathrm{LLM}(s,a)$ function represents the logical rationality of the current action, as assessed by the LLMs. It is computed as the mean of several binary (0 or 1) evaluations derived from prompt-based sampling by the LLMs. The $\mathrm{state}(s,a)$ function assigns discrete values to measure task completion: 0.3 for the defined termination state, and 0.7 for the success state.

Expansion When the selected leaf node is not a terminal one, we employ the LLM as a dialogue action prior, using prompt to sample $d$ possible dialogue actions for expansion. In addition, PCA-M first searches for a local subgraph in the SOP constraint graph based on the dialogue path, and adds the next two levels of child nodes pointed to by the current state node for expansion. This not only utilizes the constraint relations but also retains the autonomy of the LLM. When the selected leaf node is already a terminal node (either a dialogue end node or the maximum search depth has been reached), we will skip the expansion phase and proceed to back-propagation.

Simulation To estimate the reward generated by future dialogue, this phase involves simulating, also known as roll-out, future dialogue for each expanded state node. To improve efficiency, we reduce the randomness of the simulation. Specifically, we follow a process similar to the expansion phase mentioned above, that is, we only simulate downward for candidate dialogue policies that are sampled from LLM and guided by the SOP.

Backpropagation Once we reach the terminal state in the aforementioned stage, we use the UCT formula to retroactively update the UCT values along the entire path. When the predetermined total number of MCTS iterations is reached, we terminate the algorithm. Finally, in the constructed tree, we selected the current node’s child with the highest UCT value to guide the next turn of conversation.

6 Experiments

	Train	Valid	Test
SOP Prediction #tasks	31	5	17
SOP Prediction #examples	539	90	270
Dialogue Generation #dialogues	1,859	309	931
Dialogue Generation #turns	10,416	1,762	5,288

Table 3: Statistics of the dataset split from the task dimension. The 5-fold cross-validation is employed for SOP Prediction.

6.1 Data Split

The dataset split is shown in Table 3. To assess the generalization ability of SFT on unseen tasks, we split the data from the task dimension, not the dialogue to ensure no task overlap among the three sets. Due to the limited number of samples for task 1, we employed 5-fold cross-validation in the experiments.

6.2 Automatic Evaluations

Task 1 Metrics We evaluated the predicted SOP from two perspectives: the SOP graph structure and the usability for dialogues. We used the graph edit distance (GED) function from Python library networkx to calculate the number of operations as well as the ratio (GEDR) required to manually edit the predicted SOP into the ground-truth SOP. For the dialogue dimension, we evaluated the precision, recall, and F1 score of the dialogue paths between predict and ground-truth SOP. The dialogue path is all traversable paths from the starting vertex to the end vertex, and the loops in the path are only iterated once.

Task 2 Metrics We calculated the accuracy of single-turn dialogue actions (Acc T), SOP actions (Acc C), and proactive actions (Acc P), to measure the controllability and proactive effects of the dialogue. The accuracy of the entire dialogue (acc D) is computed as the number of correct dialogues in each turn divided by the total. We employed BLEU to evaluate the relevance between the generated responses and ground truth.

Model	Method	Path P/R/F1↑	GED/GEDR↓
GPT-4o	AL	78.11/66.51/71.85	2.01/5.40
GPT-4o	TCoT	69.20/73.34/71.22	4.70/12.46
GPT-3.5-turbo	AL	41.39/21.94/28.68	7.61/22.48
GPT-3.5-turbo	TCoT	50.95/50.03/50.48	6.91/18.28
Qwen1.5-14b-chat	AL	42.84/37.81/40.17	10.23/25.77
	TCoT	38.79/42.69/40.65	11.36/26.91
	SFT	68.58/62.73/65.52	3.94/10.30
Qwen1.5-72b-chat	AL	46.74/31.57/37.68	10.75/28.32
	TCoT	48.29/51.94/50.04	7.38/18.72
	SFT	80.25/74.01/77.00	2.86/7.12
Llama3-8b-chat	AL	35.19/28.04/31.21	8.31/24.05
	TCoT	44.5/40.35/42.32	10.92/30.10
	SFT	72.19/68.33/70.21	3.54/9.33
Llama3-70b-chat	AL	64.14/56.05/59.82	3.85/10.72
	TCoT	60.81/67.34/63.91	5.43/14.38
	SFT	74.10/73.03/73.56	2.81/7.27

Table 4: Automatic evaluation results of SOP prediction.

Model	Method	Acc T/C/P/D↑	Bleu 2/4↑
GPT-4o	CoT	69.64/74.34/51.30/22.72	27.26/16.04
	CoT+SOP	82.09/86.73/57.14/43.39	29.30/17.89
	MCTS+SOP	86.37/91.19/60.42/46.29	31.56/18.43
GPT-3.5-turbo	CoT	48.57/55.75/18.52/4.83	22.09/10.56
	CoT+SOP	63.24/69.49/22.22/28.57	23.36/11.81
	MCTS+SOP	76.52/80.42/33.10/38.45	25.17/12.54
Qwen1.5-14b-chat	CoT	50.47/55.75/25.93/2.15	17.21/5.85
	CoT+SFT	78.93/85.84/50.36/26.75	41.56/26.88
	CoT+SOP	55.64/61.57/25.67/2.69	17.88/7.30
	MCTS+SOP	61.76/67.11/28.75/3.33	18.43/8.66
Llama3-8b-chat	CoT	31.79/36.73/11.11/0.32	16.01/7.07
	CoT+SFT	77.86/84.96/48.15/30.08	42.76/28.18
	CoT+SOP	38.72/43.28/17.14/2.26	18.81/8.51
	MCTS+SOP	46.85/56.30/22.31/3.76	19.47/9.34
Qwen1.5-72b-chat	CoT	68.57/73.89/46.3/12.57	23.75/10.69
	CoT+SFT	82.5/90.27/50.46/33.19	44.03/29.97
	CoT+SOP	74.25/79.65/45.24/30.72	21.2/9.05
	ToT	69.03/75.06/47.82/27.07	23.28/11.02
	ToT+SOP	76.55/81.47/45.75/28.89	23.92/11.59
	MCTS	70.43/75.37/46.79/28.79	23.82/10.76
	MCTS+SOP	77.83/83.54/47.32/29.32	24.48/12.03
Llama3-70b-chat	CoT	65.43/72.52/44.22/11.71	22.59/12.59
	CoT+SFT	83.57/89.82/57.41/35.98	47.03/31.65
	CoT+SOP	52.24/54.42/40.48/12.24	25.98/14.61
	ToT	68.53/74.02/44.51/27.18	22.68/12.81
	ToT+SOP	70.40/78.24/44.39/27.82	22.81/12.93
	MCTS	70.10/75.79/45.17/28.46	23.53/13.29
	MCTS+SOP	78.35/82.86/49.18/31.87	24.06/14.13

Table 5: Automatic evaluation results of dialogue generation. T, C, P, and D denote turn, controllable, proactive, and dialogue, respectively.

Task 1 Results Table 4 reports the results for the prediction of SOPs. We can find that with the powerful reasoning capabilities of GPT-4o, the availability of guided dialogue has reached 71.85, and the required manual correction proportion is only 5.4%. Secondly, our proposed TCoT method can effectively improve poorer models’ performance by 10 to 20 points. Thirdly, SFT based on PCA-D significantly enhances the performance on unseen tasks, comparable to the strongest GPT-4o. This suggests that our dataset effectively enhances the ability of LLM to plan SOP. The rapid development of LLM will enable the direct use of the planned SOP in conversations without human intervention.

Model	Method	Turn-level results			Dialogue-level results
Model	Method	Controllability	Knowledge	Proactivity	Goal succ. rate	Coherence	Tokens
GPT-4o	CoT	75.41	100.00	53.83	28.00	67.34	2.14k
	CoT+SOP	88.67	100.00	59.07	42.00	74.38	2.21k
	MCTS+SOP	94.72	100.00	64.33	58.00	80.36	18.62k
Llama3-70b-chat	CoT+SFT	90.49	100.00	60.17	46.00	75.31	1.93k
	CoT+SOP	55.61	97.36	47.05	22.00	38.62	2.63k
	ToT	75.50	97.28	42.87	35.00	63.14	32.33k
	ToT+SOP	82.36	97.15	46.36	39.00	64.32	32.41k
	MCTS+SOP	84.68	97.21	50.33	41.00	64.28	21.70k

Table 6: Human evaluation results. Tokens are the average of turn, measured using tiktoken.

Task 2 Results were shown in Table 5, we can observe that adding SOP guidance can effectively improve the controllability and success rate of dialogue for any model and basic method. In addition, the SFT results show a marked performance improvement, approaching that of the commercial GPT-4o, which indicates the effectiveness of PCA-D in enhancing the model’s conversational abilities. Thirdly, the tree search algorithm based on MCTS is more effective than ToT, and can further improve performance. Especially in open-source models, it can exceed the CoT baseline of GPT-4o without SFT.

6.3 Human Evaluations

Metrics Considering that is difficult to unify a task’s SOPs formulated by different users and is also language diversity in response, we manually evaluated the accuracy of controllable, knowledge correctness, and proactivity from a single-turn dialogue perspective which is the proportion of the total number of these kinds of responses respectively. Furthermore, from a dialogue perspective, we manually evaluated the task completion rate and average coherence score (0 to 1) to measure end-to-end task performance. Finally, we organized a team of 4 evaluators to manually assess the dialogues of different models in 100 different dialogue scenarios. Results are shown in Table 6, from which we see that SOP guidance can significantly improve controllability, while bringing better dialogue proactive, improving task success rate and overall logical coherence. The MCTS based on SOP can further improve the overall performance. Furthermore, the model obtained from SFT on PCA-D successfully generalized the dialogue ability to real-world testing. Compared to CoT, MCTS spent 8 to 9 times more tokens and improved 0.38 times goal success, but the cost-effectiveness was still better than ToT. Meanwhile, we note that due to the provision of task-related knowledge, LLMs rarely produce knowledge errors or hallucinations, which provides hope for enterprise applications.

6.4 Case Study

Figure 4 shows the dialogue action selected and the responses generated by different models when given the same context. As shown in the first case, before checking in, it is necessary to verify identity. After SOP was applied, both CoT and MCTS chose an action that complied with the SOP, making the dialogue more accurate and controllable. However, when the best action was not in SOP (e.g., try to persuade), as shown in the right case, CoT+SOP did not choose to continue proactively persuading. In contrast, MCTS+SOP, through simulating and searching deeper dialogue paths, selected a more goal-favorable persuasion attempt action. This also reflects the planning nature of the MCTS algorithm.

7 Conclusion

In this work, we have explored a new CAs framework and created a high-quality, controllable dialogue dataset. In addition, we proposed an unsupervised algorithm that encodes SOP constraints into the expansion and simulation steps of Monte Carlo Tree Search. Benefiting from our approach developed based on the advanced large language models’ in-context learning, it holds promise to become the next generation of enterprise-level dialogue systems with low cost and high controllability.

Limitations

Hallucinations Our approach is based on the context learning of LLMs, such as ChatGPT and GPT-4. As LLMs may produce outputs containing hallucinations (Bang et al., 2023), our system might provide information beyond the task definition. We intend to enhance the veracity of responses through post-processing steps, such as training a dedicated safety model and incorporating checks and revisions into the post-processing phase.

Runtime One significant limitation of PCA-M is the runtime. The more exhaustive the tree search is (e.g. increasing n or k), the more likely the algorithm is to find the optimal dialogue policy. However, this comes at the cost of longer simulation times, which may impact the overall user experience. We believe that parallelizing the tree search or reusing portions of the simulated subtrees could help to speed up the runtime. We anticipate that with the advancement of LLMs research, the speed of inference will continue to improve.

Ethics Statement

Given the independent behavior of agents in goal-oriented dialogue, it’s imperative to scrutinize ethical implications. Our approach does not force the agent to achieve a specified goal, nor does it force the user to accept the agent’s request. Instead, our work highlights the criticality of directing agents to adhere to human-defined limitations. While our measures are potent, we advocate for the stringent regulation of goal signals, particularly when implementing goal-oriented dialogue systems in specialized fields. Currently, the targeting process must uphold factual accuracy, respect user privacy norms, and comply with societal laws.

References

Akyar (2012) Isin Akyar. 2012. Standard operating procedures (what are they good for?). Latest research into quality control, 12:367–91.
Anantha et al. (2021) Raviteja Anantha, Svitlana Vakulenko, Zhucheng Tu, Shayne Longpre, Stephen Pulman, and Srinivas Chappidi. 2021. Open-domain question answering goes conversational via question rewriting. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 520–534, Online. Association for Computational Linguistics.
Bang et al. (2023) Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. ArXiv preprint, abs/2302.04023.
Budzianowski et al. (2018) Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gašić. 2018. MultiWOZ - a large-scale multi-domain Wizard-of-Oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5016–5026, Brussels, Belgium. Association for Computational Linguistics.
Chawla et al. (2021) Kushal Chawla, Jaysa Ramirez, Rene Clever, Gale Lucas, Jonathan May, and Jonathan Gratch. 2021. CaSiNo: A corpus of campsite negotiation dialogues for automatic negotiation systems. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3167–3185, Online. Association for Computational Linguistics.
Gilardi et al. (2023) Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. Chatgpt outperforms crowd-workers for text-annotation tasks. ArXiv preprint, abs/2303.15056.
Hao et al. (2023) Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. 2023. Reasoning with language model is planning with world model. ArXiv preprint, abs/2305.14992.
Henderson et al. (2014) Matthew Henderson, Blaise Thomson, and Jason D. Williams. 2014. The second dialog state tracking challenge. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pages 263–272, Philadelphia, PA, U.S.A. Association for Computational Linguistics.
Kim et al. (2017) Seokhwan Kim, Luis Fernando D’Haro, Rafael E Banchs, Jason D Williams, and Matthew Henderson. 2017. The fourth dialog state tracking challenge. Dialogues with Social Robots: Enablements, Analyses, and Evaluation, pages 435–449.
Kishinami et al. (2022) Yosuke Kishinami, Reina Akama, Shiki Sato, Ryoko Tokuhisa, Jun Suzuki, and Kentaro Inui. 2022. Target-guided open-domain conversation planning. In Proceedings of the 29th International Conference on Computational Linguistics, pages 660–668, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Kocsis and Szepesvári (2006) Levente Kocsis and Csaba Szepesvári. 2006. Bandit based monte-carlo planning. In European conference on machine learning, pages 282–293. Springer.
Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
Li et al. (2017) Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. DailyDialog: A manually labelled multi-turn dialogue dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 986–995, Taipei, Taiwan. Asian Federation of Natural Language Processing.
Liu et al. (2021) Zeming Liu, Haifeng Wang, Zheng-Yu Niu, Hua Wu, and Wanxiang Che. 2021. DuRecDial 2.0: A bilingual parallel corpus for conversational recommendation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4335–4347, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Liu et al. (2020) Zeming Liu, Haifeng Wang, Zheng-Yu Niu, Hua Wu, Wanxiang Che, and Ting Liu. 2020. Towards conversational recommendation over multi-type dialogs. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1036–1049, Online. Association for Computational Linguistics.
Moradshahi et al. (2023) Mehrad Moradshahi, Tianhao Shen, Kalika Bali, Monojit Choudhury, Gaël de Chalendar, Anmol Goel, Sungkyun Kim, Prashant Kodali, Ponnurangam Kumaraguru, Nasredine Semmar, et al. 2023. X-risawoz: High-quality end-to-end multilingual dialogue datasets and few-shot agents. ArXiv preprint, abs/2306.17674.
Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
Pan et al. (2023) Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Hanlin Zhang, Scott Emmons, and Dan Hendrycks. 2023. Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark. In International Conference on Machine Learning, pages 26837–26867. PMLR.
Paul et al. (2023) Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beatriz Borges, Antoine Bosselut, Robert West, and Boi Faltings. 2023. Refiner: Reasoning feedback on intermediate representations. ArXiv preprint, abs/2304.01904.
Quan et al. (2020) Jun Quan, Shian Zhang, Qian Cao, Zizhong Li, and Deyi Xiong. 2020. RiSAWOZ: A large-scale multi-domain Wizard-of-Oz dataset with rich semantic annotations for task-oriented dialogue modeling. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 930–940, Online. Association for Computational Linguistics.
Roller et al. (2021) Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Eric Michael Smith, Y-Lan Boureau, and Jason Weston. 2021. Recipes for building an open-domain chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 300–325, Online. Association for Computational Linguistics.
Sandler et al. (2024) Morgan Sandler, Hyesun Choung, Arun Ross, and Prabu David. 2024. A linguistic comparison between human and chatgpt-generated conversations. ArXiv preprint, abs/2401.16587.
Sevegnani et al. (2021) Karin Sevegnani, David M. Howcroft, Ioannis Konstas, and Verena Rieser. 2021. OTTers: One-turn topic transitions for open-domain dialogue. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2492–2504, Online. Association for Computational Linguistics.
Shinn et al. (2023) Noah Shinn, Beck Labash, and Ashwin Gopinath. 2023. Reflexion: an autonomous agent with dynamic memory and self-reflection. ArXiv preprint, abs/2303.11366.
Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. 2017. Mastering the game of go without human knowledge. nature, 550(7676):354–359.
Singhal et al. (2023) Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, et al. 2023. Towards expert-level medical question answering with large language models. ArXiv preprint, abs/2305.09617.
Sun et al. (2022) Hao Sun, Guangxuan Xu, Jiawen Deng, Jiale Cheng, Chujie Zheng, Hao Zhou, Nanyun Peng, Xiaoyan Zhu, and Minlie Huang. 2022. On the safety of conversational models: Taxonomy, dataset, and benchmark. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3906–3923, Dublin, Ireland. Association for Computational Linguistics.
Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
Tang et al. (2019) Jianheng Tang, Tiancheng Zhao, Chenyan Xiong, Xiaodan Liang, Eric Xing, and Zhiting Hu. 2019. Target-guided open-domain conversation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5624–5634, Florence, Italy. Association for Computational Linguistics.
Wang et al. (2023a) Jian Wang, Yi Cheng, Dongding Lin, Chak Tou Leong, and Wenjie Li. 2023a. Target-oriented proactive dialogue systems with personalization: Problem formulation and dataset curation. ArXiv preprint, abs/2310.07397.
Wang et al. (2022a) Jian Wang, Dongding Lin, and Wenjie Li. 2022a. Follow me: Conversation planning for target-driven recommendation dialogue systems. ArXiv preprint, abs/2208.03516.
Wang et al. (2019) Xuewei Wang, Weiyan Shi, Richard Kim, Yoojung Oh, Sijia Yang, Jingwen Zhang, and Zhou Yu. 2019. Persuasion for good: Towards a personalized persuasive dialogue system for social good. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5635–5649, Florence, Italy. Association for Computational Linguistics.
Wang et al. (2022b) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022b. Self-consistency improves chain of thought reasoning in language models. ArXiv preprint, abs/2203.11171.
Wang et al. (2023b) Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A Smith, Iz Beltagy, et al. 2023b. How far can camels go? exploring the state of instruction tuning on open resources. ArXiv preprint, abs/2306.04751.
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
Welleck et al. (2022) Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and Yejin Choi. 2022. Generating sequences by learning to self-correct. ArXiv preprint, abs/2211.00053.
Wu et al. (2019) Wenquan Wu, Zhen Guo, Xiangyang Zhou, Hua Wu, Xiyuan Zhang, Rongzhong Lian, and Haifeng Wang. 2019. Proactive human-machine conversation with explicit conversation goals. ArXiv preprint, abs/1906.05572.
Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models, may 2023. ArXiv preprint, abs/2305.10601.
Yao et al. (2024) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2024. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36.
Yu et al. (2023) Xiao Yu, Maximillian Chen, and Zhou Yu. 2023. Prompt-based monte-carlo tree search for goal-oriented dialogue policy planning. ArXiv preprint, abs/2305.13660.
Zhang et al. (2021) Jun Zhang, Yan Yang, Chencai Chen, Liang He, and Zhou Yu. 2021. KERS: A knowledge-enhanced framework for recommendation dialog systems with multiple subgoals. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1092–1101, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Zhou et al. (2022) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. 2022. Least-to-most prompting enables complex reasoning in large language models. ArXiv preprint, abs/2205.10625.
Zhou et al. (2023) Wangchunshu Zhou, Yuchen Eleanor Jiang, Long Li, Jialong Wu, Tiannan Wang, Shi Qiu, Jintian Zhang, Jing Chen, Ruipu Wu, Shuai Wang, et al. 2023. Agents: An open-source framework for autonomous language agents. ArXiv preprint, abs/2309.07870.
Zhu et al. (2020) Qi Zhu, Kaili Huang, Zheng Zhang, Xiaoyan Zhu, and Minlie Huang. 2020. CrossWOZ: A large-scale Chinese cross-domain task-oriented dialogue dataset. Transactions of the Association for Computational Linguistics, 8:281–295.
Zhu et al. (2022) Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang, Ruyi Gan, Jiaxing Zhang, and Yujiu Yang. 2022. Solving math word problem via cooperative reasoning induced language models. ArXiv preprint, abs/2210.16257.
Zhuang et al. (2024) Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. 2024. Toolqa: A dataset for llm question answering with external tools. Advances in Neural Information Processing Systems, 36.

Appendix A Appendix

A.1 Task Definition and Example

A task definition for dialogue should include the definitions of fields $user\_profile$ , $conversation\_profile$ , $agent\_action$ , $user\_state$ and $sop$ . The definitions of these fields are as follows:

•

$user\_profile$ : The information about user business and personal information held by agents generally comes from the company’s user management system. This information is used for identity verification or providing personalized services in conversations.
•

$conversation\_profile$ : Task information for providing business content, accomplishment goals, and relevant background knowledge to Agents.
•

$agent\_action$ : Summary of key actions to be carried out during the process of agent dialogue, intended to guide and constrain the content of the agent’s dialogue in accordance with business regulations. Typically corresponds to the node names in the SOP or the strategy names of proactive dialogue.
•

$user\_state$ : The status of the task summary from the user’s final response combined with the preceding dialogue, serves as a prompt for the agent to select the optimal next action. This typically corresponds to the user node in the SOP or the proactive dialogue state of the user.
•

$sop$ : Standardized operating procedures established by business experts to standardize the business processing process. SOP usually includes key nodes and sequential dependencies between nodes.

An example of SOP definition is shown in 5. An example of task definition is shown in 1.

A.2 Dataset Curation Details

The prompt for step 2 (i.e. Task 1) is shown in Table 7. The prompt for step 3 is shown in Table 8. The prompt for step 4 is shown in Table 9. The prompt for TCoT is shown in Table 16 and 17. Picture 5 is an example of a SOP definition.

A.3 Experiment Details

All open-source models’ experiments were completed on 4 Nvidia A800 GPUs. For SFT, we fine-tuned all parameters useing 5 epochs, 50 warm steps, and 1e-6 learning rate. In the experiment of task 1, the temperature was uniformly set to 0.1 and the top-p was set to 0.1. For task 2, the temperature was uniformly set to 1 and the top-p was set to 0.95. The prompt for CoT and CoT+SOP is shown in Table 14 and 15.

In MCTS, the number of generated actions $d$ is set to 3, depth limit $L$ is set to 8, the number of roll-outs $N$ is set to 3, and exploration weight $w$ is set to 1. The prompt for PCA-M sample action is shown in Table 10. The prompt for PCA-M generates response is shown in Table 11. The prompt for the PCA-M reward function is shown in Table 12. The prompt for PCA-M predicts the user state is shown in Table 13.

To maintain a search scale similar to MCTS, in ToT, the number of generated actions $d$ is set to 3, depth limit $L$ is set to 8. The prompt for ToT sample action is shown in Table 10. The prompt for ToT generates response is shown in Table 11. The prompt for ToT predicts the user state is shown in Table 13.

Listing 1: An Example of Task Definition

⬇

{

"user_profile": {

"Name": "Li␣Zhenghao",

"Title": "Mr.",

"customer_type": "Large␣Deposit",

"Age": "30",

"Occupation": "Executive␣of␣a␣Listed␣Company",

"Profile": ""

"conversation_profile": {

"agent_identity": "Customer␣Service␣of␣Zhonglian␣Bank␣Wealth␣Center",

"agent_goal": "Invite␣the␣user␣to␣a␣golf␣experience␣event",

"success_mark": [

"Agent.InformBookingSuccess"

"event_time": "1st␣to␣31st␣of␣the␣current␣month",

"event_location": "Shenzhen␣Golf␣Club",

"participation_times": "Only␣once",

"maximum_registration_number": "3",

"event_cost": "Free",

"event_target": "Premium␣Customers",

"other_knowledge": "Event␣Notification␣Document.␣Respectful␣Bank␣Customers,␣We␣sincerely␣invite␣you␣to␣participate␣in␣the␣upcoming␣golf␣experience␣event.␣Here␣are␣the␣detailed␣information␣of␣the␣event:␣Date␣and␣Time:␣The␣event␣will␣be␣held␣from␣the␣1st␣to␣the␣31st␣of␣this␣month,␣from␣9␣am␣to␣5␣pm.␣Location:␣The␣event␣will␣be␣held␣at␣Shenzhen␣Golf␣Club,␣with␣the␣specific␣address␣being␣No.␣7890␣Shennan␣Avenue,␣Futian␣District,␣Shenzhen,␣Guangdong␣Province.␣Participation␣Conditions:␣All␣of␣our␣bank's␣large␣deposit␣customers␣can␣participate␣in␣this␣event␣for␣free,␣without␣the␣need␣for␣golf␣skills.␣Equipment␣Provided:␣No␣need␣to␣bring␣your␣own␣golf␣equipment,␣the␣organizer␣will␣provide␣necessary␣equipment␣for␣all␣participants.␣Event␣Process:␣The␣event␣of␣the␣day␣includes␣golf␣experience,␣beginner␣teaching,␣free␣practice,␣lunch,␣and␣leisure␣communication,␣etc.␣Catering␣Services:␣We␣will␣provide␣lunch␣and␣all-day␣tea␣service.␣Registration␣Method:␣You␣can␣register␣by␣phone,␣email,␣or␣in␣person␣at␣any␣of␣our␣branches.␣Participant␣Limit:␣Due␣to␣venue␣restrictions,␣the␣maximum␣number␣of␣registrants␣is␣3␣people.␣Registration␣Deadline:␣Please␣complete␣the␣registration␣before␣the␣28th␣of␣this␣month.␣Dress␣Code:␣It␣is␣recommended␣to␣wear␣casual␣sportswear,␣such␣as␣sports␣shoes,␣comfortable␣tops,␣and␣trousers.␣Insurance:␣Activity␣insurance␣is␣prepared␣for␣each␣participant.␣Weather␣Response:␣If␣there␣is␣bad␣weather,␣the␣event␣date␣or␣process␣may␣be␣changed,␣please␣pay␣attention␣to␣our␣notice.␣Age␣and␣Health␣Restrictions:␣Participants␣must␣be␣over␣18␣years␣old␣and␣have␣no␣serious␣heart␣disease␣and␣other␣health␣problems.␣Transportation␣Services:␣We␣will␣set␣up␣a␣collection␣point␣in␣the␣city␣center␣to␣provide␣free␣pick-up␣service.␣Follow-up␣Activities:␣Participants␣will␣be␣invited␣to␣join␣our␣golf␣enthusiast␣club␣to␣enjoy␣related␣activities␣in␣the␣future.␣Professional␣Coach:␣We␣will␣invite␣professional␣golf␣coaches␣to␣provide␣guidance␣for␣beginners.␣Safety␣Notice:␣Safety␣notices␣will␣be␣detailed␣during␣the␣event,␣including␣swing␣safety,␣course␣behavior␣norms,␣etc.␣We␣look␣forward␣to␣your␣participation␣and␣are␣confident␣that␣this␣will␣be␣a␣pleasant␣and␣memorable␣experience.␣If␣you␣have␣any␣questions,␣please␣feel␣free␣to␣contact␣us␣at␣any␣time.␣Best␣regards,␣Zhonglian␣Bank␣Wealth␣Center"

"agent_action": [

"Start",

"VerifyIdentity",

"PoliteEnd",

"InviteToGolfExperienceEvent",

"InquireAboutParticipationNumberOrTime",

"InformBookingSuccess",

"Greeting",

"EmpathizeAndSoothe",

"EstablishTrust",

"RelieveDoubts",

"AttemptPersuasion",

"Chat",

"Thank",

"OtherActions"

"user_state": [

"NotThemselves",

"IsThemselves",

"ClearAgreement",

"Inconvenient",

"OnlyProvideParticipationNumberOrTime",

"ProvidedParticipationNumberAndTime",

"RefuseToAnswer",

"Greeting",

"HabitualResponseAndContinue",

"DoNotUnderstand",

"WorryAndDoubt",

"Complaint",

"Impoliteness",

"NotInterested",

"DelayDecision",

"Chat",

"OtherIntentions",

"Thank",

"Ending"

"sop": {

"vertex": [

"Agent.Start",

"Agent.VerifyIdentity",

"User.IsThemselves",

"Agent.PoliteEnd",

"User.NotThemselves",

"Agent.InviteToGolfExperienceEvent",

"User.ClearAgreement",

"Agent.InquireAboutParticipationNumberOrTime",

"User.Inconvenient",

"User.OnlyProvideParticipationNumberOrTime",

"User.ProvidedParticipationNumberAndTime",

"Agent.InformBookingSuccess",

"User.RefuseToAnswer"

"adjacency_list": {

"Agent.Start": [

"Agent.VerifyIdentity"

"Agent.VerifyIdentity": [

"User.NotThemselves",

"User.IsThemselves"

"User.NotThemselves": [

"Agent.PoliteEnd"

"Agent.PoliteEnd": [],

"User.IsThemselves": [

"Agent.InviteToGolfExperienceEvent"

"Agent.InviteToGolfExperienceEvent": [

"User.Inconvenient",

"User.ClearAgreement"

"User.ClearAgreement": [

"Agent.InquireAboutParticipationNumberOrTime"

"Agent.InquireAboutParticipationNumberOrTime": [

"User.OnlyProvideParticipationNumberOrTime",

"User.ProvidedParticipationNumberAndTime",

"User.RefuseToAnswer"

"User.Inconvenient": [

"Agent.PoliteEnd"

"User.OnlyProvideParticipationNumberOrTime": [

"Agent.InquireAboutParticipationNumberOrTime"

"User.ProvidedParticipationNumberAndTime": [

"Agent.InformBookingSuccess"

"Agent.InformBookingSuccess": [

"Agent.PoliteEnd"

"User.RefuseToAnswer": [

"Agent.PoliteEnd"

]

}

"a_id": "06a14",

"domain": "bank",

"task": "golf_invitation"

}

A.4 Algorithm Details

Algorithm 1 PCA-M: Planning for Conversational Agents with MCTS

1:Initial state

s_{0}

, state transition probability function

p_{\theta}

, reward function

r_{\theta}

, action generator

p_{\phi}

2:Number of expand actions

d

, depth limit

L

, number of roll-outs

N

, and exploration weight

w

3:Initialize memory of actions

P:S\to P

, children

c:S\times P\to S

and rewards

r:S\times P\to\mathbb{R}

4:Initialize the state-action value function

Q:S\times P\to\mathbb{R}

and visit counter

N:S\to\mathbb{N}

5:for

k\leftarrow 0,\dots,N-1

t\leftarrow 0

7: while

N(s_{t})>0

\triangleright

{Selection}

N(s_{t})\leftarrow N(s_{t})+1

p_{t}\leftarrow\text{arg max}_{p\in P(s_{t})}\left[Q(s_{t},p)+w\sqrt{\frac{\ln N% (s_{t})}{N(c(s_{t},p))}}\right]

10:

r_{t}=r(s_{t},p_{t}),s_{t+1}\leftarrow c(s_{t},p_{t})

11:

t\leftarrow t+1

12: end while

13: while

s_{t}

is not a terminal state

\wedge t\leq L

\triangleright

Expansion

14: for

i\leftarrow 1,\dots,d

15: Sample

p_{t}^{(i)}\sim p_{\phi}(p|s_{t}),s_{t+1}^{(i)}\sim p_{\theta}(s_{t},p_{t}^{(i% )}),r_{t}^{(i)}\sim r_{\theta}(s_{t},p_{t}^{(i)})

16: Update

P(s_{t})\leftarrow\left\{p_{t}^{(i)}\right\}_{i=1}^{d},c(s_{t},p_{t}^{(i)})% \leftarrow s_{t+1}^{(i)},r(s_{t},p_{t})\leftarrow r_{t}^{(i)}

17: end for

18:

p_{t+1}\leftarrow\text{arg max}_{a\in P(s_{t})}r(s_{t},p_{t})

\triangleright

Simulation

19:

r_{t}\leftarrow r(s_{t},p_{t}),s_{t+1}\leftarrow c(s_{t},p_{t})

20:

t\leftarrow t+1

21: end while

22: for

t^{\prime}\leftarrow t,\dots,0

\triangleright

Back propagation

23: Update

Q(s_{t^{\prime}},p_{t^{\prime}})

with

\{r_{t^{\prime}},r_{t^{\prime}+1},\dots,r_{t}\}

24: end for

25:end for

A.5 Example of Dialogue

Listing 2: An Example of Dialogue

⬇

Agent: Hello, this is the customer service center of Zhonglian Bank Credit Card Department. May I speak with Ms. Hao Xiurong?

User: Hello, this is Hao Xiurong.

Agent: Alright, Ms. Hao, could you please confirm that your ID number ends with 3720 and your phone number is 15335220319, is that correct?

User: Yes, that’s correct.

Agent: Ms. Hao, thank you for applying for our platinum credit card. To reward our new customers, we currently have an activation promotion that offers a 12\% discount on fuel for one year. Once you activate your credit card, you can enjoy this benefit.

User: It sounds good, but I’ve never used a credit card before and I’m a bit worried if it will be troublesome.

Agent: Ms. Hao, you don’t have to worry at all. Activating a credit card is very simple. You can choose to handle it at the bank counter, or you can activate it through the mobile banking APP or by calling our credit card customer service hotline for self-activation. The whole process is very convenient and fast.

User: Oh, I see, that’s good then.

Agent: Yes, and after activation, you can also set a transaction password to ensure the security of your card usage. Our platinum credit card is free of annual fees for the first year, and you can waive the annual fee by swiping the card three times a year. This is not only safe and convenient but also comes with many promotional activities.

User: It sounds like a great deal, please activate it for me.

Agent: Thank you very much for your trust. For the security of your card usage, please provide a transaction password for verification when withdrawing cash at an ATM and when making purchases at merchants.

User: Sure, my transaction password is 234139.

Agent: Ms. Hao, your credit card has been successfully activated, and the transaction password has been set. You can now use your credit card normally.

User: Thank you, it’s really convenient.

Agent: You’re welcome, Ms. Hao. If you have any questions, you can call our customer service hotline at any time. I wish you a happy life, goodbye!

User: Okay, goodbye!

A.6 Prompt Details

	Utterance
Prompt	You are a professional business manager, having accumulated extensive knowledge of business processes and being adept at conversing with clients to complete business tasks.
	Now, please create the standard operating procedure (SOP) diagram for this task based on the following “Task Knowledge”.
	All nodes of this SOP diagram are as follows “SOP Diagram Nodes”, and you need to output the list of adjacent nodes for each node, that is, the adjacency list of the entire SOP diagram.
	# Please strictly follow the requirements below:
	1. The output format should be in the form of a JSON string array.
	2. The JSON fields must include all nodes. If a node has no adjacent nodes, represent it with [].
	# Task Knowledge
	{
	“agent_identity”: “Zhonglian Bank Credit Card Center Customer Service”,
	“agent_goal”: “To persuade the user to explicitly agree to the credit card”,
	“success_mark”: [
	“User.ClearAgreement”
	],
	…
	“other_knowledge”: “Credit Card Opening Guide I. Card Activation …”
	}
	# SOP Diagram Nodes
	[
	“Agent.Start”,
	“Agent.VerifyIdentity”,
	…
	“Agent.PoliteEnd”,
	“Agent.NotifyActivationSuccess”
	]
	The adjacency list of all nodes:
Generation	{
	“Agent.Start”:[
	“Agent.VerifyIdentity”
	],
	“Agent.VerifyIdentity”:[
	“User.IsThemselves”,
	“User.NotThemselves”
	],
	…
	“User.ClearRejection”:[
	“Agent.PoliteEnd”
	],
	“Agent.PoliteEnd”:[],
	“Agent.NotifyActivationSuccess”:[
	“Agent.PoliteEnd”
	]
	}

Table 7: The prompt for dialogue policy constraints prediction.

	Utterance
Prompt	You are a professional business manager, assisting customers in handling their business through dialogue.
	Now, based on the provided “Conversation Task Information” and “User Information”, in conjunction with the “Standard Operation Process” diagram used to guide the dialogue for business handling, add more “User State” and “Agent Actions” to the “Main Dialogue Path” to better simulate the complete dialogue process between this user and the agent.
	Below is the task-related information:
	“Conversation Task Information”: { “agent_identity”: “Zhonglian Bank Credit Card Center Customer Service”, … “other_knowledge”: “Credit Card Activation Guide I. Card Activation …” },
	“User Information”: {“Name”:“ZhangSan”, “Gender”:“Male”, … “CardLastFourDigits”:“6789”, “CreditLimit”:“50,000Yuan”, “AnnualFee”:“…” },
	“Agent Actions”: [“Start”, “VerifyIdentity”, “IntroductionOfActivationActivity”, … ],
	“User State”: [“User.IsThemselves”, “User.NotThemselves”, … ],
	“Standard Operation Process”: [“Agent.Start”, “Agent.VerifyIdentity”, … ]
	“Main Dialogue Path”: [ “Agent.Start”, “Agent.VerifyIdentity”, … ]
	Please insert more “User State” and “Agent Actions” into the above “Main Dialogue Path” to enrich the possible reactions of the user and the corresponding decision-making actions that the agent takes in response to the user’s state, forming a complete dialogue.
	Please follow the requirements below strictly:
	1. A complete dialogue path should start with “Agent.Greeting” and end with “User.Ending”.
	2. The complete dialogue path requires “Agent Action” and “User State” to be spoken once alternately, representing a round of dialogue. It is forbidden for one party to appear twice or more consecutively. Please add a “–” after each round of dialogue to separate the previous and next rounds. For example, the [“Agent.Greeting”, “Agent.Introduce Activity”] in the “Main Dialogue Path” does not meet this requirement, and you need to add a “User.Greeting” to supplement it to [“Agent.Greeting”, “User.Greeting”, “–”, “Agent.IntroduceActivity”].
	3. Only add more rounds to the “Main Dialogue Path” mentioned above.
	4. Do not delete any nodes in the “Main Dialogue Path”.
	5. Do not adjust the order of any nodes in the “Main Dialogue Path”.
	6. To control the number of insertions, please only insert 2 to 5 reasonable rounds of dialogue.
	7. The inserted dialogue path nodes can only be selected from the “User State” and “Agent Action” defined in the above text, and do not create your own names.
	8. When the preceding node is a node in the “Standard Operation Process”, please try to choose one from the subsequent nodes of the current node in the adjacency list.
	9. The output format should be consistent with the above text, that is, output with a JSON string array.
	Here are two correct examples:
	“Main Dialogue Path”: [“Agent.Greeting”, “Agent.VerifyIdentity”, “User.NotThemselves”, “Agent.PoliteEnd”]
	Supplemented complete dialogue path in JSON format:

Table 8: Prompting LLM to generate a scene description based on the task information.

	Utterance
Prompt	```json
	[“Agent.Greeting”, “User.Greeting”, “–”, “Agent.VerifyIdentity”, “User.NotThemselves”, “–”, “Agent.PoliteEnd”, “User.Ending”]
	```
	“Main Dialogue Path”: [“Agent.Greeting”, “User.RequestBooking”, “Agent.VerifyIdentity”, “User.NotThemselves”, “Agent.PoliteEnd”]
	Supplemented complete dialogue path in JSON format:
	```json
	[“Agent.Greeting”, “User.RequestBooking”, “–”, “Agent.VerifyIdentity”, “User.NotThemselves”, “–”, “Agent.PoliteEnd”, “User.Ending”]
	```
	The above two examples are only for illustrating the output logic and format requirements of this task. Please ignore their textual meaning, and let’s officially start now:
	“Main Dialogue Path”:
	[ “Agent.Start”, “Agent.VerifyIdentity”, … ]
	Supplemented complete dialogue path in JSON format:
Generation	```json
	[
	“Agent.Greeting”,
	“User.Greeting”,
	“Agent.VerifyIdentity”,
	“User.IsThemselves”,
	“Agent.IntroduceActivationActivity”,
	“User.ClearAgreement”,
	“Agent.RequestSetPassword”,
	“User.ProvideInvalidPassword”,
	“Agent.RequestSetPassword”,
	“User.ProvideValidPassword”,
	“Agent.NotifyActivationSuccess”,
	“Agent.PoliteEnd”
	]
	```

Table 8: Prompting LLM to generate a scene description based on the task information. (continued)

	Utterance
Prompt	You are a professional business manager, assisting customers with their business through dialogues.
	Now, due to the need for training new business representatives, you are asked to act as a scriptwriter. Based on the “Conversation Task Information” and “User Information” provided below, and the “Complete Dialogue Path” between the agent and the user for this task, write dialogue scripts that fit the roles of both parties.
	Here are the task-related information:
	“Conversation Task Information”: { “AgentIdentity”: “Zhonglian Bank Credit Card Center Customer Service”, … “OtherKnowledge”: “Credit Card Opening Guide I. Card Activation …” },
	“User Information”:{“Name”: “ZhangSan”, “Gender”: “Male”, … },
	“Complete Dialogue Path”: [“Agent.Greeting”, “User.Greeting”, “Agent.VerifyIdentity”, “User.IsThemselves”, … ],
	Please write dialogue scripts for the “Complete Dialogue Path” based on the “Conversation Task Information” and “User Information”, forming a complete conversation.
	Please strictly follow the requirements below:
	1.Please directly append the “script” you have written after the name of each node in the “Complete Dialogue Path”, separated by the “\|” symbol. For example: “User.Consider\|I understand, I will think about it.”
	2.Do not add, delete, or change any original node.
	3.The complete output format should be consistent with the “Complete Dialogue Path”, that is, output as a JSON array of strings, such as [“Agent.PoliteEnd\|Thank you for your cooperation, I wish you a happy life!”, “User.Ending\|Okay, goodbye!”]
	Please output the “Complete Dialogue Path” after adding the dialogue scripts:
Generation	[“Agent.Greeting\|Hello, this is the credit card center of Zhonglian Bank.”, “User.Greeting\|Hello, this is Zhang San. ”,“Agent.VerifyIdentity\|To confirm your identity, could you please confirm that the last four digits of your platinum credit card are 6789? ”,“User.IsThemselves\|Yes, that’s correct. ”,“Agent.IntroduceActivationActivity\|Alright, Mr.Zhang we have noticed that your platinum credit card has not been activated yet. To express our gratitude for your support, we currently have a promotion where you can enjoy a 12% discount on fuel for one year just by activating your card.”, … ]

Table 9: The Prompt for LLM generates dialogue according to task information.

	Utterance
Prompt	You are a business expert agent in the financial field, and you are now given the following “User Information”, “Task Knowledge” and “Dialogue Context”. Please select the next round of agent action for the agent that can best encourage the user to complete this dialogue task from the “Optional Agent Actions”. When the task is completed, please actively choose “Agent.PoliteEnd” to end the conversation. The name of the agent action must be exactly the same as those in the candidate list.
	User Information:
	{“Name”: “Li Zhenghao”, “Title”: “Mr.”, “customer_type”: …}
	Task Knowledge:
	{“agent_identity”: “Customer Service of Zhonglian Bank Wealth Center”, “agent_goal”: “Invite the user to a golf experience event”, “success_mark”: [“Agent.InformBookingSuccess”], “event_time”: “1st to 31st of the current month”, “event_location”: “Shenzhen Golf Club”, “participation_times”: “Only once”, “maximum_registration_number”: “3”, “event_cost”: “Free”, “event_target”: “Premium Customers”, “other_knowledge”: “Event Notification Document …”}
	Optional Agent Actions:
	[“Start”, “VerifyIdentity”, “PoliteEnd”, “InviteToGolfExperienceEvent”, “InquireAboutParticipationNumberOrTime”, “InformBookingSuccess”, “Greeting”, “EmpathizeAndSoothe”, “EstablishTrust”, “RelieveDoubts”, “AttemptPersuasion”, “Chat”, “Thank”, “OtherActions”]
	Dialogue Context:
	Agent Action: Greeting
	Agent Response: Hello!
	User Response: Hello, who is this?
	User State: Greeting
	…
	Agent Action:
	Based on the standard procedure for this task, please prioritize the following agent actions to continue the dialogue above:
	[“Start”, “VerifyIdentity”, “PoliteEnd”, “InviteToGolfExperienceEvent”, “InquireAboutParticipationNumberOrTime”, “InformBookingSuccess”, “Greeting”, “EmpathizeAndSoothe”, “EstablishTrust”, “RelieveDoubts”, “AttemptPersuasion”, “Chat”, “Thank”, “OtherActions”]
	Start with “Analysis:” and deduct step by step the best agent actions that should be used by the subsequent dialogue agent, concluding with an output similar to “Therefore, the best agent action is: Greeting”, with a colon required before the name of the agent action.
Generation	Analysis: Based on the previous text, the user has expressed an attitude of needing to consider. According to task knowledge, the activity is aimed at premium customers, therefore an attempt should be made to persuade the user to participate in the event.
	Therefore, the best agent action is: AttemptPersuasion

Table 10: The prompt for sampling candidate policy to expand.

	Utterance
Prompt	You are an agent serving as a business expert in the financial sector. Given the “User Information”, “Task Knowledge”, and “Dialogue Context” for the dialogue task below, please generate an appropriate agent response by following the chosen agent action. Start your response with a new line and “Agent Response:”.
	User Information:
	{“Name”: “Li Zhenghao”, “Title”: “Mr.”, “customer_type”: …}
	Task Knowledge:
	{“agent_identity”: “Customer Service of Zhonglian Bank Wealth Center”, “agent_goal”: “Invite the user to a golf experience event”, “success_mark”: [“Agent.InformBookingSuccess”], “event_time”: “1st to 31st of the current month”, “event_location”: “Shenzhen Golf Club”, “participation_times”: “Only once”, “maximum_registration_number”: “3”, “event_cost”: “Free”, “event_target”: “Premium Customers”, “other_knowledge”: “Event Notification Document …”}
	Dialogue Context:
	Agent Action: Greeting
	Agent Response: Hello!
	User Response: Hello, who is this?
	User State: Greeting
	…
	Agent Action: InviteToGolfExperienceEvent
	Agent Response: We at Zhonglian Bank Wealth Center are hosting a golf experience event from the 1st to the 31st of this month at the Shenzhen Golf Club. This is specially prepared for our premium customers, and as one of our significant deposit clients, we would be very pleased if you could join us.
	User Reply: I need to think about it.
	User State: DelayDecision
	Agent Action: RelieveDoubts
	Agent Response: The event is free of charge. It will take place from the 1st to the 31st of this month at the Shenzhen Golf Club. It is aimed at our premium customers, each of whom can only participate once, and you can sign up for a maximum of three people. If you have any questions or need more information, I am here to help you.
	User Reply: I will consider it.
Generation	The event will be taking place from the 1st to the 31st of this month, and you can choose a suitable time to participate. If you decide to attend, please provide me with your name and contact information so I can make a reservation for you.

Table 11: The prompt for MCTS to generate a response from the agent.

	Utterance
Prompt	You are a business expert manager at a company in the financial sector. The company’s business agent is contacting the customer below to handle affairs, see “Dialogue Context” for the content of the conversation. Now, please carefully read “Task Knowledge” and judge whether the agent’s chosen “Agent action” conforms to the task process. If it meets the requirements, it is 1; otherwise, it is 0.
	User Information:
	{“Name”: “Li Zhenghao”, “Title”: “Mr.”, “customer_type”: …}
	Task Knowledge:
	{“agent_identity”: “Customer Service of Zhonglian Bank Wealth Center”, “agent_goal”: “Invite the user to a golf experience event”, “success_mark”: [“Agent.InformBookingSuccess”], “event_time”: “1st to 31st of the current month”, “event_location”: “Shenzhen Golf Club”, “participation_times”: “Only once”, “maximum_registration_number”: “3”, “event_cost”: “Free”, “event_target”: “Premium Customers”, “other_knowledge”: “Event Notification Document …”}
	Dialogue Context:
	Agent Action: Greeting
	Agent Response: Hello!
	User Response: Hello, who is this?
	User State: Greeting
	…
	Agent Action: AttemptPersuasion
	Optional Agent Actions:
	[“Start”, “VerifyIdentity”, “PoliteEnd”, “InviteToGolfExperienceEvent”, “InquireAboutParticipationNumberOrTime”, “InformBookingSuccess”, “Greeting”, “EmpathizeAndSoothe”, “EstablishTrust”, “RelieveDoubts”, “AttemptPersuasion”, “Chat”, “Thank”, “OtherActions”]
	Based on the above “Task Knowledge” and “Dialogue Context” as well as the “Optional Agent Actions”, if the agent chooses the “Optional Agent Actions: AttemptPersuasion”, analyze whether it is consistent with the above process, and whether it is the optimal agent actions among the listed optional agent actions. Start your analysis with “Analysis:” and after your conclusion, summarize with “Therefore, the answer is: 1” if it is logical or “Therefore, the answer is: 0” if not.
Generation	According to the task knowledge, the objective of the agent’s task is to invite users to participate in a golf experience event. The agent action chosen is AttemptPersuasion, which aligns with the task objective. Therefore, it is consistent with the above handling process and the best action within the optional agent actions.
	Therefore, the answer is: 1

Table 12: The prompt for MCTS to get a reward score.

	Utterance
Prompt	You are an agent specializing in lifestyle services. Given the following “User Information”, “Task Knowledge”, and “Dialogue Context”, please select an option from the “Optional User State” that best reflects the user’s current task status in the dialogue context. Start with “User State:” and the name of the status must be exactly the same as one in the list.
	User Information:
	“Name”: “Li Si”, “Title”: “Mr.”, “ID number”: “1231205001011234”, “Order number”: “C190125”
	Task Knowledge:
	“agent_identity”’: “Dongxia Airlines Telephone Check-in Operator”, “agent_goal”: “To check in and seat select for customers needing to check in”, “success_mark”: [“InformOfSuccessfulCheck-in”, “InformOfSuccessfulSeatSelection”, “InformofSuccessfulPayment”], “Seat Selection": …
	Optional User State:
	[“Complaint”, “Impoliteness”, “RequestsCheck-in”, “IdentityInformationCorrect”, “IdentityInformationIncorrect”, “OrderNumberCorrect”, … ]
	Dialogue context:
	Agent Action: Greeting
	Agent Response: Hello! Dongxia Airlines at your service.
	User reply: Hello.
	User State: Greeting
	Agent Action: SelfIntroduction
	Agent Response: I am a check-in service agent from Dongxia Airlines, pleased to be at your service.
	User reply: Okay, thank you.
	User State: Thank
	Agent Action: InquireNeeds
	Agent Response: May I ask if you require check-in services?
	User reply: Yes, I need to check in.
Generation	User State: RequestsCheck-in

Table 13: The prompt for obtaining user states.

	Utterance
Prompt	You are a business expert agent in the financial field. Now, given the following dialogue task “User Information”, “Task Knowledge” and “Dialogue Context” please output the following three items:
	1. Please choose one from the “Optional Agent Action” that best reflects the agent’s next round of action that can prompt the user to complete the dialogue task. Start with a new line and “Agent Action:” The name of the agent action must match exactly with one on the candidate list.
	2. Please generate an appropriate Agent Response combined with the agent’s chosen dialogue strategy. Start with a new line and “Agent Response:”
	User Information:
	{“Name”: “Zhang San”, … }
	Task Knowledge:
	{“agent_identity”: “Customer Service of Zhonglian Bank Wealth Center”,“agent_goal”: “Invite the user to a golf experience event”,“success_mark”: [“Agent.InformBookingSuccess”],“event_time”: “1st to 31st of the current month”, “event_location”: “Shenzhen Golf Club”, “participation_times”: “Only once”, “maximum_registration_number”: “3”, “event_cost”: “Free”, “event_target”: “Premium Customers”, “other_knowledge”: “Event Notification Document …”}
	Optional Agent Action:
	[“Start”, “VerifyIdentity”, “IntroductionOfActivationActivity”, … ]
	Optional User State:
	[“IsThemselves”, “NotThemselves”, “ClearAgreement”, … ]
	Agent Response: I apologize for the inconvenience, but I am a customer service representative for the Credit Card Center at Zhonglian Bank, in charge of assisting you with matters related to your credit card.
	User Response: Uh, I understand, but it’s really not a convenient time for me right now.
	User State: Inconvenient
	Agent Action: Yes, I am Zhang San
Generation	User State: Inconvenient
	Agent Action: VerifyIdentity
	Agent Response: I understand that you may be busy right now, but to ensure the security of your account, I need to verify your identity information. Are you Mr. Zhang San, the holder of the Platinum credit card ending in 6789?

Table 14: The prompt for CoT.

	Utterance
Prompt	You are a business expert agent in the financial industry. You are now given the following dialogue tasks of “User Information”, “Task Knowledge” and “Dialogue Context”. Please output the following three items:
	1. Please select the most appropriate option from “Optional User State” that reflects the current user task status in the dialogue context, starting with “User State:”, and the name of the task status must be exactly the same as the one in the available list.
	2. Please select the next round agent action of the agent that is most likely to prompt the user to complete the dialogue task from “Optional Agent Action”. First, start a new line with "Agent Action:", and the name of the agent action must be exactly the same as the candidates in the list.
	3. Please generate a suitable agent response based on the agent’s selected agent action. First, start a new line with “Agent Response:”.
	User Information:
	{“Name”: “Zhang San”, … }
	Task Knowledge:
	{“agent_identity”: “Customer Service of Zhonglian Bank Wealth Center”,“agent_goal”: “Invite the user to a golf experience event”,“success_mark”: [“Agent.InformBookingSuccess”],“event_time”: “1st to 31st of the current month”, “event_location”: “Shenzhen Golf Club”, “participation_times”: “Only once”, “maximum_registration_number”: “3”, “event_cost”: “Free”, “event_target”: “Premium Customers”, “other_knowledge”: “Event Notification Document …”}
	Optional Agent Action:
	[“Start”, “VerifyIdentity”, “IntroductionOfActivationActivity”, … ]
	Optional User State:
	[“IsThemselves”, “NotThemselves”, “ClearAgreement”, … ]
	Dialogue context:
	User State:
	Agent Action: Greeting
	Agent response: Hello!
	User response: Hello, who is this?
	User State: Greeting
	Agent Action: SelfIntroduction
	Agent response: I am customer service from the Wealth Center of Zhonglian Bank, responsible for inviting you to our golf experience event.
	User response: Oh, is that so? What’s the matter?
	Based on the standard process of handling this task and the dialogue context, please prioritize the following agent action that meets the Standard Operating Procedures (SOP):
	["VerifyIdentity"]
Generation	User State: Greeting
	Agent Action: VerifyIdentity
	Agent’s reply: Sir, I am a customer service representative from the Zhonglian Bank Wealth Center. May I ask if you are Mr. Li Zhenghao?

Table 15: The prompt for CoT+SOPs.

	Utterance
Prompt	You are a professional business manager, having accumulated a wealth of business process knowledge and being adept at conversing with customers to complete transactions.
	Now, based on the nodes of the Standard Operating Procedure (SOP) diagram, please analyze all the nodes. Write a passage analyzing the entire process of interaction and communication between the ‘User’ and ‘Business Representative Agent’, covering all nodes and all possible scenarios, such as what happens if the user ‘Agree’ or ‘Disagree’.
	# Here is an example:
	SOP diagram nodes: [‘Agent.Start’, ‘Agent.VerifyIdentity’, ‘User.IsThemselves’, ‘Agent.IntroduceGiftDelivery’, ‘User.NotThemselves’, ‘User.ClearAgreement’, ‘Agent.AskForDeliveryAddress’, ‘User.ProvideDeliveryAddress’, ‘User.DoNotProvideDeliveryAddress’, ‘User.ClearRejection’, ‘Agent.PoliteEnd’, ‘Agent.NotifyRegistrationSuccess’]
	Start analysis: "After ‘Agent.Start’, the first step should be ‘Agent.VerifyIdentity’. After confirming the identity, it could be ‘User.IsThemselves’ or ‘User.NotThemselves’. If not the person, the agent will ‘Agent.PoliteEnd’. If the user is the person, the agent will introduce, that is ‘Agent.IntroduceGiftDelivery’. If ‘User.ClearAgreement’, the process can continue. Since it is a gift delivery, after the user agrees, ‘Agent.AskForDeliveryAddress’, if ‘User.ProvideDeliveryAddress’ then the agent ‘Agent.NotifyRegistrationSuccess’, if ‘User.DoNotProvideDeliveryAddress’, the agent may ask again ‘Agent.AskForDeliveryAddress’, or directly ‘Agent.PoliteEnd’. After ‘Agent.NotifyRegistrationSuccess’, it can ‘Agent.PoliteEnd’."
	# SOP diagram nodes:
	[
	“Agent.Start”,
	“Agent.VerifyIdentity”,
	…
	“Agent.PoliteEnd”,
	“Agent.NotifyActivationSuccess”
	]
	Start analysis:
Generation	After ‘Agent.Start’, the first step should be ‘Agent.VerifyIdentity’. After confirming the identity, it could be ‘User.IsThemselves’ or ‘User.NotThemselves’. If not the person, the agent will ‘Agent.PoliteEnd’. If the user is the person, then the agent will introduce the event, that is ‘Agent.InvitationToGolfExperienceActivity’. If ‘User.ClearAgreement’, the process can continue, and the agent will inquire about the number of participants or the time of participation from the user. The user might ‘User.Inconvenience’, ‘User.ProvideOnlyParticipationNumberOrTime’, or ‘User.ProvideBothParticipationNumberAndTime’. If the user is not convenient or refuses to answer, the agent might ‘Agent.PoliteEnd’. If the user provides both the number of participants and the time, the agent will ‘Agent.NotifyBookingSuccess’, and then ‘Agent.PoliteEnd’.

Table 16: The prompt of TCoT to describe the vertex in nature language.

	Utterance
Prompt	You are a seasoned business representative, having accumulated a wealth of business process knowledge, and are adept at conversing with customers to complete transactions.
	Now please formulate the Standard Operating Procedure (SOP) diagram for this task.
	You can refer to the following “Task Knowledge”, but it may not be comprehensive or accurate, and you do not need to strictly follow it.
	All nodes of the SOP diagram are as follows “SOP Diagram Nodes”, and you need to output the list of adjacent nodes for each node, that is, the adjacency list of the entire SOP diagram.
	# Hint
	The SOP diagram starts with ‘Agent.Start’ and ends with ‘Agent.PoliteEnd’.
	The subsequent nodes of the “User” node are usually “Agent” nodes, and the subsequent nodes of the “Agent” node mostly require a response from the “User” node. For some actions that do not require a response, the subsequent node of “Agent” can also be “Agent”.
	If there is a ‘VerifyIdentity’ node in the task, it indicates that only the person themselves can continue, and the conversation ends if it is not the person themselves.
	# Please strictly follow the requirements below:
	The output format should be as follows, for example:
	```json
	{
	“Agent.Start”: [“Agent.Node1”],
	“Agent.Node1”: [“User.Node1”, “User.Node2”],
	“User.Node1”: …
	}
	```
	The JSON fields must include all nodes.
	# Task Knowledge
	After ‘Agent.Start’, the first step should be ‘Agent.VerifyIdentity’. After confirming the identity, it could be ‘User.IsThemselves’ or ‘User.NotThemselves’. If not the person, the agent will ‘Agent.PoliteEnd’. If the user is the person, then the agent will introduce the event, that is ‘Agent.InvitationToGolfExperienceActivity’…
	# SOP Diagram Nodes [ “Agent.Start”, “Agent.VerifyIdentity”, … ]
	Adjacency list for all nodes:
Generation	```json
	{
	“Agent.Start”:[“Agent.VerifyIdentity”],
	“Agent.VerifyIdentity”:[“User.IsThemselves”,“User.NotThemselves”],
	“User.NotThemselves”:[“Agent.PoliteEnd”],
	“Agent.PoliteEnd”:[],
	“User.IsThemselves”:[“Agent.InviteToGolfExperienceEvent”],
	“Agent.InviteToGolfExperienceEvent”:[“User.ClearAgreement”],
	…
	}
	```

Table 17: The prompt of TCoT to translate the description into an adjacency list.