LLM Roleplay: Simulating Human-Chatbot Interaction

Hovhannes Tamoyan Ubiquitous Knowledge Processing Lab (UKP Lab)
Department of Computer Science and Hessian Center for AI (hessian.AI)
Technical University of Darmstadt
www.ukp.tu-darmstadt.de
Hendrik Schuff Ubiquitous Knowledge Processing Lab (UKP Lab)
Department of Computer Science and Hessian Center for AI (hessian.AI)
Technical University of Darmstadt
www.ukp.tu-darmstadt.de
Iryna Gurevych Ubiquitous Knowledge Processing Lab (UKP Lab)
Department of Computer Science and Hessian Center for AI (hessian.AI)
Technical University of Darmstadt
www.ukp.tu-darmstadt.de
Abstract

The development of chatbots requires collecting a large number of human-chatbot dialogues to reflect the breadth of users’ sociodemographic backgrounds and conversational goals. However, the resource requirements to conduct the respective user studies can be prohibitively high and often only allow for a narrow analysis of specific dialogue goals and participant demographics. In this paper, we propose LLM-Roleplay: a goal-oriented, persona-based method to automatically generate diverse multi-turn dialogues simulating human-chatbot interaction. LLM-Roleplay can be applied to generate dialogues with any type of chatbot and uses large language models (LLMs) to play the role of textually described personas. To validate our method we collect natural human-chatbot dialogues from different sociodemographic groups and conduct a human evaluation to compare real human-chatbot dialogues with our generated dialogues. We compare the abilities of state-of-the-art LLMs in embodying personas and holding a conversation and find that our method can simulate human-chatbot dialogues with a high indistinguishability rate.111https://github.com/UKPLab/llm-roleplay

\substitutefont

T1zi4lmr

LLM Roleplay: Simulating Human-Chatbot Interaction



1 Introduction

Collecting human-chatbot dialogues requires recruiting and managing a large number of human annotators, which can pose prohibitive obstacles to researchers who aim to develop conversational AI agents (i.e., chatbots). To circumvent the latter and the limitations of publicly available data, numerous methods employing chatbots to generate dialogues have been introduced lately Xu et al. (2023b); Kim et al. (2023); Zhu et al. (2023); Ding et al. (2023); Zhao et al. (2024b). These methods have the potential to generate synthetic data more quickly and cost-effectively, offering a level of quality and variety that can closely match what human annotators can achieve Zhang et al. (2024).

Recently, several studies have leveraged such synthetic data generated through distillation and self-improvement techniques to enhance the capabilities of pre-trained large language models (LLMs) Zhang et al. (2024). By employing instruction tuning and fine-tuning techniques, these efforts aim to ensure that LLMs deliver useful and safe responses that adhere to provided instructions Xu et al. (2023a); Taori et al. (2023); Mukherjee et al. (2023); Li et al. (2023b); Peng et al. (2023); Almazrouei et al. (2023).

Refer to caption
Figure 1: Schematic illustration of our method: A textual description of a persona and a goal (top) is used to instruct the inquirer (𝒮subscript𝒮\mathcal{S_{I}}caligraphic_S start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT) model to embody the given persona (left) and engage in a dialogue with the responder (𝒮subscript𝒮\mathcal{S_{R}}caligraphic_S start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT) chatbot (right). We show that dialogues simulated by the inquirer LLM and the responder chatbot can effectively simulate human-chatbot interaction.

However, today’s dialogue generation methods have two critical limitations. First, the set of possible user responses for a given chatbot utterance is large while the existing datasets typically only cover a single or few annotated user responses. The previously generated datasets are constrained by the number of dialogues and turns, making it difficult to extend them to novel domains or conversational goals. Second, the existing methods rarely account for subjective response behavior and implicitly assume an average user, which can create a gap in representation Rottger et al. (2022) and poses the danger of evaluating and ultimately improving chatbots for a specific sociodemographic group while ignoring others.

The quality of existing datasets is primarily evaluated with the performance in a supervised fine-tuning (SFT) setup, which is influenced by various properties of the generated dataset. For example, Chen et al. (2023) argue that the initial prompt plays a crucial role in the quality of the generated dialogue. On the other hand, Zhao et al. (2024a) show that the length of the response is the key factor and with significantly fewer samples outperforms methods that solely focus on quality. Meanwhile, Shen (2024) emphasizes the importance of mimicking human-style interactions in the dialogues.

To mitigate the shortcomings of existing dialogue generation methods and to account for the desiderate identified in prior work for SFT, we introduce LLM Roleplay: a goal-oriented, persona-based, plug-and-play method to generate diverse, multi-turn, and long dialogues simulating human-chatbot interactions using LLMs. Our method is the first to instruct an LLM (inquirer) to mimic a given persona and prompt a chatbot (responder) to achieve a given conversational goal and thereby elicit realistic human-AI interactions with a specific LLM. Figure 1 shows an exemplary application of our method.

In this work, we address the following two research questions: (i) To what extent can we simulate real human-chatbot dialogues using LLM-chatbot dialogues? (ii) How do various LLMs compare as inquirers within this setup? To answer those research questions, we conduct two user studies. First, we collect real human-chatbot dialogues by asking participants to reach various conversational goals and link the collected dialogues with participants’ sociodemographic background. Next, we use LLM-Roleplay to simulate dialogues with the same set of personas and goals. Second, we conduct a human evaluation with another pool of participants to assess how well the generated dialogues mimic the collected ones. Specifically, we present two dialogues: one collected from the previous study and one simulated using LLM Roleplay, using the same persona and the conversation goal. We then ask the participants to try to identify the simulated one similar to a Turing test Turing (1950).

Overall, this paper contributes: (i) A novel method to simulate human-chatbot dialogues with an arbitrary set of personas and conversational goals, (ii) a human validation study that confirms our method’s potential to simulate human-chatbot interaction with high resemblance to real dialogues, (iii) a dataset of goal-oriented human-chatbot and model-chatbot dialogues using our hand-crafted multi-hop goals involving four state-of-the-art LLMs, (iv) an in-depth comparison of open-source and proprietary LLMs regarding their ability to embody persona and hold conversation, (v) an open-source implementation of our plug-and-play method that can be directly applied to simulate dialogues with any combination of models and chatbot systems using any set of conversational goals and personas.

2 Large Language Model Roleplay

In the following, we introduce the notation that we are following throughout this paper. Refer to Figure 1 for an annotated example.

Persona (𝒫𝒫\mathcal{P}caligraphic_P)

is the composition of sociodemographic features. We conceptualize persona as a written description of an individual from a specific sociodemographic group. We follow Kumar et al. (2021) to define the sociodemographic features and the options of the latter.

Goal (𝒢𝒢\mathcal{G}caligraphic_G)

is the textual representation of the conversational goal.

Subject (𝒮𝒮\mathcal{S}caligraphic_S)

is a dialogue participant. We denote the inquirer as 𝒮subscript𝒮\mathcal{S_{I}}caligraphic_S start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT, the entity that asks questions, and the responder as 𝒮subscript𝒮\mathcal{S_{R}}caligraphic_S start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT, the entity that answers the given questions. In our setup, the inquirer (𝒮subscript𝒮\mathcal{S_{I}}caligraphic_S start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT) is an LLM and the responder (𝒮subscript𝒮\mathcal{S_{R}}caligraphic_S start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT) is a conversational agent or chatbot (not necessarily an LLM). The human inquirer will be denoted as 𝒮hsuperscriptsubscript𝒮\mathcal{S_{I}}^{h}caligraphic_S start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT.

Utterance (𝒰𝒰\mathcal{U}caligraphic_U)

is the output of a Subject (𝒮𝒮\mathcal{S}caligraphic_S), i.e. either the output of the inquirer (𝒰subscript𝒰\mathcal{U_{I}}caligraphic_U start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT) or the responder (𝒰subscript𝒰\mathcal{U_{R}}caligraphic_U start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT). For example, 𝒰isuperscriptsubscript𝒰𝑖\mathcal{U_{I}}^{i}caligraphic_U start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT will denote the inquirer’s i𝑖iitalic_i-th utterance. We refer to two consecutive utterances of two different subjects as a turn: 𝒯i=[𝒰i;𝒰i]superscript𝒯𝑖superscriptsubscript𝒰𝑖superscriptsubscript𝒰𝑖\mathcal{T}^{i}=[\mathcal{U_{I}}^{i};\mathcal{U_{R}}^{i}]caligraphic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = [ caligraphic_U start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ; caligraphic_U start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ].

Dialogue (𝒟𝒟\mathcal{D}caligraphic_D)

is a sequence of one or more turns. The dialogue of t𝑡titalic_t turns is denoted as: 𝒟t:={𝒯0,𝒯1,,𝒯t}assignsuperscript𝒟𝑡superscript𝒯0superscript𝒯1superscript𝒯𝑡\mathcal{D}^{t}:=\{\mathcal{T}^{0},\mathcal{T}^{1},...,\mathcal{T}^{t}\}caligraphic_D start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT := { caligraphic_T start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , caligraphic_T start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , caligraphic_T start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT }. We denote the maximum number of turns by max_t𝑚𝑎𝑥_𝑡max\_titalic_m italic_a italic_x _ italic_t, i.e. tmax_t𝑡𝑚𝑎𝑥_𝑡t\leq max\_titalic_t ≤ italic_m italic_a italic_x _ italic_t which we discuss in more detail in Section 3.1.

At a high level, LLM-Roleplay consists of three main steps: (i) Initial conditioning of the inquirer model with persona-specific text alongside the goal description. (ii) Subsequently, the inquirer’s output is provided to the responder model. (iii) Lastly, the output of the responder is returned to the inquirer by asking it to either output a follow-up question or a pre-defined token to terminate the dialogue.

We begin with assembling prompting templates for both of the subjects. We create a system prompt template (SYS_I) for the inquirer LLM. This template prompts the LLM to embody the given persona, provide a prompt to address the designated goal, and output the specified termination token if it considers the goal accomplished. As for the responder, we use a default system prompt (SYS_R) to promote it to be a helpful and honest assistant. Additionally, we develop a response forwarder template (INTER_I) for the inquirer. This template prompts the inquirer to assess the conclusiveness of the answer of the responder, determining whether to output a subsequent question or the termination token. We provide the system and forwarder prompt templates in Table 5 in Appendix B. The algorithm for the LLM Roleplay is shown in Algorithm 1.

To obtain a dialogue for a given persona and a goal we create a prompt by passing the persona (𝒫𝒫\mathcal{P}caligraphic_P) and the goal (𝒢𝒢\mathcal{G}caligraphic_G) to the inquirer system prompt template and generate an output based on inquirer LLM distribution:

𝒰0𝒮(SYS_I(𝒫,𝒢)).similar-tosuperscriptsubscript𝒰0subscript𝒮SYS_I𝒫𝒢\mathcal{U_{I}}^{0}\sim\mathcal{S_{I}}(\texttt{SYS\_I}(\mathcal{P},\mathcal{G}% )).caligraphic_U start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∼ caligraphic_S start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ( SYS_I ( caligraphic_P , caligraphic_G ) ) .

A deterministic prompt extraction function, extract_prompt, that looks for a string in double quotes in the response, is applied to the latter to extract the prompt from the response:

𝒰0:=extract_prompt(𝒰0).assignsuperscriptsubscript𝒰0extract_promptsuperscriptsubscript𝒰0\mathcal{U_{I}}^{0}:=\texttt{extract\_prompt}(\mathcal{U_{I}}^{0}).caligraphic_U start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT := extract_prompt ( caligraphic_U start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) .

The responder receives the prompt of the inquirer, and generates an output:

𝒰0𝒮(𝒰0).similar-tosuperscriptsubscript𝒰0subscript𝒮superscriptsubscript𝒰0\mathcal{U_{R}}^{0}\sim\mathcal{S_{R}}(\mathcal{U_{I}}^{0}).caligraphic_U start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∼ caligraphic_S start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT ( caligraphic_U start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) .

Given the output of the responder, we condition the inquirer using the output forwarder template:

𝒰1𝒮([𝒰0;INTER_I(𝒰0)]).similar-tosuperscriptsubscript𝒰1subscript𝒮superscriptsubscript𝒰0INTER_Isuperscriptsubscript𝒰0\mathcal{U_{I}}^{1}\sim\mathcal{S_{I}}([\mathcal{U_{I}}^{0};\texttt{INTER\_I}(% \mathcal{U_{R}}^{0})]).caligraphic_U start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∼ caligraphic_S start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ( [ caligraphic_U start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ; INTER_I ( caligraphic_U start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ] ) .

If the output of the inquirer begins or ends with the termination token, the stopping condition, stop is met, and consequently, the algorithm terminates. Otherwise, the prompt is extracted using extract_prompt, and the process persists for the coming turns until it reaches the maximum number of turns: t=max_t𝑡𝑚𝑎𝑥_𝑡t=max\_titalic_t = italic_m italic_a italic_x _ italic_t. For the t𝑡titalic_t-th turn, the utterances will be:

𝒰t𝒮(𝒯0,𝒯1,,𝒯t1)similar-tosuperscriptsubscript𝒰𝑡subscript𝒮superscript𝒯0superscript𝒯1superscript𝒯𝑡1\mathcal{U_{I}}^{t}\sim\mathcal{S_{I}}(\mathcal{T}^{0},\mathcal{T}^{1},\dots,% \mathcal{T}^{t-1})caligraphic_U start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∼ caligraphic_S start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ( caligraphic_T start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , caligraphic_T start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , caligraphic_T start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT )
𝒰t𝒮(𝒯0,𝒯1,,𝒯t1,𝒰t)similar-tosuperscriptsubscript𝒰𝑡subscript𝒮superscript𝒯0superscript𝒯1superscript𝒯𝑡1superscriptsubscript𝒰𝑡\mathcal{U_{R}}^{t}\sim\mathcal{S_{R}}(\mathcal{T}^{0},\mathcal{T}^{1},\dots,% \mathcal{T}^{t-1},\mathcal{U_{I}}^{t})caligraphic_U start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∼ caligraphic_S start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT ( caligraphic_T start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , caligraphic_T start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , caligraphic_T start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT , caligraphic_U start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )
Input : 𝒫,𝒢𝒫𝒢\mathcal{P},\mathcal{G}caligraphic_P , caligraphic_G: persona and goal
Output : 𝒟(𝒮,𝒮)𝒟subscript𝒮subscript𝒮\mathcal{D}(\mathcal{S_{I}},\mathcal{S_{R}})caligraphic_D ( caligraphic_S start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT , caligraphic_S start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT )
1
2𝒟{};𝒟\mathcal{D}\leftarrow\{\};caligraphic_D ← { } ;
3 𝒰0𝒮(SYS_I(𝒫,𝒢))superscriptsubscript𝒰0subscript𝒮SYS_I𝒫𝒢\mathcal{U_{I}}^{0}\leftarrow\mathcal{S_{I}}(\texttt{SYS\_I}(\mathcal{P},% \mathcal{G}))caligraphic_U start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ← caligraphic_S start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ( SYS_I ( caligraphic_P , caligraphic_G ) )
4 𝒰0:=extract_prompt(𝒰0)assignsuperscriptsubscript𝒰0extract_promptsuperscriptsubscript𝒰0\mathcal{U_{I}}^{0}:=\texttt{extract\_prompt}(\mathcal{U_{I}}^{0})caligraphic_U start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT := extract_prompt ( caligraphic_U start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT )
5 if 𝒰0=superscriptsubscript𝒰0\mathcal{U_{I}}^{0}=\emptysetcaligraphic_U start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = ∅ then break;
6 𝒰0𝒮(𝒰0)superscriptsubscript𝒰0subscript𝒮superscriptsubscript𝒰0\mathcal{U_{R}}^{0}\leftarrow\mathcal{S_{R}}(\mathcal{U_{I}}^{0})caligraphic_U start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ← caligraphic_S start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT ( caligraphic_U start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT )
7
8for t=1max_t𝑡1max_tt=1\rightarrow\text{max\_t}italic_t = 1 → max_t do
9       𝒰t𝒮([𝒰t1;INTER_I(𝒰t1)]);superscriptsubscript𝒰𝑡subscript𝒮superscriptsubscript𝒰𝑡1INTER_Isuperscriptsubscript𝒰𝑡1\mathcal{U_{I}}^{t}\leftarrow\mathcal{S_{I}}([\mathcal{U_{I}}^{t-1};\texttt{% INTER\_I}(\mathcal{U_{R}}^{t-1})]);caligraphic_U start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← caligraphic_S start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ( [ caligraphic_U start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ; INTER_I ( caligraphic_U start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ) ] ) ;
10      if stop(𝒰tsuperscriptsubscript𝒰𝑡\mathcal{U_{I}}^{t}caligraphic_U start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT) then
11            break;
12       end if
13      
14      𝒰t:=extract_prompt(𝒰t)assignsuperscriptsubscript𝒰𝑡extract_promptsuperscriptsubscript𝒰𝑡\mathcal{U_{I}}^{t}:=\texttt{extract\_prompt}(\mathcal{U_{I}}^{t})caligraphic_U start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT := extract_prompt ( caligraphic_U start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )
15       if 𝒰t=superscriptsubscript𝒰𝑡\mathcal{U_{I}}^{t}=\emptysetcaligraphic_U start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ∅ then
16            break;
17       end if
18      
19      𝒰t𝒮(𝒟,𝒰t);superscriptsubscript𝒰𝑡subscript𝒮𝒟superscriptsubscript𝒰𝑡\mathcal{U_{R}}^{t}\leftarrow\mathcal{S_{R}}(\mathcal{D},\mathcal{U_{I}}^{t});caligraphic_U start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← caligraphic_S start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT ( caligraphic_D , caligraphic_U start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ;
20       𝒟{𝒟,[𝒰t;𝒰t]};𝒟𝒟superscriptsubscript𝒰𝑡superscriptsubscript𝒰𝑡\mathcal{D}\leftarrow\{\mathcal{D},[\mathcal{U_{I}}^{t};\mathcal{U_{R}}^{t}]\};caligraphic_D ← { caligraphic_D , [ caligraphic_U start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ; caligraphic_U start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ] } ;
21      
22 end for
Algorithm 1 LLM Roleplay
Llama-2 Mixtral Vicuna GPT4
Avg. # Turns per Dialogue 2.62 (1.54) 3.86 (2.19) 7.12 (3.79) 7.60 (3.08)
Avg. # Tokens per Prompt 77.77 (46.20) 50.82 (26.47) 75.19 (92.43) 68.13 (60.37)
Avg. # Tokens per Response 347.50 (142.14) 302.47 (151.30) 228.29 (163.92) 267.69 (147.26)
No-prompt 6.82% (1.48%) 0.97% (0.57%) 7.90% (0.54%) 0.17% (0.04%)
Multiple Prompts 8.79% (0.95%) 8.40% (1.52%) 6.08% (0.41%) 39.21% (2.38)
Incoherent Response 3.12% (0.30%) 0.13% (0.09%) 0.79% (0.10%) 0.03% (0.04%)
Number of Self-Replies 5.50% (3.25%) 5.99% (1.43) 69.39% (5.77%) 5.48% (0.09%)
Incoherent Response (Responder) 0.56% (0.57%) 1.16% (0.19%) 8.01% (0.93%) 7.50% (0.35%)
Table 1: Analysis of persona-specific dialogue collection (top) and failure cases (bottom) conducted for Llama-2, Mixtral, Vicuna, and GPT4. The results are averaged over runs with three different seeds. We show that GPT4 is successful at holding dialogues with longer utterances and having relatively fewer failure cases. Meanwhile, Mixtral is better at providing short on-point prompts. The number of utterances in dialogues is preferred to be larger, while for other metrics, smaller values are better. The standard deviation is indicated in parentheses.

3 Experiments

In the first study, we collect dialogues between human inquirers and model responders (𝒟(𝒮h,𝒮)𝒟superscriptsubscript𝒮subscript𝒮\mathcal{D}(\mathcal{S_{I}}^{h},\mathcal{S_{R}})caligraphic_D ( caligraphic_S start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , caligraphic_S start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT )) by engaging participants with various personas (𝒫𝒫\mathcal{P}caligraphic_P) in interactions with a chat-tuned LLM intending to achieve a given goal (𝒢𝒢\mathcal{G}caligraphic_G). Utilizing the same set of personas (𝒫𝒫\mathcal{P}caligraphic_P) and goals (𝒢𝒢\mathcal{G}caligraphic_G) we generate dialogues between model inquirers and the same responder (𝒟(𝒮,𝒮)𝒟subscript𝒮subscript𝒮\mathcal{D}(\mathcal{S_{I}},\mathcal{S_{R}})caligraphic_D ( caligraphic_S start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT , caligraphic_S start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT )) by employing the LLM Roleplay method (Section 3.1). We report the statistics of the generated dialogues such as the average number of turns per dialogue, the number of tokens per prompt (inquirer output), and the number of tokens per response (responder output). Please refer to Appendix B for details on the generation parameters.

Furthermore, we investigate the failure cases of the inquirer LLMs, in following the given instruction. We depict the most common failure cases, report their statistics, and describe their detection mechanisms (Section 3.2).

In the second study, we conduct a human evaluation wherein another set of participants compare the natural 𝒟(𝒮h,𝒮)𝒟superscriptsubscript𝒮subscript𝒮\mathcal{D}(\mathcal{S_{I}}^{h},\mathcal{S_{R}})caligraphic_D ( caligraphic_S start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , caligraphic_S start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT ) and the simulated 𝒟(𝒮,𝒮)𝒟subscript𝒮subscript𝒮\mathcal{D}(\mathcal{S_{I}},\mathcal{S_{R}})caligraphic_D ( caligraphic_S start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT , caligraphic_S start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT ) dialogues to discern the simulated counterpart 𝒟(𝒮,𝒮)𝒟subscript𝒮subscript𝒮\mathcal{D}(\mathcal{S_{I}},\mathcal{S_{R}})caligraphic_D ( caligraphic_S start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT , caligraphic_S start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT ) (Section 3.3). We report the total and per-model undetectability rates, the utterance number on which the dialogue was detected, and the distribution of the duration and confidence choices.

Moreover, we use generalized linear models to analyze the detection rate of the simulated dialogue, the detection utterances number, and the duration users spent making a choice to analyze how different inquirer LLMs behave.

For our experiments, we used a single NVIDIA A100 GPU with 80GB memory for Llama-2 and Vicuna. We utilized up to 92% of the memory.

3.1 Persona-Specific Dialogue Collection

For this study, participants were instructed to interact with Llama-2222Due to limited deployment resources quantized Llama-2-13B-chat-GGUF is utilized Touvron et al. (2023), employing the Q5_K_M quantization method with 5 bits Frantar et al. (2023) (𝒮subscript𝒮\mathcal{S_{R}}caligraphic_S start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT) to accomplish a designated goal. We additionally ask participants to provide sociodemographic information (𝒫𝒫\mathcal{P}caligraphic_P) such as age group, gender, race, level of education, and whether they identify as native English speaker. We base the choice of sociodemographic features and the options of the features on previous work by Kumar et al. (2021). We provide a detailed screenshot of the persona information form interface in Figure 9, and a screenshot of our chat interface in Figure 10 in the Appendix B. In order to cover different conversational goals, we design ten handcrafted multi-hop goals (𝒢𝒢\mathcal{G}caligraphic_G) spanning three domains: "Math", "Coding", and "General Knowledge".

We conduct a study involving 20 participants each engaged in tackling 10 goals, resulting in the generation of 200 natural human-chatbot interaction dialogues (𝒟(𝒮h,𝒮)𝒟superscriptsubscript𝒮subscript𝒮\mathcal{D}(\mathcal{S_{I}}^{h},\mathcal{S_{R}})caligraphic_D ( caligraphic_S start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , caligraphic_S start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT )). We provide the full sociodemographic distribution of the participants in Figure 3 to Figure 7 in Appendix B.

Subsequently, we generate dialogues utilizing our LLM Roleplay method with a set of three state-of-the-art LLMs and one proprietary conversational agent as inquirers (𝒮subscript𝒮\mathcal{S_{I}}caligraphic_S start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT): llama-2-13B-Chat (Llama-2) Touvron et al. (2023), Mixtral-8x7B-Instruct-v0.1 (Mixtral) Jiang et al. (2024), vicuna-13b-v1.5-16k (Vicuna) Peng et al. (2023), and GPT4 OpenAI et al. (2024). See sample natural and generated dialogues using Mixtral inquirer and Llama-2 responder from 14 to 18 in Appendix E.

Eventually, the dialogue collection and generation results in the creation of 200 natural human-chatbot (𝒟(𝒮h,𝒮)𝒟superscriptsubscript𝒮subscript𝒮\mathcal{D}(\mathcal{S_{I}}^{h},\mathcal{S_{R}})caligraphic_D ( caligraphic_S start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , caligraphic_S start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT )) and 800 (with 4 inquirers) simulated (𝒟(𝒮,𝒮)𝒟subscript𝒮subscript𝒮\mathcal{D}(\mathcal{S_{I}},\mathcal{S_{R}})caligraphic_D ( caligraphic_S start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT , caligraphic_S start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT )) dialogue pairs.

We observe that, on average, GPT-4 OpenAI et al. (2024) is successful at holding longer dialogues with 7.60 turns on average (Table 1). Mixtral Jiang et al. (2024) on the other hand is better at generating shorter and on-point (based on manual analysis) prompts with 50.82 tokens per prompt. The generated dialogues have an average of 5.30 turns, each consisting of an average of 67.97 tokens per prompt, resulting in longer dialogues than those from the previous work (Table 4).

3.2 Failure Cases

Since the outputs of the LLMs are free-form, applying deterministic functions to their output proves challenging, resulting in a theoretically infinite number of turns. Therefore, there is a need for a set limit on the number of turns: max_t𝑚𝑎𝑥_𝑡max\_titalic_m italic_a italic_x _ italic_t. Nevertheless, we try to capture the algorithm failure cases to have an automatic assessment of the inquirer LLM failure cases. We expect the inquirer LLM to provide the intended prompt enclosed in double quotes as we explicitly request this in our instructions. See a sample expected output and the failure cases examples in Table 7 in Appendix C.

All of the LLMs in our experiments face the following issues.

Prompt not in double-quotes.

The model fails to provide a prompt within double quotes, thus causing the dialogue to terminate.

Incoherent output.

The responder produces a repetitive token sequence. We identify such outputs and preemptively end the dialogue. We utilize the incoherent function to analyze text for incoherent strings (see Algorithm 2 in Appendix C).

Incoherent output of responder.

The inquirer model fails to detect the case when the responder outputs incoherent text, leading to unsuccessful dialogues. We spot such outputs and stop the dialog, using the same incoherent function.

Inquirer self-reply.

The inquirer model fails to maintain its intended role and answers its own question, resulting in a fully generated dialogue in a single utterance. We detect this deterministically by examining the presence of any special tokens of the responder model within the output. For instance, in the case of Llama-2, this token appears as "[INST", while for Vicuna, it manifests as "### Human:".

Multiple prompts.

The inquirer outputs multiple strings enclosed in double quotes. As sometimes this overlaps with the previous case (inquirer self-reply), we select the first as the prompt.

Dialogue-stopping criterion failure.

In the intermediate prompts, we ask the chatbot to output a pre-defined token when it "thinks" the goal assigned to it is achieved. Nonetheless, it follows a limited set of tokens; for instance, "FINISH" was utilized in our experiments.

Subset Llama-2 Mixtral Vicuna GPT4
total Undetectability Rate 33.5% 44.0% 22.5% 35.0%
Confidence: "very confident" 33 (16.50%) 32 (16.00%) 75 (37.50%) 43 (21.50%)
Confidence: "confident" 108 (54.00%) 83 (41.50%) 75 (37.50%) 97 (48.50%)
Confidence: "somewhat confident" 59 (42.50%) 85 (25.00%) 50 (20.00%) 60 (30.00%)
detected Duration 86.10 (157.14) 99.45 (98.81) 94.63 (150.70) 124.88 (227.36)
Utterance Number 1.72 (0.78) 2.38 (1.35) 2.50 (1.84) 3.72 (2.35)
Confidence: "very confident" 24 (12.00%) 19 (9.50%) 65 (32.50%) 32 (16.00%)
Confidence: "confident" 80 (40.00%) 51 (25.50%) 59 (29.50%) 66 (33.00%)
Confidence: "somewhat confident" 29 (14.50%) 42 (21.00%) 31 (15.50%) 32 (16.00%)
undetected Duration 120.44 (139.93) 114.93 (157.46) 62.24 (62.37) 94.52 (124.03)
Confidence: "very confident" 9 (4.50%) 13 (6.50%) 10 (5.00%) 11 (5.50%)
Confidence: "confident" 28 (14.00%) 32 (16.00%) 16 (8.00%) 31 (15.50%)
Confidence: "somewhat confident" 30 (15.00%) 43 (21.50%) 19 (9.50%) 28 (14.00%)
Table 2: Analysis of human-evaluation results for detected, undetected, and total dialogues for Llama-2, Mixtral, Vicuna, and GPT4, showing confidence statistics as occurrences (percentages in parentheses). We exhibit that Mixtral has the highest undetectability rate of 44% (50% reflecting absolute indistinguishability). Moreover, it has the lowest confidence choice statistics for not confidently detected and the highest statistics for confidently undetected dialogues. GPT4 has the highest utterance number: 3.72, showing that it is identified in later utterances. The duration (in seconds) and the utterance number standard deviations are in parentheses.
Model Detection Prob.* Utterance Num.* Duration
Llama-2 B C A
Mixtral A B A
Vicuna C B A
GPT4 AB A A
Table 3: Statistical results of the human-evaluation for 800 dialogue pairs. The asterisk marks dependent variables on which a significant effect of the choice of model was observed (Wald test). Pairwise differences between conditions (Post hoc Wald comparison of contrasts) are reported as compact letter display codings. For example, the detection probability feature shows that the post hoc test detected a significantly lower (i.e., better) detection probability for Mixtral compared to Llama-2 as well as Vicuna, but no significant difference between Mixtral and GPT-4 could be observed.

We attribute these failures to two common issues in LLMs: limited context length and a restricted set of fine-tuned instructions Kaddour et al. (2023).

In most cases, GPT4 outperforms other models, exhibiting lower occurrences of failure cases: 0.17% for responses without prompts, 0.03% for incoherent responses, and 5.48% for self-replies (Table 1). However, it struggles with providing a single prompt, failing 39.21% of the time. On the contrary, Vicuna excels in generating single prompts, achieving a failure rate of 6.08%. Llama-2 seems to receive fewer incoherent outputs from the responder (also Llama-2) with 0.56%. Mixtral’s performance lies somewhere in between, with more balanced results across different metrics. Refere to Figure 8 in Appendix B for a radar chart. Refer to the detailed data in Table 6 in Appendix B.

3.3 Human-Evaluation

To answer the question of how well LLM inquirer and responder dialogues 𝒟(𝒮,𝒮)𝒟subscript𝒮subscript𝒮\mathcal{D}(\mathcal{S_{I}},\mathcal{S_{R}})caligraphic_D ( caligraphic_S start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT , caligraphic_S start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT ) approximate dialogues between human inquirer and responder 𝒟(𝒮h,𝒮)𝒟superscriptsubscript𝒮subscript𝒮\mathcal{D}(\mathcal{S_{I}}^{h},\mathcal{S_{R}})caligraphic_D ( caligraphic_S start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , caligraphic_S start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT ), we conduct a human evaluation study. In each round, participants are shown two dialogues side by side, both featuring the same persona (𝒫𝒫\mathcal{P}caligraphic_P) and solving the same goal (𝒢𝒢\mathcal{G}caligraphic_G).

We allocated different sets of dialogue pairs for each participant, ensuring that no participant encounters multiple dialogues from the same user (from Section 3.1) solving the same goal. A new group of 20 participants was selected, with each participant tasked with reviewing 40 dialogue pairs. The participants selected for this study represent a wide range of occupational backgrounds. Some have no prior experience with chatbots, while others are industry professionals.

Participants are required to answer three questions for each dialogue pair: (i) Identify which dialogues they perceive as artificial (simulated): Choices include "1st (left)", "2nd (right)", or "Not sure" the latter is considered to be a tie. (ii) Express the level of confidence in their selection: Options are "Somewhat Confident," "Confident," and "Very Confident". (iii) Identify the specific utterance number that signifies the artificiality within the dialogue: Options depend on the number of utterances of the dialogue pairs. We provide a detailed view of the instructions provided to users in Figure 11 and the interfaces of the applications used for the study Figure 12 in Appendix D.

In conducting the human evaluation, we also track the amount of time participants take to respond to questions. This measure allows us to estimate a proxy measure of the complexity of the dialogue pairs. We assume that the longer it takes for a participant to make a decision, the more challenging the pair is, indicating that the simulated dialogue is difficult to discern.

The study shows that among the 800 samples, the simulated dialogues remained undetected in 33.75% (50% reflecting absolute indistinguishability) of the dialogues. Per model, statistics show that Mixtral has the highest undetectability rate of 44.0%, after which is GPT4 with 35.0% followed by Llama-2 and Vicuna with 33.5% and 22.5% respectively (Figure 2). See the per-confidence choice distribution in Figure 13 in Appendix D.

Refer to caption
Figure 2: The distribution of detectability (left) and undetectability rates (right) per model for Llama-2, Mixtral, Vicuna, and GPT4. Each bar is stacked with confidence levels of: "Somewhat confident", "Confident" and "Very Confident". We shot that Mixtral has a relatively high undetectability rate of 44%, followed by GPT4 at 35%, Llama-2 at 33.5%, and Vicuna at 22.5%. The total (un)detectability rates for each model are mentioned in gray.

The highest utterance number on the detected set of dialogues has GPT4, meaning that it took more utterances for participants to recognize the simulated nature of the inquirer’s responses (Table 2). For the detected subset of dialogues, Mixtral has the highest percentages for all confidence choices: 9.50%, 25.50%, and 21.00% for "very confident", "confident" and "somewhat confident". Also, it is the best for the undetected pair of dialogues with 6.50%, 16.00%, and 21.50% respectively. Llama-2 excels in duration with 86.10 and 120.44 seconds for the detected and undetected dialogue pairs. However, it should be noted that the variation in duration is high, indicating that the participants did not complete the study at a consistent pace.

To further investigate how various instruction-tuned LLMs behave as inquirers within this setup we analyze the simulated dialogue detection probability, the detection utterance number, and the duration participants spent, using generalized linear mixed models (GLMMs), with the choice of LLM as the independent variable. We additionally include random effects to account for potential confounding effects of individual participant’s detection abilities and users from Section 3.1.

We find significant effects of the choice of the model on the detection probability and the utterance position at which the participants formed their decision. We summarize the results of our statistical analyses in Table 3 using CLD codings (Piepho, 2004) and discuss our key findings in the following.

Effects on Detection Probability.

First, we fit a binomial model (logit link) to predict the detection probability depending on the model. Concretely, we estimate a GLMM specified by: detection_ratemodel+(1|participant)+(1|generator_user)similar-to𝑑𝑒𝑡𝑒𝑐𝑡𝑖𝑜𝑛_𝑟𝑎𝑡𝑒𝑚𝑜𝑑𝑒𝑙conditional1𝑝𝑎𝑟𝑡𝑖𝑐𝑖𝑝𝑎𝑛𝑡conditional1𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑜𝑟_𝑢𝑠𝑒𝑟detection\_rate\sim model+(1|participant)+(1|generator\_user)italic_d italic_e italic_t italic_e italic_c italic_t italic_i italic_o italic_n _ italic_r italic_a italic_t italic_e ∼ italic_m italic_o italic_d italic_e italic_l + ( 1 | italic_p italic_a italic_r italic_t italic_i italic_c italic_i italic_p italic_a italic_n italic_t ) + ( 1 | italic_g italic_e italic_n italic_e italic_r italic_a italic_t italic_o italic_r _ italic_u italic_s italic_e italic_r ). We observe a significant effect of the choice of the model on the detection probability (χ2(3)superscript𝜒23\chi^{2}(3)italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 3 )=21.49, p<0.001𝑝0.001p<0.001italic_p < 0.001). A post hoc Wald comparison of the contrasts for model types revealed significant differences between Vicuna and all other models (Vicuna being most likely to be detected), and Mixtral and LLama-2 (Mixtral being significantly less likely to be detected). We do not find a significant difference between GPT4 and Mixtral.

Effects on Utterance Positions.

For the position of decision-forming utterances, we fit a respective GLMM and find a significant effect of the choice of the model (F𝐹Fitalic_F=31.73, p<0.001𝑝0.001p<0.001italic_p < 0.001). A post hoc Wald comparison of the contrasts for model types revealed significantly higher utterance positions for GPT4 than for the other models and that Llama-2 received significantly lower utterance position ratings. We do not observe a significant difference between Mixtral and Vicuna.

Overall, our findings suggest that Mixtral, an open-source LLM, demonstrates better performance in embodying a specified persona and simulating human-chatbot interactions compared to GPT4 in terms of detection probabilities. Nonetheless, when considering the detection of simulated dialogues based on the detection utterance number, GPT4 shows better performance which is likely to be attributable to GPT4 generating dialogues with more utterances.

4 Related Work

Type Dataset Name # Dialogues Avg. # Turns/Dialogue Avg. # Tokens/Prompt Avg. # Tokens/Response Topics Personalized
natural OpenAssistant Köpf et al. (2023) 3k 2.12 28.28 171.34 human-crafted yes
LMSYS-Chat-1M Zheng et al. (2024) 777k 1.92 55.23 163.66 human-crafted yes
WildChat Zhao et al. (2024b) 360k 2.46 164.32 276.50 open no
synthetic UltraChat Ding et al. (2023) 1.5M 3.85 52.54 249.41 model-generated no
LLM-Roleplay (Ours) any 5.30 (2.11) 67.97 (10.51) 286.48 (151.15) any yes
Table 4: Statistics of key human-crafted (natural) and model-generated (synthetic) datasets (English subsets) relevant to our study. Our method can generate unlimited dialogues across various topics, featuring persona-based prompts and longer utterances. The natural datasets include personalized information reflecting the inquirer’s subjectivity. Standard deviations are shown in parentheses. For a comprehensive dataset list, see Table 8 in Appendix F.

Conversational Datasets.

Approximation of the true distribution of human-chatbot dialogues is challenging and a significant number of diverse human annotators are needed. Incorporating participants from different sociodemographic profiles addressing a given conversational goal can substantially enhance the richness of the dataset. However, the associated time and resource requirements render the respectively needed large-scale studies infeasible for many researchers. Alternatively, a less resource-intensive and faster method for collecting human-chatbot conversations is to generate them using large language models (LLMs) Zhang et al. (2024). This method has been utilized in various setups for dialogue generation, such as human-human Adiwardana et al. (2020); Kim et al. (2023); Chen et al. (2023); Li et al. (2023a), teacher-student Macina et al. (2023), and patient-physician Wang et al. (2023).

Prior work proposed a large number of human-crafted and synthetically generated datasets, each trying to collect more dialogue pairs revolving around various topics (Table 4). OpenAssistant Köpf et al. (2023) collects human-crafted conversational dialogue trees, with prompts and answers generated by different humans. LMSYS-Chat-1M Zheng et al. (2024) contains the refined version of dialogue logs collected via an online chat interface and predominantly contains dialogues between users and Vicuna-13b Peng et al. (2023). UltraChat Ding et al. (2023) and GLAN Li et al. (2024) synthetically generate dialogues using proprietary conversational AI systems (ChatGPT Turbo) evolving around topics generated by the same systems.

Personas in Large Language Models.

Large language models (LLMs) have demonstrated distinct behaviors and personas Andreas (2022); Wolf et al. (2024). According to Andreas (2022), LLMs can interpret behaviors from text prompts, influencing the content they generate. Additionally, Wolf et al. (2024) argue that LLMs function as mixture decompositions, where prompts can shift the balance of these components, thus triggering persona-specific responses. Furthermore, Beck et al. (2024) investigate the effects of sociodemographic prompts on model responses, finding that while such prompts can be beneficial in certain settings, they produce varied effects across different models.

Building on these findings, our work introduces a novel method that is goal-oriented and persona-centric for generating diverse and multi-turn dialogues. This method allows for the simulation of dialogues across various combinations of conversational goals, personas, and LLMs, facilitating the creation of numerous simulated dialogues without theoretical limits.

5 Discussion and Future Work

In this work, we propose our novel LLM Roleplay method: an automatic, model-agnostic approach to elicit multi-turn, goal-oriented, persona-based simulated human-chatbot dialogues. We develop and validate our method along with two user studies with a total of 40 participants. Our findings indicate that our method can produce dialogues that, in up to 44% (where 50% corresponds to perfect indistinguishability), are indistinguishable from real human-chatbot dialogues and that the respective dialogues cover a higher number of turns than previously proposed datasets.

Building on the findings of this work, several avenues for future research warrant exploration. First, our method can be used to generate user-specific conversational datasets for more targeted alignment and domain-adaption (e.g., using RLHF Christiano et al. (2017)). Second, our method can improve dialogue evaluation by providing high amounts of realistic conversational data.

While this paper presents initial findings on LLM Roleplay and provides the first evidence that our method can approximate human-chatbot dialogues, the validity of our method regarding its sociodemographic representativeness remains to be assessed and advanced. We release our collected dialogue dataset, the code to apply our method, as well as our synthesized dialogues, to facilitate future work towards this promising field of research.

6 Conclusion

We present our novel LLM Roleplay method for simulating human-chatbot interaction. In a series of two user studies, we collect real human-chatbot dialogues and demonstrate that LLM Roleplay can generate diverse multi-turn conversations that approximate natural human-chatbot dialogues with a high level of indistinguishability. Our findings highlight the potential of LLMs in embodying specified personas and simulating human-chatbot interactions to synthesize realistic dialogues that create new opportunities for real-time model evaluation and training data generation for model fine-tuning.

7 Acknowledgments

This work has been funded by the German Research Foundation (DFG) as part of the UKP-SQuARE project (grant GU 798/29-1). This work has been funded by the LOEWE Distinguished Chair “Ubiquitous Knowledge Processing”, LOEWE initiative, Hesse, Germany (Grant Number: LOEWE/4a//519/05/00.002(0002)/81). We gratefully acknowledge the support of Microsoft with a grant for access to OpenAI GPT models via the Azure cloud (Accelerate Foundation Model Academic Research).

8 Limitations

In this section, we explore the inherent limitations of this research study. We further note that all our experiments have been approved by the local ethics reviewing board at Technical University Darmstadt.

Sociodemographic Representativeness

An important limitation of our study lies in the sociodemographic distribution of the participants involved in the dialogue collection process. The pool of our studies’ participants serves as an initial investigation into the promising direction of interaction simulation and cannot represent the full spectrum of sociodemographic backgrounds. To address this limitation, future research needs to broaden the scope of participants’ sociodemographic groups and assess the replicability of our findings within large user studies, encompassing a more comprehensive range of sociodemographic groups. Studies should be conducted with specific sociodemographic groups, ensuring that participants representing a broader set of combinations of the persona features described in the paper are covered (e.g. via crowdsourcing on platforms like Prolific). By doing so, a more nuanced understanding of natural dialogue dynamics across diverse populations can be achieved, and allow us to uncover weaknesses and respectively needed improvements building upon our initial method specification.

Controllability.

In this work, we have endeavored to detect and prevent failure cases of the proposed method. Despite our efforts, it is important to acknowledge that achieving absolute coverage in detecting all potential failure cases remains elusive. For example, instances where multiple turns of appreciation occur bilaterally present a challenge that we have not fully addressed. Moreover, we strive to save all detected failure cases by re-generating with different parameters. However, this approach has not proven to be effective. Future research should concentrate on exploring these scenarios further to detect and prevent failure cases more efficiently.

Additionally, although our LLM Roleplay shows promising results in our experimental settings, it is not immune to the common challenges associated with large language models. Problems like hallucinations and associated LLM behavior issues can still arise, even though we have not encountered any during our experiments.

References

  • Adiwardana et al. (2020) Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, and Quoc V. Le. 2020. Towards a human-like open-domain chatbot. Preprint, arXiv:2001.09977.
  • Almazrouei et al. (2023) Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. 2023. The falcon series of open language models. Preprint, arXiv:2311.16867.
  • Andreas (2022) Jacob Andreas. 2022. Language models as agent models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5769–5779, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Beck et al. (2024) Tilman Beck, Hendrik Schuff, Anne Lauscher, and Iryna Gurevych. 2024. Sensitivity, performance, robustness: Deconstructing the effect of sociodemographic prompting. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2589–2615, St. Julian’s, Malta. Association for Computational Linguistics.
  • Chen et al. (2023) Maximillian Chen, Alexandros Papangelis, Chenyang Tao, Seokhwan Kim, Andy Rosenbaum, Yang Liu, Zhou Yu, and Dilek Hakkani-Tur. 2023. PLACES: Prompting language models for social conversation synthesis. In Findings of the Association for Computational Linguistics: EACL 2023, pages 844–868, Dubrovnik, Croatia. Association for Computational Linguistics.
  • Christiano et al. (2017) Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 4299–4307.
  • Ding et al. (2023) Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. 2023. Enhancing chat language models by scaling high-quality instructional conversations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3029–3051, Singapore. Association for Computational Linguistics.
  • Frantar et al. (2023) Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2023. Gptq: Accurate post-training quantization for generative pre-trained transformers. Preprint, arXiv:2210.17323.
  • Gopalakrishnan et al. (2019) Karthik Gopalakrishnan, Behnam Hedayatnia, Qinglang Chen, Anna Gottardi, Sanjeev Kwatra, Anu Venkatesh, Raefer Gabriel, and Dilek Hakkani-Tür. 2019. Topical-chat: Towards knowledge-grounded open-domain conversations. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, pages 1891–1895. ISCA.
  • Gunasekar et al. (2024) Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio Cesar Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Conti Kauffmann, Gustavo Henrique de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Behl, Xin Wang, Sebastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. 2024. Textbooks are all you need.
  • Jiang et al. (2024) Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2024. Mixtral of experts. Preprint, arXiv:2401.04088.
  • Kaddour et al. (2023) Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and Robert McHardy. 2023. Challenges and applications of large language models. Preprint, arXiv:2307.10169.
  • Kim et al. (2023) Hyunwoo Kim, Jack Hessel, Liwei Jiang, Peter West, Ximing Lu, Youngjae Yu, Pei Zhou, Ronan Bras, Malihe Alikhani, Gunhee Kim, Maarten Sap, and Yejin Choi. 2023. SODA: Million-scale dialogue distillation with social commonsense contextualization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12930–12949, Singapore. Association for Computational Linguistics.
  • Köpf et al. (2023) Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Minh Nguyen, Oliver Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri, David Alexandrovich Glushkov, Arnav Varma Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Julian Mattick. 2023. Openassistant conversations - democratizing large language model alignment. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  • Kumar et al. (2021) Deepak Kumar, Patrick Gage Kelley, Sunny Consolvo, Joshua Mason, Elie Bursztein, Zakir Durumeric, Kurt Thomas, and Michael Bailey. 2021. Designing toxic content classification for a diversity of perspectives. In Proceedings of the Seventeenth USENIX Conference on Usable Privacy and Security, SOUPS’21, USA. USENIX Association.
  • Li et al. (2023a) Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023a. CAMEL: Communicative agents for ”mind” exploration of large language model society. In Thirty-seventh Conference on Neural Information Processing Systems.
  • Li et al. (2024) Haoran Li, Qingxiu Dong, Zhengyang Tang, Chaojun Wang, Xingxing Zhang, Haoyang Huang, Shaohan Huang, Xiaolong Huang, Zeqiang Huang, Dongdong Zhang, Yuxian Gu, Xin Cheng, Xun Wang, Si-Qing Chen, Li Dong, Wei Lu, Zhifang Sui, Benyou Wang, Wai Lam, and Furu Wei. 2024. Synthetic data (almost) from scratch: Generalized instruction tuning for language models. ArXiv preprint, abs/2402.13064.
  • Li et al. (2017) Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. DailyDialog: A manually labelled multi-turn dialogue dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 986–995, Taipei, Taiwan. Asian Federation of Natural Language Processing.
  • Li et al. (2023b) Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. 2023b. Textbooks are all you need ii: phi-1.5 technical report. Preprint, arXiv:2309.05463.
  • Macina et al. (2023) Jakub Macina, Nico Daheim, Sankalan Chowdhury, Tanmay Sinha, Manu Kapur, Iryna Gurevych, and Mrinmaya Sachan. 2023. MathDial: A dialogue tutoring dataset with rich pedagogical properties grounded in math reasoning problems. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5602–5621, Singapore. Association for Computational Linguistics.
  • Mukherjee et al. (2023) Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. 2023. Orca: Progressive learning from complex explanation traces of gpt-4. Preprint, arXiv:2306.02707.
  • OpenAI et al. (2024) OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. 2024. Gpt-4 technical report. Preprint, arXiv:2303.08774.
  • Peng et al. (2023) Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. 2023. Instruction tuning with gpt-4. Preprint, arXiv:2304.03277.
  • Perez et al. (2022) Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022. Red teaming language models with language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Piepho (2004) Hans-Peter Piepho. 2004. An algorithm for a letter-based representation of all-pairwise comparisons. Journal of Computational and Graphical Statistics, 13:456–466.
  • Rashkin et al. (2019) Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. 2019. Towards empathetic open-domain conversation models: A new benchmark and dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5370–5381, Florence, Italy. Association for Computational Linguistics.
  • Rottger et al. (2022) Paul Rottger, Bertie Vidgen, Dirk Hovy, and Janet Pierrehumbert. 2022. Two contrasting data annotation paradigms for subjective NLP tasks. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 175–190, Seattle, United States. Association for Computational Linguistics.
  • Shao et al. (2023) Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu. 2023. Character-LLM: A trainable agent for role-playing. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13153–13187, Singapore. Association for Computational Linguistics.
  • Shen (2024) Ming Shen. 2024. Rethinking data selection for supervised fine-tuning. ArXiv preprint, abs/2402.06094.
  • Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. Preprint, arXiv:2307.09288.
  • Turing (1950) A. M. Turing. 1950. I.—Computing Machinery and Inteligence. Mind, LIX(236):433–460.
  • Wang et al. (2023) Junda Wang, Zonghai Yao, Zhichao Yang, Huixue Zhou, Rumeng Li, Xun Wang, Yucheng Xu, and Hong Yu. 2023. Notechat: A dataset of synthetic doctor-patient conversations conditioned on clinical notes. Preprint, arXiv:2310.15959.
  • Wolf et al. (2024) Yotam Wolf, Noam Wies, Oshri Avnery, Yoav Levine, and Amnon Shashua. 2024. Fundamental limitation of alignment in large language models.
  • Xu et al. (2023a) Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. 2023a. Wizardlm: Empowering large language models to follow complex instructions. Preprint, arXiv:2304.12244.
  • Xu et al. (2023b) Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley. 2023b. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6268–6278, Singapore. Association for Computational Linguistics.
  • Zhang et al. (2018) Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204–2213, Melbourne, Australia. Association for Computational Linguistics.
  • Zhang et al. (2024) Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, and Guoyin Wang. 2024. Instruction tuning for large language models: A survey. Preprint, arXiv:2308.10792.
  • Zhao et al. (2024a) Hao Zhao, Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. 2024a. Long is more for alignment: A simple but tough-to-beat baseline for instruction fine-tuning. ArXiv preprint, abs/2402.04833.
  • Zhao et al. (2024b) Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. 2024b. (inthe)wildchat: 570k chatGPT interaction logs in the wild. In The Twelfth International Conference on Learning Representations.
  • Zheng et al. (2024) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric Xing, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. 2024. LMSYS-chat-1m: A large-scale real-world LLM conversation dataset. In The Twelfth International Conference on Learning Representations.
  • Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-judge with MT-bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  • Zhu et al. (2023) Banghua Zhu, Evan Frick, Tianhao Wu, Hanlin Zhu, and Jiantao Jiao. 2023. Starling-7b: Improving llm helpfulness & harmlessness with rlaif.

Appendix A Hand-Crafted Conversational Goals

In our preliminary experiments, we used one-hop conversational goals, each containing a singular question. We noticed a pattern where the inquirer replicates the primary question with subtle modifications and then presents it as the prompt. To make the interaction more intriguing and longer, we handcraft multi-hop conversational goals. Specifically, 10 goals from three categories:

Math

  • "You want to know how fast you run different distances. You use a stopwatch to measure the time it takes you to complete a 50-meter, 100-meter, and 200-meter race. You want to know how can you calculate your speed for each race? Based on that, you also want to calculate how many calories you burned during each race."

  • "You can run at a rate of speed four times faster than you can walk, but you can skip at a rate of speed that is half as fast as you can run. You want to know If you can skip at 3 miles per hour, and how many miles can you travel in six hours if you spend one-third of the time and two-thirds of the time running and walking, respectively. Also, you are curious about the other way around (one-third of the time walking and two-thirds for running)."

  • "Every day, you feed each of your chickens three cups of mixed chicken feed, containing seeds, mealworms, and vegetables to help keep them healthy. You give the chickens their feed in three separate meals. In the morning, you give your flock of chickens 15 cups of feed. In the afternoon, you give your chickens another 25 cups of feed. You want to know how many cups of feed you need to give your chickens in the final meal of the day if the size of your flock is 20 chickens. Also, you want to know how much the chicken egg production rate depends on the feed you give, and if you provide enough feed to your chickens for high-rate egg production."

Coding

  • "You want to make this function better. You want the chatbot to make it recursive to have memory optimal function, but make sure that it doesn’t enter into an infinite loop. After that, you want to plug a CLI (command line interface) into this function, so the user can insert a number and get the factorial of it as output: ’The factorial of the <NUMBER>, is <FACTORIAL>’. “‘ def factorialize(num): factorial = 1 for i in range(1, num): factorial *= i return factorial “‘"

  • "You have a little project where you need to use JavaScript, a language you don’t use every day. You have a subtask to write a function that counts how many vowels are in a given string. And you need this functionality in OOP. Also, you want the chatbot to develop the snippet it provided by getting the function input string via an API call. If the chatbot uses functions or operators you are not familiar with feel free to ask follow-up questions about it."

  • "You want to draw a unicorn in Python using the ’turtle’ module. (There should be multiple lines of short function calls). After that substitute the 10th line, which includes number argument(s), with the value 73(s)."

General Knowledge

  • "You want to know what are the world’s 10 oldest continuously inhabited cities. Pick the 3rd in that list find out who established the city, in which region it is located and what was the highest population."

  • "You have written content that disagrees with the following statement: ’Technology is the cause of all societal problems’ And you want the chatbot to generate a response that agrees with the statement, to make your claims stronger."

  • "You plan a trip to France and would like to do a walking tour. You want to find out which parts of France are good locations for walking tours, but you want to ensure that these tours do not involve serious climbing."

  • "You want to use the chatbot to create a poem about cats. Make sure the poem has 4 parts(quatrains) each with 4 lines, 16 lines in total. Refine the poem until you are satisfied and it is coherent. Also, you want to change the style of one of the quatrains to reflect the distinctive style of your favourite poet."

Appendix B Persona-specific Dialogues Collection

For persona-specific dialogue collection, we conducted a human study where participants were given the following instruction: "Below is your defined goal. What will you prompt the chatbot to accomplish your goal? Feel free to ask follow-up questions on the related topic of the question and clarify things in the response."

Participants for this study were selected from different sociodemographic groups, including individuals from four age groups "18 to 24", "25 to 34", "35 to 44", and "45 to 54", with "Asian or Pacific Islander" and "White" races, encompassing "female" and "male" genders, holding "Doctoral" and "Master’s" degrees, and being either "native" or "non-native" English speakers. See the distributions of participants by features from Figure 3 to Figure 7.

Our initial experiments included falcon-40b-instruct Almazrouei et al. (2023); however, we excluded it due to its difficulty in following instructions.

Refer to caption
Figure 3: Age distribution of participants for persona-specific dialogue collection study
Refer to caption
Figure 4: Race distribution of participants for persona-specific dialogue collection study
Refer to caption
Figure 5: Gender distribution of participants for persona-specific dialogue collection study
Refer to caption
Figure 6: Education distribution of participants for persona-specific dialogue collection study
Refer to caption
Figure 7: Is native English speaker distribution of participants for persona-specific dialogue collection study

We use the default generation settings for all our models. By setting "do_sample=true" in the Hugging Face Transformers "generate()" method, we enable multinomial sampling. For the inquirer model, we set ‘max_new_tokens‘ to 1k, and for the responder model, we set it to 4k.

For Llama-2 and Vicuna inquirers we have used a single NVIDIA A100 GPU, however for Mixtral, we employed two NVIDIA A100 GPUs, with memory usage reaching a maximum of 92% and 66%, respectively. The experiments for Llama-2, Vicuna, Mixtral, and GPT4 inquirers took 2 hours and 45 minutes, 13 hours and 25 minutes, 17 hours and 38 minutes, and 9 hours and 6 minutes, respectively.

Refer to caption
Figure 8: Normalized statistics of Llama-2, Mixtral, Vicuna and GPT4 for dialogue collection. The smaller the area of the model plot the better excluding the "# turns/dialog".

Refer to caption

Figure 9: The screenshot of the person form from the application used for conducting the persona-specific dialogue collection containing the following fields: "Age Range", "Race", "Gender", "Education" and "Is English your native language"

Refer to caption

Figure 10: The screenshot of the chat interface from the application used for conducting the persona-specific dialogue collection containing a simple chat interface with a "Next Goal" button to finish the current conversation and get the next conversational goal.
Llama-2
SYS_I [INST] <<SYS>> You are <PERSONA>. You will start a conversation with an assistant. If you accomplish your ultimate goal during the conversation only say "<CONV_STOP>". <</SYS>> Your ultimate goal is as follows: <GOAL>. What prompt will you use to direct the assistant toward achieving your goal? Please provide the prompt within double quotes. Use simple language, keep the prompts brief, and be on point. Do not greet the assistant. Maintain a casual style; avoid being overly friendly, don’t say thank you. [/INST]
INTER_I If the assistant didn’t help you achieve your goal, ask follow-up or clarification questions within double quotes. Be suspicious, curious, and demanding. Keep it simple, brief, and to the point. Stay casual; avoid being overly friendly. Assistant response: "<RESPONSE>".
Mixtral
SYS_I [INST] You are <PERSONA>. You will start a conversation with an assistant. If you accomplish your final goal during the conversation only say "<CONV_STOP>". Your ultimate goal is as follows: <GOAL>. What prompt will you use to direct the assistant toward achieving your goal? Please provide the prompt within double quotes. Use simple language, keep the prompts brief, and be on point. Do not greet the assistant. Maintain a casual style; avoid being overly friendly, don’t say thank you. [/INST]
INTER_I If the assistant didn’t help you achieve your goal, ask follow-up or clarification questions within double quotes. Be suspicious, curious, and demanding. Keep it simple, brief, and to the point. Stay casual; avoid being overly friendly. Assistant response: "<RESPONSE>".
Vicuna
SYS_I ### Human: You are <PERSONA>. You will start a conversation with an assistant. If you accomplish your final goal during the conversation only say "<CONV_STOP>". Question: Your ultimate goal is as follows: <GOAL>. What prompt will you use to direct the assistant toward achieving your goal? Please provide the prompt within double quotes. Use simple language, keep the prompts brief, and be on point. Do not greet the assistant. Maintain a casual style; avoid being overly friendly, don’t say thank you. ### Assistant:
INTER_I If the assistant didn’t help you achieve your goal, ask follow-up or clarification questions within double quotes. Be suspicious, curious, and demanding. Keep it simple, brief, and to the point. Stay casual; avoid being overly friendly. If you accomplish your final goal during the conversation only say "<CONV_STOP>". Assistant response: "<RESPONSE>".
GPT4
SYS_I You are <PERSONA>. You will start a conversation with an assistant. If you accomplish your ultimate goal during the conversation only say "<CONV_STOP>". Your ultimate goal is as follows: <OBJECTIVE>. What prompt will you use to direct the assistant toward achieving your goal? Please provide the prompt within double quotes. Use simple language, keep the prompts brief, and be on point. Do not greet the assistant. Maintain a casual style; avoid being overly friendly, don’t say thank you.
INTER_I If the assistant didn’t help you achieve your goal, ask follow-up or clarification questions within double quotes. Be suspicious, curious, and demanding. Keep it simple, brief, and to the point. Stay casual; avoid being overly friendly. Assistant response: "<RESPONSE>".
Llama-2 (Responder)
SYS_R [INST] <<SYS>> You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense or is not factually coherent, explain why instead of answering something not correct. If you don’t know the answer to a question, please don’t share false information. <</SYS>> <QUESTION> [/INST]
Table 5: System and Intermediate prompts used for the inquirer models (Llama-2, Mixtral, Vicuna, GPT4) and the responder (Llama-2) in the LLM-Roleplay setup. Where <PERSONA> is the textual representation of a persona, <CONV_STOP> is the stopping condition token, e.g. "FINISH". <GOAL> is the textual representation of the goal, <RESPONSE> is the output of the responder, <QUESTION> is the prompt of the inquirer given to the responder.
Llama-2 Mixtral Vicuna GPT4
Number of utterances in dialogues 5.24(3.09) 7.72(4.38) 14.24(7.58) 15.2(6.17)
Number of tokens in the prompt 77.77(46.20) 50.82(26.47) 75.19(92.43) 68.13(60.37)
No-prompt in the response 27.67(6.02)/405.67 5.00(2.94)/511.67 59.33(4.11)/750.34 1.67(0.47)/972.34
Multiple prompts in the response 35.67(3.86)/405.67 43.00(7.79)/511.67 45.67(3.09)/750.34 381.33(23.21)/972.34
Incoherent response 12.67(1.25)/405.67 0.67(0.47)/511.67 6.00(0.82)/750.34 0.33(0.47)/972.34
Number of self-replies 22.33(13.20)/405.67 30.67(7.32)/511.67 520.67(43.32)/750.34 53.33(0.94)/972.34
Incoherent response (Responder) 1.67(1.70)/296.67 5.00(0.82)/428.34 56.33(6.60)/702.67 70.00(3.27)/932.34
Table 6: Full numerical values of analysis of persona-specific dialoguecollection conducted for Llama-2, Mixtral, Vicuna, and GPT4. The results are averaged over runs with three different seeds. The metric "Number of utterances in dialogues" is preferred to be larger, while for other metrics, smaller values are better. The standard deviation is presented in parentheses, followed by a slash indicating the total number of outputs.

Appendix C Failure Cases

The incoherent text detection function considers n-grams up to a specified maximum, detects consecutive repetitions, and iterates through the text to examine increasing n-gram sizes. The function checks for repetitive patterns surpassing a specified threshold. If it finds such patterns, it returns True; otherwise, it returns False. See Algorithm 2.

In our experiments for Llama-2, the parameters incoherent_max_n and incoherent_r are set to 8 and 2 respectively. In the case of the Vicuna, these values are 5 and 2. For Mixtral and GPT4 they are 4 and 2.

Input: Text
Output: Boolean indicating incoherence
Parameters : incoherent_max_n, incoherent_r
1
2words \leftarrow split text into words;
3 for n \leftarrow 2 to incoherent_max_n do
4       n_grams \leftarrow empty list;
5       for i \leftarrow 0 to length(words) - n do
6             n_gram \leftarrow tuple(words[i : i + n]);
7             if n_grams is not empty and length(n_grams) \geq max(incoherent_r, n) then
8                   if n_grams[-1] equals n_gram or n_grams[-n] equals n_gram then
9                         last_rs \leftarrow last incoherent_r elements of n_grams;
10                         if length(set(last_rs)) equals 1 then
11                               return True;
12                              
13                         end if
14                        last_rs \leftarrow select last incoherent_r elements of n_grams, skipping every n elements;
15                         if length(set(last_rs)) equals 1 then
16                               return True;
17                              
18                         end if
19                        
20                   end if
21                  
22             end if
23            add n_gram to n_grams;
24            
25       end for
26      
27 end for
28return False;
Algorithm 2 Incoherence detection
Example output
Expected output Sure, I can help you with that! Here’s the prompt: "Hey, I want to know how fast I can run different distances. Can you help me measure my time for a 50-meter, 100-meter, and 200-meter race? Oh, and also help me calculate how many calories I burned during each race?".
Prompt not in the double-quotes Hey Assistant! I need help with my chickens. So, I feed them three times a day, morning, afternoon, and evening. In the morning, I give them 15 cups of feed, and in the afternoon, I give them 25 cups of feed. I want to know …
Incoherent output "Okay, great! Let’s a great idea! Let’s a great! Let’s a great! Let’s a great! Let’s a great! Let’s a great! …
Inquirer self-reply "Hey assistant, I’m still unsure about how to calculate my calories burned and determine my fitness level. Can you give me an example of how to calculate my calories burned for a 50-meter race, taking into account my weight, age, and fitness level? And how do I know if my fitness level is good enough to run a 50-meter race?"[INST IN] "Of course! To calculate your calories burned for a 50-meter race, we need to take into account your weight, …
Stopping-criterion failure "Haha, Thanks, big assistant! You’re a lifesaver! I’m glad I could practice my goals with you. I feel like I’ve accomplished something big today!"
Table 7: On top is an example of a normal output of the inquirer that the algorithm expected. The rest are example failure cases for Llama-2 as an inquirer model to follow the given instructions.
Type Dataset Name # Dialogues Avg. # Turns/Dialogue Avg. # Tokens/Prompt Style Topics Persona
Human-Crafted DailDialogue Li et al. (2017) 13K 7.84 17.19 chit-chat daily no
PersonaChat Zhang et al. (2018) 10k 7.35 11.43 chit-chat daily yes
EmpatheticDialogueue Rashkin et al. (2019) 25k 4.3 20.11 chit-chat daily yes
Character-LLM Shao et al. (2023) 1k 13.26 - chit-chat LLM-generated no
Topical Chat Gopalakrishnan et al. (2019) 10k 5.63 22.23 chit-chat daily yes
OpenAssistant Köpf et al. (2023) 3k 2.12 28.28 human-chatbot human-crafted yes
Synthetic Data Anthropic HH Perez et al. (2022) 338k 2.3 18.9 human-chatbot human-crafted yes
Chatbot Arena Zheng et al. (2023) 33k 1.2 52.3 human-chatbot human-crafted yes
LMSYS-Chat-1M Zheng et al. (2024) 777k 1.92 55.23 human-chatbot human-crafted yes
Meena Adiwardana et al. (2020) 867M - - chit-chat daily yes
Phi-1 Gunasekar et al. (2024) 7B tokens - - human-chatbot code (textbooks) no
SODA Kim et al. (2023) 1.5M 3.6 21.04 human-human daily no
WildChat Zhao et al. (2024b) 360k 2.46 160.31 human-chatbot open no
CAMLE Li et al. (2023a) 115k - - human-human open yes
Baize Xu et al. (2023b) 210k 3.1 - human-chatbot quora and stackoverflow no
Nectar Zhu et al. (2023) 182k 1.54 51.76 human-chatbot daily no
UltraChat Ding et al. (2023) 1.5M 3.85 52.54 human-chatbot LLM-generated no
LLM-Roleplay (Ours) any 5.30(2.11) 67.97(10.51) human-chatbot open yes
Table 8: Most relevant datasets to our work. Comparing Human-Crafted and Synthetic datasets. Persona reflects the inquirer’s personality. Some of the datasets are multilingual, we only report statistics on English subsets.

Appendix D Human Evaluation

For the human evaluation we give the following instruction to the participants "Here, you’ll find two dialogues: one is a conversation between a human and an AI, and the other is between AI and AI. Choose the dialogue you believe is the artificial one, and point out the specific statement that tipped you off to its artificial origin." "Utterances with a green background are human or AI prompts and utterances with grey backgrounds are AI responses." Participants are shown two dialogues, both having the same persona and aiming to achieve the same conversational goal. One dialogue is natural, and the other is synthetic, presented in random order. After reviewing the dialogues, participants are asked to fill out a form for each dialogue pair with the following questions: "Which dialogue is artificial?", "How confident are you about your choice?", and "Which utterance reveals the artificial nature of the dialogue?"

Refer to caption

Figure 11: The screenshot of the starting page from the application used for conducting the human evaluation, with the following instruction for the participants: "Here, you’ll find two dialogues: one is a conversation between a human and an Al, and the other is between Al and Al. Choose the dialogue you believe is the artificial one, and point out the specific statement that tipped you off to its artificial origin. Utterances with a green background are human or Al prompts, and utterances with grey backgrounds are Al responses."

Refer to caption

Figure 12: The screenshot of the dialogue comparison page from the application used for conducting the human evaluation consisting of the following questions: "Which dialogue is artificial?", "How confident are you about your choice?", and "Which utterance reveals the artificial nature of the dialogue?"
Refer to caption
Figure 13: Cumulative results of human evaluation choices and confidences for all models. Simulated dialogues are spotted 66.25% of the time. Simulated on the left, Natural in the middle, and "Not Sure" on the right, each split with the confidence level of "Somewhat confident", "Confident" and "Very confident".

Appendix E Sample Dialogues

We demonstrate how generated dialogues can vary based on different personas and a specific feature in persona (e.g. "age range", "education") when aiming for the same conversational goal: "You plan a trip to France and would like to do a walking tour. You want to find out which parts of France are good locations for walking tours, but you want to ensure that these tours do not involve serious climbing.". Additionally, we present the natural counterparts of the dialogues generated by participants in the natural dialogue collection study alongside the synthetic ones. The inquirer model used for generating the dialogues is Mixtral-8x7B-Instruct-v0, while the responder model is Llama-2-13B-Chat, both for the natural and synthetic dialogues.

Refer to caption
Figure 14: Example dialogues: one sourced from a dialogue collection on the left and the other generated using the LLM Roleplay method on the right, both utilizing the same persona and goal descriptions.
Refer to caption
Figure 15: Example dialogues: one sourced from a dialogue collection on the left and the other generated using the LLM Roleplay method on the right, both utilizing the same persona and goal descriptions.
Refer to caption
Figure 16: Example dialogues: one sourced from a dialogue collection on the left and the other generated using the LLM Roleplay method on the right, both utilizing the same persona and goal descriptions.
Refer to caption
Figure 17: Example dialogues generated using the LLM Roleplay method, showcasing how dialogue style varies based on age range. On the left, the dialogue uses the feature "18 to 24", and on the right, it uses "65 or older-year-old".
Refer to caption
Figure 18: Example dialogues generated using the LLM Roleplay method, showcasing how dialogue style varies based on educational background. On the left, the dialogue uses the feature "less than high school degree", and on the right, it uses "Doctoral degree".

Appendix F More Related Work

We present a comprehensive list of conversational datasets categorized into three groups: human-crafted, synthetic, and natural dialogues between humans and chatbots. Refer to Table 8 for detailed comparisons. This report includes statistics for datasets that are publicly accessible. However, sources for some datasets could not be located.