Stephanie: Step-by-Step Dialogues for Mimicking Human Interactions
in Social Conversations

Hao Yang1 , Hongyuan Lu2, Xinhua Zeng1, Yang Liu1,3, Xiang Zhang4,
Haoran Yang2, Yumeng Zhang5, Yiran Wei4, Wai Lam2
1
Fudan University, 2The Chinese University of Hong Kong, 3University of Toronto,
4FaceMind Corporation, 5Tsinghua University
[email protected], [email protected]
  Work done during internship at FaceMind Corporation  Corresponding author and co-first author
Abstract

In the rapidly evolving field of natural language processing, dialogue systems primarily employ a single-step dialogue paradigm. Although this paradigm is efficient, it lacks the depth and fluidity of human interactions and does not appear natural. We introduce a novel Step-by-Step Dialogue Paradigm (Stephanie), designed to mimic the ongoing dynamic nature of human conversations. By employing a dual learning strategy and a further-split post-editing method, we generated and utilized a high-quality step-by-step dialogue dataset to fine-tune existing large language models, enabling them to perform step-by-step dialogues. We thoroughly present Stephanie. Tailored automatic and human evaluations are conducted to assess its effectiveness compared to the traditional single-step dialogue paradigm. We will release code, Stephanie datasets, and Stephanie LLMs to facilitate the future of chatbot eras.

1 Introduction

In the field of natural language processing, the research and development of dialogue systems continue to advance. Some systems aim to mimic human communication in daily life111https://nijigen.com.cn,222https://chatgpt.com/?model=gpt-4o, providing a more natural user experience. However, these systems predominantly employ a Single-Step Dialogue Paradigm(Abbas et al., ; Touvron et al., 2023; Du et al., 2022; Abdin et al., 2024; Achiam et al., 2023), where the system provides a comprehensive, one-time response to each user input, quickly addressing user questions or needs. While this approach can deliver a wealth of information in a single response, it falls short in simulating the fluidity and emotional exchange of real human conversations. In reality, daily human conversations are ongoing, dynamically evolving processes involving multiple topics and emotional exchanges(Song et al., 2022; Butler, 2011; Nie et al., 2024; Poria et al., 2019), and the current single-step dialogue paradigm fails to adequately capture this complexity and richness of emotions.

Refer to caption
Figure 1: A single-step dialogue system and Stephanie. Stephanie constructs a dialogue composed of multiple dispersed yet coherent responses.

To better emulate the style of human social conversations, this paper introduces an innovative dialogue paradigm named step-by-step dialogue (Stephanie), as shown in Figure 1. Unlike single-step dialogue, Stephanie mimics casual chats in instant messaging applications, creating a more natural and continuous dialogue flow with the power of in-context learning (Min et al., 2022; Chen et al., 2023a; Wang et al., 2024). Under this paradigm, the dialogue system does not just provide a one-time response to each input but constructs a conversation composed of multiple dispersed yet coherent responses. This design allows the system to gradually develop the conversation, with each response focusing on different aspects of the dialogue, making the conversation more detailed, rich, and emotionally nuanced. We found that it provides a better conversation experience, evoking greater user engagement. For example, in the step-by-step chat mode, the system can address various aspects of a user’s emotional expression step by step, first by supporting users through empathetic language and understanding, and then by asking questions or expanding the topic, gradually building deeper and more continuous emotional communication.

If the one-time response of a single-step dialogue system is simply divided into multiple responses by punctuation, the overall logic and integrity of the one-time response itself will result in an unnatural and stiff step-by-step dialogue, which does not resemble a real social interaction with people. In order to fully consider the semantic similarities and differences between sentences, as well as naturalness and anthropomorphism when generating step-by-step dialogue and implementing a dialogue system with a step-by-step chat function, we introduced a comprehensive prompting framework that employs a dual learning strategy and a Further-Split post-editing method to generate and optimize step-by-step dialogue datasets. We then used this dataset with a specific fine-tuning strategy to be compatible with existing large models, thereby establishing a step-by-step dialogue system. The step-by-step dialogue paradigm demonstrates significant academic and practical value in enhancing the naturalness and emotional depth of chat systems. By simulating real social interactions, this research not only advances the technology of dialogue systems but also provides new insights and approaches for achieving more natural and human-like communication between machines and humans.

The main contributions of this paper include:

  • We innovatively propose a step-by-step dialogue paradigm that utilizes a series of dispersed yet coherent responses to more closely mirror the style of real human communication interactions, thereby enhancing the emotional depth and human-likeness of the dialogue.

  • We introduced a bidirectional learning strategy and a Further-Split post-editing method to generate and optimize step-by-step dialogue datasets, and then we fine-tuned existing large models to develop a step-by-step dialogue system. To facilitate future research, we will release code, Stephanie datasets, and Stephanie LLMs in the near future.

  • Finally, we comprehensively compared single-step dialogues with progressive dialogues through both human and automated evaluations, demonstrating the significant advantages of step-by-step dialogue systems over traditional single-step dialogue systems.

2 Related Work

Large Language Models for Dialogue Systems In dialogue systems, previous dialogue systems have been traditionally finetuned on publicly available dialogue datasets (Zhang et al., 2019; Adiwardana et al., 2020; Roller et al., 2020; Thoppilan et al., 2022). Motivated by ChatGPT’s success, developers are now conducting supervised finetuning on open-source large language models like LLaMA (Touvron et al., 2023) to develop dialogue systems. This process involves finetuning with constructed instruction-following examples(Taori et al., 2023) and using dialogue data distilled from ChatGPT (Ulmer et al., 2024; Chiang et al., 2023). Furthermore, some studies have been prompting dialogue systems built on large pre-trained models to induce the knowledge embedded in these language models. Areas of focus include task-oriented dialogues (Labruna et al., 2023; Swamy et al., 2023; Mi et al., 2022), knowledge-supported dialogues (Semnani et al., 2023; Rogers et al., 2023; hongru2023large), and open-domain dialogues (Chen et al., 2023a; Lee et al., 2023; Hongru et al., 2023).

Emotional Support in Dialogue Systems The role of emotions in building attractive dialogue systems has been thoroughly investigated (Zhou and Wang, 2017; Huber et al., 2018; Huang et al., 2020). Emotional chatting refers to dialogue systems expressing emotions such as happiness or sadness, while emotional support goes further, aiming to alleviate users’ emotional distress in emotional chatting by proactively guiding the conversation and employing appropriate support techniques (Ratican and Hutson, 2023; Chen et al., 2023b; Liu et al., 2023). An empathetic response is a key element in providing effective emotional support, focusing on understanding the user’s emotions and making suitable replies, with the goal of creating more personalized and engaging responses(Liao et al., 2021; Sun et al., 2021; Majumder et al., 2020). Enhancing the empathetic response capability of LLMs through context learning with semantic similarity, bi-directional co-generation, and integration with knowledge bases has been proposed (Qian et al., 2023). Additionally, intermediate reasoning steps can be adopted, using language clues in the conversation to determine the user’s emotional state, personality traits, and psychological characteristics, and then generating empathetic responses (Hongru et al., 2023). Emotional support dialogue also requires the ability to devise dialogue strategies, formulate appropriate response strategies for various emotional problems of users, and achieve complex dialogue objectives such as exploration, comfort, and action (Rogers et al., 2023; Peng et al., 2022; Tu et al., 2022). A multi-agent framework can coordinate multiple specialized agents, each responsible for a specific aspect of complex dialogue objectives in emotional support, such as exploration, comfort, and action, making complex dialogue objectives more approachable and stimulating greater intelligence through collaboration (Cheng et al., 2023). Another dialogue strategy involves breaking down the ultimate goal into a sequence of sub-goals, selecting actions for sub-goals, and filtering valuable sub-goals to efficiently achieve the ultimate goal (Chua, 2024). Currently, emotional support based on LLMs faces the challenge of data scarcity. One approach is to use dialogues as generative seeds and exploit the contextual learning potential of ChatGPT to recursively generate scalable emotional support dialogue datasets (Zheng et al., 2023).

In the current field of natural language processing, most dialogue systems based on large language models primarily adopt a Single-Step Chat Paradigm(Wu et al., 2023; Touvron et al., 2023; Mai et al., 2023; Yamazaki et al., 2023). Within this paradigm, the system responds to each user input with a comprehensive and complete one-time reply to promote interaction. Such interactions provide information-dense responses to handle complex inquiries, focusing on the informational density and completeness of each response, which is suitable for directly resolving specific questions or providing detailed information in a single interaction. However, this paradigm exhibits certain limitations in emulating the natural fluidity and emotional expression found in human daily dialogues. While it can identify and respond to users’ emotional inclinations, the interaction pattern often sticks to a question-and-single-answer format, lacking the emotional continuity and interaction depth present in real conversations.

3 Methodology

Refer to caption
Figure 2: In the process of step-by-step dialogue generation, we adopted a dual learning strategy to enhance the model’s ability to generate natural dialogues through the Step-by-Step Dialogue Prompt Framework. This strategy combines positive and negative learning objectives. The positive objective includes high-quality step-by-step dialogue examples selected from real social interactions, while the negative objective comprises designed high-quality single-step dialogue examples. Through contrastive learning, this approach helps the model distinguish between step-by-step dialogues and single-step dialogues, thus generating more natural and emotionally rich step-by-step dialogues.

In this section, we will delve into the process of generating and optimizing step-by-step dialogues, and based on this, create a high-quality step-by-step dialogue dataset. We further fine-tuned and built a dialogue system capable of step-by-step interactions to simulate the step-by-step dialogue paradigms found in real human social exchanges.

3.1 Dual Learning Strategy for Step-by-Step Dialogue Generation

To efficiently generate step-by-step dialogues that mimic real human social interactions, inspired by contrastive learning, we propose a dual learning strategy combining both positive and negative learning objectives within a comprehensive prompt framework named step-by-step dialogue prompt framework. As illustrated in Figure 2, the framework consists of three elements: background information D𝐷Ditalic_D, positive learning objectives P𝑃Pitalic_P, and negative learning objectives N𝑁Nitalic_N, aiming to enhance the model’s ability to generate dialogues that are both rich and natural. the comprehensive prompt framework can be formulated as:

p(rD,P,N)𝑝conditional𝑟𝐷𝑃𝑁p(r\mid D,P,N)italic_p ( italic_r ∣ italic_D , italic_P , italic_N ) (1)

where r𝑟ritalic_r is the response output of the model, and the design of the three elements is as follows:

  • Background Information: We use an LLM to summarize and generate the themes T𝑇Titalic_T of each dialogue segment from the persona-chat dataset and the characteristics C𝐶Citalic_C of the dialogue participants, to form the background information D={T,C}𝐷𝑇𝐶D=\{T,C\}italic_D = { italic_T , italic_C }. This information guides the model’s generation, covering common topics such as family, work, and leisure activities, while considering the diverse personalities of the dialogue participants—for example, one might be described as optimistic and active, while another might be portrayed as having recently faced setbacks but remaining diligent and academically inclined.

  • Positive Learning Objectives: To help the model understand the step-by-step dialogue paradigm, we created five high-quality step-by-step dialogue examples as the negative objectives P𝑃Pitalic_P. These examples simulate everyday social exchanges between two individuals and serve as a basis for few-shot learning, training the model to generate coherent and emotionally rich step-by-step dialogues in different background contexts.

  • Negative Learning Objectives: Simultaneously, we designed five high-quality single-step dialogue examples as the negative objectives N𝑁Nitalic_N. Through contrastive learning, these examples enable the model to discern the differences between single-step and step-by-step dialogues. This negative learning approach helps the model better understand the step-by-step dialogue paradigm by pushing away dissimilar single-step dialogue examples.

This dual learning strategy is a robust prompting framework. Through this structured approach, the model considers both positive and negative learning objectives during dialogue generation, enhancing its ability to understand and generate step-by-step dialogues while ensuring that the generated dialogues align with the themes and character traits in the background information.

3.2 Optimizing Step-by-Step Dialogues Using the Further-Split Post-Editing Method

Although the method described in Section 3.1 enabled the model to make some progress in generating step-by-step dialogues, our evaluation showed that some generated dialogues still exhibited characteristics of single-step dialogues, such as dense, one-time responses. To address this issue and further enhance the coherence and naturalness of emotional expression in dialogues, we designed a post-editing optimization method called "further-split."

In this process, we selected five initial step-by-step dialogues generated by the model for detailed analysis and manual restructuring. We further split these dialogues according to the natural flow and emotional progression of actual conversations, reorganizing and optimizing the content. The optimized step-by-step dialogue examples were paired with the original examples, serving as rewritten examples to guide the model in learning how to further split and rewrite dialogues, thereby generating more natural and human-like step-by-step dialogues to closely mimic real social interactions.

3.3 Dataset Generation and Finetuning Strategy for Stephanie

Based on the aforementioned comprehensive prompt framework and the further-split post-editing method, we generated a high-quality step-by-step dialogue dataset. To effectively utilize this dataset for finetuning existing large language models, we designed a specific finetuning strategy.

During the finetuning process, we introduced delimiters to format the dataset, providing structured input and output for the model, where the content between each pair of delimiters represents a single exchange between the two dialogue participants. We then used this newly formatted step-by-step dialogue dataset to finetune the model. After finetuning, the model’s output also adopted the same delimiter-separated step-by-step dialogue format. Finally, we used a series of scripts and software tools to design a user-friendly UI interface that converts the input and output into a format similar to message bubbles in social software, allowing users to interact with the large language model using the step-by-step dialogue paradigm.

This plug-and-play finetuning strategy enables our step-by-step dialogue dataset to be compatible with various existing language models, thereby constructing a dialogue system capable of step-by-step interactions to provide a coherent and emotionally rich dialogue experience in practical applications.

4 Experiment Setup

4.1 Dataset

Our incremental dialogue dataset originates from the PERSONACHAT dataset. It is a renowned multi-turn dialogue dataset grounded in character personas, with each dialogue instance typically comprising around 8 turns, where each self and partner character is described by roughly 4 traits.

From the training set of PERSONA-CHAT, we curated 5,457 high-quality dialogues. Initially, we employed the Llama3-70b model to summarize the theme of each dialogue, with summaries averaging between 50 to 100 words. Subsequently, we adopted the Stephanie dialogue generation approach described herein, incorporating these dialogue themes and approximately 4 traits of both characters as background information. We use the Llama3-70b model to generate an incremental dialogue dataset consisting of 5,457 dialogues, where each character involved exhibits roughly 4 traits.

4.2 Prompt

We describe the prompt generated for step-by-step dialogue with Stephanie as follows:

<five examples of single-step dialogues>.
<five examples of step-by-step dialogues>.
In single-step dialogues, each role sends only one message per turn. In contrast, step-by-step dialogues allow multiple messages to be sent consecutively before the other role replies, simulating the style of human daily chit-chat. Please generate a step-by-step dialogue and a single-step dialogue based on the background information:
<background information>.

We can also describe the prompt optimized for generating step-by-step dialogue using the Further-Split method as follows:

<five examples of single-step dialogues and corresponding Stephanie>.
Please assess whether each message reply in the following step-by-step dialogue can be further rewritten into multiple replies to make the conversation more natural, interesting, engaging, and closer to human interaction. Then, provide a new version of the step-by-step dialogue:
<the single-step dialogue to be further-split into Stephanie>.

4.3 Baselines and Comparison Models

In evaluating the performance of our model, we consider several leading models in the field of language processing. These models are used as benchmarks due to their significant capabilities in various tasks within natural language processing. Each model is briefly described as follows:

  • GPT-4: Developed by OpenAI, GPT-4 represents the latest advancement in the Generative Pre-trained Transformer series. Renowned for its vast knowledge base and flexibility across multiple tasks, GPT-4 is a critical benchmark for assessing advanced language understanding and generation capabilities.

  • Llama3-70b: Also from Meta’s Llama series, the Llama3-70b model, with its 70 billion parameters, is aimed at deep contextual understanding and complex reasoning tasks. It serves as a high-end model for performance comparison.

  • Llama3-8b: A model from Meta’s Llama family, Llama3-8b is designed to provide a balance between performance and efficiency with its 8 billion parameters. It is optimized for rapid response and lower resource usage, making it suitable for real-time applications.

  • Phi3-3.8b: Phi3-3.8b from Microsoft’s Phi-3 series of small language models excels in performance while being highly efficient in terms of computational resource usage. These models are designed for flexible deployment across cloud, on-device, and edge computing scenarios, ensuring effectiveness even with limited connectivity. Phi3-3.8b uses high-quality, curated training data to achieve results comparable to larger models.

4.4 Evaluation Metrics

To comprehensively assess the performance of our step-by-step dialogue, we have utilized a series of evaluation metrics aimed at thoroughly measuring various aspects such as the diversity, naturalness, and effectiveness of the dialogues, among others. These metrics include: Dialogue Experience Metrics (suitable for both automated and human evaluations), Lexical Diversity Metrics (Distinct-N), and statistical features of the dialogue data, such as the average number of words per message and the Average Consecutive Message Counts (ACMC).

  • Dialogue Experience Metrics: Interesting: The degree of interest in the dialogue. If the dialogue carries a negative sentiment, the score is 0. Informative: The amount of information contained in the dialogue. Natural: Whether the dialogue is natural and human-like. Engaging: Whether the dialogue is engaging, meaning if what is said by both roles makes them want to continue the dialogue. On-topic: Whether the dialogue stays on the topic described in the dialogue topic. On-persona: Whether the dialogue matches the personas of role1 and role2.

  • Distinct-N: To quantify the lexical diversity of the dialogues, we utilize the Distinct-N metric. This metric calculates the diversity of n-grams in the generated responses across all possible values of N𝑁Nitalic_N, showing the system’s capability to produce varied and engaging content. The Distinct-N is defined as:

    Distinct-N=Total unique n-gramsTotal n-gramsDistinct-NTotal unique n-gramsTotal n-grams\text{Distinct-N}=\frac{\text{Total unique n-grams}}{\text{Total n-grams}}Distinct-N = divide start_ARG Total unique n-grams end_ARG start_ARG Total n-grams end_ARG
  • Words/Message: Calculates the average number of words per message, providing insight into the verbosity or conciseness of the dialogues. This helps in determining the efficiency and clarity of the communication. The formula for Words/Messages is defined as:

    Words/Message=i=1nwinWords/Messagesuperscriptsubscript𝑖1𝑛subscript𝑤𝑖𝑛\text{Words/Message}=\frac{\sum_{i=1}^{n}w_{i}}{n}Words/Message = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG

    where wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the number of words in the i𝑖iitalic_i-th message, and n𝑛nitalic_n represents the total number of messages.

  • ACMC (Average Consecutive Message Counts): This metric measures the average number of consecutive messages sent by one participant before receiving a response. It is calculated as:

    ACMC=i=1ncimACMCsuperscriptsubscript𝑖1𝑛subscript𝑐𝑖𝑚\text{ACMC}=\frac{\sum_{i=1}^{n}c_{i}}{m}ACMC = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_m end_ARG

    where cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the number of consecutive messages in the i𝑖iitalic_i-th turn without interruption by the other participant, n𝑛nitalic_n is the total number of such turns, and m𝑚mitalic_m is the total number of messages sent by the participant.

5 Results

GPT4 Llama3-70b
Metrics α𝛼\alphaitalic_α β𝛽\betaitalic_β γ𝛾\gammaitalic_γ Stephanie α𝛼\alphaitalic_α β𝛽\betaitalic_β γ𝛾\gammaitalic_γ Stephanie
Interesting 82.00 80.00 84.66 88.35 80.20 75.25 88.74 91.63
Informative 83.20 80.35 85.03 88.19 79.24 72.93 86.67 88.29
Natural 87.25 88.19 91.89 94.72 86.82 84.36 95.00 97.61
Engaging 85.74 84.84 89.42 92.64 83.64 78.74 93.25 95.86
On-topic 91.54 93.53 96.35 87.78 95.88 97.10
On-persona 92.93 94.25 96.0 90.05 96.53 98.10
Table 1: Automatic Evaluation on GPT4 and Llama3-70b. The values represent the percentage scores for each metric, used to evaluate the performance of different dialogues (α𝛼\alphaitalic_α, β𝛽\betaitalic_β, γ𝛾\gammaitalic_γ, Stephanie) generated by GPT-4 and Llama3-70b. These scores indicate how interesting, engaging, informative, and natural each dialogue is, as well as its adherence to the given topic and persona. Bold values indicate the highest scores among comparable models, highlighting exceptional performance in specific metrics.
Metrics α𝛼\alphaitalic_α β𝛽\betaitalic_β Stephanie
Interesting 79.73 72.14 83.53
Informative 75.48 74.56 79.37
Natural 79.79 75.87 87.41
Engaging 83.55 78.38 86.41
On-topic 78.87 82.57
On-persona 77.24 80.04
Table 2: Automatic evalution on phi3-3.8b. The table shows the percentage scores of the performance of three dialogues (α𝛼\alphaitalic_α, β𝛽\betaitalic_β, γ𝛾\gammaitalic_γ) across multiple metrics. Bold values represent the best performance in each metric for the phi3-3.8b evaluation.
Metrics α𝛼\alphaitalic_α β𝛽\betaitalic_β γ𝛾\gammaitalic_γ Stephanie
Interesting 2.93 2.85 3.53 3.68
Informative 3.71 3.13 3.78 3.91
Natural 2.97 2.89 3.65 3.97
Engaging 3.13 2.96 3.72 4.06
On-topic 3.30 3.79 3.99
On-persona 3.17 3.73 3.89
Table 3: Human evalution on GPT4. The table presents human evaluation scores for different dialogue models (α𝛼\alphaitalic_α, β𝛽\betaitalic_β, γ𝛾\gammaitalic_γ, Stephanie) generated by GPT-4. Scores range from 1 to 5, with higher scores indicating better performance.
Metrics α𝛼\alphaitalic_α β𝛽\betaitalic_β γ𝛾\gammaitalic_γ Stephanie
words/message 11.77 13.67 8.12 5.87
ACMC 1.07 1.08 1.99 2.51
Table 4: Words/message and ANT on dialogues. The table compares the average words per message and the Average Number of Consecutive Message Counts (ACMC) across different dialogue (α𝛼\alphaitalic_α, β𝛽\betaitalic_β, γ𝛾\gammaitalic_γ, Stephanie). This helps in evaluating the verbosity and interaction depth of each dialogue.
Dialogues one two three four five
α𝛼\alphaitalic_α 92.65 7.35 0 0 0
β𝛽\betaitalic_β 91.26 8.74 0 0 0
γ𝛾\gammaitalic_γ 20.50 60.10 17.98 1.21 0.1
Stephanie 11.17 39.24 34.33 10.86 2.97
Table 5: The proportion of consecutive message counts. The table shows the proportion of dialogues with a given number of consecutive messages (one, two, three, four, five) for different dialogue models (α𝛼\alphaitalic_α, β𝛽\betaitalic_β, γ𝛾\gammaitalic_γ, Stephanie). Higher counts indicate a greater tendency for step-by-step dialogues within an interaction.
Metrics Stephanie-Llama3-8b Llama3-8b
Interesting 3.67 3.01
Informative 3.81 3.22
Natural 4.13 3.57
Engaging 3.89 3.31
Table 6: Human evalution on Stephanie-Llama3-8b dialogue system. The table presents human evaluation scores for the Stephanie-Llama3-8b and Llama3-8b dialogue systems across four metrics: Interesting, Informative, Natural, and Engaging. Scores range from 1 to 5, with higher scores indicating better performance. The fine-tuned Stephanie-Llama3-8b model outperforms the Llama3-8b model across all metrics.
Refer to caption
Figure 3: Distinct-N Results for Different Dialogue. This graph displays the lexical diversity of dialogues generated by various models, measured by the Distinct-N metric for n-grams from N=2 to N=6. Each colour represents a different dialogue model (α𝛼\alphaitalic_α, β𝛽\betaitalic_β, γ𝛾\gammaitalic_γ, Stephanie), highlighting variations in linguistic complexity and diversity.

5.1 Evaluation of Conversation Quality

We selected 100 conversation data from the persona-chat dataset as the Original Single-Step Dialogue α𝛼\alphaitalic_α. First, we used GPT-4 to summarize the themes of these 100 dialogues. Then, along with the personas of the dialogue participants, we used this background information to write prompts using the Step-by-Step Dialogue Prompt Framework proposed in this paper. These prompts were fed into GPT-4 to generate the Generated Single-Step Dialogue β𝛽\betaitalic_β. We also applied a further-split method to optimize the β𝛽\betaitalic_β, resulting in the Further-Split Step-by-Step Dialogue γ𝛾\gammaitalic_γ. Additionally, we conducted corresponding experiments with the Llama3-70b and phi3-3.8b models, generating their respective α𝛼\alphaitalic_α, β𝛽\betaitalic_β, γ𝛾\gammaitalic_γ, and Stephanie.

Subsequently, we conducted automatic machine assessments of the three models on six metrics: Interesting, Informative, Natural, Engaging, On-topic, and On-persona. with Claude-3-sonnet as the assessment expert providing scores from 0 to 100, as shown in tables 1 and 2. The β𝛽\betaitalic_βs generated by the large models were generally weaker than the original dialogues on most metrics, with the exception of the ’Natural’ metric for GPT-4, where β𝛽\betaitalic_β performed better than α𝛼\alphaitalic_α. This indicates that single-step dialogues generated by large models are inferior to original human dialogues. The γ𝛾\gammaitalic_γ was significantly superior to α𝛼\alphaitalic_α on all six metrics, demonstrating the superiority of the step-by-step dialogue paradigm. Stephanie showed further improvement over the β𝛽\betaitalic_β, highlighting the effectiveness of the further-split method. Additionally, we conducted a human evaluation of GPT-4, inviting three advanced graduate students majoring in English to score on a 0-5 scale. The results were positively consistent with the prior results.

We conducted further statistics on the β𝛽\betaitalic_β, γ𝛾\gammaitalic_γ, and Stephanie generated by GPT-4. Table 4 presents the statistics for α𝛼\alphaitalic_α, β𝛽\betaitalic_β, γ𝛾\gammaitalic_γ, and Stephanie, including the average number of words per response (Words/Messages) and the Average Number of Consecutive Message Counts (ACMC). The results show that the β𝛽\betaitalic_β is similar to α𝛼\alphaitalic_α, with β𝛽\betaitalic_β’s Words/message being slightly higher than α𝛼\alphaitalic_α’s. Compared to α𝛼\alphaitalic_α and β𝛽\betaitalic_β, γ𝛾\gammaitalic_γ has fewer Words/message and a higher ACMC, indicating that step-by-step dialogues tend to be shorter and contain more messages. Notably, Stephanie, in comparison to γ𝛾\gammaitalic_γ, further effectively reduces Words/Messages and significantly increases ACMC, demonstrating the effectiveness of the further-split method. Table 5 displays the proportion of consecutive message counts, where it is also evident that γ𝛾\gammaitalic_γ, compared to α𝛼\alphaitalic_α and β𝛽\betaitalic_β, has more consecutive replies. Furthermore, Figure 3 illustrates that Stephanie effectively shifts the distribution of the number of consecutive messages to the right relative to γ𝛾\gammaitalic_γ.

In assessing lexical diversity among dialogue models, the "Distinct-N" table provides a comparative analysis using the Distinct-N metric for n-grams ranging from N=2 to N=6. As shown in fig 3, The Original Single-Step Dialogue α𝛼\alphaitalic_α maintains high diversity, which increases with the complexity of n-grams, reflecting typical human dialogue characteristics. However, the Generated Single-Step Dialogue β𝛽\betaitalic_β exhibits lower diversity scores, especially for higher n-grams, indicating limitations in linguistic variability. Notably, the Generated Step-by-Step Dialogue γ𝛾\gammaitalic_γ and Further-Split Step-by-Step Dialogue (Stephanie) show superior performance, with Stephanie achieving the highest diversity across most categories. The significant performance of Stephanie in larger n-grams highlights the effectiveness of the further-split method in producing dialogues that are diverse and closely mimic the complex linguistic structures of human communication. This demonstrates that our proposed generation methods and prompting framework can significantly enhance the quality and human-likeness of machine-generated text.

5.2 Fine-Tuning with Step-by-Step Dialogue

Following the demonstration of the effectiveness of our proposed paradigms, generation methods, and prompting frameworks, we aimed to provide a high-quality dataset for fine-tuning existing large models. To this end, we generated a high-quality step-by-step dialogue dataset consisting of 5,457 segments using the Llama3-70b model. Subsequently, we fine-tuned the Llama3-8b model with this dataset to create the Stephanie-Llama3-8b model. We engaged five testers to interact with both the Llama3-8b and Stephanie-Llama3-8b dialogue systems, assessing them across four metrics. The results show that the model fine-tuned with the step-by-step dialogue dataset exhibited superior step-by-step dialogue capabilities, outperforming the Llama3-8b model on all four metrics as presented in table 6.

6 Conclusion

The step-by-step dialogue paradigm introduced in this article enhances human-like interactions in simulated dialogue systems. By integrating a dual learning strategy and a further-split post-editing method, we have effectively generated dialogue data with Stephanie that is more interesting, natural, engaging, and emotionally nuanced. Our evaluations demonstrate that Stephanie’s systems significantly outperform traditional single-step dialogue systems across various metrics. We plan to release our code, Stephanie dataset and Stephanie systems in the near future to facilitate chatbot eras.

Limitations

We conducted manual testing with limited human resources, and we look forward to seeing the application effectiveness of this technology on more large-scale consumer products.

Ethics Statement

We honour and support the ACL Code of Ethics. The datasets used in this work are well-known and widely used, and the dataset pre-processing does not make use of any external textual resource. In our view, there is no known ethical issue. End-to-end pre-trained generators are also used, which are subjected to generating offensive context. However, the above-mentioned issues are widely known to commonly exist for these models. Any content generated does not reflect the view of the authors.

References

  • (1) Rizwan Abbas, Bingnan Ni, Ruhui Ma, Teng Li, Yehao Lu, and Xi Li. Context-based emotion recognition: A survey. Available at SSRN 4657124.
  • Abdin et al. (2024) Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. 2024. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219.
  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  • Adiwardana et al. (2020) Daniel Adiwardana, Minh-Thang Luong, David R So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, et al. 2020. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977.
  • Butler (2011) Emily A Butler. 2011. Temporal interpersonal emotion systems: The “ties” that form relationships. Personality and Social Psychology Review, 15(4):367–393.
  • Chen et al. (2023a) Maximillian Chen, Alexandros Papangelis, Chenyang Tao, Seokhwan Kim, Andy Rosenbaum, Yang Liu, Zhou Yu, and Dilek Hakkani-Tur. 2023a. Places: Prompting language models for social conversation synthesis. arXiv preprint arXiv:2302.03269.
  • Chen et al. (2023b) Siyuan Chen, Mengyue Wu, Kenny Q Zhu, Kunyao Lan, Zhiling Zhang, and Lyuchun Cui. 2023b. Llm-empowered chatbots for psychiatrist and patient simulation: application and evaluation. arXiv preprint arXiv:2305.13614.
  • Cheng et al. (2023) Yi Cheng, Wenge Liu, Jian Wang, Chak Tou Leong, Yi Ouyang, Wenjie Li, Xian Wu, and Yefeng Zheng. 2023. Cooper: Coordinating specialized agents towards a complex dialogue goal. arXiv preprint arXiv:2312.11792.
  • Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6.
  • Chua (2024) Tat-Seng Chua. 2024. Towards generative search and recommendation: A keynote at recsys 2023. In ACM SIGIR Forum, volume 57, pages 1–14. ACM New York, NY, USA.
  • Du et al. (2022) Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
  • Hongru et al. (2023) WANG Hongru, Rui Wang, Fei Mi, Yang Deng, WANG Zezhong, Bin Liang, Ruifeng Xu, and Kam-Fai Wong. 2023. Cue-cot: Chain-of-thought prompting for responding to in-depth dialogue questions with llms. In The 2023 Conference on Empirical Methods in Natural Language Processing.
  • Huang et al. (2020) Minlie Huang, Xiaoyan Zhu, and Jianfeng Gao. 2020. Challenges in building intelligent open-domain dialog systems. ACM Transactions on Information Systems (TOIS), 38(3):1–32.
  • Huber et al. (2018) Bernd Huber, Daniel McDuff, Chris Brockett, Michel Galley, and Bill Dolan. 2018. Emotional dialogue generation using image-grounded language models. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pages 1–12.
  • Labruna et al. (2023) Tiziano Labruna, Sofia Brenna, Andrea Zaninello, and Bernardo Magnini. 2023. Unraveling chatgpt: A critical analysis of ai-generated goal-oriented dialogues and annotations. In International Conference of the Italian Association for Artificial Intelligence, pages 151–171. Springer.
  • Lee et al. (2023) Gibbeum Lee, Volker Hartmann, Jongho Park, Dimitris Papailiopoulos, and Kangwook Lee. 2023. Prompted llms as chatbot modules for long open-domain conversation. arXiv preprint arXiv:2305.04533.
  • Liao et al. (2021) Lizi Liao, Le Hong Long, Yunshan Ma, Wenqiang Lei, and Tat-Seng Chua. 2021. Dialogue state tracking with incremental reasoning. Transactions of the Association for Computational Linguistics, 9:557–569.
  • Liu et al. (2023) June M Liu, Donghao Li, He Cao, Tianhe Ren, Zeyi Liao, and Jiamin Wu. 2023. Chatcounselor: A large language models for mental health support. arXiv preprint arXiv:2309.15461.
  • Mai et al. (2023) Jinjie Mai, Jun Chen, Bing Li, Guocheng Qian, Mohamed Elhoseiny, and Bernard Ghanem. 2023. Llm as a robotic brain: Unifying egocentric memory and control. arXiv preprint arXiv:2304.09349.
  • Majumder et al. (2020) Navonil Majumder, Pengfei Hong, Shanshan Peng, Jiankun Lu, Deepanway Ghosal, Alexander Gelbukh, Rada Mihalcea, and Soujanya Poria. 2020. Mime: Mimicking emotions for empathetic response generation. arXiv preprint arXiv:2010.01454.
  • Mi et al. (2022) Fei Mi, Yasheng Wang, and Yitong Li. 2022. Cins: Comprehensive instruction for few-shot learning in task-oriented dialog systems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11076–11084.
  • Min et al. (2022) Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2022. MetaICL: Learning to learn in context. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2791–2809, Seattle, United States. Association for Computational Linguistics.
  • Nie et al. (2024) Weizhi Nie, Yuru Bao, Yue Zhao, and Anan Liu. 2024. Long dialogue emotion detection based on commonsense knowledge graph guidance. IEEE Transactions on Multimedia, 26:514–528.
  • Peng et al. (2022) Wei Peng, Yue Hu, Luxi Xing, Yuqiang Xie, Yajing Sun, and Yunpeng Li. 2022. Control globally, understand locally: A global-to-local hierarchical graph network for emotional support conversation. corr abs/2204.12749 (2022).
  • Poria et al. (2019) Soujanya Poria, Navonil Majumder, Rada Mihalcea, and Eduard Hovy. 2019. Emotion recognition in conversation: Research challenges, datasets, and recent advances. IEEE access, 7:100943–100953.
  • Qian et al. (2023) Yushan Qian, Wei-Nan Zhang, and Ting Liu. 2023. Harnessing the power of large language models for empathetic response generation: Empirical investigations and improvements. arXiv preprint arXiv:2310.05140.
  • Ratican and Hutson (2023) Jay Ratican and James Hutson. 2023. The six emotional dimension (6de) model: A multidimensional approach to analyzing human emotions and unlocking the potential of emotionally intelligent artificial intelligence (ai) via large language models (llm). Journal of Artificial Intelligence and Robotics, 1(1).
  • Rogers et al. (2023) Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki. 2023. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers). In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
  • Roller et al. (2020) Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M Smith, et al. 2020. Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637.
  • Semnani et al. (2023) Sina J Semnani, Violet Z Yao, Heidi C Zhang, and Monica S Lam. 2023. Wikichat: A few-shot llm-based chatbot grounded with wikipedia. arXiv preprint arXiv:2305.14292.
  • Song et al. (2022) Xiaohui Song, Liangjun Zang, Rong Zhang, Songlin Hu, and Longtao Huang. 2022. Emotionflow: Capture the dialogue level emotion transitions. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8542–8546.
  • Sun et al. (2021) Hao Sun, Zhenru Lin, Chujie Zheng, Siyang Liu, and Minlie Huang. 2021. Psyqa: A chinese dataset for generating long counseling text for mental health support. arXiv preprint arXiv:2106.01702.
  • Swamy et al. (2023) Sandesh Swamy, Narges Tabari, Chacha Chen, and Rashmi Gangadharaiah. 2023. Contextual dynamic prompting for response generation in task-oriented dialog systems. arXiv preprint arXiv:2301.13268.
  • Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Stanford alpaca: An instruction-following llama model.
  • Thoppilan et al. (2022) Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. 2022. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
  • Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  • Tu et al. (2022) Quan Tu, Yanran Li, Jianwei Cui, Bin Wang, Ji-Rong Wen, and Rui Yan. 2022. Misc: a mixed strategy-aware model integrating comet for emotional support conversation. arXiv preprint arXiv:2203.13560.
  • Ulmer et al. (2024) Dennis Ulmer, Elman Mansimov, Kaixiang Lin, Justin Sun, Xibin Gao, and Yi Zhang. 2024. Bootstrapping llm-based task-oriented dialogue agents via self-talk. arXiv preprint arXiv:2401.05033.
  • Wang et al. (2024) Liang Wang, Nan Yang, and Furu Wei. 2024. Learning to retrieve in-context examples for large language models. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1752–1767, St. Julian’s, Malta. Association for Computational Linguistics.
  • Wu et al. (2023) Tianyu Wu, Shizhu He, Jingping Liu, Siqi Sun, Kang Liu, Qing-Long Han, and Yang Tang. 2023. A brief overview of chatgpt: The history, status quo and potential future development. IEEE/CAA Journal of Automatica Sinica, 10(5):1122–1136.
  • Yamazaki et al. (2023) Takato Yamazaki, Katsumasa Yoshikawa, Toshiki Kawamoto, Tomoya Mizumoto, Masaya Ohagi, and Toshinori Sato. 2023. Building a hospitable and reliable dialogue system for android robots: a scenario-based approach with large language models. Advanced Robotics, 37(21):1364–1381.
  • Zhang et al. (2019) Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. 2019. Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536.
  • Zheng et al. (2023) Zhonghua Zheng, Lizi Liao, Yang Deng, and Liqiang Nie. 2023. Building emotional support chatbots in the era of llms. arXiv preprint arXiv:2308.11584.
  • Zhou and Wang (2017) Xianda Zhou and William Yang Wang. 2017. Mojitalk: Generating emotional responses at scale. arXiv preprint arXiv:1711.04090.