Cohesive Conversations: Enhancing Authenticity in Multi-Agent Simulated Dialogues

KuanChao Chu
The University of Tokyo &Yi-Pei Chen11footnotemark: 1
The University of Tokyo &Hideki Nakayama
The University of Tokyo
Joint first authors. Contact: {kcchu, ypc}@nlab.ci.i.u-tokyo.ac.jp
Abstract

This paper investigates the quality of multi-agent dialogues in simulations powered by Large Language Models (LLMs), focusing on a case study from Park et al. (2023), where 25 agents engage in day-long simulations of life, showcasing complex behaviors and interactions. Analyzing dialogues and memory over multiple sessions revealed significant issues such as repetition, inconsistency, and hallucination, exacerbated by the propagation of erroneous information. To combat these challenges, we propose a novel Screening, Diagnosis, and Regeneration (SDR) framework that detects and corrects utterance errors through a comprehensive process involving immediate issue identification, evidence gathering from past dialogues, and LLM analysis for utterance revision. The effectiveness of the SDR framework is validated through GPT-4 assessments and human evaluations, demonstrating marked improvements in dialogue consistency, diversity, and the reduction of false information. This work presents a pioneering approach to enhancing dialogue quality in multi-agent simulations, establishing a new standard for future research in the field.

1 Introduction

Recent research have leveraged Large Language Models (LLMs) (OpenAI, 2023; Touvron et al., 2023) to power multi-agent simulations, aiming to model complex human behaviors (Park et al., 2023) or enhance multi-agent communication and collaboration (Chen et al., 2023b; Hong et al., 2023). In these simulations, agents are equipped with a LLM as their core, augmented with additional components such as memory, and communicating iteratively (Cheng et al., 2024). A notable simulation example fully orchestrated by ChatGPT, Generative Agents (Park et al., 2023), simulates 25 agents “living” in a village and sometimes conversing with each other. For more details about how the simulation works, please refer to Appendix A and the original paper.

Despite the advances in multi-agent simulation, there remains a significant gap in evaluating the quality of multi-agent dialogues produced in these simulations, especially at the multi-session level. Park et al. (2023) was evaluated through interviewing the agents, while other LLM-generated multi-session dialogues limit their evaluations to sessions involving the same pair of speakers (Xu et al., 2022; Bae et al., 2022; Jang et al., 2023). None of them have thoroughly scrutinized the multi-agent communication over time, leaving a critical area of research unaddressed.

Our work delves into the simulation log of Park et al. (2023) (hereafter referred to as OneDayLife111https://reverie.herokuapp.com/arXiv_Demo/), examining the dialogue content and memory of the participating agents across multiple sessions. We found that, while individual dialogues may appear flawless on the surface, a holistic review of multi-session, multi-agent dialogues identifies significant issues. Specifically, we identified three main problems: repetition, inconsistency, and hallucination, as depicted in Fig 1 (further examples are available in Appendix E.3). Notably, these problems escalate with the duration of the simulation, exacerbated by the propagation of erroneous information through dialogues among agents.

Refer to caption
(a) Eddy is a music-major student working on a music composition project, yet he is inconsistent with what he is working on and repeating other’s utterances after conversing with them.
Refer to caption
(b) Even Jennifer has told John that she is not interested in politics, John hallucinates that Jennifer is running for local mayor election in the dialogue with Giorgio.
Figure 1: Example dialogues from OneDayLife showing problems of repetition, inconsistency, and hallucination.

To address these challenges, we introduce a novel multi-agent simulation framework featuring a Screening, Diagnosis, and Regeneration (SDR) mechanism. This approach enables the immediate detection and correction of errors in utterances generated by agents. During the Screening stage, potential issues in the candidate utterance are identified, and relevant evidence from past dialogues is gathered. Subsequently, in the Diagnosis stage, a LLM analyzes the evidence and the current dialogue, providing detailed feedback and a score on the authenticity of the candidate utterance. If it is deemed problematic, it is revised during the Regeneration stage, incorporating insights from the preceding stages. Both GPT-4 assessments and human evaluations validate that our multi-agent, multi-session dialogues are more consistent and have fewer false hallucinations, while automatic metrics confirm that we achieve better diversity, confirming the effectiveness of our SDR framework.

In summary, our contributions are threefold: (1) We illuminate the inherent problems within multi-agent, multi-session dialogues. To the best of our knowledge, we are among the first to investigate this particular problem. (2) We propose the SDR framework, a pioneering approach for on-the-fly detection and correction of utterance errors in multi-agent simulations. (3) Through extensive evaluations and analyses, we demonstrate the efficacy of our framework, setting a new standard for dialogue quality in multi-agent simulations.

2 Multi-agent Dialogue Simulation

We propose a Screening, Diagnosis, and Regeneration (SDR) framework for simulating authentic dialogues between multiple agents over a period of time. For each candidate utterance Ucsubscript𝑈𝑐U_{c}italic_U start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT generated by an agent, we examine whether potential errors occur, and re-generate a new Ucsubscript𝑈𝑐U_{c}italic_U start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT if any problem is found. Figure 2 illustrates our SDR system overview.

Refer to caption
Figure 2: Overview of the proposed Screening, Diagnosis, Re-generation (SDR) framework, an instant error correction method for multi-agent simulated dialogues. The modules in green are run by the LLM.

2.1 Screening

Due to the limitations of LLMs, including input length restrictions and associated costs, it is impractical to compare each Ucsubscript𝑈𝑐U_{c}italic_U start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT against all past dialogues. Consequently, we initiate with a Screening stage to identify potential issues and gather pertinent evidence for each identified issue.

Refer to caption
Refer to caption
Refer to caption
Figure 3: The spread of the keyword “collaboration” in OneDayLife. Left: The number of dialogues and the ratio that includes the keyword in each time span. Middle: Number of dialogues with the keyword in the first 20% of time. Each line represents a dialogue between two agents and the line color indicates the identity of the agent who firstly mentions the keyword. Right: Number of dialogues with the keyword in all time.
Repetition

Agents often display similar speech patterns, reducing their character distinctiveness. As shown in Fig. 1(a), Eddy tends to replicate phrases from other agents following their conversations. Figure 3 illustrates that as the day progresses, most agents engage in highly similar topics related to collaboration. More examples can be found in Appendix E.3.

We build a dialogue database that stores all utterances prior to the candidate utterance Ucsubscript𝑈𝑐U_{c}italic_U start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, which includes utterances from previous dialogues and the current dialogue context. For each Ucsubscript𝑈𝑐U_{c}italic_U start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, we first query the database to retrieve the top Ksimsubscript𝐾𝑠𝑖𝑚K_{sim}italic_K start_POSTSUBSCRIPT italic_s italic_i italic_m end_POSTSUBSCRIPT similar utterances. We apply a dynamic similarity threshold θsimsubscript𝜃𝑠𝑖𝑚\theta_{sim}italic_θ start_POSTSUBSCRIPT italic_s italic_i italic_m end_POSTSUBSCRIPT to identify the excessively repetitive utterances, determined by whether it is originated from the same speaker or the same dialogue or not. More details are explained in Appendix B.1. If more than one retrieved utterances surpass θsimsubscript𝜃𝑠𝑖𝑚\theta_{sim}italic_θ start_POSTSUBSCRIPT italic_s italic_i italic_m end_POSTSUBSCRIPT, all dialogues associated with these utterances will be marked as evidence for the next diagnose stage. For the cases when Ucsubscript𝑈𝑐U_{c}italic_U start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is nearly identical to a retrieved utterance (the similarity greater than θforcesubscript𝜃𝑓𝑜𝑟𝑐𝑒\theta_{force}italic_θ start_POSTSUBSCRIPT italic_f italic_o italic_r italic_c italic_e end_POSTSUBSCRIPT), we directly bypasses all pipelines and proceeds to the Re-generation stage.

Inconsistency

Factual or logical inconsistencies are an issue across multiple dialogues. For instance, Fig. 1(b) illustrates how John’s statement contradicts Jennifer’s earlier words. Other examples include sudden shifts in opinions, forgetting past statements, and invitations to conflict.

We propose a Natural Language Inference-Graph (NLI-G) module for inconsistency screening. NLI-G consists of three steps. First, we employ the LLM to extract personal information as a list of (Subject, Relation, Object) triplets from each previous dialogue of involved agents, as well as from the candidate utterance Ucsubscript𝑈𝑐U_{c}italic_U start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. An example is shown in Appendix B.2. After transforming triplets into text form, we adapt a NLI model to predict potential contradictions by comparing those from previous utterances with those from Ucsubscript𝑈𝑐U_{c}italic_U start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. Utilizing such a graph format helps the NLI model to focus on key information of agents and reduce the negative impact of style discrepancies between pretrained data and raw dialogue utterances. Finally, the triplets whose contradiction score above θnligsubscript𝜃𝑛𝑙𝑖𝑔\theta_{nlig}italic_θ start_POSTSUBSCRIPT italic_n italic_l italic_i italic_g end_POSTSUBSCRIPT are considered suspicious and forwarded to the LLM to select top Knligsubscript𝐾𝑛𝑙𝑖𝑔K_{nlig}italic_K start_POSTSUBSCRIPT italic_n italic_l italic_i italic_g end_POSTSUBSCRIPT corresponding dialogues for the next Diagnose stage.

Hallucination

We focus on detecting the harmful hallucination, which we refer to as the generated spurious information in Ucsubscript𝑈𝑐U_{c}italic_U start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT that is related to other agents. Such hallucination can persist within the simulated world through memories, and sometimes become truth over time, despite not aligning with the involved agent’s role. Given that agents are considered distinct individuals, preventing the fabricated information about others is crucial.

Refer to caption
Figure 4: Examples of hallucination screening. In Case 1, although Abigail is mentioned, it pertains only to Rajiv’s personal plan, not to a fact about Abigail. In Case 2, Ryan objectively describes a past event involving Carlos. However, this event could have been entirely fabricated by Ryan, representing a potential harmful hallucination.

We detect third-party agent mentions via name parsing. Upon identification, we prompt the LLM to rate the utterance on a 1 to 10 scale, indicating the likelihood of being a hallucination, and we flag Ucsubscript𝑈𝑐U_{c}italic_U start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT if the score is greater than θfactsubscript𝜃𝑓𝑎𝑐𝑡\theta_{fact}italic_θ start_POSTSUBSCRIPT italic_f italic_a italic_c italic_t end_POSTSUBSCRIPT. Two example cases are shown in Fig. 4.

The model is explicitly guided to give a higher score if the following criteria are met: (a) Speaker-objectivity: if it is an objective statement regarding another agent from the speaker’s perspective. (b) Discernment capability of the mentioned agent: if the referenced agent can currently verify the statement’s truthfulness. (c) Impact: the statement, if fabricated but later accepted as truth, significantly impacts the agent.

2.2 Diagnosis

We utilize the LLM to further diagnose the authenticity of Ucsubscript𝑈𝑐U_{c}italic_U start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT if there are evidence dialogues provided or if Ucsubscript𝑈𝑐U_{c}italic_U start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is flagged from the previous Screening stage. In the former case, we prompt LLM with pipeline-specific prompts given the evidence and current dialogues. The outputs are a score representing the severity of the issue and the reason for the output score. In the latter case, we provide the information of the mentioned agent and ask the LLM to output whether the mentioned agent agree with Ucsubscript𝑈𝑐U_{c}italic_U start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, along with an explanatory comment. In practice, we repeat Diagnose stage for Ndiagsubscript𝑁𝑑𝑖𝑎𝑔N_{diag}italic_N start_POSTSUBSCRIPT italic_d italic_i italic_a italic_g end_POSTSUBSCRIPT trials and select the one with the highest score. The prompts are provided in Appendix F.

2.3 Re-generation

We collect results from all pipelines and retain only comments with a score above θregensubscript𝜃𝑟𝑒𝑔𝑒𝑛\theta_{regen}italic_θ start_POSTSUBSCRIPT italic_r italic_e italic_g italic_e italic_n end_POSTSUBSCRIPT or those indicating disagreement. If no comments remain, the correction process terminates, Ucsubscript𝑈𝑐U_{c}italic_U start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is saved to the dialogue database, and the model continues to generate the next utterance of the other agent. Otherwise, the LLM is used to integrate all comments and provide suggestions for improvement, alleviating the vagueness from a simple feedback (Liang et al., 2023). The prompt for re-generation is enriched by appending comments to the original prompt that was used to generate Ucsubscript𝑈𝑐U_{c}italic_U start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT.

Our SDR procedure continues until either of the conditions is met: completes R𝑅Ritalic_R rounds of iteration, or reaches a point where no further comments are provided, indicating the resolution of identified issues.

2.4 Prompt Design

We develop multiple prompt variances to increase diversity when re-generating the revised Ucsubscript𝑈𝑐U_{c}italic_U start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, inspired by the finding that varied linguistic prompts induce output variance (Leidinger et al., 2023). The LLM often re-generates an exactly identical or very similar response given that most of the prompt content is the same as the initial response generation prompt, despite of providing additional feedback to guide the generation and setting the penalty for frequency and presence 222https://platform.openai.com/docs/guides/text-generation/parameter-details. We design two types of utterance generation prompts, a persona-based narrative prompt and a structured task-oriented prompt. The former prompt ask the model to play the role of the given persona and engage in a conversation, which is more narrative and immersive. The latter breaks down all information into clear components, which is less about storytelling and more about providing structured data for a specific task (in this case, generating a response in a conversation). The prompts of these two types are shown in Appendix F.4.

3 Experiment

3.1 Data and Settings

We conduct the simulation using data from OneDayLife. After removing dialogues with only one utterance, there are total 290 dialogues between 25 agents. We regenerate the whole dialogue Dijtsuperscriptsubscript𝐷𝑖𝑗𝑡D_{ij}^{t}italic_D start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT between two agents Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Ajsubscript𝐴𝑗A_{j}italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT at time point t𝑡titalic_t. Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Ajsubscript𝐴𝑗A_{j}italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT take turns to generate utterances until one has no reply or the dialogue reaches 16 turns. At each turn, the LLM is provided with the speaker’s persona Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, memories Mitsuperscriptsubscript𝑀𝑖𝑡M_{i}^{t}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, location, status Sit,Sjtsuperscriptsubscript𝑆𝑖𝑡superscriptsubscript𝑆𝑗𝑡S_{i}^{t},S_{j}^{t}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT at time t𝑡titalic_t, and dialogue histories between the two agents Dijk,k1,2,,Kt1formulae-sequencesuperscriptsubscript𝐷𝑖𝑗𝑘𝑘12superscript𝐾𝑡1D_{ij}^{k},k\in{1,2,...,K^{t-1}}italic_D start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_k ∈ 1 , 2 , … , italic_K start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT.

We use GPT-3.5-turbo as the backbone LLM throughout the SDR framework, which was used in (Park et al., 2023) (Origin). To compare with a stronger baseline, we reran the simulation from Origin, but generate three candidates for each Ucsubscript𝑈𝑐U_{c}italic_U start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and selected the best one judged by the LLM, denoted as Baseline. More detailed settings are provided in Appendix D.1.

In the simulation framework, each generated dialogue at time t𝑡titalic_t can potentially alter the memory, location, and status of the agents subsequently, thus might change the following dialogues after t𝑡titalic_t. To ensure a fair comparison with Origin, we opted not to regenerate new memories, locations, or statuses for agents following their conversations. Instead, we treated each dialogue as a distinct, standalone example.

3.2 Evaluation

We conduct corpus-level evaluation on three key aspects: diversity, consistency, and factualness, corresponding to the three error types - repetition, inconsistency, and hallucination - we aim to address.

Diversity

We employ widely used metrics Distinct-N (Li et al., 2016) and Semantic Distance (Distance) (Dziri et al., 2019) for diversity evaluation. Please refer to Appendix D.2.1 and the original paper for more details.

To further analyze the diversity from the perspective of each individual agent, we proposed an agent-based metric, Agent Diversity (Agent Div). This metric is based on the idea that an agent should adjust the conversation content according to the identity of the listener and not always talk about the same thing with everyone. Agent Div is calculated as the average of each agent’s Agent Div, which is one subtract the normalized similarity between dialogues of the same agent. The algorithm is provided in Appendix D.2.1.

Consistency and Factualness

We assess the corpus-level consistency and factualness by GPT-4 and human evaluation.

- GPT-4: Recent works (Liu et al., 2023; Mendonça et al., 2023; Luo et al., 2023; Gao et al., 2023) have shown that GPT-4 evaluation correlates more closely with human judgements. Besides, it has good properties on scalability and reproducibility. Thus, we select GPT-4 for this challenging task. The process to retrieve evidence dialogues is the same as the Screening stage described in Sections 2.1. Then, GPT-4 is utilized to score the consistency and factualness of the current dialogue on a scale from 1 to 10. The error rate represents the ratio of dialogues receiving a score below a threshold of 8, determined based on our empirical observations.

- Human Evaluation: We conduct human evaluation on the most challenging final 10% of the conversations. We recruited two annotators, and provide them the exact same prompt as GPT-4 to rate the consistency and factualness. The average scores are reported in this paper. The intraclass correlation coefficients (Shrout & Fleiss, 1979; McGraw & Wong, 1996) for factualness and consistency are 0.47 and 0.44, respectively.

Fluency

We utilize the perplexity derived from GPT-2 as the fluency metric. We have not stressed on fluency evaluation, as our observations indicate that all generated dialogues are highly fluent and grammatically correct.

4 Result and Discussion

Diversity Factualness Consistency Fluency Turns Words
Distinct - 1 / 2 / 3 Distance Agent Div Score Error (\downarrow) Score Error (\downarrow) PPL (\downarrow)
Origin 0.117 / 0.473 / 0.726 0.234 0.454 8.58 24.5% 8.17 37.2% 20.37 9.6 25.4
Baseline 0.124 / 0.469 / 0.718 0.274 0.475 8.77 25.5% 8.10 39.7% 20.18 15.5 29.3
SDR (Ours) 0.132 / 0.521 / 0.773 0.311 0.502 8.89 19.0% 8.27 32.4% 19.73 10.3 42.5
Table 1: Corpus-level (multi-dialogues) evaluation. Turns and Words refer to the average number of turns per dialogue and words per turn.
SDR Achieves the Best Diversity, Factualness, Consistency, and Fluency in Multi-dialogue Contexts

As illustrated in Table 1, our SDR framework (Ours) excels in achieving superior corpus-level dialogue diversity, factuality, consistency, and fluency. SDR maintain an average number of turns similar to Origin, contrasting with the Baseline that tend to extend until reaching a predetermined maximum turn count (16). While SDR does not have lengthy number of turns, each utterance conveys more comprehensive information, evidenced by a higher word count per turn in SDR.

Refer to caption
(a) The number of dialogues containing the 6 most repetitive words from TF-IDF.
Refer to caption
(b) Comparison of Agent Diversity and the number of dialogues each agent involved.
Figure 5: Diversity Analysis
SDR Significantly Reduces Keyword Repetition

Figure 5(a) demonstrates how our approach effectively reduces the repetition of the most frequently used keywords. To delve deeper into the occurrence of repetition across all dialogues, we conducted an analysis focusing on keyword frequency. Specifically, we determined the keywords by TF-IDF scores, and counted the number of dialogues consists with the top 6 noun keywords333Since “collaboration” and “election,” have multiple variations, we use the root forms of these words, “collabora” and “elect,” to ensure a more accurate representation of their usage across all dialogues.. Figure 5(a) showcases that our method substantially decreases the frequency of dialogues mentioning key terms, particularly for “creativity” and “collaboration.” Compared to the Origin, the number of dialogues featuring these keywords has been reduced by up to 47% and 44%, respectively.

Agent Div Negatively Correlated to the Number of Initiated Dialogues

We analyze the relation between Agent Diversity and dialogue volume in Fig. 5(b). We found that, despite of a few exceptions, the Agent Div have negative correlation with the number of dialogues the agent involved. Although there are a few exceptions, the figure generally shows that the more the agent talks, the lower diversity they have.

Factualness Consistency GPT4 human GPT4 human Origin 8.34 6.90 7.41 5.00 SDR (Ours) 8.96 8.57 8.18 6.14

Table 2: Human evaluation on the last 10% of dialogues.
[Uncaptioned image]
Figure 6: Error rate trends over time.
SDR Enhance Consistency and Reduce Hallucination

As illustrated in Table 2, the human evaluation reflects the same trend as the GPT-4 assessment, showing SDR is superior to Original in both factualness and consistency. Additionally, it is observed that the scores for factualness are consistently higher than those for consistency in both the human and GPT-4 evaluations. Notably, the GPT-4 scores are higher than human evaluations in all cases, which might indicates that there are potential errors not detected by GPT-4.

Furthermore, we plot the error rate changes across the percentile of the number of dialogues, as shown in Fig. 6. We observed that the errors gradually increase toward higher percentile, especially in the last one, suggesting that inconsistencies or contradictions become more frequent at higher percentiles. The analysis of error rate trends shows that SDR generally exhibits lower error rates than Origin in both factualness and consistency. Additionally, the error rate slope of Consistency for Origin is 1.5 times that of SDR. This disparity indicates a more pronounced error propagation in Origin, highlighting the effectiveness of our method in maintaining dialogue integrity over time.

  Prompt   Info Prompt Type Diversity Factualness Consistency Fluency Turns Words
Distinct - 1 / 2 / 3 Distance Score Error (\downarrow) Score Error (\downarrow) PPL (\downarrow)
Origin 0.445 / 0.724 / 0.886 0.212 8.34 31.0% 7.41 55.2% 22.2 8.1 24.4
Baseline 0.323 / 0.709 / 0.869 0.238 8.07 44.8% 7.72 41.4% 20.1 15.0 31.2
Alle Task 0.278 / 0.742 / 0.918 0.306 8.45 31.0% 8.21 34.5% 21.3 10.6 36.0
Alle Persona 0.286 / 0.751 / 0.917 0.288 8.52 27.6% 7.79 41.4% 20.3 9.9 43.9
Alle Mixed 0.292 / 0.744 / 0.919 0.303 8.66 27.6% 8.21 44.8% 19.0 11.0 41.8
      - background Mixed 0.303 / 0.751 / 0.921 0.338 8.79 32.1% 8.14 32.1% 20.4 9.4 33.7
      - memory Mixed 0.349 / 0.778 / 0.931 0.305 8.96 17.9% 8.18 35.7% 19.2 10.6 44.1
      - history Mixed 0.319 / 0.774 / 0.926 0.292 8.38 31.0% 8.69 31.0% 20.2 9.9 42.8
      - status Mixed 0.271 / 0.717 / 0.898 0.257 8.39 25.0% 8.21 35.7% 19.6 9.9 49.0
Table 3: Ablation study on the last 10% conversations in OneDayLife.
SDR can Balance Diversity and Faithfulness

Table 3 shows the ablation study for various prompt design. The ablation study is conducted on the last percentile of conversations, where the origin got the worst factualness and consistency scores. We first identify the benefit of using diverse prompt types. Randomly pick from structured task-oriented prompt or persona-based narrative prompt yields better or comparable results than using either of them. We also investigate whether all information in the original prompt is necessary. Surprisingly, we found that prompts excluding memory often outperformed others in most aspects. This outcome seems counterintuitive, as memory is generally considered crucial for preventing hallucinations and ensuring consistency. However, memory can impose a strong constraint that may reduce conversational diversity. By opting for a no-memory prompt, we open the door to more diverse conversational content. Our SDR framework ensures that the utterance Ucsubscript𝑈𝑐U_{c}italic_U start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT can be consistent with previous dialogues and free from critical factual errors. This approach allows us to strike a balance between diversity and faithfulness in multi-agent multi-dialogue generation.

Effectiveness of Screening and Diagnosis Stages

We evaluate whether our SDR framework can successfully detect potential problems by calculating the precision (P), recall (R), and F1 score for the output from the Screening and Diagnosis stages, using a subset of OneDayLife. We place particular emphasis on recall, as our primary objective is to identify potential issues at these stages. We manually annotated 50 utterances as either problematic or non-problematic and processed them through our framework up to the Diagnosis stage. The results validate the effectiveness of the procedures in the Screening and Diagnosis stages, with P, R, and F1 scores of 64.7%, 88.0%, and 74.6%, respectively. We also show an example which fails to pass all three pipelines in Table 5 in Appendix E.1.

NLI-G Study

We assess the effectiveness of NLI-G by using evaluation scores on original dialogues. We compare the factualness and consistency scores of dialogues based on two sources of evidence: evidence retrieved by NLI-G and evidence from the agent’s previous dialogues (denoted as Prev). In both cases, the number of pieces of evidence is five. In Fig. 7, the left figure shows that the scores with NLI-G are generally lower than with Prev (average differences of -0.82 for factualness and -0.62 for consistency), indicating NLI-G’s superior ability to capture crucial evidence for GPT-4 in identifying issues. Additionally, we examine NLI-G’s retrieval variance by evaluating the last 10% of dialogues five times, counting the frequency of evidence dialogues. The right figure illustrates the composition of each dialogue’s evidence count, with colored sections representing the proportions of the top five evidence (average colored area: 61.8% of the bar) and gray for the others. This demonstrates NLI-G’s consistency in retrieving similar dialogues across different trials, even when the number of available evidence dialogues - equal to the quantity of past dialogues involving either of the agents - exceeds 50 in later dialogues.

Refer to caption
Refer to caption
Figure 7: The performance study of NLI-G. Left: Score difference between evidence retrievals using NLI-G and prev-K. Right: Frequency distribution of dialogues retrieved by NLI-G. Each bar represents a dialogue in the last 10% of OneDayLife, sorted by the number of available evidence dialogues.

5 Related Work

LLM-powered Agents

A typical LLM-powered agent encompasses predefined or dynamically generated prompt templates to utilize LLM’s capabilities and achieve specific functionalities (Zhao et al., 2023). Wang et al. (2023) can autonomously navigate the game world, maintaining a skill library to expand its problem-solving scope. Richards & et al. (2023); Nakajima (2023) employ a chain-of-thought (Wei et al., 2022) approach to provide automated solutions for designated tasks. Nevertheless, a singular agent undeniably has its limitations. The collective intelligence formed by multiple agents can yield results greater than the sum of its parts. Hong et al. (2023) has integrated specialized human SOP expertise, successfully accomplishing intricate software development. Chen et al. (2023b) focusing on designing communication mechanisms among agents to enhance decision-making efficacy.

Revision Strategy

While Madaan et al. (2023); Skreta et al. (2023) employ self-feedback to improve LLM performance and task success rates, our approach prioritizes a comprehensive analysis across multiple dialogues. Other strategies, such as prompt engineering, may extend the length of prompts without addressing the challenge of extracting essential information from extended inputs. SDR adheres to the divide-and-conquer philosophy, breaking down a problem into manageable parts for more effective resolution.

Machine-Generated Dialogues

Our work distinct from other machine-generated multi-session dialogue datasets in several aspects. We first emphasize that the utterances in OneDayLife are generated iteratively by each agent, rather than through full conversation synthesis like in SODA (Kim et al., 2023), PLACES (Chen et al., 2023a), and Dialog Inpainting (Dai et al., 2022). This approach more realistically simulates agent behaviors but presents challenges in maintaining global coherence. Besides, the conversations in OneDayLife exhibit causal relationships, unlike the independent dialogue pieces in other datasets. These specific properties make OneDayLife particularly suitable for our study on the authenticity of LLM agents, and in uncovering issues inherent in multi-agent, multi-session dialogues.

6 Conclusion

In conclusion, our study introduces and validates a novel multi-agent dialogue simulation framework equipped with a Screening, Diagnosis, and Regeneration (SDR) mechanism, addressing the inherent challenges in multi-agent, multi-session dialogues such as repetition, inconsistency, and hallucination. By meticulously analyzing the dialogues generated in the OneDayLife simulation, we identified significant issues that compound over time, negatively impacting dialogue quality. Our proposed SDR framework effectively reduces these problems and enhances dialogue diversity, consistency, and factualness, as evidenced by automatic metrics, GPT-4 assessments, and human evaluations. We successfully reduce the keyword repetition, maintain dialogue integrity over multiple sessions, and balance conversational diversity with faithfulness of multi-agent dialogue simulation. Our work not only highlights the previously unaddressed challenges in multi-agent simulations but also sets a new standard for dialogue quality, paving the way for more advanced and realistic simulations in future research.

References

  • Bae et al. (2022) Sanghwan Bae, Donghyun Kwak, Soyoung Kang, Min Young Lee, Sungdong Kim, Yuin Jeong, Hyeri Kim, Sang-Woo Lee, Woomyoung Park, and Nako Sung. Keep me updated! memory management in long-term conversations. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  3769–3787, 2022.
  • Chen et al. (2023a) Maximillian Chen, Alexandros Papangelis, Chenyang Tao, Seokhwan Kim, Andy Rosenbaum, Yang Liu, Zhou Yu, and Dilek Hakkani-Tur. Places: Prompting language models for social conversation synthesis. In Findings of the Association for Computational Linguistics: EACL 2023, pp.  844–868, 2023a.
  • Chen et al. (2023b) Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia Qin, Yaxi Lu, Ruobing Xie, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents. arXiv preprint arXiv:2308.10848, 2023b.
  • Cheng et al. (2024) Yuheng Cheng, Ceyao Zhang, Zhengwen Zhang, Xiangrui Meng, Sirui Hong, Wenhao Li, Zihao Wang, Zekai Wang, Feng Yin, Junhua Zhao, et al. Exploring large language model based intelligent agents: Definitions, methods, and prospects. arXiv preprint arXiv:2401.03428, 2024.
  • Dai et al. (2022) Zhuyun Dai, Arun Tejasvi Chaganty, Vincent Y Zhao, Aida Amini, Qazi Mamunur Rashid, Mike Green, and Kelvin Guu. Dialog inpainting: Turning documents into dialogs. In International conference on machine learning, pp.  4558–4586. PMLR, 2022.
  • Dziri et al. (2019) Nouha Dziri, Ehsan Kamalloo, Kory Mathewson, and Osmar Zaiane. Evaluating coherence in dialogue systems using entailment. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  3806–3812, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1381. URL https://aclanthology.org/N19-1381.
  • Gao et al. (2023) Mingqi Gao, Jie Ruan, Renliang Sun, Xunjian Yin, Shiping Yang, and Xiaojun Wan. Human-like summarization evaluation with chatgpt. arXiv preprint arXiv:2304.02554, 2023.
  • Hong et al. (2023) Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.
  • Jang et al. (2023) Jihyoung Jang, Minseong Boo, and Hyounghun Kim. Conversation chronicles: Towards diverse temporal and relational dynamics in multi-session conversations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  13584–13606, 2023.
  • Kim et al. (2023) Hyunwoo Kim, Jack Hessel, Liwei Jiang, Peter West, Ximing Lu, Youngjae Yu, Pei Zhou, Ronan Bras, Malihe Alikhani, Gunhee Kim, et al. Soda: Million-scale dialogue distillation with social commonsense contextualization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  12930–12949, 2023.
  • Laurer et al. (2023) Moritz Laurer, Wouter Van Atteveldt, Andreu Casas, and Kasper Welbers. Less Annotating, More Classifying: Addressing the Data Scarcity Issue of Supervised Machine Learning with Deep Transfer Learning and BERT-NLI. Political Analysis, pp.  1–33, June 2023. ISSN 1047-1987, 1476-4989. doi: 10.1017/pan.2023.20. URL https://www.cambridge.org/core/product/identifier/S1047198723000207/type/journal_article.
  • Leidinger et al. (2023) Alina Leidinger, Robert van Rooij, and Ekaterina Shutova. The language of prompting: What linguistic properties make a prompt successful? arXiv preprint arXiv:2311.01967, 2023.
  • Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  110–119, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1014. URL https://aclanthology.org/N16-1014.
  • Liang et al. (2023) Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Ding, Xinyu Yang, Kailas Vodrahalli, Siyu He, Daniel Smith, Yian Yin, et al. Can large language models provide useful feedback on research papers? a large-scale empirical analysis. arXiv preprint arXiv:2310.01783, 2023.
  • Liu et al. (2023) Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  2511–2522, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.153. URL https://aclanthology.org/2023.emnlp-main.153.
  • Luo et al. (2023) Zheheng Luo, Qianqian Xie, and Sophia Ananiadou. Chatgpt as a factual inconsistency evaluator for abstractive text summarization. arXiv preprint arXiv:2303.15621, 2023.
  • Madaan et al. (2023) Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
  • McGraw & Wong (1996) Kenneth O McGraw and Seok P Wong. Forming inferences about some intraclass correlation coefficients. Psychological methods, 1(1):30, 1996.
  • Mendonça et al. (2023) John Mendonça, Patrícia Pereira, Helena Moniz, Joao Paulo Carvalho, Alon Lavie, and Isabel Trancoso. Simple LLM prompting is state-of-the-art for robust and multilingual dialogue evaluation. In Yun-Nung Chen, Paul Crook, Michel Galley, Sarik Ghazarian, Chulaka Gunasekara, Raghav Gupta, Behnam Hedayatnia, Satwik Kottur, Seungwhan Moon, and Chen Zhang (eds.), Proceedings of The Eleventh Dialog System Technology Challenge, pp.  133–143, Prague, Czech Republic, September 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.dstc-1.16.
  • Nakajima (2023) Yohei Nakajima. Babyagi, 2023. URL https://github.com/yoheinakajima/babyagi.
  • OpenAI (2023) OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  • Park et al. (2023) Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442, 2023.
  • Richards & et al. (2023) Toran Bruce Richards and et al. Auto-gpt: An autonomous gpt-4 experiment, 2023. URL https://github.com/Significant-Gravitas/Auto-GPT.
  • Rodgers & Nicewander (1988) Joseph Lee Rodgers and W. Alan Nicewander. Thirteen ways to look at the correlation coefficient. The American Statistician, 42(1):59–66, 1988. doi: 10.1080/00031305.1988.10475524. URL https://doi.org/10.1080/00031305.1988.10475524.
  • Shrout & Fleiss (1979) Patrick E Shrout and Joseph L Fleiss. Intraclass correlations: uses in assessing rater reliability. Psychological bulletin, 86(2):420, 1979.
  • Skreta et al. (2023) Marta Skreta, Naruki Yoshikawa, Sebastian Arellano-Rubach, Zhi Ji, Lasse Bjørn Kristensen, Kourosh Darvish, Alán Aspuru-Guzik, Florian Shkurti, and Animesh Garg. Errors are useful prompts: Instruction guided task programming with verifier-assisted iterative prompting. arXiv preprint arXiv:2303.14100, 2023.
  • Spearman (1961) Charles Spearman. The proof and measurement of association between two things. 1961.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • Wang et al. (2023) Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  • Xu et al. (2022) Jing Xu, Arthur Szlam, and Jason Weston. Beyond goldfish memory: Long-term open-domain conversation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  5180–5197, 2022.
  • Zhao et al. (2023) Pengyu Zhao, Zijian Jin, and Ning Cheng. An in-depth survey of large language model-based artificial intelligence agents. arXiv preprint arXiv:2309.14365, 2023.

Appendix A Human-Like Multi-Agents

A.1 From General LLM to Individual Persona

Generative Agents (Park et al., 2023) introduces a two-component architecture for creating personalized, dynamic human-like agents: a string-based memory base and an LLM-driven cognitive function set. The memory base stores memories over time, aiding in the development of diverse agents, while the LLM-centric cognitive functions simulate human capabilities like reflection, planning, and reaction. Combining these, the LLM uses memory-derived context to tailor knowledge extraction and response formulation, ensuring agent-specific behaviors.

In a scenario where two agents initiate a dialogue, each iteratively produces utterances informed by context like location, status, and memories. The dialogue function uses a specific prompts like: ”Based on the […] information, what will [name] say next?” For more details on Generative Agents and the simulation of 25 agents in a village, please see the original paper.

A.2 Conversations and Transmission

The memory capabilities enable the transmission of information to both agents involved in a dialogue section. However, this also means that undesirable dialogue content could spread in the same manner.

Figure 3 illustrates this using keyword spreading as an example. The bar chart shows the proportion of dialogues containing the keyword (in red) compared to the total number of conversations (in blue), highlighting a swift escalation, sometimes reaching 100%. Chord diagrams further reveal that initially, only a few agents act as propagators, but as the day progresses, the majority become involved in similar actions, as indicated by the variety of line colors. Consequently, the dialogue topics become repetitive and less believable. This example underscores the scale of the spreading and the associated risks, emphasizing the need for a dynamic correction approach.

Appendix B SDR Details

B.1 Dynamic Similarity

We employ dynamic similarity threshold to determined the repetition of utterances. The value of θsimsubscript𝜃𝑠𝑖𝑚\theta_{sim}italic_θ start_POSTSUBSCRIPT italic_s italic_i italic_m end_POSTSUBSCRIPT is adjusted based on specific criteria: it is decreased if the similar utterance originates from the same agent A𝐴Aitalic_A as Ucsubscript𝑈𝑐U_{c}italic_U start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and is in the current dialogue Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which indicating repetition. Conversely, a higher threshold is more acceptable if the similar utterance is from A𝐴Aitalic_A but a different dialogue. This process is formalized as Eq. 1.

θsim={θ+αif same A, different Dθαif same A, same Dθotherwise,subscript𝜃𝑠𝑖𝑚cases𝜃𝛼if same A, different D𝜃𝛼if same A, same D𝜃otherwise,\theta_{sim}=\begin{cases}\theta+\alpha&\text{if same $A$, different $D$}\\ \theta-\alpha&\text{if same $A$, same $D$}\\ \theta&\text{otherwise,}\end{cases}italic_θ start_POSTSUBSCRIPT italic_s italic_i italic_m end_POSTSUBSCRIPT = { start_ROW start_CELL italic_θ + italic_α end_CELL start_CELL if same italic_A , different italic_D end_CELL end_ROW start_ROW start_CELL italic_θ - italic_α end_CELL start_CELL if same italic_A , same italic_D end_CELL end_ROW start_ROW start_CELL italic_θ end_CELL start_CELL otherwise, end_CELL end_ROW (1)

B.2 The Extracted Graphs in NLI-G

We demonstrate an example of the personal information graph extracted from a dialogue. Table 4 lists a pair of dialogue and the extracted graph.

Raw Dialogue Giorgio Rossi: Excuse me, are you John Lin? John Lin: Yes, that’s me. How may I help you? Giorgio Rossi: I noticed that you’re very knowledgeable about medications and treatments. I’ve been curious about some of the latest developments in that field. Would you mind sharing some of your insights with me? John Lin: Of course, I’m happy to share what I know. What specifically are you interested in? Giorgio Rossi: Well, I’m particularly intrigued by how some medications are being developed based on mathematical patterns found in nature. Do you have any knowledge about that? John Lin: Hmm, that’s not really my area of expertise. But I do know that Jennifer Moore, the artist who’s running for local mayor, is also a scientist who incorporates mathematical patterns into her work. Maybe she would have more information on that.
Extracted Graph [    [“Giorgio Rossi”, “is interested in”, “medications and treatments”],    [“John Lin”, “is”, “John Lin”],    [“Giorgio Rossi”, “discusses interests with”, “John Lin”],    [“John Lin”, “is knowledgeable about”, “medications and treatments”],    [“Giorgio Rossi”, “is interested in”, “mathematical patterns found in    nature”],    [“John Lin”, “does not have expertise in”, “medications developed    based on mathematical patterns found in nature”],    [“Jennifer Moore”, “is mentioned by”, “John Lin”],    [“Jennifer Moore”, “has role”, “artist”],    [“Jennifer Moore”, “is running for”, “local mayor”],    [“Jennifer Moore”, “has profession”, “scientist”],    [“Jennifer Moore”, “incorporates”, “mathematical patterns into work”], ]
Table 4: Example of a graph extracted from raw dialogue in the NLI-G module.

Appendix C Limitations and Social Impact

C.1 Limitations

1. The primary limitation of our method is the cost, as we utilize GPT-3.5-turbo as the LLM backbone via OpenAI API calls.

2. Our results are limited to the OneDayLife dataset, and the evaluations, including automatic, LLM, and human evaluation metrics, may not cover all the aspects since the dataset is not annotated. However, we suggest that our methodology can be applied and work on scenarios involving similar LLM agents.

3. Instead of reproducing a complete full-day simulation, which incurs thousands of dollars in costs as reported in the original paper, we rerun the conversations individually using context (e.g., time, location, involved agents, past dialogues, past memories, etc.) identical to the original data with our framework. In this case, if the past context from the original data contains inherent problems, these are beyond the reach of our framework and thus contribute to the decrease in evaluation scores. We anticipate that the error rate of our method could be further reduced when running a complete simulation.

C.2 Social Impact

Our work focuses on conversations involving only LLM agents. However, future applications are expected to feature environments containing both LLM agents and human users. In these scenarios, human users can interact with the agents, and there is a possibility of malicious intervention. On the other hand, the influence of these interactions on human users remains unexplored. The best practices and effects of deploying our framework in real-world applications could be subjects for future study.

Appendix D Experiment Details

D.1 Hyper-Parameters

At the Screening phase, we set Ksimsubscript𝐾𝑠𝑖𝑚K_{sim}italic_K start_POSTSUBSCRIPT italic_s italic_i italic_m end_POSTSUBSCRIPT to 5 and threshold θ𝜃\thetaitalic_θ to 0.85, α𝛼\alphaitalic_α to 0.05, θforcesubscript𝜃𝑓𝑜𝑟𝑐𝑒\theta_{force}italic_θ start_POSTSUBSCRIPT italic_f italic_o italic_r italic_c italic_e end_POSTSUBSCRIPT to 0.95 for repetition detection. Utterances shorter than 10 words are excluded from the repetition screening, thus likely preserving the natural “social glue turns”. For inconsistency detection, we adapt the DeBERTa-based NLI model, pretrained on multiple NLI datasets Laurer et al. (2023). The θnligsubscript𝜃𝑛𝑙𝑖𝑔\theta_{nlig}italic_θ start_POSTSUBSCRIPT italic_n italic_l italic_i italic_g end_POSTSUBSCRIPT is as high as 0.98 as there are a lot of false positive, and we select top Knlig=3subscript𝐾𝑛𝑙𝑖𝑔3K_{nlig}=3italic_K start_POSTSUBSCRIPT italic_n italic_l italic_i italic_g end_POSTSUBSCRIPT = 3 dialogues as the potential contradictory dialogue evidences. The threshold for harmful hallucination likelihood θfactsubscript𝜃𝑓𝑎𝑐𝑡\theta_{fact}italic_θ start_POSTSUBSCRIPT italic_f italic_a italic_c italic_t end_POSTSUBSCRIPT is set to 6. The number of diagnose trials Ndiagsubscript𝑁𝑑𝑖𝑎𝑔N_{diag}italic_N start_POSTSUBSCRIPT italic_d italic_i italic_a italic_g end_POSTSUBSCRIPT is 3, and we selected the LLM feedback with the highest score. If there are more than one feedback that have the same highest score, we chose the longer one. The regeneration threshold θregensubscript𝜃𝑟𝑒𝑔𝑒𝑛\theta_{regen}italic_θ start_POSTSUBSCRIPT italic_r italic_e italic_g italic_e italic_n end_POSTSUBSCRIPT is 8. Our SDR procedure will terminated if no comments are found or until reaching R=2𝑅2R=2italic_R = 2 rounds. We use GPT-3.5-turbo-0613 for running the baseline and SDR. Note that OneDayLifewas generated by GPT-3.5-turbo before April, 2023.

For GPT-4 evaluation, θnligsubscript𝜃𝑛𝑙𝑖𝑔\theta_{nlig}italic_θ start_POSTSUBSCRIPT italic_n italic_l italic_i italic_g end_POSTSUBSCRIPT is set to 0.99 and Knligsubscript𝐾𝑛𝑙𝑖𝑔K_{nlig}italic_K start_POSTSUBSCRIPT italic_n italic_l italic_i italic_g end_POSTSUBSCRIPT is 5. We use the model GPT-4-0613.

D.2 Evaluation

D.2.1 Diversity

Distinct-N (Li et al., 2016)

Distinct-N calculates the ratio of unique N-grams in a given text. However, it may not fully capture the corpus-level dialogue diversity, particularly when each dialogue has longer utterances, since individual dialogues typically revolve around a single topic. To address this, we apply Distinct-N to summaries of dialogues, generated by a pretrained dialogue summarization model. This approach allows us to more effectively gauge the thematic diversity of dialogues at the corpus level across multiple conversations.

Semantic Distance (Dziri et al., 2019)

To complement the word-based Distinct-N metric, we measure the Semantic Distance on embedding space. Specifically, we calculate the cosine similarity between dialogue embeddings. Semantic Distance is then determined as 1similarity1𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦1-{similarity}1 - italic_s italic_i italic_m italic_i italic_l italic_a italic_r italic_i italic_t italic_y.

Agent Diversity

The algorithm of Agent Diversity is shown in Algorithm 1. Emb()𝐸𝑚𝑏Emb(\cdot)italic_E italic_m italic_b ( ⋅ ) is a speaker dialogue embedding calculated from length-weighted utterance embeddings in the dialogue.

Data : Agents Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for i{1,2,,N}𝑖12𝑁i\in\{1,2,...,N\}italic_i ∈ { 1 , 2 , … , italic_N },
Dialogues Dijksuperscriptsubscript𝐷𝑖𝑗𝑘D_{ij}^{k}italic_D start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT for ji𝑗𝑖j\neq iitalic_j ≠ italic_i and 0<kKij0𝑘subscript𝐾𝑖𝑗0<k\leq K_{ij}0 < italic_k ≤ italic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT
1
2// Calculate AgentDivi𝐴𝑔𝑒𝑛𝑡𝐷𝑖subscript𝑣𝑖AgentDiv_{i}italic_A italic_g italic_e italic_n italic_t italic_D italic_i italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each agent Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT;
3targets \leftarrow {j𝑗jitalic_j||||Kij0subscript𝐾𝑖𝑗0K_{ij}\neq 0italic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ≠ 0};
4sims \leftarrow 0;       pairs \leftarrow 0;
5for each unique pair (p, q) in targets do
6     
7     Ep{Emb(Dipk)|k{1,2,,Kip}}subscript𝐸𝑝conditional-setEmbsuperscriptsubscript𝐷𝑖𝑝𝑘𝑘12subscript𝐾𝑖𝑝E_{p}\leftarrow\{\textnormal{{Emb}}(D_{ip}^{k})\,|\,k\in\{1,2,\ldots,K_{ip}\}\}italic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ← { Emb ( italic_D start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) | italic_k ∈ { 1 , 2 , … , italic_K start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT } };
8     Eq{Emb(Diqk)|k{1,2,,Kiq}}subscript𝐸𝑞conditional-setEmbsuperscriptsubscript𝐷𝑖𝑞𝑘𝑘12subscript𝐾𝑖𝑞E_{q}\leftarrow\{\textnormal{{Emb}}(D_{iq}^{k})\,|\,k\in\{1,2,\ldots,K_{iq}\}\}italic_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ← { Emb ( italic_D start_POSTSUBSCRIPT italic_i italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) | italic_k ∈ { 1 , 2 , … , italic_K start_POSTSUBSCRIPT italic_i italic_q end_POSTSUBSCRIPT } };
9     spq1KipKiqa=1Kipb=1KiqCosSim(Epa,Eqb)subscript𝑠𝑝𝑞1subscript𝐾𝑖𝑝subscript𝐾𝑖𝑞superscriptsubscript𝑎1subscript𝐾𝑖𝑝superscriptsubscript𝑏1subscript𝐾𝑖𝑞CosSimsubscript𝐸subscript𝑝𝑎subscript𝐸subscript𝑞𝑏s_{pq}\leftarrow\frac{1}{K_{ip}K_{iq}}\sum_{a=1}^{K_{ip}}\sum_{b=1}^{K_{iq}}% \textnormal{{CosSim}}(E_{p_{a}},E_{q_{b}})italic_s start_POSTSUBSCRIPT italic_p italic_q end_POSTSUBSCRIPT ← divide start_ARG 1 end_ARG start_ARG italic_K start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_i italic_q end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_a = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_i italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_i italic_q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT CosSim ( italic_E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT );
10     sims \leftarrow sims + spqsubscript𝑠𝑝𝑞s_{pq}italic_s start_POSTSUBSCRIPT italic_p italic_q end_POSTSUBSCRIPT;       pairs \leftarrow pairs + 1;
11      end for
12     
13     AgentDivi1simspairs𝐴𝑔𝑒𝑛𝑡𝐷𝑖subscript𝑣𝑖1𝑠𝑖𝑚𝑠𝑝𝑎𝑖𝑟𝑠AgentDiv_{i}\leftarrow 1-\frac{sims}{pairs}italic_A italic_g italic_e italic_n italic_t italic_D italic_i italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← 1 - divide start_ARG italic_s italic_i italic_m italic_s end_ARG start_ARG italic_p italic_a italic_i italic_r italic_s end_ARG
14     // Averaging AgentDivi𝐴𝑔𝑒𝑛𝑡𝐷𝑖subscript𝑣𝑖AgentDiv_{i}italic_A italic_g italic_e italic_n italic_t italic_D italic_i italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for all agents
15     AgentDiv𝐴𝑔𝑒𝑛𝑡𝐷𝑖𝑣AgentDivitalic_A italic_g italic_e italic_n italic_t italic_D italic_i italic_v \leftarrow 1N1𝑁\frac{1}{N}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG i=1NAgentDivisuperscriptsubscript𝑖1𝑁𝐴𝑔𝑒𝑛𝑡𝐷𝑖subscript𝑣𝑖\sum_{i=1}^{N}AgentDiv_{i}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_A italic_g italic_e italic_n italic_t italic_D italic_i italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ;
Algorithm 1 Agent Diversity (Agent Div)

Appendix E Additional Analysis and Examples

E.1 Output of Diagnosis Stage

We show the output of Diagnosis stage in Table 5.

Context Time: 2023-02-13 11:55:40 Location: store in The Willows Market and Pharmacy Status: John: performing special tasks and handling customer queries (checking the inventory) Yuriko: taking a break and having lunch (researching the local mayor election) Dialogue Context: John: Hey there, Yuriko. Looks like we’re both keeping busy with work and classes. Yuriko: Yes, it seems like we’re both trying to stay on top of things. I’m act- ually taking classes on new tax laws. John: That’s interesting. I’m taking classes on new medications and treatm- ents for my pharmacy. Yuriko: I’ve heard about some exciting new research on tax incentives for small businesses. Have you heard anything interesting lately? John: Actually, I’ve been asking around about who’s running for mayor ne- xt month. Have you heard anything? Yuriko: Yes, I’ve heard that Sam Moore is running. He has a vision for inves- ting in the local economy and fostering creativity and innovation. What do you think about that?
Ucsubscript𝑈𝑐U_{c}italic_U start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT John: I actually have a friend who’s running for mayor, Jennifer Moore. She’s an artist and is interested in developing medications based on mathematical patterns found in nature.
Score Reason
Repetition 8 The response contains some repetitive statements that are already known from the previous conversation. For example, John Lin already mentioned that he has a friend named Jennifer Moore who is running for mayor and is interested in medication development based on mathematical patterns. Restating this information in the response is redundant. Additionally, John asking for more information from Jennifer Moore is already implied by him stating that she is his friend. These repetitive statements make the response less natural.
Consistency 8 John Lin’s candidate response refers to his friend Jennifer Moore running for mayor, while in the past dialogue, he mentioned Sam Moore as a candidate.
Hallucination false Jennifer Moore mentioned that she is not interested in politics and prefers to focus on her art and her relationships with the people she cares about, indicating that she would not agree with the statement that John Lin’s friend is running for mayor.
Table 5: Diagnosis stage output.

E.2 GPT4 Score and Dialogue Length

We investigate if there are biases between the dialogue length and the score given by GPT-4. We use Pearson Correlation Coefficient (Rodgers & Nicewander, 1988) and the Spearman Rank-Order Correlation Coefficient (Spearman, 1961) to examine the correlation between scores (factualness and consistency) and the dialogue statistics (number of utterances and words). The results are shown in Table 6, and it shows no or low correlations between them.

Pearson Coef. Spearman Coef.
# Turn # Word # Turn # Word
Factualness 0.054 0.086 0.026 0.039
Consistency -0.025 -0.021 -0.024 -0.014
Table 6: The correlation coefficients between GPT-4 scores and dialogue statistics on Origin.

E.3 Problematic Dialogues in OneDayLife

Below are four dialogue snippets from OneDayLife involving the four keywords “math”, “poetry”, “artwork”, and “collaborat”.

One unnatural point is that agents from diverse backgrounds all show interest in collaborating on poetry and artwork (and mathematical patterns): Carmen (a shopkeeper) [D1,D2,D3], Latoya (a photographer) [D1,D2,D4], Hailey (a writer) [D2], Arthur (a bartender) [D2], Tamara (a children’s book author) [D3], Giorgio Rossi (a mathematician) [D3], Abigail Chen (a digital artist and animator) [D3], Francisco Lopez (an actor and comedian) [D3,D4], Ryan Park (a software engineer) [D4], Rajiv (a painter) [D4].

Another unnatural point is the variation in content and participants involved in the collaborations. For example, people collaborate on poetry and artwork in each dialogue are:

[D1] Carmen, Latoya

[D2] Carmen, Latoya |||| Hailey, Arthur

[D3] Carmen, Tamara, Giorgio, Abigail, Francisco

[D4] Latoya, Ryan |||| Rajiv, Francisco

Furthermore, while Carmen does not have other conversations between D1 and D2, the collaborations she mentions differ in the two dialogues.

[D1]

  1. 1.

    Rajiv - art projects and medicinal development

  2. 2.

    Latoya - poetry and artwork |||| mobile app

  3. 3.

    Arthur - mixology

[D2]

  1. 1.

    Tamara and Rajiv - creating job opportunities and expanding book sales online

  2. 2.

    Latoya - poetry and artwork

  3. 3.

    Giorgio and Arthur - mixology and medicinal development

Dialogue Snippets:

[D1] Time: 2023-02-13 14:43:20

Arthur (a bartender): Excuse me, Carmen. I couldn’t help but overhear your conversation about potential collaborations. … Additionally, I’m passionate about creating job opportunities for young people in the community and would love to brainstorm with you on how we could work together on that. What do you think?

Carmen (a shopkeeper): That sounds great, Arthur! I’m actually meeting Rajiv Patel (a painter) here later to discuss potential collaborations on art projects and medicinal development using mathematical patterns found in nature. … I’m also discussing potential collaborations with various individuals about promoting art in low-income communities and supporting local businesses. And we have plans for a poetry and artwork project with Latoya Williams (a photographer). Additionally, we’re discussing potential collaborations on creative projects such as a mobile app for local businesses and community events. I would love to hear more about your mixology ideas and how we might be able to collaborate on that as well.

[D2] Time: 2023-02-13 14:57:20

Hailey (a writer): I was actually just talking to Arthur Burton (a bartender) about potential collaborations in mixology, art, poetry, improv, and medication development based on mathematical patterns in nature. I invited him to be a guest on my podcast to contribute unique cocktail recipes. And then I saw you and thought it would be great to catch up.

Carmen (a shopkeeper): Wow, that sounds really interesting! I’m actually discussing potential collaborations with Tamara Taylor (a children’s book author) and Rajiv Patel (a painter) on creating job opportunities for young people in the community, as well as expanding book sales online. I’m also working with Latoya Williams (a photographer) on a poetry and artwork project, and considering collaborations with Giorgio Rossi (a mathematician) and Arthur Burton (a bartender) on mixology ideas and medicinal development using mathematical patterns found in nature. We all plan to grab a drink here after Rajiv’s first solo show.

[D3] Time: 2023-02-13 15:05:20

Jennifer (a painter): That sounds like a great idea! I’d love to collaborate with you and other local artists. Do you have any other projects or collaborations in mind?

Tamara (a children’s book author): Well, I’m considering collaborations with Carmen Ortiz (a shopkeeper), Giorgio Rossi (a mathematician), Abigail Chen (a digital artist and animator), and Francisco Lopez (an actor and comedian) for projects involving poetry, artwork, and potentially other fields like science and math. I’m also interested in attending political discussions and expanding my book sales online. What about you?

[D4] Time: 2023-02-13 18:16:20

Latoya (a photographer): That sounds like a great idea. And I’m also open to collaborating with you on exploring connections between math, nature, and art. I’m actually discussing a similar project with Ryan Park (a software engineer).

Rajiv (a painter): That’s great to hear. And if you’re interested, Francisco Lopez (an actor and comedian) and I are also exploring a project involving poetry, artwork, and mathematical patterns. So there’s definitely a lot of potential for collaboration and creativity in this space.

Appendix F Prompt Examples

F.1 Repetition Check Prompt

Context for the task:
[Speaker’s background]

Here are some conversation histories between various people:
[Speaker]:**
Time: t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
[Evidence Dialogue Dt0subscript𝐷subscript𝑡0D_{t_{0}}italic_D start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ]

Time: tksubscript𝑡𝑘t_{k}italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
[Evidence Dialogue Dtksubscript𝐷subscript𝑡𝑘D_{t_{k}}italic_D start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ]

[Speaker] is about to say the following sentence (’the response’) next in the latest session:
[ Ucsubscript𝑈𝑐U_{c}italic_U start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ]

---
# Task: Please identify any ”unnatural points” in ’the response’.
An ”unnatural point” refers to redundancies or repetitive statements made in ’the response’ when considering the context of the previous conversations.
On a scale of 1 (no unnatural point) to 10 (the most significant of unnatural point), rate the likely significant score of ’the response’. And explain the reason for the score.

Output format: Output a json of the following format:
{
         “reason”: “point out the unnatural point and your reason for the score”,          “score”: ”<<<json integer>>>
}

F.2 Consistency Check Prompt

# Context
**Background:**
[Speaker’s background]

**Past Dialogues involving [Speaker]:**
Time: t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
[Evidence Dialogue Dt0subscript𝐷subscript𝑡0D_{t_{0}}italic_D start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ]

Time: tksubscript𝑡𝑘t_{k}italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
[Evidence Dialogue Dtksubscript𝐷subscript𝑡𝑘D_{t_{k}}italic_D start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ]

**Current Dialogue between [Speaker] and [Listener]:**
Time: t𝑡titalic_t
[Current Dialogue Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]

**Candidate Response:**
[Speaker] is planning to say: [ Ucsubscript𝑈𝑐U_{c}italic_U start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ]

---
# Task
Determine if there is any contradiction between the candidate response and the past dialogue/character background. Do not consider the absence of a repeated mention as an inconsistency. Ignore statements that are situational or not meant to be taken literally. Let’s think step by step.

**Output a JSON object:**
{
         “Contradiction?”: <<<true/false>>>,
         “Details”: “<<<Specify any contradictions, if any>>>
}

## Example
[An example including Past Dialogues, Current Dialogue, Candidate Response, and Output]

F.3 Agent Agreement Prompt

Context for the task:

Here is a brief description of [Mentioned Agent].
[Background of the Mentioned Agent]

Here is the memory that is in [Mentioned Agent]’s head:
[Memory of the Mentioned Agent]

Here is the previous conversation between [Mentioned Agent] and [Speaker]:
Time: t1subscript𝑡1t_{-1}italic_t start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT
[Evidence Dialogue Dt1subscript𝐷subscript𝑡1D_{t_{-1}}italic_D start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ]

[Speaker] just mentioned the following statement about [Mentioned Agent]:
[ Ucsubscript𝑈𝑐U_{c}italic_U start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ]

---
# Task: Based on the information provided above, would [Mentioned Agent] agree with the statement?

Output format: Output a json of the following format:
{
         “agreed”: “<<<json Boolean>>>”,
         “reason”: “the reason that led [Mentioned Agent] to make the judgment”
}

F.4 Revise Utterance Prompts

F.4.1 Persona-based Narrative

Your name is [Speaker].
Your background is as follows:
[Background of the Speaker]

You are engaged in a conversation with [Listener], and here is the content of the dialogue so far:
[Current Dialogue Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]

# Task:
Consider whether you would respond to [Listener]. If you choose to reply, what would you say? Would your response aim to conclude the conversation?
You might consider saying “Ucsubscript𝑈𝑐U_{c}italic_U start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT”, but it has some issues, for instance:
[Reason]
Here are some suggestions for your reference:
[Suggestion]
If the response is redundant or repetitive, you can end the current dialogue.

**Output a JSON object:**
{
         “Response”: “<<<your reply as [Speaker] (if any)>>>”,
         “The conversation ends with [Speaker]’s utterance”: <<<true/false>>>
}

F.4.2 Structured Task-oriented

# Contextual Information:
**Introduction:**
[Background of the Speaker]

**Current Dialogue between [Speaker] and [Listener]:**
[Current Dialogue Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]

# Task:
Assuming the role of [Speaker], consider whether you would respond to [Listener]. If you choose to reply, what would you say? Would your response aim to conclude the conversation?
You might be considering saying something that has some issues, such as:
[Reason]
Here are some suggestions for your reference:
[Suggestion]
If the response is redundant or repetitive, you can end the current dialogue.

**Output a JSON object:**
{
         “Response”: “<<<your reply as [Speaker] (if any)>>>”,
         “The conversation ends with [Speaker]’s utterance”: <<<true/false>>>
}