Cohesive Conversations: Enhancing Authenticity in Multi-Agent Simulated Dialogues

KuanChao Chu
The University of Tokyo &Yi-Pei Chen¹¹footnotemark: 1
The University of Tokyo &Hideki Nakayama
The University of Tokyo Joint first authors. Contact: {kcchu, ypc}@nlab.ci.i.u-tokyo.ac.jp

Abstract

This paper investigates the quality of multi-agent dialogues in simulations powered by Large Language Models (LLMs), focusing on a case study from Park et al. (2023), where 25 agents engage in day-long simulations of life, showcasing complex behaviors and interactions. Analyzing dialogues and memory over multiple sessions revealed significant issues such as repetition, inconsistency, and hallucination, exacerbated by the propagation of erroneous information. To combat these challenges, we propose a novel Screening, Diagnosis, and Regeneration (SDR) framework that detects and corrects utterance errors through a comprehensive process involving immediate issue identification, evidence gathering from past dialogues, and LLM analysis for utterance revision. The effectiveness of the SDR framework is validated through GPT-4 assessments and human evaluations, demonstrating marked improvements in dialogue consistency, diversity, and the reduction of false information. This work presents a pioneering approach to enhancing dialogue quality in multi-agent simulations, establishing a new standard for future research in the field.

1 Introduction

Recent research have leveraged Large Language Models (LLMs) (OpenAI, 2023; Touvron et al., 2023) to power multi-agent simulations, aiming to model complex human behaviors (Park et al., 2023) or enhance multi-agent communication and collaboration (Chen et al., 2023b; Hong et al., 2023). In these simulations, agents are equipped with a LLM as their core, augmented with additional components such as memory, and communicating iteratively (Cheng et al., 2024). A notable simulation example fully orchestrated by ChatGPT, Generative Agents (Park et al., 2023), simulates 25 agents “living” in a village and sometimes conversing with each other. For more details about how the simulation works, please refer to Appendix A and the original paper.

Despite the advances in multi-agent simulation, there remains a significant gap in evaluating the quality of multi-agent dialogues produced in these simulations, especially at the multi-session level. Park et al. (2023) was evaluated through interviewing the agents, while other LLM-generated multi-session dialogues limit their evaluations to sessions involving the same pair of speakers (Xu et al., 2022; Bae et al., 2022; Jang et al., 2023). None of them have thoroughly scrutinized the multi-agent communication over time, leaving a critical area of research unaddressed.

Our work delves into the simulation log of Park et al. (2023) (hereafter referred to as OneDayLife¹¹1https://reverie.herokuapp.com/arXiv_Demo/), examining the dialogue content and memory of the participating agents across multiple sessions. We found that, while individual dialogues may appear flawless on the surface, a holistic review of multi-session, multi-agent dialogues identifies significant issues. Specifically, we identified three main problems: repetition, inconsistency, and hallucination, as depicted in Fig 1 (further examples are available in Appendix E.3). Notably, these problems escalate with the duration of the simulation, exacerbated by the propagation of erroneous information through dialogues among agents.

Refer to caption — (a) Eddy is a music-major student working on a music composition project, yet he is inconsistent with what he is working on and repeating other’s utterances after conversing with them.

To address these challenges, we introduce a novel multi-agent simulation framework featuring a Screening, Diagnosis, and Regeneration (SDR) mechanism. This approach enables the immediate detection and correction of errors in utterances generated by agents. During the Screening stage, potential issues in the candidate utterance are identified, and relevant evidence from past dialogues is gathered. Subsequently, in the Diagnosis stage, a LLM analyzes the evidence and the current dialogue, providing detailed feedback and a score on the authenticity of the candidate utterance. If it is deemed problematic, it is revised during the Regeneration stage, incorporating insights from the preceding stages. Both GPT-4 assessments and human evaluations validate that our multi-agent, multi-session dialogues are more consistent and have fewer false hallucinations, while automatic metrics confirm that we achieve better diversity, confirming the effectiveness of our SDR framework.

In summary, our contributions are threefold: (1) We illuminate the inherent problems within multi-agent, multi-session dialogues. To the best of our knowledge, we are among the first to investigate this particular problem. (2) We propose the SDR framework, a pioneering approach for on-the-fly detection and correction of utterance errors in multi-agent simulations. (3) Through extensive evaluations and analyses, we demonstrate the efficacy of our framework, setting a new standard for dialogue quality in multi-agent simulations.

2 Multi-agent Dialogue Simulation

We propose a Screening, Diagnosis, and Regeneration (SDR) framework for simulating authentic dialogues between multiple agents over a period of time. For each candidate utterance $U_{c}$ generated by an agent, we examine whether potential errors occur, and re-generate a new $U_{c}$ if any problem is found. Figure 2 illustrates our SDR system overview.

2.1 Screening

Due to the limitations of LLMs, including input length restrictions and associated costs, it is impractical to compare each $U_{c}$ against all past dialogues. Consequently, we initiate with a Screening stage to identify potential issues and gather pertinent evidence for each identified issue.

Repetition

Agents often display similar speech patterns, reducing their character distinctiveness. As shown in Fig. 1(a), Eddy tends to replicate phrases from other agents following their conversations. Figure 3 illustrates that as the day progresses, most agents engage in highly similar topics related to collaboration. More examples can be found in Appendix E.3.

We build a dialogue database that stores all utterances prior to the candidate utterance $U_{c}$ , which includes utterances from previous dialogues and the current dialogue context. For each $U_{c}$ , we first query the database to retrieve the top $K_{sim}$ similar utterances. We apply a dynamic similarity threshold $\theta_{sim}$ to identify the excessively repetitive utterances, determined by whether it is originated from the same speaker or the same dialogue or not. More details are explained in Appendix B.1. If more than one retrieved utterances surpass $\theta_{sim}$ , all dialogues associated with these utterances will be marked as evidence for the next diagnose stage. For the cases when $U_{c}$ is nearly identical to a retrieved utterance (the similarity greater than $\theta_{force}$ ), we directly bypasses all pipelines and proceeds to the Re-generation stage.

Inconsistency

Factual or logical inconsistencies are an issue across multiple dialogues. For instance, Fig. 1(b) illustrates how John’s statement contradicts Jennifer’s earlier words. Other examples include sudden shifts in opinions, forgetting past statements, and invitations to conflict.

We propose a Natural Language Inference-Graph (NLI-G) module for inconsistency screening. NLI-G consists of three steps. First, we employ the LLM to extract personal information as a list of (Subject, Relation, Object) triplets from each previous dialogue of involved agents, as well as from the candidate utterance $U_{c}$ . An example is shown in Appendix B.2. After transforming triplets into text form, we adapt a NLI model to predict potential contradictions by comparing those from previous utterances with those from $U_{c}$ . Utilizing such a graph format helps the NLI model to focus on key information of agents and reduce the negative impact of style discrepancies between pretrained data and raw dialogue utterances. Finally, the triplets whose contradiction score above $\theta_{nlig}$ are considered suspicious and forwarded to the LLM to select top $K_{nlig}$ corresponding dialogues for the next Diagnose stage.

Hallucination

We focus on detecting the harmful hallucination, which we refer to as the generated spurious information in $U_{c}$ that is related to other agents. Such hallucination can persist within the simulated world through memories, and sometimes become truth over time, despite not aligning with the involved agent’s role. Given that agents are considered distinct individuals, preventing the fabricated information about others is crucial.

We detect third-party agent mentions via name parsing. Upon identification, we prompt the LLM to rate the utterance on a 1 to 10 scale, indicating the likelihood of being a hallucination, and we flag $U_{c}$ if the score is greater than $\theta_{fact}$ . Two example cases are shown in Fig. 4.

The model is explicitly guided to give a higher score if the following criteria are met: (a) Speaker-objectivity: if it is an objective statement regarding another agent from the speaker’s perspective. (b) Discernment capability of the mentioned agent: if the referenced agent can currently verify the statement’s truthfulness. (c) Impact: the statement, if fabricated but later accepted as truth, significantly impacts the agent.

2.2 Diagnosis

We utilize the LLM to further diagnose the authenticity of $U_{c}$ if there are evidence dialogues provided or if $U_{c}$ is flagged from the previous Screening stage. In the former case, we prompt LLM with pipeline-specific prompts given the evidence and current dialogues. The outputs are a score representing the severity of the issue and the reason for the output score. In the latter case, we provide the information of the mentioned agent and ask the LLM to output whether the mentioned agent agree with $U_{c}$ , along with an explanatory comment. In practice, we repeat Diagnose stage for $N_{diag}$ trials and select the one with the highest score. The prompts are provided in Appendix F.

2.3 Re-generation

We collect results from all pipelines and retain only comments with a score above $\theta_{regen}$ or those indicating disagreement. If no comments remain, the correction process terminates, $U_{c}$ is saved to the dialogue database, and the model continues to generate the next utterance of the other agent. Otherwise, the LLM is used to integrate all comments and provide suggestions for improvement, alleviating the vagueness from a simple feedback (Liang et al., 2023). The prompt for re-generation is enriched by appending comments to the original prompt that was used to generate $U_{c}$ .

Our SDR procedure continues until either of the conditions is met: completes $R$ rounds of iteration, or reaches a point where no further comments are provided, indicating the resolution of identified issues.

2.4 Prompt Design

We develop multiple prompt variances to increase diversity when re-generating the revised $U_{c}$ , inspired by the finding that varied linguistic prompts induce output variance (Leidinger et al., 2023). The LLM often re-generates an exactly identical or very similar response given that most of the prompt content is the same as the initial response generation prompt, despite of providing additional feedback to guide the generation and setting the penalty for frequency and presence ²²2https://platform.openai.com/docs/guides/text-generation/parameter-details. We design two types of utterance generation prompts, a persona-based narrative prompt and a structured task-oriented prompt. The former prompt ask the model to play the role of the given persona and engage in a conversation, which is more narrative and immersive. The latter breaks down all information into clear components, which is less about storytelling and more about providing structured data for a specific task (in this case, generating a response in a conversation). The prompts of these two types are shown in Appendix F.4.

3 Experiment

3.1 Data and Settings

We conduct the simulation using data from OneDayLife. After removing dialogues with only one utterance, there are total 290 dialogues between 25 agents. We regenerate the whole dialogue $D_{ij}^{t}$ between two agents $A_{i}$ and $A_{j}$ at time point $t$ . $A_{i}$ and $A_{j}$ take turns to generate utterances until one has no reply or the dialogue reaches 16 turns. At each turn, the LLM is provided with the speaker’s persona $P_{i}$ , memories $M_{i}^{t}$ , location, status $S_{i}^{t},S_{j}^{t}$ at time $t$ , and dialogue histories between the two agents $D_{ij}^{k},k\in{1,2,...,K^{t-1}}$ .

We use GPT-3.5-turbo as the backbone LLM throughout the SDR framework, which was used in (Park et al., 2023) (Origin). To compare with a stronger baseline, we reran the simulation from Origin, but generate three candidates for each $U_{c}$ and selected the best one judged by the LLM, denoted as Baseline. More detailed settings are provided in Appendix D.1.

In the simulation framework, each generated dialogue at time $t$ can potentially alter the memory, location, and status of the agents subsequently, thus might change the following dialogues after $t$ . To ensure a fair comparison with Origin, we opted not to regenerate new memories, locations, or statuses for agents following their conversations. Instead, we treated each dialogue as a distinct, standalone example.

3.2 Evaluation

We conduct corpus-level evaluation on three key aspects: diversity, consistency, and factualness, corresponding to the three error types - repetition, inconsistency, and hallucination - we aim to address.

Diversity

We employ widely used metrics Distinct-N (Li et al., 2016) and Semantic Distance (Distance) (Dziri et al., 2019) for diversity evaluation. Please refer to Appendix D.2.1 and the original paper for more details.

To further analyze the diversity from the perspective of each individual agent, we proposed an agent-based metric, Agent Diversity (Agent Div). This metric is based on the idea that an agent should adjust the conversation content according to the identity of the listener and not always talk about the same thing with everyone. Agent Div is calculated as the average of each agent’s Agent Div, which is one subtract the normalized similarity between dialogues of the same agent. The algorithm is provided in Appendix D.2.1.

Consistency and Factualness

We assess the corpus-level consistency and factualness by GPT-4 and human evaluation.

- GPT-4: Recent works (Liu et al., 2023; Mendonça et al., 2023; Luo et al., 2023; Gao et al., 2023) have shown that GPT-4 evaluation correlates more closely with human judgements. Besides, it has good properties on scalability and reproducibility. Thus, we select GPT-4 for this challenging task. The process to retrieve evidence dialogues is the same as the Screening stage described in Sections 2.1. Then, GPT-4 is utilized to score the consistency and factualness of the current dialogue on a scale from 1 to 10. The error rate represents the ratio of dialogues receiving a score below a threshold of 8, determined based on our empirical observations.

- Human Evaluation: We conduct human evaluation on the most challenging final 10% of the conversations. We recruited two annotators, and provide them the exact same prompt as GPT-4 to rate the consistency and factualness. The average scores are reported in this paper. The intraclass correlation coefficients (Shrout & Fleiss, 1979; McGraw & Wong, 1996) for factualness and consistency are 0.47 and 0.44, respectively.

Fluency

We utilize the perplexity derived from GPT-2 as the fluency metric. We have not stressed on fluency evaluation, as our observations indicate that all generated dialogues are highly fluent and grammatically correct.

4 Result and Discussion

	Diversity			Factualness		Consistency		Fluency	Turns	Words
	Distinct - 1 / 2 / 3	Distance	Agent Div	Score	Error ( $\downarrow$ )	Score	Error ( $\downarrow$ )	PPL ( $\downarrow$ )	Turns	Words
Origin	0.117 / 0.473 / 0.726	0.234	0.454	8.58	24.5%	8.17	37.2%	20.37	9.6	25.4
Baseline	0.124 / 0.469 / 0.718	0.274	0.475	8.77	25.5%	8.10	39.7%	20.18	15.5	29.3
SDR (Ours)	0.132 / 0.521 / 0.773	0.311	0.502	8.89	19.0%	8.27	32.4%	19.73	10.3	42.5

Table 1: Corpus-level (multi-dialogues) evaluation. Turns and Words refer to the average number of turns per dialogue and words per turn.

SDR Achieves the Best Diversity, Factualness, Consistency, and Fluency in Multi-dialogue Contexts

As illustrated in Table 1, our SDR framework (Ours) excels in achieving superior corpus-level dialogue diversity, factuality, consistency, and fluency. SDR maintain an average number of turns similar to Origin, contrasting with the Baseline that tend to extend until reaching a predetermined maximum turn count (16). While SDR does not have lengthy number of turns, each utterance conveys more comprehensive information, evidenced by a higher word count per turn in SDR.

SDR Significantly Reduces Keyword Repetition

Figure 5(a) demonstrates how our approach effectively reduces the repetition of the most frequently used keywords. To delve deeper into the occurrence of repetition across all dialogues, we conducted an analysis focusing on keyword frequency. Specifically, we determined the keywords by TF-IDF scores, and counted the number of dialogues consists with the top 6 noun keywords³³3Since “collaboration” and “election,” have multiple variations, we use the root forms of these words, “collabora” and “elect,” to ensure a more accurate representation of their usage across all dialogues.. Figure 5(a) showcases that our method substantially decreases the frequency of dialogues mentioning key terms, particularly for “creativity” and “collaboration.” Compared to the Origin, the number of dialogues featuring these keywords has been reduced by up to 47% and 44%, respectively.

Agent Div Negatively Correlated to the Number of Initiated Dialogues

We analyze the relation between Agent Diversity and dialogue volume in Fig. 5(b). We found that, despite of a few exceptions, the Agent Div have negative correlation with the number of dialogues the agent involved. Although there are a few exceptions, the figure generally shows that the more the agent talks, the lower diversity they have.

Factualness Consistency GPT4 human GPT4 human Origin 8.34 6.90 7.41 5.00 SDR (Ours) 8.96 8.57 8.18 6.14

Table 2: Human evaluation on the last 10% of dialogues.

Figure 6: Error rate trends over time.

SDR Enhance Consistency and Reduce Hallucination

As illustrated in Table 2, the human evaluation reflects the same trend as the GPT-4 assessment, showing SDR is superior to Original in both factualness and consistency. Additionally, it is observed that the scores for factualness are consistently higher than those for consistency in both the human and GPT-4 evaluations. Notably, the GPT-4 scores are higher than human evaluations in all cases, which might indicates that there are potential errors not detected by GPT-4.

Furthermore, we plot the error rate changes across the percentile of the number of dialogues, as shown in Fig. 6. We observed that the errors gradually increase toward higher percentile, especially in the last one, suggesting that inconsistencies or contradictions become more frequent at higher percentiles. The analysis of error rate trends shows that SDR generally exhibits lower error rates than Origin in both factualness and consistency. Additionally, the error rate slope of Consistency for Origin is 1.5 times that of SDR. This disparity indicates a more pronounced error propagation in Origin, highlighting the effectiveness of our method in maintaining dialogue integrity over time.

Prompt Info	Prompt Type	Diversity		Factualness		Consistency		Fluency	Turns	Words
Prompt Info	Prompt Type	Distinct - 1 / 2 / 3	Distance	Score	Error ( $\downarrow$ )	Score	Error ( $\downarrow$ )	PPL ( $\downarrow$ )	Turns	Words
Origin		0.445 / 0.724 / 0.886	0.212	8.34	31.0%	7.41	55.2%	22.2	8.1	24.4
Baseline		0.323 / 0.709 / 0.869	0.238	8.07	44.8%	7.72	41.4%	20.1	15.0	31.2
Alle	Task	0.278 / 0.742 / 0.918	0.306	8.45	31.0%	8.21	34.5%	21.3	10.6	36.0
Alle	Persona	0.286 / 0.751 / 0.917	0.288	8.52	27.6%	7.79	41.4%	20.3	9.9	43.9
Alle	Mixed	0.292 / 0.744 / 0.919	0.303	8.66	27.6%	8.21	44.8%	19.0	11.0	41.8
- background	Mixed	0.303 / 0.751 / 0.921	0.338	8.79	32.1%	8.14	32.1%	20.4	9.4	33.7
- memory	Mixed	0.349 / 0.778 / 0.931	0.305	8.96	17.9%	8.18	35.7%	19.2	10.6	44.1
- history	Mixed	0.319 / 0.774 / 0.926	0.292	8.38	31.0%	8.69	31.0%	20.2	9.9	42.8
- status	Mixed	0.271 / 0.717 / 0.898	0.257	8.39	25.0%	8.21	35.7%	19.6	9.9	49.0

Table 3: Ablation study on the last 10% conversations in OneDayLife.

SDR can Balance Diversity and Faithfulness

Table 3 shows the ablation study for various prompt design. The ablation study is conducted on the last percentile of conversations, where the origin got the worst factualness and consistency scores. We first identify the benefit of using diverse prompt types. Randomly pick from structured task-oriented prompt or persona-based narrative prompt yields better or comparable results than using either of them. We also investigate whether all information in the original prompt is necessary. Surprisingly, we found that prompts excluding memory often outperformed others in most aspects. This outcome seems counterintuitive, as memory is generally considered crucial for preventing hallucinations and ensuring consistency. However, memory can impose a strong constraint that may reduce conversational diversity. By opting for a no-memory prompt, we open the door to more diverse conversational content. Our SDR framework ensures that the utterance $U_{c}$ can be consistent with previous dialogues and free from critical factual errors. This approach allows us to strike a balance between diversity and faithfulness in multi-agent multi-dialogue generation.

Effectiveness of Screening and Diagnosis Stages

We evaluate whether our SDR framework can successfully detect potential problems by calculating the precision (P), recall (R), and F1 score for the output from the Screening and Diagnosis stages, using a subset of OneDayLife. We place particular emphasis on recall, as our primary objective is to identify potential issues at these stages. We manually annotated 50 utterances as either problematic or non-problematic and processed them through our framework up to the Diagnosis stage. The results validate the effectiveness of the procedures in the Screening and Diagnosis stages, with P, R, and F1 scores of 64.7%, 88.0%, and 74.6%, respectively. We also show an example which fails to pass all three pipelines in Table 5 in Appendix E.1.

NLI-G Study

We assess the effectiveness of NLI-G by using evaluation scores on original dialogues. We compare the factualness and consistency scores of dialogues based on two sources of evidence: evidence retrieved by NLI-G and evidence from the agent’s previous dialogues (denoted as Prev). In both cases, the number of pieces of evidence is five. In Fig. 7, the left figure shows that the scores with NLI-G are generally lower than with Prev (average differences of -0.82 for factualness and -0.62 for consistency), indicating NLI-G’s superior ability to capture crucial evidence for GPT-4 in identifying issues. Additionally, we examine NLI-G’s retrieval variance by evaluating the last 10% of dialogues five times, counting the frequency of evidence dialogues. The right figure illustrates the composition of each dialogue’s evidence count, with colored sections representing the proportions of the top five evidence (average colored area: 61.8% of the bar) and gray for the others. This demonstrates NLI-G’s consistency in retrieving similar dialogues across different trials, even when the number of available evidence dialogues - equal to the quantity of past dialogues involving either of the agents - exceeds 50 in later dialogues.

5 Related Work

LLM-powered Agents

A typical LLM-powered agent encompasses predefined or dynamically generated prompt templates to utilize LLM’s capabilities and achieve specific functionalities (Zhao et al., 2023). Wang et al. (2023) can autonomously navigate the game world, maintaining a skill library to expand its problem-solving scope. Richards & et al. (2023); Nakajima (2023) employ a chain-of-thought (Wei et al., 2022) approach to provide automated solutions for designated tasks. Nevertheless, a singular agent undeniably has its limitations. The collective intelligence formed by multiple agents can yield results greater than the sum of its parts. Hong et al. (2023) has integrated specialized human SOP expertise, successfully accomplishing intricate software development. Chen et al. (2023b) focusing on designing communication mechanisms among agents to enhance decision-making efficacy.

Revision Strategy

While Madaan et al. (2023); Skreta et al. (2023) employ self-feedback to improve LLM performance and task success rates, our approach prioritizes a comprehensive analysis across multiple dialogues. Other strategies, such as prompt engineering, may extend the length of prompts without addressing the challenge of extracting essential information from extended inputs. SDR adheres to the divide-and-conquer philosophy, breaking down a problem into manageable parts for more effective resolution.

Machine-Generated Dialogues

Our work distinct from other machine-generated multi-session dialogue datasets in several aspects. We first emphasize that the utterances in OneDayLife are generated iteratively by each agent, rather than through full conversation synthesis like in SODA (Kim et al., 2023), PLACES (Chen et al., 2023a), and Dialog Inpainting (Dai et al., 2022). This approach more realistically simulates agent behaviors but presents challenges in maintaining global coherence. Besides, the conversations in OneDayLife exhibit causal relationships, unlike the independent dialogue pieces in other datasets. These specific properties make OneDayLife particularly suitable for our study on the authenticity of LLM agents, and in uncovering issues inherent in multi-agent, multi-session dialogues.

6 Conclusion

In conclusion, our study introduces and validates a novel multi-agent dialogue simulation framework equipped with a Screening, Diagnosis, and Regeneration (SDR) mechanism, addressing the inherent challenges in multi-agent, multi-session dialogues such as repetition, inconsistency, and hallucination. By meticulously analyzing the dialogues generated in the OneDayLife simulation, we identified significant issues that compound over time, negatively impacting dialogue quality. Our proposed SDR framework effectively reduces these problems and enhances dialogue diversity, consistency, and factualness, as evidenced by automatic metrics, GPT-4 assessments, and human evaluations. We successfully reduce the keyword repetition, maintain dialogue integrity over multiple sessions, and balance conversational diversity with faithfulness of multi-agent dialogue simulation. Our work not only highlights the previously unaddressed challenges in multi-agent simulations but also sets a new standard for dialogue quality, paving the way for more advanced and realistic simulations in future research.

References

Bae et al. (2022) Sanghwan Bae, Donghyun Kwak, Soyoung Kang, Min Young Lee, Sungdong Kim, Yuin Jeong, Hyeri Kim, Sang-Woo Lee, Woomyoung Park, and Nako Sung. Keep me updated! memory management in long-term conversations. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 3769–3787, 2022.
Chen et al. (2023a) Maximillian Chen, Alexandros Papangelis, Chenyang Tao, Seokhwan Kim, Andy Rosenbaum, Yang Liu, Zhou Yu, and Dilek Hakkani-Tur. Places: Prompting language models for social conversation synthesis. In Findings of the Association for Computational Linguistics: EACL 2023, pp. 844–868, 2023a.
Chen et al. (2023b) Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia Qin, Yaxi Lu, Ruobing Xie, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents. arXiv preprint arXiv:2308.10848, 2023b.
Cheng et al. (2024) Yuheng Cheng, Ceyao Zhang, Zhengwen Zhang, Xiangrui Meng, Sirui Hong, Wenhao Li, Zihao Wang, Zekai Wang, Feng Yin, Junhua Zhao, et al. Exploring large language model based intelligent agents: Definitions, methods, and prospects. arXiv preprint arXiv:2401.03428, 2024.
Dai et al. (2022) Zhuyun Dai, Arun Tejasvi Chaganty, Vincent Y Zhao, Aida Amini, Qazi Mamunur Rashid, Mike Green, and Kelvin Guu. Dialog inpainting: Turning documents into dialogs. In International conference on machine learning, pp. 4558–4586. PMLR, 2022.
Dziri et al. (2019) Nouha Dziri, Ehsan Kamalloo, Kory Mathewson, and Osmar Zaiane. Evaluating coherence in dialogue systems using entailment. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3806–3812, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1381. URL https://aclanthology.org/N19-1381.
Gao et al. (2023) Mingqi Gao, Jie Ruan, Renliang Sun, Xunjian Yin, Shiping Yang, and Xiaojun Wan. Human-like summarization evaluation with chatgpt. arXiv preprint arXiv:2304.02554, 2023.
Hong et al. (2023) Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.
Jang et al. (2023) Jihyoung Jang, Minseong Boo, and Hyounghun Kim. Conversation chronicles: Towards diverse temporal and relational dynamics in multi-session conversations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 13584–13606, 2023.
Kim et al. (2023) Hyunwoo Kim, Jack Hessel, Liwei Jiang, Peter West, Ximing Lu, Youngjae Yu, Pei Zhou, Ronan Bras, Malihe Alikhani, Gunhee Kim, et al. Soda: Million-scale dialogue distillation with social commonsense contextualization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 12930–12949, 2023.
Laurer et al. (2023) Moritz Laurer, Wouter Van Atteveldt, Andreu Casas, and Kasper Welbers. Less Annotating, More Classifying: Addressing the Data Scarcity Issue of Supervised Machine Learning with Deep Transfer Learning and BERT-NLI. Political Analysis, pp. 1–33, June 2023. ISSN 1047-1987, 1476-4989. doi: 10.1017/pan.2023.20. URL https://www.cambridge.org/core/product/identifier/S1047198723000207/type/journal_article.
Leidinger et al. (2023) Alina Leidinger, Robert van Rooij, and Ekaterina Shutova. The language of prompting: What linguistic properties make a prompt successful? arXiv preprint arXiv:2311.01967, 2023.
Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 110–119, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1014. URL https://aclanthology.org/N16-1014.
Liang et al. (2023) Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Ding, Xinyu Yang, Kailas Vodrahalli, Siyu He, Daniel Smith, Yian Yin, et al. Can large language models provide useful feedback on research papers? a large-scale empirical analysis. arXiv preprint arXiv:2310.01783, 2023.
Liu et al. (2023) Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 2511–2522, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.153. URL https://aclanthology.org/2023.emnlp-main.153.
Luo et al. (2023) Zheheng Luo, Qianqian Xie, and Sophia Ananiadou. Chatgpt as a factual inconsistency evaluator for abstractive text summarization. arXiv preprint arXiv:2303.15621, 2023.
Madaan et al. (2023) Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
McGraw & Wong (1996) Kenneth O McGraw and Seok P Wong. Forming inferences about some intraclass correlation coefficients. Psychological methods, 1(1):30, 1996.
Mendonça et al. (2023) John Mendonça, Patrícia Pereira, Helena Moniz, Joao Paulo Carvalho, Alon Lavie, and Isabel Trancoso. Simple LLM prompting is state-of-the-art for robust and multilingual dialogue evaluation. In Yun-Nung Chen, Paul Crook, Michel Galley, Sarik Ghazarian, Chulaka Gunasekara, Raghav Gupta, Behnam Hedayatnia, Satwik Kottur, Seungwhan Moon, and Chen Zhang (eds.), Proceedings of The Eleventh Dialog System Technology Challenge, pp. 133–143, Prague, Czech Republic, September 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.dstc-1.16.
Nakajima (2023) Yohei Nakajima. Babyagi, 2023. URL https://github.com/yoheinakajima/babyagi.
OpenAI (2023) OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
Park et al. (2023) Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442, 2023.
Richards & et al. (2023) Toran Bruce Richards and et al. Auto-gpt: An autonomous gpt-4 experiment, 2023. URL https://github.com/Significant-Gravitas/Auto-GPT.
Rodgers & Nicewander (1988) Joseph Lee Rodgers and W. Alan Nicewander. Thirteen ways to look at the correlation coefficient. The American Statistician, 42(1):59–66, 1988. doi: 10.1080/00031305.1988.10475524. URL https://doi.org/10.1080/00031305.1988.10475524.
Shrout & Fleiss (1979) Patrick E Shrout and Joseph L Fleiss. Intraclass correlations: uses in assessing rater reliability. Psychological bulletin, 86(2):420, 1979.
Skreta et al. (2023) Marta Skreta, Naruki Yoshikawa, Sebastian Arellano-Rubach, Zhi Ji, Lasse Bjørn Kristensen, Kourosh Darvish, Alán Aspuru-Guzik, Florian Shkurti, and Animesh Garg. Errors are useful prompts: Instruction guided task programming with verifier-assisted iterative prompting. arXiv preprint arXiv:2303.14100, 2023.
Spearman (1961) Charles Spearman. The proof and measurement of association between two things. 1961.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Wang et al. (2023) Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023.
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
Xu et al. (2022) Jing Xu, Arthur Szlam, and Jason Weston. Beyond goldfish memory: Long-term open-domain conversation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5180–5197, 2022.
Zhao et al. (2023) Pengyu Zhao, Zijian Jin, and Ning Cheng. An in-depth survey of large language model-based artificial intelligence agents. arXiv preprint arXiv:2309.14365, 2023.

Appendix A Human-Like Multi-Agents

A.1 From General LLM to Individual Persona

Generative Agents (Park et al., 2023) introduces a two-component architecture for creating personalized, dynamic human-like agents: a string-based memory base and an LLM-driven cognitive function set. The memory base stores memories over time, aiding in the development of diverse agents, while the LLM-centric cognitive functions simulate human capabilities like reflection, planning, and reaction. Combining these, the LLM uses memory-derived context to tailor knowledge extraction and response formulation, ensuring agent-specific behaviors.

In a scenario where two agents initiate a dialogue, each iteratively produces utterances informed by context like location, status, and memories. The dialogue function uses a specific prompts like: ”Based on the […] information, what will [name] say next?” For more details on Generative Agents and the simulation of 25 agents in a village, please see the original paper.

A.2 Conversations and Transmission

The memory capabilities enable the transmission of information to both agents involved in a dialogue section. However, this also means that undesirable dialogue content could spread in the same manner.

Figure 3 illustrates this using keyword spreading as an example. The bar chart shows the proportion of dialogues containing the keyword (in red) compared to the total number of conversations (in blue), highlighting a swift escalation, sometimes reaching 100%. Chord diagrams further reveal that initially, only a few agents act as propagators, but as the day progresses, the majority become involved in similar actions, as indicated by the variety of line colors. Consequently, the dialogue topics become repetitive and less believable. This example underscores the scale of the spreading and the associated risks, emphasizing the need for a dynamic correction approach.

Appendix B SDR Details

B.1 Dynamic Similarity

We employ dynamic similarity threshold to determined the repetition of utterances. The value of $\theta_{sim}$ is adjusted based on specific criteria: it is decreased if the similar utterance originates from the same agent $A$ as $U_{c}$ and is in the current dialogue $D_{t}$ , which indicating repetition. Conversely, a higher threshold is more acceptable if the similar utterance is from $A$ but a different dialogue. This process is formalized as Eq. 1.

\theta_{sim}=\begin{cases}\theta+\alpha&\text{if same $A$, different $D$}\\ \theta-\alpha&\text{if same $A$, same $D$}\\ \theta&\text{otherwise,}\end{cases}

(1)

B.2 The Extracted Graphs in NLI-G

We demonstrate an example of the personal information graph extracted from a dialogue. Table 4 lists a pair of dialogue and the extracted graph.

Raw Dialogue

Giorgio Rossi: Excuse me, are you John Lin? John Lin: Yes, that’s me. How may I help you? Giorgio Rossi: I noticed that you’re very knowledgeable about medications and treatments. I’ve been curious about some of the latest developments in that field. Would you mind sharing some of your insights with me? John Lin: Of course, I’m happy to share what I know. What specifically are you interested in? Giorgio Rossi: Well, I’m particularly intrigued by how some medications are being developed based on mathematical patterns found in nature. Do you have any knowledge about that? John Lin: Hmm, that’s not really my area of expertise. But I do know that Jennifer Moore, the artist who’s running for local mayor, is also a scientist who incorporates mathematical patterns into her work. Maybe she would have more information on that.

Extracted Graph

[ [“Giorgio Rossi”, “is interested in”, “medications and treatments”], [“John Lin”, “is”, “John Lin”], [“Giorgio Rossi”, “discusses interests with”, “John Lin”], [“John Lin”, “is knowledgeable about”, “medications and treatments”], [“Giorgio Rossi”, “is interested in”, “mathematical patterns found in nature”], [“John Lin”, “does not have expertise in”, “medications developed based on mathematical patterns found in nature”], [“Jennifer Moore”, “is mentioned by”, “John Lin”], [“Jennifer Moore”, “has role”, “artist”], [“Jennifer Moore”, “is running for”, “local mayor”], [“Jennifer Moore”, “has profession”, “scientist”], [“Jennifer Moore”, “incorporates”, “mathematical patterns into work”], ]

Table 4: Example of a graph extracted from raw dialogue in the NLI-G module.

Appendix C Limitations and Social Impact

C.1 Limitations

1. The primary limitation of our method is the cost, as we utilize GPT-3.5-turbo as the LLM backbone via OpenAI API calls.

2. Our results are limited to the OneDayLife dataset, and the evaluations, including automatic, LLM, and human evaluation metrics, may not cover all the aspects since the dataset is not annotated. However, we suggest that our methodology can be applied and work on scenarios involving similar LLM agents.

3. Instead of reproducing a complete full-day simulation, which incurs thousands of dollars in costs as reported in the original paper, we rerun the conversations individually using context (e.g., time, location, involved agents, past dialogues, past memories, etc.) identical to the original data with our framework. In this case, if the past context from the original data contains inherent problems, these are beyond the reach of our framework and thus contribute to the decrease in evaluation scores. We anticipate that the error rate of our method could be further reduced when running a complete simulation.

C.2 Social Impact

Our work focuses on conversations involving only LLM agents. However, future applications are expected to feature environments containing both LLM agents and human users. In these scenarios, human users can interact with the agents, and there is a possibility of malicious intervention. On the other hand, the influence of these interactions on human users remains unexplored. The best practices and effects of deploying our framework in real-world applications could be subjects for future study.

Appendix D Experiment Details

D.1 Hyper-Parameters

At the Screening phase, we set $K_{sim}$ to 5 and threshold $\theta$ to 0.85, $\alpha$ to 0.05, $\theta_{force}$ to 0.95 for repetition detection. Utterances shorter than 10 words are excluded from the repetition screening, thus likely preserving the natural “social glue turns”. For inconsistency detection, we adapt the DeBERTa-based NLI model, pretrained on multiple NLI datasets Laurer et al. (2023). The $\theta_{nlig}$ is as high as 0.98 as there are a lot of false positive, and we select top $K_{nlig}=3$ dialogues as the potential contradictory dialogue evidences. The threshold for harmful hallucination likelihood $\theta_{fact}$ is set to 6. The number of diagnose trials $N_{diag}$ is 3, and we selected the LLM feedback with the highest score. If there are more than one feedback that have the same highest score, we chose the longer one. The regeneration threshold $\theta_{regen}$ is 8. Our SDR procedure will terminated if no comments are found or until reaching $R=2$ rounds. We use GPT-3.5-turbo-0613 for running the baseline and SDR. Note that OneDayLifewas generated by GPT-3.5-turbo before April, 2023.

For GPT-4 evaluation, $\theta_{nlig}$ is set to 0.99 and $K_{nlig}$ is 5. We use the model GPT-4-0613.

D.2 Evaluation

D.2.1 Diversity

Distinct-N (Li et al., 2016)

Distinct-N calculates the ratio of unique N-grams in a given text. However, it may not fully capture the corpus-level dialogue diversity, particularly when each dialogue has longer utterances, since individual dialogues typically revolve around a single topic. To address this, we apply Distinct-N to summaries of dialogues, generated by a pretrained dialogue summarization model. This approach allows us to more effectively gauge the thematic diversity of dialogues at the corpus level across multiple conversations.

Semantic Distance (Dziri et al., 2019)

To complement the word-based Distinct-N metric, we measure the Semantic Distance on embedding space. Specifically, we calculate the cosine similarity between dialogue embeddings. Semantic Distance is then determined as $1-{similarity}$ .

Agent Diversity

The algorithm of Agent Diversity is shown in Algorithm 1. $Emb(\cdot)$ is a speaker dialogue embedding calculated from length-weighted utterance embeddings in the dialogue.

Data : Agents

A_{i}

for

i\in\{1,2,...,N\}

Dialogues

D_{ij}^{k}

for

j\neq i

and

0<k\leq K_{ij}

2// Calculate

AgentDiv_{i}

for each agent

A_{i}

;

3targets

\leftarrow

{

j

|

K_{ij}\neq 0

};

4sims

\leftarrow

0; pairs

\leftarrow

5for each unique pair (p, q) in targets do

E_{p}\leftarrow\{\textnormal{{Emb}}(D_{ip}^{k})\,|\,k\in\{1,2,\ldots,K_{ip}\}\}

;

E_{q}\leftarrow\{\textnormal{{Emb}}(D_{iq}^{k})\,|\,k\in\{1,2,\ldots,K_{iq}\}\}

;

s_{pq}\leftarrow\frac{1}{K_{ip}K_{iq}}\sum_{a=1}^{K_{ip}}\sum_{b=1}^{K_{iq}}% \textnormal{{CosSim}}(E_{p_{a}},E_{q_{b}})

;

10 sims

\leftarrow

sims +

s_{pq}

; pairs

\leftarrow

pairs + 1;

11 end for

AgentDiv_{i}\leftarrow 1-\frac{sims}{pairs}

14 // Averaging

AgentDiv_{i}

for all agents

AgentDiv

\leftarrow

\frac{1}{N}

\sum_{i=1}^{N}AgentDiv_{i}

;

Algorithm 1 Agent Diversity (Agent Div)

Appendix E Additional Analysis and Examples

E.1 Output of Diagnosis Stage

We show the output of Diagnosis stage in Table 5.

Context	Time: 2023-02-13 11:55:40 Location: store in The Willows Market and Pharmacy Status: John: performing special tasks and handling customer queries (checking the inventory) Yuriko: taking a break and having lunch (researching the local mayor election) Dialogue Context: John: Hey there, Yuriko. Looks like we’re both keeping busy with work and classes. Yuriko: Yes, it seems like we’re both trying to stay on top of things. I’m act- ually taking classes on new tax laws. John: That’s interesting. I’m taking classes on new medications and treatm- ents for my pharmacy. Yuriko: I’ve heard about some exciting new research on tax incentives for small businesses. Have you heard anything interesting lately? John: Actually, I’ve been asking around about who’s running for mayor ne- xt month. Have you heard anything? Yuriko: Yes, I’ve heard that Sam Moore is running. He has a vision for inves- ting in the local economy and fostering creativity and innovation. What do you think about that?
$U_{c}$	John: I actually have a friend who’s running for mayor, Jennifer Moore. She’s an artist and is interested in developing medications based on mathematical patterns found in nature.
	Score	Reason
Repetition	8	The response contains some repetitive statements that are already known from the previous conversation. For example, John Lin already mentioned that he has a friend named Jennifer Moore who is running for mayor and is interested in medication development based on mathematical patterns. Restating this information in the response is redundant. Additionally, John asking for more information from Jennifer Moore is already implied by him stating that she is his friend. These repetitive statements make the response less natural.
Consistency	8	John Lin’s candidate response refers to his friend Jennifer Moore running for mayor, while in the past dialogue, he mentioned Sam Moore as a candidate.
Hallucination	false	Jennifer Moore mentioned that she is not interested in politics and prefers to focus on her art and her relationships with the people she cares about, indicating that she would not agree with the statement that John Lin’s friend is running for mayor.

Table 5: Diagnosis stage output.

E.2 GPT4 Score and Dialogue Length

We investigate if there are biases between the dialogue length and the score given by GPT-4. We use Pearson Correlation Coefficient (Rodgers & Nicewander, 1988) and the Spearman Rank-Order Correlation Coefficient (Spearman, 1961) to examine the correlation between scores (factualness and consistency) and the dialogue statistics (number of utterances and words). The results are shown in Table 6, and it shows no or low correlations between them.

	Pearson Coef.		Spearman Coef.
	# Turn	# Word	# Turn	# Word
Factualness	0.054	0.086	0.026	0.039
Consistency	-0.025	-0.021	-0.024	-0.014

Table 6: The correlation coefficients between GPT-4 scores and dialogue statistics on Origin.

E.3 Problematic Dialogues in OneDayLife

Below are four dialogue snippets from OneDayLife involving the four keywords “math”, “poetry”, “artwork”, and “collaborat”.

One unnatural point is that agents from diverse backgrounds all show interest in collaborating on poetry and artwork (and mathematical patterns): Carmen (a shopkeeper) [D1,D2,D3], Latoya (a photographer) [D1,D2,D4], Hailey (a writer) [D2], Arthur (a bartender) [D2], Tamara (a children’s book author) [D3], Giorgio Rossi (a mathematician) [D3], Abigail Chen (a digital artist and animator) [D3], Francisco Lopez (an actor and comedian) [D3,D4], Ryan Park (a software engineer) [D4], Rajiv (a painter) [D4].

Another unnatural point is the variation in content and participants involved in the collaborations. For example, people collaborate on poetry and artwork in each dialogue are:

[D1] Carmen, Latoya

[D2] Carmen, Latoya $|$ Hailey, Arthur

[D3] Carmen, Tamara, Giorgio, Abigail, Francisco

[D4] Latoya, Ryan $|$ Rajiv, Francisco

Furthermore, while Carmen does not have other conversations between D1 and D2, the collaborations she mentions differ in the two dialogues.

[D1]

1.

Rajiv - art projects and medicinal development
2.

Latoya - poetry and artwork $|$ mobile app
3.

Arthur - mixology

[D2]

1.

Tamara and Rajiv - creating job opportunities and expanding book sales online
2.

Latoya - poetry and artwork
3.

Giorgio and Arthur - mixology and medicinal development

Dialogue Snippets:

[D1] Time: 2023-02-13 14:43:20

Arthur (a bartender): Excuse me, Carmen. I couldn’t help but overhear your conversation about potential collaborations. … Additionally, I’m passionate about creating job opportunities for young people in the community and would love to brainstorm with you on how we could work together on that. What do you think?

Carmen (a shopkeeper): That sounds great, Arthur! I’m actually meeting Rajiv Patel (a painter) here later to discuss potential collaborations on art projects and medicinal development using mathematical patterns found in nature. … I’m also discussing potential collaborations with various individuals about promoting art in low-income communities and supporting local businesses. And we have plans for a poetry and artwork project with Latoya Williams (a photographer). Additionally, we’re discussing potential collaborations on creative projects such as a mobile app for local businesses and community events. I would love to hear more about your mixology ideas and how we might be able to collaborate on that as well.

[D2] Time: 2023-02-13 14:57:20

Hailey (a writer): I was actually just talking to Arthur Burton (a bartender) about potential collaborations in mixology, art, poetry, improv, and medication development based on mathematical patterns in nature. I invited him to be a guest on my podcast to contribute unique cocktail recipes. And then I saw you and thought it would be great to catch up.

Carmen (a shopkeeper): Wow, that sounds really interesting! I’m actually discussing potential collaborations with Tamara Taylor (a children’s book author) and Rajiv Patel (a painter) on creating job opportunities for young people in the community, as well as expanding book sales online. I’m also working with Latoya Williams (a photographer) on a poetry and artwork project, and considering collaborations with Giorgio Rossi (a mathematician) and Arthur Burton (a bartender) on mixology ideas and medicinal development using mathematical patterns found in nature. We all plan to grab a drink here after Rajiv’s first solo show.

[D3] Time: 2023-02-13 15:05:20

Jennifer (a painter): That sounds like a great idea! I’d love to collaborate with you and other local artists. Do you have any other projects or collaborations in mind?

Tamara (a children’s book author): Well, I’m considering collaborations with Carmen Ortiz (a shopkeeper), Giorgio Rossi (a mathematician), Abigail Chen (a digital artist and animator), and Francisco Lopez (an actor and comedian) for projects involving poetry, artwork, and potentially other fields like science and math. I’m also interested in attending political discussions and expanding my book sales online. What about you?

[D4] Time: 2023-02-13 18:16:20

Latoya (a photographer): That sounds like a great idea. And I’m also open to collaborating with you on exploring connections between math, nature, and art. I’m actually discussing a similar project with Ryan Park (a software engineer).

Rajiv (a painter): That’s great to hear. And if you’re interested, Francisco Lopez (an actor and comedian) and I are also exploring a project involving poetry, artwork, and mathematical patterns. So there’s definitely a lot of potential for collaboration and creativity in this space.

Appendix F Prompt Examples

F.1 Repetition Check Prompt

Context for the task:
[Speaker’s background]

Here are some conversation histories between various people:
[Speaker]:**
Time: $t_{0}$
[Evidence Dialogue $D_{t_{0}}$ ]
…
Time: $t_{k}$
[Evidence Dialogue $D_{t_{k}}$ ]

[Speaker] is about to say the following sentence (’the response’) next in the latest session:
[ $U_{c}$ ]

---
# Task: Please identify any ”unnatural points” in ’the response’.
An ”unnatural point” refers to redundancies or repetitive statements made in ’the response’ when considering the context of the previous conversations.
On a scale of 1 (no unnatural point) to 10 (the most significant of unnatural point), rate the likely significant score of ’the response’. And explain the reason for the score.

Output format: Output a json of the following format:
{
“reason”: “point out the unnatural point and your reason for the score”, “score”: ” $<$ json integer $>$ ”
}

F.2 Consistency Check Prompt

# Context
**Background:**
[Speaker’s background]

**Past Dialogues involving [Speaker]:**
Time: $t_{0}$
[Evidence Dialogue $D_{t_{0}}$ ]
…
Time: $t_{k}$
[Evidence Dialogue $D_{t_{k}}$ ]

**Current Dialogue between [Speaker] and [Listener]:**
Time: $t$
[Current Dialogue $D_{t}$ ]

**Candidate Response:**
[Speaker] is planning to say: [ $U_{c}$ ]

---
# Task
Determine if there is any contradiction between the candidate response and the past dialogue/character background. Do not consider the absence of a repeated mention as an inconsistency. Ignore statements that are situational or not meant to be taken literally. Let’s think step by step.

**Output a JSON object:**
{
“Contradiction?”: $<$ true/false $>$ ,
“Details”: “ $<$ Specify any contradictions, if any $>$ ”
}

## Example
[An example including Past Dialogues, Current Dialogue, Candidate Response, and Output]

F.3 Agent Agreement Prompt

Context for the task:

Here is a brief description of [Mentioned Agent].
[Background of the Mentioned Agent]

Here is the memory that is in [Mentioned Agent]’s head:
[Memory of the Mentioned Agent]

Here is the previous conversation between [Mentioned Agent] and [Speaker]:
Time: $t_{-1}$
[Evidence Dialogue $D_{t_{-1}}$ ]

[Speaker] just mentioned the following statement about [Mentioned Agent]:
[ $U_{c}$ ]

---
# Task: Based on the information provided above, would [Mentioned Agent] agree with the statement?

Output format: Output a json of the following format:
{
“agreed”: “ $<$ json Boolean $>$ ”,
“reason”: “the reason that led [Mentioned Agent] to make the judgment”
}

F.4 Revise Utterance Prompts

F.4.1 Persona-based Narrative

Your name is [Speaker].
Your background is as follows:
[Background of the Speaker]

You are engaged in a conversation with [Listener], and here is the content of the dialogue so far:
[Current Dialogue $D_{t}$ ]

# Task:
Consider whether you would respond to [Listener]. If you choose to reply, what would you say? Would your response aim to conclude the conversation?
You might consider saying “ $U_{c}$ ”, but it has some issues, for instance:
[Reason]
Here are some suggestions for your reference:
[Suggestion]
If the response is redundant or repetitive, you can end the current dialogue.

**Output a JSON object:**
{
“Response”: “ $<$ your reply as [Speaker] (if any) $>$ ”,
“The conversation ends with [Speaker]’s utterance”: $<$ true/false $>$
}

F.4.2 Structured Task-oriented

# Contextual Information:
**Introduction:**
[Background of the Speaker]

**Current Dialogue between [Speaker] and [Listener]:**
[Current Dialogue $D_{t}$ ]

# Task:
Assuming the role of [Speaker], consider whether you would respond to [Listener]. If you choose to reply, what would you say? Would your response aim to conclude the conversation?
You might be considering saying something that has some issues, such as:
[Reason]
Here are some suggestions for your reference:
[Suggestion]
If the response is redundant or repetitive, you can end the current dialogue.

**Output a JSON object:**
{
“Response”: “ $<$ your reply as [Speaker] (if any) $>$ ”,
“The conversation ends with [Speaker]’s utterance”: $<$ true/false $>$
}