Memory Sharing for Large Language Model based Agents

Hang Gao
Department of Computer Science
Rutgers University
New Brunswick, NJ, US
[email protected]
&Yongfeng Zhang
Department of Computer Science
Rutgers University
New Brunswick, NJ, US
[email protected]

Abstract

The adaptation of Large Language Model (LLM)-based agents to execute tasks via natural language prompts represents a significant advancement, notably eliminating the need for explicit retraining or fine tuning, but are constrained by the comprehensiveness and diversity of the provided examples, leading to outputs that often diverge significantly from expected results, especially when it comes to the open-ended questions. This paper introduces the Memory Sharing, a framework which integrates the real-time memory filter, storage and retrieval to enhance the In-Context Learning process. This framework allows for the sharing of memories among multiple agents, whereby the interactions and shared memories between different agents effectively enhance the diversity of the memories. The collective self-enhancement through interactive learning among multiple agents facilitates the evolution from individual intelligence to collective intelligence. Besides, the dynamically growing memory pool is utilized not only to improve the quality of responses but also to train and enhance the retriever. We evaluated our framework across three distinct domains involving specialized tasks of agents. The experimental results demonstrate that the MS framework significantly improves the agents’ performance in addressing open-ended questions.

Hang Gao Department of Computer Science Rutgers University New Brunswick, NJ, US [email protected] Yongfeng Zhang Department of Computer Science Rutgers University New Brunswick, NJ, US [email protected]

1 Introduction

The emergence of Large Language Models (LLMs) has brought about significant transformations in machine learning and conversational AI, while the advent of In-Context Learning (ICL) (Brown et al., 2020) signifies a more subtle evolution. ICL facilitates dynamic and intuitive interactions between LLMs and users, enabling LLMs to perform tasks using few-shot examples without necessitating any updates to the model parameters. This greatly expands the capabilities of LLM-based agents.

Initially, ICL was introduced to facilitate LLM-based agents in achieving better performance with few-shot examples (Brown et al., 2020), and subsequently extended across various domains (Ahmed and Devanbu, 2022; Izacard et al., 2023). Following this, the proposal of Chain-of-Thought (CoT) prompting significantly augmented the proficiency of LLM-based agents in executing arithmetic tasks (Wei et al., 2022). Building upon this foundation, innovative methodologies such as PAL (Gao et al., 2023) and the integration of LLMs with symbolic solvers (He-Yueya et al., 2023) have been developed to further enhance agent capabilities in tackling reasoning tasks. Recent works has also developed agent which can continuously acquire diverse skills and make novel discoveries (Wang et al., 2023). While as the areas of questions continue to expand, especially for the open-ended questions, aiming to enable the agent to make more desired answers through ICL, it is particularly important to enrich the diversity of examples, since the agents can learn more information from different angle.

Refer to caption — Figure 1: The Memory Sharing framework. Whenever a new Prompt-Answer(PA) pair is generated, it will be considered to be added to the memory pool and train the retriever.

By combining Retrieval-Augmented Generation (RAG) (Lewis et al., 2020) with ICL, for a particular question, the number of relevant examples available to agents has increased significantly. And subsequently facilitated more effective generation in open-domain queries (Mao et al., 2021). In recent developments, self-learning techniques have been also integrated with the retrieval mechanism within ICL to refine model performance in text generation tasks, through the retrieval of examples with the most analogous patterns (Rubin et al., 2022; Wang et al., 2024). Although for a certain question, the number of relevant examples available to agents has increased significantly through RAG, it also means that this depends heavily on the quality of the external database at the same time. Sometimes, it is also very likely that a suitable external database for some problems cannot be find.

As consequence, given the diversity of examples needed to help agents better answer the open-ended questions in the ICL and diminishing the dependence on external database, there is a critical need to get continuously generated high quality examples. To further minimize the dependence of LLM-based agents on external data and enhance their performance, this paper introduces the Memory Sharing (MS) framework. The MS framework is designed to enable multiple agents to share memories, where the interaction and shared memories among multiple agents enhance memory diversity. The collective self-enhancement achieved through multi-agent interaction represents a progression from individual intelligence to collective intelligence. Additionally, we have developed an interactive learning method that facilitates rapid growth and dynamic updating of memory through multi-agent interactions. Consequently, the diversity and rapid expansion of memory effectively improve the agents’ ability to respond to open-ended questions.

Specifically, within the MS framework, the input and output of an agent in a single interaction are conceptualized as a Prompt-Answer (PA) pair, also considered as a memory, and the shared memory pool is composed of memories from different agents. This framework introduces an innovative real-time memory storage and retrieval mechanism, aimed at enhancing the shared memory pool by receiving PA pairs from different agents. During the storage phase, each PA pair undergoes rigorous evaluation by a dedicated LLM evaluator to determine its suitability for inclusion in the memory pool. The newly added memories then serve as references for the agents’ subsequent performance. The retrieval phase is coordinated by an autonomous learning retriever, calibrated to ensure the inclusion of particularly relevant memories in prompts, thereby enhancing the agents’ understanding of the query’s essence. Figure.1 illustrates the MS framework. Similar to human self-learning mechanisms, incorporating self-generated memories into prompts significantly improves the agents’ comprehension of query meanings. Moreover, continuously adding new memories to the pool not only enriches it but also refines the retriever’s performance in selecting relevant memories. Our empirical evidence demonstrates that this approach greatly assists LLM-based agents in generating outputs that better align with user expectations.

We evaluate the MS framework through three divergent domains where each domain involved the participation of three agents, and our finding suggests that incremental additions to the memory pool have led to enhancements in the precision and relevance of outputs. This research delineates the MS framework’s capacity to mitigate the inherent constraints associated with ICL, thereby underscoring its potential applicability and effectiveness.

Overall, our main contributions can be summarized as follows:

1. Constructing PA pairs from the answers generated by multiple agents and storing them as "memories" in shared memory pool. The diversity of group memories from different agents, along with the real-time dynamic growth of memories, significantly aids in improving the subsequent behavior of the agents. Additionally, the shared memories in the memory pool are used to enhance the performance of the retriever.

2. Addressing the problem of memory scarcity (data scarcity) by proposing an interactive learning method that allows different agents to rapidly grow their memories through interactive prompt and answer, thereby achieving collective enhancement quickly.

3. Conducting extensive experiments on various types of open-ended tasks to verify the effectiveness of the proposed MS framework. The experimental results show that MS not only help the agents get more expected answers, but also continuously input high quality memories into the memory pool and establish a reliable database for agents.

In the following, Section 2 delineates relevant works. An exhaustive elucidation of the MS framework, inclusive of its conceptual underpinnings and operational methodologies, is presented in Section 3. Section 4 provides empirical validation of the framework’s enhanced capability to address open-ended questions. The conclusion, presented in the Section 5, not only summarizes the findings but also explores prospective avenues for future development of the MS framework, which may better help improve the LLM-based agents.

2 Related Work

2.1 The Memory Mechanism in Agents

Equipping agents with memory mechanisms to enhance their abilities has attracted the attention of researchers. Memory can play an important role in helping agents remember conversation information, maintain behavioral consistency, and accumulate experience. Generative agents enhanced with memory features can store vast experience records, facilitating deeper self understanding (Park et al., 2023), while VOYAGER has developed a skill library that evolves by incorporating successful action programs, optimizing task resolution (Wang et al., 2023). In the case of Ghost in the Minecraft, a text-based memory system supports LLMs in maintaining reference plans for efficient plan formulation when similar objectives arise (Zhu et al., 2023). Later, based on the concept of “memorization-retrieval-response”, Memochat was proposed for maintaining consistent long-range open-domain conversation (Lu et al., 2023). With the emergence of MemGPT (Packer et al., 2023), a new memory hierarchy was developed to process long texts and maintain the long-term memory. And the TiM make LLMs to maintain an evolved memory for storing historical thoughts along the conversation stream to a reality (Liu et al., 2023a). Also, through maintain agents’ own reflective text in an episodic memory buffer and implementing the exemplar memory, the Reflexion (Shinn et al., 2023) and SYNAPSE (Zheng et al., 2023) successfully induce better decision-making and generalize successful trajectories to new task respectively. Our MS framework is primarily designed to enable multiple agents to share memories, facilitating collective self-enhancement through inter-agent interactions. In previously mentioned agents equipped with memory mechanisms, the memory mainly ensures conversational consistency and stores past experiences to achieve individual enhancement. However, our MS framework achieves collective enhancement through shared memories, providing a pathway for the evolution from individual intelligence to collective intelligence.

2.2 In-Context Learning

ICL enhances the problem-solving capabilities of LLMs by incorporating few-shot examples into prompts (Brown et al., 2020; Levine et al., 2021; Zhou et al., 2022; Liu et al., 2023b; White et al., 2023). Research has demonstrated that ICL can foster creative learning in LLMs to a certain extent (Swanson et al., 2021). By redesigning inputs, LLMs become more adept at handling logical challenges (Wiegreffe et al., 2022; Wu et al., 2022). Crowdsourced instructions also contribute to improved performance in LLMs (Mishra et al., 2022). Additionally, elucidating the relationship between examples and tasks has been shown to be highly beneficial for LLMs (Lampinen et al., 2022), while the CoT (Wei et al., 2022) and PAL (Gao et al., 2023), enhance LLMs’ performance in complex reasoning tasks by introducing intermediate reasoning steps. However, when dealing with open-ended questions, agents still face two primary challenges: insufficient problem descriptions, which impair the agents’ comprehension, and the lack of external knowledge bases and available reference materials. Our MS framework addresses these challenges by converting high-quality content generated by various LLM-based agents into shared memories, providing agents with useful reference examples, thereby improving their performance in open-ended questions.

2.3 Retrieval Augmented Generation

RAG (Lewis et al., 2020; Ram et al., 2023; Shi et al., 2023) is a method that enhances LLMs’ ability to generate accurate and timely content by integrating retrieval techniques such as BM25(Luo et al., 2023; Liu et al., 2022) or SBERT(Reimers and Gurevych, 2019). Using dense retrievers combined with contrastive learning for feedback can effectively enhance the performance (Rubin et al., 2022). Furthermore, iteratively training the retriever with contrastive learning can further improve its performance (Wang et al., 2024). However, the retrievers in the aforementioned studies are typically trained only once before deployment, making it challenging to adapt to newly generated data. In contrast, the retriever in our MS framework undergoes continuous training. Whenever new memories are added to the memory pool, they are used for further training of the retriever. This continuous updating and evolution process ensures that the quality of the retrieved memories gradually improves over time. And most importantly, with the help of MS framework, the dependence on external databases has been greatly reduced.

3 The Memory Sharing framework

In this section, we provide an in-depth description of our innovative Memory Sharing (MS) framework. MS is a framework designed to enhance the performance of multiple LLM-based agents through shared memories, while preserving the original creativity and versatility of the agents. Fig.2 show how an agent work within the MS framework. Different agents interact through a Prompt-Answer format, where the results of these interactions are evaluated. High-quality interactions are transformed into memories and stored in a shared memory pool accessible by all agents. Additionally, new memories are used to train and improve the memory retriever. During subsequent interactions, these stored memories are retrieved to enhance the performance of agents. The underlying idea is intuitive: it represents a crucial step from individual intelligence towards a more powerful collective intelligence. Similar to a group of individuals engaging in prompt and answer interactions, the valuable content is recorded and shared, serving as a reference for future questions. Consequently, after several rounds of interactions, the quality of answers to related questions typically improves. This collective memory-sharing mechanism, enriched by the diversity of memories contributed by different agents, provides more effective assistance in addressing open-ended questions. The main principles and technologies of the MS framework are detailed in three sections below.

3.1 Memory Generation

A memory is essentially a Prompt-Answer (PA) pair. In some special cases, it is permissible for a PA pair to lack a prompt, typically applicable in initial scenarios. These PA pairs are stored in natural language, which serves as the shared memories. On one hand, these shared memories can be used to improve the response quality of different agents; on the other hand, they are generated by various agents and stored in the shared memory pool. The dynamic expansion of the shared memory pool ensures a continuous influx of new memories, thereby enriching the knowledge base of all agents. In addressing open-ended questions, these shared memories provide agents with a broader perspective and deeper understanding, which is crucial for generating high quality answers. Fig.3 shows how the Agent-Puzzle use the memory to make a prompt and get the answer through one-shot learning.

After each interaction, the PA pair is scored. If the score of PA pair exceeds a preset threshold, the answer and its corresponding prompt are packaged as a useful memory and stored in the memory. When scoring the answers, we established different scoring criteria for various domains and topics, delegating the scoring tasks to the LLM itself. To facilitate the LLM’s understanding of these criteria, they were autonomously generated by the LLM, based on the reasonable assumption that the LLM can better grasp criteria it designed. Therefore, these scoring criteria are established prior to the deployment of the framework to ensure consistency in the LLM’s scoring process, thereby guaranteeing fair evaluation of memories from different agents. Before implementation, these criteria undergo a manual review phase, assessing the relevance of potential memories to the current focal task and their relevance within the domain to ensure their utility. Manual review, rather than delegation to the LLM, is employed to provide additional precision and special consideration to align with the agents’ specific needs, particularly in potential or special application scenarios. The manual review of these scoring criteria requires minimal workload.

3.2 Memory Retrieval and Training

Prior to the operational deployment of the MS, a small subset of instances was manually archived within the memory pool. These instances fulfill a dual purpose: firstly, they provide a diversified array of memories upon which each agent may experiment with novel prompts in the face of new queries; secondly, they constitute the preliminary training corpus for our retriever. This foundational training regimen mirrors the methodology by which subsequently archived memories will be assimilated into our model in real time, thereby facilitating the model’s ongoing adaptive learning and optimization. During the answering phase, an agent retrieves memories from the shared memory pool based on the question with the help of a dense retriever, which are more similar to the target question in terms of cosine similarity. These retrieved memories, combined with question, form a prompt that is submitted to the agent, which then generates an answer. The memories extracted from the shared memory are used as context to enhance the quality of the agent’s response, a typical ICL method that usually improves the answer quality.

Memory Train. Whenever a new memory, denoted as $(X,Y)$ , is added into the memory pool, it will also be used to train our retriever, which help the retriever to continuously update itself and continuously adapt to new memory. Based on the new generated memory $(X,Y)$ , the classical method BM25 ascertain the most pertinent top- $n$ candidate pairs $\{(x_{i},y_{i})\}_{i=1}^{n}$ , sourced from the diverse and extensive memory pool, labeled as $C$ . Each candidate within $C$ will undergo a evaluation process utilizing the comprehensive scoring capabilities of LLMs.The scoring mechanism employed is defined by the following equation:

p(x_{i},y_{i})=\text{P}(\neg Y\mid(x_{i},y_{i}),\,X),i\in\{1,...,n\}

(1)

This equation seeks to determine, given a input-output pair $(x_{i},y_{i})$ in $C$ as a condition, the probability that the response generated for the input in the new memory contradicts the output in the new memory. This grading part serves as a preparatory step for the subsequent labeling of each candidate example. It is noteworthy that making $\neg Y$ as the result part is trying to make sure that the memory that the retriever gets from other agents is of reference value, but it does not have to be the most relevant to the current question, which means that it can help the current agent to learn from new examples. This approach diverges from a simplistic reliance on $Y$ as the outcome, which tends to restrict the retrieval process to memory previously stored by the current agent.

Within the defined set $C=\{(x_{i},y_{i})\}_{i=1}^{n}$ , each candidate now is ascribed a score. We sort them from the lowest to the highest score and we select $v$ memory in total to label. The top $\frac{v}{2}$ candidates (lowest score) in $C$ are identified as being the pair with the reference value to $(X,Y)$ and accordingly, their labels are set to positive. Conversely, the bottom $\frac{v}{2}$ candidates are deemed as the least reference value to $(X,Y)$ , and their labels are thus designated as negative. Those labeled data will be used to minimize the following function:

	$\displaystyle\text{loss}(x,y)=-\frac{1}{v}\sum_{i=1}^{v}[y_{i}\cdot\log(\frac{% 1}{1+e^{-x_{i}}})+$		(2)
	$\displaystyle(1-y_{i})\cdot\log(1-\frac{1}{1+e^{-x_{i}}})]$		(2)

It enhances the predictive accuracy, which is especially critical in handling imbalanced memory pool. This strategic choice underscores our model’s preparation to extract meaningful insights from various memory, advancing our overarching goal of developing a robust and adaptable MS framework.

3.3 Interactive Learning

As described in sections 3.1 and 3.2, the memories stored within the memory pool effectively help agents improve their response quality. Moreover, the number of memories in memory pool is dynamically expanding. Over time, the increasing number of memories in memory pool enhances the assistance provided to the agents. However, the initial state of a newly created multi-agent system and its corresponding memory pool lacks any memories. How can it assist the agents from the outset? And how can we validate the effectiveness of this mechanism? To address this, we propose a rapid interactive learning method that enables multiple agents to engage in interactive prompt and answer. High-quality PA pair is stored as memory in the memory pool, facilitating self-learning and self-enhancement within the multi-agent system. Initially, a small set of answers (e.g., 100 records, theoretically even one record can initiate the process) is placed in the memory as the initial set. Agents then engage in prompt and answer based on this initial memory set, rapidly expanding the memory pool. Specially, give a standard answer, we ask the agent to give a corresponding question based on the standard answer. Then, we give this question back to agents again and ask them to answer this question. This is also the way how we construct our dataset and our initial memory pool. We measure the effectiveness of the MS framework by calculating the average quality of answers generated by the agents at different stages.

4 Experiments

4.1 Experiment Details

	Zero			One			Two			Three
Agent	gpt-3.5-turbo	gpt-4o	open-mistral-7b	gpt-3.5-turbo	gpt-4o	open-mistral-7b	gpt-3.5-turbo	gpt-4o	open-mistral-7b	gpt-3.5-turbo	gpt-4o	open-mistral-7b
Limerick	0.50	0.50	0.49	0.69	0.56	0.54	0.76	0.56	0.88	0.87	0.59	0.93
Wuyanlvshi	0.66	0.73	0.56	0.72	0.75	0.59	0.71	0.75	0.61	0.72	0.76	0.66
Sonnet	0.48	0.55	0.50	0.53	0.55	0.52	0.53	0.54	0.53	0.53	0.54	0.53
Puzzle	0.53	0.51	0.49	0.51	0.53	0.48	0.56	0.52	0.48	0.60	0.52	0.50
Pun	0.61	0.47	0.37	0.64	0.57	0.35	0.67	0.64	0.36	0.70	0.67	0.39
Riddle	0.86	0.40	0.36	0.64	0.42	0.36	0.70	0.48	0.35	0.88	0.52	0.37
Fitness	0.46	0.42	0.47	0.61	0.57	0.48	0.64	0.52	0.50	0.65	0.52	0.54
Study	0.44	0.41	0.45	0.65	0.56	0.46	0.60	0.53	0.44	0.63	0.51	0.46
Travel	0.45	0.41	0.44	0.55	0.54	0.48	0.71	0.53	0.50	0.71	0.53	0.53

Table 1: Performance across agents utilizing different amounts memory for open-ended questions execution. Each domain has its own Domain-pool shared within its three agents.

	Model	Limerick	Wuyanlvshi	Sonnet	Puzzle	Pun	Riddle	Fitness	Study	Travel
	gpt-3.5-turbo	0.87	0.72	0.53	0.60	0.70	0.88	0.65	0.63	0.71
Domain-pool	gpt-4o	0.59	0.76	0.54	0.52	0.67	0.52	0.52	0.51	0.53
	open-mistral-7b	0.93	0.66	0.53	0.50	0.39	0.37	0.54	0.46	0.53
	gpt-3.5-turbo	0.60	0.68	0.49	0.54	0.70	0.80	0.62	0.63	0.58
Single-pool	gpt-4o	0.54	0.76	0.54	0.50	0.61	0.51	0.52	0.56	0.52
	open-mistral-7b	0.64	0.63	0.52	0.48	0.38	0.35	0.54	0.49	0.50

Table 2: Agent performance with Domain-pool vs. Single-pool by utilizing three suitable memories for open-ended questions.

We aim to assess the efficacy of the MS framework in processing open-ended questions across three domains: Literary Creation, Unconventional Logic Problem-solving, and Plan Generation. Within the Literary Creation domain, we have appointed three specialized agents responsible for generating Wuyanlvshi (a form of classical Chinese poetry, Chinese form), Limericks, and Sonnets, respectively. In the Logic Problem-solving domain, dedicated agents are tasked with addressing Puzzles, Riddles, and Puns. Meanwhile, for Plan Generation, we have developed agents to create Study Plans, Travel Plans, and Fitness Plans. We use a total of nines datasets representing nine agents to evaluate. There are 1000 question-answer pairs in total, details in Appendix A.3. For each agent, a consistent, small subset of pre-provided, complete PA pairs were selected and incorporated into the memory pool for the initial phase of retriever training and prompt refinement. Subsequently, for each agent, an identical number of queries will be introduced to increment the volume of real-time memory within the memory pool. For our scoring LLM, we use gpt-3.5-turbo. As the backbones of our agents, we consider three LLMs: two close-source LLMs (gpt-3.5-turbo and gpt-4o) and one open-source LLM (open-mistral-7b). We use the BERTScore (Zhang et al., 2019) as our metric to measure the performance of each agent.

The evaluation of memory impact begins with the implementation of different retrieval strategies, including zero-shot, one-shot, two-shot, and three-shot learning. Subsequently, quantitative and qualitative analysis are conducted. In qualitative analysis, one question is that the memories of agents in the same domain can make up for the lack of diversity, so if the memories of agents in all domains are placed in the same memory pool, from the perspective of diversity, it will definitely increase further, but will the memory pool still provide positive help to the agents’ answers? Aiming to solve this, our study uses two distinct types of memory pools, Domain-pool and Single-pool. The Domain-pool means a dedicated memory pool is allocated for each domain and is shared for all agents with this domain, aiming at enhancing the integration of domain-specific memories. Conversely, the second pool integrates agents from all domains into a unified memory pool, facilitating the analysis of cross-domain memory utilization. Although not all the memories are very related to one single query, the memory comes from other unrelated agents in other domains may help the agents better understand the query from different angle.

In terms of the quantitative analysis, we measure the performance of each agent whenever the same proportion of new memory is added to the memory pool. There are five phase in total-20%, 40%, 60%, 80%, 100%. At each phase, an evaluation of agent performance was conducted to ascertain improvements or regressions. This dualfaceted approach enabled a thorough exploration of the naunced impacts and applicability of authentic memories across diverse domains.

4.2 Experiment Analysis

Before the experiment, none of the agents have a suitable database for reference. While after the interactive learning stage, a continuously expanding memory pool with high quality memories is successfully be a database for agent to refer. The MS framework help the agents get rid of the dependence on external databases, and multiple agents can interactively expand the memory pool. The principal outcomes of our experiments are presented in Table.1, which shows the performance of each agent under various learning strategies within the MS framework. Compared to Zero-shot learning, other three learning strategies all help agents achieve better performance, which means that the memory from other agents can help current agent get desired answers, rather than interfering with the agent’s learning ability. Also, given the performance of all agents get improved after utilizing the shared memories, our previous hypothesis that the MS framework could enhance collective intelligence through multi-agent interactions, thereby advancing from individual to collective intelligence, has been confirmed. And for most agents, they achieve the best performance when they retrieve and utilize three memories from the memory pool.

It worth noting that, for all the agents in the domains-Literary Creation and Plan Generation, the performance of them under the three-shot learning with a open-source LLM successfully surpass themselves with close-source LLM under the zero-shot learning, which proves the potential power of the shared memories. Besides, when under the same learning strategy, the superior performance of the closed-source LLM over the open-source LLM can be attributed to the former’s enhanced understanding and reasoning capabilities. Specifically, in the Literary Creation domain, the improvement of those three agents’ performance are not as significant as other agents, which may resulted in the language used is different when storing memory. This deserves further study in future work.

Table.2 compares the scenario where all agents use the Domain-pool or Single-pool under the three-shot learning strategy, since the prior experiments showed us that most agents achieve the best performance under the three-shot learning. Excluding Agent-Study, all other agents exhibited diminished performance with the Single-pool. Although the Single-pool can further enrich the diversity of shared memories, for the agents, homologous memories can better help them get more reliable answers, no matter what the agent’s backbone is.

One follow up question is, even memory types are homogeneous, whether an excessive accumulation of memories will impede the agent’s output quality. Figure.4 shows the variations in performance across individual agents consequent to the integration of different ratios of newly generated memories into the pool. It is clear that as more and more high-quality memories are added to the memory pool, the performance of most agents is getting better and better, especially for the Agent-Limerick. For several agents, there is no further change of their performance as the shared memories increases further in the later stages. We assume this is since those newly added shared memories are not more suitable than the previous ones. If the memory pool can be further expanded, this stagnation may be broken.

5 Conclusions

We introduce a novel MS framework that enables real-time memory sharing among multiple agents through memory storage and retrieval. The findings indicate that the continuously growing shared memory enhances the ability of LLM-based agents to understand the nuances of problems, leading to higher-quality responses to open-ended queries. Furthermore, these shared memories are utilized for iterative training and improving the retriever, ensuring that the retriever consistently identifies and selects the most relevant memories as the memory pool dynamically expands. Regarding future research directions, there is substantial work to be done in the areas of memory sharing and interactive learning among multiple agents. Introducing more agents based on different LLMs (e.g., GPT-4, LLaMA-3, Claude-2) to comprehensively explore and evaluate the benefits of memory sharing and interactive learning for agents is a topic worthy of long-term investigation, as it will guide us from individual intelligence to collective intelligence. Additionally, in some open-ended problem domains, exploring the use of self-generated memories by these agents for the retraining and fine-tuning of LLMs is also an interesting area to study.

Limitations

In our work, the shared memory collected and used by agents is generated through one interaction, which means that the question part is the query and the answer part is the output. However, sometimes users will ask seemingly unrelated questions first, which may not be very helpful, but sometimes they are some preliminary preparations for answering the following questions. How to integrate those "unrelated" questions and answers with the final query to form a memory with more information could be a new direction of developing the MS.

References

Ahmed and Devanbu (2022) Toufique Ahmed and Premkumar Devanbu. 2022. Few-shot training llms for project-specific code-summarization. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, pages 1–5.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
Gao et al. (2023) Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR.
He-Yueya et al. (2023) Joy He-Yueya, Gabriel Poesia, Rose E Wang, and Noah D Goodman. 2023. Solving math word problems by combining language models with symbolic solvers. arXiv preprint arXiv:2304.09102.
Izacard et al. (2023) Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2023. Atlas: Few-shot learning with retrieval augmented language models. Journal of Machine Learning Research, 24(251):1–43.
Lampinen et al. (2022) Andrew Lampinen, Ishita Dasgupta, Stephanie Chan, Kory Mathewson, Mh Tessler, Antonia Creswell, James McClelland, Jane Wang, and Felix Hill. 2022. Can language models learn from explanations in context? In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 537–563, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Levine et al. (2021) Yoav Levine, Noam Wies, Daniel Jannai, Dan Navon, Yedid Hoshen, and Amnon Shashua. 2021. The inductive bias of in-context learning: Rethinking pretraining example design. arXiv preprint arXiv:2110.04541.
Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
Liu et al. (2022) Jiachang Liu, Dinghan Shen, Yizhe Zhang, William B Dolan, Lawrence Carin, and Weizhu Chen. 2022. What makes good in-context examples for gpt-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114.
Liu et al. (2023a) Lei Liu, Xiaoyan Yang, Yue Shen, Binbin Hu, Zhiqiang Zhang, Jinjie Gu, and Guannan Zhang. 2023a. Think-in-memory: Recalling and post-thinking enable llms with long-term memory. arXiv preprint arXiv:2311.08719.
Liu et al. (2023b) Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023b. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35.
Liu et al. (2023c) Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. 2023c. Agentbench: Evaluating llms as agents. In The Twelfth International Conference on Learning Representations.
Lu et al. (2023) Junru Lu, Siyu An, Mingbao Lin, Gabriele Pergola, Yulan He, Di Yin, Xing Sun, and Yunsheng Wu. 2023. Memochat: Tuning llms to use memos for consistent long-range open-domain conversation. arXiv preprint arXiv:2308.08239.
Luo et al. (2023) Man Luo, Xin Xu, Zhuyun Dai, Panupong Pasupat, Mehran Kazemi, Chitta Baral, Vaiva Imbrasaite, and Vincent Y Zhao. 2023. Dr. icl: Demonstration-retrieved in-context learning. arXiv preprint arXiv:2305.14128.
Mao et al. (2021) Yuning Mao, Pengcheng He, Xiaodong Liu, Yelong Shen, Jianfeng Gao, Jiawei Han, and Weizhu Chen. 2021. Generation-augmented retrieval for open-domain question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4089–4100, Online. Association for Computational Linguistics.
Mishra et al. (2022) Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. 2022. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, Dublin, Ireland. Association for Computational Linguistics.
Packer et al. (2023) Charles Packer, Vivian Fang, Shishir G Patil, Kevin Lin, Sarah Wooders, and Joseph E Gonzalez. 2023. Memgpt: Towards llms as operating systems. arXiv preprint arXiv:2310.08560.
Park et al. (2023) Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22.
Ram et al. (2023) Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. 2023. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 11:1316–1331.
Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992.
Rubin et al. (2022) Ohad Rubin, Jonathan Herzig, and Jonathan Berant. 2022. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671, Seattle, United States. Association for Computational Linguistics.
Shi et al. (2023) Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2023. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652.
Shinn et al. (2023) Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R Narasimhan, and Shunyu Yao. 2023. Reflexion: language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems.
Swanson et al. (2021) Ben Swanson, Kory Mathewson, Ben Pietrzak, Sherol Chen, and Monica Dinalescu. 2021. Story centaur: Large language model few shot learning as a creative writing tool. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 244–256, Online. Association for Computational Linguistics.
Wang et al. (2023) Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. Voyager: An open-ended embodied agent with large language models. In NeurIPS 2023 Foundation Models for Decision Making Workshop.
Wang et al. (2024) Liang Wang, Nan Yang, and Furu Wei. 2024. Learning to retrieve in-context examples for large language models. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1752–1767, St. Julian’s, Malta. Association for Computational Linguistics.
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
White et al. (2023) Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C Schmidt. 2023. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382.
Wiegreffe et al. (2022) Sarah Wiegreffe, Jack Hessel, Swabha Swayamdipta, Mark Riedl, and Yejin Choi. 2022. Reframing human-ai collaboration for generating free-text explanations. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 632–658.
Wu et al. (2022) Tongshuang Wu, Michael Terry, and Carrie Jun Cai. 2022. Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts. In Proceedings of the 2022 CHI conference on human factors in computing systems, pages 1–22.
Zhang et al. (2019) Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
Zheng et al. (2023) Longtao Zheng, Rundong Wang, Xinrun Wang, and Bo An. 2023. Synapse: Trajectory-as-exemplar prompting with memory for computer control. In The Twelfth International Conference on Learning Representations.
Zhou et al. (2022) Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2022. Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations.
Zhu et al. (2023) Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, et al. 2023. Ghost in the minecraft: Generally capable agents for open-world enviroments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144.

Appendix A Appendix

A.1 Rubrics and Prompt for scoring Memory

In order to judge whether a memory can be added into the memory pool, we set three scoring rubrics for three domains respectively. For Single Pool, we set up a set of rubrics from a global perspective.

A.1.1 Rubrics for domain - Literary Creation

General Evaluation Criteria (Total: 100)
Criteria: Literary Quality
Score Range: 0-5
Description: Assesses creativity, use of language, and emotional impact. High-quality examples should demonstrate mastery of language and evoke a strong reader response.
Criteria: Authenticity
Score Range: 0-10
Description: Evaluates adherence to the form’s traditional standards, including structure, rhythm, and themes. High scores indicate that the poem respects genre conventions creatively.
Criteria: Clarity and Cohesion
Score Range: 0-10
Description: Considers the poem’s clarity of expression and the cohesion of its parts. A high score indicates that the poem communicates effectively and its elements are well integrated.
Criteria: Innovativeness
Score Range: 0-5
Description: Rewards originality in theme, structure, or language use. High scores reflect a notable degree of creativity and the introduction of novel ideas or techniques.
Criteria: Educational Value
Score Range: 0-10
Description: Assesses the example’s potential to teach about poetic forms, literary devices, and thematic exploration. High-scoring examples are rich in analyzable and teachable elements.
Criteria: Metric Precision
Score Range: 0-10
Description: Evaluates the adherence to the five-syllable structure per line, including rhythm and flow, emphasizing the importance of metric accuracy.
Criteria: Imagery and Symbolism
Score Range: 0-10
Description: Assesses the effectiveness of imagery and symbolism in conveying the poem’s themes, highlighting the depth and sophistication of language use.
Criteria: Humor and Wit
Score Range: 0-10
Description: Rates the poem’s humor, wit, and wordplay. High scores reflect effective use of language to entertain and amuse.
Criteria: Rhyme Scheme Adherence
Score Range: 0-10
Description: Assesses the AABBA rhyme scheme’s quality and creativity, including how well the rhymes enhance the humor and effectiveness of the poem.
Criteria: Structural Integrity
Score Range: 0-10
Description: Evaluates adherence to sonnet structure, including rhyme scheme and division into octaves/sestets or quatrains/couplet, stressing formal precision.
Criteria: Thematic Development
Score Range: 0-10
Description: Looks at theme or argument development, especially through the volta, reflecting the poem’s ability to engage with complex ideas persuasively.

A.1.2 Rubrics for domain - Unconventional Logic Problem-solving

Clarity and Understandability (20 points)
Question Clarity (10 points): The question should be clearly stated, without ambiguity, and understandable without requiring additional context.
Answer Clarity (10 points): The answer should be directly related to the question, clear, and easily understandable.
Creativity and Originality (30 points)
Question Creativity (15 points): The question should demonstrate creativity, originality, and should not be a common or easily found problem.
Answer Creativity (15 points): The answer should be innovative and not just a straightforward or commonly known response. It should also add a layer of depth or a surprising twist to the question.
Logical Consistency and Correctness (20 points)
Logical Consistency (10 points): The question and answer together should form a logically consistent pair where the answer correctly follows from the question.
Correctness (10 points): The answer must be factually correct and provide a true solution or conclusion to the puzzle, riddle, or pun presented in the question.
Relevance and Engagement (20 points)
Relevance (10 points): The question and answer should be relevant to the domain of Logic Problems, demonstrating an understanding of puzzles, riddles, or puns.
Engagement (10 points): The pair should be engaging and interesting, capable of capturing attention and sparking curiosity or amusement.
Difficulty Level (10 points)
The difficulty level of the question should be appropriate for the intended audience. It should neither be too easy to solve without any thought nor too difficult to be practically unsolvable. This criterion requires a balanced approach to ensure the content is intellectually stimulating but accessible.

A.1.3 Rubrics for domain - Plan Generation

Specificity and Detail (20 points)
Question Specificity (10 points): The question should be specific, providing enough detail to guide the generation of a relevant and tailored plan.
Plan Detail (10 points): The plan should include specific activities, steps, or recommendations that are clearly defined and actionable.
Feasibility and Practicality (20 points)
Plan Feasibility (20 points): The plan should be realistic and practical, considering available resources (time, money, equipment) and constraints. It should propose actions that can be realistically implemented by the user.
Comprehensiveness and Scope (20 points)
Coverage of Key Components (20 points): The plan should comprehensively address all relevant aspects of the goal. For a study plan, this might include study sessions, breaks, and topics covered; for a fitness plan, workouts, rest days, and nutrition; and for a travel plan, transportation, accommodations, and activities.
Personalization and Relevance (20 points)
Alignment with User Needs and Preferences (20 points): The plan should reflect an understanding of the user’s specific needs, preferences, goals, and limitations. It should feel customized and directly applicable to the user, rather than being a generic template.
Plan Clarity (10 points): The plan should be articulated in a clear, organized, and easy-to-follow manner. It should avoid jargon or overly complex language, making it accessible to the user.
Rationale Clarity (10 points): The plan should include clear reasoning or justification for the recommendations made, helping the user understand why specific actions or steps are suggested.

A.1.4 Rubrics for Single Pool

Accuracy (25 Points)
25 points: The output is entirely accurate, with no factual errors or inaccuracies.
15-24 points: The output is mostly accurate, with minor errors that do not significantly impact the overall understanding.
5-14 points: The output contains several inaccuracies that could lead to misunderstandings.
0-4 points: The output is largely inaccurate, misleading, or irrelevant.
Relevance (20 Points)
20 points: The output is highly relevant to the input question, directly addressing the query without diverging from the topic.
10-19 points: The output is relevant but includes some unnecessary or slightly off-topic information.
1-9 points: The output partially addresses the question but is significantly off-topic or tangential.
0 points: The output is completely irrelevant to the input question.
Completeness (20 Points)
20 points: The output provides a complete answer to the question, covering all essential aspects implied or directly asked.
10-19 points: The output covers most of the necessary information but lacks one or two minor details or aspects.
1-9 points: The output provides a partial answer, missing significant portions of the information needed to fully answer the question.
0 points: The output fails to provide any meaningful answer to the question.
Clarity and Coherence (20 Points)
20 points: The output is exceptionally clear and well-structured, making it easy to follow and understand.
10-19 points: The output is clear but may have minor issues with structure or coherence that slightly hinder understanding.
1-9 points: The output has significant clarity or coherence issues, making it difficult to understand without effort.
0 points: The output is incoherent or so poorly structured that it is unintelligible.
Creativity and Insight (15 Points)
15 points: The output demonstrates high levels of creativity or provides insights that add substantial value beyond the explicit question.
8-14 points: The output shows some creativity or insights but to a lesser extent, offering added value to the answer.
1-7 points: The output is standard, with minimal to no creativity or insightful additions.
0 points: The output is entirely generic, with no attempt at creativity or providing additional insights.

A.1.5 Prompt for scoring Memory

For scoring a memory, the following instructions were provided to LLM: {mdframed} Here is the rubrics for grading an example of In-context learning-"grading_rubric". According to the rubric, for the Question-"question" and Answer-"answer", give me a score of pair (question, answer) if I want to use it as a prompt in In-context learning later. Your answer should be a number between 0-100.

A.2 Memories

This section give an detailed example about how agent answer a question with the help of shared memories and how the newly generated prompt and memories looks like. Here is the shared memory from the memory pool of the domain Literary Creation: {mdframed} Elucidate on a mirror that reflects the soul that revealing truths, making one whole?->There once was a mirror that reflects the soul; With a tale both old and new; revealing truths, making one whole; A legend through and through; In a world where magic grew. | | Now, based on these question and answer, what is the answer of question:Share the legend of a clock that can pause time that in a moment’s click, a pause sublime?->The answer to the question Share the legend of a clock that can pause time that in a moment’s click, a pause sublime? would be the tale of a mystical clock rumored to have the power to pause time with just a simple click. In this legend, the clock is said to possess the ability to grant individuals a moment of pause in their lives, allowing them to reflect, reset, and find serenity in the chaos of the world. Just like the mirror that reflects the soul, this clock is believed to reveal truths and make one feel whole again, offering a moment of clarity and peace in a busy and fast-paced world. Figure.5 show an real example about how the Agent-Sonnet make use of this shared memory to construct a new prompt with the original query to make a sonnet.

A.3 Datasets

For all those nine agents, they have their own dataset to evaluate. There are 1000 items in total for all datasets. In each datasets, the item contains a question and a answer. For each dataset, 20% will be taken randomly as a small subset of instances which was manually archived within the memory pool before the experiments. And 40% of the dataset, we will only capture the problem part and use it to generate real-time memory in agents. As the rest 40%, they will be used as the test set.

For the agents in the domain Literary Creation, in their datasets, the question part is a description, and the answer part is like a poetry which fullfills the requirements in the description, the question and answer in the dataset are like: {mdframed} Tell me about a star that twinkles with a secret that shining brightly, mysteries to decrypt? - There once was a star that twinkles with a secret; With a tale both old and new; shining brightly, mysteries to decrypt; A legend through and through; In a world where magic grew.

For the agents in the domain Unconventional Logic Problem-solving in their datasets, the question part is a logic problem, and the answer part is a reasonable solution of that problem, the question and answer in the dataset are like: {mdframed} Why is it better to have round manhole covers than square ones? - A square manhole cover can be turned and dropped down the diagonal of the manhole. A round manhole cover cannot be dropped down the manhole. So for safety reasons, all manhole covers should be round

For the agents in the domain Plan Generation, in their datasets, the question part is a problem about how to achieve something, and the answer part is a executable plan, the question and answer in the dataset are like: {mdframed} Start learning Python for data analysis? - Learn basics, then focus on data libraries like pandas, and practice on projects.

Specifically, the dataset of the Puzzle comes from Liu et al., 2023c. For other agents’ datasets, the standard "answer" was systematically harvested from the internet and subsequently input to the LLM (gpt-3.5-turbo) to get the corresponding "question" part. In particular, the selection of Wuyanlvshi—renowned and historically significant poems within Chinese literature—was meticulously chosen for their fame and widespread recognition. Furthermore, the sonnets incorporated into our study originate from the ”quarto” collection authored by Shakespeare in 1609. We use the following similar format of prompt to generate "question" part: {mdframed} Here is a sonnet of shakespeare -> "From fairest creatures we desire increase, That thereby beauty’s rose might never die, But as the riper should by time decrease, His tender heir mught bear his memeory: But thou, contracted to thine own bright eyes, Feed’st thy light’st flame with self-substantial fuel, Making a famine where abundance lies, Thyself thy foe, to thy sweet self too cruel. Thou that art now the world’s fresh ornament, And only herald to the gaudy spring, Within thine own bud buriest thy content And, tender churl, makest waste in niggarding. Pity the world, or else this glutton be, To eat the world’s due, by the grave and thee." According to this, can you give me a problem of it(like make a sonnot followed with some description), so the possible answer can be this sonnet . The output as a "question" from the LLM is like: {mdframed} Craft a sonnet that explores the tension between personal vanity and the broader imperative to contribute to the continuation of beauty and life. Your poem should address how an individual’s focus on their own allure can lead to a metaphorical barrenness, despite the inherent capacity for generational legacy. Conclude with a reflection on the choice between selfishness and altruism, using nature as a metaphor for this existential dilemma. .