Towards Collaborative Intelligence: Propagating Intentions and Reasoning for Multi-Agent Coordination with Large Language Models

Xihe Qiu Haoyu Wang Xiaoyu Tan Chao Qu Yujie Xiong Yuan Cheng Yinghui Xu Wei Chu Yuan Qi
Abstract

Effective collaboration in multi-agent systems requires communicating goals and intentions between agents. Current agent frameworks often suffer from dependencies on single-agent execution and lack robust inter-module communication, frequently leading to suboptimal multi-agent reinforcement learning (MARL) policies and inadequate task coordination. To address these challenges, we present a framework for training large language models (LLMs) as collaborative agents to enable coordinated behaviors in cooperative MARL. Each agent maintains a private intention consisting of its current goal and associated sub-tasks. Agents broadcast their intentions periodically, allowing other agents to infer coordination tasks. A propagation network transforms broadcast intentions into teammate-specific communication messages, sharing relevant goals with designated teammates. The architecture of our framework is structured into planning, grounding, and execution modules. During execution, multiple agents interact in a downstream environment and communicate intentions to enable coordinated behaviors. The grounding module dynamically adapts comprehension strategies based on emerging coordination patterns, while feedback from execution agents influnces the planning module, enabling the dynamic re-planning of sub-tasks. Results in collaborative environment simulation demonstrate intention propagation reduces miscoordination errors by aligning sub-task dependencies between agents. Agents learn when to communicate intentions and which teammates require task details, resulting in emergent coordinated behaviors. This demonstrates the efficacy of intention sharing for cooperative multi-agent RL based on LLMs.

Towards Collaborative Intelligence: Propagating Intentions and Reasoning for Multi-Agent Coordination with Large Language Models



1 Introduction

With the recent advancements of large language models (LLMs), developing intelligent agents that can perform complex reasoning and long-horizon planning has attracted increasing research attention Sharan et al. (2023); Huang et al. (2022). A variety of agent frameworks have been proposed, such as ReAct Yao et al. (2022), LUMOS Yin et al. (2023), Chameleon Lu et al. (2023) and BOLT Chiu et al. (2024). These frameworks typically consist of modules for high-level planning, grounding plans into executable actions, and interacting with environments or tools to execute actions Rana et al. (2023).

Despite their initial success, existing agent frameworks may experience some limitations. Firstly, most of them rely on a single agent for execution Song et al. (2023); Hartmann et al. (2022). However, as tasks become more complex, the action dimension can be increased exponentially, and it poses significant challenges for a single agent to handle all execution functionalities Chebotar et al. (2023); Wen et al. (2023). Secondly, existing frameworks lack inter-module communication mechanisms. Typically, the execution results are directly used as input in the planning module without further analysis or coordination Zeng et al. (2023); Wang et al. (2024b). When execution failures occur, the agent may fail to adjust its strategies accordingly Chaka (2023). Thirdly, the grounding module in existing frameworks operates statically, without interactions with downstream modules. It grounds plans independently without considering feedback or states of the execution module Xi et al. (2023). LLMs struggle to handle emergent coordination behaviors and lack common grounding on shared tasks. Moreover, existing multi-agent reinforcement learning (MARL) methods often converge on suboptimal policies that fail to exhibit a certain level of cooperation Gao et al. (2023); Yu et al. (2023).

How can the agents with LLMs effectively communicate and collaborate with each other? we propose a novel approach, Recursive Multi-Agent Learning with Intention Sharing (ReMALIS 111The code can be accessed at the following URL:https://github.com/AnonymousBoy123/ReMALIS.) to address the limitations of existing cooperative artificial intelligence (AI) multi-agent frameworks with LLMs. ReMALIS employs intention propagation between LLM agents to enable a shared understanding of goals and tasks. This common grounding allows agents to align intentions and reduce miscoordination. Additionally, we introduce bidirectional feedback loops between downstream execution agents and upstream planning and grounding modules. This enables execution coordination patterns to guide adjustments in grounding strategies and planning policies, resulting in more flexible emergent behaviors Topsakal and Akinci (2023). By integrating these mechanisms, ReMALIS significantly improves the contextual reasoning and adaptive learning capabilities of LLM agents during complex collaborative tasks. The execution module utilizes specialized agents that collaboratively execute actions, exchange information, and propagate intentions via intention networks. These propagated intentions reduce miscoordination errors and guide grounding module adjustments to enhance LLM comprehension based on coordination patterns Dong et al. (2023). Furthermore, execution agents can provide feedback to prompt collaborative re-planning in the planning module when necessary.

Compared to single-agent frameworks, the synergistic work of multiple specialized agents enhances ReMALIS’s collective intelligence and leads to emerging team-level behaviors Wang et al. (2023). The collaborative design allows for dealing with more complex tasks that require distributed knowledge and skills. We demonstrate that:

  • Intention propagation between execution agents enables emergent coordination behaviors and reduces misaligned sub-tasks.

  • Grounding module strategies adjusted by intention sharing improve LLM scene comprehension.

  • Planning module re-planning guided by execution feedback increases goal-oriented coordination.

Compared to various single-agent baselines and existing state-of-the-art MARL Hu and Sadigh (2023); Zou et al. (2023) methods using LLMs, our ReMALIS framework demonstrates improved performance on complex collaborative tasks, utilizing the publicly available large-scale traffic flow prediction (TFP) dataset and web-based activities dataset. This demonstrates its effectiveness in deploying LLMs as collaborative agents capable of intention communication, strategic adjustments, and collaborative re-planning Du et al. (2023).

2 Preliminary

In this section, we introduce the methods of the proposed ReMALIS framework in detail. As illustrated in Figure 1, ReMALIS consists of four key components:

Refer to caption
Figure 1: This framework introduces a multi-agent learning strategy designed to enhance the capabilities of LLMs through cooperative coordination. It enables agents to collaborate and share intentions for effective coordination, and utilizes recursive reasoning to model and adapt to each other’s strategies.

Planning Module pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT predicts the next pending sub-goal st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, given the current sub-goal stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and other inputs st+1=pθ(st,It,et,ft),subscript𝑠𝑡1subscript𝑝𝜃subscript𝑠𝑡subscript𝐼𝑡subscript𝑒𝑡subscript𝑓𝑡s_{t+1}=p_{\theta}(s_{t},I_{t},e_{t},f_{t}),italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , where Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the current intention, etsubscript𝑒𝑡e_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the grounded embedding, and ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is agent feedback. pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT first encode information through encoding layers ht=Encoder(st,It,et,ft)subscript𝑡𝐸𝑛𝑐𝑜𝑑𝑒𝑟subscript𝑠𝑡subscript𝐼𝑡subscript𝑒𝑡subscript𝑓𝑡h_{t}=Encoder(s_{t},I_{t},e_{t},f_{t})italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_E italic_n italic_c italic_o italic_d italic_e italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and subsequently predict the sub-goal through st+1=Softmax(Tθ(ht))subscript𝑠𝑡1𝑆𝑜𝑓𝑡𝑚𝑎𝑥subscript𝑇𝜃subscript𝑡s_{t+1}=Softmax(T_{\theta}(h_{t}))italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( italic_T start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ), where Tθsubscript𝑇𝜃T_{\theta}italic_T start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT utilizes the graph neural network (GNN) architecture.

The module is trained to maximize the likelihood of all sub-goals along the decision sequences given the current information on time step t𝑡titalic_t. This allows the dynamic re-planning of sub-task dependencies based on agent feedback.

θ=argmaxθt=1Tpθ(st+1|st,It,et,ft).superscript𝜃subscript𝜃superscriptsubscriptproduct𝑡1𝑇subscript𝑝𝜃conditionalsubscript𝑠𝑡1subscript𝑠𝑡subscript𝐼𝑡subscript𝑒𝑡subscript𝑓𝑡\theta^{*}=\arg\max_{\theta}\prod_{t=1}^{T}p_{\theta}(s_{t+1}|s_{t},I_{t},e_{t% },f_{t}).italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (1)

Grounding Module gϕsubscript𝑔italic-ϕg_{\phi}italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT contextualizes symbol embeddings et=gϕ(st,It,f1:t)subscript𝑒𝑡subscript𝑔italic-ϕsubscript𝑠𝑡subscript𝐼𝑡subscript𝑓:1𝑡e_{t}=g_{\phi}(s_{t},I_{t},f_{1:t})italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT ), where stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and f1:tsubscript𝑓:1𝑡f_{1:t}italic_f start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT represent the states, intention, and feedback up to time step t𝑡titalic_t, respectively. These embeddings are processed by encoders ht=Encoder(st,It,f1:t)subscript𝑡Encodersubscript𝑠𝑡subscript𝐼𝑡subscript𝑓:1𝑡h_{t}=\text{Encoder}(s_{t},I_{t},f_{1:t})italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = Encoder ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT ) and then by cross-attention layers and convolutional feature extractors: et=Conv(Attn(ht,V))+Ptsubscript𝑒𝑡𝐶𝑜𝑛𝑣𝐴𝑡𝑡𝑛subscript𝑡𝑉subscript𝑃𝑡e_{t}=Conv(Attn(h_{t},V))+P_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_C italic_o italic_n italic_v ( italic_A italic_t italic_t italic_n ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_V ) ) + italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over vocabulary V𝑉Vitalic_V. Here, Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT includes agent feedback to enhance grounding accuracy based on coordination signals for more accurate contextual understanding. The module maps language symbols to physical environment representations through:

g(x)=fθ(i=1Nwig(xi)),𝑔𝑥subscript𝑓𝜃superscriptsubscript𝑖1𝑁subscript𝑤𝑖𝑔subscript𝑥𝑖g(x)=f_{\theta}\left(\sum_{i=1}^{N}w_{i}g(x_{i})\right),italic_g ( italic_x ) = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_g ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , (2)

where g(x)𝑔𝑥g(x)italic_g ( italic_x ) is the grounded embeddings of policy set x𝑥xitalic_x and g(xi)𝑔subscript𝑥𝑖g(x_{i})italic_g ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) represents its individual action embedding on agent i𝑖iitalic_i, respectively, and wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are learnable weights. The grounding function fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT utilizes a GNN architecture for structural composition. Additionally, we employ an uncertainty modeling module that represents ambiguities in grounding:

qϕ(z|x)=Normal(z;μϕ(x),σϕ2(x)),subscript𝑞italic-ϕconditional𝑧𝑥Normal𝑧subscript𝜇italic-ϕ𝑥subscriptsuperscript𝜎2italic-ϕ𝑥q_{\phi}(z|x)=\text{Normal}\big{(}z;\mu_{\phi}(x),\sigma^{2}_{\phi}(x)\big{)},italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_z | italic_x ) = Normal ( italic_z ; italic_μ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) ) , (3)

where z𝑧zitalic_z is a latent variable modeled as a normal distribution, enabling the capture of multimodal uncertainties in grounding.

Cooperative Execution Module comprises N𝑁Nitalic_N specialized agents {A1,,AN}subscript𝐴1subscript𝐴𝑁\{A_{1},...,A_{N}\}{ italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }. This architecture avoids using a single agent to handle all tasks. Instead, each agent is dedicated to a distinct semantic domain, cultivating expertise specific to that domain. For instance, agents A1,A2,subscript𝐴1subscript𝐴2A_{1},A_{2},italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , and A3subscript𝐴3A_{3}italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT may be dedicated to query processing, information retrieval, and arithmetic operations, respectively. This specialization promotes an efficient distribution of tasks and reduces overlap in capabilities.

Decomposing skills into specialized agents risks creating isolated capabilities that lack coordination. To address this, it is essential that agents not only excel individually but also comprehend the capacities and limitations of their peers. We propose an integrated training approach where specialized agents are trained simultaneously to foster collaboration and collective intelligence. We represent the parameters of agent Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Each agent’s policy, denoted as yiπθi(|s)y_{i}\sim\pi_{\theta_{i}}(\cdot|s)italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ | italic_s ), samples an output yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from a given input state s𝑠sitalic_s. The training objective for our system is defined by the following equation:

Lexe=i=1N𝔼(s,y)𝒟(πθi(yi|s),y),subscript𝐿𝑒𝑥𝑒superscriptsubscript𝑖1𝑁subscript𝔼similar-to𝑠superscript𝑦𝒟subscript𝜋subscript𝜃𝑖conditionalsubscript𝑦𝑖𝑠superscript𝑦L_{exe}=\sum_{i=1}^{N}\mathbb{E}_{(s,y^{\star})\sim\mathcal{D}}{\ell(\pi_{% \theta_{i}}(y_{i}|s),y^{\star})},italic_L start_POSTSUBSCRIPT italic_e italic_x italic_e end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT ( italic_s , italic_y start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT roman_ℓ ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_s ) , italic_y start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) , (4)

where ()\ell(\cdot)roman_ℓ ( ⋅ ) represents the task-specific loss function, comparing the agent-generated output yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the ground-truth label ysuperscript𝑦y^{\star}italic_y start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT. 𝒟𝒟\mathcal{D}caligraphic_D denotes the distribution of training data. By optimizing this objective collectively across all agents, each agent not only improves its own output accuracy but also enhances the overall team’s ability to produce coherent and well-coordinated results.

During training, we adjust the decomposition of grounding tasks to enhance collaboration, which is represented by the soft module weights {w1,,wN}subscript𝑤1subscript𝑤𝑁\{w_{1},...,w_{N}\}{ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }. These weights indicate how the distribution of grounding commands can be optimized to better utilize the capabilities of different agents. The objective of this training is defined by the following loss function: Lcom=(d,w)subscript𝐿𝑐𝑜𝑚𝑑superscript𝑤L_{com}=\ell(d,w^{\star})italic_L start_POSTSUBSCRIPT italic_c italic_o italic_m end_POSTSUBSCRIPT = roman_ℓ ( italic_d , italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ), where \ellroman_ℓ represents the loss function, d𝑑ditalic_d is expressed as subgoal task instruction data, and wsuperscript𝑤w^{\star}italic_w start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT signifies the optimal set of weights.

Refer to caption
Figure 2: Overview of the proposed ReMALIS: This framework comprises a planning module, grounding module, cooperative execution module, and intention coordination channels.

3 Approach

The collaborative MARL of ReMALIS focuses on three key points: intention propagation for grounding, bidirectional coordination channels, and integration with recursive reasoning agents. Detailed parameter supplements and pseudocode details can be found in Appendix C and Appendix F.

3.1 Planning with Intention Propagation

We formulate a decentralized, partially observable Markov game for multi-agent collaboration. Each agent i𝑖iitalic_i maintains a private intention isubscript𝑖\mathcal{I}_{i}caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT encoded as a tuple i=(γi,Σi,πi,δi)subscript𝑖subscript𝛾𝑖subscriptΣ𝑖subscript𝜋𝑖subscript𝛿𝑖\mathcal{I}_{i}=(\gamma_{i},\Sigma_{i},\pi_{i},\delta_{i})caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where γisubscript𝛾𝑖\gamma_{i}italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the current goal, Σi={σi1,σi2,}subscriptΣ𝑖subscript𝜎𝑖1subscript𝜎𝑖2\Sigma_{i}=\{\sigma_{i1},\sigma_{i2},\ldots\}roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_σ start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT , … } is a set of related sub-goals, πi(σ)subscript𝜋𝑖𝜎\pi_{i}(\sigma)italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_σ ) is a probability distribution over possible next sub-goals, and δi(σ)subscript𝛿𝑖𝜎\delta_{i}(\sigma)italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_σ ) is the desired teammate assignment for sub-goal σ𝜎\sigmaitalic_σ.

Intentions are propagated through a communication channel fΛsubscript𝑓Λf_{\Lambda}italic_f start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT parameterized by ΛΛ\Lambdaroman_Λ. For a received message mijsubscript𝑚𝑖𝑗m_{ij}italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT from agent j𝑗jitalic_j, agent i𝑖iitalic_i infers a belief over teammate j𝑗jitalic_j’s intention bi(j|mij)=fΛ(mij)subscript𝑏𝑖conditionalsubscript𝑗subscript𝑚𝑖𝑗subscript𝑓Λsubscript𝑚𝑖𝑗b_{i}(\mathcal{I}_{j}|m_{ij})=f_{\Lambda}(m_{ij})italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( caligraphic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = italic_f start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT ( italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ), where ΛΛ\Lambdaroman_Λ is a recurrent neural network. The channel fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is trained in an end-to-end manner to maximize the coordination reward function Rcsubscript𝑅𝑐R_{c}italic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. This propagates relevant sub-task dependencies to enhance common grounding on collaborative goals.

Λ=argmaxΛ𝔼,mfΛ[Rc(,m)].superscriptΛsubscriptΛsubscript𝔼similar-to𝑚subscript𝑓Λdelimited-[]subscript𝑅𝑐𝑚\Lambda^{*}=\arg\max_{\Lambda}\mathbb{E}_{\mathcal{I},m\sim f_{\Lambda}}[R_{c}% (\mathcal{I},m)].roman_Λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT caligraphic_I , italic_m ∼ italic_f start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( caligraphic_I , italic_m ) ] . (5)

At each time-step t𝑡titalic_t, the LLM witll processinputs comprising the agent’s state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the intention tsubscript𝑡\mathcal{I}_{t}caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and the feedback f1:tsubscript𝑓:1𝑡f_{1:t}italic_f start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT.

3.2 Grounding with Bidirectional Coordination Channels

The execution agent policies, denoted by πξ(ai|si,i)subscript𝜋𝜉conditionalsubscript𝑎𝑖subscript𝑠𝑖subscript𝑖\pi_{\xi}(a_{i}|s_{i},\mathcal{I}_{i})italic_π start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), are parameterized by ξ𝜉\xiitalic_ξ and conditioned on the agent’s state sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and intention isubscript𝑖\mathcal{I}_{i}caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Emergent coordination patterns are encoded in a summary statistic ctsubscript𝑐𝑡c_{t}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and passed to upstream modules to guide planning and grounding adjustments. For example, frequent miscoordination on sub-goal σ𝜎\sigmaitalic_σ indicates the necessity to re-plan σ𝜎\sigmaitalic_σ dependencies in \mathcal{I}caligraphic_I.

This bidirectional feedback aligns low-level execution with high-level comprehension strategies. In addition to the downstream propagation of intents, execution layers provide bidirectional feedback signals ψ(t)𝜓𝑡\psi(t)italic_ψ ( italic_t ) to upstream modules ψ(t)=Φ(htexec)𝜓𝑡Φsubscriptsuperscriptexec𝑡\psi(t)=\Phi(h^{\text{exec}}_{t})italic_ψ ( italic_t ) = roman_Φ ( italic_h start_POSTSUPERSCRIPT exec end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ):

htexec=[ϕ1(o1),,ϕN(oN)],subscriptsuperscriptexec𝑡subscriptitalic-ϕ1subscript𝑜1subscriptitalic-ϕ𝑁subscript𝑜𝑁h^{\text{exec}}_{t}=[\phi_{1}(o_{1}),\ldots,\phi_{N}(o_{N})],italic_h start_POSTSUPERSCRIPT exec end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_ϕ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ] , (6)

where Φ()Φ\Phi(\cdot)roman_Φ ( ⋅ ) aggregates agent encodings to summarize emergent coordination, and ϕi()subscriptitalic-ϕ𝑖\phi_{i}(\cdot)italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) encodes the observation oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for agent i𝑖iitalic_i.

Execution agents generate feedback ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to guide upstream LLM modules through: ft=gθ(τ1:t)subscript𝑓𝑡subscript𝑔𝜃subscript𝜏:1𝑡f_{t}=g_{\theta}(\tau_{1:t})italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT ), where gθsubscript𝑔𝜃g_{\theta}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT processes the action-observation history τ1:tsubscript𝜏:1𝑡\tau_{1:t}italic_τ start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT. These signals include coordination errors tsubscript𝑡\mathcal{E}_{t}caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT which indicate misalignment of sub-tasks; grounding uncertainty 𝒰tsubscript𝒰𝑡\mathcal{U}_{t}caligraphic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, measured as entropy over grounded symbol embeddings; and re-planning triggers tsubscript𝑡\mathcal{R}_{t}caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which flag the need for sub-task reordering. These signals can reflect inconsistencies between sub-task objectives, the ambiguity of symbols in different contexts, and the need to adjust previous sub-task sequencing.

Algorithm 1 ReMALIS: Recursive Multi-Agent Learning with Intention Sharing
1:  Initialize LLM parameters θ,ϕ,ω𝜃italic-ϕ𝜔\theta,\phi,\omegaitalic_θ , italic_ϕ , italic_ω
2:  Initialize agent policies πξsubscript𝜋𝜉\pi_{\xi}italic_π start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT, communication channel fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT
3:  Initialize grounding confusion matrix C𝐶Citalic_C, memory M𝑀Mitalic_M
4:  for each episode do
5:     for each time step t𝑡titalic_t do
6:        Observe states stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and feedback f1:tsubscript𝑓:1𝑡f_{1:t}italic_f start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT for all agents
7:        Infer intentions tsubscript𝑡\mathcal{I}_{t}caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from st,f1:tsubscript𝑠𝑡subscript𝑓:1𝑡s_{t},f_{1:t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT using LLMθsubscriptLLM𝜃\text{LLM}_{\theta}LLM start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT
8:        Propagate intentions tsubscript𝑡\mathcal{I}_{t}caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT through channel fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT
9:        Compute grounded embeddings et=gϕ(st,t,f1:t)subscript𝑒𝑡subscript𝑔italic-ϕsubscript𝑠𝑡subscript𝑡subscript𝑓:1𝑡e_{t}=g_{\phi}(s_{t},\mathcal{I}_{t},f_{1:t})italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT )
10:        Predict sub-tasks Σt+1=pθ(t,et,f1:t)subscriptΣ𝑡1subscript𝑝𝜃subscript𝑡subscript𝑒𝑡subscript𝑓:1𝑡\Sigma_{t+1}=p_{\theta}(\mathcal{I}_{t},e_{t},f_{1:t})roman_Σ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT )
11:        Generate actions at=aω(et,Σt+1,f1:t)subscript𝑎𝑡subscript𝑎𝜔subscript𝑒𝑡subscriptΣ𝑡1subscript𝑓:1𝑡a_{t}=a_{\omega}(e_{t},\Sigma_{t+1},f_{1:t})italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT )
12:        Execute actions atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and observe rewards rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, new states st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT
13:        Encode coordination patterns ct=Φ(htexec)subscript𝑐𝑡Φsubscriptsuperscriptexec𝑡c_{t}=\Phi(h^{\text{exec}}_{t})italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Φ ( italic_h start_POSTSUPERSCRIPT exec end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
14:        Update grounding confusion Ct,Mtsubscript𝐶𝑡subscript𝑀𝑡C_{t},M_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using ctsubscript𝑐𝑡c_{t}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
15:        Update policies πξsubscript𝜋𝜉\pi_{\xi}italic_π start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT using R𝑅Ritalic_R and auxiliary loss auxsubscriptaux\mathcal{L}_{\text{aux}}caligraphic_L start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT
16:        Update LLM θ,ϕ,ω𝜃italic-ϕ𝜔\theta,\phi,\omegaitalic_θ , italic_ϕ , italic_ω using RL,confusionsubscriptRLsubscriptconfusion\mathcal{L}_{\text{RL}},\mathcal{L}_{\text{confusion}}caligraphic_L start_POSTSUBSCRIPT RL end_POSTSUBSCRIPT , caligraphic_L start_POSTSUBSCRIPT confusion end_POSTSUBSCRIPT
17:     end for
18:  end for

3.3 Execution: Integration with Reasoning Agents

3.3.1 Agent Policy Generation

We parameterize agent policies πθ(at|st,t,c1:t)subscript𝜋𝜃conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝑡subscript𝑐:1𝑡\pi_{\theta}(a_{t}|s_{t},\mathcal{I}_{t},c_{1:t})italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT ) using an LLM with weights θ𝜃\thetaitalic_θ. At each time step, the LLM takes as input the agent’s state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, intention tsubscript𝑡\mathcal{I}_{t}caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and coordination feedback c1:tsubscript𝑐:1𝑡c_{1:t}italic_c start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT. The output is a distribution over the next actions atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

πθ(at|st,t,c1:t)=LLMθ(st,t,c1:t).subscript𝜋𝜃conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝑡subscript𝑐:1𝑡subscriptLLM𝜃subscript𝑠𝑡subscript𝑡subscript𝑐:1𝑡\pi_{\theta}(a_{t}|s_{t},\mathcal{I}_{t},c_{1:t})=\text{LLM}_{\theta}(s_{t},% \mathcal{I}_{t},c_{1:t}).italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT ) = LLM start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT ) . (7)

To leverage agent feedback f1:tsubscript𝑓:1𝑡f_{1:t}italic_f start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT, we employ an auxiliary regularization model π^ϕ(at|st,f1:t)subscript^𝜋italic-ϕconditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝑓:1𝑡\hat{\pi}_{\phi}(a_{t}|s_{t},f_{1:t})over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT ):

aux(θ;st,f1:t)=MSE(πθ(st),π^ϕ(st,f1:t)),subscriptaux𝜃subscript𝑠𝑡subscript𝑓:1𝑡MSEsubscript𝜋𝜃subscript𝑠𝑡subscript^𝜋italic-ϕsubscript𝑠𝑡subscript𝑓:1𝑡\mathcal{L}_{\text{aux}}(\theta;s_{t},f_{1:t})=\text{MSE}(\pi_{\theta}(s_{t}),% \hat{\pi}_{\phi}(s_{t},f_{1:t})),caligraphic_L start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT ( italic_θ ; italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT ) = MSE ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT ) ) , (8)

where π^ϕsubscript^𝜋italic-ϕ\hat{\pi}_{\phi}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT is a feedback-conditioned policy approximation. The training loss to optimize θ𝜃\thetaitalic_θ is:

(θ)=RL(θ)+λaux(θ),𝜃subscriptRL𝜃𝜆subscriptaux𝜃\mathcal{L}(\theta)=\mathcal{L}_{\text{RL}}(\theta)+\lambda\mathcal{L}_{\text{% aux}}(\theta),caligraphic_L ( italic_θ ) = caligraphic_L start_POSTSUBSCRIPT RL end_POSTSUBSCRIPT ( italic_θ ) + italic_λ caligraphic_L start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT ( italic_θ ) , (9)

where RLsubscriptRL\mathcal{L}_{\text{RL}}caligraphic_L start_POSTSUBSCRIPT RL end_POSTSUBSCRIPT is the reinforcement learning objective and λ𝜆\lambdaitalic_λ a weighting factor.

3.3.2 Grounding Strategy Adjustment

We model action dependencies using a graph neural policy module hta=GNN(st,a)superscriptsubscript𝑡𝑎GNNsubscript𝑠𝑡𝑎h_{t}^{a}=\text{GNN}(s_{t},a)italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT = GNN ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a ), where htasuperscriptsubscript𝑡𝑎h_{t}^{a}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT models interactions between action a𝑎aitalic_a and the state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The policy is then given by πθ(at|st)=i=1|A|htaisubscript𝜋𝜃conditionalsubscript𝑎𝑡subscript𝑠𝑡superscriptsubscriptproduct𝑖1𝐴superscriptsubscript𝑡subscript𝑎𝑖\pi_{\theta}(a_{t}|s_{t})=\prod_{i=1}^{|A|}h_{t}^{a_{i}}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_A | end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. This captures the relational structure in the action space, enabling coordinated action generation conditioned on agent communication.

The coordination feedback ctsubscript𝑐𝑡c_{t}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is used to guide adjustments in the grounding module’s strategies. We define a grounding confusion matrix Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where Ct(i,j)subscript𝐶𝑡𝑖𝑗C_{t}(i,j)italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_i , italic_j ) represents grounding errors between concepts i𝑖iitalic_i and j𝑗jitalic_j. The confusion matrix constrains LLM grounding as:

fϕ(st,t)=LLMϕ(st,t)λCtsubscript𝑓italic-ϕsubscript𝑠𝑡subscript𝑡direct-productsubscriptLLMitalic-ϕsubscript𝑠𝑡subscript𝑡𝜆subscript𝐶𝑡f_{\phi}(s_{t},\mathcal{I}_{t})=\text{LLM}_{\phi}(s_{t},\mathcal{I}_{t})\odot% \lambda C_{t}italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = LLM start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⊙ italic_λ italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (10)

where direct-product\odot is element-wise multiplication and λ𝜆\lambdaitalic_λ controls the influence of Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, reducing uncertainty on error-prone concept pairs.

We propose a modular regularization approach, with the grounding module gϕsubscript𝑔italic-ϕg_{\phi}italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT regularized by a coordination confusion estimator:

confusion=1Ni,jAψ(ci,cj)Conf(ci,cj)subscriptconfusion1𝑁subscript𝑖𝑗subscript𝐴𝜓subscript𝑐𝑖subscript𝑐𝑗Confsubscript𝑐𝑖subscript𝑐𝑗\mathcal{L}_{\text{confusion}}=\frac{1}{N}\sum_{i,j}A_{\psi}(c_{i},c_{j})\cdot% \text{Conf}(c_{i},c_{j})caligraphic_L start_POSTSUBSCRIPT confusion end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⋅ Conf ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (11)

where tasksubscripttask\mathcal{L}_{\text{task}}caligraphic_L start_POSTSUBSCRIPT task end_POSTSUBSCRIPT is the task reward, Conf(ci,cj)Confsubscript𝑐𝑖subscript𝑐𝑗\text{Conf}(c_{i},c_{j})Conf ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) measures confusion between concepts cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and cjsubscript𝑐𝑗c_{j}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and Aψ(ci,cj)subscript𝐴𝜓subscript𝑐𝑖subscript𝑐𝑗A_{\psi}(c_{i},c_{j})italic_A start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) are attention weights assigning importance based on grounding sensitivity.

An episodic confusion memory Mtsubscript𝑀𝑡M_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT accumulates long-term grounding uncertainty statistics:

Mt(i,j)=Mt1(i,j)+𝕀(Confuse(ci,cj)t),subscript𝑀𝑡𝑖𝑗subscript𝑀𝑡1𝑖𝑗𝕀Confusesubscriptsubscript𝑐𝑖subscript𝑐𝑗𝑡M_{t}(i,j)=M_{t-1}(i,j)+\mathbb{I}(\text{Confuse}(c_{i},c_{j})_{t}),italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_i , italic_j ) = italic_M start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_i , italic_j ) + blackboard_I ( Confuse ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (12)

where 𝕀()𝕀\mathbb{I}(\cdot)blackboard_I ( ⋅ ) are indicator functions tracking confusion events. By regularizing with a coordination-focused confusion estimator and episodic memory, the grounding module adapts to avoid miscoordination.

3.4 Collective Learning and Adaptation

The coordination feedback signals ctsubscript𝑐𝑡c_{t}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and interpretability signals t,𝒰t,tsubscript𝑡subscript𝒰𝑡subscript𝑡\mathcal{E}_{t},\mathcal{U}_{t},\mathcal{R}_{t}caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT play a crucial role in enabling the LLM agents to adapt and learn collectively. By incorporating these signals into the training process, the agents can adjust their strategies and policies to better align with the emerging coordination patterns and requirements of the collaborative tasks.

The collective learning process can be formalized as an optimization problem, where the goal is to minimize the following objective function (η,γ,ζ,ξ)=𝔼st,t,f1:t[α𝒰t+βt]+Ω(η,γ,ζ,ξ)𝜂𝛾𝜁𝜉subscript𝔼subscript𝑠𝑡subscript𝑡subscript𝑓:1𝑡delimited-[]𝛼subscript𝒰𝑡𝛽subscript𝑡Ω𝜂𝛾𝜁𝜉\mathcal{L}(\eta,\gamma,\zeta,\xi)=\mathbb{E}_{s_{t},\mathcal{I}_{t},f_{1:t}}% \left[\alpha\mathcal{U}_{t}+\beta\mathcal{E}_{t}-\mathcal{R}\right]+\Omega(% \eta,\gamma,\zeta,\xi)caligraphic_L ( italic_η , italic_γ , italic_ζ , italic_ξ ) = blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_α caligraphic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_β caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - caligraphic_R ] + roman_Ω ( italic_η , italic_γ , italic_ζ , italic_ξ ). Here, α𝛼\alphaitalic_α and β𝛽\betaitalic_β are weighting factors that balance the contributions of the grounding uncertainty 𝒰tsubscript𝒰𝑡\mathcal{U}_{t}caligraphic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and coordination errors tsubscript𝑡\mathcal{E}_{t}caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, respectively. The team reward \mathcal{R}caligraphic_R is maximized to encourage collaborative behavior. The term Ω(η,γ,ζ,ξ)Ω𝜂𝛾𝜁𝜉\Omega(\eta,\gamma,\zeta,\xi)roman_Ω ( italic_η , italic_γ , italic_ζ , italic_ξ ) represents regularization terms or constraints on the model parameters to ensure stable and robust learning.

The objective function \mathcal{L}caligraphic_L is defined over the current state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the interpretability signals t={t,𝒰t,t}subscript𝑡subscript𝑡subscript𝒰𝑡subscript𝑡\mathcal{I}_{t}=\{\mathcal{E}_{t},\mathcal{U}_{t},\mathcal{R}_{t}\}caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, and the trajectory of feedback signals f1:t={c1,1,,ct,t}subscript𝑓:1𝑡subscript𝑐1subscript1subscript𝑐𝑡subscript𝑡f_{1:t}=\{c_{1},\mathcal{I}_{1},\ldots,c_{t},\mathcal{I}_{t}\}italic_f start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT = { italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } up to the current time step t𝑡titalic_t. The expectation 𝔼st,t,f1:t[]subscript𝔼subscript𝑠𝑡subscript𝑡subscript𝑓:1𝑡delimited-[]\mathbb{E}_{s_{t},\mathcal{I}_{t},f_{1:t}}[\cdot]blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ⋅ ] is taken over the distribution of states, interpretability signals, and feedback signal trajectories encountered during training.

Method Web TFP
Easy Medium Hard Alle Easy Medium Hard Hell
GPT-3.5-Turbo
CoT 65.77 51.62 32.45 17.36 81.27 68.92 59.81 41.27
Zero-Shot Plan 57.61 52.73 28.92 14.58 82.29 63.77 55.39 42.38
Llama2-7B
CoT 59.83 54.92 30.38 15.62 82.73 65.81 57.19 44.58
ReAct 56.95 41.86 27.59 13.48 81.15 61.65 53.97 43.25
ART 62.51 52.34 33.81 18.53 81.98 63.23 51.78 46.83
ReWOO 63.92 53.17 34.95 19.37 82.12 71.38 61.23 47.06
AgentLM 62.14 46.75 30.84 15.98 82.96 66.03 57.16 43.91
FireAct 64.03 50.68 32.78 17.49 83.78 68.19 58.94 45.06
LUMOS 66.27 53.81 35.37 19.53 84.03 71.75 62.57 51.49
Llama3-8B
Code-Llama (PoT) 64.85 49.49 32.16 17.03 83.34 68.47 59.15 52.64
AgentLM 66.77 51.45 31.59 16.58 85.26 71.81 58.68 53.39
FiReAct 68.92 53.27 32.95 17.64 84.11 72.15 58.63 51.65
DGN 69.15 54.78 33.63 18.17 83.42 71.08 62.34 53.57
LToS 68.48 55.03 33.06 17.71 85.77 74.61 59.37 54.81
AUTOACT 67.62 56.25 31.84 16.79 87.89 76.29 58.94 52.87
ReMALIS(Ours) 73.92 58.64 38.37 21.42 89.15 77.62 64.53 55.37
Table 1: Comparative analysis of the ReMALIS framework against single-agent baselines and contemporary methods across two datasets

4 Experiments

4.1 Datasets

To assess the performance of our models, we conducted evaluations using two large-scale real-world datasets: the traffic flow prediction (TFP) dataset and the web-based activities dataset.

TFP dataset comprises 100,000 traffic scenarios, each accompanied by corresponding flow outcomes. Each example is detailed with descriptions of road conditions, vehicle count, weather, and traffic control measures, and is classified as traffic flow: smooth, congested, or jammed. The raw data was sourced from traffic cameras, incident reports, and simulations, and underwent preprocessing to normalize entities and eliminate duplicates.

Web activities dataset contains over 500,000 examples of structured web interactions such as booking flights, scheduling appointments, and making reservations. Each activity follows a template with multiple steps like searching, selecting, filling forms, and confirming. User utterances and system responses were extracted to form the input-output pairs across 150 domains, originating from real anonymized interactions with chatbots, virtual assistants, and website frontends.

4.2 Implementation Details

To handle the computational demands of training our framework with LLMs, we employ 8 Nvidia A800-80G GPUs Chen et al. (2024) under the DeepSpeed Aminabadi et al. (2022) training framework, which can effectively accommodate the extensive parameter spaces and activations required by our framework’s LLM components and multi-agent architecture Rasley et al. (2020).

For the TFP dataset, we classified the examples into four difficulty levels: “Easy”, “Medium”, “Hard”, and “Hell”. The “Easy” level comprises small grid networks with low, stable vehicle arrival rates. The “Medium” level includes larger grids with variable arrival rates. “Hard” tasks feature large, irregular networks with highly dynamic arrival rates and complex intersection configurations. The “Hell” level introduces challenges such as partially observable states, changing road conditions, and fully decentralized environments.

For the web activities dataset, we divided the tasks into “Easy”, “Medium”, “Hard”, and “All” levels. “Easy” tasks required basic single-click or short phrase interactions. “Medium” involved complex multi-page sequences like form submissions. “Hard” tasks demanded significant reasoning through ambiguous, dense websites. The “All” level combined tasks across the full difficulty spectrum.

The dataset was divided into 80% for training, 10% for validation, and 10% for testing, with examples shuffled. These large-scale datasets offer a challenging and naturalistic benchmark to evaluate our multi-agent framework on complex, real-world prediction and interaction tasks.

4.3 Results and Analysis

Table 1 displays the principal experimental results of our ReMALIS framework in comparison with various single-agent baselines and contemporary methods using the web activities dataset. We evaluated the models across four levels of task difficulty: “Easy”, “Medium”, “Hard”, and “All”.

The results from our comparative analysis indicate that ReMALIS (7B), equipped with a 7B parameter LLM backbone, significantly outperforms competing methods. On the comprehensive “All” difficulty level, which aggregates tasks across a range of complexities, ReMALIS achieved a notable score of 55.37%, surpassing the second-highest scoring method, LUMOS, which scored 51.49%. Additionally, ReMALIS (7B) also excelled against AUTOACT, which utilizes a larger 13B parameter model, by achieving a score that is over 3 percentage points higher at 52.87%. These findings highlight the efficacy of ReMALIS’s parameter-efficient design and its advanced multi-agent collaborative training approach, which allow it to outperform larger single-agent LLMs significantly.

Notably, ReMALIS (7B) also exceeded the performance of GPT-3.5 (Turbo), a substantially larger foundation model, across all difficulty levels. On “Hard” tasks, ReMALIS’s 21.42% surpassed GPT-3.5’s 17.36% by over 4 points. This indicates that ReMALIS’s coordination mechanisms transform relatively modest LLMs into highly capable collaborative agents.

Despite their larger sizes, single-agent approaches like GPT-3.5 CoT, ReAct, and AgentLM significantly underperformed. Notably, even the advanced single-agent method LUMOS (13B) could not rival the performance of ReMALIS (7B). The superiority of ReMALIS, attributed to its specialized multi-agent design and novel features such as intention propagation, bidirectional feedback, and recursive reasoning, was particularly evident. On complex “Hard” tasks that required extensive reasoning, ReMALIS achieved a notable performance of 21.42%, surpassing LUMOS by over 2 percentage points, thus highlighting the benefits of its multi-agent architecture and collaborative learning mechanisms.

The exceptional performance of our proposed ReMALIS framework on the Traffic Flow Prediction (TFP) dataset can also be attributed to its innovative design and the effective integration of advanced techniques. On the "Easy" difficulty level, ReMALIS achieved an impressive accuracy of 89.15%, outperforming the second-best method, AUTOACT, by a substantial margin of 1.26%. In the "Medium" category, ReMALIS secured an accuracy of 77.62%, surpassing AUTOACT’s 76.29% by 1.33%. Even in the most challenging "Hard" and "Hell" levels, ReMALIS maintained its lead with accuracies of 64.53% and 55.37%, respectively, outperforming the next best methods, DGN (62.34%) and LToS (54.81%), by 2.19% and 0.56%.

4.4 Ablation Studies

1)The Impact on Improving Multi-Agent Coordination Accuracy We conduct ablation studies to evaluate the impact of each component within the ReMALIS framework. The observations can be found in Table 2. Excluding intention propagation results in a decrease in accuracy by over 6% across both datasets, highlighting difficulties in achieving common grounding among agents without shared local beliefs This highlights the importance of intention sharing for emergent team behaviors.

The absence of bidirectional coordination channels leads to a 4.37% decline in performance across various metrics, illustrating the importance of execution-level signals in shaping planning and grounding strategies. Without feedback coordination, agents become less responsive to new scenarios that require re-planning.

Table 2: Ablation studies on Traffic and Web datasets
Dataset Method Metrics
Accuracy BLEU ROUGE
Traffic Single Agent Baseline 42.5% 0.217 0.384
Intention Propagation 47.3% 0.251 0.425
Bidirectional Feedback 49.8% 0.278 0.461
Recursive Reasoning 53.2% 0.311 0.503
ReMALIS (Full) 58.7% 0.342 0.538
Web Single Agent Baseline 38.9% 0.255 0.416
Intention Propagation 42.7% 0.283 0.453
Bidirectional Feedback 46.3% 0.311 0.492
Recursive Reasoning 50.6% 0.345 0.531
ReMALIS (Full) 55.4% 0.379 0.567

Substituting recursive reasoning with convolutional and recurrent neural networks reduces contextual inference accuracy by 5.86%. Non-recursive agents display short-sighted behavior compared to the holistic reasoning enabled by recursive transformer modeling. This emphasizes that recursive architectures are vital for complex temporal dependencies.

Refer to caption
Figure 3: Comparative performance evaluation across varying task difficulty levels for the web activities dataset, which indicates the accuracy scores achieved by ReMALIS and several state-of-the-art baselines.
Table 3: Ablation on agent coordination capabilities
Method % Aligned sub-tasks Coordination Time (ms)
Easy Medium Hard Easy Medium Hard
No Communication 31% 23% 17% 592 873 1198
REACT 42% 34% 29% 497 732 984
AgentLM 48% 39% 32% 438 691 876
FiReAct 58% 47% 37% 382 569 745
Basic Propagation 68% 53% 41% 314 512 691
Selective Propagation 79% 62% 51% 279 438 602
Full Intention Sharing 91% 71% 62% 248 386 521

2)The Impact on Improving Multi-Agent Coordination Capability As presented in Table 3, on aligned sub-task percentage, the proposed Basic Propagation, Selective Propagation, and Full Intention Sharing methods consistently outperform baseline models like REACT and AgentLM across varying difficulty levels (“easy”, “medium”, and “hard”). For example, Full Intention Sharing achieves alignment of 91%, 71%, and 62% across these levels, respectively. These results are substantially higher compared to scenarios with no communication (31%, 23%, and 17%).

Similarly, coordination time metrics exhibit major efficiency gains from intention propagation. On “Hard” tasks, Full Intention Sharing reduces coordination time to 521 ms, 57% faster than the 1198 ms for No Communication. As task complexity increases from easy to hard, the coordination time savings compared to baselines grows from 138 ms to 677 ms. This reveals that intention sharing mitigates growing coordination delays for difficult scenarios.

The highlighted propagation mechanisms also demonstrate clear incremental performance improvements over increasingly selective information sharing. As agents propagate more precise intentions to relevant teammates, both sub-task alignment and coordination efficiency improve. Moving from Basic to Selective to Full sharing provides gains on top of gains.

5 Conclusion

In this paper, we introduce a novel framework, ReMALIS, designed to enhance collaborative capabilities within multi-agent systems using LLMs. Our approach incorporates three principal innovations: intention propagation for establishing a shared understanding among agents, bidirectional coordination channels to adapt reasoning processes in response to team dynamics, and recursive reasoning architectures that provide agents with advanced contextual grounding and planning capabilities necessary for complex coordination tasks. Experimental results indicate that ReMALIS significantly outperforms several baseline methods, underscoring the efficacy of cooperative multi-agent AI systems. By developing frameworks that enable LLMs to acquire cooperative skills analogous to human team members, we advance the potential for LLM agents to manage flexible coordination in complex collaborative environments effectively.

6 Limitiation

While ReMALIS demonstrates promising results in collaborative multi-agent tasks, our framework relies on a centralized training paradigm, which may hinder scalability in fully decentralized environments. The current implementation does not explicitly handle dynamic agent arrival or departure during execution, which could impact coordination in real-world applications, the recursive reasoning component may struggle with long-term dependencies and planning horizons beyond a certain time frame.

References

  • Aminabadi et al. (2022) Reza Yazdani Aminabadi et al. 2022. Deepspeed-inference: Enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis.
  • Chaka (2023) Chaka Chaka. 2023. Generative ai chatbots-chatgpt versus youchat versus chatsonic: Use cases of selected areas of applied english language studies. International Journal of Learning, Teaching and Educational Research, 22(6):1–19.
  • Chebotar et al. (2023) Yevgen Chebotar et al. 2023. Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions. In Conference on Robot Learning. PMLR.
  • Chen et al. (2023) Baian Chen et al. 2023. Fireact: Toward language agent fine-tuning. arXiv preprint arXiv:2310.05915.
  • Chen et al. (2024) Yushuo Chen et al. 2024. Towards coarse-to-fine evaluation of inference efficiency for large language models. arXiv preprint arXiv:2404.11502.
  • Chiu et al. (2024) Yu Ying Chiu et al. 2024. A computational framework for behavioral assessment of llm therapists. arXiv preprint arXiv:2401.00820.
  • Dong et al. (2023) Yihong Dong et al. 2023. Codescore: Evaluating code generation by learning code execution. arXiv preprint arXiv:2301.09043.
  • Du et al. (2023) Yali Du et al. 2023. A review of cooperation in multi-agent learning. arXiv preprint arXiv:2312.05162.
  • Fan et al. (2020) Cheng Fan et al. 2020. Statistical investigations of transfer learning-based methodology for short-term building energy predictions. Applied Energy, 262:114499.
  • Foerster et al. (2018) Jakob Foerster et al. 2018. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
  • Gao et al. (2023) Yunfan Gao et al. 2023. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.
  • Hartmann et al. (2022) Valentin N. Hartmann et al. 2022. Long-horizon multi-robot rearrangement planning for construction assembly. IEEE Transactions on Robotics, 39(1):239–252.
  • He et al. (2021) Junxian He et al. 2021. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366.
  • Hu and Sadigh (2023) Hengyuan Hu and Dorsa Sadigh. 2023. Language instructed reinforcement learning for human-ai coordination. arXiv preprint arXiv:2304.07297.
  • Huang et al. (2022) Baichuan Huang, Abdeslam Boularias, and Jingjin Yu. 2022. Parallel monte carlo tree search with batched rigid-body simulations for speeding up long-horizon episodic robot planning. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE.
  • Khamparia et al. (2021) Aditya Khamparia et al. 2021. An internet of health things-driven deep learning framework for detection and classification of skin cancer using transfer learning. Transactions on Emerging Telecommunications Technologies, 32(7):e3963.
  • Lee and Perret (2022) Irene Lee and Beatriz Perret. 2022. Preparing high school teachers to integrate ai methods into stem classrooms. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36.
  • Li et al. (2020) Chuan Li et al. 2020. A systematic review of deep transfer learning for machinery fault diagnosis. Neurocomputing, 407:121–135.
  • Li et al. (2022) Weihua Li et al. 2022. A perspective survey on deep transfer learning for fault diagnosis in industrial scenarios: Theories, applications and challenges. Mechanical Systems and Signal Processing, 167:108487.
  • Loey et al. (2021) Mohamed Loey et al. 2021. A hybrid deep transfer learning model with machine learning methods for face mask detection in the era of the covid-19 pandemic. Measurement, 167:108288.
  • Lotfollahi et al. (2022) Mohammad Lotfollahi et al. 2022. Mapping single-cell data to reference atlases by transfer learning. Nature biotechnology, 40(1):121–130.
  • Lu et al. (2023) Pan Lu et al. 2023. Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842.
  • Lyu et al. (2021) Xueguang Lyu et al. 2021. Contrasting centralized and decentralized critics in multi-agent reinforcement learning. arXiv preprint arXiv:2102.04402.
  • Mao et al. (2022) Weichao Mao et al. 2022. On improving model-free algorithms for decentralized multi-agent reinforcement learning. In International Conference on Machine Learning. PMLR.
  • Martini et al. (2021) Franziska Martini et al. 2021. Bot, or not? comparing three methods for detecting social bots in five political discourses. Big data & society, 8(2):20539517211033566.
  • Miao et al. (2023) Ning Miao, Yee Whye Teh, and Tom Rainforth. 2023. Selfcheck: Using llms to zero-shot check their own step-by-step reasoning. arXiv preprint arXiv:2308.00436.
  • Qiu et al. (2024) Xihe Qiu et al. 2024. Chain-of-lora: Enhancing the instruction fine-tuning performance of low-rank adaptation on diverse instruction set. IEEE Signal Processing Letters.
  • Raman et al. (2022) Shreyas Sundara Raman et al. 2022. Planning with large language models via corrective re-prompting. In NeurIPS 2022 Foundation Models for Decision Making Workshop.
  • Rana et al. (2023) Krishan Rana et al. 2023. Sayplan: Grounding large language models using 3d scene graphs for scalable task planning. arXiv preprint arXiv:2307.06135.
  • Rashid et al. (2020) Tabish Rashid et al. 2020. Weighted qmix: Expanding monotonic value function factorisation for deep multi-agent reinforcement learning. In Advances in neural information processing systems 33, pages 10199–10210.
  • Rasley et al. (2020) Jeff Rasley et al. 2020. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
  • Saber et al. (2021) Abeer Saber et al. 2021. A novel deep-learning model for automatic detection and classification of breast cancer using the transfer-learning technique. IEEE Access, 9:71194–71209.
  • Schroeder de Witt et al. (2019) Christian Schroeder de Witt et al. 2019. Multi-agent common knowledge reinforcement learning. In Advances in Neural Information Processing Systems 32.
  • Schuchard and Crooks (2021) Ross J. Schuchard and Andrew T. Crooks. 2021. Insights into elections: An ensemble bot detection coverage framework applied to the 2018 us midterm elections. Plos one, 16(1):e0244309.
  • Schumann et al. (2024) Raphael Schumann et al. 2024. Velma: Verbalization embodiment of llm agents for vision and language navigation in street view. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38.
  • Shanahan et al. (2023) Murray Shanahan, Kyle McDonell, and Laria Reynolds. 2023. Role play with large language models. Nature, 623(7987):493–498.
  • Sharan et al. (2023) S. P. Sharan, Francesco Pittaluga, and Manmohan Chandraker. 2023. Llm-assist: Enhancing closed-loop planning with language-based reasoning. arXiv preprint arXiv:2401.00125.
  • Shen et al. (2020) Sheng Shen et al. 2020. Deep convolutional neural networks with ensemble learning and transfer learning for capacity estimation of lithium-ion batteries. Applied Energy, 260:114296.
  • Singh et al. (2023) Ishika Singh et al. 2023. Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE.
  • Song et al. (2023) Chan Hee Song et al. 2023. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
  • Topsakal and Akinci (2023) Oguzhan Topsakal and Tahir Cetin Akinci. 2023. Creating large language model applications utilizing langchain: A primer on developing llm apps fast. In International Conference on Applied Engineering and Natural Sciences, volume 1.
  • Valmeekam et al. (2022) Karthik Valmeekam et al. 2022. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498.
  • Wang et al. (2024a) Haoyu Wang et al. 2024a. Carbon-based molecular properties efficiently predicted by deep learning-based quantum chemical simulation with large language models. Computers in Biology and Medicine, page 108531.
  • Wang et al. (2024b) Haoyu Wang et al. 2024b. Subequivariant reinforcement learning framework for coordinated motion control. arXiv preprint arXiv:2403.15100.
  • Wang et al. (2023) Lei Wang et al. 2023. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432.
  • Wang et al. (2020) Tonghan Wang et al. 2020. Roma: Multi-agent reinforcement learning with emergent roles. arXiv preprint arXiv:2003.08039.
  • Wen et al. (2023) Hao Wen et al. 2023. Empowering llm to use smartphone for intelligent task automation. arXiv preprint arXiv:2308.15272.
  • Xi et al. (2023) Zhiheng Xi et al. 2023. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864.
  • Yao et al. (2022) Shunyu Yao et al. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
  • Yin et al. (2023) Da Yin et al. 2023. Lumos: Learning agents with unified data, modular design, and open-source llms. arXiv preprint arXiv:2311.05657.
  • Yu et al. (2023) Shengcheng Yu et al. 2023. Llm for test script generation and migration: Challenges, capabilities, and opportunities. In 2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS). IEEE.
  • Zeng et al. (2023) Fanlong Zeng et al. 2023. Large language models for robotics: A survey. arXiv preprint arXiv:2311.07226.
  • Zhang and Gao (2023) Xuan Zhang and Wei Gao. 2023. Towards llm-based fact verification on news claims with a hierarchical step-by-step prompting method. arXiv preprint arXiv:2310.00305.
  • Zhao et al. (2024) Andrew Zhao et al. 2024. Expel: Llm agents are experiential learners. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38.
  • Zhu et al. (2023) Zhuangdi Zhu et al. 2023. Transfer learning in deep reinforcement learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • Zhuang et al. (2020) Fuzhen Zhuang et al. 2020. A comprehensive survey on transfer learning. Proceedings of the IEEE, 109(1):43–76.
  • Zimmer et al. (2021a) Matthieu Zimmer et al. 2021a. Learning fair policies in decentralized cooperative multi-agent reinforcement learning. In International Conference on Machine Learning. PMLR.
  • Zimmer et al. (2021b) Matthieu Zimmer et al. 2021b. Learning fair policies in decentralized cooperative multi-agent reinforcement learning. In International Conference on Machine Learning. PMLR.
  • Zou et al. (2023) Hang Zou et al. 2023. Wireless multi-agent generative ai: From connected intelligence to collective intelligence. arXiv preprint arXiv:2307.02757.

Appendix A Related Work

A.1 Single Agent Frameworks

Early agent frameworks such as Progprompt Singh et al. (2023) directly prompt large language models (LLMs) to plan, execute actions, and process feedback in a chained manner within one model Song et al. (2023). Despite its conceptual simplicity Valmeekam et al. (2022), an integrated framework imposes a substantial burden on a single LLM, leading to challenges in managing complex tasks Raman et al. (2022); Wang et al. (2024a).

To reduce the reasoning burden, recent works explore modular designs by separating high-level planning and low-level execution into different modules. For example, LUMOS Yin et al. (2023) consists of a planning module, a grounding module, and an execution module. The planning and grounding modules break down complex tasks into interpretable sub-goals and executable actions. FiReAct Chen et al. (2023) introduces a similar hierarchical structure, with a focus on providing step-by-step explanations Zhang and Gao (2023). Although partitioning into modules specializing for different skills is reasonable, existing modular frameworks still rely on a single agent for final action execution Miao et al. (2023); Qiu et al. (2024). Our work pushes this idea further by replacing the single execution agent with a cooperative team of multiple agents.

A.2 Multi-Agent Reinforcement Learning

Collaborative multi-agent reinforcement learning has been studied to solve complex control or game-playing tasks. Representative algorithms include COMA Foerster et al. (2018), QMIX Rashid et al. (2020) and ROMA Wang et al. (2020). These methods enable decentralized execution of different agents but allow centralized training by sharing experiences or parameters Lyu et al. (2021). Drawing on this concept, our ReMALIS framework places greater emphasis on integrating modular LLMs to address complex language tasks. In ReMALIS, each execution agent specializes in specific semantic domains such as query, computation, or retrieval, and is coordinated through a communication module Mao et al. (2022).

The concept of multi-agent RL has recently influenced the design of conversational agents Zimmer et al. (2021a); Schumann et al. (2024). EnsembleBot Schuchard and Crooks (2021) utilizes multiple bots trained on distinct topics, coordinated by a routing model. However, this approach primarily employs a divide-and-conquer strategy with independent skills Martini et al. (2021), and communication within EnsembleBot predominantly involves one-way dispatching rather than bidirectional coordination. In contrast, our work focuses on fostering a more tightly integrated collaborative system for addressing complex problems Schroeder de Witt et al. (2019); Zimmer et al. (2021b).

A.3 Integrated & Collaborative Learning

Integrated learning techniques originate from transfer learning Zhuang et al. (2020); Zhu et al. (2023), aiming to improve a target model by incorporating additional signals from other modalities Lotfollahi et al. (2022); Shanahan et al. (2023). For multi-agent systems, Li et al. (2022); Zhao et al. (2024) find joint training of multiple agents simultaneously boosts performance over separately trained independent agents Lee and Perret (2022). Recently, integrated learning has been used in single agent frameworks like Shen et al. (2020) and Loey et al. (2021), where auxiliary losses of interpretable outputs facilitate main model training through multi-tasking Khamparia et al. (2021); Saber et al. (2021).

Our work adopts integrated learning to train specialized execution agents that are semantically consistent. At the team level, a communication module learns to attentively aggregate and propagate messages across agents, which indirectly coordinates their strategies and behaviors Fan et al. (2020). The integrated and collaborative learning synergizes individual skills and leads to emerged collective intelligence, enhancing the overall reasoning and planning capabilities when dealing with complex tasks He et al. (2021); Li et al. (2020).

Appendix B Methodology and Contributions

Based on the motivations and inspirations above, we propose recursive multi-agent learning with intention sharing framework (ReMALIS), an innovative multi-agent framework empowered by integrated learning for communication and collaboration. The main contributions are:

1. We design a cooperative execution module with multiple agents trained by integrated learning. Different execution agents specialize in different semantic domains while understanding peer abilities, which reduces redundant capacities and improves efficient division of labor.

2. We propose an attentive communication module that propagates informative cues across specialized agents. The module coordinates agent execution strategies without explicit supervision, acting as the role of team leader.

3. The collaborative design allows ReMALIS to handle more complex tasks compared to single-agent counterparts. Specialized agents focus on their specialized domain knowledge while collaborating closely through communicative coordination, leading to strong emergent team intelligence.

4. We enable dynamic feedback loops from communication to the grounding module and re-planning of the planning module, increasing adaptability when execution difficulties arise.

We expect the idea of integrating specialized collaborative agents with dynamic coordination mechanisms to inspire more future research toward developing intelligent collaborative systems beyond conversational agents.

Appendix C Key variables and symbols

Table 4: Key variables and symbols in the proposed recursive multi-agent learning framework.
Symbol Description
pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT Planning module parameterized by θ𝜃\thetaitalic_θ
stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT Current sub-goal at time t𝑡titalic_t
Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT Current intention at time t𝑡titalic_t
etsubscript𝑒𝑡e_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT Grounded embedding at time t𝑡titalic_t
ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT Agent feedback at time t𝑡titalic_t
gϕsubscript𝑔italic-ϕg_{\phi}italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT Grounding module parameterized by ϕitalic-ϕ\phiitalic_ϕ
πξisubscript𝜋subscript𝜉𝑖\pi_{\xi_{i}}italic_π start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT Execution policy of agent i𝑖iitalic_i parameterized by ξisubscript𝜉𝑖\xi_{i}italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
fΛsubscript𝑓Λf_{\Lambda}italic_f start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT Intention propagation channel parameterized by ΛΛ\Lambdaroman_Λ
mijsubscript𝑚𝑖𝑗m_{ij}italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT Message sent from agent j𝑗jitalic_j to agent i𝑖iitalic_i
bi(Ij|mij)subscript𝑏𝑖conditionalsubscript𝐼𝑗subscript𝑚𝑖𝑗b_{i}(I_{j}|m_{ij})italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) Agent i𝑖iitalic_i’s belief over teammate j𝑗jitalic_j’s intention Ijsubscript𝐼𝑗I_{j}italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT given message mijsubscript𝑚𝑖𝑗m_{ij}italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT
Rcsubscript𝑅𝑐R_{c}italic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT Coordination reward
πξ(ai|si,Ii)subscript𝜋𝜉conditionalsubscript𝑎𝑖subscript𝑠𝑖subscript𝐼𝑖\pi_{\xi}(a_{i}|s_{i},I_{i})italic_π start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) Execution agent policy conditioned on state sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and intention Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Action of agent i𝑖iitalic_i
sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT State of agent i𝑖iitalic_i
Ii=(γi,Σi,πi,δi)subscript𝐼𝑖subscript𝛾𝑖subscriptΣ𝑖subscript𝜋𝑖subscript𝛿𝑖I_{i}=(\gamma_{i},\Sigma_{i},\pi_{i},\delta_{i})italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) Intention of agent i𝑖iitalic_i
γisubscript𝛾𝑖\gamma_{i}italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Current goal of agent i𝑖iitalic_i
Σi={σi1,σi2,}subscriptΣ𝑖subscript𝜎𝑖1subscript𝜎𝑖2\Sigma_{i}=\{\sigma_{i1},\sigma_{i2},\ldots\}roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_σ start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT , … } Set of sub-goals for agent i𝑖iitalic_i
πi(σ)subscript𝜋𝑖𝜎\pi_{i}(\sigma)italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_σ ) Probability distribution over possible next sub-goals for agent i𝑖iitalic_i
δi(σ)subscript𝛿𝑖𝜎\delta_{i}(\sigma)italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_σ ) Desired teammate assignment for sub-goal σ𝜎\sigmaitalic_σ of agent i𝑖iitalic_i

Table 4 summarizes the key variables and symbols used in the proposed recursive multi-agent learning framework called ReMALIS. It includes symbols representing various components like the planning module, grounding module, execution policies, intentions, goals, sub-goals, and the intention propagation channel.

Table 5: Comparison of Traffic Network Complexity Levels
Difficulty Level Grid Size Intersections Arrival Rates Phases per Intersection
Easy 3x3 9 Low and stable (0.5 vehicles/s) Less than 10
Medium 5x5 25 Fluctuating (0.5-2 vehicles/s) 10-15
Hard 8x8 64 Highly dynamic (0.1 to 3 vehicles/s) More than 15
Hell Irregular 100+ Extremely dynamic with spikes >>>25
Table 6: Training hyperparameters and configurations
Hyperparameter/Configuration ReMALIS LUMOS AgentLM GPT-3.5
Language Model Size 7B 13B 6B 175B
Optimizer AdamW Adam AdamW Adam
Learning Rate 1e-4 2e-5 1e-4 2e-5
Batch Size 32 64 32 64
Dropout 0 0.1 0 0.1
Number of Layers 12 8 6 48
Model Dimension 768 512 768 1024
Number of Heads 12 8 12 16
Training Epochs 15 20 10 20
Warmup Epochs 1 2 1 2
Weight Decay 0.01 0.001 0.01 0.001
Network Architecture GNN Transformer Transformer Transformer
Planning Module GNN, 4 layers, 512 hidden size 2-layer GNN, 1024 hidden size - -
Grounding Module 6-layer Transformer, dmodel=768subscript𝑑model768d_{\text{model}}=768italic_d start_POSTSUBSCRIPT model end_POSTSUBSCRIPT = 768 4-layer Transformer, dmodel=512subscript𝑑model512d_{\text{model}}=512italic_d start_POSTSUBSCRIPT model end_POSTSUBSCRIPT = 512 - -
Execution Agents 7 specialized, integrated training Single agent 8 agent 4 agent
Intention Propagation 4-layer GRU, 256 hidden size - - -
Coordination Feedback GAT, 2 heads, α=0.2𝛼0.2\alpha=0.2italic_α = 0.2 - - -
Trainable Parameters 5.37B 6.65B 4.61B 17.75B

Appendix D Tasks Setup

D.1 Traffic Control

We define four levels of difficulty for our traffic control tasks: Easy, Medium, Hard, and Hell in Table 5.

D.2 Web Tasks

Similarly, we categorize the web tasks in our dataset into four levels of difficulty: Easy, Medium, Hard, and All.

Easy: The easy web tasks involve basic interactions like clicking on a single link or typing a short phrase. They require navigating simple interfaces with clear options to reach the goal.

Medium: The medium-difficulty tasks demand more complex sequences of actions across multiple pages, such as selecting filters or submitting forms. They test the agent’s ability to understand the site structure and flow.

Hard: The hard web tasks feature more open-ended exploration through dense sites with ambiguity. Significant reasoning is needed to chain obscure links and controls to achieve aims.

All: The all-level combines tasks across the spectrum of difficulty. Both simple and complex interactions are blended to assess generalized web agent skills. The performance here correlates to readiness for real-world web use cases.

Appendix E Experimental Setups

In this study, we compare the performance of several state-of-the-art language models, including ReMALIS, LUMOS, AgentLM, and GPT-3.5. These models vary in size, architecture, and training configurations, reflecting the diversity of approaches in the field of natural language processing in Table 6.

ReMALIS is a 7 billion parameter model trained using the AdamW optimizer with a learning rate of 1e-4, a batch size of 32, and no dropout. It has 12 layers, a model dimension of 768, and 12 attention heads. The model was trained for 15 epochs with a warmup period of 1 epoch and a weight decay of 0.01. ReMALIS employs a Graph Neural Network (GNN) architecture, which is particularly suited for modeling complex relationships and structures.

LUMOS, a larger model with 13 billion parameters, was trained using the Adam optimizer with a learning rate of 2e-5, a batch size of 64, and a dropout rate of 0.1. It has 8 layers, a model dimension of 512, and 8 attention heads. The model was trained for 20 epochs with a warmup period of 2 epochs and a weight decay of 0.001. LUMOS follows a Transformer architecture, which has proven effective in capturing long-range dependencies in sequential data.

AgentLM, a 6 billion parameter model, was trained using the AdamW optimizer with a learning rate of 1e-4, a batch size of 32, and no dropout. It has 6 layers, a model dimension of 768, and 12 attention heads. The model was trained for 10 epochs with a warmup period of 1 epoch and a weight decay of 0.01. AgentLM also uses a Transformer architecture.

GPT-3.5, the largest model in this study with 175 billion parameters, was trained using the Adam optimizer with a learning rate of 2e-5, a batch size of 64, and a dropout rate of 0.1. It has 48 layers, a model dimension of 1024, and 16 attention heads. The model was trained for 20 epochs with a warmup period of 2 epochs and a weight decay of 0.001. GPT-3.5 follows the Transformer architecture, which has been widely adopted for large language models.

In addition to the base language models, the table provides details on the specialized modules and configurations employed by ReMALIS and LUMOS. ReMALIS incorporates a planning module with a 4-layer GNN and a 512 hidden size, a grounding module with a 6-layer Transformer and a model dimension of 768, 7 specialized and integrated execution agents, a 4-layer Gated Recurrent Unit (GRU) with a 256 hidden size for intention propagation, and a Graph Attention Network (GAT) with 2 heads and an alpha value of 0.2 for coordination feedback.

LUMOS, on the other hand, employs a 2-layer GNN with a 1024 hidden size for planning, a 4-layer Transformer with a model dimension of 512 for grounding, and a single integrated execution agent.

Appendix F Pseudo-code

This algorithm 2 presents the hierarchical planning and grounding processes in the proposed recursive multi-agent learning framework. The planning module pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT takes the current sub-goal stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, intention Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, grounded embedding etsubscript𝑒𝑡e_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and feedback ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as inputs, and predicts the next sub-goal st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. It first encodes the inputs using an encoder, and then passes the encoded representation through a graph neural network Tθsubscript𝑇𝜃T_{\theta}italic_T start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT parameterized by θ𝜃\thetaitalic_θ. The output of Tθsubscript𝑇𝜃T_{\theta}italic_T start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is passed through a softmax layer to obtain the probability distribution over the next sub-goal.

The grounding module gϕsubscript𝑔italic-ϕg_{\phi}italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT takes the current state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, intention Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and feedback trajectory f1:tsubscript𝑓:1𝑡f_{1:t}italic_f start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT as inputs, and produces the grounded embedding etsubscript𝑒𝑡e_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. It encodes the inputs using an encoder, and then applies cross-attention over the vocabulary V𝑉Vitalic_V, followed by a convolutional feature extractor. The output is combined with agent feedback Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to enhance the grounding accuracy. The grounding module is parameterized by ϕitalic-ϕ\phiitalic_ϕ.

This algorithm 3 describes the intention propagation mechanism in the proposed recursive multi-agent learning framework. The goal is for each agent i𝑖iitalic_i to infer a belief bi(Ij|mij)subscript𝑏𝑖conditionalsubscript𝐼𝑗subscript𝑚𝑖𝑗b_{i}(I_{j}|m_{ij})italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) over the intention Ijsubscript𝐼𝑗I_{j}italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT of a teammate j𝑗jitalic_j, given a message mijsubscript𝑚𝑖𝑗m_{ij}italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT received from j𝑗jitalic_j.

Algorithm 2 Hierarchical Planning and Grounding
1:  Input: Current sub-goal stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, intention Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, grounded embedding etsubscript𝑒𝑡e_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, feedback ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
2:  Output: Next sub-goal st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT
3:  ht=Encoder(st,It,et,ft)subscript𝑡Encodersubscript𝑠𝑡subscript𝐼𝑡subscript𝑒𝑡subscript𝑓𝑡h_{t}=\text{Encoder}(s_{t},I_{t},e_{t},f_{t})italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = Encoder ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) {Encode inputs}
4:  st+1=Softmax(Tθ(ht))subscript𝑠𝑡1Softmaxsubscript𝑇𝜃subscript𝑡s_{t+1}=\text{Softmax}(T_{\theta}(h_{t}))italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = Softmax ( italic_T start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) {Predict next sub-goal}
5:  Tθsubscript𝑇𝜃T_{\theta}italic_T start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is a graph neural network parameterized by θ𝜃\thetaitalic_θ {Planning module pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT}
6:  Input: Current state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, intention Itsubscript𝐼𝑡I_{t}italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, feedback f1:tsubscript𝑓:1𝑡f_{1:t}italic_f start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT
7:  Output: Grounded embedding etsubscript𝑒𝑡e_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
8:  ht=Encoder(st,It,f1:t)subscript𝑡Encodersubscript𝑠𝑡subscript𝐼𝑡subscript𝑓:1𝑡h_{t}=\text{Encoder}(s_{t},I_{t},f_{1:t})italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = Encoder ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT ) {Encode inputs}
9:  et=Conv(Attn(ht,V))+Ptsubscript𝑒𝑡ConvAttnsubscript𝑡𝑉subscript𝑃𝑡e_{t}=\text{Conv}(\text{Attn}(h_{t},V))+P_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = Conv ( Attn ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_V ) ) + italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT {Grounded embedding}
10:  Attn(,)Attn\text{Attn}(\cdot,\cdot)Attn ( ⋅ , ⋅ ) is a cross-attention layer over vocabulary V𝑉Vitalic_V
11:  Conv()Conv\text{Conv}(\cdot)Conv ( ⋅ ) is a convolutional feature extractor
12:  Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT includes agent feedback to enhance grounding accuracy
13:  gϕsubscript𝑔italic-ϕg_{\phi}italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT is the grounding module parameterized by ϕitalic-ϕ\phiitalic_ϕ

It initializes an intention propagation channel fΛsubscript𝑓Λf_{\Lambda}italic_f start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT, parameterized by ΛΛ\Lambdaroman_Λ, which is implemented as a recurrent neural network.

The intention inference process works as follows:

  1. 1.

    The received message mijsubscript𝑚𝑖𝑗m_{ij}italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is encoded using an encoder to obtain a representation hijsubscript𝑖𝑗h_{ij}italic_h start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT.

  2. 2.

    The encoded message hijsubscript𝑖𝑗h_{ij}italic_h start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is passed through the propagation channel fΛsubscript𝑓Λf_{\Lambda}italic_f start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT to infer the belief bi(Ij|mij)subscript𝑏𝑖conditionalsubscript𝐼𝑗subscript𝑚𝑖𝑗b_{i}(I_{j}|m_{ij})italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) over teammate j𝑗jitalic_j’s intention Ijsubscript𝐼𝑗I_{j}italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

The objective is to train the parameters ΛΛ\Lambdaroman_Λ of the propagation channel fΛsubscript𝑓Λf_{\Lambda}italic_f start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT to maximize the coordination reward Rcsubscript𝑅𝑐R_{c}italic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT over sampled intentions I𝐼Iitalic_I and messages m𝑚mitalic_m from the distribution defined by fΛsubscript𝑓Λf_{\Lambda}italic_f start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT.

Algorithm 3 Intention Propagation Mechanism
0:  Current intention Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of agent i𝑖iitalic_i, message mijsubscript𝑚𝑖𝑗m_{ij}italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT from teammate j𝑗jitalic_j
0:  Belief bi(Ij|mij)subscript𝑏𝑖conditionalsubscript𝐼𝑗subscript𝑚𝑖𝑗b_{i}(I_{j}|m_{ij})italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) over teammate j𝑗jitalic_j’s intention Ijsubscript𝐼𝑗I_{j}italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
1:  Initialization:
2:  Intention propagation channel fΛsubscript𝑓Λf_{\Lambda}italic_f start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT parameterized by ΛΛ\Lambdaroman_Λ
3:  fΛsubscript𝑓Λf_{\Lambda}italic_f start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT is a recurrent neural network
4:  Intention Inference:
5:  Encode message: hijEncoder(mij)subscript𝑖𝑗Encodersubscript𝑚𝑖𝑗h_{ij}\leftarrow\text{Encoder}(m_{ij})italic_h start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ← Encoder ( italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT )
6:  Infer intention belief: bi(Ij|mij)fΛ(mij)subscript𝑏𝑖conditionalsubscript𝐼𝑗subscript𝑚𝑖𝑗subscript𝑓Λsubscript𝑚𝑖𝑗b_{i}(I_{j}|m_{ij})\leftarrow f_{\Lambda}(m_{ij})italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ← italic_f start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT ( italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT )
7:  Objective:
8:  Sample intentions I𝐼Iitalic_I and messages m𝑚mitalic_m from fΛsubscript𝑓Λf_{\Lambda}italic_f start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT
9:  Maximize coordination reward Rcsubscript𝑅𝑐R_{c}italic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT over intentions and messages:
10:  ΛargmaxΛ𝔼I,mfΛ[Rc(I,m)]superscriptΛsubscriptΛsubscript𝔼similar-to𝐼𝑚subscript𝑓Λdelimited-[]subscript𝑅𝑐𝐼𝑚\Lambda^{*}\leftarrow\arg\max_{\Lambda}\mathbb{E}_{I,m\sim f_{\Lambda}}[R_{c}(% I,m)]roman_Λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← roman_arg roman_max start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_I , italic_m ∼ italic_f start_POSTSUBSCRIPT roman_Λ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_I , italic_m ) ]

Algorithm 4 Bidirectional Coordination
0:  Experience tuples (st,at,rt,st+1)subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡subscript𝑠𝑡1(s_{t},a_{t},r_{t},s_{t+1})( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) for all agents
0:  Execution policies πξi(ai|si,Ii)subscript𝜋subscript𝜉𝑖conditionalsubscript𝑎𝑖subscript𝑠𝑖subscript𝐼𝑖\pi_{\xi_{i}}(a_{i}|s_{i},I_{i})italic_π start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and coordination feedback ctsubscript𝑐𝑡c_{t}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
1:  Execution Policy:
2:  for each agent i𝑖iitalic_i do
3:     Get agent state si,tsubscript𝑠𝑖𝑡s_{i,t}italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT and intention Ii,tsubscript𝐼𝑖𝑡I_{i,t}italic_I start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT
4:     ai,tπξi(ai|si,t,Ii,t)similar-tosubscript𝑎𝑖𝑡subscript𝜋subscript𝜉𝑖conditionalsubscript𝑎𝑖subscript𝑠𝑖𝑡subscript𝐼𝑖𝑡a_{i,t}\sim\pi_{\xi_{i}}(a_{i}|s_{i,t},I_{i,t})italic_a start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ) {Execution policy}
5:  end for
6:  Coordination Feedback:
7:  Collect execution encodings htexec=[ϕ1(o1),,ϕN(oN)]subscriptsuperscript𝑒𝑥𝑒𝑐𝑡subscriptitalic-ϕ1subscript𝑜1subscriptitalic-ϕ𝑁subscript𝑜𝑁h^{exec}_{t}=[\phi_{1}(o_{1}),\ldots,\phi_{N}(o_{N})]italic_h start_POSTSUPERSCRIPT italic_e italic_x italic_e italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_ϕ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ] {Encode observations}
8:  ctΦ(htexec)subscript𝑐𝑡Φsubscriptsuperscript𝑒𝑥𝑒𝑐𝑡c_{t}\leftarrow\Phi(h^{exec}_{t})italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← roman_Φ ( italic_h start_POSTSUPERSCRIPT italic_e italic_x italic_e italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) {Summarize coordination patterns}
9:  Objective:
10:  Maximize team reward R𝑅Ritalic_R and auxiliary loss Lauxsubscript𝐿𝑎𝑢𝑥L_{aux}italic_L start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT:
11:  ξargmaxξ𝔼(s,a)πξ[R+λLaux]superscript𝜉subscript𝜉subscript𝔼similar-to𝑠𝑎subscript𝜋𝜉delimited-[]𝑅𝜆subscript𝐿𝑎𝑢𝑥\xi^{*}\leftarrow\arg\max_{\xi}\mathbb{E}_{(s,a)\sim\pi_{\xi}}[R+\lambda L_{% aux}]italic_ξ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← roman_arg roman_max start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( italic_s , italic_a ) ∼ italic_π start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_R + italic_λ italic_L start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT ]

This algorithm 4 describes the bidirectional coordination mechanism in the proposed recursive multi-agent learning framework. It involves executing actions based on the agents’ policies and generating coordination feedback from the execution experiences.

Our algorithm takes experience tuples (st,at,rt,st+1)subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡subscript𝑠𝑡1(s_{t},a_{t},r_{t},s_{t+1})( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) for all agents as input, where stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the state, atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the action taken, rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the reward received, and st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is the next state.

The execution policy part works as follows:

  1. 1.

    For each agent i𝑖iitalic_i, get the agent’s state si,tsubscript𝑠𝑖𝑡s_{i,t}italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT and intention Ii,tsubscript𝐼𝑖𝑡I_{i,t}italic_I start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT.

  2. 2.

    Sample an action ai,tsubscript𝑎𝑖𝑡a_{i,t}italic_a start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT from the execution policy πξi(ai|si,t,Ii,t)subscript𝜋subscript𝜉𝑖conditionalsubscript𝑎𝑖subscript𝑠𝑖𝑡subscript𝐼𝑖𝑡\pi_{\xi_{i}}(a_{i}|s_{i,t},I_{i,t})italic_π start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ), parameterized by ξisubscript𝜉𝑖\xi_{i}italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

The coordination feedback part works as follows:

  1. 1.

    Collect execution encodings htexec=[ϕ1(o1),,ϕN(oN)]subscriptsuperscript𝑒𝑥𝑒𝑐𝑡subscriptitalic-ϕ1subscript𝑜1subscriptitalic-ϕ𝑁subscript𝑜𝑁h^{exec}_{t}=[\phi_{1}(o_{1}),\ldots,\phi_{N}(o_{N})]italic_h start_POSTSUPERSCRIPT italic_e italic_x italic_e italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_ϕ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ] by encoding the observations oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of each agent i𝑖iitalic_i using an encoder ϕisubscriptitalic-ϕ𝑖\phi_{i}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

  2. 2.

    Summarize the coordination patterns ctsubscript𝑐𝑡c_{t}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from the execution encodings htexecsubscriptsuperscript𝑒𝑥𝑒𝑐𝑡h^{exec}_{t}italic_h start_POSTSUPERSCRIPT italic_e italic_x italic_e italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using a function ΦΦ\Phiroman_Φ.

The objective is to maximize the team reward R𝑅Ritalic_R and an auxiliary loss Lauxsubscript𝐿𝑎𝑢𝑥L_{aux}italic_L start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT by optimizing the execution policy parameters ξ𝜉\xiitalic_ξ. The auxiliary loss Lauxsubscript𝐿𝑎𝑢𝑥L_{aux}italic_L start_POSTSUBSCRIPT italic_a italic_u italic_x end_POSTSUBSCRIPT is used to incorporate additional regularization or constraints.

The bidirectional coordination mechanism allows execution agents to act based on their policies and intentions, while also generating coordination feedback ctsubscript𝑐𝑡c_{t}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that summarizes the emerging coordination patterns. This feedback can be used to guide the planning and grounding modules in the recursive multi-agent learning framework.

Appendix G Discussion

The results demonstrate the efficacy of the proposed ReMALIS framework in enabling coordinated multi-agent collaboration for complex tasks. By propagating intentions between agents, establishing bidirectional feedback channels, and integrating recursive reasoning architectures, ReMALIS outperformed single-agent baselines and concurrent methods across difficulty levels on both the traffic flow prediction and web activities datasets.

The performance gains highlight the importance of fostering a shared understanding of goals and sub-tasks among agents through intention propagation. Communicating local beliefs allows agents to align their actions towards common objectives, leading to emergent coordinated behaviors that reduce misaligned sub-tasks and miscoordination errors. Furthermore, the bidirectional feedback channels play a crucial role in shaping the reasoning strategies of the planning and grounding modules based on the coordination patterns observed during execution. This adaptability enables the agents to adjust their comprehension and planning policies dynamically, resulting in more flexible and responsive behaviors.

The integration of recursive reasoning architectures also contributes to the superior performance of ReMALIS. By modeling the intentions and strategies of other agents, the execution agents can engage in more contextual and holistic reasoning, enhancing their ability to handle complex temporal dependencies and long-term planning horizons. This recursive reasoning capability further amplifies the benefits of intention propagation and bidirectional feedback, as agents can better interpret and leverage the shared information and coordination signals.

It is important to note that while ReMALIS demonstrates substantial improvements over single-agent frameworks, there are still limitations and potential areas for further research. For instance, the current implementation relies on a centralized training paradigm, which may hinder scalability in fully decentralized environments. Additionally, the framework does not explicitly handle dynamic agent arrival or departure during execution, which could impact coordination in real-world applications with fluid team compositions.

Future work could explore decentralized training approaches that maintain the benefits of multi-agent collaboration while addressing scalability concerns. Moreover, developing mechanisms to adaptively handle changes in the agent team during execution could enhance the robustness and flexibility of the framework in dynamic environments.

Appendix H Supplementary application description of the overall framework

To further illustrate the practical applicability and versatility of our proposed ReMALIS framework, we present a supplementary application scenario. Figure 2 depicts a high-level overview of how ReMALIS can be employed in a real-world setting to tackle complex, multi-step tasks that require orchestrating multiple agents with diverse capabilities. This exemplary use case demonstrates the framework’s ability to decompose intricate problems into manageable sub-tasks, dynamically allocate appropriate agents, and seamlessly coordinate their actions to achieve the overarching goal efficiently and effectively.

Planning Module (Figure 4): 1. Analyze the current traffic conditions, including vehicle counts, road incidents, and construction zones. 2. Identify intersections experiencing congestion and potential bottlenecks. 3. Formulate high-level goals to alleviate congestion and optimize traffic flow. 4. Break down the goals into a sequence of subgoals and subtasks. 5. Determine the dependencies and coordination needs between subtasks. 6. Plan the assignment of subtasks to specialized execution agents based on their expertise.
Refer to caption
Figure 4: Overview of the proposed ReMALIS Planning Module for predicting sub-goals based on current goals, intentions, grounded embeddings, and agent feedback.
Refer to caption
Figure 5: Framework of the proposed ReMALIS Grounding Module that contextualizes symbol embeddings using the current state, intentions, and feedback signals.
Grounding Module (Figure 5): 1. Contextualize the abstract traffic concepts and symbols into grounded representations. 2. Map entities like intersections, vehicles, and signal phases to their physical counterparts. 3. Resolve ambiguities and uncertainties in grounding based on the current traffic context. 4. Adjust grounding strategies based on feedback from execution agents and emerging coordination patterns. 5. Provide grounded embeddings to inform the execution agents’ decision-making.
Refer to caption
Figure 6: Overview of our ReMALIS Cooperative Execution Module consisting of specialized agents that collaboratively execute actions and propagate intentions.
Execution Module (Figure 6,7): 1. Specialized agents monitor their respective domains (vehicle counts, road conditions, signal timings, etc.). 2. Agents communicate their local intentions and goals to relevant teammates. 3. Agents align their actions based on shared intentions and the coordinated plans. 4. Agents execute their assigned subtasks (adjusting signal phases, routing emergency vehicles, etc.). 5. Agents observe the impact of their actions and provide feedback on emerging coordination patterns. 6. Agents adapt their strategies dynamically based on the feedback and changing traffic conditions. 7. Agents continuously monitor and respond to fluctuations in vehicle arrival rates and traffic patterns. 8. Agents collaborate and coordinate their efforts to collectively alleviate congestion and optimize traffic flow.
Refer to caption
Figure 7: Overview of the collaborative evaluation setup in the proposed ReMALIS framework.