License: arXiv.org perpetual non-exclusive license
arXiv:2304.10351v2 [cs.MA] 11 Dec 2023

Inducing Stackelberg Equilibrium through Spatio-Temporal Sequential Decision-Making in Multi-Agent Reinforcement Learning

Bin Zhang1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT    Lijuan Li1,2,*12{}^{1,2,*}start_FLOATSUPERSCRIPT 1 , 2 , * end_FLOATSUPERSCRIPT    Zhiwei Xu1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT    Dapeng Li1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT    Guoliang Fan1,2,12{}^{1,2,}start_FLOATSUPERSCRIPT 1 , 2 , end_FLOATSUPERSCRIPT 11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTInstitute of Automation, Chinese Academy of Sciences
22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTSchool of Artificial Intelligence, University of Chinese Academy of Sciences {zhangbin2020, lijuan.li, xuzhiwei2019, lidapeng2020, guoliang.fan}@ia.ac.cn
corresponding author
Abstract

In multi-agent reinforcement learning (MARL), self-interested agents attempt to establish equilibrium and achieve coordination depending on game structure. However, existing MARL approaches are mostly bound by the simultaneous actions of all agents in the Markov game (MG) framework, and few works consider the formation of equilibrium strategies via asynchronous action coordination. In view of the advantages of Stackelberg equilibrium (SE) over Nash equilibrium, we construct a spatio-temporal sequential decision-making structure derived from the MG and propose an N-level policy model based on a conditional hypernetwork shared by all agents. This approach allows for asymmetric training with symmetric execution, with each agent responding optimally conditioned on the decisions made by superior agents. Experiments demonstrate that our method effectively converges to the SE policies in repeated matrix game scenarios, and performs admirably in immensely complex settings including cooperative tasks and mixed tasks.

1 Introduction

Reinforcement learning (RL) is a popular method for solving sequential decision-making problems, but it faces several challenges when applied to a multi-agent system (MAS). In a constantly changing environment, agents’ rewards are often closely tied to the actions of others. No individual exists in isolation and their actions are often interdependent. To achieve optimal outcomes, they must learn to find a stable equilibrium rather than simply maximizing their own returns. It is, therefore, an eternal topic in multi-agent reinforcement learning (MARL) to learn how to coordinate among agents to achieve the optimal joint policy.

Currently, the majority of prevalent MARL approaches adhere to the centralized training with decentralized execution (CTDE) architecture Foerster et al. (2016). In this paradigm, agent’s policy model is trained centrally without communication constraints but executed in a distributed manner. Nevertheless, current CTDE approaches prioritize the acquisition of comprehensive cognition about the environment and the formulation of cooperative policies by directly getting the global state Rashid et al. (2018) or the local observations and actions of all agents Lowe et al. (2017). Some works based on game abstraction Zhang et al. (2023a) model the relationship of agents based on their observational data, which are state representation learning essentially. However, centralized training is not leveraged to its full potential. Actually, it is feasible to construct an interaction mechanism that effectively guides agents towards mandatory coordination.

Moreover, game theory offers the fundamental framework for the interaction of multiple intelligent agents, and several MARL algorithms, such as Nash Q-Learning Hu and Wellman (2003), Mean Field Q-learning Yang et al. (2018) and HATRPO Kuba et al. (2022), seek for convergence to Nash Equilibrium (NE). However, it is worth noting that many game problems feature more than one NE, and the diverse NE policies chosen by different individuals frequently lead to undesirable outcomes. In addition, in a two-player zero-sum game, the MaxMin operator Littman (1994) can be used to calculate NE strategies for each agent, such that neither player can gain an advantage by deviating from their chosen strategy. But in more complex general-sum game situations, finding NE strategies is typically rather difficult. Consequently, we intend to concentrate on the Stackelberg equilibrium (SE) Von Stackelberg (2010), in which agents make decisions in a leader-follower framework, with leaders prioritizing decision-making and enforcing their policies on followers who respond rationally to this enforcement. It has been shown that SE is a superior convergence objective for MARL compared to NE, whether from the certainty of equilibrium solution or Pareto superiority Zhang et al. (2020). Actually, when SE encounters MARL, we aim to address the following challenges: (a) How to converge to SE policies that rely on asynchronous action coordination under the Markov game framework where agents act simultaneously? (b) How to extend the method to scenarios with more than two agents?

To address the aforementioned issues, we draw inspiration from techniques used in single-agent RL for high-dimensional continuous control tasks Metz et al. (2017). Utilizing significant benefits of centralized training, we model MARL as a sequential decision-making problem in both temporal and spatial dimensions, and then train the heterogeneous SE policies asymmetrically. Furthermore, to facilitate the execution of SE policies in a communication-free and symmetrical environment, we introduce an N-level policy model based on a conditional hypernetwork shared by all agents, with synthetic targets of generating the weights of target agents’ policy networks according to their priority attributes. This approach allows us to bridge the gap between environmental communication restrictions and the requirements that SE policy needs to access superior agents’ policies. It mitigates the problem of suboptimal solutions induced by parameter sharing, and overcomes the issue of learning and storage costs caused by heterogeneous policy learning. Our primary contributions are summarized as follows:

  • We develop a spatio-temporal sequential Markov game framework based on agent priority that enables agents to establish an efficient interaction mechanism during centralized training.

  • We construct an N-level policy model to assist agents in executing SE policies in a fully decentralized setting without imposing a limit on the number of agents.

  • We establish the asymmetric training with symmetric execution paradigm, a significant augmentation of CTDE.

  • Our method is demonstrated to converge to the SE through repeated matrix game experiments, and the results in more complicated scenarios also illustrate its superiority over powerful benchmarks in terms of sample efficiency and overall performance.

2 Preliminaries

2.1 Markov Game

The multi-agent decision-making problem is typically described as a Markov game (MG), which can be defined by a tuple Γ,𝒮,{𝒜i}i,P,{ri}i,γΓ𝒮subscriptsuperscript𝒜𝑖𝑖𝑃subscriptsuperscript𝑟𝑖𝑖𝛾\Gamma\triangleq\langle\mathcal{I},\mathcal{S},\{\mathcal{A}^{i}\}_{i\in% \mathcal{I}},P,\{r^{i}\}_{i\in\mathcal{I}},\gamma\rangleroman_Γ ≜ ⟨ caligraphic_I , caligraphic_S , { caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT , italic_P , { italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT , italic_γ ⟩. ={1,2,,n}12𝑛\mathcal{I}=\{1,2,...,n\}caligraphic_I = { 1 , 2 , … , italic_n } denotes the set of agents and sS𝑠𝑆s\in Sitalic_s ∈ italic_S is the global state set of the environment. 𝒜isuperscript𝒜𝑖\mathcal{A}^{i}caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT denotes the action space of agent i𝑖iitalic_i and the joint action space 𝒜=i=1n𝒜i𝒜superscriptsubscriptproduct𝑖1𝑛superscript𝒜𝑖\mathcal{A}=\prod_{i=1}^{n}\mathcal{A}^{i}caligraphic_A = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the product of action spaces of all agents. P:𝒮×𝒜Ω(𝒮):𝑃𝒮𝒜Ω𝒮P:\mathcal{S\times\mathcal{A}}\to\Omega(\mathcal{S})italic_P : caligraphic_S × caligraphic_A → roman_Ω ( caligraphic_S ) denotes state transition function, where Ω(X)Ω𝑋\Omega(X)roman_Ω ( italic_X ) is the set of probability distributions in X𝑋Xitalic_X space. ri:𝒮×𝒜:superscript𝑟𝑖𝒮𝒜r^{i}:\mathcal{S\times\mathcal{A}\to\mathbb{R}}italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT : caligraphic_S × caligraphic_A → blackboard_R is the reward function of agent i𝑖iitalic_i and γ𝛾\gammaitalic_γ is the discount factor. At time step t𝑡titalic_t, each agent chooses an action ati𝒜isubscriptsuperscript𝑎𝑖𝑡superscript𝒜𝑖a^{i}_{t}\in\mathcal{A}^{i}italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT at state st𝒮subscript𝑠𝑡𝒮s_{t}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S based on its own policy πi:𝒮Ω(𝒜i):superscript𝜋𝑖𝒮Ωsuperscript𝒜𝑖\pi^{i}:\mathcal{S}\to\Omega(\mathcal{A}^{i})italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT : caligraphic_S → roman_Ω ( caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) and receives feedback in the form of ri(st,𝒂𝒕)superscript𝑟𝑖subscript𝑠𝑡subscript𝒂𝒕r^{i}(s_{t},\boldsymbol{a_{t}})italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT ), where 𝒂𝒕=(at1,,atn)𝒜subscript𝒂𝒕superscriptsubscript𝑎𝑡1superscriptsubscript𝑎𝑡𝑛𝒜\boldsymbol{a_{t}}=(a_{t}^{1},...,a_{t}^{n})\in\mathcal{A}bold_italic_a start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT = ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ∈ caligraphic_A. The environment moves to a new state st+1P(st+1st,at)similar-tosubscript𝑠𝑡1𝑃conditionalsubscript𝑠𝑡1subscript𝑠𝑡subscript𝑎𝑡s_{t+1}\sim P(s_{t+1}\mid s_{t},a_{t})italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) as a result of joint action 𝒂𝒕subscript𝒂𝒕\boldsymbol{a_{t}}bold_italic_a start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT. The joint policy of all agents is expressed as 𝝅(st)=i=1nπi(st)𝝅subscript𝑠𝑡superscriptsubscriptproduct𝑖1𝑛superscript𝜋𝑖subscript𝑠𝑡\boldsymbol{\pi}\left(s_{t}\right)=\prod_{i=1}^{n}\pi^{i}\left(s_{t}\right)bold_italic_π ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Under the framework of MG, the state value function of agent i𝑖iitalic_i is defined as :

V𝝅i(s)=𝔼sP,ai𝝅i[t=0γtrti(st,𝒂𝒕)s0=s,atiπi(st)],subscriptsuperscript𝑉𝑖𝝅𝑠subscript𝔼formulae-sequencesimilar-to𝑠𝑃similar-tosuperscript𝑎𝑖superscript𝝅𝑖delimited-[]formulae-sequenceconditionalsuperscriptsubscript𝑡0superscript𝛾𝑡subscriptsuperscript𝑟𝑖𝑡subscript𝑠𝑡subscript𝒂𝒕subscripts0𝑠similar-tosuperscriptsubscript𝑎𝑡𝑖superscript𝜋𝑖subscript𝑠𝑡\displaystyle V^{i}_{\boldsymbol{\pi}}(s)=\mathbb{E}_{s\sim P,a^{-i}\sim% \boldsymbol{\pi}^{-i}}\left[\sum_{t=0}^{\infty}\gamma^{t}r^{i}_{t}(s_{t},% \boldsymbol{a_{t}})\mid\mathrm{s}_{0}=s,a_{t}^{i}\sim\pi^{i}(s_{t})\right],italic_V start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_π end_POSTSUBSCRIPT ( italic_s ) = blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P , italic_a start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ∼ bold_italic_π start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT ) ∣ roman_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ,

(1)

where i𝑖-i- italic_i represents all agents except i𝑖iitalic_i. According to Bellman equation, the action-state value function is denoted as:

Q𝝅i(s,𝐚)=ri(s,𝐚)+γs𝒮P(s|s,𝐚)V𝝅i(s).subscriptsuperscript𝑄𝑖𝝅𝑠𝐚superscript𝑟𝑖𝑠𝐚𝛾subscriptsuperscript𝑠𝒮𝑃conditionalsuperscript𝑠𝑠𝐚subscriptsuperscript𝑉𝑖𝝅superscript𝑠\displaystyle Q^{i}_{\boldsymbol{\pi}}(s,\mathbf{a})=r^{i}(s,\mathbf{a})+% \gamma\sum_{s^{\prime}\in\mathcal{S}}P(s^{\prime}|s,\mathbf{a})\cdot V^{i}_{% \boldsymbol{\pi}}(s^{\prime}).italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_π end_POSTSUBSCRIPT ( italic_s , bold_a ) = italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_s , bold_a ) + italic_γ ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT italic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , bold_a ) ⋅ italic_V start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_π end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) . (2)

2.2 Multi-Agent Reinforcement Learning

Refer to caption
Figure 1: The STMG state transition procedure. It is an extensive game version of MG, which specifies the decision-making sequence of agents simultaneously.

In a multi-agent system, each decision-maker strives to maximize their own expected utility, denoted as 𝒥i(θ)=𝔼sP,aπθ[t=0γtrti]superscript𝒥𝑖𝜃subscript𝔼formulae-sequencesimilar-to𝑠𝑃similar-to𝑎subscript𝜋𝜃delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡subscriptsuperscript𝑟𝑖𝑡\mathcal{J}^{i}(\theta)=\mathbb{E}_{s\sim P,a\sim\pi_{\theta}}\left[\sum_{t=0}% ^{\infty}\gamma^{t}r^{i}_{t}\right]caligraphic_J start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P , italic_a ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ], where θ={θi}i𝜃subscriptsuperscript𝜃𝑖𝑖\theta=\{\theta^{i}\}_{i\in{\mathcal{I}}}italic_θ = { italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT represents the policy parameters of agents. However, using gradient ascent to directly update the strategy, i.e., θiθi+αθi𝒥i(θ)superscript𝜃𝑖superscript𝜃𝑖𝛼subscriptsuperscript𝜃𝑖superscript𝒥𝑖𝜃\theta^{i}\leftarrow\theta^{i}+\alpha\nabla_{\theta^{i}}\mathcal{J}^{i}(\theta)italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ← italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_α ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_J start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_θ ), usually fails because 𝒥i(θ)superscript𝒥𝑖𝜃\mathcal{J}^{i}(\theta)caligraphic_J start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_θ ) is affected by all agents and the gradient update directions among agents may conflict. Consequently, it is vital to employ the concept of game equilibrium solutions in order to develop effective coordination strategies. One common solution objective is NE. Nash-Q learning, for example, computes agent i𝑖iitalic_i’s value function V𝒩ashi(s)superscriptsubscript𝑉𝒩𝑎𝑠𝑖𝑠V_{\mathcal{N}ash}^{i}(s)italic_V start_POSTSUBSCRIPT caligraphic_N italic_a italic_s italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_s ) when all agents follow the NE strategy in each stage of the game, and updates the action-state value function Q𝒩ashi(s,𝐚)superscriptsubscript𝑄𝒩𝑎𝑠𝑖𝑠𝐚Q_{\mathcal{N}ash}^{i}(s,\mathbf{a})italic_Q start_POSTSUBSCRIPT caligraphic_N italic_a italic_s italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_s , bold_a ) through Eq. (2). The algorithm can converge to NE strategies under the strong assumption that equilibrium exists at every game stage. All agents adopt equilibrium policies as their own convergence objective rather than selfishly maximizing their own individual utilities. With the emergence of deep learning, current popular MARL approaches frequently employ the CTDE paradigm to enable coordination via credit assignment Rashid et al. (2018) or centralized critics Lowe et al. (2017).

Refer to caption
Figure 2: Exemplification of the execution mechanism under the STMG framework. In the temporal domain, all agents continue to adhere to the MG settings. Simultaneously, agents make decisions according to the Stackelberg leadership model, and subgame states sti1subscriptsuperscript𝑠𝑖1𝑡s^{i-1}_{t}italic_s start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are introduced to maintain each agent’s policy inputs.

2.3 Stackelberg Leadership Model

The Stackelberg leadership model dictates the order of actions among agents. Consider a two-player game, the leader can enforce his own strategy π1superscript𝜋1\pi^{1}italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT to the follower, who will react after observing the leader’s behavior π2:𝒮×𝒜1Ω(𝒜2):superscript𝜋2𝒮superscript𝒜1Ωsuperscript𝒜2\pi^{2}:\mathcal{S\times A}^{1}\to\Omega(\mathcal{A}^{2})italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT : caligraphic_S × caligraphic_A start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT → roman_Ω ( caligraphic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), where πiΠisuperscript𝜋𝑖superscriptΠ𝑖\pi^{i}\in\Pi^{i}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ roman_Π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and ΠisuperscriptΠ𝑖\Pi^{i}roman_Π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT represents the policy space. In this model, leader optimizes its objective on the premise that follower will provide the best response BR(π1)𝐵𝑅superscript𝜋1BR(\pi^{1})italic_B italic_R ( italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ), while the follower tends to maximize its expected utility based on leader’s preconditions. It can be formalized as:

maxπ1Π1{𝒥1(π1,π2)|π2\displaystyle\max_{\pi^{1}\in\Pi^{1}}\{\mathcal{J}^{1}(\pi^{1},\pi^{2})|\pi^{2}roman_max start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∈ roman_Π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { caligraphic_J start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) | italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =BR(π1)},\displaystyle=BR(\pi^{1})\},= italic_B italic_R ( italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) } , (3)
maxπ2Π2{𝒥2(a1,π2)|\displaystyle\max_{\pi^{2}\in\Pi^{2}}\{\mathcal{J}^{2}(a^{1},\pi^{2})|roman_max start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∈ roman_Π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { caligraphic_J start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) | a1π1}.\displaystyle a^{1}\sim\pi^{1}\}.italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT } .

Equilibrium signifies that in a multi-player game, all players have adopted the optimal strategy and none can improve their performance by altering their own strategy. Correspondingly, Stackelberg equilibrium signifies that both agents adhere to the optimal solution strategies (πSE1,πSE2)subscriptsuperscript𝜋1𝑆𝐸subscriptsuperscript𝜋2𝑆𝐸({\pi^{1}_{SE}},\pi^{2}_{SE})( italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_E end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_E end_POSTSUBSCRIPT ) of the the above optimization issue. It satisfies:

VπSE1,BR(πSE1)1(s)subscriptsuperscript𝑉1subscriptsuperscript𝜋1𝑆𝐸𝐵𝑅subscriptsuperscript𝜋1𝑆𝐸𝑠\displaystyle V^{1}_{\pi^{1}_{SE},BR(\pi^{1}_{SE})}(s)italic_V start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_E end_POSTSUBSCRIPT , italic_B italic_R ( italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_E end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( italic_s ) Vπ1,BR(π1)1(s),absentsubscriptsuperscript𝑉1superscript𝜋1𝐵𝑅superscript𝜋1𝑠\displaystyle\geq V^{1}_{\pi^{1},BR(\pi^{1})}(s),≥ italic_V start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_B italic_R ( italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_s ) , (4)
πSE2subscriptsuperscript𝜋2𝑆𝐸\displaystyle\pi^{2}_{SE}italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_E end_POSTSUBSCRIPT =BR(πSE1).absent𝐵𝑅subscriptsuperscript𝜋1𝑆𝐸\displaystyle=BR(\pi^{1}_{SE}).= italic_B italic_R ( italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_E end_POSTSUBSCRIPT ) .

As a mandatory equilibrium, SE offers more benefits than NE in terms of the stability, certainty of equilibrium points and Pareto superiority Zhang et al. (2020). Consequently, it is a more suitable learning objective.

Refer to caption
Figure 3: The overall architecture of STEP. Left: The workflow of STEP for a comprehensive decision in a time step. Agents make their decisions based on the current situation stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, their self-positioning Priority ID𝑃𝑟𝑖𝑜𝑟𝑖𝑡𝑦 IDPriority\text{ ID}italic_P italic_r italic_i italic_o italic_r italic_i italic_t italic_y ID, and the prerequisite actions at1:i1subscriptsuperscript𝑎:1𝑖1𝑡a^{1:i-1}_{t}italic_a start_POSTSUPERSCRIPT 1 : italic_i - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of superior agents. Right: The structure of N-level policy model. It allows for the implementation of heterogeneous policies under parameter sharing and the Stackelberg equilibrium policies under symmetric conditions.

3 Methodology

In this section, we propose a new method called Spatio-Temporal sequence Equilibrium Policy optimization (STEP). This method leverages the advantages of centralized training to establish an interaction mechanism and employs a more efficient policy network to facilitate the achievement of coordinate policies during decentralized execution. The following parts provide a detailed description of the procedure for implementing STEP.

3.1 Spatio-Temporal Sequential Markov Game

Current CTDE approaches, especially policy-based methods, tend to center around enhancing the richness of input information rather than boosting the effectiveness of collaboration. Actually, centralized training liberates us from restrictive circumstances. With centralized training, we can definitely create an interaction mechanism among agents for completing their tasks. Furthermore, the concept of the SE is more in line with the goals of RL as it requires merely additional constraints, which can be easily achieved by setting a specific interaction mechanism. So naturally, we utilize the form of multi-player Stackelberg leadership model and construct the spatio-temporal sequential Markov game (STMG) in the training phase to direct agents coordination and facilitate the implementation of SE policies.

Definition 1.

STMG can be formalized as a tuple ,𝒮,{𝒜i}i,P,{𝓇i}i,γ,{oi}i𝒮subscriptsuperscript𝒜𝑖𝑖𝑃subscriptsuperscript𝓇𝑖𝑖𝛾subscriptsuperscript𝑜𝑖𝑖\langle\mathcal{I},\mathcal{S},\{\mathcal{A}^{i}\}_{i\in\mathcal{I}},P,\{% \mathcal{r}^{i}\}_{i\in\mathcal{I}},\gamma,\{{o}^{i}\}_{i\in\mathcal{I}}\rangle⟨ caligraphic_I , caligraphic_S , { caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT , italic_P , { caligraphic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT , italic_γ , { italic_o start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT ⟩. In addition to the MG defined in Section 2.1, STMG add the term oisuperscript𝑜𝑖o^{i}italic_o start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, which denotes the action order of agent i𝑖iitalic_i. 𝒪=(o1,,on)𝒪superscript𝑜1normal-…superscript𝑜𝑛\mathcal{O}=(o^{1},...,o^{n})caligraphic_O = ( italic_o start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_o start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) represents the action order of all agents, indicating the priority/importance of agents at the decision-making stage.

Figure 1 shows the state transition process in STMG. For the sake of simplicity, we assume that the agent ID i𝑖iitalic_i is assigned by priority oisuperscript𝑜𝑖o^{i}italic_o start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. Compared with MG, STMG assumes the form of a sequence decision in both temporal and spatial domains. Agents with higher priorities have greater initiative, whereas agents with lower priorities must respond to the actions of those with higher priorities. Correspondingly, the policy of each agent changes to πi:𝒮×𝒜1𝒜i1Ω(𝒜i):superscript𝜋𝑖𝒮superscript𝒜1superscript𝒜𝑖1Ωsuperscript𝒜𝑖\pi^{i}:\mathcal{S}\times\mathcal{A}^{1}\cdot\cdot\cdot\mathcal{A}^{i-1}\to% \Omega(\mathcal{A}^{i})italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT : caligraphic_S × caligraphic_A start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ⋯ caligraphic_A start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT → roman_Ω ( caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ). The action-state value function of agent i𝑖iitalic_i can be written as:

Q𝝅i(s,a1:i1,ai)=𝔼sP,ai+1:n𝝅i+1:n[t=0γtrti(st,𝒂𝒕)s0=s,𝐚0=𝒂,atiπi(st,at1,,ati1)].\displaystyle\begin{aligned} Q^{i}_{\boldsymbol{\pi}}&(s,a^{1:i-1},a^{i})=% \mathbb{E}_{s\sim P,a^{i+1:n}\sim\boldsymbol{\pi}^{i+1:n}}\bigg{[}\sum_{t=0}^{% \infty}\gamma^{t}\cdot\\ &r^{i}_{t}(s_{t},\boldsymbol{a_{t}})\mid\mathrm{s}_{0}=s,\mathbf{a}_{0}=% \boldsymbol{a},a_{t}^{i}\sim\pi^{i}(s_{t},a_{t}^{1},...,a_{t}^{i-1})\bigg{]}.% \end{aligned}start_ROW start_CELL italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_π end_POSTSUBSCRIPT end_CELL start_CELL ( italic_s , italic_a start_POSTSUPERSCRIPT 1 : italic_i - 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P , italic_a start_POSTSUPERSCRIPT italic_i + 1 : italic_n end_POSTSUPERSCRIPT ∼ bold_italic_π start_POSTSUPERSCRIPT italic_i + 1 : italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⋅ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT ) ∣ roman_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_s , bold_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_italic_a , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ) ] . end_CELL end_ROW

(5)

The state value function is denoted as:

V𝝅i(s,a1:i1)=ai𝒜iπi(ai|s,a1:i1)Q𝝅i(s,a1:i1,ai).subscriptsuperscript𝑉𝑖𝝅𝑠superscript𝑎:1𝑖1subscriptsuperscript𝑎𝑖superscript𝒜𝑖superscript𝜋𝑖conditionalsuperscript𝑎𝑖𝑠superscript𝑎:1𝑖1subscriptsuperscript𝑄𝑖𝝅𝑠superscript𝑎:1𝑖1superscript𝑎𝑖V^{i}_{\boldsymbol{\pi}}(s,a^{1:i-1})=\sum_{a^{i}\in\mathcal{A}^{i}}\pi^{i}(a^% {i}|s,a^{1:i-1})Q^{i}_{\boldsymbol{\pi}}(s,a^{1:i-1},a^{i}).italic_V start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_π end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT 1 : italic_i - 1 end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ caligraphic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_s , italic_a start_POSTSUPERSCRIPT 1 : italic_i - 1 end_POSTSUPERSCRIPT ) italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_π end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT 1 : italic_i - 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) . (6)

And then we have the advantage function:

Aπi(s,a1:i1,ai)=Q𝝅i(s,a1:i1,ai)V𝝅i(s,a1:i1).subscriptsuperscript𝐴𝑖𝜋𝑠superscript𝑎:1𝑖1superscript𝑎𝑖subscriptsuperscript𝑄𝑖𝝅𝑠superscript𝑎:1𝑖1superscript𝑎𝑖subscriptsuperscript𝑉𝑖𝝅𝑠superscript𝑎:1𝑖1\displaystyle\begin{aligned} A^{i}_{\pi}(s,a^{1:i-1},a^{i})=Q^{i}_{\boldsymbol% {\pi}}(s,a^{1:i-1},a^{i})-V^{i}_{\boldsymbol{\pi}}(s,a^{1:i-1}).\end{aligned}start_ROW start_CELL italic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT 1 : italic_i - 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = italic_Q start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_π end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT 1 : italic_i - 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - italic_V start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_π end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT 1 : italic_i - 1 end_POSTSUPERSCRIPT ) . end_CELL end_ROW

(7)

Within the STMG framework, agent with priority i𝑖iitalic_i can consider the actions of the preceding i1𝑖1i-1italic_i - 1 superior agents as prerequisites, maximize its own expected return 𝒥i(θ)=𝔼sP,ai+1:n𝝅θi+1:n[t=0γtrti]superscript𝒥𝑖𝜃subscript𝔼formulae-sequencesimilar-to𝑠𝑃similar-tosuperscript𝑎:𝑖1𝑛subscriptsuperscript𝝅:𝑖1𝑛𝜃delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡subscriptsuperscript𝑟𝑖𝑡\mathcal{J}^{i}(\theta)=\mathbb{E}_{s\sim P,a^{i+1:n}\sim\boldsymbol{\pi}^{i+1% :n}_{\theta}}\left[\sum_{t=0}^{\infty}\gamma^{t}r^{i}_{t}\right]caligraphic_J start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P , italic_a start_POSTSUPERSCRIPT italic_i + 1 : italic_n end_POSTSUPERSCRIPT ∼ bold_italic_π start_POSTSUPERSCRIPT italic_i + 1 : italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] according to the current subgame state si1=(s,a1,,ai1)superscript𝑠𝑖1𝑠superscript𝑎1superscript𝑎𝑖1s^{i-1}=(s,a^{1},...,a^{i-1})italic_s start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT = ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_a start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ), which is equivalent to learning the best response to the strategies of superior agents. In summary, during the training phase, agents’ execution mechanism can be divided into the Markov game process in the time domain and the Leader-Follower model in the space domain. The specific execution process is depicted in Figure 2.

3.2 Parameterized N-level Policy Model Using A Conditional Hypernetwork

Within the STMG framework, agents are able to acquire the actions of superior agents directly. This asymmetric approach arises naturally in the centralized training phase without communication restrictions, but it is obviously illegal in the decentralized execution phase. To this end, several approaches can be taken. We can train a policy network that is shared by all agents, allowing them to calculate the actions of other agents using the shared parameters. Designing a communication module that enables superior agents to broadcast their decisions to inferior agents is also an alternative. Moreover, agents can keep copies of other players’ policy or value networks to independently calculate their own policies. Nonetheless, parameter sharing can result in suboptimal solutions, and communication may not be feasible in fully decentralized execution settings or with communication bandwidth constraints. Additionally, the scalability of the algorithm may be limited as the number of agents increases and the volume of saved model copies grows.

In contrast to the aforementioned methods, we develop an N-level policy model using a conditional hypernetwork. It consists of a parameter generator, a state embedding, and a target policy module. Figure 3 illustrates its structure in detail. MARL is essentially a multi-task learning (multi-objective regression) problem Iqbal and Sha (2019). Instead of explicitly training and storing individual policy modules for each agent, we maintain a set of policy parameters {θtari}isubscriptsuperscriptsubscript𝜃𝑡𝑎𝑟𝑖𝑖\{\theta_{tar}^{i}\}_{i\in\mathcal{I}}{ italic_θ start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT by training a meta-model (conditional hypernetwork) (ei,θh)=θtarisuperscript𝑒𝑖subscript𝜃subscriptsuperscript𝜃𝑖𝑡𝑎𝑟\mathcal{H}(e^{i},\theta_{h})=\theta^{i}_{tar}caligraphic_H ( italic_e start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) = italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT with the weights θhsubscript𝜃\theta_{h}italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT shared by all agents. It maps the embedding of each agent’s priority ID eisuperscript𝑒𝑖e^{i}italic_e start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT to the parameter configuration of its policy network. It has been demonstrated that hypernetworks can handle continuous learning tasks von Oswald et al. (2020), where a model is trained on a series of tasks in sequence. This is highly compatible with our expectations. In addition, the input si1superscript𝑠𝑖1s^{i-1}italic_s start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT to the target policy module is encoded by the state embedding network (s,a1,,ai1;θs)𝑠superscript𝑎1superscript𝑎𝑖1subscript𝜃𝑠\mathcal{E}(s,a^{1},...,a^{i-1};\mathcal{\theta}_{s})caligraphic_E ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_a start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ), which receives as input the current state of the environment and the actions of superior agents. Specifically, each component of the model is described as follows:

State Embedding: si1superscript𝑠𝑖1\displaystyle s^{i-1}italic_s start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT =(s,a1,,ai1;θs),absent𝑠superscript𝑎1superscript𝑎𝑖1subscript𝜃𝑠\displaystyle=\mathcal{E}(s,a^{1},...,a^{i-1};\mathcal{\theta}_{s}),= caligraphic_E ( italic_s , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_a start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , (8)
Parameter Generator: θtarisubscriptsuperscript𝜃𝑖𝑡𝑎𝑟\displaystyle\theta^{i}_{tar}italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT =(ei;θh),absentsuperscript𝑒𝑖subscript𝜃\displaystyle=\mathcal{H}(e^{i};\theta_{h}),= caligraphic_H ( italic_e start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) , (9)
Target Policy: aisuperscript𝑎𝑖\displaystyle a^{i}italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT πi(si1;θtari).similar-toabsentsuperscript𝜋𝑖superscript𝑠𝑖1subscriptsuperscript𝜃𝑖𝑡𝑎𝑟\displaystyle\sim\pi^{i}(s^{i-1};\theta^{i}_{tar}).∼ italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ; italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT ) . (10)

It is important to note that the parameters of the policy model consist of θi=(θs,θh,θtari)superscript𝜃𝑖subscript𝜃𝑠subscript𝜃subscriptsuperscript𝜃𝑖𝑡𝑎𝑟\theta^{i}=(\theta_{s},\theta_{h},\theta^{i}_{tar})italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ( italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT ), where θs,θhsubscript𝜃𝑠subscript𝜃\theta_{s},\theta_{h}italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT are learnable parameters shared by all agents and θtarisubscriptsuperscript𝜃𝑖𝑡𝑎𝑟\theta^{i}_{tar}italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT can be accessed by all agents through the parameter generator. By using the N-level policy model, we are able to share policy parameters while avoiding the suboptimal solutions and storage issues previously mentioned, resulting in efficient decentralized coordination.

3.3 Implementation

The N-level policy model allows for the implementation of the asymmetric training with symmetric/decentralized execution (ATSE) paradigm in STEP. Meanwhile, we select Proximal Policy Optimization (PPO) Schulman et al. (2017) as the underlying algorithm due to its superior performance. To ensure the balance between exploration and exploitation during the training phase, agents sample actions from the categorized distribution or multivariate Gaussian distribution generated by the policy model. However, due to the unpredictability of sampling, agents are unable to calculate the actions of superior agents. As a result, we directly transmit the behaviors of superior agents to inferior agents during the training process. During the execution process, each agent selects the action with the highest probability, and inferior agents can easily compute actions of superior agents through the shared policy model to establish corresponding coordination. Additionally, throughout each epoch of the training process, each agent trains the identical policy module using its own data, which can result in catastrophic forgetting. To address this issue, we incorporate a regularization item to ensure that the policy network can train the current agent’s policy parameters while maintaining the capacity to fit the previously updated policies of other agents. The objective function of actor i𝑖iitalic_i is expressed as:

i(θ)=𝔼sP,𝐚𝝅[cliph(θh,θhold,e1:i1)\displaystyle\mathcal{L}^{i}(\theta)=\mathbb{E}_{s\sim P,\mathbf{a}\sim% \boldsymbol{\pi}}[\mathcal{L}_{clip}-\mathcal{L}_{h}(\theta_{h},\theta_{h_{old% }},e^{1:i-1})caligraphic_L start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P , bold_a ∼ bold_italic_π end_POSTSUBSCRIPT [ caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_i italic_p end_POSTSUBSCRIPT - caligraphic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_e start_POSTSUPERSCRIPT 1 : italic_i - 1 end_POSTSUPERSCRIPT ) (11)
+ηS(πθi(s,a1:i1))],\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad+\eta S(\pi_{\theta}^{i}(% s,a^{1:i-1}))],+ italic_η italic_S ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT 1 : italic_i - 1 end_POSTSUPERSCRIPT ) ) ] ,
clip=min(rθiAπi,clip(rθi,1±ϵ)Aπi),subscript𝑐𝑙𝑖𝑝subscriptsuperscript𝑟𝑖𝜃subscriptsuperscript𝐴𝑖𝜋𝑐𝑙𝑖𝑝subscriptsuperscript𝑟𝑖𝜃plus-or-minus1italic-ϵsubscriptsuperscript𝐴𝑖𝜋\displaystyle\mathcal{L}_{clip}=\min(r^{i}_{\theta}A^{i}_{\pi},clip(r^{i}_{% \theta},1\pm\epsilon)A^{i}_{\pi}),caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_i italic_p end_POSTSUBSCRIPT = roman_min ( italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT , italic_c italic_l italic_i italic_p ( italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , 1 ± italic_ϵ ) italic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) ,
h()=βi1j=1i1(ej;θh)(ej;θhold)22,subscript𝛽𝑖1superscriptsubscript𝑗1𝑖1subscriptsuperscriptnormsuperscript𝑒𝑗subscript𝜃superscript𝑒𝑗subscript𝜃subscript𝑜𝑙𝑑22\displaystyle\mathcal{L}_{h}(\cdot)=\frac{\beta}{i-1}\sum_{j=1}^{i-1}||% \mathcal{H}(e^{j};\theta_{h})-\mathcal{H}(e^{j};\theta_{h_{old}})||^{2}_{2},caligraphic_L start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ⋅ ) = divide start_ARG italic_β end_ARG start_ARG italic_i - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT | | caligraphic_H ( italic_e start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) - caligraphic_H ( italic_e start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

where S()𝑆S(\cdot)italic_S ( ⋅ ) is Shannon entropy used for strengthening exploration,rθi=πθi(ai|s,a1:i1)πθoldi(ai|s,a1:i1)subscriptsuperscript𝑟𝑖𝜃subscriptsuperscript𝜋𝑖𝜃conditionalsuperscript𝑎𝑖𝑠superscript𝑎:1𝑖1subscriptsuperscript𝜋𝑖subscript𝜃𝑜𝑙𝑑conditionalsuperscript𝑎𝑖𝑠superscript𝑎:1𝑖1\quad r^{i}_{\theta}=\frac{\pi^{i}_{\theta}(a^{i}|s,a^{1:i-1})}{\pi^{i}_{% \theta_{old}}(a^{i}|s,a^{1:i-1})}italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = divide start_ARG italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_s , italic_a start_POSTSUPERSCRIPT 1 : italic_i - 1 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_s , italic_a start_POSTSUPERSCRIPT 1 : italic_i - 1 end_POSTSUPERSCRIPT ) end_ARG is the likelihood ratio between the current and previous policies, ϵitalic-ϵ\epsilonitalic_ϵ is the clipping ratio, η𝜂\etaitalic_η and β𝛽\betaitalic_β are coefficients of entropy and regularization term respectively. Critic network is used to fit the value function, and its loss function is expressed as:

(ϕi)=max[(Vϕ(s,a1:i1)Ri)2,(clip(Vϕ(s,a1:i1),Vϕold(s,a1:i1)±ε)Ri)2],\displaystyle\begin{aligned} \mathcal{L}&(\phi^{i})=\max\big{[}\left(V_{\phi}% \left(s,a^{1:i-1}\right)-R^{i}\right)^{2},\\ &\left(\operatorname{clip}\left(V_{\phi}\left(s,a^{1:i-1}\right),V_{\phi_{old}% }\left(s,a^{1:i-1}\right)\pm\varepsilon\right)-R^{i}\right)^{2}\big{]},\end{aligned}start_ROW start_CELL caligraphic_L end_CELL start_CELL ( italic_ϕ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = roman_max [ ( italic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT 1 : italic_i - 1 end_POSTSUPERSCRIPT ) - italic_R start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ( roman_clip ( italic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT 1 : italic_i - 1 end_POSTSUPERSCRIPT ) , italic_V start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT 1 : italic_i - 1 end_POSTSUPERSCRIPT ) ± italic_ε ) - italic_R start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , end_CELL end_ROW

(12)

where Risuperscript𝑅𝑖R^{i}italic_R start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the cumulative return and ε𝜀\varepsilonitalic_ε is the clipping ratio.

The pseudo-code of STEP can be found in Appendix. Notably, our method provides comprehensive advantages over currently popular MARL techniques. Unlike prior work based on SE Könönen (2004); Zhang et al. (2020), STEP is easily extensible to include additional players; Unlike prior work based on parameter sharing Rashid et al. (2018), STEP is able to learn heterogeneous policies; In contrast to previous work on learning heterogeneous policies Kuba et al. (2022), STEP does not increase the learning cost as the number of agents increases; In contrast to previous CTDE work Lowe et al. (2017); Yu et al. (2021), we no longer focus on merely expanding the information space, but on specific forms of coordination.

4 Related Work

Our study, inspired by the Stackelberg leadership model, builds an asymmetric cooperative connection within the MG framework and applies RL to discover the Stackelberg equilibrium solution. In certain publications, SE strategies also serve as the convergence objective. Similar to Nash-Q learning, Asymmetric Q-learning (AQL) Könönen (2004) updates the action-state value function in an asymmetric setting by calculating the SE of the stage game at each iteration. Leaders are able to access followers’ reward information and save copies of each follower’s value function. Bi-AC Zhang et al. (2020) presents a bi-level actor-critic method based on CTDE paradigm that employs a Q-learning-based leader Mnih et al. (2013) and DDPG-based follower Lillicrap et al. (2016). Both the leader and follower must save the leader’s critic network and the follower’s actor network during the execution process in order to compute and execute their policies. Additionally, Bully Littman and Stone (2001) and DeDOL Tian et al. (2019) are also dedicated to the solution of SE. However, all of the previous works only divide the agents into two levels, and extending them to more agents is not as straightforward as they anticipated. Furthermore, HATRPO Kuba et al. (2022) uses random sequence updating to ensure monotonic policy improvement in multi-agent settings, but it is only suitable for cooperative environments. The algorithm requires to define a joint advantage function for all agents, limiting its application to mixed tasks.

When dealing with structured combination action spaces, single-agent RL also uses similar spatial-temporal sequence learning approaches. SQL Metz et al. (2017) imitates the sequence-to-sequence model of the structured prediction problem, predicts the policy of each dimension simultaneously, and combines them into a comprehensive high-dimensional policy. Our technique can be considered as an extension of SQL for multi-agent systems. The actions of each dimension in SQL corresponds to the actions of each agent in STEP, and the complete high-dimensional policy relates to the joint policy of all agents. In addition, BQD’s Tavakoli et al. (2018) use of the action branch structure is also an effective way to decompose and separately control high-dimensional actions. This form of single-agent algorithm lacks communication restrictions and can freely access each dimension’s policy, hence reducing the complexity of problem-solving. In MARL, however, it is required to build extra procedures to address communication restriction-related issues.

Refer to caption
Figure 4: Matrix games. Left: The common-payoff matrix of the Penalty game, where k𝑘kitalic_k = [0,25,50,75,100]0255075100[0,-25,-50,-75,-100][ 0 , - 25 , - 50 , - 75 , - 100 ]. (a11,a32),(a21,a22)subscriptsuperscript𝑎11subscriptsuperscript𝑎23subscriptsuperscript𝑎12subscriptsuperscript𝑎22(a^{1}_{1},a^{2}_{3}),(a^{1}_{2},a^{2}_{2})( italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) , ( italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and (a31,a12)subscriptsuperscript𝑎13subscriptsuperscript𝑎21(a^{1}_{3},a^{2}_{1})( italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) are NE points, and (a11,a32)subscriptsuperscript𝑎11subscriptsuperscript𝑎23(a^{1}_{1},a^{2}_{3})( italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) is the only SE point. Right: The payoff matrix of the Mixing game. It has only one NE point (a21,a22)subscriptsuperscript𝑎12subscriptsuperscript𝑎22(a^{1}_{2},a^{2}_{2})( italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and only one SE point (a11,a12)subscriptsuperscript𝑎11subscriptsuperscript𝑎21(a^{1}_{1},a^{2}_{1})( italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). The SE is Pareto superior to the NE.
Refer to caption
Figure 5: Proportion of different convergence results in Penalty.

5 Experiments

Refer to caption
Figure 6: Performance comparison with baselines on Multi-Agent MuJoCo tasks. Error bars are a 95% confidence interval across 5 runs.
Agent 1 Agent 2 Reward
a11subscriptsuperscript𝑎11a^{1}_{1}italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT a21subscriptsuperscript𝑎12a^{1}_{2}italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT a31subscriptsuperscript𝑎13a^{1}_{3}italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT a12subscriptsuperscript𝑎21a^{2}_{1}italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT a22subscriptsuperscript𝑎22a^{2}_{2}italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT a32subscriptsuperscript𝑎23a^{2}_{3}italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT agent 1 agent 2 Avg.
STEP 99.99(0.01)subscript99.990.01\boldsymbol{99.99_{(0.01)}}bold_99.99 start_POSTSUBSCRIPT bold_( bold_0.01 bold_) end_POSTSUBSCRIPT 0.00(0.00)subscript0.000.000.00_{(0.00)}0.00 start_POSTSUBSCRIPT ( 0.00 ) end_POSTSUBSCRIPT 0.01(0.01)subscript0.010.010.01_{(0.01)}0.01 start_POSTSUBSCRIPT ( 0.01 ) end_POSTSUBSCRIPT 99.99(0.01)subscript99.990.01\boldsymbol{99.99_{(0.01)}}bold_99.99 start_POSTSUBSCRIPT bold_( bold_0.01 bold_) end_POSTSUBSCRIPT 0.00(0.00)subscript0.000.000.00_{(0.00)}0.00 start_POSTSUBSCRIPT ( 0.00 ) end_POSTSUBSCRIPT 0.01(0.01)subscript0.010.010.01_{(0.01)}0.01 start_POSTSUBSCRIPT ( 0.01 ) end_POSTSUBSCRIPT 0.00(0.00)subscript0.000.000.00_{(0.00)}0.00 start_POSTSUBSCRIPT ( 0.00 ) end_POSTSUBSCRIPT 5.00(0.00)subscript5.000.005.00_{(0.00)}5.00 start_POSTSUBSCRIPT ( 0.00 ) end_POSTSUBSCRIPT 2.50(0.00)subscript2.500.00\boldsymbol{2.50_{(0.00)}}bold_2.50 start_POSTSUBSCRIPT bold_( bold_0.00 bold_) end_POSTSUBSCRIPT
MAPPO 32.93(46.20)subscript32.9346.2032.93_{(46.20)}32.93 start_POSTSUBSCRIPT ( 46.20 ) end_POSTSUBSCRIPT 0.03(0.20)subscript0.030.200.03_{(0.20)}0.03 start_POSTSUBSCRIPT ( 0.20 ) end_POSTSUBSCRIPT 67.04(46.21)subscript67.0446.2167.04_{(46.21)}67.04 start_POSTSUBSCRIPT ( 46.21 ) end_POSTSUBSCRIPT 61.98(47.31)subscript61.9847.3161.98_{(47.31)}61.98 start_POSTSUBSCRIPT ( 47.31 ) end_POSTSUBSCRIPT 0.04(0.22)subscript0.040.220.04_{(0.22)}0.04 start_POSTSUBSCRIPT ( 0.22 ) end_POSTSUBSCRIPT 37.97(47.31)subscript37.9747.3137.97_{(47.31)}37.97 start_POSTSUBSCRIPT ( 47.31 ) end_POSTSUBSCRIPT 2.72(2.38)subscript2.722.382.72_{(2.38)}2.72 start_POSTSUBSCRIPT ( 2.38 ) end_POSTSUBSCRIPT 1.29(6.43)subscript1.296.43-1.29_{(6.43)}- 1.29 start_POSTSUBSCRIPT ( 6.43 ) end_POSTSUBSCRIPT 0.72(2.33)subscript0.722.330.72_{(2.33)}0.72 start_POSTSUBSCRIPT ( 2.33 ) end_POSTSUBSCRIPT
MADDPG 21.39(37.14)subscript21.3937.1421.39_{(37.14)}21.39 start_POSTSUBSCRIPT ( 37.14 ) end_POSTSUBSCRIPT 5.62(20.52)subscript5.6220.525.62_{(20.52)}5.62 start_POSTSUBSCRIPT ( 20.52 ) end_POSTSUBSCRIPT 72.99(39.82)subscript72.9939.8272.99_{(39.82)}72.99 start_POSTSUBSCRIPT ( 39.82 ) end_POSTSUBSCRIPT 14.46(31.81)subscript14.4631.8114.46_{(31.81)}14.46 start_POSTSUBSCRIPT ( 31.81 ) end_POSTSUBSCRIPT 7.86(22.51)subscript7.8622.517.86_{(22.51)}7.86 start_POSTSUBSCRIPT ( 22.51 ) end_POSTSUBSCRIPT 77.68(37.30)subscript77.6837.3077.68_{(37.30)}77.68 start_POSTSUBSCRIPT ( 37.30 ) end_POSTSUBSCRIPT 3.20(3.26)subscript3.203.263.20_{(3.26)}3.20 start_POSTSUBSCRIPT ( 3.26 ) end_POSTSUBSCRIPT 7.9(5.12)subscript7.95.12-7.9_{(5.12)}- 7.9 start_POSTSUBSCRIPT ( 5.12 ) end_POSTSUBSCRIPT 2.35(2.84)subscript2.352.84-2.35_{(2.84)}- 2.35 start_POSTSUBSCRIPT ( 2.84 ) end_POSTSUBSCRIPT
Table 1: Results of different methods in Mixing game. The first two columns show the average probability of two agents choosing different actions. The third column displays the individual rewards earned by each agent and the average rewards of the two agents. The values in parentheses correspond to a single standard deviation over trials.

We evaluate the performance of our proposed algorithm, STEP, in three benchmark environments: the Repeated Matrix Game, the Multi-Agent MuJoCo Peng et al. (2021), and the Highway On-Ramp Merging Chen et al. (2021). Appendix contains thorough descriptions of the three environments. Based on the reward settings (shared or individual rewards) and control types (discrete or continuous action space) in various environments, we compare STEP with various state-of-the-art MARL algorithms. The main objectives of these experiments are to: (a) Validate STEP’s ability to identify the Stackelberg equilibrium policies; (b) Evaluate its performance in more challenging cooperative and mixed tasks; and (c) Assess the effect of the hyperparameter on the its performance.

5.1 Repeated Matrix Game

We test STEP using the cooperative matrix game Penalty proposed by Claus and Boutilier Claus and Boutilier (1998) and the non-cooperative matrix game Mixing shown in Figure 4. Agents receive constant observations at each time step in both games, and the length of repeated matrix games is set to 25. To cultivate formidable critics, the popular CTDE paradigm utilises combined observation and action to enlarge the information space. However, there is no noticeable performance difference between independent learning (IL) and CTDE for empty state matrix games. Therefore, only MAPPO Yu et al. (2021) and MADDPG Lowe et al. (2017) are considered as baselines for our research. We present the experimental results over 100 trials of 10000 time steps in Figure 5 and Table 1, which show that STEP outperforms the baselines in both types of matrix games. More detailed results can be found in Appendix.

In Penalty, there are three NE points, and the only SE point corresponds to the optimal NE point. The challenge lies in the fact that during the exploration phase, any deviation from the optimal NE policy by any of the players will result in severe punishment for all players. The likelihood that agents choose the best NE action decreases as the penalty term increases. Consequently, MAPPO always falls into the trap of penalty and converges to the sub-optimal NE point (a21,a22)subscriptsuperscript𝑎12subscriptsuperscript𝑎22(a^{1}_{2},a^{2}_{2})( italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) at a rate of 100%percent100100\%100 %, unless k=0𝑘0k=0italic_k = 0. Similarly, when k𝑘kitalic_k declines, MADDPG has a greater probability of converging to the sub-optimal NE point. Compared to the baselines, SE is the convergence objective for STEP. By computing the leader’s action, the follower is always able to select the best response action, allowing it to efficiently arrive at the optimal solution regardless of the configuration. Even under a high penalty of k=100𝑘100k=-100italic_k = - 100, there is a 93%percent9393\%93 % probability of convergence to the optimal value. Due to the use of neural network approximation, modest probability errors are permitted.

Refer to caption
Figure 7: Results of STEP with different regularization coefficients. Note that a larger coefficient produces better results in the case of more agents.
Refer to caption
Figure 8: Learning curves for the Highway On-Ramp Merging tasks. Error bars are a 95% confidence interval across 5 runs.

Table 1 displays the results of different methods in Mixing game. We evaluate the agents’ coordination based on the average return proposed by Zhang Zhang et al. (2018), as each agent has its own particular reward. It is clear that there are two optimal solutions in this scenario, with (a11,a12)subscriptsuperscript𝑎11subscriptsuperscript𝑎21(a^{1}_{1},a^{2}_{1})( italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) corresponding to the SE. In addition, the SE is Pareto superior to the only NE. As expected, STEP is able to rapidly converge to the SE with a probability of nearly 100%percent100100\%100 %. In contrast, MAPPO and MADDPG are utterly ineffective and cannot even converge. It also demonstrates that STEP can effectively solve the problem of selecting multiple optimal solutions to some extent due to the certainty of the SE.

5.2 Multi-Agent MuJoCo

Refer to caption
Figure 9: Comparing the testing performance of the average speed among different methods.

The Multi-Agent MuJoCo environment, which divides each part of the robot into separate agents that operate each part independently to achieve optimal motion, is used to assess the performance of the algorithm in a fully cooperative setting, as well as the impact of the regularization term coefficient on STEP’s performance. Specifically, to measure the collaboration quality of the agents, we choose representative cooperation scenarios such as Ant and HalfCheetah. And the results in Many-Agent Ant scenario can be found in Appendix due to the space limit.

Figure 6 demonstrates that STEP outperforms heterogeneous and homogeneous methods in all cases, which is certainly a result of the hierarchical structure among agents. Notably, the performance gap between STEP and HAPPO narrows as the number of agents increases. We think it’s because STEP achieves the effect of heterogeneous policies through parameter sharing. The number of STEP parameters remains constant regardless of the number of agents, whereas HAPPO with heterogeneous policies builds an individual policy network for each agent. In this sense, STEP shows a higher capacity to capture and represent diverse and intricate policies.

Figure 7 depicts the influence of various regularization coefficients on the experimental results. When the number of agents is small, the performance of various coefficients is similar. When there are more agents, however, greater coefficients lead to better results. The absence of regularization coefficients, on the other hand, has a negative impact on the experimental outcomes, demonstrating the efficiency of the regularization term. This outcome is expected and intuitive.

5.3 Highway On-Ramp Merging

Lane merging in high-density, high-speed traffic presents a significant challenge for both autonomous vehicles and human drivers. To guarantee safe and efficient merging, vehicles must apply coordination mechanisms that allow them to merge at precise speeds and times. In this scenario, each agent has its own goal, and all of them seek to pass through the main road quickly. Due to the fact that HAPPO is built for a fully cooperative environment, we only employ MAPPO, IPPO, and MADDPG to compare with STEP in this mixed task scenario. The road conditions vary at different levels.

Results in Figure 8 show that STEP outperforms the other approaches in the mixed task scenario, with the performance gap between STEP and the other algorithms growing as traffic density and complexity rise. Additionally, the average speed of STEP, as illustrated in Figure 9, is also faster than that of the other algorithms after convergence occurs.

6 Conclusion

In this study, we propose a multi-agent reinforcement learning method called STEP, which is inspired by the Stackelberg leadership model, multi-task learning, as well as high-dimensional continuous action decomposition for single agents. The core insight behind STEP is to make sequential decisions in both temporal and spatial domains that drives MARL to converge to the Stackelberg equilibrium. During the centralized training phase, we construct an efficient interaction mechanism, namely the spatio-temporal sequence Markov game, and utilize a shared N-level policy model to learn policies that aid the agents in achieving efficient coordination during the decentralized execution phase. The experimental results demonstrate that our STEP approach achieves state-of-the-art performance. We believe that asynchronous action coordination and the spatio-temporal sequential decision-making model have further development potential, and how to leverage the advantages of the game structure is a topic worthy of further research.

References

  • Chen et al. [2021] Dong Chen, Zhaojian Li, Yongqiang Wang, Longsheng Jiang, and Yue Wang. Deep multi-agent reinforcement learning for highway on-ramp merging in mixed traffic. arXiv preprint arXiv:2105.05701, 2021.
  • Chen et al. [2022] Yiqun Chen, Hangyu Mao, Tianle Zhang, Shiguang Wu, Bin Zhang, Jianye Hao, Dong Li, Bin Wang, and Hongxing Chang. Ptde: Personalized training with distillated execution for multi-agent reinforcement learning. arXiv preprint arXiv:2210.08872, 2022.
  • Claus and Boutilier [1998] Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. In Jack Mostow and Chuck Rich, editors, Proceedings of the Fifteenth National Conference on Artificial Intelligence and Tenth Innovative Applications of Artificial Intelligence Conference, AAAI 98, IAAI 98, July 26-30, 1998, Madison, Wisconsin, USA, pages 746–752. AAAI Press / The MIT Press, 1998.
  • Foerster et al. [2016] Jakob N. Foerster, Yannis M. Assael, Nando de Freitas, and Shimon Whiteson. Learning to communicate with deep multi-agent reinforcement learning. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 2137–2145, 2016.
  • Hu and Wellman [2003] Junling Hu and Michael P. Wellman. Nash q-learning for general-sum stochastic games. J. Mach. Learn. Res., 4:1039–1069, 2003.
  • Iqbal and Sha [2019] Shariq Iqbal and Fei Sha. Actor-attention-critic for multi-agent reinforcement learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 2961–2970. PMLR, 2019.
  • Könönen [2004] Ville Könönen. Asymmetric multiagent reinforcement learning. Web Intell. Agent Syst., 2(2):105–121, 2004.
  • Kuba et al. [2022] Jakub Grudzien Kuba, Ruiqing Chen, Muning Wen, Ying Wen, Fanglei Sun, Jun Wang, and Yaodong Yang. Trust region policy optimisation in multi-agent reinforcement learning. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  • Lillicrap et al. [2016] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
  • Littman and Stone [2001] Michael L Littman and Peter Stone. Leading best-response strategies in repeated games. In In Seventeenth Annual International Joint Conference on Artificial Intelligence Workshop on Economic Agents, Models, and Mechanisms. Citeseer, 2001.
  • Littman [1994] Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. In William W. Cohen and Haym Hirsh, editors, Machine Learning, Proceedings of the Eleventh International Conference, Rutgers University, New Brunswick, NJ, USA, July 10-13, 1994, pages 157–163. Morgan Kaufmann, 1994.
  • Lowe et al. [2017] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 6379–6390, 2017.
  • Metz et al. [2017] Luke Metz, Julian Ibarz, Navdeep Jaitly, and James Davidson. Discrete sequential prediction of continuous actions for deep RL. arXiv preprint arXiv:1705.05035, 2017.
  • Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Peng et al. [2021] Bei Peng, Tabish Rashid, Christian Schröder de Witt, Pierre-Alexandre Kamienny, Philip H. S. Torr, Wendelin Boehmer, and Shimon Whiteson. FACMAC: factored multi-agent centralised policy gradients. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 12208–12221, 2021.
  • Rashid et al. [2018] Tabish Rashid, Mikayel Samvelyan, Christian Schröder de Witt, Gregory Farquhar, Jakob N. Foerster, and Shimon Whiteson. QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 4292–4301. PMLR, 2018.
  • Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Tavakoli et al. [2018] Arash Tavakoli, Fabio Pardo, and Petar Kormushev. Action branching architectures for deep reinforcement learning. In Sheila A. McIlraith and Kilian Q. Weinberger, editors, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 4131–4138. AAAI Press, 2018.
  • Tian et al. [2019] Zheng Tian, Ying Wen, Zhichen Gong, Faiz Punakkath, Shihao Zou, and Jun Wang. A regularized opponent model with maximum entropy objective. In Sarit Kraus, editor, Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, pages 602–608. ijcai.org, 2019.
  • von Oswald et al. [2020] Johannes von Oswald, Christian Henning, João Sacramento, and Benjamin F. Grewe. Continual learning with hypernetworks. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
  • Von Stackelberg [2010] Heinrich Von Stackelberg. Market structure and equilibrium. Springer Science & Business Media, 2010.
  • Xu et al. [2021] Zhiwei Xu, Bin Zhang, Yunpeng Bai, Dapeng Li, and Guoliang Fan. Learning to coordinate via multiple graph neural networks. In International Conference on Neural Information Processing, pages 52–63. Springer, 2021.
  • Yang et al. [2018] Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. Mean field multi-agent reinforcement learning. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 5567–5576. PMLR, 2018.
  • Yu et al. [2021] Chao Yu, Akash Velu, Eugene Vinitsky, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of ppo in cooperative, multi-agent games. arXiv preprint arXiv:2103.01955, 2021.
  • Zhang et al. [2018] Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Basar. Fully decentralized multi-agent reinforcement learning with networked agents. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 5867–5876. PMLR, 2018.
  • Zhang et al. [2020] Haifeng Zhang, Weizhe Chen, Zeren Huang, Minne Li, Yaodong Yang, Weinan Zhang, and Jun Wang. Bi-level actor-critic for multi-agent coordination. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7325–7332. AAAI Press, 2020.
  • Zhang et al. [2023a] Bin Zhang, Yunpeng Bai, Zhiwei Xu, Dapeng Li, and Guoliang Fan. Efficient policy generation in multi-agent systems via hypergraph neural network. In Neural Information Processing: 29th International Conference, ICONIP 2022, Virtual Event, November 22–26, 2022, Proceedings, Part II, pages 219–230. Springer, 2023.
  • Zhang et al. [2023b] Bin Zhang, Zhiwei Xu, Yiqun Chen, Dapeng Li, Yunpeng Bai, Guoliang Fan, and Lijuan Li. Multi-agent hyper-attention policy optimization. In Neural Information Processing: 29th International Conference, ICONIP 2022, Virtual Event, November 22–26, 2022, Proceedings, Part I, pages 76–87. Springer, 2023.