Cooperative Multi-Agent Deep Reinforcement Learning Methods for UAV-aided Mobile Edge Computing Networks

Mintae Kim, Hoon Lee, , Sangwon Hwang, Mérouane Debbah,  and Inkyu Lee M. Kim, S. Hwang, and I. Lee are with the School of Electrical Engineering, Korea University, Seoul 02841, Korea (e-mail: {wkd2749, tkddnjs3510, inkyu}@korea.ac.kr). H. Lee is with the Department of Electrical Engineering and the AI graduate school, Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Korea (e-mail: [email protected]). M. Debbah is with Khalifa University of Science and Technology, P O Box 127788, Abu Dhabi, UAE (email: [email protected]).
Abstract

This paper presents a cooperative multi-agent deep reinforcement learning (MADRL) approach for unmmaned aerial vehicle (UAV)-aided mobile edge computing (MEC) networks. An UAV with computing capability can provide task offlaoding services to ground internet-of-things devices (IDs). With partial observation of the entire network state, the UAV and the IDs individually determine their MEC strategies, i.e., UAV trajectory, resource allocation, and task offloading policy. This requires joint optimization of decision-making process and coordination strategies among the UAV and the IDs. To address this difficulty, the proposed cooperative MADRL approach computes two types of action variables, namely message action and solution action, each of which is generated by dedicated actor neural networks (NNs). As a result, each agent can automatically encapsulate its coordination messages to enhance the MEC performance in the decentralized manner. The proposed actor structure is designed based on graph attention networks such that operations are possible regardless of the number of IDs. A scalable training algorithm is also proposed to train a group of NNs for arbitrary network configurations. Numerical results demonstrate the superiority of the proposed cooperative MADRL approach over conventional methods.

Index Terms:
Reinforcement learning, Graph attention network, UAV mobile edge computing.
publicationid: pubid: 978-1-6654-3540-6/22 © 2022 IEEE

I Introduction

Mobile edge computing (MEC) systems have been regarded as promising solutions to provide remote computation services for internet-of-things (IoT) networks with the aid of edge servers [1, 2, 3, 4, 5, 6, 7, 8, 9]. Edge servers mounted on unmanned aerial vehicles (UAVs) can further enhance the MEC performance by decreasing access distance between servers and IoT devices (IDs) [10, 11, 12]. The integration of IoT devices with the MEC enables more responsive and efficient handling of data, addressing latency-sensitive and bandwidth-intensive applications such as smart cities, healthcare monitoring, and industrial automation [13, 14, 15]. On the other hand, the mobility of UAVs incurs time-varying system dynamics, e.g., highly fluctuating propagation statistics. To tackle this difficulty, there have been studies to utilize the deep reinforcement learning (DRL) approaches [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35].

Centralized DRL frameworks were developed for optimizing the trajectory of UAV servers and resource allocation strategies [16, 20, 19, 18, 21, 17, 22, 23, 24, 25]. Deep Q-network (DQN) methods in [16, 17, 19, 20, 18, 21] characterized these optimization variables as actions at an agent, i.e., UAV servers and IDs, which is realized by neural networks (NNs). Since the DQN is confined to discrete action spaces, it fails to identify continuous-valued UAV trajectory. Thus, suitable DRL schemes which can handle continuous action space were introduced such as deep deterministic policy gradient (DDPG), twin delayed DDPG (TD3), and proximal policy optimization (PPO) [22, 23, 24, 25] based on an actor-critic architecture. The actor NN generates actions of UAV agents and ID agents, e.g., trajectory and resource allocation solutions, whereas the critic NN evaluates the effectiveness of the actor NN. By doing so, continuous-valued optimization variables can be successfully obtained for the UAV-based MEC systems.

I-A Motivations

In practical UAV-aided MEC networks, the distributed operation is desirable so that observing states of environments and inferring actions for individual UAVs and IDs can be split across multiple UAVs and IDs. Existing centralized DRL methods [22, 23, 24, 25] combine all UAVs and IDs as a single agent to handle states and actions. Such a single-agent deep reinforcement learning (SADRL) approach brings state collection and decision-making processes in the central manner, which are infeasible for supporting a massive number of IDs. Also, a sole actor-critic NN can only be trained to a certain MEC network with a fixed number of UAVs and IDs, and thus it cannot be straightforwardly applied to larger MEC systems.

One viable approach for distributed UAV-aided MEC systems is multi-agent DRL (MADRL) [26, 27, 28, 29, 30, 31, 32, 33, 34, 35], which adopts multi-agent partially observable Markov decision processes (POMDP). Computing nodes at UAV servers and ground IDs are interpreted as agents which identify their actions based on partial observations of the overall MEC systems, e.g., channel state information, task volume, and locations. Various MADRL techniques such as the MAPPO [26, 27, 28], MADDPG [29, 30, 31, 32, 33] and MATD3 [34, 35] have shown their effectiveness for the decentralized management of UAVs and IDs.

As the multi-agent POMDP formalism lacks full knowledge of the entire MEC system, a coordination strategy among agents should be employed so that they can estimate the overall state from partial observations. Two popular methods include implicit coordination through reward shaping [26, 27, 28, 29, 30, 31, 32, 33] and exploit coordination through observation shaping [26, 27, 28, 29, 30, 31, 32, 33, 34, 35]. The former trains all the agents jointly using carefully designed reward functions such that individual agents can implicitly infer the knowledge of other agents in the training. Since agents can cooperate in the training step via long-term statistics, trained actor NNs are not able to accommodate highly fluctuating environments. In contrast, the latter allows agents to share messages, which are normally given as subsets or manipulations of partial observations. These messages can be conveyed through reliable control channels. Such an explicit message exchange mechanism can adapt to immediate environment changes at the expense of increased communication overhead.

Both approaches have proven to succeed in various configurations of UAV-based MEC networks. However, these approaches generally require time-consuming trial-and-error validation processes to check the feasibility of all possible combinations of partial observations and rewards. This challenge may become prohibitive as the dimension of observation statistics increases. Moreover, man-made agent coordination policies are not guaranteed to achieve good performance since input features of actor NNs normally resort to manual optimization. For these reasons, it is necessary to develop a new cooperative MADRL framework that can autonomously determine interaction strategies among agents, in particular, communication messages by leveraging NNs.

I-B Contributions and Organization

This paper investigates a cooperative MADRL for UAV-aided MEC networks where NNs at multiple agents determine their decision-making and coordination policies autonomously. We aim at minimizing the total energy consumption of ground IDs by offloading their computational tasks to a mobile edge server mounted on a UAV. This poses joint optimization of the trajectory and computing resource allocation of the UAV along with offloading decisions at IDs. These optimization variables need to be computed by the UAV and IDs individually. The considered problem is classified as a multi-agent POMDP where UAV and ID agents collaboratively identify their action variables only with partial observations. Compared to existing MADRL frameworks which require a handcraft design of UAV-ID coordination, the proposed scheme can handle highly fluctuating and heterogeneous network dynamics.

In this paper, we propose a novel cooperative multi-agent DDPG (C-MADDPG) framework where NNs at UAV and ID agents determine their policies by cooperating with other agents. Our system is first formulated as a cooperative multi-agent POMDP task which constructs two different actions for individual agents, namely, solution actions and message actions. The solution actions include optimization variables at agents, e.g., trajectory and resource allocation variables at the UAV agent and offloading variables at the ID agents. In addition, the message actions indicate communication messages to be exchanged among the UAV and ID agents. For effective information exchange, the proposed design establishes interactions in uplink (ID-to-UAV) and downlink (UAV-to-ID). Thus, along with the solution actions, the uplink message actions at IDs and the downlink message actions at the UAV are regarded as actions taken by actor NNs. This is a distinct feature of the proposed framework compared to conventional MADDPG approaches [36] where communication messages should be designed synthetically in advance and are fixed in all episodes of training steps.

In the proposed C-MADDPG, we build a solution actor NN and a message actor NN. In the inference stage, the UAV and ID agents calculate their solution actions by using the solution actor NN. Cooperative inference among these actor NNs establishes uplink-downlink coordination among the UAV and ID agents. To send encoded statistics of partial observations to the UAV agent, the ID agents utilize their message actor NNs that generate uplink message actions. These uplink message actions become an input to the message actor NN at the UAV agent, which creates the downlink message actions intended for individual ID agents. Once the coordination is completed, the UAV and ID agents calculate their solution actions by using the solution actor NNs. In this decision-making process, the message actions are leveraged as side inputs to the solution actor NNs so that the UAV and ID agents can collaboratively decide their optimization variables. Since the proposed cooperative inference does not need any centralized operations, we can construct decentralized optimization solutions for practical UAV-aided MEC networks.

In the proposed cooperative actor NN architecture, the UAV agent aggregates the uplink message actions sent by all ID agents. Therefore, the input dimension of actor NNs at the UAV agent, in particular, the message actor NN, scales with the number of IDs. For this reason, a naive NN structure whose input and output dimensions are fixed leads to poor generalization ability with respect to the ID population. To achieve the scalability, we exploit the concept of the graph attention network (GAT) [37], and adopt the parameter sharing technique where all ID agents utilize the identical actor NN. By doing so, the entire inference becomes independent of the number of IDs and can achieve the scalability.

The training process of the proposed C-MADDPG requests joint optimization of solution actor NNs, message actor NNs, and critic NNs. To this end, we adopt the centralized training decentralized execution (CTDE) strategy where all actor NNs are trained in an end-to-end manner under the supervision of the critic NN. The trained actor NNs are then deployed to the UAV and IDs for real-time decentralized inference. Consequently, the proposed parameter sharing policy can be implemented without additional communication overheads in the inference step. To further improve the scalablity of actor NNs, we develop a joint training process which leverages several episodes with the arbitrary number of ID agents. Also, we employ a random masking strategy that stochastically prunes input features of the critic NN. Numerical results validate the generalization ability of the proposed scheme and demonstrate the effectiveness of the proposed C-MADDPG framework over existing approaches.

The contributions of this paper are summarized as follows:

  • We propose a novel C-MADDPG framework which establishes self-organizing coordination strategies among UAV and ID agents. Compared to existing MADRL methods [26, 27, 28, 29, 30, 31, 32, 33, 34, 35] which design agent coordination messages manually, the proposed approach exploits message actor NNs to allow autonomous RL operations. This framework generates task-oriented agent interaction protocols that are optimized to enhance the expected reward function. Consequently, the proposed method does not require a handcrafted design of observations and rewards.

  • In practical UAV-aided MEC networks, the number of IDs may vary from time to time. This requests actor NNs whose inference calculations can be performed independent of the ID populations. To this end, we develop scalable actor NNs based on the parameter sharing strategy where all IDs leverage the identical NNs. However, a straightforward extension of the parameter sharing entails indistinguishable messages at all IDs. To address this difficulty, the GAT mechanism is employed which evaluates the importance of individual IDs and the resulting actor NNs successfully achieve the scalability to the number of IDs.

  • To further enhance the scalability, the training mechanism of the proposed C-MADDPG should be carefully designed such that shared actor NNs can observe various MEC configurations with arbitrary ID populations. Such randomized samples can be generated by using masking operations. We randomly prune IDs in the training step so that the actor NNs can be optimized over different MEC networks. This strategy helps the NNs to learn an efficient decentralized decision-making policy for arbitrary given ID populations.

The rest of the paper is organized as follows: Section II offers an overview of recent works on MADRL-based UAV MEC systems. In Section III, we describe a system model and formulate an optimization problem. Section IV provides a cooperative POMDP formulation. The proposed actor structure and its cooperative inference are presented in Section V. In Section VI, a joint training policy is introduced, and the performance evaluations are shown in Section VII. Finally, the paper is terminated with concluding remarks in Section VIII.

II Related Works

Centralized single-agent DRL (SADRL) has been developed to tackle various optimization problems in UAV-aided MEC networks [16, 20, 19, 18, 21, 17, 22, 23, 24, 25]. In [16], the utility maximization problem of a UAV server was studied based on the DQN framework. The energy consumption was minimized by scheduling the offloading and the position of the UAV using DQN [17]. To obtain continuous-valued UAV trajectories, the DDPG and PPO were taken into account [22, 23, 24, 25]. The DDPG method [22] determined the UAV trajectory to minimize the energy consumption of IDs. The minimization of task completion latency and energy consumption was considered in [25].

To reduce the complexity of the SADRL, there has been a recent paradigm shift towards the MADRL [26, 27, 28, 29, 30, 32, 31, 33, 34, 35]. These methods offer an effective mechanism for dealing with cooperative or competitive interactions within intricate environments. In [26], the deployment of MEC-powered UAVs was optimized for sub-THz communication. A modified MAPPO was proposed in [27] to handle the energy consumption minimization problem effectively. A joint optimization problem of precoder, trajectory, and ID association was solved in an integrated sensing and communication network [28]. A fairness maximization problem was examined in [30] by optimizing trajectory and offloading decisions. In [31], total delay and energy consumption were minimized by adopting the stochastic game. The ratio of the transmission rate to the energy consumption of a UAV was optimized in [32] by combining the game theory and MADDPG. The authors in [33] maximized the number of offloaded tasks while meeting heterogenous quality-of-service requirements. The multi-UAV multi-clouds task offloading problems were addressed in [34].

These MADRL approaches [26, 27, 28, 29, 30, 31, 32, 33, 34, 35] generally require exhaustive search processes for identifying efficient rewards and observations heuristically. Weighted sum reward functions were considered in [27, 28, 32] where the optimized weights should be found numerically. The work in [31] designed the reward as the difference between local and edge computing cost, whereas [33] employed the smoothen objective function as the reward. Also, in [34] and [35], the UAV agents exchange their current locations, and these communication messages contribute to partial observation inputs for actor NNs. These man-made agent coordination polices are computationally inefficient and even become infeasible for practical UAV MEC systems with a number of heterogeneous observations and actions.

III Network Model

Refer to caption
Figure 1: UAV-assisted MEC system model

As illustrated in Fig. 1, we consider a UAV-assisted MEC system where the UAV server flies over the network area to offer computation offloading services for N𝑁Nitalic_N mobile IDs at the ground. A time-slotted MEC protocol is adopted where the system block is divided into T𝑇Titalic_T time slots. Let 𝒩{1,,N}𝒩1𝑁\mathcal{N}\triangleq\{1,\cdots,N\}caligraphic_N ≜ { 1 , ⋯ , italic_N } be the index set of IDs. ID j𝑗jitalic_j (j𝒩for-all𝑗𝒩\forall j\in\mathcal{N}∀ italic_j ∈ caligraphic_N) desires to handle its computational task of size Ijsubscript𝐼𝑗I_{j}italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bits within one system block consisting of T𝑇Titalic_T time slots.

III-A Mobility Model

Let 𝐪j(t)=(qx,j(t),qy,j(t),0)3superscriptsubscript𝐪𝑗𝑡superscriptsubscript𝑞𝑥𝑗𝑡superscriptsubscript𝑞𝑦𝑗𝑡0superscript3\mathbf{q}_{j}^{(t)}=(q_{x,j}^{(t)},q_{y,j}^{(t)},0)\in\mathbb{R}^{3}bold_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = ( italic_q start_POSTSUBSCRIPT italic_x , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT italic_y , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , 0 ) ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT be the 3D Cartesian coordinate of ID j𝑗jitalic_j at time slot t𝑡titalic_t (t=1,,T𝑡1𝑇t=1,\cdots,Titalic_t = 1 , ⋯ , italic_T). Mobile IDs change their positions time to time according to predefined missions. The randomness in ID positions can be modeled by the Gauss-Markov process [38]. At time slot t𝑡titalic_t, the location of ID j𝑗jitalic_j is written by

qx,j(t)superscriptsubscript𝑞𝑥𝑗𝑡\displaystyle q_{x,j}^{(t)}italic_q start_POSTSUBSCRIPT italic_x , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT =qx,j(t1)+τvj(t)cosoj(t),absentsuperscriptsubscript𝑞𝑥𝑗𝑡1𝜏superscriptsubscript𝑣𝑗𝑡superscriptsubscript𝑜𝑗𝑡\displaystyle=q_{x,j}^{(t-1)}+\tau v_{j}^{(t)}\cos{o_{j}^{(t)}},= italic_q start_POSTSUBSCRIPT italic_x , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT + italic_τ italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT roman_cos italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , (1a)
qy,j(t)superscriptsubscript𝑞𝑦𝑗𝑡\displaystyle q_{y,j}^{(t)}italic_q start_POSTSUBSCRIPT italic_y , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT =qy,j(t1)+τvj(t)sinoj(t),absentsuperscriptsubscript𝑞𝑦𝑗𝑡1𝜏superscriptsubscript𝑣𝑗𝑡superscriptsubscript𝑜𝑗𝑡\displaystyle=q_{y,j}^{(t-1)}+\tau v_{j}^{(t)}\sin{o_{j}^{(t)}},= italic_q start_POSTSUBSCRIPT italic_y , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT + italic_τ italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT roman_sin italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , (1b)

where τ𝜏\tauitalic_τ represents the duration of a time slot and the speed vj(t)superscriptsubscript𝑣𝑗𝑡v_{j}^{(t)}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and oj(t)superscriptsubscript𝑜𝑗𝑡o_{j}^{(t)}italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT indicate speed and moving direction, respectively. Here, vj(t)superscriptsubscript𝑣𝑗𝑡v_{j}^{(t)}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and oj(t)superscriptsubscript𝑜𝑗𝑡o_{j}^{(t)}italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT are updated as

vj(t)=κvvj(t1)+(1κv)v¯j+1κv2Φv,superscriptsubscript𝑣𝑗𝑡subscript𝜅𝑣superscriptsubscript𝑣𝑗𝑡11subscript𝜅𝑣subscript¯𝑣𝑗1superscriptsubscript𝜅𝑣2subscriptΦ𝑣\displaystyle v_{j}^{(t)}=\kappa_{v}v_{j}^{(t-1)}+(1-\kappa_{v})\bar{v}_{j}+% \sqrt{1-\kappa_{v}^{2}}\Phi_{v},italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_κ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT + ( 1 - italic_κ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_κ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_Φ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , (2a)
oj(t)=κooj(t1)+(1κo)o¯j+1κo2Φo,superscriptsubscript𝑜𝑗𝑡subscript𝜅𝑜superscriptsubscript𝑜𝑗𝑡11subscript𝜅𝑜subscript¯𝑜𝑗1superscriptsubscript𝜅𝑜2subscriptΦ𝑜\displaystyle o_{j}^{(t)}=\kappa_{o}o_{j}^{(t-1)}+(1-\kappa_{o})\bar{o}_{j}+% \sqrt{\smash[b]{1-\kappa_{o}^{2}}}\Phi_{o},italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_κ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT + ( 1 - italic_κ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) over¯ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_κ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_Φ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT , (2b)

where κv[0,1]subscript𝜅𝑣01\kappa_{v}\in[0,1]italic_κ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ [ 0 , 1 ] and κo[0,1]subscript𝜅𝑜01\kappa_{o}\in[0,1]italic_κ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∈ [ 0 , 1 ] stand for the memory factors and v¯jsubscript¯𝑣𝑗\bar{v}_{j}over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and o¯jsubscript¯𝑜𝑗\bar{o}_{j}over¯ start_ARG italic_o end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are the average speed and direction of ID j𝑗jitalic_j, respectively. The independent Gaussian random variables ΦvN(0,ςv2)similar-tosubscriptΦ𝑣𝑁0superscriptsubscript𝜍𝑣2\Phi_{v}\sim N(0,\varsigma_{v}^{2})roman_Φ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∼ italic_N ( 0 , italic_ς start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and ΦoN(0,ςo2)similar-tosubscriptΦ𝑜𝑁0superscriptsubscript𝜍𝑜2\Phi_{o}\sim N(0,\varsigma_{o}^{2})roman_Φ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∼ italic_N ( 0 , italic_ς start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) characterize the randomness of the ID mobility. In the meantime, the UAV trajectory is optimized to enhance the MEC performance. Let us define β(t)[0,2π]superscript𝛽𝑡02𝜋\beta^{(t)}\in[0,2\pi]italic_β start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ [ 0 , 2 italic_π ] and η(t)[0,π]superscript𝜂𝑡0𝜋\eta^{(t)}\in[0,\pi]italic_η start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ [ 0 , italic_π ] respectively as the azimuth angle and the elevation angle of the UAV at time slot t𝑡titalic_t. Then, the moving direction of the UAV 𝚫(t)3superscript𝚫𝑡superscript3\mathbf{\Delta}^{(t)}\in\mathbb{R}^{3}bold_Δ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT can be expressed as

𝚫(t)=(sinβ(t)cosη(t),sinβ(t)sinη(t),cosβ(t)).superscript𝚫𝑡superscript𝛽𝑡superscript𝜂𝑡superscript𝛽𝑡superscript𝜂𝑡superscript𝛽𝑡\displaystyle\mathbf{\Delta}^{(t)}=(\sin{\beta^{(t)}}\cos{\eta^{(t)}},\sin{% \beta^{(t)}}\sin{\eta^{(t)}},\cos{\beta^{(t)}}).bold_Δ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = ( roman_sin italic_β start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT roman_cos italic_η start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , roman_sin italic_β start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT roman_sin italic_η start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , roman_cos italic_β start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) . (3)

As a result, the 3D location vector of the UAV is obtained as

𝐮(t)=𝐮(t1)+τv(t)𝚫(t),superscript𝐮𝑡superscript𝐮𝑡1𝜏superscript𝑣𝑡superscript𝚫𝑡\displaystyle\mathbf{u}^{(t)}=\mathbf{u}^{(t-1)}+\tau v^{(t)}\mathbf{\Delta}^{% (t)},bold_u start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = bold_u start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT + italic_τ italic_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_Δ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , (4)

where v(t)[0,vmax]superscript𝑣𝑡0subscript𝑣v^{(t)}\in[0,v_{\max}]italic_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ [ 0 , italic_v start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] equals the UAV velocity with vmaxsubscript𝑣v_{\max}italic_v start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT being the maximum speed constraint.

III-B Channel Model

We define Pj(t)superscriptsubscript𝑃𝑗𝑡P_{j}^{(t)}italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT as the line of sight (LoS) probability given by [39]

Pj(t)=11+K1exp(K2[νj(t)K1]),superscriptsubscript𝑃𝑗𝑡11subscript𝐾1subscript𝐾2delimited-[]superscriptsubscript𝜈𝑗𝑡subscript𝐾1\displaystyle P_{j}^{(t)}=\frac{1}{{1+K_{1}\exp{(-K_{2}[\nu_{j}^{(t)}-K_{1}])}% }},italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 1 + italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_exp ( - italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ italic_ν start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ) end_ARG , (5)

where K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are constants on the propagation environment and νj(t)superscriptsubscript𝜈𝑗𝑡\nu_{j}^{(t)}italic_ν start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT equals the elevation angle between the UAV and ID j𝑗jitalic_j. According to the air-to-ground propagation model [40, 41], the large-scale channel gain between the UAV and ID j𝑗jitalic_j can be written by

hj(t)=𝐮(t)𝐪j(t)αρ0(Pj(t)χLoS+(1Pj(t))χNLoS),superscriptsubscript𝑗𝑡superscriptnormsuperscript𝐮𝑡superscriptsubscript𝐪𝑗𝑡𝛼subscript𝜌0superscriptsubscript𝑃𝑗𝑡subscript𝜒LoS1superscriptsubscript𝑃𝑗𝑡subscript𝜒NLoS\displaystyle h_{j}^{(t)}=\frac{\|\mathbf{u}^{(t)}-\mathbf{q}_{j}^{(t)}\|^{-% \alpha}}{\rho_{0}(P_{j}^{(t)}\chi_{\text{LoS}}+(1-P_{j}^{(t)})\chi_{\text{NLoS% }})},italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = divide start_ARG ∥ bold_u start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - bold_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT end_ARG start_ARG italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT italic_χ start_POSTSUBSCRIPT LoS end_POSTSUBSCRIPT + ( 1 - italic_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) italic_χ start_POSTSUBSCRIPT NLoS end_POSTSUBSCRIPT ) end_ARG , (6)

where 𝐪j(t)superscriptsubscript𝐪𝑗𝑡\mathbf{q}_{j}^{(t)}bold_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is the 3D location vector of ID j𝑗jitalic_j, α𝛼\alphaitalic_α represents the path loss exponent, ρ0subscript𝜌0\rho_{0}italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT indicates the reference path loss, and χLoSsubscript𝜒LoS\chi_{\text{LoS}}italic_χ start_POSTSUBSCRIPT LoS end_POSTSUBSCRIPT and χNLoSsubscript𝜒NLoS\chi_{\text{NLoS}}italic_χ start_POSTSUBSCRIPT NLoS end_POSTSUBSCRIPT (χNLoS>χLoS>1subscript𝜒NLoSsubscript𝜒LoS1\chi_{\text{NLoS}}>\chi_{\text{LoS}}>1italic_χ start_POSTSUBSCRIPT NLoS end_POSTSUBSCRIPT > italic_χ start_POSTSUBSCRIPT LoS end_POSTSUBSCRIPT > 1) respectively account for the path loss of the LoS and non-LoS cases.

We employ the time division duplexing protocol where uplink and downlink communication are realized over the reciprocal channel. The uplink and downlink rates are respectively expressed by

Ru,j(t)superscriptsubscript𝑅𝑢𝑗𝑡\displaystyle R_{u,j}^{(t)}italic_R start_POSTSUBSCRIPT italic_u , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT =BNlog2(1+Npuhj(t)BN0),absent𝐵𝑁subscript21𝑁subscript𝑝𝑢superscriptsubscript𝑗𝑡𝐵subscript𝑁0\displaystyle=\!\frac{B}{N}{\log_{2}}\bigg{(}{1\!+\!\frac{{Np_{u}h_{j}^{(t)}}}% {{BN_{0}}}}\bigg{)},= divide start_ARG italic_B end_ARG start_ARG italic_N end_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 1 + divide start_ARG italic_N italic_p start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_B italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) , (7)
Rd,j(t)superscriptsubscript𝑅𝑑𝑗𝑡\displaystyle R_{d,j}^{(t)}italic_R start_POSTSUBSCRIPT italic_d , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT =BNlog2(1+Npdhj(t)BN0),absent𝐵𝑁subscript21𝑁subscript𝑝𝑑superscriptsubscript𝑗𝑡𝐵subscript𝑁0\displaystyle=\!\frac{B}{N}{\log_{2}}\bigg{(}{1\!+\!\frac{{Np_{d}h_{j}^{(t)}}}% {{BN_{0}}}}\bigg{)},= divide start_ARG italic_B end_ARG start_ARG italic_N end_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 1 + divide start_ARG italic_N italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_B italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) , (8)

where pusubscript𝑝𝑢p_{u}italic_p start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and pdsubscript𝑝𝑑p_{d}italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT equal the uplink and downlink transmit power at the IDs and the UAV, respectively, B𝐵Bitalic_B denotes the total bandwidth and N0subscript𝑁0N_{0}italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT stands for the noise power.

III-C Offloading Process

ID j𝑗jitalic_j splits its task into T𝑇Titalic_T subtasks each having Ij/Tsubscript𝐼𝑗𝑇I_{j}/Titalic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / italic_T bits. Each subtask is subject to be completed within one time slot of duration τ𝜏\tauitalic_τ. At the beginning of each time slot, the IDs determine their task offloading policies based on the partial offloading protocol. ID j𝑗jitalic_j offloads λj(t)[0,1]superscriptsubscript𝜆𝑗𝑡01\lambda_{j}^{(t)}\in[0,1]italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ [ 0 , 1 ] portion of Ij/Tsubscript𝐼𝑗𝑇I_{j}/Titalic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / italic_T bits to the UAV server, whereas the remaining part 1λj(t)1superscriptsubscript𝜆𝑗𝑡1-\lambda_{j}^{(t)}1 - italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is processed locally. The energy consumption Elj(t)superscriptsubscript𝐸subscript𝑙𝑗𝑡E_{l_{j}}^{(t)}italic_E start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT required for the local processing of (1λj(t))Ij/T1superscriptsubscript𝜆𝑗𝑡subscript𝐼𝑗𝑇(1-\lambda_{j}^{(t)})I_{j}/T( 1 - italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / italic_T bits is written as [42]

Elj(t)=ϑ(C(1λj(t))Ij)3τ2T3,superscriptsubscript𝐸subscript𝑙𝑗𝑡italic-ϑsuperscript𝐶1superscriptsubscript𝜆𝑗𝑡subscript𝐼𝑗3superscript𝜏2superscript𝑇3\displaystyle E_{l_{j}}^{(t)}=\vartheta\frac{(C(1-\lambda_{j}^{(t)})I_{j})^{3}% }{\tau^{2}T^{3}},italic_E start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_ϑ divide start_ARG ( italic_C ( 1 - italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG start_ARG italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG , (9)

where the constants ϑitalic-ϑ\varthetaitalic_ϑ and C𝐶Citalic_C account for the hardware efficiency and the computational complexity, respectively.

To offload λj(t)IjTsuperscriptsubscript𝜆𝑗𝑡subscript𝐼𝑗𝑇\lambda_{j}^{(t)}\frac{{I}_{j}}{T}italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT divide start_ARG italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_T end_ARG bits to the UAV, the communication energy Eoj(t)superscriptsubscript𝐸subscript𝑜𝑗𝑡E_{o_{j}}^{(t)}italic_E start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT of ID j𝑗jitalic_j is given by

Eoj(t)=puλj(t)IjRu,j(t)T.superscriptsubscript𝐸subscript𝑜𝑗𝑡subscript𝑝𝑢superscriptsubscript𝜆𝑗𝑡subscript𝐼𝑗superscriptsubscript𝑅𝑢𝑗𝑡𝑇\displaystyle E_{o_{j}}^{(t)}=\frac{p_{u}\lambda_{j}^{(t)}I_{j}}{R_{u,j}^{(t)}% T}.italic_E start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = divide start_ARG italic_p start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_R start_POSTSUBSCRIPT italic_u , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT italic_T end_ARG . (10)

The computation capacity of the UAV is limited by the maximum CPU frequency fmaxsubscript𝑓f_{\max}italic_f start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT. For parallel computations, a virtual machine (VM) [43] with the CPU frequency fj(t)superscriptsubscript𝑓𝑗𝑡f_{j}^{(t)}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is dedicated to processing the task offloaded from ID j𝑗jitalic_j. This incurs the sum CPU frequency constraint as

j=1Nfj(t)fmax,t.superscriptsubscript𝑗1𝑁superscriptsubscript𝑓𝑗𝑡subscript𝑓for-all𝑡\displaystyle\sum\limits_{j=1}^{N}{f_{j}^{(t)}\leq{f_{\max}}},\forall t.∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ≤ italic_f start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , ∀ italic_t . (11)

The latency of the offloading procedure at ID j𝑗jitalic_j comprises delays in the uplink task offloading from ID j𝑗jitalic_j to the UAV, task computation at the UAV, and downlink transmission from the UAV to ID j𝑗jitalic_j as

Loj(t)=λj(t)IjRu,j(t)T+Cλj(t)Ijfj(t)T+δλj(t)IjRd,j(t)T,superscriptsubscript𝐿subscript𝑜𝑗𝑡superscriptsubscript𝜆𝑗𝑡subscript𝐼𝑗superscriptsubscript𝑅𝑢𝑗𝑡𝑇𝐶superscriptsubscript𝜆𝑗𝑡subscript𝐼𝑗superscriptsubscript𝑓𝑗𝑡𝑇𝛿superscriptsubscript𝜆𝑗𝑡subscript𝐼𝑗superscriptsubscript𝑅𝑑𝑗𝑡𝑇\displaystyle L_{o_{j}}^{(t)}=\frac{\lambda_{j}^{(t)}I_{j}}{R_{u,j}^{(t)}T}+% \frac{C{\lambda_{j}^{(t)}I_{j}}}{{f_{j}^{(t)}T}}+\frac{\delta\lambda_{j}^{(t)}% I_{j}}{R_{d,j}^{(t)}T},italic_L start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = divide start_ARG italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_R start_POSTSUBSCRIPT italic_u , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT italic_T end_ARG + divide start_ARG italic_C italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT italic_T end_ARG + divide start_ARG italic_δ italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_R start_POSTSUBSCRIPT italic_d , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT italic_T end_ARG , (12)

where the first term represents the uplink transmission delay of the offloaded task of size λj(t)Ijsuperscriptsubscript𝜆𝑗𝑡subscript𝐼𝑗\lambda_{j}^{(t)}I_{j}italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with Ru,j(t)superscriptsubscript𝑅𝑢𝑗𝑡R_{u,j}^{(t)}italic_R start_POSTSUBSCRIPT italic_u , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bits/sec, and the second term indicates the computation latency with the CPU frequency fj(t)superscriptsubscript𝑓𝑗𝑡f_{j}^{(t)}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT cycles/sec, and the third term quantifies the downlink transmission delay for broadcasting the task of size δλj(t)Ij𝛿superscriptsubscript𝜆𝑗𝑡subscript𝐼𝑗\delta\lambda_{j}^{(t)}I_{j}italic_δ italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with Rd,j(t)superscriptsubscript𝑅𝑑𝑗𝑡R_{d,j}^{(t)}italic_R start_POSTSUBSCRIPT italic_d , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bits/sec with the constant δ𝛿\deltaitalic_δ being the ratio of output to input task sizes. Finally, the latency constraint is imposed as

Loj(t)τ,j,t.superscriptsubscript𝐿subscript𝑜𝑗𝑡𝜏for-all𝑗𝑡\displaystyle L_{o_{j}}^{(t)}\leq{\tau},\forall j,t.italic_L start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ≤ italic_τ , ∀ italic_j , italic_t . (13)

III-D Problem Description

We aim at minimizing the total energy consumption of all IDs through the joint optimization of the UAV trajectory 𝐔={v(t),η(t),β(t),t}𝐔superscript𝑣𝑡superscript𝜂𝑡superscript𝛽𝑡for-all𝑡\mathbf{U}=\{v^{(t)},\eta^{(t)},\beta^{(t)},\forall t\}bold_U = { italic_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_η start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_β start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , ∀ italic_t }, computing resource allocation 𝐅={fj(t),j,t}𝐅superscriptsubscript𝑓𝑗𝑡for-all𝑗𝑡\mathbf{F}=\{f_{j}^{(t)},\forall j,t\}bold_F = { italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , ∀ italic_j , italic_t }, and offloading ratio 𝚲={λj(t),j,t}𝚲superscriptsubscript𝜆𝑗𝑡for-all𝑗𝑡\mathbf{\Lambda}=\{\lambda_{j}^{(t)},\forall j,t\}bold_Λ = { italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , ∀ italic_j , italic_t }. The total energy minimization problem can be formulated as

min𝐔,𝐅,𝚲1Tt=1Tj=1NElj(t)+Eoj(t)subscript𝐔𝐅𝚲1𝑇superscriptsubscript𝑡1𝑇superscriptsubscript𝑗1𝑁superscriptsubscript𝐸subscript𝑙𝑗𝑡superscriptsubscript𝐸subscript𝑜𝑗𝑡\displaystyle\min_{\begin{subarray}{c}\mathbf{U},\mathbf{F},\mathbf{\Lambda}% \end{subarray}}\quad\frac{1}{T}\sum\limits_{t=1}^{T}\sum\limits_{j=1}^{N}E_{l_% {j}}^{(t)}+E_{o_{j}}^{(t)}roman_min start_POSTSUBSCRIPT start_ARG start_ROW start_CELL bold_U , bold_F , bold_Λ end_CELL end_ROW end_ARG end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + italic_E start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT
s.t. v(t)[0,vmax],η(t)[0,2π),β(t)[0,π],formulae-sequencesuperscript𝑣𝑡0subscript𝑣formulae-sequencesuperscript𝜂𝑡02𝜋superscript𝛽𝑡0𝜋\displaystyle\quad v^{(t)}\in[0,v_{\max}],\ \eta^{(t)}\in[0,2\pi),\ \beta^{(t)% }\in[0,\pi],italic_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ [ 0 , italic_v start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] , italic_η start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ [ 0 , 2 italic_π ) , italic_β start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ [ 0 , italic_π ] , (14a)
fj(t)0,superscriptsubscript𝑓𝑗𝑡0\displaystyle\quad f_{j}^{(t)}\geq 0,italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ≥ 0 , (14b)
λj(t)[0,1],superscriptsubscript𝜆𝑗𝑡01\displaystyle\quad\lambda_{j}^{(t)}\in[0,1],italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ [ 0 , 1 ] , (14c)
(11)and(13).11and13\displaystyle\quad(\ref{const:freque_const})\ \text{and}\ (\ref{const:latency_% const}).( ) and ( ) . (14d)

The above problem is a nonconvex problem due to the latency constraint and the objective function. Existing MADRL methods [26, 27, 28, 29, 30, 32, 31, 33, 34, 35] resort to computationally demanding exhaustive search processes for designing agent interaction mechanisms. To overcome this difficulty, we propose a novel cooperative MADRL framework that identifies optimization variables and coordination strategies autonomously by using NNs.

IV Cooperative Multi-Agent POMDP Formulation

We introduce a C-MADDPG scheme for addressing (𝐏)𝐏(\mathbf{P})( bold_P ) where the UAV and IDs are realized as individual agents taking their decision variables. Separate UAV and IDs independently interact with the environment, i.e., the MEC network. This leads to a cooperative POMDP formulation where each agent can only access to partial knowledge on the current environment. In what follows, we transform (𝐏)𝐏(\mathbf{P})( bold_P ) into the multi-agent POMDP task consisting of states, observations, actions, and rewards.

1) Observations and States

We denote the UAV as agent 0 and ID j𝑗jitalic_j as agent j𝑗jitalic_j. Let 𝒩~{0}𝒩~𝒩0𝒩\tilde{\mathcal{N}}\triangleq\{0\}\bigcup\mathcal{N}over~ start_ARG caligraphic_N end_ARG ≜ { 0 } ⋃ caligraphic_N be the set of all agents. The partial observation oj(t)superscriptsubscript𝑜𝑗𝑡o_{j}^{(t)}italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT of agent j𝒩~𝑗~𝒩j\in\tilde{\mathcal{N}}italic_j ∈ over~ start_ARG caligraphic_N end_ARG at time slot t𝑡titalic_t consists of information about the entire MEC network that can be observed by agent j𝑗jitalic_j. For the UAV agent, the observation o0(t)superscriptsubscript𝑜0𝑡o_{0}^{(t)}italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is set to its previous location 𝐮(t1)superscript𝐮𝑡1\mathbf{u}^{(t-1)}bold_u start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT as

o0(t)=𝐮(t1).superscriptsubscript𝑜0𝑡superscript𝐮𝑡1\displaystyle o_{0}^{(t)}=\mathbf{u}^{(t-1)}.italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = bold_u start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT . (15)

In contrast, ID agent j𝑗jitalic_j forms its observation oj(t)superscriptsubscript𝑜𝑗𝑡o_{j}^{(t)}italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT as

oj(t)={𝐪j(t1),(1λj(t1))IjT,λj(t1)IjT,Ij,Ru,j(t1)}.superscriptsubscript𝑜𝑗𝑡superscriptsubscript𝐪𝑗𝑡11superscriptsubscript𝜆𝑗𝑡1subscript𝐼𝑗𝑇superscriptsubscript𝜆𝑗𝑡1subscript𝐼𝑗𝑇subscript𝐼𝑗superscriptsubscript𝑅𝑢𝑗𝑡1\displaystyle o_{j}^{(t)}=\left\{\mathbf{q}_{j}^{(t-1)},(1-\lambda_{j}^{(t-1)}% )\frac{I_{j}}{T},\lambda_{j}^{(t-1)}\frac{I_{j}}{T},I_{j},R_{u,j}^{(t-1)}% \right\}.italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = { bold_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT , ( 1 - italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ) divide start_ARG italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_T end_ARG , italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT divide start_ARG italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_T end_ARG , italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_u , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT } . (16)

As a result, the state of the MEC network s(t)superscript𝑠𝑡s^{(t)}italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT collects all observations as

s(t){oj(t):j𝒩~}.superscript𝑠𝑡conditional-setsuperscriptsubscript𝑜𝑗𝑡for-all𝑗~𝒩\displaystyle s^{(t)}\triangleq\{o_{j}^{(t)}:\forall j\in\tilde{\mathcal{N}}\}.italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ≜ { italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT : ∀ italic_j ∈ over~ start_ARG caligraphic_N end_ARG } . (17)

2) Solution Actions

The solution action xj(t)superscriptsubscript𝑥𝑗𝑡x_{j}^{(t)}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT contains a set of optimization variables identified by agent j𝑗jitalic_j. As discussed, the solution action x0(t)superscriptsubscript𝑥0𝑡x_{0}^{(t)}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT of the UAV agent receives the trajectory variables as

x0(t)={𝐯(t),𝐟(t)},superscriptsubscript𝑥0𝑡superscript𝐯𝑡superscript𝐟𝑡\displaystyle x_{0}^{(t)}=\{\mathbf{v}^{(t)},\mathbf{f}^{(t)}\},italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = { bold_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_f start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } , (18)

where 𝐯(t)={v(t),η(t),β(t)}superscript𝐯𝑡superscript𝑣𝑡superscript𝜂𝑡superscript𝛽𝑡\mathbf{v}^{(t)}=\{v^{(t)},\eta^{(t)},\beta^{(t)}\}bold_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = { italic_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_η start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_β start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } is the trajectory variable and 𝐟(t){fj(t):j𝒩}superscript𝐟𝑡conditional-setsuperscriptsubscript𝑓𝑗𝑡for-all𝑗𝒩\mathbf{f}^{(t)}\triangleq\{f_{j}^{(t)}:\forall j\in\mathcal{N}\}bold_f start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ≜ { italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT : ∀ italic_j ∈ caligraphic_N } stands for a collection of the CPU frequencies of all IDs. Also, ID agent j𝑗jitalic_j obtains its own offloading decision variable λj(t)superscriptsubscript𝜆𝑗𝑡\lambda_{j}^{(t)}italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT. Thus, the solution xj(t)subscriptsuperscript𝑥𝑡𝑗x^{(t)}_{j}italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT of ID agent j𝑗jitalic_j becomes

xj(t)=λj(t).subscriptsuperscript𝑥𝑡𝑗superscriptsubscript𝜆𝑗𝑡\displaystyle x^{(t)}_{j}=\lambda_{j}^{(t)}.italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT . (19)

3) Reward

The reward function r(t)superscript𝑟𝑡r^{(t)}italic_r start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT evaluates the performance of the MEC network at time slot t𝑡titalic_t. Since our aim is to minimize the energy consumption of all IDs, the reward r(t)superscript𝑟𝑡r^{(t)}italic_r start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is set to

r(t)=j=1N(Elj(t)+Eoj(t)).superscript𝑟𝑡superscriptsubscript𝑗1𝑁superscriptsubscript𝐸subscript𝑙𝑗𝑡superscriptsubscript𝐸subscript𝑜𝑗𝑡\displaystyle r^{(t)}=-\sum\limits_{j=1}^{N}(E_{l_{j}}^{(t)}+E_{o_{j}}^{(t)}).italic_r start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_E start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + italic_E start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) . (20)

4) Message Actions

The considered POMDP formulation can be addressed by the conventional MADDPG framework [36]. In this approach, agent j𝑗jitalic_j (j𝒩~for-all𝑗~𝒩\forall j\in\tilde{\mathcal{N}}∀ italic_j ∈ over~ start_ARG caligraphic_N end_ARG) is equipped with its own actor NN, which produces the solution action xj(t)superscriptsubscript𝑥𝑗𝑡x_{j}^{(t)}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT from the partial observation oj(t)superscriptsubscript𝑜𝑗𝑡o_{j}^{(t)}italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT. The major drawback of the conventional MADDPG comes from the limited agent interaction. Since the solution actions of all agents are highly coupled in the UAV-aided MEC networks, coordination among the UAV and IDs is essential for identifying the optimal solution to (P). Nevertheless, the actor DNN only accepts the partial observation oj(t)superscriptsubscript𝑜𝑗𝑡o_{j}^{(t)}italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT as an input, and thus the resulting solution action xj(t)superscriptsubscript𝑥𝑗𝑡x_{j}^{(t)}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is determined without knowing the observations of other agents.

To cope with this issue, along with the decision processes of the solution actions, we develop a coordination policy among the UAV and ID agents, which can be realized by additional message actions mj(t)superscriptsubscript𝑚𝑗𝑡m_{j}^{(t)}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT (j𝒩~for-all𝑗~𝒩\forall j\in\tilde{\mathcal{N}}∀ italic_j ∈ over~ start_ARG caligraphic_N end_ARG). The message actions should be designed to encapsulate sufficient statistics of agent j𝑗jitalic_j needed for individual decision-making processes at others. Messages of ID agents are shared with the UAV through uplink control channels. Similarly, the UAV multicasts its message action m0(t)superscriptsubscript𝑚0𝑡m_{0}^{(t)}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT to all IDs via downlink control channels. As will be explained, the message actions are determined using additional actor NNs, and the resulting message actors are leveraged as side information to decide the solution actions. Thus, the overall action of agent j𝑗jitalic_j consists of both the solution action and message action as

aj(t)={xj(t),mj(t)}.superscriptsubscript𝑎𝑗𝑡superscriptsubscript𝑥𝑗𝑡superscriptsubscript𝑚𝑗𝑡\displaystyle a_{j}^{(t)}=\{x_{j}^{(t)},m_{j}^{(t)}\}.italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = { italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } . (21)

V Cooperative Actor Design

Refer to caption
Figure 2: Proposed cooperative actor architecture

The cooperative multi-agent POMDP formulation presented in Section IV readily establishes the C-MADDPG framework to take the solution actions xj(t)subscriptsuperscript𝑥𝑡𝑗x^{(t)}_{j}italic_x start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (j𝒩~for-all𝑗~𝒩\forall j\in\tilde{\mathcal{N}}∀ italic_j ∈ over~ start_ARG caligraphic_N end_ARG) using decentralized coordination among the UAV and ID agents. In the C-MADDPG method, this can be achieved by employing actor NNs at individual agents. In order to determine the message actions mj(t)superscriptsubscript𝑚𝑗𝑡m_{j}^{(t)}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT (j𝒩~for-all𝑗~𝒩\forall j\in\tilde{\mathcal{N}}∀ italic_j ∈ over~ start_ARG caligraphic_N end_ARG), we employ additional actor NNs that produce message action variables. As illustrated in Fig. 2, the proposed architecture deploys two types of actors: message actor and solution actor. The message actors determine agent coordination, whereas the solution actors compute appropriate solution variables based on the received message actions. Such a cooperative actor architecture leads to the joint optimization of two different actions in a goal-oriented manner for maximizing the expected reward value.

The proposed actor design invokes a challenging issue on the scalability with respect to the network size, in particular, the number of IDs N𝑁Nitalic_N. The joint optimization of message actors and solution actors poses the exploding population of trainable parameters that is proportional to 2N2𝑁2N2 italic_N. Such dedicated actor NNs lack the generalization ability to an arbitrary N𝑁Nitalic_N. Also, the sizes of the state s(t)superscript𝑠𝑡s^{(t)}italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and message actions {mj(t):j𝒩}conditional-setsubscriptsuperscript𝑚𝑡𝑗for-all𝑗𝒩\{m^{(t)}_{j}:\forall j\in\mathcal{N}\}{ italic_m start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT : ∀ italic_j ∈ caligraphic_N } grow with N𝑁Nitalic_N, which increases the model complexity for critic and actor NNs to handle high-dimensional states and actions. To address these difficulties, in this section, we develop a cooperative and scalable actor structure whose message-generating and solution-optimizing computations become independent of the number of IDs. In what follows, we discuss the inference steps of message actors and solution actors.

V-A Message Actors at ID agents

ID agent j𝑗jitalic_j (j𝒩for-all𝑗𝒩\forall j\in\mathcal{N}∀ italic_j ∈ caligraphic_N) first obtains its message action mj(t)subscriptsuperscript𝑚𝑡𝑗m^{(t)}_{j}italic_m start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT by the message actor NN μI(;φI)subscript𝜇𝐼subscript𝜑𝐼\mu_{I}(\cdot;\varphi_{I})italic_μ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( ⋅ ; italic_φ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) with trainable parameter φIsubscript𝜑𝐼\varphi_{I}italic_φ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT based on its partial observation oj(t)superscriptsubscript𝑜𝑗𝑡o_{j}^{(t)}italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT. The message action mj(t)subscriptsuperscript𝑚𝑡𝑗m^{(t)}_{j}italic_m start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT of ID agent j𝑗jitalic_j (j𝒩for-all𝑗𝒩\forall j\in\mathcal{N}∀ italic_j ∈ caligraphic_N) is then expressed as

mj(t)=μI(oj(t);φI),superscriptsubscript𝑚𝑗𝑡subscript𝜇𝐼superscriptsubscript𝑜𝑗𝑡subscript𝜑𝐼\displaystyle m_{j}^{(t)}=\mu_{I}(o_{j}^{(t)};\varphi_{I}),italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_μ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ; italic_φ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) , (22)

where the identical message actor μI(;φI)subscript𝜇𝐼subscript𝜑𝐼\mu_{I}(\cdot;\varphi_{I})italic_μ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( ⋅ ; italic_φ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) is employed for all ID agents. Such a parameter sharing policy leads to a scalable structure so that a sole message actor can be universally applied to an arbitrary ID population.

A set of ID messages {mj(t):j𝒩}conditional-setsuperscriptsubscript𝑚𝑗𝑡for-all𝑗𝒩\{m_{j}^{(t)}:\forall j\in\mathcal{N}\}{ italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT : ∀ italic_j ∈ caligraphic_N } are conveyed to the UAV agent through orthogonal uplink control channels. Without loss of the generality, each ID is assumed to be assigned by M𝑀Mitalic_M frequency resource blocks (RBs) to transmit its message action. To accommodate such a resource constraint, we design mj(t)superscriptsubscript𝑚𝑗𝑡m_{j}^{(t)}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT as an M𝑀Mitalic_M-dimensional vector where the transmission of each element occupies one RB. As a result, total NM𝑁𝑀NMitalic_N italic_M RBs are needed for the uplink coordination.

V-B Message Actor at UAV agent

After receiving the messages 𝐦I(t){mj(t):j𝒩}superscriptsubscript𝐦𝐼𝑡conditional-setsuperscriptsubscript𝑚𝑗𝑡for-all𝑗𝒩\mathbf{m}_{I}^{(t)}\triangleq\{m_{j}^{(t)}:\forall j\in\mathcal{N}\}bold_m start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ≜ { italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT : ∀ italic_j ∈ caligraphic_N }, the UAV agent computes its message action m0(t)superscriptsubscript𝑚0𝑡m_{0}^{(t)}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT to be multicast to all IDs through the downlink control channels. This UAV message action encodes the knowledge required for the decision processes of the solution action xj(t)superscriptsubscript𝑥𝑗𝑡x_{j}^{(t)}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT at ID agent j𝒩𝑗𝒩j\in\mathcal{N}italic_j ∈ caligraphic_N. To this end, the UAV agent combines its observation o0(t)superscriptsubscript𝑜0𝑡o_{0}^{(t)}italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and the group of ID message actions 𝐦I(t)superscriptsubscript𝐦𝐼𝑡\mathbf{m}_{I}^{(t)}bold_m start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT. By doing so, the UAV successfully propagates partitioned information of individual IDs to the entire MEC network. Similar to the ID message actions, the UAV message actor μU(;φU)subscript𝜇𝑈subscript𝜑𝑈\mu_{U}(\cdot;\varphi_{U})italic_μ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ( ⋅ ; italic_φ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) with trainable parameter φUsubscript𝜑𝑈\varphi_{U}italic_φ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT is adopted to produce the UAV message action m0(t)superscriptsubscript𝑚0𝑡m_{0}^{(t)}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT as

m0(t)=μU(o0(t),𝐦I(t);φU).superscriptsubscript𝑚0𝑡subscript𝜇𝑈superscriptsubscript𝑜0𝑡superscriptsubscript𝐦𝐼𝑡subscript𝜑𝑈\displaystyle m_{0}^{(t)}=\mu_{U}(o_{0}^{(t)},\mathbf{m}_{I}^{(t)};\varphi_{U}).italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_μ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_m start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ; italic_φ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) . (23)

The dimension of the ID message actions 𝐦I(t)NMsuperscriptsubscript𝐦𝐼𝑡superscript𝑁𝑀\mathbf{m}_{I}^{(t)}\in\mathbb{R}^{NM}bold_m start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N italic_M end_POSTSUPERSCRIPT scales with the number of IDs N𝑁Nitalic_N. For this reason, a naive NN architecture, in particular, fully-connected layers, fails to achieve the scalability with respect to N𝑁Nitalic_N.

To overcome this issue, we develop a scalable UAV message actor μU(;φU)subscript𝜇𝑈subscript𝜑𝑈\mu_{U}(\cdot;\varphi_{U})italic_μ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ( ⋅ ; italic_φ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) based on the GAT [37]. This framework modifies node interaction policies of the graph neural network (GNN) [44] such that each node can measure the importance of its neighbors, which is referred to as an attention score. As a result, the generalization capability can be fairly improved without sacrificing the scalability to the node population. Thus, the design goal of the message actor (23) is to aggregate the observation of the UAV agent o0(t)superscriptsubscript𝑜0𝑡o_{0}^{(t)}italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and 𝐦I(t)superscriptsubscript𝐦𝐼𝑡\mathbf{m}_{I}^{(t)}bold_m start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT based on the importance of individual IDs.

In general, the GAT facilitates multiple layers to extract useful features of input data. To realize such a layered GAT architecture, several communication rounds among the UAV and IDs are necessary to share the results of each GAT iteration. To avoid this issue, we modify conventional multi-iteration GAT architectures to a single-iteration GAT to leverage sole uplink-downlink coordination. Also, since the DRL involves temporal connections of actor NNs in consecutive time slots, our single-iteration GAT design becomes more powerful in the UAV-aided MEC systems.

A key enabler of the proposed GAT approach is to allow the UAV agent to have NN modules for computing the attention scores of all ID agents. By doing so, we can straightforwardly implement the GAT mechanism without propagating latent vectors of hidden layers multiple times. The UAV agent first extracts a hidden feature of agent j𝒩~𝑗~𝒩j\in\tilde{\mathcal{N}}italic_j ∈ over~ start_ARG caligraphic_N end_ARG, denoted by 𝐞j(t)superscriptsubscript𝐞𝑗𝑡\mathbf{e}_{j}^{(t)}bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT of length E𝐸Eitalic_E, as

𝐞j(t)={ϵU(o0(t);δU)forj=0,ϵI(mj(t);δI)forj𝒩,superscriptsubscript𝐞𝑗𝑡casessubscriptitalic-ϵ𝑈superscriptsubscript𝑜0𝑡subscript𝛿𝑈for𝑗0subscriptitalic-ϵ𝐼superscriptsubscript𝑚𝑗𝑡subscript𝛿𝐼for𝑗𝒩\displaystyle\mathbf{e}_{j}^{(t)}=\begin{cases}\epsilon_{U}(o_{0}^{(t)};\delta% _{U})\ \text{for}&j=0,\\ \epsilon_{I}(m_{j}^{(t)};\delta_{I})\ \text{for}&j\in\mathcal{N},\end{cases}bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = { start_ROW start_CELL italic_ϵ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ; italic_δ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) for end_CELL start_CELL italic_j = 0 , end_CELL end_ROW start_ROW start_CELL italic_ϵ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ; italic_δ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) for end_CELL start_CELL italic_j ∈ caligraphic_N , end_CELL end_ROW (24)

where ϵU(;δU)subscriptitalic-ϵ𝑈subscript𝛿𝑈\epsilon_{U}(\cdot;\delta_{U})italic_ϵ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ( ⋅ ; italic_δ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) and ϵI(;δI)subscriptitalic-ϵ𝐼subscript𝛿𝐼\epsilon_{I}(\cdot;\delta_{I})italic_ϵ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( ⋅ ; italic_δ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) indicate the feature extractor NNs of the UAV agent and ID agents, respectively, which are responsible for generating 𝐞j(t)superscriptsubscript𝐞𝑗𝑡\mathbf{e}_{j}^{(t)}bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT utilized for the computation of the attention scores. A group of feature vectors {𝐞j(t):j𝒩~}conditional-setsuperscriptsubscript𝐞𝑗𝑡for-all𝑗~𝒩\{\mathbf{e}_{j}^{(t)}:\forall j\in\tilde{\mathcal{N}}\}{ bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT : ∀ italic_j ∈ over~ start_ARG caligraphic_N end_ARG } is adopted as an input to the GAT operation. A scalar attention score zj,k(t)[0,1]superscriptsubscript𝑧𝑗𝑘𝑡01z_{j,k}^{(t)}\in[0,1]italic_z start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ [ 0 , 1 ] about agent k𝑘kitalic_k measured by agent j𝑗jitalic_j is calculated as

zj,k(t)=exp(ϵA(𝐞j(t),𝐞k(t);δA))i𝒩~exp(ϵA(𝐞j(t),𝐞i(t);δA)),superscriptsubscript𝑧𝑗𝑘𝑡subscriptitalic-ϵ𝐴superscriptsubscript𝐞𝑗𝑡superscriptsubscript𝐞𝑘𝑡subscript𝛿𝐴subscript𝑖~𝒩subscriptitalic-ϵ𝐴superscriptsubscript𝐞𝑗𝑡superscriptsubscript𝐞𝑖𝑡subscript𝛿𝐴\displaystyle z_{j,k}^{(t)}=\frac{\exp(\epsilon_{A}(\mathbf{e}_{j}^{(t)},% \mathbf{e}_{k}^{(t)};\delta_{A}))}{\sum_{i\in\tilde{\mathcal{N}}}\exp(\epsilon% _{A}(\mathbf{e}_{j}^{(t)},\mathbf{e}_{i}^{(t)};\delta_{A}))},italic_z start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = divide start_ARG roman_exp ( italic_ϵ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ; italic_δ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ over~ start_ARG caligraphic_N end_ARG end_POSTSUBSCRIPT roman_exp ( italic_ϵ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ; italic_δ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) ) end_ARG , (25)

where ϵA(;δA)subscriptitalic-ϵ𝐴subscript𝛿𝐴\epsilon_{A}(\cdot;\delta_{A})italic_ϵ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( ⋅ ; italic_δ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) stands for the attention NN that evaluates the affinity of two different agents. The attention score zj,k(t)superscriptsubscript𝑧𝑗𝑘𝑡z_{j,k}^{(t)}italic_z start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT interprets the importance of agent k𝑘kitalic_k for the decision-making process at agent j𝑗jitalic_j. Finally, the output of the GAT for agent j𝑗jitalic_j becomes the weighted average of the feature vectors {𝐞j(t):j𝒩~}conditional-setsuperscriptsubscript𝐞𝑗𝑡for-all𝑗~𝒩\{\mathbf{e}_{j}^{(t)}:\forall j\in\tilde{\mathcal{N}}\}{ bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT : ∀ italic_j ∈ over~ start_ARG caligraphic_N end_ARG } with coefficients {zj,k(t):j𝒩~}conditional-setsuperscriptsubscript𝑧𝑗𝑘𝑡for-all𝑗~𝒩\{z_{j,k}^{(t)}:\forall j\in\tilde{\mathcal{N}}\}{ italic_z start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT : ∀ italic_j ∈ over~ start_ARG caligraphic_N end_ARG } as

𝐰j(t)=k𝒩~zj,k(t)𝐞k(t).superscriptsubscript𝐰𝑗𝑡subscript𝑘~𝒩superscriptsubscript𝑧𝑗𝑘𝑡superscriptsubscript𝐞𝑘𝑡\displaystyle\mathbf{w}_{j}^{(t)}=\sum\limits_{k\in\tilde{\mathcal{N}}}z_{j,k}% ^{(t)}\mathbf{e}_{k}^{(t)}.bold_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_k ∈ over~ start_ARG caligraphic_N end_ARG end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT . (26)

Notice that in conventional GAT, each agent has its dedicated feature extractor NN. Thus, to obtain the attention score zj,k(t)superscriptsubscript𝑧𝑗𝑘𝑡z_{j,k}^{(t)}italic_z start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT in (25), ID agent j𝑗jitalic_j should know all the feature vectors 𝐞k(t)superscriptsubscript𝐞𝑘𝑡\mathbf{e}_{k}^{(t)}bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT (k𝒩~for-all𝑘~𝒩\forall k\in\tilde{\mathcal{N}}∀ italic_k ∈ over~ start_ARG caligraphic_N end_ARG) which is not viable with sole uplink cooperation from IDs to the UAV. This can be addressed by allowing the UAV agent to reuse the feature extractor NNs ϵI(;δI)subscriptitalic-ϵ𝐼subscript𝛿𝐼\epsilon_{I}(\cdot;\delta_{I})italic_ϵ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( ⋅ ; italic_δ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) for all IDs.

Thanks to the GAT mechanism, 𝐰j(t)superscriptsubscript𝐰𝑗𝑡\mathbf{w}_{j}^{(t)}bold_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT encodes sufficient statistics required at agent j𝑗jitalic_j to take its solution actions. Thus, they can be utilized as an input of solution actor NNs. As will be discussed, 𝐰0(t)superscriptsubscript𝐰0𝑡\mathbf{w}_{0}^{(t)}bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is leveraged internally to find the solution action of the UAV agent x0(t)superscriptsubscript𝑥0𝑡x_{0}^{(t)}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT. In contrast, the remaining vectors 𝐰j(t)superscriptsubscript𝐰𝑗𝑡\mathbf{w}_{j}^{(t)}bold_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT (j𝒩for-all𝑗𝒩\forall j\in\mathcal{N}∀ italic_j ∈ caligraphic_N) need to be sent to the associated ID agents. To this end, the UAV message action m0(t)superscriptsubscript𝑚0𝑡m_{0}^{(t)}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is designed as

m0(t)={𝐰j(t):j𝒩}.superscriptsubscript𝑚0𝑡conditional-setsuperscriptsubscript𝐰𝑗𝑡for-all𝑗𝒩\displaystyle m_{0}^{(t)}=\{\mathbf{w}_{j}^{(t)}:\forall j\in\mathcal{N}\}.italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = { bold_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT : ∀ italic_j ∈ caligraphic_N } . (27)

The UAV multicasts m0(t)NEsuperscriptsubscript𝑚0𝑡superscript𝑁𝐸m_{0}^{(t)}\in\mathbb{R}^{NE}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N italic_E end_POSTSUPERSCRIPT to the IDs, which occupies NE𝑁𝐸NEitalic_N italic_E RBs for the downlink coordination.

V-C Solution Actor at UAV agent

The UAV message m0(t)superscriptsubscript𝑚0𝑡m_{0}^{(t)}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT encapsulates the network state s(t)superscript𝑠𝑡s^{(t)}italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, i.e., the set of all observations of the UAV and ID agents. Therefore, it is sufficient for all agents to determine their solution actions by leveraging m0(t)superscriptsubscript𝑚0𝑡m_{0}^{(t)}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT only. Let πU(;θU)subscript𝜋𝑈subscript𝜃𝑈\pi_{U}(\cdot;\theta_{U})italic_π start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ( ⋅ ; italic_θ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) be the solution actor NN at the UAV agent with the learnable parameter θUsubscript𝜃𝑈\theta_{U}italic_θ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT. The UAV solution action x0(t)superscriptsubscript𝑥0𝑡x_{0}^{(t)}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is obtained as

x0(t)=πU(m0(t);θU).superscriptsubscript𝑥0𝑡subscript𝜋𝑈superscriptsubscript𝑚0𝑡subscript𝜃𝑈\displaystyle x_{0}^{(t)}=\pi_{U}(m_{0}^{(t)};\theta_{U}).italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_π start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ( italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) . (28)

As shown in (18), the solution action of the UAV agent x0(t)superscriptsubscript𝑥0𝑡x_{0}^{(t)}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT contains two types of optimization variables, i.e., the trajectory 𝐯(t)superscript𝐯𝑡\mathbf{v}^{(t)}bold_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and computing resource allocation 𝐟(t)superscript𝐟𝑡\mathbf{f}^{(t)}bold_f start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT. To yield such heterogeneous actions, the solution actor πU(;θU)subscript𝜋𝑈subscript𝜃𝑈\pi_{U}(\cdot;\theta_{U})italic_π start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ( ⋅ ; italic_θ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) comprises two component NNs γV(;ζV)subscript𝛾𝑉subscript𝜁𝑉\gamma_{V}(\cdot;\zeta_{V})italic_γ start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ( ⋅ ; italic_ζ start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ) and γF(;ζF)subscript𝛾𝐹subscript𝜁𝐹\gamma_{F}(\cdot;\zeta_{F})italic_γ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( ⋅ ; italic_ζ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) for calculating 𝐯(t)superscript𝐯𝑡\mathbf{v}^{(t)}bold_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and 𝐟(t)superscript𝐟𝑡\mathbf{f}^{(t)}bold_f start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, respectively. Then, the trainable parameter set of the solution actor NN of the UAV becomes θU={ζV,ζF}subscript𝜃𝑈subscript𝜁𝑉subscript𝜁𝐹\theta_{U}=\{\zeta_{V},\zeta_{F}\}italic_θ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT = { italic_ζ start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT , italic_ζ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT }.

The trajectory variable is computed as

𝐯(t)=γV(𝐰0(t),j𝒩𝐰j(t);ζV),superscript𝐯𝑡subscript𝛾𝑉superscriptsubscript𝐰0𝑡subscript𝑗𝒩superscriptsubscript𝐰𝑗𝑡subscript𝜁𝑉\displaystyle\mathbf{v}^{(t)}=\gamma_{V}\left(\mathbf{w}_{0}^{(t)},\sum_{j\in% \mathcal{N}}\mathbf{w}_{j}^{(t)};\zeta_{V}\right),bold_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_γ start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N end_POSTSUBSCRIPT bold_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ; italic_ζ start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ) , (29)

where an input is a concatenation of the UAV information vector 𝐰0(t)Esuperscriptsubscript𝐰0𝑡superscript𝐸\mathbf{w}_{0}^{(t)}\in\mathbb{R}^{E}bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT from the GAT (25) and the sum of the ID information vectors j𝒩𝐰j(t)Esubscript𝑗𝒩superscriptsubscript𝐰𝑗𝑡superscript𝐸\sum_{j\in\mathcal{N}}\mathbf{w}_{j}^{(t)}\in\mathbb{R}^{E}∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N end_POSTSUBSCRIPT bold_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT. Since the dimension of these input vectors is independent of N𝑁Nitalic_N, (29) preserves the scalability with respect to the ID population. The distinct input 𝐰0(t)superscriptsubscript𝐰0𝑡\mathbf{w}_{0}^{(t)}bold_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT helps the NN γV(;ζV)subscript𝛾𝑉subscript𝜁𝑉\gamma_{V}(\cdot;\zeta_{V})italic_γ start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ( ⋅ ; italic_ζ start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ) distinguish the UAV information vector with those of the IDs j𝒩𝐰jsubscript𝑗𝒩subscript𝐰𝑗\sum_{j\in\mathcal{N}}\mathbf{w}_{j}∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N end_POSTSUBSCRIPT bold_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Consequently, we can successfully produce the UAV-specific trajectory action 𝐯(t)superscript𝐯𝑡\mathbf{v}^{(t)}bold_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT based on the aggregated ID information j𝒩𝐰jsubscript𝑗𝒩subscript𝐰𝑗\sum_{j\in\mathcal{N}}\mathbf{w}_{j}∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N end_POSTSUBSCRIPT bold_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

Next, to determine the computing resource allocation 𝐟(t)superscript𝐟𝑡\mathbf{f}^{(t)}bold_f start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, a sole NN produces each CPU frequency variable fj(t)superscriptsubscript𝑓𝑗𝑡f_{j}^{(t)}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT based on the associated ID agent information vector 𝐰j(t)superscriptsubscript𝐰𝑗𝑡\mathbf{w}_{j}^{(t)}bold_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT. To this end, we first calculate an intermediate value f~j(t)superscriptsubscript~𝑓𝑗𝑡\tilde{f}_{j}^{(t)}over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT as

f~j(t)=γF(𝐰j(t);ζF),superscriptsubscript~𝑓𝑗𝑡subscript𝛾𝐹superscriptsubscript𝐰𝑗𝑡subscript𝜁𝐹\displaystyle\tilde{f}_{j}^{(t)}=\gamma_{F}(\mathbf{w}_{j}^{(t)};\zeta_{F}),over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_γ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ; italic_ζ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) , (30)

where the output activation of γF(;ζF)subscript𝛾𝐹subscript𝜁𝐹\gamma_{F}(\cdot;\zeta_{F})italic_γ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( ⋅ ; italic_ζ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) is set to the rectified linear unit (ReLU) to yield a nonnegative number f~j(t)subscriptsuperscript~𝑓𝑡𝑗\tilde{f}^{(t)}_{j}over~ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Upon obtaining all f~j(t)superscriptsubscript~𝑓𝑗𝑡\tilde{f}_{j}^{(t)}over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, the CPU cycle action fj(t)superscriptsubscript𝑓𝑗𝑡f_{j}^{(t)}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is retrieved as

fj(t)=f~j(t)k𝒩f~k(t)fmax,superscriptsubscript𝑓𝑗𝑡superscriptsubscript~𝑓𝑗𝑡subscript𝑘𝒩superscriptsubscript~𝑓𝑘𝑡subscript𝑓\displaystyle f_{j}^{(t)}=\frac{\tilde{f}_{j}^{(t)}}{\sum_{k\in\mathcal{N}}% \tilde{f}_{k}^{(t)}}f_{\max},italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = divide start_ARG over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_N end_POSTSUBSCRIPT over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_ARG italic_f start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , (31)

which forces 𝐟(t)superscript𝐟𝑡\mathbf{f}^{(t)}bold_f start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT to be feasible for the computing resource constraint (11).

V-D Solution Actor at ID agent

From the message action m0(t)superscriptsubscript𝑚0𝑡m_{0}^{(t)}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT from the UAV, ID agent j𝑗jitalic_j first recovers its corresponding information vector 𝐰j(t)superscriptsubscript𝐰𝑗𝑡\mathbf{w}_{j}^{(t)}bold_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT. Let πI(;θI)subscript𝜋𝐼subscript𝜃𝐼\pi_{I}(\cdot;\theta_{I})italic_π start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( ⋅ ; italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) be the solution actor NN of the ID agent with parameter θIsubscript𝜃𝐼\theta_{I}italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT. Then, the ID solution action xj(t)superscriptsubscript𝑥𝑗𝑡x_{j}^{(t)}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT in (19), which equals the offloading ratio λj(t)superscriptsubscript𝜆𝑗𝑡\lambda_{j}^{(t)}italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, is expressed as

xj(t)=πI(𝐰j(t);θI).superscriptsubscript𝑥𝑗𝑡subscript𝜋𝐼superscriptsubscript𝐰𝑗𝑡subscript𝜃𝐼\displaystyle x_{j}^{(t)}=\pi_{I}(\mathbf{w}_{j}^{(t)};\theta_{I}).italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_π start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) . (32)

The latency constraint in (13) always becomes feasible if xj(t)superscriptsubscript𝑥𝑗𝑡x_{j}^{(t)}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT lies within a bounded range [0,λmax,j(t)]0superscriptsubscript𝜆𝑗𝑡[0,\lambda_{\max,j}^{(t)}][ 0 , italic_λ start_POSTSUBSCRIPT roman_max , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ], where an upperbound λmax,j(t)superscriptsubscript𝜆𝑗𝑡\lambda_{\max,j}^{(t)}italic_λ start_POSTSUBSCRIPT roman_max , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is given by

λmax,j(t)=min{1,τTIj/(1Ru,j(t)+δRd,j(t)+Cfj(t))}.subscriptsuperscript𝜆𝑡𝑗1𝜏𝑇subscript𝐼𝑗1superscriptsubscript𝑅𝑢𝑗𝑡𝛿superscriptsubscript𝑅𝑑𝑗𝑡𝐶superscriptsubscript𝑓𝑗𝑡\displaystyle\lambda^{(t)}_{\max,j}=\min\left\{1,\frac{\tau T}{I_{j}}/(\frac{1% }{R_{u,j}^{(t)}}+\frac{\delta}{R_{d,j}^{(t)}}+\frac{C}{f_{j}^{(t)}})\right\}.italic_λ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max , italic_j end_POSTSUBSCRIPT = roman_min { 1 , divide start_ARG italic_τ italic_T end_ARG start_ARG italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG / ( divide start_ARG 1 end_ARG start_ARG italic_R start_POSTSUBSCRIPT italic_u , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_ARG + divide start_ARG italic_δ end_ARG start_ARG italic_R start_POSTSUBSCRIPT italic_d , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_ARG + divide start_ARG italic_C end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_ARG ) } . (33)

To guarantee xj(t)[0,λmax,j(t)]superscriptsubscript𝑥𝑗𝑡0superscriptsubscript𝜆𝑗𝑡x_{j}^{(t)}\in[0,\lambda_{\max,j}^{(t)}]italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ [ 0 , italic_λ start_POSTSUBSCRIPT roman_max , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ], the output activation of πI(;θI)subscript𝜋𝐼subscript𝜃𝐼\pi_{I}(\cdot;\theta_{I})italic_π start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( ⋅ ; italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) is set to the sigmoid function multiplied by λmax,j(t)superscriptsubscript𝜆𝑗𝑡\lambda_{\max,j}^{(t)}italic_λ start_POSTSUBSCRIPT roman_max , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT. To calculate λmax,j(t)superscriptsubscript𝜆𝑗𝑡\lambda_{\max,j}^{(t)}italic_λ start_POSTSUBSCRIPT roman_max , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, ID agent j𝑗jitalic_j needs to know its CPU frequency allocation fj(t)superscriptsubscript𝑓𝑗𝑡f_{j}^{(t)}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, which is, in fact, determined at the UAV agent. To this end, the IDs are assumed to be equipped with a copy of the UAV actor NN γF(;ζF)subscript𝛾𝐹subscript𝜁𝐹\gamma_{F}(\cdot;\zeta_{F})italic_γ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( ⋅ ; italic_ζ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) in (30). This can be achieved by an offline training procedure which optimizes all actor NNs jointly before its real-time inference step, as will be discussed in Section VI. Along with the UAV message m0(t)superscriptsubscript𝑚0𝑡m_{0}^{(t)}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and the normalization operation in (31), each ID agent readily gets its CPU frequency fj(t)superscriptsubscript𝑓𝑗𝑡f_{j}^{(t)}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT without incurring additional communication with the UAV and other ID agents.

Algorithm 1 Cooperative inference of the proposed C-MADDPG
1. Uplink coordination
 ID agent j𝑗jitalic_j (j𝒩for-all𝑗𝒩\forall j\in\mathcal{N}∀ italic_j ∈ caligraphic_N) creates mj(t)superscriptsubscript𝑚𝑗𝑡m_{j}^{(t)}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT from (22) and sends it
 to the UAV agent.
2. Downlink coordination
 The UAV agent computes m0(t)superscriptsubscript𝑚0𝑡m_{0}^{(t)}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT from (23)-(27) and
 multicasts it to all ID agents.
3. Decision at UAV agent
 The UAV agent calculates the solution action x0(t)superscriptsubscript𝑥0𝑡x_{0}^{(t)}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT from
 (28)-(30).
4. Decision at ID agents
 Each ID agent j𝑗jitalic_j (j𝒩for-all𝑗𝒩\forall j\in\mathcal{N}∀ italic_j ∈ caligraphic_N) obtains the solution action
xj(t)superscriptsubscript𝑥𝑗𝑡x_{j}^{(t)}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT from (32) and (33).

Algorithm 1 summarizes the cooperative inference among the UAV and ID agents of the proposed C-MADDPG framework. At the beginning of the time slot, ID agent j𝑗jitalic_j individually generates the ID message mj(t)superscriptsubscript𝑚𝑗𝑡m_{j}^{(t)}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT using its message actor NN μI(;φI)subscript𝜇𝐼subscript𝜑𝐼\mu_{I}(\cdot;\varphi_{I})italic_μ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( ⋅ ; italic_φ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) in (22)italic-(22italic-)\eqref{eq:MU_Mess_gene}italic_( italic_). The resulting messages are conveyed to the UAV agent through the uplink coordination channel. Next, the UAV agent generates the UAV message 𝐦0(t)superscriptsubscript𝐦0𝑡\mathbf{m}_{0}^{(t)}bold_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT based on its message actor NN μU(;φU)subscript𝜇𝑈subscript𝜑𝑈\mu_{U}(\cdot;\varphi_{U})italic_μ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ( ⋅ ; italic_φ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) and multicasts the output message to the ID agents. After this uplink-downlink agent interaction, each agent individually decides its solution action by leveraging the dedicated solution actor NNs πU(;θU)subscript𝜋𝑈subscript𝜃𝑈\pi_{U}(\cdot;\theta_{U})italic_π start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ( ⋅ ; italic_θ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) and πI(;θI)subscript𝜋𝐼subscript𝜃𝐼\pi_{I}(\cdot;\theta_{I})italic_π start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( ⋅ ; italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ).

It is important to note that the proposed cooperative inference can be realized in a decentralized manner. Both the message and solution actions can be determined at individual agents based only on their local information, e.g., observation oj(t)superscriptsubscript𝑜𝑗𝑡o_{j}^{(t)}italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and messages mj(t)superscriptsubscript𝑚𝑗𝑡m_{j}^{(t)}italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT. As a consequence, neither central information collection steps nor central computing units are necessary to implement Algorithm 1. So far, we have designed cooperative actor NNs and their decentralized execution mechanisms for establishing the proposed MADDPG framework. In the following sections, we discuss the MADDPG training algorithm for the proposed actor NNs.

VI Joint Training Strategy

We present a joint training policy which optimizes the actor NNs in an end-to-end manner. The CTDE strategy is adopted which trains all NNs centrally in an offline manner and dispatches the trained actor NNs to intended nodes for online and decentralized decisions. This approach guarantees that the identical actor NNs are deployed across all IDs in the training step. To train the actor NNs, we employ a critic NN Q(s(t),a(t);ϕ)𝑄superscript𝑠𝑡superscript𝑎𝑡italic-ϕQ(s^{(t)},a^{(t)};\phi)italic_Q ( italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ; italic_ϕ ) with parameter ϕitalic-ϕ\phiitalic_ϕ which estimates the Q-value of the state-action pair (s(t),a(t))superscript𝑠𝑡superscript𝑎𝑡(s^{(t)},a^{(t)})( italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ), where a(t)superscript𝑎𝑡a^{(t)}italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT stands for the global action collecting all actions of the UAV and ID agents as

a(t)={aj(t):j𝒩~}.superscript𝑎𝑡conditional-setsuperscriptsubscript𝑎𝑗𝑡for-all𝑗~𝒩\displaystyle a^{(t)}=\{a_{j}^{(t)}:\forall j\in\tilde{\mathcal{N}}\}.italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = { italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT : ∀ italic_j ∈ over~ start_ARG caligraphic_N end_ARG } . (34)

The training step leverages a relay buffer \mathcal{M}caligraphic_M given as

={(s(t),a(t),r(t),s(t+1)):t}.conditional-setsuperscript𝑠𝑡superscript𝑎𝑡superscript𝑟𝑡superscript𝑠𝑡1for-all𝑡\displaystyle\mathcal{M}=\{(s^{(t)},a^{(t)},r^{(t)},s^{(t+1)}):\forall t\}.caligraphic_M = { ( italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT ) : ∀ italic_t } . (35)

To improve the scalability to the ID population N𝑁Nitalic_N, transition samples in (35) are generated over random N𝑁Nitalic_N uniformly distributed within [Nmin,Nmax]subscript𝑁subscript𝑁[N_{\min},N_{\max}][ italic_N start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ]. This enables a versatile computation architecture for the critic NN Q(;ϕ)𝑄italic-ϕQ(\cdot;\phi)italic_Q ( ⋅ ; italic_ϕ ) to handle variable-length state and action inputs, which can be realized by a simple masking operation to the input. More precisely, for each transition sample, we uniformly set the ID population N[Nmin,Nmax]𝑁subscript𝑁subscript𝑁N\in[N_{\min},N_{\max}]italic_N ∈ [ italic_N start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] and sample the index set of ID agents randomly, i.e., 𝒩𝒩max{1,,Nmax}𝒩subscript𝒩1subscript𝑁\mathcal{N}\subset\mathcal{N}_{\max}\triangleq\{1,\cdots,N_{\max}\}caligraphic_N ⊂ caligraphic_N start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ≜ { 1 , ⋯ , italic_N start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT } with |𝒩|=N𝒩𝑁|\mathcal{N}|=N| caligraphic_N | = italic_N. The observations oj(t)superscriptsubscript𝑜𝑗𝑡o_{j}^{(t)}italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT for inactive ID agents j𝒩max\𝒩𝑗\subscript𝒩𝒩j\in\mathcal{N}_{\max}\backslash\mathcal{N}italic_j ∈ caligraphic_N start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT \ caligraphic_N are fixed as zero vectors. Then, the state s(t)superscript𝑠𝑡s^{(t)}italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT in (17) concatenates all observations oj(t)superscriptsubscript𝑜𝑗𝑡o_{j}^{(t)}italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT (j{0}𝒩maxfor-all𝑗0subscript𝒩\forall j\in\{0\}\bigcup\mathcal{N}_{\max}∀ italic_j ∈ { 0 } ⋃ caligraphic_N start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT) thereby resulting in a masked vector. A similar operation is applied to construct the global action a(t)superscript𝑎𝑡a^{(t)}italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT in (34). As a consequence, the critic NN Q(s(t),a(t);ϕ)𝑄superscript𝑠𝑡superscript𝑎𝑡italic-ϕQ(s^{(t)},a^{(t)};\phi)italic_Q ( italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ; italic_ϕ ) designed to process the maximum ID population Nmaxsubscript𝑁N_{\max}italic_N start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT preserves the scalability to an arbitrary N𝑁Nitalic_N.

At each training iteration, we randomly sample a mini-batch set \mathcal{B}\subset\mathcal{M}caligraphic_B ⊂ caligraphic_M from the replay buffer \mathcal{M}caligraphic_M. Let b=(s,a,r,s)𝑏𝑠𝑎𝑟superscript𝑠b=(s,a,r,s^{\prime})\in\mathcal{B}italic_b = ( italic_s , italic_a , italic_r , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_B be a particular batch sample where ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT represents the one-step forward state originated from the current state-action pair (s,a)𝑠𝑎(s,a)( italic_s , italic_a ). Also, we define ψ{φU,θU,φI,θI}𝜓subscript𝜑𝑈subscript𝜃𝑈subscript𝜑𝐼subscript𝜃𝐼\psi\triangleq\{\varphi_{U},\theta_{U},\varphi_{I},\theta_{I}\}italic_ψ ≜ { italic_φ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT , italic_φ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT } as the parameter set of all actor NNs. The actor NNs are optimized to maximize the policy objective function J(ψ)𝐽𝜓J(\psi)italic_J ( italic_ψ ) which measures the expected Q-value over the mini-batch samples b𝑏b\in\mathcal{B}italic_b ∈ caligraphic_B as

J(ψ)=1||bQ(s,A(s;ψ);ϕ),𝐽𝜓1subscript𝑏𝑄𝑠𝐴𝑠𝜓italic-ϕ\displaystyle J(\psi)=\frac{1}{|\mathcal{B}|}\sum_{b\in\mathcal{B}}Q(s,A(s;% \psi);\phi),italic_J ( italic_ψ ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_B | end_ARG ∑ start_POSTSUBSCRIPT italic_b ∈ caligraphic_B end_POSTSUBSCRIPT italic_Q ( italic_s , italic_A ( italic_s ; italic_ψ ) ; italic_ϕ ) , (36)

where A(;ψ)𝐴𝜓A(\cdot;\psi)italic_A ( ⋅ ; italic_ψ ) indicates a group of actor NNs that take the global action a𝑎aitalic_a from the current state s𝑠sitalic_s. As a result, the mini-batch stochastic gradient descent (SGD) update strategy of the actor NNs becomes

ψψ+ηAψJ(ψ),𝜓𝜓subscript𝜂𝐴subscript𝜓𝐽𝜓\displaystyle\psi\leftarrow\psi+\eta_{A}\nabla_{\psi}J(\psi),italic_ψ ← italic_ψ + italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT italic_J ( italic_ψ ) , (37)

where ηA>0subscript𝜂𝐴0\eta_{A}>0italic_η start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT > 0 denotes the learning rate and zsubscript𝑧\nabla_{z}∇ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT equals the gradient operator with respect to the variable z𝑧zitalic_z.

The critic NN Q(s,a;ϕ)𝑄𝑠𝑎italic-ϕQ(s,a;\phi)italic_Q ( italic_s , italic_a ; italic_ϕ ) is obtained to yield a correct Q-value for a given state-action pair (s,a)𝑠𝑎(s,a)( italic_s , italic_a ). The corresponding critic loss function L(ϕ)𝐿italic-ϕL(\phi)italic_L ( italic_ϕ ) is written by

L(ϕ)=1||b(yQ(s,a;ϕ))2,𝐿italic-ϕ1subscript𝑏superscript𝑦𝑄𝑠𝑎italic-ϕ2\displaystyle L(\phi)\!=\!\frac{1}{|\mathcal{B}|}\sum_{b\in\mathcal{B}}\!\!% \big{(}y\!-\!Q(s,a;\phi)\big{)}^{2},italic_L ( italic_ϕ ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_B | end_ARG ∑ start_POSTSUBSCRIPT italic_b ∈ caligraphic_B end_POSTSUBSCRIPT ( italic_y - italic_Q ( italic_s , italic_a ; italic_ϕ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (38)

where y𝑦yitalic_y stands for the target of the estimated Q-value Q(s,a;ϕ)𝑄𝑠𝑎italic-ϕQ(s,a;\phi)italic_Q ( italic_s , italic_a ; italic_ϕ ) as

y=r+αQ(s,A(s;ψ);ϕ),𝑦𝑟𝛼𝑄superscript𝑠𝐴superscript𝑠superscript𝜓superscriptitalic-ϕ\displaystyle y=r+\alpha Q(s^{\prime},A(s^{\prime};\psi^{\prime});\phi^{\prime% }),italic_y = italic_r + italic_α italic_Q ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_A ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_ψ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ; italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , (39)

with α(0,1]𝛼01\alpha\in(0,1]italic_α ∈ ( 0 , 1 ]. Then, the critic NN is updated as

ϕϕηCϕL(ϕ)italic-ϕitalic-ϕsubscript𝜂𝐶subscriptitalic-ϕ𝐿italic-ϕ\displaystyle\phi\leftarrow\phi-\eta_{C}\nabla_{\phi}L(\phi)italic_ϕ ← italic_ϕ - italic_η start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_L ( italic_ϕ ) (40)

where ηCsubscript𝜂𝐶\eta_{C}italic_η start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT equals the learning rate for the critic NN. In addition, we adopt the soft update strategy for the target actor NN A(;ψ)𝐴superscript𝜓A(\cdot;\psi^{\prime})italic_A ( ⋅ ; italic_ψ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) and target critic NN Q(;ϕ)𝑄superscriptitalic-ϕQ(\cdot;\phi^{\prime})italic_Q ( ⋅ ; italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) expressed by

ψsuperscript𝜓\displaystyle\psi^{\prime}italic_ψ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT κAψ+(1κA)ψabsentsubscript𝜅𝐴𝜓1subscript𝜅𝐴superscript𝜓\displaystyle\leftarrow\kappa_{A}\psi+(1-\kappa_{A})\psi^{\prime}← italic_κ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_ψ + ( 1 - italic_κ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) italic_ψ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (41)
ϕsuperscriptitalic-ϕ\displaystyle\phi^{\prime}italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT κCϕ+(1κC)ϕ,absentsubscript𝜅𝐶italic-ϕ1subscript𝜅𝐶superscriptitalic-ϕ\displaystyle\leftarrow\kappa_{C}\phi+(1-\kappa_{C})\phi^{\prime},← italic_κ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT italic_ϕ + ( 1 - italic_κ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , (42)

where κCsubscript𝜅𝐶\kappa_{C}italic_κ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT and κAsubscript𝜅𝐴\kappa_{A}italic_κ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT are the target update rates.

Algorithm 2 Joint training strategy of the proposed C-MADDPG
Initialize ψ𝜓\psiitalic_ψ, ϕitalic-ϕ\phiitalic_ϕ, ψsuperscript𝜓\psi^{\prime}italic_ψ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, ϕsuperscriptitalic-ϕ\phi^{\prime}italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and ρ𝜌\rhoitalic_ρ.
for episode e=1,,E𝑒1𝐸e=1,\cdots,Eitalic_e = 1 , ⋯ , italic_E 
     Initialize the number of IDs N[Nmin,Nmax]𝑁subscript𝑁subscript𝑁N\in[N_{\min},N_{\max}]italic_N ∈ [ italic_N start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ].
     Sample the set of active IDs 𝒩𝒩max𝒩subscript𝒩\mathcal{N}\subset\mathcal{N}_{\max}caligraphic_N ⊂ caligraphic_N start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT.
     Generate an initial state s(1)superscript𝑠1s^{(1)}italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT by masking local
     observations of inactive IDs 𝒩max\𝒩\subscript𝒩𝒩\mathcal{N}_{\max}\backslash\mathcal{N}caligraphic_N start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT \ caligraphic_N.
     Update the exploration noise variance as σ2ρσ2superscript𝜎2𝜌superscript𝜎2\sigma^{2}\leftarrow\rho\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ← italic_ρ italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.
     for time slot t=1,,T𝑡1𝑇t=1,\cdots,Titalic_t = 1 , ⋯ , italic_T 
         Calculate a(t)=A(s(t);ψ)superscript𝑎𝑡𝐴superscript𝑠𝑡𝜓a^{(t)}=A(s^{(t)};\psi)italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_A ( italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ; italic_ψ ) from Algorithm 1.
         Add the exploration noise as a(t)a(t)+𝒩(0,σ2)superscript𝑎𝑡superscript𝑎𝑡𝒩0superscript𝜎2a^{(t)}\leftarrow a^{(t)}+\mathcal{N}(0,\sigma^{2})italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ← italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).
         Mask local actions of inactive IDs in a(t)superscript𝑎𝑡a^{(t)}italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT.
         Obtain reward r(t)superscript𝑟𝑡r^{(t)}italic_r start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and the next state s(t+1)superscript𝑠𝑡1s^{(t+1)}italic_s start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT.
         Store (s(t),a(t),r(t),s(t+1))superscript𝑠𝑡superscript𝑎𝑡superscript𝑟𝑡superscript𝑠𝑡1(s^{(t)},a^{(t)},r^{(t)},s^{(t+1)})( italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT ) in the replay buffer \mathcal{M}caligraphic_M
         Sample the mini-batch set \mathcal{B}\subset\mathcal{M}caligraphic_B ⊂ caligraphic_M.
         Update the actor NN ψ𝜓\psiitalic_ψ and the critic NN ϕitalic-ϕ\phiitalic_ϕ from
         (37) and (40).
         Update the target actor NN ψsuperscript𝜓\psi^{\prime}italic_ψ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and the target critic
         NN ϕsuperscriptitalic-ϕ\phi^{\prime}italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT from (41) and (42). end      end

Algorithm 2 presents the joint training process of the proposed C-MADDPG framework. Unlike the existing MADDPG training algorithm, the proposed training procedure involves scalable learning strategies to enhance the generalization ability of the actor and critic NNs. In the initialization phase, the number of IDs N𝑁Nitalic_N is generated uniformly within [Nmin,Nmax]subscript𝑁subscript𝑁[N_{\min},N_{\max}][ italic_N start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ]. Then, we randomly sample the set of active IDs as 𝒩𝒩max{1,,Nmax}𝒩subscript𝒩1subscript𝑁\mathcal{N}\subset\mathcal{N}_{\max}\triangleq\{1,\cdots,N_{\max}\}caligraphic_N ⊂ caligraphic_N start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ≜ { 1 , ⋯ , italic_N start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT } with |𝒩|=N𝒩𝑁|\mathcal{N}|=N| caligraphic_N | = italic_N. A masking operation is employed for inactive IDs where the local observations oj(t)superscriptsubscript𝑜𝑗𝑡o_{j}^{(t)}italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT for j𝒩max\𝒩𝑗\subscript𝒩𝒩j\in\mathcal{N}_{\max}\backslash\mathcal{N}italic_j ∈ caligraphic_N start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT \ caligraphic_N are replaced with zero vectors. Consequently, the state s(t)superscript𝑠𝑡s^{(t)}italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT contains the observations of the UAV and active IDs. This masking operation is similarly applied to the action a(t)superscript𝑎𝑡a^{(t)}italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and this can be viewed as an extension of the dropout operation [45] developed for the generalization ability. We discard the local observations and local actions of randomly selected inactive IDs. By doing so, ensemble training of the actor and critic NNs is achieved by producing a number of randomized MEC configurations in the training. This is also beneficial for the actor NNs to learn a generic decentralized decision policy with any given N𝑁Nitalic_N. The hyperparameters are utilized to improve the exploration capability of the actor NNs by adding the Gaussian noise as

a(t)=A(s(t);ψ)+n,superscript𝑎𝑡𝐴superscript𝑠𝑡𝜓𝑛\displaystyle a^{(t)}=A(s^{(t)};\psi)+n,italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_A ( italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ; italic_ψ ) + italic_n , (43)

where n𝒩(0,σ2)𝑛𝒩0superscript𝜎2n\backsim\mathcal{N}(0,\sigma^{2})italic_n ∽ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) stands for the Gaussian random variable. The exploration noise helps the actor NN A(;ψ)𝐴𝜓A(\cdot;\psi)italic_A ( ⋅ ; italic_ψ ) choose new actions that have not been experienced in the training, thereby improving the generalization capability. At the beginning of each episode, we set N𝑁Nitalic_N uniformly over [Nmin,Nmax]subscript𝑁subscript𝑁[N_{\min},N_{\max}][ italic_N start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ]. An initial state s𝑠sitalic_s is randomly generated according to predefined UAV and ID deployment scenarios. Also, we decrease the exploration noise variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT by the decaying ratio ρ𝜌\rhoitalic_ρ. Thus, the power of the exploration noise in (43) is gradually reduced as the training continues.

Each episode consists of T𝑇Titalic_T time slots. At each time slot, we execute Algorithm 1 to yield the action a(t)superscript𝑎𝑡a^{(t)}italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT for the current state s(t)superscript𝑠𝑡s^{(t)}italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT. Then, the exploration noise is injected into the action a(t)superscript𝑎𝑡a^{(t)}italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT as in (43). The interaction with the MEC network produces the reward r(t)superscript𝑟𝑡r^{(t)}italic_r start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT for the state-action pair (s(t),a(t))superscript𝑠𝑡superscript𝑎𝑡(s^{(t)},a^{(t)})( italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) as well as the next state s(t+1)superscript𝑠𝑡1s^{(t+1)}italic_s start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT. The resulting transition sample (s(t),a(t),r(t),s(t+1))superscript𝑠𝑡superscript𝑎𝑡superscript𝑟𝑡superscript𝑠𝑡1(s^{(t)},a^{(t)},r^{(t)},s^{(t+1)})( italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT ) is stored to the replay buffer \mathcal{M}caligraphic_M. Next, we randomly sample the mini-batch set \mathcal{B}caligraphic_B from the replay buffer \mathcal{M}caligraphic_M uniformly, which is utillized for the training of the actor NNs and critic NN based on their update rules in (37) and (40). It is then followed by the modification steps of the target NNs in (41) and (42). These procedures are repeated for T𝑇Titalic_T time slots, which complete one training episode. The entire training step elapses total E𝐸Eitalic_E episodes.

The proposed joint training algorithm is conducted centrally in an offline manner. This is due to the critic NN Q(s,a;ϕ)𝑄𝑠𝑎italic-ϕQ(s,a;\phi)italic_Q ( italic_s , italic_a ; italic_ϕ ) which estimates the Q-value of the global state-action pair (s,a)𝑠𝑎(s,a)( italic_s , italic_a ), thereby incurring the centralized information collection process. A network cloud can be employed to train the actor NNs and critic NN jointly. By doing so, the SGD updates in (37) and (40) can be realized with the aid of the shared reward function (20). Notice that the critic NN is employed only in the training phase for the optimization of the actor NNs. Therefore, once the training is completed, the critic NN can be discarded, and only the optimized actor NNs are dispatched to their desired agents for real-time task offloading and positioning decisions.

Since the proposed C-MADDPG framework facilitates individual computation architectures of the actor NNs, the decentralized execution of the UAV and ID agents is guaranteed based on Algorithm 1. In the decentralized inference step, the agents share the coordination messages generated from the trained message actor NNs, whereas no parameter exchanges are required for the parameter sharing policy. In fact, this can be easily ensured in the training step since they are optimized under the identical GD update rule (37). This offline training process incurs no additional communication overheads in the online inference step.

TABLE I: Simulation Parameters
Symbol Settings Symbol Settings
τ,T𝜏𝑇\tau,Titalic_τ , italic_T 0.2s,100.2s100.2\ \text{s},100.2 s , 10 χLoS,χNLoSsubscript𝜒LoSsubscript𝜒NLoS\chi_{\text{LoS}},\chi_{\text{NLoS}}italic_χ start_POSTSUBSCRIPT LoS end_POSTSUBSCRIPT , italic_χ start_POSTSUBSCRIPT NLoS end_POSTSUBSCRIPT 3dB,23dB3dB23dB3\ \text{dB},23\ \text{dB}3 dB , 23 dB
Ij,δsubscript𝐼𝑗𝛿I_{j},\deltaitalic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_δ [2,20]Gbits,0.2220Gbits0.2[2,20]\ \text{Gbits},0.2[ 2 , 20 ] Gbits , 0.2 K1,K2subscript𝐾1subscript𝐾2K_{1},K_{2}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 11.95,0.1411.950.1411.95,0.1411.95 , 0.14
C,ϑ𝐶italic-ϑC,\varthetaitalic_C , italic_ϑ 1550,10281550superscript10281550,10^{-28}1550 , 10 start_POSTSUPERSCRIPT - 28 end_POSTSUPERSCRIPT B,N0𝐵subscript𝑁0B,N_{0}italic_B , italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 10MHz,130dBm10MHz130dBm10\ \text{MHz},-130\ \text{dBm}10 MHz , - 130 dBm
ρ0,αsubscript𝜌0𝛼\rho_{0},\alphaitalic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_α 38dB,238dB2-38\ \text{dB},2- 38 dB , 2 αC,αAsubscript𝛼Csubscript𝛼A\alpha_{\text{C}},\alpha_{\text{A}}italic_α start_POSTSUBSCRIPT C end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT A end_POSTSUBSCRIPT 1×103, 1×1041superscript1031superscript1041\times 10^{-3},\ 1\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
pU,pDsubscript𝑝𝑈subscript𝑝𝐷p_{U},p_{D}italic_p start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT 1W,10W1W10W1\ \text{W},10\ \text{W}1 W , 10 W vmaxsubscript𝑣v_{\max}italic_v start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT, fmaxsubscript𝑓f_{\max}italic_f start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT 50505050 m/s, 40404040 GHz
ρ𝜌\rhoitalic_ρ, σ2superscript𝜎2{\sigma}^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 0.45,0.99950.450.99950.45,0.99950.45 , 0.9995

VII Numerical Results

We present numerical results validating the proposed C-MADDPG. Unless otherwise stated, simulation parameters are fixed as in Table I. The critic NN has four fully-connected hidden layers each with 512512512512, 256256256256, 128128128128, and 64646464 neurons. The message actor NNs μU(;φU)subscript𝜇𝑈subscript𝜑𝑈\mu_{U}(\cdot;\varphi_{U})italic_μ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ( ⋅ ; italic_φ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) and μI(;φI)subscript𝜇𝐼subscript𝜑𝐼\mu_{I}(\cdot;\varphi_{I})italic_μ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( ⋅ ; italic_φ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) are built with three layers each having 128128128128 neurons. Also, we leverage four fully-connected layers with 128128128128 neurons for constructing the solution actor NNs πU(;θU)subscript𝜋𝑈subscript𝜃𝑈\pi_{U}(\cdot;\theta_{U})italic_π start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ( ⋅ ; italic_θ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ) and πI(;θI)subscript𝜋𝐼subscript𝜃𝐼\pi_{I}(\cdot;\theta_{I})italic_π start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( ⋅ ; italic_θ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ). Output layers of the message actor NNs adopt the ReLU activation functions, whereas those of the solution actor NNs are set to the hyperbolic tangent function. Total E=105𝐸superscript105E=10^{5}italic_E = 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT episodes and T=10𝑇10T=10italic_T = 10 time slots are considered in the training where each time slot consists of ||=256256|\mathcal{B}|=256| caligraphic_B | = 256 mini-batch samples. For the initialization, the UAV and IDs are uniformly distributed in a 100100100100 m-by-100100100100 m square area, and the altitude of the UAV is restricted to a bounded range [0m,60m]0m60m[0\ \text{m},60\ \text{m}][ 0 m , 60 m ]. The trained C-MADDPG is tested with 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT episodes.

We consider the following benchmark schemes.

  • Vanilla MADDPG [36]: Message exchanges among the agents are not allowed. Thus, each agent is equipped with the solution actor NN only and produces xj(t)superscriptsubscript𝑥𝑗𝑡x_{j}^{(t)}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT based on the partial observation oj(t)superscriptsubscript𝑜𝑗𝑡o_{j}^{(t)}italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT.

  • Single agent DDPG (SADDPG): An ideal centralized DRL scheme is adopted where a super actor NN decides the solution actions of all the UAV and ID agents jointly based on the state input s(t)superscript𝑠𝑡s^{(t)}italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT.

  • C-MADDPG with GraphSage (C-MADDPG-GS): Instead of the GAT, the actor NN of the UAV agent is implemented with the GraphSage [46].

  • Naive: The UAV position is simply determined as a centroid of the IDs. Then, all IDs offload their tasks to the UAV, and the computing resources are equally allocated.

The vanilla MADDPG is a special case of the proposed C-MADDPG with no messages exchanged among agents. The SADDPG assumes an ideal centralized system where the UAV and IDs can share their observations perfectly. Unlike the proposed C-MADDPG, a single actor NN architecture of the SADDPG fails to achieve the scalability for varying N𝑁Nitalic_N. Therefore, the SADDPG needs to be trained at each given N𝑁Nitalic_N. For this reason, the SADDPG baseline provides unachievable upperbound performance of the proposed C-MADDPG approach. In the C-MADDPG-GS, the intermediate vector 𝐰j(t)superscriptsubscript𝐰𝑗𝑡\mathbf{w}_{j}^{(t)}bold_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT of agent j𝑗jitalic_j in (26) is computed as the concatenation of the corresponding feature vector 𝐞j(t)superscriptsubscript𝐞𝑗𝑡\mathbf{e}_{j}^{(t)}bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT in (24) and the aggregation of others 𝐞k(t)(k𝒩~\{j}\mathbf{e}_{k}^{(t)}(\forall k\in\tilde{\mathcal{N}}\backslash{\{j\}}bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( ∀ italic_k ∈ over~ start_ARG caligraphic_N end_ARG \ { italic_j }) as

𝐰j(t)={𝐞j(t),k𝒩~{j}𝐞k(t)}.superscriptsubscript𝐰𝑗𝑡superscriptsubscript𝐞𝑗𝑡subscript𝑘~𝒩𝑗superscriptsubscript𝐞𝑘𝑡\displaystyle\mathbf{w}_{j}^{(t)}=\left\{\mathbf{e}_{j}^{(t)},\sum_{k\in\tilde% {\mathcal{N}}\setminus\{j\}}\mathbf{e}_{k}^{(t)}\right\}.bold_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = { bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , ∑ start_POSTSUBSCRIPT italic_k ∈ over~ start_ARG caligraphic_N end_ARG ∖ { italic_j } end_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } . (44)

In the C-MADDPG-GS, we halve the dimension of the feature vector 𝐞j(t)superscriptsubscript𝐞𝑗𝑡\mathbf{e}_{j}^{(t)}bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT so that the concatenation 𝐰j(t)superscriptsubscript𝐰𝑗𝑡\mathbf{w}_{j}^{(t)}bold_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT occupies the identical RBs with the proposed C-MADDPG.

Refer to caption
Figure 3: Convergence of the training process for different Nmaxsubscript𝑁N_{\max}italic_N start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT

Fig. 3 depicts the convergence behavior of the training process of the proposed C-MADDPG method in terms of the total energy consumption performance. We plot the moving average energy consumption over 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT episodes. From the figure, we can check that the proposed C-MADDPG converges within 105superscript10510^{5}10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT episodes for all simulated Nmaxsubscript𝑁N_{\max}italic_N start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT. This implies the effectiveness of the proposed training strategy for handling a number of IDs. Since a small Nmaxsubscript𝑁N_{\max}italic_N start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT results in simple training processes, it is beneficial to set Nmax=10subscript𝑁10N_{\max}=10italic_N start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 10.

Refer to caption
Figure 4: Average sum energy consumption with respect to N𝑁Nitalic_N

Fig. 4 compares the energy consumption performance of various schemes by changing N𝑁Nitalic_N. We set the ID population regime in the training to N[5,10]𝑁510N\in[5,10]italic_N ∈ [ 5 , 10 ], and the test performance of the trained C-MADDPG is evaluated over N[5,30]𝑁530N\in[5,30]italic_N ∈ [ 5 , 30 ]. Thus, the actor NNs cannot observe training samples with N10𝑁10N\geq 10italic_N ≥ 10. Nevertheless, the proposed C-MADDPG shows negligible loss to the upperbound performance generated by the ideal centralized SADDPG method which is trained at each given N𝑁Nitalic_N. This proves the scalability of the proposed approach where the actor NNs optimized at a small N𝑁Nitalic_N can be readily applied to larger networks. The gap between the vanilla MADDPG and the naive benchmark scheme decreases as N𝑁Nitalic_N grows. Without a proper agent interaction mechanism, the solution actor NNs of the vanilla MADDPG cannot provide efficient MEC management solutions and they simply converge to a suboptimal policy of the naive baseline. Based on this result, we can conclude that the message actor NNs play crucial roles in controlling the UAV and IDs in a decentralized manner. Also, the performance of the C-MADDPG-GS is degraded compared to the ideal SADDPG method. A simple sum pooling operation of the GS method fails to capture the importance of each ID agent in generating the downlink coordination message at the UAV agent. For this reason, the C-MADDPG-GS presents a large performance gap to the C-MADDPG. Thus, it is concluded that the GAT-based UAV message actor design is essential to control separate ID agents in a decentralized manner.

Refer to caption
Figure 5: Average sum energy consumption with respect to fmaxsubscript𝑓𝑚𝑎𝑥f_{max}italic_f start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT

Fig. 5 exhibits the energy consumption performance of various schemes by changing the maximum CPU frequency fmaxsubscript𝑓f_{\max}italic_f start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT with N=10𝑁10N=10italic_N = 10. The UAV-aided MEC network can save the operating energy of the IDs as fmaxsubscript𝑓f_{\max}italic_f start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT increases. As expected, the proposed C-MADDPG outperforms other baseline schemes and provides almost identical performance to the upperbound SADDPG method. This validates the effectiveness of the C-MADDPG for optimizing appropriate MEC management solutions regardless of fmaxsubscript𝑓f_{\max}italic_f start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT.

Refer to caption
Figure 6: Average sum energy consumption with varying N𝑁Nitalic_N

The adaptability to time-varying ID population N𝑁Nitalic_N is investigated in Fig. 6 which depicts the energy consumption as a function of the time slot. The number of IDs is assumed to change at every 10 time slots. We set N=5𝑁5N=5italic_N = 5 for the first 10 time slots and then changes to N=15𝑁15N=15italic_N = 15 for the next 10 time slots. For the last 10 time slots, the ID population is fixed as N=10𝑁10N=10italic_N = 10. The energy consumption of all schemes highly fluctuates at the transition time slots and then is gradually reduced as the actor NNs yield convergent policies. Since the vanilla MADDPG relies on long-term collaboration through rewards rather than real-time cooperation based on observations, it adapts slowly to environment changes. In contrast, the proposed C-MADDPG shows fast convergence by means of online information exchange among agents. Thanks to the scalability, the proposed C-MADDPG can handle such highly fluctuating MEC configurations only with a sole training process. On the contrary, the non-scalable baselines such as the SADDPG and vanilla MADDPG resort to several trained actor NNs dedicated to each possible N𝑁Nitalic_N. Also, the proposed approach with the GAT architecture achieves the performance of the ideal SADDPG method, proving the effectiveness of the proposed scheme.

TABLE II: Average CPU running time [msec]
N𝑁Nitalic_N 10 15 20 25 30
SADDPG 2.30 2.33 2.35 2.37 2.39
C-MADDPG 2.19 2.19 2.19 2.19 2.19

Table II compares the inference time complexity of trained actor NNs by evaluating the average CPU running time for executing 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT episodes. The computation complexity of the proposed C-MADDPG remains unchanged with ID populations due to its decentralized and parallel architecture. For this reason, the inference complexity of the proposed C-MADDPG is slightly lower than that of the centralized SADDPG without incurring performance degradation.

VIII Conclusion

This work has proposed a novel C-MADDPG approach for the decentralized control of the UAV-aided MEC networks where the UAV and IDs can only get access to their local observations only. To build a valid decentralized decision-making policy, it is necessary to develop a proper coordination protocol, in particular, interaction messages bearing sufficient statistics of the optimal solutions of others. The considered problem has been formalized into a cooperative multi-agent POMDP which includes interaction messages as well as solutions of individual agents as action variables. Such dual actions request two different actor NNs, the message actor NN and solution actor NN, each of which accounts for the agent coordination and solution optimization. For effective message aggregation, the message actor NN at the UAV adopts the GAT architecture. Along with the parameter sharing policy, this graph-inspired structure leads to versatile operations that do not depend on the network size. Also, the joint training algorithm of all actor NNs has been proposed. To achieve the scalability to the ID population, the proposed C-MADDPG has been optimized over arbitrary random number of the ID agents. Numerical results have demonstrated the superiority of the proposed scheme over existing schemes.

References

  • [1] N. Abbas, Y. Zhang, A. Taherkordi, and T. Skeie, “Mobile edge computing: A survey,” IEEE Internet Things J., vol. 5, no. 1, pp. 450–465, Feb. 2018.
  • [2] C. L. Chen, C. G. Brinton, and V. Aggarwal, “Latency minimization for mobile edge computing networks,” IEEE Trans. Mobile Comput., vol. 22, no. 4, pp. 2233–2247, Apr. 2023.
  • [3] L. Zhang, Y. Sun, Z. Chen, and S. Roy, “Communications-caching-computing resource allocation for bidirectional data computation in mobile edge networks,” IEEE Trans. Commun., vol. 69, no. 3, pp. 1496–1509, Nov. 2021.
  • [4] M. Masoudi and C. Cavdar, “Device vs edge computing for mobile services: Delay-aware decision making to minimize power consumption,” IEEE Trans. Mobile Comput., vol. 20, no. 12, pp. 3324–3337, Dec. 2021.
  • [5] K. Zhang, S. Leng, Y. He, S. Maharjan, and Y. Zhang, “Mobile edge computing and networking for green and low-latency internet of things,” IEEE Commun. Mag., vol. 56, no. 5, p. 39–45, May 2018.
  • [6] M. Kim, H. Lee, S. Hwang, M.Kim, M. Debbah, and I. Lee, “Decentralized learning framework for hierarchical wireless networks: A tree neural network approach,” IEEE Internet Things J., vol. 11, no. 10, pp. 17 780–17 796, May 2024.
  • [7] J. Park, S. Solanki, S. Baek, and I. Lee, “Latency minimization for wireless powered mobile edge computing networks with nonlinear rectifiers,” IEEE Trans. Veh. Technol., vol. 70, no. 8, pp. 8320–8324, Aug. 2021.
  • [8] M. Wu, W. Qi, J. Park, P. Lin, L. Guo, and I. Lee, “Residual energy maximization for wireless powered mobile edge computing systems with mixed-offloading,” IEEE Trans. Veh. Technol., vol. 71, no. 4, pp. 4523–4528, Apr. 2022.
  • [9] N. Kiran and C. Pan and S. Wang and C. Yin, “Joint resource allocation and computation offloading in mobile edge computing for SDN based wireless networks,” J. Commun. Netw., vol. 22, no. 1, pp. 1–11, Feb 2020.
  • [10] Y. Zeng, R. Zhang, and T. J. Lim, “Wireless communications with unmanned aerial vehicles: opportunities and challenges,” IEEE Commun. Mag., vol. 54, no. 5, pp. 36–42, May 2016.
  • [11] N. H. Motlagh, T. Taleb, and O. Arouk, “Low-altitude unmanned aerial vehicles-based internet of things services: Comprehensive survey and future perspectives,” IEEE Internet Things J., vol. 3, no. 6, pp. 899–922, Dec. 2016.
  • [12] B. Li, Z. Fei, and Y. Zhang, “UAV communications for 5G and beyond: Recent advances and future trends,” IEEE Internet Things J., vol. 6, no. 2, pp. 2241–2263, Apr. 2019.
  • [13] Z. Liu, J. Qi, Y. Shen, K. Ma, and X. Guan, “Maximizing energy efficiency in UAV-assisted NOMA–MEC networks,” IEEE Internet Things J., vol. 10, no. 24, pp. 22 208–22 222, Dec 2023.
  • [14] H. Zhou, Z. Wang, G. Min, and H. Zhang, “UAV-Aided computation offloading in mobile-edge computing networks: A stackelberg game approach,” IEEE Internet Things J., vol. 10, no. 8, pp. 6622–6633, Apr 2023.
  • [15] H. Xie, T. Zhang, X. Xu, D. Yang, and Y. Liu, “Joint sensing, communication and computation in UAV-assisted systems,” IEEE Internet Things J., Mar 2024, to be published.
  • [16] Y. Liu, S. Xie, and Y. Zhang, “Cooperative offloading and resource management for UAV-enabled mobile edge computing in power IoT system,” IEEE Trans. Veh. Technol., vol. 69, no. 10, pp. 12 229–12 239, Oct. 2020.
  • [17] L. Zhang et al., “Task offloading and trajectory control for UAV-assisted mobile edge computing using deep reinforcement learning,” IEEE Access, vol. 9, pp. 53 708–53 719, Apr. 2021.
  • [18] Q. Liu, L. Shi, L. Sun, J. Li, M. Ding, and F. Shu, “Path planning for UAV-mounted mobile edge computing with deep reinforcement learning,” IEEE Trans. Veh. Technol., vol. 69, no. 5, pp. 5723–5728, May 2020.
  • [19] L. Wang, P. Huang, K. Wang, G. Zhang, L. Zhang, N. Aslam, and K. Yang, “Rl-based user association and resource allocation for multi-UAV enabled MEC,” in Proc. Int. Wireless. Commun. Mobile Comput. Conf., pp. 741–746, Jun. 2019.
  • [20] H. Wang, H. Ke, and W. Sun, “Unmanned-aerial-vehicle-assisted computation offloading for mobile edge computing based on deep reinforcement learning,” IEEE Access, vol. 8, pp. 180 784–180 798, Oct. 2020.
  • [21] Y. Peng, Y. Liu, and H. Zhang, “Deep reinforcement learning based path planning for UAV-assisted edge computing networks,” in Proc. IEEE Wireless Commun. Netw. Conf, pp. 1–6, Mar. 2021.
  • [22] L. Wang, K. Wang, C. Pan, W. Xu, N. Aslam, and A. Nallanathan, “Deep reinforcement learning based dynamic trajectory control for UAV-assisted mobile edge computing,” IEEE Trans. Mobile Comput., vol. 21, no. 10, pp. 3536–3550, Oct. 2022.
  • [23] S. Hwang, J. Park, H. Lee, M. Kim, and I. Lee, “Deep reinforcement learning approach for UAV-assisted mobile edge computing networks,” in Proc. IEEE Global Commun. Conf., Dec. 2022, pp. 3839–3844.
  • [24] Z. Chen and X. Wang, “Decentralized computation offloading for multi-user mobile edge computing: A deep reinforcement learning approach,” EURASIP J. Wireless. Commun. Netw., vol. 2020, no. 188, pp. 1–21, Sep. 2020.
  • [25] F. Song, H. Xing, X. Wang, S. Luo, P. Dai, Z. Xiao, and B. Zhao, “Evolutionary multi-objective reinforcement learning based trajectory control and task offloading in UAV-assisted mobile edge computing,,” IEEE Trans. Mobile Comput., vol. 22, no. 12, pp. 7387–7405, Dec. 2023.
  • [26] Y. M. Park, S. S. Hassan, Y. K. Tun, Z. Han, and C. S. Hong, “Joint trajectory and resource optimization of MEC-assisted UAVs in sub-THz networks: A resources-based multi-agent proximal policy optimization DRL with attention mechanism,” IEEE Trans. Veh. Technol., early access, doi: 10.1109/TVT.2023.3311537.
  • [27] B. Zhang, B. Tang, and F. Xiao, “Robust computation offloading and trajectory optimization for multi-UAV-assisted MEC: A multi-agent DRL approach,” IEEE Internet Things J., early access, doi: 10.1109/JIOT.2023.3300718.
  • [28] B. Li, W. Liu, W. Xie, N. Zhang, and Y. Zhang, “Adaptive digital twin for UAV-assisted integrated sensing, communication, and computation networks,” IEEE Trans. Green. Commun. Netw., vol. 7, no. 4, pp. 10 497–10 509, Dec. 2023.
  • [29] S. Hwang, H. Lee, J. Park, and I. Lee, “Decentralized computation offloading with cooperative UAVs: multi-agent deep reinforcement learning perspective,” IEEE Wireless Commun., vol. 29, no. 4, pp. 24–31, Aug. 2022.
  • [30] L. Wang et al., “Multi-agent deep reinforcement learning-based trajectory planning for multi-UAV assisted mobile edge computing,” IEEE Trans. Cogn. Commun. New., vol. 7, no. 1, pp. 73–84, Mar. 2021.
  • [31] A. M. Seid, G. O. Boateng, B. Mareri, G. Sun, and W. Jiang, “Multi-agent DRL for task offloading and resource allocation in multi-UAV enabled IoT edge network,” IEEE Trans. Netw. Service Manage., vol. 18, no. 4, pp. 4534–4547, Dec. 2021.
  • [32] A. Gao, Q. Wang, W. Liang, and Z. Ding, “Game combined multi-agent reinforcement learning approach for UAV assisted offloading,” IEEE Trans. Veh. Technol., vol. 70, no. 12, pp. 12 888–12 901, Dec. 2021.
  • [33] H. Peng and X. Shen, “Multi-agent reinforcement learning based resource management in MEC- and UAV-assisted vehicular networks,” IEEE J. Sel. Areas Commun., vol. 39, no. 1, pp. 131–141, Jan. 2021.
  • [34] N. Zhao, Z. Ye, Y. Pei, Y. C. Liang, and D. Niyato, “Multi-agent deep reinforcement learning for task offloading in UAV-assisted mobile edge computing,” IEEE Trans. Wireless Commun., vol. 21, no. 9, pp. 6949–6960, Sept. 2022.
  • [35] Z. Ji, S. Wu, and C. Jiang, “Cooperative multi-agent deep reinforcement learning for computation offloading in digital twin satellite edge networks,” IEEE J. Sel. Areas Commun., early access, doi: 10.1109/JSAC.2023.3313595.
  • [36] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multiagent actor-critic for mixed cooperative-competitive environments,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), pp. 6382–6393, Jun. 2017.
  • [37] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Deep reinforcement learning approach for UAV-assisted mobile edge computing networks,” in Proc. Int. Conf. Learn. Represent., Apr. 2018, pp. 1–13.
  • [38] S. Batabyal and P. Bhaumik, “Mobility models, traces and impact of mobility on opportunistic routing algorithms: A survey,” IEEE Commun. Surveys Tuts., vol. 17, no. 3, pp. 1679–1707, 3rd Quart.,2015.
  • [39] A. Al-Hourani, S. Kandeepan, and S. Lardner, “Optimal LAP altitude for maximum coverage,” IEEE Wireless Commun. Lett., vol. 3, no. 6, pp. 569–572, Dec. 2014.
  • [40] M. Mozaffari et al., “Unmanned aerial vehicle with underlaid device-to-device communications: Performance and tradeoffs,” IEEE Trans. Wireless Commun., vol. 15, no. 6, pp. 3949–3963, Jun. 2016.
  • [41] Y. Zeng, J. Xu, and R. Zhang, “Energy minimization for wireless communication with rotary-wing UAV,” IEEE Trans. Wireless Commun., vol. 18, no. 4, pp. 2329–2345, April 2019.
  • [42] Y. Wang, M. Sheng, X. Wang, L. Wang, and J. Li, “Mobile-edge computing: Partial computation offloading using dynamic voltage scaling,” IEEE Trans. Commun., vol. 64, no. 10, p. 4268, Oct. 2016.
  • [43] R. Goldberg, “Survey of virtual machine research,” IEEE Computer Mag., vol. 7, no. 6, pp. 34–45, Jun 1974.
  • [44] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Trans. Neural. Netw, vol. 20, no. 1, pp. 61–80, Jan. 2009.
  • [45] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, Jun 2014.
  • [46] S. Zhang, H. Tong, J. Xu, and R. Maciejewski, “Graph convolutional networks: A comprehensive review,” Comput. Social Netw., vol. 6, no. 1, pp. 1–23, Dec. 2019.