P-DRL: A Framework for Multi-UAVs Dynamic Formation Control under Operational Uncertainty and Unknown Environment

Zhou, Jinlun; Zhang, Honghai; Hua, Mingzhuang; Wang, Fei; Yi, Jia

doi:10.3390/drones8090475

Open AccessArticle

P-DRL: A Framework for Multi-UAVs Dynamic Formation Control under Operational Uncertainty and Unknown Environment

by

Jinlun Zhou

¹

,

Honghai Zhang

¹,

Mingzhuang Hua

^2,*,

Fei Wang

¹

and

Jia Yi

¹

College of Civil Aviation, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China

²

College of General Aviation and Flight, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China

^*

Author to whom correspondence should be addressed.

Drones 2024, 8(9), 475; https://doi.org/10.3390/drones8090475

Submission received: 5 July 2024 / Revised: 1 September 2024 / Accepted: 5 September 2024 / Published: 10 September 2024

(This article belongs to the Section Innovative Urban Mobility)

Download

Browse Figures

Versions Notes

Abstract

:

Unmanned aerial vehicle (UAV) formation flying is an efficient and economical operation mode for air transportation systems. To improve the effectiveness of synergetic formation control for UAVs, this paper proposes a pairwise conflict resolution approach for UAV formation through mathematical analysis and designs a dynamic pairing and deep reinforcement learning framework (P-DRL formation control framework). Firstly, a new pairwise UAV formation control theorem is proposed, which breaks down the multi-UAVs formation control problem into multiple sequential control problems involving UAV pairs through a dynamic pairing algorithm. The training difficulty of Agents that only control each pair (two UAVs) is lower compared to controlling all UAVs directly, resulting in better and more stable formation control performance. Then, a deep reinforcement learning model for a UAV pair based on the Environment–Agent interaction is built, where segmented reward functions are designed to reduce the collision possibility of UAVs. Finally, P-DRL completes the formation control task of the UAV fleet through continuous pairing and Agent-based pairwise formation control. The simulations used the dynamic pairing algorithm combined with the DRL architectures of asynchronous advantage actor–critic (P-A3C), actor–critic (P-AC), and double deep q-value network (P-DDQN) to achieve synergetic formation control. This approach yielded effective control results with a strong generalization ability. The success rate of controlling dense, fast, and multi-UAV (10–20) formations reached 96.3%, with good real-time performance (17.14 Hz).

Keywords:

unmanned aerial vehicle (UAV); formation control; artificial intelligence; deep reinforcement learning; conflict detection and resolution (CD&R)

1. Introduction

The formation of UAVs refers to the process of multi-UAVs arranging or maintaining a specific spatial configuration in the airspace while performing flight tasks [1,2]. In transportation systems, UAV formation aims to make the fleet more efficient by receiving better navigation/position signals, improving communication stability and reducing conflicts during transportation tasks. It is also widely used in stunt performance, geographic exploration, and joint positioning in other fields. In recent years, with the continuous growth of logistics demand in urban air transportation systems, UAV formation flying has become increasingly widespread due to its efficient and flexible transportation advantages.

The primary task of UAV formation is to arrange or maintain a spatial configuration, which is different from the primary task of trajectory planning to arrive at a destination [3,4]. Therefore, UAV formation focuses on “trajectory control”, while UAV trajectory planning focuses on “trajectory determination”. Synergetic formation control is the basic technology behind the formatting process, which generally has high requirements for surfaces, obstacles, and other environments. Therefore, UAV formatting usually chooses an open and flat site, which is helpful for the operation and implementation of the formation control algorithm [5]. However, when it is difficult to meet the operating requirements of no obstacles and no interference, the UAV synergetic formation control algorithm emerges for complex operating environments [6,7]. At the same time, some complex tasks such as terrain exploration and multi-resource fusion positioning require the synergetic operation of multi-UAVs, which makes controlling multiple UAVs simultaneously a new research direction [8,9].

Generally, due to advanced information transmission systems such as 5G and sensors, as well as continuous improvements in low complexity control algorithms, the real-time performance and accuracy of UAV controls are superior [10]. However, compared to the real-time performance and control accuracy of the UAV formation control method, the synergetic control capability for simultaneous control of multi-UAV and the environmental adaptability of multi-UAVs, such as the formation control ability under operational uncertainty and unknown environment, are drawbacks of the current research in the field of UAV formation control [11]. Fortunately, as theories of deep reinforcement learning and systems control become more complete, there are new opportunities for improvements in the synergetic control scale of UAVs as well as their adaptability to various complex scenarios.

This paper proposes a P-DRL multi-UAV synergetic formation control framework, which combines mathematical analysis, heuristic algorithms, and deep reinforcement learning theory to achieve formation control of multi-UAVs with high safety and real-time performance in unknown environments. The paper’s main contributions are listed as follows.

A new theorem for UAV pairwise formation control is proposed based on an analysis of the conflict–collision relationship between multi-UAVs. Based on this, the task of multi-UAV synergetic formation control is broken down into multiple pairwise UAV synergetic formation control tasks using the dynamic pairing algorithm we designed, thus reducing the training difficulty for the Agent in the DRL model.
A detailed deep reinforcement learning model of synergetic formation control for a UAV pair is proposed, including the reward function with collision-avoidance intensified, state transform, state–action space, etc.
A general framework, P-DRL, is proposed to solve the problem of multiple UAV (10–20) dynamic formation control, which can be used for the task of simultaneous real-time dynamic multi-UAV formation control in complex environments with operational uncertainty and obstacles unknown.

This paper is structured as follows: Section 2 categorizes the advanced methods used for UAV formation tasks and their contributions, as well as the motivations for this paper. Section 3 introduces the formation control problem addressed in this study, the solving theorem, the basic control system of a single UAV, and the DRL model for UAV pair control. Section 4 provides the framework and P-DRL as an approach for UAV formation control. Section 5 conducts simulations of P-A3C, P-AC, and P-DDQN, constructed using the P-DRL framework. Section 6 presents the conclusions and discussion.

2. Research Basis

2.1. Related Work

Multi-UAV formation is essentially a control problem [8]. Unlike “stability” control, formation control belongs to the category of “synergetic” control, the former generally focusing on its own control system, while the latter focuses on the process of “synergetic decision-making” [12,13,14]. The process of UAV formation control is usually based on UAV dynamic models/systems as opposed to the design control algorithms used for different applications [15]. There have been various excellent works for UAV formation control theory, from the independent control method [16,17] to the “leader–follower” [18,19,20] and other heuristic control algorithms, as well as artificial intelligence (AI) control algorithms [21,22] in recent years. The existing research can be classified into two main categories from the perspective of UAV formation control algorithms: cybernetics and artificial intelligence.

The method based on cybernetics constructs the posture and dynamic model of the UAV and then controls the operation of the UAV through linear or nonlinear mapping between signals and displacement [23]. See, for example, classical proportional-integral-derivative (PID) control [2,15,24] backstepping control based on Lyapunov stability theory [25], and sliding mode control [26,27]. The formation result of this type of method widely depends on the accuracy of the parameters in the mathematical model and the ability of the UAV to adjust according to feedback, thus carrying out precise control of UAVs with good adaptability. In recent years, many improved theoretical models for drone formation control have achieved very good practical results, such as multi-level switching control [28], back-stepping [29], energetic reference generator [19], etc.

The method based on AI extracts key features from a UAV’s historical trajectory data and combines it with the UAV motion model, robust control [23], and the DRL method [30,31,32,33] to achieve UAV control. For example, operators can fit the UAV control signal and nonlinear control effect through a neural network, or learn from the control experience of UAV pilots/experts, train the AI model using input and feedback to control the UAVs [34], use a graph convolutional neural network (GCNN) to evaluate the operational uncertainty and make control decisions [27], and construct distributed control systems to train the control decision Agent [3,35], etc. These methods often have good robustness, fault tolerance, environment fitness, and the ability to extract key information when dealing with the control of nonlinear complex systems, which usually benefit from the use of adaptive deep reinforcement learning neural networks.

Although the above two categories of methods have their advantages in different aspects, there are still certain drawbacks when facing complex environments with dynamic, unknown conditions and operational interference. For example, the cybernetic model generally needs to be supported by the technology of trajectory planning and it is necessary to determine the four-dimensional trajectory (x, y, z-position, timestamp) of the UAV first, and then convert it into UAV control instructions through integration [36]. The implementation of this “two-stage” method is relatively cumbersome, since the process of solving the feasible 4D trajectory of the UAVs sometimes consumes a lot of computing resources. The AI-based UAV formation control algorithm usually directly controls the UAV using action allocations [37,38,39]. Its solution process is more direct but the application scenarios are limited. The input state in the training process grows exponentially with the number of UAVs and obstacles, making the convergence of the neural network difficult [3,14]. Therefore, AI control methods often struggle to extend to or improve on more complex or different scenarios due to the upper limitation of AI training abilities. Based on further reflection on the above difficulties, we made improvements and propose the P-DRL framework for multi-UAV formation control.

2.2. Motivation

Generally, for the formation control scenarios within unknown environments dealing with operational uncertainty, adopting a deep reinforcement learning architecture may be more advantageous compared to some deterministic control approaches. However, when using deep reinforcement learning for UAV formation control, two challenging problems must be considered:

(1): Learning difficulty increases with the number of UAVs being controlled simultaneously.

For deep reinforcement learning methods to solve the UAV formation control problem, there is always a maximum number of UAVs that the Agent can control simultaneously. This phenomenon is mainly caused by the input dimension limit of the deep q-value/v-value network. For example, one quadcopter has three position parameters and four action variables, which are composed of a basic state matrix with seven dimensions. Therefore, training an Agent for 2–5 UAV control is achievable, but when the input dimensions are over 50 (these input parameters are completely independent in theory, such as the relationship between longitude and latitude and longitude and one of the rotor speeds), the training of the Agent will be difficult due to the exponential growth and the continuous search for space.

(2): Training difficulty in improving the non-collision success rate.

Due to insufficient training and estimated errors in the q-value, the trained Agents may experience phenomena such as collisions between drones and obstacles, which should be avoided as much as possible. Thus, reward functions with a new structure should be designed to maximize conflict perception ability and ensure sufficient value iteration.

Therefore, the motivation to compress training input dimensions emerges. The P-DRL framework makes improvements from two main perspectives: (1) breaking down the multi-UAV formation control problem into multiple UAV pair formation control problems so we can train a DRL Agent to only focus on the synergetic control task of two UAVs, thereby resulting in less training difficulties. (2) A new DRL model for UAV pairwise formation control is constructed with better reward functions and interaction modes to improve collision avoidance capabilities.

From a macro perspective, the transformation from multi-UAVs formation control to UAV pairwise formation control enables the P-DRL framework to have better synergetic control capabilities. From a micro perspective, the pairwise UAV control DRL model has a smaller state–action space, allowing the trained Agents to achieve more effective training and better decision-making performance. These two perspectives are the fundamental reasons why the P-DRL framework can result in better formation control performance.

3. Problem Formation

3.1. Problem Definition

The problem addressed in this study is dynamic multiple UAVs formation control under operational uncertainty in unknown environments. To further clarify this problem, we describe the problem from four perspectives: objectives, decision variables, constraints, and assumptions.

(1)

Objectives: A group of UAVs (≥10) needs to arrange or maintain a specific configuration synergetically, where drones have a random initial state, which may be stationary or in motion.

(2)

Decision variables: The only decision variable is the rotor speed of the UAV. For example, a hexacopter with six propeller rotors has six decision variables, because it can theoretically control the speed of every single rotor. Similarly, a quadcopter (quadrotor UAV) has four decision variables. Drones achieve most postures such as climb, descent, and roll based on the adjustment of their rotor speed.

(3)

Constraints: Firstly, the maneuver of UAVs must meet their performance constraints. Secondly, UAVs cannot collide with each other during their formation process, and they cannot collide with other obstacles in the airspace.

(4)

Assumptions: Some communication factors such as signal transmission interference/delay are not considered, but some external interference like wind and inner control errors made by UAV systems need to be considered. All of the obstacles’ positions in the airspace are generated randomly, and the positions are unknown before the formation task begins. We further express the assumption as:

Operational uncertainty: The next state of the UAV will not be formed completely according to the current state and control action, but it follows a normal distribution according to a specific variance, which is Pr[ $S_{t + 1}$ | $S_{t}$ , $A_{t}$ ] ≠ 1.
Unknown environment: Obstacles in the operating environment cannot be predicted before the UAV approaches it/before the process of formation control, and they become knowable when the UAV approaches it in the process of formation control.

3.2. Pairwise Control Theorem

Since our approach is to break down the problem of multiple UAV formation control into multiple UAV pair formation control problems, it is necessary to demonstrate the feasibility of this mode. This means proving that conducting multiple UAV pair formation control can yield results equivalent to those obtained by directly conducting multiple UAVs formation control using the same assumptions and decision variables as Section 3.1.

(1): Same objectives.

The formation objectives/results of the UAV formation control are equivalent regardless of the control algorithms used. A group of UAVs will arrange or maintain a specific configuration synergistically at the end if the approach is implemented successfully.

(2): Same constraints.

There are three main constraints in the formation control model for multi-UAVs: (a) meet the performance of the UAV dynamic model; (b) avoid collision between any UAVs and obstacles in the environment; and (c) avoid collision between any UAVs. For constraint (a), it is easy to deduce that all UAVs in the fleet meet performance constraints and that any UAV pair in the UAV fleet meets performance constraints are sufficiently and with the necessary conditions in place. Similarly, if there is no collision with obstacles for the whole UAV fleet, then any UAV pair in the UAV fleet will naturally not collide with obstacles, and vice versa.

However, for constraint (c), the equivalence between a group of UAVs not colliding and multiple UAV pairs not colliding requires more discussion. The relationships between UAVs regarding conflicts/collisions are more complex than those between UAVs and obstacles. In brief, a UAV may have a conflict/collision trend with more than two surrounding UAVs at the same time, and whether the pairwise conflict relief control method can solve multiple conflict relief problems in a UAV fleet requires deduction.

For the equivalence proof of constraint (c), we need to clarify two definitions, which are conflict and collision. Conflict refers to the trend of collision between UAVs due to position, heading, and speed relationships [2,6]. Specifically, several UAVs will have a situation where horizontal positions and vertical height differences are less than a certain value in the future according to the current operation state. Collision refers to UAVs being in the same spatial position at a certain time and colliding with each other. Conflict occurs before a collision and there must be at least one conflict before a collision can happen. Therefore, if we want to avoid UAV collisions, we can realize this purpose by solving all of the UAV conflicts (a sufficient condition for collision avoidance).

We assume

U

is the set of all operating UAVs,

U

composed of n UAV elements

A_{1}

,

A_{2}

, …,

A_{n}

. n∈

N

*,

N

* is the set of natural numbers. For each UAV

A_{i}

,

A_{i}

= {

P_{i}

,

v_{i}

(t)}, where

P_{i}

is the 3D position vector [x, y, z] of the UAV, where x, y is the lateral position and z is the height of the UAV,

v_{i}

(t) represents the mapping from time t to 3D position vector P of

{UAV}_{i}

by velocity

v_{i}

. Define the safety interval between UAV as d, which means the interval between two UAVs should be larger than d to ensure safe operation.

U^{'}

indicates a conflicting set, where UAVs in

U^{'}

have conflicts with other UAVs in

U^{'}

.

The mathematical description of conflicts in a UAV fleet can be represented as ①.

①:

U

= {

A_{1}

,

A_{2}

,…,

A_{n}

}, n∈

N

*, ∃

U^{'}

⊆

U

, |

U^{'}

| ≥ 3, ∀

A_{i}

= {

P_{i}

,

v_{i}

(t)}∈

U^{'}

, ∀

A_{j}

= {

P_{j}

,

v_{j}

(t)}∈

U^{'}

, i≠j; makes ∃t∈ (0, +∞),

P_{i}

←

v_{i}

(t),

P_{j}

←

v_{j}

(t), ||

P_{i}

−

P_{j}

|| ≤ d.

The mathematical description of conflicts between a UAV pair can be represented as ②.

②: ∃

A_{i}

= {

P_{i}

,

v_{i}

(t)}∈

U

,

A_{j}

= {

P_{j}

,

v_{j}

(t)}∈

U

, i,j∈

N

*, i≠j; makes ∃t∈ (0, +∞),

P_{i}

←

v_{i}

(t),

P_{j}

←

v_{j}

(t), ||

P_{i}

−

P_{j}

|| ≤ d.

Prove Theorem 1 (① ⇒ ②).

The existence of conflicts in a UAV fleet is a sufficient condition for the existence of UAV pair conflicts.

Randomly select

A_{i}

= {

P_{i}

,

v_{i}

(t)}∈

U^{'}

,

A_{j}

= {

P_{j}

,

v_{j}

(t)}∈

U^{'}

.

According to the description of ①: For certain

A_{i}

∈

U^{'}

, ∃

A_{j}

∈

U^{'}

, makes

P_{i}

←

v_{i}

(t),

P_{j}

←

v_{j}

(t), ||

P_{i}

−

P_{j}

|| ≤ d.

∵ |

U^{'}

| ≥ 3; ∴∃

A_{k}

∈

U^{'} ({\bar{A}}_{j})

, makes

P_{i}

←

v_{i}

(t),

P_{k}

←

v_{k}

(t), ||

P_{i}

−

P_{k}

|| ≤d, t∈ (0, +∞).

∵

U^{'}

⊆

U

; ∴

A_{i}

,

A_{j}

,

A_{k}

∈

U

.

∴ When ① is established, which satisfies ∃

A_{i}

,

A_{j}

,

A_{k}

∈

U

, i ≠ j ≠ k, makes ∃ t∈ (0, +∞),

P_{i}

←

v_{i}

(t),

P_{j}

←

v_{j}

(t), ||

P_{i}

−

P_{j}

|| ≤ d. And ∃

A_{k}

∈

U^{'} ({\bar{A}}_{j})

, makes

P_{i}

←

v_{i}

(t),

P_{k}

←

v_{k}

(t), ||

P_{i}

−

P_{k}

|| ≤d, t∈ (0, +∞), ② established.

∴ ① ⇒ ② is proven.

Now, we have the contrapositive of Theorem 1, recorded as Theorem 2:

Theorem 1 ( $②^{'}$ ⇒ $①^{'}$ ).

Non-conflict among all UAV pairs is a sufficient condition for non-conflict among all UAVs in the fleet, which is:

②^{'}

: ∀

U

= {

A_{1}

,

A_{2}

, …,

A_{n}

}, n ∈

N^{*}

, ∃

U^{'}

⊆

U

,

{∥U^{'}∥}_{0}

≥ 3, ∀

A_{i}

= {

P_{i}

,

v_{i}

(t)} ∈

U^{'}

, ∀

A_{j}

= {

P_{j}

,

v_{j}

(t)}∈

U^{'}

, i ≠ j, s.t. ∀ t ∈ (0, +∞),

P_{i}

←

v_{i}

(t),

P_{j}

←

v_{j}

(t), ||

P_{i}

–

P_{j}

|| > d.

①^{'}

: ∃

A_{i}

= {

P_{i}

,

v_{j}

(t)} ∈

U

,

A_{j}

= {

P_{j}

,

v_{j}

(t)}∈

U

, i,j ∈{1, 2, …, n}, i ≠ j; s.t. ∀ t ∈ (0, +∞),

P_{i}

←

v_{i}

(t),

P_{j}

←

v_{j}

(t), ||

P_{i}

–

P_{j}

|| > d.

∵ ① ⇒ ② has been proven,

②^{'}

⇒

①^{'}

is the contrapositive of ① ⇒ ②.

∴

②^{'}

⇒

①^{'}

is proven.

The meaning of

②^{'}

⇒

①^{'}

(Theorem 2) is that if all conflicts between UAV pairs are resolved, it can be ensured that the UAV fleet (all of the drones) is also operating conflict-free. In summary, it can be proven that the pairwise conflict relief control method can indeed solve multiple conflict relief problems in a UAV fleet. Therefore, constraint (c) is equivalent for both pairwise formation control and directly conducting multi-UAV formation control.

3.3. Single UAV Control Model

The single UAV control model is a basic element in the multi-UAV formation process, as well as a basic unit in the deep reinforcement learning model of UAV pairwise formation control.

In this section, we use a type of quadrotor UAV as an example to build the UAV control model. The movement of the quadrotor UAV can be regarded as a combination of the translation of the UAV center and the rotation of the UAV body. The rotation matrix

R_{B}

is:

R_{B} = {[\begin{matrix} cos ψ & sin ψ & 0 \\ - sin ψ & cos ψ & 0 \\ 0 & 0 & 1 \end{matrix}]}^{T} \times {[\begin{matrix} cos θ & 0 & - sin θ \\ 0 & 1 & 0 \\ sin θ & 0 & cos θ \end{matrix}]}^{T} \times {[\begin{matrix} 1 & 0 & 0 \\ 0 & cos ϕ & sin ϕ \\ 0 & - sin ϕ & cos ϕ \end{matrix}]}^{T}, θ \neq \pm \frac{π}{2}

(1)

where

ϕ

is the roll angle,

θ

is the pitch angle, and

ψ

is the yaw angle of the UAV.

The control method for quadrotor UAVs involves adjusting the speed of the propeller rotors. By maneuvering between the different speeds of the propeller rotors, complex actions such as roll, yaw, climb, and descent can be realized. The lift

U_{1}

generated by the propellers is:

U_{1} = \frac{1}{2} \cdot \sum_{i = 1}^{4} ρ \cdot {v_{i}}^{2} \cdot S \cdot C_{L}

(2)

where

ρ

is the atmospheric density,

v_{i}

is the i-th propeller rotor speed of the UAV, S is the propeller area of the UAV,

C_{L}

is the lift coefficient of the propeller, and

U_{1}

is the lift force of the UAV.

v_{i}

=

ω_{i}

·r/2, where r is the length of the UAV propeller; therefore, Equation (2) can be simplified as:

U_{1} = b \cdot \sum_{i = 1}^{4} {ω_{i}}^{2}

(3)

where b is a composite lift parameter constituted by

ρ

,S,

C_{L}

.

ω_{i}

is the angular velocity (r/min, RPM) of the i-th propeller rotor, $ω$ is a vector composed of the four angular velocities of the UAV propeller rotors, $ω$ = [

ω_{1}

,

ω_{2}

,

ω_{3}

,

ω_{4}

]^{T}

. The balance equation of the UAV is as follows:

m [\begin{matrix} \ddot{x} \\ \ddot{y} \\ \ddot{z} \end{matrix}] = U_{1} {R_{B}}^{T} [\begin{matrix} 0 \\ 0 \\ 1 \end{matrix}] - k_{a} [\begin{matrix} \dot{x} \\ \dot{y} \\ \dot{z} \end{matrix}] - m g [\begin{matrix} 0 \\ 0 \\ 1 \end{matrix}]

(4)

[\begin{matrix} \ddot{x} \\ \ddot{y} \\ \ddot{z} \end{matrix}] = [\begin{matrix} \frac{\sum_{i = 1}^{4} b \cdot {ω_{i}}^{2} (cos ϕ sin θ cos ψ + sin ϕ sin ψ) - k_{a} \dot{x}}{m} \\ \frac{\sum_{i = 1}^{4} b \cdot {ω_{i}}^{2} (cos ϕ sin θ sin ψ + sin ϕ cos ψ) - k_{a} \dot{y}}{m} \\ \frac{\sum_{i = 1}^{4} b \cdot {ω_{i}}^{2} (cos θ cos ϕ) - k_{a} \dot{z}}{m} - g \end{matrix}]

(5)

where m is the mass of the UAV, g is the acceleration of gravity,

k_{a}

is the air resistance coefficient,

{[\ddot{x}, \ddot{y}, \ddot{z}]}^{T}

is the acceleration on the x, y, z axes, and

{[\dot{x}, \dot{y}, \dot{z}]}^{T}

is the speed on the x, y, z axes of UAV.

Then, construct the rotation model of the UAV. Since the structure of the quadrotor UAV is almost symmetrical, the inertia matrix I of the UAV can be defined as diag (

I_{x}

,

I_{y}

,

I_{z}

), which reflects the difficulty of rotating the UAV along axes x, y, z. The rotation moment matrix

M_{1}

of the UAV is:

M_{1} = [\begin{matrix} U_{2} \\ U_{3} \\ U_{4} \end{matrix}] = [\begin{matrix} C_{L} \cdot l \cdot ({ω_{4}}^{2} - {ω_{2}}^{2}) \\ C_{L} \cdot l \cdot ({ω_{3}}^{2} - {ω_{1}}^{2}) \\ k_{d} \cdot ({ω_{2}}^{2} + {ω_{4}}^{2} - {ω_{1}}^{2} - {ω_{3}}^{2}) \end{matrix}]

(6)

where

U_{2}

,

U_{3}

,

U_{4}

are the rolling moment, pitching moment, and yaw moment of the UAV, l is the distance from the UAV propeller center to the UAV geometric center, and

k_{d}

is the inverse torque proportional coefficient of the rotor. Therefore, the rotational motion of the quadrotor UAV is:

[\begin{matrix} \ddot{ϕ} \\ \ddot{θ} \\ \ddot{ψ} \end{matrix}] = [\begin{matrix} \frac{(I_{x} - I_{z}) \cdot \dot{θ} \cdot \dot{ψ} + I_{b} \cdot \dot{θ} \cdot \dot{ψ} + I_{b} \cdot \dot{θ} \cdot Ω + U_{2}}{I_{x}} \\ \frac{(I_{z} - I_{x}) \cdot \dot{ϕ} \cdot \dot{ψ} + I_{b} \cdot \dot{ϕ} \cdot \dot{ψ} + I_{b} \cdot \dot{ϕ} \cdot Ω + U_{3}}{I_{y}} \\ \frac{(I_{x} - I_{y}) \cdot \dot{θ} \cdot \dot{ϕ} + U_{4}}{I_{z}} \end{matrix}]

(7)

where

Ω

=

ω_{2}

+

ω_{4}

−

ω_{1}

−

ω_{3}

,

I_{b}

is the moment inertia of the propeller around the axis of the body shaft,

{[\ddot{ϕ}, \ddot{θ}, \ddot{ψ}]}^{T}

are the acceleration values of the roll angle, pitch angle, and yaw angle, and

{[\dot{ϕ}, \dot{θ}, \dot{ψ}]}^{T}

are the angular velocity values of the roll angle, pitch angle, and yaw angle.

Now, the UAV dynamic model is built and the four control variables are the different speeds of the UAV’s four propeller rotors. The state

S_{t}

( $ω$ ) after t s can be described as:

\begin{matrix} S (ω) = S_{0} + {[\begin{matrix} x (ω) & y (ω) & z (ω) & \begin{matrix} ϕ (ω) & θ (ω) & ψ (ω) \end{matrix} \end{matrix}]}^{T} \\ = [\begin{matrix} x_{t} \\ y_{t} \\ z_{t} \\ ϕ_{t} \\ θ_{t} \\ ψ_{t} \end{matrix}] = [\begin{matrix} x_{0} \\ y_{0} \\ z_{0} \\ ϕ_{0} \\ θ_{0} \\ ψ_{0} \end{matrix}] + [\begin{matrix} \int \int \frac{\sum_{i = 1}^{4} b \cdot {ω_{i}}^{2} (cos ϕ sin θ cos ψ + sin ϕ sin ψ) - k_{a} \dot{x}}{m} d t \\ \int \int \frac{\sum_{i = 1}^{4} b \cdot {ω_{i}}^{2} (cos ϕ sin θ sin ψ + sin ϕ cos ψ) - k_{a} \dot{y}}{m} d t \\ \int \int \frac{\sum_{i = 1}^{4} b \cdot {ω_{i}}^{2} (cos θ cos ϕ) - k_{a} \dot{z}}{m} - g d t \\ \int \int \frac{(I_{x} - I_{z}) \cdot \dot{θ} \cdot \dot{ψ} + I_{b} \cdot \dot{θ} \cdot \dot{ψ} + I_{b} \cdot \dot{θ} \cdot Ω + U_{2}}{I_{x}} d t \\ \int \int \frac{(I_{z} - I_{x}) \cdot \dot{ϕ} \cdot \dot{ψ} + I_{b} \cdot \dot{ϕ} \cdot \dot{ψ} + I_{b} \cdot \dot{ϕ} \cdot Ω + U_{3}}{I_{y}} d t \\ \int \int \frac{(I_{x} - I_{y}) \cdot \dot{θ} \cdot \dot{ϕ} + U_{4}}{I_{z}} d t \end{matrix}] \end{matrix}

(8)

where

S_{0}

= [

x_{0}

,

y_{0}

,

z_{0}

,

ϕ_{0}

,

θ_{0}

,

ψ_{0}

] is the original state of the UAV and

S_{t}

= [

x_{t}

,

y_{t}

,

z_{t}

,

ϕ_{t}

,

θ_{t}

,

ψ_{t}

] is the new state at timestamp t of the UAV.

Finally, select the geometric center of the UAV and the centers of the four propellers as the collision points (Figure 1). Assuming that the geometric center of the UAV is also the Euler rotation center, the position of the collision points in the ground reference system [

{x_{c}}^{'}

,

{y_{c}}^{'}

,

{z_{c}}^{'}

]^{T}

can be described as:

[\begin{matrix} {x_{c}}^{'} \\ {y_{c}}^{'} \\ {z_{c}}^{'} \end{matrix}] = R_{B} \times [\begin{matrix} x_{c} \\ y_{c} \\ z_{c} \end{matrix}] + [\begin{matrix} x_{t} \\ y_{t} \\ z_{t} \end{matrix}]

(9)

where [

x_{c}

,

y_{c}

,

z_{c}

]^{T}

is the relative position of the key collision point to the geometric center of the UAV and [

x_{t}

,

y_{t}

,

z_{t}

]^{T}

is the displacement of the UAV in the ground reference system.

3.4. DRL Model for UAV Pairwise Formation Control

Construct the DRL model for UAV pairwise formation control based on the objectives, constraints, and UAV performance of this formation control problem (Section 3.1). It is worth noting that this DRL model is designed for the synergistic formation control of two UAVs, rather than the synergetic formation control of the UAV fleet directly, which is designed to reduce the training difficulty of the Agent. It is essential to integrate the dynamic pairing algorithm in Section 4.2 to realize the synergetic control of the UAV fleet. We use the classical structure of Environment–Agent to construct this DRL model.

3.4.1. Environment

(a): State of the UAV pair.

A quadrotor UAV has the attributes of three position parameters (x, y, z), three rotation parameters (

ψ

,

θ

,

ϕ

), and four rotor speed parameters (

ω_{1}

,

ω_{2}

,

ω_{3}

,

ω_{4}

). The state matrix

{S_{UAV}}^{t}

, composed of two UAVs, has a total of 20 elements. Add the position parameters (

x_{ob}

,

y_{ob}

), the height of the bottom surface

z_{ob 1}

, and the height of the top surface

z_{ob 2}

of the obstacle. Therefore, the state vector

S_{t}

of a UAV pair can be described as:

\begin{matrix} {S_{UAV}}^{t} = {[\begin{matrix} {x_{1}}^{t}, {y_{1}}^{t}, {z_{1}}^{t}, {ψ_{1}}^{t}, {θ_{1}}^{t}, {ϕ_{1}}^{t}, {ω_{11}}^{t}, {ω_{12}}^{t}, {ω_{13}}^{t}, {ω_{14}}^{t} \\ {x_{2}}^{t}, {y_{2}}^{t}, {z_{2}}^{t}, {ψ_{2}}^{t}, {θ_{2}}^{t}, {ϕ_{2}}^{t}, {ω_{21}}^{t}, {ω_{22}}^{t}, {ω_{23}}^{t}, {ω_{24}}^{t} \end{matrix}]}_{2 \times 10} \\ {S_{obstacle}}^{t} = {[{x_{ob}}^{t}, {y_{ob}}^{t}, {z_{ob 1}}^{t}, {z_{ob 2}}^{t}]}_{1 \times 4} \\ S_{t} = {[{S_{UAV}}^{t}, {S_{obstacle}}^{t}]}_{1 \times 24} \end{matrix}

(10)

The limitation of UAV performance can be described as:

\begin{matrix} ψ \in [0, 2 π], θ \in (θ_{1}, θ_{2}), ϕ \in (ϕ_{1}, ϕ_{2}) \\ ω_{i j} \in [0, ω_{max}), S_{t} \in S \end{matrix}

(11)

where

θ_{1}

,

θ_{2}

are the minimum and maximum pitch angle of the UAV,

ϕ_{1}

and

ϕ_{2}

are the minimum and maximum roll angle of the UAV,

ω_{\max}

means the maximum rotating speed of rotors, and

S

is the state space of the UAV pair. These parameters are set to simulate the upper and lower limits of UAV performance and ensure it operates in a reasonable posture.

(b): Action of the UAV pair.

The quadrotor UAV achieves roll, yaw, climb, and descend by controlling the speed of its four rotors. Therefore, the action

A_{t}

of two UAVs at timestamp t can be defined as:

A_{t} = {[\begin{matrix} Δ ω_{11}, Δ ω_{12}, Δ ω_{13}, Δ ω_{14}, H_{1} \\ Δ ω_{21}, Δ ω_{22}, Δ ω_{23}, Δ ω_{24}, H_{2} \end{matrix}]}_{2 \times 5}, A_{t} \in A

(12)

where

Δ

ω_{i j}

is the adjustment of rotational speed for the j-th rotor in the i-th UAV, and

H_{i}

is the 0–1 control variable for UAV hovering. When

H_{i}

= 1, the i-th UAV hovers at the current position immediately and adjusts the speeds of the four rotors to balance the gravity. Due to the constraints of rotor acceleration and deceleration, choose the space of

Δ

ω_{i j}

(adjustment of rotational speed, r/min, RPM) as:

Δ ω_{i j} \in \{\pm 10 RPS, \pm 50 RPS \pm 100 RPS\}, A_{t} \in A

(13)

The value of

Δ

ω_{i j}

should also meet Equation (11), which means satisfying the performance constraints of the UAV, and the action space composed of

Δ

ω_{i j}

is denoted as

A

.

(c): State transforming.

The state transforming function trans(.) is used to calculate the state of the UAV at timestamp t+1 after the UAV takes an action

A_{t}

at state

S_{t}

. The calculation process of the UAV’s position (x, y, z) and rotation (

ψ

,

θ

,

ϕ

) in this function as Equation (8), and the rotor speed parameters are calculated as:

\begin{matrix} {ω_{i j}}^{(t + 1)} = {ω_{i j}}^{(t)} + Δ {ω_{i j}}^{(t)} \\ {ω_{i j}}^{(t)} \in S_{t}, Δ {ω_{i j}}^{(t)} \in A_{t} \end{matrix}

(14)

In addition, the obstacle parameters (

x_{ob}

,

y_{ob}

,

z_{ob 1}

,

z_{ob 1}

,

z_{ob 2}

) in

{S_{obstacle}}^{t}

are the closest obstacle parameters to any UAV in the UAV pair, which are derived directly from the Environment, as:

{S_{obstacle}}^{t} = \{\begin{matrix} \underset{(x_{ob}, y_{ob}, z_{ob 1}, z_{ob 2})}{arg min} {∥[x_{ob}, y_{ob}], [{x_{i}}^{t}, {y_{i}}^{t}]∥}_{2}, \exists z_{ob 1} \leq {z_{i}}^{t} \leq z_{ob 2} \\ None, \exists {z_{i}}^{t} < z_{ob 1} or z_{ob 1} > z_{ob 2}, i \in {1, 2} \end{matrix}

(15)

Then, a function of the UAV state transforming from timestamp t to t+1 is obtained, denoted as:

S_{t + 1} = trans (S_{t}, A_{t})

(16)

To further simulate the influence of the external environment, like wind, a random bias b needs to be added into

S_{t + 1}

, which is only effective for three position parameters (x, y, z), the final state of time t + 1 denoted as:

\begin{matrix} S_{t + 1} = trans (S_{t}, A_{t}) + b \\ b = {[b_{x}, b_{y}, b_{z}, 0, 0, \dots, 0]}^{T} \end{matrix}

(17)

The expectation and variance of parameters in random bias b are denoted as:

μ_{x}

,

μ_{y}

,

μ_{z}

,

σ_{x}, σ_{y}, σ_{z}

.

3.4.2. Agent

We use the A3C architecture to build the Agent as an example for implementing the P-DRL framework. There are two basic networks in each Agent: the Actor network and the Critic network. The same structure of Backpropagation (BP) neuron networks [40] is used for the Actor and Critic to evaluate the q-value and v-value, whose estimated values are denoted as

Q_{π}

(

S_{t}

,

A_{t}

; W) and

V_{π}

(

S_{t}

; $θ$ ), where W is the weight matrix of the Actor network and $θ$ is the weight matrix of the Critic network. The input is the state of the UAV pair and the action in its action space, which are

S_{t}

and

A_{t}

.

The

ε

-greedy criterion is selected as the policy for the Agent to select the action

A_{t}

at the state

S_{t}

, policy

π

(

S_{t}

) is denoted as:

\begin{matrix} π (S_{t}) = \{\begin{matrix} \underset{a}{argmax} \{Q_{π} (S_{t}, a; W)\}, e \geq 2 ε, by the Actor \\ \underset{a}{argmax} \{V_{π} (trans (S_{t}, a); θ)\}, e > ε, by the Critic \\ random {A (S_{t})}, r \leq ε \end{matrix} \\ \forall a \in A (S_{t}), π (S_{t}) = A_{t}, S_{t + 1} = trans (S_{t}, A_{t}), e = random (0, 1) \end{matrix}

(18)

where e is a random decimal between 0 and 1, which is obtained by the function random(.) and

ε

is a small positive constant.

3.4.3. Reward

Synergetic formation control of the UAV fleet needs to achieve the following goals: Firstly, the UAV must avoid obstacles and other UAVs. Secondly, the aircraft must reach the formation destination under this premise. Based on these drone formation control targets, the absolute value of the reward function can be designed according to the following principles: for every UAV pair composed of two UAVs, UAV safety reward ≈ Obstacle safety reward >> Formation reward. This enables the Agent to perform formation control while ensuring safe operation.

(1): UAV safety reward.

Set the UAV safety reward

{r_{t}}^{uav}

at timestamp t as:

\begin{matrix} {r_{t}}^{uav} = \{\begin{matrix} R_{1}, \exists {∥[{x_{1, i}}^{t}, {y_{1, i}}^{t}, {z_{1, i}}^{t}] - [{x_{2, j}}^{t}, {y_{2, j}}^{t}, {z_{2, j}}^{t}]∥}_{2} \leq d, t = t_{1} \\ R_{2}, \exists {∥[{x_{1, i}}^{t}, {y_{1, i}}^{t}, {z_{1, i}}^{t}] - [{x_{2, j}}^{t}, {y_{2, j}}^{t}, {z_{2, j}}^{t}]∥}_{2} \leq d, t = t_{2} \\ R_{3}, \exists {∥[{x_{1, i}}^{t}, {y_{1, i}}^{t}, {z_{1, i}}^{t}] - [{x_{2, j}}^{t}, {y_{2, j}}^{t}, {z_{2, j}}^{t}]∥}_{2} \leq d, t = t_{3} \\ 0, otherwise \end{matrix}, i, j \in {1, 2, 3, 4, 5} \end{matrix}

(19)

where [

{x_{1, i}}^{t}

,

{y_{1, i}}^{t}

,

{z_{1, i}}^{t}

] is the position of the i-th key collision point of UAV1 at timestamp t, and suffix j is for UAV2, for every single UAV, there are five key collision points, as seen in Figure 1.

R_{1}

,

R_{2}

,

R_{3}

is the value of the reward, set

R_{1}

= −400,

R_{2}

= −20,

R_{3}

= −4.

t_{1}

,

t_{2}

,

t_{3}

are the prediction time spans according to the current state of the two UAVs. For example, it is assumed that

t_{1}

= 1 s,

t_{2}

= 5 s,

t_{3}

= 10 s in this paper, which means the position of two UAVs after 1 s, 5 s, and 10 s will be calculated based on their current speed and attitude, then, it will receive its reward value

{r_{t}}^{uav}

according to whether there is a collision after 1 s, 5 s, and 10 s. d is the safety distance, which is used to judge whether UAVs collide, set d = 0.5 m.

(2): Obstacle safety reward.

Set the Obstacle safety reward

{r_{t}}^{ob}

at timestamp t as follows:

{r_{t}}^{ob} = \sum_{i \in {1, 2}} \sum_{k \in {t_{1} {, t}_{2} {, t}_{3}}} o b_r e w a r d_{i, k}

(20)

o b_r_{i, k} = \{\begin{matrix} R_{1} / 2, \exists {∥[{x_{i}}^{k}, {y_{i}}^{k}] - [{x_{ob}}^{k}, {y_{ob}}^{k}]∥}_{2} \leq d_{ob}, k = t_{1} \\ R_{2} / 2, \exists {∥[{x_{i}}^{k}, {y_{i}}^{k}] - [{x_{ob}}^{k}, {y_{ob}}^{k}]∥}_{2} \leq d_{ob}, k = t_{2} \\ R_{3} / 2, \exists {∥[{x_{i}}^{k}, {y_{i}}^{k}] - [{x_{ob}}^{k}, {y_{ob}}^{k}]∥}_{2} \leq d_{ob}, k = t_{3} \\ 0, \exists {z_{i}}^{t} \notin [z_{ob 1}, z_{ob 2}] or {∥[{x_{i}}^{k}, {y_{i}}^{k}] - [{x_{ob}}^{k}, {y_{ob}}^{k}]∥}_{2} > d_{ob} \end{matrix}

(21)

where [

{x_{i}}^{k}

,

{y_{i}}^{k}

,

{z_{i}}^{k}

] is the position of the geometric center of the i-th UAV at timestamp k. [

{x_{ob}}^{k}

,

{y_{ob}}^{k}

] is the position of the obstacle, and [

{z_{ob}}^{1}

,

{z_{ob}}^{2}

] is the height of the bottom and top of the obstacle.

d_{ob}

is the radius of the obstacle. The values of

R_{1}

,

R_{2}

,

R_{3}

, and

t_{1}

,

t_{2}

,

t_{3}

are set in the same manner as in Equation (19) and summarized in Section 4.1.

(3): Formation reward.

Set the Formation reward

{r_{t}}^{fm}

at timestamp t as follows:

{r_{t}}^{fm} = \sum_{i \in {1, 2}} {fm_reward}_{i}

(22)

\begin{matrix} {fm_reward}_{i} = D \cdot (\frac{{∥{P_{i}}^{t} - D_{i}∥}_{2} - {∥{P_{i}}^{t + 1} - D_{i}∥}_{2}}{{∥{P_{i}}^{t} - D_{i}∥}_{2}}) - 2 \cdot D, i \in {1, 2} \\ {P_{i}}^{t} = [{x_{i}}^{t}, {y_{i}}^{t}, {z_{i}}^{t}], {P_{i}}^{t + 1} = [{x_{i}}^{t + 1}, {y_{i}}^{t + 1}, {z_{i}}^{t + 1}], D_{i} = [{x_{i}}^{d}, {y_{i}}^{d}, {z_{i}}^{d}] \end{matrix}

(23)

where

{P_{i}}^{t} = [{x_{i}}^{t}, {y_{i}}^{t}, {z_{i}}^{t}],

is the position of the i-th UAV at timestamp t.

{P_{i}}^{t + 1} = [{x_{i}}^{t + 1}, {y_{i}}^{t + 1}, {z_{i}}^{t + 1}]

is the position of the i-th UAV at timestamp t+1, which is calculated by the transforming of the state

S_{t}

and action

A_{t}

.

D_{i} = [{x_{i}}^{d}, {y_{i}}^{d}, {z_{i}}^{d}]

is the destination of the formation task. This mapping of the Formation reward can make the value of each UAV at a single timestamp falls within the interval −D to 0. Set the value of D = 10 in this paper.

Finally, the reward

R_{t}

of the Environment–Agent interaction of two UAVs at timestamp t is:

R_{t} = {r_{t}}^{uav} + {r_{t}}^{ob} + {r_{t}}^{fm}

(24)

There are some tips in reward shaping: Firstly, all of the rewards should be set as negative numbers to prevent the aircraft from circling in the airspace to get more rewards. Secondly, it is better to split the UAV safety reward and Obstacle safety reward into multiple rewards at different timestamps, which will speed up the convergence of the q-value and v-value, as well as play a role in conflict detection, as seen in Figure 2.

3.4.4. Interaction

This is similar to the architecture of most DRL models; for each timestamp t, the Agent selects the action

A_{t}

based on the policy and the q/v-value of the state

S_{t}

evaluated by the neural networks, then obtains the reward

R_{t}

, and the state of UAV pair becomes

S_{t + 1}

. In the multi-agent learning model of A3C, it can be regarded that n Agents are performing the above steps and updating the weights of their common neural network, as seen in Figure 3 and Figure 4.

As a result, the DRL model for UAV pairwise formation control can be represented as:

\begin{matrix} max Z = E_{π} (\sum_{t = 1}^{T} R_{t} (S_{t}, A_{t})) \\ s . t . \{\begin{matrix} S_{t + 1} = trans (S_{t}, A_{t}) \\ A_{t} = π (S_{t}) \\ \forall A_{t} \in A (S_{t}), \forall S_{t} \in S \end{matrix} \end{matrix}

where

E_{π}

(.) represents the mathematical expectations of the total reward under the policy

π

, and T is the maximum timestamp in each round of formation control.

4. Approach

4.1. P-DRL Framework

By integrating the dynamic pairing algorithm with the deep reinforcement learning method, we propose a real-time formation control framework for P-DRL, where the Agent in the DRL model can be many learning architectures of deep neuron networks, such as A3C, AC, DDQN, etc.

The main module in the framework and its functions are described below:

Single UAV control model: this model is used to define the performance of UAVs, such as the maximum rotor speed, the range of rolling angle, the body configuration used for collision detection, and the state transforming function used in the DRL model.
DRL model for pairwise formation control: this is the model used for training the Agent in the synergistic control of two UAVs, including the state, action space setting, and reward shaping in the Environment, the decision policy, the architecture of deep neuron networks in the Agent, and the Environment–Agent intersection mode.
Algorithm 1—dynamic pairing: converts the formation control problem of a UAV fleet into a synergetic formation control problem involving multiple UAV pairs. This reduces the difficulty of Agent training and allows the control scenario to be solved by the Agent.
Algorithm 2—Agent training: trains the Agent based on the reward returned by the Environment–Agent intersection.
Implement: for every timestamp, the dynamic pairing algorithm chooses a UAV pair composed of two UAVs, then, the Agent allocates an action for each of them, loops until all of the UAVs have been paired, then turns to the next timestamp.

To simply describe this framework, we use the dynamic pairing algorithm to break down the original scenario into multiple paired scenarios, then we use the trained Agent to assign an action for the two UAVs, loop all the UAVs in the airspace, and iterate to the next timestamp. The Agent is trained based on the DRL model for UAV pairwise control, in which multiple DRL architectures such as AC, A3C, DDQN can be adopted. Therefore, we denote this framework as dynamic Pairing and Deep Reinforcement Learning (P-DRL), when the DRL uses the architecture of A3C, we denote the method as P-A3C. By analogy, formation control methods such as P-DDQN, P-AC can also be built based on this architecture, as seen in Figure 5.

4.2. Dynamic Pairing

The core idea of dynamic pairing is to pair the UAV according to the severity of the conflict or potential conflict, The operational situation of UAVs varies over time, resulting in different UAV pairing results. This method simplifies the multi-UAV synergetic formation control problem to a two-UAV synergetic formation control problem, which can then be addressed by the trained agent, as seen in Figure 6.

A dynamic pairing algorithm is used to convert the synergetic formation control of n UAVs into multiple instance of two-UAV synergetic formation control. The main process is as follows:

Step 1: Build a set

K

composed of UAVs waiting for formatting at timestamp t. Assuming there are n UAVs in the airspace (

{UAV}_{1}

–

{UAV}_{n}

). At timestamp t, initialize

K

as:

K = {{UAV}_{1}, {UAV}_{2}, {UAV}_{3}, \dots, {UAV}_{n}}

(25)

Step 2: Build the distance matrix

M_{(n \times n)}

, as follows:

K = {{UAV}_{1}, {UAV}_{2}, {UAV}_{3}, \dots, {UAV}_{n}}

(26)

M (i, j) = {[Distance ({UAV}_{i}, {UAV}_{j})]}_{(n \times n)}

(27)

Step 3: Pick up the UAV pair (i,j) with the highest priority, as:

(i, j) = \underset{(i, j)}{arg min} (d_{i j}), \exists {UAV}_{i, j} \in K

(28)

Step 4: For the UAV is in set

K

, assign an action by the Agent, otherwise, skip it;

Step 5: Delete the UAV that has been assigned an action from

K

, and return to step 3, until

K

= ∅.

The pseudo-code of the dynamic pairing is shown below:

Algorithm 1: Dynamic pairing algorithm for n UAVs.

This algorithm is designed based on the prioritization of conflict severity and pairwise action allocation. For the action assignment task at timestamp t, the number of UAVs is finite, so this pairing algorithm can traverse them. It is ensured that at least one UAV receives its action from the Agent in each dynamic pairing and action-allocating process. By repeatedly performing pairwise pairing and action allocation, the algorithm can traverse all of these UAVs at each timestamp t, finally completing the action assignment of multiple UAVs in the airspace.

It can be found that the algorithm complexity of this pairing method is polynomial, O(n−1), and the complexity for q-value and v-value evaluation by the trained Agent is also O(n), where n is the number of UAVs in the airspace. Therefore, the P-DRL-based method has a complexity of O(2n−1), which is perfect from the perspective of time complexity to realize real-time formation control in a complex environment.

4.3. Agent Training

This paper uses the A3C agent training architecture as an example to illustrate how to train an Agent using the constructed DRL model. For each instance of Environment–Agent interaction, the Actor network and the Critic network in the Agent will choose an action for two UAVs independently and assign the iteration parameters to the common network. For the Actor network, the Agent selects the action

A_{t}

based on the value of

Q_{π}

(

S_{t}

,

A_{t}

; W) and the strategy (

π

). For the Critic network, the Agent selects the action

A_{t}

based on the value of

V_{π}

(

S_{t}

; $θ$ ), and executes the same processes as for the Actor.

Firstly, initialize the action value (q-value) evaluation neural network weight matrix W and the state value (v-value) evaluation neural network weight matrix $θ$ . Initialize the learning rate

α^{(W)}

,

α^{(θ)}

for the q-value and v-value network, and the discount rate

γ

.

Define the estimated value

Q_{π}

(

S_{t}

,

A_{t}

; W) of the q-value by policy

π

and weight W as:

\begin{matrix} E (q (S_{t}, A_{t})) = Q_{π} (S_{t}, A_{t}; W) \\ = R_{t} + γ \cdot max Q_{π} (S_{t + 1}, A_{t + 1}; W), A_{t + 1} \in A \end{matrix}

(29)

Define the estimated value

V_{π}

(

S_{t}

; $θ$ ) of the v-value by policy

π

and weight $θ$ as:

\begin{matrix} E (v (S_{t})) = V_{π} (S_{t}; θ) \\ = R_{t} + max V_{π} (S_{t + 1} | S_{t}; θ) \\ S_{t + 1} \in S (A_{t} | S_{t}), A_{t} \in A \end{matrix}

(30)

The Agent decides an action

A_{t}

based on the value of

Q_{π}

(

S_{t}

,

A_{t}

; W) or

V_{π}

(

S_{t}

;

θ

) and the policy

π

, then receives the reward feedback

R_{t}

, and calculates the iteration error

U \leftarrow R_{t} + γ \cdot Q_{π} (S_{t}, A_{t}; W)

or

U \leftarrow R_{t} + γ \cdot V_{π} (S_{t}; θ)

.

Now, there are two weight matrices for the neural networks W and $θ$ . Define the mapping

f_{1} = W \mapsto q (S_{t}, A_{t})

,

f_{2} = θ \mapsto v (S_{t})

and the corresponding Fisher matrix of

f_{1}

and

f_{2}

are

F_{W}

and

F_{θ}

, as follows:

F_{W} = E ([\nabla ln π (A_{t} | S_{t}; W)] \times {[\nabla ln π (A_{t} | S_{t}; W)]}^{T})

(31)

F_{θ} = E ([\nabla ln π (A_{t} | S_{t}; θ)] \times {[\nabla ln π (A_{t} | S_{t}; θ)]}^{T})

(32)

Update W ←W +

α^{(W)}

·

Q_{π}

(

S_{t}

,

A_{t}

; W)·

F_{W}

to decrease the value of [U −

Q_{π}

(

S_{t}

,

A_{t}

; W)] ·

π

(

A_{t}

|

S_{t}

; W), and update

θ

←

θ

+

α^{(θ)}

·

V_{π}

(

S_{t}

;

θ

)·

F_{θ}

to decrease the value of [U −

V_{π}

(

S_{t}

;

θ

)]·

π

(

A_{t}

|

S_{t}

;

θ

). The A3C training algorithm for synergetic control is Algorithm 2.

Algorithm 2: Agent (A3C structure) training algorithm for UAV pairwise control.

4.4. Implement

After the training process of the Agent is completed, we can use the Agent combined with the dynamic pairing algorithm. The implementation process of the P-DRL method can be demonstrated as seen in Figure 7.

Step 1: Execute the process of dynamic pairing according to the situation of the UAV fleet at timestamp t. Then, we will obtain multiple UAV pairs using the pairing algorithm.

Step 2: For every UAV pair, obtain the obstacle which is closest to one of them, as:

\begin{matrix} {S_{obstacle}}^{t} = {[{x_{ob}}^{t}, {y_{ob}}^{t}, {z_{ob 1}}^{t}, {z_{ob 2}}^{t}]}_{1 \times 4} \\ = arg min_{S_{obstacle}} ∥S_{obstacle} - {S_{UAV}}^{t}∥ \end{matrix}

(33)

Step 3: Then, we get the state of the UAV pair and the obstacle most related to their operation, as:

S_{t} = {[{S_{UAV}}^{t}, {S_{obstacle}}^{t}]}_{1 \times 24}

(34)

For every UAV pair, input this state vector into the trained Agent, then receive the action from the Agent, as seen in Equation (18).

Step 4: Every UAV executes the action from the Agent and receives a new state under the operational uncertainty.

Step 5: If the formation task has been completed, end this process, or else return to Step 1 with timestamp t + 1.

5. Simulation and Results

5.1. Background

We use the Da–Jiang Innovations (DJI) Phantom4-type UAV as our object for synergetic formation control experiments, and the key parameters are shown in Table 1. These parameters are set based on the official parameters of the drone manufacturer to ensure optimal simulation experiments [41].

The hyperparameter settings in the DRL model are summarized in Table 2.

The simulations are based on a 64-bit operating system with 8 GB of processor RAM and i7-6700CPU. The simulation software is Spyder 5 with Python 3.8, and the hardware is a Microcontroller unit (MCU) of STM32F303 compiled using C language.

In the simulation experiments, we use the deep reinforcement learning architecture of A3C, Actor–Critic, and DDQN combined with the dynamic pairing algorithm, thus forming a P-DRL-based method (denoted as P-A3C, P-AC, P-DDQN) and we achieve synergetic formation control.

5.2. Simulations

5.2.1. Training the Agent

The architecture of the DRL model and the training algorithm have been described in Section 3.4 and Section 4.3. The P-DRL UAV formation control framework is a method that is compatible with multiple DRL models. In this paper, we choose three typical DRL models for experiments to demonstrate the excellent compatibility of the P-DRL model, and these models are A3C, DDQN, and AC. Taking A3C as an example, for both the Actor and Critic networks, we use the structure of two layers of BP neuron networks with 128 neurons per layer (denoted as 2 × 128), and 3 × 128 and 2 × 64 as the basic q-value and v-value estimation networks for ablation experiments. Double deep q-value network (DDQN) [42] and Actor-Critic [43] are used as extra experiments to demonstrate the universality of this control framework, where the Environment–Agent interaction mode, hyper-parameter settings, as well as the structure of the neuron networks are the same as A3C. The parameters in the reward function are set according to the principle of UAV safety reward ≈ Obstacle safety reward >> Formation reward, as described in Section 3.4.3.

t_{1}

,

t_{1}

,

t_{1}

are set for collision detection over short-term, medium-term, and long-term periods. This approach enables the trained Agent to have more comprehensive conflict resolution capabilities. Also, the values of the learning rate and discount rate can be set as general deep reinforcement learning parameters. Then, we train an Agent using the DRL model of A3C, AC, and DDQN, as shown in Figure 8 and Figure 9.

The value of the average reward per second of a UAV reflects the synergetic control ability of the Agent. The training process shows that all of these methods have significant learning effects, but compared with the DDQN and Actor–Critic training algorithms, the Agent trained by the A3C algorithm has a more stable learning efficiency and better ultimate UAV pairwise synergetic formation control ability.

5.2.2. Formation Control

From the perspective of the UAV formation phase [44], UAV formation is mainly composed of four parts: formation shaping, keeping, reconfiguration, and dissolution.Formation shaping refers to the process from takeoff to achieving the predetermined spatial configuration, typically starting from a motionless state of the UAVs. Formation keeping mainly refers to the flying process of a UAV fleet in motion in a certain conformation in the airspace, which needs to counteract interference that will affect the formation [15], Formation reconfiguration refers to the process of changing the formation arrangement during the flying process. Formation dissolution is the process of the formation dissolving.

Generally, it is easier to execute formation-keeping and dissolution processes compared to shaping and reconfiguration. Therefore, choose the scenarios of shaping and reconfiguration to demonstrate the performance of the P-DRL-based method, in which UAVs should avoid multiple obstacles, counteract interference, and change the formation’s arrangements. The information on UAVs and obstacles in the scenarios is shown in Table 3.

(1): Scenario 1: Formation Shaping Control

After training the Agent to complete the formation control tasks for two UAVs, the algorithm in Section 3.4.2 is used to convert the synergetic formation control of the UAV fleet into the synergetic formation control of multiple UAV pairs. Then, the Agent is used to solve the formation control task for the UAV pair at each timestamp and complete the formation control task of the UAV fleet by repeating the above process.

The process of taking 20 UAVs from a static state to formation arrangement based on the P-A3C method is shown in Figure 10 and Figure 11.

It can be seen from Figure 10 and Figure 11 that this synergetic formation control method for the UAV fleet achieved good results. The 20 UAVs avoided obstacles and avoided trajectory conflicts with other UAVs while reaching the formation destination at a relatively fast speed. In Figure 11, due to the requirement of adjusting altitude and avoiding surrounding drones, some repeated adjustments in a horizontal direction were made in the region of x = 80–160. Taking four propeller rotor speeds of UAV1 changing with time as an example (gray bold dotted line in Figure 10 and Figure 11), it can be seen in Figure 12 that the control of the UAV by the Agent is smooth and accurate, which conforms to the general aerodynamics.

(2): Scenario 2: Formation Reconfiguration Control

Scenario 2 is an experiment starting from the formation arrangement results of Scenario 1. The 20 UAVs descended from a star formation and re-arranged into a diamond formation while avoiding collisions between UAVs and obstacles. The process of formation reconfiguration for 20 UAVs based on the P-A3C method is shown in Figure 13 and Figure 14.

It can be seen from Figure 13 and Figure 14 that these 20 UAVs can avoid obstacles and trajectory conflicts with others while reaching the new formation destination at a fast speed under the control of the Agent. In this simulation, we set an initial speed for the drones, therefore, the formation flying speed is relatively fast compared to Scenario 1. Four propeller rotor speeds for UAV1 (gray bold dotted line in Figure 13 and Figure 14) change with time, as shown in Figure 15.

5.3. Performance Analysis

5.3.1. Success Rate of Non-Collision Formation Control

(1): Success Rate sensitivity analysis by changing control frequency.

The P-A3C formation control framework can adapt to the multi-UAV formation tasks in the presence of multiple random obstacles, which means this method can be used in other scenarios without retraining the DRL model, even when the obstacles are unknown before the formation tasks and are controlled in real-time by the trained Agent. There are 49 other scenarios composed of 10–20 UAVs designed to test the success rate of non-collision synergetic formation control by P-A3C, P-DDQN, and P-AC. During this process, we attempted to change the control frequency of the Agent sending control commands to the UAVs. The non-collision success rate of P-A3C, P-DDQN, and P-AC changing with the control frequency are shown in Figure 16, Figure 17 and Figure 18.

Control frequency by the Agent represents the update frequency of the drone’s actions, for example, a control frequency of 1 Hz means the Agent controls the drones’ actions every second, and a control frequency of 10 Hz means the Agent controls the drones’ actions every 0.1 seconds. Due to the execution time requirement of the P-DRL formation control algorithm, the maximum control frequency is 17.14 Hz.

Firstly, Figure 16 reflects the excellent compatibility of the P-DRL formation control framework with multiple types of deep reinforcement learning methods. Although there were some differences in specific performance and stability, the A3C, DDQN, and AC deep reinforcement learning methods all performed well in this framework (A3C: 91.7–96.2%, DDQN: 90.0–96.2%, AC: 90.1–96.3%).

While continuously increasing the frequency of control decisions by the trained Agent, the success rate in the 50 c formation control scenarios shows a clear trend of “grow then stable”, which is a point worth paying attention to. In the P-DRL framework, increasing the control frequency of the Agent does not increase the success rate to 100%. The reason is that there is a performance limit on the UAV’s control. Sometimes even if the Agent makes the correct decision, it may still lead to collisions due to the high speed of the UAVs and the control lag. An obvious and effective solution is to limit the maximum operating speed of the UAV (such as setting

v_{max}

= 5 m/s) to improve the success rate. Certainly, the experimental value of this measure is not significant and inefficient and time-consuming formation control can naturally improve its operational safety, rather than benefiting from the method improved.

(2): Success rate with collision avoidance measure.

In this section, we introduce an emergency braking collision avoidance measure into the P-DRL formation control framework. Assuming that the surrounding sensors of the UAV detect that the distance from the obstacle to itself is less than

D_{1}

= 0.5 m, then the UAV immediately brakes with maximum braking performance to ensure operational safety. In this experiment, emergency braking is not always safe. When the UAV flies towards an obstacle at high speed, emergency braking may fail, resulting in the UAV colliding with the obstacle. See the diagram of emergency braking in Figure 19.

We set the control frequency of the Agent to 5 Hz and conducted UAV control experiments using the P-A3C, P-DDQN, and P-AC methods. The scenarios were the same as in Section 5.3.1, and the success rate distribution results are shown in Figure 20.

Compared to the P-DRL formation control method without collision avoidance measures (P-A3C: 95.32%, P-DDQN: 94.51%, P-AC: 94.49%), the advantage of the P-DRL formation control method with avoidance measures in place is apparent (about P-A3C: 95.97%, P-DDQN: 94.98%, P-AC: 95.55%). Limiting the maximum operating speed of the UAV (such as setting

v_{max}

= 5 m/s) is also an efficient measure to improve the non-collision success rate by sacrificing formation speed.

5.3.2. Average Formatting Speed of UAVs

The average formatting speed of UAVs reflects the efficiency of UAV formatting operations. The value of average formatting speed can be influenced by the radius of obstacles, and the performance of the UAVThe value of the average formation speed can be influenced by the radius of the obstacles and the performance of the UAV. Therefore, we use the same scenario as described in Section 5.2 to make a preliminary comparison. The results of the average formatting speed are 9.24 m/s for P-A3C, 9.01 m/s for P-DDQN, and 10.11 m/s for P-AC, as in Table 4.

The average speed of UAVs controlled using P-A3C, P-DDQN, and P-Actor–Critic have no significant difference in this specific scenario. It can be speculated that when the structure and the reward of the DRL model are the same, the average speed of the UAVs controlled by the Agent will not change significantly.

5.3.3. Real-Time Performance

(1): Only on software (without communication).

In this section, we only consider the P-DRL control framework operating in a computer software environment, which is implemented and executed using the Python programming language. The indicator of the trajectory output time is used to reflect the real-time performance of the UAV formation control methods. From the perspective of the simulation experiments, the faster the control algorithm outputs control actions, the faster the trajectory is at outputting control actions as well. Due to the running time of a single control action for drones being so short, we use the calculation time that completes the formation control and trajectory output of 20 UAVs as an indicator (120 times for control action output for every single UAV). Use the artificial potential field (APF) method [44,45] as a comparison, which is a common trajectory planning algorithm.

The trajectory output times in the software environment for 20 UAVs using different methods are shown in Table 4.

(2): On software and hardware (with communication).

Next, we consider the deployment and operational mode of P-DRL in real multi-UAV systems. A typical mode involves achieving multi-UAV formation control through a ground computing center by building an air–ground communication link. In this mode, the UAV sends real-time operational information such as its position and attitude to the ground computing center through the air–ground communication link and requests action instructions from the ground center. Then, the ground responds and sends action instructions based on the information from multiple UAVs, as shown in Figure 21.

In this experiment, we used the control chip of a DJI Phantom 4 as an example. DJI Phantom 4 uses a STM32 Microcontroller Unit (MCU), which is relatively easy to purchase and obtain, as seen in Figure 22. Therefore, we can establish serial communication between PC and STM32 MCU to simulate the operation of P-DRL in real unmanned aerial vehicle systems.

The trajectory output times for 20 UAVs in both software and hardware environments (with communication) using different methods are shown in Table 4. It can be seen that when using the P-DRL implementation method, as in Figure 21, the additional running time due to hardware information processing and communication is approximately 0.1 s. For the relevant code for the simulation in this paper, please refer to the link: https://github.com/jinlun8823/P-DRL_formation_control (accessed on 4 July 2024).

6. Discussion and Conclusions

The major contribution of this paper is a new UAV formation control framework that combines a DRL model with a heuristic algorithm, This framework transforms the synergetic formation control problem of the UAV fleet into synergetic formation control problems for multiple UAV pairs, thus limiting the number of UAVs that the Agent needs to control each time, making the training process for the Agent easier. The feasibility of the P-DRL framework has been demonstrated through theoretical analysis and simulations. We used A3C, AC, and DQN to demonstrate the feasibility of the P-DRL control framework. It can complete real-time synergetic formation control tasks for 10–20 aircrafts with nearly no collisions. Although this is a relatively large number compared to the current research, it is still not the upper limit of the number of UAVs that can be controlled simultaneously by this framework. Moreover, the time complexity of the P-DRL method for formation control is O(2n − 1), which is suitable for real-time formation control tasks of n(n≥ 10) UAVs in complex environments with random obstacles and operational uncertainty.

The P-DRL formation control framework has good environmental adaptability and real-time performance, but it still has deficiencies. For example, collision can still occur during the formation process. Therefore, finding a suitable collision avoidance control method as a supplement/safeguard to the P-DRL formation framework is very important. Meanwhile, although this paper proposes a simple experiment for deploying P-DRL in UAV systems, it only integrates part of the information exchange system. How to implement the P-DRL framework on existing UAV sensor system, collision avoidance system, and control system, as well as further improve its operational efficiency, success rate, and compatibility, are all important research directions for future studies.

Author Contributions

Conceptualization, J.Z.; methodology, J.Z. and H.Z.; software, M.H.; Validation, J.Z. and M.H.; formal analysis, F.W. and J.Y.; investigation, H.Z.; resources, M.H.; writing—original draft preparation, J.Z.; writing—review and editing, J.Z. and M.H.; visualization, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Social Science Foundation of China grant number 22&ZD169, the National Natural Science Foundation of China grant number U2333214.

Data Availability Statement

Data will be made available on request.

Acknowledgments

The author would like to thank Shi Zongbei, Gang Zhong, and Hao Liu from Nanjing University of Aeronautics and Astronautics for their academic guidance and experimental resources.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wu, J.; Luo, C.; Luo, Y.; Li, K. Distributed UAV Swarm Formation and Collision Avoidance Strategies Over Fixed and Switching Topologies. IEEE Trans. Cybern. 2022, 52, 10969–10979. [Google Scholar] [CrossRef] [PubMed]
Wu, Y.; Gou, J.; Ji, H.; Deng, J. Hierarchical Mission Replanning for Multiple UAV Formations Performing Tasks in Dynamic Situation. Comput. Commun. 2023, 200, 132–148. [Google Scholar] [CrossRef]
Du, W.; Guo, T.; Chen, J.; Li, B.; Zhu, G.; Cao, X. Cooperative Pursuit of Unauthorized UAVs in Urban Airspace via Multi-agent Reinforcement Learning. Transp. Res. Part C Emerg. Technol. 2021, 128, 103122. [Google Scholar] [CrossRef]
Meng, Q.; Qu, Q.; Chen, K.; Yi, T. Multi-UAV Path Planning Based on Cooperative Co-Evolutionary Algorithms with Adaptive Decision Variable Selection. Drones 2024, 8, 435. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, H.; Zhou, J.; Hua, M.; Zhong, G.; Liu, H. Adaptive Collision Avoidance for Multiple UAVs in Urban Environments. Drones 2023, 7, 2024050715. [Google Scholar] [CrossRef]
Felix, B.; Stratis, K.; Madeline, C.; Roberto, P.; Mike, B. A Taxonomy of Validation Strategies to Ensure the Safe Operation of Highly Automated Vehicles. J. Intell. Transp. Syst. 2022, 26, 14–33. [Google Scholar] [CrossRef]
Guanetti, J.; Kim, Y.; Borrelli, F. Control of Connected and Automated Vehicles: State of the Art and Future Challenges. Annu. Rev. Control 2018, 45, 18–40. [Google Scholar] [CrossRef]
Pan, Z.; Zhang, C.; Xia, Y.; Xiong, H.; Shao, X. An Improved Artificial Potential Field Method for Path Planning and Formation Control of the Multi-UAV Systems. IEEE Trans. Circuits Syst. II Express Briefs 2022, 69, 1129–1133. [Google Scholar] [CrossRef]
Zhang, X.; Li, H.; Zhu, G.; Zhang, Y.; Wang, C.; Wang, Y.; Su, C.Y. Finite-Time Adaptive Quantized Control for Quadrotor Aerial Vehicle with Full States Constraints and Validation on QDrone Experimental Platform. Drones 2024, 8, 264. [Google Scholar] [CrossRef]
Yu, Y.; Chen, J.; Zheng, Z.; Yuan, J. Distributed Finite-Time ESO-Based Consensus Control for Multiple Fixed-Wing UAVs Subjected to External Disturbances. Drones 2024, 8, 260. [Google Scholar] [CrossRef]
Patiño, D.; Mayya, S.; Calderon, J.; Daniilidis, K.; Saldaña, D. Learning to Navigate in Turbulent Flows with Aerial Robot Swarms: A Cooperative Deep Reinforcement Learning Approach. IEEE Robot. Autom. Lett. 2023, 8, 4219–4226. [Google Scholar] [CrossRef]
Qi, Z.; Ziyang, Z.; Huajun, G.; Hongbo, C.; Rong, L.; Jicheng, L. UAV Formation Control based on Dueling Double DQN. J. Beijing Univ. Aeronaut. Astronaut. 2023, 49, 2137–2146. [Google Scholar] [CrossRef]
La, H.M.; Lim, R.; Sheng, W. Multirobot Cooperative Learning for Predator Avoidance. IEEE Trans. Control Syst. Technol. 2015, 23, 52–63. [Google Scholar] [CrossRef]
Xiang, X.; Yan, C.; Wang, C.; Yin, D. Coordination Control Method for Fixed-wing UAV Formation Through Deep Reinforcement Learning. Acta Aeronaut. Astronaut. Sin. 2021, 42, 524009. [Google Scholar] [CrossRef]
Lombaerts, T.; Looye, G.; Chu, Q.; Mulder, J. Design and Simulation of Fault Tolerant Flight Control Based on a Physical Approach. Aerosp. Sci. Technol. 2012, 23, 151–171. [Google Scholar] [CrossRef]
Liao, F.; Teo, R.; Wang, J.L.; Dong, X.; Lin, F.; Peng, K. Distributed Formation and Reconfiguration Control of VTOL UAVs. IEEE Trans. Control Syst. Technol. 2017, 25, 270–277. [Google Scholar] [CrossRef]
Gu, Z.; Song, B.; Fan, Y.; Chen, X. Design and Verification of UAV Formation Controller based on Leader-Follower Method. In Proceedings of the 2022 7th International Conference on Automation, Control and Robotics Engineering (CACRE), Virutal, 15–16 July 2022; pp. 38–44. [Google Scholar] [CrossRef]
Liu, C.; Wu, X.; Mao, B. Formation Tracking of Second-Order Multi-Agent Systems with Multiple Leaders Based on Sampled Data. IEEE Trans. Circuits Syst. II Express Briefs 2021, 68, 331–335. [Google Scholar] [CrossRef]
Bianchi, D.; Borri, A.; Cappuzzo, F.; Di Gennaro, S. Quadrotor Trajectory Control Based on Energy-Optimal Reference Generator. Drones 2024, 8, 29. [Google Scholar] [CrossRef]
Liu, S.; Huang, F.; Yan, B.; Zhang, T.; Liu, R.; Liu, W. Optimal Design of Multimissile Formation Based on an Adaptive SA-PSO Algorithm. Aerospace 2022, 9, 21. [Google Scholar] [CrossRef]
Kada, B.; Khalid, M.; Shaikh, M.S. Distributed cooperative control of autonomous multi-agent UAV systems using smooth control. J. Syst. Eng. Electron. 2020, 31, 1297–1307. [Google Scholar] [CrossRef]
Kang, C.; Xu, J.; Bian, Y. Affine Formation Maneuver Control for Multi-Agent Based on Optimal Flight System. Appl. Sci. 2024, 14, 2292. [Google Scholar] [CrossRef]
Brodecki, M.; Subbarao, K. Autonomous Formation Flight Control System Using In-Flight Sweet-Spot Estimation. J. Guid. Control Dyn. 2015, 38, 1083–1096. [Google Scholar] [CrossRef]
Sun, G.; Zhou, R.; Xu, K.; Weng, Z.; Zhang, Y.; Dong, Z.; Wang, Y. Cooperative formation control of multiple aerial vehicles based on guidance route in a complex task environment. Chin. J. Aeronaut. 2020, 33, 701–720. [Google Scholar] [CrossRef]
Zhang, Q.; Liu, H.H.T. Robust Nonlinear Close Formation Control of Multiple Fixed-Wing Aircraft. J. Guid. Control Dyn. 2021, 44, 572–586. [Google Scholar] [CrossRef]
Dogan, A.; Venkataramanan, S. Nonlinear Control for Reconfiguration of Unmanned-Aerial-Vehicle Formation. J. Guid. Control Dyn. 2005, 28, 667–678. [Google Scholar] [CrossRef]
Yu, Y.; Guo, J.; Ahn, C.K.; Xiang, Z. Neural Adaptive Distributed Formation Control of Nonlinear Multi-UAVs with Unmodeled Dynamics. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 9555–9561. [Google Scholar] [CrossRef]
Lin, Z.; Yan, B.; Zhang, T.; Li, S.; Meng, Z.; Liu, S. Multi-Level Switching Control Scheme for Folding Wing VTOL UAV Based on Dynamic Allocation. Drones 2024, 8, 303. [Google Scholar] [CrossRef]
Zhang, J.; Yan, J.; Zhang, P. Multi-UAV Formation Control Based on a Novel Back-Stepping Approach. IEEE Trans. Veh. Technol. 2020, 69, 2437–2448. [Google Scholar] [CrossRef]
Hung, S.M.; Givigi, S.N. A Q-Learning Approach to Flocking with UAVs in a Stochastic Environment. IEEE Trans. Cybern. 2017, 47, 186–197. [Google Scholar] [CrossRef]
Li, B.; Gan, Z.; Chen, D.; Sergey Aleksandrovich, D. UAV Maneuvering Target Tracking in Uncertain Environments Based on Deep Reinforcement Learning and Meta-Learning. Remote Sens. 2020, 12, 3789. [Google Scholar] [CrossRef]
Li, R.; Zhang, L.; Han, L.; Wang, J. Multiple Vehicle Formation Control Based on Robust Adaptive Control Algorithm. IEEE Intell. Transp. Syst. Mag. 2017, 9, 41–51. [Google Scholar] [CrossRef]
Xu, L.; Wang, T.; Cai, W.; Sun, C. UAV target following in complex occluded environments with adaptive multi-modal fusion. Appl. Intell. 2022, 53, 16998–17014. [Google Scholar] [CrossRef]
Chen, H.; Duan, H. Multiple Unmanned Aerial Vehicle Autonomous Formation via Wolf Packs Mechanism. In Proceedings of the 2016 IEEE International Conference on Aircraft Utility Systems (AUS), Beijing, China, 10–12 October 2016; pp. 606–610. [Google Scholar] [CrossRef]
Shi, G.; Hönig, W.; Shi, X.; Yue, Y.; Chung, S.J. Neural-Swarm2: Planning and Control of Heterogeneous Multirotor Swarms Using Learned Interactions. IEEE Trans. Robot. 2022, 38, 1063–1079. [Google Scholar] [CrossRef]
Hu, H.; Wang, Q.l. Proximal Policy Optimization with an Integral Compensator for Quadrotor Control. Front. Inf. Technol. Electron. Eng. 2020, 21, 777–795. [Google Scholar] [CrossRef]
Duan, H.; Luo, Q.; Shi, Y.; Ma, G. Hybrid Particle Swarm Optimization and Genetic Algorithm for Multi-UAV Formation Reconfiguration. IEEE Comput. Intell. Mag. 2013, 8, 16–27. [Google Scholar] [CrossRef]
Wang, C.; Wang, J.; Shen, Y.; Zhang, X. Autonomous Navigation of UAVs in Large-Scale Complex Environments: A Deep Reinforcement Learning Approach. IEEE Trans. Veh. Technol. 2019, 68, 2124–2136. [Google Scholar] [CrossRef]
Xu, G.; Jiang, W.; Wang, Z.; Wang, Y. Autonomous Obstacle Avoidance and Target Tracking of UAV Based on Deep Reinforcement Learning. J. Intell. Robot. Syst. 2022, 104, 60. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; A Bradford Book: Cambridge, MA, USA, 2018. [Google Scholar]
DJI. Parameters of DJI phantom4 pro. Available online: https://www.dji.com/cn/phantom-4-pro-v2/specs (accessed on 4 July 2024).
Hasselt, H.v.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-Learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, Phoenix, AZ, USA, 12–17 February 2016; pp. 2094–2100. [Google Scholar]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.M.O.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Szczepanski, R. Safe Artificial Potential Field - Novel Local Path Planning Algorithm Maintaining Safe Distance from Obstacles. IEEE Robot. Autom. Lett. 2023, 8, 4823–4830. [Google Scholar] [CrossRef]
Ju, C.; Luo, Q.; Yan, X. Path Planning Using Artificial Potential Field Method And A-star Fusion Algorithm. In Proceedings of the 2020 Global Reliability and Prognostics and Health Management (PHM-Shanghai), Shanghai, China, 16–18 October 2020; pp. 1–7. [Google Scholar] [CrossRef]

Figure 1. The collision points of a quadrotor UAV.

Figure 2. Reward shaping with collision avoidance enhanced.

Figure 3. Diagram of Environment–Agent interaction.

Figure 4. Basic architecture of the A3C Agent.

Figure 5. Framework of dynamic pairing and deep reinforcement learning (P-DRL).

Figure 6. The diagram of multi-UAVs formation control task simplified to multiple UAV pair formation control tasks.

Figure 7. Implementation of the method based on the P-DRL framework.

Figure 8. The Agent training process of A3C and DDQN.

Figure 9. The Agent training process of A3C and Actor-Critic.

Figure 10. 3D diagram of UAV formation arrangement at different timestamps.

Figure 11. 2D diagram of UAVs’ formation shaping control.

Figure 12. Four propeller rotor speeds of UAV1 in the process of formation arrangement.

Figure 13. 3D Diagram of formation reconfiguration of UAVs at different timestamps.

Figure 14. 2D diagram of formation reconfiguration control for UAVs.

Figure 15. Four propeller rotor speeds of UAV1 in the process of formation reconfiguration.

Figure 16. Non-collision success rate of P-A3C changing with the control frequency.

Figure 17. Non-collision successful rate of P-DDQN changing with the control frequency.

Figure 18. Non-collision success rate of P-AC changing with the control frequency.

Figure 19. Diagram of the emergency braking collision avoidance measure.

Figure 20. Non-collision success rate with emergency braking.

Figure 21. A classic implementation diagram for P-DRL in the multi-UAV systems.

Figure 22. STM32 microcontroller unit for DJI Phantom4 drones.

Table 1. Hyper-parameters of the UAV (Type: Phantom4).

Parameter	Meaning	Value (Unit)
m	Mass of UAV	1.375 (kg)
$θ_{1}$	Minimum pitch angle	$- π$ /6
$θ_{2}$	Maximum pitch angle	+ $π$ /6
$ϕ_{1}$	Minimum roll angle	$- π$ /6
$ϕ_{2}$	Maximum roll angle	+ $π$ /6
$ω_{\max}$	Maximum rotating speed of rotors	8100 (PRM)
$C_{L}$	Lift coefficient	0.484
b	Composite lift parameter	${2.232 \times 10}^{- 4}$ (N/ ${RPS}^{2}$ )
l	Length of the Rotor stick	0.350 (m)
$I_{x}$	x-axis moment of inertia	0.152 (N· $m^{2}$ )
$I_{y}$	y-axis moment of inertia	0.152 (N· $m^{2}$ )
$I_{z}$	z-axis moment of inertia	0.0842 (N· $m^{2}$ )
g	Acceleration of gravity	9.807 (m/ $s^{2}$ )
$ρ$	Density of the atmosphere	1.29 (kg/ $m^{3}$ )
r	Length of the propeller	0.0850 (m)
$k_{a}$	Air resistance coefficient	0.0427
$k_{d}$	Inverse torque coefficient	0.021
$μ_{x}, μ_{y}, μ_{z}$	Expectations of the position interfere	0
$σ_{x}, σ_{y}, σ_{z}$	Variance of the position interfere	0.5

Table 2. Hyper-parameter settings in the DRL model.

Parameter	Meaning	Value (Unit)
$ε$	Hyper-parameter in the policy $π$	0.03
$R_{1}$	Hyper-parameter in the UAV and obstacle safety reward	−400
$R_{2}$	Hyper-parameter in the UAV and obstacle safety reward	−20
$R_{3}$	Hyper-parameter in the UAV and obstacle safety reward	−4
$t_{1}$	Prediction time for the future state based on the current state	1 (s)
$t_{2}$	Prediction time for the future state based on the current state	5 (s)
$t_{3}$	Prediction time for the future state based on the current state	10 (s)
d	The safety distance for key collision point of the UAV	0.5 (m)
D	Hyper-parameter in the Formation reward	10
$α^{(W)}$	Learning rate of the Actor	0.01
$α^{(θ)}$	Learning rate of the Critic	0.01
$γ$	Discount rate of the reward	0.95

Table 3. Synergetic formation parameters for 20 UAVS.

UAV ID	Original Destination	Shaping Destination	Flying Destination
UAV1	[1,3,0]	[20.83,3.33,80]	[35,2.67,60]
UAV2	[3,3,0]	[21.67,3.33,80]	[34,1.33,60]
UAV3	[1,1,0]	[23.33,1.67,80]	[36,1.33,60]
UAV4	[3,1,0]	[25,0,80]	[38,0,60]
UAV5	[1,−3,0]	[20,−5,80]	[34,−4,60]
UAV6	[3,−3,0]	[23.33,−0.83,80]	[36,−4/3,60]
UAV7	[3,−1,0]	[21.67,−3.33,80]	[36,0,60]
UAV8	[1,−1,0]	[20.83,−3.33,80]	[35,-2.67,60]
UAV9	[−3,1,0]	[16.67,0.83,80]	[32,1.33,60]
UAV10	[−3,3,0]	[0.33,1.67,80]	[33,2.67,60]
UAV11	[−1,3,0]	[20,5,80]	[34,4,60]
UAV12	[−1,1,0]	[19.17,3.33,80]	[34,−1.33,60]
UAV13	[−3,−1,0]	[15,0,80]	[30,0,60]
UAV14	[−3,−3,0]	[16.67,-0.83,80]	[32.−1.33,60]
UAV15	[−1,−1,0]	[18.33,−1.67,80]	[32,0,60]
UAV16	[−1,−3,0]	[19.17,−3.33,80]	[33,−2.67,60]
UAV17	[−4,−3,0]	[15,−5,80]	[38,5.67,60]
UAV18	[−4,−1,0]	[25,−5,80]	[30,5.67,60]
UAV19	[−4,1,0]	[15,5,80]	[38,−5.67,60]
UAV20	[−4,3,0]	[25,5,80]	[30,−5,60]
Obstacle1	–	[6,0,0,50,0.8]	[28,2.5,60,85,0.9]
Obstacle2	–	[10,2,0,60,0.6]	[27,−2.5,60,85,0.8]
Obstacle3	–	[12,−2,0,60,0.8]	–

PS: Data format, UAV: [x, y, z], Obstacle: [x, y, bottom, top, radius].

Table 4. Performance comparison of various methods based on the framework for P-DRL.

Method	Non-Collision Susscessful Rate (%)		Average	Average Trajectory Output Time (s)
	With Collision	Without Collision Avoidance	Speed (m/s)	Only Software	Software and Hardware
	Avoidance Measure	Measure (Frequency: 5 Hz)	Speed (m/s)	Only Software	Software and Hardware
P-A3C	91.7–96.2	95.32	9.24	7.104	7.171
P-DDQN	90.0–96.2	94.51	9.11	6.992	7.070
P-AC	90.1–96.3	94.49	10.11	7.104	7.170
APF	100 (Theoretically)	–	7.76	1227.6	–

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, J.; Zhang, H.; Hua, M.; Wang, F.; Yi, J. P-DRL: A Framework for Multi-UAVs Dynamic Formation Control under Operational Uncertainty and Unknown Environment. Drones 2024, 8, 475. https://doi.org/10.3390/drones8090475

AMA Style

Zhou J, Zhang H, Hua M, Wang F, Yi J. P-DRL: A Framework for Multi-UAVs Dynamic Formation Control under Operational Uncertainty and Unknown Environment. Drones. 2024; 8(9):475. https://doi.org/10.3390/drones8090475

Chicago/Turabian Style

Zhou, Jinlun, Honghai Zhang, Mingzhuang Hua, Fei Wang, and Jia Yi. 2024. "P-DRL: A Framework for Multi-UAVs Dynamic Formation Control under Operational Uncertainty and Unknown Environment" Drones 8, no. 9: 475. https://doi.org/10.3390/drones8090475

Article Menu

P-DRL: A Framework for Multi-UAVs Dynamic Formation Control under Operational Uncertainty and Unknown Environment

Abstract

1. Introduction

2. Research Basis

2.1. Related Work

2.2. Motivation

3. Problem Formation

3.1. Problem Definition

3.2. Pairwise Control Theorem

3.3. Single UAV Control Model

3.4. DRL Model for UAV Pairwise Formation Control

3.4.1. Environment

3.4.2. Agent

3.4.3. Reward

3.4.4. Interaction

4. Approach

4.1. P-DRL Framework

4.2. Dynamic Pairing

4.3. Agent Training

4.4. Implement

5. Simulation and Results

5.1. Background

5.2. Simulations

5.2.1. Training the Agent

5.2.2. Formation Control

5.3. Performance Analysis

5.3.1. Success Rate of Non-Collision Formation Control

5.3.2. Average Formatting Speed of UAVs

5.3.3. Real-Time Performance

6. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI