Navigation in a simplified Urban Flow through Deep Reinforcement Learning

Federica Tonti
FLOW, Engineering Mechanics,
KTH Royal Institute of Technology,
Osquars backe 18, 11428, Stockholm, Sweden
[email protected]
&Jean Rabault
Independent Researcher,
0854 Oslo, Norway
[email protected]
&Ricardo Vinuesa
FLOW, Engineering Mechanics,
KTH Royal Institute of Technology,
Osquars backe 18, 11428, Stockholm, Sweden
[email protected]
Abstract

The increasing number of unmanned aerial vehicles (UAVs) in urban environments requires a strategy to minimize their environmental impact, both in terms of energy efficiency and noise reduction. In order to reduce these concerns, novel strategies for developing prediction models and optimization of flight planning, for instance through deep reinforcement learning (DRL), are needed. Our goal is to develop DRL algorithms capable of enabling the autonomous navigation of UAVs in urban environments, taking into account the presence of buildings and other UAVs, optimizing the trajectories in order to reduce both energetic consumption and noise. This is achieved using fluid-flow simulations which represent the environment in which UAVs navigate and training the UAV as an agent interacting with an urban environment. In this work, we consider a domain domain represented by a two-dimensional flow field with obstacles, ideally representing buildings, extracted from a three-dimensional high-fidelity numerical simulation. The presented methodology, using PPO+LSTM cells, was validated by reproducing a simple but fundamental problem in navigation, namely the Zermelo’s problem, which deals with a vessel navigating in a turbulent flow, travelling from a starting point to a target location, optimizing the trajectory. The current method shows a significant improvement with respect to both a simple PPO and a TD3 algorithm, with a success rate (SR) of the PPO+LSTM trained policy of 98.7%, and a crash rate (CR) of 0.1%, outperforming both PPO (SR = 75.6%, CR=18.6%) and TD3 (SR=77.4% and CR=14.5%). This is the first step towards DRL strategies which will guide UAVs in a three-dimensional flow field using real-time signals, making the navigation efficient in terms of flight time and avoiding damages to the vehicle.

1 Introduction

The presence of unmanned aerial vehicles (UAVs) is constantly increasing in urban environments due to the variety of tasks they can accomplish, from package delivery to surveillance and traffic monitoring, with the advantage of being able to access areas which would be difficult to reach by ground transportation or using bigger aerial vehicles, such as helicopters ([17, 15, 1]). On the other hand, the increasing number of UAVs brings new challenges which have not been faced until now because of their relatively recent use in cities, such as acoustic pollution or increasing risk of accidents ([61, 48, 58, 41, 45]). Due to these challenges, developing an efficient strategy to allow UAVs to navigate autonomously in complex environments is becoming crucial, not only to accomplish the aforementioned tasks but also to satisfy security constraints and ideally reduce their environmental impact, in particular acoustic pollution. Under this requirements, path planning , i.e. being able to find an optimal path between a starting point and a target point, avoiding obstacles if present, becomes essential.
For UAV navigation problems, path planning, obstacle detection and avoidance methods can typically be divided into non-learning-based and learning-based methods. Non-learning-based methods need a good knowledge and understanding of the problem domain, a fact that leads to the difficulty to generalize to unseen environments, but they provide a good interpretability of the process since the decision-making task is based on well-defined algorithms ([34]). Dijkstra’s Algorithm ([6], [18]), A*, which can be seen as an extension of the Dijkstra’s Algorithm ([60, 11]), or rapidly exploring random tree (RRT) ([21, 65, 26]) are popular non-learning-based path planning algorithms which have demonstrated to be successful in environments which do not exhibit uncertainties, but give poor performance when the environment is dynamic. When dealing with obstacles, sensing and avoidance methods steer the vehicle in the opposite direction with respect to the obstacles and navigate through the environment by path-planning algorithms ([4]). Another class of non-learning-based methods is simultaneously localization and mapping (SLAM), which deals not only with path-planning, as the aforementioned works, but also with obstacle detection and avoidance by constructing a map of the environment ([5, 23, 35]). The drawback of SLAM-based methods is the fact that in large-scale environments, it will exhibit degraded efficiency since building a map of the whole environment is practically infeasible.
Urban environments can be described as large scale, complex geometries, with buildings representing dense obstacles and flow-fields characterized by turbulent flows. These features describe a problem with many uncertainty sources, to which also the characteristics of the UAVs have to be added, such as sensors noise for navigation, calibration and control errors. This leads to the necessity of using learning-based methods, which can mostly overcome and handle these critical aspects of the task of navigation and obstacle avoidance. Supervised deep Learning (DL)-based methods have received significant attention in recent years. DL can significantly enhance obstacle avoidance and path planning for UAVs by leveraging neural networks to process and interpret vast amounts of sensory data, such as images from cameras or signals from LiDAR ([46, 28, 36]). This approach allows UAVs to detect and navigate around obstacles more efficiently by recognizing patterns and predicting potential collisions in real time. Supervised DL methods are extremely efficient for environments exhibiting small variations, as they are based on labels and are sensitive to environmental changes. These features make them suitable for closed environments, but less reliable for urban environments, where the conditions typically change very rapidly. This brings the necessity to develop reinforcement-learning (RL) methods, which are unsupervised, for understanding and automating decision making processes, in which the agent learns based on the given goals ([42]).
The optimal policy is obtained by learning how to map states to actions, and the agent learns through a trial-and-error procedure to get the actions which will ideally yield the quantitative highest reward and qualitatively best result depending on the task it has to execute. With respect to other methodologies, RL makes an agent learn by letting it directly interact with the environment. Deep reinforcement learning (DRL) combines the advantages of both DL and RL. In particular, DRL leverages the neural-network (NN) architectures of DL to approximate the value functions or policies used in RL in order to solve problems where the action space, the observation space or both are high dimensional ([24], [33]). Moreover, DRL agents are able to generalize from raw input data by sensors or images to learn representations that capture the underlying structure of the environment, leading to more robust decision-making ([59], [64]). Observations are directly mapped into actions to be taken by the agent, and this results in an end-to-end decision-making strategy which drastically reduces uncertainties due to sensor noise or low-quality inputs. The model of the environment is important to correctly guide the agent towards the goal but the algorithm does not strictly depend on it([9], [27]). Indeed, since DRL aims to map the optimal relations between observations and actions, even if the environment changes the agent can still take the suitable actions related to the received observations. Another advantage with respect to supervised DL is that DRL does not need labels because the agent directly interacts with the environment and generates the reward signal used for learning on the fly.
Machine learning (ML) has experienced a rapid development in the last years and has transformed the state-of-the-art capabilities for many tasks in engineering and computer science. In particular, it has been exploited to enhance fluid-flow simulations for a variety of applications, from turbulence modelling to development of boundary conditions ([52],[53]). In this context, DRL showed to be particularly suitable to face non-linear and high dimensional problems, such as turbulence and flow control([50]).
Active flow control is an extremely interesting topic in the field of DRL applications in fluid mechanics. [30] applied DRL to a two-dimensional (2D) simulation of the flow around a cylinder to learn an active control strategy from varying mass flow rates of two jets on the sides of said cylinder, achieving a considerable drag reduction. This approach has been extended to a multi-environment configuration, which considerably speeded up the execution by adapting the DRL algorithm for parallelization ([31]). The active-flow-control problem around a cylinder has been extended to three dimensions (3D) by Suarez et al. ([44], [43]), using a Multi-Agent Reinforcement Learning (MARL) approach coupled with a CFD solver, which led to a considerable drag reduction after applying the DRL control on three different configurations. MARL has been also successfully applied to a 2D Rayleigh–Bérnard convection problem, allowing to control this multiple-input multiple-output problem ([51]), and to drag reduction in fully developed turbulent channels ([12, 40]).
These successful applications of DRL in flow control have led to its application in a variety of engineering tasks. Among them, path planning, trajectory optimization and obstacle avoidance using DRL received a massive interest in the last years, dealing with different models. [13] applied a DRL algorithm based in Remember-and-Forget-Experience-Replay to steer a fixed-speed swimmer through an unsteady 2D flow field. A modified twin delayed DDPG (TD3) model was used to execute a navigation task in multi-obstacle environments with moving obstacles ([66]). Here, Zhang et al. wanted to predict the impact of the environment on the UAV and the change of observations was added in the actor-critic (AC) network. Then, a two-stream AC network structure was proposed to extract features of the observations provided in the environment. Another approach to the same multi-obstacle problem uses a modified recurrent deterministic policy gradient (RDPG) algorithm, named Fast-RDPG, in which the parameters of the policy are updated step by step, without the necessity to wait until the end of an episode to update them ([54]). A variation of double deep Q-Network (DDQN), the Autonomous Navigation and Obstacle Avoidance (ANOA) algorithm, was developed by Wu et al.([57]). Here, the network was divided into two parts, one outputting a state-value function, the other outputting an advantage-value function, which ensures the extra-reward value of choosing an action rather than another. The action-advantage value is independent of state and environment noise, which are instead taken into account in the state-value function. Wang et al. developed an algorithm for military applications, in which the main focus was collision avoidance by including a Faster Region-based Convolutional Neural Networks (R-CNN) model and a Data Deposit Mechanism to extract information about the obstacles from images, based on a Deep Q-Network (DQN) algorithm ([55]). Jin et al. proposed a multi-input attention prioritized deep deterministic policy gradient algorithm (MAPDDPG), which introduces an attention mechanism to help the UAV focus on environmental information relevant for the navigation task ([19]). Despite the differences and enhancements tailored to each of these algorithms, they are all off-policy learning algorithms and use target networks, meaning that they learn from experiences collected using a different policy than the one is getting optimized. The experiences are stored in a replay buffer to break the correlation between consecutive experiences and stabilize training. Target networks are used to stabilize the training and are periodically updated with the weights of the main networks to reduce the variance of the updates.
A more robust algorithm is Proximal Policy Optimization (PPO). This a policy gradient method, primarily designed to balance exploration and exploitation while keeping a stable and efficient training. PPO is often considered more robust than TD3, DDPG and DQN and their variants because PPO uses a clipped surrogate objective function, ensuring that the policy updates are not too drastic. By clipping the probability ratios, PPO limits the change in the policy at each update step, avoiding large deviations that could make the training unstable. PPO is an on-policy algorithm. The policy is updated based on data collected from the current policy, making the updates more stable because they are based on the most recent interactions with the environment. In addition, PPO adjusts the step size dynamically based on the performance of the policy and it features a reduced sensitivity to hyperparameter settings. The drawback is that it may require more samples overall, compared to an off-policy method.
When PPO is combined with recurrent neural networks (RNNs), it becomes highly efficient for many engineering problems, in particular in trajectory optimization. Federici et al.([10]) used this architecture with long short-term memory (LSTM) cells to build a meta-RL algorithm to achieve autonomous waypoint guidance for a six-rotor UAV in Mars’ atmosphere, showing a substantial improvement with respect to the simple PPO algorithm. Hu et al. ([16]) used a PPO+RNN architecture to design an escape flight vehicle against multiple pursuit flight vehicles, demonstrating that the use of RNN enhances the capability of PPO to train the agent to accomplish the given task.
In this work, we aim to develop a method for trajectory optimization and obstacle avoidance for a UAV in a 2D domain characterized by a complex flow field. The approach to this problem can be described as a Zermelo’s problem ([63]), but adding obstacles in the field. A formulation of the solution of the Zermelo’s problem by means of RL was given by Biferale et al. ([3]), where a 2D velocity flow field derived from numerical simulations was given, comparing the efficiency of the RL approach with that of a classic optimal navigation method. This configuration was used as validation step of the methods used here, with substantial modifications in the algorithm. The problem was apporached with PPO and the complexity was increased by having random starting and target areas. PPO resulted in a more stable training, reaching a reward at convergence slightly higher than using an AC algorithm.
The problem stated here can be considered as a partially observable markov decision process (POMDP). POMDP is substantially different from classical markov decision process (MDP), where the main feature of MDPs is that the future state depends only on the current state and action, and not on the sequence of events that preceded it. POMDPs extend MDPs to cases where the agent cannot directly observe the true state of the environment. Instead, it receives observations that provide partial information about the state. The agent does not have direct access to the full state of the environment but must rely solely on observations that provide incomplete information. In this context, we have only a partial representation of the environment and the state of the UAV, since it would be too complex and computationally unfeasible to map all the points and variables of the domain, considering that we are using results from a high-fidelity simulation of a turbulent flow field. This makes the problem even more challenging, since the agent has to deal with uncertainties not only of the environment and the turbulent flow field, but also of the observability of the state itself.
To the best of the author’s knowledge, there are no existing studies that specifically address UAV trajectory optimization using PPO and LSTM in a turbulent flow-field generated from high-fidelity numerical simulations, particularly in scenarios involving obstacle avoidance. While LSTM-based models have been previously used in reduced-order modeling (ROM) of turbulent flows, capturing temporal dynamics from direct numerical simulations (DNS) data, these models have primarily focused on learning and predicting flow behavior without integrating them into UAV navigation tasks ([25]). Additionally, recent work on UAV obstacle avoidance and deep reinforcement learning typically utilizes partially observable environments with LSTMs, but these efforts have not yet extended to scenarios involving complex CFD-generated turbulent flow fields [39]. Current works in UAV navigation that use PPO generally focus on simplified or structured environments, such as urban spaces involving static obstacles or moving entities, without incorporating the complexities of turbulent flow-fields, or multi-agent obstacle avoidance systems ([7], [8]). The combination of PPO with LSTM networks has been applied to multi-agent cooperative systems and collision avoidance, but these studies do not incorporate complex turbulent environments [32].
The method proposed in this work, which integrates snapshots of DNS-generated flow fields into a reinforcement learning framework, addresses this gap. It allows for the precomputed turbulence data to inform the UAV’s navigation decisions, offering a higher-fidelity representation of real-world fluid dynamics compared to the methods used in these prior studies. This approach enhances the realism and complexity of the environment without the computational burden of real-time CFD interaction, setting it apart from previous works. The present work is structured as follows: In Section 2, the difference between MDP and POMDP is explained; the problem addressed here is also described, including the environment and the architecture of the chosen algorithm. In Section 3, the results of the application of the algorithm to the problem are shown and compared with the results of PPO and TD3 algorithms. In Section 4, the main results are summarized and future work is proposed.

2 Problem statement

This work is focused on the navigation of a UAV in a 2D slice of a 3D turbulent flow-field, with obstacles representing buildings which the UAV has to avoid. The main goals are obstacle avoidance (spatial problem) and trajectory optimization (temporal problem), in a way that the UAV can reach a target finding the safest and quickest path. The task is designed to be addressed using DRL, in particular using a PPO+LSTM architecture.

2.1 MDPs and POMDPs

The main features of an MDP are a state space S𝑆Sitalic_S, and initial state space S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with an initial state distribution p(s0)𝑝subscript𝑠0p(s_{0})italic_p ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), an action state space A𝐴Aitalic_A, a state transition probability distribution p(st+1|st,at)𝑝conditionalsubscript𝑠𝑡1subscript𝑠𝑡subscript𝑎𝑡p(s_{t+1}|s_{t},a_{t})italic_p ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) which satisfies the Markov property, and a reward function r(st,at):S×AR:𝑟subscript𝑠𝑡subscript𝑎𝑡𝑆𝐴𝑅r(s_{t},a_{t}):S\times A\rightarrow Ritalic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) : italic_S × italic_A → italic_R. The reward function provides the feedback of the environments when executing the action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in a state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. State and action spaces can be continuous or discrete. These give a first indication about how to build the architecture and the network associated with the DRL algorithm. RL is generally used to solve a MDP. An optimal policy is learned through a trial-and-error process, and the policy could be stochastic or deterministic. A stochastic policy aπ(|s):SP(A)a\sim\pi(\cdot|s):S\rightarrow P(A)italic_a ∼ italic_π ( ⋅ | italic_s ) : italic_S → italic_P ( italic_A ) returns the probability density of available state and action pairs (s,a𝑠𝑎s,aitalic_s , italic_a), where P(A)𝑃𝐴P(A)italic_P ( italic_A ) is the set of probability measures on A𝐴Aitalic_A. Note that a deterministic policy μ(S):SA:𝜇𝑆𝑆𝐴\mu(S):S\rightarrow Aitalic_μ ( italic_S ) : italic_S → italic_A only projects states into actions.
A POMDP is characterized by the fact that the agent cannot directly observe the state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, but receives a set of observations otsubscript𝑜𝑡o_{t}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with a distribution p(ot|st)𝑝conditionalsubscript𝑜𝑡subscript𝑠𝑡p(o_{t}|s_{t})italic_p ( italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The sequence of observations do not satisfy the Markov property, since p(ot+1|at,ot,at1,ot1,,o0)p(ot+1|ot,at)𝑝conditionalsubscript𝑜𝑡1subscript𝑎𝑡subscript𝑜𝑡subscript𝑎𝑡1subscript𝑜𝑡1subscript𝑜0𝑝conditionalsubscript𝑜𝑡1subscript𝑜𝑡subscript𝑎𝑡p(o_{t+1}|a_{t},o_{t},a_{t-1},o_{t-1},...,o_{0})\neq p(o_{t+1}|o_{t},a_{t})italic_p ( italic_o start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≠ italic_p ( italic_o start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Consequently, the agent has to infer the current state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT based on the history of trajectories.
In this work, we exploit in particular the ability of LSTMs to capture temporal dependencies in a POMDP, in combination with PPO, in an environment characterized by a partial observability, in order to design an algorithm powerful enough to be prone to extension to a 3D problem and to even more uncertain environments. The detailed description of policy-gradient methods, PPO, LSTMs and their combination is described in detail in the Appendix.

2.2 Fluid-flow data

The flow field is represented by a 2D data set extracted from 3D high-fidelity simulations performed with the spectral-elements code Nek5000 following the same numerical setup as in Atzori et al. [2] and Zampino et al ([62]). The flow field is extracted at the centerplane plane z=0𝑧0z=0italic_z = 0 (where x𝑥xitalic_x, y𝑦yitalic_y and z𝑧zitalic_z are the streamwise, vertical and spanwise coordinates, respectively). The domain coordinates111All the distances are scaled by the obstacle height hhitalic_h. are x[2.0,4.0]𝑥2.04.0x\in[-2.0,4.0]italic_x ∈ [ - 2.0 , 4.0 ], y[0,3.0]𝑦03.0y\in[0,3.0]italic_y ∈ [ 0 , 3.0 ]. Obstacles coordinates are xo1subscript𝑥𝑜1x_{o1}italic_x start_POSTSUBSCRIPT italic_o 1 end_POSTSUBSCRIPT[0.25,0.25]absent0.250.25\in[-0.25,0.25]∈ [ - 0.25 , 0.25 ], yo1subscript𝑦𝑜1y_{o1}italic_y start_POSTSUBSCRIPT italic_o 1 end_POSTSUBSCRIPT[0.0,1.0]absent0.01.0\in[0.0,1.0]∈ [ 0.0 , 1.0 ], xo2subscript𝑥𝑜2x_{o2}italic_x start_POSTSUBSCRIPT italic_o 2 end_POSTSUBSCRIPT[1.25,1.75]absent1.251.75\in[1.25,1.75]∈ [ 1.25 , 1.75 ], yo2subscript𝑦𝑜2y_{o2}italic_y start_POSTSUBSCRIPT italic_o 2 end_POSTSUBSCRIPT[0.0,0.5]absent0.00.5\in[0.0,0.5]∈ [ 0.0 , 0.5 ]. The dataset used the training consists in a set of 300 snapshots separated by 0.08750 time units, with time span of the dataset of 26.25 time units222Time is normalized with hhitalic_h and the freestream velocity Usubscript𝑈U_{\infty}italic_U start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT. Figure 1 shows the streamwise velocity U𝑈Uitalic_U of a 2D slice of an instantaneous flow field. The flow field shows both recirculation zones and areas where U𝑈Uitalic_U is much greater than the surroundings, indicating the paths which could be more challenging for the safe navigation of the UAV.

Refer to caption
Figure 1: A snapshot showing the streamwise velocity U𝑈Uitalic_U of a 2D slice of the instantaneous flow-field with obstacles. Obstacles are highlighted as red rectangles.

2.3 UAV dynamics

UAV navigation is typically described with a set of non-linear differential equations in a three-dimensional space. Since in the present work the flow field is two dimensional, the UAV is modelled as a mass point and the set of non-linear equations is reduced to a 2D space. The variables describing the UAV motion here are: position vector (𝒙𝒙\boldsymbol{x}bold_italic_x), velocity vector(𝒗globalsubscript𝒗global\boldsymbol{v}_{\textrm{global}}bold_italic_v start_POSTSUBSCRIPT global end_POSTSUBSCRIPT), orientation (heading angle, θ𝜃\thetaitalic_θ) and angular velocity (ω𝜔\omegaitalic_ω). The dynamics and state of the UAV are significantly influenced by the underlying flow field. The equations of motion take into account the presence of the surrounding flow field with velocity components uflowsubscript𝑢flowu_{\text{flow}}italic_u start_POSTSUBSCRIPT flow end_POSTSUBSCRIPT and vflowsubscript𝑣flowv_{\text{flow}}italic_v start_POSTSUBSCRIPT flow end_POSTSUBSCRIPT, and are given by:

{d𝒙dt=𝒗global+𝒗flow,𝒗globaldt=a(cos(θ)sin(θ)),dθdt=ω,dωdt=ω˙,casesd𝒙dtabsentsubscript𝒗globalsubscript𝒗flowsubscript𝒗globaldtabsent𝑎matrix𝜃𝜃d𝜃dtabsent𝜔d𝜔dtabsent˙𝜔\begin{cases}\frac{\mathrm{d}\boldsymbol{x}}{\mathrm{dt}}&=\boldsymbol{v}_{% \textrm{global}}+\boldsymbol{v}_{\textrm{flow}},\\ \frac{\boldsymbol{v}_{\text{global}}}{\mathrm{dt}}&=a\begin{pmatrix}\cos(% \theta)\\ \sin(\theta)\end{pmatrix},\\ \frac{\mathrm{d}\theta}{\mathrm{dt}}&=\omega,\\ \frac{\mathrm{d}\omega}{\mathrm{dt}}&=\dot{\omega},\end{cases}\ { start_ROW start_CELL divide start_ARG roman_d bold_italic_x end_ARG start_ARG roman_dt end_ARG end_CELL start_CELL = bold_italic_v start_POSTSUBSCRIPT global end_POSTSUBSCRIPT + bold_italic_v start_POSTSUBSCRIPT flow end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL divide start_ARG bold_italic_v start_POSTSUBSCRIPT global end_POSTSUBSCRIPT end_ARG start_ARG roman_dt end_ARG end_CELL start_CELL = italic_a ( start_ARG start_ROW start_CELL roman_cos ( italic_θ ) end_CELL end_ROW start_ROW start_CELL roman_sin ( italic_θ ) end_CELL end_ROW end_ARG ) , end_CELL end_ROW start_ROW start_CELL divide start_ARG roman_d italic_θ end_ARG start_ARG roman_dt end_ARG end_CELL start_CELL = italic_ω , end_CELL end_ROW start_ROW start_CELL divide start_ARG roman_d italic_ω end_ARG start_ARG roman_dt end_ARG end_CELL start_CELL = over˙ start_ARG italic_ω end_ARG , end_CELL end_ROW (1)

where vflow=(uflowvflow)subscript𝑣flowmatrixsubscript𝑢flowsubscript𝑣flow\vec{v}_{\text{flow}}=\begin{pmatrix}u_{\text{flow}}\\ v_{\text{flow}}\end{pmatrix}over→ start_ARG italic_v end_ARG start_POSTSUBSCRIPT flow end_POSTSUBSCRIPT = ( start_ARG start_ROW start_CELL italic_u start_POSTSUBSCRIPT flow end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_v start_POSTSUBSCRIPT flow end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) represents the velocity of the flow field at the UAV’s position, while a𝑎aitalic_a and ω˙˙𝜔\dot{\omega}over˙ start_ARG italic_ω end_ARG represent the linear and angular accelerations, respectively. a𝑎aitalic_a and ω˙˙𝜔\dot{\omega}over˙ start_ARG italic_ω end_ARG are also the controls which represent the actions taken bu the UAV when interacting with the two dimensional flow-field.
To solve the equations of motion, the fourth-order Runge–Kutta method is used, yielding the state of the UAV at the next time step.

2.4 Environment

Both observation and action spaces are continuous. The observation space is desgined as follows:

o={θ,ϕ,d0}+{βi},𝑜𝜃italic-ϕsubscript𝑑0subscript𝛽𝑖o=\left\{\theta,\phi,d_{0}\right\}+\left\{\beta_{i}\right\},italic_o = { italic_θ , italic_ϕ , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT } + { italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , (2)

where θ𝜃\thetaitalic_θ is the heading angle of the UAV, ϕitalic-ϕ\phiitalic_ϕ is the relative angle of the UAV with respect to the target, d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the distance between of the UAV and the target and {βi}subscript𝛽𝑖\left\{\beta_{i}\right\}{ italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } are the are the angles associated with the sensors for obstacle detection, with i[0,8]𝑖08i\in[0,8]italic_i ∈ [ 0 , 8 ] spanning the angles between π𝜋-\pi- italic_π and π𝜋\piitalic_π. Figure 2 shows graphically the definition of the observation space.

Refer to caption
Figure 2: Schematic representation of the observation space described in Eq. (2).

The action space is designed to include the linear and angular accelerations of the UAV, where a[3.0,3.0]𝑎3.03.0a\in[-3.0,3.0]italic_a ∈ [ - 3.0 , 3.0 ] and ω˙[π/4,π/4]˙𝜔𝜋4𝜋4\dot{\omega}\in[-\pi/4,\pi/4]over˙ start_ARG italic_ω end_ARG ∈ [ - italic_π / 4 , italic_π / 4 ], while setting constraints to the velocity, which is bounded in magnitude, vUAV[1.4vmax,1.4vmax]subscript𝑣UAV1.4subscript𝑣max1.4subscript𝑣maxv_{\textrm{UAV}}\in[-1.4v_{\textrm{max}},1.4v_{\textrm{max}}]italic_v start_POSTSUBSCRIPT UAV end_POSTSUBSCRIPT ∈ [ - 1.4 italic_v start_POSTSUBSCRIPT max end_POSTSUBSCRIPT , 1.4 italic_v start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ], where vmaxsubscript𝑣maxv_{\textrm{max}}italic_v start_POSTSUBSCRIPT max end_POSTSUBSCRIPT is the magnitude of the maximum velocity of the flow field across all the frames used by the algorithm.
From the formulation of the observation space, it can be observed that obstacle detection is achieved by providing to the agent a set of directions, since in UAVs the relation with the surroundings is typically given by images from cameras, radar signals or range finders. In this work, we take as inputs for the observation space the angles which represent the orientation of rays sent by range finders mounted on the UAV. Obstacle detection is achieved by implementing a ray-tracing technique ([47]). First of all, the UAV has to check for free space in its perspective. The input is the position of the UAV and the output is a boolean variable which indicates whether the path is free from obstacles or not. Then, if the obstacle is present, the intersection with the traced rays is computed. First, the direction of the ray is calculated, based on the ray origin and final point, as well as the coordinates of the obstacles. Then, it is verified whether parallel directions to the obstacles are present. If the detected directions are not parallel to the obstacles, the intersection point between the ray and the obstacle is calculated and the distance to the intersection is returned. Figure 3 sketches the process.

Refer to caption
Figure 3: Sketch of the obstacle detection procedure.

The starting and target areas are chosen randomly before the first obstacle and after the second obstacle, respectively. The agent is allowed to take a maximum of 80 steps in the environment for each episode. The starting frame of the simulation is random, meaning that the navigation task does not always start from the same simulation data, but changes randomly for every episode, so that the initial conditions of the flow field themselves exhibit uncertainties.
The UAV state is described as follows:

s={𝒙,θ,𝒗,ω},𝑠𝒙𝜃𝒗𝜔s=\left\{\boldsymbol{x},\theta,\boldsymbol{v},\omega\right\},italic_s = { bold_italic_x , italic_θ , bold_italic_v , italic_ω } , (3)

where 𝒙𝒙\boldsymbol{x}bold_italic_x is the position vector, θ𝜃\thetaitalic_θ is the heading angle, 𝒗𝒗\boldsymbol{v}bold_italic_v is the velocity vector and ω𝜔\omegaitalic_ω is the angular velocity.
The state of the UAV is inferred from the observations, which are given as input to the NN and described in Equation (2). The environment has been implemented using the format of OpenAI Gymnasium ([49]).

2.5 Reward function

As mentioned in Section 1, DRL is a process that encourages learning by trial and error and this process is triggered by a reward which is given to the agent when it takes the right actions to complete the assigned task. The structure of the reward is crucial because this guides the agent towards a more effective learning, so this component of the algorithm has to be carefully designed and tuned for a specific task. The reward structure of this work is inspired by other obstacle-avoidance and navigation problems ([66], [54], [19]).
The reward structure is designed to guide the UAV towards the target while minimizing collisions with obstacles, reducing energy consumption and preventing leaving the designated operational bounds. The reward function is constructed from several components, each one addressing a different aspect of the UAV’s performance. The final reward is a sum of the following components:

  • Transition reward rtranssubscript𝑟𝑡𝑟𝑎𝑛𝑠r_{trans}italic_r start_POSTSUBSCRIPT italic_t italic_r italic_a italic_n italic_s end_POSTSUBSCRIPT: it describes the progression of the UAV towards the target, and is defined as:

    rtrans=σddist,subscript𝑟trans𝜎subscript𝑑distr_{\mathrm{trans}}=\sigma d_{\mathrm{dist}},italic_r start_POSTSUBSCRIPT roman_trans end_POSTSUBSCRIPT = italic_σ italic_d start_POSTSUBSCRIPT roman_dist end_POSTSUBSCRIPT , (4)

    where σ𝜎\sigma\in\mathbb{R}italic_σ ∈ blackboard_R is a scaling factor to weight the influence of this term in the total reward, and ddistsubscript𝑑distd_{\textrm{dist}}italic_d start_POSTSUBSCRIPT dist end_POSTSUBSCRIPT is the reduced distance from the starting point to the target:

    ddist=xt1xtargetxtxtarget,subscript𝑑distnormsubscript𝑥𝑡1subscript𝑥targetnormsubscript𝑥𝑡subscript𝑥targetd_{\mathrm{dist}}=\left\|\ x_{t-1}-x_{\mathrm{target}}\right\|-\left\|\ x_{t}-% x_{\mathrm{target}}\right\|,italic_d start_POSTSUBSCRIPT roman_dist end_POSTSUBSCRIPT = ∥ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT roman_target end_POSTSUBSCRIPT ∥ - ∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT roman_target end_POSTSUBSCRIPT ∥ , (5)

    where xt1subscript𝑥𝑡1x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT is the position of the UAV at the previous time step, xtargetsubscript𝑥targetx_{\textrm{target}}italic_x start_POSTSUBSCRIPT target end_POSTSUBSCRIPT is the position of the target point and xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the current position of the UAV;

  • Obstacle penalty robssubscript𝑟obsr_{\textrm{obs}}italic_r start_POSTSUBSCRIPT obs end_POSTSUBSCRIPT: it detects how close the UAV is to the obstacle. The closest the UAV is to the obstacle, the larger the value of the penalty, described here as:

    robs=αeψdmin,subscript𝑟obs𝛼superscript𝑒𝜓subscript𝑑minr_{\textrm{obs}}=-\alpha e^{-\psi d_{\textrm{min}}},italic_r start_POSTSUBSCRIPT obs end_POSTSUBSCRIPT = - italic_α italic_e start_POSTSUPERSCRIPT - italic_ψ italic_d start_POSTSUBSCRIPT min end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , (6)

    where α𝛼\alphaitalic_α, ψ𝜓\psiitalic_ψ absent\in\mathbb{R}∈ blackboard_R are constants, and dminsubscript𝑑mind_{\textrm{min}}italic_d start_POSTSUBSCRIPT min end_POSTSUBSCRIPT = min{d1,..,dn}\textrm{min}\left\{d_{1},..,d_{n}\right\}min { italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , . . , italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, with n𝑛absentn\initalic_n ∈(0,8) being the minimum of the distances between the UAV and the obstacles provided by the sensor measurement.

  • Free-space reward rfreesubscript𝑟freer_{\textrm{free}}italic_r start_POSTSUBSCRIPT free end_POSTSUBSCRIPT: it is the reward assigned for navigating in a direction free of obstacles:

    rfree={rfreeif free space ahead is detected0otherwise,subscript𝑟freecasessubscript𝑟freeif free space ahead is detected0otherwiser_{\textrm{free}}=\begin{cases}r_{\textrm{free}}&\text{if free space ahead is % detected}\\ 0&\text{otherwise},\end{cases}italic_r start_POSTSUBSCRIPT free end_POSTSUBSCRIPT = { start_ROW start_CELL italic_r start_POSTSUBSCRIPT free end_POSTSUBSCRIPT end_CELL start_CELL if free space ahead is detected end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise , end_CELL end_ROW (7)

    where rfreesubscript𝑟freer_{\textrm{free}}\in\mathbb{R}italic_r start_POSTSUBSCRIPT free end_POSTSUBSCRIPT ∈ blackboard_R is a constant.

  • Best-direction free-space reward rbestsubscript𝑟bestr_{\textrm{best}}italic_r start_POSTSUBSCRIPT best end_POSTSUBSCRIPT: if the path ahead is not free of obstacles, the agent chooses an alternative direction:

    rbest={ζβbestif no free space ahead0otherwise,subscript𝑟bestcases𝜁subscript𝛽bestif no free space ahead0otherwiser_{\textrm{best}}=\begin{cases}\zeta\beta_{\textrm{best}}&\text{if no free % space ahead}\\ 0&\text{otherwise},\end{cases}italic_r start_POSTSUBSCRIPT best end_POSTSUBSCRIPT = { start_ROW start_CELL italic_ζ italic_β start_POSTSUBSCRIPT best end_POSTSUBSCRIPT end_CELL start_CELL if no free space ahead end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise , end_CELL end_ROW (8)

    where ζ𝜁\zeta\in\mathbb{R}italic_ζ ∈ blackboard_R is a constant and βbestsubscript𝛽best\beta_{\textrm{best}}italic_β start_POSTSUBSCRIPT best end_POSTSUBSCRIPT is the angle which returns a direction representing the maximum distance from the obstacle in the set detected by the sensors;

  • Step penalty rstepsubscript𝑟stepr_{\textrm{step}}italic_r start_POSTSUBSCRIPT step end_POSTSUBSCRIPT: it is the penalty given for each step taken by the agent, which encourages the UAV to find the shortest path to the target, rstepsubscript𝑟stepr_{\textrm{step}}\in\mathbb{R}italic_r start_POSTSUBSCRIPT step end_POSTSUBSCRIPT ∈ blackboard_R;

  • Energy penalty renergysubscript𝑟energyr_{\textrm{energy}}italic_r start_POSTSUBSCRIPT energy end_POSTSUBSCRIPT: it accounts for the UAV’s propulsion energy consumption. It is calculated as:

    renergy=κ𝐯prop,subscript𝑟energy𝜅normsubscript𝐯propr_{\textrm{energy}}=-\kappa||\mathbf{v_{\textrm{prop}}}||,italic_r start_POSTSUBSCRIPT energy end_POSTSUBSCRIPT = - italic_κ | | bold_v start_POSTSUBSCRIPT prop end_POSTSUBSCRIPT | | , (9)

    where 𝐯propsubscript𝐯prop\mathbf{v_{\textrm{prop}}}bold_v start_POSTSUBSCRIPT prop end_POSTSUBSCRIPT is the propulsion velocity, defined as the difference between the UAV’s current velocity and the flow velocity. This penalty discourages excessive energy use by reducing the reward proportionally to the effort required to move relative to the surrounding flow.

The total reward is then given by:

Rtot=i=0mrtransi+robsi+rfreei+rbesti+rstepi+renergyi,subscript𝑅totsuperscriptsubscript𝑖0𝑚subscript𝑟subscripttrans𝑖subscript𝑟subscriptobs𝑖subscript𝑟subscriptfree𝑖subscript𝑟subscriptbest𝑖subscript𝑟subscriptstep𝑖subscript𝑟subscriptenergy𝑖R_{\textrm{tot}}=\sum_{i=0}^{m}r_{\textrm{trans}_{i}}+r_{\textrm{obs}_{i}}+r_{% \textrm{free}_{i}}+r_{\textrm{best}_{i}}+r_{\textrm{step}_{i}}+r_{\textrm{% energy}_{i}},italic_R start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT trans start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT obs start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT free start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT best start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT step start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT energy start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (10)

where m𝑚m\in\mathbb{N}italic_m ∈ blackboard_N is the number of steps taken by the agent in the environment. At the end of each episode of m𝑚mitalic_m steps, additional terms are considered. If the agent reaches the target it gets an extra reward, whereas if it hits an obstacle it receives an extra penalty, as well as if it hits the bounds of the domain.

2.6 PPO+LSTM

The algorithm used in this work has been completely developed in the context of this work, and is written in PyTorch ([29]). Figure 4 shows a sketch of the network structure.

Refer to caption
Figure 4: Architecture of the PPO+LSTM network. nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the number of observations fed into the NN, equivalent to the dimensions of the observation space. The number of outputs noutsubscript𝑛outn_{\textrm{out}}italic_n start_POSTSUBSCRIPT out end_POSTSUBSCRIPT corresponds to the mean and standard deviation of each of the actions (𝝁θ,ksubscript𝝁𝜃𝑘\boldsymbol{\mu}_{\theta,k}bold_italic_μ start_POSTSUBSCRIPT italic_θ , italic_k end_POSTSUBSCRIPT and 𝝈θ,ksubscript𝝈𝜃𝑘\boldsymbol{\sigma}_{\theta,k}bold_italic_σ start_POSTSUBSCRIPT italic_θ , italic_k end_POSTSUBSCRIPT), which are two, and the value function, estimating the expected return from the current state (Vθ,ksubscript𝑉𝜃𝑘V_{\theta,k}italic_V start_POSTSUBSCRIPT italic_θ , italic_k end_POSTSUBSCRIPT).

The first layer takes as input the vector of observations (𝒐ksubscript𝒐k\boldsymbol{o_{\textrm{k}}}bold_italic_o start_POSTSUBSCRIPT k end_POSTSUBSCRIPT) described in Section 2.4 and feeds the raw observations into the first hidden layer of the neural network. The first hidden layer is a fully connected layer with 64 neurons and a hyperbolic tangent (tanh) activation function, which allows to capture the non-linear relationships in the data. The second hidden layer is a again a fully connected layer, with a reduced size of 32 neurons, again with a tanh activation function. The LSTM layer follows the two dense layers, with 16 neurons. It processes the temporal features and models the temporal dynamics of the problem. The LSTM layer takes as additional inputs also the last reward (Rk1subscript𝑅𝑘1R_{k-1}italic_R start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT) and last actions (𝒂k-1subscript𝒂k-1\boldsymbol{a_{\textrm{k-1}}}bold_italic_a start_POSTSUBSCRIPT k-1 end_POSTSUBSCRIPT), thus benefiting from the memory of past experiences. Finally, the network outputs are divided into two distinct streams. The first stream produces parameters for each action in the action space, specifically the mean (μ𝜇\muitalic_μ) and standard deviation (σ𝜎\sigmaitalic_σ), which are used for sampling actions in a stochastic policy. The second stream outputs a single scalar value representing the value function, which estimates the expected return from the current state. The output layer uses a linear activation function, appropriate for both continuous action distribution parameters and value estimation. Note that 𝒄𝒄\boldsymbol{c}bold_italic_c and 𝒉𝒉\boldsymbol{h}bold_italic_h represent the parameters of the hidden state which are fed and then given as output of the LSTM layer.
The hyperparameters of the algorithm were tuned by a trial-and-error procedure and are summarized in Table 1. The Adam optimizer ([20]) is chosen because of its robustness and adaptive property.

Table 1: Summary of hyperparameters.
Hyperparameter Value Description
Clip ratio 0.2 Controls the range of policy updates (clipping)
Stochastic gradient descent (SGD) steps 30 Number of SGD steps per epoch
Discount factor (γ𝛾\gammaitalic_γ) 0.99 Discount factor for future rewards
GAE lambda (λ𝜆\lambdaitalic_λ) 0.95 Balances bias and variance in advantage estimation
Batch size 256 Number of samples per batch
Mini-batch size 128 Number of samples per mini-batch
Policy update epochs 4 Number of epochs to update the policy
Learning rate 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT Initial learning rate of Adam optimizer
Evaluation interval 5 Frequency of model evaluation during training

The results obtained with a PPO+LSTM approach are compared with the results obtained applying a TD3 and a traditional PPO algorithm, also custom implemented, which are widely used for similar tasks in less complex environments. The structures of these networks are described in Figure 5.

Refer to caption
(a) Schematic of PPO
Refer to caption
(b) General structure of TD3
Figure 5: Schematic representations of the PPO and TD3 architectures. The sketch of the TD3 network in panel (b) is extracted from [22].

For Figure 5(a), the same nomenclature as for Figure 4 applies. Figure 5(b) describes the classical structure of a TD3 architecture. In our implementation, the actor network comprises 5 fully connected layers with 256, 128, 64, 32, and 2 nodes, respectively. Rectified-linear-unit (ReLU) activations are applied after each of the first four layers, with the final layer using a tanh activation scaled by the maximum action value. The critic network also has 5 fully connected layers with 256, 128, 64, 32, and 1 node. It takes both the state and action as input, concatenates them, and processes the combined input through ReLU activations after each layer to output a single Q value. The learning rates for the actor and critic networks are set to 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, respectively, for both PPO and TD3.

3 Results and discussion

In this section, the main results obtained with the present methodology are presented. Figure 6 shows the comparison of the rewards of PPO+LSTM, the TD3 and the PPO algorithms. The reward is represented with an exponential moving average (EMA) for visualization purposes, together with the standard deviation of the reward. EMA is a technique used to smooth data by applying exponentially decreasing weights to past observations. This approach prioritizes recent data points, making the EMA more responsive to recent changes while still considering the entire history of the data. This exponentially smoothed value effectively captures underlying trends by reducing the impact of short-term fluctuations, making it ideal for visualizing noisy data in applications like model training. The EMA is calculated as follows:

EMAt=ανt+(1α)EMAt1,subscriptEMA𝑡𝛼subscript𝜈𝑡1𝛼subscriptEMA𝑡1\text{EMA}_{t}=\alpha\nu_{t}+(1-\alpha)\text{EMA}_{t-1},EMA start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_α italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( 1 - italic_α ) EMA start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , (11)

where νtsubscript𝜈𝑡\nu_{t}italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the value of the data at time t𝑡titalic_t and α𝛼absent\alpha\initalic_α ∈[0,1] is the smoothing factor, with values close to 0 corresponding to maximim smoothing and 1 corresponding to raw data.

Refer to caption
Figure 6: Comparison of the total rewards against episode number for PPO+LSTM (red), TD3 (blue) and PPO (green)

It can be observed that the PPO+LSTM combination outperforms both the PPO and TD3 apporoaches. Not only the final value of the reward, but also the slope during training shows that convergence is much faster for PPO+LSTM with respect to the other two algorithms. This is due to the fact that the LSTM cells enhance memory of the recent events in the current episode trajectory and learning is much more effective than in architectures which do not have such memory capabilities. The other thing to take into account is the partial observability of the environment. In particular, the observation space does not include any information about the flow field, which is taken into account only when integrating the equations of the dynamics. The only information about the environment is given by the relative orientation to the target and the calculated distances from the obstacles. The state of the UAV is described by its position and orientation with respect to the target, its velocity and calculated distances from the obstacles, but the UAV state is not fully visible in the observations passed as an input to the NN. Based in the result, the PPO and TD3 algorithms without the contribution of RNNs are not suitable for POMDPs and problems where keeping track of the temporal sequence of the actions is crucial. A recent work by Wang et al. ([56]) developed a dynamic feature-based DRL (DF-DRL) for flow control which can provide a suitable alternative to the use of RNNs, which could be adapted and tested in trajectory-optimization problems.
Several trajectories produced by the PPO+LSTM policy during evaluation are shown in Figure 7.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 7: Various trajectories of the UAV for different starting and target points produced during the evaluation phase of the final policy. The surrounding flow-field is displayed by the streamwise velocity.

It can be observed that the mass point which represents the UAV not only avoids the obstacles, but also manages to properly exploit the flow-field regions where the velocity is higher, and avoids getting trapped in regions with high recirculation. The success rate (SR) of the PPO+LSTM-trained policy reached 98.7%, and the crash rate (CR) was 0.1%. These results are significantly better than the ones obtained with PPO (SR = 75.6%, CR=18.6%) and TD3 (SR=77.4% and CR=14.5%), and highlight the importance of the memory cells, which are essential in navigation problems in complex environments. Moreover, the PPO+LSTM model requires fewer neurons to achieve much better performance than the other two networks, which are double the size.
The instantaneous mean (μ𝜇\muitalic_μ) and standard deviation (σ𝜎\sigmaitalic_σ) obtained from the last layer of the NN for each of the actions help us understand the agent’s behavior and its exploration-exploitation trade-off during the learning process. Figure 8 shows the evolution of μ𝜇\muitalic_μ and σ𝜎\sigmaitalic_σ fr each action at each step taken from the agent in the environment for two different episodes, one at early stages and one at the converged policy stage of the training.

Refer to caption
(a) Evolution of μ𝜇\muitalic_μ
Refer to caption
(b) Evolution of σ𝜎\sigmaitalic_σ
Figure 8: Evolution of mean (μ𝜇\muitalic_μ) (a) and standard deviation (σ𝜎\sigmaitalic_σ) (b) of each action for each step in the environment during an episode in the early stages of training and at the converged policy.

The figure clearly shows that in the early stages of the training, the agent is still in exploration phase. This is visible from the blue lines in Figure 8(a) and Figure 8(b), where the values of μ𝜇\muitalic_μ change significantly from step to step and from the fluctuations of the standard deviation σ𝜎\sigmaitalic_σ around high values, a fact that denotes that the policy is not converged. On the other hand, at the final stages of the training (green lines of Figures 8(a) and 8(b)), both μ𝜇\muitalic_μ and σ𝜎\sigmaitalic_σ stabilize around lower values. The μ𝜇\muitalic_μ values indicate that the agent effectively manages to save energy when navigating through the flow, taking minimal corrections of the linear and angular accelerations and remaining essentially constant around a value; σ𝜎\sigmaitalic_σ in the same way fluctuates around values close to zero, indicating that the policy is essentially converged and that the exploration phase occurring in the early stages of the training has ended; thus the agent is exploiting the most effective actions learned during the training process. This is consistent with what we see with the trajectory visualizations in Figure 7, where the UAV effectively exploits the flow field to navigate through the environment following the vortical structures of the flow field to reach the target. Overall, these results provide insights into the agent’s learned dynamics, highlighting the transition from a broad exploratory phase to a more focused exploitation of learned strategies. The increased concentration of actions in later episodes reflects the agent’s improved understanding of the environment and its ability to identify and repeat actions that contribute to achieving its objectives.
With respect to other works ([66, 55, 54, 57]) where RNNs are used in combination with off-policy methods, there is a substantial difference in how actions are related to the observation space. In this work, the actions taken by the agent are linear and angular accelerations, and the action space is continuous. In the observation space, which is fed in the input layer of the NN, only the heading angle is included, giving an extremely limited intuition of the state of the UAV. In the cited works, the action space has a simpler structure, designed with only one physical variable, in particular the heading angle or the acceleration, and sometimes considering a discrete action space. This implies that our algorithm not only has to find the right actions separately, but also combine them in order to have the most efficient combination of the two actions to reach the target, not forgetting to maintain a reasonable distance from the obstacles and minimizing energy consumption. Moreover, here the UAV s significantly affected by the surrounding flow-field, which is turbulent and contains several recirculation zones. In the previously cited works, the flow field from a numerical simulation is not present, thus requiring less variables to describe the behavior of the UAV in a quiescent environment. These aspects show how the present method has potential to be extended to face more complex problems, such as navigation in a 3D complex environment and be prone to be adapted to accomplish multiple tasks, for example goods delivery, with a small development effort.

4 Conclusions and outlook

Previous studies in UAV navigation tasks with RL focus on simplified or static environments, where the flow dynamics are not explicitly modeled through high-fidelity numerical simulations. The method presented in this work incorporates a flow database obtained from high-fidelity simulations of a turbulent flow-field for a navigation task in the presence of two obstacles using a PPO algorithm enhanced with LSTMs architecture. The proposed architecture puts the LSTM cell not as the input layer of the network, but after a fully connected layer which performs a first feature extraction. The first layer focuses on pattern extraction, while the LSTM focuses on modeling temporal dependencies. Moreover, this makes the architecture suitable to learn from relatively sparse data and deal with the stochasticity of the environment and observation space.
The algorithm was compared with a simple PPO and a TD3 architecture, showing a significant improvement in the final reward and success rate in reaching the target. PPO+LSTM architecture reached the highest reward (40.15) whereas TD3 and PPO reached a value of 18.99 and 21.15, respecgtively. Moreover, the agent learned not only to reach the target, but to reach it safely and exploiting the features of the flow field, saving energy and avoiding unnecessary linear and angular accelerations.
The next step will be the testing of the algorithm in a 3D environment, possibly extending the model of the UAV to a real-body problem, including forces acting on it during navigation and introducing obstacles of different heights and distances. We will also focus on and also reducing the noise produced by the drone in a real application, which would increase the level of acoustic pollution in cities.

Appendix

Policy-gradient Methods

Policy-gradient methods parameterise and optimize policies by maximizing the expected returns explicitly, unlike value-based methods that derive policies indirectly by a value function.
In policy-gradient methods, the expected cumulative reward has to be maximized as:

J(θ)=𝔼πθ[t=0Tγtr(st,at)],𝐽𝜃subscript𝔼subscript𝜋𝜃delimited-[]superscriptsubscript𝑡0𝑇superscript𝛾𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡J(\theta)=\mathbb{E}_{\pi_{\theta}}\Bigg{[}\sum_{t=0}^{T}\gamma^{t}r(s_{t},a_{% t})\Bigg{]},italic_J ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] , (12)

where γtsuperscript𝛾𝑡\gamma^{t}italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the discount factor raised to the power of t𝑡titalic_t, which discounts the reward r(st,at)𝑟subscript𝑠𝑡subscript𝑎𝑡r(s_{t},a_{t})italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) received at time t𝑡titalic_t.
The policy-gradient theorem yields the gradient of the objective function calculated with respect to the policy characterized by the set of weights of the NN θ𝜃\thetaitalic_θ ([42]):

θJθ=𝔼πθ[θlogπθ(a|s)Qπθ(s,a)],subscript𝜃subscript𝐽𝜃subscript𝔼subscript𝜋𝜃delimited-[]subscript𝜃logsubscript𝜋𝜃conditional𝑎𝑠superscript𝑄subscript𝜋𝜃𝑠𝑎\nabla_{\theta}J_{\theta}=\mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\textrm{log% }\pi_{\theta}(a|s)Q^{\pi_{\theta}}(s,a)],∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a | italic_s ) italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a ) ] , (13)

with Qπθ(s,a)superscript𝑄subscript𝜋𝜃𝑠𝑎Q^{\pi_{\theta}}(s,a)italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a ) being the action-value function under the policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT.
In this work we use a stochastic policy in the training steps. The output of the policy will be the covariance σθsubscript𝜎𝜃{\sigma}_{\theta}italic_σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and the mean value μθsubscript𝜇𝜃{\mu}_{\theta}italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT of the Gaussian distribution.

Proximal-Policy Optimization

The proximal-policy optimization (PPO) is an advanced policy-gradient method which uses a first-order trust region to limit the policy update step size and prevent too large variations of the policy, balancing at the same time exploration and exploitation ([38]). This is achieved by clipping to zero the probability that the new policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT deviates more than a given ϵitalic-ϵ\epsilonitalic_ϵ from the previous policy πθoldsubscript𝜋subscript𝜃old\pi_{\theta_{\textrm{old}}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT old end_POSTSUBSCRIPT end_POSTSUBSCRIPT as follows:

Jclipped(θ)=𝔼τπθ[1Kk=0K1min(r~k(θ)Aπθ(𝒚k,𝒖k),clip(r~k(θ),1ϵ,1+ϵ)Aπθ(𝒚k,𝒖k))],superscript𝐽clipped𝜃subscript𝔼similar-to𝜏subscript𝜋𝜃delimited-[]1𝐾superscriptsubscript𝑘0𝐾1minsubscript~𝑟𝑘𝜃superscript𝐴subscript𝜋𝜃subscript𝒚𝑘subscript𝒖𝑘clipsubscript~𝑟𝑘𝜃1italic-ϵ1italic-ϵsuperscript𝐴subscript𝜋𝜃subscript𝒚𝑘subscript𝒖𝑘\begin{split}J^{\textrm{clipped}}(\theta)=\mathbb{E}_{\tau\sim\pi_{\theta}}% \Bigg{[}\frac{1}{K}\sum_{k=0}^{K-1}\textrm{min}(\tilde{r}_{k}(\theta)A^{\pi_{% \theta}}(\boldsymbol{y}_{k},\boldsymbol{u}_{k}),\\ \textrm{clip}(\tilde{r}_{k}(\theta),1-\epsilon,1+\epsilon)A^{\pi_{\theta}}(% \boldsymbol{y}_{k},\boldsymbol{u}_{k}))\Bigg{]},\end{split}start_ROW start_CELL italic_J start_POSTSUPERSCRIPT clipped end_POSTSUPERSCRIPT ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT min ( over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ ) italic_A start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL clip ( over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ ) , 1 - italic_ϵ , 1 + italic_ϵ ) italic_A start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ] , end_CELL end_ROW (14)

with:

r~k(θ)=πθ(𝒖k|𝒚k)πold(𝒖k|𝒚k),subscript~𝑟𝑘𝜃subscript𝜋𝜃conditionalsubscript𝒖𝑘subscript𝒚𝑘subscript𝜋oldconditionalsubscript𝒖𝑘subscript𝒚𝑘\tilde{r}_{k}(\theta)=\frac{\pi_{\theta}(\boldsymbol{u}_{k}|\boldsymbol{y}_{k}% )}{\pi_{\textrm{old}}(\boldsymbol{u}_{k}|\boldsymbol{y}_{k})},over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_θ ) = divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT old end_POSTSUBSCRIPT ( bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG , (15)

being the ratio between the new and the old policy and:

Aπθ(𝒚k,𝒖k)=Qπθ(𝒚k,𝒖k)Vπθ(𝒚k),superscript𝐴subscript𝜋𝜃subscript𝒚𝑘subscript𝒖𝑘superscript𝑄subscript𝜋𝜃subscript𝒚𝑘subscript𝒖𝑘superscript𝑉subscript𝜋𝜃subscript𝒚𝑘A^{\pi_{\theta}}(\boldsymbol{y}_{k},\boldsymbol{u}_{k})=Q^{\pi_{\theta}}(% \boldsymbol{y}_{k},\boldsymbol{u}_{k})-V^{\pi_{\theta}}(\boldsymbol{y}_{k}),italic_A start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , (16)

the advantage function. ϵitalic-ϵ\epsilonitalic_ϵ is the clip range, which prevents dramatic changes in the policy.
The advantage function represents the average improvement in the cumulative reward by selecting a specific action. It is usually computed using an estimate of the value function Vϕsubscript𝑉italic-ϕV_{\phi}italic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT by the generalized advantage estimator (GAE) ([37]):

Aϕ,k=k=kK1(γλ)kkδϕ,k,subscript𝐴italic-ϕ𝑘superscriptsubscriptsuperscript𝑘𝑘𝐾1superscript𝛾𝜆superscript𝑘𝑘subscript𝛿italic-ϕsuperscript𝑘A_{\phi,k}=\sum_{k^{\prime}=k}^{K-1}(\gamma\lambda)^{k^{\prime}-k}\delta_{\phi% ,k^{\prime}},italic_A start_POSTSUBSCRIPT italic_ϕ , italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ( italic_γ italic_λ ) start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_k end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_ϕ , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , (17)

where

δϕ,k=Rk+γVϕ,k+1Vϕ,ksubscript𝛿italic-ϕ𝑘subscript𝑅𝑘𝛾subscript𝑉italic-ϕ𝑘1subscript𝑉italic-ϕ𝑘\delta_{\phi,k}=R_{k}+\gamma V_{\phi,k+1}-V_{\phi,k}italic_δ start_POSTSUBSCRIPT italic_ϕ , italic_k end_POSTSUBSCRIPT = italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_γ italic_V start_POSTSUBSCRIPT italic_ϕ , italic_k + 1 end_POSTSUBSCRIPT - italic_V start_POSTSUBSCRIPT italic_ϕ , italic_k end_POSTSUBSCRIPT (18)

is the temporal difference residual of Vϕ,ksubscript𝑉italic-ϕ𝑘V_{\phi,k}italic_V start_POSTSUBSCRIPT italic_ϕ , italic_k end_POSTSUBSCRIPT with a discount γ𝛾\gammaitalic_γ and λ[0,1]𝜆01\lambda\in[0,1]italic_λ ∈ [ 0 , 1 ] is the GAE factor.
The mean-squared error (MSE) Lvalue(ϕ)superscript𝐿valueitalic-ϕL^{\textrm{value}}(\phi)italic_L start_POSTSUPERSCRIPT value end_POSTSUPERSCRIPT ( italic_ϕ ) between the value function estimate Vϕsubscript𝑉italic-ϕV_{\phi}italic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT and the cumulative reward calculated from step k𝑘kitalic_k it is the loss function used to update the critic network:

Lvalue(θ,ϕ)=𝔼τπθ[1Kk=0K1(Vϕ,kk=kKRk)2],superscript𝐿value𝜃italic-ϕsubscript𝔼similar-to𝜏subscript𝜋𝜃delimited-[]1𝐾superscriptsubscript𝑘0𝐾1superscriptsubscript𝑉italic-ϕ𝑘superscriptsubscriptsuperscript𝑘𝑘𝐾subscript𝑅superscript𝑘2L^{\textrm{value}}(\theta,\phi)=\mathbb{E}_{\tau\sim\pi_{\theta}}\Bigg{[}\frac% {1}{K}\sum_{k=0}^{K-1}\Bigg{(}V_{\phi,k}-\sum_{k^{\prime}=k}^{K}R_{k^{\prime}}% \Bigg{)}^{2}\Bigg{]},italic_L start_POSTSUPERSCRIPT value end_POSTSUPERSCRIPT ( italic_θ , italic_ϕ ) = blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ( italic_V start_POSTSUBSCRIPT italic_ϕ , italic_k end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (19)

If the value function estimate has both the actor and critic roles, being then an additional output of the policy network, ϕ=θitalic-ϕ𝜃\phi=\thetaitalic_ϕ = italic_θ and the MSE is added to the clipped objective function to get the final PPO:

Jtot(θ)=Jclipped(θ)λvLvalue(θ),superscript𝐽tot𝜃superscript𝐽clipped𝜃subscript𝜆𝑣superscript𝐿value𝜃J^{\textrm{tot}}(\theta)=J^{\textrm{clipped}}(\theta)-\lambda_{v}L^{\textrm{% value}}(\theta),italic_J start_POSTSUPERSCRIPT tot end_POSTSUPERSCRIPT ( italic_θ ) = italic_J start_POSTSUPERSCRIPT clipped end_POSTSUPERSCRIPT ( italic_θ ) - italic_λ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT value end_POSTSUPERSCRIPT ( italic_θ ) , (20)

with λvsubscript𝜆𝑣\lambda_{v}italic_λ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT being the value-function coefficient, which measures the weight of the last term in the total objective function.
The steps of the algorithm are the following: first an initial policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and a value function Vϕsubscript𝑉italic-ϕV_{\phi}italic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT are initialized. Then the interaction with the environment starts to gather a set of trajectories {st,at,rt,st+1}subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡subscript𝑠𝑡1\left\{s_{t},a_{t},r_{t},s_{t+1}\right\}{ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT } and the advantage function is estimated using GAE to enhance stability and reduce the variance of the policy throughout the training. The clipped objective function is then used to update the policy parameters and the value function parameters are adjusted to minimize the MSE. This process is then repeated until convergence of the policy is achieved, meaning that the reward does not oscillate anymore, so that the calculated standard deviation stays withing acceptable bounds and the variance reduces over time.

Integration of LSTM newtowrks with PPO

The effectivenss of PPO can be enhanced by implementing LSTM cells, especially when dealing with environments in which sequential data and temporal dependencies are crucial, as it is in trajectory-optimization problems ([14]). LSTMs are recurrent neural networks (RNNs) that handle sequences and capture long-term dependencies, making them suitable for problems which require memory of previous states.
The drawbacks of traditional RNNs are addressed by LSTM cells which can keep information over long periods. LSTM cells have three main components: a forget gate, an input gate and an output gate. The forget gate decides which information has to be discarded from the cell state, and it is defined as:

ft=σ(Wf[ht1,xt]+bf),subscript𝑓𝑡𝜎subscript𝑊𝑓subscript𝑡1subscript𝑥𝑡subscript𝑏𝑓f_{t}=\sigma(W_{f}\cdot[h_{t-1},x_{t}]+b_{f}),italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_σ ( italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ⋅ [ italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] + italic_b start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) , (21)

with σ𝜎\sigmaitalic_σ is the sigmoid function, Wfsubscript𝑊𝑓W_{f}italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT are the weights, ht1subscript𝑡1h_{t-1}italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT the previous hidden state, xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT the current input and bfsubscript𝑏𝑓b_{f}italic_b start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT the bias, respectively.
The input gate provides the new information which has to be added to the cell state. It is defined as:

it=σ(Wi[ht1,xt]+bi),subscript𝑖𝑡𝜎subscript𝑊𝑖subscript𝑡1subscript𝑥𝑡subscript𝑏𝑖i_{t}=\sigma(W_{i}\cdot[h_{t-1},x_{t}]+b_{i}),italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_σ ( italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ [ italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (22)

and the candidate cell state C~tsubscript~𝐶𝑡\tilde{C}_{t}over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be expressed as:

C~t=tanh(WC[ht1,xt]+bC)subscript~𝐶𝑡tanhsubscript𝑊𝐶subscript𝑡1subscript𝑥𝑡subscript𝑏𝐶\tilde{C}_{t}=\textrm{tanh}(W_{C}\cdot[h_{t-1},x_{t}]+b_{C})over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = tanh ( italic_W start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ⋅ [ italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] + italic_b start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) (23)

The new cell state is then updated as:

Ct=ftCt1+itC~t,subscript𝐶𝑡direct-productsubscript𝑓𝑡subscript𝐶𝑡1direct-productsubscript𝑖𝑡subscript~𝐶𝑡C_{t}=f_{t}\odot C_{t-1}+i_{t}\odot\tilde{C}_{t},italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊙ italic_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊙ over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (24)

where direct-product\odot indicates an element-wise multiplication.
The output gate controls the output based on the cell state:

ot=σ(Wo[ht1,xt]+bo),subscript𝑜𝑡𝜎subscript𝑊𝑜subscript𝑡1subscript𝑥𝑡subscript𝑏𝑜o_{t}=\sigma(W_{o}\cdot[h_{t-1},x_{t}]+b_{o}),italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_σ ( italic_W start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ⋅ [ italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] + italic_b start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) , (25)
ht=ottanh(Ct),subscript𝑡direct-productsubscript𝑜𝑡tanhsubscript𝐶𝑡h_{t}=o_{t}\odot\textrm{tanh}(C_{t}),italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊙ tanh ( italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (26)

where otsubscript𝑜𝑡o_{t}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the output state and htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT the new hidden state respectively.
The combination of these components enables LSTMs to effectively capture temporal dependencies and mitigate the vanishing-gradient problem common in standard RNNs.
When applying LSTMs to PPO, these cells have to be integrated both in the policy and value networks. This allows the agent to learn also from past experiences, which is crucial when dealing with POMDPs. In fact, many real-world scenarios are not represented by a fully observable state of the environment. LSTMs enhance the agents’ behavior by keeping information from past observations and the agents can then infer missing information and take actions based on more informed decisions.
Combining LSTM networks with PPO gives several advantages:

  • The agent can learn from temporal sequences, leading to a better-informed decision-making process;

  • LSTMs make the whole process more suitable for complex environments with partial observability, leveraging the long-term dependencies they are able to learn;

  • The agent can store and reuse past experiences and this a key feature for problems which are characterized by temporal dependencies.

On the other hand, they exhibit some drawbacks. In particular, the computational cost increases, they require a careful hyperparameter tuning and stabilizing the training can be more challenging due to the interaction between PPO and LSTM networks.

Acknowledgments

Federica Tonti and Ricardo Vinuesa acknowledge funding from the European Union’s HORIZON Research and Innovation Program, project REFMAP, under Grant Agreement number 101096698. The computations were carried out at the supercomputer Dardel at PDC, KTH, and the computer time was provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS). The authors also want to thank Luca Biferale and Michele Buzzicotti for their contribution by providing the data for the reproduction and validation of the Zermelo’s problem, which gave foundations to the present work.

Data Availability Statement

All the codes and data used in this work will be made available open access when the article is published here: https://github.com/KTH-FlowAI

References

  • [1] João Antunes. UAVs Across Europe: Commercial Drone Applications in the Netherlands. https://www.commercialuavnews.com/infrastructure/uavs-across-europe-commercial-drone-applications-in-the-netherlands, 2024.
  • [2] Marco Atzori, Pablo Torres, Alvaro Vidal, Soledad Le Clainche, Sergio Hoyas, and Ricardo Vinuesa. High-resolution simulations of a turbulent boundary layer impacting two obstacles in tandem. Phys. Rev. Fluids, 8:063801, Jun 2023.
  • [3] Luca Biferale, Fabio Bonaccorso, Michele Buzzicotti, Patricio Clark Di Leoni, and Kristian Gustavsson. Zermelo’s problem: Optimal point-to-point navigation in 2D turbulent flows using Reinforcement Learning. Chaos, 29, 2019.
  • [4] Navaneetha Krishna Chandran, Mohammed Thariq Hameed Sultan, Andrzej Łukaszewicz, Farah Syazwani Shahar, Andriy Holovatyy, and Wojciech Giernacki. Review on Type of Sensors and Detection Method of Anti-Collision System of Unmanned Aerial Vehicle. Sensors, 23(15), 2023.
  • [5] Fethi Demim, Abdelkrim NEMRA, Kahina Louadj, and Mustapha Hamerlain. Simultaneous localization, mapping, and path planning for unmanned vehicle using optimal control. Advances in Mechanical Engineering, 10:168781401773665, 01 2018.
  • [6] Edsger Wybe Dijkstra. A note on two problems in connexion with graphs. Numerische Mathematik, 1(1):269–271, 1959.
  • [7] Hu Duoxiu, Dong Wenhan, and Xie Wujie. Proximal policy optimization for UAV autonomous guidance, tracking and obstacle avoidance. Journal of Beijing University of Aeronautics and Astronautics, 49(1):195–205, 2023.
  • [8] Hu Duoxiu, Dong Wenhan, Xie Wujie, and He Lei. Proximal Policy Optimization for Multi-rotor UAV Autonomous Guidance, Tracking and Obstacle Avoidance. International Journal of Aeronautical and Space Sciences, 23(2):339–353, 2022.
  • [9] Mónika Farsang and Luca Szegletes. Importance of Environment Design in Reinforcement Learning: A Study of a Robotic Environment. https://arxiv.org/abs/2102.10447, 2021.
  • [10] Lorenzo Federici, Roberto Furfaro, Alessandro Zavoli, and Guido De Matteis. Robust Waypoint Guidance of a Hexacopter on Mars using Meta-Reinforcement Learning. AIAA SCITECH 2023 Forum, 2023.
  • [11] Daniel Foead, Alifio Ghifari, Marchel Budi Kusuma, Novita Hanafiah, and Eric Gunawan. A Systematic Literature Review of A* Pathfinding. Procedia Computer Science, 179:507–514, 2021. 5th International Conference on Computer Science and Computational Intelligence 2020.
  • [12] Luca Guastoni, J. Rabault, Philipp Schlatter, Hossein Azizpour, and Ricardo Vinuesa. Deep reinforcement learning for turbulent drag reduction in channel flows. The European physical journal. E, Soft matter, 46:27, 04 2023.
  • [13] Peter Gunnarson, Ioannis Mandralis, Guido Novati, Petros Koumoutsakos, and John O. Dabiri. Learning efficient navigation in vortical flow fields. Nature Communications, 12(1), 2021.
  • [14] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-term Memory. Neural computation, 9:1735–80, 12 1997.
  • [15] Zehua Hong and Qiuping Lyu. Drones soar into wider application in China. https://english.news.cn/20240718/380bfb7077d24df2a4edae4bb2040ad2/c.html, 2024.
  • [16] Xiao Hu, Hongbo Wang, Min Gong, and Tianshu Wang. Guidance Design for Escape Flight Vehicle against Multiple Pursuit Flight Vehicles Using the RNN-Based Proximal Policy Optimization Algorithm. Aerospace, 11(5), 2024.
  • [17] Wonjune Hwang. Development of Peoples’ Republic of China’s Unmanned Aerial Vehicles (UAVs) and Its Impact on the East China Sea. International Journal of China Studies, 11:121–144, 06 2020.
  • [18] Luay Jabbar, Eyad Abass, and Sundus Hasan. A Modification of Shortest Path Algorithm According to Adjustable Weights Based on Dijkstra Algorithm. Engineering and Technology Journal, 41(2):359–374, 2023.
  • [19] Yanliang Jin, Qianhong Liu, Liquan Shen, and Leiji Zhu. Deep Deterministic Policy Gradient Algorithm Based on Convolutional Block Attention for Autonomous Driving. Symmetry, 13(6), 2021.
  • [20] Diederik Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. International Conference on Learning Representations, 12 2014.
  • [21] Steven M. LaValle. Rapidly-exploring random trees : a new tool for path planning. The annual research report, 1998.
  • [22] Zhenyu Liang, Xingru Qu, Zhao Zhang, and Cong Chen. Three-Dimensional Path-Following Control of an Autonomous Underwater Vehicle Based on Deep Reinforcement Learning. Polish Maritime Research, 29:36–44, 12 2022.
  • [23] Zixiang Liu. Implementation of SLAM and path planning for mobile robots under ROS framework. In 2021 6th International Conference on Intelligent Computing and Signal Processing (ICSP), pages 1096–1100, 2021.
  • [24] Yutaka Matsuo, Yann LeCun, Maneesh Sahani, Doina Precup, David Silver, Masashi Sugiyama, Eiji Uchibe, and Jun Morimoto. Deep learning, reinforcement learning, and world models. Neural Networks, 152:267–275, 2022.
  • [25] Arvind T. Mohan and Datta V. Gaitonde. A Deep Learning based Approach to Reduced Order Modeling for Turbulent Flow Control using LSTM Neural Networks. https://arxiv.org/abs/1804.09269, 2018.
  • [26] Iram Noreen, Amna Khan, and Zulfiqar Habib. Optimal Path Planning using RRT* based Approaches: A Survey and Future Directions. International Journal of Advanced Computer Science and Applications, 7, 11 2016.
  • [27] Jim Martin Catacora Ocana, Roberto Capobianco, and Daniele Nardi. An Overview of Environmental Features that Impact Deep Reinforcement Learning in Sparse-Reward Domains. Journal of Artificial Intelligence Research, 76, April 2023.
  • [28] Lucas Prado Osco, José Marcato Junior, Ana Paula Marques Ramos, Lúcio André de Castro Jorge, Sarah Narges Fatholahi, Jonathan de Andrade Silva, Edson Takashi Matsubara, Hemerson Pistori, Wesley Nunes Gonçalves, and Jonathan Li. A review on deep learning in UAV remote sensing. International Journal of Applied Earth Observation and Geoinformation, 102:102456, 2021.
  • [29] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: an imperative style, high-performance deep learning library. Curran Associates Inc., Red Hook, NY, USA, 2019.
  • [30] Jean Rabault, Miroslav Kuchta, Atle Jensen, Ulysse Réglade, and Nicolas Cerardi. Artificial neural networks trained through deep reinforcement learning discover control strategies for active flow control. Journal of Fluid Mechanics, 865:281–302, 2019.
  • [31] Jean Rabault and Alexander Kuhnle. Accelerating deep reinforcement learning strategies of flow control through a multi-environment approach. Physics of Fluids, 31(9):094105, 09 2019.
  • [32] D. M. K. K. Venkateswara Rao, Hamed Habibi, Jose Luis Sanchez-Lopez, and Holger Voos. An Integrated Real-time UAV Trajectory Optimization with Potential Field Approach for Dynamic Collision Avoidance. https://arxiv.org/abs/2303.02043, 2023.
  • [33] P. Venkateswara Rao, Vybhavi B., Manjeet Manjeet, Arhath Kumar, Manisha Mittal, Amit Verma, and Dharmesh Dhabliya. Deep Reinforcement Learning: Bridging the Gap with Neural Networks. International Journal of Intelligent Systems and Applications in Engineering, 12(15s), Feb. 2024.
  • [34] Mohamed Reda, Ahmed Onsy, Amira Y. Haikal, and Ali Ghanbari. Path planning algorithms in the autonomous driving system: A comprehensive review. Robotics and Autonomous Systems, 174, 2024.
  • [35] Jianxin Ren, Tao Wu, Xiaohua Zhou, Congcong Yang, Jiahui Sun, Mingshuo Li, Huayang Jiang, and Anfeng Zhang. SLAM, Path Planning Algorithm and Application Research of an Indoor Substation Wheeled Robot Navigation System. Electronics, 11(12), 2022.
  • [36] Jeremy Roghair, Kyungtae Ko, Amir Ehsan Niaraki Asli, and Ali Jannesari. A Vision Based Deep Reinforcement Learning Algorithm for UAV Obstacle Avoidance. https://arxiv.org/abs/2103.06403, 2021.
  • [37] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-Dimensional Continuous Control Using Generalized Advantage Estimation. https://arxiv.org/abs/1506.02438, 2018.
  • [38] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms. https://arxiv.org/abs/1707.06347, 2017.
  • [39] Abhik Singla, Sindhu Padakandla, and Shalabh Bhatnagar. Memory-based Deep Reinforcement Learning for Obstacle Avoidance in UAV with Limited Environment Knowledge. https://arxiv.org/abs/1811.03307, 2018.
  • [40] Takahiro Sonoda, Zhuchen Liu, Toshitaka Itoh, and Yosuke Hasegawa. Reinforcement learning of control strategies for reducing skin friction drag in a fully developed turbulent channel flow. Journal of Fluid Mechanics, 960, 2023.
  • [41] Xuting Sun, Yue Hu, Yichen Qin, and Yuan Zhang. Risk assessment of unmanned aerial vehicle accidents based on data-driven Bayesian networks. Reliability Engineering & System Safety, 248:110185, 2024.
  • [42] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018.
  • [43] P. Suárez, F. Álcantara Ávila, A. Miró, J. Rabault, B. Font, O. Lehmkuhl, and R. Vinuesa. Active flow control for drag reduction through multi-agent reinforcement learning on a turbulent cylinder at ReD=3900𝑅subscript𝑒𝐷3900Re_{D}=3900italic_R italic_e start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT = 3900. https://arxiv.org/abs/2405.17655, 2024.
  • [44] Pol Suárez, Francisco Alcántara-Ávila, Arnau Miró, Jean Rabault, Bernat Font, Oriol Lehmkuhl, and R. Vinuesa. Active flow control for three-dimensional cylinders through deep reinforcement learning. https://arxiv.org/abs/2309.02462, 2023.
  • [45] Z. Svatý, L. Nouzovský, T. Mičunek, and M. Frydrýn. Evaluation of the drone-human collision consequences. Heliyon, 8(11):e11677, 2022.
  • [46] Guangyi Tang, Jianjun Ni, Yonghao Zhao, Yang Gu, and Weidong Cao. A Survey of Object Detection for UAVs Based on Deep Learning. Remote Sensing, 16(1), 2024.
  • [47] Federica Tonti, Jaka Perovšek, Jose’ Zapata Usandivaras, Sebastian Karl, Justin S. Hardi, Youhi Morii, and Michael Oschwald. Obtaining pseudo-OH* radiation images from cfd solutions of transcritical flames. Combustion and Flame, 233:111614, 2021.
  • [48] Antonio J. Torija, Zhengguang Li, and Rod H. Self. Effects of a hovering unmanned aerial vehicle on urban soundscapes perception. Transportation Research Part D: Transport and Environment, 78:102195, 2020.
  • [49] Mark Towers, Jordan K. Terry, Ariel Kwiatkowski, John U. Balis, Gianluca de Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Arjun KG, Markus Krimmel, Rodrigo Perez-Vicente, Andrea Pierré, Sander Schulhoff, Jun Jet Tai, Andrew Tan Jin Shen, and Omar G. Younis. Gymnasium. https://zenodo.org/record/8127025, March 2023.
  • [50] C. Vignon, J. Rabault, and R. Vinuesa. Recent advances in applying deep reinforcement learning for flow control: Perspectives and future directions. Physics of Fluids, 35(3), March 2023.
  • [51] Colin Vignon, Jean Rabault, Joel Vasanth, Francisco Alcántara-Ávila, Mikael Mortensen, and Ricardo Vinuesa. Effective control of two-dimensional Rayleigh–Bénard convection: Invariant multi-agent reinforcement learning is all you need. Physics of Fluids, 35(6):065146, 06 2023.
  • [52] Ricardo Vinuesa and Steve L. Brunton. Enhancing computational fluid dynamics with machine learning. Nature Computational Science, 2(6):358–366, June 2022.
  • [53] Ricardo Vinuesa and Steven Brunton. Emerging Trends in Machine Learning for Computational Fluid Dynamics. Computing in Science & Engineering, 24:33–41, 09 2022.
  • [54] Chao Wang, Jian Wang, Yuan Shen, and Xudong Zhang. Autonomous Navigation of UAVs in Large-Scale Complex Environments: A Deep Reinforcement Learning Approach. IEEE Transactions on Vehicular Technology, 68(3):2124–2136, 2019.
  • [55] Fei Wang, Xiaoping ZHU, Zhou ZHOU, and Yang TANG. Deep-reinforcement-learning-based UAV autonomous navigation and collision avoidance in unknown environments. Chinese Journal of Aeronautics, 37(3):237–257, 2024.
  • [56] Qiulei Wang, Lei Yan, Gang Hu, Wenli Chen, Jean Rabault, and Bernd R. Noack. Dynamic feature-based deep reinforcement learning for flow control of circular cylinder with sparse surface pressure sensing. Journal of Fluid Mechanics, 988, May 2024.
  • [57] Xing Wu, Haolei Chen, Changgu Chen, Mingyu Zhong, Shaorong Xie, Yike Guo, and Hamido Fujita. The autonomous navigation and obstacle avoidance for USVs with ANOA deep reinforcement learning method. Knowledge-Based Systems, 196:105201, 03 2020.
  • [58] Xray. Innovations and Advancements: Patents in Noise Reduction in Drones. https://xray.greyb.com/drones/noise-reduction-in-drones, 2024.
  • [59] Fangkai Yang, Daoming Lyu, Bo Liu, and Steven Gustafson. PEORL: Integrating Symbolic Planning and Hierarchical Reinforcement Learning for Robust Decision-Making. https://arxiv.org/pdf/1804.07779, 2018.
  • [60] Zhiyu You, Keyu Shen, Tao Huang, Yongxin Liu, and Xiaofeng Zhang. Application of A* Algorithm Based on Extended Neighborhood Priority Search in Multi-Scenario Maps. Electronics, 12(4), 2023.
  • [61] Qian Yujie, Wei Yuliang, Kon Deyi, and Xu He. Experimental investigation on motor noise reduction of Unmanned Aerial Vehicles. Applied Acoustics, 176:107873, 2021.
  • [62] Gerardo Zampino, Marco Atzori, Elias Zea, Evelyn Otero, and Ricardo Vinuesa. Aspect-ratio effect on the wake of a wall-mounted square cylinder immersed in a turbulent boundary layer. https://arxiv.org/abs/2401.11793, 2024.
  • [63] Ernst Friedrich Ferdinand Zermelo. Über das Navigationsproblem bei ruhender oder veränderlicher Windverteilung. Zamm-zeitschrift Fur Angewandte Mathematik Und Mechanik, 11:114–124, 1931.
  • [64] Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Mingyan Liu, Duane Boning, and Cho-Jui Hsieh. Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations. https://arxiv.org/abs/2003.08938, 2021.
  • [65] QiongWei Zhang, LunXing Li, LiaoMo Zheng, and BeiBei Li. An Improved Path Planning Algorithm Based on RRT. In 2022 11th International Conference of Information and Communication Technology (ICTech)), pages 149–152, 2022.
  • [66] Sitong Zhang, Yibing Li, and Qianhui Dong. Autonomous navigation of UAV in multi-obstacle environments based on a Deep Reinforcement Learning approach. Applied Soft Computing, 115, 2022.