-
Constrained Reinforcement Learning with Average Reward Objective: Model-Based and Model-Free Algorithms
Authors:
Vaneet Aggarwal,
Washim Uddin Mondal,
Qinbo Bai
Abstract:
Reinforcement Learning (RL) serves as a versatile framework for sequential decision-making, finding applications across diverse domains such as robotics, autonomous driving, recommendation systems, supply chain optimization, biology, mechanics, and finance. The primary objective in these applications is to maximize the average reward. Real-world scenarios often necessitate adherence to specific co…
▽ More
Reinforcement Learning (RL) serves as a versatile framework for sequential decision-making, finding applications across diverse domains such as robotics, autonomous driving, recommendation systems, supply chain optimization, biology, mechanics, and finance. The primary objective in these applications is to maximize the average reward. Real-world scenarios often necessitate adherence to specific constraints during the learning process.
This monograph focuses on the exploration of various model-based and model-free approaches for Constrained RL within the context of average reward Markov Decision Processes (MDPs). The investigation commences with an examination of model-based strategies, delving into two foundational methods - optimism in the face of uncertainty and posterior sampling. Subsequently, the discussion transitions to parametrized model-free approaches, where the primal-dual policy gradient-based algorithm is explored as a solution for constrained MDPs. The monograph provides regret guarantees and analyzes constraint violation for each of the discussed setups.
For the above exploration, we assume the underlying MDP to be ergodic. Further, this monograph extends its discussion to encompass results tailored for weakly communicating MDPs, thereby broadening the scope of its findings and their relevance to a wider range of practical scenarios.
△ Less
Submitted 17 July, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
Sample-Efficient Constrained Reinforcement Learning with General Parameterization
Authors:
Washim Uddin Mondal,
Vaneet Aggarwal
Abstract:
We consider a constrained Markov Decision Problem (CMDP) where the goal of an agent is to maximize the expected discounted sum of rewards over an infinite horizon while ensuring that the expected discounted sum of costs exceeds a certain threshold. Building on the idea of momentum-based acceleration, we develop the Primal-Dual Accelerated Natural Policy Gradient (PD-ANPG) algorithm that guarantees…
▽ More
We consider a constrained Markov Decision Problem (CMDP) where the goal of an agent is to maximize the expected discounted sum of rewards over an infinite horizon while ensuring that the expected discounted sum of costs exceeds a certain threshold. Building on the idea of momentum-based acceleration, we develop the Primal-Dual Accelerated Natural Policy Gradient (PD-ANPG) algorithm that guarantees an $ε$ global optimality gap and $ε$ constraint violation with $\mathcal{O}(ε^{-3})$ sample complexity. This improves the state-of-the-art sample complexity in CMDP by a factor of $\mathcal{O}(ε^{-1})$.
△ Less
Submitted 17 May, 2024;
originally announced May 2024.
-
Variance-Reduced Policy Gradient Approaches for Infinite Horizon Average Reward Markov Decision Processes
Authors:
Swetha Ganesh,
Washim Uddin Mondal,
Vaneet Aggarwal
Abstract:
We present two Policy Gradient-based methods with general parameterization in the context of infinite horizon average reward Markov Decision Processes. The first approach employs Implicit Gradient Transport for variance reduction, ensuring an expected regret of the order $\tilde{\mathcal{O}}(T^{3/5})$. The second approach, rooted in Hessian-based techniques, ensures an expected regret of the order…
▽ More
We present two Policy Gradient-based methods with general parameterization in the context of infinite horizon average reward Markov Decision Processes. The first approach employs Implicit Gradient Transport for variance reduction, ensuring an expected regret of the order $\tilde{\mathcal{O}}(T^{3/5})$. The second approach, rooted in Hessian-based techniques, ensures an expected regret of the order $\tilde{\mathcal{O}}(\sqrt{T})$. These results significantly improve the state of the art of the problem, which achieves a regret of $\tilde{\mathcal{O}}(T^{3/4})$.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
Near-perfect Coverage Manifold Estimation in Cellular Networks via conditional GAN
Authors:
Washim Uddin Mondal,
Veni Goyal,
Satish V. Ukkusuri,
Goutam Das,
Di Wang,
Mohamed-Slim Alouini,
Vaneet Aggarwal
Abstract:
This paper presents a conditional generative adversarial network (cGAN) that translates base station location (BSL) information of any Region-of-Interest (RoI) to location-dependent coverage probability values within a subset of that region, called the region-of-evaluation (RoE). We train our network utilizing the BSL data of India, the USA, Germany, and Brazil. In comparison to the state-of-the-a…
▽ More
This paper presents a conditional generative adversarial network (cGAN) that translates base station location (BSL) information of any Region-of-Interest (RoI) to location-dependent coverage probability values within a subset of that region, called the region-of-evaluation (RoE). We train our network utilizing the BSL data of India, the USA, Germany, and Brazil. In comparison to the state-of-the-art convolutional neural networks (CNNs), our model improves the prediction error ($L_1$ difference between the coverage manifold generated by the network under consideration and that generated via simulation) by two orders of magnitude. Moreover, the cGAN-generated coverage manifolds appear to be almost visually indistinguishable from the ground truth.
△ Less
Submitted 10 February, 2024;
originally announced February 2024.
-
Learning General Parameterized Policies for Infinite Horizon Average Reward Constrained MDPs via Primal-Dual Policy Gradient Algorithm
Authors:
Qinbo Bai,
Washim Uddin Mondal,
Vaneet Aggarwal
Abstract:
This paper explores the realm of infinite horizon average reward Constrained Markov Decision Processes (CMDP). To the best of our knowledge, this work is the first to delve into the regret and constraint violation analysis of average reward CMDPs with a general policy parametrization. To address this challenge, we propose a primal dual based policy gradient algorithm that adeptly manages the const…
▽ More
This paper explores the realm of infinite horizon average reward Constrained Markov Decision Processes (CMDP). To the best of our knowledge, this work is the first to delve into the regret and constraint violation analysis of average reward CMDPs with a general policy parametrization. To address this challenge, we propose a primal dual based policy gradient algorithm that adeptly manages the constraints while ensuring a low regret guarantee toward achieving a global optimal policy. In particular, we demonstrate that our proposed algorithm achieves $\tilde{\mathcal{O}}({T}^{4/5})$ objective regret and $\tilde{\mathcal{O}}({T}^{4/5})$ constraint violation bounds.
△ Less
Submitted 3 March, 2024; v1 submitted 3 February, 2024;
originally announced February 2024.
-
Terrain-based Coverage Manifold Estimation: Machine Learning, Stochastic Geometry, or Simulation?
Authors:
Ruibo Wang,
Washim Uddin Mondal,
Mustafa A. Kishk,
Vaneet Aggarwal,
Mohamed-Slim Alouini
Abstract:
Given the necessity of connecting the unconnected, covering blind spots has emerged as a critical task in the next-generation wireless communication network. A direct solution involves obtaining a coverage manifold that visually showcases network coverage performance at each position. Our goal is to devise different methods that minimize the absolute error between the estimated coverage manifold a…
▽ More
Given the necessity of connecting the unconnected, covering blind spots has emerged as a critical task in the next-generation wireless communication network. A direct solution involves obtaining a coverage manifold that visually showcases network coverage performance at each position. Our goal is to devise different methods that minimize the absolute error between the estimated coverage manifold and the actual coverage manifold (referred to as accuracy), while simultaneously maximizing the reduction in computational complexity (measured by computational latency). Simulation is a common method for acquiring coverage manifolds. Although accurate, it is computationally expensive, making it challenging to extend to large-scale networks. In this paper, we expedite traditional simulation methods by introducing a statistical model termed line-of-sight probability-based accelerated simulation. Stochastic geometry is suitable for evaluating the performance of large-scale networks, albeit in a coarse-grained manner. Therefore, we propose a second method wherein a model training approach is applied to the stochastic geometry framework to enhance accuracy and reduce complexity. Additionally, we propose a machine learning-based method that ensures both low complexity and high accuracy, albeit with a significant demand for the size and quality of the dataset. Furthermore, we describe the relationships between these three methods, compare their complexity and accuracy as performance verification, and discuss their application scenarios.
△ Less
Submitted 11 December, 2023; v1 submitted 4 December, 2023;
originally announced December 2023.
-
Improved Sample Complexity Analysis of Natural Policy Gradient Algorithm with General Parameterization for Infinite Horizon Discounted Reward Markov Decision Processes
Authors:
Washim Uddin Mondal,
Vaneet Aggarwal
Abstract:
We consider the problem of designing sample efficient learning algorithms for infinite horizon discounted reward Markov Decision Process. Specifically, we propose the Accelerated Natural Policy Gradient (ANPG) algorithm that utilizes an accelerated stochastic gradient descent process to obtain the natural policy gradient. ANPG achieves $\mathcal{O}({ε^{-2}})$ sample complexity and…
▽ More
We consider the problem of designing sample efficient learning algorithms for infinite horizon discounted reward Markov Decision Process. Specifically, we propose the Accelerated Natural Policy Gradient (ANPG) algorithm that utilizes an accelerated stochastic gradient descent process to obtain the natural policy gradient. ANPG achieves $\mathcal{O}({ε^{-2}})$ sample complexity and $\mathcal{O}(ε^{-1})$ iteration complexity with general parameterization where $ε$ defines the optimality error. This improves the state-of-the-art sample complexity by a $\log(\frac{1}ε)$ factor. ANPG is a first-order algorithm and unlike some existing literature, does not require the unverifiable assumption that the variance of importance sampling (IS) weights is upper bounded. In the class of Hessian-free and IS-free algorithms, ANPG beats the best-known sample complexity by a factor of $\mathcal{O}(ε^{-\frac{1}{2}})$ and simultaneously matches their state-of-the-art iteration complexity.
△ Less
Submitted 5 February, 2024; v1 submitted 17 October, 2023;
originally announced October 2023.
-
Regret Analysis of Policy Gradient Algorithm for Infinite Horizon Average Reward Markov Decision Processes
Authors:
Qinbo Bai,
Washim Uddin Mondal,
Vaneet Aggarwal
Abstract:
In this paper, we consider an infinite horizon average reward Markov Decision Process (MDP). Distinguishing itself from existing works within this context, our approach harnesses the power of the general policy gradient-based algorithm, liberating it from the constraints of assuming a linear MDP structure. We propose a policy gradient-based algorithm and show its global convergence property. We th…
▽ More
In this paper, we consider an infinite horizon average reward Markov Decision Process (MDP). Distinguishing itself from existing works within this context, our approach harnesses the power of the general policy gradient-based algorithm, liberating it from the constraints of assuming a linear MDP structure. We propose a policy gradient-based algorithm and show its global convergence property. We then prove that the proposed algorithm has $\tilde{\mathcal{O}}({T}^{3/4})$ regret. Remarkably, this paper marks a pioneering effort by presenting the first exploration into regret-bound computation for the general parameterized policy gradient algorithm in the context of average reward scenarios.
△ Less
Submitted 2 February, 2024; v1 submitted 4 September, 2023;
originally announced September 2023.
-
Supporting Post-disaster Recovery with Agent-based Modeling in Multilayer Socio-physical Networks
Authors:
Jiawei Xue,
Sangung Park,
Washim Uddin Mondal,
Sandro Martinelli Reia,
Tong Yao,
Satish V. Ukkusuri
Abstract:
The examination of post-disaster recovery (PDR) in a socio-physical system enables us to elucidate the complex relationships between humans and infrastructures. Although existing studies have identified many patterns in the PDR process, they fall short of describing how individual recoveries contribute to the overall recovery of the system. To enhance the understanding of individual return behavio…
▽ More
The examination of post-disaster recovery (PDR) in a socio-physical system enables us to elucidate the complex relationships between humans and infrastructures. Although existing studies have identified many patterns in the PDR process, they fall short of describing how individual recoveries contribute to the overall recovery of the system. To enhance the understanding of individual return behavior and the recovery of point-of-interests (POIs), we propose an agent-based model (ABM), called PostDisasterSim. We apply the model to analyze the recovery of five counties in Texas following Hurricane Harvey in 2017. Specifically, we construct a three-layer network comprising the human layer, the social infrastructure layer, and the physical infrastructure layer, using mobile phone location data and POI data. Based on prior studies and a household survey, we develop the ABM to simulate how evacuated individuals return to their homes, and social and physical infrastructures recover. By implementing the ABM, we unveil the heterogeneity in recovery dynamics in terms of agent types, housing types, household income levels, and geographical locations. Moreover, simulation results across nine scenarios quantitatively demonstrate the positive effects of social and physical infrastructure improvement plans. This study can assist disaster scientists in uncovering nuanced recovery patterns and policymakers in translating policies like resource allocation into practice.
△ Less
Submitted 21 July, 2023;
originally announced July 2023.
-
Cooperating Graph Neural Networks with Deep Reinforcement Learning for Vaccine Prioritization
Authors:
Lu Ling,
Washim Uddin Mondal,
Satish V,
Ukkusuri
Abstract:
This study explores the vaccine prioritization strategy to reduce the overall burden of the pandemic when the supply is limited. Existing methods conduct macro-level or simplified micro-level vaccine distribution by assuming the homogeneous behavior within subgroup populations and lacking mobility dynamics integration. Directly applying these models for micro-level vaccine allocation leads to sub-…
▽ More
This study explores the vaccine prioritization strategy to reduce the overall burden of the pandemic when the supply is limited. Existing methods conduct macro-level or simplified micro-level vaccine distribution by assuming the homogeneous behavior within subgroup populations and lacking mobility dynamics integration. Directly applying these models for micro-level vaccine allocation leads to sub-optimal solutions due to the lack of behavioral-related details. To address the issue, we first incorporate the mobility heterogeneity in disease dynamics modeling and mimic the disease evolution process using a Trans-vaccine-SEIR model. Then we develop a novel deep reinforcement learning to seek the optimal vaccine allocation strategy for the high-degree spatial-temporal disease evolution system. The graph neural network is used to effectively capture the structural properties of the mobility contact network and extract the dynamic disease features. In our evaluation, the proposed framework reduces 7% - 10% of infections and deaths than the baseline strategies. Extensive evaluation shows that the proposed framework is robust to seek the optimal vaccine allocation with diverse mobility patterns in the micro-level disease evolution system. In particular, we find the optimal vaccine allocation strategy in the transit usage restriction scenario is significantly more effective than restricting cross-zone mobility for the top 10% age-based and income-based zones. These results provide valuable insights for areas with limited vaccines and low logistic efficacy.
△ Less
Submitted 9 May, 2023;
originally announced May 2023.
-
Reinforcement Learning with Delayed, Composite, and Partially Anonymous Reward
Authors:
Washim Uddin Mondal,
Vaneet Aggarwal
Abstract:
We investigate an infinite-horizon average reward Markov Decision Process (MDP) with delayed, composite, and partially anonymous reward feedback. The delay and compositeness of rewards mean that rewards generated as a result of taking an action at a given state are fragmented into different components, and they are sequentially realized at delayed time instances. The partial anonymity attribute im…
▽ More
We investigate an infinite-horizon average reward Markov Decision Process (MDP) with delayed, composite, and partially anonymous reward feedback. The delay and compositeness of rewards mean that rewards generated as a result of taking an action at a given state are fragmented into different components, and they are sequentially realized at delayed time instances. The partial anonymity attribute implies that a learner, for each state, only observes the aggregate of past reward components generated as a result of different actions taken at that state, but realized at the observation instance. We propose an algorithm named $\mathrm{DUCRL2}$ to obtain a near-optimal policy for this setting and show that it achieves a regret bound of $\tilde{\mathcal{O}}\left(DS\sqrt{AT} + d (SA)^3\right)$ where $S$ and $A$ are the sizes of the state and action spaces, respectively, $D$ is the diameter of the MDP, $d$ is a parameter upper bounded by the maximum reward delay, and $T$ denotes the time horizon. This demonstrates the optimality of the bound in the order of $T$, and an additive impact of the delay.
△ Less
Submitted 28 August, 2023; v1 submitted 3 May, 2023;
originally announced May 2023.
-
Mean-Field Control based Approximation of Multi-Agent Reinforcement Learning in Presence of a Non-decomposable Shared Global State
Authors:
Washim Uddin Mondal,
Vaneet Aggarwal,
Satish V. Ukkusuri
Abstract:
Mean Field Control (MFC) is a powerful approximation tool to solve large-scale Multi-Agent Reinforcement Learning (MARL) problems. However, the success of MFC relies on the presumption that given the local states and actions of all the agents, the next (local) states of the agents evolve conditionally independent of each other. Here we demonstrate that even in a MARL setting where agents share a c…
▽ More
Mean Field Control (MFC) is a powerful approximation tool to solve large-scale Multi-Agent Reinforcement Learning (MARL) problems. However, the success of MFC relies on the presumption that given the local states and actions of all the agents, the next (local) states of the agents evolve conditionally independent of each other. Here we demonstrate that even in a MARL setting where agents share a common global state in addition to their local states evolving conditionally independently (thus introducing a correlation between the state transition processes of individual agents), the MFC can still be applied as a good approximation tool. The global state is assumed to be non-decomposable i.e., it cannot be expressed as a collection of local states of the agents. We compute the approximation error as $\mathcal{O}(e)$ where $e=\frac{1}{\sqrt{N}}\left[\sqrt{|\mathcal{X}|} +\sqrt{|\mathcal{U}|}\right]$. The size of the agent population is denoted by the term $N$, and $|\mathcal{X}|, |\mathcal{U}|$ respectively indicate the sizes of (local) state and action spaces of individual agents. The approximation error is found to be independent of the size of the shared global state space. We further demonstrate that in a special case if the reward and state transition functions are independent of the action distribution of the population, then the error can be improved to $e=\frac{\sqrt{|\mathcal{X}|}}{\sqrt{N}}$. Finally, we devise a Natural Policy Gradient based algorithm that solves the MFC problem with $\mathcal{O}(ε^{-3})$ sample complexity and obtains a policy that is within $\mathcal{O}(\max\{e,ε\})$ error of the optimal MARL policy for any $ε>0$.
△ Less
Submitted 26 May, 2023; v1 submitted 13 January, 2023;
originally announced January 2023.
-
Mean-Field Approximation of Cooperative Constrained Multi-Agent Reinforcement Learning (CMARL)
Authors:
Washim Uddin Mondal,
Vaneet Aggarwal,
Satish V. Ukkusuri
Abstract:
Mean-Field Control (MFC) has recently been proven to be a scalable tool to approximately solve large-scale multi-agent reinforcement learning (MARL) problems. However, these studies are typically limited to unconstrained cumulative reward maximization framework. In this paper, we show that one can use the MFC approach to approximate the MARL problem even in the presence of constraints. Specificall…
▽ More
Mean-Field Control (MFC) has recently been proven to be a scalable tool to approximately solve large-scale multi-agent reinforcement learning (MARL) problems. However, these studies are typically limited to unconstrained cumulative reward maximization framework. In this paper, we show that one can use the MFC approach to approximate the MARL problem even in the presence of constraints. Specifically, we prove that, an $N$-agent constrained MARL problem, with state, and action spaces of each individual agents being of sizes $|\mathcal{X}|$, and $|\mathcal{U}|$ respectively, can be approximated by an associated constrained MFC problem with an error, $e\triangleq \mathcal{O}\left([\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}]/\sqrt{N}\right)$. In a special case where the reward, cost, and state transition functions are independent of the action distribution of the population, we prove that the error can be improved to $e=\mathcal{O}(\sqrt{|\mathcal{X}|}/\sqrt{N})$. Also, we provide a Natural Policy Gradient based algorithm and prove that it can solve the constrained MARL problem within an error of $\mathcal{O}(e)$ with a sample complexity of $\mathcal{O}(e^{-6})$.
△ Less
Submitted 15 September, 2022;
originally announced September 2022.
-
On the Near-Optimality of Local Policies in Large Cooperative Multi-Agent Reinforcement Learning
Authors:
Washim Uddin Mondal,
Vaneet Aggarwal,
Satish V. Ukkusuri
Abstract:
We show that in a cooperative $N$-agent network, one can design locally executable policies for the agents such that the resulting discounted sum of average rewards (value) well approximates the optimal value computed over all (including non-local) policies. Specifically, we prove that, if $|\mathcal{X}|, |\mathcal{U}|$ denote the size of state, and action spaces of individual agents, then for suf…
▽ More
We show that in a cooperative $N$-agent network, one can design locally executable policies for the agents such that the resulting discounted sum of average rewards (value) well approximates the optimal value computed over all (including non-local) policies. Specifically, we prove that, if $|\mathcal{X}|, |\mathcal{U}|$ denote the size of state, and action spaces of individual agents, then for sufficiently small discount factor, the approximation error is given by $\mathcal{O}(e)$ where $e\triangleq \frac{1}{\sqrt{N}}\left[\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}\right]$. Moreover, in a special case where the reward and state transition functions are independent of the action distribution of the population, the error improves to $\mathcal{O}(e)$ where $e\triangleq \frac{1}{\sqrt{N}}\sqrt{|\mathcal{X}|}$. Finally, we also devise an algorithm to explicitly construct a local policy. With the help of our approximation results, we further establish that the constructed local policy is within $\mathcal{O}(\max\{e,ε\})$ distance of the optimal policy, and the sample complexity to achieve such a local policy is $\mathcal{O}(ε^{-3})$, for any $ε>0$.
△ Less
Submitted 7 September, 2022;
originally announced September 2022.
-
Can Mean Field Control (MFC) Approximate Cooperative Multi Agent Reinforcement Learning (MARL) with Non-Uniform Interaction?
Authors:
Washim Uddin Mondal,
Vaneet Aggarwal,
Satish V. Ukkusuri
Abstract:
Mean-Field Control (MFC) is a powerful tool to solve Multi-Agent Reinforcement Learning (MARL) problems. Recent studies have shown that MFC can well-approximate MARL when the population size is large and the agents are exchangeable. Unfortunately, the presumption of exchangeability implies that all agents uniformly interact with one another which is not true in many practical scenarios. In this ar…
▽ More
Mean-Field Control (MFC) is a powerful tool to solve Multi-Agent Reinforcement Learning (MARL) problems. Recent studies have shown that MFC can well-approximate MARL when the population size is large and the agents are exchangeable. Unfortunately, the presumption of exchangeability implies that all agents uniformly interact with one another which is not true in many practical scenarios. In this article, we relax the assumption of exchangeability and model the interaction between agents via an arbitrary doubly stochastic matrix. As a result, in our framework, the mean-field `seen' by different agents are different. We prove that, if the reward of each agent is an affine function of the mean-field seen by that agent, then one can approximate such a non-uniform MARL problem via its associated MFC problem within an error of $e=\mathcal{O}(\frac{1}{\sqrt{N}}[\sqrt{|\mathcal{X}|} + \sqrt{|\mathcal{U}|}])$ where $N$ is the population size and $|\mathcal{X}|$, $|\mathcal{U}|$ are the sizes of state and action spaces respectively. Finally, we develop a Natural Policy Gradient (NPG) algorithm that can provide a solution to the non-uniform MARL with an error $\mathcal{O}(\max\{e,ε\})$ and a sample complexity of $\mathcal{O}(ε^{-3})$ for any $ε>0$.
△ Less
Submitted 1 June, 2022; v1 submitted 28 February, 2022;
originally announced March 2022.
-
Deep Learning based Coverage and Rate Manifold Estimation in Cellular Networks
Authors:
Washim Uddin Mondal,
Praful D. Mankar,
Goutam Das,
Vaneet Aggarwal,
Satish V. Ukkusuri
Abstract:
This article proposes Convolutional Neural Network-based Auto Encoder (CNN-AE) to predict location-dependent rate and coverage probability of a network from its topology. We train the CNN utilising BS location data of India, Brazil, Germany, and the USA and compare its performance with stochastic geometry (SG) based analytical models. In comparison to the best-fitted SG-based model, CNN-AE improve…
▽ More
This article proposes Convolutional Neural Network-based Auto Encoder (CNN-AE) to predict location-dependent rate and coverage probability of a network from its topology. We train the CNN utilising BS location data of India, Brazil, Germany, and the USA and compare its performance with stochastic geometry (SG) based analytical models. In comparison to the best-fitted SG-based model, CNN-AE improves the coverage and rate prediction errors by a margin of as large as $40\%$ and $25\%$ respectively. As an application, we propose a low complexity, provably convergent algorithm that, using trained CNN-AE, can compute locations of new BSs that need to be deployed in a network in order to satisfy pre-defined spatially heterogeneous performance goals.
△ Less
Submitted 21 August, 2022; v1 submitted 13 February, 2022;
originally announced February 2022.
-
On the Approximation of Cooperative Heterogeneous Multi-Agent Reinforcement Learning (MARL) using Mean Field Control (MFC)
Authors:
Washim Uddin Mondal,
Mridul Agarwal,
Vaneet Aggarwal,
Satish V. Ukkusuri
Abstract:
Mean field control (MFC) is an effective way to mitigate the curse of dimensionality of cooperative multi-agent reinforcement learning (MARL) problems. This work considers a collection of $N_{\mathrm{pop}}$ heterogeneous agents that can be segregated into $K$ classes such that the $k$-th class contains $N_k$ homogeneous agents. We aim to prove approximation guarantees of the MARL problem for this…
▽ More
Mean field control (MFC) is an effective way to mitigate the curse of dimensionality of cooperative multi-agent reinforcement learning (MARL) problems. This work considers a collection of $N_{\mathrm{pop}}$ heterogeneous agents that can be segregated into $K$ classes such that the $k$-th class contains $N_k$ homogeneous agents. We aim to prove approximation guarantees of the MARL problem for this heterogeneous system by its corresponding MFC problem. We consider three scenarios where the reward and transition dynamics of all agents are respectively taken to be functions of $(1)$ joint state and action distributions across all classes, $(2)$ individual distributions of each class, and $(3)$ marginal distributions of the entire population. We show that, in these cases, the $K$-class MARL problem can be approximated by MFC with errors given as $e_1=\mathcal{O}(\frac{\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}}{N_{\mathrm{pop}}}\sum_{k}\sqrt{N_k})$, $e_2=\mathcal{O}(\left[\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}\right]\sum_{k}\frac{1}{\sqrt{N_k}})$ and $e_3=\mathcal{O}\left(\left[\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}\right]\left[\frac{A}{N_{\mathrm{pop}}}\sum_{k\in[K]}\sqrt{N_k}+\frac{B}{\sqrt{N_{\mathrm{pop}}}}\right]\right)$, respectively, where $A, B$ are some constants and $|\mathcal{X}|,|\mathcal{U}|$ are the sizes of state and action spaces of each agent. Finally, we design a Natural Policy Gradient (NPG) based algorithm that, in the three cases stated above, can converge to an optimal MARL policy within $\mathcal{O}(e_j)$ error with a sample complexity of $\mathcal{O}(e_j^{-3})$, $j\in\{1,2,3\}$, respectively.
△ Less
Submitted 8 May, 2022; v1 submitted 8 September, 2021;
originally announced September 2021.
-
Queuing Analysis of Opportunistic Cognitive Radio IoT Network with Imperfect Sensing
Authors:
Asif Ahmed Sardar,
Dibbendu Roy,
Washim Uddin Mondal,
Goutam Das
Abstract:
In this paper, we analyze a Cognitive Radio-based Internet-of-Things (CR-IoT) network comprising a Primary Network Provider (PNP) and an IoT operator. The PNP uses its licensed spectrum to serve its users. The IoT operator identifies the white-space in the licensed band at regular intervals and opportunistically exploits them to serve the IoT nodes under its coverage. IoT nodes are battery-operate…
▽ More
In this paper, we analyze a Cognitive Radio-based Internet-of-Things (CR-IoT) network comprising a Primary Network Provider (PNP) and an IoT operator. The PNP uses its licensed spectrum to serve its users. The IoT operator identifies the white-space in the licensed band at regular intervals and opportunistically exploits them to serve the IoT nodes under its coverage. IoT nodes are battery-operated devices that require periodical energy replenishment. We employ the Microwave Power Transfer (MPT) technique for its superior energy transfer efficiency over long-distance. The white-space detection process is not always perfect and the IoT operator may jeopardize the PNP's transmissions due to misdetection. To reduce the possibility of such interferences, some of the spectrum holes must remain unutilized, even when the IoT nodes have data to transmit. The IoT operator needs to decide what percentage of the white-space to keep unutilized and how to judiciously use the rest for data transmission and energy-replenishment to maintain an optimal balance between the average interference inflicted on PNP's users and the Quality-of-Service (QoS) experienced by IoT nodes. Due to the periodic nature of the spectrum-sensing process, Discrete Time Markov Chain (DTMC) method can realistically model this framework. In literature, activities of the PNP and IoT operator are assumed to be mutually exclusive, for ease of analysis. Our model incorporates possible overlaps between these activities, making the analysis more realistic. Using our model, the sustainability region of the CR-IoT network can be obtained. The accuracy of our analysis is demonstrated via extensive simulation.
△ Less
Submitted 16 March, 2021;
originally announced March 2021.
-
On Exact Distribution of Poisson-Voronoi Area in $K$-tier HetNets with Generalized Association Rule
Authors:
Washim Uddin Mondal,
Goutam Das
Abstract:
This letter characterizes the exact distribution function of a typical Voronoi area in a $K$-tier Poisson network. The users obey a generalized association (GA) rule, which is a superset of nearest base station association and maximum received power based association (with arbitrary fading) rules that are commonly adopted in the literature. Combining the Robbins' theorem and the probability genera…
▽ More
This letter characterizes the exact distribution function of a typical Voronoi area in a $K$-tier Poisson network. The users obey a generalized association (GA) rule, which is a superset of nearest base station association and maximum received power based association (with arbitrary fading) rules that are commonly adopted in the literature. Combining the Robbins' theorem and the probability generating functional of a Poisson point process, we obtain the exact moments of a typical $k$-th tier Voronoi area, $k \in \{1,...,K\}$ under the GA rule. We apply this result in several special cases. For example, we prove that in multi-tier networks with the GA rule, the mean of $k$-th tier Voronoi area can exactly be expressed in a closed-form. We also obtain simplified expressions of its higher-order moments for both average and instantaneous received power based user association. In single-tier networks with exponential fading, the later association rule provides closed-form expression of the second-order moment of a typical Voronoi area. We numerically evaluate this exact expression and compare it with an approximated result.
△ Less
Submitted 1 July, 2020;
originally announced July 2020.