Search | arXiv e-print repository

Lyapunov Robust Constrained-MDPs: Soft-Constrained Robustly Stable Policy Optimization under Model Uncertainty

Authors: Reazul Hasan Russel, Mouhacine Benosman, Jeroen Van Baar, Radu Corcodel

Abstract: Safety and robustness are two desired properties for any reinforcement learning algorithm. CMDPs can handle additional safety constraints and RMDPs can perform well under model uncertainties. In this paper, we propose to unite these two frameworks resulting in robust constrained MDPs (RCMDPs). The motivation is to develop a framework that can satisfy safety constraints while also simultaneously of… ▽ More Safety and robustness are two desired properties for any reinforcement learning algorithm. CMDPs can handle additional safety constraints and RMDPs can perform well under model uncertainties. In this paper, we propose to unite these two frameworks resulting in robust constrained MDPs (RCMDPs). The motivation is to develop a framework that can satisfy safety constraints while also simultaneously offer robustness to model uncertainties. We develop the RCMDP objective, derive gradient update formula to optimize this objective and then propose policy gradient based algorithms. We also independently propose Lyapunov based reward shaping for RCMDPs, yielding better stability and convergence properties. △ Less

Submitted 5 August, 2021; originally announced August 2021.

Comments: arXiv admin note: text overlap with arXiv:2010.04870

arXiv:2010.04870 [pdf, other]

Robust Constrained-MDPs: Soft-Constrained Robust Policy Optimization under Model Uncertainty

Authors: Reazul Hasan Russel, Mouhacine Benosman, Jeroen Van Baar

Abstract: In this paper, we focus on the problem of robustifying reinforcement learning (RL) algorithms with respect to model uncertainties. Indeed, in the framework of model-based RL, we propose to merge the theory of constrained Markov decision process (CMDP), with the theory of robust Markov decision process (RMDP), leading to a formulation of robust constrained-MDPs (RCMDP). This formulation, simple in… ▽ More In this paper, we focus on the problem of robustifying reinforcement learning (RL) algorithms with respect to model uncertainties. Indeed, in the framework of model-based RL, we propose to merge the theory of constrained Markov decision process (CMDP), with the theory of robust Markov decision process (RMDP), leading to a formulation of robust constrained-MDPs (RCMDP). This formulation, simple in essence, allows us to design RL algorithms that are robust in performance, and provides constraint satisfaction guarantees, with respect to uncertainties in the system's states transition probabilities. The need for RCMPDs is important for real-life applications of RL. For instance, such formulation can play an important role for policy transfer from simulation to real world (Sim2Real) in safety critical applications, which would benefit from performance and safety guarantees which are robust w.r.t model uncertainty. We first propose the general problem formulation under the concept of RCMDP, and then propose a Lagrangian formulation of the optimal problem, leading to a robust-constrained policy gradient RL algorithm. We finally validate this concept on the inventory management problem. △ Less

Submitted 9 October, 2020; originally announced October 2020.

arXiv:2006.11679 [pdf, other]

Entropic Risk Constrained Soft-Robust Policy Optimization

Authors: Reazul Hasan Russel, Bahram Behzadian, Marek Petrik

Abstract: Having a perfect model to compute the optimal policy is often infeasible in reinforcement learning. It is important in high-stakes domains to quantify and manage risk induced by model uncertainties. Entropic risk measure is an exponential utility-based convex risk measure that satisfies many reasonable properties. In this paper, we propose an entropic risk constrained policy gradient and actor-cri… ▽ More Having a perfect model to compute the optimal policy is often infeasible in reinforcement learning. It is important in high-stakes domains to quantify and manage risk induced by model uncertainties. Entropic risk measure is an exponential utility-based convex risk measure that satisfies many reasonable properties. In this paper, we propose an entropic risk constrained policy gradient and actor-critic algorithms that are risk-averse to the model uncertainty. We demonstrate the usefulness of our algorithms on several problem domains. △ Less

Submitted 20 June, 2020; originally announced June 2020.

arXiv:1912.02696 [pdf, other]

Optimizing Norm-Bounded Weighted Ambiguity Sets for Robust MDPs

Authors: Reazul Hasan Russel, Bahram Behzadian, Marek Petrik

Abstract: Optimal policies in Markov decision processes (MDPs) are very sensitive to model misspecification. This raises serious concerns about deploying them in high-stake domains. Robust MDPs (RMDP) provide a promising framework to mitigate vulnerabilities by computing policies with worst-case guarantees in reinforcement learning. The solution quality of an RMDP depends on the ambiguity set, which is a qu… ▽ More Optimal policies in Markov decision processes (MDPs) are very sensitive to model misspecification. This raises serious concerns about deploying them in high-stake domains. Robust MDPs (RMDP) provide a promising framework to mitigate vulnerabilities by computing policies with worst-case guarantees in reinforcement learning. The solution quality of an RMDP depends on the ambiguity set, which is a quantification of model uncertainties. In this paper, we propose a new approach for optimizing the shape of the ambiguity sets for RMDPs. Our method departs from the conventional idea of constructing a norm-bounded uniform and symmetric ambiguity set. We instead argue that the structure of a near-optimal ambiguity set is problem specific. Our proposed method computes a weight parameter from the value functions, and these weights then drive the shape of the ambiguity sets. Our theoretical analysis demonstrates the rationale of the proposed idea. We apply our method to several different problem domains, and the empirical results further furnish the practical promise of weighted near-optimal ambiguity sets. △ Less

Submitted 4 December, 2019; originally announced December 2019.

Comments: arXiv admin note: substantial text overlap with arXiv:1910.10786

arXiv:1912.02150 [pdf, other]

A Probabilistic Approach to Satisfiability of Propositional Logic Formulae

Authors: Reazul Hasan Russel

Abstract: We propose a version of WalkSAT algorithm, named as BetaWalkSAT. This method uses probabilistic reasoning for biasing the starting state of the local search algorithm. Beta distribution is used to model the belief over boolean values of the literals. Our results suggest that, the proposed BetaWalkSAT algorithm can outperform other uninformed local search approaches for complex boolean satisfiabili… ▽ More We propose a version of WalkSAT algorithm, named as BetaWalkSAT. This method uses probabilistic reasoning for biasing the starting state of the local search algorithm. Beta distribution is used to model the belief over boolean values of the literals. Our results suggest that, the proposed BetaWalkSAT algorithm can outperform other uninformed local search approaches for complex boolean satisfiability problems. △ Less

Submitted 4 December, 2019; originally announced December 2019.

arXiv:1910.10786 [pdf, other]

Optimizing Percentile Criterion Using Robust MDPs

Authors: Bahram Behzadian, Reazul Hasan Russel, Marek Petrik, Chin Pang Ho

Abstract: We address the problem of computing reliable policies in reinforcement learning problems with limited data. In particular, we compute policies that achieve good returns with high confidence when deployed. This objective, known as the \emph{percentile criterion}, can be optimized using Robust MDPs~(RMDPs). RMDPs generalize MDPs to allow for uncertain transition probabilities chosen adversarially fr… ▽ More We address the problem of computing reliable policies in reinforcement learning problems with limited data. In particular, we compute policies that achieve good returns with high confidence when deployed. This objective, known as the \emph{percentile criterion}, can be optimized using Robust MDPs~(RMDPs). RMDPs generalize MDPs to allow for uncertain transition probabilities chosen adversarially from given ambiguity sets. We show that the RMDP solution's sub-optimality depends on the spans of the ambiguity sets along the value function. We then propose new algorithms that minimize the span of ambiguity sets defined by weighted $L_1$ and $L_\infty$ norms. Our primary focus is on Bayesian guarantees, but we also describe how our methods apply to frequentist guarantees and derive new concentration inequalities for weighted $L_1$ and $L_\infty$ norms. Experimental results indicate that our optimized ambiguity sets improve significantly on prior construction methods. △ Less

Submitted 25 February, 2021; v1 submitted 23 October, 2019; originally announced October 2019.

arXiv:1904.08528 [pdf, other]

Robust Exploration with Tight Bayesian Plausibility Sets

Authors: Reazul H. Russel, Tianyi Gu, Marek Petrik

Abstract: Optimism about the poorly understood states and actions is the main driving force of exploration for many provably-efficient reinforcement learning algorithms. We propose optimism in the face of sensible value functions (OFVF)- a novel data-driven Bayesian algorithm to constructing Plausibility sets for MDPs to explore robustly minimizing the worst case exploration cost. The method computes polici… ▽ More Optimism about the poorly understood states and actions is the main driving force of exploration for many provably-efficient reinforcement learning algorithms. We propose optimism in the face of sensible value functions (OFVF)- a novel data-driven Bayesian algorithm to constructing Plausibility sets for MDPs to explore robustly minimizing the worst case exploration cost. The method computes policies with tighter optimistic estimates for exploration by introducing two new ideas. First, it is based on Bayesian posterior distributions rather than distribution-free bounds. Second, OFVF does not construct plausibility sets as simple confidence intervals. Confidence intervals as plausibility sets are a sufficient but not a necessary condition. OFVF uses the structure of the value function to optimize the location and shape of the plausibility set to guarantee upper bounds directly without necessarily enforcing the requirement for the set to be a confidence interval. OFVF proceeds in an episodic manner, where the duration of the episode is fixed and known. Our algorithm is inherently Bayesian and can leverage prior information. Our theoretical analysis shows the robustness of OFVF, and the empirical results demonstrate its practical promise. △ Less

Submitted 17 April, 2019; originally announced April 2019.

arXiv:1901.07010 [pdf, other]

A Short Survey on Probabilistic Reinforcement Learning

Authors: Reazul Hasan Russel

Abstract: A reinforcement learning agent tries to maximize its cumulative payoff by interacting in an unknown environment. It is important for the agent to explore suboptimal actions as well as to pick actions with highest known rewards. Yet, in sensitive domains, collecting more data with exploration is not always possible, but it is important to find a policy with a certain performance guaranty. In this p… ▽ More A reinforcement learning agent tries to maximize its cumulative payoff by interacting in an unknown environment. It is important for the agent to explore suboptimal actions as well as to pick actions with highest known rewards. Yet, in sensitive domains, collecting more data with exploration is not always possible, but it is important to find a policy with a certain performance guaranty. In this paper, we present a brief survey of methods available in the literature for balancing exploration-exploitation trade off and computing robust solutions from fixed samples in reinforcement learning. △ Less

Submitted 21 January, 2019; originally announced January 2019.

Comments: 7 pages, originally written as a literature survey for PhD candidacy exam

arXiv:1811.06512 [pdf, other]

Tight Bayesian Ambiguity Sets for Robust MDPs

Authors: Reazul Hasan Russel, Marek Petrik

Abstract: Robustness is important for sequential decision making in a stochastic dynamic environment with uncertain probabilistic parameters. We address the problem of using robust MDPs (RMDPs) to compute policies with provable worst-case guarantees in reinforcement learning. The quality and robustness of an RMDP solution is determined by its ambiguity set. Existing methods construct ambiguity sets that lea… ▽ More Robustness is important for sequential decision making in a stochastic dynamic environment with uncertain probabilistic parameters. We address the problem of using robust MDPs (RMDPs) to compute policies with provable worst-case guarantees in reinforcement learning. The quality and robustness of an RMDP solution is determined by its ambiguity set. Existing methods construct ambiguity sets that lead to impractically conservative solutions. In this paper, we propose RSVF, which achieves less conservative solutions with the same worst-case guarantees by 1) leveraging a Bayesian prior, 2) optimizing the size and location of the ambiguity set, and, most importantly, 3) relaxing the requirement that the set is a confidence interval. Our theoretical analysis shows the safety of RSVF, and the empirical results demonstrate its practical promise. △ Less

Submitted 15 November, 2018; originally announced November 2018.

Comments: 5 pages. Accepted at Infer to Control Workshop at Neural Information Processing Systems (NIPS) 2018

arXiv:1704.03926 [pdf, other]

Value Directed Exploration in Multi-Armed Bandits with Structured Priors

Authors: Bence Cserna, Marek Petrik, Reazul Hasan Russel, Wheeler Ruml

Abstract: Multi-armed bandits are a quintessential machine learning problem requiring the balancing of exploration and exploitation. While there has been progress in developing algorithms with strong theoretical guarantees, there has been less focus on practical near-optimal finite-time performance. In this paper, we propose an algorithm for Bayesian multi-armed bandits that utilizes value-function-driven o… ▽ More Multi-armed bandits are a quintessential machine learning problem requiring the balancing of exploration and exploitation. While there has been progress in developing algorithms with strong theoretical guarantees, there has been less focus on practical near-optimal finite-time performance. In this paper, we propose an algorithm for Bayesian multi-armed bandits that utilizes value-function-driven online planning techniques. Building on previous work on UCB and Gittins index, we introduce linearly-separable value functions that take both the expected return and the benefit of exploration into consideration to perform n-step lookahead. The algorithm enjoys a sub-linear performance guarantee and we present simulation results that confirm its strength in problems with structured priors. The simplicity and generality of our approach makes it a strong candidate for analyzing more complex multi-armed bandit problems. △ Less

Submitted 17 May, 2017; v1 submitted 12 April, 2017; originally announced April 2017.

Showing 1–10 of 10 results for author: Russel, R H