Search | arXiv e-print repository

Inception: Efficiently Computable Misinformation Attacks on Markov Games

Authors: Jeremy McMahan, Young Wu, Yudong Chen, Xiaojin Zhu, Qiaomin Xie

Abstract: We study security threats to Markov games due to information asymmetry and misinformation. We consider an attacker player who can spread misinformation about its reward function to influence the robust victim player's behavior. Given a fixed fake reward function, we derive the victim's policy under worst-case rationality and present polynomial-time algorithms to compute the attacker's optimal wors… ▽ More We study security threats to Markov games due to information asymmetry and misinformation. We consider an attacker player who can spread misinformation about its reward function to influence the robust victim player's behavior. Given a fixed fake reward function, we derive the victim's policy under worst-case rationality and present polynomial-time algorithms to compute the attacker's optimal worst-case policy based on linear programming and backward induction. Then, we provide an efficient inception ("planting an idea in someone's mind") attack algorithm to find the optimal fake reward function within a restricted set of reward functions with dominant strategies. Importantly, our methods exploit the universal assumption of rationality to compute attacks efficiently. Thus, our work exposes a security vulnerability arising from standard game assumptions under misinformation. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: Accepted to Reinforcement Learning Conference (RLC) 2024

arXiv:2406.08847 [pdf, other]

Roping in Uncertainty: Robustness and Regularization in Markov Games

Authors: Jeremy McMahan, Giovanni Artiglio, Qiaomin Xie

Abstract: We study robust Markov games (RMG) with $s$-rectangular uncertainty. We show a general equivalence between computing a robust Nash equilibrium (RNE) of a $s$-rectangular RMG and computing a Nash equilibrium (NE) of an appropriately constructed regularized MG. The equivalence result yields a planning algorithm for solving $s$-rectangular RMGs, as well as provable robustness guarantees for policies… ▽ More We study robust Markov games (RMG) with $s$-rectangular uncertainty. We show a general equivalence between computing a robust Nash equilibrium (RNE) of a $s$-rectangular RMG and computing a Nash equilibrium (NE) of an appropriately constructed regularized MG. The equivalence result yields a planning algorithm for solving $s$-rectangular RMGs, as well as provable robustness guarantees for policies computed using regularized methods. However, we show that even for just reward-uncertain two-player zero-sum matrix games, computing an RNE is PPAD-hard. Consequently, we derive a special uncertainty structure called efficient player-decomposability and show that RNE for two-player zero-sum RMG in this class can be provably solved in polynomial time. This class includes commonly used uncertainty sets such as $L_1$ and $L_\infty$ ball uncertainty sets. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: Accepted to ICML 2024

arXiv:2405.14183 [pdf, other]

Deterministic Policies for Constrained Reinforcement Learning in Polynomial-Time

Authors: Jeremy McMahan

Abstract: We present a novel algorithm that efficiently computes near-optimal deterministic policies for constrained reinforcement learning (CRL) problems. Our approach combines three key ideas: (1) value-demand augmentation, (2) action-space approximate dynamic programming, and (3) time-space rounding. Under mild reward assumptions, our algorithm constitutes a fully polynomial-time approximation scheme (FP… ▽ More We present a novel algorithm that efficiently computes near-optimal deterministic policies for constrained reinforcement learning (CRL) problems. Our approach combines three key ideas: (1) value-demand augmentation, (2) action-space approximate dynamic programming, and (3) time-space rounding. Under mild reward assumptions, our algorithm constitutes a fully polynomial-time approximation scheme (FPTAS) for a diverse class of cost criteria. This class requires that the cost of a policy can be computed recursively over both time and (state) space, which includes classical expectation, almost sure, and anytime constraints. Our work not only provides provably efficient algorithms to address real-world challenges in decision-making but also offers a unifying theory for the efficient computation of constrained deterministic policies. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2312.00198 [pdf, other]

doi 10.1609/aaai.v38i13.29346

Optimal Attack and Defense for Reinforcement Learning

Authors: Jeremy McMahan, Young Wu, Xiaojin Zhu, Qiaomin Xie

Abstract: To ensure the usefulness of Reinforcement Learning (RL) in real systems, it is crucial to ensure they are robust to noise and adversarial attacks. In adversarial RL, an external attacker has the power to manipulate the victim agent's interaction with the environment. We study the full class of online manipulation attacks, which include (i) state attacks, (ii) observation attacks (which are a gener… ▽ More To ensure the usefulness of Reinforcement Learning (RL) in real systems, it is crucial to ensure they are robust to noise and adversarial attacks. In adversarial RL, an external attacker has the power to manipulate the victim agent's interaction with the environment. We study the full class of online manipulation attacks, which include (i) state attacks, (ii) observation attacks (which are a generalization of perceived-state attacks), (iii) action attacks, and (iv) reward attacks. We show the attacker's problem of designing a stealthy attack that maximizes its own expected reward, which often corresponds to minimizing the victim's value, is captured by a Markov Decision Process (MDP) that we call a meta-MDP since it is not the true environment but a higher level environment induced by the attacked interaction. We show that the attacker can derive optimal attacks by planning in polynomial time or learning with polynomial sample complexity using standard RL techniques. We argue that the optimal defense policy for the victim can be computed as the solution to a stochastic Stackelberg game, which can be further simplified into a partially-observable turn-based stochastic game (POTBSG). Neither the attacker nor the victim would benefit from deviating from their respective optimal policies, thus such solutions are truly robust. Although the defense problem is NP-hard, we show that optimal Markovian defenses can be computed (learned) in polynomial time (sample complexity) in many scenarios. △ Less

Submitted 17 June, 2024; v1 submitted 30 November, 2023; originally announced December 2023.

Journal ref: Proceedings of the AAAI Conference on Artificial Intelligence, 38(13), 14332-14340. 2024

arXiv:2311.05511 [pdf, other]

Anytime-Constrained Reinforcement Learning

Authors: Jeremy McMahan, Xiaojin Zhu

Abstract: We introduce and study constrained Markov Decision Processes (cMDPs) with anytime constraints. An anytime constraint requires the agent to never violate its budget at any point in time, almost surely. Although Markovian policies are no longer sufficient, we show that there exist optimal deterministic policies augmented with cumulative costs. In fact, we present a fixed-parameter tractable reductio… ▽ More We introduce and study constrained Markov Decision Processes (cMDPs) with anytime constraints. An anytime constraint requires the agent to never violate its budget at any point in time, almost surely. Although Markovian policies are no longer sufficient, we show that there exist optimal deterministic policies augmented with cumulative costs. In fact, we present a fixed-parameter tractable reduction from anytime-constrained cMDPs to unconstrained MDPs. Our reduction yields planning and learning algorithms that are time and sample-efficient for tabular cMDPs so long as the precision of the costs is logarithmic in the size of the cMDP. However, we also show that computing non-trivial approximately optimal policies is NP-hard in general. To circumvent this bottleneck, we design provable approximation algorithms that efficiently compute or learn an arbitrarily accurate approximately feasible policy with optimal value so long as the maximum supported cost is bounded by a polynomial in the cMDP or the absolute budget. Given our hardness results, our approximation guarantees are the best possible under worst-case analysis. △ Less

Submitted 13 June, 2024; v1 submitted 9 November, 2023; originally announced November 2023.

Journal ref: Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:4321-4329, 2024

arXiv:2311.00582 [pdf, other]

Minimally Modifying a Markov Game to Achieve Any Nash Equilibrium and Value

Authors: Young Wu, Jeremy McMahan, Yiding Chen, Yudong Chen, Xiaojin Zhu, Qiaomin Xie

Abstract: We study the game modification problem, where a benevolent game designer or a malevolent adversary modifies the reward function of a zero-sum Markov game so that a target deterministic or stochastic policy profile becomes the unique Markov perfect Nash equilibrium and has a value within a target range, in a way that minimizes the modification cost. We characterize the set of policy profiles that c… ▽ More We study the game modification problem, where a benevolent game designer or a malevolent adversary modifies the reward function of a zero-sum Markov game so that a target deterministic or stochastic policy profile becomes the unique Markov perfect Nash equilibrium and has a value within a target range, in a way that minimizes the modification cost. We characterize the set of policy profiles that can be installed as the unique equilibrium of a game and establish sufficient and necessary conditions for successful installation. We propose an efficient algorithm that solves a convex optimization problem with linear constraints and then performs random perturbation to obtain a modification plan with a near-optimal cost. The code for our algorithm is available at https://github.com/YoungWu559/game-modification . △ Less

Submitted 24 August, 2024; v1 submitted 1 November, 2023; originally announced November 2023.

Comments: Accepted by ICML 2024 Conference

arXiv:2307.09652 [pdf, other]

VISER: A Tractable Solution Concept for Games with Information Asymmetry

Authors: Jeremy McMahan, Young Wu, Yudong Chen, Xiaojin Zhu, Qiaomin Xie

Abstract: Many real-world games suffer from information asymmetry: one player is only aware of their own payoffs while the other player has the full game information. Examples include the critical domain of security games and adversarial multi-agent reinforcement learning. Information asymmetry renders traditional solution concepts such as Strong Stackelberg Equilibrium (SSE) and Robust-Optimization Equilib… ▽ More Many real-world games suffer from information asymmetry: one player is only aware of their own payoffs while the other player has the full game information. Examples include the critical domain of security games and adversarial multi-agent reinforcement learning. Information asymmetry renders traditional solution concepts such as Strong Stackelberg Equilibrium (SSE) and Robust-Optimization Equilibrium (ROE) inoperative. We propose a novel solution concept called VISER (Victim Is Secure, Exploiter best-Responds). VISER enables an external observer to predict the outcome of such games. In particular, for security applications, VISER allows the victim to better defend itself while characterizing the most damaging attacks available to the attacker. We show that each player's VISER strategy can be computed independently in polynomial time using linear programming (LP). We also extend VISER to its Markov-perfect counterpart for Markov games, which can be solved efficiently using a series of LPs. △ Less

Submitted 18 July, 2023; originally announced July 2023.

Comments: 17 pages, 6 figures

MSC Class: 91A27 (Primary); 93E20 (Secondary) ACM Class: F.2.1; G.3; I.2.8

arXiv:2306.08041 [pdf, ps, other]

Data Poisoning to Fake a Nash Equilibrium in Markov Games

Authors: Young Wu, Jeremy McMahan, Xiaojin Zhu, Qiaomin Xie

Abstract: We characterize offline data poisoning attacks on Multi-Agent Reinforcement Learning (MARL), where an attacker may change a data set in an attempt to install a (potentially fictitious) unique Markov-perfect Nash equilibrium for a two-player zero-sum Markov game. We propose the unique Nash set, namely the set of games, specified by their Q functions, with a specific joint policy being the unique Na… ▽ More We characterize offline data poisoning attacks on Multi-Agent Reinforcement Learning (MARL), where an attacker may change a data set in an attempt to install a (potentially fictitious) unique Markov-perfect Nash equilibrium for a two-player zero-sum Markov game. We propose the unique Nash set, namely the set of games, specified by their Q functions, with a specific joint policy being the unique Nash equilibrium. The unique Nash set is central to poisoning attacks because the attack is successful if and only if data poisoning pushes all plausible games inside the set. The unique Nash set generalizes the reward polytope commonly used in inverse reinforcement learning to MARL. For zero-sum Markov games, both the inverse Nash set and the set of plausible games induced by data are polytopes in the Q function space. We exhibit a linear program to efficiently compute the optimal poisoning attack. Our work sheds light on the structure of data poisoning attacks on offline MARL, a necessary step before one can design more robust MARL algorithms. △ Less

Submitted 18 June, 2024; v1 submitted 13 June, 2023; originally announced June 2023.

arXiv:2206.01888 [pdf, ps, other]

Reward Poisoning Attacks on Offline Multi-Agent Reinforcement Learning

Authors: Young Wu, Jeremy McMahan, Xiaojin Zhu, Qiaomin Xie

Abstract: In offline multi-agent reinforcement learning (MARL), agents estimate policies from a given dataset. We study reward-poisoning attacks in this setting where an exogenous attacker modifies the rewards in the dataset before the agents see the dataset. The attacker wants to guide each agent into a nefarious target policy while minimizing the $L^p$ norm of the reward modification. Unlike attacks on si… ▽ More In offline multi-agent reinforcement learning (MARL), agents estimate policies from a given dataset. We study reward-poisoning attacks in this setting where an exogenous attacker modifies the rewards in the dataset before the agents see the dataset. The attacker wants to guide each agent into a nefarious target policy while minimizing the $L^p$ norm of the reward modification. Unlike attacks on single-agent RL, we show that the attacker can install the target policy as a Markov Perfect Dominant Strategy Equilibrium (MPDSE), which rational agents are guaranteed to follow. This attack can be significantly cheaper than separate single-agent attacks. We show that the attack works on various MARL agents including uncertainty-aware learners, and we exhibit linear programs to efficiently solve the attack problem. We also study the relationship between the structure of the datasets and the minimal attack cost. Our work paves the way for studying defense in offline MARL. △ Less

Submitted 7 March, 2023; v1 submitted 3 June, 2022; originally announced June 2022.

arXiv:2108.12976 [pdf, ps, other]

Approximating Pandora's Box with Correlations

Authors: Shuchi Chawla, Evangelia Gergatsouli, Jeremy McMahan, Christos Tzamos

Abstract: We revisit the classic Pandora's Box (PB) problem under correlated distributions on the box values. Recent work of arXiv:1911.01632 obtained constant approximate algorithms for a restricted class of policies for the problem that visit boxes in a fixed order. In this work, we study the complexity of approximating the optimal policy which may adaptively choose which box to visit next based on the va… ▽ More We revisit the classic Pandora's Box (PB) problem under correlated distributions on the box values. Recent work of arXiv:1911.01632 obtained constant approximate algorithms for a restricted class of policies for the problem that visit boxes in a fixed order. In this work, we study the complexity of approximating the optimal policy which may adaptively choose which box to visit next based on the values seen so far. Our main result establishes an approximation-preserving equivalence of PB to the well studied Uniform Decision Tree (UDT) problem from stochastic optimization and a variant of the Min-Sum Set Cover ($\text{MSSC}_f$) problem. For distributions of support $m$, UDT admits a $\log m$ approximation, and while a constant factor approximation in polynomial time is a long-standing open problem, constant factor approximations are achievable in subexponential time (arXiv:1906.11385). Our main result implies that the same properties hold for PB and $\text{MSSC}_f$. We also study the case where the distribution over values is given more succinctly as a mixture of $m$ product distributions. This problem is again related to a noisy variant of the Optimal Decision Tree which is significantly more challenging. We give a constant-factor approximation that runs in time $n^{ \tilde O( m^2/\varepsilon^2 ) }$ when the mixture components on every box are either identical or separated in TV distance by $\varepsilon$. △ Less

Submitted 21 July, 2023; v1 submitted 29 August, 2021; originally announced August 2021.

arXiv:2108.04422 [pdf, other]

A $D$-competitive algorithm for the Multilevel Aggregation Problem with Deadlines

Authors: Jeremy McMahan

Abstract: In this paper, we consider the multi-level aggregation problem with deadlines (MLAPD) previously studied by Bienkowski et al. (2015), Buchbinder et al. (2017), and Azar and Touitou (2019). This is an online problem where the algorithm services requests arriving over time and can save costs by aggregating similar requests. Costs are structured in the form of a rooted tree. This problem has applicat… ▽ More In this paper, we consider the multi-level aggregation problem with deadlines (MLAPD) previously studied by Bienkowski et al. (2015), Buchbinder et al. (2017), and Azar and Touitou (2019). This is an online problem where the algorithm services requests arriving over time and can save costs by aggregating similar requests. Costs are structured in the form of a rooted tree. This problem has applications to many important areas such as multicasting, sensor networks, and supply-chain management. In particular, the TCP-acknowledgment problem, joint-replenishment problem, and assembly problem are all special cases of the delay version of the problem. We present a $D$-competitive algorithm for MLAPD. This beats the $6(D+1)$-competitive algorithm given in Buchbinder et al. (2017). Our approach illuminates key structural aspects of the problem and provides an algorithm that is simpler to implement than previous approaches. We also give improved competitive ratios for special cases of the problem. △ Less

Submitted 9 August, 2021; originally announced August 2021.

arXiv:2105.09377 [pdf, other]

doi 10.1145/3460945.3464953

Pure Tensor Program Rewriting via Access Patterns (Representation Pearl)

Authors: Gus Henry Smith, Andrew Liu, Steven Lyubomirsky, Scott Davidson, Joseph McMahan, Michael Taylor, Luis Ceze, Zachary Tatlock

Abstract: Tensor kernels in machine learning (ML) often correspond to pure mathematical expressions, making term rewriting an attractive strategy for optimization and mapping to specialized hardware accelerators. However, existing ML intermediate representations (IRs) tend to either be \textit{pure but high-level}, making low-level rewrites to hardware targets inexpressible, or \textit{low-level but impure}… ▽ More Tensor kernels in machine learning (ML) often correspond to pure mathematical expressions, making term rewriting an attractive strategy for optimization and mapping to specialized hardware accelerators. However, existing ML intermediate representations (IRs) tend to either be \textit{pure but high-level}, making low-level rewrites to hardware targets inexpressible, or \textit{low-level but impure}, hampering the use of term rewriting altogether. This paper introduces Glenside, a pure IR whose core abstraction -- the \textit{access pattern} -- enables low-level, layout-aware, hardware-centric program rewrites. We demonstrate how term rewriting in Glenside can be used to map program fragments to hardware accelerator invocations and automatically discover classic data layout transformations like \texttt{im2col}. Glenside establishes a new foundation for exploring further term rewriting techniques in optimizing low-level tensor programs. △ Less

Submitted 19 May, 2021; originally announced May 2021.

Comments: To be published at MAPS 2021

Showing 1–12 of 12 results for author: McMahan, J