Search | arXiv e-print repository

Finite-Time Convergence and Sample Complexity of Actor-Critic Multi-Objective Reinforcement Learning

Authors: Tianchen Zhou, FNU Hairi, Haibo Yang, Jia Liu, Tian Tong, Fan Yang, Michinari Momma, Yan Gao

Abstract: Reinforcement learning with multiple, potentially conflicting objectives is pervasive in real-world applications, while this problem remains theoretically under-explored. This paper tackles the multi-objective reinforcement learning (MORL) problem and introduces an innovative actor-critic algorithm named MOAC which finds a policy by iteratively making trade-offs among conflicting reward signals. N… ▽ More Reinforcement learning with multiple, potentially conflicting objectives is pervasive in real-world applications, while this problem remains theoretically under-explored. This paper tackles the multi-objective reinforcement learning (MORL) problem and introduces an innovative actor-critic algorithm named MOAC which finds a policy by iteratively making trade-offs among conflicting reward signals. Notably, we provide the first analysis of finite-time Pareto-stationary convergence and corresponding sample complexity in both discounted and average reward settings. Our approach has two salient features: (a) MOAC mitigates the cumulative estimation bias resulting from finding an optimal common gradient descent direction out of stochastic samples. This enables provable convergence rate and sample complexity guarantees independent of the number of objectives; (b) With proper momentum coefficient, MOAC initializes the weights of individual policy gradients using samples from the environment, instead of manual initialization. This enhances the practicality and robustness of our algorithm. Finally, experiments conducted on a real-world dataset validate the effectiveness of our proposed method. △ Less

Submitted 9 May, 2024; v1 submitted 5 May, 2024; originally announced May 2024.

Comments: Accepted in ICML 2024

arXiv:2403.15935 [pdf, other]

Sample and Communication Efficient Fully Decentralized MARL Policy Evaluation via a New Approach: Local TD update

Authors: Fnu Hairi, Zifan Zhang, Jia Liu

Abstract: In actor-critic framework for fully decentralized multi-agent reinforcement learning (MARL), one of the key components is the MARL policy evaluation (PE) problem, where a set of $N$ agents work cooperatively to evaluate the value function of the global states for a given policy through communicating with their neighbors. In MARL-PE, a critical challenge is how to lower the sample and communication… ▽ More In actor-critic framework for fully decentralized multi-agent reinforcement learning (MARL), one of the key components is the MARL policy evaluation (PE) problem, where a set of $N$ agents work cooperatively to evaluate the value function of the global states for a given policy through communicating with their neighbors. In MARL-PE, a critical challenge is how to lower the sample and communication complexities, which are defined as the number of training samples and communication rounds needed to converge to some $ε$-stationary point. To lower communication complexity in MARL-PE, a "natural'' idea is to perform multiple local TD-update steps between each consecutive rounds of communication to reduce the communication frequency. However, the validity of the local TD-update approach remains unclear due to the potential "agent-drift'' phenomenon resulting from heterogeneous rewards across agents in general. This leads to an interesting open question: Can the local TD-update approach entail low sample and communication complexities? In this paper, we make the first attempt to answer this fundamental question. We focus on the setting of MARL-PE with average reward, which is motivated by many multi-agent network optimization problems. Our theoretical and experimental results confirm that allowing multiple local TD-update steps is indeed an effective approach in lowering the sample and communication complexities of MARL-PE compared to consensus-based MARL-PE algorithms. Specifically, the local TD-update steps between two consecutive communication rounds can be as large as $\mathcal{O}(1/ε^{1/2}\log{(1/ε)})$ in order to converge to an $ε$-stationary point of MARL-PE. Moreover, we show theoretically that in order to reach the optimal sample complexity, the communication complexity of local TD-update approach is $\mathcal{O}(1/ε^{1/2}\log{(1/ε)})$. △ Less

Submitted 23 March, 2024; originally announced March 2024.

Comments: Main body of the paper appeared in AAMAS24

arXiv:2012.06613 [pdf, ps, other]

Beyond Scaling: Calculable Error Bounds of the Power-of-Two-Choices Mean-Field Model in Heavy-Traffic

Authors: Fnu Hairi, Xin Liu, Lei Ying

Abstract: This paper provides a recipe for deriving calculable approximation errors of mean-field models in heavy-traffic with the focus on the well-known load balancing algorithm -- power-of-two-choices (Po2). The recipe combines Stein's method for linearized mean-field models and State Space Concentration (SSC) based on geometric tail bounds. In particular, we divide the state space into two regions, a ne… ▽ More This paper provides a recipe for deriving calculable approximation errors of mean-field models in heavy-traffic with the focus on the well-known load balancing algorithm -- power-of-two-choices (Po2). The recipe combines Stein's method for linearized mean-field models and State Space Concentration (SSC) based on geometric tail bounds. In particular, we divide the state space into two regions, a neighborhood near the mean-field equilibrium and the complement of that. We first use a tail bound to show that the steady-state probability being outside the neighborhood is small. Then, we use a linearized mean-field model and Stein's method to characterize the generator difference, which provides the dominant term of the approximation error. From the dominant term, we are able to obtain an asymptotically-tight bound and a nonasymptotic upper bound, both are calculable bounds, not order-wise scaling results like most results in the literature. Finally, we compared the theoretical bounds with numerical evaluations to show the effectiveness of our results. We note that the simulation results show that both bounds are valid even for small size systems such as a system with only ten servers. △ Less

Submitted 29 October, 2021; v1 submitted 11 December, 2020; originally announced December 2020.

Showing 1–3 of 3 results for author: Hairi, F