Diversifying Policies With Non-Markov Dispersion to Expand the Solution Space

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):11392-11408. doi: 10.1109/TPAMI.2024.3455257. Epub 2024 Nov 6.

Abstract

Policy diversity, encompassing the variety of policies an agent can adopt, enhances reinforcement learning (RL) success by fostering more robust, adaptable, and innovative problem-solving in the environment. The environment in which standard RL operates is usually modeled with a Markov Decision Process (MDP) as the theoretical foundation. However, in many real-world scenarios, the rewards depend on an agent's history of states and actions leading to a non-MDP. Under the premise of policy diffusion initialization, non-MDPs may have unstructured expanding solution space due to varying historical information and temporal dependencies. This results in solutions having non-equivalent closed forms in non-MDPs. In this paper, deriving diverse solutions for non-MDPs requires policies to break through the boundaries of the current solution space through gradual dispersion. The goal is to expand the solution space, thereby obtaining more diverse policies. Specifically, we first model the sequences of states and actions by a transformer-based method to learn policy embeddings for dispersion in the solution space, since the transformer has advantages in handling sequential data and capturing long-range dependencies for non-MDP. Then, we stack the policy embeddings to construct a dispersion matrix as the policy diversity measure to induce the policy dispersion in the solution space and obtain a set of diverse policies. Finally, we prove that if the dispersion matrix is positive definite, the dispersed embeddings can effectively enlarge the disagreements across policies, yielding a diverse expression for the original policy embedding distribution. Experimental results of both non-MDP and MDP environments show that this dispersion scheme can obtain more expressive diverse policies via expanding the solution space, showing more robust performance than the recent learning baselines.