TIMAR: Transition-informed representation for sample-efficient multi-agent reinforcement learning

Neural Netw. 2024 Dec 31:184:107081. doi: 10.1016/j.neunet.2024.107081. Online ahead of print.

Abstract

In MARL (Multi-Agent Reinforcement Learning), the trial-and-error learning paradigm based on multiple agents requires massive interactions to produce training samples, significantly increasing both the training cost and difficulty. Therefore, enhancing data efficiency is a core issue in MARL. However, in the context of MARL, agent partially observed information leads to a lack of consideration for agent interactions and coordination from an ego perspective under the world model, which becomes the main obstacle to improving the data efficiency of current proposed MARL methods. To address this, motivated by the success of learning a world model in RL and cognitive science, we devise a world-model-driven learning paradigm enabling agents to gain a more holistic representation of individual observation of the environment. Specifically, we present the Transition-Informed Multi-Agent Representations (TIMAR) framework, which leverages the joint transition model, i.e., a surrogate world model that captures the dynamics of the multi-agent system, to learn effective representations among agents through a self-supervised learning objective. This objective encourages consistency between predicted and actual future observations, allowing the model to learn without explicit labels. TIMAR incorporates an auxiliary module to predict future transitions based on sequential observations and actions, allowing agents to infer the latent state of the system and consider the influences of others. Unlike traditional MARL approaches that primarily focus on efficient policy improvement, TIMAR is designed to learn a useful semantic representation from high-dimensional observations. This enables the used MARL algorithm built on these representations to achieve improvements in data efficiency. Experimental evaluation of TIMAR in various MARL environments demonstrates its significantly improved performance and data efficiency compared to strong baselines such as MAPPO, HAPPO, finetuned QMIX, MAT, and MA2CL. In addition, we found TIMAR can also improve the generalization of the Transformer-based MARL algorithm such as MAT.

Keywords: Multi-agent reinforcement learning; Representation learning; Self-supervised learning; Transformers.