In the domain of online reinforcement learning, strategies that leverage inherent rewards for exploration tend to achieve commendable outcomes within contexts characterized by deceptive or sparse rewards. Counting through the visitation of states is an efficient count-based exploration method to get the proper intrinsic reward. However, only the novelty of the states encountered by the agent is considered in this exploration method, resulting in the over-exploration of a certain state-action pair and falling into a locally optimal solution. In this paper, a count-based method called the visitation count of state-action pairs (VCSAP) is proposed, which is based on the strong error correction ability of online reinforcement learning. VCSAP counts both the visitation of individual states and state-action pairs, which not only drives the agent to visit novel states, but also motivates the agent to select novel actions. MuJoCo is an advanced multi-joint dynamics simulator, and MuJoCo environments with sparse rewards are more challenging and closer to real-world environments. VCSAP is applied to Proximal Policy Optimization (PPO) and Trust Region Policy Optimization (TRPO) respectively, and comparative experiments with exploration baselines are conducted on multiple tasks of MuJoCo and sparse MuJoCo benchmark. The experimental results show that compared to Random Network Distillation method, the performance of PPO-VCSAP and TRPO-VCSAP improves by 18% and 8% in 8 environments.
Keywords: Count-based exploration method; Intrinsic reward; Online reinforcement learning; Visitation count.
Copyright © 2025 Elsevier Ltd. All rights reserved.