VCSAP: Online reinforcement learning exploration method based on visitation count of state-action pairs

Ruikai Zhou; Wenbo Zhu; Shuai Han; Meng Kang; Shuai Lü

doi:10.1016/j.neunet.2024.107052

VCSAP: Online reinforcement learning exploration method based on visitation count of state-action pairs

Neural Netw. 2025 Jan 1:184:107052. doi: 10.1016/j.neunet.2024.107052. Online ahead of print.

Authors

Ruikai Zhou¹, Wenbo Zhu², Shuai Han³, Meng Kang⁴, Shuai Lü⁵

Affiliations

¹ Key Laboratory of Symbolic Computation and Knowledge Engineering (Jilin University), Changchun 130012, China; College of Computer Science and Technology, Jilin University, Changchun 130012, China. Electronic address: [email protected].
² Key Laboratory of Symbolic Computation and Knowledge Engineering (Jilin University), Changchun 130012, China; College of Computer Science and Technology, Jilin University, Changchun 130012, China. Electronic address: [email protected].
³ Key Laboratory of Symbolic Computation and Knowledge Engineering (Jilin University), Changchun 130012, China; College of Computer Science and Technology, Jilin University, Changchun 130012, China; Department of Information and Computing Sciences, Utrecht University, Utrecht 3584 CC, The Netherlands. Electronic address: [email protected].
⁴ Key Laboratory of Symbolic Computation and Knowledge Engineering (Jilin University), Changchun 130012, China; College of Computer Science and Technology, Jilin University, Changchun 130012, China. Electronic address: [email protected].
⁵ Key Laboratory of Symbolic Computation and Knowledge Engineering (Jilin University), Changchun 130012, China; College of Computer Science and Technology, Jilin University, Changchun 130012, China; College of Software, Jilin University, Changchun 130012, China. Electronic address: [email protected].

PMID: 39778295
DOI: 10.1016/j.neunet.2024.107052

Abstract

In the domain of online reinforcement learning, strategies that leverage inherent rewards for exploration tend to achieve commendable outcomes within contexts characterized by deceptive or sparse rewards. Counting through the visitation of states is an efficient count-based exploration method to get the proper intrinsic reward. However, only the novelty of the states encountered by the agent is considered in this exploration method, resulting in the over-exploration of a certain state-action pair and falling into a locally optimal solution. In this paper, a count-based method called the visitation count of state-action pairs (VCSAP) is proposed, which is based on the strong error correction ability of online reinforcement learning. VCSAP counts both the visitation of individual states and state-action pairs, which not only drives the agent to visit novel states, but also motivates the agent to select novel actions. MuJoCo is an advanced multi-joint dynamics simulator, and MuJoCo environments with sparse rewards are more challenging and closer to real-world environments. VCSAP is applied to Proximal Policy Optimization (PPO) and Trust Region Policy Optimization (TRPO) respectively, and comparative experiments with exploration baselines are conducted on multiple tasks of MuJoCo and sparse MuJoCo benchmark. The experimental results show that compared to Random Network Distillation method, the performance of PPO-VCSAP and TRPO-VCSAP improves by 18% and 8% in 8 environments.

Keywords: Count-based exploration method; Intrinsic reward; Online reinforcement learning; Visitation count.