Improving Sample Efficiency of Reinforcement Learning with Background Knowledge from Large Language Models

Fuxiang Zhang Junyou Li Yi-Chen Li Zongzhang Zhang Yang Yu Deheng Ye Fuxiang Zhang, Yi-Chen Li, Zongzhang Zhang, and Yang Yu are with National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China and School of Artificial Intelligence, Nanjing University, Nanjing 210023, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]).Junyou Li and Deheng Ye are with Tencent, Shenzhen 518054, China (e-mail: [email protected]; [email protected]).
Abstract

Low sample efficiency is an enduring challenge of reinforcement learning (RL). With the advent of versatile large language models (LLMs), recent works impart common-sense knowledge to accelerate policy learning for RL processes. However, we note that such guidance is often tailored for one specific task but loses generalizability. In this paper, we introduce a framework that harnesses LLMs to extract background knowledge of an environment, which contains general understandings of the entire environment, making various downstream RL tasks benefit from one-time knowledge representation. We ground LLMs by feeding a few pre-collected experiences and requesting them to delineate background knowledge of the environment. Afterward, we represent the output knowledge as potential functions for potential-based reward shaping, which has a good property for maintaining policy optimality from task rewards. We instantiate three variants to prompt LLMs for background knowledge, including writing code, annotating preferences, and assigning goals. Our experiments show that these methods achieve significant sample efficiency improvements in a spectrum of downstream tasks from Minigrid and Crafter domains.

Index Terms:
Reinforcement learning, reward shaping, knowledge representation

I Introduction

Refer to caption
Figure 1: An illustration of our framework to extract background knowledge from LLMs for reward shaping in downstream RL tasks. We sample experiences from pre-collected data and request LLM feedback in different forms including code, preference, or goals. The obtained feedback is represented as potential functions for potential-based reward shaping in downstream RL tasks.

Reinforcement learning (RL) [1] has achieved notable success in various domains including game AI [2, 3, 4], robotics [5], and natural language processing [6]. Conventionally, the success of RL hinges on extensive interactions, making low sample efficiency a huge challenge of RL literature [7, 8]. This issue is particularly challenging in environments with sparse rewards, inspiring endeavors on designing auxiliary rewards and enhancing exploration. Researchers commonly involve various mechanisms to improve the sample efficiency of RL, such as intrinsic motivations [9] based on novelty, curiosity, or uncertainty [10, 11, 12, 13] and external knowledge sources including human annotations, knowledge bases, or foundational models [14, 15, 16].

As highlighted in recent research [17, 18], large language models (LLMs) such as GPT-4 [19] have shown remarkable ability in instruction following with common-sense knowledge. This capability leads to innovative uses of LLMs in the field of RL, where recent studies develop approaches such as goal decomposition [20, 21, 22] and writing codes [23, 24, 25]. Different from exploration with general human-concluded metrics like curiosity and uncertainty, LLM-assisted RL approaches provide training signals tailored for specific RL tasks more effectively. However, this specificity also comes with a downside. The heavy reliance on task-specific prompting may lead to an inability to produce reusable knowledge, particularly in open-ended domains where solving each task with exclusive prompting processes can be both costly and time-consuming.

In this paper, we utilize LLMs to represent the background knowledge of an environment, thereby offering a more general and effective means to guide RL without repetitive LLM calls for different tasks. The proposed background knowledge, which is irrelevant to specific tasks, serves as the preliminary knowledge of an environment. For instance, the knowledge to avoid walls and obstacles in a grid world or to grab food in a survival game is generally useful to related environments regardless of executed tasks. We believe that the background knowledge can also be expressed as reward signals and propose a framework to extract and reuse such knowledge for downstream RL tasks based on designed desiderata of interaction-free, task-agnostic, and optimality-invariant requirements.

As depicted in Figure 1, we leverage a dataset containing experiences from multiple tasks to ground LLMs for decision-making problems, which avoids comprehensive interactions with the environment. Afterward, we particularly prompt the LLM to provide feedback on data samples based on its general understanding of the environment, forming task-agnostic background knowledge. The obtained knowledge is represented as potential functions for potential-based reward shaping [26], a way that shapes RL processes without changing policy optimality, to accelerate RL in downstream tasks. As different potential functions may influence RL to different extents, we adopt three different variants for background knowledge representation inspired by previous research on harnessing LLMs, including writing code, annotating preference, and suggesting goals. Experimental results in the Minigrid and Crafter domains show that these variants all yield great sample-efficiency improvement. Furthermore, we discover the possibilities of reusing background knowledge for emerging task types or increasing task scales, proving the generalization ability of extracted background knowledge. We also include discussions on the sensitivity of our proposed variants with different choices of language models and data. Our contributions can be summarized as follows:

  • We propose a framework that harnesses LLMs to provide background knowledge of an environment and thereby accelerates downstream RL tasks via potential-based reward shaping.

  • Based on the proposed framework, we develop three variants to represent background knowledge from LLM feedback for the reward-shaping procedure.

  • We show that acquired background knowledge can significantly improve sample efficiency and well generalize to previously unseen tasks.

II Related Work

Reward Shaping in RL. RL commonly faces poor sample efficiency [7] especially in sparse reward tasks. Researchers establish theories and methods to enhance sample efficiency by enhancing agent exploration and exploitation [8, 27]. The most common approach to improving sample efficiency is reshaping the training signals of RL processes, i.e., reward shaping [26]. Recently, there has been a surge in shaping rewards with languages by utilizing human annotations [14, 28], deducing state novelty from language abstractions [29, 30], or setting lingual goals [31, 32]. Unlike these methods, our work aims to extract the underlying background knowledge of the environment, which only adopts languages as an interface to acquire knowledge rather than exploiting language structures for RL. Although some prior works also try to integrate external knowledge for sample-efficient RL [15, 33], they commonly adopt underlying structures of tasks such as symbolic input. In contrast, our work does not posit specific task designs but proposes a general framework to harness general-purpose LLMs for reward shaping.

LLM-assisted RL. Incorporated with proper techniques such as in-context learning [34] and chain-of-thought prompting [35], recent works show that LLMs are knowledgeable enough to master decision-making tasks [36, 37]. However, directly using LLMs to solve complicated tasks can be difficult, whereas leveraging their powerful capabilities to guide RL processes can be a remedy. Some prior works directly use pretrained language models as policies and fine-tune their parameters with RL [38, 39, 40]. Though effective, these methods require text-based states and actions for making decisions and usually face the problem of high computational cost. Another line of research tries to guide RL with the common-sense knowledge of LLMs, which is the domain our framework falls into. Some prior works utilize LLMs as an intermediate to provide language instructions [41, 42, 21] and train language-conditioned policies [43]. Other works may focus on prompting LLMs to write code of reward functions [24, 23, 25] or check the completion of task goals [44]. However, it is worth noting that these approaches are often tailored for specific RL tasks, where the particular mechanism may both invoke difficulties in applicability and bring considerate costs when deploying to different tasks. In our paper, we propose a general framework to utilize background knowledge for reward shaping, thereby improving sample efficiency for various downstream RL tasks within an environment. Different from previous works on harnessing LLMs for policy pretraining [21], our framework provides a lightweight reward-shaping process to integrate the knowledge of LLMs without querying them during RL processes.

III Framework

III-A Problem Statement

We consider sequential decision-making problems in an open environment \mathcal{E}caligraphic_E, in which a specific task 𝒯𝒯\mathcal{T}caligraphic_T corresponds to a reinforcement learning (RL) problem. An agent interacts with the environment in discrete time to realize specific requirements defined by the task. Typically, an RL task can be modeled as a Markov decision process (MDP)  [1] 𝒯=(𝒮,𝒜,r,P,ρ,γ)𝒯𝒮𝒜𝑟𝑃𝜌𝛾\mathcal{T}=\left(\mathcal{S},\mathcal{A},r,P,\rho,\gamma\right)caligraphic_T = ( caligraphic_S , caligraphic_A , italic_r , italic_P , italic_ρ , italic_γ ). An agent starting at initial state s0ρ()similar-tosubscript𝑠0𝜌s_{0}\sim\rho(\cdot)italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_ρ ( ⋅ ) selects action at𝒜subscript𝑎𝑡𝒜a_{t}\in\mathcal{A}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A and leads to a transition to st+1P(st,at)s_{t+1}\sim P(\cdot\mid s_{t},a_{t})italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_P ( ⋅ ∣ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) with acquiring reward rt=r(st,at)subscript𝑟𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡r_{t}=r(s_{t},a_{t})italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). An RL algorithm usually learns a policy π(as)𝜋conditional𝑎𝑠\pi(a\mid s)italic_π ( italic_a ∣ italic_s ) by maximizing the discounted cumulative reward, a.k.a. return, R=t=0γtrt𝑅superscriptsubscript𝑡0superscript𝛾𝑡subscript𝑟𝑡R=\sum_{t=0}^{\infty}\gamma^{t}r_{t}italic_R = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Since an agent often perceives a view of the open environment, the received input in our experiments may be a partial observation. With a little misuse of notation, we do not explicitly distinguish states and observations in our methodology for simplicity since we do not delve into the theoretical properties of partial observability. A practical approach is to use the trajectory of agent history τt=(s0,a0,,st1,at1,st)subscript𝜏𝑡subscript𝑠0subscript𝑎0subscript𝑠𝑡1subscript𝑎𝑡1subscript𝑠𝑡\tau_{t}=(s_{0},a_{0},\dots,s_{t-1},a_{t-1},s_{t})italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to represent agent state. Our work focuses on training policies π(as)𝜋conditional𝑎𝑠\pi(a\mid s)italic_π ( italic_a ∣ italic_s ) in any task 𝒯𝒯\mathcal{T}caligraphic_T from an environment, where the task variance commonly comes from different targets and maps in our experiments.

To tackle the problem of poor sample efficiency in RL, potential-based reward shaping [26, 45] is a useful technique for altering RL processes. Typically, this technique adds an additional reward term F(s,s)𝐹𝑠superscript𝑠F(s,s^{\prime})italic_F ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) to the original environmental reward r𝑟ritalic_r, whose value is computed with a defined potential function ϕ(s)italic-ϕ𝑠\phi(s)italic_ϕ ( italic_s ):

F(s,s)=γϕ(s)ϕ(s),𝐹𝑠superscript𝑠𝛾italic-ϕsuperscript𝑠italic-ϕ𝑠F(s,s^{\prime})=\gamma\phi(s^{\prime})-\phi(s),italic_F ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_γ italic_ϕ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_ϕ ( italic_s ) , (1)

where s𝑠sitalic_s and ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are the current state and the next state respectively and γ𝛾\gammaitalic_γ is the discount factor of the MDP. An advantage of applying potential-based reward shaping is that it can maintain the policy optimality of original problems [26]. Consider an MDP 𝒯𝒯\mathcal{T}caligraphic_T where rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the extrinsic reward received from the environment at the time step t𝑡titalic_t. We can denote G=t=0rt𝐺superscriptsubscript𝑡0subscript𝑟𝑡G=\sum_{t=0}^{\infty}r_{t}italic_G = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as the return of an episode in the original MDP. The concept of reward shaping refers to adding additional shaped reward F(s,s)𝐹𝑠superscript𝑠F(s,s^{\prime})italic_F ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) to the extrinsic reward r𝑟ritalic_r:

r=r+F(s,s).superscript𝑟𝑟𝐹𝑠superscript𝑠r^{\prime}=r+F(s,s^{\prime}).italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_r + italic_F ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) . (2)

Note that reward shaping is not to modify the reward function of the environment but to supplement additional rewards for computation. The purpose of the shaping function F𝐹Fitalic_F is often to provide heuristic domain knowledge to the problem when the agent transitions from state s𝑠sitalic_s to ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. We thus define the return of an episode by

G=t=0γtr=t=0γt(rt+F(st,st+1)).superscript𝐺superscriptsubscript𝑡0superscript𝛾𝑡superscript𝑟superscriptsubscript𝑡0superscript𝛾𝑡subscript𝑟𝑡𝐹subscript𝑠𝑡subscript𝑠𝑡1G^{\prime}=\sum_{t=0}^{\infty}\gamma^{t}r^{\prime}=\sum_{t=0}^{\infty}\gamma^{% t}(r_{t}+F(s_{t},s_{t+1})).italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_F ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ) . (3)

Potential-based reward shaping is a particular type of reward shaping. For potential-based reward shaping, the shaping function is in the form of

F(s,s)=γϕ(s)ϕ(s),𝐹𝑠superscript𝑠𝛾italic-ϕsuperscript𝑠italic-ϕ𝑠F(s,s^{\prime})=\gamma\phi(s^{\prime})-\phi(s),italic_F ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_γ italic_ϕ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_ϕ ( italic_s ) , (4)

where γ𝛾\gammaitalic_γ is the exact discount factor from the MDP. We call ϕitalic-ϕ\phiitalic_ϕ the potential function and ϕ(s)italic-ϕ𝑠\phi(s)italic_ϕ ( italic_s ) the potential of state s𝑠sitalic_s. Therefore, we can define the potential function ϕ:𝒮:italic-ϕ𝒮\phi:\mathcal{S}\rightarrow\mathbb{R}italic_ϕ : caligraphic_S → blackboard_R instead of defining F:𝒮×𝒮:𝐹𝒮𝒮F:\mathcal{S}\times\mathcal{S}\rightarrow\mathbb{R}italic_F : caligraphic_S × caligraphic_S → blackboard_R. The reason for choosing this form of reward shaping is that it can converge to the optimal policy. Consider the return for one episode:

Gsuperscript𝐺\displaystyle G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =t=0γt(rt+F(st,st+1))absentsuperscriptsubscript𝑡0superscript𝛾𝑡subscript𝑟𝑡𝐹subscript𝑠𝑡subscript𝑠𝑡1\displaystyle=\sum_{t=0}^{\infty}\gamma^{t}(r_{t}+F(s_{t},s_{t+1}))= ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_F ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ) (5)
=t=0γt(rt+γϕ(st+1ϕ(st))\displaystyle=\sum_{t=0}^{\infty}\gamma^{t}(r_{t}+\gamma\phi(s_{t+1}-\phi(s_{t% }))= ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) )
=t=0γtrt+t=0γt+1ϕ(st+1)t=0γtϕ(st)absentsuperscriptsubscript𝑡0superscript𝛾𝑡subscript𝑟𝑡superscriptsubscript𝑡0superscript𝛾𝑡1italic-ϕsubscript𝑠𝑡1superscriptsubscript𝑡0superscript𝛾𝑡italic-ϕsubscript𝑠𝑡\displaystyle=\sum_{t=0}^{\infty}\gamma^{t}r_{t}+\sum_{t=0}^{\infty}\gamma^{t+% 1}\phi(s_{t+1})-\sum_{t=0}^{\infty}\gamma^{t}\phi(s_{t})= ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
=t=0γtrt+t=1γtϕ(st)ϕ(s0)t=1γtϕ(st)absentsuperscriptsubscript𝑡0superscript𝛾𝑡subscript𝑟𝑡superscriptsubscript𝑡1superscript𝛾𝑡italic-ϕsubscript𝑠𝑡italic-ϕsubscript𝑠0superscriptsubscript𝑡1superscript𝛾𝑡italic-ϕsubscript𝑠𝑡\displaystyle=\sum_{t=0}^{\infty}\gamma^{t}r_{t}+\sum_{t=1}^{\infty}\gamma^{t}% \phi(s_{t})-\phi(s_{0})-\sum_{t=1}^{\infty}\gamma^{t}\phi(s_{t})= ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_ϕ ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
=Gϕ(s0),absent𝐺italic-ϕsubscript𝑠0\displaystyle=G-\phi(s_{0}),= italic_G - italic_ϕ ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ,

where we can decompose Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT into the cumulative reward on the original MDP and the potential of the initial state ϕ(s0)italic-ϕsubscript𝑠0\phi(s_{0})italic_ϕ ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). As the initial state s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT can be any arbitrary state, we can easily extend this equality to a shaped Q-function, for example, Q(s,a)=Q(s,a)ϕ(s)superscript𝑄𝑠𝑎𝑄𝑠𝑎italic-ϕ𝑠Q^{\prime}(s,a)=Q(s,a)-\phi(s)italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s , italic_a ) = italic_Q ( italic_s , italic_a ) - italic_ϕ ( italic_s ). Therefore, any RL algorithm that maximizes the cumulative reward or the Q-values will derive the same optimal policies after reward shaping, since the term ϕ(s)italic-ϕ𝑠\phi(s)italic_ϕ ( italic_s ) is not related to action selection.

The optimality invariance of reward shaping is found in early research [26, 45] and widely applied in recent RL works [14, 16]. We note that although potential-based reward shaping can provide guarantees of the final results, it may alter the optimization dynamics in a way that either accelerates or slows down policy learning. A well-designed potential function ϕ(s)italic-ϕ𝑠\phi(s)italic_ϕ ( italic_s ) can decrease the time to convergence while bad ones may increase the time. In our work, we propose three approaches to design potential functions that can reflect the background knowledge of the environment from LLM feedback. We prove that our designs of the potential function can significantly improve sample efficiency, leading to a shorter convergence time.

To ground LLMs for decision-making tasks, it is conventional to transform trajectories into text captions. Though this limitation may be addressed by further improvement on multimodal LLMs, we posit the existence of text captions in our pre-collected data for querying LLMs. Typically, the description of a trajectory can be derived from state captions and action names.

Refer to caption
Figure 2: The proposed three variants of background knowledge representation from pre-collected data. (a) We query an LLM to write code that returns high values for behaviors with desired background knowledge. We ask the LLM to iteratively improve the code from sampled results. (b) We prompt an LLM to annotate its preference over two trajectories and then learn the potential function ϕ(s)italic-ϕ𝑠\phi(s)italic_ϕ ( italic_s ) that decomposes preferences. (c) We sample trajectories from the dataset and ask the LLM to suggest potential goals. The pair of captions and goals are stored in a text-based goal library. To use the goal library for downstream RL, we retrieve results whose trajectories are similar to agent history and compute goal similarity with the current state.

III-B Reward Shaping with Background Knowledge

In this section, we describe our framework that harnesses LLMs to obtain background knowledge of an environment, and further use it for accelerating downstream RL tasks. As there exist various ways to guide RL with LLM feedback, we summarize the desiderata of our framework below to better illustrate our requirements:

  • Interaction-free. The process of acquiring background knowledge from LLMs should not be attached to an online RL process, which will involve a considerable amount of interactions with the environment.

  • Task-agnostic. The background knowledge acquired in our framework should represent a general understanding of the environment, rather than involving behaviors about specific tasks.

  • Optimality-invariant. We expect to not change policy optimality after injecting background knowledge into RL. We provide background knowledge for improving sample efficiency.

Based on these three principles, we design a framework to extract background knowledge for reward shaping as previously shown in Figure 1. To satisfy the desiderata, we highlight three key procedures in the framework:

Data collection. To follow the interaction-free property, we pre-collect a dataset 𝒟={(s,a)}𝒟𝑠𝑎\mathcal{D}=\{(s,a)\}caligraphic_D = { ( italic_s , italic_a ) } by directly deploying an RL algorithm in an open environment without specifying its task goal. Therefore, the dataset only contains states and actions to delineate possible interactions in the environment. Specifically, we adopt RND [12] and save its interactive data periodically, following the standard in offline RL literature [46]. As an intrinsically motivated RL approach, RND tends to explore novel states in the open environment. Therefore, the collected data can contain diverse and useful behaviors for concluding background knowledge.

Background knowledge representation. To ground LLMs for the environment, we iteratively sample trajectories from the pre-collected data and ask LLMs for their feedback. During this procedure, the LLM makes assessments based on captions of agent trajectories but keeps unknown about the task goal, producing task-agnostic background knowledge. The knowledge is then represented as a potential function ϕ(s)italic-ϕ𝑠\phi(s)italic_ϕ ( italic_s ) by gathering LLM feedback over data samples, where we defer different variants for background knowledge representation to the next section.

RL with reward shaping. Using the potential function ϕ(s)italic-ϕ𝑠\phi(s)italic_ϕ ( italic_s ), we can adopt potential-based reward shaping for downstream RL tasks according to Equation (1), which shapes RL processes with preserving policy optimality. We adopt the PPO algorithm [47] based on the implementation from CleanRL [48] to train policies in different tasks.

The above three phases achieve our proposed desiderata, composing a common framework to extract and reuse background knowledge for downstream RL tasks. We note that the implementation of data collection and the RL process can be realized by different approaches. Specifically, we use the PPO algorithm [47] with the CleanRL [48] implementation, a stable and succinct PPO version for adopting our reward-shaping technique. The only modification to the training process is to add the auxiliary reward to the original environmental reward. To adapt to our selected environments, we use convolutional neural networks to extract features from the input observation. Then the features are fed into different multi-layer perceptrons (MLPs) to compute the action logits or value functions. Due to the different observation spaces of our evaluated environments, the feature processing networks are slightly different. We show the network structure of RL agents and also report the hyperparameters of the algorithms in our appendix, where most of the hyperparameters are directly taken from the CleanRL implementation despite some coefficients are related to environment sampling and auxiliary rewards.

TABLE I: Comparisons of the proposed three variants on background knowledge representations from LLM feedback.
Variant Sources of background knowledge
Additional model
during RL
Text captions
during RL
Unstructured input
support (e.g., images)
BK-Code LLM-coded functions
BK-Pref LLM’s preference from data parameterized ϕ(s)italic-ϕ𝑠\phi(s)italic_ϕ ( italic_s )
BK-Goal goals suggested by LLMs a sentence encoder

IV Background Knowledge Representation

In this section, we introduce three different variants to request LLMs for background knowledge. As previous works have made several endeavors to ground LLMs for decision-making tasks, we find that relevant approaches on prompting LLMs are also effective in representing background knowledge and propose three variants including code programming (Section IV-A), preference annotation (Section IV-B), and goal suggestion (Section IV-C). We design effective yet simple approaches to query LLMs for background knowledge and transform results into required potential functions.

IV-A Code Programming (BK-Code)

With pretraining in code data, LLMs exhibit powerful abilities in programming and reasoning [49, 50, 51], propelling successful applications to write code for guiding RL [24, 25, 23]. Most of these works propose to prompt an LLM to write code of reward functions to depict desired targets. As coding is a direct approach to conveying LLM knowledge, we thus introduce an iterative prompting procedure to harness LLMs to program and improve their written code. We design straightforward prompts to ask LLMs to provide a Python function whose arguments come from the current observation and agent histories. By providing agent trajectories, LLMs can analyze agent behaviors based on more thorough considerations. Our prompt contains three portions:

  • Environmental information. We attach the environment description and necessary constants and variables of the environment, serving as a header of the code.

  • Trajectory samples. We sample trajectory data from the pre-collected data and caption them to better ground LLMs for the environment.

  • Feedback from samples. We use the current code to compute results from sampled data and append the results as an evaluation.

To be more specific, we prompt LLMs to write code for a function def compute_reward(obs, past_obs, past_actions). Although we use the computation results as the potential values ϕ(s)italic-ϕ𝑠\phi(s)italic_ϕ ( italic_s ), we name the function name compute_reward with a little misuse, since it is better for LLMs to understand. The parameters of the function include the current observation, historical observations, and historical actions, where we truncate the length of the historical sequence to 5555 for efficiency consideration. The LLM is supposed to write a function that returns the expected value of the potential function and also a dictionary that describes different portions of the result. This additional dictionary gives the LLM diverse feedback and thus helps improve the code. This helpful information will also be attached to the next query. There may be situations where the code provided by the LLM cannot pass the Python interpreter, especially for weak LLMs. In this case, we attach the error log reported by the interpreter and ask the LLM to fix potential issues. The LLM will have multiple retry chances to write a runnable code. Otherwise, we will stop the iteration and use the latest successful code as the final result. For each iteration, we sample trajectories from the dataset and feed them into the written code for the output values and the information dictionary. Afterward, we attach the text caption of the trajectories along with the dictionary information and ask the LLM to improve the code. We repeat this process for 20202020 rounds, which is a moderate number since we find that involving additional iterations may result in too much unnecessary or inaccurate code.

IV-B Preference Annotation (BK-Pref)

As LLMs show the potential to be a helpful evaluation tool [52, 53], many prior works utilize the feedback of LLMs to align learned models [54, 55, 56]. We note that the preference over trajectories can also be a useful metric to delineate background knowledge and thus develop an approach that learns the potential function from LLM feedback. We sample two trajectories τ0,τ1superscript𝜏0superscript𝜏1\tau^{0},\tau^{1}italic_τ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_τ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT from pre-collected data and ask LLMs to provide preferences over the given pair. The sampled trajectory sequences are truncated to H𝐻Hitalic_H steps τ=(s1,a1,,sH,aH)𝜏subscript𝑠1subscript𝑎1subscript𝑠𝐻subscript𝑎𝐻\tau=(s_{1},a_{1},\dots,s_{H},a_{H})italic_τ = ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) for fair comparisons. The preference label can be y=0𝑦0y=0italic_y = 0 for τ0>τ1superscript𝜏0superscript𝜏1\tau^{0}>\tau^{1}italic_τ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT > italic_τ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT or y=1𝑦1y=1italic_y = 1 for τ1>τ0superscript𝜏1superscript𝜏0\tau^{1}>\tau^{0}italic_τ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT > italic_τ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, resulting in a preference dataset 𝒟pref={(τ0,τ1,y)}subscript𝒟prefsuperscript𝜏0superscript𝜏1𝑦\mathcal{D}_{\text{pref}}=\{(\tau^{0},\tau^{1},y)\}caligraphic_D start_POSTSUBSCRIPT pref end_POSTSUBSCRIPT = { ( italic_τ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_τ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_y ) }. Following previous preference-based RL studies [57, 58], we can define a preference predictor following the Bradley-Terry model [59]:

P[τ1>τ0]=exp(tϕ(st1))exp(tϕ(st0))+exp(tϕ(st1)),𝑃delimited-[]superscript𝜏1superscript𝜏0subscript𝑡italic-ϕsuperscriptsubscript𝑠𝑡1subscript𝑡italic-ϕsuperscriptsubscript𝑠𝑡0subscript𝑡italic-ϕsuperscriptsubscript𝑠𝑡1P[\tau^{1}>\tau^{0}]=\frac{\exp\left(\sum_{t}\phi(s_{t}^{1})\right)}{\exp(\sum% _{t}\phi(s_{t}^{0}))+\exp(\sum_{t}\phi(s_{t}^{1}))},italic_P [ italic_τ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT > italic_τ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ] = divide start_ARG roman_exp ( ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) ) end_ARG start_ARG roman_exp ( ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ) + roman_exp ( ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) ) end_ARG , (6)

where ϕ(s)italic-ϕ𝑠\phi(s)italic_ϕ ( italic_s ) is the potential function. Therefore, we can learn ϕ(s)italic-ϕ𝑠\phi(s)italic_ϕ ( italic_s ) by maximizing the likelihood of Equation (6) in our annotated dataset 𝒟prefsubscript𝒟pref\mathcal{D}_{\text{pref}}caligraphic_D start_POSTSUBSCRIPT pref end_POSTSUBSCRIPT in the following objective:

(θ)=𝔼(τ0,τ1,y)𝒟pref𝜃superscript𝜏0superscript𝜏1𝑦subscript𝒟pref𝔼\displaystyle\mathcal{L}(\theta)=-\underset{(\tau^{0},\tau^{1},y)\in\mathcal{D% }_{\text{pref}}}{\mathbb{E}}caligraphic_L ( italic_θ ) = - start_UNDERACCENT ( italic_τ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_τ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT pref end_POSTSUBSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG [ylogP[τ1>τ0;ϕ]\displaystyle\big{[}y\log P[\tau^{1}>\tau^{0};\phi][ italic_y roman_log italic_P [ italic_τ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT > italic_τ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ; italic_ϕ ] (7)
+(1y)logP[τ0>τ1;ϕ]],\displaystyle+(1-y)\log P[\tau^{0}>\tau^{1};\phi]\big{]},+ ( 1 - italic_y ) roman_log italic_P [ italic_τ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT > italic_τ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ; italic_ϕ ] ] ,

where θ𝜃\thetaitalic_θ is the parameters of ϕitalic-ϕ\phiitalic_ϕ. Following [60], we adopt the Transformer structure to compute ϕ(s)italic-ϕ𝑠\phi(s)italic_ϕ ( italic_s ) with historical information. For each preference label, we sample two trajectories τ0,τ1superscript𝜏0superscript𝜏1\tau^{0},\tau^{1}italic_τ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_τ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT from pre-collected data and ask LLMs to provide preferences over the given pair. The trajectory sequence is truncated to a length of H=5𝐻5H=5italic_H = 5 and is then captioned to compose the input prompt. We annotate labels including y=0𝑦0y=0italic_y = 0 for τ0>τ1superscript𝜏0superscript𝜏1\tau^{0}>\tau^{1}italic_τ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT > italic_τ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and y=1𝑦1y=1italic_y = 1 for τ1>τ0superscript𝜏1superscript𝜏0\tau^{1}>\tau^{0}italic_τ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT > italic_τ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, forming a dataset 𝒟𝒟\mathcal{D}caligraphic_D of labeled data. In addition, we also find that LLMs may refuse to provide a rank in the process because they equally preferred the provided two trajectories. To utilize this portion of data, we add an additional log-likelihood term to the original loss on the unlabeled data 𝒟superscript𝒟\mathcal{D}^{\prime}caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT:

(θ)=𝔼(τ0,τ1)𝒟[logP[τ1>τ0;ϕ]+logP[τ0>τ1;ϕ]].superscript𝜃superscript𝜏0superscript𝜏1superscript𝒟𝔼delimited-[]𝑃delimited-[]superscript𝜏1superscript𝜏0italic-ϕ𝑃delimited-[]superscript𝜏0superscript𝜏1italic-ϕ\mathcal{L}^{\prime}(\theta)=-\underset{(\tau^{0},\tau^{1})\in\mathcal{D}^{% \prime}}{\mathbb{E}}\big{[}\log P[\tau^{1}>\tau^{0};\phi]+\log P[\tau^{0}>\tau% ^{1};\phi]\big{]}.caligraphic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ ) = - start_UNDERACCENT ( italic_τ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_τ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) ∈ caligraphic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG [ roman_log italic_P [ italic_τ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT > italic_τ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ; italic_ϕ ] + roman_log italic_P [ italic_τ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT > italic_τ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ; italic_ϕ ] ] . (8)

We mix (θ)𝜃\mathcal{L}(\theta)caligraphic_L ( italic_θ ) and (θ)superscript𝜃\mathcal{L}^{\prime}(\theta)caligraphic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ ) when optimizing the preference predictor. To parameterize the potential function ϕ(s)italic-ϕ𝑠\phi(s)italic_ϕ ( italic_s ), it is conventional to use a neural network with the input of agent state s𝑠sitalic_s. However, we find that it can be more helpful to make ϕitalic-ϕ\phiitalic_ϕ a non-Markovian function according to the previous work Preference Transformer [60] since the previous observations and actions contain useful information to judge the potential of the current state. Therefore, we adopt the Transformer structure to capture sequential information. We feed the input sequence (s1,a1,,sH,aH)subscript𝑠1subscript𝑎1subscript𝑠𝐻subscript𝑎𝐻(s_{1},a_{1},\dots,s_{H},a_{H})( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) into the Transformer and aggregate the output representations on each state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to compute the logits with an additional preference attention layer proposed by [60]. The embedding dimension of the Transformer is 128128128128 with 1111 layer and 1111 header. We adopt the same feature extractor of the RL policy to process input observation and a simple embedding layer to process the action.

IV-C Goal Suggestion (BK-Goal)

Another way to involve background knowledge is to guide the agent with potential goals of the environment. Recent works [21, 41, 61] propose several techniques to query LLMs for helpful goals and thus guide the agent to visit states of interest. In our approach, we try to discover such potential goals in an offline manner. We first caption sampled trajectories τ𝜏\tauitalic_τ from the dataset and ask LLMs for potential goals the agent can reach. Then we store the output goals g𝑔gitalic_g along with the trajectory caption, forming a text-based goal library for further utilization. During the RL process, we introduce a retrieval procedure to select appropriate goals from the goal library according to the agent history. When the agent is in state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we use the text form of agent history τt1subscript𝜏subscript𝑡1\tau_{t_{1}}italic_τ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT to retrieve an exhaustive goal list g1,,gksubscript𝑔1subscript𝑔𝑘g_{1},\dots,g_{k}italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT from the top-K𝐾Kitalic_K most similar trajectories. The similarity metrics are calculated using embeddings provided by a pretrained language model as the sentence encoder [62]. Then we compute the cosine similarity σ(st,gi)𝜎subscript𝑠𝑡subscript𝑔𝑖\sigma(s_{t},g_{i})italic_σ ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) between the current trajectory τtsubscript𝜏𝑡\tau_{t}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the goal gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the achievement of each goal. The value of the potential function in s𝑠sitalic_s is the maximal similarity of potential goals ϕ(s)=maxiσ(s,gi)italic-ϕ𝑠subscript𝑖𝜎𝑠subscript𝑔𝑖\phi(s)=\max_{i}\sigma(s,g_{i})italic_ϕ ( italic_s ) = roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ ( italic_s , italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) since it provides the degree of how well the agent completes the most potential goal.

To harness LLMs for goal suggestion, we also caption the trajectory sequence and add the caption to the prompt. We then ask the LLM to provide potential goals based on the history of the agent. We ask the LLMs to list all possible goals in the form of an unordered list and stone all the goals along with the trajectory caption. The pair of trajectory captions and corresponding goals compose our desired goal library. For downstream RL, we retrieve the top-K𝐾Kitalic_K similar pairs according to the similarity between the query text and the text caption from the library, where K𝐾Kitalic_K is set to 3333. For our approach, we deploy a naive method by computing the cosine similarity for each text caption, but we note that approaches to accelerating this retrieval process are well-studied, which goes beyond our scope. We adopt a small-scale pretrained BERT model [63] from HuggingFace111https://huggingface.co/prajjwal1/bert-small as our sentence encoder.


Refer to caption
Figure 3: Average episodic returns of compared methods in different BabyAI goto tasks of the Minigrid environment. Task goals containing the color purple and the object type key do not appear in the collected datasets.
Refer to caption
Figure 4: Average success rates of compared methods in different downstream tasks of the Crafter environment. For each task, the agent only acquires a reward when completing the corresponding achievement.

We summarize the properties of our proposed three variants on background knowledge representation in Table I. Although these methods all leverage LLMs to express background knowledge, the differences in representing potential functions result in different behaviors during downstream RL tasks. BK-Code does not require an additional parameterized model but calculates auxiliary rewards solely using written code. In contrast, BK-Pref and BK-Goal require pretrained models to express potential functions. The similarity computation in BK-Goal requires text-form environmental descriptions, limiting its applications to environments that can provide a text captioner. The other two methods only require text captions in collected data samples, making it practical to only caption data samples rather than implementing a general captioner. Besides, BK-Code is infeasible for RL tasks with unstructured input such as images and texts since we cannot directly parse such input using code.

Principles on prompt design. The prompts in background knowledge representation typically contain two portions, the environmental information and the mission description. An exception comes from BK-Code, which requires additional code snippets for programming. We make the designed prompts modularized and succinct, applicable to different environments with moderate efforts in replacing the environment-related prompt. In our supplementary materials, we list all used prompts, discuss how to adapt to other domains, and provide some example responses from different environments and LLMs.

V Experiments

In this section, we aim to evaluate the effectiveness of using background knowledge in downstream RL tasks from the following aspects222Code available at https://github.com/mansicer/background-knowledge-rl: (1) Sample efficiency improvement. Is reward shaping with background knowledge more efficient than prior sample-efficient RL approaches? (2) Generalizability of background knowledge. When the data is from a subset of tasks, can the derived background knowledge help a broader range of tasks in the environment? (3) Sensitivity of background knowledge. We analyze two key factors that affect the qualities of knowledge – the language model and data size – to evaluate the sensitivity of the proposed three variants. We conduct experiments in two popular environments: (1) Minigrid [64], an environment with easily configurable grid-world tasks frequently used to evaluate exploration methods, and (2) Crafter [65], an open-ended environment where agents should collect achievements while discovering survival strategies. As stated in Table I, only BK-Goal from our variants require text captions during downstream RL training. The other two methods only require captions on the pre-collected data, making it possible to caption the dataset for environments that may be difficult to write a rule-based captioner. Although previous research [21] also tries to train captioners based on vision language models, here we simply adopt existing captioning functions from prior works. For Minigrid, we adopt the text captioner from a previous work GLAM [40] to caption states. For the Crafter environment, we use the text captioner from the SmartPlay benchmark [66]. We directly borrow captioners from existing works to avoid the effectiveness and reproducibility issues of designing a specific captioner for our framework.

Refer to caption
Figure 5: Rendered game frames from two used environments: (a) Minigrid and (b) Crafter.
Refer to caption
Figure 6: Average episodic returns of our methods compared with ELLM and Motif in different BabyAI goto tasks of the Minigrid environment.

We introduce baselines, including classic exploration approaches and sample-efficient RL methods using language abstraction. We adopt RND [12] and NovelD [13] as popular exploration baselines that generally show better sample efficiency than PPO. We also introduce Lang-ND [30] and L-NovelD [29] as methods that enhance sample efficiency by providing intrinsic rewards through languages. Similar to ours, these methods do not require specific task information but usually train language models to provide features during the RL process. However, our approaches either require no text input or only use pretrained sentence encoders for inference, which is a more efficient way. Our reported results are averaged over 5555 runs with error bars denoting the standard deviation.

Refer to caption
Figure 7: Average episodic returns of compared methods in different emerging BabyAI tasks of the Minigrid environment including (a, b) goto-seq tasks and (c, d) pickup tasks.

V-A Performance on Downstream Tasks

Minigrid [64] contains a spectrum of configurable tasks in a grid environment. We adopt the Minigrid environment [64] (Figure 5(a)) from its official GitHub repository333https://github.com/Farama-Foundation/Minigrid. For this environment, we mainly use its goto and pickup tasks types from the BabyAI domain [67] to build tasks. To specify the domain knowledge for a comprehensive analysis, we mainly focus on the goto tasks from the BabyAI series [67], which is a subset of original Minigrid tasks. A goto task typically requires the agent to navigate to a specific object in a map with multiple distractors. The target object type can be a ball, a box, or a key with a specific color from red, blue, green, or purple. Similarly, a pickup task requires the agent navigate to an object and perform a pickup action. The original environment registration in the code repository is text-conditioned, where the agent may receive different text instructions for separate episodes. To unify the task goal, we make specifications for the environment, making it generate consistent goals during RL processes. We create a series of goto and pickup tasks for our experiments. A goto task typically requires the agent to navigate to a specific kind of object, e.g., a red ball in the grid world. As for its extension, the goto-seq task requires the agent to sequentially navigate to two different objects in one episode, bringing more difficulties for agent exploration. A pickup task additionally asks the agent to perform a pickup action after navigating to the object. The original BabyAI environments usually contain map sizes ranging from 5555 to 8888, which can be simple for exploration methods. To create more challenging benchmarks, we scale the map size to 2030203020-3020 - 30 to conduct our experiments. To introduce diverse downstream tasks for evaluation, we design different kinds of unseen tasks on purpose. The data for background knowledge representation is collected from tasks without targets of the object type key and the color purple. However, during the downstream RL training, we aim to train policies in these tasks within one-time knowledge representation. Therefore, we can evaluate the effectiveness of acquired background knowledge in a spectrum of seen and unseen tasks.

In Figure 3, we show the average episodic returns of different methods in four downstream tasks with different targets. We find that our proposed three variants generally outperform the compared baselines in all tasks. Notably, when conducting experiments in tasks with unseen object types and colors, the improvement of sample efficiency of our methods is still obvious. The results indicate that our way of representing background knowledge does acquire task-agnostic background knowledge of the environment, thus accelerating policy learning in unseen tasks. We also find that Lang-ND and L-NovelD, which both adopt state captions to provide intrinsic motivation, perform better than classic exploration baselines, demonstrating that leveraging text features can be useful to improve sample efficiency.

Crafter [65] is a 2D survival game (Figure 5(b)) drawing inspiration from the popular Minecraft game. The environment is procedurally generated and partially observable, where the agent needs to complete achievements such as collecting and crafting. The original Crafter game rewards agents based on the accomplishment of achievements and health changes. Based on the achievement list, we create downstream tasks corresponding to each achievement where the agent can only get a reward when completing it. We also adopt the environment from its official repository444https://github.com/danijar/crafter. However, the original action space in Crafter merges different operations like eating, collecting, and attacking into a single ‘do’ action. Though successfully reducing the action space for exploration, this implementation may hinder understanding of environmental logic. Following the solution in [21], we split this action into several actions including ‘eat’, ‘drink’, ‘attack’, and ‘collect’, making it more aligned to actual agent behaviors. The transformation on the action space also helps explain actions during text captioning. We create different downstream tasks in the Crafter environment by splitting the achievement list of the environment into different tasks. We evaluate our proposed methods in these tasks except for BK-Code since the observation space in Crafter is image-based. As shown in Figure 4, we find that the two applicable approaches, BK-Pref and BK-Goal, still exhibit superior performance in most downstream tasks. For crafting tasks that require specific action sequences like making a wood pick axe, introducing background knowledge can significantly improve sample efficiency.

Some recent works also consider learning RL policies with LLM knowledge besides our compared baselines, such as Motif [56] and ELLM [21]. However, these methods are not directly comparable to our method due to different training paradigms and the inability of prompt design in each task. Motif also takes the idea of learning from LLM-labeled preference, also known as RL from artificial intelligence feedback (RLAIF) [54, 55, 68], which directly uses the preference data to augment rewards. Unlike Motif, our framework focuses on more general background knowledge of a domain to avoid complex prompt design and costly LLM queries. In addition, ELLM adopts a different paradigm to generate possible goals during policy pretraining. In contrast, our framework does not require the engagement of LLMs within the interactions with the LLMs but extracts background knowledge from LLMs in an interaction-free manner. Although these methods cannot realize an efficient paradigm like ours, we can compare them in each single task. In Figure 6, we show the results in a map with size 30 to enlarge its difficulty, where these baselines extract knowledge from each single task data while our methods still use previous acquired knowledge. We find that our three methods can still perfrom well compared to these baselines, indicating the effectiveness of integrating LLM knowledge via reward shaping.

Refer to caption
Figure 8: Average episodic return at the 2222M-th time step in goto tasks with different map sizes ranging from 20202020 to 30303030.

V-B Generalizability of Background Knowledge

In this section, we further test the generalizability of acquired background knowledge, aiming to find whether our framework of using background knowledge can accelerate policy learning for more distinct tasks. We conduct experiments in tasks with different task types and map sizes from the highly configurable Minigrid environment to evaluate the performance of our methods.

Effectiveness on emerging task types. The data used for acquiring background knowledge contains experiences from the goto tasks, which only require the agent to navigate to one object. Here we further introduce two additional task types to examine whether the background knowledge can provide more general training signals. The extra task types include (1) the goto-seq task, where the agent should navigate to two distinct objects sequentially, and (2) the pickup task, where the agent needs to execute a “pickup” action after successfully navigating to the target. As shown in Figure 7, our methods still exhibit superior performance compared to other baselines. Specifically, the goto-seq task type is relatively difficult as it requires exhaustive exploration to find two desired objects. Our compared baselines can hardly solve this kind of task while our methods can solve the problem without previously encountering these tasks. We find that BK-Code is less stable and shows less promising performance than the other two variants in some tasks, indicating that this form of knowledge representation may not be as generalizable as other variants.

Scaling to larger map sizes. When the map becomes larger, the agent needs to execute more actions precisely to acquire the reward, resulting in increasing difficulty in exploring and exploiting such reward signals. To this end, we configure a series of goto tasks with different map sizes to examine whether reward shaping with background knowledge can scale well to larger maps. We evaluate our methods in this task series and plot the average episodic returns at the 2222M-th time step in Figure 8. Notably, our proposed methods can all maintain high sample efficiency with increasing task difficulty. In contrast, the compared baselines mostly exhibit significant performance drops when the map becomes larger. The results indicate that our three variants can scale to larger maps well.

Refer to caption
Figure 9: The performance of proposed methods using different GPT-series models for background knowledge representation.
Refer to caption
Figure 10: The performance of proposed methods using GPT-4 and Llama-2 language models for background knowledge representation.

V-C Sensitivity of Background Knowledge

In this section, we conduct experiments to find out the sensitivity of learned background knowledge under the conditions of different language models and data. In Figure 9, we evaluate our three variants with two LLMs, gpt-3.5-turbo and gpt-4, separately. The results show that BK-Code is the most sensitive approach when we turn to a weaker LLM. Since the quality of code strongly correlates to the capability of LLMs, we also find that weaker LLMs have a higher probability of generating code with runtime errors. The performance drop when using gpt-3.5-turbo in BK-Pref and BK-Goal is hardly observed in Minigrid but is more evident in the crafting task from Crafter, where an LLM with lower capability may have a more superficial environment understanding. To further investigate the effectiveness of our framework under much weaker open-source language models, we also test the performance of BK-Pref and BK-Goal when using a 7B version of Llama-2 chat model in Figure 10. We omit the BK-Code method with the Llama version since we find that the Llama-2 chat model is not capable to generate valid code in our case. We find that BK-Pref and BK-Goal using Llama models perform slightly worse than the methods with GPT models, indicating that the capability of LLMs may affect the quality of extracted knowledge. However, thanks to our effective knowledge representations from preferences or goals, the performance loss is not significant and these algorithms can still solve the task with knowledge from a weak Llama-2 7B model.

Refer to caption
Figure 11: The performance of proposed methods using different data sources for background knowledge representation.

Besides, we try to discover the connection between used data and RL performance, as LLMs may not be able to derive sufficient knowledge from low-quality data. we present the policy performance when using data collected by random policies for background knowledge representation, abbreviated as the random data approach. As shown in Figure 11, we still observe significant sample efficiency improvements for these variants. For Minigrid, we find that background knowledge derived from random data is sufficient to accelerate policy learning in goto tasks. For the more difficult task of making a wood pickaxe in Crafter, random policies may not be able to provide useful trajectories, limiting the knowledge LLMs can provide.

VI Conclusions and Limitations

In this paper, we propose a novel framework to extract and reuse background knowledge of an environment by harnessing LLMs. Leveraging a pre-collected dataset, we design three variants, BK-Code, BK-Pref, and BK-Goal, to represent background knowledge from LLM feedback as helpful potential functions. With a unified reward-shaping framework, we adopt the derived potential functions for different downstream tasks. Our experiments in two different environments show that our framework significantly improves sample efficiency in downstream RL tasks and the LLM-concluded knowledge can even generalize to unseen tasks beyond the scope of provided data. We also present a detailed analysis of the generalizability and sensitivity of our framework under different conditions.

Our work has a few limitations. First, we did not optimize the prompting mechanisms for the proposed variants. It is important to note that our current prompting methods for LLMs may not be the most effective, and future research could explore more sophisticated approaches. Additionally, although we employ a modular and concise prompt design, creating a prompt with environmental information for new domains still requires a moderate effort. Future studies that aim to simplify and automate these procedures could be highly beneficial.

Acknowledgments

This work is supported by the National Science Foundation of China (62276126) and the Tencent AI Lab (RBFR2023011).

[]

TABLE II: Typical time cost of all compared methods.
Method BK-Code BK-Pref BK-Goal RND NovelD Lang-ND L-NovelD
Time cost 6.57h 6.67h 25.42h 13.17h 11.51h 14.68h 17.87h
Refer to caption
Figure 12: The network structures of RL agents trained for the Minigrid and Crafter tasks.
TABLE III: Hyperparameters for downstream RL training
Hyperparameter Value
Discount factor γ𝛾\gammaitalic_γ 0.99
Learning rate 0.0003
Extrinsic reward coefficient 10.0
GAE factor λ𝜆\lambdaitalic_λ 0.95
Parallel workers 8 (Minigrid), 16 (Crafter)
Batch size 1024 (Minigrid), 4096 (Crafter)
Importance sampling clipping 0.1
Entropy loss coefficient 0.01
Value loss coefficient 0.5
Gradient clipping factor 0.5
Refer to caption
Figure 13: The histograms of episodic returns and trajectory lengths in the used Minigrid dataset.
TABLE IV: Typical LLM query cost of BK-Code, BK-Pref, and BK-Goal.
Method # LLM calls Cost (gpt-3.5-turbo) Cost (gpt-4)
BK-Code 50 $0.15 $18
BK-Pref 2000 $1.5 $90
BK-Goal 500 $0.9 $54

-A Details of Data Collection

As stated in our main paper, we deploy an RND algorithm in the environment and collect data periodically. To be specific, we train RND for 5555M steps and evaluate the learned policy for each 500000500000500000500000 step. During the evaluation process, we run the policy in 50505050 episodes and save the data as our dataset. For Minigrid, we run the RND algorithm simultaneously in multiple tasks without fixing the task types, which means that the collected unlabeled trajectories may come from a distribution of tasks. We specify the task range by restricting the target types as mentioned in our experiments section, where the object type key and the color purple do not appear in the pre-collected data. For the Crafter environment, we directly deploy an RND algorithm in the original environment. The agent will try to maximize the raw Crafter reward based on completing achievements and keeping healthy. We present the properties of collected datasets in Figure 13.

-B Details of Downstream RL training

We use the PPO algorithm [47] with the CleanRL [48] implementation, a stable and succinct PPO version for adopting our reward-shaping technique. The only modification to the training process is to add the auxiliary reward to the original environmental reward. To adapt to our selected environments, we use convolutional neural networks to extract features from the input observation. Then the features are fed into different multi-layer perceptrons (MLPs) to compute the action logits or value functions. We show the network structure of RL agents in Figure 12. Due to the different observation spaces of the two environments, the feature processing networks are slightly different. We also report the hyperparameters of the RL algorithms in Table III, where most of the hyperparameters are directly taken from the CleanRL implementation despite some coefficients related to environment sampling and auxiliary rewards.

-C Cost of Our Methods

Our method is less time-consuming since it only requires one-time knowledge extraction and does not call LLM during RL. We show the time cost of all methods in the BabyAI-GoToRedBall-S30 task as an example. In Table II, we find that our methods are time-efficient, except for BK-Goal which needs to compute text embeddings during training. Generally, the time cost of all our methods is comparable with these baselines. In addition, we also present typical values of LLM calls and costs for our three approaches in Table IV. Our framework is cost-efficient since it does not involve LLM calls during RL. In contrast, some prior works like ELLM typically requires millions of LLM calls since it needs interactions with LLMs in the RL pretraining process.

References

  • [1] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.   MIT press, 2018.
  • [2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • [3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
  • [4] D. Ye, G. Chen, W. Zhang, S. Chen, B. Yuan, B. Liu, J. Chen, Z. Liu, F. Qiu, H. Yu, Y. Yin, B. Shi, L. Wang, T. Shi, Q. Fu, W. Yang, L. Huang, and W. Liu, “Towards playing full moba games with deep reinforcement learning,” in Advances in Neural Information Processing Systems, 2020, pp. 621–632.
  • [5] OpenAI, I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, J. Schneider, N. Tezak, J. Tworek, P. Welinder, L. Weng, Q. Yuan, W. Zaremba, and L. Zhang, “Solving rubik’s cube with a robot hand,” arXiv preprint arXiv:1910.07113, 2019.
  • [6] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” in Advances in Neural Information Processing Systems, 2022, pp. 27 730–27 744.
  • [7] Y. Yu, “Towards sample efficient reinforcement learning,” in International Joint Conference on Artificial Intelligence, 2018, pp. 5739–5743.
  • [8] P. Ladosz, L. Weng, M. Kim, and H. Oh, “Exploration in deep reinforcement learning: A survey,” Information Fusion, vol. 85, pp. 1–22, 2022.
  • [9] A. Aubret, L. Matignon, and S. Hassas, “A survey on intrinsic motivation in reinforcement learning,” arXiv preprint arXiv:1908.06976, 2019.
  • [10] M. G. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos, “Unifying count-based exploration and intrinsic motivation,” in Advances in Neural Information Processing Systems, 2016, pp. 1471–1479.
  • [11] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven exploration by self-supervised prediction,” in International Conference on Machine Learning, 2017, pp. 2778–2787.
  • [12] Y. Burda, H. Edwards, A. J. Storkey, and O. Klimov, “Exploration by random network distillation,” in International Conference on Learning Representations, 2019.
  • [13] T. Zhang, H. Xu, X. Wang, Y. Wu, K. Keutzer, J. E. Gonzalez, and Y. Tian, “NovelD: A simple yet effective exploration criterion,” in Advances in Neural Information Processing Systems, 2021, pp. 25 217–25 230.
  • [14] P. Goyal, S. Niekum, and R. J. Mooney, “Using natural language for reward shaping in reinforcement learning,” in International Joint Conference on Artificial Intelligence, 2019, pp. 2385–2391.
  • [15] S. Mazumder, B. Liu, S. Wang, Y. Zhu, X. Yin, L. Liu, and J. Li, “Knowledge-guided exploration in deep reinforcement learning,” arXiv preprint arXiv:2210.15670, 2022.
  • [16] W. Ye, Y. Zhang, M. Wang, S. Wang, X. Gu, P. Abbeel, and Y. Gao, “Foundation reinforcement learning: towards embodied generalist agents with foundation prior assistance,” arXiv preprint arXiv:2310.02635, 2023.
  • [17] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, Y. Du, C. Yang, Y. Chen, Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang, Z. Liu, P. Liu, J. Nie, and J. Wen, “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023.
  • [18] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. M. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, and Y. Zhang, “Sparks of artificial general intelligence: Early experiments with GPT-4,” arXiv preprint arXiv:2303.12712, 2023.
  • [19] OpenAI, “GPT-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
  • [20] N. D. Palo, A. Byravan, L. Hasenclever, M. Wulfmeier, N. Heess, and M. Riedmiller, “Towards a unified agent with foundation models,” in Workshop on Reincarnating Reinforcement Learning at ICLR, 2023.
  • [21] Y. Du, O. Watkins, Z. Wang, C. Colas, T. Darrell, P. Abbeel, A. Gupta, and J. Andreas, “Guiding pretraining in reinforcement learning with large language models,” in International Conference on Machine Learning, 2023, pp. 8657–8677.
  • [22] Z. Zhao, W. S. Lee, and D. Hsu, “Large language models as commonsense knowledge for large-scale task planning,” in Advances in Neural Information Processing Systems, 2023, pp. 31 967–31 987.
  • [23] Y. J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y. Zhu, L. Fan, and A. Anandkumar, “Eureka: Human-level reward design via coding large language models,” arXiv preprint arXiv:2310.12931, 2023.
  • [24] W. Yu, N. Gileadi, C. Fu, S. Kirmani, K. Lee, M. G. Arenas, H. L. Chiang, T. Erez, L. Hasenclever, J. Humplik, B. Ichter, T. Xiao, P. Xu, A. Zeng, T. Zhang, N. Heess, D. Sadigh, J. Tan, Y. Tassa, and F. Xia, “Language to rewards for robotic skill synthesis,” in Conference on Robot Learning, 2023, pp. 374–404.
  • [25] T. Xie, S. Zhao, C. H. Wu, Y. Liu, Q. Luo, V. Zhong, Y. Yang, and T. Yu, “Text2Reward: Automated dense reward function generation for reinforcement learning,” arXiv preprint arXiv:2309.11489, 2023.
  • [26] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under reward transformations: Theory and application to reward shaping,” in International Conference on Machine Learning, 1999, pp. 278–287.
  • [27] A. C. Li, L. Pinto, and P. Abbeel, “Generalized hindsight for reinforcement learning,” in Advances in Neural Information Processing Systems, 2020, pp. 7754–7767.
  • [28] N. Waytowich, S. L. Barton, V. Lawhern, and G. Warnell, “A narration-based reward shaping approach using grounded natural language commands,” arXiv preprint arXiv:1911.00497, 2019.
  • [29] J. Mu, V. Zhong, R. Raileanu, M. Jiang, N. D. Goodman, T. Rocktäschel, and E. Grefenstette, “Improving intrinsic exploration with language abstractions,” in Advances in Neural Information Processing Systems, 2022, pp. 33 947–33 960.
  • [30] A. C. Tam, N. C. Rabinowitz, A. K. Lampinen, N. A. Roy, S. C. Y. Chan, D. Strouse, J. Wang, A. Banino, and F. Hill, “Semantic exploration from language abstractions and pretrained representations,” in Advances in Neural Information Processing Systems, 2022, pp. 25 377–25 389.
  • [31] S. Mirchandani, S. Karamcheti, and D. Sadigh, “ELLA: Exploration through learned language abstraction,” in Advances in Neural Information Processing Systems, 2021, pp. 29 529–29 540.
  • [32] T. Carta, P. Oudeyer, O. Sigaud, and S. Lamprier, “EAGER: Asking and answering questions for automatic reward shaping in language-guided RL,” in Advances in Neural Information Processing Systems, 2022, pp. 12 478–12 490.
  • [33] J. Xu, C. Chen, F. Zhang, L. Yuan, Z. Zhang, and Y. Yu, “Internal logical induction for pixel-symbolic reinforcement learning,” in ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023, pp. 2825–2837.
  • [34] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in Advances in Neural Information Processing Systems, 2020, pp. 1877–1901.
  • [35] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” in Advances in Neural Information Processing Systems, 2022, pp. 24 824–24 837.
  • [36] G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar, “Voyager: An open-ended embodied agent with large language models,” arXiv preprint arXiv:2305.16291, 2023.
  • [37] Y. Wu, S. Y. Min, S. Prabhumoye, Y. Bisk, R. Salakhutdinov, A. Azaria, T. Mitchell, and Y. Li, “SPRING: GPT-4 out-performs rl algorithms by studying papers and reasoning,” arXiv preprint arXiv:2305.15486, 2023.
  • [38] B. Ichter, A. Brohan, Y. Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, R. Julian, D. Kalashnikov, S. Levine, Y. Lu, C. Parada, K. Rao, P. Sermanet, A. Toshev, V. Vanhoucke, F. Xia, T. Xiao, P. Xu, M. Yan, N. Brown, M. Ahn, O. Cortes, N. Sievers, C. Tan, S. Xu, D. Reyes, J. Rettinghouse, J. Quiambao, P. Pastor, L. Luu, K. Lee, Y. Kuang, S. Jesmonth, N. J. Joshi, K. Jeffrey, R. J. Ruano, J. Hsu, K. Gopalakrishnan, B. David, A. Zeng, and C. K. Fu, “Do as I can, not as I say: Grounding language in robotic affordances,” in Conference on Robot Learning, 2022, pp. 287–318.
  • [39] S. Li, X. Puig, C. Paxton, Y. Du, C. Wang, L. Fan, T. Chen, D. Huang, E. Akyürek, A. Anandkumar, J. Andreas, I. Mordatch, A. Torralba, and Y. Zhu, “Pre-trained language models for interactive decision-making,” in Advances in Neural Information Processing Systems, 2022, pp. 31 199–31 212.
  • [40] T. Carta, C. Romac, T. Wolf, S. Lamprier, O. Sigaud, and P. Oudeyer, “Grounding large language models in interactive environments with online reinforcement learning,” in International Conference on Machine Learning, 2023, pp. 3676–3713.
  • [41] B. Hu, C. Zhao, P. Zhang, Z. Zhou, Y. Yang, Z. Xu, and B. Liu, “Enabling intelligent interactions between an agent and an LLM: A reinforcement learning approach,” arXiv preprint arXiv:2306.03604, 2023.
  • [42] T. G. Karimpanal, L. B. Semage, S. Rana, H. Le, T. Tran, S. Gupta, and S. Venkatesh, “LaGR-SEQ: Language-guided reinforcement learning with sample-efficient querying,” arXiv preprint arXiv:2308.13542, 2023.
  • [43] J. Luketina, N. Nardelli, G. Farquhar, J. N. Foerster, J. Andreas, E. Grefenstette, S. Whiteson, and T. Rocktäschel, “A survey of reinforcement learning informed by natural language,” in International Joint Conference on Artificial Intelligence, 2019, pp. 6309–6317.
  • [44] M. Kwon, S. M. Xie, K. Bullard, and D. Sadigh, “Reward design with language models,” in International Conference on Learning Representations, 2023.
  • [45] J. Randløv and P. Alstrøm, “Learning to drive a bicycle using reinforcement learning and shaping,” in International Conference on Machine Learning, 1998, pp. 463–471.
  • [46] J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine, “D4RL: Datasets for deep data-driven reinforcement learning,” arXiv preprint arXiv:2004.07219, 2020.
  • [47] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
  • [48] S. Huang, R. F. J. Dossa, C. Ye, J. Braga, D. Chakraborty, K. Mehta, and J. G. M. Araújo, “CleanRL: High-quality single-file implementations of deep reinforcement learning algorithms,” Journal of Machine Learning Research, vol. 23, pp. 274:1–274:18, 2022.
  • [49] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba, “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
  • [50] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, “CodeGen: An open large language model for code with multi-turn program synthesis,” arXiv preprint arXiv:2203.13474, 2022.
  • [51] K. Yang, J. Liu, J. Wu, C. Yang, Y. R. Fung, S. Li, Z. Huang, X. Cao, X. Wang, Y. Wang, H. Ji, and C. Zhai, “If LLM is the wizard, then code is the wand: A survey on how code empowers large language models to serve as intelligent agents,” arXiv preprint arXiv:2401.00812, 2024.
  • [52] Y.-T. Lin and Y.-N. Chen, “LLM-Eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models,” arXiv preprint arXiv:2305.13711, 2023.
  • [53] C.-H. Chiang and H.-y. Lee, “Can large language models be an alternative to human evaluations?” arXiv preprint arXiv:2305.01937, 2023.
  • [54] Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, C. Chen, C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Ganguli, D. Li, E. Tran-Johnson, E. Perez, J. Kerr, J. Mueller, J. Ladish, J. Landau, K. Ndousse, K. Lukosiute, L. Lovitt, M. Sellitto, N. Elhage, N. Schiefer, N. Mercado, N. DasSarma, R. Lasenby, R. Larson, S. Ringer, S. Johnston, S. Kravec, S. E. Showk, S. Fort, T. Lanham, T. Telleen-Lawton, T. Conerly, T. Henighan, T. Hume, S. R. Bowman, Z. Hatfield-Dodds, B. Mann, D. Amodei, N. Joseph, S. McCandlish, T. Brown, and J. Kaplan, “Constitutional AI: Harmlessness from ai feedback,” arXiv preprint arXiv:2212.08073, 2022.
  • [55] H. Lee, S. Phatale, H. Mansoor, K. Lu, T. Mesnard, C. Bishop, V. Carbune, and A. Rastogi, “RLAIF: Scaling reinforcement learning from human feedback with ai feedback,” arXiv preprint arXiv:2309.00267, 2023.
  • [56] M. Klissarov, P. D’Oro, S. Sodhani, R. Raileanu, P.-L. Bacon, P. Vincent, A. Zhang, and M. Henaff, “Motif: Intrinsic motivation from artificial intelligence feedback,” arXiv preprint arXiv:2310.00166, 2023.
  • [57] A. Wilson, A. Fern, and P. Tadepalli, “A Bayesian approach for policy learning from trajectory preference queries,” in Advances in Neural Information Processing Systems, 2012, pp. 1142–1150.
  • [58] P. F. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” in Advances in Neural Information Processing Systems, 2017, pp. 4299–4307.
  • [59] R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. the method of paired comparisons,” Biometrika, vol. 39, no. 3/4, pp. 324–345, 1952.
  • [60] C. Kim, J. Park, J. Shin, H. Lee, P. Abbeel, and K. Lee, “Preference transformer: Modeling human preferences using transformers for RL,” in International Conference on Learning Representations, 2023.
  • [61] W. Li, D. Qiao, B. Wang, X. Wang, B. Jin, and H. Zha, “Semantically aligned task decomposition in multi-agent reinforcement learning,” arXiv preprint arXiv:2305.10865, 2023.
  • [62] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using siamese bert-networks,” arXiv preprint arXiv:1908.10084, 2019.
  • [63] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Conference of the North American Chapter of the Association for Computational Linguistics, 2019, pp. 4171–4186.
  • [64] M. Chevalier-Boisvert, B. Dai, M. Towers, R. de Lazcano, L. Willems, S. Lahlou, S. Pal, P. S. Castro, and J. Terry, “Minigrid & Miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks,” arXiv preprint arXiv:2306.13831, 2023.
  • [65] D. Hafner, “Benchmarking the spectrum of agent capabilities,” in International Conference on Learning Representations, 2022.
  • [66] Y. Wu, X. Tang, T. M. Mitchell, and Y. Li, “SmartPlay: A benchmark for llms as intelligent agents,” arXiv preprint arXiv:2310.01557, 2023.
  • [67] M. Chevalier-Boisvert, D. Bahdanau, S. Lahlou, L. Willems, C. Saharia, T. H. Nguyen, and Y. Bengio, “BabyAI: A platform to study the sample efficiency of grounded language learning,” in International Conference on Learning Representations, 2019.
  • [68] J.-C. Pang, P. Wang, K. Li, X.-H. Chen, J. Xu, Z. Zhang, and Y. Yu, “Language model self-improvement by reinforcement learning contemplation,” arXiv preprint arXiv:2305.14483, 2023.