Online Policy Learning from Offline Preferences

Guoxi Zhang
Kyoto University
Kyoto, Japan
[email protected]
&Han Bao
Kyoto University
Kyoto, Japan
[email protected]
&Hisashi Kashima
Kyoto University
Kyoto, Japan
[email protected]
This work was done when the author was a student at Kyoto University.

Abstract

In preference-based reinforcement learning (PbRL), a reward function is learned from a type of human feedback called preference. To expedite preference collection, recent works have leveraged offline preferences, which are preferences collected for some offline data. In this scenario, the learned reward function is fitted on the offline data. If a learning agent exhibits behaviors that do not overlap with the offline data, the learned reward function may encounter generalizability issues. To address this problem, the present study introduces a framework that consolidates offline preferences and virtual preferences for PbRL, which are comparisons between the agent’s behaviors and the offline data. Critically, the reward function can track the agent’s behaviors using the virtual preferences, thereby offering well-aligned guidance to the agent. Through experiments on continuous control tasks, this study demonstrates the effectiveness of incorporating the virtual preferences in PbRL.

1 Introduction

Preference-based reinforcement learning (PbRL) is a setting for developing agents using human preferences (Akrour et al., 2011). A preference can be an outcome of comparisons between a pair of actions (Fürnkranz et al., 2012), states (Wirth and Fürnkranz, 2015), or trajectories (Christiano et al., 2017), for the extent they meet task specifications. PbRL is intriguing for two primary reasons. First, as pointed out by Thurstone (1927), pairwise comparisons are less subjective than absolute scoring, allowing preferences to be collected from people who cannot quantitatively evaluate agents. This is a desirable property for human-involved tasks such as value alignment (Fisac et al., 2017) and shared autonomy (Reddy et al., 2018). Furthermore, preference collection is seamlessly scalable, as a typical preference query requires comparing videos for only a few seconds (Christiano et al., 2017), which allows for collecting a large amount of preferences economically using crowdsourcing.

In its online formulation (Figure 0(a)), PbRL requires on-demand assessments from humans. This is inefficient in terms of human time, because annotators are entirely occupied during policy learning, mostly waiting for trajectories. A recent idea is to adopt offline preferences for better efficiency (Shin and Brown, 2021), which means to collect preferences for certain existing trajectories (called offline data), as illustrated in Figure 0(b). However, one caveat is that, there may be a distribution shift between the offline data and the agent’s behaviors. For example, consider the Pusher task from the gym library (Brockman et al., 2016), where the agent controls a robot arm to push a white cylinder to a red spot. The states in offline data (Figure 0(c) upper left) are clustered, likely due to the limited number of feasible configurations for the robot’s joints. Meanwhile, the states of the agent’s initial behaviors (Figure 0(c) middle right) are disributed uniformly. When the distribution shift exists, we may face a generalizability problem. As the agent’s reward function is only trained for behaviors in the offline data, it might not generalize to the agent’s behaviors. In our experiments for Pusher, the learned reward function cannot predict the ranking of the agent’s behaviors (“PbRL” in Figure 2), leading to poor task performance (“PbRL” in Table 8). Since we cannot control the distribution of the offline data in practice, this generalizability problem risks practical use of offline preferences.

Refer to caption — (a) A diagram for online PbRL. Annotators must wait for behaviors to be generated during policy learning.

To address the generalizability problem, we propose a framework called preference-based adversarial imitation learning (PbAIL). The key idea is to generate virtual preferences that favor offline data over the agent’s behaviors. By jointly maximizing the likelihood of offline and virtual preferences, PbAIL can learn a reward function that aligns with the agent’s behaviors. In the meantime, it also improves the agent’s policy using the learned reward function. The reward learning step and the policy learning step are alternated to ensure the reward function always align with the agent’s behaviors, which is equivalent to solving a max-min objective. Since offline data can be imperfect in practice, we extend PbAIL to model the reliability of virtual preferences to handle imperfect data.

This work evaluated PbAIL from three perspectives: (i) its task performance on imperfect offline data and preferences, (ii) the individual effect of learning from virtual preferences and modeling their reliability, and (iii) how its performance changes with preference size and offline data quality. In our experiments for seven Mujoco tasks, PbAIL consistently achieves good performance when compared to existing approaches that use both offline data and preferences (Ibarz et al., 2018; Zhang et al., 2021). In particular, PbAIL can achieve better performance when compared to only using offline preferences in six of the seven tasks. In an ablation study, we confirm that it matches the return of offline data when using virtual preferences, and it achieves better performance on imperfect offline data when modeling the reliability of virtual preferences. Lastly, our results for the effect of preference size and data quality highlight that PbAIL is suitable when preferences are more accessible than high-fidelity trajectories. In summary, our contributions are as follows:

•

We propose PbAIL to overcome the generalizability problem that arises when learning reward functions from offline preferences.
•

We extend PbAIL to handle imperfect offline data.
•

We extensively evaluate PbAIL for non-optimal offline data and limited preferences, clarifying its strength, the effects of its components, and its limitations.

The rest of this paper is organized as follows. Section 2 reviews related literature, while Section 3 provides background knowledge. Section 4 formulates the learning problem and introduces the proposed PbAIL. Section 5 presents empirical results, and Section 6 concludes the paper.

2 Related Work

PbRL has been studied for over a decade (Akrour et al., 2011; Fürnkranz et al., 2012) and applied to Atari games (Christiano et al., 2017), locomotion tasks (Lee et al., 2021a), navigation tasks (Shin and Brown, 2021), and fine-tuning language models (Ouyang et al., 2022). Recent advancements include enhancing exploration (Lee et al., 2021b), adaptive query selection (Wilde et al., 2020; Biyik et al., 2020), and improving feedback efficiency (Park et al., 2022; Liu et al., 2022). Specifically, utilizing offline preferences (Shin and Brown, 2021; Zhang and Kashima, 2023) allows for more efficient use of annotator’s time. What remains a question is its impact on reward learning, especially if there is a mismatch between the distributions of offline data and agents’ behaviors.

In PbRL literature, the approach proposed by (Ibarz et al., 2018) is related to this work, as it combines PbRL and behavior cloning (BC). However, this approach cannot handle imperfect offline data. Meanwhile, the confidence-aware imitation learning (CAIL) algorithm (Zhang et al., 2021) accepts both offline data and preferences as input, but it does not maximize the likelihood of preferences. We empirically compare PbAIL with these two approaches in Section 5.

3 Preliminaries

3.1 Reinforcement Learning

MDP

Reinforcement learning (RL) uses the Markov decision process $\langle\mathcal{S},\mathcal{A},P,r,\gamma,\mu\rangle$ (Sutton and Barto, 2018) to model sequential decision-making tasks. Here, $\mathcal{S}$ is the set of states, and $\mathcal{A}$ is the set of actions; they represent information and options available to an agent for decision-making, respectively. The transition probability $P:\mathcal{S}\times\mathcal{A}\to\mathit{\Delta}(\mathcal{S)}$ governs how states transit, where $\mathit{\Delta}(\cdot)$ is the set of distributions over a set. The reward function $r:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}$ is a function that evaluates agents’ decisions, which is unknown and to be inferred in PbRL. The discount factor $\gamma\in(0,1)$ is later used to define the value function, and $\mu$ is the distribution for initial states. An MDP prescribes a protocol for sequential interaction between an agent and a hypothetical entity called the environment. Starting from an initial state sampled from $\mu$ , the agent observes a state $s\in\mathcal{S}$ from the environment and selects an action $a$ according to a stochastic policy $\pi:\mathcal{S}\to\mathit{\Delta}(\mathcal{A})$ associated with the agent. It then receives the next state from the environment, which is sampled from $P(\cdot|s,a)$ . A sequence of states and actions generated in interaction $(s_{1},a_{1},s_{2},a_{2},\dots)\overset{\mathrm{def}}{=}\tau$ is defined as a trajectory. To simplify notations, this work occasionally uses $x$ as a shorthand notation for a state-action pair $(s,a)$ .

RL Objective

The return of a trajectory is the sum of rewards assigned to its state-action pairs. For state $s\in\mathcal{S}$ , the value function $v^{\pi}(s)$ of a policy $\pi$ specifies the expectation of $\gamma$ -discounted return starting from an initial state $s$ and following a policy $\pi$ . It is defined as follows:

v^{\pi}(s)\overset{\mathrm{def}}{=}\mathbb{E}\left[\sum_{t=1}^{\infty}\gamma^{% t-1}r(s_{t},a_{t})\,\middle|\,\pi,s_{1}=s\right],

where the expectation is taken over states (following the transition probability) and actions (following the policy). The value of a policy $\pi$ is the expectation of $v^{\pi}(s)$ over initial-state distribution $\mu$ : $v^{\pi}=\mathbb{E}_{s\sim\mu}[v^{\pi}(s)]$ . The goal of RL is to find an optimal policy $\pi^{*}$ such that $v^{\pi^{*}}\geq v^{\pi}$ for any policy $\pi$ under a given reward function $r$ .

A useful quantity is the discounted occupancy measure, which is defined as follows.

Definition 3.1 (Puterman (1994)).

The discounted occupancy measure $\rho^{\pi}$ for policy $\pi$ is defined as

\rho^{\pi}(s,a)=\sum_{t=1}^{\infty}\gamma^{t-1}\Pr(s_{t}=s,a_{t}=a;\pi),

where $\Pr(s_{t}=s,a_{t}=a;\pi)$ is the probability of the joint event $s_{t}=s$ and $a_{t}=a$ when following the transition probability and policy $\pi$ .

The discounted occupancy measure $\rho^{\pi}$ can be interpreted as an unnormalized measure for state-action pairs generated by $\pi$ . It is straightforward to show that the normalizer of occupancy measures is $\frac{1}{1-\gamma}$ .

Corollary 3.2.

$\sum_{x}\rho^{\pi}(x)=\frac{1}{1-\gamma}$ for any policy $\pi$ .

For function $f:\mathcal{S}\times\mathcal{A}\to\mathbb{R}$ , we write $\mathbb{E}_{x\sim\rho^{\pi}}[f(x)]$ as the sum of $f(x)$ over all state-action pairs weighted by $\rho^{\pi}(x)$ with slight abuse of notation. The values can be expressed using $\rho^{\pi}$ alternatively.

Corollary 3.3.

$v^{\pi}=\mathbb{E}_{x\sim\rho^{\pi}}[r(x)]$ .

As a final remark, the following theorem relates policies with occupancy measures, which allows one to regard policies as samplers for state-action pairs.

Theorem 3.4 (Syed et al. (2008)).

Suppose $\rho$ is an occupancy measure and $\pi\overset{\text{def}}{=}\frac{\rho(s,a)}{\sum_{a}\rho(s,a)}$ . Then, $\rho$ is the occupancy measure for $\pi$ , and $\pi$ is the only policy whose occupancy measure is $\rho$ .

3.2 Preference-based Reinforcement Learning

Thurstone’s model (Thurstone, 1927)

Preferences are formally defined using Thurstone’s model (Thurstone, 1927). Suppose there is a general object space $\mathcal{O}$ . For each object $o\in\mathcal{O}$ , this model assumes a “true” utility score $G(o)\in\mathbb{R}$ . Annotators compare objects using disturbed utility $G(o)+\epsilon$ , where $\epsilon$ is sampled from some distribution for noise. Then, the probability of object $o$ being preferred over object $o^{\prime}$ follows $\Pr(o\succ o^{\prime})=\Pr(G(o)-G(o^{\prime})>\epsilon^{\prime}-\epsilon)$ , where $\epsilon^{\prime}$ is the noise associated with $o^{\prime}$ . Subsequently, we work with a special case of Thurstone’s model called Bradley–Terry model (BT model) (Bradley and Terry, 1952). Assuming $\epsilon\sim\mathrm{Gumbel}(0,1)$ , we have $\Pr(o\succ o^{\prime})=\sigma(z-z^{\prime})$ , where $\sigma(\cdot)$ is the sigmoid function. Under this model, the log-likelihood of a preference $o\succ o^{\prime}$ is denoted by $\ell(o,o^{\prime})=\log\sigma(G(o)-G(o^{\prime}))$ .

Preference-based Reward Learning

We assume that the offline preferences are specified between trajectories, as they provide annotators with more information than states or state-action pairs. Under the BT model, a trajectory $\tau$ is preferred over another trajectory $\tau^{\prime}$ , denoted as $\tau\succ\tau^{\prime}$ , if the utility of $\tau$ is higher than that of $\tau^{\prime}$ . This means that $\tau$ corresponds to a more desirable outcome from an annotator’s perspective. We will use preferences between state-action pairs to formulate the proposed virtual preferences. In this case, a state action pair $x$ is preferred over another state-action pair $x^{\prime}$ if the utility of $r(x)>r(x^{\prime})$ .

Reward functions can be learned by maximizing the likelihood of preferences under the BT model. A canonical assumption for parameterizing trajectory utility is the sum of rewards in a trajectory: $\sum_{x\in\tau}r(x)$ (Christiano et al., 2017; Ibarz et al., 2018). We use the reward of a state-action pair as its utility. A sample for trajectory preferences is written as $(\tau_{1},\tau_{2},c)$ , where $\tau_{1}$ and $\tau_{2}$ are the trajectories being compared. $c=1$ if $\tau_{1}\succ\tau_{2}$ , and $c_{j}=0$ otherwise. Suppose the reward function is parameterized by $\theta_{r}$ , and let the sum of estimated rewards for state-action pairs in a trajectory $\tau$ to be $G(\tau;\theta_{r})$ . Given a set of $M$ preferences $\mathcal{Y}$ , we can learn $\theta_{r}$ by minimizing:

L_{\mathrm{pref}}(\theta_{r})=-\frac{1}{M}\sum_{(\tau_{1},\tau_{2},c)\in% \mathcal{Y}}\left[c\log(\sigma(G(\tau_{1};\theta_{r})-G(\tau_{2};\theta_{r})))% +(1-c)\sigma(G(\tau_{2};\theta_{r})-G(\tau_{1};\theta_{r}))\right].

(1)

4 Preference-based Adversarial Imitation Learning

This section presents the proposed PbAIL framework. After formulating the learning problem, we describe how PbAIL leverages offline data for reward learning in Section 4.2 and how it handles imperfect data in Section 4.3.

4.1 Problem Setup

An agent is provided with a set of $M$ offline trajectory preferences $\mathcal{Y}$ , which are collected for $N$ trajectories $\mathcal{D}$ generated by behavior policy $b$ . To align the learned reward function with the agent’s behaviors, we assume the agent has access to $\mathcal{D}$ . The agent is supposed to learn a reward function $r$ and a policy $\pi$ . Similar to online PbRL, new trajectories can be generated during policy learning; however, no new real preferences may be collected.

4.2 Reward Learning from Offline Trajectories

First, we borrow the notion of policy preferences (Busa-Fekete et al., 2014) to draw a connection between the behavior policy and reward learning. A policy $\pi$ is preferred over another policy $\pi^{\prime}$ , denoted as $\pi\succ\pi^{\prime}$ , if the disturbed utility of $\pi$ is larger than that of $\pi^{\prime}$ under the BT model. PbAIL assumes that $b$ is preferred over any other policy, i.e., $b\succ\pi,\forall\pi$ . Assuming the utility of a policy $\pi$ is its value $v^{\pi}$ , the log-likelihood of $b\succ\pi$ is given by $\log(\text{Pr}(b\succ\pi))=\log(\sigma(v^{b}-v^{\pi})$ ). To ensure $b\succ\pi$ holds for all $\pi$ , PbAIL maximizes the worst-case log-likelihood of policy preference as follows:

\max_{r}\min_{\pi}\log\sigma(v^{b}-v^{\pi}).

(2)

This objective involves learning a reward function $r$ and a policy $\pi$ . The maximization over $r$ enlarges the difference between $v^{b}$ and $v^{\pi}$ to make $b$ more preferable. The minimization over $\pi$ is equivalent to maximizing $v^{\pi}$ with respect to $r$ , which can be solved by RL. This work uses soft actor-critic (SAC) (Haarnoja et al., 2018) as the policy learning algorithm, since it employs the principle of maximum entropy to enhance exploration.

While the maximization over reward functions in Equation 2 can be solved by approximating values using sampled trajectories, such a direct approach is subject to high variance. This work thus proposes an efficient optimization based on the following tight lower bound. By noting that $\log\sigma(\cdot)$ is concave and the value $v^{\pi}$ can be expressed as $v^{\pi}=\mathbb{E}_{x\sim\rho^{\pi}}[r(x)]$ (Corollary 3.3), the following inequality holds from Jensen’s inequality:

\log\mathrm{Pr}(b\succ\pi)\geq\mathbb{E}_{\begin{subarray}{c}x\sim\rho^{b},x^{% \prime}\sim\rho^{\pi}\end{subarray}}\ell(x,x^{\prime})\overset{\text{def}}{=}U% _{b}(\pi,r).

(3)

The likelihood lower bound $U_{b}(\pi,r)$ can be efficiently approximated using state-action pairs. We can use samples in $\mathcal{D}$ to estimate the expectation over $\rho^{b}$ . When using off-policy RL backbones such as SAC, we can estimate the expectation over $\rho^{\pi}$ using state-action pairs in the replay buffer of the RL backbone. Let the replay buffer be $\mathcal{D}_{\text{RL}}$ Then, the reward maximization in Equation 2 is approximated by minimizing the following loss function:

L_{\text{virtual}}(\theta_{r})=-\mathbb{E}_{\begin{subarray}{c}x\in\mathcal{D}% \\ x^{\prime}\in\mathcal{D}_{\text{RL}}\end{subarray}}\log(\sigma(r(x;\theta_{r})% -r(x^{\prime};\theta_{r}))).

(4)

Given both offline preferences $\mathcal{Y}$ and trajectories $\mathcal{D}$ , we can minimize $L_{\text{pref}}(\theta_{r})+L_{\text{virtual}}(\theta_{r})$ to ensure the reward function learned from offline preferences to align with the agent’s behaviors, thereby overcoming the generalizability issue.

Virtual Preferences

Note that Equation 4 coincides with learning from state-action preferences $x\succ x^{\prime}$ such that $x\in\mathcal{D}$ and $x^{\prime}\in\mathcal{D}_{\text{RL}}$ . By minimizing Equation 4, we are maximizing the log-likelihood that state-action pairs in offline data $\mathcal{D}$ are preferred over state-action pairs generated by the agent. Since such preferences are not collected from annotators, we consider them to be generated by a virtual annotator and call them virtual preferences. We use $\succ_{v}$ for virtual comparisons and $\mathcal{Y}_{v}$ for virtual preferences. The use of virtual state-action preferences is a key idea of PbAIL. As will be discussed in Section 4.3, it leads to a straightforward approach for handling non-optimal offline data.

As a final remark, it is straightforward to combine recent ideas of PbRL, such as the use of data augmentation (Park et al., 2022) or active query generation (Biyik et al., 2020), with PbAIL. We leave these extensions as future work and focus solely on showing the effectiveness of PbAIL.

4.3 Handling Imperfect Data

In practice, we may encounter imperfect offline data. In this case, adapting the reward function using offline data by minimizing Equation 4 becomes problematic. Based on the interpretation of virtual preferences, we propose to handle imperfect data by inferring the reliability of virtual preferences and the reward function simultaneously. Our approach is based on a probabilistic model for noisy preferences collected from multiple annotators (Zhang and Kashima, 2023). We simplify the model since the virtual preferences are not collected from multiple annotators.

Specifically, we assume that to generate a preference for $x\in\mathcal{D}$ and $x^{\prime}\in\mathcal{D}_{\text{RL}}$ , the virtual annotator first generates a temporary label using the BT model and the ground-truth reward function. Then, it reports the label correctly with probability $\alpha(x,x^{\prime})$ and incorrectly with probability $1-\alpha(x,x^{\prime})$ . By the law of total probability, the probability for $x\succ_{v}x$ under this model, denoted by $\text{Pr}_{\text{imperfect}}(x\succ_{v}x^{\prime})$ , is given by

\text{Pr}_{\text{imperfect}}(x\succ_{v}x^{\prime})=\alpha(x,x^{\prime})\sigma(% r(x)-r(x^{\prime}))+(1-\alpha(x,x^{\prime}))\sigma(r(x^{\prime})-r(x)).

(5)

Suppose $\alpha$ is parameterized with neural network $\theta_{\alpha}$ . Under this model, objective for learning from virtual preferences becomes

\begin{split}L_{\begin{subarray}{c}\text{imperfect}\\ \text{virtual}\end{subarray}}(\theta_{r},\theta_{\alpha})=-&\mathbb{E}_{\begin% {subarray}{c}x\in\mathcal{D}\\ x^{\prime}\in\mathcal{D}_{\text{RL}}\end{subarray}}\log(\text{Pr}_{\text{% imperfect}}(x\succ_{v}x^{\prime})).\end{split}

(6)

Essentially, we model the reliability of virtual preferences using $\alpha(x,x^{\prime})$ . We use the reward difference $r(x)-r(x^{\prime})$ and the index for the trajectory from which $x^{\prime}$ , denoted as $I(x^{\prime})$ , as features for this reliability network. $r(x)-r(x^{\prime})$ is informative as it suggests the temporary label in the generative process of this probabilistic model. As for $I(x^{\prime})$ , recall that we sample state-action pairs from the replay buffer $\mathcal{D}_{\text{RL}}$ for better sample efficiency. This buffer stores state-action pairs generated in the entire policy learning process. Since the policy is generally improving during this process, state-action pairs generated during late stage of policy learning are more likely to be better than state-action pairs in $\mathcal{D}$ , when compared to state-action pairs generated earlier.

Initialization

As suggested by Zhang and Kashima (2023), this model needs an initialization phase. The intuition is that, since the agent’s policy is initialized from scratch, in the initial phase of policy learning the offline data are likely to be better than the agent’s behaviors. The objective for this phase is Equation 6:

\begin{split}L_{\begin{subarray}{c}\text{imperfect}\\ \text{virtual,init}\end{subarray}}(\theta_{r},\theta_{\alpha})=L_{\text{% virtual}}(\theta_{r})-&\mathbb{E}_{\begin{subarray}{c}x\in\mathcal{D}\\ x^{\prime}\sim\rho^{\pi}\end{subarray}}\log(\alpha(x,x^{\prime})).\end{split}

(7)

In addition, to stablilize training, we opt to initialize the agent’s policy using BC.

Remark

The root cause for virtual preferences to be unreliable here is that, our assumption $b\succ\pi$ for any $\pi$ is no longer justified. Given preferences $\mathcal{Y}$ , we expect the agent to outperform the behavior policy of $\mathcal{D}$ . Our approach reduces the problem of handling imperfect behavior policy to inferring the reliability of state-action preferences in $\mathcal{Y}_{v}$ .

In summary, when offline trajectories are not perfect, PbAIL uses a probabilistic model for virtual preferences that can simultaneously infer the reliability of virtual preferences and learn a reward function. Algorithm 1 summarizes the algorithm for this case.

Algorithm 1 PbAIL using off-policy RL backbone

Input: Offline trajectories

\mathcal{D}

, preferences

\mathcal{Y}

, number of initialization steps

K_{\text{init}}

Initialize policy

\theta_{r}

\theta_{\pi}

\theta_{\alpha}

and a replay

\mathcal{D}_{\text{RL}}

Initialize a counter for gradient steps

k

repeat

Interact with the environment and store a transition in

\mathcal{D}

Sample a batch of data from

\mathcal{D}

and

\mathcal{D}_{\text{RL}}

k\leq K_{\text{init}}

then

Take a minimization step for

L_{\text{pref}}+L_{\begin{subarray}{c}\text{imperfect}\\ \text{virtual,init}\end{subarray}}(\theta_{r},\theta_{\alpha}).

Update the actor using the loss of BC.

else

Take a minimization step for

L_{\text{pref}}+L_{\begin{subarray}{c}\text{imperfect}\\ \text{virtual}\end{subarray}}(\theta_{r},\theta_{\alpha}).

end if

Take a policy improvement step using

r

and transitions from

\mathcal{D}_{\text{RL}}

until

k

exceeds the intended number of training steps.

5 Experiments

We evaluate PbAIL from three aspects: its task performance, the individual effect of virtual preferences and preference reliability modeling, and its sensitivity against trajectory quality and preference size. Details regarding experiment design and implementations can be found in the appendices.

5.1 Experiment Design

Data

We considered seven Mujoco tasks: Ant, Walker2d, Hopper, HalfCheetah (HC for short), HumanoidStandup (HS for short), Pusher, and Swimmer. To reveal algorithms’ performance in practice, we evaluated them using imperfect offline trajectories generated by multiple policies. For each task, we used two versions of offline data: novice and mixture. We first trained five SAC agents for five million steps using different random seeds and considered their final performance as the final performance for SAC. The novice version was generated using five model checkpoints reaching 20% of the final performance, while the mixture version was generated by them and five checkpoints reaching 50% of the final performance. For preferences, we considered two sizes of $\mathcal{Y}$ : $|\mathcal{Y}|=1225$ and $|\mathcal{Y}|=300$ . Following prior practice (Christiano et al., 2017), preferences were generated for trajectory clips of length 60 using the truth rewards.

Alternative Methods

We considered the method proposed by Christiano et al. (2017) (referred to as PbRL), which is an algorithm for online PbRL. We also included two other baselines, SACfD and CAIL (Zhang et al., 2021). SACfD is an extension to the method proposed by Ibarz et al. (2018) that combines BC with PbRL. CAIL is an imitation learning algorithm that encourages the ranking of offline trajectories induced by the inferred returns aligns with the given preferences. In addition, we present the results for an ablation of the reliability estimation by using Equation 4 instead of Equation 6, denoted as PbAIL^-. For reference, this paper also reports the results for SAC trained with the truth rewards (denoted as GT) and the return of offline trajectories (denoted as Data).

Evaluation Metrics

The algorithms were compared for test returns after being trained for one million steps. Returns were normalized to a 0–1 scale, such that 0 corresponds to a random policy and 1 corresponds to the final performance; a negative number indicates a method is worse than random policy. We report the mean and standard deviation of results for five random seeds.

Implementation Details

We used SAC as the policy learner. The reward functions of PbRL, SACfD, and PbAIL, as well as the discriminator of CAIL, were parameterized with a FFN of 64 units with the spectral normalization (Miyato et al., 2018). We also applied the weight decay to the reward functions of PbRL, SACfD, and PbAIL, and the reliability network of PbAIL. All experiments were performed on six NVIDIA A6000 GPUs. Our code is available here¹¹1https://www.dropbox.com/s/g0m3qwng4cnltmt/PbAIL.zip?dl=0.

5.2 Results

Task Performance

Table 1 shows the results on the novice datasets with $|\mathcal{Y}|=1225$ . First, we compare the performance of PbRL, SACfD, and PbAIL. Except in the case of Walker2d and Swimmer, PbRL does not demonstrate competitive performance. SACfD outperforms PbRL in Ant, Walker2d, and HC, but it does not perform well for the rest. PbAIL surpasses PbRL in six of the seven tasks, achieving the best performance in five tasks and even outperforming GT in three tasks. In addition, PbAIL presents significant advantages over CAIL. These results corroborate PbAIL’s efficacy for learning from offline trajectories and offline preferences.

Ablation Study

Table 1 also shows the individual effect of using virtual preferences and inferring their reliability. The advantage of PbAIL^- over PbRL, particularly in Ant, HS, and Pusher, confirms the efficacy of using virtual preferences. The efficacy of inferring preference reliability is supported by comparing data return (denoted as Data) and the performance of PbAIL^- accompanied with the comparison between PbAIL^- and PbAIL. Except in Walker2d, PbAIL^- only matches the return of data, which means that information in offline preferences is not properly utilized. PbAIL’s substantial improvement over data return confirms the necessity of modeling virtual preference reliability when dealing with imperfect offline data.

Table 1: Algorithms’ normalized returns for novice datasets with

|\mathcal{Y}|=1225

. We show the normalized returns of training data in “Data” column and the performance obtained with the ground-truth rewards in “GT” column. The proposed PbAIL outperforms SACfD in five tasks and CAIL in six tasks. Notably, it even outperforms GT in three tasks. PbAIL^- is a variant of PbAIL that does not handle the non-optimality of data, and it is significantly outperformed by PbAIL. These results confirm the efficacy of learning from virtual preferences and modeling their reliability.

Task	PbRL	SACfD	CAIL	PbAIL^-	PbAIL	Data	GT
Ant	$-0.35$ p m 0.02	$0.57$ p m 0.09	$0.59$ p m 0.09	$0.30$ p m 0.04	\DeclareFontSeriesDefault[rm]bfb $0.72$ p m 0.03	$0.23$	$0.66$
Walker2d	$0.53$ p m 0.17	\DeclareFontSeriesDefault[rm]bfb $0.61$ p m 0.25	$0.46$ p m 0.14	$-0.00$ p m 0.00	$0.58$ p m 0.13	$0.28$	$0.89$
Hopper	$0.73$ p m 0.28	$0.66$ p m 0.25	$0.83$ p m 0.20	$0.31$ p m 0.07	\DeclareFontSeriesDefault[rm]bfb $1.13$ p m 0.01	$0.29$	$0.95$
HC	$0.22$ p m 0.23	$0.44$ p m 0.20	$0.74$ p m 0.12	$0.25$ p m 0.02	\DeclareFontSeriesDefault[rm]bfb $0.76$ p m 0.05	$0.22$	$0.69$
HS	$0.05$ p m 0.30	$0.05$ p m 0.17	$0.47$ p m 0.15	$0.55$ p m 0.16	\DeclareFontSeriesDefault[rm]bfb $0.64$ p m 0.15	$0.51$	$0.97$
Pusher	$-4.23$ p m 0.72	$-4.55$ p m 0.85	$0.22$ p m 0.34	$0.39$ p m 0.06	\DeclareFontSeriesDefault[rm]bfb $0.71$ p m 0.09	$0.37$	$0.93$
Swimmer	$0.19$ p m 0.07	$0.13$ p m 0.10	$0.19$ p m 0.10	\DeclareFontSeriesDefault[rm]bfb $0.30$ p m 0.05	$0.12$ p m 0.19	$0.28$	$0.64$

Effect of Preference Size

Table 2 shows the results for $|\mathcal{Y}|=300$ on the novice datasets. Compared to the results presented in Table 1, preference-based methods (PbRL, SACfD, and PbAIL) worsen significantly, which is expected as less information is available in preferences. Interestingly, with fewer preferences, CAIL tends to perform better—it improves in five tasks. This is probably due to the fact that it uses preferences to update the weights of samples in $\mathcal{D}$ for imitation learning, using a max-margin loss function for preferences. With fewer preferences, the weights of samples can be determined more easily, so its adversarial learning process is more stable.

Table 2: Performance for novice datasets with

|\mathcal{Y}|=300

. Compared to results in Table 1, with less preferences, preference-based methods (PbRL, SACfD, and PbAIL) become worse, while CAIL performs better.

Task	PbRL	SACfD	CAIL	PbAIL	Data	GT
Ant	$-0.37$ p m 0.02	$-0.37$ p m 0.01	\DeclareFontSeriesDefault[rm]bfb $0.61$ p m 0.07	$0.56$ p m 0.03	$0.23$	$0.66$
Walker2d	$0.39$ p m 0.15	$0.34$ p m 0.16	\DeclareFontSeriesDefault[rm]bfb $0.49$ p m 0.07	$0.15$ p m 0.13	$0.28$	$0.89$
Hopper	$0.63$ p m 0.32	$0.38$ p m 0.06	$0.74$ p m 0.21	\DeclareFontSeriesDefault[rm]bfb $0.79$ p m 0.28	$0.29$	$0.95$
HC	$0.35$ p m 0.20	$0.12$ p m 0.22	\DeclareFontSeriesDefault[rm]bfb $0.82$ p m 0.08	$0.14$ p m 0.26	$0.22$	$0.69$
HS	$0.12$ p m 0.13	$0.18$ p m 0.21	$0.32$ p m 0.34	\DeclareFontSeriesDefault[rm]bfb $0.64$ p m 0.12	$0.51$	$0.97$
Pusher	$-4.55$ p m 0.76	$-4.44$ p m 0.75	$0.40$ p m 0.11	\DeclareFontSeriesDefault[rm]bfb $0.58$ p m 0.08	$0.37$	$0.93$
Swimmer	\DeclareFontSeriesDefault[rm]bfb $0.32$ p m 0.10	$0.26$ p m 0.15	$0.25$ p m 0.14	$0.16$ p m 0.27	$0.28$	$0.64$

Effect of Trajectory Quality

Table 8 shows the results for the mixture datasets with $|\mathcal{Y}|=1225$ . The quality of offline data increases by 0.1 to 0.2 when compared with the case in Table 1. PbRL and PbAIL cannot benefit from the improved quality, but PbAIL is the only method that attains the best performance in three tasks. SACfD improves in Ant and Walker2d but fails in HS. CAIL can best benefit from improved quality of offline trajectories. The results in Table 1 and Table 8 imply that, PbAIL is suitable for scenarios where preferences are accessible, and imitation-based methods (SACfD and CAIL) can be better when good demonstrations can be collected.

Table 3: Results on mixture datasets with

|\mathcal{Y}|=1225

. Compared to results in Table 1, PbRL and PbAIL cannot benefit from the improved quality of offline trajectories. SACfD improves in Ant and Walker2d, but fails in HS. CAIL has significant improvement in Ant, Walker2d, and Hopper.

Task	PbRL	SACfD	CAIL	PbAIL	Data	GT
Ant	$-0.33$ p m 0.05	\DeclareFontSeriesDefault[rm]bfb $0.74$ p m 0.02	$0.71$ p m 0.02	$0.59$ p m 0.03	$0.34$	$0.66$
Walker2d	$0.53$ p m 0.25	\DeclareFontSeriesDefault[rm]bfb $0.71$ p m 0.23	$0.61$ p m 0.02	$0.63$ p m 0.10	$0.38$	$0.89$
Hopper	$0.76$ p m 0.24	$0.49$ p m 0.16	$0.90$ p m 0.05	\DeclareFontSeriesDefault[rm]bfb $1.13$ p m 0.03	$0.49$	$0.95$
HC	$0.34$ p m 0.24	$0.25$ p m 0.28	\DeclareFontSeriesDefault[rm]bfb $0.74$ p m 0.04	$0.69$ p m 0.03	$0.37$	$0.69$
HS	$0.06$ p m 0.23	NaN	$0.44$ p m 0.17	\DeclareFontSeriesDefault[rm]bfb $0.62$ p m 0.14	$0.52$	$0.97$
Pusher	$-5.26$ p m 0.45	$-5.05$ p m 0.52	$-0.10$ p m 0.83	\DeclareFontSeriesDefault[rm]bfb $0.66$ p m 0.04	$0.45$	$0.93$
Swimmer	\DeclareFontSeriesDefault[rm]bfb $0.38$ p m 0.05	$0.33$ p m 0.04	$0.10$ p m 0.19	$0.14$ p m 0.05	$0.40$	$0.64$

Reward Generalizability

Finally, let us discuss our claim on the generalizability problem. We analyzed the generalizability of reward functions using the mixture datasets with 1225 preferences. For each method, we took 10 model checkpoints in the early stage of policy learning. We then rolled out trajectories using the actor of each checkpoint and inferred their returns using the corresponding reward function. The Kendall’s rank correlation coefficient reflects how well the ranks of trajectories computed using inferred returns match those computed using ground-truth returns. The higher it is, the better the two ranks match, which means that reward functions better generalize to the learning agent’s behaviors.

Figure 2 shows the results for Ant, HC, and Pusher. The reward function of PbRL fails to generalize for Ant and Pusher, which explains PbRL’s poor performance for the two tasks. Compared to PbRL, the reward function of SACfD generalizes better for Ant but fails for Pusher. Meanwhile, for all four tasks, PbAIL’s reward functions demonstrate good generalizability, which explains its performance.

In summary, our take-aways are three-folds.

•

The proposed PbAIL is competitive for all three cases. Both the use of virtual preferences and modeling their reliability work as expected.
•

Preference-based methods tend to degenerate with fewer preferences. Imitation-based approaches better enjoy the improved quality of offline trajectories.
•

Our claim for the generalizability issue of reward functions is supported by rank correlation between the inferred and true trajectory returns.

These findings confirm the efficacy of our proposals. Moreover, they shed light on how to select algorithms for complex real-world tasks. If collecting preferences is more viable than collecting good trajectories, then PbAIL is the most suitable method. Otherwise, approaches that directly imitate from trajectories are better.

6 Conclusion

PbRL is a setting that learns a reward function from preferences, which represent human evaluation for agents’ behaviors. Recently, the use of offline preferences was proposed to make better use of human time. In this case, the preferences are collected from certain offline data.However, as the offline data may follow a different distribution when compared to an learning agent’s performance, reward functions learned from offline preferences may fail to generalize to the agent’s behaviors. In response to this issue, the present study proposes PbAIL, a framework that overcomes this drawback by using virtual preferences generated from offline data. PbAIL learns a reward function by jointly maximizing the likelihood of offline preferences and virtual preferences, which aligns the learned reward functions with the agent’s behaviors. Furthermore, this work extends PbAIL to imperfect offline data, thus broadening its applicability. From experiments on continuous tasks, we verified the efficacy of using virtual preferences and handling data imperfection, and we also discussed the advantages and limitations of PbAIL. As for future work, it would be interesting to consider extensions that do not explicitly require offline trajectories.

References

Akrour et al. (2011) R. Akrour, M. Schoenauer, and M. Sebag. Preference-based policy learning. In Machine Learning and Knowledge Discovery in Databases, pages 12–27. Springer Berlin Heidelberg, 2011.
Biyik et al. (2020) E. Biyik, N. Huynh, M. J. Kochenderfer, and D. Sadigh. Active preference-based gaussian process regression for reward learning. In Robotics: Science and Systems XVII, 2020.
Bradley and Terry (1952) R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
Brockman et al. (2016) G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym, 2016.
Busa-Fekete et al. (2014) R. Busa-Fekete, B. Szörényi, P. Weng, W. Cheng, and E. Hüllermeier. Preference-based reinforcement learning: Evolutionary direct policy search using a preference-based racing algorithm. Machine Learning, 97(3):327–351, 2014.
Christiano et al. (2017) P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4302–4310. Curran Associates, Inc., 2017.
Fisac et al. (2017) J. F. Fisac, M. A. Gates, J. B. Hamrick, C. Liu, D. Hadfield-Menell, M. Palaniappan, D. Malik, S. S. Sastry, T. L. Griffiths, and A. D. Dragan. Pragmatic-pedagogic value alignment. In Proceedings of the Eighteenth International Symposium on Robotics Research, pages 49–57. Springer International Publishing, 2017.
Fürnkranz et al. (2012) J. Fürnkranz, E. Hüllermeier, W. Cheng, and S.-H. Park. Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Machine Learning, 89(1):123–156, 2012.
Haarnoja et al. (2018) T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the Thirty-Fifth International Conference on Machine Learning, pages 1861–1870. PMLR, 2018.
Ibarz et al. (2018) B. Ibarz, J. Leike, T. Pohlen, G. Irving, S. Legg, and D. Amodei. Reward learning from human preferences and demonstrations in Atari. In Advances in Neural Information Processing Systems, pages 8022–8034. Curran Associates Inc., 2018.
Lee et al. (2021a) K. Lee, L. Smith, A. Dragan, and P. Abbeel. B-pref: Benchmarking preference-based reinforcement learning. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021a.
Lee et al. (2021b) K. Lee, L. M. Smith, and P. Abbeel. Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. In Proceedings of the Thirty-Eighth International Conference on Machine Learning, pages 6152–6163. PMLR, 2021b.
Liu et al. (2022) R. Liu, F. Bai, Y. Du, and Y. Yang. Meta-reward-net: Implicitly differentiable reward learning for preference-based reinforcement learning. In Advances in Neural Information Processing Systems, 2022.
Miyato et al. (2018) T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. In Proceedings of the Sixth International Conference on Learning Representations, 2018.
Orsini et al. (2021) M. Orsini, A. Raichuk, L. Hussenot, D. Vincent, R. Dadashi, S. Girgin, M. Geist, O. Bachem, O. Pietquin, and M. Andrychowicz. What matters for adversarial imitation learning? In Advances in Neural Information Processing Systems, pages 14656–14668. Curran Associates, Inc., 2021.
Ouyang et al. (2022) L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback, 2022.
Park et al. (2022) J. Park, Y. Seo, J. Shin, H. Lee, P. Abbeel, and K. Lee. SURF: Semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning. In Proceedings of the Tenth International Conference on Learning Representations, 2022.
Puterman (1994) M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., USA, 1st edition, 1994. ISBN 0471619779.
Reddy et al. (2018) S. Reddy, A. D. Dragan, and S. Levine. Shared autonomy via deep reinforcement learning. In Proceedings of the Robotics: Science and Systems XIV. MIT Press, 2018.
Shin and Brown (2021) D. Shin and D. Brown. Offline preference-based apprenticeship learning. In Proceedings of the Workshop on Human-AI Collaboration in Sequential Decision-Making at the Thirty-Eighth International Conference on Machine Learning., 2021.
Sutton and Barto (2018) R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. A Bradford Book, 2018.
Syed et al. (2008) U. Syed, M. Bowling, and R. E. Schapire. Apprenticeship learning using linear programming. In Proceedings of the Twenty-Fifth International Conference on Machine Learning, pages 1032–1039. Association for Computing Machinery, 2008.
Thurstone (1927) L. L. Thurstone. A law of comparative judgment. Psychological Review, 34(4):273–286, 1927.
Wilde et al. (2020) N. Wilde, D. Kulic, and S. L. Smith. Active preference learning using maximum regret. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 10952–10959. IEEE, 2020.
Wirth and Fürnkranz (2015) C. Wirth and J. Fürnkranz. On learning from game annotations. IEEE Transactions on Computational Intelligence and AI in Games, 7(3):304–316, 2015.
Zhang and Kashima (2023) G. Zhang and H. Kashima. Batch reinforcement learning from crowds. In Machine Learning and Knowledge Discovery in Databases, pages 38–51. Springer Cham, 2023.
Zhang et al. (2021) S. Zhang, Z. CAO, D. Sadigh, and Y. Sui. Confidence-aware imitation learning from demonstrations with varying optimality. In Advances in Neural Information Processing Systems, pages 12340–12350. Curran Associates, Inc., 2021.

Appendix A Data

Offline Data

To generate the offline data used in experiments, we first trained five SAC agents for five million steps using different random seeds and hyperparameters reported in Table 5. To compute the normalized returns of algorithms, we consider the policies initialized from scratch as the random policies. As shown in Table 4, their final performance matches results reported for SAC by Haarnoja et al. [2018]. To evaluate algorithms in realistic settings, this work considers two versions of offline data: novice and mixture. The novice version was generated using five model checkpoints for different random seeds that reached 20% of the final performance. For the mixture version, beside the five model checkpoints of the novice version, we also used five model checkpoints for different random seeds that reached 20% of the final performance. Both versions contain 50 trajectories, and each of the model checkpoints contributed the same amount of trajectories.

Preferences

Following prior practice for PbRL [Christiano et al., 2017], we consider preferences between short clips of trajectories for better efficiency. From each trajectory, we sampled one clip of length 30 in Pusher and 60 in other tasks. So there are 50 trajectory clips. The case of $|\mathcal{Y}|=1225$ involves all of the paired comparisons for these 50 clips, and the case of $|\mathcal{Y}|=300$ involves paired comparisons for 25 of the 50 clips. The preference labels were generated using the ground-truth rewards. In other words, for two trajectory clips $\eta_{1}$ and $\eta_{2}$ , $\eta_{1}\succ\eta_{2}$ if $\sum_{x\in\eta_{1}}r(x)>\sum_{x\in\eta_{2}}r(x)$ under the ground-truth reward function $r$ .

Table 4: The final performance and performance of random policies.

Task	Ant	Walker2d	Hopper	HC	HS	Pusher	Swimmer
Final Performance	6983.98	4656.05	3097.87	14646.87	157117.92	-20.50	125.57
Random Policy	989.65	17.08	46.98	-1.44	42788.13	-58.73	7.01

Table 5: Hyperparameters for the behavior policies.

Parameter	Value
optimizer	Adam
learning rate for the actor	$3\times 10^{-4}$
learning rate for the critic	$3\times 10^{-4}$
whether to train entropy weight	Yes
initial value for entropy weight	0.1
learning rate for entropy weight	$3\times 10^{-4}$
#hidden layers (all networks)	2
#units per layer for the actor	256
#units per layer for the critic	256
# training steps	1
# environment steps per training step	1
activation function (all networks)	ReLU
interval for updating the target network	1
smoothing coefficient for target updates	$5\times 10^{-3}$
size of replay buffer	$1\times 10^{6}$
discount factor	0.99

Appendix B Implementation Details

As suggested by Orsini et al. [2021], we used SAC as the policy learner for all algorithms, due to its good sample efficiency. The hyperparameters of the policy learner are the same as those for behavior policies (reported in Table 5), except for the coefficient of target updates, which is changed to 0.001 for better stability. During training, we used the squashed Gaussian policy to enhance exploration [Haarnoja et al., 2018]; in test time we used the mode of the policy, which was deterministic.

We used the same parameterization for the reward functions of PbRL, SACfD, and PbAIL, as well as the discriminator of CAIL, so the differences in their performance could show the efficacy of our proposals. The reward functions (discriminators) were parameterized with a FFN of 64 units with the spectral normalization [Miyato et al., 2018]. For PbRL, SACfD, and PbAIL, we also applied weight decay to their reward functions using the same but task-specific coefficients, whose values are presented in Table 6. These values were selected from $\{0,0.0025,0.005\}$ via grid search. The batchsize was 256 for all algorithms We now report additional details of these methods.

Table 6: The coefficients of weight decay for reward functions.

Task	Ant	Walker2d	Hopper	HC	HS	Pusher	Swimmer
Coefficient	0.0025	0	0	0.0025	0.0025	0	0

Details for PbRL

PbRL learns a reward function from preferences before policy learning and uses the learned function to infer rewards of state-action pairs generated by the policy learning. When learning the reward function, we used the Adam optimizer with $0.0001$ as learning rate and ran optimization for 10,000 steps.

Details for SACfD

SACfD is an extension to the method proposed by Ibarz et al. [2018]. The method was originally proposed for discrete control tasks, so we extended it for using SAC. Compared to PbRL, it introduces a the objective of BC into the objective of the actor. The reward function of SACfD was learned in the same way as the reward function of PbRL. As described by Ibarz et al. [2018], the actor of the policy learner was pre-trained using the objective of BC and the provided offline data Before policy learning. The pre-training took 10,000 steps.

Let $L_{\text{actor}}$ be the objective function for learning the actor of SAC. During policy learning, SACfD minimizes $L_{\text{actor}}-\mathbb{E}_{(s,a)\in\mathcal{D}}\left[\mathbb{I}(Q^{\pi}(s,a)% >Q^{\pi}(s,\pi_{\text{mode}}(s)))\log(\pi(a|s))\right]$ , where $\mathbb{I}$ is the indicator function and $\pi_{\text{mode}}$ is the mode of $\pi$ . In other words, it regularizes the actor using BC only for states at which the action in the offline data has larger Q-value than the action induced by $\pi$ .

Details for PbAIL

As mentioned in Algorithm 1, PbAIL needs an initialization phase. In all tasks we initialized PbAIL for 10,000 steps. We applied weight decay to the reliability network in addition to its reward network. These values were selected from $\{0,0.0025,0.005\}$ via grid search, and their values are reported in Table 7.

Table 7: The coefficients of weight decay for the reliability network of PbAIL.

Task	Ant	Walker2d	Hopper	HC	HS	Pusher	Swimmer
Coefficient	0	0.005	0	0.005	0.005	0.0025	0.0025

Table 8: Results on real human preferences.

Task	PbRL	SACfD	CAIL	PbAIL	Data	GT
Hopper-medium-expert	$0.57$ p m 0.23	$0.44$ p m 0.12	\DeclareFontSeriesDefault[rm]bfb $1.07$ p m 0.10	$0.65$ p m 0.36	$0.67$	$0.66$
Hopper-medium-replay	$0.28$ p m 0.19	$0.28$ p m 0.26	$0.43$ p m 0.14	\DeclareFontSeriesDefault[rm]bfb $0.45$ p m 0.13	$0.38$	$0.14$
Walker2d-medium-expert	$0.06$ p m 0.07	$0.21$ p m 0.05	\DeclareFontSeriesDefault[rm]bfb $0.90$ p m 0.05	$0.73$ p m 0.35	$0.81$	$0.95$
Walker2d-medium-replay	$0.10$ p m 0.10	$0.08$ p m 0.12	\DeclareFontSeriesDefault[rm]bfb $0.44$ p m 0.15	$-0.01$ p m 0.08	$0.14$	$0.69$