License: CC BY-NC-ND 4.0
arXiv:2403.10160v1 [cs.LG] 15 Mar 2024

Online Policy Learning from Offline Preferences

Guoxi Zhang
Kyoto University
Kyoto, Japan
[email protected]
&Han Bao
Kyoto University
Kyoto, Japan
[email protected]
&Hisashi Kashima
Kyoto University
Kyoto, Japan
[email protected]
This work was done when the author was a student at Kyoto University.
Abstract

In preference-based reinforcement learning (PbRL), a reward function is learned from a type of human feedback called preference. To expedite preference collection, recent works have leveraged offline preferences, which are preferences collected for some offline data. In this scenario, the learned reward function is fitted on the offline data. If a learning agent exhibits behaviors that do not overlap with the offline data, the learned reward function may encounter generalizability issues. To address this problem, the present study introduces a framework that consolidates offline preferences and virtual preferences for PbRL, which are comparisons between the agent’s behaviors and the offline data. Critically, the reward function can track the agent’s behaviors using the virtual preferences, thereby offering well-aligned guidance to the agent. Through experiments on continuous control tasks, this study demonstrates the effectiveness of incorporating the virtual preferences in PbRL.

1 Introduction

Preference-based reinforcement learning (PbRL) is a setting for developing agents using human preferences (Akrour et al., 2011). A preference can be an outcome of comparisons between a pair of actions (Fürnkranz et al., 2012), states (Wirth and Fürnkranz, 2015), or trajectories (Christiano et al., 2017), for the extent they meet task specifications. PbRL is intriguing for two primary reasons. First, as pointed out by Thurstone (1927), pairwise comparisons are less subjective than absolute scoring, allowing preferences to be collected from people who cannot quantitatively evaluate agents. This is a desirable property for human-involved tasks such as value alignment (Fisac et al., 2017) and shared autonomy (Reddy et al., 2018). Furthermore, preference collection is seamlessly scalable, as a typical preference query requires comparing videos for only a few seconds (Christiano et al., 2017), which allows for collecting a large amount of preferences economically using crowdsourcing.

In its online formulation (Figure 0(a)), PbRL requires on-demand assessments from humans. This is inefficient in terms of human time, because annotators are entirely occupied during policy learning, mostly waiting for trajectories. A recent idea is to adopt offline preferences for better efficiency (Shin and Brown, 2021), which means to collect preferences for certain existing trajectories (called offline data), as illustrated in Figure 0(b). However, one caveat is that, there may be a distribution shift between the offline data and the agent’s behaviors. For example, consider the Pusher task from the gym library (Brockman et al., 2016), where the agent controls a robot arm to push a white cylinder to a red spot. The states in offline data (Figure 0(c) upper left) are clustered, likely due to the limited number of feasible configurations for the robot’s joints. Meanwhile, the states of the agent’s initial behaviors (Figure 0(c) middle right) are disributed uniformly. When the distribution shift exists, we may face a generalizability problem. As the agent’s reward function is only trained for behaviors in the offline data, it might not generalize to the agent’s behaviors. In our experiments for Pusher, the learned reward function cannot predict the ranking of the agent’s behaviors (“PbRL” in Figure 2), leading to poor task performance (“PbRL” in Table 8). Since we cannot control the distribution of the offline data in practice, this generalizability problem risks practical use of offline preferences.

Refer to caption
(a) A diagram for online PbRL. Annotators must wait for behaviors to be generated during policy learning.
Refer to caption
(b) A diagram for using offline preferences. All preferences are collected before policy learning.
Refer to caption
(c) Upper left: An example for behaviors in the offline data. The robot arm is pushing the white cylinder to the red spot. Middle right: An example for the agent’s behaviors in the initial phase of policy learning. The robot arm cannot is far away from the white cylinder. There is little overlap between the state distribution of the offline data and that of the agent’s initial behaviors. So reward functions learned from offline preferences may not align with the agent’s performance.
Figure 1: A diagram for online PbRL ((a)), learning from offline preferences ((b)), and the generalizability problem of using offline preferences ((c)).

To address the generalizability problem, we propose a framework called preference-based adversarial imitation learning (PbAIL). The key idea is to generate virtual preferences that favor offline data over the agent’s behaviors. By jointly maximizing the likelihood of offline and virtual preferences, PbAIL can learn a reward function that aligns with the agent’s behaviors. In the meantime, it also improves the agent’s policy using the learned reward function. The reward learning step and the policy learning step are alternated to ensure the reward function always align with the agent’s behaviors, which is equivalent to solving a max-min objective. Since offline data can be imperfect in practice, we extend PbAIL to model the reliability of virtual preferences to handle imperfect data.

This work evaluated PbAIL from three perspectives: (i) its task performance on imperfect offline data and preferences, (ii) the individual effect of learning from virtual preferences and modeling their reliability, and (iii) how its performance changes with preference size and offline data quality. In our experiments for seven Mujoco tasks, PbAIL consistently achieves good performance when compared to existing approaches that use both offline data and preferences (Ibarz et al., 2018; Zhang et al., 2021). In particular, PbAIL can achieve better performance when compared to only using offline preferences in six of the seven tasks. In an ablation study, we confirm that it matches the return of offline data when using virtual preferences, and it achieves better performance on imperfect offline data when modeling the reliability of virtual preferences. Lastly, our results for the effect of preference size and data quality highlight that PbAIL is suitable when preferences are more accessible than high-fidelity trajectories. In summary, our contributions are as follows:

  • We propose PbAIL to overcome the generalizability problem that arises when learning reward functions from offline preferences.

  • We extend PbAIL to handle imperfect offline data.

  • We extensively evaluate PbAIL for non-optimal offline data and limited preferences, clarifying its strength, the effects of its components, and its limitations.

The rest of this paper is organized as follows. Section 2 reviews related literature, while Section 3 provides background knowledge. Section 4 formulates the learning problem and introduces the proposed PbAIL. Section 5 presents empirical results, and Section 6 concludes the paper.

2 Related Work

PbRL has been studied for over a decade (Akrour et al., 2011; Fürnkranz et al., 2012) and applied to Atari games (Christiano et al., 2017), locomotion tasks (Lee et al., 2021a), navigation tasks (Shin and Brown, 2021), and fine-tuning language models (Ouyang et al., 2022). Recent advancements include enhancing exploration (Lee et al., 2021b), adaptive query selection (Wilde et al., 2020; Biyik et al., 2020), and improving feedback efficiency (Park et al., 2022; Liu et al., 2022). Specifically, utilizing offline preferences (Shin and Brown, 2021; Zhang and Kashima, 2023) allows for more efficient use of annotator’s time. What remains a question is its impact on reward learning, especially if there is a mismatch between the distributions of offline data and agents’ behaviors.

In PbRL literature, the approach proposed by (Ibarz et al., 2018) is related to this work, as it combines PbRL and behavior cloning (BC). However, this approach cannot handle imperfect offline data. Meanwhile, the confidence-aware imitation learning (CAIL) algorithm (Zhang et al., 2021) accepts both offline data and preferences as input, but it does not maximize the likelihood of preferences. We empirically compare PbAIL with these two approaches in Section 5.

3 Preliminaries

3.1 Reinforcement Learning

MDP

Reinforcement learning (RL) uses the Markov decision process 𝒮,𝒜,P,r,γ,μ𝒮𝒜𝑃𝑟𝛾𝜇\langle\mathcal{S},\mathcal{A},P,r,\gamma,\mu\rangle⟨ caligraphic_S , caligraphic_A , italic_P , italic_r , italic_γ , italic_μ ⟩ (Sutton and Barto, 2018) to model sequential decision-making tasks. Here, 𝒮𝒮\mathcal{S}caligraphic_S is the set of states, and 𝒜𝒜\mathcal{A}caligraphic_A is the set of actions; they represent information and options available to an agent for decision-making, respectively. The transition probability P:𝒮×𝒜Δ(𝒮):𝑃𝒮𝒜𝛥𝒮P:\mathcal{S}\times\mathcal{A}\to\mathit{\Delta}(\mathcal{S)}italic_P : caligraphic_S × caligraphic_A → italic_Δ ( caligraphic_S ) governs how states transit, where Δ()𝛥\mathit{\Delta}(\cdot)italic_Δ ( ⋅ ) is the set of distributions over a set. The reward function r:𝒮×𝒜:𝑟𝒮𝒜r:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}italic_r : caligraphic_S × caligraphic_A → blackboard_R is a function that evaluates agents’ decisions, which is unknown and to be inferred in PbRL. The discount factor γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ) is later used to define the value function, and μ𝜇\muitalic_μ is the distribution for initial states. An MDP prescribes a protocol for sequential interaction between an agent and a hypothetical entity called the environment. Starting from an initial state sampled from μ𝜇\muitalic_μ, the agent observes a state s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S from the environment and selects an action a𝑎aitalic_a according to a stochastic policy π:𝒮Δ(𝒜):𝜋𝒮𝛥𝒜\pi:\mathcal{S}\to\mathit{\Delta}(\mathcal{A})italic_π : caligraphic_S → italic_Δ ( caligraphic_A ) associated with the agent. It then receives the next state from the environment, which is sampled from P(|s,a)P(\cdot|s,a)italic_P ( ⋅ | italic_s , italic_a ). A sequence of states and actions generated in interaction (s1,a1,s2,a2,)=defτsubscript𝑠1subscript𝑎1subscript𝑠2subscript𝑎2def𝜏(s_{1},a_{1},s_{2},a_{2},\dots)\overset{\mathrm{def}}{=}\tau( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … ) overroman_def start_ARG = end_ARG italic_τ is defined as a trajectory. To simplify notations, this work occasionally uses x𝑥xitalic_x as a shorthand notation for a state-action pair (s,a)𝑠𝑎(s,a)( italic_s , italic_a ).

RL Objective

The return of a trajectory is the sum of rewards assigned to its state-action pairs. For state s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S, the value function vπ(s)superscript𝑣𝜋𝑠v^{\pi}(s)italic_v start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) of a policy π𝜋\piitalic_π specifies the expectation of γ𝛾\gammaitalic_γ-discounted return starting from an initial state s𝑠sitalic_s and following a policy π𝜋\piitalic_π. It is defined as follows:

vπ(s)=def𝔼[t=1γt1r(st,at)|π,s1=s],v^{\pi}(s)\overset{\mathrm{def}}{=}\mathbb{E}\left[\sum_{t=1}^{\infty}\gamma^{% t-1}r(s_{t},a_{t})\,\middle|\,\pi,s_{1}=s\right],italic_v start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) overroman_def start_ARG = end_ARG blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | italic_π , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_s ] ,

where the expectation is taken over states (following the transition probability) and actions (following the policy). The value of a policy π𝜋\piitalic_π is the expectation of vπ(s)superscript𝑣𝜋𝑠v^{\pi}(s)italic_v start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) over initial-state distribution μ𝜇\muitalic_μ: vπ=𝔼sμ[vπ(s)]superscript𝑣𝜋subscript𝔼similar-to𝑠𝜇delimited-[]superscript𝑣𝜋𝑠v^{\pi}=\mathbb{E}_{s\sim\mu}[v^{\pi}(s)]italic_v start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_μ end_POSTSUBSCRIPT [ italic_v start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) ]. The goal of RL is to find an optimal policy π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT such that vπ*vπsuperscript𝑣superscript𝜋superscript𝑣𝜋v^{\pi^{*}}\geq v^{\pi}italic_v start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ≥ italic_v start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT for any policy π𝜋\piitalic_π under a given reward function r𝑟ritalic_r.

A useful quantity is the discounted occupancy measure, which is defined as follows.

Definition 3.1 (Puterman (1994)).

The discounted occupancy measure ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT for policy π𝜋\piitalic_π is defined as

ρπ(s,a)=t=1γt1Pr(st=s,at=a;π),superscript𝜌𝜋𝑠𝑎superscriptsubscript𝑡1superscript𝛾𝑡1Prsubscript𝑠𝑡𝑠subscript𝑎𝑡𝑎𝜋\rho^{\pi}(s,a)=\sum_{t=1}^{\infty}\gamma^{t-1}\Pr(s_{t}=s,a_{t}=a;\pi),italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT roman_Pr ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a ; italic_π ) ,

where Pr(st=s,at=a;π)Prsubscript𝑠𝑡𝑠subscript𝑎𝑡𝑎𝜋\Pr(s_{t}=s,a_{t}=a;\pi)roman_Pr ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a ; italic_π ) is the probability of the joint event st=ssubscript𝑠𝑡𝑠s_{t}=sitalic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s and at=asubscript𝑎𝑡𝑎a_{t}=aitalic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a when following the transition probability and policy π𝜋\piitalic_π.

The discounted occupancy measure ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT can be interpreted as an unnormalized measure for state-action pairs generated by π𝜋\piitalic_π. It is straightforward to show that the normalizer of occupancy measures is 11γ11𝛾\frac{1}{1-\gamma}divide start_ARG 1 end_ARG start_ARG 1 - italic_γ end_ARG.

Corollary 3.2.

xρπ(x)=11γsubscript𝑥superscript𝜌𝜋𝑥11𝛾\sum_{x}\rho^{\pi}(x)=\frac{1}{1-\gamma}∑ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 1 - italic_γ end_ARG for any policy π𝜋\piitalic_π.

For function f:𝒮×𝒜:𝑓𝒮𝒜f:\mathcal{S}\times\mathcal{A}\to\mathbb{R}italic_f : caligraphic_S × caligraphic_A → blackboard_R, we write 𝔼xρπ[f(x)]subscript𝔼similar-to𝑥superscript𝜌𝜋delimited-[]𝑓𝑥\mathbb{E}_{x\sim\rho^{\pi}}[f(x)]blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_f ( italic_x ) ] as the sum of f(x)𝑓𝑥f(x)italic_f ( italic_x ) over all state-action pairs weighted by ρπ(x)superscript𝜌𝜋𝑥\rho^{\pi}(x)italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_x ) with slight abuse of notation. The values can be expressed using ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT alternatively.

Corollary 3.3.

vπ=𝔼xρπ[r(x)]superscript𝑣𝜋subscript𝔼similar-to𝑥superscript𝜌𝜋delimited-[]𝑟𝑥v^{\pi}=\mathbb{E}_{x\sim\rho^{\pi}}[r(x)]italic_v start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_r ( italic_x ) ].

As a final remark, the following theorem relates policies with occupancy measures, which allows one to regard policies as samplers for state-action pairs.

Theorem 3.4 (Syed et al. (2008)).

Suppose ρ𝜌\rhoitalic_ρ is an occupancy measure and π=𝑑𝑒𝑓ρ(s,a)aρ(s,a)𝜋𝑑𝑒𝑓𝜌𝑠𝑎subscript𝑎𝜌𝑠𝑎\pi\overset{\text{def}}{=}\frac{\rho(s,a)}{\sum_{a}\rho(s,a)}italic_π overdef start_ARG = end_ARG divide start_ARG italic_ρ ( italic_s , italic_a ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_ρ ( italic_s , italic_a ) end_ARG. Then, ρ𝜌\rhoitalic_ρ is the occupancy measure for π𝜋\piitalic_π, and π𝜋\piitalic_π is the only policy whose occupancy measure is ρ𝜌\rhoitalic_ρ.

3.2 Preference-based Reinforcement Learning

Thurstone’s model (Thurstone, 1927)

Preferences are formally defined using Thurstone’s model (Thurstone, 1927). Suppose there is a general object space 𝒪𝒪\mathcal{O}caligraphic_O. For each object o𝒪𝑜𝒪o\in\mathcal{O}italic_o ∈ caligraphic_O, this model assumes a “true” utility score G(o)𝐺𝑜G(o)\in\mathbb{R}italic_G ( italic_o ) ∈ blackboard_R. Annotators compare objects using disturbed utility G(o)+ϵ𝐺𝑜italic-ϵG(o)+\epsilonitalic_G ( italic_o ) + italic_ϵ, where ϵitalic-ϵ\epsilonitalic_ϵ is sampled from some distribution for noise. Then, the probability of object o𝑜oitalic_o being preferred over object osuperscript𝑜o^{\prime}italic_o start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT follows Pr(oo)=Pr(G(o)G(o)>ϵϵ)Prsucceeds𝑜superscript𝑜Pr𝐺𝑜𝐺superscript𝑜superscriptitalic-ϵitalic-ϵ\Pr(o\succ o^{\prime})=\Pr(G(o)-G(o^{\prime})>\epsilon^{\prime}-\epsilon)roman_Pr ( italic_o ≻ italic_o start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_Pr ( italic_G ( italic_o ) - italic_G ( italic_o start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) > italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_ϵ ), where ϵsuperscriptitalic-ϵ\epsilon^{\prime}italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the noise associated with osuperscript𝑜o^{\prime}italic_o start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Subsequently, we work with a special case of Thurstone’s model called Bradley–Terry model (BT model) (Bradley and Terry, 1952). Assuming ϵGumbel(0,1)similar-toitalic-ϵGumbel01\epsilon\sim\mathrm{Gumbel}(0,1)italic_ϵ ∼ roman_Gumbel ( 0 , 1 ), we have Pr(oo)=σ(zz)Prsucceeds𝑜superscript𝑜𝜎𝑧superscript𝑧\Pr(o\succ o^{\prime})=\sigma(z-z^{\prime})roman_Pr ( italic_o ≻ italic_o start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_σ ( italic_z - italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), where σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) is the sigmoid function. Under this model, the log-likelihood of a preference oosucceeds𝑜superscript𝑜o\succ o^{\prime}italic_o ≻ italic_o start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is denoted by (o,o)=logσ(G(o)G(o))𝑜superscript𝑜𝜎𝐺𝑜𝐺superscript𝑜\ell(o,o^{\prime})=\log\sigma(G(o)-G(o^{\prime}))roman_ℓ ( italic_o , italic_o start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_log italic_σ ( italic_G ( italic_o ) - italic_G ( italic_o start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ).

Preference-based Reward Learning

We assume that the offline preferences are specified between trajectories, as they provide annotators with more information than states or state-action pairs. Under the BT model, a trajectory τ𝜏\tauitalic_τ is preferred over another trajectory τsuperscript𝜏\tau^{\prime}italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, denoted as ττsucceeds𝜏superscript𝜏\tau\succ\tau^{\prime}italic_τ ≻ italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, if the utility of τ𝜏\tauitalic_τ is higher than that of τsuperscript𝜏\tau^{\prime}italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. This means that τ𝜏\tauitalic_τ corresponds to a more desirable outcome from an annotator’s perspective. We will use preferences between state-action pairs to formulate the proposed virtual preferences. In this case, a state action pair x𝑥xitalic_x is preferred over another state-action pair xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT if the utility of r(x)>r(x)𝑟𝑥𝑟superscript𝑥r(x)>r(x^{\prime})italic_r ( italic_x ) > italic_r ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).

Reward functions can be learned by maximizing the likelihood of preferences under the BT model. A canonical assumption for parameterizing trajectory utility is the sum of rewards in a trajectory: xτr(x)subscript𝑥𝜏𝑟𝑥\sum_{x\in\tau}r(x)∑ start_POSTSUBSCRIPT italic_x ∈ italic_τ end_POSTSUBSCRIPT italic_r ( italic_x ) (Christiano et al., 2017; Ibarz et al., 2018). We use the reward of a state-action pair as its utility. A sample for trajectory preferences is written as (τ1,τ2,c)subscript𝜏1subscript𝜏2𝑐(\tau_{1},\tau_{2},c)( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_c ), where τ1subscript𝜏1\tau_{1}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and τ2subscript𝜏2\tau_{2}italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the trajectories being compared. c=1𝑐1c=1italic_c = 1 if τ1τ2succeedssubscript𝜏1subscript𝜏2\tau_{1}\succ\tau_{2}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and cj=0subscript𝑐𝑗0c_{j}=0italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0 otherwise. Suppose the reward function is parameterized by θrsubscript𝜃𝑟\theta_{r}italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, and let the sum of estimated rewards for state-action pairs in a trajectory τ𝜏\tauitalic_τ to be G(τ;θr)𝐺𝜏subscript𝜃𝑟G(\tau;\theta_{r})italic_G ( italic_τ ; italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ). Given a set of M𝑀Mitalic_M preferences 𝒴𝒴\mathcal{Y}caligraphic_Y, we can learn θrsubscript𝜃𝑟\theta_{r}italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT by minimizing:

Lpref(θr)=1M(τ1,τ2,c)𝒴[clog(σ(G(τ1;θr)G(τ2;θr)))+(1c)σ(G(τ2;θr)G(τ1;θr))].subscript𝐿prefsubscript𝜃𝑟1𝑀subscriptsubscript𝜏1subscript𝜏2𝑐𝒴delimited-[]𝑐𝜎𝐺subscript𝜏1subscript𝜃𝑟𝐺subscript𝜏2subscript𝜃𝑟1𝑐𝜎𝐺subscript𝜏2subscript𝜃𝑟𝐺subscript𝜏1subscript𝜃𝑟L_{\mathrm{pref}}(\theta_{r})=-\frac{1}{M}\sum_{(\tau_{1},\tau_{2},c)\in% \mathcal{Y}}\left[c\log(\sigma(G(\tau_{1};\theta_{r})-G(\tau_{2};\theta_{r})))% +(1-c)\sigma(G(\tau_{2};\theta_{r})-G(\tau_{1};\theta_{r}))\right].italic_L start_POSTSUBSCRIPT roman_pref end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) = - divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_c ) ∈ caligraphic_Y end_POSTSUBSCRIPT [ italic_c roman_log ( italic_σ ( italic_G ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) - italic_G ( italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ) ) + ( 1 - italic_c ) italic_σ ( italic_G ( italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) - italic_G ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ) ] . (1)

4 Preference-based Adversarial Imitation Learning

This section presents the proposed PbAIL framework. After formulating the learning problem, we describe how PbAIL leverages offline data for reward learning in Section 4.2 and how it handles imperfect data in Section 4.3.

4.1 Problem Setup

An agent is provided with a set of M𝑀Mitalic_M offline trajectory preferences 𝒴𝒴\mathcal{Y}caligraphic_Y, which are collected for N𝑁Nitalic_N trajectories 𝒟𝒟\mathcal{D}caligraphic_D generated by behavior policy b𝑏bitalic_b. To align the learned reward function with the agent’s behaviors, we assume the agent has access to 𝒟𝒟\mathcal{D}caligraphic_D. The agent is supposed to learn a reward function r𝑟ritalic_r and a policy π𝜋\piitalic_π. Similar to online PbRL, new trajectories can be generated during policy learning; however, no new real preferences may be collected.

4.2 Reward Learning from Offline Trajectories

First, we borrow the notion of policy preferences (Busa-Fekete et al., 2014) to draw a connection between the behavior policy and reward learning. A policy π𝜋\piitalic_π is preferred over another policy πsuperscript𝜋\pi^{\prime}italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, denoted as ππsucceeds𝜋superscript𝜋\pi\succ\pi^{\prime}italic_π ≻ italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, if the disturbed utility of π𝜋\piitalic_π is larger than that of πsuperscript𝜋\pi^{\prime}italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT under the BT model. PbAIL assumes that b𝑏bitalic_b is preferred over any other policy, i.e., bπ,πsucceeds𝑏𝜋for-all𝜋b\succ\pi,\forall\piitalic_b ≻ italic_π , ∀ italic_π. Assuming the utility of a policy π𝜋\piitalic_π is its value vπsuperscript𝑣𝜋v^{\pi}italic_v start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT, the log-likelihood of bπsucceeds𝑏𝜋b\succ\piitalic_b ≻ italic_π is given by log(Pr(bπ))=log(σ(vbvπ)\log(\text{Pr}(b\succ\pi))=\log(\sigma(v^{b}-v^{\pi})roman_log ( Pr ( italic_b ≻ italic_π ) ) = roman_log ( italic_σ ( italic_v start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT - italic_v start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT )). To ensure bπsucceeds𝑏𝜋b\succ\piitalic_b ≻ italic_π holds for all π𝜋\piitalic_π, PbAIL maximizes the worst-case log-likelihood of policy preference as follows:

maxrminπlogσ(vbvπ).subscript𝑟subscript𝜋𝜎superscript𝑣𝑏superscript𝑣𝜋\max_{r}\min_{\pi}\log\sigma(v^{b}-v^{\pi}).roman_max start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT roman_log italic_σ ( italic_v start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT - italic_v start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) . (2)

This objective involves learning a reward function r𝑟ritalic_r and a policy π𝜋\piitalic_π. The maximization over r𝑟ritalic_r enlarges the difference between vbsuperscript𝑣𝑏v^{b}italic_v start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT and vπsuperscript𝑣𝜋v^{\pi}italic_v start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT to make b𝑏bitalic_b more preferable. The minimization over π𝜋\piitalic_π is equivalent to maximizing vπsuperscript𝑣𝜋v^{\pi}italic_v start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT with respect to r𝑟ritalic_r, which can be solved by RL. This work uses soft actor-critic (SAC) (Haarnoja et al., 2018) as the policy learning algorithm, since it employs the principle of maximum entropy to enhance exploration.

While the maximization over reward functions in Equation 2 can be solved by approximating values using sampled trajectories, such a direct approach is subject to high variance. This work thus proposes an efficient optimization based on the following tight lower bound. By noting that logσ()𝜎\log\sigma(\cdot)roman_log italic_σ ( ⋅ ) is concave and the value vπsuperscript𝑣𝜋v^{\pi}italic_v start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT can be expressed as vπ=𝔼xρπ[r(x)]superscript𝑣𝜋subscript𝔼similar-to𝑥superscript𝜌𝜋delimited-[]𝑟𝑥v^{\pi}=\mathbb{E}_{x\sim\rho^{\pi}}[r(x)]italic_v start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_r ( italic_x ) ] (Corollary 3.3), the following inequality holds from Jensen’s inequality:

logPr(bπ)𝔼xρb,xρπ(x,x)=defUb(π,r).Prsucceeds𝑏𝜋subscript𝔼formulae-sequencesimilar-to𝑥superscript𝜌𝑏similar-tosuperscript𝑥superscript𝜌𝜋𝑥superscript𝑥defsubscript𝑈𝑏𝜋𝑟\log\mathrm{Pr}(b\succ\pi)\geq\mathbb{E}_{\begin{subarray}{c}x\sim\rho^{b},x^{% \prime}\sim\rho^{\pi}\end{subarray}}\ell(x,x^{\prime})\overset{\text{def}}{=}U% _{b}(\pi,r).roman_log roman_Pr ( italic_b ≻ italic_π ) ≥ blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_x ∼ italic_ρ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT roman_ℓ ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) overdef start_ARG = end_ARG italic_U start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_π , italic_r ) . (3)

The likelihood lower bound Ub(π,r)subscript𝑈𝑏𝜋𝑟U_{b}(\pi,r)italic_U start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_π , italic_r ) can be efficiently approximated using state-action pairs. We can use samples in 𝒟𝒟\mathcal{D}caligraphic_D to estimate the expectation over ρbsuperscript𝜌𝑏\rho^{b}italic_ρ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT. When using off-policy RL backbones such as SAC, we can estimate the expectation over ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT using state-action pairs in the replay buffer of the RL backbone. Let the replay buffer be 𝒟RLsubscript𝒟RL\mathcal{D}_{\text{RL}}caligraphic_D start_POSTSUBSCRIPT RL end_POSTSUBSCRIPT Then, the reward maximization in Equation 2 is approximated by minimizing the following loss function:

Lvirtual(θr)=𝔼x𝒟x𝒟RLlog(σ(r(x;θr)r(x;θr))).subscript𝐿virtualsubscript𝜃𝑟subscript𝔼𝑥𝒟superscript𝑥subscript𝒟RL𝜎𝑟𝑥subscript𝜃𝑟𝑟superscript𝑥subscript𝜃𝑟L_{\text{virtual}}(\theta_{r})=-\mathbb{E}_{\begin{subarray}{c}x\in\mathcal{D}% \\ x^{\prime}\in\mathcal{D}_{\text{RL}}\end{subarray}}\log(\sigma(r(x;\theta_{r})% -r(x^{\prime};\theta_{r}))).italic_L start_POSTSUBSCRIPT virtual end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) = - blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_x ∈ caligraphic_D end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT RL end_POSTSUBSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT roman_log ( italic_σ ( italic_r ( italic_x ; italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) - italic_r ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ) ) . (4)

Given both offline preferences 𝒴𝒴\mathcal{Y}caligraphic_Y and trajectories 𝒟𝒟\mathcal{D}caligraphic_D, we can minimize Lpref(θr)+Lvirtual(θr)subscript𝐿prefsubscript𝜃𝑟subscript𝐿virtualsubscript𝜃𝑟L_{\text{pref}}(\theta_{r})+L_{\text{virtual}}(\theta_{r})italic_L start_POSTSUBSCRIPT pref end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) + italic_L start_POSTSUBSCRIPT virtual end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) to ensure the reward function learned from offline preferences to align with the agent’s behaviors, thereby overcoming the generalizability issue.

Virtual Preferences

Note that Equation 4 coincides with learning from state-action preferences xxsucceeds𝑥superscript𝑥x\succ x^{\prime}italic_x ≻ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that x𝒟𝑥𝒟x\in\mathcal{D}italic_x ∈ caligraphic_D and x𝒟RLsuperscript𝑥subscript𝒟RLx^{\prime}\in\mathcal{D}_{\text{RL}}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT RL end_POSTSUBSCRIPT. By minimizing Equation 4, we are maximizing the log-likelihood that state-action pairs in offline data 𝒟𝒟\mathcal{D}caligraphic_D are preferred over state-action pairs generated by the agent. Since such preferences are not collected from annotators, we consider them to be generated by a virtual annotator and call them virtual preferences. We use vsubscriptsucceeds𝑣\succ_{v}≻ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT for virtual comparisons and 𝒴vsubscript𝒴𝑣\mathcal{Y}_{v}caligraphic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT for virtual preferences. The use of virtual state-action preferences is a key idea of PbAIL. As will be discussed in Section 4.3, it leads to a straightforward approach for handling non-optimal offline data.

As a final remark, it is straightforward to combine recent ideas of PbRL, such as the use of data augmentation (Park et al., 2022) or active query generation (Biyik et al., 2020), with PbAIL. We leave these extensions as future work and focus solely on showing the effectiveness of PbAIL.

4.3 Handling Imperfect Data

In practice, we may encounter imperfect offline data. In this case, adapting the reward function using offline data by minimizing Equation 4 becomes problematic. Based on the interpretation of virtual preferences, we propose to handle imperfect data by inferring the reliability of virtual preferences and the reward function simultaneously. Our approach is based on a probabilistic model for noisy preferences collected from multiple annotators (Zhang and Kashima, 2023). We simplify the model since the virtual preferences are not collected from multiple annotators.

Specifically, we assume that to generate a preference for x𝒟𝑥𝒟x\in\mathcal{D}italic_x ∈ caligraphic_D and x𝒟RLsuperscript𝑥subscript𝒟RLx^{\prime}\in\mathcal{D}_{\text{RL}}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT RL end_POSTSUBSCRIPT, the virtual annotator first generates a temporary label using the BT model and the ground-truth reward function. Then, it reports the label correctly with probability α(x,x)𝛼𝑥superscript𝑥\alpha(x,x^{\prime})italic_α ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) and incorrectly with probability 1α(x,x)1𝛼𝑥superscript𝑥1-\alpha(x,x^{\prime})1 - italic_α ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). By the law of total probability, the probability for xvxsubscriptsucceeds𝑣𝑥𝑥x\succ_{v}xitalic_x ≻ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_x under this model, denoted by Primperfect(xvx)subscriptPrimperfectsubscriptsucceeds𝑣𝑥superscript𝑥\text{Pr}_{\text{imperfect}}(x\succ_{v}x^{\prime})Pr start_POSTSUBSCRIPT imperfect end_POSTSUBSCRIPT ( italic_x ≻ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), is given by

Primperfect(xvx)=α(x,x)σ(r(x)r(x))+(1α(x,x))σ(r(x)r(x)).subscriptPrimperfectsubscriptsucceeds𝑣𝑥superscript𝑥𝛼𝑥superscript𝑥𝜎𝑟𝑥𝑟superscript𝑥1𝛼𝑥superscript𝑥𝜎𝑟superscript𝑥𝑟𝑥\text{Pr}_{\text{imperfect}}(x\succ_{v}x^{\prime})=\alpha(x,x^{\prime})\sigma(% r(x)-r(x^{\prime}))+(1-\alpha(x,x^{\prime}))\sigma(r(x^{\prime})-r(x)).Pr start_POSTSUBSCRIPT imperfect end_POSTSUBSCRIPT ( italic_x ≻ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_α ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_σ ( italic_r ( italic_x ) - italic_r ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) + ( 1 - italic_α ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) italic_σ ( italic_r ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_r ( italic_x ) ) . (5)

Suppose α𝛼\alphaitalic_α is parameterized with neural network θαsubscript𝜃𝛼\theta_{\alpha}italic_θ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT. Under this model, objective for learning from virtual preferences becomes

Limperfectvirtual(θr,θα)=𝔼x𝒟x𝒟RLlog(Primperfect(xvx)).subscript𝐿imperfectvirtualsubscript𝜃𝑟subscript𝜃𝛼subscript𝔼𝑥𝒟superscript𝑥subscript𝒟RLsubscriptPrimperfectsubscriptsucceeds𝑣𝑥superscript𝑥\begin{split}L_{\begin{subarray}{c}\text{imperfect}\\ \text{virtual}\end{subarray}}(\theta_{r},\theta_{\alpha})=-&\mathbb{E}_{\begin% {subarray}{c}x\in\mathcal{D}\\ x^{\prime}\in\mathcal{D}_{\text{RL}}\end{subarray}}\log(\text{Pr}_{\text{% imperfect}}(x\succ_{v}x^{\prime})).\end{split}start_ROW start_CELL italic_L start_POSTSUBSCRIPT start_ARG start_ROW start_CELL imperfect end_CELL end_ROW start_ROW start_CELL virtual end_CELL end_ROW end_ARG end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) = - end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_x ∈ caligraphic_D end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT RL end_POSTSUBSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT roman_log ( Pr start_POSTSUBSCRIPT imperfect end_POSTSUBSCRIPT ( italic_x ≻ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) . end_CELL end_ROW (6)

Essentially, we model the reliability of virtual preferences using α(x,x)𝛼𝑥superscript𝑥\alpha(x,x^{\prime})italic_α ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). We use the reward difference r(x)r(x)𝑟𝑥𝑟superscript𝑥r(x)-r(x^{\prime})italic_r ( italic_x ) - italic_r ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) and the index for the trajectory from which xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, denoted as I(x)𝐼superscript𝑥I(x^{\prime})italic_I ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), as features for this reliability network. r(x)r(x)𝑟𝑥𝑟superscript𝑥r(x)-r(x^{\prime})italic_r ( italic_x ) - italic_r ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is informative as it suggests the temporary label in the generative process of this probabilistic model. As for I(x)𝐼superscript𝑥I(x^{\prime})italic_I ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), recall that we sample state-action pairs from the replay buffer 𝒟RLsubscript𝒟RL\mathcal{D}_{\text{RL}}caligraphic_D start_POSTSUBSCRIPT RL end_POSTSUBSCRIPT for better sample efficiency. This buffer stores state-action pairs generated in the entire policy learning process. Since the policy is generally improving during this process, state-action pairs generated during late stage of policy learning are more likely to be better than state-action pairs in 𝒟𝒟\mathcal{D}caligraphic_D, when compared to state-action pairs generated earlier.

Initialization

As suggested by  Zhang and Kashima (2023), this model needs an initialization phase. The intuition is that, since the agent’s policy is initialized from scratch, in the initial phase of policy learning the offline data are likely to be better than the agent’s behaviors. The objective for this phase is Equation 6:

Limperfectvirtual,init(θr,θα)=Lvirtual(θr)𝔼x𝒟xρπlog(α(x,x)).subscript𝐿imperfectvirtual,initsubscript𝜃𝑟subscript𝜃𝛼subscript𝐿virtualsubscript𝜃𝑟subscript𝔼𝑥𝒟similar-tosuperscript𝑥superscript𝜌𝜋𝛼𝑥superscript𝑥\begin{split}L_{\begin{subarray}{c}\text{imperfect}\\ \text{virtual,init}\end{subarray}}(\theta_{r},\theta_{\alpha})=L_{\text{% virtual}}(\theta_{r})-&\mathbb{E}_{\begin{subarray}{c}x\in\mathcal{D}\\ x^{\prime}\sim\rho^{\pi}\end{subarray}}\log(\alpha(x,x^{\prime})).\end{split}start_ROW start_CELL italic_L start_POSTSUBSCRIPT start_ARG start_ROW start_CELL imperfect end_CELL end_ROW start_ROW start_CELL virtual,init end_CELL end_ROW end_ARG end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) = italic_L start_POSTSUBSCRIPT virtual end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) - end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_x ∈ caligraphic_D end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT roman_log ( italic_α ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) . end_CELL end_ROW (7)

In addition, to stablilize training, we opt to initialize the agent’s policy using BC.

Remark

The root cause for virtual preferences to be unreliable here is that, our assumption bπsucceeds𝑏𝜋b\succ\piitalic_b ≻ italic_π for any π𝜋\piitalic_π is no longer justified. Given preferences 𝒴𝒴\mathcal{Y}caligraphic_Y, we expect the agent to outperform the behavior policy of 𝒟𝒟\mathcal{D}caligraphic_D. Our approach reduces the problem of handling imperfect behavior policy to inferring the reliability of state-action preferences in 𝒴vsubscript𝒴𝑣\mathcal{Y}_{v}caligraphic_Y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT.

In summary, when offline trajectories are not perfect, PbAIL uses a probabilistic model for virtual preferences that can simultaneously infer the reliability of virtual preferences and learn a reward function. Algorithm 1 summarizes the algorithm for this case.

Algorithm 1 PbAIL using off-policy RL backbone
  Input: Offline trajectories 𝒟𝒟\mathcal{D}caligraphic_D, preferences 𝒴𝒴\mathcal{Y}caligraphic_Y, number of initialization steps Kinitsubscript𝐾initK_{\text{init}}italic_K start_POSTSUBSCRIPT init end_POSTSUBSCRIPT.
  Initialize policy θrsubscript𝜃𝑟\theta_{r}italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, θπsubscript𝜃𝜋\theta_{\pi}italic_θ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT, θαsubscript𝜃𝛼\theta_{\alpha}italic_θ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT and a replay 𝒟RLsubscript𝒟RL\mathcal{D}_{\text{RL}}caligraphic_D start_POSTSUBSCRIPT RL end_POSTSUBSCRIPT.
  Initialize a counter for gradient steps k𝑘kitalic_k.
  repeat
     Interact with the environment and store a transition in 𝒟𝒟\mathcal{D}caligraphic_D.
     Sample a batch of data from 𝒟𝒟\mathcal{D}caligraphic_D and 𝒟RLsubscript𝒟RL\mathcal{D}_{\text{RL}}caligraphic_D start_POSTSUBSCRIPT RL end_POSTSUBSCRIPT.
     if kKinit𝑘subscript𝐾initk\leq K_{\text{init}}italic_k ≤ italic_K start_POSTSUBSCRIPT init end_POSTSUBSCRIPT then
        Take a minimization step for Lpref+Limperfectvirtual,init(θr,θα).subscript𝐿prefsubscript𝐿imperfectvirtual,initsubscript𝜃𝑟subscript𝜃𝛼L_{\text{pref}}+L_{\begin{subarray}{c}\text{imperfect}\\ \text{virtual,init}\end{subarray}}(\theta_{r},\theta_{\alpha}).italic_L start_POSTSUBSCRIPT pref end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT start_ARG start_ROW start_CELL imperfect end_CELL end_ROW start_ROW start_CELL virtual,init end_CELL end_ROW end_ARG end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) .
        Update the actor using the loss of BC.
     else
        Take a minimization step for Lpref+Limperfectvirtual(θr,θα).subscript𝐿prefsubscript𝐿imperfectvirtualsubscript𝜃𝑟subscript𝜃𝛼L_{\text{pref}}+L_{\begin{subarray}{c}\text{imperfect}\\ \text{virtual}\end{subarray}}(\theta_{r},\theta_{\alpha}).italic_L start_POSTSUBSCRIPT pref end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT start_ARG start_ROW start_CELL imperfect end_CELL end_ROW start_ROW start_CELL virtual end_CELL end_ROW end_ARG end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) .
     end if
     Take a policy improvement step using r𝑟ritalic_r and transitions from 𝒟RLsubscript𝒟RL\mathcal{D}_{\text{RL}}caligraphic_D start_POSTSUBSCRIPT RL end_POSTSUBSCRIPT.
  until k𝑘kitalic_k exceeds the intended number of training steps.

5 Experiments

We evaluate PbAIL from three aspects: its task performance, the individual effect of virtual preferences and preference reliability modeling, and its sensitivity against trajectory quality and preference size. Details regarding experiment design and implementations can be found in the appendices.

5.1 Experiment Design

Data

We considered seven Mujoco tasks: Ant, Walker2d, Hopper, HalfCheetah (HC for short), HumanoidStandup (HS for short), Pusher, and Swimmer. To reveal algorithms’ performance in practice, we evaluated them using imperfect offline trajectories generated by multiple policies. For each task, we used two versions of offline data: novice and mixture. We first trained five SAC agents for five million steps using different random seeds and considered their final performance as the final performance for SAC. The novice version was generated using five model checkpoints reaching 20% of the final performance, while the mixture version was generated by them and five checkpoints reaching 50% of the final performance. For preferences, we considered two sizes of 𝒴𝒴\mathcal{Y}caligraphic_Y: |𝒴|=1225𝒴1225|\mathcal{Y}|=1225| caligraphic_Y | = 1225 and |𝒴|=300𝒴300|\mathcal{Y}|=300| caligraphic_Y | = 300. Following prior practice (Christiano et al., 2017), preferences were generated for trajectory clips of length 60 using the truth rewards.

Alternative Methods

We considered the method proposed by Christiano et al. (2017) (referred to as PbRL), which is an algorithm for online PbRL. We also included two other baselines, SACfD and CAIL (Zhang et al., 2021). SACfD is an extension to the method proposed by Ibarz et al. (2018) that combines BC with PbRL. CAIL is an imitation learning algorithm that encourages the ranking of offline trajectories induced by the inferred returns aligns with the given preferences. In addition, we present the results for an ablation of the reliability estimation by using Equation 4 instead of Equation 6, denoted as PbAIL-. For reference, this paper also reports the results for SAC trained with the truth rewards (denoted as GT) and the return of offline trajectories (denoted as Data).

Evaluation Metrics

The algorithms were compared for test returns after being trained for one million steps. Returns were normalized to a 0–1 scale, such that 0 corresponds to a random policy and 1 corresponds to the final performance; a negative number indicates a method is worse than random policy. We report the mean and standard deviation of results for five random seeds.

Implementation Details

We used SAC as the policy learner. The reward functions of PbRL, SACfD, and PbAIL, as well as the discriminator of CAIL, were parameterized with a FFN of 64 units with the spectral normalization (Miyato et al., 2018). We also applied the weight decay to the reward functions of PbRL, SACfD, and PbAIL, and the reliability network of PbAIL. All experiments were performed on six NVIDIA A6000 GPUs. Our code is available here111https://www.dropbox.com/s/g0m3qwng4cnltmt/PbAIL.zip?dl=0.

5.2 Results

Task Performance

Table 1 shows the results on the novice datasets with |𝒴|=1225𝒴1225|\mathcal{Y}|=1225| caligraphic_Y | = 1225. First, we compare the performance of PbRL, SACfD, and PbAIL. Except in the case of Walker2d and Swimmer, PbRL does not demonstrate competitive performance. SACfD outperforms PbRL in Ant, Walker2d, and HC, but it does not perform well for the rest. PbAIL surpasses PbRL in six of the seven tasks, achieving the best performance in five tasks and even outperforming GT in three tasks. In addition, PbAIL presents significant advantages over CAIL. These results corroborate PbAIL’s efficacy for learning from offline trajectories and offline preferences.

Ablation Study

Table 1 also shows the individual effect of using virtual preferences and inferring their reliability. The advantage of PbAIL- over PbRL, particularly in Ant, HS, and Pusher, confirms the efficacy of using virtual preferences. The efficacy of inferring preference reliability is supported by comparing data return (denoted as Data) and the performance of PbAIL- accompanied with the comparison between PbAIL- and PbAIL. Except in Walker2d, PbAIL- only matches the return of data, which means that information in offline preferences is not properly utilized. PbAIL’s substantial improvement over data return confirms the necessity of modeling virtual preference reliability when dealing with imperfect offline data.

Table 1: Algorithms’ normalized returns for novice datasets with |𝒴|=1225𝒴1225|\mathcal{Y}|=1225| caligraphic_Y | = 1225. We show the normalized returns of training data in “Data” column and the performance obtained with the ground-truth rewards in “GT” column. The proposed PbAIL outperforms SACfD in five tasks and CAIL in six tasks. Notably, it even outperforms GT in three tasks. PbAIL- is a variant of PbAIL that does not handle the non-optimality of data, and it is significantly outperformed by PbAIL. These results confirm the efficacy of learning from virtual preferences and modeling their reliability.
Task PbRL SACfD CAIL PbAIL- PbAIL Data GT
Ant 0.35-0.35-0.35- 0.35 p m 0.02 0.570.570.570.57 p m 0.09 0.590.590.590.59 p m 0.09 0.300.300.300.30 p m 0.04 \DeclareFontSeriesDefault[rm]bfb0.720.720.720.72 p m 0.03 0.230.230.230.23 0.660.660.660.66
Walker2d 0.530.530.530.53 p m 0.17 \DeclareFontSeriesDefault[rm]bfb0.610.610.610.61 p m 0.25 0.460.460.460.46 p m 0.14 0.00-0.00-0.00- 0.00 p m 0.00 0.580.580.580.58 p m 0.13 0.280.280.280.28 0.890.890.890.89
Hopper 0.730.730.730.73 p m 0.28 0.660.660.660.66 p m 0.25 0.830.830.830.83 p m 0.20 0.310.310.310.31 p m 0.07 \DeclareFontSeriesDefault[rm]bfb1.131.131.131.13 p m 0.01 0.290.290.290.29 0.950.950.950.95
HC 0.220.220.220.22 p m 0.23 0.440.440.440.44 p m 0.20 0.740.740.740.74 p m 0.12 0.250.250.250.25 p m 0.02 \DeclareFontSeriesDefault[rm]bfb0.760.760.760.76 p m 0.05 0.220.220.220.22 0.690.690.690.69
HS 0.050.050.050.05 p m 0.30 0.050.050.050.05 p m 0.17 0.470.470.470.47 p m 0.15 0.550.550.550.55 p m 0.16 \DeclareFontSeriesDefault[rm]bfb0.640.640.640.64 p m 0.15 0.510.510.510.51 0.970.970.970.97
Pusher 4.23-4.23-4.23- 4.23 p m 0.72 4.55-4.55-4.55- 4.55 p m 0.85 0.220.220.220.22 p m 0.34 0.390.390.390.39 p m 0.06 \DeclareFontSeriesDefault[rm]bfb0.710.710.710.71 p m 0.09 0.370.370.370.37 0.930.930.930.93
Swimmer 0.190.190.190.19 p m 0.07 0.130.130.130.13 p m 0.10 0.190.190.190.19 p m 0.10 \DeclareFontSeriesDefault[rm]bfb0.300.300.300.30 p m 0.05 0.120.120.120.12 p m 0.19 0.280.280.280.28 0.640.640.640.64

Effect of Preference Size

Table 2 shows the results for |𝒴|=300𝒴300|\mathcal{Y}|=300| caligraphic_Y | = 300 on the novice datasets. Compared to the results presented in Table 1, preference-based methods (PbRL, SACfD, and PbAIL) worsen significantly, which is expected as less information is available in preferences. Interestingly, with fewer preferences, CAIL tends to perform better—it improves in five tasks. This is probably due to the fact that it uses preferences to update the weights of samples in 𝒟𝒟\mathcal{D}caligraphic_D for imitation learning, using a max-margin loss function for preferences. With fewer preferences, the weights of samples can be determined more easily, so its adversarial learning process is more stable.

Table 2: Performance for novice datasets with |𝒴|=300𝒴300|\mathcal{Y}|=300| caligraphic_Y | = 300. Compared to results in Table 1, with less preferences, preference-based methods (PbRL, SACfD, and PbAIL) become worse, while CAIL performs better.
Task PbRL SACfD CAIL PbAIL Data GT
Ant 0.37-0.37-0.37- 0.37 p m 0.02 0.37-0.37-0.37- 0.37 p m 0.01 \DeclareFontSeriesDefault[rm]bfb0.610.610.610.61 p m 0.07 0.560.560.560.56 p m 0.03 0.230.230.230.23 0.660.660.660.66
Walker2d 0.390.390.390.39 p m 0.15 0.340.340.340.34 p m 0.16 \DeclareFontSeriesDefault[rm]bfb0.490.490.490.49 p m 0.07 0.150.150.150.15 p m 0.13 0.280.280.280.28 0.890.890.890.89
Hopper 0.630.630.630.63 p m 0.32 0.380.380.380.38 p m 0.06 0.740.740.740.74 p m 0.21 \DeclareFontSeriesDefault[rm]bfb0.790.790.790.79 p m 0.28 0.290.290.290.29 0.950.950.950.95
HC 0.350.350.350.35 p m 0.20 0.120.120.120.12 p m 0.22 \DeclareFontSeriesDefault[rm]bfb0.820.820.820.82 p m 0.08 0.140.140.140.14 p m 0.26 0.220.220.220.22 0.690.690.690.69
HS 0.120.120.120.12 p m 0.13 0.180.180.180.18 p m 0.21 0.320.320.320.32 p m 0.34 \DeclareFontSeriesDefault[rm]bfb0.640.640.640.64 p m 0.12 0.510.510.510.51 0.970.970.970.97
Pusher 4.55-4.55-4.55- 4.55 p m 0.76 4.44-4.44-4.44- 4.44 p m 0.75 0.400.400.400.40 p m 0.11 \DeclareFontSeriesDefault[rm]bfb0.580.580.580.58 p m 0.08 0.370.370.370.37 0.930.930.930.93
Swimmer \DeclareFontSeriesDefault[rm]bfb0.320.320.320.32 p m 0.10 0.260.260.260.26 p m 0.15 0.250.250.250.25 p m 0.14 0.160.160.160.16 p m 0.27 0.280.280.280.28 0.640.640.640.64

Effect of Trajectory Quality

Table 8 shows the results for the mixture datasets with |𝒴|=1225𝒴1225|\mathcal{Y}|=1225| caligraphic_Y | = 1225. The quality of offline data increases by 0.1 to 0.2 when compared with the case in Table 1. PbRL and PbAIL cannot benefit from the improved quality, but PbAIL is the only method that attains the best performance in three tasks. SACfD improves in Ant and Walker2d but fails in HS. CAIL can best benefit from improved quality of offline trajectories. The results in Table 1 and Table 8 imply that, PbAIL is suitable for scenarios where preferences are accessible, and imitation-based methods (SACfD and CAIL) can be better when good demonstrations can be collected.

Table 3: Results on mixture datasets with |𝒴|=1225𝒴1225|\mathcal{Y}|=1225| caligraphic_Y | = 1225. Compared to results in Table 1, PbRL and PbAIL cannot benefit from the improved quality of offline trajectories. SACfD improves in Ant and Walker2d, but fails in HS. CAIL has significant improvement in Ant, Walker2d, and Hopper.
Task PbRL SACfD CAIL PbAIL Data GT
Ant 0.33-0.33-0.33- 0.33 p m 0.05 \DeclareFontSeriesDefault[rm]bfb0.740.740.740.74 p m 0.02 0.710.710.710.71 p m 0.02 0.590.590.590.59 p m 0.03 0.340.340.340.34 0.660.660.660.66
Walker2d 0.530.530.530.53 p m 0.25 \DeclareFontSeriesDefault[rm]bfb0.710.710.710.71 p m 0.23 0.610.610.610.61 p m 0.02 0.630.630.630.63 p m 0.10 0.380.380.380.38 0.890.890.890.89
Hopper 0.760.760.760.76 p m 0.24 0.490.490.490.49 p m 0.16 0.900.900.900.90 p m 0.05 \DeclareFontSeriesDefault[rm]bfb1.131.131.131.13 p m 0.03 0.490.490.490.49 0.950.950.950.95
HC 0.340.340.340.34 p m 0.24 0.250.250.250.25 p m 0.28 \DeclareFontSeriesDefault[rm]bfb0.740.740.740.74 p m 0.04 0.690.690.690.69 p m 0.03 0.370.370.370.37 0.690.690.690.69
HS 0.060.060.060.06 p m 0.23 NaN 0.440.440.440.44 p m 0.17 \DeclareFontSeriesDefault[rm]bfb0.620.620.620.62 p m 0.14 0.520.520.520.52 0.970.970.970.97
Pusher 5.26-5.26-5.26- 5.26 p m 0.45 5.05-5.05-5.05- 5.05 p m 0.52 0.10-0.10-0.10- 0.10 p m 0.83 \DeclareFontSeriesDefault[rm]bfb0.660.660.660.66 p m 0.04 0.450.450.450.45 0.930.930.930.93
Swimmer \DeclareFontSeriesDefault[rm]bfb0.380.380.380.38 p m 0.05 0.330.330.330.33 p m 0.04 0.100.100.100.10 p m 0.19 0.140.140.140.14 p m 0.05 0.400.400.400.40 0.640.640.640.64

Reward Generalizability

Finally, let us discuss our claim on the generalizability problem. We analyzed the generalizability of reward functions using the mixture datasets with 1225 preferences. For each method, we took 10 model checkpoints in the early stage of policy learning. We then rolled out trajectories using the actor of each checkpoint and inferred their returns using the corresponding reward function. The Kendall’s rank correlation coefficient reflects how well the ranks of trajectories computed using inferred returns match those computed using ground-truth returns. The higher it is, the better the two ranks match, which means that reward functions better generalize to the learning agent’s behaviors.

Figure 2 shows the results for Ant, HC, and Pusher. The reward function of PbRL fails to generalize for Ant and Pusher, which explains PbRL’s poor performance for the two tasks. Compared to PbRL, the reward function of SACfD generalizes better for Ant but fails for Pusher. Meanwhile, for all four tasks, PbAIL’s reward functions demonstrate good generalizability, which explains its performance.

Refer to caption
(a) Ant
Refer to caption
(b) HC
Refer to caption
(c) Pusher
Figure 2: The Kendall’s rank correlation coefficient between the the inferred returns and the true returns of agents’ trajectories during policy learning. This coefficient reflects the generalizability of reward functions to agents’ behaviors. Agents were trained on the mixture datasets with |𝒴|=1225𝒴1225|\mathcal{Y}|=1225| caligraphic_Y | = 1225. The reward function of PbRL does not generalize well for Ant and Pusher, which explains its poor performance presented in Table 8. These observations support our claim for the generalizability issue.

In summary, our take-aways are three-folds.

  • The proposed PbAIL is competitive for all three cases. Both the use of virtual preferences and modeling their reliability work as expected.

  • Preference-based methods tend to degenerate with fewer preferences. Imitation-based approaches better enjoy the improved quality of offline trajectories.

  • Our claim for the generalizability issue of reward functions is supported by rank correlation between the inferred and true trajectory returns.

These findings confirm the efficacy of our proposals. Moreover, they shed light on how to select algorithms for complex real-world tasks. If collecting preferences is more viable than collecting good trajectories, then PbAIL is the most suitable method. Otherwise, approaches that directly imitate from trajectories are better.

6 Conclusion

PbRL is a setting that learns a reward function from preferences, which represent human evaluation for agents’ behaviors. Recently, the use of offline preferences was proposed to make better use of human time. In this case, the preferences are collected from certain offline data.However, as the offline data may follow a different distribution when compared to an learning agent’s performance, reward functions learned from offline preferences may fail to generalize to the agent’s behaviors. In response to this issue, the present study proposes PbAIL, a framework that overcomes this drawback by using virtual preferences generated from offline data. PbAIL learns a reward function by jointly maximizing the likelihood of offline preferences and virtual preferences, which aligns the learned reward functions with the agent’s behaviors. Furthermore, this work extends PbAIL to imperfect offline data, thus broadening its applicability. From experiments on continuous tasks, we verified the efficacy of using virtual preferences and handling data imperfection, and we also discussed the advantages and limitations of PbAIL. As for future work, it would be interesting to consider extensions that do not explicitly require offline trajectories.

References

  • Akrour et al. (2011) R. Akrour, M. Schoenauer, and M. Sebag. Preference-based policy learning. In Machine Learning and Knowledge Discovery in Databases, pages 12–27. Springer Berlin Heidelberg, 2011.
  • Biyik et al. (2020) E. Biyik, N. Huynh, M. J. Kochenderfer, and D. Sadigh. Active preference-based gaussian process regression for reward learning. In Robotics: Science and Systems XVII, 2020.
  • Bradley and Terry (1952) R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  • Brockman et al. (2016) G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym, 2016.
  • Busa-Fekete et al. (2014) R. Busa-Fekete, B. Szörényi, P. Weng, W. Cheng, and E. Hüllermeier. Preference-based reinforcement learning: Evolutionary direct policy search using a preference-based racing algorithm. Machine Learning, 97(3):327–351, 2014.
  • Christiano et al. (2017) P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4302–4310. Curran Associates, Inc., 2017.
  • Fisac et al. (2017) J. F. Fisac, M. A. Gates, J. B. Hamrick, C. Liu, D. Hadfield-Menell, M. Palaniappan, D. Malik, S. S. Sastry, T. L. Griffiths, and A. D. Dragan. Pragmatic-pedagogic value alignment. In Proceedings of the Eighteenth International Symposium on Robotics Research, pages 49–57. Springer International Publishing, 2017.
  • Fürnkranz et al. (2012) J. Fürnkranz, E. Hüllermeier, W. Cheng, and S.-H. Park. Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Machine Learning, 89(1):123–156, 2012.
  • Haarnoja et al. (2018) T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the Thirty-Fifth International Conference on Machine Learning, pages 1861–1870. PMLR, 2018.
  • Ibarz et al. (2018) B. Ibarz, J. Leike, T. Pohlen, G. Irving, S. Legg, and D. Amodei. Reward learning from human preferences and demonstrations in Atari. In Advances in Neural Information Processing Systems, pages 8022–8034. Curran Associates Inc., 2018.
  • Lee et al. (2021a) K. Lee, L. Smith, A. Dragan, and P. Abbeel. B-pref: Benchmarking preference-based reinforcement learning. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021a.
  • Lee et al. (2021b) K. Lee, L. M. Smith, and P. Abbeel. Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. In Proceedings of the Thirty-Eighth International Conference on Machine Learning, pages 6152–6163. PMLR, 2021b.
  • Liu et al. (2022) R. Liu, F. Bai, Y. Du, and Y. Yang. Meta-reward-net: Implicitly differentiable reward learning for preference-based reinforcement learning. In Advances in Neural Information Processing Systems, 2022.
  • Miyato et al. (2018) T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. In Proceedings of the Sixth International Conference on Learning Representations, 2018.
  • Orsini et al. (2021) M. Orsini, A. Raichuk, L. Hussenot, D. Vincent, R. Dadashi, S. Girgin, M. Geist, O. Bachem, O. Pietquin, and M. Andrychowicz. What matters for adversarial imitation learning? In Advances in Neural Information Processing Systems, pages 14656–14668. Curran Associates, Inc., 2021.
  • Ouyang et al. (2022) L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback, 2022.
  • Park et al. (2022) J. Park, Y. Seo, J. Shin, H. Lee, P. Abbeel, and K. Lee. SURF: Semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning. In Proceedings of the Tenth International Conference on Learning Representations, 2022.
  • Puterman (1994) M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., USA, 1st edition, 1994. ISBN 0471619779.
  • Reddy et al. (2018) S. Reddy, A. D. Dragan, and S. Levine. Shared autonomy via deep reinforcement learning. In Proceedings of the Robotics: Science and Systems XIV. MIT Press, 2018.
  • Shin and Brown (2021) D. Shin and D. Brown. Offline preference-based apprenticeship learning. In Proceedings of the Workshop on Human-AI Collaboration in Sequential Decision-Making at the Thirty-Eighth International Conference on Machine Learning., 2021.
  • Sutton and Barto (2018) R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. A Bradford Book, 2018.
  • Syed et al. (2008) U. Syed, M. Bowling, and R. E. Schapire. Apprenticeship learning using linear programming. In Proceedings of the Twenty-Fifth International Conference on Machine Learning, pages 1032–1039. Association for Computing Machinery, 2008.
  • Thurstone (1927) L. L. Thurstone. A law of comparative judgment. Psychological Review, 34(4):273–286, 1927.
  • Wilde et al. (2020) N. Wilde, D. Kulic, and S. L. Smith. Active preference learning using maximum regret. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 10952–10959. IEEE, 2020.
  • Wirth and Fürnkranz (2015) C. Wirth and J. Fürnkranz. On learning from game annotations. IEEE Transactions on Computational Intelligence and AI in Games, 7(3):304–316, 2015.
  • Zhang and Kashima (2023) G. Zhang and H. Kashima. Batch reinforcement learning from crowds. In Machine Learning and Knowledge Discovery in Databases, pages 38–51. Springer Cham, 2023.
  • Zhang et al. (2021) S. Zhang, Z. CAO, D. Sadigh, and Y. Sui. Confidence-aware imitation learning from demonstrations with varying optimality. In Advances in Neural Information Processing Systems, pages 12340–12350. Curran Associates, Inc., 2021.

Appendix A Data

Offline Data

To generate the offline data used in experiments, we first trained five SAC agents for five million steps using different random seeds and hyperparameters reported in Table 5. To compute the normalized returns of algorithms, we consider the policies initialized from scratch as the random policies. As shown in Table 4, their final performance matches results reported for SAC by Haarnoja et al. [2018]. To evaluate algorithms in realistic settings, this work considers two versions of offline data: novice and mixture. The novice version was generated using five model checkpoints for different random seeds that reached 20% of the final performance. For the mixture version, beside the five model checkpoints of the novice version, we also used five model checkpoints for different random seeds that reached 20% of the final performance. Both versions contain 50 trajectories, and each of the model checkpoints contributed the same amount of trajectories.

Preferences

Following prior practice for PbRL [Christiano et al., 2017], we consider preferences between short clips of trajectories for better efficiency. From each trajectory, we sampled one clip of length 30 in Pusher and 60 in other tasks. So there are 50 trajectory clips. The case of |𝒴|=1225𝒴1225|\mathcal{Y}|=1225| caligraphic_Y | = 1225 involves all of the paired comparisons for these 50 clips, and the case of |𝒴|=300𝒴300|\mathcal{Y}|=300| caligraphic_Y | = 300 involves paired comparisons for 25 of the 50 clips. The preference labels were generated using the ground-truth rewards. In other words, for two trajectory clips η1subscript𝜂1\eta_{1}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and η2subscript𝜂2\eta_{2}italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, η1η2succeedssubscript𝜂1subscript𝜂2\eta_{1}\succ\eta_{2}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT if xη1r(x)>xη2r(x)subscript𝑥subscript𝜂1𝑟𝑥subscript𝑥subscript𝜂2𝑟𝑥\sum_{x\in\eta_{1}}r(x)>\sum_{x\in\eta_{2}}r(x)∑ start_POSTSUBSCRIPT italic_x ∈ italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_r ( italic_x ) > ∑ start_POSTSUBSCRIPT italic_x ∈ italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_r ( italic_x ) under the ground-truth reward function r𝑟ritalic_r.

Table 4: The final performance and performance of random policies.
Task Ant Walker2d Hopper HC HS Pusher Swimmer
Final Performance 6983.98 4656.05 3097.87 14646.87 157117.92 -20.50 125.57
Random Policy 989.65 17.08 46.98 -1.44 42788.13 -58.73 7.01
Table 5: Hyperparameters for the behavior policies.
Parameter Value
optimizer Adam
learning rate for the actor 3×1043superscript1043\times 10^{-4}3 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
learning rate for the critic 3×1043superscript1043\times 10^{-4}3 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
whether to train entropy weight Yes
initial value for entropy weight 0.1
learning rate for entropy weight 3×1043superscript1043\times 10^{-4}3 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
#hidden layers (all networks) 2
#units per layer for the actor 256
#units per layer for the critic 256
# training steps 1
# environment steps per training step 1
activation function (all networks) ReLU
interval for updating the target network 1
smoothing coefficient for target updates 5×1035superscript1035\times 10^{-3}5 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
size of replay buffer 1×1061superscript1061\times 10^{6}1 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT
discount factor 0.99

Appendix B Implementation Details

As suggested by Orsini et al. [2021], we used SAC as the policy learner for all algorithms, due to its good sample efficiency. The hyperparameters of the policy learner are the same as those for behavior policies (reported in Table 5), except for the coefficient of target updates, which is changed to 0.001 for better stability. During training, we used the squashed Gaussian policy to enhance exploration [Haarnoja et al., 2018]; in test time we used the mode of the policy, which was deterministic.

We used the same parameterization for the reward functions of PbRL, SACfD, and PbAIL, as well as the discriminator of CAIL, so the differences in their performance could show the efficacy of our proposals. The reward functions (discriminators) were parameterized with a FFN of 64 units with the spectral normalization [Miyato et al., 2018]. For PbRL, SACfD, and PbAIL, we also applied weight decay to their reward functions using the same but task-specific coefficients, whose values are presented in Table 6. These values were selected from {0,0.0025,0.005}00.00250.005\{0,0.0025,0.005\}{ 0 , 0.0025 , 0.005 } via grid search. The batchsize was 256 for all algorithms We now report additional details of these methods.

Table 6: The coefficients of weight decay for reward functions.
Task Ant Walker2d Hopper HC HS Pusher Swimmer
Coefficient 0.0025 0 0 0.0025 0.0025 0 0

Details for PbRL

PbRL learns a reward function from preferences before policy learning and uses the learned function to infer rewards of state-action pairs generated by the policy learning. When learning the reward function, we used the Adam optimizer with 0.00010.00010.00010.0001 as learning rate and ran optimization for 10,000 steps.

Details for SACfD

SACfD is an extension to the method proposed by Ibarz et al. [2018]. The method was originally proposed for discrete control tasks, so we extended it for using SAC. Compared to PbRL, it introduces a the objective of BC into the objective of the actor. The reward function of SACfD was learned in the same way as the reward function of PbRL. As described by Ibarz et al. [2018], the actor of the policy learner was pre-trained using the objective of BC and the provided offline data Before policy learning. The pre-training took 10,000 steps.

Let Lactorsubscript𝐿actorL_{\text{actor}}italic_L start_POSTSUBSCRIPT actor end_POSTSUBSCRIPT be the objective function for learning the actor of SAC. During policy learning, SACfD minimizes Lactor𝔼(s,a)𝒟[𝕀(Qπ(s,a)>Qπ(s,πmode(s)))log(π(a|s))]subscript𝐿actorsubscript𝔼𝑠𝑎𝒟delimited-[]𝕀superscript𝑄𝜋𝑠𝑎superscript𝑄𝜋𝑠subscript𝜋mode𝑠𝜋conditional𝑎𝑠L_{\text{actor}}-\mathbb{E}_{(s,a)\in\mathcal{D}}\left[\mathbb{I}(Q^{\pi}(s,a)% >Q^{\pi}(s,\pi_{\text{mode}}(s)))\log(\pi(a|s))\right]italic_L start_POSTSUBSCRIPT actor end_POSTSUBSCRIPT - blackboard_E start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ caligraphic_D end_POSTSUBSCRIPT [ blackboard_I ( italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) > italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT mode end_POSTSUBSCRIPT ( italic_s ) ) ) roman_log ( italic_π ( italic_a | italic_s ) ) ], where 𝕀𝕀\mathbb{I}blackboard_I is the indicator function and πmodesubscript𝜋mode\pi_{\text{mode}}italic_π start_POSTSUBSCRIPT mode end_POSTSUBSCRIPT is the mode of π𝜋\piitalic_π. In other words, it regularizes the actor using BC only for states at which the action in the offline data has larger Q-value than the action induced by π𝜋\piitalic_π.

Details for PbAIL

As mentioned in Algorithm 1, PbAIL needs an initialization phase. In all tasks we initialized PbAIL for 10,000 steps. We applied weight decay to the reliability network in addition to its reward network. These values were selected from {0,0.0025,0.005}00.00250.005\{0,0.0025,0.005\}{ 0 , 0.0025 , 0.005 } via grid search, and their values are reported in Table 7.

Table 7: The coefficients of weight decay for the reliability network of PbAIL.
Task Ant Walker2d Hopper HC HS Pusher Swimmer
Coefficient 0 0.005 0 0.005 0.005 0.0025 0.0025
Table 8: Results on real human preferences.
Task PbRL SACfD CAIL PbAIL Data GT
Hopper-medium-expert 0.570.570.570.57 p m 0.23 0.440.440.440.44 p m 0.12 \DeclareFontSeriesDefault[rm]bfb1.071.071.071.07 p m 0.10 0.650.650.650.65 p m 0.36 0.670.670.670.67 0.660.660.660.66
Hopper-medium-replay 0.280.280.280.28 p m 0.19 0.280.280.280.28 p m 0.26 0.430.430.430.43 p m 0.14 \DeclareFontSeriesDefault[rm]bfb0.450.450.450.45 p m 0.13 0.380.380.380.38 0.140.140.140.14
Walker2d-medium-expert 0.060.060.060.06 p m 0.07 0.210.210.210.21 p m 0.05 \DeclareFontSeriesDefault[rm]bfb0.900.900.900.90 p m 0.05 0.730.730.730.73 p m 0.35 0.810.810.810.81 0.950.950.950.95
Walker2d-medium-replay 0.100.100.100.10 p m 0.10 0.080.080.080.08 p m 0.12 \DeclareFontSeriesDefault[rm]bfb0.440.440.440.44 p m 0.15 0.01-0.01-0.01- 0.01 p m 0.08 0.140.140.140.14 0.690.690.690.69