Learning Action-based Representations Using
Invariance

Max Rudolph∗1, Caleb Chuck∗1, Kevin Black∗2, Misha Lvovsky3,
Scott Niekum3, Amy Zhang1
The University of Texas at Austin1
University of California, Berkeley2
University of Massachusetts Amherst3
Abstract

Robust reinforcement learning agents using high-dimensional observations must be able to identify relevant state features amidst many exogeneous distractors. A representation that captures controllability identifies these state elements by determining what affects agent control. While methods such as inverse dynamics and mutual information capture controllability for a limited number of timesteps, capturing long-horizon elements remains a challenging problem. Myopic controllability can capture the moment right before an agent crashes into a wall, but not the control-relevance of the wall while the agent is still some distance away. To address this we introduce action-bisimulation encoding, a method inspired by the bisimulation invariance pseudometric, that extends single-step controllability with a recursive invariance constraint. By doing this, action-bisimulation learns a multi-step controllability metric that smoothly discounts distant state features that are relevant for control. We demonstrate that action-bisimulation pretraining on reward-free, uniformly random data improves sample efficiency in several environments, including a photorealistic 3D simulation domain, Habitat. Additionally, we provide theoretical analysis and qualitative results demonstrating the information captured by action-bisimulation. Code and video: https://maxrudolph1.github.io/action-bisimulation-site/

* Authors contributed equally, corresponding author [email protected]

1 Introduction

Learning control for complex decision-making from high-dimensional observation spaces such as video and depth is vital for real-world applications of reinforcement learning (RL). To do this, a representation of the observation space allows agents to reason about the environment and take intelligent actions. However, learning these representations is often sample inefficient. One reason for this is that real-world scenarios often contain many irrelevant and distracting features embedded in a high-dimensional space. Correlating reward with relevant state elements, and not causally confusing distractors in this setting, is challenging—especially since reward signals are often sparse.

Representation learning has emerged as a promising approach to address this challenge by extracting a compressed and informative representation of the observation space that is useful for learning (Bengio et al., 2013). Representation learning removes irrelevant distractors from the state space used to learn the policy, which improves sample efficiency and performance. In RL, task-specific representation learning uses reward or expert behavioral similarity (Ferns et al., 2011; Agarwal et al., 2021) to discover the compressed representation, only describing task-specific elements. This has the advantage of capturing only information that is either useful for solving the task or relevant to the demonstrations while being limited by requiring either expert behavior or task-achieving policies, both of which can be difficult to obtain prior to learning. On the other hand, task-free methods use unsupervised signals like reconstruction (Lange & Riedmiller, 2010) and contrastive objectives (Laskin et al., 2020) and can be pre-trained on any data, including random actions. However, these methods are trained without action information. As a result they can capture exogenous distractors that are not useful for improving RL policy performance.

One promising direction of task-agnostic methods utilizes controllability to learn a behavior-relevant representation that is not task-specific (Lamb et al., 2022). These representations can avoid capturing task-irrelevant information while not requiring expert or reward-achieving behavior. Recent work in action-based representation learning for RL has shown promising results (Zhang et al., 2022) by utilizing inverse dynamics models to extract representations (Islam et al., 2022). These representations rely on a window of information by predicting the first action between two states separated by k𝑘kitalic_k-steps. If k𝑘kitalic_k is small this representation is myopic, but when k𝑘kitalic_k is large the prediction problem is underspecified. This underspecification restricts large k𝑘kitalic_k to offline datasets with correlated action data—such as expert trajectories.

We investigate utilizing a novel invariant metric to learn a multi-step control-based representation instead of directly applying k𝑘kitalic_k-step prediction. Our action-bisimulation metric offers a novel framework for controllability metrics that takes a myopic dynamics encoding and extends it to multi-step representations. This formulation is inspired by reward bisimulation (Zhang et al., 2020b), which utilizes single-step reward information to learn multi-step return-capturing representations. Action-bisimulation applies bootstrapping on the myopic k=1𝑘1k=1italic_k = 1 controllability representations to enforce multi-step invariance in an action-bisimulation encoding. Since the base case uses single-step prediction, the encoding can be trained with any offline data, even fully random. At the same time, boostrapping extends the action-bisimulation encoding to capture long-term controllability. LABEL:fig:bisim_main captures how action-bisimulation maps control-irrelevant states together, while not doing the same for control-relevant states.

This work offers an empirical analysis and theoretical formulation of the novel control-based invariant metric for representation learning. We demonstrate empirically that in scenarios where complex, long-horizon, sparse-reward decision-making is required, the metric improves sample efficiency compared to RL agents trained directly from pixels, or pre-trained with existing representation learning methods in multiple domains. Next, we provide qualitative results demonstrating the robustness of the learned representation to uncontrollable distractors, as well as sensitivity to control-relevant state features.

2 Related Works

Representation Learning in RL. Learned representations have been widely applied to RL, formalized (Li et al., 2006) through hierarchical symbolic representations (Konidaris et al., 2014; Andre & Russell, 2002), skill abstractions (Dietterich, 2000), policy optimality (Auer et al., 2008; Jong & Stone, 2005; Abel et al., 2016), selective attention (Jones & Canas, 2010) and contingency awareness (Bellemare et al., 2012). One effective strategy is to use the representation to learn a model that is effective for planning (Hafner et al., 2019; Koul et al., 2023). These methods learn world models (Ha & Schmidhuber, 2018) and other representations that can be used for prediction (Singh et al., 2012), data generation and planning. Alternatively, other methods apply representation learning for filtering (Krishnan et al., 2015; Karl et al., 2016) or reduced complexity  (Higgins et al., 2016; Oord et al., 2018; Laskin et al., 2020) representations. Action-bisimulation is a novel encoder that learns controllability-based representations to improve RL performance. Unlike other representation learning methods, action-bisimulation uses a soft invariance pseudometric to capture action information through time.

Action-based Representations. RL methods have directly leveraged action-relevant representations in several ways. This includes contingency awareness (Bellemare et al., 2012; Choi et al., 2018; Chuck et al., 2020; 2023), which is closely related to action controllability (Zhong et al., 2020) and control information measures like empowerment (channel capacity between actions and state) (Jung et al., 2011; Mohamed & Jimenez Rezende, 2015; Levy et al., 2023) or affordances (Cruz et al., 2016; Khetarpal et al., 2020; Nagarajan et al., 2020). Multi-step inverse models are most similar to action-bisimulation, but common multi-step inverse methods (Lamb et al., 2022; Islam et al., 2022; Koul et al., 2023) require selecting a specific k𝑘kitalic_k for the multi-step horizon, potentially leaving critical control information on the table. Further, it has been shown that multi-step inverse models can be insufficient when the dynamics are periodic (Levine et al., 2024). Action-bisimulation uses a soft invariance metric to extend single-step models, which better preserves long-term controllability.

Bisimulation methods. Bisimulation describes future invariant state representations, originally applied to stationary representations (Larsen & Skou, 1989; Dean et al., 1997; Ferns et al., 2004), before being extended to continuous state MDPs (Ferns et al., 2011). Reward-based bisimulation methods have gained popularity through learned deep representations (Zhang et al., 2020b). This has been extended to non-optimal policies (Castro et al., 2021), with generalized value function bounds (Kemertas & Aumentado-Armstrong, 2021) and augmented with state discretization (Kemertas & Jepson, 2022) and clustering (Liu et al., 2023). Bisimulation-based methods have also been applied in different contexts: expert policy similarity (Agarwal et al., 2021; Bertran et al., 2022; Mazoure et al., 2021), goal-conditioned RL (Hansen-Estruch et al., 2022) and reward-action policy equivalence (Liao et al., 2023; Castro, 2020). While this work draws on reward-bisimulation, action-bisimulation is fundamentally offline and task-agnostic because it takes an expectation over actions, removing its dependence on any policy.

3 Preliminaries

A Markov decision process is defined by the tuple (𝒮,𝒜,p,R)𝒮𝒜𝑝𝑅\mathcal{M}\coloneqq(\mathcal{S},\mathcal{A},p,R)caligraphic_M ≔ ( caligraphic_S , caligraphic_A , italic_p , italic_R ), where 𝒮𝒮\mathcal{S}caligraphic_S is the state space, 𝒜𝒜\mathcal{A}caligraphic_A is the action space and s𝒮,a𝒜formulae-sequence𝑠𝒮𝑎𝒜s\in\mathcal{S},a\in\mathcal{A}italic_s ∈ caligraphic_S , italic_a ∈ caligraphic_A are states and actions respectively. p(s|s,a)𝑝conditionalsuperscript𝑠𝑠𝑎p(s^{\prime}|s,a)italic_p ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) is the transition function that gives the probability of the next state ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT given the current state and action (s,a)𝑠𝑎(s,a)( italic_s , italic_a ). The reward function R(s,a)𝑅𝑠𝑎R(s,a)italic_R ( italic_s , italic_a ) maps state and action to a scalar reward. A policy π(a|s)𝜋conditional𝑎𝑠\pi(a|s)italic_π ( italic_a | italic_s ) is the probability of an action given the current state.

This work utilizes the following two-phase paradigm: in the first phase, the agent first takes actions without access to the reward function R(s,a)𝑅𝑠𝑎R(s,a)italic_R ( italic_s , italic_a ) to generate a dataset of ordered state action tuples 𝒟{(s(0),a(0)),(s(|𝒟|1),a(|𝒟|1))}𝒟superscript𝑠0superscript𝑎0superscript𝑠𝒟1superscript𝑎𝒟1\mathcal{D}\coloneqq\{(s^{(0)},a^{(0)}),\ldots(s^{(|\mathcal{D}|-1)},a^{(|% \mathcal{D}|-1)})\}caligraphic_D ≔ { ( italic_s start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) , … ( italic_s start_POSTSUPERSCRIPT ( | caligraphic_D | - 1 ) end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ( | caligraphic_D | - 1 ) end_POSTSUPERSCRIPT ) }. Then, a representation ϕ:𝒮𝒵:italic-ϕ𝒮𝒵\phi:\mathcal{S}\rightarrow\mathcal{Z}italic_ϕ : caligraphic_S → caligraphic_Z is learned from 𝒮𝒮\mathcal{S}caligraphic_S. In the second phase, the agent learns from extrinsic reward utilizing the learned representation.

The action-bisimulation representation method is inspired by reward bisimulation (Dean et al., 1997). In RL, bisimulation is a state abstraction that groups reward-equivalent states:

Definition 3.1 (Bisimulation Relations (Givan et al., 2003)).

In MDP \mathcal{M}caligraphic_M, an equivalence relation B𝐵Bitalic_B between states is a bisimulation relation if: si,sj𝒮for-allsubscript𝑠𝑖subscript𝑠𝑗𝒮\forall s_{i},s_{j}\in\mathcal{S}∀ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_S where the states are equivalent under B𝐵Bitalic_B (siBsjsubscript𝐵subscript𝑠𝑖subscript𝑠𝑗s_{i}\equiv_{B}s_{j}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≡ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT), the following conditions hold:

R(si,a)𝑅subscript𝑠𝑖𝑎\displaystyle R(s_{i},a)italic_R ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a ) =R(sj,a)a𝒜formulae-sequenceabsent𝑅subscript𝑠𝑗𝑎for-all𝑎𝒜\displaystyle=R(s_{j},a)\quad\forall a\in\mathcal{A}= italic_R ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a ) ∀ italic_a ∈ caligraphic_A (1)
P(𝒢|si,a)𝑃conditional𝒢subscript𝑠𝑖𝑎\displaystyle P(\mathcal{G}|s_{i},a)italic_P ( caligraphic_G | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a ) =P(𝒢|sj,a)a𝒜,𝒢𝒮Bformulae-sequenceabsent𝑃conditional𝒢subscript𝑠𝑗𝑎formulae-sequencefor-all𝑎𝒜for-all𝒢subscript𝒮𝐵\displaystyle=P(\mathcal{G}|s_{j},a)\quad\forall a\in\mathcal{A},\forall% \mathcal{G}\in\mathcal{S}_{B}= italic_P ( caligraphic_G | italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a ) ∀ italic_a ∈ caligraphic_A , ∀ caligraphic_G ∈ caligraphic_S start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT (2)

where 𝒮Bsubscript𝒮𝐵\mathcal{S}_{B}caligraphic_S start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT is the partition of 𝒮𝒮\mathcal{S}caligraphic_S under the relation B𝐵Bitalic_B (the set of all groups 𝒢𝒢\mathcal{G}caligraphic_G of equivalent states), and P(𝒢|s,a)=s𝒢p(s|s,a)𝑃conditional𝒢𝑠𝑎subscriptsuperscript𝑠𝒢𝑝conditionalsuperscript𝑠𝑠𝑎P(\mathcal{G}|s,a)=\sum_{s^{\prime}\in\mathcal{G}}p(s^{\prime}|s,a)italic_P ( caligraphic_G | italic_s , italic_a ) = ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_G end_POSTSUBSCRIPT italic_p ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a )

Bisimulation Metrics (Ferns et al., 2011; Castro, 2020) soften the notion of state partitions with a pseudometric space (𝒮,d)𝒮𝑑(\mathcal{S},d)( caligraphic_S , italic_d ), where distance function d:𝒮×𝒮R0:𝑑𝒮𝒮subscript𝑅absent0d:\mathcal{S}\times\mathcal{S}\rightarrow R_{\geq 0}italic_d : caligraphic_S × caligraphic_S → italic_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT measures the similarity between two states.111This is a pseudometric, meaning that two different states can have 00 distance. The on-policy bisimulation metric (Kemertas & Aumentado-Armstrong, 2021) is:

dr-bisim(si,sj)=maxa(1c)|R(si,a)R(sj,a)|base case+cW1(d)(p(|si,a),p(|sj,a))recursive step,d_{\text{r-bisim}}(s_{i},s_{j})=\max_{a}\underbrace{(1-c)\cdot|R(s_{i},a)-R(s_% {j},a)|}_{\text{base case}}+\underbrace{cW_{1}(d)(p(\cdot|s_{i},a),p(\cdot|s_{% j},a))}_{\text{recursive step}},italic_d start_POSTSUBSCRIPT r-bisim end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT under⏟ start_ARG ( 1 - italic_c ) ⋅ | italic_R ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a ) - italic_R ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a ) | end_ARG start_POSTSUBSCRIPT base case end_POSTSUBSCRIPT + under⏟ start_ARG italic_c italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_d ) ( italic_p ( ⋅ | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a ) , italic_p ( ⋅ | italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a ) ) end_ARG start_POSTSUBSCRIPT recursive step end_POSTSUBSCRIPT , (3)

where W1(d)subscript𝑊1𝑑W_{1}(d)italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_d ) is the 1-Wasserstein distance and c𝑐citalic_c is a scalar hyperparameter that weights the multi-step sensitivity of the distance. The 1-Wasserstein metric measures the distance between next-state distributions in the latent bisimulation space. We propose a novel controllability-based relation, which replaces reward equivalence with single-step control equivalence. By replacing rewards in the equivalence, the relation is task-agnostic.

Definition 3.2 (Action-Bisimulation Relations).

Let ψ:𝒮𝒵ss:𝜓𝒮subscript𝒵𝑠𝑠\psi:\mathcal{S}\rightarrow\mathcal{Z}_{ss}italic_ψ : caligraphic_S → caligraphic_Z start_POSTSUBSCRIPT italic_s italic_s end_POSTSUBSCRIPT be a single step controllability encoder such that p(a|ψ(s),ψ(s))=p(a|s,s)𝑝conditional𝑎𝜓𝑠𝜓superscript𝑠𝑝conditional𝑎𝑠superscript𝑠p(a|\psi(s),\psi(s^{\prime}))=p(a|s,s^{\prime})italic_p ( italic_a | italic_ψ ( italic_s ) , italic_ψ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) = italic_p ( italic_a | italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) for all s,a,s𝑠𝑎superscript𝑠s,a,s^{\prime}italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. In MDP \mathcal{M}caligraphic_M, an equivalence relation AB𝐴𝐵ABitalic_A italic_B between states is an action-bisimulation relation according to ψ𝜓\psiitalic_ψ if: si,sj𝒮for-allsubscript𝑠𝑖subscript𝑠𝑗𝒮\forall s_{i},s_{j}\in\mathcal{S}∀ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_S where the states are equivalent under AB𝐴𝐵ABitalic_A italic_B (siABsjsubscript𝐴𝐵subscript𝑠𝑖subscript𝑠𝑗s_{i}\equiv_{AB}s_{j}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≡ start_POSTSUBSCRIPT italic_A italic_B end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT), the following conditions hold:

ψ(si)𝜓subscript𝑠𝑖\displaystyle\psi(s_{i})italic_ψ ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) =ψ(sj)absent𝜓subscript𝑠𝑗\displaystyle=\psi(s_{j})= italic_ψ ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (4)
P(𝒢|si,a)𝑃conditional𝒢subscript𝑠𝑖𝑎\displaystyle P(\mathcal{G}|s_{i},a)italic_P ( caligraphic_G | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a ) =P(𝒢|sj,a)a𝒜,𝒢𝒮ABformulae-sequenceabsent𝑃conditional𝒢subscript𝑠𝑗𝑎formulae-sequencefor-all𝑎𝒜for-all𝒢subscript𝒮𝐴𝐵\displaystyle=P(\mathcal{G}|s_{j},a)\quad\forall a\in\mathcal{A},\forall% \mathcal{G}\in\mathcal{S}_{AB}= italic_P ( caligraphic_G | italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a ) ∀ italic_a ∈ caligraphic_A , ∀ caligraphic_G ∈ caligraphic_S start_POSTSUBSCRIPT italic_A italic_B end_POSTSUBSCRIPT (5)

where 𝒮ABsubscript𝒮𝐴𝐵\mathcal{S}_{AB}caligraphic_S start_POSTSUBSCRIPT italic_A italic_B end_POSTSUBSCRIPT is the partition of 𝒮𝒮\mathcal{S}caligraphic_S under the relation AB𝐴𝐵ABitalic_A italic_B (the set of all groups 𝒢𝒢\mathcal{G}caligraphic_G of equivalent states), and P(𝒢|s,a)=s𝒢p(s|s,a)𝑃conditional𝒢𝑠𝑎subscriptsuperscript𝑠𝒢𝑝conditionalsuperscript𝑠𝑠𝑎P(\mathcal{G}|s,a)=\sum_{s^{\prime}\in\mathcal{G}}p(s^{\prime}|s,a)italic_P ( caligraphic_G | italic_s , italic_a ) = ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_G end_POSTSUBSCRIPT italic_p ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a )

This equivalence can be similarly relaxed into a pseudometric. However, in the off-policy setting, we are not interested in a particular policy, but all policies. Thus, action-bisimulation uses the expectation over uniform actions to encode all possible policies.

da-bisim(si,sj,ψ)=(1c)ψ(si)ψ(sj)1base case+c𝔼aU(𝒜)[W1(p(|si,a),p(|sj,a))]recursive stepd_{\text{a-bisim}}(s_{i},s_{j},\psi)=\underbrace{(1-c)\cdot\|\psi(s_{i})-\psi(% s_{j})\|_{1}}_{\text{base case}}+\underbrace{c\cdot\mathbb{E}_{a\sim U(% \mathcal{A})}\left[W_{1}(p(\cdot|s_{i},a),p(\cdot|s_{j},a))\right]}_{\text{% recursive step}}italic_d start_POSTSUBSCRIPT a-bisim end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_ψ ) = under⏟ start_ARG ( 1 - italic_c ) ⋅ ∥ italic_ψ ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_ψ ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT base case end_POSTSUBSCRIPT + under⏟ start_ARG italic_c ⋅ blackboard_E start_POSTSUBSCRIPT italic_a ∼ italic_U ( caligraphic_A ) end_POSTSUBSCRIPT [ italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_p ( ⋅ | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a ) , italic_p ( ⋅ | italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a ) ) ] end_ARG start_POSTSUBSCRIPT recursive step end_POSTSUBSCRIPT (6)

In the next section, we describe how ψ()𝜓\psi(\cdot)italic_ψ ( ⋅ ) can be learned from data, and how to use da-bisimsubscript𝑑a-bisimd_{\text{a-bisim}}italic_d start_POSTSUBSCRIPT a-bisim end_POSTSUBSCRIPT to learn an action-bisimulation encoder.

4 Methods

This section describes the algorithm for training an action-bisimulation encoder. First, the single-step encoder is learned, then the distance in single step space is used as the “base case” for the recursive step. The training flow and inputs are visualized in LABEL:fig:bisim_mainb.

4.1 Single-Step Controllability

Inverse dynamics describes the probability of an action given two sequential states (s,s)𝑠superscript𝑠(s,s^{\prime})( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ): P(a|s,s)𝑃conditional𝑎𝑠superscript𝑠P(a|s,s^{\prime})italic_P ( italic_a | italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). To get a single-step encoding of the action-relevant state features we define the single step state encoder ψθ(s):𝒮𝒵ss:subscript𝜓𝜃𝑠𝒮subscript𝒵𝑠𝑠\psi_{\theta}(s):\mathcal{S}\rightarrow\mathcal{Z}_{ss}italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s ) : caligraphic_S → caligraphic_Z start_POSTSUBSCRIPT italic_s italic_s end_POSTSUBSCRIPT, where 𝒵sssubscript𝒵𝑠𝑠\mathcal{Z}_{ss}caligraphic_Z start_POSTSUBSCRIPT italic_s italic_s end_POSTSUBSCRIPT is the embedded single-step space (Lamb et al., 2022), and ψθsubscript𝜓𝜃\psi_{\theta}italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is parameterized by θ𝜃\thetaitalic_θ. Then, for dataset 𝒟𝒟\mathcal{D}caligraphic_D of (s,a,s)𝑠𝑎superscript𝑠(s,a,s^{\prime})( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) tuples, the regularized single-step representation is learned by optimizing the single-step (ss) inverse dynamics loss:

Lss(𝒟,θ,ν)=(s,a,s)𝒟logfν,inverse(a|ψθ(s),ψθ(s))+β(ψθ(s)1+ψθ(s)1),subscript𝐿𝑠𝑠𝒟𝜃𝜈subscriptsimilar-to𝑠𝑎superscript𝑠𝒟subscript𝑓𝜈inverseconditional𝑎subscript𝜓𝜃𝑠subscript𝜓𝜃superscript𝑠𝛽subscriptnormsubscript𝜓𝜃𝑠1subscriptnormsubscript𝜓𝜃superscript𝑠1L_{ss}(\mathcal{D},\theta,\nu)=-\sum_{(s,a,s^{\prime})\sim\mathcal{D}}\log f_{% \nu,\text{inverse}}(a|\psi_{\theta}(s),\psi_{\theta}(s^{\prime}))+\beta\left(% \|\psi_{\theta}(s)\|_{1}+\|\psi_{\theta}(s^{\prime})\|_{1}\right),italic_L start_POSTSUBSCRIPT italic_s italic_s end_POSTSUBSCRIPT ( caligraphic_D , italic_θ , italic_ν ) = - ∑ start_POSTSUBSCRIPT ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT roman_log italic_f start_POSTSUBSCRIPT italic_ν , inverse end_POSTSUBSCRIPT ( italic_a | italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s ) , italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) + italic_β ( ∥ italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∥ italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , (7)

where fν,inversesubscript𝑓𝜈inversef_{\nu,\text{inverse}}italic_f start_POSTSUBSCRIPT italic_ν , inverse end_POSTSUBSCRIPT is a learned inverse dynamics model parameterized by ν𝜈\nuitalic_ν. The regularization ensures that the learned representation includes the minimum information necessary to capture the action-dependent inverse dynamics.

This inverse model is optimized to predict a distribution over actions P(|ψθ(s),ψθ(s))P(\cdot|\psi_{\theta}(s),\psi_{\theta}(s^{\prime}))italic_P ( ⋅ | italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s ) , italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) using the single-step embeddings as inputs. In this work, we represent the parameters of the distribution as a function of [ψθ(s),ψθ(s)]subscript𝜓𝜃𝑠subscript𝜓𝜃superscript𝑠[\psi_{\theta}(s),\psi_{\theta}(s^{\prime})][ italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s ) , italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ]. Intuitively, ψ()𝜓\psi(\cdot)italic_ψ ( ⋅ ) embeds control-relevant features by embedding action-relevant components of the state. We use a relatively weak inverse model under the intuition that the simpler the model used to capture inverse dynamics, the more information is forced into the embedding rather than the inverse dynamics model.

4.2 Action-Bisimulation Metric

This section describes how the action-bisimulation metric (Equation 6) is used to learn an encoder ϕη(s):𝒮𝒵:subscriptitalic-ϕ𝜂𝑠𝒮𝒵\phi_{\eta}(s):\mathcal{S}\rightarrow\mathcal{Z}italic_ϕ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_s ) : caligraphic_S → caligraphic_Z, where 𝒵𝒵\mathcal{Z}caligraphic_Z is the representation space and ϕηsubscriptitalic-ϕ𝜂\phi_{\eta}italic_ϕ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT is parameterized by η𝜂\etaitalic_η. This definition uses the single-step representation space 𝒵sssubscript𝒵𝑠𝑠\mathcal{Z}_{ss}caligraphic_Z start_POSTSUBSCRIPT italic_s italic_s end_POSTSUBSCRIPT to define the multi-step representation space 𝒵𝒵\mathcal{Z}caligraphic_Z.

The recursive step 𝔼[cW1(d)(p(|si,a),p(|sj,a))]\mathbb{E}[cW_{1}(d)(p(\cdot|s_{i},a),p(\cdot|s_{j},a))]blackboard_E [ italic_c italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_d ) ( italic_p ( ⋅ | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a ) , italic_p ( ⋅ | italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a ) ) ] requires computing p(|si,a)p(\cdot|s_{i},a)italic_p ( ⋅ | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a ) and p(|sj,a)p(\cdot|s_{j},a)italic_p ( ⋅ | italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a ). This can be done by learning a forward model parameterized by υ𝜐\upsilonitalic_υ: fυ(ϕη(si),a):𝒵×𝒜P(|ϕη(si),a)f_{\upsilon}(\phi_{\eta}(s_{i}),a):\mathcal{Z}\times\mathcal{A}\rightarrow P(% \cdot|\phi_{\eta}(s_{i}),a)italic_f start_POSTSUBSCRIPT italic_υ end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_a ) : caligraphic_Z × caligraphic_A → italic_P ( ⋅ | italic_ϕ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_a ) that takes in the state embedding and action and outputs a probability distribution over the next embedded state. We model this by outputting the parameters of a conditional Gaussian model 𝒩(μ,Σ)𝒩𝜇Σ\mathcal{N}(\mu,\Sigma)caligraphic_N ( italic_μ , roman_Σ ) following the practice of Zhang et al. (2020b). Using the notation f(ϕη(si),a)[s]𝑓subscriptitalic-ϕ𝜂subscript𝑠𝑖𝑎delimited-[]superscript𝑠f(\phi_{\eta}(s_{i}),a)[s^{\prime}]italic_f ( italic_ϕ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_a ) [ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] to denote the probability of state ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT under the distribution fυ(ϕη(si),a)subscript𝑓𝜐subscriptitalic-ϕ𝜂subscript𝑠𝑖𝑎f_{\upsilon}(\phi_{\eta}(s_{i}),a)italic_f start_POSTSUBSCRIPT italic_υ end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_a ), we train the forward model by minimizing the negative log-likelihood of the observed data in 𝒟𝒟\mathcal{D}caligraphic_D:

Lforward(D)=s,a,sDlogfυ(ϕη(s),a)[ϕη(s)].subscript𝐿forward𝐷subscriptsimilar-to𝑠𝑎superscript𝑠𝐷subscript𝑓𝜐subscriptitalic-ϕ𝜂𝑠𝑎delimited-[]subscriptitalic-ϕ𝜂superscript𝑠\displaystyle L_{\text{forward}}(D)=-\sum_{s,a,s^{\prime}\sim D}\log f_{% \upsilon}(\phi_{\eta}(s),a)[\phi_{\eta}(s^{\prime})].italic_L start_POSTSUBSCRIPT forward end_POSTSUBSCRIPT ( italic_D ) = - ∑ start_POSTSUBSCRIPT italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_D end_POSTSUBSCRIPT roman_log italic_f start_POSTSUBSCRIPT italic_υ end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_s ) , italic_a ) [ italic_ϕ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] . (8)

In deterministic dynamics, the 1-Wasserstein distance equals the l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT distance of the mean. fυ()subscript𝑓𝜐f_{\upsilon}(\cdot)italic_f start_POSTSUBSCRIPT italic_υ end_POSTSUBSCRIPT ( ⋅ ) is a function of the encoded state ϕ(s)italic-ϕ𝑠\phi(s)italic_ϕ ( italic_s ) rather than the observation s𝑠sitalic_s because forward dynamics over the observations is more costly due to the inherent reconstruction objectives they minimize; this reconstruction could bring in uncontrollable elements and does not inherently include control centric components.

In the off-policy setting, we propose using one of two expectations for the recursive step: over the uniform distribution of actions EaU(𝒜)subscript𝐸similar-to𝑎𝑈𝒜E_{a\sim U(\mathcal{A})}italic_E start_POSTSUBSCRIPT italic_a ∼ italic_U ( caligraphic_A ) end_POSTSUBSCRIPT or over the behavior distribution: Eaπb(si)subscript𝐸similar-to𝑎subscript𝜋𝑏subscript𝑠𝑖E_{a\sim\pi_{b}(s_{i})}italic_E start_POSTSUBSCRIPT italic_a ∼ italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT. The use of the behavioral distribution applies to settings where random actions might restrict the distribution of observed states. In practice, these are computed using the empirical mean. Then, the action-controllability bisimulation metric using learned models is:

da-bisim(si,sj,ψθ,ϕη)=(1c)ψθ(si)ψθ(sj)1+c𝔼aU(𝒜)[W1(f(ϕη(si),a),f(ϕη(sj),a))]subscript𝑑a-bisimsubscript𝑠𝑖subscript𝑠𝑗subscript𝜓𝜃subscriptitalic-ϕ𝜂1𝑐subscriptnormsubscript𝜓𝜃subscript𝑠𝑖subscript𝜓𝜃subscript𝑠𝑗1𝑐subscript𝔼similar-to𝑎𝑈𝒜delimited-[]subscript𝑊1𝑓subscriptitalic-ϕ𝜂subscript𝑠𝑖𝑎𝑓subscriptitalic-ϕ𝜂subscript𝑠𝑗𝑎d_{\text{a-bisim}}(s_{i},s_{j},\psi_{\theta},\phi_{\eta})=(1-c)\cdot\|\psi_{% \theta}(s_{i})-\psi_{\theta}(s_{j})\|_{1}+c\cdot\mathbb{E}_{a\sim U(\mathcal{A% })}\left[W_{1}(f(\phi_{\eta}(s_{i}),a),f(\phi_{\eta}(s_{j}),a))\right]italic_d start_POSTSUBSCRIPT a-bisim end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ) = ( 1 - italic_c ) ⋅ ∥ italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_c ⋅ blackboard_E start_POSTSUBSCRIPT italic_a ∼ italic_U ( caligraphic_A ) end_POSTSUBSCRIPT [ italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_f ( italic_ϕ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_a ) , italic_f ( italic_ϕ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , italic_a ) ) ] (9)

To train the encoder, we match the l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT distance between the embedded representations ϕ(si),ϕ(sj)italic-ϕsubscript𝑠𝑖italic-ϕsubscript𝑠𝑗\phi(s_{i}),\phi(s_{j})italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) to the metric distance:

L(𝒟)=1Nsi,sj𝒟|ϕη(si)ϕη(sj)1da-bisim(si,sj,ψ,ϕ)|.𝐿𝒟1𝑁subscriptsimilar-tosubscript𝑠𝑖subscript𝑠𝑗𝒟subscriptnormsubscriptitalic-ϕ𝜂subscript𝑠𝑖subscriptitalic-ϕ𝜂subscript𝑠𝑗1subscript𝑑a-bisimsubscript𝑠𝑖subscript𝑠𝑗𝜓italic-ϕL(\mathcal{D})=\frac{1}{N}\sum_{s_{i},s_{j}\sim\mathcal{D}}\left|\|\phi_{\eta}% (s_{i})-\phi_{\eta}(s_{j})\|_{1}-d_{\text{a-bisim}}(s_{i},s_{j},\psi,\phi)% \right|.italic_L ( caligraphic_D ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∼ caligraphic_D end_POSTSUBSCRIPT | ∥ italic_ϕ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_ϕ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT a-bisim end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_ψ , italic_ϕ ) | . (10)

In practice the parameters of ϕitalic-ϕ\phiitalic_ϕ used to calculate da-bisimsubscript𝑑a-bisimd_{\text{a-bisim}}italic_d start_POSTSUBSCRIPT a-bisim end_POSTSUBSCRIPT are trailed behind ϕηsubscriptitalic-ϕ𝜂\phi_{\eta}italic_ϕ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT with the exponential moving average: ϕ=τϕη+(1τ)ϕitalic-ϕ𝜏subscriptitalic-ϕ𝜂1𝜏italic-ϕ\phi=\tau\phi_{\eta}+(1-\tau)\phiitalic_ϕ = italic_τ italic_ϕ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT + ( 1 - italic_τ ) italic_ϕ.

Algorithm 1: Action-bisimulation Encoder Learning

  Input: Dataset without reward (s,a,s)𝒟similar-to𝑠𝑎superscript𝑠𝒟(s,a,s^{\prime})\sim\mathcal{D}( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∼ caligraphic_D, initial encoder ϕθ¯(s)superscriptitalic-ϕ¯𝜃𝑠\phi^{\bar{\theta}}(s)italic_ϕ start_POSTSUPERSCRIPT over¯ start_ARG italic_θ end_ARG end_POSTSUPERSCRIPT ( italic_s )
  Single-step Training Train ψ𝜓\psiitalic_ψ with 𝒟𝒟\mathcal{D}caligraphic_D and Equation 7.
  repeat
     Forward Model Update: Update the forward model fυ()subscript𝑓𝜐f_{\upsilon}(\cdot)italic_f start_POSTSUBSCRIPT italic_υ end_POSTSUBSCRIPT ( ⋅ ) according to the current multi-step encoder ϕitalic-ϕ\phiitalic_ϕ using Equation 8.
     Multi-step Update: Sample si,sj𝒟similar-tosubscript𝑠𝑖subscript𝑠𝑗𝒟s_{i},s_{j}\sim\mathcal{D}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∼ caligraphic_D pairs and minimize the loss as defined by the metric in Equation 9 and loss (Equation  10) to update the encoder parameters θ𝜃\thetaitalic_θ.
     Momentum Update: Update the parameters: θ¯=τθ+(1τ)θ¯¯𝜃𝜏𝜃1𝜏¯𝜃\bar{\theta}=\tau\theta+(1-\tau)\bar{\theta}over¯ start_ARG italic_θ end_ARG = italic_τ italic_θ + ( 1 - italic_τ ) over¯ start_ARG italic_θ end_ARG
  until θ¯¯𝜃\bar{\theta}over¯ start_ARG italic_θ end_ARG converge

5 Experiments

In this section, we aim to answer the following questions: 1) Does pre-training with the action-bisimulation objective learn representations useful for arbitrary downstream tasks? 2) How does this pretraining compare with existing methods, especially single-step action controllability? (3) Are the learned representations robust to background distractors? (4) How well does the action-bisimulation procedure capture multi-step relationships between state elements?

We evaluate experiments in three domains illustrated in Figure LABEL:fig:downstream_rl. Nav2D is a 15x15 grid environment where the agent navigates using cardinal directions to the center of the grid, avoiding randomly generated 2x2 obstacles. Pointmaze (Fu et al., 2020) is a 2D Mujoco control environment where the agent takes actions to reach a goal location while being impeded by obstacles. We also investigate Distractor Pointmaze where the background in Pointmaze has been replaced with photorealistic distractions in the form of video clips. Finally, Habitat (Savva et al., 2019b) is a complex 3D environment where the agent must navigate through scans of human environments to reach a goal location. Additional environment details are in Appendix H (number of obstacles, goal/grid size, randomization, etc.) and all other relevant hyperparameters are in Appendix I.

5.1 Baselines

We compare the performance of our method against representation learning pretraining methods used in prior RL works that utilize control-, contrastive- and reconstruction-based objectives.

Single-Step Inverse (SSI): This baseline uses the single-step objective learned using Equation 7 with k=1𝑘1k=1italic_k = 1 to learn a state representation. This demonstrates whether simply learning a myopic action-centric inverse dynamics representation is sufficient for good performance. In general, this representation performs surprisingly well.

Agent Centric Representations for Offline RL (ACRO) (Lamb et al., 2022): This method is equivalent to SSI with k1𝑘1k\neq 1italic_k ≠ 1. When k>1𝑘1k>1italic_k > 1, this means that the model must learn to identify the first action taken from a pair of states. While this allows the model to capture longer-term relationships, it also limits how effective it can be when trained with random actions.

β𝛽\betaitalic_β-Variational Autoencoder (bVAE) (Higgins et al., 2016): This method evaluates a classic compressed state reconstruction method for representation learning. While popularized with video, it has been applied to RL with marginal success. In general, reconstruction can struggle to pick up fine-grained changes such as the movement of the agent.

Contrastive Unsupervised Representations for Reinforcement Learning (CURL) (Laskin et al., 2020): This method uses data augmentation with a contrastive objective to learn a representation. In this work, we used random noise augmentations because of the importance of identifying small features (the location of the agent).

Vanilla RL (Schulman et al., 2017; Mnih et al., 2013): Trains either Deep Q-networks (DQN) in Gridworld, or Proximal Policy Optimization (PPO) in the remaining domains, from scratch.

5.2 Downstream Learning

To evaluate downstream learning we first gather an offline dataset of random action state transitions, with sizes recorded in Table 2. State encoder ϕ(s)italic-ϕ𝑠\phi(s)italic_ϕ ( italic_s ) is trained with Equation 10, which is used to initialize the policy πθ(|ϕ(s))\pi_{\theta}(\cdot|\phi(s))italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_ϕ ( italic_s ) ). This fine-tuning strategy proved to be the best performing empirically, though future work could investigate freezing the encoder, the technique used in Lamb et al. (2022), as we discuss in Appendix E. LABEL:fig:downstream_rl illustrates the comparison of the action-bisimulation encoder to baseline encodings.

As we can see in Figure LABEL:fig:downstream_rl, learning with the action-bisimulation representation outperforms other methods in terms of sample efficiency, even k-step controllability (ACRO), by a substantial margin. This provides evidence for hypotheses 1 and 2, that action-bisimulation learns useful representations which compare well with other methods. That reconstruction and data augmentation-based methods bVAE and CURL perform poorly is not unexpected: in this domain, the agent is often small, so these methods achieve low reconstruction loss even when they omit the most important element: the agent position. On the other hand, SSI captures the agent position but is highly myopic, limiting transfer to downstream tasks. We hypothesize ACRO struggles because it relies on predicting action from two states separated by k𝑘kitalic_k timesteps, which is ill-posed, especially when using a dataset of random actions. Additional details on baselines can be found in Appendix G.

5.3 Background Distractors

In this section we evaluate hypothesis 3: whether the action-bisimulation encoding is robust to distractors. We assess this through a modified Pointmass environment with a photorealistic visual background. The foreground, that is the agent, goal position, and obstacles, remain the same. We visualize the distractor environments in Figure LABEL:paired_images, where the agent has been exaggerated.

Figure LABEL:fig:downstream_rlc shows that adding background distractors dramatically widens the gap between action-bisimulation and other methods. These backgrounds make vanilla RL, reconstruction and data-augmentation-based methods struggle wildly since these methods have no built-in robustness. They also have a significant effect, even on the fixed-step models, ACRO, and SSI. For single-step models, we hypothesize this is because pretraining causes the agent to mostly ignore obstacles since they have a limited myopic effect. For ACRO, the correlated background images appear to confuse the k-step prediction. For action-bisimulation, by contrast, there is only a marginal difference.

We also illustrate how a few representative methods map together states in Figure LABEL:paired_images. In these plots, two nearby states are sampled and visualized. As we can see, action-bisimulation and single-step encodings encode the agent position, but action-bisimulation also maps regions of similar local obstacles together. Beta-VAE (bVAE) encodings are trained with reconstruction; the encodings largely ignore the agent in favor of matching similar backgrounds. Interestingly, ACRO also maps similar backgrounds together. We think this is because of the correlation between subsequent frames in the video, though this is worth further investigation.

5.4 Captured Representations

To investigate hypothesis 4: how well the action-bisimulation encodings capture multi-step relationships, we provide qualitative visualizations comparing the multi-step and single-step encodings.

Figures LABEL:fig:perturbation_maps is a perturbation map, which visualizes how much the representation changes when a single obstacle is placed at a particular location, compared with the base representation. Figure LABEL:fig:perturbation_maps illustrates the contrast between the myopia of the single-step encoder compared with the range of the multi-step encoder.

In Appendix D, we provide several additional qualitative results demonstrating how the action-bisimulation representation captures multi-step relations, including perturbation plots of how the sensitivity changes with c𝑐citalic_c, the tradeoff parameter, and the representation difference from near-vs-far perturbations. Furthermore, sensitivity to perturbations is environment-dependent: if the environment has a fixed structure such as a corridor or maze, unreachable obstacle perturbations will be mapped close together in the action-bisimulation space.

6 Conclusion

Controllability-capturing encodings for reinforcement learning are a promising direction for representation pretraining since they can be learned without reward but are still able to filter out uncontrollable distractors. However, existing methods either only capture short-term controllability or are dependent on demonstration data, which has implicit task bias. We introduce the action-bisimulation encoding, which builds off of myopic representations by enforcing recursive invariance to learn a supervision-free multi-step controllability representation. The empirical results in this work demonstrate how these encodings can be used to improve the sample efficiency, especially in domains with significant background distractors. The primary limitation of this method is the inverse dynamics single-step model, which might not capture all controllable features, just a subset. This can result in the representation being agnostic to important task elements. A more in-depth discussion of limitations is included in Appendix F. Altogether, action-bisimulation is a novel invariance relation for capturing controllability from offline data that removes expert performance requirements and smoothly handles long-horizon controllability.

7 Acknowledgements

This work has taken place in part in the Safe, Correct, and Aligned Learning and Robotics Lab (SCALAR) at The University of Massachusetts Amherst. SCALAR research is supported in part by the NSF (IIS-2323384), AFOSR (FA9550-20-1-0077), and the Center for AI Safety (CAIS). The work was supported by the National Defense Science & Engineering Graduate (NDSEG) Fellowship sponsored by the Air Force Office of Science and Research (AFOSR). Special thanks to collaborators Stephen Guigere, Harshit Sikchi, Alex Levine, Siddhant Agarwal, Rudolf Lioutikov, Yuchen Cui, Akanksha Saran, Wonjoon Goo, Daniel Brown, Prasoon Goyal, Christina Yuan, and Ajinkya Jain for their fruitful conversations and timely help.

References

  • Abel et al. (2016) David Abel, David Hershkowitz, and Michael Littman. Near optimal behavior via approximate state abstraction. In International Conference on Machine Learning, pp.  2915–2923. PMLR, 2016.
  • Agarwal et al. (2021) Rishabh Agarwal, Marlos C Machado, Pablo Samuel Castro, and Marc G Bellemare. Contrastive behavioral similarity embeddings for generalization in reinforcement learning. arXiv preprint arXiv:2101.05265, 2021.
  • Andre & Russell (2002) David Andre and Stuart J Russell. State abstraction for programmable reinforcement learning agents. In Aaai/iaai, pp.  119–125, 2002.
  • Auer et al. (2008) Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21, 2008.
  • Bellemare et al. (2012) Marc Bellemare, Joel Veness, and Michael Bowling. Investigating contingency awareness using atari 2600 games. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 26, pp.  864–871, 2012.
  • Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
  • Bertran et al. (2022) Martin Bertran, Walter Talbott, Nitish Srivastava, and Joshua Susskind. Efficient embedding of semantic similarity in control policies via entangled bisimulation. arXiv preprint arXiv:2201.12300, 2022.
  • Castro (2020) Pablo Samuel Castro. Scalable methods for computing state similarity in deterministic markov decision processes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  10069–10076, 2020.
  • Castro et al. (2021) Pablo Samuel Castro, Tyler Kastner, Prakash Panangaden, and Mark Rowland. Mico: Improved representations via sampling-based state similarity for markov decision processes. Advances in Neural Information Processing Systems, 34:30113–30126, 2021.
  • Choi et al. (2018) Jongwook Choi, Yijie Guo, Marcin Moczulski, Junhyuk Oh, Neal Wu, Mohammad Norouzi, and Honglak Lee. Contingency-aware exploration in reinforcement learning. arXiv preprint arXiv:1811.01483, 2018.
  • Chuck et al. (2020) Caleb Chuck, Supawit Chockchowwat, and Scott Niekum. Hypothesis-driven skill discovery for hierarchical deep reinforcement learning. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.  5572–5579. IEEE, 2020.
  • Chuck et al. (2023) Caleb Chuck, Kevin Black, Aditya Arjun, Yuke Zhu, and Scott Niekum. Granger-causal hierarchical skill discovery. arXiv preprint arXiv:2306.09509, 2023.
  • Cruz et al. (2016) Francisco Cruz, Sven Magg, Cornelius Weber, and Stefan Wermter. Training agents with interactive reinforcement learning and contextual affordances. IEEE Transactions on Cognitive and Developmental Systems, 8(4):271–284, 2016.
  • Dean et al. (1997) Thomas Dean, Robert Givan, and Sonia Leach. Model reduction techniques for computing approximately optimal solutions for markov decision processes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 13, pp.  124–131, 1997.
  • Dietterich (2000) Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of artificial intelligence research, 13:227–303, 2000.
  • Ferns et al. (2004) Norm Ferns, Prakash Panangaden, and Doina Precup. Metrics for finite markov decision processes. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, volume 20, pp.  162–169, 2004.
  • Ferns et al. (2011) Norm Ferns, Prakash Panangaden, and Doina Precup. Bisimulation metrics for continuous markov decision processes. SIAM Journal on Computing, 40(6):1662–1714, 2011.
  • Fu et al. (2020) Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
  • Givan et al. (2003) Robert Givan, Thomas Dean, and Matthew Greig. Equivalence notions and model minimization in markov decision processes. Artificial Intelligence, 147(1-2):163–223, 2003.
  • Gutmann & Hyvärinen (2010) Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp.  297–304. JMLR Workshop and Conference Proceedings, 2010.
  • Ha & Schmidhuber (2018) David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.
  • Hafner et al. (2019) Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019.
  • Hansen-Estruch et al. (2022) Philippe Hansen-Estruch, Amy Zhang, Ashvin Nair, Patrick Yin, and Sergey Levine. Bisimulation makes analogies in goal-conditioned reinforcement learning. In International Conference on Machine Learning, pp.  8407–8426. PMLR, 2022.
  • He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015.
  • Higgins et al. (2016) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International conference on learning representations, 2016.
  • Islam et al. (2022) Riashat Islam, Manan Tomar, Alex Lamb, Yonathan Efroni, Hongyu Zang, Aniket Didolkar, Dipendra Misra, Xin Li, Harm van Seijen, Remi Tachet des Combes, et al. Agent-controller representations: Principled offline rl with rich exogenous information. arXiv preprint arXiv:2211.00164, 2022.
  • Jones & Canas (2010) Matt Jones and Fabián Canas. Integrating reinforcement learning with models of representation learning. In Proceedings of the annual meeting of the cognitive science society, volume 32, 2010.
  • Jong & Stone (2005) Nicholas K Jong and Peter Stone. State abstraction discovery from irrelevant state variables. In IJCAI, volume 8, pp.  752–757, 2005.
  • Jung et al. (2011) Tobias Jung, Daniel Polani, and Peter Stone. Empowerment for continuous agent—environment systems. Adaptive Behavior, 19(1):16–39, 2011.
  • Karl et al. (2016) Maximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick Van der Smagt. Deep variational bayes filters: Unsupervised learning of state space models from raw data. arXiv preprint arXiv:1605.06432, 2016.
  • Kemertas & Aumentado-Armstrong (2021) Mete Kemertas and Tristan Aumentado-Armstrong. Towards robust bisimulation metric learning. Advances in Neural Information Processing Systems, 34:4764–4777, 2021.
  • Kemertas & Jepson (2022) Mete Kemertas and Allan Douglas Jepson. Approximate policy iteration with bisimulation metrics. Transactions on Machine Learning Research, 2022.
  • Khetarpal et al. (2020) Khimya Khetarpal, Zafarali Ahmed, Gheorghe Comanici, David Abel, and Doina Precup. What can i do here? a theory of affordances in reinforcement learning. In International Conference on Machine Learning, pp.  5243–5253. PMLR, 2020.
  • Konidaris et al. (2014) George Konidaris, Leslie Kaelbling, and Tomas Lozano-Perez. Constructing symbolic representations for high-level planning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 28, 2014.
  • Koul et al. (2023) Anurag Koul, Shivakanth Sujit, Shaoru Chen, Ben Evans, Lili Wu, Byron Xu, Rajan Chari, Riashat Islam, Raihan Seraj, Yonathan Efroni, Lekan Molu, Miro Dudik, John Langford, and Alex Lamb. Pclast: Discovering plannable continuous latent states, 2023.
  • Krishnan et al. (2015) Rahul G Krishnan, Uri Shalit, and David Sontag. Deep kalman filters. arXiv preprint arXiv:1511.05121, 2015.
  • Lamb et al. (2022) Alex Lamb, Riashat Islam, Yonathan Efroni, Aniket Rajiv Didolkar, Dipendra Misra, Dylan J Foster, Lekan P Molu, Rajan Chari, Akshay Krishnamurthy, and John Langford. Guaranteed discovery of control-endogenous latent states with multi-step inverse models. Transactions on Machine Learning Research, 2022.
  • Lange & Riedmiller (2010) Sascha Lange and Martin Riedmiller. Deep auto-encoder neural networks in reinforcement learning. In The 2010 international joint conference on neural networks (IJCNN), pp.  1–8. IEEE, 2010.
  • Larsen & Skou (1989) Kim G Larsen and Arne Skou. Bisimulation through probabilistic testing (preliminary report). In Proceedings of the 16th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pp.  344–352, 1989.
  • Laskin et al. (2020) Michael Laskin, Aravind Srinivas, and Pieter Abbeel. Curl: Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning, pp.  5639–5650. PMLR, 2020.
  • Levine et al. (2024) Alexander Levine, Peter Stone, and Amy Zhang. Multistep inverse is not all you need, 2024.
  • Levy et al. (2023) Andrew Levy, Sreehari Rammohan, Alessandro Allievi, Scott Niekum, and George Konidaris. Hierarchical empowerment: Towards tractable empowerment-based skill-learning. arXiv preprint arXiv:2307.02728, 2023.
  • Li et al. (2006) Lihong Li, Thomas J Walsh, and Michael L Littman. Towards a unified theory of state abstraction for mdps. AI&M, 1(2):3, 2006.
  • Liao et al. (2023) Weijian Liao, Zongzhang Zhang, and Yang Yu. Policy-independent behavioral metric-based representation for deep reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence, 37(7):8746–8754, Jun. 2023. doi: 10.1609/aaai.v37i7.26052. URL https://ojs.aaai.org/index.php/AAAI/article/view/26052.
  • Liu et al. (2023) Qiyuan Liu, Qi Zhou, Rui Yang, and Jie Wang. Robust representation learning by clustering with bisimulation metrics for visual reinforcement learning with distractions. arXiv preprint arXiv:2302.12003, 2023.
  • Mazoure et al. (2021) Bogdan Mazoure, Ahmed M Ahmed, Patrick MacAlpine, R Devon Hjelm, and Andrey Kolobov. Cross-trajectory representation learning for zero-shot generalization in rl. arXiv preprint arXiv:2106.02193, 2021.
  • Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning, 2013.
  • Mohamed & Jimenez Rezende (2015) Shakir Mohamed and Danilo Jimenez Rezende. Variational information maximisation for intrinsically motivated reinforcement learning. Advances in neural information processing systems, 28, 2015.
  • Nagarajan et al. (2020) Tushar Nagarajan, Yanghao Li, Christoph Feichtenhofer, and Kristen Grauman. Ego-topo: Environment affordances from egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  163–172, 2020.
  • Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  • Savva et al. (2019a) Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A Platform for Embodied AI Research. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019a.
  • Savva et al. (2019b) Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  9339–9347, 2019b.
  • Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017.
  • Singh et al. (2012) Satinder Singh, Michael James, and Matthew Rudary. Predictive state representations: A new theory for modeling dynamical systems. arXiv preprint arXiv:1207.4167, 2012.
  • Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp.  5026–5033. IEEE, 2012.
  • Xia et al. (2018) Fei Xia, Amir R. Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson env: real-world perception for embodied agents. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on. IEEE, 2018.
  • Zhang et al. (2020a) Amy Zhang, Clare Lyle, Shagun Sodhani, Angelos Filos, Marta Kwiatkowska, Joelle Pineau, Yarin Gal, and Doina Precup. Invariant causal prediction for block mdps. In International Conference on Machine Learning, pp.  11214–11224. PMLR, 2020a.
  • Zhang et al. (2020b) Amy Zhang, Rowan McAllister, Roberto Calandra, Yarin Gal, and Sergey Levine. Learning invariant representations for reinforcement learning without reconstruction. arXiv preprint arXiv:2006.10742, 2020b.
  • Zhang et al. (2022) Qihang Zhang, Zhenghao Peng, and Bolei Zhou. Learning to drive by watching youtube videos: Action-conditioned contrastive policy pretraining. European Conference on Computer Vision (ECCV), 2022.
  • Zhong et al. (2020) Yuanyi Zhong, Alexander Schwing, and Jian Peng. Disentangling controllable object through video prediction improves visual reinforcement learning. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  3672–3676. IEEE, 2020.

Appendix A Convergence and Causal Properties

In this section, we extend some of the convergence properties that apply to reward-based bisimulation metrics to the action-based bisimulation metrics. Then, we prove that an optimized representation is agnostic to causally irrelevant components: elements that do not affect control and cannot be affected by control.

A.1 Fixed point convergence

First, we demonstrate that our action bisimulation metric converges to a fixed point. This proof follows a similar pattern to that found in Agarwal et al. (2021).

Theorem A.1.

Let \mathcal{M}caligraphic_M be the space of bounded pseudometrics on 𝒮,𝒜𝒮𝒜\mathcal{S},\mathcal{A}caligraphic_S , caligraphic_A. Define operator ::\mathcal{F}:\mathcal{M}caligraphic_F : caligraphic_M based on the action-bisim distance metric in Theorem 6:

(d)(si,sj)𝑑subscript𝑠𝑖subscript𝑠𝑗\displaystyle\mathcal{F}(d)(s_{i},s_{j})caligraphic_F ( italic_d ) ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) =dss(si,sj)+cEaU(𝒜)[W1(d)(P(|si,a),P(|sj,a))].\displaystyle=d_{\text{ss}}(s_{i},s_{j})+c\cdot E_{a\sim U(\mathcal{A})}[W_{1}% (d)(P(\cdot|s_{i},a),P(\cdot|s_{j},a))].= italic_d start_POSTSUBSCRIPT ss end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + italic_c ⋅ italic_E start_POSTSUBSCRIPT italic_a ∼ italic_U ( caligraphic_A ) end_POSTSUBSCRIPT [ italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_d ) ( italic_P ( ⋅ | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a ) , italic_P ( ⋅ | italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a ) ) ] .

Then \mathcal{F}caligraphic_F is a contraction mapping and has a unique fixed point for a bounded dist.

Proof: See Appendix B.1. \blacksquare

A.2 Agnostic to Behavior Irrelevant Components

Just because there is an optimal fixed point does not imply that this optimal fixed point is useful. Even using a trivial single-step embedding ψ𝜓\psiitalic_ψ which maps all states to zero will still satisfy the convergence. However, if we assume that ψ(s)𝜓𝑠\psi(s)italic_ψ ( italic_s ), the single-step representation, captures only action-relevant information between S𝑆Sitalic_S and Ssuperscript𝑆S^{\prime}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, the myopic state information, then we can show that the learned representation captures a subset of the control relevant state features only.

First, we assume a uniform behavior policy:

Assumption A.2.

The distribution of π(a|s)𝜋conditional𝑎𝑠\pi(a|s)italic_π ( italic_a | italic_s ) is uniform (uniform distribution denoted U(𝒜)𝑈𝒜U(\mathcal{A})italic_U ( caligraphic_A )), and therefore not conditioned on S𝑆Sitalic_S:

P(a|S)=1|𝒜|a𝑃conditional𝑎𝑆1𝒜for-all𝑎P(a|S)=\frac{1}{|\mathcal{A}|}\quad\forall aitalic_P ( italic_a | italic_S ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_A | end_ARG ∀ italic_a

.

This is because otherwise, the behavior policy could introduce relationships between states and actions that are not present as a result of control. Now we turn to the properties of the single-step encoder. Using the abuse of notation where ψ(S)𝜓𝑆\psi(S)italic_ψ ( italic_S ) is the random variable representing state, we make the following assumption about the single-step model:

Assumption A.3.

ψ:𝒮𝒵ss:𝜓𝒮subscript𝒵ss\psi:\mathcal{S}\rightarrow\mathcal{Z}_{\text{ss}}italic_ψ : caligraphic_S → caligraphic_Z start_POSTSUBSCRIPT ss end_POSTSUBSCRIPT captures a minimum sufficient representation between S,S𝑆superscript𝑆S,S^{\prime}italic_S , italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and A𝐴Aitalic_A:

ψargminψ𝜓subscriptargmin𝜓\displaystyle\psi\coloneqq\operatorname*{arg\,min}_{\psi}italic_ψ ≔ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT I(S;ψ(S))𝐼𝑆𝜓𝑆\displaystyle\quad I\left(S;\psi(S)\right)italic_I ( italic_S ; italic_ψ ( italic_S ) )
s. t.dKL(P(A|[ψ(S),ψ(S)])P(A|[S,S]))=0,\displaystyle\text{s. t.}\quad d_{KL}\left(P\left(A|[\psi(S),\psi(S^{\prime})]% \right)\|P\left(A|[S,S^{\prime}]\right)\right)=0,s. t. italic_d start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_P ( italic_A | [ italic_ψ ( italic_S ) , italic_ψ ( italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ) ∥ italic_P ( italic_A | [ italic_S , italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] ) ) = 0 , (11)

where dKL()d_{KL}(\cdot\|\cdot)italic_d start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( ⋅ ∥ ⋅ ) is the KL divergence between two distributions. Then this question denotes that ψ(s)𝜓𝑠\psi(s)italic_ψ ( italic_s ) captures as little information about the current state as possible (the first term), the conditional distribution over A𝐴Aitalic_A from [ψ(S),ψ(S)]𝜓𝑆𝜓superscript𝑆[\psi(S),\psi(S^{\prime})][ italic_ψ ( italic_S ) , italic_ψ ( italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] is the same as that using [S,S]𝑆superscript𝑆[S,S^{\prime}][ italic_S , italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ]. Notice that the terms in this assumption are approximated in single step encoder training (Equation 7). The inverse dynamics prediction approximates the KL constraint, and the encoding regularization ensures minimal remaining information.

Before using this assumption, we first define what kind of information our representation should be agnostic to. Suppose that there is a partitioning of the state features (analogous to causal feature sets in (Zhang et al., 2020a)) where one set is controllable Scsuperscript𝑆𝑐S^{c}italic_S start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, and any feature not part of that set is Susuperscript𝑆𝑢S^{u}italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT. The sets can be imagined as sets of causal variables, where the concatenation of these sets produces the complete state space S𝑆Sitalic_S. These sets can be defined as follows:

Definition A.4.

State S𝑆Sitalic_S can be decomposed into controllable feature set Scsuperscript𝑆𝑐S^{c}italic_S start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and uncontrollable feature set Susuperscript𝑆𝑢S^{u}italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT that completely describe S𝑆Sitalic_S (bidirectional entropy is 1111). These partitions have the property that the transition dynamics of Scsuperscript𝑆𝑐S^{c}italic_S start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT are independent of the transition dynamics of Susuperscript𝑆𝑢S^{u}italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT, and the transition dynamics of Susuperscript𝑆𝑢S^{u}italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT are independent of Scsuperscript𝑆𝑐S^{c}italic_S start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and A𝐴Aitalic_A:

P(Su|S,A)=P(Su|Su)𝑃conditionalsuperscript𝑆superscript𝑢𝑆𝐴𝑃conditionalsuperscript𝑆superscript𝑢superscript𝑆𝑢\displaystyle P(S^{u^{\prime}}|S,A)=P(S^{u^{\prime}}|S^{u})italic_P ( italic_S start_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | italic_S , italic_A ) = italic_P ( italic_S start_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT )
P(Sc|S,A)=P(Sc|Sc,A)𝑃conditionalsuperscript𝑆superscript𝑐𝑆𝐴𝑃conditionalsuperscript𝑆superscript𝑐superscript𝑆𝑐𝐴\displaystyle P(S^{c^{\prime}}|S,A)=P(S^{c^{\prime}}|S^{c},A)italic_P ( italic_S start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | italic_S , italic_A ) = italic_P ( italic_S start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | italic_S start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_A )
H(S|Sc,Su)=H(Sc,Su|S)=1.𝐻conditional𝑆superscript𝑆𝑐superscript𝑆𝑢𝐻superscript𝑆𝑐conditionalsuperscript𝑆𝑢𝑆1\displaystyle H(S|S^{c},S^{u})=H(S^{c},S^{u}|S)=1.italic_H ( italic_S | italic_S start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ) = italic_H ( italic_S start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT | italic_S ) = 1 . (12)

The encoder will compress action-irrelevant components (elements of Susuperscript𝑆𝑢S^{u}italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT), which are components with no undirected path in the causal graph connected to actions. By compression, we mean that states that vary only according to these elements will share the same encoding.

Theorem A.5.

Action-Bisimulation Control Relevance: Suppose that ϕ:𝒮𝒵:ϕ𝒮𝒵\phi:\mathcal{S}\rightarrow\mathcal{Z}italic_ϕ : caligraphic_S → caligraphic_Z maps observations to a latent action bisimulation representation where ϕ(si)ϕ(sj)1=da-bisim(si,sj,ψ,ϕ)subscriptnormϕsubscriptsiϕsubscriptsj1subscriptda-bisimsubscriptsisubscriptsjψϕ\|\phi(s_{i})-\phi(s_{j})\|_{1}=d_{\text{a-bisim}}(s_{i},s_{j},\psi,\phi)∥ italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT a-bisim end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_ψ , italic_ϕ ) using a ψψ\psiitalic_ψ described in Definition A.3. ZZZitalic_Z, the distribution of encodings has no information about action-irrelevant components: I(Z;Su)=0IZsuperscriptSu0I(Z;S^{u})=0italic_I ( italic_Z ; italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ) = 0.

Proof: See Appendix B.3. \blacksquare

Appendix B Proofs

B.1 Fixed point proof

Theorem (reproduced).

Let \mathcal{M}caligraphic_M be the space of bounded pseudometrics on 𝒮,𝒜𝒮𝒜\mathcal{S},\mathcal{A}caligraphic_S , caligraphic_A. Define operator ::\mathcal{F}:\mathcal{M}caligraphic_F : caligraphic_M based on the action-bisim distance metric in Theorem 6:

(d)(si,sj)𝑑subscript𝑠𝑖subscript𝑠𝑗\displaystyle\mathcal{F}(d)(s_{i},s_{j})caligraphic_F ( italic_d ) ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) =dss(si,sj)+cEaU(𝒜)[W1(d)(P(|si,a),P(|sj,a))].\displaystyle=d_{\text{ss}}(s_{i},s_{j})+c\cdot E_{a\sim U(\mathcal{A})}[W_{1}% (d)(P(\cdot|s_{i},a),P(\cdot|s_{j},a))].= italic_d start_POSTSUBSCRIPT ss end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + italic_c ⋅ italic_E start_POSTSUBSCRIPT italic_a ∼ italic_U ( caligraphic_A ) end_POSTSUBSCRIPT [ italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_d ) ( italic_P ( ⋅ | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a ) , italic_P ( ⋅ | italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a ) ) ] .

then \mathcal{F}caligraphic_F is a contraction mapping and has a unique fixed point for a bounded dist.

Proof:
First, we utilize a lemma that is proved in  Agarwal et al. (2021), which allows us to apply a powerful inequality to the bisimulation-esque pseudometric defined in Equation 6 in Appendix A.

Lemma B.1.

Inequality for two pseudometrics d,d𝑑superscript𝑑d,d^{\prime}italic_d , italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and probability distributions PX,PYsubscript𝑃𝑋subscript𝑃𝑌P_{X},P_{Y}italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT:

W1(d)(PX,PY)dd+W1(d)(PX,PY).subscript𝑊1𝑑subscript𝑃𝑋subscript𝑃𝑌norm𝑑superscript𝑑subscript𝑊1superscript𝑑subscript𝑃𝑋subscript𝑃𝑌W_{1}(d)(P_{X},P_{Y})\leq\|d-d^{\prime}\|+W_{1}(d^{\prime})(P_{X},P_{Y}).italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_d ) ( italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) ≤ ∥ italic_d - italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ + italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ( italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) . (13)

See Lemma B.1 proof in Agarwal et al. (2021).

Then, use Banach fixed point theorem:

(d)(x,y)𝑑𝑥𝑦\displaystyle\mathcal{F}(d)(x,y)caligraphic_F ( italic_d ) ( italic_x , italic_y ) (d)(x,y)=superscript𝑑𝑥𝑦absent\displaystyle-\mathcal{F}(d^{\prime})(x,y)=- caligraphic_F ( italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ( italic_x , italic_y ) =
=cEaU(𝒜)[W1(d)(P(|si,a),P(|sj,a))]EbU(𝒜)[W1(d)(P(|si,b),P(|sj,b))]\displaystyle=c\cdot E_{a\sim U(\mathcal{A})}[W_{1}(d)(P(\cdot|s_{i},a),P(% \cdot|s_{j},a))]-E_{b\sim U(\mathcal{A})}[W_{1}(d^{\prime})(P(\cdot|s_{i},b),P% (\cdot|s_{j},b))]= italic_c ⋅ italic_E start_POSTSUBSCRIPT italic_a ∼ italic_U ( caligraphic_A ) end_POSTSUBSCRIPT [ italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_d ) ( italic_P ( ⋅ | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a ) , italic_P ( ⋅ | italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a ) ) ] - italic_E start_POSTSUBSCRIPT italic_b ∼ italic_U ( caligraphic_A ) end_POSTSUBSCRIPT [ italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ( italic_P ( ⋅ | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b ) , italic_P ( ⋅ | italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_b ) ) ]
=cEaU(𝒜)[W1(d)(P(|si,a),P(|sj,a))]W1(d)(P(|si,b),P(|sj,b))]\displaystyle=c\cdot E_{a\sim U(\mathcal{A})}[W_{1}(d)(P(\cdot|s_{i},a),P(% \cdot|s_{j},a))]-W_{1}(d^{\prime})(P(\cdot|s_{i},b),P(\cdot|s_{j},b))]= italic_c ⋅ italic_E start_POSTSUBSCRIPT italic_a ∼ italic_U ( caligraphic_A ) end_POSTSUBSCRIPT [ italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_d ) ( italic_P ( ⋅ | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a ) , italic_P ( ⋅ | italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a ) ) ] - italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ( italic_P ( ⋅ | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b ) , italic_P ( ⋅ | italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_b ) ) ]
cEaU(𝒜)[dd+W1(d)(P(|si,a),P(|sj,a))]W1(d)(P(|si,b),P(|sj,b))]Applying Lemma 13.\displaystyle\mathop{\leq c\cdot E_{a\sim U(\mathcal{A})}[\|d-d^{\prime}\|+W_{% 1}(d^{\prime})(P(\cdot|s_{i},a),P(\cdot|s_{j},a))]-W_{1}(d^{\prime})(P(\cdot|s% _{i},b),P(\cdot|s_{j},b))]}_{\text{Applying Lemma~{}\ref{PSMineq}}}.start_BIGOP ≤ italic_c ⋅ italic_E start_POSTSUBSCRIPT italic_a ∼ italic_U ( caligraphic_A ) end_POSTSUBSCRIPT [ ∥ italic_d - italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ + italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ( italic_P ( ⋅ | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a ) , italic_P ( ⋅ | italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a ) ) ] - italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ( italic_P ( ⋅ | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b ) , italic_P ( ⋅ | italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_b ) ) ] end_BIGOP start_POSTSUBSCRIPT Applying Lemma end_POSTSUBSCRIPT .
=cEaU(𝒜)[dd]absent𝑐subscript𝐸similar-to𝑎𝑈𝒜delimited-[]norm𝑑superscript𝑑\displaystyle=c\cdot E_{a\sim U(\mathcal{A})}[\|d-d^{\prime}\|]= italic_c ⋅ italic_E start_POSTSUBSCRIPT italic_a ∼ italic_U ( caligraphic_A ) end_POSTSUBSCRIPT [ ∥ italic_d - italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ ]
=cddabsent𝑐norm𝑑superscript𝑑\displaystyle=c\cdot\|d-d^{\prime}\|= italic_c ⋅ ∥ italic_d - italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥

Since (d)(x,y)(d)(x,y)cdd𝑑𝑥𝑦superscript𝑑𝑥𝑦𝑐norm𝑑superscript𝑑\mathcal{F}(d)(x,y)-\mathcal{F}(d^{\prime})(x,y)\leq c\cdot\|d-d^{\prime}\|caligraphic_F ( italic_d ) ( italic_x , italic_y ) - caligraphic_F ( italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ( italic_x , italic_y ) ≤ italic_c ⋅ ∥ italic_d - italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥, \mathcal{F}caligraphic_F is a contractive mapping for c<1𝑐1c<1italic_c < 1 and has unique fixed point d.superscript𝑑d^{*}.italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT . \blacksquare

B.2 Causal Parititon proof

Assumption B.2.

ψ𝜓\psiitalic_ψ captures the information bottleneck representation between St,St+ksuperscript𝑆𝑡superscript𝑆𝑡𝑘S^{t},S^{t+k}italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_S start_POSTSUPERSCRIPT italic_t + italic_k end_POSTSUPERSCRIPT and Aksubscript𝐴𝑘A_{k}italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT:

argminψI(St,St+k;ψ(S),ψ(St+k))βI(ψ(S),ψ(St+k);Ak)argmin𝜓𝐼superscript𝑆𝑡superscript𝑆𝑡𝑘𝜓𝑆𝜓superscript𝑆𝑡𝑘𝛽𝐼𝜓𝑆𝜓superscript𝑆𝑡𝑘subscript𝐴𝑘\operatorname*{arg\,min}{\psi}I(S^{t},S^{t+k};\psi(S),\psi(S^{t+k}))-\beta I(% \psi(S),\psi(S^{t+k});A_{k})start_OPERATOR roman_arg roman_min end_OPERATOR italic_ψ italic_I ( italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_S start_POSTSUPERSCRIPT italic_t + italic_k end_POSTSUPERSCRIPT ; italic_ψ ( italic_S ) , italic_ψ ( italic_S start_POSTSUPERSCRIPT italic_t + italic_k end_POSTSUPERSCRIPT ) ) - italic_β italic_I ( italic_ψ ( italic_S ) , italic_ψ ( italic_S start_POSTSUPERSCRIPT italic_t + italic_k end_POSTSUPERSCRIPT ) ; italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (14)

Then, the following theorem holds:

Theorem B.3.

Action Bisimulation Partitions: If we partition observations using the action bisimulation metric where the single-step representation optimizes Equation 14, then the action bisimulation partitions correspond to a subset of the causal feature set for current and future actions.

Proof:
Suppose u𝑢uitalic_u is a feature along which action bisimulation partitions, but is not part of the causal feature set for current and future actions.

First, consider the case of current actions: then by definition, this will increase I(St,St+k;ψ(S),ψ(St+k))𝐼superscript𝑆𝑡superscript𝑆𝑡𝑘𝜓𝑆𝜓superscript𝑆𝑡𝑘I(S^{t},S^{t+k};\psi(S),\psi(S^{t+k}))italic_I ( italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_S start_POSTSUPERSCRIPT italic_t + italic_k end_POSTSUPERSCRIPT ; italic_ψ ( italic_S ) , italic_ψ ( italic_S start_POSTSUPERSCRIPT italic_t + italic_k end_POSTSUPERSCRIPT ) ), because it will be a component of state encoded by the embedding. However, since it is not part of the causal feature set, it will not increase βI(ψ(S),ψ(St+k);Ak)𝛽𝐼𝜓𝑆𝜓superscript𝑆𝑡𝑘subscript𝐴𝑘\beta I(\psi(S),\psi(S^{t+k});A_{k})italic_β italic_I ( italic_ψ ( italic_S ) , italic_ψ ( italic_S start_POSTSUPERSCRIPT italic_t + italic_k end_POSTSUPERSCRIPT ) ; italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). Thus, it will not satisfy the optimal embedding specified in Equation 14 for the single step embedding, which will increase the base-case loss in Equation 10, ψ(si)ψ(sj)norm𝜓subscript𝑠𝑖𝜓subscript𝑠𝑗\|\psi(s_{i})-\psi(s_{j})\|∥ italic_ψ ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_ψ ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥.

Second, consider that in the case where u𝑢uitalic_u encodes information about future actions, suppose at time horizon k𝑘kitalic_k. This will increase the loss in the second term for Equation 10. This can be seen by unrolling the distance across k𝑘kitalic_k steps, where the l1 loss is used in the Wasserstein distance.

Thus, u𝑢uitalic_u cannot exist while also being an optimal solution, meaning it could not be a feature along which the action bisimulation partitions. \blacksquare

This connection allows us to make statements about what information the encoder compresses. The encoder will compress action-irrelevant components, which are components with no undirected path in the causal graph connected to actions so that these states are encoded together.

B.3 Action-Bisimulation Control Relevance Proof

Theorem (reproduced).

Action-Bisimulation Control Relevance: Suppose that ϕ:𝒮𝒵:italic-ϕ𝒮𝒵\phi:\mathcal{S}\rightarrow\mathcal{Z}italic_ϕ : caligraphic_S → caligraphic_Z maps observations to a latent action bisimulation representation where ϕ(si)ϕ(sj)1=da-bisim(si,sj)subscriptnormitalic-ϕsubscript𝑠𝑖italic-ϕsubscript𝑠𝑗1subscript𝑑a-bisimsubscript𝑠𝑖subscript𝑠𝑗\|\phi(s_{i})-\phi(s_{j})\|_{1}=d_{\text{a-bisim}}(s_{i},s_{j})∥ italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT a-bisim end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). Z𝑍Zitalic_Z, the distribution of encodings has no information about action-irrelevant components: I(Z;Su)=0𝐼𝑍superscript𝑆𝑢0I(Z;S^{u})=0italic_I ( italic_Z ; italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ) = 0.

Proof:
sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are two states which only differ according to features in Susuperscript𝑆𝑢S^{u}italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT. We demonstrate for any si,sj,Susubscript𝑠𝑖subscript𝑠𝑗superscript𝑆𝑢s_{i},s_{j},S^{u}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT, ϕ(si)=ϕ(sj)italic-ϕsubscript𝑠𝑖italic-ϕsubscript𝑠𝑗\phi(s_{i})=\phi(s_{j})italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ).

Lemma B.4.

I(ψ(S);Su)=0𝐼𝜓𝑆superscript𝑆𝑢0I(\psi(S);S^{u})=0italic_I ( italic_ψ ( italic_S ) ; italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ) = 0 for any ψ()𝜓\psi(\cdot)italic_ψ ( ⋅ ) that satisfies A.3

Proof of Lemma B.4:
We start by demonstrating that a ψ(S)𝜓𝑆\psi(S)italic_ψ ( italic_S ) with zero mutual information with Susuperscript𝑆𝑢S^{u}italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT can still satisfy dKL(P(A|[ψ(S),ψ(S)])P(A|[S,S]))=0d_{KL}\left(P\left(A|[\psi(S),\psi(S^{\prime})]\right)\|P\left(A|[S,S^{\prime}% ]\right)\right)=0italic_d start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_P ( italic_A | [ italic_ψ ( italic_S ) , italic_ψ ( italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ) ∥ italic_P ( italic_A | [ italic_S , italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] ) ) = 0, the distribution matching property of Equation 14 by demonstrating that the distributions have no dependence on Susuperscript𝑆𝑢S^{u}italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT:

P(A|S,S)𝑃conditional𝐴𝑆superscript𝑆\displaystyle P(A|S,S^{\prime})italic_P ( italic_A | italic_S , italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =P(S|S,A)P(A|S)P(S|S)absent𝑃conditionalsuperscript𝑆𝑆𝐴𝑃conditional𝐴𝑆𝑃conditionalsuperscript𝑆𝑆\displaystyle=\frac{P(S^{\prime}|S,A)P(A|S)}{P(S^{\prime}|S)}= divide start_ARG italic_P ( italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_S , italic_A ) italic_P ( italic_A | italic_S ) end_ARG start_ARG italic_P ( italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_S ) end_ARG
=P(Su|S,A)P(Sc|S,Su,A)P(A|S)P(Su|S)P(Sc|S,Su)absent𝑃conditionalsuperscript𝑆superscript𝑢𝑆𝐴𝑃conditionalsuperscript𝑆superscript𝑐𝑆superscript𝑆superscript𝑢𝐴𝑃conditional𝐴𝑆𝑃conditionalsuperscript𝑆superscript𝑢𝑆𝑃conditionalsuperscript𝑆superscript𝑐𝑆superscript𝑆superscript𝑢\displaystyle=\frac{P(S^{u^{\prime}}|S,A)P(S^{c^{\prime}}|S,S^{u^{\prime}},A)P% (A|S)}{P(S^{u^{\prime}}|S)P(S^{c^{\prime}}|S,S^{u^{\prime}})}= divide start_ARG italic_P ( italic_S start_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | italic_S , italic_A ) italic_P ( italic_S start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | italic_S , italic_S start_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , italic_A ) italic_P ( italic_A | italic_S ) end_ARG start_ARG italic_P ( italic_S start_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | italic_S ) italic_P ( italic_S start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | italic_S , italic_S start_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) end_ARG
=P(Su|Su)P(Sc|Sc,A))P(A|S)P(Su|Su)P(Sc|Sc)Applying Definition A.4\displaystyle=\mathop{\frac{P(S^{u^{\prime}}|S^{u})P(S^{c^{\prime}}|S^{c},A))P% (A|S)}{P(S^{u^{\prime}}|S^{u})P(S^{c^{\prime}}|S^{c})}}_{\text{Applying % \lx@cref{creftype~refnum}{controllablepartition}}}= start_BIGOP divide start_ARG italic_P ( italic_S start_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ) italic_P ( italic_S start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | italic_S start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_A ) ) italic_P ( italic_A | italic_S ) end_ARG start_ARG italic_P ( italic_S start_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ) italic_P ( italic_S start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | italic_S start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) end_ARG end_BIGOP start_POSTSUBSCRIPT Applying end_POSTSUBSCRIPT
=P(Sc|Sc,A))U(𝒜)P(Sc|Sc)Applying A.2\displaystyle=\mathop{\frac{P(S^{c^{\prime}}|S^{c},A))U(\mathcal{A})}{P(S^{c^{% \prime}}|S^{c})}}_{\text{Applying \lx@cref{creftype~refnum}{uniformpolicy}}}= start_BIGOP divide start_ARG italic_P ( italic_S start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | italic_S start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_A ) ) italic_U ( caligraphic_A ) end_ARG start_ARG italic_P ( italic_S start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | italic_S start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) end_ARG end_BIGOP start_POSTSUBSCRIPT Applying end_POSTSUBSCRIPT (15)

Where U(𝒜)𝑈𝒜U(\mathcal{A})italic_U ( caligraphic_A ) is the uniform distribution over actions. By removing the dependence of P(A|S,S)𝑃conditional𝐴𝑆superscript𝑆P(A|S,S^{\prime})italic_P ( italic_A | italic_S , italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) on Susuperscript𝑆𝑢S^{u}italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT, this means that dKL(P(A|S,S)P(A|ψ(S),ψ(S)))=0d_{KL}(P(A|S,S^{\prime})\|P(A|\psi(S),\psi(S^{\prime})))=0italic_d start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_P ( italic_A | italic_S , italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ italic_P ( italic_A | italic_ψ ( italic_S ) , italic_ψ ( italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ) = 0 for all ϕ(S)italic-ϕ𝑆\phi(S)italic_ϕ ( italic_S ) where the distribution differs only according to Susuperscript𝑆𝑢S^{u}italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT.

Now, consider any ψ~(S)~𝜓𝑆\tilde{\psi}(S)over~ start_ARG italic_ψ end_ARG ( italic_S ) where I(ψ~(S);Su)=α>0𝐼~𝜓𝑆superscript𝑆𝑢𝛼0I(\tilde{\psi}(S);S^{u})=\alpha>0italic_I ( over~ start_ARG italic_ψ end_ARG ( italic_S ) ; italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ) = italic_α > 0. We have already shown that the ψ~(S)~𝜓𝑆\tilde{\psi}(S)over~ start_ARG italic_ψ end_ARG ( italic_S ) distributional dependence is unnecessary to satisfy the KL𝐾𝐿KLitalic_K italic_L constraint. Thus, any dependence on Susuperscript𝑆𝑢S^{u}italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT will increase the mutual information I(S;ψ~(s))𝐼𝑆~𝜓𝑠I(S;\tilde{\psi}(s))italic_I ( italic_S ; over~ start_ARG italic_ψ end_ARG ( italic_s ) ). This means that for any single step encoding ψ~()~𝜓\tilde{\psi}(\cdot)over~ start_ARG italic_ψ end_ARG ( ⋅ ), there exists a lower cost ψ()𝜓\psi(\cdot)italic_ψ ( ⋅ ) which has no dependence on Susuperscript𝑆𝑢S^{u}italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT, since any dependence on Susuperscript𝑆𝑢S^{u}italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT is unnecessary to satisfy the KL constraint. Thus any ψ()𝜓\psi(\cdot)italic_ψ ( ⋅ ) that satisfies A.3 has the property I(ψ(S);Su)=0𝐼𝜓𝑆superscript𝑆𝑢0I(\psi(S);S^{u})=0italic_I ( italic_ψ ( italic_S ) ; italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ) = 0.

Lemma B.5.

For any si,sjSsubscript𝑠𝑖subscript𝑠𝑗𝑆s_{i},s_{j}\in Sitalic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_S which differ only according to features in Susuperscript𝑆𝑢S^{u}italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT,

ψ(si)ψ(sj)=0norm𝜓subscript𝑠𝑖𝜓subscript𝑠𝑗0\|\psi(s_{i})-\psi(s_{j})\|=0∥ italic_ψ ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_ψ ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ = 0

Proof of Lemma B.5:
The consequence of Lemma B.4 is that the zero mutual information indicates:

ψ(si)=ψ(sj)si,sjs. t.si and sj differ only according to features in Su𝜓subscript𝑠𝑖𝜓subscript𝑠𝑗for-allsubscript𝑠𝑖subscript𝑠𝑗s. t.si and sj differ only according to features in Su\psi(s_{i})=\psi(s_{j})\quad\quad\forall s_{i},s_{j}\quad\text{s. t.}\quad% \text{$s_{i}$ and $s_{j}$ differ only according to features in $S^{u}$}italic_ψ ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_ψ ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∀ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT s. t. italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT differ only according to features in italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT

This follows from the definition of mutual information, where I(X,Y)=0𝐼𝑋𝑌0I(X,Y)=0italic_I ( italic_X , italic_Y ) = 0 implies that X𝑋Xitalic_X is independent of Y𝑌Yitalic_Y. If two variables are independent, then any change of one variable will not change the other variable. As a result, ψ(si)ψ(sj)=0norm𝜓subscript𝑠𝑖𝜓subscript𝑠𝑗0\|\psi(s_{i})-\psi(s_{j})\|=0∥ italic_ψ ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_ψ ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ = 0. \blacksquare

Finally, we can complete the proof by unrolling the multi-step objective for any two states si,sjsubscript𝑠𝑖subscript𝑠𝑗s_{i},s_{j}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT which differ only according to Susuperscript𝑆𝑢S^{u}italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT:

da-bisim(si,sj,ψ,ϕ)=(1c)ψ(si)ψ(sj)1+c𝔼aU(𝒜)[W1(p(ϕ(si),a),p(ϕ(sj),a))]\displaystyle d_{\text{a-bisim}}(s_{i},s_{j},\psi,\phi)=(1-c)\cdot\|\psi_{(}s_% {i})-\psi_{(}s_{j})\|_{1}+c\cdot\mathbb{E}_{a\sim U(\mathcal{A})}\left[W_{1}(p% (\phi(s_{i}),a),p(\phi(s_{j}),a))\right]italic_d start_POSTSUBSCRIPT a-bisim end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_ψ , italic_ϕ ) = ( 1 - italic_c ) ⋅ ∥ italic_ψ start_POSTSUBSCRIPT ( end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_ψ start_POSTSUBSCRIPT ( end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_c ⋅ blackboard_E start_POSTSUBSCRIPT italic_a ∼ italic_U ( caligraphic_A ) end_POSTSUBSCRIPT [ italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_p ( italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_a ) , italic_p ( italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , italic_a ) ) ]
=c𝔼aU(𝒜)[W1(p(ϕ(si),a),p(ϕ(sj),a))]absent𝑐subscript𝔼similar-to𝑎𝑈𝒜delimited-[]subscript𝑊1𝑝italic-ϕsubscript𝑠𝑖𝑎𝑝italic-ϕsubscript𝑠𝑗𝑎\displaystyle=c\cdot\mathbb{E}_{a\sim U(\mathcal{A})}\left[W_{1}(p(\phi(s_{i})% ,a),p(\phi(s_{j}),a))\right]= italic_c ⋅ blackboard_E start_POSTSUBSCRIPT italic_a ∼ italic_U ( caligraphic_A ) end_POSTSUBSCRIPT [ italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_p ( italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_a ) , italic_p ( italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , italic_a ) ) ]
=cEa𝒰(𝒜)[sip(ϕ(si),a),sjp(ϕ(sj),a)da-bisim(si,sj)δsiδsj]absent𝑐subscript𝐸similar-to𝑎𝒰𝒜delimited-[]subscriptformulae-sequencesimilar-tosuperscriptsubscript𝑠𝑖𝑝italic-ϕsubscript𝑠𝑖𝑎similar-tosuperscriptsubscript𝑠𝑗𝑝italic-ϕsubscript𝑠𝑗𝑎subscript𝑑a-bisimsuperscriptsubscript𝑠𝑖superscriptsubscript𝑠𝑗𝛿superscriptsubscript𝑠𝑖𝛿superscriptsubscript𝑠𝑗\displaystyle=c\cdot E_{a\sim\mathcal{U}(\mathcal{A})}\left[\int_{s_{i}^{% \prime}\sim p(\phi(s_{i}),a),s_{j}^{\prime}\sim p(\phi(s_{j}),a)}d_{\text{a-% bisim}}(s_{i}^{\prime},s_{j}^{\prime})\delta s_{i}^{\prime}\delta s_{j}^{% \prime}\right]= italic_c ⋅ italic_E start_POSTSUBSCRIPT italic_a ∼ caligraphic_U ( caligraphic_A ) end_POSTSUBSCRIPT [ ∫ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_p ( italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_a ) , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_p ( italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , italic_a ) end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT a-bisim end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_δ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_δ italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ]

Notice that unrolling da-bisim(si,sj)subscript𝑑a-bisimsuperscriptsubscript𝑠𝑖superscriptsubscript𝑠𝑗d_{\text{a-bisim}}(s_{i}^{\prime},s_{j}^{\prime})italic_d start_POSTSUBSCRIPT a-bisim end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) gives (1c)ψ(si)ψ(sj)1+c𝔼aU(𝒜)[W1(p(ϕ(si),a),p(ϕ(sj),a),d)](1-c)\cdot\|\psi_{(}s_{i}^{\prime})-\psi_{(}s_{j}^{\prime})\|_{1}+c\cdot% \mathbb{E}_{a\sim U(\mathcal{A})}\left[W_{1}(p(\phi(s_{i}^{\prime}),a),p(\phi(% s_{j}^{\prime}),a),d)\right]( 1 - italic_c ) ⋅ ∥ italic_ψ start_POSTSUBSCRIPT ( end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_ψ start_POSTSUBSCRIPT ( end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_c ⋅ blackboard_E start_POSTSUBSCRIPT italic_a ∼ italic_U ( caligraphic_A ) end_POSTSUBSCRIPT [ italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_p ( italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_a ) , italic_p ( italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_a ) , italic_d ) ]. Using Definition A.4 demonstrates that sisuperscriptsubscript𝑠𝑖s_{i}^{\prime}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and sjsuperscriptsubscript𝑠𝑗s_{j}^{\prime}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT must also only differ according to features in Susuperscript𝑆𝑢S^{u}italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT. By induction, this difference holds for all timesteps, which demonstrates that da-bisim(si,sj,ψ,ϕ)=0subscript𝑑a-bisimsubscript𝑠𝑖subscript𝑠𝑗𝜓italic-ϕ0d_{\text{a-bisim}}(s_{i},s_{j},\psi,\phi)=0italic_d start_POSTSUBSCRIPT a-bisim end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_ψ , italic_ϕ ) = 0. Since ϕ()italic-ϕ\phi(\cdot)italic_ϕ ( ⋅ ) is defined as matching da-bisim(si,sj,ψ,ϕ)subscript𝑑a-bisimsubscript𝑠𝑖subscript𝑠𝑗𝜓italic-ϕd_{\text{a-bisim}}(s_{i},s_{j},\psi,\phi)italic_d start_POSTSUBSCRIPT a-bisim end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_ψ , italic_ϕ ), this implies that ϕ(si)=ϕ(sj)si,sjSformulae-sequenceitalic-ϕsubscript𝑠𝑖italic-ϕsubscript𝑠𝑗for-allsubscript𝑠𝑖subscript𝑠𝑗𝑆\phi(s_{i})=\phi(s_{j})\quad\forall s_{i},s_{j}\in Sitalic_ϕ ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_ϕ ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∀ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_S that differ only according to Susuperscript𝑆𝑢S^{u}italic_S start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT. \blacksquare

Appendix C Alternative Base Case Representations

This section introduces the single step contrastive alternative to the encoder introduced in Section 4.1, as well as the k-step generalization, where the existing methods can be seen as k=1𝑘1k=1italic_k = 1.

C.1 Contrastive Representations

Contrastive representations approximate the lower bound of the mutual information between two signals, in this case the state transition (s,s)𝑠superscript𝑠(s,s^{\prime})( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) and the action a𝑎aitalic_a. Mutual information is the degree to which knowledge about (s,s)𝑠superscript𝑠(s,s^{\prime})( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) encodes information about a𝑎aitalic_a, which is defined by: I((s,s);a)=H((s,s))H((s,s)|a)𝐼𝑠superscript𝑠𝑎𝐻𝑠superscript𝑠𝐻conditional𝑠superscript𝑠𝑎I((s,s^{\prime});a)=H((s,s^{\prime}))-H((s,s^{\prime})|a)italic_I ( ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ; italic_a ) = italic_H ( ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) - italic_H ( ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | italic_a ), where H𝐻Hitalic_H is the Shannon entropy. InfoNCE (Oord et al., 2018) is a popular contrastive method for computing a lower bound of this statistic based on Noise Contrastive Estimation (Gutmann & Hyvärinen, 2010). Like inverse dyanmics, define the learned state encoder as ψθ(s):𝒮𝒵ss:subscript𝜓𝜃𝑠𝒮subscript𝒵𝑠𝑠\psi_{\theta}(s):\mathcal{S}\rightarrow\mathcal{Z}_{ss}italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s ) : caligraphic_S → caligraphic_Z start_POSTSUBSCRIPT italic_s italic_s end_POSTSUBSCRIPT. Define action encoder to map to the concatenated space of state encodings [𝒵ss,𝒵ss]subscript𝒵𝑠𝑠subscript𝒵𝑠𝑠[\mathcal{Z}_{ss},\mathcal{Z}_{ss}][ caligraphic_Z start_POSTSUBSCRIPT italic_s italic_s end_POSTSUBSCRIPT , caligraphic_Z start_POSTSUBSCRIPT italic_s italic_s end_POSTSUBSCRIPT ], where square brackets represent concatenation: ψη,𝒜(a):𝒜[𝒵ss,𝒵ss]:subscript𝜓𝜂𝒜𝑎𝒜subscript𝒵𝑠𝑠subscript𝒵𝑠𝑠\psi_{\eta,\mathcal{A}}(a):\mathcal{A}\rightarrow[\mathcal{Z}_{ss},\mathcal{Z}% _{ss}]italic_ψ start_POSTSUBSCRIPT italic_η , caligraphic_A end_POSTSUBSCRIPT ( italic_a ) : caligraphic_A → [ caligraphic_Z start_POSTSUBSCRIPT italic_s italic_s end_POSTSUBSCRIPT , caligraphic_Z start_POSTSUBSCRIPT italic_s italic_s end_POSTSUBSCRIPT ]. Finally, a pairwise distance operator d(z1,z2):𝒵ss×𝒵ss:𝑑subscript𝑧1subscript𝑧2subscript𝒵𝑠𝑠subscript𝒵𝑠𝑠d(z_{1},z_{2}):\mathcal{Z}_{ss}\times\mathcal{Z}_{ss}\rightarrow\mathbb{R}italic_d ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) : caligraphic_Z start_POSTSUBSCRIPT italic_s italic_s end_POSTSUBSCRIPT × caligraphic_Z start_POSTSUBSCRIPT italic_s italic_s end_POSTSUBSCRIPT → blackboard_R. In our experiments d(,)𝑑d(\cdot,\cdot)italic_d ( ⋅ , ⋅ ) was the l2 distance. The InfoNCE objective is as follows:

LinfoNCE(𝒟,θ,η)=E(s,a+,s)𝒟[ed([ψθ(s),ψθ(s)],ψη,𝒜(a+))a~{a,a+}ed([ψθ(s),ψθ(s)],ψη,𝒜(a~))].subscript𝐿infoNCE𝒟𝜃𝜂subscript𝐸similar-to𝑠superscript𝑎superscript𝑠𝒟delimited-[]superscript𝑒𝑑subscript𝜓𝜃𝑠subscript𝜓𝜃superscript𝑠subscript𝜓𝜂𝒜superscript𝑎subscript~𝑎superscript𝑎superscript𝑎superscript𝑒𝑑subscript𝜓𝜃𝑠subscript𝜓𝜃superscript𝑠subscript𝜓𝜂𝒜~𝑎L_{\text{infoNCE}}(\mathcal{D},\theta,\eta)=E_{(s,a^{+},s^{\prime})\sim% \mathcal{D}}\left[\frac{e^{d([\psi_{\theta}(s),\psi_{\theta}(s^{\prime})],\psi% _{\eta,\mathcal{A}}(a^{+}))}}{\sum_{\tilde{a}\in\{a^{-},a^{+}\}}e^{d([\psi_{% \theta}(s),\psi_{\theta}(s^{\prime})],\psi_{\eta,\mathcal{A}}(\tilde{a}))}}% \right].italic_L start_POSTSUBSCRIPT infoNCE end_POSTSUBSCRIPT ( caligraphic_D , italic_θ , italic_η ) = italic_E start_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT [ divide start_ARG italic_e start_POSTSUPERSCRIPT italic_d ( [ italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s ) , italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] , italic_ψ start_POSTSUBSCRIPT italic_η , caligraphic_A end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) ) end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT over~ start_ARG italic_a end_ARG ∈ { italic_a start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_d ( [ italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s ) , italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] , italic_ψ start_POSTSUBSCRIPT italic_η , caligraphic_A end_POSTSUBSCRIPT ( over~ start_ARG italic_a end_ARG ) ) end_POSTSUPERSCRIPT end_ARG ] . (16)

a+superscript𝑎a^{+}italic_a start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT denotes the positive sample, which is the actual action taken in state s𝑠sitalic_s. asuperscript𝑎a^{-}italic_a start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT represents the negative samples, which are the alternative actions not taken in s𝑠sitalic_s. Optimizing the loss in Equation 16 will learn a representation encoding action-relevant components. In practice, the contrastive representations did not perform as well as the inverse dyanmics-based ones, and future work is investigating the reason for this in detail.

C.2 K-step Base Cases for Action-Bisimulation

In this work, we primarily investigate a base encoder ψθ()subscript𝜓𝜃\psi_{\theta}(\cdot)italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) trained using s,a,s𝑠𝑎superscript𝑠s,a,s^{\prime}italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, where s𝑠sitalic_s and ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are subsequent states. Prior work (Lamb et al., 2022) has investigated training encoders two states k𝑘kitalic_k steps apart and predicting the first action. While it may seem like longer-term controllability can be captured by simply increasing k𝑘kitalic_k, choosing a fixed horizon introduces a clear limitation: Determining the inverse dynamics between states when k𝑘kitalic_k is small is well-defined but myopic, but when k𝑘kitalic_k gets large, there may no longer be enough information between the state at t𝑡titalic_t and t+k𝑡𝑘t+kitalic_t + italic_k to provide meaningful information about the action. For this to be well defined, there must be a meaningful correlation between the current action and the state k𝑘kitalic_k steps into the future. This correlation does not exist if the actions are random and the agent can return to states that it has been to before. As a result, in practice k𝑘kitalic_k-step controllability is limited to the offline RL setting, where some meaningful trajectories are provided to the agent (Islam et al., 2022). This means that in practice k𝑘kitalic_k-step methods are not fully unsupervised.

Depending on the nature of the offline data, the k-step extension can be combined with Action-bisimulation, where instead of choosing a large c𝑐citalic_c (i.e. c>0.9𝑐0.9c>0.9italic_c > 0.9), the single-step encoders can be replaced with k𝑘kitalic_k-step encoders. This has the potential to significantly increase the degree to which the action-bisimulation encoder ϕηsubscriptitalic-ϕ𝜂\phi_{\eta}italic_ϕ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT can capture long-term controllability.

Formally, instead of the tuple (s,a,s)𝑠𝑎superscript𝑠(s,a,s^{\prime})( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), we use the tuple (s(t),a(t),s(t+k))superscript𝑠𝑡superscript𝑎𝑡superscript𝑠𝑡𝑘(s^{(t)},a^{(t)},s^{(t+k)})( italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_t + italic_k ) end_POSTSUPERSCRIPT ). Then, we can represent the k𝑘kitalic_k-step regularized inverse dynamics loss (adapting from Equation 7) with:

Lssr(𝒟,θ,η)=(s(t),a(t),s(t+k))𝒟logfη,forward(a(t)|\displaystyle L_{ssr}(\mathcal{D},\theta,\eta)=-\sum_{(s^{(t)},a^{(t)},s^{(t+k% )})\sim\mathcal{D}}\log f_{\eta,\text{forward}}(a^{(t)}|italic_L start_POSTSUBSCRIPT italic_s italic_s italic_r end_POSTSUBSCRIPT ( caligraphic_D , italic_θ , italic_η ) = - ∑ start_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_t + italic_k ) end_POSTSUPERSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT roman_log italic_f start_POSTSUBSCRIPT italic_η , forward end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | ψθ(s(t)),ψθ(s(t+k)))\displaystyle\psi_{\theta}(s^{(t)}),\psi_{\theta}(s^{(t+k)}))italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ( italic_t + italic_k ) end_POSTSUPERSCRIPT ) )
+β(ψθ(s(t))1+ψθ(s(t+k))1).𝛽subscriptnormsubscript𝜓𝜃superscript𝑠𝑡1subscriptnormsubscript𝜓𝜃superscript𝑠𝑡𝑘1\displaystyle+\beta\left(\|\psi_{\theta}(s^{(t)})\|_{1}+\|\psi_{\theta}(s^{(t+% k)})\|_{1}\right).+ italic_β ( ∥ italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∥ italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ( italic_t + italic_k ) end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) .

Notice that if a large k𝑘kitalic_k is chosen, this can run into the same issues as other fixed-k𝑘kitalic_k methods, where the distribution of actions can affect the features captured by the single step model.

Similarly, we can replace the InfoNCE representation by replacing a+superscript𝑎a^{+}italic_a start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT with 𝐚+superscript𝐚\mathbf{a}^{+}bold_a start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, which is the actual sequence of actions between s(t)superscript𝑠𝑡s^{(t)}italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and s(t+k)superscript𝑠𝑡𝑘s^{(t+k)}italic_s start_POSTSUPERSCRIPT ( italic_t + italic_k ) end_POSTSUPERSCRIPT, instead of just the first action. We can also replace asuperscript𝑎a^{-}italic_a start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT with 𝐚superscript𝐚\mathbf{a}^{-}bold_a start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT, which is a sequence of actions different from the actual one. This gives the k𝑘kitalic_k-step representation of Equation 16:

LinfoNCE(𝒟,θ,η)=E(ss(t),𝐚+,s(t+k))𝒟[ed([ψθ(s(t)),ψθ(s(t+k))],ψη,𝒜(𝐚+))𝐚~{a,a+}ed([ψθ(s(t)),ψθ(s(t+k))],ψη,𝒜(𝐚~))].subscript𝐿infoNCE𝒟𝜃𝜂subscript𝐸similar-to𝑠superscript𝑠𝑡superscript𝐚superscript𝑠𝑡𝑘𝒟delimited-[]superscript𝑒𝑑subscript𝜓𝜃superscript𝑠𝑡subscript𝜓𝜃superscript𝑠𝑡𝑘subscript𝜓𝜂𝒜superscript𝐚subscript~𝐚superscriptasuperscriptasuperscript𝑒𝑑subscript𝜓𝜃superscript𝑠𝑡subscript𝜓𝜃superscript𝑠𝑡𝑘subscript𝜓𝜂𝒜~𝐚L_{\text{infoNCE}}(\mathcal{D},\theta,\eta)=E_{(ss^{(t)},\mathbf{a}^{+},s^{(t+% k)})\sim\mathcal{D}}\left[\frac{e^{d([\psi_{\theta}(s^{(t)}),\psi_{\theta}(s^{% (t+k)})],\psi_{\eta,\mathcal{A}}(\mathbf{a}^{+}))}}{\sum_{\tilde{\mathbf{a}}% \in\{\textbf{a}^{-},\textbf{a}^{+}\}}e^{d([\psi_{\theta}(s^{(t)}),\psi_{\theta% }(s^{(t+k)})],\psi_{\eta,\mathcal{A}}(\tilde{\mathbf{a}}))}}\right].italic_L start_POSTSUBSCRIPT infoNCE end_POSTSUBSCRIPT ( caligraphic_D , italic_θ , italic_η ) = italic_E start_POSTSUBSCRIPT ( italic_s italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_a start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_t + italic_k ) end_POSTSUPERSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT [ divide start_ARG italic_e start_POSTSUPERSCRIPT italic_d ( [ italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ( italic_t + italic_k ) end_POSTSUPERSCRIPT ) ] , italic_ψ start_POSTSUBSCRIPT italic_η , caligraphic_A end_POSTSUBSCRIPT ( bold_a start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) ) end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT over~ start_ARG bold_a end_ARG ∈ { a start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , a start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_d ( [ italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ( italic_t + italic_k ) end_POSTSUPERSCRIPT ) ] , italic_ψ start_POSTSUBSCRIPT italic_η , caligraphic_A end_POSTSUBSCRIPT ( over~ start_ARG bold_a end_ARG ) ) end_POSTSUPERSCRIPT end_ARG ] . (17)

C.3 Adaptive Regularization for Minimal Representation

To train an encoder with the loss described in Eq. 7, it is necessary to choose a regularizing constant β𝛽\betaitalic_β beforehand. We found it was possible (and sometimes easier from a hyper-parameter search perspective) to adapt the β𝛽\betaitalic_β parameter to the current performance of encoder ψθsubscript𝜓𝜃\psi_{\theta}italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. We changed β𝛽\betaitalic_β throughout training according to the accuracy of the inverse dynamics predictions, lowering the regularization constant when the accruacy was low and raising it when the accuracy was high. The intuition is that if accuracy is low, then the representation needs to be less minimal and so we need to regularize less heavily. We calculated the regularization constant βisubscript𝛽𝑖\beta_{i}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT where i𝑖iitalic_i is the training iteration with:

βi=βmax(1exp(4αi12)),subscript𝛽𝑖subscript𝛽max14superscriptsubscript𝛼𝑖12\beta_{i}=\beta_{\text{max}}(1-\exp(-4\alpha_{i-1}^{2})),italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( 1 - roman_exp ( - 4 italic_α start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) ,

where βmaxsubscript𝛽max\beta_{\text{max}}italic_β start_POSTSUBSCRIPT max end_POSTSUBSCRIPT is the maximum regularization constant and αi1subscript𝛼𝑖1\alpha_{i-1}italic_α start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT is the action prediction accuracy during the previous iteration. This trick did not significantly impact our results, but lessened the hyper-parameter search.

Appendix D Additional Qualitative Results

This section describes several other qualitative results that demonstrate the properties of the encodings learned using action-bisimulation as compared to other encoding methods. We first provide qualitative results describing how the representation is sensitive not only to the agent’s location but also to the local obstacles. This distinction is valuable since encoding agent position can often be sufficient to already significantly improve downstream RL performance. To generate the plot, we randomly generate obstacles either near the agent (left) or far from the agent (right), where near and distance are described below. When the changes are near the agent, there is a large variation in representation distance. On the other hand, distant perturbations make little difference to the representation of the agent.

Figure LABEL:fig:gamma_variance shows how this dropoff varies as the value of c𝑐citalic_c, the discount factor in Equation 6, changes. The multi-step encoder gracefully increases in sensitivity with greater c𝑐citalic_c, though a very large c𝑐citalic_c can make it unstable. Fundamentally, the possible sequences of actions grows exponentially, especially when trained with random actions, which is why selecting a value of c which ensures some dropoff ensures that the action-bisimulation representation does not become too off-policy.

In Figure LABEL:fig:maze_corridor, we demonstrate how the action-bisimulation metric can learn reprsentations that ignore control-irrelevant information. In the Corridor environment, the agent is never able to leave the interior of the corridor but can always observe the obstacles on the exterior. We see that the representation is sensitive only to changes within the corridor. These results are echoed in the more complex Maze environment where the unreachable obstacles inside the maze’s walls have little to no effect on the agent’s representation. In fact, we can see that the representation’s region of sensitivity almost exactly matches the agent’s reachable locations.

Appendix E Additional RL Results

The Pointmaze environment which we evaluated with used a set of discrete actions. In general, evaluating U(𝒜)𝑈𝒜U(\mathcal{A})italic_U ( caligraphic_A ), the uniform distribution over actions, is easier with continuous actions. Action-bisimulation can be approximated in continuous contexts simply by sampling some representative number of states. We demonstrate this in a continuous pointmaze environment, where the agent takes continuous 2D directions.

Refer to caption
Figure 6: Continuous Pointmaze performance: because the action space is more challenging, many of the baselines struggle, especially ACRO, where action ambiguity is heightened.

As we can see, Figure 6 provides evidence that action-bisimulation encodings are not limited to discrete actions.

Environment 2D Navigation Point-Mass Habitat
Action-Bisimulation 14.266±0.509plus-or-minus14.2660.509-14.266\pm 0.509- 14.266 ± 0.509 30.8±3.7plus-or-minus30.83.7-30.8\pm 3.7- 30.8 ± 3.7 0.7754±0.005049plus-or-minus0.77540.0050490.7754\pm 0.0050490.7754 ± 0.005049
Single-step 12.613±1.992plus-or-minus12.6131.992-12.613\pm 1.992- 12.613 ± 1.992 29.7±3.9plus-or-minus29.73.9-29.7\pm 3.9- 29.7 ± 3.9 0.7716±0.1629plus-or-minus0.77160.16290.7716\pm 0.16290.7716 ± 0.1629
ACRO 49.436±0.342plus-or-minus49.4360.342-49.436\pm 0.342- 49.436 ± 0.342 127.9±0.61plus-or-minus127.90.61-127.9\pm 0.61- 127.9 ± 0.61 0.7374±0.0065plus-or-minus0.73740.00650.7374\pm 0.00650.7374 ± 0.0065
β𝛽\betaitalic_β-VAE 13.880±2.263plus-or-minus13.8802.263-13.880\pm 2.263- 13.880 ± 2.263 128.4±0.52plus-or-minus128.40.52-128.4\pm 0.52- 128.4 ± 0.52 0.138±0.0220plus-or-minus0.1380.02200.138\pm 0.02200.138 ± 0.0220
CURL 44.791±7.472plus-or-minus44.7917.472-44.791\pm 7.472- 44.791 ± 7.472 128.4±0.45plus-or-minus128.40.45-128.4\pm 0.45- 128.4 ± 0.45 0.7657±0.0193plus-or-minus0.76570.01930.7657\pm 0.01930.7657 ± 0.0193
Vanilla 14.523±1.560plus-or-minus14.5231.560-14.523\pm 1.560- 14.523 ± 1.560 128.4±0.052plus-or-minus128.40.052-128.4\pm 0.052- 128.4 ± 0.052 0.728±0.0294plus-or-minus0.7280.02940.728\pm 0.02940.728 ± 0.0294
Table 1: Final Performance Evaluation

Appendix F Limitations

The three primary limitations we observed in this work for implementing action-bisimulation are as follows, and we go into further detail in the subsequent subsections:

  1. 1.

    The minimum controllable single-step representation, especially when using learned inverse dynamics can omit controllable information if the action is overrepresented in the state

  2. 2.

    Uncontrollable, but reward-relevant elements must be incorporated into the representation after it is learned.

  3. 3.

    Both the forward model for transitions and the action-bisimulation encoding representation are bootstrapped over the expectation over all actions. This can result in unstable training.

  4. 4.

    When a task does not require much lookahead, action-bisimulation will only provide a marginal benefit.

F.1 Limitations of Minimum Controllable Representation

While Appendix A demonstrates that at convergence the action-bisimulation encoding will capture only action-relevant information, it does not guarantee that it will capture all action relevant information. If using the regularized single step loss Equation 7, the method is regularized to capture the minimal sufficient information to predict actions. In practice, this can be quite limiting.

For example, consider the scenario where in the top corner of the screen there is a small display of the last action that the agent takes. In this scenario, the inverse dynamics model is likely to learn to only pay attention to this part of the screen, ignoring other components such as the state of the agent. This is because paying attention to this part is a sufficient representation of actions, even though it does not capture all action-relevant information. Information-based methods such as Equation 16 or generative controllability representations are a possible solution for this, but we have found empirically that they do not seem to learn representations useful for RL. As a result, future work is for investigating action-controllable components that are not necessary for inverse dynamics (action prediction).

F.2 Uncontrollable Reward-relevant components

Another possible limitation is relevant for all controllability pertaining: a representation that captures controllable elements may fail to capture uncontrollable reward-relevant components. For example, consider a goal-based task where the goal is part of the state. The goal itself is not controllable and thus the representation of the goal will not be encoded in an action-bisimulation representation. While the action-bisimulation method is task-agnostic, at least insofar as the initial offline dataset is task-agnostic, it is not a sufficient representation in every task.

This issue can be mitigated by a variety of approaches. The simplest is to simply allow the representation to be modified to be task-specific in the RL training, and we employ this strategy in this work. However, more complex strategies might add an additional channel for task-relevant information, or integrating classic reward-bisimulation to learn the task-relevant components on top of the pre-trained action-bisimulation ones.

F.3 Training Instability of Bootsrapping

One of the challenges when learning an action-bisimulation representation is the inherent bootstrapping where the forward model is trained with f(ϕ(s),a)ϕ(s)𝑓italic-ϕ𝑠𝑎superscriptitalic-ϕ𝑠f(\phi(s),a)\rightarrow\phi^{\prime}(s)italic_f ( italic_ϕ ( italic_s ) , italic_a ) → italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ), and the encodings themselves are being updated with Equation 10. Since the action distribution in Equation 9 is over the uniform expectation over actions, this can result in instability because of the combinatorial complexity of actions. One way we mitigated this is through the adaptive learning rate of the forward model, but future work should investigate stabilizing the convergence, especially if action-bisimulation is applied to online data.

F.4 Tasks without Lookahead

Finally, while multi-step controllability is a powerful property, not all tasks require this kind of lookahead, and it is not clear that multi-step pretraining would outperform single-step or other baselines in these cases. For example, in the popular Mujoco locomotion domains (Todorov et al., 2012), knowing about future control can often be distracting to the agent—all the relevant information is captured by determining how the current action will affect future actions. Domains where long-term control is useful, such as manipulation, can also be challenging for the current form of action-bisimulation because of the minimal sufficient information property of the single-step losses. Future work is aimed at investigating this in greater detail.

Appendix G Baseline Details

A detailed description of each of the baselines and the limitations of each. Also a mention of reward-based bisimulation.

Appendix H Environment Details

Environment Pretrain dataset Evaluation Steps
2D Navigation 1111M samples 2222M steps
Point-Mass 0.250.250.250.25M samples 7777M steps
Habitat 100100100100k samples 2222M steps
Table 2: Amount of data used for 2 phases. The pretrain dataset uses random actions and is used to train the encoders, and the evaluation steps is the number of environment steps used to train RL.

H.1 2D Navigation with Obstacles

This environment is a 2D gridworld which consists of a 15×15151515\times 1515 × 15 grid. The agent has 4 actions, up, down, left and right [(0,1),(0,1),(1,0)(1,0)]01011010[(0,-1),(0,1),(-1,0)(1,0)][ ( 0 , - 1 ) , ( 0 , 1 ) , ( - 1 , 0 ) ( 1 , 0 ) ]. If the agent moves into an obstacle, or the edge of the screen, its location will not change, otherwise, the direction will be added to the current position. and takes as observation a 3-channel 15×15151515\times 1515 × 15 image. The first channel encodes the agent position with 1111 at the location of the agent, and 11-1- 1 elsewhere. The second channel encodes obstacles as 1111 where there is an obstacle and 11-1- 1 otherwise, and the last channel encodes the goal. In this version, the goal is always located at the center of the image (7,7)77(7,7)( 7 , 7 ). This is because otherwise a task-agnostic encoding would have to re-learn the goal location. The agent receives a reward of 11-1- 1 everywhere except the goal, where it receives reward of 00.

Initialization of the environment is as follows: the agent is initialized at a random location. Then obstacles are generated as 20202020 2×2222\times 22 × 2 obstacles, initialized at random locations. The obstacles can be overlapping, but they cannot be initialized on top of the agent. Finally, the environment checks that there exists a path from the agent to the goal. if there is not, the environment is reinitialized until there is. Each episode is 50505050 timesteps, after which a new environment is initialized.

H.2 Pointmass

This environment is a modification of the Mujoco Pointmass environment, where a pointmass with the dynamics of a damped linear x𝑥xitalic_x and y𝑦yitalic_y joint with damping coefficent 1111 and friction coefficient 0.50.50.50.5 with navigates through the environment taking a set of four discrete actions, up, down, left, and right. The original environment only included a small number of pre-defined mazes in a 15m×15m15m15m15\text{m}\times 15\text{m}15 m × 15 m world. Additionally the original environment directly gives observations of the position of the goal and agent, while this version gives pixel data from a fixed topdown camera. This environment lacked the complexity of controllability in the dynamics we are interested in investigating in this work. Instead, we modified the environment so that 20202020 2m×2m2m2m2\text{m}\times 2\text{m}2 m × 2 m obstacles are randomly arranged in the environment, and added walls to prevent the point from leaving the field of view. The goal is always located in the center of the image. In this environment, the extrinsic reward function is a sparse 0/1 reward for being within 1m1𝑚1m1 italic_m of the goal, which is the distance traversed by the agent in 1111 timestep. The agent takes episodes of 128128128128 time steps.

H.3 Habitat

Habitat (Savva et al., 2019a) is a photorealistic 3D simulator for training embodied agents. The experiments in this paper use five scenes from the Tiny partition of the Gibson dataset (Xia et al., 2018), Andover, Azusa, Anaheim, Ballou, and Spotswood. These scenes were chosen for their high navigational complexity. The observation space is a visually rich RGB+Depth image. Unlike the original Habitat environment, we choose to use an orthographic (as opposed to pinhole) camera placed above the goal in each episode so that the goal location is always at the center of the image observation; using a consistent goal location with respect to the camera is critical as we do not include any other goal information in the observation (in contrast with the traditional PointNav task in Habitat that includes a distance+compass heading sensor to the goal). In the RGB observation, we place a yellow box on top of the agent to indicate its location because the default rendered agent is sometimes the same color as the floor below it; the depth image remains unchanged. The agent and goal are spawned in new locations every episode such that the agent is always in view of the camera; this means that each episode looks at a different part of the scene.

Appendix I Hyperparameters

I.1 Nav2D

The network dimensions and architectures used for Nav2D.

Inverse Dynamics Model
Layer Type Layer Size Input 1152 Linear & ReLU 256 Linear 256

Action-Bisimulation Parameters
Parameter Value Single Step L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Penalty 0.00010.00010.00010.0001 Multi Step c𝑐citalic_c 0.990.990.990.99 Learning Rate 0.00010.00010.00010.0001

Reinforcement Learning Parameters
Parameter Value Algorithm DQN Batch Size L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Penalty 32323232 ϵendsubscriptitalic-ϵend\epsilon_{\text{end}}italic_ϵ start_POSTSUBSCRIPT end end_POSTSUBSCRIPT 0.2 ϵstartsubscriptitalic-ϵstart\epsilon_{\text{start}}italic_ϵ start_POSTSUBSCRIPT start end_POSTSUBSCRIPT 0.9 γ𝛾\gammaitalic_γ 0.990.990.990.99 Learning Rate 0.0001

I.2 Pointmass

The network dimensions and architectures used for the environment.

Encoder parameters
Layer Type Layer Size Kernel Size Input N/A 64x64x3 Conv2D & ReLU 3x3 32x32x8 Conv2D & ReLU 3x3 16x16x16 Conv2D 8x8 1x1x32

Inverse Dynamics Model
Layer Type Layer Size Input 64 Linear & ReLU 256 Linear 32

Actor/Critic Models
Layer Type Layer Size Input 32 Linear & ReLU 256 Linear 4/1

Action-Bisimulation Parameters
Parameter Value Single Step L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Penalty 1.01.01.01.0 Multi Step c𝑐citalic_c 0.750.750.750.75 K Steps 5555 Learning Rate 0.00010.00010.00010.0001

Reinforcement Learning Parameters
Parameter Value Algorithm PPO Batch Size L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Penalty 256256256256 Steps Per Rollout 65536655366553665536 Steps Per Eval 16384163841638416384 Learning Rate 0.0000250.0000250.0000250.000025

The L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT penalty is a particularly sensitive parameter, with this incorrectly set the single step model fails to identify relevant features. To train this effectively an adaptive term was used to scale the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT regularization term to the listed value as the encoder approached convergence.

I.3 Habitat

Encoder hyperparameters and PPO The network dimensions and architectures used for the Habitat experiments are exact copies of the ResNet18 (He et al., 2015) networks used in the original Habitat PointGoal navigation task (Savva et al., 2019a). For pretraining the encoders, we only trained the visual features encoder of the ResNet18 policy used in Habitat. Further, we used the vanilla implementation of PPO written in Habitat with all default parameters.

Inverse Dynamics Model
Layer Type Layer Size Input 2048 Linear & ReLU 256 Linear 256

Action-Bisimulation Parameters
Parameter Value Single Step L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Penalty 0.00.00.00.0 Multi Step c𝑐citalic_c 0.950.950.950.95 Learning Rate 0.00010.00010.00010.0001

PPO Parameters
Parameter Value clip_param 0.2 ppo_epoch 4 num_mini_batch 2 value_loss_coef 0.5 entropy_coef 0.01 lr 0.00025 eps .00001 max_grad_norm 0.5 num_steps 128 hidden_size 512 gamma 0.99 tau 0.95