Disentangled (Un)Controllable Features

1^st Jacob E. Kooi Quantitative Data Analytics
Vrije Universiteit Amsterdam
Amsterdam, Netherlands
[email protected] 2^nd Mark Hoogendoorn Quantitative Data Analytics
Vrije Universiteit Amsterdam
Amsterdam, Netherlands
[email protected] 3^rd Vincent Francois-Lavet Quantitative Data Analytics
Vrije Universiteit Amsterdam
Amsterdam, Netherlands
[email protected]

Abstract

In the context of MDPs with high-dimensional states, downstream tasks are predominantly applied on a compressed, low-dimensional representation of the original input space. A variety of learning objectives have therefore been used to attain useful representations. However, these representations usually lack interpretability of the different features. We present a novel approach that is able to disentangle latent features into a controllable and an uncontrollable partition. We illustrate that the resulting partitioned representations are easily interpretable on three types of environments and show that, in a distribution of procedurally generated maze environments, it is feasible to interpretably employ a planning algorithm in the isolated controllable latent partition.

I Introduction

Learning from high-dimensional data remains a challenging task. Particularly for reinforcement learning (RL), the complexity and high dimensionality of the Markov Decision Process (MDP) [1] states often leads to complex or intractable solutions. In order to facilitate learning from high-dimensional input data, an encoder architecture can be used to compress the inputs into a lower-dimensional latent representation. To this extent, a plethora of work has successfully focused on discovering a compressed encoded representation that accommodates the underlying features for the task at hand [2, 3, 4, 5, 6, 7, 8].

The resulting low-dimensional representations however seldom contain specific disentangled features, which leads to disorganized latent information. This means that the individual latent states can represent the information from the state in any arbitrary way. The result is a representation with poor interpretability, as the latent states cannot be connected to certain attributes of the original observation space (e.g, the x-y coordinates of the agent). Prior work in structuring a latent representation has shown notions and use of interpretability in MDP representations [9]. When expanding this notion of interpretability to be compatible with RL, it has been argued that the controllable features should be an important element of a latent representation, since it generally represents what is directly influenced by the policy. In this light, [10] have introduced the concept of isolating and disentangling controllable features in a low-dimensional maze environment, by means of a selectivity loss. Furthermore, [11] took an object-centric approach to isolate distinct objects in MDPs and [12] showed theoretical foundations for this isolation in a weakly-supervised controllable setting. Controllable features however only represent a fragment of an environment, where in many cases the uncontrollable features are of equal importance. For example, in the context of a distribution of mazes, for the prediction of the next controllable (agent) state following an action, the information about the wall structure is crucial (see Fig. 1). We therefore hypothesize that a thorough representation should incorporate controllable and uncontrollable features, ideally in a disentangled, interpretable arrangement; Intepretability is crucial for future real-world deployment [13], while an additional benefit would be that the separation of the controllable and uncontrollable features can be exploited in downstream algorithms such as planning.

Our contribution consists of an algorithm that, showcased in three different MDP settings, explicitly disentangles the latent representation into a controllable and an uncontrollable latent partition. This is highlighted on three types of environments, each with a varying class of controllable and uncontrollable elements. This allows for a precise and visible separation of the latent features, improving interpretability, representation quality and possibly moving towards a basis for building causal relationships between an agent and its environment. The unsupervised learning algorithm consists of both an action-conditioned and a state-only forward predictor, along with a contrastive and an adversarial loss, which isolate and disentangle the controllable versus the non-controllable features. Furthermore, we show an application of learning and planning on the human-interpretable disentangled latent representation, where the properties of disentanglement allow the planning algorithm to operate solely in the controllable partition of the latent representation.

Refer to caption — Figure 1: Visualization in a maze environment of four random pixel observations $s\in\mathbb{R}^{48\times 48}$ (left) and the encoded observations $z=f(s;\theta_{enc})\hskip 5.69054pt\forall s\in\mathcal{S}$ (right). On the right, we can see the disentanglement of the controllable latent $z^{c}\in\mathbb{R}^{2}$ on the horizontal axes, and the uncontrollable latent $z^{u}\in\mathbb{R}^{1}$ on the vertical axis. The encoder is trained on high-dimensional tuples $(s_{t},a_{t},r_{t},s_{t+1})$ , sampled from a replay buffer $\mathcal{B}$ , gathered from random trajectories in the four maze environments shown on the left. All possible states in all four mazes are encoded and plotted with the transition prediction for each possible action, revealing a clear disentanglement between the controllable latents (agent x-y position) and the uncontrollable latent (wall architecture). Note that all samples are taken from the same buffer, filled with samples from all four mazes.

II Related Work

General Representation Learning

Many works have focused on converting high-dimensional inputs to a compact, abstract latent representation. Learning this representation can make use of auxiliary, unsupervised tasks in addition to the pure RL objectives [3]. One way to ensure a meaningful latent space is to implement architectures that require a pixel reconstruction loss such as a variational [14, 15] or a deterministic [6] autoencoder. Other approaches combined reward reconstruction with latent prediction [16], pixel reconstruction with planning [17, 18] or used latent predictive losses without pixel reconstruction [5, 7].

Representing controllable features

In representation learning for RL, a focus on controllable features can be beneficial as these features are strongly influenced by the policy [10]. This can be done using generative methods [19], but is most commonly pursued using an auxiliary inverse-prediction loss; predicting the action that was taken in the MDP [2]. The work in [20, 21] builds a latent representation with an emphasis on the controllable features of an environment with inverse-prediction losses, and uses these features to guide exploratory behavior. Furthermore, [22] and concurrent work by [23] employ multi-step inverse prediction to successfully encompass controllable features in their representation. However, these works have not expressed a focus on also retaining the uncontrollable features in their representation, which is a key aspect in our work.

Partitioning a latent representation

Sharing similarity in terms of the separation of the latent representation, [24] disentangle the latent representation in the domain adaptation setting into a task-relevant and a context partition, by means of adversarial predictions with gradient reversals and cyclic reconstruction. [25] use a reconstruction-based adversarial architecture that divides their latent representation into reward-relevant and irrelevant features. Related work by [26] further divides the latent representation of Dreamer [17], using action-conditioned and state-only forward predictors, into controllable, uncontrollable and their respective reward relevant and irrelevant features. As compared to [26], who focus on distraction-efficient RL, we purely focus on the representational learning aspect of these predictors, and show notions of separation in low-dimensional, structured representations of MDPs, leaning towards enhanced interpretability. Furthermore, we use an adversarial loss to enforce disentanglement between $z^{c}$ and $z^{u}$ , and apply a contrastive loss instead of pixel reconstruction to avoid representation collapse due to latent forward prediction.

Interpretable representations in MDPs

More closely related to our research is the work by [10], which connects individual latent dimensions to independently controllable states in a maze using a reconstruction loss and a selectivity loss. The work by [9] visualizes the representation of an agent and its transitions in a maze environment, but does not disentangle the agent state in its controllable and uncontrollable parts, which limits the interpretability analysis and does not allow simplifications during planning. The work by [11] uses an object-oriented approach to isolate different (controllable) features, using graph neural networks (GNN’s) and a contrastive forward prediction loss, but does not discriminate between controllable and uncontrollable features. Further work in this direction by [12] focuses on theoretical foundations for an encoder to structurally represent a distinct controllable object. We aim to progress the aforementioned lines of research by using a representation learning architecture that disentangles an MDP’s latent representation into interpretable, disentangled controllable and uncontrollable features. Finally, we show that having separate partitions of controllable and uncontrollable features can be exploited in a planning algorithm. Exploitations like these are done in combination with prior knowledge of a certain MDP, as in [27].

III Preliminaries

We consider an agent acting within an environment, where the environment is modeled as a discrete Markov Decision Process (MDP) defined as a tuple $(\mathcal{S},\mathcal{A},T,R,\gamma)$ . Here, $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, $T:\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{S}$ is the environment’s transition function, $R:\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{R}$ is the environment’s reward mapping and $\gamma$ is the discount factor. We consider the setting where we have access to a replay buffer ( $\mathcal{B}$ ) of visited states $s_{t}\in\mathcal{S}$ that were followed by actions $a_{t}\in\mathcal{A}$ and resulted in the rewards $r_{t}\in\mathcal{R}$ and the next states $s_{t+1}$ . One entry in $B$ contains a tuple of past experience $(s_{t},a_{t},r_{t},s_{t+1})$ . The agent’s goal is to learn a policy $\pi:\mathcal{S}\rightarrow\mathcal{A}$ that maximizes the expectation of the discounted return $V^{\pi}(s)=\operatorname{\mathbb{E}}_{\tau}[\sum_{t=0}^{\infty}\gamma^{t}R(s_{% t},a_{t})\mid s_{t}=s]$ , where $\tau$ is a trajectory following the policy $\pi$ .

Furthermore, we examine the setting where a high-dimensional state ( $s_{t}\in\mathbb{R}^{v}$ ) is compressed into a lower-dimensional latent state $z_{t}\in\mathcal{Z}=\mathbb{R}^{w}$ where $\mathcal{Z}$ represents the latent space with $w\leq v$ . This is done by means of a neural network encoding $f:\mathcal{S}\rightarrow\mathcal{Z}$ where $f$ represents the encoder.

IV Algorithm

We aim for an interpretable and disentangled representation of the controllable and uncontrollable latent features. We define controllable features as the characteristics of the MDP that are predominantly affected by any action $a\in\mathcal{A}$ , such as the position of the agent in the context of a maze environment. The uncontrollable features are those attributes that are not or only marginally affected by the actions. We show that the proposed disentanglement is possible by designing losses and gradient propagation through two separate parts of the latent representation. Specifically, to assign controllable information to the controllable latent partition, the gradient from an action-conditioned forward predictor is propagated through it. To assign uncontrollable information to the uncontrollable latent partition, the gradient from a state-only forward predictor is propagated through it. The remaining details will be provided in the rest of this Section.

We consider environments with high-dimensional states, represented as pixel inputs. These pixel inputs are subsequently encoded into a latent representation $z_{t}=(z^{c},z^{u})\in\mathcal{Z}\in\mathbb{R}^{n_{c}}+\mathbb{R}^{n_{u}}$ , with the superscripts $c$ and $u$ representing the controllable and uncontrollable features, and the superscripts $n_{c}$ and $n_{u}$ representing their respective dimensions. The compression into a latent representation $\mathcal{S}\rightarrow\mathcal{Z}$ is done by means of a convolutional encoder, parameterized by a set of learnable parameters $\theta_{enc}$ according to:

z_{t}=(z^{c}_{t},z^{u}_{t})=f(s_{t};\theta_{enc}).

(1)

An overview of the proposed algorithm is illustrated in Fig. 2 and the details are provided hereafter. In this section, all losses and transitions are given under the assumption of a continuous abstract representation and a deterministic transition function. The algorithm could be adapted by replacing the losses related to the internal transitions with generative approaches (in the context of continuous and stochastic transitions) or a log-likelihood loss (in the context of stochastic but discrete representations).

IV-A Controllable Features

To isolate controllable features in the latent representation, $z^{c}_{t}$ is used to make an action-conditioned forward prediction in latent space. In the context of a continuous latent space and deterministic transitions, $z^{c}$ is updated using a mean squared error (MSE) forward prediction loss $\mathcal{L}_{c}=\big{|}\hat{z}^{c}_{t+1}-z^{c}_{t+1}\big{|}^{2}$ , where $\hat{z}^{c}_{t+1}$ is the action-conditioned residual forward prediction of the parameterized function $T_{c}(z,a;\theta_{c}):\mathcal{Z}\times\mathcal{A}\rightarrow\mathcal{Z}$ :

\hat{z}^{c}_{t+1}=T_{c}(z_{t},a_{t};\theta_{c})+z^{c}_{t}

(2)

and the prediction target $z^{c}_{t+1}$ is part of the encoder output $f(s_{t+1};\theta_{enc})$ . Note that the full latent state $z_{t}$ is necessary in order to predict $\hat{z}^{c}_{t+1}$ (e.g. the uncontrollable features could represent a wall or other static structure that is necessary for the prediction of the controllable features). Furthermore, the uncontrollable latent partition input $z^{u}_{t}$ is accompanied by a stop gradient to discourage the presence of controllable features in $z^{u}$ . When minimizing $\mathcal{L}_{c}$ , both the encoder ( $\theta_{enc}$ ) as well as the predictor ( $\theta_{c}$ ) are updated, which allows shaping the representation $z^{c}$ as well as learning the internal dynamics.

IV-B Uncontrollable Features

To express uncontrollable features in the latent space, $z^{u}_{t}$ is used to make a state-only (not conditioned on the action $a_{t}$ ) forward prediction in latent space. This enforces uncontrollable features within the uncontrollable latent partition $z^{u}$ , since features that are action-dependent cannot be accurately predicted with the preceding state only. Following a residual prediction, $z^{u}$ is then updated using a MSE forward prediction loss $\mathcal{L}_{u}=\big{|}\hat{z}^{u}_{t+1}-z^{u}_{t+1}\big{|}^{2}$ , with $\hat{z}^{u}_{t+1}$ defined as:

\hat{z}^{u}_{t+1}=T_{u}(z^{u}_{t};\theta_{u})+z^{u}_{t}

(3)

and $T_{u}(z^{u};\theta_{u}):\mathcal{Z}\rightarrow\mathcal{Z}$ representing the parameterized prediction function. The target $z^{u}_{t+1}$ is part of the output of the encoder $f(s_{t+1};\theta_{enc})$ . When minimizing $\mathcal{L}_{u}$ , both $\theta_{enc}$ and $\theta_{u}$ are updated. In this way the loss $\mathcal{L}_{u}$ drives the latent representation $z^{u}$ , which is conditioned on $\theta_{enc}$ according to $(z^{c}_{t},z^{u}_{t})=f(s_{t};\theta_{enc})$ , to only represent the features of $s_{t}$ that are not conditioned on the action $a_{t}$ .

IV-C Avoiding Predictive Representation Collapse

Minimizing a forward prediction loss in latent space $\mathcal{Z}$ is prone to collapse [9, 16], due to the convergence of $\mathcal{L}_{c}$ and $\mathcal{L}_{u}$ when $f(s_{t};\theta_{enc})$ is a constant $\forall\hskip 5.69054pts_{t}\in\mathcal{S}$ . To avoid representation collapse when using forward predictors, a contrastive loss is used to enforce sufficient diversity in the latent representation:

\mathcal{L}_{H_{1}}=exp\big{(}-C_{d}\big{\|}z_{t}-\bar{z}_{t}\big{\|}_{2}\big{)}

(4)

where $C_{d}$ represents a constant hyperparameter and $\bar{z}_{t}$ is a ‘negative’ batch of latent states $z_{t}$ , which is obtained by shifting each position of latent states in the batch by a random number between 0 and the batch size. In the random maze environment, an additional contrastive loss is added to further diversify the controllable representation:

\mathcal{L}_{H_{2}}=exp\big{(}-C_{d}\big{\|}z^{c}_{t}-\bar{z}^{c}_{t}\big{\|}_% {2}\big{)}

(5)

where $z^{c}_{t}$ is obtained from randomly sampled trajectories. This additional regularizer proved neccessary to avoid collapse of $z^{c}$ when moving to a near infinite number of possible mazes. More information on this subject can be found in Appendix A-D. The resulting contrastive loss $\mathcal{L}_{H}$ for the random maze environment then consists of $0.5\mathcal{L}_{H_{1}}+0.5\mathcal{L}_{H_{2}}$ . The total loss used to update the encoder’s parameters now consists of $\mathcal{L}_{enc}=\mathcal{L}_{c}+\mathcal{L}_{u}+\mathcal{L}_{H}$ .

IV-D Guiding Feature Disentanglement with Adversarial Loss

When using a controllable latent space $z^{c}\in\mathbb{R}^{x},x\in\mathbb{N}$ , where $x>g$ , with $g$ representing the number of dimensions needed to portray the controllable features, some information about the uncontrollable features in the controllable latent representation might be present (see Appendix C-B). This is due to the non-enforcing nature of $\mathcal{L}_{c}$ , as the uncontrollable features are equally predictable with or without the action. To ensure that no information about the uncontrollable features is kept in the controllable latent representation, an adversarial component is added to the architecture in Fig. 2. This is done by updating the encoder with an adversarial loss $\mathcal{L}_{adv}$ and reversing the gradient [28]. The adversarial loss is defined as

\mathcal{L}_{adv}=\big{|}\hat{z}^{u}_{t}-z^{u}_{t}\big{|}^{2},

(6)

with $\hat{z}^{u}_{t}=T_{adv}(z^{c}_{t};\theta_{adv})$ , where $\hat{z}^{u}_{t}$ is the uncontrollable prediction of the parameterized function $T_{adv}(z^{c};\theta_{adv}):\mathcal{Z}\rightarrow\mathcal{Z}$ and $z^{u}_{t}$ is the target. Intuitively, since the parameters of $T_{adv}(z^{c};\theta_{adv})$ are being updated with $\mathcal{L}_{adv}$ and the parameters of $f(s;\theta_{enc})$ are being updated with $-\mathcal{L}_{adv}$ , the prediction function can be seen as the discriminator and the encoder can be seen as the generator [29]. The discriminator tries to give an accurate prediction of the uncontrollable latent $z^{u}$ given the controllable latent $z^{c}$ , while the generator tries to counteract the discriminator by removing any uncontrollable features from the controllable representation. In our case, the predictor is a multi-layer perceptron (MLP), which means that minimizing $\mathcal{L}_{adv}$ enforces that no nonlinear relation between $z^{c}$ and $z^{u}$ can be learned. We hypothesize that this is a deterministic approximation of minimizing the Mutual Information (MI) between $z^{u}$ and $z^{c}$ . When using the adversarial loss, the combined loss propagating through the encoder consists of $\mathcal{L}_{enc}=\mathcal{L}_{c}+\mathcal{L}_{u}+\mathcal{L}_{H}-\mathcal{L}_% {adv}$ . Here the minus term in $-\mathcal{L}_{adv}$ represents a gradient reversal to the encoder. Note that the losses are not scaled, as this did not prove to be necessary for the experiments conducted.

Algorithm 1 Disentangled (Un)Controllable Features

1:Initialize

\theta_{enc}

\theta_{c}

\theta_{u}

\theta_{adv}

2:for

iteration=1,2,\ldots,N

3: Sample batch of tuples {

s_{t},a_{t},s_{t+1}

}

4: Encode observations:

f(s;\theta_{enc})=\{z^{c},z^{u}\}

5: Predict

\hat{z}^{c}_{t+1}=T_{c}(z^{c}_{t},z^{u}_{t},a;\theta_{c})+z^{c}_{t}\hskip 56.9% 055pt

// detach

z^{u}_{t}

6: Predict

\hat{z}^{u}_{t+1}=T_{u}(z^{u}_{t};\theta_{u})+z^{u}_{t}

7: Predict

\hat{z}^{u}_{t}=T_{adv}(z^{c}_{t};\theta_{adv})

8: Compute losses

\mathcal{L}_{c},\mathcal{L}_{u},-\mathcal{L}_{adv},\mathcal{L}_{H}

9: Update parameters

\theta_{enc}

\theta_{c}

\theta_{u}

\theta_{adv}

10:end for

IV-E Downstream Tasks

By disentangling a latent representation in a controllable and an uncontrollable part, one can more readily obtain human-interpretable features. While interpretability is generally an important aspect, it is also important to test how a notion of human interpretability affects downstream performance, as it is generally desired to strike a good balance between interpretability and performance. This is examined by training an RL agent on the learned and subsequently frozen latent representation. The action $a_{t}$ is chosen following an $\epsilon$ -greedy policy, where a random action is taken with a probability $\epsilon$ , and with $(1-\epsilon)$ probability the policy $\pi(z)=\underset{a\in\mathcal{A}}{\operatorname*{arg\,max}}\hskip 2.84526ptQ(z% ,a;\theta)$ is evaluated, where $Q(z,a;\theta)$ is the Q-network trained by Deep Double Q-Learning (DDQN) [30, 31]. The Q-network is trained with respect to a target $Y_{t}$ :

Y_{t}=r_{t}+\gamma Q(z_{t+1},\operatorname*{arg\,max}_{a\in\mathcal{A}}Q(z_{t+% 1},a;\theta);\theta^{-})\,.

(7)

With $\gamma$ representing the environment’s discount factor and $\theta^{-}$ the target Q-network’s parameters. The target Q-network’s parameters are updated as an exponential moving average of the original parameters $\theta$ according to: $\theta^{-}_{k+1}=(1-\tau)\theta^{-}_{k}+\tau\theta_{k}$ , where subscript $k$ represents a training iteration and $\tau$ represents a hyperparameter controlling the speed of the parameter update. The resulting DDQN loss is defined as $\mathcal{L}_{Q}=\big{|}Y_{t}-Q(z_{t},a;\theta)\big{|}^{2}$ . The full computation of all losses is shown in pseudocode in Algorithm 1.

V Experiments

In this section, we showcase the disentanglement of controllable and uncontrollable features on three different environments, the complexity of which is in line with prior work on structured representations [10, 32, 9, 11, 12]: (i) a quadruple maze environment, (ii) the catcher environment and (iii) a randomly generated maze environment. The first environment yields a state space of 119 different observations, and is used to showcase the algorithm’s ability to disentangle a low-dimensional latent representation. The catcher environment examines a setting where the uncontrollable features are not static, and the random maze environment is used to showcase disentanglement in a more complex distribution of over 25 million possible environments, followed by the application of downstream tasks by applying reinforcement learning (DDQN) and a latent planning algorithm running in the controllable latent partition . The base of the encoder is derived from [33] and consists of two convolutional layers, followed by a fully connected layer for low-dimensional latent representations or an additional CNN for a higher-dimensional latent representation such as a feature map. For the full network architectures, we refer the reader to Appendix C. In all environments, the encoder $f(s;\theta_{enc})$ is trained from a buffer $\mathcal{B}$ filled with transition tuples $(s_{t},a_{t},r_{t},s_{t+1})$ from random trajectories. Note that, in interpretability, there is generally not a specific metric to optimize for. In order to produce interpretable representations, finding the right hyperparameters required manual (human) inspection of the plotted latent representations. An ablation of the hyperparameters used can be found in Appendices A1-A3

V-A Quadruple Maze Environment

The maze environment consists of an agent and a selection of four distinct, handpicked wall architectures. The environment’s state is provided as pixel observations $s_{t}\in\mathbb{R}^{48\times 48}$ , where an action moves the agent by 6 pixels in each direction (up, down, left, right) except if this direction is obstructed by a wall. We consider the context where there is no reward ( $r_{t}=0\hskip 5.69054pt\forall\hskip 5.69054pt(s_{t},a_{t})\in(\mathcal{S},% \mathcal{A})$ ) and there is no terminal state.

We select a two-dimensional controllable representation ( $z^{c}\in\mathbb{R}^{2}$ ) and a one-dimensional uncontrollable representation ( $z^{u}\in\mathbb{R}^{1}$ ). The remaining hyperparameters and details can be found in Appendix B. The experiments are conducted using a buffer $\mathcal{B}$ filled with random trajectories from the four different basic maze architectures. The encoder’s parameters are updated using $\mathcal{L}_{enc}$ in Section IV-C with $\mathcal{L}_{H}=\mathcal{L}_{H_{1}}$ . After 50k training iterations, a clear disentanglement between the controllable ( $z^{c}$ ) and uncontrollable ( $z^{u}$ ) latent representation can be seen in Fig. 1. One can observe that the encoder is updated so that the one-dimensional latent representation $z^{u}$ learns different values that define the type of wall architecture. A progression to this representation is provided in Appendix C-A.

V-B Catcher Environment

As opposed to the maze environment, the catcher environment encompasses uncontrollable features that are non-stationary. The ball is dropped randomly at the top of the environment and is falling irrespective of the actions, while the paddle position is directly modified by the actions. The environment’s states are defined as pixel observations $s_{t}$ of size $\mathbb{R}^{51\times 51}$ . At each time step, the paddle moves left or right by 3 pixels. Since we are only doing unsupervised learning, we consider the context where there is no reward ( $r_{t}=0\hskip 5.69054pt\forall\hskip 5.69054pt(s_{t},a_{t})\in(\mathcal{S},% \mathcal{A})$ ) and an episode ends whenever the ball reaches the paddle or the bottom.

We take $z^{c}\in\mathbb{R}^{2}$ and $z^{u}\in\mathbb{R}^{6\times 6}$ . To test disentanglement, $z^{c}$ is of a higher dimension than needed since the paddle (agent) only moves on the x-axis and would therefore require only one feature (see Appendix C-B for the simpler setting with $z^{c}\in\mathbb{R}^{1}$ ). To show disentanglement, the redundant dimension of $z^{c}$ should not or negligibly have information about $z^{u}$ . The encoder’s parameters are updated using $\mathcal{L}_{enc}$ in Section IV-D with $\mathcal{L}_{H}=\mathcal{L}_{H_{1}}$ . After training the encoder for 200k iterations, a selection of state observations $s_{t}$ and their encoding into the latent representation $z=(z^{c},z^{u})$ can be seen in Fig. 3. A clear distinction between the ball and paddle representations can be observed, with the former residing in $z^{u}$ and the latter in $z^{c}$ .

V-C Random Maze Environment

The random maze environment is similar to the maze environment from Section V-A, but consists of a large distribution of randomly generated mazes with complex wall structures. The environment’s state is provided as pixel observations $s_{t}\in\mathbb{R}^{48\times 48}$ , where an action moves the agent by 6 pixels in each direction. We consider $z^{c}\in\mathbb{R}^{2}$ and $z^{u}\in\mathbb{R}^{6\times 6}$ . This environment tests the generalization properties of a disentangled latent representation, as there are over $25$ million possible maze architectures, corresponding to a probability of less than $4\cdot 10^{-8}$ to sample the same maze twice. Note that because $z^{c}$ is 2-dimensional, results with and without adversarial loss are in practice extremely close. After 50k training iterations, the latent representation $z=(z^{c},z^{u})$ shows an interpretable disentanglement between the controllable and the uncontrollable features (see Fig. 3(a)). A clear distinction between the agent and the wall structure can be found inside $z^{c}$ and $z^{u}$ . Note that Instead of using a single dimension to ‘describe’ the uncontrollable features $z^{u}$ (see Fig. 1), using a feature map for $z^{u}$ allows training an encoding that provides a more interpretable representation of the actual wall architecture.

Using an Inverse Predictor

An alternative to the state-action forward prediction method used throughout the paper is the inverse (action) prediction loss. An inverse prediction loss is often referred to in previous work that focuses on controllable features [2, 20, 21]. A single-step inverse prediction loss is defined as:

\hat{a}_{t}=I(z^{c}_{t},z^{c}_{t+1},z^{u}_{t};\theta_{inv}).

(8)

Here, $\hat{a}_{t}$ is the predicted action and $I(z^{c}_{t},z^{c}_{t+1},z^{u}_{t};\theta_{inv}):\mathcal{Z}\rightarrow\mathcal% {A}$ is the inverse prediction network. To see whether an inverse predictor can generate structured, controllable representations in the random maze environment, we replace the action-conditioned forward predictor with an inverse predictor, so that $z^{c}$ is no longer updated with $\mathcal{L}_{c}$ but with $\mathcal{L}_{inv}$ (see Appendix A-F for details on $\mathcal{L}_{inv}$ ).

The resulting representation can be seen in Fig. 3(b). It seems that using $\mathcal{L}_{inv}$ , causes an absence of interpretable structure in the controllable latent representation $z^{c}_{t}$ . Furthermore, there is a less precise disentanglement between the controllable and uncontrollable features, as differences can be observed in $z^{c}_{t}$ when encoding equal agent positions as pixel states $s_{t}$ . In addition, an inverse predictor does not allow forward prediction in latent space, which can be used for planning as shown hereafter. It thus seems that in some environments, an inverse prediction loss might be insufficient to isolate the controllable features. Take for example the maze agent in the top-right maze of Fig. 4, where the agent can only move in the left direction. Even when using the wall information ( $z^{u}_{t})$ , an inverse predictor will not be able to predict the action taken when the agent does not go left. However, an action-conditioned forward predictor is able to predict the next state correctly regardless of which action was taken.

Reinforcement Learning

In order to verify whether a human-interpretable disentangled latent encoding is informative enough for downstream tasks, we formalize the random maze environment into an MDP with rewards. The agent acquires a reward $r_{t}$ of -0.1 at every time step, except when it finds the key in the top right part in which case it acquires a positive reward of 1. The episode ends whenever a positive reward is obtained or a total of 50 environment steps have been taken. For each new episode, a random wall structure is generated, and the agent starts over in the bottom left section of the maze (see Fig. 5). To see whether an interpretable disentangled latent representation is useful for RL, we compare different scenarios of (pre)training; (i) An encoder pretrained for 50k iterations to attain the representation in Fig. 3(a) and subsequently trained with DDQN for 500k iterations (ii) an encoder identical to the aforementioned but trained with DDQN and a planning algorithm (iii) an encoder pretrained for 50k iterations with $\mathcal{L}_{inv}$ instead of $\mathcal{L}_{c}$ and subsequently trained with DDQN for 500k iterations (iv) an encoder purely trained with DDQN gradients for 500k iterations. The resulting performances are compared in Fig. 5. We find that a disentangled structured representation is suitable for downstream tasks, as it achieves comparable performance to training an encoder end-to-end with DDQN for 500k iterations. Although performance is similar, Fig. 3(c) shows that an encoder updated solely with the DDQN gradient can lose any form of interpretability. Moreover, we show in Fig. 5 that a representation trained with an inverse prediction loss instead of a state-action forward prediction loss leads to poor downstream performance in the random maze environment.

Planning

As seen in Fig. 3(a), after pre-training with the unsupervised losses, an interpretable disentangled representation with the corresponding agent transitions is obtained. Due to this disentanglement of the controllable and uncontrollable features, we can for instance employ prior knowledge that the uncontrollable features in the maze environment are static, and employ latent planning in the controllable latent space only (see Fig. 6). The planning algorithm used is derived from [34], and is used to successfully plan only in the controllable partition of the latent representation $z^{c}$ , while freezing the input for $z^{u}$ regardless of planning depth. More details on the planning algorithm can be found in Appendix A-E. It can be observed that even when planning with a relatively small depth of 3, we achieve better performance than the pre-trained representation with an $\epsilon$ -greedy policy and than the purely DDQN-updated encoder.

VI Limitations

While the work presented here provides a step towards a better understanding of disentangling controllable and uncontrollable features within an encoder architecture, there remain some limitations that we must acknowledge, and which can provide a basis for future research.

First, our method’s effectiveness was predominantly demonstrated on environments with relatively simple underlying dynamics. In these environments, the disentanglement process was easier to achieve due to the limited complexity of internal dynamics present. As we begin to transfer our approach to more complex environments characterized by more extensive internal dynamics, there can arise two problems; The first being that the separation of controllable from uncontrollable features may not be as clear-cut in more complex MDPs, but can be more on a spectrum, complicating the fundamental differences between a state-only and a state-action forward predictor. The second being that interpretability will be harder to enforce when there are a large number of underlying factors of variation. As distinct seeds can give different orderings and signs of the neurons in the final layer of the encoder, identifying a factor of variation can become exponentially harder for more complex environments.

Lastly, while our work showed that an action-conditioned forward predictor could be preferred over an inverse predictor in some environments for isolating controllable features, it may not hold for all scenarios. The inherent properties of different environments might show a necessity of using different predictors. Consequently, there could very well be MDPs where our current approach might not provide the same level of disentanglement showed in the MDPs used in this paper.

Despite these limitations, we believe our work provides a strong foundation upon which future research can build and further extend the possibilities of achieving a highly interpretable latent representation through disentanglement of controllable and uncontrollable features.

VII Conclusion and Future Work

We have shown the possibility of disentangling controllable and uncontrollable features in an encoder architecture, strongly increasing the interpretability of the latent representation while also showing the potential use of this for downstream learning and planning, even in a single latent partition. This disentanglement of controllable and uncontrollable features in the latent representation of high-dimensional MDPs was achieved by propagating an action-conditioned forward prediction loss and a state-only forward prediction loss through distinct sections of the latent representation. Additionally, a contrastive loss and an adversarial loss were used to respectively avoid collapse and further disentangle the latent representation. Furthermore, we showed that an action-conditioned forward predictor can, in some environments, be preferred as compared to an inverse predictor in terms of isolating controllable features in the representation. Finally, by employing forward prediction in latent space, we were able to successfully run a planning algorithm while leveraging the properties of the environment. In particular, the disentanglement of controllable and uncontrollable features allowed us to keep $z^{u}$ frozen regardless of planning depth in the context of a distribution of randomly generated mazes, i.e. we only do forward prediction in $z^{c}$ .

Future work could focus on gradually transferring our notion of disentanglement and interpretability to environments with more extensive underlying internal dynamics. Further work could also look at the ordering of the latent dimensions, as a latent representation is often arbitrarily ordered. This means that distinct seeds will lead to a different ordering and sign of the neurons in the final layer of the encoder. For example, if seed one would give agent position +x and +y for neurons 1 and 2 respectively, then seed two could give agent position -y and +x to the same neurons. As we are additionally using a contrastive loss while learning our representation, these results are compliant with the theory that a contrastive loss can recover the original latent information up to an orthogonal linear transformation [35].

Certain benefits can be obtained as well with a particular design of the encoder architecture, as we have done in this paper using estimates of the necessary dimensions of $z^{c}$ and $z^{u}$ for the different MDP environments. This can be seen as an inductive bias to aid disentanglement, as mentioned by [36]. Succeeding work could also focus on finding more algorithmic benefits of this disentanglement of controllable/uncontrollable features in more complex environments. For example, in the context of safety, a disentangled interpretable representation could allow incorporating latent state constraints in a planning algorithm. Lastly, as discussed by [13, 36], an interesting venue could be to further investigate the trade-off between interpretability and downstream performance. This is due to the fact that black-box representations such as Figure 3(c) still seem to have excellent downstream performance with DDQN, where for the task of maze navigation, a human would perform substantially better using the representation portrayed in Figure 3(a) as compared to using the representation in Figure 3(c).

References

[1] R. Bellman, “A markovian decision process,” in Journal of Mathematics and Mechanics, vol. 6, no. 5, 1957.
[2] R. Jonschkowski and O. Brock, “Learning state representations with robotic priors,” Autonomous Robots, vol. 39, no. 3, 2015.
[3] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu, “Reinforcement Learning with Unsupervised Auxiliary Tasks,” in International Conference on Learning Representations, ICLR, 2017.
[4] M. Laskin, A. Srinivas, and P. Abbeel, “CURL: Contrastive unsupervised representations for reinforcement learning,” in 37th International Conference on Machine Learning, ICML, 2020.
[5] K. H. Lee, I. Fischer, A. Z. Liu, Y. Guo, H. Lee, J. Canny, and S. Guadarrama, “Predictive information accelerates learning in RL,” in Advances in Neural Information Processing Systems, NIPS, 2020.
[6] D. Yarats, A. Zhang, I. Kostrikov, B. Amos, J. Pineau, and R. Fergus, “Improving Sample Efficiency in Model-Free Reinforcement Learning from Images,” in Proceedings of the AAAI Conference on Artificial Intelligence, AAAI, 2021.
[7] M. Schwarzer, A. Anand, R. Goel, R. D. Hjelm, A. Courville, and P. Bachman, “Data-Efficient Reinforcement Learning with Self-Predictive Representations,” in International Conference on Learning Representations, ICLR, 2021.
[8] I. Kostrikov, D. Yarats, and R. Fergus, “Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels,” in International Conference on Learning Representations, ICLR, 2021.
[9] V. Francois-Lavet, Y. Bengio, D. Precup, and J. Pineau, “Combined Reinforcement Learning via Abstract Representations,” in Proceedings of the AAAI Conference on Artificial Intelligence, AAAI, 2019.
[10] V. Thomas, J. Pondard, E. Bengio, M. Sarfati, P. Beaudoin, M.-J. Meurs, J. Pineau, D. Precup, and Y. Bengio, “Independently Controllable Factors,” arXiv preprint arXiv:1708.01289, 2017.
[11] T. Kipf, E. van der Pol, and M. Welling, “Contrastive Learning of Structured World Models,” in International Conference on Learning Representations, ICLR, 2020.
[12] K. Ahuja, J. Hartford, and Y. Bengio, “Weakly supervised representation learning with sparse perturbations,” in Advances in Neural Information Processing Systems, NIPS, 2022.
[13] C. Glanois, P. Weng, M. Zimmer, D. Li, T. Yang, J. Hao, and W. Liu, “A survey on interpretable reinforcement learning,” arXiv preprint arXiv:2112.13112, 2021.
[14] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” in International Conference on Learning Representations, ICLR, 2014.
[15] I. Higgins, A. Pal, A. A. Rusu, L. Matthey, C. P. Burgess, A. Pritzel, M. Botvinick, C. Blundell, and A. Lerchner, “DARLA: Improving Zero-Shot Transfer in Reinforcement Learning,” in International Conference on Machine Learning, ICML, 2017.
[16] C. Gelada, S. Kumar, J. Buckman, O. Nachum, and M. G. Bellemare, “DeepMDP: Learning continuous latent space models for representation learning,” in International Conference on Machine Learning, ICML, 2019.
[17] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson, “Learning latent dynamics for planning from pixels,” in International Conference on Machine Learning, ICML, 2019.
[18] D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba, “Mastering Atari with Discrete World Models,” in International Conference on Learning Representations, ICLR, 2021.
[19] A. Laversanne-Finot, A. Pere, and P.-Y. Oudeyer, “Curiosity driven exploration of learned disentangled goal spaces,” in Proceedings of The 2nd Conference on Robot Learning. PMLR, 2018.
[20] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven Exploration by Self-supervised Prediction,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW, 2017.
[21] A. P. Badia, P. Sprechmann, A. Vitvitskyi, D. Guo, B. Piot, S. Kapturowski, O. Tieleman, M. Arjovsky, A. Pritzel, A. Bolt, and C. Blundell, “Never Give Up: Learning Directed Exploration Strategies,” in International Conference on Learning Representations, ICLR, 2020.
[22] Y. Efroni, D. Misra, A. Krishnamurthy, A. Agarwal, and J. Langford, “Provable RL with exogenous distractors via multistep inverse dynamics,” in International Conference on Machine Learning, ICML, 2021.
[23] A. Lamb, R. Islam, Y. Efroni, A. Didolkar, D. Misra, D. Foster, L. Molu, R. Chari, A. Krishnamurthy, and J. Langford, “Guaranteed discovery of controllable latent states with multi-step inverse models,” arXiv preprint arXiv:2207.08229, 2022.
[24] D. Bertoin and E. Rachelson, “Disentanglement by cyclic reconstruction,” in IEEE Transactions on Neural Networks and Learning Systems, 2022.
[25] X. Fu, G. Yang, P. Agrawal, and T. Jaakkola, “Learning task informed abstractions,” in International Conference on Machine Learning, ICML, 2021.
[26] T. Wang, S. Du, A. Torralba, P. Isola, A. Zhang, and Y. Tian, “Denoised MDPs: Learning world models better than the world itself,” in Proceedings of the 39th International Conference on Machine Learning, PMLR, 2022.
[27] E. van der Pol, D. Worrall, H. van Hoof, F. Oliehoek, and M. Welling, “Mdp homomorphic networks: Group symmetries in reinforcement learning,” in Advances in Neural Information Processing Systems, NIPS, 2020.
[28] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, V. Lempitsky, U. Dogan, M. Kloft, F. Orabona, T. Tommasi, and a. Ganin, “Domain-Adversarial Training of Neural Networks,” in Journal of Machine Learning Research, vol. 17, 2016.
[29] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Networks,” in Advances in Neural Information Processing Systems, NIPS, 2014.
[30] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, 2015.
[31] H. van Hasselt, A. Guez, and D. Silver, “Deep Reinforcement Learning with Double Q-learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, AAAI, 2016.
[32] I. Higgins, D. Amos, D. Pfau, S. Racanière, L. Matthey, D. J. Rezende, and A. Lerchner, “Towards a definition of disentangled representations,” arXiv preprint arXiv:1812.02230, 2018.
[33] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, T. Lillicrap, and M. Riedmiller, “DeepMind Control Suite,” arXiv preprint arXiv:1801.00690, 2018.
[34] J. Oh, S. Singh, and H. Lee, “Value Prediction Network,” in Advances in Neural Information Processing Systems, NIPS, 2017.
[35] R. S. Zimmermann, Y. Sharma, S. Schneider, M. Bethge, and W. Brendel, “Contrastive learning inverts the data generating process,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR, 18–24 Jul 2021, pp. 12 979–12 990.
[36] F. Locatello, S. Bauer, M. Lucic, G. Raetsch, S. Gelly, B. Schölkopf, and O. Bachem, “Challenging common assumptions in the unsupervised learning of disentangled representations,” in Proceedings of the 36th International Conference on Machine Learning, PMLR, 2019.
[37] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” in International Conference on Learning Representations, ICLR, 2015.

Appendix A Additional Material

A-A Ablation of the contrastive scalar

Without using a pixel reconstruction loss, the contrastive loss $\mathcal{L}_{H}$ is crucial in avoiding the trivial solution for any latent forward predictor [9, 16]. The contrastive scalar that regulates the $\mathcal{L}_{H}$ however remains the most influential hyperparameter. When $C_{d}$ is chosen too high, the representation remains in a compact cluster. On the other hand, when $C_{d}$ is chosen too low, unnecessary inter-sample distances are formed to enforce large individual latent distances. Two ablations of the contrastive scalar $C_{d}$ are shown in Fig. 7.

A-B Ablation of learning rates

We show experiments in Fig. 8 and Fig. 9 where we employ different learning rates for the encoder and the action-conditioned forward predictor, respectively.

A-C Ablation of the detachment of $z^{u}$ and ablation of the residual prediction

As seen in the main paper in Figure 2, we detach the uncontrollable representation $z^{c}$ from $\mathcal{L}_{c}$ as we do not want controllable features to be present in $z^{u}$ . We can see in Figure 10 that updating $z^{u}$ with $\mathcal{L}_{c}$ leads to slightly better transition predictions in $z^{c}$ , but also results in a less interpretable encoding of $z^{u}$ . Furthermore, we can also see in Figure 10 that, when using normal forward predictions instead of residual forward predictions, we lose almost all of our interpretable structure in $z^{u}$ .

A-D Ablation of the entropy loss $\mathcal{L}_{H2}$

As the amount of possible encoded maze architectures goes to infinity due to the procedural generation, a collapse in the controllable features $z^{c}$ can be noticed when using only $\mathcal{L}_{H1}$ as the contrastive loss (see Fig. 11). On the other hand, when using only $\mathcal{L}_{H2}$ as the contrastive loss, there is no more clear distinction in the uncontrollable representation $z^{u}$ . The best results were obtained using a combination of the aforementioned losses.

A-E Planning

We use a planning algorithm derived from [34, 9], where we employ d-step planning as:

\hat{Q}^{d}((\hat{z}^{c}_{t},z^{u}),a)=\left\{\begin{array}[]{ll}P((\hat{z}^{c% }_{t},z^{u}),a;\theta_{r})+\Gamma((\hat{z}^{c}_{t},z^{u}),a;\theta_{\gamma})\ % \underset{a^{\prime}\in\mathcal{A}^{*}}{\operatorname{max}}\ \hat{Q}^{d-1}(\\ (\hat{z}^{c}_{t+1},z^{u}),a^{\prime}),\hskip 56.9055pt\text{ if }d>0\\ Q((\hat{z}^{c}_{t},z^{u}),a;\theta),\hskip 51.21495pt\text{ if }d=0\end{array}\right.

(9)

Q_{plan}^{D}((\hat{z}^{c}_{t},z^{u}),a)=\sum_{d=0}^{D}\hat{Q}^{d}((\hat{z}^{c}% _{t},z^{u}),a)

(10)

Where $P(s_{t},a;\theta_{r}):\mathcal{Z}\times\mathcal{A}\rightarrow\mathcal{R}$ represents the reward predictor and $\Gamma(s,a;\theta_{\gamma}):\mathcal{Z}\times\mathcal{A}\rightarrow\gamma$ represents the discount value predictor. The action is chosen by taking the argmax of $Q_{plan}^{D}((\hat{z}^{c}_{t},z^{u}),a)$ . Note in the results from Section 5.3, we are only forward predicting in the controllable latent space $z^{c}$ , and that $z^{u}$ remains a fixed value regardless of planning depth. This is possible by making use of the prior knowledge of the maze environments together with a disentangled controllable and uncontrollable latent representation.

A-F Inverse Prediction

A common single-step inverse prediction is defined as:

\hat{a}_{t}=f(s_{t},s_{t+1})

(11)

where $\hat{a}_{t}$ is the predicted action and $f(s_{t},s_{t+1})$ represents an arbitrarily structured function. In the random maze environment, we use a parameterized inverse predictor which predicts in latent space:

\hat{a}_{t}=I(z^{c}_{t},z^{c}_{t+1},z^{u}_{t},z^{u}_{t+1};\theta_{inv})

(12)

Where $I(\cdot;\theta_{inv})\in\mathcal{I}:\mathcal{Z}\rightarrow\mathcal{A}$ is a parameterized inverse prediction function. Since we have 4 actions, we use the 4-dimensional logit output $\hat{a}_{t}$ to calculate the inverse prediction loss $\mathcal{L}_{inv}$ as:

S(\hat{a}_{i})=\frac{\text{exp}({\hat{a}_{i}})}{\sum_{j=1}^{n_{a}}\text{exp}({% \hat{a}_{j}})},\quad\mathcal{L}_{inv}=-\sum_{i=1}^{n_{a}}a_{i}\log(S(\hat{a}_{% i}))

(13)

Here, $n_{a}$ is the number of actions, $S(\hat{a}_{i})$ represents the softmax operator and $a_{i}$ is the actual action, given as a 0 or 1 truth label. This is more commonly known as the Cross-Entropy loss computation.

A-G Reconstruction

We run an additional ablation on the four mazes environment, where the contrastive loss $\mathcal{L}_{H}$ is replaced with a pixel reconstruction loss. The resulting representation comparison can be seen in Fig. 12.

A-H T-SNE

We conduct an additional experiment in the random maze environment where we use a latent dimension of 32, partition it in half to form $z^{c}\in\mathbb{R}^{16}$ and $z^{u}\in\mathbb{R}^{16}$ and show the a T-SNE visualization of 6 different trajectories in random mazes in Fig. 13. Note that, because the trajectories are random, only a subpart of the possible agent positions in every random maze is present.

Appendix B Experiment details

The Pytorch framework was used for all experiments, as well as the Adam optimizer [37]. We employ a batch size of 32 tuples $(s_{t},a_{t},r_{t},s_{t+1})$ for every update. In all experiments, we detach $z^{c}_{t}$ in the calculation of $\mathcal{L}_{c}$ , as it allowed us to use a larger learning rate for $T_{c}$ without causing instabilities.

Simple Maze

The replay buffer $\mathcal{B}$ is filled with 5k transitions from each of the four wall architectures. The transitions are collected by the agent following a random policy. The learning rate for the encoder is $5\cdot 10^{-5}$ , for the action-conditioned forward predictor $1\cdot 10^{-3}$ and for the uncontrollable forward predictor $5\cdot 10^{-5}$ . The contrastive scalar $C_{d}$ is set to 15.

Catcher

The replay buffer $\mathcal{B}$ is filled with 25k transitions. The transitions are collected by the agent following a random policy. A new random maze is created after 50 time steps or when the reward is acquired. The learning rate for the encoder is $2\cdot 10^{-5}$ , for the action-conditioned forward predictor $4\cdot 10^{-5}$ and for the uncontrollable forward predictor $1\cdot 10^{-5}$ . When using the adversarial loss, we use a learning rate of $1\cdot 10^{-3}$ for the adversarial predictor. The contrastive scalar $C_{d}$ is set to 5.

Random Maze

The replay buffer $\mathcal{B}$ is filled with 50k transitions, representing around 1000 maze architectures. The transitions are collected by the agent following a random policy. The learning rates used are equal to those of the catcher environment; for the encoder $2\cdot 10^{-5}$ , for the action-conditioned forward predictor $4\cdot 10^{-5}$ and for the uncontrollable forward predictor $1\cdot 10^{-5}$ . After freezing the encoder, we train the action-conditioned forward predictor for an additional 250k iterations on the same 50k transitions in the buffer $\mathcal{B}$ . For updating the Q-network with DDQN, we use a learning rate of $1\cdot 10^{-4}$ , and a $\tau$ of 0.02. The contrastive scalar $C_{d}$ is set to 13. When using planning, we employ a learning rate of $5\cdot 10^{-5}$ for the reward and discount prediction networks.

Contrastive Loss

For the catcher and random maze environment, given that $z^{c}$ is 1 or 2-dimensional, and $z^{u}$ is a 36-dimensional feature map, we alleviate dimensional mismatch when calculating the contrastive loss in Equation 4 in the main paper. This is done by taking a random subset of 15 out of 36 feature values in $z^{u}$ for every batch.

Appendix C Network Architecture

We use the same base encoder for all experiments, made up of 2 convolutional layers of 32 channels each, with a kernel size of 3 and stride 2, except for the final layer which has stride 1. Both convolutional layers have a Rectified Linear Unit (ReLU) nonlinear activation.

In the quadruple maze environment, the output of the base convolutional encoder is flattened and used as an input to a single linear layer with 3 outputs ( $z^{c}+z^{u}$ ) and a hyperbolic tangent (tanh) activation function.

In the catcher and random maze environments, we use the following encoder head to extract the uncontrollable features; the base convolutional layers are followed by a single convolutional layer with 32 channels, a kernel size of 4 and a stride of 1. This layer is followed by a ReLU activation function and an AveragePool layer with an output size of 6. For the controllable features, we flatten the output of the base convolutional encoder and use this as an input to a linear layer with 200 neurons and a tanh activation function. This layer is followed by another linear layer with $n_{c}$ neurons and a tanh activation function.

The transition and prediction models all have the same structure, with linear layers of 32-128-128-32-x neurons where x is the output dimension in line with the predicted feature’s dimension. The linear layers all have tanh activation functions except for the final output. Only the action-conditioned transition predictor of the random maze environment has larger layer sizes, with linear layers of 128-512-512-128-2, to account for slightly more complicated transitions. The DQN network used is of size 128-512-512-128-4, with an output value corresponding to each possible action.