In-Context Decision Transformer: Reinforcement Learning via Hierarchical Chain-of-Thought

Sili Huang    Jifeng Hu    Hechang Chen    Lichao Sun    Bo Yang
Abstract

In-context learning is a promising approach for offline reinforcement learning (RL) to handle online tasks, which can be achieved by providing task prompts. Recent works demonstrated that in-context RL could emerge with self-improvement in a trial-and-error manner when treating RL tasks as an across-episodic sequential prediction problem. Despite the self-improvement not requiring gradient updates, current works still suffer from high computational costs when the across-episodic sequence increases with task horizons. To this end, we propose an In-context Decision Transformer (IDT) to achieve self-improvement in a high-level trial-and-error manner. Specifically, IDT is inspired by the efficient hierarchical structure of human decision-making and thus reconstructs the sequence to consist of high-level decisions instead of low-level actions that interact with environments. As one high-level decision can guide multi-step low-level actions, IDT naturally avoids excessively long sequences and solves online tasks more efficiently. Experimental results show that IDT achieves state-of-the-art in long-horizon tasks over current in-context RL methods. In particular, the online evaluation time of our IDT is 36×\bm{\times}bold_× times faster than baselines in the D4RL benchmark and 27×\bm{\times}bold_× times faster in the Grid World benchmark.

Machine Learning, ICML

1 Introduction

Large transformer models (Vaswani et al., 2017) have shown impressive abilities across a variety of domains, including text (Brown et al., 2020b), image (Dosovitskiy et al., 2020), and audio (Alayrac et al., 2022). In the field of reinforcement learning (RL), large transformer models can treat the RL tasks as a type of sequential prediction problem, which has proven successful in using solely offline training (Lee et al., 2022; Reed et al., 2022). A notable shortcoming lies with these methods to self-improve when employed in online environments. To overcome this, in-context RL methods have been introduced, which enable continued policy improvement (Laskin et al., 2023).

Recent works demonstrated that in-context RL can automatically improve its performance in a trial-and-error manner when across-episodic contexts serve as prompt conditions (Lee et al., 2023a). The construction of the across-episodic context is flexible and easy to implement, such as a chain of experience that consists of multiple historical trajectories arranged in ascending order of returns (Hao Liu, 2023). Despite the progress made, current methods are mostly limited to short-horizon tasks with less than 100 timesteps (Laskin et al., 2023). This arises from (1) the quadratic complexity of the self-attention mechanism and (2) the significant increase in the length of sequences caused by across-episodic contexts. Such huge computational costs severely limit in-context RL to apply the trial-and-error ability on traditional RL tasks, which often reach 1000 timesteps, such as MuJoCo (Fu et al., 2020) and Atari (Bellemare et al., 2013).

In fact, trial-and-error is the central idea of modern RL algorithms (Sutton & Barto, 2018). It is an animal behavior originated by a psychologist Thorndike (1927) who considers trial-and-error as an elementary way of combining search and memory. Correspondingly, the across-episodic contexts provide memory, and the self-attention mechanism reviews historical actions in the memory to search for better actions. However, human decision-making is more complex and operates on multiple levels of temporal abstraction (Sutton et al., 1999b). For example, travelers tend to decide on their budget first, then their mode of transportation, right down to the smallest action. Inspired by this idea, a natural perspective emerges:

“Can human multi-level decision-making bring out a more efficient trial-and-error?”

Refer to caption
Figure 1: Trial-and-error comparison of minimal actions and high-level decisions, where * denotes better results. (a) In the trial-and-error process, the memory consists of the smallest actions from experiences and serves as context to search for better action. (b) In the high-level trial-and-error process, the memory and search act on high-level decisions. Since one high-level decision controls multiple actions, we can use smaller memory to preserve experiences and search for better decisions with less computational costs.

As one high-level decision can guide multi-step low-level actions, it can considerably shorten across-episodic contexts and, therefore, significantly alleviate the computational costs. In this view, we aim to explore a high-level trial-and-error process, as shown in Figure 1. However, since the model generates high-level decisions rather than low-level actions that interact with the environment, our first challenge is ensuring that better high-level decisions can encourage better low-level actions. In addition, the in-context RL is trained with supervised losses, which predicts each next step conditioned on the past steps in the sequence. Unlike low-level actions, the high-level decision is an abstract concept that is usually not directly observable from training data.

To this end, we propose an efficient in-context RL method called In-context Decision Transformer (IDT). Specifically, IDT consists of Making Decisions, Decisions to Go, and Reviewing Decisions modules to mimic the high-level trial-and-error process. First, the Making Decisions module is a decoder-only transformer that generates high-level decisions autoregressively, where the high-level decision is represented by a vector sampled from a multivariate Gaussian distribution. Then, the generated high-level decisions are fed to the Decisions to Go module, which is also a decoder-only transformer to generate low-level actions autoregressively. The output of the Making Decisions module serves as a conditional input to the Decisions to Go module, ensuring high-level decisions correctly guide low-level actions. To fill in the missing high-level decisions in the training data, we designed the Reviewing Decisions module to encode high-level decisions from sequences of low-level actions. All three modules are learned end-to-end by predicting the low-level actions from the training data.

Our contributions are as follows: (1) We propose IDT, an in-context RL method that emerges with high-level trial-and-error ability. IDT can learn by directly combining sub-optimal data and efficiently improving itself through multiple trials at test time. (2) Compared to the contexts consisting of the smallest actions, IDT significantly shortens the evaluation time by 36×\bm{\times}bold_× times on the D4RL baselines (Fu et al., 2020) and 27×\bm{\times}bold_× times on the Large Grid World environments (Lee et al., 2022). (3) IDT can achieve state-of-the-art results with less training costs, especially outstanding in long-horizon tasks.

2 Related Work

Transformer for Decision-Making.   In general, reinforcement learning was proposed as a fundamentally online paradigm (Sutton et al., 1999a). The nature of online learning comes with some limitations when meeting the applications for which it is impossible to gather online data and learn simultaneously, such as autonomous driving. To this end, offline RL proposes that the agent can learn from a fixed dataset of previously collected data without gathering new data during learning (Fujimoto et al., 2019; Kumar et al., 2020; Yu et al., 2021; Kumar et al., 2019). In the context of offline RL, recent works explored using transformer-based policy by treating RL tasks as a type of sequential prediction problem. Among them, a decision transformer (Chen et al., 2021) is proposed to model trajectories as sequences and autoregressively predicts action conditioning on desired return-to-go, past states, and actions. Trajectory transformer (Janner et al., 2021) demonstrated that transformer could learn single-task policies from offline data. Subsequently, the multi-game decision transformer (Lee et al., 2022) and Gato (Reed et al., 2022) further showed that transformer-based policies could address multi-tasks in the same domain and cross-domain tasks. However, these works focused on distilling expert policies from offline data and failed to enable self-improvement like IDT. Instead, when the offline data are sub-optimal, or the agent is required to adapt to new tasks, the multi-game decision transformers need to finetune the model parameters while Gato is required to get prompted with expert demonstrations.

Meta RL.   IDT falls into the category of methods of learning to learn, which is also known as meta-learning. More precisely, recent in-context RL methods can be categorized as in-context meta-RL methods. The general idea of learning self-improvement has a long history in RL but is limited to hyper-parameters in the early stages (Ishii et al., 2002). In-context meta-RL methods (Wang et al., 2016; Duan et al., 2016) are commonly trained in the online setting by maximizing multi-episodic value functions with memory-based architectures through environment interactions. Another online meta-RL attempts to find good network parameter initializations and then quickly adapt through additional gradient updates (Finn et al., 2017; Nichol et al., 2018). More recently, meta-RL has seen substantial breakthroughs, from performance gains on popular benchmarks to offline settings, such as Bayesian RL (Dorfman et al., 2021) and optimization-based meta-RL (Mitchell et al., 2021). Considering the difficulty of a completely offline setting, recent work has explored hybrid offline-online settings (Zahavy et al., 2020; Pong et al., 2022). IDT is similar to the hybrid offline-online setting, but the online phase does not involve gradient updates.

In-Context RL.   In-context RL is the one that addresses tasks by providing prompts or demonstrations (Chen et al., 2021; Janner et al., 2021). By training agents at a large scale, transformer-based policies usually have the ability to learn in context (Lee et al., 2022; Reed et al., 2022). The learning process is performed entirely in context and does not involve parameter updates of neural networks. In this work, we consider incremental in-context RL that involves learning from one’s own behaviors through a trial-and-error manner. Laskin et al. (2023) proposed Algorithm Distillation (AD) that automatically improves its performance in a trial-and-error manner by providing multiple historical trajectories. Subsequently, Lee et al. (2023b) proposed a Decision-Pretrained Transformer, which trains the agent to find optimal behaviors faster by only predicting the optimal trajectory. More recently, Hao Liu (2023) further demonstrated that across-episodic contexts encourage large transformer models’ emerging trial-and-error behaviors. However, these methods focus on the smallest action level, which causes across-episodic contexts to induce too-long sequences and suffer from huge computational costs. In contrast, IDT explores the trial-and-error ability of high-level decisions, which can significantly shorten the length of across-episodic contexts and, therefore, alleviate the computational costs arising from the self-attention mechanism.

3 Preliminaries

Partially Observable Markov Decision Process.   We consider learning problems in the context of Partially Observable Markov Decision Processes (POMDP) represented by the tuple =(𝒮,𝒪,𝒜,P,)𝒮𝒪𝒜𝑃\mathcal{M}=(\mathcal{S,O,A},P,\mathcal{R})caligraphic_M = ( caligraphic_S , caligraphic_O , caligraphic_A , italic_P , caligraphic_R ). The POMDP tuple consist of states s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S, observations o𝒪𝑜𝒪o\in\mathcal{O}italic_o ∈ caligraphic_O, actions a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A, rewards r𝑟r\in\mathcal{R}italic_r ∈ caligraphic_R, and transition probability function P(st+1|st,at)𝑃conditionalsubscript𝑠𝑡1subscript𝑠𝑡subscript𝑎𝑡P(s_{t+1}|s_{t},a_{t})italic_P ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), where t𝑡titalic_t is an integer denoting the timestep. In environments described by a POMDP, at each timestep t𝑡titalic_t the agent receives the observation otsubscript𝑜𝑡o_{t}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, selects an action atπ(|ot)a_{t}\sim\pi(\cdot|o_{t})italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π ( ⋅ | italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) from its policy, and then receives the next observation ot+1subscript𝑜𝑡1o_{t+1}italic_o start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. A trajectory is a sequence of observations, actions, and rewards and is denoted by τ=(o0,a0,r0,,oT,aT,rT)𝜏subscript𝑜0subscript𝑎0subscript𝑟0subscript𝑜𝑇subscript𝑎𝑇subscript𝑟𝑇\tau=(o_{0},a_{0},r_{0},\dots,o_{T},a_{T},r_{T})italic_τ = ( italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_o start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ). The return of a trajectory at timestep t𝑡titalic_t, Rt=t=tTrtsubscript𝑅𝑡superscriptsubscriptsuperscript𝑡𝑡𝑇superscriptsubscript𝑟𝑡R_{t}=\sum_{t^{\prime}=t}^{T}r_{t}^{\prime}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, is calculated as the sum of future rewards from that timestep. In addition, a completion token dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, a binary identifier, is used to indicate whether a trajectory ends at time t𝑡titalic_t.

Hierarchical Reinforcement Learning.   RL algorithms aim to maximize the expected return E[t=0Trt]Edelimited-[]superscriptsubscript𝑡0𝑇subscript𝑟𝑡\mathrm{E}[\sum_{t=0}^{T}r_{t}]roman_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] throughout an agent’s lifetime or training episodes. In long-horizon tasks, standard RL methods suffer from poor performance due to the exponentially growing exploration space. Hierarchical RL decomposes the long-horizon task into subproblems or subtasks such that a high-level policy learns to perform the task by choosing optimal subtasks as the high-level decisions (Pateria et al., 2021). High-level decisions can be designed as discrete or continuous forms. The discrete form can select multiple independent low-level policy models (Bacon et al., 2017), while the continuous form usually serves as additional conditions to control a general low-level policy model (Nachum et al., 2018). Since the transformer-based policy is a conditional generative model, it is naturally adapted to high-level decisions in the continuous form, such as the return-to-go condition in the decision transformer (Chen et al., 2021). In this work, we use a vector z to represent high-level decisions and assume that it is sampled from a multivariate Gaussian distribution.

Transformers.   The Transformer (Vaswani et al., 2017) architecture consists of multiple layers of self-attention operation and MLP. The self-attention begins by projecting input data X𝑋Xitalic_X with three separate matrices onto D𝐷Ditalic_D-dimensional vectors called queries Q𝑄Qitalic_Q, keys K𝐾Kitalic_K, and values V𝑉Vitalic_V. These vectors are then passed through the attention function:

Attention(Q,K,V)=softmax(QKT/D)V.Attention𝑄𝐾𝑉softmax𝑄superscript𝐾𝑇𝐷𝑉\mathrm{Attention}(Q,K,V)=\mathrm{softmax}(QK^{T}/\sqrt{D})V.roman_Attention ( italic_Q , italic_K , italic_V ) = roman_softmax ( italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT / square-root start_ARG italic_D end_ARG ) italic_V . (1)

The QKT𝑄superscript𝐾𝑇QK^{T}italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT term computes an inner product between two projections of the input data X𝑋Xitalic_X. The inner product is then normalized and projected back to a D𝐷Ditalic_D-dimensional vector with the scaling term V𝑉Vitalic_V. Transformers utilize self-attention as a core part of the architecture to process sequential data (Devlin et al., 2018; Brown et al., 2020a). In this work, we use GPT (Radford et al., 2018) architecture that modifies the transformer with a causal self-attention mask to focus on the previous tokens in the sequence (j[1,i]𝑗1𝑖j\in[1,i]italic_j ∈ [ 1 , italic_i ]), enabling us to do autoregressive generation at test time.

Refer to caption
Figure 2: The architecture of IDT is designed into three modules to simulate the high-level trial-and-error process. First, the (1) Making Decisions module predicts a high-level decision by providing across-episodic contexts, where across-episodic contexts contain multiple trajectories arranged in ascending order of the total rewards. Then, the (2) Decisions to Go module predicts actions for c𝑐citalic_c steps conditioned on the predicted high-level decision. Finally, the (3) Reviewing Decisions module reviews the executed actions to serve as an experience for the next cycle. Note that the Reviewing Decisions encodes the true label of high-level decisions from offline data at training while encodes from the executed actions at testing.

4 Method

In this section, we present IDT, which models a high-level trial-and-error process through a hierarchical chain of experience, as summarized in Figure  2.

4.1 Chain of Experience

The key factors that influence our modeling on how to represent trajectories are (1) the ability of transformers to uncover meaningful patterns from multiple trajectories and (2) the capacity to improve itself conditioned on experience. The basic elements of trajectories are observations o𝑜oitalic_o, actions a𝑎aitalic_a, rewards r𝑟ritalic_r, and completion token d𝑑ditalic_d. As modeling rewards is a nontrivial task, we aim to have the model generate actions based on the target returns R^0subscript^𝑅0\hat{R}_{0}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (Chen et al., 2021), which can be updated using rewards R^t=R^0j=0trjsubscript^𝑅𝑡subscript^𝑅0superscriptsubscript𝑗0𝑡subscript𝑟𝑗\hat{R}_{t}=\hat{R}_{0}-\sum_{j=0}^{t}r_{j}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Therefore, the following trajectory representation is amenable to autoregressive training and generation:

τ=(R^0,o0,a0,r0,d0,,R^T,oT,aT,rT,dT).𝜏subscript^𝑅0subscript𝑜0subscript𝑎0subscript𝑟0subscript𝑑0subscript^𝑅𝑇subscript𝑜𝑇subscript𝑎𝑇subscript𝑟𝑇subscript𝑑𝑇\tau=(\hat{R}_{0},o_{0},a_{0},r_{0},d_{0},\dots,\hat{R}_{T},o_{T},a_{T},r_{T},% d_{T}).italic_τ = ( over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) . (2)

To facilitate the model to achieve target return, we construct across-episodic contexts that consist of multiple trajectories for self-improvement during test time (Hao Liu, 2023). This idea arises from the approach called chain of hindsight (Liu et al., 2023), which trains language models from human feedback by conditioning on positive indicators and negative-rated examples to predict corresponding positive-rated examples. In the RL tasks, the positive indicator is the target return, and previous trajectories serve as negative-rated examples to predict the trajectory with higher returns.

Specifically, the chain of experience is represented by n𝑛nitalic_n trajectories: s=(τ1,τ2,,τn)𝑠superscript𝜏1superscript𝜏2superscript𝜏𝑛s=(\tau^{1},\tau^{2},\dots,\tau^{n})italic_s = ( italic_τ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_τ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) where

τi=(R^0i,o0i,a0i,r0i,d0i,,R^Ti,oTi,aTi,rTi,dTi).superscript𝜏𝑖superscriptsubscript^𝑅0𝑖superscriptsubscript𝑜0𝑖superscriptsubscript𝑎0𝑖superscriptsubscript𝑟0𝑖superscriptsubscript𝑑0𝑖superscriptsubscript^𝑅𝑇𝑖superscriptsubscript𝑜𝑇𝑖superscriptsubscript𝑎𝑇𝑖superscriptsubscript𝑟𝑇𝑖superscriptsubscript𝑑𝑇𝑖\tau^{i}=(\hat{R}_{0}^{i},o_{0}^{i},a_{0}^{i},r_{0}^{i},d_{0}^{i},\dots,\hat{R% }_{T}^{i},o_{T}^{i},a_{T}^{i},r_{T}^{i},d_{T}^{i}).italic_τ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ( over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , … , over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_d start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) . (3)

The trajectories are ascending sorted according to their total rewards, i.e., t=0Trt1t=0Trt2t=0Trtnsuperscriptsubscript𝑡0𝑇superscriptsubscript𝑟𝑡1superscriptsubscript𝑡0𝑇superscriptsubscript𝑟𝑡2superscriptsubscript𝑡0𝑇superscriptsubscript𝑟𝑡𝑛\sum_{t=0}^{T}{r_{t}^{1}}\leq\sum_{t=0}^{T}{r_{t}^{2}}\leq\dots\leq\sum_{t=0}^% {T}{r_{t}^{n}}∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ≤ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ⋯ ≤ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. For all n𝑛nitalic_n trajectories, the initial target return R^0isuperscriptsubscript^𝑅0𝑖\hat{R}_{0}^{i}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT equals the max total reward, i.e., the last trajectory R^0n=t=0Trtnsuperscriptsubscript^𝑅0𝑛superscriptsubscript𝑡0𝑇superscriptsubscript𝑟𝑡𝑛\hat{R}_{0}^{n}=\sum_{t=0}^{T}{r_{t}^{n}}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT.

4.2 Hierarchical Chain of Experience

After building across-episodic contexts based on the chain of experience, in-context RL can automatically improve its performance at evaluation time by rolling trajectories in a trial-and-error manner. However, this suffers substantial computational costs when the horizon of tasks increases.

Since total rewards are obtained at the end of episodes, it is more difficult to evaluate and improve a policy model in long-horizon tasks. In traditional RL methods, an effective solution is to decompose complex tasks into several sub-problems by incorporating hierarchical structures (Nachum et al., 2018). The high-level policy only needs to generate a signal once to control the low-level policy to generate multi-step actions. This allows (1) the high-level policy to receive feedback faster, as if working on a short-horizon task, and (2) the low-level policy only needs to consider how to better implement the sub-tasks generated by the high-level decision. Although the hierarchical structure can be optimized end-to-end by reward signals, the trial-and-error process in in-context RL is more complicated.

As psychologist Edward Thorndike mentioned, the trial-and-error process includes two parts (Sutton & Barto, 2018), memory and search. The high-level decision plays an important role that is closely connected with memory and search. A high-level decision is generated from the search process and directly affects the quality of low-level executed actions. Subsequently, it also serves as the memory for future searches. Therefore, we designed three modules to realize a high-level trial-and-error process: Making Decisions, Decisions to Go, and Reviewing Decisions.

Making Decisions.   The purpose of the Making Decisions module is to generate high-level decisions autoregressively, where the high-level decision is represented by a vector z sampled from a multivariate Gaussian distribution. As the quality of z directly relates to low-level actions a𝑎aitalic_a, a better high-level decision z is critical for inducing better low-level actions a𝑎aitalic_a. Therefore, we reconstruct across-episodic contexts represented as a high-level chain of experience sh=(τh1,τh2,,τhn)subscript𝑠superscriptsubscript𝜏1superscriptsubscript𝜏2superscriptsubscript𝜏𝑛s_{h}=(\tau_{h}^{1},\tau_{h}^{2},\dots,\tau_{h}^{n})italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = ( italic_τ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_τ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_τ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ). Each τhisuperscriptsubscript𝜏𝑖\tau_{h}^{i}italic_τ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is denoted as:

τhi=(\displaystyle\tau_{h}^{i}=(italic_τ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ( R^0i,o0i,z0i,r^0i,d0i,R^ci,oci,zci,r^ci,dci,superscriptsubscript^𝑅0𝑖superscriptsubscript𝑜0𝑖superscriptsubscriptz0𝑖superscriptsubscript^𝑟0𝑖superscriptsubscript𝑑0𝑖superscriptsubscript^𝑅𝑐𝑖superscriptsubscript𝑜𝑐𝑖superscriptsubscriptz𝑐𝑖superscriptsubscript^𝑟𝑐𝑖superscriptsubscript𝑑𝑐𝑖\displaystyle\hat{R}_{0}^{i},o_{0}^{i},\textbf{z}_{0}^{i},\hat{r}_{0}^{i},d_{0% }^{i},\hat{R}_{c}^{i},o_{c}^{i},\textbf{z}_{c}^{i},\hat{r}_{c}^{i},d_{c}^{i},over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , (4)
,R^kci,okci,zkci,r^kci,dkci),\displaystyle\dots,\hat{R}_{kc}^{i},o_{kc}^{i},\textbf{z}_{kc}^{i},\hat{r}_{kc% }^{i},d_{kc}^{i}),… , over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_k italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT italic_k italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , z start_POSTSUBSCRIPT italic_k italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_d start_POSTSUBSCRIPT italic_k italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ,

where each high-level decision z is generated every c𝑐citalic_c steps, TckcT𝑇𝑐𝑘𝑐𝑇T-c\leq kc\leq Titalic_T - italic_c ≤ italic_k italic_c ≤ italic_T, and r^ci=t=c2c1rtisuperscriptsubscript^𝑟𝑐𝑖superscriptsubscript𝑡𝑐2𝑐1superscriptsubscript𝑟𝑡𝑖\hat{r}_{c}^{i}=\sum_{t=c}^{2c-1}{r_{t}^{i}}over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_c - 1 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the sum of c𝑐citalic_c steps rewards. By comparing with Equation (3), the high-level chain of experience can considerably shorten the length of contexts and, therefore, significantly alleviate the computational complexity of the self-attention mechanism.

Decisions to Go.   Based on high-level decisions, the Decisions to Go module is designed to generate low-level actions that can interact with environments. Since the transformer-based policy is a conditional generative model, we can build a low-level context that contains high-level decisions to control the low-level actions. The low-level context is represented as:

τli,j=superscriptsubscript𝜏𝑙𝑖𝑗absent\displaystyle\tau_{l}^{i,j}=italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i , italic_j end_POSTSUPERSCRIPT = (zji,oji,aji,rji,zji,oj+1i,aj+1i,rj+1i,\displaystyle(\textbf{z}_{j}^{i},o_{j}^{i},a_{j}^{i},r_{j}^{i},\textbf{z}_{j}^% {i},o_{j+1}^{i},a_{j+1}^{i},r_{j+1}^{i},( z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , (5)
,zji,oj+c1i,aj+c1i,rj+c1i),\displaystyle\dots,\textbf{z}_{j}^{i},o_{j+c-1}^{i},a_{j+c-1}^{i},r_{j+c-1}^{i% }),… , z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT italic_j + italic_c - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_j + italic_c - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_j + italic_c - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ,

where each τli,jsuperscriptsubscript𝜏𝑙𝑖𝑗\tau_{l}^{i,j}italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i , italic_j end_POSTSUPERSCRIPT starts from the generation step j{0,c,,kc}𝑗0𝑐𝑘𝑐j\in\{0,c,\dots,kc\}italic_j ∈ { 0 , italic_c , … , italic_k italic_c } of high-level decisions and completes c𝑐citalic_c steps low-level actions in the trajectory τisuperscript𝜏𝑖\tau^{i}italic_τ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. In particular, we introduce the reparameterization trick (Jang et al., 2016) into high-level decisions to ensure backpropagation through the Decisions to Go module to the Making Decisions module.

Reviewing Decisions.   The autoregressive training of the conditional generation model is achieved by predicting each token in the sequence. For example, when the transformer model is trained to generate atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we need to provide the action label at time t𝑡titalic_t and condition it on the historical actions {a0,,at1}subscript𝑎0subscript𝑎𝑡1\{a_{0},\dots,a_{t-1}\}{ italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT }. However, the supervisory signals about high-level decisions z are not directly observable from the sequence, as shown in Equation (4). For instance, when the transformer model is trained to generate zjsubscriptz𝑗\textbf{z}_{j}z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (j{0,c,,kc}𝑗0𝑐𝑘𝑐j\in\{0,c,\dots,kc\}italic_j ∈ { 0 , italic_c , … , italic_k italic_c }), we have neither the true label of zjsubscriptz𝑗\textbf{z}_{j}z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT nor the previous high-level decisions {z0,zc,,zjc}subscriptz0subscriptz𝑐subscriptz𝑗𝑐\{\textbf{z}_{0},\textbf{z}_{c},\dots,\textbf{z}_{j-c}\}{ z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , … , z start_POSTSUBSCRIPT italic_j - italic_c end_POSTSUBSCRIPT }.

To this end, we replace the true label of zjsubscriptz𝑗\textbf{z}_{j}z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with the gradients from the Decisions to Go module trained to generate the following c𝑐citalic_c steps of actions {aj,aj+1,,aj+c1}subscript𝑎𝑗subscript𝑎𝑗1subscript𝑎𝑗𝑐1\{a_{j},a_{j+1},\dots,a_{j+c-1}\}{ italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_j + italic_c - 1 end_POSTSUBSCRIPT }. For the previous high-level decisions {z0,zc,,zjc}subscriptz0subscriptz𝑐subscriptz𝑗𝑐\{\textbf{z}_{0},\textbf{z}_{c},\dots,\textbf{z}_{j-c}\}{ z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , … , z start_POSTSUBSCRIPT italic_j - italic_c end_POSTSUBSCRIPT }, we introduce the Reviewing Decisions module to encode the label from low-level actions. As the high-level decisions induce low-level actions, the low-level actions should be able to infer high-level decisions inversely. Specifically, to infer a previous high-level decision zt{z0,zc,,zjc}subscriptz𝑡subscriptz0subscriptz𝑐subscriptz𝑗𝑐\textbf{z}_{t}\in\{\textbf{z}_{0},\textbf{z}_{c},\dots,\textbf{z}_{j-c}\}z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , … , z start_POSTSUBSCRIPT italic_j - italic_c end_POSTSUBSCRIPT }, we first utilize the self-attention operation to aggregate the information of at+c1subscript𝑎𝑡𝑐1a_{t+c-1}italic_a start_POSTSUBSCRIPT italic_t + italic_c - 1 end_POSTSUBSCRIPT and {ot,at,,ot+c1,at+c1}subscript𝑜𝑡subscript𝑎𝑡subscript𝑜𝑡𝑐1subscript𝑎𝑡𝑐1\{o_{t},a_{t},\dots,o_{t+c-1},a_{t+c-1}\}{ italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_o start_POSTSUBSCRIPT italic_t + italic_c - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t + italic_c - 1 end_POSTSUBSCRIPT }. Then, we apply a linear layer to encode ztsubscriptz𝑡\textbf{z}_{t}z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from the aggregated information. Note that the Reviewing Decision module is not required to perform autoregressive generation, so any sequence model, such as LSTM, can replace it.

By combining the above three modules, IDT can automatically improve its performance at evaluation time by rolling trajectories in a high-level trial-and-error manner. We now introduce the implementation details of IDT, including architecture, training, and testing.

Refer to caption
Figure 3: Results for (a) testing and (b) training times. We report the training time per 10k gradient updates, the testing time for 50 episodes over Grid World, and 10 episodes over D4RL. Note that we use the number of steps to measure the context size here. The number of tokens per step may vary depending on the algorithm. Each step in AD contains 4 tokens: observation, action, reward, and completion. IDT’s Making Decisions module and AT have an extra return-to-go token. As the task length increases, the context length is forced to grow exponentially, resulting in a square increase in computational costs. In contrast, IDT reconstructs the sequence to consist of high-level decisions. Therefore, the context is smaller than one episode length, significantly reducing computational costs.

4.3 Implementation of IDT

Architecture.   We feed n𝑛nitalic_n trajectories into the Making Decisions module, which results in 5×n×T/c5𝑛𝑇𝑐5\times n\times T/c5 × italic_n × italic_T / italic_c tokens, with one token for each of the five modalities: desired target return, observation, high-level decision, reward, and completion. In the Decisions to Go module, we feed 4×c4𝑐4\times c4 × italic_c tokens, with one token for each of the four modalities: high-level decision, observation, action, and rewards. In the Reviewing Decisions module, we feed 2×c2𝑐2\times c2 × italic_c tokens, with one token for each of the two modalities: observation and action. To create the token embeddings, we train a linear layer for each modality, which transforms the raw inputs into the desired embedding dimension, followed by layer normalization (Ba et al., 2016). Finally, the tokens are processed by a GPT model that predicts future high- and low-level action tokens through autoregressive modeling.

Training and Testing.   During training, we are given a dataset of offline trajectories, where the trajectories can be suboptimal. In each iteration, we sample minibatches of trajectories from the dataset. The Reviewing Decisions module first encodes each true high-level decision z from the minibatch every c𝑐citalic_c steps. Then, the Making Decisions module predicts the high-level decision ztsubscriptz𝑡\textbf{z}_{t}z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT given the input token otsubscript𝑜𝑡o_{t}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and past trajectories. Finally, the Decisions to Go module autoregressively predicts c𝑐citalic_c steps of low-level actions {at,,at+c1}subscript𝑎𝑡subscript𝑎𝑡𝑐1\{a_{t},\dots,a_{t+c-1}\}{ italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_t + italic_c - 1 end_POSTSUBSCRIPT } given ztsubscriptz𝑡\textbf{z}_{t}z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and {ot,,ot+c1}subscript𝑜𝑡subscript𝑜𝑡𝑐1\{o_{t},\dots,o_{t+c-1}\}{ italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_o start_POSTSUBSCRIPT italic_t + italic_c - 1 end_POSTSUBSCRIPT }. The low-level actions are evaluated with either cross-entropy loss or mean-squared error, depending on whether the actions are discrete or continuous. The losses from each timestep are averaged and updated in all three modules end-to-end. At test time, we roll out the IDT with multiple trajectories and report the largest return among trajectories. Following the configuration from related works Hao Liu (2023); Laskin et al. (2023), we set a context size across n=4𝑛4n=4italic_n = 4 episodes. Note that the task horizons T𝑇Titalic_T used in this work range from 20 steps to 1000 steps, and the maximum context size reaches 20000 tokens. The pseudocode for IDT is summarized in Appendix A. Source code and more hyperparameters are described in Appendix B.

5 Experiments

Dataset: Grid World.   In this section, we first consider the discrete control environments from the Grid World (Lee et al., 2022), which is a commonly used benchmark for recent in-context RL methods. The environments support many tasks that cannot be solved through zero-shot generalization after pre-training because these tasks cannot be inferred easily from the observation. The episode of each task is short enough to train a transformer-based policy with across-episodic contexts feasibly. Specifically, we consider the four evaluation environments: Darkroom, Darkroom Hard, Darkroom Dynamic, and Dark Key-to-Door.

The evaluation environments provide a 2D discrete POMDP where an agent spawns in a room and must find a goal location. The agent only observes its own (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) coordinates but does not know the goal location, which is required to deduce it from the rewards received. The room dimensions are 9×9999\times 99 × 9 with the agent’s possible actions, including moving one step either left, right, up, down, or staying idle. In Darkroom, an episode lasts 20 steps, and the agent can obtain a reward (r=1𝑟1r=1italic_r = 1) each time the goal is achieved. The Darkroom Hard and Darkroom Dynamic are two variants of Darkroom. In the Darkroom Hard, agents only obtain a reward when the goal is achieved first. In the Darkroom Dynamic, the goal is fixed to a corner, but the action space is randomly permuted. In the Dark Key-to-Door, the length of an episode is 50, where the agent is required to locate an invisible key to receive a one-time reward first and then identify an invisible door to obtain another one-time reward.

In addition, we create a variant of Large Darkroom, Large Darkroom Hard, Large Darkroom Dynamic, and Large Darkroom Key-to-Door, where the coordinate space of each environment is expanded to 40×40404040\times 4040 × 40, and the episode length is expanded 10 times. The dataset is collected from learning histories that are generated by training gradient-based RL algorithms, such as Deep Q-Network (Mnih et al., 2013). For each environment, we randomly create 60 tasks from the coordinate space and collect data for 1 million timesteps.

Refer to caption
Figure 4: Results for Grid World. An agent is expected to solve a new task by interacting with the environments for 50 episodes without online model updates. Based on high-level decisions, our method outperforms both AT and AD, which rely on across-episodic contexts with the smallest actions. In particular, IDT has significant advantages in handling long-horizon tasks.

Dateset: D4RL.   D4RL (Fu et al., 2020) is a commonly used offline RL benchmark, including continuous control tasks. The different dataset settings are described below.

  • Medium: 1 million timesteps generated by a “medium” policy that performs approximately one-third as well as an expert policy.

  • Medium-Replay: 1 million timesteps collected from the replay buffer of an agent trained to the performance of a “medium” policy.

  • Medium-Expert: It consists of 1 million timesteps generated by the “medium” policy and another 1 million timesteps generated by the expert policy.

The dataset is collected from Mujoco environments, including HalfCheetah, Hopper, and Walker. The episode length in D4RL is 1000, which is far more than that of Grid World. Therefore, current in-context RL methods require huge computational costs in D4RL, even though it is a commonly used baseline for conventional RL algorithms.

Baselines.   In this section, we investigate the performance and efficiency of IDT relative to in-context RL, dedicated offline RL, and imitation learning algorithms. Our baselines can be categorized as follows:

  • In-context RL: These methods use the transformer to model trajectory sequences and predict actions autoregressively. We compare with recent methods, Agentic Transformer (AT) (Hao Liu, 2023) and Algorithm Distillation (AD) (Laskin et al., 2023), which proposed across-episodic contexts with the smallest actions.

  • Temporal-difference learning: Most temporal-difference (TD) learning methods use an action space constraint or value pessimism and will serve as faithful comparisons to IDT, representing standard RL methods. We consider state-of-the-art TD3+BC (Fujimoto & Gu, 2021) that is demonstrated to be effective on D4RL.

  • Imitation learning: Imitation learning methods similarly utilize supervised losses for training, such as Behavior Cloning (BC) (Torabi et al., 2018) and Decision Transformer (DT) (Chen et al., 2021). Following AT, we compare with BC-10%percent\%%, which is shown to be competitive with state-of-the-art on D4RL. DT also uses a transformer to predict actions autoregressively but is limited to a single episode context.

For all comparison methods, we adhere closely to the original hyper-parameter settings. To evaluate IDT and other in-context RL algorithms, we roll out 10 episodes in D4RL and 50 episodes in Grid World. For each result, we report mean and standard error across 5 random seeds.

5.1 Evaluation of Computing Costs

An important property of in-context RL is that it can improve itself without expensive gradient updates. However, the computational costs of forward propagation are hidden in short-horizon tasks. Therefore, we reported the training time per 10k gradient updates, the evaluation time for 50 episodes over Grid World, and 10 episodes over D4RL. As shown in Figure 3, our IDT has efficient training and significantly reduces the testing time compared to the baselines, approximately 36×\bm{\times}bold_× times faster in D4RL and 27×\bm{\times}bold_× times faster in large Grid World. More detailed results for each task are described in Appendix C.

As the task length increases, the evaluation time of AT and AD grows quadratically. This is because the across-episodic contexts multiply the sequence length, leading to intolerable computational costs in the self-attention mechanism. The AT algorithm requires four episodes for trial-and-error, where each episode reaches 1000 steps in D4RL, and each step contains 5 tokens. Therefore, each step of AT generation requires scanning 20k tokens. Since the AD algorithm reduces a return-to-go token at each step, the training and testing time are both less than AT. In contrast, IDT reconstructs the sequence to consist of high-level decisions, and thus, the context is smaller than one episode length. As a result, IDT is significantly lower than baselines at both training and testing times.

Table 1: Results for D4RL datasets. IDT outperforms both in-context RL (AT and AD) and supervised learning (BC) and performs competitively with conventional RL algorithms (TD3+BC and TD3) on almost all tasks.
Dataset Environment BC-10%percent\%% TD3+BC TD3 DT AT AD Ours
Medium-Expert HalfCheetah 94.11 96.59 87.60 93.40 95.81±plus-or-minus\pm± 0.25 94.21 ±plus-or-minus\pm± 0.46 96.12±plus-or-minus\pm± 0.18
Medium-Expert Hopper 113.13 113.22 98.41 111.18 115.92±plus-or-minus\pm± 1.26 108.32 ±plus-or-minus\pm± 0.95 118.39±plus-or-minus\pm± 0.75
Medium-Expert Walker 109.90 112.21 100.52 108.71 114.87±plus-or-minus\pm± 0.56 111.36 ±plus-or-minus\pm± 0.46 118.51±plus-or-minus\pm± 0.48
Medium HalfCheetah 43.90 48.93 34.60 42.73 45.12±plus-or-minus\pm± 0.34 42.28 ±plus-or-minus\pm± 1.18 45.51±plus-or-minus\pm± 0.26
Medium Hopper 73.84 70.44 56.98 69.42 70.45±plus-or-minus\pm± 0.45 72.58 ±plus-or-minus\pm± 0.54 83.24±plus-or-minus\pm± 0.33
Medium Walker 82.05 86.91 70.95 74.70 88.71±plus-or-minus\pm± 0.55 85.96 ±plus-or-minus\pm± 0.46 88.94±plus-or-minus\pm± 0.61
Medium-Replay HalfCheetah 42.27 45.84 38.81 40.31 46.86±plus-or-minus\pm± 0.33 41.28 ±plus-or-minus\pm± 0.21 45.58±plus-or-minus\pm± 0.36
Medium-Replay Hopper 90.57 98.12 78.90 88.74 96.85±plus-or-minus\pm± 0.41 91.32 ±plus-or-minus\pm± 0.66 98.59±plus-or-minus\pm± 0.26
Medium-Replay Walker 76.09 91.17 65.94 68.22 92.32±plus-or-minus\pm± 1.21 89.21 ±plus-or-minus\pm± 1.42 96.22±plus-or-minus\pm± 1.06
Total Average 80.65±plus-or-minus\pm± 1.34 84.83±plus-or-minus\pm± 1.10 70.28±plus-or-minus\pm± 1.20 77.49±plus-or-minus\pm± 1.45 85.21±plus-or-minus\pm± 1.12 82.84±plus-or-minus\pm± 0.70 87.90±plus-or-minus\pm± 1.06
Refer to caption
Figure 5: Results for IDT conditioned on partial demonstrations. IDT can accelerate self-improvement through the Review Decisions module to encode external data prompts.

5.2 Grid World Results

To evaluate IDT’s self-improvement capabilities in unseen tasks, we compared recent in-context RL methods in the Grid World environments. The agent is required to solve an unseen task by interacting with the environments for 50 episodes without online model updates. As shown in Figure 4, IDT achieves state-of-the-art performance in a wide range of tasks.

In large variant tasks, IDT significantly surpasses the baselines in both efficiency and performance. However, neither AT nor AD showed obvious self-improvement trends, especially in Large Darkroom Hard. This is because the Large Darkroom Hard is a task with sparse rewards, which makes it difficult for AT and AD to capture the goal position in long sequences. In contrast, IDT explores tasks in a high-level trial-and-error manner, making receiving positive feedback on rewards easier. Overall, IDT demonstrated that a high-level trial-and-error manner is feasible rather than limited to the smallest actions.

5.3 D4RL Results

In addition to short-horizon tasks specific to in-context RL methods, we also test the performance of IDT on the D4RL dataset, which is commonly used in conventional RL methods. Based on Fu et al. (2020), the results on D4RL are normalized so that 100 denotes an expert policy. Baseline numbers are reported by the AT paper and from the D4RL paper. As shown in Table 1, IDT outperforms baselines in a majority of the tasks and is competitive with the state-of-the-art in the remaining tasks.

In the TD learning and imitation learning categories, TD3+BC is generally the most remarkable algorithm. Compared with them, the superior performance of IDT demonstrates the advantages of using high-level trial-and-error.

5.4 Case Study on the Reviewing Decisions Module

A notable ability of transformer-based policy is to address tasks by providing demonstration prompts. Although in-context RL can improve itself without relying on demonstrations, external prompts can speed up the process. Therefore, we want to investigate whether IDT can benefit from this setting. To answer this question, we design a ϵitalic-ϵ\epsilonitalic_ϵ-greedy policy to collect external data in Darkroom and Large Darkroom tasks, ranging from nearly random to optimal.

As shown in Figure 5, IDT improves each policy in context until it is near-optimal. Notably, the more optimal the input policy, the faster IDT improves it until it is optimal. Despite high-level decisions that cannot be directly observed from the external demonstrations, IDT can extract experts’ intentions through the Reviewing Decisions module. In particular, we also perform parameter sensitivity analyses on high-level decision frequency (c𝑐citalic_c) and context size (n𝑛nitalic_n episodes), as shown in Appendix C.

6 Conclusion

In this work, we propose an efficient in-context RL method IDT that treats RL tasks as an across-episodic sequence problem and can improve itself at test time. The idea of human multi-level decision-making inspires IDT and introduces high-level decisions into the sequence prediction process. Unlike current in-context RL methods limited to short-horizon tasks, IDT is also good at standard RL benchmarks, which typically have longer task horizons. On the Grid World and D4RL benchmarks, we show that IDT can outperform baselines in both efficiency and performance.

Impact Statement

In terms of the potential broader impact, our work provides a new idea of incorporating high-level decisions in the design of context, which may promote the development of the in-context RL community. Besides, we do not see any negative ethical and societal impacts of our work while using our method in practice.

Acknowledgments

This work was supported by the National Key R&D Program of China under Grant No. 2021ZD0112500; the National Natural Science Foundation of China under Grant Nos. U22A2098, U19A2065, 62172185, 61976102, 62206105 and 62202200; the International Cooperation Project of Jilin Province under Grant Nos.20220402009GH, U2341229.

References

  • Alayrac et al. (2022) Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  • Ba et al. (2016) Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  • Bacon et al. (2017) Bacon, P.-L., Harb, J., and Precup, D. The option-critic architecture. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017.
  • Bellemare et al. (2013) Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
  • Brown et al. (2020a) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020a.
  • Brown et al. (2020b) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020b.
  • Chen et al. (2021) Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., and Mordatch, I. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
  • Devlin et al. (2018) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • Dorfman et al. (2021) Dorfman, R., Shenfeld, I., and Tamar, A. Offline meta reinforcement learning–identifiability challenges and effective data collection strategies. Advances in Neural Information Processing Systems, 34:4607–4618, 2021.
  • Dosovitskiy et al. (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • Duan et al. (2016) Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., and Abbeel, P. Rl2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
  • Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pp.  1126–1135. PMLR, 2017.
  • Fu et al. (2020) Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
  • Fujimoto & Gu (2021) Fujimoto, S. and Gu, S. S. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145, 2021.
  • Fujimoto et al. (2019) Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pp.  2052–2062. PMLR, 2019.
  • Hao Liu (2023) Hao Liu, P. A. Emergent agentic transformer from chain of hindsight experience. In Proceedings of the 20th International Conference on Machine Learning (ICML 2023), 2023.
  • Ishii et al. (2002) Ishii, S., Yoshida, W., and Yoshimoto, J. Control of exploitation–exploration meta-parameter in reinforcement learning. Neural networks, 15(4-6):665–687, 2002.
  • Jang et al. (2016) Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
  • Janner et al. (2021) Janner, M., Li, Q., and Levine, S. Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273–1286, 2021.
  • Kumar et al. (2019) Kumar, A., Fu, J., Soh, M., Tucker, G., and Levine, S. Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019.
  • Kumar et al. (2020) Kumar, A., Zhou, A., Tucker, G., and Levine, S. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
  • Laskin et al. (2023) Laskin, M., Wang, L., Oh, J., Parisotto, E., Spencer, S., Steigerwald, R., Strouse, D., Hansen, S. S., Filos, A., Brooks, E., et al. In-context reinforcement learning with algorithm distillation. In The Eleventh International Conference on Learning Representations, 2023.
  • Lee et al. (2023a) Lee, J. N., Xie, A., Pacchiano, A., Chandak, Y., Finn, C., Nachum, O., and Brunskill, E. Supervised pretraining can learn in-context reinforcement learning. arXiv preprint arXiv:2306.14892, 2023a.
  • Lee et al. (2023b) Lee, J. N., Xie, A., Pacchiano, A., Chandak, Y., Finn, C., Nachum, O., and Brunskill, E. Supervised pretraining can learn in-context reinforcement learning. arXiv preprint arXiv:2306.14892, 2023b.
  • Lee et al. (2022) Lee, K.-H., Nachum, O., Yang, M. S., Lee, L., Freeman, D., Guadarrama, S., Fischer, I., Xu, W., Jang, E., Michalewski, H., et al. Multi-game decision transformers. Advances in Neural Information Processing Systems, 35:27921–27936, 2022.
  • Liu et al. (2023) Liu, H., Sferrazza, C., and Abbeel, P. Chain of hindsight aligns language models with feedback. arXiv preprint arXiv:2302.02676, 3, 2023.
  • Mitchell et al. (2021) Mitchell, E., Rafailov, R., Peng, X. B., Levine, S., and Finn, C. Offline meta-reinforcement learning with advantage weighting. In International Conference on Machine Learning, pp.  7780–7791. PMLR, 2021.
  • Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Nachum et al. (2018) Nachum, O., Gu, S. S., Lee, H., and Levine, S. Data-efficient hierarchical reinforcement learning. Advances in neural information processing systems, 31, 2018.
  • Nichol et al. (2018) Nichol, A., Achiam, J., and Schulman, J. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999, 2018.
  • Pateria et al. (2021) Pateria, S., Subagdja, B., Tan, A.-h., and Quek, C. Hierarchical reinforcement learning: A comprehensive survey. ACM Computing Surveys (CSUR), 54(5):1–35, 2021.
  • Pong et al. (2022) Pong, V. H., Nair, A. V., Smith, L. M., Huang, C., and Levine, S. Offline meta-reinforcement learning with online self-supervision. In International Conference on Machine Learning, pp.  17811–17829. PMLR, 2022.
  • Radford et al. (2018) Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al. Improving language understanding by generative pre-training. 2018.
  • Reed et al. (2022) Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., Gimenez, M., Sulsky, Y., Kay, J., Springenberg, J. T., et al. A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
  • Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. 2018.
  • Sutton et al. (1999a) Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999a.
  • Sutton et al. (1999b) Sutton, R. S., Precup, D., and Singh, S. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999b.
  • Thorndike (1927) Thorndike, E. L. The law of effect. The American journal of psychology, 39(1/4):212–222, 1927.
  • Torabi et al. (2018) Torabi, F., Warnell, G., and Stone, P. Behavioral cloning from observation. arXiv preprint arXiv:1805.01954, 2018.
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Wang et al. (2016) Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
  • Yu et al. (2021) Yu, T., Kumar, A., Rafailov, R., Rajeswaran, A., Levine, S., and Finn, C. Combo: Conservative offline model-based policy optimization. Advances in neural information processing systems, 34:28954–28967, 2021.
  • Zahavy et al. (2020) Zahavy, T., Xu, Z., Veeriah, V., Hessel, M., Oh, J., van Hasselt, H. P., Silver, D., and Singh, S. A self-tuning actor-critic algorithm. Advances in neural information processing systems, 33:20913–20924, 2020.

Appendix of paper “In-Context Decision Transformer: Reinforcement Learning via Hierarchical Chain-of-Thought”

Appendix A Pseudocode of In-context Decision Transformer

Algorithm 1 In-context Decision Transformer.
1:  Input: A dataset of Trajectories, Max Iterations M𝑀Mitalic_M as training phase, Max episodes m𝑚mitalic_m at testing phase, A number of trajectories n𝑛nitalic_n in hierarchical chain of experience, A number of steps of low-level actions c𝑐citalic_c for one high-level decision
2:  Output: The generated low-level actions
3:  // Training
4:  for i=1𝑖1i=1italic_i = 1 to M𝑀Mitalic_M do
5:     Randomly sample n𝑛nitalic_n episodes from dataset s=(τ1,τ2,,τn)𝑠superscript𝜏1superscript𝜏2superscript𝜏𝑛s=(\tau^{1},\tau^{2},\dots,\tau^{n})italic_s = ( italic_τ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_τ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT )
6:     Sort n𝑛nitalic_n episodes ascending according to their returns t=0Trt1t=0Trt2t=0Trtnsuperscriptsubscript𝑡0𝑇superscriptsubscript𝑟𝑡1superscriptsubscript𝑡0𝑇superscriptsubscript𝑟𝑡2superscriptsubscript𝑡0𝑇superscriptsubscript𝑟𝑡𝑛\sum_{t=0}^{T}{r_{t}^{1}}\leq\sum_{t=0}^{T}{r_{t}^{2}}\leq\dots\leq\sum_{t=0}^% {T}{r_{t}^{n}}∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ≤ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ⋯ ≤ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT
7:     Compute returns-to-go R^t=R^0j=0trjsubscript^𝑅𝑡subscript^𝑅0superscriptsubscript𝑗0𝑡subscript𝑟𝑗\hat{R}_{t}=\hat{R}_{0}-\sum_{j=0}^{t}r_{j}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for all steps for each episode, where R^0=t=0Trtnsubscript^𝑅0superscriptsubscript𝑡0𝑇superscriptsubscript𝑟𝑡𝑛\hat{R}_{0}=\sum_{t=0}^{T}{r_{t}^{n}}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT
8:     The Reviewing Decisions module encodes a high-level decisions z every c𝑐citalic_c steps
9:     Concatenate n𝑛nitalic_n episodes as a high-level sequence sh=(τh1,τh2,,τhn)subscript𝑠superscriptsubscript𝜏1superscriptsubscript𝜏2superscriptsubscript𝜏𝑛s_{h}=(\tau_{h}^{1},\tau_{h}^{2},\dots,\tau_{h}^{n})italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = ( italic_τ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_τ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_τ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) based on Equation (4)
10:     Build a low-level sequence every c𝑐citalic_c steps based on Equation (5)
11:     The Making Decisions module predicts the next high-level decision tokens, and then the Decision to Go module predicts the next c𝑐citalic_c steps low-level action tokens for each predicted high-level decision token
12:     Train the Reviewing Decisions, Making Decisions, and Decision to Go modules based on the loss of predicted low-level actions end-to-end
13:  end for
14:  // Testing
15:  for i=1𝑖1i=1italic_i = 1 to m𝑚mitalic_m do
16:     Start a new episode i𝑖iitalic_i and reset the timestep t=0𝑡0t=0italic_t = 0
17:     while tT𝑡𝑇t\leq Titalic_t ≤ italic_T do
18:        The Making Decisions model generates next high-level decision token ztisuperscriptsubscriptz𝑡𝑖\textbf{z}_{t}^{i}z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT based on the across-episodic context (τhi3,τhi2,τhi1,,R^ti,oti)superscriptsubscript𝜏𝑖3superscriptsubscript𝜏𝑖2superscriptsubscript𝜏𝑖1superscriptsubscript^𝑅𝑡𝑖superscriptsubscript𝑜𝑡𝑖(\tau_{h}^{i-3},\tau_{h}^{i-2},\tau_{h}^{i-1},\dots,\hat{R}_{t}^{i},o_{t}^{i})( italic_τ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 3 end_POSTSUPERSCRIPT , italic_τ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 2 end_POSTSUPERSCRIPT , italic_τ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT , … , over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ), where τhisuperscriptsubscript𝜏𝑖\tau_{h}^{i}italic_τ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is expressed as Equation (4)
19:        for k=0𝑘0k=0italic_k = 0 to c1𝑐1c-1italic_c - 1 do
20:           The Decisions to Go model generates next low-level action at+kisuperscriptsubscript𝑎𝑡𝑘𝑖a_{t+k}^{i}italic_a start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT based on the previous context (zti,oti,ati,rti,,zti,ot+ki)superscriptsubscriptz𝑡𝑖superscriptsubscript𝑜𝑡𝑖superscriptsubscript𝑎𝑡𝑖superscriptsubscript𝑟𝑡𝑖superscriptsubscriptz𝑡𝑖superscriptsubscript𝑜𝑡𝑘𝑖(\textbf{z}_{t}^{i},o_{t}^{i},a_{t}^{i},r_{t}^{i},\dots,\textbf{z}_{t}^{i},o_{% t+k}^{i})( z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , … , z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT )
21:        end for
22:        The Review Decisions model encode the executed decision ztisuperscriptsubscriptz𝑡𝑖\textbf{z}_{t}^{i}z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT from the c𝑐citalic_c steps (oti,ati,,ot+c1i,at+c1i)superscriptsubscript𝑜𝑡𝑖superscriptsubscript𝑎𝑡𝑖superscriptsubscript𝑜𝑡𝑐1𝑖superscriptsubscript𝑎𝑡𝑐1𝑖(o_{t}^{i},a_{t}^{i},\dots,o_{t+c-1}^{i},a_{t+c-1}^{i})( italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , … , italic_o start_POSTSUBSCRIPT italic_t + italic_c - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t + italic_c - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT )
23:        Compute the sum of c𝑐citalic_c steps rewards r^tisuperscriptsubscript^𝑟𝑡𝑖\hat{r}_{t}^{i}over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and the next returns-to-go R^t+cisuperscriptsubscript^𝑅𝑡𝑐𝑖\hat{R}_{t+c}^{i}over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_t + italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT
24:        Receive the next observation ot+cisuperscriptsubscript𝑜𝑡𝑐𝑖o_{t+c}^{i}italic_o start_POSTSUBSCRIPT italic_t + italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT
25:        Update the across-episodic context ((τhi3,τhi2,τhi1,,R^ti,oti,zti,r^ti,dti,R^t+ci,ot+ci)((\tau_{h}^{i-3},\tau_{h}^{i-2},\tau_{h}^{i-1},\dots,\hat{R}_{t}^{i},o_{t}^{i}% ,\textbf{z}_{t}^{i},\hat{r}_{t}^{i},d_{t}^{i},\hat{R}_{t+c}^{i},o_{t+c}^{i})( ( italic_τ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 3 end_POSTSUPERSCRIPT , italic_τ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 2 end_POSTSUPERSCRIPT , italic_τ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT , … , over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_t + italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT italic_t + italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT )
26:        Update time step t=t+c𝑡𝑡𝑐t=t+citalic_t = italic_t + italic_c
27:     end while
28:  end for

In Algorithm 1, we introduce the training and testing process of IDT. At each iteration, we first construct a sequence consisting of high-level decisions, as described in lines 5-9. Importantly, high-level decisions in the dataset are encoded by the Reviewing Decisions module (line 8). In addition, each high-level decision will correspond to a short sequence of c𝑐citalic_c steps low-level actions, as described in line 10. Based on the constructed sequences, the Making Decisions and Decisions to Go modules predict high-level decisions and low-level actions, respectively (line 11). Finally, the low-level actions are evaluated with either cross-entropy loss or mean-squared error, depending on whether the actions are discrete or continuous. The losses from each time step are averaged and updated in all three modules end-to-end, as described in line 12.

During testing, IDT needs to generate low-level actions autoregressively and interact with the environment m𝑚mitalic_m episodes. At step t𝑡titalic_t of episode i𝑖iitalic_i (line 18), the Making Decisions module first generates a high-level decision token ztisuperscriptsubscriptz𝑡𝑖\textbf{z}_{t}^{i}z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT conditioned on the across-episodic context (τhi3,τhi2,τhi1,,R^ti,oti)superscriptsubscript𝜏𝑖3superscriptsubscript𝜏𝑖2superscriptsubscript𝜏𝑖1superscriptsubscript^𝑅𝑡𝑖superscriptsubscript𝑜𝑡𝑖(\tau_{h}^{i-3},\tau_{h}^{i-2},\tau_{h}^{i-1},\dots,\hat{R}_{t}^{i},o_{t}^{i})( italic_τ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 3 end_POSTSUPERSCRIPT , italic_τ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 2 end_POSTSUPERSCRIPT , italic_τ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT , … , over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ), where τhisuperscriptsubscript𝜏𝑖\tau_{h}^{i}italic_τ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is expressed as Equation (4). Then, the Decision to Go will generate the following c𝑐citalic_c steps low-level actions (at,,at+c1)subscript𝑎𝑡subscript𝑎𝑡𝑐1(a_{t},\dots,a_{t+c-1})( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_t + italic_c - 1 end_POSTSUBSCRIPT ) autoregressively, as described in lines 19-21. Unlike training, the Reviewing Decisions module encodes the executed decision from the actions generated by the Decision to Go module. Then, it serves as a condition for generating the next high-level decision zt+cisuperscriptsubscriptz𝑡𝑐𝑖\textbf{z}_{t+c}^{i}z start_POSTSUBSCRIPT italic_t + italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, as described in lines 22-26.

Appendix B Experimental Details

Source code is available at here.

Compute. Experiments are carried out on NVIDIA GeForce RTX 3090 GPUs and NVIDIA A10 GPUs. Besides, the CPU type is Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz. Since our memory is not enough to support AT training in D4rl tasks, we refer to the results of the original paper. In contrast, our method has lower memory requirements because it naturally shortens the across-episodic contexts.

Hyperparameters. The default length of across-episodic is four trajectories unless mentioned otherwise. In D4RL and Large Grid World, the Decisions to Go module generates c=10𝑐10c=10italic_c = 10 steps low-level actions while the Making Decisions module generates one high-level decision. In conventional Grid World, we set c=5𝑐5c=5italic_c = 5 because the task is too short. Except for independent parameters and different input and output dimensions, three modules in IDT follow the same architecture. In summary, Table 2 shows the hyperparameters used in our IDT model.

Table 2: Hyperparameters of IDT.
Hyperparameters Value
Number of layers 3
Number of attention heads 3
Embedding dimension 128
Activation function ReLU
c𝑐citalic_c steps controlled by one high-level decision 10 D4RL and Large Grid World
5 Grid World
Training Batch size 64
Dropout 0.1
Learning rate 1e-4
Learning rate decay Linear warmup for 1e5 steps
Grad norm clip 0.25
Weight decay 1e-4
Number of trajectories to form across-episodic contexts n𝑛nitalic_n 4 (Large) Dark Key-to-Door
10 other tasks in Grid World
4 D4RL
Testing Target return for HalfCheetah 12000
Target return for Hopper 3600
Target return for Walker 5000
Target return for Darkroom 20
Target return for Darkroom Hard 1
Target return for Darkroom Dynamic 20
Target return for Darkroom Key-to-Door 2
Target return for Large Darkroom 15
Target return for Large Darkroom Hard 1
Target return for Large Darkroom Dynamic 15
Target return for Large Darkroom Key-to-Door 2
Number of trajectories to form across-episodic contexts n𝑛nitalic_n 4 (Large) Dark Key-to-Door
10 other tasks in Grid World
4 D4RL

Appendix C Additional Experimental Results

Refer to caption
Figure 6: Parameter sensitive analysis of c𝑐citalic_c. (a) IDT maintains stable performance to changes in c𝑐citalic_c in dense reward D4RL tasks. (b) As c𝑐citalic_c increases, it becomes easier for the model to receive positive feedback to discover the target location in sparse reward Grid World tasks.
Refer to caption
Figure 7: Context size: IDT in Darkroom with different context sizes. IDT emerges with trial-and-error ability once the context size is large enough and across-episodic.
Table 3: Results for training and testing times. We report the training time per 10k gradient updates, the testing time for 50 episodes over Grid World, and 10 episodes over D4RL. As the task length increases, the context length is forced to grow exponentially, resulting in a square increase in computational costs. In contrast, IDT completes trial-and-error on high-level decisions in sizes smaller than one episode length, significantly reducing computational costs.
Context size (step) Tasks Training (hour) Testing (minute)
AT AD Ours AT AD Ours
200 Darkroom 0.27 0.23 0.21 0.62 0.61 0.65
Darkroom Hard 0.29 0.28 0.20 0.59 0.56 0.58
Darkroom Dynamic 0.33 0.31 0.21 0.65 0.62 0.67
Dark Key-to-Door 1.12 1.01 0.44 1.89 1.50 1.52
2000 Large Darkroom 5.09 4.70 2.49 67.22 (13×\times×) 45.08 (9×\times×) 5.27
Large Darkroom Hard 6.48 6.69 2.93 66.81 (11×\times×) 44.96 (7×\times×) 6.09
Large Darkroom Dynamic 5.71 5.84 2.73 62.06 (11×\times×) 42.12 (8×\times×) 5.51
Large Dark Key-to-Door 18.87 18.23 3.06 167.07 (27×\times×) 76.79 (12×\times×) 6.18
4000 HalfCheetah 36.18 37.10 21.90 234.20 (37×\times×) 173.11 (28×\times×) 6.29
Walker 32.82 33.77 20.08 233.18 (36×\times×) 172.34 (26×\times×) 6.51
Hopper 24.08 22.23 12.99 232.82 (35×\times×) 172.92 (26×\times×) 6.56

Parameter Sensitive Analysis of c𝑐citalic_c.   An important insight of IDT is that one high-level decision can guide c𝑐citalic_c-step low-level actions. We aim to investigate whether the size of c𝑐citalic_c will affect the performance of IDT. Therefore, we tested c=1,5,10,15,20𝑐15101520c=1,5,10,15,20italic_c = 1 , 5 , 10 , 15 , 20 in D4RL and Grid World, respectively. As shown in Figure 6, IDT maintains stable performance to changes in c𝑐citalic_c in D4RL tasks. In contrast, larger c𝑐citalic_c achieves better performance in Grid World. This is because the Grid World is designed for tasks with sparse rewards where the agent needs to rely on rewards to reason about the target location. As c𝑐citalic_c increases, it becomes easier for the model to receive positive feedback to discover the target location.

What Context Size is Required for IDT? Similar to other in-context RL methods, we also test how context sizes are required for IDT emerging with trial-and-error ability. As shown in Figure 7, multi-episodic contexts of 4 episodes are necessary to learn a near-optimal IDT. When the context size is roughly the length of an episode, IDT begins to emerge with self-improvement. The reason for this is likely that the context is large enough to retrain across-episodic information – e.g., at the start of a new episode, the context will be filled with transitions from most of the previous episode.