-
Self-Consistent Models and Values
Authors:
Gregory Farquhar,
Kate Baumli,
Zita Marinho,
Angelos Filos,
Matteo Hessel,
Hado van Hasselt,
David Silver
Abstract:
Learned models of the environment provide reinforcement learning (RL) agents with flexible ways of making predictions about the environment. In particular, models enable planning, i.e. using more computation to improve value functions or policies, without requiring additional environment interactions. In this work, we investigate a way of augmenting model-based RL, by additionally encouraging a le…
▽ More
Learned models of the environment provide reinforcement learning (RL) agents with flexible ways of making predictions about the environment. In particular, models enable planning, i.e. using more computation to improve value functions or policies, without requiring additional environment interactions. In this work, we investigate a way of augmenting model-based RL, by additionally encouraging a learned model and value function to be jointly \emph{self-consistent}. Our approach differs from classic planning methods such as Dyna, which only update values to be consistent with the model. We propose multiple self-consistency updates, evaluate these in both tabular and function approximation settings, and find that, with appropriate choices, self-consistency helps both policy evaluation and control.
△ Less
Submitted 25 October, 2021;
originally announced October 2021.
-
Emphatic Algorithms for Deep Reinforcement Learning
Authors:
Ray Jiang,
Tom Zahavy,
Zhongwen Xu,
Adam White,
Matteo Hessel,
Charles Blundell,
Hado van Hasselt
Abstract:
Off-policy learning allows us to learn about possible policies of behavior from experience generated by a different behavior policy. Temporal difference (TD) learning algorithms can become unstable when combined with function approximation and off-policy sampling - this is known as the ''deadly triad''. Emphatic temporal difference (ETD($λ$)) algorithm ensures convergence in the linear case by app…
▽ More
Off-policy learning allows us to learn about possible policies of behavior from experience generated by a different behavior policy. Temporal difference (TD) learning algorithms can become unstable when combined with function approximation and off-policy sampling - this is known as the ''deadly triad''. Emphatic temporal difference (ETD($λ$)) algorithm ensures convergence in the linear case by appropriately weighting the TD($λ$) updates. In this paper, we extend the use of emphatic methods to deep reinforcement learning agents. We show that naively adapting ETD($λ$) to popular deep reinforcement learning algorithms, which use forward view multi-step returns, results in poor performance. We then derive new emphatic algorithms for use in the context of such algorithms, and we demonstrate that they provide noticeable benefits in small problems designed to highlight the instability of TD methods. Finally, we observed improved performance when applying these algorithms at scale on classic Atari games from the Arcade Learning Environment.
△ Less
Submitted 21 June, 2021;
originally announced June 2021.
-
Switchable induced-transmission filters enabled by vanadium dioxide
Authors:
Chenghao Wan,
David Woolf,
Colin M. Hessel,
Jad Salman,
Yuzhe Xiao,
Chunhui Yao,
Albert Wright,
Joel M. Hensley,
Mikhail A. Kats
Abstract:
Abstract: An induced-transmission filter (ITF) uses an ultrathin layer of metal positioned at an electric-field node within a dielectric thin-film bandpass filter to select one transmission band while suppressing other transmission bands that would have been present without the metal layer. Here, we introduce a switchable mid-infrared ITF where the metal film can be "switched on and off", enabling…
▽ More
Abstract: An induced-transmission filter (ITF) uses an ultrathin layer of metal positioned at an electric-field node within a dielectric thin-film bandpass filter to select one transmission band while suppressing other transmission bands that would have been present without the metal layer. Here, we introduce a switchable mid-infrared ITF where the metal film can be "switched on and off", enabling the modulation of the filter response from single-band to multiband. The switching is enabled by a deeply subwavelength film of vanadium dioxide (VO2), which undergoes a reversible insulator-to-metal phase transition. We designed and experimentally demonstrated an ITF that can switch between two states: one broad passband across the long-wave infrared (LWIR, 8 - 12 um) and one narrow passband at ~8.8 um. Our work generalizes the ITF -- previously a niche type of bandpass filter -- into a new class of tunable devices. Furthermore, our unique fabrication process -- which begins with thin-film VO2 on a suspended membrane -- enables the integration of VO2 into any thin-film assembly that is compatible with physical vapor deposition (PVD) processes, and is thus a new platform for realizing tunable thin-film filters.
△ Less
Submitted 13 June, 2021;
originally announced June 2021.
-
Podracer architectures for scalable Reinforcement Learning
Authors:
Matteo Hessel,
Manuel Kroiss,
Aidan Clark,
Iurii Kemaev,
John Quan,
Thomas Keck,
Fabio Viola,
Hado van Hasselt
Abstract:
Supporting state-of-the-art AI research requires balancing rapid prototyping, ease of use, and quick iteration, with the ability to deploy experiments at a scale traditionally associated with production systems.Deep learning frameworks such as TensorFlow, PyTorch and JAX allow users to transparently make use of accelerators, such as TPUs and GPUs, to offload the more computationally intensive part…
▽ More
Supporting state-of-the-art AI research requires balancing rapid prototyping, ease of use, and quick iteration, with the ability to deploy experiments at a scale traditionally associated with production systems.Deep learning frameworks such as TensorFlow, PyTorch and JAX allow users to transparently make use of accelerators, such as TPUs and GPUs, to offload the more computationally intensive parts of training and inference in modern deep learning systems. Popular training pipelines that use these frameworks for deep learning typically focus on (un-)supervised learning. How to best train reinforcement learning (RL) agents at scale is still an active research area. In this report we argue that TPUs are particularly well suited for training RL agents in a scalable, efficient and reproducible way. Specifically we describe two architectures designed to make the best use of the resources available on a TPU Pod (a special configuration in a Google data center that features multiple TPU devices connected to each other by extremely low latency communication channels).
△ Less
Submitted 13 April, 2021;
originally announced April 2021.
-
Muesli: Combining Improvements in Policy Optimization
Authors:
Matteo Hessel,
Ivo Danihelka,
Fabio Viola,
Arthur Guez,
Simon Schmitt,
Laurent Sifre,
Theophane Weber,
David Silver,
Hado van Hasselt
Abstract:
We propose a novel policy update that combines regularized policy optimization with model learning as an auxiliary loss. The update (henceforth Muesli) matches MuZero's state-of-the-art performance on Atari. Notably, Muesli does so without using deep search: it acts directly with a policy network and has computation speed comparable to model-free baselines. The Atari results are complemented by ex…
▽ More
We propose a novel policy update that combines regularized policy optimization with model learning as an auxiliary loss. The update (henceforth Muesli) matches MuZero's state-of-the-art performance on Atari. Notably, Muesli does so without using deep search: it acts directly with a policy network and has computation speed comparable to model-free baselines. The Atari results are complemented by extensive ablations, and by additional results on continuous control and 9x9 Go.
△ Less
Submitted 31 March, 2022; v1 submitted 13 April, 2021;
originally announced April 2021.
-
Discovery of Options via Meta-Learned Subgoals
Authors:
Vivek Veeriah,
Tom Zahavy,
Matteo Hessel,
Zhongwen Xu,
Junhyuk Oh,
Iurii Kemaev,
Hado van Hasselt,
David Silver,
Satinder Singh
Abstract:
Temporal abstractions in the form of options have been shown to help reinforcement learning (RL) agents learn faster. However, despite prior work on this topic, the problem of discovering options through interaction with an environment remains a challenge. In this paper, we introduce a novel meta-gradient approach for discovering useful options in multi-task RL environments. Our approach is based…
▽ More
Temporal abstractions in the form of options have been shown to help reinforcement learning (RL) agents learn faster. However, despite prior work on this topic, the problem of discovering options through interaction with an environment remains a challenge. In this paper, we introduce a novel meta-gradient approach for discovering useful options in multi-task RL environments. Our approach is based on a manager-worker decomposition of the RL agent, in which a manager maximises rewards from the environment by learning a task-dependent policy over both a set of task-independent discovered-options and primitive actions. The option-reward and termination functions that define a subgoal for each option are parameterised as neural networks and trained via meta-gradients to maximise their usefulness. Empirical analysis on gridworld and DeepMind Lab tasks show that: (1) our approach can discover meaningful and diverse temporally-extended options in multi-task RL domains, (2) the discovered options are frequently used by the agent while learning to solve the training tasks, and (3) that the discovered options help a randomly initialised manager learn faster in completely new tasks.
△ Less
Submitted 12 February, 2021;
originally announced February 2021.
-
Discovering Reinforcement Learning Algorithms
Authors:
Junhyuk Oh,
Matteo Hessel,
Wojciech M. Czarnecki,
Zhongwen Xu,
Hado van Hasselt,
Satinder Singh,
David Silver
Abstract:
Reinforcement learning (RL) algorithms update an agent's parameters according to one of several possible rules, discovered manually through years of research. Automating the discovery of update rules from data could lead to more efficient algorithms, or algorithms that are better adapted to specific environments. Although there have been prior attempts at addressing this significant scientific cha…
▽ More
Reinforcement learning (RL) algorithms update an agent's parameters according to one of several possible rules, discovered manually through years of research. Automating the discovery of update rules from data could lead to more efficient algorithms, or algorithms that are better adapted to specific environments. Although there have been prior attempts at addressing this significant scientific challenge, it remains an open question whether it is feasible to discover alternatives to fundamental concepts of RL such as value functions and temporal-difference learning. This paper introduces a new meta-learning approach that discovers an entire update rule which includes both 'what to predict' (e.g. value functions) and 'how to learn from it' (e.g. bootstrapping) by interacting with a set of environments. The output of this method is an RL algorithm that we call Learned Policy Gradient (LPG). Empirical results show that our method discovers its own alternative to the concept of value functions. Furthermore it discovers a bootstrapping mechanism to maintain and use its predictions. Surprisingly, when trained solely on toy environments, LPG generalises effectively to complex Atari games and achieves non-trivial performance. This shows the potential to discover general RL algorithms from data.
△ Less
Submitted 5 January, 2021; v1 submitted 17 July, 2020;
originally announced July 2020.
-
Meta-Gradient Reinforcement Learning with an Objective Discovered Online
Authors:
Zhongwen Xu,
Hado van Hasselt,
Matteo Hessel,
Junhyuk Oh,
Satinder Singh,
David Silver
Abstract:
Deep reinforcement learning includes a broad family of algorithms that parameterise an internal representation, such as a value function or policy, by a deep neural network. Each algorithm optimises its parameters with respect to an objective, such as Q-learning or policy gradient, that defines its semantics. In this work, we propose an algorithm based on meta-gradient descent that discovers its o…
▽ More
Deep reinforcement learning includes a broad family of algorithms that parameterise an internal representation, such as a value function or policy, by a deep neural network. Each algorithm optimises its parameters with respect to an objective, such as Q-learning or policy gradient, that defines its semantics. In this work, we propose an algorithm based on meta-gradient descent that discovers its own objective, flexibly parameterised by a deep neural network, solely from interactive experience with its environment. Over time, this allows the agent to learn how to learn increasingly effectively. Furthermore, because the objective is discovered online, it can adapt to changes over time. We demonstrate that the algorithm discovers how to address several important issues in RL, such as bootstrapping, non-stationarity, and off-policy learning. On the Atari Learning Environment, the meta-gradient algorithm adapts over time to learn with greater efficiency, eventually outperforming the median score of a strong actor-critic baseline.
△ Less
Submitted 16 July, 2020;
originally announced July 2020.
-
Expected Eligibility Traces
Authors:
Hado van Hasselt,
Sephora Madjiheurem,
Matteo Hessel,
David Silver,
André Barreto,
Diana Borsa
Abstract:
The question of how to determine which states and actions are responsible for a certain outcome is known as the credit assignment problem and remains a central research question in reinforcement learning and artificial intelligence. Eligibility traces enable efficient credit assignment to the recent sequence of states and actions experienced by the agent, but not to counterfactual sequences that c…
▽ More
The question of how to determine which states and actions are responsible for a certain outcome is known as the credit assignment problem and remains a central research question in reinforcement learning and artificial intelligence. Eligibility traces enable efficient credit assignment to the recent sequence of states and actions experienced by the agent, but not to counterfactual sequences that could also have led to the current state. In this work, we introduce expected eligibility traces. Expected traces allow, with a single update, to update states and actions that could have preceded the current state, even if they did not do so on this occasion. We discuss when expected traces provide benefits over classic (instantaneous) traces in temporal-difference learning, and show that sometimes substantial improvements can be attained. We provide a way to smoothly interpolate between instantaneous and expected traces by a mechanism similar to bootstrapping, which ensures that the resulting algorithm is a strict generalisation of TD($λ$). Finally, we discuss possible extensions and connections to related ideas, such as successor features.
△ Less
Submitted 8 February, 2021; v1 submitted 3 July, 2020;
originally announced July 2020.
-
A Self-Tuning Actor-Critic Algorithm
Authors:
Tom Zahavy,
Zhongwen Xu,
Vivek Veeriah,
Matteo Hessel,
Junhyuk Oh,
Hado van Hasselt,
David Silver,
Satinder Singh
Abstract:
Reinforcement learning algorithms are highly sensitive to the choice of hyperparameters, typically requiring significant manual effort to identify hyperparameters that perform well on a new domain. In this paper, we take a step towards addressing this issue by using metagradients to automatically adapt hyperparameters online by meta-gradient descent (Xu et al., 2018). We apply our algorithm, Self-…
▽ More
Reinforcement learning algorithms are highly sensitive to the choice of hyperparameters, typically requiring significant manual effort to identify hyperparameters that perform well on a new domain. In this paper, we take a step towards addressing this issue by using metagradients to automatically adapt hyperparameters online by meta-gradient descent (Xu et al., 2018). We apply our algorithm, Self-Tuning Actor-Critic (STAC), to self-tune all the differentiable hyperparameters of an actor-critic loss function, to discover auxiliary tasks, and to improve off-policy learning using a novel leaky V-trace operator. STAC is simple to use, sample efficient and does not require a significant increase in compute. Ablative studies show that the overall performance of STAC improved as we adapt more hyperparameters. When applied to the Arcade Learning Environment (Bellemare et al. 2012), STAC improved the median human normalized score in 200M steps from 243% to 364%. When applied to the DM Control suite (Tassa et al., 2018), STAC improved the mean score in 30M steps from 217 to 389 when learning with features, from 108 to 202 when learning from pixels, and from 195 to 295 in the Real-World Reinforcement Learning Challenge (Dulac-Arnold et al., 2020).
△ Less
Submitted 14 April, 2021; v1 submitted 28 February, 2020;
originally announced February 2020.
-
What Can Learned Intrinsic Rewards Capture?
Authors:
Zeyu Zheng,
Junhyuk Oh,
Matteo Hessel,
Zhongwen Xu,
Manuel Kroiss,
Hado van Hasselt,
David Silver,
Satinder Singh
Abstract:
The objective of a reinforcement learning agent is to behave so as to maximise the sum of a suitable scalar function of state: the reward. These rewards are typically given and immutable. In this paper, we instead consider the proposition that the reward function itself can be a good locus of learned knowledge. To investigate this, we propose a scalable meta-gradient framework for learning useful…
▽ More
The objective of a reinforcement learning agent is to behave so as to maximise the sum of a suitable scalar function of state: the reward. These rewards are typically given and immutable. In this paper, we instead consider the proposition that the reward function itself can be a good locus of learned knowledge. To investigate this, we propose a scalable meta-gradient framework for learning useful intrinsic reward functions across multiple lifetimes of experience. Through several proof-of-concept experiments, we show that it is feasible to learn and capture knowledge about long-term exploration and exploitation into a reward function. Furthermore, we show that unlike policy transfer methods that capture "how" the agent should behave, the learned reward functions can generalise to other kinds of agents and to changes in the dynamics of the environment by capturing "what" the agent should strive to do.
△ Less
Submitted 21 August, 2020; v1 submitted 11 December, 2019;
originally announced December 2019.
-
Off-Policy Actor-Critic with Shared Experience Replay
Authors:
Simon Schmitt,
Matteo Hessel,
Karen Simonyan
Abstract:
We investigate the combination of actor-critic reinforcement learning algorithms with uniform large-scale experience replay and propose solutions for two challenges: (a) efficient actor-critic learning with experience replay (b) stability of off-policy learning where agents learn from other agents behaviour. We employ those insights to accelerate hyper-parameter sweeps in which all participating a…
▽ More
We investigate the combination of actor-critic reinforcement learning algorithms with uniform large-scale experience replay and propose solutions for two challenges: (a) efficient actor-critic learning with experience replay (b) stability of off-policy learning where agents learn from other agents behaviour. We employ those insights to accelerate hyper-parameter sweeps in which all participating agents run concurrently and share their experience via a common replay module. To this end we analyze the bias-variance tradeoffs in V-trace, a form of importance sampling for actor-critic methods. Based on our analysis, we then argue for mixing experience sampled from replay with on-policy experience, and propose a new trust region scheme that scales effectively to data distributions where V-trace becomes unstable. We provide extensive empirical validation of the proposed solution. We further show the benefits of this setup by demonstrating state-of-the-art data efficiency on Atari among agents trained up until 200M environment frames.
△ Less
Submitted 18 November, 2019; v1 submitted 25 September, 2019;
originally announced September 2019.
-
Discovery of Useful Questions as Auxiliary Tasks
Authors:
Vivek Veeriah,
Matteo Hessel,
Zhongwen Xu,
Richard Lewis,
Janarthanan Rajendran,
Junhyuk Oh,
Hado van Hasselt,
David Silver,
Satinder Singh
Abstract:
Arguably, intelligent agents ought to be able to discover their own questions so that in learning answers for them they learn unanticipated useful knowledge and skills; this departs from the focus in much of machine learning on agents learning answers to externally defined questions. We present a novel method for a reinforcement learning (RL) agent to discover questions formulated as general value…
▽ More
Arguably, intelligent agents ought to be able to discover their own questions so that in learning answers for them they learn unanticipated useful knowledge and skills; this departs from the focus in much of machine learning on agents learning answers to externally defined questions. We present a novel method for a reinforcement learning (RL) agent to discover questions formulated as general value functions or GVFs, a fairly rich form of knowledge representation. Specifically, our method uses non-myopic meta-gradients to learn GVF-questions such that learning answers to them, as an auxiliary task, induces useful representations for the main task faced by the RL agent. We demonstrate that auxiliary tasks based on the discovered GVFs are sufficient, on their own, to build representations that support main task learning, and that they do so better than popular hand-designed auxiliary tasks from the literature. Furthermore, we show, in the context of Atari 2600 videogames, how such auxiliary tasks, meta-learned alongside the main task, can improve the data efficiency of an actor-critic agent.
△ Less
Submitted 10 September, 2019;
originally announced September 2019.
-
Behaviour Suite for Reinforcement Learning
Authors:
Ian Osband,
Yotam Doron,
Matteo Hessel,
John Aslanides,
Eren Sezener,
Andre Saraiva,
Katrina McKinney,
Tor Lattimore,
Csaba Szepesvari,
Satinder Singh,
Benjamin Van Roy,
Richard Sutton,
David Silver,
Hado Van Hasselt
Abstract:
This paper introduces the Behaviour Suite for Reinforcement Learning, or bsuite for short. bsuite is a collection of carefully-designed experiments that investigate core capabilities of reinforcement learning (RL) agents with two objectives. First, to collect clear, informative and scalable problems that capture key issues in the design of general and efficient learning algorithms. Second, to stud…
▽ More
This paper introduces the Behaviour Suite for Reinforcement Learning, or bsuite for short. bsuite is a collection of carefully-designed experiments that investigate core capabilities of reinforcement learning (RL) agents with two objectives. First, to collect clear, informative and scalable problems that capture key issues in the design of general and efficient learning algorithms. Second, to study agent behaviour through their performance on these shared benchmarks. To complement this effort, we open source github.com/deepmind/bsuite, which automates evaluation and analysis of any agent on bsuite. This library facilitates reproducible and accessible research on the core issues in RL, and ultimately the design of superior learning algorithms. Our code is Python, and easy to use within existing projects. We include examples with OpenAI Baselines, Dopamine as well as new reference implementations. Going forward, we hope to incorporate more excellent experiments from the research community, and commit to a periodic review of bsuite from a committee of prominent researchers.
△ Less
Submitted 14 February, 2020; v1 submitted 9 August, 2019;
originally announced August 2019.
-
General non-linear Bellman equations
Authors:
Hado van Hasselt,
John Quan,
Matteo Hessel,
Zhongwen Xu,
Diana Borsa,
Andre Barreto
Abstract:
We consider a general class of non-linear Bellman equations. These open up a design space of algorithms that have interesting properties, which has two potential advantages. First, we can perhaps better model natural phenomena. For instance, hyperbolic discounting has been proposed as a mathematical model that matches human and animal data well, and can therefore be used to explain preference orde…
▽ More
We consider a general class of non-linear Bellman equations. These open up a design space of algorithms that have interesting properties, which has two potential advantages. First, we can perhaps better model natural phenomena. For instance, hyperbolic discounting has been proposed as a mathematical model that matches human and animal data well, and can therefore be used to explain preference orderings. We present a different mathematical model that matches the same data, but that makes very different predictions under other circumstances. Second, the larger design space can perhaps lead to algorithms that perform better, similar to how discount factors are often used in practice even when the true objective is undiscounted. We show that many of the resulting Bellman operators still converge to a fixed point, and therefore that the resulting algorithms are reasonable and inherit many beneficial properties of their linear counterparts.
△ Less
Submitted 8 July, 2019;
originally announced July 2019.
-
On Inductive Biases in Deep Reinforcement Learning
Authors:
Matteo Hessel,
Hado van Hasselt,
Joseph Modayil,
David Silver
Abstract:
Many deep reinforcement learning algorithms contain inductive biases that sculpt the agent's objective and its interface to the environment. These inductive biases can take many forms, including domain knowledge and pretuned hyper-parameters. In general, there is a trade-off between generality and performance when algorithms use such biases. Stronger biases can lead to faster learning, but weaker…
▽ More
Many deep reinforcement learning algorithms contain inductive biases that sculpt the agent's objective and its interface to the environment. These inductive biases can take many forms, including domain knowledge and pretuned hyper-parameters. In general, there is a trade-off between generality and performance when algorithms use such biases. Stronger biases can lead to faster learning, but weaker biases can potentially lead to more general algorithms. This trade-off is important because inductive biases are not free; substantial effort may be required to obtain relevant domain knowledge or to tune hyper-parameters effectively. In this paper, we re-examine several domain-specific components that bias the objective and the environmental interface of common deep reinforcement learning agents. We investigated whether the performance deteriorates when these components are replaced with adaptive solutions from the literature. In our experiments, performance sometimes decreased with the adaptive components, as one might expect when comparing to components crafted for the domain, but sometimes the adaptive components performed better. We investigated the main benefit of having fewer domain-specific components, by comparing the learning performance of the two systems on a different set of continuous control problems, without additional tuning of either system. As hypothesized, the system with adaptive components performed better on many of the new tasks.
△ Less
Submitted 5 July, 2019;
originally announced July 2019.
-
When to use parametric models in reinforcement learning?
Authors:
Hado van Hasselt,
Matteo Hessel,
John Aslanides
Abstract:
We examine the question of when and how parametric models are most useful in reinforcement learning. In particular, we look at commonalities and differences between parametric models and experience replay. Replay-based learning algorithms share important traits with model-based approaches, including the ability to plan: to use more computation without additional data to improve predictions and beh…
▽ More
We examine the question of when and how parametric models are most useful in reinforcement learning. In particular, we look at commonalities and differences between parametric models and experience replay. Replay-based learning algorithms share important traits with model-based approaches, including the ability to plan: to use more computation without additional data to improve predictions and behaviour. We discuss when to expect benefits from either approach, and interpret prior work in this context. We hypothesise that, under suitable conditions, replay-based algorithms should be competitive to or better than model-based algorithms if the model is used only to generate fictional transitions from observed states for an update rule that is otherwise model-free. We validated this hypothesis on Atari 2600 video games. The replay-based algorithm attained state-of-the-art data efficiency, improving over prior results with parametric models.
△ Less
Submitted 12 June, 2019;
originally announced June 2019.
-
Transfer in Deep Reinforcement Learning Using Successor Features and Generalised Policy Improvement
Authors:
André Barreto,
Diana Borsa,
John Quan,
Tom Schaul,
David Silver,
Matteo Hessel,
Daniel Mankowitz,
Augustin Žídek,
Rémi Munos
Abstract:
The ability to transfer skills across tasks has the potential to scale up reinforcement learning (RL) agents to environments currently out of reach. Recently, a framework based on two ideas, successor features (SFs) and generalised policy improvement (GPI), has been introduced as a principled way of transferring skills. In this paper we extend the SFs & GPI framework in two ways. One of the basic…
▽ More
The ability to transfer skills across tasks has the potential to scale up reinforcement learning (RL) agents to environments currently out of reach. Recently, a framework based on two ideas, successor features (SFs) and generalised policy improvement (GPI), has been introduced as a principled way of transferring skills. In this paper we extend the SFs & GPI framework in two ways. One of the basic assumptions underlying the original formulation of SFs & GPI is that rewards for all tasks of interest can be computed as linear combinations of a fixed set of features. We relax this constraint and show that the theoretical guarantees supporting the framework can be extended to any set of tasks that only differ in the reward function. Our second contribution is to show that one can use the reward functions themselves as features for future tasks, without any loss of expressiveness, thus removing the need to specify a set of features beforehand. This makes it possible to combine SFs & GPI with deep learning in a more stable way. We empirically verify this claim on a complex 3D environment where observations are images from a first-person perspective. We show that the transfer promoted by SFs & GPI leads to very good policies on unseen tasks almost instantaneously. We also describe how to learn policies specialised to the new tasks in a way that allows them to be added to the agent's set of skills, and thus be reused in the future.
△ Less
Submitted 30 January, 2019;
originally announced January 2019.
-
Optical properties of thin-film vanadium dioxide from the visible to the far infrared
Authors:
Chenghao Wan,
Zhen Zhang,
David Woolf,
Colin M. Hessel,
Jura Rensberg,
Joel M. Hensley,
Yuzhe Xiao,
Alireza Shahsafi,
Jad Salman,
Steffen Richter,
Yifei Sun,
M. Mumtaz Qazilbash,
Rüdiger Schmidt-Grund,
Carsten Ronning,
Shriram Ramanathan,
Mikhail A. Kats
Abstract:
The insulator-to-metal transition (IMT) in vanadium dioxide (VO2) can enable a variety of optics applications, including switching and modulation, optical limiting, and tuning of optical resonators. Despite the widespread interest in optics, the optical properties of VO2 across its IMT are scattered throughout the literature, and are not available in some wavelength regions. We characterized the c…
▽ More
The insulator-to-metal transition (IMT) in vanadium dioxide (VO2) can enable a variety of optics applications, including switching and modulation, optical limiting, and tuning of optical resonators. Despite the widespread interest in optics, the optical properties of VO2 across its IMT are scattered throughout the literature, and are not available in some wavelength regions. We characterized the complex refractive index of VO2 thin films across the IMT for free-space wavelengths from 300 nm to 30 μm, using broadband spectroscopic ellipsometry, reflection spectroscopy, and the application of effective-medium theory.
We studied VO2 thin films of different thickness, on two different substrates (silicon and sapphire), and grown using different synthesis methods (sputtering and sol gel). While there are differences in the optical properties of VO2 synthesized under different conditions, they are relatively minor compared to the change resulting from the IMT, most notably in the ~2 - 11 μm range where the insulating phase of VO2 has relatively low optical loss. We found that the macroscopic optical properties of VO2 are much more robust to sample-to-sample variation compared to the electrical properties, making the refractive-index datasets from this article broadly useful for modeling and design of VO2-based optical and optoelectronic components.
△ Less
Submitted 8 January, 2019;
originally announced January 2019.
-
Scaling shared model governance via model splitting
Authors:
Miljan Martic,
Jan Leike,
Andrew Trask,
Matteo Hessel,
Shane Legg,
Pushmeet Kohli
Abstract:
Currently the only techniques for sharing governance of a deep learning model are homomorphic encryption and secure multiparty computation. Unfortunately, neither of these techniques is applicable to the training of large neural networks due to their large computational and communication overheads. As a scalable technique for shared model governance, we propose splitting deep learning model betwee…
▽ More
Currently the only techniques for sharing governance of a deep learning model are homomorphic encryption and secure multiparty computation. Unfortunately, neither of these techniques is applicable to the training of large neural networks due to their large computational and communication overheads. As a scalable technique for shared model governance, we propose splitting deep learning model between multiple parties. This paper empirically investigates the security guarantee of this technique, which is introduced as the problem of model completion: Given the entire training data set or an environment simulator, and a subset of the parameters of a trained deep learning model, how much training is required to recover the model's original performance? We define a metric for evaluating the hardness of the model completion problem and study it empirically in both supervised learning on ImageNet and reinforcement learning on Atari and DeepMind~Lab. Our experiments show that (1) the model completion problem is harder in reinforcement learning than in supervised learning because of the unavailability of the trained agent's trajectories, and (2) its hardness depends not primarily on the number of parameters of the missing part, but more so on their type and location. Our results suggest that model splitting might be a feasible technique for shared model governance in some settings where training is very expensive.
△ Less
Submitted 14 December, 2018;
originally announced December 2018.
-
Deep Reinforcement Learning and the Deadly Triad
Authors:
Hado van Hasselt,
Yotam Doron,
Florian Strub,
Matteo Hessel,
Nicolas Sonnerat,
Joseph Modayil
Abstract:
We know from reinforcement learning theory that temporal difference learning can fail in certain cases. Sutton and Barto (2018) identify a deadly triad of function approximation, bootstrapping, and off-policy learning. When these three properties are combined, learning can diverge with the value estimates becoming unbounded. However, several algorithms successfully combine these three properties,…
▽ More
We know from reinforcement learning theory that temporal difference learning can fail in certain cases. Sutton and Barto (2018) identify a deadly triad of function approximation, bootstrapping, and off-policy learning. When these three properties are combined, learning can diverge with the value estimates becoming unbounded. However, several algorithms successfully combine these three properties, which indicates that there is at least a partial gap in our understanding. In this work, we investigate the impact of the deadly triad in practice, in the context of a family of popular deep reinforcement learning models - deep Q-networks trained with experience replay - analysing how the components of this system play a role in the emergence of the deadly triad, and in the agent's performance
△ Less
Submitted 6 December, 2018;
originally announced December 2018.
-
Multi-task Deep Reinforcement Learning with PopArt
Authors:
Matteo Hessel,
Hubert Soyer,
Lasse Espeholt,
Wojciech Czarnecki,
Simon Schmitt,
Hado van Hasselt
Abstract:
The reinforcement learning community has made great strides in designing algorithms capable of exceeding human performance on specific tasks. These algorithms are mostly trained one task at the time, each new task requiring to train a brand new agent instance. This means the learning algorithm is general, but each solution is not; each agent can only solve the one task it was trained on. In this w…
▽ More
The reinforcement learning community has made great strides in designing algorithms capable of exceeding human performance on specific tasks. These algorithms are mostly trained one task at the time, each new task requiring to train a brand new agent instance. This means the learning algorithm is general, but each solution is not; each agent can only solve the one task it was trained on. In this work, we study the problem of learning to master not one but multiple sequential-decision tasks at once. A general issue in multi-task learning is that a balance must be found between the needs of multiple tasks competing for the limited resources of a single learning system. Many learning algorithms can get distracted by certain tasks in the set of tasks to solve. Such tasks appear more salient to the learning process, for instance because of the density or magnitude of the in-task rewards. This causes the algorithm to focus on those salient tasks at the expense of generality. We propose to automatically adapt the contribution of each task to the agent's updates, so that all tasks have a similar impact on the learning dynamics. This resulted in state of the art performance on learning to play all games in a set of 57 diverse Atari games. Excitingly, our method learned a single trained policy - with a single set of weights - that exceeds median human performance. To our knowledge, this was the first time a single agent surpassed human-level performance on this multi-task domain. The same approach also demonstrated state of the art performance on a set of 30 tasks in the 3D reinforcement learning platform DeepMind Lab.
△ Less
Submitted 12 September, 2018;
originally announced September 2018.
-
Observe and Look Further: Achieving Consistent Performance on Atari
Authors:
Tobias Pohlen,
Bilal Piot,
Todd Hester,
Mohammad Gheshlaghi Azar,
Dan Horgan,
David Budden,
Gabriel Barth-Maron,
Hado van Hasselt,
John Quan,
Mel Večerík,
Matteo Hessel,
Rémi Munos,
Olivier Pietquin
Abstract:
Despite significant advances in the field of deep Reinforcement Learning (RL), today's algorithms still fail to learn human-level policies consistently over a set of diverse tasks such as Atari 2600 games. We identify three key challenges that any algorithm needs to master in order to perform well on all games: processing diverse reward distributions, reasoning over long time horizons, and explori…
▽ More
Despite significant advances in the field of deep Reinforcement Learning (RL), today's algorithms still fail to learn human-level policies consistently over a set of diverse tasks such as Atari 2600 games. We identify three key challenges that any algorithm needs to master in order to perform well on all games: processing diverse reward distributions, reasoning over long time horizons, and exploring efficiently. In this paper, we propose an algorithm that addresses each of these challenges and is able to learn human-level policies on nearly all Atari games. A new transformed Bellman operator allows our algorithm to process rewards of varying densities and scales; an auxiliary temporal consistency loss allows us to train stably using a discount factor of $γ= 0.999$ (instead of $γ= 0.99$) extending the effective planning horizon by an order of magnitude; and we ease the exploration problem by using human demonstrations that guide the agent towards rewarding states. When tested on a set of 42 Atari games, our algorithm exceeds the performance of an average human on 40 games using a common set of hyper parameters. Furthermore, it is the first deep RL algorithm to solve the first level of Montezuma's Revenge.
△ Less
Submitted 29 May, 2018;
originally announced May 2018.
-
Distributed Prioritized Experience Replay
Authors:
Dan Horgan,
John Quan,
David Budden,
Gabriel Barth-Maron,
Matteo Hessel,
Hado van Hasselt,
David Silver
Abstract:
We propose a distributed architecture for deep reinforcement learning at scale, that enables agents to learn effectively from orders of magnitude more data than previously possible. The algorithm decouples acting from learning: the actors interact with their own instances of the environment by selecting actions according to a shared neural network, and accumulate the resulting experience in a shar…
▽ More
We propose a distributed architecture for deep reinforcement learning at scale, that enables agents to learn effectively from orders of magnitude more data than previously possible. The algorithm decouples acting from learning: the actors interact with their own instances of the environment by selecting actions according to a shared neural network, and accumulate the resulting experience in a shared experience replay memory; the learner replays samples of experience and updates the neural network. The architecture relies on prioritized experience replay to focus only on the most significant data generated by the actors. Our architecture substantially improves the state of the art on the Arcade Learning Environment, achieving better final performance in a fraction of the wall-clock training time.
△ Less
Submitted 2 March, 2018;
originally announced March 2018.
-
Unicorn: Continual Learning with a Universal, Off-policy Agent
Authors:
Daniel J. Mankowitz,
Augustin Žídek,
André Barreto,
Dan Horgan,
Matteo Hessel,
John Quan,
Junhyuk Oh,
Hado van Hasselt,
David Silver,
Tom Schaul
Abstract:
Some real-world domains are best characterized as a single task, but for others this perspective is limiting. Instead, some tasks continually grow in complexity, in tandem with the agent's competence. In continual learning, also referred to as lifelong learning, there are no explicit task boundaries or curricula. As learning agents have become more powerful, continual learning remains one of the f…
▽ More
Some real-world domains are best characterized as a single task, but for others this perspective is limiting. Instead, some tasks continually grow in complexity, in tandem with the agent's competence. In continual learning, also referred to as lifelong learning, there are no explicit task boundaries or curricula. As learning agents have become more powerful, continual learning remains one of the frontiers that has resisted quick progress. To test continual learning capabilities we consider a challenging 3D domain with an implicit sequence of tasks and sparse rewards. We propose a novel agent architecture called Unicorn, which demonstrates strong continual learning and outperforms several baseline agents on the proposed domain. The agent achieves this by jointly representing and learning multiple policies efficiently, using a parallel off-policy learning setup.
△ Less
Submitted 3 July, 2018; v1 submitted 22 February, 2018;
originally announced February 2018.
-
Rainbow: Combining Improvements in Deep Reinforcement Learning
Authors:
Matteo Hessel,
Joseph Modayil,
Hado van Hasselt,
Tom Schaul,
Georg Ostrovski,
Will Dabney,
Dan Horgan,
Bilal Piot,
Mohammad Azar,
David Silver
Abstract:
The deep reinforcement learning community has made several independent improvements to the DQN algorithm. However, it is unclear which of these extensions are complementary and can be fruitfully combined. This paper examines six extensions to the DQN algorithm and empirically studies their combination. Our experiments show that the combination provides state-of-the-art performance on the Atari 260…
▽ More
The deep reinforcement learning community has made several independent improvements to the DQN algorithm. However, it is unclear which of these extensions are complementary and can be fruitfully combined. This paper examines six extensions to the DQN algorithm and empirically studies their combination. Our experiments show that the combination provides state-of-the-art performance on the Atari 2600 benchmark, both in terms of data efficiency and final performance. We also provide results from a detailed ablation study that shows the contribution of each component to overall performance.
△ Less
Submitted 6 October, 2017;
originally announced October 2017.
-
The Predictron: End-To-End Learning and Planning
Authors:
David Silver,
Hado van Hasselt,
Matteo Hessel,
Tom Schaul,
Arthur Guez,
Tim Harley,
Gabriel Dulac-Arnold,
David Reichert,
Neil Rabinowitz,
Andre Barreto,
Thomas Degris
Abstract:
One of the key challenges of artificial intelligence is to learn models that are effective in the context of planning. In this document we introduce the predictron architecture. The predictron consists of a fully abstract model, represented by a Markov reward process, that can be rolled forward multiple "imagined" planning steps. Each forward pass of the predictron accumulates internal rewards and…
▽ More
One of the key challenges of artificial intelligence is to learn models that are effective in the context of planning. In this document we introduce the predictron architecture. The predictron consists of a fully abstract model, represented by a Markov reward process, that can be rolled forward multiple "imagined" planning steps. Each forward pass of the predictron accumulates internal rewards and values over multiple planning depths. The predictron is trained end-to-end so as to make these accumulated values accurately approximate the true value function. We applied the predictron to procedurally generated random mazes and a simulator for the game of pool. The predictron yielded significantly more accurate predictions than conventional deep neural network architectures.
△ Less
Submitted 20 July, 2017; v1 submitted 28 December, 2016;
originally announced December 2016.
-
Learning values across many orders of magnitude
Authors:
Hado van Hasselt,
Arthur Guez,
Matteo Hessel,
Volodymyr Mnih,
David Silver
Abstract:
Most learning algorithms are not invariant to the scale of the function that is being approximated. We propose to adaptively normalize the targets used in learning. This is useful in value-based reinforcement learning, where the magnitude of appropriate value approximations can change over time when we update the policy of behavior. Our main motivation is prior work on learning to play Atari games…
▽ More
Most learning algorithms are not invariant to the scale of the function that is being approximated. We propose to adaptively normalize the targets used in learning. This is useful in value-based reinforcement learning, where the magnitude of appropriate value approximations can change over time when we update the policy of behavior. Our main motivation is prior work on learning to play Atari games, where the rewards were all clipped to a predetermined range. This clipping facilitates learning across many different games with a single learning algorithm, but a clipped reward function can result in qualitatively different behavior. Using the adaptive normalization we can remove this domain-specific heuristic without diminishing overall performance.
△ Less
Submitted 16 August, 2016; v1 submitted 24 February, 2016;
originally announced February 2016.
-
Dueling Network Architectures for Deep Reinforcement Learning
Authors:
Ziyu Wang,
Tom Schaul,
Matteo Hessel,
Hado van Hasselt,
Marc Lanctot,
Nando de Freitas
Abstract:
In recent years there have been many successes of using deep representations in reinforcement learning. Still, many of these applications use conventional architectures, such as convolutional networks, LSTMs, or auto-encoders. In this paper, we present a new neural network architecture for model-free reinforcement learning. Our dueling network represents two separate estimators: one for the state…
▽ More
In recent years there have been many successes of using deep representations in reinforcement learning. Still, many of these applications use conventional architectures, such as convolutional networks, LSTMs, or auto-encoders. In this paper, we present a new neural network architecture for model-free reinforcement learning. Our dueling network represents two separate estimators: one for the state value function and one for the state-dependent action advantage function. The main benefit of this factoring is to generalize learning across actions without imposing any change to the underlying reinforcement learning algorithm. Our results show that this architecture leads to better policy evaluation in the presence of many similar-valued actions. Moreover, the dueling architecture enables our RL agent to outperform the state-of-the-art on the Atari 2600 domain.
△ Less
Submitted 5 April, 2016; v1 submitted 20 November, 2015;
originally announced November 2015.