Search | arXiv e-print repository

When is Offline Policy Selection Sample Efficient for Reinforcement Learning?

Authors: Vincent Liu, Prabhat Nagarajan, Andrew Patterson, Martha White

Abstract: Offline reinforcement learning algorithms often require careful hyperparameter tuning. Consequently, before deployment, we need to select amongst a set of candidate policies. As yet, however, there is little understanding about the fundamental limits of this offline policy selection (OPS) problem. In this work we aim to provide clarity on when sample efficient OPS is possible, primarily by connect… ▽ More Offline reinforcement learning algorithms often require careful hyperparameter tuning. Consequently, before deployment, we need to select amongst a set of candidate policies. As yet, however, there is little understanding about the fundamental limits of this offline policy selection (OPS) problem. In this work we aim to provide clarity on when sample efficient OPS is possible, primarily by connecting OPS to off-policy policy evaluation (OPE) and Bellman error (BE) estimation. We first show a hardness result, that in the worst case, OPS is just as hard as OPE, by proving a reduction of OPE to OPS. As a result, no OPS method can be more sample efficient than OPE in the worst case. We then propose a BE method for OPS, called Identifiable BE Selection (IBES), that has a straightforward method for selecting its own hyperparameters. We highlight that using IBES for OPS generally has more requirements than OPE methods, but if satisfied, can be more sample efficient. We conclude with an empirical study comparing OPE and IBES, and by showing the difficulty of OPS on an offline Atari benchmark dataset. △ Less

Submitted 4 December, 2023; originally announced December 2023.

arXiv:2010.06505 [pdf, other]

A Lean and Highly-automated Model-Based Software Development Process Based on DO-178C/DO-331

Authors: Konstantin Dmitriev, Shanza Ali Zafar, Kevin Schmiechen, Yi Lai, Micheal Saleab, Pranav Nagarajan, Daniel Dollinger, Markus Hochstrasser, Stephan Myschik, Florian Holzapfel

Abstract: The emergence of a global market for urban air mobility and unmanned aerial systems has attracted many startups across the world. These organizations have little training or experience in the traditional processes used in civil aviation for the development of software and electronic hardware. They are also constrained in the resources they can allocate for dedicated teams of professionals to follo… ▽ More The emergence of a global market for urban air mobility and unmanned aerial systems has attracted many startups across the world. These organizations have little training or experience in the traditional processes used in civil aviation for the development of software and electronic hardware. They are also constrained in the resources they can allocate for dedicated teams of professionals to follow these standardized processes. To fill this gap, this paper presents a custom workflow based on a subset of objectives derived from the foundational standards for safety critical software DO-178C/DO-331. The selection of objectives from the standards is based on the importance, degree of automation, and reusability of specific objectives. This custom workflow is intended to establish a lean and highly automated development life cycle resulting in higher quality software with better maintainability characteristics for research and prototype aircraft. It can also be proposed as means of compliance for software of certain applications such as unmanned aircraft systems, urban air mobility and general aviation. By producing the essential set of development and verification artifacts, the custom workflow also provides a scalable basis for potential future certification in compliance with DO-178C/DO-331. The custom workflow is demonstrated in a case study of an Autopilot Manual Disconnection System. △ Less

Submitted 13 October, 2020; originally announced October 2020.

arXiv:2007.08082 [pdf, other]

Distributed Reinforcement Learning of Targeted Grasping with Active Vision for Mobile Manipulators

Authors: Yasuhiro Fujita, Kota Uenishi, Avinash Ummadisingu, Prabhat Nagarajan, Shimpei Masuda, Mario Ynocente Castro

Abstract: Developing personal robots that can perform a diverse range of manipulation tasks in unstructured environments necessitates solving several challenges for robotic grasping systems. We take a step towards this broader goal by presenting the first RL-based system, to our knowledge, for a mobile manipulator that can (a) achieve targeted grasping generalizing to unseen target objects, (b) learn comple… ▽ More Developing personal robots that can perform a diverse range of manipulation tasks in unstructured environments necessitates solving several challenges for robotic grasping systems. We take a step towards this broader goal by presenting the first RL-based system, to our knowledge, for a mobile manipulator that can (a) achieve targeted grasping generalizing to unseen target objects, (b) learn complex grasping strategies for cluttered scenes with occluded objects, and (c) perform active vision through its movable wrist camera to better locate objects. The system is informed of the desired target object in the form of a single, arbitrary-pose RGB image of that object, enabling the system to generalize to unseen objects without retraining. To achieve such a system, we combine several advances in deep reinforcement learning and present a large-scale distributed training system using synchronous SGD that seamlessly scales to multi-node, multi-GPU infrastructure to make rapid prototyping easier. We train and evaluate our system in a simulated environment, identify key components for improving performance, analyze its behaviors, and transfer to a real-world setup. △ Less

Submitted 14 October, 2020; v1 submitted 15 July, 2020; originally announced July 2020.

Comments: Accepted at IROS 2020

arXiv:2002.00149 [pdf, other]

Periodic Intra-Ensemble Knowledge Distillation for Reinforcement Learning

Authors: Zhang-Wei Hong, Prabhat Nagarajan, Guilherme Maeda

Abstract: Off-policy ensemble reinforcement learning (RL) methods have demonstrated impressive results across a range of RL benchmark tasks. Recent works suggest that directly imitating experts' policies in a supervised manner before or during the course of training enables faster policy improvement for an RL agent. Motivated by these recent insights, we propose Periodic Intra-Ensemble Knowledge Distillatio… ▽ More Off-policy ensemble reinforcement learning (RL) methods have demonstrated impressive results across a range of RL benchmark tasks. Recent works suggest that directly imitating experts' policies in a supervised manner before or during the course of training enables faster policy improvement for an RL agent. Motivated by these recent insights, we propose Periodic Intra-Ensemble Knowledge Distillation (PIEKD). PIEKD is a learning framework that uses an ensemble of policies to act in the environment while periodically sharing knowledge amongst policies in the ensemble through knowledge distillation. Our experiments demonstrate that PIEKD improves upon a state-of-the-art RL method in sample efficiency on several challenging MuJoCo benchmark tasks. Additionally, we perform ablation studies to better understand PIEKD. △ Less

Submitted 1 February, 2020; originally announced February 2020.

Comments: 8 pages

arXiv:1912.04201 [pdf, other]

Learning Latent State Spaces for Planning through Reward Prediction

Authors: Aaron Havens, Yi Ouyang, Prabhat Nagarajan, Yasuhiro Fujita

Abstract: Model-based reinforcement learning methods typically learn models for high-dimensional state spaces by aiming to reconstruct and predict the original observations. However, drawing inspiration from model-free reinforcement learning, we propose learning a latent dynamics model directly from rewards. In this work, we introduce a model-based planning framework which learns a latent reward prediction… ▽ More Model-based reinforcement learning methods typically learn models for high-dimensional state spaces by aiming to reconstruct and predict the original observations. However, drawing inspiration from model-free reinforcement learning, we propose learning a latent dynamics model directly from rewards. In this work, we introduce a model-based planning framework which learns a latent reward prediction model and then plans in the latent state-space. The latent representation is learned exclusively from multi-step reward prediction which we show to be the only necessary information for successful planning. With this framework, we are able to benefit from the concise model-free representation, while still enjoying the data-efficiency of model-based algorithms. We demonstrate our framework in multi-pendulum and multi-cheetah environments where several pendulums or cheetahs are shown to the agent but only one of which produces rewards. In these environments, it is important for the agent to construct a concise latent representation to filter out irrelevant observations. We find that our method can successfully learn an accurate latent reward prediction model in the presence of the irrelevant information while existing model-based methods fail. Planning in the learned latent state-space shows strong performance and high sample efficiency over model-free and model-based baselines. △ Less

Submitted 9 December, 2019; originally announced December 2019.

Comments: Deep RL Workshop, Neurips 2019, Vancouver

arXiv:1912.03905 [pdf, other]

ChainerRL: A Deep Reinforcement Learning Library

Authors: Yasuhiro Fujita, Prabhat Nagarajan, Toshiki Kataoka, Takahiro Ishikawa

Abstract: In this paper, we introduce ChainerRL, an open-source deep reinforcement learning (DRL) library built using Python and the Chainer deep learning framework. ChainerRL implements a comprehensive set of DRL algorithms and techniques drawn from state-of-the-art research in the field. To foster reproducible research, and for instructional purposes, ChainerRL provides scripts that closely replicate the… ▽ More In this paper, we introduce ChainerRL, an open-source deep reinforcement learning (DRL) library built using Python and the Chainer deep learning framework. ChainerRL implements a comprehensive set of DRL algorithms and techniques drawn from state-of-the-art research in the field. To foster reproducible research, and for instructional purposes, ChainerRL provides scripts that closely replicate the original papers' experimental settings and reproduce published benchmark results for several algorithms. Lastly, ChainerRL offers a visualization tool that enables the qualitative inspection of trained agents. The ChainerRL source code can be found on GitHub: https://github.com/chainer/chainerrl. △ Less

Submitted 11 April, 2021; v1 submitted 9 December, 2019; originally announced December 2019.

Comments: Journal of Machine Learning Research

Journal ref: Journal of Machine Learning Research 22(77) (2021) 1-14;

arXiv:1904.06387 [pdf, other]

Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations

Authors: Daniel S. Brown, Wonjoon Goo, Prabhat Nagarajan, Scott Niekum

Abstract: A critical flaw of existing inverse reinforcement learning (IRL) methods is their inability to significantly outperform the demonstrator. This is because IRL typically seeks a reward function that makes the demonstrator appear near-optimal, rather than inferring the underlying intentions of the demonstrator that may have been poorly executed in practice. In this paper, we introduce a novel reward-… ▽ More A critical flaw of existing inverse reinforcement learning (IRL) methods is their inability to significantly outperform the demonstrator. This is because IRL typically seeks a reward function that makes the demonstrator appear near-optimal, rather than inferring the underlying intentions of the demonstrator that may have been poorly executed in practice. In this paper, we introduce a novel reward-learning-from-observation algorithm, Trajectory-ranked Reward EXtrapolation (T-REX), that extrapolates beyond a set of (approximately) ranked demonstrations in order to infer high-quality reward functions from a set of potentially poor demonstrations. When combined with deep reinforcement learning, T-REX outperforms state-of-the-art imitation learning and IRL methods on multiple Atari and MuJoCo benchmark tasks and achieves performance that is often more than twice the performance of the best demonstration. We also demonstrate that T-REX is robust to ranking noise and can accurately extrapolate intention by simply watching a learner noisily improve at a task over time. △ Less

Submitted 8 July, 2019; v1 submitted 12 April, 2019; originally announced April 2019.

Comments: In proceedings of Thirty-sixth International Conference on Machine Learning (ICML 2019)

arXiv:1809.05676 [pdf, other]

Deterministic Implementations for Reproducibility in Deep Reinforcement Learning

Authors: Prabhat Nagarajan, Garrett Warnell, Peter Stone

Abstract: While deep reinforcement learning (DRL) has led to numerous successes in recent years, reproducing these successes can be extremely challenging. One reproducibility challenge particularly relevant to DRL is nondeterminism in the training process, which can substantially affect the results. Motivated by this challenge, we study the positive impacts of deterministic implementations in eliminating no… ▽ More While deep reinforcement learning (DRL) has led to numerous successes in recent years, reproducing these successes can be extremely challenging. One reproducibility challenge particularly relevant to DRL is nondeterminism in the training process, which can substantially affect the results. Motivated by this challenge, we study the positive impacts of deterministic implementations in eliminating nondeterminism in training. To do so, we consider the particular case of the deep Q-learning algorithm, for which we produce a deterministic implementation by identifying and controlling all sources of nondeterminism in the training process. One by one, we then allow individual sources of nondeterminism to affect our otherwise deterministic implementation, and measure the impact of each source on the variance in performance. We find that individual sources of nondeterminism can substantially impact the performance of agent, illustrating the benefits of deterministic implementations. In addition, we also discuss the important role of deterministic implementations in achieving exact replicability of results. △ Less

Submitted 9 June, 2019; v1 submitted 15 September, 2018; originally announced September 2018.

Comments: 17 Pages

Showing 1–8 of 8 results for author: Nagarajan, P