Zum Hauptinhalt springen

Showing 1–8 of 8 results for author: Cobbe, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2305.20050  [pdf, other

    cs.LG cs.AI cs.CL

    Let's Verify Step by Step

    Authors: Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe

    Abstract: In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning ste… ▽ More

    Submitted 31 May, 2023; originally announced May 2023.

  2. arXiv:2112.09332  [pdf, other

    cs.CL cs.AI cs.LG

    WebGPT: Browser-assisted question-answering with human feedback

    Authors: Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, John Schulman

    Abstract: We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must coll… ▽ More

    Submitted 1 June, 2022; v1 submitted 17 December, 2021; originally announced December 2021.

    Comments: 32 pages

  3. arXiv:2110.14168  [pdf, other

    cs.LG cs.CL

    Training Verifiers to Solve Math Word Problems

    Authors: Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman

    Abstract: State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high tes… ▽ More

    Submitted 17 November, 2021; v1 submitted 27 October, 2021; originally announced October 2021.

  4. arXiv:2110.00641  [pdf, other

    cs.LG stat.ML

    Batch size-invariance for policy optimization

    Authors: Jacob Hilton, Karl Cobbe, John Schulman

    Abstract: We say an algorithm is batch size-invariant if changes to the batch size can largely be compensated for by changes to other hyperparameters. Stochastic gradient descent is well-known to have this property at small batch sizes, via the learning rate. However, some policy optimization algorithms (such as PPO) do not have this property, because of how they control the size of policy updates. In this… ▽ More

    Submitted 24 September, 2022; v1 submitted 1 October, 2021; originally announced October 2021.

    Comments: 32 pages. Code is available at https://github.com/openai/ppo-ewma

    Journal ref: Advances in Neural Information Processing Systems 35 (2022) 17086-17098

  5. arXiv:2103.15332  [pdf, other

    cs.LG cs.AI

    Measuring Sample Efficiency and Generalization in Reinforcement Learning Benchmarks: NeurIPS 2020 Procgen Benchmark

    Authors: Sharada Mohanty, Jyotish Poonganam, Adrien Gaidon, Andrey Kolobov, Blake Wulfe, Dipam Chakraborty, Gražvydas Šemetulskis, João Schapke, Jonas Kubilius, Jurgis Pašukonis, Linas Klimas, Matthew Hausknecht, Patrick MacAlpine, Quang Nhat Tran, Thomas Tumiel, Xiaocheng Tang, Xinwei Chen, Christopher Hesse, Jacob Hilton, William Hebgen Guss, Sahika Genc, John Schulman, Karl Cobbe

    Abstract: The NeurIPS 2020 Procgen Competition was designed as a centralized benchmark with clearly defined tasks for measuring Sample Efficiency and Generalization in Reinforcement Learning. Generalization remains one of the most fundamental challenges in deep reinforcement learning, and yet we do not have enough benchmarks to measure the progress of the community on Generalization in Reinforcement Learnin… ▽ More

    Submitted 29 March, 2021; originally announced March 2021.

  6. arXiv:2009.04416  [pdf, other

    cs.LG stat.ML

    Phasic Policy Gradient

    Authors: Karl Cobbe, Jacob Hilton, Oleg Klimov, John Schulman

    Abstract: We introduce Phasic Policy Gradient (PPG), a reinforcement learning framework which modifies traditional on-policy actor-critic methods by separating policy and value function training into distinct phases. In prior methods, one must choose between using a shared network or separate networks to represent the policy and value function. Using separate networks avoids interference between objectives,… ▽ More

    Submitted 9 September, 2020; originally announced September 2020.

  7. arXiv:1912.01588  [pdf, other

    cs.LG stat.ML

    Leveraging Procedural Generation to Benchmark Reinforcement Learning

    Authors: Karl Cobbe, Christopher Hesse, Jacob Hilton, John Schulman

    Abstract: We introduce Procgen Benchmark, a suite of 16 procedurally generated game-like environments designed to benchmark both sample efficiency and generalization in reinforcement learning. We believe that the community will benefit from increased access to high quality training environments, and we provide detailed experimental protocols for using this benchmark. We empirically demonstrate that diverse… ▽ More

    Submitted 26 July, 2020; v1 submitted 3 December, 2019; originally announced December 2019.

  8. arXiv:1812.02341  [pdf, other

    cs.LG stat.ML

    Quantifying Generalization in Reinforcement Learning

    Authors: Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, John Schulman

    Abstract: In this paper, we investigate the problem of overfitting in deep reinforcement learning. Among the most common benchmarks in RL, it is customary to use the same environments for both training and testing. This practice offers relatively little insight into an agent's ability to generalize. We address this issue by using procedurally generated environments to construct distinct training and test se… ▽ More

    Submitted 14 July, 2019; v1 submitted 5 December, 2018; originally announced December 2018.