-
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
Authors:
Ansh Radhakrishnan,
Karina Nguyen,
Anna Chen,
Carol Chen,
Carson Denison,
Danny Hernandez,
Esin Durmus,
Evan Hubinger,
Jackson Kernion,
Kamilė Lukošiūtė,
Newton Cheng,
Nicholas Joseph,
Nicholas Schiefer,
Oliver Rausch,
Sam McCandlish,
Sheer El Showk,
Tamera Lanham,
Tim Maxwell,
Venkatesa Chandrasekaran,
Zac Hatfield-Dodds,
Jared Kaplan,
Jan Brauner,
Samuel R. Bowman,
Ethan Perez
Abstract:
As large language models (LLMs) perform more difficult tasks, it becomes harder to verify the correctness and safety of their behavior. One approach to help with this issue is to prompt LLMs to externalize their reasoning, e.g., by having them generate step-by-step reasoning as they answer a question (Chain-of-Thought; CoT). The reasoning may enable us to check the process that models use to perfo…
▽ More
As large language models (LLMs) perform more difficult tasks, it becomes harder to verify the correctness and safety of their behavior. One approach to help with this issue is to prompt LLMs to externalize their reasoning, e.g., by having them generate step-by-step reasoning as they answer a question (Chain-of-Thought; CoT). The reasoning may enable us to check the process that models use to perform tasks. However, this approach relies on the stated reasoning faithfully reflecting the model's actual reasoning, which is not always the case. To improve over the faithfulness of CoT reasoning, we have models generate reasoning by decomposing questions into subquestions. Decomposition-based methods achieve strong performance on question-answering tasks, sometimes approaching that of CoT while improving the faithfulness of the model's stated reasoning on several recently-proposed metrics. By forcing the model to answer simpler subquestions in separate contexts, we greatly increase the faithfulness of model-generated reasoning over CoT, while still achieving some of the performance gains of CoT. Our results show it is possible to improve the faithfulness of model-generated reasoning; continued improvements may lead to reasoning that enables us to verify the correctness and safety of LLM behavior.
△ Less
Submitted 25 July, 2023; v1 submitted 16 July, 2023;
originally announced July 2023.
-
The Capacity for Moral Self-Correction in Large Language Models
Authors:
Deep Ganguli,
Amanda Askell,
Nicholas Schiefer,
Thomas I. Liao,
Kamilė Lukošiūtė,
Anna Chen,
Anna Goldie,
Azalia Mirhoseini,
Catherine Olsson,
Danny Hernandez,
Dawn Drain,
Dustin Li,
Eli Tran-Johnson,
Ethan Perez,
Jackson Kernion,
Jamie Kerr,
Jared Mueller,
Joshua Landau,
Kamal Ndousse,
Karina Nguyen,
Liane Lovitt,
Michael Sellitto,
Nelson Elhage,
Noemi Mercado,
Nova DasSarma
, et al. (24 additional authors not shown)
Abstract:
We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability…
▽ More
We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability for moral self-correction emerges at 22B model parameters, and typically improves with increasing model size and RLHF training. We believe that at this level of scale, language models obtain two capabilities that they can use for moral self-correction: (1) they can follow instructions and (2) they can learn complex normative concepts of harm like stereotyping, bias, and discrimination. As such, they can follow instructions to avoid certain kinds of morally harmful outputs. We believe our results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles.
△ Less
Submitted 18 February, 2023; v1 submitted 14 February, 2023;
originally announced February 2023.
-
Discovering Language Model Behaviors with Model-Written Evaluations
Authors:
Ethan Perez,
Sam Ringer,
Kamilė Lukošiūtė,
Karina Nguyen,
Edwin Chen,
Scott Heiner,
Craig Pettit,
Catherine Olsson,
Sandipan Kundu,
Saurav Kadavath,
Andy Jones,
Anna Chen,
Ben Mann,
Brian Israel,
Bryan Seethor,
Cameron McKinnon,
Christopher Olah,
Da Yan,
Daniela Amodei,
Dario Amodei,
Dawn Drain,
Dustin Li,
Eli Tran-Johnson,
Guro Khundadze,
Jackson Kernion
, et al. (38 additional authors not shown)
Abstract:
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from inst…
▽ More
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.
△ Less
Submitted 19 December, 2022;
originally announced December 2022.
-
Constitutional AI: Harmlessness from AI Feedback
Authors:
Yuntao Bai,
Saurav Kadavath,
Sandipan Kundu,
Amanda Askell,
Jackson Kernion,
Andy Jones,
Anna Chen,
Anna Goldie,
Azalia Mirhoseini,
Cameron McKinnon,
Carol Chen,
Catherine Olsson,
Christopher Olah,
Danny Hernandez,
Dawn Drain,
Deep Ganguli,
Dustin Li,
Eli Tran-Johnson,
Ethan Perez,
Jamie Kerr,
Jared Mueller,
Jeffrey Ladish,
Joshua Landau,
Kamal Ndousse,
Kamile Lukosuite
, et al. (26 additional authors not shown)
Abstract:
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supe…
▽ More
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.
△ Less
Submitted 15 December, 2022;
originally announced December 2022.
-
Measuring Progress on Scalable Oversight for Large Language Models
Authors:
Samuel R. Bowman,
Jeeyoon Hyun,
Ethan Perez,
Edwin Chen,
Craig Pettit,
Scott Heiner,
Kamilė Lukošiūtė,
Amanda Askell,
Andy Jones,
Anna Chen,
Anna Goldie,
Azalia Mirhoseini,
Cameron McKinnon,
Christopher Olah,
Daniela Amodei,
Dario Amodei,
Dawn Drain,
Dustin Li,
Eli Tran-Johnson,
Jackson Kernion,
Jamie Kerr,
Jared Mueller,
Jeffrey Ladish,
Joshua Landau,
Kamal Ndousse
, et al. (21 additional authors not shown)
Abstract:
Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think abou…
▽ More
Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on ways it can be studied empirically. We first present an experimental design centered on tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat -- a trivial baseline strategy for scalable oversight -- substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks.
△ Less
Submitted 11 November, 2022; v1 submitted 4 November, 2022;
originally announced November 2022.
-
Predictability and Surprise in Large Generative Models
Authors:
Deep Ganguli,
Danny Hernandez,
Liane Lovitt,
Nova DasSarma,
Tom Henighan,
Andy Jones,
Nicholas Joseph,
Jackson Kernion,
Ben Mann,
Amanda Askell,
Yuntao Bai,
Anna Chen,
Tom Conerly,
Dawn Drain,
Nelson Elhage,
Sheer El Showk,
Stanislav Fort,
Zac Hatfield-Dodds,
Scott Johnston,
Shauna Kravec,
Neel Nanda,
Kamal Ndousse,
Catherine Olsson,
Daniela Amodei,
Dario Amodei
, et al. (5 additional authors not shown)
Abstract:
Large-scale pre-training has recently emerged as a technique for creating capable, general purpose, generative models such as GPT-3, Megatron-Turing NLG, Gopher, and many others. In this paper, we highlight a counterintuitive property of such models and discuss the policy implications of this property. Namely, these generative models have an unusual combination of predictable loss on a broad train…
▽ More
Large-scale pre-training has recently emerged as a technique for creating capable, general purpose, generative models such as GPT-3, Megatron-Turing NLG, Gopher, and many others. In this paper, we highlight a counterintuitive property of such models and discuss the policy implications of this property. Namely, these generative models have an unusual combination of predictable loss on a broad training distribution (as embodied in their "scaling laws"), and unpredictable specific capabilities, inputs, and outputs. We believe that the high-level predictability and appearance of useful capabilities drives rapid development of such models, while the unpredictable qualities make it difficult to anticipate the consequences of model deployment. We go through examples of how this combination can lead to socially harmful behavior with examples from the literature and real world observations, and we also perform two novel experiments to illustrate our point about harms from unpredictability. Furthermore, we analyze how these conflicting properties combine to give model developers various motivations for deploying these models, and challenges that can hinder deployment. We conclude with a list of possible interventions the AI community may take to increase the chance of these models having a beneficial impact. We intend this paper to be useful to policymakers who want to understand and regulate AI systems, technologists who care about the potential policy impact of their work, and academics who want to analyze, critique, and potentially develop large generative models.
△ Less
Submitted 3 October, 2022; v1 submitted 15 February, 2022;
originally announced February 2022.