Zum Hauptinhalt springen

Showing 1–6 of 6 results for author: Roger, F

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.19550  [pdf, other

    cs.LG cs.CL

    Stress-Testing Capability Elicitation With Password-Locked Models

    Authors: Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov, David Krueger

    Abstract: To determine the safety of large language models (LLMs), AI developers must be able to assess their dangerous capabilities. But simple prompting strategies often fail to elicit an LLM's full capabilities. One way to elicit capabilities more robustly is to fine-tune the LLM to complete the task. In this paper, we investigate the conditions under which fine-tuning-based elicitation suffices to elici… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  2. arXiv:2312.06942  [pdf, other

    cs.LG

    AI Control: Improving Safety Despite Intentional Subversion

    Authors: Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, Fabien Roger

    Abstract: As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models, or red-teaming techniques to surface subtle failure modes. However, researchers have not evalu… ▽ More

    Submitted 23 July, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

    Comments: Edit: Fix minor typos, clarify abstract, add glossary, expand related work. ICML version: https://openreview.net/pdf?id=KviM5k8pcP

  3. arXiv:2310.18512  [pdf, other

    cs.LG

    Preventing Language Models From Hiding Their Reasoning

    Authors: Fabien Roger, Ryan Greenblatt

    Abstract: Large language models (LLMs) often benefit from intermediate steps of reasoning to generate answers to complex problems. When these intermediate steps of reasoning are used to monitor the activity of the model, it is essential that this explicit reasoning is faithful, i.e. that it reflects what the model is actually reasoning about. In this work, we focus on one potential way intermediate steps of… ▽ More

    Submitted 31 October, 2023; v1 submitted 27 October, 2023; originally announced October 2023.

    Comments: Edit: Fix typos

  4. arXiv:2308.15605  [pdf, other

    cs.LG

    Benchmarks for Detecting Measurement Tampering

    Authors: Fabien Roger, Ryan Greenblatt, Max Nadeau, Buck Shlegeris, Nate Thomas

    Abstract: When training powerful AI systems to perform complex tasks, it may be challenging to provide training signals which are robust to optimization. One concern is \textit{measurement tampering}, where the AI system manipulates multiple measurements to create the illusion of good results instead of achieving the desired outcome. In this work, we build four new text-based datasets to evaluate measuremen… ▽ More

    Submitted 29 September, 2023; v1 submitted 29 August, 2023; originally announced August 2023.

    Comments: Edits: extended and improved appendices, fixed references, figures, and typos

  5. arXiv:2306.07567  [pdf, other

    cs.LG cs.CL

    Large Language Models Sometimes Generate Purely Negatively-Reinforced Text

    Authors: Fabien Roger

    Abstract: When using adversarial training, it is common practice to train against the most egregious failures. However, this might imply using examples with sensitive information (such as leaked passwords or security vulnerabilities) as training data. One might assume that language models trained with gradient descent never generate text snippets which were only present in examples associated with the lowes… ▽ More

    Submitted 16 June, 2023; v1 submitted 13 June, 2023; originally announced June 2023.

    Comments: 6 pages, 5 figures, LaTeX; added a related work section

  6. arXiv:2212.11281  [pdf, other

    cs.CL cs.AI cs.LG

    Language models are better than humans at next-token prediction

    Authors: Buck Shlegeris, Fabien Roger, Lawrence Chan, Euan McLean

    Abstract: Current language models are considered to have sub-human capabilities at natural language tasks like question-answering or writing code. However, language models are not trained to perform well at these tasks, they are trained to accurately predict the next token given previous tokes in tokenized text. It is not clear whether language models are better or worse than humans at next token prediction… ▽ More

    Submitted 15 July, 2024; v1 submitted 21 December, 2022; originally announced December 2022.

    Comments: Edit: TMLR 2024, more analysis of the results were added