Zum Hauptinhalt springen

Showing 1–5 of 5 results for author: Clymer, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.15371  [pdf, ps, other

    cs.CY cs.AI

    Affirmative safety: An approach to risk management for high-risk AI

    Authors: Akash R. Wasil, Joshua Clymer, David Krueger, Emily Dardaman, Simeon Campos, Evan R. Murphy

    Abstract: Prominent AI experts have suggested that companies developing high-risk AI systems should be required to show that such systems are safe before they can be developed or deployed. The goal of this paper is to expand on this idea and explore its implications for risk management. We argue that entities developing or deploying high-risk AI systems should be required to present evidence of affirmative… ▽ More

    Submitted 14 April, 2024; originally announced June 2024.

  2. arXiv:2406.06613  [pdf, other

    cs.CL cs.AI

    GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents

    Authors: Anthony Costarelli, Mat Allen, Roman Hauksson, Grace Sodunke, Suhas Hariharan, Carlson Cheng, Wenjie Li, Joshua Clymer, Arjun Yadav

    Abstract: Large language models have demonstrated remarkable few-shot performance on many natural language understanding tasks. Despite several demonstrations of using large language models in complex, strategic scenarios, there lacks a comprehensive framework for evaluating agents' performance across various types of reasoning found in games. To address this gap, we introduce GameBench, a cross-domain benc… ▽ More

    Submitted 22 July, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

  3. arXiv:2405.05466  [pdf, other

    cs.CL cs.AI

    Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

    Authors: Joshua Clymer, Caden Juang, Severin Field

    Abstract: Like a criminal under investigation, Large Language Models (LLMs) might pretend to be aligned while evaluated and misbehave when they have a good opportunity. Can current interpretability methods catch these 'alignment fakers?' To answer this question, we introduce a benchmark that consists of 324 pairs of LLMs fine-tuned to select actions in role-play scenarios. One model in each pair is consiste… ▽ More

    Submitted 11 May, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

  4. arXiv:2403.10462  [pdf, other

    cs.CY cs.AI

    Safety Cases: How to Justify the Safety of Advanced AI Systems

    Authors: Joshua Clymer, Nick Gabrieli, David Krueger, Thomas Larsen

    Abstract: As AI systems become more advanced, companies and regulators will make difficult decisions about whether it is safe to train and deploy them. To prepare for these decisions, we investigate how developers could make a 'safety case,' which is a structured rationale that AI systems are unlikely to cause a catastrophe. We propose a framework for organizing a safety case and discuss four categories of… ▽ More

    Submitted 18 March, 2024; v1 submitted 15 March, 2024; originally announced March 2024.

  5. arXiv:2311.07723  [pdf, other

    cs.AI cs.CL cs.LG

    Generalization Analogies: A Testbed for Generalizing AI Oversight to Hard-To-Measure Domains

    Authors: Joshua Clymer, Garrett Baker, Rohan Subramani, Sam Wang

    Abstract: As AI systems become more intelligent and their behavior becomes more challenging to assess, they may learn to game the flaws of human feedback instead of genuinely striving to follow instructions; however, this risk can be mitigated by controlling how LLMs generalize human feedback to situations where it is unreliable. To better understand how reward models generalize, we craft 69 distribution sh… ▽ More

    Submitted 17 December, 2023; v1 submitted 13 November, 2023; originally announced November 2023.

    Comments: Code: https://github.com/Joshuaclymer/GENIES Website: https://joshuaclymer.github.io/generalization-analogies-website/