Skip to main content

Showing 1–9 of 9 results for author: Turner, A M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2312.06681  [pdf, other

    cs.CL cs.AI cs.LG

    Steering Llama 2 via Contrastive Activation Addition

    Authors: Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Matt Turner

    Abstract: We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA computes "steering vectors" by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior, such as factual versus hallucinatory responses. During inference, these steerin… ▽ More

    Submitted 5 July, 2024; v1 submitted 8 December, 2023; originally announced December 2023.

  2. arXiv:2310.08043  [pdf, other

    cs.AI

    Understanding and Controlling a Maze-Solving Policy Network

    Authors: Ulisse Mini, Peli Grietzer, Mrinank Sharma, Austin Meek, Monte MacDiarmid, Alexander Matt Turner

    Abstract: To understand the goals and goal representations of AI systems, we carefully study a pretrained reinforcement learning policy that solves mazes by navigating to a range of target squares. We find this network pursues multiple context-dependent goals, and we further identify circuits within the network that correspond to one of these goals. In particular, we identified eleven channels that track th… ▽ More

    Submitted 12 October, 2023; originally announced October 2023.

    Comments: 46 pages

  3. arXiv:2308.10248  [pdf, other

    cs.CL cs.LG

    Activation Addition: Steering Language Models Without Optimization

    Authors: Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, Monte MacDiarmid

    Abstract: Reliably controlling the behavior of large language models is a pressing open problem. Existing methods include supervised finetuning, reinforcement learning from human feedback, prompt engineering and guided decoding. We instead investigate activation engineering: modifying activations at inference-time to predictably alter model behavior. We bias the forward pass with a 'steering vector' implici… ▽ More

    Submitted 4 June, 2024; v1 submitted 20 August, 2023; originally announced August 2023.

  4. arXiv:2206.13477  [pdf, other

    cs.AI

    Parametrically Retargetable Decision-Makers Tend To Seek Power

    Authors: Alexander Matt Turner, Prasad Tadepalli

    Abstract: If capable AI agents are generally incentivized to seek power in service of the objectives we specify for them, then these systems will pose enormous risks, in addition to enormous benefits. In fully observable environments, most reward functions have an optimal policy which seeks power by keeping options open and staying alive. However, the real world is neither fully observable, nor must trained… ▽ More

    Submitted 11 October, 2022; v1 submitted 27 June, 2022; originally announced June 2022.

    Comments: 10-page main paper, 36 pages total, poster at NeurIPS 2022

  5. arXiv:2206.11831  [pdf, other

    cs.AI

    On Avoiding Power-Seeking by Artificial Intelligence

    Authors: Alexander Matt Turner

    Abstract: We do not know how to align a very intelligent AI agent's behavior with human interests. I investigate whether -- absent a full solution to this AI alignment problem -- we can build smart AI agents which have limited impact on the world, and which do not autonomously seek power. In this thesis, I introduce the attainable utility preservation (AUP) method. I demonstrate that AUP produces conservati… ▽ More

    Submitted 23 June, 2022; originally announced June 2022.

    Comments: 287 pages, PhD thesis

  6. arXiv:2206.11812  [pdf, other

    cs.AI

    Formalizing the Problem of Side Effect Regularization

    Authors: Alexander Matt Turner, Aseem Saxena, Prasad Tadepalli

    Abstract: AI objectives are often hard to specify properly. Some approaches tackle this problem by regularizing the AI's side effects: Agents must weigh off "how much of a mess they make" with an imperfectly specified proxy objective. We propose a formal criterion for side effect regularization via the assistance game framework. In these games, the agent solves a partially observable Markov decision process… ▽ More

    Submitted 8 November, 2022; v1 submitted 23 June, 2022; originally announced June 2022.

    Comments: 14 pages, accepted to ML Safety Workshop at NeurIPS 2022. Alexander Turner and Aseem Saxena contributed equally

  7. arXiv:2006.06547  [pdf, other

    cs.AI

    Avoiding Side Effects in Complex Environments

    Authors: Alexander Matt Turner, Neale Ratzlaff, Prasad Tadepalli

    Abstract: Reward function specification can be difficult. Rewarding the agent for making a widget may be easy, but penalizing the multitude of possible negative side effects is hard. In toy environments, Attainable Utility Preservation (AUP) avoided side effects by penalizing shifts in the ability to achieve randomly generated goals. We scale this approach to large, randomly generated environments based on… ▽ More

    Submitted 22 October, 2020; v1 submitted 11 June, 2020; originally announced June 2020.

    Comments: Accepted as spotlight paper at NeurIPS 2020. 10 pages main paper; 19 pages with appendices

  8. arXiv:1912.01683  [pdf, other

    cs.AI

    Optimal Policies Tend to Seek Power

    Authors: Alexander Matt Turner, Logan Smith, Rohin Shah, Andrew Critch, Prasad Tadepalli

    Abstract: Some researchers speculate that intelligent reinforcement learning (RL) agents would be incentivized to seek resources and power in pursuit of their objectives. Other researchers point out that RL agents need not have human-like power-seeking instincts. To clarify this discussion, we develop the first formal theory of the statistical tendencies of optimal policies. In the context of Markov decisio… ▽ More

    Submitted 28 January, 2023; v1 submitted 3 December, 2019; originally announced December 2019.

    Comments: Accepted to NeurIPS 2021 as spotlight paper. 12 pages, 44 pages with appendices. Since the 2021 acceptance, we updated the paper to point out that optimal policies can be qualitatively divorced from real-world learned policies

  9. Conservative Agency via Attainable Utility Preservation

    Authors: Alexander Matt Turner, Dylan Hadfield-Menell, Prasad Tadepalli

    Abstract: Reward functions are easy to misspecify; although designers can make corrections after observing mistakes, an agent pursuing a misspecified reward function can irreversibly change the state of its environment. If that change precludes optimization of the correctly specified reward function, then correction is futile. For example, a robotic factory assistant could break expensive equipment due to a… ▽ More

    Submitted 10 June, 2020; v1 submitted 25 February, 2019; originally announced February 2019.

    Comments: Published in AI, Ethics, and Society 2020