-
Circuit Breaking: Removing Model Behaviors with Targeted Ablation
Authors:
Maximilian Li,
Xander Davies,
Max Nadeau
Abstract:
Language models often exhibit behaviors that improve performance on a pre-training objective but harm performance on downstream tasks. We propose a novel approach to removing undesirable behaviors by ablating a small number of causal pathways between model components, with the intention of disabling the computational circuit responsible for the bad behavior. Given a small dataset of inputs where t…
▽ More
Language models often exhibit behaviors that improve performance on a pre-training objective but harm performance on downstream tasks. We propose a novel approach to removing undesirable behaviors by ablating a small number of causal pathways between model components, with the intention of disabling the computational circuit responsible for the bad behavior. Given a small dataset of inputs where the model behaves poorly, we learn to ablate a small number of important causal pathways. In the setting of reducing GPT-2 toxic language generation, we find ablating just 12 of the 11.6K causal edges mitigates toxic generation with minimal degradation of performance on other inputs.
△ Less
Submitted 29 January, 2024; v1 submitted 12 September, 2023;
originally announced September 2023.
-
Benchmarks for Detecting Measurement Tampering
Authors:
Fabien Roger,
Ryan Greenblatt,
Max Nadeau,
Buck Shlegeris,
Nate Thomas
Abstract:
When training powerful AI systems to perform complex tasks, it may be challenging to provide training signals which are robust to optimization. One concern is \textit{measurement tampering}, where the AI system manipulates multiple measurements to create the illusion of good results instead of achieving the desired outcome. In this work, we build four new text-based datasets to evaluate measuremen…
▽ More
When training powerful AI systems to perform complex tasks, it may be challenging to provide training signals which are robust to optimization. One concern is \textit{measurement tampering}, where the AI system manipulates multiple measurements to create the illusion of good results instead of achieving the desired outcome. In this work, we build four new text-based datasets to evaluate measurement tampering detection techniques on large language models. Concretely, given sets of text inputs and measurements aimed at determining if some outcome occurred, as well as a base model able to accurately predict measurements, the goal is to determine if examples where all measurements indicate the outcome occurred actually had the outcome occur, or if this was caused by measurement tampering. We demonstrate techniques that outperform simple baselines on most datasets, but don't achieve maximum performance. We believe there is significant room for improvement for both techniques and datasets, and we are excited for future work tackling measurement tampering.
△ Less
Submitted 29 September, 2023; v1 submitted 29 August, 2023;
originally announced August 2023.
-
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Authors:
Stephen Casper,
Xander Davies,
Claudia Shi,
Thomas Krendl Gilbert,
Jérémy Scheurer,
Javier Rando,
Rachel Freedman,
Tomasz Korbak,
David Lindner,
Pedro Freire,
Tony Wang,
Samuel Marks,
Charbel-Raphaël Segerie,
Micah Carroll,
Andi Peng,
Phillip Christoffersen,
Mehul Damani,
Stewart Slocum,
Usman Anwar,
Anand Siththaranjan,
Max Nadeau,
Eric J. Michaud,
Jacob Pfau,
Dmitrii Krasheninnikov,
Xin Chen
, et al. (7 additional authors not shown)
Abstract:
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and rel…
▽ More
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) overview techniques to understand, improve, and complement RLHF in practice; and (3) propose auditing and disclosure standards to improve societal oversight of RLHF systems. Our work emphasizes the limitations of RLHF and highlights the importance of a multi-faceted approach to the development of safer AI systems.
△ Less
Submitted 11 September, 2023; v1 submitted 27 July, 2023;
originally announced July 2023.
-
Discovering Variable Binding Circuitry with Desiderata
Authors:
Xander Davies,
Max Nadeau,
Nikhil Prakash,
Tamar Rott Shaham,
David Bau
Abstract:
Recent work has shown that computation in language models may be human-understandable, with successful efforts to localize and intervene on both single-unit features and input-output circuits. Here, we introduce an approach which extends causal mediation experiments to automatically identify model components responsible for performing a specific subtask by solely specifying a set of \textit{deside…
▽ More
Recent work has shown that computation in language models may be human-understandable, with successful efforts to localize and intervene on both single-unit features and input-output circuits. Here, we introduce an approach which extends causal mediation experiments to automatically identify model components responsible for performing a specific subtask by solely specifying a set of \textit{desiderata}, or causal attributes of the model components executing that subtask. As a proof of concept, we apply our method to automatically discover shared \textit{variable binding circuitry} in LLaMA-13B, which retrieves variable values for multiple arithmetic tasks. Our method successfully localizes variable binding to only 9 attention heads (of the 1.6k) and one MLP in the final token's residual stream.
△ Less
Submitted 7 July, 2023;
originally announced July 2023.
-
Controller design and experimental evaluation of a motorised assistance for a patient transfer floor lift
Authors:
Donatien Callon,
Ian Lalonde,
Mathieu Nadeau,
Alexandre Girard
Abstract:
Patient transfer is a challenging, critical task because it exposes caregivers to injury risks. Available transfer devices, like floor lifts, lead to improvements but are far from perfect. They do not eliminate the caregivers risk of musculoskeletal disorders, and they can be burdensome to use due to their poor maneuverability. This paper presents a new motorized floor lift with a single central m…
▽ More
Patient transfer is a challenging, critical task because it exposes caregivers to injury risks. Available transfer devices, like floor lifts, lead to improvements but are far from perfect. They do not eliminate the caregivers risk of musculoskeletal disorders, and they can be burdensome to use due to their poor maneuverability. This paper presents a new motorized floor lift with a single central motorized wheel connected to an instrumented handle. Admittance controllers are designed to 1) improve the device maneuverability, 2) reduce the required caregiver effort, and 3) ensure the security and comfort of patients. Two controller designs, one with a linear admittance law and a non-linear admittance law with variable damping, were developed and implemented on a prototype. Tests were performed on seven participants to evaluate the performance of the assistance system and the controllers. The experimental results show that 1) the motorized assistance with the variable damping controller improves maneuverability by 28%, 2) reduces the amount of effort required to push the lift by 66% and 3) provides the same level of patient comfort compared to a standard unassisted floor lift.
△ Less
Submitted 23 May, 2024; v1 submitted 30 May, 2022;
originally announced May 2022.
-
Robust Feature-Level Adversaries are Interpretability Tools
Authors:
Stephen Casper,
Max Nadeau,
Dylan Hadfield-Menell,
Gabriel Kreiman
Abstract:
The literature on adversarial attacks in computer vision typically focuses on pixel-level perturbations. These tend to be very difficult to interpret. Recent work that manipulates the latent representations of image generators to create "feature-level" adversarial perturbations gives us an opportunity to explore perceptible, interpretable adversarial attacks. We make three contributions. First, we…
▽ More
The literature on adversarial attacks in computer vision typically focuses on pixel-level perturbations. These tend to be very difficult to interpret. Recent work that manipulates the latent representations of image generators to create "feature-level" adversarial perturbations gives us an opportunity to explore perceptible, interpretable adversarial attacks. We make three contributions. First, we observe that feature-level attacks provide useful classes of inputs for studying representations in models. Second, we show that these adversaries are uniquely versatile and highly robust. We demonstrate that they can be used to produce targeted, universal, disguised, physically-realizable, and black-box attacks at the ImageNet scale. Third, we show how these adversarial images can be used as a practical interpretability tool for identifying bugs in networks. We use these adversaries to make predictions about spurious associations between features and classes which we then test by designing "copy/paste" attacks in which one natural image is pasted into another to cause a targeted misclassification. Our results suggest that feature-level attacks are a promising approach for rigorous interpretability research. They support the design of tools to better understand what a model has learned and diagnose brittle feature associations. Code is available at https://github.com/thestephencasper/feature_level_adv
△ Less
Submitted 11 September, 2023; v1 submitted 7 October, 2021;
originally announced October 2021.
-
A Low-Power Dual-Factor Authentication Unit for Secure Implantable Devices
Authors:
Saurav Maji,
Utsav Banerjee,
Samuel H Fuller,
Mohamed R Abdelhamid,
Phillip M Nadeau,
Rabia Tugce Yazicigil,
Anantha P Chandrakasan
Abstract:
This paper presents a dual-factor authentication protocol and its low-power implementation for security of implantable medical devices (IMDs). The protocol incorporates traditional cryptographic first-factor authentication using Datagram Transport Layer Security - Pre-Shared Key (DTLS-PSK) followed by the user's touch-based voluntary second-factor authentication for enhanced security. With a low-p…
▽ More
This paper presents a dual-factor authentication protocol and its low-power implementation for security of implantable medical devices (IMDs). The protocol incorporates traditional cryptographic first-factor authentication using Datagram Transport Layer Security - Pre-Shared Key (DTLS-PSK) followed by the user's touch-based voluntary second-factor authentication for enhanced security. With a low-power compact always-on wake-up timer and touch-based wake-up circuitry, our test chip consumes only 735 pW idle state power at 20.15 Hz and 2.5 V. The hardware accelerated dual-factor authentication unit consumes 8 $μ$W at 660 kHz and 0.87 V. Our test chip was coupled with commercial Bluetooth Low Energy (BLE) transceiver, DC-DC converter, touch sensor and coin cell battery to demonstrate standalone implantable operation and also tested using in-vitro measurement setup.
△ Less
Submitted 27 April, 2020;
originally announced April 2020.