Zum Hauptinhalt springen

Showing 1–2 of 2 results for author: Rushing, C

Searching in archive cs. Search in all archives.
.
  1. arXiv:2402.15390  [pdf, other

    cs.LG cs.AI cs.CL

    Explorations of Self-Repair in Language Models

    Authors: Cody Rushing, Neel Nanda

    Abstract: Prior interpretability research studying narrow distributions has preliminarily identified self-repair, a phenomena where if components in large language models are ablated, later components will change their behavior to compensate. Our work builds off this past literature, demonstrating that self-repair exists on a variety of models families and sizes when ablating individual attention heads on t… ▽ More

    Submitted 26 May, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

    Comments: ICML 2024

  2. arXiv:2310.04625  [pdf, other

    cs.LG cs.AI cs.CL

    Copy Suppression: Comprehensively Understanding an Attention Head

    Authors: Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, Neel Nanda

    Abstract: We present a single attention head in GPT-2 Small that has one main role across the entire training distribution. If components in earlier layers predict a certain token, and this token appears earlier in the context, the head suppresses it: we call this copy suppression. Attention Head 10.7 (L10H7) suppresses naive copying behavior which improves overall model calibration. This explains why multi… ▽ More

    Submitted 6 October, 2023; originally announced October 2023.