Skip to main content

Showing 1–5 of 5 results for author: Chughtai, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.08734  [pdf, other

    cs.LG cs.AI cs.CL

    Transformer Circuit Faithfulness Metrics are not Robust

    Authors: Joseph Miller, Bilal Chughtai, William Saunders

    Abstract: Mechanistic interpretability work attempts to reverse engineer the learned algorithms present inside neural networks. One focus of this work has been to discover 'circuits' -- subgraphs of the full model that explain behaviour on specific tasks. But how do we measure the performance of such circuits? Prior work has attempted to measure circuit 'faithfulness' -- the degree to which the circuit repl… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: CoLM 2024 Conference Paper. 11 page main body. 11 page appendix. 12 figures

  2. arXiv:2407.04694  [pdf, other

    cs.CL cs.AI cs.LG

    Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs

    Authors: Rudolf Laine, Bilal Chughtai, Jan Betley, Kaivalya Hariharan, Jeremy Scheurer, Mikita Balesni, Marius Hobbhahn, Alexander Meinke, Owain Evans

    Abstract: AI assistants such as ChatGPT are trained to respond to users by saying, "I am a large language model". This raises questions. Do such models know that they are LLMs and reliably act on this knowledge? Are they aware of their current circumstances, such as being deployed to the public? We refer to a model's knowledge of itself and its circumstances as situational awareness. To quantify situational… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

    Comments: 11 page main body, 98 page appendix, 58 figures

  3. arXiv:2405.07436  [pdf, other

    cs.LG cs.AI

    Can Language Models Explain Their Own Classification Behavior?

    Authors: Dane Sherburn, Bilal Chughtai, Owain Evans

    Abstract: Large language models (LLMs) perform well at a myriad of tasks, but explaining the processes behind this performance is a challenge. This paper investigates whether LLMs can give faithful high-level explanations of their own internal processes. To explore this, we introduce a dataset, ArticulateRules, of few-shot text-based classification tasks generated by simple rules. Each rule is associated wi… ▽ More

    Submitted 12 May, 2024; originally announced May 2024.

  4. arXiv:2402.07321  [pdf, other

    cs.LG cs.CL

    Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs

    Authors: Bilal Chughtai, Alan Cooney, Neel Nanda

    Abstract: How do transformer-based large language models (LLMs) store and retrieve knowledge? We focus on the most basic form of this task -- factual recall, where the model is tasked with explicitly surfacing stored facts in prompts of form `Fact: The Colosseum is in the country of'. We find that the mechanistic story behind factual recall is more complex than previously thought. It comprises several disti… ▽ More

    Submitted 11 February, 2024; originally announced February 2024.

    Comments: NeurIPS 2023 Attributing Model Behaviour at Scale Workshop

  5. arXiv:2302.03025  [pdf, other

    cs.LG cs.AI math.RT

    A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

    Authors: Bilal Chughtai, Lawrence Chan, Neel Nanda

    Abstract: Universality is a key hypothesis in mechanistic interpretability -- that different models learn similar features and circuits when trained on similar tasks. In this work, we study the universality hypothesis by examining how small neural networks learn to implement group composition. We present a novel algorithm by which neural networks may implement composition for any finite group via mathematic… ▽ More

    Submitted 24 May, 2023; v1 submitted 6 February, 2023; originally announced February 2023.

    Comments: 9 page main body, 1 page references, 12 page appendix