-
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability
Authors:
Aaron Mueller,
Jannik Brinkmann,
Millicent Li,
Samuel Marks,
Koyena Pal,
Nikhil Prakash,
Can Rager,
Aruna Sankaranarayanan,
Arnab Sen Sharma,
Jiuding Sun,
Eric Todd,
David Bau,
Yonatan Belinkov
Abstract:
Interpretability provides a toolset for understanding how and why neural networks behave in certain ways. However, there is little unity in the field: most studies employ ad-hoc evaluations and do not share theoretical foundations, making it difficult to measure progress and compare the pros and cons of different techniques. Furthermore, while mechanistic understanding is frequently discussed, the…
▽ More
Interpretability provides a toolset for understanding how and why neural networks behave in certain ways. However, there is little unity in the field: most studies employ ad-hoc evaluations and do not share theoretical foundations, making it difficult to measure progress and compare the pros and cons of different techniques. Furthermore, while mechanistic understanding is frequently discussed, the basic causal units underlying these mechanisms are often not explicitly defined. In this paper, we propose a perspective on interpretability research grounded in causal mediation analysis. Specifically, we describe the history and current state of interpretability taxonomized according to the types of causal units (mediators) employed, as well as methods used to search over mediators. We discuss the pros and cons of each mediator, providing insights as to when particular kinds of mediators and search methods are most appropriate depending on the goals of a given study. We argue that this framing yields a more cohesive narrative of the field, as well as actionable insights for future work. Specifically, we recommend a focus on discovering new mediators with better trade-offs between human-interpretability and compute-efficiency, and which can uncover more sophisticated abstractions from neural networks than the primarily linear mediators employed in current work. We also argue for more standardized evaluations that enable principled comparisons across mediator types, such that we can better understand when particular causal units are better suited to particular use cases.
△ Less
Submitted 2 August, 2024;
originally announced August 2024.
-
NNsight and NDIF: Democratizing Access to Foundation Model Internals
Authors:
Jaden Fiotto-Kaufman,
Alexander R Loftus,
Eric Todd,
Jannik Brinkmann,
Caden Juang,
Koyena Pal,
Can Rager,
Aaron Mueller,
Samuel Marks,
Arnab Sen Sharma,
Francesca Lucchetti,
Michael Ripa,
Adam Belfki,
Nikhil Prakash,
Sumeet Multani,
Carla Brodley,
Arjun Guha,
Jonathan Bell,
Byron Wallace,
David Bau
Abstract:
The enormous scale of state-of-the-art foundation models has limited their accessibility to scientists, because customized experiments at large model sizes require costly hardware and complex engineering that is impractical for most researchers. To alleviate these problems, we introduce NNsight, an open-source Python package with a simple, flexible API that can express interventions on any PyTorch…
▽ More
The enormous scale of state-of-the-art foundation models has limited their accessibility to scientists, because customized experiments at large model sizes require costly hardware and complex engineering that is impractical for most researchers. To alleviate these problems, we introduce NNsight, an open-source Python package with a simple, flexible API that can express interventions on any PyTorch model by building computation graphs. We also introduce NDIF, a collaborative research platform providing researchers access to foundation-scale LLMs via the NNsight API. Code, documentation, and tutorials are available at https://www.nnsight.net.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
Function Vectors in Large Language Models
Authors:
Eric Todd,
Millicent L. Li,
Arnab Sen Sharma,
Aaron Mueller,
Byron C. Wallace,
David Bau
Abstract:
We report the presence of a simple neural mechanism that represents an input-output function as a vector within autoregressive transformer language models (LMs). Using causal mediation analysis on a diverse range of in-context-learning (ICL) tasks, we find that a small number attention heads transport a compact representation of the demonstrated task, which we call a function vector (FV). FVs are…
▽ More
We report the presence of a simple neural mechanism that represents an input-output function as a vector within autoregressive transformer language models (LMs). Using causal mediation analysis on a diverse range of in-context-learning (ICL) tasks, we find that a small number attention heads transport a compact representation of the demonstrated task, which we call a function vector (FV). FVs are robust to changes in context, i.e., they trigger execution of the task on inputs such as zero-shot and natural text settings that do not resemble the ICL contexts from which they are collected. We test FVs across a range of tasks, models, and layers and find strong causal effects across settings in middle layers. We investigate the internal structure of FVs and find while that they often contain information that encodes the output space of the function, this information alone is not sufficient to reconstruct an FV. Finally, we test semantic vector composition in FVs, and find that to some extent they can be summed to create vectors that trigger new complex tasks. Our findings show that compact, causal internal vector representations of function abstractions can be explicitly extracted from LLMs. Our code and data are available at https://functions.baulab.info.
△ Less
Submitted 25 February, 2024; v1 submitted 23 October, 2023;
originally announced October 2023.
-
CUREE: A Curious Underwater Robot for Ecosystem Exploration
Authors:
Yogesh Girdhar,
Nathan McGuire,
Levi Cai,
Stewart Jamieson,
Seth McCammon,
Brian Claus,
John E. San Soucie,
Jessica E. Todd,
T. Aran Mooney
Abstract:
The current approach to exploring and monitoring complex underwater ecosystems, such as coral reefs, is to conduct surveys using diver-held or static cameras, or deploying sensor buoys. These approaches often fail to capture the full variation and complexity of interactions between different reef organisms and their habitat. The CUREE platform presented in this paper provides a unique set of capab…
▽ More
The current approach to exploring and monitoring complex underwater ecosystems, such as coral reefs, is to conduct surveys using diver-held or static cameras, or deploying sensor buoys. These approaches often fail to capture the full variation and complexity of interactions between different reef organisms and their habitat. The CUREE platform presented in this paper provides a unique set of capabilities in the form of robot behaviors and perception algorithms to enable scientists to explore different aspects of an ecosystem. Examples of these capabilities include low-altitude visual surveys, soundscape surveys, habitat characterization, and animal following. We demonstrate these capabilities by describing two field deployments on coral reefs in the US Virgin Islands. In the first deployment, we show that CUREE can identify the preferred habitat type of snapping shrimp in a reef through a combination of a visual survey, habitat characterization, and a soundscape survey. In the second deployment, we demonstrate CUREE's ability to follow arbitrary animals by separately following a barracuda and stingray for several minutes each in midwater and benthic environments, respectively.
△ Less
Submitted 20 April, 2023; v1 submitted 7 March, 2023;
originally announced March 2023.