Functional Faithfulness in the Wild: Circuit Discovery with Differentiable Computation Graph Pruning

Lei Yu Jingcheng Niu Zining Zhu Gerald Penn

Abstract

In this paper, we introduce a comprehensive reformulation of the task known as Circuit Discovery, along with DiscoGP, a novel and effective algorithm based on differentiable masking for discovering circuits. Circuit discovery is the task of interpreting the computational mechanisms of language models (LMs) by dissecting their functions and capabilities into sparse subnetworks (circuits). We identified two major limitations in existing circuit discovery efforts: (1) a dichotomy between weight-based and connection-edge-based approaches forces researchers to choose between pruning connections or weights, thereby limiting the scope of mechanistic interpretation of LMs; (2) algorithms based on activation patching tend to identify circuits that are neither functionally faithful nor complete. The performance of these identified circuits is substaintially reduced, often resulting in near-random performance in isolation. Furthermore, the complement of the circuit—i.e., the original LM with the identified circuit removed — still retains adequate performance, indicating that essential components of a complete circuits are missed by existing methods.

DiscoGP successfully addresses the two aforementioned issues and demonstrates state-of-the-art faithfulness, completeness, and sparsity. The effectiveness of the algorithm and its novel structure open up new avenues of gathering new insights into the internal workings of generative AI.

Machine Learning, ICML

1 Introduction

Refer to caption — Figure 1: An illustration of different circuit discovery methods. The large blocks symbolize transformer modules, such as attention heads, MLP weights, or input/output nodes. Each small block represents an individual weight parameter. Our DiscoGP algorithm enables the joint pruning of model weight parameters and connection edges, achieving the state-of-the-art performance.

Large-scale Transformer language models (LMs) (Vaswani et al., 2017; Devlin et al., 2019; Radford et al., 2019; Raffel et al., 2020; OpenAI, 2023; Touvron et al., 2023) have demonstrated their incredible capabilities in solving various natural language tasks across different fields. Yet, the exact mechanisms by which these models solve tasks remain enigmatic. Researchers in the field of interpretability therefore aim to provide human-understandable explanations of the computational mechanisms of these “black-boxed” LMs. Should the interpretation of LMs become possible, it could lead to the improvement of LMs with better controllability and performance, and even germinate the next generation of explainable artificial intelligence (XAI) systems.

Most recently, a nascent and promising thread of research known as circuit discovery has emerged (Elhage et al., 2021; Nanda et al., 2022; Conmy et al., 2023). This method views language models as computation graphs and interprets the internal workings of LMs by identifying subnetworks (circuits) that explain the original model’s capability on solving certain tasks. In our opinion, circuit-based LM interpretation can serve as the basic unit of LM interpretation, and provides a new direction towards XAI by dissecting LMs.

Nevertheless, existing circuit discovery efforts face two major challenges. First, researchers must choose between identifying important model weights or recognizing essential connections between model components, which cannot be localized together by prior circuit discovery methods. As previous interpretability research has highlighted, components such as attention heads often exhibit high polysemy in terms of storing parametric knowledge (Gurnee et al., 2023; Huang et al., 2024; Black et al., 2022) and interacting with other modules (Geva et al., 2023; Chan et al., 2022; Neo et al., 2024). Therefore, it is crucial to selectively reduce both the weight and connection edge complexities of language models to disentangle task-specific functionalities from general model capabilities (Cunningham et al., 2023).

Second, algorithms using activation patching tends to identify circuits that are not functionally faithful. We conducted a thorough reappraisal of the activation patching methods (Geiger et al., 2021; Meng et al., 2022; Nanda, 2023; Wu et al., 2024), on a wider range of tasks and settings. Our evaluation shows that the so called “canonical circuits” identified by previous work such as (Wang et al., 2022; Conmy et al., 2023) have very low functional faithfulness – i.e., their performance drastically reduced to near-random levels when being evaluated in isolation.

These difficulties underscore the inherent challenges in circuit discovery and necessitate a careful reconsideration of its definition and relevant concepts. Therefore, we redefine the two primary objectives of circuit discovery: (functional) faithfulness and (functional) completeness. Furthermore, we argue that computation graph sparsification, the common practical implementation of circuit discovery, should be conducted through pruning rather than patching during evaluation. Our comprehensive reappraisal indicates that the currently adopted notions of circuit faithfulness and completeness are excessively relaxed.

To address the two aforementioned issues, we introduce DiscoGP, a differentiable circuit discovery algorithm with joint weight and edge computation graph pruning.¹¹1Code and data will be publicly available online soon. Figure 1 shows an illustration of different circuit discovery methods. We place a set of learnable binary mask parameters at the weights of LM components and their interconnecting edges, and these masks can be trained together in an end-to-end manner to discover an LM computation subgraph (i.e., a circuit) with desired properties. As a result, DiscoGP solves issue (1) by having two compatible modes of computation graph pruning: weight pruning and edge pruning. DiscoGP enables joint computational graph sparsification by iterating between the two modes. Our evaluation shows DiscoGP is not only functionally faithful (it retains near-perfect performance with only a fraction of the original model’s weights and edges when performing inference in isolation) but also functionally complete (when the identified circuits are removed from the model, the performance of the model drops substantially to random levels).

Moreover, the better functional faithfulness of the algorithm and its novel structure pave the way for new approaches in LM interpretation. Here, we highlight two significant findings. First, the attention heads in the lower layers play a crucial role in executing LM functions, a role that has been largely overlooked in prior research of circuit analysis. Second, there are distinct localization characteristics between connection edges and model weight parameters. Specifically, attention weights are predominantly found in the lower layers, while connection edges are more prominent in the upper layers. Our results support the hypothesis that LM functions occur in two distinct stages (Meng et al., 2022; Geva et al., 2023; Niu et al., 2024; Hernandez et al., 2024), offering a more detailed and nuanced understanding of the process due to the comprehensive nature of DiscoGP.

In summary, our main contributions are: (1) We provide a thorough reappraisal of previous activation-patching-based circuit discovery methods and identified their limitations under a more stringent evaluation; (2) We provide a reformulation of circuit discovery through the redefinition of its primary objectives: faithfulness and completeness; (3) We propose DiscoGP, a novel and effective circuit discovery algorithm that achieves joint weight and edge pruning of the computation graph and gained state-of-the-art circuit discovery performance; (4) Using circuits discovered by DiscoGP, we enable new modes of LM interpretation, uncovering novel insights into how functions and capabilities are mechanistically implemented by LMs.

2 Circuit Discovery

We define circuit discovery and the relevant concepts in this section. Additionally, we enumerate several issues present in previous definitions and practices. Furthermore, we address how these issues are resolved in our new formulation.

2.1 Task Formulation

Computation Graph & Circuit Discovery

We can model the computational process of a neural LM as a directed acyclic graph (DAG) $G=\{E,V\}$ . Following Wang et al. (2022); Goldowsky-Dill et al. (2023); Li et al. (2024b), we define each vertex $v\in V$ in the graph as an LM component node such as input, output, the MLPs, and attention heads. Each attention head is further split into query, key, value and output nodes (Figure 2) to better capture their interactions. Each edge $e\in E$ in the graph symbolize the information flow from one component to another, and it is implemented through residual rewrite (Nanda & Bloom, 2022). The attention head $A_{j}^{(i)}$ (the $j$ th attention head on layer $i$ ) takes the residual from the previous $R_{i-1}$ layer as input. Since $R_{0}=I$ and $R_{i}=R_{i-1}+\sum_{j}A_{j}^{(i)}+M_{i}$ , where $M_{i}$ is the $i$ th layer MLP node output, we can unroll the residual term and consider attention heads operate on the sum $S_{A}^{(i)}=I+\sum_{l<i}(M_{l}+\sum_{j}A^{(l)}_{j})$ and MLPs on a similar $S_{M}^{(i)}=I+\sum_{l<i}M_{l}+\sum_{l<i}\sum_{j}A^{(l)}_{j}$ . An edge $e=(u,v)\in E$ therefore symbolises that the node $u$ operates on the output of $v$ . Under this framework, the task of circuit discovery is to identif a subnetwork $G_{T}=\{E_{T},V_{T}\},E_{T}\subseteq E,V_{T}\subseteq V$ for a particular task $T$ . We evaluate tasks and functions based on the LMs’ ability to assign the highest probability to the next token candidate that is syntactically well-formed, or semantically and factually coherent.

Computation Graph Knockout

What happens to an LM component or a connection edge when it is “turned off?” This is not a trivial question and previous work has drastically different definitions that leads to different results. Previous work coined the term knockout to refer to the action of removing and disabling an LM module or connection (Wang et al., 2022; Chan et al., 2022; Geva et al., 2023). Generally speaking, there are two distinct types of computation graph knockout: patching-based knockout and pruning-based knockout.

Activation patching, also known as path patching or simply patching, has been widely applied by mechanistic interpretability researchers to identify circuits (Wang et al., 2022; Olsson et al., 2022; Goldowsky-Dill et al., 2023; Heimersheim & Janiak, 2023; Hanna et al., 2024), and have also demonstrated its potential in better aligning language models with human values (Li et al., 2024a; Hernandez et al., 2024; Turner et al., 2023). This method was later automated by ACDC (Conmy et al., 2023). Patching involves replacing part of a model’s forward pass with activations from a different input to observe the resulting influence on the model’s final output. While sometimes referred to as a “causal intervention”, (Vig et al., 2020; Finlayson et al., 2021; Geiger et al., 2021), this method is more accurately described as influence attribution. Various strategies can be employed for patching: Mean ablation sets the activation to an average activation output value across a reference distribution obtained from feeding a sample dataset through the model; Interchange ablation overwrites the activation with its value from another data point within the dataset; and Random ablation replaces the input with a random value.

Another line of interpretability research utilizes network pruning (Louizos et al., 2018; Csordás et al., 2021) to locate functional modules by masking out component weights that do not contribute to model outputs. Pruning has also been widely used as to find small parameter efficient subnetworks of NLP applications (Voita et al., 2019; Ren & Zhu, 2022; Zhang et al., 2021; Zhao et al., 2020; Wang et al., 2020; Bayazit et al., 2023). We view pruning as the most stringent method for performing knockout, as it completely eliminates the informational content of the targeted module and is equivalent to activation patching with zero values. In comparison, intervention-based patching does not comprehensively remove the influence of model components, as the patched activation values often still retain vestiges of information about the original input. We therefore disagree with the critics on zero-ablation by Chan et al.; Wang et al.; Conmy et al.; Chan et al.’s (2022; 2022; 2023; 2022) that it takes the model too far away from actually possible activation distributions and hence should not be used. To the best of our knowledge, we are the first contemporary circuit discovery study to utilize a strict pruning-based analysis of LM computational graph.

2.2 Evaluation and Objective of Circuit Discovery

Wang et al. (2022) proposed three criteria to evaluate a circuit: faithfulness (the circuit can perform the task as well as the original model), completeness (the circuit contains all the nodes used to perform the task) and minimality (sparsity) (the circuit should be as small as possible). While we strongly appreciate Wang et al.’s (2022) theoretical framing of these criteria, we have identified several limitations in their practical implementations of faithfulness and completeness. The same issue persists in most patching-based interpretability research that adopted the same empirical framework (Chan et al., 2022; Conmy et al., 2023; Ghandeharioun et al., 2024). In this section, we elaborate on these limitations and propose a new, more comprehensive reformulation.

Functional Faithfulness

Again, faithfulness refers to the circuit’s ability to perform the task ( $T$ ) in isolation. However, in practice, Wang et al. (2022) measured faithfulness by computing the average difference in the unnormalized output logits between the correct token and an incorrect option in the test set. This evaluation method has two major issues. First, raw logit differences without normalization (typically softmax) do not accurately represent how language models are used in practice. Second, without normalization, the difference measure can be misleading due to the influence of outliers, and this issue is exacerbated when the average is taken.

Conmy et al. (2023) computes the KL divergence across the entire output vocabulary between the circuit and the original model, using this metric to assess faithfulness. While this practice is standard in model compression and knowledge distillation (Kim et al., 2021), we argue that it is not appropriate for circuit discovery. A circuit is expected to retain its performance only on the specified task, with other irrelevant capabilities and behaviours potentially differing from the original model. Computing KL divergence across the entire output vocabulary diminishes the focus on the primary task and unfairly penalizes the circuit for reduced performance in unrelated areas.

We emphasize that both Wang et al. (2022) and Conmy et al. (2023) conducted their main evaluations using patching rather than pruning. Therefore, their circuit discovery methods have only been assessed in this less stringent manner. As previously discussed and as we will demonstrate later, patching merely introduces perturbations to the model, retaining the majority of the original model’s information. Although precise statistics were not provided, Wang et al. (2022) noted that the use of pruning yielded “noisy results in practice.” We believe this outcome reflects the challenges inherent in their circuit discovery method when subjected to more rigorous testing conditions, rather than an inherent flaw in the pruning technique itself.

We propose using the original task evaluation to measure faithfulness, specifically by directly computing task accuracy. To avoid any confusion in terminology, we term this measure functional faithfulness. This measure best reflects the original definition of faithfulness: whether the circuit can perform the task. Our evaluation suggests that previous patching-based circuit discovery methods exhibit poor functional faithfulness, as we will demonstrate in Section 5.

Functional Completeness

In practice, Wang et al. (2022) defined completeness as $|F(C\setminus K)-F(M\setminus K)|$ for every subset $K\subseteq C$ . However, calculating this metric directly is intractable, leading to the use of random sampling of subsets, which provides an unreliable approximation. In our work, we follow De Cao et al. (2022); Bayazit et al. (2023) to compute completeness as the task performance of the complement of the circuit. I.e., we want the model to perform poorly on the task when the circuit is removed.

2.3 Survey: Previous Circuit Discovery Methods

SP: Subnetwork Pruning through Weight Masking

SP methods such as (Louizos et al., 2018; Cao et al., 2021; Sanh et al., 2020; De Cao et al., 2022) learn a binary mask for every weight parameter of the internal model components (such as attention heads and MLPs), using an objective that encourages high accuracy and low sparsity of the pruned subnetwork consisting of unmasked weights after training. However, SP cannot remove a node component from LM computational graph unless all of its weights are masked out. Moreover, as each component is still connected to its neighbor nodes after pruning, SP is not able to discover a circuit with low edge-level complexity either.

HISP: Head Importance Score for Attention Head Pruning

HISP methods such as (Voita et al., 2019; Michel et al., 2019) assign a shared binary mask for all weights in the attention head, so that a head is either fully preserved or completely removed from the computational graph after mask learning. In our framework, we extend HISP to enable masking of MLP nodes in each layer as well. However, HISP only reduces edge-level complexity by eliminating all edges pointing to a masked component, but cannot selectively remove connections for a preserved node.

ACDC: Greedy Patching-based Circuit Discovery

Conmy et al. (2023) systematizes a common workflow of recent mechanistic interpertability research in finding task specific circuits (Nanda et al., 2022; Wang et al., 2022; Conmy et al., 2023; Olsson et al., 2022; Hanna et al., 2024; Heimersheim & Janiak, 2023; Goldowsky-Dill et al., 2023). ACDC starts from the last transformer layer and iteratively searches for key model components with highest influence to nodes in upper layers through edge connections. The influence of an edge is measured through the aforementioned patching practice. As we have explained in Section 2.1 and shall demonstrate in the results sections, patching-based methods such as ACDC fail to identify a circuit that functions as the original model when isolated, and therefore do not guarantee functional faithfulness.

3 DiscoGP: Differentiable Computational Graph Pruning

Given the limitations of existing circuit discovery methods introduced above, we propose DiscoGP, an algorithm that finds from the transformer computational graph a sparse set of nodes and edges that can perform as a task-specific subnetwork that behaves similar to the full model.

Computational graph pruning

Consider the computational graph $G_{f}(V_{\bm{\theta}},E)$ of a neural network $f(x)$ that takes $x$ as input (e.g. a sequence of tokens) and returns a probability distribution $p_{f}(y|x)$ of a discrete output labels $y$ (e.g. the vocabulary index of the predicted next token). DiscoGP learns two sets of binary mask parameters $\bm{m}=(\bm{m}_{\bm{\theta}},\bm{m}_{E})\in\{0,1\}^{|\bm{\theta}|+|E|}$ that is element-wise multiplied to node component weights and edge connection parameters, resulting in a circuit that represents a function ${c}_{\bm{m}}(x)$ with a computational graph $G_{f}(V_{\bm{\theta}\odot\bm{m}_{\bm{\theta}}},E\odot\bm{m}_{E})$ . Similar to existing work on differentiable mask learning (Louizos et al., 2018; Csordás et al., 2021; Cao et al., 2021; De Cao et al., 2022; Bayazit et al., 2023), DiscoGP models each mask $m_{i}\in\bm{m}$ as a random variable with a hard-concrete or gumbel-sigmoid distribution. In particular, we first compute a continuous score $s_{i}\in[0,1]$ in the following way:

\displaystyle s_{i}=\sigma\Big{(}\frac{l_{i}-\log\frac{\log\mathcal{U}_{1}}{% \log\mathcal{U}_{2}}}{\tau}\Big{)};\mathcal{U}_{1},\mathcal{U}_{2}\sim\text{% Uniform}(0,1),

(1)

where $\tau\in(0,\inf)$ is a temperature hyperparameter, $l_{i}$ a learnable logit parameter of a sigmoid distribution $\sigma(\cdot)$ , and $\mathcal{U}_{1},\mathcal{U}_{2}$ are variables drawn from a uniform distribution. We then apply the straight-through estimator (Bengio et al., 2013) to cast the sampled $s_{i}$ into a binary mask variable:

\displaystyle m_{i}=[\mathds{1}_{s_{i}>0.5}-s_{i}]_{\text{detach}}+s_{i},

(2)

where $\mathds{1}$ is the indicator function and $[\cdot]_{\text{detach}}$ is an operator that prevents backward gradient flow. In this way, the resulting binary mask $m_{i}$ is a differentiable function of the logit $l_{i}$ , which can then be optimized through backpropagation on certain objectives. We can therefore implement the SP baseline as special cases of DiscoGP by setting $m=1$ for all $m\in\bm{m}_{E})$ . To implement HISP, we can simply force all weight masks $\bm{m}_{v}$ of a node $v\in V$ to have the same value.

Differentiable circuit search objectives

Given a task dataset $\mathcal{D}=\{x^{(i)},\hat{y}^{(i)}\}$ , where $x^{(i)}$ is the input and $\hat{y}^{(i)}=\text{argmax}_{y}p_{f}(y|x)$ is the model predicted label, we wish to find a set of masks $\bm{m}$ on the weights parameters of transformer computational graph nodes and their connections, so that the predicted label of the masked subnetwork made on each $x^{(i)}\in\mathcal{D}$ is identical to the full model prediction $\hat{y}^{(i)}$ . We therefore define the following (functional) faithfulness loss as the negative likelihood of the full model predicted label in the output distribution by the pruned circuit:

\displaystyle\mathcal{L}_{\text{faith}}=-\sum_{i}\log p_{c_{\bm{m}}}(\hat{y}^{% (i)}|x^{(i)}).

(3)

Moreover, we would like to ensure that we have located all task-specific node components and edge connections – that is, if we sever the identified circuit from $G$ , the remaining computational graph should yield near-random performance on ${D}$ . In particular, let $\tilde{\bm{m}}=1-\bm{m}$ be the reverse mask of $\bm{m}$ , and $c_{\tilde{\bm{m}}}$ be the resulting complementary circuit of $c$ after applying $\tilde{\bm{m}}$ on $G$ , we define the following completeness loss as the cross entropy between the circuit output distribution and a uniform distribution over the label space $\{y_{k}\}_{k=1}^{K}$ :

\displaystyle\mathcal{L}_{\text{complete}}=-\sum_{i}\sum_{k=1}^{K}\frac{1}{K}% \log p_{c_{\tilde{\bm{m}}}}(y_{k}|x^{(i)}).

(4)

Lastly, we introduce the following sparsity loss as the density of node weights and edge connections to remove as many task-irrelevant computational graph components possible:

\begin{split}\mathcal{L}_{\text{sparse}}&=\mathcal{L}_{\text{sparse}-\bm{% \theta}}+\mathcal{L}_{\text{sparse}-E}\\ &=\frac{1}{|\bm{m}_{\bm{\theta}}|}\sum_{i=1}^{|\bm{m}_{\bm{\theta}}|}\sigma(l_% {i})+\frac{1}{|E|}\sum_{i=1}^{|\bm{m}_{E}|}\sigma(l_{i}).\end{split}

(5)

The final objective function is then comprised of a weighted mixture of the three loss terms:

\displaystyle\mathcal{L}_{\text{GP}}=\mathcal{L}_{\text{faith}}+\lambda_{c}% \mathcal{L}_{\text{complete}}+\lambda_{s}\mathcal{L}_{\text{sparse}},

(6)

where $\lambda_{c},\lambda_{s}$ are hyperparameters that regulate relative loss importance.

Graph pruning after mask search

Since our training objective does not include graph connectivity, after learning a set of masks by optimizing Equation 6, we could further reduce number of the circuit components by running graph search to remove all unmasked nodes and edges that are not reachable from either the output or the input node, as these graph components no longer affect predictions of the pruned circuit. Similarly, we can also perform post-hoc graph pruning for a weight-only pruning method (e.g. Subnetwork Probing) by removing nodes (and their associated edges) whose weight density drop to zero after applying the learned weight masks.

Table 1: An overview of the tasks and datasets.

Dataset	LM Function	Example Prompt	Correct Solution	Incorrect Solution
BLiMP	Syntactic	Many girls insulted	themselves	herself
IOI	Semantic	When Mary and John went to the store, John gave a drink to	Mary	John
OQA	Factual	The capital city of Canada is	Ottawa	*not unique

4 Experimental Setup of Circuit Discovery

We use GPT-2 small as our LM for circuit discovery, as it has been extensively studied by the mechanistic interpretability research community on similar application tasks. For each circuit discovery task, we ran DiscoGP to find a set of weight and edge mask logits that optimize the learning objective defined in Equation 6. We also evaluated the aforementioned baseline methods using the same methods: (1) HISP, (2) SP, (3) ACDC, and (4) an “edge-only” version of DiscoGP that only learns masks for computational graph edges without pruning node weights. We reproduce HISP and SP following the formulations introduced in Section 3. For ACDC, we use the implementation released by (Conmy et al., 2023). We learn masks for all weights, edges and nodes except for the input embedding and the output unembedding layers. For each method, we report test set results of the discovered circuit with highest validation set accuracy among all saved checkpoints. For ACDC, we identify circuits with various pruning thresholds $\tau$ and select the one with an edge density comparable to that of DiscoGP.

We evaluate DiscoGP and the baselines across three tasks covering various linguistic and world knowledge. The tasks are well-established within the mechanistic interpretability community. Table 1 provides an overview of the tasks and datasets. See Appendix A for additional details of the datasets.

1. BLiMP Syntactic Agreement: We use the minimal pair data from BLiMP (Warstadt et al., 2020) to study how syntactic phenomena are expressed by LMs. There has been an extensive line of research on understanding how neural LMs represent and apply syntactic knowledge (Lakretz et al., 2019; Hu et al., 2020; Finlayson et al., 2021). Since we focus on decoder-only LMs (GPT-2 small), we select syntactic agreement paradigms where the target words are located at the end of the sequence.

2. Indirect Object Identification (IOI): The task of IOI (Wang et al., 2022) is one of the most well-known mechanistic interpretability benchmarks. As the example shown in figure 1, the model should choose between completing the main clause by generating one of the two names (the indirect object (IO) or the subject (S1)). We follow Wang et al. (2022) and use the “BABA” templates to generate 1,280 examples for each evaluation.

3. Open-Domain Question Answering (OQA): The PARAREL dataset (Elazar et al., 2021) was created for studies investigating how LMs encode and process world knowledge (Petroni et al., 2019; Safavi & Koutra, 2021; Carlini et al., 2022; Bayazit et al., 2023). The dataset contains facts formulated as a fill-in-the-blank cloze task, as demonstrated by the example in Table 1. We generate prompts using 9,543 triples taken from 12 out of 38 PARAREL relations that have a unique object answer at the end of the generated prompt and can therefore be predicted by an autoregressive LM. See Appendix A for additional details.

Table 2: Circuit discovery results.

Syntactic Agreement (SA)
Circuit Discovery Method	Weight	Node	Edge	KL Div.	Func.	Func.
Circuit Discovery Method	Density	Density	Density	KL Div.	Faith.	Comp.
Weight pruning (HISP)	68.09	38.5	100	0.1132	95.7	43.9
Weight pruning (SP)	3.21	72.7	100	0.0782	96.5	39.0
Edge patching (ACDC)	100	28.3	3.02	0.1331	85.2	54.6
Edge pruning (DiscoGP)	100	17.2	2.87	0.0757	95.5	21.0
Joint pruning (DiscoGP)	3.08	16.9	2.98	0.0569	98.0	26.0
Indirect Object Identification (IOI)
Weight pruning (HISP)	71.05	83.9	100	0.0502	99.2	57.8
Weight pruning (SP)	1.87	93.6	100	0.0438	98.4	50.0
Edge patching (ACDC)	100	28.8	2.45	0.7305	51.6	50.6
Edge pruning (DiscoGP)	100	19.0	2.97	0.0322	100	57.5
Joint pruning (DiscoGP)	1.79	17.3	2.03	0.0204	100	49.2
Open-Domain Question Answering (OQA)
Weight pruning (HISP)	73.20	86.3	100	0.0105	96.1	0.61
Weight pruning (SP)	3.58	96.4	100	0.0041	98.2	0.68
Edge pruning (ACDC)	100	85.0	2.55	0.0500	0.1	5.00
Edge pruning (DiscoGP)	100	84.4	3.15	0.0046	96.6	0.34
Joint pruning (DiscoGP)	3.25	80.8	2.09	0.0026	98.5	0.31

5 Experiment Results

Table 2 summarizes our experiment results. We evaluated the functional faithfulness of each circuit by measuring its test accuracy on the specified task. Functional completeness was assessed by determining the test accuracy of the “complementary circuit,” which was derived using the reverse node and edge masks. Additionally, we included the KL divergence between the output label distributions (not the full output vocabulary) to facilitate comparison with previous studies. Furthermore, we provided the weight, node, and edge density metrics for each discovered circuit, which are defined as the percentage of weight/edge/node with assigned masks and remain open in the circuit.

Key findings from our study include: (1) Subnetwork Pruning fails to reduce the node or edge complexity of GPT-2 small, indicating that weight-only pruning methods cannot uncover “circuits” with simplified computational graph structures. (2) HISP is less effective than DiscoGP in reducing weight and edge density, as it only assigns node-level masks and cannot eliminate task-irrelevant model parameters and edge connections. (3) ACDC identifies circuits with low complexity but at the expense of functional faithfulness, resulting in the lowest test accuracy among all baselines. For the IOI and OQA tasks, ACDC’s circuit performance is comparable to random baselines. Moreover, it produces significantly higher KL-divergence compared to other methods. (4) In contrast, DiscoGP not only discovers structurally sparse task-specific circuits with low weight density but also maintains model performance, achieving near-perfect test accuracy across all three datasets.

DiscoGP optimizes the complexity-faithfulness trade-off

DiscoGP demonstrates superior efficiency compared to baseline circuit discovery methods by effectively balancing the trade-off between circuit complexity and functional faithfulness. Figure 3 illustrates the IOI test accuracy of circuits discovered by each method, plotted against varying numbers of nodes and edges. To generate a series of pruned subnetworks with increasing complexity in nodes and edges, we adjusted the hyperparameters controlling circuit sparsity for each method. Our findings indicate that the functional faithfulness of circuits identified by baseline

methods deteriorates substantially faster than those discovered by DiscoGP as the subgraphs become sparser. In contrast, the accuracy of circuits identified by DiscoGP remains nearly perfect until the edge density reaches approximately 1% and the node density about 15%.

Edge patching is far from edge pruning

To better understand why patching-based methods such as ACDC fail to discover functionally faithful circuits, we run GPT2-small on the corrupted versions of each of the three evaluation benchmarks by ablating the key input tokens that directly affects the answer (See Appendix B for how we construct corrupted inputs).

Table 3: Mean cosine similarity between clean and corrupted edge hidden representations on three application datasets.

Task	Clean-Ablated Edge similarity
Task	Mean	Interchange	Random
Agreement	0.878	0.907	0.582
IOI	0.943	0.996	0.597
OQA	0.951	0.960	0.556

We then compute the average cosine similarity between the original and corrupted hidden representations propagated through each edge. For comparison, we also compute the average similarity for a random ablation, where the input prompt is replaced with a completely different sentence. The results are shown in Table 3. For both mean and interchange ablations used in existing patching studies, the hidden representation of each edge remains very similar to the clean ones. Therefore, edges “knockedout” from a circuit by patching-based methods may still propagate a substantial amount of task-related information to the output node. Therefore, as previously discussed, evaluation results based on patching rather than pruning could be misleading, as they tend to be overly lenient.

6 Analysis and Findings

Table 4: Edge and weight overlap across different circuits. The results indicate a trend where similar tasks exhibit higher circuit overlaps. In the table, the overlap percentage are followed by the exact number of overlaps in brackets.

Circuit 1	Circuit 2	Edge Overlap	Weight Overlap
AGA	DNA	14.86% (251)	2.69% (8020)
ANA	DNA	16.19% (277)	1.12% (14816)
ANA	AGA	18.32% (266)	0.91% (17693)
DNA	DNA irr	21.07% (317)	4.72% (69364)
DNA	DNA adj	18.46% (332)	4.96% (74782)
DNA	DNA irr adj	18.24% (323)	6.06% (96727)

Circuit similarity reflects functional similarity

Table 4 illustrates the overlap levels between different circuits. The overlap percentages are calculated by dividing the number of overlap cases by the size of the logical union of the two masks. In this analysis, we only considered the agreement tasks as their task similarity is easier to perceive. BLiMP offers several variants of the Determiner-Noun Agreement (DNA) tasks, and we observed a relatively high level of circuit overlap in terms of weights and edges among them. The Anaphor Number Agreement (ANA) and Anaphor Gender Agreement (AGA) tasks exhibit greater similarity to each other compared to DNA tasks, as ANA and AGA follow similar templates (see Appendix A). This similarity is reflected in the level of edge overlap. Curiously, the weight overlap between the AGA and the ANA circuits is low. We conjecture that this distinction between weight and edge overlap is due to the different roles they play: weights store information, while edges guide the function of the task. While ANA and AGA share similar templates (and therefore exhibit higher edge overlap), performing the task requires distinct parametrized information (resulting in lower weight overlap).

Unfaithful IOI circuits “in the wild”

Our experimental results indicate that the previously identified IOI circuits are not functionally faithful. This discrepancy may arise because patching-based methods overlook critical components of the computation graph. Figure 4 displays the attention head distributions of IOI circuits identified by edge-only DiscoGP and ACDC, the latter corresponding to the “canonical” IOI circuit (Wang et al., 2022). We observe that DiscoGP retains more attention heads in the lower layers while excluding many middle and upper layer heads that were identified by ACDC. Additionally, we found that all MLP sublayers were selected by DiscoGP, indicating their significant functional roles, which have been largely neglected in previous IOI circuit analyses.

To further elaborate our analysis, we computed the accuracy drops of the model on the IOI dataset after removing the circuit attention heads discovered by DiscoGP and ACDC in each transformer layer, as illustrated in Figure 5. Our findings indicate that across most layers, ablating the circuit heads identified by DiscoGP results in a significantly greater accuracy drop compared to removing those discovered by ACDC. This effect is particularly pronounced in the lower transformer layers, which are most influential to model prediction due to residual connections. These results suggest that the “canonical” IOI circuit identified by ACDC and (Wang et al., 2022) has overlooked functionally essential components and is therefore not functionally faithful.

Unveiling the factual recall pipeline in GPT

By analyzing DiscoGP circuits, we confirm the theory that factual knowledge recall occurs in two distinct stages (Meng et al., 2022; Geva et al., 2023; Niu et al., 2024; Hernandez et al., 2024), i.e., the factual recall pipeline. The left panel of Figure 6

illustrates the layer-wise average number of MLP and attention weight parameters retained in the 12 relation-specific DiscoGP circuits learned from PARAREL. We observe that MLPs retain substantially more weights in the OQA circuits compared to attention heads, especially in the lower transformer layers. This finding aligns with recent interpretability analyses indicating that MLP sublayers function as key-value memory for factual knowledge extraction (Geva et al., 2021). Conversely, the right panel of Figure 6 shows the number of circuit edges at each layer, detailing connections from lower-layer attention heads to current-layer MLPs (MLP to Attention) and from preceding MLPs to current-layer attention heads (Attention to MLP). Notably, the set of connections in upper layers is dominated by MLP-to-attention edges. This observation supports recent findings in mechanistic interpretability research, which suggest that attention heads play a major role in propagating the retrieved factual knowledge from early-site MLPs to upper transformer layers, thereby selecting the most relevant information for answering questions (Geva et al., 2023).

7 Conclusion

Through a comprehensive re-evaluation of prior research on circuit discovery, we have pinpointed their significant shortcomings: the inability to provide both functionally accurate and structurally straightforward explanations for LM capabilities. To address these deficiencies, we introduce DiscoGP, an innovative differentiable algorithm that performs joint weight and connection pruning of neural network computation graph and achieves state-of-the-art circuit discovery results on multiple NLP tasks. Our analyses showcase how DiscoGP paves the way for novel avenues of language model interpretability, thereby enriching our understanding of the inner workings of powerful yet black-boxed state-of-the-art generative AI systems.

References

Bayazit et al. (2023) Bayazit, D., Foroutan, N., Chen, Z., Weiss, G., and Bosselut, A. Discovering knowledge-critical subnetworks in pretrained language models. arXiv preprint arXiv:2310.03084, 2023.
Bengio et al. (2013) Bengio, Y., Léonard, N., and Courville, A. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation, August 2013.
Black et al. (2022) Black, S., Sharkey, L., Grinsztajn, L., Winsor, E., Braun, D., Merizian, J., Parker, K., Guevara, C. R., Millidge, B., Alfour, G., et al. Interpreting neural networks through the polytope lens. arXiv preprint arXiv:2211.12312, 2022.
Cao et al. (2021) Cao, S., Sanh, V., and Rush, A. Low-Complexity Probing via Finding Subnetworks. In Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., and Zhou, Y. (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 960–966, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.74.
Carlini et al. (2022) Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., and Zhang, C. Quantifying memorization across neural language models. In The Eleventh International Conference on Learning Representations, 2022.
Chan et al. (2022) Chan, L., Garriga-Alonso, A., Goldwosky-Dill, N., Greenblatt, R., Nitishinskaya, J., Radhakrishnan, A., Shlegeris, B., and Thomas, N. Causal scrubbing, a method for rigorously testing interpretability hypotheses. AI Alignment Forum, 2022. https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing.
Conmy et al. (2023) Conmy, A., Mavor-Parker, A. N., Lynch, A., Heimersheim, S., and Garriga-Alonso, A. Towards automated circuit discovery for mechanistic interpretability. In Thirty-Seventh Conference on Neural Information Processing Systems, 2023.
Csordás et al. (2021) Csordás, R., van Steenkiste, S., and Schmidhuber, J. Are neural nets modular? Inspecting functional modularity through differentiable weight masks. In International Conference on Learning Representations, 2021.
Cunningham et al. (2023) Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey, L. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023.
Dai et al. (2022) Dai, D., Dong, L., Hao, Y., Sui, Z., Chang, B., and Wei, F. Knowledge Neurons in Pretrained Transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8493–8502, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.581.
De Cao et al. (2022) De Cao, N., Schmid, L., Hupkes, D., and Titov, I. Sparse interventions in language models with differentiable masking. In Bastings, J., Belinkov, Y., Elazar, Y., Hupkes, D., Saphra, N., and Wiegreffe, S. (eds.), Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp. 16–27, Abu Dhabi, United Arab Emirates (Hybrid), December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.blackboxnlp-1.2. URL https://aclanthology.org/2022.blackboxnlp-1.2.
Devlin et al. (2019) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423.
Elazar et al. (2021) Elazar, Y., Kassner, N., Ravfogel, S., Ravichander, A., Hovy, E., Schütze, H., and Goldberg, Y. Measuring and Improving Consistency in Pretrained Language Models. Transactions of the Association for Computational Linguistics, 9:1012–1031, 2021. doi: 10.1162/tacl˙a˙00410.
Elhage et al. (2021) Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
Finlayson et al. (2021) Finlayson, M., Mueller, A., Gehrmann, S., Shieber, S., Linzen, T., and Belinkov, Y. Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1828–1843, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.144.
Geiger et al. (2021) Geiger, A., Lu, H., Icard, T., and Potts, C. Causal abstractions of neural networks. Advances in Neural Information Processing Systems, 34:9574–9586, 2021.
Geva et al. (2021) Geva, M., Schuster, R., Berant, J., and Levy, O. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 5484–5495, 2021.
Geva et al. (2023) Geva, M., Bastings, J., Filippova, K., and Globerson, A. Dissecting recall of factual associations in auto-regressive language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 12216–12235, 2023.
Ghandeharioun et al. (2024) Ghandeharioun, A., Caciularu, A., Pearce, A., Dixon, L., and Geva, M. Patchscope: A unifying framework for inspecting hidden representations of language models. arXiv preprint arXiv:2401.06102, 2024.
Goldowsky-Dill et al. (2023) Goldowsky-Dill, N., MacLeod, C., Sato, L., and Arora, A. Localizing model behavior with path patching. arXiv preprint arXiv:2304.05969, 2023.
Gurnee et al. (2023) Gurnee, W., Nanda, N., Pauly, M., Harvey, K., Troitskii, D., and Bertsimas, D. Finding neurons in a haystack: Case studies with sparse probing. Transactions on Machine Learning Research, 2023.
Hanna et al. (2024) Hanna, M., Liu, O., and Variengien, A. How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. Advances in Neural Information Processing Systems, 36, 2024.
Heimersheim & Janiak (2023) Heimersheim, S. and Janiak, J. A circuit for python docstrings in a 4-layer attention-only transformer. URL: https://www. alignmentforum. org/posts/u6KXXmKFbXfWzoAXn/acircuit-for-python-docstrings-in-a-4-layer-attention-only, 2023.
Hernandez et al. (2024) Hernandez, E., Sharma, A. S., Haklay, T., Meng, K., Wattenberg, M., Andreas, J., Belinkov, Y., and Bau, D. Linearity of relation decoding in transformer language models. In Proceedings of the 2024 International Conference on Learning Representations, 2024.
Hu et al. (2020) Hu, J., Gauthier, J., Qian, P., Wilcox, E., and Levy, R. A systematic assessment of syntactic generalization in neural language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1725–1744, 2020.
Huang et al. (2024) Huang, J., Wu, Z., Potts, C., Geva, M., and Geiger, A. Ravel: Evaluating interpretability methods on disentangling language model representations. arXiv preprint arXiv:2402.17700, 2024.
Kim et al. (2021) Kim, T., Oh, J., Kim, N. Y., Cho, S., and Yun, S.-Y. Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in Knowledge Distillation. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pp. 2628–2635, Montreal, Canada, August 2021. International Joint Conferences on Artificial Intelligence Organization. ISBN 978-0-9992411-9-6. doi: 10.24963/ijcai.2021/362.
Lakretz et al. (2019) Lakretz, Y., Unit, C. N., Kruszewski, G., Desbordes, T., Hupkes, D., Dehaene, S., and Baroni, M. The emergence of number and syntax units in lstm language models. In Proceedings of NAACL-HLT, pp. 11–20, 2019.
Li et al. (2024a) Li, K., Patel, O., Viégas, F., Pfister, H., and Wattenberg, M. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems, 36, 2024a.
Li et al. (2024b) Li, M., Davies, X., and Nadeau, M. Circuit Breaking: Removing Model Behaviors with Targeted Ablation, January 2024b.
Louizos et al. (2018) Louizos, C., Welling, M., and Kingma, D. P. Learning sparse neural networks through l_0 regularization. In International Conference on Learning Representations, 2018.
Meng et al. (2022) Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 36, 2022.
Michel et al. (2019) Michel, P., Levy, O., and Neubig, G. Are sixteen heads really better than one? Advances in neural information processing systems, 32, 2019.
Nanda (2023) Nanda, N. Attribution patching: Activation patching at industrial scale. URL: https://www. neelnanda. io/mechanistic-interpretability/attribution-patching, 2023.
Nanda & Bloom (2022) Nanda, N. and Bloom, J. Transformerlens. https://github.com/TransformerLensOrg/TransformerLens, 2022.
Nanda et al. (2022) Nanda, N., Chan, L., Lieberum, T., Smith, J., and Steinhardt, J. Progress measures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations, 2022.
Neo et al. (2024) Neo, C., Cohen, S. B., and Barez, F. Interpreting context look-ups in transformers: Investigating attention-mlp interactions. arXiv preprint arXiv:2402.15055, 2024.
Niu et al. (2024) Niu, J., Liu, A., Zhu, Z., and Penn, G. What does the knowledge neuron thesis have to do with knowledge? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=2HJRwwbV3G.
Olsson et al. (2022) Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
OpenAI (2023) OpenAI. GPT-4 Technical Report, March 2023.
Petroni et al. (2019) Petroni, F., Rocktäschel, T., Riedel, S., Lewis, P., Bakhtin, A., Wu, Y., and Miller, A. Language Models as Knowledge Bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2463–2473, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1250.
Radford et al. (2019) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language Models are Unsupervised Multitask Learners. OpenAI Blog, pp. 24, 2019.
Raffel et al. (2020) Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. ISSN 1533-7928.
Ren & Zhu (2022) Ren, S. and Zhu, K. Specializing pre-trained language models for better relational reasoning via network pruning. In Findings of the Association for Computational Linguistics: NAACL 2022, pp. 2195–2207, 2022.
Safavi & Koutra (2021) Safavi, T. and Koutra, D. Relational world knowledge representation in contextual language models: A review. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 1053–1067, 2021.
Sanh et al. (2020) Sanh, V., Wolf, T., and Rush, A. Movement pruning: Adaptive sparsity by fine-tuning. Advances in neural information processing systems, 33:20378–20389, 2020.
Touvron et al. (2023) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T. Llama 2: Open Foundation and Fine-Tuned Chat Models, July 2023.
Turner et al. (2023) Turner, A., Monte, M., Udell, D., Thiergart, L., and Mini, U. Steering gpt-2-xl by adding an activation vector. In AI Alignment Forum, 2023.
Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
Vig et al. (2020) Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Singer, Y., and Shieber, S. Investigating gender bias in language models using causal mediation analysis. Advances in neural information processing systems, 33:12388–12401, 2020.
Voita et al. (2019) Voita, E., Talbot, D., Moiseev, F., Sennrich, R., and Titov, I. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5797–5808, 2019.
Wang et al. (2022) Wang, K. R., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small. In The Eleventh International Conference on Learning Representations, September 2022.
Wang et al. (2020) Wang, Z., Wohlwend, J., and Lei, T. Structured pruning of large language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6151–6162, 2020.
Warstadt et al. (2020) Warstadt, A., Parrish, A., Liu, H., Mohananey, A., Peng, W., Wang, S.-F., and Bowman, S. R. BLiMP: The Benchmark of Linguistic Minimal Pairs for English. Transactions of the Association for Computational Linguistics, 8:377–392, July 2020. ISSN 2307-387X. doi: 10.1162/tacl˙a˙00321.
Wu et al. (2024) Wu, Z., Geiger, A., Icard, T., Potts, C., and Goodman, N. Interpretability at scale: Identifying causal mechanisms in alpaca. Advances in Neural Information Processing Systems, 36, 2024.
Zhang et al. (2021) Zhang, X., van de Meent, J.-W., and Wallace, B. C. Disentangling representations of text by masking transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 778–791, 2021.
Zhao et al. (2020) Zhao, M., Lin, T., Mi, F., Jaggi, M., and Schütze, H. Masking as an efficient alternative to finetuning for pretrained language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2226–2241, 2020.

Table 5: Examples of the anaphor syntactic agreement datasets in BLiMP and their converted circuit discovery data.

Agreement Phenonemon	Good sentence	Bad sentence	Converted input query	True answer	False answer
Anaphor Gender Agreement	Katherine can’t help herself.	Katherine can’t help himself.	Katherine can’t help	herself	himself
Anaphor Number Agreement	Susan revealed herself.	Susan revealed themselves.	Susan revealed	herself	themselves

Table 6: Sentence templates for generating the IOI dataset.

Templates
Then, [B] and [A] went to the [PLACE]. [B] gave a [OBJECT] to [A]
Then, [B] and [A] had a lot of fun at the [PLACE]. [B] gave a [OBJECT] to [A]
Then, [B] and [A] were working at the [PLACE]. [B] decided to give a [OBJECT] to [A]
Then, [B] and [A] were thinking about going to the [PLACE]. [B] wanted to give a [OBJECT] to [A]
Then, [B] and [A] had a long argument, and afterwards [B] said to [A]
After [B] and [A] went to the [PLACE], [B] gave a [OBJECT] to [A]
When [B] and [A] got a [OBJECT] at the [PLACE], [B] decided to give it to [A]
When [B] and [A] got a [OBJECT] at the [PLACE], [B] decided to give the [OBJECT] to [A]
While [B] and [A] were working at the [PLACE], [B] gave a [OBJECT] to [A]
While [B] and [A] were commuting to the [PLACE], [B] gave a [OBJECT] to [A]
After the lunch, [B] and [A] went to the [PLACE]. [B] gave a [OBJECT] to [A]
Afterwards, [B] and [A] went to the [PLACE]. [B] gave a [OBJECT] to [A]
Then, [B] and [A] had a long argument. Afterwards [B] said to [A]
The [PLACE] [B] and [A] went to had a [OBJECT]. [B] gave it to [A]
Friends [B] and [A] found a [OBJECT] at the [PLACE]. [B] gave it to [A]

Appendix A Additional details of circuit discovery datasets

Syntactic agreement

We distill syntax-related circuits from GPT-2 small using The Benchmark of Linguistic Minimal Pairs (BLiMP) by (Warstadt et al., 2020). BLiMP consists of 67 individual datasets consisting of minimally different sentence pairs that contrast in grammatical acceptability and isolate specific phenomenon in syntax, morphology, or semantics. We use the two BLiMP datasets of assessing language model capability of recognizing the English syntactic requirement of anaphor agreement that reflexive pronouns like himself (a.k.a. anaphora) agree with their antecedents in gender and number. Each contrasting sentence pair in the two anaphor agreement datasets differ only in the very last word of reflexive pronoun, so we convert every pair into a binary classification problem of choosing one of the two pronouns as the continuation of the shared prefix. See Table 5 for example contrasting sentence pairs of anaphora gender agreement and anaphora number agreement, as well as their corresponding query prompt for circuit discovery.

Indirect object identification

(Wang et al., 2022) create dataset samples for IOI using templates with random single-token names, places and items. We follow their data curation pipeline by taking the same set of 15 templates and candidate infilling words to generate our circuit discovery dataset. At each trial, we randomly draw a template and a set of infilling tokens to construct a full sentence. We then convert the generated sentence into a binary classification question, where the input prompt is the sentence prefix without the last indirect object, and the two candidate next tokens are the indirect object and the subject tokens. See Table 6 and 7 for a complete list of IOI sentence templates and candidate infilling words.

Table 7: Candidate infilling words of IOI sentence templates.

Placeholder Type	Candidate Infilling Words
[A] and [B] (names)	Michael, Christopher, Jessica, Matthew, Ashley, Jennifer, Joshua Daniel, David, James, Robert, John, Joseph, Andrew, Ryan, Bran Justin, Sarah, William, Jonathan, Stephanie, Brian, Nicole, Nicho Heather, Eric, Elizabeth, Adam, Megan, Melissa, Kevin, Steven, Timothy, Christina, Kyle, Rachel, Laura, Lauren, Amber, Brittan Richard, Kimberly, Jeffrey, Amy, Crystal, Michelle, Tiffany, Jere Mark, Emily, Aaron, Charles, Rebecca, Jacob, Stephen, Patrick, Kelly, Samantha, Nathan, Sara, Dustin, Paul, Angela, Tyler, Scot Andrea, Gregory, Erica, Mary, Travis, Lisa, Kenneth, Bryan, Lin Jose, Alexander, Jesse, Katie, Lindsay, Shannon, Vanessa, Court Alicia, Cody, Allison, Bradley, Samuel.
[PLACE]	store, garden, restaurant, school, hospital, office, house, station.
[OBJECT]	ring, kiss, bone, basketball, computer, necklace, drink, snack.

Open-domain question answering

We use the PARAREL dataset by (Elazar et al., 2021) that consists of 38 relation types and 27,738 (subject, relation, object) fact triples such as (Canada, capital city, Ottawa). We then use the templates created by (Dai et al., 2022) to convert each fact triple into multiple query prompts (e.g. “The capital city of Canada is __”). We take prompts generated from triples with 12 out of 38 PARAREL relations that satisfy the following two conditions: 1) there is a unique object entity answer for each (subject, relation) pair; and 2) the object word always comes at the end of the template-generated sentence so that it can be predicted by an autoregressive language model. We finally obtained a total of 9,543 queries as our dataset of open-domain question answering, and we learn a circuit for each relational dataset for every circuit discovery method. See Table 8 for a list of the 12 relations we used together with examplar fact triples and queries.

Table 8: PARAREL relations and sample queries used for circuit discovery.

Relation ID	Relation	No. of queries	Sample Query	True answer
P103	native language	977	The mother tongue of Victor Horta is	Dutch
P138	named after	645	Rawlings Gold Glove Award, which is named for	glove
P159	headquarters location	967	The headquarter of Strait Shipping is located in	Wellington
P176	manufacturer	982	Honda RA272 is produced by	Honda
P264	record label	429	Johnny Carroll’s record label is	Decca
P279	subclass of	964	Nucleoporin 62, a type of	protein
P30	continent	975	Romulus Glacier is located in	Antarctica
P407	language of work or name	877	Ten Years Gone is a work written in	English
P449	original network	881	Himalaya with Michael Palin was originally aired on	BBC
P495	country of origin	909	Mundo Obrero was from	Spain
P1376	capital of	234	Guangzhou is the capital of	Guangdong
P36	capital	703	The capital city of Porto District is	Porto

Appendix B Constructing corrupted inputs for edge intervention

For Agreement dataset, we construct corrupted inputs using the method of interchange intervention that replaces the subject token of each prompt with its opposite gender pronoun (e.g. “Katherine” replaced with “He”). For IOI dataset, we construct the interchange intervention inputs following the method introduced in (Wang et al., 2022), where the second mention of the subject token is replaced with a third randomly sampled name, so that both the subject and the indirect object is a feasible next word of the corrupted input prompt (e.g. “When Mary and John went to the store, John gave a drink to” $\rightarrow$ “When Mary and John went to the store, Katie gave a drink to”). For OQA, we replace the subject tokens in each prompt with another subject entity that is randomly drawn from the same relational subset of factual triples (E.g. “The capital city of Canada is” $\rightarrow$ “The capital city of China is”).