Abstraction Alignment: Comparing Model and Human Conceptual Relationships

Angie Boggust
MIT CSAIL
Cambridge, MA, USA
[email protected]
&Hyemin Bang
MIT CSAIL
Cambridge, MA, USA
\ANDHendrik Strobelt
MIT–IBM Watson AI Lab
IBM Research
Cambridge, MA, USA
&Arvind Satyanarayan
MIT CSAIL
Cambridge, MA, USA

Abstract

Abstraction — the process of generalizing specific examples into broad reusable patterns — is central to how people efficiently process and store information and apply their knowledge to new data. Promisingly, research has shown that ML models learn representations that span levels of abstraction, from specific concepts like bolo tie and car tire to more general concepts like ceo and model. However, existing techniques analyze these representations in isolation, treating learned concepts as independent artifacts rather than an interconnected web of abstraction. As a result, although we can identify the concepts a model uses to produce its output, it is difficult to assess if it has learned a human-aligned abstraction of the concepts that will generalize to new data. To address this gap, we introduce abstraction alignment, a methodology to measure the agreement between a model’s learned abstraction and the expected human abstraction. We quantify abstraction alignment by comparing model outputs against a human abstraction graph, such as linguistic relationships or medical disease hierarchies. In evaluation tasks interpreting image models, benchmarking language models, and analyzing medical datasets, abstraction alignment provides a deeper understanding of model behavior and dataset content, differentiating errors based on their agreement with human knowledge, expanding the verbosity of current model quality metrics, and revealing ways to improve existing human abstractions.

1 Introduction

Abstraction is the process of distilling many individual data instances into a set of fundamental concepts and relationships that capture essential characteristics of the data (Yee, 2019; Alexander, 2018; Liskov et al., 1986). The result is an interconnected web of concepts, ranging from specific ideas, like schnauzer, to progressively more abstract notions, like dog or animal. Abstraction is a central characteristic of human cognition as it allows us to flexibly reason at the level of abstraction appropriate for our task — for example, in computer science, abstraction helps hide low-level implementation concerns from clients Liskov et al. (1986). Moreover, abstraction allows us to generalize our knowledge by fitting our abstracted patterns to new, unseen data (Yee, 2019). For instance, over their careers, clinicians learn abstractions of disease symptoms which they use to diagnose and treat new patients, even those with rare or atypical diseases (Eva, 2005).

Promisingly, existing interpretability and alignment research has shown that machine learning (ML) models learn concepts at varying levels of abstraction. Concept-based interpretability methods have demonstrated model sensitivity to concepts, ranging from specific ideas, like car tire, to higher-level concepts, like model (Kim et al., 2018; Ghorbani et al., 2019). Similarly, research on neuron activations, has found that models encode human-like concepts across levels of abstraction, such as stone wall, sky, and graduate (Hernandez et al., 2021; Bau et al., 2017; Oikarinen and Weng, 2022). Together, these results suggest that ML models extract human-like concepts from their training data and use them to make inferences on new data.

However, existing techniques analyze a model’s learned concepts in isolation, ignoring the relationships between concepts that make up its abstraction. Typically, interpretability methods curate a set of human concepts and quantify the model’s sensitivity to each concept independently. While this testing procedure quantifies the importance of a concept to the model’s decision, it does not measure the model’s reliance on multiple concepts, ability to generalize concepts, or what relationships it has learned between these concepts. As a result, while we can verify that a model uses human-aligned concepts to make its decision, we lack tools to test if it has learned a human-aligned abstraction of those concepts. Yet, testing a model’s abstraction is important because, even with the correct set of concepts, a model using a misaligned abstraction can result in an inability to generalize to new data. For instance, a model that has only been exposed to images of cooked crabs may learn an abstraction that crabs are food and fail to generalize to images of beach crabs.

To address this gap, we introduce abstraction alignment, a methodology to measure the agreement between a model’s learned abstraction and the expected human abstraction. To quantify abstraction alignment, we represent human abstractions as directed acyclic graphs (DAGs), such as medical taxonomies (World Health Organization, 1978) or lexical graphs (Miller, 1995). Then, we compare model outputs against the human abstraction graph to measure how well the abstraction accounts for the model’s uncertainty. Through this process, we define metrics of abstraction alignment including uncertainty alignment (Equation 2) and concept confusion (Equation 4), to surface various aspects of a model’s abstraction.

We demonstrate how abstraction alignment can be used for model interpretability, model benchmarking, and dataset analysis tasks ¹¹1Code is available at: https://github.com/mitvis/abstraction-alignment ²²2An interactive interface to explore experimental results is available at: https://vis.mit.edu/abstraction-alignment/. When interpreting an image model, we show that abstraction alignment allows us to differentiate errors based on their agreement with human knowledge, revealing when seemingly problematic errors are actually more benign lack of granularity. Next, we use abstraction alignment to expand existing language model quality benchmarks by quantifying model specificity across a breadth of linguistic markers. Finally, in a medical domain, abstraction alignment exposes data quality issues between how diseases are categorized in formal guidelines and encoded in the dataset, and identifies opportunities to improve existing human medical abstractions.

2 Related Work

Abstractions in ML datasets Abstractions allow humans to efficiently process information and form the bases for information encodings in linguistics (Miller, 1995; Dewey, 2011), biology (Hinchliff et al., 2014; Linnaeus, 1758), and medicine (World Health Organization, 1978). In machine learning, abstractions are built into many tasks, such as image classification (Krizhevsky et al., 2009; Deng et al., 2009), medical diagnostics (Johnson et al., 2016a, b), and text prediction (Miller, 1995). Even datasets that do not include an abstraction can be linked to existing abstractions by matching their output classes with corresponding concept nodes (Redmon and Farhadi, 2017). We apply abstraction alignment to interpret image classification models using the CIFAR-100 hierarchy (Krizhevsky et al., 2009), benchmark language models using WordNet language abstractions (Miller, 1995; Fellbaum, 1998), and analyze dataset abstractions using the ICD-9 disease abstraction (Johnson et al., 2016a, b; World Health Organization, 1978).

Concepts in model interpretability Aligned with our goal of understanding machine learning model behavior, interpretability research focuses on measuring model reliance on known human concepts (Doshi-Velez and Kim, 2017; Rai, 2020). For instance, saliency methods reveal input features important to the model’s prediction that humans compare to their expectations (Selvaraju et al., 2017; Carter et al., 2019a; Boggust et al., 2022). Feature visualization methods help identify concepts, like patterns or object parts, that activate model layers (Olah et al., 2017; Erhan et al., 2009; Bau et al., 2017). Concept-based methods like TCAV (Kim et al., 2018; Ghorbani et al., 2019) identify and test for human concepts encoded in a model’s latent space. Neuron activation analysis identifies human concepts that activate particular model neurons (Hernandez et al., 2021; Bau et al., 2017; Oikarinen and Weng, 2022). Recently, work in mechanistic interpretability discovered small networks contain state machines that transition between concepts in human meaningful ways (Bricken et al., 2023). Together these methods have identified problematic model correlations (Carter et al., 2021), made sense of complex model activations (Olah et al., 2018; Carter et al., 2019b), and discovered novel concepts that advance human knowledge (Schut et al., 2023). Building on their success, we expand interpretability from independent concepts to the relationships between them, to ensure that models learn human-aligned concepts and human-aligned abstractions.

Knowledge graphs Related to the abstractions we work with are knowledge graphs — semantic data networks that represent entities and their relationships (Ji et al., 2020). While the abstractions encode levels of conceptual abstraction, knowledge graphs are directed graphs that encode any form of relationship between the nodes, like familial relationships between people and associations between people and institutions. Machine learning research on knowledge graphs has focused on training models to encode (Lin et al., 2018), complete (Yao et al., 2019), and even replace (Sun et al., 2023) knowledge graphs. Like abstraction alignment, these models suggest ways to update existing human taxonomies; however abstraction alignment focuses on interpreting the human-alignment of existing machine learning models.

3 Methodology

The goal of abstraction alignment is to measure how well the model’s learned abstraction aligns with a given human abstraction. Our methodology is based on the assumption that the model’s confusion is a reflection of its learned abstraction — i.e., concepts the model commonly confuses are more similar in the model’s abstraction than concepts the model perfectly separates.

3.1 Representing human abstractions

To compute abstraction alignment, we first represent the human abstraction as a directed acyclic graph (DAG), where nodes represent concepts and edges represent child-to-parent relationships between concepts. For example, in the medical abstraction in Section 4.3, nodes represent medical diagnoses and edges map from specific diagnoses, like frontal sinusitis, to broader diagnostic categories, like respiratory infections (World Health Organization, 1978). Every node in the DAG exists at a level of abstraction, ranging from the leaf level to the root level, computed based on its shortest path from a leaf node.

The DAG data structure is well suited to representing abstractions because it efficiently encodes both the abstraction’s concepts and conceptual relationships. We can easily access a concept’s level of abstraction by measuring its height and move up and down the level of abstraction by getting its ancestors or descendants. Since the graph is acyclic, it guarantees the hierarchical structure that underpins abstraction relationships. Further DAGs are commonly used to represent abstractions (World Health Organization, 1978; Miller, 1995) and are built into many ML datasets (Krizhevsky et al., 2009; Deng et al., 2009; Johnson et al., 2016a, b), allowing abstraction alignment to apply to a wide variety of domains.

3.2 Integrating model outputs with human abstractions

The next step in computing abstraction alignment is to compare the model’s behavior to the given human abstraction DAG. To do so, we map the model’s output space (e.g., classes or tokens) to nodes in the DAG. Often the human abstraction is built into the modeling task, so this mapping is straightforward — e.g., CIFAR-100 includes a human abstraction mapping classes to higher-level superclasses Krizhevsky et al. (2009). However, even when the human abstraction is separate from the modeling task, the model’s output space can often be easily computationally mapped to the DAG. For instance, in Section 4.2, we map words in the model’s vocabulary to nodes in the WordNet DAG (Miller, 1995).

We use this mapping to analyze the model’s behavior based on the human abstraction. Following Algorithm 1, we create a weighted DAG for each dataset instance, where nodes have a value and aggregated value. The value corresponds to the model’s output probability. If a node corresponds to a model’s output, then the value is the model’s predicted probability for that output. Otherwise, the value is zero. The aggregated value is the model’s propagated probability and is computed as the sum of values of its descendants. For example, in CIFAR-100 (Krizhevsky et al., 2009), the aggregated value of the flower node is the sum of the model’s probabilities that the image is a orchid, poppy, rose, sunflower, or tulip. By propagating the model’s confidences through the abstraction, aggregated value provides a measure for the model’s confidence in non-output nodes, including high-level concepts.

Algorithm 1 Abstraction Alignment Propagation — create a weighted DAG for a dataset instance

1:Inputs

2:instance

\leftarrow

the dataset instance

3:model

\leftarrow

the model to evaluate

4:classes

\leftarrow

the model’s output space

5:abstraction

\leftarrow

the human abstraction DAG

6:for node in abstraction do

\triangleright

Initialize the values of the DAG

7: node.value = 0

\triangleright

value is the model’s assigned probability

8: node.aggregated_value = 0

\triangleright

aggregated_value is the propagated probability

9:end for

10:probabilities = model(instance)

11:for i, prob in enumerate(probabilities) do

\triangleright

Set node values based on model outputs

12: node = abstraction.get_node(classes[i])

13: node.value = prob

14:end for

15:for level in abstraction.levels do

\triangleright

Propagate node values through the DAG starting at the leaves

16: for node in level do

17: node.aggregated_value = node.value

18: for child in node.children do

19: node.aggregated_value += child.aggregate_value

20: end for

21: end for

22:end for

3.3 Measuring abstraction alignment

Using the weighted DAG, we can measure the abstraction alignment of a model’s decision for a specific instance or an entire dataset. While there are potentially many metrics one could use to analyze abstraction alignment patterns, we define an initial set of four metrics that we have found useful for downstream tasks. During analysis we assume we have a dataset $X$ , a model trained to predict classes or tokens $c$ , and an abstraction DAG containing nodes $n$ across levels of abstraction $l$ . We use the model’s outputs (e.g., $y_{ij}$ for instance $x_{i}$ and class $c_{j}$ ) to compute the aggregated value $v_{ki}$ of node $n_{k}$ .

Accuracy abstraction alignment

One way to measure abstraction alignment is to measure how well the human abstraction accounts for the model’s errors. If a model’s mistakes are substantially reduced by moving up a level of abstraction, then the model’s behavior is more abstraction aligned than if it continues to make errors at higher-levels of abstraction. While there are cases when the model’s errors may acceptably not fit the abstraction, such as misclassifying an image containing multiple objects, in aggregate we expect the model’s errors to reflect its abstractions — i.e., it will confuse output classes or tokens that it considers similar.

We measure accuracy alignment as the proportion of errors that are reduced by moving from level $l_{i}$ to $l_{j}$ . First, we compute the number of correct predictions at each level by comparing the node with the highest aggregated value in that level to the expected prediction at the level. Then, we compute the proportion of errors that are mitigated by moving up in the abstraction. If accuracy alignment is high, then the abstraction accounts for a large amount of the model’s mistakes, suggesting the model is using a similar abstraction.

accuracyalignment

\displaystyle=\Delta A_{l_{i},l_{j}}=\frac{\sum_{k=1}^{|X|}\mathbf{1}[\text{% argmax}([v_{a,k}\;\forall\;n_{a}\in l_{j}])=y_{k,j}]-\sum_{k=1}^{|X|}\mathbf{1% }[\text{argmax}([v_{a,k}\;\forall\;n_{a}\in l_{i}])=y_{k,i}]}{|X|-\sum_{k=1}^{% |X|}\mathbf{1}[\text{argmax}([v_{a,k}\;\forall\;n_{a}\in l_{i}])}

(1)

Uncertainty abstraction alignment

Similarly, we can measure abstraction alignment by quantifying how well the abstraction accounts for the model’s uncertainty. A model whose confusion is contained within a small portion of the DAG is more abstraction aligned than a model whose confusion spans the DAG. As with accuracy alignment, uncertainty alignment applies in aggregate — e.g., a model that regularly confuses types of fruit is more abstraction aligned than a model that regularly confuses fruits and birds.

We measure uncertainty alignment by testing the difference in entropy between levels of the DAG. First, we compute the Shannon entropy ( $H$ ) entropy of the node aggregate values for every level in the DAG. The larger the entropy for a given level the more confused the model is across concepts at that level of abstraction. Then we compute the mean difference in entropy ( $H$ ) between two levels ( $l_{i}$ and $l_{j}$ ) across a set of data instances, $X$ . If the entropy decreases substantially then the model’s behavior aligns with the abstraction mapping the low-level nodes to the higher-level nodes.

uncertaintyalignment

\displaystyle=\Delta H_{l_{i},l_{j}}=\frac{1}{|X|}\sum_{k=1}^{|X|}H([v_{a,k}\;% \forall\;n_{a}\in l_{j}])-\frac{1}{|X|}\sum_{k=1}^{|X|}H([v_{a,k}\;\forall\;n_% {a}\in l_{i}])

(2)

Subgraph preference

Another useful metric when using abstractions to analyze model behavior is to compare subgraphs within the abstraction DAG. For instance, in Section 4.2, we compare regions of the DAG that represent different concepts (e.g., any location concept vs. canadian location concepts) and different levels of abstraction (e.g., concepts more specific than journalist to concepts more general than journalist). In aggregate, these comparisons help us quantify and compare abstractions the model uses and prefers.

We compute subgraph preference by measuring how often the maximum aggregate value of a node in one subgraph, $s_{i}$ , is larger than the maximum aggregate value of a node in another subgraph $s_{j}$ . This is an extension of the specificity testing metric, $p_{r}$ , proposed by Huang et al. (2023), where $s_{i}$ is the specific concept and $s_{j}$ is the general concept. However, unlike $p_{r}$ that was designed to test two output tokens of a model, abstraction alignment allows us to test a breadth of concepts, including different levels of abstraction, multiple similar concepts, and concepts related to different abstractions. If our model’s outputs span many nodes in the abstraction DAG (as in Section 4.2), we can also compute this metric using the node’s value as opposed to aggregate value.

\displaystyle{\texttt{subgraph}\>\texttt{preference}}=P(s_{i},s_{j})=\frac{1}{% |X|}\sum_{k=1}^{|X|}\mathbf{1}[\max([v_{a,k}\;\forall\;n_{a}\in s_{i}])]>\max(% [v_{b,k}\;\forall\;n_{b}\in s_{j}])

(3)

Concept confusion

Finally, the concept confusion metric allows us to measure how often a model assigns probability to pairs of concepts. Identifying these concepts can reveal concepts that the model considers similar in its abstraction despite being different in the human abstraction. While concept pairs that are direct ancestors or descendants of each other will definitionally have high concept confusion, unrelated concept pairs with high concept confusion indicate unrelated human concepts that the model’s abstraction deems similar.

To compute concept confusion for a pair of nodes, we compute the Shannon entropy ( $H$ ) of their aggregate values divided by the maximum possible entropy for a pair of nodes. By computing the entropy, we weight the concept confusion by how confused the two nodes are. We compute concept confusion over an entire dataset to identify concepts that the model repeatedly confuses.

\displaystyle{\texttt{concept}\>\texttt{confusion}}=C(n_{i},n_{j})=\frac{\sum_% {k=1}^{|X|}H([v_{i,k},v_{j,k}])}{\sum_{k=1}^{|X|}H([0.5,0.5])}

(4)

4 Experiments

4.1 Interpreting model behavior with abstraction alignment

Refer to caption — Figure 1: We use abstraction alignment to interpret the behavior of a ResNet20 model (He et al., 2016a) on the CIFAR-100 test set (Krizhevsky et al., 2009) using accuracy alignment (left) and uncertainty alignment (right). High values of each metric indicates that the model has learned that aspect of the human abstraction and the majority of its errors and uncertainty is contained within the abstraction (e.g., people).

Abstraction alignment improves model interpretability by expanding the number and complexity of concepts we can use to characterize model decisions and comparing them to accepted human abstractions. A common interpretability task is understanding a model’s mistakes; however not all mistakes are equally problematic. For example, we would be more likely to forgive a model that regularly mistakes cars for trucks than a model that consistently mistakes cars for stop signs. The former aligns with our human abstractions that treat vehicles similarly while driving; whereas, the latter suggests model abstractions do not follow accepted human reasoning with potentially dangerous consequences. In these cases, abstraction alignment helps differentiate the severity of a model’s mistakes, distinguishing benign low-level errors from problematic higher-level misalignment.

To demonstrate abstraction alignment’s ability to characterize model behavior, we use it to interpret a ResNet20 (He et al., 2016a) trained on CIFAR-100 (Krizhevsky et al., 2009). We use the CIFAR-100 class and superclass structure as the human abstraction, resulting in a DAG with 121 nodes across 3 levels (see Section A.1) (Krizhevsky et al., 2009). We compute each test image’s weighted DAG by applying a softmax to the model’s outputs and mapping the output probabilities to the corresponding class nodes (see Section 3.2).

Quantifying abstraction alignment reveals abstraction aligned errors. In Figure 1, we report the model’s accuracy alignment (Equation 1) and uncertainty alignment (Equation 2) across the CIFAR-100 test set. The people, tree, and flower abstractions resolve a large proportion of the model’s errors with their subclasses, indicating that the model has learned those abstractions For instance, $76\%$ of prediction errors for images of maple, oak, palm, pine, and willow are resolved by predicting at the level 2 concept tree. Similarly, uncertainty for images of baby, boy, girl, man, and woman is almost entirely resolved by moving up one level of abstraction to people. While our model only achieves $67.7\%$ test accuracy, seeing that many of its errors align with our human abstraction may increase our trust that it will behave acceptably on unseen data in these categories.

Our abstraction alignment metrics also reveal areas where the model is misaligned with the human abstraction. For instance, the model’s errors are not accounted for by abstractions like vehicles 2 nor does it appear to learn animal categorizations like medium mammals and large carnivores. In both cases, we might consider these results to be acceptable model performance in light of poorly designed or ill-fitting human abstractions respectively. In particular, the CIFAR-100 hierarchy artificially restricts each superclass to contain exactly 5 subclasses — a constraint that produces two abstract nodes for vehicles that arbitrarily distinguish their children rather than meaningfully capture abstracted patterns. In contrast, although the higher-level animal categories are semantically meaningful, they reflect abstract biological concepts like size (medium, large), reproduction (mammals), and diet (carnivores) that are seemingly hard for a model to learn visually from 32x32 images. If learning accurate biological abstractions are important for our task, then we may prefer to train on an alternate dataset or modality that more precisely expresses these characteristics; on the other hand, if learning visual abstractions are acceptable, we may update our human abstractions to better reflect what can be learned from the data (i.e., categorizing animals based on visual similarity).

Abstraction alignment also provides a structure to explore types of model behavior. In Figure 2, we use the weighted DAGs to query for particular types of abstraction alignment. We can define types of abstraction alignment based on the number of nodes the model considers at each level and how it distributes its confidence across nodes. To explore instances where the model’s decision aligns and misaligns with human abstractions, we compare images where the model’s confusion resolves at the level 2 concept against images where the model’s uncertainty is split over four level 2 concepts. We find that while $15.5\%$ of instances are harmless low-level confusion, $25\%$ of images result of confusion at higher levels of abstraction. We can also look at particular types of model behaviors, querying for instances where the model is confused between two distinct concepts. Validating our quantitative analysis, we see instances where model confusion is split between vehicles 1 and vehicles 2. By measuring instance similarly based on the pattern of model decision making, as opposed to semantic similarity, abstraction alignment enables qualitative analysis of model alignment.

4.2 Benchmarking language models’ abstraction alignment

Benchmarking the specificity of language models helps us distinguish valuable models that output precise answers from those that output correct but meaningless text. For instance, while “Dante is a person” and “Dante is a poet” are both correct, we would prefer a language model that outputs the latter since it is operating at the correct level of abstraction (Huang et al., 2023). Metrics for benchmarking language model specificity use a dataset of language prompts to test the model’s preference between a specific and general response (Huang et al., 2023). However, these metrics are limited to testing only two human-defined responses at two levels of abstraction, even through generative models may output a variety of correct answers spanning many levels of abstraction (e.g., “writer” or “artist”).

With abstraction alignment, we expand existing specificity benchmarks to more thoroughly test models against a variety of correct answers spanning multiple levels of abstraction. Instead of testing one specific and one general answer, we can use the abstraction DAG to compare many possible answers, such as all answers more specific or more general than the specific answer. Using the expressivity of the abstraction DAG, we can also test the model preference for particular topics, such as testing whether the model prefers a correct answer over an incorrect answer that is still related to the topic. By leveraging an existing linguistic abstraction, like WordNet (Miller, 1995; Fellbaum, 1998), we can test these additional aspects of model behavior without the need for additional human labeling.

We apply abstraction alignment to benchmark BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), and GPT-2 (Radford et al., 2019) language models. We test the models using the S-TEST dataset which contains sentence prompts for masked token prediction of the sentence’s subject’s occupation, location, and birthplace (Huang et al., 2023). Each of the prompts is labeled with a corresponding specific answer and general answer. For instance the prompt “Lake Louise Ski Resort is in” is paired with the specific answer “Alberta” and the general answer “Canada” (Huang et al., 2023). For each of the three tasks we create a human abstraction DAG by mapping the S-TEST specific answers to nodes in the WordNet abstraction DAG (Miller, 1995; Fellbaum, 1998). We compute edges between nodes using WordNet’s hypernym/hyponym and holynym/meronym functions, creating an abstraction graph of precise and general answers related to the task.

To quantify the model’s specificity, we use subgraph preference to compare the model’s preference for answers in different regions of the abstraction DAG. As a baseline, we recreate Huang et al. (2023)’s specificity metric. Their metric compares the model’s probability in the specific answer to its probability in the general answer. We replicate this metric using subgraph preference by comparing the values of the specific answer node and the the general answer node ( $P(s_{s},s_{g})$ ). Next, we extend this metric to test specificity across additional words and levels of abstraction. Instead of testing one specific and one general answer, we compare all answers more specific than the specific answer (specific answer and its children) to all answers more general than the specific answer (specific answer’s parents) ( $P(s_{s\downarrow},s_{s\uparrow})$ ). We extend these metrics further, testing whether the model prefers a correct answer at any level of abstraction to an incorrect answer by comparing all answers related to the specific answer to all answers related to the task (e.g., all occupation words) ( $P(s_{s\downarrow\uparrow},s_{t})$ ).

	Task
	Occupation				Standort				Birthplace
Model	Acc@10	$\boldsymbol{P(s_{s},s_{g})}$	$\boldsymbol{P(s_{s\downarrow},s_{s\uparrow}})$	$\boldsymbol{P(s_{s\downarrow\uparrow},s_{t}})$	Acc@10	$\boldsymbol{P(s_{s},s_{g})}$	$\boldsymbol{P(s_{s\downarrow},s_{s\uparrow}})$	$\boldsymbol{P(s_{s\downarrow\uparrow},s_{t}})$	Acc@10	$\boldsymbol{P(s_{s},s_{g})}$	$\boldsymbol{P(s_{s\downarrow},s_{s\uparrow}})$	$\boldsymbol{P(s_{s\downarrow\uparrow},s_{t}})$
bert-base (Devlin et al., 2019)	0.2844	0.7046	0.7902	0.0068	0.4316	0.4909	0.9752	0.2304	0.4142	0.6068	0.9994	0.2450
bert-large (Devlin et al., 2019)	0.2214	0.7176	0.8240	0.0116	0.4564	0.4236	0.9821	0.2744	0.4214	0.5652	0.9975	0.2592
roberta-base (Liu et al., 2019)	0.2450	0.6180	0.7898	0.0751	0.3659	0.4999	0.9854	0.1790	0.2897	0.5448	1.000	0.2042
roberta-large (Liu et al., 2019)	0.2244	0.7144	0.8238	0.0797	0.3905	0.4328	0.9869	0.2242	0.2321	0.4216	0.9992	0.2248
gpt-2 (Radford et al., 2019)	0.1610	0.5728	0.5193	0.1682	0.1702	0.4825	0.6659	0.1348	0.3327	0.5972	0.9879	0.1959

Table 1: Abstraction alignment expands existing language model specificity benchmarks. We compute the subgraph preference of language models (Devlin et al., 2019; Liu et al., 2019; Radford et al., 2019) on the S-TEST dataset’s occupation, location, and birthplace tasks (Huang et al., 2023). We compare existing metrics that test model preference between a specific and general answer (

P(s_{s},s_{g})

) (Huang et al., 2023) to abstraction alignment metrics measuring the model’s preference for any specific answer to any general answer (

P(s_{s\downarrow},s_{s\uparrow})

) and its preference for a correct answer at any level of abstraction to an incorrect answer on the same task (

P(s_{s\downarrow\uparrow},s_{t})

Benchmarking models with abstraction alignment reveals aspects of model behavior overlooked by prior metrics. Existing metrics indicate that language models only have a slight preference for specific answers, with most $P(s_{s},s_{g})$ near $50\%$ (Huang et al., 2023). However, by expanding to a larger set of possible answers, abstraction alignment reveals that language models have a strong preference for specific answers. For example, bert-large prefers a specific answer on over $80\%$ of instances across all tasks. This result suggests that prior metrics are too strict and do not account for variety of model preferences, whereas abstraction alignment more accurately reflects model specificity.

Beyond making specificity testing more accurate, abstraction alignment also allows us to test other aspects of specificity. Using $P(s_{s\downarrow\uparrow},s_{t})$ , we can test the model’s preference for a correct answer at any level of abstraction to an incorrect answer related to the task. For instance, when predicting the occupation for “Enrico Castellani is a” we compare all answers that are direct ancestors or descendants of the correct answer painter to all other answers that are ancestors or descendants of any other occupation in the dataset. While previously we found models prefer a specific correct answer to a general correct answer, here, we find that models often prefer a incorrect answer to any correct answer. This is not always correlated with accuracy or other specificity metrics — for instance, gpt-2 has the lowest accuracy and specificity on occupation prediction but the highest preference for correctness. By using the abstraction alignment methodology, we have expanded traditional benchmarks, exposing otherwise hidden aspects of model behavior.

4.3 Analyzing datasets using abstraction alignment

Abstraction alignment can also be valuable in dataset analysis by revealing differences in the abstractions we expect our models to learn and those codified in the dataset. Models learn correlations between input features and output decisions from their training data. However, the correlations in the dataset are not always the ones a model developer expects their model to learn. Often, dataset issues are only identified after models trained on them produce problematic outputs (Zech et al., 2018; Caliskan et al., 2022). Applying abstraction alignment to datasets can help us understand how the abstractions they implicitly encode correspond with expected human abstractions before the datasets are released as training data.

To demonstrate abstraction alignment as a dataset analysis tool, we use it to compare the medical abstractions encoded in the MIMIC-III dataset to medical hierarchy standards set by global health authorities. The MIMIC-III dataset contains patients’ medical notes labeled with a set of ICD-9 codes representing the patient’s diseases and procedures (Johnson et al., 2016a, b). The codes are part of the ICD-9 medical hierarchy used by hospitals to justify healthcare costs to insurance (Alexander et al., 2003). However, discrepancies between clinical code application and the ICD-9 guidelines are known to occur due to lack of coder experience, complexity of the coding system, and intentional misuse to increase insurance payout (O’Malley et al., 2005). Since MIMIC-III contains real-world patient records that could be affected by clinical misuse, the code labels in the dataset may not reflect ICD-9’s intended use. Abstraction alignment can reveal how well MIMIC-III aligns with the ICD-9 abstraction to inform model developers of the abstractions their models may learn and perpetuate in deployment.

To apply abstraction alignment in this setting, we use the ICD-9 hierarchy as the abstraction DAG and the dataset’s ICD-9 code labels to represent the dataset’s encodings. The ICD-9 hierarchy Mullenbach et al. (2018) contains 21,116 nodes over 7 levels of abstraction. Each nodes represents an ICD-9 code (e.g., frontal sinusitis) or higher-level code grouping (e.g., respiratory infections). To compute the abstraction alignment of each dataset instance, we map the dataset’s labels to the ICD-9 nodes (Section 3.2) to create a weighted DAG where the aggregate value of a node represents how many labels were assigned to its descendants. To understand possible discrepancies between the dataset’s abstractions and the ICD-9 abstraction, we use concept confusion to analyze pairs of nodes that co-occur across dataset instances. We can think of pairs with high concept confusion as concepts the dataset represents similarly because both concepts often apply to the same medical note.

We begin our analysis by filtering to nodes representing high-level code groupings (i.e., direct descendants of the root). In ICD-9, there are four top-level groupings: procedures, diseases and injuries, v supplementary health factors, and e supplementary causes of injury and poisoning. In Figure 3, we see it is common for the dataset to contain code labels from multiple of these high-level code groupings. Assigning procedure codes with disease and injury codes makes sense because a disease defines a treatment procedure. However, the dataset frequently contains disease and injury and v supplementary health factors codes. In the ICD-9 hierarchy, v supplementary health factors codes are “provided to deal with occasions when circumstances other than a disease or injury are recorded as diagnosis or problems” (American Speech-Lanugage-Hearing Association, 2015). The fact that patient notes in the MIMIC-III dataset often contain both disease and injury and v supplementary health factors codes when the ICD-9 hierarchy expects them to be used disjointly, suggests a misalignment in the dataset’s abstractions.

Next, we analyze confusion between lower-level nodes to understand how specific dataset labels may be misaligned with the ICD-9 abstraction. Often diseases share a medical correlation, so it is expected that the dataset commonly contains both labels for metabolic factors like disorders of lipid metabolism and essential hypertension. However, we also see frequent co-labeling between “other” codes, like other disease of lung and other and unspecified hypertension. In ICD-9, a code grouping often contains sibling codes representing specific variants of that code followed by an “other” catchall code. The frequent occurrence of “other” code labels in the MIMIC-III dataset could cause models to learn to over apply “other” codes when they are unwarranted. It could also suggest that there are common diseases and procedures missing from ICD-9 or medical issues that have arisen since its development in 1977.

Our abstraction alignment analysis of MIMIC-III reveals discrepancies between how ICD-9 codes are applied in the dataset and the ICD-9 abstraction expectations for disease classification. These discrepancies suggest that even models that achieve high performance on the dataset may not align with medical standards and, if deployed to label patient clinical notes in hospitals, could perpetuate code misapplication, leading to inaccurate insurance billing. Further, our abstraction alignment analysis suggests ways the ICD-9 abstraction does not support real-world coding. In fact, the overuse of “other” codes and joint coding between disease and injury and v supplementary codes that we found in via abstraction alignment corresponds to real-world changes made during the transition from ICD-9 to ICD-10, such as increasing code specificity and incorporating supplementary codes into the main hierarchy (Cartwright, 2013; World Health Organization, 2022). This result suggests that beyond dataset quality analysis, abstraction alignment can also identify opportunities for improving ground truth human abstractions.

5 Discussion and Future Work

In this paper, we study abstraction alignment — the agreement between a model’s learned conceptual relationships and established human abstractions. In interpretability tasks, abstraction alignment identifies misalignments in model reasoning; in model benchmarking, abstraction alignment expands the expressiveness of evaluation metrics; and, in dataset analysis, abstraction alignment reveals differences between the abstractions we want models to learn and those codified in the dataset.

We consider abstraction alignment to be a paradigm for understanding datasets and ML model behavior and, just as there are many ways to measure models’ representational alignment (Sucholutsky et al., 2023; Terry et al., 2023), we expect there are likely a plethora of techniques to measure abstraction alignment. For instance, following research methods that reveal and edit models’ representation of state (Hernandez et al., 2021; Li et al., 2021; Reif et al., 2019; Hewitt and Manning, 2019), future work could test whether models’ internal representations encode human abstractions. Internal abstraction alignment metrics could study how abstractions change across model layers, evolve during training, and whether modifying modifying a model’s internal abstraction improves its performance.

Future work on abstraction alignment should also consider concept theories from cognitive psychology. Our current approach applies Aristotelian concept theory where concepts define exact membership conditions to determine whether an instance is part of a given concept (Rosch, 2011). Thus, this means our concepts are discrete — dogs are animals from the species canis lupus so schnauzer and wolf are both dogs. However, if we were to use graded concept theory (Rosch and Lloyd, 1978), concepts would define a continuous degree of membership. In this setting, a common dog like schnauzer is a strong example of a dog whereas wolf is a weak member because, while technically still a dog, we perceive them differently from domesticated dogs. This may suggest a more continuous measurement of abstraction alignment where conceptual relationships are weighted based on the degree of membership.

Finally, in many cases, there may not exist a universal human abstraction that applies to a given task. For example, individual doctors often develop slightly differing medical abstractions as a function of their medical training and clinical experiences (Cai et al., 2019). Thus, besides developing clinical models that agree with medical standards, abstraction alignment can help us develop models that are more personalized to a particular clinician. For instance, abstraction alignment could be used to improve human-AI collaboration by ensuring both humans and models are reasoning with the same abstractions. More interestingly, by adapting abstraction alignment, we could specifically train models to learn abstractions that complement a doctor’s — acting as valuable collaborators with additional expertise and alternate perspectives.

References

Yee [2019] Eiling Yee. Abstraction and concepts: when, how, where, what and why? Language, Cognition and Neuroscience, 34(10):1257–1265, 2019.
Alexander [2018] Christopher Alexander. A pattern language: towns, buildings, construction. Oxford university press, 2018.
Liskov et al. [1986] Barbara Liskov, John Guttag, et al. Abstraction and specification in program development, volume 20. MIT press Cambridge, 1986.
Eva [2005] Kevin W Eva. What every teacher needs to know about clinical reasoning. Medical education, 39(1):98–106, 2005.
Kim et al. [2018] Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory Sayres. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). In International Conference on Machine Learning (ICML), pages 2668–2677, Stockholm, Sweden, 2018. PMLR.
Ghorbani et al. [2019] Amirata Ghorbani, James Wexler, and Been Kim. Automating interpretability: Discovering and testing visual concepts learned by neural networks. ArXiv, abs/1902.03129, 2019. URL https://api.semanticscholar.org/CorpusID:59842921.
Hernandez et al. [2021] Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, and Jacob Andreas. Natural language descriptions of deep visual features. In International Conference on Learning Representations, 2021.
Bau et al. [2017] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 3319–3327. IEEE Computer Society, 2017.
Oikarinen and Weng [2022] Tuomas Oikarinen and Tsui-Wei Weng. Clip-dissect: Automatic description of neuron representations in deep vision networks. arXiv preprint arXiv:2204.10965, 2022.
World Health Organization [1978] World Health Organization. International Classification of Diseases, Ninth Revision (ICD-9). World Health Organization, Geneva, 1978.
Miller [1995] George A. Miller. Wordnet: A lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
Dewey [2011] Melvil Dewey. Dewey Decimal Classification and Relative Index. OCLC Online Computer Library Center, Inc., 23 edition, 2011.
Hinchliff et al. [2014] Cody E. Hinchliff, Stephen A. Smith, James F. Allman, John Gordon Burleigh, Ruchi Chaudhary, Lyndon M. Coghill, Keith A. Crandall, Jia bin Deng, Bryan Thomas Drew, Romina Gazis, Karl Gude, David S. Hibbett, Laura A. Katz, H. Dail Laughinghouse Iv, Emily Jane McTavish, Peter E. Midford, Christopher L. Owen, Richard H. Ree, Jonathan A. Rees, Douglas E. Soltis, Tiffani L. Williams, Tiffani L. Williams, Tiffani L. Williams, Tiffani L. Williams, Tiffani L. Williams, Tiffani L. Williams, Tiffani L. Williams, Tiffani L. Williams, and Karen A. Cranston. Synthesis of phylogeny and taxonomy into a comprehensive tree of life. Proceedings of the National Academy of Sciences, 112:12764 – 12769, 2014.
Linnaeus [1758] Carl Linnaeus. Systema Naturae per Regna Tria Naturae, Secundum Classes, Ordines, Genera, Species, cum Characteribus, Differentiis, Synonymis, Locis, volume 10. Laurentius Salvius, 1758.
Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255. IEEE Computer Society, 2009.
Johnson et al. [2016a] Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016a.
Johnson et al. [2016b] Alistair Johnson, Tom Pollard, and Roger Mark. Mimic-iii clinical database (version 1.4), 2016b.
Redmon and Farhadi [2017] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7263–7271, 2017.
Fellbaum [1998] Christiane Fellbaum. WordNet: An electronic lexical database. MIT press, 1998.
Doshi-Velez and Kim [2017] Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning, 2017.
Rai [2020] Arun Rai. Explainable AI: From black box to glass box. Journal of the Academy of Marketing Science, 48(1):137–141, 2020.
Selvaraju et al. [2017] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In International Conference on Computer Vision (ICCV), pages 618–626. IEEE, 2017.
Carter et al. [2019a] Brandon Carter, Jonas Mueller, Siddhartha Jain, and David K. Gifford. What made you do this? Understanding black-box decisions with sufficient input subsets. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 567–576. PMLR, 2019a.
Boggust et al. [2022] Angie Boggust, Benjamin Hoover, Arvind Satyanarayan, and Hendrik Strobelt. Shared interest: Measuring human-ai alignment to identify recurring patterns in model behavior. In Simone D. J. Barbosa, Cliff Lampe, Caroline Appert, David A. Shamma, Steven Mark Drucker, Julie R. Williamson, and Koji Yatani, editors, Conference on Human Factors in Computing Systems (CHI), pages 10:1–10:17. ACM, 2022.
Olah et al. [2017] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization. Distill, 2(11):e7, 2017.
Erhan et al. [2009] D. Erhan, Yoshua Bengio, Aaron C. Courville, and Pascal Vincent. Visualizing higher-layer features of a deep network. 2009.
Bricken et al. [2023] Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, page 2, 2023.
Carter et al. [2021] Brandon Carter, Siddhartha Jain, Jonas W Mueller, and David Gifford. Overinterpretation reveals image classification model pathologies. Advances in Neural Information Processing Systems, 34:15395–15407, 2021.
Olah et al. [2018] Chris Olah, Arvind Satyanarayan, Ian Johnson, Shan Carter, Ludwig Schubert, Katherine Ye, and Alexander Mordvintsev. The building blocks of interpretability. Distill, 3(3):e10, 2018.
Carter et al. [2019b] Shan Carter, Zan Armstrong, Ludwig Schubert, Ian Johnson, and Chris Olah. Activation atlas. Distill, 2019b. doi:10.23915/distill.00015. https://distill.pub/2019/activation-atlas.
Schut et al. [2023] Lisa Schut, Nenad Tomasev, Tom McGrath, Demis Hassabis, Ulrich Paquet, and Been Kim. Bridging the human-ai knowledge gap: Concept discovery and transfer in alphazero. arXiv preprint arXiv:2310.16410, 2023.
Ji et al. [2020] Shaoxiong Ji, Shirui Pan, E. Cambria, Pekka Marttinen, and Philip S. Yu. A survey on knowledge graphs: Representation, acquisition, and applications. IEEE Transactions on Neural Networks and Learning Systems, 33:494–514, 2020. URL https://api.semanticscholar.org/CorpusID:211010433.
Lin et al. [2018] Yankai Lin, Xu Han, Ruobing Xie, Zhiyuan Liu, and Maosong Sun. Knowledge representation learning: A quantitative review. arXiv preprint arXiv:1812.10901, 2018.
Yao et al. [2019] Liang Yao, Chengsheng Mao, and Yuan Luo. Kg-bert: Bert for knowledge graph completion. arXiv preprint arXiv:1909.03193, 2019.
Sun et al. [2023] Kai Sun, Yifan Ethan Xu, Hanwen Zha, Yue Liu, and Xin Luna Dong. Head-to-tail: How knowledgeable are large language models (llm)? aka will llms replace knowledge graphs? arXiv preprint arXiv:2308.10168, 2023.
Huang et al. [2023] Jie Huang, Kevin Chen-Chuan Chang, Jinjun Xiong, and Wen-Mei Hwu. Can language models be specific? how? In Findings of the Association for Computational Linguistics (ACL), pages 716–727. ACL, 2023.
He et al. [2016a] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), pages 770–778. IEEE Computer Society, 2016a.
Devlin et al. [2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 4171–4186. Association for Computational Linguistics, 2019.
Liu et al. [2019] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
Radford et al. [2019] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.
Zech et al. [2018] John R Zech, Marcus A Badgeley, Manway Liu, Anthony B Costa, Joseph J Titano, and Eric Karl Oermann. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study. PLoS medicine, 15(11):e1002683, 2018.
Caliskan et al. [2022] Aylin Caliskan, Pimparkar Parth Ajay, Tessa Charlesworth, Robert Wolfe, and Mahzarin R. Banaji. Gender bias in word embeddings: A comprehensive analysis of frequency, syntax, and semantics. In Vincent Conitzer, John Tasioulas, Matthias Scheutz, Ryan Calo, Martina Mara, and Annette Zimmermann, editors, AAAI/ACM Conference on AI, Ethics, and Society (AIES), pages 156–170. ACM, 2022.
Alexander et al. [2003] Sherri Alexander, Therese Conner, and Teresa Slaughter. Overview of inpatient coding. American journal of health-system pharmacy : AJHP : official journal of the American Society of Health-System Pharmacists, 60 21 Suppl 6:S11–4, 2003.
O’Malley et al. [2005] Kimberly O’Malley, Karon F. Cook, Matt D. Price, Kimberly Raiford Wildes, John F. Hurdle, and Carol M. Ashton. Measuring diagnoses: Icd code accuracy. Health services research, 40 5 Pt 2:1620–39, 2005.
Mullenbach et al. [2018] James Mullenbach, Sarah Wiegreffe, Jon Duke, Jimeng Sun, and Jacob Eisenstein. Explainable prediction of medical codes from clinical text. In Marilyn A. Walker, Heng Ji, and Amanda Stent, editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 1101–1111. Association for Computational Linguistics, 2018.
American Speech-Lanugage-Hearing Association [2015] American Speech-Lanugage-Hearing Association. Module Two: International Classification of Diseases–9th Revision–Clinical Modification (ICD-9-CM). https://www.asha.org/practice/reimbursement/module-two/, 2015. Accessed: 2024-01-29.
Cartwright [2013] Donna J Cartwright. Icd-9-cm to icd-10-cm codes: what? why? how?, 2013.
World Health Organization [2022] World Health Organization. International Classification of Diseases, 10th Revision. World Health Organization, Geneva, Switzerland, 2022.
Sucholutsky et al. [2023] Ilia Sucholutsky, Lukas Muttenthaler, Adrian Weller, Andi Peng, Andreea Bobu, Been Kim, Bradley C. Love, Erin Grant, Jascha Achterberg, Joshua B. Tenenbaum, Katherine M. Collins, Katherine L. Hermann, Kerem Oktar, Klaus Greff, Martin N. Hebart, Nori Jacoby, Qiuyi Zhang, Raja Marjieh, Robert Geirhos, Sherol Chen, Simon Kornblith, Sunayana Rane, Talia Konkle, Thomas P. O’Connell, Thomas Unterthiner, Andrew K. Lampinen, Klaus-Robert Müller, Mariya Toneva, and Thomas L. Griffiths. Getting aligned on representational alignment. CoRR, abs/2310.13018, 2023. doi:10.48550/ARXIV.2310.13018. URL https://doi.org/10.48550/arXiv.2310.13018.
Terry et al. [2023] Michael Terry, Chinmay Kulkarni, Martin Wattenberg, Lucas Dixon, and Meredith Ringel Morris. AI alignment in the design of interactive AI: specification alignment, process alignment, and evaluation support. CoRR, abs/2311.00710, 2023. doi:10.48550/ARXIV.2311.00710. URL https://doi.org/10.48550/arXiv.2311.00710.
Li et al. [2021] Belinda Z. Li, Maxwell I. Nye, and Jacob Andreas. Implicit representations of meaning in neural language models. In Annual Meeting of the Association for Computational (ACL), pages 1813–1827. Association for Computational Linguistics, 2021.
Reif et al. [2019] Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda B. Viégas, Andy Coenen, Adam Pearce, and Been Kim. Visualizing and measuring the geometry of BERT. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 8592–8600, 2019.
Hewitt and Manning [2019] John Hewitt and Christopher D. Manning. A structural probe for finding syntax in word representations. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (NAACL-HLT), pages 4129–4138. Association for Computational Linguistics, 2019.
Rosch [2011] Eleanor Rosch. Slow lettuce: Categories, concepts, fuzzy sets, and logical deduction. Concepts and fuzzy logic, 8:89–120, 2011.
Rosch and Lloyd [1978] Eleanor Rosch and Barbara B Lloyd. Principles of categorization. 1978.
Cai et al. [2019] Carrie J. Cai, Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry. "hello ai": Uncovering the onboarding needs of medical practitioners for human-ai collaborative decision-making. Proc. ACM Hum. Comput. Interact., 3(CSCW):104:1–104:24, 2019.
Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
He et al. [2016b] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 630–645. Springer, 2016b.
Sutskever et al. [2013] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147. PMLR, 2013.
Petroni et al. [2020] Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim Rocktäschel, Yuxiang Wu, Alexander H. Miller, and Sebastian Riedel. How context affects language models’ factual predictions. In Automated Knowledge Base Construction, 2020. URL https://openreview.net/forum?id=025X0zPfn.
Petroni et al. [2019] F. Petroni, T. Rocktäschel, A. H. Miller, P. Lewis, A. Bakhtin, Y. Wu, and S. Riedel. Language models as knowledge bases? In In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019, 2019.

A Appendix

A.1 Experimental details

Here we describe the experimental details for each experiment in Section 4. Code to recreate our experiments can be found at: https://github.com/mitvis/abstraction-alignment. An interactive interface for exploring experimental results is provided at: https://vis.mit.edu/abstraction-alignment/.

A.1.1 Interpreting model behavior with abstraction alignment

In Section 4.1, we use abstraction alignment to interpret a CIFAR-100 image classification model. We train a PyTorch [Paszke et al., 2019] ResNet20 model [He et al., 2016a] on CIFAR-100 training set [Krizhevsky et al., 2009] for 200 epochs with a batch size of 128. We apply random crop and horizontal flip data augmentations to the images following He et al. [2016b]. We use cross-entropy loss optimized via stochastic gradient descent and Nesterov momentum [Sutskever et al., 2013] (momentum = 0.9; weight decay = 5e-4). We use a learning rate of 0.1 and reduce it at epoch 60, 120, and 160 using gamma of 0.2. The trained model achieves $67.7\%$ accuracy on the CIFAR-100 test set.

To apply abstraction alignment we use the CIFAR-100 class/superclass mapping [Krizhevsky et al., 2009] to form an abstraction DAG. The DAG contains 121 nodes across 3 levels — 100 class nodes (level 1), 20 superclass nodes (level 2), and a root node (level 3). We create a weighted DAG for every dataset instance in the CIFAR-100 test set, representing the model’s abstraction alignment for that instance. To do so, for a given image, we compute the model’s softmax output probabilities over the classes. Following Algorithm 1, we assign class nodes in the DAG a value equal to the model’s output probability for that class. All other nodes recieve a value of zero. We compute every node’s aggregate value as the sum of all of their descendant’s values. For instance, the node tulip’s aggregate value is the model’s output probability that the image is a tulip, whereas the node flower’s aggregate value is the sum of the model’s output probability for orchid, rose, tulip, sunflower, and poppy.

A.1.2 Benchmarking language models

In Section 4.2, we apply abstraction alignment to benchmark language models. Following the benchmarking procedure in Huang et al. [2023], we compare pretrained bert-base [Devlin et al., 2019], bert-large [Devlin et al., 2019], roberta-base [Liu et al., 2019], roberta-large [Liu et al., 2019], and gpt-2 [Radford et al., 2019] models from the LAMA benchmark³³3https://github.com/facebookresearch/LAMA [Petroni et al., 2020, 2019]. We test each model on the occupation, location, and birthplace tasks from the S-TEST dataset⁴⁴4https://github.com/jeffhj/S-TEST [Huang et al., 2023]. Each data instance in the S-TEST dataset is a text query paired with one specific and one general answer label. For each model, we compute its top-10 accuracy, measured as the proportion of instances where the specific answer was in the model’s top 10 predicted tokens.

To measure abstraction alignment, we create an abstraction DAG for each of the occupation, location, and birthplace tasks. For a task, we map each of its specific answer labels to its corresponding node (i.e., synset) in WordNet [Miller, 1995, Fellbaum, 1998]. We do this process by searching for the specific answer label in the NLTK WordNet corpus⁵⁵5https://www.nltk.org/howto/wordnet.html. If there are multiple WordNet nodes that hit for a given search, we select the most appropriate node by manually inspecting their WordNet definitions. Then, we expand the DAG by including all direct ancestors and descendants of any specific answer nodes. We only consider ancestors and descendants that exist in the model’s vocabulary. The result is a DAG containing all the vocabulary words related to any of the data instances’ specific answer labels.

To create weighted DAGs, we compute the model’s output probability across every word in its vocabulary for every data instance. For each data instance, we assign the model’s output probabilities to their corresponding nodes in the DAG. We use the weighted DAGs to compute three specificity metrics, using the subgraph preference function Equation 3. In each metric, we use the node values corresponding to the model’s predicted probability outputs. First, we compute replicate the specificity testing metric from Huang et al. [2023] (originally called $p_{r}$ ). We compute it as $P(s_{s},s_{g})$ , where $s_{s}$ is the single-node graph containing the specific label and $s_{g}$ is the single-node graph containing the general answer label. Next, we compute $P(s_{s\downarrow},s_{s\uparrow})$ to compare all words at the specific label’s level of abstraction and lower $s_{s\downarrow}$ (specific label and its descendants) to all words at a higher level of abstraction than the specific label $s_{s\uparrow}$ (specific label’s ancestors). Finally, we compute $P(s_{s\uparrow\downarrow},s_{t})$ to compare ancestors and descendants of the specific label $s_{s\uparrow\downarrow}$ to any other word in the task DAG $s_{t}$ .

A.1.3 Analyzing datasets using abstraction alignment

In Section 4.2, we apply abstraction alignment to anlayze the abstractions in the MIMIC-III dataset [Johnson et al., 2016a, b]. The dataset contains textual medical notes paired with a set of ICD-9 code labels. We use the ICD-9 medical hierarchy as the abstraction DAG [World Health Organization, 1978]. We pair the dataset’s ICD-9 code labels with their corresponding code in the ICD-9 abstraction DAG. To compute weighted DAGs for every dataset instance, we set the code node’s value equal to one if the code was labeled on that instance and zero otherwise. For all other nodes (e.g., non-codable node groupings), we assign their aggregate value as the sum of its children. As a result the aggregate value of a node is equivalent to the number of times it or one of its children labeled the medical note. In the task, non-leaf nodes are codable. For instance, both sickle-cell anemia and its direct parent hereditary hemolytic anemias can be applied to the same medical note.

A.2 Compute resources and efficiency

All abstraction alignment analysis is performed on CPU. Time to build the DAG and compute abstraction alignment depends on the number of nodes in the DAG and the abstraction alignment metric. On the CIFAR-100 DAG (121 nodes) [Krizhevsky et al., 2009], computing the weighted DAG for 10,000 CIFAR-100 test images takes approximately 2 minutes, computing accuracy alignment and uncertainty alignment takes under a minute, and computing concept confusion takes on the order of 15 minutes. On the MIMIC-III DAG (21,166 nodes), creating the weighted abstraction DAGs and computing concept confusion takes around 30 minutes.

We train and evaluate the models used in the experiments on 1 NVIDIA V100 GPU with 1TB of memory. Training the CIFAR-100 ResNet20 model takes approximately 30 minutes. Running inference on the S-TEST dataset takes roughly 10 minutes per language model.

A.3 Additional abstraction alignment examples

Additional examples of the abstraction alignment of data instances from Section 4.1 and Section 4.3 are available in an exploratory interface https://vis.mit.edu/abstraction-alignment/.