Codebook LLMs: Adapting Political Science Codebooks for
LLM Use and Adapting LLMs to Follow Codebooks

Andrew Halterman
Department of Political Science
Michigan State University
[email protected] Katherine A. Keith
Department of Computer Science
Williams College
[email protected]

Abstract

Codebooks—documents that operationalize constructs and outline annotation procedures—are used almost universally by social scientists when coding unstructured political texts. Recently, to reduce manual annotation costs, political scientists have looked to generative large language models (LLMs) to label and analyze text data. However, previous work using LLMs for classification has implicitly relied on the universal label assumption—correct classification of documents is possible using only a class label or minimal definition and the information that the LLM inductively learns during its pre-training. In contrast, we argue that political scientists who care about valid measurement should instead make a codebook-construct label assumption—an LLM should follow the definition and exclusion criteria of a construct/label provided in a codebook. In this work, we collect and curate three political science datasets and their original codebooks and conduct a set of experiments to understand whether LLMs comply with codebook instructions, whether rewriting codebooks improves performance, and whether instruction-tuning LLMs on codebook-document-label tuples improves performance over zero-shot classification. Using Mistral 7B Instruct as our LLM, we find re-structuring the original codebooks gives modest gains in zero-shot performance but the model still struggles to comply with the constraints of the codebooks. Optimistically, instruction-tuning Mistral on one of our datasets gives significant gains over zero-shot inference (0.76 versus 0.53 micro F1). We hope our conceptualization of the codebook-specific task, assumptions, and instruction-tuning pipeline as well our semi-structured LLM codebook format will help political scientists readily adapt to the LLM era.¹¹1To insulate our work from being contaminated in LLM training data, we are taking a “walled garden” approach and are not releasing the data with a public link; please email the authors if you would like access to the data. Code will be available in a public repository upon publication.

1 Introduction

Measurement is essential to social science: how can broad concepts be expressed as precise constructs, raw data be turned into valid measurements, and the resulting measurements incorporated into downstream analyses? For many political science research projects, text is the best source of data, either as the object of interest in itself (e.g., political speeches, press releases, or political manifestos), or as a useful source of events in the world (e.g., news reports on protests, political violence, etc.). In order to transform unstructured text data into valid quantitative measurements of social science constructs, many researchers rely on human annotation via codebooks, documents that define concepts and outline annotation procedures Krippendorff (2018).

Manually labeling all texts in large-scale projects according to codebook instructions is extremely costly. For well over a decade, analysts have attempted to automate or scale-up manual content analysis via supervised or unsupervised machine learning Grimmer and Stewart (2013). Recently, the promise of better classification performance with fewer (or no) training examples has led analysts to turn to generative large language models (LLMs) to annotate text (Egami et al. 2024; Gilardi, Alizadeh and Kubli 2023; Peskoff and Stewart 2023; Ziems et al. 2024; Wu et al. 2023; Atreja et al. 2024). LLMs use transformer-based deep learning models with a language model loss function trained on large corpora (typically scraped from the web). These models have grown dramatically in the number of parameters they are trained on; for instance, there are roughly 1,000 times more parameters in GPT-3 ( $2\times 10^{11}$ parameters; Brown et al. (2020)) than BERT ( $3\times 10^{8}$ parameters; Devlin et al. (2019)).

Much of the excitement for LLMs from social scientists comes from LLMs’ potential for zero-shot classification, meaning one can obtain classification predictions with zero human-labeled training examples. Under this zero-shot paradigm, LLMs take as input a sequence of tokens (word pieces) consisting of the instance of text one wishes to annotate and annotation instructions. Then the LLM outputs a sequence of tokens which can be parsed to obtain the predicted class.

In an ideal world with perfect LLM accuracy, a political scientist working on a specialized corpus could specify the exact construct they wanted to measure from text and use LLM predictions in downstream analyses. For example, suppose a political scientist wanted to measure the number of protests from decades of news articles. She might start with prompt templates from existing prior work using LLMs for social science annotation, (e.g., Ziems et al. (2024); Figure 1), and input to an LLM: “A hundred people marched through the capital in opposition to the proposed law. Is this A. a protest or B. not a protest?” Ideally, the LLM would output “A. a protest” and the instance of text would be recorded as a positive instance of the protest class. However, because the ultimate goal is to measure a complex social construct (here, a “protest”), this kind of definition-agnostic LLM prompt can have negative empirical consequences due to the following oversimplified assumption.

Universal label assumption. We define the universal label assumption as the assumption that natural-language classification labels (that an LLM predicts) have one universal meaning. This assumptions implies that class labels, for example, “protest” or “liberal,” mean only what the LLM has inductively learned from its pre-training data, and this meaning matches the concept a social scientist wishes to code. This assumption is related to the “background concept” term used by Adcock and Collier (2001, pg. 530), in which there is a “constellation of potentially diverse meanings associated with a given concept”. In many natural language processing (NLP) tasks, such as sentiment classification for product or movie reviews (Pang, Lee and Vaithyanathan 2002), the “background concepts” cued by the “positive” or “negative” sentiment labels are likely to be sufficient. However, we provide the following three real-world examples in which the universal label assumption breaks down in social science applications.

First, Ziems et al. (2024) show that an LLM can accurately categorize political statements into “Liberal”, “Conservative”, or “Neutral”. In a specific instance (Figure 1), the LLM correctly predicts that a pro-labor union statement is a liberal political position. However, it does so by adopting the universal label assumption we describe above, drawing on the large pre-training corpus to apply a standard definition of liberal and conservative. However, in specific contexts this universal label assumption is likely to break down. Specifically, in Australian politics, the Liberal Party is a conservative, center-right/right-wing party that opposes labor unions. In the context of Australian politics, asking whether a statement is liberal invites ambiguity. Does the LLM predict “liberal” in an ideological sense or “liberal” in the sense of the (conservative) party in Australia?

Figure 1: Utterance ideology example provided in Ziems et al. (2024)’s online appendix. This illustrates an (implicit) universal label assumption: no further instructions are provided and the LLM must rely on the background concepts of “liberal” and “conservative”.

A second example comes from the extensive study of protests in political science and sociology. While all definitions of protests include a shared core (claim-making or demands for political or social change outside of formal political channels), specific projects use different definitions of protests, categorize protests at varying levels of granularity, and apply different exclusion criteria. These specific definitions depend on the empirical project and are a crucial aspect of high-quality social science measurement. For example, the BFRS dataset on political violence in Pakistan (Bueno de Mesquita et al. 2015) excludes all non-violent protests, while making a distinction between a “riot” and a “violent political demonstration”. In contrast, the common automated event data ontology CAMEO (Gerner et al. 2002) collapses these two into a single “riot” category, while also coding other non-violent protest tactics. The MPEDS project (Hanna 2017, Appendix A) applies much more thoroughly defined requirements about the identity of those protesting than the CAMEO ontology does. CAMEO excludes resignations and complaints by elites as a form of protest, while the Electoral Contestation and Violence (ECAV) dataset (Daxecker, Amicarelli and Jung 2019), which focuses exclusively on subnational protests in the context of elections, includes these as protests. Applying these different operationalizations of “protest” to the same corpus should not produce the same measurements.

As a final example, the empirical study of populism in politics has employed several conceptions of populism. Without providing a precise definition to an LLM labeling populism in text, a researcher cannot be sure if the LLM is labeling populism by rhetoric (e.g., anti-elite rhetoric, rhetoric about the “general will” of the people, or a Manichean worldview), by populist policies (e.g., welfare chauvinism), or conflating ideology and populism (e.g., conflating right-wing nativism with populism) (Meijers and Zaslove 2021). A measurement strategy that conflates these varying definitions of populism will prevent researchers from answering questions such as the relationship between populist rhetoric and populist policies.

Refer to caption — Figure 2: Codebook task and pipeline overview (top) with different operationalizations of the “protest” concept in the CCC (Crowd Counting Consortium 2024) and CAMEO (Gerner et al. 2002) datasets (bottom). The *universal label assumption* relies on a “background” concept represented by a label (e.g., a “protest”), while the *codebook-construct label assumption* applies a specific, precisely defined construct from a codebook.

Codebook-construct label assumption. In contrast with the universal label assumption, we define the codebook-construct label assumption as the assumption that a natural-language classification label is a placeholder for a rich operationalization of a social science concept that analysts define, constrain, and clarify in a specific codebook document. The goal is to use the LLM as a social measurement instrument and as such the LLM should comprehend and comply with the definition of a class label from the context of the codebook. This assumption aligns with more general assumption made by Raji et al. (2021) about datasets used by machine learning and AI, “all datasets come with an embedded perspective — there is no neutral or universal dataset.” Similarly, Adcock and Collier (2001) argue that broad “background” social science concepts must be “systematized” with an explicit definition, and then operationalized into measures before they can be used to generate useful data.

In this work, we argue that social scientists using LLMs for measurement should make the codebook-construct label assumption. We describe conceptual work that prescribes how to incorporate this assumption into the codebook-based annotation task and pipeline (see Figure 2 for an overview), then assess an LLM’s performance on the task using empirical experiments on real-world political science datasets. In doing so, we aim to address the following research questions:

RQ1. What is the zero-shot classification performance of LLMs on real-world political science datasets under the universal label assumption?

Specifically, if we provide only a label name without any other definition or clarification and zero training examples, can an LLM accurately predict the gold standard label for a document? This question has been addressed by previous work (Ziems et al. 2024; Atreja et al. 2024), but we apply the RQ to three new datasets: the Crowd Counting Consortium (CCC) dataset on protests in the United States (Crowd Counting Consortium 2024), the BFRS dataset on violence in Pakistan (Bueno de Mesquita et al. 2015), and the Manifesto Project corpus (Lehmann et al. 2017); see Section 4 for details. We chose these datasets because they have rich class labels and codebook definitions that allow us to rely explore this RQ. For example, from the CCC dataset on protests, does the LLM pretraining data include enough information for a model to reliably distinguish between, for example, a protest, rally, demonstration, and march?

We contrast this question about the viability with LLMs under the universal label assumption with our second research question:

RQ2a. What is the zero-shot classification performance of LLMs on real-world political science datasets under the codebook-contextual label assumption?

Given that social scientists put a great deal of effort into conceptualizing categories and writing detailed codebooks that guided the gold-standard human annotations, we hypothesize LLMs will perform better if they are provided with the entire set of codebook instructions in their input prompt. However, the greatly expanded size of the prompt may also pose challenges for LLMs if they struggle to attend to all parts of the document-codebook prompt (Levy, Jacoby and Goldberg 2024). Additionally, under the codebook-contextual label assumption and zero-shot paradigm, we seek to understand how the format and composition of the codebook affects LLM performance, specifically:

RQ2b. What components of the codebook impact classification performance?

While zero-shot classification is efficient and straightforward, social scientists typically want the best possible accuracy during the measurement phase since these measurements will be incorporated into inference. Thus, we also explore improving performance by further training LLMs via a paradigm called instruction-tuning (Wei et al. 2022; Sanh et al. 2022; Wang et al. 2022; Longpre et al. 2023; Wang et al. 2023). When instruction-tuning, we take a classification dataset with gold-standard classification labels (i.e., the texts labeled by domain-experts) and further train the LLM on these gold-standard labels via a supervised language modeling objective. This paradigm has proven extremely promising in boosting LLMs ability to follow instructions and has increased zero-shot performance on suites of “traditional” natural language processing tasks (Wei et al. 2022). Promisingly, instruction-tuning has recently been show to scale to a local GPU using parameter efficient training methods such as quantization and/or training low rank adapter weights (Hu et al. 2022; Dettmers et al. 2024). Our hope is that if we eventually gather enough codebooks and their corresponding datasets, we might be able to train the LLM to follow the codebook-contextual label assumption and read, comprehend, and comply with the codebook definition of class labels. This leads to,

RQ3a. How does instruction-tuning an LLM on a single codebook and supervised examples compare to zero-shot classification at inference time?

Ideally, an LLM trained on the broad task of document classification using social science codebooks would perform well on the task of annotating documents according to codebook instructions, even on codebooks it had not seen during instruction-tuning. Thus, our final research question asks about the generalizability of the task:

RQ3b. How does instruction-tuning an LLM on multiple codebooks and supervised examples affect classification performance at inference time on a new codebook?

In subsequent sections, we examine these research questions using a one of the best-performing “local” models, (i.e. weights can be downloaded to a local machine for reproducibility), Mistral 7B Instruct (Jiang et al. 2023). We also describe related work (Section 2), formalize the codebook measurement task and assumptions (Section 3), describe our data collection and curation (Section 4), and walk through our empirical findings (Section 5).

Will this work stand the test of time? Given the current rapid pace of development and improvement in performance of LLMs, it is reasonable to ask whether this work will stand the test of time and be relevant in 5, 10, or even 20 years. We keep this in mind and intentionally compare conditions within a single LLM, rather than comparisons between multiple LLMs. This helps isolate our research questions which focus on codebook component ablations and comparing zero-shot and instruction-tuning conditions. We believe our conceptualization of the codebook-specific task, assumptions, and instruction-tuning pipeline as well our semi-structured LLM codebook format should be agnostic to improvements in the base models.

2 Related Work

Our work sits at at the intersection of a rapidly growing methodological literature that uses LLMs for social science and political science text analysis, technical improvements in NLP machine reading comprehension, and a mature literature on measurement in the social sciences.

2.1 LLMs in social science annotation, classification, and inference

A number of researchers have attempted to use LLMs for zero-shot or few-shot classification (implicitly under the universal label assumption). Halterman et al. (2021) use a BERT model fine-tuned for a natural language inference (NLI) task for zero-shot classification of police events and achieve up to 0.74 F1 on some classes. Ziems et al. (2024) collect 25 representative English computational social science (CSS) benchmarks and evaluate LLMs on utterance-level classification tasks (e.g., classifying persuasion or misinformation), conversation-level tasks (e.g., classifying persuasion or politeness) and document-level tasks (e.g., classifying event arguments or ideology); they find that although zero-shot results rarely exceed the carefully tuned supervised RoBERTa baseline, for some tasks there is a respectable zero-shot performance (in terms of F1 score). Gilardi, Alizadeh and Kubli (2023) find zero-shot ChatGPT outperforms crowd workers on relevance, stance, topics, and frame detection tasks. Mellon et al. (2024) use LLMs to code open-text survey responses and found similar accuracy to human annotators. Pangakis and Wolken (2024) gather 14 classification tasks from recently published CSS articles; they generate labels using LLMs and then use these labels in (more classical) supervised ML pipelines and find the classifiers perform comparably to models with labels from human annotators. Atreja et al. (2024) examine four social science datasets tasks—toxicity, sentiment analysis, rumor stance detection, news frame classification—and use a factorial design to vary aspects of the prompt: definition inclusion, output type, explanation, prompt length; they find results are very task and model dependent. Halterman (2023) shows that careful prompting with social science concepts can generate useful synthetic training data for document classifiers.

Recent work, which is complementary to our work here, examines the consequences of using noisy, biased, “surrogate” or “proxy” labels in downstream inference (Fong and Grimmer 2019; Knox, Lucas and Cho 2022; Egami et al. 2024; Chen, Bhattacharya and Keith 2024). Specifically, Egami et al. (2024) find that even when LLMs are quite accurate in labeling documents, using these “surrogate” labels in downstream analysis as independent variables can induce quite severe bias and coverage issues. They suggest a model-based adjustment that re-weights the LLM surrogate labels using a set of expert hand-labeled documents. This strand of research emphasizes that even if LLMs can accurately apply labels using codebooks in a zero-shot setting, valid downstream analysis may require adjustments to the outputs.

2.2 Measurement in political science

Our work also builds on a more mature literature on measurement in in political science. Adcock and Collier (2001) articulate the “basic question of measurement validity: Do the observations meaningfully capture the ideas contained in the concepts?” (529). Or, in terms of Fariss, Kenwick and Reuning (2020), measurement involves a link between a theoretical concept, a specific and defined construct for generating observations, and the resulting observable data. These distinctions map onto the two assumptions we articulate above. Using a universal label is analogous to asking the LLM to classify via a non-systematized “background” concept. A codebook is a form of a concretized construct, laying out precise, context-specific definitions. In the past, generating observable data from text without formalizing a systematized construct was difficult. Zero shot LLM-based annotation, however, make it possible to bypass some of the operationalization that hand-annotation or supervised learning required. In subsequent sections, we discuss the link between codebooks and social science constructs and the pitfalls inherent in using LLMs without careful construct creation.

2.3 Machine reading comprehension

In NLP, the closest task to our codebook task is machine reading comprehension, also known as question answering. The task of machine reading comprehension poses text passages to NLP models and then has the models answer questions about the passages.

Popular machine reading comprehension datasets include Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al. 2016) and SQuAD 2.0 (Rajpurkar, Jia and Liang 2018) which consist of hundreds of thousands of Wikipedia-based question-answer pairs. The answer to every question is an extractive span (a segment of text from the text passage). Other variants of machine reading comprehension datasets require models to answer questions about news articles (Trischler et al. (2017)’s NewsQA), find and reason over multiple supporting documents to answer questions (Yang et al. (2018)’s HotPotQA), or contain questions derived from Google search users’ queries (Kwiatkowski et al. (2019)’s Natural Questions).

In contrast with our codebook task, machine reading comprehension questions are usually factoid questions (general world knowledge) or extractive questions (a simple question whose answer is a direct quote from the provided passage), not questions constrained by context-specific definitions crafted by domain experts in codebooks. Our work builds on these NLP literatures, but is distinguished from previous work by our focus on codebook-specific constructs and requiring reasoning over both long documents and long codebook instructions with many categories.

3 Formalization of the codebook measurement task

In this section, we formally define the codebook measurement task as well as the universal label assumption and the codebook-construct label assumption. For now, we only focus on multi-class, single-label classification; in other words, the classifier predicts from multiple classes but the classes must be mutually exclusive and only one class label can be applied to each instance.

Let $\mathcal{Z}$ be a finite set of $L$ concepts/labels described in natural-language words or phrases. For example, for the BFRS dataset employs $L=8$ labels, $\mathcal{Z}=$ {assassination, drone assassination, attack on state, conventional attack on government forces, guerilla attack on government forces, government attack on civilians, riot, terrorism, threat of violence, violent political demonstration, other}.²²2We simplify these labels slightly from the original BFRS labels. See Appendix B.

Let $C$ be the codebook, a sequence of tokens (words or word pieces) that operationalize $\mathcal{Z}$ . The codebook is a concatenation of the natural-language operationalization for each of labels in $z_{1},z_{2},\dots,z_{L}\in\mathcal{Z}$ , namely $C=[C_{z_{1}},C_{z_{2}},\dots,C_{z_{L}}]$ . For example, in BFRS, the (abbreviated) description of $C_{\textsc{terrorism}}$ is “Premeditated, politically motivated violence perpetrated against noncombatant targets by subnational groups or clandestine agents.” Each $C_{z_{j}}$ can consist of longer or shorter operationalization with or without clarifications or examples (see Section 4.3).

The goal of the codebook measurement task is to infer the label $\hat{Y}_{i}\in\mathcal{Z}$ given the tuple $(C,X_{i})$ for all $i$ in the inference dataset where $X_{i}$ is the text of interest. This input text could be a full document, one or more sentences, or a quasi-sentence (e.g., the partial sentences labeled in the Manifestos dataset).

In practice, there are several measurement paradigms: (a) humans can manually assign $\hat{Y_{i}}$ by reading $X_{i}$ while adhering to $C$ , (b) models can infer $\hat{Y_{i}}$ zero-shot (with no training examples) for each $X_{i}$ , or (c) supervised models can be trained via a training dataset of examples: $\{(C,X_{i},Y_{i}\in\mathcal{Z})\}_{i=1}^{N_{\textsc{train}}}$ .

Formally, the universal label assumption is

E[Y_{i}\in\mathcal{Z}|C,X_{i}]=E[Y_{i}\in\mathcal{Z}|X_{i}]

or, in other words, the measurement of labels is independent of their exact definition in the codebook $C$ . In contrast, the codebook-construct label assumption is

\exists C,C^{\prime}\text{ s.t. }E[Y_{i}|\mathcal{Z},C,X_{i}]\not=E[Y_{i}|% \mathcal{Z},C^{\prime}X_{i}]

or, in other words, changing the codebook from $C$ to $C^{\prime}$ could result in a different label $Y_{i}$ for the same document $X_{i}$ , even when holding $\mathcal{Z}$ constant. We argue applied researchers should adopt the codebook-construct label assumption given their projects typically center around precisely defining the constructs that are important to their substantive research question.

4 Data & codebooks

Our empirical analysis examines how well LLMs follow the constructs defined in codebooks and can predict gold-standard labels from these codebooks and input texts. To do this, we collect three separate political science datasets, each of which provide document- or (quasi-) sentence-level labels according to a codebook-defined schema: the Crowd Counting Consortium (CCC) dataset on protests in the United States (Crowd Counting Consortium 2024), the BFRS dataset on violence in Pakistan (Bueno de Mesquita et al. 2015), and the Manifesto Project corpus (Lehmann et al. 2017). We obtain their original codebooks and the text that the data was coded from (see Section 4.1 for additional details).

We focus on single-label, multi-class classification: for each document, we seek to recover a single gold-standard label. For the Manifestos dataset, this label is the Manifesto Project policy/ideology label for a (quasi) sentence in a political party’s manifesto. The BFRS and CCC datasets each code a set of variables for events from news articles, including information on the actors and motivations involved in each event. We focus solely on the event categories provided in each dataset.

We chose these three datasets because they represent a range of political science topics: political violence, contentious politics in the United States, and comparative party politics. The datasets also reflect wide variation in the number of classes and the length of input text (Table 1). We also hypothesized that partial overlap in some of the categories in BFRS and CCC related to protests would allow codebook-specific distinctions from similar categories In contrast to many social science related tasks in NLP, each of these datasets were constructed by social scientists with a focus on measuring specific political concepts and phenomena. This provides a point of departure from previous NLP work on reading comprehension and previous work on LLM-based annotation, in that we focus on categories that are carefully defined in the context of a specific research question, not “universal” labels that the LLM would have been trained on. How well LLMs perform on these three codebooks and datasets is thus interesting both for applied social science researchers, who would like to use LLMs to annotate text according to carefully specified instructions, but also for the NLP community, in providing much more difficult tasks than previous benchmarks.

Dataset	CCC	BFRS	Manifestos
Num. classes	8	12	142
Codebook median toks.	370	375	291
Input text median toks.	437	28	16
Num. train instances	4,710	20,978	8,081
Num. dev. instances	1,009	4,495	1,732
Num. test instances	1,010	4,496	1,732

Table 1: Descriptive statistics about the datasets used in our experiments. Note, codebook length includes label definitions and positive clarifications. Tokens are whitespace delimited words. Note, the Manifesto input texts are quasi-sentences, not full manifestos.

4.1 Dataset preprocessing

We make several changes to the original datasets as part of our preprocessing pipeline. The original news stories used to code the BFRS dataset from the English-language Dawn newspaper in Pakistan were not obtainable. Instead, we rely on the short, free-text English language summary of the event written by annotators that is provided in the structured BFRS dataset.In most cases, using the hand-written summary makes the task easier, as the summary focuses on the main event corresponding to the gold-standard label.³³3In some cases, the summary omits details that are necessary for determining the true. However, in Section 5.4 and Table 7, we find this is not a major concern in practice; only 4% of our BFRS sample lacked the necessary context.

The CCC dataset provides hyperlinks to the sources used to code each event.⁴⁴4https://github.com/nonviolent-action-lab/crowd-counting-consortium We obtain the original English language source text using a scraper provided by (Halterman and Radford 2021). To make the dataset comparable to the other two datasets, we limit their corpus to events that have (1) a single event type, and (2) a single news source. Omitting stories with multiple event types yields a single-label classification task that matches the other two datasets, while limited to events with a single source avoids the need for the LLM to perform a challenging cross-document inference task. We also omit events where the original text is not obtainable via the scraper.

We obtain the labeled Manifesto quasi-sentences from the Manifesto Project through their API⁵⁵5urlhttps://manifesto-project.wzb.eu/information/documents/api. In total, the Manifesto Project provides more than 1 million labeled quasi-sentences across several dozen languages. The Manifesto Project employs a large team of extensively trained native language speakers to annotate each quasi-sentence in each manifesto (Merz, Regel and Lewandowski 2016). For this work, we use a sample of 13 manifestos from English-speaking countries.

BFRS and CCC both label standalone, complete news stories, while Manifesto passages are often sub-sentence fragments (quasi-sentences). Prototype manual analysis of LLM errors (see Section 5.4) revealed that identifying the correct label for some Manifesto passages was impossible without context from the preceding passes. Thus, in our empirical pipeline, we input the Manifestos’ target quasi-sentence (which is labeled) as well as the quasi-sentence just prior to the target one as context.

4.2 Contamination insulation

Contamination of evaluation datasets—when an evaluation dataset is part of the LLMs training data and thus does not adequately represent generalization performance on an unseen set of examples—is a major concern in NLP research (Chang et al. 2023; Li and Flanigan 2024; Balloccu et al. 2024; Golchin and Surdeanu 2024, inter alia). We believe that our datasets are insulated from contamination, that is, they are unlikely to appear in their complete form in LLM pre-training or instruction-tuning data. The BFRS dataset is not available online and was provided directly to us by the authors of the dataset. The structured CCC dataset includes hyperlinks to the sources used, but not the raw text. Instead, we scrape the original news stories, where available, and combine them with the structured data. The Manifesto corpus is the most at risk of training contamination, though we note that it has the lowest performance (see below). The original manifestos and the Manifesto Project codebook (Version 2020a from July 22, 2020) are available on the internet and likely to present in the LLM pre-training data, but the sentence-level annotations are provided through an API that requires registration.

4.3 Codebook to LLM prompt

Figure 3: Example prompt from the BFRS dataset.

Figure 4: Zero-shot Mistral generative output given the BFRS prompt in Figure 3.

Adhering to the codebook-construct label assumption, we were careful and deliberate with how we formatted the codebook, $C=[C_{z_{1}},C_{z_{2}},\dots,C_{z_{L}}]$ , for input into the LLM. We re-structure the original codebooks for three reasons. First, there is some evidence from previous instruction-tuning research that explicitly providing a definition, positive example, and negative example can improve zero shot performance on a diverse set of NLP tasks (Wang et al. 2022). By reformatting the codebooks, we hope to guide the LLM in the zero shot setup. Second, by separating parts of each codebook definition into components, we can experimentally isolate and ablate components to evaluate changes in performance. Finally, we hypothesize that formatting all codebooks in a consistent format could improve cross-codebook performance when we train an LLM on many codebooks.

Our new semi-structured codebook format. To answer RQ2b about the impact of each codebook component on classification accuracy and to answer RQ3 about instruction-tuning performance, we propose a new generalizable and consistent format for codebooks that are both human and machine readable. Converging on a consistent format for future codebooks developed by social scientists will make it possible to instruction-tune on large number of codebooks (and hopefully thus result in better shared model performance across tasks). Our format consists of the following codebook components (see Figure 3):

1.

Label: The exact label that the LLM should return. We write the label in capital letters after early ad-hoc experiments shows that lower case labels yielded more hallucinated or inconsistently labels generated by the LLM.
2.

Label Definition. The inclusion of a label definition is necessary under the codebook-construct label assumption. We provide a succinct definition of the category, which is generally a single sentence.
3.

Clarification & Negative Clarification. Most existing work on LLMs for social science document labeling include labels alone or a single-sentence definition of each label. For example, Atreja et al. (2024) use a one-sentence definition of the “toxicity” class in the prompt, “A rude, disrespectful, or unreasonable comment that is likely to make individuals leave a discussion.” However, many social science concepts require further explanation, including details on what falls outside the category. For example, the Manifesto Project category for welfare_positive, which includes calls or support for maintaining/expanding social service provision, specifically excludes education spending, which is coded under a separate category (education_positive). Similarly, the BFRS category for riot does not apply if one of the groups involved is a police force.
4.

Positive & Negative Examples. “Few shot” or “in-context learning”, where an LLM is provided example input–output pairs in the prompt, is a well established technique for improving LLM performance on new tasks (Brown et al. 2020). We include both positive and negative examples in the codebook, along with explanations for the correct or incorrect coding.⁶⁶6We add positive and negative examples only for BFRS: the length of the CCC news stories and the large number of categories for Manifestos mean that including examples would increase the prompts for these two datasets by around two thousand words.

An excerpt of the restructured BFRS codebook is shown in Figure 3, and a full prompt with Mistral special tokens is shown in Appendix B.

5 Empirical analysis

We evaluate an LLM’s performance on our codebook task and BFRS, CCC, and Manifestos datasets across two primary experimental conditions: zero-shot and instruction-tuning. We use the one of the best “local” base models available at the time we began our experiments, Mistral 7B Instruct Jiang et al. (2023).

In a “zero-shot” approach, the LLM is not given any training examples, only a task description consisting of the entire codebook and the documents to be labeled, $\{C,X_{i}\}_{i=1}^{N_{\textsc{inf}}}$ . The LLM’s task is to provide the label from the codebook that best matches the document provided, $\hat{Y}_{i}\in\mathcal{Z}|X_{i}$ .⁷⁷7In preliminary experiments, the earlier LLM we used had a context window shorter than our codebook+text input, so we posed the task as a set of binary classification tasks: given a document and a single label and definition from the codebook, infer whether provided label applies to the document. However, this was computationally expensive and rapid improvements in LLM context window size now make it possible to provide the entire codebook at once. “Instruction-tuning” refers to updating the LLM’s weights via language model loss function and gold-standard training labels (the natural-language word or phrase that is the label). In our instruction-tuning condition, we further train Mistral 7B Instruct on training examples, $\{C,X_{i},Y_{i}\in\mathcal{Z}\}_{i=1}^{N_{\textsc{train}}}$ , drawn from each original project’s dataset and codebook. For computational efficiency, we employ QLoRA (Dettmers et al. 2024), a method that combines quantization (lower-precision weights) and updates a low-rank approximation of the weight matrix. Our instruction fine-tuning set-up with Mistral closely matches the set-up described in (Jindal, Rajpoot and Parikh 2024), which won the 2023 NeurIPS LLM Efficiency Challenge for instruction fine-tuning an LLM on a single GPU in 24 hours.

At inference time, the generative-language output of the LLM is parsed via a deterministic pattern-matching for the correct label. We select $\hat{Y}_{i}$ as the first sequence of tokens in the LLM output that matches a label in $\mathcal{Z}$ .⁸⁸8Other work has attempted to constrain the output vocabulary of the LLM to only consist of $\mathcal{Z}$ ; however, in preliminary experiments we found this decreased performance.

5.1 LLM choice: Mistral

Since the landscape and performance of LLMs is constantly changing, we fix the base LLM, and compare conditions within a single LLM, rather than comparisons between multiple LLMs. This helps isolate our research questions which focus on codebook component ablations and comparing zero-shot and instruction-tuning conditions.

Constraint: Local LLMs. We intentionally choose a “local” LLM, an LLM whose weights can be downloaded to a local machine and updated. Others have emphasized the importance of local models for scientific research and how local models are reproducible in a way that “closed”, API-based models which are constantly changing and do not have version control (e.g., ChatGPT) are not (Rogers 2023; Spirling 2023; Palmer, Smith and Spirling 2024). However, we note that local models are still not “open” models in that local model developers may still withhold details about the model’s training data or training conditions (Groeneveld et al. 2024). However, we believe the performance advantage of “open weight” models over fully open source models makes the trade-off worth their lack of full information about training details.

Mistral. We use Mistral 7B Instruct v0.2 for both the zero-shot and instruction-tuning (Jiang et al. 2023). Mistral is available for researchers to download and can be run on a commodity GPU, fulfilling the requirements we set for reproducibility and model access. At the time we began running our experiments, Mistral 7B Instruct was considered one of the “best” local LLMs available. It outperformed both Llama2-7b and Llama2-13b on most important benchmarks, including MMLU, HellaSwag, and Winograd and outperformed all other 7b models on MT-Bench (Jiang et al. 2023).⁹⁹9We experimented with Llama 2 and Phi-3 earlier in the project but had preliminary evidence that they were less accurate than Mistral.

Note, we still have limited information on the nature of the training data in Mistral 7B Instruct. The most detailed description from Jiang et al. (2023) is “we fine-tuned [Mistral 7B] on instruction datasets publicly available on the Hugging Face repository. No proprietary data or training tricks were utilized […]”

5.2 Zero-shot analysis

Our zero-shot experiments seek to address RQs 1 and 2: how well does an off-the-shelf LLM comply with codebook instructions when labeling documents, and which components of the codebook are helpful for improving LLM performance? For our first experimental condition, we used the codebooks as they were originally written. The only change we made was to prepend “Label:” to each label in the codebook to indicate what the LLM should predict in its output.

We compare this performance to our new codebook format, described in Section 4.3. Table 2 shows the performance of Mistral on the development set of each of our three datasets, under both the original and re-written codebook conditions. All F1 scores shown are computed per-class and weighted by the prevalence of each class in the development set. We find modest gains from our proposed codebook restructuring for BFRS and CCC, but a decline in performance on the Manifestos dataset.

Dataset	Codebook format	Dev F1
BFRS	Original	0.49
BFRS	Ours	0.54
CCC	Original	0.61
CCC	Ours	0.66
Manifestos	Original	0.22
Manifestos	Ours	0.14

Table 2: Original versus new codebook format. Micro F1 scores for zero-shot Mistral predictions on each development dataset with changes in the codebook format.

5.2.1 Zero shot ablation results

Next, we conduct an ablation experiment on BFRS to examine how dropping components of the codebook affects the LLM’s classification accuracy. We choose BFRS for this analysis because the original codebook was the most similar to our proposed new format and included some positive and negative examples already, making it straightforward to convert the old format into a complete new format. The ablation consists of dropping components of the codebook (e.g., “Positive Example” and “Negative Example”) and re-running the zero shot classification pipeline on the complete development set. Table 3 shows the results of six ablation experiments. It shows that dropping Positive Example, Negative Example, Clarification, and Negative Clarification each reduce classification performance modestly. Remarkably, ablating all components of the codebook aside from the label produces an F1 score similar to the full codebook results (0.51 vs. 0.54). On the one hand, this is encouraging that the codebook-construct label assumption—which we argue is the right assumption conceptually for social science measurement—modestly helps performance (rather than hurts it). On the other hand, this result seems to suggest the LLM is not attending to the codebook definitions of labels in the prompt; we elaborate on this discussion in Section 6.

Excluded Codebook Components
Label Definition	Output Reminder	Positive Example	Negative Example	Clarification	Negative Clarification	Dev F1
						0.54
	✗					0.54
		✗	✗			0.46
		✗	✗	✗	✗	0.46
	✗	✗	✗	✗	✗	0.47
✗	✗	✗	✗	✗	✗	0.51*

Table 3: Codebook ablations. On the BFRS development dataset, zero-shot Mistral predictions and micro F1 scores with codebook/prompt ablations. *Dropping all codebook components besides the label is equivalent to the universal label assumption.

5.3 Instruction-tuning

After conducting our zero shot experiments, we then turn to instruction-tuning to answer RQ3 about whether LLMs can improve their performance on codebook-based classification with additional training.

We follow the definition of instruction-tuning in which an LLM’s weights are further updated on supervised input/output pairs via a language model objective function (Wei et al. 2022; Sanh et al. 2022; Wang et al. 2022; Longpre et al. 2023; Wang et al. 2023). This is also called “supervised learning using demonstrations” (Wang et al. 2023). Note, instruction-tuning is different than the previous NLP paradigm of fine-tuning BERT (Devlin et al. 2019) or BERT-like models with a supervised classification head (logistic regression) on top of a base encoder model. Instead, for instruction-tuning, the supervised input is formatted as a “natural language instruction” and the output is natural-language tokens from a generative model.

According to Wang et al. (2023), “Instruction datasets targeted at specific domains and/or capabilities are extremely effective at improving model performance in those aspects.” If this statement holds true, it lends to the hypothesize that instruction-tuning on many social science codebook datasets will also lead to improved performance.

5.3.1 Instruction-tuning loss function

For our instruction-tuning set-up, a training dataset—BFRS, CCC, or Manifestos—consists of $D$ tuples $\{(m_{i},y_{i})\}_{i=1}^{D}$ where $m_{i}$ is the entire “prompt” (consisting of the concatenation of $C$ and $X_{i}$ as well as any other instructions) and $y_{i}$ is the desired model output (the natural-language word or phrase for the concept/label).

When instruction fine-tuning on these input-output pairs, a model’s language modeling loss function can either be computed from both the input and output or the output only. Some research shows that computing the loss on both input and output improves performance (Shi et al. 2024), while others only compute the loss on only output (Wang et al. 2023). We opt for the later and mask the loss for the input and only update weights via the loss computed on the output. Doing so speeds the training, and we hypothesize that it may improve cross-domain generalizability. For each example $i=1,2,\ldots,D$ , let $N_{m_{i}}$ be the total number of tokens for the example’s prompt $m_{i}$ and $N_{y_{i}}$ be the total number of tokens for the corresponding output $y_{i}$ . Then the negative log-likelihood loss function (masking inputs) is:

L_{\theta}\equiv-\sum_{i=1}^{D}\sum_{j\in[N_{m_{i}},N_{y_{i}}]}\log p_{\theta}% (t_{i,j}|t_{i,<j})\times\begin{cases}0\text{ if }j\in N_{m_{i}}\\ 1\text{ otherwise }\end{cases}

(1)

where $t_{i,j}$ is the gold-standard $j$ th token in either the input $m_{i}$ or output $y_{i}$ .¹⁰¹⁰10Unlike Wang et al. (2023), our training does not include teacher forcing, when a model generating a sequence of tokens conditions on the gold-standard tokens instead of the token generated by model from the previous time steps. Teacher forcing has been shown to be important on long generative texts but our desired gold-standard labels are just a few tokens long and an error on the first generated token means the LLM has already mis-predicted the label. The model weights $\theta$ are then updated to minimize this loss function via the QLoRA techniques that we describe in the next section.

5.3.2 QLoRA training: quantization + low-rank adapters

To conduct the instruction fine-tuning, we use quantized low-rank adaptation (QLoRA) technique (Hu et al. 2022; Dettmers et al. 2024), similar to the model that won the the 2023 NeurIPS LLM Efficiency Challenge for instruction fine-tuning an LLM (Jindal, Rajpoot and Parikh 2024). Low rank adaptation is an efficient technique for fine-tuning LLMs. Rather than fine-tuning all weights in the LLM, LoRA updates a low-rank decomposition of the weights.¹¹¹¹11That is, rather than updating the entire weight matrix $W\in\mathcal{R}^{d\times d}$ , LoRA decomposes $W$ into $d\times r$ , $r\times d$ matrices, with significantly fewer weights than the full weight matrix. To allow training to fit on a commodity GPU, we apply LoRA to a 4 bit quantized version of Mistral.

We experiment with two sets of hyperparameters in fitting LoRA. First, we vary the rank of the low-rank weight matrices. Lower rank approximations are more efficient and involve tuning fewer parameters, while higher rank LoRAs allow for more flexibility, but at greater computational cost and risk of overfitting. We also experiment with which components of the weights we fine-tune. The original LoRA paper tunes only the query and value components of the attention (Hu et al. 2022). However, recent research has shown that tuning all weights in the LLM can improve performance over only tuning part of the attention weights, albeit at increased computational cost (Raschka 2024). Thus, we experiment with two sets of parameters: the “simple” case of query and value weights, and all weights. LoRA also includes a hyperparameter $\alpha$ , which controls how much weight to place on the LoRA parameters when combining them with the original weights. We do not tune this parameter, setting it to the same value as the rank.

5.3.3 Results

Instruction-tuning Mistral provides mixed benefits when evaluated within-codebook (Table 4). It provides very large improvements for BFRS (up to +0.22 F1 over zero-shot), modest benefits for CCC, and no benefits for Manifestos. The variability of the of the performance across LoRA modules and rank suggests that further hyperparameter tuning could improve performance further. We discuss this possibility in Section 7.

Cross-codebook and mixed codebook training performance, that is, models evaluated on unseen codebooks or trained on multiple codebooks, provide no or negative benefit (Table 5). We hypothesized (RQ3b) that instruction-tuned LLMs trained on the codebook classification task, especially those trained on multiple codebooks, would show greater classification performance than zero-shot performance. Table 5 shows that this is not the case. Cross-codebook performance is no better, and often worse, than zero shot performance. We elaborate on hypotheses for these results in Section 6.

Both the zero-shot and instruction-tuned results on the development set are very similar to results on the test set, the latter which is still significantly better than the baseline of predicting the majority class (Table 6).

Dataset	LoRA modules	LoRA rank	Dev F1	$\Delta$ vs. zero shot
BFRS	all	8	0.71	+0.17
	simple	8	0.67	+0.13
	all	16	0.76	+0.22
CCC	all	8	0.62	-0.04
	simple	8	0.70	+0.04
	all	16	0.56	-0.10
Manifestos	all	8	0.08	-0.06
	simple	8	0.06	-0.06
	all	16	0.14	+0.0

Table 4: In-codebook instruction-tuned Mistral predictions and micro F1 scores on each development set with LoRA hyperparameters. Here, “simple” means only the query and value parameters are updated with low-rank matrices whereas “all” means the linear layers are updated with low-rank matrices as well.

Training Data	Dev data	LoRA modules	LoRA rank	Dev F1	$\Delta$ vs. zero shot
CCC	BFRS	all	8	0.48	-0.06
		simple	8	0.47	-0.06
		all	16	0.44	-0.10
BFRS	CCC	all	8	0.66	0.0
		simple	8	0.66	0.0
		all	16	0.66	0.0
BFRS + CCC	CCC	all	16	0.57	-0.09
BFRS + CCC + Manifestos	CCC	all	16	0.58	-0.08

Table 5: Cross-codebook instruction-tuned Mistral results and micro F1 scores on each development set with LoRA hyperparameters. .

Dataset	Baseline: Majority Class	Zero-shot	Instruction-tuned
BFRS	0.16	0.53	0.76
CCC	0.51	0.68	0.70
Manifestos	0.03	0.14	0.14

Table 6: Test set micro F1 results of best zero-shot and instruction-tuned models. Zero-shot models include all codebook components. We use the in-domain instruction-tuned models that performed best on the development set.

5.4 Manual analysis of zero-shot generative outputs

The zero-shot results above reveal that Mistral struggles to correctly classify documents using codebook instructions, but does not illuminate why and how it fails. To better understand our quantitative results, we sampled our LLM’s zero-shot generative outputs given BFRS, CCC, and Manifestos development sets as inputs. Then we (the two authors) manually analyzed the model inputs, model outputs, and gold-standard labels. Although we recognize that explanations generated by an LLM are not necessarily always logically consistent or faithful¹²¹²12“A faithful interpretation is one that accurately represents the reasoning process behind the model’s prediction” (Jacovi and Goldberg 2020). to their internal representations and predictions (Ye and Durrett 2022; Turpin et al. 2024; Zhao et al. 2024; Zhu et al. 2024; Lyu, Apidianaki and Callison-Burch 2024), we found it insightful to manually inspect these explanations in order to hypothesize whether and how LLMs are actually “reading” codebooks.

After one exploratory prototype round¹³¹³13Our pilot annotation round sampled roughly 50 errors on the Manifestos development set., we developed the following six mutually exclusive categories which we subsequently used to categorize the BFRS, CCC, and Manifestos samples:

A.

LLM correct. We assigned this category the LLM generative output (including the LLM explanation) and the parsed label from the output were correct.
B.

Incorrect gold standard. We assigned this category if after inspection of the text, gold-standard label (from the original annotators), and codebook, we disagreed with the gold-standard label.
C.

Document error. We assigned this category if we determined something was wrong with the text itself. For example, sometimes scraping errors meant part of the text was missing or the chunk of text given as input need more context (e.g., the Manifestos’ sub-sentences required more context than we provided).
D.

LLM non-compliance. We assigned this category if the LLM did not comply with the output reminder in the prompt: “Write the name of the label that fits best, with no other text. For example, […]” We borrow the compliance/non-compliance language from Atreja et al. (2024) who also analyzed LLM non-compliance. In our work, this category includes instances for which an LLM hallucinated a label or the LLM had more than one label in its output. We did not assign this category to when the LLM outputted explanations despite the prompt specifying “no other text” since Mistral almost always included an explanation.
E.

LLM semantics/reasoning mistake. We assigned this category if the LLM was compliant with the output reminder (i.e. outputted only a single label from the valid label set), but the label and/or explanation were incorrect. Almost always, the mistaken label was due to semantics, reasoning, and/or world knowledge. This was one of the most important categories for us as it reveals significant weaknesses in LLMs compiling with the codebook-contextual label assumption.
F.

Other. This was our catch-all category for any other manual analyses that did not fit wihtin the categories above. For example, if we determined there was sufficient ambiguity in the codebook descriptions of the classes; if there was a parsing error on the model output¹⁴¹⁴14For example, on a CCC instance the model output was “COUNTER_PROTEST” and our deterministic parser than marked “PROTEST” but “PROTEST” was still the incorrect label.; ambiguity in the text itself; or ambiguity in the label precedence.¹⁵¹⁵15For example, in CCC does march take precedence over rally?

	BFRS	CCC	Manifestos
Sample: Num. Examples	24	50	62
Sample: Num. of Unique Classes	12	5	50
A. LLM correct	0.38	0.48	0.11
B. Incorrect gold standard	0.04	0.10	0.08
C. Document error (scraping/lack context)	0.04	0.02	0.03
D. LLM non-compliance	0.00	0.02	0.45
E. LLM semantics/reasoning mistake	0.50	0.26	0.29
F. Other	0.04	0.10	0.02

Table 7: Manual error analysis on a sample of each of the zero-shot development set predictions for each dataset. For each dataset, we report the number of examples and the number of unique classes in that sample as well as the proportion in categories A-F as judged by the authors.

5.4.1 Manual analysis results

We present the aggregated results of our manual analysis in Table 7. In Table 9, we provide some “lemon-picked” examples¹⁶¹⁶16The NLP community has started to used lemon-picked to mean selecting choice negative examples and is in contrast txo “cherry-picking” good examples. of LLM failures for categories D and E (above).

LLM compliance with prompt instructions related to allowable labels was excellent for BFRS (0% non-compliance) and CCC (2% non-compliance). However, for Manifestos, 45% of the sampled LLM outputs were non-compliant. These included errors such as outputting multiple labels or more egregious errors such as hallucinating labels. For example, the second Manifestos example in Table 9, the LLM takes the noun phrase from the input text “immediate permanent residency to new spouse” and hallucinates the label immediate permanent residency for new spouses positive which does not exist in the codebook.¹⁷¹⁷17We experimented earlier in the project with constrained generation, biasing the LLM’s token probabilities to tokens that appear in the prompt, to reduce label hallucination. We found very slight gains on Manifestos and no gains on CCC and BFRS, and leave a deeper exploration of this technique for future work.

We have some preliminary evidence that the LLM is relying heavily on the universal label assumption and not actually reading the codebooks. For the Manifestos-1 example in Table 9, the LLM predicts “WELFARE_POSITIVE (or EDUCATION_POSITIVE)”. However, an explicit constraint in the Manifesto’s codebook for the welfare positive label is “This category excludes education.” Ideally, the LLM would comply with this constraint. In BFRS-1 in Table 9, the LLM does not follow the explicit codebook instruction that an attack on state must be “conducted by the armed forces of another state.” In its explanation, it seems to correctly identify the Taliban as a “non-state entity” but does not make this reasoning jump.

We have some evidence that Mistral is using lexical overlap heuristics—selecting label words that appear in the text, even if the label is incorrect. Other work has document LLMs using this heuristic (Levy, Ravfogel and Goldberg 2023), and overreliance on heuristics (also called short-cuts or spurious correlations in the literature) in other NLP models besides LLMs has been a major concern of past work in NLP (Gururangan et al. 2018; McCoy, Pavlick and Linzen 2019; Gardner et al. 2021, inter alia). In our empirical results, lexical overlap heuristics occur both in correct and incorrect predictions. For example, in CCC-1 in Table 9, the word “rally” occurs in the first sentence of the text and the LLM predicts the rally label even though the text clearly aligns with the codebook definition of a demonstration “A crowd gathering to demonstrate their support for a set of political demands or claims to an external audience, typically with things like signs and banners and flags displayed to passers-by in a public space.” In our manual analysis, we found several of the LLM correct predictions on Manifestos came from heavy lexical overlap between the input text and label. For example, the LLM correctly predicted the label PEACE for the text “Plaid Cymru supports peaceful negotiations over destructive warfare” and correctly predicted the label political corruption for the text “The EFF Government will end government corruption and incompetence, which currently cost the State more than R30 billion per year.”

Original annotation and codebook quality remains a concern; we (the authors) disagreed with the gold-standard label in 4%, 10% and 8% of instances in our samples of BFRS, CCC, and Manifestos respectively. We also identified some mismatches between LLM predictions and gold standard labels that result from ambiguity in the codebook, especially in the CCC dataset. For instance, an event that involves a march near the target of the action could be considered either a protest (a crowd gathering near the target of their claims) or a march. In practice, the march label seems to take precedence over rally, protest, or demonstration, but the codebook is not explicitly clear on this point.¹⁸¹⁸18In practice, CCC allows multiple labels per story. Our sample of CCC stories include only stories for which a single event was identified. Using LLMs as part of an iterative process of codebook development is a promising direction for future work (Section 7).

5.5 Comparison with closed weight LLM

Model	F1
Mistral 7B zero shot (best)	0.54
Mistral 7B instruction-tuned (best)	0.76
GPT-4o zero shot	0.74

Table 8: Comparison of local versus closed LLMs. These results are the micro F1 score on the BFRS development set.

Despite our focus on local open-weight LLMs, we also conduct an experiment using a state-of-the-art closed weight LLM, GPT-4o ¹⁹¹⁹19GPT-4o experiments were run July 8-11, 2024.. On the BFRS development set with the full structured codebook, GPT-4o obtained a zero-shot F1 of 0.74 (Table 8). This performance is significantly higher than Mistral 7B’s zero shot performance, but slightly below the Mistral model instruction-tuned on BFRS (Table 8). Running the 4,495 examples and codebook through GPT-4o cost around $60. These findings leave us optimistic that other researchers could similarly instruction-tune local LLMs and have similar performance to closed LLMs, reducing cost and increasing scientific reproducibility.

6 Discussion

Addressing RQ1 and RQ2a about the viability of using LLMs as zero-shot classifiers, from Table 6, we see that test-set zero-shot results range from very poor with 0.14 (micro) F1 on Manifestos, to marginal with 0.53 and 0.68 on BFRS and CCC respectively. However, we note that this poor zero-shot performance is about on par with what previous work have found in adjacent (but possibly slightly easier tasks). The best zero-shot LLM models from Atreja et al. (2024) achieve 0.73 (macro) F1 on predicting toxicity, 0.48 F1 on sentiment, 0.44 F1 on stance, and 0.54 F1 on frames. Ziems et al. (2024)’s zero-shot results on document-level tasks are 0.23 F1 on event argument classification, 0.52 F1 on ideology, and 0.45 F1 on tropes. This weak performance suggests that zero-shot LLMs on complex codebook tasks are unlikely to be useful to applied analysts without further modification (e.g., instruction-tuning); or inference-specific assumptions and design i.e. Egami et al. (2024) or Chen, Bhattacharya and Keith (2024); or fundamental improvements in the base LLMs to read, comprehend, and comply with codebook instructions.

Shifting from the old to new codebook format results in marginal performance gains. We find mixed evidence that restructuring a codebook into a consistent, structured format improves zero-shot LLM performance (RQ2b). For BFRS and CCC, zero-shot performances improves slightly (an increase in F1 of 0.05) compared to the original codebook, while performance on the Manifesto dataset declines slightly. Despite very modest performance changes, pragmatically, our new semi-structured format made ablation experiments much more feasible, and we still believe that converging to a consistent format for future codebooks developed by social scientists would be beneficial to the field of a whole.

The ablation study on BFRS using the structured codebook shows relatively small changes as we exclude more sections from the codebook. For example, dropping the Output Reminder that emphasizes the importance of answering only with an allowed label does not change performance, while dropping positive and negative examples hurts performance somewhat. We take this as further evidence that LLMs are not attending to the information provided in the codebook.

In-codebook instruction-tuning results in performance gains over zero-shot classification. We find that instruction-tuning an LLM on a specific codebook greatly improves performance for CCC and BFRS, but not for Manifestos (RQ3a). An instruction-tuned Mistral also beats the zero shot performance of GPT-4o. However, we see no performance gains and some losses compared to zero-shot results when we train on multiple codebooks or evaluate on out-of-sample codebooks (RQ3b). We speculate we would need to gather dozens more codebooks (with high quality gold-standard labels) to achieve the kinds of performance gains seen in suites of instruction-tuned NLP tasks, akin to Longpre et al. (2023).

The Manifestos dataset is too challenging for LLMs in their present form. Across both zero-shot and instruction-tuning approaches, we find very poor performance on the Manifesto corpus. We hypothesize that the poor performance is due to the large number of labels (142) and the very long codebook that results. While LLM context windows are rapidly increasing in size, some research suggests that attention is focused at the beginning and ends of prompts, not on the middle (Liu et al. 2024). The long codebook and extreme number of categories make this a good challenge dataset for future work on LLMs.

7 Future work

We see several avenues for future research building on our work here. First, different engineering approaches could possibly result in improved classification performance. We limit our work to a single, relatively small model (Mistral 7B Instruct), which we quantize. Using a larger or more recent local LLM could possibly improve zero-shot classification performance, but at potential cost of foregoing use on a commodity GPU. Instruction-tuning involves a larger set of hyperparameters than we were able to fully investigate. In addition to standard training hyperparameters (e.g., optimizer, learning rate, etc), LoRA also provides other hyperparameters, including the rank and $\alpha$ values.

We see great potential in expanding the set and quality of codebooks and datasets used for evaluation and training. We use only three datasets, which limits our ability to instruction-tune an LLM on a diverse set of instructions. Re-structuring codebooks into a consistent machine- and human-readable format would allow new codebooks and datasets to be added to our collection, similar to work in NLP on representing a diverse set of tasks in a consistent format (Wang et al. 2022). Additionally, via our manual error analysis, we found challenging examples, mistakes in the gold-labels, or places where the codebooks could be refined or clarified. This naturally leads to our next direction.

LLMs could be used in-the-loop for codebook development. Codebooks are often developed iteratively during annotation as edge cases arise or annotators raise questions. Updating the codebook during annotation requires retraining annotators and potentially re-annotating existing documents, which can slow the annotation process. LLMs have the potential to reduce the number of changes a codebook requires after annotation begins by identifying gaps in the codebook. Our manual analysis of CCC, for instance, identified a case where Mistral identified a rally in an event, rather than the more precise march. If the march label takes precedence over a rally, the codebook could be refined to state this explicitly. Relatedly, does the semi-structured format we propose improve annotator accuracy, as well as LLM accuracy? Both of these questions could be explored through experiments with human annotators.

Finally, applied researchers need tests for codebook “compliance” to assure themselves that LLMs can follow the instructions provided in their codebook. When used as a measurement tool, LLMs have the task of inferring a latent concept (a document’s true label) from the observable text provided to them. All measurement hinges on the validity of this translation (Fariss, Kenwick and Reuning 2020), but social scientists do not yet have the diagnostic tools for LLMs that they have for other measurement techniques to understand when LLMs fail. Future work can begin to formalize tests for LLM compliance. For instance, the extreme non-compliance of Mistral on the Manifesto dataset, where it returns hallucinated labels for almost half of documents (Table 5.4), is an immediate indication that the model is not able to follow the most basic instructions in the prompt, raising an immediate red flag for its use on this dataset. We see further work formalizing compliance checks as a prerequisite for applied researchers widespread adoption of codebook-based LLM classification.

8 Conclusion

As LLMs continue to improve, we believe they will be an increasingly attractive tool to measure social science constructs from text. In this work, we argue that applied researchers should make a codebook-construct label assumption, as opposed to a universal label assumption, and provide LLMs with full codebooks containing precise operationalization of labels. Empirically, we find that providing an LLM with a full codebook, either in its original or re-structured form, does not meaningfully improve (or harm) the zero-shot classification performance of one of the best “local” LLMs, Mistral, on the three real-world political science datasets. We suggest avenues for research to improve LLM “compliance” with social science codebooks, a necessary prerequisite for using LLMs in applied settings.

We see two points of optimism from our work. First, instruction-tuning LLMs with parameter-efficient methods, a task not much more demanding than the previous NLP paradigm of fine-tuning BERT-based models, increased LLM performance on classifying documents for two of our datasets. The performance of this (relatively small) open-weight instruction-tuned LLM matches the performance of closed-weight commercial LLM. This suggests that instruction-tuning local models is a viable path forward for applied researchers, which have significantly more scientific reproducibility than closed models. Second, we are optimistic that if social scientists developing new codebooks write them for both humans and LLMs in our semi-structured format, we can systematically gather these codebooks and datasets, train LLMs on them, and hopefully improve the models for everyone’s benefit.

9 Ethics & limitations

Our research has several limitations. First, this project focuses only on multi-class document classification. Future work could explore the use of generative LLMs to annotate other information, including extracting spans of text from documents (i.e. information extraction). We also acknowledge that all our results focus on English-language codebooks and text documents (Bender 2019). Our conclusions may not hold on non-English language texts.

We acknowledge the real-world harm associated with the events reported in the documents in the datasets we use. We believe quantitative study of these events is important to mitigate future harm, but we also acknowledge that it should be paired with qualitative work, hopefully that which also connects with and advocates for the people affected by the events (D’Ignazio and Klein 2020).

References

(1)
Adcock and Collier (2001) Adcock, Robert and David Collier. 2001. “Measurement validity: A shared standard for qualitative and quantitative research.” American Political Science Association 95(03):529–546.
Atreja et al. (2024) Atreja, Shubham, Joshua Ashkinaze, Lingyao Li, Julia Mendelsohn and Libby Hemphill. 2024. “Prompt Design Matters for Computational Social Science Tasks but in Unpredictable Ways.” arXiv preprint arXiv:2406.11980 .
Balloccu et al. (2024) Balloccu, Simone, Patrícia Schmidtová, Mateusz Lango and Ondřej Dušek. 2024. Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 67–93.
Bender (2019) Bender, Emily. 2019. “The #benderrule: On naming the languages we study and why it matters.” The Gradient 14:34.
Brown et al. (2020) Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell et al. 2020. “Language models are few-shot learners.” Advances in Neural Information Processing Systems .
Bueno de Mesquita et al. (2015) Bueno de Mesquita, Ethan, C Christine Fair, Jenna Jordan, Rasul Bakhsh Rais and Jacob N Shapiro. 2015. “Measuring political violence in Pakistan: Insights from the BFRS Dataset.” Conflict Management and Peace Science 32(5):536–558.
Chang et al. (2023) Chang, Kent, Mackenzie Cramer, Sandeep Soni and David Bamman. 2023. Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 7312–7327.
Chen, Bhattacharya and Keith (2024) Chen, Jacob M, Rohit Bhattacharya and Katherine A Keith. 2024. “Proximal Causal Inference With Text Data.” arXiv preprint arXiv:2401.06687 .
Crowd Counting Consortium (2024) Crowd Counting Consortium. 2024. crowdcounting.org .
Daxecker, Amicarelli and Jung (2019) Daxecker, Ursula, Elio Amicarelli and Alexander Jung. 2019. “Electoral Contention and Violence (ECAV): A new dataset.” Journal of Peace Research 56(5):714–723.
Dettmers et al. (2024) Dettmers, Tim, Artidoro Pagnoni, Ari Holtzman and Luke Zettlemoyer. 2024. “QLoRA: Efficient finetuning of quantized LLMs.” Advances in Neural Information Processing Systems .
Devlin et al. (2019) Devlin, Jacob, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
D’Ignazio and Klein (2020) D’Ignazio, Catherine and Lauren F Klein. 2020. Data feminism. MIT press.
Egami et al. (2024) Egami, Naoki, Musashi Hinck, Brandon Stewart and Hanying Wei. 2024. “Using imperfect surrogates for downstream inference: Design-based supervised learning for social science applications of large language models.” Advances in Neural Information Processing Systems 36.
Fariss, Kenwick and Reuning (2020) Fariss, Christopher J, Michael R Kenwick and Kevin Reuning. 2020. “Measurement models.” The SAGE handbook of research methods in political science and international relations pp. 353–370.
Fong and Grimmer (2019) Fong, Christian and Justin Grimmer. 2019. “Causal inference with latent treatments.” American Journal of Political Science .
Gardner et al. (2021) Gardner, Matt, William Merrill, Jesse Dodge, Matthew E Peters, Alexis Ross, Sameer Singh and Noah A Smith. 2021. Competency Problems: On Finding and Removing Artifacts in Language Data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. pp. 1801–1813.
Gerner et al. (2002) Gerner, Deborah J., Philip A Schrodt, Omür Yilmaz and Rajaa Abu-Jabr. 2002. “Conflict and Mediation Event Observations (CAMEO): A new event data framework for the analysis of foreign policy interactions.” International Studies Association, New Orleans .
Gilardi, Alizadeh and Kubli (2023) Gilardi, Fabrizio, Meysam Alizadeh and Maël Kubli. 2023. “ChatGPT outperforms crowd workers for text-annotation tasks.” Proceedings of the National Academy of Sciences .
Golchin and Surdeanu (2024) Golchin, Shahriar and Mihai Surdeanu. 2024. Time Travel in LLMs: Tracing Data Contamination in Large Language Models. In The Twelfth International Conference on Learning Representations.
Grimmer and Stewart (2013) Grimmer, Justin and Brandon M Stewart. 2013. “Text as data: The promise and pitfalls of automatic content analysis methods for political texts.” Political analysis 21(3):267–297.
Groeneveld et al. (2024) Groeneveld, Dirk, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang et al. 2024. “OLMo: Accelerating the science of language models.” arXiv preprint arXiv:2402.00838 .
Gururangan et al. (2018) Gururangan, Suchin, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman and Noah A Smith. 2018. Annotation Artifacts in Natural Language Inference Data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). pp. 107–112.
Halterman (2023) Halterman, Andrew. 2023. “Synthetically generated text for supervised text analysis.” arXiv preprint arXiv:2303.16028 .
Halterman and Radford (2021) Halterman, Andrew and Benjamin J Radford. 2021. Few-Shot Upsampling for Protest Size Detection. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. pp. 3713–3720.
Halterman et al. (2021) Halterman, Andrew, Katherine Keith, Sheikh Sarwar and Brendan O’Connor. 2021. Corpus-Level Evaluation for Event QA: The IndiaPoliceEvents Corpus Covering the 2002 Gujarat Violence. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. pp. 4240–4253.
Hanna (2017) Hanna, Alex. 2017. “MPEDS: Automating the Generation of Protest Event Data.” SocArXiv https://osf. io/preprints/socarxiv/xuqmv .
Hu et al. (2022) Hu, Edward J, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations.
Jacovi and Goldberg (2020) Jacovi, Alon and Yoav Goldberg. 2020. Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 4198–4205.
Jiang et al. (2023) Jiang, Albert Q., Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix and William El Sayed. 2023. “Mistral 7B.” arXiv preprint arXiv:2310.06825 .
Jindal, Rajpoot and Parikh (2024) Jindal, Ashvini Kumar, Pawan Kumar Rajpoot and Ankur Parikh. 2024. “Birbal: An efficient 7B instruct-model fine-tuned with curated datasets.” arXiv preprint arXiv:2403.02247 .
Knox, Lucas and Cho (2022) Knox, Dean, Christopher Lucas and Wendy K Tam Cho. 2022. “Testing causal theories with learned proxies.” Annual Review of Political Science 25:419–441.
Krippendorff (2018) Krippendorff, Klaus. 2018. Content analysis: An introduction to its methodology. Sage publications.
Kwiatkowski et al. (2019) Kwiatkowski, Tom, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le and Slav Petrov. 2019. “Natural Questions: a Benchmark for Question Answering Research.” Transactions of the Association of Computational Linguistics .
Lehmann et al. (2017) Lehmann, Pola, Theres Matthieß, Nicolas Merz, Sven Regel and Annika Werner. 2017. Manifesto Corpus. Version: 2017b. Technical report Berlin: WZB Berlin Social Science Center.
Levy, Jacoby and Goldberg (2024) Levy, Mosh, Alon Jacoby and Yoav Goldberg. 2024. “Same task, more tokens: the impact of input length on the reasoning performance of large language models.” arXiv preprint arXiv:2402.14848 .
Levy, Ravfogel and Goldberg (2023) Levy, Mosh, Shauli Ravfogel and Yoav Goldberg. 2023. Guiding LLM to Fool Itself: Automatically Manipulating Machine Reading Comprehension Shortcut Triggers. In Findings of the Association for Computational Linguistics: EMNLP 2023. pp. 8495–8505.
Li and Flanigan (2024) Li, Changmao and Jeffrey Flanigan. 2024. Task contamination: Language models may not be few-shot anymore. In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38 pp. 18471–18480.
Liu et al. (2024) Liu, Nelson F, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni and Percy Liang. 2024. “Lost in the middle: How language models use long contexts.” Transactions of the Association for Computational Linguistics 12:157–173.
Longpre et al. (2023) Longpre, Shayne, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei et al. 2023. The Flan collection: Designing data and methods for effective instruction tuning. In International Conference on Machine Learning. PMLR pp. 22631–22648.
Lyu, Apidianaki and Callison-Burch (2024) Lyu, Qing, Marianna Apidianaki and Chris Callison-Burch. 2024. “Towards faithful model explanation in NLP: A survey.” Computational Linguistics pp. 1–67.
McCoy, Pavlick and Linzen (2019) McCoy, Tom, Ellie Pavlick and Tal Linzen. 2019. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. pp. 3428–3448.
Meijers and Zaslove (2021) Meijers, Maurits J and Andrej Zaslove. 2021. “Measuring populism in political parties: Appraisal of a new approach.” Comparative political studies 54(2):372–407.
Mellon et al. (2024) Mellon, Jonathan, Jack Bailey, Ralph Scott, James Breckwoldt, Marta Miori and Phillip Schmedeman. 2024. “Do AIs know what the most important issue is? Using language models to code open-text social survey responses at scale.” Research & Politics 11(1):20531680241231468.
Merz, Regel and Lewandowski (2016) Merz, Nicolas, Sven Regel and Jirka Lewandowski. 2016. “The Manifesto Corpus: A new resource for research on political parties and quantitative text analysis.” Research & Politics 3(2):2053168016643346.
Palmer, Smith and Spirling (2024) Palmer, Alexis, Noah A Smith and Arthur Spirling. 2024. “Using proprietary language models in academic research requires explicit justification.” Nature Computational Science 4(1):2–3.
Pang, Lee and Vaithyanathan (2002) Pang, Bo, Lillian Lee and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002). Association for Computational Linguistics.
Pangakis and Wolken (2024) Pangakis, Nicholas and Sam Wolken. 2024. Knowledge Distillation in Automated Annotation: Supervised Text Classification with LLM-Generated Training Labels. In Proceedings of the Sixth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS 2024).
Peskoff and Stewart (2023) Peskoff, Denis and Brandon M Stewart. 2023. Credible without Credit: Domain Experts Assess Generative Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). pp. 427–438.
Raji et al. (2021) Raji, Deborah, Emily Denton, Emily M. Bender, Alex Hanna and Amandalynne Paullada. 2021. AI and the Everything in the Whole Wide World Benchmark. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, ed. J. Vanschoren and S. Yeung. Vol. 1.
Rajpurkar et al. (2016) Rajpurkar, Pranav, Jian Zhang, Konstantin Lopyrev and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. pp. 2383–2392.
Rajpurkar, Jia and Liang (2018) Rajpurkar, Pranav, Robin Jia and Percy Liang. 2018. Know What You Don’t Know: Unanswerable Questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). pp. 784–789.
Raschka (2024) Raschka, Sebastian. 2024. “Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation).” Blog: Ahead of AI https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms.
Rogers (2023) Rogers, Anna. 2023. “Closed AI models make bad baselines.”. Accessed: 2024-07-10.
https://hackingsemantics.xyz/2023/closed-baselines/
Sanh et al. (2022) Sanh, Victor, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey et al. 2022. Multitask Prompted Training Enables Zero-Shot Task Generalization. In International Conference on Learning Representations.
Shi et al. (2024) Shi, Zhengyan, Adam X Yang, Bin Wu, Laurence Aitchison, Emine Yilmaz and Aldo Lipani. 2024. “Instruction Tuning With Loss Over Instructions.” arXiv preprint arXiv:2405.14394 .
Spirling (2023) Spirling, Arthur. 2023. “Why open-source generative AI models are an ethical way forward for science.” Nature 616(7957):413–413.
Trischler et al. (2017) Trischler, Adam, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman and Kaheer Suleman. 2017. NewsQA: A Machine Comprehension Dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP. pp. 191–200.
Turpin et al. (2024) Turpin, Miles, Julian Michael, Ethan Perez and Samuel Bowman. 2024. “Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting.” Advances in Neural Information Processing Systems 36.
Wang et al. (2023) Wang, Yizhong, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Chandu, David Wadden, Kelsey MacMillan, Noah A Smith, Iz Beltagy et al. 2023. “How far can camels go? exploring the state of instruction tuning on open resources.” Advances in Neural Information Processing Systems 36:74764–74786.
Wang et al. (2022) Wang, Yizhong, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap et al. 2022. Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. pp. 5085–5109.
Wei et al. (2022) Wei, Jason, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai and Quoc V Le. 2022. Finetuned Language Models are Zero-Shot Learners. In International Conference on Learning Representations.
Wu et al. (2023) Wu, Patrick Y, Joshua A Tucker, Jonathan Nagler and Solomon Messing. 2023. “Large language models can be used to estimate the ideologies of politicians in a zero-shot learning setting.” arXiv preprint arXiv:2303.12057 .
Yang et al. (2018) Yang, Zhilin, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov and Christopher D Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. pp. 2369–2380.
Ye and Durrett (2022) Ye, Xi and Greg Durrett. 2022. “The unreliability of explanations in few-shot prompting for textual reasoning.” Advances in neural information processing systems 35:30378–30392.
Zhao et al. (2024) Zhao, Haiyan, Hanjie Chen, Fan Yang, Ninghao Liu, Huiqi Deng, Hengyi Cai, Shuaiqiang Wang, Dawei Yin and Mengnan Du. 2024. “Explainability for large language models: A survey.” ACM Transactions on Intelligent Systems and Technology 15(2):1–38.
Zhu et al. (2024) Zhu, Zining, Hanjie Chen, Xi Ye, Qing Lyu, Chenhao Tan, Ana Marasović and Sarah Wiegreffe. 2024. Explanation in the Era of Large Language Models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 5: Tutorial Abstracts). pp. 19–25.
Ziems et al. (2024) Ziems, Caleb, William Held, Omar Shaikh, Jiaao Chen, Zhehao Zhang and Diyi Yang. 2024. “Can large language models transform computational social science?” Computational Linguistics 50(1):237–291.

Appendix A Compute Resources

Zero-shot experiments were conducted on a server with a single Nvidia RTX 4080 16GB RAM. Instruction-tuning experiments were run on a server with 4 x RTX A6000 48GB VRAM.

Appendix B Prompt Example

Example complete zero shot prompt for BFRS showing the Llama2-style special instruction tokens for Mistral 7B instruct. The special <s> token is applied during tokenization.

⬇

[INST]

<<SYS>>

Instructions: You’re an expert political scientist categorizing news stories from Pakistan into categories. Carefully read the definitions below, read the story, and write the Label that best matches the story. Use only the provided labels.

Task Type: single label multiclass

<</SYS>>

Categories:

Category: assassination

Label: ASSASSINATION

Definition: An attempt (successful or failed) by a non-state entity aimed to kill a specific individual.

Clarification: Targets may include military, political, civil society or civilian state or federal leadership. In some case, there may be a suspected government agency behind the assassination. This Label applies when a body is found with bullet wounds or other marks of violence and/or torture.

Negative Clarification: Does not include assassination using drone attacks (coded separately) or deaths from attacks on military, paramilitary, police or intelligence targets, or state violence against a group of civilians.

Positive Example: ""An Imam of a mosque was brutally murdered by unknown assailants in Jehanian on Friday."" This is an ASSASSINATION because the imam appears to have been deliberately targeted.

Negative Example: ""A woman and her son were injured by unidentified assailants in Mominabad on Sunday."" This is an example of TERRORISM, not ASSASSINATION, because the attack does not appear to be specifically targeted at the victims.

Category: assassination (drone attack)

Label: DRONE_ASSASSINATION

Definition: An assassination (failed or successful attempt) carried out using an unmanned aerial vehicle (drone strike).

Negative Clarification: Non-drone assassination attempts are coded under the ‘Assassination’ category, not under this category.

Positive Example: ""Drone fired missiles at a heavily fortified mud-house in Asoray Village in the Mirali sub-division kiliing 5 people and injuring 1.;a senior Al Qaeda commander, Hamza Rabia, was killed in the attack""

Negative Example: ""Two soldiers were injured in a bomb blast near an army outpost in the Hamrang area."" No mention of a drone strike.

Category: attack on state

Label: ATTACK_ON_STATE

Definition: An attack on Pakistani territory targeted at the state of Pakistan or its representatives that was conducted by the armed forces of another state. This includes shelling across the Line of Control, attacks on Pakistani forces mistaken for militants near the Afghan border.

Clarification: All these incidents were attributed to the government of India or United States and NATO forces.

Positive Example: ""Indian forces resorted to machine gun firing across the LoC in what the Pakistani officials termed a clear violation of the ceasefire."" This is an example of another state’s military conducting an attack on Pakistan.

Negative Example: ""Five rockets were fired at a Frontier Corps checkpost in Sangsilla Chashma area of Pir Koh."" Not attributable to the armed forces of another state.

Category: conventional attack on military/paramilitary/police/intelligence

Label: CONVENTIONAL_ATTACK_ON_GOV_FORCES

Definition: Incidents that involve a direct conventional attack on a military, police, paramilitary, or intelligence target, such that violence has the potential to be exchanged between the attackers.

Clarification: Conventional tactics are those that would be employed by state forces. Conventional attacks include ambushes, direct fire, artillery, pitched battle, and troop captures. This includes but is not limited to: attacks against military checkposts, police headquarters, the Federal Investigative Agency (FIA), the ISI.

Negative Clarification: ‘‘Guerilla" tactics are described in a separate category. Attacks by state actors (e.g. India) are coded as ‘‘attack on state"

Positive Example: ""At least two security personnel were injured in skirmishes between security forces and armed tribesmen. The attackers escaped after security forces retaliated using gunfire. No arrest was reported."" Skirmishes involving gunfire is a conventional attack.

Negative Example: ""A vehicle carrying security personnel hit a landmine in Loti gas field area resulting in injuries to three soldiers."" This is a guerilla attack because it involves a road-side bomb.

Category: guerilla attack on military/paramilitary/police/intelligence

Label: GUERILLA_ATTACK_ON_GOV_FORCES

Definition: Incidents that involve a guerilla-like attack (e.g., a car or suicide-bombing, a kidnapping) on a military, police, paramilitary, or intelligence target.

Clarification: Guerilla attacks employ tactics that conventional forces do not. For example, guerilla attacks will employ road-side bombs, improvised explosive devices (IEDs), suicide attacks (e.g. on a checkpoint), car bombs and the like.

Negative Clarification: Gun battles between soldiers and militants are coded as conventional attacks.

Positive Example: ""A suicide bomber blew himself up in an army training area in Guliana near Kharian cantonment, where the soldiers were busy training."" Suicide attacks are examples of guerilla attacks.

Negative Example: ""Mewind Rifles received information about the presence of militants in the area. When security forces reached there, militants started firing."" This describes a conventional battle between militants and state forces and is thus not a guerilla attack.

Category: military/paramilitary/police attack on non-state combatants

Label: GOV_ATTACK_ON_NONSTATE_COMBATANTS

Definition: Violence initiated by state (federal or provincial) combatants against combatant non-state, sub-national groups, or clandestine agents.

Clarification: Includes attacks by the US and NATO on militants in Pakistan that involve manned aircraft or artillery systems. For example, these may be brief episodes such as the Pakistan Army and Police assault on the Lal Masjid OR they may be sustained as the army offensive in South Waziristan in 2011 or the Frontier Corps in Aurakzai.

Negative Clarification: drone attacks have their own category

Positive Example: ""In a continuation of the crackdown on militants by the army, four more militants were killed."" State forces attacked non-state armed actors.

Negative Example: ""A personnel of the Balochistan Reserve Police was injured when some armed men opened indiscriminate fire on them which they later returned."" This is a non-state attack on state forces, so this label does not apply.

Category: military/paramilitary/police selective violence

Label: GOV_ATTACK_ON_CIVILIANS

Definition: Violence initiated by state (federal or provincial) combatants against civilians.

Clarification: These may include incidents where a newspaper reports that a political activist is picked up by the intelligence services and detained and beaten overnight.

Negative Clarification: Does not include attacks on combatants or violence in the context of a protest.

Positive Example: ""Advocate Ghulam Nabi of the Peshawer High Court was allegedly picked up by personnel of an intelligence agency in three vehicles near the NWFP assembly, and released after a severe over night thrashing. He is associated with Jamaat-i-Islami and a critic of the army and intelligence agencies.""

Negative Example: ""The police baton charged lawyers and other political activists as they tried to enter the city in order to join the throng of people outside the Supreme Court who were there supporting the sacked Chief Justice. Hundereds of people tried entering Islamabad for the purpose, which the police tried to stop by baton-charging the crowd."" This violence takes place in the context of a protest, and thus should get the label ""violent political demonstration"".

Category: riot

Label: RIOT

Definition: A riot is a violent clash between two or more sizeable groups or when a single informally assembled crowd becomes violent.

Clarification: Neither group can be a state force. However, if there are two or more clashes around a single event in which the police are involved, then it would be a classified as a riot.

Negative Clarification: For example, the students of the Lal Masjid versus the police would not be considered a riot, it would be a Violent Political Demonstration.

Positive Example: ""Five workers of the Muslim Students Federation, the student wing of the ruling Muslim conference, sustained bullet wounds in a clash with the activists of pro-independence National Students Federation.""

Negative Example: ""An angry mob thrashed police personnel, accusing them of manhandling people and using obscene language when they (the police) raided a gambling den."" This incident involves police, so it is coded as a ""violent political demonstration.""

Category: terrorism

Label: TERRORISM

Definition: Premeditated, politically motivated violence perpetrated against noncombatant targets by subnational groups or clandestine agents (22 U.S.C. \S 2656f(d)(2)).

Positive Example: ""Security forces foiled an attempt to blow up a 12-inch diameter gas pipeline in Sui tehsil."" Even though this attack was unsuccessful, it still involved an attack by non-state groups on non-combatant targets.

Negative Clarification: In contrast, an attack on a police check point is not considered to be terrorism even if civilians are harmed in the conduct of the attack provided that the primary target was the check point. This is coded as an attack on police, paramilitary or military target. The intended targets of the attacks is key. A targeted attack on a specific figure is an instance of assassination, not terrorism.

Negative Example: ""The grandson of Mohmand Agency MNA Maulana Ghulam Mohammad Sadiq was killed when gunmen entered his house and opened fire."" The attack seems directed at a specific prominent person, so even though it’s carried out by non-state actors, it should be coded as an assassination, not terrorism.

Category: threat of violence

Label: THREAT_OF_VIOLENCE

Definition: Incidents in which the threat of violence is reported as being used for political purposes.

Clarification: A typical example is when a spokesman for a banned militant group threatens attacks if certain policy changes were not made, or when a party spokesman threatens violent protest if certain changes are not made. This also includes incidents in which a particular individual is threatened.

Positive Example: ""The President of the All Pakistan Clerks Association...alleged that he had received a phone call warning him of an explosion at his house.""

Negative Example: ""A man killed his cousin, accusing her of having illicit relations with a youth of the same village."" This is an instance of violence actually being carried out, not a threat of violence.

Category: violent political demonstration

Label: VIOLENT_POLITICAL_DEMONSTRATION

Definition: A violent political demonstration is characterized as mobilization of crowds in response to a political event that at some point becomes violent.

Clarification: Violence may be initiated by the crowd or by the police upon the crowd.

Negative Clarification: A violent clash between two non-state groups is a riot, not a violent political demonstration.

Positive Example: ""Hundreds of school children and armed supporters of Lashkar-i-Islam stormed the Bara bazaar and set the main tehsil, Qambaraband, Al Haaj Market, Sheikhan and Qasaban checkposts of the Khasadar force on fire. The Lashkar-i-Islam had warned of protests if their demands were not met.""

Negative Example: ""A man killed and his brother injured when they were attacked by their rivals over a land dispute."" The motivation here is not a political event, thus this event is a riot, not a violent political demonstration.

Category: other

Label: OTHER

Definition: Incidents that are not easily categorized such as people being hurt by unexploded ordnance or when the source of the violence is unclear.

Clarification: An event that does not fit into the other categories (assassination, attack on state, conventional attack on military/paramilitary/police/intelligence, guerilla attack on military/paramilitary/police/intelligence, military/paramilitary/police selective violence) should be coded as OTHER.

Positive Example: ""A man allegedly opened fire on his wife while she was offering prayers, and later proceeded to kill his brother on the pretext that his wife and his brother had illicit relations. He then escaped.""

Negative Example: ""A suicide bomber blew himself up in an army training area in Guliana near Kharian cantonment, where the soldiers were busy training.""

The following is a news story from Pakistan.

Document: A presiding officer was kidnapped allgedly by the supporters of a PML Q candidate.

Write the name of the Label that fits best, with no other text. For example, ""Label: ASSASSINATION"", ""Label: VIOLENT_POLITICAL_DEMONSTRATION"", etc.

---------

[/INST]

Label: "

Appendix C Lemon-picked examples from manual error analysis

D. LLM non-compliance
Dataset-Ex.	Text	Gold Label	LLM output
Manif.-1	Context: will promote a learning and knowledge society by advancing technology, enhancing research and creating a society with love of learning and engaged in societal development; work with the Universities and other Institutes of Higher learning to improve their coordination with the FET system and the business sector to ensure that their graduates have the kind of education that is relevant to the needs of the economy and society; Text: further improve the levels of financial support especially to university students who come from poor families , and by encouraging work/study opportunities for students;	EDUCATION _POSITIVE	WELFARE_POSITIVE (or EDUCATION_POSITIVE, depending on the specific focus of the party’s manifesto) Explanation: The statement from the South African political party’s manifesto indicates a commitment to improving financial support for university students from poor families and encouraging work/study opportunities for students. This aligns with the “Welfare” or “Education” categories, as the party is focusing on enhancing the quality of life for students by addressing their financial needs and providing opportunities for practical experience. The “Positive” label is used to indicate that the party’s stance is supportive and constructive in this regard. Note: Depending on
Manif.-2	Context: We will provide more opportunities for applicants who have Canadian siblings by giving additional points under the Express Entry system, and we will restore the maximum age for dependents to 22 from 19, to allow more Canadians to bring their children to Canada. Text: We will also grant immediate permanent residency to new spouses entering Canada, eliminating the two-year waiting period.	MINORITY _GROUPS	IMMEDIATE_PERMANENT_RESIDENCY_FOR_NEW- _SPOUSES_POSITIVE Explanation: This statement falls under the category “immigration” and specifically “permanent residency” within the “External Relations” aggregation. The label “IMMEDIATE_PERMANENT_RESIDENCY_FOR- _NEW_SPOUSES_POSITIVE” is the best match for this statement as it accurately reflects the positive nature of the party’s stance on granting immediate permanent residency to new spouses entering Canada. Additionally, it’s important to note that the provided passage does not contain any negative statements, so no negative labels
E. LLM semantics/reasoning mistakes
Dataset-Ex.	Text	Gold Label	LLM output
BFRS-1	The army picket came under heavy gunfire and the military countered the attack and fired artillery shells. This attack took place two days after the expiry of the deadline given by Taliban that the military should halt the Swat operation or face retaliation.	GUERILLA_ATTACK _ON_GOV_FORCES	ATTACK_ON_STATE Explanation: The news story describes an attack on a military checkpoint, which fits the definition of an attack on Pakistani territory targeted at the state of Pakistan or its representatives. The attack was carried out by non-state entities, specifically the Taliban, and was in response to a deadline given by the Taliban for the military to halt an operation. Therefore, the best label for this news story is ”ATTACK_ON_STATE”. Note: It’s important to remember that the labels provided are based on the given definitions and the information in the news story. The actual classification of an event can depend on additional context and information that
CCC-1	Holding a sign and wearing a pink hat, Mary Counter looked around and lamented the fact that only about three dozen people showed up Friday afternoon outside the Statehouse for a rally supporting the impeachment and removal of President Donald Trump. “We’re like frogs in boiling water,” she said. The ever-increasing pile of wrongdoing by Trump has simply numbed many Americans into inaction despite the threat to their nation, said Counter, a retired communications professional from the Northwest Side. [… truncated for space …]	DEMONSTRATION	RALLY Explanation: The news story describes a crowd gathering to demonstrate their support for the impeachment and removal of President Donald Trump. The crowd is holding signs and marching to a building, which fits the definition of a rally. Additional context: The crowd is sponsoring the event and is drawing support from passersby, indicating that it is a deliberate and organized demonstration of support for a political cause. The fact that the event includes a short march to a building also supports the label of rally, as marches are a separate category. Therefore, the label for this news story is ”RALLY”. Note: The label ”PROTEST” could
Manif.-3	Context: Tax Breaks for Green Buildings From 1 July 2011, businesses that invest in eligible assets or capital works to improve the energy efficiency of their existing buildings – from 2 stars or lower to 4 stars or higher – will be able to apply for a one-off bonus tax deduction. Text: The incentive will enable businesses to claim a bonus tax deduction of 50 per cent of the cost of the eligible assets or capital works.	ENVIRONMENT- ALISM_POSITIVE	INCENTIVES Explanation: The text describes a tax incentive for businesses to invest in energy-efficient buildings. The incentive is a bonus tax deduction, which falls under the category of incentives in the provided list. Therefore, the label for this excerpt is ”INCENTIVES”. Additionally, the label for the aggregation level for this text would be ”Economy”. So, the complete label for this excerpt would be ”Economy: Incentives”. This label indicates that the text discusses a policy related to incentives within the economic context of the political party’s manifesto.

Table 9: Lemon-picked examples of LLM errors discovered during manual error analysis.

Codebook LLMs: Adapting Political Science Codebooks for LLM Use and Adapting LLMs to Follow Codebooks