Can Unconfident LLM Annotations Be Used for Confident Conclusions?

    Kristina Gligorić    Tijana Zrnic    Cinoo Lee    Emmanuel J. Candès     Dan Jurafsky
Stanford University
{gligoric, tijana.zrnic, cinoolee, candes, jurafsky}@stanford.edu
Abstract

Large language models (LLMs) have shown high agreement with human raters across a variety of tasks, demonstrating potential to ease the challenges of human data collection. In computational social science (CSS), researchers are increasingly leveraging LLM annotations to complement slow and expensive human annotations. Still, guidelines for collecting and using LLM annotations, without compromising the validity of downstream conclusions, remain limited. We introduce Confidence-Driven Inference: a method that combines LLM annotations and LLM confidence indicators to strategically select which human annotations should be collected, with the goal of producing accurate statistical estimates and provably valid confidence intervals while reducing the number of human annotations needed. Our approach comes with safeguards against LLM annotations of poor quality, guaranteeing that the conclusions will be both valid and no less accurate than if we only relied on human annotations. We demonstrate the effectiveness of Confidence-Driven Inference over baselines in statistical estimation tasks across three CSS settings—text politeness, stance, and bias—reducing the needed number of human annotations by over 25% in each. Although we use CSS settings for demonstration, Confidence-Driven Inference can be used to estimate most standard quantities across a broad range of NLP problems.

Can Unconfident LLM Annotations Be Used for Confident Conclusions?


    Kristina Gligorić    Tijana Zrnic    Cinoo Lee    Emmanuel J. Candès     Dan Jurafsky Stanford University {gligoric, tijana.zrnic, cinoolee, candes, jurafsky}@stanford.edu


Equal contribution.

1 Introduction

Large language models (LLMs) have shown strong zero-shot performance across tasks Kojima et al. (2022), making them a promising tool for generating annotations, particularly when they align closely with human judgments Ziems et al. (2024). Given this potential, LLM annotations of textual data may be effectively leveraged for statistical estimation, hypothesis testing, and theory development Park et al. (2023), as well as informing policy decisions Wei et al. (2023).

Refer to caption
Figure 1: Illustration of Confidence-Driven Inference. Given a text corpus and a quantity of interest θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, (1) we collect LLM annotations and indicators of LLM confidence, based on which we strategically choose a small number of human annotations; (2) we then produce an unbiased estimate θ^confsuperscript^𝜃conf\hat{\theta}^{\mathrm{conf}}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_conf end_POSTSUPERSCRIPT and a valid confidence interval, allowing valid downstream conclusions.

Computational Social Science (CSS) research typically focuses not on the annotations themselves but on the social-science insights and conclusions they enable. Thus, understanding how LLM annotations could be used for downstream inferences is crucial in CSS. For example, stance annotations facilitate the study of linguistic differences between media affirming or denying global warming Luo et al. (2020), while politeness annotations can help examine racial disparities in verbal interactions with law enforcement Voigt et al. (2017), the relationship between politeness and social power Danescu-Niculescu-Mizil et al. (2013), and politeness and gender Newman et al. (2008). Similarly, annotating political leanings in text allows studying the bias of search engines Robertson et al. (2018), social media Ribeiro et al. (2018), and political discourse Sim et al. (2013). Precise statistical estimation, such as prevalence or regression coefficient estimation, is essential for drawing valid conclusions in such studies.

However, whether LLM annotations can be effectively leveraged without compromising the validity of statistical estimation remains uncertain. LLMs exhibit demographic biases Weidinger et al. (2022); Cheng et al. (2023) and may lack factual accuracy Gunjal et al. (2024); Li et al. (2023b) and consistency Sclar et al. (2023); Atreja et al. (2024). Given these limitations, using LLMs without caution may lead to inaccurate conclusions and potential societal harms, especially when such conclusions influence policy or have tangible impacts on peoples’ outcomes Landers and Behrend (2023). A potential solution is to rely solely on human annotations; however, human annotations are costly.

Here, we present Confidence-Driven Inference, a method for valid statistical inference using LLM annotations. Given a text corpus and a quantity of interest, our approach builds on active inference Zrnic and Candès (2024) to: (1) strategically choose a small number of human annotations, guided by LLM annotations and the LLM’s verbalized confidence scores, and (2) combine the human and LLM annotations into an accurate estimate of the quantity of interest (Fig. 1). The resulting estimate is statistically valid, while reducing reliance on expensive human annotations.

Our task is statistical estimation of a quantity of interest. We evaluate our approach on five estimation tasks in three CSS settings (politeness, stance, and media bias) in terms of confidence interval coverage and effective sample size, which measures the increase in accuracy due to augmenting human with LLM annotations (Sec. 3.4). We find that naively treating LLM annotations as human data can lead to highly inaccurate estimates and poor coverage. At the same time, our method maintains the target coverage, while outperforming the baselines (defined in Sec. 3.3) in terms of the effective sample size. The latter is enabled partially by the fact that in all tested settings the confidence scores are reflective of LLM accuracy.

Confidence-Driven Inference can be used to estimate a wide range of standard targets (such as regression coefficients, means, and prevalences) across various NLP problems. Our code and data are available at https://github.com/kristinagligoric/confidence-driven-inference.

2 Background

2.1 LLMs for Data Annotation Tasks

LLMs have shown great potential in handling text-annotation tasks without prior task-specific training, sometimes even outperforming crowd workers Gilardi et al. (2023). NLP, LLMs offer transformative opportunities for any discipline that relies on text as data. Fields such as psychology, political science, sociology, communications, and economics recognize this emerging technology’s potential to enhance simulation-based research Bail (2024), and facilitate tasks such as text analysis, concept induction Lam et al. (2024), and topic modeling Pham et al. (2024).

However, despite their promise, limited research has explored how to harness the potential of LLMs in ways that are both cost-effective and statistically reliable. Our work addresses this gap.

2.2 Collaborative Annotation Paradigms

Much of past work frames human and LLM annotations as competing alternatives, with a focus on determining which is superior Thapa et al. (2023). More recent work increasingly calls for a collaborative approach that leverages the complementary strengths of both Allen et al. (1999). These collaborative paradigms aim to balance annotation quality and cost by combining human expertise and LLM efficiency Li et al. (2023c); Kim et al. (2024).

In the spirit of these collaborative paradigms, our work uses LLM confidence to efficiently and cost-effectively allocate annotation tasks, while also ensuring that the statistical inferences derived from the annotated data are valid.

2.3 Valid Statistical Inferences in NLP

Statistical inference is vital in NLP research. For example, model evaluation requires determining whether a model performs better than a baseline Card et al. (2020), which in turn relies on making valid conclusions about whether one is observing meaningful model improvements or noise Dodge et al. (2019). Chatzi et al. (2024) and Boyeau et al. (2024) leverage prediction-powered inference Angelopoulos et al. (2023a, b) for valid ranking of LLMs. A similar approach is adopted by Saad-Falcon et al. (2024) to evaluate Retrieval-Augmented Generation (RAG) systems.

Beyond model evaluation, NLP applications involve producing measurements, descriptive statistics, and causal effect estimates Feder et al. (2022); Card and Smith (2018). Notably, Keith and O’Connor (2018) introduced the problem of scientifically valid prevalence estimation. They construct Bayesian confidence intervals by proposing a generative model for text documents. We contribute to the existing literature by proposing an entirely model-free approach that is applicable to a broad range of target quantities.

Lastly, Egami et al. (2024) consider the problem of valid statistical inference when combining human and LLM annotations. However, they collect the human annotations for uniformly sampled instances, without adapting to the difficulty of annotation. Given the promise of active learning Zhang et al. (2023); Margatina et al. (2021), we develop an adaptive approach that samples a limited number of human annotations strategically. At a technical level, our approach builds on active inference Zrnic and Candès (2024), which can be seen as a refinement of prediction-powered inference Angelopoulos et al. (2023a, b) that uses active data collection for improved efficiency. Furthermore, we make use of power tuning Angelopoulos et al. (2023b), a technique that ensures that incorporating LLM annotations into the estimation can never be worse than ignoring them completely.

3 Methods

3.1 Problem Setup

We have a text corpus consisting of n𝑛nitalic_n independent and identically distributed (i.i.d.) instances T1,,Tnsubscript𝑇1subscript𝑇𝑛T_{1},\dots,T_{n}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. We wish to estimate a quantity of interest θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, such as the prevalence of political bias in the corpus or the causal effect of using certain linguistic markers on the perceived sentiment. To perform the estimation, we require human annotations H1,,Hnsubscript𝐻1subscript𝐻𝑛H_{1},\dots,H_{n}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT corresponding to T1,,Tnsubscript𝑇1subscript𝑇𝑛T_{1},\dots,T_{n}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. For example, Hisubscript𝐻𝑖H_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT might indicate whether Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT contains political bias, or assess the perceived politeness of Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In addition to human annotations, we may also have other readily-available information about Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT—covariates Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTsuch as the source of Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT or indicators of whether Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT contains certain linguistic markers, computed via a lexicon. Note that Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is available automatically, without needing human annotation. We use the short-hand notation T=(T1,,Tn)𝑇subscript𝑇1subscript𝑇𝑛T=(T_{1},\dots,T_{n})italic_T = ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) and define X𝑋Xitalic_X and H𝐻Hitalic_H similarly.

The quantity θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT can be estimated via an estimator θ^(X,H)^𝜃𝑋𝐻\hat{\theta}(X,H)over^ start_ARG italic_θ end_ARG ( italic_X , italic_H ), which we will denote by θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG for short. The accuracy of θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG improves as the number of samples n𝑛nitalic_n increases (θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG recovers θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as n𝑛nitalic_n approaches infinity). We assume that θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG is an M-estimator Van der Vaart (2000), meaning it can be written as

θ^=argminθ1ni=1nθ(Xi,Hi),^𝜃subscriptargmin𝜃1𝑛superscriptsubscript𝑖1𝑛subscript𝜃subscript𝑋𝑖subscript𝐻𝑖\hat{\theta}=\operatorname*{arg\,min}_{\theta}\frac{1}{n}\sum_{i=1}^{n}\ell_{% \theta}(X_{i},H_{i}),over^ start_ARG italic_θ end_ARG = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (1)

for a loss function θsubscript𝜃\ell_{\theta}roman_ℓ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT that is convex in θ𝜃\thetaitalic_θ. Important special cases include the mean label, θ^=1ni=1nHi^𝜃1𝑛superscriptsubscript𝑖1𝑛subscript𝐻𝑖\hat{\theta}=\frac{1}{n}\sum_{i=1}^{n}H_{i}over^ start_ARG italic_θ end_ARG = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and linear regression coefficients, which are pervasive in CSS. Other examples include quantiles, logistic, and other regression coefficients. Notice that in some cases, like calculating the mean, the loss function only depends on Hisubscript𝐻𝑖H_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Our goal is to produce an estimate of θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with uncertainty—by providing a confidence interval at a pre-specified level (1α)1𝛼(1-\alpha)( 1 - italic_α )—with limited access to human annotations. Specifically, we can only collect nhumannmuch-less-thansubscript𝑛human𝑛n_{\mathrm{human}}\ll nitalic_n start_POSTSUBSCRIPT roman_human end_POSTSUBSCRIPT ≪ italic_n annotations (on average). This means that the “ideal estimate” (1) is out of reach.

To supplement the costly human annotations, we assume access to LLM annotations H^isubscript^𝐻𝑖\hat{H}_{i}over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for all n𝑛nitalic_n instances. However, we make no assumption that the LLM annotations are good: we want to produce a valid confidence interval no matter the quality of the LLM, though we anticipate better gains when their quality is high (i.e., lower mean squared error and a smaller confidence interval).

3.2 Confidence-Driven Inference

We combine LLM annotations with strategically chosen human annotations to produce an unbiased estimate θ^confsuperscript^𝜃conf\hat{\theta}^{\mathrm{conf}}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_conf end_POSTSUPERSCRIPT that lends itself to a confidence interval that is both valid and tight around θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. In particular, in the large-sample limit, the mean of the estimate is exactly θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, no matter how biased the LLM annotations are.

We first explain how to choose the set of instances to be human-annotated, which is crucial for producing an accurate estimate. We collect a human annotation Hisubscript𝐻𝑖H_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for instance Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with probability πisubscript𝜋𝑖\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We let ξi=𝟏{Hi collected}subscript𝜉𝑖1subscript𝐻𝑖 collected\xi_{i}=\mathbf{1}\{H_{i}\text{ collected}\}italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_1 { italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT collected } denote the indicator of whether Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has been human-annotated. Zrnic and Candès (2024) show that the optimal choice of πisubscript𝜋𝑖\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is to sample according to the uncertainty of the predicted annotation; roughly speaking, for most estimation problems the optimal rule is

πi𝔼[(H^iHi)2|Ti],proportional-tosuperscriptsubscript𝜋𝑖𝔼delimited-[]conditionalsuperscriptsubscript^𝐻𝑖subscript𝐻𝑖2subscript𝑇𝑖\pi_{i}^{*}\propto\sqrt{\mathbb{E}[(\hat{H}_{i}-H_{i})^{2}|T_{i}]},italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∝ square-root start_ARG blackboard_E [ ( over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] end_ARG ,

where proportional-to\propto hides the normalization required to meet the budget, 𝔼[i=1nξi]=i=1nπi=nhuman𝔼delimited-[]superscriptsubscript𝑖1𝑛subscript𝜉𝑖superscriptsubscript𝑖1𝑛superscriptsubscript𝜋𝑖subscript𝑛human\mathbb{E}[\sum_{i=1}^{n}\xi_{i}]=\sum_{i=1}^{n}\pi_{i}^{*}=n_{\mathrm{human}}blackboard_E [ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_n start_POSTSUBSCRIPT roman_human end_POSTSUBSCRIPT. Of course, since Hisubscript𝐻𝑖H_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is unknown, πisuperscriptsubscript𝜋𝑖\pi_{i}^{*}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is unattainable.

A key idea behind our method is to approximate πisubscriptsuperscript𝜋𝑖\pi^{*}_{i}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by querying the LLM for verbalized confidence. Since RLHF may cause overconfidence Geng et al. (2024); Zhou et al. (2024) and miscalibration Band et al. (2024); Achiam et al. (2023) of the LLM’s conditional token probabilities, verbalized probabilities, i.e., expressions of confidence in token-space, are better-calibrated Tian et al. (2023). Therefore, to collect confidence scores, we adopt the verbalized two-stage prompting approach introduced by Tian et al. (2023), where the model is first asked to provide an answer via zero-shooting and afterward asked to assign a probability to the correctness of the answer. This gives us a confidence score Ci[0,1]subscript𝐶𝑖01C_{i}\in[0,1]italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ 0 , 1 ] for each instance Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In our applications, we find that the verbalized confidence scores are calibrated (Fig. 3 (right)), meaning that higher confidence scores correspond to higher accuracy with respect to human annotations.

As we collect human annotations, we use {(Cj,(H^jHj)2)}j<i,ξj=1subscriptsubscript𝐶𝑗superscriptsubscript^𝐻𝑗subscript𝐻𝑗2formulae-sequence𝑗𝑖subscript𝜉𝑗1\{(C_{j},(\hat{H}_{j}-H_{j})^{2})\}_{j<i,\xi_{j}=1}{ ( italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ( over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_j < italic_i , italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT as feature–label pairs to train a black-box predictor err^isubscript^err𝑖\widehat{\texttt{err}}_{i}over^ start_ARG err end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In other words, we train a model to predict the LLM error from its confidence. Finally, we set

πierr^i(Ci),proportional-tosubscript𝜋𝑖subscript^err𝑖subscript𝐶𝑖\pi_{i}\propto\sqrt{\widehat{\texttt{err}}_{i}(C_{i})},italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∝ square-root start_ARG over^ start_ARG err end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ,

normalized so that 𝔼[i=1nξi]=i=1nπi=nhuman𝔼delimited-[]superscriptsubscript𝑖1𝑛subscript𝜉𝑖superscriptsubscript𝑖1𝑛subscript𝜋𝑖subscript𝑛human\mathbb{E}[\sum_{i=1}^{n}\xi_{i}]=\sum_{i=1}^{n}\pi_{i}=n_{\mathrm{human}}blackboard_E [ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT roman_human end_POSTSUBSCRIPT. In practice we do not fine-tune err^isubscript^err𝑖\widehat{\texttt{err}}_{i}over^ start_ARG err end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at every step i𝑖iitalic_i, but we do so periodically, after reasonably large batches of data (say, every 50 or 100 data points). See App. A.3 for further details behind the sampling and Table 2 for prompt texts.

After we have collected the human annotations according to πisubscript𝜋𝑖\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, building on active inference Zrnic and Candès (2024) we compute a confidence-driven estimate of θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT:

θ^conf=argminθ1ni=1n(λ^θ,i+(θ,iλ^θ,i)ξiπi),superscript^𝜃confsubscriptargmin𝜃1𝑛superscriptsubscript𝑖1𝑛𝜆subscript^𝜃𝑖subscript𝜃𝑖𝜆subscript^𝜃𝑖subscript𝜉𝑖subscript𝜋𝑖\hat{\theta}^{\mathrm{conf}}=\operatorname*{arg\,min}_{\theta}\frac{1}{n}\sum_% {i=1}^{n}\left(\lambda\hat{\ell}_{\theta,i}+(\ell_{\theta,i}-\lambda\hat{\ell}% _{\theta,i})\frac{\xi_{i}}{\pi_{i}}\right),over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_conf end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_λ over^ start_ARG roman_ℓ end_ARG start_POSTSUBSCRIPT italic_θ , italic_i end_POSTSUBSCRIPT + ( roman_ℓ start_POSTSUBSCRIPT italic_θ , italic_i end_POSTSUBSCRIPT - italic_λ over^ start_ARG roman_ℓ end_ARG start_POSTSUBSCRIPT italic_θ , italic_i end_POSTSUBSCRIPT ) divide start_ARG italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) , (2)

where we denote θ,i=θ(Xi,Hi)subscript𝜃𝑖subscript𝜃subscript𝑋𝑖subscript𝐻𝑖\ell_{\theta,i}=\ell_{\theta}(X_{i},H_{i})roman_ℓ start_POSTSUBSCRIPT italic_θ , italic_i end_POSTSUBSCRIPT = roman_ℓ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and ^θ,i=θ(Xi,H^i)subscript^𝜃𝑖subscript𝜃subscript𝑋𝑖subscript^𝐻𝑖\hat{\ell}_{\theta,i}=\ell_{\theta}(X_{i},\hat{H}_{i})over^ start_ARG roman_ℓ end_ARG start_POSTSUBSCRIPT italic_θ , italic_i end_POSTSUBSCRIPT = roman_ℓ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), and λ[0,1]𝜆01\lambda\in[0,1]italic_λ ∈ [ 0 , 1 ] is a carefully chosen tuning parameter. Notice that every summand in (2) is in expectation over ξisubscript𝜉𝑖\xi_{i}italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT equal to θ(Xi,Hi)subscript𝜃subscript𝑋𝑖subscript𝐻𝑖\ell_{\theta}(X_{i},H_{i})roman_ℓ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), and thus the loss (2) is on average equal to “ideal” loss (1). This allows showing that, in the limit, θ^confsuperscript^𝜃conf\hat{\theta}^{\mathrm{conf}}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_conf end_POSTSUPERSCRIPT is on average exactly equal to θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, no matter the bias in the LLM annotations. To give one example, if we want to estimate the mean of Hisubscript𝐻𝑖H_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, θ^confsuperscript^𝜃conf\hat{\theta}^{\mathrm{conf}}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_conf end_POSTSUPERSCRIPT reduces to

θ^conf=1ni=1n(λH^i+(HiλH^i)ξiπi).superscript^𝜃conf1𝑛superscriptsubscript𝑖1𝑛𝜆subscript^𝐻𝑖subscript𝐻𝑖𝜆subscript^𝐻𝑖subscript𝜉𝑖subscript𝜋𝑖\hat{\theta}^{\mathrm{conf}}=\frac{1}{n}\sum_{i=1}^{n}\left(\lambda\hat{H}_{i}% +(H_{i}-\lambda\hat{H}_{i})\frac{\xi_{i}}{\pi_{i}}\right).over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_conf end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_λ over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ( italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_λ over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) divide start_ARG italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) .

Notice that 𝔼[θ^conf]=𝔼[Hi]=θ𝔼delimited-[]superscript^𝜃conf𝔼delimited-[]subscript𝐻𝑖superscript𝜃\mathbb{E}[\hat{\theta}^{\mathrm{conf}}]=\mathbb{E}[H_{i}]=\theta^{*}blackboard_E [ over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_conf end_POSTSUPERSCRIPT ] = blackboard_E [ italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] = italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. The parameter λ𝜆\lambdaitalic_λ is called a power-tuning parameter Angelopoulos et al. (2023b), and it interpolates between ignoring the LLM annotations (λ=0𝜆0\lambda=0italic_λ = 0) and utilizing them fully (λ=1𝜆1\lambda=1italic_λ = 1). We set λ𝜆\lambdaitalic_λ optimally, so that the mean squared error (MSE) of θ^confsuperscript^𝜃conf\hat{\theta}^{\mathrm{conf}}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_conf end_POSTSUPERSCRIPT is minimized over λ𝜆\lambdaitalic_λ. This means that, given any sampling rule πisubscript𝜋𝑖\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the confidence-driven estimator can never be hurt by leveraging erroneous LLM annotations oder miscalibrated confidence scores. The estimator is at least as good as when λ=0𝜆0\lambda=0italic_λ = 0. Details behind the optimization of λ𝜆\lambdaitalic_λ are in App. A.2.

Finally, applying the theoretical guarantees of Zrnic and Candès (2024), we form a valid confidence interval at level 1α1𝛼1-\alpha1 - italic_α as

C1α=(θ^conf±z1α/2σ^se),subscript𝐶1𝛼plus-or-minussuperscript^𝜃confsubscript𝑧1𝛼2subscript^𝜎seC_{1-\alpha}=(\hat{\theta}^{\mathrm{conf}}\pm z_{1-\alpha/2}\hat{\sigma}_{% \mathrm{se}}),italic_C start_POSTSUBSCRIPT 1 - italic_α end_POSTSUBSCRIPT = ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_conf end_POSTSUPERSCRIPT ± italic_z start_POSTSUBSCRIPT 1 - italic_α / 2 end_POSTSUBSCRIPT over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT roman_se end_POSTSUBSCRIPT ) ,

where z1α/2subscript𝑧1𝛼2z_{1-\alpha/2}italic_z start_POSTSUBSCRIPT 1 - italic_α / 2 end_POSTSUBSCRIPT is the 1α/21𝛼21-\alpha/21 - italic_α / 2 quantile of the standard normal distribution and σ^sesubscript^𝜎se\hat{\sigma}_{\mathrm{se}}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT roman_se end_POSTSUBSCRIPT is a standard error estimate that has a closed form, stated in App. A.1.

3.3 Baselines

Human + LLM (non-adaptive).

The first baseline incorporates LLM annotations but does not adapt to the per-instance confidence or accuracy of the LLM—it equally trusts all LLM annotations. In particular, this baseline is a special case of θ^confsuperscript^𝜃conf\hat{\theta}^{\mathrm{conf}}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_conf end_POSTSUPERSCRIPT with λ=1𝜆1\lambda=1italic_λ = 1 and uniform sampling probabilities πi=nhumannsubscript𝜋𝑖subscript𝑛human𝑛\pi_{i}=\frac{n_{\mathrm{human}}}{n}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_n start_POSTSUBSCRIPT roman_human end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG. This is the method evaluated and studied by Egami et al. (2024).

Human only.

The second baseline ignores LLM annotations and simply applies the standard estimator to human annotations. It collects each human annotation with equal probability, nhumannsubscript𝑛human𝑛\frac{n_{\mathrm{human}}}{n}divide start_ARG italic_n start_POSTSUBSCRIPT roman_human end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG, so that nhumansubscript𝑛humann_{\mathrm{human}}italic_n start_POSTSUBSCRIPT roman_human end_POSTSUBSCRIPT annotations are collected on average. This is the “classical” approach, and it can be thought of as erring on the side of caution and ignoring potentially biased LLM outputs. Since the baseline only collects human annotations, it allows forming a valid confidence interval via classical statistics. This approach is equivalent to θ^confsuperscript^𝜃conf\hat{\theta}^{\mathrm{conf}}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_conf end_POSTSUPERSCRIPT with λ=0𝜆0\lambda=0italic_λ = 0.

LLM only.

Finally, we consider the naive baseline which treats LLM annotations as human annotations, applying the standard estimator to those annotations and naively forming a confidence interval. This baseline does not suffer from a budget constraint, since LLM annotations are assumed to be cheap and available for all n𝑛nitalic_n instances, but it may be biased if the LLM produces biased outputs.

3.4 Evaluation Metrics

We evaluate our approach and the baselines in terms of effective sample size and coverage. The effective sample size measures the increase in accuracy achieved by incorporating LLM annotations alongside human annotations. This is akin to getting more value out of each human annotation. For instance, if one has only 100 human annotations but combines them effectively with a larger pool of LLM annotations, the resulting accuracy could be comparable to having 150 human annotations. The latter metric, coverage, evaluates the statistical validity of the approaches by capturing how often the true value θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT falls within the produced confidence interval. In the following we elaborate on the two metrics, deferring further details behind their computation to App. A.4.

Effective sample size.

Given an estimate θ^methodsuperscript^𝜃method\hat{\theta}^{\mathrm{method}}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_method end_POSTSUPERSCRIPT produced by a method, we define the effective sample size as the hypothetical value neffectivesubscript𝑛effectiven_{\mathrm{effective}}italic_n start_POSTSUBSCRIPT roman_effective end_POSTSUBSCRIPT such that MSE(θ^method)=MSE(θ^neffectivehuman)MSEsuperscript^𝜃methodMSEsubscriptsuperscript^𝜃humansubscript𝑛effective\mathrm{MSE}(\hat{\theta}^{\mathrm{method}})=\mathrm{MSE}(\hat{\theta}^{% \mathrm{human}}_{n_{\mathrm{effective}}})roman_MSE ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_method end_POSTSUPERSCRIPT ) = roman_MSE ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_human end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT roman_effective end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), where θ^neffectivehumansubscriptsuperscript^𝜃humansubscript𝑛effective\hat{\theta}^{\mathrm{human}}_{n_{\mathrm{effective}}}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_human end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT roman_effective end_POSTSUBSCRIPT end_POSTSUBSCRIPT is obtained via the human-only approach with neffectivesubscript𝑛effectiven_{\mathrm{effective}}italic_n start_POSTSUBSCRIPT roman_effective end_POSTSUBSCRIPT annotations. In other words, θ^methodsuperscript^𝜃method\hat{\theta}^{\mathrm{method}}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_method end_POSTSUPERSCRIPT is as accurate as the “classical” estimate with neffectivesubscript𝑛effectiven_{\mathrm{effective}}italic_n start_POSTSUBSCRIPT roman_effective end_POSTSUBSCRIPT human annotations. An equivalent definition says that neffectivesubscript𝑛effectiven_{\mathrm{effective}}italic_n start_POSTSUBSCRIPT roman_effective end_POSTSUBSCRIPT is the sample size for which the confidence interval around θ^methodsuperscript^𝜃method\hat{\theta}^{\mathrm{method}}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_method end_POSTSUPERSCRIPT is of equal width as the classical confidence interval around θ^neffectivehumansubscriptsuperscript^𝜃humansubscript𝑛effective\hat{\theta}^{\mathrm{human}}_{n_{\mathrm{effective}}}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_human end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT roman_effective end_POSTSUBSCRIPT end_POSTSUBSCRIPT. We thus have that neffectivenhumansubscript𝑛effectivesubscript𝑛humann_{\mathrm{effective}}-n_{\mathrm{human}}italic_n start_POSTSUBSCRIPT roman_effective end_POSTSUBSCRIPT - italic_n start_POSTSUBSCRIPT roman_human end_POSTSUBSCRIPT is the benefit (if positive) or harm (if negative) of using LLM annotations. We also report the gain in effective sample size, defined as (neffectivenhuman)/nhuman100%subscript𝑛effectivesubscript𝑛humansubscript𝑛humanpercent100(n_{\mathrm{effective}}-n_{\mathrm{human}})/n_{\mathrm{human}}\cdot 100\%( italic_n start_POSTSUBSCRIPT roman_effective end_POSTSUBSCRIPT - italic_n start_POSTSUBSCRIPT roman_human end_POSTSUBSCRIPT ) / italic_n start_POSTSUBSCRIPT roman_human end_POSTSUBSCRIPT ⋅ 100 %. The effective sample size of the human-only approach is always nhumansubscript𝑛humann_{\mathrm{human}}italic_n start_POSTSUBSCRIPT roman_human end_POSTSUBSCRIPT. We only report the effective sample size for approaches that use human annotations, i.e. all but LLM only, because the effective sample size measures the increase in value of the human annotations.

Coverage.

Coverage is defined as the rate at which the confidence intervals produced by each method cover θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Since θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is an ideal estimate that would require infinite data, we cannot know θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT exactly in our applications. Instead, as a proxy, we compute coverage with respect to the estimate (1) on the full dataset. We compute the intervals with a target coverage rate of 90%percent9090\%90 %. Note that, following the theory of Zrnic and Candès (2024), the coverage of our method is provably equal to 90%percent9090\%90 %, and the same is true of the other two statistically valid baselines (our numbers will be slightly upward biased due to the fact that we use a proxy for θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT). With this in mind, the main purpose of reporting coverage is to evaluate the performance of the LLM only approach; for all other methods, we show coverage as a proof of concept.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Confidence intervals, effective sample size, and coverage. Rows correspond to different estimation tasks. The first column shows the confidence intervals in five random trials. The vertical dashed line corresponds to the estimate produced on the full dataset. A method is valid if its confidence interval includes this estimate (in about 90% of the trials), and tighter intervals around θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT indicates better performance. The second and third columns display the effective sample size neffectivesubscript𝑛effectiven_{\mathrm{effective}}italic_n start_POSTSUBSCRIPT roman_effective end_POSTSUBSCRIPT and coverage, respectively, for different values of the human annotation budget nhumansubscript𝑛humann_{\mathrm{human}}italic_n start_POSTSUBSCRIPT roman_human end_POSTSUBSCRIPT. Results are estimated over 100 trials.

4 Results

We evaluate our approach on a set of CSS problems that rely on statistical estimation. We aim to include settings that (1) allow addressing important downstream social-science questions, (2) rely on a human-labeled corpus of text instances (possibly with relevant additional covariates), and (3) have a publicly available dataset. We selected three settings that meet these criteria—politeness, stance, and political bias. For stance and politeness, we leverage publicly available datasets and the corresponding human annotations in their entirety. Given the large size, for political leaning, we randomly sample a smaller subset of texts.

4.1 Estimation tasks

Estimation task Metric Method
confidence-driven human + LLM (non-adaptive) LLM only
Politeness devices (hedge) Gain in eff. sample size ( 30.02 ±plus-or-minus\pm± 7.82)%  ( -16.76 ±plus-or-minus\pm± 8.08)%
Coverage 95% 89% 52%
Politeness devices (1st person pl.) Gain in eff. sample size  ( 319.44 ±plus-or-minus\pm± 22.09)% ( -8.05 ±plus-or-minus\pm± 30.09)%
Coverage 94% 94% 69%
Stance on global warming Gain in eff. sample size ( 33.77 ±plus-or-minus\pm± 25.06)% ( 17.84 ±plus-or-minus\pm± 25.67)%
Coverage 88% 94% 0%
Political bias (left-leaning) Gain in eff. sample size ( 29.94 ±plus-or-minus\pm± 8.19)% ( 20.56 ±plus-or-minus\pm± 6.33)%
Coverage 97% 94% 2%
Political bias (right-leaning) Gain in eff. sample size ( 63.73 ±plus-or-minus\pm± 11.48)% ( 61.15 ±plus-or-minus\pm± 8.75)%
Coverage 91% 95% 90%
Table 1: Results summary. Gain in effective sample size and coverage across the five estimation tasks for nhuman=500subscript𝑛human500n_{\mathrm{human}}=500italic_n start_POSTSUBSCRIPT roman_human end_POSTSUBSCRIPT = 500, estimated over 100 trials. In each task, the confidence-driven approach achieves a higher gain in effective sample size (bolded) than the non-adaptive approach. Confidence-driven approach always achieves a  positive gain, while the non-adaptive approach sometimes achieves a  negative gain. Confidence-driven and non-adaptive approaches achieve near 90% coverage, or higher. In contrast, LLM-only coverage is often poor. Gain in effective sample size is not estimated for the LLM-only approach as it does not leverage human annotations. Errors show a standard deviation over 100 trials.

Politeness.

Texts from online requests posted on Stack Exchange and Wikipedia (n=5,480𝑛5480n=5,480italic_n = 5 , 480) can be seen as polite or impolite. Politeness annotations help understand how linguistic devices impact perceived politeness Danescu-Niculescu-Mizil et al. (2013). In this estimation task, θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT corresponds to the logistic regression coefficient βhedgesubscript𝛽hedge\beta_{\mathrm{hedge}}italic_β start_POSTSUBSCRIPT roman_hedge end_POSTSUBSCRIPT measuring the impact of a linguistic feature such as hedging on the perceived politeness, logit(P(Hpolite=1|Xhedge))=β0+βhedgeXhedgelogit𝑃subscript𝐻politeconditional1subscript𝑋hedgesubscript𝛽0subscript𝛽hedgesubscript𝑋hedge\mathrm{logit}(P(H_{\mathrm{polite}}=~{}1|X_{\mathrm{hedge}}))=\beta_{0}+\beta% _{\mathrm{hedge}}X_{\mathrm{hedge}}roman_logit ( italic_P ( italic_H start_POSTSUBSCRIPT roman_polite end_POSTSUBSCRIPT = 1 | italic_X start_POSTSUBSCRIPT roman_hedge end_POSTSUBSCRIPT ) ) = italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT roman_hedge end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT roman_hedge end_POSTSUBSCRIPT, where Xhedge=1subscript𝑋hedge1X_{\mathrm{hedge}}=1italic_X start_POSTSUBSCRIPT roman_hedge end_POSTSUBSCRIPT = 1 indicates the presence of the hedge marker and Hpolite=1subscript𝐻polite1H_{\mathrm{polite}}=1italic_H start_POSTSUBSCRIPT roman_polite end_POSTSUBSCRIPT = 1 indicates annotation as polite. We similarly estimate β1ppsubscript𝛽1pp\beta_{\mathrm{1pp}}italic_β start_POSTSUBSCRIPT 1 roman_p roman_p end_POSTSUBSCRIPT, the impact of the use of the first person plural pronouns on the perceived politeness.

Stance.

News headlines (n=2,300𝑛2300n=2,300italic_n = 2 , 300) are agreeing, neutral, or disagreeing with the stance that global warming is a serious concern Luo et al. (2020). Stance annotations facilitate the study of linguistic differences between media supporting or rejecting global warming, which have important implications for communication and policy Hmielowski et al. (2014). In this task, we estimate θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT corresponding to Oagreementsubscript𝑂agreementO_{\mathrm{agreement}}italic_O start_POSTSUBSCRIPT roman_agreement end_POSTSUBSCRIPT, the odds ratio of agreement given the presence of affirming devices such as “expert,” “proven,” “renowned,” and so on. Formally, denoting by Xaffirm{0,1}subscript𝑋affirm01X_{\mathrm{affirm}}\in\{0,1\}italic_X start_POSTSUBSCRIPT roman_affirm end_POSTSUBSCRIPT ∈ { 0 , 1 } the presence of an affirming device and Hagree{0,1}subscript𝐻agree01H_{\mathrm{agree}}\in\{0,1\}italic_H start_POSTSUBSCRIPT roman_agree end_POSTSUBSCRIPT ∈ { 0 , 1 } the annotation of agreement, we have

Oagreement=μagree|affirm/(1μagree|affirm)μagree|¬affirm/(1μagree|¬affirm),subscript𝑂agreementsubscript𝜇conditionalagreeaffirm1subscript𝜇conditionalagreeaffirmsubscript𝜇conditionalagreeaffirm1subscript𝜇conditionalagreeaffirmO_{\mathrm{agreement}}=\frac{\mu_{\mathrm{agree|affirm}}/(1-\mu_{\mathrm{agree% |affirm}})}{\mu_{\mathrm{agree|\neg affirm}}/(1-\mu_{\mathrm{agree|\neg affirm% }})},italic_O start_POSTSUBSCRIPT roman_agreement end_POSTSUBSCRIPT = divide start_ARG italic_μ start_POSTSUBSCRIPT roman_agree | roman_affirm end_POSTSUBSCRIPT / ( 1 - italic_μ start_POSTSUBSCRIPT roman_agree | roman_affirm end_POSTSUBSCRIPT ) end_ARG start_ARG italic_μ start_POSTSUBSCRIPT roman_agree | ¬ roman_affirm end_POSTSUBSCRIPT / ( 1 - italic_μ start_POSTSUBSCRIPT roman_agree | ¬ roman_affirm end_POSTSUBSCRIPT ) end_ARG ,

where μagree|affirm=P(Hagree=1|Xaffirm=1)subscript𝜇conditionalagreeaffirm𝑃subscript𝐻agreeconditional1subscript𝑋affirm1\mu_{\mathrm{agree|affirm}}=P(H_{\mathrm{agree}}=1|X_{\mathrm{affirm}}=~{}1)italic_μ start_POSTSUBSCRIPT roman_agree | roman_affirm end_POSTSUBSCRIPT = italic_P ( italic_H start_POSTSUBSCRIPT roman_agree end_POSTSUBSCRIPT = 1 | italic_X start_POSTSUBSCRIPT roman_affirm end_POSTSUBSCRIPT = 1 ) and μagree|¬affirm=P(Hagree=1|Xaffirm=0)subscript𝜇conditionalagreeaffirm𝑃subscript𝐻agreeconditional1subscript𝑋affirm0\mu_{\mathrm{agree|\neg affirm}}=P(H_{\mathrm{agree}}=1|X_{\mathrm{affirm}}=0)italic_μ start_POSTSUBSCRIPT roman_agree | ¬ roman_affirm end_POSTSUBSCRIPT = italic_P ( italic_H start_POSTSUBSCRIPT roman_agree end_POSTSUBSCRIPT = 1 | italic_X start_POSTSUBSCRIPT roman_affirm end_POSTSUBSCRIPT = 0 ). Indicators for affirming devices were extracted using a lexicon derived by Luo et al. (2020).

Political bias.

News texts (randomly sampled n=2,000𝑛2000n=2,000italic_n = 2 , 000) are either leaning left, center, or right Baly et al. (2020). Annotating political leanings in text allows studying the bias in media outlets, socio-technical systems, or historical and contemporary public discourse. Such biases are often reported in terms of prevalence statistics. Thus, in this setting θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT corresponds to the prevalence of a leaning, i.e., plean=P(Hlean=1)subscript𝑝lean𝑃subscript𝐻lean1p_{\mathrm{lean}}=P(H_{\mathrm{lean}}=1)italic_p start_POSTSUBSCRIPT roman_lean end_POSTSUBSCRIPT = italic_P ( italic_H start_POSTSUBSCRIPT roman_lean end_POSTSUBSCRIPT = 1 ), where Hlean{0,1}subscript𝐻lean01H_{\mathrm{lean}}\in\{0,1\}italic_H start_POSTSUBSCRIPT roman_lean end_POSTSUBSCRIPT ∈ { 0 , 1 } denotes the presence of a leaning. We estimate pleftsubscript𝑝leftp_{\mathrm{left}}italic_p start_POSTSUBSCRIPT roman_left end_POSTSUBSCRIPT and prightsubscript𝑝rightp_{\mathrm{right}}italic_p start_POSTSUBSCRIPT roman_right end_POSTSUBSCRIPT, the prevalences of left- and right-leaning articles in the corpus.

4.2 Evaluation

Our main evaluation is based on LLM annotations collected with GPT-4o; analogous results with GPT-3.5 can be found in App. B.1. Table 2 in App. A.3 lists prompt texts and parameters. To test LLM performance out of the box, all annotations are collected using zero-shot prompting. Overall, the confidence scores are calibrated with accuracy, but the annotations are only in moderate agreement with human annotations in all three settings (see App. A.3). This is aligned with our lack of assumption that the LLM annotations are good: we want to produce a valid confidence interval no matter the quality of the LLM annotations.

We report the two key metrics (effective sample size and coverage), for the three selected settings (the study of politeness, stance, and bias), where the task is to estimate the five target quantities βhedgesubscript𝛽hedge\beta_{\mathrm{hedge}}italic_β start_POSTSUBSCRIPT roman_hedge end_POSTSUBSCRIPT, β1ppsubscript𝛽1pp\beta_{\mathrm{1pp}}italic_β start_POSTSUBSCRIPT 1 roman_p roman_p end_POSTSUBSCRIPT, Oagreementsubscript𝑂agreementO_{\mathrm{agreement}}italic_O start_POSTSUBSCRIPT roman_agreement end_POSTSUBSCRIPT, pleftsubscript𝑝leftp_{\mathrm{left}}italic_p start_POSTSUBSCRIPT roman_left end_POSTSUBSCRIPT, and prightsubscript𝑝rightp_{\mathrm{right}}italic_p start_POSTSUBSCRIPT roman_right end_POSTSUBSCRIPT. Both metrics are estimated over 100 trials for varying nhumansubscript𝑛humann_{\mathrm{human}}italic_n start_POSTSUBSCRIPT roman_human end_POSTSUBSCRIPT, the budget for human annotations.

Our main findings are reported in Figure 2 and summarized in Table 1. Figure 2 also depicts the computed confidence intervals in five randomly chosen trials across the five target quantities, for the lowest value of nhumansubscript𝑛humann_{\mathrm{human}}italic_n start_POSTSUBSCRIPT roman_human end_POSTSUBSCRIPT in the considered range.

Effective sample size.

First, across the five target quantities, we find that Confidence-Driven Inference increases the effective sample size compared to the human-only baseline. For a given budget of nhumansubscript𝑛humann_{\mathrm{human}}italic_n start_POSTSUBSCRIPT roman_human end_POSTSUBSCRIPT annotations, e.g., nhuman=1000subscript𝑛human1000n_{\mathrm{human}}=1000italic_n start_POSTSUBSCRIPT roman_human end_POSTSUBSCRIPT = 1000, the confidence-driven approach achieves the effective sample size at minimum 1250 (when estimating pleftsubscript𝑝leftp_{\mathrm{left}}italic_p start_POSTSUBSCRIPT roman_left end_POSTSUBSCRIPT). This means that the confidence interval around the estimated statistic is of equal width as the confidence interval produced with a larger number of human-only annotations.

Similarly, it is informative to consider the necessary budget of human annotations nhumansubscript𝑛humann_{\mathrm{human}}italic_n start_POSTSUBSCRIPT roman_human end_POSTSUBSCRIPT given a desired effective sample size neffectivesubscript𝑛effectiven_{\mathrm{effective}}italic_n start_POSTSUBSCRIPT roman_effective end_POSTSUBSCRIPT. To achieve a desired fixed effective sample size, the confidence-driven approach reduces the needed number of human annotations across all five target quantities. For instance, to achieve neffective=1000subscript𝑛effective1000n_{\mathrm{effective}}=1000italic_n start_POSTSUBSCRIPT roman_effective end_POSTSUBSCRIPT = 1000, only between around 250 (β1ppsubscript𝛽1pp\beta_{\mathrm{1pp}}italic_β start_POSTSUBSCRIPT 1 roman_p roman_p end_POSTSUBSCRIPT) and 750 (pleftsubscript𝑝leftp_{\mathrm{left}}italic_p start_POSTSUBSCRIPT roman_left end_POSTSUBSCRIPT) human annotations are needed, thus reducing the number of human annotations needed to achieve equally accurate estimates by at least 25% for all tasks.

Moreover, we also find that the confidence-driven approach increases the effective sample size compared to the human + LLM (non-adaptive) baseline. For example, to achieve neffective=1000subscript𝑛effective1000n_{\mathrm{effective}}=1000italic_n start_POSTSUBSCRIPT roman_effective end_POSTSUBSCRIPT = 1000, the confidence-driven approach requires 200 (respectively, 750) fewer human annotations than the non-adaptive baseline for Oagreementsubscript𝑂agreementO_{\mathrm{agreement}}italic_O start_POSTSUBSCRIPT roman_agreement end_POSTSUBSCRIPT (respectively, β1ppsubscript𝛽1pp\beta_{\mathrm{1pp}}italic_β start_POSTSUBSCRIPT 1 roman_p roman_p end_POSTSUBSCRIPT). The confidence-driven approach therefore leads to a further reduction in the required number of human annotations compared to an approach that leverages LLMs, but does so non-adaptively. Moreover, notice that the non-adaptive approach can sometimes even hurt compared to the human-only baseline: in the two politeness tasks, using LLMs actually reduces the effective sample size.

Table 1 summarizes the gain in effective sample size for nhuman=500subscript𝑛human500n_{\mathrm{human}}=500italic_n start_POSTSUBSCRIPT roman_human end_POSTSUBSCRIPT = 500. Across the five tasks, the confidence-driven approach achieves a substantial gain in the effective sample size, providing at minimum around +30% gain (when estimating pleftsubscript𝑝leftp_{\mathrm{left}}italic_p start_POSTSUBSCRIPT roman_left end_POSTSUBSCRIPT), going even over +300% (when estimating β1ppsubscript𝛽1pp\beta_{\mathrm{1pp}}italic_β start_POSTSUBSCRIPT 1 roman_p roman_p end_POSTSUBSCRIPT). Again, the confidence-driven approach achieves a higher gain than the non-adaptive approach for each task, which can even be negative (when estimating βhedgesubscript𝛽hedge\beta_{\mathrm{hedge}}italic_β start_POSTSUBSCRIPT roman_hedge end_POSTSUBSCRIPT and β1ppsubscript𝛽1pp\beta_{\mathrm{1pp}}italic_β start_POSTSUBSCRIPT 1 roman_p roman_p end_POSTSUBSCRIPT).

Coverage.

The save in human annotations does not come at the cost of diminished validity. As expected, across the five target quantities, the confidence-driven approach has coverage around or over 90%, as do the non-adaptive and human-only baselines (Fig. 2). However, LLM-only intervals have a much lower coverage, only being around 90% for prightsubscript𝑝rightp_{\mathrm{right}}italic_p start_POSTSUBSCRIPT roman_right end_POSTSUBSCRIPT, and otherwise ranging between 0% (Oagreementsubscript𝑂agreementO_{\mathrm{agreement}}italic_O start_POSTSUBSCRIPT roman_agreement end_POSTSUBSCRIPT) and 70% (β1ppsubscript𝛽1pp\beta_{\mathrm{1pp}}italic_β start_POSTSUBSCRIPT 1 roman_p roman_p end_POSTSUBSCRIPT). This emphasizes how estimates only relying on LLM annotations can be misleading. Notably, when estimating Oagreementsubscript𝑂agreementO_{\mathrm{agreement}}italic_O start_POSTSUBSCRIPT roman_agreement end_POSTSUBSCRIPT using LLM annotations only, the odds-ratio estimate points in the wrong direction (Oagreement>1subscript𝑂agreement1O_{\mathrm{agreement}}>1italic_O start_POSTSUBSCRIPT roman_agreement end_POSTSUBSCRIPT > 1 while Oagreement<1subscript𝑂agreement1O_{\mathrm{agreement}}<1italic_O start_POSTSUBSCRIPT roman_agreement end_POSTSUBSCRIPT < 1 is true), as illustrated in Fig. 2. Interestingly, the overall inter-annotator agreement between human and LLM annotations is the highest in this setting (Cohen inter-rater agreement κstance=0.57subscript𝜅stance0.57\kappa_{\mathrm{stance}}=0.57italic_κ start_POSTSUBSCRIPT roman_stance end_POSTSUBSCRIPT = 0.57). This suggests that even when LLM annotations overall agree with human annotations, downstream statistical estimates relying on LLM annotations only can be biased.

Table 1 summarizes the achieved coverage for nhuman=500subscript𝑛human500n_{\mathrm{human}}=500italic_n start_POSTSUBSCRIPT roman_human end_POSTSUBSCRIPT = 500. Across the five tasks, the confidence-driven and non-adaptive approaches achieve around or over 90% coverage (note that small deviations are possible due to only 100 simulation trials). In contrast, the LLM-only approach only meets the requirement for prightsubscript𝑝rightp_{\mathrm{right}}italic_p start_POSTSUBSCRIPT roman_right end_POSTSUBSCRIPT and otherwise severely undercovers.

In summary, our method increases the effective sample size given a fixed budget of human annotations, leading to a substantial save in budget, while maintaining the target coverage.

5 Discussion

In this work, we introduce Confidence-Driven Inference, a method that integrates verbalized confidence of LLMs with active inference to optimally combine human and LLM annotations. Across three distinct CSS settings, results demonstrate that the proposed method consistently outperforms baseline methods (human-only and non-adaptive approaches) in effective sample size. Moreover, the increase in the effective sample size is achieved without a decrease in coverage. In contrast, the LLM-only approach yields invalid estimates and considerably lower coverage.

Thus, Confidence-Driven Inference allows for researchers to allocate human and LLM annotations in a cost-effective manner while maintaining confidence in the statistical validity of their results. Furthermore, Confidence-Driven Inference also addresses the challenges posed by the variable quality of LLM annotation, by providing validity guarantees when leveraging imperfect LLM annotations.

Although overall LLM annotations moderately agree with human annotations in the tested settings, relying on LLM annotations only can lead to wrong conclusions, as shown in the example of estimating the odds ratio in the stance setting. In contrast, despite the fact that LLM annotations are imperfect, our approach allows carefully combining them with a limited set of human annotations in order to reduce the human annotation budget, without sacrificing the validity.

Finally, given the growing interest in harnessing the capabilities of LLMs across disciplines, the accessibility of a method is an important consideration. Researchers can simply prompt the LLM for its confidence level via API access and leverage Confidence-Driven Inference to combine LLM confidence with LLM and human annotations to produce a valid statistical estimate. This approach can be applied to a wide range of tasks, across fields.

6 Limitations

The external validity of our findings is contingent upon two key assumptions: that the text instances are i.i.d. from a relevant distribution, and that the researcher has full control of the annotation process. The first may be violated if the distribution of texts shifts over time, and the collected instances are no longer representative of the current quantity of interest. For example, it is possible that relationships between linguistic devices and perceived politeness evolve over time. The second assumption may be violated in situations where certain annotations are difficult to obtain (e.g., for low-resource languages). Our approach may lead to inaccurate or misleading conclusions under either violation. We thus caution against generalizing to settings where text instances exhibit time-varying shifts or the researcher is not in control over the data collection process.

If the adaptive sampling probabilities πisubscript𝜋𝑖\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are poorly chosen—potentially due to inaccurate verbalized confidence scores—the resulting estimates could have a higher mean squared error (MSE) than if uniform, non-adaptive sampling were used. This could even result in an estimate with a larger MSE than the human-only baseline (for sensitivity to miscalibration, see App B.2). However, by using power tuning, as detailed in Section 3.2, we ensure that incorporating LLM annotation into the estimation process does not hurt the estimate (i.e., does not increase the MSE) regardless of the sampling method used for human annotations (whether uniform or adaptive). That said, if the confidence scores are not reliable (i.e., they are poorly calibrated), the estimate may become too noisy, leading to overly wide confidence intervals.

We tested only a limited number of LLMs. We note that establishing a comprehensive benchmark is beyond the scope of this work (see App. B.1 for performance details using a different model).

Additionally, while we treat human annotations as the gold standard in our study, we acknowledge that human annotations are biased, and that reasonable annotators can disagree. Future work could explore ways to account for variability and bias in human annotations.

Human annotations are often obtained through crowdsourcing, which may itself be influenced by LLMs, as crowd workers might use LLMs to increase productivity Veselovsky et al. (2023). Although we use datasets collected before the widespread availability of LLMs, detecting AI-generated text remains a challenge Verma et al. (2024); Weber-Wulff et al. (2023).

This work only conducted experiments on estimation tasks within CSS datasets and only in English. However, Confidence-Driven Inference is generalizable to other types of text-based datasets, and it would be valuable to see more diverse applications in future research.

Lastly, the presented experiments do not address causal effects. For instance, in the context of politeness, to identify the causal effect of hedging on perceived politeness, it would be necessary to compare texts that are otherwise identical but differ only in their use of hedging. Nevertheless, while these evaluations are not causal, our method is still applicable for use in causal estimation.

7 Ethical Implications

Our work assumes that the existing human annotations within the leveraged datasets serve as the gold standard. However, we caution against interpreting human annotations as definitive judgments, given the subjective nature of many tasks Fleisig et al. (2023), the potential for annotator disagreement Weerasooriya et al. (2023), and the influence of annotator positionality Santy et al. (2023), beliefs, biases Sap et al. (2022), as well as variance in cultural Huang and Yang (2023) and social norms Ziems et al. (2023).

In addition to their use in text analysis, LLMs may hold potential for simulating human behavior in social science research, including applications such as pretesting surveys and imputing missing data Bail (2024). Our work contributes to establishing reliable principles for doing so. At the same time, we do not advocate for using LLMs as substitutes for human data beyond the constraints of our assumptions, especially seeing that prior studies have shown that LLMs tend to reflect the perspectives of some demographic groups more accurately than others Santurkar et al. (2023) and may propagate stereotypical portrayals  Cheng et al. (2023).

We also caution against fully replacing human annotators with LLM surrogates, which can not only be harmful for the economy Cazzaniga et al. (2024), but also exacerbate the exploitation of human labor Li et al. (2023a). Instead, our work highlights the benefits of human-AI collaboration, showing that a combined approach can yield more accurate and valid outcomes.

References

  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774.
  • Allen et al. (1999) James E Allen, Curry I Guinn, and Eric Horvitz. 1999. Mixed-initiative interaction. IEEE Intelligent Systems and their Applications, 14(5):14–23.
  • Angelopoulos et al. (2023a) Anastasios N Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I Jordan, and Tijana Zrnic. 2023a. Prediction-powered inference. Science, 382(6671):669–674.
  • Angelopoulos et al. (2023b) Anastasios N Angelopoulos, John C Duchi, and Tijana Zrnic. 2023b. PPI++: Efficient prediction-powered inference. arXiv preprint arXiv:2311.01453.
  • Atreja et al. (2024) Shubham Atreja, Joshua Ashkinaze, Lingyao Li, Julia Mendelsohn, and Libby Hemphill. 2024. Prompt design matters for computational social science tasks but in unpredictable ways. arXiv preprint arXiv:2406.11980.
  • Bail (2024) Christopher A Bail. 2024. Can generative AI improve social science? Proceedings of the National Academy of Sciences, 121(21):e2314021121.
  • Baly et al. (2020) Ramy Baly, Giovanni Da San Martino, James Glass, and Preslav Nakov. 2020. We can detect your bias: Predicting the political ideology of news articles. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4982–4991.
  • Band et al. (2024) Neil Band, Xuechen Li, Tengyu Ma, and Tatsunori Hashimoto. 2024. Linguistic calibration of long-form generations. In Proceedings of the Forty-first International Conference on Machine Learning.
  • Boyeau et al. (2024) Pierre Boyeau, Anastasios N Angelopoulos, Nir Yosef, Jitendra Malik, and Michael I Jordan. 2024. Autoeval done right: Using synthetic data for model evaluation. arXiv preprint arXiv:2403.07008.
  • Card et al. (2020) Dallas Card, Peter Henderson, Urvashi Khandelwal, Robin Jia, Kyle Mahowald, and Dan Jurafsky. 2020. With little power comes great responsibility. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9263–9274.
  • Card and Smith (2018) Dallas Card and Noah A Smith. 2018. The importance of calibration for estimating proportions from annotations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1636–1646.
  • Cazzaniga et al. (2024) Mauro Cazzaniga, Ms Florence Jaumotte, Longji Li, Mr Giovanni Melina, Augustus J Panton, Carlo Pizzinelli, Emma J Rockall, and Ms Marina Mendes Tavares. 2024. Gen-AI: Artificial intelligence and the future of work. International Monetary Fund.
  • Chatzi et al. (2024) Ivi Chatzi, Eleni Straitouri, Suhas Thejaswi, and Manuel Gomez Rodriguez. 2024. Prediction-powered ranking of large language models. arXiv preprint arXiv:2402.17826.
  • Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794.
  • Cheng et al. (2023) Myra Cheng, Esin Durmus, and Dan Jurafsky. 2023. Marked personas: Using natural language prompts to measure stereotypes in language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1504–1532.
  • Danescu-Niculescu-Mizil et al. (2013) Cristian Danescu-Niculescu-Mizil, Moritz Sudhof, Dan Jurafsky, Jure Leskovec, and Christopher Potts. 2013. A computational approach to politeness with application to social factors. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 250–259.
  • Dodge et al. (2019) Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A Smith. 2019. Show your work: Improved reporting of experimental results. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2185–2194.
  • Egami et al. (2024) Naoki Egami, Musashi Hinck, Brandon Stewart, and Hanying Wei. 2024. Using imperfect surrogates for downstream inference: Design-based supervised learning for social science applications of large language models. Advances in Neural Information Processing Systems, 36.
  • Feder et al. (2022) Amir Feder, Katherine A Keith, Emaad Manzoor, Reid Pryzant, Dhanya Sridhar, Zach Wood-Doughty, Jacob Eisenstein, Justin Grimmer, Roi Reichart, Margaret E Roberts, et al. 2022. Causal inference in natural language processing: Estimation, prediction, interpretation and beyond. Transactions of the Association for Computational Linguistics, 10:1138–1158.
  • Fleisig et al. (2023) Eve Fleisig, Rediet Abebe, and Dan Klein. 2023. When the majority is wrong: Modeling annotator disagreement for subjective tasks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6715–6726.
  • Geng et al. (2024) Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych. 2024. A survey of confidence estimation and calibration in large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6577–6595, Mexico City, Mexico. Association for Computational Linguistics.
  • Gilardi et al. (2023) Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30):e2305016120.
  • Gunjal et al. (2024) Anisha Gunjal, Jihan Yin, and Erhan Bas. 2024. Detecting and preventing hallucinations in large vision language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18135–18143.
  • Hmielowski et al. (2014) Jay D Hmielowski, Lauren Feldman, Teresa A Myers, Anthony Leiserowitz, and Edward Maibach. 2014. An attack on science? media use, trust in scientists, and perceptions of global warming. Public Understanding of Science, 23(7):866–883.
  • Huang and Yang (2023) Jing Huang and Diyi Yang. 2023. Culturally aware natural language inference. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7591–7609, Singapore. Association for Computational Linguistics.
  • Keith and O’Connor (2018) Katherine Keith and Brendan O’Connor. 2018. Uncertainty-aware generative models for inferring document class prevalence. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4575–4585, Brussels, Belgium. Association for Computational Linguistics.
  • Kim et al. (2024) Hannah Kim, Kushan Mitra, Rafael Li Chen, Sajjadur Rahman, and Dan Zhang. 2024. MEGAnno+: A human-LLM collaborative annotation system. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 168–176, St. Julians, Malta. Association for Computational Linguistics.
  • Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems.
  • Lam et al. (2024) Michelle S Lam, Janice Teoh, James A Landay, Jeffrey Heer, and Michael S Bernstein. 2024. Concept induction: Analyzing unstructured text with high-level concepts using lloom. In Proceedings of the CHI Conference on Human Factors in Computing Systems, pages 1–28.
  • Landers and Behrend (2023) Richard N Landers and Tara S Behrend. 2023. Auditing the AI auditors: A framework for evaluating fairness and bias in high stakes AI predictive models. American Psychologist, 78(1):36.
  • Li et al. (2023a) Hanlin Li, Nicholas Vincent, Stevie Chancellor, and Brent Hecht. 2023a. The dimensions of data labor: A road map for researchers, activists, and policymakers to empower data producers. In Proceedings of the 2023 ACM conference on fairness, accountability, and transparency, pages 1151–1161.
  • Li et al. (2023b) Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2023b. HaluEval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464, Singapore. Association for Computational Linguistics.
  • Li et al. (2023c) Minzhi Li, Taiwei Shi, Caleb Ziems, Min-Yen Kan, Nancy Chen, Zhengyuan Liu, and Diyi Yang. 2023c. Coannotating: Uncertainty-guided work allocation between human and large language models for data annotation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1487–1505.
  • Luo et al. (2020) Yiwei Luo, Dallas Card, and Dan Jurafsky. 2020. Detecting stance in media on global warming. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3296–3315.
  • Margatina et al. (2021) Katerina Margatina, Giorgos Vernikos, Loïc Barrault, and Nikolaos Aletras. 2021. Active learning by acquiring contrastive examples. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 650–663, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Newman et al. (2008) Matthew L Newman, Carla J Groom, Lori D Handelman, and James W Pennebaker. 2008. Gender differences in language use: An analysis of 14,000 text samples. Discourse processes, 45(3):211–236.
  • Park et al. (2023) Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22.
  • Pham et al. (2024) Chau Pham, Alexander Hoyle, Simeng Sun, Philip Resnik, and Mohit Iyyer. 2024. TopicGPT: A prompt-based topic modeling framework. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 2956–2984.
  • Ribeiro et al. (2018) Filipe Ribeiro, Lucas Henrique, Fabricio Benevenuto, Abhijnan Chakraborty, Juhi Kulshrestha, Mahmoudreza Babaei, and Krishna Gummadi. 2018. Media bias monitor: Quantifying biases of social media news outlets at large-scale. In Proceedings of the International AAAI Conference on Web and Social Media, volume 12.
  • Robertson et al. (2018) Ronald E Robertson, Shan Jiang, Kenneth Joseph, Lisa Friedland, David Lazer, and Christo Wilson. 2018. Auditing partisan audience bias within google search. Proceedings of the ACM on human-computer interaction, 2(CSCW):1–22.
  • Saad-Falcon et al. (2024) Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. 2024. Ares: An automated evaluation framework for retrieval-augmented generation systems. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 338–354.
  • Santurkar et al. (2023) Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. 2023. Whose opinions do language models reflect? In International Conference on Machine Learning, pages 29971–30004. PMLR.
  • Santy et al. (2023) Sebastin Santy, Jenny Liang, Ronan Le Bras, Katharina Reinecke, and Maarten Sap. 2023. NLPositionality: Characterizing design biases of datasets and models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9080–9102, Toronto, Canada. Association for Computational Linguistics.
  • Sap et al. (2022) Maarten Sap, Swabha Swayamdipta, Laura Vianna, Xuhui Zhou, Yejin Choi, and Noah A. Smith. 2022. Annotators with attitudes: How annotator beliefs and identities bias toxic language detection. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5884–5906, Seattle, United States. Association for Computational Linguistics.
  • Sclar et al. (2023) Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2023. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. In The Twelfth International Conference on Learning Representations.
  • Sim et al. (2013) Yanchuan Sim, Brice D. L. Acree, Justin H. Gross, and Noah A. Smith. 2013. Measuring ideological proportions in political speeches. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 91–101, Seattle, Washington, USA. Association for Computational Linguistics.
  • Thapa et al. (2023) Surendrabikram Thapa, Usman Naseem, and Mehwish Nasim. 2023. From humans to machines: Can chatgpt-like llms effectively replace human annotators in nlp tasks. In Workshop Proceedings of the 17th International AAAI Conference on Web and Social Media.
  • Tian et al. (2023) Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. 2023. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5433–5442.
  • Van der Vaart (2000) Aad W Van der Vaart. 2000. Asymptotic statistics, volume 3. Cambridge university press.
  • Verma et al. (2024) Vivek Verma, Eve Fleisig, Nicholas Tomlin, and Dan Klein. 2024. Ghostbuster: Detecting text ghostwritten by large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1702–1717.
  • Veselovsky et al. (2023) Veniamin Veselovsky, Manoel Horta Ribeiro, and Robert West. 2023. Artificial artificial artificial intelligence: Crowd workers widely use large language models for text production tasks. arXiv preprint arXiv:2306.07899.
  • Voigt et al. (2017) Rob Voigt, Nicholas P Camp, Vinodkumar Prabhakaran, William L Hamilton, Rebecca C Hetey, Camilla M Griffiths, David Jurgens, Dan Jurafsky, and Jennifer L Eberhardt. 2017. Language from police body camera footage shows racial disparities in officer respect. Proceedings of the National Academy of Sciences, 114(25):6521–6526.
  • Weber-Wulff et al. (2023) Debora Weber-Wulff, Alla Anohina-Naumeca, Sonja Bjelobaba, Tomáš Foltỳnek, Jean Guerrero-Dib, Olumide Popoola, Petr Šigut, and Lorna Waddington. 2023. Testing of detection tools for AI-generated text. International Journal for Educational Integrity, 19(1):26.
  • Weerasooriya et al. (2023) Tharindu Cyril Weerasooriya, Alexander Ororbia, Raj Bhensadadia, Ashiqur KhudaBukhsh, and Christopher Homan. 2023. Disagreement matters: Preserving label diversity by jointly modeling item and annotator label distributions with DisCo. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4679–4695, Toronto, Canada. Association for Computational Linguistics.
  • Wei et al. (2023) Johnny Tian-Zheng Wei, Frederike Zufall, and Robin Jia. 2023. Operationalizing content moderation" accuracy" in the digital services act. arXiv preprint arXiv:2305.09601.
  • Weidinger et al. (2022) Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, et al. 2022. Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 214–229.
  • Zhang et al. (2023) Ruoyu Zhang, Yanzeng Li, Yongliang Ma, Ming Zhou, and Lei Zou. 2023. LLMaAA: Making large language models as active annotators. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 13088–13103, Singapore. Association for Computational Linguistics.
  • Zhou et al. (2024) Kaitlyn Zhou, Jena D Hwang, Xiang Ren, and Maarten Sap. 2024. Relying on the unreliable: The impact of language models’ reluctance to express uncertainty. arXiv preprint arXiv:2401.06730.
  • Ziems et al. (2023) Caleb Ziems, Jane Dwivedi-Yu, Yi-Chia Wang, Alon Halevy, and Diyi Yang. 2023. Normbank: A knowledge bank of situational social norms. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7756–7776.
  • Ziems et al. (2024) Caleb Ziems, William Held, Omar Shaikh, Jiaao Chen, Zhehao Zhang, and Diyi Yang. 2024. Can large language models transform computational social science? Computational Linguistics, 50(1):237–291.
  • Zrnic and Candès (2024) Tijana Zrnic and Emmanuel J Candès. 2024. Active statistical inference. In Proceedings of the Forty-first International Conference on Machine Learning.

Appendix A Further Details on the Method

A.1 Confidence Intervals

We compute the confidence intervals following the approach in Zrnic and Candès (2024). Suppose that θ^confsuperscript^𝜃conf\hat{\theta}^{\mathrm{conf}}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_conf end_POSTSUPERSCRIPT is possibly d𝑑ditalic_d-dimensional (such as in, for example, linear or logistic regression), and we are interested in coefficient j𝑗jitalic_j. If d=1𝑑1d=1italic_d = 1, such as in the case of prevalence estimation, then j𝑗jitalic_j is always equal to 1. We compute the confidence interval as:

C1α=(θ^jconf±z1α/2Σ^jjn),subscript𝐶1𝛼plus-or-minussubscriptsuperscript^𝜃conf𝑗subscript𝑧1𝛼2subscript^Σ𝑗𝑗𝑛C_{1-\alpha}=\left(\hat{\theta}^{\mathrm{conf}}_{j}\pm z_{1-\alpha/2}\sqrt{% \frac{\widehat{\Sigma}_{jj}}{n}}\right),italic_C start_POSTSUBSCRIPT 1 - italic_α end_POSTSUBSCRIPT = ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_conf end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ± italic_z start_POSTSUBSCRIPT 1 - italic_α / 2 end_POSTSUBSCRIPT square-root start_ARG divide start_ARG over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_j italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG end_ARG ) ,

where z1α/2subscript𝑧1𝛼2z_{1-\alpha/2}italic_z start_POSTSUBSCRIPT 1 - italic_α / 2 end_POSTSUBSCRIPT is the 1α/21𝛼21-\alpha/21 - italic_α / 2 quantile of the standard normal distribution. The matrix Σ^^Σ\widehat{\Sigma}over^ start_ARG roman_Σ end_ARG is an estimate of the covariance of θ^confsuperscript^𝜃conf\hat{\theta}^{\mathrm{conf}}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_conf end_POSTSUPERSCRIPT, given by:

Σ^=H^1Var^(λ^θ^conf+(θ^confλ^θ^conf)ξπ)H^1,^Σsuperscript^𝐻1^Var𝜆subscript^superscript^𝜃confsubscriptsuperscript^𝜃conf𝜆subscript^superscript^𝜃conf𝜉𝜋superscript^𝐻1\widehat{\Sigma}=\hat{H}^{-1}\widehat{\mathrm{Var}}\left(\lambda\nabla\hat{% \ell}_{\hat{\theta}^{\mathrm{conf}}}+(\nabla\ell_{\hat{\theta}^{\mathrm{conf}}% }-\lambda\nabla\hat{\ell}_{\hat{\theta}^{\mathrm{conf}}})\frac{\xi}{\pi}\right% )\hat{H}^{-1},over^ start_ARG roman_Σ end_ARG = over^ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over^ start_ARG roman_Var end_ARG ( italic_λ ∇ over^ start_ARG roman_ℓ end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_conf end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + ( ∇ roman_ℓ start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_conf end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - italic_λ ∇ over^ start_ARG roman_ℓ end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_conf end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) divide start_ARG italic_ξ end_ARG start_ARG italic_π end_ARG ) over^ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ,

where H^=𝔼^[2θ^conf]^𝐻^𝔼delimited-[]superscript2subscriptsuperscript^𝜃conf\hat{H}=\hat{\mathbb{E}}[\nabla^{2}\ell_{\hat{\theta}^{\mathrm{conf}}}]over^ start_ARG italic_H end_ARG = over^ start_ARG blackboard_E end_ARG [ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_conf end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ] is the empirical estimate of the Hessian at θ^confsuperscript^𝜃conf\hat{\theta}^{\mathrm{conf}}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_conf end_POSTSUPERSCRIPT and Var^^Var\widehat{\mathrm{Var}}over^ start_ARG roman_Var end_ARG denotes the empirical variance. Recall also the short-hand notation θ=θ(X,H)subscript𝜃subscript𝜃𝑋𝐻\ell_{\theta}=\ell_{\theta}(X,H)roman_ℓ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = roman_ℓ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X , italic_H ) and ^θ=θ(X,H^)subscript^𝜃subscript𝜃𝑋^𝐻\hat{\ell}_{\theta}=\ell_{\theta}(X,\hat{H})over^ start_ARG roman_ℓ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = roman_ℓ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X , over^ start_ARG italic_H end_ARG ). This is a generalization of the usual “sandwich” covariance used in linear regression.

Some estimation targets, such as the odds ratio, are not M-estimators but are functions of M-estimators. In those cases a confidence interval is obtained by additionally applying the delta method.

See Zrnic and Candès (2024) for further details.

A.2 Power Tuning

Power tuning, introduced by Angelopoulos et al. (2023b), refers to choosing λ𝜆\lambdaitalic_λ so that the MSE of θ^confsuperscript^𝜃conf\hat{\theta}^{\mathrm{conf}}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_conf end_POSTSUPERSCRIPT, or equivalently its variance, is minimized over λ𝜆\lambdaitalic_λ. Since Σ^jjsubscript^Σ𝑗𝑗\widehat{\Sigma}_{jj}over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_j italic_j end_POSTSUBSCRIPT is a quadratic in λ𝜆\lambdaitalic_λ, the optimal λ𝜆\lambdaitalic_λ has a closed-form analytical expression. As before, suppose we are interesting in estimating coordinate j𝑗jitalic_j of θ^confsuperscript^𝜃conf\hat{\theta}^{\mathrm{conf}}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_conf end_POSTSUPERSCRIPT. Let hjsubscript𝑗h_{j}italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT denote the j𝑗jitalic_j-th column of H^1superscript^𝐻1\hat{H}^{-1}over^ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. Then, we set λ𝜆\lambdaitalic_λ according to:

λ=hCov^h2hVar^h,𝜆superscripttop^Cov2superscripttop^Var\lambda=\frac{h^{\top}~{}\widehat{\mathrm{Cov}}~{}h}{2h^{\top}~{}\widehat{% \mathrm{Var}}~{}h},italic_λ = divide start_ARG italic_h start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG roman_Cov end_ARG italic_h end_ARG start_ARG 2 italic_h start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG roman_Var end_ARG italic_h end_ARG ,

where Cov^:=Cov^(^θ^conf(ξπ1),θ^confξπ)+Cov^(θ^confξπ,^θ^conf(ξπ1))assign^Cov^Covsubscript^superscript^𝜃conf𝜉𝜋1subscriptsuperscript^𝜃conf𝜉𝜋^Covsubscriptsuperscript^𝜃conf𝜉𝜋subscript^superscript^𝜃conf𝜉𝜋1\widehat{\mathrm{Cov}}:=\widehat{\mathrm{Cov}}(\nabla\hat{\ell}_{\hat{\theta}^% {\mathrm{conf}}}(\frac{\xi}{\pi}-1),\nabla\ell_{\hat{\theta}^{\mathrm{conf}}}% \frac{\xi}{\pi})+\widehat{\mathrm{Cov}}(\nabla\ell_{\hat{\theta}^{\mathrm{conf% }}}\frac{\xi}{\pi},\nabla\hat{\ell}_{\hat{\theta}^{\mathrm{conf}}}(\frac{\xi}{% \pi}-1))over^ start_ARG roman_Cov end_ARG := over^ start_ARG roman_Cov end_ARG ( ∇ over^ start_ARG roman_ℓ end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_conf end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( divide start_ARG italic_ξ end_ARG start_ARG italic_π end_ARG - 1 ) , ∇ roman_ℓ start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_conf end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_ξ end_ARG start_ARG italic_π end_ARG ) + over^ start_ARG roman_Cov end_ARG ( ∇ roman_ℓ start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_conf end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_ξ end_ARG start_ARG italic_π end_ARG , ∇ over^ start_ARG roman_ℓ end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_conf end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( divide start_ARG italic_ξ end_ARG start_ARG italic_π end_ARG - 1 ) ) and Var^:=Var^(^θ^conf(ξπ1))assign^Var^Varsubscript^superscript^𝜃conf𝜉𝜋1\widehat{\mathrm{Var}}:=\widehat{\mathrm{Var}}(\nabla\hat{\ell}_{\hat{\theta}^% {\mathrm{conf}}}(\frac{\xi}{\pi}-1))over^ start_ARG roman_Var end_ARG := over^ start_ARG roman_Var end_ARG ( ∇ over^ start_ARG roman_ℓ end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_conf end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( divide start_ARG italic_ξ end_ARG start_ARG italic_π end_ARG - 1 ) ) are empirical (co)variances. See Angelopoulos et al. (2023b) for further details.

A.3 LLM and Human Annotation Details

For data annotation, we use GPT-4o (gpt-4o-2024-05-13 version) and GPT-3.5-turbo (gpt-3.5-turbo-0125 version). Prompt texts in both stages are listed in Table 2. To test LLM performance out-of-the-box, all annotations are collected using zero-shot prompting. We set the max_tokens parameter to 5, use default temperature (1), and the default system prompt and the other prompting parameters.

Stage 1 GPT-4o annotations are in moderate agreement with human annotations in all three settings: κpoliteness=0.39subscript𝜅politeness0.39\kappa_{\mathrm{politeness}}=0.39italic_κ start_POSTSUBSCRIPT roman_politeness end_POSTSUBSCRIPT = 0.39, κstance=0.57subscript𝜅stance0.57\kappa_{\mathrm{stance}}=0.57italic_κ start_POSTSUBSCRIPT roman_stance end_POSTSUBSCRIPT = 0.57, and κbias=0.43subscript𝜅bias0.43\kappa_{\mathrm{bias}}=0.43italic_κ start_POSTSUBSCRIPT roman_bias end_POSTSUBSCRIPT = 0.43.

In Stage 2, we find that the collected verbalized confidence scores are calibrated with the Stage 1 accuracy (Fig. 3 (right)), such that higher confidence scores correspond to higher accuracy with respect to human annotations. This implies that verbalized confidence is indeed an informative signal to leverage in estimation tasks. Histograms of the collected verbalized confidence scores are illustrated in Fig. 3 (left). We also observe a variance in the verbalized confidence within each setting, and a relative lack of overconfident responses (where the model is 100% certain).

We choose the sampling probabilities πisubscript𝜋𝑖\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT according to the theory of Zrnic and Candès (2024). For estimating the prevalences pleftsubscript𝑝leftp_{\mathrm{left}}italic_p start_POSTSUBSCRIPT roman_left end_POSTSUBSCRIPT and prightsubscript𝑝rightp_{\mathrm{right}}italic_p start_POSTSUBSCRIPT roman_right end_POSTSUBSCRIPT, as well as the odds ratio Oagreementsubscript𝑂agreementO_{\mathrm{agreement}}italic_O start_POSTSUBSCRIPT roman_agreement end_POSTSUBSCRIPT, we choose πierr^i(Ci)proportional-tosubscript𝜋𝑖subscript^err𝑖subscript𝐶𝑖\pi_{i}\propto\sqrt{\widehat{\texttt{err}}_{i}(C_{i})}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∝ square-root start_ARG over^ start_ARG err end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG, as described in Section 3.2. For the logistic regression coefficient βhedgesubscript𝛽hedge\beta_{\mathrm{hedge}}italic_β start_POSTSUBSCRIPT roman_hedge end_POSTSUBSCRIPT (respectively, β1ppsubscript𝛽1pp\beta_{\mathrm{1pp}}italic_β start_POSTSUBSCRIPT 1 roman_p roman_p end_POSTSUBSCRIPT), we set πierr^i(Ci)|Xih|proportional-tosubscript𝜋𝑖subscript^err𝑖subscript𝐶𝑖superscriptsubscript𝑋𝑖top\pi_{i}\propto\sqrt{\widehat{\texttt{err}}_{i}(C_{i})}\cdot|X_{i}^{\top}h|italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∝ square-root start_ARG over^ start_ARG err end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ⋅ | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_h |, where hhitalic_h is the column of H^^𝐻\widehat{H}over^ start_ARG italic_H end_ARG (defined in App. A.1) corresponding to Xhedgesubscript𝑋hedgeX_{\mathrm{hedge}}italic_X start_POSTSUBSCRIPT roman_hedge end_POSTSUBSCRIPT (respectively, X1ppsubscript𝑋1ppX_{\mathrm{1pp}}italic_X start_POSTSUBSCRIPT 1 roman_p roman_p end_POSTSUBSCRIPT).

To fit err^isubscript^err𝑖\widehat{\texttt{err}}_{i}over^ start_ARG err end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we train an XGBoost Chen and Guestrin (2016) model. For all problem settings, we use the same training parameters: number of boosting rounds 2000, step size 0.001, maximum depth 3, and squared-error objective.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Histograms and calibration curves of verbalized confidence scores. (Left) Confidence score histograms across the three settings (GPT-4o). (Right) LLM annotation accuracy with respect to human annotations (y-axis), among instances where the confidence score is greater than C (x-axis) across the three settings (GPT-4o).

A.4 Computation of Evaluation Metrics

We provide further details behind the computation of our two main metrics, effective sample size and coverage. For all problem settings, we run 100100100100 simulation trials. All experiments were run on a single CPU.

Effective sample size.

Recall that we define the effective sample size of a method as the hypothetical value neffectivesubscript𝑛effectiven_{\mathrm{effective}}italic_n start_POSTSUBSCRIPT roman_effective end_POSTSUBSCRIPT such that MSE(θ^method)=MSE(θ^neffectivehuman)MSEsuperscript^𝜃methodMSEsubscriptsuperscript^𝜃humansubscript𝑛effective\mathrm{MSE}(\hat{\theta}^{\mathrm{method}})=\mathrm{MSE}(\hat{\theta}^{% \mathrm{human}}_{n_{\mathrm{effective}}})roman_MSE ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_method end_POSTSUPERSCRIPT ) = roman_MSE ( over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_human end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT roman_effective end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), where θ^neffectivehumansubscriptsuperscript^𝜃humansubscript𝑛effective\hat{\theta}^{\mathrm{human}}_{n_{\mathrm{effective}}}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_human end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT roman_effective end_POSTSUBSCRIPT end_POSTSUBSCRIPT is obtained via the human-only approach with neffectivesubscript𝑛effectiven_{\mathrm{effective}}italic_n start_POSTSUBSCRIPT roman_effective end_POSTSUBSCRIPT annotations. Since all approaches but the LLM-only approach are unbiased in the large-sample limit, meaning their estimate has mean exactly equal to θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, the MSE is simply equal to the estimator variance. Estimator variance is used in the confidence interval construction and is estimated as Σ^/n^Σ𝑛\widehat{\Sigma}/nover^ start_ARG roman_Σ end_ARG / italic_n, as explained in App. A.1. The different baselines differ in their choice of λ𝜆\lambdaitalic_λ and π𝜋\piitalic_π in the definition of Σ^^Σ\widehat{\Sigma}over^ start_ARG roman_Σ end_ARG. We thus compute the effective sample size as Σ^jjhuman/Σ^jjn,subscriptsuperscript^Σhuman𝑗𝑗subscript^Σ𝑗𝑗𝑛\widehat{\Sigma}^{\mathrm{human}}_{jj}/\widehat{\Sigma}_{jj}\cdot n,over^ start_ARG roman_Σ end_ARG start_POSTSUPERSCRIPT roman_human end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_j end_POSTSUBSCRIPT / over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_j italic_j end_POSTSUBSCRIPT ⋅ italic_n , where j𝑗jitalic_j indexes the coordinate of θ^confsuperscript^𝜃conf\hat{\theta}^{\mathrm{conf}}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT roman_conf end_POSTSUPERSCRIPT when the estimate has more than one dimension. The final reported effective sample size is the mean of these values over 100100100100 trials.

Coverage.

We estimate coverage over 100100100100 trials. For all methods but LLM only, the trials differ in the random annotation decisions ξisubscript𝜉𝑖\xi_{i}italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that determine which points get human-annotated, and we average 0/1 indicators of coverage over those trials. For LLM only, since we only have one fixed dataset of n𝑛nitalic_n LLM annotations, in order to estimate coverage we simulate random draws from a population via the bootstrap. In other words, in each trial we draw n𝑛nitalic_n LLM annotations with replacement, form a classical confidence interval using those points, and record a 0/1 indicator of coverage.

Setting Stage Prompt text
Politeness Stage 1 Is the following text polite? Output either A or B. Output a letter only.
A) Polite
B) Impolite
Text: <text>
Answer:
Politeness Stage 2 How likely is it that the following text is <previously provided answer: polite or impolite>?
Output the probability only (a number between 0 and 1).
Text: <text>
Answer:
Stance Stage 1 A statement can agree, be neutral, or disagree with the statement: “Climate change/global warming is a serious concern”. Classify the following statement into one of the three categories. Output either A, B, or C. Output a letter only.
A) Agree
B) Neutral
C) Disagree
Statement: <text>
Answer:
Stance Stage 2 How likely is it that the following text <previously provided answer: agrees, neither agrees nor disagrees, or disagrees> with the statement: “Climate change/global warming is a serious concern”?
Output the probability only (a number between 0 and 1).
Text: <text>
Probability:
Bias Stage 1 What is the political bias of the following article? Output either A,B, or C. Output a letter only.
A) Left
B) Center
C) Right
Article: <text>
Answer:
Bias Stage 2 How likely is it that the following article has a <previously provided answer: left-leaning, centrist, or right-leaning> political bias? Output the probability only (a number between 0 and 1).
Text: <text>
Probability:
Table 2: Complete prompt texts. LLM annotation prompts across the three settings, for Stages 1 and 2.

Appendix B Supplementary Results

B.1 An Alternative LLM

Figure 4 and Table 3 summarize the results using GPT-3.5. For nhuman=500subscript𝑛human500n_{\mathrm{human}}=500italic_n start_POSTSUBSCRIPT roman_human end_POSTSUBSCRIPT = 500, in each estimation task the confidence-driven approach again achieves a higher gain in effective sample size than the non-adaptive approach. Moreover, it always achieves a positive gain. In contrast, the non-adaptive approach achieves a negative gain in three out of the five estimation tasks (both politeness estimates and the stance estimate). The confidence-driven and non-adaptive approaches always achieve over 90% coverage. In contrast, LLM-only coverage is always poor using GPT-3.5 (while using GPT-4o it was poor on four out of the five estimation tasks).

B.2 Sensitivity to Miscalibrated LLM Confidence Scores

A calibrated LLM will produce higher confidence scores when annotations are in agreement with human annotations, compared to when annotations are in disagreement with human annotations. However, calibration of confidence scores across tasks is not guaranteed.

To understand how the performance of Confidence-Driven Inference is affected by the calibration of confidence scores, we conducted a robustness test, adding noise to confidence scores to simulate miscalibration. In particular, for illustration we consider the task of analyzing stance on global warming. We add a varying amount of normally distributed noise 𝒩(0,σ2)𝒩0superscript𝜎2\mathcal{N}(0,\sigma^{2})caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) to the collected confidence scores Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and truncate the sum to [0,1]01[0,1][ 0 , 1 ] to obtain a probability.

We use a t-test to test the difference in calibration score means when LLM and human annotations agree, vs when LLM and human annotations disagree. If the t-statistic is large (equivalently, the corresponding p-value is small), that suggests that the two means differ significantly. As the random noise added to the confidence scores increases, the scores become less calibrated (Table 4). We expect that our method performs worse in terms of neffectivesubscript𝑛effectiven_{\mathrm{effective}}italic_n start_POSTSUBSCRIPT roman_effective end_POSTSUBSCRIPT when the confidence scores are miscalibrated, although coverage should be maintained.

As predicted, as the amount of miscalibration in the confidence scores increases, the gain in the effective sample size decreases (Table 4). The confidence-driven approach achieves the highest gain for the smallest amount of noise, though it always achieves a positive gain. This suggests that the approach is robust to poor confidence scores. Furthermore, Confidence-Driven Inference achieves near 90% coverage or higher in each setting, regardless of the amount of miscalibration. Finally, we observe that Confidence-Driven Inference achieves a higher gain than the non-adaptive approach regardless of the extent of miscalibration. This can be explained through the power tuning parameter λ𝜆\lambdaitalic_λ; even when the confidence scores provide no signal, power tuning makes sure that LLM annotations are leveraged effectively.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Confidence intervals, effective sample size, and coverage (GPT-3.5). Rows correspond to different estimation tasks. The first column shows the confidence intervals in five random trials. The vertical dashed line corresponds to the estimate produced on the full dataset. A method is valid if its confidence interval includes this estimate (in about 90% of the trials), and tighter intervals around θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT indicates better performance. The second and third columns display the effective sample size neffectivesubscript𝑛effectiven_{\mathrm{effective}}italic_n start_POSTSUBSCRIPT roman_effective end_POSTSUBSCRIPT and coverage, respectively, for different values of the human annotation budget nhumansubscript𝑛humann_{\mathrm{human}}italic_n start_POSTSUBSCRIPT roman_human end_POSTSUBSCRIPT. Results are estimated over 100 trials.
Estimation task Metric Method
confidence-driven human + LLM (non-adaptive) LLM only
Politeness devices (hedge) Gain in eff. sample size ( 31.51 ±plus-or-minus\pm± 7.81)% ( -12.23 ±plus-or-minus\pm± 9.18)%
Coverage 92% 92% 39%
Politeness devices (1st person plural) Gain in eff. sample size ( 321.00 ±plus-or-minus\pm± 19.01)% ( -5.77±plus-or-minus\pm± 30.83)%
Coverage 97% 92% 67%
Stance on global warming Gain in eff. sample size ( 24.66 ±plus-or-minus\pm± 16.24)% ( -12.13 ±plus-or-minus\pm± 16.95)%
Coverage 92% 94% 0%
Political bias (left-leaning) Gain in eff. sample size ( 17.08 ±plus-or-minus\pm± 6.30)% ( 3.83 ±plus-or-minus\pm± 4.86)%
Coverage 93% 98% 18%
Political bias (right-leaning) Gain in eff. sample size ( 23.10 ±plus-or-minus\pm± 7.50)% ( 11.09 ±plus-or-minus\pm± 5.73)%
Coverage 95% 94% 11%
Table 3: Results summary (GPT-3.5). Gain in effective sample size and coverage across the five estimation tasks for nhuman=500subscript𝑛human500n_{\mathrm{human}}=500italic_n start_POSTSUBSCRIPT roman_human end_POSTSUBSCRIPT = 500, estimated over 100 trials. In each task, the confidence-driven approach achieves a higher gain in effective sample size (bolded) than the non-adaptive approach. Confidence-driven approach always achieves a  positive gain, while the non-adaptive approach sometimes achieves a  negative gain. Confidence-driven and non-adaptive approaches achieve near 90% coverage, or higher. In contrast, LLM-only coverage is poor. Gain in effective sample size is not estimated for the LLM-only approach as it does not leverage human annotations. Errors show a standard deviation over 100 trials.
Method Confidence score calibration t-test Metric
Gain in eff. sample size Coverage
nhuman=500subscript𝑛human500n_{\mathrm{human}}=500italic_n start_POSTSUBSCRIPT roman_human end_POSTSUBSCRIPT = 500
LLM only 0%
human + LLM (non-adaptive) -3.46% 100%
confidence-driven (σ2=0superscript𝜎20\sigma^{2}=0italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0) t=9.08𝑡9.08t=9.08italic_t = 9.08, p=2.19×1019𝑝2.19superscript1019p=2.19\times 10^{-19}italic_p = 2.19 × 10 start_POSTSUPERSCRIPT - 19 end_POSTSUPERSCRIPT 43.48% 94%
confidence-driven (σ2=0.2superscript𝜎20.2\sigma^{2}=0.2italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.2) t=5.53𝑡5.53t=5.53italic_t = 5.53, p=3.65×108𝑝3.65superscript108p=3.65\times 10^{-8}italic_p = 3.65 × 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT 40.70% 95%
confidence-driven (σ2=0.4superscript𝜎20.4\sigma^{2}=0.4italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.4) t=3.30𝑡3.30t=3.30italic_t = 3.30, p=0.000994𝑝0.000994p=0.000994italic_p = 0.000994 42.82% 94%
confidence-driven (σ2=0.6superscript𝜎20.6\sigma^{2}=0.6italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.6) t=2.34𝑡2.34t=2.34italic_t = 2.34, p=0.0192𝑝0.0192p=0.0192italic_p = 0.0192 39.57% 93%
confidence-driven (σ2=0.8superscript𝜎20.8\sigma^{2}=0.8italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.8) t=1.88𝑡1.88t=1.88italic_t = 1.88, p=0.0606𝑝0.0606p=0.0606italic_p = 0.0606 40.87% 94%
nhuman=1150subscript𝑛human1150n_{\mathrm{human}}=1150italic_n start_POSTSUBSCRIPT roman_human end_POSTSUBSCRIPT = 1150
LLM only 0%
human + LLM (non-adaptive) 17.02% 100%
confidence-driven (σ2=0superscript𝜎20\sigma^{2}=0italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0) t=9.08𝑡9.08t=9.08italic_t = 9.08, p=2.19×1019𝑝2.19superscript1019p=2.19\times 10^{-19}italic_p = 2.19 × 10 start_POSTSUPERSCRIPT - 19 end_POSTSUPERSCRIPT 28.14% 99%
confidence-driven (σ2=0.2superscript𝜎20.2\sigma^{2}=0.2italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.2) t=5.29𝑡5.29t=5.29italic_t = 5.29, p=1.36×107𝑝1.36superscript107p=1.36\times 10^{-7}italic_p = 1.36 × 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT 25.30% 100%
confidence-driven (σ2=0.4superscript𝜎20.4\sigma^{2}=0.4italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.4) t=3.22𝑡3.22t=3.22italic_t = 3.22, p=0.00130𝑝0.00130p=0.00130italic_p = 0.00130 25.84% 97%
confidence-driven (σ2=0.6superscript𝜎20.6\sigma^{2}=0.6italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.6) t=2.23𝑡2.23t=2.23italic_t = 2.23, p=0.0257𝑝0.0257p=0.0257italic_p = 0.0257 26.81% 100%
confidence-driven (σ2=0.8superscript𝜎20.8\sigma^{2}=0.8italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.8) t=1.83𝑡1.83t=1.83italic_t = 1.83, p=0.0831𝑝0.0831p=0.0831italic_p = 0.0831 25.71% 99%
Table 4: Sensitivity to confidence score calibration. Gain in effective sample size and coverage for the LLM only, human + LLM (non-adaptive), and confidence-driven approaches, given varying amounts of miscalibration in confidence scores (σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT). Results are presented for the task of analyzing stance on global warming, estimated over 100 trials. The t-test tests for the difference in calibration score means when LLM and human annotations agree, vs when LLM and human annotations disagree (larger t𝑡titalic_t means difference is more significant). The confidence-driven approach achieves the largest gain for the smallest amount of noise (bolded), and it always achieves a  positive gain. For each amount of miscalibration, the confidence-driven approach achieves a higher gain in effective sample size than the non-adaptive approach; it also achieves near 90% coverage or higher in each setting. In contrast, LLM-only coverage is poor. Gain in effective sample size is not estimated for the LLM-only approach as it does not leverage human annotations.