Can Unconfident LLM Annotations Be Used for Confident Conclusions?

Kristina Gligorić^⋆ Tijana Zrnic^⋆ Cinoo Lee^⋆ Emmanuel J. Candès Dan Jurafsky
Stanford University
{gligoric, tijana.zrnic, cinoolee, candes, jurafsky}@stanford.edu

Abstract

Large language models (LLMs) have shown high agreement with human raters across a variety of tasks, demonstrating potential to ease the challenges of human data collection. In computational social science (CSS), researchers are increasingly leveraging LLM annotations to complement slow and expensive human annotations. Still, guidelines for collecting and using LLM annotations, without compromising the validity of downstream conclusions, remain limited. We introduce Confidence-Driven Inference: a method that combines LLM annotations and LLM confidence indicators to strategically select which human annotations should be collected, with the goal of producing accurate statistical estimates and provably valid confidence intervals while reducing the number of human annotations needed. Our approach comes with safeguards against LLM annotations of poor quality, guaranteeing that the conclusions will be both valid and no less accurate than if we only relied on human annotations. We demonstrate the effectiveness of Confidence-Driven Inference over baselines in statistical estimation tasks across three CSS settings—text politeness, stance, and bias—reducing the needed number of human annotations by over 25% in each. Although we use CSS settings for demonstration, Confidence-Driven Inference can be used to estimate most standard quantities across a broad range of NLP problems.

Kristina Gligorić^⋆ Tijana Zrnic^⋆ Cinoo Lee^⋆ Emmanuel J. Candès Dan Jurafsky Stanford University {gligoric, tijana.zrnic, cinoolee, candes, jurafsky}@stanford.edu

^†^†^⋆Equal contribution.

1 Introduction

Large language models (LLMs) have shown strong zero-shot performance across tasks Kojima et al. (2022), making them a promising tool for generating annotations, particularly when they align closely with human judgments Ziems et al. (2024). Given this potential, LLM annotations of textual data may be effectively leveraged for statistical estimation, hypothesis testing, and theory development Park et al. (2023), as well as informing policy decisions Wei et al. (2023).

Refer to caption — Figure 1: Illustration of Confidence-Driven Inference. Given a text corpus and a quantity of interest $\theta^{*}$ , (1) we collect LLM annotations and indicators of LLM confidence, based on which we strategically choose a small number of human annotations; (2) we then produce an unbiased estimate $\hat{\theta}^{\mathrm{conf}}$ and a valid confidence interval, allowing valid downstream conclusions.

Computational Social Science (CSS) research typically focuses not on the annotations themselves but on the social-science insights and conclusions they enable. Thus, understanding how LLM annotations could be used for downstream inferences is crucial in CSS. For example, stance annotations facilitate the study of linguistic differences between media affirming or denying global warming Luo et al. (2020), while politeness annotations can help examine racial disparities in verbal interactions with law enforcement Voigt et al. (2017), the relationship between politeness and social power Danescu-Niculescu-Mizil et al. (2013), and politeness and gender Newman et al. (2008). Similarly, annotating political leanings in text allows studying the bias of search engines Robertson et al. (2018), social media Ribeiro et al. (2018), and political discourse Sim et al. (2013). Precise statistical estimation, such as prevalence or regression coefficient estimation, is essential for drawing valid conclusions in such studies.

However, whether LLM annotations can be effectively leveraged without compromising the validity of statistical estimation remains uncertain. LLMs exhibit demographic biases Weidinger et al. (2022); Cheng et al. (2023) and may lack factual accuracy Gunjal et al. (2024); Li et al. (2023b) and consistency Sclar et al. (2023); Atreja et al. (2024). Given these limitations, using LLMs without caution may lead to inaccurate conclusions and potential societal harms, especially when such conclusions influence policy or have tangible impacts on peoples’ outcomes Landers and Behrend (2023). A potential solution is to rely solely on human annotations; however, human annotations are costly.

Here, we present Confidence-Driven Inference, a method for valid statistical inference using LLM annotations. Given a text corpus and a quantity of interest, our approach builds on active inference Zrnic and Candès (2024) to: (1) strategically choose a small number of human annotations, guided by LLM annotations and the LLM’s verbalized confidence scores, and (2) combine the human and LLM annotations into an accurate estimate of the quantity of interest (Fig. 1). The resulting estimate is statistically valid, while reducing reliance on expensive human annotations.

Our task is statistical estimation of a quantity of interest. We evaluate our approach on five estimation tasks in three CSS settings (politeness, stance, and media bias) in terms of confidence interval coverage and effective sample size, which measures the increase in accuracy due to augmenting human with LLM annotations (Sec. 3.4). We find that naively treating LLM annotations as human data can lead to highly inaccurate estimates and poor coverage. At the same time, our method maintains the target coverage, while outperforming the baselines (defined in Sec. 3.3) in terms of the effective sample size. The latter is enabled partially by the fact that in all tested settings the confidence scores are reflective of LLM accuracy.

Confidence-Driven Inference can be used to estimate a wide range of standard targets (such as regression coefficients, means, and prevalences) across various NLP problems. Our code and data are available at https://github.com/kristinagligoric/confidence-driven-inference.

2 Background

2.1 LLMs for Data Annotation Tasks

LLMs have shown great potential in handling text-annotation tasks without prior task-specific training, sometimes even outperforming crowd workers Gilardi et al. (2023). NLP, LLMs offer transformative opportunities for any discipline that relies on text as data. Fields such as psychology, political science, sociology, communications, and economics recognize this emerging technology’s potential to enhance simulation-based research Bail (2024), and facilitate tasks such as text analysis, concept induction Lam et al. (2024), and topic modeling Pham et al. (2024).

However, despite their promise, limited research has explored how to harness the potential of LLMs in ways that are both cost-effective and statistically reliable. Our work addresses this gap.

2.2 Collaborative Annotation Paradigms

Much of past work frames human and LLM annotations as competing alternatives, with a focus on determining which is superior Thapa et al. (2023). More recent work increasingly calls for a collaborative approach that leverages the complementary strengths of both Allen et al. (1999). These collaborative paradigms aim to balance annotation quality and cost by combining human expertise and LLM efficiency Li et al. (2023c); Kim et al. (2024).

In the spirit of these collaborative paradigms, our work uses LLM confidence to efficiently and cost-effectively allocate annotation tasks, while also ensuring that the statistical inferences derived from the annotated data are valid.

2.3 Valid Statistical Inferences in NLP

Statistical inference is vital in NLP research. For example, model evaluation requires determining whether a model performs better than a baseline Card et al. (2020), which in turn relies on making valid conclusions about whether one is observing meaningful model improvements or noise Dodge et al. (2019). Chatzi et al. (2024) and Boyeau et al. (2024) leverage prediction-powered inference Angelopoulos et al. (2023a, b) for valid ranking of LLMs. A similar approach is adopted by Saad-Falcon et al. (2024) to evaluate Retrieval-Augmented Generation (RAG) systems.

Beyond model evaluation, NLP applications involve producing measurements, descriptive statistics, and causal effect estimates Feder et al. (2022); Card and Smith (2018). Notably, Keith and O’Connor (2018) introduced the problem of scientifically valid prevalence estimation. They construct Bayesian confidence intervals by proposing a generative model for text documents. We contribute to the existing literature by proposing an entirely model-free approach that is applicable to a broad range of target quantities.

Lastly, Egami et al. (2024) consider the problem of valid statistical inference when combining human and LLM annotations. However, they collect the human annotations for uniformly sampled instances, without adapting to the difficulty of annotation. Given the promise of active learning Zhang et al. (2023); Margatina et al. (2021), we develop an adaptive approach that samples a limited number of human annotations strategically. At a technical level, our approach builds on active inference Zrnic and Candès (2024), which can be seen as a refinement of prediction-powered inference Angelopoulos et al. (2023a, b) that uses active data collection for improved efficiency. Furthermore, we make use of power tuning Angelopoulos et al. (2023b), a technique that ensures that incorporating LLM annotations into the estimation can never be worse than ignoring them completely.

3 Methods

3.1 Problem Setup

We have a text corpus consisting of $n$ independent and identically distributed (i.i.d.) instances $T_{1},\dots,T_{n}$ . We wish to estimate a quantity of interest $\theta^{*}$ , such as the prevalence of political bias in the corpus or the causal effect of using certain linguistic markers on the perceived sentiment. To perform the estimation, we require human annotations $H_{1},\dots,H_{n}$ corresponding to $T_{1},\dots,T_{n}$ . For example, $H_{i}$ might indicate whether $T_{i}$ contains political bias, or assess the perceived politeness of $T_{i}$ . In addition to human annotations, we may also have other readily-available information about $T_{i}$ —covariates $X_{i}$ such as the source of $T_{i}$ or indicators of whether $T_{i}$ contains certain linguistic markers, computed via a lexicon. Note that $X_{i}$ is available automatically, without needing human annotation. We use the short-hand notation $T=(T_{1},\dots,T_{n})$ and define $X$ and $H$ similarly.

The quantity $\theta^{*}$ can be estimated via an estimator $\hat{\theta}(X,H)$ , which we will denote by $\hat{\theta}$ for short. The accuracy of $\hat{\theta}$ improves as the number of samples $n$ increases ( $\hat{\theta}$ recovers $\theta^{*}$ as $n$ approaches infinity). We assume that $\hat{\theta}$ is an M-estimator Van der Vaart (2000), meaning it can be written as

\hat{\theta}=\operatorname*{arg\,min}_{\theta}\frac{1}{n}\sum_{i=1}^{n}\ell_{% \theta}(X_{i},H_{i}),

(1)

for a loss function $\ell_{\theta}$ that is convex in $\theta$ . Important special cases include the mean label, $\hat{\theta}=\frac{1}{n}\sum_{i=1}^{n}H_{i}$ , and linear regression coefficients, which are pervasive in CSS. Other examples include quantiles, logistic, and other regression coefficients. Notice that in some cases, like calculating the mean, the loss function only depends on $H_{i}$ .

Our goal is to produce an estimate of $\theta^{*}$ with uncertainty—by providing a confidence interval at a pre-specified level $(1-\alpha)$ —with limited access to human annotations. Specifically, we can only collect $n_{\mathrm{human}}\ll n$ annotations (on average). This means that the “ideal estimate” (1) is out of reach.

To supplement the costly human annotations, we assume access to LLM annotations $\hat{H}_{i}$ for all $n$ instances. However, we make no assumption that the LLM annotations are good: we want to produce a valid confidence interval no matter the quality of the LLM, though we anticipate better gains when their quality is high (i.e., lower mean squared error and a smaller confidence interval).

3.2 Confidence-Driven Inference

We combine LLM annotations with strategically chosen human annotations to produce an unbiased estimate $\hat{\theta}^{\mathrm{conf}}$ that lends itself to a confidence interval that is both valid and tight around $\theta^{*}$ . In particular, in the large-sample limit, the mean of the estimate is exactly $\theta^{*}$ , no matter how biased the LLM annotations are.

We first explain how to choose the set of instances to be human-annotated, which is crucial for producing an accurate estimate. We collect a human annotation $H_{i}$ for instance $T_{i}$ with probability $\pi_{i}$ . We let $\xi_{i}=\mathbf{1}\{H_{i}\text{ collected}\}$ denote the indicator of whether $T_{i}$ has been human-annotated. Zrnic and Candès (2024) show that the optimal choice of $\pi_{i}$ is to sample according to the uncertainty of the predicted annotation; roughly speaking, for most estimation problems the optimal rule is

\pi_{i}^{*}\propto\sqrt{\mathbb{E}[(\hat{H}_{i}-H_{i})^{2}|T_{i}]},

where $\propto$ hides the normalization required to meet the budget, $\mathbb{E}[\sum_{i=1}^{n}\xi_{i}]=\sum_{i=1}^{n}\pi_{i}^{*}=n_{\mathrm{human}}$ . Of course, since $H_{i}$ is unknown, $\pi_{i}^{*}$ is unattainable.

A key idea behind our method is to approximate $\pi^{*}_{i}$ by querying the LLM for verbalized confidence. Since RLHF may cause overconfidence Geng et al. (2024); Zhou et al. (2024) and miscalibration Band et al. (2024); Achiam et al. (2023) of the LLM’s conditional token probabilities, verbalized probabilities, i.e., expressions of confidence in token-space, are better-calibrated Tian et al. (2023). Therefore, to collect confidence scores, we adopt the verbalized two-stage prompting approach introduced by Tian et al. (2023), where the model is first asked to provide an answer via zero-shooting and afterward asked to assign a probability to the correctness of the answer. This gives us a confidence score $C_{i}\in[0,1]$ for each instance $T_{i}$ . In our applications, we find that the verbalized confidence scores are calibrated (Fig. 3 (right)), meaning that higher confidence scores correspond to higher accuracy with respect to human annotations.

As we collect human annotations, we use $\{(C_{j},(\hat{H}_{j}-H_{j})^{2})\}_{j<i,\xi_{j}=1}$ as feature–label pairs to train a black-box predictor $\widehat{\texttt{err}}_{i}$ . In other words, we train a model to predict the LLM error from its confidence. Finally, we set

\pi_{i}\propto\sqrt{\widehat{\texttt{err}}_{i}(C_{i})},

normalized so that $\mathbb{E}[\sum_{i=1}^{n}\xi_{i}]=\sum_{i=1}^{n}\pi_{i}=n_{\mathrm{human}}$ . In practice we do not fine-tune $\widehat{\texttt{err}}_{i}$ at every step $i$ , but we do so periodically, after reasonably large batches of data (say, every 50 or 100 data points). See App. A.3 for further details behind the sampling and Table 2 for prompt texts.

After we have collected the human annotations according to $\pi_{i}$ , building on active inference Zrnic and Candès (2024) we compute a confidence-driven estimate of $\theta^{*}$ :

\hat{\theta}^{\mathrm{conf}}=\operatorname*{arg\,min}_{\theta}\frac{1}{n}\sum_% {i=1}^{n}\left(\lambda\hat{\ell}_{\theta,i}+(\ell_{\theta,i}-\lambda\hat{\ell}% _{\theta,i})\frac{\xi_{i}}{\pi_{i}}\right),

(2)

where we denote $\ell_{\theta,i}=\ell_{\theta}(X_{i},H_{i})$ and $\hat{\ell}_{\theta,i}=\ell_{\theta}(X_{i},\hat{H}_{i})$ , and $\lambda\in[0,1]$ is a carefully chosen tuning parameter. Notice that every summand in (2) is in expectation over $\xi_{i}$ equal to $\ell_{\theta}(X_{i},H_{i})$ , and thus the loss (2) is on average equal to “ideal” loss (1). This allows showing that, in the limit, $\hat{\theta}^{\mathrm{conf}}$ is on average exactly equal to $\theta^{*}$ , no matter the bias in the LLM annotations. To give one example, if we want to estimate the mean of $H_{i}$ , $\hat{\theta}^{\mathrm{conf}}$ reduces to

\hat{\theta}^{\mathrm{conf}}=\frac{1}{n}\sum_{i=1}^{n}\left(\lambda\hat{H}_{i}% +(H_{i}-\lambda\hat{H}_{i})\frac{\xi_{i}}{\pi_{i}}\right).

Notice that $\mathbb{E}[\hat{\theta}^{\mathrm{conf}}]=\mathbb{E}[H_{i}]=\theta^{*}$ . The parameter $\lambda$ is called a power-tuning parameter Angelopoulos et al. (2023b), and it interpolates between ignoring the LLM annotations ( $\lambda=0$ ) and utilizing them fully ( $\lambda=1$ ). We set $\lambda$ optimally, so that the mean squared error (MSE) of $\hat{\theta}^{\mathrm{conf}}$ is minimized over $\lambda$ . This means that, given any sampling rule $\pi_{i}$ , the confidence-driven estimator can never be hurt by leveraging erroneous LLM annotations oder miscalibrated confidence scores. The estimator is at least as good as when $\lambda=0$ . Details behind the optimization of $\lambda$ are in App. A.2.

Finally, applying the theoretical guarantees of Zrnic and Candès (2024), we form a valid confidence interval at level $1-\alpha$ as

C_{1-\alpha}=(\hat{\theta}^{\mathrm{conf}}\pm z_{1-\alpha/2}\hat{\sigma}_{% \mathrm{se}}),

where $z_{1-\alpha/2}$ is the $1-\alpha/2$ quantile of the standard normal distribution and $\hat{\sigma}_{\mathrm{se}}$ is a standard error estimate that has a closed form, stated in App. A.1.

3.3 Baselines

Human + LLM (non-adaptive).

The first baseline incorporates LLM annotations but does not adapt to the per-instance confidence or accuracy of the LLM—it equally trusts all LLM annotations. In particular, this baseline is a special case of $\hat{\theta}^{\mathrm{conf}}$ with $\lambda=1$ and uniform sampling probabilities $\pi_{i}=\frac{n_{\mathrm{human}}}{n}$ . This is the method evaluated and studied by Egami et al. (2024).

Human only.

The second baseline ignores LLM annotations and simply applies the standard estimator to human annotations. It collects each human annotation with equal probability, $\frac{n_{\mathrm{human}}}{n}$ , so that $n_{\mathrm{human}}$ annotations are collected on average. This is the “classical” approach, and it can be thought of as erring on the side of caution and ignoring potentially biased LLM outputs. Since the baseline only collects human annotations, it allows forming a valid confidence interval via classical statistics. This approach is equivalent to $\hat{\theta}^{\mathrm{conf}}$ with $\lambda=0$ .

LLM only.

Finally, we consider the naive baseline which treats LLM annotations as human annotations, applying the standard estimator to those annotations and naively forming a confidence interval. This baseline does not suffer from a budget constraint, since LLM annotations are assumed to be cheap and available for all $n$ instances, but it may be biased if the LLM produces biased outputs.

3.4 Evaluation Metrics

We evaluate our approach and the baselines in terms of effective sample size and coverage. The effective sample size measures the increase in accuracy achieved by incorporating LLM annotations alongside human annotations. This is akin to getting more value out of each human annotation. For instance, if one has only 100 human annotations but combines them effectively with a larger pool of LLM annotations, the resulting accuracy could be comparable to having 150 human annotations. The latter metric, coverage, evaluates the statistical validity of the approaches by capturing how often the true value $\theta^{*}$ falls within the produced confidence interval. In the following we elaborate on the two metrics, deferring further details behind their computation to App. A.4.

Effective sample size.

Given an estimate $\hat{\theta}^{\mathrm{method}}$ produced by a method, we define the effective sample size as the hypothetical value $n_{\mathrm{effective}}$ such that $\mathrm{MSE}(\hat{\theta}^{\mathrm{method}})=\mathrm{MSE}(\hat{\theta}^{% \mathrm{human}}_{n_{\mathrm{effective}}})$ , where $\hat{\theta}^{\mathrm{human}}_{n_{\mathrm{effective}}}$ is obtained via the human-only approach with $n_{\mathrm{effective}}$ annotations. In other words, $\hat{\theta}^{\mathrm{method}}$ is as accurate as the “classical” estimate with $n_{\mathrm{effective}}$ human annotations. An equivalent definition says that $n_{\mathrm{effective}}$ is the sample size for which the confidence interval around $\hat{\theta}^{\mathrm{method}}$ is of equal width as the classical confidence interval around $\hat{\theta}^{\mathrm{human}}_{n_{\mathrm{effective}}}$ . We thus have that $n_{\mathrm{effective}}-n_{\mathrm{human}}$ is the benefit (if positive) or harm (if negative) of using LLM annotations. We also report the gain in effective sample size, defined as $(n_{\mathrm{effective}}-n_{\mathrm{human}})/n_{\mathrm{human}}\cdot 100\%$ . The effective sample size of the human-only approach is always $n_{\mathrm{human}}$ . We only report the effective sample size for approaches that use human annotations, i.e. all but LLM only, because the effective sample size measures the increase in value of the human annotations.

Coverage.

Coverage is defined as the rate at which the confidence intervals produced by each method cover $\theta^{*}$ . Since $\theta^{*}$ is an ideal estimate that would require infinite data, we cannot know $\theta^{*}$ exactly in our applications. Instead, as a proxy, we compute coverage with respect to the estimate (1) on the full dataset. We compute the intervals with a target coverage rate of $90\%$ . Note that, following the theory of Zrnic and Candès (2024), the coverage of our method is provably equal to $90\%$ , and the same is true of the other two statistically valid baselines (our numbers will be slightly upward biased due to the fact that we use a proxy for $\theta^{*}$ ). With this in mind, the main purpose of reporting coverage is to evaluate the performance of the LLM only approach; for all other methods, we show coverage as a proof of concept.

4 Results

We evaluate our approach on a set of CSS problems that rely on statistical estimation. We aim to include settings that (1) allow addressing important downstream social-science questions, (2) rely on a human-labeled corpus of text instances (possibly with relevant additional covariates), and (3) have a publicly available dataset. We selected three settings that meet these criteria—politeness, stance, and political bias. For stance and politeness, we leverage publicly available datasets and the corresponding human annotations in their entirety. Given the large size, for political leaning, we randomly sample a smaller subset of texts.

4.1 Estimation tasks

Estimation task	Metric	Method
Estimation task	Metric	confidence-driven	human + LLM (non-adaptive)	LLM only
Politeness devices (hedge)	Gain in eff. sample size	( $\pm$ 7.82)%	( $\pm$ 8.08)%	—
Politeness devices (hedge)	Coverage
Politeness devices (1st person pl.)	Gain in eff. sample size	( $\pm$ 22.09)%	( $\pm$ 30.09)%	—
Politeness devices (1st person pl.)	Coverage		%
Stance on global warming	Gain in eff. sample size	( $\pm$ 25.06)%	( $\pm$ 25.67)%	—
Stance on global warming	Coverage
Political bias (left-leaning)	Gain in eff. sample size	( $\pm$ 8.19)%	( $\pm$ 6.33)%	—
Political bias (left-leaning)	Coverage
Political bias (right-leaning)	Gain in eff. sample size	( $\pm$ 11.48)%	( $\pm$ 8.75)%	—
Political bias (right-leaning)	Coverage

Table 1: Results summary. Gain in effective sample size and coverage across the five estimation tasks for

n_{\mathrm{human}}=500

, estimated over 100 trials. In each task, the confidence-driven approach achieves a higher gain in effective sample size (bolded) than the non-adaptive approach. Confidence-driven approach always achieves a , while the non-adaptive approach sometimes achieves a . Confidence-driven and non-adaptive approaches achieve , or higher. In contrast, LLM-only coverage is often . Gain in effective sample size is not estimated for the LLM-only approach as it does not leverage human annotations. Errors show a standard deviation over 100 trials.

Politeness.

Texts from online requests posted on Stack Exchange and Wikipedia ( $n=5,480$ ) can be seen as polite or impolite. Politeness annotations help understand how linguistic devices impact perceived politeness Danescu-Niculescu-Mizil et al. (2013). In this estimation task, $\theta^{*}$ corresponds to the logistic regression coefficient $\beta_{\mathrm{hedge}}$ measuring the impact of a linguistic feature such as hedging on the perceived politeness, $\mathrm{logit}(P(H_{\mathrm{polite}}=~{}1|X_{\mathrm{hedge}}))=\beta_{0}+\beta% _{\mathrm{hedge}}X_{\mathrm{hedge}}$ , where $X_{\mathrm{hedge}}=1$ indicates the presence of the hedge marker and $H_{\mathrm{polite}}=1$ indicates annotation as polite. We similarly estimate $\beta_{\mathrm{1pp}}$ , the impact of the use of the first person plural pronouns on the perceived politeness.

Stance.

News headlines ( $n=2,300$ ) are agreeing, neutral, or disagreeing with the stance that global warming is a serious concern Luo et al. (2020). Stance annotations facilitate the study of linguistic differences between media supporting or rejecting global warming, which have important implications for communication and policy Hmielowski et al. (2014). In this task, we estimate $\theta^{*}$ corresponding to $O_{\mathrm{agreement}}$ , the odds ratio of agreement given the presence of affirming devices such as “expert,” “proven,” “renowned,” and so on. Formally, denoting by $X_{\mathrm{affirm}}\in\{0,1\}$ the presence of an affirming device and $H_{\mathrm{agree}}\in\{0,1\}$ the annotation of agreement, we have

O_{\mathrm{agreement}}=\frac{\mu_{\mathrm{agree|affirm}}/(1-\mu_{\mathrm{agree% |affirm}})}{\mu_{\mathrm{agree|\neg affirm}}/(1-\mu_{\mathrm{agree|\neg affirm% }})},

where $\mu_{\mathrm{agree|affirm}}=P(H_{\mathrm{agree}}=1|X_{\mathrm{affirm}}=~{}1)$ and $\mu_{\mathrm{agree|\neg affirm}}=P(H_{\mathrm{agree}}=1|X_{\mathrm{affirm}}=0)$ . Indicators for affirming devices were extracted using a lexicon derived by Luo et al. (2020).

Political bias.

News texts (randomly sampled $n=2,000$ ) are either leaning left, center, or right Baly et al. (2020). Annotating political leanings in text allows studying the bias in media outlets, socio-technical systems, or historical and contemporary public discourse. Such biases are often reported in terms of prevalence statistics. Thus, in this setting $\theta^{*}$ corresponds to the prevalence of a leaning, i.e., $p_{\mathrm{lean}}=P(H_{\mathrm{lean}}=1)$ , where $H_{\mathrm{lean}}\in\{0,1\}$ denotes the presence of a leaning. We estimate $p_{\mathrm{left}}$ and $p_{\mathrm{right}}$ , the prevalences of left- and right-leaning articles in the corpus.

4.2 Evaluation

Our main evaluation is based on LLM annotations collected with GPT-4o; analogous results with GPT-3.5 can be found in App. B.1. Table 2 in App. A.3 lists prompt texts and parameters. To test LLM performance out of the box, all annotations are collected using zero-shot prompting. Overall, the confidence scores are calibrated with accuracy, but the annotations are only in moderate agreement with human annotations in all three settings (see App. A.3). This is aligned with our lack of assumption that the LLM annotations are good: we want to produce a valid confidence interval no matter the quality of the LLM annotations.

We report the two key metrics (effective sample size and coverage), for the three selected settings (the study of politeness, stance, and bias), where the task is to estimate the five target quantities $\beta_{\mathrm{hedge}}$ , $\beta_{\mathrm{1pp}}$ , $O_{\mathrm{agreement}}$ , $p_{\mathrm{left}}$ , and $p_{\mathrm{right}}$ . Both metrics are estimated over 100 trials for varying $n_{\mathrm{human}}$ , the budget for human annotations.

Our main findings are reported in Figure 2 and summarized in Table 1. Figure 2 also depicts the computed confidence intervals in five randomly chosen trials across the five target quantities, for the lowest value of $n_{\mathrm{human}}$ in the considered range.

Effective sample size.

First, across the five target quantities, we find that Confidence-Driven Inference increases the effective sample size compared to the human-only baseline. For a given budget of $n_{\mathrm{human}}$ annotations, e.g., $n_{\mathrm{human}}=1000$ , the confidence-driven approach achieves the effective sample size at minimum 1250 (when estimating $p_{\mathrm{left}}$ ). This means that the confidence interval around the estimated statistic is of equal width as the confidence interval produced with a larger number of human-only annotations.

Similarly, it is informative to consider the necessary budget of human annotations $n_{\mathrm{human}}$ given a desired effective sample size $n_{\mathrm{effective}}$ . To achieve a desired fixed effective sample size, the confidence-driven approach reduces the needed number of human annotations across all five target quantities. For instance, to achieve $n_{\mathrm{effective}}=1000$ , only between around 250 ( $\beta_{\mathrm{1pp}}$ ) and 750 ( $p_{\mathrm{left}}$ ) human annotations are needed, thus reducing the number of human annotations needed to achieve equally accurate estimates by at least 25% for all tasks.

Moreover, we also find that the confidence-driven approach increases the effective sample size compared to the human + LLM (non-adaptive) baseline. For example, to achieve $n_{\mathrm{effective}}=1000$ , the confidence-driven approach requires 200 (respectively, 750) fewer human annotations than the non-adaptive baseline for $O_{\mathrm{agreement}}$ (respectively, $\beta_{\mathrm{1pp}}$ ). The confidence-driven approach therefore leads to a further reduction in the required number of human annotations compared to an approach that leverages LLMs, but does so non-adaptively. Moreover, notice that the non-adaptive approach can sometimes even hurt compared to the human-only baseline: in the two politeness tasks, using LLMs actually reduces the effective sample size.

Table 1 summarizes the gain in effective sample size for $n_{\mathrm{human}}=500$ . Across the five tasks, the confidence-driven approach achieves a substantial gain in the effective sample size, providing at minimum around +30% gain (when estimating $p_{\mathrm{left}}$ ), going even over +300% (when estimating $\beta_{\mathrm{1pp}}$ ). Again, the confidence-driven approach achieves a higher gain than the non-adaptive approach for each task, which can even be negative (when estimating $\beta_{\mathrm{hedge}}$ and $\beta_{\mathrm{1pp}}$ ).

Coverage.

The save in human annotations does not come at the cost of diminished validity. As expected, across the five target quantities, the confidence-driven approach has coverage around or over 90%, as do the non-adaptive and human-only baselines (Fig. 2). However, LLM-only intervals have a much lower coverage, only being around 90% for $p_{\mathrm{right}}$ , and otherwise ranging between 0% ( $O_{\mathrm{agreement}}$ ) and 70% ( $\beta_{\mathrm{1pp}}$ ). This emphasizes how estimates only relying on LLM annotations can be misleading. Notably, when estimating $O_{\mathrm{agreement}}$ using LLM annotations only, the odds-ratio estimate points in the wrong direction ( $O_{\mathrm{agreement}}>1$ while $O_{\mathrm{agreement}}<1$ is true), as illustrated in Fig. 2. Interestingly, the overall inter-annotator agreement between human and LLM annotations is the highest in this setting (Cohen inter-rater agreement $\kappa_{\mathrm{stance}}=0.57$ ). This suggests that even when LLM annotations overall agree with human annotations, downstream statistical estimates relying on LLM annotations only can be biased.

Table 1 summarizes the achieved coverage for $n_{\mathrm{human}}=500$ . Across the five tasks, the confidence-driven and non-adaptive approaches achieve around or over 90% coverage (note that small deviations are possible due to only 100 simulation trials). In contrast, the LLM-only approach only meets the requirement for $p_{\mathrm{right}}$ and otherwise severely undercovers.

In summary, our method increases the effective sample size given a fixed budget of human annotations, leading to a substantial save in budget, while maintaining the target coverage.

5 Discussion

In this work, we introduce Confidence-Driven Inference, a method that integrates verbalized confidence of LLMs with active inference to optimally combine human and LLM annotations. Across three distinct CSS settings, results demonstrate that the proposed method consistently outperforms baseline methods (human-only and non-adaptive approaches) in effective sample size. Moreover, the increase in the effective sample size is achieved without a decrease in coverage. In contrast, the LLM-only approach yields invalid estimates and considerably lower coverage.

Thus, Confidence-Driven Inference allows for researchers to allocate human and LLM annotations in a cost-effective manner while maintaining confidence in the statistical validity of their results. Furthermore, Confidence-Driven Inference also addresses the challenges posed by the variable quality of LLM annotation, by providing validity guarantees when leveraging imperfect LLM annotations.

Although overall LLM annotations moderately agree with human annotations in the tested settings, relying on LLM annotations only can lead to wrong conclusions, as shown in the example of estimating the odds ratio in the stance setting. In contrast, despite the fact that LLM annotations are imperfect, our approach allows carefully combining them with a limited set of human annotations in order to reduce the human annotation budget, without sacrificing the validity.

Finally, given the growing interest in harnessing the capabilities of LLMs across disciplines, the accessibility of a method is an important consideration. Researchers can simply prompt the LLM for its confidence level via API access and leverage Confidence-Driven Inference to combine LLM confidence with LLM and human annotations to produce a valid statistical estimate. This approach can be applied to a wide range of tasks, across fields.

6 Limitations

The external validity of our findings is contingent upon two key assumptions: that the text instances are i.i.d. from a relevant distribution, and that the researcher has full control of the annotation process. The first may be violated if the distribution of texts shifts over time, and the collected instances are no longer representative of the current quantity of interest. For example, it is possible that relationships between linguistic devices and perceived politeness evolve over time. The second assumption may be violated in situations where certain annotations are difficult to obtain (e.g., for low-resource languages). Our approach may lead to inaccurate or misleading conclusions under either violation. We thus caution against generalizing to settings where text instances exhibit time-varying shifts or the researcher is not in control over the data collection process.

If the adaptive sampling probabilities $\pi_{i}$ are poorly chosen—potentially due to inaccurate verbalized confidence scores—the resulting estimates could have a higher mean squared error (MSE) than if uniform, non-adaptive sampling were used. This could even result in an estimate with a larger MSE than the human-only baseline (for sensitivity to miscalibration, see App B.2). However, by using power tuning, as detailed in Section 3.2, we ensure that incorporating LLM annotation into the estimation process does not hurt the estimate (i.e., does not increase the MSE) regardless of the sampling method used for human annotations (whether uniform or adaptive). That said, if the confidence scores are not reliable (i.e., they are poorly calibrated), the estimate may become too noisy, leading to overly wide confidence intervals.

We tested only a limited number of LLMs. We note that establishing a comprehensive benchmark is beyond the scope of this work (see App. B.1 for performance details using a different model).

Additionally, while we treat human annotations as the gold standard in our study, we acknowledge that human annotations are biased, and that reasonable annotators can disagree. Future work could explore ways to account for variability and bias in human annotations.

Human annotations are often obtained through crowdsourcing, which may itself be influenced by LLMs, as crowd workers might use LLMs to increase productivity Veselovsky et al. (2023). Although we use datasets collected before the widespread availability of LLMs, detecting AI-generated text remains a challenge Verma et al. (2024); Weber-Wulff et al. (2023).

This work only conducted experiments on estimation tasks within CSS datasets and only in English. However, Confidence-Driven Inference is generalizable to other types of text-based datasets, and it would be valuable to see more diverse applications in future research.

Lastly, the presented experiments do not address causal effects. For instance, in the context of politeness, to identify the causal effect of hedging on perceived politeness, it would be necessary to compare texts that are otherwise identical but differ only in their use of hedging. Nevertheless, while these evaluations are not causal, our method is still applicable for use in causal estimation.

7 Ethical Implications

Our work assumes that the existing human annotations within the leveraged datasets serve as the gold standard. However, we caution against interpreting human annotations as definitive judgments, given the subjective nature of many tasks Fleisig et al. (2023), the potential for annotator disagreement Weerasooriya et al. (2023), and the influence of annotator positionality Santy et al. (2023), beliefs, biases Sap et al. (2022), as well as variance in cultural Huang and Yang (2023) and social norms Ziems et al. (2023).

In addition to their use in text analysis, LLMs may hold potential for simulating human behavior in social science research, including applications such as pretesting surveys and imputing missing data Bail (2024). Our work contributes to establishing reliable principles for doing so. At the same time, we do not advocate for using LLMs as substitutes for human data beyond the constraints of our assumptions, especially seeing that prior studies have shown that LLMs tend to reflect the perspectives of some demographic groups more accurately than others Santurkar et al. (2023) and may propagate stereotypical portrayals Cheng et al. (2023).

We also caution against fully replacing human annotators with LLM surrogates, which can not only be harmful for the economy Cazzaniga et al. (2024), but also exacerbate the exploitation of human labor Li et al. (2023a). Instead, our work highlights the benefits of human-AI collaboration, showing that a combined approach can yield more accurate and valid outcomes.

References

Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774.
Allen et al. (1999) James E Allen, Curry I Guinn, and Eric Horvitz. 1999. Mixed-initiative interaction. IEEE Intelligent Systems and their Applications, 14(5):14–23.
Angelopoulos et al. (2023a) Anastasios N Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I Jordan, and Tijana Zrnic. 2023a. Prediction-powered inference. Science, 382(6671):669–674.
Angelopoulos et al. (2023b) Anastasios N Angelopoulos, John C Duchi, and Tijana Zrnic. 2023b. PPI++: Efficient prediction-powered inference. arXiv preprint arXiv:2311.01453.
Atreja et al. (2024) Shubham Atreja, Joshua Ashkinaze, Lingyao Li, Julia Mendelsohn, and Libby Hemphill. 2024. Prompt design matters for computational social science tasks but in unpredictable ways. arXiv preprint arXiv:2406.11980.
Bail (2024) Christopher A Bail. 2024. Can generative AI improve social science? Proceedings of the National Academy of Sciences, 121(21):e2314021121.
Baly et al. (2020) Ramy Baly, Giovanni Da San Martino, James Glass, and Preslav Nakov. 2020. We can detect your bias: Predicting the political ideology of news articles. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4982–4991.
Band et al. (2024) Neil Band, Xuechen Li, Tengyu Ma, and Tatsunori Hashimoto. 2024. Linguistic calibration of long-form generations. In Proceedings of the Forty-first International Conference on Machine Learning.
Boyeau et al. (2024) Pierre Boyeau, Anastasios N Angelopoulos, Nir Yosef, Jitendra Malik, and Michael I Jordan. 2024. Autoeval done right: Using synthetic data for model evaluation. arXiv preprint arXiv:2403.07008.
Card et al. (2020) Dallas Card, Peter Henderson, Urvashi Khandelwal, Robin Jia, Kyle Mahowald, and Dan Jurafsky. 2020. With little power comes great responsibility. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9263–9274.
Card and Smith (2018) Dallas Card and Noah A Smith. 2018. The importance of calibration for estimating proportions from annotations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1636–1646.
Cazzaniga et al. (2024) Mauro Cazzaniga, Ms Florence Jaumotte, Longji Li, Mr Giovanni Melina, Augustus J Panton, Carlo Pizzinelli, Emma J Rockall, and Ms Marina Mendes Tavares. 2024. Gen-AI: Artificial intelligence and the future of work. International Monetary Fund.
Chatzi et al. (2024) Ivi Chatzi, Eleni Straitouri, Suhas Thejaswi, and Manuel Gomez Rodriguez. 2024. Prediction-powered ranking of large language models. arXiv preprint arXiv:2402.17826.
Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794.
Cheng et al. (2023) Myra Cheng, Esin Durmus, and Dan Jurafsky. 2023. Marked personas: Using natural language prompts to measure stereotypes in language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1504–1532.
Danescu-Niculescu-Mizil et al. (2013) Cristian Danescu-Niculescu-Mizil, Moritz Sudhof, Dan Jurafsky, Jure Leskovec, and Christopher Potts. 2013. A computational approach to politeness with application to social factors. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 250–259.
Dodge et al. (2019) Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A Smith. 2019. Show your work: Improved reporting of experimental results. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2185–2194.
Egami et al. (2024) Naoki Egami, Musashi Hinck, Brandon Stewart, and Hanying Wei. 2024. Using imperfect surrogates for downstream inference: Design-based supervised learning for social science applications of large language models. Advances in Neural Information Processing Systems, 36.
Feder et al. (2022) Amir Feder, Katherine A Keith, Emaad Manzoor, Reid Pryzant, Dhanya Sridhar, Zach Wood-Doughty, Jacob Eisenstein, Justin Grimmer, Roi Reichart, Margaret E Roberts, et al. 2022. Causal inference in natural language processing: Estimation, prediction, interpretation and beyond. Transactions of the Association for Computational Linguistics, 10:1138–1158.
Fleisig et al. (2023) Eve Fleisig, Rediet Abebe, and Dan Klein. 2023. When the majority is wrong: Modeling annotator disagreement for subjective tasks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6715–6726.
Geng et al. (2024) Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych. 2024. A survey of confidence estimation and calibration in large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6577–6595, Mexico City, Mexico. Association for Computational Linguistics.
Gilardi et al. (2023) Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30):e2305016120.
Gunjal et al. (2024) Anisha Gunjal, Jihan Yin, and Erhan Bas. 2024. Detecting and preventing hallucinations in large vision language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18135–18143.
Hmielowski et al. (2014) Jay D Hmielowski, Lauren Feldman, Teresa A Myers, Anthony Leiserowitz, and Edward Maibach. 2014. An attack on science? media use, trust in scientists, and perceptions of global warming. Public Understanding of Science, 23(7):866–883.
Huang and Yang (2023) Jing Huang and Diyi Yang. 2023. Culturally aware natural language inference. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7591–7609, Singapore. Association for Computational Linguistics.
Keith and O’Connor (2018) Katherine Keith and Brendan O’Connor. 2018. Uncertainty-aware generative models for inferring document class prevalence. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4575–4585, Brussels, Belgium. Association for Computational Linguistics.
Kim et al. (2024) Hannah Kim, Kushan Mitra, Rafael Li Chen, Sajjadur Rahman, and Dan Zhang. 2024. MEGAnno+: A human-LLM collaborative annotation system. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 168–176, St. Julians, Malta. Association for Computational Linguistics.
Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems.
Lam et al. (2024) Michelle S Lam, Janice Teoh, James A Landay, Jeffrey Heer, and Michael S Bernstein. 2024. Concept induction: Analyzing unstructured text with high-level concepts using lloom. In Proceedings of the CHI Conference on Human Factors in Computing Systems, pages 1–28.
Landers and Behrend (2023) Richard N Landers and Tara S Behrend. 2023. Auditing the AI auditors: A framework for evaluating fairness and bias in high stakes AI predictive models. American Psychologist, 78(1):36.
Li et al. (2023a) Hanlin Li, Nicholas Vincent, Stevie Chancellor, and Brent Hecht. 2023a. The dimensions of data labor: A road map for researchers, activists, and policymakers to empower data producers. In Proceedings of the 2023 ACM conference on fairness, accountability, and transparency, pages 1151–1161.
Li et al. (2023b) Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2023b. HaluEval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464, Singapore. Association for Computational Linguistics.
Li et al. (2023c) Minzhi Li, Taiwei Shi, Caleb Ziems, Min-Yen Kan, Nancy Chen, Zhengyuan Liu, and Diyi Yang. 2023c. Coannotating: Uncertainty-guided work allocation between human and large language models for data annotation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1487–1505.
Luo et al. (2020) Yiwei Luo, Dallas Card, and Dan Jurafsky. 2020. Detecting stance in media on global warming. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3296–3315.
Margatina et al. (2021) Katerina Margatina, Giorgos Vernikos, Loïc Barrault, and Nikolaos Aletras. 2021. Active learning by acquiring contrastive examples. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 650–663, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Newman et al. (2008) Matthew L Newman, Carla J Groom, Lori D Handelman, and James W Pennebaker. 2008. Gender differences in language use: An analysis of 14,000 text samples. Discourse processes, 45(3):211–236.
Park et al. (2023) Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22.
Pham et al. (2024) Chau Pham, Alexander Hoyle, Simeng Sun, Philip Resnik, and Mohit Iyyer. 2024. TopicGPT: A prompt-based topic modeling framework. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 2956–2984.
Ribeiro et al. (2018) Filipe Ribeiro, Lucas Henrique, Fabricio Benevenuto, Abhijnan Chakraborty, Juhi Kulshrestha, Mahmoudreza Babaei, and Krishna Gummadi. 2018. Media bias monitor: Quantifying biases of social media news outlets at large-scale. In Proceedings of the International AAAI Conference on Web and Social Media, volume 12.
Robertson et al. (2018) Ronald E Robertson, Shan Jiang, Kenneth Joseph, Lisa Friedland, David Lazer, and Christo Wilson. 2018. Auditing partisan audience bias within google search. Proceedings of the ACM on human-computer interaction, 2(CSCW):1–22.
Saad-Falcon et al. (2024) Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. 2024. Ares: An automated evaluation framework for retrieval-augmented generation systems. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 338–354.
Santurkar et al. (2023) Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. 2023. Whose opinions do language models reflect? In International Conference on Machine Learning, pages 29971–30004. PMLR.
Santy et al. (2023) Sebastin Santy, Jenny Liang, Ronan Le Bras, Katharina Reinecke, and Maarten Sap. 2023. NLPositionality: Characterizing design biases of datasets and models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9080–9102, Toronto, Canada. Association for Computational Linguistics.
Sap et al. (2022) Maarten Sap, Swabha Swayamdipta, Laura Vianna, Xuhui Zhou, Yejin Choi, and Noah A. Smith. 2022. Annotators with attitudes: How annotator beliefs and identities bias toxic language detection. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5884–5906, Seattle, United States. Association for Computational Linguistics.
Sclar et al. (2023) Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2023. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. In The Twelfth International Conference on Learning Representations.
Sim et al. (2013) Yanchuan Sim, Brice D. L. Acree, Justin H. Gross, and Noah A. Smith. 2013. Measuring ideological proportions in political speeches. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 91–101, Seattle, Washington, USA. Association for Computational Linguistics.
Thapa et al. (2023) Surendrabikram Thapa, Usman Naseem, and Mehwish Nasim. 2023. From humans to machines: Can chatgpt-like llms effectively replace human annotators in nlp tasks. In Workshop Proceedings of the 17th International AAAI Conference on Web and Social Media.
Tian et al. (2023) Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. 2023. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5433–5442.
Van der Vaart (2000) Aad W Van der Vaart. 2000. Asymptotic statistics, volume 3. Cambridge university press.
Verma et al. (2024) Vivek Verma, Eve Fleisig, Nicholas Tomlin, and Dan Klein. 2024. Ghostbuster: Detecting text ghostwritten by large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1702–1717.
Veselovsky et al. (2023) Veniamin Veselovsky, Manoel Horta Ribeiro, and Robert West. 2023. Artificial artificial artificial intelligence: Crowd workers widely use large language models for text production tasks. arXiv preprint arXiv:2306.07899.
Voigt et al. (2017) Rob Voigt, Nicholas P Camp, Vinodkumar Prabhakaran, William L Hamilton, Rebecca C Hetey, Camilla M Griffiths, David Jurgens, Dan Jurafsky, and Jennifer L Eberhardt. 2017. Language from police body camera footage shows racial disparities in officer respect. Proceedings of the National Academy of Sciences, 114(25):6521–6526.
Weber-Wulff et al. (2023) Debora Weber-Wulff, Alla Anohina-Naumeca, Sonja Bjelobaba, Tomáš Foltỳnek, Jean Guerrero-Dib, Olumide Popoola, Petr Šigut, and Lorna Waddington. 2023. Testing of detection tools for AI-generated text. International Journal for Educational Integrity, 19(1):26.
Weerasooriya et al. (2023) Tharindu Cyril Weerasooriya, Alexander Ororbia, Raj Bhensadadia, Ashiqur KhudaBukhsh, and Christopher Homan. 2023. Disagreement matters: Preserving label diversity by jointly modeling item and annotator label distributions with DisCo. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4679–4695, Toronto, Canada. Association for Computational Linguistics.
Wei et al. (2023) Johnny Tian-Zheng Wei, Frederike Zufall, and Robin Jia. 2023. Operationalizing content moderation" accuracy" in the digital services act. arXiv preprint arXiv:2305.09601.
Weidinger et al. (2022) Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, et al. 2022. Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 214–229.
Zhang et al. (2023) Ruoyu Zhang, Yanzeng Li, Yongliang Ma, Ming Zhou, and Lei Zou. 2023. LLMaAA: Making large language models as active annotators. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 13088–13103, Singapore. Association for Computational Linguistics.
Zhou et al. (2024) Kaitlyn Zhou, Jena D Hwang, Xiang Ren, and Maarten Sap. 2024. Relying on the unreliable: The impact of language models’ reluctance to express uncertainty. arXiv preprint arXiv:2401.06730.
Ziems et al. (2023) Caleb Ziems, Jane Dwivedi-Yu, Yi-Chia Wang, Alon Halevy, and Diyi Yang. 2023. Normbank: A knowledge bank of situational social norms. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7756–7776.
Ziems et al. (2024) Caleb Ziems, William Held, Omar Shaikh, Jiaao Chen, Zhehao Zhang, and Diyi Yang. 2024. Can large language models transform computational social science? Computational Linguistics, 50(1):237–291.
Zrnic and Candès (2024) Tijana Zrnic and Emmanuel J Candès. 2024. Active statistical inference. In Proceedings of the Forty-first International Conference on Machine Learning.

Appendix A Further Details on the Method

A.1 Confidence Intervals

We compute the confidence intervals following the approach in Zrnic and Candès (2024). Suppose that $\hat{\theta}^{\mathrm{conf}}$ is possibly $d$ -dimensional (such as in, for example, linear or logistic regression), and we are interested in coefficient $j$ . If $d=1$ , such as in the case of prevalence estimation, then $j$ is always equal to 1. We compute the confidence interval as:

C_{1-\alpha}=\left(\hat{\theta}^{\mathrm{conf}}_{j}\pm z_{1-\alpha/2}\sqrt{% \frac{\widehat{\Sigma}_{jj}}{n}}\right),

where $z_{1-\alpha/2}$ is the $1-\alpha/2$ quantile of the standard normal distribution. The matrix $\widehat{\Sigma}$ is an estimate of the covariance of $\hat{\theta}^{\mathrm{conf}}$ , given by:

\widehat{\Sigma}=\hat{H}^{-1}\widehat{\mathrm{Var}}\left(\lambda\nabla\hat{% \ell}_{\hat{\theta}^{\mathrm{conf}}}+(\nabla\ell_{\hat{\theta}^{\mathrm{conf}}% }-\lambda\nabla\hat{\ell}_{\hat{\theta}^{\mathrm{conf}}})\frac{\xi}{\pi}\right% )\hat{H}^{-1},

where $\hat{H}=\hat{\mathbb{E}}[\nabla^{2}\ell_{\hat{\theta}^{\mathrm{conf}}}]$ is the empirical estimate of the Hessian at $\hat{\theta}^{\mathrm{conf}}$ and $\widehat{\mathrm{Var}}$ denotes the empirical variance. Recall also the short-hand notation $\ell_{\theta}=\ell_{\theta}(X,H)$ and $\hat{\ell}_{\theta}=\ell_{\theta}(X,\hat{H})$ . This is a generalization of the usual “sandwich” covariance used in linear regression.

Some estimation targets, such as the odds ratio, are not M-estimators but are functions of M-estimators. In those cases a confidence interval is obtained by additionally applying the delta method.

See Zrnic and Candès (2024) for further details.

A.2 Power Tuning

Power tuning, introduced by Angelopoulos et al. (2023b), refers to choosing $\lambda$ so that the MSE of $\hat{\theta}^{\mathrm{conf}}$ , or equivalently its variance, is minimized over $\lambda$ . Since $\widehat{\Sigma}_{jj}$ is a quadratic in $\lambda$ , the optimal $\lambda$ has a closed-form analytical expression. As before, suppose we are interesting in estimating coordinate $j$ of $\hat{\theta}^{\mathrm{conf}}$ . Let $h_{j}$ denote the $j$ -th column of $\hat{H}^{-1}$ . Then, we set $\lambda$ according to:

\lambda=\frac{h^{\top}~{}\widehat{\mathrm{Cov}}~{}h}{2h^{\top}~{}\widehat{% \mathrm{Var}}~{}h},

where $\widehat{\mathrm{Cov}}:=\widehat{\mathrm{Cov}}(\nabla\hat{\ell}_{\hat{\theta}^% {\mathrm{conf}}}(\frac{\xi}{\pi}-1),\nabla\ell_{\hat{\theta}^{\mathrm{conf}}}% \frac{\xi}{\pi})+\widehat{\mathrm{Cov}}(\nabla\ell_{\hat{\theta}^{\mathrm{conf% }}}\frac{\xi}{\pi},\nabla\hat{\ell}_{\hat{\theta}^{\mathrm{conf}}}(\frac{\xi}{% \pi}-1))$ and $\widehat{\mathrm{Var}}:=\widehat{\mathrm{Var}}(\nabla\hat{\ell}_{\hat{\theta}^% {\mathrm{conf}}}(\frac{\xi}{\pi}-1))$ are empirical (co)variances. See Angelopoulos et al. (2023b) for further details.

A.3 LLM and Human Annotation Details

For data annotation, we use GPT-4o (gpt-4o-2024-05-13 version) and GPT-3.5-turbo (gpt-3.5-turbo-0125 version). Prompt texts in both stages are listed in Table 2. To test LLM performance out-of-the-box, all annotations are collected using zero-shot prompting. We set the max_tokens parameter to 5, use default temperature (1), and the default system prompt and the other prompting parameters.

Stage 1 GPT-4o annotations are in moderate agreement with human annotations in all three settings: $\kappa_{\mathrm{politeness}}=0.39$ , $\kappa_{\mathrm{stance}}=0.57$ , and $\kappa_{\mathrm{bias}}=0.43$ .

In Stage 2, we find that the collected verbalized confidence scores are calibrated with the Stage 1 accuracy (Fig. 3 (right)), such that higher confidence scores correspond to higher accuracy with respect to human annotations. This implies that verbalized confidence is indeed an informative signal to leverage in estimation tasks. Histograms of the collected verbalized confidence scores are illustrated in Fig. 3 (left). We also observe a variance in the verbalized confidence within each setting, and a relative lack of overconfident responses (where the model is 100% certain).

We choose the sampling probabilities $\pi_{i}$ according to the theory of Zrnic and Candès (2024). For estimating the prevalences $p_{\mathrm{left}}$ and $p_{\mathrm{right}}$ , as well as the odds ratio $O_{\mathrm{agreement}}$ , we choose $\pi_{i}\propto\sqrt{\widehat{\texttt{err}}_{i}(C_{i})}$ , as described in Section 3.2. For the logistic regression coefficient $\beta_{\mathrm{hedge}}$ (respectively, $\beta_{\mathrm{1pp}}$ ), we set $\pi_{i}\propto\sqrt{\widehat{\texttt{err}}_{i}(C_{i})}\cdot|X_{i}^{\top}h|$ , where $h$ is the column of $\widehat{H}$ (defined in App. A.1) corresponding to $X_{\mathrm{hedge}}$ (respectively, $X_{\mathrm{1pp}}$ ).

To fit $\widehat{\texttt{err}}_{i}$ , we train an XGBoost Chen and Guestrin (2016) model. For all problem settings, we use the same training parameters: number of boosting rounds 2000, step size 0.001, maximum depth 3, and squared-error objective.

A.4 Computation of Evaluation Metrics

We provide further details behind the computation of our two main metrics, effective sample size and coverage. For all problem settings, we run $100$ simulation trials. All experiments were run on a single CPU.

Effective sample size.

Recall that we define the effective sample size of a method as the hypothetical value $n_{\mathrm{effective}}$ such that $\mathrm{MSE}(\hat{\theta}^{\mathrm{method}})=\mathrm{MSE}(\hat{\theta}^{% \mathrm{human}}_{n_{\mathrm{effective}}})$ , where $\hat{\theta}^{\mathrm{human}}_{n_{\mathrm{effective}}}$ is obtained via the human-only approach with $n_{\mathrm{effective}}$ annotations. Since all approaches but the LLM-only approach are unbiased in the large-sample limit, meaning their estimate has mean exactly equal to $\theta^{*}$ , the MSE is simply equal to the estimator variance. Estimator variance is used in the confidence interval construction and is estimated as $\widehat{\Sigma}/n$ , as explained in App. A.1. The different baselines differ in their choice of $\lambda$ and $\pi$ in the definition of $\widehat{\Sigma}$ . We thus compute the effective sample size as $\widehat{\Sigma}^{\mathrm{human}}_{jj}/\widehat{\Sigma}_{jj}\cdot n,$ where $j$ indexes the coordinate of $\hat{\theta}^{\mathrm{conf}}$ when the estimate has more than one dimension. The final reported effective sample size is the mean of these values over $100$ trials.

Coverage.

We estimate coverage over $100$ trials. For all methods but LLM only, the trials differ in the random annotation decisions $\xi_{i}$ that determine which points get human-annotated, and we average 0/1 indicators of coverage over those trials. For LLM only, since we only have one fixed dataset of $n$ LLM annotations, in order to estimate coverage we simulate random draws from a population via the bootstrap. In other words, in each trial we draw $n$ LLM annotations with replacement, form a classical confidence interval using those points, and record a 0/1 indicator of coverage.

Setting	Stage	Prompt text
Politeness	Stage 1	Is the following text polite? Output either A or B. Output a letter only.
		A) Polite
		B) Impolite
		Text: <text>
		Answer:
Politeness	Stage 2	How likely is it that the following text is <previously provided answer: polite or impolite>?
		Output the probability only (a number between 0 and 1).
		Text: <text>
		Answer:
Stance	Stage 1	A statement can agree, be neutral, or disagree with the statement: “Climate change/global warming is a serious concern”. Classify the following statement into one of the three categories. Output either A, B, or C. Output a letter only.
		A) Agree
		B) Neutral
		C) Disagree
		Statement: <text>
		Answer:
Stance	Stage 2	How likely is it that the following text <previously provided answer: agrees, neither agrees nor disagrees, or disagrees> with the statement: “Climate change/global warming is a serious concern”?
		Output the probability only (a number between 0 and 1).
		Text: <text>
		Probability:
Bias	Stage 1	What is the political bias of the following article? Output either A,B, or C. Output a letter only.
		A) Left
		B) Center
		C) Right
		Article: <text>
		Answer:
Bias	Stage 2	How likely is it that the following article has a <previously provided answer: left-leaning, centrist, or right-leaning> political bias? Output the probability only (a number between 0 and 1).
		Text: <text>
		Probability:

Table 2: Complete prompt texts. LLM annotation prompts across the three settings, for Stages 1 and 2.

Appendix B Supplementary Results

B.1 An Alternative LLM

Figure 4 and Table 3 summarize the results using GPT-3.5. For $n_{\mathrm{human}}=500$ , in each estimation task the confidence-driven approach again achieves a higher gain in effective sample size than the non-adaptive approach. Moreover, it always achieves a positive gain. In contrast, the non-adaptive approach achieves a negative gain in three out of the five estimation tasks (both politeness estimates and the stance estimate). The confidence-driven and non-adaptive approaches always achieve over 90% coverage. In contrast, LLM-only coverage is always poor using GPT-3.5 (while using GPT-4o it was poor on four out of the five estimation tasks).

B.2 Sensitivity to Miscalibrated LLM Confidence Scores

A calibrated LLM will produce higher confidence scores when annotations are in agreement with human annotations, compared to when annotations are in disagreement with human annotations. However, calibration of confidence scores across tasks is not guaranteed.

To understand how the performance of Confidence-Driven Inference is affected by the calibration of confidence scores, we conducted a robustness test, adding noise to confidence scores to simulate miscalibration. In particular, for illustration we consider the task of analyzing stance on global warming. We add a varying amount of normally distributed noise $\mathcal{N}(0,\sigma^{2})$ to the collected confidence scores $C_{i}$ , and truncate the sum to $[0,1]$ to obtain a probability.

We use a t-test to test the difference in calibration score means when LLM and human annotations agree, vs when LLM and human annotations disagree. If the t-statistic is large (equivalently, the corresponding p-value is small), that suggests that the two means differ significantly. As the random noise added to the confidence scores increases, the scores become less calibrated (Table 4). We expect that our method performs worse in terms of $n_{\mathrm{effective}}$ when the confidence scores are miscalibrated, although coverage should be maintained.

As predicted, as the amount of miscalibration in the confidence scores increases, the gain in the effective sample size decreases (Table 4). The confidence-driven approach achieves the highest gain for the smallest amount of noise, though it always achieves a positive gain. This suggests that the approach is robust to poor confidence scores. Furthermore, Confidence-Driven Inference achieves near 90% coverage or higher in each setting, regardless of the amount of miscalibration. Finally, we observe that Confidence-Driven Inference achieves a higher gain than the non-adaptive approach regardless of the extent of miscalibration. This can be explained through the power tuning parameter $\lambda$ ; even when the confidence scores provide no signal, power tuning makes sure that LLM annotations are leveraged effectively.

Estimation task	Metric	Method
Estimation task	Metric	confidence-driven	human + LLM (non-adaptive)	LLM only
Politeness devices (hedge)	Gain in eff. sample size	( $\pm$ 7.81)%	( $\pm$ 9.18)%	—
Politeness devices (hedge)	Coverage
Politeness devices (1st person plural)	Gain in eff. sample size	( $\pm$ 19.01)%	( $\pm$ 30.83)%	—
Politeness devices (1st person plural)	Coverage
Stance on global warming	Gain in eff. sample size	( $\pm$ 16.24)%	( $\pm$ 16.95)%	—
Stance on global warming	Coverage
Political bias (left-leaning)	Gain in eff. sample size	( $\pm$ 6.30)%	( $\pm$ 4.86)%	—
Political bias (left-leaning)	Coverage
Political bias (right-leaning)	Gain in eff. sample size	( $\pm$ 7.50)%	( $\pm$ 5.73)%	—
Political bias (right-leaning)	Coverage

Table 3: Results summary (GPT-3.5). Gain in effective sample size and coverage across the five estimation tasks for

n_{\mathrm{human}}=500

, estimated over 100 trials. In each task, the confidence-driven approach achieves a higher gain in effective sample size (bolded) than the non-adaptive approach. Confidence-driven approach always achieves a , while the non-adaptive approach sometimes achieves a . Confidence-driven and non-adaptive approaches achieve , or higher. In contrast, LLM-only coverage is . Gain in effective sample size is not estimated for the LLM-only approach as it does not leverage human annotations. Errors show a standard deviation over 100 trials.

$n_{\mathrm{human}}=500$
Method	Confidence score calibration t-test	Metric
Method	Confidence score calibration t-test	Gain in eff. sample size	Coverage
LLM only	—	—
human + LLM (non-adaptive)	—
confidence-driven ( $\sigma^{2}=0$ )	$t=9.08$ , $p=2.19\times 10^{-19}$
confidence-driven ( $\sigma^{2}=0.2$ )	$t=5.53$ , $p=3.65\times 10^{-8}$
confidence-driven ( $\sigma^{2}=0.4$ )	$t=3.30$ , $p=0.000994$
confidence-driven ( $\sigma^{2}=0.6$ )	$t=2.34$ , $p=0.0192$
confidence-driven ( $\sigma^{2}=0.8$ )	$t=1.88$ , $p=0.0606$
$n_{\mathrm{human}}=1150$
LLM only	—	—
human + LLM (non-adaptive)	—
confidence-driven ( $\sigma^{2}=0$ )	$t=9.08$ , $p=2.19\times 10^{-19}$
confidence-driven ( $\sigma^{2}=0.2$ )	$t=5.29$ , $p=1.36\times 10^{-7}$
confidence-driven ( $\sigma^{2}=0.4$ )	$t=3.22$ , $p=0.00130$
confidence-driven ( $\sigma^{2}=0.6$ )	$t=2.23$ , $p=0.0257$
confidence-driven ( $\sigma^{2}=0.8$ )	$t=1.83$ , $p=0.0831$

Table 4: Sensitivity to confidence score calibration. Gain in effective sample size and coverage for the LLM only, human + LLM (non-adaptive), and confidence-driven approaches, given varying amounts of miscalibration in confidence scores (

\sigma^{2}

). Results are presented for the task of analyzing stance on global warming, estimated over 100 trials. The t-test tests for the difference in calibration score means when LLM and human annotations agree, vs when LLM and human annotations disagree (larger

t

means difference is more significant). The confidence-driven approach achieves the largest gain for the smallest amount of noise (bolded), and it always achieves a . For each amount of miscalibration, the confidence-driven approach achieves a higher gain in effective sample size than the non-adaptive approach; it also achieves or higher in each setting. In contrast, LLM-only coverage is . Gain in effective sample size is not estimated for the LLM-only approach as it does not leverage human annotations.