\externaldocument

si

Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms

Joshua Ashkinaze University of MichiganUnited States [email protected] Ruijia Guan University of MichiganUnited States [email protected] Laura Kurek University of MichiganUnited States [email protected] Eytan Adar University of MichiganUnited States [email protected] Ceren Budak University of MichiganUnited States [email protected]  and  Eric Gilbert University of MichiganUnited States [email protected]
(2024; 20 February 2007; 12 March 2009; 5 June 2009)
Abstract.

Large language models (LLMs) are trained on broad corpora and then used in communities with specialized norms. Is providing LLMs with community rules enough for models to follow these norms? We evaluate LLMs’ capacity to detect (Task 1) and correct (Task 2) biased Wikipedia edits according to Wikipedia’s Neutral Point of View (NPOV) policy. LLMs struggled with bias detection, achieving only 64% accuracy on a balanced dataset. Models exhibited contrasting biases (some under- and others over-predicted bias), suggesting distinct priors about neutrality. LLMs performed better at generation, removing 79% of words removed by Wikipedia editors. However, LLMs made additional changes beyond Wikipedia editors’ simpler neutralizations, resulting in high-recall but low-precision editing. Interestingly, crowdworkers rated AI rewrites as more neutral (70%) and fluent (61%) than Wikipedia-editor rewrites. Qualitative analysis found LLMs sometimes applied NPOV more comprehensively than Wikipedia editors but often made extraneous non-NPOV-related changes (such as grammar). LLMs may apply rules in ways that resonate with the public but diverge from community experts. While potentially effective for generation, LLMs may reduce editor agency and increase moderation workload (e.g., verifying additions). Even when rules are easy to articulate, having LLMs apply them like community members may still be difficult.

artificial intelligence, Wikipedia, large language models, content moderation
copyright: acmlicensedjournalyear: 2024doi: XXXXXXX.XXXXXXXconference: Preprint; Jul 03, 2024; Woodstock, NYisbn: 978-1-4503-XXXX-X/18/06ccs: Human-centered computingccs: Human-centered computing Human computer interaction (HCI)ccs: Human-centered computing Empirical studies in HCIccs: Human-centered computing Empirical studies in collaborative and social computingccs: Human-centered computing Wikisccs: Computing methodologies Artificial intelligenceccs: Computing methodologies Natural language processing

1. Introduction

Large language models (LLMs) are trained on large corpora and then used within communities that have their own norms. To steer models towards specific norms and values, there is a growing trend of stating high-level rules as prompts. For example, Constitutional AI (bai_constitutional_2022, ) involves providing the model with a set of rules that it uses to critique and then revise its own outputs. But is providing high-level rules sufficient to steer models toward community norms?

The challenge of going from high-level rules—like Wikipedia’s Neutral Point of View (NPOV) policy—to specific cases mirrors earlier debates in human-AI interaction and beyond. For instance, Lucy Suchman’s 1987 “Plans and Situated Actions” framework (suchman_plans_1987, ) contrasted predetermined procedures derived from universal principles (“plans”) with context-dependent actions based on concrete circumstances (“situated actions”). She argued that AI systems execute plans, while humans perform situated actions. This distinction relates to James Scott’s  (scott_seeing_1998, ) concept of “seeing like a state”—the idea that the large-scale plans of centralized authorities often break down when faced with complex, local realities. The tension between universal plans and situated, local actions is relevant to deploying general-purpose LLMs in communities like Wikipedia. On Wikipedia, many decisions involve messy and contextual judgments (hansen_wikipedia_2009, ; swarts_collaborative_2009, ) rather than straightforward applications of high-level rules. Is providing high-level rules to trillion-parameter models also insufficient for navigating particular NPOV cases?

We test how well general-purpose LLMs can be steered to apply NPOV policy.111https://en.wikipedia.org/wiki/Wikipedia:Neutral_point_of_view The broader NPOV policy encompasses various complex assessments (e.g., covering viewpoints in proportion to their prominence in reliable sources). Here, we focus on one specific subtask: debiasing language. Specifically, we evaluate large language models on their ability to: (1) detect edits that include non-neutral language and (2) generate NPOV-compliant edits from biased ones. Evaluating LLMs on Wikipedia’s NPOV policy is an interesting test case of model abilities for three reasons. First, LLMs are used in widely different communities. An open question is whether these models can be steered—with no task-specific fine-tuning—toward following community policies or guidelines. Second, detecting and generating non-neutral language requires nuance that goes beyond literal meaning. This task probes how well LLMs can apply natural language pragmatics and mirror the nuanced decisions of community members. Third, Wikipedia’s NPOV has a unique tension: while clearly-articulated and well-documented in theory, it is complex in practice (reagle_is_2007, ; swarts_collaborative_2009, ). Adjudicating neutrality is nuanced. To what extent can LLMs, absent fine-tuning, apply clearly articulated but nuanced community policies?

1.1. Findings

To test whether LLMs could apply Wikipedia’s NPOV policy, we conducted detection and generation experiments. In both experiments, LLM decisions and edits were compared against an existing corpus of NPOV edits by Wikipedia editors, also known as Wikipedians. In the detection experiment, we used ChatGPT 3.5, Mistral-Medium, and GPT-4 to classify whether edits violated NPOV, varying the amount of policy detail and example decisions provided. In the first generation experiment, we used GPT-4 (using both Zero-Shot and Constitutional AI approaches) to neutralize NPOV-violating edits. We used computational metrics to compare the LLM-neutralized edits to the Wikipedian-neutralized edits. In the second generation experiment, humans gave masked evaluations of LLM and Wikipedian rewrites. We also conducted a qualitative analysis of LLM vs. Wikipedian rewrites to better understand differences in how each neutralizes text. We summarize our main findings and implications below:

  1. (1)

    Large language models largely failed at neutrality detection. Across models, prompts, and optimization strategies, performance was low (64% accuracy for the best prompt; the dataset was balanced so random chance accuracy is 50%). Analysis of LLM rationales and errors suggested LLMs relied (sometimes to a fault) on simple heuristics like the presence of a highly subjective adjective. Perhaps modern models still struggle with applying subtle rules to real-world cases without task-specific training.

  2. (2)

    Models exhibited biases that persisted across prompts. ChatGPT 3.5 over-predicted edits as biased, and Mistral-Medium did the opposite. GPT-4 was balanced. Pretrained LLMs may internalize distinct priors about what constitutes neutrality that persist across prompts. Understanding model idiosyncrasies is crucial. It is not clear that, when a new model appears on the scene, we will know what its priors are.

  3. (3)

    LLMs applied rules in different ways than expert editors. Computational experiments showed LLMs generally did remove words that Wikipedia editors removed in the generation phase. But LLMs also made many other changes. Crowdworkers also rated Constitutional AI (CAI)-generated rewrites as adding unnecessary information more frequently than Wikipedian rewrites. In other words, LLM editors are high-recall but low-precision. And while Wikipedians make more removals than additions, LLM editors do the opposite. One risk for large-scale deployment of models within communities is that they may follow rules in different ways than community members would.

  4. (4)

    Crowd-workers preferred LLM neutralizations to Wikipedia-editor neutralizations. Given the finding that LLMs generally made more changes than Wikipedia editors, we conducted human experiments to understand if these generations are in fact preferred to Wikipedian neutralizations. We find that crowd-workers prefer AI edits over human edits on both fluency (61% of choices) and bias reduction (70% of choices). Instruction-tuned models may be highly capable of applying rules in ways that resonate with a broader public, even if these applications differ from community experts.

  5. (5)

    Qualitative analysis showed LLMs are ‘NPOV+’. To reconcile LLMs’ low detection performance with high generation evaluations, we conducted a qualitative analysis. Despite being instructed to make only minimal (NPOV) changes, we find LLMs made many additional grammatical and stylistic edits that may have influenced participants’ judgments. This maximalist editing approach could (1) increase labor costs as moderators may need to check for AI hallucinations and (2) reduce editor agency due to extensive rewrites. But in other cases, AI models arguably applied NPOV more faithfully than human editors (as judged by the authors), possibly due to varying community norms on what would constitute neutrality. These findings highlight tradeoffs in general-purpose LLM generations.

Our study has practical implications for Wikipedia, model builders, and perhaps for policy. For Wikipedia, LLMs have low NPOV-violation detection accuracy but we see potential in generating initial drafts of neutral content, with the caveats discussed above. We also found that LLMs applied rules differently from Wikipedians yet were preferred by laypeople, indicating a potential tension between reader and contributor satisfaction. For model builders, we suggest several actionable steps to improve detection and generation abilities: Retrieval-augmented generation (e.g., incorporating Wikipedia Talk page conversations to simulate editors’ community knowledge) or multi-agent systems (irving_ai_2018, ; khan_debating_2024, ) (simulating Wikipedia-style debates) may enhance detection capabilities. Fine-tuning models on expert feedback can increase alignment with domain specialists. Refining Constitutional AI to emphasize preserving original text may improve revision precision. For policy, our findings highlight the limitations of a “set and forget” approach for rules and AI systems. Human oversight and stakeholder input are required to prevent LLMs from overstepping instructions, as LLMs may generate content appealing to the general public but not to the intended community.

2. Related Work

We position our work in the context of Wikipedia neutrality research (2.1), automated moderation of Wikipedia (2.2), and broader conversations around the role of general-purpose LLMs in community moderation (2.3). Wikipedia’s Neutral Point of View (NPOV) policy, though clearly defined, is often debated in practice (swarts_collaborative_2009, ; viegas_studying_2004, ; reagle_is_2007, ; nagar_what_2012, ), making it a compelling test case for LLMs. While prior work tested task-specific models on neutralizing NPOV violations (pryzant_automatically_2020, ; recasens_linguistic_2013, ), we examine whether general-purpose LLMs can neutralize NPOV violations. This task is timely given two trends. First, LLMs excel at zero-shot annotation (gilardi_chatgpt_2023, ), motivating work on their ability to apply community policies (cao_toxicity_2024, ; barbarestani_content_2024, ; kumar_watch_2024, ; kolla_llm-mod_2024, ). We complement this work by testing LLMs on generation in addition to detection, specifically in the Wikipedia context. Second, Wikipedia has historically used automated moderation tools (wikipedia_wikipedia_2024-1, ), but there are contentious internal debates about the use of ChatGPT (harrison_should_2023, ). Our study can inform these debates. More broadly, our work speaks to the opportunities and limitations of using general-purpose LLMs to uphold nuanced community norms.

2.1. Wikipedia’s Neutral Point of View (NPOV) Policy

Wikipedia is one of the most visited websites in the world (semrush_top_nodate, ). It is a widely-regarded success in peer production. It is well studied in CSCW and beyond (nagar_what_2012, ; matei_wikipedias_2011, ; bryant_becoming_2005, ; reagle_is_2007, ; kittur_harnessing_2008, ; swarts_collaborative_2009, ; viegas_studying_2004, ). To support a crowd-sourced encyclopedia editable by anyone, Wikipedia developed many policies, processes, and norms (butler_dont_2008, ). “Neutral Point of View” (NPOV) is one of the most central. For example, NPOV is listed first among the three core content policies (wikipedia_core_2023, ). According to Wikipedia.org (wikipedia_wikipedianpov_2024, ):

“All encyclopedic content on Wikipedia must be written from a neutral point of view (NPOV), which means representing fairly, proportionately, and, as far as possible, without editorial bias, all the significant views that have been published by reliable sources on a topic.”

Jimmy Wales (a co-founder of Wikipedia) has said (wales_jimmy_2006, ) (nagar_what_2012, , p. 5):

“So how do we do this? …How does it work? …So the biggest and the most important thing is our neutral point of view policy. This is something that I set down from the very beginning, as a core principle of the community that’s completely not debatable.”

NPOV is crucial for Wikipedia’s content quality. Wikipedia’s iterative revision process improves neutrality over time (greenstein_experts_2018, ). This is due in part to the role of NPOV in defending against disinformation, low-quality information (steinsson_rule_2024, ; mcdowell_it_2020, ), and delineating fringe points of view (steinsson_rule_2024, ). The result is (1) a public perception that Wikipedia editors are committed to unbiased information and (2) increased reader trust (elmimouni_why_2022, ).

Because of NPOV’s importance, Wikipedia goes to great lengths to codify exactly what a “neutral point of view” means. Specifically, NPOV is described by a set of principles222Avoid stating opinions as facts; Avoid stating seriously contested assertions as facts; Avoid stating facts as opinions; Prefer nonjudgmental language; Indicate the relative prominence of opposing views. (wikipedia_wikipedianeutral_2024-1, ). These principles are expounded and clarified in numerous FAQs, tutorials, and examples. Editors are trained in NPOV before they can edit. And if an article violates NPOV, then it is flagged as “NPOV Disputed” so that an editor can bring it into compliance. As a community norm, NPOV is clearly defined and often invoked in day-to-day community activity. The ‘rules’ are clear.

However, despite its clear articulation, NPOV has always been complicated in practice. Reagle claims that “in the Wikipedia culture, the notion of ‘neutrality’ is not understood so much as an end result, but as a process”  (reagle_is_2007, ). Wikipedia articles often go through many rewrites, with debates occurring as to what constitutes neutrality. For example, over three years the Wikipedia entry for ‘clean coal technology’ had 39 distinct facts and 142 different rewrites of these facts  (swarts_collaborative_2009, ). Norm-consistent Wikipedia content is often the result of conflict, coordination, and deliberation—more complex than lone editors applying rules in isolation (kittur_he_2007, ; swarts_collaborative_2009, ; nagar_what_2012, ; kittur_harnessing_2008, ; viegas_studying_2004, ). Past work (kittur_harnessing_2008, ) estimated that 40% of Wikipedia edits are dedicated to coordination between Wikipedians (e.g., achieving consensus). And despite NPOV being 5 principles, there are many pages and debates dedicated to this topic (matei_wikipedias_2011, ).

Wikipedia’s NPOV policy is an interesting test case of general-purpose LLM abilities because of a tension: It is both clear and nuanced. On one hand, NPOV is clearly defined and documented. In theory, augmenting LLMs with the text of NPOV guidelines should enable LLMs to apply these guidelines effectively. But even though NPOV is clear in theory, it is complex in practice. To what extent, then, can general-purpose LLMs—absent any task-specific training—be steered to follow nuanced community guidelines such as NPOV?

2.2. Automated Approaches to NPOV and Wikpedia Moderation

Researchers have developed algorithms to understand and ameliorate problems with Wikipedia article content (e.g., (nicholson_measuring_2021, ; berube_wiki-gendersort_2020, ; wang_assessing_2020, ; sathe_automated_2020, ; halfaker_ores_2020, ; schmahl_is_2020, )). Most relevant to the present work, several papers specifically tested whether models can neutralize NPOV-violating edits. Recasens et al. (recasens_linguistic_2013, ) experimented on a subset of NPOV-violating edits where the bias was due to one word. Both humans and their system struggled to guess the biased word (37% and 34% accuracy, respectively). Building on this result, Pryzant et al. (pryzant_automatically_2020, ) collected a similar dataset of NPOV-violating edits and their corrections called the Wikipedia Neutrality Corpus (WNC). We use this dataset in our work. Both projects (pryzant_automatically_2020, ; recasens_linguistic_2013, ) conducted experiments on a subset of edits in which the bias was attributable to only one word. With the WNC corpus, a sequence-to-sequence system was trained to detect the NPOV-violating word and then predict its replacement, achieving 46% generation accuracy (measured as the proportion of generations that matched Wikipedia-editor rewrites).

Our study departs from these two studies in two ways. First, we conduct experiments on a more representative set of NPOV-violating edits (i.e., not the one-word subset), which is arguably a harder task. Second, we test whether general-purpose models, absent any task-specific training, can perform this task. Both Recasens et al. and Pryzant et al. show that applying NPOV in a way that matches how expert editors behave is difficult for both laypeople and task-specific NLP systems. In our work, we can answer the question: How well can new generations of general-purpose LLMs neutralize text?

Wikipedia already uses automated moderation tools (wikipedia_wikipedia_2024, ), and our results can inform the benefits and risks of using general-purpose LLMs. Past and current services include WikiTrust (wikipedia_wikipedia_2024-1, ), an extension that helped editors detect untrustworthy edits. The ORES project used machine learning to rank the usefulness of edits (halfaker_ores_2020, ). The Automod project was an abuse detection model (rawat_automatic_2019, ). More recently, Wikipedia editors have debated the utility of ChatGPT for generating Wikipedia edits (harrison_should_2023, ). By systematically evaluating NPOV detection and generation from general-purpose LLMs, our results can add evidence to this debate, possibly informing community practice.

2.3. Pre-Trained LLMs for Community Content Moderation

Given how much content there is to moderate, platforms often employ automatic content moderation tools such as Reddit’s AutoMod (jhaver_human-machine_2019, ). Because general-purpose LLMs show remarkable zero-shot (gilardi_chatgpt_2023, ) and few-shot (brown_language_2020, ) capabilities even without fine-tuning (i.e., training the model on task-specific data), it is reasonable to think general-purpose LLMs would be effective content moderators. The zero-shot and few-shot capabilities of LLMs are useful, since fine-tuning requires resources (compute, instances) that some communities may not have. Hence, online communities may rightfully wonder how well these off-the-shelf LLMs can apply their rules for online moderation. Our work engages with this question.

Several studies have explored whether general-purpose LLMs can apply highly specific sets of online community rules without fine-tuning. One study (kolla_llm-mod_2024, ) tested whether LLM moderators can detect rule-violating posts across nine subreddits when the rules of these subreddits are provided to the LLM. Their system achieved a high true-negative rate (92%) but a low true-positive rate (43%). A similar study found that ChatGPT 3.5 was 64% accurate in predicting subreddit moderation decisions (kumar_using_2023, ). Cao et al. tested how well LLMs could detect rule violations of r/AskHistorians, with precision and recall showing variance depending on the specific rule (cao_toxicity_2024, ). They also asked moderators what would be an acceptable precision and recall of models for the models to be useful. In general, moderators indicated models should have recall and precision above 70%. Although not subreddit-specific, Barbarestani et al. found ChatGPT largely agrees with both crowds and experts when detecting inappropriate words in Reddit comments (barbarestani_content_2024, ). However, when considering errors, they found that differences in subjective interpretations of appropriateness accounted for 41% of the disagreement amongst experts but 69% of the disagreement between experts and ChatGPT. This may suggest that some familiarity with the topic is required to apply nuanced guidelines reliably.

Our study complements and extends existing research on using general-purpose LLMs for online community content moderation. It complements prior work by testing if general-purpose LLMs can effectively apply Wikipedia NPOV. NPOV is both intrinsically important (being a central norm to one of the most visited websites) and theoretically interesting (since it is highly documented while also requiring nuanced reasoning to apply). Our study extends these works, too. In addition to detection, we focus on how well LLMs can generate content consistent with a community’s rules. As LLM usage grows, understanding the implications for generation within communities is crucial. Because LLMs are explicitly trained for natural language generation, they may be more effective at generation than detection. We conduct both computational and human subject experiments to measure generative quality.

3. LLM Bias Detection

To understand whether general-purpose LLMs can classify edits as neutral or biased, we conducted classification experiments on a balanced sample of NPOV-violating edits and NPOV-compliant rewrites. We find that across models and prompts, models largely failed to distinguish between neutral and biased edits. Models exhibited contrasting failure modes, and model explanations suggest they appeared to rely on heuristics such as the presence of a highly subjective adjective.

3.1. Dataset

The data for our detection and generation experiments come from the Wikipedia Neutrality Corpus (WNC) (pryzant_automatically_2020, ). The WNC is a collection of edit pairs: A biased edit (an edit flagged by an editor for violating the NPOV) and a neutral edit (the rewrite an editor made to ensure NPOV compliance). While NPOV is a set of multiple rules, the WNC consists of edits that violated a particular subset of NPOV—biased language (framing, epistemological, and demographic bias). This stipulation is useful for testing the abilities of LLMs. Unlike other NPOV rules (e.g., those concerning citations), a violation of neutral language does not require access to external information. For detection experiments, we use the subset of these edits where the data also has tags for which topic an edit belongs to (Appendix for Table 6 topic counts).

This WNC was collected by crawling Wikipedia revisions between 2004 and 2019. In an initial pass, Pryzant et al. (pryzant_automatically_2020, ) filtered for revisions where editors provided a justification indicating an NPOV violation. In subsequent filtering passes, Pryzant et al. (pryzant_automatically_2020, ) excluded revisions below a minimal and above a maximal number of characters and applied additional rules to maximize the probability that the change was related to bias in particular (e.g., excluding revisions that involved references).

3.2. Experiment Setup

We conducted a multi-model prompt experiment (N𝑁Nitalic_N = 5,348 annotations) where the task was to classify if a given Wikipedia edit was biased or neutral. Our two experimental factors were (Factor 1) the definition given to an LLM on what constitutes neutrality, and (Factor 2) whether or not we provided examples (i.e., few-shot or zero-shot). The rationale for varying definitions is that we are interested in whether augmenting LLMs with community-specific definitions of neutrality increases adherence to community-specific norms. The rationale for our second factor is to test whether providing example decisions from a community helps models adhere to these community norms. Appendix B has all the prompts we used in this paper.

  • Factor 1: Definitions

    • Minimal definition (Minimal)

    • Wikipedia’s Neutral Point of View definition (NPOV)

    • Wikipedia’s definition focused on neutral language (NPOV-Scoped)

  • Factor 2: Examples

    • No examples: Zero-shot (ZeroShot)

    • With examples: 10-shot examples from the same topic (10-Shot)

We employed three models: gpt-4-0125-preview, gpt-3.5-turbo, and mistral-medium-latest. We refer to these as GPT-4, ChatGPT 3.5, and Mistral-Medium from here on. We chose these on the basis that—at the time of this study—the first is a state-of-the-art LLM, the second is commonly used (largely due to speed and cost), and the third is from a popular open-source developer. For each of the 18 (model ×\times× definition ×\times× example) combinations, we classify a balanced (between biased edits and neutralized rewrites) sample of 300 edits stratified by topic, yielding 5,400 attempted classifications. For each classification, we instruct the model to return the correct label (‘neutral’ or ‘biased’), a rationale, and the policy violated (if the prompt contains policies). Of 5,400 attempted annotations, 5,358 yielded compliant annotations (i.e., one of the two classes). We classified this number of edits due to API costs; we wanted to use more expensive state-of-the-art models such as GPT-4. Confidence intervals for quantities of interest are narrow.

3.2.1. Factor 1: Definitions Provided

The minimal prompt relies on LLMs’ learned knowledge only, providing no Wikipedia-specific definitions or examples. This condition reveals how well general-purpose LLMs apply NPOV without additional context. The NPOV prompt provides LLMs with Wikipedia’s verbatim NPOV guidelines. Comparing the performance between the minimal and NPOV conditions estimates how much simple prompting can align general-purpose notions of neutrality with community-specific norms. Since the WNC focuses on neutral language violations, a subset of NPOV violations, we created a third prompt condition: NPOV-Scoped. This prompt gives LLMs Wikipedia’s NPOV guidelines and additional language from Pryzant et al. (pryzant_automatically_2020, ) on the specific types of non-neutral language (framing, epistemological, and demographic bias) in the dataset. Comparing NPOV-Scoped to NPOV reveals how much guideline specificity improves classification accuracy.

3.2.2. Factor 2: Examples Provided

Few-shot learning can improve model performance (brown_language_2020, ). We test whether augmenting the model with editors’ prior decisions increases accuracy. In the few-shot condition, we select a random sample of 10 edits and their labels from the same topic, as neutrality norms may differ by topic due to factors like similar editors or content.

3.3. Experiment Results

Figure 1. Graphical summaries of model performance and predictions.
Refer to caption
(a) Accuracy by models and conditions. Error bars are 95% CIs.
Refer to caption
(b) Proportion of edits predicted to be biased. Error bars are 95% CIs.

All models performed poorly and prompts made little difference (Table 1; 1(a)). Across all conditions, the accuracy was 0.58. Averaging across models, ChatGPT 3.5 performed the worst with accuracy at 0.55 and GPT-4 performed the best at 0.61. The top combination was GPT-4 with a zero-shot NPOV prompt (0.63). For each of the models, there were no statistically significant differences in accuracy between prompt conditions, based on p-values from two-tailed permutation tests. Models were more accurate for biased edits (0.63, 95% CI = [0.61, 0.65]) than neutral edits (0.53, 95% CI =[0.51, 0.55]), p<0.001𝑝0.001p<0.001italic_p < 0.001 from two-tailed permutation tests.

Table 1. Accuracy of models and prompts. SEs are in parentheses.
Prompt Minimal NPOV NPOVScoped
Examples ZeroShot FewShot ZeroShot FewShot ZeroShot FewShot
Model
ChatGPT 3.5 0.57 (0.029) 0.56 (0.029) 0.53 (0.029) 0.55 (0.029) 0.53 (0.029) 0.57 (0.029)
MistralMedium 0.57 (0.029) 0.61 (0.028) 0.60 (0.029) 0.57 (0.029) 0.59 (0.029) 0.58 (0.029)
GPT-4 0.59 (0.028) 0.61 (0.028) 0.63 (0.028) 0.59 (0.028) 0.61 (0.028) 0.60 (0.028)

We ran a logistic regression (Appendix Table 7) modeling accuracy as a function of definitions, few-shot or zero-shot, model, topic, edit word count, and normalized edit distance tertile. Normalized edit distance is the edit distance between NPOV-violating and compliant edits, divided by the longer string’s length. Higher values indicate that it took more characters to bring the NPOV-violating edit into compliance, so we may expect AI systems to find these edits easier to detect. Compared to ChatGPT 3.5, GPT-4 (OR = 1.25, 95% CI = [1.09, 1.43]) and Mistral-Medium (OR = 1.15, 95% CI = [1.01, 1.31]) were more accurate. Definitions and few-shot examples were not significant. Edit word count was weakly associated with higher accuracy (OR = 1.10, 95% CI = [1.02, 1.18]). Medium (OR = 1.86, 95% CI = [1.60, 2.16]) and high (OR = 1.43, 95% CI = [1.21, 1.70]) edit distance had higher accuracy than low edit distance, suggesting models were better at detecting violations that required more characters to correct. We also found that providing examples tempered ChatGPT 3.5’s predictions. Across zero-shot conditions, ChatGPT 3.5 predicted that 83% of classifications were biased but across few-shot conditions, ChatGPT 3.5 predicted 59% were biased, two-tailed permutation test p<0.001𝑝0.001p<0.001italic_p < 0.001.

3.4. LLM Self-Optimizations

We experimented with more advanced reasoning techniques and LLM self-optimizations (see Appendix B for these LLM-assisted prompts). First, we selected our top model (GPT-4) and the NPOV and NPOV-Scoped prompts to use with chain-of-thought (CoT) reasoning and DSPy’s ‘BootstrapFewShot’ module. In CoT, a model is instructed to reason step by step to get to its answer, which can improve performance (wei_chain–thought_2023, ). DSPy (khattab_dspy_2023, ) is a state-of-the-art framework for self-improving prompt pipelines. In DSPy’s BootstrapFewShot module, the LLM will learn rationales for labeled instances in a training phase. Then on the test set, instead of simply seeing an example and a label (standard few-shot), the LLM will also see these ‘bootstrapped’ rationales when making classifications. For both prompts, we tested 300 edits with a 70-30 train-test split and 20-shot bootstrapped-rationales. (We doubled the number of examples from our 10-shot experiment to amplify the effect of LLM-generated rationales.) The 20-shot augmented NPOV-Scoped prompt achieved 64% accuracy on the held-out set (+1% higher but statistically indistinguishable from the best unaugmented prompt); the 20-shot augmented NPOV prompt achieved 62% accuracy. We also experimented with an LLM optimizing its prompt on its own. To do this, we used the COPRO module from DSPy (khattab_dspy_2023, ). Briefly, a model is initialized with an initial minimal prompt, and then in a training phase the model repeatedly generates and refines the prompt based on performance on a subset of examples. We similarly used 300 examples with a 70-30 train-test split. The top AI-generated prompt from the training phase achieved a test-set accuracy of 61%. For the rest of the detection analysis, we analyze the results from our original experiments.

(a) Confusion matrices of models. Models showed different tendencies in the errors they made.
Refer to caption
Refer to caption
(b) Testing whether models have unbalanced errors. We simulated a null hypothesis of evenly divided errors using a binomial distribution, and then compared each model’s observed error balance to this distribution. We rejected the null hypothesis of balanced errors for ChatGPT 3.5 and Mistral-Medium, but not for GPT-4. For more details, see Appendix C.1
Figure 2. Comparison of model performance using confusion matrices and binomial distribution tests.

3.5. Model-Level Analysis

We based our evaluation on a balanced dataset of neutral and biased edits. In real-world applications, models would encounter unbalanced data. We examined the error distribution of different models. We find that not all models failed the same way (2(a) and 2(b)). ChatGPT 3.5 was far more likely than other models to predict that edits were biased when they were neutral. Mistral-Medium, on the other hand, erred in the opposite direction. It over-predicted neutral edits. GPT-4 was balanced. We assessed the statistical significance of these patterns with two-tailed binomial tests (Appendix C.1; 2(b)), rejecting a null hypothesis of balanced errors for ChatGPT 3.5 and Mistral-Medium but not GPT-4. Note that prompts and specific edits were held constant, so these differences in model predictions arise from the model itself. Our results suggest that prompting may not be enough. If models have idiosyncrasies, these may persist through different prompts.

3.6. Edit-Level Analysis

Refer to caption
Figure 3. Edit difficulty was bimodal and models were more accurate for biased edits.

We calculated the edit-level probability of correct classification (POCC) to identify which edits are easier to detect (Figure 3). POCC is the probability of a model correctly classifying an edit across all models and prompts. POCCs were bimodal (Figure 3), indicating that edits tend to be either easy or hard to classify.

We conducted a qualitative analysis of ‘easy’ vs. ‘hard’ biased edits (i.e., top or bottom POCC quartile). Top-POCC biased edits tend to have some highly subjective word that alerts models these edits are biased. See below (emphasis added).

  • “one of the central characters of the novel, akili kuwale, provides a brilliant demonstration of this change and its implications, together with excellent characterization.”

  • “environmentalists complain that before shipbreaking began there in june 1983 the beach at alang was pristine and unspoiled.”

  • “colchester has a proud tradition of its citizen volunteers serving in the territorial army.”

By contrast, low-POCC biased edits generally do not contain an overtly subjective adjective.

  • “the bill protects americans against discrimination based on their genetic information when it comes to health insurance and employment.”

  • “confucianism (; pinyin: rxu ; literally means ”the school of the scholars”, see also names for confucianism for details) is an east asian ethical and philosophical system derived from the teachings of the early chinese sage confucius.”

  • “they can see far over the great plains of illinois and across lake michigan on a clear day.”

3.7. Explanation-Level Analysis

We examined if interpretable features of LLM explanations correlate with accuracy (Appendix C.2 for variables and standardized logistic regression coefficients). We focused on three features:

Increased word count (OR = 0.80, 95% CI = [0.74, 0.86]) and more conjunctions (OR = 0.91, 95% CI = [0.85, 0.96]) correlates with lower accuracy. Higher semantic similarity to the prompt (OR = 1.15, 95% CI = [1.07, 1.23]), NPOV-related words (OR = 1.12, 95% CI = [1.05, 1.20]), and emotion words (OR = 1.09, 95% CI = [1.03, 1.15]) correlate with higher accuracy. These results suggest that the depth of reasoning alone does not predict accuracy. Instead, LLMs performed better when their rationales incorporated specific instructions and referenced relevant content. However, due to a low pseudo-R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we consider these findings preliminary.

Refer to caption
Figure 4. Words in an explanation with the most negative and most positive logit coefficients after a TF-IDF logistic regression predicting accuracy. Positive coefficients are associated with accuracy.

We next conducted a TF-IDF logistic regression (Figure 4) to understand what specific words in an explanation correlate with accuracy. We used an 80-20 train-test split over {Explanation, Is Accurate} tuples and 5-fold cross-validated grid search to tune hyperparameters. The best model achieved 0.8 accuracy, 0.8 precision, and 0.79 recall. Referencing highly subjective words such as ‘sadly’, were associated with accuracy. Occasionally, the model also flagged edits as biased when they contained a subjective-sounding word even when the edit was itself not biased. As an example, models misclassified the following edit as biased (emphasis added):

the advertisement called obama a hypocrite for not supporting armed guards in schools while noting that the children of the us president receive special protection by armed agents of the us secret service.

In this edit, the writer is referencing the advertisement—but LLMs flagged the edit as biased even though it was factually recalling what an ad was saying. Here is GPT-4’s explanation:

The edit uses loaded language by calling Obama a ‘hypocrite’ which is an opinion stated as a fact, and also uses judgmental language which violates the neutral point of view policy.

These dynamics suggest LLMs relied (sometimes to a fault) on simple heuristics like the presence of a highly subjective adjective. It is worth noting that LLMs did effectively identify the subjective-sounding part of the text in many of these false positive cases. This suggests that general-purpose LLMs may be able to neutralize biased text—but possibly at the cost of precision. That is what we find in the next section.

4. LLM Neutrality Generation: Computational Evaluation

We evaluated LLMs’ ability to generate unbiased edits from biased ones using both automated computational metrics and human experiments. Our computational evaluation revealed that AI editors are high-recall but low-precision compared to Wikipedia editors. LLMs generally neutralize non-neutral words but also make many more changes than humans do. We also found that models apply NPOV differently from human editors. While human editors make more removals than additions to neutralize edits, general-purpose LLMs have the opposite tendency.

4.1. Experiment Setup

We used our top-performing model (GPT-4) and the NPOV-Scoped333We used this prompt since GPT-4 performance did not significantly differ between any prompt condition, but the dataset contains violations of neutral language, in particular. prompt to neutralize biased Wikipedia edits. Specifically, we instructed the model to revise an NPOV-violating edit, varying two factors. We varied (Factor 1) whether we conducted this generation using zero-shot reasoning or Constitutional AI (CAI). In the latter, the model first critiqued why an edit violated NPOV and then used that critique to revise the edit. Since initial results suggested LLMs changed more words than human editors, we experimented with (Factor 2) instructions to only edit what was necessary. We conducted this generation on 200 edits, varying both prompt type (CAI or Zero-Shot) and whether to add constraining instructions (Yes or No), yielding 800 generations. For all conditions, we set the temperature to zero to further reduce extraneous changes.

4.2. Measures

We evaluated both the intensity and accuracy of AI changes. To measure intensity, we computed (1) ‘AI Edit Distance’ as the normalized edit distance between the AI-neutralized edit and the original NPOV-violating edit, (2) the word count of AI edits, and (3) the number of changed words (excluding stopwords) in each edit by summing additions and deletions (this process discussed more below). We compared these intensity metrics to analogous metrics for human-modified Wikipedia edits to determine if AI makes more or fewer changes than human editors.

We evaluated the accuracy of AI edits by comparing them to Wikipedia editors’ changes, which we treated as the gold standard. Our approach is similar to that of Pryzant et al. (pryzant_automatically_2020, ). We first computed differences (diffs) between the original biased edits (wiBiassuperscriptsubscript𝑤𝑖Biasw_{i}^{\text{Bias}}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Bias end_POSTSUPERSCRIPT), human-modified edits (wiModHsuperscriptsubscript𝑤𝑖ModHw_{i}^{\text{ModH}}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ModH end_POSTSUPERSCRIPT), and AI-modified edits (wiModAIsuperscriptsubscript𝑤𝑖ModAIw_{i}^{\text{ModAI}}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ModAI end_POSTSUPERSCRIPT) using Python’s difflib library. ‘Diffs’ index substrings that were removed or added. We focus on removals as a measure of neutrality adjustment. After preprocessing diffs (lowercasing, stripping punctuation, removing stopwords) we defined sets A𝐴Aitalic_A (words removed by AI) and B𝐵Bitalic_B (words removed by humans). From these, we define true positives (TP) as words removed by both AI and humans (|AB|𝐴𝐵|A\cap B|| italic_A ∩ italic_B |); false positives (FP) as words removed by the AI that were not removed by humans (|AB|𝐴𝐵|A-B|| italic_A - italic_B |); and false negatives (FN) as words removed by humans that were not removed by AI (|BA|𝐵𝐴|B-A|| italic_B - italic_A |). We calculated various metrics based on these values (see Table 2). We also computed the BLEU score (a standard machine translation metric) of AI edits, using human edits as the reference text. We treated undefined precision metrics as 0, undefined recall metrics as missing, and used BLEU-score smoothing from (lin_orange_2004, ). For robustness, we re-analyzed the data under 32 analysis decisions (Appendix Figure 10), yielding results similar to those in Table 4.

Table 2. Description of natural language generation accuracy metrics. These metrics compare AI neutralizations to neutralizations from Wikipedia editors, treating the latter as the gold standard. To distinguish: ‘Similarity’ compares changes from the original edit, while ‘BLEU Score’ compares the final edits’ similarity.
Metric Practical Interpretation Operationalization
Precision Does AI change only what is needed? TP/(TP+FP)TPTPFP\text{TP}/(\text{TP}+\text{FP})TP / ( TP + FP )
Recall Does AI ‘catch’ words that human editors flag? TP/(TP+FN)TPTPFN\text{TP}/(\text{TP}+\text{FN})TP / ( TP + FN )
F1 A balance (harmonic mean) of precision and recall 2PrecisionRecallPrecision+Recall2PrecisionRecallPrecisionRecall2\cdot\frac{\text{Precision}\cdot\text{Recall}}{\text{Precision}+\text{Recall}}2 ⋅ divide start_ARG Precision ⋅ Recall end_ARG start_ARG Precision + Recall end_ARG
Similarity How similar are AI changes to human ones? |AB|/|AB|𝐴𝐵𝐴𝐵|A\cap B|/|A\cup B|| italic_A ∩ italic_B | / | italic_A ∪ italic_B |
BLEU Score How similar is AI-generated text to human text? Common machine-translation metric (papineni_bleu_2002, )
Non Disjoint Are AI changes at least somewhat accurate? 1 if |AB|>0; 0 otherwise1 if 𝐴𝐵0 0 otherwise1\text{ if }|A\cap B|>0;\,0\text{ otherwise}1 if | italic_A ∩ italic_B | > 0 ; 0 otherwise
Table 3. Statistics of AI edit intensity with mean and SD in parentheses. Edit distance is the normalized edit distance between the NPOV-violating text and the neutralization. ‘N Changes’ is the number of words (excluding stopwords) that the edit changed (i.e., additions plus removals).
Prompt Type Is Constrained Edit Distance N Changes Word Count
CAI N 0.56 (0.17) 25.88 (16.75) 47.27 (28.97)
CAI Y 0.35 (0.17) 12.94 (9.35) 30.17 (16.03)
ZeroShot N 0.29 (0.15) 10.52 (7.67) 28.65 (15.26)
ZeroShot Y 0.18 (0.12) 6.18 (5.07) 28.05 (14.79)
Refer to caption
Figure 5. Automated metrics regarding the intensity of edits. The horizontal line is the average for Wikipedian rewrites. Error bars are 95% CIs.

4.3. Experiment Results

Finding 1: AI changes more than humans. LLMs changed more than humans (Figure 5). AI edit distance (M𝑀Mitalic_M = 0.34, SD𝑆𝐷SDitalic_S italic_D = 0.21) was larger than human edit distance (M𝑀Mitalic_M = 0.15, SD𝑆𝐷SDitalic_S italic_D = 0.13), t𝑡titalic_t(799) = 24.06, p𝑝pitalic_p¡0.001, Cohen’s d𝑑ditalic_d = 0.85. Considering both additions and removals, AI made more changes (M𝑀Mitalic_M = 13.88, SD𝑆𝐷SDitalic_S italic_D = 12.91) than humans (M𝑀Mitalic_M = 3.79, SD𝑆𝐷SDitalic_S italic_D = 4.01), t𝑡titalic_t(799) = 22.10, p𝑝pitalic_p¡0.001, Cohen’s d𝑑ditalic_d = 0.78. Driven by CAI, AI edits were longer (M𝑀Mitalic_M = 33.53, SD𝑆𝐷SDitalic_S italic_D = 21.19) than human edits (M𝑀Mitalic_M = 27.53, SD𝑆𝐷SDitalic_S italic_D = 14.98), t𝑡titalic_t(799) = 10.87, p𝑝pitalic_p¡0.001, Cohen’s d𝑑ditalic_d = 0.38. See Appendix Table 9 for regressions and Table 3 for descriptive statistics. CAI makes more changes than zero-shot and constraining instructions are more effective in reducing changes for CAI than for zero-shot.

Table 4. Comparing the accuracy of AI edits to edits by Wikipedia editors. In general, conditions had higher recall than precision.
Prompt Type Is Constrained Precision Recall F1 Similarity Non Disjoint BLEU
CAI N 0.29 0.89 0.44 0.27 0.90 0.23
CAI Y 0.36 0.82 0.50 0.33 0.86 0.44
ZeroShot N 0.37 0.77 0.50 0.34 0.82 0.48
ZeroShot Y 0.47 0.67 0.56 0.41 0.76 0.64

Finding 2: Humans are more likely to remove while AI is more likely to add. We find different editing patterns for AI vs human editors (Figure 6). AI additions (M𝑀Mitalic_M = 8.04, SD𝑆𝐷SDitalic_S italic_D = 9.33) were higher than removals (M𝑀Mitalic_M = 5.83, SD𝑆𝐷SDitalic_S italic_D = 4.61), t𝑡titalic_t(799) = 8.86, p𝑝pitalic_p¡0.001, Cohen’s d𝑑ditalic_d = 0.31. Meanwhile, human removals (M𝑀Mitalic_M = 2.40, SD𝑆𝐷SDitalic_S italic_D = 2.31) were higher than additions (M𝑀Mitalic_M = 1.39, SD𝑆𝐷SDitalic_S italic_D = 2.14), t𝑡titalic_t(199) = 7.34, p𝑝pitalic_p¡0.001, Cohen’s d𝑑ditalic_d = 0.52. Condition-level dynamics of additions and removals (Appendix Figure 9; Appendix Table 11) are similar to Finding 1.

Refer to caption
Figure 6. AI tends to neutralize edits via adding words and humans tend to neutralize edits via removing words. Error bars are 95% CIs.

Finding 3: AI changes are high recall but low precision. Across AI conditions, recall (0.79; 95% CI = [0.76-0.81]) was twice as high as precision (0.37; 95% CI = [0.35-0.40]). While AI editors remove content that human editors remove AI editors also make many other changes that human editors do not. See Table 4 for condition-level breakdowns and see Appendix Table 10 for regressions.

5. LLM Neutrality Generation: Human Evaluation

AI editors make more changes than human editors. But what do people prefer? We conducted a controlled experiment to address this question444This experiment was approved by our university’s IRB.. Participants were shown one edit that initially violated NPOV and two sets of (masked) annotated revisions—one from a Wikipedia editor, and one from an AI condition (Zero-Shot and CAI, both with constraints). Participants rated rewrites on: (1) bias reduction; (2) adding/removing information from the original; (3) fluency. Participants chose AI edits as being more neutral and fluent than human ones.

5.1. Experiment Setup

Participants

We recruited 147 participants through Prolific, a crowdsourcing platform. Each participant completed 10 trials. Participants received $3.33. Pilot tests indicated the experiment would take 11 minutes. This sample size was larger than that required by Orme’s rule of thumb (orme_sample_1998, ) for discrete choice experiments. We restricted our study to users who: were over 18 years old, lived in the United States, completed more than 100 Prolific tasks, and had a 98%+ approval rating. An ideal sample would be a large group of Wikipedians, but this is quite difficult to assemble. Instead, we turn to a sample approximating Wikipedia’s readership—which can inform some central questions of this work but has limitations around others. These limitations are discussed further in subsequent sections.

Procedure

Participants began on a page that described the study. After providing informed consent, they read an introduction to neutral language on Wikipedia. After this introduction, they completed a training module where they rated the neutrality of 3 edits [-2, 2] and received feedback and an explanation after each of these examples. After the training module, participants continued to the main task. For each trial, participants saw one original edit and two visually annotated revisions—one from a Wikipedia editor, and one from an AI condition (in a randomized order and masked). 44 sets of the form {Original Biased Edit, Human-Modified Edit, Zero-Shot Modified Edit, CAI-Modified Edit} were used for the experiment.555Before the experiment, we standardized the capitalization of edits and fixed the spacing of punctuation marks in WNC edits. This was informed by participant feedback from smaller pilot studies. First, participants selected which rewrite most increased neutrality. Second, participants answered whether each rewrite (A) removed information from the original edit (yes/no) and (B) added information to the original edit (yes/no), excluding changes that increased the neutrality of the edit. Third, participants picked which rewrite was most fluent. At the end of the 10 trials, participants were also fully debriefed and asked to guess what proportion of time other participants said the AI edit increased neutrality more than the human edit (to see if AI outperformed lay expectations).

Analysis

For answers involving choosing between two edits (fluency and neutrality), we first conducted binomial tests on whether the proportion of cases a participant chose the AI edit differed from 0.5. We also modeled participants’ selections with a conditional logistic regression (clogit function from the R package, survival), as is common for choice data (shang_dce_2023, ). Analysis approaches agreed. We compared the frequencies of an edit adding or removing information via χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT tests.

5.2. Experiment Results

Figure 7. Experiment results for neutrality and fluency variables.
Refer to caption
(a) Participants selected AI edits as being more neutral and fluent than human edits. Odds ratios are from conditional logistic regression models (Appendix  Table 12).
Refer to caption
(b) Distribution of AI Delta, where AI Delta is the matchup-level [number of raters who selected the AI edit] minus [the number of raters who selected the human edit].
Table 5. Experiment results. For fluency and neutrality, participants chose between an AI and a human edit. P-values are computed using two-tailed binomial tests for whether the probability of picking an AI edit differs from 0.5. For additions and removals, participants evaluated each of the AI and human edits separately but were shown both at the same time. This table reports human addition and removal data aggregated over both matchups. P-values are from chi-squared tests on whether human vs AI edits differed in frequencies of adding or removing information. ‘Delta’ is the AI proportion minus the human proportion. Stars: n.s. p¿0.05, *p¡0.05, **p¡0.01 ***p¡0.001, ****p¡0.0001
Question Type Comparison AI proportion Human proportion Delta
Neutrality Human vs ZeroShot 0.70 0.30 0.40****
Neutrality Human vs CAI 0.69 0.31 0.38****
Fluency Human vs ZeroShot 0.65 0.35 0.30****
Fluency Human vs CAI 0.57 0.43 0.14***
Add Human vs ZeroShot 0.27 0.28 -0.02 (n.s.)
Add Human vs CAI 0.34 0.28 0.06**
Remove Human vs ZeroShot 0.41 0.40 0.01 (n.s.)
Remove Human vs CAI 0.42 0.40 0.02 (n.s.)

Label Quality. Participants completed the task in a median of 19 minutes (M=21,SD=9)formulae-sequence𝑀21𝑆𝐷9(M=21,SD=9)( italic_M = 21 , italic_S italic_D = 9 ). The median correctness for the three training questions was 2 (M=2,SD=1)formulae-sequence𝑀2𝑆𝐷1(M=2,SD=1)( italic_M = 2 , italic_S italic_D = 1 ). Agreement was generally high. The proportion of respondents agreeing with the majority answer for each question was as follows: neutrality (Mdn=0.75,M=0.74,SD=0.13)formulae-sequence𝑀𝑑𝑛0.75formulae-sequence𝑀0.74𝑆𝐷0.13(Mdn=0.75,M=0.74,SD=0.13)( italic_M italic_d italic_n = 0.75 , italic_M = 0.74 , italic_S italic_D = 0.13 ), fluency (Mdn=0.68,M=0.69,SD=0.13)formulae-sequence𝑀𝑑𝑛0.68formulae-sequence𝑀0.69𝑆𝐷0.13(Mdn=0.68,M=0.69,SD=0.13)( italic_M italic_d italic_n = 0.68 , italic_M = 0.69 , italic_S italic_D = 0.13 ), additions (Mdn=0.72,M=0.72,SD=0.12)formulae-sequence𝑀𝑑𝑛0.72formulae-sequence𝑀0.72𝑆𝐷0.12(Mdn=0.72,M=0.72,SD=0.12)( italic_M italic_d italic_n = 0.72 , italic_M = 0.72 , italic_S italic_D = 0.12 ), and removals (Mdn=0.64,M=0.66,SD=0.11)formulae-sequence𝑀𝑑𝑛0.64formulae-sequence𝑀0.66𝑆𝐷0.11(Mdn=0.64,M=0.66,SD=0.11)( italic_M italic_d italic_n = 0.64 , italic_M = 0.66 , italic_S italic_D = 0.11 ). We conducted robustness checks (Appendix Table 15) where we re-analyzed data after excluding participants with duration below the first quartile and including only those participants who got every training module question correct. Main results are highly similar to using the full dataset. See Appendix E.1 for participant demographics.

Neutrality. See Table 5 and 7(b) for raw data and 7(b) for logistic regression odds ratios. Participants rated AI edits as more neutral than human edits in 70% of zero-shot choices (95% CI = [67, 73]) and 69% of CAI choices (95% CI = [66, 72]), p<0.0001𝑝0.0001p<0.0001italic_p < 0.0001 for both by two-tailed binomial tests and conditional logistic regressions. After our initial (surprising) results, we conducted a pilot study to rule out the possibility that the preference for AI edits was simply an artifact of forcing participants to choose between edits rather than allowing a “both equal” option. We find the proportional gap (probability of choosing the AI-generated edit over the human one) replicated, but in 21% of cases, the options were rated as comparable (Appendix Table 14). At the end of our experiment, participants guessed how often other participants selected the AI edit as more neutral. The actual rate of AI neutrality preference (70%) was generally higher than participants’ forecasts (M=58%;SD=16)formulae-sequence𝑀percent58𝑆𝐷16(M=58\%;SD=16)( italic_M = 58 % ; italic_S italic_D = 16 ), but 75% of participants predicted that AI edits would be chosen over 50% of the time (Appendix Figure 11).

Fluency. Participants rated AI edits as more fluent (Table 5; 7(b)) than human edits in 65% of zero-shot choices (95% CI = [61, 68]) and 57% of CAI choices (95% CI = [53, 60]), p<0.001𝑝0.001p<0.001italic_p < 0.001 for both by two-tailed binomial tests and conditional logistic regression.

Additions and removals. Participants evaluated whether each rewrite added or removed information from the original edit, excluding changes that increased NPOV compliance (Table 5). Participants were shown human and AI edits simultaneously but answered questions for each separately. We analyzed judgments of human edits in two ways. First, aggregating across both AI matchups, we found CAI edits added information more frequently (34%, 95% CI = [31, 38]) than human edits (28%; 95% CI = [26, 31]), χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT p<0.001𝑝0.001p<0.001italic_p < 0.001. However, when evaluating human edits alongside CAI edits, participants reported a higher information addition rate (33%, 95% CI = [30, 37]) than when evaluating human edits alongside zero-shot edits (23%; 95% CI = [20, 26]). This led to a directionally similar but not statistically significant difference between human and CAI additions at the matchup level (Appendix Table 13). High-addition CAI edits may have heightened participants’ awareness or changed perceptions of additions in human edits.

6. LLM Neutrality Generation: Qualitative Analysis

To reconcile LLMs’ low detection performance with crowdworkers’ high preference for AI edits, we conduct a qualitative analysis of AI-generated vs human edits to better understand differences. We discuss cases (Figure 8) that exemplify recurring patterns and significant implications that emerged upon inspection.

AI is high-recall and low-precision compared to Wikipedia editors; models generally change the text that editors change, and much more. See Cases 4 (8(e)), 5 (8(b)), and 9 (8(c)). The human editor made a few changes. But both AI editors made many changes. However, many of these changes were to more than just NPOV-violating words. LLMs also make various grammatical and stylistic tweaks that may have influenced participants’ judgments. We call the sum of these changes ‘NPOV+’.

In some cases, the AI model correctly found the NPOV-violating string and then replaced it with one that was no better. In Case 1 (8(a)), Zero-Shot and the human editor essentially agree. However, we see that CAI has removed the word “noted” and added a long text string that is just as NPOV-violating as “noted”.

In other cases, AI models arguably applied NPOV more faithfully than human editors, possibly due to varying community norms on what would constitute neutrality. Case 10 (8(d)) is particularly interesting. The original edit said herbal medicine “can treat” conditions. A Wikipedia editor, possibly from the herbal medicine community, changed the edit to say the clinic “treats” (implying more certainty in herbal medicine’s abilities). In contrast, the LLM edits reflected a degree of doubt in herbal medicine, with CAI changing the edit to say that the clinic “claims to treat” different health conditions and Zero-Shot saying the clinic “provides services” (with the possible implication that such services may not be effectively rendered). Both AI edits stop short of saying that the herbal medicine clinic actually does treat health conditions, which is what the human NPOV editor ‘corrected’ the edit to. Another example of LLM edits removing more biased content can be seen in Case 9 (8(c)), where the AI editor removes certain words like “gimmick” that human editors retained.

(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
Figure 8. Comparing AI neutralizations to Wikipedia editor neutralizations.

7. Discussion

LLMs are increasingly used in communities with their own rules and guidelines. Is providing these rules to LLMs enough for them to replicate community moderation decisions? We evaluated general-purpose LLMs on their ability to (1) detect edits that violated Wikipedia’s Neutral Point of View (NPOV) and (2) generate NPOV-compliant edits from NPOV-violating edits.

Takeaway 1: Large language models largely failed at detection.

Our results provide insight into the capabilities and limitations of LLMs in applying community-specific neutrality norms. Across models, prompts, reasoning strategies, and self-optimizations, performance was low—even with targeted prompts that directly incorporated Wikipedia’s guidelines. LLMs are trained on Wikipedia data, so this poor performance existed despite a potential data leakage. Low detection performance suggests that applying subtle guidelines to real-world cases is difficult for today’s general-purpose models. Notably, the largest model did the best. Perhaps even larger models would do better. Alternatively, low performance may be due to a more fundamental aspect of LLMs. For example, it may be that LLMs effectively ‘over-learn’ a notion (e.g., neutrality) from broad web corpora, and specializing this notion to a community’s norms requires changing a model’s parameters. In any case, our study suggests two future directions. First, fine-tuning may improve performance. However, fine-tuning may also risk overfitting (ma_adapting_2024, ). Another approach involves more advanced use of retrieval-augmented-generation (RAG). Can incorporating conversations from Wikipedia Talk pages or other Wikipedia data simulate the community knowledge that editors have?

However, we must view LLM performance in the context of what is, essentially, a hard task. NPOV edits are typically made by senior editors (pryzant_automatically_2020, ). And crowdworkers could only guess the biased word in a Wikipedia edit 37% of the time (recasens_linguistic_2013, ). General-purpose LLMs might be better than random individuals at NPOV detection but worse than expert editors.

Takeaway 2: Different models had their own biases.

Different models exhibited distinct biases. ChatGPT 3.5 was far more likely to over-predict edits as biased, while Mistral-Medium had the opposite tendency. Contrasting failure modes suggest that pre-trained LLMs may internalize distinct priors about what constitutes neutrality. These distinctions highlight one other shortcoming of relying on LLMs for such tasks. Models get introduced and updated rapidly, and our understanding of how LLMs work can get outdated just as rapidly. As language models become more ubiquitous, understanding these idiosyncratic biases will be crucial. Beyond understanding these biases, they may also be utilized. Considering Wikipedia NPOV specifically, editors engaging with opposing views play a role in maintaining neutrality (greenstein_ideology_2021, ). Perhaps a multi-agent system simulating ‘debates’ (irving_ai_2018, ; khan_debating_2024, ) can better apply nuanced rules than an individual agent in isolation.

Takeaway 3: LLMs may apply rules in different ways from humans.

AI editors neutralize text differently from human editors. Computational metrics indicate that LLMs typically remove the NPOV-violating words that human editors remove (79% average recall), but they also make many other changes (37% average precision). Furthermore, while Wikipedians were more likely to neutralize text through deletions than additions, AI editors showed the opposite pattern. This divergence suggests LLMs may internalize and operationalize rules like NPOV in fundamentally different ways compared to a community’s human experts. Though different is not necessarily better or worse. Of course, one clear risk is that AI moderators can reduce the cohesiveness of community content. Then again, LLMs may also inspire new practices. The fact that LLMs did not apply rules like human experts represents both a risk (discord) and an opportunity (innovation). Future work can explore innovative uses of LLMs for content moderation that go beyond attempting to ‘mimic’ human content moderators.

Takeaway 4: Crowd-workers preferred LLM neutralizations to Wikipedia-editor neutralizations.

We find that crowdworkers prefer AI edits over human edits on both fluency (61% of choices) and neutrality (70% of choices). We emphasize that participants were not Wikipedia editors. Their judgment may not match those who would actually be using these systems for content moderation. We view crowdworkers as more representative of Wikipedia readership than editors, and our results should be interpreted in this context.

We hypothesize that our human evaluation findings are driven by both model size and reinforcement learning from human feedback (RLHF). First, LLMs have been trained on large corpora to predict the highest-likelihood next token (zheng_secrets_2023, ), and so their output is often highly grammatically correct or readable—which is not always the case for human Wikipedia edits. This tendency to produce “expected” text may have caused participants to evaluate LLM generations more favorably. Relatedly, through RLHF, LLMs have been trained to produce text that people—in general—will like (zheng_secrets_2023, ). And when evaluated by crowdworkers, their generations are preferred to Wikipedia-editor generations. Instruction-tuned models may be highly capable of applying rules in ways that resonate with a broader public, even if these applications differ from community experts. A promising area of future research is instruction-tuning models on expert feedback.

Takeaway 5: Qualitative analysis showed LLMs are ‘NPOV+’.

To reconcile LLMs’ low detection performance with high generation evaluations, we conducted a qualitative analysis of LLM generations. Qualitative analysis confirmed what computational experiments found: LLMs are high-recall neutralizers. In most cases, they did change the words that human editors changed. And we also found cases in which LLMs changed words human editors arguably should have changed. If a community prioritizes recall over precision, then LLM generations are highly valuable.

But LLMs also make many changes, and these changes are not always related to NPOV (e.g., grammar or readability). Participants rated CAI-generated rewrites as adding unnecessary information more frequently (34%) than human rewrites (28%). However, this difference was relatively small compared to the large AI-human gap in changes measured by edit distance and diffs, suggesting that LLMs typically make many small and unnecessary tweaks rather than add new information. If deployed, we do not know how volunteer editors would react to AI systems making unnecessary edits to their content. This risks engendering a loss of agency, decreasing stylistic variation, and increasing moderator burden if moderators need to constantly check that the LLM changes are not hallucinations.

Mixed-Initiative Content Moderation Systems

Our findings suggest promise for LLMs in mixed-initiative community moderation systems. LLMs performed better at generation than detection, so they can create ‘first drafts’ of texts flagged by human moderators as norm-violating. However, LLMs’ poor detection performance limits their usability. Generating norm-compliant text once a violation is identified is valuable—but it is impractical to have LLMs rewrite every text in a community. Therefore, effective detection would have amplified the usability of effective generation. Future work could explore hybrid systems that leverage LLM generation abilities while reducing reliance on costly human judgment for detection. For instance, models could monitor Wikipedia Talk pages for comments suggesting NPOV violations. Based on automated triggers, these models could then offer neutralizations. This approach would use indirect human input to address LLMs’ detection shortfalls.

Implications for Stakeholders

Here, we summarize some of the practical implications of our study.

For Wikipedia
  • \rightarrow

    Poor NPOV Detection: Do not expect current LLMs to accurately detect NPOV violations on their own, though LLM predictions may help moderators prioritize what to check.

  • \rightarrow

    High Recall, Low Precision: LLMs are high-recall but low-precision rule-followers. Communities should be aware of the increased labor burden to verify LLM additions and the potential loss of agency due to automated rewrites.

  • \rightarrow

    LLM Strengths: Models performed better at generation than detection, suggesting possible use in brainstorming ‘first drafts’.

  • \rightarrow

    Rule Application: LLMs applied rules differently than experts but were preferred by laypeople, indicating potential trade-offs between satisfying Wikipedia readers and contributors.

For Model Builders
  • \rightarrow

    Improving Detection: Retrieval-augmented generation (integrating Wikipedia’s Talk pages) that provides models with additional context for labels or multi-agent systems (simulating Wikipedia-style debates) may enhance norm violation detection capabilities.

  • \rightarrow

    More Precise Constitutional AI: Constitutional AI, in particular, made many extraneous changes, indicating a need for more refined setups or multi-objective functions that put more weight on content preservation.

  • \rightarrow

    Expert Alignment: Fine-tuning models on expert feedback, rather than lay judgments, can increase alignment with domain specialists.

For Platform & Tech Policy
  • \rightarrow

    Cannot ‘Set and Forget’: LLMs’ tendency to make extraneous edits necessitates human oversight to ensure they do not overstep their instructions.

  • \rightarrow

    Stakeholder Evaluations Essential: LLMs may generate content that appeals to the general public but not the intended community, underlining the importance of stakeholder input in evaluation.

8. Limitations

Our study has limitations that can inform future work. First, we focused on Wikipedia’s neutrality guidelines, which are an important example of community-specific neutrality norms given Wikipedia’s wide readership and clearly articulated norms. However, Wikipedia is just one community. Second, the Wikipedia Neutrality Corpus, the dataset we used, may not represent all types of NPOV violations. It focuses on violations of neutral language, which is useful for our purposes since detecting these violations does not require access to external information. Future research can explore LLM performance on other types of NPOV violations. Third, we tested three language models (GPT 4, ChatGPT 3.5, and Mistral-Medium), so we cannot generalize our results to all language models. We chose these models because GPT 4 is state-of-the-art, ChatGPT 3.5 is widely used, and Mistral-Medium is from an open-source developer. Fourth, we did not explore the impact of fine-tuning on the results, as our goal was to understand to what extent these models can be steered without additional resources. Fifth, no prompt experiment can rule out that some (untested) prompts may have worked better. Though we explored a diverse range of prompting strategies. Our main experiment centered around providing LLMs with the text of NPOV, which is a reasonable prompt strategy. We then systematically ablated examples and definitions. Subsequent experiments incorporated chain-of-thought reasoning, LLM-generated rationales, and LLM self-generated prompts. We encourage future research to explore additional techniques, such as more sophisticated few-shot example selection and retrieval-augmented generation. Sixth, our dataset is noisy: Edits appear without the full article context, and we are using the judgments of a single editor to label data as NPOV-violating (though another editor may disagree). This noise likely deflates LLM performance. Seventh, future work can better optimize generations to match human behavior. For example, consider generating multiple neutralizations and then selecting the one with the least changes. Eighth, participants in our human subject experiments were not Wikipedia editors, though their judgments likely approximate Wikipedia’s readership. Future work could explore these dynamics with Wikipedia editors. Despite these limitations, our work offers a detailed analysis of the NPOV detection and generation ability of general-purpose LLMs.

9. Conclusion

On the one hand, simply exposing models to high-level principles was insufficient for models to reproduce the judgments of community members in particular cases. Even when models correctly classified edits, this was correlated with the presence of a single ‘giveaway’ adjective. On the other hand, LLMs were able to apply NPOV for generation. LLMs generally changed words that Wikipedia editors changed, and LLM edits were also preferred over Wikipedian edits by laypeople. This may suggest a divergence between following instructions in general and applying them to particular cases like community members do. We suggest that even when articulating principles is easy, having LLMs apply them to specific cases—like community members do—may still be hard.

References

  • (1) Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lukosuite, K., Lovitt, L., Sellitto, M., Elhage, N., Schiefer, N., Mercado, N., DasSarma, N., Lasenby, R., Larson, R., Ringer, S., Johnston, S., Kravec, S., Showk, S. E., Fort, S., Lanham, T., Telleen-Lawton, T., Conerly, T., Henighan, T., Hume, T., Bowman, S. R., Hatfield-Dodds, Z., Mann, B., Amodei, D., Joseph, N., McCandlish, S., Brown, T., and Kaplan, J. Constitutional AI: Harmlessness from AI Feedback, Dec. 2022. arXiv:2212.08073 [cs].
  • (2) Barbarestani, B., Maks, I., and Vossen, P. T. Content Moderation in Online Platforms: A Study of Annotation Methods for Inappropriate Language. In Proceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying LREC-COLING-2024 (Torino, Italia, May 2024), R. Kumar, A. K. Ojha, S. Malmasi, B. R. Chakravarthi, B. Lahiri, S. Singh, and S. Ratan, Eds., ELRA and ICCL, pp. 96–104.
  • (3) Bérubé, N., Ghiasi, G., Sainte-Marie, M., and others. Wiki-Gendersort: Automatic gender detection using first names in Wikipedia. Publisher: OSF.
  • (4) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems (2020), vol. 33, Curran Associates, Inc., pp. 1877–1901.
  • (5) Bryant, S. L., Forte, A., and Bruckman, A. Becoming Wikipedian: transformation of participation in a collaborative online encyclopedia. In Proceedings of the 2005 ACM International Conference on Supporting Group Work (New York, NY, USA, Nov. 2005), GROUP ’05, Association for Computing Machinery, pp. 1–10.
  • (6) Butler, B., Joyce, E., and Pike, J. Don’t look now, but we’ve created a bureaucracy: the nature and roles of policies and rules in wikipedia. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (New York, NY, USA, Apr. 2008), CHI ’08, Association for Computing Machinery, pp. 1101–1110.
  • (7) Cao, Y. T., Domingo, L.-F., Gilbert, S. A., Mazurek, M., Shilton, K., and Daumé III, H. Toxicity Detection is NOT all you Need: Measuring the Gaps to Supporting Volunteer Content Moderators, Feb. 2024. arXiv:2311.07879 [cs].
  • (8) Elmimouni, H., Forte, A., and Morgan, J. Why People Trust Wikipedia Articles: Credibility Assessment Strategies Used by Readers. In Proceedings of the 18th International Symposium on Open Collaboration (Madrid Spain, Sept. 2022), ACM, pp. 1–10.
  • (9) Gilardi, F., Alizadeh, M., and Kubli, M. ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences 120, 30 (July 2023), e2305016120. Publisher: Proceedings of the National Academy of Sciences.
  • (10) Greenstein, S., Gu, G., and Zhu, F. Ideology and Composition Among an Online Crowd: Evidence from Wikipedians. Management Science 67, 5 (May 2021), 3067–3086. Publisher: INFORMS.
  • (11) Greenstein, S., and Zhu, F. Do Experts or Crowd-Based Models Produce More Bias? Evidence from Encyclopedia Britannica and Wikipedia. MIS Quarterly 42, 3 (Mar. 2018), 945–959.
  • (12) Halfaker, A., and Geiger, R. S. Ores: Lowering barriers with participatory machine learning in wikipedia. Proceedings of the ACM on Human-Computer Interaction 4, CSCW2 (2020), 1–37. Publisher: ACM New York, NY, USA.
  • (13) Hansen, S., Berente, N., and Lyytinen, K. Wikipedia, Critical Social Theory, and the Possibility of Rational Discourse 1. The Information Society 25, 1 (Jan. 2009), 38–59.
  • (14) Harrison, S. Should ChatGPT Be Used to Write Wikipedia Articles? Slate (Jan. 2023).
  • (15) Irving, G., Christiano, P., and Amodei, D. AI safety via debate, Oct. 2018. arXiv:1805.00899 [cs, stat].
  • (16) Jhaver, S., Birman, I., Gilbert, E., and Bruckman, A. Human-Machine Collaboration for Content Regulation: The Case of Reddit Automoderator. ACM Transactions on Computer-Human Interaction 26, 5 (July 2019), 31:1–31:35.
  • (17) Khan, A., Hughes, J., Valentine, D., Ruis, L., Sachan, K., Radhakrishnan, A., Grefenstette, E., Bowman, S. R., Rocktäschel, T., and Perez, E. Debating with More Persuasive LLMs Leads to More Truthful Answers, May 2024. arXiv:2402.06782 [cs].
  • (18) Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., Vardhamanan, S., Haq, S., Sharma, A., Joshi, T. T., Moazam, H., Miller, H., Zaharia, M., and Potts, C. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines, Oct. 2023. arXiv:2310.03714 [cs].
  • (19) Kittur, A., and Kraut, R. E. Harnessing the wisdom of crowds in wikipedia: quality through coordination. In Proceedings of the 2008 ACM conference on Computer supported cooperative work (San Diego CA USA, Nov. 2008), CSCW ’08, ACM, pp. 37–46.
  • (20) Kittur, A., Suh, B., Pendleton, B. A., and Chi, E. H. He says, she says: conflict and coordination in Wikipedia. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (San Jose California USA, Apr. 2007), ACM, pp. 453–462.
  • (21) Kolla, M., Salunkhe, S., Chandrasekharan, E., and Saha, K. LLM-Mod: Can Large Language Models Assist Content Moderation? In Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems (New York, NY, USA, May 2024), CHI EA ’24, Association for Computing Machinery, pp. 1–8.
  • (22) Kumar, D., AbuHashem, Y. A., and Durumeric, Z. Watch Your Language: Investigating Content Moderation with Large Language Models. Proceedings of the International AAAI Conference on Web and Social Media 18 (May 2024), 865–878.
  • (23) Kumar, S., Correa, C. G., Dasgupta, I., Marjieh, R., Hu, M. Y., Hawkins, R. D., Daw, N. D., Cohen, J. D., Narasimhan, K., and Griffiths, T. L. Using Natural Language and Program Abstractions to Instill Human Inductive Biases in Machines, Feb. 2023.
  • (24) Li, J., Ye, Z., and Xiao, L. Detection of Propaganda Using Logistic Regression. In Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda (Hong Kong, China, Nov. 2019), A. Feldman, G. Da San Martino, A. Barrón-Cedeño, C. Brew, C. Leberknight, and P. Nakov, Eds., Association for Computational Linguistics, pp. 119–124.
  • (25) Lin, C.-Y., and Och, F. J. ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics (Geneva, Switzerland, Aug. 2004), COLING, pp. 501–507.
  • (26) Ma, H., Zhang, C., Fu, H., Zhao, P., and Wu, B. Adapting Large Language Models for Content Moderation: Pitfalls in Data Engineering and Supervised Fine-tuning, Mar. 2024. arXiv:2310.03400 [cs].
  • (27) Matei, S. A., and Dobrescu, C. Wikipedia’s “Neutral Point of View”: Settling Conflict through Ambiguity. The Information Society 27, 1 (Jan. 2011), 40–51. Publisher: Routledge _eprint: https://doi.org/10.1080/01972243.2011.534368.
  • (28) McDowell, Z. J., and Vetter, M. A. It Takes a Village to Combat a Fake News Army: Wikipedia’s Community and Policies for Information Literacy. Social Media + Society 6, 3 (July 2020), 2056305120937309. Publisher: SAGE Publications Ltd.
  • (29) Nagar, Y. What do you think? the structuring of an online community as a collective-sensemaking process. In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work (New York, NY, USA, Feb. 2012), CSCW ’12, Association for Computing Machinery, pp. 393–402.
  • (30) Nicholson, J. M., Uppala, A., Sieber, M., Grabitz, P., Mordaunt, M., and Rife, S. C. Measuring the quality of scientific references in Wikipedia: an analysis of more than 115M citations to over 800 000 scientific articles. The FEBS journal 288, 14 (2021), 4242–4248. Publisher: Wiley Online Library.
  • (31) Orme, B. Sample size issues for conjoint analysis studies. Sequim: Sawtooth Software Technical Paper (1998).
  • (32) Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (Philadelphia, Pennsylvania, USA, July 2002), P. Isabelle, E. Charniak, and D. Lin, Eds., Association for Computational Linguistics, pp. 311–318.
  • (33) Pennebaker, J. W., Chung, C. K., Ireland, M., Gonzales, A., and Booth, R. J. The Development and Psychometric Properties of LIWC2007. Tech. rep., 2007.
  • (34) Pryzant, R., Martinez, R. D., Dass, N., Kurohashi, S., Jurafsky, D., and Yang, D. Automatically neutralizing subjective bias in text. In Proceedings of the aaai conference on artificial intelligence (Dec. 2020), vol. 34, arXiv, pp. 480–489. Issue: 01.
  • (35) Rawat, C., Sarkar, A., Singh, S., Alvarado, R., and Rasberry, L. Automatic detection of online abuse and analysis of problematic users in wikipedia. In 2019 Systems and Information Engineering Design Symposium (SIEDS) (2019), IEEE, pp. 1–6.
  • (36) Reagle, J. Is the Wikipedia Neutral?, Apr. 2007.
  • (37) Recasens, M., Danescu-Niculescu-Mizil, C., and Jurafsky, D. Linguistic models for analyzing and detecting biased language. In Proceedings of the 51st annual meeting of the Association for Computational Linguistics (volume 1: long papers) (Sofia, Bulgaria, Aug. 2013), H. Schuetze, P. Fung, and M. Poesio, Eds., Association for Computational Linguistics, pp. 1650–1659.
  • (38) Reimers, N., and Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, Aug. 2019.
  • (39) Sathe, A., Ather, S., Le, T. M., Perry, N., and Park, J. Automated fact-checking of claims from Wikipedia. In Proceedings of the Twelfth Language Resources and Evaluation Conference (2020), pp. 6874–6882.
  • (40) Schmahl, K. G., Viering, T. J., Makrodimitris, S., Jahfari, A. N., Tax, D., and Loog, M. Is Wikipedia succeeding in reducing gender bias? Assessing changes in gender bias in Wikipedia using word embeddings. In Proceedings of the Fourth Workshop on Natural Language Processing and Computational Social Science (2020), pp. 94–103.
  • (41) Scott, J. C. Seeing like a state : how certain schemes to improve the human condition have failed. Yale agrarian studies. 1998. Publisher: Yale University Press.
  • (42) Semrush. Top Websites in the World - April 2024 Most Visited & Popular Rankings.
  • (43) Shang, L., and Chandra, Y. DCE Data Analysis Using R. In Discrete Choice Experiments Using R: A How-To Guide for Social and Managerial Sciences, L. Shang and Y. Chandra, Eds. Springer Nature, Singapore, 2023, pp. 157–181.
  • (44) Steinsson, S. Rule Ambiguity, Institutional Clashes, and Population Loss: How Wikipedia Became the Last Good Place on the Internet. American Political Science Review 118, 1 (Feb. 2024), 235–251.
  • (45) Suchman, L. A. Plans and situated actions: the problem of human-machine communication. Plans and situated actions: The problem of human-machine communication. Cambridge University Press, USA, Nov. 1987.
  • (46) Swarts, J. The collaborative construction of ”fact” on Wikipedia. In Proceedings of the 27th ACM international conference on Design of communication (Bloomington Indiana USA, Oct. 2009), ACM, pp. 281–288.
  • (47) Viégas, F. B., Wattenberg, M., and Dave, K. Studying cooperation and conflict between authors with history flow visualizations. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (New York, NY, USA, Apr. 2004), CHI ’04, Association for Computing Machinery, pp. 575–582.
  • (48) Wales, J. Jimmy Wales: The birth of Wikipedia | TED Talk, 2006.
  • (49) Wang, P., and Li, X. Assessing the quality of information on wikipedia: A deep-learning approach. Journal of the Association for Information Science and Technology 71, 1 (2020), 16–28. Publisher: Wiley Online Library.
  • (50) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Jan. 2023.
  • (51) Wikipedia. Core content policies, Dec. 2023. Page Version ID: 1187927074.
  • (52) Wikipedia. Wikipedia: Automated Moderation, 2024.
  • (53) Wikipedia. Wikipedia: WikiTrust, 2024.
  • (54) Wikipedia. Wikipedia:Neutral point of view, June 2024. Page Version ID: 1226843190.
  • (55) Wikipedia. Wikipedia:NPOV tutorial, May 2024. Page Version ID: 1222446643.
  • (56) Zheng, R., Dou, S., Gao, S., Hua, Y., Shen, W., Wang, B., Liu, Y., Jin, S., Liu, Q., Zhou, Y., Xiong, L., Chen, L., Xi, Z., Xu, N., Lai, W., Zhu, M., Chang, C., Yin, Z., Weng, R., Cheng, W., Huang, H., Sun, T., Yan, H., Gui, T., Zhang, Q., Qiu, X., and Huang, X. Secrets of RLHF in Large Language Models Part I: PPO, July 2023.

Appendix A Dataset

Table 6. Number of edits for each topic. The data consists of rewrites of biased edits.
Topic Total Edits Biased Edits Neutral Edits
Culture 56296 28148 28148
Geography 28348 14174 14174
History And Society 20986 10493 10493
STEM 8722 4361 4361
Assistance 120 60 60

Appendix B Prompts

B.1. Detection: NPOV

Note: The EXAMPLES part of the prompt is only included for the few-shot conditions.

INSTRUCTIONS
Act like an expert Wikipedia editor. Classify if this Wikipedia edit is neutral or biased.

A Wikipedia edit is biased if it violates Wikipedia’s neutral point of view (NPOV) policy.

Neutral point of view policy:
- Policy 1: Avoid stating opinions as facts.
- Policy 2: Avoid stating seriously contested assertions as facts.
- Policy 3: Avoid stating facts as opinions.
- Policy 4: Prefer nonjudgmental language.
- Policy 5: Indicate the relative prominence of opposing views.

Note that even if it slightly violates the NPOV policy, it is considered biased and not neutral.

EXAMPLES
{example_str}

EDIT
{edit_here}

RETURN
a json like {"response": "biased" or "neutral", "policy_violated": "none" if neutral else policy violated,
"reasoning": reasoning} and nothing else

B.2. Detection: NPOV Scoped

Note: The EXAMPLES part of the prompt is only included for the few-shot conditions.

INSTRUCTIONS
Act like an expert Wikipedia editor. Classify if this Wikipedia edit is neutral or biased.

A Wikipedia edit is biased if it violates Wikipedia’s neutral point of view (NPOV) policy.

Neutral point of view policy:
- Policy 1: Avoid subjective and judgmental language. Word choice should use a disinterested,
  impartial tone and avoid language that sympathizes with or disparages the subject.
- Policy 2: Avoid phrasing that implies how ‘believable’ a statement is. Word choice can imply
  how believable the author considers a statement to be.
- Policy 3: Avoid stereotypes. Word choice can imply stereotypes about gender, race or other
  demographic categories.
- Policy 4: Avoid presenting opinions as facts. Opinions should be attributed to a person, a
  group of people who hold the opinion, or where appropriate, described as a widespread view.

Note that even if it slightly violates the NPOV policy, it is considered biased and not neutral.

EXAMPLES
{example_str}

EDIT
{edit_here}

RETURN
a json like {"response": "biased" or "neutral", "policy_violated": "none" if neutral else policy violated,
"reasoning": reasoning} and nothing else

B.3. Detection: Minimal

Note: The EXAMPLES part of the prompt is only included for the few-shot conditions.

INSTRUCTIONS
Act like an expert encyclopedia editor. Classify if a given statement is neutral or biased.

Note that even if it slightly violates the NPOV policy, it is considered biased and not neutral.

EXAMPLES
{example_str}

STATEMENT
{edit_here}

RETURN
a json like {"response": "biased" or "neutral", "reasoning": reasoning} and nothing else

B.4. Detection: LLM Bootstrapped Prompt (NPOV-Scoped)

Note: This prompt format follows DSPY’s Chain of Thought implementation.

Act like an expert Wikipedia editor. Classify if this Wikipedia edit is neutral or biased.
A Wikipedia edit is biased if it violates Wikipedia’s neutral point of view (NPOV) policy.
Neutral point of view policy:
- Policy 1: Avoid subjective and judgmental language. Word choice should use a disinterested,
  impartial tone and avoid language that sympathizes with or disparages the subject.
- Policy 2: Avoid phrasing that implies how ’believable’ a statement is. Word choice can imply
  how believable the author considers a statement to be.
- Policy 3: Avoid stereotypes. Word choice can imply stereotypes about gender, race or other
  demographic categories.
- Policy 4: Avoid presenting opinions as facts. Opinions should be attributed to a person, a group
  of people who hold the opinion, or where appropriate, described as a widespread view.
Note that even if it slightly violates the NPOV policy, it is considered biased and not neutral.

---

Follow the following format.

Edit: edit
Reasoning: Let’s think step by step in order to ${produce the answer}. We ...
Answer: neutral or biased

---

Edit: hydrogen pipelines are unfortunately more expensive than even long-distance electric lines.
Reasoning: Let’s think step by step in order to produce the answer. We first identify if the edit uses subjective or
judgmental language. The word "unfortunately" introduces a subjective viewpoint, implying a negative aspect of
hydrogen pipelines compared to electric lines. This word choice violates Policy 1 by not maintaining an impartial tone.
It also subtly presents an opinion
as a fact, which is against Policy 4, as it does not attribute this viewpoint to a specific source or
acknowledge it as a perspective rather than an objective truth. Therefore, the edit does not adhere to the neutral
point of view policy.
Answer: biased

---


(...Omitting more examples for space...)

---

Edit: {EDIT_HERE}
Reasoning: Let’s think step by step in order to


B.5. Detection: LLM Bootstrapped Prompt (NPOV)

Note: This prompt format follows DSPY’s Chain of Thought implementation.

Act like an expert Wikipedia editor. Classify if this Wikipedia edit is neutral or biased.
A Wikipedia edit is biased if it violates Wikipedia’s neutral point of view (NPOV) policy.
Neutral point of view policy:
- Policy 1: Avoid stating opinions as facts
- Policy 2: Avoid stating seriously contested assertions as facts
- Policy 3: Avoid stating facts as opinions
- Policy 4: Prefer nonjudgmental language.
- Policy 5: Indicate the relative prominence of opposing views.
Note that even if it slightly violates the NPOV policy, it is considered biased and not neutral.

---

Follow the following format.

Edit: edit
Reasoning: Let’s think step by step in order to ${produce the answer}. We ...
Answer: neutral or biased

---

Edit: he developed a complex story-line with theological themes, as adam warlock, who had emerged from a kind of virgin
birth, confronted a dictatorial universal church of truth founded by his evil future self.
Reasoning: Let’s think step by step in order to produce the answer. First, we examine if the edit states
opinions as facts. The edit provides a summary of a storyline involving Adam Warlock, a character from
comic books, and describes the plot and themes involved. It does not present opinions as facts but rather summarizes
the content of a fictional work. Second, we check if it states seriously contested assertions as facts. The edit does not
make assertions about real-world events or figures but discusses a fictional narrative. Third, we consider
if it states facts as opinions. The edit does not frame factual information as opinion. Fourth, we evaluate the
use of nonjudgmental language. The language used to describe the storyline and themes is neutral and descriptive
without passing judgment. Fifth, we assess if it indicates the relative prominence of opposing views. Since this edit is
about a specific storyline within a comic book, the concept of opposing views is not applicable in the same way it would
be for a real-world issue. Therefore, based on these considerations, the edit does not violate the NPOV policy.
Answer: neutral

---


(...Omitting more examples for space...)

---

Edit: {EDIT_HERE}
Reasoning: Let’s think step by step in order to


B.6. Detection: LLM Auto-Generated Prompt

Note: This prompt format follows DSPY’s Chain of Thought implementation.

Please meticulously evaluate the content of the indicated Wikipedia edit. Our aim is to determine its alignment with the
principles of neutrality and objective information presentation. An edit falls under "neutral" if it maintains or improves
the essential impartial perspective of the topic by offering well-referenced data, balanced interpretation, and avoids
espousing a specific ideology, viewpoint, or favoritism towards any subject matter or entity portrayed. Conversely,
categorize an edit as "biased" when it skews towards advocating a singular perspective: conspicuously highlighted
by the use or omission of particular terminologies, incorporation or deletion of facts that possess agenda-driven subtexts,
or through nuanced alteration playing to subjective opinions, ideologies, or non-verifiable anecdotes that conspicuously
colour the narrative away from evidentiary neutrality. In making your categorization, lavish detail on linguistic nuances,
presentation styles, and contextual frame shifts evoked by the edit, analyzing its departure from or congruity to
non-partisan, factual, information dissemination aimed at fostering an informed readership.


---

Follow the following format.

Edit: edit
Reasoning: Let’s think step by step in order to ${produce the answer}. We ...
After thorough inspection, the Wikipedia edit’s alignment with neutral-objective editing standards is: neutral or biased

---

Edit: {EDIT_HERE}
Reasoning: Let’s think step by step in order to


B.7. Generation: NPOV Scoped

Note: The CONSTRAINTS section of the prompt is only included in the ‘Is Constrained’ condition.


Given a biased Wikipedia edit, make minimal changes to make this edit neutral according to Wikipedia’s neutral point of view
(NPOV) policy.

Neutral point of view policy:
- Policy 1: Avoid subjective and judgmental language. Word choice should use a disinterested, impartial tone and avoid language
that sympathizes with or disparages the subject.
- Policy 2: Avoid phrasing that implies how ’believable’ a statement is. Word choice can imply how believable the author
considers a statement to be.
- Policy 3: Avoid stereotypes. Word choice can imply stereotypes about gender, race or other demographic categories.
- Policy 4: Avoid presenting opinions as facts. Opinions should be attributed to a person, a group of people who hold the opinion,
or where appropriate, described as a widespread view.

CONSTRAINTS
- Change as few words as possible
- Make minimal changes

EDIT
{EDIT}

RETURN an extractable json like {’neutral_edit’:EDIT}

B.8. Generation: CAI

Note: the CONSTRAINTS section of the prompt is only included in the ‘Is Constrained’ condition.

B.8.1. Critique request

Identify specific ways, writing in one paragraph, in which this Wikipedia edit
violates Wikipedia’s neutral point of view (NPOV) policy.

Neutral point of view policy:
- Policy 1: Avoid subjective and judgmental language. Word choice should use a
disinterested, impartial tone and avoid language that sympathizes with or
disparages the subject.
- Policy 2: Avoid phrasing that implies how ‘believable’ a statement is. Word
choice can imply how believable the author considers a statement to be.
- Policy 3: Avoid stereotypes. Word choice can imply stereotypes about gender,
race or other demographic categories.
- Policy 4: Avoid presenting opinions as facts. Opinions should be attributed
to a person, a group of people who hold the opinion, or where appropriate,
described as a widespread view.

EDIT
{EDIT}

RETURN a json like {’critique’:critique}

B.8.2. Revision request

Please rewrite the wikipedia edit based on this critique.

CRITIQUE
{CRITIQUE}

EDIT
{EDIT}

CONSTRAINTS
- Change as few words as possible
- Make minimal changes

RETURN a json like {’new_edit’:new_edit}

Appendix C Detection Results

Table 7. Logistic regression predicting accuracy. The reference level for edit distance is ‘low’, the reference model is ChatGPT 3.5, the reference prompt is ‘minimal’, FewShot is compared to ZeroShot, and the reference topic is ‘Assistance’. The word count of the edit is z-scored. ORs are shown with 95% CIs are in parentheses.
Dependent variable: Accuracy
(1)
EditDistance[high] 1.433∗∗∗
(1.211 , 1.696)
EditDistance[med] 1.863∗∗∗
(1.605 , 2.162)
Examples[FewShot] 1.005
(0.901 , 1.122)
Intercept 0.795∗∗
(0.651 , 0.971)
Model[GPT-4] 1.248∗∗∗
(1.091 , 1.427)
Model[MistralMedium] 1.150∗∗
(1.006 , 1.315)
Prompt[NPOVScoped] 0.974
(0.851 , 1.114)
Prompt[NPOV] 0.962
(0.842 , 1.101)
RootTopic[Culture] 1.447∗∗∗
(1.215 , 1.724)
RootTopic[Geography] 1.011
(0.850 , 1.203)
RootTopic[HistoryAndSociety] 1.074
(0.903 , 1.277)
RootTopic[STEM] 1.314∗∗∗
(1.103 , 1.565)
ZScoreEditWordCount 1.098∗∗∗
(1.023 , 1.177)
Observations 5358
Pseudo R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 0.016
Note: p<<<0.1; ∗∗p<<<0.05; ∗∗∗p<<<0.01

C.1. Tests for Model Error Imbalance

We assessed the statistical significance of model error patterns. Assume a null hypothesis (HO) where a model’s errors are equally likely to be false positives (FPs) or false negatives (FNs). (Here, an FP means falsely predicting that an edit is biased and an FN means falsely predicting that an edit is neutral.) Under this null hypothesis, the proportion of errors that are FPs should be 0.5. We tested whether a model’s errors significantly deviated from this proportion using a two-tailed binomial test. In 2(b) we visualize the distribution of FPs under a null hypothesis that FP=FN𝐹𝑃𝐹𝑁FP=FNitalic_F italic_P = italic_F italic_N against the actual proportion of errors that are FPs. We reject the null hypothesis for ChatGPT 3.5 (p<0.001𝑝0.001p<0.001italic_p < 0.001) and Mistral-Medium (p<0.001𝑝0.001p<0.001italic_p < 0.001), but not GPT-4 (p=0.68)p=0.68)italic_p = 0.68 )— indicating that the first two models are more prone to specific types of errors.

C.2. Modeling Whether Explanations Predict Accuracy

Table 8. Logistic regression predicting accuracy of classifications by explanation features. Non-categorical variables are z-scored so odds ratios are interpreted as the change in odds per standard deviation increase in the predictor. Odds ratios are shown with 95% CIs in parentheses.
Dependent variable: Accuracy
(1)
Examples[FewShot] 1.051
(0.938 , 1.177)
Intercept 1.098
(0.906 , 1.330)
Model[GPT-4] 1.351∗∗∗
(1.155 , 1.578)
Model[MistralMedium] 1.295∗∗∗
(1.118 , 1.499)
Prompt[NPOVScoped] 0.873
(0.736 , 1.037)
Prompt[NPOV] 0.893
(0.760 , 1.048)
avgwordlength 0.958
(0.887 , 1.034)
cognitivewords 0.970
(0.906 , 1.038)
conjunctions 0.905∗∗∗
(0.850 , 0.964)
emotionwords 1.089∗∗∗
(1.026 , 1.155)
maxoverlap 0.983
(0.914 , 1.057)
policywords 1.124∗∗∗
(1.051 , 1.201)
readinggradelevel 1.056
(0.981 , 1.137)
roottopic[Culture] 1.385∗∗∗
(1.155 , 1.660)
roottopic[Geography] 0.911
(0.762 , 1.088)
roottopic[HistoryAndSociety] 1.109
(0.928 , 1.325)
roottopic[STEM] 1.250∗∗
(1.044 , 1.498)
similaritytoprompt 1.149∗∗∗
(1.070 , 1.233)
wordcount 0.799∗∗∗
(0.740 , 0.863)
Observations 5085
Pseudo R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 0.022
Note: p<<<0.1; ∗∗p<<<0.05; ∗∗∗p<<<0.01

We analyzed whether the explanation that the LLM gives predicts accuracy. One set of features captures the complexity of the explanation: word count, average word count, LIWC cognitive reasoning words, Flesh-Kincaid reading grade level, and the count of conjunctions (as used in (li_detection_2019, )) from LIWC. Another set of features captures rule reference (the degree to which rationale integrated prompt and NPOV words): maximum common substring between the explanation and the prompt, the count of NPOV-related words 666The list of words were: [’wikipedia’, ’contested’, ’contest’, ’opinion’, ’policy’, ’neutral’, ’subjective’, ’subjectivity’, ’judgemental’, ’judgment’, ’judge’, ’believe’, ’stereotype’, ’opinion’, ’fact’, ’factual’]. These words were based on the specific prompts we used and Wikipedia’s NPOV guidelines., and the semantic similarity (using SBERT (reimers_sentence-bert_2019, )) between the prompt and the rationale. Finally, we also measured the amount of emotion words in the explanation using LIWC’s emotion category. For cognitive reasoning words, conjunction words, emotion words, and NPOV-related words, we divided the count by the total number of tokens in the explanation. Before entering variables into the logistic regression, we z-scored all non-categorical variables so they can be interpreted on a similar scale—as a change in odds associated with a +1SD increase. See Table 8 for results.

Appendix D Generation Results

Table 9. OLS regressions of intensity metrics. 95% CIs in parentheses.
Edit Distance Word Count N Changes
(1) (2) (3)
Intercept 0.556∗∗∗ 47.270∗∗∗ 25.880∗∗∗
(0.535 , 0.577) (44.539 , 50.001) (24.404 , 27.356)
IsConstrained[Y] -0.209∗∗∗ -17.100∗∗∗ -12.940∗∗∗
(-0.239 , -0.179) (-20.962 , -13.238) (-15.028 , -10.852)
IsConstrained[Y]:PromptType[ZeroShot] 0.102∗∗∗ 16.500∗∗∗ 8.610∗∗∗
(0.059 , 0.144) (11.039 , 21.961) (5.657 , 11.563)
PromptType[ZeroShot] -0.270∗∗∗ -18.620∗∗∗ -15.365∗∗∗
(-0.300 , -0.240) (-22.482 , -14.758) (-17.453 , -13.277)
Observations 800 800 800
R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 0.447 0.142 0.324
Adjusted R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 0.445 0.138 0.321
Residual Std. Error 0.153 (df=796) 19.673 (df=796) 10.637 (df=796)
F Statistic 214.783∗∗∗ (df=3; 796) 43.738∗∗∗ (df=3; 796) 126.924∗∗∗ (df=3; 796)
Note: p<<<0.1; ∗∗p<<<0.05; ∗∗∗p<<<0.01
Table 10. OLS regressions of edit-level accuracy NLG metrics, which are computed based on comparing the sets of words that AI and humans removed from edits. P,R,S,N,B are precision, recall, similarity, non-disjoint, and BLEU score. 95% CIs in parentheses.
P R S N B
(1) (2) (3) (4) (5)
Intercept 0.289∗∗∗ 0.886∗∗∗ 0.275∗∗∗ 0.900∗∗∗ 0.230∗∗∗
(0.246 , 0.332) (0.837 , 0.934) (0.234 , 0.315) (0.849 , 0.951) (0.200 , 0.260)
IsConstrained[Y] 0.070∗∗ -0.070∗∗ 0.053 -0.040 0.213∗∗∗
(0.009 , 0.131) (-0.139 , -0.001) (-0.004 , 0.110) (-0.112 , 0.032) (0.170 , 0.255)
IsConstrained[Y]:PromptType[ZeroShot] 0.029 -0.031 0.015 -0.025 -0.059
(-0.057 , 0.115) (-0.129 , 0.066) (-0.066 , 0.096) (-0.127 , 0.077) (-0.118 , 0.001)
PromptType[ZeroShot] 0.085∗∗∗ -0.113∗∗∗ 0.070∗∗ -0.075∗∗ 0.253∗∗∗
(0.024 , 0.146) (-0.182 , -0.044) (0.012 , 0.127) (-0.147 , -0.003) (0.210 , 0.295)
Observations 800 752 796 800 800
R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 0.043 0.050 0.028 0.019 0.313
Adjusted R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 0.040 0.046 0.024 0.016 0.311
Residual Std. Error 0.310 (df=796) 0.340 (df=748) 0.291 (df=792) 0.367 (df=796) 0.216 (df=796)
F Statistic 11.972∗∗∗ (df=3; 796) 13.033∗∗∗ (df=3; 748) 7.567∗∗∗ (df=3; 792) 5.220∗∗∗ (df=3; 796) 121.098∗∗∗ (df=3; 796)
Note: p<<<0.1; ∗∗p<<<0.05; ∗∗∗p<<<0.01
Table 11. OLS regressions of additions and removals, excluding stopwords. 95% CIs in parentheses.
Count of Added Words Count of Removed Words
(1) (2)
Intercept 17.455∗∗∗ 8.425∗∗∗
(16.421 , 18.489) (7.834 , 9.016)
IsConstrained[Y] -10.630∗∗∗ -2.310∗∗∗
(-12.093 , -9.167) (-3.146 , -1.474)
IsConstrained[Y]:PromptType[ZeroShot] 8.250∗∗∗ 0.360
(6.181 , 10.319) (-0.822 , 1.542)
PromptType[ZeroShot] -12.315∗∗∗ -3.050∗∗∗
(-13.778 , -10.852) (-3.886 , -2.214)
Observations 800 800
R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 0.364 0.151
Adjusted R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 0.361 0.148
Residual Std. Error 7.452 (df=796) 4.257 (df=796)
F Statistic 151.733∗∗∗ (df=3; 796) 47.118∗∗∗ (df=3; 796)
Note: p<<<0.1; ∗∗p<<<0.05; ∗∗∗p<<<0.01
Refer to caption
Figure 9. AI had both more removals and more additions than human editors. Error bars are 95% CIs.
Refer to caption
Figure 10. To rule out that our generation findings were dependent on seemingly minor analytical choices we made (such as stop-word removal), we re-ran this analysis 32 times under different analytic decisions. We varied (1) whether undefined precision calculations were treated as zero or missing, (2) whether undefined recall calculations were treated as zero or missing, (3) whether or not diff-based metrics stripped punctuation, (4) whether or not diff-based metrics stripped stop-words, and (5) whether BLEU scores were calculated with smoothing. Condition-level metrics (colors) are broadly similar regardless of analytical decision set.

Appendix E Human Experiment

E.1. Participant Demographics

Participants’ genders were woman (57.1%), man (39.5%), non-binary (1.4%), other (1.4%), prefer not to disclose (0.7%). Participants’ ages were 25-34 (32.7%), 35-44 (26.5%), 45-54 (17.0%), 18-24 (12.9%), 55-64 (6.8%), 65 or over (4.1%). Participants’ levels of educational attainment was bachelor degree (42.2%), some college but no degree (22.4%), high school degree or equivalent (e.g., GED) (12.2%), graduate degree (e.g., Masters, PhD, M.D) (12.2%), associate degree (10.2%), less than high school degree (0.7%. )

Table 12. Conditional logistic regression models of the odds of choosing an AI edit over the human edit, for both neutrality and fluency. Results are on the log odds scale, with SEs in parentheses.
Dependent variable:
Neutrality Fluency
(1) (2)
conditionZeroShot 0.854∗∗∗ 0.613∗∗∗
(0.080) (0.077)
conditionCAI 0.802∗∗∗ 0.276∗∗∗
(0.080) (0.075)
Observations 2,940 2,940
R2 0.076 0.027
Max. Possible R2 0.500 0.500
Log Likelihood --902.892 --978.853
Wald Test (df = 2) 213.350∗∗∗ 77.010∗∗∗
LR Test (df = 2) 232.068∗∗∗ 80.146∗∗∗
Score (Logrank) Test (df = 2) 225.874∗∗∗ 79.104∗∗∗
Note: p<<<0.1; ∗∗p<<<0.05; ∗∗∗p<<<0.01
Table 13. Experiment results. For fluency and neutrality, participants chose between an AI and a human edit. P-values are computed using two-tailed binomial tests for whether the probability of picking an AI edit differs from 0.5. For additions and removals, participants evaluated each of the AI and huamn edits separately but were shown both at the same time. This table reports human addition and removal judgements at the matchup level. P-values are computed using chi-squared tests on whether human vs AI edits differed in frequencies of adding or removing information. ‘Delta’ is the AI proportion minus the human proportion. Stars: n.s. p¿0.05, *p¡0.05, **p¡0.01 ***p¡0.001, ****p¡0.0001
Question Type Comparison AI Proportion Human Proportion Delta
Neutrality Human vs ZeroShot 0.70 0.30 0.40****
Neutrality Human vs CAI 0.69 0.31 0.38****
Fluency Human vs ZeroShot 0.65 0.35 0.30****
Fluency Human vs CAI 0.57 0.43 0.14***
Add Human vs ZeroShot 0.27 0.23 0.03 (n.s.)
Add Human vs CAI 0.34 0.33 0.01 (n.s.)
Remove Human vs ZeroShot 0.41 0.40 0.01 (n.s.)
Remove Human vs CAI 0.42 0.40 0.02 (n.s.)
Table 14. Neutrality results from a ‘both-equal’ pilot. Following our main experiment, we ran a pilot study (n=20; 15 trials each) where we used the same setup but gave participants an option to select ‘No substantial difference’. We only ran this pilot study for neutrality. The risk ratio is the ratio of the proportion of times an answer was chosen relative to the proportion of times the human rewrite was chosen.
Comparison Choice Proportion Chosen (95% CI) Risk Ratio (95% CI)
Human vs CAI CAI 0.61 ([0.53, 0.69]) 2.49 ([1.88, 3.48])
Human vs CAI Human 0.25 ([0.18, 0.31])
Human vs CAI Both Equal 0.15 ([0.09, 0.20]) 0.59 ([0.35, 0.94])
Human vs ZeroShot ZeroShot 0.56 ([0.48, 0.64]) 3.65 ([2.54, 5.85])
Human vs ZeroShot Human 0.15 ([0.10, 0.21])
Human vs ZeroShot Both Equal 0.28 ([0.21, 0.35]) 1.83 ([1.18, 3.00])
Table 15. Robustness checks of main results. ‘Human Prop.’ and ‘AI Prop.’ denote the proportion of times the human or AI rewrite was chosen. P-values are computed using two-tailed binomial tests for whether the probability of picking an AI edit differs from 0.5. ‘Delta’ is the AI proportion minus the human proportion. Subsets: All data; Participants who were above the 25th percentile in study duration; Participants who got all 3 training module questions correct. Stars: n.s. p¿0.05, *p¡0.05, **p¡0.01 ***p¡0.001, ****p¡0.0001
Subset Question Type Comparison AI Prop. Human Prop. Delta
Full data Fluency Human vs CAI 0.57 0.43 0.14***
3 of 3 Correct Fluency Human vs CAI 0.64 0.36 0.27****
¿25th Percentile Duration Fluency Human vs CAI 0.55 0.45 0.11*
Full data Fluency Human vs ZeroShot 0.65 0.35 0.30****
3 of 3 Correct Fluency Human vs ZeroShot 0.67 0.33 0.34****
¿25th Percentile Duration Fluency Human vs ZeroShot 0.64 0.36 0.29****
Full data Neutrality Human vs CAI 0.69 0.31 0.38****
3 of 3 Correct Neutrality Human vs CAI 0.75 0.25 0.50****
¿25th Percentile Duration Neutrality Human vs CAI 0.68 0.32 0.37****
Full data Neutrality Human vs ZeroShot 0.70 0.30 0.40****
3 of 3 Correct Neutrality Human vs ZeroShot 0.75 0.25 0.49****
¿25th Percentile Duration Neutrality Human vs ZeroShot 0.70 0.30 0.39****
Refer to caption
Figure 11. At the end of the experiment, participants guessed (via 0-100 slider) how often others chose the AI edit as more neutral than the human edit. The question was: “For each pair of rewrites you saw, one rewrite was made by a Wikipedia editor and the other rewrite was made by a large language model (LLM) such as ChatGPT. Like you, other participants were not told which was which. What percent of the time do you think participants said the LLM rewrite increased neutrality more than the Wikipedia-editor rewrite?”