Towards Automating Text Annotation: A Case Study on Semantic Proximity Annotation using GPT-4

Sachin Yadav Tejaswi Choppa Dominik Schlechtweg Institute for Natural Language Processing, University of Stuttgart

Abstract

This paper explores using GPT-3.5 and GPT-4 to automate the data annotation process with automatic prompting techniques. The main aim of this paper is to reuse human annotation guidelines along with some annotated data to design automatic prompts for LLMs, focusing on the semantic proximity annotation task. Automatic prompts are compared to customized prompts. We further implement the prompting strategies into an open-source text annotation tool, enabling easy online use via the OpenAI API. Our study reveals the crucial role of accurate prompt design and suggests that prompting GPT-4 with human-like instructions is not straightforwardly possible for the semantic proximity task. We show that small modifications to the human guidelines already improve the performance, suggesting possible ways for future research.

keywords:

LLM\sepPhiTag\sepComputational Annotator\sepPrompt Engineering\sepAnnotation Automation\sepSemantic Proximity

, and [email protected], [email protected], [email protected]

1 Introduction

Data annotation, the process of labeling data plays an important role for training machine learning models. High-quality data annotations ensure the model grasps the relationship between input data and desired output [1]. However, data annotation is complex, expensive, and time-consuming. Data can be ambiguous, subjective, and exist in diverse formats, often requiring domain expertise for accurate annotation [2]. Studies have shown that advanced Large Language Models (LLMs) such as GPT-4 [3], Gemini [4] and Llama-2 [5] offer a promising opportunity to revolutionize data annotation. [6] discovered that LLMs are approximately thirty times more cost-effective than human annotators, with a cost per annotation of less than $0.003. Additionally, they exhibit a remarkable speed advantage over human annotators. They further offer human-like performance in various downstream tasks. To get the most out of the system, users provide instructions in natural language (prompts), and the system responds accordingly [7, 8]. A prompt serves as a set of instructions guiding LLMs, refining their capabilities to suit the specific task [9, 10]. Prompting has shown promising results in low-resource settings and can effectively extract knowledge from pre-trained language models [11]. Recent research suggests that prompting techniques can be effectively used with LLMs to automate or assist with annotation tasks [12, 1]. However, writing an effective prompt is crucial as quality prompts generate quality output, as discussed in [10]. Manual prompt creation is further time-consuming, sophistic, and can lead to inconsistent, volatile or counter-intuitive results [13, 14, 15]. We aim to utilize existing resources, such as human annotation guidelines and small sets of well-annotated data ("gold data"), to automatically prompt LLMs in order to make the annotation process as easy as possible. An essential part of our study is the implementation of the automatic prompting strategies into an open-source annotation tool exploiting the OpenAI API to make it possible to do automatic prompting with few clicks in an online user interface.

2 Related Work

2.1 Text Annotation

Human language is complex, requiring machines to understand linguistic elements. Text annotation bridges this gap by labeling text data, providing additional information like highlighting specific elements, assigning categories, or adding comments. This labeled data is essential for training accurate machine learning models that can analyze and understand human language effectively. Without text annotation, machines would struggle with the nuances of language, leading to inaccurate or meaningless results. Several general-purpose text annotation tools are available, offering flexibility for defining various tasks. These tools include CATMA [16], INCEpTION [17], MTurk¹¹1https://www.mturk.com/, PhiTag²²2https://phitag.ims.uni-stuttgart.de/, Toloka³³3https://toloka.ai/docs/ or POTATO [18].

2.2 Large Language Models

LLMs have demonstrated impressive capabilities in understanding and processing natural language tasks. These models are pre-trained on massive amounts of text data that enables them to learn the patterns, structures, and relationships within language. Also, LLMs have a large number of parameters that help it to learn the complex linguistic patterns to generate responses with remarkable speed and accuracy [19].

Prompting LLMs

Prompting refers to the process of interacting with LLMs through textual content. A prompt typically serves as a crucial medium for engaging with LLMs and guiding them to perform specific tasks or generate desired outputs [20]. A very simple example of prompting can be seen in Figure 1.

Refer to caption — Figure 1: Prompting LLM for use pair meaning task.

LLMs as Annotators

LLMs demonstrate potential as data annotators for various textual tasks [21, 1, 12]. In the evaluations conducted by [22], key differences between ChatGPT and human experts emerged. ChatGPT’s responses exhibit a strict focus on given questions, objectivity, formality, and minimal emotional expression, while human responses tend to diverge, including subjective expressions, exhibit colloquialism, and convey emotions through punctuation and language features. These distinctions highlight the unique strengths and limitations of ChatGPT in comparison to human expertise.

[12] evaluated ChatGPT’s proficiency in detecting and explaining implicit hate speech using the Latent Hatred data-set [23]. The study revealed that ChatGPT’s ability to detect implicit hate speech was on par with human capabilities, and the quality of its generated explanations closely mirrored human-written ones. This emphasizes LLMs effectiveness as a valuable tool for data annotation. This finding is particularly interesting for lexical semantic tasks like Lexical Semantic Change Detection [24] or Use Pair Semantic Proximity Annotation [25].

[21] and [26] demonstrated that GPT-4 and GPT-3.5 can handle challenging deep semantic tasks like detecting semantic shifts in words over time (Lexical Semantic Change Detection) and meaning distinction between word uses (Use Pair Semantic Proximity Annotation). Semantic analysis, crucial for NLP, delves into the meaning and context of text, analyzing word relationships to understand the overall message. This includes further tasks like sentiment analysis and information extraction, aiming to achieve accurate text interpretation.

We choose semantic proximity annotation as a target task in this study because it is simple, gold data and high-quality annotation guidelines are available in the annotation interface and it is very relevant for semantic applications of NLP.

3 PhiTag

3.1 Idea

PhiTag, an open-source text annotation platform, addresses the critical need for high-quality training data in machine learning, particularly for NLP tasks.⁴⁴4https://github.com/Garrafao/phitag Recognizing the importance of custom data labeling and annotation quality, PhiTag offers a user-friendly environment designed for flexibility and efficient data management. The platform provides support for various general NLP tasks such as text pair, text ranking or text labeling annotation. The platform provides annotation guidelines, data for annotator checks (tutorial), an annotation interface and agreement calculation to ensure that users understand the specific requirements and can become proficient annotators.⁵⁵5https://phitag.ims.uni-stuttgart.de/guide/explained-annotation-task-urel We utilize PhiTag’s text pair annotation task type for our study. A screenshot showcasing some text pair instances can be found in Figure 2.

3.2 Computational Annotator

We implement a computational annotator in PhiTag which automatically annotates text data using the power of LLMs. We use the API provided by OpenAI.⁶⁶6https://openai.com/ This allows us to use different types of models (gpt-4-0125-preview used in this study) available in OpenAI and to define different types of prompts including (i) the upload of a custom prompt instruction (ii) automatic prompting features combining the annotation guidelines and tutorial data already available in PhiTag. This allows us to automate the annotation process as far as possible by reusing existing instructions and data. Find a screenshot of the computational annotator interface in PhiTag in Figure 3.

4 Task Description

Given the use pairs (often two sentences) of a word, our task of choice (Use Pair Semantic Proximity Annotation) involves evaluating the semantic relatedness between the two uses. Annotators judge pairs of sentences containing a highlighted target word, rating the meaning relatedness on a 4-point scale: 1 (unrelated), 2 (distantly related), 3 (closely related), and 4 (identical) [27]. They first determine the most likely meaning for the target word in each sentence independently, then assess the connection between those meanings.

As examples, consider (4) (relatedness of "bank") and (4) (relatedness of "eat"). {examples}

Sentence 1: His parents had left a lot of money in the bank and now it was all Measle’s, but a judge had said that Measle was too young to get it.

Sentence 2: Sherrell, is is said, was sitting on the bank of the river close by, and as soon as the men had disappeared from sight he jumped on board the schooner.

Target word: bank

Judgement: 1 (Unrelated) {examples}

Sentence 1: Speaking of bread and butter reminds me that we’d better eat ours before the coffee gets cold.

Sentence 2: When the meal was over and they had finished their tea after they ate, wang the Second took the trusty man to his elder brother’s gate.

Target word: eat

Judgement: 4 (Closely related)

5 Data

Our study is based on the publicly available DWUG EN "Use Pair" dataset (V2.0.0) [28].⁷⁷7https://github.com/ChangeIsKey/annotation_standardization/tree/main/use_pair/urel/english/data The dataset contains 46,000 human judgments of use pairs according to the semantic relatedness scale described above.

Filtering, Cleaning and Splitting

To ensure high-quality data, we employ a strict filtering condition keeping only instances without "Cannot decide" judgment and annotated by at least 2 human annotators. We include only data points where all annotators agreed. This filtering process resulted in a final dataset of approximately 930 instances. We employ a random sampling technique to split the gold data into three distinct sets: training, testing, and development. The specific distribution of instances can be found in Table 1, and the distribution of labels of each data set can be found in Figure 4.

Dev	Train	Test
46	140	744

Table 1: Test data split statistics

6 Experiments

We now describe our experiments on the data described in Section 5. We start out by fixing model parameters at their default value and optimizing the custom prompt on the development data. We then test the influence of model parameters using the optimized custom prompt. We continue by using these optimal model parameters to optimize prompts for the fine-tuned model and automatic prompting on the development data. We then test the optimal configurations for each strategy on the test data and report performance. We provide the data, code and prompts in a public repository.⁸⁸8https://github.com/sachin1022/phitag_gpt_datasets

Performance Metrics

Performance between the gold data and the computationally annotated data in our experiment is measured by two metrics: ordinal Krippendorff’s $\alpha$ [29] and percentage agreement. Percentage agreement measures the normalized share of agreements among the total number of annotations. However, it does not account for chance agreement and graded deviation from gold. In contrast, Krippendorff’s $\alpha$ offers a more robust assessment. This versatile statistic considers both observed agreement and agreement expected by chance, providing a more reliable picture of inter-annotator agreement. Krippendorff’s alpha can handle various data types (nominal, ordinal, interval, and ratio), and a score of 1 indicates perfect agreement [30].

6.1 Customized Prompt

Prompt

The customized prompt consists of an instruction message to the system, a custom message consisting of the two uses to be judged along with the target lemma and the message to obtain the judgment. Find an example in Appendix A. We explore prompting strategies to optimize performance of the model. Initial attempts were done by using basic prompts such as "Your task is to rate the degree of semantic relatedness between two uses of a target word in the given sentences" inspired by the human task guidelines. See Appendix A for complete prompt. Inspired by the work of [21], we also adopted and refined their prompt structure. The instruction was modified by adding the following sentence towards the end of prompt: "Your response should align with a human’s succinct judgment." This modification aimed to guide the model towards human-like evaluations of semantic relatedness. To align the generated annotation with the semantic relatedness scale, we further polish the prompt by adding the sentence "Please provide a judgment as a single integer. For example, if your judgment is Identical, then provide 4. If your judgment is Unrelated, provide 1." See Appendix A Prompt 2 for the final prompt.

Model Development

After fixing the prompt, we experiment with gpt-4-0125-preview from the GPT-4 class, the latest version specifically designed to address "laziness" issues. This typically arises when the model fails to complete tasks adequately, either by providing incomplete or insufficient responses.⁹⁹9https://platform.openai.com/docs/models/continuous-model-upgrades The model’s performance is affected by several parameters such as temperature, top-p, stop-condition, max-token, etc. We examine the impact of the temperature and top-p parameters on the performance of GPT-4 model outputs using the development data. These parameters influence the diversity and randomness of the model’s output which in turn indicate the quality of the generated responses [31, 32].

Higher top-p values allow for more diverse and creative output and vice-versa. Likewise, higher temperatures introduce more randomness and exploration, resulting in potentially more diverse but less coherent output. We explore the impact of temperature values ranging from min 0.1 to max 1.0 at steps of 0.1, and top-p values (0.1 to 1.0) in steps of 0.1. The primary objective is to determine the optimal model configuration by adjusting these parameters. We found that a temperature setting of 0.9 and a top-p value of 0.9 yielded the best performance for our study with Prompt 2 (see Appendix A).

Upon various trials conducted on the above-described model configuration and selected optimized prompt, we find the optimal performance of the model at mean $\alpha$ of 0.74 and percentage agreement of 0.80 achieved across 5 different trials as shown in Table 2.

Model Testing

After finalizing model development, we evaluate performances on a test dataset (gold data) of 744 instances as displayed in Table 1 using both Krippendorff’s $\alpha$ and percentage agreement. In the test data, the instances labelled with 4 are most frequent and instances labelled with 3 are least frequent (see Table 4 for a detailed distribution). We achieve consistent results across multiple trials. Throughout five trials, we get consistent results with mean $\alpha$ of 0.54 and 0.72 for percentage agreement, as shown in Table 2.

Trial	dev		test
Trial	$\alpha$	$\%$	$\alpha$	$\%$
1	0.74	0.80	0.56	0.73
2	0.75	0.81	0.54	0.71
3	0.75	0.80	0.54	0.72
4	0.74	0.80	0.54	0.72
5	0.74	0.80	0.53	0.73
Mean	0.74	0.80	0.54	0.72

Table 2: Custom prompting results.

\alpha

: Krippendorff’s

\alpha

\%

: Percentage agreement.

6.2 Fine-Tuning

Fine-tuning a large language model involves training the model further on a smaller, targeted dataset that is relevant to the desired task or subject matter. This can also be done through prompting. In the study conducted by [33], fine-tuning has shown to give a better result in autocomplete and classification tasks. In this study, we fine-tune the gpt-3.5-turbo model using the training data set from Section 5 structuring it in the fine-tuning data format as described in the OpenAI documentation.¹⁰¹⁰10https://platform.openai.com/docs/guides/fine-tuning