Towards Automating Text Annotation: A Case Study on Semantic Proximity Annotation using GPT-4

Sachin Yadav    Tejaswi Choppa    Dominik Schlechtweg Institute for Natural Language Processing, University of Stuttgart
Abstract

This paper explores using GPT-3.5 and GPT-4 to automate the data annotation process with automatic prompting techniques. The main aim of this paper is to reuse human annotation guidelines along with some annotated data to design automatic prompts for LLMs, focusing on the semantic proximity annotation task. Automatic prompts are compared to customized prompts. We further implement the prompting strategies into an open-source text annotation tool, enabling easy online use via the OpenAI API. Our study reveals the crucial role of accurate prompt design and suggests that prompting GPT-4 with human-like instructions is not straightforwardly possible for the semantic proximity task. We show that small modifications to the human guidelines already improve the performance, suggesting possible ways for future research.

keywords:
LLM\sepPhiTag\sepComputational Annotator\sepPrompt Engineering\sepAnnotation Automation\sepSemantic Proximity

1 Introduction

Data annotation, the process of labeling data plays an important role for training machine learning models. High-quality data annotations ensure the model grasps the relationship between input data and desired output [1]. However, data annotation is complex, expensive, and time-consuming. Data can be ambiguous, subjective, and exist in diverse formats, often requiring domain expertise for accurate annotation [2]. Studies have shown that advanced Large Language Models (LLMs) such as GPT-4 [3], Gemini [4] and Llama-2 [5] offer a promising opportunity to revolutionize data annotation. [6] discovered that LLMs are approximately thirty times more cost-effective than human annotators, with a cost per annotation of less than $0.003. Additionally, they exhibit a remarkable speed advantage over human annotators. They further offer human-like performance in various downstream tasks. To get the most out of the system, users provide instructions in natural language (prompts), and the system responds accordingly [7, 8]. A prompt serves as a set of instructions guiding LLMs, refining their capabilities to suit the specific task [9, 10]. Prompting has shown promising results in low-resource settings and can effectively extract knowledge from pre-trained language models [11]. Recent research suggests that prompting techniques can be effectively used with LLMs to automate or assist with annotation tasks [12, 1]. However, writing an effective prompt is crucial as quality prompts generate quality output, as discussed in [10]. Manual prompt creation is further time-consuming, sophistic, and can lead to inconsistent, volatile or counter-intuitive results [13, 14, 15]. We aim to utilize existing resources, such as human annotation guidelines and small sets of well-annotated data ("gold data"), to automatically prompt LLMs in order to make the annotation process as easy as possible. An essential part of our study is the implementation of the automatic prompting strategies into an open-source annotation tool exploiting the OpenAI API to make it possible to do automatic prompting with few clicks in an online user interface.

2 Related Work

2.1 Text Annotation

Human language is complex, requiring machines to understand linguistic elements. Text annotation bridges this gap by labeling text data, providing additional information like highlighting specific elements, assigning categories, or adding comments. This labeled data is essential for training accurate machine learning models that can analyze and understand human language effectively. Without text annotation, machines would struggle with the nuances of language, leading to inaccurate or meaningless results. Several general-purpose text annotation tools are available, offering flexibility for defining various tasks. These tools include CATMA [16], INCEpTION [17], MTurk111https://www.mturk.com/, PhiTag222https://phitag.ims.uni-stuttgart.de/, Toloka333https://toloka.ai/docs/ or POTATO [18].

2.2 Large Language Models

LLMs have demonstrated impressive capabilities in understanding and processing natural language tasks. These models are pre-trained on massive amounts of text data that enables them to learn the patterns, structures, and relationships within language. Also, LLMs have a large number of parameters that help it to learn the complex linguistic patterns to generate responses with remarkable speed and accuracy [19].

Prompting LLMs

Prompting refers to the process of interacting with LLMs through textual content. A prompt typically serves as a crucial medium for engaging with LLMs and guiding them to perform specific tasks or generate desired outputs [20]. A very simple example of prompting can be seen in Figure 1.

Refer to caption
Figure 1: Prompting LLM for use pair meaning task.

LLMs as Annotators

LLMs demonstrate potential as data annotators for various textual tasks [21, 1, 12]. In the evaluations conducted by [22], key differences between ChatGPT and human experts emerged. ChatGPT’s responses exhibit a strict focus on given questions, objectivity, formality, and minimal emotional expression, while human responses tend to diverge, including subjective expressions, exhibit colloquialism, and convey emotions through punctuation and language features. These distinctions highlight the unique strengths and limitations of ChatGPT in comparison to human expertise.

[12] evaluated ChatGPT’s proficiency in detecting and explaining implicit hate speech using the Latent Hatred data-set [23]. The study revealed that ChatGPT’s ability to detect implicit hate speech was on par with human capabilities, and the quality of its generated explanations closely mirrored human-written ones. This emphasizes LLMs effectiveness as a valuable tool for data annotation. This finding is particularly interesting for lexical semantic tasks like Lexical Semantic Change Detection [24] or Use Pair Semantic Proximity Annotation [25].

[21] and [26] demonstrated that GPT-4 and GPT-3.5 can handle challenging deep semantic tasks like detecting semantic shifts in words over time (Lexical Semantic Change Detection) and meaning distinction between word uses (Use Pair Semantic Proximity Annotation). Semantic analysis, crucial for NLP, delves into the meaning and context of text, analyzing word relationships to understand the overall message. This includes further tasks like sentiment analysis and information extraction, aiming to achieve accurate text interpretation.

We choose semantic proximity annotation as a target task in this study because it is simple, gold data and high-quality annotation guidelines are available in the annotation interface and it is very relevant for semantic applications of NLP.

3 PhiTag

3.1 Idea

PhiTag, an open-source text annotation platform, addresses the critical need for high-quality training data in machine learning, particularly for NLP tasks.444https://github.com/Garrafao/phitag Recognizing the importance of custom data labeling and annotation quality, PhiTag offers a user-friendly environment designed for flexibility and efficient data management. The platform provides support for various general NLP tasks such as text pair, text ranking or text labeling annotation. The platform provides annotation guidelines, data for annotator checks (tutorial), an annotation interface and agreement calculation to ensure that users understand the specific requirements and can become proficient annotators.555https://phitag.ims.uni-stuttgart.de/guide/explained-annotation-task-urel We utilize PhiTag’s text pair annotation task type for our study. A screenshot showcasing some text pair instances can be found in Figure 2.

Refer to caption
Figure 2: PhiTag text pair annotation page.

3.2 Computational Annotator

We implement a computational annotator in PhiTag which automatically annotates text data using the power of LLMs. We use the API provided by OpenAI.666https://openai.com/ This allows us to use different types of models (gpt-4-0125-preview used in this study) available in OpenAI and to define different types of prompts including (i) the upload of a custom prompt instruction (ii) automatic prompting features combining the annotation guidelines and tutorial data already available in PhiTag. This allows us to automate the annotation process as far as possible by reusing existing instructions and data. Find a screenshot of the computational annotator interface in PhiTag in Figure 3.

Refer to caption
Figure 3: PhiTag computational annotator interface.

4 Task Description

Given the use pairs (often two sentences) of a word, our task of choice (Use Pair Semantic Proximity Annotation) involves evaluating the semantic relatedness between the two uses. Annotators judge pairs of sentences containing a highlighted target word, rating the meaning relatedness on a 4-point scale: 1 (unrelated), 2 (distantly related), 3 (closely related), and 4 (identical) [27]. They first determine the most likely meaning for the target word in each sentence independently, then assess the connection between those meanings.

As examples, consider (4) (relatedness of "bank") and (4) (relatedness of "eat"). {examples}

Sentence 1: His parents had left a lot of money in the bank and now it was all Measle’s, but a judge had said that Measle was too young to get it.

Sentence 2: Sherrell, is is said, was sitting on the bank of the river close by, and as soon as the men had disappeared from sight he jumped on board the schooner.

Target word: bank

Judgement: 1 (Unrelated) {examples}

Sentence 1: Speaking of bread and butter reminds me that we’d better eat ours before the coffee gets cold.

Sentence 2: When the meal was over and they had finished their tea after they ate, wang the Second took the trusty man to his elder brother’s gate.

Target word: eat

Judgement: 4 (Closely related)

5 Data

Our study is based on the publicly available DWUG EN "Use Pair" dataset (V2.0.0) [28].777https://github.com/ChangeIsKey/annotation_standardization/tree/main/use_pair/urel/english/data The dataset contains 46,000 human judgments of use pairs according to the semantic relatedness scale described above.

Filtering, Cleaning and Splitting

To ensure high-quality data, we employ a strict filtering condition keeping only instances without "Cannot decide" judgment and annotated by at least 2 human annotators. We include only data points where all annotators agreed. This filtering process resulted in a final dataset of approximately 930 instances. We employ a random sampling technique to split the gold data into three distinct sets: training, testing, and development. The specific distribution of instances can be found in Table 1, and the distribution of labels of each data set can be found in Figure 4.

Dev Train Test
46 140 744
Table 1: Test data split statistics

6 Experiments

We now describe our experiments on the data described in Section 5. We start out by fixing model parameters at their default value and optimizing the custom prompt on the development data. We then test the influence of model parameters using the optimized custom prompt. We continue by using these optimal model parameters to optimize prompts for the fine-tuned model and automatic prompting on the development data. We then test the optimal configurations for each strategy on the test data and report performance. We provide the data, code and prompts in a public repository.888https://github.com/sachin1022/phitag_gpt_datasets

Performance Metrics

Performance between the gold data and the computationally annotated data in our experiment is measured by two metrics: ordinal Krippendorff’s α𝛼\alphaitalic_α [29] and percentage agreement. Percentage agreement measures the normalized share of agreements among the total number of annotations. However, it does not account for chance agreement and graded deviation from gold. In contrast, Krippendorff’s α𝛼\alphaitalic_α offers a more robust assessment. This versatile statistic considers both observed agreement and agreement expected by chance, providing a more reliable picture of inter-annotator agreement. Krippendorff’s alpha can handle various data types (nominal, ordinal, interval, and ratio), and a score of 1 indicates perfect agreement [30].

6.1 Customized Prompt

Prompt

The customized prompt consists of an instruction message to the system, a custom message consisting of the two uses to be judged along with the target lemma and the message to obtain the judgment. Find an example in Appendix A. We explore prompting strategies to optimize performance of the model. Initial attempts were done by using basic prompts such as "Your task is to rate the degree of semantic relatedness between two uses of a target word in the given sentences" inspired by the human task guidelines. See Appendix A for complete prompt. Inspired by the work of [21], we also adopted and refined their prompt structure. The instruction was modified by adding the following sentence towards the end of prompt: "Your response should align with a human’s succinct judgment." This modification aimed to guide the model towards human-like evaluations of semantic relatedness. To align the generated annotation with the semantic relatedness scale, we further polish the prompt by adding the sentence "Please provide a judgment as a single integer. For example, if your judgment is Identical, then provide 4. If your judgment is Unrelated, provide 1." See Appendix A Prompt 2 for the final prompt.

Refer to caption
(a) Train Data
Refer to caption
(b) Test Data
Refer to caption
(c) Dev Data
Figure 4: Distribution of labels in gold data.

Model Development

After fixing the prompt, we experiment with gpt-4-0125-preview from the GPT-4 class, the latest version specifically designed to address "laziness" issues. This typically arises when the model fails to complete tasks adequately, either by providing incomplete or insufficient responses.999https://platform.openai.com/docs/models/continuous-model-upgrades The model’s performance is affected by several parameters such as temperature, top-p, stop-condition, max-token, etc. We examine the impact of the temperature and top-p parameters on the performance of GPT-4 model outputs using the development data. These parameters influence the diversity and randomness of the model’s output which in turn indicate the quality of the generated responses [31, 32].

Higher top-p values allow for more diverse and creative output and vice-versa. Likewise, higher temperatures introduce more randomness and exploration, resulting in potentially more diverse but less coherent output. We explore the impact of temperature values ranging from min 0.1 to max 1.0 at steps of 0.1, and top-p values (0.1 to 1.0) in steps of 0.1. The primary objective is to determine the optimal model configuration by adjusting these parameters. We found that a temperature setting of 0.9 and a top-p value of 0.9 yielded the best performance for our study with Prompt 2 (see Appendix A).

Upon various trials conducted on the above-described model configuration and selected optimized prompt, we find the optimal performance of the model at mean α𝛼\alphaitalic_α of 0.74 and percentage agreement of 0.80 achieved across 5 different trials as shown in Table 2.

Model Testing

After finalizing model development, we evaluate performances on a test dataset (gold data) of 744 instances as displayed in Table 1 using both Krippendorff’s α𝛼\alphaitalic_α and percentage agreement. In the test data, the instances labelled with 4 are most frequent and instances labelled with 3 are least frequent (see Table 4 for a detailed distribution). We achieve consistent results across multiple trials. Throughout five trials, we get consistent results with mean α𝛼\alphaitalic_α of 0.54 and 0.72 for percentage agreement, as shown in Table 2.

Trial dev test
α𝛼\alphaitalic_α %percent\%% α𝛼\alphaitalic_α %percent\%%
1 0.74 0.80 0.56 0.73
2 0.75 0.81 0.54 0.71
3 0.75 0.80 0.54 0.72
4 0.74 0.80 0.54 0.72
5 0.74 0.80 0.53 0.73
Mean 0.74 0.80 0.54 0.72
Table 2: Custom prompting results. α𝛼\alphaitalic_α: Krippendorff’s α𝛼\alphaitalic_α, %percent\%%: Percentage agreement.

6.2 Fine-Tuning

Fine-tuning a large language model involves training the model further on a smaller, targeted dataset that is relevant to the desired task or subject matter. This can also be done through prompting. In the study conducted by [33], fine-tuning has shown to give a better result in autocomplete and classification tasks. In this study, we fine-tune the gpt-3.5-turbo model using the training data set from Section 5 structuring it in the fine-tuning data format as described in the OpenAI documentation.101010https://platform.openai.com/docs/guides/fine-tuning

Prompt

We use customized prompt 2 in the fine-tuning setting for providing the training data. We use a second simplistic prompt after training to query the model for input data. Find the final prompt in Appendix A.

Model Development

We use a similar fine-tuning parameter settings [34] to fine tuned the model.

Model Testing

The result of the fine-tuned model can be seen in the Table 3. As we can see, performance is much lower than for the non-fine-tuned setting. However, α𝛼\alphaitalic_α is still slightly above chance performance.

Trial dev test
α𝛼\alphaitalic_α %percent\%% α𝛼\alphaitalic_α %percent\%%
1 0.16 0.24 0.02 0.12
2 0.16 0.24 0.02 0.12
3 0.16 0.24 0.02 0.12
4 0.16 0.24 0.02 0.12
5 0.16 0.24 0.02 0.12
Mean 0.16 0.24 0.02 0.12
Table 3: Fine-tuned model results. α𝛼\alphaitalic_α: Krippendorff’s α𝛼\alphaitalic_α, %percent\%%: Percentage agreement.

6.3 Automatic PhiTag prompt

Prompt

By using the automatic prompting options we implemented into PhiTag, we create prompts by using the guidelines and guidelines + tutorial examples available for human annotators. We aim to see how the model performance varies by changing the prompting strategy from a manual and highly optimized to an automatic one. So, we design our prompt by concatenating an initial instruction along with the guidelines and a connecting sentence to link the tutorial examples, and a final instruction sentence. Additionally, we refine the guidelines by converting the data available in tables into a machine-readable format of word usages and target word. We also vary the prompt structure by changing the initial instructions message as well as the connecting sentence between guideline and tutorial examples. Further, we remove the "cannot decide" examples from the guidelines since our data does not include such instances. Similar to the customized prompt, the final instruction sentence ask the model to return a single integer value for judgment. Find the final prompts in Appendix A and the tested prompts in the paper repository.

Model Development

We set the top-p and temperature parameters to 0.9 each, as this setting has been found to yield optimal performance in case of the customized prompt, see Section 6.1. On five different trials, the model shows a below satisfactory performance on the development data, achieving a mean Krippendorff’s α𝛼\alphaitalic_α of -0.07 and a mean percentage agreement of 0.25 in the guidelines-only setting, and Krippendorff’s α𝛼\alphaitalic_α of 0.01 and a percentage agreement of 0.23 in the guidelines+tutorial setting, as shown in Tables 4 and 5. We notice a consistent but slight improvement of performance with modified guidelines over raw guidelines.

Trial dev test
α𝛼\alphaitalic_α %percent\%% α𝛼\alphaitalic_α %percent\%%
1 -0.07 0.26 0.03 0.30
2 -0.07 0.25 0.03 0.29
3 -0.07 0.24 0.03 0.28
4 -0.07 0.25 0.03 0.28
5 -0.07 0.25 0.03 0.29
Mean -0.07 0.25 0.03 0.28
Table 4: Automatic prompting results (guidelines-only). α𝛼\alphaitalic_α: Krippendorff’s α𝛼\alphaitalic_α, %percent\%%: Percentage agreement.
Trial dev test
α𝛼\alphaitalic_α %percent\%% α𝛼\alphaitalic_α %percent\%%
1 0.01 0.23 0.06 0.33
2 0.01 0.23 0.05 0.31
3 0.01 0.23 0.06 0.32
4 0.01 0.23 0.06 0.32
5 0.01 0.23 0.06 0.32
Mean 0.01 0.23 0.06 0.32
Table 5: Automatic prompting results (guidelines and tutorials). α𝛼\alphaitalic_α: Krippendorff’s α𝛼\alphaitalic_α, %percent\%%: Percentage agreement.

Model Testing

The optimal settings for model-development are applied to the test data. In five successive trials the model’s performance is below satisfactory, achieving a Krippendorff’s α𝛼\alphaitalic_α of 0.03 and a percentage agreement of 0.28 in the guidelines-only setting a mean Krippendorff’s α𝛼\alphaitalic_α of 0.06 and a mean percentage agreement of 0.32 in the guidelines+tutorial setting. See Table 4 and 5.

7 Discussion

We find that the customized prompt yields a reasonable performance while the automatic PhiTag prompting strategies show a considerably lower performance. While we were able to slightly improve the performance of the automatic prompting strategies through adjustments of the guidelines, the best model still shows low performance. A possible reason is the extensive length of the guidelines, as compared to usual prompts. In the future, it may be interesting to use only the task introduction part or the main task description part of human guidelines to prompt the model. Also, automatic summarization techniques may be used to condense the information for the model. Fine-tuning did not improve the performance of the customized prompt either, but rather led to a significant performance drop.

In Figure 5, we plot the label distribution for the best model of each strategy (customized, fine-tuned, guidelines-only, guidelines+tutorial). In comparison to the distribution of the gold data (Table 4), we see that, despite partly low performance, all models recover the overall gold label distribution quite well. This holds even for the zero-shot customized prompt model having no access to the overall label distribution from the prompt in the testing phase.

We observed a significant impact of the parameters temperature and top-p on the performance of the GPT-4 model. Lower temperatures and led to poor performance on the development data while a temperature setting of 0.9 yielded optimal results.

Refer to caption
(a) Customized prompt
Refer to caption
(b) Fine-Tuned Model
Refer to caption
(c) Automatic prompt (guidelines-only)
Refer to caption
(d) Automatic prompt (guidelines+tutorial)
Figure 5: Distribution of predicted labels on test data.

8 Conclusion

In conclusion, our study reveals the crucial role of accurate prompt design and careful parameter selection in optimizing the performance of GPT for the Use Pair Semantic Proximity task. Through our experiments, we observed the significant impact of model configurations, prompting strategies, and fine-tuning techniques on enhancing the efficiency and accuracy of annotation processes. Fully automating the annotation process reusing human-tailored instructions remains a major challenge. We were not able to obtain good performance with human-like automatic prompting strategies.

9 Limitations

One limitation of our study is the dependence on a specific language model and annotation platform, which may not generalize to all language models or annotation tools. Additionally, the effectiveness of our fine-tuning approach may vary depending on the task and dataset used, and further investigation with different models and datasets is needed. Furthermore, while our experiments provide insights into the effectiveness of various model configurations and prompting strategies, they may not capture the full complexity of real-world annotation scenarios.

Automatic prompts containing complete guidelines or tutorial data can get quite large and often running them costs a lot of money. Additionally, the current implementation of the automatic prompting strategy in PhiTag is rather inefficient as it annotates one instance at a time instead of creating one prompt for all instances.

An additional limitation arising with the use of closed-source source LLMs is the question how much of the testing data the model has already seen in its training process as the data is publicly available and also used in other studies for prompting.

References

  • [1] Ding B, Qin C, Liu L, Bing L, Joty SR, Li BA. Is GPT-3 a Good Data Annotator? In: Annual Meeting of the Association for Computational Linguistics; 2022. Available from: https://api.semanticscholar.org/CorpusID:254877171.
  • [2] Tan Z, Beigi A, Wang S, Guo R, Bhattacharjee A, Jiang B, et al. Large Language Models for Data Annotation: A Survey. ArXiv. 2024;abs/2402.13446. Available from: https://api.semanticscholar.org/CorpusID:267770019.
  • [3] OpenAI. GPT-4 Technical Report. ArXiv. 2023;abs/2303.08774. Available from: https://api.semanticscholar.org/CorpusID:257532815.
  • [4] Gemini Team et al . Gemini: A Family of Highly Capable Multimodal Models. arXiv e-prints. 2023 Dec:arXiv:2312.11805.
  • [5] Touvron H, Martin L, Stone KR, Albert P, Almahairi A, Babaei Y, et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. ArXiv. 2023;abs/2307.09288. Available from: https://api.semanticscholar.org/CorpusID:259950998.
  • [6] Gilardi F, Alizadeh M, Kubli M. ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences of the United States of America. 2023;120. Available from: https://api.semanticscholar.org/CorpusID:257766307.
  • [7] Brown SW. Choosing Sense Distinctions for WSD: Psycholinguistic Evidence. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers. Stroudsburg, PA, USA; 2008. p. 249-52.
  • [8] Li J, Tang T, Nie J, rong Wen J, Zhao WX. Learning to Transfer Prompts for Text Generation. In: North American Chapter of the Association for Computational Linguistics; 2022. Available from: https://api.semanticscholar.org/CorpusID:248506201.
  • [9] Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Computing Surveys. 2021;55:1 35. Available from: https://api.semanticscholar.org/CorpusID:236493269.
  • [10] White J, Fu Q, Hays S, Sandborn M, Olea C, Gilbert H, et al. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. ArXiv. 2023;abs/2302.11382. Available from: https://api.semanticscholar.org/CorpusID:257079092.
  • [11] Goswami K, Lange L, Araki J, Adel H. SwitchPrompt: Learning Domain-Specific Gated Soft Prompts for Classification in Low-Resource Domains. In: Conference of the European Chapter of the Association for Computational Linguistics; 2023. Available from: https://api.semanticscholar.org/CorpusID:256846498.
  • [12] Huang F, Kwak H, An J. Is ChatGPT better than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech. Companion Proceedings of the ACM Web Conference 2023. 2023. Available from: https://api.semanticscholar.org/CorpusID:256868854.
  • [13] Shin T, Razeghi Y, IV RLL, Wallace E, Singh S. Eliciting Knowledge from Language Models Using Automatically Generated Prompts. ArXiv. 2020;abs/2010.15980. Available from: https://api.semanticscholar.org/CorpusID:226222232.
  • [14] Sclar M, Choi Y, Tsvetkov Y, Suhr A. Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting; 2023.
  • [15] Li C, Wang J, Zhang Y, Zhu K, Hou W, Lian J, et al.. Large Language Models Understand and Can be Enhanced by Emotional Stimuli; 2023.
  • [16] Gius E, Meister JC, Meister M, Petris M, Bruck C, Jacke J, et al.. CATMA. Zenodo; 2022. Available from: https://doi.org/10.5281/zenodo.6419805.
  • [17] Klie JC, Bugert M, Boullosa B, Eckart de Castilho R, Gurevych I. The INCEpTION Platform: Machine-Assisted and Knowledge-Oriented Interactive Annotation. In: Zhao D, editor. Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations. Santa Fe, New Mexico: Association for Computational Linguistics; 2018. p. 5-9. Available from: https://aclanthology.org/C18-2002.
  • [18] Pei J, Ananthasubramaniam A, Wang X, Zhou N, Dedeloudis A, Sargent J, et al. POTATO: The Portable Text Annotation Tool. In: Che W, Shutova E, editors. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Abu Dhabi, UAE: Association for Computational Linguistics; 2022. p. 327-37. Available from: https://aclanthology.org/2022.emnlp-demos.33.
  • [19] Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al. Language Models Are Few-Shot Learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS’20. Red Hook, NY, USA: Curran Associates Inc.; 2020. .
  • [20] Liu X, Wang J, Sun J, Yuan X, Dong G, Di P, et al. Prompting Frameworks for Large Language Models: A Survey. ArXiv. 2023;abs/2311.12785. Available from: https://api.semanticscholar.org/CorpusID:265308881.
  • [21] Karjus A. Machine-assisted mixed methods: augmenting humanities and social sciences with artificial intelligence. ArXiv. 2023;abs/2309.14379. Available from: https://api.semanticscholar.org/CorpusID:262825994.
  • [22] Guo B, Zhang X, Wang Z, Jiang M, Nie J, Ding Y, et al. How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection. ArXiv. 2023;abs/2301.07597. Available from: https://api.semanticscholar.org/CorpusID:255998637.
  • [23] ElSherief M, Ziems C, Muchlinski D, Anupindi V, Seybolt J, De Choudhury M, et al. Latent Hatred: A Benchmark for Understanding Implicit Hate Speech. In: Moens MF, Huang X, Specia L, Yih SWt, editors. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics; 2021. p. 345-63. Available from: https://aclanthology.org/2021.emnlp-main.29.
  • [24] Schlechtweg D, McGillivray B, Hengchen S, Dubossarsky H, Tahmasebi N. SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection. In: Proceedings of the 14th International Workshop on Semantic Evaluation. Barcelona, Spain: Association for Computational Linguistics; 2020. Available from: https://www.aclweb.org/anthology/2020.semeval-1.1/.
  • [25] Erk K, McCarthy D, Gaylord N. Measuring Word Meaning in Context. Computational Linguistics. 2013;39(3):511-54.
  • [26] Periti F, Dubossarsky H, Tahmasebi N. (Chat)GPT v BERT: Dawn of Justice for Semantic Change Detection; 2024. Available from: https://arxiv.org/abs/2401.14040.
  • [27] Schlechtweg D, Schulte im Walde S, Eckmann S. Diachronic Usage Relatedness (DURel): A Framework for the Annotation of Lexical Semantic Change. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. New Orleans, Louisiana; 2018. p. 169-74. Available from: https://www.aclweb.org/anthology/N18-2027/.
  • [28] Schlechtweg D, Tahmasebi N, Hengchen S, Dubossarsky H, McGillivray B. DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics; 2021. p. 7079-91. Available from: https://aclanthology.org/2021.emnlp-main.567.
  • [29] Krippendorff K. Content Analysis: An Introduction to Its Methodology. SAGE Publications; 2018.
  • [30] Artstein R, Poesio M. Inter-Coder Agreement for Computational Linguistics. Computational Linguistics. 2008 12;34(4):555-96. Available from: https://doi.org/10.1162/coli.07-034-R2.
  • [31] Peng K, Ding L, Zhong Q, Shen L, Liu X, Zhang M, et al.. Towards Making the Most of ChatGPT for Machine Translation; 2023.
  • [32] Wang C, Wang J, Qiu M, Huang J, Gao M. TransPrompt: Towards an Automatic Transferable Prompting Framework for Few-shot Text Classification. In: Moens MF, Huang X, Specia L, Yih SWt, editors. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics; 2021. p. 2792-802. Available from: https://aclanthology.org/2021.emnlp-main.221.
  • [33] Sun AY, Zemour E, Saxena A, Vaidyanathan U, Lin E, Lau C, et al. Does fine-tuning GPT-3 with the OpenAI API leak personally-identifiable information? ArXiv. 2023;abs/2307.16382. Available from: https://api.semanticscholar.org/CorpusID:260334454.
  • [34] Kim J, Lee JH, Kim S, Park J, Yoo KM, Kwon SJ, et al.. Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization; 2023.

Appendix A Prompts

Customized Prompt 1

You are a highly trained text data annotation tool capable of providing subjective responses.
Your task is to rate the degree of semantic relatedness between two uses of a target word in the given sentences.
Sentence 1: [SENTENCE 1]
Sentence 2: [SENTENCE 2]
Target word: [TARGET WORD]
Please provide a judgment as a single integer. For example, if your judgment is Identical, then provide 4. If your judgment is Unrelated, provide 1.

Customized Prompt 2

You are a highly trained text data annotation tool capable of providing subjective responses.
Rate the semantic similarity of the target word in these sentences 1 and 2. Consider only the objects/concepts the word forms refer to: ignore any common etymology and metaphorical similarity! Ignore case! Ignore number (cat/Cats = identical meaning). If target is emoji then rate by its contextual function. Homonyms (like bat the animal vs bat in baseball) count as unrelated. Output numeric rating: 1 is unrelated; 2 is distantly related; 3 is closely related; 4 is identical meaning.Your response should align with a human’s succinct judgment.
Sentence 1:[SENTENCE 1]
Sentence 2: [SENTENCE 2]
Target word: [TARGET WORD]
Please provide a judgment as a single integer. For example, if your judgment is Identical, then provide 4. If your judgment is Unrelated, provide 1.

Fine-tuned model prompt

You are a highly trained text data annotation tool capable of providing subjective responses.
Annotate this pair of given sentences
Sentence 1: [SENTENCE 1]
Sentence 2: [SENTENCE 2]
Target word: [TARGET WORD]

Automatic PhiTag prompt with guidelines-only

You are a highly trained text data annotation tool capable of providing subjective responses.
[MODIFIED GUIDELINES]
Sentence 1: [SENTENCE 1]
Sentence 2: [SENTENCE 2]
Target word: [TARGET WORD]
Please provide a judgment as a single integer for Sentence 1 and Sentence 2 above. For example, if your judgment is Identical, then provide 4. If your judgment is Unrelated, provide 1.

Automatic PhiTag prompt with guidelines and tutorial

You are a highly trained text data annotation tool capable of providing judgments based on contexts provided to you. [MODIFIED GUIDELINES]
Here are few sample instances and their corresponding judgements:
Example sentences
Sentence 1: [SENTENCE 1]
Sentence 2: [SENTENCE 2]
Target word: [TARGET WORD]
Please provide a judgment as a single integer for Sentence 1 and Sentence 2 above. For example, if your judgment is Identical, then provide 4. If your judgment is Unrelated, provide 1.