Data Generation using Large Language Models for Text Classification:
An Empirical Case Study

Yinheng Li    Rogerio Bonatti    Sara Abdali    Justin Wagle    Kazuhito Koishida
Abstract

Using Large Language Models (LLMs) to generate synthetic data for model training has become increasingly popular in recent years. While LLMs are capable of producing realistic training data, the effectiveness of data generation is influenced by various factors, including the choice of prompt, task complexity, and the quality, quantity, and diversity of the generated data. In this work, we focus exclusively on using synthetic data for text classification tasks. Specifically, we use natural language understanding (NLU) models trained on synthetic data to assess the quality of synthetic data from different generation approaches. This work provides an empirical analysis of the impact of these factors and offers recommendations for better data generation practices.

Synthetic Data Generation, Large Language Model, Data Augmentation, Text Classification

1 Introduction

Data augmentation is a method that utilize existing data to generate additional training data without collecting more data (Feng et al., 2021). It is an effective solution to improve model performance when limited data is available (Xie et al., 2020). With the emergence of large language models, data augmentation becomes even more accessible and has been successfully applied in training language models (Gunasekar et al., 2023; Liu et al., 2024).

Using LLM to generate or annotate data is a cost-efficient alternative to human-labeled data. While human-labeled data tends to have higher quality, leveraging LLM with well-designed prompts can also generate data that achieves comparable model performance at a much lower cost. As estimated in (Ding et al., 2023), labeling 3000 samples for SST-2 task (Socher et al., 2013) would cost between 221 to 300 USD and take around 1000 minutes. In contrast, generating the same amount of data using GPT-3 only costs 14.37 USD and takes 46 minutes. With only 6000 samples generated by GPT-3, the model is able to achieved 76% accuracy, compared to 88% from human-curated data.

Our research focuses on synthetic data generation using of large language models (LLMs) for text classification tasks, specifically tasks uses natural language understanding models(transformer encoder models). In the scope of this study, we use the terms data augmentation and data generation interchangeably, as LLMs often require a few in-context samples to generate data. The data produced in this way can be considered augmented from these in-context samples. Meanwhile, we focus solely on tasks that have limited or no data at all, as our experiments have shown that tasks with sufficient data receive minimal improvements from additional synthetic data. Numerous studies have proposed various frameworks to improve the quality of synthetic data generation (Wang et al., 2023; Gao et al., 2023; Gupta et al., 2023). However, to the best of our knowledge, few works have addressed the fundamental questions associated with LLM for data generation. These questions include:

  • What is the optimal amount of data to generate, and does increasing the volume of synthetic data improve model performance?

  • Can in-context learning (generation) enhance the quality of synthetic data, would providing few examples lead to higher quality data than zero-shot generation?

  • Does the LLM’s performance on a particular task directly influence the quality of the generated synthetic data for this task?

  • Is combining synthetic data with raw data beneficial for model training?

  • Is the synthetic data diversity an important factor for model performance?

We experimented with six common NLP tasks (Table 1) with different data generation methods. We found it is very challenging to pinpoint a definitive answer to the questions above that applies universally to all NLP tasks due to their inherent differences. Nevertheless, the findings from 6 tasks offer valuable insights into practical data generation techniques.

2 Related Work

Data Augmentation

The goal of data augmentation is to increase diversity of existing data by exposing the model to unseen data. This method has been applied to many domains in computer vision (Yang et al., 2023) and natural language processing (Li et al., 2022). In (Feng et al., 2021), augmentation techniques are categorized into rule based generation and model based generation. Rule based generation are used in computer vision problems including image transformations, such as rotation, flipping, and cropping, etc (Mikołajczyk & Grochowski, 2018), while model based generation has been widely used in natural language processing, such as rephrasing, back translation (Kumar et al., 2019; Yang et al., 2020; Cai et al., 2020; Ye et al., 2022; Okur et al., 2022b).

Large Language models (LLMs)

With the development of large language models, model based data augmentation for NLP becomes trivial (Zhou et al., 2024). By instructing LLM with proper prompt, it is able to generate a new example in human like text. While it is easy to implement, the synthetic data generated from LLM is usually noisy and has a different distribution compared with raw data, which hampers the training performance. Lots of work has explored ways to deal with this issue. The work from (Veselovsky et al., 2023) uses techniques like grounding, providing taxonomy and filtering to ensure the quality of synthetic data by LLM. Synthesis Step by Step (Wang et al., 2023) uses an iterative step to create prompt based on misclassified golden data to reduce the gap between the synthesized data distribution and gold distribution. SunGen (Gao et al., 2023) uses weighted loss to reduce the impact of noise from synthetic data during training.

3 Methods

Refer to caption
Figure 1: Pipeline for Data Augmentation using LLM

We follow the workflow in Figure 1 for our experiment. We explore the following in-context data generation methods. The term ”in-context generation” refers to using an LLM to generate data for training given a specific context, similar to in-context learning (Brown et al., 2020). The methods we investigate can be categorized as follows:

  • Zero-shot in-context generation: Provide the task description in the prompt and ask the LLM to generate a similar example.

  • One-shot in-context generation: Provide the task description and one example, prompting the LLM to generate a similar example.

  • Few-shot in-context generation: Provide the task description and a few examples, prompting the LLM to generate a similar example.

Inspired by the work from (Yu et al., 2023), we also experiment with an additional method called zero-shot topic in-context generation:

  • Zero-shot topic in-context generation: Use the LLM to generate a list of topics (see Appendix A). Provide the task description and sample one topic from the list to prompt the LLM to generate a similar example.

To evaluate the success of synthetic data generation, we train a NLU model on the synthetic data and assess its performance on the task’s validation set. We then compare the performance of the model trained on synthetic data with that of the model trained on the original data. Following the practice established in previous works (Li et al., 2023), we consider the generated data is better if it results in better model performance.

4 Experiments

In our experiment, GPT-3.5 turbo111 GPT-3.5 version: 2024-02-15 preview accessed from Azure OpenAI Studio is selected for all data generation process except for topic generation (see appendix A). Although more powerful models like GPT-4 is available, we decided to use GPT-3.5 turbo due to the resource constrain, especially we need to run the large number of inferences for our data generation experiment. Overall, GPT-3.5 turbo is a well-rounded model with competitive performance across multiple benchmarks (Liang et al., 2023). In the future, it would be interesting to compare the quality of synthetic data generated from different LLMs, which we plan to explore further.

Existing work (Gupta et al., 2023) have utilized common NLP benchmarks, such as SuperGLUE (Wang et al., 2019), as tasks for evaluation or employ a customized selection (Gao et al., 2023; Ye et al., 2022).

We select six common tasks for evaluation: SST-2 (Socher et al., 2013; Wang et al., 2019), Twitter Emotion Classification (EMO) (Saravia et al., 2018), New York Times News Classification (NYT)(Stefano, 2021), Review (Amazon Review Classification) (Keung et al., 2020), RTE (Recognizing Textual Entailment) (Bentivogli et al., 2009; Wang et al., 2019) and BoolQ (Clark et al., 2019; Wang et al., 2019). The goal is to select diverse tasks that represent a wide range of popular NLP corpora (Table 1). Additionally, we try to include challenging tasks for which current NLU models do not perform well when provided with limited training data. Therefore, we do not use the entire GLUE benchmark, as models like BERT (Devlin et al., 2019) or RoBERTa(Liu et al., 2019) can easily achieve high accuracy on such tasks. We also do not use the complete SuperGLUE task collection since some of its tasks require token-level classification. In this work, we focus on sequence-to-sequence and sequence pair classification tasks. The six selected tasks cover common web data, such as news and Wikipedia, as well as popular user data, like Twitter, movie reviews, and product reviews. They cover binary classification, multi-class classification, and question-answering tasks.

For the evaluation metric, the default metric is accuracy, but we use F1 or Macro-F1 to calculate the performance since these metrics provide a more balanced and comprehensive assessment of classification performance, taking into account both precision and recall, especially in cases of imbalanced class distribution or multi-class classification tasks. In our experiment, we use RoBERTa as the NLU model for all tasks, as it is a commonly used model for benchmark on these tasks.

We experiment five in-context generation methods for each task: zero-shot, zero-shot topic, one-shot, few-shot with 3 examples, few-shot with 5 examples. Prompt used in the generation can be found in Appendix C.


Corpus Training Size Test Size Task Metrics Domain
SST-2 67k 1.8k Binary Classification F1 Movie Reviews
EMO 16k 2k Multi-class Classification Macro-F1 Twitter
NYT 256k 3k222The official dataset does not provide testset, we randomly sample 3k data as testset Multi-class Classification Macro-F1 News
Review 200k 5k Multi-class, Ordinal Regression Macro-F1 Amazon Review
RTE 2.5k 3k Pair Classification, Question Answering Macro-F1 News, Wikipedia
BoolQ 16k 3.2k Pair Classification, Question Answering Macro-F1 News, Wikipedia, Web Query
Table 1: Summary of datasets and tasks.

In our experiment, we generate 1,000 synthetic data points per task, as we found the benefit of additional synthetic data diminishes after that. To simulate a low-resource setting, we allow only 100 raw examples to be used for one-shot and few-shot generation. For zero-shot topic generation, we generate 500 random topics related to the task domain. Details can be found in Appendix A.

5 Key Findings

In this section, we present the key findings from our experiments.

5.1 Mixing Raw Data is Necessary

Refer to caption
Figure 2: Performance of different prompting methods with and without augmentation. Synthetic only: use 1000 synthetic data only. Augmented: 1000 synthetic data plus 100 raw data

To assess the effectiveness of data augmentation, we train models with pure synthetic data and augmented data. For the augmented setting, 100 raw data points are mixed with 1000 synthetic data. In the data generation stage, we use only the same 100 raw data points used for in-context generation to prevent the model from accessing additional data. As shown in Figure 2, we observe significant improvements across all tasks for most prompting methods when incorporating raw data into training. Even as few as 100 data points can boost synthetic data performance compared to using only synthetic data.

5.2 Impact of Bias

In the BoolQ task, we found that the zero-shot generation method outperforms other methods, which contrasts with the results obtained for the rest of the tasks. This finding is intriguing since zero-shot data exhibits the highest repetition rate, which is detrimental to model training. Upon further examination, we noticed that only in the datasets generated using one-shot or few-shot methods, terms like ”not,” ”significant,” ”only,” ”just,” ”few,” and ”little” frequently appear in the generated questions. These terms create a tone that can be used to imply the answer to the question (which is often False). Table 2 provides an example of such trivial question. Table 4 provides statistics for such questions from different prompting method.

We hypothesize that this pattern introduces bias in model training by encouraging the model to search for specific keywords in the question rather than reading the passage. To test this hypothesis, we instruct the LLM to rephrase the questions ”like what people would search online” for each synthetic example (see Appendix B). We found that performance significantly improved for zero-shot topic and one-shot method after rephrasing. The work (Okur et al., 2022a) has also shows the effectiveness of paraphrasing in other data augmentation techniques.

Although we only detected synthetic bias in the BoolQ task, it remains an important factor to consider during data generation. The technique of rephrasing might not be applicable to other cases, but ensuring that synthetic data does not contain unwanted patterns is necessary.

For all the rest experiments, the results for BoolQ task are all under the question rephrasing setting unless otherwise specified.


Did the Mars Exploration Rover mission only involve one rover? – False
Did scientists in the 20th century make no significant discoveries or advancements? – False
Table 2: Examples of Trivial Questions – questions contain terms ”not,” ”significant,” ”only,” ”just,” ”few,” and ”little”.
Table 3: BoolQ Trivial Questions and F1 Score Comparison.

Trivial Q. Count F1 Score
Raw Rephrased Raw (SD) Raw (SD) Raw (AD) Rephrased (AD)
Zero-Shot Topic 230 208 0.19 0.77 0.75 0.77
One-Shot 131 74 0.38 0.74 0.76 0.77
Few-Shot (3 ex.) 90 30 0.55 0.51 0.70 0.72
Few-Shot (5 ex.) 57 28 0.53 0.48 0.75 0.73
Zero-Shot 11 - 0.71 - 0.73 -
Raw Data 31 - - 0.768 - -
Table 4: BoolQ Trivial Questions and F1 score comparison. SD: use 1000 synthetic data. AD: use 100 raw data plus 1000 synthetic data. raw data: model uses 1000 raw data only without question rephrase, this score is used as a baseline

5.3 Relationship between LLM Performance and Data Quality

While it may seem intuitive that the effectiveness of using LLMs to generate data for model training depends on the LLM’s knowledge of a specific task, our research has shown that this is not always the case. The zero-shot or few-shot performance of an LLM on a task does not necessarily determine the performance of a model (specifically, the RoBERTa model used in our experiment) trained with data generated by the LLM. In other words, the fact that an LLM performs well on a task does not guarantee that models finetuned with data generated by the LLM will also perform well. Additionally, for tasks where the LLM performs poorly, models finetuned on the synthetic data generated by the LLM could actually outperform the LLM itself. The former scenario could be due to the fact that the ability of an LLM to generate good examples for a task does not always correspond to its ability to solve the task itself. The latter scenario is also plausible, as an LLM may be proficient at generating examples with a given label, but not as good at predicting the label given the task itself.

The results of our experiment can be found in Table 5. For each task, we prompted the LLM (GPT3.5-turbo) with zero/one/three/five-shot learning and reported the best performance achieved across all in-context learning methods. We did not optimize the prompt or use any advanced prompting methods in our evaluation of the LLM. It is possible that the LLM could achieve better performance with more advanced prompting techniques. However, the results obtained from the most basic in-context learning method (see Appendix D) do provide valuable insights into this problem.

For SST-2, BoolQ, NYT, and Review tasks, we found a performance gap of 10-15% between the LLM’s in-context learning performance on the task and the fine-tuned language model (RoBERTa model) using synthetic data. For RTE and EMO tasks, the LLM does not perform well, but the data generated by the LLM leads to much better performance. Therefore, even for tasks that LLMs struggle to solve, using LLM-generated synthetic data can still achieve better results.

GPT3.5-turbo RoBERTa on Synthetic Data RoBERTa on Augmented Data
SST-2 0.956 0.845 0.874
BoolQ 0.870 0.641 0.742
NYT 0.729 0.604 0.742
Review 0.603 0.475 0.527
RTE 0.345 0.574 0.653
Emo 0.300 0.404 0.568
Table 5: LLM performance vs model trained by synthetic data on 6 tasks. Average f1 score from 5 prompting method under (1) Synthetic Data (1000 synthetic data) (2) Augmented data (1000 synthetic data + 100 raw data)

5.4 Synthetic Data is Helpful Mostly in Low-Resource Settings

Refer to caption
Figure 3: Improvement on Different Raw Data Amount. raw data (x) is only using X number of raw data points. augmented (x) is using X amount raw data points plus 100 synthetic data. For augmented f1 score, it is the average model performance on the data generated by 5 different prompting methods
Refer to caption
Figure 4: Synthetic Data Similarity
Refer to caption
Figure 5: Impact on Synthetic Data Quantity

Previous work has shown that it is challenging for models trained with synthetic data to perform as well as models trained with the same amount of original data (Li et al., 2023; Ding et al., 2023). However, when human-annotated data is limited, synthetic data augmentation can improve model performance. In fact, this technique is most effective in low-resource settings. For all tasks with 100 raw data points, we found that synthetic data augmentation yields improvements of at least 3% to 26%. When the raw training data increases to 1,000, only four tasks show improvements, which are less than 5% (Figure 3). There is no universal rule for determining the amount of raw data considered as low-resource. It is worth noting that 1,000 data points still represent a small portion of training data for all six tasks. The model continues to improve as we increase the number of raw data for training. However, the amount of performance gain obtained from increasing training data is also dependent on other factors such as task and model complexity. Based on this observation, we consider 100 raw data points as low-resource tasks, which will be used as the default augmented setting in all experiments.

5.5 A Comparison Between Different Prompting Methods

In the synthetic data only setting, one-shot or zero-shot topic methods rank in the top two for all tasks except the Review task (Figure 2).

In the augmented setting, few-shot generation and zero-shot topic generation methods demonstrate good performance across all tasks. In BoolQ, EMO, and RTE tasks, zero-shot topic methods outperform other prompting methods. In SST-2 and NYT tasks, few-shot generation methods perform best. The performance of zero-shot methods is sub-optimal across all tasks.

In the five prompting methods we experimented with, zero-shot topic generation typically produces the most diverse dataset because different topic is sampled for each time during generation. Pure zero-shot methods generate the least diverse dataset, as the prompt remains the same for each generation. One-shot and few-shot methods also generate repeated examples due to the limitation of in-context examples. We found for most tasks, a diversity dataset tends to benefit model training. As shown in (Figure 2), in non-augmented setting pure zero-shot generation shows the worst performance for RTE, EMO, Review and SST-2 while zero-shot topic generation out-performs other methods (or close to other methods) for BoolQ, NYT, RTE and EMO task. This effect does not appear on all tasks as there might be other factors that impact the model performance. Meanwhile, the effect of diversity diminishes when we mix synthetic data with raw data. Therefore, training with both raw data and synthetic data could help when synthetic data is not diverse.

While not generating the most optimally diverse dataset, using one-shot or few-shot generation methods typically helps LLMs better understand the task description and generate examples similar to the original examples (Li, 2023; Song et al., 2022). In EMO and Review tasks, we observe the advantage of few-shot learning over other prompting methods. We suspect this is because both tasks are more subjective compared to the rest of the tasks, as the EMO contains twitter posts and Review task is made up of customer reviews and ratings.

5.6 Synthetic Data Diversity and Similarity to Raw Data

In this section, we examine the diversity of our training data using inter-sample semantic similarity. To calculate this similarity, we use vector embedding proposed in (Reimers & Gurevych, 2019) and average the similarity score across all examples pairs following (Yu et al., 2023). Figure 4 displays the inter-sample similarity for each task, comparing data generated by five prompting methods. On the x-axis, we show the performance of the finetuned model using the 1000 synthetic data only. Figure 4 shows that for BoolQ, NYT, and SST-2, a lower inter-sample diversity results in a better F1 score. However, for other tasks, the correlation is weak due to the existence of outliers, especially for RTE, and the possible impact of other factors, such as task complexity. We also calculated the similarity between the synthetic data and the actual raw data using the same method and found that the synthetic data generated from five different prompting methods had similar similarity scores with the raw data. However, it is not clear whether synthetic data that closely resembles the raw data would lead to better model performance. This could be due to the limitations of our similarity measuring method, which only considers semantic similarity, as discussed in (Steck et al., 2024). Many NLP tasks rely on subtle contextual cues and nuanced wordings, such as in the SST-2 task, where changes to wording can affect the sentiment of the text more than contextual semantics. Our measurement does not account for other aspects of similarity, such as structural or lexical similarity, as discussed in (Wang et al., 2020; Ayeldeen et al., 2014). Lastly, due to the limited number of data points and the potential variation in synthetic data, it needs to be cautious to generalize our findings to our tasks or domains.

5.7 Synthetic Data Quantity

We have found that increasing the amount of synthetic data in our model training improves its performance. Figure 5 shows the relationship between the model’s performance (measured by the f1 score) on the y-axis and the total number of training data on the x-axis. In the augmented scenario, we mixed 100 raw data points with varying amounts of synthetic data. The performance is the average of the model’s f1 score over 5 prompting methods for each data size. For the raw data scenario, only real-world data was used in model training. Our graph indicates that raw data serves as an upper bound for the augmented setting in almost all tasks. Moreover, we observed that the marginal effect of performance gain with increasing training data is present in both raw and synthetic data. For BoolQ and SST-2 tasks, we observed this phenomenon at the same data size. As such, the raw data size at which marginal improvement of model performance appears can be used as a reference point when increasing the number of synthetic data.

6 Data Generation Techniques in Practice

In the process of using LLM to generate data for this study, we identified several useful techniques. These practices lack theoretical support and the effectiveness of these techniques can be subject to the choice of large language models or the requirements of a specific task.

6.1 Condition on Label

There are two typical ways to generate a classification dataset: Condition on the Label and Left-to-Right (see Table 6). It is recommended to use Condition on the Label for each generation as it saves effort in parsing the label and avoids LLM generating unknown labels. It also provides the user control over the label distribution in the synthetic dataset.

Table 6: Left-to-right prompt vs. class-conditioned prompt.

Left-to-right prompt: generate an example text first and then generate its class label.
Class-conditioned prompt: generate an example text where the label must be Class X.

It is worth noting that class-conditioned generations are more likely to introduce bias and reduce the difficulty of the synthetic example. When the class label is visible, LLM could leak the label information during content generation. In the BoolQ example, LLM hints the answer ”FALSE” via certain words in the question it generates (e.g. the word ”only”). In this case, rephrasing the question where the class label is hidden essentially performing left-to-right generation, which improves performance.

6.2 Generation on Target Corpus

It is essential to provide topics or descriptions closely related to the use case when generating examples. Ensuring that the topics are relevant to the use case significantly improves the quality of generated data. For example, when creating examples from Twitter, it is beneficial to first generate common topics found on Twitter. On the other hand, when generating Amazon customer reviews, it is effective to generate an Amazon product catalog as a list of potential topics. This approach ensures that the synthetic data is more closely aligned with the target corpus, leading to better performance in classification tasks.

6.3 Iterative Data Generation and Prompt Refinement

Generating synthetic data can be both time-consuming and resource-intensive. To maximize efficiency and ensure high-quality data, it is recommended to adopt an iterative approach. Initially, generate a small number of examples and evaluate their quality. If the quality of these initial data points is low, refine the prompt before generating more data. It is unlikely that simply generating more data points with the same prompt will magically produce high quality data.

7 Conclusion

In this work, we analyzed different factors that influences the data generation using LLMs. We found data generation is most effective in low resourced settings. Increasing the amount of synthetic data does not necessarily lead to continuous improvements in model performance. It is beneficial to combine synthetic data with raw data during training. Additionally, it is crucial to be vigilant for patterns or biases in synthetic data that may hinder model training. Overall, using LLM for data augmentation has great potential in model training. With a carefully tuned prompt, the data generated by LLM is able to obtain comparable performance with human annotated data, but at a much lower cost.

The domain of data generation for classification tasks is highly complex. Due to the diversity of NLP tasks, it is challenging to find rules that generalize well across all tasks. However, our findings could still serve as valuable resources for researchers and practitioners looking to use synthetic data for training classification models. For future work, it would be valuable to study the effects of more advanced prompting methods, such as the Chain of Thought (Wei et al., 2023), or LLM hyperparameters, such as temperature, on the quality of synthetic data.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

References

  • Ayeldeen et al. (2014) Ayeldeen, H., Hassanien, A. E., and Fahmy, A. A. Lexical similarity using fuzzy euclidean distance. In 2014 International Conference on Engineering and Technology (ICET), pp.  1–6, 2014. doi: 10.1109/ICEngTechnol.2014.7016801.
  • Bentivogli et al. (2009) Bentivogli, L., Dagan, I., Dang, H. T., Giampiccolo, D., and Magnini, B. The fifth PASCAL recognizing textual entailment challenge. 2009.
  • Brown et al. (2020) Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners, 2020.
  • Cai et al. (2020) Cai, H., Chen, H., Song, Y., Zhang, C., Zhao, X., and Yin, D. Data manipulation: Towards effective instance learning for neural dialogue generation via learning to augment and reweight. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  6334–6343, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.564. URL https://aclanthology.org/2020.acl-main.564.
  • Clark et al. (2019) Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of NAACL-HLT 2019, 2019.
  • Devlin et al. (2019) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  • Ding et al. (2023) Ding, B., Qin, C., Liu, L., Chia, Y. K., Joty, S., Li, B., and Bing, L. Is gpt-3 a good data annotator?, 2023.
  • Feng et al. (2021) Feng, S. Y., Gangal, V., Wei, J., Chandar, S., Vosoughi, S., Mitamura, T., and Hovy, E. A survey of data augmentation approaches for nlp, 2021.
  • Gao et al. (2023) Gao, J., Pi, R., Lin, Y., Xu, H., Ye, J., Wu, Z., Zhang, W., Liang, X., Li, Z., and Kong, L. Self-guided noise-free data generation for efficient zero-shot learning, 2023.
  • Gunasekar et al. (2023) Gunasekar, S., Zhang, Y., Aneja, J., Mendes, C. C. T., Giorno, A. D., Gopi, S., Javaheripi, M., Kauffmann, P., de Rosa, G., Saarikivi, O., Salim, A., Shah, S., Behl, H. S., Wang, X., Bubeck, S., Eldan, R., Kalai, A. T., Lee, Y. T., and Li, Y. Textbooks are all you need, 2023.
  • Gupta et al. (2023) Gupta, H., Scaria, K., Anantheswaran, U., Verma, S., Parmar, M., Sawant, S. A., Baral, C., and Mishra, S. Targen: Targeted data generation with large language models, 2023.
  • Keung et al. (2020) Keung, P., Lu, Y., Szarvas, G., and Smith, N. A. The multilingual amazon reviews corpus. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020.
  • Kumar et al. (2019) Kumar, A., Bhattamishra, S., Bhandari, M., and Talukdar, P. Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  3609–3619, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1363. URL https://aclanthology.org/N19-1363.
  • Li et al. (2022) Li, B., Hou, Y., and Che, W. Data augmentation approaches in natural language processing: A survey. AI Open, 3:71–90, 2022. ISSN 2666-6510. doi: https://doi.org/10.1016/j.aiopen.2022.03.001. URL https://www.sciencedirect.com/science/article/pii/S2666651022000080.
  • Li (2023) Li, Y. A practical survey on zero-shot prompt design for in-context learning. In Proceedings of the Conference Recent Advances in Natural Language Processing - Large Language Models for Natural Language Processings, RANLP. INCOMA Ltd., Shoumen, BULGARIA, 2023. doi: 10.26615/978-954-452-092-2˙069. URL http://dx.doi.org/10.26615/978-954-452-092-2_069.
  • Li et al. (2023) Li, Z., Zhu, H., Lu, Z., and Yin, M. Synthetic data generation with large language models for text classification: Potential and limitations, 2023.
  • Liang et al. (2023) Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove, C., Manning, C. D., Ré, C., Acosta-Navas, D., Hudson, D. A., Zelikman, E., Durmus, E., Ladhak, F., Rong, F., Ren, H., Yao, H., Wang, J., Santhanam, K., Orr, L., Zheng, L., Yuksekgonul, M., Suzgun, M., Kim, N., Guha, N., Chatterji, N., Khattab, O., Henderson, P., Huang, Q., Chi, R., Xie, S. M., Santurkar, S., Ganguli, S., Hashimoto, T., Icard, T., Zhang, T., Chaudhary, V., Wang, W., Li, X., Mai, Y., Zhang, Y., and Koreeda, Y. Holistic evaluation of language models, 2023.
  • Liu et al. (2024) Liu, R., Wei, J., Liu, F., Si, C., Zhang, Y., Rao, J., Zheng, S., Peng, D., Yang, D., Zhou, D., and Dai, A. M. Best practices and lessons learned on synthetic data for language models, 2024.
  • Liu et al. (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta: A robustly optimized bert pretraining approach, 2019.
  • Mikołajczyk & Grochowski (2018) Mikołajczyk, A. and Grochowski, M. Data augmentation for improving deep learning in image classification problem. In 2018 International Interdisciplinary PhD Workshop (IIPhDW), pp.  117–122, 2018. doi: 10.1109/IIPHDW.2018.8388338.
  • Okur et al. (2022a) Okur, E., Sahay, S., and Nachman, L. Data augmentation with paraphrase generation and entity extraction for multimodal dialogue system. In Calzolari, N., Béchet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Odijk, J., and Piperidis, S. (eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp.  4114–4125, Marseille, France, June 2022a. European Language Resources Association. URL https://aclanthology.org/2022.lrec-1.437.
  • Okur et al. (2022b) Okur, E., Sahay, S., and Nachman, L. Data augmentation with paraphrase generation and entity extraction for multimodal dialogue system, 2022b.
  • Reimers & Gurevych (2019) Reimers, N. and Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks, 2019.
  • Saravia et al. (2018) Saravia, E., Liu, H.-C. T., Huang, Y.-H., Wu, J., and Chen, Y.-S. CARER: Contextualized affect representations for emotion recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  3687–3697, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1404. URL https://www.aclweb.org/anthology/D18-1404.
  • Socher et al. (2013) Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In Yarowsky, D., Baldwin, T., Korhonen, A., Livescu, K., and Bethard, S. (eds.), Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp.  1631–1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL https://aclanthology.org/D13-1170.
  • Song et al. (2022) Song, Y., Wang, T., Mondal, S. K., and Sahoo, J. P. A comprehensive survey of few-shot learning: Evolution, applications, challenges, and opportunities, 2022.
  • Steck et al. (2024) Steck, H., Ekanadham, C., and Kallus, N. Is cosine-similarity of embeddings really about similarity? arXiv preprint arXiv:2403.05440v1, 2024. arXiv.org perpetual non-exclusive license.
  • Stefano (2021) Stefano, D. D. New york times topics. https://huggingface.co/datasets/dstefa/New_York_Times_Topics, 2021.
  • Veselovsky et al. (2023) Veselovsky, V., Ribeiro, M. H., Arora, A., Josifoski, M., Anderson, A., and West, R. Generating faithful synthetic data with large language models: A case study in computational social science, 2023.
  • Wang et al. (2019) Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. arXiv preprint 1905.00537, 2019.
  • Wang et al. (2023) Wang, R., Zhou, W., and Sachan, M. Let’s synthesize step by step: Iterative dataset synthesis with large language models by extrapolating errors from small models, 2023.
  • Wang et al. (2020) Wang, Z., Zhang, Y., and Wu, H. Structural-aware sentence similarity with recursive optimal transport, 2020.
  • Wei et al. (2023) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  • Xie et al. (2020) Xie, Q., Dai, Z., Hovy, E., Luong, M.-T., and Le, Q. V. Unsupervised data augmentation for consistency training, 2020.
  • Yang et al. (2023) Yang, S., Xiao, W., Zhang, M., Guo, S., Zhao, J., and Shen, F. Image data augmentation for deep learning: A survey, 2023.
  • Yang et al. (2020) Yang, Y., Malaviya, C., Fernandez, J., Swayamdipta, S., Le Bras, R., Wang, J.-P., Bhagavatula, C., Choi, Y., and Downey, D. Generative data augmentation for commonsense reasoning. In Cohn, T., He, Y., and Liu, Y. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  1008–1025, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.90. URL https://aclanthology.org/2020.findings-emnlp.90.
  • Ye et al. (2022) Ye, J., Gao, J., Li, Q., Xu, H., Feng, J., Wu, Z., Yu, T., and Kong, L. Zerogen: Efficient zero-shot learning via dataset generation, 2022.
  • Yu et al. (2023) Yu, Y., Zhuang, Y., Zhang, J., Meng, Y., Ratner, A., Krishna, R., Shen, J., and Zhang, C. Large language model as attributed training data generator: A tale of diversity and bias, 2023.
  • Zhou et al. (2024) Zhou, Y., Guo, C., Wang, X., Chang, Y., and Wu, Y. A survey on data augmentation in large model era, 2024.

Appendix A Appendix

Prompt for topic generation for zero-shot with topics and LLM output examples. GPT-4 is used to generate 500 random topics per task:

Task Role Nachricht
BoolQ, RET, NYT, SST-2, Emo System You are an AI assistant that generates random topics. There is no limit on the number of topics you can generate.
BoolQ, RET, NYT User Please generate 500 topics
BoolQ, RET, NYT LLM Output example: The world’s most beautiful sculptures, The role of technology in modern education …
SST-2, Emo User Please generate 500 twitter post topics
SST-2, Emo LLM Output example: Lunch break, Online dating …
Review System You are an AI assistant that knows Amazon product categories. The user will ask you to generate a list of categories. It is your responsibility to generate the entire list of categories.
Review User Please generate 500 amazon different product categories
Review LLM Output example: Baby Products, Clothing, Jewelry …

Appendix B Appendix

Prompt for Question Rephrasing in Section 5.2


Please rephrase the question as if you are typing it in a search engine. Make sure the answer can only be true or false, Input: question Output:

Appendix C Appendix

Prompt used for data generation for each task:

Task Prompt Type Prompt
BoolQ zero-shot

Step 1

Please generate a random short passage. Passage:

Step 2

Please generate a True or False question based on the passage. The answer to the question must be [random([True, False])] Passage: [passage from step 1] Question:
BoolQ zero-shot topic

Step 1

Please generate a short passage about this topic: [topic sampled from a topic list] Passage:

Step 2

Please generate a True or False question based on the passage. The answer to the question must be [random([True, False])] Passage: [passage from step 1] Question:
BoolQ one-shot

Step 1

Please generate a Passage, a Question and the Label to the question following this example: [example from raw data: Passage, Question, Label] Please generate a similar passage. Passage:

Step 2

Please generate a True or False question based on the passage. The answer to the question must be [label from example in Step 1] Passage: [passage generated in Step 1] Question:
BoolQ few-shot (3 or 5)

Step 1

Please generate a Passage, a Question and the Label to the question. Here are some examples: [examples from raw data: Passage, Question, Label] Please generate a similar example. Make sure the question is a True or False question and the answer to the question is [random([True, False])]. Passage:
EMO zero-shot

Step 1

Please generate a twitter post with the emotion of [random(label)]. Text:
EMO zero-shot topic

Step 1

Please consider this topic for generation: [topic sampled from a topic list]. Please generate a twitter post with the emotion of [random(label)]. Text:
EMO one-shot

Step 1

The task is to predict the emotion of a twitter post. The emotion contains six categories: sadness, joy, love, anger, fear, surprise. Here is an example. Text: [example from raw data] Emotion: [example label from raw data] Please generate another example for the same emotion. Text:
EMO few-shot (3 or 5)

Step 1

The task is to predict the emotion of a twitter post. The emotion contains six categories: sadness, joy, love, anger, fear, surprise. Here are some examples: [examples: Text, Emotion] Please generate a twitter post with the emotion of [first label from examples]. Text:
Task Prompt Type Prompt
NYT zero-shot

Step 1

Please generate a news title for [random(label)] category. Headline:
NYT zero-shot topic

Step 1

Please consider this sentence for generation: [topic sampled from topic list]. Please generate a news headline for [random(label)] category. Headline:
NYT one-shot

Step 1

The task is to predict the topic of a news headline. The topics contain ’sports’, ’arts, culture and entertainment’, ’business and finance’, ’health and wellness’, ’lifestyle and fashion’, ’science and technology’, ’politics’, ’crime’. Here is an example News: [example news] Topic: [example topic] Please generate another news on [example topic]. Headline:
NYT few-shot (3 or 5)

Step 1

The task is to predict the topic of a news headline. The topics contain ’sports’, ’arts, culture and entertainment’, ’business and finance’, ’health and wellness’, ’lifestyle and fashion’, ’science and technology’, ’politics’, ’crime’. Here are some examples: [examples: Headline, Topic] Please generate a news headline for [first topic from examples] category. News:
Review zero-shot

Step 1

The Amazon customer review has a rating ranges from 1 to 5, 1 being the lowest and 5 being the highest. Please generate a customer review with a rating of [random(label)]. Content:
Review zero-shot topic

Step 1

The Amazon customer review has a rating ranges from 1 to 5, 1 being the lowest and 5 being the highest. Please generate a customer review with a rating of [random(label)] for a specific product under [a product category sampled from topic list]. Please use a fake product name. Content:
Review one-shot

Step 1

The task is to predict the rating of an Amazon customer review based on the content. The rating ranges from 1 to 5, 1 being the lowest and 5 being the highest. Here is a review example. Content: [example content] Rating: [example rating] Please generate another example for a similar product. Make sure the rating for the review is [example rating]. Content:
Review few-shot (3 or 5)

Step 1

The Amazon customer review has a rating ranges from 1 to 5, 1 being the lowest and 5 being the highest. Here are some examples Content: [examples: Content, Rating] Please generate a customer review with a rating [first rating from examples]. Content:
Task Prompt Type Prompt
RTE zero-shot

Step 1

Given a premise and a hypothesis, a model needs to predict whether the hypothesis can be logically inferred from the premise. The response should be either True if the hypothesis can be inferred from the premise, or False if it cannot be inferred. Here is the output format: Premise: Hypothesis: Label: True or False Please generate an example where the Label is [random(label)]. Premise:
RTE zero-shot topic

Step 1

Given a premise and a hypothesis, a model needs to predict whether the hypothesis can be logically inferred from the premise. The response should be either True if the hypothesis can be inferred from the premise, or False if it cannot be inferred. Here is the output format: Premise: Hypothesis: Label: True or False Please generate an example about [premise] where the Label is [random(label)]. Premise:
RTE one-shot

Step 1

Given a premise and a hypothesis, a model needs to predict whether the hypothesis can be logically inferred from the premise. The response should be either True if the hypothesis can be inferred from the premise, or False if it cannot be inferred. Here is an example: Premise: [example premise] Hypothesis: [example hypothesis] Label: [example label] Please generate another similar example where the Label is [example label]. Premise:
RTE few-shot (3 or 5)

Step 1

Given a premise and a hypothesis, a model needs to predict whether the hypothesis can be logically inferred from the premise. The response should be either True if the hypothesis can be inferred from the premise, or False if it cannot be inferred. Here are some examples: [examples: Premise, Hypothesis, Label] Please generate a similar example. Make sure the label is [first label from examples]. Premise:
SST-2 zero-shot

Step 1

Please generate a sentence that contains a [random(label)] sentiment. Sentence:
SST-2 zero-shot topic

Step 1

Please consider this topic for generation: [topic from the topic list]. Please generate a sentence that contains a [random(label)] sentiment. Sentence:
SST-2 one-shot

Step 1

The task is to predict whether the following sentence is positive or negative sentiment. Sentence: [example sentence] Label:[example label] Please generate a similar example on the same topic, including a Sentence and a Label. Sentence:
SST-2 few-shot (3 or 5)

Step 1

The task is to predict whether the following sentence is positive or negative sentiment. [examples: Sentence, Label] Please generate a similar example, including a Sentence and a Label. Sentence:

Appendix D Appendix

Prompt used to evaluate LLM performance on each task.

Task Prompt Type Prompt
RTE zero-shot

Step 1

Given a premise and a hypothesis, a model needs to predict whether the hypothesis can be logically inferred from the premise. The response should be either True if the hypothesis can be inferred from the premise, or False if it cannot be inferred. Premise: [premise], Hypothesis: [hypothesis], Label:
RTE 0/1/3/5-shot

Step 1

Given a premise and a hypothesis, a model needs to predict whether the hypothesis can be logically inferred from the premise. The response should be either True if the hypothesis can be inferred from the premise, or False if it cannot be inferred. Here are some examples: [example premise, hypothesis, label] Premise: [premise], Hypothesis: [hypothesis], Label:
BoolQ zero-shot

Step 1

The task is to answer a question which is solely based on the content provided. Passage: [passage] , Question: [question], Label:
BoolQ 0/1/3/5-shot

Step 1

The task is to answer a question which is solely based on the content provided. Here are some examples: [example passage, question, label] Passage: [passage], Question: [question], Label:
Review zero-shot

Step 1

The task is to predict the rating of an Amazon customer review based on the content. The rating ranges from 1 to 5, with 1 being the lowest and 5 being the highest. Text: [text] , Label:
Review 0/1/3/5-shot

Step 1

The task is to predict the rating of an Amazon customer review based on the content. The rating ranges from 1 to 5, with 1 being the lowest and 5 being the highest. Here are some examples: [example text, label] Text: [text], Label:
NYT zero-shot

Step 1

The task is to predict the topic of a news headline. The topics include: ’sports’, ’arts, culture and entertainment’, ’business and finance’, ’health and wellness’, ’lifestyle and fashion’, ’science and technology’, ’politics’, ’crime’. Text:[text], Label:
NYT 0/1/3/5-shot

Step 1

The task is to predict the topic of a news headline. The topics include: ’sports’, ’arts, culture and entertainment’, ’business and finance’, ’health and wellness’, ’lifestyle and fashion’, ’science and technology’, ’politics’, ’crime’. Here are some examples: [example text, label] Text: [text], Label:
EMO zero-shot

Step 1

The task is to predict the emotion of a Twitter text. The emotions include six categories: sadness, joy, love, anger, fear, surprise. Text: [text], Label:
EMO 0/1/3/5-shot

Step 1

The task is to predict the emotion of a Twitter text. The emotions include six categories: sadness, joy, love, anger, fear, surprise. Here are some examples: [example text, label] Text: [text], Label:
SST-2 zero-shot

Step 1

The task is to predict whether the given sentence has a positive or negative sentiment. Sentence: [sentence], Label:
SST-2 0/1/3/5-shot

Step 1

The task is to predict whether the given sentence has a positive or negative sentiment. Here are some examples: [example sentence, label], Sentence: [sentence], Label: